简介:本文详细解析LLaMA Factory单机微调的全流程,涵盖环境配置、数据准备、模型训练与评估等关键环节,提供可复用的代码示例与操作建议,助力开发者高效完成本地化模型优化。
LLaMA Factory的单机微调依赖Python 3.8+环境,推荐使用conda创建独立虚拟环境以避免依赖冲突。首先安装基础依赖:
conda create -n llama_factory python=3.9conda activate llama_factorypip install torch==2.0.1 transformers==4.30.2 datasets==2.12.0 accelerate==0.20.3
关键点说明:
nvidia-smi查看驱动版本,在PyTorch官网选择对应版本。
git clone https://github.com/hiyouga/LLaMA-Factory.gitcd LLaMA-Factorypip install -e .
--gradient_accumulation_steps参数分批计算梯度。数据质量直接影响模型性能,需遵循以下原则:
import redef clean_text(text):text = re.sub(r'\s+', ' ', text).strip() # 合并多余空格return re.sub(r'[^\w\s]', '', text) # 移除标点(根据任务需求调整)
prompt和response字段:
{"prompt": "解释量子计算的基本原理", "response": "量子计算利用..."}{"prompt": "用Python实现快速排序", "response": "def quick_sort(arr):..."}
1比例划分训练集、验证集、测试集,确保分布一致性。可使用sklearn的train_test_split:
from sklearn.model_selection import train_test_splittrain_data, temp_data = train_test_split(all_data, test_size=0.2)val_data, test_data = train_test_split(temp_data, test_size=0.5)
LLaMA Factory支持LoRA(低秩适应)和全参数微调两种方式,推荐从LoRA开始以降低计算成本。
核心参数说明:
--lora_rank 16:LoRA矩阵的秩,通常设为8/16/32,值越大效果越好但计算量越大。--lora_alpha 32:缩放因子,与lora_rank共同控制适应强度。--lora_dropout 0.1:防止过拟合的Dropout率。示例命令:
python llama_factory/src/train_lora.py \--model_name_or_path meta-llama/Llama-2-7b-hf \--data_path ./data/train.json \--output_dir ./output/lora \--lora_rank 16 \--lora_alpha 32 \--num_train_epochs 3 \--per_device_train_batch_size 4 \--gradient_accumulation_steps 8 \--learning_rate 2e-5 \--warmup_steps 100 \--fp16
当数据量充足(>10万样本)且硬件允许时,可尝试全参数微调:
python llama_factory/src/train_full.py \--model_name_or_path meta-llama/Llama-2-7b-hf \--data_path ./data/train.json \--output_dir ./output/full \--num_train_epochs 5 \--per_device_train_batch_size 2 \--gradient_accumulation_steps 16 \--learning_rate 1e-5 \--weight_decay 0.01 \--bf16
关键差异:
--weight_decay)防止过拟合训练完成后需通过以下指标评估模型:
generate.py脚本生成样本,人工评估相关性、流畅性:
python llama_factory/src/generate.py \--model_name_or_path ./output/lora \--prompt "解释光合作用的过程" \--max_new_tokens 200
from datasets import load_metricmetric = load_metric("rouge")predictions = [model_output]references = [ground_truth]results = metric.compute(predictions=predictions, references=references)
显存不足错误:
per_device_train_batch_sizegradient_accumulation_steps--gradient_checkpointing节省显存训练速度慢:
fp16或bf16混合精度--use_flash_attn(需安装flash-attn库)nvidia-smi监控GPU利用率,若低于80%可能存在I/O瓶颈模型不收敛:
warmup_steps(如从100增至500)微调完成后,可通过以下方式部署:
from transformers import AutoModelForCausalLM, AutoTokenizermodel = AutoModelForCausalLM.from_pretrained("./output/lora", device_map="auto")tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")inputs = tokenizer("解释机器学习", return_tensors="pt").to("cuda")outputs = model.generate(**inputs, max_new_tokens=100)print(tokenizer.decode(outputs[0], skip_special_tokens=True))
API服务化:使用FastAPI构建REST接口:
from fastapi import FastAPIimport torchfrom transformers import pipelineapp = FastAPI()generator = pipeline("text-generation", model="./output/lora", device=0 if torch.cuda.is_available() else "cpu")@app.post("/generate")async def generate(prompt: str):result = generator(prompt, max_length=100, do_sample=True)return {"text": result[0]["generated_text"]}
--lora_weights ./output/lora/checkpoint-1000
def get_max_batch_size(model, tokenizer, max_length=512):device = torch.device("cuda" if torch.cuda.is_available() else "cpu")batch_size = 1while True:try:inputs = tokenizer([" "] * batch_size, return_tensors="pt", padding=True, max_length=max_length).to(device)_ = model(**inputs)batch_size *= 2except RuntimeError:return batch_size // 2
通过以上系统化的实战教程,开发者可掌握LLaMA Factory单机微调的核心方法,从环境搭建到模型部署形成完整闭环。实际项目中建议结合具体任务(如问答、代码生成)调整超参数,并通过A/B测试对比不同微调策略的效果。