全量微调:完整模型参数更新
--- title: "全量微调:完整模型参数更新" description: "掌握全量微调的技术细节、资源配置和最佳实践,适用于追求极致性能的场景" tags: ["全量微调", "完整训练", "大模型训练", "分布式训练"] category: "llm" icon: "🧠"
全量微调:完整模型参数更新
全量微调简介
全量微调(Full Fine-tuning)是指更新预训练模型的所有参数,使其适应特定任务。虽然资源消耗大,但在某些场景下能获得最佳性能。全量微调保留了模型的全部能力,同时允许模型全面适应目标任务。
全量微调的优势:
- 最大灵活性:模型可以完全适应目标任务
- 性能上限高:在数据充足时通常优于参数高效方法
- 无架构限制:不需要额外的适配器模块
资源需求计算
显存估算
def estimate_training_memory(model_params, batch_size, seq_length):
"""
估算全量微调显存需求
model_params: 模型参数量(B)
batch_size: 批量大小
seq_length: 序列长度
"""
# 模型参数(FP16)
param_memory = model_params * 2 # GB
# 梯度(FP16)
gradient_memory = model_params * 2 # GB
# 优化器状态(AdamW需要FP32参数副本+一阶矩+二阶矩)
optimizer_memory = model_params * 12 # GB
# 激活值(估算)
activation_memory = batch_size * seq_length * model_params * 0.1 # GB
total = param_memory + gradient_memory + optimizer_memory + activation_memory
return {
"模型参数": f"{param_memory:.1f} GB",
"梯度": f"{gradient_memory:.1f} GB",
"优化器状态": f"{optimizer_memory:.1f} GB",
"激活值": f"{activation_memory:.1f} GB",
"总计": f"{total:.1f} GB"
}
# 示例:LLaMA-7B全量微调
memory_estimate = estimate_training_memory(
model_params=7, # 7B参数
batch_size=8,
seq_length=2048
)
print(memory_estimate)
# 输出约需要 ~110GB 显存
GPU配置建议
7B模型:2-4张 A100 80GB
13B模型:4-8张 A100 80GB
30B模型:8-16张 A100 80GB
65B模型:16-32张 A100 80GB
实现步骤
数据准备
from datasets import load_dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=2048,
padding="max_length"
)
dataset = load_dataset("json", data_files="train.json")
tokenized_dataset = dataset.map(tokenize_function, batched=True)
模型加载
from transformers import AutoModelForCausalLM, TrainingArguments
# 加载完整模型(FP16)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto",
attn_implementation="flash_attention_2" # 使用Flash Attention
)
训练配置
from transformers import Trainer, DataCollatorForLanguageModeling
training_args = TrainingArguments(
output_dir="./full_finetune_output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=2e-5, # 全量微调学习率更小
weight_decay=0.01,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
fp16=True,
gradient_checkpointing=True,
optim="adamw_torch",
max_grad_norm=1.0,
save_strategy="epoch",
evaluation_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="eval_loss"
)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
data_collator=data_collator
)
trainer.train()
分布式训练
DeepSpeed集成
# ds_config.json
{
"bf16": {"enabled": true},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {"device": "cpu"},
"offload_param": {"device": "cpu"},
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 5e8,
"stage3_prefetch_bucket_size": 5e8,
"stage3_param_persistence_threshold": 1e6
},
"gradient_accumulation_steps": 8,
"gradient_clipping": 1.0,
"train_batch_size": 32,
"train_micro_batch_size_per_gpu": 4,
"wall_clock_breakdown": false
}
# 启动训练
# deepspeed --num_gpus=4 train.py --deepspeed ds_config.json
FSDP配置
from torch.distributed.fsdp import (
FullyShardedDataParallel as FSDP,
ShardingStrategy
)
# FSDP分片策略
fsdp_config = {
"fsdp_sharding_strategy": "FULL_SHARD",
"fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
"fsdp_cpu_ram_efficient_loading": True,
"fsdp_use_orig_params": True
}
超参数调优
学习率
# 全量微调学习率通常比LoRA小
learning_rates = {
"7B": 1e-5, # 7B模型
"13B": 5e-6, # 13B模型
"30B": 2e-6, # 30B模型
"65B": 1e-6 # 65B模型
}
批量大小
# 有效批量大小 = per_device_batch_size × num_gpus × gradient_accumulation_steps
# 推荐有效批量大小:32-128
effective_batch_size = 4 * 4 * 8 # 128
训练轮数
# 数据量较小时(<10K样本):3-5轮
# 数据量中等时(10K-100K样本):1-3轮
# 数据量较大时(>100K样本):1轮
检查点管理
from transformers import TrainerCallback
class CheckpointCallback(TrainerCallback):
def on_save(self, args, state, control, **kwargs):
# 保存最佳模型的完整权重
if state.best_metric:
kwargs["model"].save_pretrained(
f"./best_model_{state.best_metric:.4f}"
)
评估与监控
import wandb
# 初始化WandB
wandb.init(project="full_finetune", name="llama2-7b")
# 自定义评估指标
from transformers import TrainerCallback
class EvalCallback(TrainerCallback):
def on_evaluate(self, args, state, control, metrics=None, **kwargs):
if metrics:
wandb.log({
"eval_loss": metrics["eval_loss"],
"perplexity": 2 ** metrics["eval_loss"]
})
最佳实践
- 数据质量优先:确保训练数据质量和多样性
- 渐进式训练:可以先用较小学习率预热,再正式训练
- 定期评估:每epoch进行验证,防止过拟合
- 保存中间结果:定期保存检查点,避免训练中断
- 监控训练曲线:使用TensorBoard或WandB监控loss变化
全量微调虽然资源消耗大,但在追求最佳性能的场景下仍然是不可替代的选择。