模型对齐
--- title: "模型对齐" description: "模型对齐技术详解,包括RLHF、DPO、偏好学习等方法" tags: ["模型对齐", "RLHF", "DPO", "偏好学习"] category: "llm" icon: "🧠"
模型对齐
模型对齐(Model Alignment)是确保大语言模型的行为符合人类意图和价值观的关键技术。通过将模型的输出与人类偏好对齐,可以提升模型的安全性、有用性和诚实性。
RLHF(基于人类反馈的强化学习)
RLHF是对齐技术的开创性方法,分为三个阶段:
阶段一:监督微调(SFT)
使用高质量的指令-回复数据对基座模型进行微调:
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import Dataset
# SFT数据准备
sft_data = [
{"instruction": "解释量子计算", "output": "量子计算利用量子比特的叠加和纠缠..."},
{"instruction": "写一首诗", "output": "春风拂面柳丝长,细雨润花香满堂..."},
]
def format_sft(example):
return {"text": f"用户:{example['instruction']}\n助手:{example['output']}"}
dataset = Dataset.from_list(sft_data).map(format_sft)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
training_args = TrainingArguments(
output_dir="./sft_model",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5,
save_strategy="epoch",
)
trainer = Trainer(
model=model, args=training_args,
train_dataset=dataset, tokenizer=tokenizer
)
trainer.train()
阶段二:奖励模型训练
使用人类偏好数据训练奖励模型,学习人类对回复质量的判断:
import torch.nn as nn
class RewardModel(nn.Module):
def __init__(self, base_model):
super().__init__()
self.base_model = base_model
self.reward_head = nn.Linear(base_model.config.hidden_size, 1)
def forward(self, input_ids, attention_mask):
outputs = self.base_model(input_ids, attention_mask=attention_mask)
last_token = outputs.last_hidden_state[:, -1, :]
return self.reward_head(last_token)
def reward_loss(chosen_rewards, rejected_rewards):
"""偏好对损失:chosen回复得分应高于rejected"""
return -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()
# 训练数据:(prompt, chosen, rejected) 三元组
preference_data = [
{
"prompt": "如何学习编程?",
"chosen": "建议从Python开始,每天练习2小时...",
"rejected": "随便学学就行,不用太认真..."
},
]
阶段三:PPO强化学习优化
使用PPO算法根据奖励模型的反馈优化语言模型:
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
config = PPOConfig(
learning_rate=1.4e-5,
batch_size=64,
mini_batch_size=16,
ppo_epochs=4,
kl_penalty="kl", # KL散度惩罚
init_kl_coef=0.2,
)
ppo_model = AutoModelForCausalLMWithValueHead.from_pretrained("sft_model")
ppo_trainer = PPOTrainer(config, ppo_model, tokenizer=tokenizer)
for batch in dataloader:
# 生成回复
responses = ppo_model.generate(batch["input_ids"], max_new_tokens=256)
# 计算奖励
rewards = reward_model(responses)
# PPO更新
stats = ppo_trainer.step(batch["input_ids"], responses, rewards)
DPO(直接偏好优化)
DPO跳过了显式的奖励模型训练,直接从偏好数据优化策略模型,大幅简化了训练流程:
from trl import DPOTrainer, DPOConfig
dpo_config = DPOConfig(
learning_rate=5e-7,
beta=0.1, # KL散度系数
loss_type="sigmoid",
per_device_train_batch_size=4,
num_train_epochs=3,
)
def dpo_loss(policy_chosen, policy_rejected, reference_chosen, reference_rejected, beta=0.1):
"""DPO损失函数"""
chosen_logratios = policy_chosen - reference_chosen
rejected_logratios = policy_rejected - reference_rejected
logits = beta * (chosen_logratios - rejected_logratios)
return -torch.nn.functional.logsigmoid(logits).mean()
trainer = DPOTrainer(
model=ppo_model,
ref_model=None, # DPO可使用同一个模型作为参考
args=dpo_config,
train_dataset=preference_dataset,
tokenizer=tokenizer,
)
trainer.train()
对齐技术对比
| 方法 | 优点 | 缺点 | 适用场景 |
|---|---|---|---|
| RLHF | 效果成熟,理论完善 | 训练复杂,需三阶段 | 大规模对齐 |
| DPO | 简化流程,无需奖励模型 | 可能不稳定 | 中小规模对齐 |
| RLAIF | 减少人工标注 | 可能引入AI偏见 | 标注成本高时 |
| Constitutional AI | 自我对齐,无需人工 | 依赖模型自省能力 | 安全对齐 |
评估对齐效果
def evaluate_alignment(model, tokenizer, test_cases):
"""评估模型对齐效果"""
results = {"helpfulness": [], "safety": [], "honesty": []}
for case in test_cases:
inputs = tokenizer(case["prompt"], return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# 使用GPT-4作为评判模型
score = evaluate_with_judge(response, case)
results[case["category"]].append(score)
return {k: sum(v)/len(v) for k, v in results.items()}
模型对齐是AI安全的核心技术,随着能力提升,对齐方法也需要持续演进以应对新的安全挑战。