← 返回首页
🧠

模型对齐

📂 llm ⏱ 2 min 292 words

--- title: "模型对齐" description: "模型对齐技术详解,包括RLHF、DPO、偏好学习等方法" tags: ["模型对齐", "RLHF", "DPO", "偏好学习"] category: "llm" icon: "🧠"

模型对齐

模型对齐(Model Alignment)是确保大语言模型的行为符合人类意图和价值观的关键技术。通过将模型的输出与人类偏好对齐,可以提升模型的安全性、有用性和诚实性。

RLHF(基于人类反馈的强化学习)

RLHF是对齐技术的开创性方法,分为三个阶段:

阶段一:监督微调(SFT)

使用高质量的指令-回复数据对基座模型进行微调:

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import Dataset

# SFT数据准备
sft_data = [
    {"instruction": "解释量子计算", "output": "量子计算利用量子比特的叠加和纠缠..."},
    {"instruction": "写一首诗", "output": "春风拂面柳丝长,细雨润花香满堂..."},
]

def format_sft(example):
    return {"text": f"用户:{example['instruction']}\n助手:{example['output']}"}

dataset = Dataset.from_list(sft_data).map(format_sft)

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

training_args = TrainingArguments(
    output_dir="./sft_model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    save_strategy="epoch",
)

trainer = Trainer(
    model=model, args=training_args, 
    train_dataset=dataset, tokenizer=tokenizer
)
trainer.train()

阶段二:奖励模型训练

使用人类偏好数据训练奖励模型,学习人类对回复质量的判断:

import torch.nn as nn

class RewardModel(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model
        self.reward_head = nn.Linear(base_model.config.hidden_size, 1)
    
    def forward(self, input_ids, attention_mask):
        outputs = self.base_model(input_ids, attention_mask=attention_mask)
        last_token = outputs.last_hidden_state[:, -1, :]
        return self.reward_head(last_token)

def reward_loss(chosen_rewards, rejected_rewards):
    """偏好对损失:chosen回复得分应高于rejected"""
    return -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()

# 训练数据:(prompt, chosen, rejected) 三元组
preference_data = [
    {
        "prompt": "如何学习编程?",
        "chosen": "建议从Python开始,每天练习2小时...",
        "rejected": "随便学学就行,不用太认真..."
    },
]

阶段三:PPO强化学习优化

使用PPO算法根据奖励模型的反馈优化语言模型:

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

config = PPOConfig(
    learning_rate=1.4e-5,
    batch_size=64,
    mini_batch_size=16,
    ppo_epochs=4,
    kl_penalty="kl",  # KL散度惩罚
    init_kl_coef=0.2,
)

ppo_model = AutoModelForCausalLMWithValueHead.from_pretrained("sft_model")
ppo_trainer = PPOTrainer(config, ppo_model, tokenizer=tokenizer)

for batch in dataloader:
    # 生成回复
    responses = ppo_model.generate(batch["input_ids"], max_new_tokens=256)
    # 计算奖励
    rewards = reward_model(responses)
    # PPO更新
    stats = ppo_trainer.step(batch["input_ids"], responses, rewards)

DPO(直接偏好优化)

DPO跳过了显式的奖励模型训练,直接从偏好数据优化策略模型,大幅简化了训练流程:

from trl import DPOTrainer, DPOConfig

dpo_config = DPOConfig(
    learning_rate=5e-7,
    beta=0.1,  # KL散度系数
    loss_type="sigmoid",
    per_device_train_batch_size=4,
    num_train_epochs=3,
)

def dpo_loss(policy_chosen, policy_rejected, reference_chosen, reference_rejected, beta=0.1):
    """DPO损失函数"""
    chosen_logratios = policy_chosen - reference_chosen
    rejected_logratios = policy_rejected - reference_rejected
    logits = beta * (chosen_logratios - rejected_logratios)
    return -torch.nn.functional.logsigmoid(logits).mean()

trainer = DPOTrainer(
    model=ppo_model,
    ref_model=None,  # DPO可使用同一个模型作为参考
    args=dpo_config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)
trainer.train()

对齐技术对比

方法 优点 缺点 适用场景
RLHF 效果成熟,理论完善 训练复杂,需三阶段 大规模对齐
DPO 简化流程,无需奖励模型 可能不稳定 中小规模对齐
RLAIF 减少人工标注 可能引入AI偏见 标注成本高时
Constitutional AI 自我对齐,无需人工 依赖模型自省能力 安全对齐

评估对齐效果

def evaluate_alignment(model, tokenizer, test_cases):
    """评估模型对齐效果"""
    results = {"helpfulness": [], "safety": [], "honesty": []}
    
    for case in test_cases:
        inputs = tokenizer(case["prompt"], return_tensors="pt")
        outputs = model.generate(**inputs, max_new_tokens=512)
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # 使用GPT-4作为评判模型
        score = evaluate_with_judge(response, case)
        results[case["category"]].append(score)
    
    return {k: sum(v)/len(v) for k, v in results.items()}

模型对齐是AI安全的核心技术,随着能力提升,对齐方法也需要持续演进以应对新的安全挑战。