🧠

偏好数据：构建人类偏好数据以优化LLM行为

📂 llm ⏱ 3 min 578 words

偏好数据人类反馈 RLHF 数据标注模型对齐

偏好数据：构建人类偏好数据以优化LLM行为

什么是偏好数据

偏好数据（Preference Data）是描述人类对模型输出相对偏好的数据格式。与传统的监督学习数据不同，偏好数据不直接给出"正确答案"，而是标注出对于同一个输入，哪个输出更优。这种数据格式成为现代大语言模型对齐训练的核心资源。

偏好数据的基本形式为三元组 (prompt, chosen, rejected)，其中 chosen 代表人类偏好的回答，rejected 代表不被偏好的回答。

偏好数据的重要性

在大语言模型的训练流程中，偏好数据发挥着关键作用：

对齐训练：使模型输出符合人类价值观和期望
安全性提升：引导模型拒绝有害请求
质量控制：减少模型生成不准确或无帮助内容的概率
个性化适配：根据不同用户群体调整模型行为偏好

OpenAI、Anthropic、Google 等公司均在模型训练中大量使用偏好数据进行人类反馈强化学习（RLHF）或直接偏好优化（DPO）。

偏好数据的构建流程

1. 数据收集阶段

首先需要收集大量多样的 prompt 输入。这些 prompt 应覆盖目标应用场景的各种情况：

import json
from pathlib import Path

# 定义 prompt 模板分类
prompt_categories = {
    "creative_writing": [
        "写一首关于{主题}的诗",
        "以{角色}的视角写一段故事",
        "创作一段{风格}的广告文案"
    ],
    "factual_qa": [
        "{事件}发生在什么时候？",
        "解释{概念}的基本原理",
        "{技术}和{技术}有什么区别？"
    ],
    "coding": [
        "用Python实现{功能}",
        "调试以下代码：{code}",
        "优化{算法}的性能"
    ]
}

def generate_prompts(categories, n_per_category=100):
    """生成多样化的测试 prompts"""
    prompts = []
    for category, templates in categories.items():
        for i in range(n_per_category):
            template = templates[i % len(templates)]
            prompts.append({
                "id": f"{category}_{i}",
                "category": category,
                "prompt": template
            })
    return prompts

prompts = generate_prompts(prompt_categories)

2. 模型响应生成

使用多个不同的模型或同一模型的不同参数生成候选响应：

from openai import OpenAI

client = OpenAI()

def generate_candidates(prompt, n_responses=4, temperature_range=(0.3, 1.2)):
    """为每个 prompt 生成多个候选响应"""
    candidates = []
    
    for i in range(n_responses):
        temp = temperature_range[0] + (temperature_range[1] - temperature_range[0]) * i / (n_responses - 1)
        
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=temp,
            max_tokens=512
        )
        
        candidates.append({
            "model": "gpt-4",
            "temperature": temp,
            "response": response.choices[0].message.content
        })
    
    return candidates

# 批量生成候选响应
dataset = []
for item in prompts[:10]:  # 示例取前10条
    candidates = generate_candidates(item["prompt"])
    dataset.append({
        "prompt": item["prompt"],
        "candidates": candidates
    })

3. 人类标注流程

标注是偏好数据构建中最关键的环节。标注者需要对候选响应进行成对比较：

class PreferenceAnnotation:
    """偏好标注工作流"""
    
    def __init__(self, annotation_guidelines):
        self.guidelines = annotation_guidelines
    
    def create_annotation_task(self, prompt, candidates):
        """创建标注任务"""
        # 随机打乱候选顺序避免位置偏差
        import random
        shuffled = candidates.copy()
        random.shuffle(shuffled)
        
        return {
            "prompt": prompt,
            "candidates": shuffled,
            "criteria": self.guidelines["criteria"],
            "instructions": self.guidelines["instructions"]
        }
    
    def validate_annotation(self, annotation):
        """验证标注质量"""
        required_fields = ["prompt_id", "chosen_id", "rejected_id", "reason"]
        return all(field in annotation for field in required_fields)
    
    def calculate_agreement(self, annotations_per_item):
        """计算标注者间一致性（Kappa系数）"""
        agreements = 0
        total = len(annotations_per_item)
        
        for item_id, annotations in annotations_per_item.items():
            if len(annotations) >= 2:
                if annotations[0]["choice"] == annotations[1]["choice"]:
                    agreements += 1
        
        return agreements / total if total > 0 else 0

# 标注指南
guidelines = {
    "criteria": ["准确性", "有帮助程度", "安全性", "语言质量"],
    "instructions": "请比较两个回答，选择你认为更好的一个。考虑以下因素：信息是否准确、回答是否完整有帮助、是否安全无害、语言是否流畅自然。"
}

4. 质量控制机制

确保偏好数据质量需要多层质量控制：

class QualityControl:
    """数据质量控制"""
    
    def __init__(self):
        self.min_annotators = 3
        self.agreement_threshold = 0.7
    
    def check_annotation_consensus(self, annotations):
        """检查标注共识"""
        if len(annotations) < self.min_annotators:
            return False, "标注人数不足"
        
        # 统计选择分布
        choices = [a["choice"] for a in annotations]
        chosen_count = choices.count("chosen")
        rejected_count = choices.count("rejected")
        
        agreement_ratio = max(chosen_count, rejected_count) / len(annotations)
        
        return agreement_ratio >= self.agreement_threshold, f"一致性: {agreement_ratio:.2%}"
    
    def filter_low_quality(self, dataset):
        """过滤低质量数据"""
        filtered = []
        for item in dataset:
            consensus, ratio = self.check_annotation_consensus(item["annotations"])
            if consensus:
                # 取多数投票结果
                majority_choice = max(set(a["choice"] for a in item["annotations"]), 
                                     key=lambda x: sum(1 for a in item["annotations"] if a["choice"] == x))
                item["final_choice"] = majority_choice
                item["agreement_ratio"] = ratio
                filtered.append(item)
        return filtered
    
    def detect_bias(self, dataset):
        """检测位置偏差"""
        chosen_positions = {"first": 0, "second": 0}
        for item in dataset:
            if item["chosen_position"] == "first":
                chosen_positions["first"] += 1
            else:
                chosen_positions["second"] += 1
        
        total = sum(chosen_positions.values())
        bias_ratio = chosen_positions["first"] / total
        return bias_ratio, bias_ratio > 0.6 or bias_ratio < 0.4

偏好数据格式规范

OpenAI 格式

{
  "prompt": "解释量子计算的基本原理",
  "chosen": "量子计算利用量子力学的叠加和纠缠原理进行计算...",
  "rejected": "量子计算就是用量子计算机算东西..."
}

Anthropic HH 格式

{
  "prompt": "如何学习编程？",
  "chosen": "学习编程可以从以下几个步骤开始：\n1. 选择一门入门语言如Python\n...",
  "rejected": "直接找个人教你就行了"
}

DPO 训练数据格式

{
  "system": "你是一个有帮助的助手",
  "prompt": "写一个排序算法",
  "chosen": "```python\ndef quicksort(arr):\n    if len(arr) <= 1:\n        return arr\n    pivot = arr[0]\n    left = [x for x in arr[1:] if x <= pivot]\n    right = [x for x in arr[1:] if x > pivot]\n    return quicksort(left) + [pivot] + quicksort(right)\n```",
  "rejected": "排序就是把数字从小到大排列。可以用冒泡排序，就是两两比较交换位置。"
}

偏好数据的常见问题

位置偏差（Position Bias）

标注者倾向于选择第一个呈现的选项。解决方案：

def mitigate_position_bias(dataset):
    """通过交叉排列缓解位置偏差"""
    augmented = []
    for item in dataset:
        # 正序版本
        augmented.append({
            "prompt": item["prompt"],
            "option_a": item["chosen"],
            "option_b": item["rejected"],
            "position_mapping": {"a": "chosen", "b": "rejected"}
        })
        # 反序版本
        augmented.append({
            "prompt": item["prompt"],
            "option_a": item["rejected"],
            "option_b": item["chosen"],
            "position_mapping": {"a": "rejected", "b": "chosen"}
        })
    return augmented

标注者间不一致

不同标注者可能对同一对响应有不同偏好。通过多人标注和一致性检查解决：

def resolve_annotation_disagreement(annotations, method="majority"):
    """解决标注分歧"""
    if method == "majority":
        choices = [a["choice"] for a in annotations]
        return max(set(choices), key=choices.count)
    elif method == "expert":
        # 使用专家标注者的意见
        expert_annotations = [a for a in annotations if a.get("is_expert")]
        if expert_annotations:
            return expert_annotations[0]["choice"]
    return None

最佳实践总结

多样化 prompt：覆盖尽可能多的场景和主题
多位标注者：每条数据至少 3 位独立标注
交叉验证：正反序呈现消除位置偏差
持续迭代：根据模型训练反馈调整标注策略
文档记录：详细记录标注指南和决策过程
数据平衡：确保正负样本比例合理
隐私保护：去除敏感个人信息

偏好数据的质量直接决定了对齐训练的效果。投资于高质量的偏好数据构建，是获得优秀LLM的关键步骤。