LLM评估基础
--- title: "LLM评估基础"
description: "介绍大语言模型的评估方法、指标和基准测试" tags: ["LLM评估", "基准测试", "NLP", "模型评估"] category: "llm" icon: "🧠"
LLM评估基础
为什么需要评?
评估是衡量LLM性能、选择合适模型、优化系统效果的关键环节?
评估维度
- 语言能力:流畅性、连贯性、语法正确?
- 知识能力:事实准确性、推理能?
- *安全?:有害内容、偏?
- 效率:延迟、吞吐量、成?
人工评估
def human_evaluate(responses, criteria):
results = []
for response in responses:
scores = {}
for criterion in criteria:
score = input(f"对以下回答的{criterion}评分(1-5):\n{response}\n")
scores[criterion] = int(score)
results.append(scores)
return results
criteria = ["准确?, "流畅?, "有用?]
# scores = human_evaluate(["回答1", "回答2"], criteria)
自动评估指标
from collections import Counter
import numpy as np
def bleu_score(reference, hypothesis, n=1):
ref_tokens = reference.split()
hyp_tokens = hypothesis.split()
ref_ngrams = Counter([tuple(ref_tokens[i:i+n]) for i in range(len(ref_tokens)-n+1)])
hyp_ngrams = Counter([tuple(hyp_tokens[i:i+n]) for i in range(len(hyp_tokens)-n+1)])
matches = sum((hyp_ngrams & ref_ngrams).values())
total = sum(hyp_ngrams.values())
return matches / total if total > 0 else 0
def rouge_1(reference, hypothesis):
ref_tokens = set(reference.split())
hyp_tokens = set(hypothesis.split())
intersection = ref_tokens & hyp_tokens
return len(intersection) / len(ref_tokens) if ref_tokens else 0
使用LLM评估LLM
from openai import OpenAI
client = OpenAI()
def llm_evaluate(question, answer, criteria="准确性、完整性、流畅?):
prompt = f"""请评估以下回答的质量?
问题:{question}
回答:{answer}
评估标准:{criteria}
请给?-10的评分并简要说明理由?""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
常用基准测试
| 基准 | 评估能力 | 难度 |
|---|---|---|
| MMLU | 知识广度 | 中等 |
| GSM8K | 数学推理 | 中等 |
| HumanEval | 代码生成 | 较难 |
| TruthfulQA | 真实? | 较难 |
| MT-Bench | 对话质量 | 中等 |
总结
LLM评估是多维度的系统工程,需要结合人工评估、自动指标和基准测试,全面衡量模型性能?