← 返回首页
🧠

LLM评估基础

📂 llm ⏱ 1 min 186 words

--- title: "LLM评估基础"

description: "介绍大语言模型的评估方法、指标和基准测试" tags: ["LLM评估", "基准测试", "NLP", "模型评估"] category: "llm" icon: "🧠"

LLM评估基础

为什么需要评?

评估是衡量LLM性能、选择合适模型、优化系统效果的关键环节?

评估维度

  1. 语言能力:流畅性、连贯性、语法正确?
  2. 知识能力:事实准确性、推理能?
  3. *安全?:有害内容、偏?
  4. 效率:延迟、吞吐量、成?

人工评估

def human_evaluate(responses, criteria):
    results = []
    for response in responses:
        scores = {}
        for criterion in criteria:
            score = input(f"对以下回答的{criterion}评分(1-5):\n{response}\n")
            scores[criterion] = int(score)
        results.append(scores)
    return results

criteria = ["准确?, "流畅?, "有用?]
# scores = human_evaluate(["回答1", "回答2"], criteria)

自动评估指标

from collections import Counter
import numpy as np

def bleu_score(reference, hypothesis, n=1):
    ref_tokens = reference.split()
    hyp_tokens = hypothesis.split()
    
    ref_ngrams = Counter([tuple(ref_tokens[i:i+n]) for i in range(len(ref_tokens)-n+1)])
    hyp_ngrams = Counter([tuple(hyp_tokens[i:i+n]) for i in range(len(hyp_tokens)-n+1)])
    
    matches = sum((hyp_ngrams & ref_ngrams).values())
    total = sum(hyp_ngrams.values())
    
    return matches / total if total > 0 else 0

def rouge_1(reference, hypothesis):
    ref_tokens = set(reference.split())
    hyp_tokens = set(hypothesis.split())
    intersection = ref_tokens & hyp_tokens
    return len(intersection) / len(ref_tokens) if ref_tokens else 0

使用LLM评估LLM

from openai import OpenAI
client = OpenAI()

def llm_evaluate(question, answer, criteria="准确性、完整性、流畅?):
    prompt = f"""请评估以下回答的质量?

问题:{question}
回答:{answer}

评估标准:{criteria}

请给?-10的评分并简要说明理由?""
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

常用基准测试

基准 评估能力 难度
MMLU 知识广度 中等
GSM8K 数学推理 中等
HumanEval 代码生成 较难
TruthfulQA 真实? 较难
MT-Bench 对话质量 中等

总结

LLM评估是多维度的系统工程,需要结合人工评估、自动指标和基准测试,全面衡量模型性能?