← 返回首页
🧠

LLM日志系统

📂 llm ⏱ 3 min 416 words

--- title: "LLM日志系统" description: "深入讲解大语言模型日志系统的设计与实现,涵盖结构化日志、ELK Stack部署、日志分析与管理的完整方案。" tags: ["LLM日志", "结构化日志", "ELK Stack", "日志管理"] category: "llm" icon: "🧠"

LLM日志系统

LLM日志的特殊性

大语言模型系统的日志与传统Web应用有显著不同。LLM日志需要记录的不仅是请求和响应,还包括Prompt内容、模型参数、生成的Token序列以及质量评估结果。这些日志数据量大、结构复杂,需要专门的设计来有效管理。

关键日志类型

结构化日志设计

日志Schema定义

from pydantic import BaseModel
from typing import Optional, List
from datetime import datetime
from enum import Enum

class LogLevel(str, Enum):
    INFO = "info"
    WARNING = "warning"
    ERROR = "error"

class LLMLogEntry(BaseModel):
    timestamp: datetime
    request_id: str
    user_id: Optional[str]
    model_name: str
    prompt: str
    response: Optional[str]
    system_prompt: Optional[str]
    temperature: float
    max_tokens: int
    input_tokens: int
    output_tokens: int
    latency_ms: float
    first_token_latency_ms: Optional[float]
    stop_reason: str
    level: LogLevel
    metadata: dict = {}

class QualityLogEntry(BaseModel):
    request_id: str
    rating: Optional[int]
    feedback: Optional[str]
    annotations: List[str] = []
    reviewer_id: Optional[str]

日志记录器实现

import json
import logging
import uuid
from datetime import datetime

class StructuredLogger:
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.logger = logging.getLogger(service_name)
        handler = logging.StreamHandler()
        handler.setFormatter(logging.Formatter('%(message)s'))
        self.logger.addHandler(handler)
        self.logger.setLevel(logging.INFO)

    def log_request(self, entry: LLMLogEntry):
        log_record = {
            "timestamp": entry.timestamp.isoformat(),
            "service": self.service_name,
            "level": entry.level.value,
            "request_id": entry.request_id,
            "model": entry.model_name,
            "metrics": {
                "input_tokens": entry.input_tokens,
                "output_tokens": entry.output_tokens,
                "latency_ms": entry.latency_ms,
                "ttft_ms": entry.first_token_latency_ms
            },
            "config": {
                "temperature": entry.temperature,
                "max_tokens": entry.max_tokens
            },
            "prompt": entry.prompt[:1000],  # 截断避免日志过大
            "stop_reason": entry.stop_reason
        }
        self.logger.info(json.dumps(log_record, ensure_ascii=False))

# 使用示例
logger = StructuredLogger("llm-inference")

entry = LLMLogEntry(
    timestamp=datetime.now(),
    request_id=str(uuid.uuid4()),
    user_id="user_123",
    model_name="gpt-4",
    prompt="解释机器学习的基本概念",
    response="机器学习是人工智能的一个子领域...",
    system_prompt="你是一个AI助手",
    temperature=0.7,
    max_tokens=2048,
    input_tokens=15,
    output_tokens=256,
    latency_ms=2340.5,
    first_token_latency_ms=180.2,
    stop_reason="stop",
    level=LogLevel.INFO
)
logger.log_request(entry)

ELK Stack架构

架构组件

用户请求 → LLM服务 → Logstash → Elasticsearch → Kibana
                    (解析/转换)   (存储/索引)    (可视化)

Logstash配置

# logstash.conf
input {
  beats {
    port => 5044
  }
}

filter {
  json {
    source => "message"
    target => "llm"
  }

  date {
    match => ["llm.timestamp", "ISO8601"]
    target => "@timestamp"
  }

  mutate {
    add_field => {
      "service" => "%{llm.service}"
    }
    remove_field => ["llm.prompt"]  # 敏感信息过滤
  }

  # 提取关键指标
  ruby {
    code => "
      metrics = event.get('[llm][metrics]')
      if metrics
        event.set('input_tokens', metrics.get('input_tokens'))
        event.set('output_tokens', metrics.get('output_tokens'))
        event.set('latency_ms', metrics.get('latency_ms'))
      end
    "
  }
}

output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    index => "llm-logs-%{+YYYY.MM.dd}"
    template_name => "llm-logs"
  }
}

Elasticsearch索引模板

{
  "index_patterns": ["llm-logs-*"],
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "refresh_interval": "5s"
  },
  "mappings": {
    "properties": {
      "timestamp": {"type": "date"},
      "request_id": {"type": "keyword"},
      "model": {"type": "keyword"},
      "input_tokens": {"type": "integer"},
      "output_tokens": {"type": "integer"},
      "latency_ms": {"type": "float"},
      "stop_reason": {"type": "keyword"},
      "level": {"type": "keyword"}
    }
  }
}

日志分析与查询

Kibana查询示例

// 查找高延迟请求
{
  "query": {
    "bool": {
      "must": [
        {"range": {"latency_ms": {"gte": 5000}}},
        {"term": {"level": "info"}}
      ]
    }
  },
  "aggs": {
    "avg_latency_by_model": {
      "terms": {"field": "model"},
      "aggs": {
        "avg_latency": {"avg": {"field": "latency_ms"}}
      }
    }
  }
}

日志告警规则

# 基于日志的告警
- alert: HighErrorRate
  condition: count(errors) / count(all) > 0.05
  window: 5m
  severity: critical

- alert: SlowInference
  condition: percentile(latency_ms, 95) > 10000
  window: 10m
  severity: warning

日志管理最佳实践

  1. 隐私保护:对用户输入进行脱敏处理,移除PII信息
  2. 采样策略:对成功请求采样记录,错误请求全量记录
  3. 生命周期:设置日志保留策略,热数据30天,温数据90天,冷数据归档
  4. 压缩存储:使用Gzip压缩日志文件,减少存储开销
  5. 实时分析:关键指标使用流式计算实时聚合

完善的LLM日志系统是问题排查、性能优化和质量保证的基础,值得投入足够的工程资源来建设。