🧠

LLM监控系统

📂 llm ⏱ 2 min 300 words

--- title: "LLM监控系统" description: "全面介绍大语言模型监控系统的架构设计与实现，包括指标采集、Prometheus集成、Grafana仪表盘搭建以及生产环境最佳实践。" tags: ["LLM监控", "监控系统", "Prometheus", "Grafana", "可观测性"] category: "llm" icon: "🧠"

LLM监控系统

为什么需要LLM监控

在生产环境中部署大语言模型（LLM）时，监控是确保系统稳定性和可靠性的关键环节。与传统API不同，LLM的输出具有概率性和非确定性，这意味着我们需要更精细化的监控手段来捕捉潜在问题。

LLM监控主要关注以下几个维度：

延迟：首Token延迟（TTFT）和生成总时延
吞吐量：每秒处理的Token数（TPS）
资源利用率：GPU/CPU使用率、内存占用
质量指标：用户反馈、拒绝率、幻觉检测

核心监控指标

延迟指标

import time
from prometheus_client import Histogram

# 定义延迟直方图
TTFT_HISTOGRAM = Histogram(
    'llm_first_token_latency_seconds',
    'Time to first token',
    buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0]
)

GENERATION_HISTOGRAM = Histogram(
    'llm_generation_latency_seconds',
    'Total generation latency',
    buckets=[1.0, 2.0, 5.0, 10.0, 30.0, 60.0]
)

@TTFT_HISTOGRAM.time()
def generate_response(prompt: str) -> str:
    """生成模型响应并记录延迟"""
    first_token_time = None
    response = []
    for token in model.generate_stream(prompt):
        if first_token_time is None:
            first_token_time = time.time()
        response.append(token)
    return ''.join(response)

Token级指标

from prometheus_client import Counter, Gauge

TOKENS_GENERATED = Counter(
    'llm_tokens_generated_total',
    'Total tokens generated',
    ['model', 'status']
)

ACTIVE_REQUESTS = Gauge(
    'llm_active_requests',
    'Number of active inference requests'
)

GPU_UTILIZATION = Gauge(
    'llm_gpu_utilization_percent',
    'GPU utilization percentage',
    ['gpu_id']
)

Prometheus集成架构

指标采集流程

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'llm-inference'
    static_configs:
      - targets: ['llm-server:8000']
    metrics_path: '/metrics'
    scrape_interval: 5s

  - job_name: 'llm-gpu'
    static_configs:
      - targets: ['nvidia-exporter:9400']

  - job_name: 'llm-proxy'
    static_configs:
      - targets: ['api-gateway:9100']

服务端指标暴露

from prometheus_client import start_http_server, generate_latest
from fastapi import FastAPI

app = FastAPI()

@app.on_event("startup")
async def startup():
    # 启动Prometheus指标服务器
    start_http_server(8000)
    print("Metrics server started on port 8000")

@app.get("/metrics")
async def metrics():
    return Response(
        content=generate_latest(),
        media_type="text/plain"
    )

Grafana仪表盘设计

仪表盘布局

一个完整的LLM监控仪表盘通常包含以下面板：

第一行：概览指标

当前QPS（每秒查询数）
平均延迟（P50/P95/P99）
错误率
GPU显存使用率

第二行：延迟趋势

TTFT时延分布直方图
生成时延随时间变化曲线
Token生成速率（TPS）

第三行：资源监控

GPU利用率时间序列
内存使用趋势
CPU利用率

{
  "dashboard": {
    "title": "LLM Inference Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "type": "stat",
        "targets": [{
          "expr": "rate(llm_requests_total[5m])",
          "legendFormat": "{{model}}"
        }]
      },
      {
        "title": "Latency Distribution",
        "type": "heatmap",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(llm_generation_latency_seconds_bucket[5m]))",
          "legendFormat": "P95"
        }]
      }
    ]
  }
}

告警规则配置

# alerts.yml
groups:
  - name: llm_alerts
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(llm_generation_latency_seconds_bucket[5m])) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM P95延迟超过10秒"

      - alert: HighErrorRate
        expr: rate(llm_errors_total[5m]) / rate(llm_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "LLM错误率超过5%"

      - alert: GPUMemoryHigh
        expr: llm_gpu_memory_used_bytes / llm_gpu_memory_total_bytes > 0.9
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "GPU显存使用率超过90%"

生产环境最佳实践

分层监控：分别监控基础设施层、模型服务层和应用层
采样策略：对高频请求采用采样，避免监控系统成为瓶颈
关联追踪：将监控指标与链路追踪结合，便于问题定位
基线建立：在上线初期建立性能基线，用于后续对比分析
自动化响应：配置自动扩缩容规则，基于监控指标动态调整资源

通过建立完善的LLM监控体系，团队可以快速发现和解决生产环境中的问题，持续优化模型服务质量。