LLM监控系统
--- title: "LLM监控系统" description: "全面介绍大语言模型监控系统的架构设计与实现,包括指标采集、Prometheus集成、Grafana仪表盘搭建以及生产环境最佳实践。" tags: ["LLM监控", "监控系统", "Prometheus", "Grafana", "可观测性"] category: "llm" icon: "🧠"
LLM监控系统
为什么需要LLM监控
在生产环境中部署大语言模型(LLM)时,监控是确保系统稳定性和可靠性的关键环节。与传统API不同,LLM的输出具有概率性和非确定性,这意味着我们需要更精细化的监控手段来捕捉潜在问题。
LLM监控主要关注以下几个维度:
- 延迟:首Token延迟(TTFT)和生成总时延
- 吞吐量:每秒处理的Token数(TPS)
- 资源利用率:GPU/CPU使用率、内存占用
- 质量指标:用户反馈、拒绝率、幻觉检测
核心监控指标
延迟指标
import time
from prometheus_client import Histogram
# 定义延迟直方图
TTFT_HISTOGRAM = Histogram(
'llm_first_token_latency_seconds',
'Time to first token',
buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0]
)
GENERATION_HISTOGRAM = Histogram(
'llm_generation_latency_seconds',
'Total generation latency',
buckets=[1.0, 2.0, 5.0, 10.0, 30.0, 60.0]
)
@TTFT_HISTOGRAM.time()
def generate_response(prompt: str) -> str:
"""生成模型响应并记录延迟"""
first_token_time = None
response = []
for token in model.generate_stream(prompt):
if first_token_time is None:
first_token_time = time.time()
response.append(token)
return ''.join(response)
Token级指标
from prometheus_client import Counter, Gauge
TOKENS_GENERATED = Counter(
'llm_tokens_generated_total',
'Total tokens generated',
['model', 'status']
)
ACTIVE_REQUESTS = Gauge(
'llm_active_requests',
'Number of active inference requests'
)
GPU_UTILIZATION = Gauge(
'llm_gpu_utilization_percent',
'GPU utilization percentage',
['gpu_id']
)
Prometheus集成架构
指标采集流程
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'llm-inference'
static_configs:
- targets: ['llm-server:8000']
metrics_path: '/metrics'
scrape_interval: 5s
- job_name: 'llm-gpu'
static_configs:
- targets: ['nvidia-exporter:9400']
- job_name: 'llm-proxy'
static_configs:
- targets: ['api-gateway:9100']
服务端指标暴露
from prometheus_client import start_http_server, generate_latest
from fastapi import FastAPI
app = FastAPI()
@app.on_event("startup")
async def startup():
# 启动Prometheus指标服务器
start_http_server(8000)
print("Metrics server started on port 8000")
@app.get("/metrics")
async def metrics():
return Response(
content=generate_latest(),
media_type="text/plain"
)
Grafana仪表盘设计
仪表盘布局
一个完整的LLM监控仪表盘通常包含以下面板:
第一行:概览指标
- 当前QPS(每秒查询数)
- 平均延迟(P50/P95/P99)
- 错误率
- GPU显存使用率
第二行:延迟趋势
- TTFT时延分布直方图
- 生成时延随时间变化曲线
- Token生成速率(TPS)
第三行:资源监控
- GPU利用率时间序列
- 内存使用趋势
- CPU利用率
{
"dashboard": {
"title": "LLM Inference Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "stat",
"targets": [{
"expr": "rate(llm_requests_total[5m])",
"legendFormat": "{{model}}"
}]
},
{
"title": "Latency Distribution",
"type": "heatmap",
"targets": [{
"expr": "histogram_quantile(0.95, rate(llm_generation_latency_seconds_bucket[5m]))",
"legendFormat": "P95"
}]
}
]
}
}
告警规则配置
# alerts.yml
groups:
- name: llm_alerts
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, rate(llm_generation_latency_seconds_bucket[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "LLM P95延迟超过10秒"
- alert: HighErrorRate
expr: rate(llm_errors_total[5m]) / rate(llm_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "LLM错误率超过5%"
- alert: GPUMemoryHigh
expr: llm_gpu_memory_used_bytes / llm_gpu_memory_total_bytes > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "GPU显存使用率超过90%"
生产环境最佳实践
- 分层监控:分别监控基础设施层、模型服务层和应用层
- 采样策略:对高频请求采用采样,避免监控系统成为瓶颈
- 关联追踪:将监控指标与链路追踪结合,便于问题定位
- 基线建立:在上线初期建立性能基线,用于后续对比分析
- 自动化响应:配置自动扩缩容规则,基于监控指标动态调整资源
通过建立完善的LLM监控体系,团队可以快速发现和解决生产环境中的问题,持续优化模型服务质量。