Prometheus监控LLM
--- title: "Prometheus监控LLM" description: "详解如何使用Prometheus监控LLM应用,包括指标定义、数据收集和告警配置。" tags: ["Prometheus", "LLM", "监控"] category: "llm" icon: "🧠"
Prometheus监控LLM
Prometheus简介
Prometheus是一个开源的系统监控和告警工具包,特别适合云原生应用的监控。它通过拉取(Pull)模型收集指标数据,并提供强大的查询语言PromQL。
对于LLM应用,Prometheus可以帮助:
- 实时监控模型性能
- 追踪资源使用情况
- 设置告警规则
- 生成可视化仪表盘
基础配置
安装与启动
# 使用Docker运行Prometheus
docker run -d -p 9090:9090 \
-v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
# 或者使用Docker Compose
version: '3.8'
services:
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
配置文件
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'llm-service'
static_configs:
- targets: ['localhost:8000']
metrics_path: '/metrics'
- job_name: 'llm-model-server'
static_configs:
- targets: ['localhost:8001']
LLM指标定义
使用Prometheus Client Library
from prometheus_client import Counter, Histogram, Gauge, Summary, start_http_server
import time
from functools import wraps
# 定义指标
REQUEST_COUNT = Counter(
'llm_requests_total',
'Total number of LLM requests',
['model', 'method', 'status']
)
REQUEST_LATENCY = Histogram(
'llm_request_duration_seconds',
'LLM request latency in seconds',
['model', 'method'],
buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
TOKEN_USAGE = Counter(
'llm_tokens_total',
'Total tokens processed',
['model', 'type'] # type: input/output
)
ACTIVE_REQUESTS = Gauge(
'llm_active_requests',
'Number of active LLM requests',
['model']
)
# 装饰器自动记录指标
def track_metrics(model: str):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
ACTIVE_REQUESTS.labels(model=model).inc()
start_time = time.time()
try:
result = func(*args, **kwargs)
REQUEST_COUNT.labels(model=model, method=func.__name__, status='success').inc()
return result
except Exception as e:
REQUEST_COUNT.labels(model=model, method=func.__name__, status='error').inc()
raise
finally:
duration = time.time() - start_time
REQUEST_LATENCY.labels(model=model, method=func.__name__).observe(duration)
ACTIVE_REQUESTS.labels(model=model).dec()
return wrapper
return decorator
记录Token使用
def call_llm(model: str, prompt: str, **kwargs):
# 调用LLM API
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
**kwargs
)
# 记录token使用
TOKEN_USAGE.labels(model=model, type='input').inc(response.usage.prompt_tokens)
TOKEN_USAGE.labels(model=model, type='output').inc(response.usage.completion_tokens)
return response
暴露Metrics端点
from fastapi import FastAPI, Request
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from prometheus_client.multiprocess import MultiProcessCollector
import psutil
app = FastAPI()
@app.get("/metrics")
async def metrics():
# 收集所有指标
return Response(
content=generate_latest(),
media_type=CONTENT_TYPE_LATEST
)
@app.get("/health")
async def health():
return {"status": "healthy"}
# 启动metrics服务器
if __name__ == "__main__":
import uvicorn
start_http_server(8001) # 单独的metrics端口
uvicorn.run(app, host="0.0.0.0", port=8000)
PromQL查询示例
常用查询
# 每秒请求数
rate(llm_requests_total[5m])
# 请求延迟P95
histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m]))
# 错误率
rate(llm_requests_total{status="error"}[5m]) / rate(llm_requests_total[5m])
# 模型token使用率
rate(llm_tokens_total{type="output"}[5m])
# 活跃请求数
llm_active_requests
高级查询
# 预估成本(假设每1K token $0.02)
rate(llm_tokens_total[1h]) * 0.02 / 1000
# 延迟异常检测(超过平均值2倍标准差)
llm_request_duration_seconds > (avg(llm_request_duration_seconds) + 2 * stddev(llm_request_duration_seconds))
# 模型使用分布
sum by (model) (rate(llm_requests_total[1h]))
告警规则配置
告警规则文件
# alerts.yml
groups:
- name: llm_alerts
rules:
# 高错误率告警
- alert: HighErrorRate
expr: rate(llm_requests_total{status="error"}[5m]) / rate(llm_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "LLM服务错误率过高"
description: "错误率 {{ $value | humanizePercentage }}"
# 延迟告警
- alert: HighLatency
expr: histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "LLM服务延迟过高"
description: "P95延迟 {{ $value }}秒"
# 资源使用告警
- alert: HighMemoryUsage
expr: process_resident_memory_bytes / 1024 / 1024 > 1024
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用过高"
description: "当前内存使用 {{ $value }}MB"
在Prometheus中启用告警
# prometheus.yml
rule_files:
- "alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
Grafana集成
添加Prometheus数据源
- 在Grafana中添加Prometheus数据源
- 配置URL为
http://prometheus:9090 - 创建仪表盘
预配置仪表盘JSON
{
"dashboard": {
"title": "LLM服务监控",
"panels": [
{
"title": "请求速率",
"type": "graph",
"targets": [{
"expr": "rate(llm_requests_total[5m])",
"legendFormat": "{{model}}"
}]
}
]
}
}
最佳实践
- 指标命名规范:使用一致的命名约定(如
llm_前缀) - 标签管理:避免高基数标签,控制标签维度
- 采样策略:根据数据量调整抓取间隔
- 存储规划:预估数据量,合理配置保留策略
- 高可用:使用Prometheus联邦或Thanos实现高可用
通过Prometheus,你可以建立完整的LLM应用监控体系,实现数据驱动的运维决策。