← 返回首页
🧠

Prometheus监控LLM

📂 llm ⏱ 3 min 420 words

--- title: "Prometheus监控LLM" description: "详解如何使用Prometheus监控LLM应用,包括指标定义、数据收集和告警配置。" tags: ["Prometheus", "LLM", "监控"] category: "llm" icon: "🧠"

Prometheus监控LLM

Prometheus简介

Prometheus是一个开源的系统监控和告警工具包,特别适合云原生应用的监控。它通过拉取(Pull)模型收集指标数据,并提供强大的查询语言PromQL。

对于LLM应用,Prometheus可以帮助:

基础配置

安装与启动

# 使用Docker运行Prometheus
docker run -d -p 9090:9090 \
  -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

# 或者使用Docker Compose
version: '3.8'
services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

配置文件

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'llm-service'
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: '/metrics'
    
  - job_name: 'llm-model-server'
    static_configs:
      - targets: ['localhost:8001']

LLM指标定义

使用Prometheus Client Library

from prometheus_client import Counter, Histogram, Gauge, Summary, start_http_server
import time
from functools import wraps

# 定义指标
REQUEST_COUNT = Counter(
    'llm_requests_total',
    'Total number of LLM requests',
    ['model', 'method', 'status']
)

REQUEST_LATENCY = Histogram(
    'llm_request_duration_seconds',
    'LLM request latency in seconds',
    ['model', 'method'],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

TOKEN_USAGE = Counter(
    'llm_tokens_total',
    'Total tokens processed',
    ['model', 'type']  # type: input/output
)

ACTIVE_REQUESTS = Gauge(
    'llm_active_requests',
    'Number of active LLM requests',
    ['model']
)

# 装饰器自动记录指标
def track_metrics(model: str):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            ACTIVE_REQUESTS.labels(model=model).inc()
            start_time = time.time()
            
            try:
                result = func(*args, **kwargs)
                REQUEST_COUNT.labels(model=model, method=func.__name__, status='success').inc()
                return result
            except Exception as e:
                REQUEST_COUNT.labels(model=model, method=func.__name__, status='error').inc()
                raise
            finally:
                duration = time.time() - start_time
                REQUEST_LATENCY.labels(model=model, method=func.__name__).observe(duration)
                ACTIVE_REQUESTS.labels(model=model).dec()
        
        return wrapper
    return decorator

记录Token使用

def call_llm(model: str, prompt: str, **kwargs):
    # 调用LLM API
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        **kwargs
    )
    
    # 记录token使用
    TOKEN_USAGE.labels(model=model, type='input').inc(response.usage.prompt_tokens)
    TOKEN_USAGE.labels(model=model, type='output').inc(response.usage.completion_tokens)
    
    return response

暴露Metrics端点

from fastapi import FastAPI, Request
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from prometheus_client.multiprocess import MultiProcessCollector
import psutil

app = FastAPI()

@app.get("/metrics")
async def metrics():
    # 收集所有指标
    return Response(
        content=generate_latest(),
        media_type=CONTENT_TYPE_LATEST
    )

@app.get("/health")
async def health():
    return {"status": "healthy"}

# 启动metrics服务器
if __name__ == "__main__":
    import uvicorn
    start_http_server(8001)  # 单独的metrics端口
    uvicorn.run(app, host="0.0.0.0", port=8000)

PromQL查询示例

常用查询

# 每秒请求数
rate(llm_requests_total[5m])

# 请求延迟P95
histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m]))

# 错误率
rate(llm_requests_total{status="error"}[5m]) / rate(llm_requests_total[5m])

# 模型token使用率
rate(llm_tokens_total{type="output"}[5m])

# 活跃请求数
llm_active_requests

高级查询

# 预估成本(假设每1K token $0.02)
rate(llm_tokens_total[1h]) * 0.02 / 1000

# 延迟异常检测(超过平均值2倍标准差)
llm_request_duration_seconds > (avg(llm_request_duration_seconds) + 2 * stddev(llm_request_duration_seconds))

# 模型使用分布
sum by (model) (rate(llm_requests_total[1h]))

告警规则配置

告警规则文件

# alerts.yml
groups:
  - name: llm_alerts
    rules:
      # 高错误率告警
      - alert: HighErrorRate
        expr: rate(llm_requests_total{status="error"}[5m]) / rate(llm_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "LLM服务错误率过高"
          description: "错误率 {{ $value | humanizePercentage }}"
      
      # 延迟告警
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM服务延迟过高"
          description: "P95延迟 {{ $value }}秒"
      
      # 资源使用告警
      - alert: HighMemoryUsage
        expr: process_resident_memory_bytes / 1024 / 1024 > 1024
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用过高"
          description: "当前内存使用 {{ $value }}MB"

在Prometheus中启用告警

# prometheus.yml
rule_files:
  - "alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

Grafana集成

添加Prometheus数据源

  1. 在Grafana中添加Prometheus数据源
  2. 配置URL为 http://prometheus:9090
  3. 创建仪表盘

预配置仪表盘JSON

{
  "dashboard": {
    "title": "LLM服务监控",
    "panels": [
      {
        "title": "请求速率",
        "type": "graph",
        "targets": [{
          "expr": "rate(llm_requests_total[5m])",
          "legendFormat": "{{model}}"
        }]
      }
    ]
  }
}

最佳实践

  1. 指标命名规范:使用一致的命名约定(如llm_前缀)
  2. 标签管理:避免高基数标签,控制标签维度
  3. 采样策略:根据数据量调整抓取间隔
  4. 存储规划:预估数据量,合理配置保留策略
  5. 高可用:使用Prometheus联邦或Thanos实现高可用

通过Prometheus,你可以建立完整的LLM应用监控体系,实现数据驱动的运维决策。