← 返回首页
🧠

Grafana可视化LLM

📂 llm ⏱ 3 min 437 words

--- title: "Grafana可视化LLM" description: "介绍如何使用Grafana构建LLM应用的可视化仪表盘,包括数据源配置、面板设计和告警设置。" tags: ["Grafana", "LLM", "可视化"] category: "llm" icon: "🧠"

Grafana可视化LLM

Grafana简介

Grafana是一个开源的数据可视化和监控平台,支持多种数据源,可以创建丰富的仪表盘和告警规则。对于LLM应用,Grafana可以帮助团队直观地理解系统状态和性能趋势。

Grafana的优势:

安装与配置

Docker安装

# 启动Grafana
docker run -d -p 3000:3000 \
  -v grafana-storage:/var/lib/grafana \
  -e GF_SECURITY_ADMIN_PASSWORD=admin \
  grafana/grafana

# 或使用Docker Compose
version: '3.8'
services:
  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana-storage:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

配置数据源

# provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: true

创建LLM监控仪表盘

仪表盘结构设计

{
  "dashboard": {
    "title": "LLM服务监控",
    "tags": ["llm", "ai", "monitoring"],
    "timezone": "browser",
    "panels": [
      {
        "title": "请求速率",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [{
          "expr": "sum(rate(llm_requests_total[5m])) by (model)",
          "legendFormat": "{{model}}"
        }]
      },
      {
        "title": "错误率",
        "type": "gauge",
        "gridPos": {"h": 8, "w": 6, "x": 12, "y": 0},
        "targets": [{
          "expr": "rate(llm_requests_total{status=\"error\"}[5m]) / rate(llm_requests_total[5m]) * 100"
        }],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"value": 0, "color": "green"},
                {"value": 1, "color": "yellow"},
                {"value": 5, "color": "red"}
              ]
            }
          }
        }
      }
    ]
  }
}

核心面板配置

请求统计面板

{
  "title": "请求统计",
  "type": "stat",
  "targets": [
    {
      "expr": "sum(increase(llm_requests_total[24h]))",
      "legendFormat": "总请求数"
    },
    {
      "expr": "sum(increase(llm_requests_total{status=\"success\"}[24h]))",
      "legendFormat": "成功请求"
    },
    {
      "expr": "sum(increase(llm_requests_total{status=\"error\"}[24h]))",
      "legendFormat": "失败请求"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "color": {"mode": "thresholds"},
      "thresholds": {
        "steps": [
          {"value": 0, "color": "green"}
        ]
      }
    }
  }
}

延迟分布面板

{
  "title": "延迟分布",
  "type": "histogram",
  "targets": [{
    "expr": "llm_request_duration_seconds_bucket",
    "legendFormat": "{{le}}"
  }],
  "options": {
    "bucketBound": "le",
    "combine": false
  }
}

模型使用热力图

{
  "title": "模型使用热力图",
  "type": "heatmap",
  "targets": [{
    "expr": "sum(rate(llm_requests_total[5m])) by (model, le)",
    "format": "heatmap"
  }],
  "options": {
    "calculate": false,
    "yAxis": {
      "unit": "short"
    }
  }
}

变量与模板

创建模板变量

{
  "templating": {
    "list": [
      {
        "name": "model",
        "type": "query",
        "query": "label_values(llm_requests_total, model)",
        "refresh": 2,
        "includeAll": true,
        "multi": true
      },
      {
        "name": "time_range",
        "type": "interval",
        "query": "1m,5m,15m,1h",
        "current": {
          "text": "5m",
          "value": "5m"
        }
      }
    ]
  }
}

在查询中使用变量

# 使用model变量
sum(rate(llm_requests_total{model=~"$model"}[$time_range])) by (model)

# 使用时间范围
histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket{model=~"$model"}[$time_range]))

告警配置

通过UI配置告警

  1. 进入Alerting菜单
  2. 创建新的Alert Rule
  3. 配置查询条件
  4. 设置通知渠道

告警规则YAML

# alerting/rules/llm-alerts.yml
groups:
  - name: llm-alerts
    rules:
      - alert: HighLLMErrorRate
        expr: |
          sum(rate(llm_requests_total{status="error"}[5m])) by (model)
          /
          sum(rate(llm_requests_total[5m])) by (model)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "LLM模型 {{ $labels.model }} 错误率过高"
          description: "错误率 {{ $value | humanizePercentage }}"
      
      - alert: LLMHighLatency
        expr: |
          histogram_quantile(0.95, 
            sum(rate(llm_request_duration_seconds_bucket[5m])) by (model, le)
          ) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM模型 {{ $labels.model }} 延迟过高"

高级可视化技巧

使用注释标记事件

{
  "annotations": {
    "list": [{
      "datasource": "Prometheus",
      "enable": true,
      "expr": "changes(llm_model_version[1m]) > 0",
      "iconColor": "rgba(255, 96, 96, 1)",
      "titleFormat": "模型版本更新",
      "tagKeys": "version"
    }]
  }
}

创建仪表盘行

{
  "rows": [
    {
      "title": "性能概览",
      "collapsed": false,
      "panels": [...]
    },
    {
      "title": "详细指标",
      "collapsed": true,
      "panels": [...]
    }
  ]
}

最佳实践

  1. 分层设计:按功能模块组织仪表盘
  2. 一致性:使用统一的颜色和样式规范
  3. 可操作性:面板标题应明确表达监控含义
  4. 性能优化:避免过于复杂的查询影响加载速度
  5. 文档化:为仪表盘添加说明文档

通过Grafana,你可以将LLM应用的复杂指标转化为直观的可视化界面,帮助团队快速理解和响应系统状态。