Grafana可视化LLM
--- title: "Grafana可视化LLM" description: "介绍如何使用Grafana构建LLM应用的可视化仪表盘,包括数据源配置、面板设计和告警设置。" tags: ["Grafana", "LLM", "可视化"] category: "llm" icon: "🧠"
Grafana可视化LLM
Grafana简介
Grafana是一个开源的数据可视化和监控平台,支持多种数据源,可以创建丰富的仪表盘和告警规则。对于LLM应用,Grafana可以帮助团队直观地理解系统状态和性能趋势。
Grafana的优势:
- 多数据源支持:兼容Prometheus、Loki、Elasticsearch等
- 丰富的可视化:图表、表格、热力图等多种展示形式
- 灵活的告警:支持多渠道通知
- 协作友好:仪表盘可共享和导出
安装与配置
Docker安装
# 启动Grafana
docker run -d -p 3000:3000 \
-v grafana-storage:/var/lib/grafana \
-e GF_SECURITY_ADMIN_PASSWORD=admin \
grafana/grafana
# 或使用Docker Compose
version: '3.8'
services:
grafana:
image: grafana/grafana
ports:
- "3000:3000"
volumes:
- grafana-storage:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
配置数据源
# provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: true
创建LLM监控仪表盘
仪表盘结构设计
{
"dashboard": {
"title": "LLM服务监控",
"tags": ["llm", "ai", "monitoring"],
"timezone": "browser",
"panels": [
{
"title": "请求速率",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [{
"expr": "sum(rate(llm_requests_total[5m])) by (model)",
"legendFormat": "{{model}}"
}]
},
{
"title": "错误率",
"type": "gauge",
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 0},
"targets": [{
"expr": "rate(llm_requests_total{status=\"error\"}[5m]) / rate(llm_requests_total[5m]) * 100"
}],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 1, "color": "yellow"},
{"value": 5, "color": "red"}
]
}
}
}
}
]
}
}
核心面板配置
请求统计面板
{
"title": "请求统计",
"type": "stat",
"targets": [
{
"expr": "sum(increase(llm_requests_total[24h]))",
"legendFormat": "总请求数"
},
{
"expr": "sum(increase(llm_requests_total{status=\"success\"}[24h]))",
"legendFormat": "成功请求"
},
{
"expr": "sum(increase(llm_requests_total{status=\"error\"}[24h]))",
"legendFormat": "失败请求"
}
],
"fieldConfig": {
"defaults": {
"color": {"mode": "thresholds"},
"thresholds": {
"steps": [
{"value": 0, "color": "green"}
]
}
}
}
}
延迟分布面板
{
"title": "延迟分布",
"type": "histogram",
"targets": [{
"expr": "llm_request_duration_seconds_bucket",
"legendFormat": "{{le}}"
}],
"options": {
"bucketBound": "le",
"combine": false
}
}
模型使用热力图
{
"title": "模型使用热力图",
"type": "heatmap",
"targets": [{
"expr": "sum(rate(llm_requests_total[5m])) by (model, le)",
"format": "heatmap"
}],
"options": {
"calculate": false,
"yAxis": {
"unit": "short"
}
}
}
变量与模板
创建模板变量
{
"templating": {
"list": [
{
"name": "model",
"type": "query",
"query": "label_values(llm_requests_total, model)",
"refresh": 2,
"includeAll": true,
"multi": true
},
{
"name": "time_range",
"type": "interval",
"query": "1m,5m,15m,1h",
"current": {
"text": "5m",
"value": "5m"
}
}
]
}
}
在查询中使用变量
# 使用model变量
sum(rate(llm_requests_total{model=~"$model"}[$time_range])) by (model)
# 使用时间范围
histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket{model=~"$model"}[$time_range]))
告警配置
通过UI配置告警
- 进入Alerting菜单
- 创建新的Alert Rule
- 配置查询条件
- 设置通知渠道
告警规则YAML
# alerting/rules/llm-alerts.yml
groups:
- name: llm-alerts
rules:
- alert: HighLLMErrorRate
expr: |
sum(rate(llm_requests_total{status="error"}[5m])) by (model)
/
sum(rate(llm_requests_total[5m])) by (model)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "LLM模型 {{ $labels.model }} 错误率过高"
description: "错误率 {{ $value | humanizePercentage }}"
- alert: LLMHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(llm_request_duration_seconds_bucket[5m])) by (model, le)
) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "LLM模型 {{ $labels.model }} 延迟过高"
高级可视化技巧
使用注释标记事件
{
"annotations": {
"list": [{
"datasource": "Prometheus",
"enable": true,
"expr": "changes(llm_model_version[1m]) > 0",
"iconColor": "rgba(255, 96, 96, 1)",
"titleFormat": "模型版本更新",
"tagKeys": "version"
}]
}
}
创建仪表盘行
{
"rows": [
{
"title": "性能概览",
"collapsed": false,
"panels": [...]
},
{
"title": "详细指标",
"collapsed": true,
"panels": [...]
}
]
}
最佳实践
- 分层设计:按功能模块组织仪表盘
- 一致性:使用统一的颜色和样式规范
- 可操作性:面板标题应明确表达监控含义
- 性能优化:避免过于复杂的查询影响加载速度
- 文档化:为仪表盘添加说明文档
通过Grafana,你可以将LLM应用的复杂指标转化为直观的可视化界面,帮助团队快速理解和响应系统状态。