← 返回首页
📊

Grafana仪表盘:多数据源与告警配置

📂 architecture ⏱ 2 min 381 words

Grafana仪表盘:多数据源与告警配置

Grafana架构概览

Grafana是开源的数据可视化平台,支持多种数据源,提供丰富的图表类型和告警功能。是可观测性体系的可视化核心。

Grafana架构:
┌─────────────────────────────────────────────────┐
│                Grafana Server                    │
├─────────────┬─────────────┬─────────────────────┤
│   Dashboard │   Panel     │   Alerting          │
│   仪表盘    │   面板       │   告警引擎         │
├─────────────┴─────────────┴─────────────────────┤
│              Data Source API                     │
├───────┬───────┬───────┬───────┬─────────────────┤
│Prometh│Loki  │Jaeger │MySQL  │ Elasticsearch   │
│ eus   │      │       │       │                 │
└───────┴───────┴───────┴───────┴─────────────────┘

多数据源配置

数据源配置示例

# provisioning/datasources/datasources.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: '15s'
      httpMethod: POST
  
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: jaeger
          matcherRegex: "trace_id=(\\w+)"
          name: TraceID
          url: "$${__value.raw}"
  
  - name: Jaeger
    type: jaeger
    access: proxy
    url: http://jaeger:16686
  
  - name: MySQL
    type: mysql
    url: mysql:3306
    database: grafana
    user: grafana
    secureJsonData:
      password: 'password'

仪表盘设计

仪表盘JSON模型

{
  "dashboard": {
    "title": "Service Overview",
    "tags": ["microservice", "production"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)",
          "refresh": 2,
          "multi": true,
          "includeAll": true
        },
        {
          "name": "interval",
          "type": "interval",
          "query": "1m,5m,15m,30m,1h",
          "auto": true,
          "auto_min": "1m"
        }
      ]
    },
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"$service\"}[$interval])) by (status)",
            "legendFormat": "{{status}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps",
            "custom": {
              "drawStyle": "line",
              "fillOpacity": 20
            }
          }
        }
      }
    ]
  }
}

常用面板类型

panel_types:
  timeseries:
    description: "时序折线图"
    use_case: "趋势分析、监控曲线"
    features: ["多系列", "填充", "梯度"]
  
  stat:
    description: "单值统计"
    use_case: "当前值、状态指示"
    features: ["颜色映射", "阈值", "迷你图"]
  
  gauge:
    description: "仪表盘"
    use_case: "百分比、进度"
    features: ["阈值区间", "渐变色"]
  
  bar:
    description: "柱状图"
    use_case: "对比分析"
    features: ["堆叠", "水平/垂直"]
  
  table:
    description: "表格"
    use_case: "详细数据"
    features: ["排序", "过滤", "分页"]
  
  heatmap:
    description: "热力图"
    use_case: "分布分析"
    features: ["直方图", "颜色映射"]

Grafana告警配置

告警规则

# 告警规则配置
apiVersion: 1
groups:
  - orgId: 1
    name: Service Alerts
    folder: Production
    interval: 1m
    rules:
      - uid: high-error-rate
        title: High Error Rate
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
              instant: true
          - refId: B
            datasourceUid: __expr__
            model:
              type: reduce
              expression: A
              reducer: last
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: gt
                    params: [0.05]
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "错误率超过5%"

通知渠道

# 通知渠道配置
contactPoints:
  - name: Slack
    receivers:
      - type: slack
        settings:
          url: https://hooks.slack.com/services/xxx
          recipient: "#alerts"
          title: |
            [{{ len .Alerts.Firing }}] {{ .GroupLabels.alertname }}
          text: |
            {{ range .Alerts }}
            *Instance:* {{ .Labels.instance }}
            *Description:* {{ .Annotations.description }}
            {{ end }}
  
  - name: Webhook
    receivers:
      - type: webhook
        settings:
          url: http://alert-handler:8080/alert
          httpMethod: POST

最佳实践

  1. 仪表盘分层:Overview → Service → Instance逐层下钻
  2. 变量驱动:使用模板变量实现仪表盘复用
  3. 一致配色:红色=错误,黄色=警告,绿色=正常
  4. 告警分级:Critical/Warning/Info分级,避免告警疲劳