← 返回首页
📊

完整可观测性平台

📂 devops ⏱ 2 min 259 words

完整可观测性平台

可观测性架构

数据源
├── Metrics → Prometheus → Grafana
├── Logs → Fluentd → Elasticsearch → Kibana
└── Traces → Jaeger → Grafana

统一查询
└── Grafana (多数据源)

Docker Compose部署

version: '3.8'

services:
  # Metrics
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
  
  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - "3000:3000"
  
  # Logs
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    volumes:
      - es_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
  
  logstash:
    image: docker.elastic.co/logstash/logstash:8.10.0
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
  
  kibana:
    image: docker.elastic.co/kibana/kibana:8.10.0
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
  
  # Traces
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"
      - "4317:4317"
      - "4318:4318"

volumes:
  grafana_data:
  es_data:

Prometheus配置

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'app'
    static_configs:
      - targets: ['app:8080']
    metrics_path: '/metrics'
  
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

rule_files:
  - 'rules/*.yml'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

Grafana仪表板

{
  "dashboard": {
    "title": "Observability Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{service}}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "singlestat",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) * 100"
          }
        ],
        "format": "percent"
      },
      {
        "title": "Latency P99",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "{{service}}"
          }
        ]
      }
    ]
  }
}

告警配置

groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
      
      - alert: HighLatency
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"

实践:完整监控系统

# 1. 启动所有服务
docker-compose up -d

# 2. 配置Grafana数据源
# Prometheus: http://prometheus:9090
# Elasticsearch: http://elasticsearch:9200
# Jaeger: http://jaeger:16686

# 3. 导入仪表板
# Node Exporter: 1860
# Docker: 893
# MySQL: 7362

最佳实践

  1. 统一数据模型
  2. 关联指标、日志和追踪
  3. 自动化告警
  4. 定期审查
  5. 文档化仪表板

总结

完整的可观测性平台需要整合Metrics、Logs和Traces。通过Prometheus、ELK Stack和Jaeger,可以构建全面的可观测性系统。