完整可观测性平台
完整可观测性平台
可观测性架构
数据源
├── Metrics → Prometheus → Grafana
├── Logs → Fluentd → Elasticsearch → Kibana
└── Traces → Jaeger → Grafana
统一查询
└── Grafana (多数据源)
Docker Compose部署
version: '3.8'
services:
# Metrics
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana_data:/var/lib/grafana
ports:
- "3000:3000"
# Logs
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
volumes:
- es_data:/usr/share/elasticsearch/data
ports:
- "9200:9200"
logstash:
image: docker.elastic.co/logstash/logstash:8.10.0
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
kibana:
image: docker.elastic.co/kibana/kibana:8.10.0
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
# Traces
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686"
- "4317:4317"
- "4318:4318"
volumes:
grafana_data:
es_data:
Prometheus配置
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'app'
static_configs:
- targets: ['app:8080']
metrics_path: '/metrics'
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
rule_files:
- 'rules/*.yml'
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
Grafana仪表板
{
"dashboard": {
"title": "Observability Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{service}}"
}
]
},
{
"title": "Error Rate",
"type": "singlestat",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}
],
"format": "percent"
},
{
"title": "Latency P99",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "{{service}}"
}
]
}
]
}
}
告警配置
groups:
- name: application
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
- alert: HighLatency
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
实践:完整监控系统
# 1. 启动所有服务
docker-compose up -d
# 2. 配置Grafana数据源
# Prometheus: http://prometheus:9090
# Elasticsearch: http://elasticsearch:9200
# Jaeger: http://jaeger:16686
# 3. 导入仪表板
# Node Exporter: 1860
# Docker: 893
# MySQL: 7362
最佳实践
- 统一数据模型
- 关联指标、日志和追踪
- 自动化告警
- 定期审查
- 文档化仪表板
总结
完整的可观测性平台需要整合Metrics、Logs和Traces。通过Prometheus、ELK Stack和Jaeger,可以构建全面的可观测性系统。