高级监控与告警
高级监控与告警
监控架构
数据收集
├── Metrics: Prometheus, Node Exporter
├── Logs: Fluentd, Filebeat
└── Traces: Jaeger, OpenTelemetry
数据存储
├── Prometheus TSDB
├── Elasticsearch
└── Thanos (长期存储)
可视化和告警
├── Grafana (仪表板)
├── Alertmanager (告警)
└── PagerDuty (通知)
高级Prometheus配置
global:
scrape_interval: 15s
evaluation_interval: 15s
# 录制规则
rule_files:
- 'rules/recording.yml'
- 'rules/alerting.yml'
# 远程写入
remote_write:
- url: http://thanos-receive:19291/api/v1/receive
# 服务发现
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}:$1
录制规则
# rules/recording.yml
groups:
- name: recording_rules
rules:
# 每分钟请求率
- record: job:http_requests:rate1m
expr: sum(rate(http_requests_total[1m])) by (job)
# 错误率
- record: job:http_errors:ratio1h
expr: |
sum(rate(http_requests_total{status=~"5.."}[1h])) by (job)
/
sum(rate(http_requests_total[1h])) by (job)
# P99延迟
- record: job:http_duration:p99_5m
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))
高级告警规则
# rules/alerting.yml
groups:
- name: advanced_alerts
rules:
# 预测性告警
- alert: DiskSpaceWillFill
expr: |
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "磁盘空间将在24小时内耗尽"
# 异常检测
- alert: AnomalousTraffic
expr: |
abs(
sum(rate(http_requests_total[5m]))
- avg_over_time(sum(rate(http_requests_total[5m]))[7d:5m])
) > 3 * stddev_over_time(sum(rate(http_requests_total[5m]))[7d:5m])
for: 10m
labels:
severity: warning
annotations:
summary: "检测到异常流量模式"
# 错误预算告警
- alert: ErrorBudgetBurnRate
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > 0.001 * 14.4
for: 5m
labels:
severity: critical
annotations:
summary: "错误预算消耗过快"
Alertmanager配置
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
- match:
severity: warning
receiver: 'warning'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
send_resolved: true
- name: 'critical'
pagerduty_configs:
- service_key: 'xxx'
slack_configs:
- channel: '#critical-alerts'
send_resolved: true
- name: 'warning'
slack_configs:
- channel: '#warning-alerts'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
实践:完整监控系统
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./rules:/etc/prometheus/rules
ports:
- "9090:9090"
grafana:
image: grafana/grafana
volumes:
- grafana_data:/var/lib/grafana
ports:
- "3000:3000"
alertmanager:
image: prom/alertmanager
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
thanos:
image: thanosio/thanos
command:
- sidecar
- --tsdb.path=/data
- --prometheus.url=http://prometheus:9090
volumes:
grafana_data:
最佳实践
- 使用录制规则优化查询
- 分层告警策略
- 预测性告警
- 告警静默和抑制
- 定期审查告警规则
总结
高级监控和告警是确保系统可靠性的关键。通过录制规则、预测性告警和分层策略,可以实现智能化的监控系统。