指标设计:RED/USE黄金指标体系
指标设计:RED/USE黄金指标体系
指标设计方法论概览
监控指标设计是可观测性的基础。RED和USE是两种经典的方法论,分别面向服务和资源,帮助团队定义关键监控指标。
指标设计方法论:
┌─────────────────────────────────────────────────┐
│ RED 方法 │
│ 面向服务(Request-Driven) │
│ Rate | Errors | Duration │
├─────────────────────────────────────────────────┤
│ USE 方法 │
│ 面向资源(Resource-Oriented) │
│ Utilization | Saturation | Errors │
├─────────────────────────────────────────────────┤
│ 黄金信号 │
│ Google SRE四大信号 │
│ 延迟 | 流量 | 错误 | 饱和度 │
└─────────────────────────────────────────────────┘
RED方法(面向服务)
核心指标
class REDMetrics:
"""RED方法:Rate Errors Duration"""
def __init__(self, service_name: str):
self.service_name = service_name
# Rate:请求速率(QPS)
self.request_rate = Counter(
f'{service_name}_requests_total',
'Total requests',
['method', 'endpoint', 'status']
)
# Errors:错误率
self.error_rate = Counter(
f'{service_name}_errors_total',
'Total errors',
['method', 'endpoint', 'error_type']
)
# Duration:请求延迟
self.request_duration = Histogram(
f'{service_name}_request_duration_seconds',
'Request duration',
['method', 'endpoint'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
def record_request(self, method: str, endpoint: str,
status: int, duration: float):
# 记录请求
self.request_rate.labels(
method=method, endpoint=endpoint, status=str(status)
).inc()
# 记录错误
if status >= 500:
self.error_rate.labels(
method=method, endpoint=endpoint, error_type='server'
).inc()
# 记录延迟
self.request_duration.labels(
method=method, endpoint=endpoint
).observe(duration)
RED指标查询
# Rate:请求速率
sum(rate(http_requests_total[5m])) by (service)
# Errors:错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
# Duration:P99延迟
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
USE方法(面向资源)
核心指标
class USEMetrics:
"""USE方法:Utilization Saturation Errors"""
def __init__(self, resource_name: str):
self.resource_name = resource_name
# Utilization:资源使用率
self.utilization = Gauge(
f'{resource_name}_utilization_percent',
'Resource utilization',
['type'] # cpu, memory, disk, network
)
# Saturation:资源饱和度
self.saturation = Gauge(
f'{resource_name}_saturation',
'Resource saturation',
['type'] # queue_length, waiting_threads
)
# Errors:资源错误
self.errors = Counter(
f'{resource_name}_errors_total',
'Resource errors',
['type'] # hardware, software
)
def record_utilization(self, resource_type: str, value: float):
self.utilization.labels(type=resource_type).set(value)
def record_saturation(self, resource_type: str, value: float):
self.saturation.labels(type=resource_type).set(value)
USE指标查询
# CPU Utilization
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory Utilization
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk Saturation(IO等待队列)
node_disk_io_time_weighted_seconds_total
# Network Errors
rate(node_network_receive_errs_total[5m])
黄金信号(Google SRE)
四大信号
golden_signals:
latency:
description: "请求耗时"
metric: "http_request_duration_seconds"
percentiles: [50, 90, 99]
example: "P99延迟 < 200ms"
traffic:
description: "请求流量"
metric: "http_requests_total"
rate: "QPS"
example: "QPS > 1000"
errors:
description: "错误请求"
metric: "http_requests_total{status=~'5..'}"
rate: "错误率"
example: "错误率 < 0.1%"
saturation:
description: "资源饱和度"
metrics:
- cpu_usage_percent
- memory_usage_percent
- disk_usage_percent
example: "CPU < 80%"
SLI指标定义
SLI类型
sli_types:
availability:
description: "可用性"
formula: "成功请求 / 总请求"
target: "99.9%"
latency:
description: "延迟"
formula: "P99延迟 < 阈值的请求占比"
target: "99% < 200ms"
throughput:
description: "吞吐量"
formula: "单位时间处理请求数"
target: "> 1000 QPS"
correctness:
description: "正确性"
formula: "正确响应 / 总响应"
target: "99.99%"
freshness:
description: "新鲜度"
formula: "数据更新延迟"
target: "< 5分钟"
SLI实现
# 可用性SLI
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# 延迟SLI
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) < 0.2
# 吞吐量SLI
sum(rate(http_requests_total[5m])) > 1000
# 新鲜度SLI
(time() - data_last_updated_timestamp) < 300
指标设计最佳实践
best_practices:
- name: "选择正确的粒度"
description: "平衡指标基数和有用性"
example: "按service/status分组,避免按request_id分组"
- name: "使用标准命名"
description: "遵循Prometheus命名规范"
example: "http_requests_total,避免req_count"
- name: "记录元数据"
description: "添加有意义的标签"
example: "service, version, environment"
- name: "定义SLI/SLO"
description: "为每个指标定义目标值"
example: "P99延迟 < 200ms (SLO: 99.9%)"
- name: "避免过度采集"
description: "只采集必要的指标"
example: "高基数标签使用采样"
指标分类框架
指标分类:
├── 业务指标
│ ├── 订单量
│ ├── 转化率
│ └── 客单价
├── 应用指标
│ ├── 请求量
│ ├── 错误率
│ └── 延迟
├── 基础设施指标
│ ├── CPU/内存/磁盘
│ ├── 网络IO
│ └── 容器资源
└── 中间件指标
├── 数据库连接
├── Redis命中率
└── 消息队列积压