监控三支柱:Metrics/Logging/Tracing架构
监控三支柱:Metrics/Logging/Tracing架构
可观测性三支柱概览
可观测性(Observability)是通过系统外部输出推断内部状态的能力。三大支柱分别从不同维度提供系统洞察:
可观测性三支柱:
┌─────────────────────────────────────────────────┐
│ 系统内部状态 │
├─────────────────┬─────────────┬─────────────────┤
│ Metrics │ Logging │ Tracing │
│ 指标监控 │ 日志管理 │ 链路追踪 │
├─────────────────┼─────────────┼─────────────────┤
│ 时序数据 │ 事件记录 │ 请求追踪 │
│ 聚合统计 │ 详细上下文 │ 分布式调用链 │
│ 趋势分析 │ 问题诊断 │ 性能分析 │
│ Prometheus │ ELK/Loki │ Jaeger/Zipkin │
└─────────────────┴─────────────┴─────────────────┘
Metrics指标体系
指标分类
# 指标类型定义
metric_types:
counter:
description: "单调递增计数器"
examples:
- http_requests_total
- errors_total
use_case: "请求计数、错误计数"
gauge:
description: "可增可减的瞬时值"
examples:
- cpu_usage_percent
- memory_used_bytes
- active_connections
use_case: "资源使用率、队列长度"
histogram:
description: "值分布统计"
examples:
- http_request_duration_seconds
- response_size_bytes
use_case: "延迟分布、响应大小"
summary:
description: "客户端计算的分位数"
examples:
- go_gc_duration_seconds
use_case: "百分位数统计"
RED方法
# RED指标:Rate Errors Duration
class REDMetrics:
def __init__(self, service_name: str):
self.request_count = Counter(
f'{service_name}_requests_total',
'Total requests',
['method', 'endpoint', 'status']
)
self.error_count = Counter(
f'{service_name}_errors_total',
'Total errors',
['method', 'endpoint', 'error_type']
)
self.request_duration = Histogram(
f'{service_name}_request_duration_seconds',
'Request duration',
['method', 'endpoint'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
def record_request(self, method: str, endpoint: str,
status: int, duration: float):
self.request_count.labels(
method=method, endpoint=endpoint, status=str(status)
).inc()
if status >= 500:
self.error_count.labels(
method=method, endpoint=endpoint, error_type='server_error'
).inc()
self.request_duration.labels(
method=method, endpoint=endpoint
).observe(duration)
Logging日志架构
结构化日志
{
"timestamp": "2024-01-15T10:30:00.123Z",
"level": "ERROR",
"service": "payment-api",
"trace_id": "abc123def456",
"span_id": "span789",
"message": "Payment processing failed",
"error": {
"type": "PaymentGatewayError",
"message": "Insufficient funds",
"stack": "..."
},
"context": {
"user_id": "user123",
"order_id": "order456",
"amount": 99.99,
"currency": "USD"
}
}
日志级别策略
log_levels:
DEBUG:
description: "详细调试信息"
retention: "7天"
production: false
INFO:
description: "正常操作信息"
retention: "30天"
production: true
WARN:
description: "潜在问题警告"
retention: "90天"
production: true
ERROR:
description: "错误事件"
retention: "1年"
production: true
FATAL:
description: "致命错误"
retention: "永久"
production: true
Tracing链路追踪
分布式追踪原理
# OpenTelemetry追踪实现
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
# 配置Tracer
provider = TracerProvider()
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
)
provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
# 创建Span
def process_payment(order_id: str, amount: float):
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("payment.amount", amount)
# 子Span:验证
with tracer.start_as_current_span("validate_payment"):
validate_payment_details(order_id)
# 子Span:调用网关
with tracer.start_as_current_span("call_gateway"):
result = call_payment_gateway(amount)
span.set_attribute("payment.status", result.status)
return result
三支柱协同
# 关联三个支柱
correlation:
metrics_to_logs:
description: "指标异常时查看相关日志"
example: "错误率飙升时查看ERROR日志"
metrics_to_traces:
description: "指标异常时追踪具体请求"
example: "延迟升高时查看慢请求链路"
traces_to_logs:
description: "追踪中查看相关日志"
example: "通过trace_id关联日志"
最佳实践
- 统一时间戳:使用UTC时间,确保三者时间对齐
- 关联ID:使用trace_id串联指标、日志、追踪
- 采样策略:高QPS服务使用智能采样,平衡存储和可观测性
- 告警联动:指标告警自动触发日志分析和追踪查询