📊

监控三支柱：Metrics/Logging/Tracing架构

📂 architecture ⏱ 2 min 334 words

监控三支柱：Metrics/Logging/Tracing架构

可观测性三支柱概览

可观测性（Observability）是通过系统外部输出推断内部状态的能力。三大支柱分别从不同维度提供系统洞察：

可观测性三支柱：
┌─────────────────────────────────────────────────┐
│                 系统内部状态                       │
├─────────────────┬─────────────┬─────────────────┤
│    Metrics      │   Logging   │    Tracing      │
│    指标监控      │   日志管理   │    链路追踪     │
├─────────────────┼─────────────┼─────────────────┤
│  时序数据        │  事件记录    │  请求追踪       │
│  聚合统计        │  详细上下文  │  分布式调用链   │
│  趋势分析        │  问题诊断    │  性能分析       │
│  Prometheus     │  ELK/Loki   │  Jaeger/Zipkin  │
└─────────────────┴─────────────┴─────────────────┘

Metrics指标体系

指标分类

# 指标类型定义
metric_types:
  counter:
    description: "单调递增计数器"
    examples:
      - http_requests_total
      - errors_total
    use_case: "请求计数、错误计数"
  
  gauge:
    description: "可增可减的瞬时值"
    examples:
      - cpu_usage_percent
      - memory_used_bytes
      - active_connections
    use_case: "资源使用率、队列长度"
  
  histogram:
    description: "值分布统计"
    examples:
      - http_request_duration_seconds
      - response_size_bytes
    use_case: "延迟分布、响应大小"
  
  summary:
    description: "客户端计算的分位数"
    examples:
      - go_gc_duration_seconds
    use_case: "百分位数统计"

RED方法

# RED指标：Rate Errors Duration
class REDMetrics:
    def __init__(self, service_name: str):
        self.request_count = Counter(
            f'{service_name}_requests_total',
            'Total requests',
            ['method', 'endpoint', 'status']
        )
        self.error_count = Counter(
            f'{service_name}_errors_total',
            'Total errors',
            ['method', 'endpoint', 'error_type']
        )
        self.request_duration = Histogram(
            f'{service_name}_request_duration_seconds',
            'Request duration',
            ['method', 'endpoint'],
            buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
        )
    
    def record_request(self, method: str, endpoint: str, 
                       status: int, duration: float):
        self.request_count.labels(
            method=method, endpoint=endpoint, status=str(status)
        ).inc()
        
        if status >= 500:
            self.error_count.labels(
                method=method, endpoint=endpoint, error_type='server_error'
            ).inc()
        
        self.request_duration.labels(
            method=method, endpoint=endpoint
        ).observe(duration)

Logging日志架构

结构化日志

{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "ERROR",
  "service": "payment-api",
  "trace_id": "abc123def456",
  "span_id": "span789",
  "message": "Payment processing failed",
  "error": {
    "type": "PaymentGatewayError",
    "message": "Insufficient funds",
    "stack": "..."
  },
  "context": {
    "user_id": "user123",
    "order_id": "order456",
    "amount": 99.99,
    "currency": "USD"
  }
}

日志级别策略

log_levels:
  DEBUG:
    description: "详细调试信息"
    retention: "7天"
    production: false
  
  INFO:
    description: "正常操作信息"
    retention: "30天"
    production: true
  
  WARN:
    description: "潜在问题警告"
    retention: "90天"
    production: true
  
  ERROR:
    description: "错误事件"
    retention: "1年"
    production: true
  
  FATAL:
    description: "致命错误"
    retention: "永久"
    production: true

Tracing链路追踪

分布式追踪原理

# OpenTelemetry追踪实现
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter

# 配置Tracer
provider = TracerProvider()
jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
)
provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

# 创建Span
def process_payment(order_id: str, amount: float):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("payment.amount", amount)
        
        # 子Span：验证
        with tracer.start_as_current_span("validate_payment"):
            validate_payment_details(order_id)
        
        # 子Span：调用网关
        with tracer.start_as_current_span("call_gateway"):
            result = call_payment_gateway(amount)
            span.set_attribute("payment.status", result.status)
        
        return result

三支柱协同

# 关联三个支柱
correlation:
  metrics_to_logs:
    description: "指标异常时查看相关日志"
    example: "错误率飙升时查看ERROR日志"
  
  metrics_to_traces:
    description: "指标异常时追踪具体请求"
    example: "延迟升高时查看慢请求链路"
  
  traces_to_logs:
    description: "追踪中查看相关日志"
    example: "通过trace_id关联日志"

最佳实践

统一时间戳：使用UTC时间，确保三者时间对齐
关联ID：使用trace_id串联指标、日志、追踪
采样策略：高QPS服务使用智能采样，平衡存储和可观测性
告警联动：指标告警自动触发日志分析和追踪查询