← 返回首页
📊

日志架构:结构化采样与集中管理

📂 architecture ⏱ 2 min 399 words

日志架构:结构化采样与集中管理

日志架构设计原则

现代分布式系统的日志架构需要平衡可观测性、存储成本和查询性能。核心原则包括:结构化输出、智能采样、集中收集和生命周期管理。

日志架构分层:
┌─────────────────────────────────────────────────┐
│               日志生成层                         │
│     结构化格式 | 统一字段 | 关联ID                │
└──────────────────────┬──────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────┐
│               采集层                             │
│     Agent | Sidecar | DaemonSet                  │
└──────────────────────┬──────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────┐
│               传输层                             │
│     Kafka | Fluentd | Logstash                   │
└──────────────────────┬──────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────┐
│               存储层                             │
│     Elasticsearch | Loki | S3                    │
└──────────────────────┬──────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────┐
│               查询与分析层                        │
│     Kibana | Grafana | 自定义查询                 │
└─────────────────────────────────────────────────┘

结构化日志设计

日志格式规范

{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "INFO",
  "service": "user-service",
  "version": "1.2.3",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "message": "User login successful",
  "context": {
    "user_id": "user-123",
    "ip_address": "192.168.1.100",
    "user_agent": "Mozilla/5.0...",
    "request_id": "req-456"
  },
  "metrics": {
    "response_time_ms": 45,
    "query_count": 3
  }
}

日志字段定义

log_fields:
  required:
    - name: timestamp
      type: string
      format: ISO8601
      description: "事件发生时间"
    
    - name: level
      type: enum
      values: [DEBUG, INFO, WARN, ERROR, FATAL]
      description: "日志级别"
    
    - name: service
      type: string
      description: "服务名称"
    
    - name: message
      type: string
      description: "日志消息"
  
  optional:
    - name: trace_id
      type: string
      description: "分布式追踪ID"
    
    - name: user_id
      type: string
      description: "用户ID"
    
    - name: request_id
      type: string
      description: "请求ID"

日志采样策略

采样类型

class LogSampler:
    def __init__(self, strategy: str, rate: float):
        self.strategy = strategy
        self.rate = rate
    
    def should_sample(self, log_entry: dict) -> bool:
        """根据策略决定是否采样"""
        
        if self.strategy == "head":
            # 头部采样:随机采样
            import random
            return random.random() < self.rate
        
        elif self.strategy == "tail":
            # 尾部采样:基于内容采样
            if log_entry.get("level") == "ERROR":
                return True  # 错误日志全量采集
            if log_entry.get("response_time_ms", 0) > 1000:
                return True  # 慢请求全量采集
            return random.random() < self.rate
        
        elif self.strategy == "adaptive":
            # 自适应采样:基于QPS调整
            current_qps = self.get_current_qps()
            if current_qps > 10000:
                return random.random() < 0.01  # 高QPS时采样1%
            elif current_qps > 1000:
                return random.random() < 0.1   # 中QPS时采样10%
            else:
                return True  # 低QPS时全量采集
        
        return True

采样配置

# 采样策略配置
sampling:
  default:
    strategy: head
    rate: 0.1  # 10%采样
  
  rules:
    - name: error-logs
      match:
        level: ERROR
      action: keep  # 全量保留
    
    - name: slow-requests
      match:
        response_time_ms: "> 1000"
      action: keep
    
    - name: high-frequency-debug
      match:
        level: DEBUG
        service: high-qps-service
      action: sample
      rate: 0.01  # 1%采样

集中日志管理

Kubernetes日志收集

# Fluentd DaemonSet配置
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
spec:
  selector:
    matchLabels:
      name: fluentd
  template:
    metadata:
      labels:
        name: fluentd
    spec:
      serviceAccountName: fluentd
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1.16-debian-elasticsearch8
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch.logging.svc.cluster.local"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        - name: FLUENT_ELASTICSEARCH_SCHEME
          value: "https"
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: containers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: containers
        hostPath:
          path: /var/lib/docker/containers

日志生命周期管理

# 索引生命周期配置
index_lifecycle:
  hot:
    duration: "7d"
    actions:
      - rollover: "30GB或1天"
      - set_priority: 100
  
  warm:
    duration: "30d"
    actions:
      - shrink: 1
      - forcemerge: 1 segment
  
  cold:
    duration: "90d"
    actions:
      - freeze
  
  delete:
    duration: "180d"
    actions:
      - delete

最佳实践

  1. 结构化优先:使用JSON格式输出日志,避免非结构化文本
  2. 关联ID:使用trace_id/request_id串联分布式日志
  3. 采样平衡:高QPS服务使用采样,错误日志全量保留
  4. 生命周期:配置ILM策略,自动清理过期日志