日志架构:结构化采样与集中管理
日志架构:结构化采样与集中管理
日志架构设计原则
现代分布式系统的日志架构需要平衡可观测性、存储成本和查询性能。核心原则包括:结构化输出、智能采样、集中收集和生命周期管理。
日志架构分层:
┌─────────────────────────────────────────────────┐
│ 日志生成层 │
│ 结构化格式 | 统一字段 | 关联ID │
└──────────────────────┬──────────────────────────┘
│
┌──────────────────────▼──────────────────────────┐
│ 采集层 │
│ Agent | Sidecar | DaemonSet │
└──────────────────────┬──────────────────────────┘
│
┌──────────────────────▼──────────────────────────┐
│ 传输层 │
│ Kafka | Fluentd | Logstash │
└──────────────────────┬──────────────────────────┘
│
┌──────────────────────▼──────────────────────────┐
│ 存储层 │
│ Elasticsearch | Loki | S3 │
└──────────────────────┬──────────────────────────┘
│
┌──────────────────────▼──────────────────────────┐
│ 查询与分析层 │
│ Kibana | Grafana | 自定义查询 │
└─────────────────────────────────────────────────┘
结构化日志设计
日志格式规范
{
"timestamp": "2024-01-15T10:30:00.123Z",
"level": "INFO",
"service": "user-service",
"version": "1.2.3",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"message": "User login successful",
"context": {
"user_id": "user-123",
"ip_address": "192.168.1.100",
"user_agent": "Mozilla/5.0...",
"request_id": "req-456"
},
"metrics": {
"response_time_ms": 45,
"query_count": 3
}
}
日志字段定义
log_fields:
required:
- name: timestamp
type: string
format: ISO8601
description: "事件发生时间"
- name: level
type: enum
values: [DEBUG, INFO, WARN, ERROR, FATAL]
description: "日志级别"
- name: service
type: string
description: "服务名称"
- name: message
type: string
description: "日志消息"
optional:
- name: trace_id
type: string
description: "分布式追踪ID"
- name: user_id
type: string
description: "用户ID"
- name: request_id
type: string
description: "请求ID"
日志采样策略
采样类型
class LogSampler:
def __init__(self, strategy: str, rate: float):
self.strategy = strategy
self.rate = rate
def should_sample(self, log_entry: dict) -> bool:
"""根据策略决定是否采样"""
if self.strategy == "head":
# 头部采样:随机采样
import random
return random.random() < self.rate
elif self.strategy == "tail":
# 尾部采样:基于内容采样
if log_entry.get("level") == "ERROR":
return True # 错误日志全量采集
if log_entry.get("response_time_ms", 0) > 1000:
return True # 慢请求全量采集
return random.random() < self.rate
elif self.strategy == "adaptive":
# 自适应采样:基于QPS调整
current_qps = self.get_current_qps()
if current_qps > 10000:
return random.random() < 0.01 # 高QPS时采样1%
elif current_qps > 1000:
return random.random() < 0.1 # 中QPS时采样10%
else:
return True # 低QPS时全量采集
return True
采样配置
# 采样策略配置
sampling:
default:
strategy: head
rate: 0.1 # 10%采样
rules:
- name: error-logs
match:
level: ERROR
action: keep # 全量保留
- name: slow-requests
match:
response_time_ms: "> 1000"
action: keep
- name: high-frequency-debug
match:
level: DEBUG
service: high-qps-service
action: sample
rate: 0.01 # 1%采样
集中日志管理
Kubernetes日志收集
# Fluentd DaemonSet配置
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: logging
spec:
selector:
matchLabels:
name: fluentd
template:
metadata:
labels:
name: fluentd
spec:
serviceAccountName: fluentd
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1.16-debian-elasticsearch8
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.logging.svc.cluster.local"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
- name: FLUENT_ELASTICSEARCH_SCHEME
value: "https"
resources:
limits:
memory: 512Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: containers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: containers
hostPath:
path: /var/lib/docker/containers
日志生命周期管理
# 索引生命周期配置
index_lifecycle:
hot:
duration: "7d"
actions:
- rollover: "30GB或1天"
- set_priority: 100
warm:
duration: "30d"
actions:
- shrink: 1
- forcemerge: 1 segment
cold:
duration: "90d"
actions:
- freeze
delete:
duration: "180d"
actions:
- delete
最佳实践
- 结构化优先:使用JSON格式输出日志,避免非结构化文本
- 关联ID:使用trace_id/request_id串联分布式日志
- 采样平衡:高QPS服务使用采样,错误日志全量保留
- 生命周期:配置ILM策略,自动清理过期日志