分布式链路追踪:Jaeger与Zipkin架构
分布式链路追踪:Jaeger与Zipkin架构
分布式追踪原理
分布式链路追踪通过在请求入口生成唯一Trace ID,贯穿整个调用链,记录每个Span的耗时和状态,用于分析分布式系统的性能瓶颈。
用户请求 → API Gateway → Service A → Service B → Database
│ │ │ │ │
└──────────┴────────────┴───────────┴───────────┘
Trace ID: abc123
Span树结构:
Trace (abc123)
├── Span: API Gateway (100ms)
│ └── Span: Service A (80ms)
│ ├── Span: Service B (50ms)
│ │ └── Span: Database Query (20ms)
│ └── Span: Cache Lookup (5ms)
Jaeger架构
核心组件
Jaeger架构:
┌─────────────────────────────────────────────────┐
│ Jaeger Architecture │
├─────────────────┬───────────────────────────────┤
│ Agent │ Collector │
│ 接收Span │ 处理、验证、索引 │
│ UDP接收 │ 写入存储 │
├─────────────────┴───────────────────────────────┤
│ Query Service │
│ 查询API + Web UI │
├─────────────────────────────────────────────────┤
│ Storage Backend │
│ Elasticsearch / Cassandra / Kafka │
└─────────────────────────────────────────────────┘
Jaeger部署(Kubernetes)
# jaeger-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
ports:
- containerPort: 16686 # UI
- containerPort: 14268 # HTTP collector
- containerPort: 6831 # Agent UDP
env:
- name: COLLECTOR_STORAGE_TYPE
value: elasticsearch
- name: ES_SERVER_URLS
value: http://elasticsearch:9200
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
spec:
selector:
app: jaeger
ports:
- name: ui
port: 16686
- name: collector
port: 14268
Zipkin架构
核心组件
Zipkin架构:
┌─────────────────────────────────────────────────┐
│ Zipkin │
├─────────────────┬───────────────────────────────┤
│ Collector │ Storage │
│ 收集Span数据 │ 存储追踪数据 │
├─────────────────┴───────────────────────────────┤
│ API + UI │
│ 查询和可视化 │
└─────────────────────────────────────────────────┘
Zipkin配置
# docker-compose.yml
version: '3'
services:
zipkin:
image: openzipkin/zipkin:latest
ports:
- "9411:9411"
environment:
- STORAGE_TYPE=elasticsearch
- ES_HOSTS=http://elasticsearch:9200
depends_on:
- elasticsearch
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
ports:
- "9200:9200"
Trace传播机制
W3C Trace Context
# HTTP Header传播
# 请求头格式:
# traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
# tracestate: vendor1=value1,vendor2=value2
import requests
from opentelemetry.propagate import inject, extract
def call_downstream_service():
headers = {}
inject(headers) # 注入当前context到headers
response = requests.get(
"http://downstream-service/api",
headers=headers
)
return response
# 接收端提取context
def handle_request(request):
context = extract(request.headers) # 从headers提取context
with tracer.start_as_current_span("handle_request", context=context):
process_request()
Baggage传播
from opentelemetry import baggage, context
# 设置Baggage(跨服务传播的键值对)
ctx = baggage.set_baggage("user.id", "user123")
ctx = baggage.set_baggage("request.id", "req456", context=ctx)
# 传播Baggage
headers = {}
inject(headers, context=ctx)
# 下游服务读取Baggage
ctx = extract(request.headers)
user_id = baggage.get_baggage("user.id", context=ctx)
采样策略
# 采样配置
sampling_strategies:
# 概率采样
probabilistic:
type: probabilistic
param: 0.1 # 10%采样率
# 限速采样
rateLimiting:
type: rateLimiting
param: 100 # 每秒100条
# 尾部采样(基于延迟/错误)
tail_sampling:
policies:
- name: error-based
type: status_code
status_code:
status_codes: [ERROR]
- name: latency-based
type: latency
latency:
threshold_ms: 1000
最佳实践
- 采样策略:高QPS服务使用概率采样,关键路径使用尾部采样
- 上下文传播:确保所有HTTP/gRPC调用正确传播Trace Context
- Span命名:使用有意义的Span名称,如
GET /api/users/{id} - 错误记录:在Span中记录错误堆栈,便于问题定位