分布式追踪:OTel Collector与采样策略
分式追踪:OTel Collector与采样策略
分布式追踪核心概念
分布式追踪通过TraceID和SpanID将跨服务的调用链串联起来。每个服务处理形成一个Span,所有Span组成一棵Trace树。OpenTelemetry是CNCF的可观测性标准,统一了追踪、指标和日志的采集。
// OpenTelemetry初始化
public class TracingConfig {
@Bean
public OpenTelemetry openTelemetry() {
Resource resource = Resource.getDefault()
.merge(Resource.create Attributes.builder()
.put(ResourceAttributes.SERVICE_NAME, "my-service")
.put(ResourceAttributes.SERVICE_VERSION, "1.0.0")
.build());
// OTLP导出到Collector
OtlpGrpcSpanExporter spanExporter = OtlpGrpcSpanExporter.builder()
.setEndpoint("otel-collector:4317")
.build();
SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
.setResource(resource)
.addSpanProcessor(BatchSpanProcessor.builder(spanExporter).build())
.setSampler(Sampler.parentBased(Sampler.traceIdRatioBased(0.1)))
.build();
return OpenTelemetrySdk.builder()
.setTracerProvider(tracerProvider)
.build();
}
}
// 创建Span
Tracer tracer = openTelemetry.getTracer("my-service");
Span span = tracer.spanBuilder("processOrder")
.setAttribute("order.id", orderId)
.startSpan();
try (Scope scope = span.makeCurrent()) {
// 业务逻辑
processOrderInternal(orderId);
span.setStatus(StatusCode.OK);
} catch (Exception e) {
span.setStatus(StatusCode.ERROR, e.getMessage());
span.recordException(e);
} finally {
span.end();
}
OTel Collector架构
OTel Collector是独立的可观测性数据处理管道,支持接收、处理和导出三个阶段。
# OTel Collector配置
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
jaeger:
protocols:
thrift_compact:
endpoint: 0.0.0.0:6831
thrift_http:
endpoint: 0.0.0.0:14268
processors:
batch:
timeout: 5s
send_batch_size: 1024
send_batch_max_size: 2048
memory_limiter:
check_interval: 1s
limit_mib: 2000
spike_limit_mib: 400
attributes:
actions:
- key: environment
value: production
action: upsert
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
logging:
verbosity: normal
service:
pipelines:
traces:
receivers: [otlp, jaeger]
processors: [memory_limiter, batch, attributes]
exporters: [jaeger, logging]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
采样策略
采样策略在数据完整性和存储成本之间取得平衡。常见策略包括固定比例采样、自适应采样和尾部采样。
// 采样策略实现
public class AdaptiveSampler implements Sampler {
private final RateLimiter rateLimiter;
private final int maxTracesPerSecond;
@Override
public SamplingResult shouldSample(SamplingParameters parameters) {
// 错误请求总是采样
if (hasError(parameters)) {
return SamplingResult.create(SamplingDecision.RECORD_AND_SAMPLE);
}
// 慢请求总是采样
if (isSlowRequest(parameters)) {
return SamplingResult.create(SamplingDecision.RECORD_AND_SAMPLE);
}
// 自适应采样
if (rateLimiter.tryAcquire()) {
return SamplingResult.create(SamplingDecision.RECORD_AND_SAMPLE);
}
return SamplingResult.create(SamplingDecision.DROP);
}
}
// OTel Collector尾部采样配置
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
- name: error-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: latency-policy
type: latency
latency: {threshold_ms: 1000}
- name: probabilistic-policy
type: probabilistic
probabilistic: {sampling_percentage: 10}
追踪上下文传播
追踪上下文通过HTTP Header或gRPC Metadata在服务间传播,支持W3C TraceContext和B3格式。
// W3C TraceContext传播
// 请求头格式: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
// 手动传播上下文
Span currentSpan = Span.current();
TextMapPropagator propagator = GlobalOpenTelemetry.getPropagators().getTextMapPropagator();
// 注入上下文到请求头
Context context = Context.current().with(currentSpan);
Map<String, String> headers = new HashMap<>();
propagator.inject(context, headers, (carrier, key, value) -> carrier.put(key, value));
// 从请求头提取上下文
Context extractedContext = propagator.extract(Context.current(), headers,
(carrier, key) -> carrier.get(key));
追踪系统选型
Jaeger和Zipkin是两大主流追踪系统。Jaeger支持更灵活的存储后端和更强大的查询能力,Zipkin则更轻量简单。OTel Collector可以同时导出到多个后端,实现迁移和多系统并存。