← 返回首页
🌐

分布式追踪:OTel Collector与采样策略

📂 architecture ⏱ 2 min 304 words

分式追踪:OTel Collector与采样策略

分布式追踪核心概念

分布式追踪通过TraceID和SpanID将跨服务的调用链串联起来。每个服务处理形成一个Span,所有Span组成一棵Trace树。OpenTelemetry是CNCF的可观测性标准,统一了追踪、指标和日志的采集。

// OpenTelemetry初始化
public class TracingConfig {
    @Bean
    public OpenTelemetry openTelemetry() {
        Resource resource = Resource.getDefault()
            .merge(Resource.create Attributes.builder()
                .put(ResourceAttributes.SERVICE_NAME, "my-service")
                .put(ResourceAttributes.SERVICE_VERSION, "1.0.0")
                .build());
        
        // OTLP导出到Collector
        OtlpGrpcSpanExporter spanExporter = OtlpGrpcSpanExporter.builder()
            .setEndpoint("otel-collector:4317")
            .build();
        
        SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
            .setResource(resource)
            .addSpanProcessor(BatchSpanProcessor.builder(spanExporter).build())
            .setSampler(Sampler.parentBased(Sampler.traceIdRatioBased(0.1)))
            .build();
        
        return OpenTelemetrySdk.builder()
            .setTracerProvider(tracerProvider)
            .build();
    }
}

// 创建Span
Tracer tracer = openTelemetry.getTracer("my-service");
Span span = tracer.spanBuilder("processOrder")
    .setAttribute("order.id", orderId)
    .startSpan();
try (Scope scope = span.makeCurrent()) {
    // 业务逻辑
    processOrderInternal(orderId);
    span.setStatus(StatusCode.OK);
} catch (Exception e) {
    span.setStatus(StatusCode.ERROR, e.getMessage());
    span.recordException(e);
} finally {
    span.end();
}

OTel Collector架构

OTel Collector是独立的可观测性数据处理管道,支持接收、处理和导出三个阶段。

# OTel Collector配置
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  jaeger:
    protocols:
      thrift_compact:
        endpoint: 0.0.0.0:6831
      thrift_http:
        endpoint: 0.0.0.0:14268

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
    send_batch_max_size: 2048
  
  memory_limiter:
    check_interval: 1s
    limit_mib: 2000
    spike_limit_mib: 400
  
  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true
  
  prometheus:
    endpoint: 0.0.0.0:8889
  
  logging:
    verbosity: normal

service:
  pipelines:
    traces:
      receivers: [otlp, jaeger]
      processors: [memory_limiter, batch, attributes]
      exporters: [jaeger, logging]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

采样策略

采样策略在数据完整性和存储成本之间取得平衡。常见策略包括固定比例采样、自适应采样和尾部采样。

// 采样策略实现
public class AdaptiveSampler implements Sampler {
    private final RateLimiter rateLimiter;
    private final int maxTracesPerSecond;
    
    @Override
    public SamplingResult shouldSample(SamplingParameters parameters) {
        // 错误请求总是采样
        if (hasError(parameters)) {
            return SamplingResult.create(SamplingDecision.RECORD_AND_SAMPLE);
        }
        
        // 慢请求总是采样
        if (isSlowRequest(parameters)) {
            return SamplingResult.create(SamplingDecision.RECORD_AND_SAMPLE);
        }
        
        // 自适应采样
        if (rateLimiter.tryAcquire()) {
            return SamplingResult.create(SamplingDecision.RECORD_AND_SAMPLE);
        }
        
        return SamplingResult.create(SamplingDecision.DROP);
    }
}

// OTel Collector尾部采样配置
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: error-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: latency-policy
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

追踪上下文传播

追踪上下文通过HTTP Header或gRPC Metadata在服务间传播,支持W3C TraceContext和B3格式。

// W3C TraceContext传播
// 请求头格式: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

// 手动传播上下文
Span currentSpan = Span.current();
TextMapPropagator propagator = GlobalOpenTelemetry.getPropagators().getTextMapPropagator();

// 注入上下文到请求头
Context context = Context.current().with(currentSpan);
Map<String, String> headers = new HashMap<>();
propagator.inject(context, headers, (carrier, key, value) -> carrier.put(key, value));

// 从请求头提取上下文
Context extractedContext = propagator.extract(Context.current(), headers, 
    (carrier, key) -> carrier.get(key));

追踪系统选型

Jaeger和Zipkin是两大主流追踪系统。Jaeger支持更灵活的存储后端和更强大的查询能力,Zipkin则更轻量简单。OTel Collector可以同时导出到多个后端,实现迁移和多系统并存。