OpenTelemetry:统一可观测性框架
OpenTelemetry:统一可观测性框架
OpenTelemetry架构
OpenTelemetry是CNCF孵化的可观测性框架,提供统一的API、SDK和工具,用于采集Traces、Metrics和Logs三大支柱数据。
OpenTelemetry架构:
┌─────────────────────────────────────────────────┐
│ Application │
├─────────────────┬───────────────────────────────┤
│ OTel API │ OTel SDK │
│ 统一接口 │ 实现层 │
├─────────────────┴───────────────────────────────┤
│ Auto-Instrumentation │
│ 自动埋点(Java/Python/Go/Node.js) │
└──────────────────────┬──────────────────────────┘
│ OTLP (gRPC/HTTP)
▼
┌─────────────────────────────────────────────────┐
│ OTel Collector │
├─────────────┬─────────────┬─────────────────────┤
│ Receivers │ Processors │ Exporters │
│ 接收器 │ 处理器 │ 导出器 │
│ OTLP │ Batch │ Prometheus │
│ Jaeger │ Filter │ Jaeger │
│ Zipkin │ Transform │ Loki │
│ Kafka │ Tail-sampling│ Elasticsearch │
└─────────────┴─────────────┴─────────────────────┘
OTel SDK集成
Python SDK初始化
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.resource import Resource
# 定义资源
resource = Resource.create({
"service.name": "payment-service",
"service.version": "1.0.0",
"deployment.environment": "production"
})
# 配置Tracer
trace_exporter = OTLPSpanExporter(endpoint="otel-collector:4317")
tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
trace.set_tracer_provider(tracer_provider)
# 配置Meter
metric_exporter = OTLPMetricExporter(endpoint="otel-collector:4317")
metric_reader = PeriodicExportingMetricReader(metric_exporter, export_interval_millis=5000)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)
# 使用
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
# 创建Span
def process_order(order_id: str):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
# 创建Metrics
order_counter = meter.create_counter(
"orders.processed",
description="Number of orders processed",
unit="1"
)
order_counter.add(1, {"status": "success"})
自动埋点
# 自动埋点配置(Python)
# opentelemetry-instrument 命令自动埋点
# opentelemetry-instrument python app.py
# 手动配置自动埋点
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
app = FastAPI()
# 自动埋点FastAPI
FastAPIInstrumentor.instrument_app(app)
# 自动埋点SQLAlchemy
SQLAlchemyInstrumentor().instrument(engine=engine)
# 自动埋点Redis
RedisInstrumentor().instrument()
Collector配置
# otel-collector-config.yml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['localhost:8888']
processors:
batch:
send_batch_size: 10000
timeout: 200s
memory_limiter:
check_interval: 1s
limit_mib: 4000
spike_limit_mib: 1000
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 10
- name: latency
type: latency
latency:
threshold_ms: 500
- name: error
type: status_code
status_code:
status_codes: [ERROR]
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
最佳实践
- 资源属性:为所有服务添加service.name、service.version等资源属性
- 采样策略:生产环境使用尾部采样,平衡存储和可观测性
- 传播上下文:确保trace context在服务间正确传播
- 渐进式接入:先自动埋点,再手动添加关键Span