请求路由
--- title: "请求路由" description: "介绍LLM服务中的智能请求路由策略,包括负载均衡、模型路由和故障转移机制" tags: ["请求路由", "负载均衡", "模型路由", "故障转移"] category: "llm" icon: "🧠"
请求路由
请求路由的重要性
在多模型、多实例的LLM部署架构中,智能请求路由决定了每个请求应该被发送到哪个后端服务。合理的路由策略可以提升整体服务质量、优化资源利用率,并实现成本控制。
路由架构设计
基础路由层
from dataclasses import dataclass
from typing import List, Optional
import random
@dataclass
class BackendEndpoint:
url: str
model_name: str
weight: int = 1
healthy: bool = True
current_load: float = 0.0
class LLMSmartRouter:
def __init__(self):
self.backends: List[BackendEndpoint] = []
self.health_checker = HealthChecker()
def add_backend(self, endpoint: BackendEndpoint):
self.backends.append(endpoint)
def route_request(self, request) -> BackendEndpoint:
healthy_backends = [b for b in self.backends if b.healthy]
if not healthy_backends:
raise NoHealthyBackendError()
# 加权随机选择
total_weight = sum(b.weight for b in healthy_backends)
r = random.uniform(0, total_weight)
cumulative = 0
for backend in healthy_backends:
cumulative += backend.weight
if r <= cumulative:
return backend
return healthy_backends[0]
智能路由策略
1. 基于负载的路由
将请求发送到负载最低的后端:
class LeastLoadRouter:
def select_backend(self, backends):
healthy = [b for b in backends if b.healthy]
if not healthy:
return None
# 选择当前负载最低的后端
return min(healthy, key=lambda b: b.current_load)
2. 基于模型能力的路由
根据请求特性选择最合适的模型:
class ModelCapabilityRouter:
def __init__(self):
self.model_capabilities = {
"gpt-4o": {"max_tokens": 128000, "strengths": ["reasoning", "code"]},
"gpt-4o-mini": {"max_tokens": 16000, "strengths": ["general", "fast"]},
"claude-3.5-sonnet": {"max_tokens": 200000, "strengths": ["analysis", "creative"]}
}
def route(self, request):
# 根据请求内容选择模型
if request.requires_reasoning:
return "gpt-4o"
elif request.max_length > 16000:
return "claude-3.5-sonnet"
else:
return "gpt-4o-mini"
3. 成本感知路由
在满足质量要求的前提下选择成本最低的模型:
class CostAwareRouter:
def __init__(self):
self.pricing = {
"gpt-4o-mini": 0.60,
"deepseek-v3": 1.10,
"gpt-4o": 10.00
}
def route_by_budget(self, request, max_cost_per_1k=1.0):
for model, cost in sorted(self.pricing.items(), key=lambda x: x[1]):
if cost <= max_cost_per_1k and self.meets_quality(model, request):
return model
return "gpt-4o-mini" # 默认使用最便宜的
故障转移机制
健康检查
import asyncio
import aiohttp
class HealthChecker:
def __init__(self, check_interval=30):
self.check_interval = check_interval
async def check_health(self, endpoint):
try:
async with aiohttp.ClientSession() as session:
async with session.get(f"{endpoint.url}/health", timeout=5) as resp:
return resp.status == 200
except Exception:
return False
自动故障转移
当主后端不可用时,自动切换到备用后端。配置多个备用后端,实现多级故障转移。
路由规则配置
routing_rules:
- match:
model: "gpt-4o"
path: "/v1/chat/completions"
route:
backend: "openai-production"
priority: 1
- match:
model: "deepseek-v3"
route:
backend: "deepseek-cluster"
priority: 1
- fallback:
route:
backend: "default-backend"
priority: 10
监控与可观测性
记录每个路由决策的详情,包括选择的后端、路由原因、响应时间。通过分析路由日志,持续优化路由策略。设置路由级别的SLA监控,确保整体服务质量达标。