← 返回首页
🧠

请求路由

📂 llm ⏱ 2 min 300 words

--- title: "请求路由" description: "介绍LLM服务中的智能请求路由策略,包括负载均衡、模型路由和故障转移机制" tags: ["请求路由", "负载均衡", "模型路由", "故障转移"] category: "llm" icon: "🧠"

请求路由

请求路由的重要性

在多模型、多实例的LLM部署架构中,智能请求路由决定了每个请求应该被发送到哪个后端服务。合理的路由策略可以提升整体服务质量、优化资源利用率,并实现成本控制。

路由架构设计

基础路由层

from dataclasses import dataclass
from typing import List, Optional
import random

@dataclass
class BackendEndpoint:
    url: str
    model_name: str
    weight: int = 1
    healthy: bool = True
    current_load: float = 0.0

class LLMSmartRouter:
    def __init__(self):
        self.backends: List[BackendEndpoint] = []
        self.health_checker = HealthChecker()
    
    def add_backend(self, endpoint: BackendEndpoint):
        self.backends.append(endpoint)
    
    def route_request(self, request) -> BackendEndpoint:
        healthy_backends = [b for b in self.backends if b.healthy]
        
        if not healthy_backends:
            raise NoHealthyBackendError()
        
        # 加权随机选择
        total_weight = sum(b.weight for b in healthy_backends)
        r = random.uniform(0, total_weight)
        cumulative = 0
        
        for backend in healthy_backends:
            cumulative += backend.weight
            if r <= cumulative:
                return backend
        
        return healthy_backends[0]

智能路由策略

1. 基于负载的路由

将请求发送到负载最低的后端:

class LeastLoadRouter:
    def select_backend(self, backends):
        healthy = [b for b in backends if b.healthy]
        if not healthy:
            return None
        
        # 选择当前负载最低的后端
        return min(healthy, key=lambda b: b.current_load)

2. 基于模型能力的路由

根据请求特性选择最合适的模型:

class ModelCapabilityRouter:
    def __init__(self):
        self.model_capabilities = {
            "gpt-4o": {"max_tokens": 128000, "strengths": ["reasoning", "code"]},
            "gpt-4o-mini": {"max_tokens": 16000, "strengths": ["general", "fast"]},
            "claude-3.5-sonnet": {"max_tokens": 200000, "strengths": ["analysis", "creative"]}
        }
    
    def route(self, request):
        # 根据请求内容选择模型
        if request.requires_reasoning:
            return "gpt-4o"
        elif request.max_length > 16000:
            return "claude-3.5-sonnet"
        else:
            return "gpt-4o-mini"

3. 成本感知路由

在满足质量要求的前提下选择成本最低的模型:

class CostAwareRouter:
    def __init__(self):
        self.pricing = {
            "gpt-4o-mini": 0.60,
            "deepseek-v3": 1.10,
            "gpt-4o": 10.00
        }
    
    def route_by_budget(self, request, max_cost_per_1k=1.0):
        for model, cost in sorted(self.pricing.items(), key=lambda x: x[1]):
            if cost <= max_cost_per_1k and self.meets_quality(model, request):
                return model
        return "gpt-4o-mini"  # 默认使用最便宜的

故障转移机制

健康检查

import asyncio
import aiohttp

class HealthChecker:
    def __init__(self, check_interval=30):
        self.check_interval = check_interval
    
    async def check_health(self, endpoint):
        try:
            async with aiohttp.ClientSession() as session:
                async with session.get(f"{endpoint.url}/health", timeout=5) as resp:
                    return resp.status == 200
        except Exception:
            return False

自动故障转移

当主后端不可用时,自动切换到备用后端。配置多个备用后端,实现多级故障转移。

路由规则配置

routing_rules:
  - match:
      model: "gpt-4o"
      path: "/v1/chat/completions"
    route:
      backend: "openai-production"
      priority: 1
  
  - match:
      model: "deepseek-v3"
    route:
      backend: "deepseek-cluster"
      priority: 1
  
  - fallback:
      route:
        backend: "default-backend"
        priority: 10

监控与可观测性

记录每个路由决策的详情,包括选择的后端、路由原因、响应时间。通过分析路由日志,持续优化路由策略。设置路由级别的SLA监控,确保整体服务质量达标。