← 返回首页
🧠

LLM故障转移

📂 llm ⏱ 2 min 270 words

--- title: "LLM故障转移" description: "系统讲解LLM服务的故障转移机制,包括健康检查、自动切换、流量路由、状态同步以及故障转移的最佳实践" tags: ["故障转移", "高可用", "负载均衡", "自动切换"] category: "llm" icon: "🧠"

LLM故障转移

故障转移的基本概念

故障转移(Failover)是指当主节点发生故障时,系统自动将服务切换到备用节点的过程。对于LLM推理服务而言,故障转移需要在保证服务连续性的同时,确保请求不丢失、状态正确同步。

健康检查机制

故障转移的前提是准确检测节点健康状态。健康检查通常包括多个维度:

import time
import requests

class HealthChecker:
    def __init__(self, endpoints, timeout=5):
        self.endpoints = endpoints
        self.timeout = timeout
        self.failure_counts = {}
    
    def check(self, endpoint):
        """综合健康检查"""
        checks = {
            'http': self._check_http(endpoint),
            'gpu': self._check_gpu(endpoint),
            'memory': self._check_memory(endpoint),
            'latency': self._check_latency(endpoint),
        }
        return all(checks.values()), checks
    
    def _check_http(self, endpoint):
        try:
            resp = requests.get(f"{endpoint}/health", timeout=self.timeout)
            return resp.status_code == 200
        except requests.RequestException:
            return False
    
    def _check_gpu(self, endpoint):
        try:
            resp = requests.get(f"{endpoint}/gpu-status", timeout=self.timeout)
            return resp.json().get('healthy', False)
        except Exception:
            return False
    
    def _check_latency(self, endpoint):
        start = time.time()
        try:
            requests.get(f"{endpoint}/health", timeout=self.timeout)
            return (time.time() - start) < 2.0
        except Exception:
            return False
    
    def _check_memory(self, endpoint):
        try:
            resp = requests.get(f"{endpoint}/gpu-memory", timeout=self.timeout)
            used_ratio = resp.json().get('used_ratio', 1.0)
            return used_ratio < 0.95
        except Exception:
            return False

故障转移策略

主备切换

最简单的故障转移模式。主节点故障时切换到备用节点。

class PrimaryStandbyFailover:
    def __init__(self, primary, standby):
        self.primary = primary
        self.standby = standby
        self.active = primary
        self.health_checker = HealthChecker([primary, standby])
    
    def route_request(self, request):
        healthy, details = self.health_checker.check(self.active)
        if not healthy:
            self._perform_switch()
        return self.active.generate(request)
    
    def _perform_switch(self):
        if self.active == self.primary:
            self.active = self.standby
        else:
            self.active = self.primary

轮询故障转移

多个节点按顺序提供服务,故障节点被跳过。

基于权重的故障转移

根据节点性能动态分配权重,性能差的节点获得更少流量。

class WeightedFailover:
    def __init__(self, nodes):
        self.nodes = nodes
        self.weights = {node: 1.0 for node in nodes}
    
    def update_weights(self, metrics):
        for node, metric in metrics.items():
            # 基于延迟和错误率调整权重
            latency_score = max(0, 1 - metric['avg_latency'] / 5.0)
            error_score = max(0, 1 - metric['error_rate'] * 10)
            self.weights[node] = (latency_score + error_score) / 2
    
    def select_node(self):
        total = sum(self.weights.values())
        r = random.random() * total
        cumulative = 0
        for node, weight in self.weights.items():
            cumulative += weight
            if r <= cumulative:
                return node
        return list(self.weights.keys())[-1]

状态同步

LLM服务通常是无状态的,但某些场景(如多轮对话、流式输出)需要状态同步:

流量切换的注意事项

执行故障转移时需要注意:

  1. 优雅切换:等待当前请求处理完毕后再切换,避免中断进行中的推理
  2. 回切机制:原主节点恢复后,需要评估是否回切
  3. 灰度切换:先将少量流量切到新节点,验证正常后再完全切换
  4. 告警通知:每次故障转移都应通知运维团队

故障转移测试

定期进行故障注入测试,验证故障转移机制是否正常工作。可以使用混沌工程工具模拟节点故障,观察系统是否能自动完成切换并保持服务可用。建议在非生产环境进行充分测试后再部署到生产环境。