← 返回首页
🧠

模型缓存

📂 llm ⏱ 2 min 349 words

--- title: "模型缓存" description: "详解LLM模型缓存技术,优化模型加载和推理性能" tags: ["模型缓存", "推理优化", "内存管理"] category: "llm" icon: "🧠"

模型缓存

模型缓存是将LLM模型权重和推理状态保存在高速存储中,避免重复加载和初始化的技术。对于本地部署的LLM,模型缓存能显著减少首次推理延迟。

模型缓存的层次

模型缓存通常分为三个层次:GPU显存缓存(最快)、内存缓存(中等)、磁盘缓存(最慢但容量大)。合理的多层缓存策略能在性能和成本之间取得平衡。

GPU显存缓存

将频繁使用的模型保持在GPU显存中:

class GPUCache:
    def __init__(self, max_gpu_memory=8):
        self.max_memory = max_gpu_memory
        self.loaded_models = {}
        self.access_order = []

    def load_model(self, model_name):
        if model_name in self.loaded_models:
            self.access_order.remove(model_name)
            self.access_order.append(model_name)
            return self.loaded_models[model_name]

        model_size = self.estimate_size(model_name)
        self.evict_until_fit(model_size)

        model = self.load_from_disk(model_name)
        self.loaded_models[model_name] = model
        self.access_order.append(model_name)
        return model

    def evict_until_fit(self, needed):
        while self.get_used_memory() + needed > self.max_memory:
            oldest = self.access_order.pop(0)
            del self.loaded_models[oldest]

GPU显存管理采用LRU策略,优先淘汰最近最少使用的模型。

模型权重缓存

缓存模型权重到高速存储:

class WeightCache:
    def __init__(self, cache_dir="/tmp/model_cache"):
        self.cache_dir = cache_dir
        self.index = {}

    async def get_weights(self, model_id, layer=None):
        cache_path = os.path.join(self.cache_dir, model_id, f"{layer}.pt")

        if os.path.exists(cache_path):
            return torch.load(cache_path)

        weights = await self.download_weights(model_id, layer)
        os.makedirs(os.path.dirname(cache_path), exist_ok=True)
        torch.save(weights, cache_path)
        return weights

权重缓存避免重复下载和计算,加速模型加载。

KV Cache优化

优化Transformer的KV Cache提升推理性能:

class KVCacheOptimizer:
    def __init__(self, max_cache_size=4096):
        self.max_size = max_cache_size

    def optimize_cache(self, kv_cache, attention_mask):
        if kv_cache[0].shape[2] > self.max_size:
            head_size = kv_cache[0].shape[2] // 4
            kv_cache = self.compress_cache(kv_cache, head_size)
        return kv_cache

    def compress_cache(self, kv_cache, target_size):
        compressed = []
        for layer_cache in kv_cache:
            compressed.append(layer_cache[:, :, -target_size:, :])
        return tuple(compressed)

KV Cache压缩减少显存占用,允许更大的批次处理。

预热缓存

系统启动时预加载常用模型:

class ModelWarmer:
    def __init__(self, model_registry):
        self.registry = model_registry

    async def warmup(self):
        popular_models = await self.registry.get_popular(limit=5)
        tasks = [self.warm_model(m) for m in popular_models]
        await asyncio.gather(*tasks)

    async def warm_model(self, model_info):
        model = await self.load_model(model_info["id"])
        await self.prefill_cache(model, model_info["common_prompts"])

预热确保热门模型在首次请求时就已就绪。

缓存淘汰策略

根据使用模式选择合适的淘汰策略:

class CacheEviction:
    def __init__(self):
        self.strategies = {
            "lru": self.lru_evict,
            "lfu": self.lfu_evict,
            "size": self.size_evict,
        }

    def lru_evict(self, cache, target_size):
        while self.get_size(cache) > target_size:
            oldest_key = min(cache.keys(), key=lambda k: cache[k]["last_access"])
            del cache[oldest_key]

    def lfu_evict(self, cache, target_size):
        while self.get_size(cache) > target_size:
            least_freq = min(cache.keys(), key=lambda k: cache[k]["access_count"])
            del cache[least_freq]

    def size_evict(self, cache, target_size):
        while self.get_size(cache) > target_size:
            largest_key = max(cache.keys(), key=lambda k: cache[k]["size"])
            del cache[largest_key]

LRU适合时序局部性场景,LFU适合访问频率差异大的场景。

分布式模型缓存

多节点共享模型缓存:

class DistributedModelCache:
    def __init__(self, storage_backend):
        self.storage = storage_backend
        self.local_cache = {}

    async def get_model(self, model_id, node_id):
        if model_id in self.local_cache:
            return self.local_cache[model_id]

        storage_key = f"models/{model_id}/{node_id}"
        model = await self.storage.get(storage_key)

        if model is None:
            model = await self.load_and_store(model_id, node_id)

        self.local_cache[model_id] = model
        return model

分布式缓存减少跨节点的模型传输开销。

缓存一致性管理

确保多副本缓存的一致性:

class CacheConsistency:
    def __init__(self):
        self.version_map = {}

    async def update_model(self, model_id, new_version):
        self.version_map[model_id] = new_version
        await self.broadcast_invalidation(model_id)

    async def check_consistency(self, model_id, local_version):
        remote_version = self.version_map.get(model_id)
        if remote_version != local_version:
            await self.sync_model(model_id)
            return False
        return True

总结

模型缓存是LLM推理性能优化的关键技术。GPU显存缓存、权重缓存、KV Cache优化、预热和分布式缓存的组合使用,能显著提升推理速度并降低资源消耗。