← 返回首页
🧠

LLM响应缓存

📂 llm ⏱ 3 min 477 words

--- title: "LLM响应缓存" description: "全面介绍LLM响应缓存策略,实现高效的内容复用和成本优化" tags: ["响应缓存", "内容复用", "成本优化"] category: "llm" icon: "🧠"

LLM响应缓存

响应缓存是将LLM生成的结果存储起来供后续请求直接使用的技术。与模型缓存不同,响应缓存关注的是输出内容的复用,能直接减少API调用次数和延迟。

响应缓存的价值

LLM API调用的成本和延迟是主要瓶颈。响应缓存通过存储历史响应,对重复或相似查询直接返回缓存结果。典型的FAQ场景下,缓存命中率可达60-80%,显著降低成本。

精确匹配缓存

最简单的响应缓存实现:

import hashlib
import json

class ExactResponseCache:
    def __init__(self, ttl=3600, max_size=10000):
        self.cache = {}
        self.ttl = ttl
        self.max_size = max_size

    def make_key(self, prompt, **kwargs):
        content = json.dumps({"prompt": prompt, **kwargs}, sort_keys=True)
        return hashlib.sha256(content.encode()).hexdigest()

    def get(self, prompt, **kwargs):
        key = self.make_key(prompt, **kwargs)
        entry = self.cache.get(key)
        if entry and time.time() - entry["time"] < self.ttl:
            return entry["response"]
        return None

    def set(self, prompt, response, **kwargs):
        if len(self.cache) >= self.max_size:
            self.evict_oldest()
        key = self.make_key(prompt, **kwargs)
        self.cache[key] = {"response": response, "time": time.time()}

精确匹配简单可靠,但要求输入完全一致。

语义相似缓存

基于语义相似度的响应缓存:

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticResponseCache:
    def __init__(self, threshold=0.90):
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.threshold = threshold
        self.entries = []

    def get(self, query):
        query_emb = self.encoder.encode(query)
        best_match = None
        best_score = 0

        for entry in self.entries:
            score = np.dot(query_emb, entry["embedding"])
            if score > best_score and score >= self.threshold:
                best_score = score
                best_match = entry

        return best_match["response"] if best_match else None

    def set(self, query, response):
        embedding = self.encoder.encode(query)
        self.entries.append({
            "query": query,
            "response": response,
            "embedding": embedding
        })

语义缓存对同义表述也能命中,但需要维护向量索引。

上下文感知缓存

根据对话上下文进行缓存:

class ContextAwareCache:
    def __init__(self):
        self.cache = {}

    def make_key(self, messages, model):
        context_hash = hashlib.sha256(
            json.dumps(messages[-3:], sort_keys=True).encode()
        ).hexdigest()[:16]
        return f"{model}:{context_hash}"

    def get(self, messages, model):
        key = self.make_key(messages, model)
        return self.cache.get(key)

    def set(self, messages, model, response):
        key = self.make_key(messages, model)
        self.cache[key] = response

上下文感知缓存考虑对话历史,提升相关性。

缓存预热策略

提前填充高价值缓存:

class ResponseCacheWarmer:
    def __init__(self, cache, llm_client):
        self.cache = cache
        self.client = llm_client

    async def warm_by_category(self, categories):
        for category in categories:
            queries = await self.get_popular_queries(category)
            for query in queries:
                if not self.cache.get(query):
                    response = await self.client.generate(query)
                    self.cache.set(query, response)

    async def warm_by_user(self, user_id):
        history = await self.get_user_history(user_id)
        for interaction in history:
            response = await self.client.generate(interaction["prompt"])
            self.cache.set(interaction["prompt"], response)

预热确保热门内容在请求到达前就已缓存。

缓存失效机制

确保缓存内容的新鲜度:

class ResponseCacheInvalidation:
    def __init__(self, cache):
        self.cache = cache

    def invalidate_by_ttl(self, max_age=3600):
        now = time.time()
        expired = [k for k, v in self.cache.items() if now - v["time"] > max_age]
        for key in expired:
            del self.cache[key]

    def invalidate_by_pattern(self, pattern):
        import re
        to_remove = [k for k in self.cache.keys() if re.match(pattern, k)]
        for key in to_remove:
            del self.cache[key]

    def invalidate_by_version(self, current_version):
        to_remove = [k for k, v in self.cache.items() if v.get("version", 0) < current_version]
        for key in to_remove:
            del self.cache[key]

TTL适合时效性内容,版本号适合结构化数据。

缓存降级策略

缓存不可用时的备选方案:

class CacheFallback:
    def __init__(self, cache, llm_client):
        self.cache = cache
        self.client = llm_client
        self.degraded = False

    async def get_or_generate(self, prompt, **kwargs):
        if not self.degraded:
            cached = self.cache.get(prompt, **kwargs)
            if cached:
                return cached

        response = await self.client.generate(prompt, **kwargs)

        if not self.degraded:
            self.cache.set(prompt, response, **kwargs)

        return response

    def set_degraded(self, degraded):
        self.degraded = degraded

降级模式下跳过缓存,直接调用LLM API。

缓存分析与优化

分析缓存使用情况优化策略:

class CacheAnalyzer:
    def __init__(self, cache):
        self.cache = cache
        self.stats = {"hits": 0, "misses": 0, "total_size": 0}

    def analyze(self):
        hit_rate = self.stats["hits"] / max(1, self.stats["hits"] + self.stats["misses"])
        avg_entry_size = self.stats["total_size"] / max(1, len(self.cache))
        return {
            "hit_rate": hit_rate,
            "entry_count": len(self.cache),
            "avg_entry_size": avg_entry_size,
            "recommendation": self.get_recommendation(hit_rate)
        }

    def get_recommendation(self, hit_rate):
        if hit_rate < 0.3:
            return "考虑增加语义缓存或调整缓存策略"
        elif hit_rate < 0.6:
            return "缓存效果中等,可优化缓存键设计"
        else:
            return "缓存效果良好"

总结

响应缓存是降低LLM成本、提升响应速度的直接手段。精确匹配、语义缓存、上下文感知、预热和失效机制的组合使用,构建了完整的响应缓存体系。持续监控和优化是保持缓存效果的关键。