LLM响应缓存
--- title: "LLM响应缓存" description: "全面介绍LLM响应缓存策略,实现高效的内容复用和成本优化" tags: ["响应缓存", "内容复用", "成本优化"] category: "llm" icon: "🧠"
LLM响应缓存
响应缓存是将LLM生成的结果存储起来供后续请求直接使用的技术。与模型缓存不同,响应缓存关注的是输出内容的复用,能直接减少API调用次数和延迟。
响应缓存的价值
LLM API调用的成本和延迟是主要瓶颈。响应缓存通过存储历史响应,对重复或相似查询直接返回缓存结果。典型的FAQ场景下,缓存命中率可达60-80%,显著降低成本。
精确匹配缓存
最简单的响应缓存实现:
import hashlib
import json
class ExactResponseCache:
def __init__(self, ttl=3600, max_size=10000):
self.cache = {}
self.ttl = ttl
self.max_size = max_size
def make_key(self, prompt, **kwargs):
content = json.dumps({"prompt": prompt, **kwargs}, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
def get(self, prompt, **kwargs):
key = self.make_key(prompt, **kwargs)
entry = self.cache.get(key)
if entry and time.time() - entry["time"] < self.ttl:
return entry["response"]
return None
def set(self, prompt, response, **kwargs):
if len(self.cache) >= self.max_size:
self.evict_oldest()
key = self.make_key(prompt, **kwargs)
self.cache[key] = {"response": response, "time": time.time()}
精确匹配简单可靠,但要求输入完全一致。
语义相似缓存
基于语义相似度的响应缓存:
from sentence_transformers import SentenceTransformer
import numpy as np
class SemanticResponseCache:
def __init__(self, threshold=0.90):
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
self.threshold = threshold
self.entries = []
def get(self, query):
query_emb = self.encoder.encode(query)
best_match = None
best_score = 0
for entry in self.entries:
score = np.dot(query_emb, entry["embedding"])
if score > best_score and score >= self.threshold:
best_score = score
best_match = entry
return best_match["response"] if best_match else None
def set(self, query, response):
embedding = self.encoder.encode(query)
self.entries.append({
"query": query,
"response": response,
"embedding": embedding
})
语义缓存对同义表述也能命中,但需要维护向量索引。
上下文感知缓存
根据对话上下文进行缓存:
class ContextAwareCache:
def __init__(self):
self.cache = {}
def make_key(self, messages, model):
context_hash = hashlib.sha256(
json.dumps(messages[-3:], sort_keys=True).encode()
).hexdigest()[:16]
return f"{model}:{context_hash}"
def get(self, messages, model):
key = self.make_key(messages, model)
return self.cache.get(key)
def set(self, messages, model, response):
key = self.make_key(messages, model)
self.cache[key] = response
上下文感知缓存考虑对话历史,提升相关性。
缓存预热策略
提前填充高价值缓存:
class ResponseCacheWarmer:
def __init__(self, cache, llm_client):
self.cache = cache
self.client = llm_client
async def warm_by_category(self, categories):
for category in categories:
queries = await self.get_popular_queries(category)
for query in queries:
if not self.cache.get(query):
response = await self.client.generate(query)
self.cache.set(query, response)
async def warm_by_user(self, user_id):
history = await self.get_user_history(user_id)
for interaction in history:
response = await self.client.generate(interaction["prompt"])
self.cache.set(interaction["prompt"], response)
预热确保热门内容在请求到达前就已缓存。
缓存失效机制
确保缓存内容的新鲜度:
class ResponseCacheInvalidation:
def __init__(self, cache):
self.cache = cache
def invalidate_by_ttl(self, max_age=3600):
now = time.time()
expired = [k for k, v in self.cache.items() if now - v["time"] > max_age]
for key in expired:
del self.cache[key]
def invalidate_by_pattern(self, pattern):
import re
to_remove = [k for k in self.cache.keys() if re.match(pattern, k)]
for key in to_remove:
del self.cache[key]
def invalidate_by_version(self, current_version):
to_remove = [k for k, v in self.cache.items() if v.get("version", 0) < current_version]
for key in to_remove:
del self.cache[key]
TTL适合时效性内容,版本号适合结构化数据。
缓存降级策略
缓存不可用时的备选方案:
class CacheFallback:
def __init__(self, cache, llm_client):
self.cache = cache
self.client = llm_client
self.degraded = False
async def get_or_generate(self, prompt, **kwargs):
if not self.degraded:
cached = self.cache.get(prompt, **kwargs)
if cached:
return cached
response = await self.client.generate(prompt, **kwargs)
if not self.degraded:
self.cache.set(prompt, response, **kwargs)
return response
def set_degraded(self, degraded):
self.degraded = degraded
降级模式下跳过缓存,直接调用LLM API。
缓存分析与优化
分析缓存使用情况优化策略:
class CacheAnalyzer:
def __init__(self, cache):
self.cache = cache
self.stats = {"hits": 0, "misses": 0, "total_size": 0}
def analyze(self):
hit_rate = self.stats["hits"] / max(1, self.stats["hits"] + self.stats["misses"])
avg_entry_size = self.stats["total_size"] / max(1, len(self.cache))
return {
"hit_rate": hit_rate,
"entry_count": len(self.cache),
"avg_entry_size": avg_entry_size,
"recommendation": self.get_recommendation(hit_rate)
}
def get_recommendation(self, hit_rate):
if hit_rate < 0.3:
return "考虑增加语义缓存或调整缓存策略"
elif hit_rate < 0.6:
return "缓存效果中等,可优化缓存键设计"
else:
return "缓存效果良好"
总结
响应缓存是降低LLM成本、提升响应速度的直接手段。精确匹配、语义缓存、上下文感知、预热和失效机制的组合使用,构建了完整的响应缓存体系。持续监控和优化是保持缓存效果的关键。