Semantic Caching for AI Agents: Why It Matters, When to Use It, and What It Actually Saves
You built an AI agent. It answers questions, helps users, and feels like magic. Then the bill arrives. Every question — even ones you have already answered a hundred times — costs the same: embed the query, search the knowledge base, send thousands of tokens to the LLM, wait for the response. Same work, same cost, same latency. Semantic caching changes that equation entirely.
This post breaks down the fundamentals — what tokens are, how inference pricing works, what embeddings actually do — and then shows you exactly when semantic caching makes sense, when it doesn't, and how to implement it with Redis Cloud and MongoDB Atlas.
First, Understand the Cost: Tokens, Prompts, and Inference
Before you can optimize cost, you need to understand what you are paying for. Every LLM API call is priced in tokens — and the bill has two sides.
What is a token?
A token is a piece of text — roughly ¾ of a word in English. The sentence 'What is semantic caching?' is about 6 tokens. The LLM does not see words; it sees tokens. Every token sent in and every token generated out has a price.
Input tokens vs output tokens
When your agent handles a question, the LLM receives a prompt — this is the input. The prompt includes everything the model needs to generate a good answer: the system instructions, the retrieved context from your knowledge base, the conversation history, and the user's actual question. All of this is measured in input tokens.
The model's response — the answer it generates — is measured in output tokens. For a typical RAG (Retrieval-Augmented Generation) agent, input tokens are 3-10x larger than output tokens because the prompt is stuffed with context.
| Component | Typical Size | Direction |
|---|---|---|
| System prompt (persona, rules) | 200–500 tokens | Input |
| Retrieved knowledge chunks (5 docs) | 2,000–5,000 tokens | Input |
| Conversation history (last 8 messages) | 300–800 tokens | Input |
| User's question | 10–50 tokens | Input |
| Total input | 3,000–6,000 tokens | Sent to LLM |
| LLM's response | 200–1,000 tokens | Output from LLM |
So a single question might use 5,000 input tokens and 500 output tokens. If 100 users ask variations of the same question, that is 500,000 input tokens and 50,000 output tokens — all generating the same answer.
What is inference?
Inference is the act of running a trained model to produce output. When you call an LLM API, you are paying for inference — the GPU compute time to process your tokens and generate a response. Inference is the single most expensive operation in the AI pipeline. Embedding a query costs fractions of a cent; generating a 500-token response can cost 1-10 cents depending on the model.
What is a prompt?
A prompt is everything you send to the LLM in a single request. It is not just the user's question — it is the full package: system instructions that define the agent's personality, context retrieved from your knowledge base, the conversation history for continuity, and finally the user's message. The prompt is what the model 'reads' before generating its answer. A well-constructed prompt is the difference between a useful agent and an expensive hallucination machine.
What Semantic Caching Actually Does
Traditional caching (Redis GET/SET with a string key) works on exact matches. The query 'What is semantic caching?' and 'Tell me about semantic caching' produce different cache keys — two misses, two LLM calls, double the cost. Semantic caching fixes this by comparing the meaning of questions, not their text.
How it works step by step
- Step 1 — Embed the query: Convert the user's question into a vector (an array of 1,024 floating-point numbers) using an embedding model like Amazon Titan Embed v2. This vector captures the meaning of the question.
- Step 2 — Search the cache: Use Redis vector search (FT.SEARCH with KNN) to find the most similar previously-asked question in the cache. This takes 2-5ms.
- Step 3 — Check the similarity score: If the cached query's similarity is above your threshold (e.g., 95%), the meaning is close enough. Return the cached answer immediately — no LLM call needed.
- Step 4 — On a miss: If nothing matches, call the LLM normally, get the response, and store both the query embedding and the response in the cache for future matches.
Understanding the Similarity Threshold
The threshold is the most important configuration in semantic caching. It determines how similar a new question must be to a cached question before returning the cached answer instead of calling the LLM. This is measured using cosine similarity — a score from 0 (completely unrelated) to 1 (identical meaning).
| Similarity | What It Means | Example |
|---|---|---|
| 1.00 (100%) | Identical meaning | "What is semantic caching?" vs "What is semantic caching?" |
| 0.96 (96%) | Same question, rephrased | "What is semantic caching?" vs "Tell me about semantic caching" |
| 0.84 (84%) | Same topic, different angle | "What is semantic caching?" vs "What is caching semantics?" |
| 0.58 (58%) | Loosely related | "Azure cloud?" vs "What is Azure?" |
| 0.10 (10%) | Unrelated | "What is Redis?" vs "How to bake a cake?" |
At a threshold of 0.95, only near-identical questions hit the cache. This is safe but conservative — it prevents returning wrong answers but catches fewer rephrasings. At 0.85, you catch more rephrasings but risk returning a cached answer about 'semantic caching' when someone asked about 'caching semantics' (which is actually a different question). The right threshold depends on your domain. For customer support where accuracy matters, use 0.93-0.97. For internal tools where speed matters more, 0.85-0.90 works.
The Cost Math: Why This Matters
Let us do the math with real numbers. Consider an AI agent on a company's website answering questions about their product.
Without semantic caching
| Metric | Value |
|---|---|
| Queries per day | 200 |
| Average input tokens per query | 4,000 |
| Average output tokens per query | 500 |
| Input token cost (DeepSeek v3.2) | $0.0015 / 1K tokens |
| Output token cost (DeepSeek v3.2) | $0.0075 / 1K tokens |
| Embedding cost per query | ~$0.0001 |
| Cost per query | ~$0.01 |
| Daily cost | ~$2.00 |
| Monthly cost | ~$60.00 |
With semantic caching (60% hit rate)
In practice, 40-70% of questions on a product or documentation site are variations of the same 20-30 core questions. With a 60% cache hit rate, 120 of those 200 daily queries are served from Redis in under 100ms at zero LLM cost.
| Metric | Value |
|---|---|
| Cache hit queries (120/day) | $0.0001 each (embedding only) |
| Cache miss queries (80/day) | $0.01 each (full pipeline) |
| Daily cost | ~$0.81 |
| Monthly cost | ~$24.00 |
| Monthly savings | ~$36.00 (60%) |
| Latency on cache hit | ~100ms vs ~2-5 seconds |
The savings scale linearly. At 2,000 queries per day, you go from $600/month to $240/month. At 20,000 queries per day, you go from $6,000/month to $2,400/month. And the latency improvement — from 2-5 seconds to under 100 milliseconds — is often more valuable than the cost savings.
When Semantic Caching Makes Sense
Great use cases
- Customer support bots — 'How do I reset my password?' gets asked 500 times a day in 50 different phrasings. Cache the first answer, serve the rest instantly.
- Product documentation agents — The same 30 questions cover 80% of traffic. Semantic caching turns a $600/month agent into a $200/month agent.
- Marketing site chatbots — Visitors ask about pricing, features, and comparisons. Highly repetitive, perfectly cacheable.
- Internal knowledge bases — Employees asking the same HR, IT, or policy questions. Cache hits save time and money.
- FAQ-style interactions — Any scenario where users ask variations of common questions.
When NOT to use it
- Personalized queries — 'Show me MY order #12345' is unique every time. No two users share the same query, so nothing is cacheable.
- Real-time data — 'What is the current stock price?' changes every second. A cached answer would be dangerously wrong.
- Creative generation — 'Write me a unique poem about Redis' should produce different results each time. Caching defeats the purpose.
- Multi-turn reasoning — Complex conversations where each response depends on a unique chain of prior messages. The context is different every time.
- Low-volume applications — If you handle 10 queries a day, the infrastructure cost of semantic caching exceeds the savings.
How to Implement It: Redis Cloud + MongoDB Atlas
The implementation requires two data stores working together. Redis Cloud handles the hot path — storing embeddings and serving cache lookups in single-digit milliseconds. MongoDB Atlas handles the cold path — storing metadata, tracking what has been cached, and logging metrics for analysis.
Step 1: Create the Redis vector index
Redis Cloud with the RediSearch module supports vector similarity search natively. First, create an index on your cache keys. This tells Redis to build an HNSW (Hierarchical Navigable Small World) graph on the embedding field — enabling fast approximate nearest-neighbor search.
FT.CREATE idx:semantic_cache ON JSON PREFIX 1 cache: SCHEMA $.query_embedding AS query_embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 1024 DISTANCE_METRIC COSINE — This creates an index that watches all JSON keys starting with 'cache:' and indexes their query_embedding field as a 1024-dimensional vector using cosine distance.
Step 2: Store a response in the cache
After the LLM generates a response, store it in Redis with the query's embedding vector. The key is an MD5 hash of the query text — this prevents duplicate entries for the exact same question while letting the vector search handle semantic similarity.
JSON.SET cache:a1b2c3d4e5f6 $ '{"query": "What is semantic caching?", "query_embedding": [0.023, -0.051, ...1024 floats...], "response": "Semantic caching works by...", "model": "deepseek.v3.2", "created_at": 1712764800}' — Then set a TTL: EXPIRE cache:a1b2c3d4e5f6 86400 (24 hours).
Step 3: Check the cache on new queries
When a new question arrives, embed it and search the cache index for the nearest match. FT.SEARCH idx:semantic_cache '*=>[KNN 1 @query_embedding $vec AS similarity]' PARAMS 2 vec <binary_vector> DIALECT 2 — This finds the single closest cached query. If (1 - similarity) >= 0.95, return the cached response. If not, proceed to the full RAG pipeline.
Step 4: Log metrics in MongoDB
Every query — hit or miss — gets logged to MongoDB Atlas for analysis. Track the cache hit rate, average similarity scores, which queries are most common, and how much money the cache is saving. This data drives threshold tuning and helps you understand your users' question patterns.
Architecture: Where Each Piece Fits
| Component | Role | Why This Tool |
|---|---|---|
| Redis Cloud | Cache storage + vector similarity search | Sub-millisecond reads, native vector search with RediSearch, built-in TTL for automatic expiry |
| MongoDB Atlas | Metrics logging, cache analytics, knowledge base metadata | Durable storage, flexible queries, aggregation pipeline for hit-rate analysis |
| AWS Bedrock (Titan Embed v2) | Query embedding (text → vector) | 1024 dimensions, fast inference, pay-per-use, no GPU management |
| AWS Bedrock (DeepSeek/Claude) | LLM response generation | Streaming support, multiple model choices, native SDK integration |
The key architectural insight is separation of concerns. Redis handles everything that needs to be fast (cache lookups, vector search, session state). MongoDB handles everything that needs to be durable and queryable (metrics, metadata, audit logs). The embedding model is stateless — call it, get a vector, move on.
Production Considerations
Cache invalidation
Use TTLs aggressively. A 24-hour TTL means stale answers expire automatically. When you update your knowledge base (new blog posts, updated docs), flush the cache and let it rebuild organically. The cost of a few extra LLM calls is far less than serving outdated information.
Threshold tuning
Start strict (0.95) and loosen gradually based on data. Log every cache hit with its similarity score. After a week, review the hits between 0.90 and 0.95 — if they are returning correct answers, lower the threshold. If they are returning wrong answers, keep it tight. This is an empirical process, not a theoretical one.
Embedding model consistency
The cache only works if you use the same embedding model for storing and retrieving. If you switch from Titan Embed v2 to a different model, the vectors are incompatible — flush the entire cache and rebuild. This is why the embedding model choice should be stable and deliberate.
Multi-tenant isolation
If your agent serves multiple customers, each with different knowledge bases, never share cache across tenants. Prefix cache keys with the tenant ID (cache:tenant42:a1b2c3) and create separate indexes or use TAG filters in your FT.SEARCH query. A cached answer from Company A's knowledge base is wrong for Company B.
Measuring Success
Track these metrics to know if your semantic cache is working:
| Metric | Target | What It Tells You |
|---|---|---|
| Cache hit rate | 40–70% | Percentage of queries served from cache. Below 30% means your traffic is too unique for caching. |
| Average similarity on hits | > 0.96 | If hits average 0.92, your threshold might be too low — you could be serving wrong answers. |
| P95 cache lookup latency | < 10ms | Cache lookups should be near-instant. If slow, check your Redis index or network. |
| Monthly cost reduction | 40–70% | Compare monthly LLM spend before and after. This is the bottom line. |
| False positive rate | < 2% | Manually review a sample of cache hits. If wrong answers are being served, raise the threshold. |
The Bottom Line
Semantic caching is not a silver bullet — it is a targeted optimization for repetitive workloads. If your AI agent handles the same 50 questions in 500 different phrasings, semantic caching can cut your LLM costs by 60% and your response latency by 95%. If every query is unique, caching adds complexity without benefit.
The implementation is straightforward: embed queries, search with Redis vector search, return cached answers when similarity is high enough, log everything to MongoDB for analysis. The hard part is tuning the threshold — and the only way to tune it is to ship it, measure it, and iterate.
The best cache is one where users never notice it exists — their questions just get answered faster and cheaper. That is what semantic caching delivers when implemented correctly.
At Polystreak, we build this exact stack for production AI agents. Our live agent demo at polystreak.com/agent uses Redis Cloud for semantic caching, MongoDB Atlas for metrics, and AWS Bedrock for inference — and you can watch the cache in action through the X-Ray panel. Try asking the same question twice and see latency drop from 5 seconds to 100 milliseconds.