All posts
AI AgentsSemantic CachingLLMCost OptimizationRedis CloudMongoDB AtlasEmbeddingsAI InfrastructureVector SearchProduction AI

Semantic Caching for AI Agents: Why It Matters, When to Use It, and What It Actually Saves

Polystreak Team2026-04-1012 min read

You built an AI agent. It answers questions, helps users, and feels like magic. Then the bill arrives. Every question — even ones you have already answered a hundred times — costs the same: embed the query, search the knowledge base, send thousands of tokens to the LLM, wait for the response. Same work, same cost, same latency. Semantic caching changes that equation entirely.

This post breaks down the fundamentals — what tokens are, how inference pricing works, what embeddings actually do — and then shows you exactly when semantic caching makes sense, when it doesn't, and how to implement it with Redis Cloud and MongoDB Atlas.

First, Understand the Cost: Tokens, Prompts, and Inference

Before you can optimize cost, you need to understand what you are paying for. Every LLM API call is priced in tokens — and the bill has two sides.

What is a token?

A token is a piece of text — roughly ¾ of a word in English. The sentence 'What is semantic caching?' is about 6 tokens. The LLM does not see words; it sees tokens. Every token sent in and every token generated out has a price.

Input tokens vs output tokens

When your agent handles a question, the LLM receives a prompt — this is the input. The prompt includes everything the model needs to generate a good answer: the system instructions, the retrieved context from your knowledge base, the conversation history, and the user's actual question. All of this is measured in input tokens.

The model's response — the answer it generates — is measured in output tokens. For a typical RAG (Retrieval-Augmented Generation) agent, input tokens are 3-10x larger than output tokens because the prompt is stuffed with context.

ComponentTypical SizeDirection
System prompt (persona, rules)200–500 tokensInput
Retrieved knowledge chunks (5 docs)2,000–5,000 tokensInput
Conversation history (last 8 messages)300–800 tokensInput
User's question10–50 tokensInput
Total input3,000–6,000 tokensSent to LLM
LLM's response200–1,000 tokensOutput from LLM

So a single question might use 5,000 input tokens and 500 output tokens. If 100 users ask variations of the same question, that is 500,000 input tokens and 50,000 output tokens — all generating the same answer.

What is inference?

Inference is the act of running a trained model to produce output. When you call an LLM API, you are paying for inference — the GPU compute time to process your tokens and generate a response. Inference is the single most expensive operation in the AI pipeline. Embedding a query costs fractions of a cent; generating a 500-token response can cost 1-10 cents depending on the model.

What is a prompt?

A prompt is everything you send to the LLM in a single request. It is not just the user's question — it is the full package: system instructions that define the agent's personality, context retrieved from your knowledge base, the conversation history for continuity, and finally the user's message. The prompt is what the model 'reads' before generating its answer. A well-constructed prompt is the difference between a useful agent and an expensive hallucination machine.

What Semantic Caching Actually Does

Traditional caching (Redis GET/SET with a string key) works on exact matches. The query 'What is semantic caching?' and 'Tell me about semantic caching' produce different cache keys — two misses, two LLM calls, double the cost. Semantic caching fixes this by comparing the meaning of questions, not their text.

How it works step by step

  • Step 1 — Embed the query: Convert the user's question into a vector (an array of 1,024 floating-point numbers) using an embedding model like Amazon Titan Embed v2. This vector captures the meaning of the question.
  • Step 2 — Search the cache: Use Redis vector search (FT.SEARCH with KNN) to find the most similar previously-asked question in the cache. This takes 2-5ms.
  • Step 3 — Check the similarity score: If the cached query's similarity is above your threshold (e.g., 95%), the meaning is close enough. Return the cached answer immediately — no LLM call needed.
  • Step 4 — On a miss: If nothing matches, call the LLM normally, get the response, and store both the query embedding and the response in the cache for future matches.

Understanding the Similarity Threshold

The threshold is the most important configuration in semantic caching. It determines how similar a new question must be to a cached question before returning the cached answer instead of calling the LLM. This is measured using cosine similarity — a score from 0 (completely unrelated) to 1 (identical meaning).

SimilarityWhat It MeansExample
1.00 (100%)Identical meaning"What is semantic caching?" vs "What is semantic caching?"
0.96 (96%)Same question, rephrased"What is semantic caching?" vs "Tell me about semantic caching"
0.84 (84%)Same topic, different angle"What is semantic caching?" vs "What is caching semantics?"
0.58 (58%)Loosely related"Azure cloud?" vs "What is Azure?"
0.10 (10%)Unrelated"What is Redis?" vs "How to bake a cake?"

At a threshold of 0.95, only near-identical questions hit the cache. This is safe but conservative — it prevents returning wrong answers but catches fewer rephrasings. At 0.85, you catch more rephrasings but risk returning a cached answer about 'semantic caching' when someone asked about 'caching semantics' (which is actually a different question). The right threshold depends on your domain. For customer support where accuracy matters, use 0.93-0.97. For internal tools where speed matters more, 0.85-0.90 works.

The Cost Math: Why This Matters

Let us do the math with real numbers. Consider an AI agent on a company's website answering questions about their product.

Without semantic caching

MetricValue
Queries per day200
Average input tokens per query4,000
Average output tokens per query500
Input token cost (DeepSeek v3.2)$0.0015 / 1K tokens
Output token cost (DeepSeek v3.2)$0.0075 / 1K tokens
Embedding cost per query~$0.0001
Cost per query~$0.01
Daily cost~$2.00
Monthly cost~$60.00

With semantic caching (60% hit rate)

In practice, 40-70% of questions on a product or documentation site are variations of the same 20-30 core questions. With a 60% cache hit rate, 120 of those 200 daily queries are served from Redis in under 100ms at zero LLM cost.

MetricValue
Cache hit queries (120/day)$0.0001 each (embedding only)
Cache miss queries (80/day)$0.01 each (full pipeline)
Daily cost~$0.81
Monthly cost~$24.00
Monthly savings~$36.00 (60%)
Latency on cache hit~100ms vs ~2-5 seconds

The savings scale linearly. At 2,000 queries per day, you go from $600/month to $240/month. At 20,000 queries per day, you go from $6,000/month to $2,400/month. And the latency improvement — from 2-5 seconds to under 100 milliseconds — is often more valuable than the cost savings.

When Semantic Caching Makes Sense

Great use cases

  • Customer support bots — 'How do I reset my password?' gets asked 500 times a day in 50 different phrasings. Cache the first answer, serve the rest instantly.
  • Product documentation agents — The same 30 questions cover 80% of traffic. Semantic caching turns a $600/month agent into a $200/month agent.
  • Marketing site chatbots — Visitors ask about pricing, features, and comparisons. Highly repetitive, perfectly cacheable.
  • Internal knowledge bases — Employees asking the same HR, IT, or policy questions. Cache hits save time and money.
  • FAQ-style interactions — Any scenario where users ask variations of common questions.

When NOT to use it

  • Personalized queries — 'Show me MY order #12345' is unique every time. No two users share the same query, so nothing is cacheable.
  • Real-time data — 'What is the current stock price?' changes every second. A cached answer would be dangerously wrong.
  • Creative generation — 'Write me a unique poem about Redis' should produce different results each time. Caching defeats the purpose.
  • Multi-turn reasoning — Complex conversations where each response depends on a unique chain of prior messages. The context is different every time.
  • Low-volume applications — If you handle 10 queries a day, the infrastructure cost of semantic caching exceeds the savings.

How to Implement It: Redis Cloud + MongoDB Atlas

The implementation requires two data stores working together. Redis Cloud handles the hot path — storing embeddings and serving cache lookups in single-digit milliseconds. MongoDB Atlas handles the cold path — storing metadata, tracking what has been cached, and logging metrics for analysis.

Step 1: Create the Redis vector index

Redis Cloud with the RediSearch module supports vector similarity search natively. First, create an index on your cache keys. This tells Redis to build an HNSW (Hierarchical Navigable Small World) graph on the embedding field — enabling fast approximate nearest-neighbor search.

FT.CREATE idx:semantic_cache ON JSON PREFIX 1 cache: SCHEMA $.query_embedding AS query_embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 1024 DISTANCE_METRIC COSINE — This creates an index that watches all JSON keys starting with 'cache:' and indexes their query_embedding field as a 1024-dimensional vector using cosine distance.

Step 2: Store a response in the cache

After the LLM generates a response, store it in Redis with the query's embedding vector. The key is an MD5 hash of the query text — this prevents duplicate entries for the exact same question while letting the vector search handle semantic similarity.

JSON.SET cache:a1b2c3d4e5f6 $ '{"query": "What is semantic caching?", "query_embedding": [0.023, -0.051, ...1024 floats...], "response": "Semantic caching works by...", "model": "deepseek.v3.2", "created_at": 1712764800}' — Then set a TTL: EXPIRE cache:a1b2c3d4e5f6 86400 (24 hours).

Step 3: Check the cache on new queries

When a new question arrives, embed it and search the cache index for the nearest match. FT.SEARCH idx:semantic_cache '*=>[KNN 1 @query_embedding $vec AS similarity]' PARAMS 2 vec <binary_vector> DIALECT 2 — This finds the single closest cached query. If (1 - similarity) >= 0.95, return the cached response. If not, proceed to the full RAG pipeline.

Step 4: Log metrics in MongoDB

Every query — hit or miss — gets logged to MongoDB Atlas for analysis. Track the cache hit rate, average similarity scores, which queries are most common, and how much money the cache is saving. This data drives threshold tuning and helps you understand your users' question patterns.

Architecture: Where Each Piece Fits

ComponentRoleWhy This Tool
Redis CloudCache storage + vector similarity searchSub-millisecond reads, native vector search with RediSearch, built-in TTL for automatic expiry
MongoDB AtlasMetrics logging, cache analytics, knowledge base metadataDurable storage, flexible queries, aggregation pipeline for hit-rate analysis
AWS Bedrock (Titan Embed v2)Query embedding (text → vector)1024 dimensions, fast inference, pay-per-use, no GPU management
AWS Bedrock (DeepSeek/Claude)LLM response generationStreaming support, multiple model choices, native SDK integration

The key architectural insight is separation of concerns. Redis handles everything that needs to be fast (cache lookups, vector search, session state). MongoDB handles everything that needs to be durable and queryable (metrics, metadata, audit logs). The embedding model is stateless — call it, get a vector, move on.

Production Considerations

Cache invalidation

Use TTLs aggressively. A 24-hour TTL means stale answers expire automatically. When you update your knowledge base (new blog posts, updated docs), flush the cache and let it rebuild organically. The cost of a few extra LLM calls is far less than serving outdated information.

Threshold tuning

Start strict (0.95) and loosen gradually based on data. Log every cache hit with its similarity score. After a week, review the hits between 0.90 and 0.95 — if they are returning correct answers, lower the threshold. If they are returning wrong answers, keep it tight. This is an empirical process, not a theoretical one.

Embedding model consistency

The cache only works if you use the same embedding model for storing and retrieving. If you switch from Titan Embed v2 to a different model, the vectors are incompatible — flush the entire cache and rebuild. This is why the embedding model choice should be stable and deliberate.

Multi-tenant isolation

If your agent serves multiple customers, each with different knowledge bases, never share cache across tenants. Prefix cache keys with the tenant ID (cache:tenant42:a1b2c3) and create separate indexes or use TAG filters in your FT.SEARCH query. A cached answer from Company A's knowledge base is wrong for Company B.

Measuring Success

Track these metrics to know if your semantic cache is working:

MetricTargetWhat It Tells You
Cache hit rate40–70%Percentage of queries served from cache. Below 30% means your traffic is too unique for caching.
Average similarity on hits> 0.96If hits average 0.92, your threshold might be too low — you could be serving wrong answers.
P95 cache lookup latency< 10msCache lookups should be near-instant. If slow, check your Redis index or network.
Monthly cost reduction40–70%Compare monthly LLM spend before and after. This is the bottom line.
False positive rate< 2%Manually review a sample of cache hits. If wrong answers are being served, raise the threshold.

The Bottom Line

Semantic caching is not a silver bullet — it is a targeted optimization for repetitive workloads. If your AI agent handles the same 50 questions in 500 different phrasings, semantic caching can cut your LLM costs by 60% and your response latency by 95%. If every query is unique, caching adds complexity without benefit.

The implementation is straightforward: embed queries, search with Redis vector search, return cached answers when similarity is high enough, log everything to MongoDB for analysis. The hard part is tuning the threshold — and the only way to tune it is to ship it, measure it, and iterate.

The best cache is one where users never notice it exists — their questions just get answered faster and cheaper. That is what semantic caching delivers when implemented correctly.

At Polystreak, we build this exact stack for production AI agents. Our live agent demo at polystreak.com/agent uses Redis Cloud for semantic caching, MongoDB Atlas for metrics, and AWS Bedrock for inference — and you can watch the cache in action through the X-Ray panel. Try asking the same question twice and see latency drop from 5 seconds to 100 milliseconds.