How to Decide the Semantic Similarity Threshold for Your AI Cache
You have built a semantic cache. Queries are embedded, vectors are indexed in Redis, and cache hits skip the LLM entirely. Everything works — until a user rephrases their question slightly and your cache misses. Or worse, a loosely related question hits the cache and returns a completely wrong answer. The difference between those two outcomes is a single floating-point number: the similarity threshold.
This post is not theory. We ran a real experiment on the Polystreak AI agent — using DeepSeek v3.2 for chat and Amazon Titan Embed Text v2 (1024 dimensions) for embeddings on AWS Bedrock — and tested 14 query variants against a single cached question. The data tells you exactly how thresholds behave in practice and how to pick the right one for your domain.
The Models We Used
Before diving into the threshold decision, here is the exact stack we tested with. The choice of embedding model directly affects the similarity scores you will see, so these numbers are specific to this configuration.
| Component | Model | Details |
|---|---|---|
| Chat LLM | DeepSeek v3.2 | Streaming via AWS Bedrock native SDK, used for generating responses on cache miss |
| Embedding Model | Amazon Titan Embed Text v2 | 1024-dimensional vectors, invoked via @aws-sdk/client-bedrock-runtime |
| Vector Index | Redis Cloud (RediSearch) | HNSW index with COSINE distance metric, FT.SEARCH with KNN 1 |
| Cache Storage | Redis Cloud (RedisJSON) | JSON documents with query, embedding, response, and metadata |
Amazon Titan Embed Text v2 produces 1024-dimensional vectors. We chose it for its good balance of semantic discrimination and low latency (~15ms per embedding call). The cosine similarity scores in this post are specific to this model — if you use OpenAI text-embedding-3-small, Cohere embed-v3, or another model, your absolute numbers will differ, but the relative patterns and decision framework still apply.
The Experiment: 14 Variants of One Question
We cached the response to a single question: 'What is semantic caching?' Then we sent 14 different queries — ranging from identical rephrasings to completely unrelated questions — and recorded the cosine similarity score returned by Redis vector search for each one.
| # | Query | Cosine Similarity | Same Intent? |
|---|---|---|---|
| 1 | What is semantic caching? | 1.000 | Yes (identical) |
| 2 | Tell me about semantic caching | 0.965 | Yes |
| 3 | Explain semantic caching to me | 0.952 | Yes |
| 4 | How does semantic caching work? | 0.941 | Yes |
| 5 | What does semantic caching mean? | 0.937 | Yes |
| 6 | Semantic caching explained | 0.928 | Yes |
| 7 | Can you describe semantic caching? | 0.924 | Yes |
| 8 | What is caching semantics? | 0.844 | Different — about compiler/memory caching semantics |
| 9 | What is the meaning of caching? | 0.812 | Different — generic caching, not semantic caching |
| 10 | What is Redis? | 0.534 | No |
| 11 | Azure cloud? | 0.321 | No |
| 12 | How to bake a cake? | 0.108 | No |
| 13 | What is vector search? | 0.612 | No — related topic, different question |
| 14 | How do embeddings work? | 0.587 | No — related topic, different question |
This data reveals the core challenge. Queries 1-7 are legitimate rephrasings — same intent, different words. Queries 8-9 are dangerously close in vocabulary but ask different things. Queries 10-14 are clearly different. Your threshold must thread the needle: catch 1-7 as hits, reject 8-14 as misses.
The Threshold Decision Table
Using the 14 variants above, here is how each threshold performs. 'Hits' counts how many of the 7 legitimate rephrasings (queries 1-7) would be served from cache. 'False hits' counts how many of the 7 non-matching queries (8-14) would incorrectly be served a cached answer.
| Threshold | Hits out of 7 rephrasings | Hit rate | False hits out of 7 non-matches | Risk |
|---|---|---|---|---|
| 0.98 | 1 (only identical) | 14% | 0 | Ultra-safe but nearly useless — only exact duplicates hit |
| 0.95 | 3 | 43% | 0 | Conservative — catches close rephrasings, zero false hits |
| 0.93 | 5 | 71% | 0 | Balanced — good hit rate with no false positives |
| 0.90 | 7 | 100% | 0 | Aggressive — catches all rephrasings, still no false hits with this data |
| 0.85 | 7 | 100% | 0 | High recall — still safe for this example, but gap to false hits shrinks |
| 0.80 | 7 | 100% | 2 (queries 8-9) | Risky — serves 'caching semantics' and 'meaning of caching' as hits |
| 0.75 | 7 | 100% | 2 | Same as 0.80 — next false hit (vector search) comes at ~0.61 |
| 0.60 | 7 | 100% | 4 | Dangerous — catches loosely related topics like vector search and embeddings |
The sweet spot in this experiment is between 0.85 and 0.93. At 0.93 you catch 71% of rephrasings with zero risk. At 0.90 you catch them all while still maintaining a safe margin above the nearest false match (0.844). Below 0.85, you start accepting queries that look similar but mean something different.
Why One Number Does Not Fit All
The optimal threshold depends on three factors that vary by application.
1. The embedding model
Different embedding models produce different similarity distributions. Amazon Titan Embed v2 (1024-dim) tends to spread scores more widely — unrelated queries often fall below 0.60, while rephrasings cluster above 0.90. OpenAI text-embedding-3-large (3072-dim) may produce tighter clusters with higher absolute scores. Cohere embed-v3 may behave differently again. Always calibrate your threshold using your actual embedding model, not numbers from a blog post that used a different one.
2. The domain vocabulary
Technical domains have dense, overlapping vocabularies. 'Semantic caching' and 'caching semantics' are 0.844 similar because they share the same words — but they mean entirely different things. In a domain with highly specific jargon (medical, legal, financial), word overlap creates more false-positive risk, so you need a higher threshold. In a general-purpose FAQ domain with diverse vocabulary, you can afford a lower threshold because unrelated questions score much lower.
3. The cost of a wrong answer
This is the most important factor. If your agent handles customer support for a fintech app and a wrong cached answer could cause a user to make a bad financial decision, the cost of a false hit is enormous — use 0.95 or higher. If your agent is an internal FAQ bot where a slightly off answer just means someone asks a follow-up question, the cost of a false hit is low — 0.85-0.90 is fine.
Pros and Cons of Each Strategy
High threshold (0.93 – 0.98): The safety-first approach
| Pros | Cons |
|---|---|
| Near-zero false hit risk | Low cache hit rate (14-71% of rephrasings) |
| Every cached response is almost certainly correct | Users still pay full LLM cost for most rephrasings |
| Simple to defend in audits and compliance | Less latency improvement since fewer queries are cached |
| Works with any domain without tuning | You may wonder why you built a semantic cache at all |
Medium threshold (0.85 – 0.92): The balanced approach
| Pros | Cons |
|---|---|
| Catches most natural rephrasings (85-100%) | Requires domain-specific calibration |
| Significant cost savings — 60%+ of repetitive queries cached | Some edge cases may return wrong answers |
| Sub-100ms responses for most repeated questions | Needs monitoring and periodic threshold review |
| Best ROI for FAQ-heavy and support-bot workloads | More operational overhead than a simple key-value cache |
Low threshold (0.75 – 0.84): The aggressive approach
| Pros | Cons |
|---|---|
| Maximum cache hit rate | Serves wrong answers for similar-but-different queries |
| Lowest possible LLM costs | Erodes user trust when cached answers don't match the question |
| Great for very narrow domains with little vocabulary overlap | Requires strong guardrails: human review, confidence disclaimers, or fallback |
| Can work if combined with metadata filters (e.g., topic tags) | Debugging is harder — users see plausible but wrong answers |
A Practical Decision Framework
Here is a step-by-step process to pick and refine your threshold.
Step 1: Collect a seed dataset
Take 20-30 of your most common questions from production logs. For each, write 3-5 natural rephrasings (what real users would say) and 3-5 different-but-similar questions (same words, different intent). Embed all of them using your production embedding model.
Step 2: Compute the similarity matrix
For each original question, compute cosine similarity against all its rephrasings and against all the decoy questions. You are looking for the gap — the score range where rephrasings end and different questions begin. In our experiment, rephrasings bottomed out at 0.924 and the nearest false match was 0.844 — a gap of 0.08.
Step 3: Pick a threshold in the gap
Set your threshold at the midpoint of the gap, leaning toward safety. In our case: (0.924 + 0.844) / 2 = 0.884. We rounded down to 0.88 for a balanced starting point, then tested at 0.85 and 0.90 to see the practical difference. The Polystreak agent currently runs at 0.80 for demonstration purposes — lower than we would use in production — to show that the cache catches even aggressively rephrased queries.
Step 4: Deploy with logging, not with confidence
Log every cache hit with the similarity score, the original cached query, and the incoming query. After one week, review all hits where the score was within 0.05 of your threshold. These are your borderline cases. If they are all correct, consider lowering the threshold. If some are wrong, raise it. This is an empirical process — run it quarterly.
Step 5: Use metadata filters as a safety net
If you can tag your cached responses by topic, category, or entity, add a filter to your vector search. Instead of just searching by similarity, search by similarity AND topic match. This lets you use a lower threshold (for better hit rate) while the metadata filter prevents cross-topic contamination. Redis Cloud supports TAG filters in FT.SEARCH queries alongside vector KNN.
Real-World Code: How We Check Similarity
Here is the actual flow in the Polystreak agent. The embedding model generates a 1024-dimensional vector, Redis runs a KNN search against the semantic cache index, and the threshold determines whether to return the cached response or proceed to the full RAG pipeline.
Generating the embedding (Amazon Titan Embed v2 via Bedrock)
Searching the cache (Redis Vector Search)
Streaming the LLM response (DeepSeek v3.2 via Bedrock)
The Score Distribution Pattern
Our experiment revealed a consistent pattern across multiple test queries with Titan Embed v2. The scores cluster into three natural bands:
| Band | Score Range | What Lives Here |
|---|---|---|
| High similarity | 0.92 – 1.00 | True rephrasings — same question, different words |
| Danger zone | 0.80 – 0.91 | Vocabulary overlap — same words, possibly different meaning |
| Clearly different | Below 0.80 | Different topics — safe to reject |
The danger zone is where threshold tuning matters most. In this range, queries share significant vocabulary with the cached question but may have different intent. 'Caching semantics' (0.844) sounds like 'semantic caching' but asks about something else entirely. The only way to know if your danger zone contains false positives is to test it with your actual data.
Threshold Recommendations by Use Case
| Use Case | Recommended Threshold | Reasoning |
|---|---|---|
| Customer support (external users) | 0.93 – 0.97 | Wrong answers damage trust and create support tickets. Prioritize correctness. |
| Product documentation bot | 0.90 – 0.95 | Users can verify answers against docs. Moderate risk tolerance. |
| Internal knowledge base | 0.85 – 0.92 | Users know the domain and can spot wrong answers. Higher hit rate saves more time. |
| Marketing chatbot | 0.88 – 0.93 | Brand risk from wrong answers, but queries are highly repetitive. Good cache ROI. |
| Developer tools / API docs | 0.90 – 0.95 | Precision matters — a wrong code example causes debugging pain. |
| Demo / proof of concept | 0.80 – 0.88 | Show the cache working aggressively. Accuracy is secondary to demonstrating the concept. |
Common Mistakes
- Using a threshold from a blog post without testing — Similarity distributions vary by embedding model, domain, and query length. Always calibrate with your own data.
- Setting the threshold once and forgetting it — User language evolves, your knowledge base changes, and new edge cases appear. Review quarterly.
- Ignoring the embedding model change — If you switch from Titan Embed v2 to OpenAI text-embedding-3-small, flush the cache AND recalibrate the threshold. Different models produce incompatible vectors.
- Not logging borderline hits — Hits scored between threshold and threshold+0.05 are your early warning system. If wrong answers appear here, raise the threshold before production users notice.
- Treating all cache hits equally — A hit at 0.99 similarity is almost certainly correct. A hit at 0.91 (just above a 0.90 threshold) should be monitored more closely. Consider adding a confidence tier.
Advanced: Tiered Thresholds
Instead of a single binary threshold (hit or miss), some production systems use two thresholds to create three tiers:
| Tier | Score Range (example) | Behavior |
|---|---|---|
| Confident hit | >= 0.95 | Return cached response immediately, no disclaimers |
| Tentative hit | 0.88 – 0.94 | Return cached response with a note: 'Based on a similar question — let me know if this doesn't answer your query' |
| Miss | < 0.88 | Full LLM pipeline — embed, search knowledge base, generate fresh response |
This approach captures more cache hits while being transparent about uncertainty. The tentative hit tier is especially useful for customer-facing agents where users appreciate honesty about approximate matching.
Measuring the Impact
After deploying your threshold, track these metrics weekly:
| Metric | What to Watch | Action If Off |
|---|---|---|
| Cache hit rate | Should stabilize at 40-70% for repetitive workloads | If below 30%, threshold may be too high or traffic is too unique for caching |
| Average similarity on hits | Should be well above threshold (e.g., 0.96 avg with 0.90 threshold) | If average is close to threshold, many hits are borderline — review them |
| False positive rate | Sample 50 hits/week manually — should be < 2% | If > 5%, raise the threshold immediately |
| Cost savings | Track tokens avoided × price per token | Compare against cache infrastructure cost to ensure positive ROI |
| User satisfaction | Monitor thumbs-down or follow-up correction messages | Spike in corrections after threshold change = too aggressive |
The Bottom Line
The similarity threshold is not a configuration — it is a product decision. It balances cost savings against answer quality, speed against accuracy, and engineering simplicity against operational rigor. There is no universal right number.
Start with the data. Embed your real queries. Measure the gap between rephrasings and different-but-similar questions. Set your threshold in that gap. Deploy with aggressive logging. Review weekly. Adjust quarterly. That is the process — and it works whether you are running Amazon Titan Embed v2 at 1024 dimensions or any other embedding model.
The best threshold is the one you arrived at by testing your own data — not the one you copied from a blog post. Including this one.
Try it yourself: visit polystreak.com/agent and ask the same question twice in different words. Watch the X-Ray panel show a cache hit on the second query — the similarity score, the latency drop from seconds to milliseconds, and the cost going to $0.0000. That is the threshold in action.