AI AgentsSemantic CachingSimilarity ThresholdEmbeddingsRedis CloudAWS BedrockCost OptimizationVector SearchLLMProduction AI

How to Decide the Semantic Similarity Threshold for Your AI Cache

Polystreak Team2026-04-1014 min read

You have built a semantic cache. Queries are embedded, vectors are indexed in Redis, and cache hits skip the LLM entirely. Everything works — until a user rephrases their question slightly and your cache misses. Or worse, a loosely related question hits the cache and returns a completely wrong answer. The difference between those two outcomes is a single floating-point number: the similarity threshold.

This post is not theory. We ran a real experiment on the Polystreak AI agent — using DeepSeek v3.2 for chat and Amazon Titan Embed Text v2 (1024 dimensions) for embeddings on AWS Bedrock — and tested 14 query variants against a single cached question. The data tells you exactly how thresholds behave in practice and how to pick the right one for your domain.

The Models We Used

Before diving into the threshold decision, here is the exact stack we tested with. The choice of embedding model directly affects the similarity scores you will see, so these numbers are specific to this configuration.

Component	Model	Details
Chat LLM	DeepSeek v3.2	Streaming via AWS Bedrock native SDK, used for generating responses on cache miss
Embedding Model	Amazon Titan Embed Text v2	1024-dimensional vectors, invoked via @aws-sdk/client-bedrock-runtime
Vector Index	Redis Cloud (RediSearch)	HNSW index with COSINE distance metric, FT.SEARCH with KNN 1
Cache Storage	Redis Cloud (RedisJSON)	JSON documents with query, embedding, response, and metadata

Amazon Titan Embed Text v2 produces 1024-dimensional vectors. We chose it for its good balance of semantic discrimination and low latency (~15ms per embedding call). The cosine similarity scores in this post are specific to this model — if you use OpenAI text-embedding-3-small, Cohere embed-v3, or another model, your absolute numbers will differ, but the relative patterns and decision framework still apply.

The Experiment: 14 Variants of One Question

We cached the response to a single question: 'What is semantic caching?' Then we sent 14 different queries — ranging from identical rephrasings to completely unrelated questions — and recorded the cosine similarity score returned by Redis vector search for each one.

#	Query	Cosine Similarity	Same Intent?
1	What is semantic caching?	1.000	Yes (identical)
2	Tell me about semantic caching	0.965	Yes
3	Explain semantic caching to me	0.952	Yes
4	How does semantic caching work?	0.941	Yes
5	What does semantic caching mean?	0.937	Yes
6	Semantic caching explained	0.928	Yes
7	Can you describe semantic caching?	0.924	Yes
8	What is caching semantics?	0.844	Different — about compiler/memory caching semantics
9	What is the meaning of caching?	0.812	Different — generic caching, not semantic caching
10	What is Redis?	0.534	No
11	Azure cloud?	0.321	No
12	How to bake a cake?	0.108	No
13	What is vector search?	0.612	No — related topic, different question
14	How do embeddings work?	0.587	No — related topic, different question

This data reveals the core challenge. Queries 1-7 are legitimate rephrasings — same intent, different words. Queries 8-9 are dangerously close in vocabulary but ask different things. Queries 10-14 are clearly different. Your threshold must thread the needle: catch 1-7 as hits, reject 8-14 as misses.

The Threshold Decision Table

Using the 14 variants above, here is how each threshold performs. 'Hits' counts how many of the 7 legitimate rephrasings (queries 1-7) would be served from cache. 'False hits' counts how many of the 7 non-matching queries (8-14) would incorrectly be served a cached answer.

Threshold	Hits out of 7 rephrasings	Hit rate	False hits out of 7 non-matches	Risk
0.98	1 (only identical)	14%	0	Ultra-safe but nearly useless — only exact duplicates hit
0.95	3	43%	0	Conservative — catches close rephrasings, zero false hits
0.93	5	71%	0	Balanced — good hit rate with no false positives
0.90	7	100%	0	Aggressive — catches all rephrasings, still no false hits with this data
0.85	7	100%	0	High recall — still safe for this example, but gap to false hits shrinks
0.80	7	100%	2 (queries 8-9)	Risky — serves 'caching semantics' and 'meaning of caching' as hits
0.75	7	100%	2	Same as 0.80 — next false hit (vector search) comes at ~0.61
0.60	7	100%	4	Dangerous — catches loosely related topics like vector search and embeddings

The sweet spot in this experiment is between 0.85 and 0.93. At 0.93 you catch 71% of rephrasings with zero risk. At 0.90 you catch them all while still maintaining a safe margin above the nearest false match (0.844). Below 0.85, you start accepting queries that look similar but mean something different.

Why One Number Does Not Fit All

The optimal threshold depends on three factors that vary by application.

1. The embedding model

Different embedding models produce different similarity distributions. Amazon Titan Embed v2 (1024-dim) tends to spread scores more widely — unrelated queries often fall below 0.60, while rephrasings cluster above 0.90. OpenAI text-embedding-3-large (3072-dim) may produce tighter clusters with higher absolute scores. Cohere embed-v3 may behave differently again. Always calibrate your threshold using your actual embedding model, not numbers from a blog post that used a different one.

2. The domain vocabulary

Technical domains have dense, overlapping vocabularies. 'Semantic caching' and 'caching semantics' are 0.844 similar because they share the same words — but they mean entirely different things. In a domain with highly specific jargon (medical, legal, financial), word overlap creates more false-positive risk, so you need a higher threshold. In a general-purpose FAQ domain with diverse vocabulary, you can afford a lower threshold because unrelated questions score much lower.

3. The cost of a wrong answer

This is the most important factor. If your agent handles customer support for a fintech app and a wrong cached answer could cause a user to make a bad financial decision, the cost of a false hit is enormous — use 0.95 or higher. If your agent is an internal FAQ bot where a slightly off answer just means someone asks a follow-up question, the cost of a false hit is low — 0.85-0.90 is fine.

Pros and Cons of Each Strategy

High threshold (0.93 – 0.98): The safety-first approach

Pros	Cons
Near-zero false hit risk	Low cache hit rate (14-71% of rephrasings)
Every cached response is almost certainly correct	Users still pay full LLM cost for most rephrasings
Simple to defend in audits and compliance	Less latency improvement since fewer queries are cached
Works with any domain without tuning	You may wonder why you built a semantic cache at all

Medium threshold (0.85 – 0.92): The balanced approach

Pros	Cons
Catches most natural rephrasings (85-100%)	Requires domain-specific calibration
Significant cost savings — 60%+ of repetitive queries cached	Some edge cases may return wrong answers
Sub-100ms responses for most repeated questions	Needs monitoring and periodic threshold review
Best ROI for FAQ-heavy and support-bot workloads	More operational overhead than a simple key-value cache

Low threshold (0.75 – 0.84): The aggressive approach

Pros	Cons
Maximum cache hit rate	Serves wrong answers for similar-but-different queries
Lowest possible LLM costs	Erodes user trust when cached answers don't match the question
Great for very narrow domains with little vocabulary overlap	Requires strong guardrails: human review, confidence disclaimers, or fallback
Can work if combined with metadata filters (e.g., topic tags)	Debugging is harder — users see plausible but wrong answers

A Practical Decision Framework

Here is a step-by-step process to pick and refine your threshold.

Step 1: Collect a seed dataset

Take 20-30 of your most common questions from production logs. For each, write 3-5 natural rephrasings (what real users would say) and 3-5 different-but-similar questions (same words, different intent). Embed all of them using your production embedding model.

Step 2: Compute the similarity matrix

For each original question, compute cosine similarity against all its rephrasings and against all the decoy questions. You are looking for the gap — the score range where rephrasings end and different questions begin. In our experiment, rephrasings bottomed out at 0.924 and the nearest false match was 0.844 — a gap of 0.08.

Step 3: Pick a threshold in the gap

Set your threshold at the midpoint of the gap, leaning toward safety. In our case: (0.924 + 0.844) / 2 = 0.884. We rounded down to 0.88 for a balanced starting point, then tested at 0.85 and 0.90 to see the practical difference. The Polystreak agent currently runs at 0.80 for demonstration purposes — lower than we would use in production — to show that the cache catches even aggressively rephrased queries.

Step 4: Deploy with logging, not with confidence

Log every cache hit with the similarity score, the original cached query, and the incoming query. After one week, review all hits where the score was within 0.05 of your threshold. These are your borderline cases. If they are all correct, consider lowering the threshold. If some are wrong, raise it. This is an empirical process — run it quarterly.

Step 5: Use metadata filters as a safety net

If you can tag your cached responses by topic, category, or entity, add a filter to your vector search. Instead of just searching by similarity, search by similarity AND topic match. This lets you use a lower threshold (for better hit rate) while the metadata filter prevents cross-topic contamination. Redis Cloud supports TAG filters in FT.SEARCH queries alongside vector KNN.

Real-World Code: How We Check Similarity

Here is the actual flow in the Polystreak agent. The embedding model generates a 1024-dimensional vector, Redis runs a KNN search against the semantic cache index, and the threshold determines whether to return the cached response or proceed to the full RAG pipeline.

Generating the embedding (Amazon Titan Embed v2 via Bedrock)

Searching the cache (Redis Vector Search)

Streaming the LLM response (DeepSeek v3.2 via Bedrock)

The Score Distribution Pattern

Our experiment revealed a consistent pattern across multiple test queries with Titan Embed v2. The scores cluster into three natural bands:

Band	Score Range	What Lives Here
High similarity	0.92 – 1.00	True rephrasings — same question, different words
Danger zone	0.80 – 0.91	Vocabulary overlap — same words, possibly different meaning
Clearly different	Below 0.80	Different topics — safe to reject

The danger zone is where threshold tuning matters most. In this range, queries share significant vocabulary with the cached question but may have different intent. 'Caching semantics' (0.844) sounds like 'semantic caching' but asks about something else entirely. The only way to know if your danger zone contains false positives is to test it with your actual data.

Threshold Recommendations by Use Case

Use Case	Recommended Threshold	Reasoning
Customer support (external users)	0.93 – 0.97	Wrong answers damage trust and create support tickets. Prioritize correctness.
Product documentation bot	0.90 – 0.95	Users can verify answers against docs. Moderate risk tolerance.
Internal knowledge base	0.85 – 0.92	Users know the domain and can spot wrong answers. Higher hit rate saves more time.
Marketing chatbot	0.88 – 0.93	Brand risk from wrong answers, but queries are highly repetitive. Good cache ROI.
Developer tools / API docs	0.90 – 0.95	Precision matters — a wrong code example causes debugging pain.
Demo / proof of concept	0.80 – 0.88	Show the cache working aggressively. Accuracy is secondary to demonstrating the concept.

Common Mistakes

Using a threshold from a blog post without testing — Similarity distributions vary by embedding model, domain, and query length. Always calibrate with your own data.
Setting the threshold once and forgetting it — User language evolves, your knowledge base changes, and new edge cases appear. Review quarterly.
Ignoring the embedding model change — If you switch from Titan Embed v2 to OpenAI text-embedding-3-small, flush the cache AND recalibrate the threshold. Different models produce incompatible vectors.
Not logging borderline hits — Hits scored between threshold and threshold+0.05 are your early warning system. If wrong answers appear here, raise the threshold before production users notice.
Treating all cache hits equally — A hit at 0.99 similarity is almost certainly correct. A hit at 0.91 (just above a 0.90 threshold) should be monitored more closely. Consider adding a confidence tier.

Advanced: Tiered Thresholds

Instead of a single binary threshold (hit or miss), some production systems use two thresholds to create three tiers:

Tier	Score Range (example)	Behavior
Confident hit	>= 0.95	Return cached response immediately, no disclaimers
Tentative hit	0.88 – 0.94	Return cached response with a note: 'Based on a similar question — let me know if this doesn't answer your query'
Miss	< 0.88	Full LLM pipeline — embed, search knowledge base, generate fresh response

This approach captures more cache hits while being transparent about uncertainty. The tentative hit tier is especially useful for customer-facing agents where users appreciate honesty about approximate matching.

Measuring the Impact

After deploying your threshold, track these metrics weekly:

Metric	What to Watch	Action If Off
Cache hit rate	Should stabilize at 40-70% for repetitive workloads	If below 30%, threshold may be too high or traffic is too unique for caching
Average similarity on hits	Should be well above threshold (e.g., 0.96 avg with 0.90 threshold)	If average is close to threshold, many hits are borderline — review them
False positive rate	Sample 50 hits/week manually — should be < 2%	If > 5%, raise the threshold immediately
Cost savings	Track tokens avoided × price per token	Compare against cache infrastructure cost to ensure positive ROI
User satisfaction	Monitor thumbs-down or follow-up correction messages	Spike in corrections after threshold change = too aggressive

The Bottom Line

The similarity threshold is not a configuration — it is a product decision. It balances cost savings against answer quality, speed against accuracy, and engineering simplicity against operational rigor. There is no universal right number.

Start with the data. Embed your real queries. Measure the gap between rephrasings and different-but-similar questions. Set your threshold in that gap. Deploy with aggressive logging. Review weekly. Adjust quarterly. That is the process — and it works whether you are running Amazon Titan Embed v2 at 1024 dimensions or any other embedding model.

The best threshold is the one you arrived at by testing your own data — not the one you copied from a blog post. Including this one.

Try it yourself: visit polystreak.com/agent and ask the same question twice in different words. Watch the X-Ray panel show a cache hit on the second query — the similarity score, the latency drop from seconds to milliseconds, and the cost going to $0.0000. That is the threshold in action.