All posts
AI AgentsSemantic CachingSimilarity ThresholdEmbeddingsRedis CloudAWS BedrockCost OptimizationVector SearchLLMProduction AI

How to Decide the Semantic Similarity Threshold for Your AI Cache

Polystreak Team2026-04-1014 min read

You have built a semantic cache. Queries are embedded, vectors are indexed in Redis, and cache hits skip the LLM entirely. Everything works — until a user rephrases their question slightly and your cache misses. Or worse, a loosely related question hits the cache and returns a completely wrong answer. The difference between those two outcomes is a single floating-point number: the similarity threshold.

This post is not theory. We ran a real experiment on the Polystreak AI agent — using DeepSeek v3.2 for chat and Amazon Titan Embed Text v2 (1024 dimensions) for embeddings on AWS Bedrock — and tested 14 query variants against a single cached question. The data tells you exactly how thresholds behave in practice and how to pick the right one for your domain.

The Models We Used

Before diving into the threshold decision, here is the exact stack we tested with. The choice of embedding model directly affects the similarity scores you will see, so these numbers are specific to this configuration.

ComponentModelDetails
Chat LLMDeepSeek v3.2Streaming via AWS Bedrock native SDK, used for generating responses on cache miss
Embedding ModelAmazon Titan Embed Text v21024-dimensional vectors, invoked via @aws-sdk/client-bedrock-runtime
Vector IndexRedis Cloud (RediSearch)HNSW index with COSINE distance metric, FT.SEARCH with KNN 1
Cache StorageRedis Cloud (RedisJSON)JSON documents with query, embedding, response, and metadata

Amazon Titan Embed Text v2 produces 1024-dimensional vectors. We chose it for its good balance of semantic discrimination and low latency (~15ms per embedding call). The cosine similarity scores in this post are specific to this model — if you use OpenAI text-embedding-3-small, Cohere embed-v3, or another model, your absolute numbers will differ, but the relative patterns and decision framework still apply.

The Experiment: 14 Variants of One Question

We cached the response to a single question: 'What is semantic caching?' Then we sent 14 different queries — ranging from identical rephrasings to completely unrelated questions — and recorded the cosine similarity score returned by Redis vector search for each one.

#QueryCosine SimilaritySame Intent?
1What is semantic caching?1.000Yes (identical)
2Tell me about semantic caching0.965Yes
3Explain semantic caching to me0.952Yes
4How does semantic caching work?0.941Yes
5What does semantic caching mean?0.937Yes
6Semantic caching explained0.928Yes
7Can you describe semantic caching?0.924Yes
8What is caching semantics?0.844Different — about compiler/memory caching semantics
9What is the meaning of caching?0.812Different — generic caching, not semantic caching
10What is Redis?0.534No
11Azure cloud?0.321No
12How to bake a cake?0.108No
13What is vector search?0.612No — related topic, different question
14How do embeddings work?0.587No — related topic, different question

This data reveals the core challenge. Queries 1-7 are legitimate rephrasings — same intent, different words. Queries 8-9 are dangerously close in vocabulary but ask different things. Queries 10-14 are clearly different. Your threshold must thread the needle: catch 1-7 as hits, reject 8-14 as misses.

The Threshold Decision Table

Using the 14 variants above, here is how each threshold performs. 'Hits' counts how many of the 7 legitimate rephrasings (queries 1-7) would be served from cache. 'False hits' counts how many of the 7 non-matching queries (8-14) would incorrectly be served a cached answer.

ThresholdHits out of 7 rephrasingsHit rateFalse hits out of 7 non-matchesRisk
0.981 (only identical)14%0Ultra-safe but nearly useless — only exact duplicates hit
0.95343%0Conservative — catches close rephrasings, zero false hits
0.93571%0Balanced — good hit rate with no false positives
0.907100%0Aggressive — catches all rephrasings, still no false hits with this data
0.857100%0High recall — still safe for this example, but gap to false hits shrinks
0.807100%2 (queries 8-9)Risky — serves 'caching semantics' and 'meaning of caching' as hits
0.757100%2Same as 0.80 — next false hit (vector search) comes at ~0.61
0.607100%4Dangerous — catches loosely related topics like vector search and embeddings

The sweet spot in this experiment is between 0.85 and 0.93. At 0.93 you catch 71% of rephrasings with zero risk. At 0.90 you catch them all while still maintaining a safe margin above the nearest false match (0.844). Below 0.85, you start accepting queries that look similar but mean something different.

Why One Number Does Not Fit All

The optimal threshold depends on three factors that vary by application.

1. The embedding model

Different embedding models produce different similarity distributions. Amazon Titan Embed v2 (1024-dim) tends to spread scores more widely — unrelated queries often fall below 0.60, while rephrasings cluster above 0.90. OpenAI text-embedding-3-large (3072-dim) may produce tighter clusters with higher absolute scores. Cohere embed-v3 may behave differently again. Always calibrate your threshold using your actual embedding model, not numbers from a blog post that used a different one.

2. The domain vocabulary

Technical domains have dense, overlapping vocabularies. 'Semantic caching' and 'caching semantics' are 0.844 similar because they share the same words — but they mean entirely different things. In a domain with highly specific jargon (medical, legal, financial), word overlap creates more false-positive risk, so you need a higher threshold. In a general-purpose FAQ domain with diverse vocabulary, you can afford a lower threshold because unrelated questions score much lower.

3. The cost of a wrong answer

This is the most important factor. If your agent handles customer support for a fintech app and a wrong cached answer could cause a user to make a bad financial decision, the cost of a false hit is enormous — use 0.95 or higher. If your agent is an internal FAQ bot where a slightly off answer just means someone asks a follow-up question, the cost of a false hit is low — 0.85-0.90 is fine.

Pros and Cons of Each Strategy

High threshold (0.93 – 0.98): The safety-first approach

ProsCons
Near-zero false hit riskLow cache hit rate (14-71% of rephrasings)
Every cached response is almost certainly correctUsers still pay full LLM cost for most rephrasings
Simple to defend in audits and complianceLess latency improvement since fewer queries are cached
Works with any domain without tuningYou may wonder why you built a semantic cache at all

Medium threshold (0.85 – 0.92): The balanced approach

ProsCons
Catches most natural rephrasings (85-100%)Requires domain-specific calibration
Significant cost savings — 60%+ of repetitive queries cachedSome edge cases may return wrong answers
Sub-100ms responses for most repeated questionsNeeds monitoring and periodic threshold review
Best ROI for FAQ-heavy and support-bot workloadsMore operational overhead than a simple key-value cache

Low threshold (0.75 – 0.84): The aggressive approach

ProsCons
Maximum cache hit rateServes wrong answers for similar-but-different queries
Lowest possible LLM costsErodes user trust when cached answers don't match the question
Great for very narrow domains with little vocabulary overlapRequires strong guardrails: human review, confidence disclaimers, or fallback
Can work if combined with metadata filters (e.g., topic tags)Debugging is harder — users see plausible but wrong answers

A Practical Decision Framework

Here is a step-by-step process to pick and refine your threshold.

Step 1: Collect a seed dataset

Take 20-30 of your most common questions from production logs. For each, write 3-5 natural rephrasings (what real users would say) and 3-5 different-but-similar questions (same words, different intent). Embed all of them using your production embedding model.

Step 2: Compute the similarity matrix

For each original question, compute cosine similarity against all its rephrasings and against all the decoy questions. You are looking for the gap — the score range where rephrasings end and different questions begin. In our experiment, rephrasings bottomed out at 0.924 and the nearest false match was 0.844 — a gap of 0.08.

Step 3: Pick a threshold in the gap

Set your threshold at the midpoint of the gap, leaning toward safety. In our case: (0.924 + 0.844) / 2 = 0.884. We rounded down to 0.88 for a balanced starting point, then tested at 0.85 and 0.90 to see the practical difference. The Polystreak agent currently runs at 0.80 for demonstration purposes — lower than we would use in production — to show that the cache catches even aggressively rephrased queries.

Step 4: Deploy with logging, not with confidence

Log every cache hit with the similarity score, the original cached query, and the incoming query. After one week, review all hits where the score was within 0.05 of your threshold. These are your borderline cases. If they are all correct, consider lowering the threshold. If some are wrong, raise it. This is an empirical process — run it quarterly.

Step 5: Use metadata filters as a safety net

If you can tag your cached responses by topic, category, or entity, add a filter to your vector search. Instead of just searching by similarity, search by similarity AND topic match. This lets you use a lower threshold (for better hit rate) while the metadata filter prevents cross-topic contamination. Redis Cloud supports TAG filters in FT.SEARCH queries alongside vector KNN.

Real-World Code: How We Check Similarity

Here is the actual flow in the Polystreak agent. The embedding model generates a 1024-dimensional vector, Redis runs a KNN search against the semantic cache index, and the threshold determines whether to return the cached response or proceed to the full RAG pipeline.

Generating the embedding (Amazon Titan Embed v2 via Bedrock)

Searching the cache (Redis Vector Search)

Streaming the LLM response (DeepSeek v3.2 via Bedrock)

The Score Distribution Pattern

Our experiment revealed a consistent pattern across multiple test queries with Titan Embed v2. The scores cluster into three natural bands:

BandScore RangeWhat Lives Here
High similarity0.92 – 1.00True rephrasings — same question, different words
Danger zone0.80 – 0.91Vocabulary overlap — same words, possibly different meaning
Clearly differentBelow 0.80Different topics — safe to reject

The danger zone is where threshold tuning matters most. In this range, queries share significant vocabulary with the cached question but may have different intent. 'Caching semantics' (0.844) sounds like 'semantic caching' but asks about something else entirely. The only way to know if your danger zone contains false positives is to test it with your actual data.

Threshold Recommendations by Use Case

Use CaseRecommended ThresholdReasoning
Customer support (external users)0.93 – 0.97Wrong answers damage trust and create support tickets. Prioritize correctness.
Product documentation bot0.90 – 0.95Users can verify answers against docs. Moderate risk tolerance.
Internal knowledge base0.85 – 0.92Users know the domain and can spot wrong answers. Higher hit rate saves more time.
Marketing chatbot0.88 – 0.93Brand risk from wrong answers, but queries are highly repetitive. Good cache ROI.
Developer tools / API docs0.90 – 0.95Precision matters — a wrong code example causes debugging pain.
Demo / proof of concept0.80 – 0.88Show the cache working aggressively. Accuracy is secondary to demonstrating the concept.

Common Mistakes

  • Using a threshold from a blog post without testing — Similarity distributions vary by embedding model, domain, and query length. Always calibrate with your own data.
  • Setting the threshold once and forgetting it — User language evolves, your knowledge base changes, and new edge cases appear. Review quarterly.
  • Ignoring the embedding model change — If you switch from Titan Embed v2 to OpenAI text-embedding-3-small, flush the cache AND recalibrate the threshold. Different models produce incompatible vectors.
  • Not logging borderline hits — Hits scored between threshold and threshold+0.05 are your early warning system. If wrong answers appear here, raise the threshold before production users notice.
  • Treating all cache hits equally — A hit at 0.99 similarity is almost certainly correct. A hit at 0.91 (just above a 0.90 threshold) should be monitored more closely. Consider adding a confidence tier.

Advanced: Tiered Thresholds

Instead of a single binary threshold (hit or miss), some production systems use two thresholds to create three tiers:

TierScore Range (example)Behavior
Confident hit>= 0.95Return cached response immediately, no disclaimers
Tentative hit0.88 – 0.94Return cached response with a note: 'Based on a similar question — let me know if this doesn't answer your query'
Miss< 0.88Full LLM pipeline — embed, search knowledge base, generate fresh response

This approach captures more cache hits while being transparent about uncertainty. The tentative hit tier is especially useful for customer-facing agents where users appreciate honesty about approximate matching.

Measuring the Impact

After deploying your threshold, track these metrics weekly:

MetricWhat to WatchAction If Off
Cache hit rateShould stabilize at 40-70% for repetitive workloadsIf below 30%, threshold may be too high or traffic is too unique for caching
Average similarity on hitsShould be well above threshold (e.g., 0.96 avg with 0.90 threshold)If average is close to threshold, many hits are borderline — review them
False positive rateSample 50 hits/week manually — should be < 2%If > 5%, raise the threshold immediately
Cost savingsTrack tokens avoided × price per tokenCompare against cache infrastructure cost to ensure positive ROI
User satisfactionMonitor thumbs-down or follow-up correction messagesSpike in corrections after threshold change = too aggressive

The Bottom Line

The similarity threshold is not a configuration — it is a product decision. It balances cost savings against answer quality, speed against accuracy, and engineering simplicity against operational rigor. There is no universal right number.

Start with the data. Embed your real queries. Measure the gap between rephrasings and different-but-similar questions. Set your threshold in that gap. Deploy with aggressive logging. Review weekly. Adjust quarterly. That is the process — and it works whether you are running Amazon Titan Embed v2 at 1024 dimensions or any other embedding model.

The best threshold is the one you arrived at by testing your own data — not the one you copied from a blog post. Including this one.

Try it yourself: visit polystreak.com/agent and ask the same question twice in different words. Watch the X-Ray panel show a cache hit on the second query — the similarity score, the latency drop from seconds to milliseconds, and the cost going to $0.0000. That is the threshold in action.