AI AgentsAI InfrastructureObservabilityLLMRAGMonitoringRedis CloudMongoDB AtlasAWS BedrockProduction AI

Monitoring Your LLM RAG Pipeline: A Practical Guide to Observability That Actually Matters

Polystreak Team2026-04-1116 min read

You built a RAG agent. It embeds queries, searches a knowledge base, assembles prompts, streams LLM responses, and caches answers for next time. Users are happy — until they are not. A response takes 8 seconds instead of 3. The cost doubles in a week and nobody knows why. The agent starts returning confident-sounding answers that are completely wrong. Without observability, you are flying blind.

This post is based on the Polystreak AI agent — a production RAG pipeline running DeepSeek v3.2 for chat and Amazon Titan Embed Text v2 (1024 dimensions) for embeddings on AWS Bedrock, with Redis Cloud for vector search and semantic caching, and MongoDB Atlas for metrics and metadata. Every number in this post comes from real requests, real latencies, and real costs. The patterns apply to any RAG stack.

Why LLM Pipelines Are Harder to Monitor Than Traditional APIs

A traditional REST API is straightforward to monitor: measure response time, track error rates, alert on 5xx spikes. An LLM pipeline is fundamentally different in three ways that make standard monitoring insufficient.

Non-deterministic output — The same input can produce different responses. You cannot write assertions like 'response equals X'. Quality monitoring requires semantic evaluation, not string matching.
Variable cost per request — A cache hit costs $0.0001 (embedding only). A cache miss with 5,000 input tokens and 1,000 output tokens costs $0.015. A single endpoint has 150x cost variance depending on the request path.
Multi-step pipeline with cascading failures — A slow embedding call delays everything downstream. A bad vector search returns irrelevant chunks. Irrelevant chunks produce a hallucinated response. The failure at step 2 manifests as a quality problem at step 5, and traditional error monitoring sees nothing wrong.

This is why LLM observability needs three pillars, not just dashboards.

The Three Pillars

Pillar	What It Answers	Granularity
Tracing	What happened in THIS specific request?	Per-request, per-step
Metrics	How is the system performing over time?	Aggregated — hourly, daily, weekly trends
Evaluation	Are the answers actually correct?	Per-response quality scoring

Most teams start with metrics (dashboards), skip tracing entirely, and never build evaluation. That is backwards. Tracing is the foundation — it gives you the per-request detail you need to debug anything. Metrics aggregate traces into trends. Evaluation tells you whether the output is worth serving at all.

Pillar 1: Tracing — The Request Waterfall

A trace is a single request's journey through every step of your pipeline, broken into spans. Each span records what happened, how long it took, and what data flowed through it. Here is what a real trace looks like for the Polystreak agent on a cache miss.

Example trace: cache miss (total 3,240ms)

Span	Duration	Status	Key Attributes
session_load	12ms	OK	session_id: s_abc123, messages_in_session: 4
embed_query	18ms	OK	model: amazon.titan-embed-text-v2:0, dimensions: 1024
cache_check	4ms	MISS	best_similarity: 0.72, threshold: 0.90
vector_search	6ms	OK	index: idx:knowledge_base, results: 5, top_score: 0.94
prompt_assembly	1ms	OK	input_tokens: 4,812, chunks_included: 5, history_messages: 4
llm_inference	3,180ms	OK	model: deepseek.v3.2, output_tokens: 975, time_to_first_token: 420ms
cache_store	8ms	OK	key: cache:a1b2c3, ttl: 86400s
session_update	5ms	OK	messages_added: 2
metrics_log	6ms	OK	collection: agenti_ai_metrics

Total: 3,240ms. The LLM inference span dominates at 3,180ms (98% of the request). Everything else combined is 60ms. This is typical — in a cache miss, the LLM is always the bottleneck.

Example trace: cache hit (total 38ms)

Span	Duration	Status	Key Attributes
session_load	10ms	OK	session_id: s_abc123, messages_in_session: 6
embed_query	16ms	OK	model: amazon.titan-embed-text-v2:0, dimensions: 1024
cache_check	3ms	HIT	similarity: 0.96, threshold: 0.90, cached_query: 'What is semantic caching?'
vector_search	—	SKIP	reason: cache_hit
llm_inference	—	SKIP	reason: cache_hit, tokens_saved: 5,787
session_update	4ms	OK	messages_added: 2
metrics_log	5ms	OK	collection: agenti_ai_metrics

Total: 38ms. The cache hit skips vector search and LLM inference entirely. The user gets an answer 85x faster. The cost drops from ~$0.015 to ~$0.0001. Without tracing, you see '38ms response time' in your logs — but you do not know WHY it was fast. Tracing tells you: cache hit at 0.96 similarity.

What tracing reveals that metrics cannot

A slow request where the LLM took 12 seconds — was it the model, the prompt size, or network? The span attributes show the input token count was 11,000 (someone had a 20-message conversation history). Fix: trim the context window.
A cache miss at 0.89 similarity with a 0.90 threshold — was this a rephrasing that should have hit? Open the trace, read the cached query and incoming query, decide if the threshold needs adjustment.
An empty response from the LLM — the embedding and search worked fine, but the model returned nothing. The span shows a 401 error: expired API credentials. Without the trace, this looks like a mystery empty response.
A correct but slow response — tracing shows the embedding call took 450ms instead of the usual 18ms. The Bedrock endpoint had a cold start. This is intermittent and invisible in aggregated P50 metrics.

Implementing traces

The simplest approach is to record spans yourself using timestamps and structured metadata. In our agent, the metrics tracker already captures per-step timing. To turn this into proper traces, wrap each step in a span that records start time, end time, status, and attributes.

For production systems, use OpenTelemetry or a purpose-built LLM tracing tool like Langfuse or LangSmith. These provide trace visualization UIs, search across traces, and correlation with downstream metrics — without building the infrastructure yourself.

Pillar 2: Metrics — Trends, Dashboards, and Alerts

Traces tell you what happened in a single request. Metrics tell you what is happening across all requests over time. Here are the metrics that matter for an LLM RAG pipeline, organized by category.

Latency metrics

Track percentiles (P50, P95, P99), not averages. A P50 of 2 seconds with a P99 of 15 seconds means most users are happy but 1 in 100 is waiting unreasonably long.

Metric	Healthy Range	Alert Threshold	What It Tells You
Total response time (P50)	100ms (hit) / 2-4s (miss)	> 6s	Overall user experience. If this degrades, drill into spans.
Total response time (P95)	200ms (hit) / 5-8s (miss)	> 12s	Tail latency. Often caused by cold starts or large prompts.
Embedding time	15-25ms	> 100ms	Bedrock Titan v2 is usually fast. Spikes indicate cold starts or throttling.
Vector search time	3-8ms	> 50ms	Redis Cloud KNN search. If slow, check index size or network latency.
Cache check time	2-5ms	> 20ms	Similar to vector search — uses the same Redis instance.
LLM time-to-first-token	300-600ms	> 2s	How long before the user sees the first word streaming. Critical for perceived speed.
LLM total inference time	2-5s	> 10s	Depends on output length. Long responses take longer — this is expected.

The Polystreak agent typically shows: embed 18ms + cache check 4ms + vector search 6ms + LLM 3,200ms = 3,228ms total on a cache miss. On a cache hit: embed 16ms + cache check 3ms = 19ms total (no LLM call). The 170x difference is why cache hit rate is your most impactful metric.

Cost metrics

Metric	How to Calculate	Why It Matters
Cost per query (average)	Total token spend / total queries	Your unit economics. Track daily.
Cost per query (P95)	95th percentile of per-query cost	Identifies expensive outliers — long conversations with large context windows.
Daily token spend (input)	Sum of all input tokens × price per token	Input tokens dominate cost in RAG. DeepSeek v3.2: ~$0.0015/1K input tokens.
Daily token spend (output)	Sum of all output tokens × price per token	Output tokens are more expensive per token but smaller volume. DeepSeek v3.2: ~$0.0075/1K output tokens.
Cost avoided by cache	(Cache hits × avg miss cost) — (cache hits × embedding cost)	Proves ROI. At 60% hit rate with 200 queries/day, saves ~$1.20/day or ~$36/month.
Embedding cost	Queries × embedding price per call	Amazon Titan Embed v2: ~$0.0001 per call. Small but adds up at scale.

Here is a real cost breakdown from the Polystreak agent over a sample day with 200 queries.

Path	Queries	Avg Input Tokens	Avg Output Tokens	Cost Per Query	Daily Cost
Cache hit	120 (60%)	0 (no LLM call)	0	$0.0001	$0.012
Cache miss (short context)	50 (25%)	3,200	400	$0.0078	$0.39
Cache miss (long conversation)	30 (15%)	6,500	900	$0.0165	$0.495
Total	200	—	—	—	$0.90

Without caching, the same 200 queries at an average miss cost of $0.011 would cost $2.20/day. The cache saves $1.30/day — 59% reduction. Track this metric weekly. If it drops, your cache hit rate is declining or your average prompt is growing.

Cache metrics

Metric	Target	What Drift Means
Cache hit rate	40-70%	Below 30%: traffic is too unique, or threshold is too high. Above 80%: great, or possibly serving stale answers.
Average similarity on hits	> 0.94	If average drops toward your threshold (e.g., 0.91 with threshold 0.90), you have many borderline hits — review them for correctness.
Cache entries count	Grows, then plateaus	If it grows indefinitely, your TTL is too long or you are not deduplicating.
Cache eviction rate	Matches TTL expectations	If entries expire faster than expected, check Redis memory limits and eviction policies.
Similarity score distribution	Bimodal: peaks at >0.95 and <0.50	A flat distribution means your embedding model is not discriminating well between related and unrelated queries.

Error metrics

Error Type	Source	Impact	How to Detect
LLM timeout / 5xx	AWS Bedrock	User gets no response or an error message	Track non-200 status codes or caught exceptions in the LLM span
Auth failure (401/403)	AWS IAM credentials	All requests fail until credentials are refreshed	Alert on any auth error — this is always critical
Embedding failure	Bedrock Titan Embed	No embedding = no cache check, no vector search. Pipeline falls back or fails.	Track embedding span errors separately
Redis connection failure	Redis Cloud	No cache, no vector search, no sessions. Entire agent degrades.	Health check ping + connection error counter
MongoDB write failure	MongoDB Atlas	Metrics and metadata not logged. Agent still works for users, but you lose observability.	Track metrics_log span failures. This is ironic — the observability system failing silently.
Rate limit exceeded	App-level or Bedrock	User gets throttled. Legitimate if protecting cost; bad if threshold is too low.	Track rate limit rejections per hour

Pillar 3: Evaluation — Is the Answer Actually Good?

This is the pillar most teams skip, and it is the most important. Your agent can have perfect latency, low cost, and zero errors — while confidently returning wrong answers. Evaluation measures output quality.

What can go wrong even when everything 'works'

Retrieval failure — The vector search returns 5 chunks, but none of them are relevant to the question. The LLM hallucinates an answer from its training data instead of the knowledge base. Every metric looks green.
Context poisoning — One of the 5 retrieved chunks contains outdated information. The LLM faithfully references it. The answer is wrong but grounded in a source — so it looks trustworthy.
Cache contamination — A cached response from a slightly different question is served. The similarity was 0.91 (above the 0.90 threshold) but the intent was different. The user gets a plausible but incorrect answer instantly.
Truncation — The conversation history grew too long. The context window was trimmed, cutting out a critical earlier message. The LLM contradicts something it said three turns ago.

Online evaluation (real-time, automated)

These checks run during or immediately after each request.

Check	How It Works	Cost
Chunk relevance gate	If the top chunk score from vector search is below 0.70, flag the response as potentially ungrounded.	Free — the score is already computed
Answer-chunk overlap	Check if key terms from the retrieved chunks appear in the LLM response. Low overlap suggests the model ignored the context.	Free — string matching on existing data
Response length anomaly	If the response is unusually short (<50 tokens) or unusually long (>2,000 tokens) compared to the running average, flag for review.	Free — compare against stored statistics
User feedback signal	Add thumbs up/down buttons. Track the ratio. A sudden drop in positive feedback after a change is an immediate quality signal.	Free infrastructure — but requires UI changes
LLM-as-judge (lightweight)	Send the (query, context, response) triple to a fast model and ask: 'Is this response faithful to the provided context? Answer yes or no.' Run on 10% of requests.	~$0.001 per evaluated request

Offline evaluation (batch, weekly)

Sample 50-100 traces per week. For each, store the complete triple: the user query, the retrieved chunks, and the generated response. Run these through evaluation frameworks.

Framework	What It Measures	How It Works
RAGAS (open source)	Faithfulness, answer relevance, context precision, context recall	Automated scoring using an evaluator LLM. Scores 0-1 per dimension. Best open-source RAG evaluation.
LangSmith Evaluations	Custom evaluators on datasets	Upload query-response pairs, define evaluation criteria, track scores over time
Langfuse Scores	Attach quality scores to traces	Manual or automated scoring linked to individual traces. Good for trend tracking.
Human review	Ground truth correctness	Domain expert reviews sampled responses. Expensive but irreplaceable for high-stakes domains.

The evaluation cadence matters. Run automated checks on every request (chunk relevance gate, answer length). Run LLM-as-judge on 10% of requests. Run RAGAS or human review weekly on a sample. This gives you continuous quality monitoring without the cost of evaluating every single response with a second LLM call.

Putting It Together: The Metrics Document

In the Polystreak agent, every request writes a metrics document to MongoDB Atlas. This single document powers all three pillars — tracing (per-step timing), metrics (aggregatable fields), and evaluation (chunk scores, token counts). Here is the structure.

With this document structure, a single MongoDB aggregation pipeline can answer any question: 'What is the P95 LLM latency this week?', 'What is the cache hit rate today?', 'Which sessions cost more than $0.05?', 'How many requests had a top chunk score below 0.70?'

Dashboard Design: What to Put on Screen

Three dashboards cover the full picture. Keep them separate — mixing operational and business metrics on one screen means nobody reads either.

Dashboard 1: Operational health (for engineers)

Panel	Visualization	Data Source
Request rate	Time series — requests per minute	COUNT over agenti_ai_metrics grouped by minute
Latency percentiles	Time series — P50, P95, P99 lines	PERCENTILE over llmTotalMs + embeddingMs + vectorSearchMs
Error rate	Time series — errors per hour	COUNT where pipeline contains status: 'error'
Cache hit rate	Single stat + sparkline	COUNT(cacheHit=true) / COUNT(*) per hour
Active sessions	Single stat	COUNT DISTINCT session_id in last 30 minutes

Dashboard 2: Cost and efficiency (for engineering leads)

Panel	Visualization	Data Source
Daily token spend	Stacked bar — input vs output tokens	SUM promptTokens, SUM completionTokens grouped by day
Daily cost	Time series — actual spend vs budget line	SUM estimatedCost grouped by day, with $0.33/day budget line ($10/month)
Cost saved by cache	Single stat	(COUNT cache hits × avg miss cost) — (COUNT cache hits × embedding cost)
Cost per query trend	Time series — 7-day moving average	AVG estimatedCost per day, smoothed
Top expensive sessions	Table — session ID, total cost, query count	GROUP BY session_id, SUM estimatedCost, ORDER DESC, LIMIT 10

Dashboard 3: Quality (for product leads)

Panel	Visualization	Data Source
Chunk relevance distribution	Histogram of topChunkScore	Bucket distribution — how many responses had top chunk > 0.90 vs < 0.70?
Zero-result queries	Table — queries where chunksRetrieved = 0	The agent had no knowledge base context. These answers are LLM-only (risky).
Cache similarity distribution	Histogram of cacheSimilarity on hits	Are hits clustered at 0.98 (safe) or spread near the threshold (risky)?
Borderline cache hits	Table — hits where similarity is within 0.05 of threshold	These are the responses most likely to be wrong. Review them manually.
User feedback ratio	Pie chart — thumbs up vs thumbs down	If you have feedback signals, this is the ultimate quality metric

Alerting: What to Wake Up For

Not every metric needs an alert. Over-alerting causes alert fatigue and you start ignoring everything. Here are the alerts that warrant immediate attention.

Alert	Condition	Severity	Why
Auth failure spike	More than 2 auth errors in 5 minutes	Critical	All requests are failing. Likely expired AWS credentials.
LLM error rate	> 5% of requests fail at LLM step in 15 minutes	Critical	Bedrock outage, rate limiting, or model deprecation.
P95 latency spike	P95 total response time > 12s for 10 minutes	Warning	Something is slow. Could be LLM cold start, large prompts, or network.
Cache hit rate drop	Drops below 20% for 1 hour (was previously > 40%)	Warning	Cache was flushed, TTLs expired in bulk, or traffic pattern changed.
Daily cost overrun	Estimated daily cost exceeds 2x the 7-day average	Warning	Unusual traffic spike or a prompt injection attack inflating token usage.
Redis connection failure	Any connection error	Critical	No cache, no vector search, no sessions. Core infrastructure down.

Set alerts on symptoms, not causes. 'P95 latency > 12s' is a symptom. When it fires, use tracing to find the cause: was it the LLM, the embedding, the network, or a specific session with a massive conversation history?

Tool Comparison: What to Use

The LLM observability space has exploded. Here is how the options compare for a RAG pipeline like ours.

Tool	Best For	Tracing	Metrics	Evaluation	Pricing
MongoDB Charts + Atlas Alerts	Quick start, no new infra	Manual (query the metrics collection)	Yes (built-in charts)	No	Free with Atlas
Langfuse (open source)	LLM-native observability	Yes (purpose-built for LLM traces)	Yes (cost, latency dashboards)	Yes (scores, datasets)	Free self-hosted / cloud tiers
LangSmith	LangChain ecosystem	Yes (deep LangChain integration)	Yes	Yes (evaluations, datasets)	Free tier / paid
OpenTelemetry + Grafana	Multi-service, vendor-neutral	Yes (standard spans)	Yes (Prometheus export)	No (add RAGAS separately)	Free (self-hosted)
Datadog LLM Observability	Enterprise, existing Datadog	Yes (APM + LLM-specific)	Yes (full Datadog metrics)	Partial	$$$ (per-host + per-trace pricing)
Helicone (open source)	Proxy-based, zero-code	Yes (intercepts LLM calls)	Yes (cost, latency)	No	Free tier / cloud

Our recommendation

Start with MongoDB Charts — you already have the data in agenti_ai_metrics. Build the three dashboards described above. This takes an hour and gives you 60% of the value. When you need proper tracing with a visual waterfall and quality evaluation, add Langfuse. Its TypeScript SDK wraps your existing code with minimal changes, and it is purpose-built for LLM pipelines. Only move to Datadog or a full OpenTelemetry stack when you are running multiple agents or services and need unified cross-service observability.

The Monitoring Maturity Model

Most teams progress through four stages. Knowing where you are helps you invest in the right layer.

Stage	What You Have	What You Are Missing	Next Step
Level 0: Blind	Console.log statements	Everything. You debug by reading server logs.	Add structured metrics logging (the MongoDB document above).
Level 1: Metrics	Dashboards showing latency, cost, hit rate	Per-request detail. You know P95 is bad but not why.	Add tracing — per-step spans with attributes.
Level 2: Traces + Metrics	Full request waterfall + trend dashboards	Quality measurement. You know it is fast and cheap, but is it correct?	Add evaluation — chunk relevance gates, LLM-as-judge, weekly RAGAS.
Level 3: Full Observability	Traces, metrics, evaluation, alerts	Proactive optimization. You react to problems instead of preventing them.	Add anomaly detection, automated threshold tuning, A/B testing for prompts.

The Polystreak agent is currently at Level 1 — structured metrics in MongoDB, with the X-Ray panel providing per-request visibility. The data is there for Level 2 (the trace structure exists in the pipeline array); it just needs a visualization layer.

Common Mistakes

Monitoring only the LLM call — The LLM is 98% of the latency but only 20% of the failure modes. Embedding failures, bad retrieval, cache contamination, and session corruption all happen upstream.
Averaging latency instead of using percentiles — A P50 of 2s with a P99 of 20s means you have a serious tail latency problem that averages hide completely.
Ignoring the cost of cache misses vs hits — If your dashboard shows '$1.50/day total' but your cache hit rate drops from 60% to 20%, tomorrow's bill is $3.50. Track the trend, not just the number.
Not logging the retrieved chunks — When a response is wrong, you need to know what chunks the model saw. If you did not log them, you cannot distinguish between bad retrieval and bad generation.
Treating all errors equally — A MongoDB logging failure is invisible to users. A Redis connection failure breaks the entire agent. Your alerting should reflect this asymmetry.
Building dashboards before defining alerts — Dashboards are for investigation. Alerts are for detection. If nobody is watching the dashboard, problems go unnoticed. Set up the critical alerts first, then build dashboards for when they fire.

The Bottom Line

Monitoring an LLM RAG pipeline is not an extension of traditional API monitoring — it is a different discipline. The non-deterministic nature of LLM output, the variable cost per request, and the multi-step pipeline with cascading failure modes all demand specialized observability.

Start with the three pillars: tracing for per-request debugging, metrics for trend analysis and alerting, and evaluation for quality assurance. Log a structured metrics document on every request — even if you start with MongoDB Charts, that data powers every future observability tool you add.

The agent that serves a wrong answer in 100 milliseconds is worse than the agent that takes 5 seconds to give a correct one. Monitor for quality first, latency second, cost third.

See the observability in action: visit polystreak.com/agent and watch the X-Ray panel during your conversation. Every metric described in this post — embedding time, cache similarity, vector search latency, LLM inference duration, token counts, and estimated cost — is computed and displayed in real time. That is Level 1 observability. The data to reach Level 3 is already being logged.