All posts
AI AgentsAI InfrastructureObservabilityLLMRAGMonitoringRedis CloudMongoDB AtlasAWS BedrockProduction AI

Monitoring Your LLM RAG Pipeline: A Practical Guide to Observability That Actually Matters

Polystreak Team2026-04-1116 min read

You built a RAG agent. It embeds queries, searches a knowledge base, assembles prompts, streams LLM responses, and caches answers for next time. Users are happy — until they are not. A response takes 8 seconds instead of 3. The cost doubles in a week and nobody knows why. The agent starts returning confident-sounding answers that are completely wrong. Without observability, you are flying blind.

This post is based on the Polystreak AI agent — a production RAG pipeline running DeepSeek v3.2 for chat and Amazon Titan Embed Text v2 (1024 dimensions) for embeddings on AWS Bedrock, with Redis Cloud for vector search and semantic caching, and MongoDB Atlas for metrics and metadata. Every number in this post comes from real requests, real latencies, and real costs. The patterns apply to any RAG stack.

Why LLM Pipelines Are Harder to Monitor Than Traditional APIs

A traditional REST API is straightforward to monitor: measure response time, track error rates, alert on 5xx spikes. An LLM pipeline is fundamentally different in three ways that make standard monitoring insufficient.

  • Non-deterministic output — The same input can produce different responses. You cannot write assertions like 'response equals X'. Quality monitoring requires semantic evaluation, not string matching.
  • Variable cost per request — A cache hit costs $0.0001 (embedding only). A cache miss with 5,000 input tokens and 1,000 output tokens costs $0.015. A single endpoint has 150x cost variance depending on the request path.
  • Multi-step pipeline with cascading failures — A slow embedding call delays everything downstream. A bad vector search returns irrelevant chunks. Irrelevant chunks produce a hallucinated response. The failure at step 2 manifests as a quality problem at step 5, and traditional error monitoring sees nothing wrong.

This is why LLM observability needs three pillars, not just dashboards.

The Three Pillars

PillarWhat It AnswersGranularity
TracingWhat happened in THIS specific request?Per-request, per-step
MetricsHow is the system performing over time?Aggregated — hourly, daily, weekly trends
EvaluationAre the answers actually correct?Per-response quality scoring

Most teams start with metrics (dashboards), skip tracing entirely, and never build evaluation. That is backwards. Tracing is the foundation — it gives you the per-request detail you need to debug anything. Metrics aggregate traces into trends. Evaluation tells you whether the output is worth serving at all.

Pillar 1: Tracing — The Request Waterfall

A trace is a single request's journey through every step of your pipeline, broken into spans. Each span records what happened, how long it took, and what data flowed through it. Here is what a real trace looks like for the Polystreak agent on a cache miss.

Example trace: cache miss (total 3,240ms)

SpanDurationStatusKey Attributes
session_load12msOKsession_id: s_abc123, messages_in_session: 4
embed_query18msOKmodel: amazon.titan-embed-text-v2:0, dimensions: 1024
cache_check4msMISSbest_similarity: 0.72, threshold: 0.90
vector_search6msOKindex: idx:knowledge_base, results: 5, top_score: 0.94
prompt_assembly1msOKinput_tokens: 4,812, chunks_included: 5, history_messages: 4
llm_inference3,180msOKmodel: deepseek.v3.2, output_tokens: 975, time_to_first_token: 420ms
cache_store8msOKkey: cache:a1b2c3, ttl: 86400s
session_update5msOKmessages_added: 2
metrics_log6msOKcollection: agenti_ai_metrics

Total: 3,240ms. The LLM inference span dominates at 3,180ms (98% of the request). Everything else combined is 60ms. This is typical — in a cache miss, the LLM is always the bottleneck.

Example trace: cache hit (total 38ms)

SpanDurationStatusKey Attributes
session_load10msOKsession_id: s_abc123, messages_in_session: 6
embed_query16msOKmodel: amazon.titan-embed-text-v2:0, dimensions: 1024
cache_check3msHITsimilarity: 0.96, threshold: 0.90, cached_query: 'What is semantic caching?'
vector_searchSKIPreason: cache_hit
llm_inferenceSKIPreason: cache_hit, tokens_saved: 5,787
session_update4msOKmessages_added: 2
metrics_log5msOKcollection: agenti_ai_metrics

Total: 38ms. The cache hit skips vector search and LLM inference entirely. The user gets an answer 85x faster. The cost drops from ~$0.015 to ~$0.0001. Without tracing, you see '38ms response time' in your logs — but you do not know WHY it was fast. Tracing tells you: cache hit at 0.96 similarity.

What tracing reveals that metrics cannot

  • A slow request where the LLM took 12 seconds — was it the model, the prompt size, or network? The span attributes show the input token count was 11,000 (someone had a 20-message conversation history). Fix: trim the context window.
  • A cache miss at 0.89 similarity with a 0.90 threshold — was this a rephrasing that should have hit? Open the trace, read the cached query and incoming query, decide if the threshold needs adjustment.
  • An empty response from the LLM — the embedding and search worked fine, but the model returned nothing. The span shows a 401 error: expired API credentials. Without the trace, this looks like a mystery empty response.
  • A correct but slow response — tracing shows the embedding call took 450ms instead of the usual 18ms. The Bedrock endpoint had a cold start. This is intermittent and invisible in aggregated P50 metrics.

Implementing traces

The simplest approach is to record spans yourself using timestamps and structured metadata. In our agent, the metrics tracker already captures per-step timing. To turn this into proper traces, wrap each step in a span that records start time, end time, status, and attributes.

For production systems, use OpenTelemetry or a purpose-built LLM tracing tool like Langfuse or LangSmith. These provide trace visualization UIs, search across traces, and correlation with downstream metrics — without building the infrastructure yourself.

Pillar 2: Metrics — Trends, Dashboards, and Alerts

Traces tell you what happened in a single request. Metrics tell you what is happening across all requests over time. Here are the metrics that matter for an LLM RAG pipeline, organized by category.

Latency metrics

Track percentiles (P50, P95, P99), not averages. A P50 of 2 seconds with a P99 of 15 seconds means most users are happy but 1 in 100 is waiting unreasonably long.

MetricHealthy RangeAlert ThresholdWhat It Tells You
Total response time (P50)100ms (hit) / 2-4s (miss)> 6sOverall user experience. If this degrades, drill into spans.
Total response time (P95)200ms (hit) / 5-8s (miss)> 12sTail latency. Often caused by cold starts or large prompts.
Embedding time15-25ms> 100msBedrock Titan v2 is usually fast. Spikes indicate cold starts or throttling.
Vector search time3-8ms> 50msRedis Cloud KNN search. If slow, check index size or network latency.
Cache check time2-5ms> 20msSimilar to vector search — uses the same Redis instance.
LLM time-to-first-token300-600ms> 2sHow long before the user sees the first word streaming. Critical for perceived speed.
LLM total inference time2-5s> 10sDepends on output length. Long responses take longer — this is expected.

The Polystreak agent typically shows: embed 18ms + cache check 4ms + vector search 6ms + LLM 3,200ms = 3,228ms total on a cache miss. On a cache hit: embed 16ms + cache check 3ms = 19ms total (no LLM call). The 170x difference is why cache hit rate is your most impactful metric.

Cost metrics

MetricHow to CalculateWhy It Matters
Cost per query (average)Total token spend / total queriesYour unit economics. Track daily.
Cost per query (P95)95th percentile of per-query costIdentifies expensive outliers — long conversations with large context windows.
Daily token spend (input)Sum of all input tokens × price per tokenInput tokens dominate cost in RAG. DeepSeek v3.2: ~$0.0015/1K input tokens.
Daily token spend (output)Sum of all output tokens × price per tokenOutput tokens are more expensive per token but smaller volume. DeepSeek v3.2: ~$0.0075/1K output tokens.
Cost avoided by cache(Cache hits × avg miss cost) — (cache hits × embedding cost)Proves ROI. At 60% hit rate with 200 queries/day, saves ~$1.20/day or ~$36/month.
Embedding costQueries × embedding price per callAmazon Titan Embed v2: ~$0.0001 per call. Small but adds up at scale.

Here is a real cost breakdown from the Polystreak agent over a sample day with 200 queries.

PathQueriesAvg Input TokensAvg Output TokensCost Per QueryDaily Cost
Cache hit120 (60%)0 (no LLM call)0$0.0001$0.012
Cache miss (short context)50 (25%)3,200400$0.0078$0.39
Cache miss (long conversation)30 (15%)6,500900$0.0165$0.495
Total200$0.90

Without caching, the same 200 queries at an average miss cost of $0.011 would cost $2.20/day. The cache saves $1.30/day — 59% reduction. Track this metric weekly. If it drops, your cache hit rate is declining or your average prompt is growing.

Cache metrics

MetricTargetWhat Drift Means
Cache hit rate40-70%Below 30%: traffic is too unique, or threshold is too high. Above 80%: great, or possibly serving stale answers.
Average similarity on hits> 0.94If average drops toward your threshold (e.g., 0.91 with threshold 0.90), you have many borderline hits — review them for correctness.
Cache entries countGrows, then plateausIf it grows indefinitely, your TTL is too long or you are not deduplicating.
Cache eviction rateMatches TTL expectationsIf entries expire faster than expected, check Redis memory limits and eviction policies.
Similarity score distributionBimodal: peaks at >0.95 and <0.50A flat distribution means your embedding model is not discriminating well between related and unrelated queries.

Error metrics

Error TypeSourceImpactHow to Detect
LLM timeout / 5xxAWS BedrockUser gets no response or an error messageTrack non-200 status codes or caught exceptions in the LLM span
Auth failure (401/403)AWS IAM credentialsAll requests fail until credentials are refreshedAlert on any auth error — this is always critical
Embedding failureBedrock Titan EmbedNo embedding = no cache check, no vector search. Pipeline falls back or fails.Track embedding span errors separately
Redis connection failureRedis CloudNo cache, no vector search, no sessions. Entire agent degrades.Health check ping + connection error counter
MongoDB write failureMongoDB AtlasMetrics and metadata not logged. Agent still works for users, but you lose observability.Track metrics_log span failures. This is ironic — the observability system failing silently.
Rate limit exceededApp-level or BedrockUser gets throttled. Legitimate if protecting cost; bad if threshold is too low.Track rate limit rejections per hour

Pillar 3: Evaluation — Is the Answer Actually Good?

This is the pillar most teams skip, and it is the most important. Your agent can have perfect latency, low cost, and zero errors — while confidently returning wrong answers. Evaluation measures output quality.

What can go wrong even when everything 'works'

  • Retrieval failure — The vector search returns 5 chunks, but none of them are relevant to the question. The LLM hallucinates an answer from its training data instead of the knowledge base. Every metric looks green.
  • Context poisoning — One of the 5 retrieved chunks contains outdated information. The LLM faithfully references it. The answer is wrong but grounded in a source — so it looks trustworthy.
  • Cache contamination — A cached response from a slightly different question is served. The similarity was 0.91 (above the 0.90 threshold) but the intent was different. The user gets a plausible but incorrect answer instantly.
  • Truncation — The conversation history grew too long. The context window was trimmed, cutting out a critical earlier message. The LLM contradicts something it said three turns ago.

Online evaluation (real-time, automated)

These checks run during or immediately after each request.

CheckHow It WorksCost
Chunk relevance gateIf the top chunk score from vector search is below 0.70, flag the response as potentially ungrounded.Free — the score is already computed
Answer-chunk overlapCheck if key terms from the retrieved chunks appear in the LLM response. Low overlap suggests the model ignored the context.Free — string matching on existing data
Response length anomalyIf the response is unusually short (<50 tokens) or unusually long (>2,000 tokens) compared to the running average, flag for review.Free — compare against stored statistics
User feedback signalAdd thumbs up/down buttons. Track the ratio. A sudden drop in positive feedback after a change is an immediate quality signal.Free infrastructure — but requires UI changes
LLM-as-judge (lightweight)Send the (query, context, response) triple to a fast model and ask: 'Is this response faithful to the provided context? Answer yes or no.' Run on 10% of requests.~$0.001 per evaluated request

Offline evaluation (batch, weekly)

Sample 50-100 traces per week. For each, store the complete triple: the user query, the retrieved chunks, and the generated response. Run these through evaluation frameworks.

FrameworkWhat It MeasuresHow It Works
RAGAS (open source)Faithfulness, answer relevance, context precision, context recallAutomated scoring using an evaluator LLM. Scores 0-1 per dimension. Best open-source RAG evaluation.
LangSmith EvaluationsCustom evaluators on datasetsUpload query-response pairs, define evaluation criteria, track scores over time
Langfuse ScoresAttach quality scores to tracesManual or automated scoring linked to individual traces. Good for trend tracking.
Human reviewGround truth correctnessDomain expert reviews sampled responses. Expensive but irreplaceable for high-stakes domains.

The evaluation cadence matters. Run automated checks on every request (chunk relevance gate, answer length). Run LLM-as-judge on 10% of requests. Run RAGAS or human review weekly on a sample. This gives you continuous quality monitoring without the cost of evaluating every single response with a second LLM call.

Putting It Together: The Metrics Document

In the Polystreak agent, every request writes a metrics document to MongoDB Atlas. This single document powers all three pillars — tracing (per-step timing), metrics (aggregatable fields), and evaluation (chunk scores, token counts). Here is the structure.

With this document structure, a single MongoDB aggregation pipeline can answer any question: 'What is the P95 LLM latency this week?', 'What is the cache hit rate today?', 'Which sessions cost more than $0.05?', 'How many requests had a top chunk score below 0.70?'

Dashboard Design: What to Put on Screen

Three dashboards cover the full picture. Keep them separate — mixing operational and business metrics on one screen means nobody reads either.

Dashboard 1: Operational health (for engineers)

PanelVisualizationData Source
Request rateTime series — requests per minuteCOUNT over agenti_ai_metrics grouped by minute
Latency percentilesTime series — P50, P95, P99 linesPERCENTILE over llmTotalMs + embeddingMs + vectorSearchMs
Error rateTime series — errors per hourCOUNT where pipeline contains status: 'error'
Cache hit rateSingle stat + sparklineCOUNT(cacheHit=true) / COUNT(*) per hour
Active sessionsSingle statCOUNT DISTINCT session_id in last 30 minutes

Dashboard 2: Cost and efficiency (for engineering leads)

PanelVisualizationData Source
Daily token spendStacked bar — input vs output tokensSUM promptTokens, SUM completionTokens grouped by day
Daily costTime series — actual spend vs budget lineSUM estimatedCost grouped by day, with $0.33/day budget line ($10/month)
Cost saved by cacheSingle stat(COUNT cache hits × avg miss cost) — (COUNT cache hits × embedding cost)
Cost per query trendTime series — 7-day moving averageAVG estimatedCost per day, smoothed
Top expensive sessionsTable — session ID, total cost, query countGROUP BY session_id, SUM estimatedCost, ORDER DESC, LIMIT 10

Dashboard 3: Quality (for product leads)

PanelVisualizationData Source
Chunk relevance distributionHistogram of topChunkScoreBucket distribution — how many responses had top chunk > 0.90 vs < 0.70?
Zero-result queriesTable — queries where chunksRetrieved = 0The agent had no knowledge base context. These answers are LLM-only (risky).
Cache similarity distributionHistogram of cacheSimilarity on hitsAre hits clustered at 0.98 (safe) or spread near the threshold (risky)?
Borderline cache hitsTable — hits where similarity is within 0.05 of thresholdThese are the responses most likely to be wrong. Review them manually.
User feedback ratioPie chart — thumbs up vs thumbs downIf you have feedback signals, this is the ultimate quality metric

Alerting: What to Wake Up For

Not every metric needs an alert. Over-alerting causes alert fatigue and you start ignoring everything. Here are the alerts that warrant immediate attention.

AlertConditionSeverityWhy
Auth failure spikeMore than 2 auth errors in 5 minutesCriticalAll requests are failing. Likely expired AWS credentials.
LLM error rate> 5% of requests fail at LLM step in 15 minutesCriticalBedrock outage, rate limiting, or model deprecation.
P95 latency spikeP95 total response time > 12s for 10 minutesWarningSomething is slow. Could be LLM cold start, large prompts, or network.
Cache hit rate dropDrops below 20% for 1 hour (was previously > 40%)WarningCache was flushed, TTLs expired in bulk, or traffic pattern changed.
Daily cost overrunEstimated daily cost exceeds 2x the 7-day averageWarningUnusual traffic spike or a prompt injection attack inflating token usage.
Redis connection failureAny connection errorCriticalNo cache, no vector search, no sessions. Core infrastructure down.

Set alerts on symptoms, not causes. 'P95 latency > 12s' is a symptom. When it fires, use tracing to find the cause: was it the LLM, the embedding, the network, or a specific session with a massive conversation history?

Tool Comparison: What to Use

The LLM observability space has exploded. Here is how the options compare for a RAG pipeline like ours.

ToolBest ForTracingMetricsEvaluationPricing
MongoDB Charts + Atlas AlertsQuick start, no new infraManual (query the metrics collection)Yes (built-in charts)NoFree with Atlas
Langfuse (open source)LLM-native observabilityYes (purpose-built for LLM traces)Yes (cost, latency dashboards)Yes (scores, datasets)Free self-hosted / cloud tiers
LangSmithLangChain ecosystemYes (deep LangChain integration)YesYes (evaluations, datasets)Free tier / paid
OpenTelemetry + GrafanaMulti-service, vendor-neutralYes (standard spans)Yes (Prometheus export)No (add RAGAS separately)Free (self-hosted)
Datadog LLM ObservabilityEnterprise, existing DatadogYes (APM + LLM-specific)Yes (full Datadog metrics)Partial$$$ (per-host + per-trace pricing)
Helicone (open source)Proxy-based, zero-codeYes (intercepts LLM calls)Yes (cost, latency)NoFree tier / cloud

Our recommendation

Start with MongoDB Charts — you already have the data in agenti_ai_metrics. Build the three dashboards described above. This takes an hour and gives you 60% of the value. When you need proper tracing with a visual waterfall and quality evaluation, add Langfuse. Its TypeScript SDK wraps your existing code with minimal changes, and it is purpose-built for LLM pipelines. Only move to Datadog or a full OpenTelemetry stack when you are running multiple agents or services and need unified cross-service observability.

The Monitoring Maturity Model

Most teams progress through four stages. Knowing where you are helps you invest in the right layer.

StageWhat You HaveWhat You Are MissingNext Step
Level 0: BlindConsole.log statementsEverything. You debug by reading server logs.Add structured metrics logging (the MongoDB document above).
Level 1: MetricsDashboards showing latency, cost, hit ratePer-request detail. You know P95 is bad but not why.Add tracing — per-step spans with attributes.
Level 2: Traces + MetricsFull request waterfall + trend dashboardsQuality measurement. You know it is fast and cheap, but is it correct?Add evaluation — chunk relevance gates, LLM-as-judge, weekly RAGAS.
Level 3: Full ObservabilityTraces, metrics, evaluation, alertsProactive optimization. You react to problems instead of preventing them.Add anomaly detection, automated threshold tuning, A/B testing for prompts.

The Polystreak agent is currently at Level 1 — structured metrics in MongoDB, with the X-Ray panel providing per-request visibility. The data is there for Level 2 (the trace structure exists in the pipeline array); it just needs a visualization layer.

Common Mistakes

  • Monitoring only the LLM call — The LLM is 98% of the latency but only 20% of the failure modes. Embedding failures, bad retrieval, cache contamination, and session corruption all happen upstream.
  • Averaging latency instead of using percentiles — A P50 of 2s with a P99 of 20s means you have a serious tail latency problem that averages hide completely.
  • Ignoring the cost of cache misses vs hits — If your dashboard shows '$1.50/day total' but your cache hit rate drops from 60% to 20%, tomorrow's bill is $3.50. Track the trend, not just the number.
  • Not logging the retrieved chunks — When a response is wrong, you need to know what chunks the model saw. If you did not log them, you cannot distinguish between bad retrieval and bad generation.
  • Treating all errors equally — A MongoDB logging failure is invisible to users. A Redis connection failure breaks the entire agent. Your alerting should reflect this asymmetry.
  • Building dashboards before defining alerts — Dashboards are for investigation. Alerts are for detection. If nobody is watching the dashboard, problems go unnoticed. Set up the critical alerts first, then build dashboards for when they fire.

The Bottom Line

Monitoring an LLM RAG pipeline is not an extension of traditional API monitoring — it is a different discipline. The non-deterministic nature of LLM output, the variable cost per request, and the multi-step pipeline with cascading failure modes all demand specialized observability.

Start with the three pillars: tracing for per-request debugging, metrics for trend analysis and alerting, and evaluation for quality assurance. Log a structured metrics document on every request — even if you start with MongoDB Charts, that data powers every future observability tool you add.

The agent that serves a wrong answer in 100 milliseconds is worse than the agent that takes 5 seconds to give a correct one. Monitor for quality first, latency second, cost third.

See the observability in action: visit polystreak.com/agent and watch the X-Ray panel during your conversation. Every metric described in this post — embedding time, cache similarity, vector search latency, LLM inference duration, token counts, and estimated cost — is computed and displayed in real time. That is Level 1 observability. The data to reach Level 3 is already being logged.