AI AgentsEvaluationObservabilityMonitoringMongoDB AtlasRedis CloudLLMProductionA/B Testing

Evaluating AI Agents in Production: Metrics, Instrumentation, and Data Systems

Polystreak Team2026-04-0513 min read

Shipping an AI agent is easy compared to knowing whether it is getting better. In production, failures are often partial: the model answers politely, calls the wrong tool, or completes the task while subtly violating a policy. Evaluation is how you turn those fuzzy failures into measurable regressions, cost signals, and release gates that your team can trust.

Start With Outcomes, Not Model Scores

Model-level benchmarks rarely predict business outcomes for agents that orchestrate tools, retrieve documents, and iterate over multiple steps. Production evaluation should anchor on task-level success, safety, latency, and economics. Everything else is diagnostic detail you use when those north-star metrics move.

Core metrics every production agent should track

Metric	What it measures	Typical healthy band (indicative)
Task completion rate	End-to-end success on defined tasks	92%–97% on curated eval sets; 78%–88% on noisy live traffic
Hallucination rate	Outputs contradicted by ground truth or policy	Under 3% on fact-heavy tasks; under 1% after RAG hardening
Tool call accuracy	Correct tool, arguments, and ordering	96%–99% argument validity; under 2% wrong-tool rate
P95 end-to-end latency	Wall clock from request to final answer	1.8s–4.0s for chat; 8s–25s for multi-tool workflows
Token cost per successful task	Spend divided by completed tasks	$0.012–$0.09 for mid-tier models at moderate context
Context utilization	Share of context window used before truncation	55%–80% peak; spikes above 90% correlate with quality drops

The bands above are not universal targets. They are reference ranges we see when teams have basic guardrails, retrieval, and structured outputs in place. Your baseline depends on domain risk, user tolerance, and how strictly you define success.

Task Completion and Hallucination: Make Success Binary First

Define a task schema: inputs, allowed tools, required artifacts, and acceptance checks. Completion should be machine-checkable when possible (JSON schema validation, database row state, test assertions on generated code). For open-ended work, use rubric scoring but still publish a primary pass or fail field so dashboards stay legible.

Log the final assistant message, all tool calls with arguments and responses, retrieval snippets with IDs, and the prompt template version hash.
Store human labels on a stratified sample (5%–15% of production traffic) to calibrate automated judges weekly.
Track disagreement rate between human reviewers; above 12% usually means your rubric or task definition is ambiguous.
Separate policy violations from factual errors; they regress for different reasons and need different mitigations.

If you cannot say what correct looks like for a task, you cannot evaluate the agent—only the prose.

Tool call accuracy beyond pass or fail

Tool errors are the silent killer of agent reliability. Instrument each call with latency, HTTP status, retry count, idempotency key, and a normalized error class. Compare argument JSON against a validator before execution when safety allows; when you must execute speculatively, log both proposed and executed arguments.

Signal	Example threshold	Why it matters
Wrong-tool rate	Below 1.5% of calls	High values indicate prompt drift or ambiguous tool descriptions
Validator rejection rate	Below 4% after two weeks stable	Spikes often precede user-visible failures
Retry success rate	Above 70% when retries are allowed	Low success means flaky dependencies, not model noise
Average tools per successful task	Stable within ±8% week over week	Sudden jumps usually mean new ambiguity or context bloat

Latency Budgets and Token Economics

Agents are multi-step systems. Break latency into model time, retrieval time, tool round-trips, and queueing. A common pattern is that tool and retrieval time dominates once the agent becomes capable; optimizing the LLM alone stops moving the P95.

Stage	Example P50	Example P95	Notes
First token	320 ms	890 ms	Includes routing and auth
Retrieval	140 ms	410 ms	Vector + rerank + cache miss path
Single tool round-trip	180 ms	920 ms	Highly dependent on downstream APIs
Full multi-step task	4.2 s	17 s	Three to six tool calls typical

Token cost per task should be reported with and without cache hits. Teams that only look at averages hide regressions where a small fraction of sessions burn extreme context. Track the 90th percentile token count per task family alongside the mean.

How to Instrument Agents for Evaluation

Emit a trace ID per user request and propagate it through every model, tool, and retrieval call.
Record model name, temperature, max tokens, structured output mode, and safety filter outcomes.
Capture retrieval scores, top-k IDs, and whether the answer cited those IDs.
Version prompts and tools in the log; evaluations without version metadata are not comparable across releases.

Store raw transcripts for sampled sessions and aggregate metrics for all sessions. Raw logs power root cause analysis; aggregates power alerting. If you keep only aggregates, you will know that quality dropped but not why.

Offline Evaluation Versus Online Evaluation

Offline evaluation uses fixed datasets, golden answers, and replayed tool stubs. It is fast, repeatable, and ideal for regression gates in CI. Online evaluation measures behavior under real data drift, messy user phrasing, and production dependencies. You need both; neither alone is sufficient.

Offline: run nightly on 800–5,000 scenarios; block releases when task completion drops more than 1.5 points on core suites.
Online: shadow traffic or canary prompts comparing candidate policies without user-visible changes when risk is high.
Continuous: stream lightweight scores (helpfulness, safety flags) on a sample of live traffic with human spot checks.

LLM-as-judge patterns that survive production scrutiny

Judges should consume the same evidence a human would: user message, tool traces, retrieved documents, and the final answer. Use pairwise comparisons when absolute scoring is noisy; they are often more stable than single-number grades. Calibrate judges against human labels and retire prompts that systematically disagree with your reviewers.

A judge is just another model call with its own failure modes. Treat its output as a metric, not ground truth.

Regression Testing for Agents

Treat agent changes like any other critical service. Maintain versioned eval suites tagged by capability (retrieval, tools, formatting, safety). On each pull request, run a fast subset (50–120 cases) focused on the touched surface. Nightly jobs run the full corpus and produce trend lines. When a metric regresses, bisect prompt changes, tool schemas, and retrieval settings before touching the base model.

A/B Testing Agent Behavior

A/B tests for agents should pre-register primary metrics (task completion, P95 latency, cost per success) and guardrail metrics (hallucination rate, policy violations, tool error rate). Use consistent bucketing on user or session IDs. Short experiments with high variance need larger samples than typical web experiments; plan for at least several thousand eligible tasks per arm when effect sizes are small.

MongoDB Atlas: Durable Eval History and Dashboards

Evaluation generates high-cardinality time series and rich documents: traces, judge outputs, human labels, and release metadata. MongoDB Atlas fits this shape well. Store one document per task execution with embedded tool steps, index by service version and task family, and query historical performance when you need to explain a regression from six weeks ago.

Persist eval runs with {run_id, git_sha, prompt_hash, model_id, suite_name, aggregate_metrics} for auditability.
Use compound indexes on (date, task_family) and (version, outcome) for dashboard queries under a second at moderate scale.
Materialize nightly summary collections for leadership views while retaining raw traces in a colder tier or bucket export pipeline.
Join human review queues with agent traces using a shared trace_id field so reviewers see full context in one lookup.

Redis Cloud: Real-Time Counters, Streaming Scores, and Latency Tracking

Operational monitoring for agents needs sub-second visibility. Redis Cloud excels at counters, rolling windows, and lightweight leaderboards. Track running success counts, sliding-window P95 estimates per route, and circuit-breaker style signals when tool failure rates spike.

Increment per-outcome counters per model version for live success-rate tiles.
Maintain time-bucketed hashes for latency histograms updated on each request; expire keys with TTL aligned to your SLO window.
Push streaming eval scores from online judges into Redis Streams for async consumers that write summaries to Atlas.
Use Redis as a coordination layer for A/B assignment caching so routing stays fast and consistent within a session.

Together, Redis Cloud gives you the live nervous system while MongoDB Atlas becomes the memory of what shipped, how it scored, and whether the last change was an improvement. Close the loop by automating alerts off Redis-backed SLOs and driving postmortems off Atlas-stored traces.

Putting It Together

Evaluating agents in production is an engineering discipline: define tasks clearly, instrument deeply, separate offline regression from online drift, use judges humbly, and store enough history to learn from mistakes. When metrics move, you should be able to name the version, the prompt, and the tool trace behind the change. That is the bar for trustworthy iteration.