Evaluating AI Agents in Production: Metrics, Instrumentation, and Data Systems
Shipping an AI agent is easy compared to knowing whether it is getting better. In production, failures are often partial: the model answers politely, calls the wrong tool, or completes the task while subtly violating a policy. Evaluation is how you turn those fuzzy failures into measurable regressions, cost signals, and release gates that your team can trust.
Start With Outcomes, Not Model Scores
Model-level benchmarks rarely predict business outcomes for agents that orchestrate tools, retrieve documents, and iterate over multiple steps. Production evaluation should anchor on task-level success, safety, latency, and economics. Everything else is diagnostic detail you use when those north-star metrics move.
Core metrics every production agent should track
| Metric | What it measures | Typical healthy band (indicative) |
|---|---|---|
| Task completion rate | End-to-end success on defined tasks | 92%–97% on curated eval sets; 78%–88% on noisy live traffic |
| Hallucination rate | Outputs contradicted by ground truth or policy | Under 3% on fact-heavy tasks; under 1% after RAG hardening |
| Tool call accuracy | Correct tool, arguments, and ordering | 96%–99% argument validity; under 2% wrong-tool rate |
| P95 end-to-end latency | Wall clock from request to final answer | 1.8s–4.0s for chat; 8s–25s for multi-tool workflows |
| Token cost per successful task | Spend divided by completed tasks | $0.012–$0.09 for mid-tier models at moderate context |
| Context utilization | Share of context window used before truncation | 55%–80% peak; spikes above 90% correlate with quality drops |
The bands above are not universal targets. They are reference ranges we see when teams have basic guardrails, retrieval, and structured outputs in place. Your baseline depends on domain risk, user tolerance, and how strictly you define success.
Task Completion and Hallucination: Make Success Binary First
Define a task schema: inputs, allowed tools, required artifacts, and acceptance checks. Completion should be machine-checkable when possible (JSON schema validation, database row state, test assertions on generated code). For open-ended work, use rubric scoring but still publish a primary pass or fail field so dashboards stay legible.
- Log the final assistant message, all tool calls with arguments and responses, retrieval snippets with IDs, and the prompt template version hash.
- Store human labels on a stratified sample (5%–15% of production traffic) to calibrate automated judges weekly.
- Track disagreement rate between human reviewers; above 12% usually means your rubric or task definition is ambiguous.
- Separate policy violations from factual errors; they regress for different reasons and need different mitigations.
If you cannot say what correct looks like for a task, you cannot evaluate the agent—only the prose.
Tool call accuracy beyond pass or fail
Tool errors are the silent killer of agent reliability. Instrument each call with latency, HTTP status, retry count, idempotency key, and a normalized error class. Compare argument JSON against a validator before execution when safety allows; when you must execute speculatively, log both proposed and executed arguments.
| Signal | Example threshold | Why it matters |
|---|---|---|
| Wrong-tool rate | Below 1.5% of calls | High values indicate prompt drift or ambiguous tool descriptions |
| Validator rejection rate | Below 4% after two weeks stable | Spikes often precede user-visible failures |
| Retry success rate | Above 70% when retries are allowed | Low success means flaky dependencies, not model noise |
| Average tools per successful task | Stable within ±8% week over week | Sudden jumps usually mean new ambiguity or context bloat |
Latency Budgets and Token Economics
Agents are multi-step systems. Break latency into model time, retrieval time, tool round-trips, and queueing. A common pattern is that tool and retrieval time dominates once the agent becomes capable; optimizing the LLM alone stops moving the P95.
| Stage | Example P50 | Example P95 | Notes |
|---|---|---|---|
| First token | 320 ms | 890 ms | Includes routing and auth |
| Retrieval | 140 ms | 410 ms | Vector + rerank + cache miss path |
| Single tool round-trip | 180 ms | 920 ms | Highly dependent on downstream APIs |
| Full multi-step task | 4.2 s | 17 s | Three to six tool calls typical |
Token cost per task should be reported with and without cache hits. Teams that only look at averages hide regressions where a small fraction of sessions burn extreme context. Track the 90th percentile token count per task family alongside the mean.
How to Instrument Agents for Evaluation
- Emit a trace ID per user request and propagate it through every model, tool, and retrieval call.
- Record model name, temperature, max tokens, structured output mode, and safety filter outcomes.
- Capture retrieval scores, top-k IDs, and whether the answer cited those IDs.
- Version prompts and tools in the log; evaluations without version metadata are not comparable across releases.
Store raw transcripts for sampled sessions and aggregate metrics for all sessions. Raw logs power root cause analysis; aggregates power alerting. If you keep only aggregates, you will know that quality dropped but not why.
Offline Evaluation Versus Online Evaluation
Offline evaluation uses fixed datasets, golden answers, and replayed tool stubs. It is fast, repeatable, and ideal for regression gates in CI. Online evaluation measures behavior under real data drift, messy user phrasing, and production dependencies. You need both; neither alone is sufficient.
- Offline: run nightly on 800–5,000 scenarios; block releases when task completion drops more than 1.5 points on core suites.
- Online: shadow traffic or canary prompts comparing candidate policies without user-visible changes when risk is high.
- Continuous: stream lightweight scores (helpfulness, safety flags) on a sample of live traffic with human spot checks.
LLM-as-judge patterns that survive production scrutiny
Judges should consume the same evidence a human would: user message, tool traces, retrieved documents, and the final answer. Use pairwise comparisons when absolute scoring is noisy; they are often more stable than single-number grades. Calibrate judges against human labels and retire prompts that systematically disagree with your reviewers.
A judge is just another model call with its own failure modes. Treat its output as a metric, not ground truth.
Regression Testing for Agents
Treat agent changes like any other critical service. Maintain versioned eval suites tagged by capability (retrieval, tools, formatting, safety). On each pull request, run a fast subset (50–120 cases) focused on the touched surface. Nightly jobs run the full corpus and produce trend lines. When a metric regresses, bisect prompt changes, tool schemas, and retrieval settings before touching the base model.
A/B Testing Agent Behavior
A/B tests for agents should pre-register primary metrics (task completion, P95 latency, cost per success) and guardrail metrics (hallucination rate, policy violations, tool error rate). Use consistent bucketing on user or session IDs. Short experiments with high variance need larger samples than typical web experiments; plan for at least several thousand eligible tasks per arm when effect sizes are small.
MongoDB Atlas: Durable Eval History and Dashboards
Evaluation generates high-cardinality time series and rich documents: traces, judge outputs, human labels, and release metadata. MongoDB Atlas fits this shape well. Store one document per task execution with embedded tool steps, index by service version and task family, and query historical performance when you need to explain a regression from six weeks ago.
- Persist eval runs with {run_id, git_sha, prompt_hash, model_id, suite_name, aggregate_metrics} for auditability.
- Use compound indexes on (date, task_family) and (version, outcome) for dashboard queries under a second at moderate scale.
- Materialize nightly summary collections for leadership views while retaining raw traces in a colder tier or bucket export pipeline.
- Join human review queues with agent traces using a shared trace_id field so reviewers see full context in one lookup.
Redis Cloud: Real-Time Counters, Streaming Scores, and Latency Tracking
Operational monitoring for agents needs sub-second visibility. Redis Cloud excels at counters, rolling windows, and lightweight leaderboards. Track running success counts, sliding-window P95 estimates per route, and circuit-breaker style signals when tool failure rates spike.
- Increment per-outcome counters per model version for live success-rate tiles.
- Maintain time-bucketed hashes for latency histograms updated on each request; expire keys with TTL aligned to your SLO window.
- Push streaming eval scores from online judges into Redis Streams for async consumers that write summaries to Atlas.
- Use Redis as a coordination layer for A/B assignment caching so routing stays fast and consistent within a session.
Together, Redis Cloud gives you the live nervous system while MongoDB Atlas becomes the memory of what shipped, how it scored, and whether the last change was an improvement. Close the loop by automating alerts off Redis-backed SLOs and driving postmortems off Atlas-stored traces.
Putting It Together
Evaluating agents in production is an engineering discipline: define tasks clearly, instrument deeply, separate offline regression from online drift, use judges humbly, and store enough history to learn from mistakes. When metrics move, you should be able to name the version, the prompt, and the tool trace behind the change. That is the bar for trustworthy iteration.