Guardrails and Safety for Production AI Agents
Production agents sit on untrusted user input, expensive model APIs, and powerful tools. Guardrails are not a single library call—they are a layered system of checks, budgets, and policies enforced before, during, and after each turn. This post walks through what we actually ship when we mean safety in production.
Why guardrails are infrastructure, not prompts
System prompts can steer behavior, but they cannot enforce budgets, prove who called a tool, or retain tamper-evident history. Engineering teams treat guardrails like API middleware: deterministic rules, measurable latency, and explicit failure modes when something crosses a line.
If you cannot explain in one sentence what happens when a guardrail fires, you do not have a guardrail—you have a hope.
Input validation: prompt injection and PII
Assume every message may contain instructions meant for the model, not your product. Classic jailbreak strings are only one slice; users also paste emails, tickets, and documents that embed hidden directives. Pair heuristic filters with structure-aware checks: separate system and user channels in the API, strip or neutralize obvious delimiter attacks, and run lightweight classifiers or regex passes for high-risk patterns before the prompt ever reaches the planner.
- Treat tool descriptions and retrieved documents as semi-trusted; never let them override signed policy text.
- Use allowlisted tool names and argument schemas; reject free-form tool calls at the boundary.
- Detect and redact or block common PII (SSN patterns, credit cards, national IDs) before logging or forwarding to third parties.
- Cap attachment size and depth of nested context so a single message cannot blow the context window.
Teams often budget 5–25 ms of synchronous input latency for regex and small-model classifiers on the hot path, with heavier analysis (full-document PII scans) offloaded async when policy allows. The goal is predictable overhead, not perfect recall.
| Control | Purpose | Typical placement |
|---|---|---|
| Schema validation | Reject malformed tool args and oversized payloads | API gateway or agent runtime |
| PII detection | Block or mask before storage and downstream APIs | Pre-LLM and pre-log pipeline |
| Instruction firewall | Reduce prompt-injection surface between user and system content | Templating layer |
| Content hash / reputation | Throttle known abuse payloads | Edge + Redis-backed counters |
Output validation: hallucinations, format, and policy
Models can sound confident while citing nonexistent URLs, inventing policy exceptions, or returning JSON that parses but violates business rules. Output guardrails verify structure first (JSON Schema, typed fields), then semantics (allowed enums, numeric ranges), then softer checks such as citation presence for RAG answers or grounding against retrieved chunks.
- Enforce response format with parsers; on failure, retry with a stricter prompt or return a safe fallback.
- Run content filters for regulated categories before rendering to users or writing to CRM systems.
- For numeric claims, cross-check against tool outputs or database facts when available.
- Attach confidence or abstention: sometimes the safest output is "I cannot verify that."
In internal benchmarks, a two-pass output check (schema + lightweight verifier model) often adds 150–400 ms end-to-end for short answers, versus 20–80 ms for schema-only. That trade is usually acceptable for workflows where a wrong answer is costlier than a slower one.
| Check | What it catches | Example failure mode |
|---|---|---|
| JSON Schema | Wrong types, missing keys | Tool call with null "id" field |
| Allowlist URLs | Hallucinated links | Fake doc links in customer-facing chat |
| Numeric bounds | Impossible quantities | Approvals above tenant limit |
| Toxicity / policy classifiers | Brand and compliance risk | Disallowed medical or legal advice tone |
Cost circuit breakers: tokens, steps, and runaway loops
An agent stuck in a tool loop can burn thousands of dollars in hours. One real incident class: a misconfigured retriever plus a max-step count in the hundreds produced roughly $1,200–$4,800 in LLM spend over a weekend for a mid-traffic B2B app—mostly input tokens from repeated context. Circuit breakers stop the bleeding: per-run token ceilings, per-user daily caps, and hard limits on planner iterations.
- Per-request token budget: fail closed when projected completion exceeds remaining budget.
- Per-session and per-tenant daily spend caps in USD or token equivalents.
- Loop detection: same tool with identical arguments N times in a row triggers halt.
- Wall-clock timeouts on the whole agent trace, not just individual HTTP calls.
Implement counters in a low-latency store so enforcement is consistent across horizontally scaled workers. When a breaker trips, return a user-safe message, emit an alert, and persist the event for review—never silently retry without bounds.
Rate limiting per user and tenant
Fairness and abuse control require more than a global API limit. Productive patterns combine user-level, tenant-level, and IP-level keys, with burst allowances for interactive chat but stricter sustained limits for batch or automation endpoints. Sliding-window or token-bucket algorithms work well when state lives in a shared data plane.
| Dimension | Example limit | Rationale |
|---|---|---|
| User | 60 agent turns / minute | Stops scripted flooding from one account |
| Tenant | 5,000 turns / hour | Protects shared infrastructure and billing |
| IP (unauthenticated) | 20 turns / minute | Reduces anonymous scraping |
| Tool family | 10 external API calls / minute per user | Prevents third-party quota burn |
Redis Cloud is a natural enforcement layer: atomic INCR with TTL for fixed windows, or Lua-backed token buckets for smooth burst handling. Sub-millisecond reads and writes mean limits apply at the edge of your agent service without becoming the bottleneck—typical added latency is well under 1 ms per check when colocated in the same region.
Tool call authorization: who can do what, when
Not every principal should invoke every tool. Bind tools to roles, environments, and data scopes. A support agent might read tickets but not issue refunds; a finance workflow might post journal entries only after a human approval record exists. Enforce this in code at the tool dispatcher, not only in the prompt.
- Maintain a policy map: principal → allowed tool names → argument constraints.
- Require signed or server-minted capability tokens for high-risk actions.
- Log every tool invocation with correlation IDs tying back to the user, tenant, and model trace.
- Use read-only replicas or scoped API keys so retrieval tools cannot accidentally write.
Human-in-the-loop patterns
Some actions should never be autonomous on the first attempt: large money movement, irreversible deletes, mass outbound email, or access to sensitive HR data. Pattern the agent to prepare a draft, write a pending state, and wait for an authenticated human approval in a separate channel. Time-bound approvals (for example 15 minutes) reduce stale actions sitting in queues.
The best production agents escalate early and often; the worst ones complete the task at any cost.
Audit logging and post-incident analytics
Guardrails generate signal: violations, overrides, retries, and breaker trips. That signal belongs in a queryable store with retention policies aligned to compliance. MongoDB Atlas fits well for structured audit documents—each event can capture tenant, user, rule name, severity, redacted input hash, model ID, token usage, and outcome—while Atlas Search or aggregation pipelines support dashboards on trends (which rules fire most, which tenants drift).
- Store guardrail rule definitions as versioned documents so you can replay "what policy applied on Tuesday?"
- Index on tenantId, timestamp, and ruleId for incident timelines.
- Keep raw prompts out of logs when possible; store hashes and truncated excerpts instead.
- Export aggregates to your SIEM or warehouse for longer retention than hot operational data.
| Event type | Fields to capture | Primary consumer |
|---|---|---|
| Rate limit hit | key, limit, current, endpoint | SRE dashboards |
| PII blocked | detector, location, redaction mode | Security / compliance |
| Breaker trip | reason, spend, token estimate | FinOps + on-call |
| HITL requested | action, approver role, SLA | Workflow UI |
Redis Cloud and MongoDB Atlas in one architecture
Use Redis Cloud for millisecond decisions: rate limit counters, per-run token budgets, circuit breaker flags, short-lived capability tokens, and feature flags that flip stricter modes under attack. Use MongoDB Atlas for durable truth: audit trails, guardrail configuration, violation history, and analytics jobs that answer questions like "How many injection attempts did we see this quarter?" The split keeps the hot path fast while preserving rich, queryable records for security and product teams.
- Redis: INCR/DECR and EXPIRE for windows; hashes for per-tenant budget state; pub/sub or streams for alerting fan-out.
- Atlas: collections for policies, incidents, and anonymized training/eval sets derived from production guardrail outcomes.
- Correlation: propagate a single trace ID from edge through Redis checks into Atlas audit documents.
- Drills: regularly simulate breaker trips and verify Atlas retention and Redis failover behavior match runbooks.
Shipping production agents means designing for adversarial input, expensive failure modes, and regulatory scrutiny. Layer input and output validation, hard cost and rate limits, explicit tool policy, human gates for irreversible work, and auditable records. Redis Cloud and MongoDB Atlas are not mandatory brand choices—they are representative of the speed-vs-durability split that makes guardrails both enforceable in real time and explainable after the fact.