LLM Gateway Architecture: Routing, Resilience, and Cost at the Edge
If your application calls OpenAI today, Anthropic tomorrow, and a self-hosted vLLM cluster next week, you already feel the pain: divergent SDKs, uneven reliability, opaque latency, and invoices that spike when a single expensive model becomes the default. An LLM gateway is the control plane that normalizes those differences behind one internal API.
What an LLM Gateway Is (and Why You Need One)
At minimum, a gateway is an authenticated HTTP edge that accepts your canonical request shape, attaches the right provider credentials, translates payloads, enforces quotas, and returns a consistent response envelope. In practice it is also where you implement routing policies, caching, circuit breaking, and the telemetry that product and finance teams actually read.
Symptoms that mean you are ready
- Engineers hard-code provider URLs and API keys across services.
- Incidents require redeploys to switch models or disable a bad endpoint.
- You cannot answer simple questions: cost per feature, p95 time-to-first-token by provider, or cache hit rate for embeddings.
- A single provider brownout takes down customer-facing flows with no graceful degradation path.
Routing Across OpenAI, Anthropic, Google, and Local Models
Multi-provider routing is not exotic anymore. Teams map logical model aliases (for example, prod.chat.default) to concrete endpoints. The gateway resolves alias to provider + model ID + region, applies transforms (message format, tool schema, safety settings), and streams tokens back through one SSE or WebSocket contract your clients already understand.
| Provider | Typical p50 TTFT (chat) | Indicative input $/1M tokens | Indicative output $/1M tokens | Notes for gateway design |
|---|---|---|---|---|
| OpenAI (GPT-4.1-class) | 350–900 ms | ~$2–5 | ~$8–16 | Strong tool-use ecosystem; watch rate-limit headers and organization-level quotas. |
| Anthropic (Claude 3.5/4-class) | 400–950 ms | ~$3–6 | ~$12–22 | Long-context workloads; normalize system vs user blocks carefully in adapters. |
| Google (Gemini 2.x-class) | 320–850 ms | ~$1.5–4 | ~$6–14 | Multimodal payloads differ; gateway should centralize media preprocessing. |
| Local (vLLM / TGI on GPU) | 80–600 ms | CapEx + power | N/A (self-hosted) | Best for steady, high-QPS internal traffic; expose health and queue depth to the gateway. |
Figures above are order-of-magnitude ranges observed across North American regions with warm connections and modest prompts; your measured p50 and p99 will dominate policy decisions. The gateway should record per-provider histograms rather than trusting vendor marketing pages.
Load balancing strategies that work in production
- Weighted random among healthy backends when models are fungible (same capability tier).
- Least outstanding requests for streaming workloads, so slow streams do not starve the pool.
- Key-hash by tenant or session for sticky routing when provider-side KV caches or prompt templates benefit from locality.
- Canary slices: route 1–5% of traffic to a candidate model and compare error rate, latency, and downstream task success.
Automatic Failover When a Provider Degrades
Transient failures dominate real outages. In aggregated production logs we commonly see 0.2–1.5% of calls fail with 429, 5xx, or connect timeouts during busy windows even when dashboards show green. A gateway implements circuit breakers: after error-rate or latency thresholds trip, traffic shifts to secondary models for a cooling period, then half-open probes restore the primary.
| Gateway component | Primary responsibility | Redis Cloud role | MongoDB Atlas role |
|---|---|---|---|
| Router / policy engine | Resolve model alias, tenant rules, and experiments | Hot cache of routing tables and feature flags (sub-ms reads) | Versioned routing documents, audit trail, scheduled promotions |
| Adapter layer | Request/response mapping per provider | None (stateless) or short-lived scratch for large uploads | Store adapter compatibility matrices and regression fixtures |
| Resilience | Retries, timeouts, circuit breakers, bulkheads | Breaker state, per-tenant token buckets, concurrency semaphores | Incident timelines, postmortem queries on failover events |
| Cache | Semantic and exact response reuse | Vector index or exact key cache with TTL and eviction | Optional cold archive of cache entries for debugging (PII-aware) |
| Observability | Metrics, traces, structured logs | Real-time counters and rolling windows for rate dashboards | Durable request/response metadata, cost rollups, BI exports |
Redis Cloud gives you replicated, low-latitude memory for the data that must be correct within milliseconds: limiter counters, breaker flags, and routing snapshots. MongoDB Atlas holds the slower-moving and compliance-sensitive history: who was routed where, what it cost, and how each provider behaved over weeks.
Request and Response Transformation
Normalization is where gateways earn their keep. Tool definitions, JSON mode, reasoning traces, and image parts all differ. Keep transforms declarative (JSON or DSL) and test them with golden files. At the edge, strip or redact PII before persistence, clamp max tokens, and attach internal trace IDs that propagate to every provider call.
API Key Management Without Sprawl
Never distribute raw provider keys to every microservice. The gateway holds secrets in a vault or KMS; MongoDB Atlas can store encrypted key references plus rotation metadata, while Redis Cloud caches short-lived derived tokens only when a provider issues them. Per-tenant subkeys or proxy credentials let you revoke access in one place.
Cost-Based Routing: Send Cheap Work to Cheap Models
Not every user message needs a frontier model. Classify intent with a small classifier or heuristic router, then map tiers: summaries and regex-able tasks to compact models, code generation to coding-specialized endpoints, and only high-stakes reasoning to flagship SKUs. Teams that instrument cost per successful task routinely cut spend 25–45% without hurting quality metrics—if routing is measured, not guessed.
Latency-Based Routing and SLO Guardrails
Maintain rolling p95 time-to-first-token per route in Redis (time-series friendly structures or sorted sets with TTL). When SLO risk rises—say, p95 exceeds 1.2 seconds for three consecutive windows—shift discretionary traffic to a faster provider or enable a cached answer path. Latency routing pairs naturally with hedged requests for idempotent non-streaming calls, but use hedging sparingly on chat streams to avoid double billing.
Semantic Caching at the Gateway Layer
Exact-match caches miss on paraphrases. Semantic caching stores embeddings of prompts (or normalized prompt hashes plus embedding buckets) in Redis vector capabilities or a sidecar index, then serves prior completions when similarity exceeds a threshold. Typical cache hit rates land between 8% and 30% for support and documentation bots, directly reducing provider spend and tail latency.
Treat the gateway as a product surface: if you cannot explain routing, cost, and failure behavior in one diagram, your operators will improvise—and improvisation does not scale.
Request Queuing and Backpressure
When upstreams throttle, unbounded in-memory queues become outages. Push queue depth, per-tenant fair scheduling, and deadline-aware dropping into Redis lists, streams, or consumer groups. Clients receive explicit 429 or retry-after semantics instead of hanging sockets. For burst workloads, combine queueing with autoscaling workers that drain toward provider rate limits without violating per-tenant fairness.
Logging, Analytics, and Operational Truth
Log structured events: tenant, route, model, token counts, finish reason, cache status, and trace ID. Ship metrics to your observability stack, but keep MongoDB Atlas as the queryable system of record for finance and reliability reviews—aggregates by feature flag, customer segment, and provider. That is how you catch silent quality regressions when a cheaper route slips error rate from 0.4% to 2.1%.
Reference SLO table (illustrative targets)
| Signal | Target | Where to measure |
|---|---|---|
| Gateway availability | 99.95% monthly | Synthetic probes + edge success rate |
| p95 TTFT (interactive chat) | < 1.0 s | Per provider histogram in metrics backend |
| Provider error budget | < 0.5% 5xx/timeout | Gateway logs correlated with provider status |
| Semantic cache hit rate | > 12% for FAQ-like traffic | Redis cache metadata + Atlas daily rollup |
| Cost drift alert | > 15% WoW without launch | Atlas cost aggregates by feature |
Redis Cloud: Hot Path State That Must Be Fast
- Exact and semantic response caches with TTLs aligned to content freshness requirements.
- Token-bucket and sliding-window rate limiters per tenant, API key, and route.
- Circuit breaker and half-open probe state replicated for regional failover.
- Cached routing configuration and model capability maps refreshed on change events.
- Queues and backpressure signals using Redis Streams for fair worker consumption.
MongoDB Atlas: Durable History and Configuration
- Append-only request and response metadata for compliance, debugging, and replay (with redaction policies).
- Cost analytics: input/output tokens, cached tokens, and blended $/1K tokens by product surface.
- Provider performance history: error classes, latency percentiles, and failover counts over time.
- Encrypted references to API keys, rotation schedules, and scoped credentials.
- Authoritative routing rules, experiments, and promotion workflows with version tags.
A mature LLM gateway is boring on purpose: one contract for clients, explicit policies for routing and spend, and fast state in Redis Cloud paired with trustworthy history in MongoDB Atlas. Build it before provider count and traffic force you to bolt on partial fixes—your future on-call self will recognize the shape of the architecture immediately.