Monitoring Redis Cloud with Datadog: Setup, Key Metrics, and AI Workload Dashboards
Datadog is the fastest path to production monitoring for Redis Cloud. Install the integration, point at your cluster, and dashboards appear in minutes. But speed of setup creates a different problem: metric sprawl. Redis Cloud emits 200+ metrics per database. At Datadog's per-custom-metric pricing, scraping everything is a fast way to a $50K/year monitoring bill. The discipline is in choosing what to collect.
Datadog makes it easy to monitor everything. The skill is in monitoring only what matters — and filtering out the 180+ metrics that don't.
How the Integration Works
Redis Cloud supports Datadog integration natively. There are two approaches depending on your setup: the Datadog Redis integration (for direct endpoint scraping) and the Redis Cloud Datadog integration (managed, configured in the Redis Cloud console).
Option 1: Redis Cloud Native Integration
Redis Cloud Pro and Enterprise subscriptions offer a built-in Datadog integration. In the Redis Cloud console, navigate to your subscription settings, select the Datadog integration, enter your Datadog API key and region, and enable it. Redis Cloud pushes metrics directly to Datadog — no agent deployment needed. Metrics appear under the redis.cloud namespace.
Option 2: Datadog Agent with Redis Check
If you need more control or want to combine Redis metrics with host-level metrics from your application servers, deploy the Datadog Agent on a host that can reach your Redis Cloud endpoint (via VPC peering). Configure the Redis check in the agent's conf.d/redisdb.d/conf.yaml with the Redis Cloud endpoint, port, and password. The agent scrapes metrics on each check interval (default 15 seconds) and pushes to Datadog.
| Approach | Setup | Agent Required | Best For |
|---|---|---|---|
| Redis Cloud native integration | Console config + API key | No | Quick setup, managed metrics, no infrastructure to maintain |
| Datadog Agent + Redis check | Agent on EC2/EKS + conf.yaml | Yes | Combined host + Redis metrics, custom check intervals, metric filtering |
The Metric Filtering Problem
Datadog charges per custom metric per month. A single Redis Cloud database can emit 200+ distinct metric names. With multiple databases and shards, you can easily hit thousands of custom metrics. At $0.05 per custom metric per month (standard tier), 2,000 unfiltered metrics across 5 databases = $100/month just for Redis metrics you'll never look at.
The fix: use Datadog's metric filtering. In the agent's conf.yaml, use the metrics parameter to whitelist only the metrics you need. For the native integration, use Datadog's Metrics Summary to tag and exclude metrics you don't want billed.
Every Redis metric you don't filter is a line item on your Datadog invoice. Whitelist the 15-20 that matter. Exclude the rest.
The 15 Metrics That Matter for AI Workloads
Out of 200+ available metrics, these are the ones that directly correlate with AI agent performance, reliability, and cost.
Latency (The Golden Signal)
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| redis.net.commands.duration | Average command latency. The single most important metric. | Alert if P99 > 5ms |
| redis.slowlog.micros.95percentile | 95th percentile of slow commands from Redis slowlog. | Alert if > 10ms |
| redis.net.instantaneous_ops_per_sec | Current throughput. Use alongside latency to detect saturation. | Alert on sudden drop (> 50% decrease) |
Memory (Saturation)
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| redis.mem.used | Total memory consumed by data + overhead. | Alert at 80% of maxmemory |
| redis.mem.fragmentation_ratio | Memory fragmentation. Above 1.5 means 50% waste. | Alert if > 1.5 sustained |
| redis.keys.evicted | Keys evicted due to maxmemory. For AI context stores, each eviction is lost data. | Alert on any non-zero rate |
| redis.mem.rss | Resident set size — actual OS memory consumed. | Alert if RSS >> used (indicates fragmentation) |
Connections (Traffic)
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| redis.net.clients | Current connected clients. | Alert at 80% of maxclients |
| redis.net.rejected_connections | Connections rejected because maxclients was reached. | Alert on any non-zero value |
| redis.net.blocked_clients | Clients blocked on BLPOP/BRPOP. | Alert if > 0 sustained |
Cache Effectiveness
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| redis.stats.keyspace_hits | Successful key lookups. Use with misses for hit ratio. | N/A (use ratio) |
| redis.stats.keyspace_misses | Failed key lookups. | Alert if hit ratio < 90% |
| redis.keys.expired | Keys expired by TTL. Expected behavior, but track for patterns. | Informational |
Replication
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| redis.replication.delay | Replication lag in seconds. | Alert if > 5 seconds |
| redis.net.slaves | Number of connected replicas. | Alert if drops below expected count |
Building the Dashboard
Organize your Datadog Redis dashboard into four rows mapping to the Golden Signals. Each row answers one question.
- Row 1 — Latency: Timeseries of command duration (P50, P90, P99). Heatmap of slowlog entries. Goal: spot latency spikes instantly.
- Row 2 — Traffic: Timeseries of ops/sec and connected clients. Overlay with application request rate for correlation. Goal: understand load patterns.
- Row 3 — Errors: Timeseries of rejected connections, evicted keys, and keyspace miss rate. Goal: zero on all three.
- Row 4 — Saturation: Gauge of memory usage vs maxmemory. Timeseries of fragmentation ratio. Connection count vs maxclients. Goal: stay below 80% on all.
Add a fifth row for AI-specific metrics: cache hit ratio (computed from keyspace_hits / (hits + misses)), eviction rate (keys lost from context store), and command breakdown by type (track FT.SEARCH latency separately from GET/SET for vector search workloads).
Alerting: What Deserves a Page
Not every metric needs an alert. Alerts should fire only when a human needs to act. Over-alerting causes alert fatigue and missed real incidents.
| Alert | Condition | Severity |
|---|---|---|
| Memory critical | redis.mem.used > 90% of maxmemory for 5 minutes | P1 — Evictions imminent |
| Connections exhausted | redis.net.rejected_connections > 0 | P1 — Clients being dropped |
| Latency spike | redis.net.commands.duration P99 > 10ms for 5 minutes | P2 — Performance degrading |
| Replication lag | redis.replication.delay > 10 seconds for 5 minutes | P2 — Read replicas serving stale data |
| Eviction rate | redis.keys.evicted rate > 100/min | P2 — Context data being lost |
| Cache hit ratio drop | Hit ratio < 80% for 15 minutes | P3 — Investigate query pattern change |
Six alerts. That's it. If your Redis dashboard has 20 alerts, you have zero useful alerts — because the team has already learned to ignore them.
Cost Optimization: Controlling Your Datadog Redis Bill
- Whitelist metrics in conf.yaml — Only collect the 15-20 metrics listed above. Exclude everything else with metrics: lists.
- Use metric tags wisely — Tags increase cardinality. Avoid high-cardinality tags like request_id or user_id on Redis metrics.
- Set check_interval appropriately — Default 15 seconds is fine for production. For dev/staging, increase to 60 seconds to reduce data volume.
- Use Datadog's Metrics without Limits — Submit all metrics but only index and query the ones you configure. Reduces cost for metrics you want to keep but don't query often.
- Review Metrics Summary monthly — Datadog shows which custom metrics are active. Remove integrations or metrics you're no longer using.
The best Datadog Redis setup is one where you pay for 20 metrics that you actually look at — not 200 that you ingested because the default config scraped everything.