ObservabilityDatadogRedis CloudMonitoring

Monitoring Redis Cloud with Datadog: Setup, Key Metrics, and AI Workload Dashboards

Polystreak Team2026-03-209 min read

Datadog is the fastest path to production monitoring for Redis Cloud. Install the integration, point at your cluster, and dashboards appear in minutes. But speed of setup creates a different problem: metric sprawl. Redis Cloud emits 200+ metrics per database. At Datadog's per-custom-metric pricing, scraping everything is a fast way to a $50K/year monitoring bill. The discipline is in choosing what to collect.

Datadog makes it easy to monitor everything. The skill is in monitoring only what matters — and filtering out the 180+ metrics that don't.

How the Integration Works

Redis Cloud supports Datadog integration natively. There are two approaches depending on your setup: the Datadog Redis integration (for direct endpoint scraping) and the Redis Cloud Datadog integration (managed, configured in the Redis Cloud console).

Option 1: Redis Cloud Native Integration

Redis Cloud Pro and Enterprise subscriptions offer a built-in Datadog integration. In the Redis Cloud console, navigate to your subscription settings, select the Datadog integration, enter your Datadog API key and region, and enable it. Redis Cloud pushes metrics directly to Datadog — no agent deployment needed. Metrics appear under the redis.cloud namespace.

Option 2: Datadog Agent with Redis Check

If you need more control or want to combine Redis metrics with host-level metrics from your application servers, deploy the Datadog Agent on a host that can reach your Redis Cloud endpoint (via VPC peering). Configure the Redis check in the agent's conf.d/redisdb.d/conf.yaml with the Redis Cloud endpoint, port, and password. The agent scrapes metrics on each check interval (default 15 seconds) and pushes to Datadog.

Approach	Setup	Agent Required	Best For
Redis Cloud native integration	Console config + API key	No	Quick setup, managed metrics, no infrastructure to maintain
Datadog Agent + Redis check	Agent on EC2/EKS + conf.yaml	Yes	Combined host + Redis metrics, custom check intervals, metric filtering

The Metric Filtering Problem

Datadog charges per custom metric per month. A single Redis Cloud database can emit 200+ distinct metric names. With multiple databases and shards, you can easily hit thousands of custom metrics. At $0.05 per custom metric per month (standard tier), 2,000 unfiltered metrics across 5 databases = $100/month just for Redis metrics you'll never look at.

The fix: use Datadog's metric filtering. In the agent's conf.yaml, use the metrics parameter to whitelist only the metrics you need. For the native integration, use Datadog's Metrics Summary to tag and exclude metrics you don't want billed.

Every Redis metric you don't filter is a line item on your Datadog invoice. Whitelist the 15-20 that matter. Exclude the rest.

The 15 Metrics That Matter for AI Workloads

Out of 200+ available metrics, these are the ones that directly correlate with AI agent performance, reliability, and cost.

Latency (The Golden Signal)

Metric	What It Tells You	Alert Threshold
redis.net.commands.duration	Average command latency. The single most important metric.	Alert if P99 > 5ms
redis.slowlog.micros.95percentile	95th percentile of slow commands from Redis slowlog.	Alert if > 10ms
redis.net.instantaneous_ops_per_sec	Current throughput. Use alongside latency to detect saturation.	Alert on sudden drop (> 50% decrease)

Memory (Saturation)

Metric	What It Tells You	Alert Threshold
redis.mem.used	Total memory consumed by data + overhead.	Alert at 80% of maxmemory
redis.mem.fragmentation_ratio	Memory fragmentation. Above 1.5 means 50% waste.	Alert if > 1.5 sustained
redis.keys.evicted	Keys evicted due to maxmemory. For AI context stores, each eviction is lost data.	Alert on any non-zero rate
redis.mem.rss	Resident set size — actual OS memory consumed.	Alert if RSS >> used (indicates fragmentation)

Connections (Traffic)

Metric	What It Tells You	Alert Threshold
redis.net.clients	Current connected clients.	Alert at 80% of maxclients
redis.net.rejected_connections	Connections rejected because maxclients was reached.	Alert on any non-zero value
redis.net.blocked_clients	Clients blocked on BLPOP/BRPOP.	Alert if > 0 sustained

Cache Effectiveness

Metric	What It Tells You	Alert Threshold
redis.stats.keyspace_hits	Successful key lookups. Use with misses for hit ratio.	N/A (use ratio)
redis.stats.keyspace_misses	Failed key lookups.	Alert if hit ratio < 90%
redis.keys.expired	Keys expired by TTL. Expected behavior, but track for patterns.	Informational

Replication

Metric	What It Tells You	Alert Threshold
redis.replication.delay	Replication lag in seconds.	Alert if > 5 seconds
redis.net.slaves	Number of connected replicas.	Alert if drops below expected count

Building the Dashboard

Organize your Datadog Redis dashboard into four rows mapping to the Golden Signals. Each row answers one question.

Row 1 — Latency: Timeseries of command duration (P50, P90, P99). Heatmap of slowlog entries. Goal: spot latency spikes instantly.
Row 2 — Traffic: Timeseries of ops/sec and connected clients. Overlay with application request rate for correlation. Goal: understand load patterns.
Row 3 — Errors: Timeseries of rejected connections, evicted keys, and keyspace miss rate. Goal: zero on all three.
Row 4 — Saturation: Gauge of memory usage vs maxmemory. Timeseries of fragmentation ratio. Connection count vs maxclients. Goal: stay below 80% on all.

Add a fifth row for AI-specific metrics: cache hit ratio (computed from keyspace_hits / (hits + misses)), eviction rate (keys lost from context store), and command breakdown by type (track FT.SEARCH latency separately from GET/SET for vector search workloads).

Alerting: What Deserves a Page

Not every metric needs an alert. Alerts should fire only when a human needs to act. Over-alerting causes alert fatigue and missed real incidents.

Alert	Condition	Severity
Memory critical	redis.mem.used > 90% of maxmemory for 5 minutes	P1 — Evictions imminent
Connections exhausted	redis.net.rejected_connections > 0	P1 — Clients being dropped
Latency spike	redis.net.commands.duration P99 > 10ms for 5 minutes	P2 — Performance degrading
Replication lag	redis.replication.delay > 10 seconds for 5 minutes	P2 — Read replicas serving stale data
Eviction rate	redis.keys.evicted rate > 100/min	P2 — Context data being lost
Cache hit ratio drop	Hit ratio < 80% for 15 minutes	P3 — Investigate query pattern change

Six alerts. That's it. If your Redis dashboard has 20 alerts, you have zero useful alerts — because the team has already learned to ignore them.

Cost Optimization: Controlling Your Datadog Redis Bill

Whitelist metrics in conf.yaml — Only collect the 15-20 metrics listed above. Exclude everything else with metrics: lists.
Use metric tags wisely — Tags increase cardinality. Avoid high-cardinality tags like request_id or user_id on Redis metrics.
Set check_interval appropriately — Default 15 seconds is fine for production. For dev/staging, increase to 60 seconds to reduce data volume.
Use Datadog's Metrics without Limits — Submit all metrics but only index and query the ones you configure. Reduces cost for metrics you want to keep but don't query often.
Review Metrics Summary monthly — Datadog shows which custom metrics are active. Remove integrations or metrics you're no longer using.

The best Datadog Redis setup is one where you pay for 20 metrics that you actually look at — not 200 that you ingested because the default config scraped everything.