All posts
ObservabilityDatadogRedis CloudMonitoring

Monitoring Redis Cloud with Datadog: Setup, Key Metrics, and AI Workload Dashboards

Polystreak Team2026-03-209 min read

Datadog is the fastest path to production monitoring for Redis Cloud. Install the integration, point at your cluster, and dashboards appear in minutes. But speed of setup creates a different problem: metric sprawl. Redis Cloud emits 200+ metrics per database. At Datadog's per-custom-metric pricing, scraping everything is a fast way to a $50K/year monitoring bill. The discipline is in choosing what to collect.

Datadog makes it easy to monitor everything. The skill is in monitoring only what matters — and filtering out the 180+ metrics that don't.

How the Integration Works

Redis Cloud supports Datadog integration natively. There are two approaches depending on your setup: the Datadog Redis integration (for direct endpoint scraping) and the Redis Cloud Datadog integration (managed, configured in the Redis Cloud console).

Option 1: Redis Cloud Native Integration

Redis Cloud Pro and Enterprise subscriptions offer a built-in Datadog integration. In the Redis Cloud console, navigate to your subscription settings, select the Datadog integration, enter your Datadog API key and region, and enable it. Redis Cloud pushes metrics directly to Datadog — no agent deployment needed. Metrics appear under the redis.cloud namespace.

Option 2: Datadog Agent with Redis Check

If you need more control or want to combine Redis metrics with host-level metrics from your application servers, deploy the Datadog Agent on a host that can reach your Redis Cloud endpoint (via VPC peering). Configure the Redis check in the agent's conf.d/redisdb.d/conf.yaml with the Redis Cloud endpoint, port, and password. The agent scrapes metrics on each check interval (default 15 seconds) and pushes to Datadog.

ApproachSetupAgent RequiredBest For
Redis Cloud native integrationConsole config + API keyNoQuick setup, managed metrics, no infrastructure to maintain
Datadog Agent + Redis checkAgent on EC2/EKS + conf.yamlYesCombined host + Redis metrics, custom check intervals, metric filtering

The Metric Filtering Problem

Datadog charges per custom metric per month. A single Redis Cloud database can emit 200+ distinct metric names. With multiple databases and shards, you can easily hit thousands of custom metrics. At $0.05 per custom metric per month (standard tier), 2,000 unfiltered metrics across 5 databases = $100/month just for Redis metrics you'll never look at.

The fix: use Datadog's metric filtering. In the agent's conf.yaml, use the metrics parameter to whitelist only the metrics you need. For the native integration, use Datadog's Metrics Summary to tag and exclude metrics you don't want billed.

Every Redis metric you don't filter is a line item on your Datadog invoice. Whitelist the 15-20 that matter. Exclude the rest.

The 15 Metrics That Matter for AI Workloads

Out of 200+ available metrics, these are the ones that directly correlate with AI agent performance, reliability, and cost.

Latency (The Golden Signal)

MetricWhat It Tells YouAlert Threshold
redis.net.commands.durationAverage command latency. The single most important metric.Alert if P99 > 5ms
redis.slowlog.micros.95percentile95th percentile of slow commands from Redis slowlog.Alert if > 10ms
redis.net.instantaneous_ops_per_secCurrent throughput. Use alongside latency to detect saturation.Alert on sudden drop (> 50% decrease)

Memory (Saturation)

MetricWhat It Tells YouAlert Threshold
redis.mem.usedTotal memory consumed by data + overhead.Alert at 80% of maxmemory
redis.mem.fragmentation_ratioMemory fragmentation. Above 1.5 means 50% waste.Alert if > 1.5 sustained
redis.keys.evictedKeys evicted due to maxmemory. For AI context stores, each eviction is lost data.Alert on any non-zero rate
redis.mem.rssResident set size — actual OS memory consumed.Alert if RSS >> used (indicates fragmentation)

Connections (Traffic)

MetricWhat It Tells YouAlert Threshold
redis.net.clientsCurrent connected clients.Alert at 80% of maxclients
redis.net.rejected_connectionsConnections rejected because maxclients was reached.Alert on any non-zero value
redis.net.blocked_clientsClients blocked on BLPOP/BRPOP.Alert if > 0 sustained

Cache Effectiveness

MetricWhat It Tells YouAlert Threshold
redis.stats.keyspace_hitsSuccessful key lookups. Use with misses for hit ratio.N/A (use ratio)
redis.stats.keyspace_missesFailed key lookups.Alert if hit ratio < 90%
redis.keys.expiredKeys expired by TTL. Expected behavior, but track for patterns.Informational

Replication

MetricWhat It Tells YouAlert Threshold
redis.replication.delayReplication lag in seconds.Alert if > 5 seconds
redis.net.slavesNumber of connected replicas.Alert if drops below expected count

Building the Dashboard

Organize your Datadog Redis dashboard into four rows mapping to the Golden Signals. Each row answers one question.

  • Row 1 — Latency: Timeseries of command duration (P50, P90, P99). Heatmap of slowlog entries. Goal: spot latency spikes instantly.
  • Row 2 — Traffic: Timeseries of ops/sec and connected clients. Overlay with application request rate for correlation. Goal: understand load patterns.
  • Row 3 — Errors: Timeseries of rejected connections, evicted keys, and keyspace miss rate. Goal: zero on all three.
  • Row 4 — Saturation: Gauge of memory usage vs maxmemory. Timeseries of fragmentation ratio. Connection count vs maxclients. Goal: stay below 80% on all.

Add a fifth row for AI-specific metrics: cache hit ratio (computed from keyspace_hits / (hits + misses)), eviction rate (keys lost from context store), and command breakdown by type (track FT.SEARCH latency separately from GET/SET for vector search workloads).

Alerting: What Deserves a Page

Not every metric needs an alert. Alerts should fire only when a human needs to act. Over-alerting causes alert fatigue and missed real incidents.

AlertConditionSeverity
Memory criticalredis.mem.used > 90% of maxmemory for 5 minutesP1 — Evictions imminent
Connections exhaustedredis.net.rejected_connections > 0P1 — Clients being dropped
Latency spikeredis.net.commands.duration P99 > 10ms for 5 minutesP2 — Performance degrading
Replication lagredis.replication.delay > 10 seconds for 5 minutesP2 — Read replicas serving stale data
Eviction rateredis.keys.evicted rate > 100/minP2 — Context data being lost
Cache hit ratio dropHit ratio < 80% for 15 minutesP3 — Investigate query pattern change
Six alerts. That's it. If your Redis dashboard has 20 alerts, you have zero useful alerts — because the team has already learned to ignore them.

Cost Optimization: Controlling Your Datadog Redis Bill

  • Whitelist metrics in conf.yaml — Only collect the 15-20 metrics listed above. Exclude everything else with metrics: lists.
  • Use metric tags wisely — Tags increase cardinality. Avoid high-cardinality tags like request_id or user_id on Redis metrics.
  • Set check_interval appropriately — Default 15 seconds is fine for production. For dev/staging, increase to 60 seconds to reduce data volume.
  • Use Datadog's Metrics without Limits — Submit all metrics but only index and query the ones you configure. Reduces cost for metrics you want to keep but don't query often.
  • Review Metrics Summary monthly — Datadog shows which custom metrics are active. Remove integrations or metrics you're no longer using.
The best Datadog Redis setup is one where you pay for 20 metrics that you actually look at — not 200 that you ingested because the default config scraped everything.