Redis Cloud v2 Metrics: Complete Reference for Monitoring, Alerting, and Production Observability
Redis Cloud v2 metrics provide deep, production-grade observability into every layer of your Redis deployment. Unlike basic INFO-command metrics that most monitoring setups rely on, the v2 metric set includes true histogram latency distributions, memory allocator internals, keyspace distribution by data structure and size, Active-Active CRDT synchronization tracking, and node-level infrastructure metrics. It's the difference between knowing Redis is slow and knowing exactly why.
Redis Cloud exports these metrics natively — no agent required. Connect your monitoring platform using an API key, select your region, and metrics start flowing within minutes. The metrics follow standard conventions: Gauge (current value), Count (monotonically increasing counter), Histogram (latency distribution), and Info (metadata). They arrive pre-tagged with cluster, database, shard, region, and role labels for instant filtering.
Basic Redis monitoring tells you 'latency is high.' v2 metrics tell you 'P99 write latency crossed 8ms, memory fragmentation is at 1.7, the allocator is holding 40% more resident memory than active, and 3 keys in the large sorted-set bucket are blocking the event loop.' That's the depth difference.
What the v2 Metric Set Covers
80+ metrics organized across 12 categories — from Redis process internals to node hardware.
- Configuration and metadata — database config state, throughput limits, cluster metadata
- Memory — 13 metrics covering usage, limits, allocator internals, fragmentation, and background process indicators
- Latency — true histogram distributions for read, write, and other operations (P50/P90/P95/P99, not averages)
- Traffic — request/response counts by type, ingress/egress bytes, backpressure indicators
- Connections — connected clients, blocked clients, connection churn, proxy disconnections, establishment failures
- Network — Redis-level input/output bytes
- CPU — process and per-thread CPU consumption
- Keyspace — key counts, expiry counts, evictions, hit/miss ratios separated by read and write
- Keyspace distribution — key counts by data structure type (String, Hash, List, Set, Sorted Set) and size bucket
- Replication and syncer — replication offsets, Active-Active lag, syncer status, byte-level sync tracking
- Client tracking — server-assisted caching metrics for Redis 6+ client-side caching
- Node-level infrastructure — CPU, memory, network I/O, packet counts on the underlying hardware (Pro subscriptions)
Configuration and Metadata Metrics
These metrics expose database and cluster configuration state. Use them to track config drift, detect unexpected changes, and correlate configuration modifications with performance shifts.
| Metric | Type | Description | Unit |
|---|---|---|---|
| db_config | Info | Database configuration metadata — TLS mode, Redis version, port. Track changes over time to detect config drift. | N/A |
| bdb_max_throughput | Gauge | Maximum configured throughput for the database. If ops/sec approaches this limit, requests will be throttled. | Ops/sec |
| bdb_data | Info | Database-level metadata and configuration data. | N/A |
| cluster_data | Info | Cluster-level metadata — account_id, subscription, region, maintenance flags. The identity record for the cluster. | N/A |
Memory Metrics (13 Metrics)
Memory is the most critical monitoring category for Redis. These 13 metrics cover the full stack: from how much data you're storing, to how the allocator is managing physical memory, to whether background persistence operations are spiking latency.
| Metric | Type | What It Tells You |
|---|---|---|
| redis_server_used_memory | Gauge | Total memory consumed by data. The primary capacity metric. Alert at 80% of maxmemory — evictions begin beyond this. |
| redis_server_maxmemory | Gauge | Configured maxmemory limit. The ceiling. Compare with used_memory for utilization percentage. |
| db_memory_limit_bytes | Gauge | Database-level memory limit configured in Redis Cloud. Different from maxmemory when overcommit is enabled. |
| redis_server_used_memory_overhead | Gauge | Memory consumed by Redis internals — buffers, metadata, data structures overhead. Not your data, but still your bill. |
| redis_server_mem_fragmentation_ratio | Gauge | Ratio of RSS (physical memory) to used_memory (logical data). Above 1.5 means 50%+ waste from fragmentation. Below 1.0 means Redis is swapping to disk — critical. |
| redis_server_allocator_allocated | Gauge | Bytes allocated from jemalloc. Includes internal fragmentation within allocated pages. |
| redis_server_allocator_active | Gauge | Bytes in allocator active pages. Includes external fragmentation. Compare with allocated to quantify fragmentation. |
| redis_server_allocator_resident | Gauge | Resident memory held by allocator. The actual OS-level memory footprint Redis occupies. |
| redis_server_active_defrag_running | Gauge | 1 if active defragmentation is running. Correlate with latency — defrag can cause micro-spikes during compaction. |
| redis_server_mem_aof_buffer | Gauge | Memory consumed by the AOF (Append-Only File) buffer. Spikes during heavy write bursts as commands queue for persistence. |
| redis_server_mem_replication_backlog | Gauge | Memory used by the replication backlog. Sized to handle replica reconnection without triggering a full resync. |
| redis_server_rdb_bgsave_in_progress | Gauge | 1 if RDB background save is running. The fork operation can cause latency spikes proportional to dataset size. |
| redis_server_aof_rewrite_in_progress | Gauge | 1 if AOF rewrite is in progress. Another fork-based operation that can spike latency. |
Three memory metrics matter most: used_memory (how full you are), maxmemory (the ceiling before evictions), and mem_fragmentation_ratio (how efficiently you're using what you have). Everything else is diagnostic — reach for them when those three raise a flag.
Latency Metrics — True Histogram Distributions (9 Metrics)
The v2 metrics provide true histogram latency — not averages. This is the single most important difference from basic Redis monitoring. Averages hide outliers. A P50 of 1ms and a P99 of 50ms both produce an average that looks fine while 1% of your requests are unacceptably slow. Histograms give you the full distribution: P50, P90, P95, P99.
Latency is broken into three operation categories — read, write, and other — each with count, sum, and bucket metrics.
| Metric | Type | Description |
|---|---|---|
| endpoint_read_requests_latency_histogram_count | Count | Total number of read latency observations. Rate this for read throughput. |
| endpoint_read_requests_latency_histogram_sum | Count | Sum of all read latency values (microseconds). Divide by count for average — but prefer percentiles. |
| endpoint_read_requests_latency_histogram_bucket | Histogram | Read latency distribution across buckets. The raw data for computing P50/P90/P95/P99 read latency. |
| endpoint_write_requests_latency_histogram_count | Count | Total write latency observations. |
| endpoint_write_requests_latency_histogram_sum | Count | Sum of write latency values (microseconds). |
| endpoint_write_requests_latency_histogram_bucket | Histogram | Write latency distribution. The most important metric for AI context store write performance. |
| endpoint_other_requests_latency_histogram_count | Count | Total other command latency observations (admin commands, Pub/Sub, module commands). |
| endpoint_other_requests_latency_histogram_sum | Count | Sum of other command latency values. |
| endpoint_other_requests_latency_histogram_bucket | Histogram | Other command latency distribution. Watch for FT.SEARCH and vector search operations here. |
To compute P99 read latency, use the standard histogram percentile formula: histogram_quantile(0.99, sum(rate(endpoint_read_requests_latency_histogram_bucket[5m])) by (le)). The result is in microseconds — divide by 1000 for milliseconds. This works in any monitoring platform that supports PromQL-style queries.
Traffic Metrics (10 Metrics)
Traffic metrics separate reads, writes, and other commands — and further separate requests from responses. This lets you detect asymmetries: if requests consistently exceed responses, commands are being dropped, timing out, or queuing.
| Metric | Type | Description |
|---|---|---|
| endpoint_read_requests | Count | Total read requests received. Rate this for reads/sec. |
| endpoint_write_requests | Count | Total write requests received. |
| endpoint_other_requests | Count | Non-read/write commands — PING, CONFIG, SUBSCRIBE, module commands, etc. |
| endpoint_read_responses | Count | Responses sent for read requests. Compare with read_requests to detect drops. |
| endpoint_write_responses | Count | Responses sent for write requests. |
| endpoint_other_responses | Count | Responses for other commands. |
| endpoint_ingress | Count | Total bytes transferred into the database. Track for data transfer cost estimation. |
| endpoint_egress | Count | Total bytes transferred out of the database. The primary driver of data transfer cost. |
| endpoint_egress_pending | Gauge | Pending outgoing bytes waiting to be sent. Sustained non-zero values indicate network backpressure — the client can't consume fast enough. |
| endpoint_egress_pending_discarded | Count | Pending bytes discarded because the client disconnected before receiving them. Indicates clients timing out. |
Connection and Client Metrics (8 Metrics)
| Metric | Type | What It Tells You |
|---|---|---|
| redis_server_connected_clients | Gauge | Current connected clients. Alert at 80% of maxclients to prevent connection exhaustion. |
| redis_server_blocked_clients | Gauge | Clients blocked on BLPOP/BRPOP/WAIT. Sustained non-zero values indicate a consumer bottleneck. |
| redis_server_instantaneous_ops_per_sec | Gauge | Real-time operations per second. The headline throughput metric. |
| endpoint_client_connections | Count | New client connection establishment events. High rate means high connection churn — a sign of missing or broken connection pooling. |
| endpoint_client_disconnections | Count | Client-initiated disconnections. Normal during scale-down or deployment. |
| endpoint_proxy_disconnections | Count | Proxy-initiated disconnections. Non-zero means the Redis Cloud proxy is actively dropping connections — investigate maxclients or proxy resource limits. |
| endpoint_client_connection_expired | Count | Connections expired due to idle TTL. Expected behavior for connection lifecycle management. |
| endpoint_client_establishment_failures | Count | Failed connection attempts. Non-zero means clients are failing to connect — check DNS resolution, TLS certificate validity, maxclients limit, or network connectivity. |
The most overlooked connection metric: endpoint_client_connections rate. If you see hundreds of new connections per minute while connected_clients stays low, your application is connecting and disconnecting on every request. You're paying the TCP+TLS handshake cost — 2-5ms — on every single operation. Fix the connection pool.
Network and CPU Metrics
Network (Redis-Level)
| Metric | Type | Description |
|---|---|---|
| redis_server_total_net_input_bytes | Count | Total bytes received by the Redis process. Rate this for inbound bandwidth utilization. |
| redis_server_total_net_output_bytes | Count | Total bytes sent by Redis. Rate this for outbound bandwidth. Also the basis for estimating data transfer costs. |
CPU (Process-Level)
| Metric | Type | Description |
|---|---|---|
| namedprocess_namegroup_cpu_seconds_total | Count | Total CPU seconds consumed by the Redis process. Rate this for overall CPU utilization. |
| namedprocess_namegroup_thread_cpu_seconds_total | Count | CPU seconds per Redis thread. Identifies hot threads — critical for diagnosing I/O thread saturation in Redis 6+ multi-threaded I/O. |
Keyspace Metrics (10 Metrics)
Keyspace metrics tell you what's inside your database: how many keys, how many expire, and whether your reads are hitting or missing. The v2 metrics separate hits and misses by read and write — more granular than the combined keyspace_hits/keyspace_misses in basic Redis INFO.
| Metric | Type | What It Tells You |
|---|---|---|
| redis_server_db_keys | Gauge | Total keys in the database. Track the growth rate to predict when memory limits will be reached. |
| redis_server_db_expires | Gauge | Keys with expiration set. If db_expires is much less than db_keys, many keys are permanent and will never be reclaimed by TTL. |
| redis_server_expired_keys | Count | Keys expired by TTL. Expected behavior — rate indicates how fast your data is cycling through. |
| redis_server_evicted_keys | Count | Keys evicted by the maxmemory eviction policy. Every eviction is data loss. For AI context stores, this means lost agent memories. Alert on any non-zero rate. |
| redis_server_keys_trimmed | Count | Keys trimmed (stream MAXLEN enforcement). Indicates Redis Streams length management is active. |
| redis_server_up | Gauge | Database availability: 1 = up, 0 = down. The most fundamental health check. Alert immediately on 0. |
| redis_server_keyspace_read_hits | Count | Successful read lookups — the key existed. Use with read_misses for read hit ratio calculation. |
| redis_server_keyspace_write_hits | Count | Successful write lookups — the key existed before the write operation. |
| redis_server_keyspace_read_misses | Count | Failed read lookups — the requested key did not exist. High miss rate indicates cold cache, wrong keys, or expired data. |
| redis_server_keyspace_write_misses | Count | Write operations to keys that didn't previously exist (new key creation). |
Keyspace Distribution by Data Structure (15 Metrics)
These metrics are unique to the v2 metric set — you won't find them in basic Redis monitoring. They break down your key population by data structure type and size bucket. This is how you find the oversized keys that are silently degrading performance.
| Data Structure | Small Bucket | Medium Bucket | Large Bucket |
|---|---|---|---|
| Strings | strings_sizes_under_128M (< 128MB) | strings_sizes_128M_to_512M | strings_sizes_over_512M (> 512MB) |
| Sorted Sets | zsets_items_under_1M (< 1M items) | zsets_items_1M_to_8M | zsets_items_over_8M (> 8M items) |
| Sets | sets_items_under_1M (< 1M items) | sets_items_1M_to_8M | sets_items_over_8M |
| Lists | lists_items_under_1M (< 1M items) | lists_items_1M_to_8M | lists_items_over_8M |
| Hashes | hashes_items_under_1M (< 1M items) | hashes_items_1M_to_8M | hashes_items_over_8M |
All metric names are prefixed with redis_server_. If any 'large' bucket is non-zero, investigate immediately. A single sorted set with 10 million items or a string exceeding 512MB will cause latency spikes on every operation that touches it. Large keys block the Redis event loop during serialization, deletion, and persistence — affecting all other operations on that shard.
The keyspace distribution table is the fastest way to find the keys that will break your system at scale. If the 'Large' column has any non-zero value, you have a problem — even if latency looks fine today. It will degrade as traffic grows.
Replication and Syncer Metrics (9 Metrics)
These metrics cover two replication modes: standard primary-replica synchronization and Active-Active (CRDT) cross-region database synchronization. If you run geo-distributed AI agent deployments with Active-Active databases, the syncer metrics are essential for detecting cross-region lag before it causes stale context retrieval.
| Metric | Type | What It Tells You |
|---|---|---|
| redis_server_master_repl_offset | Gauge | Replication offset on the primary. Compare with slave_offset to compute lag in bytes. |
| redis_server_slave_offset | Gauge | Replication offset on the replica. The difference (master_repl_offset - slave_offset) = replication lag in bytes. |
| database_syncer_dst_lag | Gauge | Lag between the syncer and the destination (milliseconds). The primary health metric for Active-Active sync. |
| database_syncer_current_status | Gauge | Syncer status indicator. 0 = not running. Monitor for unexpected state transitions. |
| database_syncer_total_requests | Count | Total write operations delivered to the destination by the syncer. Rate this for sync throughput. |
| database_syncer_ingress_bytes | Count | Bytes read from the source shard by the syncer. |
| database_syncer_ingress_bytes_decompressed | Count | Decompressed bytes received by the syncer. Compare with ingress_bytes to measure wire compression effectiveness. |
| database_syncer_syncer_repl_offset | Gauge | The syncer's own replication tracking offset. |
| database_syncer_dst_repl_offset | Gauge | The destination's replication offset. Compare with syncer_repl_offset for sync position lag. |
Client Tracking and Caching Metrics (4 Metrics)
Redis 6+ introduced server-assisted client-side caching — where the server tracks which keys a client has cached locally and sends invalidation messages when those keys change. These metrics show adoption and correctness of that protocol.
| Metric | Type | What It Tells You |
|---|---|---|
| endpoint_client_tracking_on_requests | Count | CLIENT TRACKING ON commands issued. Shows how many clients are using server-assisted caching. |
| endpoint_client_tracking_off_requests | Count | CLIENT TRACKING OFF commands. Clients opting out of tracking. |
| endpoint_disposed_commands_after_client_caching | Count | Commands disposed due to client caching protocol misuse. Non-zero means a client library bug — investigate. |
| endpoint_client_expiration_refresh | Count | Client connection expiration TTL refresh events. |
Node-Level Infrastructure Metrics (9 Metrics — Pro Only)
Pro subscriptions expose the underlying node hardware metrics — the physical machine running your Redis shards. These are invisible on Essentials plans. When Redis-level metrics look fine but performance is degraded, node-level metrics reveal whether you're hitting hardware ceilings: CPU saturation, memory exhaustion at the OS level, or network interface limits.
| Metric | Type | What It Tells You |
|---|---|---|
| node_available_memory_bytes | Gauge | Available memory on the node. If this approaches zero, the OS OOM-killer will terminate processes. |
| node_memory_MemFree_bytes | Gauge | Free (unallocated) memory on the node. Available minus cached/buffered. |
| node_cpu_seconds_total | Count | Total CPU seconds consumed per mode (user, system, iowait, idle). Rate by mode for CPU utilization breakdown. |
| node_network_receive_bytes_total | Count | Bytes received on the node's network interface. Rate for inbound bandwidth utilization. |
| node_network_transmit_bytes_total | Count | Bytes transmitted. Rate for outbound bandwidth. |
| node_ingress_bytes | Count | Total incoming traffic across all processes on the node. |
| node_egress_bytes | Count | Total outgoing traffic across all processes on the node. |
| node_network_receive_packets_total | Count | Network packets received. High packet rate with low byte rate means small-payload inefficiency — batch your operations. |
| node_network_transmit_packets_total | Count | Network packets transmitted. |
Labels and Tags: The Complete Taxonomy
Every v2 metric arrives pre-tagged with rich labels for filtering, grouping, and dashboard segmentation. There are 9 distinct label categories. Understanding them is the difference between a dashboard that shows 'average across everything' and one that shows 'P99 latency on shard-3 of database-prod-context in us-east-1.'
Default System Tags (Auto-Attached to All Metrics)
| Tag | Description | Use Case |
|---|---|---|
| cluster | Redis Cloud cluster identifier | Filter metrics to a specific cluster |
| db | Database identifier | Scope dashboards to one database |
| shard | Shard identifier | Per-shard analysis for clustered databases — find hot shards |
| region | Cloud region where the database is deployed | Regional dashboards, multi-region latency comparison |
| role | Node role: master or replica | Compare primary vs replica performance |
| account_id | Redis Cloud account identifier | Multi-account environments |
| subscription_id | Subscription identifier | Group metrics by subscription for cost allocation |
| syncer_type | crdt or replica (present only on syncer metrics) | Distinguish Active-Active from standard replication |
Redis Enterprise Core Labels
| Category | Label | Description |
|---|---|---|
| Identity | cluster | Cluster FQDN |
| Identity | bdb / bdb_id | Redis Enterprise database ID |
| Identity | bdb_name | Database name (human-readable) |
| Identity | db | Database ID (CSE) or Redis logical DB (0-15) |
| Identity | node | Node identifier |
| Identity | redis | Shard identifier |
| Topology | role | master/primary or slave |
| Topology | slots | Hash slot range assigned to this shard |
| Topology | status | Shard operational status |
| Topology | shard_type | ram, flash, or total — indicates storage tier |
Active-Active (CRDT) Labels
Available only for Active-Active databases — geo-distributed conflict-free replicated databases across multiple regions.
| Label | Description |
|---|---|
| crdt_guid | Active-Active database GUID — the unique identifier across all participating regions |
| crdt_replica_id | Replica ID (1-10) — identifies which geographic instance within the Active-Active group |
| crdt_peer | Peer ID — the remote region this metric relates to |
| crdt_backlog | Backlog indicator — pending sync data not yet delivered |
| src_id | Source ID for syncer operations |
| dst_id | Destination ID for syncer operations |
Account and Subscription Labels
Attached via cluster_data metrics. Essential for multi-tenant environments, cost allocation dashboards, and subscription-level capacity planning.
| Label | Description |
|---|---|
| account_id | Redis Cloud account ID |
| account_name | Account name |
| cluster_id | Cluster ID |
| cluster_name | Cluster name (human-readable) |
| subscription | Subscription ID |
| vip | Virtual IP assigned to the cluster |
| shared | Shared cluster flag (true for multi-tenant) |
| under_maintenance | Maintenance status — flag active maintenance windows |
CSE Labels (job=rlec_v2)
| Label | Context | Why It Matters |
|---|---|---|
| db_port | db_config metric | Database port — useful for multi-database-per-cluster identification |
| db_version | db_config metric | Redis version running on this database. Track for version drift across databases. |
| tls_mode | db_config metric | TLS enabled or disabled. Flag any production database without TLS enabled. |
Node and Proxy Labels
| Label | Description |
|---|---|
| addr | Node IP address |
| cnm_version | Cluster Node Manager version |
| proxy | Proxy ID |
| endpoint | Endpoint ID |
| port | Listener port |
| driver | Storage driver (e.g., speedb) |
Alerting Labels (job=rlec_node)
Used for infrastructure alert metrics emitted at the node level.
| Label | Description |
|---|---|
| alertname | Alert name identifier |
| alertstate | Alert state (firing, pending, resolved) |
| severity | Alert severity level |
| cloud | Cloud provider |
| region | Cloud region |
| zone_id / zone_name | Availability zone identification |
| machine_type | Instance type (e.g., r6g.xlarge) |
| process | Process name |
| disk_path / directory_path / file_path | Storage path identifiers |
Critical: The db vs bdb Naming Difference
There is a naming inconsistency between metric sources that will break your dashboards and monitors if you mix them.
| Concept | CSE Metrics (rlec_v2) | Standard Metrics |
|---|---|---|
| Database ID | db | bdb |
| Database Name | db_name | bdb_name |
Always check which job the metric originates from before building queries. Mixing db and bdb labels in the same query produces empty results or incorrect aggregations. This is the #1 debugging issue when building Redis Cloud dashboards.
Query Grouping Best Practices
How you group metrics determines the granularity of your monitoring. Choose the right level for each dashboard panel.
| Scope | Group By | When to Use |
|---|---|---|
| Account-level | sum by (account_id) | Executive dashboards, multi-account cost views |
| Subscription-level | sum by (subscription) | Subscription cost tracking, capacity planning |
| Database-level | sum by (bdb) or sum by (db) | Per-database monitoring. Use bdb for standard jobs, db for CSE. |
| Shard-level | sum by (shard) | Diagnosing hot shards, detecting uneven data distribution |
| Role-level | sum by (role) | Comparing primary vs replica latency and throughput |
- Avoid grouping by instance unless you are specifically debugging metric collection issues.
- Never aggregate histogram _bucket metrics without preserving the le (bucket boundary) label — you will destroy the distribution and get meaningless percentile values.
- Use custom database tags in Redis Cloud (team, environment, service) — they flow automatically into exported metrics as labels for business-context filtering.
Building Production Dashboards
Organize your Redis Cloud monitoring dashboard into six sections, each answering one operational question. For Pro subscriptions, add a seventh.
- Section 1 — Health Overview: redis_server_up status indicator, connected_clients gauge, instantaneous_ops_per_sec timeseries, memory utilization percentage. The 10-second glance that answers 'is everything OK right now?'
- Section 2 — Latency Distribution: Histogram percentile graphs for read, write, and other latency. Show P50, P90, P95, P99 on the same chart. For AI context retrieval, P99 read latency is the metric that determines user-perceived agent performance.
- Section 3 — Memory Deep Dive: used_memory vs maxmemory as a utilization gauge. Fragmentation ratio timeseries. Allocator breakdown (allocated vs active vs resident) to quantify fragmentation. Eviction rate. RDB/AOF save indicators overlaid on the latency chart to correlate fork-induced spikes.
- Section 4 — Traffic and Throughput: Read/write/other request rates as stacked area. Ingress/egress bytes for data transfer cost visibility. Request vs response count comparison to detect command drops. egress_pending for backpressure detection.
- Section 5 — Connections: Connected clients timeseries. New connection rate (endpoint_client_connections). Establishment failures. Proxy disconnections. Blocked clients. If new connections per minute significantly exceeds connected_clients, connection pooling is broken.
- Section 6 — Keyspace Intelligence: Key count and expiry count over time. Eviction rate (should be zero for context stores). Read hit ratio as a computed metric (read_hits / (read_hits + read_misses)). Keyspace distribution by data structure — any non-zero 'large' bucket should be a red indicator.
- Section 7 (Pro only) — Node Infrastructure: Node available memory, CPU utilization by mode (user/system/iowait), network packet rates and bandwidth utilization.
Alerting: 6 Monitors That Cover Production Failure Modes
Six alerts. Not twenty. Every additional alert beyond the critical set dilutes team attention and causes alert fatigue — the state where every alert is ignored because most are noise.
| Alert | Condition | Severity | Why It Matters |
|---|---|---|---|
| Database down | redis_server_up = 0 | P1 Critical | Complete outage. All reads and writes fail. Every second counts. |
| Memory critical | used_memory > 85% of maxmemory for 5 minutes | P1 High | Evictions are imminent. Context data, session data, and cached embeddings will be dropped. |
| Evictions active | redis_server_evicted_keys rate > 0 | P1 High | Data is actively being lost. For AI agent context stores, this means lost memories and degraded agent quality. |
| Latency spike | Read or write P95 > configured threshold for 5 minutes | P2 Warning | Context retrieval or write performance degrading. End-user response times will increase. |
| Replication lag | database_syncer_dst_lag > threshold for 5 minutes | P2 Warning | Replicas serving stale data. Active-Active regions are out of sync. Cross-region reads return outdated context. |
| Blocked clients | redis_server_blocked_clients > baseline for 10 minutes | P3 Info | Consumer bottleneck on blocking list/stream operations. Investigate consumer throughput. |
If your Redis monitoring has 20 alerts, you effectively have zero. The team learned to ignore them two weeks after they were created. Six well-tuned alerts with clear severity levels and runbook links outperform fifty noisy ones every time.
v2 Metrics vs Basic Redis INFO Monitoring
Many teams monitor Redis using the basic INFO command output — 30-40 metrics covering memory, clients, stats, and replication. The v2 metric set exported by Redis Cloud goes significantly deeper.
| Capability | v2 Metrics (Redis Cloud Export) | Basic INFO Monitoring |
|---|---|---|
| Metric count | 80+ metrics across 12 categories | ~30-40 from INFO sections |
| Latency measurement | True histogram buckets — P50/P90/P95/P99 | Average only (or none without Slowlog parsing) |
| Keyspace distribution | By data structure type and size bucket (15 metrics) | Total keys and expires only |
| Active-Active sync | Full syncer metrics — lag, offsets, bytes, status | Not available |
| Client tracking | Server-assisted caching metrics | Not available |
| Node infrastructure | CPU, memory, network at OS level (Pro) | Not available on managed services |
| Request/response split | Separate read, write, other — requests vs responses | Combined cmdstat only |
| Memory allocator | jemalloc allocated, active, resident | Basic used_memory and RSS only |
| Connection lifecycle | Connect, disconnect, expire, proxy drop, establishment failure | connected_clients count only |
| Labeling depth | 9 label categories — cluster, db, shard, role, region, account, CRDT, node, alerting | None (flat metrics) |
For production AI workloads running on Redis Cloud, the v2 metric set is the monitoring foundation. Basic INFO monitoring was adequate when Redis was a simple cache. When Redis holds your agent's context, vector indexes, session state, and real-time feature stores — you need the full picture.