All posts
ObservabilityDatadogMongoDB AtlasDatabase Monitoring

MongoDB Atlas Database Monitoring with Datadog: Query-Level Visibility for Replica Sets and Sharded Clusters

Polystreak Team2026-03-2310 min read

Cluster-level metrics (ops/sec, latency averages, connection count) tell you that something is wrong. Database Monitoring tells you exactly what. Which query is slow? What's its explain plan? Is it doing a COLLSCAN on a 50-million-document collection? Is a specific shard hotter than others? Datadog's Database Monitoring (DBM) for MongoDB provides query-level observability that Atlas Performance Advisor gives you inside the Atlas console — but inside Datadog, correlated with everything else.

Cluster metrics say 'latency is high.' Database Monitoring says 'this query on this collection scans 2 million documents to return 15 results because it's missing a compound index.' That's the difference between knowing and fixing.

What Datadog Database Monitoring Provides

DBM goes deeper than the Atlas API integration. Instead of cluster aggregates, it provides per-query, per-collection, per-operation visibility.

FeatureAtlas API IntegrationDatadog DBM
Ops/sec by typeYes (cluster aggregate)Yes + per-collection breakdown
LatencyAverage per operation typePer-query P50/P90/P99
Slow queriesVia Atlas Profiler (separate UI)Inline in Datadog with explain plans
Query explain plansManual via Atlas consoleAutomatic for slow/frequent queries
Index recommendationsAtlas Performance AdvisorNot built-in (use with Atlas Advisor)
Per-shard metricsLimitedFull per-shard query distribution
Correlation with APMNo (separate system)Yes — trace to query to collection
Historical query trendsLimited retentionFull retention in Datadog

Architecture: How DBM Connects to Atlas

Datadog DBM requires the Datadog Agent running in your infrastructure (EC2 instance, EKS pod, or any host in a VPC peered with Atlas). The agent connects to Atlas using the MongoDB connection string and a monitoring user, then collects query samples, slow query logs, current operations, and database stats.

For Replica Sets

A standard Atlas replica set (M10+) has a primary and two secondaries across availability zones. The Datadog Agent connects to the replica set using the mongodb+srv:// connection string. It automatically discovers all members and collects metrics from each. The agent tracks which node is primary, replication lag per secondary, and per-node operation distribution.

  • The agent connects to the primary for write operation stats and current operations.
  • It connects to each secondary for replication lag, read distribution (if using readPreference: secondaryPreferred), and per-node cache stats.
  • On failover, the agent automatically re-discovers the new primary within one check cycle.
  • One Datadog Agent instance monitors the entire replica set — no per-node agent needed.

For Sharded Clusters

A sharded Atlas cluster adds mongos routers and config servers on top of the shard replica sets. DBM handles this topology by connecting through the mongos router and collecting shard-level metrics from each underlying replica set.

  • The agent connects to the mongos router endpoint (the standard Atlas connection string for sharded clusters).
  • It discovers the shard topology — how many shards, which ranges/chunks each shard owns, and the config server state.
  • Per-shard metrics: ops/sec, latency, document counts, storage size. This reveals shard imbalance — one shard handling 70% of traffic while others are idle.
  • Scatter-gather query detection: DBM identifies queries that fan out to all shards (no shard key in the filter) vs targeted queries that hit one shard. Scatter-gather queries are the #1 performance killer in sharded clusters.
  • Chunk migration tracking: When the balancer moves chunks between shards, it creates I/O pressure. DBM shows this as temporary latency spikes correlated with migration events.
In a sharded cluster, the most dangerous query is the one without a shard key in its filter. It hits every shard, waits for all responses, and returns the slowest shard's latency. DBM finds these before your users do.

Setting Up DBM for Atlas

Step 1: Create a Monitoring User

Create a dedicated user in Atlas for Datadog monitoring. This user needs the clusterMonitor role (for serverStatus, replSetGetStatus, dbStats) and read access to the admin database. Never use your application user for monitoring — separate concerns, separate credentials.

Step 2: Deploy the Datadog Agent

Deploy the Datadog Agent in your VPC — either as a sidecar in your EKS cluster, a dedicated EC2 instance, or on any host that can reach the Atlas cluster via VPC peering. The agent needs outbound HTTPS access to Datadog's intake endpoints and TCP access to your Atlas cluster on port 27017.

Step 3: Configure the MongoDB Check

Create the agent configuration at conf.d/mongo.d/conf.yaml. The key settings are the connection URI (mongodb+srv://), the monitoring user credentials, and the DBM-specific options.

Config ParameterValuePurpose
hostsmongodb+srv://cluster.abc123.mongodb.netAtlas connection string
username / passworddatadog_monitor / <password>Monitoring user credentials
dbmtrueEnable Database Monitoring features
tlstrueRequired for Atlas (TLS enforced)
replica_checktrueMonitor all replica set members
additional_metrics["metrics.commands", "tcmalloc", "collection", "top"]Extra metric categories to collect
operations_sample_rate1.0Sample rate for query collection (1.0 = all, reduce for very high throughput)
database_autodiscoverytrueAuto-discover all databases on the cluster

Step 4: Verify

Run datadog-agent check mongo to verify the configuration. The output should show successful connections to all replica set members (or mongos for sharded clusters), metric collection counts, and no authentication errors. In Datadog, navigate to Databases > MongoDB to see your clusters appear with query-level data.

What DBM Reveals: Replica Set Insights

For replica sets, DBM provides these actionable insights that cluster-level metrics alone cannot.

  • Slow queries with explain plans — Every query exceeding the slow threshold (configurable, default 100ms) is captured with its execution plan. COLLSCAN in the plan means a missing index. SORT_KEY_GENERATOR means an in-memory sort that should be covered by an index.
  • Top queries by execution count — The query that runs 500,000 times per day at 5ms each is consuming more total capacity than the one that runs once at 2 seconds. DBM ranks by total time = count × avg duration.
  • Read distribution across members — If you use readPreference: secondaryPreferred, DBM shows the read distribution across primary and secondaries. Unbalanced reads mean one node is overloaded while others are idle.
  • Replication lag per secondary — Not just the cluster average, but per-secondary lag. One lagging secondary with 30 seconds of lag means reads from that node return stale context data.
  • Lock contention — DBM shows operations waiting for locks. High lock waits on a collection indicate write contention — often caused by frequent updates to the same hot documents.

What DBM Reveals: Sharded Cluster Insights

For sharded clusters, DBM adds shard-specific visibility that is critical for performance.

  • Shard imbalance — Per-shard ops/sec and latency. If shard-0 handles 3x the traffic of shard-1, your shard key distribution is uneven. This creates a hot shard that limits your cluster's throughput ceiling.
  • Scatter-gather queries — Queries without the shard key in the filter hit every shard. DBM identifies these and shows the fan-out cost. A query that takes 5ms on one shard takes 50ms across 10 shards waiting for the slowest.
  • Chunk migration impact — The balancer moves data chunks between shards for even distribution. During migration, the source and destination shards have elevated I/O. DBM shows the latency spike correlated with migration events.
  • Config server health — The config servers store the shard map. If config servers are slow, every routing decision (every query through mongos) is delayed. DBM tracks config server latency.
  • Per-shard storage distribution — Uneven storage across shards indicates a monotonically increasing shard key (like timestamps) that creates ever-growing 'tail' shards.

Building the DBM Dashboard

  • Query Performance tab — Top queries sorted by total execution time. Click any query to see the explain plan, execution count, and P50/P90/P99 latency. This is your primary optimization target list.
  • Replica Set Health panel — Per-member latency, replication lag, and connection count. Heatmap for quick visual identification of a lagging member.
  • Shard Distribution panel (sharded clusters) — Per-shard ops/sec as a stacked bar chart. Per-shard storage as a comparative bar. Imbalance is immediately visible.
  • Waiting Operations panel — Current operations with wait times > 0. Groups by lock type and collection. Identifies write contention hot spots.
  • Query Patterns Over Time — Timeseries of top query shapes by count. Useful for detecting new query patterns introduced by deployments.

DBM + APM: End-to-End Trace to Query

The most powerful feature of Datadog DBM is its integration with APM traces. When your application makes a MongoDB query, the APM trace shows the span for that database call. Click the span and Datadog links it directly to the DBM query — showing the exact explain plan, the collection it hit, and the latency breakdown. This means you can go from 'user reported slow response' to 'this specific query on the users collection does a COLLSCAN because it's missing a compound index on { status: 1, createdAt: -1 }' in three clicks.

APM tells you the application is slow. DBM tells you which query made it slow. The explain plan tells you why. Three clicks from symptom to root cause.

Recommendations

TopologyDBM ValueKey Metrics to Watch
Replica Set (M10-M40)Slow query detection, explain plans, replication lag per secondaryTop queries by total time, COLLSCAN count, secondary lag
Replica Set (M50+)All above + lock contention analysis, operation sampling at scaleLock wait time, cache eviction correlation with slow queries
Sharded ClusterAll above + shard imbalance, scatter-gather detection, chunk migration impactPer-shard ops/sec variance, scatter-gather query count, migration events

For AI agent data layers, the single most important DBM metric is the COLLSCAN count — queries scanning full collections instead of using indexes. Every COLLSCAN is a latency time bomb that gets worse as data grows. Review the DBM Query Performance tab weekly and aim for zero COLLSCANs on any query that runs more than 100 times per day.

Database Monitoring isn't a nice-to-have for production MongoDB. It's how you find the queries that will break your system at scale — before scale arrives.