Monitoring MongoDB Atlas with Datadog: Integration, Metrics, and Production Dashboards
MongoDB Atlas includes excellent built-in monitoring — real-time performance panels, slow query analysis, and cluster health metrics. But it lives in the Atlas console. If your production stack already uses Datadog for application APM, Redis monitoring, Kubernetes metrics, and log management, having MongoDB in a separate tab means slower incident response. When latency spikes, you need to see application P99, Redis latency, and MongoDB latency on the same graph — not in three different browser tabs.
The goal isn't to replace Atlas monitoring. It's to put MongoDB metrics next to everything else — so when something breaks, you see the full picture in one place.
Setting Up the Integration
Datadog offers a native MongoDB Atlas integration that pulls metrics directly from the Atlas API. No agent installation on the Atlas cluster is needed — Atlas is a managed service, so there's no host to install an agent on. The integration uses Atlas's monitoring API to fetch cluster-level and process-level metrics.
Step 1: Create an Atlas API Key
In the Atlas console, navigate to Organization Access Manager > API Keys. Create a new API key with the Organization Read Only role (minimum) or Project Read Only if you want to scope it to a specific project. Save the Public Key and Private Key — you'll enter these in Datadog.
Step 2: Configure the Datadog Integration
- In Datadog, go to Integrations > MongoDB Atlas.
- Click Add New and enter your Atlas API Public Key and Private Key.
- Optionally restrict to specific Atlas projects by entering the Project ID.
- Enable the integration. Datadog begins polling the Atlas API for metrics (default interval: 60 seconds).
- Metrics appear under the mongodb.atlas namespace within 2-5 minutes.
Step 3: Enable Atlas Database Metrics (Optional but Recommended)
For deeper monitoring, enable the Datadog MongoDB Database Monitoring (DBM) integration. This provides query-level visibility — slow queries, query plans, and per-operation latency — on top of the cluster-level metrics. DBM requires the Datadog Agent running in your VPC with network access to the Atlas cluster (via VPC peering).
| Integration | Agent Required | What You Get | Best For |
|---|---|---|---|
| Atlas API integration | No | Cluster metrics: connections, ops, memory, disk, replication | Quick setup, cluster-level visibility |
| Datadog DBM | Yes (agent in your VPC) | Query-level: slow queries, explain plans, per-operation stats | Deep query performance analysis |
| Both together | Yes | Full stack: cluster health + query-level diagnostics | Production AI workloads |
Key Metrics for MongoDB Atlas
Atlas exposes dozens of metrics through the Datadog integration. For AI agent data layers, focus on these categories mapped to the Golden Signals.
Latency
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| mongodb.atlas.oplatencies.reads.avg | Average read operation latency (microseconds) | Alert if > 10ms for indexed queries |
| mongodb.atlas.oplatencies.writes.avg | Average write operation latency | Alert if > 20ms sustained |
| mongodb.atlas.oplatencies.commands.avg | Command latency (aggregations, $vectorSearch) | Monitor — varies by query complexity |
Throughput
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| mongodb.atlas.opcounters.query | Read operations per second | Alert on sudden drop > 50% |
| mongodb.atlas.opcounters.insert | Insert operations per second | Monitor for write spikes |
| mongodb.atlas.opcounters.update | Update operations per second | Alert if unexpected spike (runaway upserts) |
| mongodb.atlas.opcounters.getmore | Cursor getMore operations. High values indicate large result sets. | Monitor — high getmore means queries returning too many docs |
Connections
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| mongodb.atlas.connections.current | Current open connections | Alert at 80% of tier limit |
| mongodb.atlas.connections.available | Remaining available connections | Alert if < 20% remaining |
Memory and Storage
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| mongodb.atlas.mem.resident | WiredTiger cache resident memory | Alert if approaching tier limit |
| mongodb.atlas.extra_info.page_faults | Page faults — reads hitting disk instead of cache | Alert on sustained increase |
| mongodb.atlas.wiredtiger.cache.bytes_currently_in_cache | Data in WiredTiger cache vs configured max | Alert at 80% of cache size |
| mongodb.atlas.dbstats.storage_size | On-disk storage used | Alert at 80% of provisioned storage |
Replication
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| mongodb.atlas.replset.replication_lag | Replication lag from primary to secondary (seconds) | Alert if > 10 seconds |
| mongodb.atlas.replset.oplog_window | Hours of oplog retained. If lag exceeds this, resync is needed. | Alert if < 2 hours |
Building the Dashboard
Structure your Datadog MongoDB Atlas dashboard around the four Golden Signals, with a fifth section for AI-specific workload patterns.
- Row 1 — Latency: Timeseries of read, write, and command latency. Overlay with application P99 from APM for correlation.
- Row 2 — Traffic: Timeseries of opcounters by type (query, insert, update, delete). Stacked area chart shows read/write ratio over time.
- Row 3 — Errors: Timeseries of assertion rates. Connection failures. Query targeting ratio (docsExamined vs docsReturned — from DBM).
- Row 4 — Saturation: WiredTiger cache usage as percentage. Connection count vs tier limit. Disk IOPS vs provisioned.
- Row 5 — AI Workloads: $vectorSearch latency (from DBM query stats), document retrieval ops/sec, page faults during context retrieval, replication lag for read-from-secondary patterns.
Alerting Recommendations
| Alert | Condition | Severity |
|---|---|---|
| Connection exhaustion | connections.current > 80% of tier max for 5 min | P1 — New connections will fail |
| Replication lag critical | replication_lag > 30 seconds for 5 min | P1 — Secondaries serving stale data |
| Read latency spike | oplatencies.reads.avg > 50ms for 10 min | P2 — Context retrieval degraded |
| Page faults sustained | page_faults rate > 100/min for 15 min | P2 — Working set exceeds cache |
| Oplog window shrinking | oplog_window < 2 hours | P2 — Risk of replica resync |
| Storage 80% | storage_size > 80% provisioned | P3 — Plan capacity increase |
Correlating MongoDB with the Rest of Your Stack
The real power of Datadog for MongoDB monitoring isn't the MongoDB metrics alone — it's the correlation. When your AI agent's P99 latency spikes from 200ms to 2 seconds, you need to see in one view: Was it the application code? Redis cache miss spike? MongoDB read latency? Kubernetes pod restart? Network issue?
- Create a Service Map that links your application → Redis → MongoDB. Datadog APM traces show which database call contributed to end-to-end latency.
- Use Datadog Notebooks for incident investigation — pull MongoDB latency, Redis latency, application errors, and Kubernetes events into one timeline.
- Tag MongoDB metrics with environment, cluster_name, and database to filter by service or team.
- Set up Composite Monitors — alert when MongoDB latency AND application error rate both spike (reduces false positives from either alone).
A MongoDB latency spike means nothing in isolation. It means everything when you can see it alongside the Redis cache miss that caused it and the application timeout it produced.