The AWS DocumentDB Maintenance Trap: Why You Cannot Postpone Forever
Part 1 of the DocDB Maintenance Survival Guide. AWS will eventually pick the date for you. Here is why that matters, and the critical distinction between cluster and instance maintenance that nobody explains until you hit it in production.
If you run a production workload on AWS DocumentDB, you already know the dreaded email: "Required maintenance update for your DocumentDB cluster."
You can postpone it. Once. Twice. Maybe three times if you are lucky.
Then AWS forces it on you, on a date they picked, at a time they picked.
Until recently, my team genuinely believed the worst we had to worry about was a few seconds of read-write blocking. We were wrong, and the kind of wrong that only shows up in production logs.
This is the first post in a three-part series about what we learned running through two production DocumentDB maintenances in five days — one cluster maintenance, one instance maintenance — and the very different things they break.
1. TL;DR
- AWS DocumentDB maintenance can be postponed only a finite number of times before AWS force-applies it on a date you do not control.
- There are two distinct maintenance types — cluster and instance — and they fail in completely different ways.
- Cluster maintenance causes a roughly 30–50 second read/write block on the cluster endpoint, without primary failover.
- Instance maintenance causes primary↔replica failover, which is a different (and worse) failure mode for your application.
- Plan your downtime window for instance maintenance. Plan your fallback logic for cluster maintenance. They are not the same problem.
2. The Forced Maintenance Problem
AWS Managed services are wonderful right up until they are not.
DocumentDB pushes engine patches and infrastructure updates on a schedule that AWS controls. You receive notifications, you get a "preferred maintenance window" you can configure, and you can postpone individual maintenance events through the AWS console.
What the documentation does not put in bold is this: postponement has a hard limit.
After a finite number of deferrals (typically 2–3, depending on the severity of the patch), the maintenance moves into a force-apply state. AWS will execute it on a specific date, regardless of what your business is doing that week.
A typical timeline looks like this:
- Original window: a workday afternoon.
- Postponed once → pushed by one week.
- Postponed again → pushed by another week.
- AWS notice: "Will be applied on YYYY-MM-DD, HH:00 UTC" — non-negotiable.
If the forced window lands on peak traffic hours, you have two choices:
- Let it happen and hope your application survives the failover under load.
- Pre-empt AWS by executing the maintenance manually through your DBA team during a low-traffic window, before the force-apply date.
Option 2 is almost always the right call. It is the difference between a clean tens-of-seconds outage during off-hours and an unbounded incident in the middle of business hours.
Key takeaway: Treat the AWS maintenance notification as a deadline, not a suggestion. Coordinate with your DBA team to execute the maintenance manually before the forced window. You get to pick the time. AWS picks the date.
3. Cluster vs Instance Maintenance: The Distinction Nobody Explains
Here is what the AWS console will not clearly tell you. DocumentDB maintenance comes in two flavors, and they have completely different failure characteristics.
| Aspect | Cluster Maintenance | Instance Maintenance |
|---|---|---|
| What it patches | Cluster-wide engine update | Individual instance OS / hardware |
| Failover | ❌ No failover | ✅ Primary↔replica failover |
| Cluster endpoint | Blocked for ~30–50s | Blocked briefly per instance |
| Driver impact | Read/write timeout on existing connections | Topology change, new primary discovery |
| Application symptoms | MongoSocketReadException bursts |
DataAccessResourceFailureException, then recovery on new primary |
| Recovery time | 30–50 seconds | 2–4 minutes |
🟢 Cluster Maintenance: The Simpler Beast
The cluster endpoint stops responding for 30–50 seconds while AWS rolls the engine version. Your driver sees connection failures, your application sees timeouts, and then everything comes back. No primary changes. No replica set topology mutation. Existing connections die, new connections succeed once the engine is back up.
🔴 Instance Maintenance: Where It Gets Interesting
AWS rolls instances one at a time. Each instance going down triggers a topology change in the replica set. If the instance being patched is the primary, the cluster elects a new primary — that is the failover. Your driver has to:
- Detect the old primary is gone.
- Wait for the cluster to elect a new primary (this takes seconds).
- Re-establish connections to the new primary.
- Resume routing writes there.
Until step 3 completes, every write request fails. Reads against secondaryPreferred may also fail if the secondary itself is the one being patched.
4. How We Verified This in Production
We ran both maintenance types back to back on the same production cluster within a single week. The logs are unambiguous. Cluster names and exact timestamps below are anonymized; the error signatures and counts are exactly what we observed.
📊 Cluster Maintenance (~47 second window)
- Driver-level signal:
MongoSocketReadException: Prematurely reached end of streamon the cluster endpoint. - Zero "no longer a member of the replica set" messages from the driver.
- 105 application-level errors during the window, all
DataAccessResourceFailureException. - Recovery: driver reconnects to the same primary once the engine is back online.
📊 Instance Maintenance (~4 minute window, two phases)
Instance maintenance unfolds in two distinct phases as AWS rolls each instance.
Reader phase (a non-primary instance patched, call it db-replica-a):
- ~3 minutes of socket exceptions.
- Then:
Server db-replica-a is no longer a member of the replica set. - No application errors. Writes continued flowing to the unaffected primary.
Writer phase (the primary patched, call it db-primary-1):
- ~4 minutes of socket exceptions.
- Then:
Server db-primary-1 is no longer a member of the replica set. - Then:
Discovered replica set primary db-primary-2(the previous secondary, promoted). - 5 application errors during the window, all on the consumer side. Read paths were quiet because we ran the maintenance during a maintenance window with traffic gated upstream.
The same MongoDB Java driver, the same Spring Boot application, the same cluster — but the failure modes were completely different.
5. Why This Matters for Capacity Planning
If you only ever experienced cluster maintenance, you probably built your runbook around "wait 60 seconds, everything comes back."
That runbook breaks the first time you hit instance maintenance during normal traffic.
Here is what changes:
- 🟢 Cluster maintenance Your fallback logic gets exercised. Whatever your driver, circuit-breaker and retry behavior is during a 30-second outage, that is your blast radius. Mostly survivable with reasonable timeouts.
- 🔴 Instance maintenance Your write path gets exercised. Specifically, the period where the primary is gone but a new primary has not been elected yet. Anything that requires a write — Kafka consumers persisting messages, REST endpoints inserting data, audit logs — will fail.
If your application uses @Transactional writes against MongoDB, those transactions will roll back. If those writes were triggered by Kafka messages and you are using Spring's default RecordMessageListener, you are about to learn whether your error handling sends them to a Dead Letter Topic, retries with backoff, or — if you wrote a @CircuitBreaker fallback that returns null — silently drops them.
(Spoiler for Part 2: silently dropping them is the default if you are not careful. We learned this the hard way and shipped a 5-line throw change to fix it.)
6. What Most Teams Get Wrong
Three patterns I have seen repeatedly:
- Treating "preferred maintenance window" as a guarantee. It is a preference. AWS will respect it for normal patches, but security and critical patches can be applied outside that window with limited notice.
- Running maintenance during business hours because postponement felt safe. Postponement just shifts the problem and removes your ability to pick the timing. Run it manually, off-hours, before AWS forces your hand.
- Testing only against cluster maintenance. Most engineering teams only see cluster maintenance for years before hitting their first instance maintenance, so they never validate their write-path failure handling. Instance maintenance is the real test.
7. Action Items for This Week
If you run DocumentDB in production, do this before your next AWS maintenance email:
- Audit your DBA process. Confirm your team can manually execute DocumentDB maintenance ahead of the force-apply window. Test the runbook end-to-end at least once a year.
- Identify which of your writes are recoverable. For every write path that hits MongoDB, ask: if this fails for 4 minutes during failover, what happens? (Lost? Retried? DLT? Returned 5xx to the user?)
- Read your driver and Spring Data MongoDB defaults. The MongoDB Java driver defaults to a 30-second
serverSelectionTimeout. That is the maximum time a request will wait for a primary to be elected before failing. We will dig into why that matters in Part 3. - Schedule a quarterly chaos drill. Force a manual failover on a non-prod cluster (
db.adminCommand({failover:1})) and verify that your application — including consumers, schedulers, and APIs — recovers cleanly.
8. Coming in Part 2
Now that we know why maintenance is unavoidable and what the two failure modes look like, the next post tackles the harder question:
When the failover lands during peak traffic, what does it take for zero messages to be lost between Kafka and MongoDB?
Spoiler: the difference between "all your push notifications silently disappear" and "all your push notifications get retried by Kafka" is exactly one line of Kotlin.
This series is based on real production incidents. All cluster names, instance identifiers, internal ticket references, and organization-specific details have been anonymized or generalized. Error signatures, log messages, error counts, and outage durations are real and unmodified.