When Your CircuitBreaker Never Opens: Lessons from Two Production DocDB Failovers

Part 3 of the DocDB Maintenance Survival Guide. Defaults that look reasonable on paper will never fire during a real failover. Here is what tuning to apply, and the MongoDB driver setting that quietly defeats your fast-fail strategy.

In Part 1 we covered how DocumentDB maintenance fails. In Part 2 we showed how a one-line throw from a Resilience4j fallback prevents Kafka message loss during a failover.

Both are necessary. Neither is sufficient on their own.

In both production failovers we observed, the CircuitBreaker never opened. We had a working safety net that — under realistic outage conditions — was never tested in production. The fallback throw still saved us, but the breaker itself, the thing supposed to fail fast and shed load, was sitting on the bench.

This post is about why, and what to change.

1. TL;DR

  1. Resilience4j's default minimumNumberOfCalls = 100 with COUNT_BASED sliding window is too high for short, infrastructure-level outages distributed across multiple pods.
  2. Connection-pool exceptions like MongoConnectionPoolClearedException are not in most teams' RECORD_EXCEPTIONS list, which silently exempts them from the failure count.
  3. The MongoDB Java driver's serverSelectionTimeout defaults to 30 seconds. If you only set connectTimeout and readTimeout, your application is still bound by that 30-second wall during failover.
  4. The recommended combination is B + C + E: lower the minimum call threshold, add the missing exception, and explicitly configure serverSelectionTimeout to fail fast.
  5. Validate every change with a forced failover on a non-prod cluster before relying on it.

2. The Setup: A Reasonable-Looking CircuitBreaker

Here is the configuration we ran into production with. It looks unobjectionable.

@Configuration
class Resilience4JConfig {

    @Bean
    fun circuitBreakerConfig(): CircuitBreakerConfig = CircuitBreakerConfig.custom()
        .slidingWindowType(COUNT_BASED)
        .slidingWindowSize(100)
        .minimumNumberOfCalls(100)
        .failureRateThreshold(30f)
        .waitDurationInOpenState(Duration.ofSeconds(30))
        .permittedNumberOfCallsInHalfOpenState(10)
        .recordExceptions(
            DataAccessResourceFailureException::class.java,
            MongoSocketException::class.java,
            MongoTimeoutException::class.java,
            // ... 10 more
        )
        .build()
}

The intent is reasonable. We do not want the breaker to flip on a single hiccup, so we require 100 calls in the sliding window before the breaker even starts evaluating. We track the most relevant Mongo exception types. Failure threshold of 30%. Standard stuff.

Now look at what happened in our two production DocumentDB events.

3. Why It Did Not Fire: The Math

📊 Event A: Cluster Maintenance (8 pods, 47s window)

  • Total Mongo-related exceptions across the fleet: ~989
  • Pod with the most exceptions: 75
  • Sliding window threshold required to start evaluating: 100
  • Result: No pod accumulated enough calls in its 47-second window. The breaker never reached evaluation state.

📊 Event B: Instance Maintenance Failover (4 history-writer pods)

  • Total errors: 5
  • Distributed as: 3 / 1 / 1 / 0 across pods
  • Maximum on any single pod: 3
  • Result: Even more dramatically below threshold. Spring Kafka's retry absorbed all 5 because the new primary was elected before retries exhausted.

The fundamental problem with COUNT_BASED sliding windows of size 100 is that they assume sustained traffic. During an infrastructure outage, the failures are concentrated in time but diluted across pods. A short outage on a horizontally-scaled service produces single-digit errors per pod, which never reaches the minimum-calls floor.

The breaker is designed to protect against endpoint degradation under heavy load. It is not designed to detect brief, total failures. Tuning has to acknowledge that.

4. The Tuning Options

We considered five options. Each has trade-offs.

Option Change Pros Cons
A Switch to TIME_BASED sliding window with 10s window + minimumCalls 10 Catches short, sharp outages Short window risks false positives on transient hiccups
B ✅ minimumNumberOfCalls 100 → 20~30 Smallest behavioral change, breaker can fire under realistic per-pod failure counts Slightly more sensitive to spikes; pair with proper failure-rate threshold
C ✅ Add MongoConnectionPoolClearedException to recordExceptions Stops silently exempting a major failure mode Alone, still does not solve the threshold problem (38 extra errors / 8 pods is +5 per pod)
D Leave defaults, rely on fallback throw + Kafka retry Already shown to prevent message loss Breaker remains untested in production; HTTP read paths still pay full failover latency
E ✅ Explicitly configure serverSelectionTimeout on the MongoDB driver Caps per-request latency during failover, accelerates failure accrual into the breaker window Too short means false positives during normal driver re-discovery

The recommended combination is B + C + E. Each addresses a different facet of the same problem.

5. Option E in Detail: The serverSelectionTimeout Trap

Option E deserves its own section because it is the one most teams have wrong by default and never realize.

🔍 The Discovery

During the post-mortem of one of our maintenances, we noticed something odd in the Dead Letter Topic. A single message had landed there. The interesting part was the timing of its retry attempts:

  • Event published to Kafka: 15:04:59
  • First processing attempt failed: 15:05:29 — exactly 30 seconds later.
  • Subsequent retries: 3-second intervals (matching FixedBackOff(3000L, 3L)).

A 30-second gap between Kafka pulling the message and the first error is not normal. The Kafka consumer should have started processing immediately. So where did the 30 seconds go?

📚 What the Driver Was Doing

Reading the official MongoDB Java Driver and Spring Data MongoDB documentation:

serverSelectionTimeoutMS — The maximum number of milliseconds the driver will wait while attempting to find a suitable server before throwing a server selection error. Default: 30000 ms.

During a failover, the driver knows the old primary is gone and needs to find a new one. It enters a server-selection phase. This phase has nothing to do with connectTimeout or readTimeout — those govern the socket once selected. While selecting, the driver waits up to 30 seconds for a suitable server to appear in its topology view.

AWS DocumentDB cluster failover, per AWS documentation, "typically completes within 30 seconds." If your driver default is also 30 seconds, you are racing the cluster recovery against the driver timeout, with no safety margin and no fast-fail signal.

🛠️ The Fix

Most teams configure socket timeouts but forget the cluster-settings builder, where server-selection lives:

@Configuration
class MongoConfig(private val connectionDetails: MongoConnectionDetails) :
    AbstractMongoClientConfiguration() {

    override fun configureClientSettings(builder: MongoClientSettings.Builder) {
        builder.applyConnectionString(connectionDetails.connectionString)
            .applyToConnectionPoolSettings { ... }
            .applyToSocketSettings { socket ->
                socket
                    .connectTimeout(3, TimeUnit.SECONDS)
                    .readTimeout(3, TimeUnit.SECONDS)
            }
            .applyToClusterSettings { cluster ->
                cluster.serverSelectionTimeout(10, TimeUnit.SECONDS)
            }
    }
}

The highlighted block is the addition. Picking the value is the harder question:

  • Too short (e.g. 5s): false-fail during normal driver re-discovery, especially on cold-start or transient network blips.
  • Too long (default 30s): every failover-affected request blocks for 30 seconds before failing. CircuitBreaker accrual is delayed, user-facing endpoints look hung.
  • 10–15s: covers most observed AWS failover durations with margin, fails fast enough to feed the breaker.

Pick a value, validate it with a forced failover on a non-prod cluster, and watch what your driver does during topology re-discovery. Adjust based on observed reality, not online advice.

6. Option C in Detail: The RECORD_EXCEPTIONS Audit

The MongoDB driver throws a wide variety of exceptions during failure scenarios. Most teams populate recordExceptions with the obvious ones:

  • DataAccessResourceFailureException (Spring's translation of socket-level failures)
  • MongoSocketException
  • MongoTimeoutException
  • UncategorizedMongoDbException

But MongoDB also throws connection-pool-management exceptions that do not inherit from any of the above:

com.mongodb.MongoConnectionPoolClearedException:
  Connection pool for db.example.com:27017 was cleared because another
  operation failed with: MongoSocketException

This appears during cascading pool failures — one connection breaks, the pool clears, every other in-flight request that was about to use that pool fails with this exception. In a 47-second cluster outage we observed 38 of these. None of them counted toward the breaker's failure tally.

Add it explicitly:

.recordExceptions(
    DataAccessResourceFailureException::class.java,
    MongoSocketException::class.java,
    MongoTimeoutException::class.java,
    MongoConnectionPoolClearedException::class.java,  // often missing
    // ...
)

A pragmatic audit: write a small test that forces a failure scenario (kill the database, or use TestContainers + chaos), capture every exception type bubbling up, and confirm each is in your recordExceptions list.

7. The Combined Tuning

Putting B + C + E together:

// Resilience4j config
CircuitBreakerConfig.custom()
    .slidingWindowType(COUNT_BASED)
    .slidingWindowSize(100)
    .minimumNumberOfCalls(25)                 // B: was 100
    .failureRateThreshold(30f)
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .recordExceptions(
        DataAccessResourceFailureException::class.java,
        MongoSocketException::class.java,
        MongoTimeoutException::class.java,
        MongoConnectionPoolClearedException::class.java,  // C
    )
    .build()

// MongoDB driver config
builder.applyToClusterSettings { cluster ->
    cluster.serverSelectionTimeout(10, TimeUnit.SECONDS)  // E
}

Why each piece is necessary:

  • B alone Lower threshold, but if half your failures are MongoConnectionPoolClearedException, they still do not count. Breaker still misfires.
  • C alone Counts the right exceptions, but minimumNumberOfCalls 100 is still unreachable in short outages on multiple pods. Breaker still misfires.
  • E alone Each individual request fails faster, so you accrue failures faster. But B and C still gate whether they reach the breaker. E only helps if B and C are also fixed.
  • B + C + E Lower threshold + correct exception coverage + fast-fail per request. The breaker can now actually fire under realistic outage conditions, and user-facing latency during failover is bounded.

8. Validating Before You Trust It

Configuration changes that you have not validated against a real failure are configuration changes you should not trust. Three ways to validate, in increasing order of confidence:

  1. Unit test the breaker config. Build a CircuitBreaker from your config, fire 25 simulated failures of each recordExceptions type, assert the breaker transitions to OPEN. Catches typos, missing exceptions, and threshold mistakes.
  2. Integration test against a flaky Mongo. Use TestContainers + Toxiproxy to simulate connection failures and observe both your breaker state and your application's behavior. Catches driver-level issues like the serverSelectionTimeout default.
  3. Forced failover on non-prod. db.adminCommand({failover:1}) on your dev/staging DocumentDB cluster. Observe the actual driver topology change, the breaker transition, and the consumer behavior. This is the only way to catch issues that only appear in real cluster topology.

Schedule a quarterly chaos drill. Maintenance happens whether you are ready for it or not; rehearsal is cheaper than recovery.

9. Action Items

  1. Audit minimumNumberOfCalls against realistic per-pod failure counts during a short outage. Drop it to 20–30 if your outage profile is short and concentrated.
  2. List every exception your driver throws during a forced failure scenario. Add the missing ones (commonly MongoConnectionPoolClearedException) to recordExceptions.
  3. Explicitly configure serverSelectionTimeout via applyToClusterSettings. Pick a value (10–15s is a defensible starting point) and validate.
  4. Run a forced failover on a non-prod cluster. Watch the breaker transition logs, watch driver topology logs, watch consumer lag. Iterate.
  5. Document your outage profile. If you ever read your own runbook in 18 months, you will want the per-pod failure counts and recovery times written down somewhere.

10. Series Wrap-Up

Across three posts we walked through the full picture of running Spring Boot + Spring Data MongoDB + Kafka against AWS DocumentDB through real maintenance events:

  • Part 1 — AWS forces maintenance on its own schedule. Cluster and instance maintenance fail differently. Plan accordingly. Read →
  • Part 2 — A silent CircuitBreaker fallback is a message-loss bug. One throw turns it into a Kafka-protected retry chain. Read →
  • Part 3 (this post) — Default Resilience4j thresholds and default serverSelectionTimeout together produce a breaker that never fires. B + C + E fixes that.

The deeper lesson across all three: defaults are designed for the average case, and infrastructure outages are not the average case. The best time to find out which of your defaults are wrong is in a non-prod chaos drill, not at 3 AM during an AWS-forced maintenance window.

This series is based on real production incidents. All cluster names, instance identifiers, internal ticket references, and organization-specific details have been anonymized or generalized. Error signatures, log messages, error counts, and outage durations are real and unmodified.