Surviving DocDB Failover with Spring Data MongoDB and Kafka: The One-Line Throw That Saved Our Messages

Part 2 of the DocDB Maintenance Survival Guide. When a MongoDB primary fails over during peak traffic, the difference between losing every message and retrying every message is exactly one line of Kotlin.

In Part 1, we covered how AWS DocumentDB maintenance is unavoidable and how cluster vs instance maintenance fail in completely different ways. The summary: instance maintenance triggers a real failover, your write path goes down for 2–4 minutes, and whatever your error handling does during that window is your blast radius.

This post is about that error handling. Specifically: a Kafka consumer that persists messages to MongoDB, wrapped in a Resilience4j @CircuitBreaker, with a fallback method that — following an extremely common pattern — logs the error and returns silently.

That pattern silently drops every message during a failover. Here is how we found out, and the one-line fix.

1. TL;DR

  1. A Resilience4j @CircuitBreaker fallback that logs and returns will silently absorb all failures, including the failures that should retry.
  2. For Kafka consumers persisting to MongoDB, "absorb" means the offset gets committed and the message is gone.
  3. The fix is to throw the exception argument from the fallback. Spring Kafka then engages its retry / backoff / DLT machinery, and Kafka's own offset semantics protect you.
  4. We verified this in production through two real DocumentDB maintenances: 105 errors during a cluster maintenance, 5 errors during an instance maintenance failover. Zero messages lost. One DLT entry across both events, redriven cleanly.
  5. If you only ship one change after reading this post: audit every @CircuitBreaker fallback in your Kafka consumer paths and confirm it throws.

2. The Setup: Kafka → CircuitBreaker → MongoDB

The architecture is unremarkable. A Kafka consumer receives a message, deserializes it, calls a service method, and the service method calls an adaptor that performs a Mongo write through Spring Data MongoDB.

The Mongo write is wrapped in a Resilience4j @CircuitBreaker. This is a perfectly reasonable choice: when the database becomes unhealthy, you want to fail fast instead of holding consumer threads hostage on every blocked request.

@Component
class PushHistoryAdaptor(
    private val repository: PushHistoryRepository,
) {
    @CircuitBreaker(name = "mongoWrite", fallbackMethod = "saveAllFallback")
    fun saveAll(entities: List<PushHistory>) {
        repository.saveAll(entities)
    }

    private fun saveAllFallback(
        entities: List<PushHistory>,
        e: Exception,
    ) {
        logger.error("PushHistory save failed", e)
        // silently returns Unit
    }
}

From the consumer side it looks like this:

@KafkaListener(topics = ["push-history-events"])
fun consume(records: List<ConsumerRecord<String, PushHistoryEvent>>) {
    val entities = records.map { it.value().toEntity() }
    pushHistoryAdaptor.saveAll(entities)
}

The pattern reads cleanly. The fallback "handles" the error. The consumer method completes without throwing. Spring Kafka commits the offset. Everyone goes home happy.

Until the database fails over.

3. The Anti-Pattern: Silent Fallback

Here is what happens when the MongoDB primary disappears for 47 seconds during a cluster maintenance:

  1. The consumer pulls a batch of records from Kafka.
  2. saveAll calls repository.saveAll(entities).
  3. Spring Data MongoDB tries to acquire a connection, the driver returns DataAccessResourceFailureException.
  4. Resilience4j catches the exception, calls saveAllFallback.
  5. The fallback logs the error and returns. The consumer method sees no exception.
  6. Spring Kafka commits the offset for that batch.
  7. The next batch arrives. Mongo is still down. Step 4 fires again. Step 6 fires again.

By the time the database recovers, every message that arrived during the outage has been read, "handled", offset-committed, and permanently lost. There is no DLT. There is no retry. The consumer never sees a failure that it can react to. The error log is all you have, and you are now in the recovery business of replaying messages from upstream — if upstream still has them.

The anti-pattern in one sentence: a fallback that does not throw transforms an infrastructure outage into silent message loss, because Spring Kafka's offset commit is driven by whether the listener method threw, not by whether your business logic actually succeeded.

4. The Fix: One Line

Resilience4j fallback methods receive the original exception as their last argument. The fix is to throw it.

private fun saveAllFallback(
    entities: List<PushHistory>,
    e: Exception,
) {
    logger.error("PushHistory save failed", e)
    throw e   // ← the only change
}

That is the entire diff. The fallback still exists. It still logs. The CircuitBreaker still flips open under sustained failure. The only difference is the exception now propagates back to the listener.

What changes downstream of that throw:

  • The @KafkaListener method receives the exception.
  • Spring Kafka's DefaultErrorHandler kicks in.
  • The configured BackOff (we use FixedBackOff(3000L, 3L)) retries the batch up to 3 times with 3-second delays.
  • If all retries fail, the DeadLetterPublishingRecoverer sends the record to a DLT topic.
  • Crucially, the offset is only committed once the record either succeeds or lands in the DLT. Until then, Kafka considers the message unprocessed.

The Spring Kafka error-handler configuration that backs this looks roughly like:

@Bean
fun errorHandler(template: KafkaTemplate<Any, Any>): DefaultErrorHandler {
    val recoverer = DeadLetterPublishingRecoverer(template) { record, _ ->
        TopicPartition("${record.topic()}.DLT", record.partition())
    }
    return DefaultErrorHandler(recoverer, FixedBackOff(3000L, 3L))
}

5. Why This Works: Kafka's Offset Semantics

The whole pattern hinges on a property of Spring Kafka that is easy to forget: the listener throwing is the signal that something went wrong. There is no other signal.

When the listener method returns normally, Spring Kafka treats it as successful processing and proceeds to commit the offset for that record (or batch). It does not introspect your code. It does not check whether repository.saveAll() actually wrote anything. It does not read the logger output. The contract is binary: returned cleanly = success, threw = failure.

A silent fallback breaks this contract. It catches an actual failure and lies about it. Spring Kafka has no way to know.

Throwing from the fallback restores the contract. Now Spring Kafka sees the exception and:

  • ⏳ Retry phase The DefaultErrorHandler seeks back to the offset of the failed record and re-delivers it after the backoff. If the database is back by retry 2, the message goes through.
  • 📦 DLT phase If retries exhaust, the recoverer publishes the record to a Dead Letter Topic. Now you have a durable record of the failure that ops can inspect, redrive, or replay.
  • ✅ Offset advance Only after success or DLT publish does the offset advance. There is no path that loses data without explicit human visibility.

6. Production Verification: Two Real Failovers

Cluster names and identifiers below are anonymized. Error counts, durations, and signatures are real.

📊 Event A: DocumentDB Cluster Maintenance (~47s window)

During regular production traffic, AWS rolled the cluster engine version. Cluster endpoint stopped responding.

Metric Observed
Application errors 105 (60 from API gateway, 45 from Kafka consumer pods)
Error signature DataAccessResourceFailureException (100% of cases)
Pod restarts 0
Messages routed to DLT 1 (redriven cleanly post-recovery)
Messages lost 0

Without the fallback throw, every one of the 45 consumer-side errors would have been a silently dropped message.

📊 Event B: DocumentDB Instance Maintenance with Real Failover

Run during a maintenance window with traffic gated upstream, but Kafka consumers were still draining queued messages. The primary was patched and a new primary was elected.

Metric Observed
Topology change duration ~4 minutes (primary loss + election + driver re-discovery)
Application errors 5 (all on consumer pods)
Error signature DataAccessResourceFailureException: Prematurely reached end of stream
Messages routed to DLT 0
Messages lost 0 (all 5 absorbed by retry, primary re-elected before DLT threshold)

This is the cleaner outcome. The 4-minute window included the actual primary election, but Spring Kafka's 3-attempt retry with 3-second backoff was generous enough to span the topology change. Every failed write retried, found the new primary, and committed.

7. The Wider Pattern: Where Else This Bites

The CircuitBreaker fallback is the most common case, but the same anti-pattern appears anywhere a layer "handles" an exception that should propagate. Audit your code for:

  • Try-catch blocks that log and continue in service or adaptor layers above a Kafka listener.
  • Reactive onErrorResume / onErrorReturn in WebFlux pipelines that convert an upstream failure into a successful empty result.
  • @Async methods whose exceptions are dropped because nothing awaits the returned CompletableFuture.
  • Scheduled jobs with broad catch (Exception e) that quietly skip the failed iteration.

The rule of thumb: any place where a failure is logged but not signalled to the caller is a place where the system can silently drop work. In transactional or message-driven contexts, that is exactly what you do not want.

8. Caveats and Trade-offs

Throwing from the fallback is not free. A few things to be aware of:

  • Consumer lag will grow during outages. If the database is down for 4 minutes and your consumer cannot make progress, lag accumulates. Right answer; bad for dashboards. Make sure your alerting differentiates "lag from infrastructure outage" from "lag from slow processing".
  • The CircuitBreaker still flips. Once you cross minimumNumberOfCalls with a high enough failure rate, the breaker opens and short-circuits subsequent attempts. The fallback throws either way, so the consumer keeps retrying via Spring Kafka's backoff, which is now extra cheap because the breaker fails fast. This is the desired behavior.
  • DLT topics need lifecycle planning. If your DLT receives a message during an outage, you need a redrive process. We use a small admin endpoint that consumes from the DLT and re-publishes to the original topic; you can also use Kafka's MirrorMaker or a simple shell script.
  • Idempotency matters. Retries can deliver the same message multiple times. Make sure your write is idempotent — either through a unique index on the natural key, or an upsert, or an explicit dedup check. Without that, you trade message loss for duplicate writes.

9. Action Items

  1. Grep your codebase for fallbackMethod and audit every fallback. If it does not throw and the wrapped operation is on a write path, fix it.
  2. Verify your Spring Kafka error handler is configured with both a BackOff and a DeadLetterPublishingRecoverer. A retry-only configuration without DLT will eventually exhaust and you are back to silent loss.
  3. Write a test that asserts the fallback throws. A simple unit test against the adaptor with a mocked failing repository, asserting that the exception propagates, will catch regressions before they hit production.
  4. Make every Mongo write idempotent. Unique indexes on natural keys, or explicit upserts. Retries become safe; duplicate deliveries become non-events.
  5. Run a chaos drill. Force a failover on a non-prod cluster and confirm your consumer drains its lag without losing messages or restarting pods.

10. Coming in Part 3

The fallback throw is necessary, but it is not sufficient. In both production events above, the CircuitBreaker never opened. Sliding window thresholds were never reached. We had a working safety net that was never tested in production at the level we expected.

When your CircuitBreaker is configured with minimumNumberOfCalls = 100 and your real production failure produces 5 errors across 4 pods, what tuning makes the breaker actually fire? And what does serverSelectionTimeout have to do with it?

Part 3 covers the tuning analysis and the MongoDB driver setting that, by default, blocks every request for 30 seconds during failover — whether you wanted it to or not.

This series is based on real production incidents. All cluster names, instance identifiers, internal ticket references, and organization-specific details have been anonymized or generalized. Error signatures, log messages, error counts, and outage durations are real and unmodified.