PushBackLog

Bulkhead Pattern

Advisory enforcement Complete by PushBackLog team
Topic: architecture Topic: resilience Skillset: backend Skillset: devops Technology: generic Stage: planning Stage: execution

Bulkhead Pattern

Status: Complete
Category: Architecture
Default enforcement: Advisory
Author: PushBackLog team


Tags

  • Topic: architecture, resilience
  • Skillset: backend, devops
  • Technology: generic
  • Stage: planning, execution

Summary

The Bulkhead pattern isolates components of a system into separate resource pools so that a failure or overload in one component cannot exhaust the resources of another. Named after the watertight compartments in ship hulls, the pattern ensures that a flooded compartment sinks a compartment — not the ship. It is a fundamental resilience pattern for any system where one noisy or failing dependency should not cascade into broader system failure.


Rationale

Cascading failure is the most common serious outage pattern

A cascading failure begins with a single component — a slow database query, an unresponsive third-party API, a spike in traffic to one endpoint — and propagates through the system because shared resources are exhausted. Thread pools fill with blocked requests waiting for the slow dependency. Connection pools are held open. Memory and CPU are consumed by requests that will never complete. The rest of the system, which was functioning correctly, degrades or fails entirely.

The mechanism is not exotic. It is the ordinary consequence of shared resource pools in systems that have not explicitly isolated their failure domains.

Isolation is the design response

Bulkheads prevent cascade by partitioning resources. If a payment gateway integration is slow, the threads blocked waiting for it should come from a pool dedicated to payment operations — not from the shared pool that serves user authentication, dashboard loading, and search. The slow dependency affects payment throughput; it does not affect everything else.


Guidance

Thread pool isolation

Assign separate thread pools (or async worker pools) to different categories of external dependency. In a thread-per-request model, set pool limits per dependency group.

// Example using a custom async limiter per dependency category
import pLimit from 'p-limit';

const paymentGatewayLimit = pLimit(10);    // max 10 concurrent payment calls
const inventoryServiceLimit = pLimit(20);  // max 20 concurrent inventory calls
const emailServiceLimit = pLimit(5);       // max 5 concurrent email calls

async function processOrder(order: Order): Promise<void> {
  // These calls fail independently — a slow email service doesn't block payment
  await paymentGatewayLimit(() => chargeCard(order));
  await inventoryServiceLimit(() => reserveStock(order));
  await emailServiceLimit(() => sendConfirmation(order));
}

Connection pool isolation

Do not share a single database or HTTP connection pool across all request types. Separate pools for different criticality levels or different backing services.

// Separate connection pools for different databases / access patterns
const userDbPool = createPool({ host: USER_DB_HOST, max: 20 });
const analyticsDbPool = createPool({ host: ANALYTICS_DB_HOST, max: 5 });

// If analytics DB is slow, user operations still get their full pool allocation

Request isolation in orchestration platforms

In Kubernetes, bulkheads map to:

  • Separate deployments/pods for different service categories
  • Resource limits and requests per deployment (CPU/memory quotas)
  • Network policies restricting blast radius of compromised services
  • Separate namespaces for production vs. batch/background workloads

Combining bulkhead with circuit breaker

Bulkhead limits concurrency to prevent resource exhaustion. Circuit breaker detects failure rates and stops sending requests to a failing dependency. They are complementary, not alternatives.

PatternProtects againstMechanism
BulkheadResource exhaustion, thread starvationPartition resource pools
Circuit breakerCascading calls to a failing serviceFail fast when error threshold exceeded
TimeoutIndefinitely blocked requestsKill requests after a time limit

A production-grade resilience strategy uses all three together.

Design checklist

  • Identify which external dependencies could be slow or unreliable under load
  • Assign separate async/thread pools or concurrency limits per dependency category
  • Set meaningful pool sizes based on expected throughput and dependency latency
  • Monitor pool saturation — a consistently full pool is a capacity signal
  • Combine with circuit breakers for dependencies that can fail hard
  • Test bulkhead behaviour by injecting latency into dependencies (chaos testing)