Bulkhead Pattern

Status: Complete
Category: Architecture
Default enforcement: Advisory
Author: PushBackLog team

Summary

The Bulkhead pattern isolates components of a system into separate resource pools so that a failure or overload in one component cannot exhaust the resources of another. Named after the watertight compartments in ship hulls, the pattern ensures that a flooded compartment sinks a compartment — not the ship. It is a fundamental resilience pattern for any system where one noisy or failing dependency should not cascade into broader system failure.

Rationale

Cascading failure is the most common serious outage pattern

A cascading failure begins with a single component — a slow database query, an unresponsive third-party API, a spike in traffic to one endpoint — and propagates through the system because shared resources are exhausted. Thread pools fill with blocked requests waiting for the slow dependency. Connection pools are held open. Memory and CPU are consumed by requests that will never complete. The rest of the system, which was functioning correctly, degrades or fails entirely.

The mechanism is not exotic. It is the ordinary consequence of shared resource pools in systems that have not explicitly isolated their failure domains.

Isolation is the design response

Bulkheads prevent cascade by partitioning resources. If a payment gateway integration is slow, the threads blocked waiting for it should come from a pool dedicated to payment operations — not from the shared pool that serves user authentication, dashboard loading, and search. The slow dependency affects payment throughput; it does not affect everything else.

Guidance

Thread pool isolation

Assign separate thread pools (or async worker pools) to different categories of external dependency. In a thread-per-request model, set pool limits per dependency group.

// Example using a custom async limiter per dependency category
import pLimit from 'p-limit';

const paymentGatewayLimit = pLimit(10);    // max 10 concurrent payment calls
const inventoryServiceLimit = pLimit(20);  // max 20 concurrent inventory calls
const emailServiceLimit = pLimit(5);       // max 5 concurrent email calls

async function processOrder(order: Order): Promise<void> {
  // These calls fail independently — a slow email service doesn't block payment
  await paymentGatewayLimit(() => chargeCard(order));
  await inventoryServiceLimit(() => reserveStock(order));
  await emailServiceLimit(() => sendConfirmation(order));
}

Connection pool isolation

Do not share a single database or HTTP connection pool across all request types. Separate pools for different criticality levels or different backing services.

// Separate connection pools for different databases / access patterns
const userDbPool = createPool({ host: USER_DB_HOST, max: 20 });
const analyticsDbPool = createPool({ host: ANALYTICS_DB_HOST, max: 5 });

// If analytics DB is slow, user operations still get their full pool allocation

Request isolation in orchestration platforms

In Kubernetes, bulkheads map to:

Separate deployments/pods for different service categories
Resource limits and requests per deployment (CPU/memory quotas)
Network policies restricting blast radius of compromised services
Separate namespaces for production vs. batch/background workloads

Combining bulkhead with circuit breaker

Bulkhead limits concurrency to prevent resource exhaustion. Circuit breaker detects failure rates and stops sending requests to a failing dependency. They are complementary, not alternatives.

Pattern	Protects against	Mechanism
Bulkhead	Resource exhaustion, thread starvation	Partition resource pools
Circuit breaker	Cascading calls to a failing service	Fail fast when error threshold exceeded
Timeout	Indefinitely blocked requests	Kill requests after a time limit

A production-grade resilience strategy uses all three together.

Design checklist

Identify which external dependencies could be slow or unreliable under load
Assign separate async/thread pools or concurrency limits per dependency category
Set meaningful pool sizes based on expected throughput and dependency latency
Monitor pool saturation — a consistently full pool is a capacity signal
Combine with circuit breakers for dependencies that can fail hard
Test bulkhead behaviour by injecting latency into dependencies (chaos testing)

Bulkhead Pattern

Bulkhead Pattern

Tags

Summary

Rationale

Cascading failure is the most common serious outage pattern

Isolation is the design response

Guidance

Thread pool isolation

Connection pool isolation

Request isolation in orchestration platforms

Combining bulkhead with circuit breaker

Design checklist