Bulkhead Pattern
Status: Complete
Category: Architecture
Default enforcement: Advisory
Author: PushBackLog team
Tags
- Topic: architecture, resilience
- Skillset: backend, devops
- Technology: generic
- Stage: planning, execution
Summary
The Bulkhead pattern isolates components of a system into separate resource pools so that a failure or overload in one component cannot exhaust the resources of another. Named after the watertight compartments in ship hulls, the pattern ensures that a flooded compartment sinks a compartment — not the ship. It is a fundamental resilience pattern for any system where one noisy or failing dependency should not cascade into broader system failure.
Rationale
Cascading failure is the most common serious outage pattern
A cascading failure begins with a single component — a slow database query, an unresponsive third-party API, a spike in traffic to one endpoint — and propagates through the system because shared resources are exhausted. Thread pools fill with blocked requests waiting for the slow dependency. Connection pools are held open. Memory and CPU are consumed by requests that will never complete. The rest of the system, which was functioning correctly, degrades or fails entirely.
The mechanism is not exotic. It is the ordinary consequence of shared resource pools in systems that have not explicitly isolated their failure domains.
Isolation is the design response
Bulkheads prevent cascade by partitioning resources. If a payment gateway integration is slow, the threads blocked waiting for it should come from a pool dedicated to payment operations — not from the shared pool that serves user authentication, dashboard loading, and search. The slow dependency affects payment throughput; it does not affect everything else.
Guidance
Thread pool isolation
Assign separate thread pools (or async worker pools) to different categories of external dependency. In a thread-per-request model, set pool limits per dependency group.
// Example using a custom async limiter per dependency category
import pLimit from 'p-limit';
const paymentGatewayLimit = pLimit(10); // max 10 concurrent payment calls
const inventoryServiceLimit = pLimit(20); // max 20 concurrent inventory calls
const emailServiceLimit = pLimit(5); // max 5 concurrent email calls
async function processOrder(order: Order): Promise<void> {
// These calls fail independently — a slow email service doesn't block payment
await paymentGatewayLimit(() => chargeCard(order));
await inventoryServiceLimit(() => reserveStock(order));
await emailServiceLimit(() => sendConfirmation(order));
}
Connection pool isolation
Do not share a single database or HTTP connection pool across all request types. Separate pools for different criticality levels or different backing services.
// Separate connection pools for different databases / access patterns
const userDbPool = createPool({ host: USER_DB_HOST, max: 20 });
const analyticsDbPool = createPool({ host: ANALYTICS_DB_HOST, max: 5 });
// If analytics DB is slow, user operations still get their full pool allocation
Request isolation in orchestration platforms
In Kubernetes, bulkheads map to:
- Separate deployments/pods for different service categories
- Resource limits and requests per deployment (CPU/memory quotas)
- Network policies restricting blast radius of compromised services
- Separate namespaces for production vs. batch/background workloads
Combining bulkhead with circuit breaker
Bulkhead limits concurrency to prevent resource exhaustion. Circuit breaker detects failure rates and stops sending requests to a failing dependency. They are complementary, not alternatives.
| Pattern | Protects against | Mechanism |
|---|---|---|
| Bulkhead | Resource exhaustion, thread starvation | Partition resource pools |
| Circuit breaker | Cascading calls to a failing service | Fail fast when error threshold exceeded |
| Timeout | Indefinitely blocked requests | Kill requests after a time limit |
A production-grade resilience strategy uses all three together.
Design checklist
- Identify which external dependencies could be slow or unreliable under load
- Assign separate async/thread pools or concurrency limits per dependency category
- Set meaningful pool sizes based on expected throughput and dependency latency
- Monitor pool saturation — a consistently full pool is a capacity signal
- Combine with circuit breakers for dependencies that can fail hard
- Test bulkhead behaviour by injecting latency into dependencies (chaos testing)