Chaos Engineering

Status: Complete
Category: Testing
Default enforcement: Advisory
Author: PushBackLog team

Summary

Chaos engineering is the practice of intentionally introducing failures into a system in a controlled environment to verify that it can withstand and recover from real-world disruptions. Rather than waiting to discover that a failure mode exists when it occurs in production, chaos engineering proactively explores the system’s behaviour under faults — network partitions, service crashes, latency spikes, resource exhaustion — and uses the results to harden resilience mechanisms. It is a practice for mature systems with good observability, not a starting point.

Rationale

Systems will fail; the question is whether failure has been rehearsed

Every production system operates under a distribution of failure conditions: dependencies become unavailable, disks fill, network connections drop, nodes restart during deployments. Teams that have never observed their systems fail under these conditions discover failure modes for the first time during production incidents, at the worst possible time. Chaos engineering rehearses these failures deliberately, in controlled conditions, with observability in place, so the team knows what happens before it matters.

Netflix coined the term with Chaos Monkey, which randomly terminated production instances to force the development of resilience patterns. The principle — verify resilience through deliberate fault injection — is applicable at any scale.

Chaos engineering surfaces hidden assumptions

Systems have implicit resilience assumptions: “the database will always be available”, “the payment gateway will respond within 200ms”, “the event queue will never back up”. Chaos experiments test these assumptions systematically. The result is either confirmation that the assumption is warranted, or discovery of a gap in resilience handling before it becomes a customer-visible incident.

Guidance

Chaos engineering steady-state first

Before injecting any chaos, define and verify the “steady state” — the measurable evidence that the system is behaving normally. Chaos experiments test whether steady state is maintained under faults, not whether chaos causes unexpected behaviour.

# Steady state hypothesis
name: Order API behaves normally
steady_state_hypothesis:
  title: System is healthy
  probes:
    - type: http
      url: https://api.example.com/health
      expected_status: 200
    - type: metric
      name: order_creation_p95_latency_ms
      expected_max: 300
    - type: metric  
      name: error_rate
      expected_max: 0.01

The chaos experiment cycle

1. Define steady state
2. Hypothesise: "steady state will continue when [fault] is applied"
3. Inject the fault
4. Observe behaviour against the steady state
5. Restore the system
6. Analyse: was steady state maintained? If not, what is the gap?
7. Fix the gap, then repeat

Starting with planned chaos

Begin with simple, controlled experiments before running automated chaos:

Experiment	What it tests
Kill a single service instance	Auto-healing and load balancer health checks
Introduce 200ms latency to calls to a dependency	Timeout handling and circuit breakers
Return 503 from a downstream service	Fallback behaviour and degraded mode UX
Fill the disk on an application server	Graceful degradation vs crash
Exhaust database connections	Connection pool handling and error responses
Delay message queue consumption	Backpressure and queue depth alerting

Tooling

Tool	Level	Notes
Chaos Monkey (Netflix OSS)	AWS EC2/ECS instance termination	Original Netflix implementation
Gremlin	Full platform	Commercial; comprehensive fault injection
AWS Fault Injection Service (FIS)	AWS infrastructure	Native AWS chaos experiments
Chaos Mesh	Kubernetes	Pod failures, network chaos, I/O chaos
Pumba	Docker	Container-level chaos for local/CI environments
Toxiproxy	Network	Controlled network proxy for latency/failure injection in tests

Using Toxiproxy for local chaos testing

// In integration tests, use Toxiproxy to simulate network failures
const toxiproxy = new ToxiproxyApi('localhost:8474');
const proxy = await toxiproxy.createProxy({
  name: 'stripe-api',
  listen: '0.0.0.0:8001',
  upstream: 'api.stripe.com:443',
});

// Add latency to simulate slow payment gateway
await proxy.addToxic({
  type: 'latency',
  attributes: { latency: 2000 } // 2 second delay
});

// Your application should timeout and return a graceful error, not hang
const result = await paymentClient.charge(order);
expect(result.error).toBe('PAYMENT_GATEWAY_TIMEOUT');

// Cleanup
await proxy.removeToxic('latency');

Prerequisites for chaos engineering

Do not begin chaos experiments without:

Observability — structured logging, distributed tracing, dashboards; you must be able to observe what happens
Alerting — you must detect when experiments cause real degradation vs. expected deviation
Circuit breakers and timeouts — otherwise chaos experiments reveal only that the system degrades, without the means to improve it
Rollback capability — you must be able to restore normal state quickly
Team alignment — everyone affected by the experiment should know it is happening

Chaos Engineering

Chaos Engineering

Tags

Summary

Rationale

Systems will fail; the question is whether failure has been rehearsed

Chaos engineering surfaces hidden assumptions

Guidance

Chaos engineering steady-state first

The chaos experiment cycle

Starting with planned chaos

Tooling

Using Toxiproxy for local chaos testing

Prerequisites for chaos engineering