Chaos Engineering
Status: Complete
Category: Testing
Default enforcement: Advisory
Author: PushBackLog team
Tags
- Topic: testing, resilience, reliability
- Skillset: devops, backend
- Technology: generic
- Stage: execution, review
Summary
Chaos engineering is the practice of intentionally introducing failures into a system in a controlled environment to verify that it can withstand and recover from real-world disruptions. Rather than waiting to discover that a failure mode exists when it occurs in production, chaos engineering proactively explores the system’s behaviour under faults — network partitions, service crashes, latency spikes, resource exhaustion — and uses the results to harden resilience mechanisms. It is a practice for mature systems with good observability, not a starting point.
Rationale
Systems will fail; the question is whether failure has been rehearsed
Every production system operates under a distribution of failure conditions: dependencies become unavailable, disks fill, network connections drop, nodes restart during deployments. Teams that have never observed their systems fail under these conditions discover failure modes for the first time during production incidents, at the worst possible time. Chaos engineering rehearses these failures deliberately, in controlled conditions, with observability in place, so the team knows what happens before it matters.
Netflix coined the term with Chaos Monkey, which randomly terminated production instances to force the development of resilience patterns. The principle — verify resilience through deliberate fault injection — is applicable at any scale.
Chaos engineering surfaces hidden assumptions
Systems have implicit resilience assumptions: “the database will always be available”, “the payment gateway will respond within 200ms”, “the event queue will never back up”. Chaos experiments test these assumptions systematically. The result is either confirmation that the assumption is warranted, or discovery of a gap in resilience handling before it becomes a customer-visible incident.
Guidance
Chaos engineering steady-state first
Before injecting any chaos, define and verify the “steady state” — the measurable evidence that the system is behaving normally. Chaos experiments test whether steady state is maintained under faults, not whether chaos causes unexpected behaviour.
# Steady state hypothesis
name: Order API behaves normally
steady_state_hypothesis:
title: System is healthy
probes:
- type: http
url: https://api.example.com/health
expected_status: 200
- type: metric
name: order_creation_p95_latency_ms
expected_max: 300
- type: metric
name: error_rate
expected_max: 0.01
The chaos experiment cycle
1. Define steady state
2. Hypothesise: "steady state will continue when [fault] is applied"
3. Inject the fault
4. Observe behaviour against the steady state
5. Restore the system
6. Analyse: was steady state maintained? If not, what is the gap?
7. Fix the gap, then repeat
Starting with planned chaos
Begin with simple, controlled experiments before running automated chaos:
| Experiment | What it tests |
|---|---|
| Kill a single service instance | Auto-healing and load balancer health checks |
| Introduce 200ms latency to calls to a dependency | Timeout handling and circuit breakers |
| Return 503 from a downstream service | Fallback behaviour and degraded mode UX |
| Fill the disk on an application server | Graceful degradation vs crash |
| Exhaust database connections | Connection pool handling and error responses |
| Delay message queue consumption | Backpressure and queue depth alerting |
Tooling
| Tool | Level | Notes |
|---|---|---|
| Chaos Monkey (Netflix OSS) | AWS EC2/ECS instance termination | Original Netflix implementation |
| Gremlin | Full platform | Commercial; comprehensive fault injection |
| AWS Fault Injection Service (FIS) | AWS infrastructure | Native AWS chaos experiments |
| Chaos Mesh | Kubernetes | Pod failures, network chaos, I/O chaos |
| Pumba | Docker | Container-level chaos for local/CI environments |
| Toxiproxy | Network | Controlled network proxy for latency/failure injection in tests |
Using Toxiproxy for local chaos testing
// In integration tests, use Toxiproxy to simulate network failures
const toxiproxy = new ToxiproxyApi('localhost:8474');
const proxy = await toxiproxy.createProxy({
name: 'stripe-api',
listen: '0.0.0.0:8001',
upstream: 'api.stripe.com:443',
});
// Add latency to simulate slow payment gateway
await proxy.addToxic({
type: 'latency',
attributes: { latency: 2000 } // 2 second delay
});
// Your application should timeout and return a graceful error, not hang
const result = await paymentClient.charge(order);
expect(result.error).toBe('PAYMENT_GATEWAY_TIMEOUT');
// Cleanup
await proxy.removeToxic('latency');
Prerequisites for chaos engineering
Do not begin chaos experiments without:
- Observability — structured logging, distributed tracing, dashboards; you must be able to observe what happens
- Alerting — you must detect when experiments cause real degradation vs. expected deviation
- Circuit breakers and timeouts — otherwise chaos experiments reveal only that the system degrades, without the means to improve it
- Rollback capability — you must be able to restore normal state quickly
- Team alignment — everyone affected by the experiment should know it is happening