PushBackLog

Chaos Engineering

Advisory enforcement Complete by PushBackLog team
Topic: testing Topic: resilience Topic: reliability Skillset: devops Skillset: backend Technology: generic Stage: execution Stage: review

Chaos Engineering

Status: Complete
Category: Testing
Default enforcement: Advisory
Author: PushBackLog team


Tags

  • Topic: testing, resilience, reliability
  • Skillset: devops, backend
  • Technology: generic
  • Stage: execution, review

Summary

Chaos engineering is the practice of intentionally introducing failures into a system in a controlled environment to verify that it can withstand and recover from real-world disruptions. Rather than waiting to discover that a failure mode exists when it occurs in production, chaos engineering proactively explores the system’s behaviour under faults — network partitions, service crashes, latency spikes, resource exhaustion — and uses the results to harden resilience mechanisms. It is a practice for mature systems with good observability, not a starting point.


Rationale

Systems will fail; the question is whether failure has been rehearsed

Every production system operates under a distribution of failure conditions: dependencies become unavailable, disks fill, network connections drop, nodes restart during deployments. Teams that have never observed their systems fail under these conditions discover failure modes for the first time during production incidents, at the worst possible time. Chaos engineering rehearses these failures deliberately, in controlled conditions, with observability in place, so the team knows what happens before it matters.

Netflix coined the term with Chaos Monkey, which randomly terminated production instances to force the development of resilience patterns. The principle — verify resilience through deliberate fault injection — is applicable at any scale.

Chaos engineering surfaces hidden assumptions

Systems have implicit resilience assumptions: “the database will always be available”, “the payment gateway will respond within 200ms”, “the event queue will never back up”. Chaos experiments test these assumptions systematically. The result is either confirmation that the assumption is warranted, or discovery of a gap in resilience handling before it becomes a customer-visible incident.


Guidance

Chaos engineering steady-state first

Before injecting any chaos, define and verify the “steady state” — the measurable evidence that the system is behaving normally. Chaos experiments test whether steady state is maintained under faults, not whether chaos causes unexpected behaviour.

# Steady state hypothesis
name: Order API behaves normally
steady_state_hypothesis:
  title: System is healthy
  probes:
    - type: http
      url: https://api.example.com/health
      expected_status: 200
    - type: metric
      name: order_creation_p95_latency_ms
      expected_max: 300
    - type: metric  
      name: error_rate
      expected_max: 0.01

The chaos experiment cycle

1. Define steady state
2. Hypothesise: "steady state will continue when [fault] is applied"
3. Inject the fault
4. Observe behaviour against the steady state
5. Restore the system
6. Analyse: was steady state maintained? If not, what is the gap?
7. Fix the gap, then repeat

Starting with planned chaos

Begin with simple, controlled experiments before running automated chaos:

ExperimentWhat it tests
Kill a single service instanceAuto-healing and load balancer health checks
Introduce 200ms latency to calls to a dependencyTimeout handling and circuit breakers
Return 503 from a downstream serviceFallback behaviour and degraded mode UX
Fill the disk on an application serverGraceful degradation vs crash
Exhaust database connectionsConnection pool handling and error responses
Delay message queue consumptionBackpressure and queue depth alerting

Tooling

ToolLevelNotes
Chaos Monkey (Netflix OSS)AWS EC2/ECS instance terminationOriginal Netflix implementation
GremlinFull platformCommercial; comprehensive fault injection
AWS Fault Injection Service (FIS)AWS infrastructureNative AWS chaos experiments
Chaos MeshKubernetesPod failures, network chaos, I/O chaos
PumbaDockerContainer-level chaos for local/CI environments
ToxiproxyNetworkControlled network proxy for latency/failure injection in tests

Using Toxiproxy for local chaos testing

// In integration tests, use Toxiproxy to simulate network failures
const toxiproxy = new ToxiproxyApi('localhost:8474');
const proxy = await toxiproxy.createProxy({
  name: 'stripe-api',
  listen: '0.0.0.0:8001',
  upstream: 'api.stripe.com:443',
});

// Add latency to simulate slow payment gateway
await proxy.addToxic({
  type: 'latency',
  attributes: { latency: 2000 } // 2 second delay
});

// Your application should timeout and return a graceful error, not hang
const result = await paymentClient.charge(order);
expect(result.error).toBe('PAYMENT_GATEWAY_TIMEOUT');

// Cleanup
await proxy.removeToxic('latency');

Prerequisites for chaos engineering

Do not begin chaos experiments without:

  • Observability — structured logging, distributed tracing, dashboards; you must be able to observe what happens
  • Alerting — you must detect when experiments cause real degradation vs. expected deviation
  • Circuit breakers and timeouts — otherwise chaos experiments reveal only that the system degrades, without the means to improve it
  • Rollback capability — you must be able to restore normal state quickly
  • Team alignment — everyone affected by the experiment should know it is happening