Production is going to fail. The question is whether you find the failure mode first, or your customers do. Chaos Monkey helps you find it first.
Start Breaking Things 💥Controlled failure injection for every layer of your stack.
Randomly terminate application processes to verify your restart policies, health checks, and process supervisors work as expected.
Inject latency, packet loss, and partitions. Simulate cross-region failures, DNS outages, and upstream timeouts.
Fill disks, slow I/O, corrupt writes. Find out what your app does when storage misbehaves — before your SSD surprises you.
Evict pods, drain nodes, delete namespaces. Test your PodDisruptionBudgets, liveness probes, and autoscaler response times.
Shift system clocks forward, backward, or into the future. Certificate expiry, cron drift, and distributed consensus — all exposed.
Saturate CPU cores and exhaust memory to test OOM handling, autoscaling triggers, and graceful degradation under load.
Controlled chaos, not reckless destruction.
Pick your target — a single pod, a node, an availability zone. Set boundaries.
Run the experiment. Chaos Monkey applies the failure condition within your defined scope.
Watch your metrics, alerts, and dashboards. Did the system self-heal? Did you get paged?
Harden the weakness. Add it to your test suite. Run the experiment again to verify.