Let Your Team Break Stuff On Purpose

The best engineering team I ever ran had a Friday afternoon ritual. Someone would walk into a meeting room with a stopwatch, pick a service from a hat, and break it. Then we sat and watched what happened.

The first time we did it, two people refused. They thought I had lost the plot. Why would I deliberately damage a working system?

Because a working system is not a known system. It is a system which has not failed yet. And the day it fails, you find out exactly how little you knew about it.

A calm engineer in a data centre taking notes while a single server bursts into flames

Netflix figured this out in 2011

I am not the one who invented this. Netflix did, fifteen years ago, when they were moving everything to AWS and were terrified of how brittle the cloud felt. Their answer was to write a tool called Chaos Monkey, the entire purpose of which was to kill random servers in production during business hours.

According to Wikipedia's chaos engineering page, Netflix released the source for Chaos Monkey in 2012. Amazon had been running similar Game Days since 2003. Google's DiRT programme started in 2006. Facebook had Project Storm. The biggest, most reliable systems on the internet are all built by teams who break things on purpose.

If the most uptime-obsessed companies in the world deliberately tear at their own systems, why does your team treat any unplanned outage as a fireable offence?

"Don't break it" is how you build fragile teams

Walk into a typical mid-size engineering org and tell them you want to deliberately take down a service. Watch the room change colour. The on-call engineer goes pale. The product manager starts drafting an objection. The director asks if you have approval.

This is fragility. Not in the system, in the people.

A team afraid to break things has stopped learning about its own product. They write tests for the happy path. They mock the failures so the suite stays green. They have never watched what happens when the auth service times out, so they have no idea whether the retry logic works, whether the circuit breaker trips, or whether the user sees something graceful instead of a 500 page covered in stack traces.

I have worked with teams like this. They are the same teams who get paged at 3am for the third time in a week because something they "tested" did not survive contact with real life.

The principles are simple

The Principles of Chaos site distils it down to four ideas worth memorising.

Build a hypothesis about steady-state behaviour. Pick the metric you care about. Latency. Order rate. Login success. Whatever defines "the system is fine."

Vary real-world events. Kill a process. Drop network packets. Spike traffic. Inject a 30-second delay into the database. Fail the regional cache.

Run experiments in production. This is the line people balk at. Staging is a lie. Staging has the wrong data, the wrong load, the wrong dependencies. The only environment which tells you the truth is the one your users are in.

Automate the experiments to run continuously. A one-off Game Day is good. A weekly cron job firing off small failures is better. You are looking for the regression, the new dependency, the silent change in behaviour... and you only catch those if the experiments never stop.

There is a fifth idea hiding in the small print, and it is the one nobody talks about loud enough... minimise the blast radius. Start small. One pod. One region. Off-peak. You are not trying to take down the system. You are trying to learn from it.

A friendly cartoon chaos monkey tugging at server cables in a data centre

The numbers do not lie

Gremlin's State of Chaos Engineering report found 60% of surveyed teams had run at least one chaos experiment. The interesting bit is what the regular practitioners get out of it. Top-performing chaos engineering teams hit four-nines availability with an MTTR of under one hour. Read it again. Four-nines uptime, with bad days resolved in under sixty minutes.

This is not because their software is magic. It is because their team has rehearsed the failure modes so many times the response is muscle memory. They know which dashboard to open. They know which command to run. They know whether the symptom is the cause or a knock-on effect, because they have seen this exact failure last Tuesday in a controlled experiment.

The teams not doing this drift toward longer MTTRs, longer outages, and the special breed of stress which comes from facing a problem you have never seen before while three executives ask for a status update.

This is a permission slip, not a tool

Here is the thing most leaders get wrong. They buy a chaos engineering tool. Gremlin, Steadybit, Litmus. They install it, schedule one experiment, watch nothing burn down, and conclude they are "doing chaos engineering."

You are not. You are using a tool.

What Netflix and Amazon did was give their engineers permission to be curious about what hurts. They built a culture where finding a weakness was a win, not an embarrassment. Where the person who took the system down on a Friday afternoon was a hero, not a problem. Where you measured your team by how fast they recovered, not by whether they ever fell.

This cultural shift is harder than the tooling. The tooling is a Saturday project. The culture is a year of conversations.

I have watched leaders try to short-cut this by mandating chaos days. It does not work. The team goes through the motions, runs a sanctioned experiment in a sandbox, ticks a box, and goes back to being afraid. The fear is the problem. The tool is not.

Things to break on Monday

If you have never done this before, you do not need a platform. You need a willing team and an hour. Try one of these.

Kill a single backend pod during a deploy and watch what your load balancer does.

Block outbound traffic to a third-party API for five minutes and see whether your fallback fires or whether half your user flows go silent.

Add 500ms of latency to your database connection pool and see how many of your "fast" endpoints turn into timeouts.

Take one of your CI runners offline mid-build and see whether the build queue recovers or stalls forever.

Force a leader election on your distributed coordinator and see how long the brownout lasts.

Each of these is small. Each takes minutes. Each will teach your team more about the system than a quarter of incident retros.

A group of engineers around a whiteboard labelled GAME DAY mapping out failure scenarios on sticky notes

The question every leader should ask

If your team had to handle a regional AWS outage right now, do you trust them to do it well?

If the honest answer is no, you have two options. You wait for the outage to happen and find out the hard way. Or you break the system yourself, on a Tuesday morning, with a stopwatch and a postmortem template, and you learn before the customers do.

The teams who break things on purpose are the teams who sleep at night.

The teams who refuse are the ones who get paged at three.