Learn from your system with Chaos Engineering

Take a look at your system. A deep, honest, look.

Now, ask yourself « Am I genuinely confident that my system is strong and healthy? ». Why would anyone ask themselves that question?

Well, for one, your system is your business and assessing this honestly is supporting your business success. However, this is also an intellectual and engineering exercise.

That question actually pulls the thread to another one, as software developers, do we see our whole system or simply the services we have worked on? From my past experiences, I would say young engineers rarely consider all facets of the system they are part of, but I have also met many seasoned engineers who barely grasped it (or cared for it).

Interestingly, I wonder if the core property of devops is caring for the whole system, not just discrete services. I digress.

So, can we consider that our system, in other words, what our users experiment, is healthy because a set of services are healthy? Is it enough? Ops engineers would gladly say « of course not! ». Afterall, services do not live in a vacuum. Services life on a platform in an infrastructure and are connected (loosely) to each other or to external systems. There are more critical paths outside of your service than inside it. In other words, as a developer, your control is actually very limited if you only look at your service.

A note here. Microservice architecture does not really increase the level of complexity (not in the sense of complicated). Instead, it exposes what was encapsulated into a monolith and designed by a subset of your whole team (developers that is). Exploding your monolith made it more transparent and therefore actionnable for the whole team to engage. Complexity is only shifted.

Back to our system discussion, we have now more and more connected components that can (and probably should) come and go. In other words, we have fluidity. The difficulty with fluidity is you cannot trust that state remains the same for very long. This circles back to asking yourself about how confident you are in your system health if you consider it to be highly dynamic.

Chaos Engineering is an emerging discipline that hopes to answer that very question. The principles are powerful because they let you pose an hypothesis about your system and run an experiment against it to confirm or infirm it. That experimental approach gives you honest feedback (provided you ask the right question) about your system for you to learn from. Then adapt.

Alongside Russ Miles, I have created the Chaos Toolkit, an open source software that lets you describe and run experiments against your software with, I hope, a simple and comprehensible approach.

The idea behind chaos engineering is not new, plenty of engineers not just software have carried similar experiments over our History. But, the fact this is becoming a full-fledged software engineering displine shows our industry is maturing.

We are looking at enabling that displine along with others who have paved the way already. Please join us to keep that discussion going!