Learn from your system with Chaos Engineering

Take a look at your system. A deep, honest, look.

Now, ask yourself « Am I genuinely confident that my system is strong and healthy? ». Why would anyone ask themselves that question?

Well, for one, your system is your business and assessing this honestly is supporting your business success. However, this is also an intellectual and engineering exercise.

That question actually pulls the thread to another one, as software developers, do we see our whole system or simply the services we have worked on? From my past experiences, I would say young engineers rarely consider all facets of the system they are part of, but I have also met many seasoned engineers who barely grasped it (or cared for it).

Interestingly, I wonder if the core property of devops is caring for the whole system, not just discrete services. I digress.

So, can we consider that our system, in other words, what our users experiment, is healthy because a set of services are healthy? Is it enough? Ops engineers would gladly say « of course not! ». Afterall, services do not live in a vacuum. Services life on a platform in an infrastructure and are connected (loosely) to each other or to external systems. There are more critical paths outside of your service than inside it. In other words, as a developer, your control is actually very limited if you only look at your service.

A note here. Microservice architecture does not really increase the level of complexity (not in the sense of complicated). Instead, it exposes what was encapsulated into a monolith and designed by a subset of your whole team (developers that is). Exploding your monolith made it more transparent and therefore actionnable for the whole team to engage. Complexity is only shifted.

Back to our system discussion, we have now more and more connected components that can (and probably should) come and go. In other words, we have fluidity. The difficulty with fluidity is you cannot trust that state remains the same for very long. This circles back to asking yourself about how confident you are in your system health if you consider it to be highly dynamic.

Chaos Engineering is an emerging discipline that hopes to answer that very question. The principles are powerful because they let you pose an hypothesis about your system and run an experiment against it to confirm or infirm it. That experimental approach gives you honest feedback (provided you ask the right question) about your system for you to learn from. Then adapt.

Alongside Russ Miles, I have created the Chaos Toolkit, an open source software that lets you describe and run experiments against your software with, I hope, a simple and comprehensible approach.

The idea behind chaos engineering is not new, plenty of engineers not just software have carried similar experiments over our History. But, the fact this is becoming a full-fledged software engineering displine shows our industry is maturing.

We are looking at enabling that displine along with others who have paved the way already. Please join us to keep that discussion going!

Chaos is the new order

Let it be said: «Chaos is the new order».

I have been working with microservices for the past few years and it’s been quite a ride. By breaking down the monolith into more focused independent entities, questions have presented new challenges. Obviously, from a software architecture perspective, but also operational and organizational.

Indeed, in a monolith, developers make a lot of the calls that can impact even operations but leave the ops side with little knobs to tweak. One of the first consequence of the breakdown is that operations should be much more involved in the design choices of the application. A sane DevOps culture might even emerge.

With that said, running a healthy system made of unpredictable parts is challenging, there is no way around it. All teams that have transitioned to microservices architecture have had to learn, adapt and become creative to face those challenges: service discovery, faulty services, stateless services, network latency, authentication across the system and many more.

The ecosystem is now ripe for running those systems with confidence and speed. I will now be working full steam on providing content towards running microservices hands on.

Running is one thing, as my friend Russ Miles often says, we must stress the system as it’s a living thing. Software do not live in a vacuum. Things go bad and we need to learn how to cope and respond fast with confidence. Russ and I will therefore talk more about how chaos experiments will make your teams stronger and faster at delivering great software.

Stay tuned because this is going to be fun as I will speak about Kubernetes, service mesh, logging, monitoring, storage and more! Embrace chaos as it is going the new order.