Why Python is a great choice for the Chaos Toolkit project!

Russ Miles and I started the chaostoolkit project about three months ago with the idea we wanted to express with code discussions we had been having on Chaos Engineering and its principles. I’ll leave you the opportunity to read Russ’s post about the Chaos Toolkit effort as I would like to focus here on the choice we made to use Python to implement that Open API for Chaos Engineering.

When we started the core implementation, we wondered which language we should pick up for the project. As a long Python developers, I have always felt faster and more at ease with it but I thought it would be wise to give the other existing languages a thought.

At its core, the Chaos Toolkit attempts to define a declarative API to drive your chaos engineering experiments. That experiment is serialized in a JSON file the chaos command ingests and validates it before executing declared activities sequentially.

It felt clear to us that we would be handling a lot of strings so we wanted to have something that was simple, yet powerful, when dealing with them. That alone is not a strong enough requiremement to make a choice. Afterall, most dynamic languages are capable with strings while static languages, although perhaps more verbose, would be able to as well.

We also knew we meant the chaostoolkit to be extended as widely as possible so we looked for a rich ecosystem with various capabilities. It also meant, we should pick up a language that would help us load extensions in a simple fashion (Note that the Chaos Toolkit API describes three extension mechanisms: Python functions, HTTP calls or process. This means that you do not need to write Python code when you want to extend the toolkit).

It was also relevant to select an ecosystem that was well spread to reduce the quantity of steps for users in order to get started.

On the other hand, speed never was a requirement per-se. Indeed, the Chaos Toolkit is a CLI that runs things sequentially from a single machine. There is no need for raw speed because the Chaos Toolkit runs things as fast as they actually need to be.

Language prowesses with paradigm X or Y didn’t matter to us as long as the language was sound and dependable.

Those requirements, if we can call them that, aren’t very strong nor well-defined but they served us as guidelines.

Having worked a little with a variety of languages this past two years, I kept wondering what to select. Afterall, languages such as Clojure, TypeScript or Go all fitted the bill fairly well.

I dismissed Clojure mostly because I didn’t know its ecosystem nor community well-enough and I could not afford the time to thoroughly explore it. However, should have I had known it more, it would have been a strong candidate if only because of its code-as-data approach felt appropriate. Nonetheless, I do wonder still about its ecosystem.

TypeScript was an interesting one because it does remind me of Python in many respects. Much like Clojure however, I left it out it because I wasn’t sure its ecosystem would be the right one for the sort of experiments that users would care for.

As for Go, it’s mostly the question whether a statically compiled language would fit the requirement for string handling. I don’t mean to say Go is bad at dealing with strings. I made the decision that if I had to end up with empty interfaces, I might as well directly rely on a dynamic language.

I realise you could argue it’s mostly a lack of knowledge of all those languages and ecosystems that made me uneasy about picking them up. Indeed, I certainly think they could all have worked out.

But I settled on Python. In fact, I picked Python 3 specifically. To me Python 2 feels legacy and new projects should not start there any longer (except in rare cases).

Python does check all the boxes the project needed. It’s readily available, has a massive ecosystem, with native libraries for platforms we believe the Chaos Toolkit will provide the most benefit to (such as Kubernetes), and has powerful string handling (specially with the most recent Python 3 releases).

I know most DevOps tools seem to be written in Go these days, but I strongly believe, Python is a sound choice depending on your requirements. Specially Python 3.

Will choosing Python over Go impact the Chaos Toolkit as a project? I certainly hope not and I worked hard to design the code base so that it is comprehensible and easy to ge involved with (well I hope anyhow ;)).

In the end, what tipped it over was that Python is awesome and absolutely fun to work with.

Python has a tremendous ecosystem made of high-quality libraries. The changes in the language in the past years have not just made it more feature-rich but also actually more sound and relevant to new scenarios.

But, at the end of the day, this is just for the little story really, what matters is that we are proud of the Chaos Toolkit and really enjoy using it. We do sincerely hope it will meet its community, so if you fancy a project for 2018 or want to experiment with Chaos Engineering, make sure to join it and say hello! We love a good chat.

Learn from your system with Chaos Engineering

Take a look at your system. A deep, honest, look.

Now, ask yourself « Am I genuinely confident that my system is strong and healthy? ». Why would anyone ask themselves that question?

Well, for one, your system is your business and assessing this honestly is supporting your business success. However, this is also an intellectual and engineering exercise.

That question actually pulls the thread to another one, as software developers, do we see our whole system or simply the services we have worked on? From my past experiences, I would say young engineers rarely consider all facets of the system they are part of, but I have also met many seasoned engineers who barely grasped it (or cared for it).

Interestingly, I wonder if the core property of devops is caring for the whole system, not just discrete services. I digress.

So, can we consider that our system, in other words, what our users experiment, is healthy because a set of services are healthy? Is it enough? Ops engineers would gladly say « of course not! ». Afterall, services do not live in a vacuum. Services life on a platform in an infrastructure and are connected (loosely) to each other or to external systems. There are more critical paths outside of your service than inside it. In other words, as a developer, your control is actually very limited if you only look at your service.

A note here. Microservice architecture does not really increase the level of complexity (not in the sense of complicated). Instead, it exposes what was encapsulated into a monolith and designed by a subset of your whole team (developers that is). Exploding your monolith made it more transparent and therefore actionnable for the whole team to engage. Complexity is only shifted.

Back to our system discussion, we have now more and more connected components that can (and probably should) come and go. In other words, we have fluidity. The difficulty with fluidity is you cannot trust that state remains the same for very long. This circles back to asking yourself about how confident you are in your system health if you consider it to be highly dynamic.

Chaos Engineering is an emerging discipline that hopes to answer that very question. The principles are powerful because they let you pose an hypothesis about your system and run an experiment against it to confirm or infirm it. That experimental approach gives you honest feedback (provided you ask the right question) about your system for you to learn from. Then adapt.

Alongside Russ Miles, I have created the Chaos Toolkit, an open source software that lets you describe and run experiments against your software with, I hope, a simple and comprehensible approach.

The idea behind chaos engineering is not new, plenty of engineers not just software have carried similar experiments over our History. But, the fact this is becoming a full-fledged software engineering displine shows our industry is maturing.

We are looking at enabling that displine along with others who have paved the way already. Please join us to keep that discussion going!

Chaos is the new order

Let it be said: «Chaos is the new order».

I have been working with microservices for the past few years and it’s been quite a ride. By breaking down the monolith into more focused independent entities, questions have presented new challenges. Obviously, from a software architecture perspective, but also operational and organizational.

Indeed, in a monolith, developers make a lot of the calls that can impact even operations but leave the ops side with little knobs to tweak. One of the first consequence of the breakdown is that operations should be much more involved in the design choices of the application. A sane DevOps culture might even emerge.

With that said, running a healthy system made of unpredictable parts is challenging, there is no way around it. All teams that have transitioned to microservices architecture have had to learn, adapt and become creative to face those challenges: service discovery, faulty services, stateless services, network latency, authentication across the system and many more.

The ecosystem is now ripe for running those systems with confidence and speed. I will now be working full steam on providing content towards running microservices hands on.

Running is one thing, as my friend Russ Miles often says, we must stress the system as it’s a living thing. Software do not live in a vacuum. Things go bad and we need to learn how to cope and respond fast with confidence. Russ and I will therefore talk more about how chaos experiments will make your teams stronger and faster at delivering great software.

Stay tuned because this is going to be fun as I will speak about Kubernetes, service mesh, logging, monitoring, storage and more! Embrace chaos as it is going the new order.