What are Chaos Monkeys and what do they have to do with engineering and testing?
What is Chaos engineering?
Chaos engineering is the practice of intentionally trying to harm an application in production. As agile and DevOps practices become predominant, that type of testing becomes more of a challenge as software is delivered more quickly and more frequently. The likelihood of defects manifesting themselves in production becomes higher.
One of the characteristics of high-quality software is resiliency. This means that an application can perform acceptably under adverse circumstances. Adverse circumstances constitute things going wrong in the production environment that might bring the application down or seriously degrade performance. It could also involve defects in the application that might crash the application or cause it to generate errors.
Chaos engineering was first conceived of by Netflix, which wanted to ensure that its cloud-based streaming services were robust enough to withstand unexpected failures. This could include network segments failing, data centers going down, distributed denial of service attacks, or another type of failure in production.
Netflix defined chaos engineering as an experiment with a null and alternative hypothesis. It starts with the definition of a normal operating state for the application, then postulates what might happen if a specific failure occurred. The experiment usually involves injecting or simulating a failure into the infrastructure to determine how the application responds.
Netflix has a set of tools, once known as Chaos Monkey but now called the Simian Army, that tests and in some cases causes havoc with production applications. These tools introduce network delays, cause instances or even entire data center segments to go offline, or identify security vulnerabilities. They also can perform health checks on an application and clean up unused system resources.
For working specifically with applications, the Apache-licensed Chaos Toolkit has recently become available. The Chaos Toolkit simplifies access to chaos engineering concepts. It provides an API that enables the experimentation approach can be done at different levels: infrastructure, platform but also application.
Who owns chaos engineering?
Chaos engineering is typically under the control of the team DevOps engineer, who is responsible for defining the scenario, executing the test, and determining and recording the results. That person is also responsible for minimizing the customer impact on the production system.
As you might imagine, those goals have the DevOps engineer walking a very fine line. It’s one thing to test the resiliency of a distributed application by trying to make it crash, and another thing to actually have it crash and start affecting customer sessions. Part of the planning for a chaos engineering experiment is knowing when to shut it down if things start going badly.
Chaos engineering is typically associated with DevOps teams, in part because the typical cloud deployment environment can’t easily be replicated in development and test. In short, teams test resiliency in production because it can’t be realistically tested prior to deployment.
Own your chaos testing
However, chaos engineering is also tied to DevOps because of testing. Because of the automated nature of the DevOps workflows, the vast majority of testing is by necessity automated. From unit testing to smoke testing, DevOps is designed to deliver software without a tester touching the build.
This is why testers have to own chaos engineering. It’s certainly not a core testing focus as defined today, and many testers still believe their job is done when an application reaches production. But by contributing to the DevOps toolchain, Chaos Monkey meets the need for continuous testing. Chaos engineering is testing by any reasonable definition.
While chaos engineering isn’t yet mainstream practice, it has an enthusiastic following, especially among companies deploying customer-facing applications using DevOps practices. The practice of chaos engineering enables teams to test their applications when it really counts, in the production environment. While resiliency is only a part of a full testing regime, it takes on increasing importance when deploying to the cloud.
For testers, chaos engineering turns out to be much more interesting and relevant than traditional functional testing. It involves elements of both science and art, in that it requires a specific stimulus applied in an engineering manner to the application, but also an appreciation of how far you can push the application without causing heartburn to the business. Chaos engineering enables testers to expand their skills and add value in determining the quality of an application.