Posted onAug 11, 2023 • Originally published atgtrekter.Medium

Mastering Chaos Engineering with Azure Chaos Studio

The rise of microservices architecture has sparked significant discussions within the technology industry. Despite the challenges posed by global events like the Russia-Ukraine war and the COVID-19 pandemic, experts project a remarkable growth in the market size of microservices architecture in the coming years. According to the IMARC Group, the market is expected to reach $7.8 billion by 2028. The Business Research Company paints an even brighter picture, predicting a market value of $10.86 billion by 2027. Market Research Future (MRFR) is even more optimistic, forecasting a growth to $21.61 billion by 2030. These projections confirm the widespread adoption of microservices architecture, making it crucial for developers and engineers to enhance their skills in technologies such as Docker, Kubernetes, and REST APIs.

The Complexity Challenge

However, as the adoption of microservices architecture continues to grow, so does the complexity of these systems. Testing these intricated systems made by hundreds of nodes and thousands of microservices is becoming challenging, and predicting failures has become increasingly difficult. Such failures can result in costly outages for companies. According to an International Data Corporation (IDC) report, infrastructure failures can cost large businesses around $100,000 per hour, while critical application failures can range from $500,000 to $1 million per hour. Furthermore, a survey conducted by the Uptime Institute found that nearly one-third of all data centers experienced an outage in 2020.

To proactively address the challenges posed by the complexity of microservices architecture, an increasing number of companies are turning to Chaos Engineering.

Introducing Chaos Engineering

Chaos Engineering is a proactive testing practice designed by Netflix to test its system stability after it was migrated to Amazon Web Services. Its original purpose was to assess how its system responded when critical components of its infrastructure were taken down. By intentionally inducing failures and closely monitoring the system’s responses, engineers were able to identify weaknesses that may remain hidden under normal operating conditions. Gaining real-time insights into how a system responded under pressure prepared teams for actual failures and helped identify latent bugs.

By purposefully “breaking things,” businesses can improve their ability to find and resolve issues before they lead to costly outages, ensuring the resilience and reliability of their microservices architectures.

Netflix took it a step further and developed an entire suite of automated stress tests of their infrastructure called The Simian Army. However, this is a topic for a different time.

Phases of chaos engineering

As mentioned earlier, Chaos Engineering involves running thoughtful, planned experiments that teach us how our systems behave in the face of failure based on the following phases:

Pick a hypothesis:Before running an experiment, you should have an idea of the possible outcome.
Scope:Choose the blast radius of the experiment. Low-risk experiments involve a few users by injecting failures into a subset or small group of devices. Riskier and more accurate experiments are large-scale without custom routing and have the potential to impact users not in the experiment group through circuit breakers and shared resource constraints.
Run the experiment:Execute the chaos experiment and collect metrics.
Analyze the results:Use the metrics you’ve collected to validate or invalidate the initial hypothesis.
Increase the scope:After gaining confidence from running smaller-scale experiments, you can gradually increase the blasting radius and getting additional insights.

Chaos Experiments and Microsoft Azure

While Microsoft Azure has long supported several third-party chaos engineering tools and services such as Gremlin and Chaos Mesh, it wasn’t until March 13, 2023, that they publicly made available their chaos engineering service with the release of Azure Chaos Studio in East Asia and other regions.

What is Azure Chaos Studio?

Azure Chaos Studio is a service provided by Microsoft that enables you to orchestrate controlled fault injections into your Azure resources, including but not limited to Azure Cosmos DB, AKS, Azure VM, and many others. Depending on the target, it supports different kinds of faults:

Service-direct:These faults are directly applied to an Azure resource, without necessitating any installation or instrumentation.
Agent-based:These faults are executed within VMs or VMSS to induce in-guest failures.

In the context of AKS, Azure Chaos Studio leverages Chaos Mesh, an open-source chaos engineering platform that empowers users to easily inject failures into an AKS cluster. As of this writing, there are several limitations to keep in mind when considering Azure Chaos Studio. For instance, it only supports Linux nodes, requires local cluster accounts to be enabled, among other considerations.

Integrating Azure Chaos Studio with AKS

In this section, I will guide you on how to integrate Azure Chaos Studio into your AKS cluster and execute an experiment.

Install Chaos Mash in you AKS cluster

The first thing you will need to do is install Chaos Mesh on your AKS cluster. To do so:

Get the access credentials for your AKS cluster and merge them into your local kubeconfig file so that you can interact directly with your Kubernetes cluster.

$ az aks get-credentials -g rg-training-aks-uks-01 -n aks-training-uks-01
Merged "aks-training-uks-01" as current context in /home/gtrekter/.kube/config

Install Helm on your local machine. It is a Kubernetes package manager that simplifies application deployment and management. It uses packages called “charts,” which are pre-configured Kubernetes resources, to install complex applications into a Kubernetes cluster.

curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
sudo apt-get install apt-transport-https --yes
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update
sudo apt-get install helm

Add the Chaos Mesh chart repository to Helm, update your local Helm chart repository cache, create a new namespace in your Kubernetes cluster called chaos-testing, and install a new release called chaos-mesh using the chaos-mesh chart.

helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
kubectl create ns chaos-testing
helm install chaos-mesh chaos-mesh/chaos-mesh --namespace=chaos-testing --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock

Once the installation process is concluded, you’ll see an array of new pods dedicated to several tasks related to the chaos mesh.

$ kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
chaos-testing chaos-controller-manager-674f467db4-dznmm 1/1 Running 0 2m49s
chaos-testing chaos-controller-manager-674f467db4-f2wq4 1/1 Running 0 2m49s
chaos-testing chaos-controller-manager-674f467db4-vpjth 1/1 Running 0 2m49s
chaos-testing chaos-daemon-qnlbc 1/1 Running 0 2m49s
chaos-testing chaos-dashboard-d47f8c5cd-x55qs 1/1 Running 0 2m49s
chaos-testing chaos-dns-server-84d96c6dbc-v74b2 1/1 Running 0 2m49s
default azure-vote-back-78df98c548These services are running within the "chaos-testing" namespace. The chaos-controller-manager pods are responsible for managing chaos experiments, while the chaos-daemon pod is in charge of coordinating and executing the chaos experiments-5jxpm 1/1 Running 0 14h
...

And their respective services.

$ kubectl get service -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
chaos-testing chaos-daemon ClusterIP None <none> 31767/TCP,31766/TCP 3m44s
chaos-testing chaos-dashboard NodePort 10.0.252.240 <none> 2333:31753/TCP,2334:30871/TCP 3m44s
chaos-testing chaos-mesh-controller-manager ClusterIP 10.0.250.234 <none> 443/TCP,10081/TCP,10082/TCP,10080/TCP 3m44s
chaos-testing chaos-mesh-dns-server ClusterIP 10.0.16.244 <none> 53/UDP,53/TCP,9153/TCP,9288/TCP 3m44s
...

Among them, thechaos-controller-managerpods handle the management of chaos experiments, whereas thechaos-daemonpod oversees coordination and execution of these chaos experiments.

Include your AKS cluster among the resources targetable by Chaos Studio experiments

Even if Chaos Mesh is installed in your cluster, before Chaos Studio can start injecting faults, it’s necessary to include your AKS cluster in the list of resources managed by Chaos Studio. Here’s how you do that:

Browse and login to the Azure Portal
Search forChaos Studioin the main search bar, and selectTargets.
Check your AKS cluster, then clickEnable targetsand selectEnable service-direct targets.

ClickReview + Enable,and thenEnableto confirm.

Create a chaos experiment

With your AKS cluster now enabled, it’s time to start defining our experiments.

Navigate to Chaos Studio.
Click onExperiments,thenCreate,and finally selectNew Experiment.

In the experiment creation form, assign a name to your experiment and choose the region in which the experiment will be stored.

Next, select the permissions that are going to be use to execute the experiment. These permissions can be either system-assigned or user-assigned managed identity.

Next comes theExperiment Designersection. This is where we’ll define the actions that will be performed against the targeted resources (in this case, the AKS instance). Depending on their configuration, these actions will be executed either sequentially or in parallel.

Assign a name to yourStepandBranch,and click onAdd fault.
In the side panel, you’ll have the option to choose from a wide array of pre-configured faults. For the purposes of this example, we’ll opt to cause a pod to fail for a duration of 10 minutes.

While the parameters of the faults depend on the type you’ve chosen, it’s worth noting that AKS faults, being based on Chaos Mesh, share two common parameters:DurationandjsonSpec.

In Chaos Mesh, you usually perform chaos experiments by deploying yaml manifests in your Kubernetes cluster, as shown below:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-example
namespace: chaos-testing
spec:
action: pod-failure
mode: all
duration: '600s'
selector:
namespaces:
- default

To convert your Chaos Mesh manifests to Azure Chaos Studio, all you have to do is convert thespec:block into a JSON format. For example:

{ "action": "pod-failure", "mode": "all", "duration": "600s", "selector":{ "namespaces":[ "default" ]}}

SelectNext: Target resources,and select the resources you wish to target with the experiment. Note that you’ll only see the resources that have been enabled in Chaos Studio.

ClickAddto confirm.
Finally, clickReview + create,and thenCreate. Upon completion, you’ll see the experiment in your resource group.

Grant Permissions to the Experiment Managed Identity

When a chaos experiment is created, Chaos Studio automatically generates a system-assigned managed identity that carries out faults on your target resources. However, we need to manually assign appropriate permissions to it before it can start interacting with the AKS cluster. Depending on the fault selected, the managed identity will require different permissions. You can view the complete list through this link.

Supported resource types and role assignments for Chaos Studio - Microsoft

To grant these permissions to the Experiment managed identity, follow these steps:

Browse to your AKS cluster page, selectAccess control (IAM),then click onAddfollowed byAdd role assignment.

Choose the necessary role, then navigate to theMemberstab, click onSelect membersand click the name of the experiment that you created earlier.

ClickSelect,thenReview + Assign.

Start the Experiment

Now it’s time to introduce some chaos and ‘break’ things in our resources! 😈 To do so, just browse to your experiment and clickStart.

If you’re interested in a more detailed look at what’s happening within your experiment, you can access information about the currently executing step by clickingDetails.