mind-banner-image

Designing for Resilience: How Cloudairy Cloudchart Streamlines Kubernetes Architecture for AZ Failure

Cloudairy Blog

6 Feb, 2025

|
Kubernetes

Introduction

Ensuring applications function correctly during infrastructure failures is crucial in highly distributed systems. One common failure scenario is the unavailability of an entire Availability Zone (AZ). Applications are often deployed across multiple AZs for high availability and fault tolerance, especially in cloud environments like Amazon Web Services (AWS).

 

Kubernetes helps manage and deploy applications across multiple nodes and AZs. However, testing application behaviour during an AZ failure can be challenging. This is where fault injection simulators come in. AWS Fault Injection Simulator (AWS FIS) can intentionally inject faults or failures into a system to test its resilience. In this blog post, we will explore how to use AWS FIS to simulate an AZ failure for Kubernetes workloads.

Solution Overview

To ensure that Kubernetes cluster workloads can handle failures, it's necessary to test their resilience and capabilities by simulating real-world failure scenarios. Although Kubernetes allows deploying workloads across multiple availability zones (AZs), testing how your system behaves during AZ failures is essential. In this blog post, we will use a microservice for product details, run this microservice using auto-scaling with Karpenter, and test how the system responds to varying traffic levels.


We will explore a load test that mimics hundreds of users accessing the service concurrently to simulate a realistic failure scenario. This test uses AWS FIS to disrupt network connectivity and simulate an AZ failure in a controlled manner. Karpenter, an autoscaler, will automatically adjust the cluster size based on the resource requirements of the running workloads. 

Steps to Simulate AZ Failure Using AWS FIS

1. Setup Kubernetes Cluster Across Multiple AZs

Ensure your Kubernetes cluster is deployed across multiple AZs in AWS. Use Amazon Elastic Kubernetes Service (EKS) to create a cluster spanning multiple AZs. This setup ensures that your applications are distributed and can handle the failure of an entire AZ.

2. Deploy a Microservice with Karpenter

Deploy your microservice, such as a product details service, in the Kubernetes cluster. Install Karpenter, the autoscaler, to manage the cluster size based on workload demands. Configure Karpenter to scale the microservice pods based on CPU and memory usage. This ensures that your application can handle varying loads efficiently.

3. Configure AWS Fault Injection Simulator (AWS FIS)

Create an AWS FIS experiment template to simulate an AZ failure. This can involve terminating instances, disrupting network connectivity, or stopping services in the targeted AZ. Define actions and targets in the AWS FIS experiment template. For example, you can target all instances in a specific AZ and create actions to terminate those instances or disrupt their network.

4. Run Load Tests

Use a load testing tool (e.g., Apache JMeter, Locust) to generate traffic to the microservice. This simulates hundreds of users accessing the service concurrently. Monitor the microservice's performance and scaling behavior under normal conditions before introducing faults. This step helps establish a baseline for how your system should perform.

5. Execute AWS FIS Experiment

Start the AWS FIS experiment to simulate an AZ failure. Observe the impact on the Kubernetes cluster and the microservice. Monitor how Karpenter adjusts the cluster size and how the microservice handles the failure. This helps identify any weaknesses in your system's fault tolerance.

6. Monitor and Analyze Results

Use AWS CloudWatch and Kubernetes dashboards to monitor metrics such as pod health, instance status, and network latency. Analyze how the microservice responds to the simulated AZ failure. Check if the service remains available and how quickly it recovers. Review auto-scaling events to ensure Karpenter appropriately scaled the cluster to handle the increased load or to replace failed instances.

Conclusion

Simulating AZ failures in a Kubernetes infrastructure using AWS FIS helps test the resilience of your applications. By deploying workloads across multiple AZs, using Karpenter for auto-scaling, and performing controlled fault injections, you can ensure your system is robust and can recover from infrastructure failures. This proactive approach to failure testing enhances the reliability and availability of your applications in a cloud environment.

Designing for Resilience: How Cloudairy Cloudchart Aids in Simulating AZ Failures (Before They Happen)

While Cloudairy Cloudchart doesn't directly participate in simulating failures within Kubernetes, it excels in designing and visualizing the architecture beforehand.  By using drag-and-drop components for Kubernetes objects and AWS resources, Cloudairy Cloudchart allows you to visually map your application deployment across multiple AZs. This includes integrating autoscalers like Karpenter.  This visual representation can be critical for understanding how your architecture will handle failures and pinpointing potential weak spots before you even set up a Kubernetes cluster or deploy your application. 

Design, collaborate, innovate with   Cloudairy
border-box

Unlock the power of AI-driven collaboration and creativity. Start your free trial and experience seamless design, effortless teamwork, and smarter workflows—all in one platform.

icon2
icon4
icon9