In the previous subchapter we looked at regions. Now we go one level deeper to discover one of the most important concepts in all of cloud: Availability Zones (AZ). They are the foundation that ensures your applications don’t go down even if an entire datacenter fails.

What is an Availability Zone

Each AWS region is internally made up of several Availability Zones. An AZ is one or more physical datacenters, with their own power, cooling, and network, independent from the other AZs in the region.

Think of it like this:

Europe Region (Spain)  ← a geographic area
 ├── AZ-a   ← independent datacenter(s)
 ├── AZ-b   ← independent datacenter(s)
 └── AZ-c   ← independent datacenter(s)

The AZs in the same region are:

  • Far enough apart from each other so that a local disaster (fire, flood, power outage) doesn’t affect them all at once.
  • Close enough to be connected by very fast networks, so they work together almost as if they were one.

They are usually named by adding a letter to the region code: eu-south-2a, eu-south-2b, eu-south-2c.

Why they exist: fault tolerance

The idea is simple but powerful: if you spread your application across multiple AZs, the failure of one won’t bring down your service.

Analogy: Imagine you have an important store. Instead of putting all your stock in a single warehouse (which could catch fire), you spread it across three warehouses in different neighborhoods. If one burns down, the other two keep fulfilling orders. You lose a third of your capacity, but you don’t shut down.

That’s exactly what AZs do: they let you design systems that survive the failure of an entire datacenter.

Real example: a fault-tolerant website

Imagine a well-designed online store:

        Users
           │
     [Load Balancer]   ← distributes traffic
        ╱     │     ╲
   Server Server Server
    (AZ-a)   (AZ-b)   (AZ-c)
  • User traffic arrives at a load balancer (we’ll see this in Chapter 13).
  • The balancer distributes requests among servers in different AZs.
  • If AZ-b suffers a power outage, the balancer stops sending it traffic and only uses AZ-a and AZ-c.
  • Users don’t notice anything. The website keeps working.

This is what “high availability by design” means: resilience isn’t an add-on, it’s part of how you build from the start.

The golden rule: “design for failure”

In the cloud, it’s assumed that hardware will fail sooner or later. It’s not pessimism, it’s realism: with millions of components, something is always breaking. The winning strategy isn’t to prevent anything from failing (impossible), but to design so that when something fails, it doesn’t matter.

That’s why one of the first best practices you’ll learn is:

Never put your entire application in a single AZ. Always spread it across at least two.

Services like managed databases (RDS Multi-AZ, Chapter 8) or auto-scaling groups (Chapter 13) are designed precisely to be distributed across AZs with very little effort on your part.

AZ vs Region: don’t confuse them

It’s important to distinguish two levels of resilience:

Level Protects against Example of covered failure
Multi-AZ (multiple zones, same region) Datacenter failure Power outage in a building
Multi-region (multiple regions) Entire region failure Catastrophe affecting a whole geographic area

For most applications, multi-AZ is enough and much simpler and cheaper. Multi-region is reserved for critical systems that can’t afford even the slightest downtime (we’ll see this in Chapter 26 on disaster recovery).

What you should remember

  • A region is divided into several Availability Zones (AZ), which are independent datacenters (with their own power, network, and cooling).
  • AZs are isolated from each other but connected by fast networks.
  • Spreading your application across multiple AZs makes it tolerant to the failure of an entire datacenter.
  • Golden rule: design for failure and never use just one AZ in production.
  • Multi-AZ protects against datacenter failures; multi-region protects against failures of an entire region (more expensive and complex).

In the next subchapter we’ll look at a third level of AWS presence, even closer to the user: edge locations and the CloudFront service.

Cloud, AWS & Terraform — From Zero to Expert

Chapter 1 · What is cloud computing

Chapter 2 · The cloud market and major providers

Chapter 3 · Regions, availability zones and edge

Chapter 4 · Compute: EC2

Chapter 5 · Storage: S3

Chapter 6 · Networking: VPC

Chapter 7 · Identity and access: IAM

Chapter 8 · Managed databases

Chapter 9 · Why Infrastructure as Code

Chapter 10 · HCL: the Terraform language

Chapter 11 · Providers and state

Chapter 12 · Your first real infrastructure in Terraform

Chapter 13 · Load balancing and auto scaling

Chapter 14 · Serverless with Lambda

Chapter 15 · Messaging and events

Chapter 16 · Content delivery and DNS

Chapter 17 · Containers on AWS

Chapter 18 · Modules: reuse and composition

Chapter 19 · Workspaces and environment management

Chapter 20 · Remote backends and locking

Chapter 21 · Infrastructure testing

Chapter 22 · Terraform in CI/CD

Chapter 23 · Defense in depth

Chapter 24 · Observability: logs, metrics and traces

Chapter 25 · Cost optimization

Chapter 26 · High availability and disaster recovery

Chapter 27 · AWS Well-Architected Framework

Chapter 28 · Serverless architectures at scale

Chapter 29 · Data platforms on AWS

Chapter 30 · Multi-account and landing zones

Chapter 31 · Platform Engineering and Internal Developer Platform

Chapter 32 · Relevant AWS certifications

Chapter 33 · Projects to consolidate what you've learned

Chapter 34 · Resources and community

© Copyright 2024. All rights reserved