In the previous subchapter, we saw disaster recovery strategies where, if the main system fails, traffic must switch to a backup system. But how do you detect that the main one has failed and redirect people automatically, without a human having to intervene at 3 a.m.? The answer combines two functions of Route 53 (the AWS DNS we saw in subchapter 16.1): health checks and failover.

Reminder: what Route 53 does

Remember from subchapter 16.1 that Route 53 is AWS's DNS service: it translates a domain name (like myshop.com) to the address of the server that should handle it. It's the first thing a user's browser checks to know where to connect. This gives it a privileged position: Route 53 decides where the traffic goes. And that's the key to automatic failover.

The problem: redirecting people when something fails

Imagine you have your main system in one region and a backup in another (as in the strategies from subchapter 26.2). If the main one goes down, you need users to stop going to the main (down) system and go to the backup (healthy) one. And you need this to happen:

  • Automatically (without waiting for a human to notice and act).
  • Quickly (every minute of downtime counts).
  • In a reliable way (without sending people to a broken system).

For this, you first need to detect that the main one failed, and then redirect. Route 53 does both.

Health checks: monitoring if a system is healthy

A Route 53 health check is an automatic monitor that periodically checks if your system is responding correctly. Route 53 "asks" your system every so often: "are you okay?", and based on the response, marks it as healthy or unhealthy.

Route 53 every X seconds:  "Main system, are you okay?"
   → responds correctly  → HEALTHY ✓   (keeps sending traffic there)
   → doesn't respond / gives errors → UNHEALTHY ✗ (stops sending traffic there)

Analogy: a health check is like taking a patient's pulse every few minutes. As long as the pulse is normal, all is well. If the pulse stops or becomes abnormal, the alarm goes off and action is taken. Route 53 "takes the pulse" of your systems continuously to know which ones are alive and healthy.

The health check can check things like: does the website respond? Does it return a correct code? Does it respond in time? You define what "healthy" means.

Failover: switching to the backup automatically

Here's the magic. Failover is Route 53's ability to automatically redirect traffic from the main system to the backup when the health check detects that the main one is unhealthy.

Remember the routing policies from subchapter 16.1: one of them is precisely failover. You configure Route 53 like this:

Route 53 (failover policy):
   Main:    region A   (with health check)
   Backup:  region B

   While A is HEALTHY  → all traffic goes to A
   If A becomes UNHEALTHY → Route 53 AUTOMATICALLY redirects to B
   When A is HEALTHY again → traffic goes back to A
   Normal operation:           After A fails:
   Users → [Region A ✓]        Users → [Region A ✗]──╳
                                         └──────────► [Region B ✓]

Analogy: failover is like an emergency power generator in a hospital. As long as there's power from the grid (main system healthy), everything works normally. The instant the power goes out (main fails), a system automatically detects the outage and starts the generator (backup) in seconds, without anyone having to run and do it. The hospital keeps running without patients noticing. Health check = outage detector; failover = automatic generator startup.

How health checks and failover work together

The two are inseparable: the health check detects, the failover reacts:

HEALTH CHECK  → monitors and detects that the main system went down
        │
        ▼
FAILOVER      → automatically redirects traffic to the backup

Without the health check, Route 53 wouldn't know something failed. Without failover, knowing it failed would be useless. Together, they achieve automatic traffic recovery, which is exactly what makes DR strategies (26.2) work without human intervention.

Real-world example: a company has its main website in the Ireland region and a backup (warm standby, subchapter 26.2) in Frankfurt, with Route 53 configured for failover. One night, the Ireland region has a problem and the website stops responding. The Route 53 health check detects it in seconds and marks Ireland as unhealthy. The failover automatically redirects all users to Frankfurt, which was ready. Customers barely notice a brief interruption. No one on the team had to wake up or do anything: the system recovered by itself. The next morning, when Ireland is restored, traffic automatically returns. That's resilience done right.

Beyond failover: geographic load balancing

These same capabilities (health checks + Route 53 routing policies) are also used to distribute users between regions by proximity (remember the geolocation and latency policies from subchapter 16.1), sending each user to the closest and healthy region. This way, system health is considered not only for emergencies, but also to provide the best service day to day.

What you should remember

  • Route 53 (the AWS DNS, subchapter 16.1) decides where traffic goes, allowing it to manage automatic failover.
  • A health check periodically checks if a system responds well and marks it as healthy or unhealthy. Like taking a patient's pulse continuously.
  • Failover automatically redirects traffic from the main system to the backup when the health check detects the main is unhealthy (and returns it when it recovers). Like an emergency generator that starts by itself when the power goes out.
  • They work together: the health check detects, the failover reacts. Together they achieve automatic traffic recovery, without human intervention, making DR strategies work.
  • The same capabilities are used for geographic load balancing (sending each user to the closest and healthy region), not just for emergencies.

In the last subchapter of the chapter (and of Part VI) we'll see how to protect your data with centralized and automatic backups: AWS Backup.

Cloud, AWS & Terraform — From Zero to Expert

Chapter 1 · What is cloud computing

Chapter 2 · The cloud market and major providers

Chapter 3 · Regions, availability zones and edge

Chapter 4 · Compute: EC2

Chapter 5 · Storage: S3

Chapter 6 · Networking: VPC

Chapter 7 · Identity and access: IAM

Chapter 8 · Managed databases

Chapter 9 · Why Infrastructure as Code

Chapter 10 · HCL: the Terraform language

Chapter 11 · Providers and state

Chapter 12 · Your first real infrastructure in Terraform

Chapter 13 · Load balancing and auto scaling

Chapter 14 · Serverless with Lambda

Chapter 15 · Messaging and events

Chapter 16 · Content delivery and DNS

Chapter 17 · Containers on AWS

Chapter 18 · Modules: reuse and composition

Chapter 19 · Workspaces and environment management

Chapter 20 · Remote backends and locking

Chapter 21 · Infrastructure testing

Chapter 22 · Terraform in CI/CD

Chapter 23 · Defense in depth

Chapter 24 · Observability: logs, metrics and traces

Chapter 25 · Cost optimization

Chapter 26 · High availability and disaster recovery

Chapter 27 · AWS Well-Architected Framework

Chapter 28 · Serverless architectures at scale

Chapter 29 · Data platforms on AWS

Chapter 30 · Multi-account and landing zones

Chapter 31 · Platform Engineering and Internal Developer Platform

Chapter 32 · Relevant AWS certifications

Chapter 33 · Projects to consolidate what you've learned

Chapter 34 · Resources and community

© Copyright 2024. All rights reserved