Logs and metrics (subchapters 24.1 and 24.2) are great when your application is a single piece. But modern architectures are made up of many services that collaborate: a request goes through a load balancer, then a Lambda, which calls another, which queries a database, which writes to a queue... When something is slow or fails, where in the journey is the problem? To answer that, there is distributed tracing, and in AWS the tool is X-Ray.

The Problem: A Request’s Journey Through Many Services

Remember the microservices and decoupled architectures we’ve seen (Lambda in Chapter 14, messaging in 15, containers in 17). A single user request can go through many components:

User → API Gateway → Lambda A → Lambda B → Database
                              └──→ SQS Queue → Lambda C

If that request takes 5 seconds (too long), where is the slowness? In Lambda A? In the database? In Lambda B? With isolated logs from each service, it’s very difficult to reconstruct the complete journey and see where time is lost. You need to follow the trail of that specific request through the whole system.

What Distributed Tracing Is

Distributed tracing consists of following a request through all the services it passes through, measuring how long it takes in each one. The result is a trace: the complete map of that request’s journey, with the times for each stage.

Trace of a request (how long it took in each part):
   API Gateway   ▕█▏           20 ms
   Lambda A      ▕███▏         80 ms
   Lambda B      ▕██▏          50 ms
   Database      ▕██████████▏ 4,500 ms  ← here’s the problem!
   ──────────────────────────────────
   TOTAL: ~4,650 ms

Analogy: distributed tracing is like tracking a package you send by courier. You don’t just know it took 3 days: you see each stage of the journey—“picked up at origin (1h), at logistics center A (2 days ⚠️), out for delivery (3h), delivered”—and you discover exactly where it got stuck. Without that tracking, you’d only know it took a long time, without knowing why.

What X-Ray Is

AWS X-Ray is AWS’s distributed tracing service. It follows requests through your services (Lambda, API Gateway, ECS, etc.) and shows you:

  • A service map: a visual diagram of how your components connect and how requests flow between them.
  • The detailed traces: the journey of each request, with the time spent in each service.
  • Where the bottlenecks and errors are: which part is slow or failing.
   X-Ray service map:

   [API Gateway] ──► [Lambda A] ──► [Database] 🔴 slow
                          └──────► [Lambda B] ✓

X-Ray colors and marks services according to their health (green = good, red = problems), so at a glance you see where to look.

What X-Ray Is For

  • Find bottlenecks: see exactly which service is making a request slow (like the database in the example).
  • Locate errors: see at what point in the journey a failure occurs.
  • Understand your architecture: the service map shows how your components are really connected (sometimes it’s surprising to see dependencies you didn’t remember).
  • Optimize performance: measure and improve the slow parts with concrete data, not guesswork.

Real-world example: a booking application complains that “the confirmation page is very slow.” The team enables X-Ray. The trace reveals that the request goes through four services, and that 90% of the time is spent on a call to an external payment service that responds slowly. The problem wasn’t in their code, but in an external dependency. With that information, they add an “in process” response while the payment is confirmed in the background, and the page becomes fast again. Without X-Ray, they would have wasted days looking for the problem in the wrong place.

X-Ray vs. Logs and Metrics

All three complement each other and answer different questions:

Tool Question it answers
Metrics (24.1) How much? (CPU, errors, total latency)
Logs (24.1) What exactly happened in a service? (the detail)
Traces / X-Ray (this) Where did the request go and where did it slow down?

Metrics, logs, and traces are the three pillars of observability. Metrics alert you that something is generally wrong, traces tell you in which service along the journey the problem is, and the logs from that service give you the detail of the cause.

What You Should Remember

  • In architectures with many services, a request goes through several components, and it’s difficult to know where a slowness or error problem is with just isolated logs.
  • Distributed tracing follows a request through all the services it passes through, measuring the time in each. The result is a trace (the journey map). Like tracking a package.
  • AWS X-Ray is AWS’s distributed tracing service: it offers a visual service map, detailed traces with times per stage, and marks bottlenecks and errors.
  • It’s used to find bottlenecks, locate errors, understand your real architecture, and optimize performance with data.
  • Metrics (how much), logs (what/detail), and traces (where/where it slows down) are the three pillars of observability and complement each other.

In the next subchapter, we’ll look at an open standard that unifies logs, metrics, and traces without tying you to a provider: OpenTelemetry.

Cloud, AWS & Terraform — From Zero to Expert

Chapter 1 · What is cloud computing

Chapter 2 · The cloud market and major providers

Chapter 3 · Regions, availability zones and edge

Chapter 4 · Compute: EC2

Chapter 5 · Storage: S3

Chapter 6 · Networking: VPC

Chapter 7 · Identity and access: IAM

Chapter 8 · Managed databases

Chapter 9 · Why Infrastructure as Code

Chapter 10 · HCL: the Terraform language

Chapter 11 · Providers and state

Chapter 12 · Your first real infrastructure in Terraform

Chapter 13 · Load balancing and auto scaling

Chapter 14 · Serverless with Lambda

Chapter 15 · Messaging and events

Chapter 16 · Content delivery and DNS

Chapter 17 · Containers on AWS

Chapter 18 · Modules: reuse and composition

Chapter 19 · Workspaces and environment management

Chapter 20 · Remote backends and locking

Chapter 21 · Infrastructure testing

Chapter 22 · Terraform in CI/CD

Chapter 23 · Defense in depth

Chapter 24 · Observability: logs, metrics and traces

Chapter 25 · Cost optimization

Chapter 26 · High availability and disaster recovery

Chapter 27 · AWS Well-Architected Framework

Chapter 28 · Serverless architectures at scale

Chapter 29 · Data platforms on AWS

Chapter 30 · Multi-account and landing zones

Chapter 31 · Platform Engineering and Internal Developer Platform

Chapter 32 · Relevant AWS certifications

Chapter 33 · Projects to consolidate what you've learned

Chapter 34 · Resources and community

© Copyright 2024. All rights reserved