In the previous subchapter, we built a data lake to analyze stored data. But much data arrives in real time, continuously: clicks on a website as users browse, sensor readings every second, transactions as they happen... How do you capture and process that continuous stream of data in the moment, without losing anything? That's what Amazon Kinesis is for, AWS's real-time (streaming) data service. We'll look at its two main components: Kinesis Data Streams and Kinesis Data Firehose.

The problem: data arriving non-stop, right now

Some data doesn't arrive "every now and then" in files, but as a continuous stream that never stops:

Examples of real-time (streaming) data:
   - Clicks and navigation from thousands of users on a website (every second)
   - IoT sensor readings (temperature, GPS...) every second
   - Financial transactions as they occur
   - Events from a live application

Processing this presents challenges: it arrives constantly and in large volume, you can't lose data, and often you want to react instantly (detect fraud while it's happening, not the next day). You need something capable of capturing and moving that continuous stream reliably and at scale.

What is streaming processing

Real-time data processing (streaming) consists of capturing and processing data as it is generated, continuously, instead of waiting to have a large batch and then processing it (which would be "batch" processing).

   Batch processing:   you wait → gather lots of data → process (later)
   Streaming processing: data arrives → you process it NOW (instantly)

Analogy: the difference is like a dam and a river. Batch processing is like a dam: you accumulate water and release it all at once every so often. Streaming is like a river that flows non-stop: the water (the data) passes by continuously and you use it as it flows. Kinesis is the "channel" prepared to manage that river of data without overflowing or losing anything.

What is Kinesis

Amazon Kinesis is the AWS family of services to capture, process, and analyze real-time (streaming) data at scale. It allows you to reliably ingest huge streams of continuous data. It has several components; we'll look at the two main ones.

Kinesis Data Streams: the real-time stream

Kinesis Data Streams captures a continuous stream of real-time data and makes it available for your applications to process instantly. Data enters the "stream" and your consumers (for example, Lambdas—remember that Kinesis can be a Lambda trigger, subchapter 14.2) read and process it in the moment.

Producers (web, sensors...) → Kinesis Data Streams → Consumers
   send data non-stop           (the live stream)      process INSTANTLY
                                                        (Lambda, analytics...)
  • Use case: when you need to react in real time to data (detect fraud instantly, alert on a sensor anomaly, update a live dashboard).
  • Key: data is available to be processed immediately, with minimal latency.

Analogy: Kinesis Data Streams is like a live conveyor belt where data passes by, and your workers (applications) pick it up and process it as it passes, without waiting. Ideal when every piece of data matters now.

Kinesis Data Firehose: loading the stream into a destination

Kinesis Data Firehose focuses on something different: collecting a stream of data and automatically delivering it to a storage or analytics destination (like S3—your data lake from subchapter 29.1—, Redshift, etc.), without you having to program anything to manage it. It's the simplest way to load streaming data into a place to store or analyze it.

Producers → Kinesis Data Firehose → automatically delivers to S3 / Redshift / ...
   continuous data   (collects and loads)    (your data lake, data warehouse...)
  • Use case: when you want to send a stream of data to your data lake (S3) or another destination automatically and easily, without needing to process it instantly.
  • Key: it's fully managed and very easy: you configure the source and destination, and Firehose takes care of moving the data (it can even transform or batch it along the way).

Analogy: if Data Streams is a live conveyor belt, Firehose (whose name means "fire hose") is like a hose that channels the stream of data directly into the tank (S3). You don't worry about managing the belt or the workers: you just connect the hose to the tank and the data flows there automatically.

Streams vs Firehose: when to use each

Kinesis Data Streams Kinesis Data Firehose
Use case Process the stream in real time Deliver the stream to a destination (S3, etc.)
Reaction Immediate (process instantly) Not immediate (loads data for later)
Management You program the consumers Fully managed (just configure)
Ideal for Fraud detection, live alerts Filling the data lake with streaming data

💡 Rule of thumb: if you need to react instantly to the data, use Data Streams. If you just want to send streaming data to a place (like your data lake in S3) easily and automatically, use Firehose. They're often used together: Streams to react live and Firehose to archive the same stream in S3.

How it connects with the data lake

Kinesis is often the entry point for real-time data into the data lake from subchapter 29.1:

Real-time data → Kinesis Firehose → S3 (data lake)
                                              │
                                    Glue catalogs, Athena queries
   → streaming data ends up being analyzable along with the rest

Thus, data that arrives continuously ends up in your data lake, ready to be analyzed along with the rest. Streaming (Kinesis) and data lake (S3+Glue+Athena) combine into a complete data platform.

Real-world example: an online gaming platform wants to analyze player behavior in real time and also store it for later analysis. They use Kinesis Data Streams to capture every player action (millions per minute) and process them instantly with Lambdas that, for example, detect cheating or adjust difficulty live. At the same time, they use Kinesis Data Firehose to dump that same stream of events into S3 (their data lake), where they later analyze it with Athena to understand long-term trends. Streaming to react now, data lake to understand the historical: the best of both worlds.

What you should remember

  • Much data arrives in real time, continuously (clicks, sensors, transactions); processing it requires capturing that continuous stream without losing anything, often to react instantly.
  • Streaming processing handles data as it is generated (like a flowing river), versus batch processing (like a dam that accumulates and releases).
  • Amazon Kinesis captures, processes, and analyzes real-time data at scale. Two main components:
    • Kinesis Data Streams: captures a live stream to process it instantly (react in real time: fraud, alerts). Like a live conveyor belt.
    • Kinesis Data Firehose: collects a stream and automatically delivers it to a destination (S3, Redshift...), fully managed. Like a hose to the tank.
  • 💡 Data Streams to react instantly; Firehose to send data to a destination easily. They're often used together.
  • Kinesis is the entry point for real-time data into the data lake (S3), combining streaming and historical.

In the next subchapter, we'll look at the other major pillar of analytics: the data warehouse optimized for large-scale queries, Redshift.

Cloud, AWS & Terraform — From Zero to Expert

Chapter 1 · What is cloud computing

Chapter 2 · The cloud market and major providers

Chapter 3 · Regions, availability zones and edge

Chapter 4 · Compute: EC2

Chapter 5 · Storage: S3

Chapter 6 · Networking: VPC

Chapter 7 · Identity and access: IAM

Chapter 8 · Managed databases

Chapter 9 · Why Infrastructure as Code

Chapter 10 · HCL: the Terraform language

Chapter 11 · Providers and state

Chapter 12 · Your first real infrastructure in Terraform

Chapter 13 · Load balancing and auto scaling

Chapter 14 · Serverless with Lambda

Chapter 15 · Messaging and events

Chapter 16 · Content delivery and DNS

Chapter 17 · Containers on AWS

Chapter 18 · Modules: reuse and composition

Chapter 19 · Workspaces and environment management

Chapter 20 · Remote backends and locking

Chapter 21 · Infrastructure testing

Chapter 22 · Terraform in CI/CD

Chapter 23 · Defense in depth

Chapter 24 · Observability: logs, metrics and traces

Chapter 25 · Cost optimization

Chapter 26 · High availability and disaster recovery

Chapter 27 · AWS Well-Architected Framework

Chapter 28 · Serverless architectures at scale

Chapter 29 · Data platforms on AWS

Chapter 30 · Multi-account and landing zones

Chapter 31 · Platform Engineering and Internal Developer Platform

Chapter 32 · Relevant AWS certifications

Chapter 33 · Projects to consolidate what you've learned

Chapter 34 · Resources and community

© Copyright 2024. All rights reserved