Dear reader,
Are you looking to learn Apache Kafka and leverage it for your data architecture? If so, you‘ve come to the right place!
As an experienced solution architect who has helped dozens of enterprises adopt Kafka, I‘ve distilled my hard-won knowledge into this comprehensive guide.
Here‘s what I‘ll cover to help you master Apache Kafka:
Part 1: Kafka Basics
- What is Apache Kafka?
- Kafka Key Capabilities
- Kafka Core Concepts
- A Typical Kafka Architecture
Part 2: Getting Started Guide
- Downloading, Installing & Configuring Kafka
- Creating Topics with Partitions
- Producing and Consuming Messages
- Kafka Client Code in Java, Python and JS
Part 3: Running Kafka in Production
- Achieving High Availability
- Ensuring Data Durability
- Monitoring, Securing & Optimizing Kafka
- Scaling Kafka Clusters
- Streaming Data Between Systems
So let‘s get started, shall we?
Part 1: Apache Kafka Basics
First, I‘ll explain what Kafka is, its key strengths, and the main abstractions that make Kafka tick. This context will help you better understand how to use Kafka later.
What is Apache Kafka?
Apache Kafka is a distributed, partitioned and replicated commit log service. It lets you publish and subscribe to streams of records and acts as a durable message broker.
In simpler terms, Kafka is a fast, scalable and durable real-time data streaming platform.
It was originally built by LinkedIn and later open-sourced in 2011. Now it‘s maintained by the Apache Software Foundation and used by thousands of companies globally including Uber, Netflix, Cisco, Visa, Coursera and Spotify.
Some common use cases are:
- Real-time stream processing
- Messaging
- Activity tracking
- Gathering metrics & logs
- Commit logs for storage systems
Next, let‘s understand Kafka‘s main capabilities that make it so useful.
Key Capabilities of Kafka
Here are some of the standout features of Apache Kafka:
-
High throughput: Kafka handles trillions of messages per day with very low latency. Benchmark tests in the Confluent Cloud have shown throughput exceeding 2 million writes per second.
-
Massive scale: Kafka horizontally scales to handle any data volume by distributing load across brokers and partitions. LinkedIn runs one of the biggest Kafka deployments with over 3,500 brokers.
-
Fault tolerance: Data is replicated across brokers to prevent data loss. Kafka can sustain node failures or disk crashes without any downtime.
-
Durability: Messages written to Kafka are persisted on disk immediately. Configurable retention periods let you store data for days, weeks or years.
-
Stream processing: Kafka‘s pub-sub semantics make it ideal for ingesting data streams then processing the streams in real-time.
-
Integration: Kafka connects easily to external systems like databases as a centralized pipeline. Over 80 open-source connectors integrate Kafka with popular technologies.
In a nutshell, Kafka gives you a scalable, durable and low latency architecture for handling real-time data feeds.
Now that you know what Kafka promises, let‘s understand its fundamental concepts.
Core Concepts of Kafka
Apache Kafka consists of three key capabilities:
Publish-Subscribe Messaging
Kafka lets you publish and read streams of messages via topics. Producers write data to topics while consumers read from topics.
This decouples the end systems and allows periodic batch processing, rather than direct messaging.
Data Partitioning
Topics are split into partitions for scalability. Partitions let you parallelize consumers and scale bandwidth.
Messages in a partition are strictly ordered to maintain semantics.
Data Replication
Partitions are replicated across Kafka brokers (servers) to prevent data loss. Default replication factor is 3.
If a broker goes down, others have a copy of the data to handle requests.
With these basics covered, let‘s look at a reference architecture.
A Typical Kafka Architecture
Here is what a typical Kafka deployment looks like:
- Multiple producers publish messages to central Kafka topics
- Massively scalable Kafka brokers store and replicate the messages
- Many consumers in a consumer group read messages in parallel
- Kafka connects easily to external systems via streams and connectors
- All client apps use Kafka‘s publish-subscribe messaging model
This architecture provides:
- Loose coupling between components
- Centralized data stream for the organization
- Multiple subscriber systems processing streams concurrently
- Stream analytics piped to external data stores
- Flexibility to adapt apps without affecting others
Now that you grok Kafka‘s basics, let‘s get our hands dirty!
Part 2: Getting Started Guide
In this section, I‘ll show you how to:
- Install and configure a basic Kafka server
- Create topics with partitions
- Produce and consume test messages with Kafka
- Write client apps using Kafka APIs for Java, Python and JS
Let‘s get cracking!
Downloading, Installing & Configuring Kafka
First, download the latest Kafka release from kafka.apache.org:
$ wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.13-3.3.1.tgz
$ tar -xzf kafka_2.13-3.3.1.tgz
$ cd kafka_2.13-3.3.1
Now start the ZooKeeper server which Kafka uses for coordination:
$ bin/zookeeper-server-start.sh config/zookeeper.properties
Next, start the Kafka broker in another terminal:
$ bin/kafka-server-start.sh config/server.properties
Verify your installation by creating a test topic:
$ bin/kafka-topics.sh --list --bootstrap-server localhost:9092
And that‘s it! You now have a basic Kafka server running locally.
Time create some topics next.
Creating Topics with Partitions
Let‘s create a topic named pageviews with two partitions:
$ bin/kafka-topics.sh --create \
--topic pageviews \
--partitions 2 \
--replication-factor 1 \
--bootstrap-server localhost:9092
This will store page view data for usage analytics.
Now let‘s try producing and consuming some test messages.
Producing and Consuming Messages
Kafka comes with command line tools that function as producers and consumers.
To produce messages:
$ bin/kafka-console-producer.sh --topic pageviews --bootstrap-server localhost:9092
> This is my first message!
> Here‘s a second message
To subscribe and consume messages:
$ bin/kafka-console-consumer.sh --topic pageviews --from-beginning --bootstrap-server localhost:9092
This is my first message!
Here‘s a second message
And just like that, you‘ve produced and consumed messages from Kafka!
While the console apps are great for testing, let‘s look at how to programmatically interact with Kafka.
Kafka Client Code in Java, Python and JS
Kafka provides well-documented clients for most popular languages. Let me show you examples in Java, Python and NodeJS.
Kafka Producer & Consumer in Java
To produce messages:
// Producer properties
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
KafkaProducer producer = new KafkaProducer<String, String>(props);
// Send message
ProducerRecord record = new ProducerRecord<String, String>("pageviews", "New page view!");
producer.send(record);
To consume messages:
KafkaConsumer consumer = new KafkaConsumer<String, String>(props);
consumer.subscribe(Arrays.asList("pageviews"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord record : records)
System.out.println(record.value());
}
And that‘s the crux of Kafka producers and consumers in Java!
Kafka Client for Python
Let‘s look at the same thing in Python.
First, install the Kafka python client:
pip install kafka-python
To produce:
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=‘localhost:9092‘)
producer.send(‘pageviews‘, b‘New entry!‘)
To consume:
from kafka import KafkaConsumer
consumer = KafkaConsumer(
‘pageviews‘,
bootstrap_servers=[‘localhost:9092‘],
auto_offset_reset=‘earliest‘
)
for message in consumer:
print(message.value)
And you‘re done with Python!
Kafka Client in NodeJS
If you use Node.js, install the kafkajs
library:
npm install kafkajs
To produce:
const kafka = require(‘kafkajs‘)
const producer = kafka.producer()
async function produce() {
await producer.connect()
await producer.send({
topic: ‘pageviews‘,
messages: [
{ value: ‘New page view‘ }
],
})
}
produce()
To consume:
const consumer = kafka.consumer({ groupId: ‘test‘ })
await consumer.connect()
await consumer.subscribe({ topic: ‘pageviews‘ })
await consumer.run({
eachMessage: ({ message }) => {
console.log(message.value)
},
})
And there you have it! You can now use Kafka programatically in your Java, Python or Node.js applications.
Let‘s now shift gears and explore running Kafka in production.
Part 3: Running Kafka in Production
While a local Kafka server is great for development, running Kafka in production requires some additional steps to ensure high reliability.
Here are the key aspects I‘ll cover:
- Achieving high availability (HA)
- Ensuring data durability
- Monitoring, securing & optimizing Kafka
- Horizontally scaling your clusters
- Streaming data between systems with Kafka
Let‘s get started!
Achieving High Availability in Kafka
The key to availability in Kafka lies in replication across brokers. By having multiple brokers contain copies of your data, your cluster stays operational even if some nodes fail.
Here are the best practices to follow:
1. Distributed Cluster
Run at least 3 Kafka brokers to avoid single point of failure. Kafka‘s replication will distribute partitions across them.
2. Replication Factor
When creating topics, set replication factor to 3 (which is the default). This replicates partitions across brokers, tolerating the loss of any single broker.
3. Multiple Datacenters
Place brokers across 2 or more datacenters. This prevents datacenter outages from taking down your Kafka cluster.
Following this ensures your Kafka data stays highly-available to consuming applications.
Ensuring Data Durability
Kafka lets you store streams of data durably by configuring retention policies.
Topics can be configured to retain messages for days, weeks or even years. Once published, data is persisted onto the file system immediately and indexed for retrieval.
Some key mechanisms Kafka uses to provide durability:
- Messages written to disk immediately on the leader broker
- Configurable retention periods from hours to years
- Zero data loss if at least one in-sync replica persists the data
- Checkpointed offsets for consumers act as cursors into data
By combining retention policies with replication, Kafka gives you durable streams of data.
This analysis from Confluent shows Kafka only had 2.7 minutes of downtime across 350 months – incredible for the amount of data flowing through!
Next up, let‘s discuss monitoring and securing Kafka deployments.
Monitoring, Securing & Optimizing Kafka
When using Kafka in production, you need visibility into cluster health, early warning of issues, and ability to tune Kafka:
Here are some best practices:
Monitoring
- Collect metrics on throughput, latency, disk usage, consumer lag, request rates etc.
- Graph metrics using tools like Grafana to spot trends
- Get alerts for warning thresholds on key metrics
- Use Kafka tools for consumer group analysis
Security
- Encrypt data via SSL for data security
- Authenticate clients via SASL protocol
- Authorization using ACLs for access control
- Protect Kafka via firewalls and limited endpoints
- Validate messages against a schema for integrity
Optimization
- Size topics into 12-24 partitions for throughput
- Use Snappy compression to reduce network IO
- Limit consumer poll time and batch size
- Tune fetch sizes and slow consumer thresholds
- Provision adequate network, memory and disk
Getting monitoring, security and tuning right ensures Kafka keeps performing smoothly even under load.
When you do need to scale, Kafka makes that easy too.
Horizontally Scaling Kafka
One of Kafka‘s major advantages is linear scalability. You can easily scale Kafka by adding more brokers – Kafka will automatically balance partitions across them.
When adding nodes, it‘s best to go one node at a time rather than doubling cluster size. This reduces rebalancing overhead.
Kafka can handle 50-100 MB/s of write throughput per partition. Use this number to right-size your brokers‘ disk and network capacity for your load.
With horizontal scaling, you can grow Kafka to handle pretty much any scale required!
Streaming Data Between Systems
A common use case is using Kafka to integrate disparate systems:
For example, check out this architecture:
Kafka acts as the central conduit connecting event producers to the various applications that require the events.
Some examples:
- Stream app logs, metrics and clickstream data
- Sync product changes to ecommerce systems
- Notify cache layers about data updates
- Collect IoT telemetry data for analytics
- Queue transaction requests before writing
This streaming design pattern powers data pipelines and is a popular architectural style today.
Wrapping Up
And that brings us to the end of our in-depth guide!
Here‘s a quick recap of what we covered:
Kafka Basics
- Kafka‘s publish-subscribe messaging model
- Core concepts like topics, partitions, brokers
- Kafka reference architecture
Getting Started
- Downloading, installing Kafka
- Creating topics, producing and consuming messages
- Writing Kafka client code in Java, Python, JS
Running in Production
- Achieving high availability
- Ensuring durable data retention
- Monitoring, securing and tuning Kafka
- Horizontally scaling your clusters
- Building data pipelines between systems
I hope you found this guide useful for gaining expertise with Apache Kafka!
It distills all my real-world experience as a solution architect into actionable insights you can apply.
You‘re now equipped to:
- Set up your own Kafka environments
- Build high-performance production deployments
- Stream data between all your applications
If you have any other questions, feel free to reach out!
Happy data streaming.