Apache Kafka: An In-Depth Guide for Developers

Dear reader,

Content Navigation show

Are you looking to learn Apache Kafka and leverage it for your data architecture? If so, you‘ve come to the right place!

As an experienced solution architect who has helped dozens of enterprises adopt Kafka, I‘ve distilled my hard-won knowledge into this comprehensive guide.

Here‘s what I‘ll cover to help you master Apache Kafka:

Part 1: Kafka Basics

What is Apache Kafka?
Kafka Key Capabilities
Kafka Core Concepts
A Typical Kafka Architecture

Part 2: Getting Started Guide

Downloading, Installing & Configuring Kafka
Creating Topics with Partitions
Producing and Consuming Messages
Kafka Client Code in Java, Python and JS

Part 3: Running Kafka in Production

Achieving High Availability
Ensuring Data Durability
Monitoring, Securing & Optimizing Kafka
Scaling Kafka Clusters
Streaming Data Between Systems

So let‘s get started, shall we?

Part 1: Apache Kafka Basics

First, I‘ll explain what Kafka is, its key strengths, and the main abstractions that make Kafka tick. This context will help you better understand how to use Kafka later.

What is Apache Kafka?

Apache Kafka is a distributed, partitioned and replicated commit log service. It lets you publish and subscribe to streams of records and acts as a durable message broker.

In simpler terms, Kafka is a fast, scalable and durable real-time data streaming platform.

It was originally built by LinkedIn and later open-sourced in 2011. Now it‘s maintained by the Apache Software Foundation and used by thousands of companies globally including Uber, Netflix, Cisco, Visa, Coursera and Spotify.

Some common use cases are:

Real-time stream processing
Messaging
Activity tracking
Gathering metrics & logs
Commit logs for storage systems

Next, let‘s understand Kafka‘s main capabilities that make it so useful.

Key Capabilities of Kafka

Here are some of the standout features of Apache Kafka:

High throughput: Kafka handles trillions of messages per day with very low latency. Benchmark tests in the Confluent Cloud have shown throughput exceeding 2 million writes per second.
Massive scale: Kafka horizontally scales to handle any data volume by distributing load across brokers and partitions. LinkedIn runs one of the biggest Kafka deployments with over 3,500 brokers.
Fault tolerance: Data is replicated across brokers to prevent data loss. Kafka can sustain node failures or disk crashes without any downtime.
Durability: Messages written to Kafka are persisted on disk immediately. Configurable retention periods let you store data for days, weeks or years.
Stream processing: Kafka‘s pub-sub semantics make it ideal for ingesting data streams then processing the streams in real-time.
Integration: Kafka connects easily to external systems like databases as a centralized pipeline. Over 80 open-source connectors integrate Kafka with popular technologies.

In a nutshell, Kafka gives you a scalable, durable and low latency architecture for handling real-time data feeds.

Now that you know what Kafka promises, let‘s understand its fundamental concepts.

Core Concepts of Kafka

Apache Kafka consists of three key capabilities:

Publish-Subscribe Messaging

Kafka lets you publish and read streams of messages via topics. Producers write data to topics while consumers read from topics.

This decouples the end systems and allows periodic batch processing, rather than direct messaging.

Data Partitioning

Topics are split into partitions for scalability. Partitions let you parallelize consumers and scale bandwidth.

Messages in a partition are strictly ordered to maintain semantics.

Data Replication

Partitions are replicated across Kafka brokers (servers) to prevent data loss. Default replication factor is 3.

If a broker goes down, others have a copy of the data to handle requests.

With these basics covered, let‘s look at a reference architecture.

A Typical Kafka Architecture

Here is what a typical Kafka deployment looks like:

Multiple producers publish messages to central Kafka topics
Massively scalable Kafka brokers store and replicate the messages
Many consumers in a consumer group read messages in parallel
Kafka connects easily to external systems via streams and connectors
All client apps use Kafka‘s publish-subscribe messaging model

This architecture provides:

Loose coupling between components
Centralized data stream for the organization
Multiple subscriber systems processing streams concurrently
Stream analytics piped to external data stores
Flexibility to adapt apps without affecting others

Now that you grok Kafka‘s basics, let‘s get our hands dirty!

Part 2: Getting Started Guide

In this section, I‘ll show you how to:

Install and configure a basic Kafka server
Create topics with partitions
Produce and consume test messages with Kafka
Write client apps using Kafka APIs for Java, Python and JS

Let‘s get cracking!

Downloading, Installing & Configuring Kafka

First, download the latest Kafka release from kafka.apache.org:

$ wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.13-3.3.1.tgz
$ tar -xzf kafka_2.13-3.3.1.tgz
$ cd kafka_2.13-3.3.1

Now start the ZooKeeper server which Kafka uses for coordination:

$ bin/zookeeper-server-start.sh config/zookeeper.properties

Next, start the Kafka broker in another terminal:

$ bin/kafka-server-start.sh config/server.properties

Verify your installation by creating a test topic:

$ bin/kafka-topics.sh --list --bootstrap-server localhost:9092

And that‘s it! You now have a basic Kafka server running locally.

Time create some topics next.

Creating Topics with Partitions

Let‘s create a topic named pageviews with two partitions:

$ bin/kafka-topics.sh --create \
    --topic pageviews \
    --partitions 2 \  
    --replication-factor 1 \
    --bootstrap-server localhost:9092

This will store page view data for usage analytics.

Now let‘s try producing and consuming some test messages.

Producing and Consuming Messages

Kafka comes with command line tools that function as producers and consumers.

To produce messages:

$ bin/kafka-console-producer.sh --topic pageviews --bootstrap-server localhost:9092

> This is my first message!   
> Here‘s a second message

To subscribe and consume messages:

$ bin/kafka-console-consumer.sh --topic pageviews --from-beginning --bootstrap-server localhost:9092

This is my first message!    
Here‘s a second message

And just like that, you‘ve produced and consumed messages from Kafka!

While the console apps are great for testing, let‘s look at how to programmatically interact with Kafka.

Kafka Client Code in Java, Python and JS

Kafka provides well-documented clients for most popular languages. Let me show you examples in Java, Python and NodeJS.

Kafka Producer & Consumer in Java

To produce messages:

// Producer properties 
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");

props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");

props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

KafkaProducer producer = new KafkaProducer<String, String>(props);

// Send message
ProducerRecord record = new ProducerRecord<String, String>("pageviews", "New page view!"); 

producer.send(record);

To consume messages:

KafkaConsumer consumer = new KafkaConsumer<String, String>(props);
consumer.subscribe(Arrays.asList("pageviews"));

while (true) {
  ConsumerRecords<String, String> records = consumer.poll(100);

  for (ConsumerRecord record : records)
     System.out.println(record.value());      
}

And that‘s the crux of Kafka producers and consumers in Java!

Kafka Client for Python

Let‘s look at the same thing in Python.

First, install the Kafka python client:

pip install kafka-python

To produce:

from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers=‘localhost:9092‘)
producer.send(‘pageviews‘, b‘New entry!‘)

To consume:

from kafka import KafkaConsumer

consumer = KafkaConsumer(
    ‘pageviews‘,
     bootstrap_servers=[‘localhost:9092‘],
     auto_offset_reset=‘earliest‘
)

for message in consumer:
    print(message.value)

And you‘re done with Python!

Kafka Client in NodeJS

If you use Node.js, install the kafkajs library:

npm install kafkajs

To produce:

const kafka = require(‘kafkajs‘)
const producer = kafka.producer()  

async function produce() {
  await producer.connect()  
  await producer.send({
    topic: ‘pageviews‘,
    messages: [
      { value: ‘New page view‘ }
    ],
  })
}

produce()

To consume:

const consumer = kafka.consumer({ groupId: ‘test‘ })

await consumer.connect()
await consumer.subscribe({ topic: ‘pageviews‘ })

await consumer.run({
  eachMessage: ({ message }) => {
    console.log(message.value)
  },
})

And there you have it! You can now use Kafka programatically in your Java, Python or Node.js applications.

Let‘s now shift gears and explore running Kafka in production.

Part 3: Running Kafka in Production

While a local Kafka server is great for development, running Kafka in production requires some additional steps to ensure high reliability.

Here are the key aspects I‘ll cover:

Achieving high availability (HA)
Ensuring data durability
Monitoring, securing & optimizing Kafka
Horizontally scaling your clusters
Streaming data between systems with Kafka

Let‘s get started!

Achieving High Availability in Kafka

The key to availability in Kafka lies in replication across brokers. By having multiple brokers contain copies of your data, your cluster stays operational even if some nodes fail.

Here are the best practices to follow:

1. Distributed Cluster

Run at least 3 Kafka brokers to avoid single point of failure. Kafka‘s replication will distribute partitions across them.

2. Replication Factor

When creating topics, set replication factor to 3 (which is the default). This replicates partitions across brokers, tolerating the loss of any single broker.

3. Multiple Datacenters

Place brokers across 2 or more datacenters. This prevents datacenter outages from taking down your Kafka cluster.

Following this ensures your Kafka data stays highly-available to consuming applications.

Ensuring Data Durability

Kafka lets you store streams of data durably by configuring retention policies.

Topics can be configured to retain messages for days, weeks or even years. Once published, data is persisted onto the file system immediately and indexed for retrieval.

Some key mechanisms Kafka uses to provide durability:

Messages written to disk immediately on the leader broker
Configurable retention periods from hours to years
Zero data loss if at least one in-sync replica persists the data
Checkpointed offsets for consumers act as cursors into data

By combining retention policies with replication, Kafka gives you durable streams of data.

This analysis from Confluent shows Kafka only had 2.7 minutes of downtime across 350 months – incredible for the amount of data flowing through!

Next up, let‘s discuss monitoring and securing Kafka deployments.

Monitoring, Securing & Optimizing Kafka

When using Kafka in production, you need visibility into cluster health, early warning of issues, and ability to tune Kafka:

Here are some best practices:

Monitoring

Collect metrics on throughput, latency, disk usage, consumer lag, request rates etc.
Graph metrics using tools like Grafana to spot trends
Get alerts for warning thresholds on key metrics
Use Kafka tools for consumer group analysis

Security

Encrypt data via SSL for data security
Authenticate clients via SASL protocol
Authorization using ACLs for access control
Protect Kafka via firewalls and limited endpoints
Validate messages against a schema for integrity

Optimization

Size topics into 12-24 partitions for throughput
Use Snappy compression to reduce network IO
Limit consumer poll time and batch size
Tune fetch sizes and slow consumer thresholds
Provision adequate network, memory and disk

Getting monitoring, security and tuning right ensures Kafka keeps performing smoothly even under load.

When you do need to scale, Kafka makes that easy too.

Horizontally Scaling Kafka

One of Kafka‘s major advantages is linear scalability. You can easily scale Kafka by adding more brokers – Kafka will automatically balance partitions across them.

When adding nodes, it‘s best to go one node at a time rather than doubling cluster size. This reduces rebalancing overhead.

Kafka can handle 50-100 MB/s of write throughput per partition. Use this number to right-size your brokers‘ disk and network capacity for your load.

With horizontal scaling, you can grow Kafka to handle pretty much any scale required!

Streaming Data Between Systems

A common use case is using Kafka to integrate disparate systems:

For example, check out this architecture:

Kafka acts as the central conduit connecting event producers to the various applications that require the events.

Some examples:

Stream app logs, metrics and clickstream data
Sync product changes to ecommerce systems
Notify cache layers about data updates
Collect IoT telemetry data for analytics
Queue transaction requests before writing

This streaming design pattern powers data pipelines and is a popular architectural style today.

Wrapping Up

And that brings us to the end of our in-depth guide!

Here‘s a quick recap of what we covered:

Kafka Basics

Kafka‘s publish-subscribe messaging model
Core concepts like topics, partitions, brokers
Kafka reference architecture

Getting Started

Downloading, installing Kafka
Creating topics, producing and consuming messages
Writing Kafka client code in Java, Python, JS

Running in Production

Achieving high availability
Ensuring durable data retention
Monitoring, securing and tuning Kafka
Horizontally scaling your clusters
Building data pipelines between systems

I hope you found this guide useful for gaining expertise with Apache Kafka!

It distills all my real-world experience as a solution architect into actionable insights you can apply.

You‘re now equipped to:

Set up your own Kafka environments
Build high-performance production deployments
Stream data between all your applications

If you have any other questions, feel free to reach out!

Happy data streaming.