Apache Kafka: An In-Depth Guide for Developers

Dear reader,

Are you looking to learn Apache Kafka and leverage it for your data architecture? If so, you‘ve come to the right place!

As an experienced solution architect who has helped dozens of enterprises adopt Kafka, I‘ve distilled my hard-won knowledge into this comprehensive guide.

Here‘s what I‘ll cover to help you master Apache Kafka:

Part 1: Kafka Basics

  • What is Apache Kafka?
  • Kafka Key Capabilities
  • Kafka Core Concepts
  • A Typical Kafka Architecture

Part 2: Getting Started Guide

  • Downloading, Installing & Configuring Kafka
  • Creating Topics with Partitions
  • Producing and Consuming Messages
  • Kafka Client Code in Java, Python and JS

Part 3: Running Kafka in Production

  • Achieving High Availability
  • Ensuring Data Durability
  • Monitoring, Securing & Optimizing Kafka
  • Scaling Kafka Clusters
  • Streaming Data Between Systems

So let‘s get started, shall we?

Part 1: Apache Kafka Basics

First, I‘ll explain what Kafka is, its key strengths, and the main abstractions that make Kafka tick. This context will help you better understand how to use Kafka later.

What is Apache Kafka?

Apache Kafka is a distributed, partitioned and replicated commit log service. It lets you publish and subscribe to streams of records and acts as a durable message broker.

In simpler terms, Kafka is a fast, scalable and durable real-time data streaming platform.

It was originally built by LinkedIn and later open-sourced in 2011. Now it‘s maintained by the Apache Software Foundation and used by thousands of companies globally including Uber, Netflix, Cisco, Visa, Coursera and Spotify.

Some common use cases are:

  • Real-time stream processing
  • Messaging
  • Activity tracking
  • Gathering metrics & logs
  • Commit logs for storage systems

Next, let‘s understand Kafka‘s main capabilities that make it so useful.

Key Capabilities of Kafka

Here are some of the standout features of Apache Kafka:

  1. High throughput: Kafka handles trillions of messages per day with very low latency. Benchmark tests in the Confluent Cloud have shown throughput exceeding 2 million writes per second.

  2. Massive scale: Kafka horizontally scales to handle any data volume by distributing load across brokers and partitions. LinkedIn runs one of the biggest Kafka deployments with over 3,500 brokers.

  3. Fault tolerance: Data is replicated across brokers to prevent data loss. Kafka can sustain node failures or disk crashes without any downtime.

  4. Durability: Messages written to Kafka are persisted on disk immediately. Configurable retention periods let you store data for days, weeks or years.

  5. Stream processing: Kafka‘s pub-sub semantics make it ideal for ingesting data streams then processing the streams in real-time.

  6. Integration: Kafka connects easily to external systems like databases as a centralized pipeline. Over 80 open-source connectors integrate Kafka with popular technologies.

In a nutshell, Kafka gives you a scalable, durable and low latency architecture for handling real-time data feeds.

Now that you know what Kafka promises, let‘s understand its fundamental concepts.

Core Concepts of Kafka

Apache Kafka consists of three key capabilities:

Publish-Subscribe Messaging

Kafka lets you publish and read streams of messages via topics. Producers write data to topics while consumers read from topics.

This decouples the end systems and allows periodic batch processing, rather than direct messaging.

Data Partitioning

Topics are split into partitions for scalability. Partitions let you parallelize consumers and scale bandwidth.

Messages in a partition are strictly ordered to maintain semantics.

Data Replication

Partitions are replicated across Kafka brokers (servers) to prevent data loss. Default replication factor is 3.

If a broker goes down, others have a copy of the data to handle requests.

With these basics covered, let‘s look at a reference architecture.

A Typical Kafka Architecture

Here is what a typical Kafka deployment looks like:

  • Multiple producers publish messages to central Kafka topics
  • Massively scalable Kafka brokers store and replicate the messages
  • Many consumers in a consumer group read messages in parallel
  • Kafka connects easily to external systems via streams and connectors
  • All client apps use Kafka‘s publish-subscribe messaging model

This architecture provides:

  • Loose coupling between components
  • Centralized data stream for the organization
  • Multiple subscriber systems processing streams concurrently
  • Stream analytics piped to external data stores
  • Flexibility to adapt apps without affecting others

Now that you grok Kafka‘s basics, let‘s get our hands dirty!

Part 2: Getting Started Guide

In this section, I‘ll show you how to:

  1. Install and configure a basic Kafka server
  2. Create topics with partitions
  3. Produce and consume test messages with Kafka
  4. Write client apps using Kafka APIs for Java, Python and JS

Let‘s get cracking!

Downloading, Installing & Configuring Kafka

First, download the latest Kafka release from kafka.apache.org:

$ wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.13-3.3.1.tgz
$ tar -xzf kafka_2.13-3.3.1.tgz
$ cd kafka_2.13-3.3.1

Now start the ZooKeeper server which Kafka uses for coordination:

$ bin/zookeeper-server-start.sh config/zookeeper.properties

Next, start the Kafka broker in another terminal:

$ bin/kafka-server-start.sh config/server.properties

Verify your installation by creating a test topic:

$ bin/kafka-topics.sh --list --bootstrap-server localhost:9092

And that‘s it! You now have a basic Kafka server running locally.

Time create some topics next.

Creating Topics with Partitions

Let‘s create a topic named pageviews with two partitions:

$ bin/kafka-topics.sh --create \
    --topic pageviews \
    --partitions 2 \  
    --replication-factor 1 \
    --bootstrap-server localhost:9092

This will store page view data for usage analytics.

Now let‘s try producing and consuming some test messages.

Producing and Consuming Messages

Kafka comes with command line tools that function as producers and consumers.

To produce messages:

$ bin/kafka-console-producer.sh --topic pageviews --bootstrap-server localhost:9092

> This is my first message!   
> Here‘s a second message

To subscribe and consume messages:

$ bin/kafka-console-consumer.sh --topic pageviews --from-beginning --bootstrap-server localhost:9092

This is my first message!    
Here‘s a second message

And just like that, you‘ve produced and consumed messages from Kafka!

While the console apps are great for testing, let‘s look at how to programmatically interact with Kafka.

Kafka Client Code in Java, Python and JS

Kafka provides well-documented clients for most popular languages. Let me show you examples in Java, Python and NodeJS.

Kafka Producer & Consumer in Java

To produce messages:

// Producer properties 
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");

props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");

props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

KafkaProducer producer = new KafkaProducer<String, String>(props);

// Send message
ProducerRecord record = new ProducerRecord<String, String>("pageviews", "New page view!"); 

producer.send(record);

To consume messages:

KafkaConsumer consumer = new KafkaConsumer<String, String>(props);
consumer.subscribe(Arrays.asList("pageviews"));

while (true) {
  ConsumerRecords<String, String> records = consumer.poll(100);

  for (ConsumerRecord record : records)
     System.out.println(record.value());      
}

And that‘s the crux of Kafka producers and consumers in Java!

Kafka Client for Python

Let‘s look at the same thing in Python.

First, install the Kafka python client:

pip install kafka-python

To produce:

from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers=‘localhost:9092‘)
producer.send(‘pageviews‘, b‘New entry!‘) 

To consume:

from kafka import KafkaConsumer

consumer = KafkaConsumer(
    ‘pageviews‘,
     bootstrap_servers=[‘localhost:9092‘],
     auto_offset_reset=‘earliest‘
)

for message in consumer:
    print(message.value)

And you‘re done with Python!

Kafka Client in NodeJS

If you use Node.js, install the kafkajs library:

npm install kafkajs

To produce:

const kafka = require(‘kafkajs‘)
const producer = kafka.producer()  

async function produce() {
  await producer.connect()  
  await producer.send({
    topic: ‘pageviews‘,
    messages: [
      { value: ‘New page view‘ }
    ],
  })
}

produce()

To consume:

const consumer = kafka.consumer({ groupId: ‘test‘ })

await consumer.connect()
await consumer.subscribe({ topic: ‘pageviews‘ })

await consumer.run({
  eachMessage: ({ message }) => {
    console.log(message.value)
  },
})

And there you have it! You can now use Kafka programatically in your Java, Python or Node.js applications.

Let‘s now shift gears and explore running Kafka in production.

Part 3: Running Kafka in Production

While a local Kafka server is great for development, running Kafka in production requires some additional steps to ensure high reliability.

Here are the key aspects I‘ll cover:

  • Achieving high availability (HA)
  • Ensuring data durability
  • Monitoring, securing & optimizing Kafka
  • Horizontally scaling your clusters
  • Streaming data between systems with Kafka

Let‘s get started!

Achieving High Availability in Kafka

The key to availability in Kafka lies in replication across brokers. By having multiple brokers contain copies of your data, your cluster stays operational even if some nodes fail.

Here are the best practices to follow:

1. Distributed Cluster

Run at least 3 Kafka brokers to avoid single point of failure. Kafka‘s replication will distribute partitions across them.

2. Replication Factor

When creating topics, set replication factor to 3 (which is the default). This replicates partitions across brokers, tolerating the loss of any single broker.

3. Multiple Datacenters

Place brokers across 2 or more datacenters. This prevents datacenter outages from taking down your Kafka cluster.

Following this ensures your Kafka data stays highly-available to consuming applications.

Ensuring Data Durability

Kafka lets you store streams of data durably by configuring retention policies.

Topics can be configured to retain messages for days, weeks or even years. Once published, data is persisted onto the file system immediately and indexed for retrieval.

Some key mechanisms Kafka uses to provide durability:

  • Messages written to disk immediately on the leader broker
  • Configurable retention periods from hours to years
  • Zero data loss if at least one in-sync replica persists the data
  • Checkpointed offsets for consumers act as cursors into data

By combining retention policies with replication, Kafka gives you durable streams of data.

This analysis from Confluent shows Kafka only had 2.7 minutes of downtime across 350 months – incredible for the amount of data flowing through!

Next up, let‘s discuss monitoring and securing Kafka deployments.

Monitoring, Securing & Optimizing Kafka

When using Kafka in production, you need visibility into cluster health, early warning of issues, and ability to tune Kafka:

Here are some best practices:

Monitoring

  1. Collect metrics on throughput, latency, disk usage, consumer lag, request rates etc.
  2. Graph metrics using tools like Grafana to spot trends
  3. Get alerts for warning thresholds on key metrics
  4. Use Kafka tools for consumer group analysis

Security

  1. Encrypt data via SSL for data security
  2. Authenticate clients via SASL protocol
  3. Authorization using ACLs for access control
  4. Protect Kafka via firewalls and limited endpoints
  5. Validate messages against a schema for integrity

Optimization

  1. Size topics into 12-24 partitions for throughput
  2. Use Snappy compression to reduce network IO
  3. Limit consumer poll time and batch size
  4. Tune fetch sizes and slow consumer thresholds
  5. Provision adequate network, memory and disk

Getting monitoring, security and tuning right ensures Kafka keeps performing smoothly even under load.

When you do need to scale, Kafka makes that easy too.

Horizontally Scaling Kafka

One of Kafka‘s major advantages is linear scalability. You can easily scale Kafka by adding more brokers – Kafka will automatically balance partitions across them.

When adding nodes, it‘s best to go one node at a time rather than doubling cluster size. This reduces rebalancing overhead.

Kafka can handle 50-100 MB/s of write throughput per partition. Use this number to right-size your brokers‘ disk and network capacity for your load.

With horizontal scaling, you can grow Kafka to handle pretty much any scale required!

Streaming Data Between Systems

A common use case is using Kafka to integrate disparate systems:

For example, check out this architecture:

Kafka acts as the central conduit connecting event producers to the various applications that require the events.

Some examples:

  • Stream app logs, metrics and clickstream data
  • Sync product changes to ecommerce systems
  • Notify cache layers about data updates
  • Collect IoT telemetry data for analytics
  • Queue transaction requests before writing

This streaming design pattern powers data pipelines and is a popular architectural style today.

Wrapping Up

And that brings us to the end of our in-depth guide!

Here‘s a quick recap of what we covered:

Kafka Basics

  • Kafka‘s publish-subscribe messaging model
  • Core concepts like topics, partitions, brokers
  • Kafka reference architecture

Getting Started

  • Downloading, installing Kafka
  • Creating topics, producing and consuming messages
  • Writing Kafka client code in Java, Python, JS

Running in Production

  • Achieving high availability
  • Ensuring durable data retention
  • Monitoring, securing and tuning Kafka
  • Horizontally scaling your clusters
  • Building data pipelines between systems

I hope you found this guide useful for gaining expertise with Apache Kafka!

It distills all my real-world experience as a solution architect into actionable insights you can apply.

You‘re now equipped to:

  1. Set up your own Kafka environments
  2. Build high-performance production deployments
  3. Stream data between all your applications

If you have any other questions, feel free to reach out!

Happy data streaming.