Apache Kafka is a distributed event streaming platform used for high-throughput, fault-tolerant message processing. It combines messaging, storage, and stream processing.

Core Concepts

Concept Description
Broker Kafka server node
Topic Named stream of records
Partition Ordered, immutable sequence within a topic
Offset Position of a record in a partition
Producer Publishes records to topics
Consumer Reads records from topics
Consumer Group Coordinated consumers sharing load

Architecture

  Producer → Topic (Partition 0, 1, 2) → Consumer Group
              ↕ replication
           Broker 1, 2, 3
  

Each partition is replicated across multiple brokers for fault tolerance.

Key Properties

Property Value
Throughput Millions of messages/sec
Retention Configurable (time or size)
Ordering Guaranteed within a partition
Durability Replicated to multiple brokers
Scalability Horizontal (add brokers/partitions)

Topic Configuration

  # Create topic with 3 partitions, replication factor 2
kafka-topics.sh --create \
  --bootstrap-server localhost:9092 \
  --topic orders \
  --partitions 3 \
  --replication-factor 2

# Describe topic
kafka-topics.sh --describe --topic orders --bootstrap-server localhost:9092
  

Partition Strategy

Partition choice affects ordering and parallelism:

  Key-based:  same key → same partition → ordering guaranteed
Round-robin: no key → even distribution, no ordering guarantee
Custom:     implement Partitioner interface
  

Replication

  Partition 0:  Leader (Broker 1) → Follower (Broker 2), Follower (Broker 3)
Partition 1:  Leader (Broker 2) → Follower (Broker 1), Follower (Broker 3)
  
  • Leader handles all reads/writes
  • Followers replicate from leader
  • ISR (In-Sync Replicas): followers caught up with leader

Kafka vs Traditional MQ

Feature Kafka RabbitMQ
Model Log-based Queue/exchange
Message retention Configurable (days) Deleted after ack
Throughput Very high High
Ordering Per partition Per queue
Replay Yes (by offset) No
Best for Event streaming, logs Task queues, RPC

Java Dependency

  <dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka-clients</artifactId>
    <version>3.6.1</version>
</dependency>
  

Best Practices

  • Size partitions based on target throughput (not too many — overhead)
  • Use replication factor ≥ 3 in production
  • Choose partition keys to preserve ordering where needed
  • Monitor consumer lag — growing lag indicates processing bottleneck
  • Set appropriate retention based on replay requirements