A Weekend with Kafka

As a teenager, I discovered the writings of Franz Kafka when I read his short story The Hunger Artist. (To this day, that is far and away one of my favorite pieces ever.) This post, however, is going to be much less literary in nature. Instead, it is going to be dealing with the joys of stream-processing.

So ya, I spent some time this weekend trying to understand Apache Kafka. This post describes some of what I did, and how I went about it. Specifically, I talk about getting an Apache Kafka Server setup, and then interacting with the server from Python scripts.

Setting up The Server

Even though this post isn’t about Apache Samza, eventually I do in fact want to work with Samza.

The easiest way to get started working with Apache Samza and Apache Kafka is to use a preconfigured Docker Image as a build environment. Once we have that, the best way I found to provision a simple server “grid” (which must contain Apache YARN, Apache ZooKeeper, and Apache Kafka) is to utilize a script (appropriately enough) called “grid” that comes with Apache’s example project hello-samza example.

Conveniently, a preconfigured Docker Image is available on Docker Hub for this.

After pulling down the Docker Image, get it going by running the command:

docker run --rm --name hello-samza --net host -it -p 8088:8088 anaerobic/hello-samza bash

The next step involves utilizing a script called “grid” that is part of the hello-samza example project. As I mentioned, it will automatically download and install ZooKeeper, Kafka, and YARN, then checkout Samza itself and compile it.

Complete instructions are provided in the readme for Hello-Samza. Follow them, and make sure the machine hosting your docker image has at least 4 GB of Ram available.

Finally, once you have everything compiled, start the servers by executing this command:

./bin/grid start all

The Importance of Getting Started Somewhere

I think the best place to get started with all this is to try to actually interact with the Kafka server. The reason for this is because when you are dealing with a “technology stack” that you are not familiar with, you have to start wrapping your head around the concepts somewhere.

In this instance, my initial investigations made me approach it like this:

  • Apache YARN / Apache ZooKeeper: these components are specialized frameworks having to do with distributed processing used by Kafka. While potentially interesting in their own right, for my purposes now I don’t care about them. As I said above, after a little work, our Docker-based build environment can be setup to host them, at which time all we care about is that they are running.
  • Kafka – It appears to me that while complicated and convoluted, Kafka is actually pretty simple to understand conceptually: it is just a key-value store, that can be written to producers, and read from by consumers.
  • Samza appears to me to be more difficult to understand initially. Wikipedia describes it as a “asynchronous computational framework for stream processing.” I don’t think it is a good place to just start “hacking code.” Instead, the first thing to do with it is to just try to understand their example code, which Apache nicely provides a code walkthrough for here.

Therefore, as far as I can tell, the easiest way to get started is to just use Python to interact with Kafka, treating as “yet another” way to store key-value pairs.

Additionally, because Kafka is sort of setup to deal with producers and consumers, we will just develop two simple python scripts: one to feed data into Kafa, and another to read it back out.

Using Python

For ease of use, I just setup Python to run locally within the Docker Instance. That way, we can create a simple Kafka Producer and a simple Kafka Consumer in only a few lines of code.

Due to the magic of modern package management, I just did that by running:

apt-get install python3-pip
pip3 install kafka-python

Kafka-Producer-Test.py

from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers='localhost:9092')
producer.send('test_topic', b'Line1')
producer.send('test_topic', b'Line2')

Kafka-Consumer-Test.yp

from kafka import KafkaConsumer
consumer = KafkaConsumer('test_topic')
for message in consumer:
    print (message)

The Bad News

So far I haven’t been able to get the two above Python scripts to actually work, but what would life be without some challenges!

So I will continue to debug my setup, and update this post once I make some headway.

Summing it Up

I wrote this post just to document my process as I started to investigate an unfamiliar technology stack.

My approach when doing something like this is heavily influenced by my experience as a student, and my aspirations to be a teacher.

As a student, I have been most recently effected by the amazing experience of participating in CS50 from Harvard. In particular, I really like the way the class balances theoretical study, with hands-on “learning by doing” literally from day one. Therefore, whenever I am learning something new, I have always started approaching it by:

  1. Reading Books / Wikipedia / Articles / Videos, etc.
  2. Actually writing little programs apropos to what you are trying to learn about, even if they are just throw-away ones. (And as a bonus: try to utilize your own creativity to think of interesting toy programs to write!)

Additionally, as an aspiring teacher, I am practicing writing blog posts like this explaining how to do various kind of programming tasks. I have found adding this writing component to my learning process really helps me clarify and refine my own thinking and understanding, and have found that I really enjoy it. (You can see other programming blog posts I wrote here.)