May 6, 2014

Setting up your own Apache Kafka cluster with Vagrant – Step by Step

Apache Kafka is a distributed publish-subscribe messaging system that aims to be fast, scalable, and durable. If you want to just get up and running quickly with a cluster of Vagrant virtual machines configured with Kafka, take a look at this awesome blog post. It sets up all the VMs for you and configures each node in the cluster, in one fell swoop.

However, if you want to learn how to install and configure a Kafka cluster yourself, utilizing your own Vagrant boxes, then read on. This step-by-step walk-through will guide you through building a Kafka cluster from the ground up, with vanilla Debian as a base. Kafka requires Apache Zookeeper, a service that coordinates distributed applications. In this walk-through, we will setup our first box from scratch. We will then package that box and use it as the base box for the other nodes in the cluster. When we’re finished, we’ll have a fully functional 3-node Zookeeper and Kafka cluster. It would probably be a better practice to automate this via existing chef recipes, but that’s hardly walk-through material. We are going to do it the simple, long-winded way. And I think you will find that it isn’t too painful. Onward!

Part I – Setting up a single Zookeeper/Kafka node, starting from a Vagrant base box

1. Download and install Virtualbox from virtualbox.org
Note: This walk-through uses a Vagrant base box that requires Virtualbox 4.2.10. If you already have Vagrant configured to work with VMWare, there is a VMWare Fusion version of the same base box. I will point it out in step 3 below.

2. Download and install Vagrant from vagrantup.com

3. Initialize a new Vagrant box. This particular box is vanilla Debian from Puppet Labs. I recommend creating it in a directory with a name that accurately describes what the box represents. If you are new to Vagrant, it’s easy to get carried away and wind up with an over-abundance of VMs on your machine.

mkdir debian-cluster-node-1
cd debian-cluster-node-1
vagrant init debian-cluster-node-1 http://puppet-vagrant-boxes.puppetlabs.com/debian-70rc1-x64-vbox4210.box

Or, if you’re using Vagrant with VMWare:

vagrant init debian-cluster-node-1 http://puppet-vagrant-boxes.puppetlabs.com/debian-70rc1-x64-vf503.box

This will create a Vagrantfile in the directory. You use this file to configure your VM.

4. Edit the Vagrantfile to your liking
It’s a good idea to bump up the memory. 2048 should be sufficient.

config.vm.provider :virtualbox do |vb|
  vb.customize ["modifyvm", :id, "--memory", "2048"]
end

The only other setting of note is the private IP. This allows the host (your computer’s OS) and other VMs to access your new Vagrant box via a local network IP address.

Find the line

# config.vm.network :private_network, ip: "192.168.33.10"

 


Uncomment it, and change the IP address if you feel like it, otherwise just leave it as is. I set mine to 192.168.33.21. I will be referring to that IP address throughout this walk-through.

5. Setup the Vagrant box
Start the box:

vagrant up

The first time takes quite awhile. It needs to download and unpack the box first.

Login to the box:

vagrant ssh

Install dependencies (you only need Java, and you might want to install a text editor too)

sudo apt-get update
sudo apt-get install openjdk-7-jdk

For the following steps, change to the root user (sudo su)
4. Download, build, and install Kafka
I’ve had issues trying to get things up and running with just the binary download, so we’ll build from source. Even the Kafka Quick Start tells you to build from source, so that’s what we’re going to do here. Don’t worry, it’s easy.
Note: You don’t have to install it in /usr/local/kafka. You can put it wherever you want.

wget https://archive.apache.org/dist/kafka/kafka-0.8.0-beta1-src.tgz
mkdir /usr/local/kafka
tar -zxvf kafka-0.8.0-beta1-src.tgz
cd kafka-0.8.0-beta1-src
./sbt update
./sbt package
./sbt assembly-package-dependency
cd ../
mv kafka-0.8.0-beta1-src /usr/local/kafka

5. Install Zookeeper
<em>Note: You don't have to install it in /usr/local/zookeeper. You can put it wherever you want.</em>
<code>

wget http://apache.claz.org/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz
mkdir /usr/local/zookeeper
tar -zxvf zookeeper-3.4.6.tar.gz --directory /usr/local/zookeeper
cp /usr/local/zookeeper/zookeeper-3.4.6/conf/zoo_sample.cfg /usr/local/zookeeper/zookeeper-3.4.6/conf/zoo.cfg

 

<code>
6. Configure Zookeeper
Before configuring, create a directory for the Zookeeper data.

mkdir -p /var/zookeeper/data

Edit the Zookeeper configuration file, /usr/local/zookeeper/zookeeper-3.4.6/conf/zoo.cfg

Change the dataDir property to the directory you created above.

dataDir=/var/zookeeper/data

Find the list of servers that’s commented out. If these lines aren’t there, add them.

#server.1=zookeeper1:2888:3888
#server.2=zookeeper2:2888:3888
#server.3=zookeeper3:2888:3888

Uncomment the server.1 property, and change “zookeeper1” to the private IP address that you assigned to this VM.

server.1=192.168.33.21:2888:3888

Important step, often forgot!
We need to create a myid file in the data directory.
Zookeeper uses a file named “myid” to identify itself within the cluster. It holds a single character, 1-255. Let’s set it to 1.

echo "1" > /var/zookeeper/data/myid

7. Configure Kafka
If you followed the above installation instructions, the config directory will be here:
/usr/local/kafka/kafka-0.8.0-beta1-src/config
Edit the server.properties file
Take note of the broker.id value. Each Kafka instance will need to have a unique broker.id, just as each Zookeeper instance needs to have a distinct value in the myid file. Let’s set this to 1.

broker.id=1

Uncomment #host.name=localhost and set it to the private IP address of the VM.

host.name=192.168.33.21

Locate the zookeeper.connect property. The default setting is fine, but we will be adding more nodes as we build up the cluster.
Change “localhost” to the IP address of the VM.

zookeeper.connect=192.168.33.21:2181

8. Test the current setup
You probably want to add these to your ~/.bash_profile first

export ZK_HOME=/usr/local/zookeeper/zookeeper-3.4.6/
export KAFKA_HOME=/usr/local/kafka/kafka-0.8.0-beta1-src/
export PATH=$ZK_HOME/bin:$KAFKA_HOME/bin:$PATH

Start Zookeeper

sudo $ZK_HOME/bin/zkServer.sh start

Start Kafka

sudo $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties &

Test Kafka
List topics (should not have any to start with)

$KAFKA_HOME/bin/kafka-list-topic.sh --zookeeper 192.168.33.21:2181

 


Create a new topic

$KAFKA_HOME/bin/kafka-create-topic.sh --zookeeper 192.168.33.21:2181 --replica 1 --partition 1 --topic topic-1

 


Produce messages to that topic from the console

$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list 192.168.33.21:9092 --topic topic-1
Hi
My
Name
Is
Kafka

 


(ctrl-c to kill the console producer)
Run the console consumer to verify that the messages are there for the new topic

$KAFKA_HOME/bin/kafka-console-consumer.sh --zookeeper 192.168.33.21:2181 --topic topic-1 --from-beginning

You should see the output
<code>

Hi
My
Name
Is
Kafka

Assuming that everything works, it’s time to package up this box so that we can use it as our new base box for the other VMs in the cluster.

On your host, find the name of your current VM.

VBoxManage list vms

 


Mine happens to be “vagrant_default_1399123653833_13594”

Now package it up into a box.

vagrant package --base vagrant_default_1399123653833_13594 --output debian-cluster.box

 


Put the box in a more easily recognizable location.

mkdir ~/boxes
mv debian-cluster.box ~/boxes

 


10. Shutdown the VM

vagrant halt

Part II – Adding new nodes to the cluster from the newly created base box

1. Make a directory for a new cluster node and cd to it. “debian-cluster-node-2” sounds good to me.

vagrant init debian-cluster-node-2 ~/boxes/debian-cluster.box

2. Edit the Vagrantfile, do NOT overwrite it with the Vagrantfile from your other box.
Set the memory to 2048 and set the private IP address to something different this time. I will use this: 192.168.33.22

3. Start up the new box and log in

vagrant up
vagrant ssh

4. Edit the Kafka config settings
If you set $KAFKA_HOME in your .bash_profile before packaging the box in Part I of this walk-through, it will be here:
$KAFKA_HOME/config/server.properties

Set the following properties:

broker.id=2
host.name=192.168.33.22

 


Leave the Zookeeper settings alone for now.

5. In another terminal window, start your first Vagrant box up again and log in. (Cluster Node 1)

vagrant up
vagrant ssh

 


6. Start Zookeeper and Kafka

sudo $ZK_HOME/bin/zkServer.sh start
sudo $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties &

 


7. Go back to your newly created VM for your second cluster node and start Kafka (Cluster Node 2)

sudo $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties &

That’s it! Your Kafka servers are now clustered together.
To test, go back to the terminal window for node 1.
Produce some messages to the topic that you created earlier, but this time use your new VM as the broker.

$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list 192.168.33.22:9092 --topic topic-1
Hello
From
Broker 2

 


(ctrl-c)

Check to see that your messages were successfully produced

$KAFKA_HOME/bin/kafka-console-consumer.sh --zookeeper 192.168.33.21:2181 --topic topic-1 --from-beginning

You should be able to produce messages to either broker now, or you can pass in both brokers to the console producer:

$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list 192.168.33.21:9092,192.168.33.22:9092 --topic topic-1

 


What about Zookeeper?

Zookeeper uses a “majority rule” strategy to make its decisions. If we were to setup a 2-server Zookeeper cluster, and 1 server died, then there would only be 1 out of 2 remaining, which is not enough to be a “majority.” See this post for a better explanation.

Now let’s add a third node so that we can configure a 3-node Zookeeper cluster.
Follow steps 1-3 above, but name this node “debian-cluster-node-3”, and give it a different private IP in the Vagrantfile. I will use 192.168.33.23. At step 4, we’ll do things a little differently, so come back here when you’ve finished steps 1-3.

4. Edit the Kafka server properties

$KAFKA_HOME/config/server.properties

Just as we did for the second node, we set the broker.id and host.name properties.

broker.id=3
host.name=192.168.33.23

 


This time, since we will have a 3-node Zookeeper cluster, we will also edit the zookeeper.connect property.

zookeeper.connect=192.168.33.21:2181,192.168.33.22:2181,192.168.33.23:2181

At this time, go back and edit the server.properties file in your other two boxes and set the zookeeper.connect property to be the same as what you have here.

5. Edit the Zookeeper config (for all servers)
We ignored this step when setting up the second node in the cluster, because we didn’t have enough servers for a proper Zookeeper cluster yet. We’re going to have to go back and take care of that now.

In all three of your servers, open up $ZK_HOME/conf/zoo.cfg file, and make sure you have the following:

server.1=192.168.33.21:2888:3888
server.2=192.168.33.22:2888:3888
server.3=192.168.33.23:2888:3888

 


6. Set the myid file for the second and third servers
Remember that 1-character long file we created on the first box?
We need to do the same thing for the second and third servers, or our Zookeeper cluster will not work.
Since we used “1” for the first server, let’s keep it simple for the other servers.

On your second server:

echo "2" > /var/zookeeper/data/myid

On your third server:

echo "3" > /var/zookeeper/data/myid

7. Shut them all down, hurry!
Ok, no hurry, but let’s shut down all the boxes and then bring them up one at a time, just to be sure we’re starting fresh.

For each VM

exit
vagrant halt

8. Start them all up again

vagrant up
vagrant ssh

9. Start Zookeeper and Kafka on each server
Zookeeper first
It’s a good idea to start up all the Zookeeper instances first before starting Kafka, so for each VM:

sudo $ZK_HOME/bin/zkServer.sh start

Now start Kafka on each node.

sudo $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties &

Your cluster should be in full swing now!

Test again with the console producer, this time using the third node as the broker.

$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list 192.168.33.23:9092 --topic topic-1
Hello
From
Broker
3

(ctrl-c)

And then use the console consumer to read the topic. This time, use one of your new Zookeeper nodes for the –zookeeper argument.

$KAFKA_HOME/bin/kafka-console-consumer.sh --zookeeper 192.168.33.23:2181 --topic topic-1 --from-beginning

Now let’s create a new replicated topic and produce some messages to it.

$KAFKA_HOME/bin/kafka-create-topic.sh --zookeeper 192.168.33.22:2181 --replica 3 --partition 1 --topic replicated-topic-1
$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list 192.168.33.23:9092 --topic replicated-topic-1
I
Am
A
Replicated
Topic

Now consume the new topic from one of your other servers.

$KAFKA_HOME/bin/kafka-console-consumer.sh --zookeeper 192.168.33.21:2181 --topic replicated-topic-1 --from-beginning

Play around with producing to different brokers and consuming with different zookeepers. Hopefully, it all works!

You can do A LOT with Zookeeper and Kafka. The purpose of this walk-through is just to get you to a point where you can be ready to explore all of Kafka’s goodness within a clustered environment. For more information, please read the documentation.

http://kafka.apache.org/documentation.html
http://zookeeper.apache.org/doc/r3.4.6/

Cheers!

About the Author

Object Partners profile.

One thought on “Setting up your own Apache Kafka cluster with Vagrant – Step by Step

  1. Assaf Epstein says:

    sconnelly, thanks for the helpful post!
    though I would say only half the job done.
    Before implementing it myself, I would like to ask the author sconnelly, or any other reader of this comment, for a script which I could run which will receive a list of IPs

    ./createZkClusterVMachines.sh znode1 znode2 … znodeN

    and will create a vm image for each of the the provided ips, where each vm will have all configurations set and ready for execution of zkServer as part of a known quorum:

    1. in each machine, $ZK_HOME/conf/zoo.cfg will have the list:

    server.1=znode1:2888:3888
    server.2=znode2:2888:3888

    server.N=znodeN:2888:3888

    2. in each machine, /var/zookeeper/data/myid will have its own unique id

    In the end of the execution of the script we’ll have N directories, each one describing an image of a znode. all we have to do is for each znode, cd into the its directory and:

    start up machine: vagrant up
    login into machine: vagrant ssh
    run zkServer: sudo $ZK_HOME/bin/zkServer.sh start

    next step will be combining all the executions into one automatic script.

    I’m sure someone has done this for already so here I am asking not to reinvent the will.

    Thanks!!

    1. Steve Connelly says:

      Hi Assaf, thank you for your comment. The point of this blog post is to simply highlight all of the steps that are needed in order to setup a Kafka cluster. My hope is, that after reading the blog post, the reader will have a grasp of the basic essentials needed to write his/her own script. If you want something that does it all in one fell swoop, read this blog post here, which I link to in the first paragraph:
      http://allthingshadoop.com/2013/12/07/using-vagrant-to-get-up-and-running-with-apache-kafka/

      1. Ben Yelsey says:

        Hi Steve! Thanks for writing this.

        I worked through your steps with Kafka v0.8.1.1 and there just a few minor changes I thought I’d share:

        Part I Step 4 update for downloading/building Kafka:
        wget http://apache.mirrors.tds.net/kafka/0.8.1.1/kafka-0.8.1.1-src.tgz
        tar -zxvf kafka-0.8.0-beta1-src.tgz
        sudo mkdir /usr/local/kafka
        sudo mv kafka-0.8.0-beta1-src /usr/local/kafka
        sudo /usr/local/kafka/kafka-0.8.1.1-src/gradlew jar

        Part I Step 8 updates:
        Change to ~/.bash_profile:
        export KAFKA_HOME=/usr/local/kafka/kafka-0.8.1.1-src/

        Some updates to bin/ commands (now use bin/kafka-topics.sh for all topic-related needs)
        list topics:
        $KAFKA_HOME/bin/kafka-topics.sh –zookeeper 192.168.33.21:2181 –list
        create topic:
        $KAFKA_HOME/bin/kafka-topics.sh –zookeeper 192.168.33.21:2181 –create –replication-factor 1 –partition 1 –topic topic-1

        Part II Step 9 update:
        create replicated topic:
        $KAFKA_HOME/bin/kafka-topics.sh –zookeeper 192.168.33.22:2181 –create –replication-factor 3 –partition 1 –topic replicated-topic-1

        1. Ben Yelsey says:

          Hi Steve,
          Sorry bother you about this, but it would be great if the “code” sections of my reply were monospaced etc. Wasn’t sure how to do that from the “Reply” form. Feel free to delete this reply…

  2. Steve Connelly says:

    Thanks for the update Ben!

    I’ll see if I can format the comments. If not, I will just append the update into the original post and give you credit.

  3. Steve Connelly says:

    Hello again Ben. Unfortunately, I’m unable to edit the post or comments. But thanks so much for the update!

  4. Hardik says:

    Hi Steve,

    Thanks for great article. Can you use the vagrant box where host OS being windows? I am not too sure about the LAN part.

    Thanks,
    Hardik

  5. Srikanth says:

    In order to use jConsole to read JMX mBeans for Kafka
    I added the following to bin/kafka-run-class.sh

    if [ -z “$KAFKA_JMX_OPTS” ]; then
    KAFKA_JMX_OPTS=”-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=192.168.33.110 -Dcom.sun.management.jmxremote.port=9999″
    fi

  6. Fergus says:

    Thanks Steve for this.
    However a certain amount of this has now (Mar 2016) been superceded by the availability of the latest Kafka release packages on the Ubuntu Precise64 vagrant box.
    The following steps can be executed on a Vagrant VM (or in the vagrantfile).
    http://www.bogotobogo.com/Hadoop/BigData_hadoop_Zookeeper_Kafka.php
    It is neccessary though to use Oracle Java 1.8 (I haven’t tried OpenJDK 1.7) rather than the default JRE (1.6) suggested.
    Regards

    1. Fergus says:

      The following provisioning in vagrantfile should work to start you off:

      config.vm.provision “shell”,
      inline: ”
      sudo apt-get -y update
      sudo apt-get -y install git
      sudo apt-get install -y oracle-java8-installer
      sudo apt-get install -y oracle-java8-set-default
      sudo useradd -m kafka
      sudo adduser kafka sudo
      sudo apt-get install zookeeperd
      cd
      mkdir kafka
      cd kafka
      wget http://www-eu.apache.org/dist/kafka/0.9.0.1/kafka_2.11-0.9.0.1.tgz
      tar xvzf kafka_2.9.2-0.8.2.2.tgz –strip 1

  7. chitra says:

    Thank you for all detailed steps here.

Leave a Reply to Steve Connelly Cancel reply

Your email address will not be published.

Related Blog Posts
Natively Compiled Java on Google App Engine
Google App Engine is a platform-as-a-service product that is marketed as a way to get your applications into the cloud without necessarily knowing all of the infrastructure bits and pieces to do so. Google App […]
Building Better Data Visualization Experiences: Part 2 of 2
If you don't have a Ph.D. in data science, the raw data might be difficult to comprehend. This is where data visualization comes in.
Unleashing Feature Flags onto Kafka Consumers
Feature flags are a tool to strategically enable or disable functionality at runtime. They are often used to drive different user experiences but can also be useful in real-time data systems. In this post, we’ll […]
A security model for developers
Software security is more important than ever, but developing secure applications is more confusing than ever. TLS, mTLS, RBAC, SAML, OAUTH, OWASP, GDPR, SASL, RSA, JWT, cookie, attack vector, DDoS, firewall, VPN, security groups, exploit, […]