Setting up your own Apache Kafka cluster with Vagrant – Step by Step
Apache Kafka is a distributed publish-subscribe messaging system that aims to be fast, scalable, and durable. If you want to just get up and running quickly with a cluster of Vagrant virtual machines configured with Kafka, take a look at this awesome blog post. It sets up all the VMs for you and configures each node in the cluster, in one fell swoop.
However, if you want to learn how to install and configure a Kafka cluster yourself, utilizing your own Vagrant boxes, then read on. This step-by-step walk-through will guide you through building a Kafka cluster from the ground up, with vanilla Debian as a base. Kafka requires Apache Zookeeper, a service that coordinates distributed applications. In this walk-through, we will setup our first box from scratch. We will then package that box and use it as the base box for the other nodes in the cluster. When we’re finished, we’ll have a fully functional 3-node Zookeeper and Kafka cluster. It would probably be a better practice to automate this via existing chef recipes, but that’s hardly walk-through material. We are going to do it the simple, long-winded way. And I think you will find that it isn’t too painful. Onward!
Part I – Setting up a single Zookeeper/Kafka node, starting from a Vagrant base box
1. Download and install Virtualbox from virtualbox.org
Note: This walk-through uses a Vagrant base box that requires Virtualbox 4.2.10. If you already have Vagrant configured to work with VMWare, there is a VMWare Fusion version of the same base box. I will point it out in step 3 below.
2. Download and install Vagrant from vagrantup.com
3. Initialize a new Vagrant box. This particular box is vanilla Debian from Puppet Labs. I recommend creating it in a directory with a name that accurately describes what the box represents. If you are new to Vagrant, it’s easy to get carried away and wind up with an over-abundance of VMs on your machine.
mkdir debian-cluster-node-1
cd debian-cluster-node-1
vagrant init debian-cluster-node-1 http://puppet-vagrant-boxes.puppetlabs.com/debian-70rc1-x64-vbox4210.box
Or, if you’re using Vagrant with VMWare:
vagrant init debian-cluster-node-1 http://puppet-vagrant-boxes.puppetlabs.com/debian-70rc1-x64-vf503.box
This will create a Vagrantfile in the directory. You use this file to configure your VM.
4. Edit the Vagrantfile to your liking
It’s a good idea to bump up the memory. 2048 should be sufficient.
config.vm.provider :virtualbox do |vb| vb.customize ["modifyvm", :id, "--memory", "2048"] end
The only other setting of note is the private IP. This allows the host (your computer’s OS) and other VMs to access your new Vagrant box via a local network IP address.
Find the line
# config.vm.network :private_network, ip: "192.168.33.10"
Uncomment it, and change the IP address if you feel like it, otherwise just leave it as is. I set mine to 192.168.33.21. I will be referring to that IP address throughout this walk-through.
5. Setup the Vagrant box
Start the box:
vagrant up
The first time takes quite awhile. It needs to download and unpack the box first.
Login to the box:
vagrant ssh
Install dependencies (you only need Java, and you might want to install a text editor too)
sudo apt-get update sudo apt-get install openjdk-7-jdk
For the following steps, change to the root user (sudo su)
4. Download, build, and install Kafka
I’ve had issues trying to get things up and running with just the binary download, so we’ll build from source. Even the Kafka Quick Start tells you to build from source, so that’s what we’re going to do here. Don’t worry, it’s easy.
Note: You don’t have to install it in /usr/local/kafka. You can put it wherever you want.
wget https://archive.apache.org/dist/kafka/kafka-0.8.0-beta1-src.tgz mkdir /usr/local/kafka tar -zxvf kafka-0.8.0-beta1-src.tgz cd kafka-0.8.0-beta1-src ./sbt update ./sbt package ./sbt assembly-package-dependency cd ../ mv kafka-0.8.0-beta1-src /usr/local/kafka
5. Install Zookeeper
<em>Note: You don't have to install it in /usr/local/zookeeper. You can put it wherever you want.</em>
<code>
wget http://apache.claz.org/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz mkdir /usr/local/zookeeper tar -zxvf zookeeper-3.4.6.tar.gz --directory /usr/local/zookeeper cp /usr/local/zookeeper/zookeeper-3.4.6/conf/zoo_sample.cfg /usr/local/zookeeper/zookeeper-3.4.6/conf/zoo.cfg
<code>
6. Configure Zookeeper
Before configuring, create a directory for the Zookeeper data.
mkdir -p /var/zookeeper/data
Edit the Zookeeper configuration file, /usr/local/zookeeper/zookeeper-3.4.6/conf/zoo.cfg
Change the dataDir property to the directory you created above.
dataDir=/var/zookeeper/data
Find the list of servers that’s commented out. If these lines aren’t there, add them.
#server.1=zookeeper1:2888:3888 #server.2=zookeeper2:2888:3888 #server.3=zookeeper3:2888:3888
Uncomment the server.1 property, and change “zookeeper1” to the private IP address that you assigned to this VM.
server.1=192.168.33.21:2888:3888
Important step, often forgot!
We need to create a myid file in the data directory.
Zookeeper uses a file named “myid” to identify itself within the cluster. It holds a single character, 1-255. Let’s set it to 1.
echo "1" > /var/zookeeper/data/myid
7. Configure Kafka
If you followed the above installation instructions, the config directory will be here:
/usr/local/kafka/kafka-0.8.0-beta1-src/config
Edit the server.properties file
Take note of the broker.id value. Each Kafka instance will need to have a unique broker.id, just as each Zookeeper instance needs to have a distinct value in the myid file. Let’s set this to 1.
broker.id=1
Uncomment #host.name=localhost and set it to the private IP address of the VM.
host.name=192.168.33.21
Locate the zookeeper.connect property. The default setting is fine, but we will be adding more nodes as we build up the cluster.
Change “localhost” to the IP address of the VM.
zookeeper.connect=192.168.33.21:2181
8. Test the current setup
You probably want to add these to your ~/.bash_profile first
export ZK_HOME=/usr/local/zookeeper/zookeeper-3.4.6/ export KAFKA_HOME=/usr/local/kafka/kafka-0.8.0-beta1-src/ export PATH=$ZK_HOME/bin:$KAFKA_HOME/bin:$PATH
Start Zookeeper
sudo $ZK_HOME/bin/zkServer.sh start
Start Kafka
sudo $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties &
Test Kafka
List topics (should not have any to start with)
$KAFKA_HOME/bin/kafka-list-topic.sh --zookeeper 192.168.33.21:2181
Create a new topic
$KAFKA_HOME/bin/kafka-create-topic.sh --zookeeper 192.168.33.21:2181 --replica 1 --partition 1 --topic topic-1
Produce messages to that topic from the console
$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list 192.168.33.21:9092 --topic topic-1 Hi My Name Is Kafka
(ctrl-c to kill the console producer)
Run the console consumer to verify that the messages are there for the new topic
$KAFKA_HOME/bin/kafka-console-consumer.sh --zookeeper 192.168.33.21:2181 --topic topic-1 --from-beginning
You should see the output
<code>
Hi My Name Is Kafka
Assuming that everything works, it’s time to package up this box so that we can use it as our new base box for the other VMs in the cluster.
On your host, find the name of your current VM.
VBoxManage list vms
Mine happens to be “vagrant_default_1399123653833_13594”
Now package it up into a box.
vagrant package --base vagrant_default_1399123653833_13594 --output debian-cluster.box
Put the box in a more easily recognizable location.
mkdir ~/boxes mv debian-cluster.box ~/boxes
10. Shutdown the VM
vagrant halt
Part II – Adding new nodes to the cluster from the newly created base box
1. Make a directory for a new cluster node and cd to it. “debian-cluster-node-2” sounds good to me.
vagrant init debian-cluster-node-2 ~/boxes/debian-cluster.box
2. Edit the Vagrantfile, do NOT overwrite it with the Vagrantfile from your other box.
Set the memory to 2048 and set the private IP address to something different this time. I will use this: 192.168.33.22
3. Start up the new box and log in
vagrant up vagrant ssh
4. Edit the Kafka config settings
If you set $KAFKA_HOME in your .bash_profile before packaging the box in Part I of this walk-through, it will be here:
$KAFKA_HOME/config/server.properties
Set the following properties:
broker.id=2 host.name=192.168.33.22
Leave the Zookeeper settings alone for now.
5. In another terminal window, start your first Vagrant box up again and log in. (Cluster Node 1)
vagrant up vagrant ssh
6. Start Zookeeper and Kafka
sudo $ZK_HOME/bin/zkServer.sh start sudo $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties &
7. Go back to your newly created VM for your second cluster node and start Kafka (Cluster Node 2)
sudo $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties &
That’s it! Your Kafka servers are now clustered together.
To test, go back to the terminal window for node 1.
Produce some messages to the topic that you created earlier, but this time use your new VM as the broker.
$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list 192.168.33.22:9092 --topic topic-1 Hello From Broker 2
(ctrl-c)
Check to see that your messages were successfully produced
$KAFKA_HOME/bin/kafka-console-consumer.sh --zookeeper 192.168.33.21:2181 --topic topic-1 --from-beginning
You should be able to produce messages to either broker now, or you can pass in both brokers to the console producer:
$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list 192.168.33.21:9092,192.168.33.22:9092 --topic topic-1
What about Zookeeper?
Zookeeper uses a “majority rule” strategy to make its decisions. If we were to setup a 2-server Zookeeper cluster, and 1 server died, then there would only be 1 out of 2 remaining, which is not enough to be a “majority.” See this post for a better explanation.
Now let’s add a third node so that we can configure a 3-node Zookeeper cluster.
Follow steps 1-3 above, but name this node “debian-cluster-node-3”, and give it a different private IP in the Vagrantfile. I will use 192.168.33.23. At step 4, we’ll do things a little differently, so come back here when you’ve finished steps 1-3.
4. Edit the Kafka server properties
$KAFKA_HOME/config/server.properties
Just as we did for the second node, we set the broker.id and host.name properties.
broker.id=3 host.name=192.168.33.23
This time, since we will have a 3-node Zookeeper cluster, we will also edit the zookeeper.connect property.
zookeeper.connect=192.168.33.21:2181,192.168.33.22:2181,192.168.33.23:2181
At this time, go back and edit the server.properties file in your other two boxes and set the zookeeper.connect property to be the same as what you have here.
5. Edit the Zookeeper config (for all servers)
We ignored this step when setting up the second node in the cluster, because we didn’t have enough servers for a proper Zookeeper cluster yet. We’re going to have to go back and take care of that now.
In all three of your servers, open up $ZK_HOME/conf/zoo.cfg file, and make sure you have the following:
server.1=192.168.33.21:2888:3888 server.2=192.168.33.22:2888:3888 server.3=192.168.33.23:2888:3888
6. Set the myid file for the second and third servers
Remember that 1-character long file we created on the first box?
We need to do the same thing for the second and third servers, or our Zookeeper cluster will not work.
Since we used “1” for the first server, let’s keep it simple for the other servers.
On your second server:
echo "2" > /var/zookeeper/data/myid
On your third server:
echo "3" > /var/zookeeper/data/myid
7. Shut them all down, hurry!
Ok, no hurry, but let’s shut down all the boxes and then bring them up one at a time, just to be sure we’re starting fresh.
For each VM
exit vagrant halt
8. Start them all up again
vagrant up vagrant ssh
9. Start Zookeeper and Kafka on each server
Zookeeper first
It’s a good idea to start up all the Zookeeper instances first before starting Kafka, so for each VM:
sudo $ZK_HOME/bin/zkServer.sh start
Now start Kafka on each node.
sudo $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties &
Your cluster should be in full swing now!
Test again with the console producer, this time using the third node as the broker.
$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list 192.168.33.23:9092 --topic topic-1 Hello From Broker 3
(ctrl-c)
And then use the console consumer to read the topic. This time, use one of your new Zookeeper nodes for the –zookeeper argument.
$KAFKA_HOME/bin/kafka-console-consumer.sh --zookeeper 192.168.33.23:2181 --topic topic-1 --from-beginning
Now let’s create a new replicated topic and produce some messages to it.
$KAFKA_HOME/bin/kafka-create-topic.sh --zookeeper 192.168.33.22:2181 --replica 3 --partition 1 --topic replicated-topic-1 $KAFKA_HOME/bin/kafka-console-producer.sh --broker-list 192.168.33.23:9092 --topic replicated-topic-1 I Am A Replicated Topic
Now consume the new topic from one of your other servers.
$KAFKA_HOME/bin/kafka-console-consumer.sh --zookeeper 192.168.33.21:2181 --topic replicated-topic-1 --from-beginning
Play around with producing to different brokers and consuming with different zookeepers. Hopefully, it all works!
You can do A LOT with Zookeeper and Kafka. The purpose of this walk-through is just to get you to a point where you can be ready to explore all of Kafka’s goodness within a clustered environment. For more information, please read the documentation.
http://kafka.apache.org/documentation.html
http://zookeeper.apache.org/doc/r3.4.6/
Cheers!
sconnelly, thanks for the helpful post!
though I would say only half the job done.
Before implementing it myself, I would like to ask the author sconnelly, or any other reader of this comment, for a script which I could run which will receive a list of IPs
./createZkClusterVMachines.sh znode1 znode2 … znodeN
and will create a vm image for each of the the provided ips, where each vm will have all configurations set and ready for execution of zkServer as part of a known quorum:
1. in each machine, $ZK_HOME/conf/zoo.cfg will have the list:
server.1=znode1:2888:3888
server.2=znode2:2888:3888
…
server.N=znodeN:2888:3888
2. in each machine, /var/zookeeper/data/myid will have its own unique id
In the end of the execution of the script we’ll have N directories, each one describing an image of a znode. all we have to do is for each znode, cd into the its directory and:
start up machine: vagrant up
login into machine: vagrant ssh
run zkServer: sudo $ZK_HOME/bin/zkServer.sh start
next step will be combining all the executions into one automatic script.
I’m sure someone has done this for already so here I am asking not to reinvent the will.
Thanks!!
Hi Assaf, thank you for your comment. The point of this blog post is to simply highlight all of the steps that are needed in order to setup a Kafka cluster. My hope is, that after reading the blog post, the reader will have a grasp of the basic essentials needed to write his/her own script. If you want something that does it all in one fell swoop, read this blog post here, which I link to in the first paragraph:
http://allthingshadoop.com/2013/12/07/using-vagrant-to-get-up-and-running-with-apache-kafka/
Hi Steve! Thanks for writing this.
I worked through your steps with Kafka v0.8.1.1 and there just a few minor changes I thought I’d share:
Part I Step 4 update for downloading/building Kafka:
wget http://apache.mirrors.tds.net/kafka/0.8.1.1/kafka-0.8.1.1-src.tgz
tar -zxvf kafka-0.8.0-beta1-src.tgz
sudo mkdir /usr/local/kafka
sudo mv kafka-0.8.0-beta1-src /usr/local/kafka
sudo /usr/local/kafka/kafka-0.8.1.1-src/gradlew jar
Part I Step 8 updates:
Change to ~/.bash_profile:
export KAFKA_HOME=/usr/local/kafka/kafka-0.8.1.1-src/
Some updates to bin/ commands (now use bin/kafka-topics.sh for all topic-related needs)
list topics:
$KAFKA_HOME/bin/kafka-topics.sh –zookeeper 192.168.33.21:2181 –list
create topic:
$KAFKA_HOME/bin/kafka-topics.sh –zookeeper 192.168.33.21:2181 –create –replication-factor 1 –partition 1 –topic topic-1
Part II Step 9 update:
create replicated topic:
$KAFKA_HOME/bin/kafka-topics.sh –zookeeper 192.168.33.22:2181 –create –replication-factor 3 –partition 1 –topic replicated-topic-1
Hi Steve,
Sorry bother you about this, but it would be great if the “code” sections of my reply were monospaced etc. Wasn’t sure how to do that from the “Reply” form. Feel free to delete this reply…
Thanks for the update Ben!
I’ll see if I can format the comments. If not, I will just append the update into the original post and give you credit.
Hello again Ben. Unfortunately, I’m unable to edit the post or comments. But thanks so much for the update!
Hi Steve,
Thanks for great article. Can you use the vagrant box where host OS being windows? I am not too sure about the LAN part.
Thanks,
Hardik
In order to use jConsole to read JMX mBeans for Kafka
I added the following to bin/kafka-run-class.sh
if [ -z “$KAFKA_JMX_OPTS” ]; then
KAFKA_JMX_OPTS=”-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=192.168.33.110 -Dcom.sun.management.jmxremote.port=9999″
fi
Thanks Steve for this.
However a certain amount of this has now (Mar 2016) been superceded by the availability of the latest Kafka release packages on the Ubuntu Precise64 vagrant box.
The following steps can be executed on a Vagrant VM (or in the vagrantfile).
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Zookeeper_Kafka.php
It is neccessary though to use Oracle Java 1.8 (I haven’t tried OpenJDK 1.7) rather than the default JRE (1.6) suggested.
Regards
The following provisioning in vagrantfile should work to start you off:
config.vm.provision “shell”,
inline: ”
sudo apt-get -y update
sudo apt-get -y install git
sudo apt-get install -y oracle-java8-installer
sudo apt-get install -y oracle-java8-set-default
sudo useradd -m kafka
sudo adduser kafka sudo
sudo apt-get install zookeeperd
cd
mkdir kafka
cd kafka
wget http://www-eu.apache.org/dist/kafka/0.9.0.1/kafka_2.11-0.9.0.1.tgz
tar xvzf kafka_2.9.2-0.8.2.2.tgz –strip 1
“
Thank you for all detailed steps here.