UPDATE: This has been replaced by a newer post Experience installing Hbase 0.20.0 Cluster on Ubuntu 9.04 and EC2 . I found that using the pre-built distributions of Hadoop and HBase much better than trying to build from source. I need more Java/Ant-fu to do the build from scratch. The HBase-0.20.0 Release Candidates are really great and seemingly easier to get the cluster going than previous releases.
Introduction
Hadoop and Map / Reduce are all the rage now days, so we figure we should be using it too.
Hbase is an implementation of Google’s Bigtable. Its built on top of the Hadoop File System (HDFS).
Its trivial to install it as a standalone on top of a filesystem, but I had some difficulty getting it working on top of HDFS in the “Pseudo-Distributed” mode.
Follow the Instructions
I set up Hadoop with no problems following the instructions on the Hadoop sitefor Pseudo-Distributed Operation which runs Hbase on top of HDFS but everything runs on one server (I.E. Its configured pretty much like a cluster but all the pieces are on the same server). Another helpful set of instructions are at Running Hadoop On Ubuntu Linux (Single-Node Cluster).
I followed the HBase installation instructions also for Pseudo-Distributed Operation.
A few things to be aware of:
- Make sure that the Hadoop version and the Hbase major version numbers are the same
(I used Hadoop 0.18.2 and Hbase 0.18.1) - Make sure that the Hadoop, Hbase trees as well as the directories and files that hold the hdfs filesystem are owned by hadoop:hadoop (You have to create the user and group)
- No need to disable ipv6 as some sites said
You can download the Hadoop tar file from http://www.apache.org/dyn/closer.cgi/hadoop/core/ and the Hbase tar file from http://www.apache.org/dyn/closer.cgi/hadoop/hbase/
They are also available as git repositories via:
git clone git://git.apache.org/hadoop.git git clone git://git.apache.org/hbase.git
You can track a particular branch with the command (We’re stuck at hadoop 0.19.1 / hbase 0.19.0:
cd hadoop git branch --track release-0.19.1 origin/tags/release-0.19.1 git checkout release-0.19.1 cd ../hbase git branch --track 0.19.0 origin/tags/0.19.0 git checkout 0.19.0
Then in each directory build things. As far as I can tell you just need to use the default ant build. But you can build the jar also:
cd ../hadoop ant ant jar
cd ../hbase ant ant jar
Biggest Problem I Had
The thing that took the longest time to get right was when I wanted to access Hbase from other hosts. You would think you could put the DNS Fully Qualified Domain Name (FQDN) in the config file. Turns out that by default, the Hadoop tools don’t seem to use the host’s DNS resolver and just what is in /etc/hosts (as far as I can tell). So you have to use the IP address in the config file.
I believe there are ways to configure around this but I haven’t found it yet.
Configuration Examples
File System Layout
I untarred the distributions into /usr/local/pkgs and made symbolic links to /usr/local/hadoop and /usr/local/hbase as well as created the directory where Hadoop/HDFS will use for storage.
For Ubuntu:
sudo addgroup hadoop sudo adduser --ingroup hadoop hadoop
For Mac:
Create a Home Directory
mkdir /Users/_hadoop
Find an unused groupid by seeing what ids are already in use:
sudo dscl . -list /Groups PrimaryGroupID | cut -c 32-34 | sort -rn
Then find an unused userid by seeing what userid’s are in use:
sudo dscl . -list /Users UniqueID | cut -c 20-22 | sort -rn
Pick a number that is in neither list. In our case we will use 402 for both the userid and groupid for _hadoop (Mac OS X has an underscore in front of daemon user/group names. We will also
sudo dscl . -create /Groups/_hadoop PrimaryGroupID 402 sudo dscl . -append /Groups/_hadoop RecordName hadoop
Take the Value of dsAttrTypeStandard:PrimaryGroupID in this case 500, and use it as the groupid in the following command:
sudo dscl . -create /Users/_hadoop UniqueID 402 sudo dscl . -create /Users/_hadoop RealName "Hadoop Service" sudo dscl . -create /Users/_hadoop PrimaryGroupID 402 sudo dscl . -create /Users/_hadoop NFSHomeDirectory /Users/_hadoop sudo dscl . -append /Users/_hadoop RecordName hadoop
For both Ubuntu and Mac (Note that the Mac will end up having a user/group id of _hadoop)
cd /usr/local/pkgs tar xzf hadoop-0.18.2.tar.gz tar xzf hbase-0.18.1.tar.gz cd .. ln -s /usr/local/pkgs/hadoop-0.18.2 hadoop ln -s /usr/local/pkgs/hbase-0.18.1 hbase mkdir /var/hadoop_datastore chown -R hadoop:hadoop hadoop/ hbase/ /var/hadoop_datastore /Users/_hadoop
Hadoop Config files
The following are all in /usr/local/hadoop/conf
hadoop-env.sh
Need to set the JAVA_HOME variable. I installed java 6 via synoptic. You can also install it with:
apt-get install sun-java6-jdk
The Macintosh is a easy if you have a Intel Core 2 Dual (the Intel Core Dual doesn’t count). Apple is only supporting Java 1.6 on their 64 bit processors. If you have a 32 bit processor like the first generation Macbook Pro 17″ or first generation MacMini, or you have a PPC see Tech Tip: How to Set Up JDK 6 and JavaFX on 32-bit Intel Macs
So my config is (only the things I changed, the rest was left as is):
... # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun export JAVA_HOME=/usr/lib/jvm/java-6-sun ...
For the Macintosh:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/Current
hadoop-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/var/hadoop_datastore/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <!-- As per note in http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200810.mbox/<C20126171.post@talk.nabble.com> --> <property> <name>dfs.datanode.socket.write.timeout</name> <value>0</value> </property> <property> <name>dfs.datanode.max.xcievers</name> <value>1023</value> </property> </configuration>
HBase Config Files
The following are all in /usr/local/hbase/conf
hbase-env.sh
Again, just need to set up JAVA_HOME:
... # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun export JAVA_HOME=/usr/lib/jvm/java-6-sun ...
For the Macintosh:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/Current
hbase-site.xml
Here is where I wanted to give a FQDN for the host that is the hbase.master, but had to use an IP address instead.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.rootdir</name> <value>hdfs://localhost:54310/hbase</value> <description>The directory shared by region servers. Should be fully-qualified to include the filesystem to use. E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR </description> </property> <property> <name>hbase.master</name> <value>192.168.10.50:60000</value> <description>The host and port that the HBase master runs at. </description> </property> </configuration>
Formatting the Name Node
You must do this as the same user as will be running the daemon (hadoop)
su hadoop -s /bin/sh -c /usr/local/hadoop/bin/hadoop namenode -format
on the Mac:
/usr/bin/su _hadoop /usr/local/hadoop/bin/hadoop namenode -format
Setup passphraseless ssh
Now check that you can ssh to the localhost without a passphrase:
su - hadoop ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands (as haddop):
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Ubuntu /etc/init.d style startup scripts
I scoured the InterTubes for example hadoop/hbase startup scripts and found absolutely none! I ended up creating a minimal one that is so far only suited for the Pseudo-Distributed Operation mode as it just calls the start-all / stop-all scripts.
/etc/init.d/hadoop
Create the place it will put its startup logs
mkdir /var/log/hadoop
Create /etc/init.d/hadoop with the following:
#!/bin/sh ### BEGIN INIT INFO # Provides: hadoop services # Required-Start: $network # Required-Stop: $network # Default-Start: 2 3 4 5 # Default-Stop: 0 1 6 # Description: Hadoop services # Short-Description: Enable Hadoop services including hdfs ### END INIT INFO PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin HADOOP_BIN=/usr/local/hadoop/bin NAME=hadoop DESC=hadoop USER=hadoop ROTATE_SUFFIX= test -x $HADOOP_BIN || exit 0 RETVAL=0 set -e cd / start_hadoop () { set +e su $USER -s /bin/sh -c $HADOOP_BIN/start-all.sh > /var/log/hadoop/startup_log case "$?" in 0) echo SUCCESS RETVAL=0 ;; 1) echo TIMEOUT - check /var/log/hadoop/startup_log RETVAL=1 ;; *) echo FAILED - check /var/log/hadoop/startup_log RETVAL=1 ;; esac set -e } stop_hadoop () { set +e if [ $RETVAL = 0 ] ; then su $USER -s /bin/sh -c $HADOOP_BIN/stop-all.sh > /var/log/hadoop/shutdown_log RETVAL=$? if [ $RETVAL != 0 ] ; then echo FAILED - check /var/log/hadoop/shutdown_log fi else echo No nodes running RETVAL=0 fi set -e } restart_hadoop() { stop_hadoop start_hadoop } case "$1" in start) echo -n "Starting $DESC: " start_hadoop echo "$NAME." ;; stop) echo -n "Stopping $DESC: " stop_hadoop echo "$NAME." ;; force-reload|restart) echo -n "Restarting $DESC: " restart_hadoop echo "$NAME." ;; *) echo "Usage: $0 {start|stop|restart|force-reload}" >&2 RETVAL=1 ;; esac exit $RETVAL
/etc/init.d/hbase
Create the place it will put its startup logs
mkdir /var/log/hbase
Create /etc/init.d/hbase with the following:
#!/bin/sh ### BEGIN INIT INFO # Provides: hbase services # Required-Start: $network # Required-Stop: $network # Default-Start: 2 3 4 5 # Default-Stop: 0 1 6 # Description: Hbase services # Short-Description: Enable Hbase services including hdfs ### END INIT INFO PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin HBASE_BIN=/usr/local/hbase/bin NAME=hbase DESC=hbase USER=hadoop ROTATE_SUFFIX= test -x $HBASE_BIN || exit 0 RETVAL=0 set -e cd / start_hbase () { set +e su $USER -s /bin/sh -c $HBASE_BIN/start-hbase.sh > /var/log/hbase/startup_log case "$?" in 0) echo SUCCESS RETVAL=0 ;; 1) echo TIMEOUT - check /var/log/hbase/startup_log RETVAL=1 ;; *) echo FAILED - check /var/log/hbase/startup_log RETVAL=1 ;; esac set -e } stop_hbase () { set +e if [ $RETVAL = 0 ] ; then su $USER -s /bin/sh -c $HBASE_BIN/stop-hbase.sh > /var/log/hbase/shutdown_log RETVAL=$? if [ $RETVAL != 0 ] ; then echo FAILED - check /var/log/hbase/shutdown_log fi else echo No nodes running RETVAL=0 fi set -e } restart_hbase() { stop_hbase start_hbase } case "$1" in start) echo -n "Starting $DESC: " start_hbase echo "$NAME." ;; stop) echo -n "Stopping $DESC: " stop_hbase echo "$NAME." ;; force-reload|restart) echo -n "Restarting $DESC: " restart_hbase echo "$NAME." ;; *) echo "Usage: $0 {start|stop|restart|force-reload}" >&2 RETVAL=1 ;; esac exit $RETVAL
Set up the init system
This assumes you put the above init files in /etc/init.d
chmod +x /etc/init.d/{hbase,hadoop} update-rc.d hadoop defaults update-rc.d hbase defaults 25
You can now start / stop hadoop by saying:
/etc/init.d/hadoop start
/etc/init.d/hadoop stop
And similarly with hbase
/etc/init.d/hbase start
/etc/init.d/hbase stop
Make sure you start hadoop before hbase and stop hbase before you stop hadoop
Macintosh launchd style startup
Starting proceses on Macintosh Leopard is pretty easy with lauchd/launchctl.
For hadoop, create a file /Library/LaunchAgents/com.yourdomain.hadoop.plist with the following content (replace yourdomain with the domain you want to use for this class of apps):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>GroupName</key>
<string>_hadoop</string>
<key>KeepAlive</key>
<true/>
<key>Label</key>
<string>com.yourdomain.hadoop</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/hadoop/bin/start-all.sh</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>ServiceDescription</key>
<string>Hadoop Process</string>
<key>UserName</key>
<string>_hadoop</string>
</dict>
</plist>
And for hbase, /Library/LaunchAgents/com.yourdomain.hbase.plist:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>GroupName</key>
<string>_hadoop</string>
<key>KeepAlive</key>
<true/>
<key>Label</key>
<string>com.ibd.hbase</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/hbase/bin/start-hbase.sh</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>UserName</key>
<string>_hadoop</string>
</dict>
</plist>
Set the owner to root and the mode to 644:
chown root /Library/LaunchAgents/com.yourdomain.hadoop.plist /Library/LaunchAgents/com.yourdomain.hbase.plist chmod 644 /Library/LaunchAgents/com.yourdomain.hadoop.plist /Library/LaunchAgents/com.yourdomain.hbase.plist
The next time you restart, it should start hbase and hadoop. You can also start them manually with the commands:
sudo launchctl load /Library/LaunchAgents/com.yourdomain.hadoop.plist sudo launchctl load /Library/LaunchAgents/com.yourdomain.hbase.plist
Conclusion
You should now be able to see the HBase web interface at http://<your domain name>:60010
If you have problems check /var/log/{hbase,hadoop}/startup_log as well as /usr/local/hadoop/logs/hadoop-hadoop-namenode-yourhostname.log and /usr/local/hbase/logs/hbase-hadoop-master-yourhostname.log
The error messages are pretty poor. (Ie useless as far as I could tell when tracking down the FQDN/IP Address problem). But better than nothing.
I will post an update when I deploy a Full Cluster.
Thanks for the post it was very useful. I had the same issue, it doesn’t seem to like IP addresses in the root path either.
It does definitely do full DNS look up as I resorted to adding a sub-domain on our public DNS server for it. It returns a private IP so it will still only be available internally.
Yeah, I don’t think I fully understand what’s going on with it. Its even worse when I try to deploy it to Amazon EC2 where it resolves the DNS to the NAT’d local address even if you specify the DNS FQDN of the public ip address!
When I figure it out, I’ll publish it here and make a comment. I think you get an automatic notification…
Added info on getting hadoop and hbase via git
Minor addition on how to build using ant if you installed from git. I’m still having a bit of trouble understanding what the right way is to build from source and then use though. Should I really be doing ant package and then use the package?
It’s a very useful post indeed, thanks!
Robert, I think I might have a solution for your DNS problem.
Ubuntu, by default, adds a line ‘127.0.1.1 somehostname’ to your /etc/hosts.
Remove this line, and your problems might just be fixed.
Windows has vpn capabilities built into the software. You will need a dynamic DNS acount. http://www.dyndns.org I have included a link with step by step instructions.
Thank’s for the post. I’m finish install hadoop & hbase now, but i have problem when using hbase. I create database by the name fire and I success to make it. but while I try create database with the same name, there are mistakes but i don’t know where the fault.
hbase(main):003:0> create “fire”, “firework”
0 row(s) in 1.1460 seconds
hbase(main):004:0> create “fire”, “one”
NativeException: org.apache.hadoop.hbase.TableExistsException: org.apache.hadoop.hbase.TableExistsException: fire
at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:798)
at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:762)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
I am new to HBASE, and while trying to install the same on Ubuntu system, I am facing some problem.
Below is the error log from Zookeeper log file
2014-01-18 06:10:51,392 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x143a5b052980000, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:744) 2014-01-18 06:10:51,394 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:56671 which had sessionid 0x143a5b052980000
Below is error log from master log:
2014-01-18 06:10:51,381 INFO org.apache.zookeeper.ZooKeeper: Session: 0x143a5b052980000 closed 2014-01-18 06:10:51,381 INFO org.apache.hadoop.hbase.master.HMaster: HMaster main thread exiting 2014-01-18 06:10:51,381 ERROR org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master java.lang.RuntimeException: HMaster Aborted at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160) at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76) at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2120)
Please note, I am able to start Hbase successfully. I mean after starting Hbase, I am able to see Hmaster running using jps command. But as soon as I try to go to Hbase shell, this issue arises and then by executing jps command, I don’t find Hmaster in list.
Please help me in this issue, as I tried to solve it by myself from last for days, but no luck. Please help