DataBaser.Net: Up and Running with HadoopDB

1 Installing Hadoop
2 Installing MySQL
3 Download HadoopDB
4 Install Hive
5 Data
6 HadoopDB Catalog and Running a Job

출처: http://posulliv.github.com/2010/05/10/hadoopdb-mysql.html

HadoopDB is an interesting project going on at Yale under the Prof. Daniel Abadi's supervision that I've been meaning to play with for some time now. I initially read the paper describing HadoopDB last year and intended to document how to setup a HadoopDB system using MySQL but I got busy with school work and never got around to it. Since I have a little more free time now that I've finished my thesis, I figured it was about time I got down to playing around with HadoopDB and describing how to setup a HadoopDB system using MySQL as the single node database. With that, I'm going to describe how to get up and running with HadoopDB. If you have not read the paper before starting, I strongly encourage you to give it a read. Its very well written and not that difficult to get through.
In this guide, I'm installing on Ubuntu Server 10.04 64-bit. Thus, I will be using the Ubuntu package manager heavily. I have not tested on other platforms but a lot of what is described here should apply to other platforms such as CentOS.
This guide is only on how to set up a single node system. It would not be difficult to extend what is contained here for setting up a multi-node system which I may write about in the future.

[edit]

1 Installing Hadoop #

Before installing Hadoop, Java needs to be installed. As of 10.04, the Sun JDK packages have been [http]

dropped from the Multiverse section of the Ubuntu archive. You can still install the Sun JDK if you wish but for this article, I used OpenJDK without issues:

sudo apt-get install openjdk-6-jdk

Before getting into the installation of Hadoop, I encourage you to read Michael Noll's in-depth [http]

guide to installing Hadoop on Ubuntu. I borrow from his articles a lot here.
First, create a user account and group that Hadoop will run as:

sudo groupadd hadoop
sudo useradd -m -g hadoop -d /home/hadoop -s /bin/bash -c "Hadoop software owner" hadoop

Next, we

download Hadoop and create directories for storing the software and data. For this article, Hadoop 0.20.2 was used:

cd /opt
sudo wget http://www.gtlib.gatech.edu/pub/apache/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
sudo tar zxvf hadoop-0.20.2.tar.gz
sudo ln -s /opt/hadoop-0.20.2 /opt/hadoop
sudo chown -R hadoop:hadoop /opt/hadoop /opt/hadoop-0.20.2
sudo mkdir -p /opt/hadoop-data/tmp-base
sudo chown -R hadoop:hadoop /opt/hadoop-data/

Alternatively, Cloudera has created [http]

Deb packages that can be used if you wish. I have not used them before so can't comment on how they work.
Next, we need to configure SSH for the hadoop user. This is required by Hadoop in order to manage any nodes.

su - hadoop
ssh-keygen -t rsa
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

When the ssh-keygen command is run, be sure to leave the passphrase as blank so that you will not be prompted for a password.
We will want to update the .bashrc file for the hadoop user with appropriate environment variables to make administration easier:

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF=$HADOOP_HOME/conf
export HADOOP_PATH=$HADOOP_HOME/bin
export HIVE_HOME=/opt/hive
export HIVE_PATH=$HIVE_HOME/bin

export PATH=$HIVE_PATH:$HADOOP_PATH:$PATH

We will cover installing Hive later in this article but for now, leave that environment variable in there. For the remainder of this article, I will be referring to various locations such as the Hadoop installation directory using the environment variables defined above. Next, we want configure Hadoop. There are 3 configuration files in Hadoop that we need to modify:

$HADOOP_CONF/core-site.xml
$HADOOP_CONF/mapred-site.xml
$HADOOP_CONF/hdfs-site.xml

Based on the directory structure I created beforehand, these 3 files looked as follows for me:

core-site.xml

<configuration>

  <property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/hadoop-data/tmp-base</value>
    <description>A base for other temporary directories</description>
  </property>

  <property>
    <name>fs.default.name</name>
    <value>localhost:54311</value>
    <description>
      The name of the default file system.  A URI whose
      scheme and authority determine the FileSystem implementation.  The
      uri's scheme determines the config property (fs.SCHEME.impl) naming
      the FileSystem implementation class.  The uri's authority is used to
      determine the host, port, etc. for a filesystem.
    </description>
  </property>

  <property>
    <name>hadoopdb.config.file</name>
    <value>HadoopDB.xml</value>
    <description>The name of the HadoopDB cluster configuration file</description>
  </property>

</configuration>

mapred-site.xml

<configuration>

  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:54310</value>
    <description>
      The host and port that the MapReduce job tracker runs
      at.  If "local", then jobs are run in-process as a single map
      and reduce task.
    </description>
  </property>

</configuration>

hdfs-site.xml

<configuration>

  <property>
    <name>dfs.replication</name>
    <value>1</value>
    <description>
      Default block replication.
      The actual number of replications can be specified when the file is created.
      The default is used if replication is not specified in create time.
    </description>
  </property>

</configuration>

Notice the reference to the HadoopDB XML file. We will cover that later but it is necessary for using HadoopDB to have that property in your configuration.
Next, we need to modify the $HADOOP_CONF/hadoop-env.sh file so that the JAVA_HOME variable is correctly set in that file. Thus, I have the following 2 lines in my hadoop-env.sh file:

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk

Next, we need to format the Hadoop filesystem:

$ hadoop namenode -format
10/05/07 14:24:12 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop1/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/05/07 14:24:12 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop
10/05/07 14:24:12 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/07 14:24:12 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/07 14:24:12 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/05/07 14:24:12 INFO common.Storage: Storage directory /opt/hadoop-data/tmp-base/dfs/name has been
successfully formatted.
10/05/07 14:24:12 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop1/127.0.1.1
************************************************************/
$

The above is the output from a successful format. Now, we can finally start our single-node Hadoop installation:

$ start-all.sh
starting namenode, logging to /opt/hadoop/bin/../logs/hadoop-hadoop-namenode-hadoop1.out
localhost: starting datanode, logging to /opt/hadoop/bin/../logs/hadoop-hadoop-datanode-hadoop1.out
localhost: starting secondarynamenode, logging to
/opt/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-hadoop1.out
starting jobtracker, logging to /opt/hadoop/bin/../logs/hadoop-hadoop-jobtracker-hadoop1.out
localhost: starting tasktracker, logging to
/opt/hadoop/bin/../logs/hadoop-hadoop-tasktracker-hadoop1.out
$

Again, if you don't see output similar to the above, something went wrong. The log files under /opt/hadoop/logs are quite helpful for trouble-shooting.

[edit]

2 Installing MySQL #

Installing MySQL is quite simple on Ubuntu. I went with the MySQL Server package:

sudo apt-get install mysql-server

We don't need to perform any special configuration of MySQL for HadoopDB. Just make sure to take note of what password you specify for the root user since we will perform all work with HadoopDB as the root user (this is not mandatory but what I did to keep things simple).
Next, we need to install the MySQL JDBC driver. For this article, I used [http]

Connector J. After downloading the jar file, we need to copy it into Hadoop's lib directory so it has access to it:

cp mysql-connector-java-5.1.12-bin.jar $HADOOP_HOME/lib

Its worth noting that in the paper, the authors do say that initially they used MySQL with HadoopDB but switched to PostgreSQL. The main reason cited is due to the poor join algorithms in MySQL which I assume to mean the fact that only nested loop join is supported in MySQL. I don't attempt to make any comparison of HadoopDB running with MySQL versus PostgreSQL but I wanted to point out the authors observation.

[edit]

3 Download HadoopDB #

Now we can [http]

download HadoopDB. I'm going to download both the jar file and check out the source from Subversion: Now we can download HadoopDB. After downloading the jar file, we need to copy it into Hadoop's lib directory so it has access to it:

cp hadoopdb.jar $HADOOP_HOME/lib

I also checked out the source code from Subversion in case I needed to re-build the jar file at any time:

vn co https://hadoopdb.svn.sourceforge.net/svnroot/hadoopdb hadoopdb

[edit]

4 Install Hive #

Hive is used by HadoopDB as a SQL interface to their system. Its not a requirement for working with HadoopDB but it is another way to interact with HadoopDB so I'll cover how to install it.
First, we need to create directories in HDFS:

hadoop fs -mkdir /tmp
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse

Next, we need to [http]

download the SMS_dist tar file from the HadoopDB download page:

tar zxvf SMS_dist.tar.gz
sudo mv dist /opt/hive
sudo chown -R hadoop:hadoop hive

Since we already setup the environment variables related to Hive earlier when we were installing Hadoop, everything we need should now be in our path:

$ hive
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201005081717_1990651345.txt
hive> 

create     describe   exit       from       load       quit       set
hive> quit;
$

[edit]

5 Data #

We want to some data to play around with for testing purposes. For this article, I'm going to use the data from the [http]

paper published last summer: 'A Comparison of Approaches to Large-Scale Data Analysis'. Documentation on how to re-produce the benchmarks in that paper are prodivded in the link I gave to the paper. For this article, since I'm only running one Hadoop node and have absolutely no interest in generating lots of data I modified the scripts provided to produce tiny amounts of data:

svn co http://graffiti.cs.brown.edu/svn/benchmarks/
cd benchmarks/datagen/teragen

Within the benchmarks/datagen/teragen folder, there is a Perl script named teragen.pl that is reponsible for the generation of data. I modified that script for my purposes to look like:

use strict;
use warnings;

my $CUR_HOSTNAME = `hostname -s`;
chomp($CUR_HOSTNAME);

my $NUM_OF_RECORDS_1TB    = 10000000000;
my $NUM_OF_RECORDS_535MB  = 100;
my $BASE_OUTPUT_DIR   = "/data";
my $PATTERN_STRING    = "XYZ";
my $PATTERN_FREQUENCY = 108299;
my $TERAGEN_JAR       = "teragen.jar";
my $HADOOP_COMMAND    = $ENV{'HADOOP_HOME'}."/bin/hadoop";

my %files = ( "535MB" => 1,
);
system("$HADOOP_COMMAND fs -rmr $BASE_OUTPUT_DIR"); # IGNORE ERRORS
foreach my $target (keys %files) {
   my $output_dir = $BASE_OUTPUT_DIR."/SortGrep$target";
   my $num_of_maps = $files{$target};
   my $num_of_records = ($target eq "535MB" ? $NUM_OF_RECORDS_535MB : $NUM_OF_RECORDS_1TB);
   print "Generating $num_of_maps files in '$output_dir'\n";

   ##
   ## EXEC: hadoop jar teragen.jar 10000000000 /data/SortGrep/ XYZ 108299 100
   ##
   my @args = ( $num_of_records,
                $output_dir,
                $PATTERN_STRING,
                $PATTERN_FREQUENCY,
                $num_of_maps );
   my $cmd = "$HADOOP_COMMAND jar $TERAGEN_JAR ".join(" ", @args);
   print "$cmd\n";
   system($cmd) == 0 || die("ERROR: $!");
} # FOR
exit(0);

We then run the above Perl script to generate data that will be loaded in to HDFS. HadoopDB comes with a data partitioner that can partition data into a specified number of partitions. This is not particularly important for this article since we are running a single-node cluster so we only have 1 partition. The idea is that a separate partition can be bulk-loaded into a separate database node and indexed appropriately. For us, we just need to create a database and table in our MySQL database. Since we only have 1 partition, the database name will reflect that. The procedure to load the data set we generated into our single MySQL node is:

hadoop fs -get /data/SortGrep535MB/part-00000 my_file
mysql -u root -ppassword
mysql> create database grep0;
mysql> use grep0;
mysql> create table grep (
    ->   key1 char(10),
    ->   field char(90)
    -> );
load data local infile 'my_file' into table grep fields terminated by '|' (key1, field);

We now have data loaded into both HDFS and MySQL. The data we are working with is from the grep benchmark which is not the best benchmark for HadoopDB since it is un-structured data. However, since this article is just about how to setup HadoopDB and not testing its preformance, I didn't really worry about that much.

[edit]

6 HadoopDB Catalog and Running a Job #

The HadoopDB catalog is stored as an XML in HDFS. A tool is provided that generates this XML file from a properties file. For this article, the properties file I used is:

#Properties for Catalog Generation
##################################
nodes_file=machines.txt
relations_unchunked=grep, EntireRankings
relations_chunked=Rankings, UserVisits
catalog_file=HadoopDB.xml
##
#DB Connection Parameters
##
port=3306
username=root
password=password
driver=com.mysql.jdbc.Driver
url_prefix=jdbc\:mysql\://
##
#Chunking properties
##
chunks_per_node=0
unchunked_db_prefix=grep
chunked_db_prefix=cdb
##
#Replication Properties
##
dump_script_prefix=/root/dump_
replication_script_prefix=/root/load_replica_
dump_file_u_prefix=/mnt/dump_udb
dump_file_c_prefix=/mnt/dump_cdb
##
#Cluster Connection
##
ssh_key=id_rsa-gsg-keypair

The machines.txt file must exist and for this article, my machines.txt file had only 1 entry: localhost
Then in order to generate the XML file and store it in HDFS, the following is performed:

java -cp $HADOOP_HOME/lib/hadoopdb.jar edu.yale.cs.hadoopdb.catalog.SimpleCatalogGenerator \
> Catalog.properties
hadoop dfs -put HadoopDB.xml HadoopDB.xml

Please not that the above tool is quite fragile and expects the input properties file to be in a certain format with certain fields. Its pretty easy to break the tool which is understandable given this is a research project.
We are now ready to run a HadoopDB job! The HadoopDB distribution comes with a bunch of benchmarks that were used in the paper that was published on HadoopDB. The data I generated in this article corresponds to the data that was used for their benchmarks so I can use jobs that have already been written in order to test my setup.
I'm using the grep task from the paper to search for a pattern in the data I loaded earlier. Thus, to kick off a job I do:

java -cp $CLASSPATH:hadoopdb.jar edu.yale.cs.hadoopdb.benchmark.GrepTaskDB \
> -pattern %wo% -output padraig -hadoop.config.file HadoopDB.xml

Running the job, I see output like the following:

java -cp $CLASSPATH:hadoopdb.jar edu.yale.cs.hadoopdb.benchmark.GrepTaskDB \
> -pattern %wo% -output padraig -hadoop.config.file HadoopDB.xml
10/05/08 18:01:41 INFO exec.DBJobBase: grep_db_job
10/05/08 18:01:41 INFO exec.DBJobBase: SELECT key1, field FROM grep WHERE field LIKE '%%wo%%';
10/05/08 18:01:41 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker,
sessionId=
10/05/08 18:01:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments.
Applications should implement Tool for the same.
10/05/08 18:01:41 INFO mapred.JobClient: Running job: job_local_0001
10/05/08 18:01:41 INFO connector.AbstractDBRecordReader: Data locality failed for
hadoop1.localdomain
10/05/08 18:01:41 INFO connector.AbstractDBRecordReader: Task from hadoop1.localdomain is connecting
to chunk 0 on host localhost with db url jdbc:mysql://localhost:3306/grep0
10/05/08 18:01:41 INFO connector.AbstractDBRecordReader: SELECT key1, field FROM grep WHERE field
LIKE '%%wo%%';
10/05/08 18:01:41 INFO mapred.MapTask: numReduceTasks: 0
10/05/08 18:01:41 INFO connector.AbstractDBRecordReader: DB times (ms): connection = 245, query
execution = 2, row retrieval  = 36
10/05/08 18:01:41 INFO connector.AbstractDBRecordReader: Rows retrieved = 3
10/05/08 18:01:41 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the
process of commiting
10/05/08 18:01:41 INFO mapred.LocalJobRunner: 
10/05/08 18:01:41 INFO mapred.TaskRunner: Task attempt_local_0001_m_000000_0 is allowed to commit
now
10/05/08 18:01:41 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local_0001_m_000000_0' to file:/home/hadoop/padraig
10/05/08 18:01:41 INFO mapred.LocalJobRunner: 
10/05/08 18:01:41 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
10/05/08 18:01:42 INFO mapred.JobClient:  map 100% reduce 0%
10/05/08 18:01:42 INFO mapred.JobClient: Job complete: job_local_0001
10/05/08 18:01:42 INFO mapred.JobClient: Counters: 6
10/05/08 18:01:42 INFO mapred.JobClient:   FileSystemCounters
10/05/08 18:01:42 INFO mapred.JobClient:     FILE_BYTES_READ=115486
10/05/08 18:01:42 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=130574
10/05/08 18:01:42 INFO mapred.JobClient:   Map-Reduce Framework
10/05/08 18:01:42 INFO mapred.JobClient:     Map input records=3
10/05/08 18:01:42 INFO mapred.JobClient:     Spilled Records=0
10/05/08 18:01:42 INFO mapred.JobClient:     Map input bytes=3
10/05/08 18:01:42 INFO mapred.JobClient:     Map output records=3
10/05/08 18:01:42 INFO exec.DBJobBase: 
grep_db_job JOB TIME : 1747 ms.

$

The results are stored in HDFS and I also specified I wanted the results put in an output directory named padraig. Inspecting the results I see:

$ cd padraig
$ cat part-00000
~k~MuMq=	w0000000000{XSq#Bq6,3xd.tg_Wfa"+woX1e_L*]H-UE%+]L]DiT5#QOS5<
vkrvkB8	6i0000000000.h9RSz'>Kfp6l~kE0FV"aP!>xnL^=C^W5Y}lTWO%N4$F0 Qu@:]-N4-(J%+Bm*wgF^-{BcP^5NqA
]&{`H%]1{E0000000000Z[@egp'h9!	BV8p~MuIuwoP4;?Zr' :!s=,@!F8p7e[9VOq`L4%+3h.*3Rb5e=Nu`>q*{6=7
$

I can verify this result by going the data stored in MySQL and performing the same query on it:

mysql> select key1, field from grep where field like '%wo%';
+--------------------------------+------------------------------------------------------------------------------------------+
| key1                           | field
|
+--------------------------------+------------------------------------------------------------------------------------------+
| ~k~MuMq=                       | w0000000000{XSq#Bq6,3xd.tg_Wfa"+woX1e_L*]H-UE%+]L]DiT5#QOS5<                             |
| vkrvkB8                        | 6i0000000000.h9RSz'>Kfp6l~kE0FV"aP!>xnL^=C^W5Y}lTWO%N4$F0 Qu@:]-N4-(J%+Bm*wgF^-{BcP^5NqA |
| ]&{`H%]1{E0000000000Z[@egp'h9! | BV8p~MuIuwoP4;?Zr' :!s=,@!F8p7e[9VOq`L4%+3h.*3Rb5e=Nu`>q*{6=7                            |
+--------------------------------+------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)

mysql>

Thus, I can see the same rows were returned by the HadoopDB job.
Conclusion
I didn't get to use the Hive interface to HadoopDB as I had issues getting it going. If I get it going in the future, I'll likely write about it. HadoopDB is a pretty interesting project and I enjoyed reading the paper on it a lot. A [http]

demo of HadoopDB will be given at SIGMOD this year which should be interesting.
Overall, I think its a pretty interesting project but I'm not sure how active it is. Based on the fact that a demo is being given at SIGMOD, I'm sure there is research being done on it but compared to other open source projects its difficult to tell how much development is occuring. I'm sure this has more to do with the fact that it is a research project first and foremost whose source code just happens to be available. It would be nice to see a mailing list or something pop up around this project though. For example, if I wanted to contribute a patch, its not really clear how I should go about doing that and whether it will be integrated or not.
I do think its some interesting research though and I'll be keeping my eye on it and trying to mess around with it whenever I have spare time. Next thing I want to look into regarding HadoopDB is hooking it up to the column-orientated database [=http://monetdb.cwi.nl/ MonetDB] which I will write about if I get the chance.

Contents

1 Installing Hadoop #

2 Installing MySQL #

3 Download HadoopDB #

4 Install Hive #

5 Data #

6 HadoopDB Catalog and Running a Job #