Set up Pseudo-distributed Hadoop Environment on Mac OS X Lion

MapReduce is parallel programming paradigm aiming to process big volume of data in large clusters with ease. When using MapReduce framework, you focus on the processing logics which are expressed in key-value pair mapping and reducing, totally unaware of underlaying message passing processes and high-availability mechanisms. Apache Hadoop is the most widely used MapReduce opensource implementation. Lots of top cooperations adopts Hadoop and big data processing becomes more available than ever before.

This tutorial will show you how to set up a Hadoop pseudo distributed environment on your Mac. You will get all the necessary things for developing and testing MapReduce programs, such as HDFS, et al., but all are on a single machine.

Rather than root, I suggest running Hadoop instances with your normal Mac account. These procedures are verified on Mac OS X 10.7, but I think this tutorial should apply to other Unix-like OS with slight modifications.

The rest of this post is organized as follows:

  1. Configure SSH login to localhost
  2. Get Hadoop
  3. Configure Hadoop
  4. Initialize HDFS
  5. Start HDFS and Hadoop
  6. Try some MapReduce programs
  7. Stop Hadoop

Pre-requisite: SSH login to localhost

Hadoop needs to configure various roles of servers at starting up via SSH. In this post, these various roles of servers locate on your machine--localhost. To avoid typing SSH passphases every time you start or stop hadoop, you'd better add yourself to the SSH authorized_key. ssh-copy-id does the job. One more thing, Mac OS X does NOT ship with ssh-copy-id. You can get one with brew.

$ ssh-copy-id YOUR_ACCOUNT@localhost

For custom SSH port number other than 22, you need to specify the port number in file ~/.ssh/config.

Host localhost
HostName 127.0.0.1
Port YOUR_SSH_PORT

Try if SSH login works.

$ ssh localhost

Get Hadoop

You can get Hadoop stable version from Beijing Jiaotong University Mirror. The current stable version is hadoop-1.0.1.

$ cd tmp
$ wget http://mirror.bjtu.edu.cn/apache/hadoop/core/stable/hadoop-1.0.1.tar.gz
$ wget http://mirror.bjtu.edu.cn/apache/hadoop/core/stable/hadoop-1.0.1.tar.gz.mds
$ md5 *.gz
# Compare md5sum of the package with the record in mds.

Then extract the package into your home directory and make a symbolic link.

$ cd
$ tar xzvpf /tmp/*.tar.gz
$ ln -s hadoop-* hadoop

To make Hadoop's binary files found by OS, add the following line into .bash_profile.

export PATH="$HOME/hadoop/bin:$PATH"

Then make PATH take effect.

$ source ~/.bash_profile

Configure Hadoop

haddoop-env.sh: Configure Hadoop's Java environment

JAVA_HOME points to the Java home directory. You can use /usr/libexec/java_home on Mac to determine Java home on the fly.

# File: ~/hadoop/conf/hadoop-env.sh
export JAVA_HOME=$(/usr/libexec/java_home)

Then uncomment the HADOOP_HEAPSIZE to specify a bigger heap size limit (in MB) if necessary. 2000MB sounds good.

# File: ~/hadoop/conf/hadoop-env.sh
export HADOOP_HEAPSIZE=2000

Mac OS Lion users see the message Unable to load realm info from SCDynamicStore when they first initialize the HDFS node. This issue has been tracked in HADOOP-7489 bug. To fix this issue, please modify the corresponding lines in file hadoop-env.sh to the followings.

# File: ~/hadoop/conf/hadoop-env.sh
export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"

In summary, the following changes are added to hadoop/conf/hadoop-env.sh.

# File: ~/hadoop/conf/hadoop-env.sh
export JAVA_HOME=$(/usr/libexec/java_home)
export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"   # For Mac OS X only
export HADOOP_HEAPSIZE=2000

core-site.xml: Configure Local Data Directory for HDFS

Modify hadoop/conf/core-site.xml to something similar to the followings. You can change the data directory to whatever you prefer, but make sure that the account running Hadoop have the WRITE and READ permission to the directory. As I recommend running Hadoop with your normal account, so setting the data directory to somewhere under your home should make sense.

<configuration>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/Users/YOUR_ACCOUNT/.hdfs.tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:8020</value>
</property>

</configuration>

hdfs-site.xml: Configure HDFS tuning parameters

You can set HDFS tuning parameters in file hadoop/conf/hdfs-site.xml. For example, as replication doesn't help to accelerate in a pseudo distributed environment, the following lines set HDFS to replicate only one copy of data.

<configuration>

    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

</configuration>

mapred-site.xml: Configure MapReduce engine

You can specify the location of job tracker, the maximum number of map jobs, the maximum number of reduce jobs, et al. in file hadoop/conf/mapred-site.xml. In the following settings, I specify the job tracker port (you can specify any port you prefer) and limit both the numbers of map and reduce jobs to a maximum 4.

<configuration>

    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:8021</value>
    </property>

    <property>
        <name>mapred.tasktracker.map.tasks.maximum</name>
        <value>4</value>
    </property>

    <property>
        <name>mapred.tasktracker.reduce.tasks.maximum</name>
        <value>4</value>
    </property>

</configuration>

Congratulations! You've finish all the necessary modification to configuration files. Please follow the preceding instructions to rock your Hadoop.

Initialize HDFS For the First Time

$ hadoop namenode -format

Start Hadoop

$ start-all.sh

Check all the processes of Hadoop MapReduce with jps.

$jps
30920 TaskTracker
30723 JobTracker
30944 Jps
30256 NameNode
30654 SecondaryNameNode
30456 DataNode

Browse the cluster's status via a webUI.

Try some Hadoop Programs

Some Hadoop MapReduce programs are shipped with Hadoop itself, packaged into hadoop-examples-*.jar file. You can check these cool things by running hadoop jar hadoop-examples-*.jar. The output may look like:

$ hadoop jar ${HADOOP_HOME}/hadoop-examples-*.jar   # Please change ${HADOOP_HOME} to real Hadooop Path on your system

An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  dbcount: An example job that count the pageview counts from a database.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using monte-carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sleep: A job that sleeps at each map and reduce task.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.

Estimate Pi with monte-carlo method

The following line starts a MapReduce task to estimate the value of Pi via mente-carlo method. There are 10 Map jobs and 100 Reduce jobs in this task.

$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar pi 10 100

Distributed grep

In this task, you copy text files into HDFS, then run a grep job on them.

Copy directory DIR_TEXT containing lots of text files into HDFS then rename it to input.

$ hadoop fs -put DIR_TEXT input

Run distributed grep 'hadoop' toward files in input`` and save the generated files to directoryoutput``` on HDFS.

$ hadoop jar ${HADOOP_HOME}/hadoop-examples-*.jar grep input output 'hadoop'

Check output, copy output from HDFS to local, then delete the working directories on HDFS.

$ hadoop fs -ls output   # OR hadoop fs -cat output/*
$ hadoop fs -get output output_local
$ hadoop fs -rmr input output

Stop Hadoop

The same warning: start and stop services in a stand-alone shell, not in tmux or screen.

$ stop-all.sh

Trouble Shooting

Wait too long when starting or stopping Hadoop

Try to run start-all.sh or stop-all.sh in a standalone shell, not in tmux or screen.

HDFS becomes in safemode

HDFS may sometimes take too long to recover. I cannot figure out what's wrong, but I can safely make it work by following operations.

$ hadoop dfsadmin -safemode leave
$ hadoop fsck /

SLF4J complains about multiple SLF4J

SLF4J may complain about containing multiple SLF4J bindings when HBase coexists with Hadoop on your home directory. Don't worry about that, it is not a big deal (I think). By now, my Hadoop and HBase work quite well.

Other Tips

Add hadoop fs command alias

Add the following line into ~/.bash_profile. Then hafs -cat will have the same effect as hadoop fs -cat.

alias hdfs='hadoop fs'

Use Hadoop in Python

Hadoop provides a Streaming Interface for easier daily text processing. Any script language that can read from standard input and write to standard output can utilize Hadoop's power for big data processing. Michael Noll provides a nice tutorial on using Hadoop in Python.

Reference

Comments !

blogroll

social