Getting started with Hadoop for Bioinformatics

  • SeqAhead Sardinia Hackathon, Task 9:
  • Authors: Andrea Pinna and Oliver Hunewald

The primary goal of this tutorial is the installation of Cloudera Manager in a Virtual Machine to administrate a Hadoop pseudo-cluster in your own machine.

Subsequently some Hadoop bioinformatics tools (e.g. Pydoop, Seal) are installed in the virtual cluster, while a few test runs are executed.

The tutorial is divided in the following steps:

  • Installation of VirtualBox
  • Installation of Cloudera Manager for VirtualBox
  • Test of Hadoop with WordCount
  • Installation of Pydoop and other software
  • Installation of Biodoop-Seal

Installation of Oracle's VirtualBox

Download and install Oracle's Virtual Box.

In our case-study, we installed the Ubuntu 12.04 LTS version of the VirtualBox, to run on a Sony Vaio laptop.

Installation of Cloudera Manager

To install the Cloudera Manager for Hadoop, just follow this simple steps:

  • Download Cloudera Quickstart VM for VirtualBox. The VM runs CentOS 6.2 and includes CDH4.2, Cloudera Manager 4.5.2, and Cloudera Impala 1.0.
  • Create an appropriate folder in your home directory, e.g. 'cloudera-vm'.
  • Unpack the downloaded file inside the folder you just created.
  • Start VirtualBox
  • Press the button 'New'
  • Select Linux → Red Hat 64 bit
  • Select: 'use existing virtual hard drive'

If VirtualBox complains about:

VT-x features locked or unavailable in MSR. (VERR_VMX_MSR_LOCKED_OR_DISABLED)

then you have to enable the virtualization settings in the BIOS.

Hadoop basic functional test

Test the Hadoop installation with the famous WordCount example.

  • Create the file by copying the source code.
  • Compile the software including some core libraries (necessary for CDH4):
javac -classpath /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.0.0.jar:/usr/lib/hadoop/client/hadoop-mapreduce-client-core-2.0.0-cdh4.0.0.jar -d wordcount_classes
  • Create new Java archive:
jar -cvf wordcount.jar -C wordcount_classes/ .
  • Add a significantly large text file to the Hadoop filesystem:
hadoop fs -put big.txt /user/cloudera/wordcount/input
  • Count the words in big.txt:
hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input /user/cloudera/wordcount/output

Installation of Pydoop

The installation of Pydoop and other software is required before installing Biodoop-Seal.

  • Check which Java is installed: java -version. The Oracle (Sun) version is recommended: it should be included in the VirtualBox image.
  • Check your Python version: python –version (version 2.7 is recommended).
  • Set Hadoop path: export HADOOP_HOME=/usr/lib/hadoop.

Modules for Python 2.6

In case you need to use Python 2.6, you'll have to install the backported importlib and argparse modules.

Install them with:

pip install importlib --user
pip install argparse --user


  • Download Pydoop (in this case with Git):
git clone
  • Enter the directory and install it with either:
 python build


pip install pydoop


Install both Java libraries Apache Ant:

sudo yum install ant
sudo yum install ant-apache-regexp

Protocol Buffers

  • Download Protocol Buffers (version 2.5.0).
  • Extract package to home folder (installation instructions inside the package):
make check
sudo make install


  • Download the C++ library Boost for Python (version 1.53.0).
  • Extract the package and cd into directory, then install it (only for Python):
./b2 --with-python


git clone hadoop-bam-code
  • Set the CLASSPATH:
export CLASSPATH=/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*
  • Before running Ant, open the build.xml file, then search and delete the 4 lines containing source=“1.6”.
  • Execute Ant in the Hadoop-BAM folder to install it:

Installation of Biodoop-Seal

git clone git:// biodoop-seal-code
  • Cd into directory and install it:
export HADOOP_BAM=<path to hadoop bam directory>
python build

Now you are finally ready to investigate the potentialities of Hadoop!

Last modified: 2013/06/14 15:03 by Luca Pireddu
DokuWikiRSS Feed