Pydoop has been tested on Gentoo, Ubuntu and CentOS. Although we currently have no information regarding other Linux distributions, we expect Pydoop to work (possibly with some tweaking) on them as well.
In order to build and install Pydoop, you need the following software:
Optional:
These are also runtime requirements for all cluster nodes. Note that installing Pydoop and your MapReduce application to all cluster nodes (or to an NFS share) is not required: see Installation-free Usage for additional info. Moreover, being based on Pipes, Pydoop cannot be used with Hadoop standalone installations.
Other versions of Hadoop may or may not work depending on how different they are from the ones listed above.
Before compiling and installing Pydoop, install all missing dependencies.
In addition, if your distribution does not include them by default, install basic development tools (such as a C/C++ compiler) and Python header files. On Ubuntu, for instance, you can do that as follows:
sudo apt-get install build-essential python-dev
Set the JAVA_HOME environment variable to your JDK installation directory, e.g.:
export JAVA_HOME=/usr/local/java/jdk
Note
If you don’t know where your Java home is, try finding the actual path of the java executable and stripping the trailing /jre/bin/java:
$ readlink -f $(which java)
/usr/lib/jvm/java-6-oracle/jre/bin/java
$ export JAVA_HOME=/usr/lib/jvm/java-6-oracle
If you have installed Hadoop from a tarball, set the HADOOP_HOME environment variable so that it points to where the tarball was extracted, e.g.:
export HADOOP_HOME=/opt/hadoop-1.0.4
The above step is not necessary if you installed CDH from dist-specific packages. Build Pydoop with:
python setup.py build
This builds Pydoop with the “native” HDFS backend. To build the (experimental) JPype backend instead, run:
python setup.py build --hdfs-core-impl=jpype-bridged
For a system-wide installation, run the following:
sudo python setup.py install --skip-build
For a user-local installation:
python setup.py install --skip-build --user
The latter installs Pydoop in ~/.local/lib/python2.X/site-packages. This may be a particularly handy solution if your home directory is accessible on the entire cluster.
To install to an arbitrary path:
python setup.py install --skip-build --home <PATH>
“java home not found” error, with JAVA_HOME properly exported: try setting JAVA_HOME in hadoop-env.sh
“libjvm.so not found” error: try the following:
export LD_LIBRARY_PATH="${JAVA_HOME}/jre/lib/amd64/server:${LD_LIBRARY_PATH}"
non-standard include/lib directories: the setup script looks for includes and libraries in standard places – read setup.py for details. If some of the requirements are stored in different locations, you need to add them to the search path. Example:
python setup.py build_ext -L/my/lib/path -I/my/include/path -R/my/lib/path
python setup.py build
python setup.py install --skip-build
Alternatively, you can write a small setup.cfg file for distutils:
[build_ext]
include_dirs=/my/include/path
library_dirs=/my/lib/path
rpath=%(library_dirs)s
and then run python setup.py install.
Finally, you can achieve the same result by manipulating the environment. This is particularly useful in the case of automatic download and install with pip:
export CPATH="/my/include/path:${CPATH}"
export LD_LIBRARY_PATH="/my/lib/path:${LD_LIBRARY_PATH}"
pip install pydoop
Hadoop version issues. The Hadoop version selected at compile time is automatically detected based on the output of running hadoop version. If this fails for any reason, you can provide the correct version string through the HADOOP_VERSION environment variable, e.g.:
export HADOOP_VERSION="1.0.4"
After Pydoop has been successfully installed, you might want to run unit tests to verify that everything works fine.
IMPORTANT NOTICE: in order to run HDFS tests you must:
make sure that Pydoop is able to detect your Hadoop home and configuration directories. If auto-detection fails, try setting the HADOOP_HOME and HADOOP_CONF_DIR environment variables to the appropriate locations;
since one of the test cases tests the connection to an HDFS instance with explicitly set host and port, if in your case these are different from, respectively, “localhost” and 9000 (8020 for package-based CDH), you must set the HDFS_HOST and HDFS_PORT environment variables accordingly;
start HDFS:
${HADOOP_HOME}/bin/start-dfs.sh
wait until HDFS exits from safe mode:
${HADOOP_HOME}/bin/hadoop dfsadmin -safemode wait
To run the unit tests, move to the test subdirectory and run as the cluster superuser (see below):
python all_tests.py
The following HDFS tests may fail if not run by the cluster superuser: capacity, chown and used. To get superuser privileges, you can either:
<property>
<name>dfs.permissions.supergroup</name>
<value>admin</value>
</property>
If you can’t acquire superuser privileges to run the tests, just keep in mind that the failures reported may be due to this reason.
With Apache Hadoop 2 / CDH 4, before running the unit tests, edit hdfs-site.xml and set dfs.namenode.fs-limits.min-block-size to a low value:
<property>
<name>dfs.namenode.fs-limits.min-block-size</name>
<value>512</value>
</property>
then restart Hadoop daemons.