Pydoop Submit User Guide¶
Pydoop applications are run via the pydoop submit
command. To
start, you will need a working Hadoop cluster. If you don’t have one
available, you can bring up a single-node Hadoop cluster on your
machine – see the Hadoop web site for
instructions. Alternatively, the source directory contains a
Dockerfile that can be used to build an image with Hadoop and Pydoop
installed and (minimally) configured. Check out .travis.yml
for
usage hints.
If your application is contained in a single (local) file named
wc.py
, with an entry point called __main__
(see
Writing Full-Featured Applications) you can run it as follows:
pydoop submit --upload-file-to-cache wc.py wc input output
where input
(file or directory) and output
(directory) are
HDFS paths. Note that the output
directory will not be
overwritten: instead, an error will be generated if it already exists
when you launch the program.
If your entry point has a different name, specify it via --entry-point
.
The following table shows command line options for pydoop submit
:
Short |
Long |
Meaning |
---|---|---|
|
Number of reduce tasks. Specify 0 to only perform map phase |
|
|
Don’t set the script’s HOME directory to the $HOME in your environment. Hadoop will set it to the value of the ‘mapreduce.admin.user.home.dir’ property |
|
|
Use the default PATH, LD_LIBRARY_PATH and PYTHONPATH, instead of copying them from the submitting client node |
|
|
Use the default LD_LIBRARY_PATH instead of copying it from the submitting client node |
|
|
Use the default PYTHONPATH instead of copying it from the submitting client node |
|
|
Use the default PATH instead of copying it from the submitting client node |
|
|
Set environment variables for the tasks. If a variable is set to ‘’, it will not be overridden by Pydoop. |
|
|
|
Set a Hadoop property, e.g., -D mapreduce.job.priority=high |
|
Additional python zip file |
|
|
Upload and add this file to the distributed cache. |
|
|
Upload and add this archive file to the distributed cache. |
|
|
Logging level |
|
|
name of the job |
|
|
python executable that should be used by the wrapper |
|
|
Do not actually submit a job, print the generated config settings and the command line that would be invoked |
|
|
Hadoop configuration file |
|
|
java classname of InputFormat |
|
|
Do not adapt property names to the hadoop version used. |
|
|
Disable java RecordReader |
|
|
Disable java RecordWriter |
|
|
java classname of OutputFormat |
|
|
Additional comma-separated list of jar files |
|
|
Add this HDFS file to the distributed cache as a file. |
|
|
Add this HDFS archive file to the distributed cacheas an archive. |
|
|
Explicitly execute MODULE.ENTRY_POINT() in the launcher script. |
|
|
Avro input mode (key, value or both) |
|
|
Avro output mode (key, value or both) |
|
|
Profile each task and store stats in this dir |
|
|
pstats filename pattern (expert use only) |
|
|
Don’t remove the work dir |
Setting the Environment for your Program¶
When working on a shared cluster where you don’t have root access, you
might have a lot of software installed in non-standard locations, such
as your home directory. Since non-interactive ssh connections do not
usually preserve your environment, you might lose some essential
setting like LD_LIBRARY_PATH
.
For this reason, by default pydoop submit
copies some environment
variables from the submitting node to the driver script that runs each task
on Hadoop. If this behavior is not desired, you can disable it via the
--no-override-env
command line option.