pydoop.hadut — Hadoop shell interaction

Provides access to some functionalities available via the Hadoop shell.

exception pydoop.hadut.RunCmdError(returncode, cmd, output=None)

Raised by run_tool_cmd() and all functions that make use of it to indicate that the call failed (returned non-zero).

pydoop.hadut.collect_output(mr_out_dir, out_file=None)

Return all mapreduce output in mr_out_dir.

Append the output to out_file if provided. Otherwise, return the result as a single string (it is the caller’s responsibility to ensure that the amount of data retrieved fits into memory).

pydoop.hadut.run_class(class_name, args=None, properties=None, classpath=None, hadoop_conf_dir=None, logger=None, keep_streams=True)

Run a Java class with Hadoop (equivalent of running hadoop <class_name> from the command line).

Additional HADOOP_CLASSPATH elements can be provided via classpath (either as a non-string sequence where each element is a classpath element or as a ':'-separated string). Other arguments are passed to run_cmd().

Note

HADOOP_CLASSPATH makes dependencies available only on the client side. If you are running a MapReduce application, use args=['-libjars', 'jar1,jar2,...'] to make them available to the server side as well.

pydoop.hadut.run_cmd(cmd, args=None, properties=None, hadoop_home=None, hadoop_conf_dir=None, logger=None, keep_streams=True)

Runs the hadoop command.

Calls run_tool_cmd() with "hadoop" as the first argument.

pydoop.hadut.run_tool_cmd(tool, cmd, args=None, properties=None, hadoop_conf_dir=None, logger=None, keep_streams=True)

Run a Hadoop command.

If keep_streams is set to True (the default), the stdout and stderr of the command will be buffered in memory. If the command succeeds, the former will be returned; if it fails, a RunCmdError will be raised with the latter as the message. This mode is appropriate for short-running commands whose “result” is represented by their standard output (e.g., rval = run_tool_cmd("hdfs", "dfsadmin", ["-safemode", "get"])).

If keep_streams is set to False, the command will write directly to the stdout and stderr of the calling process, and the return value will be empty. This mode is appropriate for long running commands that do not write their “real” output to stdout.