pydoop.hadut — Hadoop shell interaction

The hadut module provides access to some functionalities available via the Hadoop shell.

class pydoop.hadut.PipesRunner(prefix=None, logger=None)

Allows to set up and run pipes jobs, optionally automating a few common tasks.

Parameters:
  • prefix (str) – if specified, it must be a writable directory path that all nodes can see (the latter could be an issue if the local file system is used rather than HDFS)
  • logger (logging.Logger) – optional logger

If prefix is set, the runner object will create a working directory with that prefix and use it to store the job’s input and output — the intended use is for quick application testing. If it is not set, you must call set_output() with an hdfs path as its argument, and put will be ignored in your call to set_input(). In any event, the launcher script will be placed in the output directory’s parent (this has to be writable for the job to succeed).

clean()

Remove the working directory, if any.

collect_output(out_file=None)

Run collect_output() on the job’s output directory.

run(**kwargs)

Run the pipes job. Keyword arguments are passed to run_pipes().

set_exe(pipes_code)

Dump launcher code to the distributed file system.

set_input(input_, put=False)

Set the input path for the job. If put is True, copy (local) input_ to the working directory.

set_output(output)

Set the output path for the job. Optional if the runner has been instantiated with a prefix.

class pydoop.hadut.PydoopScriptRunner(prefix=None, logger=None)

Specialization of PipesRunner to support the set up and running of pydoop script jobs.

run(script, more_args=None, pydoop_exe='/usr/bin/pydoop')

Run the pipes job. Keyword arguments are passed to run_pipes().

exception pydoop.hadut.RunCmdError(returncode, cmd, output=None)

This exception is raised by run_cmd and all functions that make use of it to indicate that the call failed (returned non-zero).

pydoop.hadut.collect_output(mr_out_dir, out_file=None)

Return all mapreduce output in mr_out_dir.

Append the output to out_file if provided. Otherwise, return the result as a single string (it is the caller’s responsibility to ensure that the amount of data retrieved fits into memory).

pydoop.hadut.dfs(args=None, properties=None, hadoop_conf_dir=None)

Run the Hadoop file system shell.

All arguments are passed to run_class().

pydoop.hadut.find_jar(jar_name, root_path=None)

Look for the named jar in:

  1. root_path, if specified
  2. working directory – PWD
  3. ${PWD}/build
  4. /usr/share/java

Return the full path of the jar if found; else return None.

pydoop.hadut.get_num_nodes(properties=None, hadoop_conf_dir=None, offline=False)

Get the number of task trackers in the Hadoop cluster.

All arguments are passed to get_task_trackers().

pydoop.hadut.get_task_trackers(properties=None, hadoop_conf_dir=None, offline=False)

Get the list of task trackers in the Hadoop cluster.

Each element in the returned list is in the (host, port) format. All arguments are passed to run_class().

If offline is True, try getting the list of task trackers from the slaves file in Hadoop’s configuration directory (no attempt is made to contact the Hadoop daemons). In this case, ports are set to 0.

pydoop.hadut.path_exists(path, properties=None, hadoop_conf_dir=None)

Return True if path exists in the default HDFS.

Keyword arguments are passed to dfs().

This function does the same thing as hdfs.path.exists, but it uses a wrapper for the Hadoop shell rather than the hdfs extension.

pydoop.hadut.run_class(class_name, args=None, properties=None, classpath=None, hadoop_conf_dir=None, logger=None, keep_streams=True)

Run a Java class with Hadoop (equivalent of running hadoop <class_name> from the command line).

Additional HADOOP_CLASSPATH elements can be provided via classpath (either as a non-string sequence where each element is a classpath element or as a ':'-separated string). Other arguments are passed to run_cmd().

>>> cls = 'org.apache.hadoop.fs.FsShell'
>>> try: out = run_class(cls, args=['-test', '-e', 'file:/tmp'])
... except RunCmdError: tmp_exists = False
... else: tmp_exists = True

Note

HADOOP_CLASSPATH makes dependencies available only on the client side. If you are running a MapReduce application, use args=['-libjars', 'jar1,jar2,...'] to make them available to the server side as well.

pydoop.hadut.run_jar(jar_name, more_args=None, properties=None, hadoop_conf_dir=None, keep_streams=True)

Run a jar on Hadoop (hadoop jar command).

All arguments are passed to run_cmd() (args = [jar_name] + more_args) .

pydoop.hadut.run_pipes(executable, input_path, output_path, more_args=None, properties=None, force_pydoop_submitter=False, hadoop_conf_dir=None, logger=None, keep_streams=False)

Run a pipes command.

more_args (after setting input/output path) and properties are passed to run_cmd().

If not specified otherwise, this function sets the properties hadoop.pipes.java.recordreader and hadoop.pipes.java.recordwriter to "true".

This function works around a bug in Hadoop pipes that affects versions of Hadoop with security when the local file system is used as the default FS (no HDFS); see https://issues.apache.org/jira/browse/MAPREDUCE-4000. In those set-ups, the function uses Pydoop’s own pipes submitter application. You can force the use of Pydoop’s submitter by passing the argument force_pydoop_submitter=True.

pydoop.hadut.run_tool_cmd(tool, cmd, args=None, properties=None, hadoop_conf_dir=None, logger=None, keep_streams=True)

Run a Hadoop command.

If keep_streams is set to True (the default), the stdout and stderr of the command will be buffered in memory. If the command succeeds, the former will be returned; if it fails, a RunCmdError will be raised with the latter as the message. This mode is appropriate for short-running commands whose “result” is represented by their standard output (e.g., "dfsadmin", ["-safemode", "get"]).

If keep_streams is set to False, the command will write directly to the stdout and stderr of the calling process, and the return value will be empty. This mode is appropriate for long running commands that do not write their “real” output to stdout (such as pipes).

>>> hadoop_classpath = run_cmd('classpath')