Previous topic

pydoop.hdfs — HDFS API

Next topic

pydoop.hadut — Hadoop shell interaction

Get Pydoop

Contributors

Pydoop is developed by: CRS4

pydoop.mapreduce.simulator — Hadoop Simulator API

This module provides basic, stand-alone Hadoop simulators for debugging support.

class pydoop.mapreduce.simulator.HadoopSimulatorLocal(factory, logger=None, loglevel=50, context_cls=None, avro_input=None, avro_output=None, avro_output_key_schema=None, avro_output_value_schema=None)

Simulates the invocation of program components in a Hadoop workflow.

from my_mr_app import Factory
hs = HadoopSimulatorLocal(Factory())
job_conf = {...}
hs.run(fin, fout, job_conf)
counters = hs.get_counters()
run(file_in, file_out, job_conf, num_reducers=1, input_split='')

Run the simulator as configured by job_conf, with num_reducers reducers. If file_in is not None, simulate the behavior of Hadoop’s TextLineReader, creating a record for each line in file_in. Otherwise, assume that the factory argument given to the constructor defines a RecordReader, and that job_conf provides a suitable InputSplit. Similarly, if file_out is None, assume that factory defines a RecordWriter with appropriate parameters in job_conf.

class pydoop.mapreduce.simulator.HadoopSimulatorNetwork(program=None, logger=None, loglevel=50, sleep_delta=3, context_cls=None, avro_input=None, avro_output=None, avro_output_key_schema=None, avro_output_value_schema=None)

Simulates the invocation of program components in a Hadoop workflow using network connections to communicate with a user-provided pipes program.

program_name = '../wordcount/new_api/wordcount_full.py'
data_in = '../input/alice.txt'
output_dir = './output'
data_in_path = os.path.realpath(data_in)
data_in_uri = 'file://' + data_in_path
data_in_size = os.stat(data_in_path).st_size
os.makedirs(output_dir)
output_dir_uri = 'file://' + os.path.realpath(output_dir)
conf = {
  "mapred.job.name": "wordcount",
  "mapred.work.output.dir": output_dir_uri,
  "mapred.task.partition": "0",
}
input_split = InputSplit.to_string(data_in_uri, 0, data_in_size)
hsn = HadoopSimulatorNetwork(program=program_name, logger=logger,
                             loglevel=logging.INFO)
hsn.run(None, None, conf, input_split=input_split)

The Pydoop application program will be launched sleep_delta seconds after framework initialization.

run(file_in, file_out, job_conf, num_reducers=1, input_split='')

Run the program through the simulated Hadoop infrastructure, piping the contents of file_in to the program similarly to what Hadoop’s TextInputFormat does. Setting file_in to None implies that the program is expected to get its data from its own RecordReader, using the provided input_split. Analogously, the final results will be written to file_out unless it is set to None, in which case the program is expected to have a RecordWriter.