pydoop.mapreduce.api — MapReduce API

This module provides the base abstract classes used to develop MapReduce application components (Mapper, Reducer, etc.).

class pydoop.mapreduce.api.Combiner(context)

A Combiner performs the same actions as a Reducer, but it runs locally within a map task. This helps cutting down the amount of data sent to reducers across the network, with the downside that map tasks require extra memory to cache intermediate key/value pairs. The cache size is controlled by "mapreduce.task.io.sort.mb" and defaults to 100 MB.

Note that it’s not strictly necessary to extend this class in order to write a combiner: all that’s required is that it has the same interface as a reducer. Indeed, in many cases it’s useful to set the combiner class to be the same as the reducer class.

class pydoop.mapreduce.api.Component(context)
class pydoop.mapreduce.api.Context

Context objects are used for communication between the framework and the Mapreduce application. These objects are instantiated by the framework and passed to user methods as parameters:

class Mapper(api.Mapper):

    def map(self, context):
        key, value = context.key, context.value
        ...
        context.emit(new_key, new_value)
emit(key, value)

Emit a key, value pair to the framework.

get_counter(group, name)

Get a Counter from the framework.

Parameters
  • group (str) – counter group name

  • name (str) – counter name

The counter can be updated via increment_counter().

increment_counter(counter, amount)

Update a Counter by the specified amount.

input_split

The InputSplit for this task (map tasks only).

This tries to deserialize the raw split sent from upstream. In the most common scenario (file-based input format), the returned value will be a FileSplit.

To get the raw split, call get_input_split() with raw=True.

job_conf

MapReduce job configuration as a JobConf object.

key

Input key.

set_status(status)

Set the current status.

Parameters

status (str) – a description of the current status

value

Input value (map tasks only).

values

Iterator over all values for the current key (reduce tasks only).

class pydoop.mapreduce.api.Factory

Creates MapReduce application components (e.g., mapper, reducer).

A factory object must be created by the application and passed to the framework as the first argument to run_task(). All MapReduce applications need at least a mapper object, while other components are optional (the corresponding create_ method can return None). Note that the reducer is optional only in map-only jobs, where the number of reduce tasks has been set to 0.

Factory provides a generic implementation that takes component classes as initialization parameters and creates component objects as needed.

create_combiner(context)

Create a combiner object.

Return the new combiner or None, if one is not needed.

create_partitioner(context)

Create a partitioner object.

Return the new partitioner or None, if the default partitioner should be used.

create_record_reader(context)

Create a record reader object.

Return the new record reader or None, if the Java record reader should be used.

create_record_writer(context)

Create an application record writer.

Return the new record writer or None, if the Java record writer should be used.

class pydoop.mapreduce.api.FileSplit

A subset (described by offset and length) of an input file.

class pydoop.mapreduce.api.InputSplit

Represents a subset of the input data assigned to a single map task.

InputSplit objects are created by the framework and made available to the user application via the input_split context attribute.

class pydoop.mapreduce.api.JobConf

Configuration properties assigned to this job.

JobConf objects are instantiated by the framework and support the same interface as dictionaries, plus a few methods that perform automatic type conversion:

>>> jc['a']
'1'
>>> jc.get_int('a')
1
get_bool(key, default=None)

Same as dict.get(), but the value is converted to a bool.

The boolean value is considered, respectively, True or False if the string is equal, ignoring case, to 'true' or 'false'.

get_float(key, default=None)

Same as dict.get(), but the value is converted to a float.

get_int(key, default=None)

Same as dict.get(), but the value is converted to an int.

class pydoop.mapreduce.api.Mapper(context)

Maps input key/value pairs to a set of intermediate key/value pairs.

map(context)

Called once for each key/value pair in the input split. Applications must override this, emitting an output key/value pair through the context.

Parameters

context (Context) – the context object passed by the framework, used to get the input key/value pair and emit the output key/value pair.

class pydoop.mapreduce.api.OpaqueSplit

A wrapper for an arbitrary Python object.

Opaque splits are created on the Python side before job submission, serialized as hadoop.io.Writable objects and stored in an HDFS file. The Java submitter reads the splits from the above file and forwards them to the Python tasks.

Note

Opaque splits are only available when running a job via pydoop submit. The HDFS path where splits are stored is specified via the pydoop.mapreduce.pipes.externalsplits.uri configuration key.

class pydoop.mapreduce.api.Partitioner(context)

Controls the partitioning of intermediate keys output by the Mapper. The key (or a subset of it) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction.

partition(key, num_of_reduces)

Get the partition number for key given the total number of partitions, i.e., the number of reduce tasks for the job. Applications must override this.

Parameters
  • key (str) – the key of the key/value pair being dispatched.

  • numOfReduces (int) – the total number of reduces.

Return type

int

Returns

the partition number for key.

class pydoop.mapreduce.api.RecordReader(context)

Breaks the data into key/value pairs for input to the Mapper.

get_progress()

The current progress of the record reader through its data.

Return type

float

Returns

the fraction of data read up to now, as a float between 0 and 1.

next()

Called by the framework to provide a key/value pair to the Mapper. Applications must override this, making sure it raises StopIteration when there are no more records to process.

Return type

tuple

Returns

a tuple of two elements. They are, respectively, the key and the value (as strings)

class pydoop.mapreduce.api.RecordWriter(context)

Writes the output key/value pairs to an output file.

emit(key, value)

Writes a key/value pair. Applications must override this.

Parameters
  • key (str) – a final output key

  • value (str) – a final output value

class pydoop.mapreduce.api.Reducer(context)

Reduces a set of intermediate values which share a key to a (possibly) smaller set of values.

reduce(context)

Called once for each key. Applications must override this, emitting an output key/value pair through the context.

Parameters

context (Context) – the context object passed by the framework, used to get the input key and corresponding set of values and emit the output key/value pair.

pydoop.mapreduce.pipes.run_task(factory, **kwargs)

Run a MapReduce task.

Available keyword arguments:

  • raw_keys (default: False): pass map input keys to context as byte strings (ignore any type information)

  • raw_values (default: False): pass map input values to context as byte strings (ignore any type information)

  • private_encoding (default: True): automatically serialize map output k/v and deserialize reduce input k/v (pickle)

  • auto_serialize (default: True): automatically serialize reduce output (map output in map-only jobs) k/v (call str/unicode then encode as utf-8)

Advanced keyword arguments:

  • pstats_dir: run the task with cProfile and store stats in this dir

  • pstats_fmt: use this pattern for pstats filenames (experts only)

The pstats dir and filename pattern can also be provided via pydoop submit arguments, with lower precedence in case of clashes.