pydoop.mapreduce.api — MapReduce API

The MapReduce API allows to write the components of a MapReduce application.

The basic MapReduce components (Mapper, Reducer, RecordReader, etc.) are provided as abstract classes that must be subclassed by the developer, providing implementations for all methods called by the framework.

class pydoop.mapreduce.api.Closable
close()

Called after the object has finished its job.

Overriding this method is not required.

class pydoop.mapreduce.api.Context

Context objects are used for communication between the framework and the Mapreduce application. These objects are instantiated by the framework and passed to user methods as parameters:

class Mapper(api.Mapper):

    def map(self, context):
        key, value = context.key, context.value
        ...
        context.emit(new_key, new_value)
emit(key, value)

Emit a key, value pair to the framework.

get_counter(group, name)

Get a Counter from the framework.

Parameters:
  • group (str) – counter group name
  • name (str) – counter name

The counter can be updated via increment_counter().

increment_counter(counter, amount)

Update a Counter by the specified amount.

job_conf

MapReduce job configuration as a JobConf object.

key

Input key.

set_status(status)

Set the current status.

Parameters:status (str) – a description of the current status
value

Input value.

class pydoop.mapreduce.api.Counter(counter_id)

An interface to the Hadoop counters infrastructure.

Counter objects are instantiated and directly manipulated by the framework; users get and update them via the Context interface.

class pydoop.mapreduce.api.Factory

Creates MapReduce application components.

The classes to use for each component must be specified as arguments to the constructor.

create_combiner(context)

Create a combiner object.

Return the new combiner or None, if one is not needed.

create_partitioner(context)

Create a partitioner object.

Return the new partitioner or None, if the default partitioner should be used.

create_record_reader(context)

Create a record reader object.

Return the new record reader or None, if the Java record reader should be used.

create_record_writer(context)

Create an application record writer.

Return the new record writer or None, if the Java record writer should be used.

class pydoop.mapreduce.api.JobConf(values)

Configuration properties assigned to this job.

JobConf objects are instantiated by the framework and support the same interface as dictionaries, plus a few methods that perform automatic type conversion:

>>> jc['a']
'1'
>>> jc.get_int('a')
1

Warning

For the most part, a JobConf object behaves like a dict. For backwards compatibility, however, there are two important exceptions:

  1. objects are constructed from a [key1, value1, key2, value2, ...] sequence
  2. if k is not in jc, jc.get(k) raises RuntimeError instead of returning None (jc.get(k, None) returns None as in dict).
get(k[, d]) → D[k] if k in D, else d. d defaults to None.
get_bool(key, default=None)

Same as dict.get(), but the value is converted to a bool.

The boolean value is considered, respectively, True or False if the string is equal, ignoring case, to 'true' or 'false'.

get_float(key, default=None)

Same as dict.get(), but the value is converted to an float.

get_int(key, default=None)

Same as dict.get(), but the value is converted to an int.

class pydoop.mapreduce.api.MapContext

The context given to the mapper.

get_input_split(raw=False)

Get the current input split.

If raw is False (the default), return an InputSplit object; if it’s True, return a byte string (the unserialized split as sent via the downlink).

get_input_value_class()

Return the type of the input value.

input_key_class

Return the type of the input key.

input_split

The current input split as an InputSplit object.

class pydoop.mapreduce.api.Mapper(context)

Maps input key/value pairs to a set of intermediate key/value pairs.

map(context)

Called once for each key/value pair in the input split. Applications must override this, emitting an output key/value pair through the context.

Parameters:context (MapContext) – the context object passed by the framework, used to get the input key/value pair and emit the output key/value pair.
class pydoop.mapreduce.api.Partitioner(context)

Controls the partitioning of intermediate keys output by the Mapper. The key (or a subset of it) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction.

partition(key, num_of_reduces)

Get the partition number for key given the total number of partitions, i.e., the number of reduce tasks for the job. Applications must override this.

Parameters:
  • key (str) – the key of the key/value pair being dispatched.
  • numOfReduces (int) – the total number of reduces.
Return type:

int

Returns:

the partition number for key.

exception pydoop.mapreduce.api.PydoopError
class pydoop.mapreduce.api.RecordReader(context=None)

Breaks the data into key/value pairs for input to the Mapper.

get_progress()

The current progress of the record reader through its data.

Return type:float
Returns:the fraction of data read up to now, as a float between 0 and 1.
next()

Called by the framework to provide a key/value pair to the Mapper. Applications must override this, making sure it raises StopIteration when there are no more records to process.

Return type:tuple
Returns:a tuple of two elements. They are, respectively, the key and the value (as strings)
class pydoop.mapreduce.api.RecordWriter(context=None)

Writes the output key/value pairs to an output file.

emit(key, value)

Writes a key/value pair. Applications must override this.

Parameters:
  • key (str) – a final output key
  • value (str) – a final output value
class pydoop.mapreduce.api.ReduceContext

The context given to the reducer.

next_value()

Return True if there is another value that can be processed.

class pydoop.mapreduce.api.Reducer(context=None)

Reduces a set of intermediate values which share a key to a (possibly) smaller set of values.

reduce(context)

Called once for each key. Applications must override this, emitting an output key/value pair through the context.

Parameters:context (ReduceContext) – the context object passed by the framework, used to get the input key and corresponding set of values and emit the output key/value pair.
class pydoop.mapreduce.pipes.InputSplit(data)

Represents the data to be processed by an individual Mapper.

Typically, it presents a byte-oriented view on the input and it is the responsibility of the RecordReader to convert this to a record-oriented view.

The InputSplit is a logical representation of the actual dataset chunk, expressed through the filename, offset and length attributes.

InputSplit objects are instantiated by the framework and accessed via MapContext.input_split.

pydoop.mapreduce.pipes.run_task(factory, port=None, istream=None, ostream=None, private_encoding=True, context_class=<class 'pydoop.mapreduce.pipes.TaskContext'>, cmd_file=None, fast_combiner=False, auto_serialize=True)

Run the assigned task in the framework.

Return type:bool
Returns:True if the task succeeded.