The MapReduce API allows to write the components of a MapReduce application.
The basic MapReduce components (Mapper, Reducer, RecordReader, etc.) are provided as abstract classes that must be subclassed by the developer, providing implementations for all methods called by the framework.
Context objects are used for communication between the framework and the Mapreduce application. These objects are instantiated by the framework and passed to user methods as parameters:
class Mapper(api.Mapper):
def map(self, context):
key, value = context.key, context.value
...
context.emit(new_key, new_value)
Emit a key, value pair to the framework.
Get a Counter from the framework.
Parameters: |
---|
The counter can be updated via increment_counter().
Input key.
Set the current status.
Parameters: | status (str) – a description of the current status |
---|
Input value.
An interface to the Hadoop counters infrastructure.
Counter objects are instantiated and directly manipulated by the framework; users get and update them via the Context interface.
Creates MapReduce application components.
The classes to use for each component must be specified as arguments to the constructor.
Create a combiner object.
Return the new combiner or None, if one is not needed.
Create a partitioner object.
Return the new partitioner or None, if the default partitioner should be used.
Configuration properties assigned to this job.
JobConf objects are instantiated by the framework and support the same interface as dictionaries, plus a few methods that perform automatic type conversion:
>>> jc['a']
'1'
>>> jc.get_int('a')
1
Warning
For the most part, a JobConf object behaves like a dict. For backwards compatibility, however, there are two important exceptions:
Same as dict.get(), but the value is converted to a bool.
The boolean value is considered, respectively, True or False if the string is equal, ignoring case, to 'true' or 'false'.
Same as dict.get(), but the value is converted to an float.
Same as dict.get(), but the value is converted to an int.
The context given to the mapper.
Get the raw input split as a byte string (backward compatibility).
Return the type of the input key.
Return the type of the input key.
Get the current input split as an InputSplit object.
Maps input key/value pairs to a set of intermediate key/value pairs.
Called once for each key/value pair in the input split. Applications must override this, emitting an output key/value pair through the context.
Parameters: | context (MapContext) – the context object passed by the framework, used to get the input key/value pair and emit the output key/value pair. |
---|
Controls the partitioning of intermediate keys output by the Mapper. The key (or a subset of it) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction.
Get the partition number for key given the total number of partitions, i.e., the number of reduce tasks for the job. Applications must override this.
Parameters: | |
---|---|
Return type: | int |
Returns: | the partition number for key. |
Breaks the data into key/value pairs for input to the Mapper.
The current progress of the record reader through its data.
Return type: | float |
---|---|
Returns: | the fraction of data read up to now, as a float between 0 and 1. |
Called by the framework to provide a key/value pair to the Mapper. Applications must override this, making sure it raises StopIteration when there are no more records to process.
Return type: | tuple |
---|---|
Returns: | a tuple of two elements. They are, respectively, the key and the value (as strings) |
Writes the output key/value pairs to an output file.
The context given to the reducer.
Reduces a set of intermediate values which share a key to a (possibly) smaller set of values.
Called once for each key. Applications must override this, emitting an output key/value pair through the context.
Parameters: | context (ReduceContext) – the context object passed by the framework, used to get the input key and corresponding set of values and emit the output key/value pair. |
---|
Represents the data to be processed by an individual Mapper.
Typically, it presents a byte-oriented view on the input and it is the responsibility of the RecordReader to convert this to a record-oriented view.
The InputSplit is a logical representation of the actual dataset chunk, expressed through the filename, offset and length attributes.
InputSplit objects are instantiated by the framework and accessed via MapContext.input_split.