pydoop.mapreduce.api
— MapReduce API¶
This module provides the base abstract classes used to develop MapReduce
application components (Mapper
, Reducer
, etc.).
-
class
pydoop.mapreduce.api.
Combiner
(context)¶ A
Combiner
performs the same actions as aReducer
, but it runs locally within a map task. This helps cutting down the amount of data sent to reducers across the network, with the downside that map tasks require extra memory to cache intermediate key/value pairs. The cache size is controlled by"mapreduce.task.io.sort.mb"
and defaults to 100 MB.Note that it’s not strictly necessary to extend this class in order to write a combiner: all that’s required is that it has the same interface as a
reducer
. Indeed, in many cases it’s useful to set the combiner class to be the same as the reducer class.
-
class
pydoop.mapreduce.api.
Component
(context)¶
-
class
pydoop.mapreduce.api.
Context
¶ Context objects are used for communication between the framework and the Mapreduce application. These objects are instantiated by the framework and passed to user methods as parameters:
class Mapper(api.Mapper): def map(self, context): key, value = context.key, context.value ... context.emit(new_key, new_value)
-
emit
(key, value)¶ Emit a key, value pair to the framework.
-
get_counter
(group, name)¶ Get a
Counter
from the framework.The counter can be updated via
increment_counter()
.
-
increment_counter
(counter, amount)¶ Update a
Counter
by the specified amount.
-
input_split
¶ The
InputSplit
for this task (map tasks only).This tries to deserialize the raw split sent from upstream. In the most common scenario (file-based input format), the returned value will be a
FileSplit
.To get the raw split, call
get_input_split()
withraw=True
.
-
key
¶ Input key.
-
set_status
(status)¶ Set the current status.
- Parameters
status (str) – a description of the current status
-
value
¶ Input value (map tasks only).
-
values
¶ Iterator over all values for the current key (reduce tasks only).
-
-
class
pydoop.mapreduce.api.
Factory
¶ Creates MapReduce application components (e.g., mapper, reducer).
A factory object must be created by the application and passed to the framework as the first argument to
run_task()
. All MapReduce applications need at least a mapper object, while other components are optional (the correspondingcreate_
method can returnNone
). Note that the reducer is optional only in map-only jobs, where the number of reduce tasks has been set to 0.Factory
provides a generic implementation that takes component classes as initialization parameters and creates component objects as needed.-
create_combiner
(context)¶ Create a combiner object.
Return the new combiner or
None
, if one is not needed.
-
create_partitioner
(context)¶ Create a partitioner object.
Return the new partitioner or
None
, if the default partitioner should be used.
-
-
class
pydoop.mapreduce.api.
FileSplit
¶ A subset (described by offset and length) of an input file.
-
class
pydoop.mapreduce.api.
InputSplit
¶ Represents a subset of the input data assigned to a single map task.
InputSplit
objects are created by the framework and made available to the user application via theinput_split
context attribute.
-
class
pydoop.mapreduce.api.
JobConf
¶ Configuration properties assigned to this job.
JobConf objects are instantiated by the framework and support the same interface as dictionaries, plus a few methods that perform automatic type conversion:
>>> jc['a'] '1' >>> jc.get_int('a') 1
-
get_bool
(key, default=None)¶ Same as
dict.get()
, but the value is converted to a bool.The boolean value is considered, respectively,
True
orFalse
if the string is equal, ignoring case, to'true'
or'false'
.
-
get_float
(key, default=None)¶ Same as
dict.get()
, but the value is converted to a float.
-
get_int
(key, default=None)¶ Same as
dict.get()
, but the value is converted to an int.
-
-
class
pydoop.mapreduce.api.
Mapper
(context)¶ Maps input key/value pairs to a set of intermediate key/value pairs.
-
map
(context)¶ Called once for each key/value pair in the input split. Applications must override this, emitting an output key/value pair through the context.
- Parameters
context (
Context
) – the context object passed by the framework, used to get the input key/value pair and emit the output key/value pair.
-
-
class
pydoop.mapreduce.api.
OpaqueSplit
¶ A wrapper for an arbitrary Python object.
Opaque splits are created on the Python side before job submission, serialized as
hadoop.io.Writable
objects and stored in an HDFS file. The Java submitter reads the splits from the above file and forwards them to the Python tasks.Note
Opaque splits are only available when running a job via
pydoop submit
. The HDFS path where splits are stored is specified via thepydoop.mapreduce.pipes.externalsplits.uri
configuration key.
-
class
pydoop.mapreduce.api.
Partitioner
(context)¶ Controls the partitioning of intermediate keys output by the
Mapper
. The key (or a subset of it) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction.-
partition
(key, num_of_reduces)¶ Get the partition number for
key
given the total number of partitions, i.e., the number of reduce tasks for the job. Applications must override this.
-
-
class
pydoop.mapreduce.api.
RecordReader
(context)¶ Breaks the data into key/value pairs for input to the
Mapper
.-
get_progress
()¶ The current progress of the record reader through its data.
- Return type
- Returns
the fraction of data read up to now, as a float between 0 and 1.
-
next
()¶ Called by the framework to provide a key/value pair to the
Mapper
. Applications must override this, making sure it raisesStopIteration
when there are no more records to process.- Return type
tuple
- Returns
a tuple of two elements. They are, respectively, the key and the value (as strings)
-
-
class
pydoop.mapreduce.api.
RecordWriter
(context)¶ Writes the output key/value pairs to an output file.
-
class
pydoop.mapreduce.api.
Reducer
(context)¶ Reduces a set of intermediate values which share a key to a (possibly) smaller set of values.
-
pydoop.mapreduce.pipes.
run_task
(factory, **kwargs)¶ Run a MapReduce task.
Available keyword arguments:
raw_keys
(default:False
): pass map input keys to context as byte strings (ignore any type information)raw_values
(default:False
): pass map input values to context as byte strings (ignore any type information)private_encoding
(default:True
): automatically serialize map output k/v and deserialize reduce input k/v (pickle)auto_serialize
(default:True
): automatically serialize reduce output (map output in map-only jobs) k/v (call str/unicode then encode as utf-8)
Advanced keyword arguments:
pstats_dir
: run the task with cProfile and store stats in this dirpstats_fmt
: use this pattern for pstats filenames (experts only)
The pstats dir and filename pattern can also be provided via
pydoop submit
arguments, with lower precedence in case of clashes.