Pydoop is a Python interface to Hadoop that allows you to write MapReduce applications in pure Python:
import pydoop.mapreduce.api as api
class Mapper(api.Mapper):
def map(self, context):
words = context.value.split()
for w in words:
context.emit(w, 1)
class Reducer(api.Reducer):
def reduce(self, context):
s = sum(context.values)
context.emit(context.key, s)
Pydoop offers several features not commonly found in other Python libraries for Hadoop:
Pydoop enables MapReduce programming via a pure (except for a performance-critical serialization section) Python client for Hadoop Pipes, and HDFS access through an extension module based on libhdfs.
To get started, read the tutorial. Full docs, including installation instructions, are listed below.