- Added support for HDP 2.2.
- Pyavroc is now automatically loaded if installed, enabling much faster (30-40x) Avro (de)serialization.
- Added Timer objects to help debug performance issues.
- NoSeparatorTextOutputFormat is now available for all MR versions.
- Added Avro support to the Hadoop Simulator.
- Bug fixes and performance improvements.
- Pydoop now features a brand new, more pythonic MapReduce API
- Added built-in Avro support (for now, only with Hadoop 2). By setting a few flags in the submitter and selecting AvroContext as your application’s context class, you can read and write Avro data, transparently manipulating records as Python dictionaries. See the Avro I/O docs for further details.
- The new pydoop submit tool drastically simplifies job submission, in particular when running applications without installing Pydoop and other dependencies on the cluster nodes (see Installation-free Usage).
- Added support for testing Pydoop programs in a simulated Hadoop framework
- Added support (experimental) for MapReduce V2 input/output formats (see Writing a Custom InputFormat)
- The path module offers many new functions that serve as the HDFS-aware counterparts of those in os.path
- The pipes backend (except for the performance-critical serialization section) has been reimplemented in pure Python
- An alternative (optional) JPype HDFS backend is available (currently slower than the one based on libhdfs)
- Added support for CDH5 and Apache Hadoop 2.4.1, 2.5.2 and 2.6.0
- Removed support for CDH3 and Apache Hadoop 0.20.2
- Installation has been greatly simplified: now Pydoop does not require any external library to build its native extensions
- YARN is now fully supported
- Added support for CDH 4.4.0 and CDH 4.5.0
- Added support for hadoop 2.2.0
- Added support for hadoop 1.2.1
- Added support for CDH 4.3.0
- Added a walk() method to hdfs instances (works similarly to os.walk() from Python’s standard library)
- The Hadoop version parser is now more flexible. It should be able to parse version strings for all CDH releases, including older ones (note that most of them are not supported)
- Pydoop script can now handle modules whose file name has no extension
- Fixed “unable to load native-hadoop library” problem (thanks to Liam Slusser)
Fixed a bug that was causing the pipes runner to incorrectly preprocess command line options.
Fixed several bugs triggered by using a local fs as the default fs for Hadoop. This happens when you set a file: path as the value of fs.default.name in core-site.xml. For instance:
<property>
<name>fs.default.name</name>
<value>file:///var/hadoop/data</value>
</property>