The HDFS API

The HDFS API allows you to connect to an HDFS installation, read and write files and get information on files, directories and global file system properties:

>>> import pydoop.hdfs as hdfs
>>> hdfs.mkdir('test')
>>> hdfs.dump('hello, world', 'test/hello.txt')
>>> hdfs.load('test/hello.txt')
b'hello, world'
>>> hdfs.load('test/hello.txt', mode='rt')
'hello, world'
>>> [hdfs.path.basename(_) for _ in hdfs.ls('test')]
['hello.txt']
>>> hdfs.stat('test/hello.txt').st_size
12
>>> hdfs.path.isdir('test')
True
>>> hdfs.path.isfile('test')
False
>>> hdfs.path.basename('test/hello.txt')
'hello.txt'
>>> hdfs.cp('test', 'test.copy')
>>> [hdfs.path.basename(_) for _ in hdfs.ls('test.copy')]
['hello.txt']
>>> hdfs.get('test/hello.txt', '/tmp/hello.txt')
>>> with open('/tmp/hello.txt') as f:
...     f.read()
...
'hello, world'
>>> hdfs.put('/tmp/hello.txt', 'test.copy/hello.txt.copy')
>>> for x in hdfs.ls('test.copy'): print(repr(hdfs.path.basename(x)))
...
'hello.txt'
'hello.txt.copy'
>>> with hdfs.open('test/hello.txt', 'r') as fi:
...     fi.read(3)
...
b'hel'
>>> with hdfs.open('test/hello.txt', 'rt') as fi:
...     fi.read(3)
...
'hel'

Low-level API

However convenient, the high level API showcased above is inefficient when performing multiple operations on the same HDFS instance. This is due to the fact that, under the hood, each function opens a separate connection to the HDFS server and closes it before returning. The following example shows how to build statistics of HDFS usage by block size by directly instantiating an hdfs object, which represents an open connection to an HDFS instance. Full source code for the example, including a script that can be used to generate an HDFS directory tree is located under examples/hdfs in the Pydoop distribution.

import pydoop.hdfs as hdfs
from common import MB, TEST_ROOT


def usage_by_bs(fs, root):
    stats = {}
    for info in fs.walk(root):
        if info['kind'] == 'directory':
            continue
        bs = int(info['block_size'])
        size = int(info['size'])
        stats[bs] = stats.get(bs, 0) + size
    return stats


if __name__ == "__main__":
    with hdfs.hdfs() as fs:
        root = "%s/%s" % (fs.working_directory(), TEST_ROOT)
        print("BS(MB)\tBYTES")
        for k, v in usage_by_bs(fs, root).items():
            print("%.1f\t%d" % (k / float(MB), v))

For more information, see the HDFS API reference.