The HDFS API¶
The HDFS API allows you to connect to an HDFS installation, read and write files and get information on files, directories and global file system properties:
>>> import pydoop.hdfs as hdfs
>>> hdfs.mkdir('test')
>>> hdfs.dump('hello, world', 'test/hello.txt')
>>> hdfs.load('test/hello.txt')
b'hello, world'
>>> hdfs.load('test/hello.txt', mode='rt')
'hello, world'
>>> [hdfs.path.basename(_) for _ in hdfs.ls('test')]
['hello.txt']
>>> hdfs.stat('test/hello.txt').st_size
12
>>> hdfs.path.isdir('test')
True
>>> hdfs.path.isfile('test')
False
>>> hdfs.path.basename('test/hello.txt')
'hello.txt'
>>> hdfs.cp('test', 'test.copy')
>>> [hdfs.path.basename(_) for _ in hdfs.ls('test.copy')]
['hello.txt']
>>> hdfs.get('test/hello.txt', '/tmp/hello.txt')
>>> with open('/tmp/hello.txt') as f:
... f.read()
...
'hello, world'
>>> hdfs.put('/tmp/hello.txt', 'test.copy/hello.txt.copy')
>>> for x in sorted(hdfs.ls('test.copy')): print(repr(hdfs.path.basename(x)))
...
'hello.txt'
'hello.txt.copy'
>>> with hdfs.open('test/hello.txt', 'r') as fi:
... fi.read(3)
...
b'hel'
>>> with hdfs.open('test/hello.txt', 'rt') as fi:
... fi.read(3)
...
'hel'
Low-level API¶
The high level API showcased above can be inefficient
when performing multiple operations on the same HDFS instance. This is
due to the fact that, under the hood, each function opens a separate
connection to the HDFS server and closes it before returning. The
following example shows how to build statistics of HDFS usage by block
size by directly instantiating an hdfs
object, which represents an
open connection to an HDFS instance. Full source code for the example,
including a script that can be used to generate an HDFS directory tree
is located under examples/hdfs
in the Pydoop distribution.
import pydoop.hdfs as hdfs
from common import MB, TEST_ROOT
def usage_by_bs(fs, root):
stats = {}
for info in fs.walk(root):
if info['kind'] == 'directory':
continue
bs = int(info['block_size'])
size = int(info['size'])
stats[bs] = stats.get(bs, 0) + size
return stats
if __name__ == "__main__":
with hdfs.hdfs() as fs:
root = "%s/%s" % (fs.working_directory(), TEST_ROOT)
print("BS(MB)\tBYTES")
for k, v in usage_by_bs(fs, root).items():
print("%.1f\t%d" % (k / float(MB), v))
For more information, see the HDFS API reference.