This module allows you to connect to an HDFS installation, read and write files and get information on files, directories and global filesystem properties.
The hdfs module is built on top of libhdfs, in turn a JNI wrapper around the Java fs code: therefore, for the module to work properly, the CLASSPATH environment variable must include all paths to the relevant Hadoop jars. Pydoop will do this for you, but it needs to know where your Hadoop installation is located and what is your hadoop configuration directory: if Pydoop is not able to automatically find these directories, you have to make sure that the HADOOP_HOME and HADOOP_CONF_DIR environment variables are set to the appropriate values.
Another important environment variable for this module is LIBHDFS_OPTS. This is used to set options for the JVM on top of which the module runs, most notably the amount of memory it uses. If LIBHDFS_OPTS is not set, the C libhdfs will let it fall back to the default for your system, typically 1 GB. According to our experience, this is much more than most applications need and adds a lot of unnecessary memory overhead. For this reason, the hdfs module sets LIBHDFS_OPTS to -Xmx48m, a value that we found to be appropriate for most applications. If your needs are different, you can set the environment variable externally and it will override the above setting.
A handle to an HDFS instance.
Parameters: |
|
---|
Note: when connecting to the local file system, user is ignored (i.e., it will always be the current UNIX user).
Return the raw capacity of the filesystem.
Return type: | int |
---|---|
Returns: | filesystem capacity |
Change file mode bits.
Parameters: | |
---|---|
Raises: |
Change file owner and group.
Parameters: | |
---|---|
Raises: |
Close the HDFS handle (disconnect).
Copy file from one filesystem to another.
Parameters: | |
---|---|
Raises: |
Create directory path (non-existent parents will be created as well).
Parameters: | path (str) – the path of the directory |
---|---|
Raises: | IOError |
Get the default block size.
Return type: | int |
---|---|
Returns: | the default blocksize |
Delete path.
Parameters: | |
---|---|
Raises: |
Check if a given path exists on the filesystem.
Parameters: | path (str) – the path to look for |
---|---|
Return type: | bool |
Returns: | True if path exists |
Get hostnames where a particular block (determined by pos and blocksize) of a file is stored. Due to replication, a single block could be present on multiple hosts.
Parameters: | |
---|---|
Return type: | list |
Returns: | list of hosts that store the block |
Get information about path as a dict of properties.
The return value, based upon fs.FileStatus from the Java API, has the following fields:
Parameters: | path (str) – a path in the filesystem |
---|---|
Return type: | dict |
Returns: | path information |
Raises: | IOError |
The actual hdfs hostname (empty string for the local fs).
Get list of files and directories for path.
Parameters: | path (str) – the path of the directory |
---|---|
Return type: | list |
Returns: | list of files and directories in path |
Raises: | IOError |
Move file from one filesystem to another.
Parameters: | |
---|---|
Raises: |
Open an HDFS file.
Pass 0 as buff_size, replication or blocksize if you want to use the “configured” values, i.e., the ones set in the Hadoop configuration files.
Parameters: |
|
---|---|
Rtpye: | |
Returns: | handle to the open file |
The actual hdfs port (0 for the local fs).
Rename file.
Parameters: | |
---|---|
Raises: |
Set the replication of path to replication.
Parameters: | |
---|---|
Raises: |
Set the working directory to path. All relative paths will be resolved relative to it.
Parameters: | path (str) – the path of the directory |
---|---|
Raises: | IOError |
Return the total raw size of all files in the filesystem.
Return type: | int |
---|---|
Returns: | total size of files in the file system |
The user associated with this HDFS connection.
Change file last access and modification times.
Parameters: | |
---|---|
Raises: |
Generate infos for all paths in the tree rooted at top (included).
The top parameter can be either an HDFS path string or a dictionary of properties as returned by get_path_info().
Parameters: | top (str, dict) – an HDFS path or path info dict |
---|---|
Return type: | iterator |
Returns: | path infos of files and directories in the tree rooted at top |
Raises: | IOError; ValueError if top is empty |
Get the current working directory.
Return type: | str |
---|---|
Returns: | current working directory |
Open a file, returning an hdfs_file object.
hdfs_path and user are passed to split(), while the other args are passed to the hdfs_file constructor.
Write data to hdfs_path.
Additional keyword arguments, if any, are handled like in open().
Read the content of hdfs_path and return it.
Additional keyword arguments, if any, are handled like in open().
Copy the contents of src_hdfs_path to dest_hdfs_path.
Additional keyword arguments, if any, are handled like in open(). If src_hdfs_path is a directory, its contents will be copied recursively.
Copy the contents of src_path to dest_hdfs_path.
src_path is forced to be interpreted as an ordinary local path (see abspath()). Additional keyword arguments, if any, are handled like in open().
Copy the contents of src_hdfs_path to dest_path.
dest_path is forced to be interpreted as an ordinary local path (see abspath()). Additional keyword arguments, if any, are handled like in open().
Create a directory and its parents as needed.
Recursively remove files and directories.
Return a list of dictionaries of file properties.
If hdfs_path is a file, there is only one item corresponding to the file itself; if it is a directory and recursive is False, each list item corresponds to a file or directory contained by it; if it is a directory and recursive is True, the list contains one item for every file or directory in the tree rooted at hdfs_path.
Return a list of hdfs paths.
Works in the same way as lsl(), except for the fact that list items are hdfs paths instead of dictionaries of properties.
Change file mode bits.
Parameters: |
---|
Move or rename src to dest.
See fs.hdfs.chown().
See fs.hdfs.rename().
Rename from_path to to_path, creating parents as needed.
Performs the equivalent of os.stat() on path, returning a StatResult object.
Perform the equivalent of os.access() on path.
Mimics the object type returned by os.stat().
Objects of this class are instantiated from dictionaries with the same structure as the ones returned by get_path_info().
Attributes starting with st_ have the same meaning as the corresponding ones in the object returned by os.stat(), although some of them may not make sense for an HDFS path (in this case, their value will be set to 0). In addition, the kind, name and replication attributes are available, with the same values as in the input dict.
Return an absolute path for hdfs_path.
The user arg is passed to split(). The local argument forces hdfs_path to be interpreted as an ordinary local path:
>>> import os
>>> os.chdir('/tmp')
>>> import pydoop.hdfs.path as hpath
>>> hpath.abspath('file:/tmp')
'file:/tmp'
>>> hpath.abspath('file:/tmp', local=True)
'file:/tmp/file:/tmp'
Note that this function always return a full URI:
>>> import pydoop.hdfs.path as hpath
>>> hpath.abspath('/tmp')
'hdfs://localhost:9000/tmp'
Perform the equivalent of os.access() on path.
Return the final component of hdfs_path.
Return the directory component of hdfs_path.
Replace initial ~ or ~user with the user’s home directory.
NOTE: if the default file system is HDFS, the ~user form is expanded regardless of the user’s existence.
Expand environment variables in path.
Get time of last access of path.
Get time of creation / last metadata change of path.
Get time of last modification of path.
Get size, in bytes, of path.
Return True if path is absolute.
A path is absolute if it is a full URI (see isfull()) or starts with a forward slash. No check is made to determine whether path is a valid HDFS path.
Return True if path is a full URI (starts with a scheme followed by a colon).
No check is made to determine whether path is a valid HDFS path.
Return True if path is a symbolic link.
Currently this function always returns False for non-local paths.
Return True if path is a mount point.
This function always returns False for non-local paths.
Join path name components, inserting / as needed.
If any component is an absolute path (see isabs()), all previous components will be discarded. However, full URIs (see isfull()) take precedence over incomplete ones:
>>> import pydoop.hdfs.path as hpath
>>> hpath.join('bar', '/foo')
'/foo'
>>> hpath.join('hdfs://host:1/', '/foo')
'hdfs://host:1/foo'
Note that this is not the reverse of split(), but rather a specialized version of os.path.join(). No check is made to determine whether the returned string is a valid HDFS path.
Get the kind of item (“file” or “directory”) that the path references.
Return None if path doesn’t exist.
Normalize path, collapsing redundant separators and up-level refs.
Parse the given path and return its components.
Parameters: | hdfs_path (str) – an HDFS path, e.g., hdfs://localhost:9000/user/me |
---|---|
Return type: | tuple |
Returns: | scheme, netloc, path |
Return path with symlinks resolved.
Currently this function returns non-local paths unchanged.
Return True if both path arguments refer to the same path.
Split hdfs_path into a (hostname, port, path) tuple.
Parameters: | |
---|---|
Return type: | tuple |
Returns: | hostname, port, path |
Same as os.path.splitext().
Split hdfs_path into a (head, tail) pair, according to the same rules as os.path.split().
Performs the equivalent of os.stat() on path, returning a StatResult object.
A handle to an HDFS instance.
Parameters: |
|
---|
Note: when connecting to the local file system, user is ignored (i.e., it will always be the current UNIX user).
Return the raw capacity of the filesystem.
Return type: | int |
---|---|
Returns: | filesystem capacity |
Change file mode bits.
Parameters: | |
---|---|
Raises: |
Change file owner and group.
Parameters: | |
---|---|
Raises: |
Close the HDFS handle (disconnect).
Copy file from one filesystem to another.
Parameters: | |
---|---|
Raises: |
Create directory path (non-existent parents will be created as well).
Parameters: | path (str) – the path of the directory |
---|---|
Raises: | IOError |
Get the default block size.
Return type: | int |
---|---|
Returns: | the default blocksize |
Delete path.
Parameters: | |
---|---|
Raises: |
Check if a given path exists on the filesystem.
Parameters: | path (str) – the path to look for |
---|---|
Return type: | bool |
Returns: | True if path exists |
Get hostnames where a particular block (determined by pos and blocksize) of a file is stored. Due to replication, a single block could be present on multiple hosts.
Parameters: | |
---|---|
Return type: | list |
Returns: | list of hosts that store the block |
Get information about path as a dict of properties.
The return value, based upon fs.FileStatus from the Java API, has the following fields:
Parameters: | path (str) – a path in the filesystem |
---|---|
Return type: | dict |
Returns: | path information |
Raises: | IOError |
The actual hdfs hostname (empty string for the local fs).
Get list of files and directories for path.
Parameters: | path (str) – the path of the directory |
---|---|
Return type: | list |
Returns: | list of files and directories in path |
Raises: | IOError |
Move file from one filesystem to another.
Parameters: | |
---|---|
Raises: |
Open an HDFS file.
Pass 0 as buff_size, replication or blocksize if you want to use the “configured” values, i.e., the ones set in the Hadoop configuration files.
Parameters: |
|
---|---|
Rtpye: | |
Returns: | handle to the open file |
The actual hdfs port (0 for the local fs).
Rename file.
Parameters: | |
---|---|
Raises: |
Set the replication of path to replication.
Parameters: | |
---|---|
Raises: |
Set the working directory to path. All relative paths will be resolved relative to it.
Parameters: | path (str) – the path of the directory |
---|---|
Raises: | IOError |
Return the total raw size of all files in the filesystem.
Return type: | int |
---|---|
Returns: | total size of files in the file system |
The user associated with this HDFS connection.
Change file last access and modification times.
Parameters: | |
---|---|
Raises: |
Generate infos for all paths in the tree rooted at top (included).
The top parameter can be either an HDFS path string or a dictionary of properties as returned by get_path_info().
Parameters: | top (str, dict) – an HDFS path or path info dict |
---|---|
Return type: | iterator |
Returns: | path infos of files and directories in the tree rooted at top |
Raises: | IOError; ValueError if top is empty |
Get the current working directory.
Return type: | str |
---|---|
Returns: | current working directory |
Instances of this class represent HDFS file objects.
Objects from this class should not be instantiated directly. To open an HDFS file, use open_file(), or the top-level open function in the hdfs package.
Number of bytes that can be read from this input stream without blocking.
Return type: | int |
---|---|
Returns: | available bytes |
Close the file.
Force any buffered output to be written.
The file’s hdfs instance.
The I/O mode for the file.
The file’s fully qualified name.
Return the next input line, or raise StopIteration when EOF is hit.
Read length bytes of data from the file, starting from position.
Parameters: | |
---|---|
Return type: | string |
Returns: | the chunk of data read from the file |
Works like pread(), but data is stored in the writable buffer chunk rather than returned. Reads at most a number of bytes equal to the size of chunk.
Parameters: | |
---|---|
Return type: | int |
Returns: | the number of bytes read |
Read length bytes from the file. If length is negative or omitted, read all data until EOF.
Parameters: | length (int) – the number of bytes to read |
---|---|
Return type: | string |
Returns: | the chunk of data read from the file |
Works like read(), but data is stored in the writable buffer chunk rather than returned. Reads at most a number of bytes equal to the size of chunk.
Parameters: | chunk (writable string buffer) – a c-like string buffer, such as the one returned by the create_string_buffer function in the ctypes module |
---|---|
Return type: | int |
Returns: | the number of bytes read |
Read and return a line of text.
Return type: | str |
---|---|
Returns: | the next line of text in the file, including the newline character |
Seek to position in file.
Parameters: |
---|
The file’s size in bytes. This attribute is initialized when the file is opened and updated when it is closed.
Get the current byte offset in the file.
Return type: | int |
---|---|
Returns: | current offset in bytes |