Python and HDF5 - Fast Storage for Large Data

Mike Müller

Type:: Talk
Audience level:: Novice
Category:: Databases/NoSQL

March 10th 1:35 p.m. – 2:15 p.m.

Description

The presentation introduces the possibilities to use HDF5 (Hierarchical Data Format) from Python. HDF5 is one of the fastest ways to store large amounts of numerical data. The talk is for scientist who would like to store their measured or calculated data as well as for programmers who are interested in non-relational data storage.

Abstract

HDF5 (Hierarchical Data Format) allows to store large amounts of data fast. Many scientists use HDF5 for numerical data. Multidimensional arrays and database-like tables can be nested. This makes HDF5 useful for other user groups such as people working with image data.

The main objective of HDF5 is the storage of data in the GB and TB range. A HDF5 file has a hierarchical structure with groups and sub-groups similar to file system with directories and sub-directories. The analogy to files are homogeneous, multidimensional arrays or database-like tables. The hierarchical structure uses B-trees that may span several files.

HDF5 comes with compression options that allow a compact data storage. Therefore, write and read rates can be faster than the maximum rate of the hard drive compared to the stored data.

Users from scientific and technical fields like to use HDF5. It has proven valuable for a variety of applications. The speed is often considerably higher than that of user defined binary formats. HDF5 is very attractive because its storage capacity is practically unlimited and the data access is very convenient. In addition, there are many tools that help visualize and interpret data stored in HDF5 files.

HDF5 can be interesting not only for scientific application. Multidimensional arrays can be stored in tables. This opens new possibilities for an efficient and easy storage of image data including indexing. Another application could be platform independent virtual file systems based on HDF5.

There are HDF5 libraries for different programming languages such as C, C++ and Fortran. There are two libraries for Python:

h5py exposing the full C-API with all options to Python and
PyTables that adds pythonic features to simplify especially the work with tables.

This presentation gives examples for how to work with both libraries. Python programs for reading and writing HDF5 data are typically multiple times shorter than their counterparts in C or Fortran. Combining the elegance of Python with the extraordinary speed of HDF5 makes programming as well as program execution highly effective.