gcmap
file¶
The gcmap
file is in HDF5 format to store Genome Contact Map (gcmap). It is implemented by considering both portability and readability. A single file may contains maps of
various resolutions of all chromosomes. It also contains properties, which are attributes, of each map.
Structure of gcmap
file¶
As HDF5 format supports Hierarchical Data Model, therefore we implemented the contact maps in format way. Overall structure format is as follows:
HDF5
│
├──────── chr1 ──── Attributes : ['xlabel', 'ylabel', 'compression']
│ │
│ ├────── 10kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ ├────── 20kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ ├────── 40kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ ├────── 60kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ ├────── 80kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ ├────── 160kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ ├────── 320kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ ├────── 640kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ │
│ ├────── 10kb-bNoData ( 1D Numpy Array )
│ ├────── 20kb-bNoData ( 1D Numpy Array )
│ ├────── 40kb-bNoData ( 1D Numpy Array )
│ ├────── 60kb-bNoData ( 1D Numpy Array )
│ ├────── 80kb-bNoData ( 1D Numpy Array )
│ ├────── 160kb-bNoData ( 1D Numpy Array )
│ ├────── 320kb-bNoData ( 1D Numpy Array )
│ └────── 640kb-bNoData ( 1D Numpy Array )
│
├──────── chr2 ──── Attributes : ['xlabel', 'ylabel', 'compression']
│ │
│ ├────── 10kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ ├────── 20kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ ├────── 40kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ ├────── 60kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ ├────── 80kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ ├────── 160kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ ├────── 320kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ ├────── 640kb ( 2D Numpy Array ) ─── Attributes : ['minvalue', 'maxvalue', 'xshape', 'yshape', 'binsize']
│ │
│ ├────── 10kb-bNoData ( 1D Numpy Array )
│ ├────── 20kb-bNoData ( 1D Numpy Array )
│ ├────── 40kb-bNoData ( 1D Numpy Array )
│ ├────── 60kb-bNoData ( 1D Numpy Array )
│ ├────── 80kb-bNoData ( 1D Numpy Array )
│ ├────── 160kb-bNoData ( 1D Numpy Array )
│ ├────── 320kb-bNoData ( 1D Numpy Array )
│ └────── 640kb-bNoData ( 1D Numpy Array )
:
:
:
└───── ...
Compression¶
In gcmap file, contact map is stored as compressed 2D matrix. Presently, two compression method are allowed in the gcmap file:
- LZF
- GZIP
By default, LZF is used to compress arrays. This method is very fast, and allow the rapid contact map reading. However, the size reduction is moderate in comparison with GZIP compression method.
Warning
LZF method is only available through Python h5py module, and therefore, this file cannot be read by another programming language through standard library. For portability, use GZIP compression method, which is available in standard HDF5 library.
Portability and Readability¶
The gcmap
file with GZIP compressed arrays can be read and write from any programming language. For C/C++/Java, a standard HDF5 library is available from HDF5 group.
For R programming language, h5 and rhdf5 are available.
Both GZIP and LZF compression reduces the file size significantly as compare to respective flat text file. Therefore, this file is also suitable for storage and transfer.
Convert Hi-C data to gcmap¶
Hi-C data are available in several different formats. Presently, following formats can be converted to gcmap using implemented tools.
- COO sparse matrix
- Paired COO sparse matrix
- Homer Hi-C interaction matrix
- Bin-Contact pair files
- Hic files
Following tools are available for the conversion
- Convert using
gcMapExplorer
Python modules: - COO sparse matrix :
gcMapExplorer.lib.importer.CooMatrixHandler
- Paired COO sparse matrix :
gcMapExplorer.lib.importer.PairCooMatrixHandler
- Homer Hi-C interaction matrix :
gcMapExplorer.lib.importer.HomerInputHandler
- Bin-Contact pair files :
gcMapExplorer.lib.importer.BinsNContactFilesHandler
- COO sparse matrix :
See also
A tutorial to convert external Hi-C maps into ccmap or gcmap using gcMapExplorer
module are shown here.