encode2h5ΒΆ

Description:

Download and Convert ENCODE datasets to h5 files
=================================================

It can be used to download and convert multiple datasets from ENCODE Experiment
matrix (https://www.encodeproject.org/matrix/?type=Experiment).
Presently, only bigWig files are downloaded and then converted.

At first search the datasets on https://www.encodeproject.org/matrix/?type=Experiment .
Then click on download button on top of the page. A text file will be downloaded.
This text file can be used as input in this program. All bigWig files will be
downloaded and converted to gcMapExplorer compatible hdf5 format.

NOTE: At first a metafile is automatically downloaded and then files
      are filtered according to bigWig format and Assembly. Subsequently,
      if several replicates are present, only datasets with combined
      replicates are considered. In case if two replicates are present
      and combined replicates are not present, replicates will be combined with
      '-mtc/--method-to-combine' option.

NOTE: Because downloading and conversion might take very long time, it also
      generates a checkpoint file in the output directory. Therefore,
      in case of crash or abrupt exit, the process can be continued from the
      last file.

Name of output files:

    (1) For ChIP-seq assay:
        a. signal-<Experiment target>-<Experiment accession>-<File accessions.h
        b. fold-<Experiment target>-<Experiment accession>-<File accessions>.h5

    (2) For RNA-seq:
        a. uniq-reads-<date>-<Experiment accession>-<File accessions>.h
        b. plus-uniq-reads-<date>-<Experiment accession>-<File accessions>.h
        c. minus-uniq-reads-<date>-<Experiment accession>-<File accessions>.h
        d. all-reads-<date>-<Experiment accession>-<File accessions>.h5
        e. plus-all-reads-<date>-<Experiment accession>-<File accessions>.h5
        f. minus-all-reads-<date>-<Experiment accession>-<File accessions>.h5
        g. signal-<date>-<Experiment accession>-<File accession>.h5

    (2) For DNase-seq:
        a. uniq-reads-signal-<date>-<Experiment accession>-<File accessions>.h
        b. raw-signal-<date>-<Experiment accession>-<File accessions>.h
        c. all-reads-signal-<date>-<Experiment accession>-<File accessions>.h
        d. signal-<date>-<Experiment accession>-<File accessions>.h5

    Note that name of cell-line is not included here. Therefore, use the
    directory name as a identifiers for cell-lines or species. The Experiment
    and File accession can be used to back-track about the dataset on ENCODE
    website.

Requirements
============
1) bigWigToWig : It converts binary bigWig file to ascii Wig file.
2) bigWigInfo : It fetches the information about chromosomes from bigWig file.

Both tools can be downloaded from http://hgdownload.cse.ucsc.edu/admin/exe/
for linux and Mac platform. However, these tools are not yet available for
Windows OS.

Path to these tools can be set using gcMapExplorer configure utility or can be
given with the command.

Resolutions
===========
By default, original data are downsampled to following resolutions:  '1kb',
'2kb', '4kb', '5kb', '8kb', '10kb', '20kb', '40kb', '80kb', '100kb', '160kb',
'200kb', '320kb', '500kb', '640kb',  and '1mb'.

The data are downsampled at this stage only to speed up the visualization
process as downsampling might slow down the interactive visualization.

Downsampling/Coarsening method
==============================
Presently, six methods are implemented:
1) min    -> Minimum value
2) max    -> Maximum value
3) amean  -> Arithmetic mean or average
4) hmean  -> Harmonic mean
5) gmean  -> Geometric mean
6) median -> Median

All these methods are used by default.
See below help for "-dm/--downsample-method" option to change the methods.

To keep original 1 base resolution data
=======================================
By default, the output h5 file does not contain original 1-base resolution
data to reduce the file size. To keep the original data in h5 file, used
-ko/--keep-original flag.

Usage:

usage: gcMapExplorer encode2h5 [-h] [-i input.txt] [-amb hg19] [-asy ChIP-seq]
                               [-b2w bigWigToWig] [-binfo bigWigInfo]
                               [-r "List of Resolutions"]
                               [-dm "List of downsampling method"]
                               [-cmeth lzf] [-mtc mean] [-od outDir] [-ko]
                               [-wd /home/rajendra/deskForWork/scratch]

Optional arguments:

-h, --help            show this help message and exit
-i input.txt, --input input.txt
                      Input text file.
                      At first search the datasets on https://www.encodeproject.org/matrix/?type=Experiment.
                      Then click on download button on top of the page. A text file will be downloaded.
                      This text file can be used as input in this program.

-amb hg19, --assembly hg19
                       Name of reference genome.
                      Example: hg19, GRCh38 etc.

-asy ChIP-seq, --assay ChIP-seq
                       Name of assay.
                      Presently, four assays are implemented:
                      'ChIP-seq', 'RNA-seq', 'DNase-seq' and 'FAIRE-seq'.

-b2w bigWigToWig, --bigWigToWig bigWigToWig
                      Path to bigWigToWig tool.

                      This is not necessary when bigWigToWig path is already set using gcMapExplorer
                      configure utility.

                      It can be downloaded from http://hgdownload.cse.ucsc.edu/admin/exe/
                      for linux and Mac platform.

                      If it is not present in configuration file, the input path should
                      be provided. It will be stored in configuration file for later use.

-binfo bigWigInfo, --bigWigInfo bigWigInfo
                       Path to bigWigInfo tool.

                      This is not necessary when bigWigInfo path is already set using gcMapExplorer
                      configure utility.

                      It can be downloaded from http://hgdownload.cse.ucsc.edu/admin/exe/
                      for linux and Mac platform.

                      If it is not present in configuration file, the input path should
                      be provided. It will be stored in configuration file for later use.

-r "List of Resolutions", --resolutions "List of Resolutions"
                      Additional input resolutions other than these resolutions: 1kb', '2kb',
                      '4kb', '5kb', '8kb', '10kb', '20kb', '40kb', '80kb', '100kb', '160kb','200kb',
                      '320kb', '500kb', '640kb',  and '1mb'.

                      Resolutions should be provided in comma separated values. For Example:
                      -r "25kb, 50kb, 75kb"

-dm "List of downsampling method", --downsample-method "List of downsampling method"
                      Methods to coarse or downsample the data for converting from 1-base
                      to coarser resolutions. If this option is not provided, all six methods (see
                      above) will be considered. User may use only subset of these methods.
                      For example: -dm "max, amean" can be used for downsampling by only these
                      two methods.

-cmeth lzf, --compression-method lzf
                      Data compression method in h5 file.
-mtc mean, --method-to-combine mean
                      Methods to combine data from more than two input file. Presently, three
                      methods can be used: 'mean', 'max' and 'min' for average, maximum and minimum
                      value, respectively.

-od outDir, --outDir outDir
                       Directory to save all h5 files. It is an essential input.

-ko, --keep-original  To copy original 1-base resolution data in h5 file. This will increase the
                      file size significantly.

-wd /home/rajendra/deskForWork/scratch, --work-dir /home/rajendra/deskForWork/scratch
                      Directory where temporary files will be stored.