class EncodeDatasetsConverter

EncodeDatasetsConverter(inputFile, assembly) Download and convert datasets from ENCODE Experiments matrix
EncodeDatasetsConverter.saveAsH5(outDir[, …]) Download the files and convert to gcMapExplorer compatible hdf5 file.
class EncodeDatasetsConverter(inputFile, assembly, methodToCombine='mean', pathTobigWigToWig=None, pathTobigWigInfo=None, workDir=None)

Download and convert datasets from ENCODE Experiments matrix

It can be used to download and convert multiple datasets from ENCODE Experiment matrix (https://www.encodeproject.org/matrix/?type=Experiment). Presently, only bigWig files are downloaded and then converted.

At first search the datasets on https://www.encodeproject.org/matrix/?type=Experiment . Then click on download button on top of the page. A text file will be downloaded. This text file can be used as input in this program. All bigWig files will be downloaded and converted to gcMapExplorer compatible hdf5 format.

Note

At first a metafile is automatically downloaded and then files are filtered according to bigWig format and Assembly. Subsequently, if several replicates are present, only datasets with combined replicates are considered. In case if two replicates are present and combined replicates are not present, first replicate will be considered. Combining replicates are not yet implemented

Warning

Presently bigWigToWig and bigWigInfo is not available for Windows OS. Therefore, this class will fail in this OS.

inputFile

str – Name of input file downloaded from ENCODE Experiments matrix website.

assembly

str – Name of reference genome. Example: hg19, GRCh38 etc.

pathTobigWigToWig

str – Path to bigWigToWig program. It can be downloaded from http://hgdownload.cse.ucsc.edu/admin/exe/ for MacOSX and Linux. If path to program is already present in configuration file, it will be taken from the configuration.

If it is not present in configuration file, the input path should be provided. It will be stored in configuration file for later use.

pathTobigWigInfo

str – Path to bigWigInfo program. It can be downloaded from http://hgdownload.cse.ucsc.edu/admin/exe/ for MacOSX and Linux. If path to program is already present in configuration file, it will be taken from the configuration.

If it is not present in configuration file, the input path should be provided. It will be stored in configuration file for later use.

metafile

str – Name of metafile downloaded from ENCODE website. It is automatically downloaded from input file. It contains all the meta-data required for processing.

metaData = list of dictionary

A list of dictionary read from metafile. It is already filtered according to the different criteria such as file-format, assembly, replicates.

Parameters:
  • inputFile (str) – Name of input file downloaded from ENCODE Experiments matrix website.
  • assembly (str) – Name of reference genome. Example: hg19, GRCh38 etc.
  • pathTobigWigToWig (str) –

    Path to bigWigToWig program. It can be downloaded from http://hgdownload.cse.ucsc.edu/admin/exe/ for MacOSX and Linux. If path to program is already present in configuration file, it will be taken from the configuration.

    If it is not present in configuration file, the input path should be provided. It will be stored in configuration file for later use.

  • pathTobigWigInfo (str) –

    Path to bigWigInfo program. It can be downloaded from http://hgdownload.cse.ucsc.edu/admin/exe/ for MacOSX and Linux. If path to program is already present in configuration file, it will be taken from the configuration.

    If it is not present in configuration file, the input path should be provided. It will be stored in configuration file for later use.

_checkBigWigInfoProgram(pathTobigWigInfo)

Check if bigWigInfo program is available or accessible.

If program is not available in configuration file, the given path will be stored in the file after checking its accessibility.

The path is stored in gcMapExplorer.lib.genomicsDataHandler.EncodeDatasetsConverter.pathTobigWigInfo

Parameters:pathTobigWigInfo (str) – Path to bigWigInfo program
_checkBigWigToWigProgram(pathTobigWigToWig)

Check if bigWigToWig program is available or accessible.

If program is not available in configuration file, the given path will be stored in the file after checking its accessibility.

The path is stored in gcMapExplorer.lib.genomicsDataHandler.EncodeDatasetsConverter.pathTobigWigToWig

Parameters:pathTobigWigToWig (str) – Path to bigWigToWig program
_readFromCheckPoint()

Read the titles from checkpoint file.

_removeBigWigFiles()

Remove downloaded bigwig file

_writeToCheckPoint(idx)

Write to done titles to checkpoint file

downloadMetaData()

Download the metadata file

It downloads the metadata file and stored at gcMapExplorer.lib.genomicsDataHandler.EncodeDatasetsConverter.metafile

filterReplicates()

It filters the metaData according to the replicates. If several replicates for a dataset are present, it only reads the dataset with combined replicates.

In case if two replicates are present and combined replicates are not present, first replicate will be considered. Combining replicates are not yet implemented

readMetaData()

Read the metafile and extract the information

It reads the metafile, filter the datasets according to assembly and file-format and make a list as dictionary.

Each dictionary contains following field:
  • title : Experiment target-Experiment accession-File accession-[fold/signal]
  • type : fold for fold change over control or signal for signal p-value.
  • url : URL to file
  • Experiment accession
  • File accession
  • Biological replicate(s)
saveAsH5(outDir, resolutions=None, retryDownload=5, coarsening_methods=None, compression='lzf', keep_original=False)

Download the files and convert to gcMapExplorer compatible hdf5 file.

Name of output files:

  1. For ChIP-seq assay: a. signal-<Experiment target>-<Experiment accession>-<File accessions.h b. fold-<Experiment target>-<Experiment accession>-<File accessions>.h5
  2. For RNA-seq: a. uniq-reads-<date>-<Experiment accession>-<File accessions>.h b. plus-uniq-reads-<date>-<Experiment accession>-<File accessions>.h c. minus-uniq-reads-<date>-<Experiment accession>-<File accessions>.h d. all-reads-<date>-<Experiment accession>-<File accessions>.h5 e. plus-all-reads-<date>-<Experiment accession>-<File accessions>.h5 f. minus-all-reads-<date>-<Experiment accession>-<File accessions>.h5 g. signal-<date>-<Experiment accession>-<File accession>.h5
  3. For DNase-seq: a. uniq-reads-signal-<date>-<Experiment accession>-<File accessions>.h b. raw-signal-<date>-<Experiment accession>-<File accessions>.h c. all-reads-signal-<date>-<Experiment accession>-<File accessions>.h d. signal-<date>-<Experiment accession>-<File accessions>.h5
  4. For siRNA + RNA-seq: a. uniq-reads-signal-<Experiment target>-<Experiment accession>-<File accessions>.h b. all-reads-signal-<Experiment target>-<Experiment accession>-<File accessions>.h c. signal-<Experiment target>-<Experiment accession>-<File accessions>.h5

Note

Because downloading and conversion might take very long time, it also generates a checkpoint file in the output directory. Therefore, in case of crash or abrupt exit, the process can be continued from the last file.

Parameters:
  • outDir (str) – Output directory where all files will be saved. Checkpoint file will be stored in same directory.
  • resolutions (list of str) –

    Additional input resolutions other than these default resolutions: 1kb’, ‘2kb’, ‘4kb’, ‘5kb’, ‘8kb’, ‘10kb’, ‘20kb’, ‘40kb’, ‘80kb’, ‘100kb’, ‘160kb’,‘200kb’, ‘320kb’, ‘500kb’, ‘640kb’, and ‘1mb’.

    For Example: use resolutions=['25kb', '50kb', '75kb'] to add additional 25kb, 50kb and 75kb resolution data.

  • retryDownload (int) – Try to download the bigWig files with this many attempt. The time gap between each attempt is two seconds.
  • coarsening_methods (list of str) –

    Methods to coarse or downsample the data for converting from 1-base to coarser resolutions. Presently, five methods are implemented.

    • 'min' -> Minimum value
    • 'max' -> Maximum value
    • 'amean' -> Arithmetic mean or average
    • 'hmean' -> Harmonic mean
    • 'gmean' -> Geometric mean
    • 'median' -> Median

    In case of None, all five methods will be considered. User may use only subset of these methods. For example: coarse_method=['max', 'amean'] can be used for downsampling by only these two methods.

  • compression (str) – data compression method in HDF5 file : lzf or gzip method.
  • keep_original (bool) – Whether original data present in bigwig file should be incorporated in HDF5 file. This will significantly increase size of HDF5 file.