class BEDHandler
¶
BEDHandler (filenames[, column, chromName, …]) |
To convert BED files to hdf5/h5 file |
BEDHandler.parseBed () |
To parse bed files |
BEDHandler.setChromosome (chromName) |
Set the target chromosome for reading and extracting from bed file |
BEDHandler.saveAsH5 (hdf5Out[, title, …]) |
To convert bed files to hdf5 file |
-
class
BEDHandler
(filenames, column=7, chromName=None, indexFile=None, tmpNumpyArrayFiles=None, methodToCombine='mean', workDir=None, maxEntryWrite=10000000)¶ To convert BED files to hdf5/h5 file
It parses bed files and save all data to a hdf5/h5 file for given resolutions.
-
BedFileNames
¶ list[str] – List of input bed files.
Note
In case if
BEDHandler.chromName
is provided, only one wig file is accepted.
-
column
¶ int – The column number, which is considered as data column. Column number could vary and depends on BED format. For example:
- ENCODE broadPeak format (BED 6+3): 7th column
- ENCODE gappedPeak format (BED 12+3): 13th column
- ENCODE narrowPeak format (BED 6+4): 7th column
- ENCODE RNA elements format (BED 6+3): 7th column
-
chromName
¶ str – Name of target chromosome name need to be extracted from bed file.
-
chromSizeInfo
¶ dict – A dictionary containing chromosome size information.
-
_chromPointerInFile
¶ dict – A dictionary containing position index of each chromosome in bed file.
-
indexFile
¶ str – A file in json format containing indices (position in bed file) and sizes of chromosomes. If this file is not present and given as input, a new file will be generated. If this file is present, indices and sizes will be taken from this file. If index and size of input chromosome is not present in json file, these will be determined from bed file and stored in same json file. This file could be very helpful in case when same wig file has to be read many times because step to determine index and size of chromosome is skipped.
-
methodToCombine
¶ str – method to combine bed files, Presently, accepted keywords are:
mean
,min
andmax
-
tmpNumpyArrayFiles
¶ TempNumpyArrayFiles
– ThisTempNumpyArrayFiles
instance stores the temporary numpy array files information.
-
isBedParsed
¶ bool – Whether bed files are already parsed.
-
maxEntryWrite
¶ int – Number of lines read from bed file at an instant, after this, data is dumped in temporary numpy array file
Parameters: - filenames (str or list(str)) –
List of input bed files.
Note
In case if
BEDHandler.chromName
is provided, only one bed file is accepted. - column (int) –
The column number, which is considered as data column. Column number could vary and depends on BED format. For example:
- ENCODE broadPeak format (BED 6+3): 7th column
- ENCODE gappedPeak format (BED 12+3): 13th column
- ENCODE narrowPeak format (BED 6+4): 7th column
- ENCODE RNA elements format (BED 6+3): 7th column
- chromName (str) – Name of target chromosome name need to be extracted from bed file.
- indexFile (str) – A file in json format containing indices (position in bed file) and sizes of chromosomes. If this file is not present and given as input, a new file will be generated. If this file is present, indices and sizes will be taken from this file. If index and size of input chromosome is not present in json file, these will be determined from bed file and stored in same json file. This file could be very helpful in case when same bed file has to be read many times because step to determine index and size of chromosome is skipped.
- tmpNumpyArrayFiles (
TempNumpyArrayFiles
) – ThisTempNumpyArrayFiles
instance stores the temporary numpy array files information. - methodToCombine (str) – method to combine bed files, Presently, accepted keywords are:
mean
,min
andmax
- maxEntryWrite (int) – Number of lines read from bed file at an instant, after this, data is dumped in temporary numpy array file. To reduce memory (RAM) occupancy, reduce this number because large numbers need large RAM.
-
_FillDataInNumpyArrayFile
(ChromTitle, location_list, value_list)¶ Fill the extracted data from bed file to temporary numpy array file
Warning
Private method. Use it at your own risk. It is used internally in
BEDHandler._parseBed()
.Parameters: - ChromTitle (str) – Name of chromosome
- location_list (list of int) – List of locations for given chromosome
- value_list (list of float) – List of values for respective chromosome location
-
_PerformDataCoarsening
(Chrom, resolution, coarse_method)¶ Base method to perform Data coarsening.
This method read temporary Numpy array files and perform data coarsening using the given input method.
Warning
Private method. Use it at your own risk. It is used internally in
BEDHandler._StoreInHdf5File()
.Parameters:
-
_StoreInHdf5File
(hdf5Out, title, resolutions=None, coarsening_methods=None, compression='lzf', keep_original=False)¶ Base method to store coarsened data in hdf5/h5 file.
At first data is coarsened and subsequently stored in h5 file.
Warning
Private method. Use it at your own risk. It is used internally in
BEDHandler.saveAsH5()
.Parameters: - hdf5Out (str or
HDF5Handler
) – Name of output hdf5 file or instance ofHDF5Handler
- title (str) – Title of data
- resolutions (list of str) –
Additional input resolutions other than these default resolutions: 1kb’, ‘2kb’, ‘4kb’, ‘5kb’, ‘8kb’, ‘10kb’, ‘20kb’, ‘40kb’, ‘80kb’, ‘100kb’, ‘160kb’,‘200kb’, ‘320kb’, ‘500kb’, ‘640kb’, and ‘1mb’.
For Example: use
resolutions=['25kb', '50kb', '75kb']
to add additional 25kb, 50kb and 75kb resolution data. - coarsening_methods (list of str) –
Methods to coarse or downsample the data for converting from 1-base to coarser resolutions. Presently, five methods are implemented.
'min'
-> Minimum value'max'
-> Maximum value'amean'
-> Arithmetic mean or average'hmean'
-> Harmonic mean'gmean'
-> Geometric mean'median'
-> Median
In case of
None
, all five methods will be considered. User may use only subset of these methods. For example:coarse_method=['max', 'amean']
can be used for downsampling by only these two methods. - compression (str) – data compression method in HDF5 file :
lzf
orgzip
method. - keep_original (bool) – Whether original data present in wig file should be incorporated in HDF5 file. This will significantly increase size of HDF5 file.
- hdf5Out (str or
-
_getChromSizeInfo
(bedFileName, inputChrom=None)¶ Get chromosome size and index bed file
This method parses a bed file, extracts chromosome size and index it for each chromosome.
It sets
BEDHandler._chromPointerInFile
andBEDHandler.chromSizeInfo
.Warning
Private method. Use it at your own risk. It is used internally during initialization and in
BEDHandler.setChromosome()
.Parameters:
-
_loadChromSizeAndIndex
()¶ Load chromosome sizes and indices from a json file
-
_parseBed
(bedFileName)¶ Base method to parse a bed file.
This method parses a bed file and extracted data are copied in temporary numpy array files.
Warning
Private method. Use it at your own risk. It is used internally in
BEDHandler.parseBed()
.Parameters: bedFileName (str) – Name of bed File
-
_saveChromSizeAndIndex
()¶ Save chromosomes sizes and indices dictionary to a json file
-
getRawWigDataAsDictionary
(dicOut=None)¶ To get a entire dictionary of data from bed file
It generates a dictionary of numpy arrays for each chromosome. These arrays are stored in temporary numpy array files of
TempNumpyArrayFiles
.Parameters: dicOut (dict) – The output dictionary to which data will be added or replaced. Returns: dicOut – The output dictionary. Return type: dict
-
parseBed
()¶ To parse bed files
This method parses all bed files listed in
BEDHandler.bedFileNames
. The extracted data is further stored in temporary numpy array files of respective chromosome. These numpy array files can be used either for data coarsening or for further analysis.- To save as h5: Use
BEDHandler.saveAsH5()
. - To perform analysis: Use
BEDHandler.getRawWigDataAsDictionary()
to get a dictionary of numpy arrays.
- To save as h5: Use
-
saveAsH5
(hdf5Out, title=None, resolutions=None, coarsening_methods=None, compression='lzf', keep_original=False)¶ To convert bed files to hdf5 file
It parses bed files, coarsened the data and store in an input hdf5/h5 file.
Parameters: - hdf5Out (
HDF5Handler
or str) – Output hdf5 file name orHDF5Handler
instance - title (str) – Title of the data
- resolutions (list of str) –
Additional input resolutions other than these default resolutions: 1kb’, ‘2kb’, ‘4kb’, ‘5kb’, ‘8kb’, ‘10kb’, ‘20kb’, ‘40kb’, ‘80kb’, ‘100kb’, ‘160kb’,‘200kb’, ‘320kb’, ‘500kb’, ‘640kb’, and ‘1mb’.
For Example: use
resolutions=['25kb', '50kb', '75kb']
to add additional 25kb, 50kb and 75kb resolution data. - coarsening_methods (list of str) –
Methods to coarse or downsample the data for converting from 1-base to coarser resolutions. Presently, five methods are implemented.
'min'
-> Minimum value'max'
-> Maximum value'amean'
-> Arithmetic mean or average'hmean'
-> Harmonic mean'gmean'
-> Geometric mean'median'
-> Median
In case of
None
, all five methods will be considered. User may use only subset of these methods. For example:coarse_method=['max', 'amean']
can be used for downsampling by only these two methods. - compression (str) – data compression method in HDF5 file :
lzf
orgzip
method. - keep_original (bool) – Whether original data present in bigwig file should be incorporated in HDF5 file. This will significantly increase size of HDF5 file.
- hdf5Out (
-
setChromosome
(chromName)¶ Set the target chromosome for reading and extracting from bed file
To read and convert data of another chromsome from a bed file, it can be set here. After this, directly use
BEDHandler.saveAsH5()
to save data in H5 file.Parameters: chromName (str) – Name of new target chromosome
-