statDist module¶
This module contains functions and methods for calculating stationary distribution by assuming markov-chain.
This module contains functions to calculate probability transition matrix and subsequently to calculate stationary distribution.
Example¶
Below is a simple example to calculate stationary distribution. It consists of three major steps:
- Calculate median-subtracted normalized matrix
- Calculate probability transition matrix
- Calculate stationary distribution
# median-subtracted normalized matrix, stype should be 'o-e', Other arguments can be changed.
gmlib.normalizer.normalizeGCMapByMCFS('input_raw.gcmap', 'mcfs_O-E.gcmap', stats='median', stype='o-e')
# Probability transition matrix, calculate matrix at '40kb' resolution.
gmlib.statDist.transitionProbabilityMatrixForGCMap('mcfs_O-E.gcmap', 'prob_mat.gcmap', '40kb')
# Stationary distribution at '40kb' resolution
gmlib.statDist.statDistrByEigenDecompForGCMap('out_prob_mat.gcmap', 'stat_dist.h5', '40kb')
Summary¶
statDist.calculateTransitionProbabilityMatrix (A) |
Core function to calculate transition probability matrix. |
statDist.transitionProbabilityMatrixForCCMap (ccMap) |
To calculate transition probability matrix. |
statDist.transitionProbabilityMatrixForGCMap (…) |
To calculate transition matrices using a gcmap file. |
statDist.statDistrByEigenDecompForCCMap (ccMap) |
Calculate stationary distribution from a ccmap file or object. |
statDist.statDistrByEigenDecompForGCMap (…) |
Calculate stationary distribution using transition matrices from gcmap file for given resolution. |
statDist.stationaryDistributionByEigenDecomp (…) |
To calculate stationary distribution from probability transition matrix. |
Documentation¶
-
MatrixPowerRAM
(M, n)¶ Raise a square matrix to the (integer) power n.
Note
Ported from numpy/matrixlib/defmatrix.py. This function was rewritten to speed-up the calculation. In place of numpy dot function, a mdot function is used, which directly uses blas dgemm function.
For positive integers n, the power is computed by repeated matrix squaring and matrix multiplications. If
n == 0
, the identity matrix of the same shape as M is returned. Ifn < 0
, the inverse is computed and then raised to theabs(n)
.Parameters: - M (ndarray or matrix object) – Matrix to be “powered.” Must be square, i.e.
M.shape == (m, m)
, with m a positive integer. - n (int) – The exponent can be any integer or long integer, positive, negative, or zero.
Returns: M**n – The return value is the same shape and type as M; if the exponent is positive or zero then the type of the elements is the same as those of M. If the exponent is negative the elements are floating-point.
Return type: ndarray or matrix object
Raises: LinAlgError
– If the matrix is not numerically invertible.- M (ndarray or matrix object) – Matrix to be “powered.” Must be square, i.e.
-
calculateTransitionProbabilityMatrix
(A, Out=None, bNoData=None)¶ Core function to calculate transition probability matrix.
It is a core function to calculate transition probability matrix.
Parameters: - A (numpy.memmap or
gcMapExplorer.lib.ccmap.CCMAP.matrix
or numpy.ndarray) – Input map or matrix - Out (numpy.memmap or
gcMapExplorer.lib.ccmap.CCMAP.matrix
or numpy.ndarray) – Ouput transition matrix. In case if it isNone
, ouput matrix will be returned. - bNoData (numpy.array[bool]) – 1D-array containing
True
andFalse
values. Its size should be equal to input array row/column size. Row/Column withFalse
value will be considered during the calculation.
Returns: Out – Ouput transition matrix. In case if
Out
is passed,None
will be returned.Return type: numpy.ndarray or
None
.- A (numpy.memmap or
-
statDistrByEigenDecompForCCMap
(ccMap, chrom=None, hdf5Handle=None, compression='lzf', workDir=None)¶ Calculate stationary distribution from a ccmap file or object.
It uses eigendecomposition method to calculate stationary distribution from transition matrix.
Parameters: - ccMap (
gcMapExplorer.lib.ccmap.CCMAP
or ccmap file) – A CCMAP object containing observed contact frequency or a ccmap file - chrom (str) – Name of chromosome – necessary as used in output HDF5 file.
- hdf5Handle (
gcMapExplorer.lib.genomicsDataHandler.HDF5Handler
) – If it is provided, stationary distribution will be directly added to this file. In case ofNone
, stationary distribution will be returned as a 1D array. - compression (str) – Compression method in output HDF5 file. Presently allowed :
lzf
for LZF compression andgzip
for GZIP compression. - workDir (str) – Path to the directory where temporary intermediate files are generated.
If
None
, files are generated in the temporary directory according to the main configuration.
Returns: sdist – If
hdf5Handle
is providedNone
is returned otherwise stationary distribution as 1D array is returned.Return type: numpy.1darray or
None
- ccMap (
-
statDistrByEigenDecompForGCMap
(gcMapInputFile, outFile, resolution, overwrite=False, compression='lzf', workDir=None)¶ Calculate stationary distribution using transition matrices from gcmap file for given resolution.
It uses eigendecomposition method to calculate stationary distribution from transition matrix.
Note
Use same input resolution as used during calculation of transition matrix using gcmap file.
Parameters: - gcMapInputFile (str) – Name of input gcmap file.
- outFile (str) – Name of output HDF5 file to store calculated stationary distribution.
- resolution (str) – Input resolution at which transition matrix was calculated. And stationary distribution will be calculated.
- compression (str) – Compression method in output HDF5 file. Presently allowed :
lzf
for LZF compression andgzip
for GZIP compression. - workDir (str) – Path to the directory where temporary intermediate files are generated.
If
None
, files are generated in the temporary directory according to the main configuration.
-
stationaryDistributionByEigenDecomp
(prob_matrix, bNoData)¶ To calculate stationary distribution from probability transition matrix.
Stationary distribution is calculated using probability transition matrix with eigendecomposition.
Parameters: - prob_matrix (numpy.memmap or
gcMapExplorer.lib.ccmap.CCMAP.matrix
or numpy.ndarray) – Input probability transition matrix - bNoData (numpy.array[bool]) – 1D-array containing
True
andFalse
values. Its size should be equal to input matrix row/column size. Row/Column withFalse
value will be considered during the calculation. Row/Column withTrue
value will not be considered during calculation and in these locations, minimum stationary distribution will be filled in the output.
Returns: sdist – Ouput stationary distribution as 1D numpy array.
Return type: numpy.ndarray
- prob_matrix (numpy.memmap or
-
transitionProbabilityMatrixForCCMap
(ccMap, outFile=None, percentile_threshold_no_data=None, threshold_data_occup=None, workDir=None)¶ To calculate transition probability matrix.
This method can be used to calculate transition probability matrix. This is similar to markov-chain transition matrix.
Note: This transition matrix is not symmetric, because each row represents stochastic row vector, which contains contact probability of this bin with every other bins and sum of row is always equal to one. See here: Markov-chain example.
Parameters: - ccMap (
gcMapExplorer.lib.ccmap.CCMAP
or ccmap file) – A CCMAP object containing observed contact frequency or a ccmap file - outFile (str) – Name of output ccmap file, to save directly the map as a ccmap file. In case of this option,
None
will return. - percentile_threshold_no_data (int) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded.
percentile_threshold_no_data
should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile.To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns.
- threshold_data_occup (float) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded.
Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.
Returns: normCCMap – Transition matrix. When
outFile
is provided,None
is returned. In case of any other error,None
is returned.Return type: gcMapExplorer.lib.ccmap.CCMAP
orNone
- ccMap (
-
transitionProbabilityMatrixForGCMap
(gcMapInputFile, gcMapOutFile, resolution, compression='lzf', percentile_threshold_no_data=None, threshold_data_occup=None, workDir=None, logHandler=None)¶ To calculate transition matrices using a gcmap file.
It can be used to calculate transition matrices (markov-chain) for all maps present in a gcmap file for given resolution.
Note
Matrices will be calculated for only input resolution. For coarser resolutions, data will be downsampled, and therefore in output gcmap, only matrices corresponding to input resolution is correct. Other coarsed resolutionsm matrices are only for visualization purpose.
Parameters: - gcMapInputFile (str) – Name of input gcmap file.
- gcMapOutFile (str) – Name of output gcmap file.
- resolution (str) – Input resolution at which transition matrix will be calculated.
- compression (str) – Compression method in output gcmap file. Presently allowed :
lzf
for LZF compression andgzip
for GZIP compression. - percentile_threshold_no_data (int) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded.
percentile_threshold_no_data
should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile.To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns.
- threshold_data_occup (float) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded.
Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.
- workDir (str) – Path to the directory where temporary intermediate files are generated.
If
None
, files are generated in the temporary directory according to the main configuration.