statDist module¶

This module contains functions and methods for calculating stationary distribution by assuming markov-chain.

This module contains functions to calculate probability transition matrix and subsequently to calculate stationary distribution.

Example¶

Below is a simple example to calculate stationary distribution. It consists of three major steps:

Calculate median-subtracted normalized matrix

Calculate probability transition matrix

Calculate stationary distribution

# median-subtracted normalized matrix, stype should be 'o-e', Other arguments can be changed.
gmlib.normalizer.normalizeGCMapByMCFS('input_raw.gcmap', 'mcfs_O-E.gcmap', stats='median', stype='o-e')

# Probability transition matrix, calculate matrix at '40kb' resolution.
gmlib.statDist.transitionProbabilityMatrixForGCMap('mcfs_O-E.gcmap', 'prob_mat.gcmap', '40kb')

# Stationary distribution at '40kb' resolution
gmlib.statDist.statDistrByEigenDecompForGCMap('out_prob_mat.gcmap', 'stat_dist.h5', '40kb')

Summary¶

`statDist.calculateTransitionProbabilityMatrix`(A)	Core function to calculate transition probability matrix.
`statDist.transitionProbabilityMatrixForCCMap`(ccMap)	To calculate transition probability matrix.
`statDist.transitionProbabilityMatrixForGCMap`(…)	To calculate transition matrices using a gcmap file.
`statDist.statDistrByEigenDecompForCCMap`(ccMap)	Calculate stationary distribution from a ccmap file or object.
`statDist.statDistrByEigenDecompForGCMap`(…)	Calculate stationary distribution using transition matrices from gcmap file for given resolution.
`statDist.stationaryDistributionByEigenDecomp`(…)	To calculate stationary distribution from probability transition matrix.

Documentation¶

MatrixPowerRAM(M, n)¶

Raise a square matrix to the (integer) power n.

Note

Ported from numpy/matrixlib/defmatrix.py. This function was rewritten to speed-up the calculation. In place of numpy dot function, a mdot function is used, which directly uses blas dgemm function.

For positive integers n, the power is computed by repeated matrix squaring and matrix multiplications. If n == 0, the identity matrix of the same shape as M is returned. If n < 0, the inverse is computed and then raised to the abs(n).

Parameters:	M (ndarray or matrix object) – Matrix to be “powered.” Must be square, i.e. `M.shape == (m, m)`, with m a positive integer. n (int) – The exponent can be any integer or long integer, positive, negative, or zero.
Returns:	Mn** – The return value is the same shape and type as M; if the exponent is positive or zero then the type of the elements is the same as those of M. If the exponent is negative the elements are floating-point.
Return type:	ndarray or matrix object
Raises:	`LinAlgError` – If the matrix is not numerically invertible.

calculateTransitionProbabilityMatrix(A, Out=None, bNoData=None)¶

Core function to calculate transition probability matrix.

It is a core function to calculate transition probability matrix.

Parameters:	A (numpy.memmap or `gcMapExplorer.lib.ccmap.CCMAP.matrix` or numpy.ndarray) – Input map or matrix Out (numpy.memmap or `gcMapExplorer.lib.ccmap.CCMAP.matrix` or numpy.ndarray) – Ouput transition matrix. In case if it is `None`, ouput matrix will be returned. bNoData (numpy.array[bool]) – 1D-array containing `True` and `False` values. Its size should be equal to input array row/column size. Row/Column with `False` value will be considered during the calculation.
Returns:	Out – Ouput transition matrix. In case if `Out` is passed, `None` will be returned.
Return type:	numpy.ndarray or `None`.

statDistrByEigenDecompForCCMap(ccMap, chrom=None, hdf5Handle=None, compression='lzf', workDir=None)¶

Calculate stationary distribution from a ccmap file or object.

It uses eigendecomposition method to calculate stationary distribution from transition matrix.

Parameters:	ccMap (`gcMapExplorer.lib.ccmap.CCMAP` or ccmap file) – A CCMAP object containing observed contact frequency or a ccmap file chrom (str) – Name of chromosome – necessary as used in output HDF5 file. hdf5Handle (`gcMapExplorer.lib.genomicsDataHandler.HDF5Handler`) – If it is provided, stationary distribution will be directly added to this file. In case of `None`, stationary distribution will be returned as a 1D array. compression (str) – Compression method in output HDF5 file. Presently allowed : `lzf` for LZF compression and `gzip` for GZIP compression. workDir (str) – Path to the directory where temporary intermediate files are generated. If `None`, files are generated in the temporary directory according to the main configuration.
Returns:	sdist – If `hdf5Handle` is provided `None` is returned otherwise stationary distribution as 1D array is returned.
Return type:	numpy.1darray or `None`

statDistrByEigenDecompForGCMap(gcMapInputFile, outFile, resolution, overwrite=False, compression='lzf', workDir=None)¶

Calculate stationary distribution using transition matrices from gcmap file for given resolution.

It uses eigendecomposition method to calculate stationary distribution from transition matrix.

Note

Use same input resolution as used during calculation of transition matrix using gcmap file.

Parameters:

gcMapInputFile (str) – Name of input gcmap file.
outFile (str) – Name of output HDF5 file to store calculated stationary distribution.
resolution (str) – Input resolution at which transition matrix was calculated. And stationary distribution will be calculated.
compression (str) – Compression method in output HDF5 file. Presently allowed : lzf for LZF compression and gzip for GZIP compression.
workDir (str) – Path to the directory where temporary intermediate files are generated. If None, files are generated in the temporary directory according to the main configuration.

stationaryDistributionByEigenDecomp(prob_matrix, bNoData)¶

To calculate stationary distribution from probability transition matrix.

Stationary distribution is calculated using probability transition matrix with eigendecomposition.

Parameters:	prob_matrix (numpy.memmap or `gcMapExplorer.lib.ccmap.CCMAP.matrix` or numpy.ndarray) – Input probability transition matrix bNoData (numpy.array[bool]) – 1D-array containing `True` and `False` values. Its size should be equal to input matrix row/column size. Row/Column with `False` value will be considered during the calculation. Row/Column with `True` value will not be considered during calculation and in these locations, minimum stationary distribution will be filled in the output.
Returns:	sdist – Ouput stationary distribution as 1D numpy array.
Return type:	numpy.ndarray

transitionProbabilityMatrixForCCMap(ccMap, outFile=None, percentile_threshold_no_data=None, threshold_data_occup=None, workDir=None)¶

To calculate transition probability matrix.

This method can be used to calculate transition probability matrix. This is similar to markov-chain transition matrix.

Note:	This transition matrix is not symmetric, because each row represents stochastic row vector, which contains contact probability of this bin with every other bins and sum of row is always equal to one. See here: Markov-chain example.
Parameters:	ccMap (`gcMapExplorer.lib.ccmap.CCMAP` or ccmap file) – A CCMAP object containing observed contact frequency or a ccmap file outFile (str) – Name of output ccmap file, to save directly the map as a ccmap file. In case of this option, `None` will return. percentile_threshold_no_data (int) – It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. `percentile_threshold_no_data` should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile. To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns. threshold_data_occup (float) – It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded. Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.
Returns:	normCCMap – Transition matrix. When `outFile` is provided, `None` is returned. In case of any other error, `None` is returned.
Return type:	`gcMapExplorer.lib.ccmap.CCMAP` or `None`

transitionProbabilityMatrixForGCMap(gcMapInputFile, gcMapOutFile, resolution, compression='lzf', percentile_threshold_no_data=None, threshold_data_occup=None, workDir=None, logHandler=None)¶

To calculate transition matrices using a gcmap file.

It can be used to calculate transition matrices (markov-chain) for all maps present in a gcmap file for given resolution.

Note

Matrices will be calculated for only input resolution. For coarser resolutions, data will be downsampled, and therefore in output gcmap, only matrices corresponding to input resolution is correct. Other coarsed resolutionsm matrices are only for visualization purpose.

Parameters:

gcMapInputFile (str) – Name of input gcmap file.
gcMapOutFile (str) – Name of output gcmap file.
resolution (str) – Input resolution at which transition matrix will be calculated.
compression (str) – Compression method in output gcmap file. Presently allowed : lzf for LZF compression and gzip for GZIP compression.
percentile_threshold_no_data (int) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. percentile_threshold_no_data should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile.

To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns.
threshold_data_occup (float) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded.

Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.
workDir (str) – Path to the directory where temporary intermediate files are generated. If None, files are generated in the temporary directory according to the main configuration.