statDist module

This module contains functions and methods for calculating stationary distribution by assuming markov-chain.

This module contains functions to calculate probability transition matrix and subsequently to calculate stationary distribution.

Example

Below is a simple example to calculate stationary distribution. It consists of three major steps:

  • Calculate median-subtracted normalized matrix
  • Calculate probability transition matrix
  • Calculate stationary distribution
# median-subtracted normalized matrix, stype should be 'o-e', Other arguments can be changed.
gmlib.normalizer.normalizeGCMapByMCFS('input_raw.gcmap', 'mcfs_O-E.gcmap', stats='median', stype='o-e')

# Probability transition matrix, calculate matrix at '40kb' resolution.
gmlib.statDist.transitionProbabilityMatrixForGCMap('mcfs_O-E.gcmap', 'prob_mat.gcmap', '40kb')

# Stationary distribution at '40kb' resolution
gmlib.statDist.statDistrByEigenDecompForGCMap('out_prob_mat.gcmap', 'stat_dist.h5', '40kb')

Summary

statDist.calculateTransitionProbabilityMatrix(A) Core function to calculate transition probability matrix.
statDist.transitionProbabilityMatrixForCCMap(ccMap) To calculate transition probability matrix.
statDist.transitionProbabilityMatrixForGCMap(…) To calculate transition matrices using a gcmap file.
statDist.statDistrByEigenDecompForCCMap(ccMap) Calculate stationary distribution from a ccmap file or object.
statDist.statDistrByEigenDecompForGCMap(…) Calculate stationary distribution using transition matrices from gcmap file for given resolution.
statDist.stationaryDistributionByEigenDecomp(…) To calculate stationary distribution from probability transition matrix.

Documentation

MatrixPowerRAM(M, n)

Raise a square matrix to the (integer) power n.

Note

Ported from numpy/matrixlib/defmatrix.py. This function was rewritten to speed-up the calculation. In place of numpy dot function, a mdot function is used, which directly uses blas dgemm function.

For positive integers n, the power is computed by repeated matrix squaring and matrix multiplications. If n == 0, the identity matrix of the same shape as M is returned. If n < 0, the inverse is computed and then raised to the abs(n).

Parameters:
  • M (ndarray or matrix object) – Matrix to be “powered.” Must be square, i.e. M.shape == (m, m), with m a positive integer.
  • n (int) – The exponent can be any integer or long integer, positive, negative, or zero.
Returns:

M**n – The return value is the same shape and type as M; if the exponent is positive or zero then the type of the elements is the same as those of M. If the exponent is negative the elements are floating-point.

Return type:

ndarray or matrix object

Raises:

LinAlgError – If the matrix is not numerically invertible.

calculateTransitionProbabilityMatrix(A, Out=None, bNoData=None)

Core function to calculate transition probability matrix.

It is a core function to calculate transition probability matrix.

Parameters:
  • A (numpy.memmap or gcMapExplorer.lib.ccmap.CCMAP.matrix or numpy.ndarray) – Input map or matrix
  • Out (numpy.memmap or gcMapExplorer.lib.ccmap.CCMAP.matrix or numpy.ndarray) – Ouput transition matrix. In case if it is None, ouput matrix will be returned.
  • bNoData (numpy.array[bool]) – 1D-array containing True and False values. Its size should be equal to input array row/column size. Row/Column with False value will be considered during the calculation.
Returns:

Out – Ouput transition matrix. In case if Out is passed, None will be returned.

Return type:

numpy.ndarray or None.

statDistrByEigenDecompForCCMap(ccMap, chrom=None, hdf5Handle=None, compression='lzf', workDir=None)

Calculate stationary distribution from a ccmap file or object.

It uses eigendecomposition method to calculate stationary distribution from transition matrix.

Parameters:
  • ccMap (gcMapExplorer.lib.ccmap.CCMAP or ccmap file) – A CCMAP object containing observed contact frequency or a ccmap file
  • chrom (str) – Name of chromosome – necessary as used in output HDF5 file.
  • hdf5Handle (gcMapExplorer.lib.genomicsDataHandler.HDF5Handler) – If it is provided, stationary distribution will be directly added to this file. In case of None, stationary distribution will be returned as a 1D array.
  • compression (str) – Compression method in output HDF5 file. Presently allowed : lzf for LZF compression and gzip for GZIP compression.
  • workDir (str) – Path to the directory where temporary intermediate files are generated. If None, files are generated in the temporary directory according to the main configuration.
Returns:

sdist – If hdf5Handle is provided None is returned otherwise stationary distribution as 1D array is returned.

Return type:

numpy.1darray or None

statDistrByEigenDecompForGCMap(gcMapInputFile, outFile, resolution, overwrite=False, compression='lzf', workDir=None)

Calculate stationary distribution using transition matrices from gcmap file for given resolution.

It uses eigendecomposition method to calculate stationary distribution from transition matrix.

Note

Use same input resolution as used during calculation of transition matrix using gcmap file.

Parameters:
  • gcMapInputFile (str) – Name of input gcmap file.
  • outFile (str) – Name of output HDF5 file to store calculated stationary distribution.
  • resolution (str) – Input resolution at which transition matrix was calculated. And stationary distribution will be calculated.
  • compression (str) – Compression method in output HDF5 file. Presently allowed : lzf for LZF compression and gzip for GZIP compression.
  • workDir (str) – Path to the directory where temporary intermediate files are generated. If None, files are generated in the temporary directory according to the main configuration.
stationaryDistributionByEigenDecomp(prob_matrix, bNoData)

To calculate stationary distribution from probability transition matrix.

Stationary distribution is calculated using probability transition matrix with eigendecomposition.

Parameters:
  • prob_matrix (numpy.memmap or gcMapExplorer.lib.ccmap.CCMAP.matrix or numpy.ndarray) – Input probability transition matrix
  • bNoData (numpy.array[bool]) – 1D-array containing True and False values. Its size should be equal to input matrix row/column size. Row/Column with False value will be considered during the calculation. Row/Column with True value will not be considered during calculation and in these locations, minimum stationary distribution will be filled in the output.
Returns:

sdist – Ouput stationary distribution as 1D numpy array.

Return type:

numpy.ndarray

transitionProbabilityMatrixForCCMap(ccMap, outFile=None, percentile_threshold_no_data=None, threshold_data_occup=None, workDir=None)

To calculate transition probability matrix.

This method can be used to calculate transition probability matrix. This is similar to markov-chain transition matrix.

Note:

This transition matrix is not symmetric, because each row represents stochastic row vector, which contains contact probability of this bin with every other bins and sum of row is always equal to one. See here: Markov-chain example.

Parameters:
  • ccMap (gcMapExplorer.lib.ccmap.CCMAP or ccmap file) – A CCMAP object containing observed contact frequency or a ccmap file
  • outFile (str) – Name of output ccmap file, to save directly the map as a ccmap file. In case of this option, None will return.
  • percentile_threshold_no_data (int) –

    It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. percentile_threshold_no_data should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile.

    To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns.

  • threshold_data_occup (float) –

    It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded.

    Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.

Returns:

normCCMap – Transition matrix. When outFile is provided, None is returned. In case of any other error, None is returned.

Return type:

gcMapExplorer.lib.ccmap.CCMAP or None

transitionProbabilityMatrixForGCMap(gcMapInputFile, gcMapOutFile, resolution, compression='lzf', percentile_threshold_no_data=None, threshold_data_occup=None, workDir=None, logHandler=None)

To calculate transition matrices using a gcmap file.

It can be used to calculate transition matrices (markov-chain) for all maps present in a gcmap file for given resolution.

Note

Matrices will be calculated for only input resolution. For coarser resolutions, data will be downsampled, and therefore in output gcmap, only matrices corresponding to input resolution is correct. Other coarsed resolutionsm matrices are only for visualization purpose.

Parameters:
  • gcMapInputFile (str) – Name of input gcmap file.
  • gcMapOutFile (str) – Name of output gcmap file.
  • resolution (str) – Input resolution at which transition matrix will be calculated.
  • compression (str) – Compression method in output gcmap file. Presently allowed : lzf for LZF compression and gzip for GZIP compression.
  • percentile_threshold_no_data (int) –

    It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. percentile_threshold_no_data should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile.

    To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns.

  • threshold_data_occup (float) –

    It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded.

    Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.

  • workDir (str) – Path to the directory where temporary intermediate files are generated. If None, files are generated in the temporary directory according to the main configuration.