ccmapHelpers module

ccmapHelpers.MemoryMappedArray Convenient wrapper for numpy memory mapped array file
ccmapHelpers.MemoryMappedArray.copy Copy this numpy memory mapped array and generate new
ccmapHelpers.MemoryMappedArray.copy_from Copy values from source MemoryMappedArray
ccmapHelpers.MemoryMappedArray.copy_to Copy values to destination MemoryMappedArray
ccmapHelpers.get_nonzeros_index(matrix[, …]) To get a numpy array of bool values for all rows/columns which have NO missing data
ccmapHelpers.remove_zeros(matrix[, …]) To remove rows/columns with missing data (zero values)

gcMapExplorer.ccmapHelpers

get_nonzeros_index(matrix, threshold_percentile=None, threshold_data_occup=None, filterByDiagonal=False)

To get a numpy array of bool values for all rows/columns which have NO missing data

Parameters:
  • matrix (numpy.memmap or gcMapExplorer.lib.ccmap.CCMAP.matrix) – Input matrix
  • percentile_threshold_no_data (int) –

    It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. percentile_threshold_no_data should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile.

    To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns.

  • threshold_data_occup (float) –

    It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded.

    Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.

Returns:

bData – 1D-array containing True and False values. * If True: row/column has data above the threshold * If False: row/column has no data under the threshold

Return type:

numpy.array[bool]

remove_zeros(matrix, threshold_percentile=None, threshold_data_occup=None, workDir=None)

To remove rows/columns with missing data (zero values)

Parameters:
  • matrix (numpy.memmap or gcMapExplorer.lib.ccmap.CCMAP.matrix) – Input matrix
  • percentile_threshold_no_data (int) –

    It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. percentile_threshold_no_data should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile.

    To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns.

  • threshold_data_occup (float) –

    It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded.

    Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.

  • workDir (str) – Path to the directory where temporary intermediate files are generated. If None, files are generated in the temporary directory according to the OS type.
Returns:

  • A (MemoryMappedArray) – MemoryMappedArray instance containing new truncated array as memory mapped file
  • bNoData (numpy.array[bool]) – 1D-array containing True and False values. * If True: row/column has no data under the threshold * If False: row/column has data above the threshold

MemoryMappedArray class

class MemoryMappedArray

Convenient wrapper for numpy memory mapped array file

For more details, see here: (See: Numpy memmap).

path2matrix

str – Path to numpy memory mapped array file

arr

numpy.memmap – Pointer to memory mapped numpy array

workDir

str – Path to the directory where temporary intermediate files are generated. If None, files are generated in the temporary directory according to the OS type.

dtype

str – Data type of array

Parameters:
  • shape (tuple) – Shape of array
  • fill (int or float (Optional)) – Fill array with this value
  • dtype (str) – Data type of array
copy

Copy this numpy memory mapped array and generate new

Returns:out – A new MemoryMappedArray instance with copied arrays
Return type:MemoryMappedArray
copy_from

Copy values from source MemoryMappedArray

Parameters:src (MemoryMappedArray) – Source memory mapped arrays for new values
Returns:
Return type:None
Raises:ValueError – if src is not of MemoryMappedArray instance
copy_to

Copy values to destination MemoryMappedArray

Parameters:dest (MemoryMappedArray) – Destination memory mapped arrays
Returns:
Return type:None
Raises:ValueError – if dest is not of MemoryMappedArray instance

KnightRuizNorm class

class KnightRuizNorm

A modified Knight-Ruiz algorithm for matrix balancing

The original ported Knight-Ruiz algorithm is modified to implement the normalization using both memory/RAM and disk. It allows the normalization of small Hi-C maps to huge maps that could not be accommodated in RAM.

Parameters:
  • A (numpy.ndarray or MemoryMappedArray) –

    Input matrix.

    Note

    • Matrix should not contain any row or column with all zero values (missing data for row/column). This type of matrix can be obtained from remove_zeros().
    • If memory='HDD', A should be MemoryMappedArray
  • memory (str) –

    Accepted keywords are RAM and HDD:

    • RAM: All intermediate arrays are generated in memory(RAM). This version is faster, however, it requires RAM depending on the input matrix size.
    • HDD: All intermediate arrays are generated as memory mapped array files on hard-disk.
  • workDir (str) – Path to the directory where temporary intermediate files are generated. If None, files are generated in the temporary directory according to the OS type.
run

Perform Knight-Ruiz normalization

Parameters:
  • A (numpy.ndarray or MemoryMappedArray.arr) –

    Input matrix.

    Note

    • Matrix should not contain any row or column with all zero values (missing data for row/column). This type of matrix can be obtained from remove_zeros().

    Warning

    If A was MemoryMappedArray in KnightRuizNorm. Here A should be MemoryMappedArray.arr instead of MemoryMappedArray.

  • fl (int) – Its value should be zero
  • OutMatrix (gcMapExplorer.lib.ccmap.CCMAP.matrix) – Output matrix of Hi-C map to which normalized matrix is returned.
  • bNoData (numpy.ndarray[bool]) – A numpy.array containing bool to show if rows/columns have missing data. It can be obtained from remove_zeros().