normalizer module¶
normalizer.NormalizeKnightRuizOriginal (ccMapObj) 
Original KnightRuiz algorithm for matrix balancing 
normalizer.normalizeCCMapByKR (ccMap[, ...]) 
Normalize a ccmap using KnightRuiz matrix balancing method. 
normalizer.normalizeGCMapByKR (...[, ...]) 
Normalize a gcmap using KnightRuiz matrix balancing method. 
normalizer.normalizeCCMapByIC (ccMap[, tol, ...]) 
Normalize a ccmap by Iterative correction method 
normalizer.normalizeGCMapByIC (...[, vmin, ...]) 
Normalize a gcmap using Iterative Correction. 
normalizer.normalizeCCMapByMCFS (ccMap[, ...]) 
Scale ccmap using Median Contact Frequency 
normalizer.normalizeGCMapByMCFS (...[, ...]) 
Scale all maps in gcmap using Median Contact Frequency 
normalizer.normalizeCCMapByVCNorm (ccMap[, ...]) 
Normalize ccmap using VanillaCoverage method 
normalizer.normalizeGCMapByVCNorm (...[, ...]) 
Normalize all maps using VanillaCoverage method 

NormalizeKnightRuizOriginal
(ccMapObj, tol=1e12, x0=None, delta=0.1, Delta=3, fl=0)¶ Original KnightRuiz algorithm for matrix balancing
 Ported from a matlab script given in the supporting information of the following paper:
 P.A. Knight and D. Ruiz (2013). A fast algorithm for matrix balancing (2013). IMA Journal of Numerical Analysis, 33, 10291047”
 Matrix must be symmetric and nonnegative
 For input matrix A, this function find a vector X such that diag(X)*A*diag(X) is close to doubly stochastic.
Warning
 This is original ported code and kept here for comparison and testing.
 Do not use it because for large matrix it may end up with consuming all the memory for large matrix.
Parameters: ccMapObj ( gcMapExplorer.lib.ccmap.CCMAP
) – A CCMAP object containing observed contact frequencyReturns: normCCMap – Normalized Contact map. Return type: CCMAP

normalizeCCMapByIC
(ccMap, tol=0.0001, vmin=None, vmax=None, outFile=None, iteration=500, percentile_threshold_no_data=None, threshold_data_occup=None, workDir=None)¶ Normalize a ccmap by Iterative correction method
This method normalize the raw contact map by removing biases from experimental procedure. For more details, see this publication.
Parameters:  ccMap (
gcMapExplorer.lib.ccmap.CCMAP
or ccmap file.) – A CCMAP object containing observed contact frequency or a ccmap file  tol (float) – Tolerance value. The relative increment in the results before declaring convergence.
 vmin (float) – Minimum threshold value for normalization. If contact frequency is less than or equal to this threshold value, this value is discarded during normalization.
 vmax (float) – Maximum threshold value for normalization. If contact frequency is greater than or equal to this threshold value, this value is discarded during normalization.
 outFile (str) – Name of output ccmap file, to save directly the normalized map as a ccmap file. In case of this option,
None
will return.  iteration (int) – Number of iteration to stop the normalization.
 percentile_threshold_no_data (int) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded.
percentile_threshold_no_data
should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile.To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns.
 threshold_data_occup (float) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded.
Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.
 workDir (str) – Path to the directory where temporary intermediate files are generated.
If
None
, files are generated in the temporary directory according to the main configuration.
Returns: normCCMap – Normalized Contact map. When
outFile
is provided,None
is returned. In case of any other error,None
is returned.Return type: gcMapExplorer.lib.ccmap.CCMAP
orNone
 ccMap (

normalizeCCMapByKR
(ccMap, memory='RAM', tol=1e12, outFile=None, vmin=None, vmax=None, percentile_threshold_no_data=None, threshold_data_occup=None, workDir=None)¶ Normalize a ccmap using KnightRuiz matrix balancing method.
Note
 This function uses a modified version of original ported code given in
NormalizeKnightRuizOriginal()
.  Please refer to: P.A. Knight and D. Ruiz (2013). A fast algorithm for matrix balancing (2013). IMA Journal of Numerical Analysis, 33, 10291047
Parameters:  ccMap (
gcMapExplorer.lib.ccmap.CCMAP
or ccmap file) – A CCMAP object containing observed contact frequency or a ccmap file.  memory (str) –
Accepted keywords are
RAM
andHDD
:RAM
: All intermediate arrays are generated in memory(RAM). This version is faster, however, it requires RAM depending on the input matrix size.HDD
: All intermediate arrays are generated as memory mapped array files on harddisk.
 tol (float) – Tolerance for matrix balancing. Smaller tolerance increases accuracy in sums of rows and columns.
 outFile (str) – Name of output ccmap file, to save directly the normalized map as a ccmap file. In case of this option,
None
will return.  vmin (float) – Minimum threshold value for normalization. If contact frequency is less than or equal to this threshold value, this value is discarded during normalization.
 vmax (float) – Maximum threshold value for normalization. If contact frequency is greater than or equal to this threshold value, this value is discarded during normalization.
 percentile_threshold_no_data (int) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded.
percentile_threshold_no_data
should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile.To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns.
 threshold_data_occup (float) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded.
Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.
 workDir (str) – Path to the directory where temporary intermediate files are generated.
If
None
, files are generated in the temporary directory according to the main configuration.
Returns: ccMapObj – Normalized Contact map. When
outFile
is provided,None
is returned. In case of any other error,None
is returned.Return type: gcMapExplorer.lib.ccmap.CCMAP
orNone
 This function uses a modified version of original ported code given in

normalizeCCMapByMCFS
(ccMap, stats='median', vmin=None, vmax=None, stype='o/e', outFile=None, scaleUpInput=False, percentile_threshold_no_data=None, threshold_data_occup=None, workDir=None)¶ Scale ccmap using Median Contact Frequency
This method can be used to normalize contact map with expected values. These expected values could be either Median or Average contact values for particular distance between two locations/coordinates. At first, Median/Average distance contact frequency for each distance is calculated. Subsequently, the observed contact frequency is either divided (‘o/e’) or subtracted (‘oe’) by median/average contact frequency obtained for distance between the two locations.
Parameters:  ccMap (
gcMapExplorer.lib.ccmap.CCMAP
or ccmap file) – A CCMAP object containing observed contact frequency or a ccmap file  stats (str) – Statistics to be calculated along diagonals: It may be either “mean” or “median”. By default, it is “median”.
 vmin (float) – Minimum threshold value for normalization. If contact frequency is less than or equal to this threshold value, this value is discarded during normalization.
 vmax (float) – Maximum threshold value for normalization. If contact frequency is greater than or equal to this threshold value, this value is discarded during normalization.
 stype (str) – Type of scaling. It may be either ‘o/e’ or ‘oe’. In case of ‘o/e’, Observed/Expected will be calculated while (Observed  Expected) will be calculated for ‘oe’.
 outFile (str) – Name of output ccmap file, to save directly the normalized map as a ccmap file. In case of this option,
None
will return.  scaleUpInput (bool) – Scale up the input map by multiplying it with constant value. This constant value is precision of minimum value multiplied by 10. This scale up changes the minimum value to a integer value and accordingly whole map is changed. It is beneficial when input map contains very small value as generated from KR normalization.
 percentile_threshold_no_data (int) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded.
percentile_threshold_no_data
should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile.To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns.
 threshold_data_occup (float) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded.
Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.
 workDir (str) – Path to the directory where temporary intermediate files are generated.
If
None
, files are generated in the temporary directory according to the main configuration.
Returns: ccMapObj – Normalized Contact map. When
outFile
is provided,None
is returned. In case of any other error,None
is returned.Return type: gcMapExplorer.lib.ccmap.CCMAP
orNone
 ccMap (

normalizeCCMapByVCNorm
(ccMap, sqroot=False, vmin=None, vmax=None, outFile=None, percentile_threshold_no_data=None, threshold_data_occup=None, workDir=None)¶ Normalize ccmap using VanillaCoverage method
This method was first used in ` LiebermanAiden et al., 2009 <http://dx.doi.org/10.1126/science.1181369>`_ for interchromosomal map. Later it was used for intrachromosomal map by Rao et al., 2014.
Parameters:  ccMap (
gcMapExplorer.lib.ccmap.CCMAP
or ccmap file) – A CCMAP object containing observed contact frequency or a ccmap file  sqroot (bool) – If
True
, squareroot of normalized map is calculated.  vmin (float) – Minimum threshold value for normalization. If contact frequency is less than or equal to this threshold value, this value is discarded during normalization.
 vmax (float) – Maximum threshold value for normalization. If contact frequency is greater than or equal to this threshold value, this value is discarded during normalization.
 outFile (str) – Name of output ccmap file, to save directly the normalized map as a ccmap file. In case of this option,
None
will return.  scaleUpInput (bool) – Scale up the input map by multiplying it with constant value. This constant value is precision of minimum value multiplied by 10. This scale up changes the minimum value to a integer value and accordingly whole map is changed. It is beneficial when input map contains very small value as generated from KR normalization.
 percentile_threshold_no_data (int) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded.
percentile_threshold_no_data
should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile.To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns.
 threshold_data_occup (float) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded.
Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.
 workDir (str) – Path to the directory where temporary intermediate files are generated.
If
None
, files are generated in the temporary directory according to the main configuration.
Returns: ccMapObj – Normalized Contact map. When
outFile
is provided,None
is returned. In case of any other error,None
is returned.Return type: gcMapExplorer.lib.ccmap.CCMAP
orNone
 ccMap (

normalizeGCMapByIC
(gcMapInputFile, gcMapOutFile, vmin=None, vmax=None, tol=1e12, iteration=500, percentile_threshold_no_data=None, threshold_data_occup=None, compression='lzf', workDir=None, logHandler=None)¶ Normalize a gcmap using Iterative Correction.
This method normalize the raw contact map by removing biases from experimental procedure. For more details, see this publication.
Parameters:  gcMapInputFile (str) – Name of input gcmap file.
 gcMapOutFile (str) – Name of output gcmap file.
 vmin (float) – Minimum threshold value for normalization. If contact frequency is less than or equal to this threshold value, this value is discarded during normalization.
 vmax (float) – Maximum threshold value for normalization. If contact frequency is greater than or equal to this threshold value, this value is discarded during normalization.
 tol (float) – Tolerance value. The relative increment in the results before declaring convergence.
 iteration (int) – Number of iteration to stop the normalization.
 percentile_threshold_no_data (int) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded.
percentile_threshold_no_data
should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile.To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns.
 threshold_data_occup (float) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded.
Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.
 compression (str) – Compression method in output gcmap file. Presently allowed :
lzf
for LZF compression andgzip
for GZIP compression.  workDir (str) – Path to the directory where temporary intermediate files are generated.
If
None
, files are generated in the temporary directory according to the main configuration.
Returns: Return type:

normalizeGCMapByKR
(gcMapInputFile, gcMapOutFile, mapSizeCeilingForMemory=20000, vmin=None, vmax=None, tol=1e12, percentile_threshold_no_data=None, threshold_data_occup=None, compression='lzf', workDir=None, logHandler=None)¶ Normalize a gcmap using KnightRuiz matrix balancing method.
Parameters:  gcMapInputFile (str) – Name of input gcmap file.
 gcMapOutFile (str) – Name of output gcmap file.
 mapSizeCeilingForMemory (int) – Maximum size of contact map allowed for calculation using RAM. If map size or shape is larger than this value, normalization will be performed using disk (HDD).
 vmin (float) – Minimum threshold value for normalization. If contact frequency is less than or equal to this threshold value, this value is discarded during normalization.
 vmax (float) – Maximum threshold value for normalization. If contact frequency is greater than or equal to this threshold value, this value is discarded during normalization.
 tol (float) – Tolerance for matrix balancing. Smaller tolerance increases accuracy in sums of rows and columns.
 percentile_threshold_no_data (int) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded.
percentile_threshold_no_data
should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile.To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns.
 threshold_data_occup (float) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded.
Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.
 compression (str) – Compression method in output gcmap file. Presently allowed :
lzf
for LZF compression andgzip
for GZIP compression.  workDir (str) – Path to the directory where temporary intermediate files are generated.
If
None
, files are generated in the temporary directory according to the main configuration.
Returns: Return type:

normalizeGCMapByMCFS
(gcMapInputFile, gcMapOutFile, stats='median', vmin=None, vmax=None, stype='o/e', scaleUpInput=False, percentile_threshold_no_data=None, threshold_data_occup=None, compression='lzf', workDir=None, logHandler=None)¶ Scale all maps in gcmap using Median Contact Frequency
This method can be used to normalize contact map with expected values. These expected values could be either Median or Average contact values for particular distance between two locations/coordinates. At first, Median/Average distance contact frequency for each distance is calculated. Subsequently, the observed contact frequency is either divided (‘o/e’) or subtracted (‘oe’) by median/average contact frequency obtained for distance between the two locations.
Parameters:  gcMapInputFile (str) – Name of input gcmap file.
 gcMapOutFile (str) – Name of output gcmap file.
 stats (str) – Statistics to be calculated along diagonals: It may be either “mean” or “median”. By default, it is “median”.
 vmin (float) – Minimum threshold value for normalization. If contact frequency is less than or equal to this threshold value, this value is discarded during normalization.
 vmax (float) – Maximum threshold value for normalization. If contact frequency is greater than or equal to this threshold value, this value is discarded during normalization.
 stype (str) – Type of scaling. It may be either ‘o/e’ or ‘oe’. In case of ‘o/e’, Observed/Expected will be calculated while (Observed  Expected) will be calculated for ‘oe’.
 scaleUpInput (bool) – Scale up the input map by multiplying it with constant value. This constant value is precision of minimum value multiplied by 10. This scale up changes the minimum value to a integer value and accordingly whole map is changed. It is beneficial when input map contains very small value as generated from KR normalization.
 percentile_threshold_no_data (int) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded.
percentile_threshold_no_data
should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile.To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns.
 threshold_data_occup (float) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded.
Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.
 compression (str) – Compression method in output gcmap file. Presently allowed :
lzf
for LZF compression andgzip
for GZIP compression.  workDir (str) – Path to the directory where temporary intermediate files are generated.
If
None
, files are generated in the temporary directory according to the main configuration.
Returns: Return type:

normalizeGCMapByVCNorm
(gcMapInputFile, gcMapOutFile, sqroot=False, vmin=None, vmax=None, percentile_threshold_no_data=None, threshold_data_occup=None, compression='lzf', workDir=None, logHandler=None)¶ Normalize all maps using VanillaCoverage method
This method was first used in ` LiebermanAiden et al., 2009 <http://dx.doi.org/10.1126/science.1181369>`_ for interchromosomal map. Later it was used for intrachromosomal map by Rao et al., 2014.
Parameters:  gcMapInputFile (str) – Name of input gcmap file.
 gcMapOutFile (str) – Name of output gcmap file.
 sqroot (bool) – If
True
, squareroot of normalized map is calculated.  vmin (float) – Minimum threshold value for normalization. If contact frequency is less than or equal to this threshold value, this value is discarded during normalization.
 vmax (float) – Maximum threshold value for normalization. If contact frequency is greater than or equal to this threshold value, this value is discarded during normalization.
 scaleUpInput (bool) – Scale up the input map by multiplying it with constant value. This constant value is precision of minimum value multiplied by 10. This scale up changes the minimum value to a integer value and accordingly whole map is changed. It is beneficial when input map contains very small value as generated from KR normalization.
 percentile_threshold_no_data (int) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded.
percentile_threshold_no_data
should be between 1 and 100. This options discard the rows and columns which are above this percentile. For example: if this value is 99, those row or columns will be discarded which contains larger than number of zeros (missing data) at 99 percentile.To calculate percentile, all blank rows are removed, then in all rows, number of zeros are counted. Afterwards, number of zeros at percentile_threshold_no_data percentile is obtained. In next step, if a row contain number of zeros larger than this percentile value, the whole row and column is assigned to have missing data. This percentile indicates highest numbers of zeros (missing data) in given rows/columns.
 threshold_data_occup (float) –
It can be used to filter the map, where rows/columns with largest numbers of missing data can be discarded. This ratio is (number of bins with data) / (total number of bins in the given row/column). For example: if threshold_data_occup = 0.8, then all rows containing more than 20% of missing data will be discarded.
Note that this parameter is suitable for low resolution data because maps are likely to be much less sparse.
 compression (str) – Compression method in output gcmap file. Presently allowed :
lzf
for LZF compression andgzip
for GZIP compression.  workDir (str) – Path to the directory where temporary intermediate files are generated.
If
None
, files are generated in the temporary directory according to the main configuration.
Returns: Return type: