Using masked array with Hi-C map

Large Hi-C map data contains missing values. During analysis, sometimes we need to ignore these values. numpy.ma module can be used to perform mathematical operations after masking these missing values.

See also

Module numpy.ma


At first, we import modules:

In [1]:
import gcMapExplorer.ccmap as cmp
import numpy as np
import os
from numpy import ma
from scipy import stats

import matplotlib.pyplot as plt

# To show inline plots
%matplotlib inline
plt.style.use('ggplot')              # Theme for plotting

To calculate correlation between Hi-C maps

Load two Hi-C data

In [2]:
ccmapOne = cmp.load_ccmap('output/CooMatrix/normalized/chr15_100kb_normKR.ccmap', workDir=os.getcwd() )
ccmapOne.make_readable()

ccmapTwo = cmp.load_ccmap('output/CooMatrix/normalized/chr20_100kb_normKR.ccmap', workDir=os.getcwd())
ccmapTwo.make_readable()

Determine smallest shape

Matrix size can be different, therefore smallest size is determined to calculate element-wise correlation.

In [3]:
if ccmapOne.shape[0] <= ccmapTwo.shape[0]:
    smallest_shape = ccmapOne.shape[0]
else:
    smallest_shape = ccmapTwo.shape[0]

Generate masks

  • At first, generate mask from two matrices sepratealy, and then combine it.
  • Also, use slicing operations to derive subset of matrix with smallest shape.
  • During masking, lower-triangular part is also masked with diagonal offset of five. Because the values at Hi-C map diagnoal is usually large, correlation could be high due to these large values. Moreover, we are usually intereseted in off-diagonal regions of the Hi-C maps.
In [4]:
m1 = ccmapOne.matrix[:smallest_shape, :smallest_shape] <= ccmapOne.minvalue   # Mask all minimum values
m2 = ccmapTwo.matrix[:smallest_shape, :smallest_shape] <= ccmapTwo.minvalue   # Mask all minimum values
mask = ( m1 | m2 )                              # Combine both masks
mask[np.tril_indices_from(mask, k=5)] = True    # Also, Mask lower-triangle of matrix with five diagonal offset

Warning

For huge matrices, above operations could be memory/RAM consuming and script may crash.

Generate masked matrix

Now, using numpy.ma module, generate masked matrices

In [5]:
maskedMatrixOne = ma.array(ccmapOne.matrix[:smallest_shape, :smallest_shape], mask=mask)
maksedMatrixTwo = ma.array(ccmapTwo.matrix[:smallest_shape, :smallest_shape], mask=mask)

Calculate correlation coefficiant

In [6]:
corr, pvalue = stats.pearsonr(maskedMatrixOne.compressed(), maksedMatrixTwo.compressed())

print('Pearson Correlation: ', corr)

corr, pvalue = stats.spearmanr(maskedMatrixOne.compressed(), maksedMatrixTwo.compressed())

print('Spearman Correlation: ', corr)
Pearson Correlation:  0.50507
Spearman Correlation:  0.358783199081

Crorrelation using gcMapExplorer.cmstats module

Shown above is a simple step-by-step example to calculate correlation-coefficiant using masked array. However, above method may fail in case of huge matrices. Therefore, use the implemented function gcMapExplorer.cmstats.correlateCMaps function

See also

  • About gcMapExplorer.cmstats.correlateCMaps() in more details
In [7]:
from  gcMapExplorer import cmstats

corr, pvalue = cmstats.correlateCMaps(ccmapOne, ccmapTwo, diagonal_offset=5)

print('Pearson Correlation: ', corr)

corr, pvalue = cmstats.correlateCMaps(ccmapOne, ccmapTwo, corrType='spearman', diagonal_offset=5)
print('Spearman Correlation: ', corr)
Pearson Correlation:  0.50507
Spearman Correlation:  0.358783199081

Both above shown step-by-step example and implemented functions yielded similar correlation values.

Block-wise Crorrelation

To identify local difference between two maps, block-wise coorelation could be more helpful. A block is created and slided along the diagonal, and for each new position, correlation is calculated.

In [8]:
# Pearson correlation
pearson, p_centers = cmstats.correlateCMaps(ccmapOne, ccmapTwo, diagonal_offset=2, blockSize='1mb',
                                             slideStepSize=1, outFile='pearson.txt', workDir=os.getcwd())


# Spearman correlation
spearman, s_centers = cmstats.correlateCMaps(ccmapOne, ccmapTwo, corrType='spearman', diagonal_offset=2,
                                                 blockSize='1mb', slideStepSize=1, outFile='spearman.txt',
                                                 workDir=os.getcwd())
INFO:correlateCMaps: Block-wise correlation with [1mb] block-size
INFO:correlateCMaps: Number of Blocks: 63
INFO:correlateCMaps: Size of each Block in bins: 10
INFO:correlateCMaps: Number of Overlapping bins between sliding blocks:  9
INFO:correlateCMaps: Block-wise correlation with [1mb] block-size
INFO:correlateCMaps: Number of Blocks: 63
INFO:correlateCMaps: Size of each Block in bins: 10
INFO:correlateCMaps: Number of Overlapping bins between sliding blocks:  9
In [9]:
# Plot the values for visual representations
fig = plt.figure(figsize=(8,4))                               # Figure size

ax = fig.add_subplot(1,1,1)                                   # Axes first plot
ax.set_title('Correlation')                                   # Title first plot

ax.plot(p_centers, pearson, label='Pearson')                  # Plot for Pearson correlation
ax.plot(s_centers, spearman, label='Spearman')                # Plot for Spearman correlation

ax.get_xaxis().get_major_formatter().set_useOffset(False)     # Prevent ticks auto-formatting

plt.legend(loc=2)

plt.show()
../_images/modules_examples_masked_array_ccmap_21_0.png
In [10]:
path='/home/rajendra/workspace/genome_3d_organization/GM12878_CellLine/HiCmapDataPyObject/resolution_10kb/'
ccmapsOne, ccmapsTwo = [], []
for i in range(1, 23):
    ccmapsOne.append(path+'Chr{0}_NormKR.hicmap'.format(i))
for i in range(1, 23):
    ccmapsTwo.append(path+'Chr{0}_Rawobserved.hicmap'.format(i))

for i in range(len(ccmapsOne)):
    ccmapOne = cmp.load_ccmap(ccmapsOne[i], workDir=os.getcwd())
    ccmapOne.make_readable()

    ccmapTwo = cmp.load_ccmap(ccmapsTwo[i], workDir=os.getcwd())
    ccmapTwo.make_readable()

    print(cmstats.correlateCMaps(ccmapOne, ccmapTwo, corrType='spearman', diagonal_offset=5))
    del ccmapOne
    del ccmapTwo
(0.74487690967318498, 0.0)
(0.74063722421215261, 0.0)
(0.75637101395441264, 0.0)
(0.75799064378987224, 0.0)
(0.76385777207989702, 0.0)
(0.7581134503870115, 0.0)
(0.77085117905282485, 0.0)
(0.78712169997619941, 0.0)
(0.75681100168481319, 0.0)
(0.79607602070672256, 0.0)
(0.79340494862540256, 0.0)
(0.79611875779812202, 0.0)
(0.78246725829489128, 0.0)
(0.81078979072455271, 0.0)
(0.82796003802636897, 0.0)
(0.81646908774859528, 0.0)
(0.8513992946693909, 0.0)
(0.83321467022045526, 0.0)
(0.89497615144311282, 0.0)
(0.89713225129243968, 0.0)
(0.86860067840860633, 0.0)
(0.87945995663009324, 0.0)