How to access Hi-C map from .ccmap file?

.ccmap is a text file and associated *.npbin or *.npbin.gz is memory mapped matrix file.

Note

Following example needs *.ccmap file, generated in previous tutorial.

See also

About *.ccmap and *.npbin (compressed *.npbin.gz) files here.


At first, we import modules:

In [1]:
import gcMapExplorer.ccmap as cmp
import numpy as np
import matplotlib.pyplot as plt

# To show inline plots
%matplotlib inline
plt.style.use('ggplot')              # Theme for plotting

Load a .ccmap file

In [2]:
ccmap = cmp.load_ccmap('output/CooMatrix/normalized/chr15_100kb_normKR.ccmap')

Print some properties of Hi-C data

In [3]:
print('shape: ', ccmap.shape)             # Shape of matrix along X and Y axis
print('Minimum value: ', ccmap.minvalue)  # Maximum value in Hi-C data
print('Maximum value: ', ccmap.maxvalue)  # Minimum value in Hi.C data
print('data-type: ', ccmap.dtype)         #  Data type for memory mapped matrix file
print('path to matrix file:', ccmap.path2matrix)
shape:  (1026, 1026)
Minimum value:  4.66654819319956e-06
Maximum value:  0.8729197978973389
data-type:  float32
path to matrix file: /tmp/npBinary__e5gpj90.tmp

Reading *.npbin file

In [4]:
ccmap.make_readable()         # npbin file is now readable

Now, Hi-C matrix is available as ccmap.matrix.


Overview of Hi-C matrix

In [5]:
print(ccmap.matrix)
[[  4.66654819e-06   4.66654819e-06   4.66654819e-06 ...,   4.66654819e-06
    4.66654819e-06   4.66654819e-06]
 [  4.66654819e-06   4.66654819e-06   4.66654819e-06 ...,   4.66654819e-06
    4.66654819e-06   4.66654819e-06]
 [  4.66654819e-06   4.66654819e-06   4.66654819e-06 ...,   4.66654819e-06
    4.66654819e-06   4.66654819e-06]
 ...,
 [  4.66654819e-06   4.66654819e-06   4.66654819e-06 ...,   2.78881878e-01
    1.35136917e-01   7.62707740e-02]
 [  4.66654819e-06   4.66654819e-06   4.66654819e-06 ...,   1.35136917e-01
    4.56013411e-01   4.66654819e-06]
 [  4.66654819e-06   4.66654819e-06   4.66654819e-06 ...,   7.62707740e-02
    4.66654819e-06   4.66654819e-06]]

Using numpy module

  • We can use numpy module to compare properties from .ccmap file and from .npbin file.
  • To find maximum and minimum of matrix, numpy functions amin and amax can be used.
In [6]:
print('shape: ', ccmap.shape, ccmap.matrix.shape)                # Shape of matrix along X and Y axis
print('Minimum value: ', ccmap.minvalue, np.amin(ccmap.matrix))  # Minimum value in Hi-C data using numpy.amin
print('Maximum value: ', ccmap.maxvalue, np.amax(ccmap.matrix))  # Maximum value in Hi.C data using numpy.amax

shape:  (1026, 1026) (1026, 1026)
Minimum value:  4.66654819319956e-06 4.66654819319956e-06
Maximum value:  0.8729197978973389 0.8729197978973389

Remove rows/columns with missing data

In [7]:
bData = ~ccmap.bNoData                          # Stores whther rows/columns has missing data
new_matrix = (ccmap.matrix[bData,:])[:,bData]   # Getting new matrix after removing row/columns of missing data
index_bData = np.nonzero(bData)[0]               # Getting indices of original matrix after removing missing data

print('Original shape: ', ccmap.matrix.shape)   # Shape of original matrix
print('New shape: ', new_matrix.shape)           # Shape of new matrix
Original shape:  (1026, 1026)
New shape:  (822, 822)

Warning

Above operation cannot be performed on raw Hi-C map because CCMAP.bNoData is only avaialable after normalization.


Whether matrix is balanced?

If matrix is balanced, sum of rows and coloumns will be always one. Sum of rows and columns can be easily calculated using numpy.sum function.

In [8]:
r_sum = np.sum(new_matrix, axis = 0)             # Sum along row using numpy.sum
c_sum = np.sum(new_matrix, axis = 1)             # Sum along column using numpy.sum

# Plot the values for visual representations
fig = plt.figure(figsize=(14,3))                               # Figure size

ax1 = fig.add_subplot(1,2,1)                                   # Axes first plot
ax1.set_title('Sum along row')                                 # Title first plot
ax1.get_yaxis().get_major_formatter().set_useOffset(False)     # Prevent ticks auto-formatting

ax2 = fig.add_subplot(1,2,2)                                   # Axes second plot
ax2.set_title('Sum along column')
ax2.get_yaxis().get_major_formatter().set_useOffset(False)

ax1.plot(r_sum)                                                # Plot in first axes
ax2.plot(c_sum)                                                # Plot in second axes

plt.show()
../_images/modules_examples_access_ccmap_data_20_0.png

As can be seen in the above plot, sum of rows/columns are approximately one. It means that the matrix is balanced.


Using more numpy modules

Lets plot average and median of each rows using numpy.mean and numpy.median.

In [9]:
averages = np.mean(new_matrix, axis = 1)            # Calculating mean using numpy.mean
medians = np.median(new_matrix, axis = 0)           # Calculating median using numpy.median

# Plot the values for visual representations
fig = plt.figure(figsize=(14,3))                               # Figure size

ax1 = fig.add_subplot(1,2,1)                                   # Axes first plot
ax1.set_title('Average values')                                 # Title first plot
ax1.get_yaxis().get_major_formatter().set_useOffset(False)     # Prevent ticks auto-formatting

ax2 = fig.add_subplot(1,2,2)                                   # Axes second plot
ax2.set_title('Median values')
ax2.get_yaxis().get_major_formatter().set_useOffset(False)

# in below both plots, x-axis is index from original matrix to preserve original location
ax1.plot(index_bData, averages)   # Plot in first axes
ax2.plot(index_bData, medians)    # Plot in second axes

plt.show()



../_images/modules_examples_access_ccmap_data_23_0.png