Manual CNV analysis#

Objective

This vignette describes how copy number variation (CNV) can
be identified using Mosaic. Data filtering, normalization and
ploidy computation are described.

import missionbio.mosaic as ms

sample = ms.load_example_dataset('3 cell mix')

This is an analyzed h5 file. Hence the clones and clusters
are already labeled. It contains three cell lines KG-1,
Tom-1, and Jurkat. It also contains doublets of each pair
as seen in the heatmap.

For your h5 file, identify the clones using the DNA variants
before proceeding with this vignette.

fig = sample.dna.heatmap('NGT')
fig.show("jpg")
../_images/f4d3dbb500c30e2daa1dbbe06fc473a0da804557b55834aa7f2d04c7b315d7ac.jpg

Filtering amplicons#

We will only be working with the DNA and CNV assays.

The CNV object contains 138 amplicons. But not all
of them worked as expected. We will drop amplicons
which worked in less than half the total cells.

# Remove amplicons with missing information.

# completeness : float [0, 100]
#     The minimum percentage of barcodes for an amplicon that must have
#     reads greater than or equal to `read_depth` to keep that amplicon.
# read_depth : int
#     The minimum required depth for each amplicon-barcode to not be
#     considered as missing during the completness calculation.

amplicons = sample.cnv.filter_amplicons(
    completeness=50,  # 50% of the cells
    read_depth=0,  # Should have greater than 0 reads
)

sample.cnv = sample.cnv[:, amplicons]
# We are now left with 135 amplicons

sample.cnv.shape
(2476, 135)

Normalization and ploidy computation#

Finding the ploidy of the cells can be split into
two distinct calculations.

  • Normalization

    This step is required to correct for systemic artefacts.
    Cell to cell, and amplicon to amplicon variations are corrected.
    Each cell is normalized to it’s total reads and each amplicon
    to its median counts. This is performed using the normalize_reads
    method, which adds the normalized_counts layer to the CNV object.

  • Correcting the baseline

    Not all amplicons show copy loss or copy gain. In the normalization
    step all amplicons in a cell were normalized using the same value
    i.e. the total reads in the cell. This results in slight deviations
    from the true ploidy value of the cell for each amplicon.

    To correct for this, a clone which is known to be diploid is identified
    and all the other cells are normalized to the counts of this clone. This
    step is done using the compute_ploidy method, and providing it
    with the barcodes known to be diploid. This adds the ploidy layer
    to the CNV object.

# Reads are normalized to correct for systemic artefacts

sample.cnv.normalize_reads()
# Assuming TOM-1 cells are diploid for all the amplicons,
# we can compute the ploidy of the other cell lines as follows

sample.cnv.compute_ploidy(diploid_cells=sample.dna.barcodes('TOM-1'))

Visualizations#

The ploidy values are stored in the ploidy layer

These can be visualized using a heatmap or a ploidy line plot for each clone.

# Assign the DNA labels to the CNV object
# We want to ensure the labels are the same

sample.cnv.set_labels(sample.dna)

Heatmap#

# Heatmap with the features ordered by the chromosome position

fig = sample.cnv.heatmap('ploidy', features='positions')
fig.show("jpg")
../_images/9a0b4b504d2385916e87d3f313339fbf9a4e4ef72bdd2834981403bd165c2572.jpg
# Heatmap for a subset of the chromosomes

fig = sample.cnv.heatmap('ploidy', features=['4', '7', '20'])
fig.show("jpg")
../_images/c91674e676079fee74865607e79ab2bb32f8434fac42c42acae75987dbbd658c.jpg
# Heatmap with the features grouped by the genes
# The first time this runs, it fetches the gene names from ensembl
# The annotation can also be fetched using using sample.cnv.get_gene_names()

fig = sample.cnv.heatmap('ploidy', features='genes')
fig.show("jpg")
../_images/f2e9612a355351d28979231853790684f2c559b2f28085befa8e72423abd0f22.jpg
# Heatmap for a subset of the genes

fig = sample.cnv.heatmap('ploidy', features=['EZH2', 'TET2', 'TP53'])
fig.show("jpg")
../_images/43db90f9c13565c874d8855d0daa0d07568cf4f7497b7710c4694bf085dd79c2.jpg

Line plot#

As seen in the heatmap, KG-1 has a copy loss for some
of the amplicons in the panel. TOM-1 does not show much
deviation from ploidy 2.

This can also be visualized by the per amplicon ploidy plot.
This plot is generated for each clone separately.

# KG-1 shows copy loss at two different locations

fig = sample.cnv.plot_ploidy('KG-1')
fig.show("jpg")
../_images/94481f1acdd732d55fc0c1c878c24385970f2b0b21c660eb249e958188a8dab9.jpg
# KG-1 shows copy loss at two different locations

fig = sample.cnv.plot_ploidy('KG-1')
fig.show("jpg")
../_images/94481f1acdd732d55fc0c1c878c24385970f2b0b21c660eb249e958188a8dab9.jpg
# Jurkat is diploid for all the amplicons

fig = sample.cnv.plot_ploidy('Jurkat')
fig.show("jpg")
../_images/aaa365c35397fbd23cb02c60c7395c704edf85911cbd91b7d13317b77f3cdb90.jpg
# Since TOM-1 was used to set the baseline,
# it is perfectly diploid for all amplicons

fig = sample.cnv.plot_ploidy('TOM-1')
fig.show("jpg")
../_images/76894f7e8b35ae7ce5271f38013980d34677a6a29c4246cb8ed964d9247fa61b.jpg