Manual CNV analysis#

Objective

This vignette describes how copy number variation (CNV) can
be identified using Mosaic. Data filtering, normalization and
ploidy computation are described.

import missionbio.mosaic as ms

sample = ms.load_example_dataset('3 cell mix')
Loading, <_io.BytesIO object at 0x7fed89a167c0>
Loaded in 0.4s.

This is an analyzed h5 file. Hence the clones and clusters
are already labeled. It contains three cell lines KG-1,
Tom-1, and Jurkat. It also contains doublets of each pair
as seen in the heatmap.

For your h5 file, identify the clones using the DNA variants
before proceeding with this vignette.

fig = sample.dna.heatmap('NGT')
fig

Filtering amplicons#

We will only be working with the DNA and CNV assays.

The CNV object contains 138 amplicons. But not all
of them worked as expected. We will drop amplicons
which worked in less than half the total cells.

# This returns the reads in each cell and amplicon

reads = sample.cnv.get_attribute('read_counts', constraint='row+col')
reads
AMLV2_AMP1 AMLV2_AMP2 AMLV2_AMP3 AMLV2_AMP4 AMLV2_AMP5 AMLV2_AMP6 AMLV2_AMP7 AMLV2_AMP8 AMLV2_AMP9 AMLV2_AMP10 ... chr10_106721610 chr10_5554293 chr10_77210191 chr14_56969005 chr16_55770629 chr16_8569820 chr18_9750662 chr6_17076840 chr6_40116264 chr6_62094287
read_counts
AACAACCTAAACTTGTCG 139 161 105 219 92 157 152 127 74 178 ... 478 169 184 226 201 123 332 127 304 218
AACAACTGGTACGTTGGA 82 64 66 87 9 105 74 53 43 94 ... 255 132 100 134 124 31 155 126 173 102
AACAATGCAAGACCACGC 110 130 54 131 24 47 70 28 52 99 ... 419 32 37 181 139 98 170 322 306 102
AACAATGCACTGACGGAT 124 125 93 163 25 148 164 93 65 177 ... 495 58 22 124 233 78 195 197 307 214
AACAATGCAGAGCCATCC 50 59 31 39 14 51 49 16 29 82 ... 122 18 71 63 51 42 151 134 114 118
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
TTGTATCACGCTGTTCGG 38 17 16 5 3 67 22 11 12 27 ... 75 28 63 14 69 7 80 99 72 29
TTGTCAACCGCTTGTCCT 104 71 48 82 0 61 64 30 17 60 ... 187 25 127 86 96 48 74 134 153 104
TTGTCAACCTACAACACC 162 88 22 39 14 128 44 22 36 71 ... 379 53 132 122 235 125 294 153 345 82
TTGTCAACCTAGTAACGG 140 107 42 157 52 149 89 46 34 115 ... 189 96 142 177 153 120 265 207 274 89
TTGTTAGAGATCAGGATG 60 57 39 76 13 32 69 27 35 57 ... 267 10 87 55 104 82 95 111 151 96

2476 rows × 135 columns

# Only amplicons found in more than half the cells are analyzed
# The other amplicons are dropped.

working_amplicons = (reads.median() > 0).values
sample.cnv = sample.cnv[:, working_amplicons]
# We are now left with 135 amplicons

sample.cnv.shape
(2476, 135)

Normalization and ploidy computation#

Finding the ploidy of the cells can be split into
two distinct calculations.

  • Normalization

    This step is required to correct for systemic artefacts.
    Cell to cell, and amplicon to amplicon variations are corrected.
    Each cell is normalized to it’s total reads and each amplicon
    to its median counts. This is performed using the normalize_reads
    method, which adds the normalized_counts layer to the CNV object.

  • Correcting the baseline

    Not all amplicons show copy loss or copy gain. In the normalization
    step all amplicons in a cell were normalized using the same value
    i.e. the total reads in the cell. This results in slight deviations
    from the true ploidy value of the cell for each amplicon.

    To correct for this, a clone which is known to be diploid is identified
    and all the other cells are normalized to the counts of this clone. This
    step is done using the compute_ploidy method, and providing it
    with the barcodes known to be diploid. This adds the ploidy layer
    to the CNV object.

# Reads are normalized to correct for systemic artefacts

sample.cnv.normalize_reads()
# Assuming TOM-1 cells are diploid for all the amplicons,
# we can compute the ploidy of the other cell lines as follows

sample.cnv.compute_ploidy(diploid_cells=sample.dna.barcodes('TOM-1'))

Visualizations#

The ploidy values are stored in the ploidy layer

These can be visualized using a heatmap or a ploidy line plot for each clone.

# Assign the DNA labels to the CNV object
# We want to ensure the labels are the same

sample.cnv.set_labels(sample.dna.get_labels())
sample.cnv.set_palette(sample.dna.get_palette())

Heatmap#

# Heatmap with the features ordered by the chromosome position

fig = sample.cnv.heatmap('ploidy', features='positions')
fig
# Heatmap for a subset of the chromosomes

fig = sample.cnv.heatmap('ploidy', features=['4', '7', '20'])
fig
# Heatmap with the features grouped by the genes
# The first time this runs, it fetches the gene names from ensembl
# The annotation can also be fetched using using sample.cnv.get_gene_names()

fig = sample.cnv.heatmap('ploidy', features='genes')
fig
# Heatmap for a subset of the genes

fig = sample.cnv.heatmap('ploidy', features=['EZH2', 'TET2', 'TP53'])
fig

Line plot#

As seen in the heatmap, KG-1 has a copy loss for some
of the amplicons in the panel. TOM-1 does not show much
deviation from ploidy 2.

This can also be visualized by the per amplicon ploidy plot.
This plot is generated for each clone separately.

# KG-1 shows copy loss at two different locations

fig = sample.cnv.plot_ploidy('KG-1')
fig
# KG-1 shows copy loss at two different locations

fig = sample.cnv.plot_ploidy('KG-1')
fig
# Jurkat is diploid for all the amplicons

fig = sample.cnv.plot_ploidy('Jurkat')
fig
# Since TOM-1 was used to set the baseline,
# it is perfectly diploid for all amplicons

fig = sample.cnv.plot_ploidy('TOM-1')
fig