Single Sample DNA-only notebook#

This notebook is set up to work with single-sample h5 files, generated with v2 or v3 chemistry. This notebook will only work with Mosaic versions 3.0.1 and higher.

Objective: To showcase the minimum number of steps required for tertiary analysis of DNA (single-cell genotyping and CNV) and explore different ways of visualizing the data.

Major questions answered:

  1. Can we identify DNA clones based on genotypes (SNVs/Indels)?

  2. Do we detect CNV events (e.g., copy number amplification, copy number loss)?


  1. Setup

  2. Data Structure

  3. DNA Analysis

  4. CNV Analysis

  5. Combined Visualizations

  6. Export and Save Data

  7. Appendix

Not shown: All available methods and options - documented here


Topics covered

  1. Loading required packages and data.

  2. Structure and contents of data objects.

Load data
Note: importing dependencies can sometimes take a couple of minutes.

# Import mosaic libraries
import missionbio.mosaic as ms

# Import these to display entire dataframes
from IPython.display import display, HTML

# Import graph_objects from the plotly package to display figures when saving the notebook as an HTML
import plotly as px
import plotly.graph_objects as go

# Import additional packages for specific visuals
import matplotlib.pyplot as plt
import plotly.offline as pyo
import numpy as np

# Import COMPASS for imputation
from missionbio.mosaic.algorithms.compass import COMPASS

# Note: when exporting the notebook as an HTML, plots that use the "go.Figure(fig)" command are saved
# This code is optional, but will make the notebook cells/figures display across the entire width of your browser
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
# Check version; this notebook is designed for Mosaic v3.0.1 or higher
# Any function's parameters and default values can be looked up via the 'help' function
# Here, the function is 'ms.load'
# Specify the h5 file to be used in this analysis: h5path = '/path/to/h5/file/test.h5'
# If working with Windows, you may need to add an 'r' before the path: h5path = r'/path/to/h5/file/test.h5'
h5path = '/Users/botero/Desktop/Sample data/4-cell-lines-AML-multiomics/4-cell-lines-AML-CNV.dna.h5'

# Load the data
sample = ms.load(h5path, raw=False, filter_variants=True, single=True, whitelist = [], filter_cells=False) 

# Always set raw=False; if raw=True, ALL barcodes will be loaded (rather than cell-associated barcodes)
# Always set filter_variants=True unless you can't detect an expected (target) variant. Additional filtering options are included in the DNA section below
# The single=True option loads multi-sample h5 files as a single sample object (compatible with this notebook)
# The whitelist option loads any variant that is in the vcf.gz file (e.g. "chr1:179520511:C/G"); similar to whitelist feature in Tapestri Insights v2
# Always set filter_cells=False; if filter_cells=True, only genomically complete cells will be loaded 

Data Structure#

DNA, CNV, and Protein are sub-classes of the Assay class. The information is stored in four categories, and the user can modify each of those:

1. metadata (add_metadata / del_metadata):

  • dictionary containing metrics of the assay

2. row_attrs (add_row_attr / del_row_attr):

  • dictionary which contains ‘barcode’ as one of the keys (where the value is a list of all barcodes)

  • for all other keys, the values must be of the same length, i.e. match the number of barcodes

  • this is the attribute where ‘label’, ‘pca’, and ‘umap’ values are added

3. col_attrs (add_col_attr / del_col_attr):

  • dictionary which contains ‘id’ as one of the keys (where the value is a list of all ids)

  • for DNA assays, ‘ids’ are variants; for Protein assays, ‘ids’ are antibodies

  • for all other keys, the values must be of the same length, i.e. match the number ids

4. layers (add_layer / del_layer):

  • dictionary which contains critically important assay metrics

  • all the values have the shape (num barcodes) x (num ids)

  • for DNA assays, this includes AF, GQ, DP, etc. (per cell, per variant)

  • for Protein assays, this includes read counts (per cell, per antibody)

  • this is the attribute where ‘normalized_counts’ will be added

# Summary of DNA assay 
print("\'sample.dna\':", sample.dna, '\n')
print("\'row_attrs\':", "\n\t", list(sample.dna.row_attrs.keys()), '\n')
print("\'col_attrs\':", "\n\t", list(sample.dna.col_attrs.keys()), '\n')
print("\'layers\':", "\n\t", list(sample.dna.layers.keys()), '\n')
print("\'metadata\':", "\n")
for i in list(sample.dna.metadata.keys()):
    print('\t', i, ': ', sample.dna.metadata[i], sep='')
# For DNA, ids are variants
# sample.dna.ids() is a shortcut for sample.dna.col_attrs['id']

DNA Analysis#

Topics covered

  1. Standard filtering of DNA variants.

  2. Subsetting dataset for variants of interest, including whitelisted variants.

  3. Addition of annotations to the variants.

  4. Manual variant selection and clone identification

  5. OPTIONAL: Use Compass to impute missing data and a clonal phylogeny

  6. Visualizations and customization options

Basic filtering#

There are many options for filtering DNA variants. Use the help() function to understand the approach listed below.

# For additional information visit:
# Filter the variants, similar to the "Advanced Filters" in Tapestri Insights v2.2

# is replaced with the following 3 zygosity-specific filters: vaf_ref, vaf_hom, vaf_het

# In general, these additional filters remove additional false-positive from the data.

# Adjust filters if needed by overwriting dna_vars
dna_vars = sample.dna.filter_variants(

# Check the number of filtered variants. When using the default filters, the number of 
# variants is likely smaller compared to the originally loaded variants due to the more 
# stringent filtering criteria (e.g., vaf_ref=5, vaf_hom=95, vaf_het=30).
print('Number of variants:', len(dna_vars))

First, specify variants of interest using one of the three options below:

  1. Use the filtered variants from the above section (dna_vars)

  2. Use specific variants of interest (whitelist)

  3. Combine 1 & 2: filtered variants plus whitelist

Then, this list (final_vars) is used to reduce the larger data set to only your variants of interest.

# Specify the whitelist; variants may be copy/pasted from Tapestri Insights v2.2,
# but ensure correct nomenclature, ie whitelist = ["chr13:28589657:T/G","chrX:39921424:G/A"]
# If there are no whitelist variants, you can leave target variants blank

white_list = []

# Combine whitelisted and filtered variants
final_vars = list(set(list(dna_vars) + white_list))

# Want to use your whitelist only?
# final_vars = white_list
# Check the length of your final list of variants
# Dimensionality of the original sample.dna dataframe
# First number = number of cells (rows); second number = number of variants (columns)
# Before subsetting, verify that all the chosen variants are in the current sample.dna ids (should return True)
# Subsetting sample.dna (columns) based on reduced variant list. Keeping all cells that passed filtering
sample.dna = sample.dna[:, final_vars]
# Check the shape of the final filtered DNA object, i.e. (number of barcodes/cells, number of ids/variants)

Annotation Addition#

# Fetch annotations using varsome
# Note: run this on a filtered DNA sample - too many variants (e.g., 100+) are not handled correctly by the method
ann = sample.dna.get_annotations()  

Variant selection and Subclone identification#

In this section of the notebook, all variants remaining in the data will populate in a variant table. This table is interaction, variants can be selected, and rows can be sorted by ascending/descending values. The variant name can be clicked on and will navigate to the variants varsome url in your default browser.

  1. Variants selected in this table will populate in a subclone table below.

  2. The variants in the subclone table can be highlighted and assessed for Read Depth, Genotype Quality and Variant Allele Frequency.

  3. Subclones can be renamed by clicking on the pencil icon.

  4. ADO score: The ADO score can be adjusted, but by default is set to 1. Any clones with an ADO score higher this will be moved into the ADO subclones column. We recommend moving any clones with a score of .8 or higher into this column.

  5. Min clone size: The Min Clone Size can also be adjusted, but by default is set to 1. Any clones that represent less than 1% of the sample will be moved into the Small Subclones column.

  6. Cells with missing genotype information across any of the selected variants will be moved into the missing GT column.

#Run the variant table workflow to select variants and begin clone identification
wfv = ms.workflows.VariantSubcloneTable(sample)
# Subsample the Variants to only the variants selected from the workflow
variants = wfv.selected_variants
sample.dna = sample.dna[:,variants]

# Save the full set of cells to a new variable that you can call on later
# Do this before renaming the variants
full_sample = sample[:]
# Rename your variants
# Any of these column values can be added to the id names
# Add annotation to the id names
sample.dna.set_ids_from_cols(['annotated Gene', 'id'])

# Another example:
# sample.dna.set_ids_from_cols(['annotated Gene', 'CHROM', 'POS', 'REF', 'ALT'])

# Annotations are now added to the variants

# Use sample.dna.reset_ids() to get the original ids
# Plot heatmap using NGT_FILTERED.
fig = sample.dna.heatmap(attribute='NGT_FILTERED')
# Clone removal
# Remove barcodes from the missing GT, small subclones, ADO subclones or FP labels
clones = ['missing', 'small', 'ADO', 'FP']
for c in clones:
    cells = sample.dna.barcodes({c})
    if len(cells) > 0:
        sample.dna = sample.dna.drop(cells)
# Redraw heatmap
fig = sample.dna.heatmap(attribute='NGT_FILTERED')
# You can use the following line of code to change the color for heatmaps in the DNA, CNV or protein sections
# And rotate the tick labels
fig = sample.dna.heatmap(attribute='NGT_FILTERED')
fig.update_layout(title =
fig.update_xaxes(tickangle = -45)
fig.layout.coloraxis.colorscale = 'viridis'
# Evaluate new total number of cells after the above filtering

Adjusting subclone colors or heatmap colors#

If you want to change the color palette used for the subclones, you can use set_palette. For this you will need to provide a list of ALL subclone names, pointing to the hexcode/color you want to change that subclone to. Note: this function also works with the protein and CNV assays. See below for an example of this.

If you want to change the color palette used for any of the assays/layers, you can use ms.Config.Colorscale. Then you can list the assay and layer, and assign the plotly colorscale you would like to use for those graphics. You can visualized all available colorscales by using plotly.colors.sequential.swatches_continuous().show(). If you want to reset all color palettes back to their defaults, you can use ms.Config.Colorscle.reset().

# Example of changing the colors assigned to each clone
sample.dna.set_palette({'Clone 1': '#800080', 'Clone 2': '#FF69B4'})
# You can change colorScale to change the heatmap colors
# This will change the color of the dna.ngt layer to be viridis
ms.Config.Colorscale.Dna.NGT = 'viridis'
fig = sample.dna.heatmap('NGT')

# If you want to reset this, you can run the following to reset just that one modification
# Or you can run this to reset any/all modifications:
# You can change the colorScale for Dna, Cns or Protein

If you ever want to return back to the original population of cells, you can reset the data using: sample.reset("dna") This command ‘sample.reset’ works on all assays, including CNV and protein.

CNV Analysis#

Topics covered

  1. Amplicon filtering and ploidy estimation.

  2. Visualization of ploidy across subclones present

# Get gene names for amplicons

CNV workflow#

This workflow will normalize all reads and filter amplicons/cells based on the settings set at the beginning of the workflow:

  1. Amplicon completeness: refers to the minimum percentage of barcodes for an amplicon that must have reads greater than or equal to the minimum read depth set. By default this is set to 50.

  2. Amplicon read depth: refers to the minimum read depth for each amplicon-barcode combination to not be considered missing. By default this is set to 10.

  3. Mean cell read depth: refers the minimum mean read depth for a cell to be included in the analysis, otherwise the cell will be removed. By default this is set to 40.

  4. Diploid clone in DNA: refers to which subclone you are setting as the true diploid population. All ploidy estimates will be calculated in relation to this diploid population. We recommend setting this to your ‘WT’ population or most parent clone present.

Note: These settings are pretty stringent. If you are expecting large copy number events, you may want to reduce Amplicon Completeness and Amplicon Read Depth, to recover these events.

Once the above filters are set, the visualizations can be changed.

  1. Plot: Can be changed from Heatmap positions, to Heatmap genes, Line-plot positions, Line-plot genes

  2. Clone for line plot: If one of the line-plot visualizations is selected, only one clone can be shown at a time. This determines which one is plotted.

  3. X-axis features: If you would prefer to only plot a subset of the data (chromosomes or genes), you can select which chromosomes/genes you would like plotted with this function. Chromosomes can be selected for ‘positions’ type plots, and Genes can be selected for ‘genes’ type plots.

# CNV workflow to filter, normalize and estimate ploidy
wfc = ms.workflows.CopyNumber(sample)
# Amplicon completeness refers to the min percentage of barcodes for an amplicon that must have reads >= the read_depth
# Read depth is the min required depth for each amplicon_barcode to not be considered as missing
# Mean cell read depth refers to the min mean read depth for a cell to be included, otherwise it is removed
# Heatmap with the features ordered by the default amplicon order
# If you want to plot just a subset of chromosomes, you can put them in list format for features
fig = sample.cnv.heatmap('ploidy', features='positions') #features=['7', '17', '20']

# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)
#fig.layout.coloraxis.cmax = 4
#fig.layout.coloraxis.cmin = 0

# Optionally, change the size of the figure:
#fig.layout.width = 1600
#fig.layout.height = 1500

# Heatmap with the features grouped by the genes
# If you want to plot just a subset of genes, you can put them in list format for features
fig = sample.cnv.heatmap('ploidy', features='genes', convolve=1) #features=["ASXL1", "EZH2",'TP53']

# Optionally, update the separating lines to be black
#for shape in fig.layout.shapes:
#    shape.line.color = '#000000'

# Show heatmap with convolve and subclustering turned off
bars = sample.cnv.clustered_barcodes('ploidy', subcluster=False)

# This is useful to create "convolved" heatmaps which are easier to interpret
# With the subclustering off and convolve=20, the noise will be reduced and real signals will be easier to determine
fig = sample.cnv.heatmap('ploidy', bars_order=bars, convolve=20)
fig.layout.width = 900
# Signature will create a dataframe based on the layer you want to look at and which statistical value you want to view
# You can see: mean, median, mode, or std with this function
sample.cnv.signature('ploidy', 'median')
# signaturemap will plot a heatmap of the signature dataframe created above
# The labels list will control the order of the subclones along the y-axis
sample.cnv.signaturemap('ploidy') #labels=[]

Combined Visualizations#

In this section you will first subset the data to only retain barcodes with remaining DNA and CNV data. Then this data can be plotted together using sample.heatmap() and sample.signaturemap().

# This will return the barcodes common to all assays in the sample.

# Use that to filter the sample so that only the common barcodes are present in all assays
sample = sample[sample.common_barcodes()]
# Check dimensionality for each subclass; the number of cells (first number) should be the same in each data set
print(sample.dna.shape, sample.cnv.shape)
# DNA + CNV heatmap
fig = sample.heatmap(
    clusterby=('dna', 'cnv'),  # The first assay is used for the labels
    attributes=['AF', 'ploidy'],
    features=[None, 'genes'],  # If None, then clustered_ids is used
    bars_order=None,  # The order of the barcodes
    order=('dna', 'cnv')  # The order in which the heatmaps should be drawn
fig.layout.width = 1200
# Plot a combined signature heatmap, showing DNA and CNV signatures for all subclones
fig = sample.signaturemap(
    clusterby=('dna', 'cnv'),
    attributes=('NGT', 'ploidy'),
    features=[None, 'genes'],
    signature_kind=['median', 'median'],
    order=('dna', 'cnv')  # The order in which the heatmaps should be drawn
fig.layout.width = 1200
# Clone vs analyte
# Visualize the CNV data stratified by clone
fig = sample.clone_vs_analyte('cnv')

Export and Save Data#

In this section you can export a filtered .h5 file, which will contain all new labels/layers, and contain only the filtered barcodes/cells remaining. You can also export all data (row attributes, column attributes and layers) for every assay (DNA and CNV) into easily parsable .csv tables.

# Save new h5 file that includes only the final, cleaned dataset, 'FilteredData.h5')
# With new code implementation:
# Export data into csv formats
# DNA, CNV and metadata will be included in the zip
ms.to_zip(sample, 'filename')


DNA signature#

Using .signature() and .signaturemap() you can visualize different statistical metrics of your data, including: mean, median, mode and std.

# Signature will create a dataframe based on the layer you want to look at and which statistical value you want to view
# You can see: mean, median, mode, or std with this function
sample.dna.signature('AF', 'median')
# signaturemap will plot a heatmap of the signature dataframe created above
# The labels list will control the order of the subclones along the y-axis
sample.dna.signaturemap('AF') #labels=[]

Compass imputation#

This section of the notebook is OPTIONAL and is still a work in progress

Compass can be used to impute the genotypes of cells with some missing data. It can also be used to infer the phylogeny of all subclones present in the sample. If a cells nature cannot be determined, Compass will label these cells as Mixed or Ambiguous. Compass can take ~1-10 minutes to run depending on the size of the data.

The Compass publication can be found here

Note: Compass can give different subclone composition than the variant subclone workflow.

# Use compass to assume the subclone architecture and phylogeny
# Depending on the size and complexity of the data, this step can take ~1-10 minutes
# Use full_sample instead of sample to include all previously removed cells and use the unannotated variant names
compass = COMPASS(full_sample, somatic_variants=variants)
# The phylogentic tree prediction by COMPASS
fig = compass.plot_tree()
# dict of node names pointing to descriptions
# compass_labels_ is just the node names
# we want the compass_labels to be the node_descriptions
labs = compass.labels_  # Stores the labels
desc = compass.node_descriptions()  # Stores the dict mapping the label to the description
compass_labels = np.array([desc.get(lab, lab) for lab in labs])
# Store the compass label as a row_attr
full_sample.dna.add_row_attr('compass_labels', compass_labels)
# Store the index positions of the COMPASS ids for the heatmap
idx = np.isin(full_sample.dna.ids(), compass.somatic_variants)

# Visualize the assignment using the variants passed to COMPASS
fig = full_sample.dna.heatmap('NGT', features=full_sample.dna.ids()[idx], splitby='compass_labels')
# This will return a dataframe 
# Showing the overlap of variant workflow clones and compass identified clones
full_sample.dna.crosstab(compass_labels, normalize='columns')
# This will plot a heatmap of the crosstab dataframe
full_sample.dna.crosstabmap(compass_labels, normalize='columns').show()