Single Sample DNA-only#
This notebook is set up to work with single-sample h5 files, generated with v2 or v3 chemistry. This notebook will only work with Mosaic versions 3.7 and higher.
Objective: To showcase the minimum number of steps required for tertiary analysis of DNA (single-cell genotyping and CNV) and explore different ways of visualizing the data.
Major questions answered:
Can we identify DNA clones based on genotypes (SNVs/Indels)?
Do we detect CNV events (e.g., copy number amplification, copy number loss)?
Sections:
Setup
Data Structure
DNA Analysis
CNV Analysis
Combined Visualizations
Export and Save Data
Appendix
Not shown:
All available methods and options - documented here
Setup#
Topics covered
Loading required packages.
Loading in data.
Load data
Note: importing dependencies can sometimes take a couple of minutes.
# Import mosaic libraries
import missionbio.mosaic as ms
# Import these to display entire dataframes
from IPython.display import display, HTML
# Import graph_objects from the plotly package to display figures when saving the notebook as an HTML
import plotly as px
import plotly.graph_objects as go
# Import additional packages for specific visuals
import matplotlib.pyplot as plt
import plotly.offline as pyo
pyo.init_notebook_mode()
import numpy as np
import seaborn as sns
from missionbio.plotting.multimap import MultiMap
from missionbio.plotting.heatmap import Heatmap
# Import COMPASS for imputation
from missionbio.mosaic.algorithms.compass import COMPASS
import pandas as pd
# Note: when exporting the notebook as an HTML, plots that use the "go.Figure(fig)" command are saved
# This code is optional, but will make the notebook cells/figures display across the entire width of the browser
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
# Check version; this notebook is designed for Mosaic v3.7 or higher
print(ms.__version__)
# Any function's parameters and default values can be looked up via the 'help' function
# Here, the function is 'ms.load'
help(ms.load)
Loading the data #
In the ms.load statement, there are several arguments:
raw : always set raw=False; if raw=True, ALL barcodes will be loaded (rather than cell-associated barcodes)
filter_variants : always set filter_variants=True unless an expected (target) variant cannot be found. Additional filtering options are included in the DNA section below. NOTE : if loading using filter_variants=False, the whitelist may need to set to None or removed from the command, otherwise it may not load variants in
single : always set single=False for multisample/merged h5 files
whitelist : The whitelist option loads any variant that is in the vcf.gz file (e.g. “chr1:179520511:C/G”). The whitelist argument takes many variant formats, they are detailed here.
filter_cells : Always set filter_cells=False; if filter_cells=True, only cells called by the completelyness algorithm are loaded
# Specify the h5 file to be used in this analysis: h5path = '/path/to/h5/file/test.h5'
# If working with Windows, an 'r' may need to be added before the path: h5path = r'/path/to/h5/file/test.h5'
h5path = '/Users/botero/Desktop/Sample data/4-cell-lines-AML-multiomics/4-cell-lines-AML-multiomics.dna+protein.h5'
# Load the data
sample = ms.load(h5path, raw=False, filter_variants=True, single=True, whitelist = [], filter_cells=False)
Data Structure#
DNA, CNV, and Protein are sub-classes of the Assay class. The information is stored in four categories, and the user can modify each of those:
1. metadata (add_metadata / del_metadata):
dictionary containing assay parameters and basic performance metrics
2. layers (add_layer / del_layer):
dictionary containing matrices of assay metrics
all the matrices have the shape (num barcodes) x (num ids)
for DNA assays, this includes AF, GQ, DP, etc. (per cell, per variant)
for Protein assays, this includes read counts (per cell, per antibody)
for CNV assays, this includes read counts (per cell, per amplicon)
3. row_attrs (add_row_attr / del_row_attr):
dictionary containing ‘barcode’ as one of the keys (where the value is a list of all barcodes)
for all other keys, the values must be of the same length, i.e. match the number of barcodes
this is the attribute where ‘label’, ‘pca’, and ‘umap’ values are added
4. col_attrs (add_col_attr / del_col_attr):
dictionary containing ‘id’ as one of the keys (where the value is a list of all ids)
for DNA assays, ‘ids’ are variants; for Protein assays, ‘ids’ are antibodies; for CNV assays, ‘ids’ are amplicons
for all other keys, the values must be of the same length, i.e. match the number ids
# Summary of DNA assay
print("\'sample.dna\':", sample.dna, '\n')
print("\'row_attrs\':", "\n\t", list(sample.dna.row_attrs.keys()), '\n')
print("\'col_attrs\':", "\n\t", list(sample.dna.col_attrs.keys()), '\n')
print("\'layers\':", "\n\t", list(sample.dna.layers.keys()), '\n')
print("\'metadata\':", "\n")
for i in list(sample.dna.metadata.keys()):
print("\t", i, ": ", sample.cnv.metadata[i], sep="")
# For DNA, ids are variants
# sample.dna.ids() is a shortcut for sample.dna.col_attrs['id']
sample.dna.ids()
# Summary of CNV assay
print("\'sample.cnv\':", sample.cnv, '\n')
print("\'row_attrs\':", "\n\t", list(sample.cnv.row_attrs.keys()), '\n')
print("\'col_attrs\':", "\n\t", list(sample.cnv.col_attrs.keys()), '\n')
print("\'layers\':", "\n\t", list(sample.cnv.layers.keys()), '\n')
print("\'metadata\':", "\n")
for i in list(sample.cnv.metadata.keys()):
print("\t", i, ": ", sample.cnv.metadata[i], sep="")
# For CNV, ids are amplicons
# sample.cnv.ids() is a shortcut for sample.cnv.col_attrs['id']
sample.cnv.ids()
DNA Analysis#
Topics covered
Standard filtering of DNA variants.
Subsetting dataset for variants of interest, including whitelisted variants.
Addition of annotations to the variants.
Manual variant selection and clone identification
Visualizations and customization options
Basic filtering#
There are many options for filtering DNA variants.
Use the help()
function to understand the approach listed below.
help(sample.dna.filter_variants)
# Adjust filters if needed by overwriting dna_vars
dna_vars = sample.dna.filter_variants(
min_dp=10,
min_gq=30,
vaf_ref=5,
vaf_hom=95,
vaf_het=30,
min_prct_cells=50,
min_mut_prct_cells=1,
iterations=10,
)
# Check the number of filtered variants. When using the default filters, the number of
# variants is likely smaller compared to the originally loaded variants due to the more
# stringent filtering criteria (e.g., vaf_ref=5, vaf_hom=95, vaf_het=30).
print('Number of variants:', len(dna_vars))
print(dna_vars)
Note: Variants that are whitelisted during sample loading may be removed at this step, but can be added back in below. Whitelisted variants can be added to the list of final variants, or used exclusively in the code below.
# Ensure correct nomenclature, ie whitelist = ["chr13:28589657:T/G","chrX:39921424:G/A"]
# While the load statement accepts many variant formats, this whitelist does not
# If there are no whitelist variants, this can be left blank
white_list = []
# Combine whitelisted and filtered variants
final_vars = list(set(list(dna_vars) + white_list))
# To use only white_list variants:
# final_vars = white_list
# Check the length of the final list of variants
len(final_vars)
# Dimensionality of the original sample.dna dataframe
# First number = number of cells (rows); second number = number of variants (columns)
sample.dna.shape
# Before subsetting, verify that all the chosen variants are in the current sample.dna ids (should return True)
print(set(final_vars).issubset(set(sample.dna.ids())))
# Subsetting sample.dna (columns) based on reduced variant list.
sample.dna = sample.dna[:, final_vars]
# Check the shape of the final filtered DNA object, i.e. (number of barcodes/cells, number of ids/variants)
sample.dna.shape
Annotation Addition#
help(sample.dna.get_annotations)
# Fetch annotations using varsome
# Note: run this on a filtered DNA sample - too many variants (e.g., 100+) are not handled correctly by the method
ann = sample.dna.get_annotations()
Variant selection and Subclone identification#
In this section of the notebook, all variants remaining in the dataset will populate in a variant table. This table is interactive, variants can be selected, and rows can be sorted by ascending/descending values. The variant name can be clicked on and will navigate to the variant’s varsome page in the default browser.
Variants selected in this table will populate in a subclone table below.
The variants in the subclone table can be highlighted and assessed for Read Depth, Genotype Quality and Variant Allele Frequency.
Subclones can be renamed by clicking on the pencil icon.
ADO score: The ADO score threshold can be adjusted, but by default is set to 1. Any clones with an ADO score higher this will be moved into the “ADO subclones” column. We recommend moving any clones with a score of .8 or higher into this column. The algorithm used to generate the score is detailed here.
Min clone size: The Min Clone Size can also be adjusted, but by default is set to 1. Any clones that represent less than 1% of the sample will be moved into the “Small Subclones” column.
Cells with missing genotype information across any of the selected variants will be moved into the “Missing GT” column.
# Run the variant table workflow to select variants and begin clone identification
wfv = ms.workflows.VariantSubcloneTable(sample)
wfv.run()
# The width can be adjusted if needed for long variant names
# wfv.run(width=3000)
NOTE: When continuing onto the next piece of the notebook, the workflow will stop working.
# Subsample the variants to only the variants selected from the workflow
variants = wfv.selected_variants
sample.dna = sample.dna[:,variants]
# Save the full set of cells to a new variable that can be called on later
# Do this before renaming the variants
full_sample = sample[:]
# Rename the variants
# Any of these column values can be added to the id names
sample.dna.col_attrs.keys()
# Add annotation to the id names
sample.dna.set_ids_from_cols(['annotated Gene', 'id'])
# Another example:
# sample.dna.set_ids_from_cols(['annotated Gene', 'CHROM', 'POS', 'REF', 'ALT'])
# Annotations are now added to the variants
sample.dna.ids()
# Use sample.dna.reset_ids() to get the original ids
# Plot heatmap using NGT_FILTERED
fig = sample.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)
# Clone removal
# Remove barcodes from the missing GT, small subclones, ADO subclones or FP labels
clones = ['missing', 'small', 'ADO', 'FP']
for c in clones:
cells = sample.dna.barcodes({c})
if len(cells) > 0:
sample.dna = sample.dna.drop(cells)
set(sample.dna.get_labels())
# If subclones need to be renamed outside of the workflow
# The following code can be used:
# NOTE: all clones need to be listed for this function to work
sample.dna.rename_labels(
{
'current_subclone_name': 'new_subclone_name',
'WT': 'Wild Type',
'3': 'three',
'4': 'four',
}
)
# Redraw heatmap
fig = sample.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)
# Manually generate a phylogeny tree of the subclones in your sample
from missionbio.plotting.phylotree import PhyloTree
labels = ["1", "2", "3", "4"]
parents = [None, "1", "1", "2"]
descriptions = ["WT", "KIT HET", "KIT HET<br>TP53 HET", "KIT LOH"]
graph = PhyloTree(labels, parents, descriptions)
graph.draw(palette={"1": "#000000", "2": "crimson", "3": "skyblue", "4": "red"})
# Evaluate new total number of cells after the above filtering
sample.dna.shape
Adjusting subclone colors or heatmap colors#
To change the color palette used for the subclones, use set_palette
. For this provide a list of ALL subclone names, pointing to the hexcode/color to change that subclone to. Note: this function also works with the protein and CNV assays. See below for an example of this.
To change the color palette used for any of the assays/layers, use ms.Config.Colorscale
. Then list the assay and layer, and assign the plotly colorscale to use for those graphics. Visualize all available colorscales by using plotly.colors.sequential.swatches_continuous().show()
. To reset all color palettes back to their defaults, use ms.Config.Colorscle.reset()
.
# Example of changing the colors assigned to each clone
sample.dna.set_palette({'Clone 1': '#800080', 'Clone 2': '#FF69B4'})
# Change colorScale to change the heatmap colors
# This will change the color of the dna.ngt layer to be viridis
ms.Config.Colorscale.Dna.NGT = 'viridis'
fig = sample.dna.heatmap('NGT')
fig.show()
# Run the following to reset the colorscale
#ms.Config.Colorscale.Dna.reset("NGT")
# Or run this to reset any/all modifications:
#ms.Config.Colorscale.reset()
# Colorscale can be changed for DNA, CNV and Protein assays
To return back to the original population of cells, reset the data using: sample.reset("dna")
This command ‘sample.reset’ works on all assays, including CNV and protein.
CNV Analysis#
Topics covered
Amplicon and cell filtering and ploidy estimation.
Visualization of ploidy across subclones present.
# Get gene names for amplicons
sample.cnv.get_gene_names()
CNV workflow#
This workflow will normalize all reads and filter amplicons/cells based on the settings set at the beginning of the workflow:
Amplicon completeness: refers to the minimum percentage of cells that must have reads greater than or equal to the minimum read depth set. By default this is set to 50.
Amplicon read depth: refers to the minimum read depth for each amplicon-barcode combination to not be considered missing. By default this is set to 10.
Mean cell read depth: refers the minimum mean read depth for a cell to be included in the analysis, otherwise the cell will be removed. By default this is set to 40.
Diploid clone in DNA: refers to which subclone is used as the true diploid population. All ploidy estimates will be calculated in relation to this diploid population. We recommend setting this to the ‘WT’ population or earliest clone present.
Note: If large copy number events are expected, Amplicon Completeness and Amplicon Read Depth may need to be reduced.
Once the above filters are set, the visualizations can be changed.
Plot: Can be changed from Heatmap positions, to Heatmap genes, Line-plot positions, Line-plot genes
Clone for line plot: If one of the line-plot visualizations is selected, only one clone can be shown at a time. This determines which one is plotted.
X-axis features: To plot a subset of the data (chromosomes or genes), select which chromosomes/genes should be plotted with this function. Chromosomes can be selected for ‘positions’ type plots, and Genes can be selected for ‘genes’ type plots.
# CNV workflow to filter, normalize and estimate ploidy
wfc = ms.workflows.CopyNumber(sample)
wfc.run()
NOTE: When continuing onto the next piece of the notebook, the workflow will stop working.
# Heatmap with the features ordered by the default amplicon order
# To plot a subset of chromosome, the chromosomes can be put in list format in the features argument
fig = sample.cnv.heatmap('ploidy', features='positions') #features=['7', '17', '20']
# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)
#fig.layout.coloraxis.cmax = 4
#fig.layout.coloraxis.cmin = 0
# Optionally, change the size of the figure:
#fig.layout.width = 1600
#fig.layout.height = 1500
go.Figure(fig)
# Heatmap with the features grouped by the genes
# To plot just a subset of genes, put them in list format for features
fig = sample.cnv.heatmap('ploidy', features='genes', convolve=1) #features=["ASXL1", "EZH2",'TP53']
# Optionally, update the separating lines to be black
#for shape in fig.layout.shapes:
# shape.line.color = '#000000'
go.Figure(fig)
# Show heatmap with convolve and subclustering turned off
bars = sample.cnv.clustered_barcodes('ploidy', subcluster=False)
# This is useful to create "convolved" heatmaps which are easier to interpret
# With the subclustering off and convolve=20, the noise will be reduced and real signals will be easier to visualize
fig = sample.cnv.heatmap('ploidy', bars_order=bars, convolve=20)
fig.layout.width = 900
fig.show()
Combined Visualizations#
In this section the dateset will be subset to only retain barcodes with remaining DNA and CNV data. Then this data can be plotted together using sample.heatmap()
, sample.signaturemap()
.
# This will return the barcodes common to all assays in the sample.
sample.common_barcodes()
# Use that to filter the sample so that only the common barcodes are present in all assays
sample = sample[sample.common_barcodes()]
# Check dimensionality for each assay; the number of cells (first number) should be the same in each data set
print(sample.dna.shape, sample.cnv.shape)
# DNA + CNV heatmap
fig = sample.heatmap(
clusterby=('dna', 'cnv'), # The first assay is used for the labels
attributes=['AF', 'ploidy'],
features=[None, 'genes'], # If None, then clustered_ids is used
bars_order=None, # The order of the barcodes
widths=None,
order=('dna', 'cnv') # The order in which the heatmaps should be drawn
)
fig.show()
help(sample.signaturemap)
# Plot a combined signature heatmap, showing DNA and CNV signatures for all subclones
fig = sample.signaturemap(
clusterby=('dna', 'cnv'),
attributes=('NGT', 'ploidy'),
features=[None, 'genes'],
signature_kind=['median', 'median'],
widths=None,
order=('dna', 'cnv') # The order in which the heatmaps should be drawn
)
fig.show()
help(sample.clone_vs_analyte)
# Clone vs analyte
# Visualize the CNV data stratified by clone
fig = sample.clone_vs_analyte('cnv')
plt.gcf().axes[1].texts[-1].set_text('')
fig
#fig.savefig('genotype_cnv.png')
Export and Save Data#
In this section the data can export a filtered .h5 file, which will contain all new labels/layers, and contain only the filtered barcodes/cells remaining. All data can be exported (row attributes, column attributes and layers) for every assay (DNA and CNV) into easily parsable .csv tables.
# Save new h5 file that includes only the final, cleaned dataset
ms.save(sample, 'FilteredData.h5')
# Export data into csv formats
ms.to_zip(sample, 'filename')
Appendix#
Compass imputation#
This section of the notebook is OPTIONAL
Compass can be used to impute the genotypes of cells with some missing data. It can also be used to infer the phylogeny of all subclones present in the sample. If a cell cannot be unambiguously classified, Compass will label these cells as Mixed or Ambiguous. Compass can take ~1-10 minutes to run depending on the size of the data.
The Compass publication can be found here
Note: Compass can give different subclone composition than the variant subclone workflow.
# Use compass to assume the subclone architecture and phylogeny
# Depending on the size and complexity of the data, this step can take ~1-10 minutes
# Use full_sample instead of sample to include all previously removed cells and use the unannotated variant names
compass = COMPASS(full_sample, somatic_variants=variants)
compass.run()
# The phylogentic tree prediction by COMPASS
fig = compass.plot_tree()
fig.show()
# Make the compass_labels the node descriptions
# These are the new subclone labels
labs = compass.labels_ # Stores the labels
desc = compass.node_descriptions() # Stores the dict mapping the label to the description
compass_labels = np.array([desc.get(lab, lab) for lab in labs])
# Store the compass labels as a row_attr
full_sample.dna.add_row_attr('compass_labels', compass_labels)
# Store the index positions of the COMPASS ids for the heatmap
idx = np.isin(full_sample.dna.ids(), compass.somatic_variants)
# Visualize the assignment using the variants passed to COMPASS
fig = full_sample.dna.heatmap('NGT', features=full_sample.dna.ids()[idx], splitby='compass_labels')
fig.show()
# Create a DNA heatmap to see both the variant workflow subclone labels and the compass assigned labels
# Parameters
attribute = 'NGT'
assay = full_sample.dna
lab_row1 = "label"
lab_row2 = "compass_labels"
lab2_title = "Compass"
# Graphing
data = assay.get_attribute(attribute, constraint="row+col")
ids = assay.clustered_ids(attribute)
labels = pd.concat([assay.get_attribute(lab_row1), assay.get_attribute(lab_row2).rename(columns={"1": "2"})], axis=1)
bars = assay.clustered_barcodes(attribute, splitby=labels)
data = data.loc[bars, ids]
labs1 = assay.get_attribute(lab_row1, constraint="row").loc[bars, :].values.flatten()
labs2 = assay.get_attribute(lab_row2, constraint="row").loc[bars, :].values.flatten()
fig = MultiMap([data], labs1, names=[attribute], extra_column_titles=[lab2_title], widths=[1 / 24, 1]).draw()
fig.add_trace(Heatmap(data, labs2).label_trace(), row=1, col=2)
fig.layout.xaxis2.tickvals = []
fig
# This will return a dataframe
# Showing the overlap of variant workflow clones and compass identified clones
full_sample.dna.crosstab(compass_labels, normalize='columns')
# This will plot a heatmap of the crosstab dataframe
full_sample.dna.crosstabmap(compass_labels, normalize='columns').show()
Export data to use in COMPASS
# Export compass data into 2 output files
# compass_input_regions.csv and compass_input_variants.csv
# These can be used with COMPASS outside of Mosaic
compass = COMPASS(sample, sample.dna.filter_somatic_variants())
compass.prepare_input(prefix="./compass_input")
DNA clustering#
This section of the notebook is optional
Below is an alternative way to cluster cells based on DNA data. These feature is useful in understanding the diversity within the cell population and may reduce the amount of cells lost from missing data.
# First: run PCA on filtered data frame (all cells, chosen variants) using the AF_MISSING layer, for instance
# Note: AF_MISSING includes -50 values for all 0 values for which the data was filtered and is missing
# The number of components should equal the number of pre-filtered variants
sample.dna.run_pca(components=100, attribute='AF_MISSING',show_plot=True)
# Rerun PCA with optimal number of components based on elbow plot analysis
sample.dna.run_pca(components=6, attribute='AF_MISSING')
# Run UMAP on top of the newly created PC dataframe
# See https://jlmelville.github.io/uwot/abparams.html for appropriate spread and min_dist values
sample.dna.run_umap(attribute='pca', random_state=42) #, min_dist=0.2, spread=1.5)
# Review other clustering methods
help(sample.dna.cluster)
# Visualize data and cluster it using different methods
sample.dna.cluster(attribute='umap', method='graph-community', k=400)
# Re-plot UMAP projection w/ alternative clustering results
fig = sample.dna.scatterplot(attribute='umap', colorby='label')
go.Figure(fig)
# Plot heatmap using NGT
fig = sample.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)