Single Sample DNA + Protein#
This notebook is set up to work with single-sample h5 files, generated with v2 or v3 chemistry. This notebook will only work with Mosaic versions 3.7 and higher.
Objective: To showcase the minimum number of steps required for tertiary analysis of DNA (single-cell genotyping and CNV) and Protein analytes and explore different ways of visualizing the data.
Major questions answered:
Can we identify DNA clones based on genotypes (SNVs/Indels)?
Do we detect CNV events (e.g., copy number amplification, copy number loss)?
Can we delineate different cell types from cell surface protein expression data?
Do DNA clones (genotypes) correlate with specific protein-defined cell types and/or CNV profiles?
Sections:
Setup
Data Structure
DNA Analysis
CNV Analysis
Protein Analysis
Combined Visualizations
Export and Save Data
Appendix
Not shown:
All available methods and options - documented here
Setup#
Topics covered
Loading required packages.
Loading in data.
Load data
Note: importing dependencies can sometimes take a couple of minutes.
# Import mosaic libraries
import missionbio.mosaic as ms
# Import these to display entire dataframes
from IPython.display import display, HTML
# Import graph_objects from the plotly package to display figures when saving the notebook as an HTML
import plotly as px
import plotly.graph_objects as go
# Import additional packages for specific visuals
import matplotlib.pyplot as plt
import plotly.offline as pyo
pyo.init_notebook_mode()
import numpy as np
import seaborn as sns
from missionbio.plotting.multimap import MultiMap
from missionbio.plotting.heatmap import Heatmap
# Import COMPASS for imputation
from missionbio.mosaic.algorithms.compass import COMPASS
import pandas as pd
# Import for protein assignment
from missionbio.demultiplex.protein.truth import Truth
# Note: when exporting the notebook as an HTML, plots that use the "go.Figure(fig)" command are saved
# This code is optional, but will make the notebook cells/figures display across the entire width of the browser
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
# Check version; this notebook is designed for Mosaic v3.7 or higher
print(ms.__version__)
# Any function's parameters and default values can be looked up via the 'help' function
# Here, the function is 'ms.load'
help(ms.load)
Loading the data #
In the ms.load statement, there are several arguments:
raw : always set raw=False; if raw=True, ALL barcodes will be loaded (rather than cell-associated barcodes)
filter_variants : always set filter_variants=True unless an expected (target) variant cannot be found. Additional filtering options are included in the DNA section below. NOTE : if loading using filter_variants=False, the whitelist may need to set to None or removed from the command, otherwise it may not load variants in
single : always set single=False for multisample/merged h5 files
whitelist : The whitelist option loads any variant that is in the vcf.gz file (e.g. “chr1:179520511:C/G”). The whitelist argument takes many variant formats, they are detailed here.
filter_cells : Always set filter_cells=False; if filter_cells=True, only cells called by the completelyness algorithm are loaded
# Specify the h5 file to be used in this analysis: h5path = '/path/to/h5/file/test.h5'
# If working with Windows, an 'r' may need to be added before the path: h5path = r'/path/to/h5/file/test.h5'
h5path = '/Users/botero/Desktop/Sample data/4-cell-lines-AML-multiomics/4-cell-lines-AML-multiomics.dna+protein.h5'
# Load the data
sample = ms.load(h5path, raw=False, filter_variants=True, single=True, whitelist = [], filter_cells=False)
Data Structure#
DNA, CNV, and Protein are sub-classes of the Assay class. The information is stored in four categories, and the user can modify each of those:
1. metadata (add_metadata / del_metadata):
dictionary containing assay parameters and basic performance metrics
2. layers (add_layer / del_layer):
dictionary containing matrices of assay metrics
all the matrices have the shape (num barcodes) x (num ids)
for DNA assays, this includes AF, GQ, DP, etc. (per cell, per variant)
for Protein assays, this includes read counts (per cell, per antibody)
for CNV assays, this includes read counts (per cell, per amplicon)
3. row_attrs (add_row_attr / del_row_attr):
dictionary containing ‘barcode’ as one of the keys (where the value is a list of all barcodes)
for all other keys, the values must be of the same length, i.e. match the number of barcodes
this is the attribute where ‘label’, ‘pca’, and ‘umap’ values are added
4. col_attrs (add_col_attr / del_col_attr):
dictionary containing ‘id’ as one of the keys (where the value is a list of all ids)
for DNA assays, ‘ids’ are variants; for Protein assays, ‘ids’ are antibodies; for CNV assays, ‘ids’ are amplicons
for all other keys, the values must be of the same length, i.e. match the number ids
# Summary of DNA assay
print("\'sample.dna\':", sample.dna, '\n')
print("\'row_attrs\':", "\n\t", list(sample.dna.row_attrs.keys()), '\n')
print("\'col_attrs\':", "\n\t", list(sample.dna.col_attrs.keys()), '\n')
print("\'layers\':", "\n\t", list(sample.dna.layers.keys()), '\n')
print("\'metadata\':", "\n")
for i in list(sample.dna.metadata.keys()):
print("\t", i, ": ", sample.cnv.metadata[i], sep="")
# For DNA, ids are variants
# sample.dna.ids() is a shortcut for sample.dna.col_attrs['id']
sample.dna.ids()
# Summary of CNV assay
print("\'sample.cnv\':", sample.cnv, '\n')
print("\'row_attrs\':", "\n\t", list(sample.cnv.row_attrs.keys()), '\n')
print("\'col_attrs\':", "\n\t", list(sample.cnv.col_attrs.keys()), '\n')
print("\'layers\':", "\n\t", list(sample.cnv.layers.keys()), '\n')
print("\'metadata\':", "\n")
for i in list(sample.cnv.metadata.keys()):
print("\t", i, ": ", sample.cnv.metadata[i], sep="")
# For CNV, ids are amplicons
# sample.cnv.ids() is a shortcut for sample.cnv.col_attrs['id']
sample.cnv.ids()
# Summary of Protein assay
print("\'sample.protein\':", sample.protein, '\n')
print("\'row_attrs\':", "\n\t", list(sample.protein.row_attrs.keys()), '\n')
print("\'col_attrs\':", "\n\t", list(sample.protein.col_attrs.keys()), '\n')
print("\'layers\':", "\n\t", list(sample.protein.layers.keys()), '\n')
print("\'metadata\':", "\n")
for i in list(sample.protein.metadata.keys()):
print("\t", i, ": ", sample.protein.metadata[i], sep="")
# For Protein, ids are AOCs
# sample.protein.ids() is a shortcut for sample.protein.col_attrs['id']
sample.protein.ids()
DNA Analysis#
Topics covered
Standard filtering of DNA variants.
Subsetting dataset for variants of interest, including whitelisted variants.
Addition of annotations to the variants.
Manual variant selection and clone identification
Visualizations and customization options
Basic filtering#
There are many options for filtering DNA variants.
Use the help()
function to understand the approach listed below.
help(sample.dna.filter_variants)
# Adjust filters if needed by overwriting dna_vars
dna_vars = sample.dna.filter_variants(
min_dp=10,
min_gq=30,
vaf_ref=5,
vaf_hom=95,
vaf_het=30,
min_prct_cells=50,
min_mut_prct_cells=1,
iterations=10,
)
# Check the number of filtered variants. When using the default filters, the number of
# variants is likely smaller compared to the originally loaded variants due to the more
# stringent filtering criteria (e.g., vaf_ref=5, vaf_hom=95, vaf_het=30).
print('Number of variants:', len(dna_vars))
print(dna_vars)
Note: Variants that are whitelisted during sample loading may be removed at this step, but can be added back in below. Whitelisted variants can be added to the list of final variants, or used exclusively in the code below.
# Ensure correct nomenclature, ie whitelist = ["chr13:28589657:T/G","chrX:39921424:G/A"]
# While the load statement accepts many variant formats, this whitelist does not
# If there are no whitelist variants, this can be left blank
white_list = []
# Combine whitelisted and filtered variants
final_vars = list(set(list(dna_vars) + white_list))
# To use only white_list variants:
# final_vars = white_list
# Check the length of the final list of variants
len(final_vars)
# Dimensionality of the original sample.dna dataframe
# First number = number of cells (rows); second number = number of variants (columns)
sample.dna.shape
# Before subsetting, verify that all the chosen variants are in the current sample.dna ids (should return True)
print(set(final_vars).issubset(set(sample.dna.ids())))
# Subsetting sample.dna (columns) based on reduced variant list.
sample.dna = sample.dna[:, final_vars]
# Check the shape of the final filtered DNA object, i.e. (number of barcodes/cells, number of ids/variants)
sample.dna.shape
Annotation Addition#
help(sample.dna.get_annotations)
# Fetch annotations using varsome
# Note: run this on a filtered DNA sample - too many variants (e.g., 100+) are not handled correctly by the method
ann = sample.dna.get_annotations()
Variant selection and Subclone identification#
In this section of the notebook, all variants remaining in the dataset will populate in a variant table. This table is interactive, variants can be selected, and rows can be sorted by ascending/descending values. The variant name can be clicked on and will navigate to the variant’s varsome page in the default browser.
Variants selected in this table will populate in a subclone table below.
The variants in the subclone table can be highlighted and assessed for Read Depth, Genotype Quality and Variant Allele Frequency.
Subclones can be renamed by clicking on the pencil icon.
ADO score: The ADO score threshold can be adjusted, but by default is set to 1. Any clones with an ADO score higher this will be moved into the “ADO subclones” column. We recommend moving any clones with a score of .8 or higher into this column. The algorithm used to generate the score is detailed here.
Min clone size: The Min Clone Size can also be adjusted, but by default is set to 1. Any clones that represent less than 1% of the sample will be moved into the “Small Subclones” column.
Cells with missing genotype information across any of the selected variants will be moved into the “Missing GT” column.
# Run the variant table workflow to select variants and begin clone identification
wfv = ms.workflows.VariantSubcloneTable(sample)
wfv.run()
# The width can be adjusted if needed for long variant names
# wfv.run(width=3000)
NOTE: When continuing onto the next piece of the notebook, the workflow will stop working.
# Subsample the variants to only the variants selected from the workflow
variants = wfv.selected_variants
sample.dna = sample.dna[:,variants]
# Save the full set of cells to a new variable that can be called on later
# Do this before renaming the variants
full_sample = sample[:]
# Rename the variants
# Any of these column values can be added to the id names
sample.dna.col_attrs.keys()
# Add annotation to the id names
sample.dna.set_ids_from_cols(['annotated Gene', 'id'])
# Another example:
# sample.dna.set_ids_from_cols(['annotated Gene', 'CHROM', 'POS', 'REF', 'ALT'])
# Annotations are now added to the variants
sample.dna.ids()
# Use sample.dna.reset_ids() to get the original ids
# Plot heatmap using NGT_FILTERED
fig = sample.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)
# Clone removal
# Remove barcodes from the missing GT, small subclones, ADO subclones or FP labels
clones = ['missing', 'small', 'ADO', 'FP']
for c in clones:
cells = sample.dna.barcodes({c})
if len(cells) > 0:
sample.dna = sample.dna.drop(cells)
set(sample.dna.get_labels())
# If subclones need to be renamed outside of the workflow
# The following code can be used:
# NOTE: all clones need to be listed for this function to work
sample.dna.rename_labels(
{
'current_subclone_name': 'new_subclone_name',
'WT': 'Wild Type',
'3': 'three',
'4': 'four',
}
)
# Redraw heatmap
fig = sample.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)
# Manually generate a phylogeny tree of the subclones in your sample
from missionbio.plotting.phylotree import PhyloTree
labels = ["1", "2", "3", "4"]
parents = [None, "1", "1", "2"]
descriptions = ["WT", "KIT HET", "KIT HET<br>TP53 HET", "KIT LOH"]
graph = PhyloTree(labels, parents, descriptions)
graph.draw(palette={"1": "#000000", "2": "crimson", "3": "skyblue", "4": "red"})
# Evaluate new total number of cells after the above filtering
sample.dna.shape
Adjusting subclone colors or heatmap colors#
To change the color palette used for the subclones, use set_palette
. For this provide a list of ALL subclone names, pointing to the hexcode/color to change that subclone to. Note: this function also works with the protein and CNV assays. See below for an example of this.
To change the color palette used for any of the assays/layers, use ms.Config.Colorscale
. Then list the assay and layer, and assign the plotly colorscale to use for those graphics. Visualize all available colorscales by using plotly.colors.sequential.swatches_continuous().show()
. To reset all color palettes back to their defaults, use ms.Config.Colorscle.reset()
.
# Example of changing the colors assigned to each clone
sample.dna.set_palette({'Clone 1': '#800080', 'Clone 2': '#FF69B4'})
# Change colorScale to change the heatmap colors
# This will change the color of the dna.ngt layer to be viridis
ms.Config.Colorscale.Dna.NGT = 'viridis'
fig = sample.dna.heatmap('NGT')
fig.show()
# Run the following to reset the colorscale
#ms.Config.Colorscale.Dna.reset("NGT")
# Or run this to reset any/all modifications:
#ms.Config.Colorscale.reset()
# Colorscale can be changed for DNA, CNV and Protein assays
To return back to the original population of cells, reset the data using: sample.reset("dna")
This command ‘sample.reset’ works on all assays, including CNV and protein.
CNV Analysis#
Topics covered
Amplicon and cell filtering and ploidy estimation.
Visualization of ploidy across subclones present.
# Get gene names for amplicons
sample.cnv.get_gene_names()
CNV workflow#
This workflow will normalize all reads and filter amplicons/cells based on the settings set at the beginning of the workflow:
Amplicon completeness: refers to the minimum percentage of cells that must have reads greater than or equal to the minimum read depth set. By default this is set to 50.
Amplicon read depth: refers to the minimum read depth for each amplicon-barcode combination to not be considered missing. By default this is set to 10.
Mean cell read depth: refers the minimum mean read depth for a cell to be included in the analysis, otherwise the cell will be removed. By default this is set to 40.
Diploid clone in DNA: refers to which subclone is used as the true diploid population. All ploidy estimates will be calculated in relation to this diploid population. We recommend setting this to the ‘WT’ population or earliest clone present.
Note: If large copy number events are expected, Amplicon Completeness and Amplicon Read Depth may need to be reduced.
Once the above filters are set, the visualizations can be changed.
Plot: Can be changed from Heatmap positions, to Heatmap genes, Line-plot positions, Line-plot genes
Clone for line plot: If one of the line-plot visualizations is selected, only one clone can be shown at a time. This determines which one is plotted.
X-axis features: To plot a subset of the data (chromosomes or genes), select which chromosomes/genes should be plotted with this function. Chromosomes can be selected for ‘positions’ type plots, and Genes can be selected for ‘genes’ type plots.
# CNV workflow to filter, normalize and estimate ploidy
wfc = ms.workflows.CopyNumber(sample)
wfc.run()
NOTE: When continuing onto the next piece of the notebook, the workflow will stop working.
# Heatmap with the features ordered by the default amplicon order
# To plot a subset of chromosome, the chromosomes can be put in list format in the features argument
fig = sample.cnv.heatmap('ploidy', features='positions') #features=['7', '17', '20']
# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)
#fig.layout.coloraxis.cmax = 4
#fig.layout.coloraxis.cmin = 0
# Optionally, change the size of the figure:
#fig.layout.width = 1600
#fig.layout.height = 1500
go.Figure(fig)
# Heatmap with the features grouped by the genes
# To plot just a subset of genes, put them in list format for features
fig = sample.cnv.heatmap('ploidy', features='genes', convolve=1) #features=["ASXL1", "EZH2",'TP53']
# Optionally, update the separating lines to be black
#for shape in fig.layout.shapes:
# shape.line.color = '#000000'
go.Figure(fig)
# Show heatmap with convolve and subclustering turned off
bars = sample.cnv.clustered_barcodes('ploidy', subcluster=False)
# This is useful to create "convolved" heatmaps which are easier to interpret
# With the subclustering off and convolve=20, the noise will be reduced and real signals will be easier to visualize
fig = sample.cnv.heatmap('ploidy', bars_order=bars, convolve=20)
fig.layout.width = 900
fig.show()
Protein Analysis#
Topics covered
Normalization and assessment of AOC abundance
Clustering and visualization
Statistical significance analysis
Normalization And Data Inspection#
help(sample.protein.normalize_reads)
# Normalize reads
# Three different methods available: CLR = Centered Log Ratio, NSP = Noise corrected and Scaled, asinh = Inverse Hyperbolictrans formation
sample.protein.normalize_reads('NSP')
# Ridge plot: shows distribution of AOC counts across all cells (bimodal = positive and negative cell populations)
sample.protein.ridgeplot(attribute='normalized_counts',
features=sample.protein.ids())
# This plot is helpful when titrating a custom AOC panel
import pandas as pd
import plotly.express as plx
read_counts = sample.protein.layers["read_counts"]
ids = sample.protein.col_attrs["id"].copy()
cell_barcode = pd.DataFrame(read_counts, columns=ids).sum()
cell_barcode.index.rename("Antibody", inplace=True)
cell_barcode = cell_barcode.reset_index(name="Cell Barcode Count")
cell_barcode['percent'] = (cell_barcode['Cell Barcode Count'] / cell_barcode['Cell Barcode Count'].sum()) * 100
fig = plx.bar(cell_barcode, x='Antibody', y='percent')
fig.show()
# Or make a pie chart
#fig = plx.pie(cell_barcode, values='Cell Barcode Count',names='Antibody', title='Count of Cell Barcodes')
#fig.show()
# Plotting combinations of AOCs on a biaxial scatterplot, to mimic FACS data
fig = sample.protein.feature_scatter(layer='normalized_counts', ids=['CD19', 'CD45'])
go.Figure(fig)
# List all AOCs
sample.protein.ids()
# Optionally, remove AOCs from the data set prior to clustering
# For example, those that don't display any signal in any cells
sample.protein = sample.protein.drop(['Mouse IgG1k', 'CD25'])
Clustering And Visualization#
help(sample.protein.run_pca)
# Run PCA on the protein dataset to visualize how many PCs contribute the most to variation
# Number of components should equal the number of AOCs remaining in the dataset
sample.protein.run_pca(attribute='normalized_counts', components=45,show_plot=True, random_state=42)
# Rerun PCA with optimal number of components based on elbow plot analysis
# Typically, components = 10 is appropriate for 45-plex Biolegend panel
sample.protein.run_pca(attribute='normalized_counts', components=10, show_plot=False, random_state=42)
# Now run UMAP on the chosen PCs
# UMAPs rely on an initial randomization which leads to different projections every time
# To address this, pass random_state to the run_umap method
# See https://jlmelville.github.io/uwot/abparams.html for appropriate spread/min_dist values
sample.protein.run_umap(attribute='pca', random_state=42) #, spread=, min_dist=
###verbose=True
# Cluster the AOCs. Several options are available, see help info for more options
# Graph-community clustering: the higher the k value, the smaller the number of clusters --> adjust if necessary
sample.protein.cluster(attribute='umap', method='graph-community',k=300, random_state=42)
# Plot UMAP for visualization of clusters
fig = sample.protein.scatterplot(attribute='umap',colorby='label')
go.Figure(fig)
The protein clusters defined by the PCA + UMAP can be regrouped by utilizing the lasso tool. First, the UMAP needs to be replotted. On the new version of the plot, clusters can be encircled with the lasso tool, found in the upper right corner of the graph. Once this step is complete, the new cluster labels can be assigned in the following block.
# Plot UMAP for lasso cluster selection
# Be sure that selected_bars is clear before running for lasso selection
# Don't use go.Figure()
sample.protein.selected_bars = {}
sample.protein.scatterplot(attribute='umap', colorby='label')
# Assign new labels (run this right after selecting your new clusters)
sample.protein.set_selected_labels()
# Plot heatmap of normalized, clustered protein data
fig = sample.protein.heatmap(attribute='normalized_counts')
go.Figure(fig)
help(sample.protein.scatterplot)
# Plot UMAP for visualization of clusters
fig = sample.protein.scatterplot(attribute='umap',colorby='label')
go.Figure(fig)
# UMAP colored by the expression of each AOC
fig = sample.protein.scatterplot(attribute='umap',
colorby='normalized_counts',
features=sample.protein.ids())
go.Figure(fig)
# Relabel clone names acccording to biology, combine clones by assigned identical names
# Just an example! Need to fill this in based on knowledge of the sample and markers
sample.protein.rename_labels(
{
'1': 'Cell type A',
'2': 'Cell type B',
'3': 'Cell type C',
'4': 'Cell type D',
'5': 'Cell type E'
}
)
# Run heatmap again with new labels
fig = sample.protein.heatmap(attribute='normalized_counts')
go.Figure(fig)
# Drop cells (barcodes) that need to be removed
# Note: If no cells are dropped, this code will producs an error (ignore)
fp_barcodes2 = sample.protein.barcodes("FP")
sample.protein = sample.protein.drop(fp_barcodes2)
set(sample.protein.get_labels())
# Additionally, option to remove AOCs from the data
sample.protein = sample.protein.drop(['CD25'])
# Shape of cleaned up protein data (number of cells, number of AOCs)
sample.protein.shape
# Re-run heatmap with new filtered data
fig = sample.protein.heatmap(attribute='normalized_counts')
go.Figure(fig)
# Show heatmap with convolve and subclustering turned off
bars = sample.protein.clustered_barcodes('normalized_counts', subcluster=False)
# This is useful to create "convolved" heatmaps which are easier to interpret
# With the subclustering off and convolve=20, the noise will be reduced and real signals will be easier to determine
fig = sample.protein.heatmap('normalized_counts', bars_order=bars, convolve=20)
fig.layout.width = 900
fig.show()
# Signature will create a dataframe based on the layer defined and statistical value listed
# This can be used for: mean, median, mode, or std
sample.protein.signature('normalized_counts', 'median')
# signaturemap will plot a heatmap of the signature dataframe created above
# The labels list will control the order of the clusters along the y-axis
sample.protein.signaturemap('normalized_counts') #labels=[]
Statistics#
This will run a t-test on the protein clusters defined above.
pval_protein, tstat_protein = sample.protein.test_signature("normalized_counts")
pval_protein
pval_protein = pval_protein + 10 ** -50
pvals_protein = -np.log10(pval_protein) * (tstat_protein > 0)
# Colored tiles indicate strong association of a protein with a particular cluster
plt.figure(figsize=(20, 10))
fig = sns.heatmap(pvals_protein.T, vmax=50, vmin=40)
Combined Visualizations#
In this section the dateset will be subset to only retain barcodes with remaining DNA, CNV and Protein data. Then this data can be plotted together using sample.heatmap()
, sample.signaturemap()
, sample.clone_vs_analyte()
and other functions.
Overlaps in the dataset across different assays can be further quantified and visualized, using sample.crosstabmap()
.
# This will return the barcodes common to all assays in the sample.
sample.common_barcodes()
# Use that to filter the sample so that only the common barcodes are present in all assays
sample = sample[sample.common_barcodes()]
# Check dimensionality for each assay; the number of cells (first number) should be the same in each data set
print(sample.dna.shape,
sample.cnv.shape,
sample.protein.shape)
# DNA + CNV + Protein heatmap
fig = sample.heatmap(
clusterby=('dna', 'cnv', 'protein'), # The first assay is used for the labels
attributes=['AF', 'ploidy', 'normalized_counts'],
features=[None, 'genes', None], # If None, then clustered_ids is used
bars_order=None, # The order of the barcodes
widths=None,
order=('dna', 'cnv', 'protein') # The order in which the heatmaps should be drawn
)
fig.show()
# DNA + CNV heatmap
fig = sample.heatmap(
clusterby=('dna', 'cnv'), # The first assay is used for the labels
attributes=['AF', 'ploidy'],
features=[None, 'genes'], # If None, then clustered_ids is used
bars_order=None, # The order of the barcodes
widths=None,
order=('dna', 'cnv') # The order in which the heatmaps should be drawn
)
fig.show()
# DNA + Protein heatmap
fig = sample.heatmap(
clusterby=('dna', 'protein'), # The first assay is used for the labels
attributes=['AF', 'normalized_counts'],
features=[None, None], # If None, then clustered_ids is used
bars_order=None, # The order of the barcodes
widths=None,
order=('dna', 'protein') # The order in which the heatmaps should be drawn
)
fig.show()
help(sample.signaturemap)
# Plot a combined signature heatmap, showing DNA, CNV and Protein signatures for all subclones
# To only plot 2 of the analytes
# Adjust the code below by removing the arguments for the analyte that should be removed
sample.signaturemap(
clusterby=('dna', 'cnv', 'protein'),
attributes=('NGT', 'ploidy', 'normalized_counts'),
features=[None, 'genes', None],
signature_kind=['median', 'median', 'median'],
widths=None,
order=('dna', 'cnv', 'protein') # The order in which the heatmaps should be drawn
)
help(sample.clone_vs_analyte)
# Clone vs analyte
# Visualize the CNV data stratified by clone
fig = sample.clone_vs_analyte('cnv')
plt.gcf().axes[1].texts[-1].set_text('')
fig
#fig.savefig('genotype_cnv.png')
# Visualize the protein data (violin plots) stratified by clone
fig = sample.clone_vs_analyte('protein')
plt.gcf().axes[1].texts[-1].set_text('')
fig
# To review a subset of features, first subset the assay itself
sample2.protein = sample.protein[:, ['CD11b', 'CD19', 'CD38', 'CD90']]
fig = sample2.clone_vs_analyte('protein')
plt.gcf().axes[1].texts[-1].set_text('')
fig
# Plot the protein heatmap with the DNA clone labels
fig = sample.protein.heatmap(attribute='normalized_counts',splitby=sample.dna.row_attrs['label'])
go.Figure(fig)
# Plot the protein UMAP and color with the DNA clone labels
protein_umap = sample.protein.row_attrs["umap"]
fig=sample.dna.scatterplot(attribute=protein_umap, colorby="label")
go.Figure(fig)
# Plot the protein UMAP and color with variant layers (e.g., VAFs) for specific variants
protein_umap = sample.protein.row_attrs["umap"]
feats = sample.dna.ids()[:3] # plot the first 3 variants
sample.dna.scatterplot(attribute=protein_umap, colorby="AF", features=feats)
# Plot the protein UMAP and color with ploidy for the defined gene
gene_ploidy = sample.cnv.signature("ploidy", splitby=None, features="gene_name")
tp53_ploidy = gene_ploidy["TP53"].values
sample.protein.scatterplot("umap", colorby=tp53_ploidy)
# Plot the AOC ridge plots split by DNA clone
sample.protein.ridgeplot(attribute='normalized_counts',
features=['CD11b', 'CD19', 'CD38', 'CD90'],
splitby=sample.dna.row_attrs['label'])
# Plot the AOC ridge plots split by protein cluster
sample.protein.ridgeplot(attribute='normalized_counts',
features=['CD11b', 'CD19', 'CD38', 'CD90'],
splitby=sample.protein.row_attrs['label'])
# Plot the biaxial AOC plot split by DNA clone
fig = sample.protein.feature_scatter(layer='normalized_counts', ids=['CD19', 'CD34'], colorby = sample.dna.row_attrs['label'])
go.Figure(fig)
Quantify overlaps between DNA clones and Protein clusters#
# Use crosstab() to calculate the overlap between DNA subclones and Protein Clusters
# This will generate the dataframe
# What percentage of each Protein cluster is defined by each Subclone
sample.dna.crosstab(sample.protein.get_labels(), normalize='columns')
# This will generate the heatmap of the dataframe generated above
# What percentage of each Protein cluster is defined by each Subclone
sample.dna.crosstabmap(sample.protein.get_labels(), normalize='columns').show()
# This will generate the dataframe
# What percentage of each Subclone is defined by each Protein cluster
sample.protein.crosstab(sample.dna.get_labels(), normalize='columns')
# This will generate the heatmap of the dataframe generated above
# What percentage of each Subclone is defined by each Protein cluster
sample.protein.crosstabmap(sample.dna.get_labels(), normalize='columns').show()
Export and Save Data#
In this section the data can export a filtered .h5 file, which will contain all new labels/layers, and contain only the filtered barcodes/cells remaining. All data can be exported (row attributes, column attributes and layers) for every assay (DNA and CNV) into easily parsable .csv tables.
# Save new h5 file that includes only the final, cleaned dataset
ms.save(sample, 'FilteredData.h5')
# Export data into csv formats
ms.to_zip(sample, 'filename')
Appendix#
Compass imputation#
This section of the notebook is OPTIONAL
Compass can be used to impute the genotypes of cells with some missing data. It can also be used to infer the phylogeny of all subclones present in the sample. If a cell cannot be unambiguously classified, Compass will label these cells as Mixed or Ambiguous. Compass can take ~1-10 minutes to run depending on the size of the data.
The Compass publication can be found here
Note: Compass can give different subclone composition than the variant subclone workflow.
# Use compass to assume the subclone architecture and phylogeny
# Depending on the size and complexity of the data, this step can take ~1-10 minutes
# Use full_sample instead of sample to include all previously removed cells and use the unannotated variant names
compass = COMPASS(full_sample, somatic_variants=variants)
compass.run()
# The phylogentic tree prediction by COMPASS
fig = compass.plot_tree()
fig.show()
# Make the compass_labels the node descriptions
# These are the new subclone labels
labs = compass.labels_ # Stores the labels
desc = compass.node_descriptions() # Stores the dict mapping the label to the description
compass_labels = np.array([desc.get(lab, lab) for lab in labs])
# Store the compass labels as a row_attr
full_sample.dna.add_row_attr('compass_labels', compass_labels)
# Store the index positions of the COMPASS ids for the heatmap
idx = np.isin(full_sample.dna.ids(), compass.somatic_variants)
# Visualize the assignment using the variants passed to COMPASS
fig = full_sample.dna.heatmap('NGT', features=full_sample.dna.ids()[idx], splitby='compass_labels')
fig.show()
# Create a DNA heatmap to see both the variant workflow subclone labels and the compass assigned labels
# Parameters
attribute = 'NGT'
assay = full_sample.dna
lab_row1 = "label"
lab_row2 = "compass_labels"
lab2_title = "Compass"
# Graphing
data = assay.get_attribute(attribute, constraint="row+col")
ids = assay.clustered_ids(attribute)
labels = pd.concat([assay.get_attribute(lab_row1), assay.get_attribute(lab_row2).rename(columns={"1": "2"})], axis=1)
bars = assay.clustered_barcodes(attribute, splitby=labels)
data = data.loc[bars, ids]
labs1 = assay.get_attribute(lab_row1, constraint="row").loc[bars, :].values.flatten()
labs2 = assay.get_attribute(lab_row2, constraint="row").loc[bars, :].values.flatten()
fig = MultiMap([data], labs1, names=[attribute], extra_column_titles=[lab2_title], widths=[1 / 24, 1]).draw()
fig.add_trace(Heatmap(data, labs2).label_trace(), row=1, col=2)
fig.layout.xaxis2.tickvals = []
fig
# This will return a dataframe
# Showing the overlap of variant workflow clones and compass identified clones
full_sample.dna.crosstab(compass_labels, normalize='columns')
# This will plot a heatmap of the crosstab dataframe
full_sample.dna.crosstabmap(compass_labels, normalize='columns').show()
Export data to use in COMPASS
# Export compass data into 2 output files
# compass_input_regions.csv and compass_input_variants.csv
# These can be used with COMPASS outside of Mosaic
compass = COMPASS(sample, sample.dna.filter_somatic_variants())
compass.prepare_input(prefix="./compass_input")
Protein assignment#
This section of the notebook is OPTIONAL
This portion of the notebook utilizes known truth about different cell types based on protein expression, to cluster cells into the defined cell types. The known truth built into the tool is based on our catalog Total Seq-D Heme protein panel, which contains 45 AOCs. The truth was defined from data on the cell ontology database and the European Bioinformatics Institute Website and works best for PBMCs. Other cell-types may benefit from custom truths.
More information about the algorithm can be found here.
Note: For protein normalization above, either NSP or ANSP need to be used. CLR or asinh are not supported.
The cluster label can be one of the following:
Sticky - these are positive for any of the IgG markers (defined using the
sticky_antibodies
parameter).Mixed - these are cells which have a profile of 2 clusters (see the
doublet_distance
parameter)Mixed like - these cells have the profile which satisfies the
Mixed
criteria, but their proportions are too large compared to their singlet proportions. (seemax_adjusted_mixing
)Labeled - Clusters that match the given truth with sufficient probability (see
min_prob_diff
)Unassiged-{n} - A cluster that couldn’t be assigned to any of the above types
# Visualize the structure of the built-in truth
# The names that are underlines are links to the source of truth
truth = Truth.builtin()
truth.plot()
# Cluster the cells based on protein data and label them
pace = sample.protein.cluster_and_label(
max_adjusted_mixing=0.3, # This parameter controls which mixed clusters would be labelled "Mixed Like"
min_distance_for_doublet=5, # Increase this if too many mixed cells are observed.
cluster=True, # If True, graph-community clustering is run, otherwise the existing cluster labels are used
)
# Visualize the clustering
bars = sample.protein.clustered_barcodes("normalized_counts", subcluster=False)
fig = sample.protein.heatmap("normalized_counts", bars_order=bars)
fig.layout.width = 1000
fig
# Visualize the complete truth that was used in cluster_and_label
specific_truth = pace.cluster_truth()
specific_truth.plot()
# Recluster the data using the complete truth
# It will not find new clusters, but it will categorize the cells into the known cell types
sample.protein.assign_from_truth(specific_truth)
# Plot the protein clusters with the complete truth
bars = sample.protein.clustered_barcodes("normalized_counts", subcluster=False)
fig = sample.protein.heatmap("normalized_counts", bars_order=bars)
fig.layout.width = 1000
fig.show("jpg")
Using predefined clusters or custom clustering
Clustering performed above in the notebook can be feed into protein assignment for the clusters to be labeled. Simply skip the previous cluster_and_label method and run the below code.
# To use clustering performed earler on
truth = sample.protein.truth()
sample.protein.assign_from_truth(truth)
sample.protein.heatmap("normalized_counts")
Modifying the truth
The truth can be modified in many ways. The following blocks of code show examples on how to modify the truth.
# Subset the truth to a few cell types
subset_truth = truth[["T cell", "B cell", "NK cell", "Monocyte"]]
subset_truth.plot()
# Remove cell types from the truth
trimmed_truth = truth.trim("T cell", include=False)
trimmed_truth.plot()
# Remove AOCs from the truth
trimmed_truth = truth.del_ab("CD3")
trimmed_truth.plot()
# Add a new cell type to the truth
source_link = "https://jdc.jefferson.edu/cgi/viewcontent.cgi?article=1083&context=pacbfp#:~:text=CD56%20is%20an%20adhesion%20molecule,tissue%20and%20in%20granular%20lmphocytes.&text=Aberrant%20expression%20of%20CD56%20has,leukemia%20and%20plasma%20cell%20myeloma."
updated_truth = truth.add_cell("CD56+ Monocyte", ["CD56+"], parent="Monocyte", source=source_link)
updated_truth.plot()
Exporting the truth
The defined truth can be exported and shared.
# Save the truth using the `to_yaml` function
updated_truth.to_yaml("./my-truth-file.yaml", version="My version")
# The yaml file can be read using the read_yaml function
Truth.read_yaml("./my-truth-file.yaml").plot()
# Set a yaml as the truth to be used in clustering
new_truth = Truth.read_yaml("./my-truth-file.yaml").plot()
# Run the cluster_and_label with the truth argument pointing to the yaml truth
pace = sample.protein.cluster_and_label(
truth=new_truth
max_adjusted_mixing=0.3, # This parameter controls which mixed clusters would be labelled "Mixed Like"
min_distance_for_doublet=5, # Increase this if too many mixed cells are observed.
cluster=True, # If True, graph-community clustering is run, otherwise the existing cluster labels are used
)