Multisample DNA + Protein notebook#

This notebook is set up to work with merged h5 files, containing more than one sample, generated with v2 or v3 chemistry. This notebook will only work with Mosaic versions 3.0.1 and higher.

Objective: To showcase the minimum number of steps required for tertiary analysis of DNA (single-cell genotyping and CNV) and Protein analytes and explore different ways of visualizing the data.

Major questions answered:

  1. Can we identify DNA clones based on genotypes (SNVs/Indels)?

  2. How is the clonal structure different between samples?

  3. Do we detect CNV events (e.g., copy number amplification, copy number loss)?

  4. Do DNA clones (genotypes) correlate with specific protein-defined cell types and/or CNV profiles?

  5. Is there a significant difference in protein-defined cell types between samples?

Sections:

  1. Setup

  2. Data Structure

  3. DNA Analysis

  4. CNV Analysis

  5. Protein Analysis

  6. Combined Visualizations

  7. Export and Save Data

  8. Appendix

Not shown: All available methods and options - documented here

Setup#

Topics covered

  1. Loading required packages and data.

  2. Structure and contents of data objects.

Load data
Note: importing dependencies can sometimes take a couple of minutes.

# Import mosaic libraries
import missionbio.mosaic as ms

# Import these to display entire dataframes
from IPython.display import display, HTML

# Import graph_objects from the plotly package to display figures when saving the notebook as an HTML
import plotly as px
import plotly.graph_objects as go

# Import additional packages for specific visuals
import matplotlib.pyplot as plt
import plotly.offline as pyo
pyo.init_notebook_mode()
import numpy as np
import seaborn as sns

# Import COMPASS for imputation
from missionbio.mosaic.algorithms.compass import COMPASS

# Note: when exporting the notebook as an HTML, plots that use the "go.Figure(fig)" command are saved
# This code is optional, but will make the notebook cells/figures display across the entire width of your browser
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
# Check version; this notebook is designed for Mosaic v3.0.1 or higher
print(ms.__version__)
# Any function's parameters and default values can be looked up via the 'help' function
# Here, the function is 'ms.load'
help(ms.load)
# Specify the h5 file to be used in this analysis: h5path = '/path/to/h5/file/test.h5'
# If working with Windows, you may need to add an 'r' before the path: h5path = r'/path/to/h5/file/test.h5'
h5path = '/Users/botero/Desktop/Sample Data/pbmc.multisample.h5'

# Load the data
group = ms.load(h5path, raw=False, filter_variants=True, filter_cells=False, whitelist = [])
# Print the list of samples in the group object
[s.name for s in group.samples]

# Always set raw=False; if raw=True, ALL barcodes will be loaded (rather than cell-associated barcodes)
# Always set filter_variants=True unless you can't detect an expected (target) variant. Additional filtering options are included in the DNA section below
# Always set single=False for multisample/merged h5 files
# The whitelist option loads any variant that is in the vcf.gz file (e.g. "chr1:179520511:C/G"); similar to whitelist feature in Tapestri Insights v2
# Always set filter_cells=False; if filter_cells=True, only genomically complete cells will be loaded 

Data Structure#

DNA, CNV, and Protein are sub-classes of the Assay class. The information is stored in four categories, and the user can modify each of those:

1. metadata (add_metadata / del_metadata):

  • dictionary containing metrics of the assay

2. row_attrs (add_row_attr / del_row_attr):

  • dictionary which contains ‘barcode’ as one of the keys (where the value is a list of all barcodes)

  • for all other keys, the values must be of the same length, i.e. match the number of barcodes

  • this is the attribute where ‘label’, ‘pca’, and ‘umap’ values are added

3. col_attrs (add_col_attr / del_col_attr):

  • dictionary which contains ‘id’ as one of the keys (where the value is a list of all ids)

  • for DNA assays, ‘ids’ are variants; for Protein assays, ‘ids’ are antibodies

  • for all other keys, the values must be of the same length, i.e. match the number ids

4. layers (add_layer / del_layer):

  • dictionary which contains critically important assay metrics

  • all the values have the shape (num barcodes) x (num ids)

  • for DNA assays, this includes AF, GQ, DP, etc. (per cell, per variant)

  • for Protein assays, this includes read counts (per cell, per antibody)

  • this is the attribute where ‘normalized_counts’ will be added

# Summary of DNA assay, per sample
for sample in group:
    print(sample.name)
    print("\'sample.dna\':", sample.dna, '\n')
    print("\'row_attrs\':", "\n\t", list(sample.dna.row_attrs.keys()), '\n')
    print("\'col_attrs\':", "\n\t", list(sample.dna.col_attrs.keys()), '\n')
    print("\'layers\':", "\n\t", list(sample.dna.layers.keys()), '\n')
    print("\'metadata\':", "\n")
    for i in list(sample.dna.metadata.keys()):
        print("\t", i, ": ", sample.dna.metadata[i], sep="")
    print("\n")
# For DNA, ids are variants
# sample.dna.ids() is a shortcut for sample.dna.col_attrs['id']
for sample in group:
    print(sample.name, "\n", sample.dna.ids(), "\n")
# Summary of Protein assay, per sample
for sample in group:
    print(sample.name)
    print("\'sample.protein\':", sample.protein, '\n')
    print("\'row_attrs\':", "\n\t", list(sample.protein.row_attrs.keys()), '\n')
    print("\'col_attrs\':", "\n\t", list(sample.protein.col_attrs.keys()), '\n')
    print("\'layers\':", "\n\t", list(sample.protein.layers.keys()), '\n')
    print("\'metadata\':", "\n")
    for i in list(sample.protein.metadata.keys()):
        print("\t", i, ": ", sample.protein.metadata[i], sep="")
    print("\n")
# For Protein, ids are AOCs
# sample.protein.ids() is a shortcut for sample.protein.col_attrs['id']
for sample in group:
    print(sample.name, "\n", sample.protein.ids(), "\n")

DNA Analysis#

Topics covered

  1. Standard filtering of DNA variants.

  2. Subsetting dataset for variants of interest, including whitelisted variants.

  3. Addition of annotations to the variants.

  4. Manual variant selection and clone identification

  5. OPTIONAL: Use Compass to impute missing data and a clonal phylogeny

  6. Visualizations and customization options

Basic filtering#

There are many options for filtering DNA variants. Use the help() function to understand the approach listed below.

# For additional information visit: https://support.missionbio.com/hc/en-us/articles/360047303654-How-do-the-Advanced-Filters-work-
help(sample.dna.filter_variants)
# Filter the variants, similar to the "Advanced Filters" in Tapestri Insights v2.2

# One major difference: The filter "REMOVE GENOTYPE IN CELLS WITH ALTERNATE ALLELE FREQ" 
# is replaced with the following 3 zygosity-specific filters: vaf_ref, vaf_hom, vaf_het

# In general, these additional filters remove additional false-positive from the data.

# Adjust filters if needed by overwriting dna_vars
def custom_filt(sample):
    filt_vars = sample.dna.filter_variants(
        min_dp=10,
        min_gq=30,
        vaf_ref=5,
        vaf_hom=95,
        vaf_het=30,
        min_prct_cells=50,
        min_mut_prct_cells=1,
        iterations=10
)
    return filt_vars
    
dna_vars = group.apply(custom_filt)

# List all filtered variants
# Check the number of filtered variants
for i in range(len(group.samples)):
    sample = group.samples[i]
    print(sample.name)
    print("Number of variants:", len(dna_vars[i]))
    print(dna_vars[i], "\n")
# Subset the same variants in all dna assays
# It is important to maintain the same variants across all dna assays

# Establish the number of variants for each sample
og_num_vars = [s.dna.shape[1] for s in group.samples]

#Subset all samples to look at the union of all variants
var_union = list(set().union(*dna_vars))
for sample in group:
    sample.dna = sample.dna[:, var_union]  # Subsets all samples with the same variants

# Establish the new number of varaints for each sample
new_num_vars = [s.dna.shape[1] for s in group.samples]

print(og_num_vars, new_num_vars)  # The old and new number of variants for each sample in the group
# Merge the samples into one data set
combined = group.merge()

First, specify variants of interest using one of the three options below:

  1. Use the filtered variants from the above section (dna_vars)

  2. Use specific variants of interest (whitelist)

  3. Combine 1 & 2: filtered variants plus whitelist

Then, this list (final_vars) is used to reduce the larger data set to only your variants of interest.

# Specify the whitelist; variants may be copy/pasted from Tapestri Insights v2.2,
# but ensure correct nomenclature, ie whitelist = ["chr13:28589657:T/G","chrX:39921424:G/A"]
# If there are no whitelist variants, you can leave target variants blank

variants = combined.dna.ids()

white_list = []

# Combine whitelisted and filtered variants
final_vars = list(set(list(variants) + white_list))

# Want to use your whitelist only?
# final_vars = white_list
# Check the length of your final list of variants
len(final_vars)
# Dimensionality of the original sample.dna dataframe
# First number = number of cells (rows); second number = number of variants (columns)
combined.dna.shape
# Before subsetting, verify that all the chosen variants are in the current sample.dna ids (should return True)
print(set(final_vars).issubset(set(combined.dna.ids())))
# Subsetting sample.dna (columns) based on reduced variant list. Keeping all cells that passed filtering
combined.dna = combined.dna[:, final_vars]
# Check the shape of the final filtered DNA object, i.e. (number of barcodes/cells, number of ids/variants)
combined.dna.shape

Annotation Addition#

help(sample.dna.get_annotations)
# Fetch annotations using varsome
# Note: run this on a filtered DNA sample - too many variants (e.g., 100+) are not handled correctly by the method
ann = combined.dna.get_annotations()  

Variant selection and Subclone identification#

In this section of the notebook, all variants remaining in the data will populate in a variant table. This table is interaction, variants can be selected, and rows can be sorted by ascending/descending values. The variant name can be clicked on and will navigate to the variants varsome url in your default browser.

  1. Variants selected in this table will populate in a subclone table below.

  2. The variants in the subclone table can be highlighted and assessed for Read Depth, Genotype Quality and Variant Allele Frequency.

  3. Subclones can be renamed by clicking on the pencil icon.

  4. ADO score: The ADO score can be adjusted, but by default is set to 1. Any clones with an ADO score higher this will be moved into the ADO subclones column. We recommend moving any clones with a score of .8 or higher into this column.

  5. Min clone size: The Min Clone Size can also be adjusted, but by default is set to 1. Any clones that represent less than 1% of the sample will be moved into the Small Subclones column.

  6. Cells with missing genotype information across any of the selected variants will be moved into the missing GT column.

#Run the variant table workflow to select variants and begin clone identification
wfv = ms.workflows.VariantSubcloneTable(combined)
wfv.run()
# Subsample the Variants to only the variants selected from the workflow
variants = wfv.selected_variants
combined.dna = combined.dna[:,variants]

# Save the full set of cells to a new variable that you can call on later
# Do this before renaming the variants
full_combined = combined[:]
# Rename your variants
# Any of these column values can be added to the id names
combined.dna.col_attrs.keys()
# Add annotation to the id names
combined.dna.set_ids_from_cols(['annotated Gene', 'id'])

# Another example:
# sample.dna.set_ids_from_cols(['annotated Gene', 'CHROM', 'POS', 'REF', 'ALT'])

# Annotations are now added to the variants
combined.dna.ids()

# Use sample.dna.reset_ids() to get the original ids
# Plot heatmap using NGT_FILTERED.
fig = combined.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)
# Clone removal
# Remove barcodes from the missing GT, small subclones, ADO subclones or FP labels
clones = ['missing', 'ADO', 'small', 'FP']
for c in clones:
    cells = combined.dna.barcodes({c})
    if len(cells) > 0:
        combined.dna = combined.dna.drop(cells)
set(combined.dna.get_labels())
# Redraw heatmap
fig = combined.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)
# Redraw heatmap, split by sample
fig = combined.dna.heatmap(attribute='NGT_FILTERED', splitby='sample_name')
go.Figure(fig)
# You can use the following line of code to change the color for heatmaps in the DNA, CNV or protein sections
# And rotate the tick labels
fig = combined.dna.heatmap(attribute='NGT_FILTERED')
fig.update_layout(title = combined.name)
fig.update_xaxes(tickangle = -45)
fig.layout.coloraxis.colorscale = 'viridis'
fig
# Evaluate new total number of cells after the above filtering
combined.dna.shape
# Subset the barcodes to match across all assays
# This will return the barcodes common to all assays in the sample.
combined.common_barcodes()

# Use that to filter the sample so that only the common barcodes are present in all assays
combined = combined[combined.common_barcodes()]
# Split merged object
# Need to first split the samples: they will go back from combined to group
group = ms.SampleGroup(combined.split("sample_name"))
# Multisample DNA barplot
fig = group.barplot("dna", percentage=True)

#If you want to change the color of certain clones, you can try something like the following:
#fig.data[0].marker.color = "red"
#fig.data[5].marker.color = "orange"
fig.update_layout(title = 'Clone size as a percentage of total population')
fig.show()
# Fishplot
# Multisample DNA fishplot
# Order your samples in sample_order and fill in the list with their specific name
fig = group.fishplot("dna", sample_order=['Sample 1', 'Sample 2'])
fig.layout.height = 750
fig.show()
# You can further specific labels and parents to reorganize the fishplot
# Want to hide the phylogeny tree? draw_graph=False
# Documentation here: https://missionbio.github.io/mosaic/autosummary_pages/methods/missionbio.mosaic.samplegroup.SampleGroup.fishplot.html#missionbio.mosaic.samplegroup.SampleGroup.fishplot
# Heatmap per sample
for sample in group:
    fig = sample.dna.heatmap(attribute='NGT_FILTERED',splitby="label")
    fig.update_layout(title = sample.name)
    go.Figure(fig)
    fig.show()
# Shape of each sample
for sample in group:
    print(sample.name, sample.dna.shape)

Adjusting subclone colors or heatmap colors#

If you want to change the color palette used for the subclones, you can use set_palette. For this you will need to provide a list of ALL subclone names, pointing to the hexcode/color you want to change that subclone to. Note: this function also works with the protein and CNV assays. See below for an example of this.

If you want to change the color palette used for any of the assays/layers, you can use ms.Config.Colorscale. Then you can list the assay and layer, and assign the plotly colorscale you would like to use for those graphics. You can visualized all available colorscales by using plotly.colors.sequential.swatches_continuous().show(). If you want to reset all color palettes back to their defaults, you can use ms.Config.Colorscle.reset().

# Example of changing the colors assigned to each clone
sample_one = group.samples[0]
sample_two = group.samples[1]
sample_one.dna.set_palette({"Clone 1": "#800080", "Clone 2": "#FF69B4"})
sample_two.dna.set_palette({"Clone 1": "#800080", "Clone 2": "#FF69B4"})
# Change the color of a subclone for the entire group object
# These changes will be applied to the barplots and fishplots
# Example
pal = group.apply(lambda s: s.dna.get_palette())
pal = {k: v for d in pal for k, v in d.items()}

pal["Clone 1"] = 'LightCyan'
pal["Clone 2"] = 'Lavender'
pal["Clone 3"] = 'PaleGreen'

group.apply(lambda s: s.dna.set_palette(pal))

# Built in color options can be found here: https://www.w3schools.com/cssref/css_colors.php
# You can change colorScale to change the heatmap colors
# This will change the color of the dna.ngt layer to be viridis
for sample in group:
    ms.Config.Colorscale.Dna.NGT = 'viridis'
    fig = sample.dna.heatmap('NGT')
    fig.update_layout(title = sample.name)
    fig.show()

# If you want to reset this, you can run the following to reset just that one modification
#ms.Config.Colorscale.Dna.reset("NGT")
# Or you can run this to reset any/all modifications:
#ms.Config.Colorscale.reset()
# You can change the colorScale for Dna, Cns or Protein

If you ever want to return back to the original population of cells, you can reset the data using: sample.reset("dna") This command ‘sample.reset’ works on all assays, including CNV and protein.

CNV Analysis#

Topics covered

  1. Amplicon filtering and ploidy estimation.

  2. Visualization of ploidy across subclones present

# Get gene names for amplicons
for sample in group:
    sample.cnv.get_gene_names()
# Define the samples
# Go through the CNV workflow once per sample
sample_one = group.samples[0]
sample_two = group.samples[1]

CNV workflow#

The CNV workflow cannot yet handle multisample objects, so the analysis must be done on one sample at a time.

This workflow will normalize all reads and filter amplicons/cells based on the settings set at the beginning of the workflow:

  1. Amplicon completeness: refers to the minimum percentage of barcodes for an amplicon that must have reads greater than or equal to the minimum read depth set. By default this is set to 50.

  2. Amplicon read depth: refers to the minimum read depth for each amplicon-barcode combination to not be considered missing. By default this is set to 10.

  3. Mean cell read depth: refers the minimum mean read depth for a cell to be included in the analysis, otherwise the cell will be removed. By default this is set to 40.

  4. Diploid clone in DNA: refers to which subclone you are setting as the true diploid population. All ploidy estimates will be calculated in relation to this diploid population. We recommend setting this to your ‘WT’ population or most parent clone present.

Note: These settings are pretty stringent. If you are expecting large copy number events, you may want to reduce Amplicon Completeness and Amplicon Read Depth, to recover these events.

Once the above filters are set, the visualizations can be changed.

  1. Plot: Can be changed from Heatmap positions, to Heatmap genes, Line-plot positions, Line-plot genes

  2. Clone for line plot: If one of the line-plot visualizations is selected, only one clone can be shown at a time. This determines which one is plotted.

  3. X-axis features: If you would prefer to only plot a subset of the data (chromosomes or genes), you can select which chromosomes/genes you would like plotted with this function. Chromosomes can be selected for ‘positions’ type plots, and Genes can be selected for ‘genes’ type plots.

# CNV workflow to filter, normalize and estimate ploidy
wfc = ms.workflows.CopyNumber(sample_one)
wfc.run()
# Amplicon completeness refers to the min percentage of barcodes for an amplicon that must have reads >= the read_depth
# Read depth is the min required depth for each amplicon_barcode to not be considered as missing
# Mean cell read depth refers to the min mean read depth for a cell to be included, otherwise it is removed
# Heatmap with the features ordered by the default amplicon order
# If you want to plot just a subset of chromosomes, you can put them in list format for features
fig = sample_one.cnv.heatmap('ploidy', features='positions') #features=['7', '17', '20']

# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)
#fig.layout.coloraxis.cmax = 4
#fig.layout.coloraxis.cmin = 0

# Optionally, change the size of the figure:
#fig.layout.width = 1600
#fig.layout.height = 1500
fig.update_layout(title = sample_one.name)
go.Figure(fig)
# Heatmap with the features grouped by the genes
# If you want to plot just a subset of genes, you can put them in list format for features
fig = sample_one.cnv.heatmap('ploidy', features='genes', convolve=1) #features=["ASXL1", "EZH2",'TP53']

# Optionally, update the separating lines to be black
#for shape in fig.layout.shapes:
#    shape.line.color = '#000000'
fig.update_layout(title = sample_one.name)
go.Figure(fig)
# Show heatmap with convolve and subclustering turned off
bars = sample_one.cnv.clustered_barcodes('ploidy', subcluster=False)

# This is useful to create "convolved" heatmaps which are easier to interpret
# With the subclustering off and convolve=20, the noise will be reduced and real signals will be easier to determine
fig = sample_one.cnv.heatmap('ploidy', bars_order=bars, convolve=20)
fig.layout.width = 900
fig.update_layout(title = sample_one.name)
fig.show()
# Signature will create a dataframe based on the layer you want to look at and which statistical value you want to view
# You can see: mean, median, mode, or std with this function
sample_one.cnv.signature('ploidy', 'median')
# signaturemap will plot a heatmap of the signature dataframe created above
# The labels list will control the order of the subclones along the y-axis
fig = sample_one.cnv.signaturemap('ploidy') #labels=[]
fig.update_layout(title=sample_one.name)
fig.show()

Note: The following is a partial duplication to run through the CNV workflow with the second/subsequent sample.

# Go through the workflow for any additional samples
# CNV workflow to filter, normalize and estimate ploidy
wfc = ms.workflows.CopyNumber(sample_two)
wfc.run()
# Amplicon completeness refers to the min percentage of barcodes for an amplicon that must have reads >= the read_depth
# Read depth is the min required depth for each amplicon_barcode to not be considered as missing
# Mean cell read depth refers to the min mean read depth for a cell to be included, otherwise it is removed
# Heatmap with the features ordered by the default amplicon order
# If you want to plot just a subset of chromosomes, you can put them in list format for features
fig = sample_two.cnv.heatmap('ploidy', features='positions') #features=['7', '17', '20']

# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)
#fig.layout.coloraxis.cmax = 4
#fig.layout.coloraxis.cmin = 0

# Optionally, change the size of the figure:
#fig.layout.width = 1600
#fig.layout.height = 1500
fig.update_layout(title = sample_two.name)
go.Figure(fig)
# Heatmap with the features grouped by the genes
# If you want to plot just a subset of genes, you can put them in list format for features
fig = sample_two.cnv.heatmap('ploidy', features='genes', convolve=1) #features=["ASXL1", "EZH2",'TP53']

# Optionally, update the separating lines to be black
#for shape in fig.layout.shapes:
#    shape.line.color = '#000000'
fig.update_layout(title = sample_two.name)
go.Figure(fig)
# Show heatmap with convolve and subclustering turned off
bars = sample_two.cnv.clustered_barcodes('ploidy', subcluster=False)

# This is useful to create "convolved" heatmaps which are easier to interpret
# With the subclustering off and convolve=20, the noise will be reduced and real signals will be easier to determine
fig = sample_two.cnv.heatmap('ploidy', bars_order=bars, convolve=20)
fig.layout.width = 900
fig.update_layout(title = sample_two.name)
fig.show()
# Signature will create a dataframe based on the layer you want to look at and which statistical value you want to view
# You can see: mean, median, mode, or std with this function
sample_two.cnv.signature('ploidy', 'median')
# signaturemap will plot a heatmap of the signature dataframe created above
# The labels list will control the order of the subclones along the y-axis
fig = sample_two.cnv.signaturemap('ploidy') #labels=[]
fig.update_layout(title=sample_two.name)
fig.show()

Note: The next few cells are highly optional, and only need to be run if you want to plot cells from all samples together in one heatmap.

# Plot heatmaps of all samples together?
combined = group.merge()
combined.cnv.heatmap('ploidy', splitby='sample_name')
# Subset the barcodes to match across all assays
# This will return the barcodes common to all assays in the sample.
combined.common_barcodes()

# Use that to filter the sample so that only the common barcodes are present in all assays
combined = combined[combined.common_barcodes()]
# Split merged object
# Need to split the samples: they will go back from combined to group
group = ms.SampleGroup(combined.split("sample_name"))

Protein Analysis#

Topics covered

  1. Normalization and assessment of AOC abundance

  2. Clustering and visualization

  3. Statistical significance analysis

Normalization and Data Inspection#

help(sample.protein.normalize_reads)
# Subset the protein barcodes to match the dna and cnv barcodes
for sample in group:
    sample.protein = sample.protein[sample.dna.barcodes(),:]
# See the new shape of our protein data frame
for sample in group:
    print(sample.protein.shape)
# Normalize reads
# Three different methods available: CLR = Centered Log Ratio, NSP = Noise corrected and Scaled, asinh = Inverse Hyperbolic transformation
for sample in group:
    sample.protein.normalize_reads('NSP') 
# Ridge plot: shows distribution of AOC counts across all cells (bimodal = positive and negative cell populations)
for sample in group:
    sample.protein.ridgeplot(attribute='normalized_counts',
                         features=sample.protein.ids()).show()
# Plotting combinations of AOCs on a biaxial scatterplot, to mimic FACS data
for sample in group:
    sample.protein.feature_scatter(layer='normalized_counts', ids=['CD8', 'CD4']).show()
#List all AOCs, per sample
for sample in group:
    print(sample.name, '\n', sample.protein.ids(), '\n')
# Optionally, remove AOCs from the data set prior to clustering 
# For example, those that don't display any signal in any cells
# Do this for each sample being analyzed
for sample in group:
    sample.protein = sample.protein.drop(['IgG2a'])
# Combine both samples into 1 object to cluster them together post-normalization
combined = group.merge()

Clustering and Visualization#

help(sample.protein.run_pca)
# Number of components should equal the number of AOCs remaining in the dataset
combined.protein.run_pca(attribute='normalized_counts', components=45,show_plot=True)
# Rerun PCA with optimal number of components based on elbow plot analysis
# Typically, components = 10 is appropriate for 45-plex Biolegend panel
combined.protein.run_pca(attribute='normalized_counts', components=10, show_plot=False)
# Now run UMAP on the chosen PCs 
# UMAPs rely on an initial randomization which leads to different projections every time
# To address this, pass random_state to the run_umap method
# See https://jlmelville.github.io/uwot/abparams.html for appropriate spread/min_dist values
combined.protein.run_umap(attribute='pca', random_state=42) #, spread=, min_dist=
# Cluster the AOCs. Several options are available, see help info for more options
# Graph-community clustering: the higher the k value, the smaller the number of clusters --> adjust if necessary
combined.protein.cluster(attribute='umap', method='graph-community',k=300, random_state=42) 
help(sample.protein.heatmap)
# Plot heatmap of normalized, clustered protein data
fig = combined.protein.heatmap(attribute='normalized_counts')
go.Figure(fig)
# If analyzing multiple samples, order barcodes by sample and then by cluster
fig = combined.protein.heatmap(attribute='normalized_counts',splitby='sample_name')
go.Figure(fig)
help(sample.protein.scatterplot)
# Plot UMAP for visualization of clusters
fig = combined.protein.scatterplot(attribute='umap',colorby='label')
go.Figure(fig)
# Plot UMAP for visualization of clusters, by sample
fig = combined.protein.scatterplot(attribute='umap',colorby='sample_name')
go.Figure(fig)
# UMAP colored by the expression of each AOC
fig = combined.protein.scatterplot(attribute='umap',
                           colorby='normalized_counts',
                           features=sample.protein.ids())
go.Figure(fig)
# Relabel clone names acccording to biology, combine clones by assigned identical names
# Just an example! Need to fill this in based on knowledge of the sample and markers
# For cell-type calling, view these resources: https://www.frontiersin.org/articles/10.3389/fimmu.2019.02434/full
# Biolegend: https://www.biolegend.com/fr-lu/products/totalseq-d-human-heme-oncology-cocktail-v10-20465

combined.protein.rename_labels(
    {
        '1': 'Cell type A', 
        '2': 'Cell type B', 
        '3': 'Cell type C',
        '4': 'Cell type D',
        '5': 'Cell type E'
    }
)
# Drop cells (barcodes) that need to be removed
# Instead of overwriting the sample.protein variable, we define new variable called sample.protein2
# Note: if you do not wish to drop any cells, this code will producs an error (ignore)
fp_barcodes2 = combined.protein.barcodes("FP")
combined.protein = combined.protein.drop(fp_barcodes2)
set(combined.protein.get_labels())
# Run heatmap again with new labels
fig = combined.protein.heatmap(attribute='normalized_counts',splitby='label')
go.Figure(fig)
# Need to first split the samples: they will go back from combined to group
group = ms.SampleGroup(combined.split("sample_name"))
# Fish plot
fig = group.fishplot("protein", draw_graph=False) #,sample_order=['Sample1', 'Sample2'])
fig.layout.width = 1000
fig.layout.height = 850
fig.show()
# Multisample barplot:
group.barplot("protein", percentage=True)
# Additionally, option to remove AOCs from the data 
for sample in group:
    sample.protein = sample.protein.drop(['IgG2a'])
# Shape of cleaned up protein data (number of cells, number of AOCs)
for sample in group:
    print(sample.name, sample.protein.shape)
# Re-run heatmap with new filtered data
for sample in group:
    fig = sample.protein.heatmap(attribute='normalized_counts')
    fig.update_layout(title=sample.name)
    fig.show()
# Show heatmap with convolve and subclustering turned off, per sample
for sample in group:
    bars = sample.protein.clustered_barcodes('normalized_counts', subcluster=False)
    # This is useful to create "convolved" heatmaps which are easier to interpret
    # With the subclustering off and convolve=20, the noise will be reduced and real signals will be easier to determine
    fig = sample.protein.heatmap('normalized_counts', bars_order=bars, convolve=20)
    fig.update_layout(title=sample.name)
    fig.layout.width = 900
    fig.show()
# Signature will create a dataframe based on the layer you want to look at and which statistical value you want to view
# You can see: mean, median, mode, or std with this function
for sample in group:
    display(sample.protein.signature('normalized_counts', 'median'))
# signaturemap will plot a heatmap of the signature dataframe created above
# The labels list will control the order of the clusters along the y-axis
for sample in group:
    fig = sample.protein.signaturemap('normalized_counts') #labels=[]
    fig.update_layout(title = sample.name)
    fig.show()

Statistics#

Statistics should be run separately per sample, similar to the CNV statistics section. Note, these tests just look for differential protein expression within the individual samples. There are not yet any built in functions for cross-sample comparison, though this can be easily added with python packages and additional custom kernels of code.

# First define the samples
sample_one = group.samples[0]
sample_two = group.samples[1]
pval_protein, tstat_protein = sample_one.protein.test_signature("normalized_counts")
pval_protein
pval_protein = pval_protein + 10 ** -50 + pval_protein
pvals_protein = -np.log10(pval_protein) * (tstat_protein > 0)
# Colored tiles indicate strong association of a protein with a particular cluster
plt.figure(figsize=(20, 10))
fig = sns.heatmap(pvals_protein.T, vmax=50, vmin=40)

This is a partial duplication of the above code, for any additional samples that you would like to perform statistics on.

pval_protein, tstat_protein = sample_two.protein.test_signature("normalized_counts")
pval_protein
pval_protein = pval_protein + 10 ** -50 + pval_protein
pvals_protein = -np.log10(pval_protein) * (tstat_protein > 0)
# Colored tiles indicate strong association of a protein with a particular cluster
plt.figure(figsize=(20, 10))
fig = sns.heatmap(pvals_protein.T, vmax=50, vmin=40)

Combined Visualizations#

In this section you will first subset the data to only retain barcodes with remaining DNA and CNV data. Then this data can be plotted together using sample.heatmap() and sample.signaturemap().

You will also be able to quantify and visualize the overlap between the different analytes, using sample.crosstabmap().

The first few kernels of code will generate one plot per sample for the combined visualizations, the last portion of this section will combine the samples into one dataframe and plot them all together.

# This will return the barcodes common to all assays in the sample, for each sample.
for sample in group:
    sample.common_barcodes()
    # Use that to filter the sample so that only the common barcodes are present in all assays
    sample = sample[sample.common_barcodes()]
# Check dimensionality for each subclass; the number of cells (first number) should be the same in each data set
# For each sample
for sample in group:
    print(sample.name, sample.dna.shape, sample.cnv.shape, sample.protein.shape)
# DNA + CNV + Protein heatmap
for sample in group:
    fig = sample.heatmap(
        clusterby=('dna', 'cnv', 'protein'),  # The first assay is used for the labels
        attributes=['AF', 'ploidy', 'normalized_counts'],
        features=[None, 'genes', None],  # If None, then clustered_ids is used
        bars_order=None,  # The order of the barcodes
        widths=None,
        order=('dna', 'cnv', 'protein')  # The order in which the heatmaps should be drawn
    )
    fig.update_layout(title = sample.name)
    fig.layout.width = 1500
    fig.show()
# DNA + CNV heatmap
# One per sample
for sample in group:
    fig = sample.heatmap(
        clusterby=('dna', 'cnv'),  # The first assay is used for the labels
        attributes=['AF', 'ploidy'],
        features=[None, 'genes'],  # If None, then clustered_ids is used
        bars_order=None,  # The order of the barcodes
        widths=None,
        order=('dna', 'cnv')  # The order in which the heatmaps should be drawn
    )
    fig.update_layout(title = sample.name)
    fig.layout.width = 1200
    fig.show()
# DNA + Protein heatmap
for sample in group:
    fig = sample.heatmap(
        clusterby=('dna', 'protein'),  # The first assay is used for the labels
        attributes=['AF', 'normalized_counts'],
        features=[None, None],  # If None, then clustered_ids is used
        bars_order=None,  # The order of the barcodes
        widths=None,
        order=('dna', 'protein')  # The order in which the heatmaps should be drawn
    )
    fig.update_layout(title = sample.name)
    fig.layout.width = 1200
    fig.show()
help(sample.signaturemap)
# Plot a combined signature heatmap, showing DNA, CNV and Protein signatures for all subclones
# One per sample
for sample in group:
    fig = sample.signaturemap(
        clusterby=('dna', 'cnv', 'protein'),
        attributes=('NGT', 'ploidy', 'normalized_counts'),
        features=[None, 'genes', None],
        signature_kind=['median', 'median', 'median'],
        widths=None,
        order=('dna', 'cnv', 'protein')  # The order in which the heatmaps should be drawn
    )
    fig.update_layout(title = sample.name)
    fig.layout.width = 1500
    fig.show()
# Ignore warnings raised when running clone_vs_analyte

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
help(sample.clone_vs_analyte)
# Clone vs analyte
# Visualize the CNV data stratified by clone
# One per sample
for sample in group:
    fig = sample.clone_vs_analyte('cnv')
    plt.gcf().axes[1].texts[-1].set_text("")
#fig.savefig('genotype_cnv.png')
# Visualize the protein data (violin plots) stratified by clone
# One per sample
for sample in group:
    fig = sample.clone_vs_analyte('protein')
    plt.gcf().axes[1].texts[-1].set_text("")
#fig.savefig('genotype_cnv.png')
# To review a subset of features, first subset the assay itself
for sample in group:
    sample.protein = sample.protein[:, ['CD11b', 'CD19', 'CD38', 'CD90']]
    fig = sample.clone_vs_analyte('protein')
    plt.gcf().axes[1].texts[-1].set_text('')
# Reset the protein dataframe to include all AOCs again
for sample in group:
    sample.reset('protein')
# Plot the protein heatmap with the DNA clone labels
for sample in group:
    fig = sample.protein.heatmap(attribute='normalized_counts',splitby=sample.dna.row_attrs['label'])
    fig.update_layout(title = sample.name)
    fig.show()
# Plot the AOC ridge plots split by DNA clone
for sample in group:
    fig = sample.protein.ridgeplot(attribute='normalized_counts',
                         features=['CD11b', 'CD19', 'CD38', 'CD90'],
                         splitby=sample.dna.row_attrs['label'])
    fig.update_layout(title = sample.name)
    fig.show()
# Plot the AOC ridge plots split by protein cluster
for sample in group:
    fig = sample.protein.ridgeplot(attribute='normalized_counts',
                         features=['CD11b', 'CD19', 'CD38', 'CD90'],
                         splitby=sample.protein.row_attrs['label'])
    fig.update_layout(title = sample.name)
    fig.show()
# Plot the biaxial AOC plot split by DNA clone
for sample in group:
    fig = sample.protein.feature_scatter(layer='normalized_counts', ids=['CD19', 'CD34'], colorby = sample.dna.row_attrs['label'])
    fig.update_layout(title = sample.name)
    fig.show()
for sample in group:
    print(sample.name, '\n', sample.dna.ids(), '\n')
# First define your variants, proteins and clones of interest in these lists

variants_of_interest = ['chr17:29559932:C/A', 'chr4:55599436:T/C', 'chr7:148508833:A/G']
proteins_of_interest = ['CD19', "CD34", "CD30"]
clones_of_interest = ["0-1-1", "1-1-2"]

# Plot the above selected data in a multi-omic plot, per sample
for sample in group:
    s = sample[sample.dna.barcodes(clones_of_interest)]
    s.dna = s.dna[:, variants_of_interest]
    s.protein = s.protein[:, proteins_of_interest]
    s.clone_vs_analyte("protein")  
# Plot the biaxial AOC plot colored by genotype (or AF_MISSING, etc.) for specific variants
for sample in group:
    variants = ['EZH2:7:148508833:A:G', 'EZH2:7:148543525:A:G']
    layer = 'NGT_FILTERED'
    AOCs = ['CD8', 'CD34'] # only two!

    # Set the layout and size of the resulting plots (rows, columns) - may need to adjust based on number of variants
    fig, axs = plt.subplots(1, len(variants), figsize=(30, 10))

    # Generate plots for each variant
    counter = 0
    for variant in variants:
        color_variant = sample.dna.layers[layer][:,sample.dna.ids() == variant]
        fig = sample.protein.feature_scatter(layer='normalized_counts', ids=AOCs, colorby = color_variant, title=("\t" + variant))
        ms.utils.static_fig(fig, ax=axs[counter])
        counter = counter + 1

    # Show the plot
    for i in range(0, counter):
        axs[i].axis('off')

    plt.tight_layout()
    plt.show()    

Optionally: you can combine all samples present into one object and plot multi-omic visualizations of all samples as a whole. First you will need to use the merge option, and then you can plot on the merged object.

# Merge the samples into one data set
combined = group.merge()
# DNA + CNV heatmap
# For all samples together
fig = combined.heatmap(
        clusterby=('dna', 'cnv'),  # The first assay is used for the labels
        attributes=['AF', 'ploidy'],
        features=[None, 'genes'],  # If None, then clustered_ids is used
        bars_order=None,  # The order of the barcodes
        widths=None,
        order=('dna', 'cnv')  # The order in which the heatmaps should be drawn
    )
fig.layout.width = 1200
fig.show()
# DNA + CNV + Protein heatmap
# For all samples together
fig = combined.heatmap(
        clusterby=('dna', 'cnv', 'protein'),  # The first assay is used for the labels
        attributes=['AF', 'ploidy', 'normalized_counts'],
        features=[None, 'genes', None],  # If None, then clustered_ids is used
        bars_order=None,  # The order of the barcodes
        widths=None,
        order=('dna', 'cnv', 'protein')  # The order in which the heatmaps should be drawn
    )
fig.layout.width = 1200
fig.show()
# Plot a combined signature heatmap, showing DNA and CNV signatures for all subclones
# For all samples together
fig = combined.signaturemap(
        clusterby=('dna', 'cnv'),
        attributes=('NGT', 'ploidy'),
        features=[None, 'genes'],
        signature_kind=['median', 'median'],
        widths=None,
        order=('dna', 'cnv')  # The order in which the heatmaps should be drawn
    )
fig.layout.width = 1200
fig.show()
# Plot a combined signature heatmap, showing DNA and Protein signatures for all subclones
# For all samples together
fig = combined.signaturemap(
        clusterby=('dna', 'protein'),
        attributes=('NGT', 'normalized_counts'),
        features=[None, None],
        signature_kind=['median', 'median'],
        widths=None,
        order=('dna', 'protein')  # The order in which the heatmaps should be drawn
    )
fig.layout.width = 1200
fig.show()
# Clone vs analyte
# Visualize the CNV data stratified by clone
# For all samples together
fig = combined.clone_vs_analyte('cnv')
plt.gcf().axes[1].texts[-1].set_text("")
# Plot the protein UMAP and color with the DNA clone labels
protein_umap = combined.protein.row_attrs["umap"]
fig = combined.dna.scatterplot(attribute=protein_umap, colorby="label")
go.Figure(fig)
# Plot the protein UMAP and color with variant layers (e.g., VAFs) for specific variants
protein_umap = combined.protein.row_attrs["umap"]
feats = combined.dna.ids()[:3] # plot the first 3 variants
combined.dna.scatterplot(attribute=protein_umap, colorby="AF", features=feats)
# Plot the protein heatmap with the DNA clone labels, of all samples together
fig = combined.protein.heatmap(attribute='normalized_counts',splitby=combined.dna.row_attrs['label'])
go.Figure(fig)

Quantify overlaps between DNA clones and protein clusters#

# Use crosstab to calculate the overlap between DNA subclones and Protein Clusters
# This will generate the dataframe
# What percentage of each Protein cluster is defined by each Subclone
# Per sample
for sample in group:
    data = sample.dna.crosstab(sample.protein.get_labels(), normalize='columns')
    print(sample.name, '\n', data, '\n')
# This will generate the heatmap of the dataframe generated above
# What percentage of each Protein cluster is defined by each Subclone
# Per sample
for sample in group:
    fig = sample.dna.crosstabmap(sample.protein.get_labels(), normalize='columns')
    fig.update_layout(title = sample.name)
    fig.show()
# Use crosstab to calculate the overlap between DNA subclones and Protein Clusters
# This will generate the dataframe
# What percentage of each Protein cluster is defined by each Subclone
# On all samples 
combined.dna.crosstab(combined.protein.get_labels(), normalize='columns')
# This will generate the heatmap of the dataframe generated above
# What percentage of each Protein cluster is defined by each Subclone
# On all samples
combined.dna.crosstabmap(combined.protein.get_labels(), normalize='columns').show()

Export and Save Data#

In this section you can export a filtered .h5 file, which will contain all new labels/layers, and contain only the filtered barcodes/cells remaining. You can also export all data (row attributes, column attributes and layers) for every assay (DNA and CNV) into easily parsable .csv tables.

# Save new h5 file that includes only the final, cleaned dataset
ms.save(combined, 'FilteredData.h5')
# With new code implementation:
# Export data into csv formats
# DNA, CNV and metadata will be included in the zip
ms.to_zip(combined, 'filename')

Appendix#

# Split merged object
# Need to first split the samples: they will go back from combined to group
group = ms.SampleGroup(combined.split("sample_name"))

DNA signature#

Using .signature() and .signaturemap() you can visualize different statistical metrics of your data, including: mean, median, mode and std.

# Signature will create a dataframe based on the layer you want to look at and which statistical value you want to view
# You can see: mean, median, mode, or std with this function
for sample in group:
    display(sample.dna.signature('AF', 'median'))
# signaturemap will plot a heatmap of the signature dataframe created above
# The labels list will control the order of the subclones along the y-axis
for sample in group:
    fig = sample.dna.signaturemap('AF') #labels=[]
    fig.update_layout(title = sample.name)
    fig.show()

Compass imputation#

This section of the notebook is OPTIONAL and is still a work in progress

Compass can be used to impute the genotypes of cells with some missing data. It can also be used to infer the phylogeny of all subclones present in the sample. If a cells nature cannot be determined, Compass will label these cells as Mixed or Ambiguous. Compass can take ~1-10 minutes to run depending on the size of the data.

The Compass publication can be found here

Note: Compass can give different subclone composition than the variant subclone workflow.

# Use compass to assume the subclone architecture and phylogeny
# Depending on the size and complexity of the data, this step can take ~1-10 minutes
# Use the full_combined object to utilize all cells and unannotated variants
compass = COMPASS(full_combined, somatic_variants=variants)
compass.run()
# The phylogentic tree prediction by COMPASS
fig = compass.plot_tree()
fig.show()
# dict of node names pointing to descriptions
# compass_labels_ is just the node names
# we want the compass_labels to be the node_descriptions
labs = compass.labels_  # Stores the labels
desc = compass.node_descriptions()  # Stores the dict mapping the label to the description
compass_labels = np.array([desc.get(lab, lab) for lab in labs])
# Store the compass label as a row_attr
full_combined.dna.add_row_attr('compass_labels', compass_labels)
# Store the index positions of the COMPASS ids for the heatmap
idx = np.isin(full_combined.dna.ids(), compass.somatic_variants)
# Visualize the assignment using the variants passed to COMPASS
fig = full_combined.dna.heatmap('NGT', features=full_combined.dna.ids()[idx], splitby='compass_labels')
fig.show()
# This will return a dataframe 
# Showing the overlap of variant workflow clones and compass identified clones
full_combined.dna.crosstab(compass_labels, normalize='columns')
# This will plot a heatmap of the crosstab dataframe
full_combined.dna.crosstabmap(compass_labels, normalize='columns').show()