Multisample DNA + Protein#

This notebook is set up to work with merged h5 files, containing more than one sample, generated with v2 or v3 chemistry. This notebook will only work with Mosaic versions 3.7 and higher.

Objective: To showcase the minimum number of steps required for tertiary analysis of DNA (single-cell genotyping and CNV) and Protein analytes and explore different ways of visualizing the data.

Major questions answered:

  1. Can we identify DNA clones based on genotypes (SNVs/Indels)?

  2. How is the clonal structure different between samples?

  3. Do we detect CNV events (e.g., copy number amplification, copy number loss)?

  4. Do DNA clones (genotypes) correlate with specific protein-defined cell types and/or CNV profiles?

  5. Is there a significant difference in protein-defined cell types between samples?

Sections:

  1. Setup

  2. Data Structure

  3. DNA Analysis

  4. CNV Analysis

  5. Protein Analysis

  6. Combined Visualizations

  7. Export and Save Data

  8. Appendix

Not shown: All available methods and options - documented here

Setup#

Topics covered

  1. Loading required packages.

  2. Loading in data.

Load data
Note: importing dependencies can sometimes take a couple of minutes.

# Import mosaic libraries
import missionbio.mosaic as ms

# Import these to display entire dataframes
from IPython.display import display, HTML

# Import graph_objects from the plotly package to display figures when saving the notebook as an HTML
import plotly as px
import plotly.graph_objects as go

# Import additional packages for specific visuals
import matplotlib.pyplot as plt
import plotly.offline as pyo
pyo.init_notebook_mode()
import numpy as np
import seaborn as sns
from missionbio.plotting.multimap import MultiMap
from missionbio.plotting.heatmap import Heatmap

# Import COMPASS for imputation
from missionbio.mosaic.algorithms.compass import COMPASS
import pandas as pd

# Note: when exporting the notebook as an HTML, plots that use the "go.Figure(fig)" command are saved
# This code is optional, but will make the notebook cells/figures display across the entire width of your browser
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
# Check version; this notebook is designed for Mosaic v3.7 or higher
print(ms.__version__)
# Any function's parameters and default values can be looked up via the 'help' function
# Here, the function is 'ms.load'
help(ms.load)

Loading the data #

In the ms.load statement, there are several arguments:

  • raw : always set raw=False; if raw=True, ALL barcodes will be loaded (rather than cell-associated barcodes)

  • filter_variants : always set filter_variants=True unless an expected (target) variant cannot be found. Additional filtering options are included in the DNA section below. NOTE : if loading using filter_variants=False, the whitelist may need to set to None or removed from the command, otherwise it may not load variants in

  • single : always set single=False for multisample/merged h5 files

  • whitelist : The whitelist option loads any variant that is in the vcf.gz file (e.g. “chr1:179520511:C/G”). The whitelist argument takes many variant formats, they are detailed here.

  • filter_cells : Always set filter_cells=False; if filter_cells=True, only cells called by the completelyness algorithm are loaded

# Specify the h5 file to be used in this analysis: h5path = '/path/to/h5/file/test.h5'
# If working with Windows, an 'r' may need to be added before the path: h5path = r'/path/to/h5/file/test.h5'
h5path = '/Users/botero/Desktop/Sample Data/pbmc.multisample.h5'

# Load the data
group = ms.load(h5path, raw=False, filter_variants=True, filter_cells=False, whitelist = [])
# Print the list of samples in the group object
[s.name for s in group.samples]

Data Structure#

DNA, CNV, and Protein are sub-classes of the Assay class. The information is stored in four categories, and the user can modify each of those:

1. metadata (add_metadata / del_metadata):

  • dictionary containing assay parameters and basic performance metrics

2. layers (add_layer / del_layer):

  • dictionary containing matrices of assay metrics

  • all the matrices have the shape (num barcodes) x (num ids)

  • for DNA assays, this includes AF, GQ, DP, etc. (per cell, per variant)

  • for Protein assays, this includes read counts (per cell, per antibody)

  • for CNV assays, this includes read counts (per cell, per amplicon)

3. row_attrs (add_row_attr / del_row_attr):

  • dictionary containing ‘barcode’ as one of the keys (where the value is a list of all barcodes)

  • for all other keys, the values must be of the same length, i.e. match the number of barcodes

  • this is the attribute where ‘label’, ‘pca’, and ‘umap’ values are added

4. col_attrs (add_col_attr / del_col_attr):

  • dictionary containing ‘id’ as one of the keys (where the value is a list of all ids)

  • for DNA assays, ‘ids’ are variants; for Protein assays, ‘ids’ are antibodies; for CNV assays, ‘ids’ are amplicons

  • for all other keys, the values must be of the same length, i.e. match the number ids

# Summary of DNA assay, per sample
for sample in group:
    print(sample.name)
    print("\'sample.dna\':", sample.dna, '\n')
    print("\'row_attrs\':", "\n\t", list(sample.dna.row_attrs.keys()), '\n')
    print("\'col_attrs\':", "\n\t", list(sample.dna.col_attrs.keys()), '\n')
    print("\'layers\':", "\n\t", list(sample.dna.layers.keys()), '\n')
    print("\'metadata\':", "\n")
    for i in list(sample.dna.metadata.keys()):
        print("\t", i, ": ", sample.dna.metadata[i], sep="")
    print("\n")
# For DNA, ids are variants
# sample.dna.ids() is a shortcut for sample.dna.col_attrs['id']
for sample in group:
    print(sample.name, "\n", sample.dna.ids(), "\n")
# Summary of CNV assay, per sample
for sample in group:
    print("\'sample.cnv\':", sample.cnv, '\n')
    print("\'row_attrs\':", "\n\t", list(sample.cnv.row_attrs.keys()), '\n')
    print("\'col_attrs\':", "\n\t", list(sample.cnv.col_attrs.keys()), '\n')
    print("\'layers\':", "\n\t", list(sample.cnv.layers.keys()), '\n')
    print("\'metadata\':", "\n")
    for i in list(sample.cnv.metadata.keys()):
        print("\t", i, ": ", sample.cnv.metadata[i], sep="")
    print("\n")
# For CNV, ids are amplicons
# sample.cnv.ids() is a shortcut for sample.cnv.col_attrs['id']
for sample in group:
    print(sample.name, "\n", sample.cnv.ids(), "\n")
# Summary of Protein assay, per sample
for sample in group:
    print(sample.name)
    print("\'sample.protein\':", sample.protein, '\n')
    print("\'row_attrs\':", "\n\t", list(sample.protein.row_attrs.keys()), '\n')
    print("\'col_attrs\':", "\n\t", list(sample.protein.col_attrs.keys()), '\n')
    print("\'layers\':", "\n\t", list(sample.protein.layers.keys()), '\n')
    print("\'metadata\':", "\n")
    for i in list(sample.protein.metadata.keys()):
        print("\t", i, ": ", sample.protein.metadata[i], sep="")
    print("\n")
# For Protein, ids are AOCs
# sample.protein.ids() is a shortcut for sample.protein.col_attrs['id']
for sample in group:
    print(sample.name, "\n", sample.protein.ids(), "\n")

DNA Analysis#

Topics covered

  1. Standard filtering of DNA variants.

  2. Subsetting dataset for variants of interest, including whitelisted variants.

  3. Addition of annotations to the variants.

  4. Manual variant selection and clone identification

  5. Visualizations and customization options

Basic filtering#

There are many options for filtering DNA variants. Use the help() function to understand the approach listed below.

help(sample.dna.filter_variants)
# Adjust filters if needed by overwriting dna_vars
def custom_filt(sample):
    filt_vars = sample.dna.filter_variants(
        min_dp=10,
        min_gq=30,
        vaf_ref=5,
        vaf_hom=95,
        vaf_het=30,
        min_prct_cells=50,
        min_mut_prct_cells=1,
        iterations=10
)
    return filt_vars
    
dna_vars = group.apply(custom_filt)

# List all filtered variants
# Check the number of filtered variants
for i in range(len(group.samples)):
    sample = group.samples[i]
    print(sample.name)
    print("Number of variants:", len(dna_vars[i]))
    print(dna_vars[i], "\n")

Note: Variants that are whitelisted during sample loading may be removed at this step, but can be added back in below. Whitelisted variants can be added to the list of final variants, or used exclusively in the code below.

# Ensure correct nomenclature, ie whitelist = ["chr13:28589657:T/G","chrX:39921424:G/A"]
# While the load statement accepts many variant formats, this whitelist does not
# If there are no whitelist variants, this can be left blank
white_list = []
# Check that whitelist variants are included in current data
# Should return true before moving on
for sample in group:
    print(set(white_list).issubset(set(sample.dna.ids())))
# Subset the same variants in all dna assays
# It is important to maintain the same variants across all dna assays

# Establish the number of variants for each sample
og_num_vars = [s.dna.shape[1] for s in group.samples]

#Subset all samples to look at the union of all variants, and whitelist variants
var_union = list(set().union(*dna_vars,white_list))

for sample in group:
    sample.dna = sample.dna[:, var_union]  # Subsets all samples with the same variants

# Establish the new number of varaints for each sample
new_num_vars = [s.dna.shape[1] for s in group.samples]

print(og_num_vars, new_num_vars)  # The old and new number of variants for each sample in the group
# Merge the samples into one data frame, instead of having them as two separate ones
combined = group.merge()
# List all filtered variants in combined
combined.dna.ids()
# Dimensionality of the original sample.dna dataframe
# First number = number of cells (rows); second number = number of variants (columns)
combined.dna.shape

Annotation Addition#

help(sample.dna.get_annotations)
# Fetch annotations using varsome
# Note: run this on a filtered DNA sample - too many variants (e.g., 100+) are not handled correctly by the method
ann = combined.dna.get_annotations()  

Variant selection and Subclone identification#

In this section of the notebook, all variants remaining in the dataset will populate in a variant table. This table is interactive, variants can be selected, and rows can be sorted by ascending/descending values. The variant name can be clicked on and will navigate to the variant’s varsome page in the default browser.

  1. Variants selected in this table will populate in a subclone table below.

  2. The variants in the subclone table can be highlighted and assessed for Read Depth, Genotype Quality and Variant Allele Frequency.

  3. Subclones can be renamed by clicking on the pencil icon.

  4. ADO score: The ADO score threshold can be adjusted, but by default is set to 1. Any clones with an ADO score higher this will be moved into the “ADO subclones” column. We recommend moving any clones with a score of .8 or higher into this column. The algorithm used to generate the score is detailed here.

  5. Min clone size: The Min Clone Size can also be adjusted, but by default is set to 1. Any clones that represent less than 1% of the sample will be moved into the “Small Subclones” column.

  6. Cells with missing genotype information across any of the selected variants will be moved into the “Missing GT” column.

# Run the variant table workflow to select variants and begin clone identification
wfv = ms.workflows.VariantSubcloneTable(combined)
wfv.run()

NOTE: When continuing onto the next piece of the notebook, the workflow will stop working.

# Subsample the variants to only the variants selected from the workflow
variants = wfv.selected_variants
combined.dna = combined.dna[:,variants]

# Save the full set of cells to a new variable that can be called on later
# Do this before renaming the variants
full_combined = combined[:]
# Rename the variants
# Any of these column values can be added to the id names
combined.dna.col_attrs.keys()
# Add annotation to the id names
combined.dna.set_ids_from_cols(['annotated Gene', 'id'])

# Another example:
# sample.dna.set_ids_from_cols(['annotated Gene', 'CHROM', 'POS', 'REF', 'ALT'])

# Annotations are now added to the variants
combined.dna.ids()

# Use sample.dna.reset_ids() to get the original ids
# Plot heatmap using NGT_FILTERED.
fig = combined.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)
# Clone removal
# Remove barcodes from the missing GT, small subclones, ADO subclones or FP labels
clones = ['missing', 'ADO', 'small', 'FP']
for c in clones:
    cells = combined.dna.barcodes({c})
    if len(cells) > 0:
        combined.dna = combined.dna.drop(cells)
set(combined.dna.get_labels())
# If subclones need to be renamed outside of the workflow
# The following code can be used:
# NOTE: all clones need to be listed for this function to work
combined.dna.rename_labels(
  {
    'current_subclone_name': 'new_subclone_name', 
    'WT': 'Wild Type',
    '3': 'three',
    '4': 'four',      
  }
)
# Redraw heatmap
fig = combined.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)
# Redraw heatmap, split by sample
fig = combined.dna.heatmap(attribute='NGT_FILTERED', splitby='sample_name')
go.Figure(fig)
# Create a DNA heatmap to see both the sample names and the subclone labels
# Parameters
attribute = 'NGT_FILTERED'
assay = combined.dna
lab_row1 = "label"
lab_row2 = "sample_name"
lab2_title = "Sample"

# Graphing
data = assay.get_attribute(attribute, constraint="row+col")
ids = assay.clustered_ids(attribute)
labels = pd.concat([assay.get_attribute(lab_row1), assay.get_attribute(lab_row2).rename(columns={"1": "2"})], axis=1)
bars = assay.clustered_barcodes(attribute, splitby=labels)

data = data.loc[bars, ids]
labs1 = assay.get_attribute(lab_row1, constraint="row").loc[bars, :].values.flatten()
labs2 = assay.get_attribute(lab_row2, constraint="row").loc[bars, :].values.flatten()

fig = MultiMap([data], labs1, names=[attribute], extra_column_titles=[lab2_title], widths=[1 / 24, 1]).draw()
fig.add_trace(Heatmap(data, labs2).label_trace(), row=1, col=2)
fig.layout.xaxis2.tickvals = []

fig
# Evaluate new total number of cells after the above filtering
combined.dna.shape
# Subset the barcodes to match across all assays
# This will return the barcodes common to all assays in the sample.
combined.common_barcodes()

# Use that to filter the sample so that only the common barcodes are present in all assays
combined = combined[combined.common_barcodes()]
# Split merged object
# Need to first split the samples: they will go back from combined to group
group = ms.SampleGroup(combined.split("sample_name"))
# Multisample DNA barplot
fig = group.barplot("dna", percentage=True)

#To change the color of certain clones, try something like the following:
#fig.data[0].marker.color = "red"
#fig.data[5].marker.color = "orange"
fig.update_layout(title = 'Clone size as a percentage of total population')
fig.show()
# Fishplot
# Multisample DNA fishplot
# Order samples in sample_order and fill in the list with their specific name
fig = group.fishplot("dna", sample_order=['Sample 1', 'Sample 2'])
fig.layout.height = 750
fig.show()
# Further specify labels and parents to reorganize the fishplot
# Want to hide the phylogeny tree? draw_graph=False
# Documentation here: https://missionbio.github.io/mosaic/autosummary_pages/methods/missionbio.mosaic.samplegroup.SampleGroup.fishplot.html#missionbio.mosaic.samplegroup.SampleGroup.fishplot
# Heatmap per sample
for sample in group:
    fig = sample.dna.heatmap(attribute='NGT_FILTERED',splitby="label")
    fig.update_layout(title = sample.name)
    go.Figure(fig)
    fig.show()
# Shape of each sample
for sample in group:
    print(sample.name, sample.dna.shape)

Adjusting subclone colors or heatmap colors#

To change the color palette used for the subclones, use set_palette. For this provide a list of ALL subclone names, pointing to the hexcode/color to change that subclone to. Note: this function also works with the protein and CNV assays. See below for an example of this.

To change the color palette used for any of the assays/layers, use ms.Config.Colorscale. Then list the assay and layer, and assign the plotly colorscale to use for those graphics. Visualize all available colorscales by using plotly.colors.sequential.swatches_continuous().show(). To reset all color palettes back to their defaults, use ms.Config.Colorscle.reset().

# Example of changing the colors assigned to each clone
sample_one = group.samples[0]
sample_two = group.samples[1]
sample_one.dna.set_palette({"Clone 1": "#800080", "Clone 2": "#FF69B4"})
sample_two.dna.set_palette({"Clone 1": "#800080", "Clone 2": "#FF69B4"})
# Change the color of a subclone for the entire group object
# These changes will be applied to the barplots and fishplots
# Example
pal = group.apply(lambda s: s.dna.get_palette())
pal = {k: v for d in pal for k, v in d.items()}

pal["Clone 1"] = 'LightCyan'
pal["Clone 2"] = 'Lavender'
pal["Clone 3"] = 'PaleGreen'

group.apply(lambda s: s.dna.set_palette(pal))

# Built in color options can be found here: https://www.w3schools.com/cssref/css_colors.php
# Change colorScale to change the heatmap colors
# This will change the color of the dna.ngt layer to be viridis
for sample in group:
    ms.Config.Colorscale.Dna.NGT = 'viridis'
    fig = sample.dna.heatmap('NGT')
    fig.update_layout(title = sample.name)
    fig.show()

# To reset this, run the following to reset just that one modification
#ms.Config.Colorscale.Dna.reset("NGT")
# Or run this to reset any/all modifications:
#ms.Config.Colorscale.reset()
# Colorscale cane be changed for DNA, CNV and Protein assays

To return back to the original population of cells, reset the data using: sample.reset("dna") This command ‘sample.reset’ works on all assays, including CNV and protein.

CNV Analysis#

Topics covered

  1. Amplicon and cell filtering and ploidy estimation.

  2. Visualization of ploidy across subclones present.

# Get gene names for amplicons
for sample in group:
    sample.cnv.get_gene_names()
# Define the samples
# Go through the CNV workflow once per sample
sample_one = group.samples[0]
sample_two = group.samples[1]

#Additional lines will be needed if there are more than two samples
#sample_three = group.samples[2]
# Establish the order of the samples in the group object
# Note: timepoint one will not automatically be the first sample
for sample in group:
    print(sample.name)

CNV workflow#

This workflow will normalize all reads and filter amplicons/cells based on the settings set at the beginning of the workflow:

  1. Amplicon completeness: refers to the minimum percentage of cells that must have reads greater than or equal to the minimum read depth set. By default this is set to 50.

  2. Amplicon read depth: refers to the minimum read depth for each amplicon-barcode combination to not be considered missing. By default this is set to 10.

  3. Mean cell read depth: refers the minimum mean read depth for a cell to be included in the analysis, otherwise the cell will be removed. By default this is set to 40.

  4. Diploid clone in DNA: refers to which subclone is used as the true diploid population. All ploidy estimates will be calculated in relation to this diploid population. We recommend setting this to the ‘WT’ population or earliest clone present.

Note: If large copy number events are expected, Amplicon Completeness and Amplicon Read Depth may need to be reduced.

Once the above filters are set, the visualizations can be changed.

  1. Plot: Can be changed from Heatmap positions, to Heatmap genes, Line-plot positions, Line-plot genes

  2. Clone for line plot: If one of the line-plot visualizations is selected, only one clone can be shown at a time. This determines which one is plotted.

  3. X-axis features: To plot a subset of the data (chromosomes or genes), select which chromosomes/genes should be plotted with this function. Chromosomes can be selected for ‘positions’ type plots, and Genes can be selected for ‘genes’ type plots.

# CNV workflow to filter, normalize and estimate ploidy
wfc = ms.workflows.CopyNumber(sample_one)
wfc.run()

NOTE: When continuing onto the next piece of the notebook, the workflow will stop working.

# Heatmap with the features ordered by the default amplicon order
# To plot just a subset of chromosomes, put them in list format in the features argument
fig = sample_one.cnv.heatmap('ploidy', features='positions') #features=['7', '17', '20']

# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)
#fig.layout.coloraxis.cmax = 4
#fig.layout.coloraxis.cmin = 0

# Optionally, change the size of the figure:
#fig.layout.width = 1600
#fig.layout.height = 1500
fig.update_layout(title = sample_one.name)
go.Figure(fig)
# Heatmap with the features grouped by the genes
# To plot just a subset of genes, put them in list format in the features argument
fig = sample_one.cnv.heatmap('ploidy', features='genes', convolve=1) #features=["ASXL1", "EZH2",'TP53']

# Optionally, update the separating lines to be black
#for shape in fig.layout.shapes:
#    shape.line.color = '#000000'

fig.update_layout(title = sample_one.name)
go.Figure(fig)
# Show heatmap with convolve and subclustering turned off
bars = sample_one.cnv.clustered_barcodes('ploidy', subcluster=False)

# This is useful to create "convolved" heatmaps which are easier to interpret
# With the subclustering off and convolve=20, the noise will be reduced and real signals will be easier to visualize
fig = sample_one.cnv.heatmap('ploidy', bars_order=bars, convolve=20)
fig.layout.width = 900
fig.update_layout(title = sample_one.name)
fig.show()

Note: The following is a partial duplication to run through the CNV workflow with the second/subsequent sample.

# Go through the workflow for any additional samples
# CNV workflow to filter, normalize and estimate ploidy
wfc = ms.workflows.CopyNumber(sample_two)
wfc.run()

Note: The following is a partial duplication to run through the CNV workflow with the second/subsequent sample.

# Heatmap with the features ordered by the default amplicon order
# To plot just a subset of chromosomes, put them in list format for the features argument
fig = sample_two.cnv.heatmap('ploidy', features='positions') #features=['7', '17', '20']

# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)
#fig.layout.coloraxis.cmax = 4
#fig.layout.coloraxis.cmin = 0

# Optionally, change the size of the figure:
#fig.layout.width = 1600
#fig.layout.height = 1500
fig.update_layout(title = sample_two.name)
go.Figure(fig)
# Heatmap with the features grouped by the genes
# To plot just a subset of genes, put them in list format for the features argument
fig = sample_two.cnv.heatmap('ploidy', features='genes', convolve=1) #features=["ASXL1", "EZH2",'TP53']

# Optionally, update the separating lines to be black
#for shape in fig.layout.shapes:
#    shape.line.color = '#000000'
fig.update_layout(title = sample_two.name)
go.Figure(fig)
# Show heatmap with convolve and subclustering turned off
bars = sample_two.cnv.clustered_barcodes('ploidy', subcluster=False)

# This is useful to create "convolved" heatmaps which are easier to interpret
# With the subclustering off and convolve=20, the noise will be reduced and real signals will be easier to visualize
fig = sample_two.cnv.heatmap('ploidy', bars_order=bars, convolve=20)
fig.layout.width = 900
fig.update_layout(title = sample_two.name)
fig.show()

Note: The next few cells are highly optional, and only need to be run to plot cells from all samples together in one heatmap.

# Plot heatmaps of all samples together
combined = group.merge()
combined.cnv.heatmap('ploidy', splitby='sample_name')
# Subset the barcodes to match across all assays
# This will return the barcodes common to all assays in the sample
combined.common_barcodes()

# Use that to filter the sample so that only the common barcodes are present in all assays
combined = combined[combined.common_barcodes()]
# Split merged object
# Need to split the samples: they will go back from combined to group
group = ms.SampleGroup(combined.split("sample_name"))

Protein Analysis#

Topics covered

  1. Normalization and assessment of AOC abundance

  2. Clustering and visualization

  3. Statistical significance analysis

Normalization and Data Inspection#

help(sample.protein.normalize_reads)
# Subset the protein barcodes to match the dna and cnv barcodes
for sample in group:
    sample.protein = sample.protein[sample.dna.barcodes(),:]
# See the new shape of the protein data frame
for sample in group:
    print(sample.protein.shape)
# Normalize reads
# Three different methods available: CLR = Centered Log Ratio, NSP = Noise corrected and Scaled, asinh = Inverse Hyperbolic transformation
for sample in group:
    sample.protein.normalize_reads('NSP') 
# Ridge plot: shows distribution of AOC counts across all cells (bimodal = positive and negative cell populations)
for sample in group:
    sample.protein.ridgeplot(attribute='normalized_counts',
                         features=sample.protein.ids()).show()
# This plot is helpful when titrating a custom AOC panel
import pandas as pd
import plotly.express as plx

for sample in group:
    read_counts = sample.protein.layers["read_counts"]
    ids = sample.protein.col_attrs["id"].copy()
    cell_barcode = pd.DataFrame(read_counts, columns=ids).sum()
    cell_barcode.index.rename("Antibody", inplace=True)
    cell_barcode = cell_barcode.reset_index(name="Cell Barcode Count")
    cell_barcode['percent'] = (cell_barcode['Cell Barcode Count'] / cell_barcode['Cell Barcode Count'].sum()) * 100

    fig = plx.bar(cell_barcode, x='Antibody', y='percent', title=sample.name)
    fig.show()

    # Or make a pie chart
    #fig = plx.pie(cell_barcode, values='Cell Barcode Count',names='Antibody', title='Count of Cell Barcodes')
    #fig.show()
# Plotting combinations of AOCs on a biaxial scatterplot, to mimic FACS data
for sample in group:
    sample.protein.feature_scatter(layer='normalized_counts', ids=['CD8', 'CD4']).show()
# List all AOCs, per sample
for sample in group:
    print(sample.name, '\n', sample.protein.ids(), '\n')
# Optionally, remove AOCs from the data set prior to clustering 
# For example, those that don't display any signal in any cells
# Do this for each sample being analyzed
for sample in group:
    sample.protein = sample.protein.drop(['IgG2a'])
# Combine both samples into 1 object to cluster them together post-normalization
combined = group.merge()

Clustering and Visualization#

help(sample.protein.run_pca)
# Run PCA on the protein dataset to visualize how many PCs contribute the most to variation
# Number of components should equal the number of AOCs remaining in the dataset
combined.protein.run_pca(attribute='normalized_counts', components=45,show_plot=True, random_state=42)
# Rerun PCA with optimal number of components based on elbow plot analysis
# Typically, components = 10 is appropriate for 45-plex Biolegend panel
combined.protein.run_pca(attribute='normalized_counts', components=10, show_plot=False, random_state=42)
# Now run UMAP on the chosen PCs 
# UMAPs rely on an initial randomization which leads to different projections every time
# To address this, pass random_state to the run_umap method
# See https://jlmelville.github.io/uwot/abparams.html for appropriate spread/min_dist values
combined.protein.run_umap(attribute='pca', random_state=42) #, spread=, min_dist=
# Cluster the AOCs. Several options are available, see help info for more options
# Graph-community clustering: the higher the k value, the smaller the number of clusters --> adjust if necessary
combined.protein.cluster(attribute='umap', method='graph-community',k=300, random_state=42) 
# Plot UMAP for visualization of clusters
fig = sample.protein.scatterplot(attribute='umap',colorby='label')
go.Figure(fig)

The protein clusters defined by the PCA + UMAP can be regrouped by utilizing the lasso tool. First, the UMAP needs to be replotted. On the new version of the plot, clusters can be encircled with the lasso tool, found in the upper right corner of the graph. Once this step is complete, the new cluster labels can be assigned in the following block.

# Plot UMAP for lasso cluster selection
# Be sure that selected_bars is clear before running for lasso selection
# Don't use go.Figure()
combined.protein.selected_bars = {}
combined.protein.scatterplot(attribute='umap', colorby='label')
# Assign new labels (run this right after selecting your new clusters)
combined.protein.set_selected_labels()
# Plot heatmap of normalized, clustered protein data
fig = combined.protein.heatmap(attribute='normalized_counts')
go.Figure(fig)
# If analyzing multiple samples, order barcodes by sample and then by cluster
fig = combined.protein.heatmap(attribute='normalized_counts',splitby='sample_name')
go.Figure(fig)
help(sample.protein.scatterplot)
# Plot UMAP for visualization of clusters
fig = combined.protein.scatterplot(attribute='umap',colorby='label')
go.Figure(fig)
# Plot UMAP for visualization of clusters, by sample
fig = combined.protein.scatterplot(attribute='umap',colorby='sample_name')
go.Figure(fig)
# UMAP colored by the expression of each AOC
fig = combined.protein.scatterplot(attribute='umap',
                           colorby='normalized_counts',
                           features=sample.protein.ids())
go.Figure(fig)
# Relabel clone names acccording to biology, combine clones by assigned identical names
# Just an example! Need to fill this in based on knowledge of the sample and markers
# For cell-type calling, view these resources: https://www.frontiersin.org/articles/10.3389/fimmu.2019.02434/full
# Biolegend: https://www.biolegend.com/fr-lu/products/totalseq-d-human-heme-oncology-cocktail-v10-20465

combined.protein.rename_labels(
    {
        '1': 'Cell type A', 
        '2': 'Cell type B', 
        '3': 'Cell type C',
        '4': 'Cell type D',
        '5': 'Cell type E'
    }
)
# Drop cells (barcodes) that need to be removed
# Note: if no cells are dropped, this code will producs an error (ignore)
fp_barcodes2 = combined.protein.barcodes("FP")
combined.protein = combined.protein.drop(fp_barcodes2)
set(combined.protein.get_labels())
# Run heatmap again with new labels
fig = combined.protein.heatmap(attribute='normalized_counts',splitby='label')
go.Figure(fig)
# Need to first split the samples: they will go back from combined to group
group = ms.SampleGroup(combined.split("sample_name"))
# Fish plot
fig = group.fishplot("protein", draw_graph=False) #,sample_order=['Sample1', 'Sample2'])
fig.layout.width = 1000
fig.layout.height = 850
fig.show()
# Multisample barplot
group.barplot("protein", percentage=True)
# Additionally, option to remove AOCs from the data 
for sample in group:
    sample.protein = sample.protein.drop(['IgG2a'])
# Shape of cleaned up protein data (number of cells, number of AOCs)
for sample in group:
    print(sample.name, sample.protein.shape)
# Re-run heatmap with new filtered data
for sample in group:
    fig = sample.protein.heatmap(attribute='normalized_counts')
    fig.update_layout(title=sample.name)
    fig.show()
# Show heatmap with convolve and subclustering turned off, per sample
for sample in group:
    bars = sample.protein.clustered_barcodes('normalized_counts', subcluster=False)
    # This is useful to create "convolved" heatmaps which are easier to interpret
    # With the subclustering off and convolve=20, the noise will be reduced and real signals will be easier to determine
    fig = sample.protein.heatmap('normalized_counts', bars_order=bars, convolve=20)
    fig.update_layout(title=sample.name)
    fig.layout.width = 900
    fig.show()
# Signature will create a dataframe based on the layer defined and which statistical value listed
# This can be used for: mean, median, mode, or std with this function
for sample in group:
    sample.protein.signature('normalized_counts', 'median')
# signaturemap will plot a heatmap of the signature dataframe created above
# The labels list will control the order of the clusters along the y-axis
for sample in group:
    fig = sample.protein.signaturemap('normalized_counts') #labels=[]
    fig.update_layout(title = sample.name)
    fig.show()

Statistics#

Statistics should be run separately per sample, similar to the CNV statistics section. Note, these tests just look for differential protein expression within the individual samples. There are not yet any built in functions for cross-sample comparison, though this can be easily added with python packages and additional custom kernels of code. This will run a t-test on the protein clusters defined above.

# First define the samples
sample_one = group.samples[0]
sample_two = group.samples[1]
pval_protein, tstat_protein = sample_one.protein.test_signature("normalized_counts")
pval_protein
pval_protein = pval_protein + 10 ** -50 + pval_protein
pvals_protein = -np.log10(pval_protein) * (tstat_protein > 0)
# Colored tiles indicate strong association of a protein with a particular cluster
plt.figure(figsize=(20, 10))
fig = sns.heatmap(pvals_protein.T, vmax=50, vmin=40)

This is a partial duplication of the above code, for any additional samples

pval_protein, tstat_protein = sample_two.protein.test_signature("normalized_counts")
pval_protein
pval_protein = pval_protein + 10 ** -50 + pval_protein
pvals_protein = -np.log10(pval_protein) * (tstat_protein > 0)
# Colored tiles indicate strong association of a protein with a particular cluster
plt.figure(figsize=(20, 10))
fig = sns.heatmap(pvals_protein.T, vmax=50, vmin=40)

Combined Visualizations#

In this section the dataset will be subset to only retain barcodes with remaining DNA, CNV and Protein data. Then this data can be plotted together using sample.heatmap() and sample.signaturemap().

Overlap between different analytes can be quantified and visualized, using sample.crosstabmap().

The first few kernels of code will generate one plot per sample for the combined visualizations, the last portion of this section will combine the samples into one dataframe and plot them all together.

# This will return the barcodes common to all assays in the sample, for each sample.
for sample in group:
    sample.common_barcodes()
    # Use that to filter the sample so that only the common barcodes are present in all assays
    sample = sample[sample.common_barcodes()]
# Check dimensionality for each assay; the number of cells (first number) should be the same in each data set
# For each sample
for sample in group:
    print(sample.name, sample.dna.shape, sample.cnv.shape, sample.protein.shape)
# DNA + CNV + Protein heatmap
for sample in group:
    fig = sample.heatmap(
        clusterby=('dna', 'cnv', 'protein'),  # The first assay is used for the labels
        attributes=['AF', 'ploidy', 'normalized_counts'],
        features=[None, 'genes', None],  # If None, then clustered_ids is used
        bars_order=None,  # The order of the barcodes
        widths=None,
        order=('dna', 'cnv', 'protein')  # The order in which the heatmaps should be drawn
    )
    fig.update_layout(title = sample.name)
    fig.show()
# DNA + CNV heatmap
# One per sample
for sample in group:
    fig = sample.heatmap(
        clusterby=('dna', 'cnv'),  # The first assay is used for the labels
        attributes=['AF', 'ploidy'],
        features=[None, 'genes'],  # If None, then clustered_ids is used
        bars_order=None,  # The order of the barcodes
        widths=None,
        order=('dna', 'cnv')  # The order in which the heatmaps should be drawn
    )
    fig.update_layout(title = sample.name)
    fig.show()
# DNA + Protein heatmap
for sample in group:
    fig = sample.heatmap(
        clusterby=('dna', 'protein'),  # The first assay is used for the labels
        attributes=['AF', 'normalized_counts'],
        features=[None, None],  # If None, then clustered_ids is used
        bars_order=None,  # The order of the barcodes
        widths=None,
        order=('dna', 'protein')  # The order in which the heatmaps should be drawn
    )
    fig.update_layout(title = sample.name)
    fig.show()
help(sample.signaturemap)
# Plot a combined signature heatmap, showing DNA, CNV and Protein signatures for all subclones
# One per sample
for sample in group:
    fig = sample.signaturemap(
        clusterby=('dna', 'cnv', 'protein'),
        attributes=('NGT', 'ploidy', 'normalized_counts'),
        features=[None, 'genes', None],
        signature_kind=['median', 'median', 'median'],
        widths=None,
        order=('dna', 'cnv', 'protein')  # The order in which the heatmaps should be drawn
    )
    fig.update_layout(title = sample.name)
    fig.show()
help(sample.clone_vs_analyte)
# Clone vs analyte
# Visualize the CNV data stratified by clone
# One per sample
for sample in group:
    fig = sample.clone_vs_analyte('cnv')
    plt.gcf().axes[1].texts[-1].set_text("")
#fig.savefig('genotype_cnv.png')
# Visualize the protein data (violin plots) stratified by clone
# One per sample
for sample in group:
    fig = sample.clone_vs_analyte('protein')
    plt.gcf().axes[1].texts[-1].set_text("")
#fig.savefig('genotype_cnv.png')
# To review a subset of features, first subset the assay itself
for sample in group:
    sample2.protein = sample.protein[:, ['CD11b', 'CD19', 'CD38', 'CD90']]
    fig = sample2.clone_vs_analyte('protein')
    plt.gcf().axes[1].texts[-1].set_text('')
# Plot the protein heatmap with the DNA clone labels
for sample in group:
    fig = sample.protein.heatmap(attribute='normalized_counts',splitby=sample.dna.row_attrs['label'])
    fig.update_layout(title = sample.name)
    fig.show()
# Plot the AOC ridge plots split by DNA clone
for sample in group:
    fig = sample.protein.ridgeplot(attribute='normalized_counts',
                         features=['CD11b', 'CD19', 'CD38', 'CD90'],
                         splitby=sample.dna.row_attrs['label'])
    fig.update_layout(title = sample.name)
    fig.show()
# Plot the AOC ridge plots split by protein cluster
for sample in group:
    fig = sample.protein.ridgeplot(attribute='normalized_counts',
                         features=['CD11b', 'CD19', 'CD38', 'CD90'],
                         splitby=sample.protein.row_attrs['label'])
    fig.update_layout(title = sample.name)
    fig.show()
# Plot the biaxial AOC plot split by DNA clone
for sample in group:
    fig = sample.protein.feature_scatter(layer='normalized_counts', ids=['CD19', 'CD34'], colorby = sample.dna.row_attrs['label'])
    fig.update_layout(title = sample.name)
    fig.show()
for sample in group:
    print(sample.name, '\n', sample.dna.ids(), '\n')

Optionally: Combine all samples present into one object and plot multi-omic visualizations of all samples as a whole. First merge the sample, and then plot on the merged object.

# Merge the samples into one data set
combined = group.merge()
# DNA + CNV heatmap
# For all samples together
fig = combined.heatmap(
        clusterby=('dna', 'cnv'),  # The first assay is used for the labels
        attributes=['AF', 'ploidy'],
        features=[None, 'genes'],  # If None, then clustered_ids is used
        bars_order=None,  # The order of the barcodes
        widths=None,
        order=('dna', 'cnv')  # The order in which the heatmaps should be drawn
    )
fig.show()
# DNA + CNV + Protein heatmap
# For all samples together
fig = combined.heatmap(
        clusterby=('dna', 'cnv', 'protein'),  # The first assay is used for the labels
        attributes=['AF', 'ploidy', 'normalized_counts'],
        features=[None, 'genes', None],  # If None, then clustered_ids is used
        bars_order=None,  # The order of the barcodes
        widths=None,
        order=('dna', 'cnv', 'protein')  # The order in which the heatmaps should be drawn
    )
fig.show()
# Plot a combined signature heatmap, showing DNA and CNV signatures for all subclones
# For all samples together
fig = combined.signaturemap(
        clusterby=('dna', 'cnv'),
        attributes=('NGT', 'ploidy'),
        features=[None, 'genes'],
        signature_kind=['median', 'median'],
        widths=None,
        order=('dna', 'cnv')  # The order in which the heatmaps should be drawn
    )
fig.show()
# Plot a combined signature heatmap, showing DNA and Protein signatures for all subclones
# For all samples together
fig = combined.signaturemap(
        clusterby=('dna', 'protein'),
        attributes=('NGT', 'normalized_counts'),
        features=[None, None],
        signature_kind=['median', 'median'],
        widths=None,
        order=('dna', 'protein')  # The order in which the heatmaps should be drawn
    )
fig.show()
# Clone vs analyte
# Visualize the CNV data stratified by clone
# For all samples together
fig = combined.clone_vs_analyte('cnv')
plt.gcf().axes[1].texts[-1].set_text("")
# Plot the protein UMAP and color with the DNA clone labels
protein_umap = combined.protein.row_attrs["umap"]
fig = combined.dna.scatterplot(attribute=protein_umap, colorby="label")
go.Figure(fig)
# Plot the protein UMAP and color with variant layers (e.g., VAFs) for specific variants
protein_umap = combined.protein.row_attrs["umap"]
feats = combined.dna.ids()[:3] # plot the first 3 variants
combined.dna.scatterplot(attribute=protein_umap, colorby="AF", features=feats)
gene_ploidy = combined.cnv.signature("ploidy", splitby=None, features="gene_name")
tp53_ploidy = gene_ploidy["TP53"].values
combined.protein.scatterplot("umap", colorby=tp53_ploidy)
# Plot the protein heatmap with the DNA clone labels, of all samples together
fig = combined.protein.heatmap(attribute='normalized_counts',splitby=combined.dna.row_attrs['label'])
go.Figure(fig)

Quantify overlaps between DNA clones and protein clusters#

# Use crosstab to calculate the overlap between DNA subclones and Protein Clusters
# This will generate the dataframe
# What percentage of each Protein cluster is defined by each Subclone
# Per sample
for sample in group:
    data = sample.dna.crosstab(sample.protein.get_labels(), normalize='columns')
    print(sample.name, '\n', data, '\n')
# This will generate the heatmap of the dataframe generated above
# What percentage of each Protein cluster is defined by each Subclone
# Per sample
for sample in group:
    fig = sample.dna.crosstabmap(sample.protein.get_labels(), normalize='columns')
    fig.update_layout(title = sample.name)
    fig.show()
# Use crosstab to calculate the overlap between DNA subclones and Protein Clusters
# This will generate the dataframe
# What percentage of each Protein cluster is defined by each Subclone
# On all samples 
combined.dna.crosstab(combined.protein.get_labels(), normalize='columns')
# This will generate the heatmap of the dataframe generated above
# What percentage of each Protein cluster is defined by each Subclone
# On all samples
combined.dna.crosstabmap(combined.protein.get_labels(), normalize='columns').show()

Export and Save Data#

In this section the data can export a filtered .h5 file, which will contain all new labels/layers, and contain only the filtered barcodes/cells remaining. All data can be exported (row attributes, column attributes and layers) for every assay (DNA and CNV) into easily parsable .csv tables.

# Save new h5 file that includes only the final, cleaned dataset
ms.save(combined, 'FilteredData.h5')
# Export data into csv formats
ms.to_zip(combined, 'filename')

Appendix#

Compass imputation#

This section of the notebook is OPTIONAL

Compass can be used to impute the genotypes of cells with some missing data. It can also be used to infer the phylogeny of all subclones present in the sample. If a cell cannot be unambiguously classified, Compass will label these cells as Mixed or Ambiguous. Compass can take ~1-10 minutes to run depending on the size of the data.

The Compass publication can be found here

Note: Compass can give different subclone composition than the variant subclone workflow.

# Use compass to assume the subclone architecture and phylogeny
# Depending on the size and complexity of the data, this step can take ~1-10 minutes
# Use the full_combined object to utilize all cells and unannotated variants
compass = COMPASS(full_combined, somatic_variants=variants)
compass.run()
# The phylogentic tree prediction by COMPASS
fig = compass.plot_tree()
fig.show()
# Make the compass_labels the node descriptions
# These are the new subclone labels
labs = compass.labels_  # Stores the labels
desc = compass.node_descriptions()  # Stores the dict mapping the label to the description
compass_labels = np.array([desc.get(lab, lab) for lab in labs])
# Store the compass label as a row_attr
full_combined.dna.add_row_attr('compass_labels', compass_labels)
# Store the index positions of the COMPASS ids for the heatmap
idx = np.isin(full_combined.dna.ids(), compass.somatic_variants)
# Visualize the assignment using the variants passed to COMPASS
fig = full_combined.dna.heatmap('NGT', features=full_combined.dna.ids()[idx], splitby='compass_labels')
fig.show()
# Create a DNA heatmap to see both the variant workflow subclone labels and the compass assigned labels
# Parameters
attribute = 'NGT'
assay = full_combined.dna
lab_row1 = "label"
lab_row2 = "compass_labels"
lab2_title = "Compass"

# Graphing
data = assay.get_attribute(attribute, constraint="row+col")
ids = assay.clustered_ids(attribute)
labels = pd.concat([assay.get_attribute(lab_row1), assay.get_attribute(lab_row2).rename(columns={"1": "2"})], axis=1)
bars = assay.clustered_barcodes(attribute, splitby=labels)

data = data.loc[bars, ids]
labs1 = assay.get_attribute(lab_row1, constraint="row").loc[bars, :].values.flatten()
labs2 = assay.get_attribute(lab_row2, constraint="row").loc[bars, :].values.flatten()

fig = MultiMap([data], labs1, names=[attribute], extra_column_titles=[lab2_title], widths=[1 / 24, 1]).draw()
fig.add_trace(Heatmap(data, labs2).label_trace(), row=1, col=2)
fig.layout.xaxis2.tickvals = []

fig
# This will return a dataframe 
# Showing the overlap of variant workflow clones and compass identified clones
full_combined.dna.crosstab(compass_labels, normalize='columns')
# This will plot a heatmap of the crosstab dataframe
full_combined.dna.crosstabmap(compass_labels, normalize='columns').show()

Export data to use in COMPASS

# Export compass data into 2 output files
# compass_input_regions.csv and compass_input_variants.csv
# These can be used with COMPASS outside of Mosaic
compass = COMPASS(sample, sample.dna.filter_somatic_variants())
compass.prepare_input(prefix="./compass_input")

Protein assignment#

This section of the notebook is OPTIONAL

This portion of the notebook utilizes known truth about different cell types based on protein expression, to cluster cells into the defined cell types. The known truth built into the tool is based on our catalog Total Seq-D Heme protein panel, which contains 45 AOCs. The truth was defined from data on the cell ontology database and the European Bioinformatics Institute Website and works best for PBMCs. Other cell-types may benefit from custom truths.

More information about the algorithm can be found here.

Note: For protein normalization above, either NSP or ANSP need to be used. CLR or asinh are not supported.

The cluster label can be one of the following:

  1. Sticky - these are positive for any of the IgG markers (defined using the sticky_antibodies parameter).

  2. Mixed - these are cells which have a profile of 2 clusters (see the doublet_distance parameter)

  3. Mixed like - these cells have the profile which satisfies the Mixed criteria, but their proportions are too large compared to their singlet proportions. (see max_adjusted_mixing)

  4. Labeled - Clusters that match the given truth with sufficient probability (see min_prob_diff)

  5. Unassiged-{n} - A cluster that couldn’t be assigned to any of the above types

# Visualize the structure of the built-in truth
# The names that are underlines are links to the source of truth
truth = Truth.builtin()
truth.plot()
# Cluster the cells based on protein data and label them
pace = combined.protein.cluster_and_label(
    max_adjusted_mixing=0.3,  # This parameter controls which mixed clusters would be labelled "Mixed Like"
    min_distance_for_doublet=5,  # Increase this if too many mixed cells are observed.
    cluster=True,  # If True, graph-community clustering is run, otherwise the existing cluster labels are used
)
# Visualize the clustering
bars = combined.protein.clustered_barcodes("normalized_counts", subcluster=False)
fig = combined.protein.heatmap("normalized_counts", bars_order=bars)
fig.layout.width = 1000
fig
# Visualize the complete truth that was used in cluster_and_label
specific_truth = pace.cluster_truth()
specific_truth.plot()
# Recluster the data using the complete truth
# It will not find new clusters, but it will categorize the cells into the known cell types
combined.protein.assign_from_truth(specific_truth)
# Plot the protein clusters with the complete truth
bars = combined.protein.clustered_barcodes("normalized_counts", subcluster=False)
fig = combined.protein.heatmap("normalized_counts", bars_order=bars)
fig.layout.width = 1000
fig.show("jpg")

Using predefined clusters or custom clustering

Clustering performed above in the notebook can be feed into protein assignment for the clusters to be labeled. Simply skip the previous cluster_and_label method and run the below code.

# To use clustering performed earler on
truth = combined.protein.truth()
combined.protein.assign_from_truth(truth)
combined.protein.heatmap("normalized_counts")

Modifying the truth

The truth can be modified in many ways. The following blocks of code show examples on how to modify the truth.

# Subset the truth to a few cell types
subset_truth = truth[["T cell", "B cell", "NK cell", "Monocyte"]]
subset_truth.plot()
# Remove cell types from the truth
trimmed_truth = truth.trim("T cell", include=False)
trimmed_truth.plot()
# Remove AOCs from the truth
trimmed_truth = truth.del_ab("CD3")
trimmed_truth.plot()
# Add a new cell type to the truth
source_link = "https://jdc.jefferson.edu/cgi/viewcontent.cgi?article=1083&context=pacbfp#:~:text=CD56%20is%20an%20adhesion%20molecule,tissue%20and%20in%20granular%20lmphocytes.&text=Aberrant%20expression%20of%20CD56%20has,leukemia%20and%20plasma%20cell%20myeloma."
updated_truth = truth.add_cell("CD56+ Monocyte", ["CD56+"], parent="Monocyte", source=source_link)
updated_truth.plot()

Exporting the truth

The defined truth can be exported and shared.

# Save the truth using the `to_yaml` function
updated_truth.to_yaml("./my-truth-file.yaml", version="My version")
# The yaml file can be read using the read_yaml function
Truth.read_yaml("./my-truth-file.yaml").plot()
# Set a yaml as the truth to be used in clustering
new_truth = Truth.read_yaml("./my-truth-file.yaml").plot()
# Run the cluster_and_label with the truth argument pointing to the yaml truth
pace = combined.protein.cluster_and_label(
    truth=new_truth
    max_adjusted_mixing=0.3,  # This parameter controls which mixed clusters would be labelled "Mixed Like"
    min_distance_for_doublet=5,  # Increase this if too many mixed cells are observed.
    cluster=True,  # If True, graph-community clustering is run, otherwise the existing cluster labels are used
)