Mosaic analysis walk-through#

Objective: This notebook contains a detailed step by step process of analyzing Tapestri runs. It is not a vignette to demonstrate how to use mosaic, but rather a guided analysis. The following cells have not been exectued but the notebook can be downloaded and run on any sample.

Setup#

Topics covered

  1. Loading required packages and data.

  2. Structure and contents of data objects.

Load data

# Import mosaic libraries
import missionbio.mosaic as ms

# Import these to display entire dataframes
from IPython.display import display, HTML

# Import graph_objects from the plotly package to display figures when saving the notebook as an HTML
# Import numpy for statistics
import plotly.graph_objects as go
import numpy as np

# Import additional packages for specific visuals
import missionbio.mosaic.utils as mutils
import matplotlib.pyplot as plt

# Import the colors
from missionbio.mosaic.constants import COLORS
import seaborn as sns

# Note: when exporting the notebook as an HTML, plots that use the "go.Figure(fig)" command are saved
# Check version; this notebook is designed for Mosaic 2.0 or higher
print(ms.__version__)
# Any function's parameters and default values can be looked up via the 'help' function
# Here, the function is 'ms.load'
help(ms.load)
# Specify the h5 file to be used in this analysis: h5path = '/path/to/h5/file/test.h5'
# If working with Windows, you may need to add an 'r' before the path: h5path = r'/path/to/h5/file/test.h5'
h5path = '/Users/robert_durruthy/Desktop/Mosaic notebooks/4-cell-lines-AML-multiomics.dna+protein.h5'

# Load the data
# sample = ms.load(h5path, raw=False, apply_filter=True, single=True, whitelist = [])

sample = ms.load_example_dataset("3 cell mix")  # Use the above `ms.load` for custom h5 files

# Always set raw=False; if raw=True, ALL barcodes will be loaded (rather than cell-associated barcodes)
# Always set apply_filter=True unless you can't detect an expected (target) variant. Additional filtering options are included in the DNA section below
# The single=True option loads multi-sample h5 files as a single sample object (compatible with this notebook)
# The whitelist option loads any variant that is in the vcf.gz file (e.g. "chr1:179520511:C/G"); similar to whitelist feature in Tapestri Insights v2

All the interactive plotting functions return a plotly figure. If the layout or the color scheme is not suitable for your data type, they can be changed before creating the final figure.

The colors for the plots are stored either in the individual traces or the layout attributes of the plotly figure.

Mosaic also contains a list of colors that can be used to customize the plots.

# Explore the colors
# Additional color palettes: https://seaborn.pydata.org/tutorial/color_palettes.html
# Plot the first few colors
sns.palplot(COLORS[:40])
help(sns.color_palette)
# Alternatively plot another palette
sns.palplot(sns.color_palette("magma", n_colors=20))
# Print the corresponding hex codes
print(sns.color_palette("magma", n_colors=20).as_hex())

Data Structure

DNA, CNV, and Protein are sub-classes of the Assay class. The information is stored in four categories, and the user can modify each of those:

1. metadata (add_metadata / del_metadata):
    - dictionary containing metrics of the assay

2. row_attrs (add_row_attr / del_row_attr):
    - dictionary which contains 'barcode' as one of the keys 
      (where the value is a list of all barcodes)
    - for all other keys, the values must be of the same 
      length, i.e. match the number of barcodes
    - this is the attribute where 'label', 'pca', 
      and 'umap' values are added

3. col_attrs (add_col_attr / del_col_attr):
    - dictionary which contains 'id' as one of the keys 
      (where the value is a list of all ids)
    - for DNA assays, 'ids' are variants; for Protein assays, 
      'ids' are antibodies
    - for all other keys, the values must be of the same
      length, i.e. match the number ids

4. layers (add_layer / del_layer):
    - dictionary which contains critically important assay metrics
    - all the values have the shape (num barcodes) x (num ids) 
    - for DNA assays, this includes AF, GQ, DP, etc. 
      (per cell, per variant)
    - for Protein assays, this includes read counts 
      (per cell, per antibody)
    - this is the attribute where 'normalized_counts' will be added
# Summary of DNA assay 
print("\'sample.dna\':", sample.dna, '\n')
print("\'row_attrs\':", "\n\t", list(sample.dna.row_attrs.keys()), '\n')
print("\'col_attrs\':", "\n\t", list(sample.dna.col_attrs.keys()), '\n')
print("\'layers\':", "\n\t", list(sample.dna.layers.keys()), '\n')
print("\'metadata\':", "\n")
for i in list(sample.dna.metadata.keys()):
    print("\t", i, ": ", sample.dna.metadata[i], sep="")
# Summary of Protein assay 
print("\'sample.protein\':", sample.protein, '\n')
print("\'row_attrs\':", "\n\t", list(sample.protein.row_attrs.keys()), '\n')
print("\'col_attrs\':", "\n\t", list(sample.protein.col_attrs.keys()), '\n')
print("\'layers\':", "\n\t", list(sample.protein.layers.keys()), '\n')
print("\'metadata\':", "\n")
for i in list(sample.protein.metadata.keys()):
    print("\t", i, ": ", sample.protein.metadata[i], sep="")
# For DNA, ids are variants
# sample.dna.ids() is a shortcut for sample.dna.col_attrs['id']
sample.dna.ids()
# For Protein, ids are AOCs
# sample.protein.ids() is a shortcut for sample.protein.col_attrs['id']
sample.protein.ids()

DNA Analysis#

Topics covered

  1. Standard filtering of DNA variants.

  2. Subsetting dataset for variants of interest, including whitelisted variants.

  3. Addition of annotations to the variants.

  4. Manual variant selection. [OPTIONAL]

  5. Clustering sub-clones.

Basic filtering#

There are many options for filtering DNA variants. 
Use the help() function to understand the approach listed below.
# For additional information visit: https://support.missionbio.com/hc/en-us/articles/360047303654-How-do-the-Advanced-Filters-work-
help(sample.dna.filter_variants)
# Filter the variants, similar to the "Advanced Filters" in Tapestri Insights v2.2

# One major difference: The filter "REMOVE GENOTYPE IN CELLS WITH ALTERNATE ALLELE FREQ" 
# is replaced with the following 3 zygosity-specific filters: vaf_ref, vaf_hom, vaf_het

# In general, these additional filters remove additional false-positive from the data.

# Define dna_vars variable
dna_vars = sample.dna.filter_variants()
dna_vars
# Check the number of filtered variants. When using the default filters, the number of 
# variants is likely smaller compared to the originally loaded variants due to the more 
# stringent filtering criteria (e.g., vaf_ref=5, vaf_hom=95, vaf_het=35).
len(dna_vars)
# Adjust filters if needed by overwriting dna_vars
dna_vars = sample.dna.filter_variants(
    min_dp=10,
    min_gq=30,
    vaf_ref=5,
    vaf_hom=95,
    vaf_het=35,
    min_prct_cells=50,
    min_mut_prct_cells=1,
)

# List all filtered variants
dna_vars
# Re-check the number of filtered variants
len(dna_vars)

Subsetting Data for Variants of Interest#

First, specify variants of interest using one of the three options below:
    1. Use the filtered variants from the above section (dna_vars)
    2. Use specific variants of interest (whitelist) 
    3. Combine 1 & 2: filtered variants plus whitelist

Then, this list (final_vars) is used to reduce the larger data set to only your variants of interest.
## Option 1: filtered variants only (no whitelist)

final_vars=dna_vars
## Option 2: whitelisted variants only 

# Specify the whitelist; variants may be copy/pasted from Tapestri Insights v2.2,
# but ensure correct nomenclature, ie whitelist = ["chr13:28589657:T/G","chrX:39921424:G/A"]
final_vars = ["chr13:28597686:G/A","chr11:32421651:T/C","chr4:106158216:G/A","chr4:106196829:T/G"]
## Option 3: filtered variants plus whitelisted variants

# Specify the whitelist; variants may be copy/pasted from Tapestri Insights v2.2,
# but ensure correct nomenclature, ie whitelist = ["chr13:28589657:T/G","chrX:39921424:G/A"]
target_variants = ["chr13:28597686:G/A","chr11:32421651:T/C","chr4:106158216:G/A","chr4:106196829:T/G"]

# Combine whitelisted and filtered variants
final_vars = list(set(list(dna_vars) + target_variants))
# Check the lenght of your final list of variants
len(final_vars)
# Dimensionality of the original sample.dna dataframe
# First number = number of cells (rows); second number = number of variants (columns)
sample.dna.shape
# Before subsetting, verify that all the chosen variants are in the current sample.dna ids (should return True)
print(set(final_vars).issubset(set(sample.dna.ids())))
# Subsetting sample.dna (columns) based on reduced variant list. Keeping all cells that passed filtering
sample.dna = sample.dna[sample.dna.barcodes(), final_vars]
# Check the shape of the final filtered DNA object, i.e. (number of barcodes/cells, number of ids/variants)
sample.dna.shape

Annotation Addition#

help(sample.dna.get_annotations)
# Fetch annotations using varsome
# Note: run this on a filtered DNA sample - too many variants (e.g., 100+) are not handled correctly by the method
annotation = sample.dna.get_annotations()  

# Store the annotations in the dna assay as a new column attribute
for col, content in annotation.items():
    sample.dna.add_col_attr(col, content.values)
# Sort the values as desired and display a table of all variants (similar to Tapestri Insights)
ann = annotation.sort_values(by=["DANN", "Coding impact"], ascending=False)

display(HTML(ann.to_html()))
# The data can also be presented as an interactive table
# There are multiple pages to the table - check top right corner for page number
# Clicking will select a particular row and store it in table.selected_rows
table = ms.Table(ann)
table.draw()
# Check the selected variants
# If desired, this list can be used to subset the data further (similar to the subsetting done above with "final_vars")
table.selected_rows
# Any of these column values can be added to the id names
sample.dna.col_attrs.keys()
# Add annotation to the id names
sample.dna.set_ids_from_cols(["Gene", "CHROM", "POS", "REF", "ALT"])

# Annotations are now added to the variants
sample.dna.ids()

# Use sample.dna.reset_ids() to get the original ids

Manual Variant Selection#

If using only whitelisted variants (e.g., target variants are already known), this section can be skipped.

This section includes a variety of plots to help assess variant quality.
Heatmaps are interactive. Clicking on a column selects the corresponding id,
whose value is stored in the `selected_ids` attribute. Selected ids can then 
be removed from the data set. I.e., click on the low quality variants you wish to discard.
help(sample.dna.stripplot)
# First diagnostic plot to evaluate variant quality
# The 'attribute' or 'colorby' arguments may be changed
# Variants with the same genotype for every cell are likley germline
# Variants with a HET population distributed below AF=35 are likely false positives (expect HET cells around AF=50)
fig = sample.dna.stripplot(attribute='AF_MISSING', colorby='GQ')
go.Figure(fig)
# Second diagnostic plot: use this heatmap to (de)-select variants with little variance 
# For example, a germline mutation with identical genotype across all cells
# Additionally, variants with excessive missing data can be selected for removal
# Variants may be selected by clicking on the heatmap (variant name changes color from black to red)
# Don't use "fig =" as the heatmap won't become responsive for variant selection
sample.dna.heatmap(attribute='NGT_FILTERED')
# Array that lists all variants that were selected in the heatmap
sample.dna.selected_ids
# Subset the sample.dna variable to removing all selected variants

if len(sample.dna.selected_ids) != 0:
    sample.dna = sample.dna.drop(sample.dna.selected_ids)
# Redraw the heatmap with GQ values to select variants that display low quality in majority of cells
sample.dna.heatmap(attribute='GQ')
# Array that lists all variants that were selected in the heatmap
sample.dna.selected_ids
# Subset the sample.dna variable to removing all selected variants

if len(sample.dna.selected_ids) != 0:
    sample.dna = sample.dna.drop(sample.dna.selected_ids)
# Confirm new number of columns (variants)
sample.dna.shape
# Redraw heatmap with undesired variants removed
sample.dna.heatmap(attribute='AF_MISSING')

Clustering#

The DNA assay class comes with three different methods of clustering:
Method 1: f(x) = group_by_genotype (akin to Tapestri Insights v2.2)
Method 2: PCA + UMAP + clustering (customizable)

Clustering Method 1#

help(sample.dna.group_by_genotype)
# Clustering Method #1

# Recommended for 1-10 variants
# Cluster with Tapestri Insights v2.2 count-based method 

# The created table includes a 'score' row that intends to help with the identification of allelic dropout (ADO) clones
# Scores greater than 0.8 are typically considered artifacts (ADO) and may be labeled as such to be discarded
# Clones are ordered based on their size (1=largest clone, 2=second largest clone, etc.)

clone_data = sample.dna.group_by_genotype(
    features=sample.dna.ids()[4:6],
    group_missing=True,
    min_clone_size=1,
    layer="NGT_FILTERED",
    show_plot=False
)

# Display as a static html or interactive table using ms.Table.
display(HTML(clone_data.to_html()))
help(sample.dna.heatmap)
# Plot heatmap using NGT_FILTERED.
fig = sample.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)
# Store the clustering in the row attributes

sample.dna.add_row_attr("clustering-1", sample.dna.get_labels())

Clustering Method 2 (PCA + UMAP + Clustering)#

# CLUSTERING METHOD #2
# An alternative clustering recommended for high-feature/dimensional space (e.g. 10+ variants)

# First: run PCA on filtered data frame (all cells, chosen variants) using the AF_MISSING layer, for instance
# Note: AF_MISSING includes -50 values for all 0 values for which the data was filtered and is missing
# The number of components should equal the number of pre-filtered variants
sample.dna.run_pca(components=100, attribute='AF_MISSING',show_plot=True) 

# Assess 'elbow' plot and determine number of PCs where the distribution starts to plateau
# Note that PC1 = 0 in the plot
# Rerun PCA with optimal number of components based on elbow plot analysis
sample.dna.run_pca(components=6, attribute='AF_MISSING')
# Run UMAP on top of the newly created PC dataframe
# See https://jlmelville.github.io/uwot/abparams.html for appropriate spread and min_dist values
sample.dna.run_umap(attribute='pca', random_state=42) #, min_dist=0.2, spread=1.5)
# Review other clustering methods
help(sample.dna.cluster)
# Visualize data and cluster it using different methods
sample.dna.cluster(attribute='umap', method='dbscan')
# Re-plot UMAP projection w/ alternative clustering results
fig = sample.dna.scatterplot(attribute='umap', colorby='label')
go.Figure(fig)
# Plot heatmap using NGT
fig = sample.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)
# Store this clustering in the row attributes

sample.dna.add_row_attr("clustering-2", sample.dna.get_labels())

The next few commands allow you to customize the colors of the genotypes in the heatmap.

help(sns.color_palette)
sns.palplot(sns.color_palette("viridis", n_colors=20))
vir20 = sns.color_palette("viridis", n_colors=20).as_hex()
print(vir20)
vir20[0]
# The DNA heatmap and scatterplot colors are stored in the layout.coloraxis.colorscale attribute

# This value must be updated to customize the plot
fig.layout.coloraxis.colorscale
# Assuming these are new desired colors
# NGT=0 (WT) - blue
# NGT=1 (HET) - orange
# NGT=2 (HOM) - red
# NGT=3 (missing) - black

# Additional information re: color palettes here: https://seaborn.pydata.org/tutorial/color_palettes.html
wt_col = vir20[0]
het_col = vir20[10]
hom_col = vir20[19]
miss_col = COLORS[19]

sns.palplot([wt_col, het_col, hom_col, miss_col])
# Update the coloraxis to make a plot with the new colors
new_colors = [(0 / 4, wt_col), (1 / 4, wt_col),
              (1 / 4, het_col), (2 / 4, het_col),
              (2 / 4, hom_col), (3 / 4, hom_col),
              (3 / 4, miss_col), (4 / 4, miss_col)]

fig.layout.coloraxis.colorscale = new_colors
fig

Renaming Subclones (Labels)#

# Use clustering-1 for further analysis

sample.dna.set_labels(sample.dna.row_attrs["clustering-1"])
# Rename clusters based on genotypes
# Merge clusters by giving them the same name
# Rename clusters identically that are to be merged into one and discarded (e.g. "FP" for false-positive)
sample.dna.rename_labels(
  {
    "1": "Cell 1",
    "2": "Cell 2",
    "3": "Cell 3",
    "missing": "FP",
    "small": "FP",
  }
)
# Remove barcodes (clones) from data based on renamed labels
# The reduced data set will now be called 'sample.dna2'
if "FP" in sample.dna.get_labels():
    fp_barcodes = sample.dna.barcodes({"FP"})
    sample.dna2 = sample.dna.drop(fp_barcodes) 
else:
    sample.dna2 = sample.dna

set(sample.dna2.get_labels()) 
# Redraw heatmap
fig = sample.dna2.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)
# Redraw again with new (previously defined) color palette
fig = sample.dna2.heatmap(attribute='NGT_FILTERED',splitby="label")
fig.layout.coloraxis.colorscale = new_colors
go.Figure(fig)
# If analyzing multiple samples, re-plot heatmap with barcodes ordered based on sample
fig = sample.dna2.heatmap(attribute='NGT_FILTERED',splitby="sample_name")
go.Figure(fig)
# Adding write_image function saves the picture in high-res
# See help() to change resolution or file type 
help(fig.write_image)
fig.write_image("heatmap_example.svg",scale=3)
# Evaluate new total number of cells after the above filtering
sample.dna2.shape
# Compare to original cell number
sample.dna.shape

CNV Analysis#

Topics covered

  1. Amplicon filtering and read normalization.

  2. Genotype-guided ploidy computation and CNV clustering.

  3. Read-based ploidy computation and CNV clustering.

  4. Statistical significance analysis.

Filtering And Normalization#

# Filtering Amplicons
# This returns a table of the reads for each amplicon in each cell
reads = sample.cnv.get_attribute('read_counts', constraint='row+col')
reads
# Only amplicons found in more than half the cells are analyzed 
# The other amplicons are dropped
# Note: optional for samples with expected biological missing data
working_amplicons = (reads.median() > 0).values
sample.cnv = sample.cnv[:, working_amplicons]
# Additionally, we keep only valid barcodes from the DNA analysis and are storing the data
# in a new variable sample.cnv2 (e.g., in order to not overwrite sample.cnv)
sample.cnv2 = sample.cnv[sample.dna2.barcodes(),:]
# Compare amplicon number and cell number original
sample.cnv.shape
# Compare amplicon number and cell number
sample.cnv2.shape
help(sample.cnv2.normalize_reads)
# Normalize the reads 
sample.cnv2.normalize_reads()

Genotype-Guided Ploidy Clustering#

# Assuming the presence of a cell population that is considered diploid across
# all the amplicons, we can compute the ploidy of all other cell populations (previously defined)

sample.cnv2.compute_ploidy(diploid_cells=sample.dna.barcodes('Cell 1'))
# Assign the DNA labels to the CNV assay class
# We want to ensure the labels (e.g., names of cell populations) are identical across assay classes
sample.cnv2.set_labels(sample.dna2.get_labels())
sample.cnv2.set_palette(sample.dna2.get_palette())
# Heatmap with the features ordered by the default amplicon order
fig = sample.cnv2.heatmap('ploidy', features='positions')
go.Figure(fig)
# Scale the figure width and plot as a static image
# Double click on the plot to zoom-in and improve the resolution
fig = sample.cnv2.heatmap('ploidy', features='positions')
fig.layout.width = 1600
mutils.static_fig(fig, figsize=(20, 20))
# Heatmap for a subset of the chromosomes
fig = sample.cnv2.heatmap('ploidy', features=['7', '17', '20'])

# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)
#fig.layout.coloraxis.cmax = 4
#fig.layout.coloraxis.cmin = 0

go.Figure(fig)
# Heatmap with the features grouped by the genes
# The first time this runs, it fetches the gene names from ensembl
# The annotation can also be fetched using sample.cnv.get_gene_names()
# The plots can also be smoothed using a moving average with the convolve parameter

fig = sample.cnv2.heatmap('ploidy', features='genes', convolve=1)

# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)
#fig.layout.coloraxis.cmax = 4
#fig.layout.coloraxis.cmin = 0

go.Figure(fig)
# Change the color scale to "magma" - other suitable options might be "viridis", "plasma", "blues", "blues_r" ...
fig.layout.coloraxis.colorscale = 'magma'

# Update the separating lines to be black
for shape in fig.layout.shapes:
    shape.line.color = '#000000'

# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)
#fig.layout.coloraxis.cmax = 4
#fig.layout.coloraxis.cmin = 0

go.Figure(fig)
# Heatmap for a subset of the genes
fig = sample.cnv2.heatmap('ploidy', features=["ASXL1", "EZH2",'TP53'])
fig.layout.coloraxis.colorscale = 'magma'

# Update the separating lines to be black
for shape in fig.layout.shapes:
    shape.line.color = '#000000'

# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)
#fig.layout.coloraxis.cmax = 2
#fig.layout.coloraxis.cmin = 0

go.Figure(fig)
help(sample.cnv.plot_ploidy)
# Ploidy line plot across amplicons for all WT cells
# WT clone cells normalized to ploidy = 2 by default
sample.cnv2.plot_ploidy('Cell 1')
# Same as above, only using different feature groupings (genes+amplicons)
# Red dots represent the median per gene, blue dots represent the individual amplicons in each gene
sample.cnv2.plot_ploidy('Cell 1', features="genes+amplicons")
# Evaluate ploidy for other genotype-based defined clones by updating the code
sample.cnv2.plot_ploidy('Cell 2')
# Evaluate ploidy for other genotype-based defined clones by updating the code
sample.cnv2.plot_ploidy('Cell 2', features="genes+amplicons")

Read-Based Ploidy Clustering#

# Cluster the CNV data using the reads via PCA/UMAP
# Note: all cell barcodes are now used, including cells that were previously filtered out due to incomplete genotype data

# First normalize the reads and plot a heatmap (which also adds the annotations)
sample.cnv.set_labels(sample.dna.get_labels())
sample.cnv.normalize_reads()
sample.cnv.heatmap(attribute='normalized_counts',  features='genes')
## OPTIONAL

# Clustering with all amplicons can often result in low resolution of CNV events
# If genes or chromosomes of interest are known, you can use these features for clustering
amps_of_interest = np.isin(sample.cnv.col_attrs["gene_name"], ["TP53", "EZH2", "TET2"])
sample.cnv = sample.cnv[:, amps_of_interest]
# The appropriate number of components for dimensionality reduction can be determined using the elbow plot
# We start with components = the number of amplicons
amp_number = sample.cnv.shape[1]
sample.cnv.run_pca(components=amp_number, attribute='normalized_counts',show_plot=True)
# After assessing the elbow plot, re-run PCA with your chosen number of PCs
# Too many PCA components may result in merging of clusters
# Too few PCA components may result in splitting of clusters
sample.cnv.run_pca(components=10, attribute='normalized_counts',show_plot=False)
# Run UMAP on top of the newly created PC dataframe
# See https://jlmelville.github.io/uwot/abparams.html for appropriate spread/min_dist values
sample.cnv.run_umap(attribute='pca', random_state=42) #, min_dist=0.2, spread=1.5)
# Cluster data using different methods
sample.cnv.cluster(attribute='umap', method='graph-community', k=400)
# Plot the UMAP scatterplot colored by cluster
fig = sample.cnv.scatterplot(attribute='umap',colorby='label')
go.Figure(fig)
# If analyzing multiple samples, color the UMAP by sample
fig = sample.cnv.scatterplot(attribute='umap',colorby='sample_name')
go.Figure(fig)
help(sample.cnv.heatmap)
# Plot the heatmap of normalized counts, stratified by cluster
sample.cnv.heatmap(attribute='normalized_counts')

Note, two different versions of the CNV subclass exist now:

  • sample.cnv: contains data based on the read-based ploidy calculation

  • sample.cnv2: contains data based on the genotype-guided ploidy calculation

Statistics#

pval_cnv, tstat_cnv = sample.cnv2.test_signature("normalized_counts")
pval_cnv
pval_cnv = pval_cnv + 10 ** -50
pvals_cnv = -np.log10(pval_cnv) * (tstat_cnv > 0)
# Colored tiles indicate strong association of a ploidy with a particular cluster
plt.figure(figsize=(20, 10))
fig = sns.heatmap(pvals_cnv.T, vmax=50, vmin=0)

Protein Analysis#

Topics covered

  1. Normalization and assessment of AOC abundance

  2. Clustering and visualization

  3. Statistical significance analysis

Normalization And Data Inspection#

help(sample.protein.normalize_reads)
# Normalize reads
# Three different methods available: CLR = Centered Log Ratio
sample.protein.normalize_reads('CLR') 
# Ridge plot: shows distribution of AOC counts across all cells (bimodal = positive and negative cell populations)
sample.protein.ridgeplot(attribute='normalized_counts',
                         features=sample.protein.ids())
# Plotting combinations of AOCs on a biaxial scatterplot, to mimic FACS data
fig = sample.protein.feature_scatter(layer='normalized_counts', ids=['CD19', 'CD34'])
go.Figure(fig)
#List all AOCs
sample.protein.ids()
# Optionally, remove AOCs from the data set prior to clustering 
# For example, those that don't display any signal in any cells
sample.protein = sample.protein.drop(['Mouse IgG1k'])

Clustering And Visualization#

help(sample.protein.run_pca)
# Clustering similar to "CLUSTERING METHOD #2" of DNA clustering using genotypes
# Select fewer components when analyzing custom panels with 2-20 AOCs
sample.protein.run_pca(attribute='normalized_counts', components=45,show_plot=True) 
# Rerun PCA with optimal number of components based on elbow plot analysis
# Typically, components = 10 is appropriate for 45-plex Biolegend panel
sample.protein.run_pca(attribute='normalized_counts', components=10, show_plot=False)
# Now run UMAP on the chosen PCs 
# UMAPs rely on an initial randomization which leads to different projections every time
# To address this, pass random_state to the run_umap method
# See https://jlmelville.github.io/uwot/abparams.html for appropriate spread/min_dist values
sample.protein.run_umap(attribute='pca', random_state=42) #, spread=, min_dist=
# Cluster the AOCs. Several options are available, see help info for more options
# Graph-community clustering: the higher the k value, the smaller the number of clusters --> adjust if necessary
sample.protein.cluster(attribute='umap', method='graph-community',k=400) 
help(sample.protein.cluster)
# Plot heatmap of normalized, clustered protein data
fig = sample.protein.heatmap(attribute='normalized_counts')
go.Figure(fig)
# If analyzing multiple samples, order barcodes by sample and then by cluster
fig = sample.protein.heatmap(attribute='normalized_counts',splitby='sample_name')
go.Figure(fig)
help(sample.protein.scatterplot)
# Plot UMAP for visualization of clusters
fig = sample.protein.scatterplot(attribute='umap',colorby='label')
go.Figure(fig)
# UMAP colored by the expression of each AOC
fig = sample.protein.scatterplot(attribute='umap',
                           colorby='normalized_counts',
                           features=sample.protein.ids())
go.Figure(fig)
# Relabel clone names acccording to biology, combine clones by assigned identical names
# Just an example! Need to fill this in based on knowledge of the sample and markers
sample.protein.rename_labels(
    {
        '1': 'Cell type A', 
        '2': 'Cell type B', 
        '3': 'Cell type C',
        '4': 'Cell type D',
        '5': "FP"
    }
)
# Run heatmap again with new labels
fig = sample.protein.heatmap(attribute='normalized_counts')
go.Figure(fig)
# Drop cells (barcodes) that need to be removed
# Instead of overwriting the sample.protein variable, we define new variable called sample.protein2
# Note: if you do not wish to drop any cells, this code will producs an error (ignore)

fp_barcodes2 = sample.protein.barcodes("FP")
sample.protein2 = sample.protein.drop(fp_barcodes2)

set(sample.protein2.get_labels())
# Use only if no barcodes are removed by the above-listed command
sample.protein2 = sample.protein
# Additionally, option to remove AOCs from the data 
sample.protein2 = sample.protein2.drop(['CD3'])
# Shape of cleaned up protein data (number of cells, number of AOCs)
sample.protein2.shape
# Re-run heatmap with new filtered data
fig = sample.protein2.heatmap(attribute='normalized_counts')
go.Figure(fig)

Statistics#

pval_protein, tstat_protein = sample.protein2.test_signature("normalized_counts")
pval_protein
pval_protein = pval_protein + 10 ** -50
pvals_protein = -np.log10(pval_protein) * (tstat_protein > 0)
# Colored tiles indicate strong association of a protein with a particular cluster
plt.figure(figsize=(20, 10))
fig = sns.heatmap(pvals_protein.T, vmax=50, vmin=40)

Combined visualizations#

Topics covered

  1. Defining a new sample object by combining the cleaned-up versions of the DNA, CNV and protein data

  2. Built-in methods for combined visualizations

  3. Additional visualizations with analyte-specific color schemes

  4. Quantification of DNA clone and protein cluster overlaps

  5. Export and save final data set

Create a new sample object#

# First, we ensure that all individual analytes are using the same cells.
# Since no cells are removed during CNV analysis, we take the intersect of cells retained in the DNA and protein data.
# In this and the following lines of code, use the final version of each assay, 
# which may differ from the defaults listed here (e.g. sample.dna3, sample.protein4, etc.)
x = sample.protein2.barcodes()
y = sample.dna2.barcodes()

z = np.intersect1d(x, y)
len(z)
# Subset protein data to include only barcodes that were included in both analytes and store in sample.protein.3 variable
sample.protein3 = sample.protein2[z,sample.protein2.ids()]
# Subset dna data to include only barcodes that were included in both analytes and store in sample.dna.3 variable
sample.dna3 = sample.dna2[z,sample.dna2.ids()] 
#Update each sub-class (e.g., DNA, CNV, Protein) in order to visualize "clean" data only with relevant features and cells.
sample2 = sample[:]
#sub-assays
sample2.dna = sample.dna3 
sample2.cnv = sample.cnv2
sample2.protein = sample.protein3 
# Check dimensionality for each subclass; the number of cells (first number) should be the same in each data set
print(sample2.dna.shape,
sample2.cnv.shape,
sample2.protein.shape)

Built-in visualizations for combining analytes#

# A heatmap presenting the data for up to three analytes
# Note, the default layer in the DNA subclass is NGT, not NGT_FILTERED
# If you wish to use NGT_FILTERED, you must first run the following line (commented out)
# Note that this will erase the NGT layer and replace it with NGT_FILTERED (which includes missing data)
#sample2.dna.layers['NGT'] = sample2.dna.layers['NGT_FILTERED']

fig = sample2.heatmap(clusterby='dna', sortby='protein', flatten=False)
go.Figure(fig)

Often the number of amplicons in the CNV data will take over the heatmap, making the plot uninterpretable. Moreover, there might be certain non-differentiating variants and antibod in the panel. These can be dropped before making the final heatmap.

Note: when subsetting any of the analytes, the assay will “shrink” to contain only those features. To go back to the full data set, you must re-initialize your assay (see below).

# Filter the CNV with amplicons only from the relevant genes
genes = sample2.cnv.col_attrs['gene_name'].copy()
relevant_ids = np.isin(genes, ['ASXL1','EZH2','TP53'])

sample2.cnv = sample2.cnv[:, relevant_ids]
# Plot again w/ customization (e.g., color code)
fig = sample2.heatmap(clusterby='dna', sortby='cnv',drop='protein', flatten=False)

# Update the width of the plot [See the section on CNV heatmaps]
fig.layout.width = 3000

# Change the CNV colorscale [See the section on CNV heatmaps]
fig.data[2].zmax = 2
fig.data[2].zmin = 0
fig.data[2].colorscale = 'magma'

# Updating the ticktexts to show the gene names instead
fig.layout.xaxis3.ticktext = sample2.cnv.col_attrs['gene_name'].copy()

# Show as a static plot
mutils.static_fig(fig, figsize=(20, 20))
# To reset the CNV data (include all filtered amplicons instead of this new subset), simply re-initialize
sample2.cnv = sample.cnv2
# Alternatively, try the following combined heatmaps
sample2.heatmap(clusterby='dna', sortby='protein', drop='cnv', flatten=False)
# sample2.heatmap(clusterby='protein', sortby='dna', drop='cnv', flatten=True)
# sample2.heatmap(clusterby='dna', sortby='protein', flatten=False)
help(sample2.clone_vs_analyte)
# Visualize the CNV data stratified by clone
fig = sample2.clone_vs_analyte('cnv')
fig
#fig.savefig('genotype_cnv.png')
# Visualize the protein data (violin plots) stratified by clone
fig = sample2.clone_vs_analyte('protein')
fig
# To review a subset of features, first subset the assay itself 
sample2.protein = sample2.protein[:, ['CD11b', 'CD19', 'CD38', 'CD90']]
sample2.clone_vs_analyte('protein')
# To reset the Protein data (include all AOCs instead of this new subset), simply re-initialize
sample2.protein = sample.protein3

Additional visualizations#

In addition to the built-in methods above, the analytes can be intersected by re-plotting some of the single-analyte visualizations, but coloring them by the labels of a different analyte.

# Plot the protein heatmap with the DNA clone labels
fig = sample2.protein.heatmap(attribute='normalized_counts',splitby=sample2.dna.row_attrs['label'])
go.Figure(fig)
# Plot the protein UMAP and color with the DNA clone labels
protein_umap = sample2.protein.row_attrs["umap"]
fig=sample2.dna.scatterplot(attribute=protein_umap, colorby="label")
go.Figure(fig)
# Plot the protein UMAP and color with variant layers (e.g., VAFs) for specific variants
protein_umap = sample2.protein.row_attrs["umap"]
feats = sample2.dna.ids()[:3] # plot the first 3 variants
sample2.dna.scatterplot(attribute=protein_umap, colorby="AF", features=feats)
# Plot the CNV UMAP with the DNA NGT color code
# Note, this function is only applicable in the case of defining "CNV clusters" using the read-based approach (not the genotype-based approach)
# In this example, we have the read-based clustering saved in sample.cnv
# Verify that this object has the umap attribute

sample.cnv.row_attrs.keys()
# First, take only the barcodes retained in the dna data
sample.cnv = sample.cnv[sample2.dna.barcodes(),:]

# Then, plot the umap colored by NGT_FILTERED for certain variants
cnv_umap = sample.cnv.row_attrs["umap"]
feats = sample2.dna.ids()[:3] # plot the first 3 variants
sample2.dna.scatterplot(attribute=cnv_umap, colorby="NGT_FILTERED", features=feats)
sample2.protein.ids()
# Plot the AOC ridge plots split by DNA clone
sample2.protein.ridgeplot(
    attribute='normalized_counts',
    features=['CD38', 'CD45', 'CD110', 'CD19'],
    splitby=sample2.dna.get_labels(),
)
# Plot the AOC ridge plots split by protein cluster
sample2.protein.ridgeplot(
    attribute='normalized_counts',
    features=['CD38', 'CD45', 'CD110', 'CD19'],
    splitby=sample2.dna.get_labels(),
)
# Plot the biaxial AOC plot split by DNA clone
fig = sample2.protein.feature_scatter(layer='normalized_counts', ids=['CD19', 'CD34'], colorby = sample2.dna.row_attrs['label'])
go.Figure(fig)
sample2.dna.ids()
# Plot the biaxial AOC plot colored by genotype (or AF_MISSING, etc.) for specific variants
variants = ['DNMT3A:2:25458546:C:T', 'WT1:11:32417945:T:C']
layer = 'NGT_FILTERED'
AOCs = ['CD19', 'CD34'] # only two!

# Set the layout and size of the resulting plots (rows, columns) - may need to adjust based on number of variants
fig, axs = plt.subplots(1, len(variants), figsize=(30, 10))

# Generate plots for each variant
counter = 0
for variant in variants:
    color_variant = sample2.dna.layers[layer][:, sample2.dna.ids() == variant]
    fig = sample2.protein.feature_scatter(layer='normalized_counts', ids=AOCs, colorby=color_variant, title=("\t" + variant))
    mutils.static_fig(fig, ax=axs[counter])
    counter = counter + 1

# Show the plot
for i in range(0, counter):
    axs[i].axis('off')

plt.tight_layout()
plt.show()    

Quantify overlaps between DNA clones and protein clusters#

# Count total cells in each cluster (cell type) and stratify by clone
from collections import Counter
all_cells = sample2.protein.row_attrs['label']
print("Abundance in all clones")
for i in set(all_cells):
    x = Counter(all_cells)[i]
    y = round((x/len(all_cells))*100,2)
    print("    ", i, x, "  ",y, "%")
    
for i in set(sample2.dna.row_attrs['label']):
    names = sample2.dna.row_attrs['label']
    print("\nAbundance in", i)
    subset = [all_cells[a] for a in range(0, len(all_cells)) if names[a] == i]
    for j in set(subset):
        x = Counter(subset)[j]
        y = round((x/len(subset))*100,2)
        print("    ", j, x, "  ",y, "%")
# For each cell type, measure abundance of each DNA clone
all_cells = sample2.protein.row_attrs['label']
for i in set(all_cells):
    names = sample2.dna.row_attrs['label']
    print("\nClonal abundance in", i)
    subset = [names[a] for a in range(0, len(names)) if all_cells[a] == i]
    for j in set(subset):
        x = Counter(subset)[j]
        y = round((x/len(subset))*100,2)
        print("    ", j, x, "  ",y, "%")
# If analyzing multiple samples: count total cells in each cluster (cell type) and stratify by sample
all_cells = sample2.protein.row_attrs['label']
print("Abundance in merged data")
for i in set(all_cells):
    x = Counter(all_cells)[i]
    y = round((x/len(all_cells))*100,2)
    print("    ", i, x, "  ",y, "%")
    
for i in set(sample2.protein.row_attrs['sample_name']):
    names = sample2.protein.row_attrs['sample_name']
    print("\nAbundance in", i)
    subset = [all_cells[a] for a in range(0, len(all_cells)) if names[a] == i]
    for j in set(subset):
        x = Counter(subset)[j]
        y = round((x/len(subset))*100,2)
        print("    ", j, x, "  ",y, "%")

Export and Save Data#

# Save new h5 file that includes only the final, cleaned dataset
ms.save(sample2, "FilteredData.h5")
# Export data into csv format
import os

## set directory that does NOT exist
folder_to_save = './h5_extract/'

os.mkdir(folder_to_save)

for assay in [sample2.dna, sample2.cnv, sample2.protein]:
    if assay is not None:
        os.mkdir(f'{folder_to_save}/{assay.name}')
        os.mkdir(f'{folder_to_save}/{assay.name}/layers')
        os.mkdir(f'{folder_to_save}/{assay.name}/rows')

        for layer in assay.layers.keys():
            df = assay.get_attribute(layer, constraint='row+col')
            cols = list(df.columns.values)
            df.loc[:, 'label'] = assay.get_labels()
            df = df.loc[:, ['label'] + cols]
            df.to_csv(f'{folder_to_save}/{assay.name}/layers/{layer}.csv')
        
        for row in assay.row_attrs.keys():
            df = assay.get_attribute(row, constraint='row')
            cols = list(df.columns.values)
            df.loc[:, 'label'] = assay.get_labels()
            df = df.loc[:, ['label'] + cols]
            df.to_csv(f'{folder_to_save}/{assay.name}/rows/{row}.csv')