Multisample DNA-only#

This notebook is set up to work with merged h5 files, containing more than one sample, generated with v2 or v3 chemistry. This notebook will only work with Mosaic versions 3.7 and higher.

Objective: To showcase the minimum number of steps required for tertiary analysis of DNA (single-cell genotyping and CNV) and explore different ways of visualizing the data.

Major questions answered:

  1. Can we identify DNA clones based on genotypes (SNVs/Indels)?

  2. How is the clonal structure different between samples?

  3. Do we detect CNV events (e.g., copy number amplification, copy number loss)?

Sections:

  1. Setup

  2. Data Structure

  3. DNA Analysis

  4. CNV Analysis

  5. Combined Visualizations

  6. Export and Save Data

  7. Appendix

Not shown: All available methods and options - documented here

Setup#

Topics covered

  1. Loading required packages.

  2. Loading in data.

Load data
Note: importing dependencies can sometimes take a couple of minutes.

# Import mosaic libraries
import missionbio.mosaic as ms

# Import these to display entire dataframes
from IPython.display import display, HTML

# Import graph_objects from the plotly package to display figures when saving the notebook as an HTML
import plotly as px
import plotly.graph_objects as go

# Import additional packages for specific visuals
import matplotlib.pyplot as plt
import plotly.offline as pyo
pyo.init_notebook_mode()
import numpy as np
import seaborn as sns
from missionbio.plotting.multimap import MultiMap
from missionbio.plotting.heatmap import Heatmap

# Import COMPASS for imputation
from missionbio.mosaic.algorithms.compass import COMPASS
import pandas as pd

# Note: when exporting the notebook as an HTML, plots that use the "go.Figure(fig)" command are saved
# This code is optional, but will make the notebook cells/figures display across the entire width of the browser
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
# Check version; this notebook is designed for Mosaic v3.7 or higher
print(ms.__version__)
# Any function's parameters and default values can be looked up via the 'help' function
# Here, the function is 'ms.load'
help(ms.load)

Loading the data #

In the ms.load statement, there are several arguments:

  • raw : always set raw=False; if raw=True, ALL barcodes will be loaded (rather than cell-associated barcodes)

  • filter_variants : always set filter_variants=True unless an expected (target) variant cannot be found. Additional filtering options are included in the DNA section below. NOTE : if loading using filter_variants=False, the whitelist may need to set to None or removed from the command, otherwise it may not load variants in

  • single : always set single=False for multisample/merged h5 files

  • whitelist : The whitelist option loads any variant that is in the vcf.gz file (e.g. “chr1:179520511:C/G”). The whitelist argument takes many variant formats, they are detailed here.

  • filter_cells : Always set filter_cells=False; if filter_cells=True, only cells called by the completelyness algorithm are loaded

# Specify the h5 file to be used in this analysis: h5path = '/path/to/h5/file/test.h5'
# If working with Windows, an 'r' may need to be added before the path: h5path = r'/path/to/h5/file/test.h5'
h5path = '/Users/botero/Desktop/Sample Data/pbmc.multisample.h5'

# Load the data
group = ms.load(h5path, raw=False, filter_variants=True, filter_cells=False, whitelist = [])
# Print the list of samples in the group object
[s.name for s in group.samples]

Data Structure#

DNA, CNV, and Protein are sub-classes of the Assay class. The information is stored in four categories, and the user can modify each of those:

1. metadata (add_metadata / del_metadata):

  • dictionary containing assay parameters and basic performance metrics

2. layers (add_layer / del_layer):

  • dictionary containing matrices of assay metrics

  • all the matrices have the shape (num barcodes) x (num ids)

  • for DNA assays, this includes AF, GQ, DP, etc. (per cell, per variant)

  • for Protein assays, this includes read counts (per cell, per antibody)

  • for CNV assays, this includes read counts (per cell, per amplicon)

3. row_attrs (add_row_attr / del_row_attr):

  • dictionary containing ‘barcode’ as one of the keys (where the value is a list of all barcodes)

  • for all other keys, the values must be of the same length, i.e. match the number of barcodes

  • this is the attribute where ‘label’, ‘pca’, and ‘umap’ values are added

4. col_attrs (add_col_attr / del_col_attr):

  • dictionary containing ‘id’ as one of the keys (where the value is a list of all ids)

  • for DNA assays, ‘ids’ are variants; for Protein assays, ‘ids’ are antibodies; for CNV assays, ‘ids’ are amplicons

  • for all other keys, the values must be of the same length, i.e. match the number ids

# Summary of DNA assay, per sample
for sample in group:
    print(sample.name)
    print("\'sample.dna\':", sample.dna, '\n')
    print("\'row_attrs\':", "\n\t", list(sample.dna.row_attrs.keys()), '\n')
    print("\'col_attrs\':", "\n\t", list(sample.dna.col_attrs.keys()), '\n')
    print("\'layers\':", "\n\t", list(sample.dna.layers.keys()), '\n')
    print("\'metadata\':", "\n")
    for i in list(sample.dna.metadata.keys()):
        print("\t", i, ": ", sample.dna.metadata[i], sep="")
    print("\n")
# For DNA, ids are variants
# sample.dna.ids() is a shortcut for sample.dna.col_attrs['id']
for sample in group:
    print(sample.name, "\n", sample.dna.ids(), "\n")
# Summary of CNV assay, per sample
for sample in group:
    print("\'sample.cnv\':", sample.cnv, '\n')
    print("\'row_attrs\':", "\n\t", list(sample.cnv.row_attrs.keys()), '\n')
    print("\'col_attrs\':", "\n\t", list(sample.cnv.col_attrs.keys()), '\n')
    print("\'layers\':", "\n\t", list(sample.cnv.layers.keys()), '\n')
    print("\'metadata\':", "\n")
    for i in list(sample.cnv.metadata.keys()):
        print("\t", i, ": ", sample.cnv.metadata[i], sep="")
    print("\n")
# For CNV, ids are amplicons
# sample.cnv.ids() is a shortcut for sample.cnv.col_attrs['id']
for sample in group:
    print(sample.name, "\n", sample.cnv.ids(), "\n")

DNA Analysis#

Topics covered

  1. Standard filtering of DNA variants.

  2. Subsetting dataset for variants of interest, including whitelisted variants.

  3. Addition of annotations to the variants.

  4. Manual variant selection and clone identification

  5. Visualizations and customization options

Basic filtering#

There are many options for filtering DNA variants. Use the help() function to understand the approach listed below.

help(sample.dna.filter_variants)
# Adjust filters if needed by overwriting dna_vars
def custom_filt(sample):
    filt_vars = sample.dna.filter_variants(
        min_dp=10,
        min_gq=30,
        vaf_ref=5,
        vaf_hom=95,
        vaf_het=30,
        min_prct_cells=50,
        min_mut_prct_cells=1,
        iterations=10
)
    return filt_vars
    
dna_vars = group.apply(custom_filt)

# List all filtered variants
# Check the number of filtered variants
for i in range(len(group.samples)):
    sample = group.samples[i]
    print(sample.name)
    print("Number of variants:", len(dna_vars[i]))
    print(dna_vars[i], "\n")

Note: Variants that are whitelisted during sample loading may be removed at this step, but can be added back in below. Whitelisted variants can be added to the list of final variants, or used exclusively in the code below.

# Ensure correct nomenclature, ie whitelist = ["chr13:28589657:T/G","chrX:39921424:G/A"]
# While the load statement accepts many variant formats, this whitelist does not
# If there are no whitelist variants, this can be left blank
white_list = []
# Check that whitelist variants are included in current data
# Should return true before moving on
for sample in group:
    print(set(white_list).issubset(set(sample.dna.ids())))
# Subset the same variants in all dna assays
# It is important to maintain the same variants across all dna assays

# Establish the number of variants for each sample
og_num_vars = [s.dna.shape[1] for s in group.samples]

#Subset all samples to look at the union of all variants, and whitelist variants
var_union = list(set().union(*dna_vars,white_list))

for sample in group:
    sample.dna = sample.dna[:, var_union]  # Subsets all samples with the same variants

# Establish the new number of varaints for each sample
new_num_vars = [s.dna.shape[1] for s in group.samples]

print(og_num_vars, new_num_vars)  # The old and new number of variants for each sample in the group
# Merge the samples into one data frame, instead of having them as two separate ones
combined = group.merge()
# List all filtered variants in combined
combined.dna.ids()
# Dimensionality of the original sample.dna dataframe
# First number = number of cells (rows); second number = number of variants (columns)
combined.dna.shape

Annotation Addition#

help(sample.dna.get_annotations)
# Fetch annotations using varsome
# Note: run this on a filtered DNA sample - too many variants (e.g., 100+) are not handled correctly by the method
ann = combined.dna.get_annotations()  

Variant selection and Subclone identification#

In this section of the notebook, all variants remaining in the dataset will populate in a variant table. This table is interactive, variants can be selected, and rows can be sorted by ascending/descending values. The variant name can be clicked on and will navigate to the variant’s varsome page in the default browser.

  1. Variants selected in this table will populate in a subclone table below.

  2. The variants in the subclone table can be highlighted and assessed for Read Depth, Genotype Quality and Variant Allele Frequency.

  3. Subclones can be renamed by clicking on the pencil icon.

  4. ADO score: The ADO score threshold can be adjusted, but by default is set to 1. Any clones with an ADO score higher this will be moved into the “ADO subclones” column. We recommend moving any clones with a score of .8 or higher into this column. The algorithm used to generate the score is detailed here.

  5. Min clone size: The Min Clone Size can also be adjusted, but by default is set to 1. Any clones that represent less than 1% of the sample will be moved into the “Small Subclones” column.

  6. Cells with missing genotype information across any of the selected variants will be moved into the “Missing GT” column.

# Run the variant table workflow to select variants and begin clone identification
wfv = ms.workflows.VariantSubcloneTable(combined)
wfv.run()
# The width can be adjusted if needed for long variant names
# wfv.run(width=3000)

NOTE: When continuing onto the next piece of the notebook, the workflow will stop working.

# Subsample the variants to only the variants selected from the workflow
variants = wfv.selected_variants
combined.dna = combined.dna[:,variants]

# Save the full set of cells to a new variable that can be called on later
# Do this before renaming the variants
full_combined = combined[:]
# Rename the variants
# Any of these column values can be added to the id names
combined.dna.col_attrs.keys()
# Add annotation to the id names
combined.dna.set_ids_from_cols(['annotated Gene', 'id'])

# Another example:
# sample.dna.set_ids_from_cols(['annotated Gene', 'CHROM', 'POS', 'REF', 'ALT'])

# Annotations are now added to the variants
combined.dna.ids()

# Use sample.dna.reset_ids() to get the original ids
# Plot heatmap using NGT_FILTERED
fig = combined.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)
# Clone removal
# Remove barcodes from the missing GT, small subclones, ADO subclones or FP labels
clones = ['missing', 'small', 'ADO', 'FP']
for c in clones:
    cells = combined.dna.barcodes({c})
    if len(cells) > 0:
        combined.dna = combined.dna.drop(cells)
set(combined.dna.get_labels())
# If subclones need to be renamed outside of the workflow
# The following code can be used:
# NOTE: all clones need to be listed for this function to work
combined.dna.rename_labels(
  {
    'current_subclone_name': 'new_subclone_name', 
    'WT': 'Wild Type',
    '3': 'three',
    '4': 'four',      
  }
)
# Redraw heatmap
fig = combined.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)
# Redraw heatmap, split by sample
fig = combined.dna.heatmap(attribute='NGT_FILTERED', splitby='sample_name')
go.Figure(fig)
# Create a DNA heatmap to see both the sample names and the subclone labels
# Parameters
attribute = 'NGT_FILTERED'
assay = combined.dna
lab_row1 = "label"
lab_row2 = "sample_name"
lab2_title = "Sample"

# Graphing
data = assay.get_attribute(attribute, constraint="row+col")
ids = assay.clustered_ids(attribute)
labels = pd.concat([assay.get_attribute(lab_row1), assay.get_attribute(lab_row2).rename(columns={"1": "2"})], axis=1)
bars = assay.clustered_barcodes(attribute, splitby=labels)

data = data.loc[bars, ids]
labs1 = assay.get_attribute(lab_row1, constraint="row").loc[bars, :].values.flatten()
labs2 = assay.get_attribute(lab_row2, constraint="row").loc[bars, :].values.flatten()

fig = MultiMap([data], labs1, names=[attribute], extra_column_titles=[lab2_title], widths=[1 / 24, 1]).draw()
fig.add_trace(Heatmap(data, labs2).label_trace(), row=1, col=2)
fig.layout.xaxis2.tickvals = []

fig
# Evaluate new total number of cells after the above filtering
combined.dna.shape
# Subset the barcodes to match across all assays
# This will return the barcodes common to all assays in the sample.
combined.common_barcodes()

# Use that to filter the sample so that only the common barcodes are present in all assays
combined = combined[combined.common_barcodes()]
# Split merged object, separate the data into two data frames again
# Need to first split the samples: they will go back from combined to group
group = ms.SampleGroup(combined.split("sample_name"))
# Multisample DNA barplot
fig = group.barplot("dna", percentage=True)

#To change the color of certain clones, try something like the following:
#fig.data[0].marker.color = "red"
#fig.data[5].marker.color = "orange"
fig.update_layout(title = 'Clone size as a percentage of total population')
fig.show()
# Fishplot
# Multisample DNA fishplot
# Order samples in sample_order and fill in the list with their specific name
fig = group.fishplot("dna", sample_order=['Sample 1', 'Sample 2'])
fig.layout.height = 750
fig.show()
# Further specify labels and parents to reorganize the fishplot
# Want to hide the phylogeny tree? draw_graph=False
# Documentation here: https://missionbio.github.io/mosaic/autosummary_pages/methods/missionbio.mosaic.samplegroup.SampleGroup.fishplot.html#missionbio.mosaic.samplegroup.SampleGroup.fishplot
# Heatmap per sample
for sample in group:
    fig = sample.dna.heatmap(attribute='NGT_FILTERED',splitby="label")
    fig.update_layout(title = sample.name)
    go.Figure(fig)
    fig.show()
# Shape of each sample
for sample in group:
    print(sample.name, sample.dna.shape)

Adjusting subclone colors or heatmap colors#

To change the color palette used for the subclones, use set_palette. For this provide a list of ALL subclone names, pointing to the hexcode/color to change that subclone to. Note: this function also works with the protein and CNV assays. See below for an example of this.

To change the color palette used for any of the assays/layers, use ms.Config.Colorscale. Then list the assay and layer, and assign the plotly colorscale to use for those graphics. Visualize all available colorscales by using plotly.colors.sequential.swatches_continuous().show(). To reset all color palettes back to their defaults, use ms.Config.Colorscle.reset().

# Example of changing the colors assigned to each clone
sample_one = group.samples[0]
sample_two = group.samples[1]
sample_one.dna.set_palette({"Clone 1": "#800080", "Clone 2": "#FF69B4"})
sample_two.dna.set_palette({"Clone 1": "#800080", "Clone 2": "#FF69B4"})
# Change the color of a subclone for the entire group object
# These changes will be applied to the barplots and fishplots
# Example
pal = group.apply(lambda s: s.dna.get_palette())
pal = {k: v for d in pal for k, v in d.items()}

pal["Clone 1"] = 'LightCyan'
pal["Clone 2"] = 'Lavender'
pal["Clone 3"] = 'PaleGreen'

group.apply(lambda s: s.dna.set_palette(pal))

# Built in color options can be found here: https://www.w3schools.com/cssref/css_colors.php
# Change colorScale to change the heatmap colors
# This will change the color of the dna.ngt layer to be viridis
for sample in group:
    ms.Config.Colorscale.Dna.NGT = 'viridis'
    fig = sample.dna.heatmap('NGT')
    fig.update_layout(title = sample.name)
    fig.show()

# To reset this, run the following to reset just that one modification
#ms.Config.Colorscale.Dna.reset("NGT")
# Or run this to reset any/all modifications:
#ms.Config.Colorscale.reset()
# Colorscale cane be changed for DNA, CNV and Protein assays

To return back to the original population of cells, reset the data using: sample.reset("dna") This command ‘sample.reset’ works on all assays, including CNV and protein.

CNV Analysis#

Topics covered

  1. Amplicon and cell filtering and ploidy estimation.

  2. Visualization of ploidy across subclones present.

# Get gene names for amplicons
for sample in group:
    sample.cnv.get_gene_names()
# Define the samples
# Go through the CNV workflow once per sample
sample_one = group.samples[0]
sample_two = group.samples[1]

#Additional lines will be needed if there are more than two samples
#sample_three = group.samples[2]
# Establish the order of the samples in the group object
# Note: timepoint one will not automatically be the first sample
for sample in group:
    print(sample.name)

CNV workflow#

This workflow will normalize all reads and filter amplicons/cells based on the settings set at the beginning of the workflow:

  1. Amplicon completeness: refers to the minimum percentage of cells that must have reads greater than or equal to the minimum read depth set. By default this is set to 50.

  2. Amplicon read depth: refers to the minimum read depth for each amplicon-barcode combination to not be considered missing. By default this is set to 10.

  3. Mean cell read depth: refers the minimum mean read depth for a cell to be included in the analysis, otherwise the cell will be removed. By default this is set to 40.

  4. Diploid clone in DNA: refers to which subclone is used as the true diploid population. All ploidy estimates will be calculated in relation to this diploid population. We recommend setting this to the ‘WT’ population or earliest clone present.

Note: If large copy number events are expected, Amplicon Completeness and Amplicon Read Depth may need to be reduced.

Once the above filters are set, the visualizations can be changed.

  1. Plot: Can be changed from Heatmap positions, to Heatmap genes, Line-plot positions, Line-plot genes

  2. Clone for line plot: If one of the line-plot visualizations is selected, only one clone can be shown at a time. This determines which one is plotted.

  3. X-axis features: To plot a subset of the data (chromosomes or genes), select which chromosomes/genes should be plotted with this function. Chromosomes can be selected for ‘positions’ type plots, and Genes can be selected for ‘genes’ type plots.

# CNV workflow to filter, normalize and estimate ploidy
wfc = ms.workflows.CopyNumber(sample_one)
wfc.run()

NOTE: When continuing onto the next piece of the notebook, the workflow will stop working.

# Heatmap with the features ordered by the default amplicon order
# To plot just a subset of chromosomes, put them in list format in the features argument
fig = sample_one.cnv.heatmap('ploidy', features='positions') #features=['7', '17', '20']

# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)
#fig.layout.coloraxis.cmax = 4
#fig.layout.coloraxis.cmin = 0

# Optionally, change the size of the figure:
#fig.layout.width = 1600
#fig.layout.height = 1500
fig.update_layout(title = sample_one.name)
go.Figure(fig)
# Heatmap with the features grouped by the genes
# To plot just a subset of genes, put them in list format in the features argument
fig = sample_one.cnv.heatmap('ploidy', features='genes', convolve=1) #features=["ASXL1", "EZH2",'TP53']

# Optionally, update the separating lines to be black
#for shape in fig.layout.shapes:
#    shape.line.color = '#000000'
fig.update_layout(title = sample_one.name)
go.Figure(fig)
# Show heatmap with convolve and subclustering turned off
bars = sample_one.cnv.clustered_barcodes('ploidy', subcluster=False)

# This is useful to create "convolved" heatmaps which are easier to interpret
# With the subclustering off and convolve=20, the noise will be reduced and real signals will be easier to visualize
fig = sample_one.cnv.heatmap('ploidy', bars_order=bars, convolve=20)
fig.layout.width = 900
fig.update_layout(title = sample_one.name)
fig.show()

Note: The following is a partial duplication to run through the CNV workflow with the second/subsequent sample.

# Go through the workflow for any additional samples
# CNV workflow to filter, normalize and estimate ploidy
wfc = ms.workflows.CopyNumber(sample_two)
wfc.run()

NOTE: When continuing onto the next piece of the notebook, the workflow will stop working.

# Heatmap with the features ordered by the default amplicon order
# To plot just a subset of chromosomes, put them in list format in the features argument
fig = sample_two.cnv.heatmap('ploidy', features='positions') #features=['7', '17', '20']

# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)
#fig.layout.coloraxis.cmax = 4
#fig.layout.coloraxis.cmin = 0

# Optionally, change the size of the figure:
#fig.layout.width = 1600
#fig.layout.height = 1500
fig.update_layout(title = sample_two.name)
go.Figure(fig)
# Heatmap with the features grouped by the genes
# To plot just a subset of genes, put them in list format in the features argument
fig = sample_two.cnv.heatmap('ploidy', features='genes', convolve=1) #features=["ASXL1", "EZH2",'TP53']

# Optionally, update the separating lines to be black
#for shape in fig.layout.shapes:
#    shape.line.color = '#000000'
fig.update_layout(title = sample_two.name)
go.Figure(fig)
# Show heatmap with convolve and subclustering turned off
bars = sample_two.cnv.clustered_barcodes('ploidy', subcluster=False)

# This is useful to create "convolved" heatmaps which are easier to interpret
# With the subclustering off and convolve=20, the noise will be reduced and real signals will be easier to visualize
fig = sample_two.cnv.heatmap('ploidy', bars_order=bars, convolve=20)
fig.layout.width = 900
fig.update_layout(title = sample_two.name)
fig.show()

Note: The next few cells are highly optional, and only need to be run to plot cells from all samples together in one heatmap.

# Plot heatmaps of all samples together
combined = group.merge()
combined.cnv.heatmap('ploidy', splitby='sample_name')
# Subset the barcodes to match across all assays
# This will return the barcodes common to all assays in the sample
combined.common_barcodes()

# Use that to filter the sample so that only the common barcodes are present in all assays
combined = combined[combined.common_barcodes()]
# Split merged object
# Need to split the samples: they will go back from combined to group
group = ms.SampleGroup(combined.split("sample_name"))

Combined Visualizations#

In this section the dateset will be subset to only retain barcodes with remaining DNA and CNV data. Then this data can be plotted together using sample.heatmap(), sample.signaturemap().

The first few kernels of code will generate one plot per sample for the combined visualizations, the last portion of this section will combine the samples into one dataframe and plot them all together.

# This will return the barcodes common to all assays in the sample, for each sample.
for sample in group:
    sample.common_barcodes()
    # Use that to filter the sample so that only the common barcodes are present in all assays
    sample = sample[sample.common_barcodes()]
# Check dimensionality for each assay; the number of cells (first number) should be the same in each data set
# For each sample
for sample in group:
    print(sample.name, sample.dna.shape, sample.cnv.shape)
# DNA + CNV heatmap
# One per sample
for sample in group:
    fig = sample.heatmap(
        clusterby=('dna', 'cnv'),  # The first assay is used for the labels
        attributes=['AF', 'ploidy'],
        features=[None, 'genes'],  # If None, then clustered_ids is used
        bars_order=None,  # The order of the barcodes
        widths=None,
        order=('dna', 'cnv')  # The order in which the heatmaps should be drawn
    )
    fig.update_layout(title = sample.name)
    fig.show()
help(sample.signaturemap)
# Plot a combined signature heatmap, showing DNA and CNV signatures for all subclones
# One per sample
for sample in group:
    fig = sample.signaturemap(
        clusterby=('dna', 'cnv'),
        attributes=('NGT', 'ploidy'),
        features=[None, 'genes'],
        signature_kind=['median', 'median'],
        widths=None,
        order=('dna', 'cnv')  # The order in which the heatmaps should be drawn
    )
    fig.update_layout(title = sample.name)
    fig.show()
# Merge the samples into one data set
combined = group.merge()
# DNA + CNV heatmap
# For all samples together
fig = combined.heatmap(
        clusterby=('dna', 'cnv'),  # The first assay is used for the labels
        attributes=['AF', 'ploidy'],
        features=[None, 'genes'],  # If None, then clustered_ids is used
        bars_order=None,  # The order of the barcodes
        widths=None,
        order=('dna', 'cnv')  # The order in which the heatmaps should be drawn
    )
fig.show()
# Plot a combined signature heatmap, showing DNA and CNV signatures for all subclones
# For all samples together
fig = combined.signaturemap(
        clusterby=('dna', 'cnv'),
        attributes=('NGT', 'ploidy'),
        features=[None, 'genes'],
        signature_kind=['median', 'median'],
        widths=None,
        order=('dna', 'cnv')  # The order in which the heatmaps should be drawn
    )
fig.show()
# Clone vs analyte
# Visualize the CNV data stratified by clone
# For all samples together
fig = combined.clone_vs_analyte('cnv')
plt.gcf().axes[1].texts[-1].set_text("")

Export and Save Data#

In this section the data can export a filtered .h5 file, which will contain all new labels/layers, and contain only the filtered barcodes/cells remaining. All data can be exported (row attributes, column attributes and layers) for every assay (DNA and CNV) into easily parsable .csv tables.

# Save new h5 file that includes only the final, cleaned dataset
ms.save(combined, 'FilteredData.h5')
# Export data into csv formats
ms.to_zip(combined, 'filename')

Appendix#

Compass imputation#

This section of the notebook is OPTIONAL

Compass can be used to impute the genotypes of cells with some missing data. It can also be used to infer the phylogeny of all subclones present in the sample. If a cell cannot be unambiguously classified, Compass will label these cells as Mixed or Ambiguous. Compass can take ~1-10 minutes to run depending on the size of the data.

The Compass publication can be found here

Note: Compass can give different subclone composition than the variant subclone workflow.

# Use compass to assume the subclone architecture and phylogeny
# Depending on the size and complexity of the data, this step can take ~1-10 minutes
# Use the full_combined object to utilize all cells and unannotated variants
compass = COMPASS(full_combined, somatic_variants=variants)
compass.run()
# The phylogentic tree prediction by COMPASS
fig = compass.plot_tree()
fig.show()
# Make the compass_labels the node descriptions
# These are the new subclone labels
labs = compass.labels_  # Stores the labels
desc = compass.node_descriptions()  # Stores the dict mapping the label to the description
compass_labels = np.array([desc.get(lab, lab) for lab in labs])
# Store the compass label as a row_attr
full_combined.dna.add_row_attr('compass_labels', compass_labels)
# Store the index positions of the COMPASS ids for the heatmap
idx = np.isin(full_combined.dna.ids(), compass.somatic_variants)
# Visualize the assignment using the variants passed to COMPASS
fig = full_combined.dna.heatmap('NGT', features=full_combined.dna.ids()[idx], splitby='compass_labels')
fig.show()
# Create a DNA heatmap to see both the variant workflow subclone labels and the compass assigned labels
# Parameters
attribute = 'NGT'
assay = full_combined.dna
lab_row1 = "label"
lab_row2 = "compass_labels"
lab2_title = "Compass"

# Graphing
data = assay.get_attribute(attribute, constraint="row+col")
ids = assay.clustered_ids(attribute)
labels = pd.concat([assay.get_attribute(lab_row1), assay.get_attribute(lab_row2).rename(columns={"1": "2"})], axis=1)
bars = assay.clustered_barcodes(attribute, splitby=labels)

data = data.loc[bars, ids]
labs1 = assay.get_attribute(lab_row1, constraint="row").loc[bars, :].values.flatten()
labs2 = assay.get_attribute(lab_row2, constraint="row").loc[bars, :].values.flatten()

fig = MultiMap([data], labs1, names=[attribute], extra_column_titles=[lab2_title], widths=[1 / 24, 1]).draw()
fig.add_trace(Heatmap(data, labs2).label_trace(), row=1, col=2)
fig.layout.xaxis2.tickvals = []

fig
# This will return a dataframe 
# Showing the overlap of variant workflow clones and compass identified clones
full_combined.dna.crosstab(compass_labels, normalize='columns')
# This will plot a heatmap of the crosstab dataframe
full_combined.dna.crosstabmap(compass_labels, normalize='columns').show()

Export data to use in COMPASS

# Export compass data into 2 output files
# compass_input_regions.csv and compass_input_variants.csv
# These can be used with COMPASS outside of Mosaic
compass = COMPASS(sample, sample.dna.filter_somatic_variants())
compass.prepare_input(prefix="./compass_input")

DNA clustering#

This section of the notebook is optional

Below is an alternative way to cluster cells based on DNA data. These feature is useful in understanding the diversity within the cell population and may reduce the amount of cells lost from missing data.

# First: run PCA on filtered data frame (all cells, chosen variants) using the AF_MISSING layer, for instance
# Note: AF_MISSING includes -50 values for all 0 values for which the data was filtered and is missing
# The number of components should equal the number of pre-filtered variants
combined.dna.run_pca(components=100, attribute='AF_MISSING',show_plot=True) 
# Rerun PCA with optimal number of components based on elbow plot analysis
combined.dna.run_pca(components=6, attribute='AF_MISSING')
# Run UMAP on top of the newly created PC dataframe
# See https://jlmelville.github.io/uwot/abparams.html for appropriate spread and min_dist values
combined.dna.run_umap(attribute='pca', random_state=42) #, min_dist=0.2, spread=1.5)
# Review other clustering methods
help(combined.dna.cluster)
# Visualize data and cluster it using different methods
combined.dna.cluster(attribute='umap', method='graph-community', k=400)
# Re-plot UMAP projection w/ alternative clustering results
fig = combined.dna.scatterplot(attribute='umap', colorby='label')
go.Figure(fig)
# Plot heatmap using NGT
fig = combined.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)