Multisample DNA-only notebook#
This notebook is set up to work with merged h5 files, containing more than one sample, generated with v2 or v3 chemistry. This notebook will only work with Mosaic versions 3.0.1 and higher.
Objective: To showcase the minimum number of steps required for tertiary analysis of DNA (single-cell genotyping and CNV) and explore different ways of visualizing the data.
Major questions answered:
Can we identify DNA clones based on genotypes (SNVs/Indels)?
How is the clonal structure different between samples?
Do we detect CNV events (e.g., copy number amplification, copy number loss)?
Sections:
Setup
Data Structure
DNA Analysis
CNV Analysis
Combined Visualizations
Export and Save Data
Appendix
Not shown:
All available methods and options - documented here
Setup#
Topics covered
Loading required packages and data.
Structure and contents of data objects.
Load data
Note: importing dependencies can sometimes take a couple of minutes.
# Import mosaic libraries
import missionbio.mosaic as ms
# Import these to display entire dataframes
from IPython.display import display, HTML
# Import graph_objects from the plotly package to display figures when saving the notebook as an HTML
import plotly as px
import plotly.graph_objects as go
# Import additional packages for specific visuals
import matplotlib.pyplot as plt
import plotly.offline as pyo
pyo.init_notebook_mode()
import numpy as np
# Import COMPASS for imputation
from missionbio.mosaic.algorithms.compass import COMPASS
# Note: when exporting the notebook as an HTML, plots that use the "go.Figure(fig)" command are saved
# This code is optional, but will make the notebook cells/figures display across the entire width of your browser
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
# Check version; this notebook is designed for Mosaic v3.0.1 or higher
print(ms.__version__)
# Any function's parameters and default values can be looked up via the 'help' function
# Here, the function is 'ms.load'
help(ms.load)
# Specify the h5 file to be used in this analysis: h5path = '/path/to/h5/file/test.h5'
# If working with Windows, you may need to add an 'r' before the path: h5path = r'/path/to/h5/file/test.h5'
h5path = '/Users/botero/Desktop/Sample Data/pbmc.multisample.h5'
# Load the data
group = ms.load(h5path, raw=False, filter_variants=True, filter_cells=False, whitelist = [])
# Print the list of samples in the group object
[s.name for s in group.samples]
# Always set raw=False; if raw=True, ALL barcodes will be loaded (rather than cell-associated barcodes)
# Always set filter_variants=True unless you can't detect an expected (target) variant. Additional filtering options are included in the DNA section below
# Always set single=False for multisample/merged h5 files
# The whitelist option loads any variant that is in the vcf.gz file (e.g. "chr1:179520511:C/G"); similar to whitelist feature in Tapestri Insights v2
# Always set filter_cells=False; if filter_cells=True, only genomically complete cells will be loaded
Data Structure#
DNA, CNV, and Protein are sub-classes of the Assay class. The information is stored in four categories, and the user can modify each of those:
1. metadata (add_metadata / del_metadata):
dictionary containing metrics of the assay
2. row_attrs (add_row_attr / del_row_attr):
dictionary which contains ‘barcode’ as one of the keys (where the value is a list of all barcodes)
for all other keys, the values must be of the same length, i.e. match the number of barcodes
this is the attribute where ‘label’, ‘pca’, and ‘umap’ values are added
3. col_attrs (add_col_attr / del_col_attr):
dictionary which contains ‘id’ as one of the keys (where the value is a list of all ids)
for DNA assays, ‘ids’ are variants; for Protein assays, ‘ids’ are antibodies
for all other keys, the values must be of the same length, i.e. match the number ids
4. layers (add_layer / del_layer):
dictionary which contains critically important assay metrics
all the values have the shape (num barcodes) x (num ids)
for DNA assays, this includes AF, GQ, DP, etc. (per cell, per variant)
for Protein assays, this includes read counts (per cell, per antibody)
this is the attribute where ‘normalized_counts’ will be added
# Summary of DNA assay, per sample
for sample in group:
print(sample.name)
print("\'sample.dna\':", sample.dna, '\n')
print("\'row_attrs\':", "\n\t", list(sample.dna.row_attrs.keys()), '\n')
print("\'col_attrs\':", "\n\t", list(sample.dna.col_attrs.keys()), '\n')
print("\'layers\':", "\n\t", list(sample.dna.layers.keys()), '\n')
print("\'metadata\':", "\n")
for i in list(sample.dna.metadata.keys()):
print("\t", i, ": ", sample.dna.metadata[i], sep="")
print("\n")
# For DNA, ids are variants
# sample.dna.ids() is a shortcut for sample.dna.col_attrs['id']
for sample in group:
print(sample.name, "\n", sample.dna.ids(), "\n")
DNA Analysis#
Topics covered
Standard filtering of DNA variants.
Subsetting dataset for variants of interest, including whitelisted variants.
Addition of annotations to the variants.
Manual variant selection and clone identification
OPTIONAL: Use Compass to impute missing data and a clonal phylogeny
Visualizations and customization options
Basic filtering#
There are many options for filtering DNA variants.
Use the help()
function to understand the approach listed below.
# For additional information visit: https://support.missionbio.com/hc/en-us/articles/360047303654-How-do-the-Advanced-Filters-work-
help(sample.dna.filter_variants)
# Filter the variants, similar to the "Advanced Filters" in Tapestri Insights v2.2
# One major difference: The filter "REMOVE GENOTYPE IN CELLS WITH ALTERNATE ALLELE FREQ"
# is replaced with the following 3 zygosity-specific filters: vaf_ref, vaf_hom, vaf_het
# In general, these additional filters remove additional false-positive from the data.
# Adjust filters if needed by overwriting dna_vars
def custom_filt(sample):
filt_vars = sample.dna.filter_variants(
min_dp=10,
min_gq=30,
vaf_ref=5,
vaf_hom=95,
vaf_het=30,
min_prct_cells=50,
min_mut_prct_cells=1,
iterations=10
)
return filt_vars
dna_vars = group.apply(custom_filt)
# List all filtered variants
# Check the number of filtered variants
for i in range(len(group.samples)):
sample = group.samples[i]
print(sample.name)
print("Number of variants:", len(dna_vars[i]))
print(dna_vars[i], "\n")
# Subset the same variants in all dna assays
# It is important to maintain the same variants across all dna assays
# Establish the number of variants for each sample
og_num_vars = [s.dna.shape[1] for s in group.samples]
#Subset all samples to look at the union of all variants
var_union = list(set().union(*dna_vars))
for sample in group:
sample.dna = sample.dna[:, var_union] # Subsets all samples with the same variants
# Establish the new number of varaints for each sample
new_num_vars = [s.dna.shape[1] for s in group.samples]
print(og_num_vars, new_num_vars) # The old and new number of variants for each sample in the group
# Merge the samples into one data set
combined = group.merge()
First, specify variants of interest using one of the three options below:
Use the filtered variants from the above section (dna_vars)
Use specific variants of interest (whitelist)
Combine 1 & 2: filtered variants plus whitelist
Then, this list (final_vars) is used to reduce the larger data set to only your variants of interest.
# Specify the whitelist; variants may be copy/pasted from Tapestri Insights v2.2,
# but ensure correct nomenclature, ie whitelist = ["chr13:28589657:T/G","chrX:39921424:G/A"]
# If there are no whitelist variants, you can leave target variants blank
variants = combined.dna.ids()
white_list = []
# Combine whitelisted and filtered variants
final_vars = list(set(list(variants) + white_list))
# Want to use your whitelist only?
# final_vars = white_list
# Check the length of your final list of variants
len(final_vars)
# Dimensionality of the original sample.dna dataframe
# First number = number of cells (rows); second number = number of variants (columns)
combined.dna.shape
# Before subsetting, verify that all the chosen variants are in the current sample.dna ids (should return True)
print(set(final_vars).issubset(set(combined.dna.ids())))
# Subsetting sample.dna (columns) based on reduced variant list. Keeping all cells that passed filtering
combined.dna = combined.dna[:, final_vars]
# Check the shape of the final filtered DNA object, i.e. (number of barcodes/cells, number of ids/variants)
combined.dna.shape
Annotation Addition#
help(sample.dna.get_annotations)
# Fetch annotations using varsome
# Note: run this on a filtered DNA sample - too many variants (e.g., 100+) are not handled correctly by the method
ann = combined.dna.get_annotations()
Variant selection and Subclone identification#
In this section of the notebook, all variants remaining in the data will populate in a variant table. This table is interaction, variants can be selected, and rows can be sorted by ascending/descending values. The variant name can be clicked on and will navigate to the variants varsome url in your default browser.
Variants selected in this table will populate in a subclone table below.
The variants in the subclone table can be highlighted and assessed for Read Depth, Genotype Quality and Variant Allele Frequency.
Subclones can be renamed by clicking on the pencil icon.
ADO score: The ADO score can be adjusted, but by default is set to 1. Any clones with an ADO score higher this will be moved into the ADO subclones column. We recommend moving any clones with a score of .8 or higher into this column.
Min clone size: The Min Clone Size can also be adjusted, but by default is set to 1. Any clones that represent less than 1% of the sample will be moved into the Small Subclones column.
Cells with missing genotype information across any of the selected variants will be moved into the missing GT column.
#Run the variant table workflow to select variants and begin clone identification
wfv = ms.workflows.VariantSubcloneTable(combined)
wfv.run()
# Subsample the Variants to only the variants selected from the workflow
variants = wfv.selected_variants
combined.dna = combined.dna[:,variants]
# Save the full set of cells to a new variable that you can call on later
# Do this before renaming the variants
full_combined = combined[:]
# Rename your variants
# Any of these column values can be added to the id names
combined.dna.col_attrs.keys()
# Add annotation to the id names
combined.dna.set_ids_from_cols(['annotated Gene', 'id'])
# Another example:
# sample.dna.set_ids_from_cols(['annotated Gene', 'CHROM', 'POS', 'REF', 'ALT'])
# Annotations are now added to the variants
combined.dna.ids()
# Use sample.dna.reset_ids() to get the original ids
# Plot heatmap using NGT_FILTERED.
fig = combined.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)
# Clone removal
# Remove barcodes from the missing GT, small subclones, ADO subclones or FP labels
clones = ['missing', 'small', 'ADO', 'FP']
for c in clones:
cells = combined.dna.barcodes({c})
if len(cells) > 0:
combined.dna = combined.dna.drop(cells)
set(combined.dna.get_labels())
# Redraw heatmap
fig = combined.dna.heatmap(attribute='NGT_FILTERED')
go.Figure(fig)
# Redraw heatmap, split by sample
fig = combined.dna.heatmap(attribute='NGT_FILTERED', splitby='sample_name')
go.Figure(fig)
# You can use the following line of code to change the color for heatmaps in the DNA, CNV or protein sections
# And rotate the tick labels
fig = combined.dna.heatmap(attribute='NGT_FILTERED')
fig.update_layout(title = combined.name)
fig.update_xaxes(tickangle = -45)
fig.layout.coloraxis.colorscale = 'viridis'
fig
# Evaluate new total number of cells after the above filtering
combined.dna.shape
# Subset the barcodes to match across all assays
# This will return the barcodes common to all assays in the sample.
combined.common_barcodes()
# Use that to filter the sample so that only the common barcodes are present in all assays
combined = combined[combined.common_barcodes()]
# Split merged object
# Need to first split the samples: they will go back from combined to group
group = ms.SampleGroup(combined.split("sample_name"))
# Multisample DNA barplot
fig = group.barplot("dna", percentage=True)
#If you want to change the color of certain clones, you can try something like the following:
#fig.data[0].marker.color = "red"
#fig.data[5].marker.color = "orange"
fig.update_layout(title = 'Clone size as a percentage of total population')
fig.show()
# Fishplot
# Multisample DNA fishplot
# Order your samples in sample_order and fill in the list with their specific name
fig = group.fishplot("dna", sample_order=['Sample 1', 'Sample 2'])
fig.layout.height = 750
fig.show()
# You can further specific labels and parents to reorganize the fishplot
# Want to hide the phylogeny tree? draw_graph=False
# Documentation here: https://missionbio.github.io/mosaic/autosummary_pages/methods/missionbio.mosaic.samplegroup.SampleGroup.fishplot.html#missionbio.mosaic.samplegroup.SampleGroup.fishplot
# Heatmap per sample
for sample in group:
fig = sample.dna.heatmap(attribute='NGT_FILTERED',splitby="label")
fig.update_layout(title = sample.name)
go.Figure(fig)
fig.show()
# Shape of each sample
for sample in group:
print(sample.name, sample.dna.shape)
Adjusting subclone colors or heatmap colors#
If you want to change the color palette used for the subclones, you can use set_palette
. For this you will need to provide a list of ALL subclone names, pointing to the hexcode/color you want to change that subclone to. Note: this function also works with the protein and CNV assays. See below for an example of this.
If you want to change the color palette used for any of the assays/layers, you can use ms.Config.Colorscale
. Then you can list the assay and layer, and assign the plotly colorscale you would like to use for those graphics. You can visualized all available colorscales by using plotly.colors.sequential.swatches_continuous().show()
. If you want to reset all color palettes back to their defaults, you can use ms.Config.Colorscle.reset()
.
# Example of changing the colors assigned to each clone
sample_one = group.samples[0]
sample_two = group.samples[1]
sample_one.dna.set_palette({"Clone 1": "#800080", "Clone 2": "#FF69B4"})
sample_two.dna.set_palette({"Clone 1": "#800080", "Clone 2": "#FF69B4"})
# Change the color of a subclone for the entire group object
# These changes will be applied to the barplots and fishplots
# Example
pal = group.apply(lambda s: s.dna.get_palette())
pal = {k: v for d in pal for k, v in d.items()}
pal["Clone 1"] = 'LightCyan'
pal["Clone 2"] = 'Lavender'
pal["Clone 3"] = 'PaleGreen'
group.apply(lambda s: s.dna.set_palette(pal))
# Built in color options can be found here: https://www.w3schools.com/cssref/css_colors.php
# You can change colorScale to change the heatmap colors
# This will change the color of the dna.ngt layer to be viridis
for sample in group:
ms.Config.Colorscale.Dna.NGT = 'viridis'
fig = sample.dna.heatmap('NGT')
fig.update_layout(title = sample.name)
fig.show()
# If you want to reset this, you can run the following to reset just that one modification
#ms.Config.Colorscale.Dna.reset("NGT")
# Or you can run this to reset any/all modifications:
#ms.Config.Colorscale.reset()
# You can change the colorScale for Dna, Cns or Protein
If you ever want to return back to the original population of cells, you can reset the data using: sample.reset("dna")
This command ‘sample.reset’ works on all assays, including CNV and protein.
CNV Analysis#
Topics covered
Amplicon filtering and ploidy estimation.
Visualization of ploidy across subclones present
# Get gene names for amplicons
for sample in group:
sample.cnv.get_gene_names()
# Define the samples
# Go through the CNV workflow once per sample
sample_one = group.samples[0]
sample_two = group.samples[1]
CNV workflow#
The CNV workflow cannot yet handle multisample objects, so the analysis must be done on one sample at a time.
This workflow will normalize all reads and filter amplicons/cells based on the settings set at the beginning of the workflow:
Amplicon completeness: refers to the minimum percentage of barcodes for an amplicon that must have reads greater than or equal to the minimum read depth set. By default this is set to 50.
Amplicon read depth: refers to the minimum read depth for each amplicon-barcode combination to not be considered missing. By default this is set to 10.
Mean cell read depth: refers the minimum mean read depth for a cell to be included in the analysis, otherwise the cell will be removed. By default this is set to 40.
Diploid clone in DNA: refers to which subclone you are setting as the true diploid population. All ploidy estimates will be calculated in relation to this diploid population. We recommend setting this to your ‘WT’ population or most parent clone present.
Note: These settings are pretty stringent. If you are expecting large copy number events, you may want to reduce Amplicon Completeness and Amplicon Read Depth, to recover these events.
Once the above filters are set, the visualizations can be changed.
Plot: Can be changed from Heatmap positions, to Heatmap genes, Line-plot positions, Line-plot genes
Clone for line plot: If one of the line-plot visualizations is selected, only one clone can be shown at a time. This determines which one is plotted.
X-axis features: If you would prefer to only plot a subset of the data (chromosomes or genes), you can select which chromosomes/genes you would like plotted with this function. Chromosomes can be selected for ‘positions’ type plots, and Genes can be selected for ‘genes’ type plots.
# CNV workflow to filter, normalize and estimate ploidy
wfc = ms.workflows.CopyNumber(sample_one)
wfc.run()
# Amplicon completeness refers to the min percentage of barcodes for an amplicon that must have reads >= the read_depth
# Read depth is the min required depth for each amplicon_barcode to not be considered as missing
# Mean cell read depth refers to the min mean read depth for a cell to be included, otherwise it is removed
# Heatmap with the features ordered by the default amplicon order
# If you want to plot just a subset of chromosomes, you can put them in list format for features
fig = sample_one.cnv.heatmap('ploidy', features='positions') #features=['7', '17', '20']
# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)
#fig.layout.coloraxis.cmax = 4
#fig.layout.coloraxis.cmin = 0
# Optionally, change the size of the figure:
#fig.layout.width = 1600
#fig.layout.height = 1500
fig.update_layout(title = sample_one.name)
go.Figure(fig)
# Heatmap with the features grouped by the genes
# If you want to plot just a subset of genes, you can put them in list format for features
fig = sample_one.cnv.heatmap('ploidy', features='genes', convolve=1) #features=["ASXL1", "EZH2",'TP53']
# Optionally, update the separating lines to be black
#for shape in fig.layout.shapes:
# shape.line.color = '#000000'
fig.update_layout(title = sample_one.name)
go.Figure(fig)
# Show heatmap with convolve and subclustering turned off
bars = sample_one.cnv.clustered_barcodes('ploidy', subcluster=False)
# This is useful to create "convolved" heatmaps which are easier to interpret
# With the subclustering off and convolve=20, the noise will be reduced and real signals will be easier to determine
fig = sample_one.cnv.heatmap('ploidy', bars_order=bars, convolve=20)
fig.layout.width = 900
fig.update_layout(title = sample_one.name)
fig.show()
# Signature will create a dataframe based on the layer you want to look at and which statistical value you want to view
# You can see: mean, median, mode, or std with this function
sample_one.cnv.signature('ploidy', 'median')
# signaturemap will plot a heatmap of the signature dataframe created above
# The labels list will control the order of the subclones along the y-axis
fig = sample_one.cnv.signaturemap('ploidy') #labels=[]
fig.update_layout(title=sample_one.name)
fig.show()
Note: The following is a partial duplication to run through the CNV workflow with the second/subsequent sample.
# Go through the workflow for any additional samples
# CNV workflow to filter, normalize and estimate ploidy
wfc = ms.workflows.CopyNumber(sample_two)
wfc.run()
# Amplicon completeness refers to the min percentage of barcodes for an amplicon that must have reads >= the read_depth
# Read depth is the min required depth for each amplicon_barcode to not be considered as missing
# Mean cell read depth refers to the min mean read depth for a cell to be included, otherwise it is removed
# Heatmap with the features ordered by the default amplicon order
# If you want to plot just a subset of chromosomes, you can put them in list format for features
fig = sample_two.cnv.heatmap('ploidy', features='positions') #features=['7', '17', '20']
# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)
#fig.layout.coloraxis.cmax = 4
#fig.layout.coloraxis.cmin = 0
# Optionally, change the size of the figure:
#fig.layout.width = 1600
#fig.layout.height = 1500
fig.update_layout(title = sample_two.name)
go.Figure(fig)
# Heatmap with the features grouped by the genes
# If you want to plot just a subset of genes, you can put them in list format for features
fig = sample_two.cnv.heatmap('ploidy', features='genes', convolve=1) #features=["ASXL1", "EZH2",'TP53']
# Optionally, update the separating lines to be black
#for shape in fig.layout.shapes:
# shape.line.color = '#000000'
fig.update_layout(title = sample_two.name)
go.Figure(fig)
# Show heatmap with convolve and subclustering turned off
bars = sample_two.cnv.clustered_barcodes('ploidy', subcluster=False)
# This is useful to create "convolved" heatmaps which are easier to interpret
# With the subclustering off and convolve=20, the noise will be reduced and real signals will be easier to determine
fig = sample_two.cnv.heatmap('ploidy', bars_order=bars, convolve=20)
fig.layout.width = 900
fig.update_layout(title = sample_two.name)
fig.show()
# Signature will create a dataframe based on the layer you want to look at and which statistical value you want to view
# You can see: mean, median, mode, or std with this function
sample_two.cnv.signature('ploidy', 'median')
# signaturemap will plot a heatmap of the signature dataframe created above
# The labels list will control the order of the subclones along the y-axis
fig = sample_two.cnv.signaturemap('ploidy') #labels=[]
fig.update_layout(title=sample_two.name)
fig.show()
Note: The next few cells are highly optional, and only need to be run if you want to plot cells from all samples together in one heatmap.
# Plot heatmaps of all samples together?
combined = group.merge()
combined.cnv.heatmap('ploidy', splitby='sample_name')
# Subset the barcodes to match across all assays
# This will return the barcodes common to all assays in the sample.
combined.common_barcodes()
# Use that to filter the sample so that only the common barcodes are present in all assays
combined = combined[combined.common_barcodes()]
# Split merged object
# Need to split the samples: they will go back from combined to group
group = ms.SampleGroup(combined.split("sample_name"))
Combined Visualizations#
In this section you will first subset the data to only retain barcodes with remaining DNA and CNV data. Then this data can be plotted together using sample.heatmap()
and sample.signaturemap()
.
The first few kernels of code will generate one plot per sample for the combined visualizations, the last portion of this section will combine the samples into one dataframe and plot them all together.
# This will return the barcodes common to all assays in the sample, for each sample.
for sample in group:
sample.common_barcodes()
# Use that to filter the sample so that only the common barcodes are present in all assays
sample = sample[sample.common_barcodes()]
# Check dimensionality for each subclass; the number of cells (first number) should be the same in each data set
# For each sample
for sample in group:
print(sample.name, sample.dna.shape, sample.cnv.shape)
# DNA + CNV heatmap
# One per sample
for sample in group:
fig = sample.heatmap(
clusterby=('dna', 'cnv'), # The first assay is used for the labels
attributes=['AF', 'ploidy'],
features=[None, 'genes'], # If None, then clustered_ids is used
bars_order=None, # The order of the barcodes
widths=None,
order=('dna', 'cnv') # The order in which the heatmaps should be drawn
)
fig.update_layout(title = sample.name)
fig.layout.width = 1200
fig.show()
help(sample.signaturemap)
# Plot a combined signature heatmap, showing DNA and CNV signatures for all subclones
# One per sample
for sample in group:
fig = sample.signaturemap(
clusterby=('dna', 'cnv'),
attributes=('NGT', 'ploidy'),
features=[None, 'genes'],
signature_kind=['median', 'median'],
widths=None,
order=('dna', 'cnv') # The order in which the heatmaps should be drawn
)
fig.update_layout(title = sample.name)
fig.layout.width = 1200
fig.show()
# Ignore warnings raised when running clone_vs_analyte
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
help(sample.clone_vs_analyte)
# Clone vs analyte
# Visualize the CNV data stratified by clone
# One per sample
for sample in group:
fig = sample.clone_vs_analyte('cnv')
plt.gcf().axes[1].texts[-1].set_text("")
#fig.savefig('genotype_cnv.png')
# Merge the samples into one data set
combined = group.merge()
# DNA + CNV heatmap
# For all samples together
fig = combined.heatmap(
clusterby=('dna', 'cnv'), # The first assay is used for the labels
attributes=['AF', 'ploidy'],
features=[None, 'genes'], # If None, then clustered_ids is used
bars_order=None, # The order of the barcodes
widths=None,
order=('dna', 'cnv') # The order in which the heatmaps should be drawn
)
fig.layout.width = 1200
fig.show()
# Plot a combined signature heatmap, showing DNA and CNV signatures for all subclones
# For all samples together
fig = combined.signaturemap(
clusterby=('dna', 'cnv'),
attributes=('NGT', 'ploidy'),
features=[None, 'genes'],
signature_kind=['median', 'median'],
widths=None,
order=('dna', 'cnv') # The order in which the heatmaps should be drawn
)
fig.layout.width = 1200
fig.show()
# Clone vs analyte
# Visualize the CNV data stratified by clone
# For all samples together
fig = combined.clone_vs_analyte('cnv')
plt.gcf().axes[1].texts[-1].set_text("")
Export and Save Data#
In this section you can export a filtered .h5 file, which will contain all new labels/layers, and contain only the filtered barcodes/cells remaining. You can also export all data (row attributes, column attributes and layers) for every assay (DNA and CNV) into easily parsable .csv tables.
# Save new h5 file that includes only the final, cleaned dataset
ms.save(combined, 'FilteredData.h5')
# With new code implementation:
# Export data into csv formats
# DNA, CNV and metadata will be included in the zip
ms.to_zip(combined, 'filename')
Appendix#
# Split merged object
# Need to first split the samples: they will go back from combined to group
group = ms.SampleGroup(combined.split("sample_name"))
DNA signature#
Using .signature()
and .signaturemap()
you can visualize different statistical metrics of your data, including: mean, median, mode and std.
# Signature will create a dataframe based on the layer you want to look at and which statistical value you want to view
# You can see: mean, median, mode, or std with this function
for sample in group:
display(sample.dna.signature('AF', 'median'))
# signaturemap will plot a heatmap of the signature dataframe created above
# The labels list will control the order of the subclones along the y-axis
for sample in group:
sample.dna.signaturemap('AF').show() #labels=[]
Compass imputation#
This section of the notebook is OPTIONAL and is still a work in progress
Compass can be used to impute the genotypes of cells with some missing data. It can also be used to infer the phylogeny of all subclones present in the sample. If a cells nature cannot be determined, Compass will label these cells as Mixed or Ambiguous. Compass can take ~1-10 minutes to run depending on the size of the data.
The Compass publication can be found here
Note: Compass can give different subclone composition than the variant subclone workflow.
# Use compass to assume the subclone architecture and phylogeny
# Depending on the size and complexity of the data, this step can take ~1-10 minutes
# Use the full_combined object to utilize all cells and unannotated variants
compass = COMPASS(full_combined, somatic_variants=variants)
compass.run()
# The phylogentic tree prediction by COMPASS
fig = compass.plot_tree()
fig.show()
# dict of node names pointing to descriptions
# compass_labels_ is just the node names
# we want the compass_labels to be the node_descriptions
labs = compass.labels_ # Stores the labels
desc = compass.node_descriptions() # Stores the dict mapping the label to the description
compass_labels = np.array([desc.get(lab, lab) for lab in labs])
# Store the compass label as a row_attr
full_combined.dna.add_row_attr('compass_labels', compass_labels)
# Store the index positions of the COMPASS ids for the heatmap
idx = np.isin(full_combined.dna.ids(), compass.somatic_variants)
# Visualize the assignment using the variants passed to COMPASS
fig = full_combined.dna.heatmap('NGT', features=full_combined.dna.ids()[idx], splitby='compass_labels')
fig.show()
# This will return a dataframe
# Showing the overlap of variant workflow clones and compass identified clones
full_combined.dna.crosstab(compass_labels, normalize='columns')
# This will plot a heatmap of the crosstab dataframe
full_combined.dna.crosstabmap(compass_labels, normalize='columns').show()