Additional plots and customizations#

Objective

This vignette contains a wide variety of code snippets for additional plots and plot customizations that are not currently included in our latest version of the jupyter notebook.

Load in dependencies#

# Import mosaic libraries
import missionbio.mosaic as ms

# Import these to display entire dataframes
from IPython.display import display, HTML

# Import numpy for statistics
import numpy as np

# Import additional packages for specific visuals
import matplotlib.pyplot as plt

# Import the colors
import seaborn as sns

# Import the Heatmap class to create heatmaps of custom data
from missionbio.plotting.heatmap import Heatmap

Load in data#

sample = ms.load_example_dataset("Multisample PBMC", single=True)

/Users/casp/Documents/code/mosaic/src/missionbio/mosaic/assay.py:360: ChangeAppliedWarning:

Multiple samples detected. Appending the sample name to the barcodes to avoid duplicate barcodes. The original barcodes are stored in the original_barcode row attribute

Subset the DNA data to look for high quality variants of interest.

# Adjust filters if needed by overwriting dna_vars
dna_vars = sample.dna.filter_variants(
    min_dp=10,
    min_gq=50,
    vaf_ref=5,
    vaf_hom=95,
    vaf_het=40,
    min_prct_cells=80,
    min_mut_prct_cells=1,
)

sample.dna = sample.dna[:, dna_vars]

GQ violin plots#

Plot GQ violin plots for all variants of interest to assess genotype quality. It is recommended that this is added before the GQ heatmap, so poor quality variants seen in the violin plots can be subsequently de-selected in the heatmap.

fig = sample.dna.violinplot("GQ")
fig.layout.width = 900
fig.show("jpg")

../_images/7f7de574e5bc6bd4a723e6a846cb585a8c3c0684ec74e389a82d6ab3ad7c718a.jpg

Identify the subclones present in your sample.

# Plot a heatmap of the variant allele frequenices being called across your variants of interest
fig = sample.dna.heatmap(attribute='AF_MISSING')
fig.show("jpg")

../_images/ca4f78d28c2440d57f4f5c790089b27e1320af1a3d85b046d706b2fd5794a5ca.jpg

# Annotate the DNA variants
sample.dna.get_annotations()  
    
# Add annotation to the id names
sample.dna.set_ids_from_cols(["annotated Gene", "CHROM", "POS", "REF", "ALT"])

# Annotations are now added to the variants
sample.dna.ids()

array(['SF3B1:2:198266943:C:T', 'NPM1:5:170837457:A:G',
       'EZH2:7:148508833:A:G', 'WT1:11:32417945:T:C',
       'PHF6:X:133547814:T:C'], dtype=object)

# Cluster the data to identify the subclones present

clone_data, fig = sample.dna.group_by_genotype(
    features=sample.dna.ids(),
    layer="NGT_FILTERED",
    group_missing=True,
    min_clone_size=1,
    return_plot=True
)

display(HTML(clone_data.to_html()))
fig.show("jpg")

clone	1	2	Missing GT clones (127)	Small subclones (13)	ADO clones (0)
SF3B1:2:198266943:C:T	Hom (99.56%)	Hom (99.57%)	Missing in 13.54% of cells	Hom (99.1%)	NaN
NPM1:5:170837457:A:G	WT (0.45%)	Hom (99.64%)	Missing in 20.29% of cells	Hom (63.78%)	NaN
EZH2:7:148508833:A:G	WT (0.72%)	Het (59.88%)	Missing in 23.03% of cells	WT (33.67%)	NaN
WT1:11:32417945:T:C	Het (58.52%)	WT (0.48%)	Missing in 22.34% of cells	WT (16.79%)	NaN
PHF6:X:133547814:T:C	Hom (99.63%)	WT (0.36%)	Missing in 18.40% of cells	WT (36.63%)	NaN
Total Cell Number	620 (16.53%)	598 (15.94%)	2413 (64.33%)	120 (3.2%)	NaN
Sample 1 Cell Number	15 (0.77%)	582 (29.95%)	1276 (65.67%)	70 (3.6%)	NaN
Sample 2 Cell Number	605 (33.46%)	16 (0.88%)	1137 (62.89%)	50 (2.77%)	NaN
Parents	NaN	NaN	NaN	NaN	NaN
Sisters	NaN	NaN	NaN	NaN	NaN
ADO score	0	0	NaN	NaN	NaN

../_images/32d816bda88fcdcbb22b07c746336d11f1575287a05125b29dd2f497e374da03.jpg

# Rename the subclones and drop the clones that are not of interest by labelling them 'FP'

sample.dna.rename_labels(
  {
    '1': 'Clone 1', 
    '2': 'Clone 2',    
    'missing': 'FP', 
    'small': 'FP'
  }
)

# Remove barcodes (clones) from data based on renamed labels
# The reduced data set will now be called 'sample.dna'

fp_barcodes = sample.dna.barcodes({"FP"})
sample.dna = sample.dna.drop(fp_barcodes)

set(sample.dna.get_labels()) 

{'Clone 1', 'Clone 2'}

Stacked bar graph for multi-sample analysis#

Plot a stacked bar graph for clone frequency within each sample, where each sample is an individual bar. Note colors can be changed, as can labels. Graph is highly customizable, but large amounts of code may deter users.

# Create a subset of the sample by filtering the Cnv, and Protein assays with the common barcodes.
# This will raise a "ChangeApplieWarning" warning if all the assays don't have the same barcodes
sub_sample = sample[sample.common_barcodes()]

fig = ms.SampleGroup(sub_sample.split()).barplot(
    assay="dna",
    sample_order=["Sample 1", "Sample 2"],
    label_order=["Clone 1", "Clone 2"],
    percentage=True,
)
fig.show("jpg")

/Users/casp/Documents/code/mosaic/src/missionbio/mosaic/sample.py:356: ChangeAppliedWarning:

Some barcodes are missing from one or more assays. Returning only the barcodes present in all assays.

../_images/ab71902b9e8d9490d6e5312fc68de27a7c9db19b772f15fcb8c103ddddfd5c25.jpg

Reads per amplicon#

Plot a histogram of the reads per cell across any given amplicon, and print out the mean reads per cell of that amplicon.

reads = sample.dna.get_attribute('DP', constraint='row+col')
amplicons = sample.dna.col_attrs['amplicon']
amp_average = reads.groupby(amplicons, axis=1).mean()
#Insert your amplicon of interest below
DP = amp_average['MYE_SF3B1_198266730'] 
#Print the mean
print(np.mean(DP))
#Plot histogram
plt.hist(DP)

42.56896551724138

(array([ 59.,  72.,  86.,  83., 104., 143., 309., 326.,  33.,   3.]),
 array([17. , 21.7, 26.4, 31.1, 35.8, 40.5, 45.2, 49.9, 54.6, 59.3, 64. ]),
 <BarContainer object of 10 artists>)

../_images/7e2eebb991be762f9f9c8826f647e69ec3e3f17bbdfeccc40b724dd4b123d803.png

Count barcodes with at least one mutation#

Define a region of interest, and then call cell-barcodes that have a least one mutation in that region of interest. This code may be useful for gene-editing experiments where you are expecting mutations on a particular amplicon, or within a small region of the amplicon surrounding a cut-site.

# Define an amplicon of interest
amp = 'MYE_SF3B1_198266730'
print("Variants from amplicon present in dna?: ", amp in set(sample.dna.col_attrs['amplicon'])) # sanity check

# Filter the data to only look at variants called from this amplicon
amplicon_filter = sample.dna.col_attrs["amplicon"] == amp 

# THE FOLLOWING SECTION IS OPTIONAL
# # filtering variants by cut site
# # e.g. cut site at chr2:60722401, range 60722481 - 60722421
# cut_filter = (sample.dna.col_attrs["POS"] > 60722381) & (sample.dna.col_attrs["POS"] < 60722421)
# amplicon_filter = amplicon_filter & cut_filter

# counting barcodes with at least one mutation in region of interest(Het or Hom call)
# creating a dataframe with numeric genotypes
genotypes = sample.dna.get_attribute('NGT_FILTERED', constraint='row+col')

# Get the number of barcodes with at least one variant that is HET (1) or HOM (2)
num_mutated = genotypes.isin([1, 2]).any(axis=1).sum()

print(num_mutated)

Variants from amplicon present in dna?:  True
1218

Custom cluster order#

For sample.dna or sample.cnv grouped heatmaps in Mosaic notebook, this code will help you order the clones along the Y-axis in whatever order you prefer.

# create a list 'cell_clusters' with the desired order of the clusters
# In this example, clusters are labeled numerically, i.e. 1, 2, 3,...
# Use the labels you have previously created for the clones

cell_clusters = ['Clone 1', 'Clone 2']
ordered_cells = sample.dna.barcodes(label=['Clone 1', 'Clone 2'])[::-1]

fig = sample.dna.heatmap(attribute='AF_MISSING', bars_order=ordered_cells)
fig.show("jpg")

../_images/cf4f2a1cb9d75038e7e4cc51021d1d8290c7b6b404a8c2c7f0945c1601e57f21.jpg

Custom heatmap legend#

For grouped sample.dna heatmaps, after clones are identified and labeled, this helps to remove unnecessary categories from the legend, for example the ‘Missing’ category.

# Customize legend to show only three categories e.g. WT, HET, HOM
# NOTE: Only use this when there are no cells with NGT=3 ("Missing"), otherwise those will be colored as HOM

fig = sample.dna.heatmap(attribute = 'NGT_FILTERED')

# Define the categories you want to appear
fig.layout.coloraxis.colorbar.ticktext = ('WT', 'HET', 'HOM', "")
fig.layout.coloraxis.colorbar.tickvals = (0.375, 1.025, 1.675)
fig.layout.coloraxis.cmax = 2
fig.layout.coloraxis.colorscale = [[0.0, '#3b4d73'], [0.33, '#3b4d73'], [0.33, '#78a3bc'], [0.66, '#78a3bc'], [0.66, '#d7ecee'], [1.0, '#d7ecee']]
fig.show("jpg")

../_images/ec2aafd0013898051dcd0f415172b64d3ac99f79c1f4f348c117ad7026fa0f9b.jpg

Perform CNV analysis on your data#

# Copy the labels and palette from DNA
# Cells that were filtered in Dna but present in Cnv will be labeld `Unlabeled`

sample.cnv.set_labels(sample.dna, default_label='Unlabeled')

# Remove amplicons with missing information.

# completeness : float [0, 100]
#    The minimum percentage of barcodes for an amplicon that must have
#    reads greater than or equal to `read_depth` to keep that amplicon.
# read_depth : int
#   The minimum required depth for each amplicon-barcode to not be
#    considered as missing during the completness calculation.

sample.cnv = sample.cnv[:, sample.cnv.filter_amplicons(completeness=50, read_depth=0)]
sample.cnv.normalize_reads()

# Normalize your amplicons to a WT/known diploid population
sample.cnv.compute_ploidy(diploid_cells=sample.dna.barcodes('Clone 1'))

P-value chart per gene#

For CNV statistics, this code will allow you to show the pval_cnv chart with gene name labels, instead of amplicon names. Note: if there are multiple amplicons per gene, the gene name will appear for each amplicon, and this is not an aggregate p-val across the entire gene.

# Fetch the gene names for the amplicons
sample.cnv.get_gene_names()

pval_cnv, tstat_cnv = sample.cnv.test_signature("normalized_counts")
pval_cnv = pval_cnv + 10 ** -50 + pval_cnv
pval_cnv = -np.log10(pval_cnv) * (tstat_cnv > 0)

ids, genes = sample.cnv._get_ids_from_features("genes")
pval_cnv = pval_cnv.loc[:, ids]

hm = Heatmap(pval_cnv, y_groups=pval_cnv.index.values, x_groups=genes)
fig = hm.draw()
fig.show("jpg")

../_images/0862b6d146eb3247d397239a0c87510a614ef4aac5e0a892e9aaa7d53b2d7e88.jpg

Calculate an average P-value per gene#

For CNV statistics, this code will allow you to show the average p-value per gene in a heatmap format, instead of at an amplicon level.

# statistics
pval_cnv, tstat_cnv = sample.cnv.test_signature("normalized_counts")

# average the p-values by gene_id and transpose
pval_cnv = pval_cnv.groupby(sample.cnv.col_attrs['gene_name'], axis=1).mean()
pval_cnv = pval_cnv + 10 ** -50

# plot it
pval_cnv = -np.log10(pval_cnv)
hm = Heatmap(pval_cnv, y_groups=pval_cnv.index.values)
fig = hm.draw()
fig.show("jpg")

../_images/07772127c42e3e75953ef1e4ee2fc6b900a6da71d993d83038880511f478196f.jpg

Protein read distribution tool#

This code will read in your h5 file, and spit out a table of your antibodies, the raw read counts and what percentage of total counts the antibody accounts for, as well as a bar chart of antibody percentages.

import pandas as pd
import plotly.express as px

read_counts= sample.protein.layers["read_counts"]
ids = sample.protein.col_attrs["id"].copy()
raw = pd.DataFrame(read_counts, columns=ids)

total_ab = pd.DataFrame(raw.sum(axis = 0, skipna = True))
total_ab.columns =['Raw Count']
total_ab.index.name = 'Antibody'
total_ab.reset_index(inplace=True)
total_ab['Percentage'] = (total_ab['Raw Count'] / total_ab['Raw Count'].sum())

# Bar chart
fig = px.bar(total_ab, x='Antibody', y='Percentage')
fig.update_layout(yaxis_tickformat="0.0%", template="gridon")
fig.show("jpg")

display(total_ab)

../_images/d24eced4a3cb1413310b22e5ea10be5152818c66a3c636a97336055347fc3560.jpg

	Antibody	Raw Count	Percentage
0	CD10	216964	0.003790
1	CD117	343388	0.005998
2	CD11b	1976532	0.034523
3	CD11c	1448022	0.025291
4	CD123	1263044	0.022061
5	CD13	983932	0.017186
6	CD138	590295	0.010310
7	CD14	1280941	0.022373
8	CD141	692292	0.012092
9	CD16	2276106	0.039755
10	CD163	582052	0.010166
11	CD19	456095	0.007966
12	CD1c	332980	0.005816
13	CD2	1588429	0.027744
14	CD22	244551	0.004271
15	CD25	252888	0.004417
16	CD3	4431120	0.077395
17	CD30	494197	0.008632
18	CD303	207141	0.003618
19	CD304	125132	0.002186
20	CD33	1460364	0.025507
21	CD34	184545	0.003223
22	CD38	2048252	0.035775
23	CD4	3998551	0.069840
24	CD44	3832460	0.066939
25	CD45	4025459	0.070310
26	CD45RA	4077514	0.071219
27	CD45RO	643098	0.011233
28	CD49d	2033111	0.035511
29	CD5	1890299	0.033016
30	CD56	801676	0.014002
31	CD62L	1942536	0.033929
32	CD62P	147567	0.002577
33	CD64	502127	0.008770
34	CD69	325517	0.005686
35	CD7	2520817	0.044029
36	CD71	979444	0.017107
37	CD8	3539863	0.061828
38	CD83	721215	0.012597
39	CD90	137773	0.002406
40	FcεRIα	527347	0.009211
41	HLA-DR	828210	0.014466
42	IgG1	116941	0.002043
43	IgG2a	90132	0.001574
44	IgG2b	92397	0.001614

# Normalize protein counts
sample.protein.normalize_reads('NSP')

<missionbio.demultiplex.protein.nsp.NSP at 0x183d00c70>

Plot a joint scatterplot and histogram#

This will create a scatterplot of any two proteins, where additional histograms of the protein expression profile will be shown on the sides of the scatterplots.

# Plot a joint scatterplot with histograms of two proteins of interest
# for more: https://seaborn.pydata.org/generated/seaborn.jointplot.html

# create data frame with normalised protein counts ("protein_df")
protein_df = sample.protein.get_attribute('normalized_counts', constraint='row+col')

# specify the proteins of choice using the "x" and "y" arguments
sns.jointplot(x="CD3", y="CD19", data=protein_df)

<seaborn.axisgrid.JointGrid at 0x183db9c60>

../_images/53b3ebc4c5a68d0594708a3b6a67b8fc3bf088d2cbfc3f63fc5c6cde5df22660.png

CNV signature per chromosome#

This code is for the first heatmap plot in the multi-assay section of the Mosaic notebook (DNA cluster signature vs Analyte and Barcode). It allows you to show average ploidy per chromosome in the CNV section of the heatmap, instead of by amplicon or gene.

# Keep only the barcodes which are common across all assays
# This will raise a `ChangeAppliedWarning` if any of the assays has different barcodes from the others

sample = sample[sample.common_barcodes()]

/Users/casp/Documents/code/mosaic/src/missionbio/mosaic/sample.py:356: ChangeAppliedWarning:

Some barcodes are missing from one or more assays. Returning only the barcodes present in all assays.

# Calculate mean ploidy for each chromosome

amp_ploidy = sample.cnv.get_attribute("ploidy", "rc")
ids, chroms = sample.cnv._get_ids_from_features("positions")
chrom_ploidy = amp_ploidy.loc[:, ids].groupby(chroms, axis=1, sort=False).mean()

# Get the order of the barcodes to plot
# `subcluster=True` will enable hierarchical clustering of cells within each label
bars = sample.cnv.clustered_barcodes(chrom_ploidy.values, subcluster=False)

# Draw normal heatmap
hm = Heatmap(chrom_ploidy.loc[bars, :], y_groups=sample.cnv.get_labels(bars))
fig = hm.draw()
sample.cnv.update_coloraxis(fig, "ploidy")
fig.show("jpg")

../_images/56413dbd7aed820d2fea1f9824cc29cf454854706a8301da36ff48385dfcf291.jpg

# Draw multi-assay heatmap

fig = sample.heatmap(("dna", "protein"))
fig.show("jpg")

../_images/b0df894617d2e3bc8b59fa659d780670bd61e47470c3e579568816ad9f449032.jpg

Save output for one clone#

This code will output a filtered h5 file, containing only cells of the clone of interest, which can be used for further analysis or filtering.

# Save one clone (or one sample) as a separate .h5 file
# row attribute 'label' stores the group (e.g. clone) information
# set the `clone` to match the clone that you would like to extract from the others
clone = 'Clone 2'

# create a new sample with a subset of the cells
sub_sample = sample[sample.dna.barcodes(clone)]

#save sample2 to disk
ms.save(sub_sample, "Clone2_Filtered.h5", mode="w")

Additional plots and customizations

Contents