Filtering barcodes and ids#

Objective

This vignette describes how data can be filtered in Mosaic.
filtering barcodes and filtering ids both are covered.

The h5 file used in this notebook can be found here

# Import mosaic and load the data

import missionbio.mosaic as ms

sample = ms.load_example_dataset("3 cell mix")
Loading, <_io.BytesIO object at 0x7fbd9990e900>
Loaded in 0.4s.

This is an analyzed h5 file. Hence the clones and clusters
are already labeled. It contains three cell lines KG-1,
Tom-1, and Jurkat. It also contains doublets of each pair
as seen in the heatmap.

sample.dna.heatmap('NGT')

Filtering assays#

Each assay and sample in Mosaic can be filtered
using the slice notation in Python. There is also
the drop function in an Assay object which can
be used to drop certain barcodes or ids.

In case any visualization is needed for a subset of
barcodes they can be dropped as follows.

Dropping barcodes and IDs#

This method is useful if certain barcodes or ids are to be dropped

# This shows all the labels present in assay
set(sample.dna.get_labels())
{'Jurkat', 'KG-1', 'Mixed', 'TOM-1'}
# Since this is an analyzed h5 file we can
# retrive all the barcodes labeled as Mixed

mixed_barcodes = sample.dna.barcodes('Mixed')
dna = sample.dna.drop(mixed_barcodes)
set(dna.get_labels())
{'Jurkat', 'KG-1', 'TOM-1'}
# Ids can be dropped in a similar fashion
# The id chosen was a poor quality variant in KG-1 as seen in the heatmap

dna = dna.drop(['chr2:25470426:C/T'])
# Once the cells and variants are dropped, we can make the plots as usual
# Here the mixed cells and that one id is dropped in the dna object.

dna.heatmap('NGT')

Selecting barcodes and IDs#

This is useful when only certain barcodes or IDs are to be selected

# This is an analyzed h5 file, hence Jurkat barcodes can be retrieved here

# In the slice notation, the `:` refers to all barcodes or all ids
# The first value is the subset of `barcodes` to be chosen,
# the second value is the subet of `ids` to be chosen.

jurkat_barcodes = sample.dna.barcodes('Jurkat')
dna = sample.dna[jurkat_barcodes, :]
set(dna.get_labels())  # This dna object only has Jurkat cells
{'Jurkat'}
# Let's say only two ids of interest are required for a plot
# We are reusing the dna object filtered earlier

id_of_interest = ['chr2:25458546:C/T', 'chr2:25469502:C/T']
dna = dna[:, id_of_interest]
dna.ids()  # Has only the two ids selected
array(['chr2:25458546:C/T', 'chr2:25469502:C/T'], dtype=object)
# Once the assay has been filtered, it can be plotted
# We are now looking at the two selected variants for Jurkat cells only

dna.heatmap('NGT')

Filtering the entire sample#

If some barcodes are to be removed from all assays (DNA, Protein, CNV)
then, the sample level slice notation can be used.

# In this case, only a set of barcodes are required

# Choosing only two cell lines
cells = sample.dna.barcodes(['KG-1', 'TOM-1', 'Jurkat'])
filtered_sample = sample[cells]
# This is showing all the variants, but only for the three cell lines

filtered_sample.clone_vs_analyte(analyte='protein')
../_images/b99875eee1624033deaf7e7b99bb9beb18dee3bf64cdb3e93e6970d83b41265a.png
# To remove these variants the DNA object has to be filtered
# To remove some anitbodies, the protein object has to be filtered

variants = ['chr2:25458546:C/T', 'chr2:25469502:C/T']
filtered_sample.dna = filtered_sample.dna[:, variants]  # Choosing all cells and two variants

abx = ['CD34', 'CD24', 'CD19', 'CD45', 'CD90']
filtered_sample.protein = filtered_sample.protein[:, abx]  # Choosing all cells and four antibodies
# Now the plot only contains the cells and ids of interest

filtered_sample.clone_vs_analyte(analyte='protein')
../_images/f61da090a0cc512af5e7bf1847e2a17039d8db4a0a3b90348ef6982db9916568.png
# The filtered sample object can be used for other plots as well

filtered_sample.heatmap(clusterby='dna', sortby='dna', drop='cnv', flatten=False)