Overview of mosaic#

Objective
To showcase the minimum number of steps
required to do tertiary analysis of DNA + Protein
and some of the different ways to look at the data

Major questions answered:

  1. Do we see DNA clones?

  2. Do we see protein cell types

  3. Is the differential expression significant?

  4. Do the clones correlate with the cell types?

Things not shown:

  1. All available methods eg. Filtering of nearby variants, variant annotation, plots

  2. Discussing all methods and their options - Documented here

  3. Systemic variations seen in protein data

Setup#

H5 files are a replacement of loom files. These are part of the DNA and protein pipeline output.

Note: This example h5 file trimmed specifically for this analysis

Here is the complete documentation on the load function

import missionbio.mosaic as ms

sample = ms.load_example_dataset("3 cell mix")  # Use ms.load(path_to_h5) for custom h5 files

Data Structure#

Dna, Cnv, and Protein are sub classes of the _Assay class
The information is stored in four ways, and the user
can change each of those

1. metadata (add_metadata / del_metadata):
    dictionary containing metrics of the assay

2. row_attrs (add_row_attr / del_row_attr):
    dictionary which contains 'barcode' as one of
    the keys. All the values must be of the same
    length i.e. match the number of barcodes
    This is the attribute where 'label', 'pca',
    and 'umap' values are added

3. col_attrs (add_col_attr / del_col_attr):
    dictionary which contains 'ids' as one of
    the keys. All the values must be of the same
    length i.e. match the number ids
    'ids' contains variants for DNA assays
    and anitobides for Protein assays

4. layers (add_layer / del_layer):
    dictionary containing 'read_counts' as one of 
    the metrics. All the values have the shape
    (num barcodes) x (num ids). This is the attribute
    where 'normalized_counts' will be added

Sample holds the Dna and Protein information

sample.protein
<missionbio.mosaic.protein.Protein at 0x174d9ddf0>
sample.protein.metadata
{'sample_name': array([['3 cell mix']], dtype=object),
 '__mosaic_cluster_description': 'graph-community on pca with Neighbours set to 30',
 '__mosaic_clustered': 1,
 '__mosaic_data_prep_pca': 'scaled CLR',
 '__mosaic_data_prep_scale': 'CLR',
 '__mosaic_data_prep_umap': 'PCA of scaled CLR',
 '__mosaic_initialize': 0,
 '__mosaic_prepped': 1,
 '__mosaic_visual_type': array(['Plots', 'Heatmap'], dtype=object),
 'n_reads': 128914059,
 'n_reads_trimmed': 128590556,
 'n_reads_valid_ab_barcodes': 117037906,
 'n_reads_valid_cell_barcodes': 121712026,
 'pipeline_version': '1.1.0'}
sample.protein.row_attrs
{'barcode': array(['AACAACCTAAACTTGTCG', 'AACAACTGGTACGTTGGA', 'AACAATGCAAGACCACGC',
        ..., 'TTGTCAACCTACAACACC', 'TTGTCAACCTAGTAACGG',
        'TTGTTAGAGATCAGGATG'], dtype=object),
 'label': array(['Mixed', 'Jurkat', 'TOM-1', ..., 'KG-1', 'KG-1', 'KG-1'],
       dtype=object),
 'pca': array([[-0.00739676,  0.01677373, -0.03541206, ..., -0.01174839,
         -0.00960959,  0.00061406],
        [-0.02040745, -0.0168572 , -0.00686639, ..., -0.01645682,
         -0.01039511,  0.01420925],
        [ 0.00012047,  0.04579714, -0.00975074, ..., -0.00724615,
          0.0393313 , -0.03267846],
        ...,
        [ 0.0226351 , -0.01385157, -0.00756582, ...,  0.00323248,
          0.03955304,  0.02694122],
        [ 0.01691241, -0.00197575,  0.00163568, ..., -0.00587812,
          0.0005414 ,  0.01381873],
        [ 0.01485945, -0.00385416, -0.00495111, ...,  0.02740017,
          0.03469586, -0.01425199]]),
 'sample_name': array(['3 cell mix', '3 cell mix', '3 cell mix', ..., '3 cell mix',
        '3 cell mix', '3 cell mix'], dtype=object),
 'umap': array([[ 4.1881294,  2.1518862],
        [ 4.405163 ,  7.9102015],
        [ 5.964074 , -1.2190977],
        ...,
        [-4.759269 , -3.1054091],
        [-5.3096914, -1.1285425],
        [-5.050415 , -2.969383 ]], dtype=float32)}
sample.protein.ids()
array(['CD110', 'CD117', 'CD123', 'CD135', 'CD19', 'CD24', 'CD3', 'CD33',
       'CD34', 'CD38', 'CD44', 'CD45', 'CD56', 'CD90', 'HLA-DR',
       'Mouse IgG1k'], dtype=object)
sample.dna.layers
{'AF': array([[ 19.80676329,  28.57142857,   1.4084507 , ...,  27.65957447,
          25.0965251 ,  13.49693252],
        [ 38.55421687,   0.        ,   0.        , ...,  42.30769231,
          31.69014085,  50.        ],
        [  0.76335878, 100.        ,   0.        , ...,   0.        ,
           0.41322314,   0.41322314],
        ...,
        [  0.        ,   0.        ,  15.38461538, ...,   0.        ,
           0.48543689,   1.32890365],
        [  0.        ,   0.        ,  42.85714286, ...,   0.        ,
           0.70921986,   0.        ],
        [  0.        ,   0.        ,  50.        , ...,   0.        ,
           0.        ,   3.26086957]]),
 'AF_MISSING': array([[ 19.80676329,  28.57142857,   1.4084507 , ...,  27.65957447,
          25.0965251 ,  13.49693252],
        [ 38.55421687,   0.        ,   0.        , ...,  42.30769231,
          31.69014085,  50.        ],
        [  0.76335878, 100.        ,   0.        , ...,   0.        ,
           0.41322314,   0.41322314],
        ...,
        [  0.        ,   0.        ,  15.38461538, ...,   0.        ,
           0.48543689,   1.32890365],
        [  0.        ,   0.        ,  42.85714286, ...,   0.        ,
           0.70921986,   0.        ],
        [  0.        ,   0.        ,  50.        , ...,   0.        ,
           0.        ,   3.26086957]]),
 'DP': array([[207,  63,  71, ...,  47, 259, 326],
        [ 83,  36,  41, ...,  26, 142, 124],
        [131,   8,  17, ...,  19, 242, 242],
        ...,
        [ 39,  23,  13, ...,  31, 206, 301],
        [157,  21,  35, ...,  20, 141, 186],
        [ 76,   6,  20, ...,   7,  81, 184]], dtype=int16),
 'GQ': array([[99, 99, 99, ..., 99, 99,  0],
        [99, 54, 99, ..., 99, 99, 99],
        [99, 23, 51, ..., 57, 99, 99],
        ...,
        [99, 36, 38, ..., 93, 99, 99],
        [99, 33, 99, ..., 60, 99, 99],
        [99,  9, 99, ..., 21, 99, 99]], dtype=int8),
 'NGT': array([[1, 1, 0, ..., 1, 1, 0],
        [1, 0, 0, ..., 1, 1, 1],
        [0, 2, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 1, ..., 0, 0, 0],
        [0, 0, 1, ..., 0, 0, 0],
        [0, 0, 1, ..., 0, 0, 0]], dtype=int8),
 'NGT_FILTERED': array([[3, 1, 0, ..., 1, 1, 3],
        [1, 0, 0, ..., 1, 1, 1],
        [0, 3, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 3, ..., 0, 0, 0],
        [0, 0, 1, ..., 0, 0, 0],
        [0, 3, 1, ..., 3, 0, 0]], dtype=int8),
 'scaled_counts': array([[ 0.17521769,  0.77875598, -0.58027676, ...,  0.40544186,
          0.35280254, -0.1315704 ],
        [ 0.95461827, -0.33207324, -0.61815498, ...,  0.97618787,
          0.6251839 ,  1.37227722],
        [-0.61648665,  3.55582902, -0.61815498, ..., -0.67227966,
         -0.66686127, -0.670591  ],
        ...,
        [-0.64822228, -0.33207324, -0.20440833, ..., -0.67227966,
         -0.66387813, -0.63286694],
        [-0.64822228, -0.33207324,  0.53442497, ..., -0.67227966,
         -0.65463368, -0.6876149 ],
        [-0.64822228, -0.33207324,  0.72652162, ..., -0.67227966,
         -0.68393146, -0.55327411]])}

DNA Analysis#

Topcis covered

  1. Whitelist of variants

  2. Manually selecting variants

Basic filtering#

Many filtering options are available
use the documentation shared earlier,
or the help() function to get the same
information here
help(sample.dna.filter_variants)
Help on method filter_variants in module missionbio.mosaic.dna:

filter_variants(min_dp: int = 10, min_gq: int = 30, vaf_ref: float = 5, vaf_hom: float = 95, vaf_het: float = 30, min_prct_cells: float = 50, min_mut_prct_cells: float = 1, iterations: int = 10, n_cores: int = 4) -> Sequence method of missionbio.mosaic.dna.Dna instance
    Find informative variants.
    
    This method also adds the `NGT_FILTERED` layer to the assay
    which is a copy of the NGT layer but with the NGT for the
    cell-variants not passing the filters set to 3 i.e. missing.
    
    Parameters
    ----------
    min_dp : int
        The minimum depth (DP) for the call to be considered.
        Variants with less than this DP in a given
        barcode are treated as no calls.
    min_gq : int
        The minimum genotype quality (GQ) for the call to be
        considered. Variants with less than this GQ
        in a given barcode are treated as no calls.
    vaf_ref : float [0, 100]
        All reference calls (NGT = 0) with VAF > vaf_ref
        are converted to no calls (NGT = 3) for each barcode
        and variant in the NGT matrix
    vaf_het : float [0, 100]
        All hetrozygous calls (NGT = 1) with VAF < vaf_het
        are converted to no calls (NGT = 3) for each barcode
        and variant in the NGT matrix
    vaf_hom : float [0, 100]
        All homozygous calls (NGT = 2) with VAF < vaf_hom
        are converted to no calls (NGT = 3) for each barcode
        and variant in the NGT matrix
    min_prct_cells : float [0, 100]
        The minimum percent of total cells in which the variant
        should be present (NGT ∈ {0, 1, 2}) after the
        filters are applied. This value is calculated not as a
        simple percentage of cells, but as a weighted percentage
        where the contribution of each cell to the numerator and
        denominator is weighted by a measure of cell completeness.
        This has the effect of reducing the contribution of cells
        with a lot of missing genotype calls, allowing more
        variants to be retained when there are many incomplete
        cells.
    min_mut_prct_cells : float [0, 100]
        The minimum percent of the total cells in which the
        variant should be mutated, (NGT ∈ {1, 2}) after the
        filters are applied. This value is weighted by cell
        completeness as described above for `min_prct_cells`.
    iterations : int > 0
        The number of iterations used to remove the variants and calculate cell completeness.
        Here the cell completeness refers to the fraction of variants that have a high quality
        call in a cell. "High quality" is defined by the `min_dp`, `min_gq`, and the 3 VAF
        parameters. When calculating `min_prct_cells` the cells are weighted by their
        completeness. In the first iteration the completeness values are calculated using the
        original NGT matrix. In this iteration variants with a fraction of no calls less than
        `min_prct_cells / iterations` are removed. In the second iteration the completeness
        values are calculated using the NGT matrix of the variants after the first iteration.
        In the second iteration the threshold is set to `2 * min_prct_cells / iterations`. This
        process is repeated until the number of `iterations` is reached. In the last iteration
        the threshold for the fraction of cells in which the variant is present is `min_prct_cells`.
        The variants that pass the filters in the last iteration are returned. Increasing the
        number of iterations will result in a more nuanced calculation of cell completeness.
        However, it is not recommended to modify this parameter. Rather, the `min_prct_cells`
        parameter should be modified to change the number of variants that pass the filters.
        The same is true for `min_mut_prct_cells`.
    n_cores : int
        The number of cores to use for parallel processing.
    
    Returns
    -------
    numpy.ndarray
# Filter variants
# This is the default insights filtering method

dna_vars = sample.dna.filter_variants()
dna_vars
array(['chr2:25458546:C/T', 'chr2:25469502:C/T', 'chr2:25470426:C/T',
       'chr2:25470573:G/A', 'chr2:209113192:G/A', 'chr4:55599436:T/C',
       'chr4:106154990:TATAGATAG/T', 'chr4:106154990:T/TATAG',
       'chr4:106158216:G/A', 'chr4:106190862:T/C', 'chr4:106197469:G/A',
       'chr6:62094287:A/T', 'chr7:148506064:A/G', 'chr7:148529851:G/GA',
       'chr7:148543525:A/G', 'chr10:5554293:T/C', 'chr10:77210191:C/T',
       'chr10:106721610:G/A', 'chr11:32414333:G/T', 'chr11:32417945:T/C',
       'chr12:112888239:C/T', 'chr13:28597686:G/A', 'chr13:28610183:A/G',
       'chr14:56969005:C/T', 'chr17:7577427:G/A', 'chr17:7578176:C/T',
       'chr17:7578263:G/A', 'chr20:31023356:G/T', 'chr21:36252917:C/T'],
      dtype='<U26')
# Check the number of filtered variants

len(dna_vars)
29

Whitelist#

Simply appnding the whitelist to the list of filtered
variants is sufficient to then select the variants
using the slice notation

i.e. sample.dna[{list of barcodes}, {list of ids}]
whitelist = ['chr1:115256513:G/A', 'chr21:44514718:C/T']
final_vars = whitelist + list(dna_vars)
len(final_vars)
31
# Selecting all cells and final variants

sample.dna = sample.dna[sample.dna.barcodes(), final_vars]
# Check the shape i.e. (Number of barcodes, number of ids)
# of the final filtered dna object

sample.dna.shape
(2476, 29)

Manual variant selection#

Heatmaps are interactive. Clicking on it selects
the corresponding id whose value is stored in the
`selected_ids` attribute of the object

eg. sample.dna.selected_ids
# Remove `.show("jpg")` to make the plot interactive
sample.dna.stripplot(attribute='AF', colorby='GQ').show("jpg")
../_images/5a26af8e6ca61afe97a4bd81102903ee178a61ba787fc6c782a3c1724b270415.jpg
sample.dna.heatmap(attribute='AF')
if len(sample.dna.selected_ids) > 0:
    sample.dna = sample.dna.drop(sample.dna.selected_ids)

Clustering#

DNA has a custom clustering method called `find_clones`

It projects the data on a UMAP and then performs
dbscan to identify unique clusters, which are then
merged in case they were formed due to missing
information
sample.dna.find_clones()
Unique clusters found - 6
Clusters after removing missing data - 5
/Users/casp/Documents/code/mosaic/src/missionbio/mosaic/dna.py:142: UserWarning:

Using the "umap" that is already present in the row attributes.
sample.dna.row_attrs
{'barcode': array(['AACAACCTAAACTTGTCG', 'AACAACTGGTACGTTGGA', 'AACAATGCAAGACCACGC',
        ..., 'TTGTCAACCTACAACACC', 'TTGTCAACCTAGTAACGG',
        'TTGTTAGAGATCAGGATG'], dtype=object),
 'label': array(['4', '2', '3', ..., '1', '1', '1'], dtype=object),
 'pca': array([[ 0.01696292,  0.00791857,  0.02525593, ...,  0.01081608,
         -0.01028663,  0.00942911],
        [ 0.01789271, -0.01653034, -0.00027796, ...,  0.00559107,
          0.00686503, -0.00267619],
        [ 0.0103838 ,  0.04182529, -0.01736327, ...,  0.04889085,
         -0.00238786, -0.01065905],
        ...,
        [-0.02319658, -0.00374513, -0.01804908, ...,  0.01024397,
          0.00164552,  0.01918958],
        [-0.02427855, -0.00700538, -0.00999176, ...,  0.00866599,
         -0.01266088,  0.01117126],
        [-0.02028216, -0.00607941,  0.02598467, ...,  0.01244461,
         -0.00450799,  0.00174288]]),
 'sample_name': array(['3 cell mix', '3 cell mix', '3 cell mix', ..., '3 cell mix',
        '3 cell mix', '3 cell mix'], dtype=object),
 'umap': array([[ 4.2597423,  3.5266023],
        [ 5.3297005,  5.5066705],
        [ 3.561163 , -8.171488 ],
        ...,
        [-5.4089503, -1.4769189],
        [-5.6915836, -1.3970627],
        [-5.5950303, -0.1773623]], dtype=float32),
 'filtered': array([False, False, False, ..., False, False, False])}
sample.dna.scatterplot(attribute='umap', colorby='label').show("jpg")
../_images/a56432aaad78ddd509668d4c5a09fafdafe7efdf02075ceb9b9e68ba9ba2f8c3.jpg
# AF_MISSING is the same as the AF layer except that it stores the missing values as -50 instead of 0
sample.dna.heatmap('AF_MISSING').show("jpg")
../_images/c1221707775abe35449bad5846a00e964f303a6f0c077870bb1c2c30380dc58f.jpg

Conclusion#

1. Basic filtering of barcodes ids demonstrated
2. Basic DNA filtering functionality showcased

CNV Analysis#

Preliminary heatmap of CNV shows that there could be two clusters

Topics covered

  1. Dimension reduction options and their effects

Observation#

sample.cnv.normalize_reads()
sample.cnv.heatmap(attribute='normalized_counts').show("jpg")
../_images/9d3c407baa30429399dd15aa4f80234233f570767862e39b11d1608296d900a2.jpg

PCA options#

Here the UMAP options are kept constant
The only parameter in PCA is the number of components

Here we see how to determine this value, and the effect
when we deviate from this value
sample.cnv.run_pca(attribute='normalized_counts', components=6, show_plot=True)
sample.cnv.run_umap(attribute='pca', min_dist=0, n_neighbors=100, random_state=42)
/Users/casp/miniconda3/envs/missionbio.mosaic/lib/python3.9/site-packages/umap/umap_.py:1943: UserWarning:

n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.
../_images/a1993224f39bb0dc1e18b20bdc36e801c9d614c8ed190ce2cea8ce2467509b01.png

Visualization#

The result of the dimension reduction analysis is
visualized using a scatterplot of the umap
sample.cnv.cluster(attribute='umap', method='dbscan', eps=0.55)
sample.cnv.scatterplot(attribute='umap', colorby='label').show("jpg")
../_images/ed0c5958ea2b58b54ef3989d0c7f3b9dae086a90dbaf2fa3cf1d42457c7d7a24.jpg

CNV Conclusion#

Given all other variables are kept constant

1. Too many PCA components may result in merging of clusters
2. Too few PCA component may result in splitting of clusters
3. The appropriate number of components can be determined using the elbow plot

Protein Analysis#

Topics covered

  1. Basic workflow

  2. Custom clustering eg. selection on biaxial plot

  3. Custom methods by adding layers

Basic workflow#

# Downsampling and clustering similar to CNV

sample.protein.normalize_reads('CLR')
sample.protein.run_pca(attribute='normalized_counts', components=5)
sample.protein.run_umap(attribute='pca')

sample.protein.cluster(attribute='pca', method='graph-community', k=100, random_state=42)
Creating the Shared Nearest Neighbors graph.
Identifying clusters using Louvain community detection.
Number of clusters found: 10.
Modularity: 0.758
sample.protein.heatmap(attribute='normalized_counts').show("jpg")
../_images/5d7facd08071953e52e4e601bb7fda60751b168c5fe17f95a3da61f2727262fa.jpg
sample.protein.scatterplot(attribute='umap', colorby='label').show("jpg")
../_images/4e701e1db55cf208776ecff95f9a8ba6e9d40a41015b3fa2176f09c67ce47cf7.jpg
# Re cluster based on the observations from the UMAP

sample.protein.cluster(attribute='umap', method='dbscan')
sample.protein.ids()[:1]
array(['CD110'], dtype=object)
# Prefered way to look at protein expression profiles

features = ["CD110"]

sample.protein.ridgeplot(
   attribute='normalized_counts',
   splitby='label',
   features=features,
).show("jpg")
../_images/219843a5af2b8dff732fa884eabda0d2272a37097d118d499e9df54a1899aac3.jpg
# UMAP with the expression for each of the selected protein overlayed
# In case of error, make sure that ids have been selected on the heatmap and shown in sample.protein.selected_ids

sample.protein.scatterplot(
   attribute='umap',
   colorby='normalized_counts',
   features=['CD34', 'CD44', 'HLA-DR'],
).show("jpg")
../_images/5299180c3fb8942dfd901edac7c5378de994f2a9ad9c9efa39eda6044bca1a95.jpg

Custom clustering#

When `colorby` is not provided for any scatterplot
the lasso tool can be used to cluster the cells
based on the selection made
# Selction on biaxial scatterplot
# The same can be done for the UMAP when labels=False is passed

sample.protein.feature_scatter(
    layer='normalized_counts',
    ids=['CD90', 'CD3']
)

Custom methods by adding layers#

If someone is interested in trying their methods,
they can simply modify the appropriate layers, attributes
and metadata to plugin their step in this workflow
# Custom normalization by changing the `normalized_counts` layer

import numpy as np

log_reads = np.log10(10 + sample.protein.layers['read_counts'])
norm = np.divide(log_reads, log_reads.mean(axis=1).reshape(-1, 1))


sample.protein.add_layer('normalized_counts', norm)
Other examples include:

custom labels -> 'label' row_attr
custom palette -> 'palette' metadata   

Protein Conclusion#

1. Protein analysis workflow similar to CNV
2. Different clustering methods can result in
   different types of clusters being identified
3. It is possible to have custom clustering for
   any scatterplot by using the lasso tool
4. Custom analysis is possible by modifying appropriate
   layers, attributes and metadata

Statistical Significance#

The significane of differential expression
based on a t-test can be looked at using
the `feature_signature` method
pval, tstat = sample.protein.test_signature(attribute='normalized_counts')
pval
CD110 CD117 CD123 CD135 CD19 CD24 CD3 CD33 CD34 CD38 CD44 CD45 CD56 CD90 HLA-DR Mouse IgG1k
1 1.032004e-133 1.231391e-181 1.309454e-01 5.053298e-138 1.386557e-115 6.039634e-214 3.954765e-206 0.000000e+00 0.000000e+00 8.153836e-286 0.000000e+00 2.086838e-02 5.979655e-190 5.373200e-223 8.324096e-04 4.378789e-152
2 2.403739e-108 1.363348e-29 1.526118e-94 1.228098e-85 1.254303e-90 1.564962e-42 0.000000e+00 9.441482e-68 1.010007e-253 6.885586e-297 1.492394e-146 3.178453e-305 3.548999e-238 0.000000e+00 0.000000e+00 1.459707e-123
3 2.472998e-13 1.650288e-52 6.761382e-126 1.097060e-26 0.000000e+00 0.000000e+00 7.707852e-64 4.013994e-96 2.467303e-48 4.938844e-01 5.755930e-135 0.000000e+00 9.810248e-01 9.302884e-73 2.703924e-296 5.808461e-15
4 4.818229e-09 5.738795e-25 2.111025e-03 1.141661e-11 1.388346e-23 7.238226e-26 3.786053e-13 1.311617e-34 9.702871e-20 1.475976e-11 4.835089e-25 7.794311e-01 3.916305e-01 7.962307e-10 1.142897e-09 9.995305e-11
pval = pval + 10 ** -50 + pval
pvals = -np.log10(pval) * (tstat > 0)
from missionbio.plotting.heatmap import Heatmap

fig = Heatmap(pvals, y_groups=pvals.index.values).draw()
fig.show("jpg")
../_images/7678917ae4d2fcb477c05393f667d7d8163a7143bd9509b47edd7b07b03d7154.jpg

Conclusion

Statistical significance of the differential expression
can be ascertained. Median values can be explored for DNA
to determine the difference between clusters.

Combined Visualizations#

Visualization for multiple assays at once

Clone vs Analyte#

CNV#

# Ignore warnings raised when running clone_vs_analyte

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
sample.clone_vs_analyte('cnv')
../_images/3e88b16f3311673b9849e6ef6344b46b7a71d48c0594fe2c88ec73a36089411d.png

Protein#

sample.clone_vs_analyte('protein')
../_images/09f269a2d933a30e157847b673d121cbed7f143a5e5d1b05a086dd4f9606d5ea.png
# Filtering protein and cnv to improve the visualization

sample.protein = sample.protein[:, ['CD3', 'CD90']]
sample.cnv = sample.cnv[:, 58:85]
sample.clone_vs_analyte('protein')
../_images/a9c42e63ea031d25d054a2d073ca19af2f4d54386c6e6c9d5150a3995c8daeff.png
# Certain clones can also be dropped, but they must be dropped from all assays
# Hence the sample object is sliced in this case
# In this case it is better to store the new sample in a separate variable

# This returns the dna barcodes with the given labels
select_bars = sample.dna.barcodes(['2', '3', '4'])

sample_subset = sample[select_bars]
sample_subset.clone_vs_analyte('protein')
../_images/5cb710b3c6464de71b505ed2aa29ab1deef23e53756284df2fcdf531e81fa45c.png
# The ids can also be reset to the entire set

sample.reset('cnv')
sample.reset('protein')
sample.clone_vs_analyte('protein')
../_images/09f269a2d933a30e157847b673d121cbed7f143a5e5d1b05a086dd4f9606d5ea.png

Multi assay heatmap#

sample.heatmap(("dna", "protein", "cnv")).show("jpg")

# Try the following
# sample.signaturemap(("dna", "protein", "cnv"))
# sample.signaturemap(("protein", "dna", "cnv"))
# sample.signaturemap(("dna", "protein"))
../_images/1f9f744d8d71d700d5086b46e15fce857b046c767b4dbbe14d713649f4864a02.jpg

Saving#

The analysis can be saved to an h5 file.
This final trimmed file will be much smaller than the original h5 file.
It can be opened in Insights, or back again in Mosaic
ms.save(sample, './basics.analyzed.h5', mode="w")
ms.to_zip(sample, "./basics.analyzed.zip")
Data from h5 files can be efficiently manipulated,
visualized, and inferred using Mosaic.