Single sample GE+Protein#

Step-0: Add missing panel information to h5 files generated with v1.2#

Note: h5 generated by GE pipeline v1.1.1 or earlier will not work

This step is not needed for h5 files generated by GE pipeline v1.2.1 onwards#

import pandas as pd

from missionbio.h5.data.reader import H5Reader
from missionbio.h5.data.writer import H5Writer
h5_path = 'Path to the h5 from v1.2 pipeline'


## These are path to panel files which are within the GE panel ZIP file
## Extract the ZIP file and specify path to each file here
panel_design_summary_path = 'Path to designSummary.tab panel file'
panel_submitted_path = 'Path to submitted panel file'
panel_target_groups_path = 'Path to target_groups.csv panel file'
design_summary = pd.read_csv(panel_design_summary_path, sep='\t')
submitted_bed = pd.read_csv(panel_submitted_path, sep="\t", header=None, names=["chrom", "start", "stop", "target_name",],)
target_groups = pd.read_csv(panel_target_groups_path)

with H5Reader(h5_path) as reader:
    all_assays = reader.assays()
    all_samples = reader.samples()

with H5Writer(h5_path, mode="a") as writer:
    if "ge_dna_variants" in all_assays:
        for sample in all_samples:
            info = {
                sample: {
                    "design_summary": design_summary,
                    "submitted_bed": submitted_bed,
                    "target_groups": target_groups,
                }
            }

            writer.write_info("ge_dna_variants", info)

Step-1: Load the necessary libraries#

import numpy as np
import pandas as pd

import plotly.graph_objects as go

import missionbio.mosaic.io as mio

Step-2: Load the sample H5 file#

The example dataset can be downloaded from https://github.com/MissionBio/mosaic-jupyter/blob/master/exampleH5/edited-t-cells.h5

sample = mio.load('/Users/user_name/Documents/edited-t-cells.h5')

Step-3: Protein data analysis#

This is showing a standard protein analysis workflow including the following steps:

  • Normalizing protein reads

  • Prinicple Component analysis on the normalized read counts

  • Running UMAP on chosen PCs

  • Clustering the protein data using the UMAP dimensions

  • Visualizing the normalized read counts for the identified clusters (as a heatmap)

  • Renaming the clusters to more meaningful names (this step is not added for this example data)

For the latest protein analysis methods refer to the notebook here.

sample.protein.normalize_reads('NSP',random_state=42)
sample.protein.run_pca(attribute='normalized_counts', components=10, show_plot=True, random_state=42)
sample.protein.run_umap('normalized_counts', random_state=42)
sample.protein.cluster('umap', method='graph-community', k=5)
sample.protein.heatmap('normalized_counts')
# sample.protein.rename_labels({
#     "1": "Cell-type1",
#     "2": "Cell-type2",
#     "3": "Cell-type3",
#     "4": "Cell-type4",
#     "5": "Cell-type5",
# })

Step-4: Generating GE tables used for downstream plots#

The GE workflow starts with filtering low quality variants followed by identifying the edit status for each target in each cell. Variant filtering is done using 3 different filters:

  • min_gq: Remove variants which have genotyping quality or GQ (as calculated by GATK) lower than this threshold. Default value for this filter is 30.

  • min_dp: Remove variants which have read depth or DP (as calculated by GATK) lower than this threshold. Default value for this filter is 10 reads.

  • min_af: Remove variants which have Allele frequency or AF (as calculated by GATK) lower than this threshold. Default value for this filter is 10%.

The next command generates the edit_status dataframe for GE data which contains the targets as columns and cells as rows with the value for each combination being the edit status of the target in the particular cell.

More details on the function can be found in the function’s docstring (shown below)

For more information on the layers, column attributes and row attributes of the GE assay please refer to the GE DNA variants section of this article

?sample.ge.get_final_edit_status
sample.ge.filter_variants(
    min_gq=30,
    min_dp=10,
    min_af=10,
)
edit_status = sample.ge.get_final_edit_status(edit_mode='indel')
edit_status
# Look at the edit status distribution for a target

edit_status['TCRA_ONTARGET'].value_counts()
# Write out the edit_status to a CSV file

# Uncomment the following line to write out the edit_status matrix to a CSV file in the current working directory

# edit_status.to_csv(f"{sample.name}_edit_status.csv")

Step-5: Browsing GE tables and data#

1. All variants table#

Details on the data contained within this table can be found in the function’s documentation.

This table is generated for a specific edit mode, which can be one of three possible values:

  • indel : Only INDEL variants are consdiered as edits and shown in the table

  • snv : Only SNV variants are considered as edits and shown in the table

  • indel_snv : Both INDEL and SNV variants are considered as edits and shown in the table

The resulting table contains information for all targets in the panel, it can be subset to information for a particular target by filtering on the column target which contains the target names.

?sample.ge.get_all_variants_table
all_vars_table = sample.ge.get_all_variants_table(edit_mode='indel')

all_vars_table
# Write the all variants table to a file

# Uncomment the following line to write out the all_vairants_table (containing information for all targets) to a CSV file

# all_vars_table.to_csv(f"{sample.name}_all_variants.csv")

# Uncomment the following for-loop to write out 1 all_variants CSV file per target to the current working directory

# for target, mini_df in all_vars_table.groupby('target'):
#     mini_df.to_csv(f"{sample.name}_{target}_all_variants.csv")

2. Modifications list table (per target)#

Details on the data contained within this table can be found in the function’s documentation (shown below).

This table is generated for a specific target and contains detailed information about the INDELs present in that target.

This table can be used to identify which variants are present on allele-1 vs allele-2 for a particular cell barcode.

?sample.ge.get_target_modifications
sample.ge.get_target_modifications('TCRA_ONTARGET')
# Generate a file for each target in the panel

all_targets_in_panel = sample.ge.panel.targets['target_name'].tolist()

for target_name in all_targets_in_panel:
    target_df = sample.ge.get_target_modifications(target_name)

    ## Uncomment the following line to write out a CSV file for each target to the current working directory
    
    # target_df.to_csv(f'{sample.name}_{target_name}_modifications.csv')

Step-6: Add protein cell types to GE tables#

1. Final edit status table#

These statements add the column protein_label to the edit_status dataframe. These labels are the same as what was set in Step-1 for protein analysis.

edit_status = sample.ge.get_final_edit_status(edit_mode='indel')

edit_status['protein_label'] = dict(zip(sample.protein.barcodes(), sample.protein.get_labels()))

edit_status

2. Target modifications table#

These statements add the column protein_label to the target_modifications dataframe. These labels are the same as what was set in Step-1 for protein analysis.

target_modifications = sample.ge.get_target_modifications('TCRA_ONTARGET')

target_modifications['protein_label'] = dict(zip(sample.protein.barcodes(), sample.protein.get_labels()))

Step-7: Make some GE+Protein plots#

1. Protein UMAP plot colored by on-target GE editing status#

There are 2 plots shown below, the first shows the protein UMAP values colored by different antibodies showing the expression per UMAP cluster.

The second plot shows the same UMAP clusters but colored by the GE target editing statuses.

Both of these plots have attribute called “features” which can be used to control which antibodies or GE targets are shown in the respecitve plots.

This plot is useful to identify the different protein populations using the first UMAP plot (showing Ab-expression) and then seeing if any of these cell-types/clusters have a bias towards gene editing.

fig = sample.protein.scatterplot('umap',
    colorby=sample.protein.get_attribute('normalized_counts'),
    features=['CD3', 'CD4', 'CD8']
)
fig.show()
fig = sample.protein.scatterplot('umap', 
   colorby=edit_status, 
   features=['TCRA_ONTARGET', 'PDCD1_ONTARGET', 'TCRB_ONTARGET-1', 'TCRB_ONTARGET-2']
)
fig.show()

2. Protein expression violin plot split by on-target GE editing status#

This plot shows the expression of multiple antibodies simultaneously but splits the cells by the different GE editing statuses of a target. This is useful to compare different editing categories of a target and see if any of the protein’s expressions shows a correlation with the editing of the target.

target_to_plot = 'TCRA_ONTARGET'

fig = sample.protein.violinplot(
    'normalized_counts',
    splitby=edit_status[target_to_plot].values,
    features=['CD3', 'CD4', 'CD8'],
    title=f' - {target_to_plot}'
)
fig.update_layout(
    width=1000
)