Cohort Analysis Pipeline#

Overview#

The Tertiary Cohort Analysis Pipeline is designed to process and analyze cohorts of .h5 files generated from Mission Bio’s Tapestri platform. It supports variant detection, whole genome CNV analysis, phylogeny generation, and report creation for individual samples and the entire cohort.

This guide explains how to configure and run the pipeline.

Note

Download the example dataset to test the pipeline locally.

Prerequisites#

  1. Python Environment: Ensure you have mosaic 3.12.0 or above installed and activated in your terminal.

  2. Input Files: Organize the relevant .h5 files in a folder.

  3. Configuration File: A YAML configuration file specifying pipeline parameters. (Explained below)

Pipeline Steps#

The pipeline consists of the following steps:

  1. Demultiplexing: Splits multiplexed samples per patient .h5 files.

  2. Genome-wide CNV Analysis (Optional): Detects copy number variations and CNV subclones.

  3. Variant Detection: Identifies somatic variants for each patient.

  4. Phylogeny: Identifies clones using the detected variants and estimates the phylogeny.

  5. Postprocessing: Normalizes protein data and assigns final clone labels.

  6. Report Generation: Creates HTML reports for individual patients and the cohort.

Command-Line Usage#

The pipeline can be executed using the tapestri tertiary cohort command from the CLI.

tapestri tertiary cohort process --config <config_file> --output-folder <output_folder> [OPTIONS]
  • –config: Path to the YAML configuration file.

  • –output-folder: Directory where output files will be saved.

  • –overwrite: Optional flag to overwrite the output folder if it already exists.

Input#

H5 Files#

Any h5 file can be used as input. Most commonly, the h5 files generated by the Tapestri DNA or DNA+Protein pipeline are used for the cohort pipeline.

Note

All h5 files used in the analysis must be of the same type. DNA only h5 files cannot be combined with DNA+Protein h5 files. Also files from different DNA or protein panels cannot be combined in one analysis.

Demultiplex CSV File (Optional)#

Only required when processing multiplexed samples.

The demultiplexing csv file contains variant information to identify each patient sample in a multiplexed run. Only one file is needed for the entire cohort. It must contain information for all patients across all the runs. The genotype values must be 0, 1, or 2 which correspond to homozygous reference (wildtype), heterozygous, and homozygous alternate respectively. An example of the demultiplexing CSV file is shown below:

variant_id,sample_id,type,genotype
chr1:36933096:T/C,Patient1,germline,0
chr1:36933096:T/C,Patient2,germline,1
chr1:36933096:T/C,Patient3,germline,0
chr1:43805240:A/G,Patient1,germline,2
chr1:43805240:A/G,Patient2,germline,0

Note

Only variants that overlap with the panel will be used for demultiplexing. At least 10 variants per patient are required by the pipeline.

Spike-in CSV File (Optional)#

Only required for genome-wide CNV samples.

This file is similar to the demultiplexing CSV file but is used to identify the genotypes for the diploid spike-in cell line. It must contain only one sample_id. An example of the spike-in CSV file is shown below:

variant_id,sample_id,type,genotype
chr1:36933096:T/C,SpikeIn,germline,1
chr1:43805240:A/G,SpikeIn,germline,2

Somatic Variants CSV File (Optional)#

Needed if the variant selection has to be modified using a whitelist/blacklist

This file lists variants that are whitelisted or blacklisted for the phylogeny creation. Only a single file containing variants for all samples in the cohort is needed. The genotype column is optional for this file. It does not have any effect on the output. An example of the somatic variants CSV file is shown below:

variant_id,sample_id,type
chr5:170837543:C/CTCTG,Patient1,blacklist
chr4:55599436:T/C,Patient2,whitelist

Note

To ensure quick processing, the pipeline only analyses variants that pass the default Tapestri pipeline filters. These are variants with mutation rates over 1% in the pre-demultiplexed h5 file. But for timecourse samples, if any one time point has such a variant, then it will be used for other timepoints irrespective of the mutation rate. Any variant with a lower mutation rate must be given through the whitelist.

YAML Configuration File#

The YAML configuration file specifies pipeline parameters and paths to input files. An example of a complete configuration file is shown below. Some of the parameters are optional and dependent on the type of analysis being performed.

spike_in:
    variants: /path/to/spike_in_variants.csv  # Optional - required for genome-wide CNV
demultiplex:
   variants: /path/to/demultiplexing.csv  # Optional - required for multiplexed samples
tertiary:
   variants: /path/to/somatic_variants.csv  # Optional - if whitelist/blacklist is needed
   report: hem-onc
   input:
      - h5: /path/to/sample1.dna+protein.h5
        name: Sample1
        patient: ["Patient1", "Patient2", "Patient3"]
        timepoint: [1, 1, 1]
      - h5: /path/to/sample2.dna+protein.h5
        name: Sample2
        patient: ["Patient1", "Patient2"]
        timepoint: [2, 2]

Some things to note about the configuration file:

  • name can be used to shorten the h5 file name. They are shown in the report and names that are too long can hamper the readability of some figures. The same is true for patient.

  • timepoint is optional. It is needed only if there are multiple timepoints for a patient. It must be a list of integers and not strings. The samples are shown in the reports in ascending order of timepoints. The value of the timepoint is not relevant, only the relative order of the timepoints is important.

  • patient is only required for multiplexed samples. It is the list of patients multiplexed together in each h5 file. The value in patient should match the sample_id in the demultiplexing and somatic csv files.

  • For single-sample runs the name should match the sample_id in the somatic csv file.

Output#

The pipeline generates the following outputs:

  1. Processed h5 Files: Located in the h5 subdirectory. These contain all the relevant information to generate the reports. No other files are needed. These can be loaded in a Jupyter notebook to create custom figures that are not available in the standard output.

  2. Logs: Detailed logs of the pipeline execution in the tertiary.log file.

  3. Reports: HTML reports for individual samples and the cohort in the reports subdirectory. The cohort figures are interactive HTML files. All figures can be downloaded as png files by opening them in a web browser tab and using the download button on the top right of the figure. The figures are:

    1. Cohort map: It shows the summary of clonal architecture, mutations, and correlation with protein clusters for all samples. The protein cluster names can be renamed using the button at the bottom of the figure.

    2. Protein UMAP: This figure is used to check the validity of the protein clustering. It can be colored by cluster labels, sample names, point density, and sample density.

    3. Protein expression UMAP: It shows the expression of every antibody in the protein panel on the UMAP. It can be used to identify the celltype for each cluster.

    4. Variant clonality: A figure and table that is helpful to identify any false positive variants that might have been called by the pipeline. The whitelist/blacklist option can be used when rerunning the pipeline to the correct variant calls.

Multiple other folders are created, those are intermediate files for each step of the pipeline. These are kept for debugging purposes and might be removed in a future version.

Error Handling#

If the pipeline encounters an error, it will log the issue and terminate. Common issues include:

  • Missing or invalid input files.

  • Incorrect configuration parameters.

Refer to the error message in the last line of the log file to identify the issue.