Cohort Analysis Pipeline#
Overview#
The Tertiary Cohort Analysis Pipeline is designed to process and analyze cohorts of .h5 files generated from Mission Bio’s Tapestri platform. It supports variant detection, whole genome CNV analysis, phylogeny generation, and report creation for individual samples and the entire cohort.
This guide explains how to configure and run the pipeline.
Note
Download the example dataset to test the pipeline locally.
Prerequisites#
Python Environment: Ensure you have mosaic 3.12.0 or above installed and activated in your terminal.
Input Files: Organize the relevant
.h5
files in a folder.Configuration File: A YAML configuration file specifying pipeline parameters. (Explained below)
Pipeline Steps#
The pipeline consists of the following steps:
Demultiplexing: Splits multiplexed samples per patient
.h5
files.Genome-wide CNV Analysis (Optional): Detects copy number variations and CNV subclones.
Variant Detection: Identifies somatic variants for each patient.
Phylogeny: Identifies clones using the detected variants and estimates the phylogeny.
Postprocessing: Normalizes protein data and assigns final clone labels.
Report Generation: Creates HTML reports for individual patients and the cohort.
Command-Line Usage#
The pipeline can be executed using the tapestri tertiary cohort command from the CLI.
tapestri tertiary cohort process --config <config_file> --output-folder <output_folder> [OPTIONS]
–config: Path to the YAML configuration file.
–output-folder: Directory where output files will be saved.
–overwrite: Optional flag to overwrite the output folder if it already exists.
Input#
H5 Files#
Any h5 file can be used as input. Most commonly, the h5 files generated by the Tapestri DNA or DNA+Protein pipeline are used for the cohort pipeline.
Note
All h5 files used in the analysis must be of the same type. DNA only h5 files cannot be combined with DNA+Protein h5 files. Also files from different DNA or protein panels cannot be combined in one analysis.
Demultiplex CSV File (Optional)#
Only required when processing multiplexed samples.
The demultiplexing csv file contains variant information to identify each patient sample in a multiplexed run. Only one file is needed for the entire cohort. It must contain information for all patients across all the runs. The genotype values must be 0, 1, or 2 which correspond to homozygous reference (wildtype), heterozygous, and homozygous alternate respectively. An example of the demultiplexing CSV file is shown below:
variant_id,sample_id,type,genotype
chr1:36933096:T/C,Patient1,germline,0
chr1:36933096:T/C,Patient2,germline,1
chr1:36933096:T/C,Patient3,germline,0
chr1:43805240:A/G,Patient1,germline,2
chr1:43805240:A/G,Patient2,germline,0
Note
Only variants that overlap with the panel will be used for demultiplexing. At least 10 variants per patient are required by the pipeline.
Spike-in CSV File (Optional)#
Only required for genome-wide CNV samples.
This file is similar to the demultiplexing CSV file but is used to identify the genotypes for the
diploid spike-in cell line. It must contain only one sample_id
. An example of the spike-in CSV
file is shown below:
variant_id,sample_id,type,genotype
chr1:36933096:T/C,SpikeIn,germline,1
chr1:43805240:A/G,SpikeIn,germline,2
Somatic Variants CSV File (Optional)#
Needed if the variant selection has to be modified using a whitelist/blacklist
This file lists variants that are whitelisted or blacklisted for the phylogeny creation. Only a
single file containing variants for all samples in the cohort is needed. The genotype
column is
optional for this file. It does not have any effect on the output. An example of the somatic
variants CSV file is shown below:
variant_id,sample_id,type
chr5:170837543:C/CTCTG,Patient1,blacklist
chr4:55599436:T/C,Patient2,whitelist
Note
To ensure quick processing, the pipeline only analyses variants that pass the default Tapestri pipeline filters. These are variants with mutation rates over 1% in the pre-demultiplexed h5 file. But for timecourse samples, if any one time point has such a variant, then it will be used for other timepoints irrespective of the mutation rate. Any variant with a lower mutation rate must be given through the whitelist.
YAML Configuration File#
The YAML configuration file specifies pipeline parameters and paths to input files. An example of a complete configuration file is shown below. Some of the parameters are optional and dependent on the type of analysis being performed.
spike_in:
variants: /path/to/spike_in_variants.csv # Optional - required for genome-wide CNV
demultiplex:
variants: /path/to/demultiplexing.csv # Optional - required for multiplexed samples
tertiary:
variants: /path/to/somatic_variants.csv # Optional - if whitelist/blacklist is needed
report: hem-onc
input:
- h5: /path/to/sample1.dna+protein.h5
name: Sample1
patient: ["Patient1", "Patient2", "Patient3"]
timepoint: [1, 1, 1]
- h5: /path/to/sample2.dna+protein.h5
name: Sample2
patient: ["Patient1", "Patient2"]
timepoint: [2, 2]
Some things to note about the configuration file:
name
can be used to shorten the h5 file name. They are shown in the report and names that are too long can hamper the readability of some figures. The same is true forpatient
.timepoint
is optional. It is needed only if there are multiple timepoints for a patient. It must be a list of integers and not strings. The samples are shown in the reports in ascending order of timepoints. The value of the timepoint is not relevant, only the relative order of the timepoints is important.patient
is only required for multiplexed samples. It is the list of patients multiplexed together in each h5 file. The value inpatient
should match thesample_id
in the demultiplexing and somatic csv files.For single-sample runs the
name
should match thesample_id
in the somatic csv file.
Output#
The pipeline generates the following outputs:
Processed h5 Files: Located in the
h5
subdirectory. These contain all the relevant information to generate the reports. No other files are needed. These can be loaded in a Jupyter notebook to create custom figures that are not available in the standard output.Logs: Detailed logs of the pipeline execution in the
tertiary.log
file.Reports: HTML reports for individual samples and the cohort in the
reports
subdirectory. The cohort figures are interactive HTML files. All figures can be downloaded as png files by opening them in a web browser tab and using the download button on the top right of the figure. The figures are:Cohort map: It shows the summary of clonal architecture, mutations, and correlation with protein clusters for all samples. The protein cluster names can be renamed using the button at the bottom of the figure.
Protein UMAP: This figure is used to check the validity of the protein clustering. It can be colored by cluster labels, sample names, point density, and sample density.
Protein expression UMAP: It shows the expression of every antibody in the protein panel on the UMAP. It can be used to identify the celltype for each cluster.
Variant clonality: A figure and table that is helpful to identify any false positive variants that might have been called by the pipeline. The whitelist/blacklist option can be used when rerunning the pipeline to the correct variant calls.
Multiple other folders are created, those are intermediate files for each step of the pipeline. These are kept for debugging purposes and might be removed in a future version.
Error Handling#
If the pipeline encounters an error, it will log the issue and terminate. Common issues include:
Missing or invalid input files.
Incorrect configuration parameters.
Refer to the error message in the last line of the log file to identify the issue.