SPARC#

missionbio.demultiplex.phylogeny.sparc.SPARC

class SPARC(ado_score: float = 0.9, adjusted_mixing: float = 0.2, branch_size: int = 10, min_snv_clone_size: float = 0.001, min_snv_clone_count: int = 6, min_loh_clone_size: float = 0.001, min_loh_clone_count: int = 25, min_loh_group_size: int = 2, min_cnv_depth: int = 8, min_cnv_completeness: float = 0.6, min_copy_gain_cells: int = 150, min_excess_correlated_freq: float = 1)#

SPARC: Single-cell Phylogeny And Reconstruction of Clones

A tool to infer the evolutionary history and clonal structure of a population, integrating information from SNVs, Loss of Heterozygosity (LoH), and Copy Number Variations (CNVs).

Snippets#

>>> import missionbio.mosaic as ms
>>> from missionbio.demultiplex.phylogeny.sparc import SPARC
>>> sample = ms.load_example_dataset("2 PBMC mix")
>>> sparc = SPARC()
>>> result = sparc.predict(sample.dna, variants=["chr4:55599436:T/C", "chr5:170837457:A/G"])

Note

The somatic variants must be present in the Dna assay and formatted as chrom:pos:ref/alt

result contains the inferred phylogenetic tree and clone assignments.:

>>> ordered_result = result.serialize_clones()
>>> sample.dna.set_labels(ordered_result.labels.values)
>>> fig = ordered_result.draw()  # Draws the phylogenetic tree
>>> fig.show()

Debugging the inference process can be done by inspecting the other attributes of the SPARCResult object.

Variables#

`dropped_variants_`	Variants dropped due to low quality
`variant_params_`	Estimated ADO and error rates params

Functions#

`predict`(dna[, cnv, variants, cnamps, ...])	Predict the phylogenetic tree structure from a DNA assay and a list of variants.
`model_cnv`(cnv[, mod_cn])	Fits the model to the CNV assay (or assays for multiple controls)
`search_path`(profile)	The path from the root to the given profile
`search_path_drops`(profile)	The clones dropped at each step while trimming the initial tree to the final tree

param ado_score:: Minimum ADO score to consider a clone valid.
param adjusted_mixing:: The maximum adjusted mixing rate to consider when looking for doublets.
param branch_size:: Controls the number of clones dropped in a given cycle. The run time is proportional to (branch_size ^ 3)
param min_snv_clone_size:: Minimum clone size as a fraction of the total for SNV clones to be considered. For rare variants, this limit might be ignored. It does not guarantee that all SNV clones will be larger than this value.
param min_snv_clone_count:: Minimum number of cells in a clone for the SNV clone to be considered. For rare variants, this limit might be ignored. It does not guarantee that all SNV clones will be larger than this value.
param min_loh_clone_size:: Minimum clone size as a fraction of the total for LoHFinder. It does not guarantee that all LoH clones will be larger than this value.
param min_loh_clone_count:: Minimum number of cells in a clone for LoHFinder. It does not guarantee that all LoH clones will be larger than this value.
param min_loh_group_size:: Minimum number of variants in a LoH variant group for it to be considered. Passed directly to LoHFinder.
param min_cnv_depth:: Minimum reads in each cell-amplicon for the completeness check
param min_cnv_completeness:: Minimum fraction of amplicons in a given cell with sufficient depth to be considered for amplicon modeling and copy gain prediction.
param min_copy_gain_cells:: The minimum number complete cells in a clone to consider it for copy gain clone identification
param min_excess_correlated_freq:: Minimum percentage of cells whose correlation is not due to random chance as per the chi2 expected frequencies. It is used to filter somatic variants that are not correlated with the CNV labels.

Overview#

SPARC (Single-cell Phylogeny And Reconstruction of Clones) is a computational method for inferring phylogenetic relationships from single-cell DNA sequencing data. The algorithm integrates somatic mutations, genome-wide loss of heterozygosity (LoH), and copy number variations (CNVs) in a single phylogeny.

SPARC identifies clusters of cells (clones) given a set of mutations and then creates a phylogeny with those clones. Mutations are categorized into three types:

Somatic variants: Single nucleotide variants and small insertions and deletions.
LoH variants: Groups of variants on neighbouring amplicons that have a correlated change from heterozygous to homozygous state.
CNVs: Group of consecutive amplicons showing distinct copy number changes supported by allele frequency (AF) changes.

Each mutation type is processed separately to create individual phylogenies, which are then combined to construct a single phylogeny which contains all the mutation types.

Phylogeny estimation for a single mutation type#

The phylogeny estimation for each mutation type follows a three-step process:

Variant selection and clone calling: Identify informative variants for phylogeny construction and cluster cells into an initial set of clones.
Tree construction: Build an initial phylogenetic tree for the identified clones using a distance-based method.
Iterative tree refinement: Refine the tree to ensure consistency with biological assumptions

Initial clone estimation#

Somatic variants are currently expected as an input to SPARC. LoH and CNV events are identified by SPARC on the amplicons marked as genome-wide CNV ampicons. Each mutation type is processed separately to obtain an initial set of clones. It assumed that this initial set is an extensive list of all possible clones in the sample. They are refined later using phylogenetic constraints.

Clones called by SPARC — Initial set of clones called for each mutation type#

Somatic variant clone estimation#

For each somatic variant there are at most three possible signatures - wildtype, heterozygous, and homozygous. The ADO score is applied on each variant separately to find the possible signatures for it. The variants are then sorted based on the product of their mutation rate and genotyping rate. Cells are clustered iteratively into clones based on the possible signatures of each variant. Clones that are too small or dropped in this step.

During this iterative clustering step, some variants might have too much missing data causing clones found until that point to be dropped resulting in all mutated cells from a better genotyped variant to be marked as ambiguous. Since the presence of such variants leads to loss of information from other variants, these variants are currently dropped by SPARC when creating the phylogeny.

LoH event identification and clone calling#

LoH events are those where a cell loses one of the two alleles at a heterozygous locus. These events and their clones are identified using the following process:

Candidate Selection: Variants are first selected as candidates if they have a sufficient number of cells in each of the three genotype states (WT, HET, HOM).
Correlation Analysis: To identify variants that are part of the same large-scale LoH event (e.g., affecting a whole chromosome arm), the correlation between candidate variants using two complementary methods is used:
- Chi-squared (χ²) Test: A contingency table is created for pairs of variants based on their NGT calls. A χ² test of independence is performed and variants with low p-values are considered correlated. This correlation matrix is partitioned to find groups of correlated variants. This test is sensitive and ends up finding variants correlated in very small populations of cells.
- Pearson Correlation: The NGT calls are used to compute the Pearson correlation coefficient between pairs of variants. The correlation matrix is partitioned to find groups of correlated variants. This test ends up finding larger groups of variants, but may miss events in small populations of cells.
Grouping: Groups of correlated variants found by the two tests are then combined to create one set of LoH variant groups. Each group is treated as a single likely-LoH event.
Phasing and Validation: For each group of likely-LoH variants, clusters cells to identify a HET clone and one or two HOM clones. A group is considered a valid LoH event if the number of cells in the topmost (or top two) phased HOM clone are significantly more than all other phased HOM clones found. Two HOM clones may be found if both alleles are lost in different populations of cells.

Similar to the somatic variant clustering, the variants groups are sorted based on the number of variants in the group and the cells are clustered iteratively to find an extensive set of clones.

CNV clustering#

Every amplicon is modeled using a read depth dependent negative binomial distribution, which captures the expected reads and deviation in read counts for a cell with a given read depth. Then, CNV events are identified by clustering cells based on their AF corrected ploidy smoothed across neigbouring genome-wide amplicons. These clusters are iteratively merged based on similarity until no more merges are possible. CNV events are defined as groups of contiguous amplicons with the same copy number supported by AF changes in a clone with respect to other clones.

Modeling errors in the variants#

For somatciv variants and LoH variants the AF distribution of each variant is fit against an expected distribution that models various error rates (Sequencing / PCR based), Allelic dropout (ADO) rates, and population frequencies to estimate the ADO and error rate for each variant separately.

Tree construction#

Once the initial set of clones are found for each mutation type, an initial phylogenetic tree is constructed using a distance-based method. This results in a tree that adheres to the maximum parsimony principle, however it may contain biologically implausible pathways.

Initial tree created by SPARC — The tree constructed using all the initial clones for each mutation type#

A distance matrix between all clones (and a hypothetical wildtype root) is computed. The distance between two clones is the sum of their differing genotypes. For LoH variants, the heterozygous state (NGT=1) is treated as the wildtype state, so the hypothetical wildtype root clone has NGT=1 for LoH variants, NGT=0 for somatic variants, and CN=2 for CNV amplicons. The distance d(i,j) between two clones i and j is defined as:

d(i,j) = Σₖ |NGTᵢₖ - NGTⱼₖ| + Σₗ |CNᵢₗ - CNⱼₗ|

Where NGT is in (0, 1, 2) and copy numbers CN in (0, 1, 2, 3) across variants k and amplicons l.

The initial tree contains just the hypothetical wildtype root clone. The tree is then built iteratively as follows:

Set the clone with the least distance to any clone in the current tree as a child of that clone.
If more than two pairs of clones have the same distance, weigh the assignment based on the clone sizes - larger clones are given a higher weight i.e. they are more likely to be closer to the root of the tree, and therefore parents of smaller clones. This also ensures that larger clones accumulate in one branch of the tree. Due to this probabilistic assignment, there may be multiple possible trees for a given set of clones (Note: inifite sites assumption not yet applied).
Repeat until all clones are added to the tree.

Combining phylogenies#

Each mutation type (Somatic variants, LoH, CNV) results in a separate phylogeny. These phylogenies are combined by simply crosstabulating the clones from each tree to create a combined set of clones. During the combination of clones, special considerations are given to the wildtype clone so as to not lose cells that may be wildtype due to the noisy nature of the combination process.

Combined tree created by SPARC — Tree obtained by combining the final clones from each mutation type#

The combined set of clones undergo the same iterative process of tree refinement to ensure that the final tree adheres to the phylogenetic constraints and maximizes the likelihood of cell assignments.

Final tree created by SPARC — The final tree created by SPARC after combining all mutation types#

Comparison with COMPASS#

The following table summarizes the key differences between SPARC and COMPASS

Aspect	SPARC	COMPASS
Clone calling modes	SNV, gwLoH, and gwCNV	SNV and focal CNV
Performance	Runtime depends on complexity of the phylogeny	Slower; runtime scales with cells and variants
Method	Clustering followed by distance-based tree construction and refinement	MCMC sampling of trees to maximize the cell assignment likelihood
Focal CNV clones	Not yet supported	Supported, but off by default in mosaic
Per variant error rates	Each variant is modeled separately	Has one error rate for SNV and one for INDELs
Per variant ADO rates	Modeled	Modeled
Doublet assignment	Supported	Supported

Genome-wide CNV support#

COMPASS requires grouping amplicons into “regions” with common CNVs. This assumption might be acceptable when grouping amplicons by genes, for which COMPASS was designed, but it does not work for genome-wide CNV because we do not know a priori the regions which have the same copy number. SPARC has a separate module for genome-wide CNV clustering that does not require grouping amplicons into regions.

SPARC is 30x faster#

The runtime of SPARC is dependent on the complexity of the phylogeny instead of the number of cells or variants. It searches through a larger set of possible phylogenies if the dataset is complex, unlike COMPASS which always runs a fixed number of iterations. This makes SPARC suitable for increased complexity in SNV+CNV analyses.

Additionally, the runtime of COMPASS increases linearly with the number of cells and the number of variants. Genome-wide CNV clustering usually involves hundreds of variants, which will cause COMPASS to run for hours for some samples, if its methods were repurposed for gwCNV.

Runtime comparison between SPARC and COMPASS#

Comparison of phylogenies#

The phylogenies obtained by SPARC and COMPASS are generally similar. However both methods have slightly different models and there could be subtle differences in the phylogeny. These are mostly explained by differences in the weights given to ADO rate, error rate, and dispersion of VAF. There are mainly two types of differences observed.

Empty clones: COMPASS sometimes has clones with no cells assigned to them. This is because COMPASS looks for trees that maximize the likelihood of cells belonging to clones in the tree. This method does not ensure that all clones in the tree are supported by at least one cell. SPARC first clusters the cells, and then creates the tree. Therefore it’s guranteed that every clone in the tree has at least one cell assigned to it.

Wildtype clones called by COMPASS have no cells assigned to them#
Order and merging of clones: Sometimes one method merges two clones that are separate in the other. This is due to differences in the weights given to ADO and error rates. If the model predicts that the differences between the clones are likely due to ADO, then it would merge two clones. If the model predicts that the some HET calls are likely to due to errors then those calls would be considered WT and that could change the order of the clones.

Merging of clones in one of the methods#

SPARC

Contents

SPARC#

Snippets#

Variables#

Functions#

Overview#

Phylogeny estimation for a single mutation type#

Initial clone estimation#

Somatic variant clone estimation#

LoH event identification and clone calling#

CNV clustering#

Modeling errors in the variants#

Tree construction#

Iterative tree refinement#

Combining phylogenies#

Comparison with COMPASS#

Genome-wide CNV support#

SPARC is 30x faster#

Comparison of phylogenies#