Protein.assign_from_truth

Protein.assign_from_truth#

missionbio.mosaic.protein.Protein.assign_from_truth

Protein.assign_from_truth(truth: Optional[Union[str, Path, Truth]] = None, merge: str = 'mixed', min_prob_diff: float = 0.9, max_adjusted_mixing: float = 0.3, min_distance_for_doublet: int = 5, sticky_antibodies: Tuple = ('IgG1', 'IgG2a', 'IgG2b'), method: Optional[str] = None) Union[ProbabilityMethod, LikelihoodMethod]#

Label each cell with the most likely cell type from the truth.

The accuracy of the assignment is subject to the accuracy of the NSP normalization. If the scaling factor is not estimated correctly the assignment will have a large number of unassigned cells. In such cases, try to estimate the scaling factor using the read_depth_dependence() method of the isotype control.

Parameters:
truth:

A Truth object or a path to a YAML file containing the truth. If None, the builtin() truth for PBMCs is used. It can be visualised using the plot() function on the truth object.

merge:

How to merge cell types with multiple clusters

“none”: Keep all clusters. The doublets are identified by the

“:” in the cluster name

“mixed”: Merge all doublet clusters into a single “Mixed” cluster

e.g. “T cell:B cell” and “T cell:Monocytes” are merged into “Mixed”

“labeled”: Merge clusters with multiple cell types into a single

cluster. e.g. T cell-1, T cell-2 are merged into T-cell

“all”: Merge all “Unassigned” clusters into a single cluster

e.g. Unassigned-1, Unassigned-2 are merged into “Unassigned”

min_prob_diff:

Minimum difference in probability between the most likely and second most likely cell type for a cell to be assigned to a single cell type. This is also used for the assignment of doublet clusters. Here the probability refers to the probability of the cluster being a particular cell type given the normalized read counts. See the example.

max_adjusted_mixing:

Maximum adjusted mixing rate for a cluster with a mixed signature to be labelled as “Mixed”. If the adjusted mixing rate is higher, the doublet is assigned to the “Mixed Like” cluster.

min_distance_for_doublet:

The minimum number of antibodies that must be different between two clones for their doublet to be considered. Increase this if too many mixed cells are observed.

sticky_antibodies:

Antibodies that should be used to identify sticky cells

method: {None, “probability”, “likelihood”}

Method to use for the assignment. If None, the method is chosen based on the truth. If the truth has missing antibodies, the method is set to “probability”, otherwise it is set to “likelihood”. “probability” works best when a few antibodies and cell types are present in the sample. This method can call ‘Unassigned’ clones for anything that does not match the Truth. “likelihood” works only when the same antibodies are present in the truth for all cell types. This method classifies the cells into one of the Truth celltypes. It cannot call ‘Unassigned’ clones.

Returns:
PACE object:

It stores the labels for the cells with the most likely cell type from the truth.

Notes

This modifies the LABEL attribute of the assay.

Examples

Let there be three cell types in the truth structured as follows:

            Root
             |
       +-----+-----+
       |           |
    T-cell      B-cell
(CD3+, CD19-)
      |
  CD4+ T-cell

We are interested in finding the probability of a cell belonging to each of these cell types. NSP normalizes the reads such that the average signal is at 1 and the average noise (background) is at 0. The probability of a cell being a particular cell type is calculated using the NSP normalized read counts clipped at 0 and 1. The likelihood model is a beta distribution with the likelihood of signal equal to the likelihood of noise at NSP = 0.5. Using the Bayes theorem, the probability of a cell being a particular cell type given the normalized read counts (Data in the equations below) is calculated as follows:

\[P(T cell | Data) = \frac{P(Data | T cell) * P(T cell)}{P(Data)}\]
\[P(T cell | Data) = \frac{P(Data | T cell) * P(T cell)} {P(Data | T cell) * P(T cell) + P(Data | Not T cell) * P(Not T cell)}\]

Since the prior is not known, we assume that \(P(T cell) = P(Not T cell)\). therefore,

\[P(T cell | Data) = \frac{P(Data | T cell)}{P(Data | T cell) + P(Data | Not T cell)}\]

A cell will be a T cell when it expresses all the antibodies as per the truth.

\[P(Data | T cell) = L(CD3+ and C19-) = L(CD3+) * L(CD19-)\]

A cell is not a T cell if it expresses any one or more of the antibodies incorrectly.

\[P(Data | Not T cell) = L(CD3- or C19+) = L(CD3+ and CD19+) + L(CD3- and CD19-) + L(CD3- and CD19+)\]

If the cell is truly a T cell, then L(CD3- & CD19+) can be ignored because it will be much smaller than the first two terms. If the cell is not a T cell, then P(Data | T cell) will be much smaller compared to one of the first two terms in P(Data | Not T cell), and again L(CD3- & CD19+) can be ignored as the probability will be close to zero even without including this term. Therefore, the probability of a cell not being a T cell can be approximated as the sum of the probabilities of the cell not correctly expressing any one of the antibodies. This approximation is used to avoid computing the exponentially increasing number of terms that will be present as the number antibodies increase.

Note

Basedon on this formulation, \(P(T cell) + P(B cell) + P(CD4+ T cell)\) does not have to add up to 1.

If \(P(T cell) = 0.99\), \(P(B cell) = 0.1\), and \(P(CD4+ T cell) = 0.1\) For min_prob_diff=0.8, the cell will be assigned to T cell, since P(T cell) - P(B cell) = 0.89 > 0.8 For min_prob_diff=0.05, the cell will be assigned to CD4+ T cell. This happens because the assignment is performed from the root to the leaves. The cell is first assigned to T cell and then to CD4+ T cell. Because all CD4+ T cells are also T cells, the probability of CD4+ T cell is higher than B cell, but since the min_prob_diff is 0.05 which is lower than P(CD4+ T cell), the cell is assigned to CD4+ T cell. If there was also CD8+ T cell in the tree, min_prob_diff would have been compared to P(CD4+ T cell) - P(CD8+ T cell).


< Class Protein