Protein.assign_from_truth#
missionbio.mosaic.protein.Protein.assign_from_truth
- Protein.assign_from_truth(truth: Optional[Union[str, Path, Truth]] = None, merge: str = 'mixed', min_prob_diff: float = 0.9, max_adjusted_mixing: float = 0.3, min_distance_for_doublet: int = 5, sticky_antibodies: Tuple = ('IgG1', 'IgG2a', 'IgG2b'), method: Optional[str] = None) Union[ProbabilityMethod, LikelihoodMethod] #
Label each cell with the most likely cell type from the truth.
The accuracy of the assignment is subject to the accuracy of the NSP normalization. If the scaling factor is not estimated correctly the assignment will have a large number of unassigned cells. In such cases, try to estimate the scaling factor using the
read_depth_dependence()
method of the isotype control.- Parameters:
- truth:
A Truth object or a path to a YAML file containing the truth. If None, the
builtin()
truth for PBMCs is used. It can be visualised using theplot()
function on the truth object.- merge:
How to merge cell types with multiple clusters
- “none”: Keep all clusters. The doublets are identified by the
“:” in the cluster name
- “mixed”: Merge all doublet clusters into a single “Mixed” cluster
e.g. “T cell:B cell” and “T cell:Monocytes” are merged into “Mixed”
- “labeled”: Merge clusters with multiple cell types into a single
cluster. e.g. T cell-1, T cell-2 are merged into T-cell
- “all”: Merge all “Unassigned” clusters into a single cluster
e.g. Unassigned-1, Unassigned-2 are merged into “Unassigned”
- min_prob_diff:
Minimum difference in probability between the most likely and second most likely cell type for a cell to be assigned to a single cell type. This is also used for the assignment of doublet clusters. Here the probability refers to the probability of the cluster being a particular cell type given the normalized read counts. See the example.
- max_adjusted_mixing:
Maximum adjusted mixing rate for a cluster with a mixed signature to be labelled as “Mixed”. If the adjusted mixing rate is higher, the doublet is assigned to the “Mixed Like” cluster.
- min_distance_for_doublet:
The minimum number of antibodies that must be different between two clones for their doublet to be considered. Increase this if too many mixed cells are observed.
- sticky_antibodies:
Antibodies that should be used to identify sticky cells
- method: {None, “probability”, “likelihood”}
Method to use for the assignment. If None, the method is chosen based on the truth. If the truth has missing antibodies, the method is set to “probability”, otherwise it is set to “likelihood”. “probability” works best when a few antibodies and cell types are present in the sample. This method can call ‘Unassigned’ clones for anything that does not match the Truth. “likelihood” works only when the same antibodies are present in the truth for all cell types. This method classifies the cells into one of the Truth celltypes. It cannot call ‘Unassigned’ clones.
- Returns:
- PACE object:
It stores the labels for the cells with the most likely cell type from the truth.
See also
Notes
This modifies the LABEL attribute of the assay.
Examples
Let there be three cell types in the truth structured as follows:
Root | +-----+-----+ | | T-cell B-cell (CD3+, CD19-) | CD4+ T-cell
We are interested in finding the probability of a cell belonging to each of these cell types. NSP normalizes the reads such that the average signal is at 1 and the average noise (background) is at 0. The probability of a cell being a particular cell type is calculated using the NSP normalized read counts clipped at 0 and 1. The likelihood model is a beta distribution with the likelihood of signal equal to the likelihood of noise at NSP = 0.5. Using the Bayes theorem, the probability of a cell being a particular cell type given the normalized read counts (Data in the equations below) is calculated as follows:
\[P(T cell | Data) = \frac{P(Data | T cell) * P(T cell)}{P(Data)}\]\[P(T cell | Data) = \frac{P(Data | T cell) * P(T cell)} {P(Data | T cell) * P(T cell) + P(Data | Not T cell) * P(Not T cell)}\]Since the prior is not known, we assume that \(P(T cell) = P(Not T cell)\). therefore,
\[P(T cell | Data) = \frac{P(Data | T cell)}{P(Data | T cell) + P(Data | Not T cell)}\]A cell will be a T cell when it expresses all the antibodies as per the truth.
\[P(Data | T cell) = L(CD3+ and C19-) = L(CD3+) * L(CD19-)\]A cell is not a T cell if it expresses any one or more of the antibodies incorrectly.
\[P(Data | Not T cell) = L(CD3- or C19+) = L(CD3+ and CD19+) + L(CD3- and CD19-) + L(CD3- and CD19+)\]If the cell is truly a T cell, then L(CD3- & CD19+) can be ignored because it will be much smaller than the first two terms. If the cell is not a T cell, then P(Data | T cell) will be much smaller compared to one of the first two terms in P(Data | Not T cell), and again L(CD3- & CD19+) can be ignored as the probability will be close to zero even without including this term. Therefore, the probability of a cell not being a T cell can be approximated as the sum of the probabilities of the cell not correctly expressing any one of the antibodies. This approximation is used to avoid computing the exponentially increasing number of terms that will be present as the number antibodies increase.
Note
Basedon on this formulation, \(P(T cell) + P(B cell) + P(CD4+ T cell)\) does not have to add up to 1.
If \(P(T cell) = 0.99\), \(P(B cell) = 0.1\), and \(P(CD4+ T cell) = 0.1\) For
min_prob_diff=0.8
, the cell will be assigned toT cell
, sinceP(T cell) - P(B cell) = 0.89 > 0.8
Formin_prob_diff=0.05
, the cell will be assigned toCD4+ T cell
. This happens because the assignment is performed from the root to the leaves. The cell is first assigned toT cell
and then toCD4+ T cell
. Because allCD4+ T cells
are alsoT cells
, the probability ofCD4+ T cell
is higher thanB cell
, but since themin_prob_diff
is 0.05 which is lower thanP(CD4+ T cell)
, the cell is assigned toCD4+ T cell
. If there was alsoCD8+ T cell
in the tree,min_prob_diff
would have been compared toP(CD4+ T cell) - P(CD8+ T cell)
.
< Class Protein