ProbabilityMethod.__init__#
missionbio.demultiplex.protein.pace.probability.ProbabilityMethod.__init__
- ProbabilityMethod.__init__(protein: Assay, truth: Truth, min_prob_diff: float = 0.5, **kwargs: Any)#
- Parameters:
protein – The protein assay
truth – A Truth object
min_prob_diff – Minimum difference in probability between the most likely and second most likely cell type for a cell to be assigned to a single cell type. This is also used for the assignment of doublet clusters. Here the probability refers to the probability of the cluster being a particular cell type given the normalized read counts. See the example.
kwargs – Other parameters passed to PACEModel
Example
Let there be three cell types in the truth structured as follows:
Root | +-----+-----+ | | T-cell B-cell (CD3+, CD19-) | CD4+ T-cell
We are interested in finding the probability of a cell belonging to each of these cell types. NSP normalizes the reads such that the average signal is at 1 and the average noise (background) is at 0. The probability of a cell being a particular cell type is calculated using the NSP normalized read counts clipped at 0 and 1. The likelihood model is a beta distribution with the likelihood of signal equal to the likelihood of noise at NSP = 0.5. Using the Bayes theorem, the probability of a cell being a particular cell type given the normalized read counts (Data in the equations below) is calculated as follows:
\[P(T cell | Data) = \frac{P(Data | T cell) * P(T cell)}{P(Data)}\]\[P(T cell | Data) = \frac{P(Data | T cell) * P(T cell)} {P(Data | T cell) * P(T cell) + P(Data | Not T cell) * P(Not T cell)}\]Assuming \(P(T cell) = P(Not T cell)\), since the prior is not known.
\[P(T cell | Data) = \frac{P(Data | T cell)}{P(Data | T cell) * P(Data | Not T cell)}\]A cell will be a T cell when it expresses all the antibodies as per the truth.
\[P(Data | T cell) = L(CD3+ and C19-) = L(CD3+) * L(CD19-)\]A cell is not a T cell if it expresses any one of the antibodies incorrectly.
\[P(Data | Not T cell) = L(CD3- or C19+) = L(CD3+ and CD19+) + L(CD3- and CD19-) + L(CD3- and CD19+)\]If the cell is truly a T cell, then L(CD3- & CD19+) can be ignored because it will be much smaller than the first two terms. If the cell is not a T cell, then P(Data | T cell) will be much smaller compared to one of the first two terms in P(Data | Not T cell), and again L(CD3- & CD19+) can be ignored as the probability will be close to zero even without including this term. Therefore, the probability of a cell not being a T cell can be approximated as the sum of the probabilities of the cell not correctly expressing any one of the antibodies. This approximation is used to avoid computing the exponential increasing of terms that will be present in the equation as the number antibodies increase.
Basedon on this formulation, \(P(T cell) + P(B cell) + P(CD4+ T cell)\) does not have to add up to 1.
If \(P(T cell) = 0.99\), \(P(B cell) = 0.1\), and \(P(CD4+ T cell) = 0.1\) For
min_prob_diff=0.8
, the cell will be assigned toT cell
, sinceP(T cell) - P(B cell) = 0.89 > 0.8
Formin_prob_diff=0.05
, the cell will be assigned toCD4+ T cell
. This happens because the assignment is performed from the root to the leaves. The cell is first assigned toT cell
and then toCD4+ T cell
. Because allCD4+ T cells
are alsoT cells
, the probability ofCD4+ T cell
is higher thanB cell
, but since themin_prob_diff
is 0.05 which is lower thanP(CD4+ T cell)
, the cell is assigned toCD4+ T cell
. If there was alsoCD8+ T cell
in the tree,min_prob_diff
would have been compared toP(CD4+ T cell) - P(CD8+ T cell)
.
< Class ProbabilityMethod