ProbabilityMethod

ProbabilityMethod#

missionbio.demultiplex.protein.pace.probability.ProbabilityMethod

class ProbabilityMethod(protein: Assay, truth: Truth, min_prob_diff: float = 0.5, **kwargs: Any)#

A PACE method to categorize the cells into the mot probable category based on the given truth.

Parameters:
  • protein – The protein assay

  • truth – A Truth object

  • min_prob_diff – Minimum difference in probability between the most likely and second most likely cell type for a cell to be assigned to a single cell type. This is also used for the assignment of doublet clusters. Here the probability refers to the probability of the cluster being a particular cell type given the normalized read counts. See the example.

  • kwargs – Other parameters passed to PACEModel

Example

Let there be three cell types in the truth structured as follows:

            Root
             |
       +-----+-----+
       |           |
    T-cell      B-cell
(CD3+, CD19-)
      |
  CD4+ T-cell

We are interested in finding the probability of a cell belonging to each of these cell types. NSP normalizes the reads such that the average signal is at 1 and the average noise (background) is at 0. The probability of a cell being a particular cell type is calculated using the NSP normalized read counts clipped at 0 and 1. The likelihood model is a beta distribution with the likelihood of signal equal to the likelihood of noise at NSP = 0.5. Using the Bayes theorem, the probability of a cell being a particular cell type given the normalized read counts (Data in the equations below) is calculated as follows:

\[P(T cell | Data) = \frac{P(Data | T cell) * P(T cell)}{P(Data)}\]
\[P(T cell | Data) = \frac{P(Data | T cell) * P(T cell)} {P(Data | T cell) * P(T cell) + P(Data | Not T cell) * P(Not T cell)}\]

Assuming \(P(T cell) = P(Not T cell)\), since the prior is not known.

\[P(T cell | Data) = \frac{P(Data | T cell)}{P(Data | T cell) * P(Data | Not T cell)}\]

A cell will be a T cell when it expresses all the antibodies as per the truth.

\[P(Data | T cell) = L(CD3+ and C19-) = L(CD3+) * L(CD19-)\]

A cell is not a T cell if it expresses any one of the antibodies incorrectly.

\[P(Data | Not T cell) = L(CD3- or C19+) = L(CD3+ and CD19+) + L(CD3- and CD19-) + L(CD3- and CD19+)\]

If the cell is truly a T cell, then L(CD3- & CD19+) can be ignored because it will be much smaller than the first two terms. If the cell is not a T cell, then P(Data | T cell) will be much smaller compared to one of the first two terms in P(Data | Not T cell), and again L(CD3- & CD19+) can be ignored as the probability will be close to zero even without including this term. Therefore, the probability of a cell not being a T cell can be approximated as the sum of the probabilities of the cell not correctly expressing any one of the antibodies. This approximation is used to avoid computing the exponential increasing of terms that will be present in the equation as the number antibodies increase.

Basedon on this formulation, \(P(T cell) + P(B cell) + P(CD4+ T cell)\) does not have to add up to 1.

If \(P(T cell) = 0.99\), \(P(B cell) = 0.1\), and \(P(CD4+ T cell) = 0.1\) For min_prob_diff=0.8, the cell will be assigned to T cell, since P(T cell) - P(B cell) = 0.89 > 0.8 For min_prob_diff=0.05, the cell will be assigned to CD4+ T cell. This happens because the assignment is performed from the root to the leaves. The cell is first assigned to T cell and then to CD4+ T cell. Because all CD4+ T cells are also T cells, the probability of CD4+ T cell is higher than B cell, but since the min_prob_diff is 0.05 which is lower than P(CD4+ T cell), the cell is assigned to CD4+ T cell. If there was also CD8+ T cell in the tree, min_prob_diff would have been compared to P(CD4+ T cell) - P(CD8+ T cell).

Functions#

__init__

param protein:

The protein assay

assignment_probabilities

Compute the probability of each cell type for each row in the expression matrix

classify_mixed_clusters

Assigns the doublets as "Mixed" if they have an adjusted mixing rate lower than the max_adjusted_mixing or "Mixed like" for higher adjusted mixing rates.

clipped_normalized_reads

Clip the NSP counts to the range [0, 1]

clipped_signature

The mean NSP values rounded to 0 or 1

cluster_truth

The truth that defines the clusters in the protein assay

get_normalized_reads

returns:

The normalized counts in the protein assay as a dataframe

label_cells

Adds the LABEL row attribute based on the probabilistic method of assignment

label_sticky_cells

Assigns the "Sticky" labels to the cells

likelihoods

Create the probability density function of the likelihood of each expression type.

log_likelihood

Calculate the log-likelihood of the data.

palette

Create a palette for the cell types by giving each sub-celltype a unique color

Attributes#