Dna.group_by_genotype#

Dna.group_by_genotype(features: Sequence, layer: str = 'NGT', group_missing: bool = True, max_ado_score: float = 1, min_clone_size: float = 1, ignore_zygosity: bool = False, show_plot: Optional[Union[bool, str]] = None)#

Clusters cells into clones based on the provided variants and returns a dataframe of per-clone and per-variant statistics. The identified clone labels are stored in the ‘LABEL’ layer. This algorithm also takes into consideration allele dropout out (ADO) to identify potential false positive clones.

Algorithm details

In an ADO event HET goes to WT and HOM for a given variant in a subset of the cells. Here, the HET clone is called the parent clone, the HOM and WT clones are the ADO clones, together called the sister clones.

The parent and sister clones will be np.nan if the score is zero. Otherwise, it is the name of the clone from which the subclone was obtained due to an ADO event.

The score for each subclone measures the possibility that it’s a flase positive subclone obtained due to an ADO event. The score is 0 if it unlikely to be a clone due to ADO and 1 if it is highly likely to be an ADO clone.

The score takes into account the following metrics.
  1. NGT values of the clones

  2. Relative proportions of the clones

  3. Absolute proportions of the clones (uses min_clone_size as a parameter)

  4. Mean GQ of the clones

  5. Mean DP of the clones

The score is calculated using four sub scores.

score = (ss + ds + gs) * ps

  1. ss - sister score (0 - 0.8)

    It measures the proportion of the clone with resepect to its sister clone. This score is closer to 0.8 when the sister clones have similar proportions and exactly 0.8 when their proportions are within the min_clone_size.

  2. ds - DP score (0 - 0.1)

    It measures the mean DP of the clone with resepect to its parent clone. It is closer to 0.1 if the DP of the clone is lower than the parents’ DP.

  3. gs - GQ score (0 - 0.1)

    It measures the mean GQ of the clone with resepect to its parent clone. It is closer to 0.1 if the GQ of the clone is lowert than the parents’ GQ.

  4. ps - parent score (0 - 1)

    It measures the proportion of the clone with respect to the parent clone. This score is closer to 1 the larger the parent is compared to the clone, and closer to 0 the smaller the parent compared to the clone.

Parameters
featuresSequence

The features which are to be considered while allocating the groups formed by the genotype.

layerstr

Name of the layer used to count the cell types. Expected values are NGT or NGT_FILTERED as obtained from the Dna.filter_variants() method.

group_missingbool

Whether the clusters caused due to missing values are merged together under one cluster named ‘Missing’.

max_ado_scorefloat [0, 1]

The maximum ADO score of a clone before it is grouped into the “ADO” clone category.

min_clone_sizefloat [0, 100]

The minimumum proportion of total cells to be present in the clone to count it as a separate clone.

ignore_zygositybool

Whether HET and HOM are considered the same or not

show_plotbool | str

If True, a plot showing the ADO identification process is shown. The same plot is shown if the value is “all”. True as option is kept for backwards compatibility. If the value is “collapsed” then only the non-small and non-ADO clones are shown separately.

Returns
pd.DataFrame / None

None is returned if ignore_zygosity is True or group_missing is False otherwise a pandas dataframe is returned.

Columns:

  • The initial columns in the dataframe contain per-variant information. Each cell contains the genotype and average variant allele frequency for the respective clone.

  • The next set of columns show the number of cells (and %) in each clone, with one column for all samples combined together (Total Cell Number) and separate columns for each individual sample (e.g. Sample-A Cell Number).

  • The last three columns contain data related to identifying false positive ADO clones. The columns are: parent clone, sister ADO clone(s), and a score. The indices are the subclone names.

Additional rows:

  • The second-to-last row shows the percentage of cells where the variant is missing.

  • The last row displays statistics for small subclones (determeined by the min_clone_size parameter): the number and percentage of total cells.