Dna.group_by_genotype#

missionbio.mosaic.dna.Dna.group_by_genotype

Dna.group_by_genotype(features: Sequence, layer: str = 'NGT', group_missing: bool = True, max_ado_score: float = 1, min_clone_size: float = 1, ignore_zygosity: bool = False, return_plot: bool = False, plot_kind: str = 'collapsed')#

Clusters cells into clones based on the provided variants and returns a dataframe of per-clone and per-variant statistics. The identified clone labels are stored in the ‘LABEL’ layer. This algorithm also takes into consideration allele dropout out (ADO) to identify potential false positive clones.

Algorithm details

When an ADO event occurs HET calls change to WT and HOM calls for a given variant in a subset of the cells. The HET clone is called the parent clone, the HOM and WT clones are the child clones, together called the sister clones. Here, the parent of a clone only indicates that the clone could have been formed due to an ADO event in the parent, however it does not imply that the clone is entirely due to ADO. To identify clones which are likely to be only because of ADO, check the ADO score.

The parent and sister clones will be np.nan if the score is zero. Otherwise, it is the name of the clone from which the subclone was obtained due to an ADO event.

The score for each subclone measures the possibility that it’s a flase positive subclone obtained due to an ADO event. The score is 0 if it unlikely to be a clone due to ADO and 1 if it is highly likely to be an ADO clone.

The score takes into account the following metrics.
  1. NGT values of the clones

  2. Relative proportions of the clones

  3. Absolute proportions of the clones (uses min_clone_size as a parameter)

  4. Mean GQ of the clones

  5. Mean DP of the clones

The score is calculated using four sub scores.

score = (ss + ds + gs) * ps

  1. ss - sister score (0 - 0.8)

    It measures the proportion of the clone with resepect to its sister clone. This score is closer to 0.8 when the sister clones have similar proportions and exactly 0.8 when their proportions are within the min_clone_size.

  2. ds - DP score (0 - 0.1)

    It measures the mean DP of the clone with resepect to its parent clone. It is closer to 0.1 if the DP of the clone is lower than the parents’ DP.

  3. gs - GQ score (0 - 0.1)

    It measures the mean GQ of the clone with resepect to its parent clone. It is closer to 0.1 if the GQ of the clone is lowert than the parents’ GQ.

  4. ps - parent score (0 - 1)

    It measures the proportion of the clone with respect to the parent clone. This score is closer to 1 the larger the parent is compared to the clone, and closer to 0 the smaller the parent compared to the clone.

Parameters:
featuresSequence

The features which are to be considered while allocating the groups formed by the genotype.

layerstr

Name of the layer used to count the cell types. Expected values are NGT or NGT_FILTERED as obtained from the filter_variants() method.

group_missingbool

Whether the clusters caused due to missing values are merged together under one cluster named ‘Missing’.

max_ado_scorefloat [0, 1]

The maximum ADO score of a clone before it is grouped into the “ADO” clone category.

min_clone_sizefloat [0, 100]

The minimumum proportion of total cells to be present in the clone to count it as a separate clone.

ignore_zygositybool

Whether HET and HOM are considered the same or not

return_plotbool

If True, a plot showing the ADO identification process is returned along with the ADO data.

plot_kindstr

The clones that should be shown. It should be one of {“all”, “collapsed”}.

Returns:
pd.DataFrame:

None is returned if ignore_zygosity is True or group_missing is False otherwise a pandas dataframe is returned.

Columns:

  • The initial columns in the dataframe contain per-variant information. Each cell contains the genotype and average variant allele frequency for the respective clone.

  • The next set of columns show the number of cells (and %) in each clone, with one column for all samples combined (Total Cell Number) and separate columns for each individual sample (e.g. Sample-A Cell Number).

  • The last three columns contain data related to identifying false positive ADO clones. The columns are: parent clone, sister ADO clone(s), and a score. The indices are the subclone names.

Additional rows:

  • The second-to-last row shows the percentage of cells where the variant is missing.

  • The last row displays statistics for small subclones (determeined by the min_clone_size parameter): the number and percentage of total cells.

go.Figure:

Also returns a plot if return_plot=True