Dna.filter_variants#
missionbio.mosaic.dna.Dna.filter_variants
- Dna.filter_variants(min_dp: int = 10, min_gq: int = 30, vaf_ref: float = 5, vaf_hom: float = 95, vaf_het: float = 30, min_prct_cells: float = 50, min_mut_prct_cells: float = 1, iterations: int = 10, n_cores: int = 4) Sequence #
Find informative variants.
This method also adds the NGT_FILTERED layer to the assay which is a copy of the NGT layer but with the NGT for the cell-variants not passing the filters set to 3 i.e. missing.
- Parameters:
- min_dpint
The minimum depth (DP) for the call to be considered. Variants with less than this DP in a given barcode are treated as no calls.
- min_gqint
The minimum genotype quality (GQ) for the call to be considered. Variants with less than this GQ in a given barcode are treated as no calls.
- vaf_reffloat [0, 100]
All reference calls (NGT = 0) with VAF > vaf_ref are converted to no calls (NGT = 3) for each barcode and variant in the NGT matrix
- vaf_hetfloat [0, 100]
All hetrozygous calls (NGT = 1) with VAF < vaf_het are converted to no calls (NGT = 3) for each barcode and variant in the NGT matrix
- vaf_homfloat [0, 100]
All homozygous calls (NGT = 2) with VAF < vaf_hom are converted to no calls (NGT = 3) for each barcode and variant in the NGT matrix
- min_prct_cellsfloat [0, 100]
The minimum percent of total cells in which the variant should be present (NGT ∈ {0, 1, 2}) after the filters are applied. This value is calculated not as a simple percentage of cells, but as a weighted percentage where the contribution of each cell to the numerator and denominator is weighted by a measure of cell completeness. This has the effect of reducing the contribution of cells with a lot of missing genotype calls, allowing more variants to be retained when there are many incomplete cells.
- min_mut_prct_cellsfloat [0, 100]
The minimum percent of the total cells in which the variant should be mutated, (NGT ∈ {1, 2}) after the filters are applied. This value is weighted by cell completeness as described above for min_prct_cells.
- iterationsint > 0
The number of iterations used to remove the variants and calculate cell completeness. Here the cell completeness refers to the fraction of variants that have a high quality call in a cell. “High quality” is defined by the min_dp, min_gq, and the 3 VAF parameters. When calculating min_prct_cells the cells are weighted by their completeness. In the first iteration the completeness values are calculated using the original NGT matrix. In this iteration variants with a fraction of no calls less than min_prct_cells / iterations are removed. In the second iteration the completeness values are calculated using the NGT matrix of the variants after the first iteration. In the second iteration the threshold is set to 2 * min_prct_cells / iterations. This process is repeated until the number of iterations is reached. In the last iteration the threshold for the fraction of cells in which the variant is present is min_prct_cells. The variants that pass the filters in the last iteration are returned. Increasing the number of iterations will result in a more nuanced calculation of cell completeness. However, it is not recommended to modify this parameter. Rather, the min_prct_cells parameter should be modified to change the number of variants that pass the filters. The same is true for min_mut_prct_cells.
- n_coresint
The number of cores to use for parallel processing.
- Returns:
- numpy.ndarray
< Class Dna