load#
missionbio.mosaic.io.load
- load(filepath: Any, filter_cells: bool = False, filter_variants: bool = True, whitelist: Optional[Sequence] = None, raw: bool = False, single: bool = False, variant_filter_config: Optional[Dict[str, float]] = None, assays: Optional[Sequence[str]] = None, raw_counts: Optional[Sequence[str]] = None) Union[Sample, SampleGroup] #
Loading the .h5 file with one or more assays.
This is the preferred way of loading .h5 files.
It directly returns a Sample object, which contains all the assays. Those assays that were not present are stored as None.
- Parameters:
- filepath:
The path to the .h5 multi-omics file.
- filter_cells:
If True, then only the cells called by the completeness algorithm are loaded. Complete cells are those with greater than 80% completeness. If False, then all the cells are loaded.
- filter_variants:
If False, then all the variants are loaded. If True then only the filtered dna variants are loaded. The filtered DNA variants are those that pass the
filter_variants()
function. This list can be obtained by loading all variants by setting filter_variants=False and then runningfilter_variants()
on it. Information about the default filtered variants is stored in the filtered column attribute of theDna
object.- whitelist:
The specific dna variants to load. The items in the whitliset can have three formats:
- Variant IDs - chr1:12345:A/C
These look for exact matches in the variants
- Positions - chr1:12345
These look for all the variants at that position in variants
- Regions - chr1:12345-12350
These look for all the variants in that region in variants Both 12345 and 12350 are included
The four cases for whitelist and filter_variants are:
- filter_variants - False, whitelist - None
Load all the variants
- filter_variants - True, whitelist - None
Only load the variants passing as per the filtered column attribute
- filter_variants - False, whitelist - Given
Only load the variants in the whitelist
- filter_variants - True, whitelist - Given
Load the variants passing as per the filtered column attribute and also those present in the whitelist
- raw:
Whether the raw counts are to be loaded. This will load cnv_raw and protein_raw attributes of the
Sample
class.- single:
Whether to load as a single sample despite being a multi sample h5 file. If False then a
SampleGroup()
object is returned. This splits each sample into a differentSample
object. This helps with batch corrections when normalising the data, since each sample is treated separately. If single=True then a singleSample
object is returned. This makes interacting with the data easier, but care must be taken when normalising the data. Themerge()
function and thesplit()
function can be used to switch between the two object types.- variant_filter_config: Optional[Dict[str, float]]
The filters to apply to the variants. This is useful when the default filters are not suitable for the data. The default pre-filtered data is loaded when variant_filter_config=None. Only modify the filters that need to be changed since steps that do not require recomputation are skipped. For example, if only min_mut_prct_cells is passed then the genotype mask creation is skipped. Passing iterations=1 will result in significantly faster loading times at the expense of a more stringet filtering criteria. See
filter_variants
for a detailed description of the options.Parameters that result in the recomputation of the genotype mask:
- min_dpfloat
The minimum depth of the variant call
- min_gqfloat
The minimum genotype quality of the variant call
- vaf_reffloat
The maximum variant allele frequency of a reference call
- vaf_hetfloat
The minimum variant allele frequency of a heterozygous call
- vaf_homfloat
The minimum variant allele frequency of a homozygous call
Other parameters that only result in the recomputation of the variant mask:
- min_prct_cells: float
The minimum weighted percentage of cells the in which the variant should be present.
- min_mut_prct_cells: float
The minimum weighted percentage of cells the in which the variant should be mutated.
- iterations: int
The number of iterations to run the filter for.
Other parameters that that can be passed are:
- n_cores: int
Number of cores to use for filtering. The memory consumption is proportional to the number of cores used.
- verbosebool
Whether to print the logs showing the progress of the filtering.
- assays: Optional[Sequence[str]]
The assays to load. Use this option to load certain assays quickly without reading the whole file If None then all the assays are loaded. Some of the available assays are:
DNA: “dna_variants”
CNV: “dna_read_counts”
Protein: “protein_read_counts”
- raw_counts: Optional[Sequence[str]]
The raw counts to load. If None then all the raw counts are loaded. Some of the available raw counts are:
CNV: “dna_read_counts”
Protein: “protein_read_counts”
- Returns:
- missionbio.mosaic.sample.Sample / missionbio.mosaic.samplegroup.SampleGroup
- Raises:
- Exception
When the h5 file format is not supported.
< Module io