load#

missionbio.mosaic.io.load

load(filepath: Any, filter_cells: bool = False, filter_variants: bool = True, whitelist: Optional[Sequence] = None, raw: bool = False, single: bool = False, variant_filter_config: Optional[Dict[str, float]] = None, assays: Optional[Sequence[str]] = None, raw_counts: Optional[Sequence[str]] = None) → Union[Sample, SampleGroup]#

Loading the .h5 file with one or more assays.

This is the preferred way of loading .h5 files.

It directly returns a Sample object, which contains all the assays. Those assays that were not present are stored as None.

Parameters:

filepath:

The path to the .h5 multi-omics file.

filter_cells:

If True, then only the cells called by the completeness algorithm are loaded. Complete cells are those with greater than 80% completeness. If False, then all the cells are loaded.

filter_variants:

If False, then all the variants are loaded. If True then only the filtered dna variants are loaded. The filtered DNA variants are those that pass the filter_variants() function. This list can be obtained by loading all variants by setting filter_variants=False and then running filter_variants() on it. Information about the default filtered variants is stored in the filtered column attribute of the Dna object.

whitelist:

The specific dna variants to load. The items in the whitliset can have three formats:

Variant IDs - chr1:12345:A/C
These look for exact matches in the variants
Positions - chr1:12345
These look for all the variants at that position in variants
Regions - chr1:12345-12350
These look for all the variants in that region in variants Both 12345 and 12350 are included

The four cases for whitelist and filter_variants are:

filter_variants - False, whitelist - None
Load all the variants
filter_variants - True, whitelist - None
Only load the variants passing as per the filtered column attribute
filter_variants - False, whitelist - Given
Only load the variants in the whitelist
filter_variants - True, whitelist - Given
Load the variants passing as per the filtered column attribute and also those present in the whitelist

raw:

Whether the raw counts are to be loaded. This will load cnv_raw and protein_raw attributes of the Sample class.

single:

Whether to load as a single sample despite being a multi sample h5 file. If False then a SampleGroup() object is returned. This splits each sample into a different Sample object. This helps with batch corrections when normalising the data, since each sample is treated separately. If single=True then a single Sample object is returned. This makes interacting with the data easier, but care must be taken when normalising the data. The merge() function and the split() function can be used to switch between the two object types.

variant_filter_config: Optional[Dict[str, float]]

The filters to apply to the variants. This is useful when the default filters are not suitable for the data. The default pre-filtered data is loaded when variant_filter_config=None. Only modify the filters that need to be changed since steps that do not require recomputation are skipped. For example, if only min_mut_prct_cells is passed then the genotype mask creation is skipped. Passing iterations=1 will result in significantly faster loading times at the expense of a more stringet filtering criteria. See filter_variants for a detailed description of the options.

Parameters that result in the recomputation of the genotype mask:

min_dpfloat
The minimum depth of the variant call
min_gqfloat
The minimum genotype quality of the variant call
vaf_reffloat
The maximum variant allele frequency of a reference call
vaf_hetfloat
The minimum variant allele frequency of a heterozygous call
vaf_homfloat
The minimum variant allele frequency of a homozygous call

Other parameters that only result in the recomputation of the variant mask:

min_prct_cells: float
The minimum weighted percentage of cells the in which the variant should be present.
min_mut_prct_cells: float
The minimum weighted percentage of cells the in which the variant should be mutated.
iterations: int
The number of iterations to run the filter for.

Other parameters that that can be passed are:

n_cores: int
Number of cores to use for filtering. The memory consumption is proportional to the number of cores used.
verbosebool
Whether to print the logs showing the progress of the filtering.

assays: Optional[Sequence[str]]

The assays to load. Use this option to load certain assays quickly without reading the whole file If None then all the assays are loaded. Some of the available assays are:

DNA: “dna_variants”
CNV: “dna_read_counts”
Protein: “protein_read_counts”

raw_counts: Optional[Sequence[str]]

The raw counts to load. If None then all the raw counts are loaded. Some of the available raw counts are:

CNV: “dna_read_counts”
Protein: “protein_read_counts”

Returns:

missionbio.mosaic.sample.Sample / missionbio.mosaic.samplegroup.SampleGroup

Raises:

Exception: When the h5 file format is not supported.

< Module io

load

Contents

load#