load

Contents

load#

missionbio.mosaic.io.load

load(filepath: Any, filter_cells: bool = False, filter_variants: bool = True, whitelist: Optional[Sequence] = None, raw: bool = False, single: bool = False, variant_filter_config: Optional[Dict[str, float]] = None, assays: Optional[Sequence[str]] = None, raw_counts: Optional[Sequence[str]] = None) Union[Sample, SampleGroup]#

Loading the .h5 file with one or more assays.

This is the preferred way of loading .h5 files.

It directly returns a Sample object, which contains all the assays. Those assays that were not present are stored as None.

Parameters:
filepath:

The path to the .h5 multi-omics file.

filter_cells:

If True, then only the cells called by the completeness algorithm are loaded. Complete cells are those with greater than 80% completeness. If False, then all the cells are loaded.

filter_variants:

If False, then all the variants are loaded. If True then only the filtered dna variants are loaded. The filtered DNA variants are those that pass the filter_variants() function. This list can be obtained by loading all variants by setting filter_variants=False and then running filter_variants() on it. Information about the default filtered variants is stored in the filtered column attribute of the Dna object.

whitelist:

The specific dna variants to load. The items in the whitliset can have three formats:

  1. Variant IDs - chr1:12345:A/C

    These look for exact matches in the variants

  2. Positions - chr1:12345

    These look for all the variants at that position in variants

  3. Regions - chr1:12345-12350

    These look for all the variants in that region in variants Both 12345 and 12350 are included

The four cases for whitelist and filter_variants are:

  1. filter_variants - False, whitelist - None

    Load all the variants

  2. filter_variants - True, whitelist - None

    Only load the variants passing as per the filtered column attribute

  3. filter_variants - False, whitelist - Given

    Only load the variants in the whitelist

  4. filter_variants - True, whitelist - Given

    Load the variants passing as per the filtered column attribute and also those present in the whitelist

raw:

Whether the raw counts are to be loaded. This will load cnv_raw and protein_raw attributes of the Sample class.

single:

Whether to load as a single sample despite being a multi sample h5 file. If False then a SampleGroup() object is returned. This splits each sample into a different Sample object. This helps with batch corrections when normalising the data, since each sample is treated separately. If single=True then a single Sample object is returned. This makes interacting with the data easier, but care must be taken when normalising the data. The merge() function and the split() function can be used to switch between the two object types.

variant_filter_config: Optional[Dict[str, float]]

The filters to apply to the variants. This is useful when the default filters are not suitable for the data. The default pre-filtered data is loaded when variant_filter_config=None. Only modify the filters that need to be changed since steps that do not require recomputation are skipped. For example, if only min_mut_prct_cells is passed then the genotype mask creation is skipped. Passing iterations=1 will result in significantly faster loading times at the expense of a more stringet filtering criteria. See filter_variants for a detailed description of the options.

Parameters that result in the recomputation of the genotype mask:

  1. min_dpfloat

    The minimum depth of the variant call

  2. min_gqfloat

    The minimum genotype quality of the variant call

  3. vaf_reffloat

    The maximum variant allele frequency of a reference call

  4. vaf_hetfloat

    The minimum variant allele frequency of a heterozygous call

  5. vaf_homfloat

    The minimum variant allele frequency of a homozygous call

Other parameters that only result in the recomputation of the variant mask:

  1. min_prct_cells: float

    The minimum weighted percentage of cells the in which the variant should be present.

  2. min_mut_prct_cells: float

    The minimum weighted percentage of cells the in which the variant should be mutated.

  3. iterations: int

    The number of iterations to run the filter for.

Other parameters that that can be passed are:

  1. n_cores: int

    Number of cores to use for filtering. The memory consumption is proportional to the number of cores used.

  2. verbosebool

    Whether to print the logs showing the progress of the filtering.

assays: Optional[Sequence[str]]

The assays to load. Use this option to load certain assays quickly without reading the whole file If None then all the assays are loaded. Some of the available assays are:

  1. DNA: “dna_variants”

  2. CNV: “dna_read_counts”

  3. Protein: “protein_read_counts”

raw_counts: Optional[Sequence[str]]

The raw counts to load. If None then all the raw counts are loaded. Some of the available raw counts are:

  1. CNV: “dna_read_counts”

  2. Protein: “protein_read_counts”

Returns:
missionbio.mosaic.sample.Sample / missionbio.mosaic.samplegroup.SampleGroup
Raises:
Exception

When the h5 file format is not supported.


< Module io