{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "49946ee1",
   "metadata": {},
   "source": [
    "## Single Sample DNA-only notebook"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2c6f8c9",
   "metadata": {},
   "source": [
    "<b>This notebook is set up to work with single-sample h5 files, generated with v2 or v3 chemistry. This notebook will only work with Mosaic versions 3.0.1 and higher.</b>\n",
    "\n",
    "\n",
    "<b>Objective</b>: To showcase the minimum number of steps required for tertiary analysis of DNA (single-cell genotyping and CNV) and explore different ways of visualizing the data.<br>\n",
    "\n",
    "<b>Major questions answered:</b>\n",
    "1. Can we identify DNA clones based on genotypes (SNVs/Indels)?<br>\n",
    "2. Do we detect CNV events (e.g., copy number amplification, copy number loss)?<br>\n",
    "\n",
    "<b>Sections:</b>\n",
    "1. Setup\n",
    "2. Data Structure\n",
    "3. DNA Analysis\n",
    "4. CNV Analysis\n",
    "5. Combined Visualizations\n",
    "6. Export and Save Data\n",
    "7. Appendix\n",
    "\n",
    "<b>Not shown:</b>\n",
    "All available methods and options - documented [here](https://missionbio.github.io/mosaic/index.html)<br>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9907149d",
   "metadata": {},
   "source": [
    "## Setup "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34562cbf",
   "metadata": {},
   "source": [
    "<b>Topics covered</b><br>\n",
    "1. Loading required packages and data.\n",
    "2. Structure and contents of data objects.\n",
    "\n",
    "\n",
    "<b>Load data</b><br>\n",
    "<b>Note:</b> importing dependencies can sometimes take a couple of minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad2faefb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import mosaic libraries\n",
    "import missionbio.mosaic as ms\n",
    "\n",
    "# Import these to display entire dataframes\n",
    "from IPython.display import display, HTML\n",
    "\n",
    "# Import graph_objects from the plotly package to display figures when saving the notebook as an HTML\n",
    "import plotly as px\n",
    "import plotly.graph_objects as go\n",
    "\n",
    "# Import additional packages for specific visuals\n",
    "import matplotlib.pyplot as plt\n",
    "import plotly.offline as pyo\n",
    "pyo.init_notebook_mode()\n",
    "import numpy as np\n",
    "\n",
    "# Import COMPASS for imputation\n",
    "from missionbio.mosaic.algorithms.compass import COMPASS\n",
    "\n",
    "# Note: when exporting the notebook as an HTML, plots that use the \"go.Figure(fig)\" command are saved"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41bae6eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This code is optional, but will make the notebook cells/figures display across the entire width of your browser\n",
    "from IPython.display import display, HTML\n",
    "display(HTML(\"<style>.container { width:100% !important; }</style>\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9d5c159",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check version; this notebook is designed for Mosaic v3.0.1 or higher\n",
    "print(ms.__version__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0dd00edd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Any function's parameters and default values can be looked up via the 'help' function\n",
    "# Here, the function is 'ms.load'\n",
    "help(ms.load)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5b6249a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Specify the h5 file to be used in this analysis: h5path = '/path/to/h5/file/test.h5'\n",
    "# If working with Windows, you may need to add an 'r' before the path: h5path = r'/path/to/h5/file/test.h5'\n",
    "h5path = '/Users/botero/Desktop/Sample data/4-cell-lines-AML-multiomics/4-cell-lines-AML-CNV.dna.h5'\n",
    "\n",
    "# Load the data\n",
    "sample = ms.load(h5path, raw=False, filter_variants=True, single=True, whitelist = [], filter_cells=False) \n",
    "\n",
    "# Always set raw=False; if raw=True, ALL barcodes will be loaded (rather than cell-associated barcodes)\n",
    "# Always set filter_variants=True unless you can't detect an expected (target) variant. Additional filtering options are included in the DNA section below\n",
    "# The single=True option loads multi-sample h5 files as a single sample object (compatible with this notebook)\n",
    "# The whitelist option loads any variant that is in the vcf.gz file (e.g. \"chr1:179520511:C/G\"); similar to whitelist feature in Tapestri Insights v2\n",
    "# Always set filter_cells=False; if filter_cells=True, only genomically complete cells will be loaded "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "570e87f7",
   "metadata": {},
   "source": [
    "## Data Structure"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "001a0a2c",
   "metadata": {},
   "source": [
    "DNA, CNV, and Protein are sub-classes of the Assay class. The information is stored in four categories, and the user can modify each of those:\n",
    "\n",
    "<b>1. metadata (add_metadata / del_metadata):</b>\n",
    " - dictionary containing metrics of the assay\n",
    " \n",
    "<b>2. row_attrs (add_row_attr / del_row_attr):</b>\n",
    " - dictionary which contains 'barcode' as one of the keys \n",
    "   (where the value is a list of all barcodes)\n",
    " - for all other keys, the values must be of the same \n",
    "   length, i.e. match the number of barcodes\n",
    " - this is the attribute where 'label', 'pca', \n",
    "   and 'umap' values are added\n",
    "   \n",
    "<b>3. col_attrs (add_col_attr / del_col_attr):</b>\n",
    " - dictionary which contains 'id' as one of the keys \n",
    "   (where the value is a list of all ids)\n",
    " - for DNA assays, 'ids' are variants; for Protein assays, \n",
    "   'ids' are antibodies\n",
    " - for all other keys, the values must be of the same\n",
    "   length, i.e. match the number ids\n",
    "\n",
    "<b>4. layers (add_layer / del_layer):</b>\n",
    " - dictionary which contains critically important assay metrics\n",
    " - all the values have the shape (num barcodes) x (num ids) \n",
    " - for DNA assays, this includes AF, GQ, DP, etc. \n",
    "   (per cell, per variant)\n",
    " - for Protein assays, this includes read counts \n",
    "   (per cell, per antibody)\n",
    " - this is the attribute where 'normalized_counts' will be added\n",
    "\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c83db51c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Summary of DNA assay \n",
    "print(\"\\'sample.dna\\':\", sample.dna, '\\n')\n",
    "print(\"\\'row_attrs\\':\", \"\\n\\t\", list(sample.dna.row_attrs.keys()), '\\n')\n",
    "print(\"\\'col_attrs\\':\", \"\\n\\t\", list(sample.dna.col_attrs.keys()), '\\n')\n",
    "print(\"\\'layers\\':\", \"\\n\\t\", list(sample.dna.layers.keys()), '\\n')\n",
    "print(\"\\'metadata\\':\", \"\\n\")\n",
    "for i in list(sample.dna.metadata.keys()):\n",
    "    print('\\t', i, ': ', sample.dna.metadata[i], sep='')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02624da7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# For DNA, ids are variants\n",
    "# sample.dna.ids() is a shortcut for sample.dna.col_attrs['id']\n",
    "sample.dna.ids()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4125f252",
   "metadata": {},
   "source": [
    "## DNA Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce1a0b19",
   "metadata": {},
   "source": [
    "<b>Topics covered</b><br>\n",
    "1. Standard filtering of DNA variants.\n",
    "2. Subsetting dataset for variants of interest, including whitelisted variants.\n",
    "3. Addition of annotations to the variants.\n",
    "4. Manual variant selection and clone identification\n",
    "5. <b> OPTIONAL: </b> Use Compass to impute missing data and a clonal phylogeny\n",
    "6. Visualizations and customization options"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfc1ac78",
   "metadata": {},
   "source": [
    "### Basic filtering"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8777dc9",
   "metadata": {},
   "source": [
    "There are many options for filtering DNA variants. \n",
    "Use the `help()` function to understand the approach listed below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa064a76",
   "metadata": {},
   "outputs": [],
   "source": [
    "# For additional information visit: https://support.missionbio.com/hc/en-us/articles/360047303654-How-do-the-Advanced-Filters-work-\n",
    "help(sample.dna.filter_variants)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "017cc9e1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter the variants, similar to the \"Advanced Filters\" in Tapestri Insights v2.2\n",
    "\n",
    "# One major difference: The filter \"REMOVE GENOTYPE IN CELLS WITH ALTERNATE ALLELE FREQ\" \n",
    "# is replaced with the following 3 zygosity-specific filters: vaf_ref, vaf_hom, vaf_het\n",
    "\n",
    "# In general, these additional filters remove additional false-positive from the data.\n",
    "\n",
    "# Adjust filters if needed by overwriting dna_vars\n",
    "dna_vars = sample.dna.filter_variants(\n",
    "    min_dp=10,\n",
    "    min_gq=30,\n",
    "    vaf_ref=5,\n",
    "    vaf_hom=95,\n",
    "    vaf_het=30,\n",
    "    min_prct_cells=50,\n",
    "    min_mut_prct_cells=1,\n",
    "    iterations=10,\n",
    ")\n",
    "\n",
    "# Check the number of filtered variants. When using the default filters, the number of \n",
    "# variants is likely smaller compared to the originally loaded variants due to the more \n",
    "# stringent filtering criteria (e.g., vaf_ref=5, vaf_hom=95, vaf_het=30).\n",
    "print('Number of variants:', len(dna_vars))\n",
    "print(dna_vars)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3c322f0",
   "metadata": {},
   "source": [
    "First, specify variants of interest using one of the three options below:\n",
    "1. Use the filtered variants from the above section (dna_vars)\n",
    "2. Use specific variants of interest (whitelist) \n",
    "3. Combine 1 & 2: filtered variants plus whitelist\n",
    "    \n",
    "Then, this list (final_vars) is used to reduce the larger data set to only your variants of interest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ba758efd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Specify the whitelist; variants may be copy/pasted from Tapestri Insights v2.2,\n",
    "# but ensure correct nomenclature, ie whitelist = [\"chr13:28589657:T/G\",\"chrX:39921424:G/A\"]\n",
    "# If there are no whitelist variants, you can leave target variants blank\n",
    "\n",
    "white_list = []\n",
    "\n",
    "# Combine whitelisted and filtered variants\n",
    "final_vars = list(set(list(dna_vars) + white_list))\n",
    "\n",
    "# Want to use your whitelist only?\n",
    "# final_vars = white_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58ebf3a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check the length of your final list of variants\n",
    "len(final_vars)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20e02ad6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Dimensionality of the original sample.dna dataframe\n",
    "# First number = number of cells (rows); second number = number of variants (columns)\n",
    "sample.dna.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf558be6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Before subsetting, verify that all the chosen variants are in the current sample.dna ids (should return True)\n",
    "print(set(final_vars).issubset(set(sample.dna.ids())))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6c5efddb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Subsetting sample.dna (columns) based on reduced variant list. Keeping all cells that passed filtering\n",
    "sample.dna = sample.dna[:, final_vars]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c6fa585",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check the shape of the final filtered DNA object, i.e. (number of barcodes/cells, number of ids/variants)\n",
    "sample.dna.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab4ca422",
   "metadata": {},
   "source": [
    "### Annotation Addition\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69c07055",
   "metadata": {},
   "outputs": [],
   "source": [
    "help(sample.dna.get_annotations)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2fc52fbc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fetch annotations using varsome\n",
    "# Note: run this on a filtered DNA sample - too many variants (e.g., 100+) are not handled correctly by the method\n",
    "ann = sample.dna.get_annotations()  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df3623e5",
   "metadata": {},
   "source": [
    "### Variant selection and Subclone identification"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51a5c886",
   "metadata": {},
   "source": [
    "In this section of the notebook, all variants remaining in the data will populate in a variant table. This table is interaction, variants can be selected, and rows can be sorted by ascending/descending values. The variant name can be clicked on and will navigate to the variants varsome url in your default browser.\n",
    "1. Variants selected in this table will populate in a subclone table below. \n",
    "2. The variants in the subclone table can be highlighted and assessed for Read Depth, Genotype Quality and Variant Allele Frequency. \n",
    "3. Subclones can be renamed by clicking on the pencil icon. \n",
    "4. <b>ADO score:</b> The ADO score can be adjusted, but by default is set to 1. Any clones with an ADO score higher this will be moved into the ADO subclones column. We recommend moving any clones with a score of .8 or higher into this column. \n",
    "5. <b> Min clone size: </b> The Min Clone Size can also be adjusted, but by default is set to 1. Any clones that represent less than 1% of the sample will be moved into the Small Subclones column. \n",
    "6. Cells with missing genotype information across any of the selected variants will be moved into the missing GT column. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54815d1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Run the variant table workflow to select variants and begin clone identification\n",
    "wfv = ms.workflows.VariantSubcloneTable(sample)\n",
    "wfv.run()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1449ea81",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Subsample the Variants to only the variants selected from the workflow\n",
    "variants = wfv.selected_variants\n",
    "sample.dna = sample.dna[:,variants]\n",
    "\n",
    "# Save the full set of cells to a new variable that you can call on later\n",
    "# Do this before renaming the variants\n",
    "full_sample = sample[:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11f9e68a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rename your variants\n",
    "# Any of these column values can be added to the id names\n",
    "sample.dna.col_attrs.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2982e13",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add annotation to the id names\n",
    "sample.dna.set_ids_from_cols(['annotated Gene', 'id'])\n",
    "\n",
    "# Another example:\n",
    "# sample.dna.set_ids_from_cols(['annotated Gene', 'CHROM', 'POS', 'REF', 'ALT'])\n",
    "\n",
    "# Annotations are now added to the variants\n",
    "sample.dna.ids()\n",
    "\n",
    "# Use sample.dna.reset_ids() to get the original ids"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "701f2e71",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot heatmap using NGT_FILTERED.\n",
    "fig = sample.dna.heatmap(attribute='NGT_FILTERED')\n",
    "go.Figure(fig)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e18776b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clone removal\n",
    "# Remove barcodes from the missing GT, small subclones, ADO subclones or FP labels\n",
    "clones = ['missing', 'small', 'ADO', 'FP']\n",
    "for c in clones:\n",
    "    cells = sample.dna.barcodes({c})\n",
    "    if len(cells) > 0:\n",
    "        sample.dna = sample.dna.drop(cells)\n",
    "set(sample.dna.get_labels())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7ecc9b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Redraw heatmap\n",
    "fig = sample.dna.heatmap(attribute='NGT_FILTERED')\n",
    "go.Figure(fig)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a2c104c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# You can use the following line of code to change the color for heatmaps in the DNA, CNV or protein sections\n",
    "# And rotate the tick labels\n",
    "fig = sample.dna.heatmap(attribute='NGT_FILTERED')\n",
    "fig.update_layout(title = sample.name)\n",
    "fig.update_xaxes(tickangle = -45)\n",
    "fig.layout.coloraxis.colorscale = 'viridis'\n",
    "fig"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "297c5dd5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Evaluate new total number of cells after the above filtering\n",
    "sample.dna.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4b4f9c0",
   "metadata": {},
   "source": [
    "### Adjusting subclone colors or heatmap colors"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee566925",
   "metadata": {},
   "source": [
    "If you want to change the color palette used for the subclones, you can use `set_palette`. For this you will need to provide a list of ALL subclone names, pointing to the hexcode/color you want to change that subclone to. Note: this function also works with the protein and CNV assays. See below for an example of this.\n",
    "\n",
    "If you want to change the color palette used for any of the assays/layers, you can use `ms.Config.Colorscale`. Then you can list the assay and layer, and assign the plotly colorscale you would like to use for those graphics. You can visualized all available colorscales by using `plotly.colors.sequential.swatches_continuous().show()`. If you want to reset all color palettes back to their defaults, you can use `ms.Config.Colorscle.reset()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b58488e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example of changing the colors assigned to each clone\n",
    "sample.dna.set_palette({'Clone 1': '#800080', 'Clone 2': '#FF69B4'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc419d77",
   "metadata": {},
   "outputs": [],
   "source": [
    "# You can change colorScale to change the heatmap colors\n",
    "# This will change the color of the dna.ngt layer to be viridis\n",
    "ms.Config.Colorscale.Dna.NGT = 'viridis'\n",
    "fig = sample.dna.heatmap('NGT')\n",
    "fig.show()\n",
    "\n",
    "# If you want to reset this, you can run the following to reset just that one modification\n",
    "#ms.Config.Colorscale.Dna.reset(\"NGT\")\n",
    "# Or you can run this to reset any/all modifications:\n",
    "#ms.Config.Colorscale.reset()\n",
    "# You can change the colorScale for Dna, Cns or Protein"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c1e8fa1",
   "metadata": {},
   "source": [
    "If you ever want to return back to the original population of cells, you can reset the data using: `sample.reset(\"dna\")` This command 'sample.reset' works on all assays, including CNV and protein."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb5b6016",
   "metadata": {},
   "source": [
    "## CNV Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59f01e2f",
   "metadata": {},
   "source": [
    "<b>Topics covered</b>\n",
    "1. Amplicon filtering and ploidy estimation.\n",
    "2. Visualization of ploidy across subclones present"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "57773602",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get gene names for amplicons\n",
    "sample.cnv.get_gene_names()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14e155e7",
   "metadata": {},
   "source": [
    "### CNV workflow"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42114c9f",
   "metadata": {},
   "source": [
    "This workflow will normalize all reads and filter amplicons/cells based on the settings set at the beginning of the workflow: <br>\n",
    "1. <b> Amplicon completeness: </b> refers to the minimum percentage of barcodes for an amplicon that must have reads greater than or equal to the minimum read depth set. By default this is set to 50. <br>\n",
    "2. <b> Amplicon read depth: </b> refers to the minimum read depth for each amplicon-barcode combination to not be considered missing. By default this is set to 10. <br>\n",
    "3. <b>Mean cell read depth: </b> refers the minimum mean read depth for a cell to be included in the analysis, otherwise the cell will be removed. By default this is set to 40.<br>\n",
    "4. <b>Diploid clone in DNA: </b> refers to which subclone you are setting as the true diploid population. All ploidy estimates will be calculated in relation to this diploid population. We recommend setting this to your 'WT' population or most parent clone present. <br>\n",
    "    \n",
    "<b> Note: </b> These settings are pretty stringent. If you are expecting large copy number events, you may want to reduce Amplicon Completeness and Amplicon Read Depth, to recover these events.\n",
    "\n",
    "Once the above filters are set, the visualizations can be changed. <br>\n",
    "1. <b> Plot:</b> Can be changed from Heatmap positions, to Heatmap genes, Line-plot positions, Line-plot genes <br>\n",
    "2. <b> Clone for line plot:</b> If one of the line-plot visualizations is selected, only one clone can be shown at a time. This determines which one is plotted. <br>\n",
    "3. <b> X-axis features: </b> If you would prefer to only plot a subset of the data (chromosomes or genes), you can select which chromosomes/genes you would like plotted with this function. Chromosomes can be selected for 'positions' type plots, and Genes can be selected for 'genes' type plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "324a8311",
   "metadata": {},
   "outputs": [],
   "source": [
    "# CNV workflow to filter, normalize and estimate ploidy\n",
    "wfc = ms.workflows.CopyNumber(sample)\n",
    "wfc.run()\n",
    "# Amplicon completeness refers to the min percentage of barcodes for an amplicon that must have reads >= the read_depth\n",
    "# Read depth is the min required depth for each amplicon_barcode to not be considered as missing\n",
    "# Mean cell read depth refers to the min mean read depth for a cell to be included, otherwise it is removed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f960e2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Heatmap with the features ordered by the default amplicon order\n",
    "# If you want to plot just a subset of chromosomes, you can put them in list format for features\n",
    "fig = sample.cnv.heatmap('ploidy', features='positions') #features=['7', '17', '20']\n",
    "\n",
    "# Optionally, restrict the range of ploidy values based on observed/expected CNV events (commented out)\n",
    "#fig.layout.coloraxis.cmax = 4\n",
    "#fig.layout.coloraxis.cmin = 0\n",
    "\n",
    "# Optionally, change the size of the figure:\n",
    "#fig.layout.width = 1600\n",
    "#fig.layout.height = 1500\n",
    "\n",
    "go.Figure(fig)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f2e5222",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Heatmap with the features grouped by the genes\n",
    "# If you want to plot just a subset of genes, you can put them in list format for features\n",
    "fig = sample.cnv.heatmap('ploidy', features='genes', convolve=1) #features=[\"ASXL1\", \"EZH2\",'TP53']\n",
    "\n",
    "# Optionally, update the separating lines to be black\n",
    "#for shape in fig.layout.shapes:\n",
    "#    shape.line.color = '#000000'\n",
    "\n",
    "go.Figure(fig)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d525390d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show heatmap with convolve and subclustering turned off\n",
    "bars = sample.cnv.clustered_barcodes('ploidy', subcluster=False)\n",
    "\n",
    "# This is useful to create \"convolved\" heatmaps which are easier to interpret\n",
    "# With the subclustering off and convolve=20, the noise will be reduced and real signals will be easier to determine\n",
    "fig = sample.cnv.heatmap('ploidy', bars_order=bars, convolve=20)\n",
    "fig.layout.width = 900\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1048e83f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Signature will create a dataframe based on the layer you want to look at and which statistical value you want to view\n",
    "# You can see: mean, median, mode, or std with this function\n",
    "sample.cnv.signature('ploidy', 'median')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32faf1a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# signaturemap will plot a heatmap of the signature dataframe created above\n",
    "# The labels list will control the order of the subclones along the y-axis\n",
    "sample.cnv.signaturemap('ploidy') #labels=[]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da8bfcfb",
   "metadata": {},
   "source": [
    "## Combined Visualizations"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8926dab",
   "metadata": {},
   "source": [
    "In this section you will first subset the data to only retain barcodes with remaining DNA and CNV data. Then this data can be plotted together using `sample.heatmap()` and `sample.signaturemap()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3028946",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This will return the barcodes common to all assays in the sample.\n",
    "sample.common_barcodes()\n",
    "\n",
    "# Use that to filter the sample so that only the common barcodes are present in all assays\n",
    "sample = sample[sample.common_barcodes()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c35b4c21",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check dimensionality for each subclass; the number of cells (first number) should be the same in each data set\n",
    "print(sample.dna.shape, sample.cnv.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe50fa03",
   "metadata": {},
   "outputs": [],
   "source": [
    "# DNA + CNV heatmap\n",
    "fig = sample.heatmap(\n",
    "    clusterby=('dna', 'cnv'),  # The first assay is used for the labels\n",
    "    attributes=['AF', 'ploidy'],\n",
    "    features=[None, 'genes'],  # If None, then clustered_ids is used\n",
    "    bars_order=None,  # The order of the barcodes\n",
    "    widths=None,\n",
    "    order=('dna', 'cnv')  # The order in which the heatmaps should be drawn\n",
    ")\n",
    "fig.layout.width = 1200\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b65a707a",
   "metadata": {},
   "outputs": [],
   "source": [
    "help(sample.signaturemap)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2fd5de06",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot a combined signature heatmap, showing DNA and CNV signatures for all subclones\n",
    "fig = sample.signaturemap(\n",
    "    clusterby=('dna', 'cnv'),\n",
    "    attributes=('NGT', 'ploidy'),\n",
    "    features=[None, 'genes'],\n",
    "    signature_kind=['median', 'median'],\n",
    "    widths=None,\n",
    "    order=('dna', 'cnv')  # The order in which the heatmaps should be drawn\n",
    ")\n",
    "fig.layout.width = 1200\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0fde793b-ad62-4c4c-807c-a4a0d4e94521",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ignore warnings raised when running clone_vs_analyte\n",
    "\n",
    "import warnings\n",
    "warnings.simplefilter(action=\"ignore\", category=FutureWarning)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1269fa97",
   "metadata": {},
   "outputs": [],
   "source": [
    "help(sample.clone_vs_analyte)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39fe01ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clone vs analyte\n",
    "# Visualize the CNV data stratified by clone\n",
    "fig = sample.clone_vs_analyte('cnv')\n",
    "plt.gcf().axes[1].texts[-1].set_text(\"\")\n",
    "fig\n",
    "#fig.savefig('genotype_cnv.png')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0f3c84b",
   "metadata": {},
   "source": [
    "## Export and Save Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6e88dc0",
   "metadata": {},
   "source": [
    "In this section you can export a filtered .h5 file, which will contain all new labels/layers, and contain only the filtered barcodes/cells remaining. You can also export all data (row attributes, column attributes and layers) for every assay (DNA and CNV) into easily parsable .csv tables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cca6cf34",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save new h5 file that includes only the final, cleaned dataset\n",
    "ms.save(sample, 'FilteredData.h5')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ab2cb0f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# With new code implementation:\n",
    "# Export data into csv formats\n",
    "# DNA, CNV and metadata will be included in the zip\n",
    "ms.to_zip(sample, 'filename')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03f55f63",
   "metadata": {},
   "source": [
    "## Appendix"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9c41db9",
   "metadata": {},
   "source": [
    "### DNA signature"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5f6eff9",
   "metadata": {},
   "source": [
    "Using `.signature()` and `.signaturemap()` you can visualize different statistical metrics of your data, including: mean, median, mode and std."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3ede093",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Signature will create a dataframe based on the layer you want to look at and which statistical value you want to view\n",
    "# You can see: mean, median, mode, or std with this function\n",
    "sample.dna.signature('AF', 'median')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1aaa45b2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# signaturemap will plot a heatmap of the signature dataframe created above\n",
    "# The labels list will control the order of the subclones along the y-axis\n",
    "sample.dna.signaturemap('AF') #labels=[]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27525ea8",
   "metadata": {},
   "source": [
    "### Compass imputation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70a8cb20",
   "metadata": {},
   "source": [
    "<b> This section of the notebook is OPTIONAL and is still a work in progress</b>\n",
    "    \n",
    "Compass can be used to impute the genotypes of cells with some missing data. It can also be used to infer the phylogeny of all subclones present in the sample. If a cells nature cannot be determined, Compass will label these cells as Mixed or Ambiguous.<b> Compass can take ~1-10 minutes to run depending on the size of the data.</b>\n",
    "    \n",
    "The Compass publication can be found [here](https://www.biorxiv.org/content/10.1101/2022.01.06.475205v3)<br>\n",
    "    \n",
    "<b>Note:</b> Compass can give different subclone composition than the variant subclone workflow. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb7e4bd2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use compass to assume the subclone architecture and phylogeny\n",
    "# Depending on the size and complexity of the data, this step can take ~1-10 minutes\n",
    "# Use full_sample instead of sample to include all previously removed cells and use the unannotated variant names\n",
    "compass = COMPASS(full_sample, somatic_variants=variants)\n",
    "compass.run()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b3e4f2d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# The phylogentic tree prediction by COMPASS\n",
    "fig = compass.plot_tree()\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d3a7dff",
   "metadata": {},
   "outputs": [],
   "source": [
    "# dict of node names pointing to descriptions\n",
    "# compass_labels_ is just the node names\n",
    "# we want the compass_labels to be the node_descriptions\n",
    "labs = compass.labels_  # Stores the labels\n",
    "desc = compass.node_descriptions()  # Stores the dict mapping the label to the description\n",
    "compass_labels = np.array([desc.get(lab, lab) for lab in labs])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9cea8479",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Store the compass label as a row_attr\n",
    "full_sample.dna.add_row_attr('compass_labels', compass_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7087b458",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Store the index positions of the COMPASS ids for the heatmap\n",
    "idx = np.isin(full_sample.dna.ids(), compass.somatic_variants)\n",
    "\n",
    "# Visualize the assignment using the variants passed to COMPASS\n",
    "fig = full_sample.dna.heatmap('NGT', features=full_sample.dna.ids()[idx], splitby='compass_labels')\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2db20949",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This will return a dataframe \n",
    "# Showing the overlap of variant workflow clones and compass identified clones\n",
    "full_sample.dna.crosstab(compass_labels, normalize='columns')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2a43091a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This will plot a heatmap of the crosstab dataframe\n",
    "full_sample.dna.crosstabmap(compass_labels, normalize='columns').show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}