{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ed3f1712-b787-4529-8233-f3f32ba248d2",
   "metadata": {},
   "source": [
    "# Basic functionalities\n",
    "\n",
    "The easiest way to get a prediction with CHAMOIS is to run the `chamois predict` command with a query BGC given as a GenBank record. \n",
    "For now, let's use [BGC0000703](https://mibig.secondarymetabolites.org/repository/BGC0000703.4/index.html#r1c1), the MIBiG BGC\n",
    "producing [kanamycin](https://pubchem.ncbi.nlm.nih.gov/compound/6032) in *Streptomyces kanamyceticus*. The record was pre-downloaded\n",
    "from MIBiG in GenBank format."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab1971ec-fb4b-4bf3-93ec-671200aefa0b",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "Note\n",
    "\n",
    "This notebook calls the CHAMOIS CLI with the `chamois.cli.run` function. This is equivalent to calling the `chamois` command line in your shell, it's only done here to integrate with the documentation generator. For instance, calling:\n",
    "```python\n",
    "chamois.cli.run([\"predict\"])\n",
    "```\n",
    "is equivalent to running\n",
    "```bash\n",
    "$ chamois predict\n",
    "```\n",
    "in the console.\n",
    "\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27addcd4-1859-43bf-af81-6d32845a1c17",
   "metadata": {},
   "outputs": [],
   "source": [
    "import chamois.cli\n",
    "chamois.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6b630d7-a2cb-4e89-b54c-2f5319303524",
   "metadata": {},
   "source": [
    "## Running predictions\n",
    "\n",
    "Use the `chamois predict` command to run ChemOnt class predictions with CHAMOIS:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b77aace-1ecf-4f2d-b411-aeb8b8e8d1d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# $ chamois predict -i data/BGC0000703.4.gbk -o data/BGC0000703.4.hdf5\n",
    "chamois.cli.run([\"predict\", \"-i\", \"data/BGC0000703.4.gbk\", \"-o\", \"data/BGC0000703.4.hdf5\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae6102f8-8e1c-4709-aef0-a519ad00fd48",
   "metadata": {},
   "source": [
    "The resulting HDF5 file can be opened with the `anndata` package for further analysis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98cc3ab1-0905-4965-b381-a870f63311e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import anndata\n",
    "data = anndata.read_h5ad(\"data/BGC0000703.4.hdf5\")\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5cae750d-2bea-4027-ba05-a6d8e8b4d79e",
   "metadata": {},
   "source": [
    "The observations (`data.obs`) store the metadata about the query BGCs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ca235af-2031-4fb4-b89b-cacd541c2def",
   "metadata": {},
   "outputs": [],
   "source": [
    "data.obs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0154c5c1-8e01-4bd3-9841-4c24fe60becc",
   "metadata": {},
   "source": [
    "The variables (`data.var`) store the metadata about the chemical classes predicted by the CHAMOIS predictor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6440e09b-0b3b-4722-9c5c-0c9376af749a",
   "metadata": {},
   "outputs": [],
   "source": [
    "data.var"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c77f23f5-720f-49a0-a00e-aaf99665369b",
   "metadata": {},
   "source": [
    "## Visualizing results"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ebfbdac-9424-4e63-99ee-bae9a8b3670a",
   "metadata": {},
   "source": [
    "The resulting file is a HDF5 format file contains the class probabilities for each of the records in the input GenBank file. The CLI can be used to quickly inspect the predicted classes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec7b2372-f5bb-4969-8102-9a37cd9cebdc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# $ chamois render -i data/BGC0000703.4.hdf5\n",
    "chamois.cli.run([\"render\", \"-i\", \"data/BGC0000703.4.hdf5\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3aa61bb1-5b4d-4863-ad7a-e39692c75d26",
   "metadata": {},
   "source": [
    "## Screening predictions\n",
    "\n",
    "Once predictions have been made, they can be screened with a particular query metabolite to see which BGC is the most likely to predict that metabolite. Let's try with the kanamycin as a sanity check. Molecules can be passed to `chamois compare` as either SMILES, InChi, or InChi Key.\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "\n",
    "Info\n",
    "\n",
    "Passing a SMILES or an InChi requires the additional Python dependency `rdkit` \n",
    "to handle conversion to InChi Key.\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb707c78-2456-459a-8a36-d41da98ae279",
   "metadata": {},
   "outputs": [],
   "source": [
    "# $ chamois compare -i data/BGC0000703.4.hdf5 -q SBUJHOSQTJFQJX-NOAMYHISSA-N --render\n",
    "chamois.cli.run([\"compare\", \"-i\", \"data/BGC0000703.4.hdf5\", \"-q\", 'SBUJHOSQTJFQJX-NOAMYHISSA-N', \"--render\" ])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "656abe7c-6732-4f50-90cb-46557760e19a",
   "metadata": {},
   "source": [
    "## Searching a catalog"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b3bb56f-d5ee-4cb4-a50a-ce39e485475c",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "Warning\n",
    "\n",
    "This feature is experimental and has not been properly evaluated. Use with caution.\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "908e14c0-4ad6-46a0-8378-e33d925b037f",
   "metadata": {},
   "source": [
    "The predictions can be used to search a catalog of compounds encoded as a `classes.hdf5` file, similar to what CHAMOIS uses for training. For instance, we can search which compound of MIBiG 3.1 is most similar to our prediction; hopefully we should get BGC0000703 among the top hits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32016efd-0863-43ea-bd0e-1d0756f59a3f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# $ chamois search -i data/BGC0000703.4.hdf5 --catalog ../../data/datasets/mibig3.1/classes.hdf5 --render\n",
    "chamois.cli.run([\"search\", \"-i\", \"data/BGC0000703.4.hdf5\", \"--catalog\", \"../../data/datasets/mibig3.1/classes.hdf5\", \"--render\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c3b4449-bf9d-47b9-8a56-86f950479797",
   "metadata": {},
   "source": [
    "## Interpreting a prediction\n",
    "\n",
    "The `chamois explain` command allows obtaining additional information about a prediction made by CHAMOIS. It must be passed the original sequences of the BGCs, will re-annotate the genes, and will inspect the model weights to break down the prediction made by CHAMOIS into individual contributions from each genes, making it easier to understand the functions of the individual genes of the BGC. We call the `chamois explain` command with the `--cds` argument to ensure that the gene coordinates and identifiers are those already defined in the GenBank record:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2a2a07e3-f887-4c7d-8c0a-352709fd808d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# $ chamois explain --cds -i data/BGC0000703.4.gbk -o data/BGC0000703.4.tsv\n",
    "chamois.cli.run([\"explain\", \"cluster\", \"--cds\", \"-i\", \"data/BGC0000703.4.gbk\", \"-o\", \"data/BGC0000703.4.tsv\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e20cf8c-2b95-49ed-83f8-58e62a912c35",
   "metadata": {},
   "source": [
    "The output is a table that shows the contribution of the genes of the BGC to each of the predicted classes. It can be easily loaded with `pandas`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddbfab41-5ab2-4393-9488-386e8cbf9d95",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas\n",
    "table = pandas.read_table(\"data/BGC0000703.4.tsv\")\n",
    "table"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c0ea626-c786-4f4b-aa5c-b2b0ac0da49f",
   "metadata": {},
   "source": [
    "For instance, to see which genes contribute significantly to the prediction of the BGC compound to CHEMONTID:0000282 (Aminoglycosides), we can extract the relevant row from the table and filter for genes with weight greater than 2.0:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a2f25a3-adb4-4cd2-8550-d3795c8ad688",
   "metadata": {},
   "outputs": [],
   "source": [
    "w = table.set_index(\"class\").loc[\"CHEMONTID:0000282\"].drop([\"name\", \"probability\"])\n",
    "w[w >= 2]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc517abd-be47-4547-81ee-1ab67d74d72e",
   "metadata": {},
   "source": [
    "These two genes are actually [DegT/DnrJ/EryC1/StrS-family aminotransferases](https://www.ebi.ac.uk/interpro/entry/InterPro/IPR000653/), which are also found in the biosynthesic pathways of [streptidine](https://pubchem.ncbi.nlm.nih.gov/compound/439323) (one of the aminoglycoside moieties of [streptomycin](https://pubchem.ncbi.nlm.nih.gov/compound/19649)) or [rifamycin B](https://pubchem.ncbi.nlm.nih.gov/compound/5459948)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}