{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "88ab0202-500e-4690-8f22-45c7359267fc",
   "metadata": {},
   "source": [
    "# Explain a prediction\n",
    "\n",
    "This model demonstrates the use of the CHAMOIS API to establish links between the genes of a query cluster and the ChemOnt classes of the putative metabolite as predicted by CHAMOIS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "726085a3-778b-4404-8623-b1e25708e05d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import chamois\n",
    "chamois.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51ffcc30-9f35-4b6e-a92b-8e81516d44c6",
   "metadata": {},
   "source": [
    "## Loading data\n",
    "\n",
    "Use `gb-io` to load the GenBank record for a cluster into a dedicated `ClusterSequence` object. Let's use [AB746937.1](https://www.ncbi.nlm.nih.gov/nuccore/AB746937.1), the biosynthetic gene cluster for [muraminomicin](https://pubchem.ncbi.nlm.nih.gov/compound/145720725) found in *Streptosporangium amethystogenes*. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4cef6233-6886-4da1-b75e-090764574728",
   "metadata": {},
   "outputs": [],
   "source": [
    "import gb_io\n",
    "import chamois.model\n",
    "records = gb_io.load(\"data/AB746937.1.gbk\")\n",
    "clusters = [chamois.model.ClusterSequence(records[0])]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "272a8a41-ee8a-41e2-bf20-5d11b5e16d0b",
   "metadata": {},
   "source": [
    "## Calling genes\n",
    "\n",
    "You can use the `chamois.orf` module to call the genes inside one or more `ClusterSequence` objects. Since the source GenBank record has already gene called (in `CDS` features, with the gene name added in the `/protein_id` qualifier), we can skip gene calling and simply extract the already-present genes. For this, we use a `CDSFinder`: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d9d6ce37-162f-4031-a532-ebe7265584d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from chamois.orf import CDSFinder\n",
    "orf_finder = CDSFinder(locus_tag=\"protein_id\")\n",
    "proteins = list(orf_finder.find_genes(clusters))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c51d185-cbe3-47b7-a196-9c6230a92f2a",
   "metadata": {},
   "source": [
    "## Extracting features\n",
    "\n",
    "Once we have a list of proteins, we need to annotate them with protein domains. CHAMOIS is distributed with the Pfam HMMs required by the CHAMOIS predictor, so we can simply use these and run the default annotation with a `PfamAnnotator` object: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fcdc0473-dad8-4f26-a133-df06ed1ad5a1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from chamois.domains import PfamAnnotator\n",
    "annotator = PfamAnnotator()\n",
    "domains = list(annotator.annotate_domains(proteins))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d89e38f-8268-4c76-8791-51010d10e2f2",
   "metadata": {},
   "source": [
    "## Building compositional matrices\n",
    "\n",
    "We now have a list of domains, but we want to turn these domains into a matrix of presence/absence of each Pfam domain in each gene cluster. To do so, let's first load the trained CHAMOIS predictor, so we know which features we need to extract: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99e4bca2-25a6-40e5-9fea-bcf29b974072",
   "metadata": {},
   "outputs": [],
   "source": [
    "from chamois.predictor import ChemicalOntologyPredictor\n",
    "predictor = ChemicalOntologyPredictor.trained()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84bd922c-e9b5-448a-b4e9-e4ea0b4f9849",
   "metadata": {},
   "source": [
    "Then simply build the observations table (from the source clusters), and the actual compositional data matrix, returned as an `AnnData` object to preserve observation and feature metadata:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f430595-2a32-4b48-b51b-fab1b395214c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import chamois.compositions \n",
    "obs = chamois.compositions.build_observations(clusters)\n",
    "data = chamois.compositions.build_compositions(domains, obs, predictor.features_)\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "356b1585-4870-4812-897c-cd40b256b7f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "data.var_vector(clusters[0].id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17c7838f-20b0-4aa9-94e5-fe4fe0bc053d",
   "metadata": {},
   "source": [
    "## Infer chemical classes\n",
    "\n",
    "With the compositional matrix ready, we can simply call the `predict_probas` method on the predictor to get the class probabilities predicted by CHAMOIS:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4a05be4-f31c-46f5-8515-6ffa2a460ce5",
   "metadata": {},
   "outputs": [],
   "source": [
    "probas = predictor.predict_probas(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f64cf4d-e04d-4cfb-9508-24e41cd78d74",
   "metadata": {},
   "source": [
    "`probas` is a NumPy array containing probabilities for each of the classes of the model. We can turn these predictions into a table retaining the metadata from the original predictor:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c973a1fb-28e2-4463-8fd2-82b13a4405e9",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "classes = predictor.classes_.copy()\n",
    "classes['probability'] = probas[0]\n",
    "classes[classes['probability'] > 0.5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25e4838c-0b0f-4813-82ef-a5b590b0ac6e",
   "metadata": {},
   "source": [
    "## Build gene contribution table\n",
    "\n",
    "Now that we have the predictions, we can inspect the model to explain which genes of the cluster contributed to the prediction of each class. This can be done in the command line with the `chamois explain cluster` subcommand, or programmatically:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54012fa0-9e43-4d64-8eea-fca3794e4bbf",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from chamois.cli.explain import build_genetable\n",
    "genetable = build_genetable(proteins, domains, predictor, probas).set_index(\"class\")\n",
    "genetable"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4d2bb9d-ae9e-4126-9850-2018613e70c6",
   "metadata": {},
   "source": [
    "## Render cluster\n",
    "\n",
    "Now that we have a table summarizing the role of every cluster gene in the prediction of each ChemOnt class, we can render the genomic locus of the BGC with additional information about the function of each gene. Let's restrict to 5 specific classes with the lowest amount of training examples in MIBiG 3.1:  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8feb8f43-fb75-4729-aa24-7cb7c3afa89e",
   "metadata": {},
   "outputs": [],
   "source": [
    "top = predictor.classes_.loc[genetable.index].sort_values(\"n_positives\").head(5).index\n",
    "predictor.classes_.loc[top]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9413b2f-08bc-400c-82d0-81727407085d",
   "metadata": {},
   "source": [
    "We can now plot the cluster while colouring the genes according to which ChemOnt class they contribute the most, highlighting their function in the biosynthetic pathway. For the display, let's use the `dna-features-viewer` library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "643e6f67-c2f9-4e85-bb64-a1ec5e27138c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.lines import Line2D\n",
    "from matplotlib.patches import Patch\n",
    "from dna_features_viewer import GraphicFeature, GraphicRecord\n",
    "from palettable.cartocolors.qualitative import Vivid_10\n",
    "from palettable.cartocolors.sequential import *\n",
    "\n",
    "# build a palette\n",
    "palette = dict(zip(top, [Purp_2, Sunset_5, DarkMint_2, Magenta_2, Teal_5, BluGrn_2]))\n",
    "fig = plt.figure(figsize=(12, 6))\n",
    "\n",
    "# extract CDS features from the record\n",
    "features = []\n",
    "for feature in filter(lambda f: f.kind == \"CDS\", records[0].features):\n",
    "    # get the name and product of the gene\n",
    "    label = next(q.value for q in feature.qualifiers if q.key == \"protein_id\")\n",
    "    product = next((q.value for q in feature.qualifiers if q.key == \"product\"), None)\n",
    "    if product.startswith(\"putative\"):\n",
    "        product = product[9:]\n",
    "    # get the coordinates\n",
    "    start = feature.location.start\n",
    "    end = feature.location.end\n",
    "    if feature.location.strand == \"-\":\n",
    "        start, end = end, start\n",
    "    # get the colour of the gene based on contribution weight\n",
    "    weights = genetable[label].loc[top]\n",
    "    if any(weights >= 1):\n",
    "        best = weights.index[weights.argmax()]\n",
    "        color = palette[best].hex_colors[1]\n",
    "    else:\n",
    "        color = \"#c0c0c0\"\n",
    "    # record the feature\n",
    "    features.append(GraphicFeature(\n",
    "        start=start,\n",
    "        end=end,\n",
    "        strand=-1 if feature.location.strand == \"-\" else 1,\n",
    "        color=color,\n",
    "        label=None if color == \"#c0c0c0\" else product,\n",
    "    ))\n",
    "\n",
    "# render the feature records\n",
    "record = GraphicRecord(sequence=records[0].sequence, features=features)\n",
    "record.plot(ax=plt.gca())\n",
    "\n",
    "# add legend\n",
    "legend_elements = [\n",
    "    Patch(\n",
    "        facecolor=v.hex_colors[1], \n",
    "        edgecolor='black', \n",
    "        label=f\"{k} - {predictor.classes_.name.loc[k]}\"\n",
    "    )\n",
    "    for k,v in palette.items()\n",
    "]\n",
    "\n",
    "# create the figure\n",
    "plt.legend(handles=legend_elements, loc='upper left')\n",
    "plt.title(\"AB746937.1 - muraminomicin\")\n",
    "fig.tight_layout()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}