Explain a prediction#
This model demonstrates the use of the CHAMOIS API to establish links between the genes of a query cluster and the ChemOnt classes of the putative metabolite as predicted by CHAMOIS.
[1]:
import chamois
chamois.__version__
[1]:
'0.2.1'
Loading data#
Use gb-io to load the GenBank record for a cluster into a dedicated ClusterSequence object. Let’s use AB746937.1, the biosynthetic gene cluster for muraminomicin found in Streptosporangium amethystogenes.
[2]:
import gb_io
import chamois.model
records = gb_io.load("data/AB746937.1.gbk")
clusters = [chamois.model.ClusterSequence(records[0])]
Calling genes#
You can use the chamois.orf module to call the genes inside one or more ClusterSequence objects. Since the source GenBank record has already gene called (in CDS features, with the gene name added in the /protein_id qualifier), we can skip gene calling and simply extract the already-present genes. For this, we use a CDSFinder:
[3]:
from chamois.orf import CDSFinder
orf_finder = CDSFinder(locus_tag="protein_id")
proteins = list(orf_finder.find_genes(clusters))
Extracting features#
Once we have a list of proteins, we need to annotate them with protein domains. CHAMOIS is distributed with the Pfam HMMs required by the CHAMOIS predictor, so we can simply use these and run the default annotation with a PfamAnnotator object:
[4]:
from chamois.domains import PfamAnnotator
annotator = PfamAnnotator()
domains = list(annotator.annotate_domains(proteins))
Building compositional matrices#
We now have a list of domains, but we want to turn these domains into a matrix of presence/absence of each Pfam domain in each gene cluster. To do so, let’s first load the trained CHAMOIS predictor, so we know which features we need to extract:
[5]:
from chamois.predictor import ChemicalOntologyPredictor
predictor = ChemicalOntologyPredictor.trained()
Then simply build the observations table (from the source clusters), and the actual compositional data matrix, returned as an AnnData object to preserve observation and feature metadata:
[6]:
import chamois.compositions
obs = chamois.compositions.build_observations(clusters)
data = chamois.compositions.build_compositions(domains, obs, predictor.features_)
data
[6]:
AnnData object with n_obs × n_vars = 1 × 896
obs: 'length'
var: 'name', 'description', 'kind'
[7]:
data.var_vector(clusters[0].id)
[7]:
array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Infer chemical classes#
With the compositional matrix ready, we can simply call the predict_probas method on the predictor to get the class probabilities predicted by CHAMOIS:
[8]:
probas = predictor.predict_probas(data)
probas is a NumPy array containing probabilities for each of the classes of the model. We can turn these predictions into a table retaining the metadata from the original predictor:
[9]:
classes = predictor.classes_.copy()
classes['probability'] = probas[0]
classes[classes['probability'] > 0.5]
[9]:
| name | description | n_positives | information_accretion | probability | |
|---|---|---|---|---|---|
| CHEMONTID:0000002 | Organoheterocyclic compounds | Compounds containing a ring with least one car... | 1325 | 0.000000 | 0.999824 |
| CHEMONTID:0000011 | Carbohydrates and carbohydrate conjugates | Monosaccharides, disaccharides, oligosaccharid... | 341 | 2.203840 | 0.996191 |
| CHEMONTID:0000012 | Lipids and lipid-like molecules | Fatty acids and their derivatives, and substan... | 679 | 0.000000 | 0.991993 |
| CHEMONTID:0000013 | Amino acids, peptides, and analogues | Organic compounds containing an amino acid or ... | 851 | 0.575324 | 0.652714 |
| CHEMONTID:0000050 | Lactones | Cyclic esters of hydroxy carboxylic acids, con... | 410 | 1.692297 | 0.906615 |
| CHEMONTID:0000075 | Pyrimidines and pyrimidine derivatives | Compounds containing a pyrimidne ring, which i... | 79 | 0.340075 | 0.664085 |
| CHEMONTID:0000129 | Alcohols and polyols | Organic compounds in which at least one hydrox... | 1075 | 0.547347 | 0.996302 |
| CHEMONTID:0000160 | Lactams | Compounds containing a lactam ring, which is a... | 479 | 1.467895 | 0.953170 |
| CHEMONTID:0000254 | Ethers | Compounds bearing an ether group with the form... | 603 | 1.381453 | 0.998986 |
| CHEMONTID:0000261 | Phenylpropanoids and polyketides | Organic compounds that are synthesized either ... | 517 | 0.000000 | 0.762557 |
| CHEMONTID:0000264 | Organic acids and derivatives | Compounds an organic acid or a derivative ther... | 1429 | 0.000000 | 0.999968 |
| CHEMONTID:0000265 | Carboxylic acids and derivatives | Compounds containing a carboxylic acid group w... | 1268 | 0.172451 | 0.999968 |
| CHEMONTID:0000278 | Organonitrogen compounds | Organic compounds containing a nitrogen atom. | 1284 | -0.000000 | 0.999836 |
| CHEMONTID:0000291 | Pyrimidones | Compounds that contain a pyrimidine ring, whic... | 28 | 1.496426 | 0.549593 |
| CHEMONTID:0000323 | Organooxygen compounds | Organic compounds containing a bond between a ... | 1571 | 0.002752 | 0.999788 |
| CHEMONTID:0000324 | Fatty acid esters | Carboxylic ester derivatives of a fatty acid. | 116 | 2.341691 | 0.896289 |
| CHEMONTID:0000331 | Fatty amides | Carboxylic acid amide derivatives of fatty aci... | 398 | 0.563048 | 0.978401 |
| CHEMONTID:0000346 | Dicarboxylic acids and derivatives | Organic compounds containing exactly two carbo... | 205 | 2.628859 | 0.537920 |
| CHEMONTID:0000348 | Peptides | Compounds containing an amide derived from two... | 328 | 1.375463 | 0.652714 |
| CHEMONTID:0000364 | Organic carbonic acids and derivatives | Compounds comprising the organic carbonic acid... | 69 | 4.372266 | 0.885487 |
| CHEMONTID:0000469 | Monoalkylamines | Organic compounds containing an primary alipha... | 297 | 0.147625 | 0.953351 |
| CHEMONTID:0000475 | Carboxylic acid amides | Carboxylic acid derivatives containing a carbo... | 781 | 0.472970 | 0.995798 |
| CHEMONTID:0000517 | Ureas | Compounds containing two amine groups joined b... | 33 | 1.064130 | 0.871849 |
| CHEMONTID:0001093 | Carboxylic acid derivatives | Derivatives of carboxylic acid. | 1084 | 0.226190 | 0.999872 |
| CHEMONTID:0001096 | N-acyl amines | Compounds containing a fatty acid moiety linke... | 366 | 0.120925 | 0.731227 |
| CHEMONTID:0001167 | Dialkyl ethers | Organic compounds containing the dialkyl ether... | 333 | 0.856636 | 0.998986 |
| CHEMONTID:0001238 | Carboxylic acid esters | Carboxylic acid derivatives in which the carbo... | 547 | 0.986752 | 0.998772 |
| CHEMONTID:0001346 | Diazines | Organic compounds containing a five-member het... | 100 | 3.727920 | 0.664085 |
| CHEMONTID:0001542 | Disaccharides | Compounds containing two carbohydrate moieties... | 58 | 2.555647 | 0.591085 |
| CHEMONTID:0001656 | Acetals | Compounds having the structure R2C(OR')2 ( R' ... | 274 | 1.137982 | 0.989885 |
| CHEMONTID:0001661 | Secondary alcohols | Compounds containing a secondary alcohol funct... | 816 | 0.397696 | 0.996302 |
| CHEMONTID:0001664 | Tertiary carboxylic acid amides | Compounds containing an amide derivative of ca... | 301 | 1.375559 | 0.854960 |
| CHEMONTID:0001831 | Carbonyl compounds | Organic compounds containing a carbonyl group,... | 1255 | 0.323996 | 0.927188 |
| CHEMONTID:0002012 | Oxanes | Compounds containing an oxane (tetrahydropyran... | 365 | 1.860024 | 0.987434 |
| CHEMONTID:0002105 | Glycosyl compounds | Carbohydrate derivatives in which a sugar grou... | 239 | 0.512761 | 0.899443 |
| CHEMONTID:0002202 | Hydropyrimidines | Compounds containing a hydrogenated pyrimidine... | 35 | 1.174498 | 0.664085 |
| CHEMONTID:0002207 | O-glycosyl compounds | Glycoside in which a sugar group is bonded thr... | 186 | 0.361708 | 0.855456 |
| CHEMONTID:0002449 | Amines | Compounds formally derived from ammonia by rep... | 560 | 1.197146 | 0.953351 |
| CHEMONTID:0002450 | Primary amines | Amines having the nitrogen atom linked to exac... | 329 | 0.767339 | 0.953351 |
| CHEMONTID:0003890 | Vinylogous amides | Organic compounds containing an amine group, w... | 106 | 3.752870 | 0.938973 |
| CHEMONTID:0003909 | Fatty Acyls | Organic molecules synthesized by chain elongat... | 588 | 0.207595 | 0.991993 |
| CHEMONTID:0003940 | Organic oxides | Organic compounds containing an oxide group. | 1447 | 0.121371 | 0.997032 |
| CHEMONTID:0004139 | Azacyclic compounds | Organic compounds containing an heterocycle wi... | 916 | 0.532573 | 0.972626 |
| CHEMONTID:0004140 | Oxacyclic compounds | Organic compounds containing an heterocycle wi... | 860 | 0.623584 | 0.999824 |
| CHEMONTID:0004144 | Heteroaromatic compounds | Compounds containing an aromatic ring where a ... | 484 | 1.452913 | 0.929851 |
| CHEMONTID:0004150 | Hydrocarbon derivatives | Derivatives of hydrocarbons obtained by substi... | 1584 | 0.000000 | 0.998957 |
| CHEMONTID:0004557 | Organopnictogen compounds | Compounds containing a bond between carbon a p... | 1103 | 0.000000 | 0.989550 |
| CHEMONTID:0004603 | Organic oxygen compounds | Organic compounds that contain one or more oxy... | 1574 | 0.000000 | 0.999788 |
| CHEMONTID:0004707 | Organic nitrogen compounds | Organic compounds containing a nitrogen atom. | 1284 | 0.000000 | 0.999836 |
Build gene contribution table#
Now that we have the predictions, we can inspect the model to explain which genes of the cluster contributed to the prediction of each class. This can be done in the command line with the chamois explain cluster subcommand, or programmatically:
[10]:
from chamois.cli.explain import build_genetable
genetable = build_genetable(proteins, domains, predictor, probas).set_index("class")
genetable
[10]:
| name | probability | BAM98946.1 | BAM98947.1 | BAM98948.1 | BAM98949.1 | BAM98950.1 | BAM98951.1 | BAM98952.1 | BAM98953.1 | ... | BAM98989.1 | BAM98990.1 | BAM98991.1 | BAM98992.1 | BAM98993.1 | BAM98994.1 | BAM98995.1 | BAM98996.1 | BAM98997.1 | BAM98998.1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| class | |||||||||||||||||||||
| CHEMONTID:0000002 | Organoheterocyclic compounds | 0.999824 | 0.229691 | -0.996043 | 0.000000 | 0.000000 | 0.000000 | 0.415219 | 0.0 | 0.000000 | ... | 0.0 | 0.366459 | 1.001705 | 0.797974 | 0.757944 | -0.063791 | 0.000000 | -0.351779 | 0.000000 | 0.757944 |
| CHEMONTID:0000011 | Carbohydrates and carbohydrate conjugates | 0.996191 | 0.040977 | 0.000000 | 0.837321 | 0.000000 | 0.679754 | 1.548473 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | -0.322292 | -2.242671 | -0.122239 | 0.000000 | 0.507860 | 0.000000 | 0.000000 | -0.509055 |
| CHEMONTID:0000012 | Lipids and lipid-like molecules | 0.991993 | 0.517811 | 0.000000 | 0.394568 | 0.375581 | 1.676843 | 0.780424 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | -0.582604 | -0.282627 | 0.097871 | 0.000000 | -0.089691 | 0.000000 | -0.463023 | -0.210280 |
| CHEMONTID:0000013 | Amino acids, peptides, and analogues | 0.652714 | 0.000000 | 0.000000 | 0.258368 | -0.589437 | 0.000000 | -0.722850 | 0.0 | -1.546893 | ... | 0.0 | 0.000000 | -0.327656 | 5.791538 | -0.274870 | -2.012908 | -0.150229 | 0.000000 | -0.392624 | 0.074017 |
| CHEMONTID:0000050 | Lactones | 0.906615 | 0.275428 | 0.000000 | -0.094451 | 0.000000 | 0.452313 | 0.152143 | 0.0 | -0.160824 | ... | 0.0 | 0.084367 | 0.000000 | 5.938844 | -0.530828 | 0.000000 | 1.263104 | -0.063064 | 0.000000 | -0.530828 |
| CHEMONTID:0000075 | Pyrimidines and pyrimidine derivatives | 0.664085 | -0.904319 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.009511 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | -3.290392 | 0.681991 | 0.000000 | -0.322285 | 0.000000 | 0.000000 | 0.440233 |
| CHEMONTID:0000129 | Alcohols and polyols | 0.996302 | -1.514839 | 0.000000 | 0.021764 | 0.000000 | 0.012490 | 0.751500 | 0.0 | -1.009198 | ... | 0.0 | -0.338625 | 0.000000 | 0.440210 | -0.344414 | 0.000000 | 0.497811 | 0.000000 | -0.561654 | -0.663601 |
| CHEMONTID:0000160 | Lactams | 0.953170 | -0.045158 | 0.000000 | -0.126002 | -0.035427 | -0.643823 | -0.365436 | 0.0 | 0.000000 | ... | 0.0 | 1.883761 | 1.016720 | 1.418510 | -0.398273 | -0.042517 | -0.403283 | 0.000000 | -0.025076 | -0.398273 |
| CHEMONTID:0000254 | Ethers | 0.998986 | -0.344400 | 0.000000 | 1.235817 | -0.661423 | -1.314357 | 1.788580 | 0.0 | 0.000000 | ... | 0.0 | 0.717553 | -0.314751 | -1.197625 | 0.219880 | 0.279220 | 0.076783 | 0.000000 | 0.000000 | 0.219880 |
| CHEMONTID:0000261 | Phenylpropanoids and polyketides | 0.762557 | -0.308363 | 0.805470 | 0.000000 | -0.755150 | 0.818682 | -0.517657 | 0.0 | 0.000000 | ... | 0.0 | 0.190364 | 0.000000 | 2.299565 | -0.874032 | 0.000000 | 0.875051 | 0.000000 | 0.000000 | -0.874032 |
| CHEMONTID:0000264 | Organic acids and derivatives | 0.999968 | -0.127509 | -1.544090 | 0.265956 | 0.000000 | 0.000000 | -0.671037 | 0.0 | 0.000000 | ... | 0.0 | -0.193451 | 0.000000 | 6.378113 | -0.719504 | 0.000000 | 0.182064 | 0.000000 | 0.000000 | -0.082602 |
| CHEMONTID:0000265 | Carboxylic acids and derivatives | 0.999968 | -0.040775 | -0.802258 | 0.882372 | 0.181671 | 1.056387 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | -0.051538 | 4.543270 | -0.497623 | -0.087066 | 0.487317 | 0.000000 | 0.000000 | -0.401802 |
| CHEMONTID:0000278 | Organonitrogen compounds | 0.999836 | 0.000000 | 0.000000 | -0.067827 | 0.000000 | 0.354928 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 11.165793 | 0.724649 | 0.000000 | -0.638330 | 0.000000 | 0.000000 | 0.724649 |
| CHEMONTID:0000291 | Pyrimidones | 0.549593 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.589448 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | -2.669306 | 0.000000 | 0.000000 | -1.207655 | 0.000000 | 0.000000 | 0.000000 |
| CHEMONTID:0000323 | Organooxygen compounds | 0.999788 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 5.162930 | 0.000000 | 0.000000 | 0.583948 | 0.000000 | 0.000000 | 0.000000 |
| CHEMONTID:0000324 | Fatty acid esters | 0.896289 | 0.000000 | 0.000000 | 0.000000 | -0.897874 | 0.000000 | -0.199484 | 0.0 | 0.000000 | ... | 0.0 | 0.417052 | 0.000000 | 0.986235 | -0.795943 | -0.311974 | 0.378352 | 0.000000 | 0.000000 | -0.795943 |
| CHEMONTID:0000331 | Fatty amides | 0.978401 | 0.072648 | 0.000000 | 0.645894 | 1.090469 | 1.787849 | 0.952662 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 2.011304 | 0.114176 | 0.345940 | -0.024263 | 0.000000 | 0.000000 | -0.057195 |
| CHEMONTID:0000346 | Dicarboxylic acids and derivatives | 0.537920 | -0.093845 | 0.000000 | 0.092662 | 0.000000 | 0.201576 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.691972 | 0.000000 | 0.520992 | -0.543118 | 0.000000 | 0.621471 | 0.000000 | 0.000000 | 0.341106 |
| CHEMONTID:0000348 | Peptides | 0.652714 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | -0.160640 | 0.000000 | 1.855930 | -0.030794 | 0.000000 | -0.698450 | 0.000000 | 0.000000 | 0.000000 |
| CHEMONTID:0000364 | Organic carbonic acids and derivatives | 0.885487 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | -0.252501 | 0.000000 | 1.875661 | 0.418598 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.120052 |
| CHEMONTID:0000469 | Monoalkylamines | 0.953351 | 0.012152 | 0.000000 | 0.000000 | -0.133988 | 0.637179 | -0.159208 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.041564 | 0.381263 | -0.517161 | -0.358018 | 0.618700 | 0.000000 | 0.000000 | -0.517161 |
| CHEMONTID:0000475 | Carboxylic acid amides | 0.995798 | 0.627142 | 0.000000 | 0.000000 | 0.000000 | 0.924916 | 1.211296 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 4.633758 | -0.254674 | 0.000000 | -0.237840 | 0.000000 | -0.586963 | -0.254674 |
| CHEMONTID:0000517 | Ureas | 0.871849 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 0.289862 | 0.937546 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.937546 |
| CHEMONTID:0001093 | Carboxylic acid derivatives | 0.999872 | 0.027135 | -0.573684 | 0.124985 | 0.000000 | 1.226345 | 1.137245 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | -0.555833 | 4.749889 | -0.734825 | -0.807260 | 0.626345 | 0.000000 | 0.000000 | -0.272960 |
| CHEMONTID:0001096 | N-acyl amines | 0.731227 | 0.224900 | 0.000000 | 0.482717 | 0.758824 | 0.758824 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 1.917322 | 0.364546 | 0.000000 | 0.054253 | 0.000000 | 0.000000 | 0.000000 |
| CHEMONTID:0001167 | Dialkyl ethers | 0.998986 | -0.063950 | 0.000000 | 0.667023 | 0.000000 | -0.596466 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 0.065740 | 0.398470 | 0.000000 | 0.303991 | 0.000000 | 0.000000 | 0.000000 |
| CHEMONTID:0001238 | Carboxylic acid esters | 0.998772 | 0.115818 | 0.000000 | 0.000000 | -0.083324 | 0.022849 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 4.431145 | -0.944297 | -0.385007 | 0.881229 | 0.000000 | 0.000000 | -0.071601 |
| CHEMONTID:0001346 | Diazines | 0.664085 | -0.416221 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | -1.708124 | 0.364499 | 0.000000 | -0.673794 | 0.000000 | 0.000000 | 0.364499 |
| CHEMONTID:0001542 | Disaccharides | 0.591085 | -0.320478 | 0.000000 | 0.182190 | 0.000000 | 0.000000 | 0.242513 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | -3.080885 | 0.591246 | 0.000000 | -0.065833 | 0.000000 | 0.000000 | 0.000000 |
| CHEMONTID:0001656 | Acetals | 0.989885 | 0.000000 | 0.000000 | 0.855496 | 0.000000 | 0.000000 | 2.557986 | 0.0 | 0.368370 | ... | 0.0 | 0.000000 | -1.004444 | -2.619989 | -0.788560 | 0.000000 | 0.143450 | 0.000000 | 0.000000 | -0.788560 |
| CHEMONTID:0001661 | Secondary alcohols | 0.996302 | -0.155494 | 0.000000 | 0.000000 | 0.000000 | 0.220427 | 1.714529 | 0.0 | -1.542397 | ... | 0.0 | -0.349038 | 0.000000 | -0.351667 | 0.000000 | -0.256760 | 0.994932 | 0.000000 | 0.000000 | 0.000000 |
| CHEMONTID:0001664 | Tertiary carboxylic acid amides | 0.854960 | 0.046829 | 0.000000 | 0.000000 | 0.000000 | -0.230712 | 0.744273 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 2.657154 | -0.034466 | -0.181689 | -0.465255 | 0.000000 | 0.000000 | -0.010358 |
| CHEMONTID:0001831 | Carbonyl compounds | 0.927188 | 0.000000 | 0.000000 | 0.096690 | 0.000000 | 0.000000 | -0.071391 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 1.881496 | 0.234562 | 0.000000 | 0.012705 | 0.000000 | 0.040340 | 0.234562 |
| CHEMONTID:0002012 | Oxanes | 0.987434 | -0.176122 | 0.000000 | 0.539720 | 0.000000 | 0.000000 | 2.134330 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | -2.633562 | -0.381221 | 0.373362 | 0.836099 | 0.000000 | 0.000000 | -0.424168 |
| CHEMONTID:0002105 | Glycosyl compounds | 0.899443 | -1.317901 | 0.000000 | 0.350854 | 0.000000 | 0.752530 | 1.316402 | 0.0 | -0.062650 | ... | 0.0 | 0.026500 | -0.533556 | -3.469333 | -0.059982 | 0.000000 | 1.132618 | 0.000000 | 0.000000 | -0.695161 |
| CHEMONTID:0002202 | Hydropyrimidines | 0.664085 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.157592 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | -3.265233 | 0.316258 | 0.000000 | -0.945530 | 0.000000 | 0.000000 | 0.316258 |
| CHEMONTID:0002207 | O-glycosyl compounds | 0.855456 | -0.726979 | 0.000000 | 0.021056 | 0.000000 | 0.000000 | 1.951098 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | -0.559821 | -3.406662 | -0.279949 | 0.000000 | 0.988348 | 0.000000 | 0.000000 | -0.279949 |
| CHEMONTID:0002449 | Amines | 0.953351 | -0.355821 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1.750402 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | -0.246046 | -0.769980 | -0.067409 | -0.843751 | 0.194117 | 0.000000 | 0.000000 | 0.000000 |
| CHEMONTID:0002450 | Primary amines | 0.953351 | -0.553246 | 0.000000 | -0.004045 | -0.156883 | 0.562893 | -0.624989 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | -0.546283 | -0.372410 | 0.000000 | 0.455299 | 0.000000 | 0.000000 | -0.372410 |
| CHEMONTID:0003890 | Vinylogous amides | 0.938973 | -0.072487 | 0.000000 | -0.218571 | -0.573828 | 0.420431 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.103827 | 0.852967 | 1.449454 | -0.024129 | -0.382517 | 0.000000 | 0.000000 | 1.449454 |
| CHEMONTID:0003909 | Fatty Acyls | 0.991993 | 0.265968 | 0.000000 | 0.396254 | 0.803797 | 1.981999 | 0.605907 | 0.0 | 0.000000 | ... | 0.0 | 0.059721 | 0.000000 | -0.395701 | 0.094307 | 0.098287 | -0.087122 | 0.000000 | 0.000000 | 0.000000 |
| CHEMONTID:0003940 | Organic oxides | 0.997032 | 0.138076 | 0.000000 | 0.205293 | 0.000000 | 0.000000 | -0.365281 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.688342 | 2.588351 | -0.190295 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| CHEMONTID:0004139 | Azacyclic compounds | 0.972626 | -0.445462 | -0.321943 | -0.716947 | 0.000000 | -0.413126 | 0.648036 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.561154 | 3.806599 | 1.120308 | -0.064772 | -0.638590 | 0.000000 | 0.000000 | 2.066128 |
| CHEMONTID:0004140 | Oxacyclic compounds | 0.999824 | 0.000000 | -0.086450 | 0.152213 | -0.060771 | -0.060771 | 1.168670 | 0.0 | 0.000000 | ... | 0.0 | 0.288354 | -0.271145 | 0.730471 | -0.139093 | 0.000000 | 1.042502 | -0.907340 | 0.000000 | -0.139093 |
| CHEMONTID:0004144 | Heteroaromatic compounds | 0.929851 | 0.000000 | 0.000000 | -0.601390 | 0.000000 | 0.000000 | -0.164061 | 0.0 | 0.000000 | ... | 0.0 | 1.218501 | 1.668306 | 0.047806 | 0.992305 | 0.000000 | -0.811462 | 0.000000 | 0.000000 | 0.992305 |
| CHEMONTID:0004150 | Hydrocarbon derivatives | 0.998957 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 3.549721 | 0.000000 | 0.000000 | 0.020515 | 0.000000 | 0.000000 | 0.000000 |
| CHEMONTID:0004557 | Organopnictogen compounds | 0.989550 | 0.518851 | 0.000000 | 0.000000 | -0.316583 | -1.223855 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 3.915001 | 0.026611 | 0.000000 | 0.164451 | 0.000000 | 0.000000 | 0.098011 |
| CHEMONTID:0004603 | Organic oxygen compounds | 0.999788 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 3.871651 | 0.000000 | 0.000000 | 0.552077 | 0.000000 | 0.000000 | 0.000000 |
| CHEMONTID:0004707 | Organic nitrogen compounds | 0.999836 | 0.000000 | 0.000000 | -0.067827 | 0.000000 | 0.354928 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 11.165793 | 0.724649 | 0.000000 | -0.638330 | 0.000000 | 0.000000 | 0.724649 |
49 rows × 55 columns
Render cluster#
Now that we have a table summarizing the role of every cluster gene in the prediction of each ChemOnt class, we can render the genomic locus of the BGC with additional information about the function of each gene. Let’s restrict to 5 specific classes with the lowest amount of training examples in MIBiG 3.1:
[11]:
top = predictor.classes_.loc[genetable.index].sort_values("n_positives").head(5).index
predictor.classes_.loc[top]
[11]:
| name | description | n_positives | information_accretion | |
|---|---|---|---|---|
| class | ||||
| CHEMONTID:0000291 | Pyrimidones | Compounds that contain a pyrimidine ring, whic... | 28 | 1.496426 |
| CHEMONTID:0000517 | Ureas | Compounds containing two amine groups joined b... | 33 | 1.064130 |
| CHEMONTID:0002202 | Hydropyrimidines | Compounds containing a hydrogenated pyrimidine... | 35 | 1.174498 |
| CHEMONTID:0001542 | Disaccharides | Compounds containing two carbohydrate moieties... | 58 | 2.555647 |
| CHEMONTID:0000364 | Organic carbonic acids and derivatives | Compounds comprising the organic carbonic acid... | 69 | 4.372266 |
We can now plot the cluster while colouring the genes according to which ChemOnt class they contribute the most, highlighting their function in the biosynthetic pathway. For the display, let’s use the dna-features-viewer library.
[12]:
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
from matplotlib.patches import Patch
from dna_features_viewer import GraphicFeature, GraphicRecord
from palettable.cartocolors.qualitative import Vivid_10
from palettable.cartocolors.sequential import *
# build a palette
palette = dict(zip(top, [Purp_2, Sunset_5, DarkMint_2, Magenta_2, Teal_5, BluGrn_2]))
fig = plt.figure(figsize=(12, 6))
# extract CDS features from the record
features = []
for feature in filter(lambda f: f.kind == "CDS", records[0].features):
# get the name and product of the gene
label = next(q.value for q in feature.qualifiers if q.key == "protein_id")
product = next((q.value for q in feature.qualifiers if q.key == "product"), None)
if product.startswith("putative"):
product = product[9:]
# get the coordinates
start = feature.location.start
end = feature.location.end
if feature.location.strand == "-":
start, end = end, start
# get the colour of the gene based on contribution weight
weights = genetable[label].loc[top]
if any(weights >= 1):
best = weights.index[weights.argmax()]
color = palette[best].hex_colors[1]
else:
color = "#c0c0c0"
# record the feature
features.append(GraphicFeature(
start=start,
end=end,
strand=-1 if feature.location.strand == "-" else 1,
color=color,
label=None if color == "#c0c0c0" else product,
))
# render the feature records
record = GraphicRecord(sequence=records[0].sequence, features=features)
record.plot(ax=plt.gca())
# add legend
legend_elements = [
Patch(
facecolor=v.hex_colors[1],
edgecolor='black',
label=f"{k} - {predictor.classes_.name.loc[k]}"
)
for k,v in palette.items()
]
# create the figure
plt.legend(handles=legend_elements, loc='upper left')
plt.title("AB746937.1 - muraminomicin")
fig.tight_layout()