Explain a prediction#

This model demonstrates the use of the CHAMOIS API to establish links between the genes of a query cluster and the ChemOnt classes of the putative metabolite as predicted by CHAMOIS.

[1]:
import chamois
chamois.__version__
[1]:
'0.2.0'

Loading data#

Use gb-io to load the GenBank record for a cluster into a dedicated ClusterSequence object. Let’s use AB746937.1, the biosynthetic gene cluster for muraminomicin found in Streptosporangium amethystogenes.

[2]:
import gb_io
import chamois.model
records = gb_io.load("data/AB746937.1.gbk")
clusters = [chamois.model.ClusterSequence(records[0])]

Calling genes#

You can use the chamois.orf module to call the genes inside one or more ClusterSequence objects. Since the source GenBank record has already gene called (in CDS features, with the gene name added in the /protein_id qualifier), we can skip gene calling and simply extract the already-present genes. For this, we use a CDSFinder:

[3]:
from chamois.orf import CDSFinder
orf_finder = CDSFinder(locus_tag="protein_id")
proteins = list(orf_finder.find_genes(clusters))

Extracting features#

Once we have a list of proteins, we need to annotate them with protein domains. CHAMOIS is distributed with the Pfam HMMs required by the CHAMOIS predictor, so we can simply use these and run the default annotation with a PfamAnnotator object:

[4]:
from chamois.domains import PfamAnnotator
annotator = PfamAnnotator()
domains = list(annotator.annotate_domains(proteins))

Building compositional matrices#

We now have a list of domains, but we want to turn these domains into a matrix of presence/absence of each Pfam domain in each gene cluster. To do so, let’s first load the trained CHAMOIS predictor, so we know which features we need to extract:

[5]:
from chamois.predictor import ChemicalOntologyPredictor
predictor = ChemicalOntologyPredictor.trained()

Then simply build the observations table (from the source clusters), and the actual compositional data matrix, returned as an AnnData object to preserve observation and feature metadata:

[6]:
import chamois.compositions
obs = chamois.compositions.build_observations(clusters)
data = chamois.compositions.build_compositions(domains, obs, predictor.features_)
data
[6]:
AnnData object with n_obs × n_vars = 1 × 896
    obs: 'length'
    var: 'name', 'description', 'kind'
[7]:
data.var_vector(clusters[0].id)
[7]:
array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Infer chemical classes#

With the compositional matrix ready, we can simply call the predict_probas method on the predictor to get the class probabilities predicted by CHAMOIS:

[8]:
probas = predictor.predict_probas(data)

probas is a NumPy array containing probabilities for each of the classes of the model. We can turn these predictions into a table retaining the metadata from the original predictor:

[9]:
classes = predictor.classes_.copy()
classes['probability'] = probas[0]
classes[classes['probability'] > 0.5]
[9]:
name description n_positives information_accretion probability
CHEMONTID:0000002 Organoheterocyclic compounds Compounds containing a ring with least one car... 1325 0.000000 0.999824
CHEMONTID:0000011 Carbohydrates and carbohydrate conjugates Monosaccharides, disaccharides, oligosaccharid... 341 2.203840 0.996191
CHEMONTID:0000012 Lipids and lipid-like molecules Fatty acids and their derivatives, and substan... 679 0.000000 0.991993
CHEMONTID:0000013 Amino acids, peptides, and analogues Organic compounds containing an amino acid or ... 851 0.575324 0.652714
CHEMONTID:0000050 Lactones Cyclic esters of hydroxy carboxylic acids, con... 410 1.692297 0.906615
CHEMONTID:0000075 Pyrimidines and pyrimidine derivatives Compounds containing a pyrimidne ring, which i... 79 0.340075 0.664085
CHEMONTID:0000129 Alcohols and polyols Organic compounds in which at least one hydrox... 1075 0.547347 0.996302
CHEMONTID:0000160 Lactams Compounds containing a lactam ring, which is a... 479 1.467895 0.953170
CHEMONTID:0000254 Ethers Compounds bearing an ether group with the form... 603 1.381453 0.998986
CHEMONTID:0000261 Phenylpropanoids and polyketides Organic compounds that are synthesized either ... 517 0.000000 0.762557
CHEMONTID:0000264 Organic acids and derivatives Compounds an organic acid or a derivative ther... 1429 0.000000 0.999968
CHEMONTID:0000265 Carboxylic acids and derivatives Compounds containing a carboxylic acid group w... 1268 0.172451 0.999968
CHEMONTID:0000278 Organonitrogen compounds Organic compounds containing a nitrogen atom. 1284 -0.000000 0.999836
CHEMONTID:0000291 Pyrimidones Compounds that contain a pyrimidine ring, whic... 28 1.496426 0.549593
CHEMONTID:0000323 Organooxygen compounds Organic compounds containing a bond between a ... 1571 0.002752 0.999788
CHEMONTID:0000324 Fatty acid esters Carboxylic ester derivatives of a fatty acid. 116 2.341691 0.896289
CHEMONTID:0000331 Fatty amides Carboxylic acid amide derivatives of fatty aci... 398 0.563048 0.978401
CHEMONTID:0000346 Dicarboxylic acids and derivatives Organic compounds containing exactly two carbo... 205 2.628859 0.537920
CHEMONTID:0000348 Peptides Compounds containing an amide derived from two... 328 1.375463 0.652714
CHEMONTID:0000364 Organic carbonic acids and derivatives Compounds comprising the organic carbonic acid... 69 4.372266 0.885487
CHEMONTID:0000469 Monoalkylamines Organic compounds containing an primary alipha... 297 0.147625 0.953351
CHEMONTID:0000475 Carboxylic acid amides Carboxylic acid derivatives containing a carbo... 781 0.472970 0.995798
CHEMONTID:0000517 Ureas Compounds containing two amine groups joined b... 33 1.064130 0.871849
CHEMONTID:0001093 Carboxylic acid derivatives Derivatives of carboxylic acid. 1084 0.226190 0.999872
CHEMONTID:0001096 N-acyl amines Compounds containing a fatty acid moiety linke... 366 0.120925 0.731227
CHEMONTID:0001167 Dialkyl ethers Organic compounds containing the dialkyl ether... 333 0.856636 0.998986
CHEMONTID:0001238 Carboxylic acid esters Carboxylic acid derivatives in which the carbo... 547 0.986752 0.998772
CHEMONTID:0001346 Diazines Organic compounds containing a five-member het... 100 3.727920 0.664085
CHEMONTID:0001542 Disaccharides Compounds containing two carbohydrate moieties... 58 2.555647 0.591085
CHEMONTID:0001656 Acetals Compounds having the structure R2C(OR')2 ( R' ... 274 1.137982 0.989885
CHEMONTID:0001661 Secondary alcohols Compounds containing a secondary alcohol funct... 816 0.397696 0.996302
CHEMONTID:0001664 Tertiary carboxylic acid amides Compounds containing an amide derivative of ca... 301 1.375559 0.854960
CHEMONTID:0001831 Carbonyl compounds Organic compounds containing a carbonyl group,... 1255 0.323996 0.927188
CHEMONTID:0002012 Oxanes Compounds containing an oxane (tetrahydropyran... 365 1.860024 0.987434
CHEMONTID:0002105 Glycosyl compounds Carbohydrate derivatives in which a sugar grou... 239 0.512761 0.899443
CHEMONTID:0002202 Hydropyrimidines Compounds containing a hydrogenated pyrimidine... 35 1.174498 0.664085
CHEMONTID:0002207 O-glycosyl compounds Glycoside in which a sugar group is bonded thr... 186 0.361708 0.855456
CHEMONTID:0002449 Amines Compounds formally derived from ammonia by rep... 560 1.197146 0.953351
CHEMONTID:0002450 Primary amines Amines having the nitrogen atom linked to exac... 329 0.767339 0.953351
CHEMONTID:0003890 Vinylogous amides Organic compounds containing an amine group, w... 106 3.752870 0.938973
CHEMONTID:0003909 Fatty Acyls Organic molecules synthesized by chain elongat... 588 0.207595 0.991993
CHEMONTID:0003940 Organic oxides Organic compounds containing an oxide group. 1447 0.121371 0.997032
CHEMONTID:0004139 Azacyclic compounds Organic compounds containing an heterocycle wi... 916 0.532573 0.972626
CHEMONTID:0004140 Oxacyclic compounds Organic compounds containing an heterocycle wi... 860 0.623584 0.999824
CHEMONTID:0004144 Heteroaromatic compounds Compounds containing an aromatic ring where a ... 484 1.452913 0.929851
CHEMONTID:0004150 Hydrocarbon derivatives Derivatives of hydrocarbons obtained by substi... 1584 0.000000 0.998957
CHEMONTID:0004557 Organopnictogen compounds Compounds containing a bond between carbon a p... 1103 0.000000 0.989550
CHEMONTID:0004603 Organic oxygen compounds Organic compounds that contain one or more oxy... 1574 0.000000 0.999788
CHEMONTID:0004707 Organic nitrogen compounds Organic compounds containing a nitrogen atom. 1284 0.000000 0.999836

Build gene contribution table#

Now that we have the predictions, we can inspect the model to explain which genes of the cluster contributed to the prediction of each class. This can be done in the command line with the chamois explain cluster subcommand, or programmatically:

[10]:
from chamois.cli.explain import build_genetable
genetable = build_genetable(proteins, domains, predictor, probas).set_index("class")
genetable
[10]:
name probability BAM98946.1 BAM98947.1 BAM98948.1 BAM98949.1 BAM98950.1 BAM98951.1 BAM98952.1 BAM98953.1 ... BAM98989.1 BAM98990.1 BAM98991.1 BAM98992.1 BAM98993.1 BAM98994.1 BAM98995.1 BAM98996.1 BAM98997.1 BAM98998.1
class
CHEMONTID:0000002 Organoheterocyclic compounds 0.999824 0.229691 -0.996043 0.000000 0.000000 0.000000 0.415219 0.0 0.000000 ... 0.0 0.366459 1.001705 0.797974 0.757944 -0.063791 0.000000 -0.351779 0.000000 0.757944
CHEMONTID:0000011 Carbohydrates and carbohydrate conjugates 0.996191 0.040977 0.000000 0.837321 0.000000 0.679754 1.548473 0.0 0.000000 ... 0.0 0.000000 -0.322292 -2.242671 -0.122239 0.000000 0.507860 0.000000 0.000000 -0.509055
CHEMONTID:0000012 Lipids and lipid-like molecules 0.991993 0.517811 0.000000 0.394568 0.375581 1.676843 0.780424 0.0 0.000000 ... 0.0 0.000000 -0.582604 -0.282627 0.097871 0.000000 -0.089691 0.000000 -0.463023 -0.210280
CHEMONTID:0000013 Amino acids, peptides, and analogues 0.652714 0.000000 0.000000 0.258368 -0.589437 0.000000 -0.722850 0.0 -1.546893 ... 0.0 0.000000 -0.327656 5.791538 -0.274870 -2.012908 -0.150229 0.000000 -0.392624 0.074017
CHEMONTID:0000050 Lactones 0.906615 0.275428 0.000000 -0.094451 0.000000 0.452313 0.152143 0.0 -0.160824 ... 0.0 0.084367 0.000000 5.938844 -0.530828 0.000000 1.263104 -0.063064 0.000000 -0.530828
CHEMONTID:0000075 Pyrimidines and pyrimidine derivatives 0.664085 -0.904319 0.000000 0.000000 0.000000 0.000000 1.009511 0.0 0.000000 ... 0.0 0.000000 0.000000 -3.290392 0.681991 0.000000 -0.322285 0.000000 0.000000 0.440233
CHEMONTID:0000129 Alcohols and polyols 0.996302 -1.514839 0.000000 0.021764 0.000000 0.012490 0.751500 0.0 -1.009198 ... 0.0 -0.338625 0.000000 0.440210 -0.344414 0.000000 0.497811 0.000000 -0.561654 -0.663601
CHEMONTID:0000160 Lactams 0.953170 -0.045158 0.000000 -0.126002 -0.035427 -0.643823 -0.365436 0.0 0.000000 ... 0.0 1.883761 1.016720 1.418510 -0.398273 -0.042517 -0.403283 0.000000 -0.025076 -0.398273
CHEMONTID:0000254 Ethers 0.998986 -0.344400 0.000000 1.235817 -0.661423 -1.314357 1.788580 0.0 0.000000 ... 0.0 0.717553 -0.314751 -1.197625 0.219880 0.279220 0.076783 0.000000 0.000000 0.219880
CHEMONTID:0000261 Phenylpropanoids and polyketides 0.762557 -0.308363 0.805470 0.000000 -0.755150 0.818682 -0.517657 0.0 0.000000 ... 0.0 0.190364 0.000000 2.299565 -0.874032 0.000000 0.875051 0.000000 0.000000 -0.874032
CHEMONTID:0000264 Organic acids and derivatives 0.999968 -0.127509 -1.544090 0.265956 0.000000 0.000000 -0.671037 0.0 0.000000 ... 0.0 -0.193451 0.000000 6.378113 -0.719504 0.000000 0.182064 0.000000 0.000000 -0.082602
CHEMONTID:0000265 Carboxylic acids and derivatives 0.999968 -0.040775 -0.802258 0.882372 0.181671 1.056387 0.000000 0.0 0.000000 ... 0.0 0.000000 -0.051538 4.543270 -0.497623 -0.087066 0.487317 0.000000 0.000000 -0.401802
CHEMONTID:0000278 Organonitrogen compounds 0.999836 0.000000 0.000000 -0.067827 0.000000 0.354928 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 11.165793 0.724649 0.000000 -0.638330 0.000000 0.000000 0.724649
CHEMONTID:0000291 Pyrimidones 0.549593 0.000000 0.000000 0.000000 0.000000 0.000000 0.589448 0.0 0.000000 ... 0.0 0.000000 0.000000 -2.669306 0.000000 0.000000 -1.207655 0.000000 0.000000 0.000000
CHEMONTID:0000323 Organooxygen compounds 0.999788 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 5.162930 0.000000 0.000000 0.583948 0.000000 0.000000 0.000000
CHEMONTID:0000324 Fatty acid esters 0.896289 0.000000 0.000000 0.000000 -0.897874 0.000000 -0.199484 0.0 0.000000 ... 0.0 0.417052 0.000000 0.986235 -0.795943 -0.311974 0.378352 0.000000 0.000000 -0.795943
CHEMONTID:0000331 Fatty amides 0.978401 0.072648 0.000000 0.645894 1.090469 1.787849 0.952662 0.0 0.000000 ... 0.0 0.000000 0.000000 2.011304 0.114176 0.345940 -0.024263 0.000000 0.000000 -0.057195
CHEMONTID:0000346 Dicarboxylic acids and derivatives 0.537920 -0.093845 0.000000 0.092662 0.000000 0.201576 0.000000 0.0 0.000000 ... 0.0 0.691972 0.000000 0.520992 -0.543118 0.000000 0.621471 0.000000 0.000000 0.341106
CHEMONTID:0000348 Peptides 0.652714 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 ... 0.0 -0.160640 0.000000 1.855930 -0.030794 0.000000 -0.698450 0.000000 0.000000 0.000000
CHEMONTID:0000364 Organic carbonic acids and derivatives 0.885487 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 ... 0.0 -0.252501 0.000000 1.875661 0.418598 0.000000 0.000000 0.000000 0.000000 0.120052
CHEMONTID:0000469 Monoalkylamines 0.953351 0.012152 0.000000 0.000000 -0.133988 0.637179 -0.159208 0.0 0.000000 ... 0.0 0.000000 0.041564 0.381263 -0.517161 -0.358018 0.618700 0.000000 0.000000 -0.517161
CHEMONTID:0000475 Carboxylic acid amides 0.995798 0.627142 0.000000 0.000000 0.000000 0.924916 1.211296 0.0 0.000000 ... 0.0 0.000000 0.000000 4.633758 -0.254674 0.000000 -0.237840 0.000000 -0.586963 -0.254674
CHEMONTID:0000517 Ureas 0.871849 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.289862 0.937546 0.000000 0.000000 0.000000 0.000000 0.937546
CHEMONTID:0001093 Carboxylic acid derivatives 0.999872 0.027135 -0.573684 0.124985 0.000000 1.226345 1.137245 0.0 0.000000 ... 0.0 0.000000 -0.555833 4.749889 -0.734825 -0.807260 0.626345 0.000000 0.000000 -0.272960
CHEMONTID:0001096 N-acyl amines 0.731227 0.224900 0.000000 0.482717 0.758824 0.758824 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 1.917322 0.364546 0.000000 0.054253 0.000000 0.000000 0.000000
CHEMONTID:0001167 Dialkyl ethers 0.998986 -0.063950 0.000000 0.667023 0.000000 -0.596466 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.065740 0.398470 0.000000 0.303991 0.000000 0.000000 0.000000
CHEMONTID:0001238 Carboxylic acid esters 0.998772 0.115818 0.000000 0.000000 -0.083324 0.022849 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 4.431145 -0.944297 -0.385007 0.881229 0.000000 0.000000 -0.071601
CHEMONTID:0001346 Diazines 0.664085 -0.416221 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 -1.708124 0.364499 0.000000 -0.673794 0.000000 0.000000 0.364499
CHEMONTID:0001542 Disaccharides 0.591085 -0.320478 0.000000 0.182190 0.000000 0.000000 0.242513 0.0 0.000000 ... 0.0 0.000000 0.000000 -3.080885 0.591246 0.000000 -0.065833 0.000000 0.000000 0.000000
CHEMONTID:0001656 Acetals 0.989885 0.000000 0.000000 0.855496 0.000000 0.000000 2.557986 0.0 0.368370 ... 0.0 0.000000 -1.004444 -2.619989 -0.788560 0.000000 0.143450 0.000000 0.000000 -0.788560
CHEMONTID:0001661 Secondary alcohols 0.996302 -0.155494 0.000000 0.000000 0.000000 0.220427 1.714529 0.0 -1.542397 ... 0.0 -0.349038 0.000000 -0.351667 0.000000 -0.256760 0.994932 0.000000 0.000000 0.000000
CHEMONTID:0001664 Tertiary carboxylic acid amides 0.854960 0.046829 0.000000 0.000000 0.000000 -0.230712 0.744273 0.0 0.000000 ... 0.0 0.000000 0.000000 2.657154 -0.034466 -0.181689 -0.465255 0.000000 0.000000 -0.010358
CHEMONTID:0001831 Carbonyl compounds 0.927188 0.000000 0.000000 0.096690 0.000000 0.000000 -0.071391 0.0 0.000000 ... 0.0 0.000000 0.000000 1.881496 0.234562 0.000000 0.012705 0.000000 0.040340 0.234562
CHEMONTID:0002012 Oxanes 0.987434 -0.176122 0.000000 0.539720 0.000000 0.000000 2.134330 0.0 0.000000 ... 0.0 0.000000 0.000000 -2.633562 -0.381221 0.373362 0.836099 0.000000 0.000000 -0.424168
CHEMONTID:0002105 Glycosyl compounds 0.899443 -1.317901 0.000000 0.350854 0.000000 0.752530 1.316402 0.0 -0.062650 ... 0.0 0.026500 -0.533556 -3.469333 -0.059982 0.000000 1.132618 0.000000 0.000000 -0.695161
CHEMONTID:0002202 Hydropyrimidines 0.664085 0.000000 0.000000 0.000000 0.000000 0.000000 2.157592 0.0 0.000000 ... 0.0 0.000000 0.000000 -3.265233 0.316258 0.000000 -0.945530 0.000000 0.000000 0.316258
CHEMONTID:0002207 O-glycosyl compounds 0.855456 -0.726979 0.000000 0.021056 0.000000 0.000000 1.951098 0.0 0.000000 ... 0.0 0.000000 -0.559821 -3.406662 -0.279949 0.000000 0.988348 0.000000 0.000000 -0.279949
CHEMONTID:0002449 Amines 0.953351 -0.355821 0.000000 0.000000 0.000000 0.000000 -1.750402 0.0 0.000000 ... 0.0 0.000000 -0.246046 -0.769980 -0.067409 -0.843751 0.194117 0.000000 0.000000 0.000000
CHEMONTID:0002450 Primary amines 0.953351 -0.553246 0.000000 -0.004045 -0.156883 0.562893 -0.624989 0.0 0.000000 ... 0.0 0.000000 0.000000 -0.546283 -0.372410 0.000000 0.455299 0.000000 0.000000 -0.372410
CHEMONTID:0003890 Vinylogous amides 0.938973 -0.072487 0.000000 -0.218571 -0.573828 0.420431 0.000000 0.0 0.000000 ... 0.0 0.000000 0.103827 0.852967 1.449454 -0.024129 -0.382517 0.000000 0.000000 1.449454
CHEMONTID:0003909 Fatty Acyls 0.991993 0.265968 0.000000 0.396254 0.803797 1.981999 0.605907 0.0 0.000000 ... 0.0 0.059721 0.000000 -0.395701 0.094307 0.098287 -0.087122 0.000000 0.000000 0.000000
CHEMONTID:0003940 Organic oxides 0.997032 0.138076 0.000000 0.205293 0.000000 0.000000 -0.365281 0.0 0.000000 ... 0.0 0.000000 0.688342 2.588351 -0.190295 0.000000 0.000000 0.000000 0.000000 0.000000
CHEMONTID:0004139 Azacyclic compounds 0.972626 -0.445462 -0.321943 -0.716947 0.000000 -0.413126 0.648036 0.0 0.000000 ... 0.0 0.000000 0.561154 3.806599 1.120308 -0.064772 -0.638590 0.000000 0.000000 2.066128
CHEMONTID:0004140 Oxacyclic compounds 0.999824 0.000000 -0.086450 0.152213 -0.060771 -0.060771 1.168670 0.0 0.000000 ... 0.0 0.288354 -0.271145 0.730471 -0.139093 0.000000 1.042502 -0.907340 0.000000 -0.139093
CHEMONTID:0004144 Heteroaromatic compounds 0.929851 0.000000 0.000000 -0.601390 0.000000 0.000000 -0.164061 0.0 0.000000 ... 0.0 1.218501 1.668306 0.047806 0.992305 0.000000 -0.811462 0.000000 0.000000 0.992305
CHEMONTID:0004150 Hydrocarbon derivatives 0.998957 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 3.549721 0.000000 0.000000 0.020515 0.000000 0.000000 0.000000
CHEMONTID:0004557 Organopnictogen compounds 0.989550 0.518851 0.000000 0.000000 -0.316583 -1.223855 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 3.915001 0.026611 0.000000 0.164451 0.000000 0.000000 0.098011
CHEMONTID:0004603 Organic oxygen compounds 0.999788 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 3.871651 0.000000 0.000000 0.552077 0.000000 0.000000 0.000000
CHEMONTID:0004707 Organic nitrogen compounds 0.999836 0.000000 0.000000 -0.067827 0.000000 0.354928 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 11.165793 0.724649 0.000000 -0.638330 0.000000 0.000000 0.724649

49 rows × 55 columns

Render cluster#

Now that we have a table summarizing the role of every cluster gene in the prediction of each ChemOnt class, we can render the genomic locus of the BGC with additional information about the function of each gene. Let’s restrict to 5 specific classes with the lowest amount of training examples in MIBiG 3.1:

[11]:
top = predictor.classes_.loc[genetable.index].sort_values("n_positives").head(5).index
predictor.classes_.loc[top]
[11]:
name description n_positives information_accretion
class
CHEMONTID:0000291 Pyrimidones Compounds that contain a pyrimidine ring, whic... 28 1.496426
CHEMONTID:0000517 Ureas Compounds containing two amine groups joined b... 33 1.064130
CHEMONTID:0002202 Hydropyrimidines Compounds containing a hydrogenated pyrimidine... 35 1.174498
CHEMONTID:0001542 Disaccharides Compounds containing two carbohydrate moieties... 58 2.555647
CHEMONTID:0000364 Organic carbonic acids and derivatives Compounds comprising the organic carbonic acid... 69 4.372266

We can now plot the cluster while colouring the genes according to which ChemOnt class they contribute the most, highlighting their function in the biosynthetic pathway. For the display, let’s use the dna-features-viewer library.

[12]:
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
from matplotlib.patches import Patch
from dna_features_viewer import GraphicFeature, GraphicRecord
from palettable.cartocolors.qualitative import Vivid_10
from palettable.cartocolors.sequential import *

# build a palette
palette = dict(zip(top, [Purp_2, Sunset_5, DarkMint_2, Magenta_2, Teal_5, BluGrn_2]))
fig = plt.figure(figsize=(12, 6))

# extract CDS features from the record
features = []
for feature in filter(lambda f: f.kind == "CDS", records[0].features):
    # get the name and product of the gene
    label = next(q.value for q in feature.qualifiers if q.key == "protein_id")
    product = next((q.value for q in feature.qualifiers if q.key == "product"), None)
    if product.startswith("putative"):
        product = product[9:]
    # get the coordinates
    start = feature.location.start
    end = feature.location.end
    if feature.location.strand == "-":
        start, end = end, start
    # get the colour of the gene based on contribution weight
    weights = genetable[label].loc[top]
    if any(weights >= 1):
        best = weights.index[weights.argmax()]
        color = palette[best].hex_colors[1]
    else:
        color = "#c0c0c0"
    # record the feature
    features.append(GraphicFeature(
        start=start,
        end=end,
        strand=-1 if feature.location.strand == "-" else 1,
        color=color,
        label=None if color == "#c0c0c0" else product,
    ))

# render the feature records
record = GraphicRecord(sequence=records[0].sequence, features=features)
record.plot(ax=plt.gca())

# add legend
legend_elements = [
    Patch(
        facecolor=v.hex_colors[1],
        edgecolor='black',
        label=f"{k} - {predictor.classes_.name.loc[k]}"
    )
    for k,v in palette.items()
]

# create the figure
plt.legend(handles=legend_elements, loc='upper left')
plt.title("AB746937.1 - muraminomicin")
fig.tight_layout()
../_images/examples_explain_22_0.svg