Software Projects

When studying the data of life computationally, it is often needed to write new software to answer new questions, or to answer old questions in new ways. Here is an overview of the software projects I have been involved in.

antiSMASH: the antibiotics and Secondary Metabolite Analysis SHell

Co-developer: Kai Blin

The secondary metabolism of bacteria and fungi constitutes a rich source of bioactive compounds of potential pharmaceutical value.

Interestingly, the genes encoding the biosynthetic pathway responsible for the production of such secondary metabolites are very often spatially clustered together at a certain position on the chromosome; such a compendium of genes is referred to as a 'biosynthetic gene cluster'.

This genetic architecture has opened up the possibility for straightforward detection of secondary metabolite biosynthesis pathways by locating their gene clusters. In recent years, the costs of sequencing bacterial and fungi has dropped dramatically, and many genome sequences have become available. Based on profile Hidden Markov Models of genes that are specific for certain types of gene clusters, antiSMASH is able to accurately identify the gene clusters encoding secondary metabolites of all known broad chemical classes.

Moreover, antiSMASH also offers detailed functional and comparative analysis, including a preliminary prediction of the chemical structures of encoded compounds and MultiGeneBlast-empowered gene cluster alignment.


MultiGeneBlast: effective detection of sequence homology at the gene cluster level

MultiGeneBlast is an open source tool for identification of homologs of multigene modules such as operons and gene clusters. It is based on a reformatting of the FASTA headers of NCBI GenBank protein entries, using which it can track down their source nucleotide and coordinates.

Oftentimes when studying such genetic loci, much can be learned from their evolutionary context. Furthermore, MultiGeneBlast can aid in the detection of such multigene parts for synthetic biology projects; a synthetic library of operons can be created based on its output to identify those operons whose function is closest to the one desired by the user.

This tool provides the opportunities to identify all homologous genomic regions by combining the results of single BlastP runs on each gene, and sorting genomic regions from any GenBank entry by the number of hits, synteny conservation and cumulative Blast bit score. The basic algorithm behind this was previously used in our antiSMASH software. Additionally, architecture searches can be performed to find any genomic regions with Blast hits to any user-specified combination of amino acid sequences.

The tool comes with a pre-configured database containing the most recent version of all relevant GenBank divisions. Moreover, you can easily make your own databases from local files or online GenBank entries or divisions.


MultiMetEval: comparative and multi-objective analysis of genome-scale metabolic models

Main code developer: Piotr Zakrzewski

Comparative metabolic modelling is emerging as a novel research field, supported by the development of reliable and standardized approaches for constructing genome-scale metabolic models in high throughput.

The Multi-Metabolic Evaluator (MultiMetEval) is a user-friendly software framework (MultiMetEval), built upon SurreyFBA, which allows the user to compose collections of metabolic models that together can be subjected to flux balance analysis.

Additionally, MultiMetEval implements functionalities for multi-objective analysis by calculating the Pareto front between two cellular objectives.


NRPSpredictor2: a web server for predicting NRPS adenylation domain specificity

Main developers: Christian Rausch & Marc Röttig

The products of many bacterial non-ribosomal peptide synthetases (NRPS) are highly important secondary metabolites, including vancomycin and other antibiotics.

The ability to predict substrate specificity of newly detected NRPS Adenylation (A-) domains by genome sequencing efforts is of great importance to identify and annotate new gene clusters that produce secondary metabolites.

Prediction of A-domain specificity based on the sequence alone can be achieved through sequence signatures or, more accurately, through machine learning methods.

NRPSPredictor2, based on the original NRPSpredictor, predicts A-domain specificity using Support Vector Machines on four hierarchical levels, ranging from gross physicochemical properties of an A-domain’s substrates down to single amino acid substrates.