Submit a preprint

Latest recommendations

IdTitle * Authors * Abstract * Picture * Thematic fields * RecommenderReviewersSubmission date
24 Dec 2020
article picture

A linear time solution to the Labeled Robinson-Foulds Distance problem

Comparing reconciled gene trees in linear time

Recommended by ORCID_LOGO based on reviews by Barbara Holland, Gabriel Cardona, Jean-Baka Domelevo Entfellner and 1 anonymous reviewer

Unlike a species tree, a gene tree results not only from speciation events, but also from events acting at the gene level, such as duplications and losses of gene copies, and gene transfer events [1]. The reconciliation of phylogenetic trees consists in embedding a given gene tree into a known species tree and, doing so, determining the location of these gene-level events on the gene tree [2]. Reconciled gene trees can be seen as phylogenetic trees where internal node labels are used to discriminate between different gene-level events. Comparing them is of foremost importance in order to assess the performance of various reconciliation methods (e.g. [3]).
A paper describing an extension of the widely used Robinson-Foulds (RF) distance [4] to trees with labeled internal nodes was presented earlier this year [5]. This distance, called ELRF, is based on edge edits and coincides with the RF distance when all internal labels are identical; unfortunately, the ELRF distance is very costly to compute. In the present paper [6], the authors introduce a distance called LRF, which is inspired by the TED (Tree Edit Distance [7]) and is based on node edits. As the ELRF, the new distance coincides with the RF distance for identically-labeled internal nodes, but has the additional desirable features of being computable in linear time. Also, in the ELRF distance, an edge can be deleted if only it connects nodes with the same label. The new formulation does not have this restriction, and this is, in my opinion, an improvement since the restriction makes little sense in the comparison of reconciled gene trees.
The authors show the pertinence of this new distance by studying the impact of taxon sampling on reconciled gene trees when internal labels are computed via a method based on species overlap. The linear algorithm to compute the LRF distance presented in the paper has been implemented and the software —written in Python— is freely available for the community to use it. I bet that the LRF distance will be widely used in the coming years!

References

[1] Maddison, W. P. (1997). Gene trees in species trees. Systematic biology, 46(3), 523-536. doi: https://doi.org/10.1093/sysbio/46.3.523
[2] Boussau, B., and Scornavacca, C. (2020). Reconciling gene trees with species trees. Phylogenetics in the Genomic Era, p. 3.2:1–3.2:23. [3] Doyon, J. P., Chauve, C., and Hamel, S. (2009). Space of gene/species trees reconciliations and parsimonious models. Journal of Computational Biology, 16(10), 1399-1418. doi: https://doi.org/10.1089/cmb.2009.0095
[4] Robinson, D. F., and Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathematical biosciences, 53(1-2), 131-147. doi: https://doi.org/10.1016/0025-5564(81)90043-2
[5] Briand, B., Dessimoz, C., El-Mabrouk, N., Lafond, M. and Lobinska, G. (2020). A generalized Robinson-Foulds distance for labeled trees. BMC Genomics 21, 779. doi: https://doi.org/10.1186/s12864-020-07011-0
[6] Briand, S., Dessimoz, C., El-Mabrouk, N. and Nevers, Y. (2020) A linear time solution to the labeled Robinson-Foulds distance problem. bioRxiv, 2020.09.14.293522, ver. 4 peer-reviewed and recommended by PCI Mathematical and Computational Biology. doi: https://doi.org/10.1101/2020.09.14.293522
[7] Zhang, K., and Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing, 18(6), 1245-1262. doi: https://doi.org/10.1137/0218082

A linear time solution to the Labeled Robinson-Foulds Distance problemSamuel Briand, Christophe Dessimoz, Nadia El-Mabrouk and Yannis Nevers <p>Motivation Comparing trees is a basic task for many purposes, and especially in phylogeny where different tree reconstruction tools may lead to different trees, likely representing contradictory evolutionary information. While a large variety o...Combinatorics, Design and analysis of algorithms, Evolutionary BiologyCéline Scornavacca2020-08-20 21:06:23 View
23 Jul 2024
article picture

Alignment-free detection and seed-based identification of multi-loci V(D)J recombinations in Vidjil-algo

An accelerated Vidjil algorithm: up to 30X faster identification of V(D)J recombinations via spaced seeds and Aho-Corasick pattern matching

Recommended by ORCID_LOGO based on reviews by Sven Rahmann and 1 anonymous reviewer

VDJ recombination is a crucial process in the immune system, where a V (variable) gene, a D (diversity) gene, and a J (joining) gene are randomly combined to create unique antigen receptor genes. This process generates a vast diversity of antibodies and T-cell receptors, essential for recognizing and combating a wide array of pathogens. By identifying and quantifying these VDJ recombinations, we can gain a deeper and more precise understanding of the immune response, enhancing our ability to monitor and manage immune-related conditions.

It is therefore important to develop efficient methods to identify and extract VDJ recombinations from large sequences (e.g., several millions/billions of nucleotides). The work by Borée, Giraud, and Salson [2] contributes one such algorithm. As in previous work, the proposed algorithm employs the Aho-Corasick automaton to simultaneously match several patterns against a string but, differently from other methods, it also combines the efficiency of spaced seeds. Working with seeds rather than the original string has the net benefit of speeding up the algorithm and reducing its memory usage, sometimes at the price of a modest loss in accuracy. Experiments conducted on five different datasets demonstrate that these features grant the proposed method excellent practical performance compared to the best previous methods, like Vidjil [3] (up to 5X faster) and MiXCR [1] (up to 30X faster), with no quality loss.

The method can also be considered an excellent example of a more general trend in scalable algorithmic design: adapt "classic" algorithms (in this case, the Aho-Corasick pattern matching algorithm) to work in sketch space (e.g., the spaced seeds used here), trading accuracy for efficiency. Sometimes, this compromise is necessary for the sake of scaling to very large datasets using modest computing power.

References

[1] D. A. Bolotin, S. Poslavsky, I. Mitrophanov, M. Shugay, I. Z. Mamedov, E. V. Putintseva, and D. M. Chudakov (2015). "MiXCR: software for comprehensive adaptive immunity profiling." Nature Methods 12, 380–381. ISSN: 1548-7091. https://doi.org/10.1038/nmeth.3364

[2] C. Borée, M. Giraud, M. Salson (2024) "Alignment-free detection and seed-based identification of multi-loci V(D)J recombinations in Vidjil-algo". https://hal.science/hal-04361907v2, version 2, peer-reviewed and recommended by Peer Community In Mathematical and Computational Biology.

[3] M. Giraud, M. Salson, M. Duez, C. Villenet, S. Quief, A. Caillault, N. Grardel, C. Roumier, C. Preudhomme, and M. Figeac (2014). "Fast multiclonal clusterization of V(D)J recombinations from high-throughput sequencing". BMC Genomics 15, 409. https://doi.org/10.1186/1471-2164-15-409

Alignment-free detection and seed-based identification of multi-loci V(D)J recombinations in Vidjil-algoCyprien Borée, Mathieu Giraud, Mikaël Salson<p>The diversity of the immune repertoire is grounded on V(D)J recombinations in several loci. Many algorithms and software detect and designate these recombinations in high-throughput sequencing data. To improve their efficiency, we propose a mul...Combinatorics, Computational complexity, Design and analysis of algorithms, Genomics and Transcriptomics, ImmunologyGiulio Ermanno Pibiri2023-12-28 18:03:42 View
22 Apr 2025
article picture

A compact model of Escherichia coli core and biosynthetic metabolism

‘Goldilocks’-size extensively annotated model for Escherichia coli metabolism

Recommended by ORCID_LOGO based on reviews by Daan de Groot, Benjamin Luke Coltman and 1 anonymous reviewer

Metabolism is the driving force of life and thereby plays a key role in understanding microbial functioning in monoculture and in ecosystems, from natural habitats to biotechnological applications, from microbiomes related to human health to food production. However, the complexity of metabolic networks poses a major challenge for understanding how they are shaped by evolution and how we can manipulate them. Therefore, many network-based methods have been developed to study metabolism.
With the vast increase of genomic data, genome-scale metabolic networks have become popular use. For these stoichiometric models, metabolic enzymes are predicted using genome data and subsequently algorithms are used to add reactions to construct a complete (biomass producing) metabolic network (e.g., Henry et al., 2010; Machado et al., 2018; see for an overview Mendoza et al., 2019). Many tools are being developed to make predictions with these models, usually variations of FBA (Orth et al., 2010), but also methods for community predictions (Scott Jr et al., 2023) and simulations in time and space (Bauer et al., 2017; Dukovski et al., 2021). The vast amount of sequencing data combined with the high-throughput possibilities of this method make it appealing, but there is a drawback: Namely that the automated construction of networks lacks accuracy and often curation is necessary before these models produce realistic and useful results. This is exemplified by recent studies of microbial metabolism that are better predicted by genome content only than by actual metabolic models (Gralka et al., 2023; Li et al., 2023).

On the other end are well-curated small-scale models of metabolic pathways. For those, knowledge of the enzymes of a pathway, their kinetic properties and (optionally) regulation by metabolites is incorporated in usually a differential equation model. Standard methods for systems of differential equations can be used to study steady-states and the dynamics of these models, which can lead to accurate predictions (Flamholz et al., 2013; van Heerden et al., 2014). However, the downside is that the methods are difficult to scale up and, for many enzymes, the detailed information necessary for these models is not available. Combined with computational challenges, these models are limited to specific pathways and cannot be used for whole cells, nor even communities. Therefore, there is still a need for both methods and models to make accurate predictions on a scale beyond single pathways.

Corrao et al. (2025) aim for an intermediate size model that is both accurate and predictive, does not need an extensive set of enzyme parameters, but also encompasses most of the cell’s metabolic pathways. As they phrase it: a model in the ‘Goldilocks’ zone. Curation can improve genome-scale models substantially but requires additional experimental data. However, as the authors show, even the well-curated model of Escherichia coli can sometimes show unrealistic metabolic flux patterns. A smaller model can be better curated and therefore more predictive, and more methods can be applied, as for example EFM based approaches. The authors show an extensive set of methodologies that can be applied to this model and yield interpretable results. Additionally, the model contains a wealth of standardized annotation that could set a standard for the field.

This is a first model of its kind, and it is not surprising that E. coli is used as its metabolism is very well-studied. However, this could set the basis for similar models for other well-studied organisms. Because the model is well-annotated and characterized, it is very suitable for testing new methods that make predictions with such an intermediate-sized model and that can later be extended for larger models. In the future, such models for different species could aid the creation of methods for studying and predicting metabolism in communities, for which there is a large need for applications (e.g. bioremediation and human health).

The different layers of annotation and the available code with clear documentation make this model an ideal resource as teaching material as well. Methods can be explained on this model, which can still be visualized and interpreted because of its reduced size, while it is large enough to show the differences between methods.

Although it might be too much to expect models of this type for all species, the different layers of annotation can be used to inspire better annotation of genome-scale models and enhance their accuracy and predictability. Thus, this paper sets a standard that could benefit research on metabolic pathways from individual strains to natural communities to communities for biotechnology, bioremediation and human health.

References

Bauer, E., Zimmermann, J., Baldini, F., Thiele, I., Kaleta, C., 2017. BacArena: Individual-based metabolic modeling of heterogeneous microbes in complex communities. PLOS Comput. Biol. 13, e1005544. https://doi.org/10.1371/journal.pcbi.1005544

Corrao, M., He, H., Liebermeister, W., Noor, E., Bar-Even, A., 2025. A compact model of Escherichia coli core and biosynthetic metabolism. arXiv, ver.4, peer-reviewed and recommended by PCI Mathematical and Computational Biology. https://doi.org/10.48550/arXiv.2406.16596

Dukovski, I., Bajić, D., Chacón, J.M., Quintin, M., Vila, J.C.C., Sulheim, S., Pacheco, A.R., Bernstein, D.B., Riehl, W.J., Korolev, K.S., Sanchez, A., Harcombe, W.R., Segrè, D., 2021. A metabolic modeling platform for the computation of microbial ecosystems in time and space (COMETS). Nat. Protoc. 16, 5030–5082. https://doi.org/10.1038/s41596-021-00593-3

Flamholz, A., Noor, E., Bar-Even, A., Liebermeister, W., Milo, R., 2013. Glycolytic strategy as a tradeoff between energy yield and protein cost. Proc. Natl. Acad. Sci. 110, 10039–10044. https://doi.org/10.1073/pnas.1215283110

Gralka, M., Pollak, S., Cordero, O.X., 2023. Genome content predicts the carbon catabolic preferences of heterotrophic bacteria. Nat. Microbiol. 8, 1799–1808. https://doi.org/10.1038/s41564-023-01458-z

Henry, C.S., DeJongh, M., Best, A.A., Frybarger, P.M., Linsay, B., Stevens, R.L., 2010. High-throughput generation, optimization and analysis of genome-scale metabolic models. Nat. Biotechnol. 28, 977–982. https://doi.org/10.1038/nbt.1672

Li, Z., Selim, A., Kuehn, S., 2023. Statistical prediction of microbial metabolic traits from genomes. PLOS Comput. Biol. 19, e1011705. https://doi.org/10.1371/journal.pcbi.1011705

Machado, D., Andrejev, S., Tramontano, M., Patil, K.R., 2018. Fast automated reconstruction of genome-scale metabolic models for microbial species and communities. Nucleic Acids Res. 46, 7542–7553. https://doi.org/10.1093/nar/gky537

Mendoza, S.N., Olivier, B.G., Molenaar, D., Teusink, B., 2019. A systematic assessment of current genome-scale metabolic reconstruction tools. Genome Biol. 20, 158. https://doi.org/10.1186/s13059-019-1769-1

Orth, J.D., Thiele, I., Palsson, B.Ø., 2010. What is flux balance analysis? Nat. Biotechnol. 28, 245–248. https://doi.org/10.1038/nbt.1614

Scott Jr, W.T., Benito-Vaquerizo, S., Zimmermann, J., Bajić, D., Heinken, A., Suarez-Diez, M., Schaap, P.J., 2023. A structured evaluation of genome-scale constraint-based modeling tools for microbial consortia. PLOS Comput. Biol. 19, e1011363. https://doi.org/10.1371/journal.pcbi.1011363

van Heerden, J.H., Wortel, M.T., Bruggeman, F.J., Heijnen, J.J., Bollen, Y.J.M., Planqué, R., Hulshof, J., O’Toole, T.G., Wahl, S.A., Teusink, B., 2014. Lost in Transition: Start-Up of Glycolysis Yields Subpopulations of Nongrowing Cells. Science 343, 1245114. https://doi.org/10.1126/science.1245114

A compact model of Escherichia coli core and biosynthetic metabolismMarco Corrao, Hai He, Wolfram Liebermeister, Elad Noor, Arren Bar-Even<p>Metabolic models condense biochemical knowledge about organisms in a structured and standardised way. As large-scale network reconstructions are readily available for many organisms, genome-scale models are being widely used among modellers and...Cell Biology, Systems biologyMeike Wortel2024-10-22 10:26:48 View