Recommendation

Faster method for estimating the openness of species

Leo van Iersel based on reviews by Guillaume Marçais, Abiola Akinnubi and 1 anonymous reviewer

A recommendation of:

Revisiting pangenome openness with k-mers

Luca Parmigiani, Roland Wittler, Jens Stoye (2024), bioRxiv, ver.4, peer-reviewed and recommended by PCI Mathematical and Computational Biology https://doi.org/10.1101/2022.11.15.516472

Read preprint in preprint server Now published in Peer Community Journal

Data used for results

Codes used in this study

Scripts used to obtain or analyze results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

Revisiting pangenome openness with k-mers

Pangenomics is the study of related genomes collectively, usually from the same species or closely related taxa. Originally, pangenomes were defined for bacterial species. After the concept was extended to eukaryotic genomes, two definitions of pangenome evolved in parallel: the gene-based approach, which defines the pangenome as the union of all genes, and the sequence-based approach, which defines the pangenome as the set of all nonredundant genomic sequences. Estimating the total size of the pangenome for a given species has been subject of study since the very first mention of pangenomes. Traditionally, this is performed predicting the ratio at which new genes are discovered, referred to as the openness of the species. Here, we abstract each genome as a set of items, which is entirely agnostic of the two approaches (gene-based, sequence-based). Genes are a viable option for items, but also other possibilities are feasible, e.g., genome sequence substrings of fixed length k (k-mers). In the present study, we investigate the use of k-mers to estimate the openness as an alternative to genes, and compare the results. An efficient implementation is also provided.

Bioinformatics Pangenomics

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

إعادة النظر في انفتاح البانجينوم باستخدام k-mers

علم الجينوم هو دراسة الجينومات ذات الصلة بشكل جماعي، عادة من نفس النوع أو الأصناف ذات الصلة الوثيقة. في الأصل، تم تعريف الباجينومات للأنواع البكتيرية. بعد أن امتد المفهوم ليشمل الجينومات حقيقية النواة، تطور تعريفان للبانجينوم بالتوازي: النهج القائم على الجينات، والذي يعرف الباجينوم على أنه اتحاد جميع الجينات، والنهج القائم على التسلسل، والذي يعرف الباجينوم على أنه مجموعة جميع الجينات. تسلسل الجينوم غير زائدة عن الحاجة. كان تقدير الحجم الإجمالي للبانجينوم لنوع معين موضوعًا للدراسة منذ أول ذكر للبانجينوم. تقليديًا، يتم إجراء ذلك للتنبؤ بنسبة اكتشاف الجينات الجديدة، والتي يشار إليها باسم انفتاح النوع. هنا، نقوم بتجريد كل جينوم كمجموعة من العناصر، والتي لا تعرف تمامًا النهجين (القائم على الجينات، والقائم على التسلسل). تعد الجينات خيارًا قابلاً للتطبيق للعناصر، ولكن هناك أيضًا احتمالات أخرى ممكنة، على سبيل المثال، سلاسل فرعية من تسلسل الجينوم ذات طول ثابت k (k-mers). في هذه الدراسة، قمنا بدراسة استخدام k-mers لتقدير الانفتاح كبديل للجينات، ومقارنة النتائج. كما يتم توفير التنفيذ الفعال.

المعلوماتية الحيوية

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Revisando la apertura del pangenoma con k-mers

La pangenómica es el estudio de genomas relacionados de forma colectiva, normalmente de la misma especie o de taxones estrechamente relacionados. Originalmente, los pangenomas se definieron para especies bacterianas. Después de que el concepto se extendió a los genomas eucariotas, dos definiciones de pangenoma evolucionaron en paralelo: el enfoque basado en genes, que define el pangenoma como la unión de todos los genes, y el enfoque basado en secuencias, que define el pangenoma como el conjunto de todos los genes. secuencias genómicas no redundantes. La estimación del tamaño total del pangenoma de una especie determinada ha sido objeto de estudio desde la primera mención de los pangenomas. Tradicionalmente, esto se realiza prediciendo la proporción en la que se descubren nuevos genes, lo que se conoce como apertura de la especie. Aquí, abstraemos cada genoma como un conjunto de elementos, que es completamente independiente de los dos enfoques (basado en genes, basado en secuencias). Los genes son una opción viable para los elementos, pero también son factibles otras posibilidades, por ejemplo, subcadenas de secuencias del genoma de longitud fija k (k-mers). En el presente estudio, investigamos el uso de k-mers para estimar la apertura como una alternativa a los genes y comparamos los resultados. También se proporciona una implementación eficiente.

Pangenómica bioinformática

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Revisiter l’ouverture du pangénome avec les k-mers

La pangénomique est l'étude collective de génomes apparentés, généralement issus de la même espèce ou de taxons étroitement apparentés. À l’origine, les pangénomes étaient définis pour les espèces bactériennes. Après que le concept ait été étendu aux génomes eucaryotes, deux définitions du pangénome ont évolué en parallèle : l'approche basée sur les gènes, qui définit le pangénome comme l'union de tous les gènes, et l'approche basée sur les séquences, qui définit le pangénome comme l'ensemble de tous les gènes. séquences génomiques non redondantes. L'estimation de la taille totale du pangénome pour une espèce donnée fait l'objet d'études depuis la toute première mention des pangénomes. Traditionnellement, cela est effectué en prédisant le taux de découverte de nouveaux gènes, appelé ouverture de l'espèce. Ici, nous résumons chaque génome comme un ensemble d'éléments, ce qui est totalement indépendant des deux approches (basée sur les gènes, basée sur les séquences). Les gènes sont une option viable pour les éléments, mais d'autres possibilités sont également réalisables, par exemple, des sous-chaînes de séquence génomique de longueur fixe k (k-mers). Dans la présente étude, nous étudions l'utilisation des k-mers pour estimer l'ouverture comme alternative aux gènes et comparons les résultats. Une implémentation efficace est également fournie.

Pangénomique bioinformatique

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

के-मर्स के साथ पैंजेनोम खुलेपन का पुनरावलोकन

पैन्जेनोमिक्स सामूहिक रूप से संबंधित जीनोम का अध्ययन है, आमतौर पर एक ही प्रजाति या निकट से संबंधित टैक्सा से। मूल रूप से, पैन्जिनोम को जीवाणु प्रजातियों के लिए परिभाषित किया गया था। अवधारणा को यूकेरियोटिक जीनोम तक विस्तारित करने के बाद, पैन्जेनोम की दो परिभाषाएँ समानांतर में विकसित हुईं: जीन-आधारित दृष्टिकोण, जो पैन्जेनोम को सभी जीनों के मिलन के रूप में परिभाषित करता है, और अनुक्रम-आधारित दृष्टिकोण, जो पैन्जेनोम को सभी जीनों के सेट के रूप में परिभाषित करता है। गैर-निरर्थक जीनोमिक अनुक्रम। किसी दी गई प्रजाति के लिए पैन्जेनोम के कुल आकार का अनुमान लगाना पैन्जेनोम के पहले उल्लेख के बाद से ही अध्ययन का विषय रहा है। परंपरागत रूप से, यह उस अनुपात की भविष्यवाणी करते हुए किया जाता है जिस पर नए जीन की खोज की जाती है, जिसे प्रजातियों का खुलापन कहा जाता है। यहां, हम प्रत्येक जीनोम को वस्तुओं के एक सेट के रूप में अमूर्त करते हैं, जो दो दृष्टिकोणों (जीन-आधारित, अनुक्रम-आधारित) के बारे में पूरी तरह से अज्ञेयवादी है। जीन वस्तुओं के लिए एक व्यवहार्य विकल्प हैं, लेकिन अन्य संभावनाएं भी संभव हैं, उदाहरण के लिए, निश्चित लंबाई k (k-mers) के जीनोम अनुक्रम सबस्ट्रिंग। वर्तमान अध्ययन में, हम जीन के विकल्प के रूप में खुलेपन का अनुमान लगाने और परिणामों की तुलना करने के लिए के-मर्स के उपयोग की जांच करते हैं। एक कुशल कार्यान्वयन भी प्रदान किया जाता है।

जैव सूचना विज्ञान पैन्जेनोमिक्स

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

k-mer によるパンゲノムのオープン性の再検討

パンゲノミクスは、通常は同じ種または密接に関連した分類群に由来する、関連するゲノムを集合的に研究することです。もともと、パンゲノムは細菌種に対して定義されました。この概念が真核生物のゲノムに拡張された後、パンゲノムの 2 つの定義が並行して進化しました。1 つはすべての遺伝子の結合としてパンゲノムを定義する遺伝子ベースのアプローチ、もう 1 つはパンゲノムをすべての遺伝子の集合として定義する配列ベースのアプローチです。非重複ゲノム配列。特定の種のパンゲノムの合計サイズを推定することは、パンゲノムについて初めて言及されて以来研究の対象となってきました。伝統的に、これは、種の開放性と呼ばれる、新しい遺伝子が発見される割合を予測するために実行されます。ここでは、各ゲノムを項目のセットとして抽象化します。これは、2 つのアプローチ (遺伝子ベース、配列ベース) に完全に依存しません。遺伝子は項目の実行可能なオプションですが、固定長 k (k-mer) のゲノム配列部分文字列など、他の可能性も実現可能です。本研究では、遺伝子の代替として開放性を推定するための k-mer の使用を調査し、結果を比較します。効率的な実装も提供されます。

バイオインフォマティクスパンゲノミクス

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Revisitando a abertura do pangenoma com k-mers

Pangenomia é o estudo de genomas relacionados coletivamente, geralmente da mesma espécie ou de táxons intimamente relacionados. Originalmente, os pangenomas foram definidos para espécies bacterianas. Depois que o conceito foi estendido aos genomas eucarióticos, duas definições de pangenoma evoluíram em paralelo: a abordagem baseada em genes, que define o pangenoma como a união de todos os genes, e a abordagem baseada em sequência, que define o pangenoma como o conjunto de todos os genes. sequências genômicas não redundantes. A estimativa do tamanho total do pangenoma para uma determinada espécie tem sido objeto de estudo desde a primeira menção aos pangenomas. Tradicionalmente, isto é realizado prevendo a proporção na qual novos genes são descobertos, conhecida como abertura da espécie. Aqui, abstraímos cada genoma como um conjunto de itens, que é totalmente independente das duas abordagens (baseada em genes, baseada em sequências). Os genes são uma opção viável para itens, mas também outras possibilidades são viáveis, por exemplo, substrings de sequência do genoma de comprimento fixo k (k-mers). No presente estudo, investigamos o uso de k-mers para estimar a abertura como alternativa aos genes e comparamos os resultados. Uma implementação eficiente também é fornecida.

Pangenômica Bioinformática

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Возвращаясь к открытости пангенома с помощью k-меров

Пангеномика — это коллективное изучение родственных геномов, обычно одного и того же вида или близкородственных таксонов. Первоначально пангеномы были определены для видов бактерий. После того, как эта концепция была распространена на эукариотические геномы, параллельно развивались два определения пангенома: подход, основанный на генах, который определяет пангеном как объединение всех генов, и подход, основанный на последовательностях, который определяет пангеном как совокупность всех генов. неизбыточные геномные последовательности. Оценка общего размера пангенома данного вида была предметом изучения с самого первого упоминания о пангеномах. Традиционно это делается для прогнозирования соотношения, при котором будут обнаружены новые гены, что называется открытостью вида. Здесь мы абстрагируем каждый геном как набор элементов, что совершенно не зависит от двух подходов (основанного на генах и основанного на последовательностях). Гены являются жизнеспособным вариантом элементов, но возможны и другие возможности, например, подстроки последовательности генома фиксированной длины k (k-меры). В настоящем исследовании мы изучаем использование k-меров для оценки открытости в качестве альтернативы генам и сравниваем результаты. Также предоставляется эффективная реализация.

Биоинформатика Пангеномика

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

通过 k-mers 重新审视泛基因组开放性

泛基因组学是对相关基因组的研究，通常来自同一物种或密切相关的分类群。最初，全基因组是针对细菌物种定义的。当这个概念扩展到真核基因组后，泛基因组的两个定义并行发展：基于基因的方法，将泛基因组定义为所有基因的联合；以及基于序列的方法，将泛基因组定义为所有基因的集合。非冗余基因组序列。自从第一次提到泛基因组以来，估计给定物种的泛基因组的总大小一直是研究的主题。传统上，这是为了预测新基因被发现的比率，称为物种的开放性。在这里，我们将每个基因组抽象为一组项目，这与两种方法（基于基因、基于序列）完全无关。基因是项目的可行选择，但其他可能性也是可行的，例如固定长度 k (k-mers) 的基因组序列子串。在本研究中，我们研究了使用 k-mers 来估计开放性作为基因的替代方案，并比较了结果。还提供了有效的实现。

生物信息学泛基因组学

Submission: posted 22 November 2022, validated 23 November 2022
Recommendation: posted 10 April 2024, validated 10 April 2024

Cite this recommendation as:
van Iersel, L. (2024) Faster method for estimating the openness of species. Peer Community in Mathematical and Computational Biology, 100185. https://doi.org/10.24072/pci.mcb.100185

Recommendation

When sequencing more and more genomes of a species (or a group of closely related species), a natural question to ask is how quickly the total number of distinct sequences grows as a function of the total number of sequenced genomes. A similar question can be asked about the number of distinct genes or the number of distinct k-mers (length-k subsequences).

The paper “Revisiting pangenome openness with k-mers” [1] describes a general mathematical framework that can be applied to each of these versions. A genome is abstractly seen as a set of “items” and a species as a set of genomes. The question then is how fast the function f_tot, the average size of the union of m genomes of the species, grows as a function of m. Basically, the faster the growth the more “open” the species is. More precisely, the function f_tot can be described by a power law plus a constant and the openness $\alpha$ refers to one minus the exponent $\gamma$ of the power law.

With these definitions one can make a distinction between “open” genomes ($\alpha < 1$) where the total size f_tot tends to infinity and “closed” genomes ($\alpha > 1$) where the total size f_tot tends to a constant. However, performing this classification is difficult in practice and the relevance of such a disjunction is debatable. Hence, the authors of the current paper focus on estimating the openness parameter $\alpha$.

The definition of openness given in the paper was suggested by one of the reviewers and fixes a problem with a previous definition (in which it was mathematically impossible for a pangenome to be closed).

While the framework is very general, the authors apply it by using k-mers to estimate pangenome openness. This is an innovative approach because, even though k-mers are used frequently in pangenomics, they had not been used before to estimate openness. One major advantage of using k-mers is that it can be applied directly to data consisting of sequencing reads, without the need for preprocessing. In addition, k-mers also cover non-coding regions of the genomes which is in particular relevant when studying openness of eukaryotic species.

The method is evaluated on 12 bacterial pangenomes with impressive results. The estimated openness is very close to the results of several gene-based tools (Roary, Pantools and BPGA) but the running time is much better: it is one to three orders of magnitude faster than the other methods.

Another appealing aspect of the method is that it computes the function f_tot exactly using a method that was known in the ecology literature but had not been noticed in the pangenomics field. The openness is then estimated by fitting a power law function.

Finally, the paper [1] offers a clear presentation of the problem, the approach and the results, with nice examples using real data.

References

[1] Parmigiani L., Wittler, R. and Stoye, J. (2024) "Revisiting pangenome openness with k-mers". bioRxiv, ver. 4 peer-reviewed and recommended by Peer Community In Mathematical and Computational Biology. https://doi.org/10.1101/2022.11.15.516472

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Funding:
This project received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 956229. It was also supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A)

Reviews

Reviewed by Guillaume Marçais, 22 Feb 2024

The authors have addressed my remaining major comments: 1) fixed the mathematical definition of open / closed genomes and 2) added a discussion on examples of closed genomes.

I have no further comments.

https://doi.org/10.24072/pci.mcb.100185.rev31

Evaluation round #2

DOI or URL of the preprint: https://doi.org/10.1101/2022.11.15.516472

Version of the preprint: 3

Author's Reply, 20 Feb 2024

Download author's reply https://doi.org/10.24072/pci.mcb.100185.ar2

Decision by Leo van Iersel, posted 12 Oct 2023, validated 12 Oct 2023

The authors did a very good job incorporating most comments of the riviewers. Unfortunately, the main comment has not yet been resolved sufficiently. Although I do believe that the added text has improved the paper, I agree with the reviewer that more work needs to be done.

For example, in the introduction you state that "one of the most outstanding discoveries at the time was that some species possess an open pangenome and others a closed pangenome" but then you use a model in which it is impossible for a pangenome to be closed.

The other remaining problem is that you test your method only on open pangenomes. Of course this makes sense because it cannot work for closed pangenomes, but then this weakness of the method should at least be stated clearly.

The reviewer gives a very nice suggestion for how to resolve these issues. I recommend to follow this suggestion if possible.

https://doi.org/10.24072/pci.mcb.100185.d2

Reviewed by Guillaume Marçais, 21 Sep 2023

Download the review https://doi.org/10.24072/pci.mcb.100185.rev21

Reviewed by anonymous reviewer 1, 23 Aug 2023

I support this article Accept

https://doi.org/10.24072/pci.mcb.100185.rev22

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.1101/2022.11.15.516472

Version of the preprint: 2

Author's Reply, 17 Aug 2023

Download author's reply https://doi.org/10.24072/pci.mcb.100185.ar1

Decision by Leo van Iersel, posted 10 May 2023, validated 15 May 2023

The problem and approach I find very interesting, but the reviewers have a number of important questions that need to be answered first. Most importantly, please:

- justify the used definitions of open/closed genomes and explain the practical relevance of the results based on this definition;

- make the supplementary figures and tables available;

- explain why the blue line in Figure 1 does not fit the data;

- analyse how the practical running time depends on the number of samples;

- analyse the distribution of alpha values in the experiments;

- explain why the method was compared only to Roary and Pantools.

https://doi.org/10.24072/pci.mcb.100185.d1

Reviewed by Guillaume Marçais, 07 Feb 2023

In "Revisiting pangenome openness with k-mers" the authors give a computational
method and an implementation to estimate how "open" a pan-genome is, that is
whether the genome of a species has many variant genes (opened) or is more
constrained (closed). This is traditionally done by comparing gene content of
different individual bacteria of a species, but is done here using k-mer content
instead.

Although the proposed computational method seem correct, the definition of
open/close pan-genome raises questions. Consequently the conclusions drawn from
the experiments are affected by the flaw in the definition.

Major comments
==============

* Page 4, line 153: it is stated that 0 <= \gamma <= 1 and \alpha = 1 - \gamma
(hence 0 <= \alpha <= 1 as well). Then line 158, the definition of a close
genome is for \alpha > 1, which cannot happen by definition. A close genome
would imply \gamma < 0, that is the number of k-mer seen would be a decreasing
function of m (m = number of genomes considered). This simply cannot be observed.

Unsurprisingly, all the values reported by the proposed method (see Fig. 4)
have an \alpha < 1 and are all declared to be open genomes. That is not an
empirical conclusion based on data, but a mathematical guarantee independent
of the data.

* The Supplementary material does not seem to be available, even though it
contains important figures.

* Fig 1 page 5: the fitting of the blue line does not seem to match the data.
The conclusion that \alpha = 0.98 for this data set is questionable. It seems
like this data does not follow Heaps' law. Maybe the fact that this data does
not follow Heaps' law is the signature of a closed genome?

Minor comments
==============

* The use of GMP to compute f_{tot} is not well justified. The ratio (n-i)^{m} /
n^{m} (where ^{m} is the falling factorial as in the text) probably doesn't
need an infinite precision library. It is the product of the ratios
(n-i-j)/(n-j) for 0 <= j < m. These ratios and their product can most likely
be stored in double floats without significant loss in precision (and is
likely cheaper to compute).

* There is no timing or memory usage information given for the bacterial
experiments.

* GNU should be capitalized (it is an accronym)

* Page 9, line 274: "making k-mers more suitable" is ambiguous. k-mers are more
suitable for bacterial genomes or eukariotic genomes?

https://doi.org/10.24072/pci.mcb.100185.rev11

Reviewed by anonymous reviewer 1, 09 May 2023

Parmigiani et al. used k-mers to estimate pan-genome openness. It's a nice idea, but also challenging work. I have some small questions:

1) Based on the different numbers of samples (10, 20, 50, 100), what is the running time of this algorithm?

2) The author tested twelve bacterial species, how many strains were tested for each species?

3) In addition to Roary and Pantools, should it be compared with other software?

4) The author compares the sensitivity of different k-mers to the pan-genome openness estimation. Compared with other software, what is the distribution of α values for the twelve different species under different k-mers?

https://doi.org/10.24072/pci.mcb.100185.rev12

Reviewed by Abiola Akinnubi, 14 Apr 2023

The mathematical equation were well explained and it was articulated. I endorse this and say it should be accepted.

Thank you

https://doi.org/10.24072/pci.mcb.100185.rev13

User comments

No user comments yet

or Register
Submit a preprint