Recommendation

An efficient implementation of legofit software to infer demographic histories from population genetic data

Matteo Fumagalli based on reviews by Fernando Racimo and 1 anonymous reviewer

A recommendation of:

An efficient algorithm for estimating population history from genetic data

Alan R. Rogers (2021), bioRxiv, 2021.01.23.427922, ver. 5 peer-reviewed and recommended by Peer community in Mathematical and Computational Biology https://doi.org/10.1101/2021.01.23.427922

Read preprint in preprint server Now published in Peer Community Journal

Data used for results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

An efficient algorithm for estimating population history from genetic data

The Legofit statistical package uses genetic data to estimate parameters describing population history. Previous versions used computer simulations to estimate probabilities, an approach that limited both speed and accuracy. This article describes a new deterministic algorithm, which makes Legofit faster and more accurate. The speed of this algorithm declines as model complexity increases. With very complex models, the deterministic algorithm is slower than the stochastic one. In an application to simulated data sets, the estimates produced by the deterministic and stochastic algorithms were essentially identical. Reanalysis of a human data set replicated the findings of a previous study and provided increased support for the hypotheses that (a) early modern humans contributed genes to Neanderthals, and (b) a “superarchaic” population (which separated from all other humans early in the Pleistocene) was either large or deeply subdivided.

population genetics, statistical inference, Markov chains, combinatorics, population history

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

خوارزمية فعالة لتقدير تاريخ السكان من البيانات الجينية

تستخدم الحزمة الإحصائية Legofit البيانات الجينية لتقدير المعلمات التي تصف تاريخ السكان. استخدمت الإصدارات السابقة عمليات محاكاة حاسوبية لتقدير الاحتمالات، وهو أسلوب حد من السرعة والدقة. توضح هذه المقالة خوارزمية حتمية جديدة تجعل Legofit أسرع وأكثر دقة. تتناقص سرعة هذه الخوارزمية مع زيادة تعقيد النموذج. مع النماذج المعقدة للغاية، تكون الخوارزمية الحتمية أبطأ من الخوارزمية العشوائية. في تطبيق لمجموعات البيانات المحاكاة، كانت التقديرات التي أنتجتها الخوارزميات الحتمية والعشوائية متطابقة بشكل أساسي. أعاد تحليل مجموعة البيانات البشرية تكرار نتائج دراسة سابقة وقدم دعمًا متزايدًا للافتراضات القائلة بأن (أ) البشر المعاصرون الأوائل ساهموا بجينات لإنسان النياندرتال، و(ب) مجموعة سكانية "فائقة التطور" (والتي انفصلت عن جميع البشر الآخرين في وقت مبكر من العصر البليستوسيني) كان إما كبيرًا أو مقسمًا إلى أقسام عميقة.

علم الوراثة السكانية، الاستدلال الإحصائي، سلاسل ماركوف، التوافقيات، تاريخ السكان

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Un algoritmo eficiente para estimar la historia de la población a partir de datos genéticos.

El paquete estadístico Legofit utiliza datos genéticos para estimar parámetros que describen la historia de la población. Las versiones anteriores utilizaban simulaciones por computadora para estimar probabilidades, un enfoque que limitaba tanto la velocidad como la precisión. Este artículo describe un nuevo algoritmo determinista que hace que Legofit sea más rápido y preciso. La velocidad de este algoritmo disminuye a medida que aumenta la complejidad del modelo. En modelos muy complejos, el algoritmo determinista es más lento que el estocástico. En una aplicación a conjuntos de datos simulados, las estimaciones producidas por los algoritmos deterministas y estocásticos fueron esencialmente idénticas. El nuevo análisis de un conjunto de datos humanos replicó los hallazgos de un estudio anterior y proporcionó un mayor apoyo a la hipótesis de que (a) los primeros humanos modernos contribuyeron con genes a los neandertales, y (b) una población "superarcaica" (que se separó de todos los demás humanos a principios del siglo XIX). el Pleistoceno) era grande o estaba profundamente subdividido.

genética de poblaciones, inferencia estadística, cadenas de Markov, combinatoria, historia de poblaciones

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Un algorithme efficace pour estimer l’histoire d’une population à partir de données génétiques

Le progiciel statistique Legofit utilise des données génétiques pour estimer les paramètres décrivant l'histoire de la population. Les versions précédentes utilisaient des simulations informatiques pour estimer les probabilités, une approche qui limitait à la fois la vitesse et la précision. Cet article décrit un nouvel algorithme déterministe qui rend Legofit plus rapide et plus précis. La vitesse de cet algorithme diminue à mesure que la complexité du modèle augmente. Avec des modèles très complexes, l’algorithme déterministe est plus lent que l’algorithme stochastique. Dans une application à des ensembles de données simulées, les estimations produites par les algorithmes déterministe et stochastique étaient essentiellement identiques. La réanalyse d'un ensemble de données humaines a reproduit les résultats d'une étude précédente et a fourni un soutien accru aux hypothèses selon lesquelles (a) les premiers humains modernes ont contribué aux gènes des Néandertaliens, et (b) une population « superarchaïque » (qui s'est séparée de tous les autres humains au début de l'époque). le Pléistocène) était soit vaste, soit profondément subdivisé.

génétique des populations, inférence statistique, chaînes de Markov, combinatoire, histoire des populations

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

आनुवंशिक डेटा से जनसंख्या इतिहास का अनुमान लगाने के लिए एक कुशल एल्गोरिदम

लेगोफिट सांख्यिकीय पैकेज जनसंख्या इतिहास का वर्णन करने वाले मापदंडों का अनुमान लगाने के लिए आनुवंशिक डेटा का उपयोग करता है। पिछले संस्करणों ने संभावनाओं का अनुमान लगाने के लिए कंप्यूटर सिमुलेशन का उपयोग किया था, एक ऐसा दृष्टिकोण जो गति और सटीकता दोनों को सीमित करता था। यह आलेख एक नए नियतात्मक एल्गोरिदम का वर्णन करता है, जो लेगोफ़िट को तेज़ और अधिक सटीक बनाता है। मॉडल की जटिलता बढ़ने पर इस एल्गोरिदम की गति कम हो जाती है। बहुत जटिल मॉडल के साथ, नियतात्मक एल्गोरिदम स्टोकेस्टिक की तुलना में धीमा है। सिम्युलेटेड डेटा सेट के अनुप्रयोग में, नियतात्मक और स्टोकेस्टिक एल्गोरिदम द्वारा उत्पादित अनुमान अनिवार्य रूप से समान थे। मानव डेटा सेट के पुनर्विश्लेषण ने पिछले अध्ययन के निष्कर्षों को दोहराया और उन परिकल्पनाओं के लिए अधिक समर्थन प्रदान किया कि (ए) शुरुआती आधुनिक मनुष्यों ने निएंडरथल में जीन का योगदान दिया, और (बी) एक "सुपरआर्किक" आबादी (जो अन्य सभी मनुष्यों से शुरू में अलग हो गई) प्लेइस्टोसिन) या तो बड़ा था या गहराई से विभाजित था।

जनसंख्या आनुवंशिकी, सांख्यिकीय अनुमान, मार्कोव श्रृंखला, कॉम्बिनेटरिक्स, जनसंख्या इतिहास

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Legofit 統計パッケージは、遺伝データを使用して人口履歴を記述するパラメーターを推定します。以前のバージョンでは、コンピュータシミュレーションを使用して確率を推定していましたが、このアプローチでは速度と精度の両方が制限されていました。この記事では、Legofit をより高速かつ正確にする新しい決定論的アルゴリズムについて説明します。モデルの複雑さが増すにつれて、このアルゴリズムの速度は低下します。非常に複雑なモデルの場合、決定論的アルゴリズムは確率論的アルゴリズムよりも遅くなります。シミュレートされたデータセットへの適用では、決定論的アルゴリズムと確率論的アルゴリズムによって生成された推定値は本質的に同一でした。ヒトのデータセットの再分析は以前の研究の結果を再現し、(a) 初期現生人類がネアンデルタール人に遺伝子をもたらした、(b) 「超古人」集団（初期に他のすべての人類から分離した）という仮説の支持を高めた。更新世）は大規模であるか、深く細分化されていました。

料金823b4b2d5429197a67e67b9e5b2d4 遺伝データから集団の歴史を推定するための効率的なアルゴリズム 6a6956a732bb47abad5eb6c7281d70fd 集団遺伝学、統計的推論、マルコフ連鎖、組合せ論、集団史

集団遺伝学、統計的推論、マルコフ連鎖、組合せ論、集団史

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Um algoritmo eficiente para estimar o histórico populacional a partir de dados genéticos

O pacote estatístico Legofit usa dados genéticos para estimar parâmetros que descrevem o histórico populacional. Versões anteriores usavam simulações de computador para estimar probabilidades, uma abordagem que limitava a velocidade e a precisão. Este artigo descreve um novo algoritmo determinístico, que torna o Legofit mais rápido e preciso. A velocidade deste algoritmo diminui à medida que a complexidade do modelo aumenta. Com modelos muito complexos, o algoritmo determinístico é mais lento que o estocástico. Numa aplicação a conjuntos de dados simulados, as estimativas produzidas pelos algoritmos determinísticos e estocásticos foram essencialmente idênticas. A reanálise de um conjunto de dados humanos replicou as descobertas de um estudo anterior e forneceu maior apoio para as hipóteses de que (a) os primeiros humanos modernos contribuíram com genes para os neandertais e (b) uma população “superarcaica” (que se separou de todos os outros humanos no início o Pleistoceno) era grande ou profundamente subdividido.

genética populacional, inferência estatística, cadeias de Markov, combinatória, história populacional

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Эффективный алгоритм оценки истории популяции на основе генетических данных.

Статистический пакет Legofit использует генетические данные для оценки параметров, описывающих историю популяции. Предыдущие версии использовали компьютерное моделирование для оценки вероятностей — подход, который ограничивал как скорость, так и точность. В этой статье описывается новый детерминированный алгоритм, который делает Legofit быстрее и точнее. Скорость этого алгоритма снижается по мере увеличения сложности модели. В очень сложных моделях детерминированный алгоритм работает медленнее, чем стохастический. В приложении к смоделированным наборам данных оценки, полученные с помощью детерминистического и стохастического алгоритмов, были по существу идентичными. Повторный анализ набора данных о людях повторил результаты предыдущего исследования и обеспечил дополнительную поддержку гипотез о том, что (а) ранние современные люди внесли гены в неандертальцев, и (б) «суперархаичная» популяция (которая отделилась от всех других людей в начале плейстоцена) был либо крупным, либо глубоко расчлененным.

популяционная генетика, статистический вывод, цепи Маркова, комбинаторика, популяционная история

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

一种根据遗传数据估计种群历史的有效算法

Legofit 统计软件包使用遗传数据来估计描述人口历史的参数。以前的版本使用计算机模拟来估计概率，这种方法限制了速度和准确性。本文介绍了一种新的确定性算法，它使 Legofit 更快、更准确。该算法的速度随着模型复杂性的增加而下降。对于非常复杂的模型，确定性算法比随机算法慢。在模拟数据集的应用中，确定性算法和随机算法产生的估计基本上是相同的。对人类数据集的重新分析重复了先前研究的结果，并为以下假设提供了更多支持：（a）早期现代人类向尼安德特人贡献了基因，以及（b）“超古老”人口（在早期就与所有其他人类分离）更新世）要么很大，要么深度细分。

群体遗传学、统计推断、马尔可夫链、组合学、群体历史

Submission: posted 26 January 2021
Recommendation: posted 19 May 2021, validated 26 May 2021

Cite this recommendation as:
Fumagalli, M. (2021) An efficient implementation of legofit software to infer demographic histories from population genetic data . Peer Community in Mathematical and Computational Biology, 100003. https://doi.org/10.24072/pci.mcb.100003

Recommendation

The estimation of demographic parameters from population genetic data has been the subject of many scientific studies [1]. Among these efforts, legofit was firstly proposed in 2019 as a tool to infer size changes, subdivision and gene flow events from patterns of nucleotidic variation [2]. The first release of legofit used a stochastic algorithm to fit population parameters to the observed data. As it requires simulations to evaluate the fitting of each model, it is computationally intensive and can only be deployed on high-performance computing clusters.

To overcome this issue, Rogers proposes a new implementation of legofit based on a deterministic algorithm that allows the estimation of demographic histories to be computationally faster and more accurate [3]. The new algorithm employs a continuous-time Markov chain that traces the ancestry of each sample into the past. The calculations are now divided into two steps, the first one being solved numerically. To test the hypothesis that the new implementation of legofit produces a more desirable performance, Rogers generated extensive simulations of genomes from African, European, Neanderthal and Denisovan populations with msprime [4]. Additionally, legofit was tested on real genetic data from samples of said populations, following a previously published study [5].

Based on simulations, the new deterministic algorithm is more than 1600 times faster than the previous stochastic model. Notably, the new version of legofit produces smaller residual errors, although the overall accuracy to estimate population parameters is comparable to the one obtained using the stochastic algorithm. When applied to real data, the new implementation of legofit was able to recapitulate previous findings of a complex demographic model with early gene flow from humans to Neanderthal [5]. Notably, the new implementation generates better discrimination between models, therefore leading to a better precision at predicting the population history. Some parameters estimated from real data point towards unrealistic scenarios, suggesting that the initial model could be misspecified.

Further research is needed to fully explore the parameter range that can be evaluated by legofit, and to clarify the source of any associated bias. Additionally, the inclusion of data uncertainty in parameter estimation and model selection may be required to apply legofit to low-coverage high-throughput sequencing data [6]. Nevertheless, legofit is an efficient, accessible and user-friendly software to infer demographic parameters from genetic data and can be widely applied to test hypotheses in evolutionary biology. The new implementation of legofit software is freely available at https://github.com/alanrogers/legofit.

References

[1] Spence JP, Steinrücken M, Terhorst J, Song YS (2018) Inference of population history using coalescent HMMs: review and outlook. Current Opinion in Genetics & Development, 53, 70–76. https://doi.org/10.1016/j.gde.2018.07.002

[2] Rogers AR (2019) Legofit: estimating population history from genetic data. BMC Bioinformatics, 20, 526. https://doi.org/10.1186/s12859-019-3154-1

[3] Rogers AR (2021) An Efficient Algorithm for Estimating Population History from Genetic Data. bioRxiv, 2021.01.23.427922, ver. 5 peer-reviewed and recommended by Peer community in Mathematical and Computational Biology. https://doi.org/10.1101/2021.01.23.427922

[4] Kelleher J, Etheridge AM, McVean G (2016) Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLOS Computational Biology, 12, e1004842. https://doi.org/10.1371/journal.pcbi.1004842

[5] Rogers AR, Harris NS, Achenbach AA (2020) Neanderthal-Denisovan ancestors interbred with a distantly related hominin. Science Advances, 6, eaay5483. https://doi.org/10.1126/sciadv.aay5483

[6] Soraggi S, Wiuf C, Albrechtsen A (2018) Powerful Inference with the D-Statistic on Low-Coverage Whole-Genome Data. G3 Genes|Genomes|Genetics, 8, 551–566. https://doi.org/10.1534/g3.117.300192

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Reviews

Evaluation round #1

DOI or URL of the preprint: https://www.biorxiv.org/content/10.1101/2021.01.23.427922v2

Author's Reply, 04 May 2021

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.mcb.100036.ar1

Decision by Matteo Fumagalli, posted 02 Mar 2021

Dear Professor Rogers,

your manuscript has been reviewed by two experts in the field. They both found your study worth of merit, useful and of interest. I agree with their assessment. However, they raise some questions which should be fully addressed before recommending this paper. For instance, several clarifications are sought on the simulations performed. Additionally, either a small application on real data or explicit comparison of run times between different versions of legofit will improve the appealing of this study. Please also fix any typos and inconsistencies (e.g. numbering of references). Additionally, I would encourage you to provide more details on the context of legofit for readers who are not familiar with the original paper, and to provide a more comprehensive abstract. Please provide a point-to-point letter to address all reviewers' comments.

I am looking forward to read a revised version of this manuscript.

Kind Regards,

Matteo Fumagalli

https://doi.org/10.24072/pci.mcb.100036.d1

Reviewed by anonymous reviewer 1, 20 Feb 2021

In this paper, the author presents a deterministic alternative to his previously developed approach to estimate historic demographic parameters from cross-population genetic samples. The original method was published and reviewed elsewhere (ref. 11). The most important feature of the deterministic algorithm presented here, which is based on exact combinatorics, is that it has a shorter runtime for non-complex models. The derivations of the exact deterministic computations are well described and the structure of this part of the manuscript makes it easy to read and understand. While I think the improved efficiency of the algorithm has been demonstrated sufficiently, some questions remain regarding the simulations as well as the utility of the method.

• It is stated that the simulation uses the gene genealogy of Fig. 1. Does this imply that individuals sampled from population Y can exclusively be migrants from populations N or D? • What is the assumed sample size (i.e. n) in the simulations? It is mentioned that there are ten times as many points as free parameters (p. 7), implying a sample size of around ten based on a statement in ref. 11. Why was this sample size chosen? • In Fig. 2, it is stated that the number of estimated parameters is 11, whereas, according to the supplemental file, the model has a total of 16 parameters that concern demography (i.e. population sizes, split times, migration rates). Are the five remaining parameters assumed to be known? Could they be estimated as well (e.g. time of Neanderthal admixture)? Or are they fixed to avoid unidentifiability of other parameters like e.g. the migration rates? Does the need to fix these important parameters (including also Nxy and Txynd) limit the applicability of the approach to relevant real-world scenarios? • How does the accuracy of the estimation of population sizes and split times depend on the migration rates (i.e. how robust are they under variation of migration rates)? • The described procedure to compute partitions among ancestors is carried out within segments of constant population size 2N. However, in the case of migration as depicted in Fig. 1, when looking backward in time a lineage leaves population Y and appears in population N. It has therefore evolved in a segment with population size 2NY until the time of admixture and "after" that it evolves in a segment of size 2NN. It is not clear to me how this dynamics is accounted for in the algorithm to compute the branch lengths of site patterns. • On p. 9, it is discussed how the observed correlations of estimated parameters (leading to non-identifiability) can be attempted to be ameliorated using PCA. However, it is mentioned that PCA introduces further bias and that therefore it is omitted here. I do not understand how, then, the correlations are corrected for instead. Also, what are stages 4 and 5 of the algorithm doing, as apparently there is still some kind of dimensionality reduction carried out in them. • In Fig. 3, since absolute differences in pattern frequencies are plotted, it is hard to assess the accuracy of the algorithm (besides the direct comparison between the two modes, stochastic vs. deterministic). I would recommend plotting relative errors instead. • Why is 2N repeatedly named the "haploid population size" (instead of N)? Conversely, in the supplement N is confusingly called the "di[p]loid population size". • What is the following parameter, mentioned in the supplemental file: "c = 1e-8 # rate per base pair per generation"? • Note that migration rates are denoted differently in Figs. 1 and 2.

https://doi.org/10.24072/pci.mcb.100036.rev11

Reviewed by Fernando Racimo, 28 Jan 2021

In this manuscript, Rogers provides a new procedure that makes one of his previously published methods for inferring demographic parameters from genomic data orders of magnitude faster. The algorithm relies on two improvements. One is the use of the matrix coalescent for deriving the probability that there are a certain number of ancestral lineages that subtend a given number of descendant lineages over a period of time, including a new way of factoring this model into 2 steps, which solves numerical issues associated with it. The other relies on the way descendant samples are partitioned among its possible ancestors. The new algorithm makes his previous legit method more widely applicable to biologists interested in demographic history who don't have access to large computer clusters. The mathematic derivations about probabilities of different number of descendants might also be useful in other coalescent-based inference methods.

I only have minor comments:

While individual subsections of the method section were clearly explained, I feel the connections between the different sections were a bit hard to follow. For example, the transition from section 2.2 to 2.3 was very disjointed: it was unclear what role the matrix coalescent was playing in the calculation of the expected branch length, which was brought up right above it. I would recommend adding a schematic that describes how the different steps of the algorithm relate to each other, and where the key improvements are, perhaps using an example drawing of a segment with a certain number of descendant and ancestral lineages, embedded in a larger population graph. Another useful schematic could be used to describe the different steps in the data analysis pipeline.

What is the size (in physical distance) of each simulation? What are their mutation and recombination rates? Are these supposed to represent 50 uncorrelated windows from the same genome, which are used to infer the same demographic history? Does this imply that a real-world application would involve partitioning the genome a priori somehow? It was unclear why the author was using 50 simulations, it would be be good to explain the reason for that, especially for readers who are not familiar with the original legofit program.

It would be nice to see a small application to a real-world data problem, to motivate the use of the algorithm, and a comparison of run times between the old and new versions under this scenario, for an end user to have a realistic assessment of the speed improvement. This could involve, for example, inference in a dataset involving a small number of admixture and population split times on a real human genome, like the super-archaic scenario from Prufer et al. (2014), used by the author in Rogers (2019).

Related to the above comments, are the results on the super-archaic scenario in any way affected by the improvements (in terms of greater parameter accuracy) of the new algorithm?

It would be helpful to add parameter names to the schematic in Figure 1: TA, TXYND, NXY, etc. to guide the reader in subsequent figures.

It was unclear why the author was focusing on both the sum and the difference in the 2 migration parameters. Are these sums and differences calculated post-hoc after inference of the individual migration parameters? If so, what is the information provided by the accuracy plots of different linear combinations of them?

Numerical references are out of order (e.g. first reference: 11-13)

https://doi.org/10.24072/pci.mcb.100036.rev12

User comments

No user comments yet

or Register
Submit a preprint