Recommendation

Aphid: A Novel Statistical Method for Dissecting Gene Flow and Lineage Sorting in Phylogenetic Conflict

Alan Rogers based on reviews by Richard Durbin and 2 anonymous reviewers

A recommendation of:

An approximate likelihood method reveals ancient gene flow between human, chimpanzee and gorilla

Nicolas Galtier (2023), bioRxiv, ver.3, peer-reviewed and recommended by PCI Mathematical and Computational Biology https://doi.org/10.1101/2023.07.06.547897

Read preprint in preprint server Now published in Peer Community Journal

Data used for results

Codes used in this study

Scripts used to obtain or analyze results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

An approximate likelihood method reveals ancient gene flow between human, chimpanzee and gorilla

Gene flow and incomplete lineage sorting are two distinct sources of phylogenetic conflict, i.e., gene trees that differ in topology from each other and from the species tree. Distinguishing between the two processes is a key objective of current evolutionary genomics. This is most often pursued via the so-called ABBA-BABA type of method, which relies on a prediction of symmetry of gene tree discordance made by the incomplete lineage sorting hypothesis. Gene flow, however, need not be asymmetric, and when it is not, ABBA-BABA approaches do not properly measure the prevalence of gene flow. I introduce Aphid, an approximate maximum-likelihood method aimed at quantifying the sources of phylogenetic conflict via topology and branch length analysis of three-species gene trees. Aphid draws information from the fact that gene trees affected by gene flow tend to have shorter branches, and gene trees affected by incomplete lineage sorting longer branches, than the average gene tree. Accounting for the among-loci variance in mutation rate and gene flow time, Aphid returns estimates of the speciation times and ancestral effective population size, and a posterior assessment of the contribution of gene flow and incomplete lineage sorting to the conflict. Simulations suggest that Aphid is reasonably robust to a wide range of conditions. Analysis of coding and non-coding data in primates illustrates the potential of the approach and reveals that a substantial fraction of the human/chimpanzee/gorilla phylogenetic conflict is due to ancient gene flow. Aphid also predicts older speciation times and a smaller estimated effective population size in this group, compared to existing analyses assuming no gene flow.

speciation, coalescence, ABBA-BABA, effective population size, apes

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

يعد تدفق الجينات والفرز غير المكتمل للنسب مصدرين متميزين للصراع التطوري، أي أشجار الجينات التي تختلف في طوبولوجيا عن بعضها البعض وعن شجرة الأنواع. يعد التمييز بين العمليتين هدفًا رئيسيًا لعلم الجينوم التطوري الحالي. غالبًا ما يتم اتباع ذلك من خلال ما يسمى بطريقة ABBA-BABA، والتي تعتمد على التنبؤ بتماثل تنافر شجرة الجينات بواسطة فرضية فرز النسب غير المكتملة. ومع ذلك، لا يلزم أن يكون تدفق الجينات غير متماثل، وعندما لا يكون الأمر كذلك، فإن أساليب ABBA-BABA لا تقيس بشكل صحيح مدى انتشار تدفق الجينات. أقدم لكم حشرة المن، وهي طريقة تقريبية ذات احتمالية قصوى تهدف إلى تحديد مصادر الصراع التطوري عبر الطوبولوجيا وتحليل طول الفرع لأشجار الجينات المكونة من ثلاثة أنواع. يستمد المن معلوماته من حقيقة أن أشجار الجينات المتأثرة بتدفق الجينات تميل إلى أن تكون لها فروع أقصر، وأن أشجار الجينات المتأثرة بالنسب غير المكتملة تفرز فروعًا أطول، من شجرة الجينات المتوسطة. مع الأخذ في الاعتبار التباين بين المواقع في معدل الطفرة ووقت تدفق الجينات، يُرجع Aphid تقديرات لأوقات الانتواع وحجم السكان الفعال للأسلاف، وتقييمًا خلفيًا لمساهمة تدفق الجينات وفرز النسب غير المكتمل في الصراع. تشير عمليات المحاكاة إلى أن حشرة المن قوية بشكل معقول في مجموعة واسعة من الظروف. يوضح تحليل البيانات المشفرة وغير المشفرة في الرئيسيات إمكانات هذا النهج ويكشف أن جزءًا كبيرًا من الصراع التطوري بين الإنسان والشمبانزي والغوريلا يرجع إلى تدفق الجينات القديمة. يتنبأ المن أيضًا بأوقات الأنواع الأقدم وحجمًا فعالًا مقدرًا أصغر في هذه المجموعة، مقارنة بالتحليلات الحالية التي تفترض عدم وجود تدفق جيني.

6e003aae8d7944f7908f86bfa73aadc تكشف طريقة الاحتمالية التقريبية عن تدفق الجينات القديم بين الإنسان والشمبانزي والغوريلا dcac7aba37b84dbea602962a3e9ef73d الانتواع، الاندماج، ABBA-BABA، الحجم السكاني الفعال، القردة

الانتواع، الاندماج، ABBA-BABA، الحجم السكاني الفعال، القردة

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Un método de probabilidad aproximada revela un antiguo flujo de genes entre humanos, chimpancés y gorilas

El flujo de genes y la clasificación de linajes incompletos son dos fuentes distintas de conflicto filogenético, es decir, árboles de genes que difieren en topología entre sí y del árbol de especies. Distinguir entre los dos procesos es un objetivo clave de la genómica evolutiva actual. Esto se logra con mayor frecuencia mediante el llamado método de tipo ABBA-BABA, que se basa en una predicción de la simetría de la discordancia del árbol genético realizada mediante la hipótesis de clasificación de linaje incompleto. Sin embargo, el flujo genético no tiene por qué ser asimétrico y, cuando no lo es, los enfoques ABBA-BABA no miden adecuadamente la prevalencia del flujo genético. Presento Aphid, un método aproximado de máxima verosimilitud destinado a cuantificar las fuentes de conflicto filogenético mediante topología y análisis de longitud de ramas de árboles genéticos de tres especies. Aphid obtiene información del hecho de que los árboles genéticos afectados por el flujo de genes tienden a tener ramas más cortas, y los árboles genéticos afectados por un linaje incompleto clasifican ramas más largas que el árbol genético promedio. Al tener en cuenta la variación entre loci en la tasa de mutación y el tiempo de flujo de genes, Aphid devuelve estimaciones de los tiempos de especiación y el tamaño efectivo de la población ancestral, y una evaluación posterior de la contribución del flujo de genes y la clasificación de linaje incompleto al conflicto. Las simulaciones sugieren que Aphid es razonablemente resistente a una amplia gama de condiciones. El análisis de datos codificantes y no codificantes en primates ilustra el potencial del enfoque y revela que una fracción sustancial del conflicto filogenético entre humanos, chimpancés y gorilas se debe al flujo de genes antiguos. Aphid también predice tiempos de especiación más antiguos y un tamaño de población efectivo estimado más pequeño en este grupo, en comparación con los análisis existentes que suponen que no hay flujo genético.

especiación, coalescencia, ABBA-BABA, tamaño efectivo de la población, simios

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Une méthode de vraisemblance approximative révèle un ancien flux génétique entre l’homme, le chimpanzé et le gorille

Le flux génétique et le tri incomplet des lignées sont deux sources distinctes de conflits phylogénétiques, c'est-à-dire des arbres génétiques qui diffèrent par leur topologie les uns des autres et de l'arbre des espèces. La distinction entre les deux processus est un objectif clé de la génomique évolutionniste actuelle. Ceci est le plus souvent poursuivi via la méthode dite de type ABBA-BABA, qui repose sur une prédiction de la symétrie de la discordance de l'arbre génétique faite par l'hypothèse du tri incomplet des lignées. Toutefois, le flux génétique ne doit pas nécessairement être asymétrique et, lorsque ce n’est pas le cas, les approches ABBA-BABA ne mesurent pas correctement la prévalence du flux génétique. Je présente Aphid, une méthode approximative de maximum de vraisemblance visant à quantifier les sources de conflits phylogénétiques via l'analyse de la topologie et de la longueur des branches d'arbres génétiques de trois espèces. Le puceron tire des informations du fait que les arbres génétiques affectés par le flux génétique ont tendance à avoir des branches plus courtes, et que les arbres génétiques affectés par une lignée incomplète trient des branches plus longues que l'arbre génétique moyen. En tenant compte de la variance entre les loci du taux de mutation et du temps de flux génétique, Aphid renvoie des estimations des temps de spéciation et de la taille effective de la population ancestrale, ainsi qu'une évaluation postérieure de la contribution du flux génétique et du tri incomplet des lignées au conflit. Les simulations suggèrent que le puceron est raisonnablement robuste à un large éventail de conditions. L’analyse des données codantes et non codantes chez les primates illustre le potentiel de l’approche et révèle qu’une fraction substantielle du conflit phylogénétique humain/chimpanzé/gorille est due à un flux génétique ancien. Le puceron prédit également des périodes de spéciation plus anciennes et une taille de population effective estimée plus petite dans ce groupe, par rapport aux analyses existantes supposant l'absence de flux génétique.

spéciation, coalescence, ABBA-BABA, taille effective de la population, singes

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

अनुमानित संभावना विधि से मानव, चिंपैंजी और गोरिल्ला के बीच प्राचीन जीन प्रवाह का पता चलता है

जीन प्रवाह और अधूरी वंशावली छँटाई फ़ाइलोजेनेटिक संघर्ष के दो अलग-अलग स्रोत हैं, यानी, जीन पेड़ जो एक दूसरे से और प्रजाति के पेड़ से टोपोलॉजी में भिन्न होते हैं। दोनों प्रक्रियाओं के बीच अंतर करना वर्तमान विकासवादी जीनोमिक्स का एक प्रमुख उद्देश्य है। इसे अक्सर तथाकथित एबीबीए-बीएबीए प्रकार की विधि के माध्यम से अपनाया जाता है, जो अपूर्ण वंशावली छँटाई परिकल्पना द्वारा बनाई गई जीन वृक्ष विसंगति की समरूपता की भविष्यवाणी पर निर्भर करता है। हालाँकि, जीन प्रवाह को असममित होने की आवश्यकता नहीं है, और जब यह नहीं होता है, तो ABBA-BABA दृष्टिकोण जीन प्रवाह की व्यापकता को ठीक से माप नहीं पाते हैं। मैं एफिड का परिचय देता हूं, जो एक अनुमानित अधिकतम-संभावना विधि है जिसका उद्देश्य तीन-प्रजाति के जीन पेड़ों की टोपोलॉजी और शाखा लंबाई विश्लेषण के माध्यम से फ़ाइलोजेनेटिक संघर्ष के स्रोतों की मात्रा निर्धारित करना है। एफिड इस तथ्य से जानकारी प्राप्त करता है कि जीन प्रवाह से प्रभावित जीन पेड़ों की शाखाएं छोटी होती हैं, और अपूर्ण वंशावली से प्रभावित जीन पेड़ों की शाखाएं औसत जीन पेड़ की तुलना में लंबी होती हैं। उत्परिवर्तन दर और जीन प्रवाह समय में लोकी विचरण को ध्यान में रखते हुए, एफिड प्रजाति के समय और पैतृक प्रभावी जनसंख्या आकार का अनुमान देता है, और संघर्ष में जीन प्रवाह और अपूर्ण वंशावली के योगदान का एक पिछला मूल्यांकन देता है। सिमुलेशन से पता चलता है कि एफिड कई प्रकार की स्थितियों के लिए काफी मजबूत है। प्राइमेट्स में कोडिंग और गैर-कोडिंग डेटा का विश्लेषण दृष्टिकोण की क्षमता को दर्शाता है और पता चलता है कि मानव/चिंपांज़ी/गोरिल्ला फ़ाइलोजेनेटिक संघर्ष का एक बड़ा हिस्सा प्राचीन जीन प्रवाह के कारण है। एफिड पुराने प्रजाति के समय और इस समूह में एक छोटे अनुमानित प्रभावी जनसंख्या आकार की भी भविष्यवाणी करता है, मौजूदा विश्लेषणों की तुलना में कोई जीन प्रवाह नहीं माना जाता है।

प्रजातिकरण, सहसंयोजन, एबीबीए-बाबा, प्रभावी जनसंख्या आकार, वानर

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

近似尤度法により、ヒト、チンパンジー、ゴリラの間の古代の遺伝子流動が明らかに

遺伝子フローと不完全な系統分類は、系統発生上の矛盾、つまり、遺伝子ツリー同士や種ツリーとはトポロジーが異なる 2 つの異なる原因です。 2 つのプロセスを区別することは、現在の進化ゲノミクスの重要な目的です。これは、不完全な系統分類仮説によって作成された遺伝子ツリーの不一致の対称性の予測に依存する、いわゆる ABBA-BABA タイプの方法を介して追求されることがほとんどです。しかし、遺伝子流動は非対称である必要はなく、非対称である場合、ABBA-BABA アプローチは遺伝子流動の蔓延を適切に測定できません。 3 種の遺伝子ツリーのトポロジーと枝の長さの分析を通じて系統発生的競合の原因を定量化することを目的とした近似最尤法である Aphid を紹介します。アブラムシは、遺伝子フローの影響を受ける遺伝子樹は平均的な遺伝子樹よりも枝が短くなり、不完全な系統選別の影響を受ける遺伝子樹は枝が長くなる傾向があるという事実から情報を引き出しています。突然変異率と遺伝子流動時間の遺伝子座間の分散を考慮して、Aphid は種分化時間と祖先の有効集団サイズの推定値、および遺伝子流動と不完全な系統分類の紛争への寄与の事後評価を返します。シミュレーションによれば、アブラムシは広範囲の条件に対してかなり堅牢であることが示唆されています。霊長類のコードデータと非コードデータの分析は、このアプローチの可能性を示し、ヒト/チンパンジー/ゴリラの系統発生上の対立のかなりの部分が古代の遺伝子流動によるものであることを明らかにしています。また、アブラムシは、遺伝子流動がないと仮定した既存の分析と比較して、このグループの種分化時期がより古く、推定有効個体群サイズがより小さいと予測しています。

種分化、合体、ABBA-BABA、有効個体群サイズ、類人猿

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Um método de verossimilhança aproximada revela antigo fluxo gênico entre humanos, chimpanzés e gorilas

O fluxo gênico e a classificação incompleta de linhagens são duas fontes distintas de conflito filogenético, ou seja, árvores genéticas que diferem em topologia umas das outras e da árvore de espécies. Distinguir entre os dois processos é um objetivo fundamental da genômica evolutiva atual. Isto é mais frequentemente realizado através do chamado tipo de método ABBA-BABA, que se baseia em uma previsão de simetria da discordância da árvore genética feita pela hipótese de classificação de linhagem incompleta. O fluxo gênico, entretanto, não precisa ser assimétrico e, quando não o é, as abordagens ABBA-BABA não medem adequadamente a prevalência do fluxo gênico. Apresento Aphid, um método aproximado de máxima verossimilhança que visa quantificar as fontes de conflito filogenético por meio da topologia e análise do comprimento dos ramos de árvores genéticas de três espécies. Aphid extrai informações do fato de que as árvores genéticas afetadas pelo fluxo gênico tendem a ter galhos mais curtos, e as árvores genéticas afetadas pela linhagem incompleta classificam galhos mais longos, do que a árvore genética média. Levando em conta a variação entre loci na taxa de mutação e no tempo de fluxo gênico, Aphid retorna estimativas dos tempos de especiação e tamanho efetivo da população ancestral, e uma avaliação posterior da contribuição do fluxo gênico e da classificação incompleta da linhagem para o conflito. As simulações sugerem que o Aphid é razoavelmente robusto para uma ampla gama de condições. A análise de dados codificados e não codificados em primatas ilustra o potencial da abordagem e revela que uma fração substancial do conflito filogenético humano/chimpanzé/gorila se deve ao antigo fluxo gênico. Aphid também prevê tempos de especiação mais antigos e um tamanho populacional efetivo estimado menor neste grupo, em comparação com as análises existentes que não assumem nenhum fluxo gênico.

especiação, coalescência, ABBA-BABA, tamanho efetivo da população, macacos

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Метод приближенного правдоподобия выявил древний поток генов между человеком, шимпанзе и гориллой

Поток генов и неполная сортировка линий — два разных источника филогенетического конфликта, то есть генные деревья, которые отличаются по топологии друг от друга и от дерева видов. Различение этих двух процессов является ключевой задачей современной эволюционной геномики. Чаще всего это достигается с помощью так называемого метода типа ABBA-BABA, который основан на предсказании симметрии несоответствия генного дерева, сделанном на основе гипотезы неполной сортировки линий. Однако поток генов не обязательно должен быть асимметричным, а когда это не так, подходы ABBA-BABA не позволяют должным образом измерить распространенность потока генов. Я представляю Aphid, приблизительный метод максимального правдоподобия, направленный на количественную оценку источников филогенетического конфликта посредством анализа топологии и длины ветвей генных деревьев трех видов. Тля черпает информацию из того факта, что генные деревья, на которые влияет поток генов, как правило, имеют более короткие ветви, а генные деревья, на которые влияет неполная сортировка линий, имеют более длинные ветви, чем среднее генное дерево. Принимая во внимание межлокусные различия в скорости мутаций и времени потока генов, Aphid возвращает оценки времени видообразования и эффективного размера предковой популяции, а также апостериорную оценку вклада потока генов и неполной сортировки линий в конфликт. Моделирование показывает, что тля достаточно устойчива к широкому спектру условий. Анализ данных о кодировании и некодировании у приматов иллюстрирует потенциал этого подхода и показывает, что значительная часть филогенетического конфликта человека, шимпанзе и гориллы обусловлена древним потоком генов. Тля также предсказывает более старое время видообразования и меньший расчетный эффективный размер популяции в этой группе по сравнению с существующими анализами, предполагающими отсутствие потока генов.

видообразование, слияние, АББА-БАБА, эффективная численность популяции, человекообразные обезьяны

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

近似似然法揭示了人类、黑猩猩和大猩猩之间的古代基因流动

基因流和不完整的谱系排序是系统发育冲突的两个不同来源，即拓扑结构彼此不同且与物种树不同的基因树。区分这两个过程是当前进化基因组学的一个关键目标。这通常是通过所谓的 ABBA-BABA 类型的方法来实现的，该方法依赖于由不完整谱系排序假设做出的基因树不一致性的对称性预测。然而，基因流不一定是不对称的，如果不对称，ABBA-BABA 方法就不能正确测量基因流的普遍程度。我介绍了 Aphid，一种近似最大似然方法，旨在通过三物种基因树的拓扑和分支长度分析来量化系统发育冲突的来源。蚜虫从以下事实中获取信息：与平均基因树相比，受基因流影响的基因树往往具有较短的分支，而受不完整谱系排序影响的基因树则具有较长的分支。考虑到突变率和基因流动时间的位点间差异，Apid 返回物种形成时间和祖先有效种群规模的估计，以及基因流动和不完整谱系排序对冲突的贡献的后验评估。模拟表明 Aphid 对各种条件都具有相当的鲁棒性。对灵长类动物编码和非编码数据的分析说明了该方法的潜力，并揭示了人类/黑猩猩/大猩猩系统发育冲突的很大一部分是由于古老的基因流。与假设没有基因流的现有分析相比，蚜虫还预测了该组中较早的物种形成时间和较小的估计有效种群规模。

物种形成、合并、ABBA-BABA、有效种群规模、猿

Submission: posted 06 July 2023, validated 11 July 2023
Recommendation: posted 08 January 2024, validated 10 January 2024

Cite this recommendation as:
Rogers, A. (2024) Aphid: A Novel Statistical Method for Dissecting Gene Flow and Lineage Sorting in Phylogenetic Conflict. Peer Community in Mathematical and Computational Biology, 100199. https://doi.org/10.24072/pci.mcb.100199

Recommendation

Galtier [1] introduces “Aphid,” a new statistical method that estimates the contributions of gene flow (GF) and incomplete lineage sorting (ILS) to phylogenetic conflict. Aphid is based on the observation that GF tends to make gene genealogies shorter, whereas ILS makes them longer. Rather than fitting the full likelihood, it models the distribution of gene genealogies as a mixture of several canonical gene genealogies in which coalescence times are set equal to their expectations under different models. This simplification makes Aphid far faster than competing methods. In addition, it deals gracefully with bidirectional gene flow—an impossibility under competing models. Because of these advantages, Aphid represents an important addition to the toolkit of evolutionary genetics.

In the interest of speed, Aphid makes several simplifying assumptions. Yet even when these were violated, Aphid did well at estimating parameters from simulated data. It seems to be reasonably robust.

Aphid studies phylogenetic conflict, which occurs when some loci imply one phylogenetic tree and other loci imply another. This happens when the interval between successive speciation events is fairly short. If this interval is too short, however, Aphid’s approximations break down, and its estimates are biased. Galtier suggests caution when the fraction of discordant phylogenetic trees exceeds 50–55%. Thus, Aphids will be most useful when the interval between speciation events is short, but not too short.

Galtier applies the new method to three sets of primate data. In two of these data sets (baboons and African apes), Aphid detects gene flow that would likely be missed by competing methods. These competing methods are primarily sensitive to gene flow that is asymmetric in two senses: (1) greater flow in one direction than the other, and (2) unequal gene flow connecting an outgroup to two sister species. Aphid finds evidence of symmetric gene flow in the ancestry of baboons and also in that of African apes. The data suggest that ancestral humans and chimpanzees both interbred with ancestral gorillas, and at about the same rate. Aphid’s ability to detect this signature sets it apart from competing methods.

References

[1] Nicolas Galtier (2023) “An approximate likelihood method reveals ancient gene flow between human, chimpanzee and gorilla”. bioRxiv, ver. 3 peer-reviewed and recommended by Peer Community in Mathematical and Computational Biology. https://doi.org/10.1101/2023.07.06.547897

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Funding:
The author declares that they have received no specific funding for this study

Reviews

Evaluation round #2

DOI or URL of the preprint: https://doi.org/10.1101/2023.07.06.547897

Version of the preprint: 2

Author's Reply, 19 Dec 2023

Download author's reply https://doi.org/10.24072/pci.mcb.100199.ar2

Decision by Alan Rogers, posted 12 Dec 2023, validated 15 Dec 2023

See uploaded file.

Download recommender's annotations

https://doi.org/10.24072/pci.mcb.100199.d2

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.1101/2023.07.06.547897

Version of the preprint: 1

Author's Reply, 23 Nov 2023

Download author's reply https://doi.org/10.24072/pci.mcb.100199.ar1

Decision by Alan Rogers, posted 02 Oct 2023, validated 04 Oct 2023

See attached

Download recommender's annotations

https://doi.org/10.24072/pci.mcb.100199.d1

Reviewed by anonymous reviewer 1, 04 Aug 2023

Download the review https://doi.org/10.24072/pci.mcb.100199.rev11

Reviewed by Richard Durbin, 22 Sep 2023

As more species have had their genomes sequenced, it has become increasingly clear that there is widespread phylogenetic incongruence between local gene trees and consensus species trees, particularly when the time intervals between successive speciations are relatively small. These incongruences can be caused either by Incomplete Lineage Sorting (ILS) or Gene Flow (GF), and many analytical methods address one or the other, but few both jointly. This paper introduces a new, computationally efficient approach called Aphid to modelling the contribution of both ILS and GF, and with them the times of species separation and basic population genetic parameters of the ancestral species. It appears to me to be novel and effective, giving interesting results, and I recommend publication after revision.

The key step is to greatly simplify the set of possible gene trees considered for explaining the observed data, allowing an efficient maximum likelihood approach to a mixture model over this set. This is very much an intentional heuristic, which results in an enormous reduction in state space, but it appears to be remarkably effective, with good results from reasonable simulations and a demonstration of application to real data. The ideas are nice and the exposition looks sound.

The analysis with Aphid of the human/chimp/gorilla relationships is a potentially important addition to the study of hominine evolution. It suggests that approximately half their genetic discordance is due to gene flow, and consequently that the H-C and HC-G main separation dates are older and ancestral population sizes smaller, which makes sense in a number of ways. There is potential for the author or others to build on this in the future in looking at the relationships of hominine species and subspecies, and considering further the relationship to other data than can be done in this short more technical article.

Like many other phylogenetic approaches, Aphid takes as input a set of supposedly independent gene trees, each built under an assumption of no internal recombination. This referee has general concerns about the no-recombination assumption. For most mammals the average recombination rate is comparable to the mutation rate (e.g. 1e-8 compared to 1.25e-8 for humans) which means that there are on average as many recombination events as mutation events in the ancestral genealogy over a stretch of genome. Given that there have to be mutations present to enable the tree to be defined, then there should also be recombinations. There are many species with smaller mutation rates and much higher recombination rates (because they have smaller chromosomes), such as most invertebrates and many plants, for which the ratio of recombination events to mutations is much higher than one, often an order of magnitude higher. It is true that recombinations are clustered at hotspots, and that neighbouring trees separated by recombinations are correlated, but I would appreciate if you could explicitly discuss the issues for Aphid around the (almost certainly wrong) assumption of no recombination in gene trees.

Major points:

1. L95-101: It took me a while to realise that it was intentional that you are only considering trees with branch lengths at the expected values, rather than all possible trees. There is a good paragraph about this at the end in the discussion, but this modelling decision should be made much clearer earlier on because it is central to your method. First you should say early on that this is a heuristic approach involving major simplification – nothing wrong with that. Then, in the section from lines 95 and following, something more like “We model this set of observed gene trees as coming from a limited mixture of characteristic trees which we call scenarios, which have fixed branch lengths set to the expected values of the branch lengths for that scenario. These fall into three categories.” Then instead of “the coalescence times are assumed to equal” something more like “we model the coalescence times to be fixed at…”. This isn’t really an assumption, because it is flagrantly false – it is a modelling decision.

2. L176: 95% confidence intervals. Are these calculated by re-optimising the likelihood over all the parameters other than the one being investigated for each test value of the parameter under consideration? If so then say this. If not, then this is necessary. Otherwise, if parameters are coupled, it may be that the true confidence intervals are much wider.

a. Related to this, please say for your simulations for what fraction of simulations the true value of each parameter is within the confidence interval, and discuss as appropriate.

b. And you don’t show the confidence intervals in SuppTable2, nor the significance test results. Please add them. Sorry that this adds lots more columns. You could perhaps transpose the table and have columns per data set and rows per feature – your choice.

3. L188-191: how is the root defined for deciding whether trees are imbalanced? Do you need a super-outgroup to set the root, beyond the ones whose lengths from tip to root are being compared to those from? If so, say so. Else explain how this is done. (If you assume ultrametric behaviour to define the root then of course you underestimate root-to-tip variation.)

4. Related to the comments above about the assumption of no recombination within the loci, I would like to see for the real data (Supp Table 2) a new column giving the fraction of 2:2 SNPs involving A,B,C and an outgroup O that is incongruent in the test regions. This is what CoalHMM uses, and is independent of the no-recombination assumption. My memory is that for Human/Chimp/Gorilla/Orang this fraction is 30%, not 26% as you have. If you see such a difference it might help motivate discussion of the consequences of the no-recombination assumption. If you see no difference, then the explanation may be due to lower Ne in exons and/or my faulty memory. In any case, if there is no difference that is a nice validation that the model is behaving reasonably.

5. Your discussion of the real data focuses almost exclusively on macaques and hominins. I think you should at least provide a bit more overview of the other results, otherwise why do them? It looks to me that for horses and mice there is negligible evidence for either GF or ILS – please give the significance test results in the table. For the others, the GF is similar to or greater than ILS. Quite surprising to me and worth remarking on. Is there other literature on these cases?

6. For macaques Song et al. had two M. fascicularis and only one of those shows the strong gene flow signature (the one from Mauritius, where the Portugese introduced M. fascicularis several hundred years ago from SE Asia). They interpret that as meaning that the gene flow has occurred within the last 330k years, since that is their estimated divergence time. However you estimate 63% p_a which is much more ancient. I wonder about the accuracy of your p_a estimate – all the other values are above 90%. You don’t discuss this in your section on simulation. Could you please address how accurately p_a is estimated on the simulation data, and comment on this discrepancy in the macaque analysis.

7. I note that Song et al. inferred bi-directional gene flow in this case, which is possible in principle with careful application of D-stats or 5-taxon tests. I realise that because your method only models symmetric bi-directional gene flow it does not selectively demonstrate bi-directional versus uni-directional gene flow. You should state this somewhere.

8. L343-345: you discuss not testing p_AB. Why not? This would be simple using the same scheme as for p_AC, p_BC. Maybe you have low power for this, but that would be good to report. It would not invalidate the paper at all from my perspective, just show the limits of the approach.

9. L357: why not distinguish N_e(AB) from N_e(ABC)? I would be interested in what happens if you add that to the model. But I realise that this is substantially more work, so I do not require this. If you tried it and it didn’t work well because of indeterminism, I would appreciate a statement to that effect as again being useful to understand the limits of the model, without requiring that you present the results to demonstrate this. In my view it is much more useful to describe things that didn’t work than to bury them. I hope the editor and the other referee(s) take the same view!

Minor points:

1. L24: “These problems are presumably minimized” – this is imprecise. They are presumably less of an issue, but not as small as possible, which is the meaning of minimized. Something more like “these potential problems are presumably much reduced” or “much less of a concern”

2. L76: “J & MR” ref should be Smith and Kronfest.

3. L79: “The ILS hypothesis predicts an exponential distribution for this variable regardless of tree topology” is incorrect. This needs to be “The ILS hypothesis predicts an exponential distribution for this variable for trees discordant with the species tree”, which is what the Edelman paper says.

4. L118: “as from” in place of “than from” is better English

5. L122: You need to state here that you assume at most one GF event.

6. L156: you talk about star topologies here, but later (L192) you rule them out in an indirect way, by saying that you ban trees with internal branches under 0.5 mutations – since the number of mutations is discrete this means with 0 mutations, i.e. star trees. I suggest just to say at L156 that you exclude star topologies with d = 0 (and again at L192).

7. L265: “lead” -> “led”

8. L269: capitalise “Indonesia”

9. Supp Table 2: why is the asymmetry_ILS < 0.5. By definition it should be greater than 0.5, as you say in L174. Also the table header is asymmetry_ILS while the text is imbalance_ILS.

10. L282,L292: you must change “neutral mutation rate” to “exon mutation rate” or even more correctly “exon accepted mutation rate”. By using exons you are clearly not considering neutral sequence.

11. L308: “appears” not “appear” in “appears to exist”

12. L317,L320: “departing from”

13. Figure 3: I find using the magnitude of the disks for the fraction explained hard to evaluate. Fine to leave the disks, but could you add next to them the actually number in text as a percentage (e.g. 12%, 4% etc. – no need for more precision here).

14. L350: “conditional” not “conditionally”

15. Supp Text: “do not coalesce” rather than “do not coalesced”

16. Reference list: Maybe this is an editorial rather than an author point, but for this style (name, year) surely the references should be in alphabetical order of first author.

https://doi.org/10.24072/pci.mcb.100199.rev12

Reviewed by anonymous reviewer 2, 12 Sep 2023

Download the review https://doi.org/10.24072/pci.mcb.100199.rev13

User comments

No user comments yet

or Register
Submit a preprint