Recommendation

Detecting variation in ploidy within and between genomes

Alan Rogers based on reviews by Barbara Holland, Benjamin Peter and Nicolas Galtier

A recommendation of:

HMMploidy: inference of ploidy levels from short-read sequencing data

Samuele Soraggi, Johanna Rhodes, Isin Altinkaya, Oliver Tarrant, Francois Balloux, Matthew C Fisher, Matteo Fumagalli (2022), bioRxiv, 2021.06.29.450340, ver. 6 peer-reviewed and recommended by Peer Community in Mathematical and Computational Biology https://doi.org/10.1101/2021.06.29.450340

Read preprint in preprint server Now published in Peer Community Journal

Data used for results

Codes used in this study

Scripts used to obtain or analyze results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

HMMploidy: inference of ploidy levels from short-read sequencing data

The inference of ploidy levels from genomic data is important to understand molecular mechanisms underpinning genome evolution. However, current methods based on allele frequency and sequencing depth variation do not have power to infer ploidy levels at low- and mid-depth sequencing data, as they do not account for data uncertainty. Here we introduce HMMploidy, a novel tool that leverages the information from multiple samples and combines the information from sequencing depth and genotype likelihoods. We demonstrate that HMMploidy outperforms existing methods in most tested scenarios, especially at low-depth with large sample size. We apply HMMploidy to sequencing data from the pathogenic fungus Cryptococcus neoformans and retrieve pervasive patterns of aneuploidy, even when artificially downsampling the sequencing data. We envisage that HMMploidy will have wide applicability to low-depth sequencing data from polyploid and aneuploid species.

high-throughput DNA sequencing, ploidy, polyploidy, aneuploidy, hidden Markov model, genotype likelihood

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

HMMploidy: استنتاج مستويات الصيغة الصبغية من بيانات التسلسل قصيرة القراءة

يعد استنتاج مستويات الصيغة الصبغية من البيانات الجينومية أمرًا مهمًا لفهم الآليات الجزيئية التي يقوم عليها تطور الجينوم. ومع ذلك، فإن الأساليب الحالية المستندة إلى تردد الأليل وتباين عمق التسلسل لا تملك القدرة على استنتاج المستويات الصبغية في بيانات التسلسل المنخفضة والمتوسطة العمق، لأنها لا تأخذ في الاعتبار عدم اليقين في البيانات. نقدم هنا HMMploidy، وهي أداة جديدة تستفيد من المعلومات من عينات متعددة وتجمع المعلومات من عمق التسلسل واحتمالات النمط الجيني. لقد أثبتنا أن HMMploidy يتفوق على الأساليب الحالية في معظم السيناريوهات التي تم اختبارها، خاصة في العمق المنخفض مع حجم عينة كبير. نحن نطبق HMMploidy على بيانات التسلسل من الفطريات المسببة للأمراض Cryptococcus neoformans ونسترجع الأنماط المنتشرة من اختلال الصيغة الصبغية، حتى عند تقليل بيانات التسلسل بشكل مصطنع. نحن نتصور أن HMMploidy سيكون له قابلية تطبيق واسعة على بيانات التسلسل منخفضة العمق من الأنواع متعددة الصيغة الصبغية وغير الصبغية.

تسلسل الحمض النووي عالي الإنتاجية، الصيغة الصبغية، تعدد الصيغة الصبغية، اختلال الصيغة الصبغية، نموذج ماركوف المخفي، احتمالية النمط الجيني

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

HMMploidy: inferencia de niveles de ploidía a partir de datos de secuenciación de lectura corta

La inferencia de niveles de ploidía a partir de datos genómicos es importante para comprender los mecanismos moleculares que sustentan la evolución del genoma. Sin embargo, los métodos actuales basados en la frecuencia de los alelos y la variación de la profundidad de la secuenciación no tienen el poder de inferir niveles de ploidía en datos de secuenciación de profundidad baja y media, ya que no tienen en cuenta la incertidumbre de los datos. Aquí presentamos HMMploidy, una herramienta novedosa que aprovecha la información de múltiples muestras y combina la información de la profundidad de secuenciación y las probabilidades de genotipo. Demostramos que HMMploidy supera a los métodos existentes en la mayoría de los escenarios probados, especialmente a baja profundidad con un tamaño de muestra grande. Aplicamos HMMploidy a los datos de secuenciación del hongo patógeno Cryptococcus neoformans y recuperamos patrones generalizados de aneuploidía, incluso cuando se reducen artificialmente los datos de secuenciación. Prevemos que HMMploidy tendrá una amplia aplicabilidad a datos de secuenciación de baja profundidad de especies poliploides y aneuploides.

secuenciación de ADN de alto rendimiento, ploidía, poliploidía, aneuploidía, modelo oculto de Markov, probabilidad de genotipo

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

HMMploïdy : inférence des niveaux de ploïdie à partir de données de séquençage à lecture courte

La déduction des niveaux de ploïdie à partir des données génomiques est importante pour comprendre les mécanismes moléculaires qui sous-tendent l'évolution du génome. Cependant, les méthodes actuelles basées sur la fréquence allélique et la variation de la profondeur de séquençage ne permettent pas de déduire les niveaux de ploïdie pour les données de séquençage de faible et moyenne profondeur, car elles ne tiennent pas compte de l'incertitude des données. Nous présentons ici HMMploidy, un nouvel outil qui exploite les informations de plusieurs échantillons et combine les informations de la profondeur de séquençage et des probabilités du génotype. Nous démontrons que HMMploïdy surpasse les méthodes existantes dans la plupart des scénarios testés, en particulier à faible profondeur et avec un échantillon de grande taille. Nous appliquons HMMploïdie aux données de séquençage du champignon pathogène Cryptococcus neoformans et récupérons des modèles omniprésents d'aneuploïdie, même en sous-échantillonnant artificiellement les données de séquençage. Nous envisageons que HMMploïdy aura une large applicabilité aux données de séquençage à faible profondeur d'espèces polyploïdes et aneuploïdes.

séquençage d'ADN à haut débit, ploïdie, polyploïdie, aneuploïdie, modèle de Markov caché, vraisemblance du génotype

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

एचएमएमप्लोइडी: लघु-पठित अनुक्रमण डेटा से प्लोइडी स्तर का अनुमान

जीनोमिक डेटा से प्लोइडी स्तर का अनुमान जीनोम विकास को रेखांकित करने वाले आणविक तंत्र को समझने के लिए महत्वपूर्ण है। हालाँकि, एलील आवृत्ति और अनुक्रमण गहराई भिन्नता पर आधारित मौजूदा तरीकों में निम्न और मध्य-गहराई अनुक्रमण डेटा पर प्लोइडी स्तर का अनुमान लगाने की शक्ति नहीं है, क्योंकि वे डेटा अनिश्चितता के लिए जिम्मेदार नहीं हैं। यहां हम HMMploidy पेश करते हैं, जो एक नया उपकरण है जो कई नमूनों से जानकारी का लाभ उठाता है और अनुक्रमण गहराई और जीनोटाइप संभावनाओं से जानकारी को जोड़ता है। हम प्रदर्शित करते हैं कि HMMploidy अधिकांश परीक्षण किए गए परिदृश्यों में मौजूदा तरीकों से बेहतर प्रदर्शन करता है, विशेष रूप से बड़े नमूना आकार के साथ कम गहराई पर। हम रोगजनक कवक से अनुक्रमण डेटा के लिए एचएमएमप्लोइडी लागू करते हैं क्रिप्टोकोकस नियोफॉर्मन्स और कृत्रिम रूप से अनुक्रमण डेटा को डाउनसैंपलिंग करते हुए भी एन्यूप्लोइडी के व्यापक पैटर्न को पुनः प्राप्त करते हैं। हमारी परिकल्पना है कि HMMploidy में पॉलीप्लॉइड और एन्यूप्लोइड प्रजातियों से कम गहराई वाले अनुक्रमण डेटा के लिए व्यापक प्रयोज्यता होगी।

उच्च-थ्रूपुट डीएनए अनुक्रमण, प्लोइडी, पॉलीप्लोइडी, एन्यूप्लोइडी, छिपा हुआ मार्कोव मॉडल, जीनोटाइप संभावना

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

HMMploidy: ショートリード配列データからの倍数性レベルの推論

ゲノムデータからの倍数性レベルの推論は、ゲノム進化を支える分子機構を理解するために重要です。しかし、対立遺伝子頻度と配列決定深度の変動に基づく現在の方法には、データの不確実性が考慮されていないため、低深度および中深度の配列決定データで倍数性レベルを推論する力がありません。ここでは、複数のサンプルからの情報を活用し、配列決定の深さと遺伝子型の可能性からの情報を組み合わせる新しいツールである HMMploidy を紹介します。我々は、HMMploidy が、ほとんどのテスト済みシナリオ、特にサンプルサイズが大きい低深さのシナリオで既存の手法よりも優れたパフォーマンスを発揮することを実証します。 HMMploidy を病原性真菌である クリプトコッカス・ネオフォルマンス からの配列データに適用し、配列データを人為的にダウンサンプリングした場合でも、広範な異数性パターンを取得します。私たちは、HMMploidy が倍数体および異数体種からの低深さのシーケンスデータに幅広く適用できると考えています。

ハイスループット DNA シーケンス、倍数性、倍数性、異数性、隠れマルコフモデル、遺伝子型の可能性

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

HMMploidy: inferência de níveis de ploidia a partir de dados de sequenciamento de leitura curta

A inferência dos níveis de ploidia a partir de dados genômicos é importante para compreender os mecanismos moleculares que sustentam a evolução do genoma. No entanto, os métodos atuais baseados na frequência alélica e na variação da profundidade do sequenciamento não têm poder para inferir níveis de ploidia em dados de sequenciamento de baixa e média profundidade, pois não levam em conta a incerteza dos dados. Aqui apresentamos o HMMploidy, uma nova ferramenta que aproveita as informações de múltiplas amostras e combina as informações da profundidade do sequenciamento e das probabilidades genotípicas. Demonstramos que o HMMploidy supera os métodos existentes na maioria dos cenários testados, especialmente em baixa profundidade com grande tamanho de amostra. Aplicamos HMMploidy aos dados de sequenciamento do fungo patogênico Cryptococcus neoformans e recuperamos padrões generalizados de aneuploidia, mesmo quando reduzimos artificialmente a amostragem dos dados de sequenciamento. Prevemos que o HMMploidy terá ampla aplicabilidade a dados de sequenciamento de baixa profundidade de espécies poliplóides e aneuploides.

sequenciamento de DNA de alto rendimento, ploidia, poliploidia, aneuploidia, modelo oculto de Markov, probabilidade genotípica

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

HMMploidy: определение уровней плоидности на основе данных короткого считывания секвенирования.

Вывод об уровнях плоидности на основе геномных данных важен для понимания молекулярных механизмов, лежащих в основе эволюции генома. Однако современные методы, основанные на изменении частоты аллелей и глубины секвенирования, не позволяют сделать вывод об уровнях плоидности на данных секвенирования низкой и средней глубины, поскольку они не учитывают неопределенность данных. Здесь мы представляем HMMploidy, новый инструмент, который использует информацию из нескольких образцов и объединяет информацию о глубине секвенирования и вероятности генотипа. Мы демонстрируем, что HMMploidy превосходит существующие методы в большинстве протестированных сценариев, особенно на малой глубине и с большим размером выборки. Мы применяем HMMploidy к данным секвенирования патогенного гриба Cryptococcus neoformans и извлекаем распространенные образцы анеуплоидии, даже при искусственном уменьшении выборки данных секвенирования. Мы предполагаем, что HMMploidy будет иметь широкое применение для данных секвенирования малой глубины полиплоидных и анеуплоидных видов.

высокопроизводительное секвенирование ДНК, плоидность, полиплоидия, анеуплоидия, скрытая марковская модель, вероятность генотипа

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

HMMploidy：从短读长测序数据推断倍性水平

从基因组数据推断倍性水平对于理解支持基因组进化的分子机制非常重要。然而，当前基于等位基因频率和测序深度变化的方法无法推断低深度和中深度测序数据的倍性水平，因为它们没有考虑数据的不确定性。在这里，我们介绍 HMMploidy，这是一种新颖的工具，它利用来自多个样本的信息，并结合来自测序深度和基因型可能性的信息。我们证明，HMMploidy 在大多数测试场景中都优于现有方法，特别是在低深度和大样本量的情况下。我们将 HMMploidy 应用于病原真菌新型隐球菌的测序数据，并检索普遍存在的非整倍体模式，即使在人工对测序数据进行下采样时也是如此。我们预计 HMMploidy 将广泛适用于多倍体和非整倍体物种的低深度测序数据。

高通量 DNA 测序、倍性、多倍性、非整倍性、隐马尔可夫模型、基因型可能性

Submission: posted 01 July 2021
Recommendation: posted 12 September 2022, validated 19 September 2022

Cite this recommendation as:
Rogers, A. (2022) Detecting variation in ploidy within and between genomes. Peer Community in Mathematical and Computational Biology, 100010. https://doi.org/10.24072/pci.mcb.100010

Recommendation

Soraggi et al. [2] describe HMMploidy, a statistical method that takes DNA sequencing data as input and uses a hidden Markov model to estimate ploidy. The method allows ploidy to vary not only between individuals, but also between and even within chromosomes. This allows the method to detect aneuploidy and also chromosomal regions in which multiple paralogous loci have been mistakenly assembled on top of one another.

HMMploidy estimates genotypes and ploidy simultaneously, with a separate estimate for each genome. The genome is divided into a series of non-overlapping windows (typically 100), and HMMploidy provides a separate estimate of ploidy within each window of each genome. The method is thus estimating a large number of parameters, and one might assume that this would reduce its accuracy. However, it benefits from large samples of genomes. Large samples increase the accuracy of internal allele frequency estimates, and this improves the accuracy of genotype and ploidy estimates. In large samples of low-coverage genomes, HMMploidy outperforms all other estimators. It does not require a reference genome of known ploidy. The power of the method increases with coverage and sample size but decreases with ploidy. Consequently, high coverage or large samples may be needed if ploidy is high.

The method is slower than some alternative methods, but run time is not excessive. Run time increases with number of windows but isn't otherwise affected by genome size. It should be feasible even with large genomes, provided that the number of windows is not too large. The authors apply their method and several alternatives to isolates of a pathogenic yeast, Cryptococcus neoformans, obtained from HIV-infected patients. With these data, HMMploidy replicated previous findings of polyploidy and aneuploidy. There were several surprises. For example, HMMploidy estimates the same ploidy in two isolates taken on different days from a single patient, even though sequencing coverage was three times as high on the later day as on the earlier one. These findings were replicated in data that were down-sampled to mimic low coverage.

Three alternative methods (ploidyNGS [1], nQuire, and nQuire.Den [3]) estimated the highest ploidy considered in all samples from each patient. The present authors suggest that these results are artifactual and reflect the wide variation in allele frequencies. Because of this variation, these methods seem to have preferred the model with the largest number of parameters. HMMploidy represents a new and potentially useful tool for studying variation in ploidy. It will be of most use in studying the genetics of asexual organisms and cancers, where aneuploidy imposes little or no penalty on reproduction. It should also be useful for detecting assembly errors in de novo genome sequences from non-model organisms.

References

[1] Augusto Corrêa dos Santos R, Goldman GH, Riaño-Pachón DM (2017) ploidyNGS: visually exploring ploidy with Next Generation Sequencing data. Bioinformatics, 33, 2575–2576. https://doi.org/10.1093/bioinformatics/btx204

[2] Soraggi S, Rhodes J, Altinkaya I, Tarrant O, Balloux F, Fisher MC, Fumagalli M (2022) HMMploidy: inference of ploidy levels from short-read sequencing data. bioRxiv, 2021.06.29.450340, ver. 6 peer-reviewed and recommended by Peer Community in Mathematical and Computational Biology. https://doi.org/10.1101/2021.06.29.450340

[3] Weiß CL, Pais M, Cano LM, Kamoun S, Burbano HA (2018) nQuire: a statistical framework for ploidy estimation using next generation sequencing. BMC Bioinformatics, 19, 122. https://doi.org/10.1186/s12859-018-2128-z

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Reviews

Evaluation round #4

DOI or URL of the preprint: https://www.biorxiv.org/content/10.1101/2021.06.29.450340v4

Version of the preprint: 4

Author's Reply, 26 Aug 2022

Download author's reply https://doi.org/10.24072/pci.mcb.100109.ar4

Decision by Alan Rogers, posted 06 Aug 2022

I'm ready to recommend this preprint. But before doing so, I want to give the authors an opportunity to respond to one remaining issue. See attached.

Download recommender's annotations

https://doi.org/10.24072/pci.mcb.100109.d4

Evaluation round #3

DOI or URL of the preprint: https://www.biorxiv.org/content/10.1101/2021.06.29.450340v3

Version of the preprint: 3

Author's Reply, 14 Jul 2022

Download author's reply https://doi.org/10.24072/pci.mcb.100109.ar3

Decision by Alan Rogers, posted 06 May 2022

See attached.

Download recommender's annotations

https://doi.org/10.24072/pci.mcb.100109.d3

Evaluation round #2

DOI or URL of the preprint: https://doi.org/10.1101/2021.06.29.450340

Version of the preprint: 2

Author's Reply, 22 Apr 2022

Download author's reply https://doi.org/10.24072/pci.mcb.100109.ar2

Decision by Alan Rogers, posted 13 Sep 2022

See attached

Download recommender's annotations

https://doi.org/10.24072/pci.mcb.100109.d2

Reviewed by Nicolas Galtier, 18 Jan 2022

I found the manuscript to be substantially improved in many respects, and would like to thak the authors for the hard work and willingness to address all the reviewers' remarks. I still have a couple of questions.

1. From the authors' response and corrections, it is my understanding that the HMMploidy method is intended to be applied to segments across which ploidy does not vary. This is perceptible from the modified introduction, in which the emphasis is put on aneuploidy (i.e., single-ploidy chromosomes), and the simulation part, in which constant ploidy is assumed. This is a perfectly valid goal, but one might then ask, why taking an HMM approach? If ploidy is assumed to be constant then the likelihood can probably be calculated based on the provided equations without the HMM layer. The authors might like to clarify the choice an HMM apporach if ploidy is supposed not to change across the analyzed segments.

2. The section on the empirical analysis is still a bit unclear to me. In particular:
- do we have external knowledge on the real level of (aneu)ploidy in these samples?
- I don't quite understand the interpretation of the CCTP27 vs CCTP27-d121 discrepancy. In this genome the sequencing depth of chromosome 12 was tripled at day 121, compared to the reference at day 0, suggesting some major biological event. HMMploidy infers the same ploidy (of 1) for chromosome 12 in the two samples, thus missing this biological event as far as I understand it. Still, this is interpreted as a success of the method.

The authors might like to clarify their specific goals with this analysis, and what kind of biological pattern or structure they are targetting. If the idea is to identify polyploid segments having accumulated a certain amount of sequence variation, as seems implicit in the empirical analysis section, then this should probably be stated more explicitly and discussed.

https://doi.org/10.24072/pci.mcb.100109.rev21

Reviewed by Barbara Holland, 04 Feb 2022

Download the review https://doi.org/10.24072/pci.mcb.100109.rev22

Evaluation round #1

DOI or URL of the preprint: https://www.biorxiv.org/content/10.1101/2021.06.29.450340v1

Version of the preprint: 1

Author's Reply, 21 Dec 2021

Download author's reply https://doi.org/10.24072/pci.mcb.100109.ar1

Decision by Alan Rogers, posted 31 Jul 2021

My letter is attached.

Download recommender's annotations

https://doi.org/10.24072/pci.mcb.100109.d1

Reviewed by Benjamin Peter, 23 Jul 2021

In this paper, Soraggi et al. introduce a new model for inferring the ploidy of an organism from low-coverage sequence data using genotype likelihoods. This seems like an useful program; but the current manuscript requires substantial revising and editing to make it suitable for publication.

Major points:
1. Introduction: I think the authors should define better what they mean with ploidy, particularly when we talk about ploidy at the sub-chromosome level, and how the authors expect it to differ from structural variation. I.e. I can think of cases like the pseudo-autosomal-region in humans and crazy systems like the platypus X-chromosome; but it would be nice to be explicit about this, I assume it has somehow to do with homologous recombination?

2. Why are coverage-based methods not considered in comparisons? In ancient DNA, sexing is often done by comparing the ploidy of the X-chromosome. This works well at coverages < 0.01x, so I don't understand why these approaches wouldn't work on sufficiently large ploidy-regions. I would imagine at least aneuploidies would be easy to discover with those approaches as well. This needs to be better justified.

3. Why does the probability on the rhs of equation 2 not depend on i? Also, why does one not have to correct for the abundance of alleles? I.e. if we have a tetraploid and the genotype is AAAG, why would the probability of seeing As and Gs be equal? I think Equation 2 as stated is simply wrong, and if not, needs to be much better motivated.

4. So is the only signal considered in G the heterozygosity? Could that be confounded with population structure?

5. I do not think that essentially copying half the paper to the supplement is a good idea. It just makes the manuscript unnecessarily bloated. Why not reduce the supplement to p 4 and 5 which do the heavy lifting. That little care has been given to this arrangement is also apparent that the main text refers to superflues equations in the supplement

6. Section 2.3: Is reference/sequencing bias an issue here?

Minor:
Fig 1: I am a bit confused by panel A. What do the little dots represent? Is the unit a window or a SNP?
p3. why would HWE lead to a negative binomial distribution?
p5. (eq 4) would be good to label equation numbers in the supplement seperately. Also, why can't one use the main text equations here?
p5. (m-th HMM) should that mean m-th hidden state?) otherwise I don't understand this section
p6. The difference between EM and ECM should be explained. Also, in the Baum-Welch-algorithm I am familiar with, the Forward-Backward Algorithm is the E-step of the EM; so what exactly is the EM for each forward-backwards run calculating expectations over?
p6. why is overfitting sets of ploidy levels a concern? How is the number of ploidy levels defined/constrained in the first place?

Typos:
p2 incorporates
p3 (lower case) letters
across reads
In general, the English is quite poor and requires further editing. Also line numbers would greatly help pointing out typos and issues more specifically. This is compounded by the issue that the paper is at times jargon heavy (e.g. Tower property, Markov matrix) and worse, the jargon is not explained and used inconsistently (Markov matrix vs Transition matrix).

https://doi.org/10.24072/pci.mcb.100109.rev11

Reviewed by Nicolas Galtier, 09 Jul 2021

This manuscript introduces a method for inferring ploidy and its variation across genomes and loci based on next-generation sequencing data. The main novelty is the introduction of a hidden Markov Model (HMM) in which ploidy is assumed to vary across genomic windows. Ploidy is an important aspect of genome structure, and underlies key technical challenges of genome assembly and analysis, so this manuscript, in my opinion, addresses an important problem. I like much the idea of explicitly modelling ploidy variation and the resulting predictions on patterns of sequence coverage and base counts. I think that the HMMploidy approach has a great potential of significantly advancing the field. That said, I have a number of concerns regarding the manuscript, both content and form, which I detail below. Briefly, I do not think the approach is particularly well motivated or illustrated, I have technical issued with the maths and the way the method is presented, and a suggestion of improvement regarding sequence coverage modeling.

A. Awkward/insufficient justification of the method:

It is not totally intuitive why HMM would be appropriate to model ploidy, since ploidy is typically thought of as a constant, for a given species. In reality, the realized ploidy can vary across chromosomes or chromosomal regions and/or between individuals, making the HMM approach a promising one. The introduction very briefly mentions aneuploidy in cancer cells, and polyploidization in plants, as two possible instances of variable ploidy. The manuscript, however, does not develop on these examples, and rather presents (i) an analysis of data simulated in the absence of any variation in ploidy, and (ii) an analysis of a data set in Cryptococcus neoformans, introduced with very limited biological context. I did not find that the HMMploidy method performs particularly well in these two analyses. It was not obviously better than competing methods in the simulation benchmark, and failed to detect a conspicuous instance of triploidy in the real data analysis.

There are a number of reasons why ploidy is expected to vary among/across assembled genomes that are not mentioned or considered in the manuscript. The realized ploidy can be locally increased due to large-scale duplications, when several distinct regions of a genome are so similar that they are assembled as a single piece. Counting gene copy number is indeed a difficult problem (eg see papers by Schrider and Hahn). Another typical artefact with genome assembly is allele splitting, when heterozygosity is so high that assembling algorithms separate homologus alleles as if they were distinct loci (eg have a look at papers on the Ciona savignyi and Adineta vaga genomes, or the recent liteature on haplotig detection and cleaning). The HMMploidy approach seems to be a promising way to identify, annotate and possibly filter ou such anomalous genomic regions. Another example of varying ploidy that comes to my mind are sex chromosomes, which are haploid in the heterogametic sex (male in XY systems, female in ZW systems) and diploid in the homogametic sex (see for instance papers by Muyle, Kafer and Marais on how to annotate sex-chromosome-associated contigs). Please note that in many systems (eg mammals) the Y/W chromosome is actually a mosaic of ploidy, with so-called pseudo-autosomal regions being diploid while the sex-specific region is haploid. Each of the topics I'm mentioning in this paragraph is the subject of a voluminous literature.

I would suggest (i) strenghtening the introduction by discussing in more detail why among-loci variation in ploidy is actually relevant, thus justifying the HMM approach, and (ii) identifying a couple of real data sets with clear expectations regarding ploidy variation, and demonstrate the applicability and added value of the newly introduced method.

B. Awkward/inaccurate presentation of the method:

I have several concerns with the way the method is presented, which I think mostly result from insufficient clarity. At any rate at the moment I can't say I totally understand what the method exactly does, and the manuscript apparently contains incorrect equations.

- 2.1 first sentence: "N polymorphic sites"; how do we know a site is polymorphic or not prior to the analysis? Should one perform SNP calling beforehand? Maybe remove "polymorphic"?

- 2.1: a genotype is described as the number of "alternate (or derived) alleles", suggesting that SNPs are assumed to be polarized. I do not think that the method presented here requires SNP polarization (which is good), so I would suggest clarifying.

- 2.1: "We assume Hardy-Weinberg equilibrium (HWE) and thus model the genotype probability with a negative binomial distribution" -> I would rather think a binomial distribution?

- 2.2: Equation 2 appears awkward. The summation variable i does not appear in the term right to the Sigma symbol, which is suggestive of a problem. Also a genotype G_mn was defined above as an integer taking value in {0, ..., Y_mn}, but here appears the idea that O_mnr (some observed nucleotide) can be "in G_mn" (second part of equation 2), which is inconsistent.

I guess one could re-define a genotype as a vector of nucleotide instead of an integer, then replace in equation 2
p(O_mnr|G_mn,Q_mnr,Y_mn)
with
p(O_mnr|G_mni,Q_mnr,Y_mn)
and replace in second line of equation 2
"if O_mnr in G_mn"
with
"if O_mnr = G_mni"

Alternatively one could keep the text definition of genotype, call A_n and a_n the two alleles at locus n (say), and replace in equation 2:
sum_i p(Omnr|G_mn,Q_mnr,Y_mn)/Y_mn
with
((1-G_mn) p(O_mnr|A_n,Q_mnr,Y_mn) + G_mn p(O_mnr|a_n,Q_mnr,Y_mn))/Y_mn
and adjust second line of equation 2.

The above two options, which I think are equivalent (but different from the text), are what makes sense to me. In the rest of this review I'm assuming that the calculation that was actually made corresponds to the above modified equations.

- 2.3: equation 3 is a rather complex way of saying that the estimated alternate allele frequency is the observed alternate allele frequency across all reads from the pooled genome sample. Indeed Fhat_mn in equation 3 can be written as f_mn/C_mn, where f_mn is the observed number of alternate alleles in genome m, so C_mn cancels out and we get Fhat_n=sum(f_mn)/C_n.

[now switching to Supplementary Material]

- 6.5: I am not sure what alpha and beta are. I guess these correspond to the shape and scale parameter of the Poisson-Gamma distribution of mean coverage across windows - this should be specified. Secondly, I do not understand why these parameters appear with a _k index, suggesting there is one alpha and one beta per window. The text and figure S1 instead suggest that there is one value of coverage per window, C_m(k), drawn from a unique Poisson-Gamma distribution, the parameters of which should be shared across windows?

C. Modeling scheme:

The way sequencing coverage is modeled lacks clarity and justification. Irrespective of ploidy, there might be differences in coverage among loci (e.g. GC-rich vs GC-poor regions) and among genomes (due to experimental setting or the experimental noise). It would appear natural to me to model the among-loci variation in coverage as suggested in the ms, to also model among-genomes variation in coverage (i.e., introduce genome specific coverage parameters), and to define C_m(k) as the product of these two terms - thus assuming that the locus-effect and the genome-effect are independent. If one thinks this is too strong an assumption, maybe some (de)correlation parameter could be introduced. My understanding of the current method is that the across_loci distribution of coverage is assumed to be independent across genomes, i.e., the fact that one locus is highly covered in one genome says nothing about coverage at the same locus in another genome. This sounds like an highly, maybe overly, versatile model, which I think might induce some loss of signal. For instance, the analysis of chromosome 12 in the Cryptococcus CCTP27-d121 sample did not detect any change in ploidy even though coverage is consistently tripled across a large portion of the chromosome (fig 2). I suggest that if coverage was modeled in a more constrained way - i.e. as the product of a genome-specific and a locus-specific term - this abnormality could be interpreted by the method as a triplication. A clarification of how coverage is modeled across loci and genomes, a discussion of this question, and an attempt to adopt a less versatile scheme, would appear required.

D. Minor

- section 3: "averaged by the polyploid genome size" -> "divided by genome size" ?

- Simulations: section 3 says that ploidy 1 to 20 have been simulated, but the result section and figure 2 only consider ploidy 1 to 5.

- Discussion: "On the former point, rescaling sequencing depth across genomes is not possible since HMMploidy models a distribution of read counts." -> I do not understand this sentence.

https://doi.org/10.24072/pci.mcb.100109.rev12

User comments

No user comments yet

or Register
Submit a preprint