Close printable page

Recommendation

Phylogenetic reconstruction from copy number aberration in large scale, low-depth genome-wide single-cell data.

Amaury Lambert based on reviews by 3 anonymous reviewers

A recommendation of:

Cancer phylogenetic tree inference at scale from 1000s of single cell genomes

Sohrab Salehi, Fatemeh Dorri, Kevin Chern, Farhia Kabeer, Nicole Rusk, Tyler Funnell, Marc J Williams, Daniel Lai, Mirela Andronescu, Kieran R. Campbell, Andrew McPherson, Samuel Aparicio, Andrew Roth, Sohrab Shah, and Alexandre Bouchard-Côté (2023), bioRxiv, ver.4, peer-reviewed and recommended by PCI Mathematical and Computational Biology https://doi.org/10.1101/2020.05.06.058180

Read preprint in preprint server Now published in Peer Community Journal

Codes used in this study

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

Cancer phylogenetic tree inference at scale from 1000s of single cell genomes

A new generation of scalable single cell whole genome sequencing (scWGS) methods allows unprecedented high resolution measurement of the evolutionary dynamics of cancer cell populations. Phylogenetic reconstruction is central to identifying sub-populations and distinguishing the mutational processes that gave rise to them.
Existing phylogenetic tree building models do not scale to the tens of thousands of high resolution genomes achievable with current scWGS methods.
We constructed a phylogenetic model and associated Bayesian inference procedure, Sitka, specifically for scWGS data.
The method is based on a novel phylogenetic encoding of copy number (CN) data, the Sitka transformation, that simplifies the site dependencies induced by rearrangements while still forming a sound foundation to phylogenetic inference. The Sitka transformation allows us to design novel scalable Markov chain Monte Carlo (MCMC) algorithms. Moreover, we introduce a novel point mutation calling method that incorporates the CN data and the underlying phylogenetic tree to overcome the low per-cell coverage of scWGS. We demonstrate our method on three single cell datasets, including a novel PDX series, and analyse the topological properties of the inferred trees. Sitka is freely available at \href{https://github.com/UBC-Stat-ML/sitkatree.git}{https://github.com/UBC-Stat-ML/sitkatree.git}.

Phylogenetics, Evolutionary dynamics, Single cell genomics, Cancer evolution, scalable Bayesian inference

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

استنتاج شجرة النشوء والتطور السرطاني على نطاق واسع من آلاف الجينومات ذات الخلية الواحدة

يسمح جيل جديد من أساليب تسلسل الجينوم الكامل للخلية الواحدة القابلة للتطوير (scWGS) بقياس عالي الدقة غير مسبوق للديناميكيات التطورية لمجموعات الخلايا السرطانية. تعد إعادة بناء شجرة النشوء والتطور أمرًا أساسيًا في تحديد المجموعات السكانية الفرعية وتمييز العمليات الطفرية التي أدت إلى ظهورها.
لا تصل نماذج بناء شجرة النشوء والتطور الحالية إلى عشرات الآلاف من الجينومات عالية الدقة التي يمكن تحقيقها باستخدام طرق scWGS الحالية.
نحن قام ببناء نموذج النشوء والتطور وإجراء الاستدلال البايزي المرتبط به، Sitka، خصيصًا لبيانات scWGS.
تعتمد الطريقة على ترميز النشوء والتطور الجديد لبيانات رقم النسخة (CN)، وتحويل Sitka، الذي يبسط تبعيات الموقع الناتجة عن عمليات إعادة الترتيب بينما لا يزال يشكل أساسًا سليمًا للاستدلال التطوري. يتيح لنا تحويل Sitka تصميم خوارزميات سلسلة ماركوف مونتي كارلو (MCMC) الجديدة القابلة للتطوير. علاوة على ذلك، نقدم طريقة جديدة لاستدعاء طفرة نقطية تتضمن بيانات CN وشجرة النشوء والتطور الأساسية للتغلب على التغطية المنخفضة لكل خلية لـ scWGS. نعرض طريقتنا على ثلاث مجموعات بيانات أحادية الخلية، بما في ذلك سلسلة PDX جديدة، ونحلل الخصائص الطوبولوجية للأشجار المستنتجة. سيتكا متاحة مجانًا على \href{https://github.com/UBC-Stat-ML/sitkatree.git}{https://github.com/UBC-Stat-ML/sitkatree.git}.

علم الوراثة، الديناميكيات التطورية، جينوم الخلية الواحدة، تطور السرطان، الاستدلال البايزي القابل للتطوير

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Inferencia del árbol filogenético del cáncer a escala a partir de miles de genomas unicelulares

Una nueva generación de métodos escalables de secuenciación del genoma completo unicelular (scWGS) permite una medición de alta resolución sin precedentes de la dinámica evolutiva de las poblaciones de células cancerosas. La reconstrucción filogenética es fundamental para identificar subpoblaciones y distinguir los procesos mutacionales que les dieron origen.
Los modelos filogenéticos de construcción de árboles existentes no se adaptan a las decenas de miles de genomas de alta resolución que se pueden lograr con los métodos scWGS actuales.
construyó un modelo filogenético y un procedimiento de inferencia bayesiano asociado, Sitka, específicamente para datos scWGS. El método se basa en una nueva codificación filogenética de datos de número de copias (CN), la transformación de Sitka, que simplifica las dependencias de sitio inducidas por reordenamientos todavía formando una base sólida para la inferencia filogenética. La transformación de Sitka nos permite diseñar novedosos algoritmos escalables de cadena de Markov Monte Carlo (MCMC). Además, introducimos un nuevo método de llamada de mutación puntual que incorpora los datos de CN y el árbol filogenético subyacente para superar la baja cobertura por célula de scWGS. Demostramos nuestro método en tres conjuntos de datos de una sola celda, incluida una nueva serie PDX, y analizamos las propiedades topológicas de los árboles inferidos. Sitka está disponible gratuitamente en \href{https://github.com/UBC-Stat-ML/sitkatree.git}{https://github.com/UBC-Stat-ML/sitkatree.git}.

Filogenética, Dinámica evolutiva, Genómica unicelular, Evolución del cáncer, inferencia bayesiana escalable

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Inférence d'arbre phylogénétique du cancer à l'échelle de milliers de génomes unicellulaires

Une nouvelle génération de méthodes évolutives de séquençage du génome entier d'une seule cellule (scWGS) permet une mesure à haute résolution sans précédent de la dynamique évolutive des populations de cellules cancéreuses. La reconstruction phylogénétique est essentielle pour identifier les sous-populations et distinguer les processus mutationnels qui leur ont donné naissance.
Les modèles de construction d'arbres phylogénétiques existants ne s'adaptent pas aux dizaines de milliers de génomes à haute résolution réalisables avec les méthodes scWGS actuelles.
Nous construit un modèle phylogénétique et une procédure d'inférence bayésienne associée, Sitka, spécifiquement pour les données scWGS.
La méthode est basée sur un nouveau codage phylogénétique des données de nombre de copies (CN), la transformation de Sitka, qui simplifie les dépendances de site induites par les réarrangements tout en formant toujours une base solide pour l’inférence phylogénétique. La transformation de Sitka nous permet de concevoir de nouveaux algorithmes évolutifs de chaîne de Markov Monte Carlo (MCMC). De plus, nous introduisons une nouvelle méthode d’appel de mutation ponctuelle qui intègre les données CN et l’arbre phylogénétique sous-jacent pour surmonter la faible couverture par cellule de scWGS. Nous démontrons notre méthode sur trois ensembles de données monocellulaires, y compris une nouvelle série PDX, et analysons les propriétés topologiques des arbres déduits. Sitka est disponible gratuitement sur \href{https://github.com/UBC-Stat-ML/sitkatree.git}{https://github.com/UBC-Stat-ML/sitkatree.git}.

Phylogénétique, dynamique évolutive, génomique unicellulaire, évolution du cancer, inférence bayésienne évolutive

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

1000 एकल कोशिका जीनोम से पैमाने पर कैंसर फ़ाइलोजेनेटिक वृक्ष का अनुमान

स्केलेबल सिंगल सेल होल जीनोम सीक्वेंसिंग (scWGS) विधियों की एक नई पीढ़ी कैंसर सेल आबादी के विकासवादी गतिशीलता के अभूतपूर्व उच्च रिज़ॉल्यूशन माप की अनुमति देती है। फाइलोजेनेटिक पुनर्निर्माण उप-आबादी की पहचान करने और उन्हें जन्म देने वाली उत्परिवर्तन प्रक्रियाओं को अलग करने के लिए केंद्रीय है। विशेष रूप से एससीडब्ल्यूजीएस डेटा के लिए एक फाइलोजेनेटिक मॉडल और संबंधित बायेसियन अनुमान प्रक्रिया, सीताका का निर्माण किया गया। फ़ाइलोजेनेटिक अनुमान के लिए अभी भी एक ठोस आधार तैयार हो रहा है। सीताका परिवर्तन हमें नवीन स्केलेबल मार्कोव श्रृंखला मोंटे कार्लो (एमसीएमसी) एल्गोरिदम डिजाइन करने की अनुमति देता है। इसके अलावा, हम एक उपन्यास बिंदु उत्परिवर्तन कॉलिंग विधि पेश करते हैं जो एससीडब्ल्यूजीएस के कम प्रति-सेल कवरेज को दूर करने के लिए सीएन डेटा और अंतर्निहित फ़ाइलोजेनेटिक पेड़ को शामिल करता है। हम एक उपन्यास पीडीएक्स श्रृंखला सहित तीन एकल सेल डेटासेट पर अपनी पद्धति का प्रदर्शन करते हैं, और अनुमानित पेड़ों के टोपोलॉजिकल गुणों का विश्लेषण करते हैं। सीताका \href{https://github.com/UBC-Stat-ML/sitkatree.git}{https://github.com/UBC-Stat-ML/sitkatree.git} पर निःशुल्क उपलब्ध है।

फ़ाइलोजेनेटिक्स, विकासवादी गतिशीलता, एकल कोशिका जीनोमिक्स, कैंसर विकास, स्केलेबल बायेसियन अनुमान

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

数千の単細胞ゲノムからの大規模ながん系統樹推定

新世代のスケーラブルな単一細胞全ゲノム配列決定 (scWGS) 手法により、がん細胞集団の進化ダイナミクスの前例のない高解像度測定が可能になります。系統発生の再構成は、部分集団を特定し、それらを生じさせた突然変異プロセスを区別する上で中心となります。
既存の系統樹構築モデルは、現在の scWGS 手法で達成可能な数万の高解像度ゲノムに対応できません。
私たちは、は、特に scWGS データ向けに、系統発生モデルとそれに関連するベイズ推論手順 Sitka を構築しました。
この方法は、コピー数 (CN) データの新しい系統発生エンコードである Sitka 変換に基づいており、再配列によって引き起こされる部位依存性を単純化します。系統推論への健全な基盤を形成し続けています。 Sitka 変換を使用すると、新しいスケーラブルなマルコフ連鎖モンテカルロ (MCMC) アルゴリズムを設計できます。さらに、scWGS のセルごとのカバー率の低さを克服するために、CN データと基礎となる系統樹を組み込んだ新しい点突然変異呼び出し方法を紹介します。新しい PDX シリーズを含む 3 つの単一セルデータセットで手法を実証し、推定されたツリーのトポロジカルプロパティを分析します。 Sitka は \href{https://github.com/UBC-Stat-ML/sitkatree.git}{https://github.com/UBC-Stat-ML/sitkatree.git} から無料で入手できます。

系統発生学、進化ダイナミクス、単一細胞ゲノミクス、癌進化、スケーラブルなベイズ推論

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Inferência da árvore filogenética do câncer em escala de milhares de genomas de células únicas

Uma nova geração de métodos escalonáveis de sequenciamento do genoma completo de célula única (scWGS) permite medições de alta resolução sem precedentes da dinâmica evolutiva das populações de células cancerígenas. A reconstrução filogenética é fundamental para identificar subpopulações e distinguir os processos mutacionais que lhes deram origem.
Os modelos existentes de construção de árvores filogenéticas não se adaptam às dezenas de milhares de genomas de alta resolução alcançáveis com os atuais métodos scWGS.
Nós construiu um modelo filogenético e procedimento de inferência Bayesiano associado, Sitka, especificamente para dados scWGS.
O método é baseado em uma nova codificação filogenética de dados de número de cópias (CN), a transformação Sitka, que simplifica as dependências de sítio induzidas por rearranjos enquanto ainda formando uma base sólida para a inferência filogenética. A transformação Sitka nos permite projetar novos algoritmos escaláveis de Monte Carlo em cadeia de Markov (MCMC). Além disso, introduzimos um novo método de chamada de mutação pontual que incorpora os dados CN e a árvore filogenética subjacente para superar a baixa cobertura por célula do scWGS. Demonstramos nosso método em três conjuntos de dados de células únicas, incluindo uma nova série PDX, e analisamos as propriedades topológicas das árvores inferidas. Sitka está disponível gratuitamente em \href{https://github.com/UBC-Stat-ML/sitkatree.git}{https://github.com/UBC-Stat-ML/sitkatree.git}.

Filogenética, Dinâmica evolutiva, Genômica unicelular, Evolução do câncer, Inferência Bayesiana escalável

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Вывод филогенетического дерева рака в масштабе тысяч геномов отдельных клеток

Новое поколение масштабируемых методов полногеномного секвенирования отдельных клеток (scWGS) позволяет измерять с беспрецедентным высоким разрешением эволюционную динамику популяций раковых клеток. Филогенетическая реконструкция играет центральную роль в идентификации субпопуляций и различении мутационных процессов, которые их породили.
Существующие модели построения филогенетических деревьев не масштабируются до десятков тысяч геномов с высоким разрешением, достижимых с помощью современных методов scWGS.
Мы построил филогенетическую модель и связанную с ней процедуру байесовского вывода Sitka специально для данных scWGS.
Метод основан на новом филогенетическом кодировании данных числа копий (CN), преобразовании Sitka, которое упрощает зависимости сайтов, вызванные перестановками, в то время как все еще формируя прочную основу для филогенетических выводов. Преобразование Ситки позволяет нам разрабатывать новые масштабируемые алгоритмы Монте-Карло для цепей Маркова (MCMC). Более того, мы представляем новый метод вызова точечных мутаций, который включает в себя данные CN и лежащее в основе филогенетическое дерево, чтобы преодолеть низкий охват scWGS на каждую клетку. Мы демонстрируем наш метод на трех наборах данных с одной ячейкой, включая новую серию PDX, и анализируем топологические свойства полученных деревьев. Sitka находится в свободном доступе по адресу \href{https://github.com/UBC-Stat-ML/sitkatree.git}{https://github.com/UBC-Stat-ML/sitkatree.git}.

Филогенетика, Эволюционная динамика, Геномика одиночных клеток, Эволюция рака, масштабируемый байесовский вывод

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

从数千个单细胞基因组中进行大规模癌症系统发育树推断

新一代可扩展单细胞全基因组测序 (scWGS) 方法可以对癌细胞群体的进化动态进行前所未有的高分辨率测量。系统发育重建对于识别亚群和区分产生它们的突变过程至关重要。
现有的系统发育树构建模型无法扩展到当前 scWGS 方法可实现的数万个高分辨率基因组。
我们构建了专门针对 scWGS 数据的系统发育模型和相关贝叶斯推理程序 Sitka。
该方法基于拷贝数 (CN) 数据的新型系统发育编码，即 Sitka 变换，它简化了重排引起的位点依赖性，同时仍然为系统发育推断奠定了坚实的基础。 Sitka 变换使我们能够设计新颖的可扩展马尔可夫链蒙特卡罗 (MCMC) 算法。此外，我们引入了一种新颖的点突变调用方法，该方法结合了 CN 数据和底层系统发育树，以克服 scWGS 的低每细胞覆盖率。我们在三个单细胞数据集（包括新颖的 PDX 系列）上演示了我们的方法，并分析了推断树的拓扑特性。 Sitka 可在 \href{https://github.com/UBC-Stat-ML/sitkatree.git}{https://github.com/UBC-Stat-ML/sitkatree.git} 免费获取。

系统发育学、进化动力学、单细胞基因组学、癌症进化、可扩展贝叶斯推理

Submission: posted 10 December 2021
Recommendation: posted 17 April 2023, validated 18 April 2023

Cite this recommendation as:
Lambert, A. (2023) Phylogenetic reconstruction from copy number aberration in large scale, low-depth genome-wide single-cell data.. Peer Community in Mathematical and Computational Biology, 100112. https://doi.org/10.24072/pci.mcb.100112

Recommendation

The paper [1] presents and applies a new Bayesian inference method of phylogenetic reconstruction for multiple sequence alignments in the case of low sequencing coverage but diverse copy number aberrations (CNA), with applications to single cell sequencing of tumors.

The idea is to take advantage of CNA to reconstruct the topology of the phylogenetic tree of sequenced cells in a first step (the `sitka' method), and in a second step to assign single nucleotide variants (SNV) to tree edges (and then calibrate their lengths) (the `sitka-snv' method).

The data are summarized into a binary-valued CxL matrix Y, where C is the number of cells and L is the number of loci (here, loci are segments of prescribed length called `bins'). The entry of Y at row i and column j is 1 (otherwise 0) iff in the ancestral lineage of cell i, at least one genomic rearrangement has occurred, and more specifically the gain or loss of a segment with at least one endpoint in locus j or in locus j+1. The authors expect the infinite-allele assumption to approximately hold (i.e., that at most one mutation occurs at any given marker and that 0 is the ancestral state). They refer to this assumption as the `perfect phylogeny assumption'. By only recording from CNA events the endpoints at which they occur, the authors lose the information on copy number, but they gain the assumption of independence of the mutational processes occurring at different sites, which approximately holds for CNA endpoints.

The goal of sitka is to produce a posterior distribution on phylogenetic trees conditional on the matrix Y , where here a phylogenetic tree is understood as containing the information on 1) the topology of the tree but not its edge lengths, and 2) for each edge, the identity of markers having undergone a mutation, in the sense of the previous paragraph.

The results of the method are tested against synthetic datasets simulated under various assumptions, including conditions violating the perfect phylogeny assumption and compared to results obtained under other baseline methods. The method is extended to assign SNV to edges of the tree inferred by sitka. It is also applied to real datasets of single cell genomes of tumors.

The manuscript is very well-written, with a high degree of detail. The method is novel, scalable, fast and appears to perform favorably compared to other approaches. It has been applied in independent publications, for example to multi-year time-series single-cell whole-genome sequencing of tumors, in order to infer the fitness landscape and its dynamics through time, see [2].

The reviewing process has taken too long, mainly because of other commitments I had during the period and to the difficulty of finding reviewers. Let me apologize to the authors and thank them for their patience as well as for the scientific rigor they brought to their revisions and answers to reviewers, who I also warmly thank for their quality work.

REFERENCES

[1] Sohrab Salehi, Fatemeh Dorri, Kevin Chern, Farhia Kabeer, Nicole Rusk, Tyler Funnell, Marc J Williams, Daniel Lai, Mirela Andronescu, Kieran R. Campbell, Andrew McPherson, Samuel Aparicio, Andrew Roth, Sohrab Shah, and Alexandre Bouchard-Côté. Cancer phylogenetic tree inference at scale from 1000s of single cell genomes (2023). bioRxiv, 2020.05.06.058180, ver. 4 peer-reviewed and recommended by Peer Community in Mathematical and Computational Biology.
https://doi.org/10.1101/2020.05.06.058180

[2] Sohrab Salehi, Farhia Kabeer, Nicholas Ceglia, Mirela Andronescu, Marc J. Williams, Kieran R. Campbell, Tehmina Masud, Beixi Wang, Justina Biele, Jazmine Brimhall, David Gee, Hakwoo Lee, Jerome Ting, Allen W. Zhang, Hoa Tran, Ciara O’Flanagan, Fatemeh Dorri, Nicole Rusk, Teresa Ruiz de Algara, So Ra Lee, Brian Yu Chieh Cheng, Peter Eirew, Takako Kono, Jenifer Pham, Diljot Grewal, Daniel Lai, Richard Moore, Andrew J. Mungall, Marco A. Marra, IMAXT Consortium, Andrew McPherson, Alexandre Bouchard-Côté, Samuel Aparicio & Sohrab P. Shah. Clonal fitness inferred from time-series modelling of single-cell cancer genomes (2021). Nature 595, 585–590. https://doi.org/10.1038/s41586-021-03648-3

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Funding:
This project was generously supported by the BC Cancer Foundation at BC Cancer and Cycle for Survival supporting Memorial Sloan Kettering Cancer Center. SPS holds the Nicholls Biondi Chair in Computational Oncology and is a Susan G. Komen Scholar (#GC233085). SA holds the Nan and Lorraine Robert son Chair in Breast Cancer and is a Canada Research Chair in Molecular Oncology (950-230610). Additional funding provided by the Terry Fox Research Institute Grant 1082, Canadian Cancer Society Research Institute Impact program Grant 705617, CIHRGrantFDN-148429, Breast Cancer Research Foundation award (BCRF-18-180, BCRF-19-180 and BCRF-20-180), MSKCancer Center Support Grant/Core Grant( P30CA008748), National Institutes of Health Grant (1RM1HG011014-01), CCSRIGrant(#705636), the Cancer Research UK Grand Challenge Program, Canada Foundation for Innovation (40044) to SA, SPS and ABC.

Reviews

Evaluation round #2

DOI or URL of the preprint: https://doi.org/10.1101/2020.05.06.058180

Version of the preprint: 3

Author's Reply, 14 Mar 2023

Download author's reply

Dear Reviewers,

Thank you for your feedback and comments. I have updated our bioRxiv submission to version 4 and attached additional comments in a PDF here.

Best,

Kevin Chern

https://doi.org/10.24072/pci.mcb.100112.ar2

Decision by Amaury Lambert, posted 17 Feb 2023, validated 20 Feb 2023

Dear Sohrab Salehi,

I would like to thank you and your co-authors for your work on revising your manuscript according to the reviewer's comments and to mine.

We are all satisfied with this new version, except for a few minor points that I encourage you to address before final recommendation. This should not take you too much time as a whole. Please address in particular the first point raised by the reviewer regarding SCARLET and either benchmark this method or argue better in favor of not doing it.

As far as I am concerned, I sincerely acknowledge your efforts in expanding explanations (best possible tree, proxy of violation rate + a wealth of terms and phrases) and adding new analyses: comparison with new methods and assessment of within-site pairwise dependencies - I appreciated your idea of getting rid of one of the two extremities of a segment that was gained or lost. However, here and in various other tests (violation of perfect phylogeny assumption, violation of infinite-site assumption), you seem to be reluctant to simulate the real biological process of CNA, as I had suggested in my report. Can you please explain me why?

Last point: line 610, it sounds a little weird to speak of the "three noise regimes" before explaining (in section 9.5.3) what they are.

I will try and write my recommendation soon after I receive your answer/revision.

Sincerely,

Amaury Lambert

https://doi.org/10.24072/pci.mcb.100112.d2

Reviewed by anonymous reviewer 2, 08 Dec 2022

My questions and comments have largely been addressed by the authors. There are a few remaining comments I would like to make in response:

1. In their response to my comment 5, in which I suggested to include SCARLET in the benchmark of methods, the authors wrote

"Rationale of choice of additional baselines: sitka is designed for shallow sequencing regimes where calling SNVs per cell would be difficult, but copy numbers can be called reliably. In such cases, most SNVs will not be called in most cells. However SCARLET, while correcting for CNAs, requires the same SNV to be called in all cells."

I am a bit surprised by this, since the SCARLET method explicitly accounts for allele drop-outs, i.e. missing SNV calls in some cells (cited from Satas et al., 2020):

"Data from scDNA-seq typically have high error rates in identifying SNVs, and particularly high rates of false negatives and missing data due to amplification bias and allele dropout (Gawad et al., 2016). SCARLET models these errors using a beta-binominal distribution (Singer et al., 2018) of the observed read counts."

Further, in the SCARLET paper, the model was applied to a dataset from Leung et al. (2017), which has the following properties (cited from Satas et al., 2020):

"This data-set included targeted sequencing of 1,000 genes in 141 cells from a primary colon tumor and 45 cells from a matched liver metastasis (Figure 4A). The authors identified 36 SNVs and used SCITE (Jahn et al., 2016) to derive a perfect phylogeny from these SNVs (Figure 4B)."

These properties sounds very comparable to one of the cohorts, Eirew et al. (2015), that were analyzed in the current study:

"DNA was prepared from 90 individual SA501 xenograft nuclei from passages X1, X2 and X4, and the variant allele ratios were determined by targeted ultra-deep sequencing at 45 somatic SNV and 10 germline SNV positions."

Hence, I somewhat fail to see why the comparison to SCARLET was not even tried.

2. I did not find a reference for the OVA dataset, even though it is not stated that that dataset was specifically generated for this study (as with the SA535 dataset).

3. The authors further argued that the differences in performance between methods might be driven in part by the fact that some algorithms do not converge in the "available computational budget (several days)." I think it would be necessary to define the criteria for the allowed algorithm runtime/computational budget very clearly if runtime is such a crucial factor for performance. In other words, even though of course algorithm runtime is an important practical feature, the benchmark is supposed to measure accuracy, ideally after convergence and independent of runtime. If this cannot be achieved, it should be pointed out that the other methods may have achieved higher accuracy with a longer runtime.

4. Regarding my comment 6, I apologize for having failed to formulate the question in a manner that would have allowed the authors to understand and answer it properly. Even though the figure caption may have been a bit spartan, I would argue it was reasonably obvious that the previous Fig. 1f showed the insets from panel 1e. My question about what is now Fig. 3f regarded the (extinct) leaves without cells (i.e. those that do not have blue circles at the end). There are two such leaves in the box denoting iteration 100 and one in the box of iteration 101. Are these the unseen wild-type states of the marker events? Why are there two such extinct lines associated with just one marker (chr12_1600) in the left box? The process that happens in the upper part of the plot exactly corresponds to the description of the edge insertion process in the Methods section. However, it is not clear why in addition both the topology of the tree in the bottom part (presumably unaffected by the edge insertion) and the marker ID itself are changed (assuming each marker/red diamond is associated with its nearest orange text descriptor).

References:

Eirew, P., Steif, A., Khattra, J., Ha, G., Yap, D., Farahani, H., ... & Aparicio, S. (2015). Dynamics of genomic clones in breast cancer patient xenografts at single-cell resolution. Nature, 518(7539), 422-426.

Leung, M. L., Davis, A., Gao, R., Casasent, A., Wang, Y., Sei, E., ... & Navin, N. E. (2017). Single-cell DNA sequencing reveals a late-dissemination model in metastatic colorectal cancer. Genome research, 27(8), 1287-1299.

Satas, G., Zaccaria, S., Mon, G., & Raphael, B. J. (2020). SCARLET: single-cell tumor phylogeny inference with copy-number constrained mutation losses. Cell Systems, 10(4), 323-332.

https://doi.org/10.24072/pci.mcb.100112.rev21

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.1101/2020.05.06.058180

Version of the preprint: 2

Author's Reply, 12 Nov 2022

Download author's reply https://doi.org/10.24072/pci.mcb.100112.ar1

Decision by Amaury Lambert, posted 25 Aug 2022

Dear Sohrab Salehi,

We have received 3 reviews, two of which are enthusiastic about your work but request revisions before final recommendation.

Please see my detailed decision in attachment,

Best wishes,

Amaury Lambert

Download recommender's annotations

https://doi.org/10.24072/pci.mcb.100112.d1

Reviewed by anonymous reviewer 2, 16 Jan 2022

In this manuscript, the authors introduce a novel, scalable method to infer phylogenies from single-cell whole-genome sequencing data based on copy number information. The algorithm is applied to three independent datasets and the goodness-of-fit compared to other methods. Possible violations of the model assumptions are discussed and put in context of real-world data. After tree inference, SNV data can be incorporated into the model prediction as well. The manuscript is very well written and the method appears to be fast and to perform favorably compared to other approaches.

I would first like to commend the authors on the clarity of the majority of their manuscript and the high degree of detail. It was a pleasure to read it. I am further providing my detailed feedback and questions below.

1. Based on Eq. 2, it appears like the sampling probability of a vertex v is both proportional to the likelihood of each sub-tree, expressed by p(y|x(t),theta), as well as the number of possible sub-trees. The latter implies that vertices with many children are more likely to be sampled than trees with fewer children. Is this correct? Is this desired?

2. How does the equation following l. 398, which posits that the probability p(y_{c,l}|x_{c,l},theta) of a vertex can be expressed as the product of probabilities of all its children, relate to the original definition of p(y_{c,l}|x_{c,l},theta) given after l. 322, according to which children and parent nodes are independent?

3. The transformation given in the equation after l. 413, which results in a product over k factors, contains all possibilities for edge insertions in sub-tree v, including the one in which no edge is inserted. Hence, vertices whose existing configuration already has a high likelihood are, counterintuitively, selected for edge insertion proportionately to this likelihood. I can imagine that, at best, this would slow the convergence of the algorithm, but there might be more deleterious consequences.

4. Regarding the inference of the consensus tree (section 9.4.5), I am not sure I understand well Eq. 6. It appears the authors are using a generalized Bayes estimator by minimizing the posterior expected loss, as the loss function is weighted with the posterior distribution that is given after l. 365. Is this correct? Second, it appears that then the parameter t in the argument of pi should be t'. Third, why did the authors choose to use the Bayes risk to determine the tree, especially since it appears that the priors for t and theta are anyway largely non-informative? Could they not just maximize the likelihood p(y|x(t), theta)?

5. In the benchmarking (Fig. 2d), I think the authors should compare their method also to other existing methods, in particular SCARLET from ref. [17] and MEDALT from ref. [18]. In this regard, I also find it interesting that a simple hierarchical clustering method, UPGMA, shows such good performance, which is even largely better than that of MrBayes, when using Youden's index as a measure. Do the authors know why this would be so? Could it be that the performance measure is suboptimal?

6. Fig. 1f shows leaves without cells or markers. Why and how are these generated in the MCMC tree exploration scheme?

7. Along similar lines, in l. 101 of the manuscript is stated that "Markers placed at the leaves are interpreted as outliers, for example measured CN change points that are false positives. We remove from the type I tree all marker nodes that are leaf nodes, i.e., markers that are not present in any cells." Would it not be possible to always place a marker that was observed in only one cell at the leaf corresponding to that cell? Also, why are such outliers still observed (e.g. CN change points that are false positives) if columns in y with relative density across cells less than 5% were removed (l. 314)?

8. In the computation of p(y|x,theta) on p. 10, all entries (c,l) are assumed independent. However, change points should be correlated with some auto-correlation function with a decay rate proportional to the typical CN lengths, i.e. Cov(y_{c,l}, y_{c,l'})!=0. Would it be feasible to incorporate this kind of information in the algorithm?

9. It appears like the FN and FP rates could be optimized instead of set to default values (0.5 and 0.1, respectively). Are these defaults informed or are they arbitrary?

10. Regarding the number of possible trees derived after l. 352, how come the second factor in the second line of the equation is (|L|+1)^|C|? Should there not be (L+1)!/(L+1-C)! possibilities to assign |C| cells to |L|+1 vertices?

11. In l. 356ff., it is stated "This simple prior has a useful property: if a collection of say two splits are supported by m1 and m2 traits, then the prior probability for an additional trait to support the first versus second split is proportional to (m1 + 1, m2 + 1). Therefore, there is a “rich gets richer” behaviour built-in into the prior". How is this compatible with the prior being a uniform prior (cf. l. 353 and formula)?

https://doi.org/10.24072/pci.mcb.100112.rev11

Reviewed by anonymous reviewer 1, 29 Mar 2022

PCI Math Comp Biol #112

Cancer phylogenetic tree inference at scale from 1000s of single cell genomes

by Salehi et al.

This paper develops a new method for phylogenetic modeling and Bayesian inference of cancer evolution that suggests being efficient when applied to tens of thousands of high-resolution genomes from single cell whole genome sequencing (scWGS).

I think that the method clearly shows utility, but that it is not entirely clear whether it outperforms alternative approaches. This, however, is mentioned in the manuscript. The study of synthetic experiments helps readers to navigate the method and evaluated its impact and future utility, especially in light of cell removal due to contamination. I am intersted in seeing future applications of this method.

https://doi.org/10.24072/pci.mcb.100112.rev12

Reviewed by anonymous reviewer 3, 04 Aug 2022

The authors of the manuscript present a new method for reconstructing single cell phylogenies from previously inferred CNV data. Specifically, they propose a data transformation for CNV counts into discretized coarse grained markers of changes, based on which the phylogenetic reconstruction is performed efficiently.
Importantly, the authors also propose a single point mutation calling method that conditions on the CNV based phylogenies to better resolve signal to noise problem.
The method is compared to other state of art methods on three single cell datasets.

The methods and algorithms are comprehensively presented. In general, the manuscript would benefit from introducing more explanatory comments and brief motivations for introducing steps of the analysis, especially in the Results section (e.g. why do we need type I and type II trees, what is the benefit of using change points, or even definition of perfect phylogeny) so that non-expert readers can follow.

However, I do have a major concern about the property of the sitka transformation and the effect it has on the phylogeny recontruction that the authors should adress. Each copy number variations comes with 2 breakpoints. The sitka transformation, to my understanding, ends up treating copy number changes along the chromosome as independent events, and, effectively, the markers of the beginning and the end of a copy number variation are not paired. What is the impact of this on the phylogenies? Are the pairs of breakpoints separated on the reconstructed phylogenies? If so, how distant they are? The authors should discuss this point and present the relevant statistics on empirical data.

https://doi.org/10.24072/pci.mcb.100112.rev13