Close printable page

Recommendation

An accelerated Vidjil algorithm: up to 30X faster identification of V(D)J recombinations via spaced seeds and Aho-Corasick pattern matching

Giulio Ermanno Pibiri based on reviews by Sven Rahmann and 1 anonymous reviewer

A recommendation of:

Alignment-free detection and seed-based identification of multi-loci V(D)J recombinations in Vidjil-algo

Cyprien Borée, Mathieu Giraud, Mikaël Salson (2024), HAL, ver.2, peer-reviewed and recommended by PCI Mathematical and Computational Biology https://hal.science/hal-04361907

Read preprint in preprint server Now published in Peer Community Journal

Data used for results

Codes used in this study

Scripts used to obtain or analyze results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

Alignment-free detection and seed-based identification of multi-loci V(D)J recombinations in Vidjil-algo

The diversity of the immune repertoire is grounded on V(D)J recombinations in several loci. Many algorithms and software detect and designate these recombinations in high-throughput sequencing data. To improve their efficiency, we propose a multi-loci seed identification through an Aho-Corasick like automaton as well as a seed-based gene filtration. These algorithms were implemented into Vidjil-algo, used routinely by several labs for the analysis of hematologic malignancies. We benchmark the results of Vidjil-algo and of MiXCR on five datasets, evaluating the specificity and sensitivity of the detection, as well as the adequation of the designation to manually curated sequences. Compared to the previous algorithms, the new algorithms implemented in Vidjil-algo bring speedups between 3× and 30×, with a smaller memory footprint and without quality loss in results. They enable to precisely annotate in a few minutes millions of sequences coming from V(D)J recombinations, including incomplete V(D)J-like recombinations, improving our knowledge on immune repertoires.

Spaced seeds; Aho-Corasick automaton; Alignment-free algorithms; Immune repertoire; V(D)J recombinations; Adaptive Immune Receptor Repertoire (AIRR); Repertoire Sequencing (RepSeq)

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

الكشف الخالي من المحاذاة والتعرف على البذور لإعادة التركيبات متعددة المواقع V (D) J في Vidjil-algo

يرتكز تنوع الذخيرة المناعية على إعادة تركيب V(D)J في عدة مواضع. تقوم العديد من الخوارزميات والبرامج باكتشاف وتعيين عمليات إعادة التركيب هذه في بيانات تسلسل عالية الإنتاجية. لتحسين كفاءتها، نقترح تحديد البذور متعددة المواقع من خلال آلة آلية تشبه Aho-Corasick بالإضافة إلى ترشيح الجينات القائم على البذور. تم تنفيذ هذه الخوارزميات في Vidjil-algo، والتي تستخدم بشكل روتيني من قبل العديد من المختبرات لتحليل الأورام الدموية الخبيثة. نحن نقيس نتائج Vidjil-algo وMiXCR على خمس مجموعات بيانات، ونقيم خصوصية الاكتشاف وحساسيته، فضلاً عن ملاءمة التعيين للتسلسلات المنسقة يدويًا. بالمقارنة مع الخوارزميات السابقة، توفر الخوارزميات الجديدة المطبقة في Vidjil-algo سرعات تتراوح بين 5x و30x، مع مساحة ذاكرة أصغر ودون فقدان جودة النتائج. إنها تمكن من التعليق بدقة في بضع دقائق على ملايين التسلسلات القادمة من إعادة تركيب V(D)J، بما في ذلك إعادة التركيب غير الكاملة المشابهة لـ V(D)J، مما يحسن معرفتنا بالمخزون المناعي.

بذور متباعدة آلي أهو-كوراسيك؛ خوارزميات خالية من المحاذاة؛ ذخيرة المناعة. إعادة التركيب V(D)J؛ ذخيرة مستقبلات المناعة التكيفية (AIRR) ؛ تسلسل الذخيرة (RepSeq)

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Detección sin alineación e identificación basada en semillas de recombinaciones V (D) J de múltiples loci en Vidjil-algo

La diversidad del repertorio inmunológico se basa en recombinaciones V(D)J en varios loci. Muchos algoritmos y software detectan y designan estas recombinaciones en datos de secuenciación de alto rendimiento. Para mejorar su eficiencia, proponemos una identificación de semillas de múltiples loci a través de un autómata tipo Aho-Corasick, así como una filtración genética basada en semillas. Estos algoritmos se implementaron en Vidjil-algo, utilizado habitualmente por varios laboratorios para el análisis de neoplasias hematológicas. Comparamos los resultados de Vidjil-algo y de MiXCR en cinco conjuntos de datos, evaluando la especificidad y sensibilidad de la detección, así como la adecuación de la designación a secuencias seleccionadas manualmente. En comparación con los algoritmos anteriores, los nuevos algoritmos implementados en Vidjil-algo ofrecen aceleraciones de entre 5x y 30x, con una menor huella de memoria y sin pérdida de calidad en los resultados. Permiten anotar con precisión en unos pocos minutos millones de secuencias provenientes de recombinaciones V(D)J, incluidas recombinaciones incompletas similares a V(D)J, mejorando nuestro conocimiento sobre los repertorios inmunológicos.

Semillas espaciadas; Autómata Aho-Corasick; Algoritmos sin alineación; Repertorio inmunológico; recombinaciones V(D)J; Repertorio de receptores inmunes adaptativos (AIRR); Secuenciación de repertorio (RepSeq)

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Détection sans alignement et identification basée sur les graines des recombinaisons multi-loci V(D)J dans Vidjil-algo

La diversité du répertoire immunitaire repose sur les recombinaisons V(D)J dans plusieurs locus. De nombreux algorithmes et logiciels détectent et désignent ces recombinaisons dans les données de séquençage à haut débit. Pour améliorer leur efficacité, nous proposons une identification multi-loci des graines via un automate de type Aho-Corasick ainsi qu'une filtration génétique basée sur les graines. Ces algorithmes ont été implémentés dans Vidjil-algo, utilisé en routine par plusieurs laboratoires pour l'analyse des hémopathies malignes. Nous comparons les résultats de Vidjil-algo et de MiXCR sur cinq ensembles de données, évaluant la spécificité et la sensibilité de la détection, ainsi que l'adéquation de la désignation aux séquences organisées manuellement. Par rapport aux algorithmes précédents, les nouveaux algorithmes implémentés dans Vidjil-algo apportent des accélérations entre 5x et 30x, avec une empreinte mémoire plus petite et sans perte de qualité des résultats. Ils permettent d'annoter précisément en quelques minutes des millions de séquences issues de recombinaisons V(D)J, y compris des recombinaisons incomplètes de type V(D)J, améliorant ainsi nos connaissances sur les répertoires immunitaires.

Graines espacées ; Automate Aho-Corasick ; Algorithmes sans alignement ; Répertoire immunitaire ; Recombinaisons V(D)J ; Répertoire adaptatif des récepteurs immunitaires (AIRR); Séquençage du répertoire (RepSeq)

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

विडजिल-एल्गो में मल्टी-लोकी वी(डी)जे पुनर्संयोजन की संरेखण-मुक्त पहचान और बीज-आधारित पहचान

प्रतिरक्षा प्रदर्शनों की सूची की विविधता कई लोकी में वी(डी)जे पुनर्संयोजन पर आधारित है। कई एल्गोरिदम और सॉफ़्टवेयर उच्च-थ्रूपुट अनुक्रमण डेटा में इन पुनर्संयोजनों का पता लगाते हैं और नामित करते हैं। उनकी दक्षता में सुधार करने के लिए, हम अहो-कोरासिक जैसे ऑटोमेटन के साथ-साथ बीज-आधारित जीन निस्पंदन के माध्यम से बहु-लोकी बीज पहचान का प्रस्ताव करते हैं। इन एल्गोरिदम को विडजिल-एल्गो में लागू किया गया था, जिसका उपयोग हेमटोलोगिक घातकताओं के विश्लेषण के लिए कई प्रयोगशालाओं द्वारा नियमित रूप से किया जाता था। हम पांच डेटासेट पर विडजिल-एल्गो और MiXCR के परिणामों को बेंचमार्क करते हैं, पहचान की विशिष्टता और संवेदनशीलता का मूल्यांकन करते हैं, साथ ही मैन्युअल रूप से क्यूरेटेड अनुक्रमों के लिए पदनाम की पर्याप्तता का भी मूल्यांकन करते हैं। पिछले एल्गोरिदम की तुलना में, विडजिल-एल्गो में कार्यान्वित नए एल्गोरिदम कम मेमोरी फ़ुटप्रिंट और परिणामों में गुणवत्ता हानि के बिना 5x और 30x के बीच स्पीडअप लाते हैं। वे V(D)J पुनर्संयोजन से आने वाले लाखों अनुक्रमों को कुछ ही मिनटों में सटीक रूप से व्याख्या करने में सक्षम बनाते हैं, जिसमें अपूर्ण V(D)J-जैसे पुनर्संयोजन भी शामिल हैं, जिससे प्रतिरक्षा प्रदर्शनों पर हमारे ज्ञान में सुधार होता है।

दूरी वाले बीज; अहो-कोरासिक ऑटोमेटन; संरेखण-मुक्त एल्गोरिदम; प्रतिरक्षा प्रदर्शनों की सूची; वी(डी)जे पुनर्संयोजन; अनुकूली प्रतिरक्षा रिसेप्टर प्रदर्शन सूची (एआईआरआर); प्रदर्शनों की सूची अनुक्रमण (RepSeq)

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Vidjil-algo における複数遺伝子座 V(D)J 組換えのアライメントフリー検出とシードベースの同定

免疫レパートリーの多様性は、いくつかの遺伝子座における V(D)J 組換えに基づいています。多くのアルゴリズムとソフトウェアは、ハイスループットシーケンスデータ内のこれらの組換えを検出して指定します。効率を向上させるために、Aho-Corasick のようなオートマトンと種子ベースの遺伝子濾過による複数遺伝子座の種子の同定を提案します。これらのアルゴリズムは Vidjil アルゴリズムに実装され、血液悪性腫瘍の分析のためにいくつかの研究室で日常的に使用されています。 5 つのデータセットで Vidjil アルゴと MiXCR の結果をベンチマークし、検出の特異性と感度、および手動でキュレートされた配列に対する指定の適切性を評価します。以前のアルゴリズムと比較して、Vidjil-algo に実装された新しいアルゴリズムは 5 倍から 30 倍の速度向上をもたらし、メモリ使用量が小さくなり、結果の品質を損なうことがありません。これらにより、不完全な V(D)J 様組換えを含む V(D)J 組換えに由来する何百万もの配列に数分で正確にアノテーションを付けることができ、免疫レパトアに関する知識が向上します。

間隔をあけて配置された種子。アホ・コラシックのオートマトン。アライメントフリーのアルゴリズム。免疫レパートリー; V(D)J 組換え。適応免疫受容体レパートリー (AIRR);レパトアシークエンシング (RepSeq)

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Detecção livre de alinhamento e identificação baseada em sementes de recombinações V(D)J multi-loci em Vidjil-algo

A diversidade do repertório imunológico é baseada em recombinações V(D)J em vários loci. Muitos algoritmos e software detectam e designam essas recombinações em dados de sequenciamento de alto rendimento. Para melhorar sua eficiência, propomos uma identificação de sementes multi-loci através de um autômato do tipo Aho-Corasick, bem como uma filtração genética baseada em sementes. Esses algoritmos foram implementados no Vidjil-algo, usado rotineiramente por diversos laboratórios para análise de malignidades hematológicas. Comparamos os resultados do Vidjil-algo e do MiXCR em cinco conjuntos de dados, avaliando a especificidade e sensibilidade da detecção, bem como a adequação da designação para sequências curadas manualmente. Em comparação com os algoritmos anteriores, os novos algoritmos implementados no Vidjil-algo trazem acelerações entre 5x e 30x, com menor consumo de memória e sem perda de qualidade nos resultados. Eles permitem anotar com precisão em poucos minutos milhões de sequências provenientes de recombinações V(D)J, incluindo recombinações incompletas do tipo V(D)J, melhorando nosso conhecimento sobre repertórios imunológicos.

Sementes espaçadas; Autômato Aho-Corasick; Algoritmos sem alinhamento; Repertório imunológico; recombinações V(D)J; Repertório de Receptores Imunológicos Adaptativos (AIRR); Sequenciamento de repertório (RepSeq)

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Обнаружение без выравнивания и идентификация на основе семян мультилокусных рекомбинаций V (D) J в Vidjil-algo

Разнообразие иммунного репертуара основано на рекомбинациях V(D)J в нескольких локусах. Многие алгоритмы и программное обеспечение обнаруживают и обозначают эти рекомбинации в данных высокопроизводительного секвенирования. Чтобы повысить их эффективность, мы предлагаем мультилокусную идентификацию семян с помощью автомата, подобного Ахо-Корасику, а также фильтрацию генов на основе семян. Эти алгоритмы были реализованы в Vidjil-algo, который регулярно используется несколькими лабораториями для анализа гематологических злокачественных новообразований. Мы сравниваем результаты Vidjil-algo и MiXCR на пяти наборах данных, оценивая специфичность и чувствительность обнаружения, а также адекватность обозначения последовательностям, курируемым вручную. По сравнению с предыдущими алгоритмами новые алгоритмы, реализованные в Vidjil-algo, обеспечивают ускорение в 5–30 раз при меньшем объеме памяти и без потери качества результатов. Они позволяют за несколько минут точно аннотировать миллионы последовательностей, возникающих в результате рекомбинаций V(D)J, включая неполные V(D)J-подобные рекомбинации, улучшая наши знания об иммунном репертуаре.

Разнесенные семена; автомат Ахо-Корасика; Алгоритмы без выравнивания; Иммунный репертуар; V(D)J-рекомбинации; Репертуар адаптивных иммунных рецепторов (AIRR); Секвенирование репертуара (RepSeq)

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Vidjil-algo 中多位点 V(D)J 重组的免比对检测和基于种子的识别

免疫库的多样性基于多个位点的 V(D)J 重组。许多算法和软件可以检测并指定高通量测序数据中的这些重组。为了提高其效率，我们提出通过类似 Aho-Corasick 的自动机以及基于种子的基因过滤来进行多位点种子识别。这些算法被实施到 Vidjil-algo 中，多个实验室常规使用该算法来分析血液恶性肿瘤。我们在五个数据集上对 Vidjil-algo 和 MiXCR 的结果进行基准测试，评估检测的特异性和灵敏度，以及指定手动策划序列的充分性。与以前的算法相比，Vidjil-algo 中实现的新算法带来了 5 倍到 30 倍的加速，内存占用更小，并且结果质量没有损失。它们能够在几分钟内精确注释来自 V(D)J 重组的数百万个序列，包括不完整的 V(D)J 样重组，从而提高我们对免疫库的了解。

间隔种子； Aho-Corasick 自动机；免对齐算法；免疫组库； V(D)J 重组；适应性免疫受体库（AIRR）；谱库测序 (RepSeq)

Submission: posted 28 December 2023, validated 01 January 2024
Recommendation: posted 22 July 2024, validated 23 July 2024

Cite this recommendation as:
Pibiri, G. (2024) An accelerated Vidjil algorithm: up to 30X faster identification of V(D)J recombinations via spaced seeds and Aho-Corasick pattern matching. Peer Community in Mathematical and Computational Biology, 100268. https://doi.org/10.24072/pci.mcb.100268

Recommendation

VDJ recombination is a crucial process in the immune system, where a V (variable) gene, a D (diversity) gene, and a J (joining) gene are randomly combined to create unique antigen receptor genes. This process generates a vast diversity of antibodies and T-cell receptors, essential for recognizing and combating a wide array of pathogens. By identifying and quantifying these VDJ recombinations, we can gain a deeper and more precise understanding of the immune response, enhancing our ability to monitor and manage immune-related conditions.

It is therefore important to develop efficient methods to identify and extract VDJ recombinations from large sequences (e.g., several millions/billions of nucleotides). The work by Borée, Giraud, and Salson [2] contributes one such algorithm. As in previous work, the proposed algorithm employs the Aho-Corasick automaton to simultaneously match several patterns against a string but, differently from other methods, it also combines the efficiency of spaced seeds. Working with seeds rather than the original string has the net benefit of speeding up the algorithm and reducing its memory usage, sometimes at the price of a modest loss in accuracy. Experiments conducted on five different datasets demonstrate that these features grant the proposed method excellent practical performance compared to the best previous methods, like Vidjil [3] (up to 5X faster) and MiXCR [1] (up to 30X faster), with no quality loss.

The method can also be considered an excellent example of a more general trend in scalable algorithmic design: adapt "classic" algorithms (in this case, the Aho-Corasick pattern matching algorithm) to work in sketch space (e.g., the spaced seeds used here), trading accuracy for efficiency. Sometimes, this compromise is necessary for the sake of scaling to very large datasets using modest computing power.

References

[1] D. A. Bolotin, S. Poslavsky, I. Mitrophanov, M. Shugay, I. Z. Mamedov, E. V. Putintseva, and D. M. Chudakov (2015). "MiXCR: software for comprehensive adaptive immunity profiling." Nature Methods 12, 380–381. ISSN: 1548-7091. https://doi.org/10.1038/nmeth.3364

[2] C. Borée, M. Giraud, M. Salson (2024) "Alignment-free detection and seed-based identification of multi-loci V(D)J recombinations in Vidjil-algo". https://hal.science/hal-04361907v2, version 2, peer-reviewed and recommended by Peer Community In Mathematical and Computational Biology.

[3] M. Giraud, M. Salson, M. Duez, C. Villenet, S. Quief, A. Caillault, N. Grardel, C. Roumier, C. Preudhomme, and M. Figeac (2014). "Fast multiclonal clusterization of V(D)J recombinations from high-throughput sequencing". BMC Genomics 15, 409. https://doi.org/10.1186/1471-2164-15-409.

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Funding:
VidjilNet consortium (Inria), Mésocentre de Lille (Université de Lille)

Reviews

Evaluation round #1

DOI or URL of the preprint: https://hal.science/hal-04361907

Version of the preprint: 1

Author's Reply, 08 Jul 2024

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.mcb.100268.ar1

Decision by Giulio Ermanno Pibiri, posted 08 Mar 2024, validated 09 Mar 2024

This work contributes an efficient algorithm to extract the so-called "V(D)J junctions" from raw sequencing data, using an approach based on the Aho-Corasick automaton and spaced seeds.

Both reviewers found merits in the technical constribution of this paper and agreed that the paper is generally well-written and easy to follow, even for non experts. Furthermore, the experiments are convicing and the code is open-source.

This preprint merits a revision to adress the comments of the reviewers.

While one of them has minor comments and suggestions on how to improve some script in the software repository, the other reviewer asks for some clarifications regarding the experiments.
In particular, the authors are kindly requested to address the following points in the revised version of their work:
- A discussion on how the seeds are built.
- Inclusion of the method RTCR as another suitable baseline to compare against.
- A discussion on why the previous version of the algorithm performs better in some circumstances.

https://doi.org/10.24072/pci.mcb.100268.d1

Reviewed by Sven Rahmann, 05 Mar 2024

The authors present a novel engineered method to detect and designate V(D)J recombinations from raw sequence data.
The novelty of the approach lies not so much in new methods or data structures, but in their efficient combination (e.g., spaced seeds with Aho-Corasick automata).
The evaluation of the method is comprehensive, carefully done, and described in detail.
I found reading the paper, especially the introduction, very enjoyable. The introduction addresses also non-specialists of the field and concisely explains the challenges with just the right amount of detail.
Also, the remainder of the paper can be read without problems.

The evaluation gives the impression that the challenges concerning VCJ recombination are now essentially solved; perhaps the authors can comment on whether this is true or not, or what else should be done in the future in this field (except further small optimizations).

Minor Suggestions and questions:

- It may be better to move Fig. 1 to the bottom of the page.
- 46: "Afzal et al. (2019) did a comparison of several of those software ." -> software tools.
- The term "affectation vector" does not sound right to me, but then I am not a native speaker. Maybe one could use "label sequence"?
- Notation: Please write p-value instead of $p$-value and E-value instead of $e$-value in running text.
- 175: the word p-value is not correct here. You mean a 99.9% confidence interval?
- Fig. 5: The titles of the subfigures are too far away from the actual figure. I first mis-interpreted the leftmost figure as the detection results and could not make sense of the statements on lines 291ff.
- Figures: It should also be said that vidjil-old is probably 2018.02 and -new is the "development" version.
- Fig. 8: I suggest to place the color legend to the right of the figures, not below.
- Fig. 8: IGK/vidjil-new: How/why is the designation bar higher than the detection bar?

Comments on the software:

I highly appreciate that the authors provide a Snakemake workflow that starts by installing the software(s). I have a few suggestions for improvement.

- The config file could have reasonable defaults for the directories (results/, benchmarks/, software/).
- MixCR can be skipped, e.g. by using --keep-going on Snakemake, but it would be nicer if there was an option to disable it.
- I do get warnings during compilation.
- The workflow outputs a lot in information to the screen. This could be written into log files (using a logs/ directory).
- There are many errors during the Snakefile. Not all are related to MixCR. The software doesn't seem to build properly. Here is a list of the jobs still to be done, i.e. none of the jobs below completes successfully, in spite of --keep-going.

Job stats:
job count
------------------------------- -------
afzal_summary 1
afzal_summary_per_type 2
all 1
gather_classic_results 13
gather_random_vdj_results 3
gather_should_vdj_results 4
install_mixcr_custom_library 1
install_vidjil_dev 1
install_vidjil_old 1
mixcr_to_vdj 4
random_sequences 1
random_vdj 3
results_to_plot 23
run_mixcr3 49
run_mixcr3_align 2
run_mixcr4 49
run_mixcr4_align 2
run_vidjil 89
run_vidjil_align 1
run_vidjil_align_old 2
should_vdj_dataset 1
shouldvdj_export_results 2
vdj_to_assign_detection_results 8
vidjil_detect_to_vdj 4
vidjil_to_vdj 4
total 271

https://doi.org/10.24072/pci.mcb.100268.rev11

Reviewed by anonymous reviewer 1, 20 Feb 2024

In this work the authors present an improved version of the Vidjil tool, previously developed for processing high-throughput sequencing data in order to extract so-called V(D)J junctions. V(D)J recombinations in lymphocytes are essential for immunological diversity but could also serve as useful markers of pathologies. The authors propose a multi-loci seed-based method based on an Aho-Corasick-like automaton and spaced seeds extracted from V, D, and J genes. The algorithm was benchmarked against MiXCR on five datasets, evaluating both specificity and sensitivity of detection, as well as the correctness of designation. Results show the newly implemented algorithms bring significant speedups (up to 30x) along with a smaller memory footprint and comparable accuracy.

Overall the manuscript is very well written, extremely clear, and easy to follow in all of its sections. Results are generally convincing and the datasets and experimental benchmarks seem to have been adequately designed in order to evaluate the performance of the algorithms.

Nevertheless, I still have some remarks that I would like to address to the authors and that might improve the quality of the manuscript:

One of the core components of the method is an automaton built from spaced seeds extracted from V, D, and J genes. However, it is not clear how these seeds were extracted. I think it is important to discuss it a bit more in depth.
It seems that experiments are focused on data affected by substitution errors (dataset B). Are datasets C, D, and E affected by indels? I guess the method is mainly thought to be used with sequences that are primarily affected by substitution errors but I think it could be interesting to see how it performs in presence of indels.
I am not fully convinced about the choice of MiXCR as the only tool to compare Vidjil-algo with. The authors justify this as MiXCR being the most balanced tool in terms of flexibility and accuracy, according to a previous systematic review. From the same review however it seems that MiXCR is not the only balanced tool. As a matter of fact RTCR also seems to offer a good balance but is also the one with the most consistent performance across datasets. For this reason, I would suggest also to include RTCR in the comparison, in particular to see how it compares against Vidjil-algo in the challenging dataset E.
The authors showed the new version of Vidjil-algo provides better results in almost all datasets considered. However in the most challenging one (dataset E), the previous approach actually performs better in some cases. I wonder if the authors investigated this behavior and could discuss the possible reasons. It might be worthwhile also to mention the specific algorithmic differences between the old and new method in the introduction.

Minor remarks/suggestions:

In the introduction it is mentioned that the new designed algorithm for detection has a time complexity of O(n) compared to the previous version which has O(l·n) complexity. Since also the designation algorithm has been improved it could also make sense to briefly discuss its complexity with respect to the previous implementation.
The authors define as "designation" the problem of determining the specific germline V, D, and J genes that have undergone recombination, as well as identifying nucleotide deletions/insertions that may have occurred at the junctions of these genes. It is not clear to me if both of these problems are taken into account when evaluating the correct designations or if it is only based on the identification of the specific germline genes.
Indexation: in the definition of $P(g,u)$ I would replace $g{i…i+|u|-1}$ with $g[i,i+|u|-1]$ to be consistent with the "factor" definition introduced above.
"Section 2 - Locus estimation and recombination detection" paragraph: I would replace $t$ with $q$ in $\delta$ and $\delta^{\prime}$ definitions to be consistent with Figure 3's pseudocode.
It seems that the value $delta$ in Figure 4 is not discussed anywhere. I would add some words about the value actually used by Vidjil-algo in section 3.
Line 265: Supplementary figure 2 seems a bit redundant as it shows the same information of Table 2 and 3.

Typos:

Figure 3: $at$ in the pseudocode should be $a_q$
Figure 4: (at the end of the caption) "previouly" -> "previously"

https://doi.org/10.24072/pci.mcb.100268.rev12