Recommendation

Minimal encodings of canonical k-mers for general alphabets and even k-mer sizes

Paul Medvedev based on reviews by 2 anonymous reviewers

A recommendation of:

General encoding of canonical k-mers

Roland Wittler (2023), bioRxiv, ver.2, peer-reviewed and recommended by PCI Mathematical and Computational Biology https://doi.org/10.1101/2023.03.09.531845

Read preprint in preprint server Now published in Peer Community Journal

Codes used in this study

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

General encoding of canonical k-mers

To index or compare sequences efficiently, often k -mers, i.e., substrings of fixed length k , are used. For efficient indexing or storage, k -mers are encoded as integers, e.g., applying some bijective mapping between all possible σ^ k k -mers and the interval [0,σ^ k -1], where σ is the alphabet size.

In many applications, e.g., when the reading direction of a DNA-sequence is ambiguous, canonical k -mers are considered, i.e., the lexicographically smaller of a given k -mer and its reverse (or reverse complement) is chosen as a representative. In naive encodings, canonical k -mers are not evenly distributed within the interval [0,σ^ k -1].

We present a minimal encoding of canonical k -mers on alphabets of arbitrary size, i.e., a mapping to the interval [0,σ^ k /2-1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation. We further present a space and time efficient bit-based implementation for the DNA alphabet.

canonical k-mers, k-mers, q-grams, encoding

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

الترميز العام لـ k-mers الكنسي

لفهرسة التسلسلات أو مقارنتها بكفاءة، غالبًا ما يتم استخدام k -mers، أي سلاسل فرعية ذات طول ثابت k . للحصول على فهرسة أو تخزين فعال، يتم ترميز k -mers كأعداد صحيحة، على سبيل المثال، تطبيق بعض التعيينات الثنائية بين جميع σ^ k الممكنة k -mers والفاصل الزمني [0,σ^ k -1]، حيث σ هو حجم الأبجدية.

في العديد من التطبيقات، على سبيل المثال، عندما يكون اتجاه القراءة لتسلسل الحمض النووي غامضًا، تكون canonical k -mers تم اعتباره، أي أنه يتم اختيار الأصغر من الناحية المعجمية لـ k -mer المعين وعكسه (أو مكمله العكسي) كممثل. في الترميزات البسيطة، لا يتم توزيع العناصر k الأساسية بالتساوي خلال الفاصل الزمني [0,σ^ k -1].

نقدم ترميزًا بسيطًا لـ k -mers الأساسي على أبجديات ذات حجم عشوائي، أي تعيين للفاصل الزمني [0,σ^ ك /2-1]. تم تقديم هذا النهج للتحديد الأساسي في ظل الانعكاس وتم توسيعه ليشمل التحديد الأساسي في ظل التكامل العكسي. كما نقدم أيضًا تطبيقًا يعتمد على البتات يعتمد على المساحة والوقت لأبجدية الحمض النووي.

k-mers الكنسي، k-mers، q-gram، الترميز

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Codificación general de k-mers canónicos 517f501099504malo85559d7060d6a9d3 k-mers canónicos, k-mers, q-gramas, codificación

Para indexar o comparar secuencias de manera eficiente, a menudo se utilizan k -mers, es decir, subcadenas de longitud fija k . Para una indexación o almacenamiento eficiente, los k -mers se codifican como números enteros, por ejemplo, aplicando algún mapeo biyectivo entre todos los σ^ k k -mers posibles. y el intervalo [0,σ^ k -1], donde σ es el tamaño del alfabeto.

En muchas aplicaciones, por ejemplo, cuando la dirección de lectura de una secuencia de ADN es ambigua, los -mers canónicos k son considerado, es decir, el lexicográficamente más pequeño de un k -mer dado y su inverso (o complemento inverso) se elige como representante. En codificaciones ingenuas, los k -meros canónicos no se distribuyen uniformemente dentro del intervalo [0,σ^ k -1].

Presentamos una codificación mínima de k -mers canónicos en alfabetos de tamaño arbitrario, es decir, una asignación al intervalo [0,σ^ k /2-1]. El enfoque se introduce para la canonicalización bajo reversión y se extiende a la canonicalización bajo complementación inversa. Además, presentamos una implementación basada en bits eficiente en espacio y tiempo para el alfabeto de ADN.

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Codage général des k-mers canoniques

Pour indexer ou comparer efficacement des séquences, on utilise souvent des k -mers, c'est-à-dire des sous-chaînes de longueur fixe k . Pour une indexation ou un stockage efficace, les k -mers sont codés sous forme d'entiers, par exemple en appliquant un mappage bijectif entre tous les σ^ k k -mers possibles. et l'intervalle [0,σ^ k -1], où σ est la taille de l'alphabet.

Dans de nombreuses applications, par exemple lorsque la direction de lecture d'une séquence d'ADN est ambiguë, les canoniques k -mers sont considéré, c'est-à-dire que le plus petit lexicographiquement d'un k -mer donné et son inverse (ou complément inverse) est choisi comme représentant. Dans les encodages naïfs, les k -mers canoniques ne sont pas répartis uniformément dans l'intervalle [0,σ^ k -1].

Nous présentons un codage minimal de k -mers canoniques sur des alphabets de taille arbitraire, c'est-à-dire un mappage à l'intervalle [0,σ^ k /2-1]. L'approche est introduite pour la canonique sous inversion et étendue à la canonique sous complémentation inverse. Nous présentons en outre une implémentation basée sur les bits, efficace en termes d'espace et de temps, pour l'alphabet ADN.

k-mers canoniques, k-mers, q-grams, encodage

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

विहित k-mers की सामान्य एन्कोडिंग

अनुक्रमों को कुशलतापूर्वक अनुक्रमित करने या तुलना करने के लिए, अक्सर k -mers, यानी, निश्चित लंबाई k के सबस्ट्रिंग का उपयोग किया जाता है। कुशल अनुक्रमण या भंडारण के लिए, k -mers को पूर्णांक के रूप में एन्कोड किया जाता है, उदाहरण के लिए, सभी संभावित σ^ k k -mers के बीच कुछ विशेषण मानचित्रण लागू करना और अंतराल [0,σ^ k -1], जहां σ वर्णमाला का आकार है।

कई अनुप्रयोगों में, उदाहरण के लिए, जब डीएनए-अनुक्रम की पढ़ने की दिशा अस्पष्ट होती है, तो कैनोनिकल k -मेर होते हैं माना जाता है, यानी, किसी दिए गए k -mer का शब्दकोषीय रूप से छोटा और इसके विपरीत (या विपरीत पूरक) को प्रतिनिधि के रूप में चुना जाता है। अनुभवहीन एन्कोडिंग में, विहित k -mers को अंतराल [0,σ^ k -1] के भीतर समान रूप से वितरित नहीं किया जाता है।

हम मनमाने आकार के अक्षरों पर कैनोनिकल k -mers की एक न्यूनतम एन्कोडिंग प्रस्तुत करते हैं, यानी, अंतराल के लिए एक मैपिंग [0,σ^ क /2-1]. दृष्टिकोण को उत्क्रमण के तहत विहितीकरण के लिए पेश किया गया है और रिवर्स पूरकता के तहत विहितीकरण तक विस्तारित किया गया है। हम आगे डीएनए वर्णमाला के लिए एक स्थान और समय कुशल बिट-आधारित कार्यान्वयन प्रस्तुत करते हैं।

कैनोनिकल के-मेर्स, के-मेर्स, क्यू-ग्राम, एन्कोडिंग

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

正規の k-mer の一般的なエンコーディング

シーケンスのインデックス付けや比較を効率的に行うために、多くの場合、 k -mer、つまり固定長 k の部分文字列が使用されます。効率的なインデックス付けまたは保存のために、 k -mer は整数としてエンコードされます。たとえば、すべての可能な σ^ k k -mer の間に全単射マッピングを適用します。および区間 [0,σ^ k -1]、ここで σ はアルファベットのサイズです。

多くのアプリケーションでは、DNA 配列の読み取り方向があいまいな場合など、標準の k -mer がつまり、与えられた k -mer とその逆方向 (または逆補数) のうち、辞書編集的に小さい方が代表として選択されます。単純なエンコーディングでは、正規の k -mer は区間 [0,σ^ k -1] 内に均等に分布しません。

任意のサイズのアルファベット上の正規の k -mer の最小限のエンコーディング、つまり区間 [0,σ^ k /2-1]。このアプローチは、逆転の下での正規化のために導入され、逆補完の下での正規化に拡張されます。さらに、DNA アルファベットの空間と時間効率の高いビットベースの実装を紹介します。

正準 k-mer、k-mer、q-gram、エンコーディング

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Codificação geral de k-mers canônicos

Para indexar ou comparar sequências de forma eficiente, muitas vezes são usados k -mers, ou seja, substrings de comprimento fixo k . Para indexação ou armazenamento eficiente, k -mers são codificados como números inteiros, por exemplo, aplicando algum mapeamento bijetivo entre todos os σ^ k k -mers possíveis e o intervalo [0,σ^ k -1], onde σ é o tamanho do alfabeto.

Em muitas aplicações, por exemplo, quando a direção de leitura de uma sequência de DNA é ambígua, os canônicos k -mers são considerado, ou seja, o menor lexicograficamente de um determinado k -mer e seu reverso (ou complemento reverso) é escolhido como representante. Em codificações ingênuas, k -mers canônicos não são distribuídos uniformemente dentro do intervalo [0,σ^ k -1].

Apresentamos uma codificação mínima de k -mers canônicos em alfabetos de tamanho arbitrário, ou seja, um mapeamento para o intervalo [0,σ^ k /2-1]. A abordagem é introduzida para canonização sob reversão e estendida para canonização sob complementação reversa. Apresentamos ainda uma implementação baseada em bits com eficiência de espaço e tempo para o alfabeto do DNA.

k-mers canônicos, k-mers, q-gramas, codificação

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Общая кодировка канонических k-меров

Для эффективного индексирования или сравнения последовательностей часто используются k -меры, то есть подстроки фиксированной длины k . Для эффективной индексации или хранения k -меры кодируются как целые числа, например, применяя некоторое биективное отображение между всеми возможными σ^ k k -мерами. и интервал [0,σ^ k -1], где σ — размер алфавита.

Во многих приложениях, например, когда направление чтения последовательности ДНК неоднозначно, каноническими k -мерами являются рассматривается, т.е. в качестве представителя выбирается лексикографически меньший из данного k -мера и его обратного (или обратного дополнения). В наивных кодировках канонические k -меры распределены неравномерно в интервале [0,σ^ k -1].

Мы представляем минимальное кодирование канонических k -меров на алфавитах произвольного размера, т.е. отображение на интервал [0,σ^ k /2-1]. Подход представлен для канонизации при обращении и распространен на канонизацию при обратном дополнении. Далее мы представляем битовую реализацию алфавита ДНК, эффективную по пространству и времени.

канонические k-меры, k-меры, q-граммы, кодирование

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

规范 k-mers 的通用编码

为了有效地索引或比较序列，通常使用 k -mers，即固定长度 k 的子字符串。为了高效索引或存储， k -mers 被编码为整数，例如，在所有可能的 σ^ k k -mers 之间应用一些双射映射和区间 [0,σ^ k -1]，其中 σ 是字母表大小。

在许多应用中，例如，当 DNA 序列的读取方向不明确时，规范 k -mers考虑，即，选择给定 k -mer及其反向（或反向互补）中字典顺序较小的一个作为代表。在朴素编码中，规范 k -mers 并未均匀分布在区间 [0,σ^ k -1] 内。

我们在任意大小的字母表上提出了规范 k -mers 的最小编码，即到区间 [0,σ^ 的映射>k /2-1]。该方法被引入逆转下的规范化，并扩展到反向互补下的规范化。我们进一步提出了一种空间和时间高效的、基于位的 DNA 字母表实现。

规范 k-mers、k-mers、q-gram、编码

Submission: posted 13 March 2023, validated 27 March 2023
Recommendation: posted 18 September 2023, validated 18 September 2023

Cite this recommendation as:
Medvedev, P. (2023) Minimal encodings of canonical k-mers for general alphabets and even k-mer sizes. Peer Community in Mathematical and Computational Biology, 100188. https://doi.org/10.24072/pci.mcb.100188

Recommendation

As part of many bioinformatics tools, one encodes a k-mer, which is a string, into an integer. The natural encoding uses a bijective function to map the k-mers onto the interval [0, s^k - ], where s is the alphabet size. This encoding is minimal, in the sense that the encoded integer ranges from 0 to the number of represented k-mers minus 1.

However, often one is only interested in encoding canonical k-mers. One common definition is that a k-mer is canonical if it is lexicographically not larger than its reverse complement. In this case, only about half the k-mers from the universe of k-mers are canonical, and the natural encoding is no longer minimal. For the special case of a DNA alphabet and odd k, there exists a "parity-based" encoding for canonical k-mers which is minimal.

In [1], the author presents a minimal encoding for canonical k-mers that works for general alphabets and both odd and even k. They also give an efficient bit-based representation for the DNA alphabet.

This paper fills a theoretically interesting and often overlooked gap in how to encode k-mers as integers. It is not yet clear what practical applications this encoding will have, as the author readily acknowledges in the manuscript. Neither the author nor the reviewers are aware of any practical situations where the lack of a minimal encoding "leads to serious limitations." However, even in an applied field like bioinformatics, it would be short-sighted to only value theoretical work that has an immediate application; often, the application is several hops away and not apparent at the time of the original work.

In fact, I would speculate that there may be significant benefits reaped if there was more theoretical attention paid to the fact that k-mers are often restricted to be canonical. Many papers in the field sweep under the rug the fact that k-mers are made canonical, leaving it as an implementation detail. This may indicate that the theory to describe and analyze this situation is underdeveloped. This paper makes a step forward to develop this theory, and I am hopeful that it may lead to substantial practical impact in the future.

References

[1] Roland Wittler (2023) "General encoding of canonical k-mers. bioRxiv, ver.2, peer-reviewed and recommended by Peer Community in Mathematical and Computational Biology https://doi.org/10.1101/2023.03.09.531845

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Funding:
The authors declare that they have received no specific funding for this study

Reviews

Reviewed by anonymous reviewer 2, 17 Sep 2023

The author addressed all my concerns. In particular, he revisited the choice of the term "MPHF" and changed the name of the manuscript accordingly as I suggested. He also removed any load balancing consideration that are already solved using standard approaches. He now correctly points out that there are no direct and important practical applications for the proposed method; rather, he aims at filling an interesting theoretical question.

Under this setting, I think the paper can be recommended as a nice theoretical contribution.

https://doi.org/10.24072/pci.mcb.100188.rev21

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.1101/2023.03.09.531845

Version of the preprint: 1

Author's Reply, 28 Aug 2023

Download author's reply https://doi.org/10.24072/pci.mcb.100188.ar1

Decision by Paul Medvedev, posted 17 May 2023, validated 18 May 2023

Dear Dr. Roland Wittler,

Your submission has now been reviewed by two reviewers. They found your preprint to be of interest; however, they raised a number of concerns. Two concerns in particular were raised by both reviewers. The first is the use of the term MPHF, which they found misleading. The second is the lack of convincing applications / usefuleness of your encoding.

I am therefore asking for a revision which addresses the reviewer's concerns. Though all concerns are important, I believe that demonstrating usefuleness will be critical for me to recommend the preprint.

Best,

Paul Medvedev

https://doi.org/10.24072/pci.mcb.100188.d1

Reviewed by anonymous reviewer 2, 30 Apr 2023

The paper proposes a bijective encoding for canonical k-mers motivated by the fact that
such k-mers are sparse in the universe of all k-mers, of size sigma^k where sigma is
the alphabet size, and thus, the trivial encoding using klog2(sigma) bits is wasteful.
In fact, there are only (sigma^k)/2 canonical k-mers among all the sigma^k k-mers
and their encodings are not evenly distributed in the interval [0,sigma^k-1].
In the practical case where sigma=4, e.g. the DNA alphabet, the proposed encoding takes
2k-1 bits per k-mer rather than the naive 2k-bit encoding.
It also shown how to compute the encoding in O(1) time for DNA k-mers.

The paper is written exceptionally well; the prose is precise and concise, helping
to easily go thorough the technicalities. There only some minor formatting issues
and typos (as noted below). I checked some of the math and found no errors.
I would suggest, however, to add more examples to help the reader to better grasp
the result of the various encoding procedures.
It is a very elegant paper overall.

My main criticism regards the usefulness of the proposed encoding. The application where
this encoding could be used, as cited in the paper, is to implement a direct-addressing
hash table where the encoded canonical k-mer is used as an index into the table.
Since the proposed encoding uses 1 bit less than the simple 2k-bit encoding, then
the table can be shrunk by a factor of 2, from 4^k entries to 4^k/2 entries.
This is nice, but I guess only for relatively small values of k, e.g., k < 16 or so,
for we expect to observe all the different canonical k-mers from the input.
With larger values of k, instead, e.g. the "classic" k=31, we expect to only observe
a (small) subset of all possible k-mers
making the direct-address table a poor choice in practice.

I am not sure what is meant by "a MPHF allows/ensures an even distribution of k-mers
to buckets of equal size". This appears multiple times in the paper, so I wanted to
note this down. Is it meant that all slots in the table are used, with no holes
(where a slot is a bucket of size 1)?
To my knowledge, MPHFs are not used to ensure load balancing, rather they are used
to address an array with (almost) no space overhead.
In fact, we can always hash the canonical k-mers using a pseudo random hash function
(like, Murmurhash or xxhash, etc.) and take the mod of the hash code to achieve
almost perfect balancing. This is also noted in the paper.
I do not know what the advantage of the presented method is compared to this;
better balancing and speed, perhaps?

Lastly, I find it a bit misleading the choice of the term "MPHF". A MPHF is a data
structure implementing a bijection for the keys of a set S, which usually is a
subset of another, larger universe U. Clearly, S can be U but usually it is not
and the challenge is to design a data structure implementing the bijection
knowing that we will only map a subset of the keys of U.
(There is a large body of research on compressed MPHFs -- see, e.g. BBHash, FHC, CHD,
Recsplit, and the more recent PTHash and LPHash.)
My point is that the presented method does not work for a subset of canonical k-mers,
but for the whole universe C_k^{sigma}.
Mine is not a criticism -- it is stated that the work focuses on hashing the
entire universe (top of page 2) -- but I'm afraid most people have a different
notion of minimal perfect hash function in mind and would expect to be able
to build a MPHF for any wanted set of keys.
Therefore, I would suggest to use the term "bijective encoding" which seems more
appropriate to me.

Minor:

- There are several sentences terminating with a formula, for example: there
are two on page 4 (or on page 7).
Since the formula is part of a sentence, a point should grammatically terminate
the sentence. If you do not want a '.' after a formula, just rephrase in a way
that the formula does not appear at the end of the sentence. That's what I would do.

- End of page 4: "[0,|C_k|-1]" --> "[0,|C_k^{\sigma}|-1]" ?

- Page 5: notation "K(l,r)" hasn't been properly introduced. I guess "l" is a row index
and "r" is a column index into the table K.
Also T_{\sigma-1}...T_1 do not seem to have been introduce (but perhaps I missed the
point where they have!).

- Section 4.1: "|K_{16}^4|" and "|R_{16}^4|" should be "|K_{15}^4|" and "|R_{15}^4|"
since k=15?

https://doi.org/10.24072/pci.mcb.100188.rev11

Reviewed by anonymous reviewer 1, 19 Apr 2023

The paper proposes to create a function named enc_c able to map a sequence of size k from an arbitrary alphabet of size A, to 1/2(A^k + A^(ceil(k/2)). The function is bijective. The size of the encoding is lower than A^k as enc_c has encodes only canonical kmers (when adapted to DNA-like alphabets taking into account the complement of characters, the function is called enc_c^r)

A simple implementation is available on GitLab: https://gitlab.ub.uni-bielefeld.de/gi/mphfcan/. The code is perfectly clear and the README file provides the necessary information

Despite the paper could be improved with more examples or figures, it is globally easy to follow and understand. Nevertheless, I'd say that formalism may appear a bit heavy.

A major advantage offered by this approach is that it offers a uniform distribution of canonical kmer encoding, while this is not the case when considering the usually applied canonical definition based on alphabetic order (in which kmers starting by AAA are much more often canonical than those starting by TTT for instance.)

I've several major remarks, and major questions, that make me uncomfortable to accept this work to be published in PCI.

Terminology.
The title and the whole text use the term "MPHF". I agree that enc_c and enc_c^r are technically speaking MPHFs. However, they can apply only on a full universe of ALL words of fixed size k on a given alphabet. SO the encodable set is of size A^k. This is not possible to encode a set of say a million kmers of size 31 on the ACGT alphabet (for instance).
This does not devaluates the work, but I really think that you should modify the name that can lead disappointed readers to stop the reading at the end of the abstract.
I may suggest something as "general kmer encoding".

Concrete usage.
Certainly I missed something. I tried to imagine practical use cases in which enc_c[^r] would be beneficial, but I failed.
In practice (on DNA alphabet), when I want to MPHF a set of kmers I use the following steps: 1/ transform canonical (based on lexicographic order) ASCII kmer to binary kmers using the "enc" function you describe in the manuscript, 2/ create an MPHF on the encoded set. Note that during step 1, the distribution of kmers is uneven as much canonical kmers will appear in the "smaller binary values", but this has no impact on the MPHF construction in 2.

Using enc_c, step 1 would generate uniform distribution of binary values (on a smaller space, roughly of size 1/2A^k) but finally the final step would ends up with the same result.

Evaluation
The evaluation somehow reflects the previous remark. The only result is about the distribution of encoded kmers to 16 buckets. Why 16? And, more importantly, I'd be nice to also compare with any bijective hash function (such as reversible xorshift hash function for instance provided by Heng Li https://github.com/lh3/bfc/blob/master/kmer.h).

So what are the concrete advantages ? Sorry if I missed this, but the message is not clear to me.

More minor remarks
* Fig1: you may give a few named squares (eg first one is AAAAA, last one from first line is AAATTA if I'm correct, ...)
* Some definitions are generic while the differ when applied to completed sequences or not (eg canonical kmer definition in the preliminaries, number of palindromes, ...)
* C_k (end of page 4 undefined)
* Preliminaries: precise that sequences start at index 1 on your framework
* Page 5: avoid using the name 'K' for the function rank as this notation is already used.
* I did not understand the triangle numbers from which K is "derived" -> also what do you exactly mean by "derived"?
* Did you implement the rolling approach you propose for computing enc_c^r of successive kmers on DNA sequences?

https://doi.org/10.24072/pci.mcb.100188.rev12

User comments

No user comments yet

or Register
Submit a preprint