Minimal encodings of canonical k-mers for general alphabets and even k-mer sizes
General encoding of canonical k-mers
Recommendation: posted 18 September 2023, validated 18 September 2023
As part of many bioinformatics tools, one encodes a k-mer, which is a string, into an integer. The natural encoding uses a bijective function to map the k-mers onto the interval [0, s^k - ], where s is the alphabet size. This encoding is minimal, in the sense that the encoded integer ranges from 0 to the number of represented k-mers minus 1.
However, often one is only interested in encoding canonical k-mers. One common definition is that a k-mer is canonical if it is lexicographically not larger than its reverse complement. In this case, only about half the k-mers from the universe of k-mers are canonical, and the natural encoding is no longer minimal. For the special case of a DNA alphabet and odd k, there exists a "parity-based" encoding for canonical k-mers which is minimal.
In , the author presents a minimal encoding for canonical k-mers that works for general alphabets and both odd and even k. They also give an efficient bit-based representation for the DNA alphabet.
This paper fills a theoretically interesting and often overlooked gap in how to encode k-mers as integers. It is not yet clear what practical applications this encoding will have, as the author readily acknowledges in the manuscript. Neither the author nor the reviewers are aware of any practical situations where the lack of a minimal encoding "leads to serious limitations." However, even in an applied field like bioinformatics, it would be short-sighted to only value theoretical work that has an immediate application; often, the application is several hops away and not apparent at the time of the original work.
In fact, I would speculate that there may be significant benefits reaped if there was more theoretical attention paid to the fact that k-mers are often restricted to be canonical. Many papers in the field sweep under the rug the fact that k-mers are made canonical, leaving it as an implementation detail. This may indicate that the theory to describe and analyze this situation is underdeveloped. This paper makes a step forward to develop this theory, and I am hopeful that it may lead to substantial practical impact in the future.
 Roland Wittler (2023) "General encoding of canonical k-mers. bioRxiv, ver.2, peer-reviewed and recommended by Peer Community in Mathematical and Computational Biology https://doi.org/10.1101/2023.03.09.531845
Paul Medvedev (2023) Minimal encodings of canonical k-mers for general alphabets and even k-mer sizes. Peer Community in Mathematical and Computational Biology, 100188. 10.24072/pci.mcb.100188
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.
The authors declare that they have received no specific funding for this study
Reviewed by anonymous reviewer 2, 17 Sep 2023
Evaluation round #1
DOI or URL of the preprint: https://doi.org/10.1101/2023.03.09.531845
Version of the preprint: 1
Author's Reply, 28 Aug 2023
Decision by Paul Medvedev, posted 17 May 2023, validated 18 May 2023
Dear Dr. Roland Wittler,
Your submission has now been reviewed by two reviewers. They found your preprint to be of interest; however, they raised a number of concerns. Two concerns in particular were raised by both reviewers. The first is the use of the term MPHF, which they found misleading. The second is the lack of convincing applications / usefuleness of your encoding.
I am therefore asking for a revision which addresses the reviewer's concerns. Though all concerns are important, I believe that demonstrating usefuleness will be critical for me to recommend the preprint.