Roland WittlerPlease use the format "First name initials family name" as in "Marie S. Curie, Niels H. D. Bohr, Albert Einstein, John R. R. Tolkien, Donna T. Strickland"
<p style="text-align: justify;">To index or compare sequences efficiently, often <em>k</em>-mers, i.e., substrings of fixed length <em>k</em>, are used. For efficient indexing or storage, <em>k</em>-mers are encoded as integers, e.g., applying some bijective mapping between all possible σ^<em>k</em> <em>k</em>-mers and the interval [0,σ^<em>k</em>-1], where σ is the alphabet size.</p>
<p style="text-align: justify;">In many applications, e.g., when the reading direction of a DNA-sequence is ambiguous, <em>canonical</em> <em>k</em>-mers are considered, i.e., the lexicographically smaller of a given <em>k</em>-mer and its reverse (or reverse complement) is chosen as a representative. In naive encodings, canonical <em>k</em>-mers are not evenly distributed within the interval [0,σ^<em>k</em>-1].</p>
<p style="text-align: justify;">We present a minimal encoding of canonical <em>k</em>-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0,σ^<em>k</em>/2-1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation. We further present a space and time efficient bit-based implementation for the DNA alphabet.</p>
canonical k-mers, k-mers, q-grams, encoding