Submit a preprint


General encoding of canonical *k*-mersuse asterix (*) to get italics
Roland WittlerPlease use the format "First name initials family name" as in "Marie S. Curie, Niels H. D. Bohr, Albert Einstein, John R. R. Tolkien, Donna T. Strickland"
<p style="text-align: justify;">To index or compare sequences efficiently, often <em>k</em>-mers, i.e., substrings of fixed length <em>k</em>, are used. For efficient indexing or storage, <em>k</em>-mers are encoded as integers, e.g., applying some bijective mapping between all possible σ^<em>k</em> <em>k</em>-mers and the interval [0,σ^<em>k</em>-1], where σ is the alphabet size.</p> <p style="text-align: justify;">In many applications, e.g., when the reading direction of a DNA-sequence is ambiguous, <em>canonical</em> <em>k</em>-mers are considered, i.e., the lexicographically smaller of a given <em>k</em>-mer and its reverse (or reverse complement) is chosen as a representative. In naive encodings, canonical <em>k</em>-mers are not evenly distributed within the interval [0,σ^<em>k</em>-1].</p> <p style="text-align: justify;">We present a minimal encoding of canonical <em>k</em>-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0,σ^<em>k</em>/2-1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation. We further present a space and time efficient bit-based implementation for the DNA alphabet.</p>
You should fill this box only if you chose 'All or part of the results presented in this preprint are based on data'. URL must start with http:// or https://
You should fill this box only if you chose 'Scripts were used to obtain or analyze the results'. URL must start with http:// or https:// should fill this box only if you chose 'Codes have been used in this study'. URL must start with http:// or https://
canonical k-mers, k-mers, q-grams, encoding
NonePlease indicate the methods that may require specialised expertise during the peer review process (use a comma to separate various required expertises).
Combinatorics, Computational complexity, Genomics and Transcriptomics
Pierre Peterlongo suggested: Maybe RIccardo Vicedomini could be intersted Riccardo Vicedomini <> No need for them to be recommenders of PCI Math Comp Biol. Please do not suggest reviewers for whom there might be a conflict of interest. Reviewers are not allowed to review preprints written by close colleagues (with whom they have published in the last four years, with whom they have received joint funding in the last four years, or with whom they are currently writing a manuscript, or submitting a grant proposal), or by family members, friends, or anyone for whom bias might affect the nature of the review - see the code of conduct
e.g. John Doe []
2023-03-13 17:01:37
Paul Medvedev