PCI Math Comp Biol

188

Title *

General encoding of canonical *k*-mersuse asterix (*) to get italics

Authors *

Roland WittlerPlease use the format "First name initials family name" as in "Marie S. Curie, Niels H. D. Bohr, Albert Einstein, John R. R. Tolkien, Donna T. Strickland"

Year *

2023

Picture *

Abstract *

To index or compare sequences efficiently, often k-mers, i.e., substrings of fixed length k, are used. For efficient indexing or storage, k-mers are encoded as integers, e.g., applying some bijective mapping between all possible σ^k k-mers and the interval [0,σ^k-1], where σ is the alphabet size. In many applications, e.g., when the reading direction of a DNA-sequence is ambiguous, canonical k-mers are considered, i.e., the lexicographically smaller of a given k-mer and its reverse (or reverse complement) is chosen as a representative. In naive encodings, canonical k-mers are not evenly distributed within the interval [0,σ^k-1]. We present a minimal encoding of canonical k-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0,σ^k/2-1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation. We further present a space and time efficient bit-based implementation for the DNA alphabet.

Indicate the full web address (DOI or URL) giving public access to these data (if you have any problems with the deposit of your data, please contact contact@mcb.peercommunityin.org). In case all raw data are included in the preprint, indicate the DOI or URL of the preprint. *

You should fill this box only if you chose 'All or part of the results presented in this preprint are based on data'. URL must start with http:// or https://

Indicate the full web address (DOI or URL) giving public access to these scripts (if you have any problems with the deposit of your scripts, please contact contact@mcb.peercommunityin.org). In case all raw scripts are included in the preprint, indicate the DOI or URL of the preprint. *

You should fill this box only if you chose 'Scripts were used to obtain or analyze the results'. URL must start with http:// or https://

Indicate the full web address (DOI, SWHID or URL) giving public access to these codes (if you have any problems with the deposit of your codes, please contact contact@mcb.peercommunityin.org). In case all raw codes are included in the preprint, indicate the DOI or URL of the preprint. *

https://gitlab.ub.uni-bielefeld.de/gi/MinEncCanKmer/You should fill this box only if you chose 'Codes have been used in this study'. URL must start with http:// or https://

Keywords (optional)

canonical k-mers, k-mers, q-grams, encoding

Methods that require specific expertise (optional)

NonePlease indicate the methods that may require specialised expertise during the peer review process (use a comma to separate various required expertises).

Thematic fields *

Combinatorics, Computational complexity, Genomics and Transcriptomics

Suggested reviewers - Suggest up to 10 reviewers (provide names and Email addresses). (Optional)

Pierre Peterlongo suggested: Maybe RIccardo Vicedomini could be intersted Riccardo Vicedomini <riccardo.vicedomini@irisa.fr> No need for them to be recommenders of PCI Math Comp Biol. Please do not suggest reviewers for whom there might be a conflict of interest. Reviewers are not allowed to review preprints written by close colleagues (with whom they have published in the last four years, with whom they have received joint funding in the last four years, or with whom they are currently writing a manuscript, or submitting a grant proposal), or by family members, friends, or anyone for whom bias might affect the nature of the review - see the code of conduct

Opposed reviewers - Suggest up to 5 people not to invite as reviewers. (Optional)

e.g. John Doe [john@doe.com]

Submission date

2023-03-13 17:01:37

Recommender

Paul Medvedev

Reviewers

Anonymous

or Register
Submit a preprint

MANAGING BOARD

MEMBER OF

INDEXED BY