Abstract
Background: Specialized data structures are required for online algorithms to efficiently handle large sequencing datasets. The counting quotient filter (CQF), a compact hashtable, can efficiently store k-mers with a skewed distribution. Result: Here, we present the mixed-counters quotient filter (MQF) as a new variant of the CQF with novel counting and labeling systems. The new counting system adapts to a wider range of data distributions for increased space efficiency and is faster than the CQF for insertions and queries in most of the tested scenarios. A buffered version of the MQF can offload storage to disk, trading speed of insertions and queries for a significant memory reduction. The labeling system provides a flexible framework for assigning labels to member items while maintaining good data locality and a concise memory representation. These labels serve as a minimal perfect hash function but are ~ tenfold faster than BBhash, with no need to re-analyze the original data for further insertions or deletions. Conclusions: The MQF is a flexible and efficient data structure that extends our ability to work with high throughput sequencing data.
Original language | English (US) |
---|---|
Article number | 71 |
Journal | BMC Bioinformatics |
Volume | 22 |
Issue number | 1 |
DOIs | |
State | Published - Dec 2021 |
Externally published | Yes |
Keywords
- Compact hash tables
- Debruijn graphs
- Inexact data structures
- k-mers
- NGS
ASJC Scopus subject areas
- Structural Biology
- Biochemistry
- Molecular Biology
- Computer Science Applications
- Applied Mathematics