These are not the K-mers you are looking for: Efficient online K-mer counting using a probabilistic data structure

Qingpeng Zhang, Jason Pell, Rosangela Canino-Koning, Adina Chuang Howe, Charles Brown

Research output: Contribution to journalArticle

44 Citations (Scopus)

Abstract

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for kmers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.

Original languageEnglish (US)
Article numbere101271
JournalPLoS One
Volume9
Issue number7
DOIs
StatePublished - Jul 25 2014
Externally publishedYes

Fingerprint

Data structures
Data storage equipment
Boidae
Python
Turtles
Trimming
Scyphozoa
Licensure
Application programming interfaces (API)
Software packages
turtles
Interfaces (computer)
Sequence Analysis
Software
Nucleotides
sequence analysis
nucleotide sequences
Datasets
methodology

ASJC Scopus subject areas

  • Agricultural and Biological Sciences(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Medicine(all)

Cite this

These are not the K-mers you are looking for : Efficient online K-mer counting using a probabilistic data structure. / Zhang, Qingpeng; Pell, Jason; Canino-Koning, Rosangela; Howe, Adina Chuang; Brown, Charles.

In: PLoS One, Vol. 9, No. 7, e101271, 25.07.2014.

Research output: Contribution to journalArticle

Zhang, Qingpeng ; Pell, Jason ; Canino-Koning, Rosangela ; Howe, Adina Chuang ; Brown, Charles. / These are not the K-mers you are looking for : Efficient online K-mer counting using a probabilistic data structure. In: PLoS One. 2014 ; Vol. 9, No. 7.
@article{d13c0451752f4c10965542ea0050c74e,
title = "These are not the K-mers you are looking for: Efficient online K-mer counting using a probabilistic data structure",
abstract = "K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for kmers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.",
author = "Qingpeng Zhang and Jason Pell and Rosangela Canino-Koning and Howe, {Adina Chuang} and Charles Brown",
year = "2014",
month = "7",
day = "25",
doi = "10.1371/journal.pone.0101271",
language = "English (US)",
volume = "9",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "7",

}

TY - JOUR

T1 - These are not the K-mers you are looking for

T2 - Efficient online K-mer counting using a probabilistic data structure

AU - Zhang, Qingpeng

AU - Pell, Jason

AU - Canino-Koning, Rosangela

AU - Howe, Adina Chuang

AU - Brown, Charles

PY - 2014/7/25

Y1 - 2014/7/25

N2 - K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for kmers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.

AB - K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for kmers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.

UR - http://www.scopus.com/inward/record.url?scp=84904876565&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84904876565&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0101271

DO - 10.1371/journal.pone.0101271

M3 - Article

C2 - 25062443

AN - SCOPUS:84904876565

VL - 9

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 7

M1 - e101271

ER -