Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

Jason Pell, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James M. Tiedje, Charles Brown

Research output: Contribution to journalArticle

128 Citations (Scopus)

Abstract

Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory.We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.

Original languageEnglish (US)
Pages (from-to)13272-13277
Number of pages6
JournalProceedings of the National Academy of Sciences of the United States of America
Volume109
Issue number33
DOIs
StatePublished - Aug 14 2012
Externally publishedYes

Fingerprint

Metagenome
Metagenomics
High-Throughput Nucleotide Sequencing
Firearms
Ecosystem
Soil
DNA
Population

Keywords

  • Compression
  • Metagenomics

ASJC Scopus subject areas

  • General

Cite this

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. / Pell, Jason; Hintze, Arend; Canino-Koning, Rosangela; Howe, Adina; Tiedje, James M.; Brown, Charles.

In: Proceedings of the National Academy of Sciences of the United States of America, Vol. 109, No. 33, 14.08.2012, p. 13272-13277.

Research output: Contribution to journalArticle

Pell, Jason ; Hintze, Arend ; Canino-Koning, Rosangela ; Howe, Adina ; Tiedje, James M. ; Brown, Charles. / Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. In: Proceedings of the National Academy of Sciences of the United States of America. 2012 ; Vol. 109, No. 33. pp. 13272-13277.
@article{bfb77d6b52bf4c79a76639ffe3d112f0,
title = "Scaling metagenome sequence assembly with probabilistic de Bruijn graphs",
abstract = "Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory.We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.",
keywords = "Compression, Metagenomics",
author = "Jason Pell and Arend Hintze and Rosangela Canino-Koning and Adina Howe and Tiedje, {James M.} and Charles Brown",
year = "2012",
month = "8",
day = "14",
doi = "10.1073/pnas.1121464109",
language = "English (US)",
volume = "109",
pages = "13272--13277",
journal = "Proceedings of the National Academy of Sciences of the United States of America",
issn = "0027-8424",
number = "33",

}

TY - JOUR

T1 - Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

AU - Pell, Jason

AU - Hintze, Arend

AU - Canino-Koning, Rosangela

AU - Howe, Adina

AU - Tiedje, James M.

AU - Brown, Charles

PY - 2012/8/14

Y1 - 2012/8/14

N2 - Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory.We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.

AB - Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory.We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.

KW - Compression

KW - Metagenomics

UR - http://www.scopus.com/inward/record.url?scp=84865176493&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84865176493&partnerID=8YFLogxK

U2 - 10.1073/pnas.1121464109

DO - 10.1073/pnas.1121464109

M3 - Article

VL - 109

SP - 13272

EP - 13277

JO - Proceedings of the National Academy of Sciences of the United States of America

JF - Proceedings of the National Academy of Sciences of the United States of America

SN - 0027-8424

IS - 33

ER -