Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

Jason Pell, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James M. Tiedje, Charles Brown

Research output: Contribution to journalArticle

135 Scopus citations

Abstract

Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory.We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.

Original languageEnglish (US)
Pages (from-to)13272-13277
Number of pages6
JournalProceedings of the National Academy of Sciences of the United States of America
Volume109
Issue number33
DOIs
StatePublished - Aug 14 2012
Externally publishedYes

Keywords

  • Compression
  • Metagenomics

ASJC Scopus subject areas

  • General

Fingerprint Dive into the research topics of 'Scaling metagenome sequence assembly with probabilistic de Bruijn graphs'. Together they form a unique fingerprint.

  • Cite this