TY - JOUR
T1 - Scaling metagenome sequence assembly with probabilistic de Bruijn graphs
AU - Pell, Jason
AU - Hintze, Arend
AU - Canino-Koning, Rosangela
AU - Howe, Adina
AU - Tiedje, James M.
AU - Brown, Charles
PY - 2012/8/14
Y1 - 2012/8/14
N2 - Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory.We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.
AB - Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory.We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.
KW - Compression
KW - Metagenomics
UR - http://www.scopus.com/inward/record.url?scp=84865176493&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84865176493&partnerID=8YFLogxK
U2 - 10.1073/pnas.1121464109
DO - 10.1073/pnas.1121464109
M3 - Article
C2 - 22847406
AN - SCOPUS:84865176493
VL - 109
SP - 13272
EP - 13277
JO - Proceedings of the National Academy of Sciences of the United States of America
JF - Proceedings of the National Academy of Sciences of the United States of America
SN - 0027-8424
IS - 33
ER -