An Integrated Pipeline for de Novo Assembly of Microbial Genomes

Andrew Tritt, Jonathan A Eisen, Marc T. Facciotti, Aaron E. Darling

Research output: Contribution to journalArticle

292 Citations (Scopus)

Abstract

Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron's Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50% or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.

Original languageEnglish (US)
Article numbere42304
JournalPLoS One
Volume7
Issue number9
DOIs
StatePublished - Sep 13 2012

Fingerprint

Microbial Genome
Pipelines
Genes
genome
Genome
genome assembly
Quality Control
Libraries
quality control
computer hardware
Bacterial Genomes
application coverage
DNA Sequence Analysis
transposons
cleaning
Quality control
Software
sequence analysis
Technology
Laptop computers

ASJC Scopus subject areas

  • Agricultural and Biological Sciences(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Medicine(all)

Cite this

An Integrated Pipeline for de Novo Assembly of Microbial Genomes. / Tritt, Andrew; Eisen, Jonathan A; Facciotti, Marc T.; Darling, Aaron E.

In: PLoS One, Vol. 7, No. 9, e42304, 13.09.2012.

Research output: Contribution to journalArticle

Tritt, Andrew ; Eisen, Jonathan A ; Facciotti, Marc T. ; Darling, Aaron E. / An Integrated Pipeline for de Novo Assembly of Microbial Genomes. In: PLoS One. 2012 ; Vol. 7, No. 9.
@article{a4d9e61270574a4083e1e292f68670cb,
title = "An Integrated Pipeline for de Novo Assembly of Microbial Genomes",
abstract = "Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron's Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50{\%} or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.",
author = "Andrew Tritt and Eisen, {Jonathan A} and Facciotti, {Marc T.} and Darling, {Aaron E.}",
year = "2012",
month = "9",
day = "13",
doi = "10.1371/journal.pone.0042304",
language = "English (US)",
volume = "7",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "9",

}

TY - JOUR

T1 - An Integrated Pipeline for de Novo Assembly of Microbial Genomes

AU - Tritt, Andrew

AU - Eisen, Jonathan A

AU - Facciotti, Marc T.

AU - Darling, Aaron E.

PY - 2012/9/13

Y1 - 2012/9/13

N2 - Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron's Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50% or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.

AB - Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron's Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50% or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.

UR - http://www.scopus.com/inward/record.url?scp=84866391549&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84866391549&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0042304

DO - 10.1371/journal.pone.0042304

M3 - Article

C2 - 23028432

AN - SCOPUS:84866391549

VL - 7

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 9

M1 - e42304

ER -