Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes

Lisa K. Johnson, Harriet Alexander, Charles Brown

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

BACKGROUND: De novo transcriptome assemblies are required prior to analyzing RNA sequencing data from a species without an existing reference genome or transcriptome. Despite the prevalence of transcriptomic studies, the effects of using different workflows, or "pipelines," on the resulting assemblies are poorly understood. Here, a pipeline was programmatically automated and used to assemble and annotate raw transcriptomic short-read data collected as part of the Marine Microbial Eukaryotic Transcriptome Sequencing Project. The resulting transcriptome assemblies were evaluated and compared against assemblies that were previously generated with a different pipeline developed by the National Center for Genome Research. RESULTS: New transcriptome assemblies contained the majority of previous contigs as well as new content. On average, 7.8% of the annotated contigs in the new assemblies were novel gene names not found in the previous assemblies. Taxonomic trends were observed in the assembly metrics. Assemblies from the Dinoflagellata showed a higher number of contigs and unique k-mers than transcriptomes from other phyla, while assemblies from Ciliophora had a lower percentage of open reading frames compared to other phyla. CONCLUSIONS: Given current bioinformatics approaches, there is no single "best" reference transcriptome for a particular set of raw data. As the optimum transcriptome is a moving target, improving (or not) with new tools and approaches, automated and programmable pipelines are invaluable for managing the computationally intensive tasks required for re-processing large sets of samples with revised pipelines and ensuring a common evaluation workflow is applied to all samples. Thus, re-assembling existing data with new tools using automated and programmable pipelines may yield more accurate identification of taxon-specific trends across samples in addition to novel and useful products for the community.

Original languageEnglish (US)
JournalGigaScience
Volume8
Issue number4
DOIs
StatePublished - Apr 1 2019

Fingerprint

Transcriptome
Pipelines
Workflow
Genes
Ciliophora
Genome
RNA Sequence Analysis
Computational Biology
Open Reading Frames
Names
Bioinformatics
RNA
Cross-Sectional Studies
Research
Processing

Keywords

  • automated pipeline
  • marine microbial eukaryote
  • re-analysis
  • transcriptome assembly

ASJC Scopus subject areas

  • Health Informatics
  • Computer Science Applications

Cite this

Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes. / Johnson, Lisa K.; Alexander, Harriet; Brown, Charles.

In: GigaScience, Vol. 8, No. 4, 01.04.2019.

Research output: Contribution to journalArticle

@article{cecf59b5a069449e98f890a540c51da5,
title = "Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes",
abstract = "BACKGROUND: De novo transcriptome assemblies are required prior to analyzing RNA sequencing data from a species without an existing reference genome or transcriptome. Despite the prevalence of transcriptomic studies, the effects of using different workflows, or {"}pipelines,{"} on the resulting assemblies are poorly understood. Here, a pipeline was programmatically automated and used to assemble and annotate raw transcriptomic short-read data collected as part of the Marine Microbial Eukaryotic Transcriptome Sequencing Project. The resulting transcriptome assemblies were evaluated and compared against assemblies that were previously generated with a different pipeline developed by the National Center for Genome Research. RESULTS: New transcriptome assemblies contained the majority of previous contigs as well as new content. On average, 7.8{\%} of the annotated contigs in the new assemblies were novel gene names not found in the previous assemblies. Taxonomic trends were observed in the assembly metrics. Assemblies from the Dinoflagellata showed a higher number of contigs and unique k-mers than transcriptomes from other phyla, while assemblies from Ciliophora had a lower percentage of open reading frames compared to other phyla. CONCLUSIONS: Given current bioinformatics approaches, there is no single {"}best{"} reference transcriptome for a particular set of raw data. As the optimum transcriptome is a moving target, improving (or not) with new tools and approaches, automated and programmable pipelines are invaluable for managing the computationally intensive tasks required for re-processing large sets of samples with revised pipelines and ensuring a common evaluation workflow is applied to all samples. Thus, re-assembling existing data with new tools using automated and programmable pipelines may yield more accurate identification of taxon-specific trends across samples in addition to novel and useful products for the community.",
keywords = "automated pipeline, marine microbial eukaryote, re-analysis, transcriptome assembly",
author = "Johnson, {Lisa K.} and Harriet Alexander and Charles Brown",
year = "2019",
month = "4",
day = "1",
doi = "10.1093/gigascience/giy158",
language = "English (US)",
volume = "8",
journal = "GigaScience",
issn = "2047-217X",
publisher = "BioMed Central",
number = "4",

}

TY - JOUR

T1 - Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes

AU - Johnson, Lisa K.

AU - Alexander, Harriet

AU - Brown, Charles

PY - 2019/4/1

Y1 - 2019/4/1

N2 - BACKGROUND: De novo transcriptome assemblies are required prior to analyzing RNA sequencing data from a species without an existing reference genome or transcriptome. Despite the prevalence of transcriptomic studies, the effects of using different workflows, or "pipelines," on the resulting assemblies are poorly understood. Here, a pipeline was programmatically automated and used to assemble and annotate raw transcriptomic short-read data collected as part of the Marine Microbial Eukaryotic Transcriptome Sequencing Project. The resulting transcriptome assemblies were evaluated and compared against assemblies that were previously generated with a different pipeline developed by the National Center for Genome Research. RESULTS: New transcriptome assemblies contained the majority of previous contigs as well as new content. On average, 7.8% of the annotated contigs in the new assemblies were novel gene names not found in the previous assemblies. Taxonomic trends were observed in the assembly metrics. Assemblies from the Dinoflagellata showed a higher number of contigs and unique k-mers than transcriptomes from other phyla, while assemblies from Ciliophora had a lower percentage of open reading frames compared to other phyla. CONCLUSIONS: Given current bioinformatics approaches, there is no single "best" reference transcriptome for a particular set of raw data. As the optimum transcriptome is a moving target, improving (or not) with new tools and approaches, automated and programmable pipelines are invaluable for managing the computationally intensive tasks required for re-processing large sets of samples with revised pipelines and ensuring a common evaluation workflow is applied to all samples. Thus, re-assembling existing data with new tools using automated and programmable pipelines may yield more accurate identification of taxon-specific trends across samples in addition to novel and useful products for the community.

AB - BACKGROUND: De novo transcriptome assemblies are required prior to analyzing RNA sequencing data from a species without an existing reference genome or transcriptome. Despite the prevalence of transcriptomic studies, the effects of using different workflows, or "pipelines," on the resulting assemblies are poorly understood. Here, a pipeline was programmatically automated and used to assemble and annotate raw transcriptomic short-read data collected as part of the Marine Microbial Eukaryotic Transcriptome Sequencing Project. The resulting transcriptome assemblies were evaluated and compared against assemblies that were previously generated with a different pipeline developed by the National Center for Genome Research. RESULTS: New transcriptome assemblies contained the majority of previous contigs as well as new content. On average, 7.8% of the annotated contigs in the new assemblies were novel gene names not found in the previous assemblies. Taxonomic trends were observed in the assembly metrics. Assemblies from the Dinoflagellata showed a higher number of contigs and unique k-mers than transcriptomes from other phyla, while assemblies from Ciliophora had a lower percentage of open reading frames compared to other phyla. CONCLUSIONS: Given current bioinformatics approaches, there is no single "best" reference transcriptome for a particular set of raw data. As the optimum transcriptome is a moving target, improving (or not) with new tools and approaches, automated and programmable pipelines are invaluable for managing the computationally intensive tasks required for re-processing large sets of samples with revised pipelines and ensuring a common evaluation workflow is applied to all samples. Thus, re-assembling existing data with new tools using automated and programmable pipelines may yield more accurate identification of taxon-specific trends across samples in addition to novel and useful products for the community.

KW - automated pipeline

KW - marine microbial eukaryote

KW - re-analysis

KW - transcriptome assembly

UR - http://www.scopus.com/inward/record.url?scp=85062710247&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85062710247&partnerID=8YFLogxK

U2 - 10.1093/gigascience/giy158

DO - 10.1093/gigascience/giy158

M3 - Article

C2 - 30544207

AN - SCOPUS:85062710247

VL - 8

JO - GigaScience

JF - GigaScience

SN - 2047-217X

IS - 4

ER -