Using disk based index and box queries for genome sequencing error correction

Yarong Gu, Qiang Zhu, Xianying Liu, Youchao Dong, Charles Brown, Sakti Pramanik

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

The vast increase in DNA sequencing capacity over the last decade has quickly turned biology into a dataintensive science. Nevertheless, current sequencers such as Illumia HiSeq have high random per-base error rates, which makes sequencing error correction an indispensable requirement for many sequence analysis applications. Most existing error correction methods demand large expensive memory space, which limits their scalability for handling large datasets. In this paper, we present a new disk based method, called DiskBQcor, for sequencing error correction. DiskBQcor stores k-mers of sequencing genome data along with their associated metadata on inexpensive disk and utilizes a disk based index tree to efficiently process special box queries to obtain relevant k-mers and their occurring frequencies. It then applies a comprehensive voting mechanism and possibly an efficient binary encoding based assembly technique to verify and correct an erroneous base in a genome sequence under various conditions. Our experiments demonstrate that the proposed method is quite promising in error verification and correction for sequencing genome data on disk.

Original languageEnglish (US)
Title of host publicationProceedings of the 8th International Conference on Bioinformatics and Computational Biology, BICOB 2016
PublisherThe International Society for Computers and Their Applications (ISCA)
Pages69-76
Number of pages8
ISBN (Electronic)9781943436033
StatePublished - Jan 1 2016
Externally publishedYes
Event8th International Conference on Bioinformatics and Computational Biology, BICOB 2016 - Las Vegas, United States
Duration: Apr 4 2016Apr 6 2016

Other

Other8th International Conference on Bioinformatics and Computational Biology, BICOB 2016
CountryUnited States
CityLas Vegas
Period4/4/164/6/16

Fingerprint

Error correction
Genes
Genome
Politics
Metadata
DNA Sequence Analysis
Sequence Analysis
Scalability
DNA
Data storage equipment
Experiments

Keywords

  • Algorithm
  • Box query
  • DNA sequencing
  • Error correction
  • Index tree

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computational Theory and Mathematics
  • Information Systems
  • Biomedical Engineering
  • Electrical and Electronic Engineering
  • Health Informatics

Cite this

Gu, Y., Zhu, Q., Liu, X., Dong, Y., Brown, C., & Pramanik, S. (2016). Using disk based index and box queries for genome sequencing error correction. In Proceedings of the 8th International Conference on Bioinformatics and Computational Biology, BICOB 2016 (pp. 69-76). The International Society for Computers and Their Applications (ISCA).

Using disk based index and box queries for genome sequencing error correction. / Gu, Yarong; Zhu, Qiang; Liu, Xianying; Dong, Youchao; Brown, Charles; Pramanik, Sakti.

Proceedings of the 8th International Conference on Bioinformatics and Computational Biology, BICOB 2016. The International Society for Computers and Their Applications (ISCA), 2016. p. 69-76.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Gu, Y, Zhu, Q, Liu, X, Dong, Y, Brown, C & Pramanik, S 2016, Using disk based index and box queries for genome sequencing error correction. in Proceedings of the 8th International Conference on Bioinformatics and Computational Biology, BICOB 2016. The International Society for Computers and Their Applications (ISCA), pp. 69-76, 8th International Conference on Bioinformatics and Computational Biology, BICOB 2016, Las Vegas, United States, 4/4/16.
Gu Y, Zhu Q, Liu X, Dong Y, Brown C, Pramanik S. Using disk based index and box queries for genome sequencing error correction. In Proceedings of the 8th International Conference on Bioinformatics and Computational Biology, BICOB 2016. The International Society for Computers and Their Applications (ISCA). 2016. p. 69-76
Gu, Yarong ; Zhu, Qiang ; Liu, Xianying ; Dong, Youchao ; Brown, Charles ; Pramanik, Sakti. / Using disk based index and box queries for genome sequencing error correction. Proceedings of the 8th International Conference on Bioinformatics and Computational Biology, BICOB 2016. The International Society for Computers and Their Applications (ISCA), 2016. pp. 69-76
@inproceedings{7752ec5ee1434aafa7b8aec70b11124e,
title = "Using disk based index and box queries for genome sequencing error correction",
abstract = "The vast increase in DNA sequencing capacity over the last decade has quickly turned biology into a dataintensive science. Nevertheless, current sequencers such as Illumia HiSeq have high random per-base error rates, which makes sequencing error correction an indispensable requirement for many sequence analysis applications. Most existing error correction methods demand large expensive memory space, which limits their scalability for handling large datasets. In this paper, we present a new disk based method, called DiskBQcor, for sequencing error correction. DiskBQcor stores k-mers of sequencing genome data along with their associated metadata on inexpensive disk and utilizes a disk based index tree to efficiently process special box queries to obtain relevant k-mers and their occurring frequencies. It then applies a comprehensive voting mechanism and possibly an efficient binary encoding based assembly technique to verify and correct an erroneous base in a genome sequence under various conditions. Our experiments demonstrate that the proposed method is quite promising in error verification and correction for sequencing genome data on disk.",
keywords = "Algorithm, Box query, DNA sequencing, Error correction, Index tree",
author = "Yarong Gu and Qiang Zhu and Xianying Liu and Youchao Dong and Charles Brown and Sakti Pramanik",
year = "2016",
month = "1",
day = "1",
language = "English (US)",
pages = "69--76",
booktitle = "Proceedings of the 8th International Conference on Bioinformatics and Computational Biology, BICOB 2016",
publisher = "The International Society for Computers and Their Applications (ISCA)",

}

TY - GEN

T1 - Using disk based index and box queries for genome sequencing error correction

AU - Gu, Yarong

AU - Zhu, Qiang

AU - Liu, Xianying

AU - Dong, Youchao

AU - Brown, Charles

AU - Pramanik, Sakti

PY - 2016/1/1

Y1 - 2016/1/1

N2 - The vast increase in DNA sequencing capacity over the last decade has quickly turned biology into a dataintensive science. Nevertheless, current sequencers such as Illumia HiSeq have high random per-base error rates, which makes sequencing error correction an indispensable requirement for many sequence analysis applications. Most existing error correction methods demand large expensive memory space, which limits their scalability for handling large datasets. In this paper, we present a new disk based method, called DiskBQcor, for sequencing error correction. DiskBQcor stores k-mers of sequencing genome data along with their associated metadata on inexpensive disk and utilizes a disk based index tree to efficiently process special box queries to obtain relevant k-mers and their occurring frequencies. It then applies a comprehensive voting mechanism and possibly an efficient binary encoding based assembly technique to verify and correct an erroneous base in a genome sequence under various conditions. Our experiments demonstrate that the proposed method is quite promising in error verification and correction for sequencing genome data on disk.

AB - The vast increase in DNA sequencing capacity over the last decade has quickly turned biology into a dataintensive science. Nevertheless, current sequencers such as Illumia HiSeq have high random per-base error rates, which makes sequencing error correction an indispensable requirement for many sequence analysis applications. Most existing error correction methods demand large expensive memory space, which limits their scalability for handling large datasets. In this paper, we present a new disk based method, called DiskBQcor, for sequencing error correction. DiskBQcor stores k-mers of sequencing genome data along with their associated metadata on inexpensive disk and utilizes a disk based index tree to efficiently process special box queries to obtain relevant k-mers and their occurring frequencies. It then applies a comprehensive voting mechanism and possibly an efficient binary encoding based assembly technique to verify and correct an erroneous base in a genome sequence under various conditions. Our experiments demonstrate that the proposed method is quite promising in error verification and correction for sequencing genome data on disk.

KW - Algorithm

KW - Box query

KW - DNA sequencing

KW - Error correction

KW - Index tree

UR - http://www.scopus.com/inward/record.url?scp=84973641934&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84973641934&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84973641934

SP - 69

EP - 76

BT - Proceedings of the 8th International Conference on Bioinformatics and Computational Biology, BICOB 2016

PB - The International Society for Computers and Their Applications (ISCA)

ER -