VA-Store: A Virtual Approximate Store Approach to Supporting Repetitive Big Data in Genome Sequence Analyses

Xianying Liu, Qiang Zhu, Sakti Pramanik, Charles Brown, Gang Qian

Research output: Contribution to journalArticle

Abstract

In recent years, we have witnessed an increasing demand to process big data in numerous applications. It is observed that there often exist substantial amounts of repetitive data in different portions of a big data repository/dataset for applications such as genome sequence analyses. In this paper, we present a novel method, called the VA-Store, to reduce the large space requirement for repetitive data in prevailing genome sequence analysis tasks using k-mers (i.e., subsequences of length k) with multiple k values. The VA-Store maintains a physical store for one portion of the input dataset (i.e., k0-mers) and supports multiple virtual stores for other portions of the dataset (i.e., k-mers with k k0). Utilizing important relationships among repetitive data, the VA-Store transforms a given query on a virtual store into one or more queries on the physical store for execution. Both precise and approximate transformations are considered. Accuracy estimation models for approximate solutions are derived. Query optimization strategies are suggested to improve query performance. Our experiments using real and synthetic datasets demonstrate that the VA-Store is quite promising in providing effective storage and efficient query processing for solving a kernel database problem on repetitive big data for genome sequence analysis applications.

Original languageEnglish (US)
JournalIEEE Transactions on Knowledge and Data Engineering
DOIs
StateAccepted/In press - Jan 1 2018

    Fingerprint

Keywords

  • Big Data
  • Bioinformatics
  • E.2 Data storage representations
  • Genomics
  • H.2.4.h Query processing
  • H.2.8.a Bioinformatics (genome or protein) databases
  • I.1.2.b Algorithms for data and knowledge management
  • Query processing
  • Search problems
  • Sequences
  • Sequential analysis

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this