VA-Store: A Virtual Approximate Store Approach to Supporting Repetitive Big Data in Genome Sequence Analyses

Xianying Liu, Qiang Zhu, Sakti Pramanik, Charles Brown, Gang Qian

Research output: Contribution to journalArticle

Abstract

In recent years, we have witnessed an increasing demand to process big data in numerous applications. It is observed that there often exist substantial amounts of repetitive data in different portions of a big data repository/dataset for applications such as genome sequence analyses. In this paper, we present a novel method, called the VA-Store, to reduce the large space requirement for repetitive data in prevailing genome sequence analysis tasks using k-mers (i.e., subsequences of length k) with multiple k values. The VA-Store maintains a physical store for one portion of the input dataset (i.e., k0-mers) and supports multiple virtual stores for other portions of the dataset (i.e., k-mers with k k0). Utilizing important relationships among repetitive data, the VA-Store transforms a given query on a virtual store into one or more queries on the physical store for execution. Both precise and approximate transformations are considered. Accuracy estimation models for approximate solutions are derived. Query optimization strategies are suggested to improve query performance. Our experiments using real and synthetic datasets demonstrate that the VA-Store is quite promising in providing effective storage and efficient query processing for solving a kernel database problem on repetitive big data for genome sequence analysis applications.

Original languageEnglish (US)
JournalIEEE Transactions on Knowledge and Data Engineering
DOIs
StateAccepted/In press - Jan 1 2018

Fingerprint

Genes
Query processing
Big data
Experiments

Keywords

  • Big Data
  • Bioinformatics
  • E.2 Data storage representations
  • Genomics
  • H.2.4.h Query processing
  • H.2.8.a Bioinformatics (genome or protein) databases
  • I.1.2.b Algorithms for data and knowledge management
  • Query processing
  • Search problems
  • Sequences
  • Sequential analysis

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

VA-Store : A Virtual Approximate Store Approach to Supporting Repetitive Big Data in Genome Sequence Analyses. / Liu, Xianying; Zhu, Qiang; Pramanik, Sakti; Brown, Charles; Qian, Gang.

In: IEEE Transactions on Knowledge and Data Engineering, 01.01.2018.

Research output: Contribution to journalArticle

@article{6d331ecebcda4627881d274c23ae1fa0,
title = "VA-Store: A Virtual Approximate Store Approach to Supporting Repetitive Big Data in Genome Sequence Analyses",
abstract = "In recent years, we have witnessed an increasing demand to process big data in numerous applications. It is observed that there often exist substantial amounts of repetitive data in different portions of a big data repository/dataset for applications such as genome sequence analyses. In this paper, we present a novel method, called the VA-Store, to reduce the large space requirement for repetitive data in prevailing genome sequence analysis tasks using k-mers (i.e., subsequences of length k) with multiple k values. The VA-Store maintains a physical store for one portion of the input dataset (i.e., k0-mers) and supports multiple virtual stores for other portions of the dataset (i.e., k-mers with k k0). Utilizing important relationships among repetitive data, the VA-Store transforms a given query on a virtual store into one or more queries on the physical store for execution. Both precise and approximate transformations are considered. Accuracy estimation models for approximate solutions are derived. Query optimization strategies are suggested to improve query performance. Our experiments using real and synthetic datasets demonstrate that the VA-Store is quite promising in providing effective storage and efficient query processing for solving a kernel database problem on repetitive big data for genome sequence analysis applications.",
keywords = "Big Data, Bioinformatics, E.2 Data storage representations, Genomics, H.2.4.h Query processing, H.2.8.a Bioinformatics (genome or protein) databases, I.1.2.b Algorithms for data and knowledge management, Query processing, Search problems, Sequences, Sequential analysis",
author = "Xianying Liu and Qiang Zhu and Sakti Pramanik and Charles Brown and Gang Qian",
year = "2018",
month = "1",
day = "1",
doi = "10.1109/TKDE.2018.2885952",
language = "English (US)",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE Computer Society",

}

TY - JOUR

T1 - VA-Store

T2 - A Virtual Approximate Store Approach to Supporting Repetitive Big Data in Genome Sequence Analyses

AU - Liu, Xianying

AU - Zhu, Qiang

AU - Pramanik, Sakti

AU - Brown, Charles

AU - Qian, Gang

PY - 2018/1/1

Y1 - 2018/1/1

N2 - In recent years, we have witnessed an increasing demand to process big data in numerous applications. It is observed that there often exist substantial amounts of repetitive data in different portions of a big data repository/dataset for applications such as genome sequence analyses. In this paper, we present a novel method, called the VA-Store, to reduce the large space requirement for repetitive data in prevailing genome sequence analysis tasks using k-mers (i.e., subsequences of length k) with multiple k values. The VA-Store maintains a physical store for one portion of the input dataset (i.e., k0-mers) and supports multiple virtual stores for other portions of the dataset (i.e., k-mers with k k0). Utilizing important relationships among repetitive data, the VA-Store transforms a given query on a virtual store into one or more queries on the physical store for execution. Both precise and approximate transformations are considered. Accuracy estimation models for approximate solutions are derived. Query optimization strategies are suggested to improve query performance. Our experiments using real and synthetic datasets demonstrate that the VA-Store is quite promising in providing effective storage and efficient query processing for solving a kernel database problem on repetitive big data for genome sequence analysis applications.

AB - In recent years, we have witnessed an increasing demand to process big data in numerous applications. It is observed that there often exist substantial amounts of repetitive data in different portions of a big data repository/dataset for applications such as genome sequence analyses. In this paper, we present a novel method, called the VA-Store, to reduce the large space requirement for repetitive data in prevailing genome sequence analysis tasks using k-mers (i.e., subsequences of length k) with multiple k values. The VA-Store maintains a physical store for one portion of the input dataset (i.e., k0-mers) and supports multiple virtual stores for other portions of the dataset (i.e., k-mers with k k0). Utilizing important relationships among repetitive data, the VA-Store transforms a given query on a virtual store into one or more queries on the physical store for execution. Both precise and approximate transformations are considered. Accuracy estimation models for approximate solutions are derived. Query optimization strategies are suggested to improve query performance. Our experiments using real and synthetic datasets demonstrate that the VA-Store is quite promising in providing effective storage and efficient query processing for solving a kernel database problem on repetitive big data for genome sequence analysis applications.

KW - Big Data

KW - Bioinformatics

KW - E.2 Data storage representations

KW - Genomics

KW - H.2.4.h Query processing

KW - H.2.8.a Bioinformatics (genome or protein) databases

KW - I.1.2.b Algorithms for data and knowledge management

KW - Query processing

KW - Search problems

KW - Sequences

KW - Sequential analysis

UR - http://www.scopus.com/inward/record.url?scp=85058666743&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85058666743&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2018.2885952

DO - 10.1109/TKDE.2018.2885952

M3 - Article

AN - SCOPUS:85058666743

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

ER -