A general approach to single-nucleotide polymorphism discovery

Gabor T. Marth, Ian F Korf, Mark D. Yandell, Raymond T. Yeh, Zhijie Gu, Hamideh Zakeri, Nathan O. Stitziel, LaDeana Hillier, Pui Yan Kwok, Warren R. Gish

Research output: Contribution to journalArticle

374 Citations (Scopus)

Abstract

Single-nucleotide polymorphisms (SNPs) are the most abundant form of human genetic variation and a resource for mapping complex genetic traits. The large volume of data produced by high-throughput sequencing projects is a rich and largely untapped source of SNPs (refs 2-5). We present here a unified approach to the discovery of variations in genetic sequence data of arbitrary DNA sources. We propose to use the rapidly emerging genomic sequence as a template on which to layer often unmapped, fragmentary sequence data and to use base quality values to discern true allelic variations from sequencing errors. By taking advantage of the genomic sequence we are able to use simpler yet more accurate methods for sequence organization: fragment clustering, paralogue identification and multiple alignment. We analyse these sequences with a novel, Bayesian inference engine, POLYBAYES, to calculate the probability that a given site is polymorphic. Rigorous treatment of base quality permits completely automated evaluation of the full length of all sequences, without limitations on alignment depth. We demonstrate this approach by accurate SNP predictions in human ESTs aligned to finished and working-draft quality genomic sequences, a data set representative of the typical challenges of sequence-based SNP discovery.

Original languageEnglish (US)
Pages (from-to)452-456
Number of pages5
JournalNature Genetics
Volume23
Issue number4
DOIs
StatePublished - Dec 1999
Externally publishedYes

Fingerprint

Single Nucleotide Polymorphism
Expressed Sequence Tags
Medical Genetics
Sequence Analysis
Cluster Analysis
DNA

ASJC Scopus subject areas

  • Genetics(clinical)
  • Genetics

Cite this

Marth, G. T., Korf, I. F., Yandell, M. D., Yeh, R. T., Gu, Z., Zakeri, H., ... Gish, W. R. (1999). A general approach to single-nucleotide polymorphism discovery. Nature Genetics, 23(4), 452-456. https://doi.org/10.1038/70570

A general approach to single-nucleotide polymorphism discovery. / Marth, Gabor T.; Korf, Ian F; Yandell, Mark D.; Yeh, Raymond T.; Gu, Zhijie; Zakeri, Hamideh; Stitziel, Nathan O.; Hillier, LaDeana; Kwok, Pui Yan; Gish, Warren R.

In: Nature Genetics, Vol. 23, No. 4, 12.1999, p. 452-456.

Research output: Contribution to journalArticle

Marth, GT, Korf, IF, Yandell, MD, Yeh, RT, Gu, Z, Zakeri, H, Stitziel, NO, Hillier, L, Kwok, PY & Gish, WR 1999, 'A general approach to single-nucleotide polymorphism discovery', Nature Genetics, vol. 23, no. 4, pp. 452-456. https://doi.org/10.1038/70570
Marth GT, Korf IF, Yandell MD, Yeh RT, Gu Z, Zakeri H et al. A general approach to single-nucleotide polymorphism discovery. Nature Genetics. 1999 Dec;23(4):452-456. https://doi.org/10.1038/70570
Marth, Gabor T. ; Korf, Ian F ; Yandell, Mark D. ; Yeh, Raymond T. ; Gu, Zhijie ; Zakeri, Hamideh ; Stitziel, Nathan O. ; Hillier, LaDeana ; Kwok, Pui Yan ; Gish, Warren R. / A general approach to single-nucleotide polymorphism discovery. In: Nature Genetics. 1999 ; Vol. 23, No. 4. pp. 452-456.
@article{cc1287533b014db3bce74735e0eda5dc,
title = "A general approach to single-nucleotide polymorphism discovery",
abstract = "Single-nucleotide polymorphisms (SNPs) are the most abundant form of human genetic variation and a resource for mapping complex genetic traits. The large volume of data produced by high-throughput sequencing projects is a rich and largely untapped source of SNPs (refs 2-5). We present here a unified approach to the discovery of variations in genetic sequence data of arbitrary DNA sources. We propose to use the rapidly emerging genomic sequence as a template on which to layer often unmapped, fragmentary sequence data and to use base quality values to discern true allelic variations from sequencing errors. By taking advantage of the genomic sequence we are able to use simpler yet more accurate methods for sequence organization: fragment clustering, paralogue identification and multiple alignment. We analyse these sequences with a novel, Bayesian inference engine, POLYBAYES, to calculate the probability that a given site is polymorphic. Rigorous treatment of base quality permits completely automated evaluation of the full length of all sequences, without limitations on alignment depth. We demonstrate this approach by accurate SNP predictions in human ESTs aligned to finished and working-draft quality genomic sequences, a data set representative of the typical challenges of sequence-based SNP discovery.",
author = "Marth, {Gabor T.} and Korf, {Ian F} and Yandell, {Mark D.} and Yeh, {Raymond T.} and Zhijie Gu and Hamideh Zakeri and Stitziel, {Nathan O.} and LaDeana Hillier and Kwok, {Pui Yan} and Gish, {Warren R.}",
year = "1999",
month = "12",
doi = "10.1038/70570",
language = "English (US)",
volume = "23",
pages = "452--456",
journal = "Nature Genetics",
issn = "1061-4036",
publisher = "Nature Publishing Group",
number = "4",

}

TY - JOUR

T1 - A general approach to single-nucleotide polymorphism discovery

AU - Marth, Gabor T.

AU - Korf, Ian F

AU - Yandell, Mark D.

AU - Yeh, Raymond T.

AU - Gu, Zhijie

AU - Zakeri, Hamideh

AU - Stitziel, Nathan O.

AU - Hillier, LaDeana

AU - Kwok, Pui Yan

AU - Gish, Warren R.

PY - 1999/12

Y1 - 1999/12

N2 - Single-nucleotide polymorphisms (SNPs) are the most abundant form of human genetic variation and a resource for mapping complex genetic traits. The large volume of data produced by high-throughput sequencing projects is a rich and largely untapped source of SNPs (refs 2-5). We present here a unified approach to the discovery of variations in genetic sequence data of arbitrary DNA sources. We propose to use the rapidly emerging genomic sequence as a template on which to layer often unmapped, fragmentary sequence data and to use base quality values to discern true allelic variations from sequencing errors. By taking advantage of the genomic sequence we are able to use simpler yet more accurate methods for sequence organization: fragment clustering, paralogue identification and multiple alignment. We analyse these sequences with a novel, Bayesian inference engine, POLYBAYES, to calculate the probability that a given site is polymorphic. Rigorous treatment of base quality permits completely automated evaluation of the full length of all sequences, without limitations on alignment depth. We demonstrate this approach by accurate SNP predictions in human ESTs aligned to finished and working-draft quality genomic sequences, a data set representative of the typical challenges of sequence-based SNP discovery.

AB - Single-nucleotide polymorphisms (SNPs) are the most abundant form of human genetic variation and a resource for mapping complex genetic traits. The large volume of data produced by high-throughput sequencing projects is a rich and largely untapped source of SNPs (refs 2-5). We present here a unified approach to the discovery of variations in genetic sequence data of arbitrary DNA sources. We propose to use the rapidly emerging genomic sequence as a template on which to layer often unmapped, fragmentary sequence data and to use base quality values to discern true allelic variations from sequencing errors. By taking advantage of the genomic sequence we are able to use simpler yet more accurate methods for sequence organization: fragment clustering, paralogue identification and multiple alignment. We analyse these sequences with a novel, Bayesian inference engine, POLYBAYES, to calculate the probability that a given site is polymorphic. Rigorous treatment of base quality permits completely automated evaluation of the full length of all sequences, without limitations on alignment depth. We demonstrate this approach by accurate SNP predictions in human ESTs aligned to finished and working-draft quality genomic sequences, a data set representative of the typical challenges of sequence-based SNP discovery.

UR - http://www.scopus.com/inward/record.url?scp=0032706623&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0032706623&partnerID=8YFLogxK

U2 - 10.1038/70570

DO - 10.1038/70570

M3 - Article

C2 - 10581034

AN - SCOPUS:0032706623

VL - 23

SP - 452

EP - 456

JO - Nature Genetics

JF - Nature Genetics

SN - 1061-4036

IS - 4

ER -