Discriminant models for high-throughput proteomics mass spectrometer data

Parul V. Purohit, David M Rocke

Research output: Contribution to journalArticle

37 Citations (Scopus)

Abstract

We use several different multivariate analysis methods to discriminate between diseased and healthy patients using protein mass spectrometer data provided by Duke University. Two problems were presented by the university; one in which the responses (diseased or healthy) of the patients were not known and second, when the responses were known. In the latter case, the data can be used as a 'training' set. We attempted both problems. In particular, we use principle component analysis along with clustering methods to discriminate for the first problem set and partial least squares coupled with logistic and discriminant methods when the responses were known. In addition, we were able to detect regions of interest in the spectrum where there were differences in the protein patterns between healthy and diseased patients. There was considerable effort involved in the preprocessing of the data. We used a binning approach to reduce the number of variables rather than peak heights or peak areas. We performed a square root transformation on the data to help stabilize the variance; this in turn made a significant improvement in clustering results.

Original languageEnglish (US)
Pages (from-to)1699-1703
Number of pages5
JournalProteomics
Volume3
Issue number9
DOIs
StatePublished - Sep 1 2003

Fingerprint

Mass spectrometers
Proteomics
Throughput
Cluster Analysis
Logistics
Proteins
Least-Squares Analysis
Multivariate Analysis

Keywords

  • Discriminant
  • Mass spectrometry
  • Multivariate

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics

Cite this

Discriminant models for high-throughput proteomics mass spectrometer data. / Purohit, Parul V.; Rocke, David M.

In: Proteomics, Vol. 3, No. 9, 01.09.2003, p. 1699-1703.

Research output: Contribution to journalArticle

@article{5dbfd863ab3e45c898bec56db793269f,
title = "Discriminant models for high-throughput proteomics mass spectrometer data",
abstract = "We use several different multivariate analysis methods to discriminate between diseased and healthy patients using protein mass spectrometer data provided by Duke University. Two problems were presented by the university; one in which the responses (diseased or healthy) of the patients were not known and second, when the responses were known. In the latter case, the data can be used as a 'training' set. We attempted both problems. In particular, we use principle component analysis along with clustering methods to discriminate for the first problem set and partial least squares coupled with logistic and discriminant methods when the responses were known. In addition, we were able to detect regions of interest in the spectrum where there were differences in the protein patterns between healthy and diseased patients. There was considerable effort involved in the preprocessing of the data. We used a binning approach to reduce the number of variables rather than peak heights or peak areas. We performed a square root transformation on the data to help stabilize the variance; this in turn made a significant improvement in clustering results.",
keywords = "Discriminant, Mass spectrometry, Multivariate",
author = "Purohit, {Parul V.} and Rocke, {David M}",
year = "2003",
month = "9",
day = "1",
doi = "10.1002/pmic.200300518",
language = "English (US)",
volume = "3",
pages = "1699--1703",
journal = "Proteomics",
issn = "1615-9853",
publisher = "Wiley-VCH Verlag",
number = "9",

}

TY - JOUR

T1 - Discriminant models for high-throughput proteomics mass spectrometer data

AU - Purohit, Parul V.

AU - Rocke, David M

PY - 2003/9/1

Y1 - 2003/9/1

N2 - We use several different multivariate analysis methods to discriminate between diseased and healthy patients using protein mass spectrometer data provided by Duke University. Two problems were presented by the university; one in which the responses (diseased or healthy) of the patients were not known and second, when the responses were known. In the latter case, the data can be used as a 'training' set. We attempted both problems. In particular, we use principle component analysis along with clustering methods to discriminate for the first problem set and partial least squares coupled with logistic and discriminant methods when the responses were known. In addition, we were able to detect regions of interest in the spectrum where there were differences in the protein patterns between healthy and diseased patients. There was considerable effort involved in the preprocessing of the data. We used a binning approach to reduce the number of variables rather than peak heights or peak areas. We performed a square root transformation on the data to help stabilize the variance; this in turn made a significant improvement in clustering results.

AB - We use several different multivariate analysis methods to discriminate between diseased and healthy patients using protein mass spectrometer data provided by Duke University. Two problems were presented by the university; one in which the responses (diseased or healthy) of the patients were not known and second, when the responses were known. In the latter case, the data can be used as a 'training' set. We attempted both problems. In particular, we use principle component analysis along with clustering methods to discriminate for the first problem set and partial least squares coupled with logistic and discriminant methods when the responses were known. In addition, we were able to detect regions of interest in the spectrum where there were differences in the protein patterns between healthy and diseased patients. There was considerable effort involved in the preprocessing of the data. We used a binning approach to reduce the number of variables rather than peak heights or peak areas. We performed a square root transformation on the data to help stabilize the variance; this in turn made a significant improvement in clustering results.

KW - Discriminant

KW - Mass spectrometry

KW - Multivariate

UR - http://www.scopus.com/inward/record.url?scp=0141743615&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0141743615&partnerID=8YFLogxK

U2 - 10.1002/pmic.200300518

DO - 10.1002/pmic.200300518

M3 - Article

C2 - 12973728

AN - SCOPUS:0141743615

VL - 3

SP - 1699

EP - 1703

JO - Proteomics

JF - Proteomics

SN - 1615-9853

IS - 9

ER -