Stable classification with applications to microarray data

Chin-Shang Li, Cheng Cheng

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

A stable classification method called minimum-error-distance threshold (MEDT) with variable selection is developed for the two-class prediction (classification) problem. First, a set of "significant" variables (genes) associated with the two classes is selected using the Wilcoxon rank-sum test, and then a data-driven cutoff point for a distance-based classification algorithm is determined by minimizing a combination of the rates of false positives and false negatives estimated by leave-one-out cross validation. This cutoff point is used to classify a given test set based on the selected variables. The proposed methodology is applied to the leukemia data set analyzed in Golub et al. (Science 286 (1999) 531). To compare the proposed methodology with the existing discrimination methods, the diagonal-linear-discriminant analysis and nearest-neighbor classifiers, 1000 cross validations are performed. The data set is randomly split into a training set consisting of 32 patients with acute lymphoblastic leukemia (ALL) and 16 with acute myeloid leukemia (AML) and a test set consisting of 15 patients with ALL and nine with AML. Performance summaries are calculated. A simulation study is conducted to demonstrate the superior stability of MEDT compared with that of the aforementioned existing methods. The stability measure used is the mean-to-standard deviation ratio of the number of correct predictions.

Original languageEnglish (US)
Pages (from-to)599-609
Number of pages11
JournalComputational Statistics and Data Analysis
Volume47
Issue number3
DOIs
StatePublished - Oct 1 2004
Externally publishedYes

Fingerprint

Leukemia
Microarrays
Microarray Data
Acute
Test Set
Cross-validation
Discriminant analysis
Wilcoxon rank-sum test
Classifiers
Genes
Methodology
Prediction
Variable Selection
Classification Algorithm
Discriminant Analysis
False Positive
Data-driven
Classification Problems
Standard deviation
Discrimination

Keywords

  • Diagonal-linear- discriminant analysis
  • Microarray
  • Minimum-error-distance threshold
  • Nearest-neighbor classifiers
  • Stable classification
  • Variable selection

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Statistics, Probability and Uncertainty
  • Electrical and Electronic Engineering
  • Computational Mathematics
  • Numerical Analysis
  • Statistics and Probability

Cite this

Stable classification with applications to microarray data. / Li, Chin-Shang; Cheng, Cheng.

In: Computational Statistics and Data Analysis, Vol. 47, No. 3, 01.10.2004, p. 599-609.

Research output: Contribution to journalArticle

@article{ca2e62194a10424da3e945288d42180d,
title = "Stable classification with applications to microarray data",
abstract = "A stable classification method called minimum-error-distance threshold (MEDT) with variable selection is developed for the two-class prediction (classification) problem. First, a set of {"}significant{"} variables (genes) associated with the two classes is selected using the Wilcoxon rank-sum test, and then a data-driven cutoff point for a distance-based classification algorithm is determined by minimizing a combination of the rates of false positives and false negatives estimated by leave-one-out cross validation. This cutoff point is used to classify a given test set based on the selected variables. The proposed methodology is applied to the leukemia data set analyzed in Golub et al. (Science 286 (1999) 531). To compare the proposed methodology with the existing discrimination methods, the diagonal-linear-discriminant analysis and nearest-neighbor classifiers, 1000 cross validations are performed. The data set is randomly split into a training set consisting of 32 patients with acute lymphoblastic leukemia (ALL) and 16 with acute myeloid leukemia (AML) and a test set consisting of 15 patients with ALL and nine with AML. Performance summaries are calculated. A simulation study is conducted to demonstrate the superior stability of MEDT compared with that of the aforementioned existing methods. The stability measure used is the mean-to-standard deviation ratio of the number of correct predictions.",
keywords = "Diagonal-linear- discriminant analysis, Microarray, Minimum-error-distance threshold, Nearest-neighbor classifiers, Stable classification, Variable selection",
author = "Chin-Shang Li and Cheng Cheng",
year = "2004",
month = "10",
day = "1",
doi = "10.1016/j.csda.2003.12.010",
language = "English (US)",
volume = "47",
pages = "599--609",
journal = "Computational Statistics and Data Analysis",
issn = "0167-9473",
publisher = "Elsevier",
number = "3",

}

TY - JOUR

T1 - Stable classification with applications to microarray data

AU - Li, Chin-Shang

AU - Cheng, Cheng

PY - 2004/10/1

Y1 - 2004/10/1

N2 - A stable classification method called minimum-error-distance threshold (MEDT) with variable selection is developed for the two-class prediction (classification) problem. First, a set of "significant" variables (genes) associated with the two classes is selected using the Wilcoxon rank-sum test, and then a data-driven cutoff point for a distance-based classification algorithm is determined by minimizing a combination of the rates of false positives and false negatives estimated by leave-one-out cross validation. This cutoff point is used to classify a given test set based on the selected variables. The proposed methodology is applied to the leukemia data set analyzed in Golub et al. (Science 286 (1999) 531). To compare the proposed methodology with the existing discrimination methods, the diagonal-linear-discriminant analysis and nearest-neighbor classifiers, 1000 cross validations are performed. The data set is randomly split into a training set consisting of 32 patients with acute lymphoblastic leukemia (ALL) and 16 with acute myeloid leukemia (AML) and a test set consisting of 15 patients with ALL and nine with AML. Performance summaries are calculated. A simulation study is conducted to demonstrate the superior stability of MEDT compared with that of the aforementioned existing methods. The stability measure used is the mean-to-standard deviation ratio of the number of correct predictions.

AB - A stable classification method called minimum-error-distance threshold (MEDT) with variable selection is developed for the two-class prediction (classification) problem. First, a set of "significant" variables (genes) associated with the two classes is selected using the Wilcoxon rank-sum test, and then a data-driven cutoff point for a distance-based classification algorithm is determined by minimizing a combination of the rates of false positives and false negatives estimated by leave-one-out cross validation. This cutoff point is used to classify a given test set based on the selected variables. The proposed methodology is applied to the leukemia data set analyzed in Golub et al. (Science 286 (1999) 531). To compare the proposed methodology with the existing discrimination methods, the diagonal-linear-discriminant analysis and nearest-neighbor classifiers, 1000 cross validations are performed. The data set is randomly split into a training set consisting of 32 patients with acute lymphoblastic leukemia (ALL) and 16 with acute myeloid leukemia (AML) and a test set consisting of 15 patients with ALL and nine with AML. Performance summaries are calculated. A simulation study is conducted to demonstrate the superior stability of MEDT compared with that of the aforementioned existing methods. The stability measure used is the mean-to-standard deviation ratio of the number of correct predictions.

KW - Diagonal-linear- discriminant analysis

KW - Microarray

KW - Minimum-error-distance threshold

KW - Nearest-neighbor classifiers

KW - Stable classification

KW - Variable selection

UR - http://www.scopus.com/inward/record.url?scp=4944248550&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=4944248550&partnerID=8YFLogxK

U2 - 10.1016/j.csda.2003.12.010

DO - 10.1016/j.csda.2003.12.010

M3 - Article

AN - SCOPUS:4944248550

VL - 47

SP - 599

EP - 609

JO - Computational Statistics and Data Analysis

JF - Computational Statistics and Data Analysis

SN - 0167-9473

IS - 3

ER -