Is cross-validation better than resubstitution for ranking genes?

Ulisses Braga-Neto, Ronaldo Hashimoto, Edward R. Dougherty, Danh V. Nguyen, Raymond J. Carroll

Research output: Contribution to journalArticle

51 Citations (Scopus)

Abstract

Motivation: Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to estimate the error (misclassification rate) of a classifier. Resubstitution fits a single classifier to the data, and applies this classifier in turn to each data observation. Cross-validation (in leave-one-out form) removes each observation in turn, constructs the classifier, and then computes whether this leave-one-out classifier correctly classifies the deleted observation. Resubstitution typically underestimates classifier error, severely so in many cases. Cross-validation has the advantage of producing an effectively unbiased error estimate, but the estimate is highly variable. In many applications it is not the misclassification rate per se that is of interest, but rather the construction of gene sets that have the potential to classify or predict. Hence, one needs to rank feature sets based on their performance. Results: A model-based approach is used to compare the ranking performances of resubstitution and cross-validation for classification based on real-valued feature sets and for prediction in the context of probabilistic Boolean networks (PBNs). For classification, a Gaussian model is considered, along with classification via linear discriminant analysis and the 3-nearest-neighbor classification rule. Prediction is examined in the steady-distribution of a PBN.Three metrics are proposed to compare feature-set ranking based on error estimation with ranking based on the true error, which is known owing to the model-based approach. In all cases, resubstitution is competitive with cross-validation relative to ranking accuracy. This is in addition to the enormous savings in computation time afforded by resubstitution.

Original languageEnglish (US)
Pages (from-to)253-258
Number of pages6
JournalBioinformatics
Volume20
Issue number2
DOIs
StatePublished - Jan 22 2004
Externally publishedYes

Fingerprint

Cross-validation
Ranking
Classifiers
Genes
Classifier
Gene
Misclassification Rate
Boolean Networks
Observation
Prediction
Classify
Model-based
Genetic Regulatory Networks
DNA Microarray
Classification Rules
Gaussian Model
Discriminant Analysis
Error Estimation
Discriminant analysis
Microarrays

ASJC Scopus subject areas

  • Clinical Biochemistry
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

Braga-Neto, U., Hashimoto, R., Dougherty, E. R., Nguyen, D. V., & Carroll, R. J. (2004). Is cross-validation better than resubstitution for ranking genes? Bioinformatics, 20(2), 253-258. https://doi.org/10.1093/bioinformatics/btg399

Is cross-validation better than resubstitution for ranking genes? / Braga-Neto, Ulisses; Hashimoto, Ronaldo; Dougherty, Edward R.; Nguyen, Danh V.; Carroll, Raymond J.

In: Bioinformatics, Vol. 20, No. 2, 22.01.2004, p. 253-258.

Research output: Contribution to journalArticle

Braga-Neto, U, Hashimoto, R, Dougherty, ER, Nguyen, DV & Carroll, RJ 2004, 'Is cross-validation better than resubstitution for ranking genes?', Bioinformatics, vol. 20, no. 2, pp. 253-258. https://doi.org/10.1093/bioinformatics/btg399
Braga-Neto U, Hashimoto R, Dougherty ER, Nguyen DV, Carroll RJ. Is cross-validation better than resubstitution for ranking genes? Bioinformatics. 2004 Jan 22;20(2):253-258. https://doi.org/10.1093/bioinformatics/btg399
Braga-Neto, Ulisses ; Hashimoto, Ronaldo ; Dougherty, Edward R. ; Nguyen, Danh V. ; Carroll, Raymond J. / Is cross-validation better than resubstitution for ranking genes?. In: Bioinformatics. 2004 ; Vol. 20, No. 2. pp. 253-258.
@article{7efa099b34fe4861a7f8f61565555d6a,
title = "Is cross-validation better than resubstitution for ranking genes?",
abstract = "Motivation: Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to estimate the error (misclassification rate) of a classifier. Resubstitution fits a single classifier to the data, and applies this classifier in turn to each data observation. Cross-validation (in leave-one-out form) removes each observation in turn, constructs the classifier, and then computes whether this leave-one-out classifier correctly classifies the deleted observation. Resubstitution typically underestimates classifier error, severely so in many cases. Cross-validation has the advantage of producing an effectively unbiased error estimate, but the estimate is highly variable. In many applications it is not the misclassification rate per se that is of interest, but rather the construction of gene sets that have the potential to classify or predict. Hence, one needs to rank feature sets based on their performance. Results: A model-based approach is used to compare the ranking performances of resubstitution and cross-validation for classification based on real-valued feature sets and for prediction in the context of probabilistic Boolean networks (PBNs). For classification, a Gaussian model is considered, along with classification via linear discriminant analysis and the 3-nearest-neighbor classification rule. Prediction is examined in the steady-distribution of a PBN.Three metrics are proposed to compare feature-set ranking based on error estimation with ranking based on the true error, which is known owing to the model-based approach. In all cases, resubstitution is competitive with cross-validation relative to ranking accuracy. This is in addition to the enormous savings in computation time afforded by resubstitution.",
author = "Ulisses Braga-Neto and Ronaldo Hashimoto and Dougherty, {Edward R.} and Nguyen, {Danh V.} and Carroll, {Raymond J.}",
year = "2004",
month = "1",
day = "22",
doi = "10.1093/bioinformatics/btg399",
language = "English (US)",
volume = "20",
pages = "253--258",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "2",

}

TY - JOUR

T1 - Is cross-validation better than resubstitution for ranking genes?

AU - Braga-Neto, Ulisses

AU - Hashimoto, Ronaldo

AU - Dougherty, Edward R.

AU - Nguyen, Danh V.

AU - Carroll, Raymond J.

PY - 2004/1/22

Y1 - 2004/1/22

N2 - Motivation: Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to estimate the error (misclassification rate) of a classifier. Resubstitution fits a single classifier to the data, and applies this classifier in turn to each data observation. Cross-validation (in leave-one-out form) removes each observation in turn, constructs the classifier, and then computes whether this leave-one-out classifier correctly classifies the deleted observation. Resubstitution typically underestimates classifier error, severely so in many cases. Cross-validation has the advantage of producing an effectively unbiased error estimate, but the estimate is highly variable. In many applications it is not the misclassification rate per se that is of interest, but rather the construction of gene sets that have the potential to classify or predict. Hence, one needs to rank feature sets based on their performance. Results: A model-based approach is used to compare the ranking performances of resubstitution and cross-validation for classification based on real-valued feature sets and for prediction in the context of probabilistic Boolean networks (PBNs). For classification, a Gaussian model is considered, along with classification via linear discriminant analysis and the 3-nearest-neighbor classification rule. Prediction is examined in the steady-distribution of a PBN.Three metrics are proposed to compare feature-set ranking based on error estimation with ranking based on the true error, which is known owing to the model-based approach. In all cases, resubstitution is competitive with cross-validation relative to ranking accuracy. This is in addition to the enormous savings in computation time afforded by resubstitution.

AB - Motivation: Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to estimate the error (misclassification rate) of a classifier. Resubstitution fits a single classifier to the data, and applies this classifier in turn to each data observation. Cross-validation (in leave-one-out form) removes each observation in turn, constructs the classifier, and then computes whether this leave-one-out classifier correctly classifies the deleted observation. Resubstitution typically underestimates classifier error, severely so in many cases. Cross-validation has the advantage of producing an effectively unbiased error estimate, but the estimate is highly variable. In many applications it is not the misclassification rate per se that is of interest, but rather the construction of gene sets that have the potential to classify or predict. Hence, one needs to rank feature sets based on their performance. Results: A model-based approach is used to compare the ranking performances of resubstitution and cross-validation for classification based on real-valued feature sets and for prediction in the context of probabilistic Boolean networks (PBNs). For classification, a Gaussian model is considered, along with classification via linear discriminant analysis and the 3-nearest-neighbor classification rule. Prediction is examined in the steady-distribution of a PBN.Three metrics are proposed to compare feature-set ranking based on error estimation with ranking based on the true error, which is known owing to the model-based approach. In all cases, resubstitution is competitive with cross-validation relative to ranking accuracy. This is in addition to the enormous savings in computation time afforded by resubstitution.

UR - http://www.scopus.com/inward/record.url?scp=1042280954&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=1042280954&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btg399

DO - 10.1093/bioinformatics/btg399

M3 - Article

C2 - 14734317

AN - SCOPUS:1042280954

VL - 20

SP - 253

EP - 258

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 2

ER -