Motivation: Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to estimate the error (misclassification rate) of a classifier. Resubstitution fits a single classifier to the data, and applies this classifier in turn to each data observation. Cross-validation (in leave-one-out form) removes each observation in turn, constructs the classifier, and then computes whether this leave-one-out classifier correctly classifies the deleted observation. Resubstitution typically underestimates classifier error, severely so in many cases. Cross-validation has the advantage of producing an effectively unbiased error estimate, but the estimate is highly variable. In many applications it is not the misclassification rate per se that is of interest, but rather the construction of gene sets that have the potential to classify or predict. Hence, one needs to rank feature sets based on their performance. Results: A model-based approach is used to compare the ranking performances of resubstitution and cross-validation for classification based on real-valued feature sets and for prediction in the context of probabilistic Boolean networks (PBNs). For classification, a Gaussian model is considered, along with classification via linear discriminant analysis and the 3-nearest-neighbor classification rule. Prediction is examined in the steady-distribution of a PBN.Three metrics are proposed to compare feature-set ranking based on error estimation with ranking based on the true error, which is known owing to the model-based approach. In all cases, resubstitution is competitive with cross-validation relative to ranking accuracy. This is in addition to the enormous savings in computation time afforded by resubstitution.
ASJC Scopus subject areas
- Clinical Biochemistry
- Computer Science Applications
- Computational Theory and Mathematics