Machine learning-based receiver operating characteristic (ROC) curves for crisp and fuzzy classification of DNA microarrays in cancer research

Leif E. Peterson, Matthew A Coleman

Research output: Contribution to journalArticle

25 Scopus citations

Abstract

Receiver operating characteristic (ROC) curves were generated to obtain classification area under the curve (AUC) as a function of feature standardization, fuzzification, and sample size from nine large sets of cancer-related DNA microarrays. Classifiers used included k-nearest neighbor (kNN), naïve Bayes classifier (NBC), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), learning vector quantization (LVQ1), logistic regression (LOG), polytomous logistic regression (PLOG), artificial neural networks (ANN), particle swarm optimization (PSO), constricted particle swarm optimization (CPSO), kernel regression (RBF), radial basis function networks (RBFN), gradient descent support vector machines (SVMGD), and least squares support vector machines (SVMLS). For each data set, AUC was determined for a number of combinations of sample size, total sum[-log(p)] of feature t-tests, with and without feature standardization and with (fuzzy) and without (crisp) fuzzification of features. Altogether, a total of 2,123,530 classification runs were made. At the greatest level of sample size, ANN resulted in a fitted AUC of 90%, while PSO resulted in the lowest fitted AUC of 72.1%. AUC values derived from 4NN were the most dependent on sample size, while PSO was the least. ANN depended the most on total statistical significance of features used based on sum[-log(p)], whereas PSO was the least dependent. Standardization of features increased AUC by 8.1% for PSO and -0.2% for QDA, while fuzzification increased AUC by 9.4% for PSO and reduced AUC by 3.8% for QDA. AUC determination in planned microarray experiments without standardization and fuzzification of features will benefit the most if CPSO is used for lower levels of feature significance (i.e., sum [- log (p)] ∼ 50) and ANN is used for greater levels of significance (i.e., sum [- log (p)] ∼ 500). When only standardization of features is performed, studies are likely to benefit most by using CPSO for low levels of feature statistical significance and LVQ1 for greater levels of significance. Studies involving only fuzzification of features should employ LVQ1 because of the substantial gain in AUC observed and low expense of LVQ1. Lastly, PSO resulted in significantly greater levels of AUC (89.5% average) when feature standardization and fuzzification were performed. In consideration of the data sets used and factors influencing AUC which were investigated, if low-expense computation is desired then LVQ1 is recommended. However, if computational expense is of less concern, then PSO or CPSO is recommended.

Original languageEnglish (US)
Pages (from-to)17-36
Number of pages20
JournalInternational Journal of Approximate Reasoning
Volume47
Issue number1
DOIs
StatePublished - Jan 2008
Externally publishedYes

Keywords

  • Area under the curve (AUC)
  • DNA microarrays
  • Fuzzy classification
  • Gene expression
  • Receiver operator characteristic (ROC) curve
  • Soft computing

ASJC Scopus subject areas

  • Statistics and Probability
  • Electrical and Electronic Engineering
  • Statistics, Probability and Uncertainty
  • Information Systems and Management
  • Information Systems
  • Computer Science Applications
  • Artificial Intelligence

Fingerprint Dive into the research topics of 'Machine learning-based receiver operating characteristic (ROC) curves for crisp and fuzzy classification of DNA microarrays in cancer research'. Together they form a unique fingerprint.

  • Cite this