Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings

Hanczar, Blaise; Hua, Jianping; Dougherty, Edward R

doi:10.1155/2007/38473

Research Article
Open access
Published: 30 October 2007

Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings

Blaise Hanczar^1,2,
Jianping Hua³ &
Edward R Dougherty^1,3

EURASIP Journal on Bioinformatics and Systems Biology volume 2007, Article number: 38473 (2007) Cite this article

2446 Accesses
28 Citations
Metrics details

Abstract

The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, -fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.

[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22]

References

Jain A, Zongker D: Feature selection: evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence 1997, 19(2):153-158. 10.1109/34.574797
Article Google Scholar
Hughes G: On the mean accuracy of statistical pattern recognizers. IEEE Transactions on Information Theory 1968, 14(1):55-63. 10.1109/TIT.1968.1054102
Article Google Scholar
Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER: Optimal number of features as a function of sample size for various classification rules. Bioinformatics 2005, 21(8):1509-1515. 10.1093/bioinformatics/bti171
Article Google Scholar
Dougherty ER: Small sample issues for microarray-based classification. Comparative and Functional Genomics 2001, 2(1):28-34. 10.1002/cfg.62
Article Google Scholar
Braga-Neto UM, Dougherty ER: Is cross-validation valid for small-sample microarray classification? Bioinformatics 2004, 20(3):374-380. 10.1093/bioinformatics/btg419
Article Google Scholar
Molinaro AM, Simon R, Pfeiffer RM: Prediction error estimation: a comparison of resampling methods. Bioinformatics 2005, 21(15):3301-3307. 10.1093/bioinformatics/bti499
Article Google Scholar
Xiao Y, Hua J, Dougherty ER: Quantification of the impact of feature selection on the variance of cross-validation error estimation. EURASIP Journal on Bioinformatics and Systems Biology 2007., 2007: 11 pages
Google Scholar
Braga-Neto U, Hashimoto R, Dougherty ER, Nguyen DV, Carroll RJ: Is cross-validation better than resubstitution for ranking genes? Bioinformatics 2004, 20(2):253-258. 10.1093/bioinformatics/btg399
Article Google Scholar
Sima C, Braga-Neto U, Dougherty ER: Superior feature-set ranking for small samples using bolstered error estimation. Bioinformatics 2005, 21(7):1046-1054. 10.1093/bioinformatics/bti081
Article Google Scholar
Sima C, Attoor S, Braga-Neto U, Lowey J, Suh E, Dougherty ER: Impact of error estimation on feature-selection algorithms. Pattern Recognition 2005, 38(12):2472-2482. 10.1016/j.patcog.2005.03.026
Article Google Scholar
Zhou X, Mao KZ: The ties problem resulting from counting-based error estimators and its impact on gene selection algorithms. Bioinformatics 2006, 22(20):2507-2515. 10.1093/bioinformatics/btl438
Article Google Scholar
Sima C, Dougherty ER: What should be expected from feature selection in small-sample settings. Bioinformatics 2006, 22(19):2430-2436. 10.1093/bioinformatics/btl407
Article Google Scholar
Mehta T, Tanik M, Allison DB: Towards sound epistemological foundations of statistical methods for high-dimensional biology. Nature Genetics 2004, 36(9):943-947. 10.1038/ng1422
Article Google Scholar
Dougherty ER, Datta A, Sima C: Research issues in genomic signal processing. IEEE Signal Processing Magazine 2005, 22(6):46-68.
Article Google Scholar
Michiels S, Koscielny S, Hill C: Prediction of cancer outcome with microarrays: a multiple random validation strategy. The Lancet 2005, 365(9458):488-492. 10.1016/S0140-6736(05)17866-0
Article Google Scholar
Dougherty ER, Braga-Neto U: Epistemology of computational biology: mathematical models and experimental prediction as the basis of their validity. Journal of Biological Systems 2006, 14(1):65-90. 10.1142/S0218339006001726
Article MATH Google Scholar
Braga-Neto U: Fads and fallacies in the name of small-sample microarray classification—a highlight of misunderstanding and erroneous usage in the applications of genomic signal processing. IEEE Signal Processing Magazine 2007, 24(1):91-99.
Article Google Scholar
Dupuy A, Simon RM: Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. Journal of the National Cancer Institute 2007, 99(2):147-157. 10.1093/jnci/djk018
Article Google Scholar
Dougherty ER, Hua J, Bittner ML: Validation of computational methods in genomics. Current Genomics 2007, 8(1):1-19. 10.2174/138920207780076956
Article Google Scholar
van de Vijver MJ, He YD, van 't Veer LJ, et al.: A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine 2002, 347(25):1999-2009. 10.1056/NEJMoa021967
Article Google Scholar
Bhattacharjee A, Richards WG, Staunton J, et al.: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences of the United States of America 2001, 98(24):13790-13795. 10.1073/pnas.191502998
Article Google Scholar
Devroye L, Gyorfi L, Lugosi G: A Probabilistic Theory of Pattern Recognition. Springer, New York, NY, USA; 1996.
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA
Blaise Hanczar & Edward R Dougherty
Laboratoire d'Informatique Medicale et Bio-informatique (Lim&Bio), Universite Paris 13, Bobigny cedex, 93017, France
Blaise Hanczar
Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ, 85004, USA
Jianping Hua & Edward R Dougherty

Authors

Blaise Hanczar
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Hua
View author publications
You can also search for this author in PubMed Google Scholar
Edward R Dougherty
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Blaise Hanczar.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Hanczar, B., Hua, J. & Dougherty, E.R. Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings. J Bioinform Sys Biology 2007, 38473 (2007). https://doi.org/10.1155/2007/38473

Download citation

Received: 14 May 2007
Revised: 11 August 2007
Accepted: 27 August 2007
Published: 30 October 2007
DOI: https://doi.org/10.1155/2007/38473

Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords