- Research Article
- Open Access
Is Bagging Effective in the Classification of Small-Sample Genomic and Proteomic Data?
EURASIP Journal on Bioinformatics and Systems Biology volume 2009, Article number: 158368 (2009)
There has been considerable interest recently in the application of bagging in the classification of both gene-expression data and protein-abundance mass spectrometry data. The approach is often justified by the improvement it produces on the performance of unstable, overfitting classification rules under small-sample situations. However, the question of real practical interest is whether the ensemble scheme will improve performance of those classifiers sufficiently to beat the performance of single stable, nonoverfitting classifiers, in the case of small-sample genomic and proteomic data sets. To investigate that question, we conducted a detailed empirical study, using publicly-available data sets from published genomic and proteomic studies. We observed that, under t-test and RELIEF filter-based feature selection, bagging generally does a good job of improving the performance of unstable, overfitting classifiers, such as CART decision trees and neural networks, but that improvement was not sufficient to beat the performance of single stable, nonoverfitting classifiers, such as diagonal and plain linear discriminant analysis, or 3-nearest neighbors. Furthermore, as expected, the ensemble method did not improve the performance of these classifiers significantly. Representative experimental results are presented and discussed in this work.
Randomized ensemble methods for classifier design combine the decision of an ensemble of classifiers designed on randomly perturbed versions of the available data [1–5]. The combination is often done by means of majority voting among the individual classifier decisions [4–6], whereas the data perturbation usually employs the bootstrap resampling approach, which corresponds to sampling uniformly with replacement from the original data . The combination of bootstrap resampling and majority voting is known as bootstrap aggregation or bagging .
There has been considerable interest recently in the application of bagging in the classification of both gene-expression data [9–12] and protein-abundance mass spectrometry data [13–18]. However, there is scant theoretical justification for the use of this heuristic, other than the expectation that combining the decision of several classifiers will regularize and improve the performance of unstable overfitting classification rules, such as unpruned decision trees, provided one uses a large enough number of classifiers in the ensemble . It is also claimed that ensemble rules "do not overfit," meaning that classification error converges as the number of component classifiers tends to infinity .
However, the main performance issue is not whether the ensemble scheme improves the classification error of a single unstable overfitting classifier, or whether its classification error converges to a fixed limit; these are important questions, which have been studied in the literature (in particular when the component classifiers are decision trees) [519–23], but the question of main practical interest is whether the ensemble scheme will improve the performance of unstable overfitting classifiers sufficiently to beat the performance of single stable, nonoverfitting classifiers, particularly in small-sample settings. Therefore, there is a pressing need to examine rigorously the suitability and validity of the ensemble approach in the classification of small-sample genomic and proteomic data. In this paper, we present results from a comprehensive empirical study concerning the effect of bagging on the performance of several classification rules, including diagonal and plain linear discriminant analysis, 3-nearest neighbors, CART decision trees, and neural networks, using real data from published microarray and mass spectrometry studies. Here we are concerned exclusively with the performance in terms of the true classification error, and therefore we employ filter-based feature selection and holdout estimation based on large samples in order to allow accurate classification error estimation. Similar studies recently published  rely on small-sample wrapper feature selection and small-sample error estimation methods, which will obscure the issue of how bagging really affects the true classification error. In particular, there is evidence that filter-based feature selection outperforms wrapper feature selection in small-sample settings . In our experiments, we employ the one-tailed paired t-test to assess whether the expected true classification error is significantly smaller for the bagged classifier as opposed to the original base classifier, under different number of samples, dimensionality, and number of classifiers in the ensemble. Clearly, the heuristic is beneficial for the particular classification rule if and only there is a significant decrease in expected classification error, otherwise the procedure is to be avoided; however the magnitude of improvement is also a factor—a small improvement in performance may not be worth the extra computation required (which is roughly times larger for the bagging classifier, where is the number of classifiers in the ensemble). The full results of the empirical study are available on a companion website http://www.ece.tamu.edu/~ulisses/bagging/index.html .
2. Randomized Ensemble Classification Rules
Classification involves a feature vector in a feature space , a label, and a classifier, such that attempts to predict the value of for a given observation . The joint feature-label distribution of the pair completely characterizes the stochastic properties of the classification problem. In practice, a classification rule is used to design a classifier based on sample training data. Working formally, a classification rule is a mapping , which takes an i.i.d. sample of feature-label pairs drawn from the feature-label distribution to a designed classifier. The classification error is the probability that classification is erroneous given the sample data, that is, . Note that the classification error is random only through the training data . The expected classification error is the average classification error over all possible sample data sets; it is a fixed parameter of the classification rule and feature-label distribution, and used as the measure of performance of the former given the latter.
Randomization approaches based on resampling can be seen as drawing i.i.d. samples from a surrogate joint-feature label distribution , which is a function of the original training data . In the bootstrap resampling approach, one has , and the randomized sample corresponds to sampling uniformly training points from with replacement. This corresponds to using the empirical distribution of the data as the surrogate joint-feature label distribution ; the empirical distribution assigns discrete probability mass at each observed data point in . Some of the original training points may appear multiple times, whereas others may not appear at all in the bootstrap sample. Note that, given , the bootstrap sample is conditionally independent from the original feature-label distribution .
In aggregation by majority voting, a classifier is obtained based on majority voting among individual classifiers designed on the randomized samples using the original classification rule . This leads to an ensemble classification rule, such that
for , where expectation is with respect to the random mechanism , fixed at the observed value of . For bootstrap majority voting, or bagging, the expectation in (1) usually has to be approximated by Monte Carlo sampling, which leads to the "bagged" classifier:
where the classifiers are designed by the original classification rule on bootstrap samples , for , for large enough (notice the parallel with the development in , particulary equations (2.8)–(2.10), and accompanying discussion).
The issue of how large has to be so that (2) is a good Monte Carlo approximation is a critical issue in the application of bagging. Note that represents the number of classifiers that must be designed to be part of the ensemble, so that a computational problem may emerge if is made too large. In addition, even if a suitable is found, the performance of the ensemble must be compared to that of the base classification rule, to see if there is significant improvement. Even more importantly, the performance of the ensemble has to be compared to that of other classification rules; that the ensemble improves the performance of an unstable overfitting classifier is of small value if it can be bested by a single stable, nonoverfitting classifier. In the next section, we present a comprehensive empirical study that addresses these questions.
3. Experimental Study
In this section, we report the results obtained from a large simulation study based on publicly-available patient data from genomic and proteomic studies, which measured the performance of the bagging heuristic through the expected classification error, for varying number of component classifiers, sample size, and dimensionality.
We considered in our experiment several classification rules, listed here in order of complexity: diagonal linear discriminant analysis (DLDA), linear discriminant analysis (LDA), 3-nearest neighbors (3NN), decision trees (CART), and neural networks (NNET) . DLDA is an extension of LDA where only the diagonal elements (the variances) of the covariance matrix are estimated, while the off-diagonal elements (the covariances) are assumed to be zero. Bagging is applied to each of these base classification rules and its performance recorded for varying number of individual classifiers. The neural network consists of a one-hidden layer with 4 nodes and standard sigmoids as nonlinearities. The network is trained by Levenberg-Marquardt optimization with a maximum of 30 iterations. CART is applied with a stopping criterion. Splitting is stopped when there are fewer than 3 points in a given node. This is distinct from the approach advocated in  for random forests, where unpruned, fully grown trees are used instead; the reason for this is that we did not attempt to implement the approach in  (which involves concepts as random node splitting and is thus specific to decision trees), but rather to study the behavior of bagging, which is the centerpiece of such ensemble methods, across different classification rules. Resampling is done by means of balanced bootstrapping, where all samples are made to appear exactly the same number of times in the computation .
We selected data sets with large number of samples (see below) in order to be able to estimate the true error accurately using held out testing data. In each case, 1000 training data sets of size were drawn uniformly and independently from the total pool of samples. The training data are drawn in a stratified fashion, following the approximate proportion of each class in the original data. Based on the training data, a filter-based gene selection step is employed to select the top discriminating genes; we considered in this study . The univariate feature selection methods used in the filter step are the Welch two-sample t-test  and the RELIEF method —in the latter case, we employ the 1-nearest neighbor method when searching for hits and misses. After classifier design, the true classification error for each data set of size is approximated by a holdout estimator, whereby the sample points not drawn are used as the test set (a good approximation to the classification error, given that ). The expected classification error is then estimated as the sample mean of classification error over the 1000 training data sets. The sample size is kept small, as we are interested in the small-sample properties of bagging. Note also that we also must have in order to provide for large enough testing sets, as well as to make sure that consecutive training sets do not significantly overlap, so that the expected classification error can be accurately approximated. As can be easily verified, the expected ratio of overlapping sample points between two samples of size from a population of size is given simply by . In all cases considered here the expected overlap is around 20% less, which we consider to be acceptable, except in the case of the lung cancer data set with . This latter case is therefore not included in our results. The one-tailed paired t-test is employed to assess whether the ensemble classifier has an expected error that is significantly smaller than that of the corresponding individual classifier.
3.2. Data Sets
We utilized the following publicly-available data sets from published studies in order to study the performance of bagging in the context of genomics and proteomics applications.
3.2.1. Breast Cancer Gene Expression Data
These data come from the breast cancer classification study in , which analyzed gene-expression microarrays containing a total of 25760 transcripts each. Filter-based feature selection was performed on a 70-gene prognosis profile, previously published by the same authors in . Classification is between the good-prognosis class (115 samples), and the poor-prognosis class (180 samples), where prognosis is determined retrospectively in terms of survivability .
3.2.2. Lung Cancer Gene Expression Data
We employed here the data set "A" from the study in  on nonsmall-cell lung carcinomas (NSCLC), which analyzed gene-expression microarrays containing a total of 12600 transcripts each. NSCLC is subclassified as adenocarcinomas, squamous cell carcinomas and large-cell carcinomas, of which adenocarcinomas are the most common subtypes and of interest to classify from other subtypes of NSCLC. Classification is thus between adenocarcinomas (139 samples) and non-adenocarcinomas (47 samples).
3.2.3. Prostate Cancer Protein Abundance Data
Given the recent keen interest on deriving serum-based proteomic biomarkers for the diagnosis of cancer , we also included in this study data from a proteomic study of prostate cancer reported in . It consists of SELDI-TOF mass spectrometry of samples, which yield mass spectra for 45000 m/z (mass over charge) values. Filter-based feature selection is employed to find the top discriminatory m/z values to be used in the experiment. Classification is between prostate cancer patients (167 samples) and noncancer patients, including benign prostatic hyperplasia and healthy patients (159 samples). We use the raw spectra values, without baseline subtraction, as we found that this leads to better classification rates.
3.3. Results and Discussion
We present results for sample sizes and and dimensionality and , which are representative of the full set of results, available on the companion websitehttp://www.ece.tamu.edu/~ulisses/bagging/index.html. The case is displayed in Tables 1, 2, and 3, each of which corresponds to a different data set. Each table displays the expected classification error as a function of the number of classifiers used in the ensemble, for different base classification rules, feature selection methods, and sample sizes. We used in all cases an odd number of classifiers in the ensembles, to avoid tie-breaking issues. Errors that are smaller for the ensemble classifier as compared to a single classifier at a 99% significance level, according to the one-tailed paired t-test, are indicated by bold-face type. This allows one to immediately observe that bagging is able to improve the performance of the unstable overfitting CART and NNET classifiers; in most cases, a small ensemble is required, and the improvement in performance is substantial. In contrast, bagging does not improve the performance of the stable, nonoverfitting DLDA, LDA, and 3NN classifiers, except via a large ensemble; and even so the improvement in magnitude is quite small, and certainly does not justify the extra computational cost (note that in the case of the simplest classification rule, DLDA, there is no improvement at all). This is in agreement with what is known about the ensemble approach (e.g., see ).
However, of larger interest here is the performance of the ensemble against a single instance of the stable, nonoverfitting classifiers. This can be better visualized in the plots of Figures 1, 2, and 3, which display the expected classification errors as a function of number of component classifiers in the ensemble, for the case . The error of a single classifier is indicated by a horizontal dashed line. Marks indicate the values that are smaller for the ensemble classifier as compared to a single component classifier at a 99% significance level, according to the one-tailed paired t-test. One observes that as ensemble size increases, classification error decreases and tends to converge to a fixed value (in agreement with ), but we can also see that the error is usually larger at very small ensemble sizes, as compared to the error of the individual classifier. We can again observe that, in most cases, bagging is able to improve the performance of CART and NNET, but that is not significantly so, or at all, for DLDA, LDA, and 3NN. More importantly, we can see that the improvement on the performance of CART and NNET is not sufficient to beat the performance of single DLDA, LDA, or 3NN classifiers (with the exception of the prostate cancer data with RELIEF feature selection, which we comment on below).
As we can see in Figures 1–3, the breast cancer gene-expression data produces linear features that favor single DLDA and LDA classifiers (the latter do not perform so well at , due to the difficulty of estimating the entire covariance matrix at this sample size, which affects DLDA less), while the lung cancer gene-expression data produce nonlinear features, in which case, according to the results, the best option overall is to use a single 3NN classifier, followed closely by a bagged NNET in t-test feature selection and a bagged CART in RELIEF feature selection. The case of the prostate cancer proteomic data is peculiar in that it presents the only case where the best option was not a DLDA, LDA, or 3NN classifier, but in fact a single CART classifier, namely, the case (with either or ) for RELIEF feature selection (the results for t-test feature selection, on the other hand, are very similar to the ones obtained for the lung cancer data set). Note that, in this case, the best performance is achieved by a single CART classifier, rather than the ensemble CART scheme. We also point out that the classification errors obtained with t-test feature selection are smaller than the ones obtained with RELIEF feature selection, indicating that RELIEF is not a good option in this case due to the very small-sample size (in fact, there is evidence that t-test filter-based feature selection may be the method of choice in small-sample cases ), in the case , the difference between 3NN and CART essentially disappears. It is also interesting that in the case and , for RELIEF feature selection, bagging is able to improve the performance of LDA by a good margin in the case of the prostate cancer data. This is due to the fact that the combination of LDA and RELIEF feature selection produce an unstable overfitting classification rule at this acute small-sample scenario.
The results obtained with t-test feature selection are consistent across all data sets. When using RELIEF feature selection, there is a degree of contrast between the results for the prostate cancer protein-abundance data set and the ones for the gene-expression data sets, which may be attributed to the differences in technology as well as the fact that we do not employ baseline subtraction for the proteomics data in order to achieve better classification rates.
In this paper we conducted a detailed empirical study of the ensemble approach to classification of small-sample genomic and proteomic data. The main performance issue is not whether the ensemble scheme improves the classification error of an unstable overfitting classifier (e.g., CART, NNET), or whether its classification error converges to a fixed limit; but rather whether the ensemble scheme will improve performance of the unstable overfitting classifier sufficiently to beat the performance of single stable, nonoverfitting classifiers (e.g., DLDA, LDA, and 3NN). We observed that this never was the case for any of the data sets and experimental conditions considered here, except in the case of the proteomics data set with RELIEF feature selection in acute small-sample cases, when nevertheless the performance of a single unstable overfitting classifier (in this case, CART) was better or comparable to the corresponding ensemble classifier. We observed that in most cases bagging does a good (sometimes, admirable) job of improving the performance of unstable overfitting classifiers, but that improvement was not enough to beat the performance of single stable nonoverfitting classifiers.
The main message to be gleaned from this study by practitioners is that the use of bagging in classification of small-sample genomics and proteomics data increases computational cost, but is not likely to improve overall classification accuracy over other, more simple, approaches. The solution we recommend is to use simple classification rules and avoid bagging in these scenarios. It is important to stress that we do not give a definitive recommendation on the use of the random forest method for small-sample genomics and proteomics data; however, we do think that this study does provide a step in that direction, since the random forest method depends partly, if not significantly, for its success on the effectiveness of bagging. Further research is needed to investigate this question.
Schapire RE: The strength of weak learnability. Machine Learning 1990, 5(2):197-227.
Freund Y: Boosting a weak learning algorithm by majority. Proceedings of the 3rd Annual Workshop on Computational Learning Theory (COLT '90), Rochester, NY, USA, August 1990 202-216.
Xu L, Krzyzak A, Suen CY: Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on Systems, Man and Cybernetics 1992, 22(3):418-435. 10.1109/21.155943
Breiman L: Bagging predictors. Machine Learning 1996, 24(2):123-140.
Breiman L: Random forests. Machine Learning 2001, 45(1):5-32. 10.1023/A:1010933404324
Lam L, Suen CY: Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Transactions on Systems, Man and Cybernetics Part A 1997, 27(5):553-568. 10.1109/3468.618255
Efron B: Bootstrap methods: another look at the jacknife. Annals of Statistics 1979, 7: 1-26. 10.1214/aos/1176344552
Efron B: The Jackknife, the Bootstrap and Other Resampling Plans, CBMS-NSF Regional Conference Series in Applied Mathematics, SIAM, Philadelphia, Pa, USA. 1982., 38:
Alvarez S, Diaz-Uriarte R, Osorio A, et al.: A predictor based on the somatic genomic changes of the BRCA1/BRCA2 breast cancer tumors identifies the non-BRCA1/BRCA2 tumors with BRCA1 promoter hypermethylation. Clinical Cancer Research 2005, 11(3):1146-1153.
Gunther EC, Stone DJ, Gerwien RW, Bento P, Heyes MP: Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro. Proceedings of the National Academy of Sciences of the United States of America 2003, 100(16):9608-9613. 10.1073/pnas.1632587100
Díaz-Uríarte R, Alvarez de Andrés S: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006, 7: 1-13. article 3 10.1186/1471-2105-7-1
Statnikov A, Wang L, Aliferis CF: A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 2008, 9: 1-10. article 319 10.1186/1471-2105-9-1
Izmirlian G: Application of the random forest classification algorithm to a SELDI-TOF proteomics study in the setting of a cancer prevention trial. Annals of the New York Academy of Sciences 2004, 1020: 154-174. 10.1196/annals.1310.015
Wu B, Abbott T, Fishman D, et al.: Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 2003, 19(13):1636-1643. 10.1093/bioinformatics/btg210
Geurts P, Fillet M, de Seny D, et al.: Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics 2005, 21(14):3138-3145. 10.1093/bioinformatics/bti494
Zhang B, Pham TD, Zhang Y: Bagging support vector machine for classification of SELDI-TOF mass spectra of ovarian cancer serum samples. Proceedings of the 20th Australian Joint Conference on Artificial Intelligence (AI '07) of Lecture Notes in Computer Science, Gold Coast, Australia, December 2007 4830: 820-826.
Assareh A, Moradi M, Esmaeili V: A novel ensemble strategy for classification of prostate cancer protein mass spectra. Proceedings of the 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS '07), Lyon, France, August 2007 5987-5990.
Tong W, Xie Q, Hong H, et al.: Using decision forest to classify prostate cancer samples on the basis of SELDI-TOF MS data: assessing chance correlation and prediction confidence. Environmental Health Perspectives 2004, 112(16):1622-1627. 10.1289/ehp.7109
Dietterich TG: An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning 2000, 40(2):139-157. 10.1023/A:1007607513941
Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 2002, 97(457):77-87. 10.1198/016214502753479248
Hansen L, Salamon P: Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence 1990, 12(10):993-1001. 10.1109/34.58871
Bauer E, Kohavi R: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning 1999, 36(1):105-139. 10.1023/A:1007515423169
Sohn SY, Shin HW: Experimental study for the comparison of classifier combination methods. Pattern Recognition 2007, 40(1):33-40. 10.1016/j.patcog.2006.06.027
Hua J, Tembe WD, Dougherty ER: Performance of feature-selection methods in the classification of high-dimension data. Pattern Recognition 2009, 42(3):409-424. 10.1016/j.patcog.2008.08.001
Efron B: Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association 1983, 78(382):316-331. 10.2307/2288636
Devroye L, Gyorfi L, Lugosi G: A Probabilistic Theory of Pattern Recognition. Springer, New York, NY, USA; 1996.
Braga-Neto UM, Dougherty E: Classification. In Genomic Signal Processing and Statistics, EURASIP Book Series on Signal Processing and Communication, Hindawi, New York, NY, USA Edited by: Dougherty E, Shmulevich I, Chen J, Wang ZJ . 2005, 93-128.
Chernick M: Bootstrap Methods: A Practitioner's Guide. John Wiley & Sons, New York, NY, USA; 1999.
Lehmann E, Romano J: Testing Statistical Hypotheses. Springer, New York, NY, USA; 2005.
Kira K, Rendell LA: The feature selection problem: traditional methods and a new algorithm. Proceedings of the 10th National Conference on Artificial Intelligence (AAAI '92), San Jose, Calif, USA, July 1992 129-134.
van de Vijver MJ, He YD, van't Veer LJ, et al.: A gene-expression signature as a predictor of survival in breast cancer. The New England Journal of Medicine 2002, 347(25):1999-2009. 10.1056/NEJMoa021967
van't Veer LJ, Dai H, van de Vijver MJ, et al.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530-536. 10.1038/415530a
Bhattacharjee A, Richards WG, Staunton J, et al.: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences of the United States of America 2001, 98(24):13790-13795. 10.1073/pnas.191502998
Issaq HJ, Veenstra TD, Conrads TP, Felschow D: The SELDI-TOF MS approach to proteomics: protein profiling and biomarker identification. Biochemical and Biophysical Research Communications 2002, 292(3):587-592. 10.1006/bbrc.2002.6678
Adam B-L, Qu Y, Davis JW, et al.: Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Research 2002, 62(13):3609-3614.
About this article
Cite this article
Vu, T., Braga-Neto, U. Is Bagging Effective in the Classification of Small-Sample Genomic and Proteomic Data?. J Bioinform Sys Biology 2009, 158368 (2009) doi:10.1155/2009/158368
- Linear Discriminant Analysis
- Classification Error
- Classification Rule
- Ensemble Classifier
- Component Classifier