- Research
- Open Access
On optimal Bayesian classification and risk estimation under multiple classes
- Lori A. Dalton^{1, 2}Email author and
- Mohammadmahdi R. Yousefi^{1}
https://doi.org/10.1186/s13637-015-0028-3
© Dalton and Yousefi. 2015
Received: 10 May 2015
Accepted: 21 October 2015
Published: 24 October 2015
Abstract
A recently proposed optimal Bayesian classification paradigm addresses optimal error rate analysis for small-sample discrimination, including optimal classifiers, optimal error estimators, and error estimation analysis tools with respect to the probability of misclassification under binary classes. Here, we address multi-class problems and optimal expected risk with respect to a given risk function, which are common settings in bioinformatics. We present Bayesian risk estimators (BRE) under arbitrary classifiers, the mean-square error (MSE) of arbitrary risk estimators under arbitrary classifiers, and optimal Bayesian risk classifiers (OBRC). We provide analytic expressions for these tools under several discrete and Gaussian models and present a new methodology to approximate the BRE and MSE when analytic expressions are not available. Of particular note, we present analytic forms for the MSE under Gaussian models with homoscedastic covariances, which are new even in binary classification.
Keywords
1 Introduction
Classification in biomedicine is often constrained by small samples so that understanding properties of the error rate is critical to ensure the scientific validity of a designed classifier. While classifier performance is typically evaluated by employing distribution-free training-data error estimators such as cross-validation, leave-one-out, or bootstrap, a number of studies have demonstrated that these methods are highly problematic in small-sample settings [1]. Under real data and even under simple synthetic Gaussian models, cross-validation has been shown to suffer from a large variance [2] and often has nearly zero correlation, or even negative correlation, with the true error [3, 4]. Among other problems, this directly leads to severely optimistic reporting biases when selecting the best results among several datasets [5] or when selecting the best classification rule among several candidates [6] and difficulties with performance reproducibility [7].
where \(\mathcal {S}\) is a random sample, θ is a feature-label distribution, and n is the sample size. To guarantee an RMS less than 0.5 for all distributions, this bound indicates that a sample size of at least n=209 would be required. Typically, the error of a classifier should be between 0 and 0.5 so that an RMS of 0.5 is trivially guaranteed.
Rather than a distribution-free approach, recent work takes a Bayesian approach to address these problems. The idea is to assume the true distributions characterizing classes in the population are members of an uncertainty class of models. We also assume that members of the uncertainty class are weighted by a prior distribution, and after observing a sample, we update the prior to a posterior distribution. For a given classifier we may find an optimal MSE error estimator, called a Bayesian error estimator (BEE) [9, 10] and evaluate the MSE of any arbitrary error estimator [11, 12]. These quantities are found by conditioning on the sample in hand and averaging with respect to the unknown population distribution via the posterior, rather than by conditioning on the distribution and averaging over random samples as in (1). Not only does the Bayesian framework supply more powerful error estimators, but the sample-conditioned MSE allows us to evaluate the accuracy of error estimation. The Bayesian framework also facilitates optimal Bayesian classification (OBC), which provides decision boundaries to minimize the BEE [13, 14]. In this way, the Bayesian framework can be used to optimize both error estimation and classification.
Classifier design and analysis in the Bayesian framework have previously been developed for binary classification with respect to the probability of misclassification. However, it is common in small-sample classification problems to be faced with classification under multiple classes and for different types of error to be associated with different levels of risk or loss. A few classical classification algorithms naturally permit multiple classes and arbitrary loss functions; for example, a plug-in rule takes the functional form for an optimal Bayes decision rule under a given modeling assumption and substitutes sample estimates of model parameters in place of the true parameters. This can be done with linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) for multiple classes with arbitrary loss functions, which essentially assume that the underlying class-conditional densities are Gaussian with equal or unequal covariances, respectively. Most training-data error estimation methods, for instance, cross-validation, can also be generalized to handle multiple classes and arbitrary loss functions. However, it is expected that the same difficulties encountered under binary classes with simple zero-one loss functions (where the expected risk reduces to the probability of misclassification) will carry over to the more general setting, as they have in ROC curve estimation [15].
Support vector machines (SVM) are inherently binary but can be adapted to incorporate penalties that influence risk by implementing slack terms or applying a shrinkage or robustifying objective function [16, 17]. It is also common to construct multi-class classifiers from binary classifiers using the popular “one-versus-all” or “all-versus-all” strategies [18]. The former method builds several binary classifiers by discriminating one class, in turn, against all others, and at a given test point reports the class corresponding to the highest classification score. The latter discriminates between each combination of pairs of classes and reports a majority vote. However, it is unclear how one may assess the precise effect of these adaptations on the expected risk.
We are thus motivated to generalize the BEE, sample-conditioned MSE, and OBC to treat multiple classes with arbitrary loss functions. We will present analogous concepts of Bayesian risk estimation (BRE), the sample-conditioned MSE for risk estimators, and optimal Bayesian risk classification (OBRC). We will show that the BRE and OBRC can be represented in the same form as the expected risk and Bayes decision rule with unknown true densities replaced by effective densities. This approach is distinct from the simple plug-in rule discussed earlier, since the form of the effective densities may not be the same as the individual densities represented in the uncertainty class. We will also develop an interpretation of the conditional MSE based on an effective joint density, which is new even under binary classes with a zero-one loss function.
Furthermore, we will provide analytic solutions under several models: discrete spaces with Dirichlet priors (discrete models) and Gaussian distributions with known, independent scaled identity, independent arbitrary, homoscedastic scaled identity, and homoscedastic arbitrary covariance models, all with conjugate priors (Gaussian models). We provide expressions for the BRE and conditional MSE for arbitrary classification in the discrete model and binary linear classification in the Gaussian model. The analytic form that we provide for the MSE of arbitrary error estimators under homoscedastic models is completely new without an analog in prior work under binary classification and zero-one loss. For models in which an analytic form for the BRE and conditional MSE are unavailable, for instance, under multi-class or non-linear classification in the Gaussian model, we also discuss efficient methods to approximate these quantities. In particular, we present a new computationally efficient method to approximate the conditional MSE based on the effective joint density.
2 Notation
We denote random quantities with capital letters, e.g., Y; realizations of random variables with lowercase letters, e.g., y; and vectors in bold, e.g., X and x. Matrices will generally be in bold upper case, e.g., S. Spaces will be denoted by a stylized font, e.g., \(\mathcal {X}\). Distributions with conditioning will be made clear through the function arguments; for instance, we write the distribution of X given Y as f(x | y). The probability space of expectations will be made clear by denoting random quantities in the expectation and conditioning, e.g., the expectation of Y conditioned on the random variable X and the event C=c is denoted by E[Y | X,c]. When the region of integration in an integral is omitted then this region is the whole space. Any exceptions in notation will be defined throughout.
3 Bayes decision theory
We next review concepts from classical Bayes decision theory. Consider a classification problem in which we are to predict one of M classes, y=0,…,M−1, from a sample drawn in feature space \(\mathcal {X}\). Let X and Y denote a random feature vector and its corresponding random label. Let f(y | c) be the probability mass function of Y, parameterized by a vector c, and for each y, let f(x | y,θ _{ y }) be the class-y-conditional density of X, parameterized by a vector θ _{ y }. The full feature-label distribution is parameterized by c and θ={θ _{0},…,θ _{ M−1}}.
is the probability that a class-y point will be assigned class i by the classifier ψ, and the \(\Gamma _{i} = \{\mathbf {x} \in \mathcal {X}: \psi (\mathbf {x}) = i\}\) partition the sample space into decision regions.
By convention, we break ties with the lowest index, i∈{0,…,M−1}, minimizing R(i,x,c,θ).
4 Optimal Bayesian risk classification
In practice, the feature-label distribution is unknown so that we must train a classifier and estimate risk or error with data. The Bayesian framework resolves this by assuming the true feature-label distribution is a member of a parameterized uncertainty class. In particular, assume that c is the probability mass function of Y, that is, c={c _{0},…,c _{ M−1}}∈Δ ^{ M−1}, where f(y | c)=c _{ y } and Δ ^{ M−1} is the standard M−1 simplex defined by c _{ y }∈[0,1] for y∈{0,…,M−1} and \(\sum _{y = 0}^{M-1} c_{y} = 1\). Also assume \(\theta _{y} \in \mathcal {T}_{y}\) for some parameter space \(\mathcal {T}_{y}\), and \(\theta \in \mathcal {T} = \mathcal {T}_{0} \times \ldots \times \mathcal {T}_{M-1}\). Let C and Θ denote random vectors for parameters c and θ, respectively. Finally, assume C and Θ are independent prior to observing data and assign prior probabilities, π(c) and π(θ).
Priors quantify uncertainty we have about the distribution before observing the data. Although non-informative priors may be used as long as the posterior is normalizable, informative priors can supplement the classification problem with information to improve performance when the sample size is small. This is key for problems with limited or expensive data. Under mild regularity conditions, as we observe sample points, this uncertainty converges to a certainty on the true distribution parameters, where more informative priors may lead to faster convergence [12]. For small samples, the performance of Bayesian methods depends heavily on the choice of prior. Performance tends to be modest but more robust with a non-informative or weakly informative prior. Conversely, informative priors offer the potential for great performance improvement, but if the true population distribution is not well represented in the prior, then performance may be poor. This trade-off is acceptable as long as the prior is an accurate reflection of available scientific knowledge so that one is reasonably sure that catastrophic results will not occur. If multiple models are scientifically reasonable but result in different inferences, and if it is not possible to determine which model is best from data or prior knowledge, then the range of inferences must be considered [19]. For the sake of illustration, in simulations, we will utilize either low-information priors or a simple prior construction method for microarray data, although modeling and prior construction remain important problems [20].
are marginal posteriors of C and Θ. Thus, independence between C and Θ is preserved in the posterior. Constants of proportionality are found by normalizing the integral of posteriors to 1. When the prior density is proper, this all follows from Bayes’ rule; otherwise, (7) and (8) are taken as definitions, where we require posteriors to be proper.
4.1 Bayesian risk estimation
The second equality follows from Fubini’s theorem, and in the last equality, X is a random vector drawn from the density in the integrand of (16). We also have f(y | S)=E[C _{ y } | S], which depends on the prior for C and is easily found, for instance, from (9) under Dirichlet posteriors. Comparing (3) and (15), observe that f(y | S) and f(x | y,S) play roles analogous to f(y | c) and f(x | y,θ _{ y }) in Bayes decision theory. We thus call f(x | y,S) the effective class-y conditional density or simply the effective density.
where \(f(\mathbf {x} \, | \, S) = \sum _{y = 0}^{M-1} f(y \, | \, S) f(\mathbf {x} \, | \, y, S)\) is the marginal distribution of x given S. Hence, the BRE of ψ is the mean of the BCRE across the sample space.
For binary classification, \(\widehat {\varepsilon }^{i, y}(\psi, S)\) has been solved in closed form as components of the BEE for both discrete models under arbitrary classifiers and Gaussian models under linear classifiers, so the BRE with an arbitrary loss function is available in closed form for both of these models. When closed-form solutions for \(\widehat {\varepsilon }^{i, y}(\psi, S)\) are not available, from (17), \(\widehat {\varepsilon }^{i, y}(\psi, S)\) may be approximated for all i and a given fixed y by drawing a large synthetic sample from f(x | y,S) and evaluating the proportion of points assigned class i. The final approximate BRE can be found by plugging the approximate \(\widehat {\varepsilon }^{i, y}(\psi, S)\) for each y and i into (15).
A number of practical considerations for BEEs addressed under binary classification naturally carry over to multiple classes, including robustness to false modeling assumptions [9, 10] and a prior calibration method for microarray data analysis using features discarded by feature selection and a method-of-moments approach [21]. Furthermore, classical frequentist consistency holds for BREs on fixed distributions in the parameterized family owing to the convergence of posteriors in both the discrete and Gaussian models [12].
4.2 Optimal Bayesian risk classification
Analogously to the relationship between the BRE and expected risk, the OBRC has the same functional form as the BDR with f(y | S) substituted for the true class probability, f(y | c), and f(x | y,S) substituted for the true density, f(x | y,θ _{ y }), for all y. Closed-form OBRC are available for any model in which f(x | y,S) has been found, including discrete and Gaussian models [13]. A number of important properties also carry over, including invariance to invertible transformations, pointwise convergence to the Bayes classifier, and robustness to false modeling assumptions.
4.3 Sample-conditioned MSE of risk estimation
In this form, the optimality of the BRE is clear.
For binary classification with zero-one loss, the sample-conditioned MSE of the BRE converges to zero almost surely as sample size increases, for both discrete models under arbitrary classifiers and Gaussian models with independent covariances under linear classifiers [12]. Closed-form expressions for the MSE are available in these models. In this work, we extend this to multi-class discrimination under discrete models and binary linear classification under homoscedastic Gaussian models. For cases where closed-form solutions are unavailable, in the next section, we present a method to approximate the MSE.
4.4 Efficient computation
where we have used the fact that the fractional term in the integrand of the second equality is of the same form as the posterior defined in (8), updated with a new independent sample point with feature vector x and class y. Hence, the effective joint density may be easily found, once the effective density is known. Furthermore, from (29), we may approximate E[ε ^{ i,y }(ψ,Θ _{ y })ε ^{ j,z }(ψ,Θ _{ z }) | S] by drawing a large synthetic sample from f(x | y,S), drawing a single point, w, from the effective conditional density f(w | z,S∪{x,y}) for each x, and evaluating the proportion of pairs, (x,w), for which x∈Γ _{ i } and w∈Γ _{ j }. Additionally, since x is marginally governed by the effective density, from (17) we may approximate \(\widehat {\varepsilon }^{i,y}(\psi, S)\) by evaluating the proportion of x in Γ _{ i }.
Evaluating the OBRC, BRE, and conditional MSE requires obtaining E[C _{ y } | S], \({\text {E}[C_{y}^{2}} \, | \, S]\) and E[C _{ y } C _{ z } | S] based on the posterior for C and finding the effective density, f(x | y,S), and the effective joint density, f(x,w | y,z,S), based on the posterior for Θ. At a fixed point, x, one may then evaluate the posterior probability of each class, f(y | x,S), from (19) and the BCRE from (20). The OBRC is then found from (22) or, equivalently, by choosing the class, i, that minimizes \(\sum _{y = 0}^{M-1} \lambda (i, y) \text {E}[C_{y} \, | \, S] f(\mathbf {x} \, | \, y, S)\). For any classifier, the BRE is given by (15) with \(\widehat {\varepsilon }^{i,y}(\psi, S)\) given by (16) (or equivalently (17)) using the effective density, f(x | y,S). The MSE of the BRE is then given by (24), where E[ε ^{ i,y }(Θ _{ y })ε ^{ j,z }(Θ _{ z }) | S] is given by (25) when Θ _{0},…,Θ _{ M−1} are pairwise independent and y≠z, and E[ε ^{ i,y }(Θ _{ y })ε ^{ j,z }(Θ _{ z }) | S] is otherwise found from (28) (or equivalently (29)) using the effective joint density, f(x,w | y,z,S). The MSE of an arbitrary risk estimator can also be found from (26) using the BRE and the MSE for the BRE. We summarize these tools for several discrete and Gaussian models in Appendices Appendix 1: Discrete models, Appendix 2: Gaussian models, and Appendix 3: Effective joint density lemma by providing the effective density, the effective joint density (or a related density), \(\widehat {\varepsilon }^{i,y}(\psi, S)\), and E[ε ^{ i,y }(Θ _{ y })ε ^{ j,z }(Θ _{ z }) | S].
5 Simulation setup and results
In the this section, we examine several synthetic data simulations, where random distributions and samples are generated from a low-information prior, and demonstrate the performance gain and optimality of Bayesian methods within the Bayesian framework. We also examine performance with informed priors in two real datasets.
5.1 Classification rules
We consider five classification rules: OBRC, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), linear support vector machine (L-SVM), and radial basis function SVM (RBF-SVM). We will implement OBRC under Gaussian models. We used built-in MATLAB functions to implement LDA and QDA. For a collection of binary-labeled training sample points, an SVM classifier finds a maximal margin hyperplane based on a well-behaved optimization objective function and a set of constraints. When the data are not perfectly linearly separable, introduction of slack variables in the optimization procedure leads to soft margin classifiers for which mislabeled sample points are allowed. The resulting hyperplane in the feature space is called L-SVM. Alternatively, the underlying feature space can be transformed to a higher dimensional space where the data becomes linearly separable. The equivalent classifier back in the original feature space will generally be non-linear [22, 23]. When the kernel function is a Gaussian radial basis function, we call the corresponding classifier RBF-SVM. We used the package LIBSVM, which, by default, implements a one-versus-one approach for multi-class classification [24]. Since SVM classifiers optimize relative to their own objective function (for example, hinge loss), rather than expected risk, we exclude them from our analysis when using a non-zero-one loss function.
For all classification rules, we calculate the true risk defined in (3) and (4). We find the exact value if a formula is available; otherwise, we use a test sample of at least 10,000 points generated from the true feature-label distributions, stratified relative to the true class prior probabilities. This will yield an approximation of the true risk with RMS \(\leq 1/\sqrt {4 \times 10,000} = 0.005\) [8].
5.2 Risk estimation rules
We consider four risk estimation methods: BRE, 10-fold cross-validation (CV), leave-one-out (LOO), and 0.632 bootstrap (boot). When we do not have closed-form formulae for calculating the BRE, we approximate it by drawing a sample of 1,000,000 points from the effective density of each class. In CV, the training data, S, is randomly partitioned into 10 stratified folds, S ^{(i)} for i=1,2,…,10. Each fold, in turn, is held out of the classifier design step as the test set, and a surrogate classifier is designed on the remaining folds, S∖S ^{(i)}, as the training set. The risk of each surrogate classifier is estimated using S ^{(i)}. The resulting risk values from all surrogate classifiers are then averaged to get the CV estimate. To reduce “internal variance” arising from random selection of the partitions, we average the CV estimates over 10 repetitions (10 randomly generated partitions over S). If the number of folds equals the sample size, n, then each fold consists of a single point and we get the LOO risk estimation.
Bootstrap risk estimators are calculated using bootstrap samples of size n, where in each bootstrap sample, points are drawn, with replacement, from the original training dataset. A surrogate classifier is designed on the bootstrap sample and its risk estimated using sample points left out of the bootstrap sample. The basic bootstrap estimator is the expectation of this risk with respect to the bootstrap sampling distribution. The expectation is usually approximated by Monte Carlo repetitions (100 in our simulations) over a number of independent bootstrap samples. It is known that this estimate is high biased. To reduce bias, the 0.632 bootstrap reports a linear combination of this estimate, with weight 0.632, and the low-biased resubstitution risk estimate, with weight 0.368 [25–27].
Under linear classification, the sample-conditioned MSE from (24) is found analytically by evaluating E[ε ^{ i,y }(Θ _{ y })ε ^{ j,y }(Θ _{ y }) | S] from (52), plugging in the appropriate values for k and γ ^{2} depending on the covariance model, and E[ε ^{ i,y }(Θ _{ y })ε ^{ j,z }(Θ _{ z }) | S] for z≠y are found via (25) for independent and (53) for homoscedastic covariance models, plugging in appropriate values for k and γ ^{2}. When analytic forms are not available, the sample-conditioned MSE is approximated as follows. In independent covariance models, for each sample point generated to approximate the BRE, we draw a single point from the effective conditional density with y=z, giving 1,000,000 sample point pairs to approximate E[ε ^{ i,y }(Θ _{ y })ε ^{ j,y }(Θ _{ y }) | S] for each y. In homoscedastic covariance models, to find the BRE, we have 1,000,000 points available from the effective density for each y. We generate an additional 1,000,000×(M−1) synthetic points for each y, thus allocating 1,000,000 synthetic points for each combination of y and z. For each of these points, we draw a single point from the effective conditional density of a class-z point given a class-y point. For each y and z, the corresponding 1,000,000 point pairs are used to approximate E[ε ^{ i,y }(Θ _{ y })ε ^{ j,z }(Θ _{ z }) | S].
5.3 Synthetic data
Synthetic data classification settings and prior models
D | M | ν _{0},…,ν _{ M−1} | m _{0},…,m _{ M−1} | κ _{ y } (k _{ y }) | \(\frac {\mathbf {S}_{y}}{k_{y} - 2}\) | Prior (cov.) | λ | |
---|---|---|---|---|---|---|---|---|
Model 1 | 2 | 2 | 12, 2 | \(\left [\begin {array}{l} 0 \\ 0 \end {array}\right ], \left [\begin {array}{l} 0.5 \\ 0.5 \end {array}\right ]\) | 6 (5) | 0.3 I _{2} | Indep. arbit. | \(\left [\begin {array}{ll} 0 & 2 \\ 1 & 0 \end {array}\right ]\) |
Model 2 | 2 | 2 | 12, 2 | \(\left [\begin {array}{l} 0 \\ 0 \end {array}\right ], \left [\begin {array}{l} 0.5 \\ 0.5 \end {array}\right ]\) | 6 (5) | 0.3 I _{2} | Homo. arbit. | \(\left [\begin {array}{ll} 0 & 2 \\ 1 & 0 \end {array}\right ]\) |
Model 3 | 2 | 5 | 12, 2, 2, 2, 2 | \(\left [\begin {array}{l} 0 \\ 0 \end {array}\right ], \left [\begin {array}{l} 1 \\ 1 \end {array}\right ], \left [\begin {array}{l} -1 \\ -1 \end {array}\right ], \left [\begin {array}{l} 1 \\ -1 \end {array}\right ], \left [\begin {array}{l} -1 \\ 1 \end {array}\right ]\) | 6 (5) | 0.3 I _{2} | Indep. arbit. | 0–1 loss |
Model 4 | 2 | 5 | 12, 2, 2, 2, 2 | \(\left [\begin {array}{l} 0 \\ 0 \end {array}\right ], \left [\begin {array}{l} 1 \\ 1 \end {array}\right ], \left [\begin {array}{l} -1 \\ -1 \end {array}\right ], \left [\begin {array}{l} 1 \\ -1 \end {array}\right ], \left [\begin {array}{l} -1 \\ 1 \end {array}\right ]\) | 6 (5) | 0.3 I _{2} | Homo. arbit. | 0–1 loss |
Model 5 | 20 | 2 | 12, 2 | 0_{20},(0.05)_{20} | −20.65 (5) | 0.3 I _{2} | Indep. iden. | \(\left [\begin {array}{ll} 0 & 2 \\ 1 & 0 \end {array}\right ]\) |
Model 6 | 20 | 2 | 20, 20 | 0_{20},0_{20} | −20.65 (5) | 0.3 I _{20} | Indep. iden. | \(\left [\begin {array}{ll} 0 & 2 \\ 1 & 0 \end {array}\right ]\) |
Model 7 | 20 | 5 | 12, 2, 2, 2, 2 | 0_{20},(0.1)_{20},(−0.1)_{20}, | −20.65 (5) | 0.3 I _{20} | Indep. iden. | 0–1 loss |
\(\left [\begin {array}{l} {(0.1)}_{10} \\ {(-0.1)}_{10} \end {array}\right ], \left [\begin {array}{l} {(-0.1)}_{10} \\ {(0.1)}_{10} \end {array}\right ]\) | ||||||||
Model 8 | 20 | 5 | 20, 20, 20, 20, 20 | 0_{20},0_{20},0_{20},0_{20},0_{20} | −20.65 (5) | 0.3 I _{20} | Indep. iden. | 0–1 loss |
5.4 Real data
We consider two real datasets. The first is a breast cancer dataset containing 295 sample points [28], which will be used to demonstrate binary classification under a non-zero-one loss function. The second is composed of five different cancer types from The Cancer Genome Atlas (TCGA) project, which demonstrates multi-class classification under zero-one loss.
In all real-data simulations, we assume that c _{ y } is known and equal to the proportion of class-y sample points in the whole dataset. We form a Monte Carlo estimation loop to evaluate classification and risk estimation, where we iterate 1000 times with the breast cancer dataset and 10,000 times with the TCGA dataset. In each iteration, we obtain a stratified training sample of size n, i.e., we select a subset of the original dataset, keeping the proportion of points in class y as close as possible to c _{ y } for every y. We use these training points to design several classifiers, while the remaining sample points are used as holdout data to approximate the true risk of each designed classifier. For the breast cancer dataset, we also use the training data to estimate risk and find the sample-conditioned MSE of the BRE. We vary sample size and analyze its effect on performance.
To implement Bayesian methods, we assume Gaussian distributions with arbitrary independent covariances in all real-data simulations. We calibrate hyperparameters, defined in Appendix Appendix 2: Gaussian models, using a variant of the method-of-moments approach presented in [21]. In particular, we construct a calibration dataset from features not used to train the classifier and set ν _{ y }=s _{ y }/t _{ y }, \(\kappa _{y} = 2({s_{y}^{2}}/u_{y})\,+\,D\,+\,3\), m _{ y }=[m _{ y },…,m _{ y }], and S _{ y }=(κ _{ y }−D−1)s _{ y } I _{ D }, where m _{ y } is the mean of the means of features among class-y points of the calibration dataset, and s _{ y } is the mean of the variances of features in class y. t _{ y } is the variance of the means of features in class y, where the 10 % of the means with the largest absolute value are discarded. Likewise, u _{ y } is the variance of the variances of features in class y, where the 10 % of the variances with the largest value are discarded.
In the breast cancer data, 180 patients are assigned to class 0 (good prognosis) and 115 to class 1 (bad prognosis) in a 70-feature prognosis profile. A correct prognosis is associated with 0 loss, wrongly declaring a good prognosis incurs a loss of 1, and wrongly declaring a bad prognosis incurs a loss of 2. We use pre-selected features for classifier training, originally published in [29]. When D=2, these features are CENPA and BBC3, and when D=5, we also add CFFM4, TGFB3, and DKFZP564D0462. Rather than discard the 70 − D features not used for classification, we use these features to calibrate priors using the method-of-moments approach described above.
For our second dataset, we downloaded level-3 microarray data from the TCGA data portal for five different kinds of cancers: breast invasive carcinoma (BRCA) with 593 sample points, colon adenocarcinoma (COAD) with 174 sample points, kidney renal clear cell carcinoma (KIRC) with 72 sample points, lung squamous cell carcinoma (LUSC) with 155 sample points, and ovarian serous cystadenocarcinoma (OV) with 562 sample points. We pooled all the sample points into a single dataset, removed features with missing values in any cancer type (17,016 features remained out of 17,814), and quantile-normalized the data with the median of the ranked values. We pre-select features for classifier training and prior calibration using the full dataset and one of two methods, which both operate in two phases: in phase 1, we pass D+100 features, and in phase 2, we select D features from those passing phase 1. The D features passing both phases are used for classifier training, and the features passing phase 1 but not phase 2 are used for prior calibration. The first feature selection method (FS-1) passes features that minimize a score evaluating separation between classes in phase 1 and selects features that minimize a score evaluating Gaussianity of the classes in phase 2. To evaluate separation between classes in phase 1, for each pair of classes, we obtain t-test p-values for each feature and rank these across all features, low p-values being assigned a lower rank, and finally, we report the rank product score for each feature over all 10 pairs of classes. To evaluate Gaussianity in phase 2, for each class, we rank Shapiro-Wilk test p-values across all features passing phase 1, high p-values being assigned a lower rank, and report the rank product score for each feature across all five classes. The second feature selection method (FS-2) passes features minimizing the rank product score from Shapiro-Wilk tests applied to all 17,016 features in phase 1, and in phase 2, we select D features from those passing phase 1 using sequential forward search (SFS) with LDA classification and resubstitution risk as the optimization criterion.
5.5 Discussion
In real applications, data rarely satisfy modeling assumptions, for instance, Gaussianity, and there may be a concern that performance will suffer. Firstly, keep in mind the need to validate assumptions in the Bayesian model. For example, Gaussianity tests and homoscedasticity tests may be used to validate these underlying assumptions. Our real-data simulations demonstrate a few examples of how Gaussianity tests may be used in conjunction with Bayesian methods. Secondly, previous works have shown that Bayesian methods are relatively robust to deviations from a Gaussianity assumption [10, 14]. This is observed, for instance, in Figs. 9 and 10. Thirdly, inference from non-informative priors may serve as a reference. The OBRC under non-informative priors and an arbitrary homoscedastic covariance model behaves similarly to LDA and under an arbitrary independent covariance model behaves similarly to QDA [13, 14]. Thus, the OBRC can be seen as unifying and optimizing these classifiers. This applies in Fig. 11, where OBRC with an appropriate covariance model and non-informative prior performs indistinguishably from LDA. The conditional MSE is also an immensely useful tool to quantify the accuracy of a risk estimator. For instance, one may employ the MSE for censored sampling by collecting batches of sample points until the sample-conditioned MSE reaches an acceptable level, and either an acceptable risk has been achieved or it has been determined that an acceptable risk cannot be achieved. Lastly, although we provide analytic solutions under discrete and Gaussian models, the basic theory for this work does not require these assumptions. For instance, recent work in [30] develops a Bayesian Poisson model for RNA-Seq data, where Bayesian error estimators and optimal Bayesian classifiers are obtained using Markov chain Monte Carlo (MCMC) techniques.
6 Conclusion
We have extended optimal Bayesian classification theory to multiple classes and arbitrary loss functions, giving rise to Bayesian risk estimators, the sample-conditioned MSE for arbitrary risk estimators, and optimal Bayesian risk classifiers. We have developed a new interpretation of the conditional MSE based on effective joint densities, which is useful in developing analytic forms and approximations for the conditional MSE. We also provide new analytic solutions for the conditional MSE under homoscedastic covariance models. Simulations based on several synthetic Gaussian models and two real microarray datasets also demonstrate good performance relative to existing methods.
7 Appendix 1: Discrete models
When y≠z, \(\text {E}\left [ \varepsilon _{n}^{i,y}(\Theta _{y}) \varepsilon _{n}^{j,z}(\Theta _{y}) \, | \, S\right ]\) may be found from (25).
8 Appendix 2: Gaussian models
where g(x)=a ^{ T } x+b for some vector a and scalar b, and a superscript T denotes matrix transpose.
8.1 Known covariance
where \(\widehat {\mathbf {\mu } }_{y}\) is the usual sample mean of training points in class y. We require \(\nu _{y}^{\ast } > 0\) for a proper posterior.
where Φ(x) is the standard normal CDF. This result was also found in [10].
8.2 Homoscedastic arbitrary covariance
where \(\widehat {\mathbf {\Sigma } }_{y}\) is the usual sample covariance of training points in class y (\(\widehat {\mathbf {\Sigma }}_{y} = 0\) if n _{ y }≤1). The posteriors are proper if \(\nu _{y}^{\ast } >0\), κ ^{∗}>D−1 and S ^{∗}≻0.
This result was also found in [10].
where T(x,y,ρ,d) is the joint CDF of two standard multivariate student t random variables with correlation ρ and d degrees of freedom.
8.3 Independent arbitrary covariance
The posteriors are proper if \(\nu _{y}^{\ast } >0\), \(\kappa _{y}^{\ast } >D-1\) and \(\mathbf {S}_{y}^{\ast } \succ 0\).
The effective density for class y is multivariate student t as in (44) with \(k_{y} = \kappa _{y}^{\ast }-D+1\) and \(\mathbf {S}_{y}^{\ast }\) in place of k and S ^{∗}, respectively [13]. Further, (45) also holds with \(m_{\textit {iy}} = (-1)^{i} g(\mathbf {m}_{y}^{\ast })\) and with k _{ y } and \({\gamma _{y}^{2}} = \mathbf {a}^{T} \mathbf {S}_{y}^{\ast } \mathbf {a}\) in place of k and γ ^{2}, respectively. Under binary linear classification, \(\widehat {\varepsilon }^{i,y}(\psi, S)\) is given by (46) with k _{ y } and \({\gamma _{y}^{2}}\) in place of k and γ ^{2}. The same result was found in [10]. E[ε ^{ i,y }(Θ _{ y })ε ^{ j,y }(Θ _{ y }) | S] is solved similarly to before, resulting in (47), (50), (51), and ultimately (52), with k _{ y }, \(\mathbf {S}_{y}^{\ast }\) and \({\gamma _{y}^{2}}\) in place of k, S ^{∗}, and γ ^{2}, respectively. E[ε ^{ i,y }(Θ _{ y })ε ^{ j,z }(Θ _{ z }) | S] for y≠z is found from (25).
8.4 Homoscedastic scaled identity covariance
with hyperparameters \(\kappa \in \mathbb {R}\) and S, a symmetric D×D real matrix. When ν _{ y }>0, π(μ _{ y } | σ ^{2}) is a univariate Gaussian distribution with mean m _{ y } and covariance Σ _{ y }/ν _{ y }, and when (κ+D+1)D>2 and S≻0, π(σ ^{2}) is a univariate inverse-Wishart distribution. If in addition (κ+D+1)D>4, then \(\text {E}[\sigma ^{2}] = \frac {\text {trace} (\mathbf {S})}{(\kappa +D+1)D - 4}\). The form of (57) has been designed so that the posterior is of the same form as the prior with the same hyperparameter update equations given in the arbitrary covariance models, (35) and (43). We require \(\nu _{y}^{\ast } > 0\), (κ ^{∗}+D+1)D>2, and S ^{∗}≻0 for a proper posterior.
Let P=(−1)^{ i } g(X). Since P is an affine transformation of a multivariate student t random variable, again it has the same form as in (45) with k=(κ ^{∗}+D+1)D−2, \(m_{\textit {iy}} = (-1)^{i} g(\mathbf {m}_{y}^{\ast })\), and γ ^{2}=trace(S ^{∗})a ^{ T } a. Following the same steps as in the homoscedastic arbitrary covariance model, under binary linear classification, \(\widehat {\varepsilon }^{i,y}(\psi, S)\) is given by (46) with the appropriate choice of k, m _{ iy }, and γ ^{2}. This was found in [10].
E[ε ^{ i,y }(Θ _{ y })ε ^{ j,z }(Θ _{ y }) | S] can be found from (29) by defining P=(−1)^{ i } g(X) and Q=(−1)^{ j } g(W). Following the same steps as in the homoscedastic arbitrary covariance model, one can show that E[ε ^{ i,y }(Θ _{ y })ε ^{ j,z }(Θ _{ y }) | S] is equivalent to (52) when y=z and (53) when y≠z, where we plug in appropriate values for k, m _{ iy } and γ ^{2}.
8.5 Independent scaled identity covariance
where \(\pi (\mathbf {\mu }_{y} \, | \, {\sigma _{y}^{2}})\) is of the same form as in (34) with hyperparameters \(\nu _{y} \in \mathbb {R}\) and \(\mathbf {m}_{y} \in \mathbb {R}^{D}\), and \(\pi ({\sigma _{y}^{2}})\) is of the same form as in (57) with hyperparameters \(\kappa _{y} \in \mathbb {R}\) and S _{ y }, a symmetric D×D real matrix. The posterior is of the same form as the prior with the same hyperparameter update equations in (35) and (55). We require \(\nu _{y}^{\ast } > 0\), \((\kappa _{y}^{\ast } +D+1)D > 2\) and \(\mathbf {S}_{y}^{\ast } \succ 0\) for a proper posterior.
The effective density for class y is multivariate student t, as in (58) with \(k_{y} = (\kappa _{y}^{\ast } + D + 1)D-2\) and \(\mathbf {S}_{y}^{\ast }\) in place of k and S ^{∗}, respectively [13]. Under binary linear classification, \(\widehat {\varepsilon }^{i,y}(\psi, S)\) is given by (46) with \(m_{\textit {iy}} = (-1)^{i} g(\mathbf {m}_{y}^{\ast })\) and with k _{ y } and \({\gamma _{y}^{2}} = \text {trace} (\mathbf {S}_{y}^{\ast }) \mathbf {a}^{T} \mathbf {a}\) in place of k and γ ^{2}. The effective joint density, f(x,w | y,y,S), is solved as before, resulting in (59) and (61) with k _{ y } and \(\mathbf {S}_{y}^{\ast }\) in place of k and S ^{∗}, respectively. Further, E[ε ^{ i,y }(Θ _{ y })ε ^{ j,y }(Θ _{ y }) | S] is solved from (51) resulting in (52), with k _{ y } and \({\gamma _{y}^{2}}\) in place of k and γ ^{2}, respectively. E[ε ^{ i,y }(Θ _{ y })ε ^{ j,z }(Θ _{ z }) | S] for y≠z is found from (25).
9 Appendix 3: Effective joint density lemma
The lemma below is used to derive the effective joint density of Gaussian models in Appendix Appendix 2: Gaussian models.
Lemma 1.
where \(K = \frac {\nu _{z}^{\ast } + 1}{\nu _{z}^{\ast }}\) when I=0 and \(K = \frac {\nu _{y}^{\ast } + 1}{\nu _{y}^{\ast }}\) when I=1.
Proof.
Declarations
Acknowledgements
The results published here are in part based upon data generated by The Cancer Genome Atlas (TCGA) established by the NCI and NHGRI. Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found at http://cancergenome.nih.gov. The work of LAD is supported by the National Science Foundation (CCF-1422631 and CCF-1453563).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- ER Dougherty, A Zollanvari, UM Braga-Neto, The illusion of distribution-free small-sample classification in genomics. Curr. Genomics. 12(5), 333–341 (2011).View ArticleGoogle Scholar
- UM Braga-Neto, ER Dougherty, Is cross-validation valid for small-sample microarray classification?Bioinformatics. 20(3), 374–380 (2004).View ArticleGoogle Scholar
- B Hanczar, J Hua, ER Dougherty, Decorrelation of the true and estimated classifier errors in high-dimensional settings. EURASIP J. Bioinforma. Syst. Biol.2007(Article ID 38473), 12 (2007).Google Scholar
- UM Braga-Neto, ER Dougherty, Exact performance of error estimators for discrete classifiers. Pattern Recogn.38(11), 1799–1814 (2005).View ArticleMATHGoogle Scholar
- MR Yousefi, J Hua, C Sima, ER Dougherty, Reporting bias when using real data sets to analyze classification performance. Bioinormatics. 26(1), 68 (2010).View ArticleGoogle Scholar
- MR Yousefi, J Hua, ER Dougherty, Multiple-rule bias in the comparison of classification rules. Bioinformatics. 27(12), 1675–1683 (2011).View ArticleGoogle Scholar
- MR Yousefi, ER Dougherty, Performance reproducibility index for classification. Bioinformatics. 28(21), 2824–2833 (2012).View ArticleGoogle Scholar
- L Devroye, L Gyorfi, G Lugosi, A probabilistic theory of pattern recognition. Stochastic modelling and applied probability (Springer, New York, 1996).View ArticleGoogle Scholar
- LA Dalton, ER Dougherty, Bayesian minimum mean-square error estimation for classification error–part I: definition and the Bayesian MMSE error estimator for discrete classification. IEEE Trans. Signal Process.59(1), 115–129 (2011).MathSciNetView ArticleGoogle Scholar
- LA Dalton, ER Dougherty, Bayesian minimum mean-square error estimation for classification error–part II: the Bayesian MMSE error estimator for linear classification of Gaussian distributions. IEEE Trans. Signal Process.59(1), 130–144 (2011).MathSciNetView ArticleGoogle Scholar
- LA Dalton, ER Dougherty, Exact sample conditioned MSE performance of the Bayesian MMSE estimator for classification error–part I: representation. IEEE Trans. Signal Process.60(5), 2575–2587 (2012).MathSciNetView ArticleGoogle Scholar
- LA Dalton, ER Dougherty, Exact sample conditioned MSE performance of the Bayesian MMSE estimator for classification error–part II: consistency and performance analysis. IEEE Trans. Signal Process.60(5), 2588–2603 (2012).MathSciNetView ArticleGoogle Scholar
- LA Dalton, ER Dougherty, Optimal classifiers with minimum expected error within a Bayesian framework–part I: discrete and Gaussian models. Pattern Recog. 46(5), 1301–1314 (2013).View ArticleMATHGoogle Scholar
- LA Dalton, ER Dougherty, Optimal classifiers with minimum expected error within a Bayesian framework–part II: properties and performance analysis. Pattern Recog.46(5), 1288–1300 (2013).View ArticleMATHGoogle Scholar
- B Hanczar, J Hua, C Sima, J Weinstein, M Bittner, ER Dougherty, Small-sample precision of ROC-related estimates. Bioinformatics. 26:, 822–830 (2010).View ArticleGoogle Scholar
- H Xu, C Caramanis, S Mannor, S Yun, in Proceedings of the 48th IEEE Conference on Decision and Control, CDC 2009.Risk sensitive robust support vector machines (IEEENew York, 2009), pp. 4655–4661.Google Scholar
- H Xu, C Caramanis, S Mannor, Robustness and regularization of support vector machines. J. Mach. Learn. Res.10:, 1485–1510 (2009).MathSciNetMATHGoogle Scholar
- CM Bishop, Pattern recognition and machine learning vol. 4 (Springer, New York, NY, 2006).Google Scholar
- A Gelman, JB Carlin, HS Stern, DB Rubin, Bayesian data analysis vol. 2, 3rd edn., (2014).Google Scholar
- MS Esfahani, ER Dougherty, Incorporation of biological pathway knowledge in the construction of priors for optimal Bayesian classification. IEEE/ACM Trans. Comput. Biol. Bioinform.11(1), 202–218 (2014).View ArticleGoogle Scholar
- LA Dalton, ER Dougherty, Application of the Bayesian MMSE estimator for classification error to gene expression microarray data. Bioinformatics. 27(13), 1822–1831 (2011).View ArticleGoogle Scholar
- BE Boser, IM Guyon, VN Vapnik, in Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92. A training algorithm for optimal margin classifiers (ACM,New York, NY, USA, 1992), pp. 144–152.View ArticleGoogle Scholar
- C Cortes, V Vapnik, Support-vector networks. Mach. Learn.20(3), 273–297 (1995).MATHGoogle Scholar
- C-C Chang, C-J Lin, LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol.2:, 27–12727 (2011).View ArticleGoogle Scholar
- B Efron, Bootstrap methods: another look at the jackknife. Ann. Stat.7(1), 1–26 (1979).MathSciNetView ArticleMATHGoogle Scholar
- B Efron, RJ Tibshirani, An introduction to the bootstrap (CRC Press, Boca Raton, FL, 1994).Google Scholar
- B Efron, Estimating the error rate of a prediction rule: improvement on cross-validation. J. Am. Stat. Assoc.78(382), 316–331 (1983).MathSciNetView ArticleMATHGoogle Scholar
- MJ van de Vijver, YD He, LJ van ’t Veer, H Dai, AAM Hart, DW Voskuil, GJ Schreiber, JL Peterse, C Roberts, MJ Marton, M Parrish, D Atsma, A Witteveen, A Glas, L Delahaye, T van der Velde, H Bartelink, S Rodenhuis, ET Rutgers, SH Friend, R Bernards, A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med.347(25), 1999–2009 (2002).View ArticleGoogle Scholar
- A Zollanvari, UM Braga-Neto, ER Dougherty, On the sampling distribution of resubstitution and leave-one-out error estimators for linear classifiers. Pattern Recogn. 42(11), 2705–2723 (2009).View ArticleMATHGoogle Scholar
- JM Knight, I Ivanov, ER Dougherty, MCMC implementation of the optimal Bayesian classifier for non-Gaussian models: model-based RNA-Seq classification. BMC Bioinformatics. 15(1), 401 (2014).View ArticleGoogle Scholar
- S Kotz, S Nadarajah, Multivariate T distributions and their applications (Cambridge University Press, New York, 2004).View ArticleMATHGoogle Scholar
- NL Johnson, S Kotz, N Balakrishnan, Continuous univariate distributions vol. 2, 2nd edn. (John Wiley & Sons, Hoboken, NJ, 1995).MATHGoogle Scholar