Bayesian inference for biomarker discovery in proteomics: an analytic solution

This paper addresses the question of biomarker discovery in proteomics. Given clinical data regarding a list of proteins for a set of individuals, the tackled problem is to extract a short subset of proteins the concentrations of which are an indicator of the biological status (healthy or pathological). In this paper, it is formulated as a specific instance of variable selection. The originality is that the proteins are not investigated one after the other but the best partition between discriminant and non-discriminant proteins is directly sought. In this way, correlations between the proteins are intrinsically taken into account in the decision. The developed strategy is derived in a Bayesian setting, and the decision is optimal in the sense that it minimizes a global mean error. It is finally based on the posterior probabilities of the partitions. The main difficulty is to calculate these probabilities since they are based on the so-called evidence that require marginalization of all the unknown model parameters. Two models are presented that relate the status to the protein concentrations, depending whether the latter are biomarkers or not. The first model accounts for biological variabilities by assuming that the concentrations are Gaussian distributed with a mean and a covariance matrix that depend on the status only for the biomarkers. The second one is an extension that also takes into account the technical variabilities that may significantly impact the observed concentrations. The main contributions of the paper are: (1) a new Bayesian formulation of the biomarker selection problem, (2) the closed-form expression of the posterior probabilities in the noiseless case, and (3) a suitable approximated solution in the noisy case. The methods are numerically assessed and compared to the state-of-the-art methods (t test, LASSO, Battacharyya distance, FOHSIC) on synthetic and real data from proteins quantified in human serum by mass spectrometry in selected reaction monitoring mode.


Introduction
It is now generally recognized that protein expression analysis is crucial in explaining the changes that occur as a part of disease pathogenesis [1,2]. In this context, recent advances in mass spectrometry (MS) technologies have facilitated the investigation of proteins over a wide range of molecular weights in small biological specimens from blood or urine samples, for instance. Notably, MS the same biological status [5], and (2) the technical variability, which originates from the imperfections of the measurement process used to obtain the concentrations. Failing to address both of these variabilities within a technique for biomarker identification may significantly impair its performance by resulting in erroneous decision.
Furthermore, since the complexity of a status is unlikely to be manifested through the changes in the characteristics of just one protein, it has generally been acknowledged that a set of proteins should be considered [5][6][7][8].
An additional difficulty is that they are possibly correlated, imposing the use of multivariate models to account for all the data simultaneously. These aforementioned issues pose significant challenges in developing efficient and robust statistical techniques for the identification of biomarkers.
The paper tackles the problem of biomarker identification by adopting a Bayesian approach to propose the selection of the optimal set of variables. By providing an elegant and mathematically rigorous framework for incorporating the data and the prior information within a joint probabilistic model, the Bayesian setting allows straightforward modeling of both the technical and the biological variabilities of the data.
The remainder of the paper is organized as follows. Section 2 summarizes the state-of-the-art variable selection methods, discusses their main challenges, and outlines our principal contributions. Section 3 presents the proposed formulation within the Bayesian framework, the proposed models for the data, and the decision strategy. Section 4 describes the data used in the numerical evaluations, together with the results and their analysis. Finally, conclusions are drawn in Section 5. A detailed description of the model and the derivation of the analytic solution is provided in Appendix.

Related work
The identification of biomarkers for diagnosis or prognosis can be classically formulated as a variable selection problem, and this problem has been paid a lot of attention as a specific instance of model choice. Various methodologies exist that can be broadly classified in two categories: the frequentist hypothesis testing and the Bayesian decision-making.
Frequentist hypothesis testing consists in deciding between two statements, classically referred to as the null and the alternative hypotheses, by comparing a function of the observed data to a threshold. The reader is invited to consult [9] for a comprehensive overview. Two closely related methods have been proposed. On the one hand, Neyman-Pearson tests are designed to ensure the so-called type I error. On the other hand, p values focus on how strongly the data reject the null hypothesis H 0 by evaluating the probability of obtaining a value as extreme as the observed one given H 0 is true. In biomarker discovery, a popular approach consists in testing a mean difference between the case and the negative controlled populations using the classical Students' t test or its variants [7]. The latter is a statistical hypothesis test which indicates whether the difference between two group means most likely reflects that they are samples of two different populations or, on the contrary, is merely explained by the sampling fluctuation. However, as the number of candidate biomarkers increases, multiple hypothesis testing is required resulting in a higher computational cost which may become prohibitive [10]. A first solution is to perform univariate tests for each protein.
This procedure requires an adapted control of the rate of type I errors in this particular setting where multiple hypothesis tests are conducted simultaneously. Two types of procedures were proposed for this purpose, namely the so-called family wise error rate or the false discovery rate [11,12]. A common criticism of frequentist approaches is that they fail to take into account prior information about the problem at hand such as interdependencies between the different variables.
For a large number of candidate variables, regression analysis [13] provides an alternative to the abovementioned methods. The principle is to assume that a given outcome is related to a linear combination of a set of explanatory variables called the predictors. In proteomics, logistic regression models are considered that express the probability to have a disease as a function of the protein abundances [14][15][16]. Then, variable selection is classically performed using stepwise procedures that consists in successively adding or removing predictors, estimating the regression coefficients, and evaluating the goodness of fit of the subsequent model. Different criteria can be considered such as the R-squared, the adjusted R-squared, or the Akaike Information Criterion [17,18]. Such techniques are referred to as backward elimination and forward selection, respectively [13]. However, these selection procedures are prone to overfitting and the variance of the parameter estimates becomes high in the presence of correlated predictors. Regularization methods alleviate these difficulties by considering the minimum of a penalized least squares error as estimate. Since the Ridge regression in 1970 [19], several algorithms have been proposed that differ between one another with respect to the considered penalization of the regression parameters. The well-known LASSO [20] considers a L 1 -norm penalty and has the advantage of directly removing irrelevant predictors by shrinking their coefficients to zero. More recently, the elastic net [21] which combines the advantages of the Ridge and LASSO regressions has been proposed. In the presence of correlated variables, it outperforms LASSO by favoring the selection of sets of variables. A comparison between these methods in application to genome selection is presented in [22]. Although widely used, regression analysis is based on an ad hoc model that may not reflect the physical nature of the observed data. Further, it does not explicitly accommodate correlations between the candidate biomarkers as well as measurement errors.
The Bayesian framework offers an alternative formulation of model selection. The candidate models are assigned prior probabilities that are combined with the likelihood function to yield the so-called posterior probability. The latter summarizes all the available information to make the decision. In this context, deciding in favor of the a posteriori most probable model is optimal in the sense that it minimizes the risk associated to the 0/1 costfunction. There have been a lot of debates over the use of Bayesian techniques in place of frequentist approaches, but they do not address exactly the same question. Frequentist methods are designed to test the departure of the data from a pre-defined null hypothesis. In contrast, Bayesian selection procedures evaluate the plausibility of a given hypothesis given a set of candidate hypotheses hence are conveniently well-suited to multiple hypotheses testing. Thus, non-nested models can be compared in a straightforward manner. Another fundamental difference is the treatment of unknown model parameters.
In the frequentist approach, they are classically replaced by estimates whereas in the Bayesian formulation, they are integrated. The latter procedure has the advantage of automatically penalizing complex models, as discussed in [23], but often leads to intractable calculations. An additional advantage is that correlations between the model variables can be easily accounted for in the design of the prior distributions. As for the integration over the unknown parameters, several solutions have been developed. The Laplace approximation of the integrand leads to the well-known Bayesian information criterion (BIC). As an alternative, numerical integrations can be performed based on stochastic sampling techniques such as Markov Chain Monte Carlo (MCMC) methods [24]. Either across or within model-based techniques can be considered. In the first case, the model index is sampled jointly with the parameters conditionally upon the observations. A well-known algorithm is the Reversible-Jump MCMC but moves between the different parameter spaces are difficult to design. In the second case, posterior samples of the parameters are generated conditionally upon each candidate model and then used to evaluate the integrated likelihood, also called evidence [25]. Nevertheless, the harmonic mean-based estimator exhibits instabilities [26]. Applications of the MCMC Bayesian model selection methods in genomics can be found in [27,28].
In this paper, a Bayesian setting is adopted to identify a set of protein biomarkers from experimental data consisting of measured protein concentrations and the associated biological statuses of a population of individuals. The novelty is that the decision is not made protein by protein. As an alternative, the problem is formulated as directly finding the best partition of the list of proteins into two subsets, namely discriminant and non-discriminant, in the sense that it yields the highest posterior probability. Regardless of their discriminative power, the proteins are assumed Gaussian distributed. However, for the subset of biomarkers, the parameters of the Gaussian distribution take different values depending on the biological status whereas this is not the case for the second subset of proteins. The preliminary version of this hierarchical model has been presented in [29]. Its advantages are threefold. First, it is not based on an ad hoc explanatory model unlike regression analysis. Second, the proteins within a given group are assumed a priori correlated and the dependency structure is integrated out along with the remainder of the unknown model parameters so that only the protein partition is estimated. Thus, our approach intrinsically takes into account correlations between the candidate biomarkers. Third, by choosing appropriate conjugate prior distributions for the parameters, the model evidences can be calculated in closed form and there is no need to resort to computationally extensive numerical techniques. Finally, we show that our hierarchical model can be easily extended to address errors in the measured concentrations.

Problem formulation, proposed models, and methods
To formulate the biomarker selection problem and construct its solution in the proposed framework, we first introduce the basic modeling for the relevant quantities/variables at hand: the biological status, protein concentrations, number of individuals,. . . including the descriptions of the considered observation models.

Distribution for status and concentration
Regarding the biological status, it is denoted by b and takes two values, H and P, for healthy and pathological. It is conveniently described by a Bernoulli random variable B with parameter p Regarding the proteins, let us note P is their number and x ∈ R P is the collection of their concentrations. Each protein can be discriminant or non-discriminant and then accordingly labeled by + or −. The vector x + and x − , with sizes P + and P − (we have P = P + + P − ), stand for the respective concentrations. Otherwise, within the P proteins, there are 2 P possible partitions referred to as δ ∈ {+, −} P since each protein can be discriminant and non-discriminant. Following the clinically observed behavior, from a probabilistic standpoint, the concentrations are described by normal distributions in order to account for biological variabilities within the populations. Specifically, for the non-discriminant proteins, the concentration vector x − is modeled by a unique multivariate normal distribution with common parameters (m C , C ) regardless the biological status: On the other hand, for discriminant ones, the concentration vector x + is modeled, conditionally on status b, by a multivariate normal distribution with mean and precision (m H , H ) and (m P , P ) for healthy and pathological, respectively: Marginally, the concentrations of discriminant proteins are distributed according to a mixture of two Gaussian distributions In addition, it is assumed that X + and X − are conditionally independent. The parameters of the distributions are collected in the vector θ = m P , P , m H , H , m C , C , p considered as unknown. It is important to keep in mind that the quantity of interest is the partition δ.
Distribution for the individuals The total number of individual is N and (x n , b n ) is the nth concentration vector and status. They are modeled as independent conditionally on θ . Let us denote x (size P × N) as the matrix of concentrations and b (size N) as the vector of biological statuses. Also, let I H and I P be the subsets of indices for healthy and pathological individuals, respectively, and N H , N P their cardinality. For notational convenience, we introduce: N C = N H + N P and I C = I P ∪ I H (where N C = N and I C = {1, 2, . . . , N}).
Given the models (1) for the status and (2)-(3) for the non-discriminant and the discriminant proteins, based on the assumptions that X + n and X − n are conditionally uncorrelated and that the individual concentrations are also conditionally independent, the distribution of the concentrations and status (x, b), given the unknown parameters θ and partition δ, is: It can be seen (see Appendix 1) that the exponential arguments of the Gaussian distributions in the first three factors can be reformulated based on the empirical means and covariances for each index set I P , I H , and I C . The result is shown here only for the first factor: wherex + P andR + P are the empirical mean and covariance of the x + n for the individuals n ∈ I P . Moreover, regarding the probability for the statuses, we have: that is only based on the size of each index set I P and I H .
Observations Given the previously described concentrations, the proposed developments include two cases for the observation model: 1. In the first one, the concentrations x n are directly observed. 2. The second one accounts for noise: observations write y n = x n + ε n , where ε n is modeled as a zero-mean Gaussian vector with precision ε .
Both of them account for biological variabilities and the latter also includes technological variabilities that arise from both the functioning of the measurement system itself and the post-processing of the spectra. These models are referred to as "noiseless model" and "noisy model". The corresponding variable selection methods are respectively presented in Sections 3.1 and 3.2.

Prior distributions
The choice of the prior distribution for the unknown parameters is important. First of all, it must allow us to account for available information (e.g., nominal values and uncertainties, strong uncertainties,. . . ). Second, it should also enable analytical calculations or numerical computations. To this end, the prior density is chosen as a separable prior distribution, i.e., for the noiseless case for (m P , P ), (m H , H ), (m C , C ), and p. For the noisy case, in addition, we have a factor π( ε ) for ε independent from the other variables. Regarding these five variables individually, the choice is driven by a conjugation principle [30]: • The probability p is assumed a Beta distributed variable with parameter (a, b).
In the subsequent developments, we proceed with the calculation of the posterior probability for the partitions δ in the two cases: noiseless concentrations in Section 3.1 and noisy concentrations in Section 3.2. One of the novelty is an explicit analytical result for the noiseless case and a precise approximation for the noisy case.

Selection using the noiseless data
Optimal decision-maker The question of the paper is the one of the identification of a set of discriminant proteins, and it amounts to making a decision regarding the partition δ. To build an optimal decision-maker, a binary loss is considered that assigns a null loss to the correct decision and a unitary loss to the incorrect decisions. The risk is the mean loss over the models δ, the data (x, b), and the unknown parameters θ . The optimal decisionmaker is defined as the risk minimizer, and it is known to be the one that selects the most a posteriori probable model. It should be noted that alternative loss functions could be chosen, for instance, one that would penalize differently erroneous partitions depending on the number of biomarkers properly identified. In this case, the decision would still be based on the posterior probabilities but with a different rule. However, our choice not only leads to a simple identification procedure but also naturally prevents overfitting.
Thus, the point is to calculate the posterior probability P |X,B (δ|x, b) for each candidate model δ. It is carried out using the Bayes rule as: and it crucially depends on the so-called evidence which can be rewritten by factorization This calculation is the main difficulty of the paper and more generally in variable and model selection.
In order to carry out this calculation, let us note that the likelihood f X,B| , (x, b|θ , δ) factorizes (see Eq. (4)) and that the prior π | (θ|δ) also factorizes (see Eq. (6)). So, we have: that can itself be factorized in four integrals: three w.r.t. the couple of variable regarding the concentrations (m P , P ), (m H , H ), (m C , C ) and one w.r.t. the prevalence p: where the integrals w.r.t. the m × , × read with × ∈ {P, H, C} and ∈ {+, −} and the integral w.r.t. p reads As far as the three integrals I × are concerned, the calculations are founded on the reduced form (5) for the likelihood (including empirical means and variances) and on the Normal-Wishart prior (23) in Appendix 2 for the (m × , × ). Practically, thanks to the conjugacy property, the integrand in (9) involves the posterior for (m × , × ) which is Normal-Wishart with parameters where "pst" stands for posterior. The finalization of the development relies on the fact that the Normal-Wishart density sums to one: without effective complicate calculus, this yields the result as the ratio of normaliation constants where KN W is the normalizing constant of the Normal-Wishart density given by (24) in Appendix 2 and where "pri/pst" stands for "prior/posterior". As a whole, the analytical calculation of the integral in (8) is possible and yields: rendering the usually complex calculations of the evidences straightforward.
Assuming that all candidate models are equally a priori probable, from Eq. (7), the posterior probability across the 2 P models can be inferred. The selected model is the one which maximizes this probability. It should be noted that if prior information is available such as protein-to-protein interactions (PPI's), it can be taken into account by assigning a higher probability to partitions wherein the related proteins are in the same subset (either discriminant or not). In Eq. (10), the normalizing constants for the posterior distributions depend on the empirical covariance matrices of the population of individuals for the discriminant proteins and the non-discriminant ones, respectively. Their computation is expensive. However, it suffices to compute once the full covariance matrix for all the proteins and then remove the appropriate raws and columns for the 2 P configurations to be tested.

Selection using noisy data
The model presented above assumes that the concentrations are directly observed. Although this assumption leads to closed-form expressions of the posterior probabilities, it may be too simplifying. In practice, the concentrations are known up to an uncertainty and this section extends the above-detailed developments to account for these uncertainties. However, this comes at the price of intractable calculations, and to overcome this difficulty, we propose a suitable approximation. As introduced above, the measured concentrations are modeled as y n = x n +ε n where ε n is a zero-mean Gaussian vector with precision ε , therefore y n |x n ∼ N (x n ; 0, ε ). Similarly to the previous section, the vectors of observed concentrations are stacked in a matrix y of dimension P × N. To select the most probable model, the evidence f Y ,B| (y, b|δ) must be calculated for each candidate model (it was f X,B| (x, b|δ) for the noiseless model). The difficulty is that the calculation of evidence requires not only the marginalization of the model parameters but also of the true concentrations. Furthermore, the precision ε is assumed unknown and must also be marginalized. For notational convenience, we state:θ = [θ , ε ] as an extended vector of unknown parameters.
By taking into account the conditional independencies, the evidence can be expressed as: This multiple integral can be handled in several manners. We propose to first perform integration with respect toθ and then to integrate the result with respect to x; hence, Eq. (11) can be rewritten as: and in the same manner as previously with × ∈ {P, H, C} and ∈ {+, −}. The integrals (12) and (13) can be calculated analytically.
On the one hand, the integrand of (12) can be rewritten, up to a proportionality constant, as the distribution of the precision matrix ε conditionally upon y and x.
This distribution is Wishart with parameters ν pst ε , pst ε expressed as: Then, (12) can be re-arranged as: On the other hand, the integrals I × can be computed in the same manner as in the previous section using the conjugation property for the couples (m × , × ). Thus, we have: with KN W pst × and KN W pri × the normalization constants of the prior and posterior Normal-Wishart distributions for (m × , × ), respectively.
In (11), the result of the first integration with respect tõ θ does not yield an expression that can be integrated analytically w.r.t. x. To address this issue, we propose to take advantage of the fact that a matrix variate t-distribution T P,N (T; q, M, , ) tends to a Gaussian distribution when the degrees of freedom parameter q tends to infinity.
In a first step, for a high enough value of ν pri ε , (14) can be approximated as: In a second step, the integrals I × can also be approximated by Gaussian distributions although not directly. For this purpose, we focus on the factor of (15) that depends on the true concentration vectors: where C × is a proportionality constant. Contrary to (14), the expression (16) does not correspond to a standard probability density function. To make the calculations tractable, we propose to replace the empirical means of the true concentrationsx × by their approximated values computed from the measured concentrationsȳ × . By developing the expression of the empirical covariance matrix in (16), it ensues: is a matrix of dimensions P × N × whose columns are all equal to the empirical meanȳ × . Then, provided ν pri × is high enough, we can also approximate this matrix variate t-distribution by a Gaussian distribution as for (12): The performed approximations allow us to express the integrand of the evidence (11) as a product of Gaussian distributions for the true concentration vectors. In this case, the integration can be carried out analytically. By treating separately pathological and healthy individuals, we finally obtain the following expression of the evidence: In this expression,R y P andR y H are the empirical covariance matrices of the measured concentration vectors for the pathological and healthy individuals, respectively. As for H and P , they are defined as: H} and blkdiag(A, B) the block-diagonal matrix with diagonal elements A and B.

Numerical evaluation
To assess the performance of the proposed method, we have performed extensive numerical experiments using both simulated and real data. They include comparisons with other methods for biomarker identification, namely: 1. The t test [31], which consists in comparing the means of each protein concentrations between the two cohorts, H and P. If the null hypothesis, standing for the mean equality, is rejected, then the protein is declared as a biomarker. The type I error, denoted as α, corresponds to the incorrect rejection of a true null hypothesis. Its value is used to set the t test decision threshold. In this paper, it is not necessary to adjust the type I errors to account for multivariate effects. The reason is that, for fair comparison purposes, we directly select the setting that leads to the best performance of the test regarding our criterion. This point is commented in Section 4.1 and Fig. 2. 2. The LASSO method [20], based on a linear regression model in which the explanatory variables are the protein concentrations x, while the response variables are the biological statuses b. The LASSO method estimates the coefficients of the model by minimizing the sum of the squared errors, with a L 1 -norm penalty. Then, a protein is selected as a biomarker if the value of the coefficient corresponding to its concentration is different from zero. This method introduces a regularization parameter denoted λ. 3. The Bhattacharyya distance [32] is a measure of similarity between two probability distributions and by extension between two populations of individuals [32]. For two multivariate normal distributions with respective mean and covariance matrix (μ 1 , 1 ) and (μ 2 , 2 ), it is given by: In the sequel, the Bhattacharyya distance is calculated for each protein by replacing the true mean and covariance matrix by their empirical estimates. The protein is declared as discriminant if the distance is greater than a fixed threshold denoted t. The algorithm is referred to as Bha-distance. 4. The FOHSIC algorithm as introduced in [33]. It performs feature selection based on the Hilbert-Schmidt Independence Criterion (HSIC). The authors propose an unbiased estimator of HSIC and then, assuming the number of significant features is set a priori, use a forward procedure to select them. In our context, the significant features are the biomarkers.
In this section, we refer to the method from Section 3.1 as the Bayesian Model Selection with Analytical Solution for Noiseless Data (BMS-AS-D) method, while to the method from Section 3.2 as the Bayesian Model Selection with Analytical Solution for Noisy data (BMS-AS-N) method.
Crucial to our approach is the choice of the parameters of the Normal-Wishart densities (ν × , η × , μ × , × ) for × ∈ {P, H, C}. They are referred to as the hyperparameters since the evidence (10) depends on them. In a non-informative case, the values of these hyperparameters are chosen to be (0, 0, 0, ∞), while the proportionality coefficient in (10) has an undetermined form. As an alternative, to tune the parameters, we propose to use a poorly informative prior based on expert knowledge 1 about the corresponding variables (e.g., the values in μg/ml). To this end, we take advantage of the expression of the prior mean and the covariance for (m × , × ) as a function of the hyperparameters: where the superscripts i, j denote the entry (i, j) of the matrices, E(·) and V (·) refer to the expectation and the covariance matrix of a vector, respectively, while cov(·, ·) stands for the covariance between two random variables. We also recall that * ∈ + , − , " " depending whether the discriminant/non-discriminant subsets of proteins are considered or the whole set. As a consequence, the prior parameters (ν × , η × , μ × , × ) can be calculated from (18) to (21) and substituted in (10). Although our choice of prior is not non-informative in the strict sense, it is vague enough so as not to impact biomarker detection. This issue is investigated in the next subsection. Finally, to calculate (17) in the noisy case, additional hyperparameters for the Wishart probability density function of the noise precision matrix have to be tuned. They are chosen such that E( ε ) = ν pri ε pri ε and that the elements of the covariance matrix satisfy cov . Therefore, by accounting for real-life orders of magnitudes of ε , the prior parameters ν pri ε , pri ε can be calculated and substituted in the probability (10).
In the next sections, we present the results of the numerical evaluations of the proposed methods using both simulated and real data.

Evaluation using simulated data 4.1.1 Description of the simulated data and performance index
We consider the concentrations of a list of P proteins for a group which comprises N H healthy and N P pathological individuals, respectively, with N H + N P = N. The possible partitions for discriminant/non-discriminant proteins thus amount to 2 P and they are referred to as true models. For each true model, N r = 10 5 data realizations are simulated, hence the total number of realizations equals N r 2 P . On the one hand, the noiseless data comprise the biological statuses b n and the actual protein concentrations x n of the N individuals and are generated as follows. The biological statuses are sampled from the Bernoulli distribution of parameter p, where p is assumed Beta distributed of parameters a = 1 and b = 1, which corresponds to a uniform distribution. The concentrations of the subset of discriminant proteins are generated from the Gaussian distributions, N x + n ; m H , H or N x + n ; m P , P , depending on the simulated biological status. The subset of non-discriminant proteins are sampled from The orders of magnitudes for (m × , × ) are specified as: E(m × ) = 10 3 1 P * , V (m × ) = 10 4 I P * , E( × ) = 10 3 I P * , V ( × ) = 10 4 I P * , where 1 P * denotes a vector of size P * whose elements are all equal to 1 and I P * is the identity matrix of size P * . The same order of magnitude is considered for healthy, pathological, and common parameters, that is to say × ∈ {H, P, C}. These a priori information are used to tune the hyperparameters as given by (18)- (21).
On the other hand, the noisy data include the biological statuses b n and the observed protein concentrations y n for n = 1, 2 . . . N. The protein concentrations x n are generated as in the case of the noiseless observations by using the same hyperparameter setting. As for the noise ε n , it is sampled as a zero-mean multivariate Gaussian random vector with precision matrix ε . The latter is generated from a Wishart density with parameters ν pri ε , pri ε . In order to determine these hyperparameters, the a priori information is specified as: E( ε ) = 10 −2 I P , V ( ε ) = 10 −5 I P . Note here that ε measures the precision (inverse variance), thus the lower ε is, the stronger the noise is.
For each data set, the posterior probability is computed for all possible partitions according to (10) or (17) for the BMS-AS-D and BMS-AS-N methods, respectively. Then, the most probable partition is selected.
The performance is measured in terms of the error rate τ , defined as: where E i is the number of realizations of the ith partition for which the selected model is different from the true one. It should be noted that the usual type I and type II errors apply when the biomarker identification is made protein after protein but they are not relevant here, since the proteins are addressed as a whole. Indeed, as underlined in Section 3, the proposed Bayesian formulation relies on a different paradigm than the existing methods: any wrong or correct decision regards the list of proteins as a whole (and not protein-wise). Furthermore, the proposed approach minimizes a global risk and cannot be directly related to a given false discovery rate. It ensues that τ , which encompasses all types of errors, appears as a suitable criterion to measure the performance.

Results for the noiseless model
First of all, we investigate the sensitivity of the error rate τ (%) to the hyperparameter tuning. In Fig. 1, we plot the latter as a function of the considered variance V (m × ) in Eqs. (18)-(21). We can observe that provided the variance is chosen superior to a given threshold, it has no impact on the biomarker detection performance. This is the case for our settings.
Before going further in analyzing the performance, it should be noted that the t test, the LASSO, and the Battacharyya distance all require the setting of a parameter: the type I error α, the regularization parameter λ, and the threshold t, respectively. So as not to favor our approach, we have run all the algorithms for different values of these parameters and we have selected the best one (in order to get the lowest error rate). Such a procedure cannot be applied on real data, but it allows us to compare the proposed method to the best version of the alternative approaches. The results are given in Fig. 2 for the t test and the LASSO.
The performance of the BMS-AS-D method has first been evaluated as a function of the number of proteins and results are shown in Table 1. The superior performance of the BMS-AS-D method with respect to the t test is expected since the BMS-AS-D makes a decision jointly across all the proteins while accounting for all possible correlations between them. This issue is not addressed within the t test which is univariate, i.e., each protein is tested separately. The same observation can be made for the Bhattacharyya distance. Indeed, the error rate of the t test and the Bhattacharyya distance are very close. As for the superiority of the BMS-AS-D method over the LASSO, it is due to the fact that the latter makes the assumption of an arbitrary linear regression relationship between the biological statuses and the protein concentrations. Moreover, the difference in performance increases with the number of proteins, since the possibility of correlation increases. Indeed, in the presence of correlated variables, the number of significant variables is known to be over-estimated with the LASSO algorithm. This fact confirms the relevance of the BMS-AS-D method which can accommodate correlations between the variables.
Regarding the number of individuals, Table 2 shows that, as expected, the performance of all the methods improves as the number of considered individuals increases. In particular, the better performance of the BMS-AS-D method is explained by the reduced variance of the protein concentration posterior distribution for a large number of individuals. Furthermore, the BMS-AS-D method outperforms the t test, the LASSO and the Bhattacharyya distance, regardless the number of individuals. Finally, even if the number of configurations to be tested increases exponentially with the number of proteins, the computational cost is kept reasonable owing to the analytical expression of the posterior probabilities for the different partitions. Last but not the least, our method can also be run for a fixed number of biomarkers as it is the case of many current feature identification algorithms such as the FOHSIC. Only the partitions with the proper number of biomarkers are studied, which amounts to assigning a null prior probability to the others. When the number of biomarkers is chosen as M ≤ P, the number of posterior probabilities to compute is limited to C M P instead of 2 P . Under this assumption, the performance of the BMS-AS-D is compared to that of the FOHSIC in the Tables 3, 4, 5 and 6. To run the algorithm with large P (≥ 8), the number of realizations is reduced to 10 3 .
As shown in Tables 3, 5, and 6, the BMS-AS-D algorithm outperforms the FOHSIC one. This is explained by the fact that the BMS-AS-D algorithm makes a multivariate decision on the whole set of proteins, while the FOHSIC uses a forward procedure which can lead to error accumulation. Indeed, if any detected biomarker in the sequence is false, then the final selected model is bound to be erroneous. Furthermore, the BMS-AS-D is also faster than the FOHSIC, as illustrated in Table 4. More precisely, Table 5 shows the error rate τ for the FOHSIC and the Bayesian algorithm for P = 8 and different number of biomarkers. Conversely, the results proposed in Table 6 are obtained with the number of biomarkers fixed to M = 4 while  the number of proteins P is varied. As expected, the performance of the FOHSIC algorithm is degraded when increasing the number of proteins while the opposite is observed for the BMS-AS-D. Thus, even for large P, the Bayesian algorithm outperforms the FOHSIC.

Results for the noisy model
The performance of the BMS-AS-N and the BMS-AS-D algorithms is first studied as a function of the number of proteins P and the number of individuals N, for a fixed noise level. Then, the BMS-AS-N and the BMS-AS-D methods are compared for different noise conditions. Table 7 reports the error rate τ (%) for the BMS-AS-N and the BMS-AS-D methods. The value of the error rate for the BMS-AS-D method is increased as compared to the results given in Table 1. This is due to the fact that the BMS-AS-D relies on a noiseless model, i.e., it processes the noisy data as if they were noiseless protein concentrations. That is why this result is expected, especially given the specified severe noise conditions E( ε ) = 10 −2 I P . As a consequence, the performance of the BMS-AS-N method is significantly better than the BMS-AS-D one. Also, the difference in performance increases with the number of proteins, since for large sets of proteins, the number of candidate models increases and the estimation becomes more difficult.
The performance of the BMS-AS-N method is also evaluated for several numbers of individuals: N = 100, 500, and 1000. Table 8 shows the results where it can be observed that for N = 100, the error rate is higher; however, it does not exceed 13%. For the BMS-AS-D method, the results are even poorer because of the impact of the noise on smaller sample sizes.
To assess the importance of taking into account the noise in the model, the respective performances of the BMS-AS-D and the BMS-AS-N methods are also compared for different noise levels. We recall that the former is designed from the noiseless data model, while the latter specifically addresses the noisy data model. The noise power is measured by the mean value of the noise variance E( ε ), which is varied in the simulations. Table 9   Table 4 Execution time for one simulation N = 1000 and P = 3  shows the error rate for both methods as a function of this parameter: The BMS-AS-N method always outperforms the BMS-AS-D one. In the absence of noise, the BMS-AS-N method becomes equivalent to the optimal noiseless method. These results confirm the relevance of the method, especially for high noise levels. As a conclusion, the results confirm the good performance of the proposed BMS-AS-N method which is also not too computationally intensive by means of the analytical approximation of the posterior probabilities.

Evaluation using the real data
The primary goal of this paper was to present a novel methodology for biomarker identification that relaxes classical simplifying assumptions on the data model and then to evaluate it on simulated data. Nevertheless, we had at our disposal a batch of real data 2 and we used it to carry out a preliminary study of the BMS-AS-N method. The data are composed of 206 samples: 105 with the status H (including 76 patients from blood donors and 29 with negative colonoscopy), 101 with malignant tumor 3 , i.e. with status P. The latter are structured as follows: 24 patients in the 'stage one' of the cancer, 26 patients in the 'stage two' , 23 patients in the 'stage three' , 25 patients in the 'stage four' , and three missing values. The protein concentrations are obtained using the Bayesian inversion method developed in [34] from measurements of SRM spectra according to the methodology described in [35]. For each sample, the concentrations of 21 proteins are measured (14-3-3 protein sigma; 78-kDa glucoseregulated protein; protein S100-A11; calmodulin; calreticulin; peptidyl-prolyl cis-trans isomerase A; defensin-5; defensin-6; heat shock cognate 71 kDa protein; fatty acidbinding protein, intestinal; fatty acid-binding protein, liver (LFABP); stress-70 protein, mitochondrial; protein disulfide-isomerase (PDI); protein disulfide-isomerase A6 (PDIA6); phosphoglycerate kinase 1; retinol-binding protein 4; peroxiredoxin 5, mitochondrial; protein S100-A14; triosephosphate isomerase; villin-1 (Villin); Vimentin). Only one of the proteins in the sample, named LFABP, was previously identified by SRM as a biomarker [36]. To Table 6 τ (%) M = 4, N r = 10 3 , and N = 500  calculate the hyperparameters (18)- (21), empirical orders of magnitudes for (m × , × ) (e.g., μg/ml) are used as specified: E(m × ) = 10 2 1 P * , V (m × ) = 10 3 I P * , E( × ) = 10 3 I P * , V ( × ) = 10 I P * . For this data set, the posterior probability is computed for each of the 2 21 possible partitions according to (17) for the BMS-AS-N methods. Table 10 presents the four most probable partitions, with their probabilities. By far, the most probable partition (probability 0.9986) is: LFAPB is discriminant and the remaining 20 proteins are non-discriminant. The second most probable partition is: the whole set of protein is non-discriminant (probability 0.001361). The third and the fourth ones select two discriminant proteins and the 19 other are non-discriminant: (LFABP , Villin) and (LFABP , PDIA6), with probability smaller than 10 −6 . As a conclusion, the proteins are all declared as non-discriminant, except the LFAPB. This study confirms that our method correctly identifies the valid biomarker.
Despite the large number of models to compare (about two millions candidate models), the computation time is just 1 h. This short computation time is made possible by the analytical calculation of the posterior probability, avoiding the use of extensive numerical integration methods such as for instance MCMC algorithms [24].

Synthesis and perspectives
Biomarker discovery is a challenging task of the utmost interest for the diagnosis and prognosis of diseases. This paper presents a statistical approach based on variable selection. It is developed in a Bayesian framework that relies on an optimal strategy, i.e., the minimization of an error risk. Given P candidate proteins, the proposed procedure compares the probability of the 2 P partitions (subset of discriminant versus subset of nondiscriminant proteins). The most a posteriori probable partition is finally retained and thus defines the selected variables. The main difficulty is the required integration with respect to all the unknown model parameters. An important contribution is to provide a closed-form expression of the probabilities for noiseless observations  and a sensible approximation for noisy observations. The proposed method proved to be well-suited for variable selection in a complex context. Its effectiveness is assessed by a theoretical characterization and numerical studies (on simulated and real data) which are in accordance with the theoretical optimality. Furthermore, the proposed method compares favorably with the t test, the LASSO, the Battacharrya distance, and the FOHSIC. From a methodological standpoint, several perspectives can be considered. Regarding the concentrations, non-Gaussian distributions, e.g., Gamma or Wishart models, could be relevant. Regarding the status, a possible development could account for possible errors in the given status. In this case, an additional level should be appended to the hierarchical Bayesian model. It would include a prior probability for an erroneous status.
As for the applicative perspectives, we plan to further take advantage of the performance of the method in other clinical data sets or in other biomedical fields (e.g., genomics, transcriptomics. . . ). In addition, we also intend to make use of the method in other domains, for instance, in astrophysics (identification of pertinent features in order to classify galaxies), or for complex structures and industrial processes (identification of indicators for detection and diagnosis of damages or faults, analysis of fatigue and aging prevention,. . . ). 1 The knowledge about orders of magnitudes of the concentration values is acquired from the real data set provided by bioMérieux (Technology Research Department), France. 2

Reduction of the concentration distribution
This section explains how the exponential arguments of the Gaussian distributions in (4) can be reformulated The idea is to re-arrange the sum in the exponential as a function of the empirical meanx + P and the empirical varianceR + P . To this end, for the sake of calculation simplicity, we can write: x n − m P = x n −x + P − x + P − m P and develop the sum of product. Then, using the fact that u t Mu = Tr u t Mu = Tr Muu t , the product is written as function of empirical mean and variance n∈I P N x + n ; m P , P = (2π) −P + N P /2 | P | N/2 exp − N P 2 Tr P R + P + x + P − m P x + P − m P t (22) which allows easier handling of the Normal-Wishart prior.

Wishart, Normal-Wishart, and matrix variate t-distribution Wishart
The Wishart density probability function for a P×P matrix is driven by two parameters: a degree of freedom ν (real and larger than P − 1) and a matrix (positive and definite) referred to as the scale matrix. It reads where γ P is the P-dimensional Gamma function.

Normal-Wishart
For a couple (m, ), where m is a P-dimensional vector and a P × P matrix, the Normal-Wishart density is controled by four parameters ν, η, μ, . It reads: and it does not depend on μ (that is a position parameter).

Matrix variate t-distribution
The random matrix T of dimension (P ×N) is said to have a matrix variate t-distribution with parameters M, , , and q if its probability density function is given by where T and M are P × N matrices, and are positivedefinite matrices with respective sizes N × N and P × P and q > 0. When q tends to infinity, the distribution of T tends to a Gaussian distribution with mean M and covariance ⊗ that is to say N (T ; M, ⊗ ), where ⊗ is the Kronecker product. Authors' contributions JFG proposed the initial idea of the method and provided the first theoritical developments. He also importantly contributed to the writing of the paper. AG proposed the extension to noisy data and importantly contributed to the writing of the paper. ND provided the largest part of the Matlab developments and numerical assessment. She also contributed to the writing of the paper. CT provided input in the proteomics fields. She contributed to the comparison with existing methods. She also contributed to the manuscript and proposed valuable comments. MH contributed to the Matlab development and the numerical assessment. She also contributed to the writing of the paper. JPC provided input in the SRM field and with the result interpretation. He acquired and provided real SRM spectra to test the algorithm. He has revised the manuscript.
LG has developed the BHI processing algorithms for the MRM mode. He has computed the protein concentration profiles for the MRM clinical data set used for the evaluation on real data. He contributed to the result interpretation. PD provided expertise regarding proteomics. BL contributed to the early developments of the BHI-PRO project. PG was the BHI-PRO project manager. He has coordinated the conception of the processing algorithms and the interpretation of the results. He has revised the manuscript. PR coordinated biostatistical developments and the interpretation of the results. He has revised the manuscript. All authors read and approved the final manuscript.

Competing interests
JPC and BL are employed by bioMérieux. The other authors declare that they have no competing interests.

Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.