Bayesian inference for biomarker discovery in proteomics: an analytic solution
 Noura Dridi^{1, 2},
 Audrey Giremus^{1},
 JeanFrancois Giovannelli†^{1}Email author,
 Caroline Truntzer^{3},
 Melita Hadzagic^{1, 4},
 JeanPhilippe Charrier^{5},
 Laurent Gerfault^{6, 7},
 Patrick Ducoroy^{3},
 Bruno Lacroix^{5},
 Pierre Grangeat^{6, 7} and
 Pascal Roy^{8, 9, 10, 11}
https://doi.org/10.1186/s1363701700624
© The Author(s) 2017
Received: 4 August 2016
Accepted: 21 June 2017
Published: 14 July 2017
Abstract
This paper addresses the question of biomarker discovery in proteomics. Given clinical data regarding a list of proteins for a set of individuals, the tackled problem is to extract a short subset of proteins the concentrations of which are an indicator of the biological status (healthy or pathological). In this paper, it is formulated as a specific instance of variable selection. The originality is that the proteins are not investigated one after the other but the best partition between discriminant and nondiscriminant proteins is directly sought. In this way, correlations between the proteins are intrinsically taken into account in the decision. The developed strategy is derived in a Bayesian setting, and the decision is optimal in the sense that it minimizes a global mean error. It is finally based on the posterior probabilities of the partitions. The main difficulty is to calculate these probabilities since they are based on the socalled evidence that require marginalization of all the unknown model parameters. Two models are presented that relate the status to the protein concentrations, depending whether the latter are biomarkers or not. The first model accounts for biological variabilities by assuming that the concentrations are Gaussian distributed with a mean and a covariance matrix that depend on the status only for the biomarkers. The second one is an extension that also takes into account the technical variabilities that may significantly impact the observed concentrations. The main contributions of the paper are: (1) a new Bayesian formulation of the biomarker selection problem, (2) the closedform expression of the posterior probabilities in the noiseless case, and (3) a suitable approximated solution in the noisy case. The methods are numerically assessed and compared to the stateoftheart methods (t test, LASSO, Battacharyya distance, FOHSIC) on synthetic and real data from proteins quantified in human serum by mass spectrometry in selected reaction monitoring mode.
Keywords
Variable selection Model selection Optimal decision Bayesian approach Evidence Hierarchical model Proteomics Biomarker1 Introduction
It is now generally recognized that protein expression analysis is crucial in explaining the changes that occur as a part of disease pathogenesis [1, 2]. In this context, recent advances in mass spectrometry (MS) technologies have facilitated the investigation of proteins over a wide range of molecular weights in small biological specimens from blood or urine samples, for instance. Notably, MS in selected reaction monitoring (SRM) mode has demonstrated its ability to quantify clinical biomarkers in patient sera [3, 4]. Consequently, a large amount of research has been generated in proteomics based on data such as protein mass spectral intensities or protein concentrations obtained from the spectra. Specifically, the focus is on the selection (or discovery) of the “signature profiles,” the socalled biomarkers. They represent, for instance, indicators of normal versus pathogenic biological processes, or positive versus negative pharmacological responses to therapeutic intervention.
Critical to the identification of biomarkers are: (1) the biological variability, i.e., the random variations of the concentrations of proteins between individuals sharing the same biological status [5], and (2) the technical variability, which originates from the imperfections of the measurement process used to obtain the concentrations. Failing to address both of these variabilities within a technique for biomarker identification may significantly impair its performance by resulting in erroneous decision.
Furthermore, since the complexity of a status is unlikely to be manifested through the changes in the characteristics of just one protein, it has generally been acknowledged that a set of proteins should be considered [5–8]. An additional difficulty is that they are possibly correlated, imposing the use of multivariate models to account for all the data simultaneously. These aforementioned issues pose significant challenges in developing efficient and robust statistical techniques for the identification of biomarkers.
The paper tackles the problem of biomarker identification by adopting a Bayesian approach to propose the selection of the optimal set of variables. By providing an elegant and mathematically rigorous framework for incorporating the data and the prior information within a joint probabilistic model, the Bayesian setting allows straightforward modeling of both the technical and the biological variabilities of the data.
The remainder of the paper is organized as follows. Section 2 summarizes the stateoftheart variable selection methods, discusses their main challenges, and outlines our principal contributions. Section 3 presents the proposed formulation within the Bayesian framework, the proposed models for the data, and the decision strategy. Section 4 describes the data used in the numerical evaluations, together with the results and their analysis. Finally, conclusions are drawn in Section 5. A detailed description of the model and the derivation of the analytic solution is provided in Appendix.
2 Related work
The identification of biomarkers for diagnosis or prognosis can be classically formulated as a variable selection problem, and this problem has been paid a lot of attention as a specific instance of model choice. Various methodologies exist that can be broadly classified in two categories: the frequentist hypothesis testing and the Bayesian decisionmaking.
Frequentist hypothesis testing consists in deciding between two statements, classically referred to as the null and the alternative hypotheses, by comparing a function of the observed data to a threshold. The reader is invited to consult [9] for a comprehensive overview. Two closely related methods have been proposed. On the one hand, NeymanPearson tests are designed to ensure the socalled type I error. On the other hand, p values focus on how strongly the data reject the null hypothesis H _{0} by evaluating the probability of obtaining a value as extreme as the observed one given H _{0} is true. In biomarker discovery, a popular approach consists in testing a mean difference between the case and the negative controlled populations using the classical Students’ t test or its variants [7]. The latter is a statistical hypothesis test which indicates whether the difference between two group means most likely reflects that they are samples of two different populations or, on the contrary, is merely explained by the sampling fluctuation. However, as the number of candidate biomarkers increases, multiple hypothesis testing is required resulting in a higher computational cost which may become prohibitive [10]. A first solution is to perform univariate tests for each protein.
This procedure requires an adapted control of the rate of type I errors in this particular setting where multiple hypothesis tests are conducted simultaneously. Two types of procedures were proposed for this purpose, namely the socalled family wise error rate or the false discovery rate [11, 12]. A common criticism of frequentist approaches is that they fail to take into account prior information about the problem at hand such as interdependencies between the different variables.
For a large number of candidate variables, regression analysis [13] provides an alternative to the abovementioned methods. The principle is to assume that a given outcome is related to a linear combination of a set of explanatory variables called the predictors. In proteomics, logistic regression models are considered that express the probability to have a disease as a function of the protein abundances [14–16]. Then, variable selection is classically performed using stepwise procedures that consists in successively adding or removing predictors, estimating the regression coefficients, and evaluating the goodness of fit of the subsequent model. Different criteria can be considered such as the Rsquared, the adjusted Rsquared, or the Akaike Information Criterion [17, 18]. Such techniques are referred to as backward elimination and forward selection, respectively [13]. However, these selection procedures are prone to overfitting and the variance of the parameter estimates becomes high in the presence of correlated predictors. Regularization methods alleviate these difficulties by considering the minimum of a penalized least squares error as estimate. Since the Ridge regression in 1970 [19], several algorithms have been proposed that differ between one another with respect to the considered penalization of the regression parameters. The wellknown LASSO [20] considers a L _{1}norm penalty and has the advantage of directly removing irrelevant predictors by shrinking their coefficients to zero. More recently, the elastic net [21] which combines the advantages of the Ridge and LASSO regressions has been proposed. In the presence of correlated variables, it outperforms LASSO by favoring the selection of sets of variables. A comparison between these methods in application to genome selection is presented in [22]. Although widely used, regression analysis is based on an ad hoc model that may not reflect the physical nature of the observed data. Further, it does not explicitly accommodate correlations between the candidate biomarkers as well as measurement errors.
The Bayesian framework offers an alternative formulation of model selection. The candidate models are assigned prior probabilities that are combined with the likelihood function to yield the socalled posterior probability. The latter summarizes all the available information to make the decision. In this context, deciding in favor of the a posteriori most probable model is optimal in the sense that it minimizes the risk associated to the 0/1 costfunction. There have been a lot of debates over the use of Bayesian techniques in place of frequentist approaches, but they do not address exactly the same question. Frequentist methods are designed to test the departure of the data from a predefined null hypothesis. In contrast, Bayesian selection procedures evaluate the plausibility of a given hypothesis given a set of candidate hypotheses hence are conveniently wellsuited to multiple hypotheses testing. Thus, nonnested models can be compared in a straightforward manner. Another fundamental difference is the treatment of unknown model parameters. In the frequentist approach, they are classically replaced by estimates whereas in the Bayesian formulation, they are integrated. The latter procedure has the advantage of automatically penalizing complex models, as discussed in [23], but often leads to intractable calculations. An additional advantage is that correlations between the model variables can be easily accounted for in the design of the prior distributions. As for the integration over the unknown parameters, several solutions have been developed. The Laplace approximation of the integrand leads to the wellknown Bayesian information criterion (BIC). As an alternative, numerical integrations can be performed based on stochastic sampling techniques such as Markov Chain Monte Carlo (MCMC) methods [24]. Either across or within modelbased techniques can be considered. In the first case, the model index is sampled jointly with the parameters conditionally upon the observations. A wellknown algorithm is the ReversibleJump MCMC but moves between the different parameter spaces are difficult to design. In the second case, posterior samples of the parameters are generated conditionally upon each candidate model and then used to evaluate the integrated likelihood, also called evidence [25]. Nevertheless, the harmonic meanbased estimator exhibits instabilities [26]. Applications of the MCMC Bayesian model selection methods in genomics can be found in [27, 28].
In this paper, a Bayesian setting is adopted to identify a set of protein biomarkers from experimental data consisting of measured protein concentrations and the associated biological statuses of a population of individuals. The novelty is that the decision is not made protein by protein. As an alternative, the problem is formulated as directly finding the best partition of the list of proteins into two subsets, namely discriminant and nondiscriminant, in the sense that it yields the highest posterior probability. Regardless of their discriminative power, the proteins are assumed Gaussian distributed. However, for the subset of biomarkers, the parameters of the Gaussian distribution take different values depending on the biological status whereas this is not the case for the second subset of proteins. The preliminary version of this hierarchical model has been presented in [29]. Its advantages are threefold. First, it is not based on an ad hoc explanatory model unlike regression analysis. Second, the proteins within a given group are assumed a priori correlated and the dependency structure is integrated out along with the remainder of the unknown model parameters so that only the protein partition is estimated. Thus, our approach intrinsically takes into account correlations between the candidate biomarkers. Third, by choosing appropriate conjugate prior distributions for the parameters, the model evidences can be calculated in closed form and there is no need to resort to computationally extensive numerical techniques. Finally, we show that our hierarchical model can be easily extended to address errors in the measured concentrations.
3 Problem formulation, proposed models, and methods
To formulate the biomarker selection problem and construct its solution in the proposed framework, we first introduce the basic modeling for the relevant quantities/variables at hand: the biological status, protein concentrations, number of individuals,…including the descriptions of the considered observation models.
In addition, it is assumed that X ^{+} and X ^{−} are conditionally independent.
The parameters of the distributions are collected in the vector \(\boldsymbol {\theta } = \left [\boldsymbol {m}_{\mathcal {P}}, \boldsymbol {\Gamma }_{\mathcal {P}}, \boldsymbol {m}_{\mathcal {H}}, \boldsymbol {\Gamma }_{\mathcal {H}}, \boldsymbol {m}_{\mathcal {C}}, \boldsymbol {\Gamma }_{\mathcal {C}}, p\right ]\) considered as unknown. It is important to keep in mind that the quantity of interest is the partition δ.
Distribution for the individuals The total number of individual is N and (x _{ n },b _{ n }) is the nth concentration vector and status. They are modeled as independent conditionally on θ. Let us denote \(\boldsymbol {\underline {x}}\) (size P×N) as the matrix of concentrations and b (size N) as the vector of biological statuses. Also, let \(\mathcal {I}_{\mathcal {H}}\) and \(\mathcal {I}_{\mathcal {P}}\) be the subsets of indices for healthy and pathological individuals, respectively, and \(N_{\mathcal {H}}, N_{\mathcal {P}}\) their cardinality. For notational convenience, we introduce: \(N_{\mathcal {C}}=N_{\mathcal {H}} + N_{\mathcal {P}}\) and \(\mathcal {I}_{\mathcal {C}} = \mathcal {I}_{\mathcal {P}} \cup \mathcal {I}_{\mathcal {H}}\) (where \(N_{\mathcal {C}}=N\) and \(\mathcal {I}_{\mathcal {C}}=\{1,2,\dots,N\}\)).
 1.
In the first one, the concentrations x _{ n } are directly observed.
 2.
The second one accounts for noise: observations write y _{ n }=x _{ n }+ε _{ n }, where ε _{ n } is modeled as a zeromean Gaussian vector with precision Γ _{ ε }.
Both of them account for biological variabilities and the latter also includes technological variabilities that arise from both the functioning of the measurement system itself and the postprocessing of the spectra. These models are referred to as “noiseless model” and “noisy model”. The corresponding variable selection methods are respectively presented in Sections 3.1 and 3.2.

The probability p is assumed a Beta distributed variable with parameter (a,b).

The (m _{×},Γ _{×}) are assumed to be NormalWishart \(\mathcal {NW}\) distributed with parameters (μ _{×},η _{×},Λ _{×},ν _{×}), for \(\times \in \{{\mathcal {P}},{\mathcal {H}},\mathcal {C}\}\). See Appendix 2.

The precision Γ _{ ε } is under a Wishart distribution with parameters (Λ _{ ε },ν _{ ε }). See Appendix 2.
In the subsequent developments, we proceed with the calculation of the posterior probability for the partitions δ in the two cases: noiseless concentrations in Section 3.1 and noisy concentrations in Section 3.2. One of the novelty is an explicit analytical result for the noiseless case and a precise approximation for the noisy case.
3.1 Selection using the noiseless data
Optimal decisionmaker The question of the paper is the one of the identification of a set of discriminant proteins, and it amounts to making a decision regarding the partition δ. To build an optimal decisionmaker, a binary loss is considered that assigns a null loss to the correct decision and a unitary loss to the incorrect decisions. The risk is the mean loss over the models δ, the data (\(\boldsymbol {\underline {x}},\boldsymbol {b}\)), and the unknown parameters θ. The optimal decisionmaker is defined as the risk minimizer, and it is known to be the one that selects the most a posteriori probable model. It should be noted that alternative loss functions could be chosen, for instance, one that would penalize differently erroneous partitions depending on the number of biomarkers properly identified. In this case, the decision would still be based on the posterior probabilities but with a different rule. However, our choice not only leads to a simple identification procedure but also naturally prevents overfitting.
This calculation is the main difficulty of the paper and more generally in variable and model selection.
rendering the usually complex calculations of the evidences straightforward.
Assuming that all candidate models are equally a priori probable, from Eq. (7), the posterior probability across the 2^{ P } models can be inferred. The selected model is the one which maximizes this probability. It should be noted that if prior information is available such as proteintoprotein interactions (PPI’s), it can be taken into account by assigning a higher probability to partitions wherein the related proteins are in the same subset (either discriminant or not). In Eq. (10), the normalizing constants for the posterior distributions depend on the empirical covariance matrices of the population of individuals for the discriminant proteins and the nondiscriminant ones, respectively. Their computation is expensive. However, it suffices to compute once the full covariance matrix for all the proteins and then remove the appropriate raws and columns for the 2^{ P } configurations to be tested.
3.2 Selection using noisy data
The model presented above assumes that the concentrations are directly observed. Although this assumption leads to closedform expressions of the posterior probabilities, it may be too simplifying. In practice, the concentrations are known up to an uncertainty and this section extends the abovedetailed developments to account for these uncertainties. However, this comes at the price of intractable calculations, and to overcome this difficulty, we propose a suitable approximation. As introduced above, the measured concentrations are modeled as y _{ n }=x _{ n }+ε _{ n } where ε _{ n } is a zeromean Gaussian vector with precision Γ _{ ε }, therefore \(\boldsymbol {y}_{n}\boldsymbol {x}_{n} \sim \mathcal {N}(\boldsymbol {x}_{n}; \mathbf {0},\boldsymbol {\Gamma }_{\varepsilon })\). Similarly to the previous section, the vectors of observed concentrations are stacked in a matrix \(\boldsymbol {\underline {y}}\) of dimension P×N. To select the most probable model, the evidence \(f_{\boldsymbol {\underline {Y}},\boldsymbol {B}\vert \boldsymbol {\Delta }}(\boldsymbol {\underline {y}},\boldsymbol {b} \vert \boldsymbol {\delta })\) must be calculated for each candidate model (it was \(f_{\boldsymbol {\underline {X}},\boldsymbol {B}\vert \boldsymbol {\Delta }}(\boldsymbol {\underline {x}},\boldsymbol {b} \vert \boldsymbol {\delta })\) for the noiseless model). The difficulty is that the calculation of evidence requires not only the marginalization of the model parameters but also of the true concentrations. Furthermore, the precision Γ _{ ε } is assumed unknown and must also be marginalized. For notational convenience, we state: \(\tilde {\boldsymbol {\theta }} = \left [\boldsymbol {\theta }, \boldsymbol {\Gamma }_{\varepsilon } \right ]\) as an extended vector of unknown parameters.
with \(\times \in \{{\mathcal {P}},{\mathcal {H}},\mathcal {C}\}\) and ⋆∈{+,−}. The integrals (12) and (13) can be calculated analytically.
where \({\mathcal {KW}}_{\varepsilon }^{\star }\) is the normalization constant of the Wishart distribution and \(\mathcal {T}_{P,N}(\boldsymbol {T};q,\boldsymbol {M},\boldsymbol {\Sigma },\boldsymbol {\Omega })\) denote the matrix tdistribution of parameters q, M, Σ, and Ω, for a matrix T of dimensions P×N. The expression is recalled in the “Matrix variate tdistribution” section of Appendix 2.
with \(\mathcal {KNW}^{\star \text {pst}}_{\times }\) and \(\mathcal {KNW}^{\star \text {pri}}_{\times }\) the normalization constants of the prior and posterior NormalWishart distributions for (m _{×},Γ _{×}), respectively.
In (11), the result of the first integration with respect to \(\tilde {\boldsymbol {\theta }}\) does not yield an expression that can be integrated analytically w.r.t. \(\boldsymbol {\underline {x}}\). To address this issue, we propose to take advantage of the fact that a matrix variate tdistribution \(\mathcal {T}_{P,N}(\boldsymbol {T};q,\boldsymbol {M},\boldsymbol {\Sigma },\boldsymbol {\Omega })\) tends to a Gaussian distribution when the degrees of freedom parameter q tends to infinity.
where \(C_{\times }^{\star }\) is a proportionality constant.
4 Numerical evaluation
 1.
The t test [31], which consists in comparing the means of each protein concentrations between the two cohorts, \({\mathcal {H}}\) and \({\mathcal {P}}\). If the null hypothesis, standing for the mean equality, is rejected, then the protein is declared as a biomarker. The type I error, denoted as α, corresponds to the incorrect rejection of a true null hypothesis. Its value is used to set the t test decision threshold. In this paper, it is not necessary to adjust the type I errors to account for multivariate effects. The reason is that, for fair comparison purposes, we directly select the setting that leads to the best performance of the test regarding our criterion. This point is commented in Section 4.1 and Fig. 2.
 2.
The LASSO method [20], based on a linear regression model in which the explanatory variables are the protein concentrations \(\boldsymbol {\underline {x}}\), while the response variables are the biological statuses b. The LASSO method estimates the coefficients of the model by minimizing the sum of the squared errors, with a L _{1}norm penalty. Then, a protein is selected as a biomarker if the value of the coefficient corresponding to its concentration is different from zero. This method introduces a regularization parameter denoted λ.
 3.The Bhattacharyya distance [32] is a measure of similarity between two probability distributions and by extension between two populations of individuals [32]. For two multivariate normal distributions with respective mean and covariance matrix (μ _{1},Σ _{1}) and (μ _{2},Σ _{2}), it is given by:with Σ=(Σ _{1}+Σ _{2})/2.$$\begin{aligned} D_{b}&=\frac{1}{8}(\boldsymbol{\mu}_{1}\boldsymbol{\mu}_{2})^{T}\boldsymbol{\Sigma}^{1}(\boldsymbol{\mu}_{1}\boldsymbol{\mu}_{2})^{T}+\frac{1}{2}\\&\quad\log\left(\frac{\text{det}(\boldsymbol{\Sigma})}{\sqrt{\text{det}(\boldsymbol{\Sigma}_{1})\text{det}(\boldsymbol{\Sigma}_{2})}} \right) \end{aligned} $$
In the sequel, the Bhattacharyya distance is calculated for each protein by replacing the true mean and covariance matrix by their empirical estimates. The protein is declared as discriminant if the distance is greater than a fixed threshold denoted t. The algorithm is referred to as Bhadistance.
 4.
The FOHSIC algorithm as introduced in [33]. It performs feature selection based on the HilbertSchmidt Independence Criterion (HSIC). The authors propose an unbiased estimator of HSIC and then, assuming the number of significant features is set a priori, use a forward procedure to select them. In our context, the significant features are the biomarkers.
In this section, we refer to the method from Section 3.1 as the Bayesian Model Selection with Analytical Solution for Noiseless Data (BMSASD) method, while to the method from Section 3.2 as the Bayesian Model Selection with Analytical Solution for Noisy data (BMSASN) method.
where the superscripts i,j denote the entry (i,j) of the matrices, E(·) and V(·) refer to the expectation and the covariance matrix of a vector, respectively, while cov(·,·) stands for the covariance between two random variables. We also recall that ∗∈{ ^{+}, ^{−}, ^{“ "}} depending whether the discriminant/nondiscriminant subsets of proteins are considered or the whole set. As a consequence, the prior parameters (ν _{×},η _{×},μ _{×},Λ _{×}) can be calculated from (18) to (21) and substituted in (10). Although our choice of prior is not noninformative in the strict sense, it is vague enough so as not to impact biomarker detection. This issue is investigated in the next subsection.
Finally, to calculate (17) in the noisy case, additional hyperparameters for the Wishart probability density function of the noise precision matrix have to be tuned. They are chosen such that \(E(\boldsymbol {\Gamma }_{\varepsilon }) =\nu _{\varepsilon }^{\text {pri}}\boldsymbol {\Lambda }_{\varepsilon }^{\text {pri}}\) and that the elements of the covariance matrix satisfy \(\text {cov}\left (\Gamma _{\varepsilon }^{i,j},\Gamma _{\varepsilon }^{k,l}\right)= \nu _{\varepsilon }^{\text {pri}} \left (\Lambda _{\varepsilon }^{\text {pri},il}\Lambda _{\varepsilon }^{\text {pri},jk}+\Lambda _{\varepsilon }^{\text {pri},ik}\Lambda _{\varepsilon }^{\text {pri},jl}\right)\). Therefore, by accounting for reallife orders of magnitudes of Γ _{ ε }, the prior parameters \(\left (\nu _{\varepsilon }^{\text {pri}}, \boldsymbol {\Lambda }_{\varepsilon }^{\text {pri}}\right)\) can be calculated and substituted in the probability (10).
In the next sections, we present the results of the numerical evaluations of the proposed methods using both simulated and real data.
4.1 Evaluation using simulated data
4.1.1 Description of the simulated data and performance index
We consider the concentrations of a list of P proteins for a group which comprises \(N_{\mathcal {H}}\) healthy and \(N_{\mathcal {P}}\) pathological individuals, respectively, with \(N_{\mathcal {H}}+N_{\mathcal {P}}=N\). The possible partitions for discriminant/nondiscriminant proteins thus amount to 2^{ P } and they are referred to as true models. For each true model, N _{r}=10^{5} data realizations are simulated, hence the total number of realizations equals N _{r} 2^{ P }.
On the one hand, the noiseless data comprise the biological statuses b _{ n } and the actual protein concentrations x _{ n } of the N individuals and are generated as follows. The biological statuses are sampled from the Bernoulli distribution of parameter p, where p is assumed Beta distributed of parameters a=1 and b=1, which corresponds to a uniform distribution. The concentrations of the subset of discriminant proteins are generated from the Gaussian distributions, \(\mathcal {N} \left (\boldsymbol {x}_{n}^{+}; \boldsymbol {m}_{\mathcal {H}}, \boldsymbol {\Gamma }_{\mathcal {H}}\right)\) or \(\mathcal {N} \left (\boldsymbol {x}_{n}^{+}; \boldsymbol {m}_{\mathcal {P}}, \boldsymbol {\Gamma }_{\mathcal {P}}\right)\), depending on the simulated biological status. The subset of nondiscriminant proteins are sampled from \(\mathcal {N} \left (\boldsymbol {x}_{n}^{}; \boldsymbol {m}_{\mathcal {C}}, \boldsymbol {\Gamma }_{\mathcal {C}}\right)\). The parameters (m _{×},Γ _{×}), where \(\times \in \{{\mathcal {P}},{\mathcal {H}},\mathcal {C}\}\), are distributed according to the NormalWishart distribution \(\mathcal {NW}(\nu _{\times },\eta _{\times },\boldsymbol {\mu }_{\times },\boldsymbol {\Lambda }_{\times })\). The orders of magnitudes for (m _{×},Γ _{×}) are specified as: \(E(\boldsymbol {m}_{\times })=10^{3}\mathbf {1}_{P^{\ast }},\ V(\boldsymbol {m}_{\times })=10^{4}\,\mathbf {I}_{P^{\ast }},\ E(\boldsymbol {\Gamma }_{\times })=10^{3}\,\mathbf {I}_{P^{\ast }},\ V(\boldsymbol {\Gamma }_{\times })=10^{4}\,\mathbf {I}_{P^{\ast }}\), where \(\mathbf {1}_{P^{\ast }}\phantom {\dot {i}\!}\) denotes a vector of size P ^{∗} whose elements are all equal to 1 and \(\mathbf {I}_{P^{\ast }}\phantom {\dot {i}\!}\) is the identity matrix of size P ^{∗}. The same order of magnitude is considered for healthy, pathological, and common parameters, that is to say \(\times \in \lbrace {\mathcal {H}},{\mathcal {P}},\mathcal {C}\rbrace \). These a priori information are used to tune the hyperparameters as given by (18)–(21).
On the other hand, the noisy data include the biological statuses b _{ n } and the observed protein concentrations y _{ n } for n=1,2…N. The protein concentrations x _{ n } are generated as in the case of the noiseless observations by using the same hyperparameter setting. As for the noise ε _{ n }, it is sampled as a zeromean multivariate Gaussian random vector with precision matrix Γ _{ ε }. The latter is generated from a Wishart density with parameters \(\left (\nu _{\varepsilon }^{\text {pri}}, \boldsymbol {\Lambda }_{\varepsilon }^{\text {pri}}\right)\). In order to determine these hyperparameters, the a priori information is specified as: E(Γ _{ ε })=10^{−2} I _{ P }, V(Γ _{ ε })=10^{−5} I _{ P }. Note here that Γ _{ ε } measures the precision (inverse variance), thus the lower Γ _{ ε } is, the stronger the noise is.
For each data set, the posterior probability is computed for all possible partitions according to (10) or (17) for the BMSASD and BMSASN methods, respectively. Then, the most probable partition is selected.
4.1.2 Results for the noiseless model
Before going further in analyzing the performance, it should be noted that the t test, the LASSO, and the Battacharyya distance all require the setting of a parameter: the type I error α, the regularization parameter λ, and the threshold t, respectively. So as not to favor our approach, we have run all the algorithms for different values of these parameters and we have selected the best one (in order to get the lowest error rate). Such a procedure cannot be applied on real data, but it allows us to compare the proposed method to the best version of the alternative approaches. The results are given in Fig. 2 for the t test and the LASSO.
Noiseless data: τ (%) for different value of P, N=1000
P  1  2  3  4 

Best t test  0.2555  0.4025  0.471  0.5209 
Best LASSO  0.3185  4.865  14.207  27.2655 
Best Bhadistance  0.2245  0.3743  0.4496  0.5008 
BMSASD  0.0935  0.1434  0.1563  0.1616 
Noiseless data: τ (%) for different values of N, P=3
N  100  500  1000 

t test  4.041  0.8938  0.471 
LASSO  20.5541  15.9870  14.207 
Best Bhadistance  3.6970  0.8194  0.4496 
BMSASD  1.71  0.32  0.1563 
τ (%) N=1000 and P=3
Number of biomarkers  2 

BMSASD  0.0723 
FOHSIC  0.2907 
Execution time for one simulation N=1000 and P=3
Number of biomarkers  2 

BMSASD  0.202 s 
FOHSIC  0.732 s 
τ (%) for P=8, N _{r}=10^{3}, and N=500
M:  4  6 

BMSASD  0.3014  0.2107 
FOHSIC  0.8271  0.9107 
τ (%) M=4, N _{r}=10^{3}, and N=500
P  8  12 

BMSASD  0.3014  0.1818 
FOHSIC  0.8271  0.8103 
As shown in Tables 3, 5, and 6, the BMSASD algorithm outperforms the FOHSIC one. This is explained by the fact that the BMSASD algorithm makes a multivariate decision on the whole set of proteins, while the FOHSIC uses a forward procedure which can lead to error accumulation. Indeed, if any detected biomarker in the sequence is false, then the final selected model is bound to be erroneous. Furthermore, the BMSASD is also faster than the FOHSIC, as illustrated in Table 4. More precisely, Table 5 shows the error rate τ for the FOHSIC and the Bayesian algorithm for P=8 and different number of biomarkers. Conversely, the results proposed in Table 6 are obtained with the number of biomarkers fixed to M=4 while the number of proteins P is varied. As expected, the performance of the FOHSIC algorithm is degraded when increasing the number of proteins while the opposite is observed for the BMSASD. Thus, even for large P, the Bayesian algorithm outperforms the FOHSIC.
4.1.3 Results for the noisy model
The performance of the BMSASN and the BMSASD algorithms is first studied as a function of the number of proteins P and the number of individuals N, for a fixed noise level. Then, the BMSASN and the BMSASD methods are compared for different noise conditions.
Noisy data: τ (%) for different values of P, N=1000
P  1  2  3  4 

BMSASD  16.82  32.784  48.740  63.686 
BMSASN  1.38  2.682  4.158  5.698 
Noisy data: τ (%) for different valued of N, P=3
N  100  500  1000 

BMSASD  86.850  71.615  48.740 
BMSASN  12.477  5.781  4.158 
τ (%) for N=1000, P=3
E(Γ _{ ε })(×I _{ P })  10^{−2}  10^{−1}  1  10  10^{2} 

BMSASD  48.740  12.304  3.064  0.813  0.291 
BMSASN  4.158  1.395  0.567  0.303  0.220 
As a conclusion, the results confirm the good performance of the proposed BMSASN method which is also not too computationally intensive by means of the analytical approximation of the posterior probabilities.
4.2 Evaluation using the real data
The primary goal of this paper was to present a novel methodology for biomarker identification that relaxes classical simplifying assumptions on the data model and then to evaluate it on simulated data. Nevertheless, we had at our disposal a batch of real data^{2} and we used it to carry out a preliminary study of the BMSASN method. The data are composed of 206 samples: 105 with the status \({\mathcal {H}}\) (including 76 patients from blood donors and 29 with negative colonoscopy), 101 with malignant tumor^{3}, i.e. with status \({\mathcal {P}}\). The latter are structured as follows: 24 patients in the ‘stage one’ of the cancer, 26 patients in the ‘stage two’, 23 patients in the ‘stage three’, 25 patients in the ‘stage four’, and three missing values. The protein concentrations are obtained using the Bayesian inversion method developed in [34] from measurements of SRM spectra according to the methodology described in [35]. For each sample, the concentrations of 21 proteins are measured (1433 protein sigma; 78kDa glucoseregulated protein; protein S100A11; calmodulin; calreticulin; peptidylprolyl cistrans isomerase A; defensin5; defensin6; heat shock cognate 71 kDa protein; fatty acidbinding protein, intestinal; fatty acidbinding protein, liver (LFABP); stress70 protein, mitochondrial; protein disulfideisomerase (PDI); protein disulfideisomerase A6 (PDIA6); phosphoglycerate kinase 1; retinolbinding protein 4; peroxiredoxin 5, mitochondrial; protein S100A14; triosephosphate isomerase; villin1 (Villin); Vimentin). Only one of the proteins in the sample, named LFABP, was previously identified by SRM as a biomarker [36]. To calculate the hyperparameters (18)–(21), empirical orders of magnitudes for (m _{×},Γ _{×}) (e.g., μg/ml) are used as specified: \(E(\boldsymbol {m}_{\times })=10^{2}\mathbf {1}_{P^{\ast }},\ V(\boldsymbol {m}_{\times })=10^{3}\,\mathbf {I}_{P^{\ast }},\ E(\boldsymbol {\Gamma }_{\times })=10^{3}\,\mathbf {I}_{P^{\ast }},\ V(\boldsymbol {\Gamma }_{\times })=10\,\mathbf {I}_{P^{\ast }}\).
The four most probable partitions, for real data P=21
Declared biomarker  LFABP  No biomarker  LFABP and Villin  LFAPB and PDIA6 

Probability  9.986×10^{−1}  1.361×10^{−3}  9.762×10^{−7}  8.297×10^{−7} 
Despite the large number of models to compare (about two millions candidate models), the computation time is just 1 h. This short computation time is made possible by the analytical calculation of the posterior probability, avoiding the use of extensive numerical integration methods such as for instance MCMC algorithms [24].
5 Synthesis and perspectives
Biomarker discovery is a challenging task of the utmost interest for the diagnosis and prognosis of diseases. This paper presents a statistical approach based on variable selection. It is developed in a Bayesian framework that relies on an optimal strategy, i.e., the minimization of an error risk. Given P candidate proteins, the proposed procedure compares the probability of the 2^{ P } partitions (subset of discriminant versus subset of nondiscriminant proteins). The most a posteriori probable partition is finally retained and thus defines the selected variables. The main difficulty is the required integration with respect to all the unknown model parameters. An important contribution is to provide a closedform expression of the probabilities for noiseless observations and a sensible approximation for noisy observations. The proposed method proved to be wellsuited for variable selection in a complex context. Its effectiveness is assessed by a theoretical characterization and numerical studies (on simulated and real data) which are in accordance with the theoretical optimality. Furthermore, the proposed method compares favorably with the t test, the LASSO, the Battacharrya distance, and the FOHSIC.
From a methodological standpoint, several perspectives can be considered. Regarding the concentrations, nonGaussian distributions, e.g., Gamma or Wishart models, could be relevant. Regarding the status, a possible development could account for possible errors in the given status. In this case, an additional level should be appended to the hierarchical Bayesian model. It would include a prior probability for an erroneous status.
As for the applicative perspectives, we plan to further take advantage of the performance of the method in other clinical data sets or in other biomedical fields (e.g., genomics, transcriptomics…). In addition, we also intend to make use of the method in other domains, for instance, in astrophysics (identification of pertinent features in order to classify galaxies), or for complex structures and industrial processes (identification of indicators for detection and diagnosis of damages or faults, analysis of fatigue and aging prevention,…).
6 Endnotes
^{1} The knowledge about orders of magnitudes of the concentration values is acquired from the real data set provided by bioMérieux (Technology Research Department), France.
^{2} SRM measurements provided by bioMérieux (Technology Research Department), France
^{3} colorectal cancer
7 Appendix 1
7.1 Reduction of the concentration distribution
which allows easier handling of the NormalWishart prior.
8 Appendix 2
8.1 Wishart, NormalWishart, and matrix variate tdistribution
8.1.1 Wishart
8.1.2 NormalWishart
and it does not depend on μ (that is a position parameter).
8.1.3 Matrix variate tdistribution
where T and M are P×N matrices, Ω and Σ are positivedefinite matrices with respective sizes N×N and P×P and q>0.
When q tends to infinity, the distribution of T tends to a Gaussian distribution with mean M and covariance Σ⊗Ω that is to say \(\mathcal N(\boldsymbol {T}\, ; \,\boldsymbol {M},\boldsymbol {\Sigma }\otimes \boldsymbol {\Omega })\), where ⊗ is the Kronecker product.
Declarations
Funding
This work was supported by the French National Research Agency through the Bayesian Hierarchical Inversion in Proteomics Project under Contract ANR2010BLAN0313.
Authors’ contributions
JFG proposed the initial idea of the method and provided the first theoritical developments. He also importantly contributed to the writing of the paper. AG proposed the extension to noisy data and importantly contributed to the writing of the paper. ND provided the largest part of the Matlab developments and numerical assessment. She also contributed to the writing of the paper. CT provided input in the proteomics fields. She contributed to the comparison with existing methods. She also contributed to the manuscript and proposed valuable comments. MH contributed to the Matlab development and the numerical assessment. She also contributed to the writing of the paper. JPC provided input in the SRM field and with the result interpretation. He acquired and provided real SRM spectra to test the algorithm. He has revised the manuscript. LG has developed the BHI processing algorithms for the MRM mode. He has computed the protein concentration profiles for the MRM clinical data set used for the evaluation on real data. He contributed to the result interpretation. PD provided expertise regarding proteomics. BL contributed to the early developments of the BHIPRO project. PG was the BHIPRO project manager. He has coordinated the conception of the processing algorithms and the interpretation of the results. He has revised the manuscript. PR coordinated biostatistical developments and the interpretation of the results. He has revised the manuscript. All authors read and approved the final manuscript.
Competing interests
JPC and BL are employed by bioMérieux. The other authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 S Srivastava, Informatics in Proteomics, ser. Statistics: a series of textbooks and monographs (CRC Press, Boca Raton, 2005).View ArticleGoogle Scholar
 KA Do, P Muller, M Vannucci, Bayesian inference for gene expression and proteomics (Cambridge University Press, Cambridge, England, 2006).View ArticleMATHGoogle Scholar
 T Fortin, A Salvador, JP Charrier, C Lenz, X Lacoux, A Morla, G ChoquetKastylevsky, J Lemoine, Clinical quantitation of prostatespecific antigen biomarker in the low nanogram/milliliter range by conventional bore liquid chromatographytandem mass spectrometry (multiple reaction monitoring) coupling and correlation with ELISA tests multiple hypothesis testing in microarray experiments. Mol. Cell Proteomics. 8(5), 1006–1015 (2009).View ArticleGoogle Scholar
 C Huillet, A Adrait, D Lebert, G Picard, M Trauchessec, M Louwagie, A Dupuis, L Hittinger, B Ghaleh, P Le Corvoisier, M Jaquinod, J Garin, C Bruley, V Brun, Accurate quantification of cardiovascular biomarkers in serum using protein standard absolute quantification (PSAQ) and selected reaction monitoring. Mol. Cell Proteomics. 11(2) (2012).Google Scholar
 K Harris, M Girolami, H Mischak, Definition of valid proteomic biomarkers: a Bayesian solution. Lett. Notes Comput. Sci. 5780:, 137–149 (2009).View ArticleGoogle Scholar
 M Frantzi, A Bhat, A Latosinska, Clinical proteomic biomarkers: relevant issues on study design & technical considerations in biomarker development. Clin. Transl. Med. 3(7) (2014).Google Scholar
 P Roy, C Truntzer, D MaucortBoulch, T Jouve, N Molinari, Protein mass spectra data analysis for clinical biomarker discovery: a global review. Brief. Bioinform. 12(2), 176–186 (2011).View ArticleGoogle Scholar
 D Sidransky, S Srivastava, Changes in collagen metabolism in prostate cancer: a host response that may alter progression. Nat. Rev. Cancer. 18(3), 789–795 (2003).Google Scholar
 H Hoijtink, I Klugkist, Comparison of hypothesis testing and Bayesian model selection. Qual. Quant. 41:, 73–91 (2007).View ArticleMATHGoogle Scholar
 S Dudoit, J Popper Shaffer, J Boldrick, Multiple hypothesis testing in microarray experiments. Statist. Sci. 18:, 71–103 (2003).MathSciNetView ArticleMATHGoogle Scholar
 Y Benjamin, Y Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B. 57(1), 289–300 (1995).MathSciNetMATHGoogle Scholar
 GK Smyth, Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3(3) (2004).Google Scholar
 N Draper, H Smith, Applied regression analysis, 3rd ed (Wiley Series in Probability and Statistics, Chichester, New York, Singapore, Toronto, 1998).View ArticleMATHGoogle Scholar
 M Bhattacharlee, C Botting, M Sillanpaa, Bayesian biomarker identification based on markerexpression proteomics data. ELSEVIER Genomics. 92:, 37–55 (2008).Google Scholar
 J Fan, R Li, Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001).MathSciNetView ArticleMATHGoogle Scholar
 M Chen, Dc Dey, Variable selection for multivariate logistic regression model. J. Stati. Plan. and Infer. 111:, 37–55 (2003).MathSciNetView ArticleMATHGoogle Scholar
 Z Yuan, D Ghosh, Combining multiple biomarker models in logistic regression. Biometrics. 64: (2008).Google Scholar
 H Akaike, Information theory and an extension of the maximum likelihood principle. Proc. Second Int. Symp. Inform, 261–281 (1973).Google Scholar
 AE Hoerl, RW Kennard, Ridge regression: applications to nonorthogonal problems. Technometrics. 12(1), 69–82 (1970).View ArticleMATHGoogle Scholar
 R Tibshirani, Regression shrinkage and selection via the LASSO. J. Royal Stat. Soc,: Series B (Methodology). 1:, 267–288 (1996).MathSciNetMATHGoogle Scholar
 H Zou, T Hastie, Regularization and variable selection via the elastic net. J. R. Statist. Soc. B. 67:, 301–320 (2005).MathSciNetView ArticleMATHGoogle Scholar
 J Ogutu, T SchultzStreek, HP Piepho, Genomic selection using regularized linear regression models: ridge regression, LASSO, elastic net and their extensions. ser. Proc. of the 15th European workshop on QTL mapping and marker assisted selection (QTLMAS), Rennes, France, 2011, 37–55.Google Scholar
 MI Ghahramani, A note on the evidence and Bayesian Occam’s razor.Gatsby Unit, University College London, Technical Report GCNUTR 2005003 (2005).Google Scholar
 CP Robert, G Casella, MonteCarlo statistical methods, ser. Springer Texts in Statistics (Springer, New York, 2004).View ArticleMATHGoogle Scholar
 B Carlin, T Louis, Bayesian methods for data analysis (CRC Press, Chapman & Hall, Boca Raton, London, New York, 2009).MATHGoogle Scholar
 A Raftery, M Newton, J Satagopan, P Krivitsky, in Bayesian Statistics, 8. Estimating the integrated likelihood via posterior simulation using the harmonic mean identity, (2007), pp. 1–45.Google Scholar
 D Lee, N Chia, A particle algorithm for sequential Bayesian parameter estimation and model selection. IEEE Trans. on Sign. Proc. 50(2), 326–336 (2002).View ArticleGoogle Scholar
 H Mallick, N Yi, Bayesian methods for high dimensional linear models. J. Biom. Biostat. 1(5), 326–336 (2013).Google Scholar
 F Adjed, JF Giovannelli, A Giremus, N Dridi, P Szacherski, in ser. Proc. of the IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2013). Variable selection for a mixed population applied in proteomics (Vancouver, 2013), pp. 1153–1157.Google Scholar
 C Robert, in The Bayesian Choice. From decisiontheoretic foundations to to computational implementation. Springer Text in Statistics (Springer VerlagNew York, 2007).Google Scholar
 G Saporta, Probabilités, analyse de données et statistique, Technip, Ed.Editions TECHNIP, (1990).Google Scholar
 A Bhattacharyya, On a measure of divergence between two statistical populations defined by their probability distributions. Indian J Stat. 7(4), 401–406 (1946).MATHGoogle Scholar
 L Song, A Smola, A Gretton, J Bedo, K Borgwardt, Feature selection via dependence maximization. J. Mach. Learn. Res. 13(1), 1393–1434 (2012).MathSciNetMATHGoogle Scholar
 P Szacherski, JF Giovannelli, L Gerfault, P Mahé, JP Charrier, A Giremus, B Lacroix, P Grangeat, Classification of proteomic MS data as Bayesian solution of an inverse problem. IEEE Access. 2:, 1248–1262 (2014).View ArticleGoogle Scholar
 A Klich, C Mercier, L Gerfault, P Grangeat, C Beaulieu, E DegoutCharmette, T Fortin, P Mahé, JF Giovannelli, JP Charrier, A Giremus, D MaucortBoulch, P Roy, Experimental design and statistical analysis for evaluation of quantification performance of two molecular profile reconstruction algorithms used in selected reaction monitoringmass spectrometry. Service de Biostatistique, Hospices Civils and Laboratoire de Biométrie et Biologie Evolutive, Lyon, Technical Report (2016).Google Scholar
 J Lemoine, T Fortin, A Salvador, A Jaffuel, JP Charrier, G ChoquetKastylevsky, The current status of clinical proteomics and the use of MRM and MRM3 for biomarker validation. Pharmacogenomic Proteomic Metabolomic Appl. 12(4), 333–345 (2012).Google Scholar