Bayesian estimation of the discrete coefficient of determination

The discrete coefficient of determination (CoD) measures the nonlinear interaction between discrete predictor and target variables and has had far-reaching applications in Genomic Signal Processing. Previous work has addressed the inference of the discrete CoD using classical parametric and nonparametric approaches. In this paper, we introduce a Bayesian framework for the inference of the discrete CoD. We derive analytically the optimal minimum mean-square error (MMSE) CoD estimator, as well as a CoD estimator based on the Optimal Bayesian Predictor (OBP). For the latter estimator, exact expressions for its bias, variance, and root-mean-square (RMS) are given. The accuracy of both Bayesian CoD estimators with non-informative and informative priors, under fixed or random parameters, is studied via analytical and numerical approaches. We also demonstrate the application of the proposed Bayesian approach in the inference of gene regulatory networks, using gene-expression data from a previously published study on metastatic melanoma.


Introduction
DNA regulatory circuits can be often described by networks of Boolean logical gates updated and observed at discrete time intervals [1][2][3][4][5][6]. In a stochastic setting, the degree of association between Boolean predictors and targets can be quantified by means of the discrete coefficient of determination (CoD) [7]. As such, the CoD is a function of the joint probability of target and predictor variables, which, however, is usually unknown in practice. Hence, this requires the inference of the discrete CoD-given sample data. A larger sample-based CoD value indicates a tighter regulation between target and predictors.
The concept of CoD has far-reaching applications in genomics. The CoD was perhaps the first predictive paradigm utilized in the context of microarray data, the goal being to provide a measure of nonlinear interaction among genes [7]. The CoD has been used in the reconstruction or inference of gene regulatory networks using gene expression data quantized into discrete levels [8][9][10][11]. It has also been used in the definition of the intrinsically-multivariate prediction (IMP) criterion for variance, and root-mean-square (RMS) error of the OBP CoD estimator. The accuracy of both Bayesian CoD estimators is compared against that of several nonparametric CoD estimators by numerical simulations. The results indicate that the Bayesian MMSE CoD estimator is the best one when averaged over all distributions and samples, whereas the simpler OBP CoD estimator, though suboptimal in the MMSE sense, can be more accurate than the MMSE CoD estimator, in a frequentist sense, under low-variance informative priors around fixed parameters corresponding to a fixed distribution between target and predictors. It is also unsurprisingly found that priors with higher densities around true fixed distributions produce more accurate Bayesian estimators in a frequentist sense.
This paper is organized as follows. In Section 2, we introduce the discrete model for prediction and present the coefficient of determination in this model. In Section 3, we develop a Bayesian framework of the inference of the discrete CoD, define two Bayesian CoD estimators, one in the sense of minimum mean-square error (MMSE), and the other based on the optimal Bayesian classifier, and derive the analytical expressions for both Bayesian CoD estimators. In Section 4, we first present an exact formulation of accuracy metrics for the OBP CoD estimator. Afterwards, we discuss the accuracy of both Bayesian CoD estimators when averaged over all distributions and samples as well as under fixed distributions under varying priors, and their comparison with the nonparametric CoD estimators. Section 6 describes an approach to the inference of gene regulatory networks using the proposed Bayesian CoD estimators and illustrates the approach with gene expression data from a previously published study on metastatic melanoma. Finally, Section 7 presents concluding remarks.

The discrete coefficient of determination
The CoD, which was originally defined in classical regression analysis, gives the relative decrease in unexplained variability when entering a variable X into the regression of the dependent variable Y, in comparison with the total unexplained variability when entering no variables. Dougherty and collaborators extended the concept of CoD to discrete random variables [7]. Given a specified error criterion, such as the mean-square error or the mean-absolute error, the CoD was defined in [7] as where ε 0 is the minimum error of predicting Y by a constant (i.e., in the absence of observations) and ε is the minimum error of predicting Y based on the observation of X. Since ε ≤ ε 0 (all sensible error criteria satisfy this property), the CoD ranges from 0 to 1. The closer it is to one, the closer ε is to zero and the tighter the association between predictor and target variables, whereas the closer it is to zero, the closer ε is to ε 0 and the weaker the association is. By convention, CoD = 0 when ε 0 = 0. The CoD is a function only of the distribution of (X, Y ); in particular, it is not a function of sample data. This definition of the CoD reduces gracefully to the classical one in the case when (X, Y ) is jointly Gaussian [7]. We consider in this paper the case where X = (X 1 , X 2 , . . . , X d ) ∈ {0, 1} d is a binary vector of predicting variables and Y ∈ {0, 1} is a binary target random variable. For example, X and Y may consist of the active/inactive expression state of various genes. The probability distribution of the pair (X, Y ) is specified by the probability c = P(Y = 0), and the probabilities . , x 2 d be an arbitrary enumeration of the possible values of the predicting vector X. An optimal predictor of Y given X is well-known to be ψ * (X) = arg max k P(Y = k | X) [22]. The minimum error of predicting Y based on the observation of X is therefore where I A is the indicator function, which is equal to 1 if A is satisfied and zero, otherwise. On the other hand, an optimal predictor in the absence of observations is clearly given by ψ * = arg max k P(Y = k), so that the minimum error of predicting Y by a constant is given by Plugging (2) and (3) in (1) results in This formula gives the relationship between the CoD and the parameters of the distribution of (X, Y ).

Bayesian CoD estimators
In practice, the distributional parameters are generally unknown, and one would like to estimate the CoD from sample data. We present in this section the derivation of two Bayesian estimators for the CoD in (4). One approach is analogous to that followed by [17] in defining the Bayesian MMSE prediction error estimator, whereas the other one makes use of the optimal Bayesian predictor (OBP), a straightforward generalization of the optimal Bayesian classifier (OBC), introduced in [20].
We will assume that an i.i.d. sample S n = {(X 1 , Y 1 ), . . . , (X n , Y n )} from the distribution of (X, Y ) is available. Given S n , define U i as the number of sample points such that X = x i and Y = 0, and V i as the number of sample points such that X = x i and Y = 1, for i = 1, . . . , 2 d . Note that N 0 = 2 d i=1 U i and N 1 = 2 d i=1 V i are the (random) sample sizes corresponding to Y = 0 and Y = 1, respectively.
Let p = (p 1 , . . . , p 2 d ), q = (q 1 , . . . , q 2 d ), and θ = (c, p, q), where 0 ≤ c, p i , q i ≤ 1, and 2 d i=1 p i = 2 d i=1 q i = 1. As shown in the previous section, the distribution of (X, Y ) is completely specified by the parameter vector θ . The Bayesian approach treats θ as a random variable, the prior distribution of which can take advantage of a priori knowledge about the problem. We will assume that c, p, and q are independent, i.e., f (θ) = f (c)f (p)f (q). It is shown in [17] that this implies that the posterior distri- In this paper, we will employ the standard choice of priors for discrete distributions, namely, the Beta and Dirichlet distributions (c.f. Appendices A and B): where the hyperparameters α, β, α i , β i , i = 1, . . . , 2 d , are positive numbers. These distributions have bounded supports; the Beta distribution is defined over the interval [ 0, 1], while the Dirichlet distribution is defined over the simplex of 2 d nonnegative numbers that add up to one.
The shapes of the distributions are controlled by the concentration parameters c = α + β, p = 2 d j=1 α j , and q = 2 d j=1 β j , and the base measures Please refer to Appendices A and B for definitions and important facts about the Beta and Dirichlet distributions, which will be needed in the sequel.
A very important property for our purposes is that the Beta and Dirichlet priors are conjugate priors for the discrete multinomial distribution, i.e., they have the same form as the corresponding posteriors. Given the sample data S n , the posterior distributions are [17,18]: where n 0 and n 1 are the observed sample sizes corresponding to Y = 0 and Y = 1, respectively, while u i and v i are the observed sample values of the random variables U i and V i , respectively.

Minimum mean-square error CoD estimator
Given a CoD estimator CoD, consider the mean-square error The minimum MSE solution, as is well known, is given by the expectation of the CoD according to the posterior distribution of the parameters [23]. This defines the Bayesian MMSE CoD estimator: where the CoD is given by (4). It is well-known that the MMSE estimator CoD MMSE not only displays the least root mean-square error (RMS) over the distribution of (θ , S n ), but it is also an unbiased estimator (however, for a specific model with fixed θ , CoD MMSE might not be unbiased or have the least RMS).
In order to derive an expression for the Bayesian MMSE CoD estimator, first note that (4) can be rewritten as Applying (8) to (9) and using the previously mentioned fact that the posterior distribution factors allows one to write the Bayesian MMSE CoD estimator as Using (6) and the fact that the marginal distributions of a Dirichlet are Beta (c.f. Appendix B), we have that c | S n ∼ Beta(α s , β s ), Using the results in Appendix A and assuming that the hyperparameters are integers (if they are not, a simple adjustment to the derivation below can be made; see Appendix A), it follows that Likewise, we have and where the Beta function B(a, b) and the coefficients r i (a, b) are defined in Appendix A. Replacing (11)-(14) into (10) produces an exact expression for computing the MMSE CoD estimator in terms of sample sizes and model hyperparameters. Notice that for the previous expressions to make sense, one must have α > p − 1 and β > q − 1. In particular, if uniform priors are chosen for p or q, then the prior for c cannot be uniform (c.f. Appendix A).

Optimal Bayesian predictor CoD estimator
In this section, we derive a second Bayesian CoD estimator, using the optimal Bayesian predictor (OBP), a simple extension to the Boolean prediction problem of the "optimal Bayesian classifier" (OBC) proposed in [20]. Formally, let ε θ [ ψ] denote the error of a predictor ψ under parameter vector θ . The OBP predictor ψ OBP minimizes the average error over the family of (posterior) distributions indexed by the parameter Using the results of [20] for the OBC, one can verify that the OBP for the Beta-Dirichlet model considered here is given by On the other hand, the average errors of the the constant predictors ψ ≡ 0 and ψ ≡ 1 are respectively, so that the OBP error in the absence of observations iŝ We can then combine (17) and (19) to obtain the optimal Bayesian predictor (OBP) CoD estimator It is easy to show that 0 ≤ε OBP ≤ε 0,OBP , and thus 0 ≤ CoD OBP ≤ 1.
Execution time for computation of the OBP CoD estimator grows as O(2 d ). By comparison, the complexity for exact computation of the Bayesian MMSE CoD estimator introduced in the previous subsection is O(n 3 × 2 d ).
Neither n or d tends to be too large in Genomics applications, due to small sample sizes and the fact that the average number of predictor genes d per target gene must be small for a stable system, as remarked by S. Kauffman in [2]. However, if n and d become large, one could devise Monte Carlo approximation methods to compute both CoD estimators.
Therefore, the OBP CoD estimator, though suboptimal, is much more efficient computationally than the MMSE CoD estimator, especially at large sample sizes. In addition, we will see in the next section that the OBP CoD can be even more accurate than the MMSE CoD estimator, in frequentist sense, under a fixed value of the parameters.

Performance analysis
In this section, we investigate the accuracy of the Bayesian CoD estimators proposed in the previous section. We distinguish between two types of accuracy metrics: global metrics concern the average performance over all samples and all distributions of (X, Y ), weighted by the prior distribution of θ , whereas fixed-parameter metrics have to do with the average performance over all samples, but under a particular distribution of (X, Y ), corresponding to a fixed value of the parameter θ . Fixed-parameter metrics thus evaluate the proposed Bayesian estimators from a purely frequentist perspective.
For a given Bayesian CoD estimator, the fixedparameter accuracy metrics of interest are the bias the variance, and the root-mean-square (RMS) error, It becomes clear that the fixed-parameter bias, variance, and RMS of a Bayesian CoD estimator can be obtained with knowledge of the first and second moments The corresponding global accuracy metrics are obtained by taking expectation of the previous quantities with respect to the marginal (i.e., prior) distribution of θ .
As mentioned previously, the global bias of the Bayesian MMSE CoD estimator is zero and its global RMS is minimal among all CoD estimators. However, this does not imply that its fixed-parameter bias is zero or that its fixed-parameter RMS is minimum for all values of the parameter.
In what follows, we give exact expressions for the computation of E S n |θ ε ε 0 and E S n |θ ε 2 ε 2 0 for the OBP CoD estimator. As argued previously, this allows the exact computation of the fixed-parameter bias, variance, and RMS of that CoD estimator. Via simple numerical integration, it is possible then to obtain the global bias, variance, and RMS. It turns out that similar expressions for the MMSE CoD estimator are much harder to obtain; the performance of that estimator are studied via a numerical approach in the next section.
All the expectations and probabilities below are with respect to S n | θ (the subscript will be omitted for convenience). In the expressions below, c, p i , and q i , for i = 1, . . . , 2 d refer to the (deterministic) parameters in θ .
We will provide the derivation only in case (3); the other cases are similar, and lead to the exact same expressions. We assume that α − α = β − β but α = β. Without loss of generality, we assume that α > β, and let α = β +δ, where δ is a positive integer. Notice that it is easy to show in this case that n+β−α 2 + α = n+α−β 2 + β by considering the evenness and oddness of n and δ. Therefore, we have L 0 ⊂ L 1 . In the following, we discuss two cases when n + β − α is even and when n + β − α is odd. ( Now, we are going to use the fact that, for a random variable X and disjoint events A and B, one has 1 We can then write E ε OBP ε 0,OBP as: where P(n 0 = r) = n r c r (1 − c) n−r and with The expression for E ε OBP (n 1 +β)/(n+ c ) n 0 = n − r is obtained from (27), with r + α replaced by r + β. is not an integer. For the second moment, we have where M = (n + c )ε 0,OBP , as before. By using the same reasoning applied previously in the case of the first moment, we have where, letting with The expression for E ε 2 OBP (n 1 +β) 2 /(n+ c ) 2 n 0 = n − t is obtained from (30) with t + α replaced by t + β.

Global accuracy
In this section, we employ Monte Carlo sampling (with M = 10, 000 simulated data sets for each sample size) to compute global accuracy metrics of the two Bayesian CoD estimators. Following [17], we let α = 2 d +1 = β = 2 d +1, which produces a prior for c peaked around the value c = 0.5, and α i = β i = 1, for all i = 1, . . . , 2 d , i.e. flat (uniform) prior distributions for (p, q). In each iteration, the values of c and (p, q) are drawn from the respective priors, and then sample data is generated according to these probabilities. Given the sample data, we compute the exact Bayesian MMSE and OBP CoD estimates as expressed in Section 3, and compare them to the standard resubstitution CoD estimator, which is based on plugging in sample frequencies in the expression for the optimal CoD, and corresponds to the the original choice of CoD estimator in [7]. This estimator is also called the nonparametric maximum-likelihood CoD estimator in [15]. For further comparison, we also compute CoD estimators based on leave-one-out, 0.632 bootstrap and 10-repeated twofold cross-validation error estimators-for details on all these CoD estimators, please see [14,15]. Sample means and sample variances are employed to approximate the global accuracy metrics of each CoD estimator. Figure 1 displays the global bias, variance, and RMS as a function of varying sample size, for different numbers of binary predictive variables, d = 1 through d = 3. Several observations are evident. First, as expected, the Bayesian MMSE CoD estimator is unbiased and has the least RMS among all the estimators, and the gap in performance widens as dimensionality increases. Secondly, the OBP CoD estimator has the second-best performance, which indicates the benefits of using the Bayesian estimation approach. The accuracy of the OBP CoD estimator is quite close to that of the MMSE estimator for d = 1, but the gap widens as d increases. Thirdly, it is also observed that the OBP Bayesian CoD estimator is pessimistically biased. Incidentally, the 0.632 bootstrap CoD estimator displays the best accuracy among the four nonparametric ones according to global RMS, but it is matched by the resubstitution CoD estimator as sample size increases.

Fixed-parameter accuracy
In this section, we study the average accuracy of the two proposed Bayesian CoD estimators for a fixed parameter, that is, we evaluate the Bayesian estimators from a purely frequentist perspective. As in the previous subsection, we consider d = 1 through d = 3 binary predictive variables. We consider fixed values of the parameters, c * , p * , and q * . In order to examine the effect of prior belief on performance, we consider four scenarios regarding prior density around the true parameters: a flat ("non-informative") prior and three nonflat "matched, " "poorly matched, " and "mismatched" priors. This is done by assuming different base measures and concentration parameters for the priors (c.f. Section 3). As an illustration of the approach,    Table 1, and the true value of the parameter is indicated by a vertical line parameter: p = 5 (high-variance), p = 25 (medium variance), and p = 50 (low variance). Notice that each density is centered around the expected value. Note that, if the variance is high, even the matched prior becomes very diffuse around its expected value (which is the true value, in this case). Table 1 gives the values of the parameters used in the experiments. In all cases, the true value and base measure for c are the same, c * = c 0 = 0.5. In addition, in each case, the true value q * and base measure q 0 are obtained from p * and p 0 , respectively, by flipping the corresponding vector left to right; for example, when p 0 = (0.2, 0.1, 0.3, 0.4) then q 0 = (0.4, 0.3, 0.1, 0.2). Therefore, only the values for p * and p 0 are shown in Table 1. Figures 3, 4, and 5 show the results for d = 1 through d = 3 predictors, respectively. Each figure displays the bias, variance, and RMS as a function of the sample size for the Bayesian MMSE and OBP CoD estimators and the nonparametric CoD estimators. The Bayesian estimators assume a flat non-informative prior and three nonflat matched, poorly matched, and mismatched priors, specified by Table 1. For the non-flat priors only, three different variance groups are considered, corresponding to three different settings for the concentration parameters: high variance, medium variance, and low variance priors. Results for the OBP CoD estimator are computed exactly using the results of Section 4. For all other CoD estimators, bias, variance, and RMS are approximated by averaging results over 5000 Monte Carlo samples drawn from the fixed distribution. The curves for the nonparametric CoD estimators and the flat prior Bayesian CoD estimators are repeated across the columns, for comparison with the results for the nonflat prior Bayesian CoD estimators.
We can observe that, as expected, both Bayesian CoD estimators perform better when the prior is matched to the true value of the parameters than when the match is poor or nonexistent. In addition, for the matched prior, accuracy improves substantially as one moves from a diffuse (high-variance) to a peaked (low-variance) prior. This effect is especially visible in the case of the OBP CoD estimator. For example, with d = 1 the RMS is reduced by nearly 80 % between the high-variance and low-variance matched priors. In fact, the accuracy of the OBP CoD estimator beats that of the MMSE CoD estimator for peaked priors, while the opposite is true under diffuse priors. Both Bayesian CoD estimators outperform the nonparametric ones in cases d = 1 and d = 3, whereas, in the Table 1 True distributions and nonflat prior base measures for fixed-parameter experiments. In all cases, c * = c 0 = 0.5, and q * and q 0 are obtained from p * and p 0 , respectively, by flipping left to right (see text.)  Table 1. Plot key for CoD estimators: MMSE (red), OBP (blue), resubstitution (gold), leave-one-out (purple), 0.632 bootstrap (green), 10-repeated twofold cross-validation (black). Results for the OBP CoD estimator are exact while others are approximated by Monte Carlo sampling method. The curves for the nonparametric CoD estimators and the flat prior Bayesian CoD estimators are repeated across the columns, for comparison with the results for the nonflat prior Bayesian CoD estimators d = 2 case, the Bayesian estimators based on mismatched or poorly matched priors can perform worse than the nonparametric estimators, for larger sample size. It is also observed that, as the variance of priors decreases (i.e., for a larger value), the performance of both Bayesian estimators improves over the nonparametric ones. Moreover, it is interesting that the Bayesian MMSE CoD estimator performs better than the OBP CoD estimator, for a high-variance prior with matched prior, while the OBP CoD estimator beats the Bayesian MMSE CoD estimator for medium and high-variance matched priors. This indicates that the OBP CoD estimator is preferable due to its straightforward representation and superior performance with low-variance priors. Notice that the Bayesian MMSE CoD estimator has the least RMS only when averaged over all distributions and all possible samples, but  Table 1. Plot key for CoD estimators: MMSE (red), OBP (blue), resubstitution (gold), leave-one-out (purple), 0.632 bootstrap (green), 10-repeated twofold cross-validation (black). Results for the OBP CoD estimator are exact while others are approximated by Monte Carlo sampling method. The curves for the nonparametric CoD estimators and the flat prior Bayesian CoD estimators are repeated across the columns, for comparison with the results for the nonflat prior Bayesian CoD estimators its optimality does not apply to the settings with a fixed distribution. In addition, we observe that the Bayesian MMSE CoD estimator is less variant than the OBP CoD estimator. It can be seen that the Bayesian CoD estimators based on informative priors are less variant than those based on non-informative uniform priors. In the d = 1 and d = 3 cases, the OBP CoD estimator with uniform priors becomes more variant than even the crossvalidation estimator, for larger sample size. In addition, the OBP CoD estimator is less biased in magnitude than the MMSE estimator for low-variance matched priors. However, as the variance of priors increases, the Bayesian MMSE CoD estimator turns out to have less bias than the OBP estimator.  Table 1. Plot key for CoD estimators: MMSE (red), OBP (blue), resubstitution (gold), leave-one-out (purple), 0.632 bootstrap (green), 10-repeated twofold cross-validation (black). Results for the OBP CoD estimator are exact while others are approximated by Monte Carlo sampling method. The curves for the nonparametric CoD estimators and the flat prior Bayesian CoD estimators are repeated across the columns, for comparison with the results for the nonflat prior Bayesian CoD estimators 6 Gene regulatory network inference: a melanoma example We discuss in this section the application of the Bayesian CoD estimation approach discussed previously to the inference of gene regulatory networks. We apply the proposed inference procedure on data collected in a study of metastatic melanoma [24], containing 31 binarized sample expression profiles, which have been binarized, with 0 indicate no significant expression whereas 1 represents significant expression (either over-or underexpression). It was found in [24] that the WNT5A gene is a major driver of processes that lead to metastatic melanoma. We derive the logic relationships and wiring of a 7-gene WNT5A network consisting of genes selected using data analysis and prior biological knowledge: WNT5A, pirin, S100P, RET1, MART1, HADHB, and STC2; for more details about the selection of these genes, see [25,26]. We assume a model where the target binary gene expression Y ∈ {0, 1} is regulated by a binary predictor gene expression vector X = (X 1 , . . . , where f : {0, 1} d → {0, 1} is a Boolean function, the symbol "⊕" indicates modulo-2 addition, and N ∈ {0, 1} is a noise Bernoulli random variable, independent from X, such that P(N = 0) = p. The modulo-2 addition behaves as a XOR operation, which flips the state of the target Y when N = 1, and leaves it unaltered when N = 0. Hence, p quantifies the predictive power of the model: if p = 1, the system is noiseless and prediction is deterministic, while if p < 1, there is a degree of indeterminacy in the state of the target given the state of the predictors. This model is studied in detail in [15], where an inference procedure, based on a maximum-likelihood CoD estimator, is proposed to select the unknown Boolean function f, assuming that f is a member of a candidate model set F containing Boolean functions that depend on the same number k of essential variables. Each f in F is specified by (1) a Boolean function g : {0, 1} k → {0, 1} and (2) the indices for the predicting variable set {i 1 , . . . , i k } ⊂ {1, . . . , d}, or wiring, such that f (X) = g X i 1 , . . . , X i k . If the candidate boolean functions g belong to a model set G, then the total number of possible models is |G| × d l . Here, we modify the network inference in [15] to allow the use of the Bayesian CoD estimators described previously. For a given target Y and predictor set X, we assume Dirichlet prior distributions as in (5). Instead of adopting a non-informative choice of hyperparameters, we employ an "empirical Bayes" approach, where the hyperparameters are estimated in part from the sample data, as described next.
First, it follows from the model in (31) that the parameters p i = P(X = x i | Y = 0), q i = P(X = x i | Y = 1), and c = P(Y = 0) are given by: The unknown quantities here are the predictive power p and the distribution P(X) of the predictors. Given the sample data S n = {(X 1 , Y 1 ), . . . , (X n , Y n )}, and a fixed Boolean function g and wiring {i 1 , . . . , i k }, p can be very effectively estimated by means of the sample frequency [15]:p The distribution P(X) could in principle be also estimated from the data using sample frequencies; however, such an estimator can become very poor under small sample sizes and large dimensionality d. Therefore, we simply assume a flat distribution P(X = x i ) = 1/2 d , for i = 1, . . . , 2 d . Substituting this and (33) into (32) gives the values of the hyperparameters used in our experiment: Recall from Section 3 that the shape of the Dirichlet prior distribution is determined by the hyperparameters through a location parameter, called the base measure, and a concentration parameter. Our strategy is to set up the estimates in (34) as the base measure, so that the Dirichlet priors are concentrated around them, to a degree specified by the concentration parameter. Formally, the hyperparameters are set to: {α 1 , . . . , α 2 d } = { p 1 , . . . , p 2 d }, {β 1 , . . . , β 2 d } = { q 1 , . . . , q 2 d }, α = ĉ and β = (1−ĉ) , where x gives the smallest integer larger or equal to x. The value of is tuned by the experimenter, either manually or using a data-driven procedure.
We are now ready to state the procedure to select a function f in F, consisting of a k-predictor Boolean function g and its wiring.

Bayesian model selection procedure
1. For each of the Boolean functions g ∈ G, compute the prior hyperparameters as described earlier.
Obtain the MMSE Bayesian CoD / OBP CoD estimate under each of the d k possible wirings. Pick the wiring for g that produces the largest CoD estimate. Ties, if any, are broken randomly. 2. Among the |G| pairs of Boolean function g and wiring obtained in the previous step, select the one that produces the largest predictive power estimatê p. Ties, if any, are broken randomly.
In our experiment with the 7-gene WNT5A network, we consider in turn each gene as a target and the remaining six genes as predictors (so that a gene cannot be a predictor of itself ). Hence, d = 6. In addition, we assume that that each gene is predicted by three genes out of the six predictors. Therefore, k = 3 and there are 6 3 = 20 possible wirings for each target gene. The set G contains all 218 Boolean functions of exactly three essential variables (this is less than the full set of 2 2 3 = 256 3-input Boolean functions since those that are reducible to 0-, 1-, and 2-input logics are not considered). We set = 1.0 and apply the proposed Bayesian model selection procedure to infer a gene regulatory network for the MMSE and OBP CoD estimators. We also obtain the gene regulatory network produced by employing the standard model selection procedure, which picks the predictor set (among all 6 3 = 20 choices, in this case) with the largest estimated resubstitution CoD [25].
The results are presented in Fig. 6. The diagrams represent the predicted logic functions as binary strings (in the usual logic table order; e.g., AND = 00000001) and the predicted wirings as oriented edges, and, in addition, the estimated CoD in each case is displayed. We can see that the predicted logic functions and wirings for the three networks are similar, especially in the cases of the OBP and resubstitution CoD networks. If one considers only the three top predicted relationships according to CoD magnitude, one obtains the diagrams depicted in Fig. 7, which show that the same network is inferred by the OBP and resubstitution CoDs, which differ from the network obtained with the MMSE CoD by only a single arrow shift in the wirings (the inferred logics in all three cases are also very similar, differing by only a few bit shifts). The important difference between the Bayesian and standard approaches that can be observed from this experiment is in the estimated CoD magnitudes: those estimated with the standard resubstitution CoD tend to be much larger than the ones estimated with the Bayesian CoDs. This reflects the optimistic bias that tends to be displayed by resubstitution [27], a problem that is avoided by the Bayesian CoD estimators.

Conclusions
We introduced a Bayesian framework for the estimation of the CoD in a discrete prediction setting and analyzed the accuracy of the proposed Bayesian MMSE and OBP CoD estimators based on fixed and random parameters, using analytical and simulation methods. We also compared the accuracy of the two Bayesian CoD estimators against those of several classical CoD estimators, based on resubstitution, leave-one-out, bootstrap, and crossvalidation prediction error estimation. Our results indicated that the Bayesian MMSE CoD estimator has the best performance with zero bias and least RMS, when averaged over all distributions and sample data, whereas, for fixed distributions, we conclude that priors with higher densities around the fixed distributions present better accuracy with less RMS. It is also interesting to see that the OBP CoD estimator, one with very simple calculation, can beat the Bayesian MMSE CoD estimator when using low-variance priors with higher densities around the parameters of the fixed distributions. Furthermore, we proposed an approach for inference of gene regulatory networks based on the proposed Bayesian CoD estimators, and applied it to the inference of a 7-gene regulatory network using melanoma data. We observed that the inferred boolean functions and wirings were similar for both CoD Bayesian estimators. Interestingly, the network inferred with the OBP CoD estimator was very close to the network obtained with the standard inference method based on the resubstitution CoD estimator; however, the magnitude of the CoDs were larger in the latter case, which is consistent with the fact that resubstitution tends to be optimistic. We hope that this paper will provide a theoretical foundation for further work on Bayesian estimation methodologies for the inference of gene regulatory networks. The issue of obtaining informative priors based on established biological knowledge about regulatory relationships, which was not addressed in detail here, is one that deserves careful consideration in future work on this topic. Endnote 1 A proof of this fact is given in the Appendix of [14].   Beta(a, b), where a, b > 0, then the probability density function of X is given by where the normalizing term B(a, b) is known as the Beta function: Clearly, B(a, b) = B(b, a).
For k > −a and l > −b, For example, + 1, b)/B(a, b) = a/(a + b) (the second equality can be proved using the definition of the Beta function in terms of the Gamma function and the properties of the latter [28]).
The incomplete Beta function is defined as Notice that B(a, b) = IB (1; a, b).
It is easy to verify that Finally, for k > −a and l > −b, which follows easily from the definitions of the Beta density and the incomplete Beta function, and the fact that X ∈[0, 1]. In particular, if x ≥ 1, then Clearly, all the previous quantities can be computed in terms of the incomplete beta function, an expression of which is given by the next result.
where P = b − 1 and if b is an integer, or P = ∞ and otherwise.
Proof. When b is a positive non-integer (that is, b > 0), we have, by using the Taylor series expansion, Note that b denotes the largest integer that is less than b. Therefore, To interchange the integration and summation in (45), we need to construct a sequence of measurable functions g i (x), i = 0, 1, . . . , ∞, that satisfy the following three conditions: , for all k and almost all x; (ii) ∞ i=0 g i (x) converges for almost all x; (iii) ∞ For and thus the condition (ii) is satisfied.
Let i j=1 b j − 1 1 a+i = z i (i = 1, 2, . . . , ∞). Since ∞ i=0 1 0 g i (x)dx converges by Raabe's test [29]. Now we can interchange the integration and summation in (45), and we have When b is an integer, we have and it is easy to show that Notice that B(a, b) = P i=0 r i (a, b). Note also that the general case reduces to the special case if b is an integer. An equivalent expression can be derived where a appears in the binomial coefficient instead, which can then be used if a is an integer. If neither a nor b are integers, an approximation can be obtained by truncating the resulting infinite series, or by using a numerical software package.
If both a and b are integers, then IB(x; a, b) reduces to a polynomial in x. Otherwise, it is a simple matter to replace the finite summations by infinite series as specified in Theorem 1.
so that the base measure provides the "central" value around which X is distributed. In particular, large components in the base measure bias the distribution in their direction.
The concentration parameters, on the other hand, control the variance of the distribution around the base measure, with large values indicating smaller variance. In fact, it can be shown that [30] Var(X i ) = a i (1 − a i ) + 1 Cov(X i , X j ) = −a i a j + 1 , fori = j .
From the previous equations, one can see that, as approaches infinity, variances converge to zero and X becomes equal to the base measure with probability 1; in addition, covariances also go to zero, rendering the components of X uncorrelated. The special case a i = 1, for all i = 1, . . . , K corresponds to a uniform over S K−1 . This corresponds to a uniform base measure and concentration parameter = K. If the base measure is not uniform but = K, the distribution is approximately uniform. For approaching zero, the distribution becomes concentrated at the boundary of the simplex.
Summing up, large implies large probability density around the base measure, = K implies a nearly uniform distribution, whereas close to zero produces sparse sample vectors with most of the components close to zero.
The Dirichlet distribution is the multivariate generalization of the Beta distribution, in the sense that the components of a Dirichlet-distributed vector X = (X 1 , . . . , X K ) are Beta distributed: X i ∼ Beta(a i , − a i ), for i = 1, . . . , K. Notice that in the case K = 2 the Dirichlet distribution essentially reduces to the Beta distribution.