Using the minimum description length principle to reduce the rate of false positives of best-fit algorithms

The inference of gene regulatory networks is a core problem in systems biology. Many inference algorithms have been proposed and all suffer from false positives. In this paper, we use the minimum description length (MDL) principle to reduce the rate of false positives for best-fit algorithms. The performance of these algorithms is evaluated via two metrics: the normalized-edge Hamming distance and the steady-state distribution distance. Results for synthetic networks and a well-studied budding-yeast cell cycle network show that MDL-based filtering is more effective than filtering based on conditional mutual information (CMI). In addition, MDL-based filtering provides better inference than the MDL algorithm itself. Electronic supplementary material The online version of this article (doi:10.1186/s13637-014-0013-2) contains supplementary material, which is available to authorized users.


Introduction
A key goal in systems biology is to characterize the molecular mechanisms that govern specific cellular behavior and processes. Models of gene regulatory networks run the gamut from coarse-grained discrete networks to detailed descriptions of such networks by stochastic differential equations [1]. Boolean networks and the more general class of probabilistic Boolean networks are among the most popular approaches for modeling gene networks because they provide a structured way to study biological phenomena (e.g., the cell cycle) and diseases (e.g., cancer), ultimately leading to systems-based therapeutic strategies. The inference of gene networks from high-throughput genomic data is an ill-posed problem known as reverse engineering. It is particularly challenging when dealing with small sample sizes because the number of variables in the system (e.g., the number of genes) typically is much greater than the number of observations [2]. Many inference algorithms have been proposed to elucidate the regulatory relationships between genes, such as Reveal [3], ARACNE [4], the minimum description length principle (MDL) [5][6][7][8][9], the coefficient of determination (CoD) [10,11], and the best-fit extension [12,13].
False positives are a common problem in inference, especially when dealing with small sample sizes and noisy conditions. In fact, false positives are a kind of structural redundancy. Given three genes, x 1 , x 2 , and x 3 , they may interact in a chain-like manner, such as x 1 → x 2 → x 3 or x 1 ← x 2 ← x 3 ; or in a hub-based way, such as x 1 → x 2 ← x 3 or x 1 ← x 2 → x 3 . Indirect interactions between two genes may produce some correlation in their expression data, which can lead to a false regulation detection by inference algorithms. The data-processing inequality (DPI) was first used in ARACNE, which aims to reduce the false positives produced by chain interaction [4]. Later, conditional mutual information (CMI) was proposed to tackle the false positives produced by both the chain-like and hub-based interactions [14]. Because the conditioning gene, x 2 , is usually not known, a greedy search strategy was adopted to check if the CMI between x 1 and x 3 conditioned on some other genes was below a given threshold. To check the CMI on other unrelated genes is problematic. Not only is it computationally burdensome, it also suffers from an enormous multiplecomparisons problem. Moreover, since the interaction strength between genes generally varies a lot, their being both strong and weak interactions, how to set an appropriate threshold is a key problem.
A recent study shows that the best-fit algorithm appears to give the best results for recovering regulatory relationships in comparison to the aforementioned algorithms [15]. In the present paper, we propose to reduce the false positives of the best-fit algorithm by using the MDL principle. Simulation results show that it is more effective than the CMI-based method and can reduce the false positives in the MDL algorithm in [5]. In effect, the false-positive reducing procedure acts as a filter for removing false positives.
The aim of filtering in the present framework is to reduce the number of false positive connections. As with any false-positive reducing algorithm, this will invariably increase the number of false negatives, meaning more missing connections. Thus, two questions must be addressed. First, what benefits accrue from reducing the number of false positives? Second, does the increase in false negatives significantly impact inference performance?
A salient problem in translational genomics is the utilization of gene regulatory networks in determining therapeutic intervention strategies [2,16,17]. A big obstacle in deriving optimal treatment strategies from networks is the computational complexity arising directly from network complexity. Hence, significant effort has been focused on network reduction [18,19]. As with any compression scheme, reduction methods sacrifice information in return for computational tractability. Because genes are removed from the network based upon their regulatory relations with other genes, false positives are particularly troublesome. First, they increase the amount of reduction necessary and second, they compete with true positive connections for retention in the reduced network. While it is true that an increase in false negatives is not beneficial, a missing connection creates no additional computational burden (in fact, reduces computation) and plays no role in the reduction procedure. Now, for the caveat, all of this is fine, so long as the accuracy of the original inference algorithm is not adversely impacted. Practically, this means that, relative to some distance function between a ground-truth network and an inferred network (which quantifies inference accuracy), the distance is not increased when using the modified false-positive reducing algorithm in place of the original algorithm. In this paper, we will consider two distance functions, one based on the hamming distance between the ground-truth and inferred networks and the other based on the difference between the steady-state distributions of the ground-truth and inferred networks. This paper is organized as follows: Background information and necessary definitions are given in Section 2. The implementation of MDL, the best-fit algorithm, and CMI-and MDL-based filtering is then introduced in Section 3. Results from simulated networks and from the cell cycle model of budding yeast are presented in Section 4. Finally, concluding remarks are given in Section 5.

Boolean networks
A Boolean network G(V, F) is defined by a set of nodes V = {x 1 , …, x n }, x i ∈ {0, 1}, and a set of Boolean functions F = {f 1 , …, f n }, f i : 0; 1 f g k i → 0; 1 f g Each node x i represents the expression state of a gene, where x i = 0 means that the gene is off and x i = 1 means it is on. To update its value, each node x i is assigned a Boolean function f i x i1 ; …; x ik i ð Þ with k i specific input nodes. Under the synchronous updating scheme, all genes are updated simultaneously according to their corresponding update functions. The network's state at time t is represented by a binary vector x(t) = (x 1 (t), …, x n (t)). In the absence of noise, the state of the system at the next time step is The long-term behavior of a deterministic Boolean network depends on the initial state. The network will eventually settle down and cycle endlessly through a set of states called an attractor cycle. The set of all initial states that reach a particular attractor cycle forms the basin of attraction for the cycle. Following a random perturbation, the network may escape an attractor cycle, be reinitialized, and then begin its transition process anew. For a Boolean network with perturbation, its corresponding Markov chain possesses a steady-state distribution. It has been hypothesized that attractors or steady-state distributions in Boolean formalisms correspond to different cell types of an organism or to cell fates. In other words, the phenotypic traits are encoded in the attractors or steady-state distribution [1].

Best-fit extension
One approach to infer Boolean networks is to search a consistent rule from examples, the so-called consistency problem [20]. Owing to noise in gene-expression profiles, we relax it to the called best-fit extension problem, which has been extensively studied for many function classes [21]. We briefly introduce the best-fit extension problem for Boolean functions. A partially defined Boolean function (pdBf) is defined by two sets, T, F ⊆ {0, 1} n , where T and F represent the set of true and false vectors, respect- The best-fit extension aims to find two subsets T* and F* such that T* ∩ F* = ϕ and T* ∪ F* = T ∪ F, for which the function pdBf(T*, F*) has an extension in some class C of Boolean functions such that T* ∩ F + F * ∪ T is minimized. Clearly, any extension f ∈ C of pdBf (T*, F*) has minimum error magnitude [12,13].

Conditional mutual information
Mutual information (MI) is a general measurement that can detect nonlinear dependence between two random variables X and Y. For discrete-valued random variables, the one-time-lag MI from X t to Y t + 1 is given by where H(•) denotes entropy and X t and Y t + 1 are two equal-length vectors. The conditional mutual information (CMI) from X t to Y t + 1 given Z t is and quantifies the reduction in the uncertainty of Y t+1 due to knowledge of X t given Z t . In the chain-like or hub-based scenarios, genes X t and Y t+1 should be independent given the intermediate or hub gene Z t , which means that I(X t ; Y t + 1 |Z t ) = 0.

Minimum description length principle
A fundamental principle in model selection is the minimum description length (MDL) principle, which states that we should choose the model that gives the shortest description of the data. The 'two-part MDL' developed by Rissanen consists of writing the description length of a given model applied to a data set as the sum of the code length for describing the model and the code length for describing the data set fit by the model [22] L There are various ways to encode the model-coding length L M and the data-coding length L D . Given a time series of length m, Zhao et al. proposed to encode L M and L D as [5] where τ is a free parameter to balance the model-and data-coding lengths, n and m are the number of genes and time points. d i = ⌈ log 2 n⌉ and d f = ⌈ log 2 m⌉ denote the number of bits needed to code an integer and a floating-point number, respectively.

Implementation
Based on the common assumption that genetic regulatory networks are sparsely connected, we restrict simulated Boolean networks to a scale-free topology with maximal connectivity K = 4 and average connectivity k = 2. The best-fit algorithm searches for the best-fit function for each gene by exhaustively searching for all combinations of potential regulator sets. The search space grows exponentially with the number of genes. In practice, the limit k i ≤ 3 is generally applied to mitigate model complexity. In this paper, we restrict best-fit-algorithm searches to combinations of 1, 2, or 3 possible regulators. The combinatorial set with the smallest error is then selected as the regulatory set. We call this best-fit-I. In practice, the minimal error predictor set may not unique. We employ the heuristic that each of them can be viewed as fitting the target gene in a different way and if one gene occurs frequently in those sets, then it is highly likely to be a true regulatory gene. Thus, we can determine the regulatory set by applying the majority rule in these sets. Here, we refer to this algorithm as best-fit-II.
Then CMI and MDL criteria are used to filter falsepositive connections. For each regulatory connection, if the CMI for one of the remaining genes is less than 0.005, then the gene is deleted; otherwise, it remains. The MDL criterion is applied to each target gene x i . Given its parent set, Pa(x i ), we delete the regulatory gene x j ∈ Pa(x i ) that can maximally reduce its coding length L i for each point in time, repeating this process until the deletion of one regulatory gene causes L i to increase. We implement an MDL inference algorithm by directly searching the combination of 1, 2, or 3 possible regulators with minimal coding length L i . The free parameter τ in Equation 6 is set to 0.2.
We have analyzed CMI-and MDL-based filtering by using both synthetic networks as well as the well-studied cell-cycle model known as the budding-yeast network. We compare them with the ground-truth network according to the following two distances [15,23]: (1) The normalized-edge Hamming distance: where FN and FP represent the number of false-negative and false-positive wires, respectively, and P represents the total number of positive wires. This Hamming distance reflects the accuracy of the recovered regulatory relationships.
(2) The steady-state distribution distance: where π k and π ′ k are the steady-state probabilities state x k in the ground-truth and inferred network, respectively. The steady-state distribution distance reflects the degree to which an inferred network approximates the long-run behavior of the ground-truth network.

Simulation on synthetic networks
We generated 1,000 random n = 10 genes and for each network generated a random sample of m = 10, 20, 30, 40, and 50 time points. As it is hard to obtain one time series with required length, we adopt the following sampling strategy: (1) select several start states which are the farthest from their attractor; (2) run each start state to its attactor; (3) select one path as a time series, if its length is shorter than required, add another path in it until we have required length of time points. We added 5% and 10% noise to these samples to investigate the effect of noise. The perturbation probability to calculate the steady-state distribution was set to p = 0.0001. In Table 1, we list the average number of true-positive and false-positive connections for various noise intensities. Figure 1 shows the average performance of the MDL, best-fit-I, and best-fit-II filtered by CMI and MDL for 0%, 5%, and 10% noise. As a whole, the performance of these algorithms increases as sample size increases from 10 to 50. This result is easy to understand: the more data we have, the better the inferred results.
Examination of the table reveals several trends. First, MDL-based filtering (dashed lines in Figure 1) always performs better than CMI-based filtering (dotted lines in Figure 1). MDL-based filtering aims to reduce the redundancy of a model according to the MDL principle, whereas CMI-based filtering attains reduction by blindly checking if the CMI of a connection conditioned on all other genes is below a given threshold. The results indicate that the former approach is superior to the latter. According to Table 1, on the whole, MDL-based filtering retains more true connections and deletes more false connections than CMI-based filtering.
Second, the performances of MDL, best-fit-I, and bestfit-II are very similar when used with noiseless data. In this case, the MDL algorithm gives a model with L D = 0, which also corresponds to the zero-error model obtained by best-fit-I. In addition, MDL-based filtering results in little improvement over the best-fit algorithms. However, their performance is strongly related to sample size when the data are noisy. Specifically, for sample size less than 30, MDL performs better than best-fit-I and bestfit-II based on the average Hamming-edge distance μ e ham . But MDL performs worse than best-fit-I and best-fit-II for sample sizes lager than 30, because the structural regularization of MDL is beneficial only for small sample sizes whereas it leads to overfitting for large sample sizes. From Table 1, we see that, compared with best-fit-I and best-fit-II, the rate of false positives is relatively low for MDL with small sample sizes and relatively high for MDL with large sample sizes. Concerning the steadystate distribution distance μ ssd , MDL performs better than best-fit-I and best-fit-II for data with 5% noise, but the performance of these algorithms becomes equivalent for data with 10% noise. This result may be due to the noise not only deteriorating the inference of the regulatory relationships, but also deteriorating the interaction Boolean functions, which strongly influence μ ssd .
Third, for noisy situations, based on μ e ham and μ ssd , not only does MDL-based filtering not degrade performance, it improves the performance of best-fit-I and best-fit-II, with the performance for best-fit-II being slightly better than that of best-fit-I. One reason for this result may be that best-fit-II infers more true-positive connections and less false-positive connections in small-sample situations (see Table 1). It is interesting that, in noisy situations, MDL-based filtering can even outperform the MDL algorithm across all sample sizes. In essence, the two methods are totally different because the former aims to reduce the structural redundancy of the minimal-error model obtained by the best-fit algorithm, whereas the latter aims to search the model with the minimum coding length L. From the point of view of the MDL principle, the coding length L of MDL-based filtering may not be the minimum length. Because MDL-based filtering combines both the best-fit algorithm and the MDL principle, it reduces structural redundancy and overcomes the over-fitting in largesample-size situations.

Cell cycle model of budding yeast
The cell cycle is a vital biological process in which one cell grows and divides into two daughter cells. It consists of four phases, G1, S, G2, and M, and is regulated by a highly complex network that is highly conserved among the eukaryotes. From the 800 genes involved in the cell cycle process of budding yeast, Li et al. constructed a network of 11 key regulators: Cln3, MBF, SBF, Cln1, Cdh1, Swi5, Cdc20, Clb5, Sic1, Clb1, and Mcm1 [24]. This Boolean network model, shown in Figure 2A, has an attractor whose biggest basin corresponds to the biological G1 stationary state. The temporal sequence in Table 2 is a pathway from this basin that follows the biological trajectory of the cell cycle network.  Figure 1 Comparison of normalized-edge Hamming distance μ e ham and steady-state distribution distance μ ssd with 0%, 5%, and 10% noise for MDL, best-fit-I, and best-fit-II filtered by CMI and MDL.
We applied MDL, best-fit-I, and best-fit-II filtered by CMI and MDL to the artificial time-series data in Table 2. The inferred networks are shown in Figure 2. Figure 2B shows the network inferred by the MDL algorithm, which is the best network. Figure 2C,D has the same number of true-positive connections, with the latter having fewer false-positive connections. This result demonstrates that the method of selecting regulatory genes in best-fit-II is superior to using best-fit-I. Compared with Figure 2E,F, which was filtered by CMI from Figure 2C,D, Figure 2G,H  Furthermore, we can see that the networks resulting from CMI-based filtering have two disconnected subgraphs, whereas the network resulting from MDL is a connected graph. This result shows that MDL-based filtering is more effective than CMI-based filtering. In fact, Figure 2G shows the same result as in Figure 2B, which is the best result. We also ran 100 simulations with 5% and 10% noise for the pathway under consideration. Table 3 lists the average number of true positives and false positives, the normalized Hamming-edge distance μ e ham and the steady-state distribution distance μ ssd . The results are consistent with those of the simulated networks ( Figure 1) and they demonstrate that MDL-based filtering is effective for samples containing a small amount of noise.

Conclusion
Reducing the rate of false positives is an important issue in network inference. In this paper, we address this question by using the minimum description length (MDL) principle. Specifically, we apply the MDL measurement technique proposed by Zhao et al. to filter the model obtained by two best-fit algorithms (best-fit-I and best-fit-II). We compare the performance of MDL, best-fit-I, and best-fit-II filtered by CMI and MDL both on simulated networks and on an artificial model of budding yeast. The results show that, as determined by the distance metrics μ e ham and μ ssd , MDL-based filtering does not degrade inference performance, can improve inference performance, and is more effective than CMIbased filtering. Moreover, the combination of MDL filtering with the best-fit algorithm can even outperform the MDL algorithm alone. Additionally, applying MDLbased filtering is computationally less burdensome than using the MDL algorithm alone because calculating the data-coding length L D is more complex than calculating the error estimate of the best-fit algorithm, and the complexity of the calculation increases dramatically as the sample size m increases. Last but not the least, MDL-based filtering can also be applied to the results of other minimal error algorithms such as CoD. Table 3 Comparison of MDL, best-fit-I, and best-fit-II with CMI-and MDL-based filtering for yeast-pathway data