 Research
 Open access
 Published:
Using the minimum description length principle to reduce the rate of false positives of bestfit algorithms
EURASIP Journal on Bioinformatics and Systems Biology volume 2014, Article number: 13 (2014)
Abstract
The inference of gene regulatory networks is a core problem in systems biology. Many inference algorithms have been proposed and all suffer from false positives. In this paper, we use the minimum description length (MDL) principle to reduce the rate of false positives for bestfit algorithms. The performance of these algorithms is evaluated via two metrics: the normalizededge Hamming distance and the steadystate distribution distance. Results for synthetic networks and a wellstudied buddingyeast cell cycle network show that MDLbased filtering is more effective than filtering based on conditional mutual information (CMI). In addition, MDLbased filtering provides better inference than the MDL algorithm itself.
1 Introduction
A key goal in systems biology is to characterize the molecular mechanisms that govern specific cellular behavior and processes. Models of gene regulatory networks run the gamut from coarsegrained discrete networks to detailed descriptions of such networks by stochastic differential equations [1]. Boolean networks and the more general class of probabilistic Boolean networks are among the most popular approaches for modeling gene networks because they provide a structured way to study biological phenomena (e.g., the cell cycle) and diseases (e.g., cancer), ultimately leading to systemsbased therapeutic strategies. The inference of gene networks from highthroughput genomic data is an illposed problem known as reverse engineering. It is particularly challenging when dealing with small sample sizes because the number of variables in the system (e.g., the number of genes) typically is much greater than the number of observations [2]. Many inference algorithms have been proposed to elucidate the regulatory relationships between genes, such as Reveal [3], ARACNE [4], the minimum description length principle (MDL) [5]–[9], the coefficient of determination (CoD) [10],[11], and the bestfit extension [12],[13].
False positives are a common problem in inference, especially when dealing with small sample sizes and noisy conditions. In fact, false positives are a kind of structural redundancy. Given three genes, x_{1}, x_{2}, and x_{3}, they may interact in a chainlike manner, such as x_{1} → x_{2} → x_{3} or x_{1} ← x_{2} ← x_{3}; or in a hubbased way, such as x_{1} → x_{2} ← x_{3} or x_{1} ← x_{2} → x_{3}. Indirect interactions between two genes may produce some correlation in their expression data, which can lead to a false regulation detection by inference algorithms. The dataprocessing inequality (DPI) was first used in ARACNE, which aims to reduce the false positives produced by chain interaction [4]. Later, conditional mutual information (CMI) was proposed to tackle the false positives produced by both the chainlike and hubbased interactions [14]. Because the conditioning gene, x_{2}, is usually not known, a greedy search strategy was adopted to check if the CMI between x_{1} and x_{3} conditioned on some other genes was below a given threshold. To check the CMI on other unrelated genes is problematic. Not only is it computationally burdensome, it also suffers from an enormous multiplecomparisons problem. Moreover, since the interaction strength between genes generally varies a lot, their being both strong and weak interactions, how to set an appropriate threshold is a key problem.
A recent study shows that the bestfit algorithm appears to give the best results for recovering regulatory relationships in comparison to the aforementioned algorithms [15]. In the present paper, we propose to reduce the false positives of the bestfit algorithm by using the MDL principle. Simulation results show that it is more effective than the CMIbased method and can reduce the false positives in the MDL algorithm in [5]. In effect, the falsepositive reducing procedure acts as a filter for removing false positives.
The aim of filtering in the present framework is to reduce the number of false positive connections. As with any falsepositive reducing algorithm, this will invariably increase the number of false negatives, meaning more missing connections. Thus, two questions must be addressed. First, what benefits accrue from reducing the number of false positives? Second, does the increase in false negatives significantly impact inference performance?
A salient problem in translational genomics is the utilization of gene regulatory networks in determining therapeutic intervention strategies [2],[16],[17]. A big obstacle in deriving optimal treatment strategies from networks is the computational complexity arising directly from network complexity. Hence, significant effort has been focused on network reduction [18],[19]. As with any compression scheme, reduction methods sacrifice information in return for computational tractability. Because genes are removed from the network based upon their regulatory relations with other genes, false positives are particularly troublesome. First, they increase the amount of reduction necessary and second, they compete with true positive connections for retention in the reduced network. While it is true that an increase in false negatives is not beneficial, a missing connection creates no additional computational burden (in fact, reduces computation) and plays no role in the reduction procedure.
Now, for the caveat, all of this is fine, so long as the accuracy of the original inference algorithm is not adversely impacted. Practically, this means that, relative to some distance function between a groundtruth network and an inferred network (which quantifies inference accuracy), the distance is not increased when using the modified falsepositive reducing algorithm in place of the original algorithm. In this paper, we will consider two distance functions, one based on the hamming distance between the groundtruth and inferred networks and the other based on the difference between the steadystate distributions of the groundtruth and inferred networks.
This paper is organized as follows: Background information and necessary definitions are given in Section 2. The implementation of MDL, the bestfit algorithm, and CMI and MDLbased filtering is then introduced in Section 3. Results from simulated networks and from the cell cycle model of budding yeast are presented in Section 4. Finally, concluding remarks are given in Section 5.
2 Background
2.1 Boolean networks
A Boolean network G(V, F) is defined by a set of nodes V = {x_{1}, …, x_{ n }}, x_{ i } ∈ {0, 1}, and a set of Boolean functions F = {f_{1}, …, f_{ n }}, {\mathit{f}}_{\mathit{i}}:{\left\{0,1\right\}}^{{\mathit{k}}_{\mathit{i}}}\to \left\{0,1\right\} Each node x_{ i } represents the expression state of a gene, where x_{ i } = 0 means that the gene is off and x_{ i } = 1 means it is on. To update its value, each node x_{ i } is assigned a Boolean function {\mathit{f}}_{\mathit{i}}\left({\mathit{x}}_{\mathit{i}1},\dots ,{\mathit{x}}_{\mathit{i}{\mathit{k}}_{\mathit{i}}}\right)with k_{ i } specific input nodes. Under the synchronous updating scheme, all genes are updated simultaneously according to their corresponding update functions. The network's state at time t is represented by a binary vector x(t) = (x_{1}(t), …, x_{ n }(t)). In the absence of noise, the state of the system at the next time step is
The longterm behavior of a deterministic Boolean network depends on the initial state. The network will eventually settle down and cycle endlessly through a set of states called an attractor cycle. The set of all initial states that reach a particular attractor cycle forms the basin of attraction for the cycle. Following a random perturbation, the network may escape an attractor cycle, be reinitialized, and then begin its transition process anew. For a Boolean network with perturbation, its corresponding Markov chain possesses a steadystate distribution. It has been hypothesized that attractors or steadystate distributions in Boolean formalisms correspond to different cell types of an organism or to cell fates. In other words, the phenotypic traits are encoded in the attractors or steadystate distribution [1].
2.2 Bestfit extension
One approach to infer Boolean networks is to search a consistent rule from examples, the socalled consistency problem [20]. Owing to noise in geneexpression profiles, we relax it to the called bestfit extension problem, which has been extensively studied for many function classes [21]. We briefly introduce the bestfit extension problem for Boolean functions. A partially defined Boolean function (pdBf) is defined by two sets, T, F ⊆ {0, 1}^{n}, where T and F represent the set of true and false vectors, respectively. A function f is called an extension of pdBf(T, F) if T ⊆ T(f) = {x ∈ {0, 1}^{n} : f(x) = 1} and F ⊆ F(f) = {x ∈ {0, 1}^{n} : f(x) = 0}. The magnitude of the error of function f is
The bestfit extension aims to find two subsets T* and F* such that T* ∩ F* = ϕ and T* ∪ F* = T ∪ F, for which the function pdBf(T*, F*) has an extension in some class C of Boolean functions such that T* ∩ F + F * ∪ T is minimized. Clearly, any extension f ∈ C of pdBf (T*, F*) has minimum error magnitude [12],[13].
2.3 Conditional mutual information
Mutual information (MI) is a general measurement that can detect nonlinear dependence between two random variables X and Y. For discretevalued random variables, the onetimelag MI from X_{ t } to Y_{t + 1} is given by
where H(•) denotes entropy and X_{ t } and Y_{t + 1} are two equallength vectors. The conditional mutual information (CMI) from X_{ t } to Y_{t + 1} given Z_{ t } is
and quantifies the reduction in the uncertainty of Y_{t+1} due to knowledge of X_{ t } given Z_{ t }. In the chainlike or hubbased scenarios, genes X_{ t } and Y_{t+1} should be independent given the intermediate or hub gene Z_{ t }, which means that I(X_{ t }; Y_{t + 1}Z_{ t }) = 0.
2.4 Minimum description length principle
A fundamental principle in model selection is the minimum description length (MDL) principle, which states that we should choose the model that gives the shortest description of the data. The ‘twopart MDL’ developed by Rissanen consists of writing the description length of a given model applied to a data set as the sum of the code length for describing the model and the code length for describing the data set fit by the model [22]
There are various ways to encode the modelcoding length L_{ M } and the datacoding length L_{ D }. Given a time series of length m, Zhao et al. proposed to encode L_{ M } and L_{ D } as [5]
where τ is a free parameter to balance the model and datacoding lengths, n and m are the number of genes and time points. d_{ i } = ⌈ log_{2}n⌉ and d_{ f } = ⌈ log_{2}m⌉ denote the number of bits needed to code an integer and a floatingpoint number, respectively.
3 Implementation
Based on the common assumption that genetic regulatory networks are sparsely connected, we restrict simulated Boolean networks to a scalefree topology with maximal connectivity K = 4 and average connectivity k = 2. The bestfit algorithm searches for the bestfit function for each gene by exhaustively searching for all combinations of potential regulator sets. The search space grows exponentially with the number of genes. In practice, the limit k_{ i } ≤ 3 is generally applied to mitigate model complexity. In this paper, we restrict bestfitalgorithm searches to combinations of 1, 2, or 3 possible regulators. The combinatorial set with the smallest error is then selected as the regulatory set. We call this bestfitI. In practice, the minimal error predictor set may not unique. We employ the heuristic that each of them can be viewed as fitting the target gene in a different way and if one gene occurs frequently in those sets, then it is highly likely to be a true regulatory gene. Thus, we can determine the regulatory set by applying the majority rule in these sets. Here, we refer to this algorithm as bestfitII.
Then CMI and MDL criteria are used to filter falsepositive connections. For each regulatory connection, if the CMI for one of the remaining genes is less than 0.005, then the gene is deleted; otherwise, it remains. The MDL criterion is applied to each target gene x_{ i }. Given its parent set, Pa(x_{ i }), we delete the regulatory gene x_{ j } ∈ Pa(x_{ i }) that can maximally reduce its coding length L_{ i } for each point in time, repeating this process until the deletion of one regulatory gene causes L_{ i } to increase. We implement an MDL inference algorithm by directly searching the combination of 1, 2, or 3 possible regulators with minimal coding length L_{ i }. The free parameter τ in Equation 6 is set to 0.2.
We have analyzed CMI and MDLbased filtering by using both synthetic networks as well as the wellstudied cellcycle model known as the buddingyeast network. We compare them with the groundtruth network according to the following two distances [15],[23]:

(1)
The normalizededge Hamming distance:
where FN and FP represent the number of falsenegative and falsepositive wires, respectively, and P represents the total number of positive wires. This Hamming distance reflects the accuracy of the recovered regulatory relationships.

(2)
The steadystate distribution distance:
where π_{ k } and {\mathit{\pi}}_{\mathit{k}}^{\prime} are the steadystate probabilities state x_{ k } in the groundtruth and inferred network, respectively. The steadystate distribution distance reflects the degree to which an inferred network approximates the longrun behavior of the groundtruth network.
4 Results and discussion
4.1 Simulation on synthetic networks
We generated 1,000 random n = 10 genes and for each network generated a random sample of m = 10, 20, 30, 40, and 50 time points. As it is hard to obtain one time series with required length, we adopt the following sampling strategy: (1) select several start states which are the farthest from their attractor; (2) run each start state to its attactor; (3) select one path as a time series, if its length is shorter than required, add another path in it until we have required length of time points. We added 5% and 10% noise to these samples to investigate the effect of noise. The perturbation probability to calculate the steadystate distribution was set to p = 0.0001. In Table 1, we list the average number of truepositive and falsepositive connections for various noise intensities. Figure 1 shows the average performance of the MDL, bestfitI, and bestfitII filtered by CMI and MDL for 0%, 5%, and 10% noise. As a whole, the performance of these algorithms increases as sample size increases from 10 to 50. This result is easy to understand: the more data we have, the better the inferred results.
Examination of the table reveals several trends. First, MDLbased filtering (dashed lines in Figure 1) always performs better than CMIbased filtering (dotted lines in Figure 1). MDLbased filtering aims to reduce the redundancy of a model according to the MDL principle, whereas CMIbased filtering attains reduction by blindly checking if the CMI of a connection conditioned on all other genes is below a given threshold. The results indicate that the former approach is superior to the latter. According to Table 1, on the whole, MDLbased filtering retains more true connections and deletes more false connections than CMIbased filtering.
Second, the performances of MDL, bestfitI, and bestfitII are very similar when used with noiseless data. In this case, the MDL algorithm gives a model with L_{ D } = 0, which also corresponds to the zeroerror model obtained by bestfitI. In addition, MDLbased filtering results in little improvement over the bestfit algorithms. However, their performance is strongly related to sample size when the data are noisy. Specifically, for sample size less than 30, MDL performs better than bestfitI and bestfitII based on the average Hammingedge distance {\mathrm{\mu}}_{\mathrm{ham}}^{\mathrm{e}}. But MDL performs worse than bestfitI and bestfitII for sample sizes lager than 30, because the structural regularization of MDL is beneficial only for small sample sizes whereas it leads to overfitting for large sample sizes. From Table 1, we see that, compared with bestfitI and bestfitII, the rate of false positives is relatively low for MDL with small sample sizes and relatively high for MDL with large sample sizes. Concerning the steadystate distribution distance μ^{ssd}, MDL performs better than bestfitI and bestfitII for data with 5% noise, but the performance of these algorithms becomes equivalent for data with 10% noise. This result may be due to the noise not only deteriorating the inference of the regulatory relationships, but also deteriorating the interaction Boolean functions, which strongly influence μ^{ssd}.
Third, for noisy situations, based on {\mathrm{\mu}}_{\mathrm{ham}}^{\mathrm{e}} and μ^{ssd}, not only does MDLbased filtering not degrade performance, it improves the performance of bestfitI and bestfitII, with the performance for bestfitII being slightly better than that of bestfitI. One reason for this result may be that bestfitII infers more truepositive connections and less falsepositive connections in smallsample situations (see Table 1). It is interesting that, in noisy situations, MDLbased filtering can even outperform the MDL algorithm across all sample sizes. In essence, the two methods are totally different because the former aims to reduce the structural redundancy of the minimalerror model obtained by the bestfit algorithm, whereas the latter aims to search the model with the minimum coding length L. From the point of view of the MDL principle, the coding length L of MDLbased filtering may not be the minimum length. Because MDLbased filtering combines both the bestfit algorithm and the MDL principle, it reduces structural redundancy and overcomes the overfitting in largesamplesize situations.
4.2 Cell cycle model of budding yeast
The cell cycle is a vital biological process in which one cell grows and divides into two daughter cells. It consists of four phases, G1, S, G2, and M, and is regulated by a highly complex network that is highly conserved among the eukaryotes. From the 800 genes involved in the cell cycle process of budding yeast, Li et al. constructed a network of 11 key regulators: Cln3, MBF, SBF, Cln1, Cdh1, Swi5, Cdc20, Clb5, Sic1, Clb1, and Mcm1 [24]. This Boolean network model, shown in Figure 2A, has an attractor whose biggest basin corresponds to the biological G1 stationary state. The temporal sequence in Table 2 is a pathway from this basin that follows the biological trajectory of the cell cycle network.
We applied MDL, bestfitI, and bestfitII filtered by CMI and MDL to the artificial timeseries data in Table 2. The inferred networks are shown in Figure 2. Figure 2B shows the network inferred by the MDL algorithm, which is the best network. Figure 2C,D has the same number of truepositive connections, with the latter having fewer falsepositive connections. This result demonstrates that the method of selecting regulatory genes in bestfitII is superior to using bestfitI. Compared with Figure 2E,F, which was filtered by CMI from Figure 2C,D, Figure 2G,H filtered by MDL have more true connections, whereas the number of falsepositive connections are about the same. Furthermore, we can see that the networks resulting from CMIbased filtering have two disconnected subgraphs, whereas the network resulting from MDL is a connected graph. This result shows that MDLbased filtering is more effective than CMIbased filtering. In fact, Figure 2G shows the same result as in Figure 2B, which is the best result.
We also ran 100 simulations with 5% and 10% noise for the pathway under consideration. Table 3 lists the average number of true positives and false positives, the normalized Hammingedge distance {\mathrm{\mu}}_{\mathrm{ham}}^{\mathrm{e}} and the steadystate distribution distance μ^{ssd}. The results are consistent with those of the simulated networks (Figure 1) and they demonstrate that MDLbased filtering is effective for samples containing a small amount of noise.
5 Conclusion
Reducing the rate of false positives is an important issue in network inference. In this paper, we address this question by using the minimum description length (MDL) principle. Specifically, we apply the MDL measurement technique proposed by Zhao et al. to filter the model obtained by two bestfit algorithms (bestfitI and bestfitII). We compare the performance of MDL, bestfitI, and bestfitII filtered by CMI and MDL both on simulated networks and on an artificial model of budding yeast. The results show that, as determined by the distance metrics {\mathrm{\mu}}_{\mathrm{ham}}^{\mathrm{e}} and μ^{ssd}, MDLbased filtering does not degrade inference performance, can improve inference performance, and is more effective than CMIbased filtering. Moreover, the combination of MDL filtering with the bestfit algorithm can even outperform the MDL algorithm alone. Additionally, applying MDLbased filtering is computationally less burdensome than using the MDL algorithm alone because calculating the datacoding length L_{ D } is more complex than calculating the error estimate of the bestfit algorithm, and the complexity of the calculation increases dramatically as the sample size m increases. Last but not the least, MDLbased filtering can also be applied to the results of other minimal error algorithms such as CoD.
References
I Shmulevich, ER Dougherty, Genomic Signal Processing (Princeton Series in Applied Mathematics) (Princeton University Press, Princeton, 2007)
I Shmulevich, ER Dougherty, Probabilistic Boolean Networks: The Modeling and Control of Gene Regulatory Networks (SIAM, Philadelphia, 2010)
Liang S, Fuhrman S, Somogyi R: REVEAL, a general reverse engineering algorithm for inference of genetic network architectures, in Pacific Symposium on Biocomputing. World Scientific, Singapore; 1998.
Adam AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla RF, Califano A: ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 2006, 7: S7.
Wentao Z, Erchin S, Dougherty ER: Inferring gene regulatory networks from time series data using the minimum description length principle. Bioinformatics 2006, 22: 21292135. 10.1093/bioinformatics/btl364
Chaitankar V, Ghosh P, Perkins E, Ping G, Youping D, Chaoyang Z: A novel gene network inference algorithm using predictive minimum description length approach. BMC Syst. Biol. 2010, 4: S7. 10.1186/175205094S1S7
CV Chaitankar, Z Chaoyang, G Preetam, P Ghosh, EJ Perkins, G Ping, D Youping, Gene regulatory network inference using predictive minimum description length principle and conditional mutual information (International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, 2009), pp. 487–490. IJCBS'09, 2009
Dougherty J, Tabus I, Astola J: Inference of gene regulatory networks based on a universal minimum description length. EURASIP J. Bioinform. Syst. Biol. 2008, 2008: 482090.
Tabus I, Astola J: On the use of MDL principle in gene expression prediction. EURASIP J. Appl. Signal Proc. 2001, 2001: 297303. 10.1155/S1110865701000270
Dougherty ER, Kim S, Chen Y: Coefficient of determination in nonlinear signal processing. Signal Process. 2000, 80: 22192235. 10.1016/S01651684(00)000797
Kim S, Dougherty ER, Bittner ML, Chen Y, Sivakumar K, Meltzer P, Trent JM: General nonlinear framework for the analysis of gene interaction via multivariate expression arrays. J. Biomed. Opt. 2000, 5: 411424. 10.1117/1.1289142
I Shmulevich, A Saarinen, O YliHarja, J Astola, Inference of genetic regulatory networks via bestfit extensions. Computational and Statistical Approaches to Genomics (Springer, US, 2002)
Lähdesmäki H, Shmulevich I, YliHarja O: On learning gene regulatory networks under the Boolean network model. Mach. Learn. 2003, 52: 147167. 10.1023/A:1023905711304
Zhao W, Serpedin E, Dougherty ER: Inferring connectivity of genetic regulatory networks using informationtheoretic criteria. IEEE/ACM Trans. Comput. Biol. Bioinform. 2008,5(2):262274. 10.1109/TCBB.2007.1067
Qian X, Dougherty ER: Validation of gene regulatory network inference based on controllability. Front. Genet. 2013, 4: 272. 10.3389/fgene.2013.00272
Dougherty ER, Pal R, Qian X, Bittner ML, Datta A: Stationary and structural control in gene regulatory networks: basic concepts. Int. J. Syst. Sci. 2010,41(1):516. 10.1080/00207720903144560
Yousefi MR, Dougherty ER: Intervention in gene regulatory networks with maximal phenotype alteration. Bioinformatics. 2013,29(14):17581767. 10.1093/bioinformatics/btt242
Ivanov I, Simeonov P, Ghaffari N, Qian X, Dougherty ER: Selection policy induced reduction mappings for boolean networks. IEEE Trans. Signal Process. 2010,58(9):48714882. 10.1109/TSP.2010.2050314
Ghaffari N, Ivanov I, Qian X, Dougherty ER: A CoDbased reduction algorithm for designing stationary control policies on Boolean networks. Bioinformatics 2010, 26: 15561563. 10.1093/bioinformatics/btq225
Akutsu T, Miyano S, Kuhara S: Identification of genetic networks from a small number of gene expression patterns under the boolean network model. Pac. Symp. Biocomput. 1999, 4: 1728.
Boros E, Ibaraki T, Makino K: Errorfree and bestfit extensions of partially defined boolean functions. Inf. Comput. 1998, 140: 254283. 10.1006/inco.1997.2687
Rissanen J: Modeling by shortest data description. Automatica 1978, 14: 465471. 10.1016/00051098(78)900055
Dougherty ER: Validation of gene regulatory networks: scientific and inferential. Brief. Bioinform. 2011, 12: 245252. 10.1093/bib/bbq078
Li F, Long T, Ying L, Ouyang Q, Tang C: The yeast cellcycle network is robustly designed. Proc. Natl. Acad. Sci. USA 2004, 101: 47814786. 10.1073/pnas.0305937101
Acknowledgements
This work was funded in part by the National Science Foundation of China (Grants No. 61272018, No. 60970065, and No. 61174162) and the Zhejiang Provincial Natural Science Foundation of China (Grants No. R1110261 and No. LY13F010007) and support from China Scholarship Council.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Fang, J., Ouyang, H., Shen, L. et al. Using the minimum description length principle to reduce the rate of false positives of bestfit algorithms. J Bioinform Sys Biology 2014, 13 (2014). https://doi.org/10.1186/s1363701400132
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1363701400132