Inference of Gene Regulatory Networks Based on a Universal Minimum Description Length
© John Dougherty et al. 2008
Received: 24 August 2007
Accepted: 11 January 2008
Published: 12 February 2008
The Boolean network paradigm is a simple and effective way to interpret genomic systems, but discovering the structure of these networks remains a difficult task. The minimum description length (MDL) principle has already been used for inferring genetic regulatory networks from time-series expression data and has proven useful for recovering the directed connections in Boolean networks. However, the existing method uses an ad hoc measure of description length that necessitates a tuning parameter for artificially balancing the model and error costs and, as a result, directly conflicts with the MDL principle's implied universality. In order to surpass this difficulty, we propose a novel MDL-based method in which the description length is a theoretical measure derived from a universal normalized maximum likelihood model. The search space is reduced by applying an implementable analogue of Kolmogorov's structure function. The performance of the proposed method is demonstrated on random synthetic networks, for which it is shown to improve upon previously published network inference algorithms with respect to both speed and accuracy. Finally, it is applied to time-series Drosophila gene expression measurements.
The modeling of gene regulatory networks is a major focus of systems biology because, depending on the type of modeling, the networks can be used to model interdependencies between genes, to study the dynamics of the underlying genetic regulation, and to provide a basis for the derivation of optimal intervention strategies. In particular, Bayesian networks  and dynamic Bayesian networks  provide models to elucidate dependency relations; functional networks, such as Boolean networks  and probabilistic Boolean networks , provide the means to characterize steady-state behavior. All of these models are closely related .
When inferring a network from data, regardless of the type of network being considered, we are ultimately faced with the difficulty of finding the network configuration that best agrees with the data in question. Inference starts with some framework assumed to be sufficiently complex to capture a set of desired relations and sufficiently simple to be satisfactorily inferred from the data at hand. Many methods have been proposed, for instance, in the design of Bayesian networks  and probabilistic Boolean networks . Here we are concerned with Boolean networks, for which a number of methods have been proposed [10–14]. Among the first information-based design algorithms is the Reveal algorithm, which utilizes mutual information to design Boolean networks from time-course data . Information-theoretic design algorithms have also been proposed for non-time-course data .
Here we take an information-theoretic approach based on the minimum description length (MDL) principle . The MDL principle states that, given a set of data and class of models, one should choose the model providing the shortest encoding of the data. The coding amounts to storing both the network parameters and any deviations of the data from the model, a breakdown that strikes a balance between network precision and complexity. From the perspective of inference, the MDL principle represents a form of complexity regularization, where the intent is generally to measure the goodness of fit as a function of some error and some measure of complexity so as not to overfit the data, the latter being a critical issue when inferring gene networks from limited data. Basically, in addition to choosing an appropriate type, one wishes to select a model most suited for the amount of data. In essence, the MDL principle balances error (deviation from the data) and model complexity by using a cost function consisting of a sum of entropies, one relative to encoding the error and the other relative to encoding the model description . The situation is analogous to that of structural risk minimization in pattern recognition, where the cost function for the classifier is a sum of the resubstitution error of the empirical-error-rule classifier and a function of the VC dimension of the model family . The resubstitution error directly measures the deviation of the model from the data and the VC dimension term penalizes complex models. The difficulties are that one must determine a function of the VC dimension and that the VC dimension is often unknown, so that some approximation, say a bound, must be used. The MDL principle was among the first methods used for gene expression prediction using microarray data .
Recently, a time-course-data algorithm, henceforth referred to as Network MDL , was proposed based on the MDL principle. The Network MDL algorithm often yields good results, but it does so with an ad hoc coding scheme that requires a user-specified tuning parameter. We will avoid this drawback by achieving a codelength via a normalized maximum likelihood model. In addition, we will improve upon Network MDL's efficiency by applying an analogue of Kolmogorov's structure function .
2.1. Boolean Networks
The fundamental question we face is the estimation of and . Note that is usually not included as a parameter of because it can be absorbed into , but we choose to write it separately because, under the model we will specify, completely dictates , making our interest reside primarily in the structure parameter set .
where denotes modulo sum, acts independently on each column of , and is a vector of independent Bernoulli random variables with . We further assume that the errors for different nodes are independent. We allow to depend on because it can be interpreted as the probability that node disobeys the network rules, and we consider it natural for different nodes to have varying propensities for misbehaving.
We finalize the specification of our model by extending the parameter space for the error rates by replacing with where each corresponds to one of the possible values of . This allows the degree of reliability of the network function to vary based upon the state of a gene's predecessors. Note that is only an upper bound on the number of error rates because we will not necessarily observe all possible regressor values. This model is specified by the predecessor genes composing , the function , and the error rates in . Thus, adopting notation from Tabus et al. , we refer to the collection of all possible parameter settings as the model class
2.2. The MDL Principle
Given the model formulation, we use the MDL principle as our metric for assessing the quality of the parameter estimates. As stated in Section 1, the MDL principle dictates that, given a dataset and some class of possible models, one should choose the model providing the shortest possible encoding of the data. In our case, the MDL principle is applied for selecting each node's predecessors. However, as we have noted, this technique is inherently problematic because no unique manner of codelength evaluation is specified by the principle. Letting when the node in question is predicted incorrectly and otherwise, basic coding theory gives us a residual codelength of , but the cost of storing the model parameters has no such standard. Thus, we can technically choose any applicable encoding scheme we like, an allowance that inevitably gives rise to infinitely many model codelengths and, as a result, no unique MDL-based solution.
We find that the two encoding methods can give different structure estimates because the shorter model codelength allows for a greater number of predecessors. Zhao et al. compensate for this nonuniqueness by adjusting the model codelength with a weight parameter, but, while necessary for ad hoc encodings such as the ones discussed so far, the presence of such tuning parameters is undesirable when compared with a more theoretically based method. Moreover, the MDL principle's notion of "the shortest possible codelength" implies a degree of generality that is violated if we rely upon a user-defined value.
2.3. Normalized Maximum Likelihood
is solved by the NML density function, defined as divided by the normalizing constant . Tabus et al.  provide the derivations of this NML distribution; the following is a brief outline of the major steps.
Of course, this means that our model does not explicitly estimate . However, considering that represents error rates, the obvious choice is to minimize each by taking whenever , and otherwise. In the event that , we set if the portion of corresponding to is less than in binary. Assuming independent errors, this removes any bias that would result from favoring a particular value for when . This effectively reduces the parameter space for each from to which, in turn, affects by halving every . However, we will later show that the algorithm does not change whether or not we actually specify , and we opt not to do so.
an approximation given in . For the sake of efficiency, we compute every prior to learning the network so that calculating the denominator of (10) takes at most operations.
2.3.2. Stochastic Complexity
where denotes the binary entropy function. Note that the previous and all future logarithms are base 2. Returning to the issue of picking values for , we recall that doing so halves each . This translates to a unit reduction in stochastic complexity for each , but we observe that it also requires bit to store . Regardless of whether or not we choose to specify , the total codelength remains the same.
2.4. Kolmogorov's Structure Function
If we compute for every possible , we can simply select the one that provides the shortest total codelength, thus satisfying the MDL principle; however, this requires computing codelengths. A standard remedy for this problem is assuming a maximum indegree , but, even with , a -gene network would still result in possible predecessor sets per gene. Moreover, a fixed introduces bias into the method so, while we obviously cannot afford to perform exhaustive searches, we prefer to refrain from limiting the number of predecessors considered.
We refer to and as the model and noise codelengths, respectively, which together constitute a universal sufficient statistics decomposition of the total codelength. The summation of these values is clearly different from the stochastic complexity, but this is a result of partitioning the parameter space.
Of particular use in this scenario is the way in which the model codelength is somewhat stable for each , producing the distinct bands in Figure 1. The noise codelengths are still widely dispersed so we are required to compute all possible codelengths up to some total number of predecessors. We would like that number to be variable and not arbitrarily specified in advance, but this may not be feasible for highly connected networks. However, as mentioned earlier, the indegrees of genetic networks are generally assumed to be small (hence, the standard ), and, when looking for a single gene's predecessors in a 20-gene network, our method only takes 70 minutes to check every possible set up to size 6. Thus, we are still constrained by a maximum indegree, but we can now increase it well beyond the accepted number that we expect to encounter in practice without risking extreme computational repercussions. Additionally, choosing a makes a nondecreasing function of , meaning that we can also stop searching if ever becomes larger than the current value of . The method is summarized in Algorithm 1.
using (11), (17), and (18)
(17) end for
Note that we termed the resulting predecessors "near-optimal." It is possible to encounter genes for which adding one predecessor does not warrant an increase in model codelength but adding two predecessors does. Nevertheless, these differences tend to be small for certain types of networks. Moreover, depending on the kind of error with which one is concerned, these near-optimal predecessor sets can even provide a better approximation of the true network in the sense that any differences will be in the direction of the SF finding fewer predecessors. Thus, assuming a maximum indegree , the false positive rate from using the SF can never be higher than that from checking all predecessor sets up to size .
3.1. Performance on Simulated Data
A critical issue in performance analysis concerns the class from which the random networks are to be generated. While it might first appear that one should generate networks using the class composed of all Boolean networks containing genes, this is not necessarily the case if one wishes to achieve simulated results that reflect algorithm performance on realistic networks. An obvious constraint is to limit the indegree, either for biological reasons  or for the sake of inference accuracy when data are limited. In this case, one can consider the class composed of all Boolean networks with indegrees bounded by . Other constraints might include realistic attractor structures , networks that are neither too sensitive nor too insensitive to perturbations , or networks that are neither too chaotic nor too ordered .
Here we consider a constraint on the functions that is known to prevent chaotic behavior . A canalizing function is one for which there exists a gene among its regulatory set such that if the gene takes on a certain value, then that value determines the value of the function irrespective of the values of the other regulatory genes. For example, OR is canalizing with respect to because for any values of and . There is evidence that genetic networks under the Boolean model favor this kind of functionality . Corresponding to class is class , in which all functions are constrained to be canalizing.
To evaluate the performance of our model selection method, referred to as NML MDL, on synthetic Boolean networks, we consider sample sizes ranging from to , , and . We test each of the combinations on randomly generated networks from and . Note that is equivalent to .
We use the Reveal and Network MDL methods as benchmarks for comparison. As mentioned earlier, Network MDL requires a tuning parameter, which we set to since that paper uses 0.2–0.4 as the range for this parameter in its simulations. Also, its application in  limits the average indegree of the inferred network to 3 so we assume this as well. Reveal is run from a Matlab toolbox created by Kevin Murphy, available for download at http://bnt.sourceforge.net/, and requires a fixed , which we also set to 3. We implement our method with and without including the SF approach to show that the difference in accuracy is often small, especially in light of the reduction in computation time.
As performance metrics, we use the number of false positives and the Hamming distance between the estimated and true networks, both normalized over the total number of edges in the true network. False positives are defined as any time a proposed network includes an edge not existing in the real network, and Hamming distance is defined as the number of false positives plus the number of edges in the true network not included in the estimated network.
3.1.1. Random Networks
With respect to false positives, NML MDL is uniformly the best, and there is at most a minor difference between the two modes. NML MDL is also the best overall method when looking at Hamming distances. Figures 2 and 3 show the cases for which it most definitively improves upon Network MDL and Reveal, both of which have . The way in which the two NML methods diverge as increases is a general trend, but both remain below Network MDL. Increasing to 0.2 narrows the margins between the methods, but the relationships only change significantly for . As shown in Figure 4, NML MDL with the SF loses its edge, but NML MDL with fixed remains the best choice. Raising to 0.3 is most detrimental to Reveal, pulling its accuracy well away from the other three methods. Figure 5 shows this for , but the plots for smaller values of look very similar, especially in how the two NML MDL approaches perform almost identically. We point out that this is the worst scenario for NML MDL, but, even then, it is still superior for small and only worse than Network MDL for .
In terms of computation time, Reveal was fairly constant for all of the simulation settings, taking an average of 6.35 seconds to find predecessors for gene using Matlab on a Pentium IV desktop computer with 1 GB of memory. NML MDL with increases slightly with in a linear fashion, but its most noticeable increase is with . For , this method took an average of 0.33 to 0.48 seconds per gene as goes from 20 to 100, but this range increased from 0.59 to 0.73 for . Alternatively, Network MDL's runtime is sporadic with respect to and decreases when is raised, taking an average of 2.50 seconds per gene for but needing only 0.33 second per gene when , the only case for which it was noticeably faster than NML MDL with fixed . However, NML MDL with the SF proved to be the most efficient algorithm in almost every scenario. For and 0.3 it was uniformly the fastest, taking an average of 0.06 and 0.02 seconds per gene, respectively. The runtime begins to increase more rapidly with for and , but the only observed case when it was not the fastest method was for and , and even then the needed time was still less than 1 second per gene.
3.1.2. Canalizing Networks
For example, consider OR . If is found to be the best predecessor set of size 1, adding may not give enough additional information to warrant the increased model codelength, in which case NML MDL will miss one connection. Alternatively, if XOR , either input tells almost nothing by itself, and the SF will probably stop the inference too soon. However, using both inputs will most likely result in the minimum total codelength, in which case NML MDL with fixed will find the correct predecessor set.
For the same reason, we also see that Network MDL is better suited to canalizing functions, but Reveal does better without this constraint. Of particular interest is that, for these methods, the change can be so drastic that they comparatively switch their rankings depending on which network class we use, whereas NML MDL provides the most accurate inference either way. Similar results can be observed for the other cases in the supporting data. Based on these findings, we recommend using the SF primarily for networks composed of canalizing functions and networks too large to run NML MDL with fixed in a reasonable amount of time. We also suggest using the SF when is large because, as pointed out in Section 3.1.1, the performance of the two NML MDL varieties is no longer different when .
3.2. Application to Drosophila Data
4. Concluding Remarks
Using a universal codelength when applying the MDL principle eliminates the relativity of applying ad hoc codelengths and user-defined tuning parameters. In our case, this has resulted in improved accuracy of Boolean network esimation. Using the theoretically grounded stochastic complexity instead of ad hoc encodings genuinely reflects the intent of the MDL principle. In addition, the structure function makes the proposed method faster than other published methods. Computation time does not heavily rely on bounded indegrees and increases only slightly with .
This work was supported by the Academy of Finland (Application no. 213462, Finnish Programme for Centres of Excellence in Research 2006–2011), and the Tampere Graduate School in Information Science and Engineering. Partial support also provided by the National Cancer Institute (Grant no. CA90301).
- Pearl J: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco, Calif, USA; 1988.MATHGoogle Scholar
- Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data. Journal of Computational Biology 2000,7(3-4):601-620. 10.1089/106652700750050961View ArticleGoogle Scholar
- Dean T, Kanazawa K: A model for reasoning about persistence and causation. Computational Intelligence 1989,5(2):142-150. 10.1111/j.1467-8640.1989.tb00324.xView ArticleGoogle Scholar
- Murphy K: Dynamic Bayesian networks: representation, inference and learning, Ph.D. thesis. Computer Science Division, UC Berkeley, Berkeley, Calif, USA; 2002.Google Scholar
- Kauffman SA: Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology 1969,22(3):437-467. 10.1016/0022-5193(69)90015-0View ArticleGoogle Scholar
- Shmulevich I, Dougherty ER, Kim S, Zhang W: Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics 2002,18(2):261-274. 10.1093/bioinformatics/18.2.261View ArticleGoogle Scholar
- Lähdesmäki H, Hautaniemi S, Shmulevich I, Yli-Harja O: Relationships between probabilistic Boolean networks and dynamic Bayesian networks as models of gene regulatory networks. Signal Processing 2006,86(4):814-834. 10.1016/j.sigpro.2005.06.008View ArticleMATHGoogle Scholar
- Pe'er D, Regev A, Elidan G, Friedman N: Inferring subnetworks from perturbed expression profiles. Bioinformatics 2001,17(supplement 1):S215-S224.View ArticleGoogle Scholar
- Zhou X, Wang X, Pal R, Ivanov I, Bittner M, Dougherty ER: A Bayesian connectivity-based approach to constructing probabilistic gene regulatory networks. Bioinformatics 2004,20(17):2918-2927. 10.1093/bioinformatics/bth318View ArticleGoogle Scholar
- Zhao W, Serpedin E, Dougherty ER: Inferring gene regulatory networks from time series data using the minimum description length principle. Bioinformatics 2006,22(17):2129-2135. 10.1093/bioinformatics/btl364View ArticleGoogle Scholar
- Liang S, Fuhrman S, Somogyi R: Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pacific Symposium on Biocomputing 1998, 3: 18-29.Google Scholar
- Akutsu T, Miyano S, Kuhara S: Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. Pacific Symposium on Biocomputing 1999, 3: 17-28.Google Scholar
- Shmulevich I, Saarinen A, Yli-Harja O, Astola J: Inference of genetic regulatory networks via best-fit extensions. In Computational and Statistical Approaches to Genomics. chapter 11, Kluwer Academic Publishers, New York, NY, USA; 2002:197-210.Google Scholar
- Lähdesmäki H, Shmulevich I, Yli-Harja O: On learning gene regulatory networks under the Boolean network model. Machine Learning 2003,52(1-2):147-167.View ArticleMATHGoogle Scholar
- Margolin AA, Nemenman I, Basso K, et al.: ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 2006,7(supplement 1):S7.View ArticleGoogle Scholar
- Nemenman I: Information theory, multivariate dependence, and genetic network inference. KITP, UCSB, Santa Barbara, Calif, USA; June 2004.Google Scholar
- Rissanen J: Modeling by shortest data description. Automatica 1978,14(5):465-471. 10.1016/0005-1098(78)90005-5View ArticleMATHGoogle Scholar
- Rissanen J: Stochastic complexity and modeling. Annals of Statistics 1986,14(3):1080-1100. 10.1214/aos/1176350051MathSciNetView ArticleMATHGoogle Scholar
- Vapnik V: Estimation of Dependencies Based on Empirical Data. Springer, New York, NY, USA; 1982.MATHGoogle Scholar
- Tabus I, Astola J: On the use of MDL principle in gene expression prediction. EURASIP Journal on Applied Signal Processing 2001,2001(4):297-303. 10.1155/S1110865701000270MathSciNetView ArticleMATHGoogle Scholar
- Rissanen J: Information and Complexity in Statistical Modeling. Springer, New York, NY, USA; 2007.MATHGoogle Scholar
- Wuensche A: Genomic regulation modeled as a network with basins of attraction. Pacific Symposium on Biocomputing 1998, 3: 89-102.Google Scholar
- Tabus I, Rissanen J, Astola J: Normalized maximum likelihood models for Boolean regression with application to prediction and classification in genomics. In Computational and Statistical Approaches to Genomics. chapter 10, Kluwer Academic Publishers, New York, NY, USA; 2002:173-196.Google Scholar
- Szpankowski W: On asymptotics of certain recurrences arising in universal coding. Problems of Information Transmission 1998,34(2):55-61.MathSciNetMATHGoogle Scholar
- Thieffry D, Huerta AM, Pérez-Rueda E, Collado-Vides J: From specific gene regulation to genomic networks: a global analysis of transcriptional regulation in Escherichia coli . BioEssays 1998,20(5):433-440. 10.1002/(SICI)1521-1878(199805)20:5<433::AID-BIES10>3.0.CO;2-2View ArticleGoogle Scholar
- Kauffman SA: The Origins of Order. Oxford University Press, Oxford, UK; 1993.Google Scholar
- Pal R, Ivanov I, Datta A, Bittner ML, Dougherty ER: Generating Boolean networks with a prescribed attractor structure. Bioinformatics 2005,21(21):4021-4025. 10.1093/bioinformatics/bti664View ArticleGoogle Scholar
- Shmulevich I, Kauffman SA: Activities and sensitivities in Boolean network models. Physical Review Letters 2004,93(4):-4.View ArticleGoogle Scholar
- Derrida B, Pomeau Y: Random networks of automata: a simple annealed approximation. Europhysics Letters 1986, 1: 45-49. 10.1209/0295-5075/1/2/001View ArticleGoogle Scholar
- Harris S, Sawhill B, Wuensche A, Kauffman SA: A model of transcriptional regulatory networks based on biases in the observed regulation rules. Complexity 2002,7(4):23-40. 10.1002/cplx.10022View ArticleGoogle Scholar
- Arbeitman M, Furlong E, Imam F, et al.: Gene expression during the life cycle of Drosophila melanogaster . Science 2002,297(5590):2270-2275. 10.1126/science.1072152View ArticleGoogle Scholar
- Bhojwani J, Shashidhara LS, Sinha P: Requirement of teashirt (tsh) function during cell fate specification in developing head structures in Drosophila . Development Genes and Evolution 1997,207(3):137-146. 10.1007/s004270050101View ArticleGoogle Scholar
- Cimbora DM, Sakonju S: Drosophila midgut morphogenesis requires the function of the segmentation gene odd-paired . Developmental Biology 1995,169(2):580-595. 10.1006/dbio.1995.1171View ArticleGoogle Scholar
- Fujioka M, Jaynes J, Goto T: Early even-skipped stripes act as morphogenetic gradients at the single cell level to establish engrailed expression. Development 1995,121(12):4371-4382.Google Scholar
- González-Gaitan M, Jäckle H: Invagination centers within the Drosophila stomatogastric nervous system anlage are positioned by Notch -mediated signaling which is spatially controlled through wingless . Development 1995,121(8):2313-2325.Google Scholar
- Mathies LD, Kerridge S, Scott MP: Role of the teashirt gene in Drosophila midgut morphogenesis: secreted proteins mediate the action of homeotic genes. Development 1994,120(10):2799-2809.Google Scholar
- Morimura S, Maves L, Chen Y, Hoffmann FM: Decapentaplegic overexpression affects Drosophila wing and leg imaginal disc development and wingless expression. Developmental Biology 1996,177(1):136-151. 10.1006/dbio.1996.0151View ArticleGoogle Scholar
- Dréan BS-L, Nasiadka A, Dong J, Krause HM: Dynamic changes in the functions of Odd-skipped during early Drosophila embryogenesis. Development 1998,125(23):4851-4861.Google Scholar
- Schaeffer V, Killian D, Desplan C, Wimmer EA: High Bicoid levels render the terminal system dispensable for Drosophila head development. Development 2000,127(18):3993-3999.Google Scholar
- Steneberg P, Hemphälä J, Samakovlis C: Dpp and Notch specify the fusion cell fate in the dorsal branches of the Drosophila trachea. Mechanisms of Development 1999,87(1-2):153-163. 10.1016/S0925-4773(99)00157-4View ArticleGoogle Scholar
- Torres IS, López-Schier H, Johnston DSt: A Notch/Delta-dependent relay mechanism establishes anterior-posterior polarity in Drosophila . Developmental Cell 2003,5(4):547-558. 10.1016/S1534-5807(03)00272-7View ArticleGoogle Scholar
- Torres-Vazquez J, Park S, Warrior R, Arora K: The transcription factor Schnurri plays a dual role in mediating Dpp signaling during embryogenesis. Development 2001,128(9):1657-1670.Google Scholar
- Yin Z, Xu X-L, Frasch M: Regulation of the twist target gene tinman by modular cis -regulatory elements during early mesoderm development. Development 1997,124(24):4971-4982.Google Scholar
- Schroeder MD, Pearce M, Fak J, et al.: Transcriptional control in the segmentation gene network of Drosophila . PLoS Biology 2004,2(9):e271. 10.1371/journal.pbio.0020271View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.