 Research
 Open access
 Published:
Learning restricted Boolean network model by timeseries data
EURASIP Journal on Bioinformatics and Systems Biology volume 2014, Article number: 10 (2014)
Abstract
Restricted Boolean networks are simplified Boolean networks that are required for either negative or positive regulations between genes. Higa et al. (BMC Proc 5:S5, 2011) proposed a threerule algorithm to infer a restricted Boolean network from timeseries data. However, the algorithm suffers from a major drawback, namely, it is very sensitive to noise. In this paper, we systematically analyze the regulatory relationships between genes based on the state switch of the target gene and propose an algorithm with which restricted Boolean networks may be inferred from timeseries data. We compare the proposed algorithm with the threerule algorithm and the bestfit algorithm based on both synthetic networks and a wellstudied budding yeast cell cycle network. The performance of the algorithms is evaluated by three distance metrics: the normalizededge Hamming distance {\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{e}}, the normalized Hamming distance of state transition {\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{st}}, and the steadystate distribution distance μ^{ssd}. Results show that the proposed algorithm outperforms the others according to both {\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{e}} and {\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{st}}, whereas its performance according to μ^{ssd} is intermediate between bestfit and the threerule algorithms. Thus, our new algorithm is more appropriate for inferring interactions between genes from timeseries data.
1 Introduction
A key goal in systems biology is to characterize the molecular mechanisms governing specific cellular behaviors and processes. This entails selecting a model class for representing the system structure and state dynamics, followed by the application of computational or statistical inference procedures to reveal the model structure from measurement data. The models of gene regulatory networks run the gamut from coarsegrained discrete networks to the detailed description of stochastic differential equations [[1]]. They provide a uniform way to study biological phenomena (e.g., cell cycle) and diseases (e.g., cancer) and ultimately lead to systemsbased therapeutic strategies [[2]].
Boolean networks, and the more general class of probabilistic Boolean networks, are one of the most popular approaches for modeling gene networks. The inference of gene networks from highthroughput genomic data is an illposed problem. There exists more than one model that can explain the data. The search space for potential regulator sets and their corresponding Boolean functions generally increases exponentially with the number of genes in the network and the number of regulatory genes. It is particularly challenging in the face of small sample sizes, because the number of genes typically is much greater than the number of observations. Thus, estimates of modeling errors, which themselves are determined from the measurement data, can be highly variable and untrustworthy. Many inference algorithms have been proposed to elucidate the regulatory relationships between genes. Mutual information (MI) is an informationtheoretic approach that can capture the nonlinear dependence between random variables. REVEAL is the first informationbased algorithm to infer the regulatory relationships between genes [[3]]. However, a small MI does not necessarily mean that no regulatory relationship exists between genes (false negative). Conversely, a large MI does not necessarily mean a real regulatory relationship. ‘Falsepositive’ relationships often result from indirect interactions between two genes. The data processing inequality (DPI) and conditional mutual information (CMI) are two methods used to reduce the problem of false positives [[4],[5]]. Another informationbased method is the minimum description length principle (MDL), which achieves a good tradeoff between model complexity and fit to the data [[6]–[10]]. The coefficient of determination (CoD) selects a set of predictors whose expression levels can be used to better predict the expression of a target gene relative to the best possible prediction in the absence of observations [[11],[12]]. The bestfit extension incorporates inconsistencies generated from measurements or other unknown latent factors by constructing a network that makes as few misclassifications as possible [[13],[14]]. Any prior knowledge about the network structure or dynamics likely improves inference accuracy, especially for small sample sizes. Theoretical considerations and computational studies suggest that gene regulatory networks might operate close to a critical phase transition between ordered and disordered dynamical regimes [[15],[16]]. Liu et al. proposed a method to embed such a criticality assumption into the inference procedure. Such regularization of the sensitivity can both improve the inference and move the inferred networks closer to criticality [[17]].
A restricted Boolean network is a simplified Boolean model that has been used to study dynamical behavior of the yeast cell cycle [[18]–[24]]. In this model, the regulatory relationship between genes is either upregulation or downregulation. The output of the target gene is mainly dominated by the summation of its input genes. When the input summation is zero, the output state will remain as the current state of the target gene. The inference algorithm mentioned above generally cannot deal with this situation, and thus may not be appropriate to infer such network models. Recently, Higa et al. proposed a ‘threerule algorithm’ to construct a restricted Boolean network from timeseries data [[25]]. Their idea is that the consecutive state transitions of the system must be driven by some constraints, which can be induced from the small perturbations between two similar system states (detailed rules are provided in Section 3.1). However, the perturbations in microarry data sometimes may be caused by stochastic biological randomness or measurement process instead of real changes in gene expression level. This makes the threerule algorithm inevitably lead to some incorrect constraints. In this paper, we propose a systematic method to infer a restricted Boolean network based on the state transitions of the target gene. Results of simulated networks and a modeled yeast cell cycle show that the proposed algorithm is more robust to noise than the threerule method.
This paper is organized as follows: Background information and definitions are given in Section 2. Section 3 presents a brief introduction to the three rules; after which, we systematically analyze the regulatory relationships between input genes and their target gene and propose an inference algorithm. Section 4 and Section 5 present results for the simulated networks and for the cell cycle model of budding yeast. Concluding remarks are given in Section 6.
2 Background
2.1 Boolean networks
A Boolean network G(V, F) is defined by a set of nodes V = {x_{1}, …, x_{ n }}, x_{ i } ∈ {0, 1} and a set of Boolean functions F = {f_{1}, …, f_{ n }} and {\mathit{f}}_{\mathit{i}}:{\left\{0,1\right\}}^{{\mathit{k}}_{\mathit{i}}}\to \left\{0,1\right\}. Each node x_{ i } represents the expression state of gene x_{ i }, where x_{ i } = 0 means that the gene is off, and x_{ i } = 1 means it is on. Each node x_{ i } is assigned a Boolean function {\mathit{f}}_{\mathit{i}}\left({\mathit{x}}_{1},\dots ,{\mathit{x}}_{{\mathit{k}}_{\mathit{i}}}\right) with k_{ i } specific input nodes, which is used to update its value. Under the synchronous updating scheme, all genes are updated simultaneously according to their corresponding update functions. The network's state at time t is represented by a binary vector x(t) = (x_{1}(t), …, x_{ n }(t)). In the absence of noise, the state of the system at the next time step is
The longrun behavior of a deterministic Boolean network (BN) depends on the initial state, and the network will eventually settle down and cycle endlessly through a set of states called an attractor cycle. The set of all initial states that reach a particular attractor cycle forms the basin of attraction (BOA) for the cycle. Following a perturbation, the network in the long run may randomly escape an attractor cycle, be reinitialized, and then begin its transition process anew. For a BN with perturbation probability p, its corresponding Markov chain possesses a steadystate distribution. It has been hypothesized that attractors or steadystate distributions in Boolean formalisms correspond to different cell types of an organism or to cell fates. In other words, the phenotypic traits are encoded in the attractors [[1]]. There are two ways to define the perturbation probability p. One is that each gene can flip its state according to an i.i.d random perturbation vector γ = (γ_{1}, ⋯, γ_{ n }), where γ_{ i } ∈ {0, 1}, the i th gene flips if and only γ_{ i } = 1, and p = P(γ_{ i } = 1) for i = 1, 2, ⋯, n. The other is each state x(t) can transit to any other state with the same probability p. In this situation, at each time step, state x(t) will transit to the next state according to F with probability 1 + p − 2^{n} ∗ p and other states with probability p. In this paper, we adopt the later definition of the perturbation probability p.
2.2 Restricted Boolean networks
Restricted Boolean networks are simplified Boolean networks in which the regulatory relationships between genes obey the following convention: a_{ ij } = 1 represents a positive regulation from gene x_{ j } to x_{ i } (activation); a_{ ij } = − 1 represents a negative regulation from gene x_{ j } to x_{ i } (inhibition); and a_{ ij } = 0 means that x_{ j } has no effect on x_{ i }. The Boolean function {\mathit{f}}_{\mathit{i}}\left({\mathit{x}}_{1},\dots ,{\mathit{x}}_{{\mathit{k}}_{\mathit{i}}}\right) is defined as [[18]]
This model is ‘restricted’ in the sense that functions satisfying formula (2) constitute a subset of the class of all Boolean functions. The number of restricted functions decreases dramatically as the input degree k_{ i } increases. For example, there are 12 (<{2}^{{2}^{2}}=16) restricted functions for k_{ i } = 2, and only 60 functions (<<{2}^{{2}^{3}}=256) for k_{ i } = 3. The restricted model significantly reduces the model space, which is beneficial for inference, given a limited number of noisy highthroughput data.
3 Methods
3.1 Threerule method
A timeseries observation can be treated as a trajectory (or random walk) of the state space of the network used to model a real biological system. The threerule method proposed by Higa et al. is to induce the constraints between genes from the small difference between two similar states and the difference between their next states [[25]]. Given an mpoint time series S = {S(1), S(2), …, S(m)} of gene expression profiles, where S(t) ∈ {0, 1}^{n} for t = 1, 2, …, m, the three rules are as follows:
Rule 1: Let S(t − 1), S(t), and S(t + 1) be three consecutive states. If S(t − 1) and S(t) differ by a single gene x_{ k }, then for each gene x_{ i } such that x_{ i }(t) ≠ x_{ i }(t + 1), we have x_{ k } directly regulates x_{ i }; that is, a_{ ik } ≠ 0.
Rule 2: Only the active genes at time t can possibly regulate genes at time t + 1.
Rule 3: Given two similar states S(t_{1}) and S(t_{2}), the difference between S(t_{1} + 1) and S(t_{2} + 1) must result from the genes in their predecessors S(t_{1}) and S(t_{2}) that are expressed differently.
Both rules 1 and 3 can also be extended to situations where S(t − 1) and S(t) or S(t_{1}) and S(t_{2}) differ in more than one gene. Cyclically applying these rules to any two states may lead to a group of constraint inequalities between variables a_{ ij }. Many available constraint satisfaction problem solvers (CSPs) [[26]] can be used to solve the possible regulatory relationships of one gene to the target gene.
Rules 1 and 3 may give incorrect relationships if applied to noisy data; in other words, they are very sensitive to the noise inherent in data. We demonstrate this by using a small network that contains only four genes (see Figure 1). An arrow represents positive regulation, a line segment with a bar at the end represents negative regulation, and the dotted loop on x_{2} indicates that this gene downregulates itself. The timeseries data at the right in Figure 1 are extracted from the network in Figure 1. Between S(1) and S(2), only x_{2} changes from 1 to 0, and only x_{3} flips from 0 to 1 in the successive states S(2) and S(3). We can conclude that x_{2} must inhibit x_{3} by applying rule 1, which means a_{32} = − 1 because turning off x_{2} turns on x_{3}. If S(2) becomes 1001 owing to noise, then we will also have that gene x_{4} inhibiting x_{2}, which means a_{24} = − 1.
3.2 Analysis of regulatory relationships based on constraints
In this section, we study the regulatory relationships based on the constraint inequalities in formula (2) and how the target gene switches from one state to another. The target gene can switch in one of four ways: 0 → 0, 0 → 1, 1 → 0, or 0 → 1. Given an input state, inactive genes have no effect on the target gene, which may help reduce the constraint inequalities of the summation ∑ _{ j }a_{ ij }x_{ j }(t) (1 ≤ j ≤ k_{ i }). Because the null input provides no constraints between a_{ ij }, we only need to investigate the nonnull input situations.
First, consider the simplest situation where there is only one regulatory gene {\mathit{x}}_{{\mathit{j}}_{1}}. If gene {\mathit{x}}_{{\mathit{j}}_{1}} is active and the target gene x_{ i } switches from 0 to 1, then gene {\mathit{x}}_{{\mathit{j}}_{1}} must activate the target gene x_{ i } (which means {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}=1). On the contrary, if the target gene x_{ i } switches from 1 to 0, then it must be inhibited by {\mathit{x}}_{{\mathit{j}}_{1}} (which means {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}=1). When the target gene x_{ i } remains in state 1, we have {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}{\mathit{x}}_{{\mathit{j}}_{1}}\ge 0 (which means {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}=1). When the target gene x_{ i } remains in state 0, we have {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}{\mathit{x}}_{{\mathit{j}}_{1}}\le 0 (which means {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}=1). We present the four possible regulatory relationships {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}} in Table 1.
When there are two regulatory genes {\mathit{x}}_{{\mathit{j}}_{1}} and {\mathit{x}}_{{\mathit{j}}_{2}}, we only consider the input states 01, 10, and 11. If only one input gene is active, such as {\mathit{x}}_{{\mathit{j}}_{1}}{\mathit{x}}_{{\mathit{j}}_{2}}=01, then we can directly determine {\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}} from Table 1, whereas {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}} remains totally nondeterminant because it has no effect on the target gene. If both gene {\mathit{x}}_{{\mathit{j}}_{1}} and gene {\mathit{x}}_{{\mathit{j}}_{2}} are active, then we need to know whether or not the target gene x_{ i } switches its state. First, if x_{ i } switches from 1 to 0, then we have {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}={\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}=1 to satisfy the constraint {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}<0. Similarly, if x_{ i } switches from 0 to 1, then we have {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}={\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}=1 to satisfy the constraint {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}>0. Second, if x_{ i } remains in state 0, then we have {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}={\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}=1 or {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}={\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}} because {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}\le 0. Similarly, if x_{ i } remains in state 1, then we have {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}={\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}=1 or {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}={\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}} because {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}\le 0. We call these later cases ‘semidetermined’ because there are two possible combinations of {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}} and {\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}} in each case. In Table 2, we present the 12 possible regulatory relationships of {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}} and {\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}} for two input genes.
Analogously, the regulatory relationships for three input genes are shown in Table 3. There are 10 semidetermined cases, and most of them occur when the target gene x_{ i } does not change. Some of the semidetermined cases in Tables 2 and 3 may become determined if some a_{ ij } are determined. For example, given {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}\le 0 for (3) in Table 2, we can determine {\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}=1 if {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}} is determined to be 1. However, {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}} still remains semidetermined (either 1 or −1) if {\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}} is determined to be −1. As the number of regulatory genes increases, the proportion of semidetermined cases increases significantly. We will not extend the above analysis to situations of more than three input genes. In most reference studies, the limit k_{ i } ≤ 3 is generally respected to mitigate model complexity, particularly for small sample sizes.
Given a target gene x_{ i } and its predictor genes x_{ j } (1 ≤ j ≤ k_{ i }), we may determine the value of a_{ ij } at each time point t (1 ≤ t ≤ m − 1) by searching Tables 1, 2, or 3 across the whole time series S = {S(1), S(2), …, S(m)}. Let {\mathit{N}}_{\mathit{ij}}^{1}, {\mathit{N}}_{\mathit{ij}}^{1}, and {\mathit{N}}_{\mathit{ij}}^{1,1} denote the number of a_{ ij } = − 1, a_{ ij } = 1, and a_{ ij } = − 1 or 1, respectively. The degree of determination of a regulatory relationship a_{ ij } is defined as
If {\mathit{N}}_{\mathit{ij}}^{1}>{\mathit{N}}_{\mathit{ij}}^{1}, then a_{ ij } is likely to be −1; otherwise, it is likely to be 1. The larger the value of d_{ ij }, the greater the determination of a_{ ij }. In order to reduce the semidetermined cases, we first find the one with the largest determination, say, a_{ij,}, and determine its value by the majority rule. Then, we apply the value of a_{ ij } to those inequalities including it to solve other semidetermined a_{ ip } (p ≠ j, 1 ≤ p, j ≤ k_{ i }). By repeating this process, we can reduce the number of semidetermined cases and determine the values of other a_{ ip } accordingly.
3.3 Error analysis
Given a predictor set for gene x_{ i }, the basic inconsistency is the discrepancy in the determination of a_{ ij }, and we define the error resulting from such an inconsistency by {\mathit{\epsilon}}_{\mathit{ij}}^{1,1}=min\left({\mathit{N}}_{\mathit{ij}}^{1},{\mathit{N}}_{\mathit{ij}}^{1}\right). A second kind of inconsistency arises from the null input. Specifically, the target gene x_{ i } cannot flip its state under null input situations. Moreover, if it is negatively selfregulated (selfdegradation), it cannot be active when its input genes are null. The number of such inconsistencies defines the error {\mathit{\epsilon}}_{\mathit{i}}^{\mathrm{null}}, which is listed in Table 4 for selfdegradation and no selfdegradation, respectively. The total error of a predictor set is defined by \mathit{\epsilon}={\mathit{\epsilon}}_{\mathit{i}}^{\mathrm{null}}+{\displaystyle \sum _{\mathit{j}}{\mathit{\epsilon}}_{\mathit{ij}}^{1,1}}. Generally, a consistent predicator set should have the minimal error and the minimal number of regulatory genes simultaneously.
3.4 A small example
We now apply the above analysis to infer the predicator set for gene x_{3} in Figure 1. Based on Tables 1,2,3,4, the results for all possible one and twoinput genes at each time point are presented in Tables 5,6,7,8, respectively. In those six possible predictor sets, the minimal error is achieved by x_{1} and x_{2}, which are just the regulatory genes of x_{3}.
3.5 Inference algorithm
Given a time series S = {S(1), S(2), …, S(m)}, the minimal error predictor sets may not be unique. Each of them can be viewed as fitting the target gene in a different way. We employ the heuristic that if one gene occurs frequently in those sets, then it is highly probably to be a true regulatory gene. Combining them may give a more reliable prediction and can also help alleviate the constraint of using at most three input genes for a target gene. Given a target gene x_{ i }, we propose the following algorithm to infer its regulatory gene set:

1.
Calculate the total error of each combination of one, two, or three regulatory gene sets P(x _{ i }).

2.
Sort the predictor sets in ascending order of their errors.

3.
If a gene appears in the first l sets with a frequency greater than or equal to 50%, then it is selected as a regulatory gene.
4 Implementation
As mentioned in the introduction, many algorithms have been proposed to infer gene regulatory networks. A recent study shows that the bestfit algorithm appears to give the best results for the recovery of regulatory relationships among REVEAL, BIC, MDL, uMDL, and BestFit [[27]]. In this paper, we compare the performance of the threerule algorithm, the bestfit algorithm and the proposed algorithm based on both synthetic networks as well as on a wellstudied budding yeast cell cycle network.
We have implemented the threerule algorithm and our proposed algorithm based on the PBN Toolbox (http://code.google.com/p/pbnmatlabtoolbox/), which includes the implementation of bestfit algorithm and the calculation of the steady state distribution and other intervention modules for Boolean networks. Genetic regulatory networks are commonly believed to have sparse connectivity topology. To evaluate the inference algorithms based on simulated time series of network states, we have restricted the random BNs to resemble this property of biological networks. Specifically, we have generated random BNs with a scalefree topology, and each gene has at most five predictors: =\mathit{ma}{\mathit{x}}_{\mathit{i}=1}^{\mathit{n}}{\mathit{k}}_{\mathit{i}}\le 5. We uniformly assign each gene 1 to K regulators that upregulate (1) or downregulate (−1) it. The average connectivity of random networks is (1 + K)/2.
In order to compare the performance of the three algorithms with the groundtruth network, we use the following three distances [[28],[29]]:

(1)
The normalizededge Hamming distance,
{\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{e}}=\frac{\mathrm{FN}+\mathrm{FP}}{\mathit{P}+\mathit{N}},(4)where FN and FP represent the number of falsenegative and falsepositive wires, respectively. P and N represent the total number of positive and negative wires, respectively. This Hamming distance reflects the accuracy of the recovered regulatory relationships.

(2)
The normalized Hamming distance of state transitions,
{\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{st}}=\frac{1}{\mathit{n}*{2}^{\mathit{n}}}{\displaystyle \sum _{\mathit{i}=1}^{\mathit{n}}}{\displaystyle \sum _{\mathit{k}=1}^{{2}^{\mathit{n}}}}\left[{\mathit{f}}_{\mathit{i}}\left({\mathrm{x}}_{\mathrm{k}}\right)\oplus {\mathit{f}}_{\mathit{i}}^{\text{'}}\left({\mathrm{x}}_{\mathrm{k}}\right)\right],(5)where f _{ i }(•) and {\mathit{f}}_{\mathit{i}}^{\text{'}}\left(\u2022\right) represent the Boolean function of gene i in the groundtruth network and the inferred network, respectively; x_{k} represents a binary state vector, and ⊕ denotes modulo2 addition. This Hamming distance indicates the accuracy of the inferred network for predicting the next state of the groundtruth network.

(3)
The steadystate distribution distance,
{\mathit{\mu}}^{\mathrm{ssd}}={\displaystyle \sum _{\mathit{k}=1}^{{2}^{\mathit{n}}}}\left{\mathit{\pi}}_{\mathit{k}}{\mathit{\pi}}_{\mathit{k}}^{\text{'}}\right,(6)where π _{ k } and {\mathit{\pi}}_{\mathit{k}}^{\text{'}} are the steadystate distribution of state x _{ k } in the groundtruth network and the inferred network, respectively. The steadystate distribution distance reflects the degree of an inferred network approaching the longrun behavior of the groundtruth network.
5 Results and discussion
5.1 Simulated results
Owing to the computational complexity and the network state space, which increases exponentially with the number of genes or the network size, all our simulations are based on networks with n = 10 genes. We generate 300 random Boolean networks respectively with maximal input degree K = 3 and K = 5. For each simulated network, we generate about 4 time series so that the total time points add up to 40. Given a specific sample data, the noise is added by flipping the value of each bit with probability 0.05 and 0.10, respectively. The steadystate distribution is calculated by a perturbation parameter p = 0. 0001. For the proposed algorithm, we selected the first l = 10 minimal error predictor sets. For best fit, we selected the minimal error predictor sets from k = 1, 2, 3. In Table 9, we list the average number of truepositive and falsepositive connections for K = 3 and K = 5 in different noise intensities.
Figure 2 shows the performance of three algorithms on networks with K = 3 under different noise intensities according to three distance metrics: the normalizededge Hamming distance {\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{e}}, the normalized Hamming distance of state transition {\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{st}}, and the steadystate distribution distance μ^{ssd}. The performance of the threerule algorithm and the proposed algorithm is very close when there is no noise. However, it differs dramatically in noisy data. Specifically, the performance of the proposed algorithm increases as the sample size increases while that of the threerule algorithm decreases. The main reason lies in the fact that the proposed algorithm infers the regulatory relation based on the entire time series instead of on a small perturbation between two time points, which makes it more robust against noise than the threerule algorithm. Given a specific noise intensity η, with more samples, there are more noisy perturbed bits; so, more incorrect connections will be inferred by the threerule algorithm. Table 9 shows that the number of the false positives of the threerule algorithm increases more quickly than that of the true positives as the sample size increases. This is the main factor which makes its performance deteriorate even though the sample size increases. Consequently, the threerule algorithm is very sensitive to noise in the data, and increasing sample size makes no improvement in its performance.
Compared with the bestfit algorithm, the proposed algorithm performs better with respect to {\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{e}} and {\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{st}}. In a restricted Boolean network model, the output of states with {\displaystyle \sum _{\mathit{j}}}{\mathit{a}}_{\mathit{ij}}{\mathit{x}}_{\mathit{i}}\left(\mathit{t}\right)=0 is determined by the current state of the target gene x_{ i }. This means that given the same input state, x_{ i } may be 1 at one time and be 0 at another time. The bestfit algorithm does not allow such situation, and it will treat such a case in the data as an error. If the target gene x_{ i } has three regulators and one downregulates it, then there will be 3 such states out of the 8 possible input states. The influence of such cases on the performance of bestfit algorithm can not be neglected. Additionally, the bestfit algorithm cannot deal with the inconsistency listed in Figure 3. These two factors hurt its performances as compared to the proposed algorithm on {\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{e}} and {\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{st}}. Table 9 shows that the number of the true positives of both algorithms is almost the same, but the number of false positives of the bestfit algorithm is larger than that of the proposed algorithm.
Concerning the steady state distribution distance μ^{ssd}, the proposed algorithm performs not so well as the bestfit algorithm. However, their difference decreases as the noise intensity increases. As pointed in [[27]], the inferred networks with relative more connections can explain the observed data better with respect to steadystate distribution distance μ^{ssd}, even though some are incorrect connections. Because the bestfit algorithm infers more connection than the proposed algorithm (see Table 9), it performs better on μ^{ssd} than the latter. On the other hand, the proposed algorithm is more robust than the bestfit algorithm as it combines those minimal error sets to determine the regulatory gene instead of selecting one. When noise intensity increases, the performance of the bestfit algorithm will drop more quickly than that of the proposed algorithm, which leads to their performance on μ^{ssd} converges.
Figure 4 shows the performance of three algorithms on networks with K = 5, which are analogous to the trends observed in Figure 2. The only difference is that the performance of the three algorithms decreases because the networks' complexity makes them hard to infer. In summary, the proposed algorithm performs better than the threerule algorithm on the three distance metrics in noisy situations, whereas it performs less well than the bestfit algorithm on the steadystate distribution distance. This suggests that it is more feasible to infer the structure of restricted Boolean network model than the threerule algorithm and bestfit algorithm.
5.2 Cell cycle model of budding yeast
The cell cycle is a vital biological process in which one cell grows and divides into two daughter cells. It consists of four phases, G1, S, G2, and M, and is regulated by a highly complex network that is highly conserved among the eukaryotes. From the 800 genes involved in the cell cycle process of budding yeast, Li et al. constructed a network of 11 key regulators: Cln3, MBF, SBF, Cln1, Cdh1, Swi5, Cdc20, Clb5, Sic1, Clb1, and Mcm1 [[18]]. This restricted Boolean network model (shown in Figure 4A) has an attractor whose biggest basin corresponds to the biological G1 stationary state. The temporal sequence in Table 10 is a pathway from this basin, which follows the biological trajectory of the cell cycle network.
We have applied the three algorithms to the above artificial timeseries data and show the inferred networks in Figure 4. In the simplified model of the budding yeast cell cycle, there are a total of 34 regulatory relationships (or connections). The threerule algorithm inferred 10 relationships, all correct (see Figure 4B). The bestfit algorithm inferred 15 correct and 5 incorrect relationships (see Figure 4C). The proposed algorithm inferred 15 correct and 4 incorrect relationships (see Figure 4D). Both bestfit and the proposed algorithms inferred more true regulatory relationships than the threerule algorithm with some incorrect connections. For studying regulatory relationships, this may be more advantageous because more potential regulatory relationships are made available for biologists to check in the wet lab.
We also ran 100 simulations with 5% and 10% noises for this pathway. Even for the same pathway data, the result of each noisy pathway data differs dramatically. This is not surprising because noise significantly influences the determination of regulatory relations for all algorithms. The performance of the three algorithms on {\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{e}}, {\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{st}}, and μ^{ssd} is listed Table 11. The relative performance of the three algorithms for this pathway data is also consistent with the previous simulation results.
5.3 Computational issues
When inferring real networks with moderate size, the time complexity of algorithms is a key issue. Almost all algorithms proposed to date possess exponential complexity. The time complexity of the proposed algorithm and bestfit algorithm is \left(\mathit{n}\cdot {\mathit{C}}_{\mathit{n}}^{\mathit{k}}\cdot \mathit{m}\right). The most timeconsuming process for the threerule algorithm is to solve the constraint inequalities, and its time complexity is O(n ⋅ c^{n} ⋅ m^{2}) (1 < c < 2). From this point of view, the threerule algorithm is more time consuming than the other two.
The proposed algorithm is similar in workflow to the bestfit algorithm; however, additional computation time results from three factors: (1) determination of the possible regulatory relationships, (2) determination during error estimation if an output state is correct for a given model according to Equation (2), and (3) combination of the first ten leasterror models in the last step.
In practice, however, algorithm complexity is not the limiting factor. As shown in Table 12, for 11, 12, and 13 genes, and for N = 20 and N = 40, the proposed algorithm's computation time is between the bestfit and the threerule algorithms, but the overriding computational issue is computation of the steadystate distribution, which is often required for application. It is for this reason that interest has focused on reducing network complexity [[29]–[31]].
6 Conclusion
The model space of Boolean networks is huge and from the point of view of evolution, it is unimaginable for nature to select its operational mechanisms from such a large space. Restricted Boolean networks, as a simplified model, have recently been extensively used to study the dynamical behavior of the yeast cell cycle process. In this paper, we propose a systematic method to infer the restricted Boolean network from timeseries data. We compare the performance of the threerule, bestfit, and the proposed algorithms both on simulated networks and on an artificial model of budding yeast. Results show that our algorithm performs better than the threerule and bestfit algorithms according to the distance metrics {\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{e}} and {\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{st}}, but slightly less well than the bestfit algorithm according to μ^{ssd}. This result indicates that the proposed algorithm may be more appropriate for recovering regulatory relationships between genes under the restricted Boolean network model.
The main advantage of the proposed algorithm is that it is more robust to noise than both the threerule algorithm and bestfit algorithm. The proposed algorithm infers the regulatory relationships according to the consecutive state transitions of the target gene, instead of the small perturbations between two similar states in the threerule algorithm. Simulation results show that noise in the data may induce many incorrect constraints by the threerule algorithm. This hinders its application to noisy samples. Moreover, the proposed algorithm can capture the intrinsic state transition defined in Equation 2, whereas the bestfit algorithm cannot. Hence, because the inference processes of both algorithms try to find the minimalerror predictor set, the proposed algorithm can distinguish error in the data more accurately than the bestfit algorithm. Additionally, combination of the minimal error predictor sets in the proposed algorithm also improves its robustness.
In the Boolean formalism, a single time series (or trajectory) can be treated as a random walk across state space. It is not possible to recover the complex biological system from just one short trajectory by any method. Using heterogeneous data and some a priori knowledge is typically a necessity. A priori knowledge can be incorporated into the proposed algorithm and helps by reducing the search space. For instance, an algorithm might assume a prescribed attractor structure [[32]]. In our case, if we know that x regulates y, then we only consider those combinations containing x, thereby reducing the search space. Additionally, different methods may focus on different aspects of the inference process. For example, the bestfit algorithm and CoD are mainly concerned with the fitness of the data, whereas MDLbased methods intend to reduce structural risks. Future work will involve combining MDL with the proposed algorithm to reduce the rate of false positives.
References
Ilya S, Dougherty ER: Genomic Signal Processing (Princeton Series in Applied Mathematics). Princeton University Press, Princeton; 2007.
Ilya S, Dougherty ER: Probabilistic Boolean Networks: The Modeling and Control of Gene Regulatory Networks. Siam, Philadelphia; 2010.
Shoudan L, Stefanie F, Roland S: REVEAL, a general reverse engineering algorithm for inference of genetic network architectures, in Pacific Symposium on Biocomputing. World Scientific, Hawaii; 1998.
Margolin AA, Ilya N, Katia B, Chris W, Gustavo S, Riccardo DF, Andrea C: ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinforma 2006, 7: S7.
Zhao W, Serpedin E, Dougherty ER: Recovering genetic regulatory networks from chromatin immunoprecipitation and steadystate microarray data. EURASIP J. Bioinforma. Syst. Biol. 2008.
Vijender C, Preetam G, Edward P, Gong GP, Deng Y, Zhang C: A novel gene network inference algorithm using predictive minimum description length approach. BMC Syst. Biol. 2010, 4: S7.
Vijender C, Chaoyang Z, Preetam G, Perkins EJ, Gong P, Deng Y: Gene regulatory network inference using predictive minimum description length principle and conditional mutual information. In International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing (IJCBS). Edited by: Zhang J, Li G, Yang JY. IEEE Computer Society, Piscataway; 2009:487490.
Dougherty J, Tabus I, Astola J: A universal minimum description lengthbased algorithm for inferring the structure of genetic networks. In IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS). Edited by: Huang Y. IEEE, Piscataway; 2007:12.
Tabus I, Astola J: On the use of MDL principle in gene expression prediction. EURASIP J Appl Signal Process 2001, 2001: 297303.
Zhao W, Erchin S, Dougherty ER: Inferring gene regulatory networks from time series data using the minimum description length principle. Bioinformatics 2006, 22: 21292135.
Dougherty RE, Seungchan K, Yidong C: Coefficient of determination in nonlinear signal processing. Signal Process. 2000, 80: 22192235.
Kim S, Dougherty ER, Bittner ML, Chen Y, Sivakumar K, Meltzer P, Trent JM: General nonlinear framework for the analysis of gene interaction via multivariate expression arrays. J. Biomed. Opt. 2000, 5: 411424.
Shmulevich I, Dougherty ER, Seungchan K, Zhang W: Probabilistic Boolean networks: a rulebased uncertainty model for gene regulatory networks. Bioinformatics 2002, 18: 261274.
Lähdesmäki H, Shmulevich I, YliHarja O: Learning gene regulatory networks under the Boolean network model. Mach. Learn. 2003, 52: 147167.
Shmulevich I, Kauffman SA, Maximino A: Eukaryotic cells are dynamically ordered or critical but not chaotic. Proc. Natl. Acad. Sci. U. S. A. 2005, 102: 1343913444.
Nykter M, Price ND, Maximino A, et al.: Gene expression dynamics in the macrophage exhibit criticality. Proc. Natl. Acad. Sci. 2008, 105: 18971900.
W Liu, H Lähdesmäki, ER Dougherty, I Shmulevich, Inference of Boolean networks using sensitivity regularization. EURASIP J. Bioinforma. Syst. Biol. (2008). doi:10.1155/2008/780541
Li F, Long T, Ying L, Ouyang Q, Tang C: The yeast cellcycle network is robustly designed. Proc. Natl. Acad. Sci. U. S. A. 2004, 101: 47814786.
Zhang Y, Qian M, Ouyang Q, Deng M, Li F, Tang C: Stochastic model of yeast cellcycle network. Physica D: Nonlinear Phenomena 2006, 219: 3539.
KaiYeung L, Surya G, Chao T: Function constrains network architecture and dynamics: a case study on the yeast cell cycle Boolean network. Phys. Rev. E. 2007, 75: 051907.
Bornholdt S: Boolean network models of cellular regulation: prospects and limitations. J. R. Soc. Interface 2008, 5: S85S94.
Davidich MI, Stefan B: Boolean network model predicts cell cycle sequence of fission yeast. PLoS One 2008, 3: e1672.
Ronaldo Fumio H, Henrique S, Carlos HA H: Budding yeast cell cycle modeled by contextsensitive probabilistic Boolean network. In IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS). Edited by: BragaNeto U. IEEE, Piscataway; 2009:14.
Todd RG, Tomáš H: Ergodic sets as cell phenotype of budding yeast cell cycle. PLoS One 2012, 7: e45780.
Higa CHA, Louzada VHP, Andrade TP, Hashimoto RF: Constraintbased analysis of gene interactions using restricted Boolean networks and timeseries data. BMC Proc. 2011,5(Suppl 2):S5.
Niklas E, Niklas S: An extensible SATsolver. In Theory and Applications of Satisfiability Testing. Edited by: Giunchiglia E, Tacchella A. Springer, New York; 2004:502518.
Dougherty ER: Validation of gene regulatory networks: scientific and inferential. Brief. Bioinform. 2011, 12: 245252.
Xiaoning Q, Dougherty ER: Validation of gene regulatory network inference based on controllability. Front. Genet. 2013, 4: 272.
Ghaffari N, Ivanov I, Qian X, Dougherty ER: A CoDbased reduction algorithm for designing stationary control policies on Boolean networks. Bioinformatics 2010, 26: 15561563.
Ivanov I, Simeonov P, Ghaffari N, Xiaoning Q, Dougherty ER: Selection policyinduced reduction mappings for Boolean networks. Signal Process. IEEE Trans. 2010, 58: 48714882.
Qian X, Ghaffari N, Ivanov I, Dougherty ER: State reduction for network intervention in probabilistic Boolean networks. Bioinformatics 2010, 26: 30983104.
Pal R, Ivanov I, Datta A, Bittner ML, Dougherty ER: Generating Boolean networks with a prescribed attractor structure. Bioinformatics 2005, 21: 40214025.
Acknowledgements
This work was funded in part by the National Science Foundation of China (Grant No. 61272018, No. 60970065, and No. 61174162) and the Zhejiang Provincial Natural Science Foundation of China (Grant No. R1110261 and No. LY13F010007) and support from China Scholarship Council.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ouyang, H., Fang, J., Shen, L. et al. Learning restricted Boolean network model by timeseries data. J Bioinform Sys Biology 2014, 10 (2014). https://doi.org/10.1186/s1363701400105
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1363701400105