Learning restricted Boolean network model by time-series data
- Hongjia Ouyang^{1},
- Jie Fang^{1},
- Liangzhong Shen^{1},
- Edward R Dougherty^{2, 3} and
- Wenbin Liu^{1, 2}Email author
https://doi.org/10.1186/s13637-014-0010-5
© Hongjia et al.; licensee Springer. 2014
Received: 15 December 2013
Accepted: 12 May 2014
Published: 15 July 2014
Abstract
Restricted Boolean networks are simplified Boolean networks that are required for either negative or positive regulations between genes. Higa et al. (BMC Proc 5:S5, 2011) proposed a three-rule algorithm to infer a restricted Boolean network from time-series data. However, the algorithm suffers from a major drawback, namely, it is very sensitive to noise. In this paper, we systematically analyze the regulatory relationships between genes based on the state switch of the target gene and propose an algorithm with which restricted Boolean networks may be inferred from time-series data. We compare the proposed algorithm with the three-rule algorithm and the best-fit algorithm based on both synthetic networks and a well-studied budding yeast cell cycle network. The performance of the algorithms is evaluated by three distance metrics: the normalized-edge Hamming distance ${\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{e}}$, the normalized Hamming distance of state transition ${\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{st}}$, and the steady-state distribution distance μ^{ssd}. Results show that the proposed algorithm outperforms the others according to both ${\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{e}}$ and ${\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{st}}$, whereas its performance according to μ^{ssd} is intermediate between best-fit and the three-rule algorithms. Thus, our new algorithm is more appropriate for inferring interactions between genes from time-series data.
Keywords
1 Introduction
A key goal in systems biology is to characterize the molecular mechanisms governing specific cellular behaviors and processes. This entails selecting a model class for representing the system structure and state dynamics, followed by the application of computational or statistical inference procedures to reveal the model structure from measurement data. The models of gene regulatory networks run the gamut from coarse-grained discrete networks to the detailed description of stochastic differential equations [[1]]. They provide a uniform way to study biological phenomena (e.g., cell cycle) and diseases (e.g., cancer) and ultimately lead to systems-based therapeutic strategies [[2]].
Boolean networks, and the more general class of probabilistic Boolean networks, are one of the most popular approaches for modeling gene networks. The inference of gene networks from high-throughput genomic data is an ill-posed problem. There exists more than one model that can explain the data. The search space for potential regulator sets and their corresponding Boolean functions generally increases exponentially with the number of genes in the network and the number of regulatory genes. It is particularly challenging in the face of small sample sizes, because the number of genes typically is much greater than the number of observations. Thus, estimates of modeling errors, which themselves are determined from the measurement data, can be highly variable and untrustworthy. Many inference algorithms have been proposed to elucidate the regulatory relationships between genes. Mutual information (MI) is an information-theoretic approach that can capture the nonlinear dependence between random variables. REVEAL is the first information-based algorithm to infer the regulatory relationships between genes [[3]]. However, a small MI does not necessarily mean that no regulatory relationship exists between genes (false negative). Conversely, a large MI does not necessarily mean a real regulatory relationship. ‘False-positive’ relationships often result from indirect interactions between two genes. The data processing inequality (DPI) and conditional mutual information (CMI) are two methods used to reduce the problem of false positives [[4],[5]]. Another information-based method is the minimum description length principle (MDL), which achieves a good trade-off between model complexity and fit to the data [[6]–[10]]. The coefficient of determination (CoD) selects a set of predictors whose expression levels can be used to better predict the expression of a target gene relative to the best possible prediction in the absence of observations [[11],[12]]. The best-fit extension incorporates inconsistencies generated from measurements or other unknown latent factors by constructing a network that makes as few misclassifications as possible [[13],[14]]. Any prior knowledge about the network structure or dynamics likely improves inference accuracy, especially for small sample sizes. Theoretical considerations and computational studies suggest that gene regulatory networks might operate close to a critical phase transition between ordered and disordered dynamical regimes [[15],[16]]. Liu et al. proposed a method to embed such a criticality assumption into the inference procedure. Such regularization of the sensitivity can both improve the inference and move the inferred networks closer to criticality [[17]].
A restricted Boolean network is a simplified Boolean model that has been used to study dynamical behavior of the yeast cell cycle [[18]–[24]]. In this model, the regulatory relationship between genes is either upregulation or downregulation. The output of the target gene is mainly dominated by the summation of its input genes. When the input summation is zero, the output state will remain as the current state of the target gene. The inference algorithm mentioned above generally cannot deal with this situation, and thus may not be appropriate to infer such network models. Recently, Higa et al. proposed a ‘three-rule algorithm’ to construct a restricted Boolean network from time-series data [[25]]. Their idea is that the consecutive state transitions of the system must be driven by some constraints, which can be induced from the small perturbations between two similar system states (detailed rules are provided in Section 3.1). However, the perturbations in microarry data sometimes may be caused by stochastic biological randomness or measurement process instead of real changes in gene expression level. This makes the three-rule algorithm inevitably lead to some incorrect constraints. In this paper, we propose a systematic method to infer a restricted Boolean network based on the state transitions of the target gene. Results of simulated networks and a modeled yeast cell cycle show that the proposed algorithm is more robust to noise than the three-rule method.
This paper is organized as follows: Background information and definitions are given in Section 2. Section 3 presents a brief introduction to the three rules; after which, we systematically analyze the regulatory relationships between input genes and their target gene and propose an inference algorithm. Section 4 and Section 5 present results for the simulated networks and for the cell cycle model of budding yeast. Concluding remarks are given in Section 6.
2 Background
2.1 Boolean networks
The long-run behavior of a deterministic Boolean network (BN) depends on the initial state, and the network will eventually settle down and cycle endlessly through a set of states called an attractor cycle. The set of all initial states that reach a particular attractor cycle forms the basin of attraction (BOA) for the cycle. Following a perturbation, the network in the long run may randomly escape an attractor cycle, be reinitialized, and then begin its transition process anew. For a BN with perturbation probability p, its corresponding Markov chain possesses a steady-state distribution. It has been hypothesized that attractors or steady-state distributions in Boolean formalisms correspond to different cell types of an organism or to cell fates. In other words, the phenotypic traits are encoded in the attractors [[1]]. There are two ways to define the perturbation probability p. One is that each gene can flip its state according to an i.i.d random perturbation vector γ = (γ_{1}, ⋯, γ_{ n }), where γ_{ i } ∈ {0, 1}, the i th gene flips if and only γ_{ i } = 1, and p = P(γ_{ i } = 1) for i = 1, 2, ⋯, n. The other is each state x(t) can transit to any other state with the same probability p. In this situation, at each time step, state x(t) will transit to the next state according to F with probability 1 + p − 2^{ n } ∗ p and other states with probability p. In this paper, we adopt the later definition of the perturbation probability p.
2.2 Restricted Boolean networks
This model is ‘restricted’ in the sense that functions satisfying formula (2) constitute a subset of the class of all Boolean functions. The number of restricted functions decreases dramatically as the input degree k_{ i } increases. For example, there are 12 ($<{2}^{{2}^{2}}=16$) restricted functions for k_{ i } = 2, and only 60 functions ($<<{2}^{{2}^{3}}=256$) for k_{ i } = 3. The restricted model significantly reduces the model space, which is beneficial for inference, given a limited number of noisy high-throughput data.
3 Methods
3.1 Three-rule method
A time-series observation can be treated as a trajectory (or random walk) of the state space of the network used to model a real biological system. The three-rule method proposed by Higa et al. is to induce the constraints between genes from the small difference between two similar states and the difference between their next states [[25]]. Given an m-point time series S = {S(1), S(2), …, S(m)} of gene expression profiles, where S(t) ∈ {0, 1}^{ n } for t = 1, 2, …, m, the three rules are as follows:
Rule 1: Let S(t − 1), S(t), and S(t + 1) be three consecutive states. If S(t − 1) and S(t) differ by a single gene x_{ k }, then for each gene x_{ i } such that x_{ i }(t) ≠ x_{ i }(t + 1), we have x_{ k } directly regulates x_{ i }; that is, a_{ ik } ≠ 0.
Rule 2: Only the active genes at time t can possibly regulate genes at time t + 1.
Rule 3: Given two similar states S(t_{1}) and S(t_{2}), the difference between S(t_{1} + 1) and S(t_{2} + 1) must result from the genes in their predecessors S(t_{1}) and S(t_{2}) that are expressed differently.
Both rules 1 and 3 can also be extended to situations where S(t − 1) and S(t) or S(t_{1}) and S(t_{2}) differ in more than one gene. Cyclically applying these rules to any two states may lead to a group of constraint inequalities between variables a_{ ij }. Many available constraint satisfaction problem solvers (CSPs) [[26]] can be used to solve the possible regulatory relationships of one gene to the target gene.
3.2 Analysis of regulatory relationships based on constraints
In this section, we study the regulatory relationships based on the constraint inequalities in formula (2) and how the target gene switches from one state to another. The target gene can switch in one of four ways: 0 → 0, 0 → 1, 1 → 0, or 0 → 1. Given an input state, inactive genes have no effect on the target gene, which may help reduce the constraint inequalities of the summation ∑ _{ j }a_{ ij }x_{ j }(t) (1 ≤ j ≤ k_{ i }). Because the null input provides no constraints between a_{ ij }, we only need to investigate the non-null input situations.
Regulatory relationships for one input gene
Number | ${\mathit{x}}_{{\mathit{j}}_{\mathbf{1}}}\left(\mathit{t}\right)$ | x_{ i }(t) → x_{ i }(t + 1) | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{\mathbf{1}}}$ |
---|---|---|---|
1 | 1 | 0 → 0 | −1 |
2 | 1 | 0 → 1 | 1 |
3 | 1 | 1 → 0 | −1 |
4 | 1 | 1 → 1 | 1 |
Regulatory relationships for two input genes
Number | ${\mathit{x}}_{{\mathit{j}}_{\mathbf{1}}}\left(\mathit{t}\right)$ | ${\mathit{x}}_{{\mathit{j}}_{\mathbf{2}}}\left(\mathit{t}\right)$ | x_{ i }(t) → x_{ i }(t + 1) | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{\mathbf{1}}}$ | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{\mathbf{2}}}$ | Constraint |
---|---|---|---|---|---|---|
1 | 0 | 1 | 0 → 0 | No | −1 | |
2 | 1 | 0 | −1 | No | ||
3 | 1 | 1 | −1 or 1 | −1 or 1 | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}\le 0$ | |
4 | 0 | 1 | 0 → 1 | No | 1 | |
5 | 1 | 0 | 1 | No | ||
6 | 1 | 1 | 1 | 1 | ||
7 | 0 | 1 | 1 → 0 | No | −1 | |
8 | 1 | 0 | −1 | No | ||
9 | 1 | 1 | −1 | −1 | ||
10 | 0 | 1 | 1 → 1 | No | 1 | |
11 | 1 | 0 | 1 | No | ||
12 | 1 | 1 | −1 or 1 | −1 or 1 | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}\ge 0$ |
Regulatory relationships for three input genes
Number | ${\mathit{x}}_{{\mathit{j}}_{\mathbf{1}}}\left(\mathit{t}\right)$ | ${\mathit{x}}_{{\mathit{j}}_{\mathbf{2}}}\left(\mathit{t}\right)$ | ${\mathit{x}}_{{\mathit{j}}_{\mathbf{3}}}\left(\mathit{t}\right)$ | x_{ i }(t) → x_{ i }(t + 1) | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{\mathbf{1}}}$ | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{\mathbf{2}}}$ | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{\mathbf{3}}}$ | Constraint |
---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 1 | 0 → 0 | No | No | −1 | |
2 | 0 | 1 | 0 | No | −1 | No | ||
3 | 1 | 0 | 0 | −1 | No | No | ||
4 | 0 | 1 | 1 | No | −1 or 1 | −1 or 1 | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{3}}\le 0$ | |
5 | 1 | 0 | 1 | −1 or 1 | No | −1 or 1 | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{3}}\le 0$ | |
6 | 1 | 1 | 0 | −1 or 1 | −1 or 1 | No | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}\le 0$ | |
7 | 1 | 1 | 1 | −1 or 1 | −1 or 1 | −1 or 1 | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{3}}<0$ | |
8 | 0 | 0 | 1 | 0 → 1 | No | No | 1 | |
9 | 0 | 1 | 0 | No | 1 | No | ||
10 | 1 | 0 | 0 | 1 | No | No | ||
11 | 0 | 1 | 1 | No | 1 | 1 | ||
12 | 1 | 0 | 1 | 1 | No | 1 | ||
13 | 1 | 1 | 0 | 1 | 1 | No | ||
14 | 1 | 1 | 1 | −1 or 1 | −1 or 1 | −1 or 1 | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{3}}>0$ | |
15 | 0 | 0 | 1 | 1 → 0 | No | No | −1 | |
16 | 0 | 1 | 0 | No | −1 | No | ||
17 | 1 | 0 | 0 | −1 | No | No | ||
18 | 0 | 1 | 1 | No | −1 | −1 | ||
19 | 1 | 0 | 1 | −1 | No | −1 | ||
20 | 1 | 1 | 0 | −1 | −1 | No | ||
21 | 1 | 1 | 1 | −1 or 1 | −1 or 1 | −1 or 1 | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{3}}<0$ | |
22 | 0 | 0 | 1 | 1 → 1 | No | No | 1 | |
23 | 0 | 1 | 0 | No | 1 | No | ||
24 | 1 | 0 | 0 | 1 | No | No | ||
25 | 0 | 1 | 1 | No | −1 or 1 | −1 or 1 | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{3}}\ge 0$ | |
26 | 1 | 0 | 1 | −1 or 1 | No | −1 or 1 | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{3}}\ge 0$ | |
27 | 1 | 1 | 0 | −1 or 1 | −1 or 1 | No | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}\ge 0$ | |
28 | 1 | 1 | 1 | −1 or 1 | −1 or 1 | −1 or 1 | ${\mathit{a}}_{\mathit{i}{\mathit{j}}_{1}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{2}}+{\mathit{a}}_{\mathit{i}{\mathit{j}}_{3}}>0$ |
If ${\mathit{N}}_{\mathit{ij}}^{-1}>{\mathit{N}}_{\mathit{ij}}^{1}$, then a_{ ij } is likely to be −1; otherwise, it is likely to be 1. The larger the value of d_{ ij }, the greater the determination of a_{ ij }. In order to reduce the semi-determined cases, we first find the one with the largest determination, say, a_{ij,}, and determine its value by the majority rule. Then, we apply the value of a_{ ij } to those inequalities including it to solve other semi-determined a_{ ip } (p ≠ j, 1 ≤ p, j ≤ k_{ i }). By repeating this process, we can reduce the number of semi-determined cases and determine the values of other a_{ ip } accordingly.
3.3 Error analysis
Errors in the null-input situations
Number | ${\mathit{x}}_{{\mathit{j}}_{\mathbf{1}}}\left(\mathit{t}\right)=\mathbf{\cdots}={\mathit{x}}_{{\mathit{j}}_{\mathit{ki}}}\left(\mathit{t}\right)$ | x_{ i }(t) → x_{ i }(t + 1) | ${\mathit{\epsilon}}_{\mathit{i}}^{\mathbf{null}}$ | |
---|---|---|---|---|
Self-degradation regulated | No self-degradation | |||
1 | 0 | 0 → 0 | 0 | 0 |
2 | 0 | 0 → 1 | 1 | 1 |
3 | 0 | 1 → 0 | 0 | 1 |
4 | 0 | 1 → 1 | 1 | 0 |
3.4 A small example
Regulatory relationships a _{ 3 j } for one input x _{ 1 } (or x _{ 2 } or x _{ 4 } ) at each time step
t | x_{1}(t) | x_{2}(t) | x_{4}(t) | x_{3}(t) → x_{3}(t + 1) | a _{31} | ${\mathit{\epsilon}}_{\mathbf{3}}^{\mathbf{null}}$ | a _{32} | ${\mathit{\epsilon}}_{\mathbf{3}}^{\mathbf{null}}$ | a _{34} | ${\mathit{\epsilon}}_{\mathbf{3}}^{\mathbf{null}}$ |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 0 | 0 → 0 | −1 | 0 | −1 | 0 | 0 | |
2 | 1 | 0 | 0 | 0 → 1 | 1 | 0 | 1 | 1 | ||
3 | 1 | 0 | 0 | 1 → 1 | 1 | 0 | 0 | 1 | ||
4 | 1 | 0 | 1 | 1 → 1 | 1 | 0 | 0 | 1 | 0 |
Regulatory relationships a _{ 3 j } for two inputs x _{ 1 } and x _{ 2 } at each time step
t | x_{1}(t) | x_{2}(t) | x_{3}(t) → x_{3}(t + 1) | a _{31} | a _{32} | Constraint | ${\mathit{\epsilon}}_{\mathbf{3}}^{\mathbf{null}}$ |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | 0 → 0 | −1,1 | −1,1 | a_{31} + a_{32} ≤ 0 | 0 |
2 | 1 | 0 | 0 → 1 | 1 | No | 0 | |
3 | 1 | 0 | 1 → 1 | 1 | No | 0 | |
4 | 1 | 0 | 1 → 1 | 1 | No | 0 |
Regulatory relationships a _{ 3 j } for two inputs x _{ 1 } and x _{ 4 } at each time step
t | x_{1}(t) | x_{4}(t) | x_{3}(t) → x_{3}(t + 1) | a _{31} | a _{34} | Constraint | ${\mathit{\epsilon}}_{\mathbf{3}}^{\mathbf{null}}$ |
---|---|---|---|---|---|---|---|
1 | 1 | 0 | 0 → 0 | −1 | No | 0 | |
2 | 1 | 0 | 0 → 1 | 1 | No | 0 | |
3 | 1 | 0 | 1 → 1 | 1 | No | 0 | |
4 | 1 | 1 | 1 → 1 | −1,1 | −1,1 | a_{31} + a_{34} ≥ 0 | 0 |
Regulatory relationships a _{ 3 j } for two inputs x _{ 2 } and x _{ 4 } at each time step
t | x_{2}(t) | x_{4}(t) | x_{3}(t) → x_{3}(t + 1) | a _{32} | a _{34} | Constraint | ${\mathit{\epsilon}}_{\mathbf{3}}^{\mathbf{null}}$ |
---|---|---|---|---|---|---|---|
1 | 1 | 0 | 0 → 0 | −1 | No | 0 | |
2 | 0 | 0 | 0 → 1 | 1 | |||
3 | 0 | 0 | 1 → 1 | 0 | |||
4 | 0 | 1 | 1 → 1 | No | 1 | 0 |
3.5 Inference algorithm
- 1.
Calculate the total error of each combination of one, two, or three regulatory gene sets P(x _{ i }).
- 2.
Sort the predictor sets in ascending order of their errors.
- 3.
If a gene appears in the first l sets with a frequency greater than or equal to 50%, then it is selected as a regulatory gene.
4 Implementation
As mentioned in the introduction, many algorithms have been proposed to infer gene regulatory networks. A recent study shows that the best-fit algorithm appears to give the best results for the recovery of regulatory relationships among REVEAL, BIC, MDL, uMDL, and Best-Fit [[27]]. In this paper, we compare the performance of the three-rule algorithm, the best-fit algorithm and the proposed algorithm based on both synthetic networks as well as on a well-studied budding yeast cell cycle network.
We have implemented the three-rule algorithm and our proposed algorithm based on the PBN Toolbox (http://code.google.com/p/pbn-matlab-toolbox/), which includes the implementation of best-fit algorithm and the calculation of the steady state distribution and other intervention modules for Boolean networks. Genetic regulatory networks are commonly believed to have sparse connectivity topology. To evaluate the inference algorithms based on simulated time series of network states, we have restricted the random BNs to resemble this property of biological networks. Specifically, we have generated random BNs with a scale-free topology, and each gene has at most five predictors: $=\mathit{ma}{\mathit{x}}_{\mathit{i}=1}^{\mathit{n}}{\mathit{k}}_{\mathit{i}}\le 5$. We uniformly assign each gene 1 to K regulators that upregulate (1) or downregulate (−1) it. The average connectivity of random networks is (1 + K)/2.
- (1)The normalized-edge Hamming distance,where FN and FP represent the number of false-negative and false-positive wires, respectively. P and N represent the total number of positive and negative wires, respectively. This Hamming distance reflects the accuracy of the recovered regulatory relationships.${\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{e}}=\frac{\mathrm{FN}+\mathrm{FP}}{\mathit{P}+\mathit{N}},$(4)
- (2)The normalized Hamming distance of state transitions,where f _{ i }(•) and ${\mathit{f}}_{\mathit{i}}^{\text{'}}\left(\u2022\right)$ represent the Boolean function of gene i in the ground-truth network and the inferred network, respectively; x_{k} represents a binary state vector, and ⊕ denotes modulo-2 addition. This Hamming distance indicates the accuracy of the inferred network for predicting the next state of the ground-truth network.${\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{st}}=\frac{1}{\mathit{n}*{2}^{\mathit{n}}}{\displaystyle \sum _{\mathit{i}=1}^{\mathit{n}}}{\displaystyle \sum _{\mathit{k}=1}^{{2}^{\mathit{n}}}}\left[{\mathit{f}}_{\mathit{i}}\left({\mathrm{x}}_{\mathrm{k}}\right)\oplus {\mathit{f}}_{\mathit{i}}^{\text{'}}\left({\mathrm{x}}_{\mathrm{k}}\right)\right],$(5)
- (3)The steady-state distribution distance,where π _{ k } and ${\mathit{\pi}}_{\mathit{k}}^{\text{'}}$ are the steady-state distribution of state x _{ k } in the ground-truth network and the inferred network, respectively. The steady-state distribution distance reflects the degree of an inferred network approaching the long-run behavior of the ground-truth network.${\mathit{\mu}}^{\mathrm{ssd}}={\displaystyle \sum _{\mathit{k}=1}^{{2}^{\mathit{n}}}}\left|{\mathit{\pi}}_{\mathit{k}}-{\mathit{\pi}}_{\mathit{k}}^{\text{'}}\right|,$(6)
5 Results and discussion
5.1 Simulated results
Average number of true-positive and false-positive connections for three algorithms
K | Noise (%) | Algorithm | m = 10 | m = 20 | m = 30 | m = 40 | ||||
---|---|---|---|---|---|---|---|---|---|---|
TP | FP | TP | FP | TP | FP | TP | FP | |||
3 | 0 | Three-rule | 6.2 | 0 | 8.7 | 0.6 | 11.3 | 1.6 | 13.3 | 3.0 |
New | 8.7 | 3.1 | 10.5 | 3.1 | 11.8 | 3.3 | 12.5 | 3.3 | ||
Best-fit | 8.1 | 4.6 | 10.2 | 5.4 | 12.2 | 6.4 | 13.3 | 7.0 | ||
5 | Three-rule | 2.6 | 2.7 | 7.3 | 11.5 | 10.6 | 20.7 | 12.5 | 30.3 | |
New | 7.0 | 7.5 | 8.7 | 6.9 | 10.1 | 6.3 | 10.7 | 6.3 | ||
Best-fit | 7.1 | 11.1 | 9.2 | 15.1 | 10.8 | 15.7 | 11.6 | 15.9 | ||
10 | Three-rule | 1.8 | 3.6 | 6.5 | 17.6 | 10.5 | 31.6 | 12.4 | 39.8 | |
New | 5.5 | 10.0 | 6.9 | 9.5 | 8.1 | 9.2 | 8.4 | 9.1 | ||
Best-fit | 6.0 | 15.2 | 8.1 | 19.1 | 9.2 | 19.3 | 9.9 | 19.0 | ||
5 | 0 | Three-rule | 6.7 | 0.1 | 8.9 | 0.6 | 11.0 | 1.3 | 12.6 | 2.3 |
New | 8.3 | 2.7 | 9.9 | 3.0 | 10.9 | 3.4 | 11.4 | 3.9 | ||
Best-fit | 8.2 | 4.6 | 10.1 | 5.4 | 11.8 | 6.4 | 12.7 | 6.9 | ||
5 | Three-rule | 3.0 | 3.2 | 7.86 | 11.8 | 10.7 | 20.5 | 12.8 | 28.6 | |
New | 6.7 | 7.6 | 8.4 | 7.0 | 9.3 | 6.7 | 9.8 | 6.3 | ||
Best-fit | 7.1 | 11.5 | 9.2 | 15.4 | 10.4 | 15.7 | 11.1 | 16.1 | ||
10 | Three-rule | 2.7 | 2.8 | 6.9 | 16.5 | 10.6 | 31.6 | 12.4 | 39.4 | |
New | 5.3 | 9.9 | 7.0 | 9.5 | 7.5 | 9.3 | 8.1 | 9.1 | ||
Best-fit | 7.2 | 11.5 | 8.2 | 18.9 | 9.0 | 19.3 | 9.4 | 19.4 |
Concerning the steady state distribution distance μ^{ssd}, the proposed algorithm performs not so well as the best-fit algorithm. However, their difference decreases as the noise intensity increases. As pointed in [[27]], the inferred networks with relative more connections can explain the observed data better with respect to steady-state distribution distance μ^{ssd}, even though some are incorrect connections. Because the best-fit algorithm infers more connection than the proposed algorithm (see Table 9), it performs better on μ^{ssd} than the latter. On the other hand, the proposed algorithm is more robust than the best-fit algorithm as it combines those minimal error sets to determine the regulatory gene instead of selecting one. When noise intensity increases, the performance of the best-fit algorithm will drop more quickly than that of the proposed algorithm, which leads to their performance on μ^{ssd} converges.
5.2 Cell cycle model of budding yeast
Temporal evolution of state for cell cycle
Time | Cln3 | MBF | SBF | Cln1 | Cdh1 | Swi5 | Cdc20 | Clb5 | Sic1 | Clb1 | Mcm1 | Phase |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | Start |
2 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | G1 |
3 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | G1 |
4 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | G1 |
5 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | S |
6 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | G2 |
7 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | M |
8 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | M |
9 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | M |
10 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | M |
11 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | M |
12 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | M |
13 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | G1 |
We have applied the three algorithms to the above artificial time-series data and show the inferred networks in Figure 4. In the simplified model of the budding yeast cell cycle, there are a total of 34 regulatory relationships (or connections). The three-rule algorithm inferred 10 relationships, all correct (see Figure 4B). The best-fit algorithm inferred 15 correct and 5 incorrect relationships (see Figure 4C). The proposed algorithm inferred 15 correct and 4 incorrect relationships (see Figure 4D). Both best-fit and the proposed algorithms inferred more true regulatory relationships than the three-rule algorithm with some incorrect connections. For studying regulatory relationships, this may be more advantageous because more potential regulatory relationships are made available for biologists to check in the wet lab.
The performance of the three algorithms for the yeast-pathway data
Noise | |||||||||
---|---|---|---|---|---|---|---|---|---|
0% | 5% | 10% | |||||||
Distance | |||||||||
${\mathit{\mu}}_{\mathbf{ham}}^{\mathbf{e}}$ | ${\mathit{\mu}}_{\mathbf{ham}}^{\mathbf{st}}$ | μ ^{ ssd } | ${\mathit{\mu}}_{\mathbf{ham}}^{\mathbf{e}}$ | ${\mathit{\mu}}_{\mathbf{ham}}^{\mathbf{st}}$ | μ ^{ ssd } | ${\mathit{\mu}}_{\mathbf{ham}}^{\mathbf{e}}$ | ${\mathit{\mu}}_{\mathbf{ham}}^{\mathbf{st}}$ | μ ^{ ssd } | |
Three-rule | 0.198 | 0.313 | 1.394 | 0.27 | 0.378 | 1.454 | 0.29 | 0.402 | 1.472 |
New algorithm | 0.19 | 0.250 | 1.372 | 0.252 | 0.304 | 1.386 | 0.292 | 0.334 | 1.438 |
Best-fit | 0.198 | 0.229 | 1.245 | 0.298 | 0.341 | 1.263 | 0.365 | 0.403 | 1.298 |
5.3 Computational issues
When inferring real networks with moderate size, the time complexity of algorithms is a key issue. Almost all algorithms proposed to date possess exponential complexity. The time complexity of the proposed algorithm and best-fit algorithm is $\left(\mathit{n}\cdot {\mathit{C}}_{\mathit{n}}^{\mathit{k}}\cdot \mathit{m}\right)$. The most time-consuming process for the three-rule algorithm is to solve the constraint inequalities, and its time complexity is O(n ⋅ c^{ n } ⋅ m^{2}) (1 < c < 2). From this point of view, the three-rule algorithm is more time consuming than the other two.
The proposed algorithm is similar in workflow to the best-fit algorithm; however, additional computation time results from three factors: (1) determination of the possible regulatory relationships, (2) determination during error estimation if an output state is correct for a given model according to Equation (2), and (3) combination of the first ten least-error models in the last step.
Algorithm timings (seconds)
n | N = 20 | N = 40 | SSD | ||||
---|---|---|---|---|---|---|---|
Three-rule | Best-fit | Proposed | Three-rule | Best-fit | Proposed | ||
11 | 1.04 | 0.09 | 1.11 | 2.7 | 0.14 | 1.67 | 25 |
12 | 2.5 | 0.11 | 2.63 | 4.1 | 0.18 | 2.15 | 160 |
13 | 6.3 | 0.15 | 3.55 | 7.5 | 0.23 | 4.11 | 1,500 |
6 Conclusion
The model space of Boolean networks is huge and from the point of view of evolution, it is unimaginable for nature to select its operational mechanisms from such a large space. Restricted Boolean networks, as a simplified model, have recently been extensively used to study the dynamical behavior of the yeast cell cycle process. In this paper, we propose a systematic method to infer the restricted Boolean network from time-series data. We compare the performance of the three-rule, best-fit, and the proposed algorithms both on simulated networks and on an artificial model of budding yeast. Results show that our algorithm performs better than the three-rule and best-fit algorithms according to the distance metrics ${\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{e}}$ and ${\mathit{\mu}}_{\mathrm{ham}}^{\mathrm{st}}$, but slightly less well than the best-fit algorithm according to μ^{ssd}. This result indicates that the proposed algorithm may be more appropriate for recovering regulatory relationships between genes under the restricted Boolean network model.
The main advantage of the proposed algorithm is that it is more robust to noise than both the three-rule algorithm and best-fit algorithm. The proposed algorithm infers the regulatory relationships according to the consecutive state transitions of the target gene, instead of the small perturbations between two similar states in the three-rule algorithm. Simulation results show that noise in the data may induce many incorrect constraints by the three-rule algorithm. This hinders its application to noisy samples. Moreover, the proposed algorithm can capture the intrinsic state transition defined in Equation 2, whereas the best-fit algorithm cannot. Hence, because the inference processes of both algorithms try to find the minimal-error predictor set, the proposed algorithm can distinguish error in the data more accurately than the best-fit algorithm. Additionally, combination of the minimal error predictor sets in the proposed algorithm also improves its robustness.
In the Boolean formalism, a single time series (or trajectory) can be treated as a random walk across state space. It is not possible to recover the complex biological system from just one short trajectory by any method. Using heterogeneous data and some a priori knowledge is typically a necessity. A priori knowledge can be incorporated into the proposed algorithm and helps by reducing the search space. For instance, an algorithm might assume a prescribed attractor structure [[32]]. In our case, if we know that x regulates y, then we only consider those combinations containing x, thereby reducing the search space. Additionally, different methods may focus on different aspects of the inference process. For example, the best-fit algorithm and CoD are mainly concerned with the fitness of the data, whereas MDL-based methods intend to reduce structural risks. Future work will involve combining MDL with the proposed algorithm to reduce the rate of false positives.
Declarations
Acknowledgements
This work was funded in part by the National Science Foundation of China (Grant No. 61272018, No. 60970065, and No. 61174162) and the Zhejiang Provincial Natural Science Foundation of China (Grant No. R1110261 and No. LY13F010007) and support from China Scholarship Council.
Authors’ Affiliations
References
- Ilya S, Dougherty ER: Genomic Signal Processing (Princeton Series in Applied Mathematics). Princeton University Press, Princeton; 2007.Google Scholar
- Ilya S, Dougherty ER: Probabilistic Boolean Networks: The Modeling and Control of Gene Regulatory Networks. Siam, Philadelphia; 2010.Google Scholar
- Shoudan L, Stefanie F, Roland S: REVEAL, a general reverse engineering algorithm for inference of genetic network architectures, in Pacific Symposium on Biocomputing. World Scientific, Hawaii; 1998.Google Scholar
- Margolin AA, Ilya N, Katia B, Chris W, Gustavo S, Riccardo DF, Andrea C: ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinforma 2006, 7: S7.View ArticleGoogle Scholar
- Zhao W, Serpedin E, Dougherty ER: Recovering genetic regulatory networks from chromatin immunoprecipitation and steady-state microarray data. EURASIP J. Bioinforma. Syst. Biol. 2008.Google Scholar
- Vijender C, Preetam G, Edward P, Gong GP, Deng Y, Zhang C: A novel gene network inference algorithm using predictive minimum description length approach. BMC Syst. Biol. 2010, 4: S7.Google Scholar
- Vijender C, Chaoyang Z, Preetam G, Perkins EJ, Gong P, Deng Y: Gene regulatory network inference using predictive minimum description length principle and conditional mutual information. In International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing (IJCBS). Edited by: Zhang J, Li G, Yang JY. IEEE Computer Society, Piscataway; 2009:487-490.Google Scholar
- Dougherty J, Tabus I, Astola J: A universal minimum description length-based algorithm for inferring the structure of genetic networks. In IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS). Edited by: Huang Y. IEEE, Piscataway; 2007:1-2.Google Scholar
- Tabus I, Astola J: On the use of MDL principle in gene expression prediction. EURASIP J Appl Signal Process 2001, 2001: 297-303.MathSciNetView ArticleGoogle Scholar
- Zhao W, Erchin S, Dougherty ER: Inferring gene regulatory networks from time series data using the minimum description length principle. Bioinformatics 2006, 22: 2129-2135.View ArticleGoogle Scholar
- Dougherty RE, Seungchan K, Yidong C: Coefficient of determination in nonlinear signal processing. Signal Process. 2000, 80: 2219-2235.View ArticleGoogle Scholar
- Kim S, Dougherty ER, Bittner ML, Chen Y, Sivakumar K, Meltzer P, Trent JM: General nonlinear framework for the analysis of gene interaction via multivariate expression arrays. J. Biomed. Opt. 2000, 5: 411-424.View ArticleGoogle Scholar
- Shmulevich I, Dougherty ER, Seungchan K, Zhang W: Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics 2002, 18: 261-274.View ArticleGoogle Scholar
- Lähdesmäki H, Shmulevich I, Yli-Harja O: Learning gene regulatory networks under the Boolean network model. Mach. Learn. 2003, 52: 147-167.View ArticleGoogle Scholar
- Shmulevich I, Kauffman SA, Maximino A: Eukaryotic cells are dynamically ordered or critical but not chaotic. Proc. Natl. Acad. Sci. U. S. A. 2005, 102: 13439-13444.View ArticleGoogle Scholar
- Nykter M, Price ND, Maximino A, et al.: Gene expression dynamics in the macrophage exhibit criticality. Proc. Natl. Acad. Sci. 2008, 105: 1897-1900.View ArticleGoogle Scholar
- W Liu, H Lähdesmäki, ER Dougherty, I Shmulevich, Inference of Boolean networks using sensitivity regularization. EURASIP J. Bioinforma. Syst. Biol. (2008). doi:10.1155/2008/780541Google Scholar
- Li F, Long T, Ying L, Ouyang Q, Tang C: The yeast cell-cycle network is robustly designed. Proc. Natl. Acad. Sci. U. S. A. 2004, 101: 4781-4786.View ArticleGoogle Scholar
- Zhang Y, Qian M, Ouyang Q, Deng M, Li F, Tang C: Stochastic model of yeast cell-cycle network. Physica D: Nonlinear Phenomena 2006, 219: 35-39.MathSciNetView ArticleGoogle Scholar
- Kai-Yeung L, Surya G, Chao T: Function constrains network architecture and dynamics: a case study on the yeast cell cycle Boolean network. Phys. Rev. E. 2007, 75: 051907.View ArticleGoogle Scholar
- Bornholdt S: Boolean network models of cellular regulation: prospects and limitations. J. R. Soc. Interface 2008, 5: S85-S94.View ArticleGoogle Scholar
- Davidich MI, Stefan B: Boolean network model predicts cell cycle sequence of fission yeast. PLoS One 2008, 3: e1672.View ArticleGoogle Scholar
- Ronaldo Fumio H, Henrique S, Carlos HA H: Budding yeast cell cycle modeled by context-sensitive probabilistic Boolean network. In IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS). Edited by: Braga-Neto U. IEEE, Piscataway; 2009:1-4.Google Scholar
- Todd RG, Tomáš H: Ergodic sets as cell phenotype of budding yeast cell cycle. PLoS One 2012, 7: e45780.View ArticleGoogle Scholar
- Higa CHA, Louzada VHP, Andrade TP, Hashimoto RF: Constraint-based analysis of gene interactions using restricted Boolean networks and time-series data. BMC Proc. 2011,5(Suppl 2):S5.View ArticleGoogle Scholar
- Niklas E, Niklas S: An extensible SAT-solver. In Theory and Applications of Satisfiability Testing. Edited by: Giunchiglia E, Tacchella A. Springer, New York; 2004:502-518.Google Scholar
- Dougherty ER: Validation of gene regulatory networks: scientific and inferential. Brief. Bioinform. 2011, 12: 245-252.View ArticleGoogle Scholar
- Xiaoning Q, Dougherty ER: Validation of gene regulatory network inference based on controllability. Front. Genet. 2013, 4: 272.Google Scholar
- Ghaffari N, Ivanov I, Qian X, Dougherty ER: A CoD-based reduction algorithm for designing stationary control policies on Boolean networks. Bioinformatics 2010, 26: 1556-1563.View ArticleGoogle Scholar
- Ivanov I, Simeonov P, Ghaffari N, Xiaoning Q, Dougherty ER: Selection policy-induced reduction mappings for Boolean networks. Signal Process. IEEE Trans. 2010, 58: 4871-4882.View ArticleGoogle Scholar
- Qian X, Ghaffari N, Ivanov I, Dougherty ER: State reduction for network intervention in probabilistic Boolean networks. Bioinformatics 2010, 26: 3098-3104.View ArticleGoogle Scholar
- Pal R, Ivanov I, Datta A, Bittner ML, Dougherty ER: Generating Boolean networks with a prescribed attractor structure. Bioinformatics 2005, 21: 4021-4025.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.