Gene regulatory network inference and validation using relative change ratio analysis and time-delayed dynamic Bayesian network

The Dialogue for Reverse Engineering Assessments and Methods (DREAM) project was initiated in 2006 as a community-wide effort for the development of network inference challenges for rigorous assessment of reverse engineering methods for biological networks. We participated in the in silico network inference challenge of DREAM3 in 2008. Here we report the details of our approach and its performance on the synthetic challenge datasets. In our methodology, we first developed a model called relative change ratio (RCR), which took advantage of the heterozygous knockdown data and null-mutant knockout data provided by the challenge, in order to identify the potential regulators for the genes. With this information, a time-delayed dynamic Bayesian network (TDBN) approach was then used to infer gene regulatory networks from time series trajectory datasets. Our approach considerably reduced the searching space of TDBN; hence, it gained a much higher efficiency and accuracy. The networks predicted using our approach were evaluated comparatively along with 29 other submissions by two metrics (area under the ROC curve and area under the precision-recall curve). The overall performance of our approach ranked the second among all participating teams. Electronic supplementary material The online version of this article (doi:10.1186/s13637-014-0012-3) contains supplementary material, which is available to authorized users.


Introduction
Recent development of high-throughput technologies such as DNA microarray and RNA-Seq (i.e., next-generation sequencing of RNA transcripts) has made it possible for biologists to simultaneously measure gene expression at a genome scale. High dimensional datasets generated using such technologies provide a system-wide overview of how genes interact with each other in a network context. However, reconstruction of complex networks of genetic interactions and unraveling of unknown relationships among genes based on such high-throughput datasets remain a very challenging computational problem.
Various mathematical methods and computational approaches have been proposed to infer gene regulatory networks (GRN) from DNA microarray data, including Boolean networks [1], information theory [2], differential equations [3], and Bayesian networks [4][5][6]. However, the relative performances among these algorithms are not well studied because computational biologists must repeatedly test them on large-scale and high-quality datasets obtained from different experimental conditions and derived from different networks. Unfortunately, experimental datasets of customized size and design are usually unavailable and most biological networks are unknown or incomplete. Since each of these methods uses different datasets and comparison strategies, it is difficult to systematically validate the interactions predicted by different computational approaches.
Due to limited knowledge of experimentally validated biological networks of gene interactions, simulated data generated artificially from in silico gene networks provide a 'gold' standard to systematically evaluate the performance of different genetic networks inferring algorithms [7]. In silico networks are composed of a known network topology that determines the structure and model for each of the interactions among the genes. In such simulated data, all aspects of the networks are under full control and different types of data and levels of noise are allowed. Many methods have been proposed for creating in silico genetic networks, including continuous [8], probabilistic [9], and dynamic [10] approaches.
The performance of network inference algorithms has rarely been assessed and compared in terms of their strength and weakness using rigorous metrics [11,12]. As a community effort to address the deficiency in GRN reconstruction methodology, a Dialogue for Reverse Engineering Assessments and Methods (DREAM) project was initiated in 2006 [11] to catalyze the interaction between experiment and theory, specifically in the area of cellular network inference and quantitative model building (http:// www.the-dream-project.org/). One of the key goals of DREAM is the development of community-wide challenges for objective assessment of reverse engineering methods for biological networks [13]. The in silico network inference challenge of DREAM3 was designed to explore the extent to which underlying gene networks of various sizes and connection densities can be inferred from simulated data [14]. In participation of this challenge, we developed a novel approach of combining relative change ratio (RCR) and time-delayed dynamic Bayesian network to deduce GRNs from synthetic datasets for Escherichia coli and Saccharomyces cerevisiae (budding yeast) provided by the challenge. Among 29 participating teams, the performance of our approach was second only to the best performing method in the 10-node and the 50node network sub-challenges [14]. Here we present the details of our approach and its performance on the challenge datasets.

Challenge datasets
The in silico network inference challenge was structured as three separate sub-challenges with networks of 10, 50, and 100 genes (nodes), respectively [13]. For each subchallenge, five in silico networks (two for E. coli and three for S. cerevisiae) were created as benchmark or gold standard networks. The rationale for this design was to evaluate the consistence of inference methods in predicting the topology of five independent networks of the same type and size. These benchmark networks were generated by Daniel Marbach of Ecole Polytechnique Fédérale de Lausanne through extracting sub-networks with a topology of connections from the currently accepted E. coli and S. cerevisiae GRNs and imbuing the networks with dynamics using a thermodynamic model of gene expression [8]. The in silico 'measurements' were generated by continuous differential equations which were deemed reasonable approximations of gene expression regulatory functions [8,14]. A small amount of Gaussian noise was added to these values to simulate measurement error [14].
For each sub-challenge network, three experimental gene expression datasets were simulated for both E. coli and S. cerevisiae: heterozygous knockdown, nullmutants, and time series trajectories. The heterozygous knockdown dataset contained the steady state gene expression levels for the wild-type and the heterozygous knockdown (a gene reduced by half ) strains for each gene. The null-mutant dataset contained the steady state levels for the wild-type and the null-mutant (expression of a gene set to zero) strains. Time series trajectories dataset contained time courses of the network recovering from several external perturbations. All of the datasets can be downloaded at the DREAM Project website: http://wiki.c2b2.columbia.edu/dream/ index.php/D3c4.

Relative change ratio
A GRN represents the interactions of all genes in the network. For a given GRN structure, the change of the expression level of one gene results in changes of the expression levels of all others genes regulated by this gene. If a gene plays an important role in the GRN, knockout or null-mutation of an important gene (key gene) leads to more significant changes of the expression levels of other genes that are directly interacted with the hub gene. Thus, the wild-type, knockout, and null-mutant datasets provide useful information (prior knowledge) that we can use for improving the accuracy of GRN inference. Here we introduce the RCR method to preprocess and analyze the given datasets to identify the key genes that can be used for further GRN inference. The RCR method can reveal the relationships between a knockout gene and the influenced genes so it can also be directly used for inference of a GRN.
For each gene in the given dataset, we took the gene expression value of the wild-type as reference and calculated the relative change ratios of gene expression levels compared to the change range of the gene, as defined in Equation 1.
where, R i,j represents the relative change ratio of gene j when gene i is knocked out. G i,j is the gene expression value of gene j when gene i is knocked out. W j is the wild-type value of gene j, and Max(G :,j ) − Min(G :,j ) means the change range of gene j for all knockout genes. If i = j, R i,j will be set as 0 since this gene has already been knocked out.
If the change ratio is more than a chosen threshold (e.g., 0.30), we select this gene as a potential key gene and assume that it plays an important role in the network. If the change in absolute gene expression value compared to the reference is less than a threshold (e.g., 0.05) which can be defined by the user, this gene is considered as noise and ignored from the potential regulatory genes list. For example, in Figure 1, when gene 1 has been knocked out (the expression value will be set as 0), the change ratios of genes 2, 4, 5, 7, and 8 are more than 30 %, then we consider these genes as genes potentially regulated by knockout gene 1.
If the absolute change of gene expression values compared to their own reference value is less than a chosen threshold (e.g., 0.05), even though the relative change ratio is more than 0.30, we still consider these genes as noise and remove them from the regulated genes list.

Dynamic Bayesian network
Dynamic Bayesian network (DBN) analysis is the temporal extension of Bayesian network analysis. It is a general model class that is capable of representing complex temporal stochastic processes. An example of basic DBN block is shown in Figure 2.
A DBN is defined as a pair (B0, B1) representing the joint probability distribution over all possible time series of variables X = {X 1 , X 2 ,…X n }, where X i (1 ≤ i ≤ n) represents the binary-valued random variables in the network. In addition, the lowercase x i (1 ≤ i ≤ n) denotes the values of variable X i . It is composed of initial states of a Bayesian network B 0 = (G 0 , Θ 0 ) and a transition Bayesian network B 1 = (G 1 , Θ 1 ), where B 0 specifies the joint distribution of the variables in X(0) and B 1 represents the transition probabilities Pr{X(t + 1)|X(t)} for all  t. In slice 0, the parents of X i (0) are assumed to be those specified in the prior network B 0 , which means Pa(X i (0)) ⊆ X(0) for all 1 ≤ i ≤ n; in slice t + 1, the parents of X i (t + 1) are nodes in slices t, Pa(X i (t + 1)) ⊆ X(t) for all 1 ≤ i ≤ n and t ≥ 0; the connections only exist between consecutive slices. The joint distribution over a finite list of random variables X(0) ∪ X(1) ∪ ⋯ ∪ X(T) can be expressed as [15,16] Pr Kevin Murphy and co-workers [17,18] implemented a Bayesian network toolbox (BNT), in which the actual structure learning was performed by calling one of the BNT functions learn_struct_dbn_reveal, which used the REVEAL algorithm [4].

Time-delayed dynamic Bayesian network
In the traditional DBN proposed by [17,18], the effectiveness is not sufficient for two main reasons. The first is the extremely high computational cost. In Murphy's implementation, all the genes in the dataset are considered as parents (regulators) of a given target gene, which makes it impossible to model large-scale gene networks because of exponentially increasing computational time when the algorithm tries to find all of the subsets of parent genes given a target gene. Usually, the number of genes is restricted to less than 30, and more genes will be too much time consuming according to our testing. The second is that biologically relevant transcriptional time lags cannot be determined in Murphy's BNT, which reduces the inference accuracy of gene regulatory networks.
To address the above limitations of traditional DBN, Zou and Conzen [9] introduced a time-delayed dynamic Bayesian network (TDBN)-based analysis method, which can reconstruct GRNs from time series gene expression data. The improved method can dramatically reduce computational time and significantly increased accuracy. According to [9,10], most transcriptional regulators exhibit either an earlier or simultaneous change in the expression level when compared to their targets. In this way, one can limit the potential parents of each target gene and thus dramatically decrease the computational cost. The other improvement by Zou and Conzen [9] is to perform an estimation of the transcriptional time lag between potential regulators and their target genes. The time difference between the initial expression change of a potential regulator and its target gene represents a biologically relevant time period.
The initial expression change of a potential regulator is expected to allow a more accurate estimation of the transcriptional time lag between potential regulators and their targets, because it takes into account variable expression relationships of different regulator-target pairs. These improvements in [9] are related to transcriptional time-delayed lags between regulators and target genes, so it can also be considered as a time-delayed DBN and directly used to predict networks from time series gene expression data, such as the trajectory time series data in the DREAM3 challenge.

Inferring networks using a method that combines RCR and TDBN
In this combined method, we first used the simple RCR model to find key genes from the given heterozygous knockdown data and null-mutant knockout data. These key genes have a higher potential than other genes to play critical roles in simulated GRNs. After the data was preprocessed, we constructed a gene interaction network that indicated potential regulation among the selected key genes. The TDBN method was then used to infer another GRN from time series trajectory datasets. If gene interactions exist in both networks inferred by RCR and TDBN methods, we choose these interactions as our predicted edges in our final inferred networks. The predicted networks were assessed against the benchmark networks [13,14].

Inferred networks as compared with the true networks
In this work, our approach was applied to inferring GRNs in three different ways: For in silico networks with 10 genes, the gene regulatory networks were inferred only by the RCR method from steady state data, in which we used mainly the gene knockout dataset; for networks with 50 genes, the networks inferred using RCR and TDBN separately were combined into the final networks; for networks with 100 genes, we used only TDBN to reconstruct gene networks from time series trajectory gene expression dataset. In doing this, we sought to determine which method had better performance in inferring gene regulatory networks.
Our approach successfully inferred networks using the synthetic datasets provided by Marbach and his colleagues [8,13,14]. For example, one of the inferred E. coli 10-node GRN is shown in Figure 3, where seven matching edges are correctly identified by our model, in comparison to the corresponding true network. Our model correctly identified directionality in each of the matching edges. One of the predicted 50-node yeast GRN is shown in Figure 4, and the matching network is shown in Figure 5. There are 52 edges correctly inferred by our method, out of a total of 77 edges in the true network.

Performance of network inference from synthetic datasets
The performance of each method was evaluated by two metrics: the area under the precision-recall (AUPR) curve and the area under the receiver operating characteristic  (AUROC) curve for the whole set of edge predictions for 15 networks [13,14]. Precision is a measure of fidelity, whereas recall is a measure of completeness. Recall (R) is defined as Ce CeþMe ð Þ and precision (P) as Ce CeþFe ð Þ , where Ce is the number of correct edges, Me is the total number of missed edges (missed errors), and Fe is the number of false alarm errors. A missed error is defined as the connection between genes that exists in true networks, but the inference algorithms miss or make wrong orientations. A false alarm error is the connection that the inference algorithms create but does not exist in true networks.
A P value is the probability that a given or larger area under the curve value is obtained by random ordering of the T potential network links. An overall P value is the geometric mean of the n individual P values, calculated as p 1 Â p 2 Â … Â p n À Á 1=n . An overall AUROC P value represents the geometric mean of the five AUROC P values (Ecoli1, Ecoli2, Yeast1, Yeast2, and Yeast3). An overall AUPR P value is the geometric mean of the five AUPR P values.
To calculate AUPR and AUROC, each predicted network was submitted in the form of ranked lists of predicted edges. The lists were ordered according to the confidence of the predictions so that the first entry corresponded to the edge predicted with the highest confidence. In other words, the edges at the top of the list were believed to be present in the network, and the edges at the bottom of the list were believed to be absent from the network [13]. The inferred GRNs of different sizes (10, 50, and 100 genes) for both E. coli and yeast were evaluated by the above metrics. The larger scores of AUPR and AUROC and the smaller P values of AUPR and AUROC indicate the greater statistical significance of the prediction ( Table 1). The metrics of RCR and TDBN inferred networks from the 10-and 50-gene datasets were ranked second among all 29 teams participating in the DREAM3 challenge. The RCR and TDBN inferred networks from the 100-gene dataset were ranked at the 15th place. The overall performance of our methods for all three-sized networks ranked second out of all participating teams in the DREAM3 challenge.

Role of RCR and TDBN in network inference
In general, our predictions of networks with 10 and 50 genes were better than those of 100-gene networks. In most cases, predictions of E. coli networks were better than those of the yeast networks, with the exception of Yeast1 (Table 2). Based on these results, RCR appears to increase the fidelity of network inference more than using TDBN alone. This might explain why the performance of inferred networks with 100 genes was not as good as with size 10 and size 50, because only TDBN was used to infer networks instead of combining prior knowledge which would be gained from preprocessing data by RCR.
To better understand the role of RCR in GRN inference, we used the networks with 10 genes as an example and compared the performance of all three methods: RCR (using only knockout data), TDBN (using time series data without four perturbations), and the combined method (using knockout results as prior knowledge and then running TDBN with time series data). The AUPR, AUROC, and overall score (−0.5 × log 10 (P_AUPR × P_AUROC)) results obtained for the five datasets in the networks with 10 genes are shown in Figure 6A,B,C, respectively. The three metrics demonstrate that for all five tested datasets, both RCR and the combined method had better performance than TDBN. The combined method was expected to have better performance than the RCR method because the RCR results could provide prior knowledge for TDBN. For three testing datasets (Ecoli2, Yeast1, and Yeast3), the combined method performed better than RCR. But the combined method did not perform as well for the other two datasets (Ecoli1 and Yeast2). Therefore, whether RCR or the combined method has better performance depends on specific datasets. Such an observation can be explained by examining the algorithm in the TDBN method. Even though we specified 'parent regulators' as prior knowledge in TDBN to narrow down the search space of regulators, TDBN still calculated its own 'parents' based on simultaneously altered time series genes and combined two sets of parents as one group. Thus, TDBN in the combined method always inferred more connections than RCR, which might result in higher false positive rates. How to take advantage of RCR-inferred prior knowledge in the method combining RCR and TDBN to improve  The network name consists of two parts: organism name and network set number (i.e., Ecoli1 or Yeast1) followed by network size (10, 50, or 100 genes). The two parts are separated by '_'.
the performance of GRN inference remains a challenging research topic that requires further investigations.

Impact of RCR threshold on network inference accuracy
In the above analyses, we chose an empirical value of 0.30 as the RCR threshold, which implies that a gene is a potential key gene and plays an important role in the network if its change ratio is greater than 0.30. However, it is noteworthy that different RCR thresholds may affect the accuracy of network inference. To investigate the impact of a chosen RCR threshold on the prediction accuracy, we used the networks with 10 genes as an example and calculated both AUPR and AUROC P values, denoted as P-AUPR and P-AUROC, corresponding to 14 different RCR thresholds ranging from 0.05 to 0.70. As shown in Figure 7A,B, both P-AUPR and P-AUROC values were small when a RCR threshold was between 0.15 and 0.40. Furthermore, we also calculated the overall score −0.5 × log 10 (P_AUPR × P_AUROC) to evaluate the impact of RCR values on the performance. This score was used by the DREAM3 challenges to assess the performance of all participating teams. As shown in Figure 7C, the RCR threshold of 0.25 gave the best performance and it was very close to the empirical RCR threshold we used for GRN inference for the DREAM3 challenges.

Conclusions
In this study, a novel relative change ratio method was proposed to preprocess the null-mutant steady state data in order to find the key genes and build GRNs, in which these selected key genes have a higher potential than other genes to play very critical roles. Then, TDBN was used to infer GRNs from time series trajectory data, which were combined with previous knowledge gained in the initial step. Finally, the inferred networks were evaluated by using AUPR and AUROC metrics for the whole edge predictions for a network. The overall prediction results suggest that our approach was able to infer gene regulatory networks from in silico DREAM challenge data very efficiently and accurately in comparison with other participating teams. We have confidence that the DREAM project will eventually lead the reverse engineering community to resolve technical problems and overcome barriers between research groups towards reliable and accurate GRN inference from high dimensional gene expression data.