Gene regulatory network inference and validation using relative change ratio analysis and timedelayed dynamic Bayesian network
 Peng Li^{1},
 Ping Gong^{2}Email author,
 Haoni Li^{3},
 Edward J Perkins^{4},
 Nan Wang^{3} and
 Chaoyang Zhang^{3}Email author
DOI: 10.1186/s1363701400123
© Li et al.; licensee Springer. 2014
Received: 16 January 2014
Published: 16 July 2014
Abstract
The Dialogue for Reverse Engineering Assessments and Methods (DREAM) project was initiated in 2006 as a communitywide effort for the development of network inference challenges for rigorous assessment of reverse engineering methods for biological networks. We participated in the in silico network inference challenge of DREAM3 in 2008. Here we report the details of our approach and its performance on the synthetic challenge datasets. In our methodology, we first developed a model called relative change ratio (RCR), which took advantage of the heterozygous knockdown data and nullmutant knockout data provided by the challenge, in order to identify the potential regulators for the genes. With this information, a timedelayed dynamic Bayesian network (TDBN) approach was then used to infer gene regulatory networks from time series trajectory datasets. Our approach considerably reduced the searching space of TDBN; hence, it gained a much higher efficiency and accuracy. The networks predicted using our approach were evaluated comparatively along with 29 other submissions by two metrics (area under the ROC curve and area under the precisionrecall curve). The overall performance of our approach ranked the second among all participating teams.
Keywords
Gene regulatory network (GRN) Dialogue for Reverse Engineering Assessments and Methods (DREAM) Relative change ratio (RCR) Timedelayed dynamic Bayesian network (TDBN)Introduction
Recent development of highthroughput technologies such as DNA microarray and RNASeq (i.e., nextgeneration sequencing of RNA transcripts) has made it possible for biologists to simultaneously measure gene expression at a genome scale. High dimensional datasets generated using such technologies provide a systemwide overview of how genes interact with each other in a network context. However, reconstruction of complex networks of genetic interactions and unraveling of unknown relationships among genes based on such highthroughput datasets remain a very challenging computational problem.
Various mathematical methods and computational approaches have been proposed to infer gene regulatory networks (GRN) from DNA microarray data, including Boolean networks [1], information theory [2], differential equations [3], and Bayesian networks [4][6]. However, the relative performances among these algorithms are not well studied because computational biologists must repeatedly test them on largescale and highquality datasets obtained from different experimental conditions and derived from different networks. Unfortunately, experimental datasets of customized size and design are usually unavailable and most biological networks are unknown or incomplete. Since each of these methods uses different datasets and comparison strategies, it is difficult to systematically validate the interactions predicted by different computational approaches.
Due to limited knowledge of experimentally validated biological networks of gene interactions, simulated data generated artificially from in silico gene networks provide a ‘gold’ standard to systematically evaluate the performance of different genetic networks inferring algorithms [7]. In silico networks are composed of a known network topology that determines the structure and model for each of the interactions among the genes. In such simulated data, all aspects of the networks are under full control and different types of data and levels of noise are allowed. Many methods have been proposed for creating in silico genetic networks, including continuous [8], probabilistic [9], and dynamic [10] approaches.
The performance of network inference algorithms has rarely been assessed and compared in terms of their strength and weakness using rigorous metrics [11],[12]. As a community effort to address the deficiency in GRN reconstruction methodology, a Dialogue for Reverse Engineering Assessments and Methods (DREAM) project was initiated in 2006 [11] to catalyze the interaction between experiment and theory, specifically in the area of cellular network inference and quantitative model building (http://www.thedreamproject.org/). One of the key goals of DREAM is the development of communitywide challenges for objective assessment of reverse engineering methods for biological networks [13]. The in silico network inference challenge of DREAM3 was designed to explore the extent to which underlying gene networks of various sizes and connection densities can be inferred from simulated data [14]. In participation of this challenge, we developed a novel approach of combining relative change ratio (RCR) and timedelayed dynamic Bayesian network to deduce GRNs from synthetic datasets for Escherichia coli and Saccharomyces cerevisiae (budding yeast) provided by the challenge. Among 29 participating teams, the performance of our approach was second only to the best performing method in the 10node and the 50node network subchallenges [14]. Here we present the details of our approach and its performance on the challenge datasets.
Materials and methods
Challenge datasets
The in silico network inference challenge was structured as three separate subchallenges with networks of 10, 50, and 100 genes (nodes), respectively [13]. For each subchallenge, five in silico networks (two for E. coli and three for S. cerevisiae) were created as benchmark or gold standard networks. The rationale for this design was to evaluate the consistence of inference methods in predicting the topology of five independent networks of the same type and size. These benchmark networks were generated by Daniel Marbach of Ecole Polytechnique Fédérale de Lausanne through extracting subnetworks with a topology of connections from the currently accepted E. coli and S. cerevisiae GRNs and imbuing the networks with dynamics using a thermodynamic model of gene expression [8]. The in silico ‘measurements’ were generated by continuous differential equations which were deemed reasonable approximations of gene expression regulatory functions [8],[14]. A small amount of Gaussian noise was added to these values to simulate measurement error [14].
For each subchallenge network, three experimental gene expression datasets were simulated for both E. coli and S. cerevisiae: heterozygous knockdown, nullmutants, and time series trajectories. The heterozygous knockdown dataset contained the steady state gene expression levels for the wildtype and the heterozygous knockdown (a gene reduced by half) strains for each gene. The nullmutant dataset contained the steady state levels for the wildtype and the nullmutant (expression of a gene set to zero) strains. Time series trajectories dataset contained time courses of the network recovering from several external perturbations. All of the datasets can be downloaded at the DREAM Project website: http://wiki.c2b2.columbia.edu/dream/index.php/D3c4.
Relative change ratio
A GRN represents the interactions of all genes in the network. For a given GRN structure, the change of the expression level of one gene results in changes of the expression levels of all others genes regulated by this gene. If a gene plays an important role in the GRN, knockout or nullmutation of an important gene (key gene) leads to more significant changes of the expression levels of other genes that are directly interacted with the hub gene. Thus, the wildtype, knockout, and nullmutant datasets provide useful information (prior knowledge) that we can use for improving the accuracy of GRN inference. Here we introduce the RCR method to preprocess and analyze the given datasets to identify the key genes that can be used for further GRN inference. The RCR method can reveal the relationships between a knockout gene and the influenced genes so it can also be directly used for inference of a GRN.
If the absolute change of gene expression values compared to their own reference value is less than a chosen threshold (e.g., 0.05), even though the relative change ratio is more than 0.30, we still consider these genes as noise and remove them from the regulated genes list.
Dynamic Bayesian network
Kevin Murphy and coworkers [17],[18] implemented a Bayesian network toolbox (BNT), in which the actual structure learning was performed by calling one of the BNT functions learn_struct_dbn_reveal, which used the REVEAL algorithm [4].
Timedelayed dynamic Bayesian network
In the traditional DBN proposed by [17],[18], the effectiveness is not sufficient for two main reasons. The first is the extremely high computational cost. In Murphy's implementation, all the genes in the dataset are considered as parents (regulators) of a given target gene, which makes it impossible to model largescale gene networks because of exponentially increasing computational time when the algorithm tries to find all of the subsets of parent genes given a target gene. Usually, the number of genes is restricted to less than 30, and more genes will be too much time consuming according to our testing. The second is that biologically relevant transcriptional time lags cannot be determined in Murphy's BNT, which reduces the inference accuracy of gene regulatory networks.
To address the above limitations of traditional DBN, Zou and Conzen [9] introduced a timedelayed dynamic Bayesian network (TDBN)based analysis method, which can reconstruct GRNs from time series gene expression data. The improved method can dramatically reduce computational time and significantly increased accuracy. According to [9],[10], most transcriptional regulators exhibit either an earlier or simultaneous change in the expression level when compared to their targets. In this way, one can limit the potential parents of each target gene and thus dramatically decrease the computational cost. The other improvement by Zou and Conzen [9] is to perform an estimation of the transcriptional time lag between potential regulators and their target genes. The time difference between the initial expression change of a potential regulator and its target gene represents a biologically relevant time period.
The initial expression change of a potential regulator is expected to allow a more accurate estimation of the transcriptional time lag between potential regulators and their targets, because it takes into account variable expression relationships of different regulatortarget pairs. These improvements in [9] are related to transcriptional timedelayed lags between regulators and target genes, so it can also be considered as a timedelayed DBN and directly used to predict networks from time series gene expression data, such as the trajectory time series data in the DREAM3 challenge.
Inferring networks using a method that combines RCR and TDBN
In this combined method, we first used the simple RCR model to find key genes from the given heterozygous knockdown data and nullmutant knockout data. These key genes have a higher potential than other genes to play critical roles in simulated GRNs. After the data was preprocessed, we constructed a gene interaction network that indicated potential regulation among the selected key genes. The TDBN method was then used to infer another GRN from time series trajectory datasets. If gene interactions exist in both networks inferred by RCR and TDBN methods, we choose these interactions as our predicted edges in our final inferred networks. The predicted networks were assessed against the benchmark networks [13],[14].
Results and discussion
Inferred networks as compared with the true networks
In this work, our approach was applied to inferring GRNs in three different ways: For in silico networks with 10 genes, the gene regulatory networks were inferred only by the RCR method from steady state data, in which we used mainly the gene knockout dataset; for networks with 50 genes, the networks inferred using RCR and TDBN separately were combined into the final networks; for networks with 100 genes, we used only TDBN to reconstruct gene networks from time series trajectory gene expression dataset. In doing this, we sought to determine which method had better performance in inferring gene regulatory networks.
Performance of network inference from synthetic datasets
The performance of each method was evaluated by two metrics: the area under the precisionrecall (AUPR) curve and the area under the receiver operating characteristic (AUROC) curve for the whole set of edge predictions for 15 networks [13],[14]. Precision is a measure of fidelity, whereas recall is a measure of completeness. Recall (R) is defined as $\raisebox{1ex}{$\mathrm{Ce}$}\!\left/ \!\raisebox{1ex}{$\left(\mathrm{Ce}+\mathrm{Me}\right)$}\right.$ and precision (P) as $\raisebox{1ex}{$\mathrm{Ce}$}\!\left/ \!\raisebox{1ex}{$\left(\mathrm{Ce}+\mathrm{Fe}\right)$}\right.$, where Ce is the number of correct edges, Me is the total number of missed edges (missed errors), and Fe is the number of false alarm errors. A missed error is defined as the connection between genes that exists in true networks, but the inference algorithms miss or make wrong orientations. A false alarm error is the connection that the inference algorithms create but does not exist in true networks.
A P value is the probability that a given or larger area under the curve value is obtained by random ordering of the T potential network links. An overall P value is the geometric mean of the n individual P values, calculated as ${\left({\mathit{p}}_{1}\times {\mathit{p}}_{2}\times \dots \times {\mathit{p}}_{\mathit{n}}\right)}^{1/\mathit{n}}$. An overall AUROC P value represents the geometric mean of the five AUROC P values (Ecoli1, Ecoli2, Yeast1, Yeast2, and Yeast3). An overall AUPR P value is the geometric mean of the five AUPR P values.
To calculate AUPR and AUROC, each predicted network was submitted in the form of ranked lists of predicted edges. The lists were ordered according to the confidence of the predictions so that the first entry corresponded to the edge predicted with the highest confidence. In other words, the edges at the top of the list were believed to be present in the network, and the edges at the bottom of the list were believed to be absent from the network [13].
Assessment metrics for the first set of E. coli and yeast networks inferred using our approach
Metrics  Ecoli1_10  Yeast1_10  Ecoli1_50  Yeast1_50  Ecoli1_100  Yeast1_100 

AUPR  5.43E − 01  7.71E − 01  6.71E − 01  4.86E − 01  1.45E − 02  1.55E − 02 
AUROC  7.94E − 01  9.44E − 01  8.62E − 01  8.35E − 01  5.21E − 01  4.61E − 01 
P _AUPR  1.34E − 04  2.09E − 06  8.57E − 55  3.91E − 39  2.27E − 01  8.91E − 01 
P _AUROC  5.47E − 04  1.29E − 06  3.19E − 20  4.64E − 18  2.02E − 01  9.60E − 01 
Overall AUPR  1.09E − 04  2.54E − 46  4.83E − 03  
Overall AUROC  2.10E − 04  8.19E − 18  2.13E − 02 
Role of RCR and TDBN in network inference
Overall performance of our approach for predicting all five sets of networks of different sizes
Size  Metrics  Ecoli1  Ecoli2  Yeast1  Yeast2  Yeast3 

10  AUPR  0.544  0.748  0.771  0.352  0.493 
AUROC  0.794  0.856  0.944  0.590  0.715  
50  AUPR  0.671  0.672  0.486  0.367  0.381 
AUROC  0.862  0.842  0.836  0.688  0.728  
100  AUPR  0.015  0.052  0.016  0.046  0.044 
AUROC  0.521  0.544  0.461  0.576  0.428 
Impact of RCR threshold on network inference accuracy
Conclusions
In this study, a novel relative change ratio method was proposed to preprocess the nullmutant steady state data in order to find the key genes and build GRNs, in which these selected key genes have a higher potential than other genes to play very critical roles. Then, TDBN was used to infer GRNs from time series trajectory data, which were combined with previous knowledge gained in the initial step. Finally, the inferred networks were evaluated by using AUPR and AUROC metrics for the whole edge predictions for a network. The overall prediction results suggest that our approach was able to infer gene regulatory networks from in silico DREAM challenge data very efficiently and accurately in comparison with other participating teams. We have confidence that the DREAM project will eventually lead the reverse engineering community to resolve technical problems and overcome barriers between research groups towards reliable and accurate GRN inference from high dimensional gene expression data.
Abbreviations
 AUPR:

area under the precisionrecall curve
 AUROC:

area under the receiver operating characteristic (ROC) curve
 DREAM:

Dialogue for Reverse Engineering Assessments and Methods
 GRN:

gene regulatory network
 RCR:

relative change ratio
 TDBN:

timedelayed dynamic Bayesian network
Declarations
Acknowledgements
We would like to thank Gustavo Stolovitzky for organizing the DREAM3 challenge and thank Daniel Marbach and his colleagues from the Laboratory of Intelligent Systems of the Swiss Federal Institute of Technology in Lausanne for providing the challenge datasets. This work was supported by the Environmental Quality and Installation Technologies Research Program of the US Army Corps of Engineers under contract #W912HZ05P0145. Permission was granted by the Chief of Engineers to publish this information.
Authors’ Affiliations
References
 Lähdesmäki H, Shmulevich I, YliHarja O: On learning gene regulatory networks under the Boolean network model. Mach. Learn. 2003,52(1–2):147167. 10.1023/A:1023905711304View ArticleGoogle Scholar
 Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS: Largescale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol 2007,5(1):e8. 10.1371/journal.pbio.0050008View ArticleGoogle Scholar
 Chen I, He HL, Church GM: Modeling gene expression with differential equations. Pac. Symp. Biocomput 1999, 4: 2940.Google Scholar
 Liang S, Fuhrman S, Somogyi R: REVEAL, a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. Biocomput. 1998, 3: 1829.Google Scholar
 Imoto S, Goto T, Miyano S: Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. Pac. Symp. Biocomput. 2002, 7: 175186.Google Scholar
 Stolovitzky G, Prill RJ, Califano A: Lessons from the DREAM2 challenges. Ann. N Y Acad. Sci. 2009,1158(1):159195. 10.1111/j.17496632.2009.04497.xView ArticleGoogle Scholar
 Mendes P, Sha W, Ye K: Artificial gene networks for objective comparison of analysis algorithms. Bioinformatics 2003,19(2):122129.Google Scholar
 Marbach D, Schaffter T, Mattiussi C, Floreano D: Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J. Comput. Biol. 2009,16(2):229239. 10.1089/cmb.2008.09TTView ArticleGoogle Scholar
 Zou M, Conzen SD: A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 2005,21(1):7179. 10.1093/bioinformatics/bth463View ArticleGoogle Scholar
 Yu H, Luscombe NM, Qian J, Gerstein M: Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends Genet. 2003, 19: 422427. 10.1016/S01689525(03)001756View ArticleGoogle Scholar
 Stolovitzky G, Monroe D, Califano A: Dialogue on reverseengineering assessment and methods: the dream of highthroughput pathway inference. Ann. N Y Acad. Sci. 2007, 1115: 122. 10.1196/annals.1407.021View ArticleGoogle Scholar
 Cantone I, Marucci L, Iorio F, Ricci MA, Belcastro V, Bansal M, Santini S, Bernardo MD, Bernardo DD, Cosma MP: A yeast synthetic network for in vivo assessment of reverseengineering and modeling approaches. Cell 2009, 137: 172181. 10.1016/j.cell.2009.01.055View ArticleGoogle Scholar
 Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, Stolovitzky G: Revealing strengths and weaknesses of methods for gene network inference. Proc. Natl. Acad. Sci. U S A 2010,107(14):62866291. 10.1073/pnas.0913357107View ArticleGoogle Scholar
 Prill RJ, Marbach D, SaezRodriguez J, Sorger PK, Alexopoulos LG, Xue X, Clarke ND, AltanBonnet G, Stolovitzky G: Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS One 2010,5(2):e9202. 10.1371/journal.pone.0009202View ArticleGoogle Scholar
 Lähdesmäki H, Hautaniemi S, Shmulevich I, YliHarja O: Relationships between probabilistic Boolean networks and dynamic Bayesian networks as models of gene regulatory networks. Signal Process 2006,86(4):814834. 10.1016/j.sigpro.2005.06.008View ArticleGoogle Scholar
 Friedman N, Murphy K, Russell S: Learning the structure of dynamic probabilistic networks. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI) 1998, 139147.Google Scholar
 Murphy K: Dynamic Bayesian networks: representation, inference and learning. PhD Dissertation, University of California, Berkeley; 2002.Google Scholar
 Murphy K, Mian S: Modeling gene expression data using dynamic Bayesian networks. Technical report (Computer Science Division, University of California, Berkeley, CA; 1999.Google Scholar