- Research Article
- Open Access

# Modelling Transcriptional Regulation with a Mixture of Factor Analyzers and Variational Bayesian Expectation Maximization

- Kuang Lin
^{1}and - Dirk Husmeier
^{1}Email author

**2009**:601068

https://doi.org/10.1155/2009/601068

© K. Lin and D. Husmeier. 2009

**Received:**2 December 2008**Accepted:**27 February 2009**Published:**12 April 2009

## Abstract

Understanding the mechanisms of gene transcriptional regulation through analysis of high-throughput postgenomic data is one of the central problems of computational systems biology. Various approaches have been proposed, but most of them fail to address at least one of the following objectives: (1) allow for the fact that transcription factors are potentially subject to posttranscriptional regulation; (2) allow for the fact that transcription factors cooperate as a functional complex in regulating gene expression, and (3) provide a model and a learning algorithm with manageable computational complexity. The objective of the present study is to propose and test a method that addresses these three issues. The model we employ is a mixture of factor analyzers, in which the latent variables correspond to different transcription factors, grouped into complexes or modules. We pursue inference in a Bayesian framework, using the Variational Bayesian Expectation Maximization (VBEM) algorithm for approximate inference of the posterior distributions of the model parameters, and estimation of a lower bound on the marginal likelihood for model selection. We have evaluated the performance of the proposed method on three criteria: activity profile reconstruction, gene clustering, and network inference.

## Keywords

- Markov Chain Monte Carlo
- Transcriptional Regulatory Network
- Network Component Analysis
- Maximum Likelihood Factor Analysis
- Death Move

## 1. Introduction

Transcriptional gene regulation is a complex process that utilizes a network of interactions. This process is primarily controlled by diverse regulatory proteins called transcription factors (TFs), which bind to specific DNA sequences and thereby repress or initiate gene expression. Transcriptional regulatory networks control the expression levels of thousands of genes as part of diverse biological processes such as the cell cycle, embryogenesis, host-pathogen interactions, and circadian rhythms. Determining accurate models for TF-genes regulatory interactions is thus an important challenge of computational systems biology. Most recent studies of transcriptional regulation can be placed broadly in one of three categories.

Approaches in the first class attempt to build quantitative models to associate gene expression levels, as typically obtained from microarray experiments, with putative binding motifs on the gene promoter sequences. Bussemaker et al. [1] and Conlon et al. [2] propose a linear regression model for the dependence of the log gene expression ratio on the presence of regulatory sequence motifs. Beer and Tavazoie [3] cluster gene expression profiles in a preliminary data analysis based on correlation, and then apply a Bayesian network classifier to predict cluster membership from sequence motifs. Phuong et al. [4] use multivariate decision trees to find motif combinations that define homogeneous groups of genes with similar expression profiles. Segal et al. [5] cluster genes with a probabilistic generative model that systematically integrates gene expression profiles with regulatory sequence motifs.

A shortcoming of the methods in the first class is that the activities of the TFs are not included in the model. This limitation is addressed by models in the second class, which predict gene expression levels from both binding motifs on promoter sequences and the expression levels of putative regulators. Middendorf et al. [6, 7] approach this problem as a binary classification task to predict up- and down-regulation of a gene from a combination of a motif presence/absence indication and the discrete state of a putative regulator. The bidimensional regression trees of Ruan and Zhang [8] are based on a similar idea, but avoid the information loss inherent in the binary gene expression discretization.

*activities*, that is the concentration of the TF subpopulation capable of DNA binding. The methods in the second class approximate the activities of TFs by their gene expression levels. However, TFs are frequently subject to post-translational modifications, which may affect their DNA binding capability. Consequently, gene expression levels of TFs contain only limited information about their actual activities. The methods in the third class address this shortcoming by treating TFs as latent or hidden components. The regulatory system is modelled as a bipartite network, as shown in Figure 1(a), in which high-dimensional output data are driven by low-dimensional regulatory signals. The high-dimensional output data correspond to the expression levels of a large number of regulated genes. The regulators correspond to a comparatively small number of TFs, whose activities are unknown. Various authors have applied latent variable models like principal component analysis (PCA), factor analysis (FA), and independent component analysis (ICA) to determine a low-dimensional representation of high-dimensional gene expression profiles; for example, Raychaudhuri et al. [9] and Liebermeister [10]. However, these approaches provide only a phenomenological modelling of the observed data, and the hidden components do not correspond to identified TFs. Liao et al. [11] and Kao et al. [12] address this shortcoming by including partial prior knowledge about TF-gene interactions, as obtained from Chromatin Immunoprecipitation (ChIP) experiments [13] or binding motif finding algorithms (e.g., Bailey and Elkan [14]; Hughes et al. [15]). Their network component analysis (NCA) is equivalent to a constrained maximum likelihood procedure in the presence of Gaussian noise and independent hidden components; the latter represent the TF activities. A major limitation of NCA is the fact that the constraints on the connectivity pattern of the bipartite network are rigid, which does not allow for the noise intrinsic to immunoprecipitation experiments or sequence motif detection. Sabatti and James [16] and Sanguinetti et al. [17] address this shortcoming by proposing an approach based on Bayesian factor analysis, in which prior knowledge about TF-gene interactions naturally enters the model in the form of a prior distribution on the elements of the loading matrix. Pournara and Wernisch [18] propose an alternative approach based on maximum likelihood, where the loading matrix is orthogonally rotated towards a target matrix of a priori known TF-gene interactions. All three approaches simultaneously reconstruct the structure of the bipartite regulatory network—represented by the loading matrix—and the TF activity profiles—represented by the hidden factors—from gene expression data and (noisy) prior knowledge about TF-gene interactions. In a recent generalization of these approaches, Shi et al. [19] have introduced a further latent variable to indicate whether a TF is transcriptionally or posttranscriptionally regulated.

Contrary to the methods in the first two classes, the methods in the third class do not incorporate interaction effects between TFs, though. This is a major limitation, since especially in higher eukaryotes transcription factors cooperate as a functional complex in regulating gene expression [20, 21]. Boulesteix and Strimmer [22] allow for this complex formation by proposing a latent variable model in which the latent components correspond to groups of TFs. However, their partial-least squares (PLS) approach does not provide a probabilistic model and hence, like NCA, does not allow for the noise inherent in TF binding profiles from immunoprecipitation experiments or sequence motif detection schemes.

In the present paper we aim to combine the advantages of the methods in the three classes summarized above. Like the approaches in the third class, our method is a latent variable model that allows for the fact that owing to post-translational modifications the true TF activities are unknown. Similar to the approaches of the first two classes, our model explicitly incorporates interactions among TFs. Inspired by Boulesteix and Strimmer [22], we aim to group individual TFs into TF modules, as illustrated in Figure 1(b). To allow for the noise inherent in both gene expression levels and TF binding profiles, we use a proper probabilistic generative model, like Sanguinetti et al. [17] and Sabatti and James [16]. Our work is based on the work of Beal [23]. We apply a mixture of factor analyzers model, in which each component of the mixture corresponds to a TF complex composed of several TFs. This approach allows for the fact that TFs are not independent. By explicitly including this in our model we would expect to end up with fewer parameters, and hence more stable inference. To further improve the robustness of this approach, we pursue inference in a Bayesian framework, which includes a model selection scheme for estimating the number of TF complexes. We systematically integrate gene expression data and TF binding profiles, and treat both as *data*. This appears methodologically more consistent than the approach in Sanguinetti et al. [17] and Sabatti and James [16], where TF binding data are treated as *prior knowledge*. Our paper is organized as follows. In Section 2 we review Bayesian factor analysis applied to modelling transcriptional regulation. In Section 3 we discuss how TF complexes and interaction effects among TFs can be modelled with a mixture of factor analyzers. The data used for the evaluation of the method are described in Section 4. Section 5 provides three types of results related to the reconstruction of the unknown TF activity profiles are discussed in Section 5.1, gene clustering is discussed in Section 5.2, and the reconstruction of the transcriptional regulatory network is discussed in Section 5.3. We conclude our paper in Section 6 with a summary and a brief outlook on future work.

## 2. Background

In this section, we will briefly review the application of Bayesian factor analysis to transcriptional regulation. To keep the notation simple, we use the same letter for every probability distribution, even though they might be of different functional forms. The form of will become clear from its argument, with and denoting different distributions (strictly speaking, this should be written as and ). Variational distributions will be written as . We do not distinguish between random variables and their realization in our notation. However, we do distinguish between scalars and vectors/matrices, using bold-face letters for the latter, and using the superscript " " to denote transposition.

where the values of allow the inclusion of prior knowledge about TF-gene regulatory interactions, as obtained, for example, from immunoprecipitation experiments or sequence motif finding algorithms.

The objective of Bayesian inference is to learn the posterior distribution of the model parameters and latent variables. Since this distribution does not have a closed form, approximate procedures have to be adopted. Sabatti and James [16] follow a Markov chain Monte Carlo (MCMC) approach based on the collapsed Gibbs sampler. Here, each of the parameters and and latent variables and is sampled separately from a closed-form distribution that depends on sufficient statistics defined by the other parameters/latent variables, and the procedure is iterated until some convergence criterion is met. Sanguinetti et al. [17] follow an alternative approach based on Variational Bayesian Expectation maximization (VBEM), where the joint posterior distribution of the parameters and latent variables is approximated by a product of model distributions for which closed-form solutions can be obtained; see Section A.1 of the appendix.

## 3. Method

where and define the prior distributions on the parameters, as discussed below. The resulting model can be interpreted as follows: represents the composition of the th transcriptional module, that is, it indicates which TFs bind cooperatively to the promoters of the regulated genes. allows for perturbations that result, for example, from the temporary inaccessibility of certain binding sites or a variability of the binding affinities caused by external influences. is the "background" gene expression profile. represents the activity profile of the th transcriptional module, which modulates the expression levels of the regulated genes. describes the gene-specific susceptibility to transcriptional regulation, that is, to what extent the expression of the th gene is influenced by the binding of a transcriptional module to its promoter. Naturally, this information is contained in the expression profiles and TF binding profiles of the genes that are (softly) assigned to the th mixture component, while (12) and (13) provide a mechanism to allow for the noise in the data.

Here is an alternative interpretation of our model, which is based on the assumption that a variation of gene expression is brought about by different TFs binding in different proportions to the promoter. In the ideal case, genes with the same TFs binding in identical proportions to the promoter should have identical gene expression profiles; this is expressed in our model by (the proportions of TFs binding to the promoter), and (the "background" gene expression profile associated with the idealized binding profile of the TFs). Obviously, this model is oversimplified. There are two reasons why gene expression profiles might deviate from this idealized profile. The first reason is measurement errors and stochastic fluctuations unrelated to the TFs. These influences are incorporated in the additive term in (12). The second reason is variations in the TF binding affinities, their activities and binding capabilities. These variations are captured by the vector . The changes in the way TFs bind to the promoter will result in deviations of the gene expression profiles from the idealized "background" distribution; these deviations are defined by the vector . We assume that if the deviation of the TF binding profiles from the idealized binding profile is small, the deviation from the "background" gene expression profile will be small. Conversely, if the TFs show a considerable deviation from the idealized binding profile , then the gene expression profile will show a substantial deviation from the idealized expression profile . We therefore scale both and by the same gene-specific factor ; this enforces a hard association between the two effects described above. Weakening this association would be biologically more realistic, but at the expense of increased model complexity.

where , , and all other symbols are defined in Figure 2 and in the text; see [23], equation (4.29)]. The variational E- and M-steps of the VBEM algorithm are derived as in Section A.1 by setting to zero the functional derivatives of with respect to the different (hyper-)parameters and latent variables under consideration of possible normalization constraints, along the line of (A.4)–(A.7). The derivations can be found in Beal [23]. A summary of the update equations is provided in the appendix, Section A.2. The various (hyper-)parameters and latent variables are updated according to these equations iteratively, assuming the variational distributions for the other (hyper-)parameters and latent variables are fixed. The algorithm is iterated until a stationary point of is reached.

The final issue to address is model selection, that is, selecting the number of mixture components . Following Beal [23], we have not placed a prior distribution on , but instead have placed a symmetric Dirichlet prior over the mixture proportions ; see (11). Equation (22) provides a lower bound on the marginal likelihood , where the model is defined by the number of mixture components . In order to navigate in the space of different model complexities, we use the scheme of birth and death moves proposed in Beal [23]. This scheme can be seen as the VBEM equivalent to reversible jump MCMC [31]. Via a birth or a death move, a component is removed from or introduced into the mixture model, respectively. The VBEM algorithm, outlined in the present section and stated in more detail in the appendix, Section A.2, is then applied until a measure of convergence is reached. On convergence, the move is accepted if of (22) has increased, and rejected otherwise. Another birth/death proposal is then made, and the procedure is repeated until no further proposals are accepted. Further details of this birth/death scheme can be found in Beal [23]. Note that these birth and death moves also help avoid local maxima in , in a similar manner as discussed in Ueda et al. [32].

## 4. Data

We tested the performance of the proposed method on both simulated and real gene expression and TF binding data. The first approach has the advantage that the regulatory network structure and the activities of the TF complexes are known, which allows us to assess the prediction performance of the model against a known gold standard. However, the data generation mechanism is an idealized simplification of real biological processes. We therefore also tested the model on gene expression data and TF binding profiles from *Saccharomyces cerevisiae*. Although *S. cerevisiae* has been widely used as a model organism in computational biology, we still lack any reliable gold standard for the underlying regulatory network, and therefore need to use alternative evaluation criteria, based on out-of-sample performance. We will describe the data sets in the present section, and discuss the evaluation criteria together with the results in Section 5.

### 4.1. Synthetic Gene Expression and TF Binding Data

We generated synthetic data to simulate both the processes of transcriptional regulation as well as noisy data acquisition. We started from the activities of the TF protein complexes that regulate the genes by binding to their promoters. Note that owing to post-translational modifications these activities are usually not amenable to microarray experiments and therefore remain hidden. The advantage of the synthetic data is that we can assess to what extent these activities can be reconstructed from the gene expression profiles of the regulated genes.

*gene expression profile*we mean the vector of log gene expression ratios with respect to a control) were given by

Here we have assumed that each gene is regulated by a single TF complex. Note, however, that an individual TF can be involved in more than one TF module and therefore contribute to the regulation of different subsets of genes, as illustrated in Figure 1. Recall that TF modules are protein complexes composed of various TFs. In practice, we usually have only noisy indications about protein complex formations (e.g., from yeast 2-hybrid assays), and binding data are usually available for individual TFs (from binding motif similarity scores or immunoprecipitation experiments). In our simulation experiment we therefore assumed that the composition of the TF complexes was unknown, and that noisy binding data were available for individual TFs, as described shortly.

In the real world, TF binding data—whether obtained from gene upstream sequences via a motif search or from immunoprecipitation experiments—are not free of errors, and we therefore modelled two noise scenarios for two different data formats. In the first TF binding set, the non-binding elements were sampled from the beta distribution and the binding elements from . For the second TF binding set, we chose and correspondingly. The resulting TF binding patterns are shown in Figures 4(c), 4(d).

### 4.2. Gene Expression and TF Binding Data from Yeast

For evaluating the inference of transcriptional regulation in real organisms, we chose gene expression and TF binding data from the widely used model organism *Saccharomyces cerevisiae* (baker's yeast). For the clustering experiments, we combined ChIP-chip binding data of 113 TFs from Lee et al. [34] with two different microarray gene expression data sets. From the Spellman set [35], the expression levels of 3638 genes at 24 time points were used. From the Gasch set [36], the expression values of 1993 genes at 173 time points were taken. For evaluating the regulatory network reconstruction, we used the gene expression data from Mnaimneh et al. [37] and the TF binding profiles from YeastTract [38]. YeastTract provides a comprehensive database of transcriptional regulatory associations in *S. cerevisiae*, and is publicly available from http://www.yeastract.com/. Our combined data set thus included the expression levels of 5464 genes under 214 experimental conditions and binary TF binding patterns associating these genes with 169 TFs.

## 5. Results and Discussion

Overview of methods.

PLS | The partial least squares approach proposed by Boulesteix and Strimmer [22], using the software provided by the authors. Note that the method treats TF-gene interactions as fixed constants that cannot be changed in light of the gene expression data. Hence, this approach cannot be used for network reconstruction and was only applied for reconstructing the TF activity profiles. |

FA | Maximum likelihood factor analysis, effected with the EM algorithm of Ghahramani and Hinton [24] and a subsequent varimax rotation [39] of the loading matrix towards maximum sparsity, as proposed in Pournara and Wernisch [18]. |

BFA-Gibbs | Bayesian factor analysis of Sabatti and James [16], trained with Gibbs sampling. The TF regulatory network is obtained from the posterior expected loading matrix via (A.32) and (A.35). |

MFA-VBEM | The proposed mixture of factor analyzers model, shown in Figure 2 and discussed in Section 3, trained with variational Bayesian Expectation Maximization. The approach is based on the work of Beal [23], with the extension described in the text. The TF regulatory network is obtained from (24) and (25) for the curation and prediction tasks, respectively. |

### 5.1. Activity Profile Reconstruction

Since TF activity profiles are not available for real data, we used the synthetic data of Section 4.1 to evaluate the profile reconstruction performance of the model. We have compared the proposed MFA-VBEM model with the partial least-squares (PLS) approach of Boulesteix and Strimmer [22], and with the Bayesian factor analysis model using Gibbs sampling (BFA-Gibbs), as proposed in Sabatti and James [16].

The PLS approach of Boulesteix and Strimmer [22] is formally equivalent to the FA model of equation (A.3). However, the -by- loading matrix , which linearly maps latent variables onto genes, is decomposed into two matrices: an -by- matrix describing the interactions between TFs and genes, and an -by- matrix defining how the TFs interact to form modules; see Figure 1(b). The elements of the first matrix are fixed, taken from TF binding data (e.g., immunoprecipitation experiments or binding motifs). In the present example, the binding matrices of Figures 4(c), 4(d) were used. The elements of the second matrix are optimized so as to minimize the sum-of-squares deviation between the measured and reconstructed gene expression profiles subject to an orthogonality constraint for the latent profiles. These latent profiles are the predicted activity profiles of the TF modules. A cross-validation approach can in principle be used to optimize the number of TF modules . However, for ease of comparability of the reconstructed activity profiles with those obtained with the other methods we set to the correct number of TF modules: . We carried out the evaluation using the software provided in Boulesteix and Strimmer [22], using the default parameters.

The BFA-Gibbs method of Sabatti and James [16] corresponds to a Bayesian FA model with a mixture prior on the elements of the loading matrix , which incorporates the information from immunoprecipitation experiments or binding motif search algorithms. In other words, the TF binding data, which in the present evaluation were the binding matrices of Figure 4, enter the model via the prior on , using (7)–(9). We sampled all parameters with the Gibbs sampling method of Sabatti and James [16], using the authors' programs, and applying standard diagnostic tools [41] to test for convergence of the Markov chains. The predicted activity profiles are the posterior averages of the latent factor profiles, computed from (4) in Sabatti and James [16].

For the proposed MFA-VBEM model, the activity profile of the th TF module is given by , the posterior average of , where is the loading vector associated with the th module, and the posterior average is obtained with the VBEM algorithm, using (A.17). The birth and death moves of the VBEM scheme, explained in Section 3, allow an estimation of the marginal posterior probability of the number of TFs, , which was found to peak at the correct value of . For a comparison with the alternative approaches, the simulations were repeated with the number of modules fixed at this value.

Reconstruction of TF complex activity profiles.

Method | B1 | N1 | N2 | N3 |
---|---|---|---|---|

PLS | L1 | 0.52 | 0.53 | 0.52 |

BFA | 0.87 | 0.69 | 0.76 | |

MFA | 0.77 | 0.80 | 0.73 | |

PLS | L2 | 0.52 | 0.52 | 0.52 |

BFA | 0.84 | 0.68 | 0.59 | |

MFA | 0.89 | 0.71 | 0.60 | |

PLS | L3 | 0.53 | 0.52 | 0.52 |

BFA | 0.90 | 0.75 | 0.56 | |

MFA | 0.94 | 0.87 | 0.40 | |

Method | B2 | N1 | N2 | N3 |

PLS | L1 | 0.53 | 0.52 | 0.52 |

BFA | 0.92 | 0.89 | 0.78 | |

MFA | 0.88 | 0.83 | 0.71 | |

PLS | L2 | 0.52 | 0.51 | 0.52 |

BFA | 0.83 | 0.72 | 0.72 | |

MFA | 0.95 | 0.85 | 0.71 | |

PLS | L3 | 0.52 | 0.51 | 0.52 |

BFA | 0.90 | 0.73 | 0.67 | |

MFA | 0.98 | 0.94 | 0.63 |

A comparison between BFA-Gibbs and MFA-VBEM shows that BFA-Gibbs tends to outperform MFA-VBEM when the expression profiles are short (length L1) or when the noise level is high (N3). This could be a consequence of the different inference schemes ("VBEM" versus "Gibbs"). Short expression profiles and high noise levels lead to diffuse posterior distributions of the parameters. Variational learning—as opposed to Gibbs sampling—is known to lead to a systematic underestimation of the posterior variation [42], which could be a disadvantage here. However, MFA-VBEM consistently outperforms BFA-Gibbs on the longer expression profiles with lengths L2 and L3, and the lower noise levels N1 and N2. We would argue that this improvement in the performance is a consequence of the more parsimonious model ("MFA") that results when allowing for the fact that TFs are non-independent, which leads to greater robustness of inference and reduced susceptibility to over-fitting.

### 5.2. Gene Clustering

Following up on the seminal work of Eisen et al. [45], there has been considerable interest in clustering genes based on their expression patterns. The premise is based on the guilt-by-association hypothesis, according to which similarity in the expression profiles might be indicative of related biological functions. Although the main purpose of the proposed MFA-VBEM method is not one of clustering, it is straightforward to apply it to this end by using the model mixture proportions , which are obtained from the VBEM scheme via (A.22), as indicators of class membership. A convenient feature of the MFA-VBEM scheme is the fact that the number of clusters is identical to the number of mixture components in the model. This number is automatically inferred from the data using the model selection scheme based on birth-death moves, as described in Section 3. MFA-VBEM also allows for a straightforward integration of gene expression profiles with TF binding data.

We applied the MFA-VBEM method to the gene expression and TF binding data of *S. cerevisiae*, described in Section 4.2. For comparison, we also applied two standard clustering algorithms: K-means and hierarchical agglomerative average linkage clustering (see, e.g., Hastie et al. [46]). We used the implementation of these two algorithms in the Bioinformatics Toolbox of MATLAB (version 7.3.0), using default parameters and the default distance measure of 1 minus the absolute Pearson correlation coefficient. Five randomly chosen initial starting points were chosen for each application of K-means, and the most compact cluster formation found was recorded. For hierarchical clustering, we cut the dendrogram at such a distant from the root that the number of resulting clusters equalled the number of clusters used for MFA-VBEM and K-means. Note that unlike the proposed MFA-VBEM approach, K-means and average linkage clustering do not infer the number of clusters automatically from the data. To ensure comparability of the results we therefore set the number of clusters to be identical to the number of mixture components inferred with the MFA-VBEM method. We further included COSA [43] as a more advanced clustering algorithm in our comparison. The idea of clustering objects on subsets of attributes (COSA) is to detect subgroups of objects that preferentially cluster on subsets of the attribute variables rather than on all of them simultaneously. The relevant attribute subsets for each individual cluster can be different or partially overlap with other clusters. The attribute subsets are automatically selected by the algorithm via a weighting scheme that attempts to trade off two effects: (1) the objective to identify homogeneous and coherent clusters, and (2) the influence of an entropic regularization term that penalizes small subset sizes. In our study, we used the R program written by the authors, which is available from http://www-stat.stanford.edu/~jhf/COSA.html, using the default settings of the parameters. Clusters were obtained from the dendrogram in the same way as for hierarchical agglomerative average linkage clustering, subject to the constraint of having at least three genes in a cluster. Finally, we included Plaid model clustering [44] in our comparative evaluation study. Plaid model clustering is a non-mutually exclusive clustering approach, which allows a gene to have different cluster memberships. For the practical computation we used the Plaid (TM) software copyrighted by Stanford University, which is freely available from the following website: http://www-stat.stanford.edu/~owen/plaid/.

In order to evaluate the predicted clusters with respect to their biological plausibility, we tested them for significant enrichment of gene ontology (GO) annotations. To this end, we used the GO terms from the Saccharomyces genome database (SGD), which are publicly available from http://www.yeastgenome.org/. We assessed the enrichment for annotated GO terms in a given gene cluster with the program Ontologizer [47], using the default parameters. Given a population of genes with associated GO terms, Ontologizer associates each GO term with a
-value. To correct for multiple testing, we controlled the family-wise type-I error conservatively with the Bonferroni correction, using a standard threshold at the 5% significance level. We called a gene cluster "biologically meaningful" if it contained at least one significantly enriched GO term. We restricted this analysis to *specific* GO terms, as generic and non-biologically informative GO terms often tend to show a statistically significant enrichment. Following a recommendation made by one of the referees, we defined GO terms that were four or less levels from the roots of the hierarchy defined in the gene ontology (version February 29, 2008) as generic, and discarded them from the subsequent analysis.

Enrichment for GO terms in predicted gene clusters.

Data | Clusters | Biologically meaningful clusters | Genes | Genes in biologically meaningful clusters |
---|---|---|---|---|

Average linkage | ||||

[35], E | 48 | 10 | 3638 | 1483 |

[36], E | 25 | 7 | 1993 | 1092 |

[35], E+B | 30 | 8 | 3638 | 1148 |

[36], E+B | 17 | 4 | 1993 | 703 |

K-means | ||||

[35], E | 48 | 18 | 3638 | 1847 |

[36], E | 25 | 12 | 1993 | 987 |

[35], E+B | 30 | 13 | 3638 | 1337 |

[36], E+B | 17 | 9 | 1993 | 884 |

COSA | ||||

[35], E | 48 | 7 | 3638 | 1155 |

[36], E | 25 | 8 | 1993 | 748 |

[35], E+B | 30 | 10 | 3638 | 240 |

[36], E+B | 17 | 4 | 1993 | 16 |

Plaid | ||||

[35], E | 48 | 19 | 3638 | 1812 |

[36], E | 25 | 10 | 1993 | 770 |

[35], E+B | 30 | 11 | 3638 | 626 |

[36], E+B | 17 | 9 | 1993 | 636 |

MFA-VBEM | ||||

[35], E | 48 | 20 | 3638 | 2415 |

[36], E | 25 | 16 | 1993 | 1278 |

[35], E+B | 30 | 17 | 3638 | 2996 |

[36], E+B | 17 | 14 | 1993 | 1645 |

Interestingly, COSA shows a particularly poor performance on the combined gene expression and TF binding data. This can be explained as follows. The TF binding profiles extracted from YeastTract [38] are binary vectors, and some TFs bind to several genes. The affected genes will have identical (or very similar) binary profiles when restricted to the respective TFs. With its inherent tendency to cluster on subsets of attributes, COSA will group together genes that happen to have similar binary entries for a small number of TFs. This leads to the formation of many small clusters. These clusters are not necessarily biologically meaningful, since complementary information from the expression profiles and other TFs has effectively been discarded.

It is also interesting to observe that the inclusion of binding data occasionally deteriorates the performance of K-means and hierarchical agglomerative clustering. This deterioration is a consequence of the different nature of the TF binding and gene expression profiles. While the former are binary and hence nonnegative, the log gene expression ratios my vary in sign. This renders the approach of combining them in a monolithic block suboptimal, as coregulated genes may have anticorrelated expression profiles and positively correlated TF binding patterns. Avoiding this potential conflict by taking the modulus of the expression profiles is no solution, as the resulting information loss was found to lead to a deterioration of the clustering results. The proposed MFA-VBEM model, on the other hand, uses the extra flexibility that the model provides via the factor loading vector and the factor mean vector (see Figure 2) to overcome this problem. This suggests that MFA-VBEM provides the right degree of flexibility as a compromise between the rigidness of K-means and hierarchical agglomerative average linkage clustering, and the over-flexible subset selection of COSA. The consequence is an improvement in the biological plausibility of the inferred gene clusters, as seen from Table 3.

### 5.3. Regulatory Network Reconstruction

A topic of interest in computational systems biology is the reconstruction of transcriptional regulatory networks, and it is this question that most of the methods reviewed in the Introduction section ultimately aim to address. Note that in the current setting the regulatory network has the form of a bipartite graph between TFs and potentially regulated genes. The successful solution of the reconstruction task therefore requires us to infer for each TF the correct binding profile, that is, the set of genes that it potentially binds to and regulates. For the synthetic data of Section 4.1, this is a straightforward task as the true regulatory network is known. For real data, however, the true regulatory network is unknown, rendering the assessment more difficult. We approached this problem from two different angles: noise reduction and test-set performance. In the first assessment scheme, we trained (for descriptional convenience we use machine learning parlance, where the word "training" means inference of the posterior distribution of the model parameters, hyperparameters and latent variables from given data, the so-called training set) the different statistical models on noisy complete data—containing both gene expression profiles and TF binding affinities—and investigated whether the method succeeded in reducing the noise in the TF binding profiles, that is, whether it could predict a curated transcriptional regulatory network. In the second assessment scheme, the models were trained on 80% of the original data, and then evaluated on 20% of held-out test data, from which the binding profiles had been removed. We refer to these two network reconstructions tasks as network curation and network prediction, respectively. We compared the proposed MFA-VBEM scheme of Section 3 with the BFA Gibbs sampling approach of Sabatti and James [16] and with maximum likelihood FA. An overview of the methods compared in our study is shown in Table 1. Note that the PLS method of Boulesteix and Strimmer [22] was not applied to this task, as it provides no mechanism for inferring the TF-gene interaction strengths directly from gene expression data.

where is a gene expression profile of a new gene not included in the training set, and is computed by discarding from (A.22) all those terms that are related to the (nonexistent) TF binding profile. See the appendix for a derivation of (24) and (25). To obtain a regulatory network from the matrix of interaction strengths we choose a threshold and keep all those edges whose interaction strengths exceed this value. Note that by varying the threshold between the minimum and maximum interaction strength, we can obtain a receiver operating characteristic (ROC) curve when the true network is known.

We carried out maximum likelihood FA with the EM algorithm, using the software implementation of Ghahramani and Hinton [24], and a subsequent varimax rotation towards maximum sparsity of the loading matrix, as proposed in Pournara and Wernisch [18]. Since this approach does not make use of the TF binding data, the distinction between network curation and out-of-sample prediction is obsolete. Further details about the application of this scheme can be found in the appendix.

The network reconstruction with BFA-Gibbs was carried out as described in Sabatti and James [16]. For the out-of-sample network prediction, the Gibbs sampling scheme of Sabatti and James [16] was modified so as to set the TF activity profiles to the posterior mean obtained from the training set. This approach corresponds to running the Gibbs sampling algorithm of Sabatti and James [16] with the latent variables fixed, that is, one of the interleaved Gibbs steps is omitted. Again, further details and a justification of this scheme can be found in the appendix.

The practical application of BFA-Gibbs faces a computational hurdle. Within the Gibbs sampling procedure the vectors of binary latent variables (
in the notation of Pournara and Wernisch [18]) are sampled from a multinomial distribution whose parameters have to be computed for all possible configurations of
(Sabatti and James [16, (2)] or Pournara and Wernisch [18, (8)]). This is a combinatorial problem, and the computational costs increase exponentially with the number of non-zero entries in the prior probability matrix
. For our simulations we used the software provided by Sabatti and James [16], which worked fine on the synthetic data of Section 4.1. However, the programs ran into memory overflow problems on the *S. cerevisiae* data when the number of nonzero entries in
was unrestricted. This computational complexity, which inherently impedes the application of BFA-Gibbs to complex postgenomic data sets, required us to artificially limit the number of nonzero entries in
to 11 connections per gene. Most of the *S. cerevisiae* genes were not affected by this intervention, as the number of TF binding connections reported in Teixeira et al. [38] is well below this threshold. However, for densely connected genes, TF binding connections had to be randomly discarded until the restriction was enforced. We note, though, that despite this pruning procedure, still 88% of the interactions reported in [38] were included in the prior probability matrix
.

#### 5.3.1. Network Reconstruction for the Synthetic Data

Interestingly, for the low noise in the TF binding data, from which the prior connectivity matrix of BFA-Gibbs is derived, the performance of BFA-Gibbs is relatively better when the gene expression profiles are noisy (the right column in the bottom right panel of Figure 7), or the gene expression profiles are short (top row in the bottom right panel of Figure 7). We have obtained similar results on the reconstruction of TF module activity profiles (Table 2). With larger, less noisy data sets, the Gibbs sampler can be easily trapped in some local optimum. This is partly related to MCMC sampling problems in general; compare with Figures 6 and 7 in Grzegorczyk and Husmeier [49]. More substantially, this is related to mixing problems inherent in Gibbs sampling. There are possibilities to assign a TF to the six groups of coexpressed genes in Figure 4, corresponding to modes in the posterior probability landscape. A study by Jasra et al. [50] has found that in such a scenario Gibbs sampling faces intrinsic mixing problems and tends to get trapped on a single mode. Note that both problems are avoided by the proposed MFA-VBEM scheme. First, by information sharing between TFs in the same module, MFA effectively constitutes a more parsimonious model than BFA, thereby reducing the complexity of the inference problem. Second, convergence problems are effectively addressed with the birth-death moves in a similar way as discussed in Ueda et al. [32].

A comparison of the original TF-binding data in Figure 4 and the predicted TF-gene interaction profiles in Figure 6 clearly demonstrates the efficiency of the network curation and noise reduction affected with MFA-VBEM. Note that the improved reconstruction accuracy is a consequence of the systematic integration of gene expression data into the modelling and inference process, and the nature of the MFA model. The latter allows for the fact that TFs act in modules and are non-independent, and that TFs in the same module show similar interaction patterns with downstream regulated genes. This leads to greater robustness of inference and reduced susceptibility to overfitting.

#### 5.3.2. Network Reconstruction for the Yeast Data

For the network curation task, 10% false-positive interactions were added to the TF binding data of Teixeira et al. [38]. All three models were trained using the complete data set, including both gene expression and (noisy) TF binding profiles. We then assessed the predicted binding profiles by taking the associations reported in Teixeira et al. [38] as the true gold standard. The resulting ROC curves are shown in the left panel of Figure 8.

BFA with Gibbs sampling recovered a very accurate but sparse connectivity matrix. Most of the predicted connections were correct according to the chosen criterion (agreement with Teixeira et al. [38]). However, only about 30% of the TF binding connections reported in Teixeira et al. [38] were recovered, and 20% of the genes were predicted to be not connected to any TF. Additionally, most genes were predicted to be connected to at most one TF, which suggests that BFA-Gibbs does not capture any effects related to TF complex formation and cooperativity between TFs. The proposed MFA-VBEM approach avoided this problem by predicting many genes to be connected to more than one TF. For very low FP rates MFA-VBEM obtained lower TP rates than BFA-Gibbs. However, its area under the ROC curve (AUC score) is substantially higher than that of BFA-Gibbs (0.82 versus 0.66), suggesting that the overall prediction performance has improved. The performance of maximum likelihood factor analysis (FA) was much poorer than that of the other two methods, and the corresponding ROC curve was only marginally better than the expected performance of a random predictor. Recall that FA as opposed to the other two models only uses the gene expression data but not the TF binding profiles. The poor performance of FA thus suggests that the TF regulatory network cannot be reliably reconstructed on the basis of gene expression data alone, and that the varimax rotation of the loading matrix towards maximum sparsity, as suggested in Pournara and Wernisch [18], is no substitute for the explicit inclusion of TF binding information.

For the network prediction task, we trained the models on only 80% of the *S. cerevisiae* genes, and used an independent test set containing a randomly chosen subset of 20% of the genes to estimate the out-of-sample network prediction accuracy. Note that for the genes in the test set, only the expression profiles were made available, while the corresponding TF binding connections were held back. The task was to predict these TF binding connections from the gene expression data, using the (average) TF activity profiles inferred from the training set. A more comprehensive description of the evaluation is provided in the appendix.

The results are shown in the right panel of Figure 8. This figure contains two ROC curves for BFA-Gibbs. The proper evaluation of the out-of-sample network prediction accuracy according to equation (A.35) requires an uninformative prior connectivity matrix for the genes in the test set, in which all the elements are set to . However, the combinatorial complexity problem discussed above requires a restriction on the number of non-zero entries per genes. We randomly selected a set of 11 non-zero entries per gene. This leads to the ROC curve shown by a dashed line in the right panel of Figure 8, which is hardly better than the expected ROC curve of a random predictor. This poor performance is not surprising, because BFA-Gibbs cannot recover false negative interactions, as discussed in Sabatti and James [16]. As an alternative test, we selected the true TF binding interactions, as reported in Teixeira et al. [38], subject to the constraint of not allowing more than 11 non-zero entries per gene. The corresponding ROC curve is shown by the dash-dotted line in the right panel of Figure 8, which outperforms all other methods for low FP ratios. Note, though, that this approach violates the out-of-sample paradigm, in that it makes use of TF binding information that should have been held back for evaluation. Interestingly, even with this methodological violation, BFA-Gibbs is still outperformed by the proposed MFA-VBEM approach in terms of the global network reconstruction accuracy, as indicated by the overall AUC score. MFA-VBEM also significantly outperforms maximum likelihood FA (dotted graph in the right panel of Figure 8(b)). (It might seem peculiar that the out-of-sample performance of FA, as shown in Figure 8(b), is better than the training set performance, depicted on the left. This is a consequence of the global assignment of predicted TF binding profiles to true binding profiles with the Hungarian algorithm, as described in Section A.3, which works more efficiently on smaller data sets. As discussed before, this procedure uses information that should have been withheld, giving FA an unfair advantage over the other methods.) Consistently achieving higher TP ratios across the whole spectrum of FP ratios.

While the previous study has pointed to a performance improvement of MFA-VBEM over BFA-Gibbs, this improvement is a combination of two effects: the actual model performance, and the computational complexity. In order to focus on the first effect and distinguish it from the latter, we repeated the analysis on the same data in a slightly different manner. Recall that the proper evaluation of the out-of-sample network prediction accuracy according to (A.35) requires an uninformative prior connectivity matrix
for the genes in the test set, and that the combinatorial complexity problem discussed above requires a restriction on the number of non-zero entries per gene. We therefore randomly selected 2000 *S. cerevisiae* genes, then sorted the TFs according to the numbers of connections between them and the selected genes. The most densely connected 12 TFs were chosen. Then all 5464 genes were sorted according to the numbers of their connections to the chosen TFs, and the most densely connected 2000 genes were chosen. These sorting steps were iterated until convergence. We thus obtained a 12 TFs by 2000 genes connectivity matrix with dense connections for evaluating the different network reconstruction methods. This procedure, and the reduction in the number of TFs, allowed the application of BFA-Gibbs with an uninformative prior connectivity matrix, and hence ensured a fair comparison with the proposed MFA-VBEM method.

*S. cerevisiae*genes, and used an independent test set containing the remaining 60%, 40% or 20% of the genes to estimate the out-of-sample network prediction accuracy. As before, for the genes in the test set only the expression profiles were made available, while the corresponding TF binding connections were held back. The task was to predict these TF binding connections from the gene expression data, using the (average) TF activity profiles inferred from the training set. The results are shown in the subfigures of Figure 9. It can be seen in all three cases that MFA with VBEM clearly outperforms both the BFA and FA methods, and that the performance slightly increases with increasing training set size. The corresponding AUC values are 0.64, 0.67 and 0.67.

The measured TF-gene binding patterns of these two TFs show a modest correlation (correlation coefficient = 0.60). When MFA-VBEM is applied to the network reconstruction task by integrating gene expression profiles, the predicted binding patterns of the two TFs involved in the complex show an increased correlation (correlation coefficient = 0.74). However, the cooperation of TFs was not detected by the BFA or the FA methods. Here, the corresponding correlation coefficients between the TF binding patterns predicted with BFA and FA are low, 0.15 and 0.14, respectively. Hence, BFA and FA fail to identify this TF complex.

## 6. Conclusion

We have investigated the application of Bayesian mixtures of factor analyzers (MFA-VBEM) to modelling transcriptional regulation in cells. Like recent approaches based on Bayesian factor analysis applied to the same problem [16, 17], MFA-VBEM allows for the fact that TFs are often subject to post-translational modifications and that their true activities are therefore usually unknown. A shortcoming of Bayesian factor analysis is the fact that it ignores interactions between TFs. This limitation is addressed by our approach: different from Bayesian factor analysis, the mixture of factor analyzers approach allows for the fact that transcription factors co-operate as a functional complex in regulating gene expression, which is particularly common in higher eukaryotes. Our approach systematically integrates gene expression data with TF binding data. As opposed to the partial least squares (PLS) approach of Boulesteix and Strimmer [22], MFA-VBEM is a probabilistic model that allows for the noise inherent in the TF binding data. This addresses a major shortcoming of the PLS approach, where the inability to deal with measurement errors has been found to adversely affect the activity profile reconstruction accuracy. The better performance of the MFA-VBEM method over the Bayesian factor analysis approaches is presumably a consequence of the more parsimonious model that results when allowing for the fact that TFs are non-independent. Take, for instance, a complex of 3 TFs that regulates 20 genes, as in Figure 4. MFA-VBEM can effectively model this with 23 parameters: 20 regulatory interaction strengths between the TF module and the regulated genes, and 3 membership indicators that assign the TFs to the respective module. A method based on the standard FA approach, like the one proposed by Sabatti and James [16], needs parameters, corresponding to the interactions between each of the individual TFs and the regulated genes. There is nothing in the FA approach that would inform the model a priori that once a group of TFs are found to form a module, their interaction patterns with the regulated genes should be the same. Instead, these interaction strengths have to be learned separately for each TF. This leads to a less parsimonious and partially redundant model, which is less robust and more susceptible to over-fitting.

We have evaluated the proposed MFA-VBEM on three performance criteria: transcriptional activity profile reconstruction, gene clustering, and regulatory network inference. Using a synthetic data set, we found that MFA-VBEM reconstructed the hidden activity profiles of the TF complexes more accurately than PLS [22] and Bayesian factor analysis with Gibbs sampling [16]. Using gene expression profiles and TF binding profiles for *S. cerevisiae*, MFA-VBEM found biologically more plausible gene clusters than K-means, hierarchical agglomerative average linkage clustering and COSA [43], as indicated by the increased enrichment for known gene ontology terms. For the regulatory network reconstruction task, MFA-VBEM outperformed Bayesian and non-Bayesian factor analysis models on gene expression and TF binding profiles from both *S. cerevisiae* and a synthetic simulation. The better performance over the Gibbs sampling approach of Sabatti and James [16] on *S. cerevisiae* was partly a consequence of the computational complexity of the latter approach; this highlights the practical advantage of the proposed scheme in scaling up to complex postgenomic data sets.

We have pursued a variational approach to Bayesian inferences, by which a lower bound on the marginal likelihood is obtained and used for model selection. This allows us to estimate the number of active transcriptional modules regulating the genes, and select the number most supported by the data. A straightforward extension would be to make the number of active transcriptional modules a random variable itself and estimate its posterior distribution. The question, then, is which prior distribution to place on it. The potential number of active transcriptional modules is large, owing to the combinatorial explosion inherent in TF cooperation. Moreover, biological regulatory networks are known to be scale-free [52], meaning that a few TF modules potentially regulate a large number of genes. These two properties suggest that a Dirichlet process prior (also called Chinese restaurant process) would provide the appropriate modelling framework [53]. This non-parametric approach to Bayesian modelling has become popular in the machine learning community, and has recently been applied to computational biology in the context of haplotype modelling [54]. The application of these ideas to the problem of transcriptional regulation, and the method discussed in the present paper in particular, will provide an interesting avenue for future research.

## Appendix

### A.

#### A.1. Variational Bayesian Expectation Maximization

From information theory it is known that the Kullback-Leibler divergence, which is a measure of the difference between two distributions, is non-negative; see, for instance, Papoulis [48]. This implies that is a lower bound on the marginal likelihood , with a difference given by the the Kullback-Leibler divergence . The objective of variational Bayesian inference is to numerically maximize . This gives the best approximation to the true posterior distribution from the functional family , while simultaneously gives the best possible approximation to the marginal likelihood.

- (i)
variational E-step: given the distribution of the parameters , where indicates the iteration number, obtain a new distribution of the latent variables by application of (A.5).

- (ii)
variational M-step: given the distribution of the latent variables , obtain a new distribution of the parameters by application of (A.7).

This procedure, called the Variational Bayesian Expectation Maximization (VBEM) algorithm, is repeated until a stationary point of is reached.

#### A.2. The VBEM Algorithm Applied to the MFA Model

where denotes an expectation value with respect to , obtained from (A.10), denotes an expectation value with respect to , obtained from (A.19), is the th component of is the th component of is the th component of , and the index is related to a time point or experimental condition for which microarray and TF binding data have been obtained.

where denotes an expectation value with respect to , obtained from (A.12).

Here, denotes an expectation value with respect to the distributions and , obtained from the previous update steps in equations (A.12), (A.22) and (A.19), and was defined in (A.11).

where , and denotes an expectation value with respect to the distribution , which is obtained from (A.11) and (A.12). The remaining hyperparameters were fixed at , corresponding to fairly vague prior distributions. Each update equation is guaranteed to increase of (22), and the update steps are repeated in an iterative procedure until a stationary point of is reached. This update procedure involves birth and death moves to explore the model space and find the optimal model complexity , as described in Beal [23]. Note that these birth and death moves also help avoid local maxima in of (22), in a similar manner as discussed in Ueda et al. [32]. A MATLAB implementation of this method has been made available by Beal [23].

#### A.3. Details on the Regulatory Network Reconstruction

Network Reconstruction with The Proposed Mfa-Vbem Scheme

where is given in (A.22), is given in (A.18), is given in (A.17), and is given in (A.21). The curated binding profile of gene , corresponding to (A.26), is then trivially obtained from by discarding the expression profile in . Note that (A.29) consists of two terms. The first term, , describes the potential binding of TF modules to the promoters of the regulated genes. This is the generic regulatory network that we want to predict, mediated via regulated elements in the gene upstream sequences. The second term, , describes the perturbations and transient modifications of the interactions that are specific to the experimental conditions for which the training data were obtained. This term allows for the fact that a potential binding site might not be accessible to a TF in a certain condition, and that the TF binding affinities vary with changing external conditions.

where is in principle obtained by application of (A.22) to obtain , and marginalization over : . Since the derivation of is involved, we approximate by discarding from (A.22) all those terms that are related to the TF binding profiles; the trace operator in (A.22) thus extends over contributions from the gene expression data only. Note that this approximation corresponds to the imputation with . Our results suggests that this approximation works sufficiently well in practice.

Having described how to approach the tasks of network curation and prediction with the proposed MFA-VBEM model, we will now briefly outline how to address these problems with the factor analysis models of Sabatti and James [16] and Ghahramani and Hinton [24].

Network Reconstruction with Maximum Likelihood Factor Analysis

Recall the definition of the FA model in (A.3) and (A.25). The EM approach proposed in Ghahramani and Hinton [24] consists of iterative adaptation steps for the latent factors (representing TF activity profiles), the parameters (representing regulatory connection strengths), and the noise parameters . To solve the identifiability problem inherent in FA we follow Pournara and Wernisch [18] and minimize the number of non-zero entries in the connection strength matrix with a varimax rotation [39]; this procedure incorporates our prior knowledge that biological regulatory networks are usually sparsely connected. Maximum likelihood FA works solely with the gene expression data and does not incorporate explicit information about TF binding profiles. Consequently, the distinction between network curation and network prediction is not essential, and is solely made for comparison with the competing models. In the "curation" task, the connectivity matrix inferred from the training set in the way described above is used as the prediction of the transcriptional regulatory network. In the "prediction" task, the TF activity profiles obtained from the training data are kept fixed, and the TF regulatory network, represented by , is estimated for a set of independent genes (the test data). This procedure, which is straightforwardly implemented by skipping the E-step in the EM algorithm, indirectly tests how accurately the TF activity profiles have been reconstructed.

For the practical application, we applied the EM algorithm as reported in Ghahramani and Hinton [24], using the MATLAB programs provided by the authors. Each EM optimization was repeated five times from different random initializations, and the result with the highest likelihood was kept for further analysis. Since standard FA does not use any information from the TF binding profiles, the hidden factors cannot be immediately associated with known TFs. In order to evaluate how accurately the estimated loading matrix predicts the transcriptional regulatory network, we mapped each hidden factor to the closest TF. This was effected by an application of the Hungarian algorithm (The Hungarian algorithm is a combinatorial optimization algorithm. The assignment problem is represented by a cost matrix, where each matrix element represents the cost of assigning a predicted TF profile to a real TF binding profile. The algorithm solves the assignment problem in polynomial time, finding the minimum edge weight matching for the bipartite graphs.) [55] to assign the hidden factors to the known TFs in such a way that the global Euclidean distance between the corresponding rows in and the TF binding profiles reported in Teixeira et al. [38] was minimized. Note that this procedure requires the TF binding profiles to be already known beforehand, which would not be the case in practical applications, and that it therefore gives maximum likelihood FA a slight advantage over the other methods used in the comparison.

Network Reconstruction with Bayesian Factor Analysis and Gibbs Sampling

This approach corresponds to running the Gibbs sampling algorithm of Sabatti and James [16] with the latent variables fixed, that is, one of the interleaved Gibbs steps can be omitted. The second approach is computationally cheaper than the first and also appears more in line with the concept that a held-out test set should not be used for parameter inference. It was therefore adopted in our study.

## Declarations

### Acknowledgments

This work was supported by the Scottish Government Rural and Environment Research and Analysis Directorate (RERAD).

## Authors’ Affiliations

## References

- Bussemaker HJ, Li H, Siggia ED: Regulatory element detection using correlation with expression.
*Nature Genetics*2001, 27(2):167-171. 10.1038/84792View ArticleGoogle Scholar - Conlon EM, Liu XS, Lieb JD, Liu JS: Integrating regulatory motif discovery and genome-wide expression analysis.
*Proceedings of the National Academy of Sciences of the United States of America*2003, 100(6):3339-3344. 10.1073/pnas.0630591100View ArticleGoogle Scholar - Beer MA, Tavazoie S: Predicting gene expression from sequence.
*Cell*2004, 117(2):185-198. 10.1016/S0092-8674(04)00304-6View ArticleGoogle Scholar - Phuong TM, Lee D, Lee KH: Regression trees for regulatory element identification.
*Bioinformatics*2004, 20(5):750-757. 10.1093/bioinformatics/btg480View ArticleGoogle Scholar - Segal E, Yelensky R, Koller D: Genome-wide discovery of transcriptional modules from DNA sequence and gene expression.
*Bioinformatics*2003, 19(supplement 1):i273-i282. 10.1093/bioinformatics/btg1038View ArticleGoogle Scholar - Middendorf M, Kundaje A, Wiggins C, Freund Y, Leslie C: Predicting genetic regulatory response using classification.
*Bioinformatics*2004, 20(supplement 1):i232-i240. 10.1093/bioinformatics/bth923View ArticleGoogle Scholar - Middendorf M, Kundaje A, Shah M, Freund Y, Wiggins CH, Leslie C: Motif discovery through predictive modeling of gene regulation.
*Proceedings of the 9th Annual International Conference on Research in Computational Molecular Biology (RECOMB '05), Cambridge, Mass, USA, May 2005*538-552.Google Scholar - Ruan J, Zhang W: A bi-dimensional regression tree approach to the modeling of gene expression regulation.
*Bioinformatics*2006, 22(3):332-340. 10.1093/bioinformatics/bti792View ArticleGoogle Scholar - Raychaudhuri S, Stuart JM, Altman RB: Principal components analysis to summarize microarray experiments: application to sporulation time series.
*Pacific Symposium on Biocomputing*2000, 5: 455-466.Google Scholar - Liebermeister W: Linear modes of gene expression determined by independent coponent analysis.
*Bioinformatics*2002, 18(1):51-60. 10.1093/bioinformatics/18.1.51View ArticleGoogle Scholar - Liao JC, Boscolo R, Yang Y-L, Tran LM, Sabatti C, Roychowdhury VP: Network component analysis: reconstruction of regulatory signals in biological systems.
*Proceedings of the National Academy of Sciences of the United States of America*2003, 100(26):15522-15527. 10.1073/pnas.2136632100View ArticleGoogle Scholar - Kao KC, Yang Y-L, Boscolo R, Sabatti C, Roychowdhury V, Liao JC: Transcriptome-based determination of multiple transcription regulator activities in
*Escherichia coli*by using network component analysis.*Proceedings of the National Academy of Sciences of the United States of America*2004, 101(2):641-646. 10.1073/pnas.0305287101View ArticleGoogle Scholar - Harbison CT, Gordon DB, Lee TI,
*et al*.: Transcriptional regulatory code of a eukaryotic genome.*Nature*2004, 430(7004):99-104. 10.1038/nature02800View ArticleGoogle Scholar - Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers.
*Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology (ISMB '94), Stanford, Calif, USA, August 1994*28-36.Google Scholar - Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of
*cis*-regulatory elements associated with groups of functionally related genes in*Saccharomyces cerevisiae*.*Journal of Molecular Biology*2000, 296(5):1205-1214. 10.1006/jmbi.2000.3519View ArticleGoogle Scholar - Sabatti C, James GM: Bayesian sparse hidden components analysis for transcription regulation networks.
*Bioinformatics*2006, 22(6):739-746. 10.1093/bioinformatics/btk017View ArticleGoogle Scholar - Sanguinetti G, Lawrence ND, Rattray M: Probabilistic inference of transcription factor concentrations and gene-specific regulatory activities.
*Bioinformatics*2006, 22(22):2775-2781. 10.1093/bioinformatics/btl473View ArticleGoogle Scholar - Pournara I, Wernisch L: Factor analysis for gene regulatory networks and transcription factor activity profiles.
*BMC Bioinformatics*2007, 8: 1-20. article 61 10.1186/1471-2105-8-61View ArticleGoogle Scholar - Shi Y, Simon I, Mitchell T, Bar-Joseph Z: A combined expression-interaction model for inferring the temporal activity of transcription factors. In
*Proceedings of the 12th Annual International Conference on Research in Computational Molecular Biology (RECOMB '08), Lecture Notes in Computer Science, Singapore, March-April 2008*.*Volume 4955*. Edited by: Vingron M, Wong L. Springer; 82-97.Google Scholar - Reményi A, Schöler HR, Wilmanns M: Combinatorial control of gene expression.
*Nature Structural & Molecular Biology*2004, 11(9):812-815. 10.1038/nsmb820View ArticleGoogle Scholar - Yu X, Lin J, Zack DJ, Qian J: Identification of tissue-specific
*cis*-regulatory modules based on interactions between transcription factors.*BMC Bioinformatics*2007, 8: 1-13. article 437 10.1186/1471-2105-8-1View ArticleGoogle Scholar - Boulesteix A-L, Strimmer K: Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach.
*Theoretical Biology & Medical Modelling*2005, 2: 1-12. article 23 10.1186/1742-4682-2-1View ArticleGoogle Scholar - Beal MJ:
*Variational algorithms for approximate Bayesian inference, Ph.D. thesis*. Gatsby Computational Neuroscience Unit, University College London, London, UK; 2003.Google Scholar - Ghahramani Z, Hinton GE:
*The EM algorithm for mixtures of factor analyzers.*Department of Computer Science, University of Toronto, Toronto, Canada; 1996.Google Scholar - Nielsen FB:
*Variational approach to factor analysis and related models, M.S. thesis*. Informatics and Mathematical Modelling, Technical University of Denmark, Lyngby, Denmark; 2004. http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=3182Google Scholar - Ghahramani Z, Beal MJ: Variational inference for Bayesian mixtures of factor analysers. In
*Advances in Neural Information Processing Systems*. Edited by: Solla SA, Leen TK, Müller K-R. The MIT Press, Cambridge, Mass, USA; 1999:449-455.Google Scholar - West M: Bayesian factor regression models in the "large p, small n" paradigm. In
*Bayesian Statistics*.*Volume 7*. Oxford University Press, Oxford, UK; 2003:733-742.Google Scholar - McLachlan GJ, Bean RW, Ben-Tovim Jones L:Extension of the mixture of factor analyzers model to incorporate the multivariate
-distribution.
*Computational Statistics & Data Analysis*2007, 51(11):5327-5338. 10.1016/j.csda.2006.09.015View ArticleMathSciNetMATHGoogle Scholar - Huber W, von Heydebreck A, Sültmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression.
*Bioinformatics*2002, 18(supplement 1):S96-S104. 10.1093/bioinformatics/18.suppl_1.S96View ArticleGoogle Scholar - Fokoué E, Titterington DM: Mixtures of factor analysers. Bayesian estimation and inference by stochastic simulation.
*Machine Learning*2003, 50(1-2):73-94. 10.1023/A:1020297828025View ArticleMATHGoogle Scholar - Green PJ: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.
*Biometrika*1995, 82(4):711-732. 10.1093/biomet/82.4.711View ArticleMathSciNetMATHGoogle Scholar - Ueda N, Nakano R, Ghahramani Z, Hinton GE: SMEM algorithm for mixture models.
*Neural Computation*2000, 12(9):2109-2128. 10.1162/089976600300015088View ArticleGoogle Scholar - Guelzim N, Bottani S, Bourgine P, Képès F: Topological and causal structure of the yeast transcriptional regulatory network.
*Nature Genetics*2002, 31(1):60-63. 10.1038/ng873View ArticleGoogle Scholar - Lee TI, Rinaldi NJ, Robert F,
*et al*.: Transcriptional regulatory networks in*Saccharomyces cerevisiae*.*Science*2002, 298(5594):799-804. 10.1126/science.1075090View ArticleGoogle Scholar - Spellman PT, Sherlock G, Zhang MQ,
*et al*.: Comprehensive identification of cell cycle-regulated genes of the yeast*Saccharomyces cerevisiae*by microarray hybridization.*Molecular Biology of the Cell*1998, 9(12):3273-3297.View ArticleGoogle Scholar - Gasch AP, Spellman PT, Kao CM,
*et al*.: Genomic expression programs in the response of yeast cells to environmental changes.*Molecular Biology of the Cell*2000, 11(12):4241-4257.View ArticleGoogle Scholar - Mnaimneh S, Davierwala AP, Haynes J,
*et al*.: Exploration of essential gene functions via titratable promoter alleles.*Cell*2004, 118(1):31-44. 10.1016/j.cell.2004.06.013View ArticleGoogle Scholar - Teixeira MC, Monteiro P, Jain P,
*et al*.: The YEASTRACT database: a tool for the analysis of transcription regulatory associations in*Saccharomyces cerevisiae*.*Nucleic Acids Research*2006, 34, database issue: D446-D451. 10.1093/nar/gkj013View ArticleGoogle Scholar - Kaiser HF: The varimax criterion for analytic rotation in factor analysis.
*Psychometrika*1958, 23(3):187-200. 10.1007/BF02289233View ArticleMATHGoogle Scholar - Chang C, Ding Z, Hung YS, Fung PCW: Fast network component analysis (FastNCA) for gene regulatory network reconstruction from microarray data.
*Bioinformatics*2008, 24(11):1349-1358. 10.1093/bioinformatics/btn131View ArticleGoogle Scholar - Cowles MK, Carlin BP: Markov chain Monte Carlo convergence diagnostics: a comparative review.
*Journal of the American Statistical Association*1996, 91(434):883-904. 10.2307/2291683View ArticleMathSciNetMATHGoogle Scholar - Bishop CM:
*Pattern Recognition and Machine Learning*. Springer, Singapore; 2006.MATHGoogle Scholar - Friedman JH, Meulman JJ: Clustering objects on subsets of attributes (with discussion).
*Journal of the Royal Statistical Society: Series B*2004, 66(4):815-849. 10.1111/j.1467-9868.2004.02059.xView ArticleMathSciNetMATHGoogle Scholar - Lazzeroni L, Owen A: Plaid models for gene expression data.
*Statistica Sinica*2002, 12(1):61-86.MathSciNetMATHGoogle Scholar - Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns.
*Proceedings of the National Academy of Sciences of the United States of America*1998, 95(25):14863-14868. 10.1073/pnas.95.25.14863View ArticleGoogle Scholar - Hastie T, Tibshirani R, Friedman J:
*The Elements of Statistical Learning*. Springer, Berlin, Germany; 2001.View ArticleMATHGoogle Scholar - Grossmann S, Bauer S, Robinson PN, Vingron M: An improved statistic for detecting over-represented gene ontology annotations in gene sets. In
*Proceedings of the 10th Annual International Conference on Research in Computational Molecular Biology (RECOMB '06), Lecture Notes in Computer Science, Venice, Italy, April 2006*.*Volume 3909*. Edited by: Apostolico A, Guerra C, Istrail S, Pevzner PA, Waterman MS. Springer; 85-98.Google Scholar - Papoulis A:
*Probability, Random Variables, and Stochastic Processes*. 3rd edition. McGraw-Hill, Singapore; 1991.Google Scholar - Grzegorczyk M, Husmeier D: Improving the structure MCMC sampler for Bayesian networks by introducing a new edge reversal move.
*Machine Learning*2008, 71(2-3):265-305. 10.1007/s10994-008-5057-7View ArticleGoogle Scholar - Jasra A, Holmes CC, Stephens DA: Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling.
*Statistical Science*2005, 20(1):50-67. 10.1214/088342305000000016View ArticleMathSciNetMATHGoogle Scholar - Kim TS, Kim HY, Yoon JH, Kang HS: Recruitment of the Swi/Snf complex by Ste12-Tec1 promotes Flo8-Mss11-mediated activation of
*STA1*expression.*Molecular and Cellular Biology*2004, 24(21):9542-9556. 10.1128/MCB.24.21.9542-9556.2004View ArticleGoogle Scholar - Barabási A-L, Oltvai ZN: Network biology: understanding the cell's functional organization.
*Nature Reviews Genetics*2004, 5(2):101-113. 10.1038/nrg1272View ArticleGoogle Scholar - Teh YW, Jordan MI, Beal MJ, Blei DM:
*Hierarchical dirichlet processes.*Department of Statistics, University of California, Berkeley, Calif, USA; 2004.Google Scholar - Xing EP, Jordan MI, Sharan R: Bayesian haplotype inference via the dirichlet process.
*Journal of Computational Biology*2007, 14(3):267-284. 10.1089/cmb.2006.0102View ArticleMathSciNetGoogle Scholar - Kuhn HW: The Hungarian method for the assignment problem.
*Naval Research Logistics*1955, 2(1-2):83-97. 10.1002/nav.3800020109View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.