Bayesian module identification from multiple noisy networks

Background and motivations Module identification has been studied extensively in order to gain deeper understanding of complex systems, such as social networks as well as biological networks. Modules are often defined as groups of vertices in these networks that are topologically cohesive with similar interaction patterns with the rest of the vertices. Most of the existing module identification algorithms assume that the given networks are faithfully measured without errors. However, in many real-world applications, for example, when analyzing protein-protein interaction networks from high-throughput profiling techniques, there is significant noise with both false positive and missing links between vertices. In this paper, we propose a new model for more robust module identification by taking advantage of multiple observed networks with significant noise so that signals in multiple networks can be strengthened and help improve the solution quality by combining information from various sources. Methods We adopt a hierarchical Bayesian model to integrate multiple noisy snapshots that capture the underlying modular structure of the networks under study. By introducing a latent root assignment matrix and its relations to instantaneous module assignments in all the observed networks to capture the underlying modular structure and combine information across multiple networks, an efficient variational Bayes algorithm can be derived to accurately and robustly identify the underlying modules from multiple noisy networks. Results Experiments on synthetic and protein-protein interaction data sets show that our proposed model enhances both the accuracy and resolution in detecting cohesive modules, and it is less vulnerable to noise in the observed data. In addition, it shows higher power in predicting missing edges compared to individual-network methods.


Introduction
Identifying modular structures within large-scale networks has attracted significant attention in many research fields, including social science, biology, and information technology, just to name a few. For these applications, the ultimate goal is to group vertices in given networks into cohesive modules or communities, in which the vertices share similar properties, specifically their interaction patterns. Typically, densely connected sub-networks in given networks are considered desirable modular structures [1]. There have been many existing approaches proposed to study this problem in the literature, including spectral clustering algorithms based on graph cut [2,3], *Correspondence: xqian@ece.tamu.edu Department of Electrical and Computer Engineering, Texas A&M University, MS 3128, TAMU, College Station, TX, USA modularity-based algorithms [4,5], as well as matrix factorization algorithms for network clustering [6,7].
In addition to these optimization algorithms based on graph theory and mathematical programming, in statistical inference, stochastic block models (SBM) originally proposed by [8] adopt a multinomial-Bernoulli probabilistic model to capture the inherent modular structures in observed networks. Hofman and Wiggins [9] developed a Bayesian framework to find the module or community memberships of vertices in networks under study and took advantage of variational approximation to efficiently sample from the corresponding posterior distributions.
Extending the analysis to dynamic networks has attracted major attention recently. Authors in [10] studied community evolution in blogosphere based on graph characteristics such as in-degrees and out-degrees.
Chi et al. [11] used graph cut size as a measure of community evolution and proposed a dynamic version of spectral clustering. In [12], an algorithm called FacetNet was developed by extending the graph factorization method for analysis of evolutionary networks. A Markov model [13] was adopted to capture temporal community variation in stochastic block models with Gibbs sampling implemented for inference of unknown model parameters.
In this paper, we focus on module identification in biological networks. On one hand, it is often the case that biological networks in public databases [14,15], by either high-throughput profiling techniques or laborious manual curations, contain significant errors (both false positives and false negatives). On the other hand, usually, several independent such empirical networks are available for studying the species of interest, creating the opportunity to integrate information from different sources and gain higher accuracy and better reproducibility. With these noisy networks, we aim to develop an integrated stochastic model and solution methods to improve the accuracy of module identification by combining information from multiple observed networks. Figure 1 provides the schematic illustration of our basic idea. With multiple noisy observations in the top row of Fig. 1a, the proposed stochastic model assumes that there is a consistent virtual graph that captures the coherent root modular structure. As a specific application of our approach, we can think of multiple networks profiled with different techniques or characterized by multiple types of interactions between vertices. Figure 1b shows the graphical representation of our extended SBM from traditional SBM for analyzing multiple noisy networks. In our model, every observed network A (t) is associated with a latent modular structure z (t) . These instantaneous structures are considered as the results from stochastic transitions from a latent root modular structure z that is coherent in all networks. Note that this is in contrast to previous dynamic models that concentrate on the evolution of modular structures rather than embedding them. The probabilistic inference task is to simultaneously learn the root as well as instantaneous modular structures from multiple noisy networks. With such a probabilistic model, we are able to elicit the essential modular structure in all the observed networks. By combining information from these various sources, we can compensate for the perturbation effect from noisy observations. To infer this extended SBM for multiple-network clustering, a variational Bayes method is derived to efficiently quantify uncertainties over unknown model parameters. We apply our method to protein-protein interaction (PPI) data sets and show that by taking advantage of different sources of information, our method outperforms the existing SBM-based methods implemented on individual networks in predicting new protein complexes. Furthermore, the capability of predicting missing edges from our Bayesian modeling creates the opportunity for our method to be used in active learning scenarios, where the task is to efficiently infer Fig. 1 Schematic illustration. a An example of multiple noisy networks with a coherent modular structure. b Graphical representation of the proposed probabilistic model for module identification across multiple networks as an extension of the module identification model for individual networks protein-protein interactions from new sets of experiments and concurrently by taking advantage of prior knowledge from the existing experimental results, which the individual network SBM lacks.

Background
We first briefly review the stochastic block model for module identification in individual networks to study the modular structures [8], for which a Bayesian module identification algorithm has recently been proposed to efficiently solve the problem [9].
Given a network in a graph representation, G = (V , E), where V denotes the set of all N vertices in the given network and E is the set of edges connecting the corresponding vertices in the network G. Let A be the observed N × N adjacency matrix whose elements take the values 0 or 1: A ij = 1 indicating that there is a corresponding edge e ij ∈ E between vertices v i and v j ∈ V and A ij = 0 otherwise. We introduce the latent variable z i ∈ {1, 2, . . . , K} to represent the module assignment of vertices v i , and K is the total number of desirable modules. In SBM, the probability that an edge exists between two vertices depends on their module memberships. Conditioning on module assignments, the probabilities that corresponding vertices are linked follow Bernoulli distributions with the corresponding bias parameters θ c = p(A ij = 1|z i = z j ) and θ d = p(A ij = 1|z i = z j ), which are called within-and between-module edge probabilities, respectively. Also, SBM assumes a multinomial distribution over module assignment probabilities with parameters π k = p(z i = k| π). With these assumptions, the joint probability of an adjacency matrix A and the corresponding module assignment vector z can be written as in which c + = i>j A ij I z i = z j is the number of edges contained within potential modules; c − = i>j (1 − A ij )I z i = z j is the number of non-edges contained within modules; d + = i>j A ij I z i = z j is the number of edges between vertices across different modules; d − = i>j (1 − A ij )I z i = z j is the number of non-edges across potential modules; and n k denotes the number of vertices assigned to the kth potential module with k n k = N. I [x] denotes the indicator function, which equals to one if its argument x is a true logic statement and zero otherwise. The factorization of the joint probability follows from the fact that the probability of the observed adjacency matrix can be completely determined based on the given model parameters, including module assignment probabilities π and within-and between-module edge probabilities θ c , θ d .

Methods
We extend the above Bayesian framework for individual networks to more robust and accurate module identification across multiple networks. A variational Bayes approach is then derived to infer the unknown parameters of our extended model to identify significant modules across multiple noisy networks.

Multiple-network stochastic block model
Given multiple observed noisy networks with corresponding adjacency matrices A (1) , A (2) , . . . , A (T) , we aim to study the hidden modular structures across these networks. Without loss of generality, we assume that the set of vertices is fixed in all adjacency matrices. To infer the modular structures of these observed networks, we introduce a latent root module assignment z, which can be considered to determine the connectivity of a virtual image graph illustrated in Fig. 1. For T observed networks, the corresponding instantaneous module assignments z t for A (t) evolve under a transition probability matrix P (t) . This model allows an inherent modular structure to unify all other observations to borrow strengths from each other when inferring modules of a certain network and thereby compensates for the potential detrimental effect of noise mixed with observations.
With the underlying assumption that multiple observed networks have modular structures with similar withinand between-module edge densities, we fix the edge probabilities θ c and θ d to be the same for all the observed networks. To fully specify this new stochastic block model, we set the root assignment matrix z to be multinomial with assignment probabilities π. We can write the joint distribution of assignment matrices and observed adjacency matrices of this model as follows: where a concise index representation (1 : T) is adopted to denote the indices of the corresponding components in the model for multiple networks. For example, A (1:T) stands for T adjacency matrices A (1) , . . . , A (T) . The corresponding numbers of edges c + the tth network and in the rth module in the root assignment; n k is calculated from the root assignment. One immediate consequence of such modeling is that the edges that frequently appear in multiple observations have a higher chance of being true positives. Such an intuition is reflected in the likelihood function in our model. In addition, the model makes sure that the vertices connected by these edges are more likely to be assigned to the same modules in different observed networks by the proper choice of transition probabilities, which is clarified in the subsequent section.

Bayesian inference
To predict module assignments to assign memberships to all the vertices in the given networks, we resort to Bayesian inference to draw a joint posterior distribution of all the latent variables and unknown model parameters.
To facilitate the computation of the posterior, we prefer more efficient variational Bayes algorithms instead of directly implementing Monte Carlo (MC) simulations. In order to derive closed-form updates for variational Bayes algorithms, we adopt conjugate prior distributions in our multiple-network clustering model. The conjugate prior for the root assignment probability distribution π is a Dirichlet distribution with a hyper-parameter vector n 0 : Here n k,0 is the kth component of vector n 0 and (·) is the gamma function. The conjugate priors for edge weights θ c and θ d are beta distributions with hyperparameters (α c,0 , β c,0 ) and (α d,0 , β d,0 ), respectively, The underlying assumption here is that prior to observing the data, within-and between-module edge weights are independent, so their joint prior distribution factorizes. The transition probability matrices P (t) are stochastic, and therefore, their rows add up to 1. For each matrix P (t) , where t ∈ {1, 2, . . . , T}, we use Dirichlet prior distributions with a hyper-parameter vector η (0) k on rows where P (t) k is the kth row of the transition probability matrix P (t) , P (t) km is its mth element, and η (0) k,m is the mth element of η (0) k . The rows of transition probability matrices are assumed to be independent, and also, we set their hyper-parameter vectors to be identical.
To further ensure that our model captures the modular structure inherent in the observed networks, we set hyper-parameters of prior beta distributions over edge weights to bias towards edge weights with within-module edge weights being greater than between-module edge weights, and this is controlled through appropriate settings of hyper-parameters of prior beta distributions over edge weights. For the model to be capable of benefiting from the structural information inferred from other networks, we prefer that the diagonal entries of transition probability matrices P (t) to be higher than the off-diagonal entries of those matrices, which can be achieved by setting higher hyper-parameters in the corresponding Dirichlet distributions.
With these incorporated conjugate priors, their functional forms are preserved in the posterior, a variational Bayes algorithm with closed-form updates can be derived to infer the model parameters, and, more importantly, module memberships from the aforementioned model (2) in the subsequent section.

Variational Bayes solution
Variational Bayes method is an efficient alternative to Monte Carlo sampling methods [16,17] for statistical inference over complicated models as direct sampling is not tractable and computationally prohibitive. Under appropriate settings, variational Bayes algorithms can be derived to infer the desired posterior distributions with comparable accuracies at a greater speed, which is essential for the analysis of large-scale networks. The variational Bayes method seeks a restricted family of approximation distributions q(·), which minimize the Kullback-Leibler (KL) divergence between the joint probability distributions of unknown parameters and their approximate joint probability distributions [18]. For our proposed model, the quantity to be minimized takes the following form: To simplify this optimization problem of minimizing the free energy F q, A (1:T) , we follow the mean field approximation framework developed in physics [9]. To be specific, we factorize the variational or approximate distribution q(·) with respect to its arguments: (7) After this simplification, it can be shown that the optimal approximate distribution q z for the root module assignment z satisfies the following equation [18]: where E − z [ ·] denotes the expectation taken over all the parameters and latent variables except z. Similar equations can be derived for π, θ , and z (t) for t ∈ {1, 2, . . . , T}. Solving the above Eq. (8) for all the unknown parameters leads to the complete derivation of the approximate distributions. Particularly, these distributions belong to the same family as prior distributions, i.e., the approximate distributions of θ c , θ d , and π are respectively beta, beta, and Dirichlet distributions with hyper-parameters α c ,β c , α d ,β d , and˜ n. In order to calculate the posterior approximate distribution of module assignments, we factorize them as q(z i = k) = Q ik and q z t ik for i ∈ {1, 2, . . . , N}, t ∈ {1, 2, . . . , T}, and k ∈ {1, 2, . . . , K}. Q and Q (t) are N × K matrices, in which the ith row denotes the probability of assigning vertex v i to different potential modules.
The variational Bayes algorithm iterates between two stages. In the first step, the current distributions over the model parameters are used to evaluate the module assignment matrices Q and Q (t) ; and in the second step, these memberships are fixed and variational distributions over model parameters are updated. The resulting iterative algorithm then can be summarized as: (i) Update the following expected values: where ψ(·) is the digamma function. (ii) Update the variational distribution over the root module assignment: Normalize (iv) Update the posterior hyper-parameters of the Dirichlet distribution over the root module assignment of vertices: (v) Consider η (t) for t ∈ {1, 2, . . . , T} as a matrix whose elements are η (t) k,m . Then, update the matrix η (t) as follows: where Q is the transpose of the matrix Q and η (0) is the matrix of prior hyper-parameters of transition probability matrices. (vi) Update the hyper-parameters of beta distributions over edge weights: where u is a N × 1 vector of ones and v (t) is a vector with elements v ik . (vii) Calculate the updated free energy: where B(·) is a beta function with the vector argument.
The optimized free energy in (21) decreases in consecutive iterations, and thereby, this algorithm is guaranteed to converge to a local optimum. In the case where the posterior is multi-modal, several initializations should be tested to ensure the quality of the returned solutions.

Experimental results
In this section, we evaluate our Bayesian module identification across multiple networks by testing the derived variational Bayes algorithm on both synthetic and realworld PPI data sets. The obtained results also are compared with Bayesian module identification with individual networks and another state-of-the-art network clustering algorithm-ClusterOne [19]. For synthetic data, we have the ground-truth module memberships and therefore we use normalized mutual information to assess the performance of our model. Normalized mutual information (NMI) is defined as follows [3,13]: where C = {C 1 , C 2 , . . . , C K } denotes the true assignments of vertices to corresponding modules and C = C 1 , C 2 , . . . , C K denotes the inferred module memberships of vertices by the implemented algorithms. H(C) and H(C ) are the entropies of the ground truth and inferred modules.MI(C, C ) is the mutual information calculated For real-world data sets, we analyze two budding yeast (Saccharomyces cerevisiae) PPI networks obtained from the Database of Interaction Proteins (DIP) [14] and the Biological General Repository for Interaction Datasets (BioGRID) [15] to predict protein complexes. The predicted protein complexes as inferred modules by the selected algorithms are then verified against the Saccharomyces Genome Database (SGD) [20] and Munich Information Center for Protein Sequences (MIPS) [21] golden standards as the reference complexes. To validate the predicted protein complexes by the selected algorithms, we adopt the same performance metrics introduced in [3,19]. The first metric is the fraction of pairs between the predicted and reference complexes with an overlap score of larger than 0.25. We represent this metric in the results with Frac. The overlap score ω between two sets of vertices, proteins in this case, V 1 and V 2 , is defined as: where | · | denotes the cardinality of a set. The threshold 0.25 used for ω is achieved when two equally sized protein complexes have an intersection set with half of their size. The second performance metric is the geometric accuracy (Acc), which is the geometric mean of two measures: the module-wise sensitivity (Sn) and module-wise positive predictive value (PPV): Acc = √ PPV × Sn. Given n reference and m predicted protein complexes, t ij denotes the number of proteins that are the members of both the reference complex i: 1 ≤ i ≤ n and predicted complex j: 1 ≤ j ≤ m. Furthermore, let n i represent the total number of proteins in the reference complex i. The two measures Sn and PPV for computing the geometric accuracy are defined as: Since Sn can be maximized by putting every protein in the same module and PPV can be maximized by assigning each protein in a distinct module, the Acc is considered a better performance metric that we adopt.
When some proteins appear in either none of the predicted complexes or in more than one complexes, the value of PPV can be misleading [3,19]. To mitigate such deficiencies, the authors in [22] have introduced an additional metric called module-wise separation (Sep) for fair comparison. To define it, we first introduce the marginal row-wise and column-wise relative frequencies that can be computed as: The separation of the predicted complex i and reference complex j then equals to Sep ij = F r ij F c ij . The referencewise and inferred-module-wise scores are calculated for the whole set of the reference and predicted complexes as: The final separation score is computed from these two quantities as Sep = Sep ref Sep inf . Sep ij = 1 indicates that the reference complex j is a perfect match for predicted complex i and both of them contain identical proteins.

Synthetic networks
Using the same procedure as in [1], we generate a synthetic network with N = 128 vertices and K = 4 modules, each module containing 32 vertices. The average degree of vertices is set to 16, and the average number of between-module edges of each vertex is set to 6. To generate the network, we first assign vertices to different modules by following a multinomial distribution with the equal weights for all modules. Then, each pair of vertices are connected with the probabilities equal to θ c or θ d if they belong to the same or different modules, respectively.
To simulate multiple observed noisy networks, we implement the Sneppen and Maslov re-wiring method [23] to construct new networks and adjacency matrices with instilled noise based on the original network generated as described above. In this method, a pair of edges v i ↔ v j and v k ↔ v are randomly selected and then rewired such that v i becomes connected to v , while v j to v k , provided that none of these edges existed previously. This method preserves the degree of each vertex and thus global topological properties, including edge densities in perturbed networks, do not change significantly.
Following this procedure, we generate two different sets of simulated networks. In the first experiment, we generate T = 10 adjacency matrices where the number of randomly selected re-wirings increases linearly from 5 to 50 % of the total number of edges in nine steps gradually. The adjacency matrices for two extreme cases at t = 1 and t = 10 are shown in Fig. 2a respectively on top and bottom rows, which reflect different levels of introduced noise. In the second experiment, T = 10 adjacency matrices are generated from an original adjacency matrix by randomly re-wiring 25 % of the total number of edges. Thereby, here, the noise levels are consistent across all the networks. Note that the re-wirings are independent from each other. Intuitively, in the first set of networks, module identification becomes more difficult with increasing noise levels while in the second experiment, it is similarly difficult when identifying modules in respective networks. To demonstrate that our Bayesian module identification across multiple networks can better identify modules by borrowing strength across networks, we compare the results by our algorithm on the set of ten randomly perturbed networks with those of Hofman's algorithm [9] applied to individual networks. Since we assume no prior knowledge on module memberships of the vertices, the initial hyper-parameters for the Dirichlet distributions for module assignments are set to equal values for all K = 4 modules. Empirically, neither of the algorithms is sensitive to hyper-parameters for beta distributions over edge weights, provided that within-module edge weights are larger than the between-module counterpart. Based on our experiments, Hofman's implementation on individual networks may not converge to the global optimum, especially when we have high levels of introduced noise, for example, when we have 20 % random re-wirings (t > 4) in our experiments. On the contrast, we find that multiple random initializations may not be necessary for our multiple-network clustering algorithm. In order to have a fair comparison with the satisfactory solution quality, we take 100 random initializations for both algorithms. Figure 2b and c depict the average normalized mutual information in two experiments between module detection results of both algorithms and the true module membership, by which data has been generated, based on averaged 100 independent repeats with the aforementioned settings. As it can be seen in these figures, as the noise level increases, the difference between the performances of two algorithms gets more significant. For highly noisy adjacency matrices, Hofman's algorithm indeed fails to recover the module memberships accurately. Nonetheless, our algorithm by borrowing information from other observations returns satisfying results. In the second experiment, we can clearly observe the superiority of our model as there is an approximate 0.2 difference in the normalized mutual information measure in favor of our algorithm across all networks. Thus, aggregating information across networks has led to higher accuracy in predicting the module membership of vertices achieved by our method.

Edge prediction
To further validate our model, we simulate two networks from a single "ground truth" network with the specifications identical to the networks generated in the previous section, by instilling 25 % noisy edges based on the Sneppen and Maslov re-wiring method. To test the capability of our model to predict missing edges of the ground truth network, we randomly hold out different percentages of edges of the first network and use the remaining edges (and the other network in the case of our multiplenetwork model) to predict the probability of each missing edge to exist. Compared to previous experiments, we only need to slightly modify the inference by discarding the held-out edges from the likelihood. The probability of an edge existing between nodes i and j in the model can be calculated by We consider different training ratios (the percentage of remaining edges in one network) ranging from 20 to 80 % with 10 % increments. Figures 3 and 4 show the error bar plots of Area Under the Curves (AUC) of both the Receiver Operating Characteristic (ROC) and Precision-recall (PR) for different training ratios, respectively. As expected, our multiple-network method outperforms Hofman's model in terms of both ROC and PR as it takes advantage of the shared information across observations. As shown in Figs. 3 and 4, decreasing the number of held-out edges leads to the growing margin between the performances of the two approaches.

Protein complex prediction
We further have applied our Bayesian module identification to unweighted yeast PPI networks, extracted from DIP and BioGrid, to predict protein complexes. Each of these networks have 4540 proteins and the number of edges in DIP and BioGrid networks are 21,326 and 49,128, respectively. Besides our algorithm, ClusterOne [19] and Hofman's method [9] also have been applied to the networks for comparison. ClusterOne is a greedy algorithm that can be considered as an overlapping extension of normalized cut spectral clustering. For both Hofman's and our algorithm, we need to decide the value of K for the number of potential modules. However, we note that both algorithms are in the Bayesian framework and thereby the full probability of memberships of different modules can be determined. With the large enough K, model likelihoods for different Ks can be evaluated to determine the optimal K. In the current experiments, we also focus on non-overlapping module identification as done in [9] for fair comparison by assigning each vertex v i to the kth module that maximizes Q (t) ik in the tth network. Based on the average size of protein complexes given in yeast golden standards, which is approximately 5 in both SGD and MIPS, we set K =1000 considering the size of our PPI networks.
The results of these three methods are compared based on the parameters introduced earlier. Table 1 contains the performance comparison between these algorithms based on the SGD golden standard while Table 2 provides the performance comparison based on the MIPS standard. One advantage of Bayesian methods is that the identified complexes cover the whole set of proteins present in the data sets, while ClusterOne only discovers overlapping complexes that include 1372 and 2340 proteins for DIP  and BioGRID datasets respectively out of the total 4540 proteins in the original networks. As a consequence, complexes found by Bayesian methods have larger sizes and therefore possess lower fraction of matched proteins compared to the reference complexes, which are reflected as low PPV values in both tables. Nonetheless, the results with respect to the Acc have shown that our new Bayesian module identification with multiple networks can better predict potential protein complexes, in which proteins are densely connected; and thereby, these methods are more useful when the final objective is to predict new protein complexes. Another observation in the results is that combining information from different observed networks in our model enhances both the Acc and Sep metrics compared with ClusterOne and Hofman's method, especially for the DIP network, a relatively sparse network, for which detecting inherent modules is a difficult task. Table 3 depicts the number of predicted complexes for different data sets by different algorithms. It is clear that, compared to its individual-network counterpart-Hofman's method-our algorithm is able to find significantly more protein complexes and thus obtaining a better understanding of both networks' modular structures. One example is illustrated in Fig. 5. This figure demonstrates the interactions among a group of proteins in the DIP data set. All of these proteins are assigned to the same complex by Hofman's method. However, our multiple-network method identified a more densely connected subset of proteins. This discovered protein complex contains all of the proteins in the reference RNA polymerase II mediator protein complex in  5 One example of the predicted protein complexes by Hofman's and our multiple-network clustering algorithms. The whole set of proteins were considered as a single complex by Hofman's method, while the proteins colored in yellow and dark blue form the predicted complex returned by our new multiple-network clustering method. The colored proteins in yellow are the member proteins in the RNA polymerase II mediator protein complex in SGD golden standard. This figure is produced by [24] SGD golden standard. These proteins are colored in yellow in the figure, and other proteins that do not belong to this reference complex but are members of the predicted complex by our algorithm are colored in blue. Thus, we can observe that our method is capable of increasing the resolution in module identification. Note that the number of modules found by our algorithm is comparable to that of ClusterOne, though the modules in our method are disjoint. Since we have the posterior distribution of module assignments of all vertices, one can easily construct lots of overlapping complexes by allowing each vertex to belong to more than one modules.

Conclusions
In this paper, we generalize the variational Bayes algorithm for module identification in individual networks [9] to a new stochastic block model with the efficient accompanying variational Bayes algorithm for module identification across multiple noisy observed networks. The effectiveness and efficiency of our algorithm with improved accuracy and resolution have been demonstrated on both synthetic and real-world PPI networks. In our future work, we will focus on finding solution methods for module identification from multiple networks with more general noise models.