We extend the above Bayesian framework for individual networks to more robust and accurate module identification across multiple networks. A variational Bayes approach is then derived to infer the unknown parameters of our extended model to identify significant modules across multiple noisy networks.
Multiple-network stochastic block model
Given multiple observed noisy networks with corresponding adjacency matrices {A
(1),A
(2),…,A
(T)}, we aim to study the hidden modular structures across these networks. Without loss of generality, we assume that the set of vertices is fixed in all adjacency matrices. To infer the modular structures of these observed networks, we introduce a latent root module assignment \(\vec {z}\), which can be considered to determine the connectivity of a virtual image graph illustrated in Fig. 1. For T observed networks, the corresponding instantaneous module assignments \(\vec {z}^{t}\) for A
(t) evolve under a transition probability matrix P
(t). This model allows an inherent modular structure to unify all other observations to borrow strengths from each other when inferring modules of a certain network and thereby compensates for the potential detrimental effect of noise mixed with observations.
With the underlying assumption that multiple observed networks have modular structures with similar within- and between-module edge densities, we fix the edge probabilities θ
c
and θ
d
to be the same for all the observed networks. To fully specify this new stochastic block model, we set the root assignment matrix \(\vec {z}\) to be multinomial with assignment probabilities \(\vec {\pi }\). We can write the joint distribution of assignment matrices and observed adjacency matrices of this model as follows:
$$ \begin{aligned} &p\left(A^{(1:T)},\vec{z},\vec{z}^{(1:T)}|\vec{\theta},\vec{\pi},P^{(1:T)},K \right) \\ &\quad = \left[ \prod_{t=1}^{T} p\left(A^{(t)}|\vec{z}^{(t)},\vec{\theta}\right) p\left(\vec{z}^{(t)}|\vec{z},P^{(t)}\right) \right] p(\vec{z}|\vec{\pi}) \\ &\quad = \theta_{c}^{\sum_{t=1}^{T}c_{t}^{+}} (1-\theta_{c})^{\sum_{t=1}^{T}c_{t}^{-}} \theta_{d}^{\sum_{t=1}^{T}d_{t}^{+}} (1-\theta_{d})^{\sum_{t=1}^{T}d_{t}^{-}} \\ & \qquad\times \left[\prod_{t=1}^{T} \prod_{i=1}^{N} \prod_{r,s=1}^{K} P_{rs}^{I\left[z_{i}=r\vphantom{\dot{z_{K_{d}}}\!}\right]\cdot I\left[z_{i}^{(t)}=s\right]} \right] \prod_{k=1}^{K} \pi_{k}^{n_{k}}, \end{aligned} $$
((2))
where a concise index representation (1:T) is adopted to denote the indices of the corresponding components in the model for multiple networks. For example, A
(1:T) stands for T adjacency matrices {A
(1),…,A
(T)}. The corresponding numbers of edges \(c_{t}^{+}\), \(c_{t}^{-}\), \(d_{t}^{+}\), and \(d_{t}^{-}\) for the tth network are defined similarly as in the model (1) for individual networks, except that the adjacency matrix A is replaced with A
(t). Similarly, \(I\left [z_{i}=r\vphantom {\dot {z_{i}^{(t)}}\!}\right ]\cdot I\left [z_{i}^{(t)}=s\right ]\) counts the vertice v
i
when it is assigned to the sth module for the tth network and in the rth module in the root assignment; n
k
is calculated from the root assignment. One immediate consequence of such modeling is that the edges that frequently appear in multiple observations have a higher chance of being true positives. Such an intuition is reflected in the likelihood function in our model. In addition, the model makes sure that the vertices connected by these edges are more likely to be assigned to the same modules in different observed networks by the proper choice of transition probabilities, which is clarified in the subsequent section.
Bayesian inference
To predict module assignments to assign memberships to all the vertices in the given networks, we resort to Bayesian inference to draw a joint posterior distribution of all the latent variables and unknown model parameters. To facilitate the computation of the posterior, we prefer more efficient variational Bayes algorithms instead of directly implementing Monte Carlo (MC) simulations. In order to derive closed-form updates for variational Bayes algorithms, we adopt conjugate prior distributions in our multiple-network clustering model. The conjugate prior for the root assignment probability distribution \(\vec {\pi }\) is a Dirichlet distribution with a hyper-parameter vector \(\vec {n_{0}}\):
$$\begin{array}{@{}rcl@{}} p\left(\vec{\pi}|\vec{n}_{0}\right)=\frac{\Gamma \left(\sum_{k=1}^{K} n_{k,0}\right)}{\prod_{k=1}^{K}\Gamma (n_{k,0})} \prod_{k=1}^{K} \pi_{k}^{n_{k,0}-1}. \end{array} $$
((3))
Here n
k,0 is the kth component of vector \(\vec {n_{0}}\) and Γ(·) is the gamma function. The conjugate priors for edge weights θ
c
and θ
d
are beta distributions with hyper-parameters (α
c,0,β
c,0) and (α
d,0,β
d,0), respectively,
$$ {\fontsize{8.9}{6}\begin{aligned} &p\left(\vec{\theta}|\vec{\alpha}_{0},\vec{\beta}_{0}\right) = p(\theta_{c}|\alpha_{c,0},\beta_{c,0})p(\theta_{d}|\alpha_{d,0},\beta_{d,0})\\ &\quad= \frac{\Gamma (\alpha_{c,0} + \beta_{c,0})} {\Gamma (\alpha_{c,0}) \Gamma (\beta_{c,0})} \theta_{c}^{\alpha_{c,0}-1} (1-\theta_{c})^{\beta_{c,0} -1}\\ &\qquad\times \frac{\Gamma (\alpha_{d,0} + \beta_{d,0})} {\Gamma (\alpha_{d,0}) \Gamma (\beta_{d,0})} \theta_{d}^{\alpha_{d,0}-1} (1-\theta_{d})^{\beta_{d,0} -1}. \end{aligned}} $$
((4))
The underlying assumption here is that prior to observing the data, within- and between-module edge weights are independent, so their joint prior distribution factorizes. The transition probability matrices P
(t) are stochastic, and therefore, their rows add up to 1. For each matrix P
(t), where t∈{1,2,…,T}, we use Dirichlet prior distributions with a hyper-parameter vector \(\vec {\eta }_{k}^{(0)}\) on rows
$$ {\fontsize{8.6}{6}\begin{aligned} p\left(P^{(t)}|\vec{\eta}_{1}^{(0)},\ldots,\vec{\eta}_{K}^{(0)}\right) &= \prod_{k=1}^{K} p\left(\vec{P}_{k}^{(t)}|\vec{\eta}_{k}^{(0)}\right) \\ &= \prod_{k=1}^{K} \frac{\Gamma \left(\sum_{m=1}^{K} \eta_{k,m}^{(0)}\right)} {\prod_{m=1}^{K}\Gamma \left(\eta_{k,m}^{(0)}\right)} \times \prod_{m=1}^{K} \left(P_{km}^{(t)}\right)^{\eta_{k,m}^{(0)}-1}, \end{aligned}} $$
((5))
where \(\vec {P}_{k}^{(t)}\) is the kth row of the transition probability matrix P
(t), \(P_{\textit {km}}^{(t)}\) is its mth element, and \(\eta _{k,m}^{(0)}\) is the mth element of \(\vec {\eta }_{k}^{(0)}\). The rows of transition probability matrices are assumed to be independent, and also, we set their hyper-parameter vectors to be identical.
To further ensure that our model captures the modular structure inherent in the observed networks, we set hyper-parameters of prior beta distributions over edge weights to bias towards edge weights with within-module edge weights being greater than between-module edge weights, and this is controlled through appropriate settings of hyper-parameters of prior beta distributions over edge weights. For the model to be capable of benefiting from the structural information inferred from other networks, we prefer that the diagonal entries of transition probability matrices P
(t) to be higher than the off-diagonal entries of those matrices, which can be achieved by setting higher hyper-parameters in the corresponding Dirichlet distributions.
With these incorporated conjugate priors, their functional forms are preserved in the posterior, a variational Bayes algorithm with closed-form updates can be derived to infer the model parameters, and, more importantly, module memberships from the aforementioned model (2) in the subsequent section.
Variational Bayes solution
Variational Bayes method is an efficient alternative to Monte Carlo sampling methods [16, 17] for statistical inference over complicated models as direct sampling is not tractable and computationally prohibitive. Under appropriate settings, variational Bayes algorithms can be derived to infer the desired posterior distributions with comparable accuracies at a greater speed, which is essential for the analysis of large-scale networks. The variational Bayes method seeks a restricted family of approximation distributions q(·), which minimize the Kullback-Leibler (KL) divergence between the joint probability distributions of unknown parameters and their approximate joint probability distributions [18]. For our proposed model, the quantity to be minimized takes the following form:
$$ { \begin{aligned} F\left\{q,A^{(1:T)}\right\} = &- \sum_{\vec{z},\vec{z}^{(1:T)}} \int \int \left[q\left(\vec{z},{\vphantom{{\sum_{0}^{0}}}}\vec{z}^{(1:T)},\vec{\theta},\vec{\pi}\right) \right.\\ & \times \left. \ln \frac{p\left(A^{(1:T)},\vec{z},\vec{z}^{(1:T)},\vec{\theta},\vec{\pi}|K\right)} {q\left(\vec{z},\vec{z}^{(1:T)},\vec{\theta},\vec{\pi}\right)} \right] d \vec{\theta} d \vec{\pi}. \end{aligned}} $$
((6))
To simplify this optimization problem of minimizing the free energy F{q,A
(1:T)}, we follow the mean field approximation framework developed in physics [9]. To be specific, we factorize the variational or approximate distribution q(·) with respect to its arguments:
$$\begin{array}{@{}rcl@{}} q\left(\vec{z},\vec{z}^{\,\left(1:T\right)},\vec{\theta},\vec{\pi}\right) = q_{\vec{\theta}}(\vec{\theta})q_{\vec{\pi}}(\vec{\pi})q_{\vec{z}}(\vec{z}) \prod_{t=1}^{T} q_{\vec{z}^{(t)}}\left(\vec{z}^{(t)}\right). \end{array} $$
((7))
After this simplification, it can be shown that the optimal approximate distribution \(q_{\vec {z}}\) for the root module assignment \(\vec {z}\) satisfies the following equation [18]:
$$\begin{array}{@{}rcl@{}} \ln q_{\vec{z}}^{*}(\vec{z}) \propto E_{-\vec{z}} \left[\ln p\left(A^{(1:T)},\vec{z},\vec{z}^{(1:T)},\vec{\theta},\vec{\pi}|K\right) \right], \end{array} $$
((8))
where \(E_{-\vec {z}} [\cdot ]\) denotes the expectation taken over all the parameters and latent variables except \(\vec {z}\). Similar equations can be derived for \(\vec {\pi }\), \(\vec {\theta }\), and \(\vec {z}^{(t)}\) for t∈{1,2,…,T}. Solving the above Eq. (8) for all the unknown parameters leads to the complete derivation of the approximate distributions.
Particularly, these distributions belong to the same family as prior distributions, i.e., the approximate distributions of θ
c
, θ
d
, and \(\vec {\pi }\) are respectively beta, beta, and Dirichlet distributions with hyper-parameters \(\left (\tilde {\alpha }_{c},\tilde {\beta }_{c}\right)\), \(\left (\tilde {\alpha }_{d},\tilde {\beta }_{d}\right)\), and \(\tilde {\vec {n}}\). In order to calculate the posterior approximate distribution of module assignments, we factorize them as q(z
i
=k)=Q
ik
and \(q\left ({z_{i}^{t}}=k\right)=Q_{\textit {ik}}^{(t)}\) for i∈{1,2,…,N}, t∈{1,2,…,T}, and k∈{1,2,…,K}. Q and Q
(t) are N×K matrices, in which the ith row denotes the probability of assigning vertex v
i
to different potential modules.
The variational Bayes algorithm iterates between two stages. In the first step, the current distributions over the model parameters are used to evaluate the module assignment matrices Q and Q
(t); and in the second step, these memberships are fixed and variational distributions over model parameters are updated. The resulting iterative algorithm then can be summarized as:
Initialization. Initialize N×K matrices Q and Q
(t) for t∈{1,2,…,T} and set \(\tilde {\alpha }_{c}=\alpha _{c,0}\), \(\tilde {\beta }_{c}=\beta _{c,0}\), \(\tilde {\alpha }_{d}=\alpha _{d,0}\), \(\tilde {\beta }_{d}=\beta _{d,0}\), and \(\tilde {\vec {n}} = \vec {n}_{0}\).
-
(i)
Update the following expected values:
$$\begin{array}{@{}rcl@{}} E\left[\ln \pi_{k}\right] = \psi(\tilde{n}_{k}) - \psi\left(\sum_{k=1}^{K}\tilde{n}_{k}\right); \end{array} $$
((9))
$$\begin{array}{@{}rcl@{}} E\left[\ln P_{km}^{(t)}\right] = \psi\left(\tilde{\eta}_{k,m}^{(t)}\right) - \psi\left(\sum_{m=1}^{K} \tilde{\eta}_{k,m}^{(t)}\right); \end{array} $$
((10))
$$\begin{array}{@{}rcl@{}} E\left[\ln \frac{1-\theta_{d}}{1-\theta_{c}}\right] &=& \psi\left(\tilde{\beta}_{d}\right) - \psi\left(\tilde{\alpha}_{d}+\tilde{\beta}_{d}\right)- \psi\left(\tilde{\beta}_{c}\right)\\ &&+ \psi\left(\tilde{\alpha}_{c}+\tilde{\beta}_{c}\right); \end{array} $$
((11))
$$\begin{array}{@{}rcl@{}} E\left[ \ln \frac{1-\theta_{d}}{1-\theta_{c}} + \ln \frac{\theta_{c}}{\theta_{d}}\right] &=& \psi(\tilde{\alpha}_{c}) - \psi(\tilde{\beta}_{c})- \psi(\tilde{\alpha}_{d})\\ &&+ \psi(\tilde{\beta}_{d}), \end{array} $$
((12))
where ψ(·) is the digamma function.
-
(ii)
Update the variational distribution over the root module assignment:
$$\begin{array}{@{}rcl@{}} Q_{ik} \propto \exp \left\{E\left[\ln \pi_{k}\right] + \sum_{t=1}^{T} \sum_{m=1}^{K} Q_{im}^{(t)} E\left[\ln P_{km}^{(t)}\right]\right\}. \end{array} $$
((13))
Normalize Q such that \(\sum _{k=1}^{K} Q_{\textit {ik}}=1\) for all vertices v
i
.
-
(iii)
Update the variational distributions over instantaneous module assignments for t∈{1,2,…,T}:
$$\begin{array}{@{}rcl@{}} Q_{ik}^{(t)} &\propto& \exp \left\{\sum_{j \neq i} \left(E\left[\ln \frac{1-\theta_{d}}{1-\theta_{c}} + \ln \frac{\theta_{c}}{\theta_{d}}\right] A_{ij}^{(t)} \right. \right.\\ &-& \!\!\left.\left. E\left[ \ln \frac{1-\theta_{d}}{1-\theta_{c}}\right]\right) Q_{jk}^{(t)} + \sum_{s=1}^{K} Q_{is} \left[\ln P_{sk}^{(t)}\right]\right\}. \end{array} $$
((14))
Normalize Q
(t) such that \(\sum _{k=1}^{K} Q_{\textit {ik}}^{(t)}=1\) for all vertices v
i
.
-
(iv)
Update the posterior hyper-parameters of the Dirichlet distribution over the root module assignment of vertices:
$$\begin{array}{@{}rcl@{}} n_{k}=\sum_{i=1}^{N} Q_{ik} + n_{k,0}. \end{array} $$
((15))
-
(v)
Consider η
(t) for t∈{1,2,…,T} as a matrix whose elements are \(\eta _{k,m}^{(t)}\). Then, update the matrix η
(t) as follows:
$$\begin{array}{@{}rcl@{}} \eta^{(t)} = Q^{\prime}Q^{(t)} + \eta^{(0)}, \end{array} $$
((16))
where Q
′ is the transpose of the matrix Q and η
(0) is the matrix of prior hyper-parameters of transition probability matrices.
-
(vi)
Update the hyper-parameters of beta distributions over edge weights:
$$\begin{array}{@{}rcl@{}} \tilde{\alpha}_{c} = \frac{1} {2} \sum_{t=1}^{T} Tr\left(Q^{(t)'}A^{(t)}Q^{(t)}\right) + \alpha_{c,0}; \end{array} $$
((17))
$$\begin{array}{@{}rcl@{}} \tilde{\beta}_{c} &=& \frac{1} {2} \sum_{t=1}^{T} Tr\left(Q^{(t)'}\left(\vec{u} \vec{v}^{(t)'}-Q^{(t)}\right)\right) \\&&- \frac{1} {2} \sum_{t=1}^{T} Tr\left(Q^{(t)'}A^{(t)}Q^{(t)}\right)+ \beta_{c,0}; \end{array} $$
((18))
$$\begin{array}{@{}rcl@{}} \tilde{\alpha}_{d} = \sum_{t=1}^{T} \sum_{i>j} A_{ij}^{(t)} - \frac{1} {2} \sum_{t=1}^{T} Tr\left(Q^{(t)'}A^{(t)}Q^{(t)}\right)+ \alpha_{d,0}; \end{array} $$
((19))
$$ \begin{aligned} \tilde{\beta}_{d} &= \sum_{t=1}^{T} \sum_{i>j} \left(1-A_{ij}^{(t)}\right)- \frac{1} {2} \sum_{t=1}^{T} Tr\left(Q^{(t)'}\left(\vec{u} \vec{v}^{(t)'}-Q^{(t)}\right)\right)\\ &\quad + \frac{1} {2} \sum_{t=1}^{T} Tr\left(Q^{(t)'}A^{(t)}Q^{(t)}\right) + \beta_{d,0}, \end{aligned} $$
((20))
where \(\vec {u}\) is a N×1 vector of ones and \(\vec {v}^{(t)}\) is a vector with elements \(v_{k}^{(t)}=\sum _{i=1}^{N} Q_{\textit {ik}}^{(t)}\).
-
(vii)
Calculate the updated free energy:
$$ {\fontsize{9}{6}\begin{aligned} & F\left\{q^{*},A^{(1:T)}\right\} = \sum_{t=1}^{T} \sum_{i=1}^{N} \sum_{k=1}^{K} Q_{ik}^{(t)} \ln Q_{ik}^{(t)} + \sum_{i=1}^{N} \sum_{k=1}^{K} Q_{ik} \ln Q_{ik}\\ &- \sum_{t=1}^{T} \sum_{k=1}^{K} \ln \frac {B\left(\tilde{\vec{\eta}}_{k}^{(t)}\right)} {B\left(\vec{\eta}_{k}^{(0)}\right)} - \ln \frac {B\left(\tilde{\alpha}_{c},\tilde{\beta}_{c}\right)B\left(\tilde{\alpha}_{d},\tilde{\beta}_{d}\right)B\left(\tilde{\vec{n}}\right)} {B\left(\alpha_{c,0},\beta_{c,0}\right)B\left(\alpha_{d,0},\beta_{d,0}\right)B(\vec{n}_{0})}, \end{aligned}} $$
((21))
where B(·) is a beta function with the vector argument.
The optimized free energy in (21) decreases in consecutive iterations, and thereby, this algorithm is guaranteed to converge to a local optimum. In the case where the posterior is multi-modal, several initializations should be tested to ensure the quality of the returned solutions.