Problem definition
Formally, a PPI network can be represented as a graph G=(V,E) with V nodes (proteins) and E edges (interactions). G is defined by the adjacency matrix A with V×V dimension:
$$ {A_{i,j}} = \left\{ \begin{array}{c} 1, {if}\, {(i,j)}\in{E} \\ 0, {if}\, {(i,j)}\notin{E} \\ \end{array} \right.\, $$
((1))
where i and j are two nodes in the nodes set V, and (i,j) represents an edge between i and j, (i,j)∈E. The graph is called connected if there is a path of edges to connect any two nodes in the graph. For supervised learning, we divide the network into three parts: connected training network G
tn
=(V,E
tn
), validation set G
vn
=(V
vn
,E
vn
), and testing set G
tt
=(V
tt
,E
tt
). For G
tn
, it consists of a minimum spanning tree, augmented with a small set of randomly selected edges. Because all edges are equally weighted, each time a minimum spanning tree is newly built, it will be different from a previous one. And G
vn
and G
tt
are two non-overlapping subsets of edges randomly chosen from the edges that are not in G
tn
.
A kernel is a symmetric positive definite matrix K, whose elements are defined as a real-valued function K(x, y) satisfying K(x, y)=K(y, x) for any two proteins x and y in the data set. Intuitively, the kernel for a given dataset can be regarded as a measure of similarity between protein pairs with respect to the biological properties, from which kernel function takes its value. Treated as an adjacency matrix, a kernel can also be thought of as a complete network in which all the proteins are connected by weighted edges. Kernel fusion is a way to integrate multiple kernels from different data sources by a linear combination. For our task, this combination is made of the connected training network and various feature kernels K
i
,i=1,2,3…n by optimized weights W
i
,i=0,1,2,3…n, which formally is defined by Eq. (2)
$$ K_{fusion} = W_{0}G_{tn} + \sum\limits_{i=1}^{n} W_{i}K_{i} $$
((2))
Note that the training network is incomplete, i.e., with many edges taken away and reserved as testing examples. Therefore, our inferring task is to predict or recover the interactions in the testing set G
tt
based on the kernel fusion.
How to infer PPI network?
Once the kernel fusion is obtained, it will be used to make PPI inference, in the spirit of random walk. However, instead of directly doing random walk, we apply regularized Laplacian (RL) kernel to the kernel fusion, which allows for PPI inference at the whole network level. The regularized Laplacian kernel [28, 29] is also called the normalized random walk with restart kernel in Mantrach et al. [30] because of the underlying relations to the random walk with restart model [17, 31]. Formally, it is defined as Eq. (3)
$$ \textit{RL} = \sum\limits_{k=0}^{\infty} \alpha^{k}{(-L)}^{k} = {(I+\alpha\ast L)}^{-1} $$
((3))
where L=D−A is the Laplacian matrix made of the adjacency matrix A and the degree matrix D; and 0<α<ρ(L)−1 where ρ(L) is the spectral radius of L. Here, we use kernel fusion in place of the adjacent matrix, so that various feature kernels in Eq. (2) are incorporated in influencing the random walk with restart on the weighted networks [19]. With the regularized Laplacian matrix, no random walk is actually needed to measure how “close” two nodes are and then use that closeness to infer if the two corresponding proteins interact. Rather, RL
K
is the inferred matrix, and is interpreted as a probability matrix P in which P
i,j
indicates the probability of an interaction for protein i and j. Algorithm 1 shows the general steps to infer PPI network from a optimal kernel fusion. Figure 1 contains a toy example to show the process of inference, where both the kernel fusion and the regularized Laplacian are shown as heatmap. The lighter a cell is, the more likely the corresponding proteins. However, to ensure good inference, it is important to learn optimal weights for G
tn
and various K
i
to build kernel fusion K
fusion
. Otherwise, given the multiple heterogeneous kernels from different data sources, the kernel fusion without optimized weights is likely to generate erroneous inference on PPI.
ABC-DEP sampling method for learning weights
In this work, we revise the ABC-DEP sampling method [26] to optimize the weights for kernels in Eq. (2). ABC-DEP sampling method, based on approximate Bayesian computation with differential evolution and propagation, shows strong capability of accurately estimating parameters for multiple models at one time. The parameter optimization task here is relatively easier than that in [26] as there is only one RL-based prediction model. Specifically, given the connected training network G
tn
and N feature kernels in Eq. (2), the length of the particle in ABC-DEP would be N+1, where particle can also be seen as a sample including the N+1 weight values. As mentioned before, the PPI network is divided into three parts: the connected training network G
tn
, validation set G
vn
and testing set G
tt
. To obtain the optimal particle(s), a population of particles with size N
p
is intialized, and ABC-DEP sampling is run iteratively until a particle is found in the evolving population that maximizes the AUC of inferring training network G
tn
, validation set G
vn
. The validation set G
vn
is used to avoid over-fitting as the algorithm converges. Algorithm 2 shows the detailed sampling process.
Algorithm 2 is the main structure in which a population of particles with equal importance is initialized and each particle consists of kernel weights randomly generated from a uniform prior. Given the particle population, Algorithm 3 samples through the parameter space for good particles and assigns them weights according to the predicting quality of their corresponding kernel fusion K
fusion
. Note that, different from the ABC-DEP sampling method in [26] where the logarithm of the Boltzmann distribution is adopted, here, we accept or reject a new candidate particle based on Boltzmann distribution with simulated annealing method [32]. Through the evolution process, bad particles will be filtered out and good particles will be kept for the next generation. We repeat this process until the algorithm converges. The optimal particle is used to build kernel fusion K
fusion
for PPI prediction.
Data and kernels
We use yeast PPI networks downloaded from DIP database (Release 20150101) [33] to test our algorithm. Notably, some interactions without Uniprotkb ID have been filtered out in order to do name mapping and make use of genomic similarity kernels [27]. As a result, the PPI network contains 5093 proteins and 22,423 interactions, from which the largest connected component is used to serve as golden standard network. It consists of 5030 proteins and 22,394 interactions. Only tens of proteins and interactions are not included in the largest connected component, which makes the golden standard data almost as complete as the original network. As mentioned before, the golden standard PPI network is divided into three parts that are connected training network G
tn
, validation set G
vn
and testing set G
tt
, where training network G
tn
is included in the kernel fusion, validation set G
vn
is used to find optimal weights for feature kernels and testing set G
tt
is used to evaluate the inference capability of our method.
Six feature kernels are obtained from http://noble.gs.washington.edu/proj/sdp-svm/
for this study and the following list is about the detailed information of these kernels.
-
G
tn
: G
tn
is the connected training network that provides connectivity information. It can also be thought of as a base network to do the inference.
-
K
Jaccard
[34]: This kernel measure the similarity of protein pairs i,j in term of \(\frac {neigbors(i) \cap neighbors(j)}{neighbors(i) \cup neighbors(j)}\).
-
K
SN
: It measures the total number of neighbors of protein i and j, K
SN
=neighbors(i)+neighbors(j).
-
K
B
[27]: It is a sequence-based kernel matrix that is generated using the BLAST [35].
-
K
E
[27]: This is a gene co-expression kernel matrix constructed entirely from microarray gene expression measurements.
-
K
Pfam
[27]: This is a generalization of the previous pairwise comparison-based matrices in which the pairwise comparison scores are replaced by expectation values derived from hidden Markov models (HMMs) in the Pfam database [36].
These kernels are positive semi-definite. Please refer to [27] for detailed analysis (or proof). Moreover, Eq. (2) is guaranteed to be positive semi-definite, because basic algebraic operations such as addition, multiplication, and exponentiation preserve the key property of positive semi-definiteness [37]. Finally, all these kernels are normalized to the scale of (0,1) in order to avoid bias.