Computing interaction probabilities in signaling networks

Biological networks inherently have uncertain topologies. This arises from many factors. For instance, interactions between molecules may or may not take place under varying conditions. Genetic or epigenetic mutations may also alter biological processes like transcription or translation. This uncertainty is often modeled by associating each interaction with a probability value. Studying biological networks under this probabilistic model has already been shown to yield accurate and insightful analysis of interaction data. However, the problem of assigning accurate probability values to interactions remains unresolved. In this paper, we present a novel method for computing interaction probabilities in signaling networks based on transcription levels of genes. The transcription levels define the signal reachability probability between membrane receptors and transcription factors. Our method computes the interaction probabilities that minimize the gap between the observed and the computed signal reachability probabilities. We evaluate our method on four signaling networks from the Kyoto Encyclopedia of Genes and Genomes (KEGG). For each network, we compute its edge probabilities using the gene expression profiles for seven major leukemia subtypes. We use these values to analyze how the stress induced by different leukemia subtypes affects signaling interactions.


Introduction
Biological networks describe how different molecules, such as proteins, interact with each other to carry out various cellular functions. Studying biological networks gives us deep insight into cellular mechanics and allows us to understand how biological processes are governed. Discovering signaling pathways [1], mapping transcription regulation [2], and identifying the reasons behind and the consequences of various disorders [3,4] are only a few examples to many applications which are possible through studying biological networks.
Biological networks are often modeled as graphs, where each node denotes a molecule and each edge denotes an interaction. One of the critical factors that affects our analysis of biological networks is that their topologies are often uncertain. This uncertainty follows from the fact that key biological processes governing these interactions, like DNA replication, gene transcription, and epigenetic For instance, in higher eukaryotes, DNA replication can start at different chromosome locations with different probabilities [5]. Also, different biological processes like replication timing, gene expression, and transcription regulation vary across different cell types [6][7][8][9], and also from healthy cases to different disorders [10,11]. Probabilistic networks model this uncertainty in a mathematically sound manner. Briefly, a probabilistic biological network associates each edge of the underlying network with a probability value indicating the chance that the corresponding interaction takes place.
Taking the edge probabilities into account is extremely important in studying biological networks as they improve the accuracy of analysis of these networks and can lead to biologically significant observations that are impossible to achieve otherwise. Signaling pathway detection [1], network topology characterization [12], signal reachability [13], node centrality, and network stability [14] are just a few examples to the applications that have already been benefiting from this knowledge. Therefore, having Contributions In this paper, we present a novel method for computing edge probabilities for a given signaling network topology. We use end-to-end signal reachability probabilities between pairs of genes to guide our computation. While it is hard to observe the probability of each individual interaction, target reachability values are much easier to observe experimentally. Moreover, they can be observed in different cell types or under different disorders, paving way for computing phenotype-specific edge probabilities.
Correlation between the transcription levels of genes has been widely used as the primary evidence for signaling and regulation [18][19][20][21][22]. Here, we also use gene expression correlation as the guide for signal reachability between gene pairs. For each pair of source (i.e., membrane receptor) and target (reporter, i.e., transcription factor) genes, we compute the normalized Pearson correlation value between their gene expression levels as the empirical signal reachability between that pair of genes. Our method computes the probability values for all the edges so that the resulting computed signal reachability probabilities for all source-target pairs are as close as possible to the input empirical reachability values. Given a network with n edges, reachability probability can be expressed as an nth degree function of n variables [23]. Optimizing this function in an exact manner requires solving a system of n simultaneous derivative equations.
The key challenge arises from the fact that computing the function itself has an exponential time complexity, equivalent to computing all combinations of n objects. This makes exact optimization impossible even for mediumsized networks. To address this challenge, we develop a two-phase strategy. The first phase is global optimization using a genetic algorithm, where we search the entire space of possible edge probability assignments to obtain a good initial probability assignment. The second phase is local optimization using hill climbing technique. Here, we optimize the initial solution we found in the first phase by gradually improving the edge probabilities one edge at a time, until no further improvement is possible. More specifically, instead of optimizing all n variables simultaneously, we seek to optimize the value of one variable at a time. That is, at each step, we consider only one edge probability for optimization, fixing the probability values of all other edges. We show that our method produces a result that is very close to the objective. Our experiments demonstrate that our method can compute edge probabilities with high accuracy. They also show that these probability values help in identifying specific genes and interactions that characterize major leukemia subtypes.
The rest of this paper is organized as follows. Section 2 describes the method in detail. Section 3 discusses our results. Section 4 concludes the paper.

Method
In this section, we explain our method for computing edge probability values of a given probabilistic signaling network. Our method consists of two phases: global optimization and local optimization. Section 2.1 describes the key notation needed to understand our method and formally defines the problem. Sections 2.2 and 2.3 discuss the global and local optimization phases respectively.

Preliminaries
Throughout the rest of this paper, we denote a probabilistic signaling network as a graph G = (V , E, P), where V denotes the set of nodes (i.e., genes), E denotes the set of directed edges (i.e., interactions), and P : E → R ∩ [0, 1] denotes the function that returns the existence probability of each edge in E. We also define the two sets S ⊆ V and T ⊆ V as the sets of source nodes (i.e., receptor genes) and target nodes (i.e., reporter genes). We define the |S| × |T| matrix C as the gene coexpression matrix, such that C[s, t] is the absolute value of the Pearson correlation coefficient between the expressions of genes s and t, for all s ∈ S and t ∈ T. Given a probability function P : E → R ∩ [0, 1], we define the |S| × |T| matrix R P as the signal reachability matrix, such that R P [s, t] is the probability of a signal propagating successfully from s to t, for all s ∈ S and t ∈ t using P. In these definitions, C represents the empirical reachability probability between receptor and reporter genes based on their transcriptional activities. This is motivated by evidence of a strong link between gene coexpression and signaling and regulation [18][19][20][21][22]. On the other hand, R P denotes the computed reachability probability between the same gene pairs, having P as the probability function. Thus, the Euclidean L2 norm C − R P 2 is the error introduced by the function P. Since we are only interested in the magnitude of the difference between C and R P , we used L2 norm to disregard the sign of this difference. Following from this observation, next we mathematically define the problem considered in this paper.
Problem definition. Given V, E, S, T, and C, find the function P : E → R ∩ [0, 1] such that C − R P 2 is minimum.
Notice that the problem above differs from the classical reachability problem. In the reachability problem, P is known and the goal is to find R P [23]. On the other hand, in the problem considered in this paper, P is not known. In fact, the goal is to compute P with the guidance of C. That said, in order to understand our method in this paper, it is essential to know the original reachability problem well. In the following, we take a brief detour to summarize the PReach method that solves the reachability problem. For further details, we refer the reader to Gabr et al. [23].
Let U = {1, . . . , n}, where n = |E|. Let X and Y be two sets of n variables, where X = {x 1 , . . . , x n } and Y = {y 1 , . . . , y n }. Let be a subset of U. Let S 1 , . . . , S k be k different subsets of . Let x S i = j∈S i x j and y S i = j∈S i y j , where i ∈ {1, . . . , k}. Let x * and y * be two free variables. Let a 1 , . . . , a k , b and c be real numbers. PReach defines an xy-polynomial over as F = k i=1 a i x S i y \S i + bx * + cy * . Except for the free variables, each term in the above summation contains each of the indices j ∈ either as a product term x j or y j .
PReach associates every edge e j ∈ E with a variable x j ∈ X and a variable y j ∈ Y , where j ∈ U. In this notation, x j and y j represent the cases where e j is present and absent, respectively. In the above summation, each of the non-free terms a i x S i y \S i corresponds to a combination where e j is present ∀j ∈ S i and absent ∀j ∈ \ S i , and a i is the probability of observing this specific combination. The free variable x * represents the case where T is reachable from S, and b designates its probability. Inversely, the free variable y * represents the case where T is unreachable from S, and c designates its probability.
Let p i = P(e i ) and q i = 1 − p i , ∀e i ∈ E. PReach starts by associating every edge e i ∈ E with a binomial p i x i + q i y i . It then proceeds by multiplying these binomials together into a growing xy-polynomial. After each multiplication step, PReach checks the polynomial for the non-free terms that can be collapsed into one of the two free terms as follows. For any of the non-free terms a i x S i y \S i , if the edges associated with S i form at least one path from S to T, it replaces those terms with a i x * . Inversely, if the edges associated with \ S i form at least one cut between S and T, it replaces those terms with a i y * . Any further multiplication of a new term p i x i with bx * results in bp i x * . Similarly, (p i x i )(cy * ) = cp i y * , (q i y i )(bx * ) = bq i x * , and (q i y i )(cy * ) = cq i y * . The reachability problem is a computationally hard problem; it belongs to the #P-complete class [24]. However, thanks to the repeated application of the collapsing operation, PReach tries to avoid exponential growth of the size of the xy-polynomial.

Phase 1: global optimization
We are now ready to describe the method developed in this paper. The first phase of our method is a genetic algorithm to find a population of probability functions as an initial candidate solution. Note that this is a best-faith solution that will be further optimized in the second phase of our method.
We represent a candidate solution as a vector with |E| entries and denote it with ψ, where the ith entry ψ[i] is the probability assigned to edge e i ∈ E. Let us denote the computed reachability matrix obtained using the solution ψ as R ψ . That is, ∀s ∈ S, ∀t ∈ T, R ψ [s, t] is the signal reachability probability between s and t, computed based on edge probabilities in ψ (see Section 2.1). We define the fitness F ψ of a candidate solution ψ as 1 − C−R ψ 2 |S×T| . In this formulation, the fitness F ψ takes a value in the [0, 1] interval. A larger value indicates a better solution. In the extreme case when the empirical and computed reachability probabilities are identical (i.e., C = R ψ ), F ψ is equal to 1, indicating that the solution is 100 % accurate.
Our genetic algorithm consists of four steps: initialization, crossover, mutation, and selection. We elaborate on these steps next.

Initialization
We start by generating a set of random candidate solutions. These solutions serve as the seed population of solutions. We generate each seed candidate solution ψ ∈ by assigning a random number between 0 and 1 to each entry in ψ. We then compute the fitness values F ψ , ∀ψ ∈ . In our experiments, we set the population size to | | = 50, thus generate 50 random seeds. 2. Crossover This step improves the solutions in the set by combining pairs of existing solutions, also known as the crossover operation. To do that, We define a gap value g ψ for every ψ ∈ as The value of g ψ shows how much the reachability R ψ , computed based on the solution ψ, deviates from the target C in total. A positive gap value indicates that the solution ψ underestimates the probability of some of the edges. Inversely, a negative gap value indicates that the solution ψ overestimates the probability of some of the edges. We then randomly select two solutions ψ 1 and ψ 2 from using biased sampling, where the chance of selecting a sample ψ i is directly proportional to its fitness F ψ i . We use ψ 1 and ψ 2 to generate a new candidate solution as follows. For each entry i ∈ {1, .., |E|}, we choose either entry ψ 1 [i] or ψ 2 [i] based on which is more likely to produce a candidate solution with a higher fitness. There are three possible scenarios: if both g ψ 1 and g ψ 2 are positive, both ψ 1 [i] and ψ 2 [i] are possibly underestimated, so we choose the higher. Inversely, if both g ψ 1 and g ψ 2 are negative, both ψ 1 [i] and ψ 2 [i] are possibly overestimated, so we choose the lower. If one of g ψ 1 and g ψ 2 is positive and the other is negative, then we randomly select between ψ 1 [i] and ψ 2 [i], where the chance of each is proportional to the fitness of its corresponding solution. We expect this strategy to produce a new solution that is better than both ψ 1 and ψ 2 , as we reduce the gap value while constructing it. We repeat the crossover step 50 times (i.e., | | times) and include the resulting solutions to . 3 where every solution has a chance of selection that is proportional to its fitness. We remove the non-selected 50 solutions from .
We repeat the crossover, mutation, and selection steps for a large number of iterations, updating the population each time. The number of iterations needed for convergence depends on the size and properties of the target network and is a matter of trial and error. We then select the solution which has the highest fitness in the final population as the output of this phase. We use it as an input to our next local optimization phase.

Phase 2: local optimization
At the end of the first phase, we have a solution ψ that has the highest fitness value in the entire population . Although ψ is expected to yield small errors in signal reachability, it is not necessarily optimal. In this phase, we develop a hill climbing algorithm, which gradually alters the probability assignment of each edge in the solution, one edge at a time. At each step, it ensures that the probability assignment ψ[e] of the edge e being altered becomes optimal (i.e., yields the highest possible fitness value) given the probability assignment of all other edges. We continue altering the solution until no edge probability value can be altered without increasing C − R ψ 2 . In the following, we describe in detail how at a given step we alter one probability assignment ψ[e], given all other values in ψ.
Optimizing a single edge probability Assume that for only one edge e ∈ E, the probability p e of this edge is unknown. Also, assume that the probability values of all the remaining edges in E − {e} are known. Here, we compute the value of p e that guarantees to minimize the reachability error C −R ψ 2 . For this purpose, we develop a new method which is built on the PReach method [23].
Unlike PReach, our method allows one of the edge probabilities p e to be a variable. This additional unknown alters the form of the xy-polynomial constructed by PReach (see Section 2.1) as the unknown p e can get multiplied by all the terms of the original xy-polynomial. This new variable can increase the polynomial size dramatically, depending on the combination of the terms in the polynomial. We avoid this problem through a simple observation that the final xy-polynomial is independent of the order in which we multiply individual edge binomials.  Since e is the last edge to multiply in the network, it is guaranteed that x e x S i and y e y \S i will collapse to x * and y * , respectively, for all i ∈ {1, . . . , l} [23]. Also, we already know that x e x * = x * and y e y * = y * for any edge e. Thus, we have The reachability probability is the final coefficient of x * . Therefore its value is p e l i=1 a i + b. i.e., α = b and β = l i=1 a i . This means that after multiplying the binomials of all the edges except e, α is the coefficient of x * , and β is the sum of the coefficients of all non-free terms.
Notice that the coefficient of x * (i.e., α + βp e ) is also a polynomial of first degree in p e . Using this observation, we solve for the objective of this paper (which is to minimize C − R ψ 2 ) by solving for p e as follows. We first compute R ψ [s, t] = α st + β st p e , ∀(s, t) ∈ S × T. We then derive the optimal value for p e as: To solve the minimization function above, we equate its derivative to zero. The formula above constitutes the optimal value for p e given all other probability values. However, there is no guarantee that optimal value of p e falls within the proper probability range of [0, 1]. This is because the derivation above gives the optimal result across all real numbers. However, by taking the second derivative of the objective function, one can easily see that the objective function is continuous, convex, and has only one solution to d dp e = 0. This implies that the closer the value of p e is to its unconstrained optimal value, the smaller the error is in the objective function. Therefore, if the optimal value of p e is above 1, we replace it with 1. If it is below 0, we replace it with 0. This way, we find the best possible value for p e in [0, 1].

Experimental results
In this section, we experimentally evaluate our method on four major signaling networks from the Kyoto Encyclopedia of Genes and Genomes (KEGG), including cell cycle, programmed cell death, and immune response regulation pathways (see Table 1 for dataset details). We use the gene expression samples for the leukemia subtypes from Zhang et al. [10]. This dataset contains gene expression values for 413 patients, each having one of seven different leukemia subtypes (six B-ALL subtypes plus T-ALL). We use this dataset as it provides a large number of samples for a wide spectrum of leukemia subtypes. We perform a comprehensive comparative analysis of how interaction probabilities vary across these leukemia subtypes (Section 3.2). Using the interaction probability values we find, we also compute gene centrality values for all network and leukemia subtype combinations (Section 3.3). We finally extract the genes which behave differently in specific combinations of network and leukemia subtype, and analyze the significance of these genes in these combinations (Section 3.4).

Comparison with logistic regression
In this section, we present comparative analysis of the results of our method against the logistic regression method presented by Sharan et al. [17]. Throughout this section, we refer to our method as PReach, and to the logistic regression method as LogReg. LogReg learns interaction probabilities through three features: available evidence, interactor small-world properties, and their gene expression. The latter is different across the leukemia subtypes, therefore LogReg produces different probability values for each subtype. In addition to the four networks described in Table 1, we also use four more networks to obtain more conclusive results. The additional four networks are ErbB, Wnt, NF-kappaB, and p53 from KEGG. First, we compare the edge probability values produced by both methods. For each network, we run both methods once for each leukemia subtype. For every pair of network and subtype, we compute the average edge probability for both methods. We then compute the log of the ratio between them for comparison. Figure 1a shows the results. We observe from the figure that PReach produces higher probabilities on average for all pairs of network and subtypes. The biggest gap between PReach and LogReg occurs in Wnt and complement and coagulation cascades (ccc).
Next, we compare the quality of the results of PReach vs LogReg. There exists no ground truth to compare the edge probability values against. However, we compare the outputs of the two methods with respect to two measures. First, we inspect how spread is the output probabilities across the  Figure 1b shows the results. We observe that the entropy in PReach is higher than LogReg in almost all cases. This means that PReach output probabilities are more spread across the [0, 1] spectrum, while those of LogReg tend to be more discrete. More detailed inspection reveals that this happens because LogReg assigns similar probability values to most of the interactions most of the time (results not shown due to space limit). Thus, it fails to provide fine-grained distinction between the likelihoods of interactions, while our method successfully provides such distinction.
To further investigate the results quality of both methods, we inspect how much each method differentiates leukemia subtypes with respect to their edge probability values. For every pair of network and subtype, we arrange the edge probability values produced by each method in a vector with the same ordering. Next, for a given network, we compute the Euclidean distance between the vectors of each pair of leukemia subtypes. Then for every subtype, we compute the average distance between its vector and those of all other subtypes. We compare this average distance when computed using PReach versus LogReg. Figure 1c shows the results. We observe that the distances computed based on PReach are higher in the vast majority of network-subtype pairs. This means that our method can differentiate leukemia subtypes while LogReg fails to do that.

Interaction probability in leukemia
In this experiment, we explore the differences on interaction probabilities of signaling networks in distinct leukemia subtypes. Our aim is to identify specific gene interaction differences between distinct leukemia subtypes. To achieve this, we use our method to compute interaction probability values for the KEGG signaling networks. For each network, we run our method seven times, once for each leukemia subtype. Before conducting detailed analysis, however, we first need to validate that our method is computationally feasible, that it scales to networks under consideration. To evaluate its performance, we measure the time our method takes to compute the probabilities of all the interactions for every network in every leukemia subtype. Also, based on our original optimization target (see Section 2.1), we need to know how accurate our results are (i.e., how close the computed reachability probability R ψ is to the input C). To do this, we measure the quality of the resulting interaction probabilities as 1 − C−R ψ |S×T| . The closer this value is to 100 %, the better the quality is. Table 1 shows the size of each network along with its average time and quality over the seven leukemia subtypes. Our results demonstrate that our method easily scales to the networks under consideration. It computes the interaction probabilities in about 5 min or less for all the networks we tested. More importantly, our method is highly accurate. The computed reachability values deviate from the empirical reachability values by less than 5 % for all the networks. These results are highly encouraging as they show that our method is both accurate and has practical running time. Thus it can be applied on real datasets to compute interaction probabilities.
Next, we analyze the differences in interaction probabilities across leukemia subtypes. For each network, we represent each leukemia subtype by a vector of the edge probabilities computed for it. We then compute a hierarchical clustering of these vectors. Figure 2 shows the results. From the figure, we observe that the probability value of some interactions vary significantly across All values are computed for both methods and compared using logarithm of the ratio (i.e., more than zero means PReach is higher). a Average edge probability. b Entropy of edge probability distribution. c Average distance between the edge probabilities of a each subtype and other subtypes different leukemia subtypes. For instance, CASP3 is the target in three different apoptosis interactions whose probability in a subtype is at least 2 standard deviations away from their mean values among other subtypes. These interactions are (CASP10 → CASP3) in hyperdiploid, (CASP12 → CASP3) in T-ALL, and (BIRC8 → CASP3) in Ph (circled in Fig. 2a). Similarly, CHEK1 is the source in two different cell cycle interactions whose probability in a subtype is at least 2 standard deviations away from their mean values among other subtypes. These interactions are (CHEK1 → CDC25A) in T-ALL, and (CHEK1 → TP53) in MLL (circled in Fig. 2b). CASP3 is already linked to B-cell lymphoma [25], lung [26], skin [27], breast [28], and other cancers. Defects in apoptosis signaling and cell cycle pathways play an essential role in leukemogenesis. CASP3 is an effector caspase that has been associated with B-cell lymphoma [25], lung [26], skin [27], breast [28], and other cancers. Moreover, regulation of CASP3 activation has been linked to the prognosis and remission in B-ALL [29]. Notably, in the three leukemia subtypes with different apoptotic signals targeting CASP3, the programmed cell death is inhibited; interactions of CASP3 with its activators are weaker in hyperdiploid and T-ALL (CASP10 → CASP3 and CASP12 → CASP3, respectively) while the interaction with its inhibitor is increased in B-ALL with Philadelphia chromosome (BIRC8 → CASP3). CHEK1 is a cell cycle checkpoint response protein that is linked to oral squamous cell carcinoma [30] and colorectal cancer [31]. Recently, increased levels of CHEK1 have been associated to B-ALL and T-ALL [32]. Our observation makes both CASP3 and CHEK1 strong candidates for investigation in their respective subtypes of leukemia.
Additionally, the hierarchy of the leukemia subtypes gives an insight about which subtypes have similar signaling behavior. T-ALL and TCF3-PBX1 are closest to each other in apoptosis, complement and coagulation, and chemokine, noticeably more distant in cell cycle. Hyperdiploid is very similar to TCF3-PBX1 in cell cycle, but more distant from it in the other three networks. In fact, hyperdiploid is the most distant from all other subtypes in both apoptosis and chemokine. This information can guide us to build on the existing knowledge about signaling behavior in a certain subtype, using appropriate experiments, to develop new findings about other subtypes with similar behavior.

Gene centrality for leukemia subtypes
Next, we use the interaction probability values we computed in Section 3.2 to compute centrality values of the genes in each network. Briefly, we compute the centrality of a gene as its contribution to signal reachability probability between all pairs of genes (see Gabr et al. [14] for details). We compute centrality values for the genes in each of the four networks for each of the seven leukemia subtypes. For each network, we represent each leukemia subtype by a vector of the node centrality values computed for it. We then compute a hierarchical clustering of these vectors. Figure 3 presents the results.
Interestingly, the figure shows variation of centrality value across different leukemia subtypes for only a small number of genes. Notice that this variation is not as diverse as that of interaction probability values (see Fig. 2). In apoptosis, BID is the top outstanding gene in hyperdiploid, with centrality values 2.8 standard deviations higher than their mean centrality in other subtypes (circled in Fig. 3a). We notice loss of centrality for other key regulators of apoptosis like RIPK1 and CASP7 in ETV6_RUNX1 and Ph, respectively (circled in Fig. 3a). Similarly, cell cycle regulators like CDK1 and PLK1 have an increased centrality in hypo, with centrality values 2.6 standard deviations higher than their mean centrality in other subtypes (circled in Fig. 3b). BID remains as key regulator in hyperdiploid but its centrality is lost in other samples, suggesting a disruption of the programmed cell death regulation in most of the leukemia subtypes. RIPK1 and CASP7 are linked to colorectal cancer [33]. CDK1 and PLK1 induce cell cycle progression and have been associated with distinct types of cancer [34][35][36] including leukemia and lymphoma [37][38][39]. Our results suggest that these genes are interesting targets for studying in the scope of their respective leukemia subtypes.

Enrichment analysis of outstanding genes
Following from the previous results, we want to know which network plays a key role in a certain leukemia subtype. In other words, we want to know which network's outstanding set of genes is highly enriched in a specific leukemia subtype. To achieve this, we first extract the set L of outstanding genes for every network in every subtype. For every edge e = (u, v) in a given network, we compute the mean μ e and the standard deviation σ e of its probability values in all leukemia subtypes. Then for every subtype, for every edge e, we check if the probability of e in this subtype was at least 2σ e away from μ e . If it is, we add u and v to L. We then perform gene set enrichment analysis (GSEA) [40] on L for every network in every  leukemia subtype. For every pair of network and subtype, we set the phenotype A as the subtype samples, and phenotype B as all samples from other subtypes. We then run GSEA on the network's outstanding gene set L to measure its differential significance from A to B. We consider gene sets whose p value is below 0.1 as highly enriched. Table 2 lists these gene sets and their p values in their respective leukemia subtypes. Figure 4 shows the gene set enrichment plots for the two highest enriched gene sets. We observe from Table 2 that apoptosis and cell cycle signaling networks are dominant in all gene sets that are highly enriched. This implies a fundamental role for these two networks in the listed subtypes. It also implies that these subtypes are either caused by or leading to a perturbation in their respective gene sets. Another noteworthy observation is that, although all the highly enriched gene sets belong to only two networks, there is little overlap between them. In apoptosis for instance, PPP3 genes are dominant in hyperdiploid, while BIRC genes are dominant in T-ALL, and PRKA genes are dominant in TCF3-PBX1. Additionally, from Fig. 4, we observe that, although apoptosis and cell cycle have the highest enriched gene sets for hyperdiploid and ETV6_RUNX1, respectively, their relations to their respective leukemia subtypes are not the same. All genes in the apoptosis set in hyperdiploid exhibit higher expression than in other subtypes, which implies up-regulation of these genes in hyperdiploid. On the other hand, most of the genes in the cell cycle set in ETV6_RUNX1 have lower expression than in other subtypes, which indicates down-regulation of these genes in ETV6_RUNX1.

Conclusions
In this paper, we presented a novel method for computing edge probability in signaling networks. Our method uses gene coexpression as input and computes the edge probabilities so that reachability between edge terminals is as close as possible to their empirical values obtained from gene transcription levels. We used our method to compute edge probabilities for four KEGG signaling networks, using gene expression data for seven leukemia subtypes. We also used the computed edge probabilities to compute a centrality value for every gene in every leukemia subtype. We analyzed the interactions and genes with outstanding probability and centrality in specific subtypes. We also analyzed similarities and differences among these subtypes based on their edge probabilities. We performed gene set enrichment analysis on the set of edges with outstanding probabilities in each subtype to study the significance of the results. Our analysis provided evidence that links specific gene sets to specific leukemia subtypes, which makes them strong candidates for investigation in the scope of their respective subtypes.