Learning directed acyclic graphs from large-scale genomics data
- Fabio Nikolay^{1}Email authorView ORCID ID profile,
- Marius Pesavento^{1},
- George Kritikos^{2} and
- Nassos Typas^{2}
https://doi.org/10.1186/s13637-017-0063-3
© The Author(s) 2017
Received: 7 September 2017
Accepted: 8 September 2017
Published: 20 September 2017
Abstract
In this paper, we consider the problem of learning the genetic interaction map, i.e., the topology of a directed acyclic graph (DAG) of genetic interactions from noisy double-knockout (DK) data. Based on a set of well-established biological interaction models, we detect and classify the interactions between genes. We propose a novel linear integer optimization program called the Genetic-Interactions-Detector (GENIE) to identify the complex biological dependencies among genes and to compute the DAG topology that matches the DK measurements best. Furthermore, we extend the GENIE program by incorporating genetic interaction profile (GI-profile) data to further enhance the detection performance. In addition, we propose a sequential scalability technique for large sets of genes under study, in order to provide statistically significant results for real measurement data. Finally, we show via numeric simulations that the GENIE program and the GI-profile data extended GENIE (GI-GENIE) program clearly outperform the conventional techniques and present real data results for our proposed sequential scalability technique.
Keywords
1 Introduction
Genetic interaction analysis aims at uncovering the interactions among a set of genes with respect to a specified cell function of a biological system, e.g., the fitness of a specific bacteria colony. The interactions among the genes under study can be characterized by a directed acyclic graph (DAG) [1] where the hierarchical relationship among two genes of a DAG describes their hierarchical interaction type [2]. However, DAGs cannot be observed directly but only the specified cell function under study which yields observable phenotypes. The term phenotype generally describes the specific manifestation of a biological attribute of an organism which can be observed, e.g., for bacteria, a common biological attribute is the growth measured in colony size, where a specific size of the bacteria colony is a phenotype of this biological attribute.
The role of the studied genes in the cell machinery and the hierarchical interaction types of the genes, as well as the DAG, which describes the latter ones, can only be learned by means of knockout experiments where a gene or a set of genes is functionally switched off and the phenotype is observed. Traditionally, only single-knockout (SK) experiments have been conducted but those mainly provide evidence on the importance of a single gene for the investigated cell process and do not convey much information about the interaction among the genes under study.
Recently, with the technological advances in microarrays and the development of the synthetic genetic array technologies [3], new approaches have been taken that are based on large-scale knockout experiments of pairs of genes. Such double-knockout (DK) experiments are much more powerful for exploring genetic interactions since a DK phenotype of an arbitrary pair of genes generally differs considerably from the superposition of the corresponding SK phenotypes of this pair of genes. According to [2], the gene pairs can be classified into one out of five hierarchical relationship classes based on their SK and DK phenotypes. Further, based on the hierarchical relationship classes, the DAG underlying the observed SK and DK phenotypes can be inferred which directly reflects the genetic interactions among the genes.
In order to detect the DAG underlying the SK and DK phenotypes, a variety of statistical methods based on scoring the measurements or on thresholding the genetic interaction (GI)-profile data, which is commonly based on Pearson correlation of the SK and DK phenotypes [4–9], respectively, have been developed. However, methods as presented in [4–9] have three considerable disadvantages: (D1) they show poor performance in detecting the DAG underlying the observed SK and DK phenotypes; (D2) they have no ability to combine different types of side information, e.g., GI-profile data with SK and DK phenotypes, to enhance the detection quality; and (D3) they cannot make use of prior knowledge in order to enhance the DAG detection quality. Especially, the ability to overcome the disadvantage in (D2) will become more important in the future, since there is a constantly increasing amount of different data types, e.g., SK and DK phenotypes, Pearson correlation-based GI-profile data, and other types of GI-profile data, freely available. Furthermore, the ability to overcome the deficit in (D3), i.e., to incorporate a priori knowledge about the existing results in genomics research into the DAG detection procedure, is also of great significance, since existing functional relationships among genes are increasingly better understood based on a variety of studies that constantly extend the knowledge on the cell machinery and molecular biology. Although exhibiting the abovementioned disadvantages (D1) to (D3), methods as those presented in [4–9] are the most commonly used methods to detect the DAG underlying the measured SK and DK data. Therefore, we propose the Genetic-Interactions-Detector (GENIE) program, that is an approach based on the biological system model of [2] with which it is possible to overcome the abovementioned shortcomings of the most popular methods as those reported in [4–9]. Since the hierarchical relationship classes are mutually dependent, classifying each pair of genes to a specific hierarchical relationship class corresponds to a multi-hypothesis test. Thus, we formulate this multi-hypothesis test as a linear integer optimization program [10–15] in order to find the set of hierarchical relationship classes, best matching the observed SK and DK phenotypes. Based on the detected set of hierarchical relationship classes, the set of edges of the DAG which reflects the interactions among the genes can be computed. Furthermore, we propose the GI-GENIE program where we advance the proposed GENIE program by incorporating GI-profile data, e.g., GI-profile data based on Pearson correlation of the observed SK and DK phenotypes, into the DAG detection procedure. Due to incomplete knowledge about the true dependencies among the very most sets of genes, i.e., the true DAG of a set of genes with respect to a specific cell function is unknown or only partially known for almost all sets of genes irrespectively of the cell function under study, there is a strong interest in the genomics research community in statistically reliable statements about the topology of the DAGs underlying large sets of genes, i.e., for the empirical probability of a pair of genes to interact with each other. Towards this aim, we propose a sequential technique based on the GENIE/GI-GENIE algorithms that yields statistically significant statements about the interactions among genes from a large set of genes under study.
This paper is organized as follows. We first summarize the biological system model of [2] in Section 2, and then, we present in Section 3 the GENIE program for detecting the set of hierarchical relationship classes, which represents a valid DAG and matches the DK measurements best. In Section 4, we extend the GENIE program with GI-profile data (GI-GENIE). In Section 5, we present our scalability approach in order to obtain statistically significant results for large sets of genes. Following Section 5, we present results for simulated data which demonstrate the performance of the GENIE and the GI-GENIE methods in Section 6. Furthermore, in Section 6, we display real data results for the scalability approach described in Section 5. Finally, we summarize in Section 7 the key parts of this paper and give a brief outlook on future work.
2 System model
In this section, we provide a mathematical description of a DAG as well as its biological implications. Furthermore, we introduce the common biological terms and provide a compact description of the genetic interaction model of [2] including simple explanations on how to read and interpret a DAG.
2.1 Graph properties of a DAG
According to [16], a graph \(\mathcal {A} = \left (\mathrm {V}(\mathcal {A}), \mathrm {E}(\mathcal {A}) \right) \) is well defined by a set of nodes \(\mathrm {V}(\mathcal {A}) = \left \{ \mathrm {a}_{1},\mathrm {a}_{2},\ldots,\mathrm {a}_{A} \right \} \) and a set of edges \(\mathrm {E}(\mathcal {A})= \left \{ \left \{ \mathrm {a}_{1}, \mathrm {a}_{A},\right \},\left \{ \mathrm {a}_{2}, \mathrm {a}_{A},\right \}, \ldots, \left \{ \mathrm {a}_{A}, \mathrm {a}_{1},\right \} \right \} \) where {a_{ i },a_{ j },} for \(\mathrm {a}_{i},\mathrm {a}_{j} \in \mathrm {V}(\mathcal {A})\) denotes a directed edge from a_{ i } to a_{ j } and cardinality \(\left | \mathrm {V}(\mathcal {A}) \right | = A\) denotes the number of elements of set \(\mathrm {V}(\mathcal {A})\). The operators V(·) and E(·) applied to graph \(\mathcal {A}\) yield the set of nodes \(\mathrm {V}(\mathcal {A})\) and the set of edges \(\mathrm {E}(\mathcal {A})\) respectively. We mostly address sets \(\mathrm {V}(\mathcal {A})\) and \(\mathrm {E}(\mathcal {A})\) by \(\mathcal {G}_{\mathcal {A}}\) and \(\mathcal {E}_{\mathcal {A}}\), respectively, for the sake of notational convenience, i.e., \(\mathcal {A} = \left (\mathcal {G}_{\mathcal {A}}, \mathcal {E}_{\mathcal {A}} \right) \). Assume that there is a path P from node \(a_{i}\in \mathcal {G}_{\mathcal {A}}\) to node \(a_{j} \in \mathcal {G}_{\mathcal {A}}\) in graph \(\mathcal {A}\), i.e., a directed connection from node \(a_{i}\in \mathcal {G}_{\mathcal {A}}\) to node \(a_{j}\in \mathcal {G}_{\mathcal {A}}\). Then, path P is described by the concatenation of nodes being passed through on the way from node \(a_{i}\in \mathcal {G}_{\mathcal {A}}\) to node \(a_{j}\in \mathcal {G}_{\mathcal {A}}\), i.e., P=a _{ i }…a _{ j } and V(P)={a _{ i },…,a _{ j }} denotes the set of nodes of path P [16].
The functional dependencies among a set of genes \(\mathcal {G} = \left \{ g_{1}, \ldots, g_{G} \right \}\), with \(G = \left | \mathcal {G} \right |\) elements, for a given cell process and specie can be characterized by a genetic interaction map (GI map,[17–20]) which is essentially a DAG with a common root node, i.e., the reporter level R, [21]. In particular, an arbitrary DAG \(\mathcal {D}\) can be described as a graph \(\mathcal {D} = \left (\mathcal {G}_{\mathcal {D}},\mathcal {E}_{\mathcal {D}} \right)\) with the set of nodes \(\mathcal {G}_{\mathcal {D}} = \left \{ \mathcal {G} \cup R\right \}\) and the set of directed edges \(\mathcal {E}_{\mathcal {D}} = \left \{ \left \{ g_{i}, g_{j} \right \},\ldots, \left \{ g_{j}, g_{l} \right \} \right \} \). As the genetic interactions can only be observed through the reporter, all edges are always orientated in such a way that each path parting from any arbitrary gene \(g_{i} \in \mathcal {G}\) always terminates in the root node R and any gene appears on the path at most once, i.e., there exist no cycles in the graph. Hence, the DAG \(\mathcal {D}\) is always connected via its root node R. For the sake of notational convenience, in most cases, we write gene i when addressing gene g _{ i }, [21]. The reporter node R is an artificial node, i.e., not a gene, in the concept of a DAG and represents the measured phenotype of the specific cell process under study.
2.2 Biological interaction model
As stated in condition C_{1} in (1a), two genes i,j in DAG \(\mathcal {D}\) belong to the hierarchical relationship class k=1, if all paths from gene i to the reporter node R pass through gene j. Hence, gene j is always an element of the set of nodes of each path \(\mathrm {P}_{i,\tau } \in \mathcal {P}_{i}\) from gene i to the reporter node R, i.e., j∈V(P_{ i,τ }) for all paths P_{ i,τ } from gene i to the reporter node R. With the same line of argument as used in (1a), two genes i,j in DAG \(\mathcal {D}\) belong to the hierarchical relationship class k=2 if condition C_{2} in (1b) is satisfied. Two genes i,j in DAG \(\mathcal {D}\) belong to the hierarchical relationship class k=3 and are considered to be independent from each other if condition C_{3} in (1c) is satisfied which states that there is no path P_{ i,τ } from gene i to the reporter node R that passes through gene j as well as there is no path \(\mathrm {P}_{j,\tilde {\tau }}\) from gene j to the reporter node R that passes through gene i. As stated in (1d), two genes i,j in DAG \(\mathcal {D}\) belong to the hierarchical relationship class k=4 if there is at least one path P_{ i,τ } from gene i to the reporter node R which does not pass through gene j as well as for all paths \(\mathrm {P}_{j,\tilde {\tau }} \in \mathcal {P}_{j}\), there is always a path \(\mathrm {P}_{i,\tau } \in \mathcal {P}_{i}\) that is a super-path of the respective \(\mathrm {P}_{j,\tilde {\tau }} \in \mathcal {P}_{j}\). With the same line of argument as used in (1d), two genes i,j in DAG \(\mathcal {D}\) belong to the hierarchical relationship class k=5 if condition C_{5} in (1e) is satisfied.
2.3 Class coupling—example
To illustrate this, let us consider the example DAG \(\mathcal {D}_{0}\) of Fig. 1. All paths from gene i _{0} to node R pass through gene j _{0}, i.e., they are in a linear pathway with gene i _{0} upwards of gene j _{0}. Thus, the pair of genes i _{0},j _{0} belongs to class k=1. Note that with the same line of argument, we conclude that also genes i _{0} and l _{0} belong to relationship class k=1. Since all paths from gene i _{0} to the reporter level R do not pass through gene t _{0} and all paths from gene t _{0} to the reporter level do not pass through gene i _{0}, genes i _{0} and t _{0} belong to the hierarchical relationship class k=3 as given in Fig. 2, which states that genes i _{0} and t _{0} are independent of each other and the DK phenotype amounts to R(i _{0},t _{0})=μ _{3}(i _{0},t _{0}). Finally, let us investigate the hierarchical relation between genes t _{0} and n _{0} in DAG \(\mathcal {D}_{0}\). Obviously, gene t _{0} has (at least) one path to node R which does not pass through gene n _{0}, i.e., genes only having paths to R that do not pass through gene n _{0} do not affect the activity of gene n _{0}. Since there is (at least) one other path from gene t _{0} to R passing through gene n _{0}, we can reason that genes t _{0} and n _{0} belong to class k=4. Generally, there are strong implications among the hierarchical relationship classes of [2], i.e., if some pairs belong to a specific class, then this has strong implications for all other pairs. Let us consider the case that DAG \(\mathcal {D}_{0}\) was not known and only the hierarchical relationship classes for genes i _{0} and j _{0}, i.e., genes i _{0} and j _{0} belong to class k=1, as well as the hierarchical relationship class for genes i _{0} and g _{0}, i.e., genes i _{0} and g _{0} belong to class k=1, were available. By definition of the hierarchical dependency graphs in Fig. 2 and the assumptions that genes i _{0} and j _{0} belong to class k=1 as well as that genes i _{0} and g _{0} belong to class k=1, we conclude that all paths from gene i _{0} to R pass through genes j _{0} and g _{0}. Thus, either all paths from gene g _{0} to R pass through gene j _{0} or all paths from gene j _{0} to R pass through gene g _{0}. Consequently, genes j _{0} and g _{0} either belong to the hierarchical relationship class k=1, or k=2.
As we have emphazised by the example above, generally, if the hierarchical relationship class is known for two arbitrary genes i,j as well as for another pair \(i,l \in \mathcal {G}: l>i\), then this has strong logical implications on the hierarchical relationship classes genes \(j,l \in \mathcal {G}:l\;>j\) can belong to. Since we can interpret the classification of the pairs of genes i,j, based on their observed SK and DK phenotypes R(i),R(j) and R(i,j), respectively, to exactly one out of the five hierarchical relationship classes as a coupled multi-hypothesis test, we address this problem in Section 3 by a linear integer optimization program. The proposed linear integer optimization program identifies the most consistent set of hierarchical relationship classes, i.e., the set of hierarchical relationship classes that represents a valid DAG and matches best the DK measurements with respect to the logical coupling between the classes. Furthermore, in Section 4, we extend the GENIE program proposed in Section 3 by incorporating GI-profile data in order to jointly detect the most consistent set of hierarchical relationship classes and the corresponding DAG topology.
3 GENIE algorithm
In this section, we formulate the problem of classifying the gene pairs i,j into the classes of hierarchical relationships based on the observed SK and DK phenotype values as a linear integer optimization program. Furthermore, we translate the logical implications among the hierarchical relationship classes into constraints that ensure that the detected set of hierarchical relationship classes represents a valid graph. That is, the detected set of hierarchical relationship classes represents a graph which is a DAG as defined in Section 2. Finally, we propose a policy to derive an estimate \(\hat {\mathcal {E}}_{\mathcal {D}}\) of the true set of edges \(\mathcal {E}_{\mathcal {D}}\) of DAG \(\mathcal {D}\) based on the detected set of hierarchical relationship classes.
3.1 Hierarchical relationship class detection
Definition 1
Given a non-empty set of edges \(\mathcal {E}_{\text {in}}\) and a non-empty set of edges \(\mathcal {E}_{\text {out}}\), graph \(\mathcal {S} = \left (\mathcal {G}_{\mathcal {S}}, \mathcal {E}_{\mathcal {S}} \right)\), with set of nodes \(\mathcal {G}_{\mathcal {S}}\) and set of edges \(\mathcal {E}_{\mathcal {S}}\), is a SMAP if the following conditions are fulfilled: (i) the graph \(\mathcal {S}\) is acyclic and directed and (i i) there are \(\exists e_{\text {in}} \in \mathcal {E}_{\text {in}} ~{and}~ e_{\text {out}} \in \mathcal {E}_{\text {out}}\) such that each path P through graph \(\mathcal {S}\) incides \(\mathcal {S}\) via egde e _{in} and leaves graph \(\mathcal {S}\) via edge e _{out}.
This logical implication is directly reflected by constraint (5a). Given α _{1}(i,j)=1 and α _{1}(i,l)=1, the right-hand side (RHS) of (5a) amounts to 1. In this case also, the left-hand side (LHS) of (5a) becomes 1 to fulfill the inequality (5a). Thus, either α _{1}(j,l)=1 or α _{2}(j,l)=1. Reversely, assume that α _{1}(i,j)=1 and α _{1}(i,l)=1 does not hold, and then, the RHS of (5a) is less than 1, i.e., 0 or −1, while the LHS of (5a) is always greater than 0. Hence, constraint (5a) is fulfilled irrespectively of the choice of α _{ k }(j,l), i.e., constraint (5a) enforces no logical implications.
as a topology constraint to program O_{GENIE}. This property is very valuable since it allows the GENIE algorithm to take advantage of existing results in genetic interaction research to improve the reliability of the classification.
3.2 Edge computation
Assume that either condition E_{1} or condition E_{2} is fulfilled, then we conclude that there is an edge from gene i to gene j in DAG \(\mathcal {D}\). Given that either condition E_{3} or condition E_{4} is fulfilled, we conclude that there exists an edge from gene j to gene i in DAG \(\mathcal {D}\). We remark that there cannot be an edge between two genes i,j if they are independent of each other, i.e., \(\hat {\alpha }_{3}^{\mathcal {D}}(i,j)=1\).
We obtain an estimate \(\mathcal {E}_{\text {GENIE}}\) of the true set of edges \(\mathcal {E}_{\mathcal {D}}\) of DAG \(\mathcal {D}\) by setting \(\hat {\mathrm {A}}^{\mathcal {D}} = A^{\mathrm {O}_{\text {GENIE}}}\) and evaluating conditions E_{1} to E_{4} and condition E_{ R } as stated in Tables 1 and 2, respectively.
4 GI-GENIE algorithm
The quotient of \(\frac {\lambda _{c}}{\lambda _{p}}\) defines the threshold for reward of the GI-profile (GIP) term in Eq. (12), where setting the edge selection variable β(i,j)=1 is rewarded if the corresponding GI-profile measurement ρ(i,j) is above the quotient \(\frac {\lambda _{c}}{\lambda _{p}}\).
The auxiliary variables \(z_{l}(i,j) \forall i,j,l \in \mathcal {G}: j>i, l\neq i, l\neq j\) are generally necessary to ensure that the information about the topology of DAG \(\mathcal {D}\), which is encoded in the pattern of selection variables \(\mathrm {A}^{\mathrm {O}_{\text {GI-GENIE} }}\) detected by program O_{GI-GENIE}, is not contradicting with the set of edge selection variables \( \left \{ \hat {\beta }(i,j) \right \} \, \forall i,j \in \mathcal {G}:j>i\) detected by program O_{GI-GENIE}. In particular, given that the detected pattern of selection variables \(\mathrm {A}^{\mathrm {O}_{\text {GI-GENIE} }}\) enforces that there is an edge between genes i,j in DAG \(\mathcal {D}\), then the auxiliary variables ensure that the corresponding edge selection variable indicates that there is an edge between genes i,j, i.e., \(\hat {\beta }(i,j) = 1\). Furthermore, given that the detected pattern of selection variables \(\mathrm {A}^{\mathrm {O}_{\text {GI-GENIE} }}\) enforces that there is no edge between genes i,j in DAG \(\mathcal {D}\), then the auxiliary variables ensure that the corresponding edge selection variable indicates that there is no edge between genes i,j, i.e., \(\hat {\beta }(i,j) = 0\). On the contrary, assume that the detected edge selection variables enforce that there is an edge between genes i,j in DAG \(\mathcal {D}\), i.e., \(\hat {\beta }(i,j) = 1\), then the z _{ l }(i,j) ensure that the detected pattern of selection variables \(\mathrm {A}^{\mathrm {O}_{\text {GI-GENIE} }}\) must fulfill one of the conditions stated in Table 1. Consequently, given that the detected edge selection variables enforce that there is no edge between genes i,j in DAG \(\mathcal {D}\), i.e., \(\hat {\beta }(i,j) = 0\), then the z _{ l }(i,j) ensure that the detected pattern of selection variables \(\mathrm {A}^{\mathrm {O}_{\text {GI-GENIE} }}\) does not fulfill any of the conditions stated in Table 1.
model the logical implications among the selection variables \(\phantom {\dot {i}\!}\alpha _{k}(i,j), \alpha _{k'}(i,l), \alpha _{k^{\prime \prime }}(j,l)\), and β(i,j) for \(\ k,k^{\prime }, k^{\prime \prime } \in \mathcal {K}, \forall i,j,l \in \mathcal {G}: l>j>i\). Together with (11g), constraints (14a)–(14b) model condition E_{1} of our detection policy taking into account the GI-profile information ρ(i,j) via selection variables β(i,j). Assume that based on the SK and DK phenotypes, it is most consistent that α _{1}(i,j)=α _{1}(i,l)=α _{2}(j,l)=1 for at least one gene l in DAG \(\mathcal {D}\) which corresponds to condition E_{1} being violated. Hence, there cannot exist an edge between genes i and j in DAG \(\mathcal {D}\). In this case, the RHS of (14a) amounts to 1 which enforces the LHS of (14a) to amount to 1 as well, i.e., β(i,j)=0. Note that for α _{1}(i,j)=α _{1}(i,l)=α _{2}(j,l)=1, (14b) makes no restrictions on z _{ l }(i,j). Furthermore, assume that for genes i,j, based on the SK and DK phenotypes, it is most consistent that α _{1}(i,j)=1, but α _{1}(i,l) and α _{2}(j,l) are not jointly 1 for all other genes \(l \in \mathcal {G}: l>j>i \), i.e., α _{1}(i,l)+α _{1}(j,l)<2, then there is an edge between genes i,j in DAG \(\mathcal {D}\) according to condition E_{1}. In this case, it is obvious that (14a) is always fulfilled, i.e., there are no restrictions on β(i,j) by (14a). Since α _{1}(i,j)=1 and α _{1}(i,l)+α _{2}(j,l)≤1 for all \(l \in \mathcal {G}: l>j>i \), constraint (14b) can only be fulfilled if \(z_{l}(i,j) =1 \ \forall l \in \mathcal {G}: l>j>i\). Hence, this enforces β(i,j)=1 due to constraint (11g). In this case, constraint (14b) forces \(z_{l}(i,j)=1 \ \forall l \in \mathcal {G}:l > j >i\). Hence, given that \(z_{l}(i,j)=1 \ \forall l \in \mathcal {G}:l > j >i\), constraint (11g) sets β(i,j)=1.
Given that the GI-profile data strongly supports that there is no edge between genes i,j in DAG \(\mathcal {D}\), i.e., β(i,j)=0, and α _{1}(i,j)=1 is most consistent based on the SK and DK phenotypes measured, then it follows from (11g) that there must be at least one \(l \in \mathcal {G}: l > j>i\) for which z _{ l }(i,j)=0. In this case, with β(i,j)=0, α _{1}(i,j)=1, and z _{ l }(i,j)=0, the RHS of (14b) amounts to 1, forcing the LHS of (14b) to amount to 1 as well, i.e., α _{1}(i,l)=1 and α _{2}(j,l)=1, which is together with the assumption of α _{1}(i,j)=1 a combination that violates the existence of a direct edge between genes i and j. Furthermore, note that (14a) does not have any implications on the selection variables α _{1}(i,j),α _{1}(i,l), and α _{2}(j,l) for the case that β(i,j)=0 and z _{ l }(i,j)=0.
Assume that the GI-profile data strongly supports that there is an edge between genes i,j in DAG \(\mathcal {D}\), i.e., β(i,j)=1, and α _{1}(i,j)=1 is most consistent based on the SK and DK phenotypes measured, then according to (14a), there cannot be any gene \(l \in \mathcal {G}: l >j>i\) for which α _{1}(i,l)=1 and α _{2}(j,l)=1. Hence, Eq. (14b) can only be fulfilled if \(z_{l}(i,j)=1 \ \forall l \in \mathcal {G}:l>j>i\). Thus, (11g) is fulfilled with equality. We remark that given α _{1}(i,j)=1, constraints (14c) to (14i) are always fulfilled, i.e., they do not pose any implications among the selection variables α _{ k }(i,j) and β(i,j). Together with (11g), the two inequalities in (14c) model condition E_{3} where we can elucidate their functionality in the same fashion as before. Constraints (14d) to (14g) along with (11g) model a minor modification of condition E_{2} where we detect not only all necessary edges but also optional edges given that their existence is strongly supported by the GI-profile. Given that the existence of an edge between genes i,j in DAG \(\mathcal {D}\) is not strongly supported by the GI-profile, i.e., q(i,j)=0, constraints (14d) to (14e) along with (11g) model condition E_{2} which only allows necessary edges to be detected and we can elucidate their functionality in the same fashion as in (14a) to (14b). Note that (14f) to (14g) are always fulfilled for q(i,j)=0, i.e., no implications among the selection variables α _{ k }(i,j) and β(i,j) are posed. Assuming that the existence of an edge between genes i,j in DAG \(\mathcal {D}\) is strongly supported by the GI-profile, i.e., q(i,j)=1, then the constraints in (14d) and (14e) are always fulfilled, i.e., no implications among the selection variables α _{ k }(i,j) and β(i,j) are posed by (14d) and (14e). However, constraints (14f) and (14g) pose relaxed logical implications among the selection variables α _{ k }(i,j) and β(i,j) compared to constraints (14d) to (14e). Hence, given that q(i,j)=1 and α _{4}(i,j)=1, an edge between genes i,j in DAG \(\mathcal {D}\) is detected if it is allowed by the pattern of hierarchical relationship classes. Constraints (14h) to (14i) along with (11g) model a minor modification of condition E_{4} where we detect not only all necessary edges but also optional edges given that their existence is strongly supported by the GI-profile. Furthermore, the functionality of constraints (14h) to (14i) can be explained with the same line of argument as used to elucidate constraints (14d) to (14g).
where we again refer the interested reader to [30] for a detailed description of \(\mathcal {L}_{c}\). We obtain an estimate \(\mathcal {E}_{\text {GI}}\) of the true set of edges \(\mathcal {E}_{\mathcal {D}}\) of DAG \(\mathcal {D}\) based on the computed set of edge selection variables \( \left \{ \hat {\beta }(i,j) \right \}\) of program O_{GI-GENIE} where we infer the directionality of the edges according to \(\mathrm {A}^{\mathrm {O}_{\text {GI-GENIE}}}\). Note that all reporter node edges are computed according to our proposed reporter node edge detection policy as given in Table 2. Since the reporter node is an artificial node in the concept of a DAG, there is no GI-profile data \(\rho (i,R) \, \forall i \in \mathcal {G}\) and thus, no edge selection variable \(\beta (i,R) \, \forall i \in \mathcal {G}\) according to (10).
5 Sequential scalability technique
Due to the combinatorial nature of problems O_{GENIE} and O_{GI-GENIE}, the GENIE algorithm and GI-GENIE algorithm, respectively, cannot be applied to the data of large sets of genes \(\mathcal {G}\), since the number of candidate solutions to problems O_{GENIE} and O_{GI-GENIE}, respectively, grows exponentially with the number of genes. In order to nevertheless obtain statistically significant statements about the interactions among genes in a large set of genes \(\mathcal {G}\), we propose the sequential scalability (SEQSCA) technique which is based on the GENIE algorithm and the GI-GENIE algorithm, respectively.
In particular, we create a sequence of S subsets \(\left \{ \mathcal {G}_{s} \right \}_{1}^{S}\) of the full set of genes \(\mathcal {G}\), i.e., \(\mathcal {G}_{s} \subset \mathcal {G},~\text {and}~ \ \forall s \in \left \{1,\ldots,S \right \}\), where we estimate the topology \(\mathcal {E}_{\mathcal {D},s}\) of each DAG \(\mathcal {D}_{s}\), underlying the data of the subset of genes \(\mathcal {G}_{s}\), by the GENIE or GI-GENIE algorithm, respectively, in order to translate the estimated topology \(\mathcal {E}_{\mathcal {D},s}\) of DAG \(\mathcal {D}_{s}\) into the corresponding adjacency matrix M _{ s } for each s∈{1,…,S}. Based on the computed sequence of adjacency matrices \(\left \{ \boldsymbol {M}_{s} \right \}_{1}^{S}\), we iteratively compute the reliability matrix M∈[0,1]^{ N×N } of the full set of genes \(\mathcal {G}\) in such a way that each entry \(\left [ \boldsymbol {M} \right ]_{i,j \in \mathcal {G}}\) describes the empirical probability of an edge to exist between genes \(i,j \in \mathcal {G}\), i.e., the empirical probability that genes \(i,j \in \mathcal {G}\) directly interact with each other, where a value of 0 means that there is an interaction between the considered pair of genes with probability 0 and a value of 1 means that the considered pair of genes interacts with probability 1.
with M ^{(s)} being the N×N reliability matrix at iteration s, \(\kappa _{i} \in \left \{1,\ldots, N_{S} \right \} \ \forall i \in \mathcal {G}_{s}\), ∪_{ i } κ _{ i }={1,…,N _{ S }} and κ _{ i }<κ _{ j } for all i<j. Finally, we obtain the reliability matrix M of the full set of genes \(\mathcal {G}\) by normalizing each entry \( \left [ \boldsymbol {M}^{(S)} \right ]_{i,j} \ i,j \in \mathcal {G}\) by n _{ i,j } that is the frequency of how often detecting an edge between genes i and j has been considered. Note that the proposed SEQSCA technique does not intend to yield valid DAGs but to provide statistical statements to which empirical probability two genes interact with each other.
Summary of the proposed SEQSCA-algorithm
Initialization: M ^{(0)}=0 _{ N×N }; \(\phantom {\dot {i}\!}\boldsymbol {M}_{s=0} = \boldsymbol {0}_{N_{S} \times N_{S}}\); frequency counter \(n^{(0)}_{i,j} = 0\) | |
Repeat: | |
1: Select subset \(\mathcal {G}_{s}\) of size N _{ S } from \(\mathcal {G}\); draw each gene from \(\mathcal {G}\) with equal probability without replacement | |
2: Update: \(n^{(s+1)}_{i,j} = n^{(s)}_{i,j} + 1\) for all \(i,j \in \mathcal {G}_{s}\) | |
3: Estimate the DAG topology \(\mathcal {E}_{s}\) of set \(\mathcal {G}_{s}\) using GENIE, GI-GENIE, respectively; ⇒M _{ s } | |
4: Update reliability matrix M ^{(s)} according to Eq. (16) | |
7: Update iteration number: s←s+1 | |
Until: s=S; | |
Set \( \left [ \boldsymbol {M} \right ]_{i,j} = \left [ \boldsymbol {M}^{(S)} \right ]_{i,j} / n^{(S)}_{i,j} \, \forall i,j \in \mathcal {G}\) |
6 Simulation results
In this section, we first demonstrate the performance of the GENIE algorithm and the GI-GENIE algorithm with respect to conventional techniques for simulated data and further provide statistically significant statements on the interactions among the genes from a large set of genes based on real data using the proposed SEQSCA technique. For the implementation of the proposed algorithms, we used the popular CVX interface [31] along with the well-known MOSEK solver [32].
6.1 Synthetic data results
We have generated the ideal SK phenotypes \(R(i) \in \mathbb {R}\) for all \(i \in \mathcal {G}\) as well as the ideal DK phenotypes \(R(i,j) \in \mathbb {R}\) for all \(i,j \in \mathcal {G}: j>i\) according to the model of [2] as displayed in Fig. 2, where we have corrupted the ideal SK and DK phenotypes by independently and identically distributed zero-mean Gaussian noise with variance σ ^{2}. Furthermore, the GI-profile data \(\rho (i,j) \forall i,j \in \mathcal {G}: j>i\) has been generated on the basis of the SK and DK phenotypes. We compare both the GENIE algorithm and the GI-GENIE algorithm with the well-known GI-profile approach [2, 33], where the Pearson correlation between the GI-profiles of genes i and j is computed and an edge in the DAG is detected if the Pearson correlation is above a pre-defined threshold t _{corr}, where the directionality is inferred from the selection variable α _{ k }(i,j) corresponding to the least mismatch model μ _{ k }(i,j). Furthermore, we compare our proposed methods with the solution of program O_{GENIE} without considering set \(\mathcal {L}\) as a constraint, which means simply classifying each pair i,j to the least mismatch scoring hierarchical relationship class based on the SK and DK phenotypes R(i) and R(i,j), respectively, without ensuring that the resulting pattern of hierarchical relationship classes represents a valid DAG.
In order to limit the Monte Carlo simulation time, we consider a total of 10 genes amounting to 225 binary variables and 2670 constraints for the GENIE algorithm and 630 binary variables and 9645 constraints for the GI-GENIE algorithm, respectively. For the GENIE method without considering the consistency constraints in \(\mathcal {L}\), we have 225 binary variables and 270 constraints. Since we infer the edge orientation for the Pearson correlation-based method from the least mismatch scoring model, i.e., from the GENIE method without considering the consistency constraints in \(\mathcal {L}\), we have 270 binary variables and 270 constraints.
Note that in multi-hypothesis testing problems, it is common to view the diagnostic plots in Figs. 7 and 8 jointly to assess the quality of the proposed algorithms. In Fig. 7, we observe that in the low SNR regime, the Pearson correlation-based method performs best in terms of false detection percentage of edges P _{ed}; however, it fails to improve performance with increasing SNR, because for correct directionality information of the edges, this approach relies on the hierarchical relationship classes detected by method O_{GENIE} without considering \(\mathcal {L}\). Especially in the high SNR regime, the proposed GENIE and GI-GENIE methods clearly outperform program O_{GENIE} without the topology rule set \(\mathcal {L}\) and approach and respectively reach the performance of the Pearson correlation method. However, the very good performance of the Pearson correlation method in terms of false detection percentage of edges P _{ed} according to Eq. (17) comes at the cost of a rather poor performance in terms of the percentage of undetected edges P _{mis} according to Eq. (18) as can be seen in Fig. 8. In particular, in terms of the percentage of undetected edges P _{mis}, all of the proposed methods outperform the Pearson correlation method. Note that in the high SNR regime, the GI-GENIE of combining SK, DK, and GI-profile data yields the best of both worlds, i.e., it shows an equivalent performance as the Pearson correlation method in terms of false detection percentage of edges P _{ed}, as well as an improvement of the strong performance of the GENIE method in terms of the percentage of undetected edges P _{mis}.
6.2 Real data results
Acceptance ratios; ε=0.05
Method: | Γ (%) |
---|---|
SEQSCA and GENIE | 53 |
SEQSCA and GI-GENIE | 74 |
7 Conclusions
In this paper, we have considered the problem of learning the interactions between genes in a genetic network. We have proposed the GENIE algorithm and the GI-GENIE algorithm to reconstruct the DAG underlying the observed data. The GENIE method is purely based on SK and DK data whereas the GI-GENIE method combines SK and DK data with GI-profile data in order to compute an estimate of the true DAG topology. In Section 5, we have presented the SEQSCA technique in order to obtain statistically significant statements about the interactions among a large set of genes under study. Furthermore, we have shown by simulations that the GI-GENIE algorithm outperforms the conventional techniques and the GENIE algorithm due to the combination of multiple data types, i.e., SK/DK and GI-profile data. Finally, based on the SEQSCA technique, we have presented real data results for the GENIE and the GI-GENIE algorithm, respectively, where we have confirmed that the GI-GENIE method outperforms the GENIE method.
8 Endnote
Declarations
