Selection of Statistical Thresholds in Graphical Models
© Anthony Almudevar. 2009
Received: 11 June 2009
Accepted: 20 November 2009
Published: 4 March 2010
Reconstruction of gene regulatory networks based on experimental data usually relies on statistical evidence, necessitating the choice of a statistical threshold which defines a significant biological effect. Approaches to this problem found in the literature range from rigorous multiple testing procedures to ad hoc P-value cut-off points. However, when the data implies graphical structure, it should be possible to exploit this feature in the threshold selection process. In this article we propose a procedure based on this principle. Using coding theory we devise a measure of graphical structure, for example, highly connected nodes or chain structure. The measure for a particular graph can be compared to that of a random graph and structure inferred on that basis. By varying the statistical threshold the maximum deviation from random structure can be estimated, and the threshold is then chosen on that basis. A global test for graph structure follows naturally.
The reconstruction of gene regulatory networks using gene expression data has become an important computational tool in systems biology. A relationship among a set of genes can be established either by measuring the effect of the experimental perturbation of one or more selected genes on the remaining genes or from the use of measures of coexpression from observational data. The data is then incorporated into a suitable mathematical model of gene regulation. Such models vary in level of detail, but most are based on a gene graph, in which nodes represent individual genes, while edges between nodes indicate a regulatory relationship.
One important issue that arises is the variability of the data due to biological and technological sources. This leads to imperfect resolution of gene relationships and the need for principled statistical methodology with which to assign statistical significance to any inferred feature.
In many models, the existence or absence of an edge in the gene graph is resolved by a statistical hypothesis test. A natural first step is the ranking of potential edges based on the strength of the statistical evidence for the existence of the implied regulatory relationship. The intuitive approach is to construct a graph consisting of the highest ranking edges, defined by a -value threshold. The choice of threshold may be ad hoc, typically a conservative significance level such as 0.01. A more rigorous approach is to select the threshold using principles of multiple hypothesis testing (see, e.g., ), which may yield an estimate of the error rates of edge classification.
There is a fundamental drawback to this approach, in that the lack of statistical evidence of a regulatory relationship may be as much a consequence of small sample size as of biological fact. Under this scenario, we note that selection of a -value threshold generates a graph of, say, edges, with increasing in . Under a null hypothesis of no regulatory structure, -values are randomly ranked, hence edges will be distributed uniformly, whereas the edges of a true regulatory network will posses structure unlikely to arise by chance. Formulated in terms of statistical hypothesis tests, it should be possible to exploit this evidence in order to make a more informative choice of . This article proposes a method to accomplish this goal.
2. Problem Formulation
We assume in (S1)-(S2) that the matrix is balanced in the sense that row and column refer to the same gene. The methods proposed here do not rely on this assumption, although a formal treatment of the general case will be deferred to future work. Typically, will be a -value from a two-sample hypothesis test comparing the expression levels of genes obtained from cells subject to an experimental perturbation of gene to those obtained from control (unperturbed) cells. In this case small values of are interpreted as evidence for the existence of directed edge . We adopt this convention below.
It will be useful to introduce some definitions of directed gene graphs (see ). We say gene regulates gene if the gene expression level of directly influences that of gene . This is distinct from transitive regulation, in which expression levels of one gene affect another only through intermediary genes. For example, if regulates and regulates , then and are in a transitive regulatory relationship (that would not exist without ). In an accessibility graph edge exists if regulates or transitively regulates . In contrast, in an adjacency graph an edge from to exists only if regulates . An adjacency graph can be constructed as a parsimonious representation of an accessibility graph ([5–7]). It should be noted that a regulatory relationship implied by a graphical model is relative only to those genes included and does not rule out the existence of intermediary genes not observed.
Step (S3) will be based on the following idea. Data matrix can generate an estimated accessibility graph by constructing an edge if and only if . While this is a crude form of network model, we may still expect to contain interesting and measurable structure, provided that is efficiently chosen. Our intention is to use this structure to guide the choice of . The set of edges in is then used to construct a more detailed model, as in step (S5).
Consider a hierarchical sequence of graphs obtained by successively adding edges in increasing order of their -values. If the data is dominated by statistical noise, we may expect elements of the sequence to consist of random graphs generated by uniform distributions of a fixed number of edges, known as the Erdös-Renyi random graph model (see, e.g., ). Actual cellular networks are believed to conform more closely to the power-law model, where the likelihood that a randomly chosen node has interactions is proportional to where (see ). We may also expect more chain structure (longer paths) than would occur by chance. This would allow statistical identification of cellular network structure, which can provide auxiliary information for the selection of beyond what is normally available using standard multiple hypothesis testing methods.
2.1. Conditional Hypothesis Tests
The required elements of our procedure are (i) a data matrix (steps (S1)-(S2)), (ii) a graph score which is sensitive to general graphical structure, and (iii) a distributional model for generating graphs under the null hypothesis of no regulatory relationships. In the following development smaller values of imply greater structure.
2.1.1. Notational Conventions
We will adopt the following notation. Assume that is fixed. First, let be the set of all increasing sequences of positive integers for which . Then let be the set of all -dimensional vectors of nonnegative integers (which we refer to as count vectors). Let denote the sum of the elements of any . A sequence of vectors from , written , is increasing if for all , , and if . Let be the set of all such increasing count vector sequences.
The set of all order labelled graphs is denoted by . Let be the subset of graphs with edges, and for any let be the subset of graphs containing edges with parent . Let , be the respective subsets which exclude all edges from edge set . A sequence of graphs from is called increasing if is a subgraph of , . We say that an increasing graph sequence conforms to index sequence and set if , for all .
2.1.2. Data Matrix
Suppose that we are given an data matrix of -values as described in (S1)–(S5). An edge may be ruled out by setting . We will refer to such an edge as a void edge, with corresponding void matrix element. For example, this should occur when the data cannot predict self-regulation implied by edges . A missing value in may also represent a void edge.
Let be the sequence of all unique values represented as elements of . The value of varies according to the number of ties as well as the number of void elements. We need to define a system of counts generated by . Set
where when event occurs and is zero otherwise. We then define the sequence , where . This sequence is increasing, and consecutive graphs may increase by more than one edge. We refer to the system of counts and , which can be interpreted as random objects in sample spaces , , respectively. The sequence corresponds to the number of edges of the graphs in ; that is, is the number of edges in . Similarly, corresponds to the number of edges decomposed by parent node in ; that is, is the number of edges in with parent node .
2.1.3. Conditional Inference
Under the simplest null hypothesis of no regulatory structure perturbation conditions are indistinguishable from the control, in which case the -values of are uniformly distributed. A number of considerations then need to be made. The uniform distribution assumption depends on a correct characterization of the sampling distribution, which is often problematic in gene expression assays. In addition, when empirical methods (permutation or bootstrap methods) are used to estimate -values, ties may result which affect graph ordering. Finally, the definition of a null model relies on the independence structure of the data, which must be carefully characterized. Conditional procedures permit the development of tests which do not depend on problematic model identification, and have been extensively used in other applications in statistical genetics.
We will now develop two null models. A conditional inference procedure is defined by data , a composite null hypothesis concerning , a test statistic , an ancillary statistic , such that the distribution of conditional on can be characterized, and is the same for all distributions described by .
2.1.4. Null Model 1 (Elementwise Exchangeability)
Recall that a multivariate distribution is exchangeable if it is invariant under any permutation of its coordinates. This includes distributions, but also those with identical marginal distributions and permutation invariant dependence structure.
This leads to the following lemma.
In the simplest case, the null hypothesis predicts uniformly distributed and independent -values among nonvoid elements of . In this case by Lemma 1 conditioned on has distribution . If the marginal distributions are continuous, then the probability of ties is zero, and with probability 1 the elements of increment by one until the void elements are reached. When distributions are discrete ties, are possible and can be determined directly from the data. It is important to note that the actual marginal distribution of the elements is not important, which is a considerable advantage when null distributions are difficult to estimate accurately.
The testing procedure proposed here is based on simulated sampling from . There are two straightforward ways to do this. First, let be a random matrix obtained by a random permutation of the nonvoid elements of . We have already argued that . We also note that the distribution of nonvoid elements of is exchangeable, hence by Lemma 1 has distribution . Alternatively, suppose is any random matrix with continuously distributed nonvoid elements. Given any index sequence , we can define a sequence of graphs . It is easily verified that has distribution .
2.1.5. Null Model 2 (Within Column Exchangeability)
The use of as a null distribution rests on the assumption of elementwise exchangeability. A number of commonly encountered conditions may require alternative assumptions. For example, the columns of may be derived from data obtained from a single high throughput assay. In this case, the columns may be independent, but not identically distributed. Furthermore, normalization procedures and other slide specific factors may affect any independence assumptions within a column. We therefore develop an alternative null model based on within column exchangeability, which is accomplished by conditioning on .
Suppose that we are given void edges and an increasing count vector sequence , . A random sequence of graphs possesses a null distribution if is uniformly distributed on , and if conditional on is uniformly distributed on for all .
Then define our second null hypothesis:
This leads to the following lemma.
Following the permutation procedure used to simulate , we can simulate using independent within column permutations of , resulting in . By Lemma 2, possesses distribution . We note that by construction a graph sequence sampled from also conforms to .
2.2. Hypothesis Test Algorithm
Suppose that we have a sample of graphs from a distribution , which in turn defines a random variable with distribution , where is distributed as . If is a null distribution representing graphs with no significant structure, then the location of in the lower tail of is evidence of significant structure within .
We will assume that when null hypothesis or does not hold, this violation is due to the existence of a true graph . In this case, all elements of conform to the null hypothesis except for any for which , which are assumed to have smaller means than would be implied under the null distribution.
We therefore define statistics:
Now suppose that we are given , with , . We may generate a random sample from either or , say . Set , from which we extract sample so that when is a random sample from or , is a uniformly distributed random sample from or , respectively. This leads to the two sequences of statistics:
These sequences then form measures of the deviation of which can be used to accomplish two tasks. First, we conjecture that the minimum point of these sequences will define a useful threshold , that is, a point in the sequence below which most edges are true positives (a selected edge in true graph ), and above which additional edges are primarily false positives (a selected edge not in true graph ). Second, by generating further replications, we can estimate a global significance level for the presence of network structure. As will be discussed below, examining the entire range of the sequence may be problematic, and so it may be truncated. Let , where . Then consider the truncated sequences:
Thus, all graphs of order or less are considered. We first devise a statistic which measures statistical differences of from the sample . We then generate an additional set of null replications from the null distribution, denoted by . An empirical distribution is formed from the sample , from which a significance level for statistic is directly obtainable. This represents the desired global significance level. We consider the four choices:
We now summarize the proposed algorithm.
3. Information-Based Scoring for Directed Graphs
Information theoretic methods are becoming increasingly important in bioinformatics (see, e.g., ) and have been recently used in various graphical modelling applications. Recent examples include [2–4, 11, 12]. This is generally done using the minimum description length (MDL) principle, [13–15], which is a general method of inductive inference based on the idea that a model's goodness of fit can be objectively measured by estimating the amount of data compression that it permits. The work proposed here is not formally an application of these methods but does share an interest in coding techniques for graphs.
3.1. Coding-Directed Graphs
The present objective is to devise a coding algorithm for a directed graph using efficient coding principles . The object to be coded is first reduced to a list of elements in a predetermined order (letters of a text or pixels of an image). Each element is coded separately into a codeword of binary digits, which are then concatenated to form one single binary string. It is important to ensure that each distinct object is converted to a unique code, and this may be done by ensuring that the codewords possess the prefix property; that is, no codeword is a prefix of another codeword. The simplest such code is the uniform code. If an element to be coded is one of types, then each type can be uniquely assigned a binary string of bits, and any concatenation of uniform codes can be uniquely decoded. In the following development we will forgo the practice of rounding up to the next integer, since in the context of inference it is more intuitive for the code length to be a strictly increasing function of .
In order to code a nonnegative integer using a uniform code we would have to specify an upper bound , giving types, and so a codeword length of for each integer. If we expect most integers to be significantly smaller than , this would be inefficient. We will therefore make use of a universal code proposed in . One segment of the code consists of a binary representation of the integer, with no leading 0's. The code is prefixed by a string consisting of 0's equal in length to the binary string followed by a 1. Thus, , , , and so on. In general, we will have code length when , and for . This code is a prefix code, with the advantage that no upper bound need be specified, and it will be more efficient when smaller integer values are expected to be most prevalent. In the work which follows, we omit the rounding operation, and so accept the approximate code length of as
A directed order graph may be represented as an 0-1 adjacency graph (the class of such matrices is denoted by ). An edge from node to is indicated by a 1 entry for row and column . Such a matrix may be completely represented by an ordered list of subsets of , in which the th subset represents the entries of row equaling 1. The graph itself may therefore be coded as a concatenation of codewords representing the subsets. We assume that the value of is available to the decoder.
To code a subset, a uniform code may used, so that any subset from labels would be coded using bits. However, in the applications considered here, it is often expected that the size of the subset is considerably smaller than . An alternative strategy is to first specify the size of the subset and then apply a uniform code to represent all subsets of that size. This involves concatenating a codeword for (using the universal integer code) and a codeword for the subset (using a uniform code for possible subsets). A subset of size from objects will then be assigned a code length of
where is the number of 1 entries in row . This code is similar to the one proposed in  but assumes that bits are used to code , as required by a uniform code on integers.
There will be some advantage to considering a modification to . If only a relatively small subset of nodes possess edges, then we may instead code a submatrix of . Let be the set of nodes which are part of at least one edge. Possibly, , in which case it may be advantageous to code only the submatrix of . But we would also need to code itself. This object may be converted to codewords using the size indexed code and will appear in the code as a header, followed by the submatrix coded as described above. Thus, the code length for the submatrix is
3.2. Properties of Graph Codes
where is the number of 1 entries in (i.e., the number of edges in the graph). If we now let , assume that grows proportionally with , and that the subset sizes remain bounded by , then . This means that when comparing graphs with equal numbers of edges the dominant terms of are equal, since and the comparison will depend on the remaining dominant term
A mapping , is called stepwise monotone when the following holds. Let be any element of with at least two nonzero elements. Let be any two components of for which . Then let be equal to , except that and . Then , and is called strictly stepwise monotone when the inequality can be replaced with strict inequality.
Note that is a function of the vector of subset sizes . The stepwise operation described in Definition 3 generates a hierarchy of subset lists based on the tendency to concentrate larger subset sizes in fewer subsets. In terms of graphs, the ranking will be based on the tendency for a fixed number of edges to target a smaller number of nodes. We now show that is strictly stepwise monotone.
In this section we apply Algorithm A to a set of examples, first a synthetic network based on a typical pathway, then one based on yeast genome perturbation experiments.
4.1. Synthetic Network (MAP Kinase)
4.1.1. Model Simulation
Let be the adjacency matrix of the graph in Figure 2. Gene directly regulates if . We also expect perturbation of to affect genes further downstream; so we say that and are in an order relationship if there is a path from to of edges, and no shorter path exists. This holds if the th element of the product is nonzero for and zero for .
If and are in an order relationship simulate a normal random variable with mean and variance 1, then let be the -value associated with a hypothesis test against . If and have no relationship, let be uniformly distributed.
A model is defined by characteristics and . To study a given model the data matrix is replicated 2500 times as described above. For each replication we apply Algorithm A, setting , . The compound score of (19) is used. We use the elementwise exchangeable null hypothesis .
4.1.2. Algorithm Evaluation
A study of the algorithm must take into account its dual purpose. We may accept as an estimated accessibility graph which can be compared to the true graph. On the other hand, viewed as a multiple testing procedure, the objective is an efficient choice of along a type of error curve, giving the expected number of true edges as a function of the total number of edges within graphs of the sequence . The properties of the error curve define the accuracy with which a cellular network can be inferred. Ideally, the error curve increases with slope 1 until the graph is constructed and then remains constant. Statistical variation forces deviation from this ideal; so the goal in the selection of is to identify a position along the error curve such that below (or above) this position most new edges are true (or false) positives.
We now discuss the calculation of the error curve. It will be convenient to restrict attention to relationships up to an order . Suppose that is the true order graph, in the sense that it contains edge if and only if and have an order relationship. In our example, is equivalent to the graph in Figure 2. Let represent a simulated replicate from the given model, from which we construct sequence . Let be the number of edges of contained in element of . We will estimate two forms of the error curve. For the first, using replicates of we calculate the sample mean value of for each . For the second, for each replicate we use the edge value minimizing , thus identifying , then capturing the pairs to be displayed in the form of a scatter plot.
4.1.3. Model 1 (Direct Regulation Only)
An interesting feature of these plots is the increase in power with the increase in the number of spurious genes. This is the opposite of what is usually expected in gene discovery but follows from the use of graphical structure as statistical evidence. The existence of such structure implies higher connectivity of a smaller subset of genes than would occur at random. The existence of a larger pool of unconnected genes should, to some extent, contribute to the significance of graphical discovery, since the existing structure would be less likely to have occurred by chance. Of course, the competing effect of false positives usually associated with multiple hypothesis testing will also exist. The relative importance of these effects remains to be analyzed.
4.1.4. Model 2 (Including Transitive Regulation)
4.2. Yeast Genome Expression Data
In  a series of gene deletion and drug experiments are reported, resulting in a compendium of 300 microarray gene expression profiles on the yeast genome. We extracted 266 genes for which single deletion experiments were performed. By matching the responses for those genes a data matrix of perturbation effect -values was constructed (the -values used are those reported in ). Algorithm A was applied using a maximum of edges, using replications of a null matrix; then was calculated as above. These replications were supplemented by an application with settings , . We use the element wise exchangeable null hypothesis .
If we accept the 190 edge graph as that resulting from the application of an MTP, we then note that the proposed graph-based method results in significantly more structural discovery. The global significance level for a graph with 1000 edges can be taken as extremely small from an estimated -score of −95.4. This significance level applies to subgraphs from the sequence . Similarly, additional structure in the 300 gene graph compared to the 190 gene graph can be clearly seen. In order to define a "highly connected gene," we simulate random graphs to estimate a distribution of a gene's edge order . For and edges among 266 nodes we have and . Thus, we define any gene with at least 4 and 5 edges as "highly connected" in the respective graphs. Under these criteria, the respective graphs contain 33 and 43 such genes. The most connected gene in the 190 gene graph is with 38 edges. This gene is also the most connected gene in the 300 gene graph (46 edges). In general, more highly connected genes are added between edges 190 and 300, while additional edges are added to already highly connected genes.
A common problem in the statistical analysis of high-throughput data is the selection of a threshold for statistical evidence which controls false discovery. Such data is often used to construct graphical models of gene interactions. A threshold selection procedure was proposed which is based on the observed graphical structure implied by a given threshold. This procedure can be used both for threshold selection and to estimate a global significance level for graphical structure. The method was demonstrated on a small simulated network as well as on the "Rosetta Compendium"  of yeast genome expression profiles. The methodology proved to be accurate and computationally feasible.
Further investigation is warranted in a number of issues. The graphs investigated here were unconstrained directed graphs. Application to undirected graphs and directed acyclic graphs (DAGs) will require more sophisticated graph simulation algorithms. Additionally, the long range statistical behavior of the proposed graph code is complex. Such issues will need to be carefully examined before a general threshold selection technique can be proposed.
A software implementation of the proposed procedures is available from the author's web site, in the form of an R library at http://www.urmc.rochester.edu/biostat/people/faculty/almudevar.cfm.
This work was supported by NIH Grant GM075299.
- Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarray experiments. Statistical Science 2003, 18(1):71-103. 10.1214/ss/1056397487View ArticleMathSciNetMATHGoogle Scholar
- Ideker TE, Thorsson V, Karp RM: Discovery of regulatory interactions through perturbation: inference and experimental design. Pacific Symposium on Biocomputing 2000, 305-316.Google Scholar
- Zhao W, Serpedin E, Dougherty ER: Inferring gene regulatory networks from time series data using the minimum description length principle. Bioinformatics 2006, 22(17):2129-2135. 10.1093/bioinformatics/btl364View ArticleGoogle Scholar
- Dougherty J, Tabus I, Astola J: Inference of gene regulatory networks based on a universal minimum description length. EURASIP Journal on Bioinformatics and Systems Biology 2008, Article ID 482090, 2008:-11.Google Scholar
- Wagner A: Reconstructing pathways in large genetic networks from genetic perturbations. Journal of Computational Biology 2004, 11(1):53-60. 10.1089/106652704773416885View ArticleGoogle Scholar
- Onami S, Kyoda KM, Morohashi M, Kitano H: The DBRF method for inferring a gene network from large-scale steady-state gene expression data. In Foundations of Systems Biology. Edited by: Kitano H. The MIT Press, Cambridge, Mass, USA; 2001:59-75.Google Scholar
- Wagner A:How to reconstruct a large genetic network from n gene pertubations in fewer than easy steps. Bioinformatics 2002, 17(12):1183-1197.View ArticleGoogle Scholar
- Bollobas B: Random Graphs. Academic Press, London, UK; 1985.MATHGoogle Scholar
- Wagner A: Estimating coarse gene network structure from large-scale gene perturbation data. Genome Research 2002, 12(2):309-315. 10.1101/gr.193902View ArticleGoogle Scholar
- Rissanen J, Grünwald P, Heikkonen J, Myllymäki P, Roos T, Rousu J: Information theoretic methods for bioinformatics. EURASIP Journal on Bioinformatics and Systems Biology 2007, Article ID 79128, 2007:-2.Google Scholar
- Friedman N, Goldszmidt M: Learning Bayesian networks with local structure. In Learning in Graphical Models. Edited by: Jordan MI. The MIT Press, Cambridge, Mass, USA; 1998:421-459.View ArticleGoogle Scholar
- Almudevar A: A graphical approach to relatedness inference. Theoretical Population Biology 2007, 71(2):213-229. 10.1016/j.tpb.2006.10.005View ArticleMATHGoogle Scholar
- Rissanen J: Modeling by shortest data description. Automatica 1978, 14(5):465-471. 10.1016/0005-1098(78)90005-5View ArticleMATHGoogle Scholar
- Grünwald PD: The Minimum Description Length Principle. The MIT Press, Cambridge, Mass, USA; 2007.Google Scholar
- Rissanen J: Information and Complexity in Statistical Modeling. Springer, New York, NY, USA; 2007.MATHGoogle Scholar
- Cover TM, Thomas JA: Elements of Information Theory. Wiley, New York, NY, USA; 1991.View ArticleMATHGoogle Scholar
- Rissanen J: A universal prior for integers and estimation by minimum description length. Annals of Statistics 1983, 11: 416-431. 10.1214/aos/1176346150View ArticleMathSciNetMATHGoogle Scholar
- Almudevar A: Efficient coding of labelled graphs. Proceedings of IEEE Information Theory Workshop (ITW '07), Lake Tahoe, Calif, USA, September 2007 523-528.Google Scholar
- Hughes TR, Marton MJ, Jones AR, et al.: Functional discovery via a compendium of expression profiles. Cell 2000, 102(1):109-126. 10.1016/S0092-8674(00)00015-5View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.