# Graph reconstruction using covariance-based methods

- Nurgazy Sulaimanov
^{1, 2}and - Heinz Koeppl
^{1, 2}Email author

**2016**:19

https://doi.org/10.1186/s13637-016-0052-y

© The Author(s) 2016

**Received: **27 March 2016

**Accepted: **21 October 2016

**Published: **23 November 2016

## Abstract

Methods based on correlation and partial correlation are today employed in the reconstruction of a statistical interaction graph from high-throughput omics data. These dedicated methods work well even for the case when the number of variables exceeds the number of samples. In this study, we investigate how the graphs extracted from covariance and concentration matrix estimates are related by using Neumann series and transitive closure and through discussing concrete small examples. Considering the ideal case where the true graph is available, we also compare correlation and partial correlation methods for large realistic graphs. In particular, we perform the comparisons with optimally selected parameters based on the true underlying graph and with data-driven approaches where the parameters are directly estimated from the data.

### Keywords

High-dimensional graph reconstruction methods Concentration and covariance graphs## 1 Introduction

Inference of biological networks including gene regulatory, metabolic, and protein-protein interaction networks has received much attention recently. With the development of high-throughput technologies, it became possible to measure a large number of genes and proteins at once and this led to a challenge to infer a large-scale gene regulatory and protein-protein interaction networks from high-dimensional data [1, 2]. In order to address this challenge, a wide range of network inference methods have been developed such as methods based on correlation or concentration matrices, mutual information, Bayesian networks, ordinary differential equations (ODEs), and Boolean logic [3, 4]. In addition, high-throughput experiments still remain to be costly, and therefore, experiments are usually carried out for a setting with many more genes or proteins than samples. Traditional statistical methods are usually ill-posed in this small *n* large *p* scenario, and novel methods from high-dimensional statistics that assume further structure, such as sparsity, are a good choice for graph reconstruction in this scenario [5]. Correlation methods that are based on the covariance matrix estimation are widely used in reconstructing gene co-expression and module graphs, especially in large-scale biomedical applications [6–8]. However, the edges of the interaction graph resulting from correlation methods include indirect dependencies due to transitive nature of interactions. Accordingly, the effect of indirect edges is getting more dramatic as the graph size grows, and this leads to an inaccurate graph reconstruction. In contrast, methods based on the concentration or partial correlation matrix allow to infer only direct dependencies between variables. In this respect, one can differentiate two graph types resulting from correlation and partial correlation-based methods which we will call covariance and concentration graphs on the following, respectively. Despite the fact that the covariance graph includes indirect dependencies, it is widely used in applications to represent sparse biological graphs by performing simple hard-thresholding [6] or through estimating the covariance matrix with shrinkage methods [9].

The aim of the paper is to shed light on the relation between covariance and concentration graphs and how this relation can be exploited to study the performance of correlation and partial correlation-based methods. In this manuscript, we provide a practical guide for researchers when using correlation and partial correlation methods and we believe that understanding these two concepts allows for a better selection of methods for graph reconstruction problems from high-throughput biological data.

In particular, we discuss different scenarios using simple examples when it is possible to eliminate indirect dependencies in the covariance graph by hard-thresholding and when it is not. Furthermore, we review recent methods that address the problem of direct and indirect dependencies in reconstructed graphs [10, 11] and provide new insights into those methods, both analytically and numerically. Moreover, we perform in silico comparison of two correlation-based and three partial correlation methods on different graph topologies in the high-dimensional case under the setting when the number of variables *p* exceeds the sample size *n*. The selected methods are popular approaches that are widely used in reconstructing large-scale gene regulatory and protein-protein interaction graphs. The first correlation method is based on the sample covariance matrix estimation where one applies hard-thresholding on the entries of sample covariance matrix to eliminate indirect edges in the covariance graph [12]. The second method estimates a sparse version of the covariance matrix via a shrinkage approach [9]. The partial correlation methods that we consider are the nodewise regression method [13], where partial correlations are computed via linear regression, the graphical Lasso method [14] which reconstructs a concentration graph by directly solving for the sparse version of the concentration matrix and an adaptive version of nodewise regression which determines the concentration graph in a two-stage procedure.

## 2 Notation and preliminaries

*p*-dimensional multivariate normally distributed random vector

*n*i.i.d. observations of

*X*which are given in terms of the

*n*×

*p*matrix

**X**=(

**X**

_{1},…,

**X**

_{ p }), where

**X**

_{ i }is

*n*×1 vector with

*i*=1,…,

*p*. Then, the sample covariance matrix reads

Reconstructed and true graphs are written in terms of a undirected graph *G*=(*Γ*,*E*), with *Γ*={1,…,*p*} the set of variables or nodes and *E*⊆*Γ*×*Γ* is a set of edges. Sometimes, we will also deal with weighted graphs where we extend *G* to contain a weight function \(w\,: E \rightarrow \mathbb {R}\), such that *w*
_{
ij
} denotes the weight of the edge (*i*,*j*)∈*E*. In this paper, we will consider two types of graphs.

*Σ*

_{ ij }=0 indicate that the nodes

*i*and

*j*are independent [15]. More generally, in terms of probability distributions, we have

We denote the covariance graph as \(\tilde {G}=(\Gamma,\tilde {E})\), accordingly. There is an edge between any two nodes *i* and *j* if *Σ*
_{
ij
}≠0 and no edge if *Σ*
_{
ij
}=0. This type of graphs is popular in genomics (for more information, see [16]).

^{−1}, and zero entries of the concentration matrix

*Θ*

_{ ij }=0 indicate that any nodes

*i*and

*j*are conditionally independent given the other nodes. In terms of probability distributions, for arbitrary \(k \in \mathcal {N}, k \neq i, j\) it means

*ρ*

_{ ij }through the relation

for *i*≠*j* and *ρ*
_{
ij
}=1 for *i*=*j*. There is an edge in the concentration graph between nodes *i* and *j* if *ρ*
_{
ij
}≠0 and no edge if *ρ*
_{
ij
}=0 (equivalently for *Θ*
_{
ij
}). Hence, the concentration graph is equivalent in topology to the graph defining the probabilistic graphical model for the Gaussian case and coincides with the graph defining the associated Gaussian Markov random field. Throughout this paper, we will assume that the true interaction graph corresponds to the concentration graph and therefore refer to it as *G*=(*Γ*,*E*).

In the following, we give a definition of direct and indirect edges in the covariance graph which will be convenient throughout the paper.

###
**Definition 1**

Let’s denote the sets of direct and indirect edges in the covariance graph \(\tilde {G}\) as \(\tilde {E}'\) and \(\tilde {E}''\), respectively, with \(\tilde {E}=\tilde {E}' \cup \tilde {E}''\). The set of direct edges is then defined as \(\tilde {E}'=E\) whereas the set of indirect edges is defined as \(\tilde {E}''=\tilde {E} \setminus E\).

## 3 How are covariance and concentration graphs related?

In this section, we will discuss the relationship between covariance and concentration graphs. In particular, we will discuss how to estimate the covariance graph, when the concentration graph is known. We first start by giving some facts about graphical Gaussian models [17].

*X*

_{ d },

*d*=1,…,

*n*be independent samples of \(\mathcal {N}(\mu, \boldsymbol {\Sigma })\). The log-likelihood function of the observation

*X*

_{ d }is given by

*μ*and the covariance matrix Σ using

*Θ*

_{ ij }=0 as a constraint. Let

*C*⊂

*Γ*be a clique of the graph

*G*that represents a maximal subset of nodes in the graph, such that every node of the set is connected to every other node. Denote S

_{ C }as the submatrix of S corresponding to that clique. Then, we can recall the following theorem [17].

###
**Theorem 1**

If *p*<*n*, then the maximum-likelihood estimator \((\hat {\mu },\hat {\Sigma })\) exists and is determined by (i) \(\hat {\mu } = \bar {X}\)(ii) (*i*,*j*)∉*E*⇒*Θ*
_{
ij
}=0,∀*i*,*j*∈*Γ*,*i*≠*j*(iii) \(\hat {\boldsymbol {\Sigma }}_{C} = \boldsymbol {S}_{C}\) for all cliques *C* in *G*The solution to (*i*)−(*i*
*i*
*i*) is unique if S is nonsingular.

*i*,

*j*) which are non-zero and satisfy the constraint

*Θ*

_{ ij }=0. For example, let us consider a simple graph with three nodes,

*p*=3,

*X*=(

*X*

_{1},

*X*

_{2},

*X*

_{3})

^{ T }, where

*X*

_{1}╨

*X*

_{3}|

*X*

_{2}which implies

*Θ*

_{13}=0. In matrix form, this gives

*s*

_{12}

*s*

_{23}/

*s*

_{22}.

From this result, one can see that all elements of \(\hat {\boldsymbol {\Sigma }}\) are determined by entries of sample covariance matrix S. Except \(\hat {\Sigma }_{13}\) and \(\hat {\Sigma }_{31}\), all elements are the same as in S. This is a nice result from maximum likelihood estimation but it works only in the regime *p*<*n*, where the sample covariance matrix S is non-singular.

The relationship between the concentration and covariance graphs can be understood by the transitive closure operation [18] which we define in the following way. First, we give a definition for a path.

###
**Definition 2**

For a weighted graph *G*=(*Γ*,*E*,*w*) with weight function \(w:E \rightarrow \mathbb {R}\), a path *σ* between nodes *i* and *j* is an ordered sequence of 2-tuples of the form *σ*=((*i*,*k*
_{1}),(*k*
_{1},*k*
_{2}),…,(*k*
_{
m
},*j*))∈*P*
_{
m
}⊆*E*
^{
m
}. We call *m* the length of the path and define \(w^{\sigma }_{ij} = w_{ik_{1}}w_{k_{1}k_{2}} \cdots w_{k_{m} j}\) as the path weight.

With that, we define the transitive closure as follows.

###
**Definition 3**

The transitive closure of a weighted graph *G*=(*Γ*,*E*,*w*) is a weighted graph *G*
^{∗}=(*Γ*,*E*
^{∗},*w*
^{∗}), with (*i*,*j*)∈*E*
^{∗} iff there exists a path *σ*∈*P*
_{
m
} from *i* to *j* in *G* for some \(m\in \mathbb {N}\) and with edge weights \(w^{*}_{ij} = \sum _{\sigma \in P(i,j)}w^{\sigma }_{ij}\), where *P*(*i*,*j*) is the set of all distinct paths connecting (*i*,*j*) in *G* of any length \(m\in \mathbb {N}\).

*G*and

*G*

^{∗}their weighted adjacency matrices denoted A and A

^{∗}, respectively. Observe that

*G*

^{∗}contains self-loops or cycles (e.g., for a node

*i*with at least one edge,

*i*is connected to

*i*by a path of length two through

*i*→

*j*→

*i*), and hence, A

^{∗}will have non-zero diagonal entries. The transitive closure of the graph is depicted in Fig. 1 a for illustration.

Subsequently, we use the example graph depicted in Fig. 1 b.

*Γ*={

*X*

_{1},

*X*

_{2},

*X*

_{3}} and with the edge set

*E*={(

*X*

_{1},

*X*

_{2}),(

*X*

_{1},

*X*

_{3})}. We assume that this graph is weighted and edge weights are given by

*A*

_{12}and

*A*

_{13}(Fig. 1 b (left)). The adjacency matrix of

*G*then reads

We remark that the adjacency matrix (4) is not invertible and generally sparse.

*G*. Moreover, we have

*σ*(A), the spectral radius of A, then through Gelfand’s theorem by which there exists a

*k*>0 such that ||A

^{ k }||<1 if

*σ*(A)<1, the series more generally converges for

*σ*(A)<1. We now recall from graph theory that A

^{2}can be seen as an adjacency matrix of a new graph constructed from

*G*by connecting nodes that can be reached by a path of length two in

*G*. Generally, entry (

*i*,

*j*) in A

^{ m }will be non-zero if there is a path of length

*m*in

*G*connecting (

*i*,

*j*), where we observe that the diagonal elements of A

^{ m }need not be zero anymore, due to the presence of possible cycles of length

*m*in

*G*. The value at entry (

*i*,

*j*) of A

^{ m }or the weight of edge (

*i*,

*j*) is then the product of weights along one path in

*G*and then summed over all the paths connecting (

*i*,

*j*). Accordingly, the convergent infinite sum

*i*,

*j*) if there exists a path of any length (

*i*,

*j*) in

*G*. The graph associated with this infinite sum coincides with

*G*

^{∗}, the transitive closure of

*G*, i.e., \(\boldsymbol {A}^{*} = \sum _{m=1}^{\infty }\boldsymbol {A}^{m}\) and hence

*G*transform to not-connected components in the covariance graph. Moreover, taking aside potential cancelation of weights, the subgraphs in

*G*

^{∗}are dense, i.e., are fully connected. Using this infinite sum, we show that for special graphs, it is easy to compute single entries of Σ from the adjacency matrix A without complete matrix inversion. Generally, the diagonal entries of the concentration matrix Θ are distinct, and therefore, we assume D in the example to be

*Σ*

_{12}=

*Σ*

_{21}representing the direct edge in the covariance graph. It is possible to represent the corresponding entry in terms of infinite sums by

*Σ*

_{23}=

*Σ*

_{32}yields

The same approach holds for diagonal elements as all entries of the covariance matrix have the same denominator \((1-A_{12}^{2}-A_{13}^{2})\).

where \(Z =1-A_{12}^{2}-A_{13}^{2}\).

To sum up, the entries of the covariance matrix can be obtained by applying the transitive closure from Definition 3 on the concentration graph in addition to a general scaling through D. Interestingly, for particular graphs, as the example above, more structure of the concentration graph can be exploited for computing the transitive closure and hence the covariance matrix.

For instance, the following result provides the expressions of the transitive closure for a star graph Fig. 1 c.

###
**Proposition 1**

Consider a star graph with |*Γ*|=*p*, |*E*|=*p*−1 and adjacency matrix A. Denote the index of the hub node of the star by *k* and define \(c = 1-\sum _{l=1}^{p} A_{kl}A_{lk}\), then ∀*i*≠*k* and ∀*j*≠*k* we have \(A^{*}_{ij} = A_{ik}A_{kj}/c\), \(A^{*}_{ik} = A_{ik}/c\), and \(A^{*}_{kk} = 1/c-1\).

The proof of Proposition 1 is given in Additional file 1. The result moreover indicates that the entries of the transitive closure matrix A
^{∗} could be related to each other. A simple relation can be obtained by considering the correlation matrix, i.e., the normalized version of the covariance matrix

C=Λ
^{−1}
Σ
Λ
^{−1}

with diagonal scaling matrix Λ with elements \(\Lambda _{ii} = \sqrt {\Sigma _{ii}}\). In order to formalize the relation, we introduce the following variant of transitive closure.

###
**Definition 4**

The minimal transitive closure *T* of a weighted graph *G*=(*Γ*,*E*,*w*), *G*↦*T*(*G*) is the weighted graph \(\tilde {G}=(\Gamma,\tilde {E},\tilde {w})\) with \((i,j) \in \tilde {E}\) iff there exists a path between (*i*,*j*) with edge weights \(\tilde {w}_{ij} = \sum _{\sigma \in \tilde {P}(i,j)}w^{\sigma }_{ij}\) where \(\tilde {P}(i,j)\) is the set of distinct paths *σ*
_{
ij
} that are of minimal length.

With that, we have the following.

###
**Proposition 2**

Consider a concentration graph that is a star graph *G*=(*Γ*,*E*,*w*) and denote its associated covariance graph as *G*
^{′}=(*Γ*
^{′},*E*
^{′},*w*
^{′}), with weights *w*
^{′} corresponding to the correlation coefficients. Defining the graph \(\hat {G} = (\Gamma,E,\hat {w})\) with \(\hat {w}_{ij} = w'_{ij}\) for all (*i*,*j*)∈*E*, then it holds that \(T(\hat {G}) = G'\).

The proof of Proposition 2 is given in Additional file 1. This proposition indicates that the covariance graph with weights from the correlation matrix is the minimal transitive closure of the concentration graph with weights given by the correlation matrix, i.e., indirect edge weights can be obtained by closure on the direct edges.

We observe that the exact relation holds \(\tilde {A}_{3}=\tilde {A}_{1}\tilde {A}_{2}\), and the covariance graph can be regarded as the transitive closure of the concentration graph with edge weights \(\tilde {A}_{1}\) and \(\tilde {A}_{2}\).

Further examples of the set of graph for which this relation holds are chain graphs and tree graphs, which are numerically shown in our study.

### 3.1 Estimating sparse covariance graph via hard-thresholding the covariance matrix

After establishing a link between concentration and covariance graphs, we discuss how to obtain a sparse covariance graph by performing hard-thresholding on the entries of the covariance matrix with concrete examples that are given in Fig. 1 d, e. Here, our goal is to examine when it is possible to get the covariance graph which is similar to the concentration graph in terms of non-zero edges after hard-thresholding is applied. In particular, we give simple conditions on the entries of an adjacency matrix that allow the covariance graph to preserve a corresponding set of edges as in the concentration graph. A detailed description of this section is given in Additional file 1.

### 3.2 Graph reconstruction via network deconvolution

As we stated earlier, the concentration and covariance graphs can be related via the Neumann series. In the following, we briefly review a network deconvolution approach by Feizi et al. [10], which is based on a similar idea. A closely related method, called network silencing, is proposed in [11]. Strictly speaking, both methods are only applicable in the setting *p*<*n*.

_{ M }related to A through

which coincides with our definition of a transitive closure of A in (8). For many applications considered in [10], the observation matrix is taken to be the covariance or correlation matrix computed from experimental data. Comparing (18) with (6) indicates that the assumed form of the observation matrix does not cover the general form for covariance or correlation matrices.

*n*<

*p*samples also implies a rank deficiency of (I+A

^{∗}) which is the matrix to be inverted in network deconvolution according to (19). Hence, deconvolution cannot be applied directly for

*p*>

*n*unless one applies regularization, for instance, through hard-thresholding [19]. Contrasting the definition (18) of Σ

_{ M }given in [10], the authors finally use a modified version where the diagonal elements are set to zero leading to an inconsistency in the definition of the deconvolution (19). As discussed earlier, the transitive closure (18) has indeed non-zero diagonal entries due to cyclic paths made possible through higher order terms. Consequently, redefining Σ

_{ M }=A

^{∗}−V, with a diagonal matrix V=diag(A

^{∗}), the exact network deconvolution for the adapted transitive closure would read

where *α* is a scaling parameter that should control the convergence of the matrix inversion in (19).

Although the expression (19) is general, [10] state that a necessary assumption of network deconvolution is that indirect edge weights encoded in Σ
_{
M
} can be expressed as a product of direct edge weights along the path according to A. However, it is not clear which type of graphs A give rise to such a weight relation in the observation matrix (e.g., see Proposition 2 and its discussion). In the following, we demonstrate that such a relation holds for chain graphs for any *α*.

#### 3.2.1 Network deconvolution for chain graphs

*θ*=

*Σ*

_{12}=

*Σ*

_{13}=

*Σ*

_{24}and that second-order and third-order edges are

*s*

_{1}=

*Σ*

_{14}=

*Σ*

_{23}and

*s*

_{2}=

*Σ*

_{34}, respectively. We then get the following observation matrix representing the covariance graph

*α*such that deconvolution is exact. Therefore, we compute (21) and determine when indirect weights in \(\boldsymbol {\tilde {A}}\) are zero. It corresponds to solving a system of two equations for the indirect edges

*s*

_{1}and

*s*

_{2}

*s*

_{1}and

*s*

_{2}, there exists no single scaling parameter

*α*that satifies both equations. For

*s*

_{1}and

*s*

_{2}, we then get the following solutions

Considering the second solutions *s*
_{1,2}=*α*
*θ*
^{2} and *s*
_{2,2}=*α*
^{2}
*θ*
^{3}, one finds that indirect edge weights are indeed the product of direct edges along the path.

One can intuitively extend this relation to higher-order indirect edges as a network size grows as(*α*
^{3}
*θ*
^{3},*α*
^{4}
*θ*
^{5},…,*α*
^{
p−2}
*θ*
^{
p−1}) where *p* is the number of variables.

where *S*
_{
k
} represents indirect edges of *k*-th order.

*α*Σ

_{ M }, that is

*W*=(1−

*α*

^{2}

*θ*

^{2})

^{−1}.

#### 3.2.2 Effect of scaling parameter on the output of network deconvolution

*α*is introduced in [10] to improve network deconvolution. However, we show with simple examples that particular choices for

*α*can lead to unwanted elimination of direct edges. We again consider the four-node graph that contains three direct and three indirect edges which are

*θ*

_{1},

*θ*

_{2},

*θ*

_{3}and

*s*

_{1},

*s*

_{2},

*s*

_{3}, respectively. The assignment of direct and indirect edges corresponds a chain graph. The observation matrix is given by

*α*such that a particular direct edge, i.e.,

*θ*

_{1}in \(\boldsymbol {\tilde {A}}\) will be zero. In particular,

It is easy to derive the same for other direct edges. If the scaling parameter is chosen as in (28), then only the direct edge *θ*
_{1} will be zero, whereas other edges including indirect edges will be non-zero. In applications, it is difficult to choose the scaling parameter for which network deconvolution discriminates correctly between direct and indirect edges. The user needs to be aware of the fact that for some choices of *α* network, deconvolution can negatively affect the accuracy by removing direct edges instead of indirect ones.

In the following, we investigate how this scaling parameter affects indirect edges of different order with numerical simulations. For this purpose, we choose a six-node chain graph, generate synthetic data using the workflow illustrated in Fig. 4, and compute the correlation matrix. The covariance graph reconstructed from the correlation matrix is accordingly fully connected and has five direct and ten indirect edges, where edges of the same order were assigned the same weight.

where \(\langle A_{ij}^{\text {dir}}\rangle \) and \(\langle \Sigma _{M,ij}^{\text {dir}} \rangle \) are the average weights of direct edges in \(\boldsymbol {\tilde {A}}\) and Σ
_{
M
}, whereas \(\langle A_{ij}^{\text {indir}}\rangle \) and \(\langle \Sigma _{M,ij}^{\text {indir}} \rangle \) represent the average weights of indirect edges in \(\boldsymbol {\tilde {A}}\) and Σ
_{
M
}, respectively. The average is taken over all edges of the same order. We compute the discriminative ratio for each order separately.

*α*. Thus, the effect of

*α*is not uniform for all indirect edges which means that any improved discrimination after deconvolution is due to edges of some order. For example, for

*α*∈(0.5,1.5) network, deconvolution better discriminates the second, fourth, and fifth order edges, whereas it fails to discriminate the third order edge. For

*α*∈(1.5,2), the method fails to better discriminate any edge. With simulations, we also show that both network deconvolution and network silencing approaches can help better discriminate direct and indirect edges if edges are already separable in the covariance graph as it is shown in Fig. 2 c. If the absolute values of some indirect edges in the covariance graph are larger than the absolute values of direct edges, then both methods fail to discriminate them (Fig. 2 d).

## 4 Methods

_{1}.

### 4.1 Correlation-based methods

#### 4.1.1 Hard-thresholding of sample covariance matrix

However, a selection of the threshold is hard to tackle analytically. Recently, some methods have been developed to choose the threshold from the data [19, 23, 24]. However, these methods have been designed for the case *p*<*n* and do not perform well in the *p*>*n* setting.

*p*>

*n*. In the following, we are going to briefly review this method. Scale-free graphs are characterized by a power law degree distribution

where *k* is the node degree, *γ* is the degree exponent, and *b* is the normalization constant [26, 27]. Some biological graphs have been reported to exhibit a power law have degree distributions with 2<*γ*<3 [27].

Assume a sample covariance matrix S defined as in (2). We further define the thresholding operation *T*
_{
d
}(*S*
_{
ij
}) yielding sample covariance matrix elements thresholded at *d*. To choose the threshold *d*, we fit an affine function \(f(k) = -\hat {\gamma }k + \hat {b}\) to the empirical degree distribution of a graph obtained by thresholding at *d* in the log domain and compute the *R*
^{2} value of the fit (0<*R*
^{2}<1) (Fig. 3 (left)). In addition, we also compute mean degrees \(\bar {k}=p^{-1}\sum _{i=1}^{p}\tilde {k}_{i}\), where \(\tilde {k}_{i}=\sum _{j=1}^{p}T_{d}(S_{ij})\) (Fig. 3 (right)). In particular, we are interested in high *R*
^{2} values and, for sparsity, low mean degree values \(\bar {k}\). We also require \(\hat {\gamma } > 0\), so that the slope of the fitted linear function is negative. High *R*
^{2}, low mean degree values, \(\bar {k}\) and \(\hat {\gamma } > 0\) give rise to graphs with a few connections and that a few nodes have more connections compared to other nodes. This indicates that the graph obtained from *T*
_{
d
}(S) is approximately scale-free. So far, we have introduced a sparse covariance estimation using hard-thresholding where hard-thresholding is performed after the estimation of the sample covariance matrix. In the following section, we discuss a direct estimation of the sparse covariance matrix in which no hard-thresholding is involved.

#### 4.1.2 Covariance Lasso

*Covariance Lasso*. In contrast to hard-thresholding introduced in the previous section, the sparsity in the covariance matrix is achieved by minimizing a log-likelihood function of the form

*λ*

_{cov}is the penalty parameter which induces sparsity in off diagonal elements of Σ, whereas P is a matrix with nonnegative elements and ∘ denotes elementwise multiplication. The matrix P can be chosen as the matrix of ones or zeros on the diagonal to avoid shrinking diagonal elements of Σ. The objective function given in (31) is nonconvex which is due to the term log detΣ and has several local minima, which makes the optimization problem difficult. Since the objective function contains convex and concave terms, a majorization-minimization approach is used to solve the problem. This approach was successfully applied earlier on similar problems [28, 29]. The concave part of the objective function (31) is approximated by its tangent at Σ

_{0}

_{0}=S or Σ

_{0}=diag(S) and \(\boldsymbol {\Theta }_{0}=\boldsymbol {\Sigma }_{0}^{-1}\). So one needs to estimate the covariance matrix by

In the case *p*>*n*, the sample covariance matrix S is not full rank, and to avoid this, one needs to use S=S+*s*
*I*, for some small regularizing parameter *s*>0.

*λ*

_{cov}should be determined from the data and

*K*-fold cross-validation is used for this purpose. First, the samples (1,…,

*n*) which correspond to the rows of the design matrix

**X**are partitioned into

*K*subsets which are used as training and validation sets. Initially, the covariance matrix is estimated as in (34) using the training set. We denote it as \(\boldsymbol {\hat {\Sigma }}_{T}\). The validation set is used to compute the sample covariance matrix, which we denote as S

_{ V }. The penalty parameter is then computed via

where \(L(\boldsymbol {\hat {\Sigma }}_{T}|\boldsymbol {S}_{V})\) is defined in (31).

### 4.2 Partial correlation-based methods

#### 4.2.1 Nodewise regression Lasso

**X**

_{ i },

*i*∈

*Γ*to be a response variable and

**X**

^{∖i }to be the matrix of predictor variables consisting of the remaining

*p*−1 variables. In order to get an estimate for the node

*i*∈

*Γ*, one regresses this node with the remaining nodes

*j*∈

*Γ*∖{

*i*} and get a linear model of the form

^{ i }is the set of

*p*−1 regression coefficients associated to node

*i*and \(\mathbb {E}[\boldsymbol {\epsilon }_{i}]=\mathbf {0}\). Denoting an element of vector β

^{ i }as the regression coefficient \({\beta ^{i}_{j}}\), with

*j*∈

*Γ*∖{

*i*}, then this coefficient can be related to the concentration matrix as

where *λ*
_{
L
}>0 denotes the penalty parameter. In order to estimate a whole graph, this procedure is applied to all nodes, by regressing each node by the remaining nodes. Nodewise regression Lasso returns sparse estimates which are not symmetric. In particular, there are two different estimates for each edge between any two nodes, which are estimated from two different regression problems. To decide for the absence or presence of the corresponding edge in the concentration graph, AND and OR operations are proposed in [13], i.e., an edge (*i*,*j*) is present if \(\hat {\beta }^{i}_{j}\) and/or \(\hat {\beta }^{j}_{i}\) are non-zero.

#### 4.2.2 Graphical Lasso

where *λ*
_{
G
} is the parameter which controls the size of the penalty. This log-likelihood function is convex and can be solved by a block coordinate descent method proposed in [31]. The estimated concentration matrix is symmetric, and there are no additional AND or OR operations needed.

#### 4.2.3 Adaptive Lasso

*λ*

_{ L }in (39) and

*λ*

_{ G }in (40) are chosen by cross-validation. However, a cross-validated choice of these penalty parameters does not lead to a consistent model selection and leads to overestimation [5, 13]. Therefore, it is suggested to apply cross-validation using the adaptive Lasso (adaptive version of nodewise regression) which gives a sparser solution compared to cross-validation with nodewise regression and graphical Lasso. Given the data where the underlying graph is not known, it is challenging to determine a good Lasso penalty from the data. One study showed that it is possible to assign different weights to different coefficients thereby allowing the coefficients to be non-equally penalized in the

*L*

_{1}penalty [22]. This is achieved by the following estimator:

where \(\tilde {\boldsymbol {\beta }}^{i}\) are initial estimates from (39) and used as weights. It is suggested to estimate \(\tilde {\beta }^{i}\) with the penalty parameter computed through cross-validation. In the second step, it is suggested to select the penalty parameter again by cross-validation in the adaptive Lasso. The adaptive Lasso has the property that if the initial estimates \(\tilde {\beta }^{i}_{j}=0\), then the final estimates resulting from the adaptive Lasso are also \(\hat {\beta }^{i}_{j}=0\). If the initial estimates \(\tilde {\beta }^{i}_{j}\) are large, then the adaptive Lasso applies a small penalty for these estimates and vice versa. This way, the adaptive Lasso allows to reduce the number of false positives from the first step and yields a sparse solution.

## 5 Comparison of correlation- and partial correlation-based methods

### 5.1 Generating synthetic data from different graph topologies

*p*and are generated from the adjacency matrices with the size

*p*×

*p*.

- 1.
*Chain graph*. The graph corresponds to a tridiagonal adjacency matrix where each row and column consist of one or two non-zero entries which correspond to the graph with the maximum degree of 2. The graph consists of*p*−1 number of edges. - 2.
*Cluster graph*. The rows/columns of the adjacency matrix are evenly partitioned into*l*disjoint submatrices. Here, we denote them as*U*_{ i },*i*=1,…,*l*. Since they are disjoint, we can write*U*_{1}∪*U*_{2}∪,…,∪*U*_{ l }={1,…,*p*} and the corresponding graph contains*p*(*p*/*l*−1)*P*/2 number of edges, where*P*is the probability of the edge between any two nodes in a subgraph. If probability*P*=1, then disjoint subgraphs are fully connected. Decreasing*P*allows to generate sparse subgraphs. - 3.
*Scale-free graph*(Barabasi-Albert model) ([26, 27]). The degree of the graph follows a power law distribution (30). The graph generation is based on a preferential attachment and starts with*m*_{0}nodes. The new nodes with*m*≤*m*_{0}edges are added to*m*_{0}existing nodes in the graph. A new node is added to the existing node*i*depending on the degree*k*_{ i }with the probability \(P(k_{i}) = k_{i}/\sum _{j}^{}k_{j}\). The graph contains*p*−1 edges. - 4.
*Hub graph*. The rows/columns of the adjacency matrix are evenly partitioned into*l*disjoint groups as in the cluster graph,*U*_{1}∪*U*_{2}∪,…,∪*U*_{ l }={1,…,*p*}. At each disjoint subgraph, a hub node has more connections to other nodes, whereas the other nodes have only one connection. Since a partitioning is even, every subgraph contains the same number of nodes and edges.

All graphs are generated using R package *huge* [32].

### 5.2 Comparison of methods based on optimal predictions

*p*=50 and generate the dataset with the sample size

*n*=30. To account uncertainty in the data generation, we resample the data 100 times and perform the graph reconstruction with 100 datasets each of size

*p*=50. This allows us to assess the performance of methods in the presence of noise. For better illustration purposes, we plot predicted edges on the correctly predicted vs total predicted axis (Fig. 6 (left)). In addition to methods, we perform predictions by random guessing, which is used for a quality control in our study. To assess the quality of predictions produced by different methods, we compute Euclidean distances from individual edge predictions to true edges as

where *T*
_{
R
} denotes true edges in the true graph, *C*
_{pred} and *T*
_{pred} represent correctly predicted and total predicted edges, respectively. We then compute the cumulative distribution of *d*
_{
E
} (Fig. 6 (middle)).

*E*=49 edges which is regarded as simplest (Fig. 6 (first top panel)). Other methods predict about 35 to 40 edges correctly, whereas the nodewise regression Lasso produces almost perfect predictions. On the scale-free graph, the nodewise regression Lasso performs best among four methods. The prediction accuracy is about more than half of true edges for the nodewise regression Lasso and less than half for three remaining methods. The three methods predict almost a similar number of edges out of which 10 to 20 are correct edges. From ROC curves, one can see that initially all three methods perform similarly, but later, the graphical Lasso starts outperforming the thresholded sample covariance and the covariance Lasso. Since the scale-free graph contains more highly connected nodes (maximum degree

*k*

_{max}= 13) compared to other graphs, the prediction accuracy of all methods reduces in comparison to chain and cluster graphs thereby being close to predictions by random guessing. For the cluster graph, we set the probability of the edge between any two nodes to

*P*=0.3, so that the resulting graph contains less hub nodes as possible (

*k*

_{max}=4). The nodewise regression Lasso predicts on average 40 true edges out of 70, whereas other methods predict 30. In case of the hub graph, where we have 10 disjoint subgraphs with 10 hub nodes, the predictions of the nodewise regression Lasso are again best among other methods by predicting about 40 true edges out of 50. In contrast, the remaining three methods only predict a half of all true edges. We observe that the thresholded covariance, the covariance Lasso, and the graphical Lasso predict almost a similar number of true edges in all four graphs. In contrast, the nodewise regression Lasso performs best compared to other methods in all four graphs. Our comparison metrics are based on the control of false positive edges, and a similar observation was published earlier in the work of Peng et al. [33], where the authors showed that the nodewise regression Lasso performs better than the graphical Lasso when controlling for false discovery rate.

## 6 Comparison of methods when underlying graph is not known

In this section, we are going to discuss how the methods perform when the underlying graph is not given. This is a typical case in applications where the underlying graph is not known, and a challenge is to infer the graph based on the data. We are therefore going to discuss available methods that allow the selection of the optimal threshold for the sample covariance matrix and optimal regularizations for covariance Lasso and adaptive Lasso methods. Because, a cross-validated choice of the penalty parameter in nodewise regression and graphical Lasso methods leads to overestimation problem, we consider selecting the penalty from the adaptive Lasso by cross-validation which gives a sparser solutions compared to former methods. We already introduced these methods in previous sections and are going to discuss how they perform in practice. For comparison, we choose the same settings: *p*=50 and *n*=30.

### 6.1 Scale-free criteria-based thresholding of sample covariance matrix

*R*

^{2}values and mean degree values \(\bar {k}\) for various thresholds uniformly selected from [0,1]. For a reference graph, we also compute the

*R*

^{2}value (green line) and the mean degree value \(\bar {k}\) (blue line) of the true graph. As illustrated in Fig. 7 a, higher

*R*

^{2}values are achieved for the threshold higher than 0.5 which can be compared to that of the true graph (green line). The corresponding mean degree value for the threshold higher than 0.5 is also close to that of the true graph (blue line). To compare how well the threshold is selected, we further perform hard-thresholding on the true covariance matrix and compute

*R*

^{2}and mean degree values (Fig. 7 b). Since the graph for the true covariance matrix is fully connected, without thresholding, it returns low

*R*

^{2}and high mean degree values. High

*R*

^{2}values are achieved for the threshold higher than 0.5 as it was observed in the scale-free selection case (Fig. 7 a). In particular, the mean degree values close to true mean values are also attained approximately at the same threshold. In practical applications, when inferring a gene co-expression graph from microarray data, it is usually suggested to select the threshold with high

*R*

^{2}values and low mean degree values. In particular, for a high-dimensional case with thousand genes, these two metrics show saturation for high

*R*

^{2}and low mean degree values. Although in our case there is no saturation effect, it is possible to select the threshold to be 0.6, for which the

*R*

^{2}value is high and the mean degree value is low. Furthermore, we perform simulations with this threshold and compute the number of true edges in the thresholded graph (Fig. 7 c). As the plot indicates, the selected threshold is nearly optimal giving predictions close to optimal ones. Despite it gives results close to the optimal ones, best threshold predictions are almost as bad as the results of random guessing. It is noteworthy that, in our simulations, this method was shown to work well when the sample size is larger than the variable size (

*p*<

*n*). Since we only consider the

*p*>

*n*case in our study, the results are not shown.

Theoretically, high *R*
^{2} values can be achieved only for scale-free graphs and not applicable for other graph types. We also show that it is not possible to attain high *R*
^{2} values with other graph types used in our study (results are not shown here).

### 6.2 Cross-validation with covariance Lasso

*λ*

_{cov}from the data, we compute it by cross-validation procedure. We perform fivefold cross-validation and select the penalty parameter that maximizes the log-likelihood function in (31). Figure 8 depicts computed likelihood values with the penalty parameters selected from a range

*λ*

_{cov}∈[0,7]. The results show that the maximum likelihood values for all graphs exist almost in a close range of the penalty parameter. For chain and cluster graphs, the maxima are attained between

*λ*

_{cov}=3 and

*λ*

_{cov}=5, whereas for scale-free and hub graphs, between

*λ*

_{cov}=4 and

*λ*

_{cov}=6. Therefore, the penalty parameters for further simulations, we have chosen from these ranges where the maximum for the log-likelihood is attained. We then performed the covariance graph estimation using these penalty parameters. Unfortunately, we observe that in all cases, these penalty values lead to the overestimation of the graph. In particular, a lot of false positive edges are selected in the estimated graph.

### 6.3 Cross-validation with adaptive Lasso

*p*>

*n*. Other graphs used in the study contain less number of hub nodes and the method performs well on these graphs. For example, the maximum degree of the chain graph is

*k*

_{max}=2, for the cluster graph

*k*

_{max}=4, for the hub graph

*k*

_{max}=9, and for the scale-free graph

*k*

_{max}=13. Therefore, we observe that the penalty selection under cross-validation with the adaptive Lasso is highly dependent on the number of hub nodes in the graph. We also have to mention that the adaptive Lasso method does not take any prior information about the graph topology and applies the uniform penalty on all edges in the graph, which is also a major drawback of the method when applied to graphs which contain more hub nodes. This observation was also reported earlier in the other studies [34–36].

## 7 Effect of correlation strength on the performance of methods

In this section, we are going to discuss the role of correlation strength on the performance of methods. It has been shown that a magnitude of correlations should be bounded from below in order for the method to give consistent predictions [13]. It is known that if data variability is less, then large sample size is required to increase an estimation accuracy. If the sample size is limited, which is often the case in biomedical applications, then it is possible to increase the prediction accuracy by increasing the variability in the data so that correlation information between variables is high. In this section, we examine how prediction accuracy of methods is affected with changes in data variability. For this purpose, we generate several datasets from the correlation matrices with different correlation magnitudes and then perform the graph reconstruction with four methods on these datasets. To generate datasets with a different degree of correlation, we use the method introduced in [32].

*p*×

*p*adjacency matrix which consists of binary values and represents a certain graph. To induce different correlation strengths in the data, we first multiply A with some scalar

*w*>0 and convert the resulting matrix into the positive definite matrix

*γ*=| min(

*λ*

_{ i })|+

*ε*,

*i*=1,…,

*p*and

*ε*>0. Here

*λ*

_{ i }are the eigenvalues of the matrix

*w*A. Then, we compute the correlation matrix by

where Λ is the matrix of diagonal elements of the covariance matrix \(\,\boldsymbol {\hat {\!A}}^{-1}\). As a measure of the correlation magnitude, we define \(\sigma =(\sqrt {\smash [b]{\text {var}(C_{ij}))}}, \ i, j = 1,\ldots,p\). Here, the different values of *w* allow to generate the correlation matrices with different magnitudes. The correlation matrix is then used to generate datasets using the procedure described in Fig. 4.

*σ*≈0.15, colored in blue), the performance of methods is relatively poor. In this regime, all methods predict about 1/4 of correct edges. Increasing the magnitude of correlation positively affects the performance of all methods (II, III, and IV). For instance, at

*σ*≈0.19, the sensitivity of the thresholded sample covariance matrix predictions increases from 0.23 to 0.67. In this regime, the sensitivity of the covariance Lasso increases from 0.24 to 0.72 (12 to 30 edges), while the sensitivity for the nodewise regression Lasso and the graphical Lasso increases from 0.24 to 0.7 (from 13 to 35 edges). The accuracy of covariance Lasso predictions does not change so much from II to IV, indicating a saturation effect of the method. The saturation effect is also observed for the thresholded sample covariance matrix from (III) to (IV). In contrast, the sensitivity of the nodewise regression Lasso and the graphical Lasso predictions increases with the increasing correlation strength. In the regime (III), the sensitivity of the nodewise regression Lasso is about 0.83, whereas at (IV), it is almost 0.93. The sensitivity of the graphical Lasso increases from 0.75 (III) to 0.82 (IV).

Sensitivity of predictions computed by four methods calculated as the average ratio of correctly predicted to total predicted edges

Correlation strength |
| 0.19 (II) | 0.22 (III) | 0.36 (IV) |
---|---|---|---|---|

Thresholded sample covariance | 0.23 | 0.67 | 0.73 | 0.73 |

Covariance Lasso | 0.24 | 0.72 | 0.8 | 0.77 |

Nodewise regression Lasso | 0.24 | 0.7 | 0.83 | 0.93 |

Graphical Lasso | 0.25 | 0.7 | 0.75 | 0.82 |

## 8 Conclusions

High-dimensional graph reconstruction methods have attracted much scientific interest over the last years and continue to be investigated further. In this work, we analyze the relation between concentration and covariance graphs and further conduct the detailed comparison between various graph reconstruction methods designed to infer concentration as well as covariance graphs. Our analytical study shows that it is possible to establish a link between these two graphs using Neumann series. In particular, we show the entry-wise relation between the entries of the covariance matrix and the transitive closure matrix associated to the concentration graph. We analytically demonstrate this relation for a star graph. Moreover, we analytically demonstrate a graph property that the covariance graph associated to the correlation matrix can be shown as the minimum transitive closure of the concentration graph. We also show a small scale demonstration for a three-node graph. Eventually, this property can be exploited to infer edge weights of the covariance graph directly from edge weights of the concentration graph. Currently, it has been shown for a star graph, but can be extended to other graph types too.

Furthermore, we performed the analytical and numerical studies on recently published network deconvolution and network silencing methods [10, 11]. In particular, we derived the analytical solution to the network deconvolution problem by exploiting facts from Kac-Murdock-Szëgo matrix. We also give more insights about the role of the scaling parameter which has been studied only numerically in the original study. Moreover, we conducted a detailed comparison of the methods designed to reconstruct covariance and concentration graphs on different graph topologies. In order to resemble the high-throughput experiments, we designed our simulation experiments with more variables than samples (*p*>*n*). We showed that the nodewise regression Lasso allows to select a consistent penalization which controls the number of false positives compared to the thresholded sample covariance, the covariance Lasso methods, and the graphical Lasso. The adaptive version of nodewise regression Lasso also allows to control the rate of false positives better than correlation-based methods when the penalty parameter is chosen via cross-validation.

## Declarations

### Acknowledgements

We would like to thank Sara Al-Sayed for useful comments and discussions. This work has been supported by the e:Bio project HostPathX funded by Federal Ministry of Education and Research (BMBF). HK also acknowledges support from the LOEWE research priority program CompuGene and from the H2020 European project PrECISE.

### Authors’ contributions

NS and HK conceived and designed the experiments. NS performed the experiments. NS and HK wrote the paper. Both authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- D Marbach, JC Costello, R Küffner, NM Vega, R Prill, et al, Wisdom of crowds for robust gene network inference. Nat. Methods.
**9**(8), 796–804 (2012).View ArticleGoogle Scholar - SM Hill, LM Heiser, T Cokelaer, M Unger, NK Nesser, et al, Inferring causal molecular networks: empirical assessment through a community-based effort. Nat. Methods.
**13**(4), 310–318 (2016).View ArticleGoogle Scholar - W-P Lee, W-S Tzou, Computational methods for discovering gene networks from expression data. Brief. Bioinformatics.
**10**(4), 408–423 (2009).Google Scholar - F Markowetz, R Spang, Inferring cellular networks—a review. BMC Bioinformatics.
**8**(6), 1–17 (2007).Google Scholar - P Bühlmann, S van de Geer,
*Statistics for high-dimensional data: methods, theory and applications*, 1st edn. (Springer, Heidelberg, 2011).View ArticleMATHGoogle Scholar - P Langfelder, S Horvath, WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics.
**9**(1), 559 (2008).View ArticleGoogle Scholar - J Dong, S Horvath, Understanding network concepts in modules. BMC Syst. Biol.
**1**(1), 1–20 (2007).View ArticleGoogle Scholar - S Horvath, J Dong, Geometric interpretation of gene coexpression network analysis. PLoS Comput. Biol.
**4**(8), 1000117 (2008).MathSciNetView ArticleGoogle Scholar - J Bien, RJ Tibshirani, Sparse estimation of a covariance matrix. Biometrika.
**98**(4), 807–820 (2011).MathSciNetView ArticleMATHGoogle Scholar - S Feizi, D Marbach, M Médard, M Kellis, Network deconvolution as a general method to distinguish direct dependencies in networks. Nat. Biotechnol.
**31**(8), 726–733 (2013).View ArticleGoogle Scholar - B Barzel, A-L Barabási, Network link prediction by global silencing of indirect correlations. Nat Biotechnol.
**31**(8), 720–5 (2013).View ArticleGoogle Scholar - R Mazumder, T Hastie, Exact covariance thresholding into connected components for large-scale graphical lasso. J. Mach. Learn. Res.
**13**(1), 781–794 (2012).MathSciNetMATHGoogle Scholar - N Meinshausen, P Bühlmann, High-dimensional graphs and variable selection with the Lasso. Ann. Statist.
**34**(3), 1436–1462 (2006).MathSciNetView ArticleMATHGoogle Scholar - J Friedman, T Hastie, R Tibshirani, Sparse inverse covariance estimation with the graphical lasso. Biostatistics.
**9**(3), 432–441 (2008).View ArticleMATHGoogle Scholar - T Hastie, R Tibshirani, J Friedman,
*The elements of statistical learning. Springer Series in Statistics*(Springer, New York, 2001).View ArticleMATHGoogle Scholar - AJ Butte, P Tamayo, D Slonim, TR Golub, IS Kohane, Discovering functional relationships between rna expression and chemotherapeutic susceptibility using relevance networks. Proc. Nat. Acad. Sci.
**97**(22), 12182–12186 (2000).View ArticleGoogle Scholar - SL Lauritzen,
*Graphical models*(Oxford University Press, Oxford, 1996).MATHGoogle Scholar - TH Cormen, CE Leiserson, RL Rivest, C Stein,
*Introduction to algorithms, third edition*, 3rd edn. (The MIT Press, Cambridge, 2009).MATHGoogle Scholar - PJ Bickel, E Levina, Covariance regularization by thresholding. Ann. Statist.
**36**(6), 2577–2604 (2008).MathSciNetView ArticleMATHGoogle Scholar - U Grenander, G Szeg ·o,
*Toeplitz forms and their applications*(Chelsea Pub. Co., New York, 1984). Spine title: Toeplitz forms.Google Scholar - M Dow, Explicit inverses of toeplitz and associated matrices. ANZIAM J.
**44**(E), 185–215 (2003).MATHGoogle Scholar - H Zou, The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc.
**101**(476), 1418–1429 (2006).MathSciNetView ArticleMATHGoogle Scholar - N El Karoui, Operator norm consistent estimation of large-dimensional sparse covariance matrices. Ann. Statist.
**36**(6), 2717–2756 (2008).MathSciNetView ArticleMATHGoogle Scholar - PJ Bickel, E Levina, Regularized estimation of large covariance matrices. Ann. Statist.
**36**(1), 199–227 (2008).MathSciNetView ArticleMATHGoogle Scholar - B Zhang, S Horvath, A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet Mol. Biol.
**4**(1), 1128 (2005).MathSciNetMATHGoogle Scholar - A-L Barabási, R Albert, Emergence of scaling in random networks. Science.
**286**(5439), 509–512 (1999).MathSciNetView ArticleMATHGoogle Scholar - A-L Barabási, ZN Oltvai, Network biology: understanding the cell’s functional organization. Nat. Rev. Genet.
**5**(2), 101–113 (2004).View ArticleGoogle Scholar - DR Hunter, R Li, Variable selection using MM algorithms. Ann. Statist.
**33**(4), 1617–1642 (2005).MathSciNetView ArticleMATHGoogle Scholar - K Lange,
*Optimization. Springer Texts in Statistics*(Springer, Heidelberg, 2004).Google Scholar - R Tibshirani, Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Series B.
**58:**, 267–288 (1994).MathSciNetMATHGoogle Scholar - O Banerjee, L El Ghaoui, A d’Aspremont, Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res.
**9:**, 485–516 (2008).MathSciNetMATHGoogle Scholar - T Zhao, H Liu, K Roeder, J Lafferty, L Wasserman, The huge package for high-dimensional undirected graph estimation in R. J. Mach. Learn. Res.
**13**(1), 1059–1062 (2012).MathSciNetMATHGoogle Scholar - J Peng, P Wang, N Zhou, J Zhu, Partial correlation estimation by joint sparse regression models. J. Am. Stat. Assoc.
**104**(486), 735–746 (2009).MathSciNetView ArticleMATHGoogle Scholar - KM Tan, P London, K Mohan, S-I Lee, M Fazel, D Witten, Learning graphical models with hubs. J. Mach. Learn. Res.
**15**(1), 3297–3331 (2014).MathSciNetMATHGoogle Scholar - J Peng, P Wang, N Zhou, J Zhu, Partial correlation estimation by joint sparse regression models. J. Am. Stat. Assoc.
**104**(486), 735–746 (2009).MathSciNetView ArticleMATHGoogle Scholar - Q Liu, AT Ihler, in
*AISTATS. JMLR Proceedings*, 15, ed. by G. J Gordon, D. B Dunson, and M Dudík. Learning scale free networks by reweighted l1 regularization (JMLR.orgFt. Lauderdale, 2011), pp. 40–48.Google Scholar