Hierarchical Dirichlet process model for gene expression clustering

Wang, Liming; Wang, Xiaodong

doi:10.1186/1687-4153-2013-5

Research
Open access
Published: 12 April 2013

Hierarchical Dirichlet process model for gene expression clustering

Liming Wang¹ &
Xiaodong Wang²

EURASIP Journal on Bioinformatics and Systems Biology volume 2013, Article number: 5 (2013) Cite this article

5632 Accesses
9 Citations
1 Altmetric
Metrics details

Abstract

Clustering is an important data processing tool for interpreting microarray data and genomic network inference. In this article, we propose a clustering algorithm based on the hierarchical Dirichlet processes (HDP). The HDP clustering introduces a hierarchical structure in the statistical model which captures the hierarchical features prevalent in biological data such as the gene express data. We develop a Gibbs sampling algorithm based on the Chinese restaurant metaphor for the HDP clustering. We apply the proposed HDP algorithm to both regulatory network segmentation and gene expression clustering. The HDP algorithm is shown to outperform several popular clustering algorithms by revealing the underlying hierarchical structure of the data. For the yeast cell cycle data, we compare the HDP result to the standard result and show that the HDP algorithm provides more information and reduces the unnecessary clustering fragments.

1 Introduction

The microarray technology has enabled the possibility to monitor the expression levels of thousands of genes in parallel under various conditions [1]. Due to the high-volume nature of the microarray data, one often needs certain algorithms to investigate the gene functions, regulation relations, etc. Clustering is considered to be an important tool for analyzing the biological data [2–4]. The aim of clustering is to group the data into disjoint subsets, where in each subset the data show certain similarities to each other. In particular, for microarray data, genes in each clustered group exhibit correlated expression patterns under various experiments.

Several clustering methods have been proposed, most of which are distance-based algorithms. That is, a distance is first defined for clustering purpose and then the clusters are formed based on the distances of the data. Typical algorithms in this category include the K-means algorithm [5] and the self-organizing map (SOM) algorithm [6]. These algorithms are based on simple rules, and they often suffer from robustness issue, i.e., they are sensitive to noise which is extensive in biological data [7]. For example, the SOM algorithm requires user to provide number of clusters in advance. Hence, incorrect estimation of the parameter may provide wrong result.

Another important category of clustering methods is the model-based algorithms. These algorithms employ a statistical approach to model the structure of clusters. Specifically, data are assumed to be generated by some mixture distribution. Each component of the mixture corresponds to a cluster. Usually, the parameters of the mixture distribution are estimated by the EM algorithm [8]. The finite-mixture model [9–11] assumes that the number of mixture components is finite and the number can be estimated using the Bayesian information criterion [12] or the Akaike information criterion [13]. However, since the estimation of the number of clusters and the estimation of the mixture parameters are performed separately, the finite-mixture model may be sensitive to the different choices of the number of clusters [14].

The infinite-mixture model has been proposed to cope with the above sensitivity problem of the finite-mixture model. This model does not assume a specific number of components and is primarily based on the Dirichlet processes [15, 16]. The clustering process can equivalently be viewed as a Chinese restaurant process [17], where the data are considered as customers entering a restaurant. Each component corresponds to a table with infinite capacity. A new customer joins a table according to the current assignment of seats.

Hierarchical clustering (HC) is yet another more advanced approach especially for biological data [18], which groups together the data with similar features based on the underlying hierarchical structure. The biological data often exhibit hierarchical structure, e.g., one cluster may highly be overlapped or could be embedded into another cluster [19]. If such hierarchical structure is ignored, the clustering result may contain many fragmental clusters which could have been combined together. Hence, for biological data, such HC has its advantages to many traditional clustering algorithms. The performances of such HC algorithms depend highly on the quality of the data and the specific agglomerative or divisive ways the algorithms use for combining clusters.

Traditional clustering algorithms for microarray data usually assign each gene with a feature vector formed by the expressions in different experiments. The clustering is carried out for these vectors. It is well known that many genes share different levels of functionalities [20]. The resemblances of different genes are commonly represented at different levels of perspectives, e.g., at the cluster level instead of individual gene level. In other words, The relationships among different genes may vary during different experiments. In Figure 1, we illustrate the gene hierarchical structures for microarray data. Genes group A and B may show close relationship to genes group C in some experiments. While the genes group D shows correlations to groups A, B, and C in other experiments. The group D obviously has a hierarchical relationships to other gene groups. In this case, we desire to have a HC algorithm recognizing the gene resemblances not at the single gene level but at the higher cluster level, to avoid unnecessary fragmental clusters that impede the proper interpretation of the biological information. Such a HC algorithm may also provide new information by taking the hierarchical similarities into account.

In this article, we propose a model-based clustering algorithm for gene expression data based on the hierarchical Dirichlet process (HDP) [21]. The HDP model incorporates the merits of both the infinite-mixture model and the HC. The hierarchical structure is introduced to allow sharing data among related clusters. On the other hand, the model uses the Dirichlet processes as the non-parametric Bayesian prior, which do not assume a fixed number of clusters a priori.

The remainder of the article is organized as follows. In Section 2, we introduce some necessary mathematical background and formulate the HC problem as a statistical inference problem. In Section 3, we derive a Gibbs sampler-based inference algorithm based on the Chinese restaurant metaphor of the HDP model. In Section 4, we provide experimental results of the proposed HDP algorithm for two applications, regulatory network segmentation and gene expression clustering. Finally, Section 5 concludes the article.

2 System model and problem formulation

As in any model-based clustering method, it is assumed that the gene expression data are random samples from some underlying distributions. All data in one cluster are generated by the same distribution. For most existing clustering algorithms, each gene is associated with a vector containing the expressions in all experiments. The clustering of the genes is based on their vectors. However, such approach ignores the fact that genes may show different functionalities under various experiment conditions, i.e., different clusters may be formed under different experiments. In order to cope with this phenomenon, we treat each expression separately. More specifically, we allow different expressions of the same individual gene to be generated by different statistical models.

Suppose that for the mircoarray data, there are N genes in total. For each gene, we conduct M experiments. Let g_{j
i} denote the expression of the i th gene in the j th experiment, 1≤i≤N, and 1≤j≤M. For each g_{j
i}, we associate a latent membership variable z_{j
i}, which indicates the cluster membership of g_{j
i}. That is, if genes i and i^′ are in the same cluster under the conditions of experiments j and j^′, we have $z_{ji} = z_{j^{'} i^{'}}$ . Note that z_{j
i} is supported on a countable set such as $N$ or $Z$ . For each g_{j
i}, we associate a coefficient $θ_{z_{ji}}$ , whose index is determined by its membership variable z_{j
i}. In order to have a Bayesian approach, we also assume that each coefficient θ_k is drawn independently from a prior distribution G₀

\begin{array}{l} θ_{k} \sim G_{0}, \end{array}

(1)

where k is determined by z_{j
i}.

The membership variable z={z_{j
i}}_j,i has a discrete joint distribution

\begin{array}{l} z \sim Π. \end{array}

(2)

Note that in this article, the bold-face letter always refers to a set formed by the elements with specified indices.

We assume that each g_{j
i} is drawn independently from a distribution $F (θ_{z_{ji}})$

\begin{array}{l} g_{ji} \sim F (θ_{z_{ji}}), \end{array}

(3)

where $θ_{z_{ji}}$ is a coefficient associated with g_{j
i} and F is a distribution family such as the Gaussian distribution family. In summary, we have the following model for the expression data

\begin{array}{c} θ_{k} & \sim G_{0} \\ z & \sim Π \\ g_{ji} | z_{ji}, θ_{k} & \sim F (θ_{z_{ji}}) . \end{array}

(4)

The above model is a relatively general one which can induce many previous models. For example, in all Bayesian approaches, all variables are assigned with proper priors. It is very popular to use the mixture model as the prior, which models the data generated by a mixture of distributions, e.g., a linear combination of a family of distributions such as Gaussian distributions. Each cluster is generated by one component in the mixture distribution given the membership variable [14]. The above approach corresponds to our model if we assume that Π is finitely supported and F is Gaussian.

The aim for clustering is to determine the posterior probability of the latent membership variables given the observed gene expressions

\begin{array}{l} P (z | g), \end{array}

(5)

where g={g_{j
i}}_j,i.

As a clustering algorithm, the final result is given in the forms of clusters. Each gene has to be assigned to one and only one cluster. Once we have the inference result in (5), we can apply the maximum a posterior criterion to obtain an estimate of membership variable ${\hat{z}}_{\cdot i}$ for the i th gene as

\begin{array}{l} {\hat{z}}_{\cdot i} = {arg}_{a} max \sum_{j} P (z_{ji} = a | g) . \end{array}

(6)

We note that in case one is interested in finding other related clusters for one gene, we can simply use the inferred distribution to membership variable to obtain this information.

2.1 Dirichlet processes and infinite mixture model

Instead of assuming a fixed number of clusters a priori, one can assume infinite number of clusters to avoid the estimation accuracy problem on the number of clusters as we mentioned earlier. Correspondingly in (4), the prior Π is an infinite discrete distribution. Again as in the Bayesian fashion, we will introduce priors for all parameters. The Dirichlet process is one such prior. It can be viewed as a random measure [15], i.e., the domain of this process (viewed as a measure) is a collection of probability measures. In this section, we will give a brief introduction to the Dirichlet process which serves as the vital prior part in our HDP model.

Recall that the Dirichlet distribution $D (u_{1}, \dots, u_{K})$ of order K on a (K−1)-simplex in $R^{K - 1}$ with parameter u₁,…,u_K is given by the following probability density function

D (x_{1}, \dots, x_{K - 1}; u_{1}, \dots, u_{K}) = \frac{Γ (\sum_{i = 1}^{K} u_{i})}{\prod_{i = 1}^{K} Γ (u_{i})} \prod_{i = 1}^{K} {x_{i}}^{u_{i} - 1}

(7)

where $\sum_{i = 1}^{K} x_{i} = 1, u_{i} > 0, i = 1, \dots, K,$ and Γ(·) is the Gamma function. Since every point in the domain is a discrete probability measure, the Dirichlet distribution is a random measure in the finite discrete probability space.

The Dirichlet processes are the generalization of the Dirichlet distribution into the continuous space. There are various constructive or non-constructive definitions of Dirichlet processes. For simplicity, we use the following non-constructive definition.

Let (X,σ,μ₀) be a probability space. A Dirichlet process D(α₀,μ₀) with parameter α₀>0 is defined as a random measure: for any non-trivial finite partition (χ₁,…,χ_r) of X with χ_i∈σ, we have the random variable

(G (χ_{1}), \dots, G (χ_{r})) \sim D (α_{0} μ_{0} (χ_{1}), \dots, α_{0} μ_{0} (χ_{r})),

(8)

where $G$ is drawn from D(α₀,μ₀).

The Dirichlet processes can be characterized in various ways [15] such as the stick-breaking construction [22] and the Chinese restaurant process [23]. The Chinese restaurant process serves as a visualized characterization of the Dirichlet process.

Let x₁,x₂,… be a sequence of random variables drawn from the Dirichlet process D(α₀,μ₀). Although we do not have the explicit formula for D, we would like to know the conditional probability of x_i given x₁,…,x_i−1. In the Chinese restaurant model, the data can be viewed as customers sequentially entering a restaurant with infinite number of tables. Each table corresponds to a cluster with unlimited capacity. Each customer x_i entering the restaurant will join in the table already taken with equal probability. In addition, the new customer may sit in a new table with probability proportional to α₀. Tables that have already been occupied by customers tend to gain more and more customers.

One remarkable property of the Dirichlet process is that although it is generated by a continuous process, it is discrete (countably many) almost surely [15]. In other words, almost every sample distribution drawn from the Dirichlet process is a discrete distribution. As a consequence, the Dirichlet process is suitable to serve as a non-parametric prior of the infinite mixture model.

The Dirichlet mixture model uses the Dirichlet process as a prior. The model in (4) can then be represented as follows:

\begin{array}{c} g_{ji} | z_{ji}, θ_{k} \sim F (θ_{z_{ji}}); \end{array}

(9)

θ_k is generated by the measure μ₀

\begin{array}{c} θ_{k} \sim μ_{0}; \end{array}

(10)

{z_{j
i}} is generated by a Dirichlet process D(α₀,μ₀)

\begin{array}{c} {z_{ji}} \sim D (α_{0}, μ_{0}) . \end{array}

(11)

Recall that D(α₀,μ₀) is discrete almost everywhere, which corresponds to the indices of the clusters.

2.2 HDP model

Biological data such as the expression data often exhibit hierarchical structures. For example, although clusters can be formed based on similarities, some clusters may still share certain similarities among themselves at different levels of perspectives. Within one cluster, the genes may share similar features. But on the level of clusters, one cluster may share some similar feature with some other clusters. Many traditional clustering algorithms typically fail to recognize such hierarchical information and are not able to group these similar clusters into a new cluster, producing many fragments in the final clustering result. As a consequence, it is difficult to interpret the functionalities and meanings of these fragments. Therefore, it is desirable to have an algorithm that is able to cluster among clusters. In other words, the algorithm should be able to cluster based on multiple features at different levels. In order to capture the hierarchical structure feature of the gene expressions, we now introduce the hierarchical model to allow clustering at different levels. The clustering algorithm based on the hierarchical model not only reduces the number of cluster fragments, but also may reveal more details about the unknown functionalities of certain genes as the clusters sharing multiple features.

Recall that in the statistical model (11), the clustering effect is induced by the Dirichlet process D(α₀,μ₀). If we need to take into account different level of clusters, it is natural to introduce a prior with clustering effect to the base measure μ₀. Again in this case, the Dirichlet process can serve as such prior. The intuition is that given the base measure, the clustering effect is represented through a Dirichlet process on the single gene level. By the Dirichlet process assumption on the base measure, the base measure also exhibits the clustering effect, which leads to clustering at cluster level. We simply set the prior to the base measure μ₀ as

\begin{array}{c} μ_{0} \sim D_{1} (α_{1}, μ_{1}), \end{array}

(12)

where D₁(α₁,μ₁) is another Dirichlet process. In this article, we use the same letter for the measure, the distribution it induces, and the corresponding density function as long as it is clear from the context. Moreover, we could extend the hierarchies to as many levels as we wish at the expense of complexity of the inference algorithm. The desired number of hierarchies can be determined by the prior biological knowledge. In this article, we focus on a two-level hierarchy.

As a remark, we would like to point out the connection and difference on the “hierarchy” in the proposed HDP method and traditional HC [4]. Both the HDP and HC algorithms can provide HC results. The hierarchy in the HDP method is manifested by the Chinese restaurant process which will be introduced later, where the data sit in the same table can be viewed as the first level and all tables sharing the same dish can be viewed as the second level. While the hierarchy in the HC is obtained by merging existing clusters based on their distances. However, its specific merging strategy is heuristic and is irreversible for those merged clusters. Hierarchy formed in this fashion often may not reflect the true structure in the data since various hierarchical structures can be formed by choosing different distance metrics. However, the HDP algorithm captures the hierarchical structure at the model level. The merging is carried out automatically during the inference. Therefore, it naturally takes the hierarchy into consideration.

In summary, we have the following HDP model for the data:

\begin{array}{rcl} μ_{0} & \sim & D_{1} (α_{1}, μ_{1}) \\ {z_{ji}} | μ_{o}, α_{0} & \sim & D (α_{0}, μ_{0}) \\ α_{0}, α_{1} & \sim & Γ (a, b) \\ θ_{k} & \sim & μ_{1} \\ g_{ji} | z_{ji}, θ_{k} & \sim & F (θ_{z_{ji}}), \end{array}

(13)

where a and b are some fixed constants. We assume that F and μ₁ are conjugate priors. In this article, F is assumed to be the Gaussian distribution and μ₁ is the inverse Gamma distribution.

3 Inference algorithm

It is intractable to get the closed-form solution to the inference problem (5). In this section, we develop a Gibbs sampling algorithm for estimating the posterior distribution in (5). At each iteration l, we draw a sample $z_{ji}^{(l)}$ sequentially from the distribution:

P (z_{ji}^{(l)} | z_{11}^{(l)}, z_{12}^{(l)}, \dots, z_{j (i - 1)}^{(l)}, z_{j (i + 1)}^{(l - 1)}, \dots, z_{MN}^{(l - 1)}, g) .

(14)

Under regularity conditions, the distribution of ${z_{ji}^{(l)}}_{j, i}$ will converge to the true posterior distribution in (5) [24]. The proposed Gibbs sampling algorithm is similar to the HDP inference algorithm proposed in [21], since both the Gibbs algorithms use the Chinese restaurant metaphor which we will elaborate later. However, because of the differences in modeling, we still need to provide details for the inference algorithm based on our model.

3.1 Chinese restaurant metaphor

The Chinese restaurant model [23] is a visualized characterization for interpreting the Dirichlet process. Because there is no explicit formula to describe the Dirichlet process, we will employ the Chinese restaurant model for HDP inference instead of directly computing the posterior distribution in (5). We refer to [23, 25] for the proof and other details of the equivalence between the Chinese restaurant metaphor and the Dirichlet processes.

In the Chinese restaurant metaphor for the HDP model (13), we view {z_{j
i}} as customers entering a restaurant sequentially. The restaurant has infinite number of rows and columns of tables which are labeled by t_{j
i}. Each z_{j
i} will associate to one and only one table in the j th row. We use ϕ(z_{j
i}) to denote the column index of the table in the j th row taken by z_{j
i}, i.e., z_{j
i} will sit at table $t_{jϕ (z_{ji})}$ . If it is clear from the context, we will use ϕ_{j
i} in short for ϕ(z_{j
i}). The index of the random variable θ_k in (13) is characterized by a menu containing various dishes. Each table picks one and only one dish from the menus {m_k}_k=1,2,…, which are drawn independently from the base measure μ₁. g_{j
i} is drawn independently according to the dish it chooses through the distribution F(·) as in (13). We denote λ(t_{j
i}) as the index of the dish taken by table t_{j
i}, i.e., table t_{j
i} chooses dish $m_{λ (t_{ji})}$ . As before, we may write λ_{j
i} in short of λ(t_{j
i}). In summary, customer z_{j
i} will sit at table $t_{j ϕ_{ji}}$ and enjoy dish $m_{λ_{j ϕ_{ji}}}$ . The HDP is reflected in this metaphor such that the customers choose the tables as well as the dishes in a Dirichlet process fashion. The customers sitting at the same table are classified into one cluster. Moreover, the customers sitting at different tables but ordering the same dish will also be clustered into the same group. Hence, the clustering effect is performed at the cluster level, i.e., we allow “clustering among clusters”. In Figure 2, we show an illustration of the Chinese restaurant metaphor. The different patterns of shades represent different clusters. We also introduce two useful counter variables: c_{j
i} denotes the number of customers sitting at table t_{j
i}; d_{j
k} counts the number of tables in row j serving dish m_k.

Using the Chinese restaurant metaphor, instead of inferring z_{j
i}, we can directly infer ϕ_{j
i} and λ_{j
i}. The membership variable z_{j
i} is completely determined by $λ (t_{jϕ (z_{ji})})$ . That is, $z_{ji} = z_{j^{'} i^{'}}$ if and only if $λ (t_{jϕ (z_{ji})}) = λ (t_{jϕ (z_{j^{'} i^{'}})})$ . As we pointed out before, the specific values of the membership variable z_{j
i} are not relevant to the clustering as long as z_{j
i} is supported on a countable set. Hence, we could simply let

\begin{array}{c} z_{ji} = λ (t_{jϕ (z_{ji})}) . \end{array}

(15)

According to [25], we have the following conditional probabilities for the HDP model

\begin{array}{l} ϕ_{ji} | ϕ_{j 1}, \dots, ϕ_{ji - 1}, α_{0}, μ_{0} \sim & \sum_{m = 1}^{\sum_{k} d_{jk}} \frac{c_{jm}}{i - 1 + α_{0}} δ_{t_{j ϕ_{ji}}} \\ + \frac{α_{0}}{i - 1 + α_{0}} μ_{0}, \end{array}

(16)

where $\sum_{k} d_{jk}$ calculates the number of tables taken in the r th row and δ_(·) is the Kronecker delta function. The interpretation of (16) is that customer z_{j
i} chooses a table already taken with equal probability. In addition, z_{j
i} may choose a new table with probability proportional to α₀.

By the hierarchical assumption, the distribution of the dish chosen at an occupied table is another Dirichlet process. We have the following conditional distribution of the dishes

\begin{array}{l} λ_{j ϕ_{ji}} | λ_{1 ϕ_{11}}, \dots, λ_{j ϕ_{j (i - 1)}}, α_{1}, μ_{1} \sim & \sum_{k = 1}^{K_{ji}} \frac{\sum_{j} d_{jk}}{\sum_{jk} d_{jk} + α_{1}} δ_{m_{k}} \\ + \frac{α_{1}}{\sum_{jk} d_{jk} + α_{1}} μ_{1}, \end{array}

(17)

where $\sum_{j} d_{jk}$ counts the number of tables serving dish m_k; $\sum_{jk} d_{jk}$ counts the number of tables serving dishes; K_{j
i} denotes the net number of dishes served till λ_{j
i}’s coming by counting only once each dish that has been served multiple times.

3.2 A Gibbs sampler for HDP inference

Instead of sampling the posterior probability in (5), we will sample ϕ={ϕ₁₁,ϕ₁₂,…} and λ={λ₁₁,λ₁₂,…} from the following posterior distribution

\begin{array}{l} P (ϕ, λ | g) . \end{array}

(18)

We can calculate the related conditional probabilities as follows.

If a is a value that has been taken before, the conditional probability of ϕ_{j
i}=a is given by

P (ϕ_{ji} = a | ϕ_{ji}^{c}, λ, θ, α_{1}, α_{0}, μ_{1}, g) \propto c_{ja} f_{λ_{ja}} (g_{ji} | g_{ji}^{c}),

(19)

where θ={θ_{j
i}}_j,i and λ={λ_{j
i}}_j,i. The superscript c denotes the complement of the variables in its category, i.e., $g_{ji}^{c} = {g_{j^{'} i^{'}}}_{(j^{'}, i^{'}) \neq (j, i)}$ and $ϕ_{ji}^{c} = {ϕ_{j^{'} i^{'}}}_{(j^{'}, i^{'}) \neq (j, i)}$ . $f_{λ_{ja}} (g_{ji} | g_{ji}^{c})$ denotes the conditional density of g_{j
i} given all other data generated according to menu $m_{λ_{ja}}$ , which can be calculated as

f_{λ_{ja}} (g_{ji} | g_{ji}^{c}) = \frac{\int \prod_{λ_{j^{'} ϕ_{j^{'} i^{'}}} = λ_{ja}} F (g_{j^{'} i^{'}} | θ) μ_{1} (θ) d휃}{\int \prod_{j^{'} i^{'} \neq ji, λ_{j^{'} ϕ_{j^{'} i^{'}}} = λ_{ja}} F (g_{j^{'} i^{'}} | θ) μ_{1} (θ) d휃} .

(20)

The numerator of (20) is the joint density of the data which are generated by the same dish. By the assumption that $g_{j^{'} i^{'}}$ are conditionally independent given the chosen dish, we have the conditional density of the data in the product form. The denominator is the joint density excluding the specific g_{j
i} term. The integrals in (20) can either be calculated using the numerical method or using the Monte Carlo integration. For example, in order to calculate the following integral $\int_{a}^{b} f (x) p (x) dx$ , where p(x) is a density function, we can draw samples x₁,x₂,…,x_n from p(x) and approximate the integral by $\int_{a}^{b} f (x) p (x) dx = E_{p (x)} [f (x)] \approx \frac{1}{n} \sum_{i = 1}^{n} f (x_{i})$ . To calculate (20), we view μ₁(·) as p(·) and $F (g_{j^{'} i^{'}} | \cdot)$ as f(·).

On the other hand, if a is a new value then we have

\begin{align} P (ϕ_{ji} = a | ϕ_{ji}^{c}, λ, θ, α_{1}, α_{0}, g) \propto α_{0} \\ [\sum_{k = 1}^{K_{ja}} \frac{\sum_{j} d_{jk}}{\sum_{jk} d_{jk} + α_{1}} f_{k} (g_{ji} | g_{ji}^{c}) \\ + \frac{α_{1}}{\sum_{jk} d_{jk} + α_{1}} \int F (g_{ji} | θ) μ_{1} (θ) dθ] . \end{align}

(21)

We also have the following conditional probabilities for λ_{j
i}. If a is used before, we have

P (λ_{j ϕ_{ji}} = a | ϕ, λ_{j ϕ_{ji}}^{c}, θ, α_{1}, α_{0}, g) \propto (\sum_{j} d_{ja}) f_{a} (g_{ji} | g_{ji}^{c});

(22)

otherwise we have

P (λ_{j ϕ_{ji}} = a | ϕ, λ_{j ϕ_{ji}}^{c}, θ, α_{1}, α_{0}, g) \propto α_{1} \int F (g_{ji} | θ) μ_{1} (θ) dθ.

(23)

The derivations of (19), (21), (22), and (23) are given in Appendix.

Before we present the Gibbs sampling algorithm, we recall the Metropolis–Hastings (M–H) algorithm [26] for drawing samples from a target distribution whose density function f(x) is only known up to a scaling factor, i.e., f(x)∝p(x). To draw samples from f(x), we make use of some fixed conditional distribution q(x₂|x₁) that satisfies q(x₂|x₁)=q(x₁|x₂), ∀x₁,x₂. The M–H algorithm proceeds as follows.

Start with an arbitrary value x₀ with p(x₀)>0.
For l=1,2,…

Given the previous sample x_l−1, draw a candidate sample x^⋆ from q(x^⋆|x_l−1).

Calculate $β = \frac{p (x^{⋆})}{p (x_{l - 1})}$ . If β≥1 then accept the candidate and let x_l=x^⋆. Otherwise accept it with probability β, or reject it and accept the previous sample with probability 1−β.

After a “burn-in” period, say l₀, the samples ${x_{l}}_{l > l_{0}}$ follow the distribution f(x).

We now summarize the Gibbs sampling algorithm for the HDP inference as follows.

Initialization: randomly assign the indices $ϕ^{(0)} = \{ϕ_{11}^{(0)}, ϕ_{12}^{(0)}, \dots\}$ and $λ^{(0)} = \{λ_{11}^{(0)}, λ_{12}^{(0)}, \dots\}$ . Note that once we have all the indices, the counters {c_{j
i}} and {d_{j
k}} are also determined.
For l=1,2,…,l₀+L,

Draw samples of $\{ϕ_{ji}^{(l)}\}$ from their posteriors

P (ϕ_{ji}^{(l)} = a | ϕ_{ji}^{(l - 1) c}, λ^{(l - 1)}, α_{1}^{(l - 1)}, α_{0}^{(l - 1)}, g)

(24)

given by (19) and (21) using the M–H algorithm. We view the probability in (24) as the target density and choose q(·|·) to be a distribution supported on $N$ . For example, we can use $q (i | j) = \frac{j}{{(j + 1)}^{i}}$ , $i, j \in N$ .

Draw samples of $\{λ_{j ϕ_{ji}^{(l)}}^{(l)}\}$ from their posteriors

P (λ_{j ϕ_{ji}^{(l)}}^{(l)} = a | ϕ^{(l)}, λ_{j ϕ_{ji}^{(l)}}^{(l - 1) c}, α_{1}^{(l - 1)}, α_{0}^{(l - 1)}, g)

(25)

given by (22) and (23) using M–H algorithm. We view the probability in (25) as the target density and use q(·|·) as specified in the previous step.

Since P(α₀|ϕ,λ,α₁,g)=P(α₀) and P(α₁|ϕ,λ,α₀,g)=P(α₁), simply draw samples of $α_{0}^{(l)}$ and $α_{1}^{(l)}$ from their prior Gamma distributions.

Using the samples after the “burn-in” period ${\{ϕ^{(l)}, λ^{(l)}\}}_{l = l_{0} + 1}^{l_{0} + L}$ to calculate $\hat{P} (ϕ, λ | g)$ , which is given by
$\hat{P} (ϕ_{ji} = a, λ_{j ϕ_{ji}} = b) = \frac{\sum_{l = l_{0} + 1}^{l_{0} + L} 1 \{ϕ_{ji}^{(l)} = a, λ_{j ϕ_{ji}^{(l)}}^{(l)} = b\}}{L},$
(26)
where 1(·) is the indicator function. Determine the membership distribution P(z|g) from the inferred joint distribution $\hat{P} (ϕ, λ | g)$ by $P (z_{ji} = a | g) = \sum_{b} \hat{P} (λ_{jb} = a | g, ϕ_{ji} = b) \hat{P} (ϕ_{ji} = b | g)$ .
Calculate the estimation of clustering index ${\hat{z}}_{\cdot i}$ for the i th gene by ${\hat{z}}_{\cdot i} = \underset{a}{arg} max \sum_{j} P (z_{ji} = a | g)$ .

3.3 A numerical example

In this section, we provide a simple numerical example to illustrate the proposed Gibbs sampler. Let us consider the case N=M=2, i.e., there are 2 genes and 2 experiments. Assume that the expressions are as g₁₁=0,g₁₂=1,g₂₁=−1, and g₂₂=2. We assume $μ_{1} (θ) \sim N (0, 1)$ and $F (g_{ji} | θ) \sim N (θ, 1)$ . For initialization, we set $ϕ_{11}^{(0)} = 1, ϕ_{12}^{(0)} = 2, ϕ_{21}^{(0)} = 3, ϕ_{22}^{(0)} = 4$ ; $λ_{1 ϕ_{11}^{(0)}}^{(0)} = 1, λ_{1 ϕ_{12}^{(0)}}^{(0)} = 1, λ_{2 ϕ_{21}^{(0)}}^{(0)} = 2, λ_{2 ϕ_{22}^{(0)}}^{(0)} = 2,$ and α 0(0)=α 1(0)=1.

We first show how to draw sample from $P (ϕ_{11}^{(1)} | ϕ_{11}^{(0) c},$ $λ^{(0)}, α_{1}^{(0)}, α_{0}^{(0)}, g)$ by the M–H algorithm. Given the initial value, assume that q(·|·) returns ϕ₁₁=3 as a candidate sample. By (19), we have $P (ϕ_{11}^{(1)} = 1 | ϕ_{11}^{(0) c}, λ^{(0)}, α_{1}^{(0)},$ $α_{0}^{(0)}, g) \propto c_{11} f_{λ_{11}} (g_{11} | g_{11}^{c})$ , where c₁₁=1 and λ₁₁=1. We also have

\begin{align} f_{1} (g_{11} | g_{11}^{c}) & = \frac{\int \prod_{λ_{j^{'} ϕ_{j^{'} i^{'}}} = 1} F (g_{j^{'} i^{'}} | θ) μ_{1} (θ) dθ}{\int \prod_{(j^{'}, i^{'}) \neq (1, 1), λ_{j^{'} ϕ_{j^{'} i^{'}}} = 1} F (g_{j^{'} i^{'}} | θ) μ_{1} (θ) dθ} \\ = \frac{\int F (g_{11} | θ) F (g_{12} | θ) μ_{1} (θ) dθ}{\int F (g_{12} | θ) μ_{1} (θ) dθ} \approx 0.22971 . \end{align}

(27)

Note that the above integral can be calculated either numerically or by using the Monte Carlo integration method.

By (21) and using the specific values of the variables, we obtain

\begin{align} P (ϕ_{11}^{(1)} = 3 | ϕ_{11}^{(0) c}, λ^{(0)}, α_{1}^{(0)}, α_{0}^{(0)}, g) \\ \propto α_{0} [\sum_{k = 1}^{K_{11}} \frac{\sum_{j} d_{jk}}{\sum_{jk} d_{jk} + α_{1}} f_{k} (g_{11} | g_{11}^{c}) \\ + \frac{α_{1}}{\sum_{jk} d_{jk} + α_{1}} \int F (g_{11} | θ) μ_{1} (θ) dθ] \end{align}

(28)

with K₁₁=1, $\sum_{j} d_{j 1} = 2$ , $\sum_{jk} d_{jk} = 4$ , α₀=α₁=1. Plugging in these values, we have

\begin{align} P (ϕ_{11}^{(1)} = 3 | ϕ_{11}^{(0) c}, λ^{(0)}, α_{1}^{(0)}, α_{0}^{(0)}, g) \\ \propto \frac{2}{5} f_{1} (g_{11} | g_{11}^{c}) + \frac{1}{5} \int F (g_{11} | θ) μ_{1} (θ) dθ \approx 0.1483 . \end{align}

(29)

Since $β = \frac{0.1483}{0.22971} \approx 0.6456 < 1$ , we should accept this candidate sample ϕ₁₁=3 with a probability of 0.6456. After the burn-in period, say the sample returned by the M–H algorithm is ϕ₁₁=4, then we update $ϕ_{11}^{(1)} = 4$ and move on to draw samples of the remaining variables ϕ₁₂, ϕ₂₁, and ϕ₂₂.

Assuming that we obtain samples of ϕ⁽¹⁾ as $ϕ_{11}^{(1)} = 4, ϕ_{12}^{(1)} = 1, ϕ_{21}^{(1)} = 1, ϕ_{22}^{(1)} = 2$ . We next draw the sample λ⁽¹⁾. Given the initial value $λ_{1 ϕ_{11}^{(1)}} = 1$ and q(·|·) returns $λ_{1 ϕ_{11}^{(1)}} = 3$ as a candidate sample. By (22), we obtain $P (λ_{1 ϕ_{11}^{(1)}}^{(1)} = 1 | ϕ^{(1)}, λ_{1 ϕ_{11}^{(1)}}^{(0) c}, α_{1}^{(0)}, α_{0}^{(0)}, g) \propto (\sum_{j} d_{j 1}) f_{1} (g_{11} | g_{11}^{c})$ . Furthermore, we have $\sum_{j} d_{j 1} = 2$ and $f_{1} (g_{11} | g_{11}^{c}) \approx 0.22971$ as calculated before.

By (23), we obtain $P (λ_{1 ϕ_{11}}^{(1)} = 3 | ϕ^{(1)}, λ_{1 ϕ_{11}}^{(0) c}, α_{1}^{(0)}, α_{0}^{(0)}, g) \propto α_{1} \int F (g_{11} | θ) μ_{1} (θ) dθ$ . Moreover, we have α₁=1 and $\int F (g_{11} | θ) μ_{1} (θ) dθ \approx 0.28208$ as calculated before. So we have $β = \frac{0.28208}{2 * 0.22971} \approx 0.614 < 1$ . After the burn-in period, assume that the M–H algorithm returns a sample $λ_{1 ϕ_{11}^{(1)}} = 2$ , then update $λ_{1 ϕ_{11}^{(1)}}^{(1)} = 2$ and move on to sample the remaining λ variables as well as α₀ and α₁.

After the burn-in period of the whole Gibbs sampler, we can calculate the posterior joint distribution P(ϕ,λ|g) from the samples and determine the clusters following the last two steps in the proposed Gibbs sampling algorithm.

4 Experimental results

The HDP clustering algorithm proposed in this article can be employed for gene expression analysis or as a segmentation algorithm for gene regulatory network inference. In this section, we first introduce two performance measures for clustering, the Rand Index (RI) [27] and the Silhouette Index (SI) [28]. We compare the HDP algorithm to the support vector machine (SVM) algorithm for network segmentation on synthetic data. We then conduct various experiments on both synthetic and real datasets including the AD400 datasets [29], the yeast galactose datasets [30], yeast sporulation datasets [31], human fibroblasts serum datasets [32], and yeast cell cycle data [33]. We compare the HDP algorithm to the Latent Dirichlet allocation (LDA), MCLUST, SVM, K-means, Bayesian Infinite Mixture Clustering (BIMC) the HC [4, 14, 34–37] based on the performance measures and the functional relationships.

4.1 Performance measures

In order to evaluate the clustering result, we utilize two measures: RI [27] and SI [28]. The first index is used when a ground truth is known in priori and the second index is to measure the performance without any knowledge of the ground truth.

The RI is a measure of agreement between two clustering results. It takes a value between 0 and 1. The higher is the score, the higher agreements it indicates.

Let A denote the datasets with a total number of n elements. Given two clustering results X={X₁,…,X_S} and Y={Y₁,…,Y_T} of A, i.e., $A = ⋃_{i = 1}^{S} X_{i} = ⋃_{j = 1}^{T} Y_{j}$ and $X_{i} ⋂ X_{j} = \emptyset$ , $Y_{i} ⋂ Y_{j} = \emptyset$ for i≠j. For any pair of elements (a,b) in A, we say they are in the same set under a clustering result if a and b are in the same cluster. Otherwise we say they are in different sets. Note that there are totally $(\binom{n}{2})$ pairs of elements. We define the following four counting numbers: Z₁ denotes the number of pairs that are both in the same set in X and Y; Z₂ denotes the number of pairs that are both in different sets in X and Y; Z₃ denotes the number of pairs that are in the same set in X and in different sets in Y; and Z₄ denotes the number of pairs that are in different sets in X and in the same set in Y. The RI is then given by

\begin{array}{c} RI = \frac{Z_{1} + Z_{2}}{Z_{1} + Z_{2} + Z_{3} + Z_{4}} . \end{array}

(30)

Due to the lack of the ground truth in most real applications, we utilize the SI to evaluate the clustering performance. The SI is a measure by calculating the average width of all data points, which reflects the compactness of the clustering. Let x denote the average distance between a point p in a cluster and all other points within that cluster. Let y be the minimum average distance between p and other clusters. The Silhouette distance for p is defined as

\begin{array}{l} s (p) = \frac{y - x}{max {x, y}} . \end{array}

(31)

The SI is the average Silhouette distance among all data points. The value of SI lies in [−1,1] and higher score indicates better performance.

4.2 Network segmentation on synthetic data

In regulatory network inference, due to the large size of the network, it is often useful to perform a network segmentation. The segmented sub-networks usually have much less number of nodes than the original network, leading to faster and more accurate analysis of the original network [38]. Clustering algorithms can be employed for such segmentation purpose. However, traditional clustering algorithms often provide segmentation results either too fine or too coarse, i.e., the resulting sub-networks either contain too few genes or two many genes. In addition, the hierarchical structure of the network cannot be discovered by those algorithms. Thanks to its hierarchical model assumption, the HDP algorithm can provide better segmentation results. We demonstrate the segmentation application of HDP on a synthetic network and compare to the SVM algorithm which is widely used for clustering and segmentation.

The network under consideration is shown in Figure 3. We assume that the distributions for all nodes are Gaussian. The directed links indicate that the parent nodes are the priors of the child nodes. Disconnected nodes are mutually independent. We generate the data in the following way. Nodes 1, 2, and 8 are generated independently by Gaussian distributions of unit variance with means 1, 2, and 3, respectively. Nodes 3, 4, 5, 6, 9, and 10 are generated independently by unit variance Gaussian distributions with means determined by their respective parent nodes. Node 7 is generated by a Gaussian distribution with mean determined by node 4 and variance determined by absolute value of node 5. The network contains two isolated segments with one segment containing nodes 1–7 and the other containing nodes 8–10. The HDP algorithm is applied to this network and segments the network into three clusters. Nodes 2, 4, 6 form one cluster; nodes 1, 3, 5, 7 form another cluster; and nodes 8, 9, 10 form the third one. The SVM algorithm on the other hand produces two clusters, one containing nodes 1–7 and the other containing nodes 8–10. As one can see, the network obviously contains two hierarchies in the left segment, i.e., nodes 1–7 of the network. The SVM fails to recognize the hierarchies and provides a result coarser than that given by the HDP algorithm.

4.3 AD400 data

The AD400 is a synthetic dataset proposed in [29], which is used to evaluate the clustering algorithm performance. The dataset is constituted by 400 genes with 10 time points. As the ground truth, the AD400 dataset has 10 clusters with each one containing 40 genes.

For randomized algorithms as LDA, BIMC, HDP, we average the results over 20 runs of the algorithms. We compare the HDP algorithm to other widely used algorithms such as LDA, SVM, MCLUST, K-means, BIMC, and HC. The results are presented in Table 1. As we can see, the HDP algorithm has the similar performance of the MCLUST algorithm. While the HDP generally performs better than other widely used algorithms.

Table 1 Clustering performance of LDA, SVM, MCLUST, K-means, HC, and HDP on the AD400 data

Full size table

4.4 Yeast galactose data

We conduct experiment on the yeast galactose data, which consists of 205 genes. The true number of clusters based on the functional categories is 4 [39]. We calculate the RI index between different clustering results to the result in [39], which is regarded as the standard benchmark. The LDA model is a generative probabilistic model for document classifications [34], which also uses Dirichlet distribution as a prior. We adapt the LDA model to the yeast galactose data to compare the proposed HDP algorithm. Since the LDA and HDP methods are randomized algorithms, we run the algorithms 20 times and use the average for the final score. In Figure 4, we illustrate the performances of each experiments for the HDP method. The performances of the algorithms under consideration are listed in Table 2.

Table 2 Clustering performance of LDA, MCLUST, SVM, and HDP on the yeast galactose data

Full size table

It is seen that the HDP algorithm performs the best among the three algorithms. Unlike the MCLUST and LDA algorithms which produce more clusters than 4, the average number of clusters given by the HDP algorithm is very closed to the “true” value 4. Compared to the SVM method, the HDP algorithm produces a result that is more similar to the “ground truth”, i.e., with the highest RI value.

4.5 Yeast sporulation data

The yeast sporulation dataset consists of 6,118 genes with 7 times points which were obtained during the sporulation process [31]. We pre-processed the dataset by applying a logarithmic transform and removing the data whose expression levels did not have significant changes. After the pre-process, the data have 513 genes left. In Table 3, we compare the HDP clustering result to LDA, MCLUST, K-Means, BIMC, and HC. For randomized algorithms such as LDA, BIMC, and HDP, we average the scores by running the algorithm 20 times.

Table 3 Clustering performance of LDA, MCLUST, K-means, HC, BIMC, and HDP on the yeast sporulation data

Full size table

From Table 3, we can see that the HDP has the highest SI score. It suggests that the clustering results provided by HDP are more compact and less separated than results from other algorithms. The K-means and HC algorithm suggest higher number of clusters. However, their SI scores indicate that their clusters are not as tight as other algorithms.

4.6 Human fibroblasts serum data

The human fibroblasts serum data consists of 8,613 genes with 12 time points [32]. Again a logarithmic transform has been applied to the data and genes without significant changes have been removed. The remaining dataset has 532 genes.

In Table 4, we show the performance of the HDP algorithm and other various algorithms. It has been shown that the clustering results by the HDP algorithm are the compactest among those algorithms. The LDA algorithm suggests 9.4 clusters with the lowest SI score, which indicates that some of its clusters can be further tightened. HC provides a result consisting of five clusters. However, the SI score of the HC result is not the highest, which suggests its clustering may not be well formed.

Table 4 Clustering performance of LDA, MCLUST, K-means, HC, BIMC, and HDP on the human fibroblasts serum data

Full size table

4.7 Yeast cell cycle data

We next apply the proposed HDP clustering algorithm on the yeast cell Saccharomyces cerevisiae cycle dataset [2, 40]. The data are obtained by synchronizing and collecting the mRNAs from cells at 10-min intervals over the course of two cell cycles. It has been used widely for testing the performances of clustering algorithm [2, 14, 41]. The expression data have been taken logarithmic transform and lie in the interval [−2,2]. We pre-processed the data to remove those which did not change significantly over time. We also removed those data whose means are below a small threshold. After the pre-processing, there are 1,515 genes left. We then apply the HDP algorithm and obtain 10 clusters in total. The plots of the clusters are shown in Figure 5.

We resort to the MIPS database [42] to determine the functional categories for each cluster. The inferred functional category of a cluster is the category shared by the majority of the member elements. After applying the cell-cycle selection criterion in [2], we find that there are 126 genes identified by proposed HDP algorithm but not discovered in [2]. We list in Table 5 the numbers of newly discovered genes in various functional categories. We also observe that parts of the newly discovered unclassified genes belong to clusters with classified categories. Given the hierarchical characteristic of the HDP algorithm, it may suggest multiple descriptions of those genes that might have been overlooked before.

Table 5 Numbers of newly discovered genes in various functional categories by the proposed HDP clustering algorithm

Full size table

Note that in [14] a Bayesian model with infinite number of clusters is proposed based on the Dirichlet process. The model in [14] is a special case of the HDP model proposed in this article when there is only one hierarchy. In terms of discovering new gene functionalities, we find that the performances of the two algorithms are similar, as the method in [14] discovered 106 new genes compared to the result in [2]. However, by taking the hierarchical structure into account, the total number of clusters found by the HDP algorithm is significantly smaller than that given in [14] which is 43 clusters. The SI score for BIMC and HDP are 0.321 and 0.392, respectively. The HDP clustering consolidates many fragmental clusters, which may provide an easier way to interpret the clustering results.

In Table 6, we list the new genes discovered by the HDP algorithm which are not found in [2].

Table 6 List of newly discovered genes in various functional categories

Full size table

5 Conclusions

In this article, we have proposed a new clustering approach based on the HDP. The HDP clustering explicitly models the hierarchical structure in the data that is prevalent in biological data such as gene expressions. We have developed a statistical inference algorithm for the proposed HDP model based on the Chinese restaurant metaphor and the Gibbs sampler. We have applied the proposed HDP clustering algorithm to both regulatory network segmentation and gene expression clustering. The HDP algorithm is shown to reveal more structural information of the data compared to popular algorithms such as SVM and MCLUST, by incorporating the hierarchical knowledge into the model.

Appendix

Derivation of formula (19) and (21)

\begin{align} P (ϕ_{ji} = a | ϕ_{ji}^{c}, λ, θ, α_{1}, α_{0}, μ_{1}, g) \\ = \frac{P (g_{ji}, ϕ_{ji} = a | ϕ^{c} (z_{ji}), λ, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c})}{P (g_{ji} | ϕ_{ji}^{c}, λ, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c})} \end{align}

(32)

\begin{align} \propto P (g_{ji}, ϕ_{ji} = a | ϕ_{ji}^{c}, λ, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c}) \end{align}

(33)

\begin{align} \propto P (g_{ji} | ϕ, λ, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c}) \\ P (ϕ_{ji} = a | ϕ_{ji}^{c}, λ, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c}) \end{align}

(34)

By (16), if a has appeared before, we have

P (ϕ_{ji} = a | ϕ_{ji}^{c}, λ, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c}) \propto c_{ja} .

(35)

Otherwise we have

P (ϕ_{ji} = a | ϕ_{ji}^{c}, λ, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c}) \propto α_{0} .

(36)

If a has appeared before, by the assumption the data are conditionally independent, we also have

P (g_{ji} | ϕ, λ, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c}) = f_{λ_{ja}} (g_{ji} | g_{ji}^{c}),

(37)

where $f_{λ_{ja}} (g_{ji} | g_{ji}^{c})$ can be calculated by the Bayes’ formula:

f_{λ_{ja}} (g_{ji} | g_{ji}^{c}) = \frac{\int \prod_{λ_{j^{'} ϕ_{j^{'} i^{'}}} = λ_{ja}} F (g_{j^{'} i^{'}} | θ) μ_{1} (θ) dθ}{\int \prod_{(j^{'}, i^{'}) \neq (j, i), λ_{j^{'} ϕ_{j^{'} i^{'}}} = λ_{ja}} F (g_{j^{'} i^{'}} | θ) μ_{1} (θ) dθ} .

(38)

Combining (35) and (37), we have (19).

If a has not appeared before, by (17), we have

\begin{array}{l} P (g_{ji} | ϕ, λ, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c}) \\ = \sum_{k = 1}^{K_{ja}} \frac{\sum_{j} d_{jk}}{\sum_{jk} d_{jk} + α_{1}} f_{k} (g_{ji} | g_{ji}^{c}) + \frac{α_{1}}{\sum_{jk} d_{jk} + α_{1}} \int F (g_{ji} | θ) μ_{1} (θ) dθ, \end{array}

(39)

Combining (36) and (39), we have (21).

Derivation of (22) nd (23)

\begin{align} P (λ_{j ϕ_{ji}} = a | ϕ, λ_{j ϕ_{ji}}^{c}, θ, α_{1}, α_{0}, μ_{1}, g) \\ = \frac{P (g_{ji}, λ_{j ϕ_{ji}} = a | ϕ, λ_{j ϕ_{ji}}^{c}, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c})}{P (g_{ji} | ϕ, λ, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c})} \end{align}

(40)

\begin{align} \propto P (g_{ji}, λ_{j ϕ_{ji}} = a | ϕ, λ_{j ϕ_{ji}}^{c}, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c}) \end{align}

(41)

\begin{align} \propto P (g_{ji} | ϕ, λ, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c}) \\ P (λ_{j ϕ_{ji}} = a | ϕ, λ_{j ϕ_{ji}}^{c}, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c}) \end{align}

(42)

By (17), if a has appeared before, we have

P (λ_{j ϕ_{ji}} = a | ϕ, λ_{j ϕ_{ji}}^{c}, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c}) \propto \sum_{j} d_{ja} .

(43)

Otherwise we have

P (λ_{j ϕ_{ji}} = a | ϕ, λ_{j ϕ_{ji}}^{c}, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c}) \propto α_{1} .

(44)

If a is used before, we have

P (g_{ji} | ϕ, λ, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c}) = f_{a} (g_{ji} | g_{ji}^{c}) .

(45)

Otherwise, the customer chooses a new table. The data are generated from F based on a sample from μ₁. We have

P (g_{ji} | ϕ, λ, θ, α_{1}, α_{0}, μ_{1}, g_{ji}^{c}) = \int F (g_{ji} | θ) μ_{1} (θ) dθ.

(46)

Combining (43), (44), (45), and (46), we have (22) and (23).

References

Schena M, Shalon D, Davis R, Brown P: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995,270(5235):467-470. 10.1126/science.270.5235.467
Article Google Scholar
Cho R, Campbell M, Winzeler E, Steinmetz L, Conway A, Wodicka L, Wolfsberg T, Gabrielian A, Landsman D, Lockhart D: A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 1998, 2: 65-73. 10.1016/S1097-2765(00)80114-8
Article Google Scholar
Hughes J, Estep P, Tavazoie S, Church G: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol 2000,296(5):1205-1214. 10.1006/jmbi.2000.3519
Article Google Scholar
Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci 1998,95(25):14863-14868. 10.1073/pnas.95.25.14863
Article Google Scholar
MacQueen J: Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. California: University of California Press; 1967:281-297.
Google Scholar
Kohonen T: Self-Organization and Associative Memory. New York: Springer; 1988.
Book Google Scholar
Jiang D, Tang C, Zhang A: Cluster analysis for gene expression data: a survey. IEEE Trans. Knowledge Data Eng 2004,16(11):1370-1386. 10.1109/TKDE.2004.68
Article Google Scholar
Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodological) 1977, 39: 1-38.
MathSciNet Google Scholar
McLachlan G, Peel D: Finite Mixture Models. New York: Wiley-Interscience; 2000.
Book Google Scholar
Fraley C, Raftery A, clustering Model-based, analysis discriminant, Am densityestimation. J.: Stat. Assoc. 2002,97(458):611-631. 10.1198/016214502760047131
Article Google Scholar
Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W: Model-based clustering and data transformations for gene expression data. Bioinformatics 2001,17(10):977-987. 10.1093/bioinformatics/17.10.977
Article Google Scholar
Schwarz G: Estimating the dimension of a model. Ann. Stat 1978,6(2):461-464. 10.1214/aos/1176344136
Article Google Scholar
Akaike H: A new look at the statistical model identification. IEEE Trans Autom. Control 1974,19(6):716-723. 10.1109/TAC.1974.1100705
Article MathSciNet Google Scholar
Medvedovic M, Sivaganesan S: Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 2002,18(9):1194-1206. 10.1093/bioinformatics/18.9.1194
Article Google Scholar
Ferguson T: A Bayesian analysis of some nonparametric problems. Ann. Stat 1973,1(2):209-230. 10.1214/aos/1176342360
Article MathSciNet Google Scholar
Neal R: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat 2000,9(2):249-265.
MathSciNet Google Scholar
Pitman J: Some developments of the Blackwell-MacQueen urn scheme. Lecture Notes-Monograph Series 1996, 245-267.
Google Scholar
Kaufman L, Rousseeuw P: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Online Library; 1990.
Book Google Scholar
Jiang D, Pei J, Zhang A: DHC: a density-based hierarchical clustering method for time series gene expression data. In Proceedings of Third IEEE Symposium on Bioinformatics and Bioengineering. Bethesda: IEEE; 2003:393-400.
Chapter Google Scholar
Piatigorsky J: Gene Sharing and Evolution: The Diversity of Protein Functions. Cambridge: Harvard University Press; 2007.
Book Google Scholar
Teh Y, Jordan M, Beal M, Blei D: Hierarchical Dirichlet processes. J. Am. Stat. Assoc 2006,101(476):1566-1581. 10.1198/016214506000000302
Article MathSciNet Google Scholar
Sethuraman J: A constructive definition of Dirichlet priors. Stat. Sinica 1991, 4: 639-650.
MathSciNet Google Scholar
Aldous D: Exchangeability and related topics. École d’Été de Probabilités de Saint-Flour XIII 1985, 1-198.
Chapter Google Scholar
Casella G, George E: Explaining the Gibbs sampler. Am. Stat 1992,46(3):167-174.
MathSciNet Google Scholar
Blackwell D, MacQueen J: Ferguson distributions via Pólya urn schemes. Ann. Stat 1973,1(2):353-355. 10.1214/aos/1176342372
Article MathSciNet Google Scholar
Brooks S: Markov chain Monte Carlo method and its application. J. R. Stat. Soc. Ser. D (The Statistician) 1998, 47: 69-100. 10.1111/1467-9884.00117
Article Google Scholar
Hubert L, Arabie P: Comparing partitions. J. Classif 1985, 2: 193-218. 10.1007/BF01908075
Article Google Scholar
Rousseeuw PJ: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math 1987, 20: 53-65.
Article Google Scholar
Yeung KY, Ruzzo WL: Principal component analysis for clustering gene expression data. Bioinformatics 2001,17(9):763-774. 10.1093/bioinformatics/17.9.763
Article Google Scholar
Yeung K, Medvedovic M, Bumgarner R: Clustering gene-expression data with repeated measurements. Genome Biol 2003,4(5):R34. 10.1186/gb-2003-4-5-r34
Article Google Scholar
Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO, Herskowitz I: The transcriptional program of sporulation in budding yeast. Science 1998,282(5389):699-705.
Article Google Scholar
Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JC, Trent JM, Staudt LM, Hudson J, Boguski MS: The transcriptional program in the response of human fibroblasts to serum. Science 1999,283(5398):83-87. 10.1126/science.283.5398.83
Article Google Scholar
Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 1998,9(12):3273.
Article Google Scholar
Blei D, Ng A, Jordan M: Latent Dirichlet allocation. J. Mach. Learn. Res 2003, 3: 993-1022.
Google Scholar
Fraley C, Raftery A: MCLUST: software for model-based cluster analysis. J. Classif 1999,16(2):297-306. 10.1007/s003579900058
Article Google Scholar
Furey T, Cristianini N, Duffy N, Bednarski D, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000,16(10):906-914. 10.1093/bioinformatics/16.10.906
Article Google Scholar
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat. Genetics 1999, 22: 281-285. 10.1038/10343
Article Google Scholar
Chung F, Lu L CBMS Lecture Series no. 107. In Complex Graphs and Networks. Providence: American Mathematical Society; 2006.
Google Scholar
Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J: Gene ontology: tool for the unification of biology. Nat. Genet 2000, 25: 25-29. 10.1038/75556
Article Google Scholar
Stanford University: Yeast cell cycle datasets http://genome-www.stanford.edu/cellcycle/data/rawdata
Lukashin A, Fuchs R: Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics 2001,17(5):405-414. 10.1093/bioinformatics/17.5.405
Article Google Scholar
Mewes H, Frishman D, Guldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B: MIPS: a database for genomes and protein sequences. Nucleic Acids Res 2002, 30: 31-34. 10.1093/nar/30.1.31
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical & Computer Engineering, Duke University, Durham, NC, 27708, USA
Liming Wang
Department of Electrical Engineering, Columbia University, New York, NY, 10027, USA
Xiaodong Wang

Authors

Liming Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaodong Wang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Wang, L., Wang, X. Hierarchical Dirichlet process model for gene expression clustering. J Bioinform Sys Biology 2013, 5 (2013). https://doi.org/10.1186/1687-4153-2013-5

Download citation

Received: 17 October 2012
Accepted: 11 March 2013
Published: 12 April 2013
DOI: https://doi.org/10.1186/1687-4153-2013-5

Hierarchical Dirichlet process model for gene expression clustering

Abstract

1 Introduction

2 System model and problem formulation

2.1 Dirichlet processes and infinite mixture model

2.2 HDP model

3 Inference algorithm

3.1 Chinese restaurant metaphor

3.2 A Gibbs sampler for HDP inference

3.3 A numerical example

4 Experimental results

4.1 Performance measures

4.2 Network segmentation on synthetic data

4.3 AD400 data

4.4 Yeast galactose data

4.5 Yeast sporulation data

4.6 Human fibroblasts serum data

4.7 Yeast cell cycle data

5 Conclusions

Appendix

Derivation of formula (19) and (21)

Derivation of (22) nd (23)

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords