Hierarchical Dirichlet process model for gene expression clustering

Clustering is an important data processing tool for interpreting microarray data and genomic network inference. In this article, we propose a clustering algorithm based on the hierarchical Dirichlet processes (HDP). The HDP clustering introduces a hierarchical structure in the statistical model which captures the hierarchical features prevalent in biological data such as the gene express data. We develop a Gibbs sampling algorithm based on the Chinese restaurant metaphor for the HDP clustering. We apply the proposed HDP algorithm to both regulatory network segmentation and gene expression clustering. The HDP algorithm is shown to outperform several popular clustering algorithms by revealing the underlying hierarchical structure of the data. For the yeast cell cycle data, we compare the HDP result to the standard result and show that the HDP algorithm provides more information and reduces the unnecessary clustering fragments.


Introduction
The microarray technology has enabled the possibility to monitor the expression levels of thousands of genes in parallel under various conditions [1]. Due to the highvolume nature of the microarray data, one often needs certain algorithms to investigate the gene functions, regulation relations, etc. Clustering is considered to be an important tool for analyzing the biological data [2][3][4]. The aim of clustering is to group the data into disjoint subsets, where in each subset the data show certain similarities to each other. In particular, for microarray data, genes in each clustered group exhibit correlated expression patterns under various experiments.
Several clustering methods have been proposed, most of which are distance-based algorithms. That is, a distance is first defined for clustering purpose and then the clusters are formed based on the distances of the data. Typical algorithms in this category include the K-means algorithm [5] and the self-organizing map (SOM) algorithm [6]. These algorithms are based on simple rules, and they often suffer from robustness issue, i.e., they are sensitive to noise which is extensive in biological data [7]. For example, the SOM algorithm requires user to provide number *Correspondence: wangx@ee.columbia.edu 2 Department of Electrical Engineering, Columbia University, New York, NY 10027, USA Full list of author information is available at the end of the article of clusters in advance. Hence, incorrect estimation of the parameter may provide wrong result.
Another important category of clustering methods is the model-based algorithms. These algorithms employ a statistical approach to model the structure of clusters. Specifically, data are assumed to be generated by some mixture distribution. Each component of the mixture corresponds to a cluster. Usually, the parameters of the mixture distribution are estimated by the EM algorithm [8].
The finite-mixture model [9][10][11] assumes that the number of mixture components is finite and the number can be estimated using the Bayesian information criterion [12] or the Akaike information criterion [13]. However, since the estimation of the number of clusters and the estimation of the mixture parameters are performed separately, the finite-mixture model may be sensitive to the different choices of the number of clusters [14].
The infinite-mixture model has been proposed to cope with the above sensitivity problem of the finite-mixture model. This model does not assume a specific number of components and is primarily based on the Dirichlet processes [15,16]. The clustering process can equivalently be viewed as a Chinese restaurant process [17], where the data are considered as customers entering a restaurant. Each component corresponds to a table with infinite capacity. A new customer joins a table according to the current assignment of seats. http://bsb.eurasipjournals.com/content/2013/1 /5 Hierarchical clustering (HC) is yet another more advanced approach especially for biological data [18], which groups together the data with similar features based on the underlying hierarchical structure. The biological data often exhibit hierarchical structure, e.g., one cluster may highly be overlapped or could be embedded into another cluster [19]. If such hierarchical structure is ignored, the clustering result may contain many fragmental clusters which could have been combined together. Hence, for biological data, such HC has its advantages to many traditional clustering algorithms. The performances of such HC algorithms depend highly on the quality of the data and the specific agglomerative or divisive ways the algorithms use for combining clusters.
Traditional clustering algorithms for microarray data usually assign each gene with a feature vector formed by the expressions in different experiments. The clustering is carried out for these vectors. It is well known that many genes share different levels of functionalities [20]. The resemblances of different genes are commonly represented at different levels of perspectives, e.g., at the cluster level instead of individual gene level. In other words, The relationships among different genes may vary during different experiments. In Figure 1, we illustrate the gene hierarchical structures for microarray data. Genes group A and B may show close relationship to genes group C in some experiments. While the genes group D shows correlations to groups A, B, and C in other experiments. The group D obviously has a hierarchical relationships to other gene groups. In this case, we desire to have a HC algorithm recognizing the gene resemblances not at the single gene level but at the higher cluster level, to avoid unnecessary fragmental clusters that impede the proper interpretation of the biological information. Such a HC algorithm may also provide new information by taking the hierarchical similarities into account.
In this article, we propose a model-based clustering algorithm for gene expression data based on the hierarchical Dirichlet process (HDP) [21]. The HDP model incorporates the merits of both the infinite-mixture model and the HC. The hierarchical structure is introduced to allow sharing data among related clusters. On the other hand, the model uses the Dirichlet processes as the nonparametric Bayesian prior, which do not assume a fixed number of clusters a priori.
The remainder of the article is organized as follows. In Section 2, we introduce some necessary mathematical background and formulate the HC problem as a statistical inference problem. In Section 3, we derive a Gibbs sampler-based inference algorithm based on the Chinese restaurant metaphor of the HDP model. In Section 4, we provide experimental results of the proposed HDP algorithm for two applications, regulatory network segmentation and gene expression clustering. Finally, Section 5 concludes the article.

System model and problem formulation
As in any model-based clustering method, it is assumed that the gene expression data are random samples from some underlying distributions. All data in one cluster are generated by the same distribution. For most existing clustering algorithms, each gene is associated with a vector containing the expressions in all experiments. The clustering of the genes is based on their vectors. However, such approach ignores the fact that genes may show different functionalities under various experiment conditions, i.e., different clusters may be formed under different experiments. In order to cope with this phenomenon, we treat each expression separately. More specifically, we allow different expressions of the same individual gene to be generated by different statistical models.
Suppose that for the mircoarray data, there are N genes in total. For each gene, we conduct M experiments. Let g ji denote the expression of the ith gene in the jth experiment, 1 ≤ i ≤ N, and 1 ≤ j ≤ M. For each g ji , we associate a latent membership variable z ji , which indicates the cluster membership of g ji . That is, if genes i and i are in the same cluster under the conditions of experiments j and j , we have z ji = z j i . Note that z ji is supported on a countable set such as N or Z. For each g ji , we associate a coefficient θ z ji , whose index is determined by its membership variable z ji . In order to have a Bayesian approach, we also assume that each coefficient θ k is drawn independently from a prior distribution G 0 where k is determined by z ji . http://bsb.eurasipjournals.com/content/2013/1/5 The membership variable z = {z ji } j,i has a discrete joint distribution z ∼ π.
( 2 ) Note that in this article, the bold-face letter always refers to a set formed by the elements with specified indices. We assume that each g ji is drawn independently from a distribution F(θ z ji ) where θ z ji is a coefficient associated with g ji and F is a distribution family such as the Gaussian distribution family. In summary, we have the following model for the expression data The above model is a relatively general one which can induce many previous models. For example, in all Bayesian approaches, all variables are assigned with proper priors. It is very popular to use the mixture model as the prior, which models the data generated by a mixture of distributions, e.g., a linear combination of a family of distributions such as Gaussian distributions. Each cluster is generated by one component in the mixture distribution given the membership variable [14]. The above approach corresponds to our model if we assume that π is finitely supported and F is Gaussian.
The aim for clustering is to determine the posterior probability of the latent membership variables given the observed gene expressions where g = {g ji } j,i . As a clustering algorithm, the final result is given in the forms of clusters. Each gene has to be assigned to one and only one cluster. Once we have the inference result in (5), we can apply the maximum a posterior criterion to obtain an estimate of membership variableẑ ·i for the ith gene aŝ We note that in case one is interested in finding other related clusters for one gene, we can simply use the inferred distribution to membership variable to obtain this information.

Dirichlet processes and infinite mixture model
Instead of assuming a fixed number of clusters a priori, one can assume infinite number of clusters to avoid the estimation accuracy problem on the number of clusters as we mentioned earlier. Correspondingly in (4), the prior π is an infinite discrete distribution. Again as in the Bayesian fashion, we will introduce priors for all parameters. The Dirichlet process is one such prior. It can be viewed as a random measure [15], i.e., the domain of this process (viewed as a measure) is a collection of probability measures. In this section, we will give a brief introduction to the Dirichlet process which serves as the vital prior part in our HDP model. Recall that the Dirichlet distribution D(u 1 , . . . , u K ) of order K on a (K − 1)-simplex in R K−1 with parameter u 1 , . . . , u K is given by the following probability density function is the Gamma function. Since every point in the domain is a discrete probability measure, the Dirichlet distribution is a random measure in the finite discrete probability space.
The Dirichlet processes are the generalization of the Dirichlet distribution into the continuous space. There are various constructive or non-constructive definitions of Dirichlet processes. For simplicity, we use the following non-constructive definition.
The Dirichlet processes can be characterized in various ways [15] such as the stick-breaking construction [22] and the Chinese restaurant process [23]. The Chinese restaurant process serves as a visualized characterization of the Dirichlet process.
Let x 1 , x 2 , . . . be a sequence of random variables drawn from the Dirichlet process D(α 0 , μ 0 ). Although we do not have the explicit formula for D, we would like to know the conditional probability of x i given x 1 , . . . , x i−1 .
In the Chinese restaurant model, the data can be viewed as customers sequentially entering a restaurant with infinite number of tables. Each table corresponds to a cluster with unlimited capacity. Each customer x i entering the restaurant will join in the table already taken with equal probability. In addition, the new customer may sit in a new table with probability proportional to α 0 . Tables that have already been occupied by customers tend to gain more and more customers.
One remarkable property of the Dirichlet process is that although it is generated by a continuous process, it is discrete (countably many) almost surely [15]. In other words, http://bsb.eurasipjournals.com/content/2013/1/5 almost every sample distribution drawn from the Dirichlet process is a discrete distribution. As a consequence, the Dirichlet process is suitable to serve as a non-parametric prior of the infinite mixture model. The Dirichlet mixture model uses the Dirichlet process as a prior. The model in (4) can then be represented as follows: Recall that D(α 0 , μ 0 ) is discrete almost everywhere, which corresponds to the indices of the clusters.

HDP model
Biological data such as the expression data often exhibit hierarchical structures. For example, although clusters can be formed based on similarities, some clusters may still share certain similarities among themselves at different levels of perspectives. Within one cluster, the genes may share similar features. But on the level of clusters, one cluster may share some similar feature with some other clusters. Many traditional clustering algorithms typically fail to recognize such hierarchical information and are not able to group these similar clusters into a new cluster, producing many fragments in the final clustering result. As a consequence, it is difficult to interpret the functionalities and meanings of these fragments. Therefore, it is desirable to have an algorithm that is able to cluster among clusters. In other words, the algorithm should be able to cluster based on multiple features at different levels. In order to capture the hierarchical structure feature of the gene expressions, we now introduce the hierarchical model to allow clustering at different levels. The clustering algorithm based on the hierarchical model not only reduces the number of cluster fragments, but also may reveal more details about the unknown functionalities of certain genes as the clusters sharing multiple features.
Recall that in the statistical model (11), the clustering effect is induced by the Dirichlet process D(α 0 , μ 0 ). If we need to take into account different level of clusters, it is natural to introduce a prior with clustering effect to the base measure μ 0 . Again in this case, the Dirichlet process can serve as such prior. The intuition is that given the base measure, the clustering effect is represented through a Dirichlet process on the single gene level. By the Dirichlet process assumption on the base measure, the base measure also exhibits the clustering effect, which leads to clustering at cluster level. We simply set the prior to the base measure μ 0 as where D 1 (α 1 , μ 1 ) is another Dirichlet process. In this article, we use the same letter for the measure, the distribution it induces, and the corresponding density function as long as it is clear from the context. Moreover, we could extend the hierarchies to as many levels as we wish at the expense of complexity of the inference algorithm. The desired number of hierarchies can be determined by the prior biological knowledge. In this article, we focus on a two-level hierarchy.
As a remark, we would like to point out the connection and difference on the "hierarchy" in the proposed HDP method and traditional HC [4]. Both the HDP and HC algorithms can provide HC results. The hierarchy in the HDP method is manifested by the Chinese restaurant process which will be introduced later, where the data sit in the same table can be viewed as the first level and all tables sharing the same dish can be viewed as the second level. While the hierarchy in the HC is obtained by merging existing clusters based on their distances. However, its specific merging strategy is heuristic and is irreversible for those merged clusters. Hierarchy formed in this fashion often may not reflect the true structure in the data since various hierarchical structures can be formed by choosing different distance metrics. However, the HDP algorithm captures the hierarchical structure at the model level. The merging is carried out automatically during the inference. Therefore, it naturally takes the hierarchy into consideration.
In summary, we have the following HDP model for the data: where a and b are some fixed constants. We assume that F and μ 1 are conjugate priors. In this article, F is assumed to be the Gaussian distribution and μ 1 is the inverse Gamma distribution.

Inference algorithm
It is intractable to get the closed-form solution to the inference problem (5). In this section, we develop a Gibbs sampling algorithm for estimating the posterior distribution in (5). At each iteration l, we draw a sample z (l) ji sequentially from the distribution: ji } j,i will converge to the true posterior distribution in (5) [24]. The proposed Gibbs sampling algorithm is similar to the HDP inference algorithm proposed in [21], since both the Gibbs algorithms use the Chinese restaurant metaphor which we will elaborate later. However, because of the differences in modeling, we still need to provide details for the inference algorithm based on our model.

Chinese restaurant metaphor
The Chinese restaurant model [23] is a visualized characterization for interpreting the Dirichlet process. Because there is no explicit formula to describe the Dirichlet process, we will employ the Chinese restaurant model for HDP inference instead of directly computing the posterior distribution in (5). We refer to [23,25] for the proof and other details of the equivalence between the Chinese restaurant metaphor and the Dirichlet processes.
In the Chinese restaurant metaphor for the HDP model (13), we view {z ji } as customers entering a restaurant sequentially. The restaurant has infinite number of rows and columns of tables which are labeled by t ji . Each z ji will associate to one and only one table in the jth row. We use φ(z ji ) to denote the column index of the table in the jth row taken by z ji , i.e., z ji will sit at table t jφ(z ji ) . If it is clear from the context, we will use φ ji in short for φ(z ji ). The index of the random variable θ k in (13) is characterized by a menu containing various dishes. Each table picks one and only one dish from the menus {m k } k=1,2,... , which are drawn independently from the base measure μ 1 . g ji is drawn independently according to the dish it chooses through the distribution F(· ) as in (13). We denote λ(t ji ) as the index of the dish taken by table t ji , i.e., table t ji chooses dish m λ(t ji ) . As before, we may write λ ji in short of λ(t ji ). In summary, customer z ji will sit at table t jφ ji and enjoy dish m λ jφ ji . The HDP is reflected in this metaphor such that the customers choose the tables as well as the dishes in a Dirichlet process fashion. The customers sitting at the same table are classified into one cluster. Moreover, the customers sitting at different tables but ordering the same dish will also be clustered into the same group. Hence, the clustering effect is performed at the cluster level, i.e., we allow "clustering among clusters". In Figure 2, we show an illustration of the Chinese restaurant metaphor. The different patterns of shades represent different clusters. We also introduce two useful counter variables: c ji denotes the number of customers sitting at table t ji ; d jk counts the number of tables in row j serving dish m k .
Using the Chinese restaurant metaphor, instead of inferring z ji , we can directly infer φ ji and λ ji . The membership variable z ji is completely determined by λ(t jφ(z ji ) ). That is, z ji = z j i if and only if λ(t jφ(z ji ) ) = λ(t jφ(z j i ) ). As we pointed out before, the specific values of the membership variable z ji are not relevant to the clustering as long as z ji is supported on a countable set. Hence, we could simply let According to [25], we have the following conditional probabilities for the HDP model where k d jk calculates the number of tables taken in the rth row and δ (·) is the Kronecker delta function. The interpretation of (16) is that customer z ji chooses a table already taken with equal probability. In addition, z ji may choose a new table with probability proportional to α 0 . By the hierarchical assumption, the distribution of the dish chosen at an occupied table is another Dirichlet process. We have the following conditional distribution of the dishes where j d jk counts the number of tables serving dish m k ; jk d jk counts the number of tables serving dishes; K ji denotes the net number of dishes served till λ ji 's coming by counting only once each dish that has been served multiple times.
If a is a value that has been taken before, the conditional probability of φ ji = a is given by where θ = {θ ji } j,i and λ = {λ ji } j,i . The superscript c denotes the complement of the variables in its category, i.e., g c ji = {g j i } (j ,i ) =(j,i) and φ c ji = {φ j i } (j ,i ) =(j,i) . f λ ja g ji |g c ji denotes the conditional density of g ji given all other data generated according to menu m λ ja , which can be calculated as The numerator of (20) is the joint density of the data which are generated by the same dish. By the assumption that g j i are conditionally independent given the chosen dish, we have the conditional density of the data in the product form. The denominator is the joint density excluding the specific g ji term. The integrals in (20) can either be calculated using the numerical method or using the Monte Carlo integration. For example, in order to calculate the following integral b a f (x)p(x)dx, where p(x) is a density function, we can draw samples x 1 , x 2 , . . . , x n from p(x) and approximate the integral by (20), we view μ 1 (·) as p(·) and F(g j i |·) as f (·).
On the other hand, if a is a new value then we have We also have the following conditional probabilities for λ ji . If a is used before, we have otherwise we have P λ jφ ji = a|φ, λ c jφ ji , θ , α 1 , α 0 , g ∝ α 1 F(g ji |θ)μ 1 (θ)dθ.
Before we present the Gibbs sampling algorithm, we recall the Metropolis-Hastings (M-H) algorithm [26] for drawing samples from a target distribution whose density function f (x) is only known up to a scaling factor, i.e., f (x) ∝ p(x). To draw samples from f (x), we make use of some fixed conditional distribution q(x 2 |x 1 ) that satisfies q(x 2 |x 1 ) = q(x 1 |x 2 ), ∀x 1 , x 2 . The M-H algorithm proceeds as follows.
• Start with an arbitrary value x 0 with p(x 0 ) > 0.
-Given the previous sample x l−1 , draw a candidate sample x from q(x |x l−1 ).
After a "burn-in" period, say l 0 , the samples {x l } l>l 0 follow the distribution f (x).
We now summarize the Gibbs sampling algorithm for the HDP inference as follows.
• Calculate the estimation of clustering indexẑ ·i for the i th gene byẑ ·i = arg a max j P(z ji = a|g).
After the burn-in period of the whole Gibbs sampler, we can calculate the posterior joint distribution P(φ, λ|g) from the samples and determine the clusters following the last two steps in the proposed Gibbs sampling algorithm.

Experimental results
The HDP clustering algorithm proposed in this article can be employed for gene expression analysis or as a segmentation algorithm for gene regulatory network inference. In this section, we first introduce two performance measures for clustering, the Rand Index (RI) [27] and the Silhouette Index (SI) [28]. We compare the HDP algorithm to the support vector machine (SVM) algorithm for network segmentation on synthetic data. We then conduct various experiments on both synthetic and real datasets including the AD400 datasets [29], the yeast galactose datasets [30], yeast sporulation datasets [31], human fibroblasts serum datasets [32], and yeast cell cycle data [33]. We compare the HDP algorithm to the Latent Dirichlet allocation (LDA), MCLUST, SVM, K-means, Bayesian Infinite Mixture Clustering (BIMC) the HC [4,14,[34][35][36][37] based on the performance measures and the functional relationships.

Performance measures
In order to evaluate the clustering result, we utilize two measures: RI [27] and SI [28]. The first index is used when a ground truth is known in priori and the second index is to measure the performance without any knowledge of the ground truth.
The RI is a measure of agreement between two clustering results. It takes a value between 0 and 1. The higher is the score, the higher agreements it indicates.
Let A denote the datasets with a total number of n elements. Given two clustering results X = {X 1 , . . . , X S } and For any pair of   elements (a, b) in A, we say they are in the same set under a clustering result if a and b are in the same cluster. Otherwise we say they are in different sets. Note that there are totally n 2 pairs of elements. We define the following four counting numbers: Z 1 denotes the number of pairs that are both in the same set in X and Y ; Z 2 denotes the number of pairs that are both in different sets in X and Y ; Z 3 denotes the number of pairs that are in the same set in X and in different sets in Y ; and Z 4 denotes the number of pairs that are in different sets in X and in the same set in Y. The RI is then given by Due to the lack of the ground truth in most real applications, we utilize the SI to evaluate the clustering performance. The SI is a measure by calculating the average width of all data points, which reflects the compactness of the clustering. Let x denote the average distance between a point p in a cluster and all other points within that cluster. Let y be the minimum average distance between p and other clusters. The Silhouette distance for p is defined as The SI is the average Silhouette distance among all data points. The value of SI lies in [ −1, 1] and higher score indicates better performance.

Network segmentation on synthetic data
In regulatory network inference, due to the large size of the network, it is often useful to perform a network segmentation. The segmented sub-networks usually have much less number of nodes than the original network, leading to faster and more accurate analysis of the original network [38]. Clustering algorithms can be employed for such segmentation purpose. However, traditional clustering algorithms often provide segmentation results either too fine or too coarse, i.e., the resulting sub-networks  either contain too few genes or two many genes. In addition, the hierarchical structure of the network cannot be discovered by those algorithms. Thanks to its hierarchical model assumption, the HDP algorithm can provide better segmentation results. We demonstrate the segmentation application of HDP on a synthetic network and compare to the SVM algorithm which is widely used for clustering and segmentation. The network under consideration is shown in Figure 3. We assume that the distributions for all nodes are Gaussian. The directed links indicate that the parent nodes are the priors of the child nodes. Disconnected nodes are mutually independent. We generate the data in the following way. Nodes 1, 2, and 8 are generated independently by Gaussian distributions of unit variance with means 1, 2, and 3, respectively. Nodes 3, 4, 5, 6, 9, and 10 are generated independently by unit variance Gaussian distributions with means determined by their respective parent nodes. Node 7 is generated by a Gaussian distribution with mean determined by node 4 and variance determined by absolute value of node 5. The network contains two isolated segments with one segment containing nodes 1-7 and the other containing nodes 8-10. The HDP algorithm is applied to this network and segments the network into three clusters. Nodes 2, 4, 6 form one cluster; nodes 1, 3, one. The SVM algorithm on the other hand produces two clusters, one containing nodes 1-7 and the other containing nodes 8-10. As one can see, the network obviously contains two hierarchies in the left segment, i.e., nodes 1-7 of the network. The SVM fails to recognize the hierarchies and provides a result coarser than that given by the HDP algorithm.

AD400 data
The AD400 is a synthetic dataset proposed in [29], which is used to evaluate the clustering algorithm performance. The dataset is constituted by 400 genes with 10 time points. As the ground truth, the AD400 dataset has 10 clusters with each one containing 40 genes.
For randomized algorithms as LDA, BIMC, HDP, we average the results over 20 runs of the algorithms. We compare the HDP algorithm to other widely used algorithms such as LDA, SVM, MCLUST, K-means, BIMC, and HC. The results are presented in Table 1. As we can see, the HDP algorithm has the similar performance of the MCLUST algorithm. While the HDP generally performs better than other widely used algorithms.

Yeast galactose data
We conduct experiment on the yeast galactose data, which consists of 205 genes. The true number of clusters based on the functional categories is 4 [39]. We calculate the RI index between different clustering results to the result in [39], which is regarded as the standard benchmark. The LDA model is a generative probabilistic model for document classifications [34], which also uses Dirichlet distribution as a prior. We adapt the LDA model to the yeast galactose data to compare the proposed HDP algorithm. Since the LDA and HDP methods are randomized algorithms, we run the algorithms 20 times and use the average for the final score. In Figure 4, we illustrate the performances of each experiments for the HDP method. The performances of the algorithms under consideration are listed in Table 2.
It is seen that the HDP algorithm performs the best among the three algorithms. Unlike the MCLUST and LDA algorithms which produce more clusters than 4, the average number of clusters given by the HDP algorithm is very closed to the "true" value 4. Compared to the SVM  method, the HDP algorithm produces a result that is more similar to the "ground truth", i.e., with the highest RI value.

Yeast sporulation data
The yeast sporulation dataset consists of 6,118 genes with 7 times points which were obtained during the sporulation process [31]. We pre-processed the dataset by applying a logarithmic transform and removing the data whose expression levels did not have significant changes. After the pre-process, the data have 513 genes left. In Table 3, we compare the HDP clustering result to LDA, MCLUST, K-Means, BIMC, and HC. For randomized algorithms such as LDA, BIMC, and HDP, we average the scores by running the algorithm 20 times. From Table 3, we can see that the HDP has the highest SI score. It suggests that the clustering results provided by HDP are more compact and less separated than results from other algorithms. The K-means and HC algorithm suggest higher number of clusters. However, their SI scores indicate that their clusters are not as tight as other algorithms.

Human fibroblasts serum data
The human fibroblasts serum data consists of 8,613 genes with 12 time points [32]. Again a logarithmic transform has been applied to the data and genes without significant changes have been removed. The remaining dataset has 532 genes.
In Table 4, we show the performance of the HDP algorithm and other various algorithms. It has been shown

Yeast cell cycle data
We next apply the proposed HDP clustering algorithm on the yeast cell Saccharomyces cerevisiae cycle dataset [2,40]. The data are obtained by synchronizing and collecting the mRNAs from cells at 10-min intervals over the course of two cell cycles. It has been used widely for testing the performances of clustering algorithm [2,14,41]. The expression data have been taken logarithmic transform and lie in the interval [ −2, 2]. We pre-processed the data to remove those which did not change significantly over time. We also removed those data whose means are below a small threshold. After the pre-processing, there are 1,515 genes left. We then apply the HDP algorithm and obtain 10 clusters in total. The plots of the clusters are shown in Figure 5. We resort to the MIPS database [42] to determine the functional categories for each cluster. The inferred functional category of a cluster is the category shared by the majority of the member elements. After applying the cellcycle selection criterion in [2], we find that there are 126 genes identified by proposed HDP algorithm but not discovered in [2]. We list in Table 5 the numbers of newly discovered genes in various functional categories. We also observe that parts of the newly discovered unclassified genes belong to clusters with classified categories. Given the hierarchical characteristic of the HDP algorithm, it may suggest multiple descriptions of those genes that might have been overlooked before.
Note that in [14] a Bayesian model with infinite number of clusters is proposed based on the Dirichlet process. The model in [14] is a special case of the HDP model proposed in this article when there is only one hierarchy. In terms of discovering new gene functionalities, we find that the performances of the two algorithms are similar, as the method in [14] discovered 106 new genes compared to the result in [2]. However, by taking the hierarchical structure into account, the total number of clusters found by the HDP algorithm is significantly smaller than that given in [14] which is 43 clusters. The SI score for BIMC and HDP are 0.321 and 0.392, respectively. The HDP clustering consolidates many fragmental clusters, which may provide an easier way to interpret the clustering results.
In Table 6, we list the new genes discovered by the HDP algorithm which are not found in [2].

Conclusions
In this article, we have proposed a new clustering approach based on the HDP. The HDP clustering explicitly models the hierarchical structure in the data that is prevalent in biological data such as gene expressions. We have developed a statistical inference algorithm for the proposed HDP model based on the Chinese restaurant metaphor and the Gibbs sampler. We have applied the proposed HDP clustering algorithm to both regulatory network segmentation and gene expression clustering. The HDP algorithm is shown to reveal more structural information of the data compared to popular algorithms such as SVM and MCLUST, by incorporating the hierarchical knowledge into the model.