 Research Article
 Open Access
 Published:
Spectral Preprocessing for Clustering TimeSeries Gene Expressions
EURASIP Journal on Bioinformatics and Systems Biology volume 2009, Article number: 713248 (2009)
Abstract
Based on gene expression profiles, genes can be partitioned into clusters, which might be associated with biological processes or functions, for example, cell cycle, circadian rhythm, and so forth. This paper proposes a novel clustering preprocessing strategy which combines clustering with spectral estimation techniques so that the time information present in time series gene expressions is fully exploited. By comparing the clustering results with a set of biologically annotated yeast cellcycle genes, the proposed clustering strategy is corroborated to yield significantly different clusters from those created by the traditional expressionbased schemes. The proposed technique is especially helpful in grouping genes participating in timeregulated processes.
1. Introduction
A cell is the basic unit of life, and each cell contains instructions necessary for its proper functioning. These instructions are encoded in the form of DNAs that are replicated and transmitted to its progeny when the cell divides. mRNAs are middle products in this process. They are transcribed from DNA segments (genes) and serve as the templates for protein translation. This conduit of information constitutes the central dogma of molecular biology. The fast evolving gene microarray technology has enabled simultaneous measurement of genomewide gene expressions in terms of mRNA concentrations. There are two types of microarray data: time series and steady state. Timeseries data are obtained by sequential measurements in temporal experiments, while steadystate data are produced by recording gene expressions from independent sources, for example, different individuals, tissues, experiments, and so forth. The high costs, ethical concerns, and implementation issues prevent from collecting large timeseries data sets. Therefore, about 70% of the data sets are steady state [1], and most of timeseries data sets contain only a few time points, in general less than 20 samples.
Based on microarray measurements, clustering methods have been exploited to partition genes into subsets. Members in each subset are assumed to share specific biological function or participate in the same molecularlevel process. They are termed as coexpressed genes and are supposed to be located closely in the underlying genetic regulatory networks. Eisen et al. [2] applied the hierarchical clustering to partition yeast genes, Tamayo et al. [3] exploited the selforganizing map (SOM), and Tavazoie et al. [4] employed Kmeans clustering to group gene expressions and then search upstream DNA sequence motifs that contribute to the coexpression of genes. Besides the above mentioned successful applications, Zhou et al. [5] designed a clustering strategy by minimizing the mutual information between clusters, and bootstrap techniques were combined with heuristic search to solve the underlying optimization problem. Also, Giurcăneanu et al. [6] exploited the minimum description length (MDL) principle to determine the number of clusters. Whether technically advanced schemes represent better solutions for real biological data is still under debate. However, usually most of the schemes provide valuable alternatives and insights to each other. Therefore, it was recommended that several clustering schemes be performed to analyze the same real data set [7] so that the difference between clusterings would capture some patterns that otherwise would be neglected by running only one method.
A straightforward application of clustering schemes will cause the loss of temporal information inherent in the timeseries measurements. This shortcoming has been noticed in literature. Ramoni et al. [8] designed a modelbased Bayesian method to cluster the timeseries data and specified the number of clusters intelligently, Tabus and Astola [9] proposed to fit the data by linear dynamic systems, and Ernst et al. [10] presented an algorithm especially for short time series. In these models genes in the same cluster were assumed to share similar time domain profile. The temporal relationships were also explored via more complex models, that is, genetic regulatory networks, which can be constructed via more computationallydemanding algorithms, for example, Zhao et al. [11] and Liang et al. [12]. However, in general, the network inference schemes deal only with relatively smallscale networks consisting of less than hundreds of genes. Genome wide analysis is beyond the computational capability of these inference algorithms. Therefore, clustering methods are usually exploited to partition genes, and the obtained subsets of genes serve as further research targets, and more accurate maps of real biological processes are to be recovered.
Based on timeseries data, modern spectral density estimation methods have been exploited to identify periodically expressed genes. Assuming the cell cycle signal to be a single sinusoid, Spellman et al. [13] and Whitfield et al. [14] performed a Fourier transformation on the data sampled with different synchronization methods, Wichert et al. [15] applied the traditional periodogram and Fisher's test, while Ahdesmäki et al. [16] implemented a robust periodicity test procedure assuming nonGaussian noise. The majority of these works dealt with evenly sampled data, and missing data points were usually filled by interpolation in time domain, or the genes were disregarded if there were too many vacancies.
The biological experiments generally output unequally spaced measurements. The change of sampling frequency is due to missing data and the fact that the measurements are usually event driven, that is, more observations are taken when certain biological events occur, and the measurement process is slowed down when the cell remains quiet. Therefore, an analysis based on unevenly sampled data is practically desired and technically more challenging. The harmonics exploited in discrete Fourier transform (DFT) are no longer orthogonal in the presence of uneven sampling. Lomb [17] and Scargle [18] demonstrated that a phase shift suffices to make the sine and cosine terms orthogonal again. The LombScargle scheme has been exploited in analyzing the budding yeast data set by Glynn et al. [19]. Stoica and Sandgren [20] updated the traditional Capon method to cope with the irregularly sampled data. Notice also that Wang et al. [21] designed the missingdata amplitude and phase estimation (MAPES) approach, which estimated the missing data and spectrum iteratively through the usage of the Expectation Maximization (EM) algorithm. Although Capon and MAPES methods aim to achieve a better spectral resolution than LombScargle periodogram, for small sample size, the simpler LombScargle periodogram appears to possess higher accuracy in the presence of real biological data sets [22].
This paper proposes a novel clustering preprocessing procedure which combines the power spectral density analysis with clustering schemes. Given a set of microarray measurements, the power spectral density of each gene is first computed, then the spectral information is fed into the clustering schemes. The members within the same cluster will share similar spectral information, therefore they are supposed to participate in the same temporally regulated biological process. The assumptions underlying this statement rely on the following facts: if two genes X and Y are in the same cluster, their spectral densities are very close to each other; in the time domain, their gene expressions may just differ in their phases. The phases are usually modeled to correspond to different stages of the same biological processes, for example, cell cycle or circadian rhythms. The proposed spectraldensitybased clustering actually differentiates the following two cases.

(1)
Gene X's expression and Gene Y's expression are uncorrelated in both time and frequency domains.

(2)
Gene X and Y expressions are uncorrelated in time domain, but gene X's expression is a timeshifted version of gene Y's expression.
In the traditional clustering schemes, the distances are the same for the above two cases (both assuming large values). However, in the proposed algorithm, the second case is favorable and presents a lower distance. Therefore, by exploiting the proposed algorithm, the genes participating in the same biological process are more likely to be grouped into the same cluster. LombScargle periodogram serves as the spectral density estimation tool since it is computationally simple and possesses higher accuracy in the presence of unevenly measured and smallsize gene expression data sets. The appropriate clustering method is determined based on intense computer simulations. Three major clustering methods: hierarchical, Kmeans, and selforganizing map (SOM) schemes are tested with different configurations. The spectra and expressionbased clusterings are compared with respect to their ability of grouping cellcycle genes that have been experimentally verified. The differences between clusterings are recorded and compared in terms of information theoretic quantities.
2. Methods
This section explains how to apply the LombScargle periodogram to timeseries gene expressions. Next are formulated briefly the three clustering schemes: hierarchical, Kmeans, and selforganizing map (SOM). Afterward, we discuss how to validate the clusterings and make comparisons between them. The notational convention is as follows: the matrices and vectors are in bold face, and scalars are represented in regular font.
2.1. LombScargle Periodogram
Most spectral analysis methods, for example, Fourier transform and traditional periodogram employed in Spellman et al. [13] and Wichert et al. [15], rely on evenly sampled data, which are projected on orthogonal sine and cosine harmonics. However, real microarray measurements are not evenly observed due to missing data points and changing sampling frequency. The uneven sampling ruins data projection's orthogonality. Lomb [17] found that a phase shift of the sine and cosine functions would restore the orthogonality among harmonics. Scargle [18] complemented Lomb's periodogram by exploiting its distribution. Since then the established LombScargle periodogram has been exploited in numerous fields and applications, including bioinformatics and genomics (see, e.g., Glynn et al. [19]).
Given timeseries observations , where stands for the time tag and denotes the sampled expression of a specific gene, the normalized LombScargle periodogram for that gene expression at angular frequency is defined as size
where and stand for the mean and variance of the sampled data, respectively, and is defined as
Let be the greatest common divisor (gcd) for all intervals , Eyer and Bartholdi [23] proved that the highest frequency to be searched is given by
The number of probing frequencies is denoted by
and the frequency grid can be defined in terms of the following equation:
Notice further that the spectra at the front and rear halves of the frequency grid are symmetric since the microarray experiments output real values.
LombScargle periodogram represents an efficient solution in estimating the spectra of unevenly sampled data sets. Simulation results also verify its superior performance for biological data with small sample size and various unevenly sampled patterns [22].
2.2. Clustering
The obtained LombScargle power spectral density will be used as input to clustering schemes as an alternative to the original gene expression measurements. Three clustering schemes: Hierachical, Kmeans, and selforganizing map (SOM) are used for testing this substitution.
2.2.1. Hierarchical Clustering
The hierarchical clustering represents the partitioning procedure that assumes the form of a tree, also known as the dendrogram. The bottomup algorithm starts in treating each gene as a cluster. Then at each higher level, a new cluster is generated by joining the two closest clusters at the lower level. In order to quantize the distance between two gene profiles, different metrics have been proposed in literature, as enumerated in Table 1.
The correlation is the most popular metric and was exploited in Eisen's work [2]. Based on distances between gene expressions, we can further define the distances between two gene clusters, that is, linkage methods, as illustrated by Table 2.
The single linkage method actually constructs a minimal spanning tree, and it sometimes builds an undesirable long chain. The complete linkage method discourages the chaining effect and in each step increases the cluster diameter as little as possible. However, it assumes that the true clusters are compact. Alternatively, the average linkage method makes a compromise and is usually the preferred method since it poses no assumption on the structure of clusters. The selection of distance metric and linkage method depends on the nature of the real data, and several clustering schemes were proposed to be tested at the same time so that each can capture different aspects of the data. The hierarchical clustering scheme can be formulated in terms of the pseudo code depicted in Algorithm 1. If a specific number of clusters are desired, only line 3 is needed to be changed by substituting for .
Algorithm 1: Hierarchical clustering algorithm.
1: Input genes with their expressions or spectral densities;
2: Initialize ;
3: whiledo
4: ;
5: Insert , delete and ;
6: Label all existing clusters with integers ;
7:
8: end while
2.2.2. KMeans Clustering
The Kmeans clustering divides the genes into predetermined clusters. It iteratively updates the centroid of each cluster and reassigns each gene to the cluster with the nearest centroid. Different distance metrics, as listed in Table 1, can also be exploited in the Kmeans clustering scheme. In each iteration, the new centroid might be the median or mean of the cluster members. The Kmeans clustering can be formulated as Algorithm 2. One of the problems associated with Kmeans clustering is that the iterations may finally converge to a local suboptimum solution. Therefore, in our simulation we ran the algorithm 5 times and reported the one with the best performance. The Kmeans clustering method was exploited by Tavazoie et al. [4], which combined the clustering with the motif finding problem.
Algorithm 2: Kmeans clustering algorithm.
1: Input gene expressions or spectral densities, and the desired number of clusters ;
2: Randomly create centroids ;
3: Assign each gene to the cluster ;
4: while members in some clusters change do
5: compute centroids ;
6: assign gene to cluster ;
7: end while
2.2.3. SelfOrganizing Map (SOM) Clustering
The selforganizing map method is in essence based on a onelayer neural network, and it is exploited in [3]. Each cluster centroid maps to a node in the twodimensional lattice. It iteratively updates the centroid of each cluster through competitive learning. At iteration , a randomly selected gene's expression vector is fed to the learning system, and the centroid which is closest to the coming gene's expression vector is represented in terms of . Then each centroid is updated via
where the function defines the distance between two nodes indexed by and in the twodimensional lattice. It can be set to 1 if node is within the neighborhood of node , and 0 otherwise. The function represents the learning rate function, and it is monotonically decreasing with the increase of or . The SOM clustering algorithm can be formulated as Algorithm 3.
Algorithm 3: SOM clustering algorithm.
1: Input gene expressions or spectral densities, the desired number of clusters , and the number of max iterations ;
2: Randomly create centroids ;
3: Assign each gene to the cluster ;
4: for to do
5: Randomly select a gene expresssion ;
6: Find the point ;
7: Update centroids based on (6);
8: end for
9: Assign each gene to cluster ;
2.3. Performance Evaluation Metric
The three clustering schemes with inputs of either gene expressions or spectral densities are to be evaluated in two different ways: how they group timeregulated genes, and whether they are significantly different from each other. Different criteria are defined based on information theoretic quantities.
2.3.1. Validation of Clustering Scheme
Given genes with their expression or spectral density information , suppose the clustering scheme creates a partition of genes containing clusters , any two clusters and are mutually exclusive (), and all clusters constitute the measured gene expressions (), then the entropy of the clustering can be exploited to measure the information of the clustering
where measures the size of a cluster. Genes cooperate by participating in the same biological processes, in other words, singleton clusters are not expected to occur frequently in the clustering. Therefore, for a given the sizes of clusters should be balanced, and the higher the entropy of the clustering, the better the clustering scheme.
The clustering schemes can be validated by their ability to group genes that have been annotated to share similar biological functions or participate in the same biological process. One of the most explored processes is the yeast cell cycle, for which genes have been mostly identified and their interactions have been proposed in the public database [24]. Assume a set of genes, denoted as , has been verified to participate in a specific process, the joint entropy of the clustering and the known set can be represented by
It is desirable that genes with the same functions be integrated in as small number of clusters as possible. Therefore, the smaller the joint entropy, the better the clustering.
A straightforward performance metric combining both the clustering entropy and the joint entropy is defined as the mutual information
where the is defined similarly as in (7), and it is constant across different clustering schemes. This metric is actually consistent with that proposed in Gibbons and Roth [25], whereby multiple gene attributes were considered. Higher mutual information between the clustering and the prespecified set stands for a balanced clustering for all genes while genes of are more accumulated, in other words, it exhibits better performance.
2.3.2. Difference between Two Clusterings
Two clustering schemes create two different partitions of all the observed genes. A measure of the distance between two clusterings is highly valuable when the two schemes do not show a significant difference in their performance. Various metrics have been proposed to evaluate the difference between two clusterings, for example, Fowlkes and Mallows [26], Rand [27], and more recently Meilă [28]. We accept Meilă's variation of information (VI) metric because it is more discriminative, makes no assumption on the clustering structure, requires no rescaling, neither does it depend on the sample size.
Assume two different schemes produce two clusterings and , respectively, then the mutual information between these two clusterings is represented by
Then, the variation of information (VI) is defined as
VI is upper bounded by . It is zero if and only if the two clusterings are exactly the same. The greater the variation of information, the larger the difference between the two clusterings.
3. Results
The performance of the proposed power spectrumbased scheme is illustrated through comparisons with three traditional expressionbased clustering schemes: Hierarchical, Kmeans, and selforganizing map (SOM). The comparisons are divided into two parts. In the first part, we evaluate their ability to group the cellcycle involved genes, while the second part is devoted to illustrate the fact that the proposed schemes construct clusters that are significantly different from those created by the traditional schemes.
3.1. Clustering Performance Evaluation
These simulations were performed on the cdc15 data set published by Spellman et al. [13], which contained 24 timeseries expression measurements of 6178 yeast genes. The hierarchical, Kmeans, and selforganizing map (SOM) clustering schemes were simulated having as inputs the computed spectral densities and the original expression data. The hierarchical and Kmeans clustering were configured with different distance and linkage methods, which are defined in Tables 1 and 2, respectively. The simulations were executed until up to 200 clusters were created.
Cell cycle has served as a research target in molecular biology for a long time since it plays a crucial rule in cell division, and medically it underlies the development of cancer. Experimentally 109 genes have been verified to participate in the cellcycle process, and their interactions were recorded in the public database KEGG [24]. Among them 104 genes were reported in Spellman's data set. The simulations tested how these genes were clustered with other genes. Intuitively, the more integrated are these 104 genes, the better is the clustering scheme. On the other hand, it is hoped that the size of the cluster is relatively balanced, and there should not be many singleton clusters (clusters containing only one gene).
The clustering performance is represented by an information theoretic quantity, that is, mutual information, which is defined between the obtained partition of all measured genes and the set of 104 genes. Higher mutual information indicates that the 104 cellcycle genes are closely integrated into only a few clusters, and most clusters are balanced in size. In other words, with the same number of clusters, the higher the mutual information, the better the performance.
The proposed strategy is surely not constrained to detect cell cycle genes. However we have to confine our discussion to cell cycle here because the available data set is right for the purpose of cell cycle research. Besides, the cell cycle genes have been identified for a relatively long time with high confidence.
The simulation results for hierarchical clustering are illustrated in Figure 1. Each subplot is associated with a linkage method. Figure 1(a) demonstrates the performance for the single linkage method. The dotted curves represent schemes clustering spectral densities while the solid curves denote schemes clustering original gene expressions. The mutual information goes up nearly linearly when the number of clusters increases. Actually, when we delved into the generated clusters, it was found that most clusters were singletons. The chaining effect took place, and the single linkage method is not a good candidate for the purpose of clustering gene expression measurements. Spectral densitybased methods were all better than their traditional counterparts, which performed clustering on the original gene expression data. Among all, the Euclidean method clustering spectral densities achieved the best performance.
Figure 1(b) shows the results for the complete linkage method of the hierarchical clustering. Each cluster actually represents a complete subgraph. The complete linkage method discourages the chaining effect to occur in the single linkage method. The performance of spectral densitybased clusterings is lower bounded by the worst performances of the traditional gene expressionbased clusterings. For the gene expressionbased clustering, the correlation and cosine approaches are better than the Euclidean and cityblock approaches, while for the spectral density clustering, the Euclidean and cityblock approaches exhibit the best performance.
Figure 1(c) plots the results for the average linkage method of the hierarchical clustering. The average linkage is the most widely deployed method since it makes a compromise between the single and the complete methods, and it does not assume any structure on the underlying data. However, in the presence of real gene expression data, it is not as good as the complete linkage method. Different distance metrics differ in terms of their ability to group the involved cellcycle genes. For clustering expression data, the cosine and correlation approaches still achieve the best performance, but they exhibit poorer performance than the spectrabased Euclidean and cityblock methods.
Configured also with various distance metrics, the Kmeans algorithm was applied on both the spectral and original gene expression data. To avoid converging to local suboptimal solutions, all Kmeans clustering schemes were executed 5 times, and the best performance was reported. For clustering expression data, the correlation and cosine approaches are still the best choices while for spectrabased schemes, the Euclidean and cityblock approaches still exceed the other schemes (see Figure 2).
Figure 3 compares the performance of hierarchical and Kmeans clustering schemes with that of SOM. The best schemes of hierarchical and Kmeans were displayed. It turns out that SOM is the best performing scheme, Kmeans locates in the middle, whereas the hierarchical clustering is the worst, although the discrepancy looks not significant. Among all schemes, the spectral densitybased SOM achieves the best performance. Although the discrepancy between the best spectralbased clustering and the best gene expressionbased clustering is not obvious, they actually create significantly different clusters. This difference can be captured by the distance metric between clusterings.
The inferior performance of correlation and cosine metrics with spectra input is partially due to the flat spectra for those genes with no timeregulated patterns. The flat spectrum in the denominator will cause the distance metrics to be highly biased. It is also worthwhile to note that in literature other distance metrics have been proposed, for example, coherence [29] and mutual information [30]. However, these metrics involve the estimation of joint distribution, which usually requires large sample sizes. Such a requirement cannot be satisfied in general by the microarray experiments. Extra normalization of the spectrum can be performed, but simulation shows that it does not provide a significant or consistent improvement.
3.2. Distance between Clusterings
A testing of the distance between spectrabased and gene expressionbased clusterings also reveals the value of the proposed scheme. The variation of information metric approach, proposed by Meilă [28], is exploited to measure the difference between the two clusterings. The basic principle resumes to: the higher the variation of information, the greater the difference.
Figure 4 demonstrates the distance between the two clusterings with the same input, either computed using spectral densities or measured based on gene expressions. For the hierarchical clustering, only the complete linkage method is considered since it possesses the best performance in terms of grouping the known cellcycle genes. The complete set of distances between any two schemes is depicted in the additional File 1 [31]. Figure 4 conserves only the salient general patterns for conciseness. For hierarchical clustering of gene expression data, the correlation and Euclidean schemes differ more, and the distance between these two is the highest curve when the number of clusters is greater than 120. The distance between the correlation and Euclidean hierarchical clusterings is even much larger than the distance between the clusterings created by the hierarchical scheme and Kmeans or SOM. However, when clustering spectral densities, all schemes display quite similar patterns and exhibit closely located performances. This means that clustering spectral densities is stable across different clustering schemes.
Figure 5 compares the same clustering methods assuming different inputs. Comparing with the scale of Figure 4, the distance between different clusterings with the same input is much smaller than the distance between clusterings that assume different input types. The distance between any two schemes that assume the same input is below 7 bits when the number of clusters is ranging from 0 to 200, as shown in Figure 4 or the dashed curve in Figure 5, while the distance between the clusterings created by the same scheme assuming two different input types is above 8 bits when the number of clusters is ranging from 100 to 200. This shows that changing the input type from gene expression to spectral density has produced a significant different clustering scheme. For the complete plots of the distance between clusterings produced by various schemes assuming different input types, please refer to the additional File 2 [31].
4. Conclusion
A novel clustering preprocessing strategy is proposed to combine the traditional clustering schemes with power spectral analysis of timeseries gene expression measurements. The simulation results corroborate that the proposed approach achieves a better clustering for hierarchical, Kmeans, and selforganizing map (SOM) in most cases. Besides, it constructs a significantly different partition relative to traditional clustering strategies. When deploying the hierarchical or Kmeans clustering methods based on the spectral density, the Euclidean and cityblock distance metrics appear to be more appealing than the cosine or correlation distance metrics. The proposed novel algorithm is valuable since it provides additional information about temporal regulated genetic processes, for example, cell cycle.
References
Simon I, Siegfried Z, Ernst J, BarJoseph Z: Combined static and dynamic analysis for determining the quality of timeseries expression profiles. Nature Biotechnology 2005, 23(12):15031508. 10.1038/nbt1164
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. Proceedings of the National Academy of Sciences of the United States of America 1998, 95(25):1486314868. 10.1073/pnas.95.25.14863
Tamayo P, Slonim D, Mesirov J, et al.: Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences of the United States of America 1999, 96(6):29072912. 10.1073/pnas.96.6.2907
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nature Genetics 1999, 22(3):281285. 10.1038/10343
Zhou X, Wang X, Dougherty ER, Russ D, Suh E: Gene clustering based on clusterwide mutual information. Journal of Computational Biology 2004, 11(1):147161. 10.1089/106652704773416939
Giurcăneanu CD, Tăbuş I, Astola J, Ollila J, Vihinen M: Fast iterative gene clustering based on information theoretic criteria for selecting the cluster structure. Journal of Computational Biology 2004, 11(4):660682.
D'Haeseleer P: How does gene expression clustering work? Nature Biotechnology 2005, 23(12):14991501. 10.1038/nbt12051499
Ramoni MF, Sebastiani P, Kohane IS: Cluster analysis of gene expression dynamics. Proceedings of the National Academy of Sciences of the United States of America 2002, 99(14):91219126. 10.1073/pnas.132656399
Tabus I, Astola J: Clustering the nonuniformly sampled time series of gene expression data. Proceedings of the International Symposium on Signal Processing and Applications (ISSPA '03), Paris, France, July 2003 2: 6164.
Ernst J, Nau GJ, BarJoseph Z: Clustering short time series gene expression data. Bioinformatics 2005, 21(supplement 1):i159i168.
Zhao W, Serpedin E, Dougherty ER: Inferring gene regulatory networks from time series data using the minimum description length principle. Bioinformatics 2006, 22(17):21292135. 10.1093/bioinformatics/btl364
Liang S, Fuhrman S, Somogyi R: Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Proceedings of the Pacific Symposium on Biocomputing, Maui, Hawaii, USA, January 1998 3: 1829.
Spellman PT, Sherlock G, Zhang MQ, et al.: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 1998, 9(12):32733297.
Whitfield ML, Sherlock G, Saldanha AJ, et al.: Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Molecular Biology of the Cell 2002, 13(6):19772000. 10.1091/mbc.02020030.
Wichert S, Fonkianos K, Strimmer K: Identifying periodically expressed trascripts in microarry time series data. Bioinformatics 2004, 20(1):520. 10.1093/bioinformatics/btg364
Ahdesmäki M, Lähdesmäki H, Pearson R, Huttunen H, YliHarja O: Robust detection of periodic time series measured from biological systems. BMC Bioinformatics 2005, 6, article 117: 118.
Lomb NR: Leastsquares frequency analysis of unequally spaced data. Astrophysics and Space Science 1976, 39(2):447462. 10.1007/BF00648343
Scargle JD: Studies in astronomical time series analysis—II. Statistical aspects of spectral analysis of unevenly spaced data. The Astrophysics Journal 1982, 263(99):835853.
Glynn EF, Chen J, Mushegian AR: Detecting periodic patterns in unevenly spaced gene expression time series using LombScargle periodograms. Bioinformatics 2006, 22(3):310316. 10.1093/bioinformatics/bti789
Stoica P, Sandgren N: Spectral analysis of irregularlysampled data: paralleling the regularlysampled data approaches. Digital Signal Processing 2006, 16(6):712734. 10.1016/j.dsp.2006.08.012
Wang Y, Stoica P, Li J, Marzetta TL: Nonparametric spectral analysis with missing data via the EM algorithm. Digital Signal Processing 2005, 15(2):191206. 10.1016/j.dsp.2004.10.004
Zhao W, Agyepong K, Serpedin E, Dougherty ER: Detecting periodic genes from irregularly sampled gene expressions: a comparison study. EURASIP Journal on Bioinformatics and Systems Biology 2008, 2008:8.
Eyer L, Bartholdi P: Variable stars: which Nyquist frequency? Astronomy and Astrophysics 1999, 135(1):13.
KEGG Yeast Cell Cycle Pathway http://www.genome.ad.jp/kegg/pathway/sce/sce04111.html
Gibbons FD, Roth FP: Judging the quality of gene expressionbased clustering methods using gene annotation. Genome Research 2002, 12(10):15741581. 10.1101/gr.397002
Fowlkes E, Mallows C: A method for comparing two hierarchical clusterings. Journal of the American Statistical Association 1983, 78(383):553569. 10.2307/2288117
Rand WM: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 1971, 66(336):846850. 10.2307/2284239
Meilă M: Comparing clusterings—an information based distance. Journal of Multivariate Analysis 2007, 98(5):873895. 10.1016/j.jmva.2006.11.013
Butte AJ, Bao L, Reis BY, Watkins TW, Kohane IS: Comparing the similarity of timeseries gene expression using signal processing metrics. Journal of Biomedical Informatics 2001, 34(6):396405. 10.1006/jbin.2002.1037
Brillinger DR: Secondorder moments and mutual information in the analysis of time series. In Recent Advances in Statistical Methods. Imperial College Press, London, UK; 2002:6476.
Supplementary Materials http://www.ece.tamu.edu/~wtzhao/EurasipBSBClutering.htm
Acknowledgments
This work was supported by the National Cancer Institute (CA90301) and the National Science Foundation (ECS0355227 and CCF0514644).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/bync/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Zhao, W., Serpedin, E. & Dougherty, E.R. Spectral Preprocessing for Clustering TimeSeries Gene Expressions. J Bioinform Sys Biology 2009, 713248 (2009). https://doi.org/10.1155/2009/713248
Received:
Accepted:
Published:
DOI: https://doi.org/10.1155/2009/713248
Keywords
 Spectral Density
 Mutual Information
 Distance Metrics
 Genetic Regulatory Network
 Complete Linkage Method