- Research Article
- Open Access
Identifying Genes Involved in Cyclic Processes by Combining Gene Expression Analysis and Prior Knowledge
© Wentao Zhao et al. 2009
- Received: 9 July 2008
- Accepted: 26 January 2009
- Published: 3 March 2009
Based on time series gene expressions, cyclic genes can be recognized via spectral analysis and statistical periodicity detection tests. These cyclic genes are usually associated with cyclic biological processes, for example, cell cycle and circadian rhythm. The power of a scheme is practically measured by comparing the detected periodically expressed genes with experimentally verified genes participating in a cyclic process. However, in the above mentioned procedure the valuable prior knowledge only serves as an evaluation benchmark, and it is not fully exploited in the implementation of the algorithm. In addition, partial data sets are also disregarded due to their nonstationarity. This paper proposes a novel algorithm to identify cyclic-process-involved genes by integrating the prior knowledge with the gene expression analysis. The proposed algorithm is applied on data sets corresponding to Saccharomyces cerevisiae and Drosophila melanogaster, respectively. Biological evidences are found to validate the roles of the discovered genes in cell cycle and circadian rhythm. Dendrograms are presented to cluster the identified genes and to reveal expression patterns. It is corroborated that the proposed novel identification scheme provides a valuable technique for unveiling pathways related to cyclic processes.
- Circadian Rhythm
- Ribosome Biogenesis
- Cell Cycle Gene
- Synchronization Method
- Cyclic Gene
The eukaryotic cell hosts several cyclic molecular processes, for example, cell cycle and circadian rhythm. The transcriptional events in these processes can be quantitatively observed by measuring the concentration of the messenger RNA (mRNA), which is transcribed from DNA and serves as the template for synthesizing the corresponding protein. To achieve this goal, the microarray experiments exploit high-throughput gene chips to snapshot genome-wide gene expressions sequentially at discrete time points. The sampled time series data present three main characteristics. First, most data sets present small sample size, for example, no more than 50 data points. Obtaining large sample size data sets is not financially affordable, and besides, in the long run the cell culture loses synchronization and the data become meaningless if they are sampled much later on. Second, the data might not be evenly sampled, and many time points could be missing. In order to capture critical events with minimal cost, biologists usually conduct microarray experiments and make measurements when these events happen. Third, the data are highly corrupted by experimental noise, and a robust stochastic analysis is desired.
Based on time series data, various approaches have been proposed to identify periodically expressed genes, which are sometimes believed to be involved in the cell cycle. Assuming the cell cycle signal to be a simple sinusoid, Spellman et al.  and Whitfield et al.  performed Fourier transformations on the data sampled with different synchronization methods, Wichert et al.  applied the traditional periodogram and Fisher's test, while Ahdesmäki et al.  implemented a robust periodicity test assuming non-Gaussian noise. In , Giurcǎneanu explored the stochastic complexity of detecting periodically expressed genes by means of generalized Gaussian distributions. Alternatively, Luan and Li  employed guide genes and constructed cubic B-spline-based periodic functions for modeling, while Lu et al.  employed up to third harmonics to fit the data and proposed a periodic normal mixture model. De Lichtenberg et al.  compared the approaches [1, 6, 7] and proposed a new score combining the periodicity and regulation magnitude. Interestingly, the mathematically more advanced methods seem not to achieve a better performance compared with the original Spellman's method that relies on the Fast Fourier Transform (FFT) method. As an important observation, notice that the majority of these works deal only with evenly sampled data. When data points are missing, in general for the adopted methods, the vacancies are usually filled by interpolation in time domain for all genes, or the genes are disregarded if there are more than 30% of data samples missing.
The biological experiments generally output unevenly spaced measurements. The change of sampling frequency can be attributed to missing data. Besides, the measurements are usually event-driven, that is, more observations are recorded when certain biological events happen, and the observational process is slowed down when the cell remains quiet or no event of interest occurs. Therefore, the analysis based on unevenly sampled data sets is practically more desirable and technically more challenging. Notice that in the case of uneven sampling, the harmonics exploited in the discrete Fourier transform (DFT) are no longer orthogonal. Lomb  and Scargle  demonstrated that a phase shift suffices to make the sine and cosine terms orthogonal again, and consequently a spectral estimator can be designed in the presence of uneven sampling. The Lomb-Scargle scheme has been exploited by Glynn et al.  in analyzing the budding yeast data set. Notice also that a number of alternative schemes were proposed recently to cope with missing and/or irregularly spaced data samples. Stoica and Sandgren  updated the traditional Capon method to cope with the irregularly sampled data. Wang et al.  designed the missing-data amplitude and phase estimation (MAPES) approach, which estimated the missing data and spectra iteratively through the Expectation Maximization (EM) algorithm. Although Capon and MAPES methods aim to achieve a better spectral resolution than Lomb-Scargle periodogram, for small sample size, the simpler Lomb-Scargle scheme appears to possess better performance in the presence of realistic biological data .
Most of the algorithms proposed in literature identify cyclic genes by exploiting mathematical models to explain the gene's time series pattern. Employing these models and statistical tests, the periodically expressed genes are normally identified. Finally, the detected genes are compared with the genes that had been experimentally discovered to participate in specific processes like cell cycle. Notice that these practically verified cycle-involved genes only serve as a golden benchmark to evaluate the performance of the proposed identification algorithms. They are not fully exploited in the implementation of the identification algorithm. Notice also that most of the existing algorithms fail to utilize all the available data information. For example, the elutriation data provided in  was usually discarded when performing the spectral analysis. In other experiments, some data sets were also disregarded due to either loss of synchronization or nonstationarity. Herein, we propose a novel algorithm to detect periodically expressed genes by integrating the gene expression analysis with the valuable prior knowledge offered by all available data. The prior knowledge can consist of two data sets, that is, the set of genes involved in a cyclic process and the set of noncycle-involved genes recognized in biological experiments. The cycle-involved genes are used to initialize the proposed algorithm, and the noncycle-involved genes are employed to control the false positives. The expression analysis is composed of the spectral estimation technique and the computation of gene expression distance. The underlying approach relies on the assumption that genes expressing similarly with genes of a process of interest are also likely to participate in that process. This assumption is actually exploited to apply the clustering schemes on the microarray measurements in order to partition genes into different functional groups. The proposed algorithm identifies potential cyclic-process-involved genes and guarantees that the verified cycle genes will be included with 100% certainty into the output gene set, and at the same time the verified noncycle-involved genes are removed from the derived set with 100% certainty. Although most of the existing power-spectra-based algorithms can be crafted into the proposed algorithm seamlessly, herein we are using the Lomb-Scargle periodogram due to its simplicity and good performance. The proposed algorithm will also lay a ground for the following cycle pathway research.
The proposed algorithm is composed of a spectral density analysis and a gene distance computation based on the time series microarray data. All existing spectral analysis schemes can be incorporated into the proposed algorithm. However, the Lomb-Scargle periodogram is recommended here due to its convenience of implementation and excellent performance for small sample size. The nonparametric Spearman's correlation coefficient is accepted to construct the measure of distance between two genes.
2.1. Lomb-Scargle Periodogram and Periodicity Detection
Microarray measurements usually have a large portion of missing data points. Besides, the sampling frequency is tuned to adapt to nonuniformly occurring events. Lomb-Scargle periodogram appears as an excellent candidate for analyzing these irregular data .
A rejection of the null hypothesis based on a -value threshold implies that the power spectral density contains a frequency with magnitude substantially greater than the average value. This indicates that the time series data contain a periodic signal, and the corresponding gene is cyclic in expression.
2.2. Gene Distance Measure
where ( ) stand for the rank pair of the measurements of genes and . The parameter counts the number of sampling points where both gene and gene present available observations. This distance measure always assumes values between 0 and 1.
2.3. Algorithm Formulation
The proposed algorithm is formulated as Algorithm 1. Lines 1 to 9 accept inputs and initialize the target cyclic gene set with the spectral analysis results and the prior cycle-involved genes. Inside them lines 4 to 8 exclude genes whose peak periodicity, , is in contrast with the prior knowledge of the frequency range of the researched phenomenon. Lines 10 to 17 represent the iterative accumulation part. They iteratively insert into the potential cyclic gene set the genes expressed similarly as the genes within that set. Lines 18 to 25 stand for the false positive control part, which constructs the control set iteratively to suppress the potential false positives by using the prior knowledge. Line 26 subtracts the control set from the established target set and finalizes the cyclic gene set. The simulation results on the yeast data set showed that the iterative accumulation part controls the false positives pretty well.
Algorithm 1: Identifying cyclic process involved genes.
1: Input gene expression measurements, all sampled genes (referred as ),
experimentally verified cycle-involved genes (denoted as G),
noncycle-involved genes (represented as F) and priori frequency range
2: Perform power spectral analysis on gene expression data;
3: Perform statistical tests so that the periodically expressed genes are
recognized and stored in set C;
4: for each do
5: if then
9: , , specify the distance threshold t;
12: for each , do
13: if then
20: for each , do
21: if then
27: Output G;
The algorithm will surely converge to a set. This is because in each iteration of the accumulation and false positive control part, there have to be new members added into the target gene sets. The number of set members keeps increasing, and the set in the previous iteration is a subset of the later set. However, this increase is upper-bounded by the full gene set that contains all the measured genes. Therefore, both the iterative accumulation part and false positive control part converge, and the proposed algorithm also converges.
Usually some general idea about the phenomenon of interest can be used to determine the two bounds and of the frequency range. For example, the circadian rhythm has a periodicity around 24 hours, which can be somehow compressed or prolonged by experimental protocols. If no prior knowledge exists, the set can be used. The other two thresholds are to be specified. The first is the threshold for the periodicity test. To effectively control the false alarm rate, multiple testing correction can be applied and a -value threshold can be specified. In practice, can be chosen around 0.15. This threshold can also be decided by comparing the spectral analysis results with the prior knowledge. Such an approach is more attractive when the proposed algorithm is combined with other periodicity detection methods. We are inclined to use a more stringent threshold, which also represents a trade-off between the number of conserved genes and the number of experimentally verified genes. The second threshold is the distance threshold . It keeps decreasing along the iteration. For example, the initial value is assigned to be 0.25, which means high correlation according to Cohen's rule of thumb . Each iteration decreases this threshold by 0.05 until it reaches 0.1, then it remains constant at 0.1. This technique in practice helps to prevent the amplification of false positives.
The proposed algorithm was applied on the data sets provided by unicellular Saccharomyces cerevisiae (budding yeast) and multicellular Drosophila melanogaster (fruit fly), respectively. The in silico results are discussed briefly here. The full list of identified potential cell cycle genes is presented in the additional files.
3.1. Case Study 1: Saccharomyces Cerevisiae
Although various time series data sets have been available, including the experiments on human cells , the yeast data set published by Spellman et al.  is still among the most popular research targets or benchmarks of computational biology, since this data set excels in its large size of samples and the simplicity of the genome. The mRNA concentrations of nearly 6200 Open Reading Frames (ORF) were measured for the yeast strains synchronized by using four different methods, that is, factor, cdc15, cdc28, and elutriation. The data set contained in total 73 sampling points for all genes, while there existed missing observations for some experiments. The detected periodicity matched the yeast cell cycle. Our prior knowledge was derived from two sources: Spellman et al.  revised 104 cell cycle genes that were verified in previous biological experiments, while de Lichtenberg et al.  summarized 105 genes that were not involved in the cell cycle.
In order to measure valid time series samples, the cell culture has to be synchronized. In other words, all cells within the culture should be homogeneous in all aspects, for example, cell size, DNA, RNA, protein, and other cellular contents. Cooper in [26, 27] argued that the ideal synchronization is an impossible mission because different dimensions, like cell size and DNA content, could not be controlled at the same time. Therefore, current popular synchronization methods, like serum starvation and thymidine blocking, are only one-dimensional synchronization methods and fail to achieve a complete synchronization. It is fully possible that the discovered periodicity is completely caused by chance or by the specific synchronization method. Based on the Spellman et al.'s spectral analysis with CDC scores, it is obvious that the most experimentally verified cell cycle genes exhibit top CDC scores. Hence, the spectral analysis is still highly valuable. However, due to the loss of synchronization and nonstationarity, the choice of threshold for the periodicity test has to be much more stringent in order to suppress false positives. When the cell culture is not ideally synchronized or stationary, the spectral analysis may fail for some data sets, such as the elutriation data set. However, the proposed algorithm is still capable to identify a set of genes which are closely correlated to the verified cell cycle genes based on all the available data. The exploitation of the prior knowledge, consisting of experimentally verified cell cycle genes and noncell-cycle genes, can help to improve the detection accuracy and combat the negative effects induced by the loss of synchronization and nonstationarity.
3.2. Case Study 2: Drosophila Melanogaster
The multicellular Drosophila melanogaster serves as a good prototype for the research of mammalian diseases because it has only 4 pairs of chromosomes, on which are located abundant genes with mammalian analogs. Our in silico experiments are performed on the Drosophila melanogaster data set published by Arbeitman et al. . With the usage of cDNA microarrays, the RNA expression levels of 4028 genes were measured, and these stood for about one-third of all found fruit fly genes. The synchronization of the cell culture was yielded by the Cryonics method. In Arbeitman et al.'s experiments, 75 sequential sampling points were observed, starting right after fertilization and through embryonic, larval, pupal, and early days of adulthood. There were 134 experimentally verified cycling circadian genes . Among these 134 genes, 52 were measured in Arbeitman's experiment . We did not locate the set of noncell-cycle genes in the Drosophila literature. Therefore, the false positive control procedure was not performed. The least time interval between any two sampling points was 30 minutes, which was much larger than the Drosophila's cell cycle period. However, the pupal data set had sufficient sampling points to provide insights into the circadian rhythm.
Two most extensively studied genes involved in the Drosophila circadian rhythm are and . In Arbeitman's experiment, showed relatively prominent periodicity in the pupal stage. However, the period was prolonged to be more than 24 hours. This was due to the fact that the synchronization method slowed down the biological process. Unfortunately, was not measured in the experiment. A large portion of identified genes have been verified to participate in metabolism, a process closely controlled by circadian rhythm. A cross-species knowledge might be valuable. However, special precautions must be considered when the two organisms are too different, like the yeast and fly. The yeast is a unicellular organism with closed mitosis while fly is multi-cellular with open mitosis. The difference between multicellular organisms is less prominent. Therefore, we hypothesize that the prior knowledge of the Drosophila might be valuable for the identification of more advanced species, for example, Homosapiens. The complete list of identified genes is provided in the supplementary materials .
A novel algorithm is proposed to identify the cyclic-process-involved genes through the incorporation of microarray data analysis with the prior knowledge of genes participating in the cyclic process. The in silico experiments were conducted based on the data sets corresponding to the unicellular Saccharomyces cerevisiae and the multicellular Drosophila melanogaster. The potential cell cycle and circadian rhythmic genes were identified and compared with the existing computational results. It is corroborated that the proposed algorithm is capable to exploit all the available data and propose potential cycle-involved genes.
- Spellman PT, Sherlock G, Zhang MQ, et al.: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 1998, 9(12):3273-3297.View ArticleGoogle Scholar
- Whitfield ML, Sherlock G, Saldanha AJ, et al.: Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Molecular Biology of the Cell 2002, 13(6):1977-2000. 10.1091/mbc.02-02-0030.View ArticleGoogle Scholar
- Wichert S, Fonkianos K, Strimmer K: Identifying periodically expressed trascripts in microarry time series data. Bioinformatics 2004, 20(1):5-20. 10.1093/bioinformatics/btg364View ArticleGoogle Scholar
- Ahdesmäki M, Lähdesmäki H, Pearson R, Huttunen H, Yli-Harja O: Robust detection of periodic time series measured from biological systems. BMC Bioinformatics 2005, 6, article 117: 1-18.Google Scholar
- Giurcǎneanu CD: Stochastic complexity for the detection of periodically expressed genes. Proceedings of the 5th IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '07), Tuusula, Finland, June 2007 1-4.Google Scholar
- Luan Y, Li H: Model-based methods for identifying periodically expressed genes based on time course microarray gene expression data. Bioinformatics 2004, 20(3):332-339. 10.1093/bioinformatics/btg413View ArticleGoogle Scholar
- Lu X, Zhang W, Qin ZS, Kwast KE, Liu JS: Statistical resynchronization and Bayesian detection of periodically expressed genes. Nucleic Acids Research 2004, 32(2):447-455. 10.1093/nar/gkh205View ArticleGoogle Scholar
- de Lichtenberg U, Jensen LJ, Fausbøll A, Jensen TS, Bork P, Brunak S: Comparison of computational methods for the identification of cell cycle-regulated genes. Bioinformatics 2005, 21(7):1164-1171. 10.1093/bioinformatics/bti093View ArticleGoogle Scholar
- Lomb NR: Least-squares frequency analysis of unequally spaced data. Astrophysics and Space Science 1976, 39(2):447-462. 10.1007/BF00648343View ArticleGoogle Scholar
- Scargle JD: Studies in astronomical time series analysis—II. Statistical aspects of spectral analysis of unevenly spaced data. The Astrophysics Journal 1982, 263: 835-853.View ArticleGoogle Scholar
- Glynn EF, Chen J, Mushegian AR: Detecting periodic patterns in unevenly spaced gene expression time series using Lomb-Scargle periodograms. Bioinformatics 2006, 22(3):310-316. 10.1093/bioinformatics/bti789View ArticleGoogle Scholar
- Stoica P, Sandgren N: Spectral analysis of irregularly-sampled data: paralleling the regularly-sampled data approaches. Digital Signal Processing 2006, 16(6):712-734. 10.1016/j.dsp.2006.08.012View ArticleGoogle Scholar
- Wang Y, Stoica P, Li J, Marzetta TL: Nonparametric spectral analysis with missing data via the EM algorithm. Digital Signal Processing 2005, 15(2):191-206. 10.1016/j.dsp.2004.10.004View ArticleGoogle Scholar
- Zhao W, Agyepong K, Serpedin E, Dougherty ER: Detecting periodic genes from irregularly sampled gene expressions: a comparison study. EURASIP Journal on Bioinformatics and Systems Biology 2008, 2008:-8.Google Scholar
- Eyer L, Bartholdi P: Variable stars: which Nyquist frequency? Astronomy and Astrophysics Supplement Series 1999, 135(1):1-3. 10.1051/aas:1999102View ArticleGoogle Scholar
- Schwarzenberg-Czerny A: The distribution of empirical periodograms: Lomb-Scargle and PDM spectra. Monthly Notices of the Royal Astronomical Society 1998, 301(3):831-840. 10.1046/j.1365-8711.1998.02086.xView ArticleGoogle Scholar
- Cohen J: Statistical Power Analysis for the Behavioral Sciences. 2nd edition. Lawrence Erlbaum, Hillsdale, NJ, USA; 1988.MATHGoogle Scholar
- de Lichtenberg U, Wernersson R, Jensen TS, et al.: New weakly expressed cell cycle-regulated genes in yeast. Yeast 2005, 22(15):1191-1201. 10.1002/yea.1302View ArticleGoogle Scholar
- Caro LHP, Smits GJ, van Egmond P, Chapman JW, Klis FM: Transcription of multiple cell wall protein-encoding genes in Saccharomyces cerevisiae is differentially regulated during the cell cycle. FEMS Microbiology Letters 1998, 161(2):345-349. 10.1111/j.1574-6968.1998.tb12967.xView ArticleGoogle Scholar
- Klis FM, Boorsma A, De Groot PWJ: Cell wall construction in Saccharomyces cerevisiae . Yeast 2006, 23(3):185-202. 10.1002/yea.1349View ArticleGoogle Scholar
- Smits GJ, Kapteyn JC, van den Ende H, Klis FM: Cell wall dynamics in yeast. Current Opinion in Microbiology 1999, 2(4):348-352. 10.1016/S1369-5274(99)80061-7View ArticleGoogle Scholar
- Tamayo P, Slonim D, Mesirov J, et al.: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences of the United States of America 1999, 96(6):2907-2912. 10.1073/pnas.96.6.2907View ArticleGoogle Scholar
- Bernstein KA, Baserga SJ: The small subunit processome is required for cell cycle progression at G1. Molecular Biology of the Cell 2004, 15(11):5038-5046. 10.1091/mbc.E04-06-0515View ArticleGoogle Scholar
- Bernstein KA, Bleichert F, Bean JM, Cross FR, Baserga SJ: Ribosome biogenesis is sensed at the start cell cycle checkpoint. Molecular Biology of the Cell 2007, 18(3):953-964. 10.1091/mbc.E06-06-0512View ArticleGoogle Scholar
- Thomas G: An encore for ribosome biogenesis in the control of cell proliferation. Nature Cell Biology 2000, 2(5):E71-E72. 10.1038/35010581View ArticleGoogle Scholar
- Cooper S: Rethinking synchronization of mammalian cells for cell cycle analysis. Cellular and Molecular Life Sciences 2003, 60(6):1099-1106.Google Scholar
- Cooper S: Rejoinder: whole-culture synchronization cannot, and does not, synchronize cells. Trends in Biotechnology 2004, 22(6):274-276. 10.1016/j.tibtech.2004.04.011View ArticleGoogle Scholar
- Arbeitman MN, Furlong EEM, Imam F, et al.: Gene expression during the life cycle of Drosophila melanogaster . Science 2002, 297(5590):2270-2275. 10.1126/science.1072152View ArticleGoogle Scholar
- McDonald MJ, Rosbash M: Microarray analysis and organization of circadian gene expression in Drosophila . Cell 2001, 107(5):567-578. 10.1016/S0092-8674(01)00545-1View ArticleGoogle Scholar
- Supplementary Materials http://www.ece.tamu.edu/~wtzhao/FlyCellCycleGenes.xls
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.