A Hybrid Technique for the Periodicity Characterization of Genomic Sequence Data
© Julien Epps. 2009
Received: 29 May 2008
Accepted: 21 January 2009
Published: 2 March 2009
Many studies of biological sequence data have examined sequence structure in terms of periodicity, and various methods for measuring periodicity have been suggested for this purpose. This paper compares two such methods, autocorrelation and the Fourier transform, using synthetic periodic sequences, and explains the differences in periodicity estimates produced by each. A hybrid autocorrelationâ€”integer period discrete Fourier transform is proposed that combines the advantages of both techniques. Collectively, this representation and a recently proposed variant on the discrete Fourier transform offer alternatives to the widely used autocorrelation for the periodicity characterization of sequence data. Finally, these methods are compared for various tetramers of interest in C. elegans chromosome I.
The detection of structure within the DNA sequence has long captivated the interest of the research community. Among the various statistical characterizations of sequence data, one measure of structure within sequences is the degree of correlation or periodicity at various displacements along the sequence. Periodicity characterization of sequence data provides a compact and informative representation that has been used in many studies of structure within genomic sequences, including DNA sequence analysis , gene and exon detection , tandem repeat detection , and DNA sequence search and retrieval .
To measure such periodicity, autocorrelation has been widely employed [1, 5–11]. Similarly, Fourier analysis and its variants have been used for periodicity characterization of sequences [4, 9, 12–24]. In some cases [25, 26], the Fourier transform of the autocorrelation sequence has also been computed, however using existing symbolic-numeric mappings such as binary indicator sequences , this transform can also be calculated without first determining the autocorrelation. Other recent promising approaches to periodicity characterization for biological sequences include the periodicity transform , the exactly periodic subspace decomposition , and maximum-likelihood statistical periodicity , however these techniques have yet to be adopted by biologists for the purposes of sequence structure characterization.
Studies of structure within sequences, such as those referenced above, have tended to use either the autocorrelation or the Fourier transform, and to the author's knowledge, the limitations of each have not been compared in this context. In this paper, the limitations of both approaches are investigated using synthetic symbolic sequences, and caveats to their characterization of sequence data are discussed. A hybrid approach to periodicity characterization of symbolic sequence data is introduced, and its use is illustrated in a comparative manner on a study of tetramers in C. elegans.
2. Periodicity Measures for Symbolic Sequence Characterization
2.1. Definition of Periodicity
While this expression of in terms of a binary impulse train is perhaps not so common in signal processing of numerical sequences, the reverse is true for DNA sequences, which have been represented numerically using binary indicator sequences  in many studies (e.g., [13, 19, 23, 24, 30]).
2.3. Fourier Interpretation of Periodicity
where k is the discrete frequency index. Since the DFT has sinusoidal basis functions, the notion of periodicity in the Fourier sense is described in terms of the frequencies of those basis functions onto which the projections of are the largest in magnitude. That is, the magnitude of the DFT at a frequency k, , is often taken as an estimate of the relative amount of that frequency component occurring in [13, 19, 23, 24], from which the relative contribution of a particular period can be estimated.
Using a similar process to that described above in (10) and (11), the numerical representation of a symbolic sequence can also be transformed using the IPDFT to produce a spectrum that is linear in period (ρ) rather than in frequency (k). For the periodicity characterization of sequences, usually the magnitude is of greatest interest. Some care is needed in the interpretation of the IPDFT, since for a binary periodic sequence such as of fixed length N, will decrease for longer periods due to the fact that the energy of is .
where . That is, is relatively large for , and relatively small for . From this, we see that a shortcoming of Fourier transform approaches such as the IPDFT for sequence characterization by periodicity is that they produce not only a peak at , but also peaks at values of that are integer divisors of the period p (see example in Figure 1(b)). For the DFT, this effect is also seen, but instead for indices whose value is (i.e., harmonics of the frequency with integer frequency indices).
2.4. Periodicity of a Synthetic Sequence Using Autocorrelation and DFT
To illustrate the shortcomings of the autocorrelation and DFT discussed in Sections 2.2 and 2.3, consider the periodicity characterization of an example signal (i.e., exact monomer periodicity ), where and . The autocorrelation and IPDFT are shown in Figures 1(a) and 1(b), respectively, from which the ambiguities in period estimate discussed in Sections 2.2 and 2.3 can be clearly seen.
3. Hybrid Autocorrelation-IPDFT Periodicity Estimation
3.1. Hybrid Autocorrelation-IPDFT
For the simple example signal from Section 2.4, the calculation of results in a single, unambiguous periodicity estimate, as seen in Figure 1(c).
where , which may be helpful for biologists who have conventionally used either the autocorrelation ( ) or the Fourier transform ( ). For the purpose of sequence periodicity visualization, for example, could be represented as a parameter available for real-time control, so that a biologist viewing a periodicity characterization of a sequence might subjectively assign a relative weight to each of the autocorrelation and Fourier transform components. Care is needed, however, with the application of (15), since is only well defined for for all . Note that this is satisfied by the autocorrelation defined in (8), in addition to a number of DNA numerical representations (several example representations are discussed in ). It is further noted that (14) and (15) do not have a straightforward physical interpretation, in contrast to and .
3.2. Evaluation of Periodicity Estimation in Noise
3.3. Evaluation of Multiple Periodicity Estimation
It is noted that the signal processing literature includes examples of methods for detecting multiple periodic signal components, such as the MUSIC algorithm . For comparative purposes, the above experiment was repeated employing MUSIC to estimate the strengths of the periodic components. Results indicated that MUSIC was unable to consistently estimate either the periods or the relative strengths of the three components, returning no instances of all three periods correct and in the correct order. The dominant period estimate often contained the common factors of two or more of the true periodic components, an artifact attributable to the superposition of harmonic spectra reinforcing multiples of the individual component fundamentals that coincide in frequency. Two assumptions of MUSIC are not valid for this application: (i) the periodic components are not sinusoidal (although they can be represented as a harmonic series of sinusoids), (ii) the periodic components and noise may not be uncorrelated.
4. Application to DNA Sequence Data
Having discussed the differences between the autocorrelation and DFT for synthetic sequences, we now investigate the effect of using the IPDFT and hybrid autocorrelation-IPDFT in place of the autocorrelation on real sequence data. Numerous researchers have used autocorrelation [1, 5–10, 32]; here we compare with examples from the study of tetramer periodicity in the C. elegans genome using autocorrelation by Kumar et al. .
Note also that the IPDFT reveals a strong period-25 component, not at all evident in the autocorrelation. This surprising result was verified by constructing a synthetic sequence with perfect periodic components at and , and examining its autocorrelation and IPDFT. The autocorrelation of the sequence did not display visually any significant peak at until the period-2 component had been eroded by at least 80%. In contrast, the IPDFT showed a clear peak at with no period-2 erosion at all. The period-25 component has rarely been noted in previous literature, however in , a filtered distribution of distances between TA dinucleotides shows a strong peak at , which Salih et al. attribute to a 5-base periodicity associated with the period-10 consensus sequence structure for C. elegans.
This paper has made two contributions to the periodicity characterization of sequence data. Firstly, the origins of ambiguities in period estimates for symbolic sequences due to multiples or sub multiples of the true period in the autocorrelation and Fourier transform methods, respectively, were explained. This is significant because these two methods account for perhaps the majority of the periodicity analysis seen in biology literature, and yet, to the author's knowledge, their limitations have not been discussed in this context. Secondly, a hybrid autocorrelation-IPDFT technique for periodicity characterization of sequences has been proposed. This technique has been shown to provide improved accuracy relative to the autocorrelation and IPDFT for period estimation in noise and multiple periodicity estimation, for synthetic sequence data. Comparative results from a preliminary investigation of tetramers in C. elegans chromosome I suggest that the proposed approach yields estimates that are consistently less prone to attribute significance to integer multiples or divisors of the true period(s). Thus, the hybrid autocorrelation-IPDFT is putatively advanced as a useful tool for biologists in their quest to reveal and explain structure within biological sequences. Future work will include studies of different types of periodicity in sequence data from other organisms, using IPDFT-based and hybrid techniques.
The author would like to thank two anonymous reviewers for a number of helpful suggestions, which have certainly improved the quality of this paper. Thanks are also due to Professor Eliathamby Ambikairajah for helpful discussions. This research was supported by a University of New South Wales Faculty of Engineering Early Career Research Grant for genomic signal processing, 2009.
- Kumar L, Futschik M, Herzel H: DNA motifs and sequence periodicities. In Silico Biology 2006, 6(1-2):71-78.Google Scholar
- Trifonov EN: 3-, 10.5-, 200- and 400-base periodicities in genome sequences. Physica A 1998, 249(1–4):511-516.View ArticleGoogle Scholar
- Muresan DD, Parks TW: Orthogonal, exactly periodic subspace decomposition. IEEE Transactions on Signal Processing 2003, 51(9):2270-2279. 10.1109/TSP.2003.815381View ArticleMathSciNetGoogle Scholar
- Santo E, Dimitrova N: Improvement of spectral analysis as a genomic analysis tool. Proceedings of the 5th IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '07), Tuusula, Finland, June 2007Google Scholar
- Bernaola-Galván P, Carpena P, Román-Roldán R, Oliver JL: Study of statistical correlations in DNA sequences. Gene 2002, 300(1-2):105-115. 10.1016/S0378-1119(02)01037-5View ArticleGoogle Scholar
- Chakravarthy N, Spanias A, Iasemidis LD, Tsakalis K: Autoregressive modeling and feature analysis of DNA sequences. EURASIP Journal on Applied Signal Processing 2004, 2004(1):13-28. 10.1155/S111086570430925XView ArticleMATHGoogle Scholar
- Herzel H, Trifonov EN, Weiss O, Große I: Interpreting correlations in biosequences. Physica A 1998, 249(1–4):449-459.View ArticleGoogle Scholar
- Li W: The study of correlation structures of DNA sequences: a critical review. Computers and Chemistry 1997, 21(4):257-271. 10.1016/S0097-8485(97)00022-3View ArticleGoogle Scholar
- McLachlan AD: Multichannel Fourier analysis of patterns in protein sequences. The Journal of Physical Chemistry 1993, 97(12):3000-3006. 10.1021/j100114a028View ArticleGoogle Scholar
- Peng C-K, Buldyrev SV, Goldberger AL, et al.: Long-range correlations in nucleotide sequences. Nature 1992, 356(6365):168-170. 10.1038/356168a0View ArticleGoogle Scholar
- Salih F, Salih B, Trifonov EN: Sequence structure of hidden 10.4-base repeat in the nucleosomes of C. elegans . Journal of Biomolecular Structure and Dynamics 2008, 26(3):273-281.View ArticleGoogle Scholar
- Afreixo V, Ferreira PJSG, Santos D: Fourier analysis of symbolic data: a brief review. Digital Signal Processing 2004, 14(6):523-530. 10.1016/j.dsp.2004.08.001View ArticleGoogle Scholar
- Anastassiou D: Genomic signal processing. IEEE Signal Processing Magazine 2001, 18(4):8-20. 10.1109/79.939833View ArticleGoogle Scholar
- Berger JA, Mitra SK, Astola J: Power spectrum analysis for DNA sequences. Proceedings of the 7th International Symposium on Signal Processing and Its Applications (ISSPA '03), Paris, France, July 2003 2: 29-32.Google Scholar
- Coward E: Equivalence of two Fourier methods for biological sequences. Journal of Mathematical Biology 1997, 36(1):64-70. 10.1007/s002850050090View ArticleMathSciNetMATHGoogle Scholar
- Datta S, Asif A: A fast DFT based gene prediction algorithm for identification of protein coding regions. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '05), Philadelphia, Pa, USA, March 2005 5: 653-656.Google Scholar
- Dodin G, Vandergheynst P, Levoir P, Cordier C, Marcourt L: Fourier and wavelet transform analysis, a tool for visualizing regular patterns in DNA sequences. Journal of Theoretical Biology 2000, 206(3):323-326. 10.1006/jtbi.2000.2127View ArticleGoogle Scholar
- Emanuele VA II, Tran TT, Zhou GT: A fourier product method for detecting approximate tandem repeats in DNA. Proceedings of the 13th IEEE/SP Workshop on Statistical Signal Processing (SSP '05), Bordeaux, France, July 2005 1390-1395.Google Scholar
- Epps J, Ambikairajah E, Akhtar M: An integer period DFT for biological sequence processing. Proceedings of the 6th IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '08), Phoenix, Ariz, USA, June 2008 1-4.Google Scholar
- Issac B, Singh H, Kaur H, Raghava GPS: Locating probable genes using Fourier transform approach. Bioinformatics 2002, 18(1):196-197. 10.1093/bioinformatics/18.1.196View ArticleGoogle Scholar
- Makeev VJu, Tumanyan VG: Search of periodicities in primary structure of biopolymers: a general Fourier approach. Computer Applications in the Biosciences 1996, 12(1):49-54.Google Scholar
- Silverman BD, Linsker R: A measure of DNA periodicity. Journal of Theoretical Biology 1986, 118(3):295-300. 10.1016/S0022-5193(86)80060-1View ArticleGoogle Scholar
- Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R: Prediction of probable genes by Fourier analysis of genomic sequences. Computer Applications in the Biosciences 1997, 13(3):263-270.Google Scholar
- Wang W, Johnson DH: Computing linear transforms of symbolic signals. IEEE Transactions on Signal Processing 2002, 50(3):628-634. 10.1109/78.984752View ArticleGoogle Scholar
- Hosid S, Trifonov EN, Bolshoy A: Sequence periodicity of Escherichia coli is concentrated in intergenic regions. BMC Molecular Biology 2004, 5, article 14: 1-7.Google Scholar
- Worning P, Jensen LJ, Nelson KE, Brunak S, Ussery DW: Structural analysis of DNA sequence: evidence for lateral gene transfer in Thermotoga maritima . Nucleic Acids Research 2000, 28(3):706-709. 10.1093/nar/28.3.706View ArticleGoogle Scholar
- Voss RF:Evolution of long-range fractal correlations and noise in DNA base sequences. Physical Review Letters 1992, 68(25):3805-3808. 10.1103/PhysRevLett.68.3805View ArticleGoogle Scholar
- Sethares WA, Staley TW: Periodicity transforms. IEEE Transactions on Signal Processing 1999, 47(11):2953-2964. 10.1109/78.796431View ArticleMathSciNetMATHGoogle Scholar
- Arora R, Sethares WA: Detection of periodicities in gene sequences: a maximum likelihood approach. Proceedings of the 5th IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '07), Tuusula, Finland, June 2007Google Scholar
- Akhtar M, Epps J, Ambikairajah E: Signal processing in sequence analysis: advances in eukaryotic gene prediction. IEEE Journal on Selected Topics in Signal Processing 2008, 2(3):310-321.View ArticleGoogle Scholar
- Schmidt RO: Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation 1986, 34(3):276-280. 10.1109/TAP.1986.1143830View ArticleGoogle Scholar
- Li W, Marr TG, Kaneko K: Understanding long-range correlations in DNA sequences. Physica D 1994, 75(1–3):392-416.View ArticleMATHGoogle Scholar
- Fire A, Alcazar R, Tan F: Unusual DNA structures associated with germline genetic activity in Caenorhabditis elegans. Genetics 2006, 173(3):1259-1273. 10.1534/genetics.106.057364View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.