Wavelet analysis of frequency chaos game signal: a time-frequency signature of the C. elegans DNA

Challenging tasks are encountered in the field of bioinformatics. The choice of the genomic sequence’s mapping technique is one the most fastidious tasks. It shows that a judicious choice would serve in examining periodic patterns distribution that concord with the underlying structure of genomes. Despite that, searching for a coding technique that can highlight all the information contained in the DNA has not yet attracted the attention it deserves. In this paper, we propose a new mapping technique based on the chaos game theory that we call the frequency chaos game signal (FCGS). The particularity of the FCGS coding resides in exploiting the statistical properties of the genomic sequence itself. This may reflect important structural and organizational features of DNA. To prove the usefulness of the FCGS approach in the detection of different local periodic patterns, we use the wavelet analysis because it provides access to information that can be obscured by other time-frequency methods such as the Fourier analysis. Thus, we apply the continuous wavelet transform (CWT) with the complex Morlet wavelet as a mother wavelet function. Scalograms that relate to the organism Caenorhabditis elegans (C. elegans) exhibit a multitude of periodic organization of specific DNA sequences. Electronic supplementary material The online version of this article (doi:10.1186/s13637-014-0016-z) contains supplementary material, which is available to authorized users.


Introduction
The fundamental information for a living being resides essentially in its nucleic material-the DNA. This molecule contains all the instructions needed to produce proteins and enzymes for all of the metabolic pathways. Thus, revealing the structural and organizational features in DNA sequences is a very interesting topic. However, the search for relevant information along the genomic sequences is not an easy task. In fact, although several programs have been created which aim at detecting valuable information concerning the DNA, there is much work remaining to be done. In order to better understand the genomic sequence role and structure, several signal processing approaches have been investigated. To be able to apply such techniques, it is imperative to convert DNA *Correspondence: imen.messaoudi@enit.rnu.tn 1 Université de Tunis El Manar, Ecole Nationale d'Ingénieurs de Tunis, LR Signal, Images et Technologies de l'Information, BP 37, le Belvédère, 1002 Tunis, Tunisia Full list of author information is available at the end of the article characters into numerical sequences. This operation is the so-called coding technique. Thereby, various approaches for DNA character coding have been reported including the binary coding [1,2], the inter-distance signals [3], coding with the entropy measure [4], the electron-ion interaction pseudo-potential (EIIP) mapping [5], the structural bending trinucleotide coding (PNUC) [2], etc.
The choice of the most appropriate coding technique for a desired analysis represents a basic problem. It turns that coding techniques that are based on physical, chemical and structural DNA characteristics are efficient in terms of revealing specific structures as is the case with EIIP and PNUC coding approaches.
Here, we propose a new mapping technique inspired from the Chaos Game theory to which we associate the name of 'frequency chaos game signals' (FCGS). The FCGS approach relies on the frequency value of each sub-pattern assignment, which gives us the opportunity to produce several signals for the same input sequence, depending on the size of the considered sub-patterns. The http://bsb.eurasipjournals.com/content/2014/1/16 specificity of our coding consists on exploiting the statistical properties of the genomic sequence itself, which may serve in detecting interesting structures within the DNA sequences.
The efficiency of our method in detecting different biological events is demonstrated through application of the continuous wavelet transform (CWT). The choice of such analysis method (we mean CWT) is justified by the need of a time-frequency approach that provides local frequency information which is not guaranteed by other transforms such as the Fourier transform. In fact, the classical Fourier transform does not contain local information. Thus, it appears that the short-time Fourier transform (STFT) is better suited to predict sites with biological relevance in the genomic signals. Nevertheless, this method requires a good choice of the analysis window's size that must balance the frequency and temporal resolutions. The short Fourier transform induces interferences and loss of information [6]. With the advent of the wavelet transform (WT), one can get more precise and more adequate analysis especially concerning the location of hotspots in signals with complex nature, which is the case of genomic signals [5,[7][8][9][10].
In this paper, we investigate the role of the CWT in displaying the frequency-dependent structure of genomic signals by using the complex Morlet wavelet scalogram. The purpose of this analysis consists in revealing spectral features that might be of biological significance in the Caenorhabditis elegans (C. elegans) genome. This study is particular since it exposes a new coding technique which is efficient in terms of the DNA characterization.
This paper is divided into five sections: First, we describe the steps required to generate the frequency chaos game signals in section 2. In section 3, we deal with the complex wavelet analysis in which we give an overview on the continuous wavelet transform as well as a brief description of the complex Morlet wavelet. In section 4, we analyze the DNA sequences by the Morlet wavelet, and then we expose and discuss the results in section 5. Finally, in section 6, we conclude this paper.

Introduction to the frequency chaos game signals
Starting from the pioneer work of Jeffrey in 1990, representing DNA sequences by the chaos game representation (CGR) has drawn a resounding success. In fact, for more than 2 decades, the chaos game representation has been used as a platform for pattern recognition [11,12], a generalization of Markov transition tables [13], a tool for statistical characterization of genomic sequences [11,14,15], as well as a basis for alignment comparisons [16] and establishment of phylogenetic trees [17]. The CGR is an iterative algorithm that provides unique scatter picture of fractal nature. It consists on mapping a nucleotide sequence in a unit-square, where each of its vertices is assigned to a DNA character (nucleotides: A, C, G and T). Let us consider a given DNA sequence composed of N nucleotides S ={S 1 , S 2 , . . . ., S N }. Thus, an element occupying the ith position in S is represented into the square by a point x i . The point x i is repeatedly placed halfway between the previous plotted point x i−1 and the segment joining the vertex corresponding to the read letter S i [18]. The prolific iterative function of CGR is given by Usually, the starting point x 0 is placed at the center of the square while the choice of the corners is arbitrary and can be assigned in any other way. The figure given below ( Figure 1) shows the procedure to draw the sequence 'TTAGC'.
The usefulness of the chaos game representation goes beyond the convenience of genome representation and visualization. In addition, it provides a unique image which is specific to the considered genome [19,20] and thus forms an outstanding genomic signature [21].
The CGR technique reveals several hidden patterns that arise from distinct k-tuple compositions in DNA sequences. The frequency of occurrence of these patterns can be estimated by the use of the frequency chaos game representation (FCGR) [22]. The latter approach consists on dividing the CGR image into 4 k small squares where each sub-square is associated to a sub-pattern and has a side of 1/2 k . The number of points in each sub-square thus created is then counted. This procedure allows extraction of the frequency of k-length words occurrence by dividing the number of dots onto the correspondent sub-squares by the complete length of the DNA sequence. To visualize the frequencies of occurrence of associated patterns, a normalized colour scheme is used. The darker pixels in the FCGR images represent the most frequently used words; otherwise, the clearest ones represent the most avoided words [23]. The Figure 2 is divided into two blocks where the first block illustrates the arrangement of oligomers in the FCGR's sub-squares for k = {1, 2, 3}, and the second one is related to the frequency chaos game representations calculated for the chromosome I of the organism C. elegans.
Although representations based on the chaos game theory (we mean CGR and FCGR) have been successfully http://bsb.eurasipjournals.com/content/2014/1/16 applied to a wide range of problems, their capacity in following the evolution of frequencies along DNA sequences remains, so far, totally unexplored. This motivates us to exploit the FCGR method in building signals in such a way that we can follow the frequency evolution of oligomers through a given sequence. We give a particular name to these signals-the FCGSs. This new mapping technique is based on assigning the frequency of occurrence of each oligomer to the same sub-pattern that exists in the sequence. For this purpose, two steps are required: • The first step consists in the generation of the kth-order FCGR for the entire sequence. The FCGR matrix is expressed as follows: where f i,j is the frequency value of the word situated at the intersection of the ith row and the jth column in the k-mer matrix. • The second step consists in reading the input sequence by a group of successive k-nucleotides and replacing them by the corresponding frequency already calculated in the FCGR k matrix.
In this sense, an FCGS k can be generated by Here, k is the frequency chaos game representation's order and FCGR k,i,j refers to the FCGR k 's element which is placed at the intersection of the ith row and the jth column. Regarding an illustrative example of the FCGS technique, we consider the sequence S = {TTTTAGT GAAGCTTCTAGAT}. To encode S by FCGS 1 , FCGS 2 and FCGS 3 , we must calculate the FCGRs matrices for orders 1, 2 and 3. Then, we extract all the oligomers of length {1, 2 and 3}, and we attribute for each of the monomers, dimers and trimers its occurrence frequency from the convenient frequency matrix. In this case, we enumerate 20 monomers, 19 dimers and 18 trimers. For illustration, we only consider 18 oligomers which are: At the end, we obtain three different signals, which are illustrated in Figure 3.
Note that increasing the FCGS order induces a more smoothed signal which is useful in capturing the important underlying patterns [24]. The smoothing is often used in enhancing the long-term trends that can be hidden in the original signal. This makes our coding technique suitable for fine studies. To demonstrate the effectiveness and usefulness of our coding, we chose to apply the complex Morlet wavelet analysis. By such application, we will note the smoothing effect in determining the characteristic patterns of certain areas of the DNA.

The wavelet transform analysis
The wavelet transform (WT) was introduced by Morlet in 1983 to study seismic signals. Then, the proposed processing was well formalized in 1984 with contributions of Grossman [25]. Therefore, the wavelet theory has been the subject of diverse theoretical developments and practical applications. In this section, we focus on the application of wavelet transform on the C. elegans genome aiming to explore its composition.

The continuous wavelet transform
The CWT of an arbitrary signal is a linear operation that consists in projecting the signal x(t) onto a wavelet basis. Mathematically, the CWT is given by Equation 4: where a (a > 0) and b (b ∈ R) are respectively the scale and the time-shift parameters. Here, ψ t−b a is a scaled and shifted version of the so-called mother wavelet function ψ(t). Mother wavelet ψ(t), which is a wave-like oscillation, can be extended to its daughter wavelets in terms of the shift parameter b and the scale parameter a: At fixed-scale and translation parameters (a and b), the wavelet transform coefficient, denoted by W (a,b) , represents the inner product of the daughter wavelet and the signal; this operation measures the degree of their resemblance at the concerned point. If x(t) is equal to ψ (a,b) (t), the wavelet coefficient is set to 1. Hence, the closer to 1 the coefficient is, the stronger the similarity will be.
Mother wavelets are band-pass filters that oscillate in the time domain it expands or compresses depending on the scale value. When a is large, the mother wavelet becomes stretched and serves for the high frequencies' detection. In this case, the resolution of the time domain is low. On the contrary, when a is small, the http://bsb.eurasipjournals.com/content/2014/1/16 mother wavelet is compressed, i.e. the frequency domain's resolution becomes low in favor of the time domain's resolution. Mathematically, the dilated and normalized mother wavelet function 1 √ a ψ t a will admit √ aψ(aω) as a Fourier transform, which explains the fact that an expansion in time induces a contraction in the frequency domain and conversely. This property makes analysis with wavelets a relevant tool for characterization of signals as well as for detection and identification of special spectral features. Mother wavelet function can be real or complex like in the case of complex Morlet wavelet which will be briefly described in the following.

The complex Morlet wavelet
The effectiveness of the wavelet transform in analyzing signals with complex nature (like in the case of genomic signals) depends on the choice of the basis function. In this study, our choice went to the complex Morlet wavelet. The advantage of the proposed mother wavelet is that it admits a parametrized bandwidth. This provides extra flexibility which ensures a good time-frequency resolution. The complex Morlet wavelet is a plane wave modulated by a Gaussian envelope and presents a quick attenuation [26] whose mother wavelet function is expressed as where ω 0 corresponds to the number of oscillations of the wavelet. Strictly speaking, ω 0 must be greater than 5 to satisfy the admissibility criterion. This admissibility condition is required by all mother wavelets for the continuous wavelet transform to be invertible. Admissibility condition implies that the Fourier transform of the mother wavelet is 0 at frequency 0 [27]. This ensures the mother wavelet oscillates, which means that it acts as a band-pass filter. The Fourier transform of the complex Morlet wavelet function is given bŷ At a fixed scale a, the complex Morlet wavelet and its Fourier transform are given by In the frequency domain, the wavelet coefficient is a wavelet filter characterized by the constant QFactor [28]: The central frequency of the mother wavelet, denoted by f c , is the position of the global maximum ofψ(ω) which is given by f c = ω 0 2 . As for the bandwidth, denoted by f b , it is centered around f c and controls the wavelet window [29]. The complex Morlet wavelet can be expressed by the following equation: To allow easy graphical interpretation, it is preferred to display the modulus of the CWT coefficients: |W (a,b) |. This representation is called a scalogram and it represents the amplitude information of the signal at each scale a and position b. The scalogram can also be depicted in the time-frequency domain instead of the time-scale domain by converting the scales to frequencies using the formula: Thus, a scalogram is a 2D plot where time is on the horizontal axis, frequency on the vertical axis, and amplitude of CWT coefficients are colored according to a defined http://bsb.eurasipjournals.com/content/2014/1/16 code. In the following section of this paper, we will focus on analyzing the Morlet scalogram.

Results and discussion
In this work, we focus our study on the analysis of DNA sequences within the C. elegans genome. The genomic sequences are extracted from the NCBI database [30]. As for the mapping technique, we choose the FCGS algorithm with the three first levels. Thus, the generated signals are FCGS 1 , FCGS 2 and FCGS 3 of the whole chromosomes. Concerning the wavelet analysis, we use the complex Morlet wavelet with a support size of 1,420. Application of the continuous wavelet transform on the appropriate sequences is accomplished along 64 scales by using a mother wavelet centered on ω 0 = 5.4285 (radian units).
Close inspection of the resulting scalograms shows the role played by this analysis in the characterization of different sites along the DNA sequences. In fact, we offer a standard way to represent genomes and reveal the biological hotspots, regardless of their nature or their length. Through a simple zooming of 10 3 bp, we are able to observe different features with great precision. Even the finer details are easily discerned. Several regions are visually distinguished by typical motifs which include prominent periodicities. We analyze these regions in the NCBI database [30] to ascertain their nature. Besides, it is important to note that not all revealed stretches are identified; there are some regions that we have not succeeded in understanding the related biological significance. For example, in Figure 4, we provide a series of scalograms which represent a sequence taken from the a b c 6.5 6.505 6.51 6.515 6.52 chromosome III of C. elegans. As we can see, this example well illustrates the presence of different DNA structures which are easily observed due to their specific behaviors (the red brackets delimit the boundaries of these elements). According to the NCBI database, the prominent signatures relate to the elements CeRep59 (37,899 bp), CeRep55 (3,797 bp), CeRep59 (1,091 bp) and CeRep59 (2,844 bp). Among the structures that possess particular signatures, we selected some elements of the C. elegans chromosome I to study them, namely: intron, STS and Cerp3 elements.

Intron signature
It is well-known that the genomic sequences present a strong three-base periodicity. The latter periodicity is an interesting feature of the protein-coding regions (exons). Several signal processing approaches and computational algorithms have been developed based on this periodicity for predicting exons. Most of the coding region prediction methods used the discrete Fourier transform (DFT)-based algorithms through which exons refer to the maximum of the Fourier power spectrum at the position of 1/3 frequency [31][32][33][34][35]. In the same context, performing the DFT on the wavelet coefficient of the correlation function at frequency 1/3 has improved the peaks that mark exons in the Fourier spectrum [36].
On the other hand, for identification of protein coding regions, the use of the CWT based on the modified Morlet wavelet has provided more accurate results [7,37]. All of these works revolve around exon prediction; whereas intron prediction has not yet drawn the attention it deserves (the intron is a non-coding region in eukaryotic gene).
The novelty in our work consists in providing an efficient way to represent main characteristics of intronic sequences. Indeed, the FCGS coding highlights motifs having different forms with a high level of energy around specific frequency values. In our work, we found that most of introns in the C. elegans genome present high energy around the frequency 1/6.5. Figure 5  This example (Figure 5a,b,c) exposes the behavior of a typical intron which is characterized by the presence of specific motifs with high energy around the frequency 1/6.5 (as shown by the red arrow; P denotes periodicity) [38,39]. Other periodic motifs are also apparent at the level of harmonics which are marked by a lower intensity line. We note that the intensity of the lower harmonics (as indicated by the yellow arrows) increases by increasing the order of the FCGS coding. Otherwise, the intensity of the upper harmonics (see the black arrows) decreases by increasing the order of the FCGS coding.  example, we can see that this intron presents a remarkable behavior within the three levels of FCGS despite the smoothing effect of higher order FCGSs (especially noted when we code with FCGS 3 ).

STS signature
Traditional gene mapping techniques are slow and painstaking. The discovery of the sequence-tagged sites (STS) have opened a new way for geneticists to speed up the establishment of genetic and physical mapping of genes along chromosomes. An STS is a specific region of DNA which can be uniquely identified through its sequence. In addition, it is an easily PCR-amplified sequence which can contain repetitive elements as microsatellites. For the analysis of this abundant class of DNA, we choose the example of Figure 6. By examining the FCGS 1 result (Figure 6a), we can note the presence of periodic patterns with high energy at the top of the scalogram (which is indicated by the red arrow). These patterns are located within a considerable frequency band. If we consider the FCGS 2 result, we can see that the energy level of the frequency band is weakened (Figure 6b). This is due to the smoothing property of the FCGS coding. The smoothing effect of the FCGS 3 is also noticed in Figure 6c

Cerp3 signature
The last example that we are studying here is part of the Cerp3 repetitive family. The Cerp3 DNA consists of dispersed repeated elements with a length of about 1,000 bp and presents 50 to 100 copies in the C. elegans genome. Such a nematode segment hides specific periodicities that we are disclosing in the related scalograms ( Figure 7).
All the scalograms, strikingly, display a long chain of motifs consisting of seven-and six-base periodicities. Figure 7a (related to the FCGS 1 coding) shows other patterns including strong periodicities on the top of the scalograms. As for the FCGS 2 coding (Figure 7b), it enhances periodicities of 5 bp and 3 bp and shows up other periodicities corresponding to the 15-, 12-and six-base repetitive elements. Finally, Figure 7c underlines the contribution of the FCGS 3 scheme in the enhancement of periodicities like 15, five and four bases.

FCGS and the local signatures in C. elegans
In this work, we have investigated the important role of color scalograms which offer an easy visual navigation through genomic sequences. Thus, we have exposed the behavior adopted by some DNA sequences in the timefrequency plan which turns out to be easily characterized by the presence of different periodic patterns within the FCGSs scalograms. These behaviors appear as strong local signatures within the genome. As we have seen, there are some signatures which strongly appear only when we code with FCGS 1 and other signatures that similarly appear within the three levels of FCGSs. Aiming at studying the role of the FCGS order in the enhancement of the DNA signature, we consider the contribution of the percentage of the frequency band which specifies the DNA signature in terms of energy measure. This choice went to the fact that the energy of the characteristic sub-band is one of the main statistical features that can be extracted from the wavelet domain as texture descriptor [40]. The study is performed with three examples of each of the intron, STS and Cerp3 sequences (see Table 1). These sequences are coded by the frequency chaos game signal order 1, 2 and 3.
To be able to evaluate the energy contribution of the different periodic patterns in these sequences, we have to fix the frequency band limit in such a way that it includes all the periodic motifs (see Table 1).
The choice of the frequency boundaries is justified by the contour and the 3D plots given in Figures 8,9 and 10. The dashed red lines in these figures delimit the characteristic frequency band. Figure 8 refers to the third intron when it is coded by FCGS 2 .
In Figure 9, we provide the pattern distribution of the STS 2 sequence (coded by FCGS 1 ) through the contour and the 3D plots.
Finally, Figure 10 shows the contour and the 3D plots of the second Cerp3 sequence (coded by FCGS 2 ).
The second part of this study consists in the measurement of the strongest motifs' energy distribution for the intron, STS and Cerp3 sequences coded by the frequency chaos game signals order 1, 2 and 3. Thus, we calculate the total energy of the scalogram (which is designated by E t ) and the energy measure of the prominent frequency subband (which is designated by E p ). The contribution of this sub-band energy is then weighted by the percentage ratio between them. In Figure 11, we provide the energy's values, which are calculated over a portion of 800 bp for the three introns. Based on the histogram plots, we deduce that the partial energy is so close to the total energy for all introns. In addition, FCGS 1 , FCGS 2 and FCGS 3 yield close per-centage values, which confirm the fact that they similarly characterize introns.
As for the STS sequences, the scalograms show that the FCGS 1 is better suited to study this DNA type. To prove this, we consider the contribution of the characteristic patterns relating to the three first levels of FCGS. In terms of energy percentage, we provide the contribution of the characteristic patterns relating to the FCGS scalograms in Figure 12. The energy values are calculated over a portion of 1,134 bp.
Note that the energy values considerably decline when the FCGS order increases for all the STS sequences. The ratio values prove, in addition, that FCGS 1 is the only coding that characterizes STS sequences.
Finally, the energy values of the Cerp3 sequences (through a portion of 445 bp) are provided in Figure 13. From the latter histograms, we can deduce that the FCGS order 1, 2 and 3 allow the Cerp3 characterization, which results in close energy values.
Aside the qualification of these sequences by a specific signature, there are many DNA classes that are easily distinguished by relevant motifs in the scalograms. Therefore, based on the study of significant homology between signatures, we can establish efficient algorithms for DNA recognition and classification.

Conclusion
DNA coding methods play a major role in revealing information about significant biological sequences. However, the choice of such methods depends on the features that they can reflect. It appears that the available mapping techniques rely mostly on the 3-bp or 10-bp behaviors and are not well adapted to examine all periodic structures contained in the complex nature of DNA. In this context, we introduced a new mapping technique, aiming to characterize a wealth of DNA sequences. The proposed method is based on the chaos game theory and we refer to it as FCGS. The FCGS coding consists in assigning the frequency of occurrence of each sub-pattern to the same group of nucleotides that exist in the DNA sequence. Such a mapping has the advantage of providing a multitude of signals which offer the possibility to treat the DNA sequence from different views, taking into account the statistical properties of resident oligomers.
The performance of the FCGS scheme in terms of information revelation from DNA sequences was tested by the continuous wavelet transform. The complex Morlet wavelet was employed to create color scalograms for the C. elegans' FCGSs (order 1 to 3).
By reviewing the resulting scalograms, we found that the selected wavelet transform readily identifies different DNA structures. Several hidden periodicities and features which cannot be revealed by classical DNA analysis methods (such as the STFT) were sharply identified. Simulation results show a pronounced 6.5 base period in intergenic residues, more specifically in intronic ones. However, there are other introns which include periodicities like 5 bp and 3 bp. These periodicities are derived from a specific organization of periodic patterns forming thus a local signature. Through this study, it is shown that the variable patterns observed in the intron DNA are all exhibited by the FCGS 1 , FCGS 2 and FCGS 3 codings. Besides introns, we have shed the light on another type of DNA sequences: the STS. The STS are particular DNA sequences recently used in the gene mapping procedures. When we code with an FCGS order 1, we managed to find a special signature of this DNA class that derives from the microsatellite repetitive elements that it contains.
Overall, in the mapping efforts for the nematode C. elegans, various classes of repetitive DNA were annotated. Among them, we considered a particular class of C. elegans dispersed repeats: the Cerp3. The related scalograms provide clear periodical motifs of seven-and eightbase repeats. This time-frequency signature is illustrated when the coding schemes FCGS 1 , FCGS 2 and FCGS 3 are used.
In conclusion, the results stemming from the complex Morlet wavelet analysis of the FCGSs have showed its accuracy in detection of variable DNA structures. Moreover, this could serve in discovering unknown domains with potential biological significance in genomes.