Starting from the pioneer work of Jeffrey in 1990, representing DNA sequences by the chaos game representation (CGR) has drawn a resounding success. In fact, for more than 2 decades, the chaos game representation has been used as a platform for pattern recognition [11],[12], a generalization of Markov transition tables [13], a tool for statistical characterization of genomic sequences [11],[14],[15], as well as a basis for alignment comparisons [16] and establishment of phylogenetic trees [17]. The CGR is an iterative algorithm that provides unique scatter picture of fractal nature. It consists on mapping a nucleotide sequence in a unit-square, where each of its vertices is assigned to a DNA character (nucleotides: A, C, G and T). Let us consider a given DNA sequence composed of N nucleotides S={ S1,S2,….,S
N
}. Thus, an element occupying the i th position in S is represented into the square by a point x
i
. The point x
i
is repeatedly placed halfway between the previous plotted point xi−1 and the segment joining the vertex corresponding to the read letter S
i
[18]. The prolific iterative function of CGR is given by
(1)
Usually, the starting point x0 is placed at the center of the square while the choice of the corners is arbitrary and can be assigned in any other way. The figure given below (Figure 1) shows the procedure to draw the sequence ‘TTAGC’.
The usefulness of the chaos game representation goes beyond the convenience of genome representation and visualization. In addition, it provides a unique image which is specific to the considered genome [19],[20] and thus forms an outstanding genomic signature [21].
The CGR technique reveals several hidden patterns that arise from distinct k-tuple compositions in DNA sequences. The frequency of occurrence of these patterns can be estimated by the use of the frequency chaos game representation (FCGR) [22]. The latter approach consists on dividing the CGR image into 4k small squares where each sub-square is associated to a sub-pattern and has a side of 1/ 2k. The number of points in each sub-square thus created is then counted. This procedure allows extraction of the frequency of k-length words occurrence by dividing the number of dots onto the correspondent sub-squares by the complete length of the DNA sequence. To visualize the frequencies of occurrence of associated patterns, a normalized colour scheme is used. The darker pixels in the FCGR images represent the most frequently used words; otherwise, the clearest ones represent the most avoided words [23]. The Figure 2 is divided into two blocks where the first block illustrates the arrangement of oligomers in the FCGR’s sub-squares for k= {1, 2, 3}, and the second one is related to the frequency chaos game representations calculated for the chromosome I of the organism C. elegans.
Although representations based on the chaos game theory (we mean CGR and FCGR) have been successfully applied to a wide range of problems, their capacity in following the evolution of frequencies along DNA sequences remains, so far, totally unexplored. This motivates us to exploit the FCGR method in building signals in such a way that we can follow the frequency evolution of oligomers through a given sequence. We give a particular name to these signals—the FCGSs. This new mapping technique is based on assigning the frequency of occurrence of each oligomer to the same sub-pattern that exists in the sequence. For this purpose, two steps are required:
• The first step consists in the generation of the k th-order FCGR for the entire sequence. The FCGR matrix is expressed as follows:
(2)
where fi,j is the frequency value of the word situated at the intersection of the i th row and the j th column in the k-mer matrix.
• The second step consists in reading the input sequence by a group of successive k-nucleotides and replacing them by the corresponding frequency already calculated in the FCGR
k
matrix.
In this sense, an FCGS
k
can be generated by
(3)
Here, k is the frequency chaos game representation’s order and FCGRk,i,j refers to the FCGR
k
’s element which is placed at the intersection of the i th row and the j th column. Regarding an illustrative example of the FCGS technique, we consider the sequence S= {TTTTAGT GAAGCTTCTAGAT}. To encode S by FCGS 1, FCGS 2 and FCGS 3, we must calculate the FCGRs matrices for orders 1, 2 and 3. Then, we extract all the oligomers of length {1, 2 and 3}, and we attribute for each of the monomers, dimers and trimers its occurrence frequency from the convenient frequency matrix. In this case, we enumerate 20 monomers, 19 dimers and 18 trimers. For illustration, we only consider 18 oligomers which are:
• Monomers = {T, T, T, T, A, G, T, G, A, A, G, C, T, T, C, T, A and G}
• Dimers = {TT, TT, TT, TA, AG, GT, TG, GA, AA, AG, GC, CT, TT, TC, CT, TA, AG and GA}
• Trimers = {TTT, TTT, TTA, TAG, AGT, GTG, TGA, GAA, AAG, AGC, GCT, CTT, TTC, TCT, CTA, TAG, AGA and GAT}
The associated frequencies are:
• Monomer frequencies = {0.45,0.45,0.45,0.45,0.25,0.2, 0.45,0.2,0.25,0.25,0.2,0.1, 0.45,0.45,0.1,0.45,0.25,0.2}
• Dimer frequencies = {0.2632,0.2632,0.2632,0.1579, 0.2105,0.1053,0.1053,0.1579, 0.1053,0.2105,0.1053,0.1579, 0.2632,0.1053,0.1579,0.1579, 0.2105,0.1579}
• Trimer frequencies = {0.1667,0.1667,0.1111,0.1667, 0.1111,0.1111,0.1111,0.1111, 0.1111,0.1111,0.1111,0.1111, 0.1111,0.1111,0.1111,0.1667, 0.1111,0.1111}.
At the end, we obtain three different signals, which are illustrated in Figure 3.
Note that increasing the FCGS order induces a more smoothed signal which is useful in capturing the important underlying patterns [24]. The smoothing is often used in enhancing the long-term trends that can be hidden in the original signal. This makes our coding technique suitable for fine studies. To demonstrate the effectiveness and usefulness of our coding, we chose to apply the complex Morlet wavelet analysis. By such application, we will note the smoothing effect in determining the characteristic patterns of certain areas of the DNA.