 Research
 Open access
 Published:
Identification of CpG islands in DNA sequences using statistically optimal null filters
EURASIP Journal on Bioinformatics and Systems Biology volume 2012, Article number: 12 (2012)
Abstract
CpG dinucleotide clusters also referred to as CpG islands (CGIs) are usually located in the promoter regions of genes in a deoxyribonucleic acid (DNA) sequence. CGIs play a crucial role in gene expression and cell differentiation, as such, they are normally used as gene markers. The earlier CGI identification methods used the rich CpG dinucleotide content in CGIs, as a characteristic measure to identify the locations of CGIs. The fact, that the probability of nucleotide G following nucleotide C in a CGI is greater as compared to a nonCGI, is employed by some of the recent methods. These methods use the difference in transition probabilities between subsequent nucleotides to distinguish between a CGI from a nonCGI. These transition probabilities vary with the data being analyzed and several of them have been reported in the literature sometimes leading to contradictory results. In this article, we propose a new and efficient scheme for identification of CGIs using statistically optimal null filters. We formulate a new CGI identification characteristic to reliably and efficiently identify CGIs in a given DNA sequence which is devoid of any ambiguities. Our proposed scheme combines maximum signaltonoise ratio and least squares optimization criteria to estimate the CGI identification characteristic in the DNA sequence. The proposed scheme is tested on a number of DNA sequences taken from human chromosomes 21 and 22, and proved to be highly reliable as well as efficient in identifying the CGIs.
Introduction
In the recent years, computational methods for processing and interpreting vast amount of genomic data, generated from genome sequencing, have gained a lot of scientific interest. Genomic sequences such as deoxyribonucleic acid (DNA) consist of biological instructions which are crucial for the development and normal functioning of almost all living organisms[1]. A DNA molecule has a complex double helix structure that involves two strands, consisting of alternating sugar and phosphate groups. Attached to these sugar groups of each DNA strand are one of the four chemical bases, namely, adenine (A), thymine (T), guanine (G), and cytosine (C). A unit comprising of base, sugar, and phosphate is referred to as a nucleotide. Hydrogen bonds between the nucleotides A and T (similarly between nucleotides G and C) from the opposite strands not only stabilize the DNA molecule, but also make the two strands complimentary. Nucleotides in a DNA strand exhibit short, recurring patterns (also called sequence motifs) that are presumed to have a biological function. Identification of these patterns helps in understanding the biological information hidden in a DNA sequence. A human DNA consists of about 3 billion nucleotides and completion of genome sequencing of numerous model organisms has further proliferated genomic databases. To completely decipher, the biological information in a DNA sequence is a daunting task and development of fast, efficient, and cost effective computational techniques for the same is a big challenge.
A sequence pattern that plays a crucial role in the analysis of genomes is CpG Island (CGI). A typical CGI consists highfrequency of CpG dinucleoetides, where ‘p’ refers to the phosphodiester bond between the adjacent nucleotides[1, 2]. This bond is different from the hydrogen bond that exists between C and G across two strands in a DNA double helix. The length of a CGI varies from a few hundred to a few thousand base pairs (bp), but rarely exceeds 5000 bp. It is known that CpG Islands (CGIs) occur in and around the promoter regions of (50–60)% of human genes, including most housekeeping genes (the genes which are essential for general cell functions)[3]. Gene is a stretch of DNA sequence which has biological information for the synthesis of a protein. The promoter region in a gene regulates its functionality[4–7]. Due to the association of CGIs with promoters, CGIs play an important role in promoter prediction and consequently in the prediction of genes[8, 9]. CGIs also contribute significantly in discovering the epigenetic causes of cancer. CGIs located in the promoter regions of certain tumor suppressor genes are normally unmethylated in healthy cells. DNA methylation is a biochemical modification resulting from addition of a methyl group to cytosine nucleotide (C). In cancer cells, CGIs usually undergo a dense hypermethylation leading to gene silencing as shown in Figure 1. Owing to this, they can be used as candidate regions for aberrant DNA methylation, for early detection of cancer[10–14]. For these reasons, identification of CGIs has become indispensable for genome analysis and annotation.
Despite their accuracy, experimental methods employed by biologists for identification of CGIs are extremely timeconsuming, simply because of the enormity of genomic data. On the other hand, computational methods can be much more attractive for the identification of possible CGIs. The results obtained from computational methods can be used by biologists to validate and further enhance the accuracy of identified CGI locations. There are several computational methods[15–26] reported in the literature for identification of CGIs in DNA sequences. In one of the first computational attempts[15], a CGI is defined as a DNA segment fulfilling the following three conditions: (i) length of segment is at least 200 bp, (ii) G and C contents are ≥ 50%, and (iii) observed CpG to expected CpG ratio (o/e) is ≥ 0.6. Observed CpG is the number of CpG dinucleoetides in a segment and expected CpG is calculated by multiplying the number of ‘C’s and the number of ‘G’s in a segment and then dividing the product by length of the segment. This method however falsely identifies the other G and C rich motifs, e.g., Alu repeats, as CGIs. In subsequent methods, these three conditions were made more stringent in order to reduce false identification at the expense of missing some true CGIs[24]. Sophisticated methods utilizing two Markov chain models[27, 28], one for CGIs and the other for nonCGIs, are proposed[2, 25, 26]. These two Markov models differ in their respective model parameters which characterize the difference in transition probabilities between successive nucleotides in CGIs and nonCGIs, respectively. In these methods, a DNA segment is defined as CGI, if the logscore[2] computed using Markov model for a CGI is greater than that computed using Markov model for a nonCGI. Consequently, the model parameters used for CGIs and nonCGIs play a crucial role in identifying the CGIs. However, different methods employing such models from timetotime produce inconsistent results. Another criterion based on the physical distance distribution of CpG dinucleoetides in a DNA segment has also been proposed[23]. Methods based on this criterion are dependent on nucleotide composition of a DNA sequence being analyzed and suffer from low identification specificity.
Recently, digital signal processing (DSP)based algorithms have gained popularity for the analysis of genomic sequences since they can be mapped to numerical sequences. Digital filters have successfully been employed for identification of protein coding regions (exons) in DNA sequences and hotspots in protein sequences[29–33]. Digital filters have also been used for identification of CGIs with considerable success[25, 26]. These methods are similar to Markov chain methods but use digital filters to compute weighted logscore to identify CGIs. The method proposed in[25] employs a bank of IIR lowpass filters (about 40 filters, each with different bandwidth) to identify the CGIs by looking at the weighted logscores of all the filters together. The CGI identification sensitivity of this method is affected by the tradeoff between responsiveness of filter and stability of the output. Moreover, this method may become computationally demanding as it makes use of a large number of filters in the bank. Another DSP based algorithm in[26] employs an underlying multinomial statistical model[34] to estimate its Markov chain parameters followed by an FIR filter with Blackman window to compute the weighted logscore.
It is evident from above discussion that the CGI identification methods and more importantly the criteria used therein play a crucial role in identifying CGIs. As such, development of fast and efficient computational methods with highly reliable CGI identification criteria is a necessity. Statistically optimal null filters (SONF) have been proven for their ability to efficiently estimate shortduration signals embedded in noise[35]. In this article, we propose a new DSP algorithm for identification of CGIs using SONF which combines maximum signaltonoise ratio and least squares optimization criteria to estimate the message signal, characterizing the CGI, embedded in noise. Normally, the CGI identification accuracy is a lot dependent on the Markov models used and sometimes produces contrasting results. Also, one of the main objectives of the article is to find a uniform yet effective alternative CGI identification measure replacing the current measure based on transition probabilities. In the proposed scheme, we have formulated a simple basis function to be used in SONF which characterizes the CGI. Our criterion is devoid of any ambiguities associated with the choice of transition probabilities used in some of the algorithms. The proposed scheme is tested on a large number of already annotated DNA sequences obtained from human chromosomes 21 and 22. It is shown that our scheme is simple to implement and yet able to identify CGIs reliably and efficiently.
The rest of the article is organized as follows: the following section briefly describes a few existing DSPbased algorithms for the identification of CGIs. In Section “Proposed scheme”, the proposed SONFbased scheme for identifying CGIs in DNA sequences is explained. Results obtained from the proposed scheme are depicted as well as tabulated in Section “Results and discussion”. Finally, “Conclusion” section concludes the article describing some of the significant features of the proposed scheme.
Related study
In this section, we give a brief review of some of the existing CGI identification methods as a preparatory groundwork for the method to be proposed in Section “Proposed scheme”.
Markov chain approach
In this method, a DNA sequence of length N, represented as X = {x(n),x(n + 1),…,x(n + N−1)} where each symbol x(n)∈{A C T G} is considered as a firstorder Markov chain[27] due to its conditional independence property, i.e., the nucleotide occurring at the location (n−1) does not offer any information over and above that at n to predict the nucleotide occurring at (n + 1). In a CpG island, the probability of transition from the nucleotide base C to the base G is higher in comparison with that in a nonCGI. Let the probability of transition from a nucleotide β to a nucleotide γ in a CGI and a nonCGI be denoted as{p}_{\beta \gamma}^{+} and{p}_{\beta \gamma}^{} respectively. Tables1 and2 taken from[2] show the transition probabilities for CGI and nonCGI Markov models. These tables are derived from 48 putative CGIs in human DNA sequences. Each row in the tables contains transition probabilities from a specific nucleotide base to each of the four bases. These transition probabilities{p}_{\beta \gamma}^{\pm} are calculated using
where{n}_{\beta \gamma}^{\pm} is the number of dinucleoetides βγ in a DNA sequence. Naturally, every row in the tables adds up to unity. As expected, in Table1, which corresponds to the CGI Markov model, the probability that a C is followed by a G is very high as compared with that in Table2.
The CGIs in the DNA sequence X are identified by analyzing the windowed sequence X_{ n }= {x(n),x(n + 1),…,x(n + L−1)} of length L, and those obtained by shifting the window by one position at a time. The probability of observing a windowed sequence assuming that it belongs to a CGI is given by
Similarly, the probability of observing this sequence assuming it belongs to a nonCpG island region is
If the value of P(X_{ n }CGI) > P(X_{ n }nonCGI), then, it is concluded that the DNA sequence X_{ n }belongs to a CGI. Otherwise, it is more likely to be a nonCGI island. Alternatively, by formulating a loglikelihood ratio, given by
If S(n) > 0, the given DNA sequence is more likely to belong to a CGI, and if S(n) < 0 the sequence probably belongs to a nonCGI region.
IIR lowpass filter approach
Yoon and Vaidyanathan[25] have noted that the loglikelihood ratio given in (4) can be expressed as:
where y(n) is a sequence representing the loglikelihood ratio of a single transition given by
and, h_{ ave }(n) is a simple averaging filter defined as
Then, they proposed using a bank of M filters each having different bandwidth, instead of using simply one lowpass filter h_{ave}(n). Specifically, the filter used in the k th (k = 0,…,M−1) channel has a transfer function given by
where 0 < α_{0} < α_{1} < ⋯ < α_{M−1} < 1. Since impulse response of a filter in the bank is{h}_{\text{ave}}\left(k\right)=\left(1,,{\alpha}_{k}\right){\alpha}_{k}^{k}u\left(n\right) more recent inputs are given larger weights than the past ones in the averaging process of y(n). The filter bank consists of 40 channels (M = 40), and the filter parameter α_{ k }is chosen from 0.95 to 0.99 with an increment of 0.001. The loglikelihood ratio obtained from the output of the k th channel is given by
The values of S_{ k }(n) obtained for all k and n are then used to obtain a twolevel contour plot. The bands corresponding to S_{ k }(n) > 0 determine the locations of CGIs.
In this method, the use of filter bank increases the computational overhead considerably. For fair comparison, instead of a bank on M filters, we have used one pole filter with optimized parameter α = 0.99 to compare with other methods (this reduces the number of computations considerably).
Multinomial statistical model
This method by Rushdi and Tuqan[26] differs from the previous method by the way the transition tables are obtained and the type of digital filter used to calculate the loglikelihood ratio. Instead of using (1) to obtain the transition probability tables, they are generated by comparing the frequency of each dinucleotide with the one expected under a multinomial model[34]. Transition probabilities{p}_{\beta \gamma}^{\pm} for the windowed sequence X_{ n }are calculated using
where
This method uses a FIR digital filter with variable coefficients generated by Blackman window to calculate the loglikelihood ratio S(n) given in (4). The locations with S(n) greater than zero are the probable locations of CGIs.
All of the abovementioned methods rely on the transition probability tables to calculate loglikelihood ratio used to identify CGIs. The methods[25, 26] specifically vary by the way y(n), obtained from the respective transition tables, are averaged. It is shown later in Section “Results and discussion” that the choice of the transition tables may produces contrasting results. Hence, a more reliable and efficient scheme that is devoid of these transition tables is necessary for identifying CGIs.
Proposed scheme
In this study, we adopt the SONF approach, proposed in[35], to efficiently identify CGIs in DNA sequences. SONF is used for estimation of short duration signal, S_{ n }= {s(m)}, embedded in noise R_{ n }= {r(m)} by combining maximum signaltonoise ratio and least squares optimization criteria. The implementation of the twofold optimization in SONF is shown in Figure 2, where an instantaneous matched filter (IMF) is first used to detect the presence of a short duration signal embedded in noise by maximizing the signaltonoise ratio over variabletime observation interval m. The IMF output, I_{ n }, is then scaled by a locally generated function ⋀_{ n }, by least squares (LS) optimization procedure, to obtain the signal Y_{ n }, an estimate of S_{ n }. It has been shown that the SONF is equivalent to a Kalman filter with a much simpler implementation[35]. Also, SONF has the ability to track rapidly changing signals leading to more practical processing of shortduration signals[36, 37]. Therefore, the proposed scheme is expected to perform better in situations even if the CGIs are of very short length of the order of 200 bp.
To be able to apply SONF approach to identify CGIs, the DNA sequence X, of length N, is first mapped to an appropriate binary numerical sequence X_{ CG }= {x_{ CG }(n)}. A sliding window of length L is used to evaluate if each of the windowed sequences, X_{ n }= {x_{ CG }(m)}, where n = 1,2,…,N−L + 1 and m = n,n + 1,…,n + L−1, belong to a CGI or not. Each of the windowed sequence X_{ n }can be expressed as
where S_{ n }= {s(m)} is a message signal corresponding to a CGI and R_{ n }= {r(m)} is a residual signal. S_{ n }and R_{ n } are each of length L. Let Φ = {ϕ(m)} be a fixed binary basis sequence of length L having some characteristic property of CGI.
Now, the message signal corresponding to a CGI can be expressed as S_{ n }= V_{ n }Φ, where V_{ n }= {v(m)} and Φ are sequences each of length L. The sequence V_{ n }Φ is obtained by multiplying the corresponding elements of V_{ n }and Φ. The sequence V_{ n } is determined by minimizing R_{ n } in least square sense. Let the message signal be denoted as S_{ n }= {s(m)}. The objective of the proposed method is to choose the basis sequence such that V_{ n } resulting from the optimization process has some discriminating feature of indicating whether the associated sequence X_{ n } belongs to a CGI. The following subsections explain in detail the steps involved in identification of CGIs in a DNA sequence using SONF.
Numerical mapping of DNA sequences
As DNA sequences are alphabetical in nature, they need to be mapped to numerical sequences in order to employ the DSP techniques for DNA sequence analysis. There are several mapping techniques reported in the literature. One of the earliest and a popular mapping is that of Voss’s binary indicator sequences[38]. A DNA sequence X can be mapped to a set of four digital signals by forming four binary indicator sequences, namely, X_{ A }, X_{ T }, X_{ G }, and X_{ C }. In each of these binary indicator sequences, ’1’ represents the presence and ’0’ absence of the corresponding bases A, T, G, and C in X. For instance, considering a DNA sequence X = {ATCCGAAGTATAACGAA}, the binary indicator sequence corresponding to G, i.e., X_{ G }can be expressed as X_{ G }= {00001001000000100}. Indicator sequences for the remaining three nucleotides can be represented in a similar fashion.
The problem of CGI identification deals with G and C content in a DNA sequence. Hence, we define a new indicator sequence X_{ CG }= {x_{ CG }(n)}, which indicates the presence of the nucleotides C and G in the DNA sequence. For example, the binary indicator sequence X_{ CG } of the DNA sequence above is X_{ CG }= {00111001000001100}.
Choosing the basis sequence
In this study, we have noticed that the dinucleotides CC, CG, GC, and GG occur more frequently in a CGI as compared to a nonCGI. For this study, we have calculated the occurrence of these four dinucleotides in the sequence L44140 taken from the chromosome X of Homo sapiens. The sequence L44140 is of length 219447 bp and has 17 CGIs whose locations are obtained from[39]. Figure 3 depicts the relative occurrence of the above four dinucleotides as compared to the remaining dinucleotides (AA, AC, AG, AT, CA, CT, GA, GT, TC, TG, TT, and TA) in CGIs and nonCGIs of L44140. Here, the relative occurrence of a particular dinucleotide is equal to the number of times that dinucleotide occurs in the sequence divided by the sequence length. It is evident that the dinucleotides CC, CG, GC, and GG occur more frequently in CGIs whereas the other dinucleotides occur more frequently in nonCGIs. This observation can also be inferred from the transition probability tables (Tables1 and2) as the values of{p}_{\beta \gamma}^{+} are greater than{p}_{\beta \gamma}^{}, where β and γ are either G or C. In Figure 3, the darker bars corresponding to the dinucleotides CC, CG, GC, and GG are taller in CGIs, whereas the darker bars corresponding to the other dinucleotides are shorter. Hence, instead of just considering the difference in relative occurrence of CG, it is more productive to consider the relative occurrence of the dinucleotides CC, CG, GC, and GG to distinguish between a CGI and a nonCGI.
Moreover, we have studied the difference in gap sizes between the dinucleotides CC, CG, GC, and GG in CGIs and nonCGIs of L44140. The shortest possible gap is of size 0 when the dinucleotides are adjacent to each other. Figure 4 shows the relative occurrence of gaps of various sizes in a CGI and a nonCGI. Here, relative occurrence of a particular gap size is equal to the number of times that gap size occurs in the sequence divided by the sequence length. Obviously, the gap of size 0 occurs more frequently in a CGI as compared to that of a nonCGI. And, it is found that the gap size in a nonCGI can go up to 40 where as in CGIs the maximum gap size was found to be 19. It can also be seen that the gaps of sizes 0, 1, and 2 occur more frequently in a CGI and the gap sizes of 3 and greater occur more frequently in a nonCGI. A gap of size 2 is the largest gap which can distinguish between a CGI and a nonCGI.
Based on the above observations, the basis sequence which characterizes a CGI can be formulated as Φ = {1100110011…001100}. The 1’s in Φ represent either the nucleotide C or G. The 1’s always appear in pairs where each pair representing one of the dinucleotide CC, CG, GC, or GG. The 0’s in Φ form the gap between the dinucleotides. A gap size of 2 is chosen between the dinucleotides. This choice of Φ is also satisfies the basic criteria of a CGI, i.e., at least 50% of the nucleotide content in a CGI is due to C and G.
Now, in order to obtain the length of Φ (window size), we have analyzed CGIs and nonCGIs of different lengths for the relative occurrence of various gap sizes. Figure 5 shows the plot of Δ versus window size for various gap sizes. Here, Δ is the difference of relative occurrence of a particular gap in a CGI and a nonCGI for a fixed window length. It can be seen that Δ is maximum for gap size 0. As the window size increases Δ also increases before it reaches a steady value. Δ is negative for gap sizes of 3 and greater signifying that the gap sizes of 3 and higher are more probable in nonCGIs compared to CGIs. For the gap size 2, Δ stabilizes for window sizes greater than 200. Larger the window size, larger the number of computations, and hence in the proposed method we have used the length of Φ (window size) to be equal to 200.
IMF
The objective of IMF, which is the first stage of SONF shown in Figure 2, is to detect the presence of the waveform Φ in the input sequence X_{ n }. IMF is an improvement over a matched filter, the difference being, in IMF optimal SNR is repeatedly calculated at every sample m, over an observation interval m∈[n,n + L−1]. IMF takes X_{ n } and Φ as inputs and produces an output sequence I_{ n }= {ι(m)} where
for m = n,n + 1,…,n + L−1. It can be seen that at each sample m, ι(m) is calculated over a varying interval i∈[n,m]. Note that, assuming ι(n−1) = 0, ι(m) can also be calculated using the recursive relation given by
The output ι(m) leads to an optimal detection of Φ at each sample m, and can be expressed as
where{r}_{0}^{\prime}\left(m\right) is the residual signal in IMF output, and c(m) is given by
The v(m)∈V_{ n }in (15) is an unknown gain.
Least square optimization of the IMF output
The objective of the second stage in SONF is to determine a sequence ⋀ = {λ(m)}, which when used to scale the IMF output I_{ n }, produces the SONF output, Y_{ n }, such that{Y}_{n}\to {V}_{n}\Phi. Here, Y_{ n } is an element wise product of V_{ n }and Φ. Y_{ n } is an estimate of S_{ n }, which is the message signal corresponding to CGI.
Let us consider the suboptimal case in which a sample of the IMF output ι(m) in (15), when scaled by λ(m)=ϕ(m)/c(m), generates
where y(m) is an element of the SONF output, Y_{ n }. As we desire optimal null filtering, i.e., y(m) = s(m), the residual element, r_{0}(m), needs to be entirely eliminated.
Before determining the optimal ⋀_{ n }, corresponding to ideal null filtering, we define the sequence Z_{ n }= {z(m)} such that,
Ideally, y(m) = s(m) and from (18), z_{ideal}(m) = r(m). Now, the optimal ⋀_{ n }={λ_{opt}(m)} is determined by minimizing the mean square error,E\left[{e}_{\lambda}^{2}\right(m\left)\right], with respect to λ(m) where
The optimal post IMF scaling sequence λ_{opt}(m) obtained by carrying out the above mean square minimization is[35]
where SNR is the input signaltonoise ratio (considering r(m) to be noise).
In order to implement SONF, the value of the input SNR should be known. To circumvent this problem, a suboptimal case, as shown in (17), is assumed considering c(m) >> 1/SNR, leading to
It can be shown that as m increases,{\lambda}_{\text{subopt}}\left(m\right)\to {\lambda}_{\text{opt}}\left(m\right) because the second term in the equation
approaches zero (as the value of c(m) progressively increases with m). So, the value of initial SNR in (20) will influence only the starting few samples in Y_{ n }.
The SONF can easily be implemented recursively using the following equations[35]
In this case of DNA analysis, one may choose the initial value of the gain P(0) to be 1 and ι(0) = ι(1).
The proposed SONFbased CGI identification algorithm for a DNA sequence of length N can now be summarized as follows:
Initialization: Set the base location index n = 0.

Step 1: Apply a rectangular window of length L = 200 starting at the base location n of the DNA sequence X to obtain the windowed sequence X_{ n }.

Step 2: Obtain the binary indicator sequence X_{ CG }for the windowed sequence, X_{ n }, from Step 1.

Step 3: X_{ CG }from Step 2, along with the binary basis sequence Φ, form the inputs to SONF. The corresponding SONF output sequence, Y_{ n }, is evaluated using the recursive relations given in (23), by assuming P(0) = 1 and ι(0) = ι(1).

Step 4: Compute the SNR power gain G(X_{ n }), which is the ratio of the variance of the SONF output, Y_{ n }, to the variance of the corresponding input X_{ n }.

Step 5: Increment the value of n by 1, i.e., n = n + 1. If n ≤ (N−L) go to Step 1, else go to Step 7.

Step 6: Plot G(X_{ n }) as a function of n + L and get its upper envelope. The peaks in the resulting plot which are above the threshold, η, indicate the locations of CGIs identified in X.

Step 7: Exit the algorithm.
Figure 6 shows the SONF implementation for better understanding of the proposed approach. Figure 6a,b shows an example of a CGI and a nonCGI with 80 bp. Naturally, in Figure 6a there are greater number of ones. Figure 6c,d shows the IMF output for a CGI and a nonCGI, respectively. It can be seen that the IMF output corresponding to a CGI progressively increases to a greater value of 35 as compared to 6 of that of a nonCGI. Figure 6e,f is the scaling functions for a CGI and a nonCGI, respectively. They are obtained using the relation λ(m) = P(m)ϕ(m) in (23). Finally, Figure 6g,h shows the estimated CGI characteristic in a CGI and a nonCGI, respectively. The SONF output corresponding to a CGI has greater amplitude as compared with that of a nonCGI.
Prediction measures
The identification of CGIs can have four possible outcomes; true positive (TP), true negative (TN), false positive (FP), or false negative (FN) as shown in Figure 7. Two basic measures of determining the accuracy of prediction are sensitivity (Sn) and specificity (Sp)[40]. Sensitivity, given by
and is defined as the proportion of CGIs that have been predicted correctly. Whereas, specificity given by
is defined as the proportion of the predicted CGIs that are real. Sensitivity and specificity can take on values from 0 to 1. For a perfect prediction, Sn = 1 and Sp = 1. Neither sensitivity nor specificity alone can provide a good measure of the global accuracy, because high sensitivity can be achieved with little specificity and vice versa. A measure that combines sensitivity and specificity values is called the correlation coefficient (CC) and is given by
The value of CC ranges from −1 to 1, where a value of 1 corresponds to a perfect prediction; a value of −1 indicates that every CGI has been predicted as nonCGI, and vice versa.
Another measure, called the performance accuracy (Acc), used in our analysis is given by
In this article, we have evaluated the performance of different CGI identification methods at the nucleotide level. For example, the value of TP is obtained by adding all the nucleotides predicted to to true positive, and the other outcomes are calculated in the similar manner. At the CGI level, even if one nucleotide (or a threshold of a minimum number of nucleotides) corresponding to a CGI is predicted to be true positive the entire CGI is assumed to be predicted correctly.
Results and discussion
The proposed CGI prediction scheme is tested on several genomic sequences of varying lengths taken from the human chromosomes 21 and 22. More precisely, we have used the three contigs, NT_113952.1, NT_113954.1, and NT_113958.2 from chromosome 21, and the contig NT_028395.3 from chromosome 22 for our analysis. All the sequence data considered for this study are obtained from the GenBank Database[39]. The performance of the proposed scheme is compared with the other popular DSPbased approaches such as Markov chain[2], IIR lowpass filters[25], and multinomial model[26].
First, a DNA sequence from human chromosome X with the GenBank accession number of L44140 is analyzed for illustrative purpose. The sequence is of length 219447 bp and is already annotated, i.e., the locations of its CGIs are already known and can be obtained from[39]. The sequence L44140 is also used to obtain the values of threshold, η, used by the DSPbased methods being compared in this article.
Figure 8 shows the comparative performance of CGI prediction by the abovementioned four approaches. Figure 8a shows the performance of Markov chain approach, where loglikelihood ratio S(n) is plotted against base index of the sequence. The transition probability tables given in Tables1 and2 are used to calculate S(n). All the base locations, n, with S(n) > 0 imply that they are very likely to be a part of a CGI. A window length of 200 bp is considered for the method. Markov chain method is able to detect most of the CGIs in the DNA sequence and it can be seen that the CGIs and nonCGIs can reasonably be differentiated by looking at the sign of S(n). However, one of the major drawbacks of this method is the presence of a lot of false positives that falsely categorize nonCGIs into CGIs.
Figure 8b shows the performance of IIR lowpass filter approach where the loglikelihood ratio, S(n), is plotted against base index of the sequence, n. The transition probability tables given in[25] are used to calculate S(n). For fair comparison, instead of a bank on M filters, we have used one pole filter with optimized parameter α = 0.99 for this method. All the base locations, n, with S(n) > 0 imply that they are very likely to be a part of a CGI. A window length of 200 bp is considered for the method. Similar to the Markov chain method, this method also produces a lot of false positives affecting the prediction accuracy.
Figure 8c shows the prediction of CGIs using the multinomial model in[26]. An underlying multinomial statistical model is employed to estimate the Markov chain model parameters that result in the transition probability tables given in[26]. A Blackman window of length 100 bp is employed for calculating the filtered loglikelihood ratio. The Blackman window gives larger weights for central samples of the window, thus reducing the edge effects. Windows with the positive filtered loglikelihood ratio are considered to be a part of a CGI. This method shows considerably high false positives making the CGI prediction unreliable.
Figure 8d shows performance of the proposed SONF scheme in predicting the CGIs. Unlike the abovementioned methods, our scheme utilizes the binary basis sequence, Φ, instead of the probability transition tables. The proposed scheme first maximizes SNR of the output at each time instant using IMF, then it further enhances the estimated signal using leastsquare optimization criterion, to estimate the presence of Φ in the input windowed DNA sequence. A window size of 200 is used for the proposed method. Effectiveness of the proposed scheme is clearly visible in Figure 8d, which depict more contrasting peaks as compared to the other three approaches. These contrasting peaks make the identification process comparatively easier resulting in less number of false positives.
It can be seen from Figure 8 that the default threshold on η = 0 produces a lot of false positives for the methods using transition probability tables. The optimal threshold values for the methods is obtained by calculating the prediction Acc for varying thresholds for each method (Figure 9). The optimal values of thresholds obtained for the Markov chain method, IIR filter method, and the proposed SONF approach are 0.1, 0.05, and 0.6, respectively. The actual locations of the CGIs, obtained from NCBI website, present in the sequence L44140 are represented by red horizontal spots in Figure 8. Figure 10 is receiver operating characteristic (ROC) curves plotted for the four methods. It can be seen that the proposed approach has better overall performance for the sequence L44140 with the area under the curve 0.7460. The Markov chain method is next with the area under the ROC curve 0.6072. The area under the curve for IIR filter method is 0.3106. It can be seen that the multinomial model method has the least area under the ROC curve. The dismal performance of the multinomial model does not indicate anything about the method in itself but merely implies that the transition probability tables used may not be appropriate for the example considered.
We have evaluated the time complexity of the proposed method using the tictoc function in MATLAB. Taking the necessary precautions (such as all applications except MATLAB were closed, a fresh session of MATLAB was started for each task, and MATLAB was warmed up with the code, i.e., the first run of the code was ignored), the CPU time for processing a fixed length of sequence, the Markov chain method was found to be the least followed by SONF, IIR and multinomial approaches with an additional CPU time of 1.29%, 1.78%, and 1.82%, respectively. This difference is not substantial considering today’s computing resources.
Figure 11 shows the performance of the four methods for the prediction of CGIs in the first 15000 bps of L44140. The red horizontal lines are the actual locations of CGIs. The blue binary decision curve depicts the locations of the predicted CGI by the methods. As can be seen from Figure 11c, the multinomialbased approach fails to detect the CGI located between base pairs 3095 and 3426 as opposed to other three methods implying that the probability transition parameters used for the CGI identification play a crucial role. Hence, it is important to have a CGI identification characteristic which is devoid of any ambiguity with the choice of different probability transition tables available. The binary basis sequence Φ in the proposed scheme successfully identifies the CGIs and can be reliably used as CPG identification characteristic.
Table3 presents the summary of performance measures Sn, Sp, CC, and Acc obtained for the analysis of four contigs NT_113952.1, NT_113954.1, NT_113958.2, and NT_028395.3. The performance of the proposed scheme is also compared with that of CpGCluster[23], which uses the distance between CpG dinucleotides (and not the transition probability tables) for identifying CGIs. The proposed approach has the highest values of Sn for all the contigs (shown in bold) and has the highest values of CC for the contigs NT_113954.1 and NT_113958.2. The performance accuracy is also quiet high, consistently above 97% which is a good sign. This shows that the proposed method is reliable and the proposed binary basis sequence Φ is an alternative CGI identification characteristic. The multinomial method did not identify any of the CGIs in the contig NT_028395.3 and hence its Sn and Sp values are zero. The corresponding Acc value is high because the method predicting most of the true negatives correctly. The contig NT_028395.3 has short CGIs of the order of 200 bps and the proposed approach with better sensitivity is capable of identifying them.
Conclusion
In this article, a new DSPbased technique using SONFs is proposed for the prediction of CGIs in DNA sequences. A novel CPG identification characteristic is presented in the form of a binary basis sequence which is shown to identify CGIs reliably. It has also been shown that the performance of the existing methods which use discriminating transition probability tables for CGIs/nonCGIs is not consistent. The prediction accuracy of these methods are highly dependent on the training data used to obtain the transition probabilities of CGIs and nonCGIs. The inability of finding a unique CGI identification characteristic has resulted in failure in predicting many of the CGIs. This article makes an attempt to present a unique CGI identification characteristic which does not require any training. Furthermore, the ability of SONF to track short duration signals is exploited in identifying the CGIs in DNA sequences. SONF combines maximum signaltonoise ratio and least squares optimization criteria to estimate the CGI identification characteristic in the DNA sequence. The performance of the proposed technique is tested on four randomly chosen contigs in chromosomes 21 and 22 of human beings. The simulation results comparing the performance of the proposed technique with the other three DSPbased CGI prediction techniques have shown that the proposed approach enjoys superior prediction accuracy in terms of sensitivity. The overall predicting accuracy of the proposed approach is also consistently above 97% and is comparable to that of the Markov chain method making it a reliable method.
References
Lodish H, Berk A, Zipursky S, Matsudaira P, Baltimore D, Darnell J: Molecular Cell biology. Scientific American, New York,; 1995.
Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Cambridge University Press, Cambridge,; 1998.
Antequera F, Bird A: Number of CpG islands and genes in human and mouse. Proc. Natl Acad. Sci. USA 1993, 90(24):1199511999. 10.1073/pnas.90.24.11995
Antequera F, Bird A: CpG islands as genomic footprints of promoters that are associated with replication origins. Curr. Biol 1999, 9: 661667. 10.1016/S09609822(99)802905
Ioshikhes I, Zhang M: Largescale human promoter mapping using CpG islands. Nat. Genet 2000, 26: 6163. 10.1038/79189
Antequera F: Structure, function, evolution of CpG island promoters. Cell. Mol. Life Sci 2003, 60(8):16471658. 10.1007/s0001800330886
Saxonov S, Berg P, Brutlag D: A genomewide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc. Natl Acad. Sci. USA 2006, 103(5):14121417. 10.1073/pnas.0510310103
Larsen F, Gundersen G, Lopez R, Prydz H: CpG islands as gene markers in the human genome. Genomics (San Diego, CA) 1992, 13(4):10951107.
Wang Y, Leung F: An evaluation of new criteria for CpG islands in the human genome as gene markers. Bioinformatics 2004, 20(7):1170. 10.1093/bioinformatics/bth059
Bird A: DNA methylation patterns and epigenetic memory. Genes Dev 2002, 16: 621. 10.1101/gad.947102
Herman J, Baylin S: Gene silencing in cancer in association with promoter hypermethylation. New Engl. J. Med 2003, 349(21):2042. 10.1056/NEJMra023075
Issa J: CpG island methylator phenotype in cancer. Nat. Rev. Cancer 2004, 4(12):988993. 10.1038/nrc1507
Illingworth R, Kerr A, DeSousa D, Jorgensen H, Ellis P, Stalker J, Jackson D, Clee C, Plumb R, Rogers J: A novel CpG island set identifies tissuespecific methylation at developmental gene loci. PLoS Biol 2008, 6: e22. 10.1371/journal.pbio.0060022
Heisler L, Torti D, Boutros P, Watson J, Chan C, Winegarden N, Takahashi M, Yau P, Huang T, Farnham P: CpG Island microarray probe sequences derived from a physical library are representative of CpG Islands annotated on the human genome. Nucleic Acids Res 2005, 33(9):2952. 10.1093/nar/gki582
GardinerGarden M, Frommer M: CpG islands in vertebrate genomes. J. Mol. Biol 1987, 196(2):261. 10.1016/00222836(87)906899
Rouchka E, Mazzarella R, States David J: Computational detection of CpG islands in DNA, Report: WUCS9739. 1997.
Rice P, Longden I, Bleasby A: EMBOSS: the European molecular biology open software suite. Trends Genetics 2000, 16(6):276277. 10.1016/S01689525(00)020242
Ponger L, Mouchiroud D: CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences. Bioinformatics 2002, 18(4):631. 10.1093/bioinformatics/18.4.631
Dasgupta N, Lin S, Carin L: Sequential modeling for identifying CpG island locations in human genome. IEEE Signal Process. Lett 2002, 9(12):407409.
LuqueEscamilla P, MartínezAroza J, Oliver J, GómezLopera J, RománRoldán R: Compositional searching of CpG islands in the human genome. Phys. Rev. E 2005, 71(6):61925.
Bock C, Walter J, Paulsen M, Lengauer T: CpG island mapping by epigenome prediction. PLoS Comput. Biol 2007, 3(6):e110. 10.1371/journal.pcbi.0030110
Sujuan Y, Asaithambi A, Liu Y: CpGIF: an algorithm for the identification of CpG islands. Bioinformation 2008, 2(8):335338. 10.6026/97320630002335
Hackenberg M, Previti C, LuqueEscamilla P, Carpena P, MartínezAroza J, Oliver J: CpGcluster: a distancebased algorithm for CpGisland detection. BMC Bioinform 2006, 7: 446. 10.1186/147121057446
Takai D, Jones P: Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc. Natl Acad. Sci 2002, 99(6):37403745. 10.1073/pnas.052410099
Yoon B, Vaidyanathan P: Identification of CpG islands using a bank of IIR lowpass filters. In Proceedings of 11 th Digital Signal Processing Workshop. Taos Ski Valley, New Mexico; Aug. 2004.
Rushdi A, Tuqan J: A new DSPbased measure for CpG islands detection. In Digital Signal Processing Workshop, 12thSignal Processing Education Workshop, 4th. IEEE, Teton National Park, Wyoming; 2006.
Rabiner L: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77(2):257286. 10.1109/5.18626
Won K, PrugelBennett A, Krogh A: Evolving the structure of hidden Markov models. IEEE Trans. Evol. Comput 2006, 10: 3949.
Anastassiou D: Genomic signal processing. IEEE Signal Process. Mag 2001, 18(4):820. 10.1109/79.939833
Vaidyanathan P, Yoon B: The role of signalprocessing concepts in genomics and proteomics. J. Franklin Inst 2004, 341(1–2):111135.
Ramachandran P, Antoniou A: Identification of hotspot locations in proteins using digital filters. IEEE J. Sel. Topics Signal Process 2008, 2(3):378389.
Rao K, Swamy M: Analysis of genomics and proteomics using DSP techniques. IEEE Trans. Circuits Syst. 1: Regular Papers 2008, 55: 358.
Song N, Yan H: Short exon detection in DNA sequences based on multifeature spectral analysis. EURASIP J. Adv. Signal Process 2011, 2011: 2. 10.1186/1687618020112
Liu B: Statistical Genomics: Linkage, Mapping, and QTL Analysis. CRC Press, Boca Raton,; 1998.
Agarwal R, Plotkin E, Swamy M: Statistically optimal null filter based on instantaneous matched processing. Circuits Syst. Signal Process 2001, 20: 3761. 10.1007/BF01204921
Kakumani R, Devabhaktuni V, Ahmad M: Prediction of proteincoding regions in DNA sequences using a modelbased approach. In IEEE International Symposium on Circuits and Systems. Seattle; 2008.
Yadav R, Agarwal R, Swamy M: A new improved modelbased seizure detection using statistically optimal null filter. In Engineering in Medicine and Biology Society, 2009. EMBC 2009. Annual International Conference of the IEEE. Minneapolis, Minnesota; 2009.
Voss R: Evolution of longrange fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett 1992, 68(25):38053808. 10.1103/PhysRevLett.68.3805
National Centre for Biotechnology Information http://www.ncbi.nlm.nih.gov
Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics 1996, 34(3):353367. 10.1006/geno.1996.0298
Acknowledgements
This study was supported in parts by the Natural Sciences and Engineering Research Council (NSERC) of Canada and in part by the Regroupement Strategic en Microelectronique du Quebec (ReSMiQ).
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Kakumani, R., Ahmad, O. & Devabhaktuni, V. Identification of CpG islands in DNA sequences using statistically optimal null filters. J Bioinform Sys Biology 2012, 12 (2012). https://doi.org/10.1186/16874153201212
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/16874153201212