 Research
 Open Access
 Published:
Effective gene prediction by high resolution frequency estimator based on leastnorm solution technique
EURASIP Journal on Bioinformatics and Systems Biology volume 2014, Article number: 2 (2014)
Abstract
Linear algebraic concept of subspace plays a significant role in the recent techniques of spectrum estimation. In this article, the authors have utilized the noise subspace concept for finding hidden periodicities in DNA sequence. With the vast growth of genomic sequences, the demand to identify accurately the proteincoding regions in DNA is increasingly rising. Several techniques of DNA feature extraction which involves various cross fields have come up in the recent past, among which application of digital signal processing tools is of prime importance. It is known that coding segments have a 3base periodicity, while noncoding regions do not have this unique feature. One of the most important spectrum analysis techniques based on the concept of subspace is the leastnorm method. The leastnorm estimator developed in this paper shows sharp period3 peaks in coding regions completely eliminating background noise. Comparison of proposed method with existing sliding discrete Fourier transform (SDFT) method popularly known as modified periodogram method has been drawn on several genes from various organisms and the results show that the proposed method has better as well as an effective approach towards gene prediction. Resolution, quality factor, sensitivity, specificity, miss rate, and wrong rate are used to establish superiority of leastnorm gene prediction method over existing method.
1 Introduction
It has been observed that the most significant scientific and technological endeavour of the 21st century is mostly related to genomics. Therefore, researchers from various cross fields have concentrated in the field of genomic analysis in order to extract the vast information content hidden in it. Deoxyribonucleic acid (DNA) is the hereditary material present in all living organisms. In eukaryotic organisms, genes (sequences of DNA) consist of exons (coding segments) and introns (noncoding segments). It has been established that genetic information is stored in the particular order of four kinds of nucleotide bases, Adenine (a), Thymine (t), Cytosine (c) and Guanine (g) which comprise the DNA biomolecule along with sugarphosphate backbone. Exons of a DNA sequence are specified as the most informationbearing part because only the exons take part in protein coding while the introns are spliced off during protein synthesis process. Gene prediction means detecting locations of the proteincoding regions of genes in a long DNA chain. Since DNA encodes information of proteins, various statistical and computational techniques have been studied and explored to extract the information content carried by DNA and distinguish exons from introns.
Genomic information is made up of a finite number of nucleotides in the form of alphabetical characters; hence, it is discrete in nature. As a result, digital signal processing (DSP) techniques can be used as effective tools to analyze DNA in order to capture its periodic characteristics. The main objective of spectrum estimation is determination of power spectrum density of a random process. Power spectral density (PSD) describes how the average power of a signal x[n] is distributed with frequency, where x[n] is a sequence of random variables defined for every integer n. The estimated PSD provides information about the structure of a random process which can be used for refined modeling, prediction, or filtering. Estimation of power spectrum of discretely sampled processes is generally based on procedures employing the fast Fourier transform (FFT). This approach is computationally efficient and produces reasonable results, but in spite of the advantages, it has certain performance limitations. The most important limitation lies in its frequency resolution. Moreover, spectral estimation by the Fourier method generates various harmonics which often lead to false prediction of coding regions. Among the recently introduced techniques, the eigendecompositionbased noise subspace method, known as the leastnorm solution is found to be of great interest. In the present paper the authors addressed the problems posed by standard FFT method and proposed a leastnorm algorithm based on the concept of subspace frequency estimation for effective and accurate prediction of coding regions in DNA sequence.
Application of DSP methods to find periodicities in DNA sequences has been studied by various researchers [1–4]. It is established that exon regions of DNA molecules exhibit a period3 property because of the codon structure involved in the translation of nucleotide bases into amino acids [5–7]. Yin and Yau explained the phenomenon of threebase periodicity in the Fourier power spectrum of proteincoding regions resulting from nonuniform distribution of nucleotides in the three codon positions [8]. An improved algorithm for gene finding by period3 periodicity using the nonlinear tracking differentiator is presented by Yin et al. [9]. Peng et al. discussed about statistical properties of genes in their article [10]. A universal graphical representation method based on S.S.T. Yau’s technique employing trigonometric functions which denotes the four nucleotide bases to predict coding regions is presented by Jiang et al. [11]. Application of digital filters to extract period3 components and effectively eliminate background noise present in DNA sequence has given good results [12–14]. Yu et al. have used in their paper probability distributions to study similarity in DNA sequences employing symmetrized Kullback–Leibler convergence [15]. Kwan et al. introduced novel codes for onesequence numerical representation for spectral analysis and compared them with existing mapping techniques [16]. Roy et al. introduced positional frequency distribution of nucleotides (PFDN), an algorithm for prediction of coding regions [17]. Parametric techniques of gene prediction where autoregressive allpole models were used for identifying coding and noncoding regions provided better results [18, 19]. Yu et al. proposed a novel method to construct moment vectors for DNA sequences using a twodimensional graphical representation and proved that the two had onetoone correspondence [20]. In another work, Deng et al. introduced a novel method of characterizing genetic sequence defining genome space with biological distance for subsequent applications in analyzing and annotating genomes [21]. An exclusive survey of various gene prediction techniques is presented by Pradhan et al. [22]. The fundamental theory of principal component analysis is explained by Shlens and its application is discussed by Ubeyli et al. [23, 24].
In this article, authors have compared and analyzed power spectral peaks obtained by modified periodogram method with pseudospectrum obtained by leastnorm solution method for detecting the presence of coding regions in DNA sequence and established superiority of the later technique [25–28]. The algorithm has been successfully tested on several sample databases downloaded from NCBI GenBank [29].
2 Materials and methods
PSD estimation of DNA sequence requires conversion of DNA character string into numerical form. Different researchers have adopted different mapping methods to achieve this objective. The Voss representation is a very popular technique giving four binary indicator sequences x_{a}[n], x_{t}[n], x_{c}[n] and x_{g}[n] which takes a value of either 1 or 0 at location n depending on whether the corresponding character exists at that location or not [7, 13, 14]. These indicator sequences show redundancy because
Therefore, three out of these four binary sequences would be enough to uniquely determine the DNA character string. There are several other techniques such as complex numbers [2], paired numeric [6], universal graphical representation [11], weakstrong hydrogen bonding [18], EIIP [30], quaternion [31] etc. each having a certain special feature of its kind. Rao and Shepherd [19] in their study found that complex mapping was one of the most effective and compact mapping rules. In a recent work, Kwan et al. [16] introduced several novel codes for singlesequence numerical representations for spectral analysis and studied their relative performances. They focused on direct and simple numerical representations which satisfied the following requirements:
(a). Singlesequence mapping for a nucleotide sequence
(b). Fixed value mapping for each nucleotide
(c). Accessible to digital signal processing analysis
Seven singlesequence complexvalue numerical representations were derived by them in which each nucleotide of sequence was mapped to a single real value element (+1 or 1) and a single imaginary value element (+j or  j). According to the main findings of their study, the KQuaternary CodeI was most attractive whereas Rao and Shepherd found KQuaternary CodeIII to be more suitable. Details of these codes are furnished in Table 1. In this article, the authors have adopted a novel mapping rule in which KQuaternary CodeIII has been flipped about Yaxis assigning numerical values, a = 1, c = j, g = 1 and t = j to nucleotide sequence x[n] as shown in the following example in order to provide location accuracy to predicted exons.
After mapping,
Once numerical conversion of DNA sequence is obtained, DSP technique can easily be applied to estimate its power spectrum. Spectral estimation by nonparametric method can be broadly classified as direct and indirect. These two methods are equivalent and are popularly known as the periodogram method. The direct method takes discrete Fourier transform (DFT) of the signal and then averages the square of its magnitude. The indirect method is based on the concept of first estimating the autocorrelation of data sequence and then taking its Fourier transform (FT).
In the first part of this section, spectral analysis of DNA by periodogram method is discussed in brief. The basic of eigendecomposition is given in the second subsection. Mathematical background of the leastnorm solution is explained in the third subsection followed by algorithm of the leastnorm solution technique. In the next section of this article, results and discussion have been presented. In the first subsection of this section, performance of proposed method has been compared with the modified periodogram method. Model order selection by eigenvalue ratio technique has been elaborated in the next subsection. In the final and last section of the article, conclusion has been drawn. MATLAB 7.1 software has been used to show performance of the estimators.
2.1 S pectral analysis by modified periodogram method
In the direct method mentioned above, periodogram P_{per}(f_{ k }) for signal x(n) can be computed by DFT or more efficiently by fast Fourier transform (FFT) for N data points as shown in Equation 4:
To enhance performance of the periodogram method, at first, the Npoint data sequence is divided into K overlapping segments of length M each, then the periodogram is computed applying the Bartlett window; finally, the average is computed from the result.
2.2 S pectral analysis by eigendecomposition
In this article, eigendecomposition of the autocorrelation matrix has been motivated as an approach for frequency estimation of DNA sequence. Here, the signal x(n) is modeled as a sum of p complex exponentials in white noise w(n) as shown in the following equation:
where amplitude A_{ i } are complex values given by A_{ i } = A_{ i } e^{jφ}_{ i } with φ_{ i } being uncorrelated random variables that are uniformly distributed over the interval [π, π]. The power spectrum of x(n) consists of a set of p impulses of amplitude A_{ i } at frequencies w_{ i } for i = 1,2,3,…,p plus power spectrum of white noise w(n) having variance σ_{ n }^{2}.
The M × M autocorrelation sequence of the process with lag size M is given by
where P_{ i } = A_{i}^{2} is the power in the i th component. Therefore, the autocorrelation matrix R_{xx} is the sum of autocorrelation matrix due to signal R_{s} and autocorrelation matrix due to noise R_{n} which may be written concisely as
where E = [e_{1}, e_{2},…, e_{p}] is an M × p matrix containing p signal vectors e_{i} and E^{H} signifies its Hermitian transpose. P = {P_{1}, P_{2,}…, P_{ p }} is a diagonal matrix of signal powers. The eigenvalues of R_{xx} is λ_{ i } = λ_{ i }^{s} + σ_{ n }^{2} where λ_{ i }^{s} are eigenvalues of R_{s} having rank p corresponding to signal subspace and the last (Mp) eigenvalues approximately equal to σ_{ n }^{2} are noise eigenvalues. Hence, the eigenvalues and eigenvectors of R_{xx} may be divided into two groups as shown below. Assuming that the eigenvectors have been normalized to have unit norm, we may use spectral theorem to denote R_{xx} as
The set of eigenvectors {v_{1}, v_{2,}…, v_{ p }}, associated with largest eigenvalues span the signal subspace and are called principal eigenvectors. The second subset of eigenvectors {v_{p+1}, v_{p+2},…, v_{ M }} span the noise subspace and have σ_{ n }^{2} as their eigenvalue. Since the signal and noise eigenvectors are orthogonal, it follows that the signal subspace and the noise subspace are also orthogonal. After eigendecomposition of the autocorrelation matrix, the eigenvalues are arranged in decreasing order λ_{1} ≥ λ_{2} ≥ λ_{3},…, ≥ λ_{M} as depicted in Figure 1. From this plot of eigenvalues, one can distinguish initial steep slope representing signal and a more or less flat floor representing noise level.
An issue that is of central importance to successful implementation of principalcomponent analysis (PCA) is the selection of appropriate model order p since the accuracy of estimated spectrum is critically dependent on this choice. In this article, the eigenvalueratio technique has been adopted for optimum model order selection. A plot of λ_{ p }/λ_{p+1} vs integer values p indicates a large eigenvalue gap at the threshold of signal subspace and noise subspace. This p value is chosen as the required model order and eigenvalues λ_{p+1} to λ_{M} are assumed to be the noise eigenvalues corresponding to the noise subspace.
The pseudospectrum estimation by noise subspace method involves three generic steps:

1.
Formation of autocorrelation matrix from data vector.

2.
Derivation of noise subspace with the help of eigendecomposition.

3.
Identification of signal components from noise subspace by frequency estimation function.
2.3 Frequency estimation by leastnorm solution
Frequency estimation is the process in which complex frequency components of a signal are estimated in the existence of noise [32]. The leastnorm algorithm developed in this paper uses a single vector $\overrightarrow{a}$ that is constrained to lie on the noise subspace and the complex exponential frequencies are estimated from the peaks of the frequency estimation function:
where {$\overrightarrow{e}$} is an auxiliary vector given by
with $\overrightarrow{a}$ constrained to lie in the noise subspace, if the autocorrelation function is known exactly, then ${\left{\overrightarrow{e}}^{\mathrm{H}}\overrightarrow{a}\right}^{2}$ will have nulls at the frequencies of each complex exponentials. Therefore, Ztransform of coefficients of $\overrightarrow{a}$ may be factored as
where Z_{k} for k = (p + 1),…,(M  1) are the spurious roots that in general do not lie on the unit circle. The leastnorm method attempts to eliminate the effects of spurious zeros by pushing them inside the unit circle leaving the desired zeros on the unit circle. The problem then is to determine which vector in the noise subspace minimizes the effects of spurious zeros on the peaks of ${\widehat{P}}_{\mathrm{LN}}\left({e}^{\mathit{jw}}\right)$.
The approach used in the leastnorm algorithm is to find a vector $\overrightarrow{a}$ that satisfies the three following constraints:

1.
The vector $\overrightarrow{a}$ lies on the noise subspace ensuring that p roots of A(z) are on the unit circle.

2.
The vector $\overrightarrow{a}$ has least Euclidean norm ensuring that spurious roots of A(z) lie inside unit circle.

3.
The first element of $\overrightarrow{a}$ is unity, i.e. leastnorm solution is not the zero vector.
To solve this constrained minimization problem, we begin by noting the constraint that $\overrightarrow{a}$ lies on the noise subspace which is given by the following equation:
where ${P}_{n}={V}_{n}{V}_{n}^{\mathrm{H}}$ is the projection matrix projecting an arbitrary vector $\overrightarrow{v}$ on the noise subspace as shown in Figure 2[25].
The leastnorm method involves projection of signal vector $\overrightarrow{v}$ on to the entire noise space.
The third constraint is expressed as
where
This may be combined with the constraint in Equation 12 giving
The norm of $\overrightarrow{a}$ may be written as
Since projection matrix P_{ n } is Hermitian, therefore P_{ n } = P_{ n }^{H} and also idempotent, hence P_{ n }^{2} = P_{n,} we get
Minimizing $\overrightarrow{a}$ is equivalent to finding vector $\overrightarrow{v}$ that minimizes the quadratic form of ${\overrightarrow{v}}^{\mathrm{H}}{P}_{n}\overrightarrow{v}$
After reformulating the constrained minimization problem,
Once the solution of Equation 14 is found, the leastnorm solution is formed by projecting $\overrightarrow{v}$ onto noise subspace using Equation 12 and using Optimization Theory, the leastnorm solution is found to be
which is the projection of the unit vector onto normalized noise subspace such that the first coefficient is unity, and the Lagrange multiplier λ is given by
In terms of eigenvectors of the autocorrelation matrix, the leastnorm solution is given using quadratic factorization (QR) by the following equation:
2.4 Algorithm of proposed leastnorm solution technique for estimating period3 peaks
Step 1 Convert the samples of data vectors to column vector.
Step 2 Compute autocorrelation matrix of data with predetermined lag size (M).
Step 3 Diagonalize the autocorrelation matrix. Produce diagonal matrix D of eigenvalues and a full matrix V whose columns are the corresponding eigenvectors so that X*V = V*D, where X is the signal matrix.
Step 4 Sort diagonal matrix D in ascending order for eigendecomposition. Take into account noise subspace spanned by the eigenvectors corresponding to nonsignificant eigenvalues.
Step 5 Project signal vector $\overrightarrow{v}$ onto the noise space using projection matrix.
Step 6 Find Least Norm vector $\overrightarrow{a}$ on noise subspace with first element equal to unity using QR factorization and applying the Optimization Theory.
Step 7 Estimate pseudospectrum (in dB) by computing absolute FFT of vector
Step 8 Plot the result (in dB) to observe period3 spectral peaks.
3 Results and discussion
The proposed algorithm has been tested on several eukaryotic genes to predict location of coding regions of varying lengths of a few basepairs to thousand basepairs and simulation results are compared with that of modified periodogram on the same DNA data. The segments of test data used for analysis contain both exons and introns of fully constructed genes. According to period3 property of DNA, a prominent peak should be observed in the PSD plot of each exon segment. It is observed that the proposed method produces very sharp and welldefined period3 peaks indicating existence and numbers of proteincoding regions of very short to long coding segments present in the test data. Once the existence and locations of exons in the enormous length of DNA are confirmed, further statistical or computational methods may be applied on the DNA sequence to find the boundaries of proteincoding regions. The statistical parameters and computation times for modified periodogram and leastnorm methods for genes F56F11.4a, T12B5.1, C30C11 and D13156 are indicated in Table 2.
It is observed that the proposed approach removes the entire noise and reveals the hidden periodicities prominently. A comparison has been drawn with periodogram method applying Bartlett (triangular) sliding window with 50% overlap and suitable segment lengths M and number of segments K. Window length M should be chosen subjectively based on a tradeoff between spectral resolution and statistical variance. If M is very small, important features may be smoothed out, while if M is very large, the behavior becomes more like unmodified periodogram with erratic variation. Hence, a compromise value is selected between range 1/25 < M/N < 1/3 where N is nucleotide sequence length. Quality factor (Q.F.) which measures the ratio of variance to square of mean of PSD has been used as comparison metric between the two methods which are shown in Table 2. It is observed that quality factor of spectrum by the leastnorm method is much higher than modified periodogram method. Figure 3 shows bar plot of percentage rise in quality factor for various genes. Table 2 also indicates that computation time required in the leastnorm method is more than modified periodogram method.
3.1 Performance comparison of proposed method with existing method
The analysis of performance of both the methods can be made by prediction measures such as sensitivity (S_{n}), specificity (S_{p}), miss rate (M_{r}) and wrong rate (W_{r}). Their definitions are stated below:
where M_{e} = missing exons, A_{e} = actual exons, W_{e} = wrong exons, P_{e} = predicted exons, T_{p} = true positive, F_{p} = false positive, and F_{n} = false negative. T_{p} corresponds to those genes that are accurately predicted by the algorithm and also exist in the GenBank annotation. F_{p} corresponds to the exon regions which are identified by the given algorithm but are not specified in the standard annotation. F_{n} is coding region that is present in the GenBank annotation but is not predicted as a coding segment by the algorithm. The average value of S_{n} and S_{p} gives the overall exon sensitivity and specificity. Table 3 summarizes the simulation results of the eight genes used as test data. It is evident from tabulated data that S_{n}, S_{p} and the average of S_{n} and S_{p} of the proposed method are significantly higher than existing method in all the cases whereas the miss rate and wrong rate are much lower indicating superior performance of the proposed algorithm over the existing technique [33].
At first, both modified periodogram technique and proposed leastnorm algorithm are applied to C. elegans cosmid F56F11.4a gene having 8060base pair (bp) length test data starting from 7021bp location. It has five known exons between locations 7948 to 8059, 9548 to 9877, 11,134 to 11397, 12485 to 12664 and 14275 to 14625 bp. The modified periodogram result is shown in Figure 4 and the proposed algorithm result is plotted in Figure 5. In the PSD plot shown in Figure 4, there are five visible exon peaks in the presence of background noise. But it is evident from Figure 5 by the proposed method that the five sharp period3 spectral peaks visible in the specific coding regions are well defined, accurately positioned and without any noise component.
Figures 6 and 7 show the results of application of conventional modified periodogram method and proposed leastnorm solution method to 32488bp length C. elegans cosmid T12B5.1 DNA (Accession no. FO081674.1 AF100307). The plots indicate three exons in gene1 between locations 17332 to 17402, 17645 to 18266, and 18311 to 18505 bp. In Figure 6, the exon peaks are present along with other peaks; therefore, prediction becomes ambiguous. In Figure 7, obtained by the proposed algorithm, there are only three sharp period3 peaks corresponding to the exons present in the gene. They are in proper location and are absolutely devoid of noise. Hence, there is no scope of any ambiguity. Similar results are seen in Figures 8 and 9 for gene2 with three exons between locations 18994 to 19064, 19349 to 19997 and 20059 to 20253 bp. The technique was applied to the remaining three genes of this DNA and was verified successfully.
Next, both the methods were applied to DNA C30C11 (Accession no. FO080722.7 L09634) from C. elegans chromosomeIII having length 30866 bp. Figures 10 and 11 mention spectral peaks by modified periodogram and leastnorm solution method respectively for gene1 with exons between locations 4874 to 4985, 5034 to 5408, 5452 to 6179 and 6227 to 6526 bp. In Figure 11 it is observed that peak2 is shifted to right from actual position. Figures 12 and 13 indicate accurate results for gene2 with exon segments between locations 7320 to7503, 7555 to 7757 and 7804 to 7923 bp. All these plots showing results of both the existing and proposed methods reflect the superiority of proposed technique over the conventional method because the peaks obtained with proposed algorithm are sharp, well defined, unambiguous, and noisefree. The threshold values for performance analysis of modified periodogram method have been chosen judiciously as 1.75 and 1.5, respectively. Table 3 indicates a list of genes studied and analysis summary of modified periodogram and leastnorm solution approaches. In all the above examples cited, the proposed method shows better result than the existing method giving a higher value of sensitivity, specificity and their average as well as lower value of miss rate and wrong rate.
Next, leastnorm algorithm has been applied to organisms with very short exon segments. It is known that prediction of exons with less than 100bp length is difficult but the proposed leastnorm method is found to be very suitable for detecting presence of exons as small as 28 bp length. Table 4 shows details of the organisms with short exons used as test data. Spectral plots for DMPROTP1 and CALEGLOBIM have been shown in Figures 14 and 15 respectively. The figures show very sharp, well defined and noisefree peaks in exon regions even for very small exon segments. Similar tests were performed on other organisms too giving satisfactory results. Hence, it is established that our method is robust and equally suitable for short as well as long exons.
The proposed leastnorm algorithm though offers high predictive accuracy compared to existing SDFT method, it has certain limitations on its part. It is a key issue to select model order judiciously for accurate exon detection. In the leastnorm method, the time of execution is more compared to the other existing methods since computation time depends on the autocorrelation lag size which is determined depending on the length of nucleotide sequence being tested. The computation of many lags is required in estimation of periodicity which requires great deal of arithmetic, increasing the execution time of the proposed technique. It is desirable to exploit certain properties of autocorrelation function that are known to reduce the computational load. This can be done by taking advantage of the special technique based on reduction in number of multiplications given by Kendall [34]. Another method for speeding up the autocorrelation computation is by the wellknown FFT method, which can also help in reducing computation time of proposed leastnorm technique [35].
3.2 Eigenvalueratio based model order selection approach
A key issue in developing the eigendecompositionbased model is proper selection of model order p. In order to estimate leastnorm solutionbased pseudospectrum, the dimension Mp of the noise subspace must be determined accurately. If value of p taken is less than required, then few prominent peaks may go unnoticed. On the other hand, if selected model order is more than the required value, undesired peaks are introduced in the plot leading to false prediction. The most common approach is to calculate and sort the eigenvalues of the correlation matrix R_{xx} of the noisy signal. The plot of eigenvalues sorted in decreasing order is termed as Screeplot. The prime eigenvalues of dimension p having steep slope correspond to the signal subspace. The set of smallest eigenvalues having dimension Mp with values equal to noise variance σ_{ n }^{2} is more or less flat in nature (Figure 1). Decrease in negativity of the derivative from higher value to lower value is determined by the slope of tangents drawn from the Screeplot to the Xaxis. At first, two points are chosen carefully on the Screeplot such that the first is on steep slope and second is on less steep portion of the eigencurve. The values of model order p intercepted by the two projections drawn vertically downward from the point of the tangent touching the eigencurve (Scree plot) to the Xaxis are identified. A ‘large gap’ or ‘elbow’ is looked for within this segment by eigenvalueratio technique to be treated as the threshold value between signal and noise subspaces (Figures 16 and 17).
A very simple method based on eigenvalue ratio has been adopted by the authors to find model order p is discussed in this subsection [32, 36]. As shown in Figures 18 and 19 the authors have plotted eigenvalue ratio λ_{p}/λ_{p+1} vs model order p. It is noted that there exists an eigenvalue gap of high magnitude between orders p = 20 and 21 and p = 16 and 17, in the figures, respectively. Satisfactory estimates of rank of R_{xx} by suggested method was found to be 20 for F56F11.4a gene, 16 for T12B5.1 gene2, and 7 for C30C11 gene1 Thus, it may be considered that eigenvalues λ_{21}, λ_{17} and λ_{8} onwards can be treated as noise eigenvalues in the three successive cases.
In this article, spectral content measure techniques based on sliding DFT was compared with proposed leastnorm technique. In an early work, Tiwari et al. (1997) employed Fourier technique to analyze the threebase periodicity in order to recognize coding regions in genomic DNA. They observed that a few genes in Saccharomyces cerevisiae do not exhibit period3 property at all. Anastassiou (2000, 2001) was inspired by the work of Tiwari et al. and introduced computational and visual tools for analysis of biomolecular sequences. He developed optimization procedure for improving performance of traditional Fourier technique. Later, Vaidyanathan and Yoon (2004) designed multistage narrowband bandpass filter for reducing background 1/f noise. Recently, Sahu and Panda (2011) in their work improved computational efficiency by employing SDFT with the help of the Goertzel algorithm, but the method is constrained by frequency resolution and spectral leakage effects.
The leastnorm algorithm presented in this paper provides an absolutely novel approach. The first important feature of the proposed algorithm is that it produces very sharp and welldefined period3 peaks in the proteincoding regions. The second significant feature is that it eliminates noise completely; hence, there is no requirement of setting threshold value. The third significant feature of this algorithm is that it is able to effectively detect very short exons as well. Moreover, this method offers very high sensitivity and specificity and very low miss rate and wrong rate compared to other available techniques.
4 Conclusion
DNA sequence analysis through power spectrum estimation by traditional nonparametric methods is in use since long. These are methodologically straightforward, computationally simple, and easy to understand, but due to low SNR, spectral features are difficult to distinguish as noise artifacts appear in spectral estimates. Therefore, effective identification of proteincoding region becomes difficult. The application of leastnorm frequency estimator to capture period3 peaks in coding regions has been introduced here. We used a constrained vector that lies on the noise subspace and the algorithm completely filters out the spurious peaks. Selection of proper model order is a fundamental issue in application of the eigendecomposition approach. The eigenvalueratio ‘gap’ or ‘elbow’ located on the Scree plot is treated as threshold between signal and noise spaces. Application of eigendecompositionbased methods to various DNA sequences has given amazing results as compared to standard classical methods in terms of resolution, quality factor, sensitivity, specificity, miss rate, and wrong rate. It was observed that highresolution pseudospectrum estimator based on leastnorm solution could identify proteincoding regions in DNA accurately. Another important feature of the proposed technique is that it can detect the presence of extremely short exon segments which is difficult for other existing methods. Unfortunately the computational effort for this highresolution method is significantly higher than FFT processing. This limitation may be tackled by applying Kendall’s algorithm or incorporating the wellknown FFT method to speed up the autocorrelation computation. Hence, it can be concluded that identification of proteincoding regions in DNA can be done effectively in a much superior way by applying the leastnorm solution technique.
References
 1.
Zhao L: Application of spectral analysis to DNA sequences. CSD, Purdue University, TR #06003; 2006.
 2.
Anastassiou D: Frequencydomain analysis of biomolecular sequences. Bioinformatics 2000,16(12):10731081. 10.1093/bioinformatics/16.12.1073
 3.
Anastassiou D: DSP in genomics: processing and frequencydomain analysis of character strings. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 2001. (ICASSP ’01), Salt Lake City, 7–11 May, vol. 2 (IEEE, Piscataway, 2001). pp 1053–1056, 0780370412001
 4.
Vaidyanathan PP, Yoon BJ: The role of signalprocessing concepts in genomics and proteomics. J. Franklin Inst. 2004, 351: 111135.
 5.
Ficket JW, Tung CS: Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982,10(17):53035318. 10.1093/nar/10.17.5303
 6.
Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R: Prediction of probable genes by Fourier analysis of genomic sequences. CABIOS 1997,3(3):263270.
 7.
Yin C, Yau SST: Prediction of protein coding regions by the 3base periodicity analysis of a DNA sequence. J. Theor. Biol. 2007, 247: 687694. 10.1016/j.jtbi.2007.03.038
 8.
Yin C, Yau SST: A Fourier characteristic of coding sequences: origins and a nonFourier approximation. J. Comput. Biol. 2005,12(9):11531165. 10.1089/cmb.2005.12.1153
 9.
Yin C, Yoo D, Yau SS–T: Denoising the 3base periodicity walk of DNA sequences in gene finding. J. Med. BioEng 2013,2(2):8083.
 10.
Peng CK, Buldyrev SV, Goldberger AL, Havlin S, Mantegna RN, Simons M, Stanley HE: Statistical properties of DNA sequences. J. Physica. 1995, A221: 180192.
 11.
Jiang X, Lavenier D, Yau SS–T: Coding region prediction based on a universal DNA sequence representative method. J. Comput. Biol. 2008,15(10):12371256. 10.1089/cmb.2008.0041
 12.
Nair AS, Sreenadhan S: An improved digital filtering technique using nucleotide frequency indicators for locating exons. J. CSI 2006,36(1):5460.
 13.
Tuqan J, Rushdi A: A DSP approach for finding the codon bias in DNA sequences. IEEE J. Signal Process. 2008,2(3):345355.
 14.
Sahu SS, Panda G: Identification of protein coding regions in DNA sequences using a time frequency filtering approach. Genomics Proteomics Bioinformatics 2011,9(1–2):4555.
 15.
Yu C, Deng M, Yau SS–T: DNA sequence comparison by a novel probabilistic method. Information Sci. 2011, 181: 14841492. 10.1016/j.ins.2010.12.010
 16.
Kwan HK, Benjamin YM K, Jennifer YY K: Novel methodologies for spectral classification of exon and intron sequences. EURASIP J. Adv. Signal Process. 2012, 2012: 50. doi: 10.1186/16876180201250 10.1186/16876180201250
 17.
Roy M, Biswas S, Barman (Mandal) S: Identification and analysis of coding and noncoding regions of a DNA sequence by positional frequency distribution of nucleotides (PFDN) algorithm. Kolkata, India: Paper presented at the international conference on computers and devices for communication CODEC09; 2009.
 18.
Roy M, Barman (Mandal) S: Spectral analysis of coding and noncoding regions of a DNA sequence by parametric and nonparametric methods: a comparative approach. Annals of Faculty Engineering Hunedoara. Int. J. Eng. Romania 2011, 3: 5762.
 19.
Rao N, Shepherd SJ: Detection of 3periodicity for small genomic sequences based on AR technique, International Conference on Communications. IAC and Systems 2004, 2: 10321036. 27–29 June
 20.
Yu C, Liang Q, Yin C, He RL, Yau SS–T: A novel construction of genome space with biological geometry. DNA Res 2010,18(6):435449.
 21.
Deng M, Yu C, Liang Q, He RL, Yau SS–T: A novel method of characterizing genetic sequences: genome space with biological distance and applications. PLOS ONE 2011,6(3):e17293. 10.1371/journal.pone.0017293
 22.
Pradhan M, Sahu RK: An exclusive survey on gene prediction methodologies. Int. J. Comp. Sci. Info. Sec 2010,8(7):88103.
 23.
Shlens J: A Tutorial on principal component analysis, derivation, discussion and singular value decomposition.. VersionI, pp.116 25 March (2003), http://www.cs.princeton.edu/picasso/mats/PCATutorialIntuition_jp.pdf
 24.
Ubeyli ED, Guler I: Comparison of eigenvector methods with classical and modelbased methods in analysis of internal carotid arterial doppler signals. Comput. Biol. Med. 2003, 33: 473493. 10.1016/S00104825(03)000210
 25.
Hayes MH: Statistical Digital Signal Processing and Modeling. New York: Wiley; 1996:393474.
 26.
Haykin S: Adaptive Filter Theory. 4th edition. Prentice Hall: Upper Saddle River; 2002. pp. 809–822
 27.
Stoica P, Moses R: Spectral Analysis of Signals. New Dehli: PHI Pvt. Learning Ltd; 2011:2367.
 28.
Praokis JG, Manolakis DG: Digital Signal Processing: Principles, Algorithms and Applications. 4th edition. New Dehli: PHI Learning Pvt. Ltd; 2008:960985.
 29.
NCBI Database . Accessed 20 July 2012 http://www.ncbi.nlm.nih.gov
 30.
Nair AS, Sreenadhan SP: A coding measure scheme employing electronion interaction pseudopotential (EIIP). Bioinformation 2006,1(6):197202.
 31.
Brodzik AK, Peters O: Symbolbalanced quaternionic periodicity transform for latent pattern detection in DNA sequences. ICASSP 2005, 5: 373376.
 32.
Lobos T, Leonowicz Z, Rezmer J, Koglin HJ: Harmonics and interharmonics estimation using advanced signal processing methods. In Proceedings of the 9th International Conference on Harmonics and Quality Power. Orlando; 1–4 October 2000, VolI, pp. 335–340
 33.
Meher J, Meher PK, Dash G: Improved comb filter based approach for effective prediction of protein coding regions in DNA sequences. J. Sig. Info. Proc 2011, 2: 8899.
 34.
Kendall WB: A New algorithm for computing autocorrelations. IEEE Trans. Computers 1974,C23(1):9093.
 35.
Rabiner LR, Schafer RW: Digital Processing of Speech Signals. Dorling Kindersley (India: Pvt. Ltd., Noida; 2013:178180.
 36.
Liavas AP, Regalia PA: On the behavior of information theoretic criteria for model order selection. IEEE Trans. Signal. Process. 2001,49(8):16891695. 10.1109/78.934138
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Roy, M., Barman, S. Effective gene prediction by high resolution frequency estimator based on leastnorm solution technique. J Bioinform Sys Biology 2014, 2 (2014) doi:10.1186/1687415320142
Received
Accepted
Published
DOI
Keywords
 Periodogram
 Deoxyribonucleic acid
 Leastnorm solution
 Eigenvector
 Eigenvalue