Effective gene prediction by high resolution frequency estimator based on least-norm solution technique
- Manidipa Roy^{1} and
- Soma Barman^{2}Email author
https://doi.org/10.1186/1687-4153-2014-2
© Roy and Barman; licensee Springer. 2014
Received: 20 August 2013
Accepted: 15 December 2013
Published: 4 January 2014
Abstract
Linear algebraic concept of subspace plays a significant role in the recent techniques of spectrum estimation. In this article, the authors have utilized the noise subspace concept for finding hidden periodicities in DNA sequence. With the vast growth of genomic sequences, the demand to identify accurately the protein-coding regions in DNA is increasingly rising. Several techniques of DNA feature extraction which involves various cross fields have come up in the recent past, among which application of digital signal processing tools is of prime importance. It is known that coding segments have a 3-base periodicity, while non-coding regions do not have this unique feature. One of the most important spectrum analysis techniques based on the concept of subspace is the least-norm method. The least-norm estimator developed in this paper shows sharp period-3 peaks in coding regions completely eliminating background noise. Comparison of proposed method with existing sliding discrete Fourier transform (SDFT) method popularly known as modified periodogram method has been drawn on several genes from various organisms and the results show that the proposed method has better as well as an effective approach towards gene prediction. Resolution, quality factor, sensitivity, specificity, miss rate, and wrong rate are used to establish superiority of least-norm gene prediction method over existing method.
Keywords
Periodogram Deoxyribonucleic acid Least-norm solution Eigenvector Eigenvalue1 Introduction
It has been observed that the most significant scientific and technological endeavour of the 21st century is mostly related to genomics. Therefore, researchers from various cross fields have concentrated in the field of genomic analysis in order to extract the vast information content hidden in it. Deoxyribonucleic acid (DNA) is the hereditary material present in all living organisms. In eukaryotic organisms, genes (sequences of DNA) consist of exons (coding segments) and introns (non-coding segments). It has been established that genetic information is stored in the particular order of four kinds of nucleotide bases, Adenine (a), Thymine (t), Cytosine (c) and Guanine (g) which comprise the DNA biomolecule along with sugar-phosphate backbone. Exons of a DNA sequence are specified as the most information-bearing part because only the exons take part in protein coding while the introns are spliced off during protein synthesis process. Gene prediction means detecting locations of the protein-coding regions of genes in a long DNA chain. Since DNA encodes information of proteins, various statistical and computational techniques have been studied and explored to extract the information content carried by DNA and distinguish exons from introns.
Genomic information is made up of a finite number of nucleotides in the form of alphabetical characters; hence, it is discrete in nature. As a result, digital signal processing (DSP) techniques can be used as effective tools to analyze DNA in order to capture its periodic characteristics. The main objective of spectrum estimation is determination of power spectrum density of a random process. Power spectral density (PSD) describes how the average power of a signal x[n] is distributed with frequency, where x[n] is a sequence of random variables defined for every integer n. The estimated PSD provides information about the structure of a random process which can be used for refined modeling, prediction, or filtering. Estimation of power spectrum of discretely sampled processes is generally based on procedures employing the fast Fourier transform (FFT). This approach is computationally efficient and produces reasonable results, but in spite of the advantages, it has certain performance limitations. The most important limitation lies in its frequency resolution. Moreover, spectral estimation by the Fourier method generates various harmonics which often lead to false prediction of coding regions. Among the recently introduced techniques, the eigendecomposition-based noise subspace method, known as the least-norm solution is found to be of great interest. In the present paper the authors addressed the problems posed by standard FFT method and proposed a least-norm algorithm based on the concept of subspace frequency estimation for effective and accurate prediction of coding regions in DNA sequence.
Application of DSP methods to find periodicities in DNA sequences has been studied by various researchers [1–4]. It is established that exon regions of DNA molecules exhibit a period-3 property because of the codon structure involved in the translation of nucleotide bases into amino acids [5–7]. Yin and Yau explained the phenomenon of three-base periodicity in the Fourier power spectrum of protein-coding regions resulting from nonuniform distribution of nucleotides in the three codon positions [8]. An improved algorithm for gene finding by period-3 periodicity using the nonlinear tracking differentiator is presented by Yin et al. [9]. Peng et al. discussed about statistical properties of genes in their article [10]. A universal graphical representation method based on S.S.-T. Yau’s technique employing trigonometric functions which denotes the four nucleotide bases to predict coding regions is presented by Jiang et al. [11]. Application of digital filters to extract period-3 components and effectively eliminate background noise present in DNA sequence has given good results [12–14]. Yu et al. have used in their paper probability distributions to study similarity in DNA sequences employing symmetrized Kullback–Leibler convergence [15]. Kwan et al. introduced novel codes for one-sequence numerical representation for spectral analysis and compared them with existing mapping techniques [16]. Roy et al. introduced positional frequency distribution of nucleotides (PFDN), an algorithm for prediction of coding regions [17]. Parametric techniques of gene prediction where autoregressive all-pole models were used for identifying coding and non-coding regions provided better results [18, 19]. Yu et al. proposed a novel method to construct moment vectors for DNA sequences using a two-dimensional graphical representation and proved that the two had one-to-one correspondence [20]. In another work, Deng et al. introduced a novel method of characterizing genetic sequence defining genome space with biological distance for subsequent applications in analyzing and annotating genomes [21]. An exclusive survey of various gene prediction techniques is presented by Pradhan et al. [22]. The fundamental theory of principal component analysis is explained by Shlens and its application is discussed by Ubeyli et al. [23, 24].
In this article, authors have compared and analyzed power spectral peaks obtained by modified periodogram method with pseudo-spectrum obtained by least-norm solution method for detecting the presence of coding regions in DNA sequence and established superiority of the later technique [25–28]. The algorithm has been successfully tested on several sample databases downloaded from NCBI GenBank [29].
2 Materials and methods
Therefore, three out of these four binary sequences would be enough to uniquely determine the DNA character string. There are several other techniques such as complex numbers [2], paired numeric [6], universal graphical representation [11], weak-strong hydrogen bonding [18], EIIP [30], quaternion [31] etc. each having a certain special feature of its kind. Rao and Shepherd [19] in their study found that complex mapping was one of the most effective and compact mapping rules. In a recent work, Kwan et al. [16] introduced several novel codes for single-sequence numerical representations for spectral analysis and studied their relative performances. They focused on direct and simple numerical representations which satisfied the following requirements:
(a). Single-sequence mapping for a nucleotide sequence
(b). Fixed value mapping for each nucleotide
(c). Accessible to digital signal processing analysis
Numerical representations
Name | c | g | a | t | Remarks |
---|---|---|---|---|---|
K-Quaternary Code-III | -j | -1 | +1 | +j | Rao and Shepherd |
K-Quaternary Code-I | -1 | -j | +1 | +j | Kwan et al. |
Quaternary Code proposed | -j | +1 | -1 | +j | Proposed mapping |
Once numerical conversion of DNA sequence is obtained, DSP technique can easily be applied to estimate its power spectrum. Spectral estimation by non-parametric method can be broadly classified as direct and indirect. These two methods are equivalent and are popularly known as the periodogram method. The direct method takes discrete Fourier transform (DFT) of the signal and then averages the square of its magnitude. The indirect method is based on the concept of first estimating the autocorrelation of data sequence and then taking its Fourier transform (FT).
In the first part of this section, spectral analysis of DNA by periodogram method is discussed in brief. The basic of eigendecomposition is given in the second subsection. Mathematical background of the least-norm solution is explained in the third subsection followed by algorithm of the least-norm solution technique. In the next section of this article, results and discussion have been presented. In the first subsection of this section, performance of proposed method has been compared with the modified periodogram method. Model order selection by eigenvalue ratio technique has been elaborated in the next subsection. In the final and last section of the article, conclusion has been drawn. MATLAB 7.1 software has been used to show performance of the estimators.
2.1 S pectral analysis by modified periodogram method
To enhance performance of the periodogram method, at first, the N-point data sequence is divided into K overlapping segments of length M each, then the periodogram is computed applying the Bartlett window; finally, the average is computed from the result.
2.2 S pectral analysis by eigendecomposition
where amplitude A_{ i } are complex values given by A_{ i } = |A_{ i }| e^{ jφ }_{ i } with φ_{ i } being uncorrelated random variables that are uniformly distributed over the interval [π, -π]. The power spectrum of x(n) consists of a set of p impulses of amplitude |A_{ i }| at frequencies w_{ i } for i = 1,2,3,…,p plus power spectrum of white noise w(n) having variance σ_{ n }^{2}.
An issue that is of central importance to successful implementation of principal-component analysis (PCA) is the selection of appropriate model order p since the accuracy of estimated spectrum is critically dependent on this choice. In this article, the eigenvalue-ratio technique has been adopted for optimum model order selection. A plot of λ_{ p }/λ_{p+1} vs integer values p indicates a large eigenvalue gap at the threshold of signal subspace and noise subspace. This p value is chosen as the required model order and eigenvalues λ_{p+1} to λ_{M} are assumed to be the noise eigenvalues corresponding to the noise subspace.
- 1.
Formation of autocorrelation matrix from data vector.
- 2.
Derivation of noise subspace with the help of eigendecomposition.
- 3.
Identification of signal components from noise subspace by frequency estimation function.
2.3 Frequency estimation by least-norm solution
where Z_{k} for k = (p + 1),…,(M - 1) are the spurious roots that in general do not lie on the unit circle. The least-norm method attempts to eliminate the effects of spurious zeros by pushing them inside the unit circle leaving the desired zeros on the unit circle. The problem then is to determine which vector in the noise subspace minimizes the effects of spurious zeros on the peaks of ${\widehat{P}}_{\mathrm{LN}}\left({e}^{\mathit{jw}}\right)$.
- 1.
The vector $\overrightarrow{a}$ lies on the noise subspace ensuring that p roots of A(z) are on the unit circle.
- 2.
The vector $\overrightarrow{a}$ has least Euclidean norm ensuring that spurious roots of A(z) lie inside unit circle.
- 3.
The first element of $\overrightarrow{a}$ is unity, i.e. least-norm solution is not the zero vector.
The least-norm method involves projection of signal vector $\overrightarrow{v}$ on to the entire noise space.
Minimizing $\overrightarrow{a}$ is equivalent to finding vector $\overrightarrow{v}$ that minimizes the quadratic form of ${\overrightarrow{v}}^{\mathrm{H}}{P}_{n}\overrightarrow{v}$
2.4 Algorithm of proposed least-norm solution technique for estimating period-3 peaks
Step 1 Convert the samples of data vectors to column vector.
Step 2 Compute autocorrelation matrix of data with pre-determined lag size (M).
Step 3 Diagonalize the autocorrelation matrix. Produce diagonal matrix D of eigenvalues and a full matrix V whose columns are the corresponding eigenvectors so that X*V = V*D, where X is the signal matrix.
Step 4 Sort diagonal matrix D in ascending order for eigendecomposition. Take into account noise subspace spanned by the eigenvectors corresponding to nonsignificant eigenvalues.
Step 5 Project signal vector $\overrightarrow{v}$ onto the noise space using projection matrix.
Step 6 Find Least Norm vector $\overrightarrow{a}$ on noise subspace with first element equal to unity using QR factorization and applying the Optimization Theory.
Step 8 Plot the result (in dB) to observe period-3 spectral peaks.
3 Results and discussion
Summary of statistical parameters and computation time of modified periodogram and least-norm methods for various genes
Gene | Sliding DFT method | Least-norm method | ||||||
---|---|---|---|---|---|---|---|---|
Q.F. | CPU | Window | K | Q.F. | CPU | Model | Percent | |
(mean)^{2}/var | Time | Length | No. of | (mean)^{2}/var | Time | Order | Rise in | |
(s) | M | segments | (s) | p | Q.F. | |||
F56F11.4a | 4.83 | 0.24 | 351 | 23 | 121.89 | 104.86 | 20 | 2.42e + 003 |
T12B5.1G-1 | 6.32 | 0.14 | 252 | 07 | 347.96 | 48.72 | 08 | 5.41e + 003 |
T12B5.1G-2 | 5.58 | 0.14 | 252 | 08 | 305.51 | 50.37 | 16 | 5.37e + 003 |
T12B5.1G-3 | 3.54 | 0.09 | 252 | 04 | 742.96 | 06.68 | 02 | 2.09e + 004 |
T12B5.1G-4 | 8.38 | 0.15 | 252 | 09 | 221.09 | 54.15 | 17 | 2.54e + 003 |
T12B5.1G-5 | 5.88 | 0.13 | 252 | 06 | 227.29 | 07.76 | 17 | 3.76e + 003 |
C30C11G-1 | 10.43 | 0.18 | 252 | 12 | 498.41 | 11.37 | 07 | 4.68e + 003 |
C30C11G-2 | 3.92 | 0.10 | 210 | 04 | 107.79 | 06.21 | 17 | 2.65e + 003 |
D13156 | 4.84 | 0.15 | 351 | 05 | 246.08 | 37.38 | 17 | 4.98e + 003 |
3.1 Performance comparison of proposed method with existing method
Summary of performance analysis of data for least-norm and modified periodogram methods
Gene | DSP | Threshold | Prediction measures | ||||
---|---|---|---|---|---|---|---|
methods | value | S _{n} | S _{p} | (S_{n} + S_{p})/2 | M _{r} | W _{r} | |
F56F11.4a | Periodogram | 1.75 | 0.4 | 1.0 | 0.70 | 0.6 | 0.0 |
Periodogram | 1.50 | 0.8 | 0.66 | 0.73 | 0.2 | 0.4 | |
Least-norm | * | 1.0 | 1.00 | 1.00 | 0.0 | 0.00 | |
T12B5 Gene-1 | Periodogram | 1.75 | 1.0 | 0.43 | 0.71 | 0.0 | 0.55 |
Periodogram | 1.50 | 1.0 | 0.33 | 0.66 | 0.0 | 0.66 | |
Least-norm | * | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | |
T12B5 Gene-2 | Periodogram | 1.75 | 1.0 | 0.6 | 0.8 | 0.0 | 0.4 |
Periodogram | 1.50 | 1.0 | 0.5 | 0.75 | 0.0 | 0.5 | |
Least-norm | * | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | |
T12B5 Gene-3 | Periodogram | 1.75 | 1.0 | 0.15 | 0.57 | 0.0 | 0.84 |
Periodogram | 1.50 | 1.0 | 0.12 | 0.56 | 0.0 | 0.87 | |
Least-norm | * | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | |
T12B5 Gene-4 | Periodogram | 1.75 | 0.5 | 0.4 | 0.45 | 0.5 | 0.6 |
Periodogram | 1.50 | 0.75 | 0.33 | 0.54 | 0.25 | 0.66 | |
Least-norm | * | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | |
T12B5 Gene-5 | Periodogram | 1.75 | 0.66 | 0.22 | 0.44 | 0.33 | 0.77 |
Periodogram | 1.50 | 1.0 | 0.25 | 0.62 | 0.0 | 0.75 | |
Least-norm | * | 1.0 | 1.00 | 1.00 | 0.0 | 0.0 | |
C30C11 Gene-1 | Periodogram | 1.75 | 0.5 | 0.4 | 0.45 | 0.5 | 0.6 |
Periodogram | 1.50 | 1.0 | 0.4 | 0.7 | 0.0 | 0.6 | |
Least-norm | * | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | |
C30C11 Gene-2 | Periodogram | 1.75 | 1.0 | 0.33 | 0.66 | 0.0 | 0.66 |
Periodogram | 1.50 | 1.0 | 0.21 | 0.60 | 0.0 | 0.78 | |
Least-norm | * | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | |
D13156 | Periodogram | 1.75 | 1.0 | 0.22 | 0.61 | 0.0 | 0.77 |
Periodogram | 1.50 | 1.0 | 0.15 | 0.57 | 0.0 | 0.86 | |
Least-norm | * | 1.0 | 0.5 | 0.75 | 0.0 | 0.5 |
Details of organisms with short exons
Gene ID | GenBank accession no. | DNA length in bp | Length of exons in bp | Source |
---|---|---|---|---|
DMPROTP1 | L17007.1 | 624 | 177 (122 to 248, 376 to 425) | Didelphis marsupialis (Southern opossum) |
Exon1-127 and Exon2-50 | ||||
OAMTTI | X07975.1 | 2055 | 186 (995 to 1022, 1312 to 1377, 1697 to 1,788) Exon1-28, Exon2-66, Exon3-92 | Ovis aries (sheep) |
CALEGLOBIM | L25363.1 | 1698 | 444 (144 to 235, 364 to 586, 1399 to 1527) | Callithrix jacchus (white tufted ear marmoset) |
PIGAPAI | L00626.1 | 3333 | Exon1-92, Exon2-223, Exon3-129 | Sus scorfa (pig) |
798 (751 to 793, 975 to 1128, 1770 to 2,370) | ||||
Exon1-43, Exon2-154, | ||||
Exon3-601 |
The proposed least-norm algorithm though offers high predictive accuracy compared to existing SDFT method, it has certain limitations on its part. It is a key issue to select model order judiciously for accurate exon detection. In the least-norm method, the time of execution is more compared to the other existing methods since computation time depends on the autocorrelation lag size which is determined depending on the length of nucleotide sequence being tested. The computation of many lags is required in estimation of periodicity which requires great deal of arithmetic, increasing the execution time of the proposed technique. It is desirable to exploit certain properties of autocorrelation function that are known to reduce the computational load. This can be done by taking advantage of the special technique based on reduction in number of multiplications given by Kendall [34]. Another method for speeding up the autocorrelation computation is by the well-known FFT method, which can also help in reducing computation time of proposed least-norm technique [35].
3.2 Eigenvalue-ratio based model order selection approach
In this article, spectral content measure techniques based on sliding DFT was compared with proposed least-norm technique. In an early work, Tiwari et al. (1997) employed Fourier technique to analyze the three-base periodicity in order to recognize coding regions in genomic DNA. They observed that a few genes in Saccharomyces cerevisiae do not exhibit period-3 property at all. Anastassiou (2000, 2001) was inspired by the work of Tiwari et al. and introduced computational and visual tools for analysis of biomolecular sequences. He developed optimization procedure for improving performance of traditional Fourier technique. Later, Vaidyanathan and Yoon (2004) designed multistage narrowband band-pass filter for reducing background 1/f noise. Recently, Sahu and Panda (2011) in their work improved computational efficiency by employing SDFT with the help of the Goertzel algorithm, but the method is constrained by frequency resolution and spectral leakage effects.
The least-norm algorithm presented in this paper provides an absolutely novel approach. The first important feature of the proposed algorithm is that it produces very sharp and well-defined period-3 peaks in the protein-coding regions. The second significant feature is that it eliminates noise completely; hence, there is no requirement of setting threshold value. The third significant feature of this algorithm is that it is able to effectively detect very short exons as well. Moreover, this method offers very high sensitivity and specificity and very low miss rate and wrong rate compared to other available techniques.
4 Conclusion
DNA sequence analysis through power spectrum estimation by traditional non-parametric methods is in use since long. These are methodologically straightforward, computationally simple, and easy to understand, but due to low SNR, spectral features are difficult to distinguish as noise artifacts appear in spectral estimates. Therefore, effective identification of protein-coding region becomes difficult. The application of least-norm frequency estimator to capture period-3 peaks in coding regions has been introduced here. We used a constrained vector that lies on the noise subspace and the algorithm completely filters out the spurious peaks. Selection of proper model order is a fundamental issue in application of the eigendecomposition approach. The eigenvalue-ratio ‘gap’ or ‘elbow’ located on the Scree plot is treated as threshold between signal and noise spaces. Application of eigendecomposition-based methods to various DNA sequences has given amazing results as compared to standard classical methods in terms of resolution, quality factor, sensitivity, specificity, miss rate, and wrong rate. It was observed that high-resolution pseudo-spectrum estimator based on least-norm solution could identify protein-coding regions in DNA accurately. Another important feature of the proposed technique is that it can detect the presence of extremely short exon segments which is difficult for other existing methods. Unfortunately the computational effort for this high-resolution method is significantly higher than FFT processing. This limitation may be tackled by applying Kendall’s algorithm or incorporating the well-known FFT method to speed up the autocorrelation computation. Hence, it can be concluded that identification of protein-coding regions in DNA can be done effectively in a much superior way by applying the least-norm solution technique.
Declarations
Authors’ Affiliations
References
- Zhao L: Application of spectral analysis to DNA sequences. CSD, Purdue University, TR #06-003; 2006.Google Scholar
- Anastassiou D: Frequency-domain analysis of biomolecular sequences. Bioinformatics 2000,16(12):1073-1081. 10.1093/bioinformatics/16.12.1073MathSciNetView ArticleGoogle Scholar
- Anastassiou D: DSP in genomics: processing and frequency-domain analysis of character strings. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 2001. (ICASSP ’01), Salt Lake City, 7–11 May, vol. 2 (IEEE, Piscataway, 2001). pp 1053–1056, 0-7803-7041-2001Google Scholar
- Vaidyanathan PP, Yoon BJ: The role of signal-processing concepts in genomics and proteomics. J. Franklin Inst. 2004, 351: 111-135.View ArticleGoogle Scholar
- Ficket JW, Tung CS: Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982,10(17):5303-5318. 10.1093/nar/10.17.5303View ArticleGoogle Scholar
- Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R: Prediction of probable genes by Fourier analysis of genomic sequences. CABIOS 1997,3(3):263-270.Google Scholar
- Yin C, Yau SS-T: Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J. Theor. Biol. 2007, 247: 687-694. 10.1016/j.jtbi.2007.03.038MathSciNetView ArticleGoogle Scholar
- Yin C, Yau SS-T: A Fourier characteristic of coding sequences: origins and a non-Fourier approximation. J. Comput. Biol. 2005,12(9):1153-1165. 10.1089/cmb.2005.12.1153View ArticleGoogle Scholar
- Yin C, Yoo D, Yau SS–T: Denoising the 3-base periodicity walk of DNA sequences in gene finding. J. Med. Bio-Eng 2013,2(2):80-83.Google Scholar
- Peng CK, Buldyrev SV, Goldberger AL, Havlin S, Mantegna RN, Simons M, Stanley HE: Statistical properties of DNA sequences. J. Physica. 1995, A-221: 180-192.View ArticleGoogle Scholar
- Jiang X, Lavenier D, Yau SS–T: Coding region prediction based on a universal DNA sequence representative method. J. Comput. Biol. 2008,15(10):1237-1256. 10.1089/cmb.2008.0041MathSciNetView ArticleGoogle Scholar
- Nair AS, Sreenadhan S: An improved digital filtering technique using nucleotide frequency indicators for locating exons. J. CSI 2006,36(1):54-60.Google Scholar
- Tuqan J, Rushdi A: A DSP approach for finding the codon bias in DNA sequences. IEEE J. Signal Process. 2008,2(3):345-355.Google Scholar
- Sahu SS, Panda G: Identification of protein coding regions in DNA sequences using a time frequency filtering approach. Genomics Proteomics Bioinformatics 2011,9(1–2):45-55.View ArticleGoogle Scholar
- Yu C, Deng M, Yau SS–T: DNA sequence comparison by a novel probabilistic method. Information Sci. 2011, 181: 1484-1492. 10.1016/j.ins.2010.12.010MathSciNetView ArticleGoogle Scholar
- Kwan HK, Benjamin YM K, Jennifer YY K: Novel methodologies for spectral classification of exon and intron sequences. EURASIP J. Adv. Signal Process. 2012, 2012: 50. doi: 10.1186/1687-6180-2012-50 10.1186/1687-6180-2012-50View ArticleGoogle Scholar
- Roy M, Biswas S, Barman (Mandal) S: Identification and analysis of coding and non-coding regions of a DNA sequence by positional frequency distribution of nucleotides (PFDN) algorithm. Kolkata, India: Paper presented at the international conference on computers and devices for communication CODEC-09; 2009.Google Scholar
- Roy M, Barman (Mandal) S: Spectral analysis of coding and non-coding regions of a DNA sequence by parametric and non-parametric methods: a comparative approach. Annals of Faculty Engineering Hunedoara. Int. J. Eng. Romania 2011, 3: 57-62.Google Scholar
- Rao N, Shepherd SJ: Detection of 3-periodicity for small genomic sequences based on AR technique, International Conference on Communications. IAC and Systems 2004, 2: 1032-1036. 27–29 JuneGoogle Scholar
- Yu C, Liang Q, Yin C, He RL, Yau SS–T: A novel construction of genome space with biological geometry. DNA Res 2010,18(6):435-449.View ArticleGoogle Scholar
- Deng M, Yu C, Liang Q, He RL, Yau SS–T: A novel method of characterizing genetic sequences: genome space with biological distance and applications. PLOS ONE 2011,6(3):e17293. 10.1371/journal.pone.0017293View ArticleGoogle Scholar
- Pradhan M, Sahu RK: An exclusive survey on gene prediction methodologies. Int. J. Comp. Sci. Info. Sec 2010,8(7):88-103.Google Scholar
- Shlens J: A Tutorial on principal component analysis, derivation, discussion and singular value decomposition.. Version-I, pp.1-16 25 March (2003), http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdfGoogle Scholar
- Ubeyli ED, Guler I: Comparison of eigenvector methods with classical and model-based methods in analysis of internal carotid arterial doppler signals. Comput. Biol. Med. 2003, 33: 473-493. 10.1016/S0010-4825(03)00021-0View ArticleGoogle Scholar
- Hayes MH: Statistical Digital Signal Processing and Modeling. New York: Wiley; 1996:393-474.Google Scholar
- Haykin S: Adaptive Filter Theory. 4th edition. Prentice Hall: Upper Saddle River; 2002. pp. 809–822Google Scholar
- Stoica P, Moses R: Spectral Analysis of Signals. New Dehli: PHI Pvt. Learning Ltd; 2011:23-67.Google Scholar
- Praokis JG, Manolakis DG: Digital Signal Processing: Principles, Algorithms and Applications. 4th edition. New Dehli: PHI Learning Pvt. Ltd; 2008:960-985.Google Scholar
- NCBI Database . Accessed 20 July 2012 http://www.ncbi.nlm.nih.gov
- Nair AS, Sreenadhan SP: A coding measure scheme employing electron-ion interaction pseudopotential (EIIP). Bioinformation 2006,1(6):197-202.Google Scholar
- Brodzik AK, Peters O: Symbol-balanced quaternionic periodicity transform for latent pattern detection in DNA sequences. ICASSP 2005, 5: 373-376.Google Scholar
- Lobos T, Leonowicz Z, Rezmer J, Koglin H-J: Harmonics and interharmonics estimation using advanced signal processing methods. In Proceedings of the 9th International Conference on Harmonics and Quality Power. Orlando; 1–4 October 2000, Vol-I, pp. 335–340Google Scholar
- Meher J, Meher PK, Dash G: Improved comb filter based approach for effective prediction of protein coding regions in DNA sequences. J. Sig. Info. Proc 2011, 2: 88-99.Google Scholar
- Kendall WB: A New algorithm for computing autocorrelations. IEEE Trans. Computers 1974,C-23(1):90-93.MathSciNetView ArticleGoogle Scholar
- Rabiner LR, Schafer RW: Digital Processing of Speech Signals. Dorling Kindersley (India: Pvt. Ltd., Noida; 2013:178-180.Google Scholar
- Liavas AP, Regalia PA: On the behavior of information theoretic criteria for model order selection. IEEE Trans. Signal. Process. 2001,49(8):1689-1695. 10.1109/78.934138View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.