Application of discrete Fourier inter-coefficient difference for assessing genetic sequence similarity

King, Brian R; Aburdene, Maurice; Thompson, Alex; Warres, Zach

doi:10.1186/1687-4153-2014-8

Research
Open access
Published: 28 May 2014

Application of discrete Fourier inter-coefficient difference for assessing genetic sequence similarity

Brian R King¹,
Maurice Aburdene²,
Alex Thompson² &
…
Zach Warres²

EURASIP Journal on Bioinformatics and Systems Biology volume 2014, Article number: 8 (2014) Cite this article

2727 Accesses
6 Citations
Metrics details

Abstract

Digital signal processing (DSP) techniques for biological sequence analysis continue to grow in popularity due to the inherent digital nature of these sequences. DSP methods have demonstrated early success for detection of coding regions in a gene. Recently, these methods are being used to establish DNA gene similarity. We present the inter-coefficient difference (ICD) transformation, a novel extension of the discrete Fourier transformation, which can be applied to any DNA sequence. The ICD method is a mathematical, alignment-free DNA comparison method that generates a genetic signature for any DNA sequence that is used to generate relative measures of similarity among DNA sequences. We demonstrate our method on a set of insulin genes obtained from an evolutionarily wide range of species, and on a set of avian influenza viral sequences, which represents a set of highly similar sequences. We compare phylogenetic trees generated using our technique against trees generated using traditional alignment techniques for similarity and demonstrate that the ICD method produces a highly accurate tree without requiring an alignment prior to establishing sequence similarity.

Introduction

Substantial technological advances continue to be made in modern DNA sequencing instrumentation. Next-generation sequencing (NGS) systems generate genetic and genomic data at unprecedented rates. Methods that can be used to help us understand these data are being researched in earnest. In general, the most common, biologically meaningful approach to understand new sequence data are based on methods that can compare new data against a large set of data that is well understood.

When a new biological sequence with unknown function has been identified, researchers search for the most ‘similar’ sequence in a database of annotated sequence data, under the premise that similar sequences imply similar biological functionality, and in the case of proteins, similar structural characteristics. Similarity between two biological sequences forms the basis for determining whether the sequences are homologous, i.e., there is shared ancestry between them [1]. Phylogenetics, the study of evolutionary relationships between organisms, relies on methods that can quantitatively measure differences between these organisms, with the premise that larger differences between organisms imply a larger span of time before the organisms split from a common ancestor. Phylogenies are most commonly inferred from pairwise comparisons performed on the underlying genetic sequence data obtained from the organisms being analyzed [2, 3]. For these and many other reasons, sequence analysis methods are among the most researched and sought after methods in bioinformatics. We encourage the reader to consult a text on biological sequence analysis to learn about existing methods [1, 4].

Generally speaking, the predominant methods for biological sequence comparison are based on sequence alignments, such as the popular BLAST and the ClustalW series of methods [5, 6]. Alignment methods have represented the de facto standard for sequence analysis, comparison, and retrieval. However, the advent of NGS sequencing has pushed traditional alignment methods to their limits. There are numerous user-defined parameters for dealing with gaps and mismatches between sequences, and it is difficult to determine the ideal parameters to achieve an optimal alignment. The computational resources required for these methods can increase quadratically or more with respect to the length of the sequences and the number of sequences being aligned [1]. Moreover, there is an increased risk of errors being introduced with multiple sequence alignments as the average pairwise sequence identity of the data being aligned decreases. Another source of alignment error arises if the order of significant regions in sequences is not conserved [7]. If an optimal alignment has been found, it is difficult to determine an accurate metric of distance between sequences [8]. Despite these challenges, alignment methods continue to be used. With appropriate parameter selection, they excel at visually indicating regions that are highly conserved among many sequences.

To overcome these challenges, there has been increased interest in techniques that can compare sequences without an alignment, referred to as alignment-free methods [7–9]. The most popular alignment-free methods are based on computing various transformations of fixed-length words of length n (or n-mers, n-grams), with common approaches involving computing a frequency vector over all possible n-mers for each sequence [2]. Other methods search for a shared set of the longest common subsequences [10]. These methods tend to be among the most efficient, as their computational complexity is linear [9]. However, they may lose valuable information with respect to positioning of important subsequences within the whole sequence. Moreover, like alignment-based methods, they often require multiple runs to select the most ideal parameters.

Digital signal processing (DSP) techniques have been used effectively for efficient searching and comparison of sequential data [11, 12]. They are emerging as another alternative alignment-free approach used to analyze both genomic and proteomic data. In order for these data to be processed using DSP techniques, they must be converted to a numeric sequence. There are several numeric representations available, each with their own strengths and weaknesses [13–15]. In the case of DNA, there are a limited number of numeric transformations available. DNA encodes the genetic blueprint of every organism as a sequence over four possible nucleotides, represented as A, C, G, or T. Encoded in DNA are genes, which contain the instructions to make proteins, and intergenic regions, which fill in the large gaps between genes. Within each gene are coding regions (exons) and noncoding regions (introns). The information content, which is critical to understanding the biological function of the gene, is hidden in the coding regions in the gene. Coding regions are comprised of codons, nucleotide triplets that code for individual amino acids, and represent a very small portion of the entire genome. In the human genome, only about 5% of it contains coding instructions. These complexities make the process of identifying genes and coding regions within these genes a daunting task.

Proteins have more choices of possible numeric transformations available, owed in part to the physicochemical properties of amino acids. Proteins themselves are long polypeptide chains of amino acids. There are 20 possible amino acids that exist in proteins, each having many physicochemical properties, such as hydropathy, charge, and solubility. These properties provide useful numeric representations for protein sequences, making a translation to a numeric sequence a relatively easy process. For example, the MAFFT method is a protein sequence alignment method that converts converted proteins into numeric sequences that represent the polarity and volume values of each amino acid residue in the proteins being aligned [16].

Regardless of the numeric transformation chosen, preservation of information content in the sequence is critical. This is perhaps one reason for the most common representation of a DNA, the binary indicator sequence, also commonly known as the Voss representation [17]. In this representation, each DNA sequence is transformed into a sequence of binary occurrence vectors. (This is the transformation used in our research, and is described in detail in the Methods and materials section). Some methods use variations of the binary indicator sequence. For example, Afrexio et al. introduced a variant of the Voss representation that converts the occurrence vector into a vector of inter-nucleotide distances [18]. There is a wide range of transformations available [13]. Hota et al. analyzed the performance of several common DNA to numerical mapping techniques. They provide a good description of each transformation method used in practice [19].

DSP based methods have continued to emerge in recent years for the purpose of genomic analysis. The most prevalent use has been to locate reading frames in DNA, as well as different regions in the genome, including genes and coding (or exon) regions within these genes [14, 20, 21]. Sharma et al. analyzed the performance of several DNA mapping schemes for detecting the coding region of genes [15]. DSP techniques have been used to address other problems in genomics and proteomics. For example, methods have been developed for splice site detection within the gene [20], the identification of active sites in a protein using Morlet wavelets [22] and identification of acceptor splicing sites and the visual identification of patterns and motifs in DNA through spectral analysis [14, 23].

Regardless of the domain, the field of digital signal processing has provided a plethora of methods for analyzing sequential data. Most methods use variations of the Fourier transform [24], with the discrete Fourier transform (DFT) being among the most popular signal processing technique [25, 26]. Typically, the fast Fourier transform (FFT) is used to compute the DFT, as it is among the most computationally efficient algorithms for this purpose [24]. These transforms have been successfully used for general sequential data comparison and retrieval [11, 12], and are readily suitable for biological sequence comparison, owed to the inherent discrete, symbolic nature of biological sequences [14, 27, 28]. In fact, FFTs have been used to analyze DNA data before [20, 29, 30]. In addition to some of the methods listed previously, Cheever et al. measured the cross correlation of two DNA sequences to explore significant regions of similarity between the DNA, where the cross correlation was computed using a FFT [31]. The FFT has also been used for protein sequence alignments in the MAFFT method [16].

There have been many DSP-based methods introduced in recent years for biological data analysis; however, very few were designed to report a biologically relevant measure of evolutionary distance between sequences being analyzed, particularly when a large number of sequences are being analyzed. Multiple sequence alignments have been used successfully for this purpose, but these methods can be computationally expensive and are prone to errors, particularly as the set of sequences being analyzed increase in size and diversity. We developed a novel signal processing technique that characterizes genetic sequence data through a simple transformation of the coefficients generated by the DFT of a specific numeric representation of the original DNA sequence. In our work, we compute a transformation on the set of coefficients generated that we call the inter-coefficient difference or ICD. We show that this characterization effectively produces a signature for a given sequence and can be used to compare genetic sequences among different species. The ICD method provides comparisons between genes from evolutionarily distant species, as well as subtle variants from identical genes from the same species. We demonstrate its effectiveness through analysis of datasets that have different levels of pairwise similarity. The method effectively generates a pairwise distance matrix representing the level of similarity between each genetic sequence with remarkable running times. The resulting matrix can be used to induce a dendrogram representing phylogenetic relationships between species from which the sequences were obtained. Our results show that we produce alignment-free dendrograms that are highly similar to those trees produced using alignment-based techniques and other alignment-free methods.

Methods and materials

Our method is based on the application of the DFT to four numeric sequences that are derived from the original DNA sequence. We use a binary indicator sequence representation of a DNA sequence, which is among the most popular numeric representation used in this area in literature [17, 20]; it allows for an easy transformation from the original sequence on which many DSP and other numeric transformations can be computed [18, 20, 27].

The inter-coefficient difference

Let S represent a set of DNA sequences, where s_i represents an arbitrary sequence in S. Each DNA sequence s_i is defined over the alphabet. Let N be the length of the longest sequence in S. Each sequence s_i in S goes through a series of transformations to produce the corresponding ICD vector. The first transformation computes a unique binary indicator sequence from s_i. Next, we apply the DFT on the indicator sequence, yielding a vector of coefficients. Basic mathematical transformations are applied to the coefficient vector, resulting in the ICD vector. The details of this algorithm are given below.

For a given sequence s_i, we define four binary indicator sequences x_A[n], x_C[n], x_G[n], and x_T[n], which indicate the presence (i.e., a 1) or absence (i.e., a 0) of a symbol in s_i at position n. Each indicator sequence is padded with zeros to ensure that every indicator sequence in S has an identical length of N. Zero padding is a common technique with FFT computations that can increase the spectral resolution and can increase the efficiency of the computation when the length of the original sequence is padded to a power of 2 [26]. For example, let s_i = GACGACTCAT, which has a length of 10. However, suppose that N, which is the length of the longest sequence in S, is 12. Then:

s_{i} = GACGACTCAT

x_{A} = 010010001000

x_{C} = 001001010000

x_{G} = 100100000000

x_{T} = 000000100100

For each indicator sequence, we compute the DFT, which converts the finite-length sequence x_A[n] into a series of coefficients X_A[k] resulting from the DFT computation, defined in Equation 1:

X_{A} [k] = \sum_{n = 0}^{N - 1} x_{A} [n] e^{- j (\frac{2 πnk}{N})} k = 0, 1, \dots, N - 1

(1)

The coefficients produced are complex, and thus the absolute value of each coefficient is computed, yielding a series of real valued numbers. X_A[0] represents the number of 1 s in the indicator sequence x_A. It is discarded because it is substantially larger than all other coefficients and is highly dependent on the length of the original unpadded sequence. We retain coefficients $X_{A} [1], X_{A} [2], \dots, X_{A} [\frac{N}{2}]$ , eliminating half of the coefficients because of the symmetric nature of the coefficients produced by the DFT [26]. The remaining coefficients are denoted as vector $X_{A}^{*}$ . We normalize $X_{A}^{*}$ by dividing by its Euclidean norm, $‖X_{A}^{*}‖$ , resulting in X_A. Equation 2 illustrates this transformation, introducing variable η for simplicity:

\begin{array}{l} η = [\frac{N}{2}] \\ X_{A}^{*} = [|X_{A} [1]|, |X_{A} [2]|, \dots, |X_{A} [η]|] \\ X_{A} = \frac{X_{A}^{*}}{‖X_{A}^{*}‖} \end{array}

(2)

For each vector X_A, we compute the inter-coefficient difference of X_A, denoted ICD(X_A), by computing the difference between each adjacent number in the sequence as shown in Equation 3:

ICD (X_{A}) = [X_{A} [2] - X_{A} [1], X_{A} [3] - X_{A} [2], \dots, X_{A} [η] - X_{A} [η - 1]]

(3)

The same computations are repeated for indicator sequences x_C, x_G, and x_T, yielding vectors X_C, X_G, and X_T separately.

For example, continuing from our previous example indicator sequence, x_A = 010010001000, and N = 12. We apply Equations 1 and 2 above on x_A, which computes the DFT on x_A and normalizes it, resulting in the vector of coefficients X_A:

X_{A}^{*} = [0.5176, 1.0000, 2.2361, 1.7321, 1.9319, 1.0000]

‖X_{A}^{*}‖ = 3.7417

X_{A} = [0.1383, 0.2673, 0.5976, 0.4629, 0.5163, 0.2673]

Then, the inter-coefficient difference of X_A is computed, resulting in:

ICD (X_{A}) = [0.1289, 0.3304, - 0.1347, 0.0534, - 0.2490]

The ICD of each coefficient vector resulting from vectors X_C, X_G, and X_T is concatenated to produce a single numeric vector, denoted X.

X = [ICD (X_{A}) ICD (X_{C}) ICD (X_{G}) ICD (X_{T})]

It is important to mention that all ICD vectors will have an equal length for every sequence in S, regardless of the length of the original sequence. Each indicator sequence transformation is padded to have a length of N, which is the length of the longest sequence in S. The final concatenated vector X will have a length of $4 ⌊\frac{N}{2}⌋ = 4 η$ .

Establishing distance between DNA sequences

Given two arbitrary DNA sequences, s₁ and s₂ in set S, we can compute the ICD transformation yielding numeric vectors X₁ and X₂, respectively. A single numeric value that represents a measure of biological distance is computed from these vectors by computing the correlation between the two vectors. We compute Dist (X₁, X₂), a single measure of distance between the ICD vectors, as follows:

Dist (X_{1}, X_{2}) = 1.0 - \frac{\sum_{i = 1}^{4 η} (X_{1} [i] - {\bar{X}}_{1}) (X_{2} [i] - {\bar{X}}_{2})}{\sqrt{\sum_{i = 1}^{4 η} {(X_{1} [i] - {\bar{X}}_{1})}^{2} \sum_{i = 1}^{4 η} {(X_{2} [i] - {\bar{X}}_{2})}^{2}}}

(4)

Equation 4 is 1.0 minus a standard correlation calculation between two sets of data. We know that a standard correlation falls in the range [−1.0, 1.0], where −1.0 is a perfect negative correlation and 1.0 is a positive correlation. Two vectors of identical values would have perfect positive correlation, and thus their Dist calculation would be 0.0, implying that there is no distance between them. A value of 2.0 is perfect negative correlation, implying opposing numerical trends around the means.

Data

To test the efficacy of this method, we assembled two sets of DNA data. Our first set consisted of mRNA insulin sequences from 19 different animals, called INS19 (Table 1). Insulin is an important hormone found throughout the animal kingdom for regulating carbohydrate and fat metabolism and for managing glucose levels in the blood. All sequences were downloaded from NCBI's RefSeq database (http://www.ncbi.nlm.nih.gov/refseq/). This dataset was chosen to measure the ability of the method to assess pairwise similarity over a set of sequences that have highly conserved regions in its genetic sequence owed to its similar function among all species while exhibiting substantial regions of low conservation in proportion to the evolutionary distance between species. The length of the sequences in the data ranged between 291 and 774 nucleotides in length.

Table 1 mRNA insulin sequences from 19 animal species in the INS19 dataset

Full size table

Our second set of data was chosen to test the ability of the method to accurately distinguish subtle differences among a large set of sequences from the same gene obtained from the same viral species. To this end, we selected 60 influenza type A sequences collected from the NCBI Influenza Virus Sequence Database (http://www.ncbi.nlm.nih.gov/genomes/FLU/). Influenza is an RNA virus that affects a wide range of mammals and birds; in extreme cases, it can lead to death. Influenza A viruses are broken down into different subtypes that are named based on two specific proteins that are on the surface of the virus: hemagglutinin (HA) and neuraminidase (NA). There are 17 types of the HA protein and 10 types of neuraminidase NA protein. Each virus receives a designation labeled Hx Ny, where x represents a specific subtype of the HA gene and y represents a subtype of the NA gene in the virus. Our dataset, denoted FLU60, contains 60 examples of avian influenza sequences (influenza sequences known to affect birds) for the HA gene only, collected from various locations in the United States between January and July of 2010. Avian flu strands were selected because all known subtypes of influenza A can affect birds. The length of all sequences in FLU60 ranged between 1,683 and 1,746 nucleotides in length. The frequency of influenza A subtypes in the dataset are detailed in Table 2. The most dominant variant in the data is H4N6 at 25 examples, with H3Nx variants coming in second. Because we collected only examples of the HA gene, only the Hx part of the subtype name should play a role in determining similarity. Additional file 1: Table S1 has detailed information about the dataset, including the accession number, subtype, date and place that specimen was acquired, and the length of each sequence [see Additional file 1].

Table 2 Avian influenza A subtype frequency in FLU60

Full size table

Results

To assess the capability of the ICD method to measure sequence similarity, we generated a dendrogram based on a hierarchical clustering using the unweighted pair group method average (UPGMA) method for constructing the tree. This was performed for both INS19 and FLU60 datasets. For comparative purposes, we computed an all-against-all pairwise global alignment using the standard Needleman-Wunsch algorithm for each set of sequences being tested [32], utilizing a uniform nucleotide substitution matrix (as defined by the nuc44 function in the Matlab® Bioinformatics Toolbox) for the purpose of finding the best alignment. Though computing a pairwise alignment for all possible pairs of sequences is computationally expensive, this will yield a superior alignment than any single multiple sequence alignment (MSA), as it significantly reduces the likelihood of introducing alignment errors that result from an MSA. The distance between each pair of aligned sequences was computed by measuring the proportion of sites in the alignment at which the two sequences are different, yielding a score of 1 for entirely dissimilar sequences and 0 if they were identical. This distance measure yields identical groupings to those that are generated directly from the alignment score itself but has a comparative advantage of producing numbers that are in an identical range to the distance values that are produced with the ICD method. ICD uses a correlation coefficient between coefficient differences and likewise always produces a distance value between 0 and 1. We also compared our results to an alignment-free sequence comparison method called feature frequency profile (FFP), which is a popular tool for phylogenetic analysis [2]. We used default parameters on all FFP tools to generate a tree, with the exception of word size; we evaluated word sizes between 6 and 20 and determined that a word size of 16 achieved results that produced the most biologically correct phylogenetic groupings. Finally, the Robinson-Foulds (RFdist) tree distance metric is computed on the INS19 test using the treedist function in the phangorn package in R [33, 34]. RFdist is computed between all combinations of pairs of trees to assist in measuring tree similarity.

ICD method on INS19 dataset

Our first test was conducted to measure the ability for the ICD method to accurately assess similarity between sequences that are relatively divergent, where the data was collected from a wide range of eukaryotic species. The INS19 dataset contains data from the insulin gene, taken from 19 species in the eukaryotic kingdom. The range of pairwise sequence identity after alignment ranged between 32% and 89% identity, with an average observed percent identity at 60% (see Figure 1). A dendrogram was built based on the pairwise similarity computed from the ICD method and is shown in Figure 2. For comparison purposes, an all-against-all pairwise global alignment (denoted AAP) was performed on all sequences, and a dendrogram was built revealing the relationships between the sequences based on the alignment. A dendrogram was also computed based on the alignment-free FFP method [2]. The resulting dendrograms from each of these comparative methods are shown in Figures 3 and 4, respectively.

All trees exhibit strong similarities within major groupings, closely resembling phylogenetic relationships observed in nature, with some subtle, yet biologically significant differences between each method. In particular, both ICD and AAP methods place monkey and chimp as the most similar among all species, whereas FFP places human and chimp as most similar. All methods suggest the African clawed frog as most distant from others species used in this study. The FFP method grouped the zebrafish with the clawed frog, whereas the ICD and AAP methods correctly cluster all three fish species. The AAP method grouped a hedgehog, a type of rodent, with a flycatcher, a type of bird. In contrast, the ICD and FFP methods correctly grouped the flycatcher with a chicken, which are both types of birds, and the hedgehog with other similar mammals. The AAP method grouped the hamster, a rodent, with the cow and pig, which are both even-toed ungulates; the FFP method fared a bit better, placing a hamster between a rabbit and hedgehog. In contrast, the ICD method correctly grouped the hamster with the guinea pig, which are both rodents.

The RFdist distance metric was computed between all pairs of trees. The RFdist between the ICD and FFP phylogenetic trees is 26, between ICD and AAP is 24, and between FFP and AAP is 22. These values suggest that, though the trees have similar groups, they have a relatively equal number of different partitions of data that are implied by each tree, with the final tree produced by the ICD method being only slightly more similar to the tree produced by the all-against-all pairwise alignment than the FFP method.

ICD method on FLU60 dataset

Our next test was conducted on the FLU60 dataset, which contains 60 DNA sequences of the HA gene from avian influenza A virus. Conducting an all-against-all pairwise alignment revealed a pairwise sequence identity range of 57% to 99.9%, with an average identity of 70.5%. Additional file 1: Figure S1 shows a histogram revealing the sequence identity over all pairs of sequences (see Additional file 1). We performed identical analyses on these data to the analyses performed with the INS19 data, resulting in dendrograms from each method. The dendrogram for the ICD method is shown in Figure 5. The dendrograms for the AAP and FFP methods are shown in Additional file 1: Figures S2 and S3 (see Additional file 1). The RFdist metric was not measured for this test.

Close evaluation of these dendrograms will reveal remarkably similar groupings among each individual subtype of influenza A. We were pleased to see that all influenza HA subtypes were grouped together correctly by all methods. In particular, in the case of H3 and H4 subtypes, all three methods indicated two very distinct strains. H3 is divided into a strain that hit Mississippi and one that hit Alaska. H4 was divided into three distinct strains, with all methods agreeing on the divisions. When looking at similarity between subtypes, all methods group together influenza A subtype H7 with H10, suggesting that each of these groups share a common ancestor. However, they differ slightly on the ancestry relationships between H9, H11, and H12. These findings, as well as most of the other relationships observed in this study, are confirmed by Air's work on sequence relationships in the hemagluttinin genes of 12 different variants of influenza A [35]. The methods differ on the divergence point of subtype H6; the AAP and FFP methods suggest that H6 and H1 have a common ancestor, whereas the ICD method suggests that H6 diverged much earlier from a subgroup consisting of H4, H9, and H12. The AAP and FFP method are closer to the similarity observed in Air's work. However, the level of similarity computed by the ICD method between H6 and subgroup H4, H9, and H12 is remarkably similar to the alternative group H1, H11, and H3, suggesting that the common ancestor could have been from either group.

The execution times were recorded for each of the methods we investigated. In addition, we included the timing results of ClustalW2 [36] and Clustal Omega [37], which are two other popular multiple sequence alignment methods widely used today. All methods were run on a laptop computer running Mac OS X 10.9 with a 2.2 GHz Intel Core i7 processor equipped with 16 GB of memory. The ICD and AAP methods were run as Matlab® applications, while FFP, ClustalW2, and Clustal Omega were compiled and installed as native applications. All methods were executed multiple times with similar loads. The first time was discarded to eliminate bias resulting from file system latency. All methods ran under one second of execution time on the INS19 dataset. This is not surprising given the small size of the dataset and the short length of each sequence. The FLU60 dataset provided a much more informative comparison. Table 3 shows the results for all five methods tested. The results clearly indicate the strength of alignment-free methods with respect to running times. Among the alignment-free methods, the ICD method outperformed FFP, despite the fact that it is running within the Matlab® framework. This suggests that even better execution times may be observed with the ICD method if it was redesigned as a native application.

Table 3 Observed execution time for FLU60

Full size table

Discussion

The binary indicator representation of DNA is a common representation to use on methods that treat DNA as a digital signal [19]. Some suggest that this representation is common because it inherently retains the important three-base periodicity which is important for detecting coding regions in DNA [15]. Some methods make interesting transformations to the original indicator sequence, such as the inter-nucleotide distance utilized by Afreixo et al. [18], with the goal of strengthening signals in the original digital signal that are discriminatory between different sequences. We applied the DFT on the indicator sequence. The DFT is a common DSP technique on digital signals and has been used in other methods for DNA sequence analysis. Each coefficient of the DFT represents a cross correlation of the entire input sequence and a complex sinusoid at a specific frequency, notably k/N[25]. As noted, DFTs have been successfully used in detecting coding regions of genes, where a strong peak is observed at frequency N/3. Our work was largely motivated by an interest in investigating how the differences of the magnitudes of the sinusoids between adjacent frequencies might improve sequence characterization in DNA. The ICD method presents a novel use of the DFT by computing the inter-coefficient difference from the resulting set of coefficients computed by the DFT transformation. The analysis presented here demonstrates its potential as a viable alternative approach toward DNA sequence analysis. In particular, the ability to distinguish differences between sequences having both low and high measures of homology, without computing an alignment, is particularly useful, compared to the challenges from computing multiple sequence alignments over large amounts of biological sequence data.

One may wonder about the likelihood of two different sequences producing an identical ICD vector. The critical part of the ICD method is the DFT. Two different DNA sequences will produce a unique set of coefficients, and likewise, our ICD transformation applied to these coefficients is thus also unique, except under one condition: when one sequence is a rotational shift of the other, and both of these sequences represent the longest sequences in the set of DNA being analyzed, meaning, there will be no zeros appended to either sequence. For example, if the sequences GACGACTCAT and TGACGACTCA (the second sequence is equivalent to the first sequence right-shifted by 1 with rotation) were both in the set of DNA being analyzed and were the longest sequences in the data, they will both yield the same $X_{A}^{*}, X_{C}^{*}, X_{G}^{*}$ , and $X_{T}^{*}$ vectors. However, the likelihood of two biologically meaningful DNA sequences being an entire rotational shift of the other is highly unlikely, particularly when analyzing entire genes. If this event were to actually occur in nature, then our method will yield these sequences as being identical, rightfully drawing the attention of the researcher.

Though we tested several datasets to determine the efficacy of the method, it does have a limitation worth noting. The ICD method works well when assessing similarity of identical genes over many different species, such as the INS19 dataset. It also works well when assessing similarity over many variants of the same gene from the same species, such as the FLU60 dataset. However, for evaluating the similarity over large, genomic regions or entire genomes from different species, the ICD method is limited. The reason for this is due to the requirement of padding zeros to the indicator sequences to ensure all sequences have equal length. Sequences that are significantly shorter than the longest sequence will likewise have a substantial vector of zeros padded and thus will yield comparatively poor ICD vectors.

The results of the ICD method compared favorably to other methods tested. In fact, we observed examples in the INS19 test where the ICD analysis yielded a more phylogenetically correct tree than those produced from other methods tests, backed up by simple phylogenetic relationships observed in any biological text. We opted to perform an all-against-all pairwise alignment over a multiple sequence alignment to ensure the highest degree of accuracy of the measure of similarity of alignments. In the FLU60 data, the ICD methods ability to detect the correct measure of similarity among even those sequences that had a high measure of pairwise sequence identity was remarkable.

A significant disadvantage of alignment-based sequence comparison methods is that they assume that important regions in the genetic sequence will follow the same order between similar species. However, as noted by Pinello et al., this is not always the case [7]. As a method based on the DFT, the ICD method capitalizes on recurrent patterns, regardless of the position of those patterns in the whole sequence. It is robust to possible reordering of conserved regions between genetic sequences.

The ICD method offers a significant advantage over alignment and alignment-free methods by eliminating the need for parameters. Other methods often require multiple runs to determine the best parameter set. In comparison, our ICD method is a pure mathematical, alignment-free transformation that requires no user-defined parameters prior to the analysis.

Depending on the alignment algorithm chosen, the running time to compare m sequences of length n and produce a tree based on alignment methods can vary between O(m²n²) for ClustalW [6] to as high as O(n^m) for dynamic programming approaches. More recently, Clustal Omega implemented substantial improvements over its predecessors in the Clustal family, improving the running time to O(nm log m), making it suitable for large-scale multiple sequence alignments. Alignment-free methods often have a performance advantage, particularly those that are based on k-mer frequencies. These methods can be run in O(knm) time, noting that selection of word size will have an effect on the final performance. This is particularly important for DNA, which requires longer word lengths for meaningful results. In contrast, the FFT runs in O(mn log n), suggesting that it is an efficient technique, comparable with other alignment-free methods. Our results in Table 3 confirm the theoretical running times, with the alignment-free methods having a superior advantage over the alignment methods.

Alignment-based methods have their advantages. In particular, an alignment will often produce a better absolute value of evolutionary distance between sequences by incorporating a substitution matrix such as BLOSUM62. In contrast, it is relatively difficult to infer a precise measure of evolutionary distance from alignment-free methods, and this is particularly true of the correlation computed from the ICD vector. This is not uncommon, as this is a limitation with any phylogenetic approach that involves computing a distance matrix based on sequence homology. Despite this limitation, most of the relative distances observed between species in the INS19 dataset and between different variants of avian flu in the FLU60 dataset were consistent with the alignments produced. More interestingly, we demonstrated a few differences between the results from the methods applied to the INS19 data, where the ICD approach produced evolutionary relationships that were more consistent with our biological understanding of evolution among species that the other approaches we evaluated failed to capture.

The use of a correlation coefficient for distance is part of the novel approach in this work. Even though the theoretical value of the Dist computation is [0.0, 2.0], all pairs of sequences analyzed had values between 0.0 and 1.0. In other words, sequences were either found to have a strong positive correlation, which is implied for Dist values near 0, or no correlation, for Dist values near 1.0. Our observations on all tests never observed Dist computations of more than 1.0. If this had happened, it would have implied that the two biological sequences being tested had a negative correlation with respect to their ICD vector. From a biological viewpoint, different species, genes, or even different variants within the same genes arise due to evolution; more specifically, due to selective pressures placed on the genome to become more ‘fit’ than its ancestors. The processes behind natural selection that are so important for breeding new species and genetic functions are not random. However, the underlying genetic mutations that occur over eons are generally considered to be random events [38]. The fact that we never observed a negative correlation might offer a metric to numerically confirm the random nature of evolution. This needs further investigation over a much larger set of genetic data to draw any conclusions.

Conclusions

In this paper, we present a novel use of the discrete Fourier transform to establish sequence similarity through incorporating a simple transform of the coefficient vector. We demonstrated its efficacy on two datasets designed to measure the method's capability on establishing similarity among datasets with different levels of sequence identity. The ICD approach produced a high quality dendrogram representing phylogenetic relationships of sequences with different levels of sequence identity. Our results were nearly identical with those obtained using traditional alignment-based approaches.

Abbreviations

AAP:: all-against-all pairwise global alignment
DFT:: discrete Fourier transform
DNA:: deoxyribonucleic acid
DSP:: digital signal processing
FFP:: feature frequency profile method
FFT:: fast Fourier transform
FLU60:: dataset of 60 variants of avian influenza
HA:: hemagglutinin (an influenza gene)
ICD:: inter-coefficient difference
INS19:: dataset of the insulin gene from 19 species
MSA:: multiple sequence alignment
NA:: neuraminidase (an influenza gene)
RFdist:: Robinson-Foulds tree distance metric
UPGMA:: unweighted pair group method average
⌊x⌋:: the floor of x
‖X‖:: the euclidean norm of vector X.

References

Durbin R, Eddy SR, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK; 1998:356.
Book Google Scholar
Sims GE, Jun S-R, Wu GA, Kim S-H: Whole-genome phylogeny of mammals: evolutionary information in genic and nongenic regions. Proc Natl Acad Sci U S A. 2009, 106: 17077-82. 10.1073/pnas.0909377106
Article Google Scholar
Phillips A, Janies D, Wheeler W: Multiple sequence alignment in phylogenetic analysis. Mol Phylogenet Evol. 2000, 16: 317-30. 10.1006/mpev.2000.0785
Article Google Scholar
Samuelsson T: Genomics and bioinformatics: an introduction to programming tools for life scientists. 1st edition. Cambridge University Press, Cambridge, UK; 2012:356.
Book Google Scholar
Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1016/S0022-2836(05)80360-2
Article Google Scholar
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680. 10.1093/nar/22.22.4673
Article Google Scholar
Pinello L, Lo Bosco G, Yuan G-C: Applications of alignment-free methods in epigenomics. Brief Bioinform. 2013, 15: 419-430.
Article Google Scholar
Vinga S, Almeida J: Alignment-free sequence comparison–a review. Bioinformatics. 2003, 19: 513-523. 10.1093/bioinformatics/btg005
Article Google Scholar
Bonham-Carter O, Steele J, Bastola D: Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform. online only, published July 31, 2013
Google Scholar
Domazet-Lošo M, Haubold B: Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics. 2011, 27: 1466-72. 10.1093/bioinformatics/btr176
Article Google Scholar
Rafiei D, Mendelzon A: Efficient Retrieval of Similar Time Sequences Using DFT. In Proceedings of 5th International Conference of Foundations of Data Organization – FODO '98. Kobe, Japan; 1998:249-257.
Google Scholar
Wu Y-L, Agrawal D, El Abbadi A: A comparison of DFT and DWT based similarity search in time-series databases. In Proc. ninth Int. Conf. Inf. Knowl. Manag. - CIKM ’00. ACM Press, New York, USA; 2000:488-495.
Chapter Google Scholar
Cristea PD: Conversion of nucleotides sequences into genomic signals. J Cell Mol Med. 2002, 6: 279-303. 10.1111/j.1582-4934.2002.tb00196.x
Article Google Scholar
Anastassiou D: Genomic signal processing. IEEE Signal Process Mag. 2001, 18: 8-20.
Article Google Scholar
Sharma SD, Shakya K, Sharma SN: Evaluation of DNA mapping schemes for exon detection. Int Conf Comput Commun Electr Technol. 2011, 2011: 71-74.
Google Scholar
Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002, 30: 3059-3066. 10.1093/nar/gkf436
Article Google Scholar
Voss RF: Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys Rev Lett. 1992, 68: 3805-3808. 10.1103/PhysRevLett.68.3805
Article Google Scholar
Afreixo V, Bastos CAC, Pinho AJ, Garcia SP, Ferreira PJSG: Genome analysis with inter-nucleotide distances. Bioinformatics. 2009, 25: 3064-70. 10.1093/bioinformatics/btp546
Article Google Scholar
Hota MK, Srivastava VK: Performance analysis of different DNA to numerical mapping techniques for identification of protein coding regions using tapered window based short-time discrete Fourier transform. Int Conf Power, Control Embed Syst. 2010, 3: 1-4.
Google Scholar
Akhtar M, Epps J, Ambikairajah E: Signal processing in sequence analysis: advances in eukaryotic gene prediction. IEEE J Sel Top Signal Process. 2008, 2: 310-321.
Article Google Scholar
Saberkari H, Shamsi M, Sedaaghi M, Golabi F: Prediction of protein coding regions in DNA sequences using signal processing methods. In Proc. 2012 IEEE Symp. Ind. Electron. Appl. (ISIEA2012). Bandung, Indonesia; 2012:355-360.
Google Scholar
Rao KD, Member S, Swamy MNS, Fellow L: Analysis of genomics and proteomics using DSP techniques. IEEE Trans Circuits Syst I Regul Pap. 2008, 55: 370-378.
Article MathSciNet Google Scholar
Marhon SA, Kremer SC: Gene prediction based on DNA spectral analysis: a literature review. J Comput Biol. 2011, 18: 639-76. 10.1089/cmb.2010.0184
Article MathSciNet Google Scholar
Brigham EO, Morrow RE: The fast Fourier transform. IEEE Spectr. 1967, 4: 63-70.
Article Google Scholar
Lyons RG: Understanding Digital Signal Processing. Pearson Education, Upper Saddle River, NJ; 2004.
Google Scholar
Oppenheim AV, Schafer RW: Discrete-Time Signal Processing. 3rd edition. Prentice Hall, Upper Saddle River, NJ, USA; 2010.
Google Scholar
Vaidyanathan P: The role of signal-processing concepts in genomics and proteomics. J Franklin Inst. 2004, 341: 111-135. 10.1016/j.jfranklin.2003.12.001
Article Google Scholar
Berger JA, Mitra SK, Carli M, Neri A: New approaches to genome sequence analysis based on digital signal processing. In IEEE Work. Genomic Signal Process. Stat. (GENSIPS). IEEE Press, Raleigh, North Carolina, USA; 2002.
Google Scholar
Tuqan J, Rushdi A, Member S: A DSP approach for finding the codon bias in DNA sequences. IEEE J Sel Top Signal Process. 2008, 2: 343-356.
Article Google Scholar
Anastassiou D: Frequency-domain analysis of biomolecular sequences. Bioinformatics. 2000, 16: 1073-1081. 10.1093/bioinformatics/16.12.1073
Article MathSciNet Google Scholar
Cheever EA, Searls DB, Karunaratne W, Overton GC: Using Signal Processing Techniques for DNA Sequence Comparison. In Proc. Fifteenth Annu. Northeast Bioeng. Conf. IEEE Press, Boston, MA; 1989:173-174.
Chapter Google Scholar
Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48: 443-453. 10.1016/0022-2836(70)90057-4
Article Google Scholar
Robinson DF, Foulds LR: Comparison of phylogenetic trees. Math Biosci. 1981, 53: 131-147. 10.1016/0025-5564(81)90043-2
Article MathSciNet Google Scholar
Schliep KP: phangorn: phylogenetic analysis in R. Bioinformatics. 2011, 27: 592-3. 10.1093/bioinformatics/btq706
Article Google Scholar
Air GM: Sequence relationships among the hemagglutinin genes of 12 subtypes of influenza A virus. Proc Natl Acad Sci U S A. 1981, 78: 7639-43. 10.1073/pnas.78.12.7639
Article Google Scholar
Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics. 2007, 23: 2947-8. 10.1093/bioinformatics/btm404
Article Google Scholar
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011, 7: 539.
Article Google Scholar
Kaufmann S: The origins of order, vol. 209. Oxford University Press, Oxford, UK; 1993:709.
Google Scholar

Download references

Acknowledgements

Part of this work was funded through a grant awarded to Brian King from the Bucknell Geisinger Research Initiative. Additional funding was provided through the Bucknell Presidential Fellowship program. We wish to thank the reviewers for insightful comments that helped strengthen our work.

Author information

Authors and Affiliations

Department of Computer Science, Bucknell University, Lewisburg, PA, 17837, USA
Brian R King
Department of Electrical and Computer Engineering, Bucknell University, Lewisburg, PA, 17837, USA
Maurice Aburdene, Alex Thompson & Zach Warres

Authors

Brian R King
View author publications
You can also search for this author in PubMed Google Scholar
Maurice Aburdene
View author publications
You can also search for this author in PubMed Google Scholar
Alex Thompson
View author publications
You can also search for this author in PubMed Google Scholar
Zach Warres
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Brian R King.

Additional information

Competing interests

The authors declare that they have no competing interests.

Electronic supplementary material

13637_2013_50_MOESM1_ESM.pdf

Additional file 1:Application of the discrete Fourier transform on DNA for sequence similarity. Table S1. Avian Flu Sequences (FLU60). Figure S1. Histogram of % identity in FLU60. Figure S2. Alignment based dendogram for FLU60. Figure S3. FFP based dendogram for FLU60. (PDF 295 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

King, B.R., Aburdene, M., Thompson, A. et al. Application of discrete Fourier inter-coefficient difference for assessing genetic sequence similarity. J Bioinform Sys Biology 2014, 8 (2014). https://doi.org/10.1186/1687-4153-2014-8

Download citation

Received: 16 December 2013
Accepted: 01 May 2014
Published: 28 May 2014
DOI: https://doi.org/10.1186/1687-4153-2014-8

Application of discrete Fourier inter-coefficient difference for assessing genetic sequence similarity

Abstract

Introduction

Methods and materials

The inter-coefficient difference

Establishing distance between DNA sequences

Data

Results

ICD method on INS19 dataset

ICD method on FLU60 dataset

Discussion

Conclusions

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Electronic supplementary material

13637_2013_50_MOESM1_ESM.pdf

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords