Open Access

A 2D graphical representation of the sequences of DNA based on triplets and its application

EURASIP Journal on Bioinformatics and Systems Biology20142014:1

https://doi.org/10.1186/1687-4153-2014-1

Received: 17 August 2013

Accepted: 10 December 2013

Published: 2 January 2014

Abstract

In this paper, we first present a new concept of ‘weight’ for 64 triplets and define a different weight for each kind of triplet. Then, we give a novel 2D graphical representation for DNA sequences, which can transform a DNA sequence into a plot set to facilitate quantitative comparisons of DNA sequences. Thereafter, associating with a newly designed measure of similarity, we introduce a novel approach to make similarities/dissimilarities analysis of DNA sequences. Finally, the applications in similarities/dissimilarities analysis of the complete coding sequences of β-globin genes of 11 species illustrate the utilities of our newly proposed method.

Keywords

Graphical representation Similarities/dissimilarities analysis Triplet DNA sequence

1. Introduction

In the recent years, an exponential growth of sequence data in DNA databases has been observed by biologists; the importance of understanding genetic sequences coupled with the difficulty of working with such immense volumes of DNA sequence data underscores the urgent need for supportive visual tools. Recently, graphical representation is well regarded which can offer visual inspection of data and provide a simple way to facilitate the similarity analysis and comparison of DNA sequences [15]. Because of its convenience and excellent maneuverability, currently, all kinds of methods based on graphical representation have been extensively applied in relevant realms of bioinformatics.

Until now, there are many different graphical representation methods having been proposed to numerically characterize DNA sequences on the basis of different multiple-dimension spaces. For example, Liao et al. [69], Randic et al. [1013], Guo et al. [14, 15], Qi et al. [16], Dai et al. [17, 18], and Dorota et al. [19] proposed different 2D graphical representation methods of DNA sequences, respectively. Liao et al. [2023], Randic et al. [24, 25], Qi et al. [26], Yu et al. [27], and Aram et al. [28] proposed different 3D graphical representation methods of DNA sequences, respectively. Liao et al. [29], Tang et al. [30], and Chi et al. [31] proposed different 4D graphical representation methods of DNA sequences, respectively. In addition, Liao et al. [32] also proposed a kind of 5D representation method of DNA sequences and so on.

In these approaches mentioned above, most of them adopt the leading eigenvalues of some matrices, such as L/L matrices, M/M matrices, E matrices, covariance matrices, and D/D matrices, to weigh the similarities/dissimilarities among the complete coding sequences of β-globin genes of different species. Because the matrix computation is needed to obtain the leading eigenvalues, these methods are usually computationally expensive for long DNA sequences. Furthermore, in some of these approaches, their results of similarities/dissimilarities analysis are not quite reasonable, and there are some results that do not accord with the fact [7, 9].

To degrade the computational complexity and obtain more reasonable results of similarities/dissimilarities analysis of DNA sequences, in this article, we propose a new 2D graphical representation of DNA sequences based on triplets, in which, we present a new concept of ‘weight’ for 64 triplets and a new concept of ‘weight deviation’ to weigh the similarities/dissimilarities among the complete coding sequences of β-globin genes of different species. Compared with some existing graphical representations of the DNA sequences, our new scheme has the following advantages: (1) no matrix computation is needed, and (2) it can characterize the graphical representations for DNA sequences exactly and obtain reasonable results of similarities/dissimilarities analysis of DNA sequences.

2. Proposed 2D graphical representation of DNA sequence

Codon is a specific sequence of three adjacent nucleotides on the mRNA that specifies the genetic code information for synthesizing a particular amino acid. As illustrated in Table 1, there are total 20 amino acids and 64 codons in the natural world, and each of these codons has a specific meaning in protein synthesis: 64 codons represent amino acids and the other 3 codons cause the termination of protein synthesis.
Table 1

Relationship between 20 different kinds of most common amino acids and 64 different kinds of mRNA codons

Codons

Amino acid

Codons

Amino acid

GCU, GCC, GCA, GCG

Alanine

CUU, CUC, CUA, CUG, UUA, UUG

Leucine

CGU, CGC, CGA, CGG, AGA, AGG

Arginine

AAA, AAG

Lysine

GAU, GAC

Aspartic acid

AUG

Methionine

AAU, AAC

Asparagine

UUU, UUC

Phenylalanine

UGU, UGC

Cysteine

CCU, CCC, CCA, CCG

Proline

GAA, GAG

Glutamic acid

UCU, UCC, UCA, UCG, AGU, AGC

Serine

CAA, CAG

Glutamine

ACU, ACC, ACA, ACG

Threonine

GGU, GGC, GGA, GGG

Glycine

UGG

Tryptophan

CAU, CAC

Histidine

UAU, UAC

Tyrosine

AUU, AUC, AUA

Isoleucine

GUU, GUC, GUA, GUG

Valine

UAA, UAG, UGA

   
For the 64 codons illustrated in Table 1, their corresponding triplets of DNA are illustrated in Table 2.
Table 2

The corresponding triplets of 64 codons

Codons

Corresponding triplets

Codons

Corresponding triplets

GCU, GCC, GCA, GCG

GCT, GCC, GCA, GCG

CUU, CUC, CUA, CUG, UUA, UUG

CTT, CTC, CTA, CTG, TTA, TTG

CGU, CGC, CGA, 0020CGG, AGA, AGG

CGT, CGC, CGA, CGG, AGA, AGG

AAA, AAG

AAA, AAG

GAU, GAC

GAT, GAC

AUG

ATG

AAU, AAC

AAT, AAC

UUU, UUC

TTT, TTC

UGU, UGC

TGT, TGC

CCU, CCC, CCA, CCG

CCT, CCC, CCA, CCG

GAA, GAG

GAA, GAG

UCU, UCC, UCA, UCG, AGU, AGC

TCT, TCC, TCA, TCG, AGT, AGC

CAA, CAG

CAA, CAG

ACU, ACC, ACA, ACG

ACT, ACC, ACA, ACG

GGU, GGC, GGA, GGG

GGT, GGC, GGA, GGG

UGG

TGG

CAU, CAC

CAT, CAC

UAU, UAC

TAT, TAC

AUU, AUC, AUA

ATT, ATC, ATA

GUU, GUC, GUA, GUG

GTT, GTC, GTA, GTG

UAA, UAG, UGA

TAA, TAG, TGA

  
Based on the above 64 triplets of DNA illustrated in Table 2, we define a new mapping Ψ to map each of these triplets into a different weight. Obviously, the mapping Ψ shall satisfy the following rule: for any two pairs of triplets (X1, Y1) and (X2, Y2), where X1, Y1, X2, and Y2 are all triplets, if the corresponding codons of X1 and Y1 code the same amino acid but the corresponding codons of X2 and Y2 code two different amino acids, then there shall be |Ψ (X1) − Ψ (Y1)| < |Ψ (X2) − Ψ (Y2)|. So, according to the above rule and for the sake of convenience, weights consist of amino acid and codon. Amino acid is the integer part of weight, and codon is the fractional part of weight. Alanine is defined as 1, arginine is defined as 2, and the rest can be done in the same manner. Codons of every amino acid are reordered, so the first codon of alanine's (GCT) weight value is 1.1. We design the detailed mapping rules of Ψ as illustrated in Table 3.
Table 3

The mapping rules of Ψ

Triplet

Corresponding weight

Triplet

Corresponding weight

GCT

1.1

CTT

11.1

GCC

1.2

CTC

11.2

GCA

1.3

CTA

11.3

GCG

1.4

CTG

11.4

  

TTA

11.5

  

TTG

11.6

CGT

2.1

AAA

12.3

CGC

2.2

AAG

12.4

CGA

2.3

  

CGG

2.4

  

AGA

2.5

  

AGG

2.6

  

GAT

3.3

TTT

13.1

GAC

3.4

TTC

13.2

AAT

4.1

CCT

14.1

AAC

4.2

CCC

14.2

  

CCA

14.3

  

CCG

14.4

TGT

5.1

TCT

15.1

TGC

5.2

TCC

15.2

  

TCA

15.3

  

TCG

15.4

  

AGT

15.5

  

AGC

15.6

GAA

6.1

ACT

16.3

GAG

6.2

ACC

16.4

  

ACA

16.5

  

ACG

16.6

CAA

7.1

TGG

17.3

CAG

7.2

  

GGT

8.1

TAT

18.1

GGC

8.2

TAC

18.2

GGA

8.3

  

GGG

8.4

  

CAT

9.1

GTT

19.1

CAC

9.2

GTC

19.2

  

GTA

19.3

  

GTG

19.4

ATT

10.1

ATG

20.1

ATC

10.2

  

ATA

10.3

  

TAA

21.1

  

TAG

21.2

  

TGA

21.3

  

For example, from Table 3, we will have Ψ (GCT) = 1.1, Ψ (GCC) = 1.2, Ψ (ATG) = 20.1, etc., and in addition, we can propose a novel 2D graphical representation of DNA sequences as follows:

Let G = g1, g2, g3g N be an arbitrary DNA primary sequence, where g i {A, T, G, C} for any i {1, 2,…, N}, and then, we can transform G into a sequence of triplets such as G = t1, t2, t3t M , where M = [N/3] and t i is a triplet of DNA for any i {1, 2,…, M}. Thereafter, we can define a new mapping Θ to map G into a plot set as illustrated in the formula (1).
Θ G = 1 , Ψ t 1 , 2 , Ψ t 2 , , M , Ψ t M
(1)
As for the complete coding sequences of β-globin genes of 11 species illustrated in the Table 4, each of them can be mapped into a plot set by using the new given mapping Θ, and the 2D graphical representations corresponding to the complete coding sequences of β-globin genes of human, chimpanzee, and opossum are shown in Figures 1, 2, and 3, respectively.
Figure 1

The 2D graphical representations of the complete coding sequences of β-globin genes of human.

Figure 2

The 2D graphical representations of the complete coding sequences of β-globin genes of chimpanzee.

Figure 3

The 2D graphical representations of the complete coding sequences of β-globin genes of opossum.

Table 4

The complete coding sequences of β-globin genes of 11 species

Species

Complete coding sequence

Human

ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAA

Chimpanzee

ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAG

Gorilla

ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAAGCTCCTGGGCAATGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAG

Black lemur

ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGTGGGGCAAGGTGGATGTAGAGAAAGTTGGTGGCGAGGCCTTGGGCAGGCTGCTGGTCGTCTACCCATGGACCCAGAGGTTCTTCGAGTCCTTTGGGGACCTGTCCTCTCCTTCTGCTGTTATGGGGAACCCTAAGGTGAAGGCCCATGGCAAGAAGGTGCTGAGTGCCTTTAGTGAAGGTCTGCATCACCTGGACAACCTCAAGGGCACCTTTGCTCAACTGAGTGAGCTGCACTGTGACAAGTTGCACGTGGATCCTCAGAACTTCACTCTCCTGGGCAACGTGCTGGTGGTTGTGCTGGCTGAACACTTTGGCAATGCATTCAGCCCGGCGGTGCAGGCTGCCTTTCAGAAGGTGGTGGCTGGTGTGGCCAATGCTCTGGCTCACAAGTACCACTGA

Norway rat

ATGGTGCACCTAACTGATGCTGAGAAGGCTACTGTTAGTGGCCTGTGGGGAAAGGTGAATGCTGATAATGTTGGCGCTGAGGCCCTGGGCAGGCTGCTGGTTGTCTACCCTTGGACCCAGAGGTACTTTTCTAAATTTGGGGACCTGTCCTCTGCCTCTGCTATCATGGGTAACCCCCAGGTGAAGGCCCATGGCAAGAAGGTGATAAATGCCTTCAATGATGGCCTGAAACACTTGGACAACCTCAAGGGCACCTTTGCTCATCTGAGTGAACTCCACTGTGACAAGCTGCATGTGGATCCTGAGAACTTCAGGCTCCTGGGCAATATGATTGTGATTGTGTTGGGCCACCACCTGGGCAAGGAATTCACCCCCTGTGCACAGGCTGCCTTCCAGAAGGTGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTAA

House mouse

ATGGTGCACCTGACTGATGCTGAGAAGTCTGCTGTCTCTTGCCTGTGGGCAAAGGTGAACCCCGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTTGTCTACCCTTGGACCCAGCGGTACTTTGATAGCTTTGGAGACCTATCCTCTGCCTCTGCTATCATGGGTAATCCCAAGGTGAAGGCCCATGGCAAAAAGGTGATAACTGCCTTTAACGAGGGCCTGAAAAACCTGGACAACCTCAAGGGCACCTTTGCCAGCCTCAGTGAGCTCCACTGTGACAAGCTGCATGTGGATCCTGAGAACTTCAGGCTCCTAGGCAATGCGATCGTGATTGTGCTGGGCCACCACCTGGGCAAGGATTTCACCCCTGCTGCACAGGCTGCCTTCCAGAAGGTGGTGGCTGGAGTGGCCACTGCCCTGGCTCACAAGTACCACTAA

Goat

ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGGCTTCTGGGGCAAGGTGAAAGTGGATGAAGTTGGTGCTGAGGCCCTGGGCAGGCTGCTGGTTGTCTACCCCTGGACTCAGAGGTTCTTTGAGCACTTTGGGGACTTGTCCTCTGCTGATGCTGTTATGAACAATGCTAAGGTGAAGGCCCATGGCAAGAAGGTGCTAGACTCCTTTAGTAACGGCATGAAGCATCTTGACGACCTCAAGGGCACCTTTGCTCAGCTGAGTGAGCTGCACTGTGATAAGCTGCACGTGGATCCTGAGAACTTCAAGCTCCTGGGCAACGTGCTGGTGGTTGTGCTGGCTCGCCACCATGGCAGTGAATTCACCCCGCTGCTGCAGGCTGAGTTTCAGAAGGTGGTGGCTGGTGTTGCCAATGCCCTGGCCCACAGATATCACTAA

Bovine

ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGCCTTTTGGGGCAAGGTGAAAGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTTGTCTACCCCTGGACTCAGAGGTTCTTTGAGTCCTTTGGGGACTTGTCCACTGCTGATGCTGTTATGAACAACCCTAAGGTGAAGGCCCATGGCAAGAAGGTGCTAGATTCCTTTAGTAATGGCATGAAGCATCTCGATGACCTCAAGGGCACCTTTGCTGCGCTGAGTGAGCTGCACTGTGATAAGCTGCATGTGGATCCTGAGAACTTCAAGCTCCTGGGCAACGTGCTAGTGGTTGTGCTGGCTCGCAATTTTGGCAAGGAATTCACCCCGGTGCTGCAGGCTGACTTTCAGAAGGTGGTGGCTGGTGTGGCCAATGCCCTGGCCCACAGATATCATTAA

Rabbit

ATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGTCACTGCCCTGTGGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTTGTCTACCCATGGACCCAGAGGTTCTTCGAGTCCTTTGGGGACCTGTCCTCTGCAAATGCTGTTATGAACAATCCTAAGGTGAAGGCTCATGGCAAGAAGGTGCTGGCTGCCTTCAGTGAGGGTCTGAGTCACCTGGACAACCTCAAAGGCACCTTTGCTAAGCTGAGTGAACTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTTATTGTGCTGTCTCATCATTTTGGCAAAGAATTCACTCCTCAGGTGCAGGCTGCCTATCAGAAGGTGGTGGCTGGTGTGGCCAATGCCCTGGCTCACAAATACCACTGA

Opossum

ATGGTGCACTTGACTTCTGAGGAGAAGAACTGCATCACTACCATCTGGTCTAAGGTGCAGGTTGACCAGACTGGTGGTGAGGCCCTTGGCAGGATGCTCGTTGTCTACCCCTGGACCACCAGGTTTTTTGGGAGCTTTGGTGATCTGTCCTCTCCTGGCGCTGTCATGTCAAATTCTAAGGTTCAAGCCCATGGTGCTAAGGTGTTGACCTCCTTCGGTGAAGCAGTCAAGCATTTGGACAACCTGAAGGGTACTTATGCCAAGTTGAGTGAGCTCCACTGTGACAAGCTGCATGTGGACCCTGAGAACTTCAAGATGCTGGGGAATATCATTGTGATCTGCCTGGCTGAGCACTTTGGCAAGGATTTTACTCCTGAATGTCAGGTTGCTTGGCAGAAGCTCGTGGCTGGAGTTGCCCATGCCCTGGCCCACAAGTACCACTAA

Gallus

ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCTGGGGCAAGGTCAATGTGGCCGAATGTGGGGCCGAAGCCCTGGCCAGGCTGCTGATCGTCTACCCCTGGACCCAGAGGTTCTTTGCGTCCTTTGGGAACCTCTCCAGCCCCACTGCCATCCTTGGCAACCCCATGGTCCGCGCCCACGGCAAGAAAGTGCTCACCTCCTTTGGGGATGCTGTGAAGAACCTGGACAACATCAAGAACACCTTCTCCCAACTGTCCGAACTGCATTGTGACAAGCTGCATGTGGACCCCGAGAACTTCAGGCTCCTGGGTGACATCCTCATCATTGTCCTGGCCGCCCACTTCAGCAAGGACTTCACTCCTGAATGCCAGGCTGCCTGGCAGAAGCTGGTCCGCGTGGTGGCCCATGCCCTGGCTCGCAAGTACCACTAA

3. Similarity analysis of DNA sequence

Let G = g1, g2, g3g N be an arbitrary complete coding sequence, where g i {A, T, G, C} for any i {1, 2,…, N}, and G = t1, t2, t3t M be its corresponding sequence of triplets, where M = [N/3] and t i is a triplet of DNA for any i {1, 2,…, M}. Then, we define a function δ and let δ (t i ) represent the total number of times that the triplet t i repeats in the sequence of triplets G = t1, t2, t3t M for any i {1, 2,…, M}.

Let T1 = GCT, T2 = GCC, T3 = GCA, T4 = GCG, T5 = CGT, T6 = CGC, T7 = CGA, T8 = CGG, T9 = AGA, T10 = AGG, T11 = GAT, T12 = GAC, T13 = AAT, T14 = AAC, T15 = TGT, T16 = TGC, T17 = GAA, T18 = GAG, T19 = CAA, T20 = CAG, T21 = GGT, T22 = GGC, T23 = GGA, T24 = GGG, T25 = CAT, T26 = CAC, T27 = ATT, T28 = ATC, T29 = ATA, T30 = CTT T31 = CTC, T32 = CTA, T33 = CTG, T34 = TTA, T35 = TTG, T36 = AAA, T37 = AAG, T38 = TTT, T39 = TTC, T40 = CCT, T41 = CCC, T42 = CCA, T43 = CCG, T44 = TCT, T45 = TCC, T46 = TCA, T47 = TCG, T48 = AGT, T49 = AGC, T50 = ACT, T51 = ACC, T52 = ACA, T53 = ACG, T54 = TGG, T55 = TAT, T56 = TAC, T57 = GTT, T58 = GTC, T59 = GTA, T60 = GTG, T61 = ATG, T62 = TAA, T63 = TAG, and T64 = TGA.

Thereafter, according to Table 2, since there are a total of 64 triplets of DNA, then we can construct a set of 64 vectors {<T1, δ (T1)>, <T2, δ (T2)>,…, <T64, δ (T64)>} for the given sequence of triplets G = t1, t2, t3t M as follows: if T i  = t j {t1, t2, t3,…t M }, then δ (T i ) = δ (t j ), else δ (T i ) =0, for any i {1, 2,…, 64} and j {1, 2,…, M}.

For convenience, we call {<T1, δ (T1)>, <T2, δ (T2)>,…, <T64, δ (T64)>} as the triplet-repeat model set of G.

For any two given complete coding sequences A and B, suppose that their triplet-repeat model sets are {<T1, X1>, <T2, X2>,…, <T64, X64>} and {<T1, Y1>, <T2, Y2>,…, <T64, Y64>}, respectively. Then, on the basis of the 2D graphical representation given in the previous Section 2, we can define the weight deviation between the two DNA sequences A and B as the following formula (2) to measure the similarity between A and B.
WD A , B = i = 1 64 X i Y i Ψ T i 64
(2)
Obviously, the above formula (2) satisfies the fact that the smaller the weight deviation between the two DNA sequences A and B, the higher the degree of similarity of A and B. According to formula (2), the detailed similarity/dissimilarity matrix obtained for the coding sequences listed in Table 4 is illustrated in Table 5. Basing on the similarity matrix (Table 5) constructs a phylogenetic tree, which is shown in Figure 4.
Figure 4

Phylogenetic tree based on the similarity matrix (Table 5 ).

Table 5

The similarity/dissimilarity matrix for the coding sequences of Table 1 based on the weight deviation

 

Human

Chimpanzee

Gorilla

Lemur

Rat

Mouse

Goat

Bovine

Rabbit

Opossum

Gallus

Human

0

5.2500

4.3359

8.5891

10.670

9.7047

8.2219

8.1438

7.8281

15.6078

16.7109

Chimpanzee

 

0

1.1266

8.0297

10.645

9.6016

8.4375

9.3219

9.6000

14.2578

15.8734

Gorilla

  

0

7.8688

9.9625

8.6063

7.6734

8.5578

8.5547

13.9719

14.8781

Lemur

   

0

8.7219

9.5500

7.1328

9.3891

5.6891

12.9281

15.2000

Rat

    

0

6.0750

7.0484

9.3641

9.6578

13.5906

14.1219

Mouse

     

0

9.4953

9.2641

10.7984

12.3406

12.3688

Goat

      

0

5.2625

8.7219

11.9703

14.5359

Bovine

       

0

9.2906

12.5922

15.0234

Rabbit

        

0

14.8984

15.6953

Opossum

         

0

14.2750

Gallus

          

0

Observing Table 5, it is easy to find out that human, gorilla, and chimpanzee are most similar to each other, and the pairs like gorilla-chimpanzee (with weight deviation of 1.1266), human-gorilla (with weight deviation of 4.3359), and human-chimpanzee (with weight deviation of 5.2500) are the most similar species pairs, but Gallus and opossum are the most dissimilar to the others (with weight deviation bigger than 11). It is consistent with the fact that Gallus is not a mammal, whereas the others are mammals, and opossum is the most remote species from the remaining mammals. Similar results have been obtained in other papers by different approaches [2, 5, 7, 9, 33].

For testing the validity of our method, the existing results of the examination of the degree of similarity/dissimilarity of the coding sequences of β-globin genes of several species with the coding sequence of the human β-globin gene by means of approaches using alternative DNA sequence descriptors [2, 5, 7, 9] are listed in Table 6 for comparison.
Table 6

The similarity/dissimilarity of the coding sequences

Species

A

B

C

D

E

Chimpanzee

5.2500

0.0144

14.00

0.005069

0.863

Gorilla

4.3359

0.0125

13.63

0.006611

0.339

Lemur

8.5891

-

31.75

0.030894

1.188

Rat

10.670

0.1377

41.65

0.015539

1.966

Mouse

9.7047

0.1427

30.27

0.015700

0.735

Goat

8.2219

0.1161

31.39

0.020980

0.311

Bovine

8.1438

0.0773

30.68

0.017700

2.489

Rabbit

7.8281

0.1332

35.575

0.015788

1.372

Opossum

15.6078

-

48.701

0.033363

6.322

Gallus

16.7109

-

70.46

0.025801

7.170

From Table 6, we can find that the pairs like human-gorilla and human-chimpanzee are the two most similar species pairs when adopting (A) the method of our work, (B) the method of [2], (C) the method of [5], and (D) the method of [7], which is in accordance with the fact that gorilla and chimpanzee are the two most closest species of human, but when adopting (E) the method of [9], the most similar species pair is human-goat, which is obviously not correct. In addition, the pairs like human-Gallus and human-opossum are the two most dissimilar species pairs when adopting (A) the method of our work, (C) the method of [5], and (E) the method of [9], which is in accordance with the fact that Gallus is not a mammal, whereas the others are mammals, and opossum is the most remote species from the remaining mammals. However, when adopting (D) the method of [7], the two most dissimilar species pairs are human-opossum and human-lemur, which is obviously not reasonable also.

4. Conclusion

In this paper, we propose a new 2D graphical representation for DNA sequences based on triplets, and associating with a newly introduced concept of weight of triplets and a newly designed measure of similarity named weight deviation, we propose a new method to make similarity analysis of DNA sequences, in which no matrix computation is needed and reasonable and useful approaches for both computational scientists and molecular biologists to effectively analyze DNA sequences can be provided at the same time.

Declarations

Acknowledgements

This work is supported by the Chongqing Education Science Project of China in 2014, Chongqing “Twelfth Five Year plan” educational programming projects of China (2013-ZJ-077), program for university youth backbone teachers of Chongqing in 2014.

Authors’ Affiliations

(1)
School of Software Engineering, Chongqing College of Electronic Engineering

References

  1. Chen W, Liao B, Liu Y, Zhu W, Su Z: A numerical representation of DNA sequences and its applications. MATCH: Commun Math Comput Chem. 2008, 60: 291-300.MathSciNetGoogle Scholar
  2. Jafarzadeh N, Iranmanesh A: A novel graphical and numerical representation for analyzing DNA sequences based on codons. MATCH: Commun Math Comput Chem. 2012, 68: 611-620.MathSciNetGoogle Scholar
  3. Liao B, Liao BY, Sun XM, Zeng QG: A novel method for similarity analysis and protein sub-cellular localization prediction. Bioinformatics 2010, 26: 2678-2683. 10.1093/bioinformatics/btq521View ArticleGoogle Scholar
  4. Qi XQ, Wu Q, Zhang Y, Fuller E, Zhang CQ: A novel model for DNA sequence similarity analysis based on graph theory. J Evol Bioinform 2011, 7: 149-158.Google Scholar
  5. Yu JF, Wang JH, Sun X: Analysis of similarities/dissimilarities of DNA sequences based on a novel graphical representation. MATCH: Commun Math Comput Chem. 2010, 63: 493-512.MathSciNetGoogle Scholar
  6. Li Y, Huang G, Liao B, Liu Z: H-L curve: a novel 2D graphical representation of protein sequences. MATCH: Commun Math Comput Chem. 2009, 61: 519-532.MathSciNetGoogle Scholar
  7. Liao B, Wang TM: New 2D graphical representation of DNA sequences. J. Comput. Chem. 2004, 25: 1364-1368. 10.1002/jcc.20060View ArticleGoogle Scholar
  8. Liao B, Xiang XY, Zhu W: Coronavirus phylogeny based on 2D graphical representation of DNA sequence. J. Comput. Chem. 2006, 27: 1196-1202. 10.1002/jcc.20439View ArticleGoogle Scholar
  9. Liu ZB, Liao B, Zhu W, Huang GH: A 2D graphical representation of DNA sequence based on dual nucleotides and its application. Int. J. Quantum Chem. 2009, 109: 948-958. 10.1002/qua.21919View ArticleGoogle Scholar
  10. Randic M, Vracko M, Zupan J, Novic M: Compact 2D graphical representation of DNA. Chem. Phys. Lett. 2003, 373: 558-562. 10.1016/S0009-2614(03)00639-0View ArticleGoogle Scholar
  11. Randic M, Vracko M, Lers N, Plavsic D: Analysis of similarity/dissimilarity of 2D graphical representation. Chem. Phys. Lett. 2003, 371: 202-207. 10.1016/S0009-2614(03)00244-6View ArticleGoogle Scholar
  12. Randic M, Vracko M, Lers N, Plavsic D: Novel 2-D graphical representation of DNA sequences and their numerical characterization. Chem. Phys. Lett. 2003, 368: 1-6. 10.1016/S0009-2614(02)01784-0View ArticleGoogle Scholar
  13. Randic M: Graphical representations of DNA as 2-D map. Chem. Phys. Lett. 2004, 386: 468-471. 10.1016/j.cplett.2004.01.088View ArticleGoogle Scholar
  14. Guo XF, Randic M, Basak SC: A novel 2-D graphical representation of DNA sequences of low degeneracy. Chem. Phys. Lett. 2001, 350: 106-112. 10.1016/S0009-2614(01)01246-5View ArticleGoogle Scholar
  15. Guo XF, Nandy A: Numerical characterization of DNA sequences in a 2-D graphical representation scheme of low degeneracy. Chem. Phys. Lett. 2003, 369: 361-366. 10.1016/S0009-2614(02)02029-8View ArticleGoogle Scholar
  16. Qi ZH, Qi XQ: Novel 2D graphical representation of DNA sequence based on dual nucleotides. Chem Phys Lett. 2007, 440: 139-144. 10.1016/j.cplett.2007.03.107View ArticleGoogle Scholar
  17. Dai Q, Xiu ZL, Wang TM: A novel 2D graphical representation of DNA sequences and its application. J Mol Graph Model. 2006, 25: 340-344. 10.1016/j.jmgm.2005.12.004View ArticleGoogle Scholar
  18. Liu XQ, Dai Q, Xiu ZL, Wang TM: PNN–curve: a new 2D graphical representation of DNA sequences and its application. J. Theor. Biol. 2006, 243: 555-561. 10.1016/j.jtbi.2006.07.018MathSciNetView ArticleGoogle Scholar
  19. Dorota BW, Timothy C, Piotr W: 2D-dynamic representation of DNA sequences. Chem. Phys. Lett. 2007, 442: 140-144. 10.1016/j.cplett.2007.05.050View ArticleGoogle Scholar
  20. Yuan CX, Liao B, Wang TM: New 3D graphical representation of DNA sequences and their numerical characterization. Chem. Phys. Lett. 2003, 379: 412-417. 10.1016/j.cplett.2003.07.023View ArticleGoogle Scholar
  21. Liao B, Wang TM: 3-D graphical representation of DNA sequences and their numerical characterization. J. Mol. Struct. (THEOCHEM) 2004, 681: 209-212. 10.1016/j.theochem.2004.05.020View ArticleGoogle Scholar
  22. Liao B, Wang TM: A 3D graphical representation of RNA secondary structure. J Biomol Struct Dynam. 2004, 21: 827-832. 10.1080/07391102.2004.10506972MathSciNetView ArticleGoogle Scholar
  23. Cao Z, Liao B, Li RF: A group of 3D graphical representation of DNA sequences based on dual nucleotides. Int. J. Quantum Chem. 2008, 108: 1485-1490. 10.1002/qua.21698View ArticleGoogle Scholar
  24. Randic M, Vracko M, Nandy A, Basak SC: On 3D graphical representation of DNA primary sequences and their numerical characterization. J. Chem. Inf. Comput. Sci. 2000, 40: 1235-1244. 10.1021/ci000034qView ArticleGoogle Scholar
  25. Randic M, Zupan J, Novic M: On 3D graphical representation of proteomics maps and their numerical characterization. J. Chem. Inf. Comput. Sci. 2001, 41: 1339-1344. 10.1021/ci0001684View ArticleGoogle Scholar
  26. Qi XQ, Fan TR: PN-curve: a 3D graphical representation of DNA sequences and their numerical characterization. Chem. Phys. Lett. 2007, 442: 434-440. 10.1016/j.cplett.2007.06.029View ArticleGoogle Scholar
  27. Yu JF, Sun X, Wang JH: TN curve: a novel 3D graphical representation of DNA sequence based on trinucleotides and its applications. J. Theor. Biol. 2009, 261: 459-468. 10.1016/j.jtbi.2009.08.005View ArticleGoogle Scholar
  28. Aram V, Iranmanesh A: 3D-dynamic representation of DNA sequences. MATCH: Commun Math Comput Chem. 2012, 67: 809-816.MathSciNetGoogle Scholar
  29. Liao B, Tan MS, Ding KQ: A 4D representation of DNA sequences and its application. Chem. Phys. Lett. 2005, 402: 380-383. 10.1016/j.cplett.2004.12.062View ArticleGoogle Scholar
  30. Tang XC, Zhou PP, Qiu WY: On the similarity/dissimilarity of DNA sequences based on 4D graphical representation. Chin. Sci. Bull. 2010, 55: 701-704. 10.1007/s11434-010-0045-2View ArticleGoogle Scholar
  31. Chi R, Ding KQ: Novel 4D numerical representation of DNA sequences. Chem. Phys. Lett. 2005, 407: 63-67. 10.1016/j.cplett.2005.03.056View ArticleGoogle Scholar
  32. Liao B, Xiang XY, Li RF, Zhu W: On the similarity of DNA primary sequences based on 5D representation. J. Math. Chem. 2007, 42: 47-57. 10.1007/s10910-006-9091-zMathSciNetView ArticleGoogle Scholar
  33. He P, Wang J: Characteristic sequences for DNA primary sequence. J. Chem. Inf. Comput. Sci. 2002, 42: 1080-1085. 10.1021/ci010131zView ArticleGoogle Scholar

Copyright

© Zou et al.; licensee Springer. 2014

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.