Skip to content


  • Research Article
  • Open Access

A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification

EURASIP Journal on Bioinformatics and Systems Biology20072007:87356

  • Received: 28 February 2007
  • Accepted: 31 July 2007
  • Published:


We investigate methods of estimating residue correlation within protein sequences. We begin by using mutual information (MI) of adjacent residues, and improve our methodology by defining the mutual information vector (MIV) to estimate long range correlations between nonadjacent residues. We also consider correlation based on residue hydropathy rather than protein-specific interactions. Finally, in experiments of family classification tests, the modeling power of MIV was shown to be significantly better than the classic MI method, reaching the level where proteins can be classified without alignment information.


  • Protein Sequence
  • Mutual Information
  • Long Range
  • System Biology
  • Modeling Power


Authors’ Affiliations

Center For Genomics and Bioinformatics, Indiana University, 1001 E. 3rd Street, Bloomington, Indiana, IN 47405-3700, USA
School of Informatics, Center for Genomics and Bioinformatics, Indiana University, 901 E. 10th Street, Bloomington, Indiana, IN 47408-3912, USA


  1. Weiss O, Jiménez-Montaño MA, Herzel H: Information content of protein sequences. Journal of Theoretical Biology 2000, 206(3):379-386. 10.1006/jtbi.2000.2138View ArticleGoogle Scholar
  2. Cline MS, Karplus K, Lathrop RH, Smith TF, Rogers RG Jr., Haussler D: Information-theoretic dissection of pairwise contact potentials. Proteins: Structure, Function and Genetics 2002, 49(1):7-14. 10.1002/prot.10198View ArticleGoogle Scholar
  3. Martin LC, Gloor GB, Dunn SD, Wahl LM: Using information theory to search for co-evolving residues in proteins. Bioinformatics 2005, 21(22):4116-4124. 10.1093/bioinformatics/bti671View ArticleGoogle Scholar
  4. Bateman A, Coin L, Durbin R, et al.: The Pfam protein families database. Nucleic Acids Research 2004, 32(Database):D138-D141.View ArticleGoogle Scholar
  5. Atchley WR, Terhalle W, Dress A: Positional dependence, cliques, and predictive motifs in the bHLH protein domain. Journal of Molecular Evolution 1999, 48(5):501-516. 10.1007/PL00006494View ArticleGoogle Scholar
  6. Weiss O, Herzel H: Correlations in protein sequences and property codes. Journal of Theoretical Biology 1998, 190(4):341-353. 10.1006/jtbi.1997.0560View ArticleGoogle Scholar
  7. Cover TM, Thomas JA: Elements of Information Theory. Wiley-Interscience, New York, NY, USA; 1991.View ArticleMATHGoogle Scholar
  8. Grosse I, Herzel H, Buldyrev SV, Stanley HE: Species independence of mutual information in coding and noncoding DNA. Physical Review E 2000, 61(5):5624-5629. 10.1103/PhysRevE.61.5624View ArticleGoogle Scholar
  9. Jiménez-Montaño MA: On the syntactic structure of protein sequences and the concept of grammar complexity. Bulletin of Mathematical Biology 1984, 46(4):641-659.View ArticleMathSciNetMATHGoogle Scholar
  10. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215(3):403-410.View ArticleGoogle Scholar
  11. Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Series in Data Management Systems. 2nd edition. Morgan Kaufmann, San Francisco, Calif, USA; 2005.Google Scholar
  12. Cover TM, Hart P: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 1967, 13(1):21-27. 10.1109/TIT.1967.1053964View ArticleMATHGoogle Scholar
  13. Aha DW, Kibler D, Albert MK: Instance-based learning algorithms. Machine Learning 1991, 6(1):37-66.Google Scholar
  14. Kohavi R: A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI '95), Montréal, Québec, Canada, August 1995 2: 1137-1145.Google Scholar
  15. Herzel H, Schmitt AO, Ebeling W: Finite sample effects in sequence analysis. Chaos, Solitons & Fractals 1994, 4(1):97-113. 10.1016/0960-0779(94)90020-5View ArticleMATHGoogle Scholar


© C. Hemmerich and S. Kim. 2007

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.