Skip to main content

A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification


We investigate methods of estimating residue correlation within protein sequences. We begin by using mutual information (MI) of adjacent residues, and improve our methodology by defining the mutual information vector (MIV) to estimate long range correlations between nonadjacent residues. We also consider correlation based on residue hydropathy rather than protein-specific interactions. Finally, in experiments of family classification tests, the modeling power of MIV was shown to be significantly better than the classic MI method, reaching the level where proteins can be classified without alignment information.



  1. Weiss O, Jiménez-Montaño MA, Herzel H: Information content of protein sequences. Journal of Theoretical Biology 2000, 206(3):379-386. 10.1006/jtbi.2000.2138

    Article  Google Scholar 

  2. Cline MS, Karplus K, Lathrop RH, Smith TF, Rogers RG Jr., Haussler D: Information-theoretic dissection of pairwise contact potentials. Proteins: Structure, Function and Genetics 2002, 49(1):7-14. 10.1002/prot.10198

    Article  Google Scholar 

  3. Martin LC, Gloor GB, Dunn SD, Wahl LM: Using information theory to search for co-evolving residues in proteins. Bioinformatics 2005, 21(22):4116-4124. 10.1093/bioinformatics/bti671

    Article  Google Scholar 

  4. Bateman A, Coin L, Durbin R, et al.: The Pfam protein families database. Nucleic Acids Research 2004, 32(Database):D138-D141.

    Article  Google Scholar 

  5. Atchley WR, Terhalle W, Dress A: Positional dependence, cliques, and predictive motifs in the bHLH protein domain. Journal of Molecular Evolution 1999, 48(5):501-516. 10.1007/PL00006494

    Article  Google Scholar 

  6. Weiss O, Herzel H: Correlations in protein sequences and property codes. Journal of Theoretical Biology 1998, 190(4):341-353. 10.1006/jtbi.1997.0560

    Article  Google Scholar 

  7. Cover TM, Thomas JA: Elements of Information Theory. Wiley-Interscience, New York, NY, USA; 1991.

    Book  MATH  Google Scholar 

  8. Grosse I, Herzel H, Buldyrev SV, Stanley HE: Species independence of mutual information in coding and noncoding DNA. Physical Review E 2000, 61(5):5624-5629. 10.1103/PhysRevE.61.5624

    Article  Google Scholar 

  9. Jiménez-Montaño MA: On the syntactic structure of protein sequences and the concept of grammar complexity. Bulletin of Mathematical Biology 1984, 46(4):641-659.

    Article  MathSciNet  MATH  Google Scholar 

  10. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215(3):403-410.

    Article  Google Scholar 

  11. Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Series in Data Management Systems. 2nd edition. Morgan Kaufmann, San Francisco, Calif, USA; 2005.

    Google Scholar 

  12. Cover TM, Hart P: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 1967, 13(1):21-27. 10.1109/TIT.1967.1053964

    Article  MATH  Google Scholar 

  13. Aha DW, Kibler D, Albert MK: Instance-based learning algorithms. Machine Learning 1991, 6(1):37-66.

    Google Scholar 

  14. Kohavi R: A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI '95), Montréal, Québec, Canada, August 1995 2: 1137-1145.

    Google Scholar 

  15. Herzel H, Schmitt AO, Ebeling W: Finite sample effects in sequence analysis. Chaos, Solitons & Fractals 1994, 4(1):97-113. 10.1016/0960-0779(94)90020-5

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Chris Hemmerich.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Hemmerich, C., Kim, S. A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification. J Bioinform Sys Biology 2007, 87356 (2007).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI:


  • Protein Sequence
  • Mutual Information
  • Long Range
  • System Biology
  • Modeling Power