Skip to main content


Springer Nature is making Coronavirus research free. View research | View latest news | Sign up for updates

A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification

  • 1173 Accesses

  • 8 Citations


We investigate methods of estimating residue correlation within protein sequences. We begin by using mutual information (MI) of adjacent residues, and improve our methodology by defining the mutual information vector (MIV) to estimate long range correlations between nonadjacent residues. We also consider correlation based on residue hydropathy rather than protein-specific interactions. Finally, in experiments of family classification tests, the modeling power of MIV was shown to be significantly better than the classic MI method, reaching the level where proteins can be classified without alignment information.



  1. 1.

    Weiss O, Jiménez-Montaño MA, Herzel H: Information content of protein sequences. Journal of Theoretical Biology 2000, 206(3):379-386. 10.1006/jtbi.2000.2138

  2. 2.

    Cline MS, Karplus K, Lathrop RH, Smith TF, Rogers RG Jr., Haussler D: Information-theoretic dissection of pairwise contact potentials. Proteins: Structure, Function and Genetics 2002, 49(1):7-14. 10.1002/prot.10198

  3. 3.

    Martin LC, Gloor GB, Dunn SD, Wahl LM: Using information theory to search for co-evolving residues in proteins. Bioinformatics 2005, 21(22):4116-4124. 10.1093/bioinformatics/bti671

  4. 4.

    Bateman A, Coin L, Durbin R, et al.: The Pfam protein families database. Nucleic Acids Research 2004, 32(Database):D138-D141.

  5. 5.

    Atchley WR, Terhalle W, Dress A: Positional dependence, cliques, and predictive motifs in the bHLH protein domain. Journal of Molecular Evolution 1999, 48(5):501-516. 10.1007/PL00006494

  6. 6.

    Weiss O, Herzel H: Correlations in protein sequences and property codes. Journal of Theoretical Biology 1998, 190(4):341-353. 10.1006/jtbi.1997.0560

  7. 7.

    Cover TM, Thomas JA: Elements of Information Theory. Wiley-Interscience, New York, NY, USA; 1991.

  8. 8.

    Grosse I, Herzel H, Buldyrev SV, Stanley HE: Species independence of mutual information in coding and noncoding DNA. Physical Review E 2000, 61(5):5624-5629. 10.1103/PhysRevE.61.5624

  9. 9.

    Jiménez-Montaño MA: On the syntactic structure of protein sequences and the concept of grammar complexity. Bulletin of Mathematical Biology 1984, 46(4):641-659.

  10. 10.

    Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215(3):403-410.

  11. 11.

    Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Series in Data Management Systems. 2nd edition. Morgan Kaufmann, San Francisco, Calif, USA; 2005.

  12. 12.

    Cover TM, Hart P: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 1967, 13(1):21-27. 10.1109/TIT.1967.1053964

  13. 13.

    Aha DW, Kibler D, Albert MK: Instance-based learning algorithms. Machine Learning 1991, 6(1):37-66.

  14. 14.

    Kohavi R: A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI '95), Montréal, Québec, Canada, August 1995 2: 1137-1145.

  15. 15.

    Herzel H, Schmitt AO, Ebeling W: Finite sample effects in sequence analysis. Chaos, Solitons & Fractals 1994, 4(1):97-113. 10.1016/0960-0779(94)90020-5

Download references

Author information

Correspondence to Chris Hemmerich.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Hemmerich, C., Kim, S. A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification. J Bioinform Sys Biology 2007, 87356 (2007).

Download citation


  • Protein Sequence
  • Mutual Information
  • Long Range
  • System Biology
  • Modeling Power