A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification

Hemmerich, Chris; Kim, Sun

doi:10.1155/2007/87356

Research Article
Open access
Published: 10 September 2007

A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification

Chris Hemmerich¹ &
Sun Kim²

EURASIP Journal on Bioinformatics and Systems Biology volume 2007, Article number: 87356 (2007) Cite this article

2294 Accesses
8 Citations
Metrics details

Abstract

We investigate methods of estimating residue correlation within protein sequences. We begin by using mutual information (MI) of adjacent residues, and improve our methodology by defining the mutual information vector (MIV) to estimate long range correlations between nonadjacent residues. We also consider correlation based on residue hydropathy rather than protein-specific interactions. Finally, in experiments of family classification tests, the modeling power of MIV was shown to be significantly better than the classic MI method, reaching the level where proteins can be classified without alignment information.

[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]

References

Weiss O, Jiménez-Montaño MA, Herzel H: Information content of protein sequences. Journal of Theoretical Biology 2000, 206(3):379-386. 10.1006/jtbi.2000.2138
Article Google Scholar
Cline MS, Karplus K, Lathrop RH, Smith TF, Rogers RG Jr., Haussler D: Information-theoretic dissection of pairwise contact potentials. Proteins: Structure, Function and Genetics 2002, 49(1):7-14. 10.1002/prot.10198
Article Google Scholar
Martin LC, Gloor GB, Dunn SD, Wahl LM: Using information theory to search for co-evolving residues in proteins. Bioinformatics 2005, 21(22):4116-4124. 10.1093/bioinformatics/bti671
Article Google Scholar
Bateman A, Coin L, Durbin R, et al.: The Pfam protein families database. Nucleic Acids Research 2004, 32(Database):D138-D141.
Article Google Scholar
Atchley WR, Terhalle W, Dress A: Positional dependence, cliques, and predictive motifs in the bHLH protein domain. Journal of Molecular Evolution 1999, 48(5):501-516. 10.1007/PL00006494
Article Google Scholar
Weiss O, Herzel H: Correlations in protein sequences and property codes. Journal of Theoretical Biology 1998, 190(4):341-353. 10.1006/jtbi.1997.0560
Article Google Scholar
Cover TM, Thomas JA: Elements of Information Theory. Wiley-Interscience, New York, NY, USA; 1991.
Book MATH Google Scholar
Grosse I, Herzel H, Buldyrev SV, Stanley HE: Species independence of mutual information in coding and noncoding DNA. Physical Review E 2000, 61(5):5624-5629. 10.1103/PhysRevE.61.5624
Article Google Scholar
Jiménez-Montaño MA: On the syntactic structure of protein sequences and the concept of grammar complexity. Bulletin of Mathematical Biology 1984, 46(4):641-659.
Article MathSciNet MATH Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215(3):403-410.
Article Google Scholar
Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Series in Data Management Systems. 2nd edition. Morgan Kaufmann, San Francisco, Calif, USA; 2005.
Google Scholar
Cover TM, Hart P: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 1967, 13(1):21-27. 10.1109/TIT.1967.1053964
Article MATH Google Scholar
Aha DW, Kibler D, Albert MK: Instance-based learning algorithms. Machine Learning 1991, 6(1):37-66.
Google Scholar
Kohavi R: A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI '95), Montréal, Québec, Canada, August 1995 2: 1137-1145.
Google Scholar
Herzel H, Schmitt AO, Ebeling W: Finite sample effects in sequence analysis. Chaos, Solitons & Fractals 1994, 4(1):97-113. 10.1016/0960-0779(94)90020-5
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Center For Genomics and Bioinformatics, Indiana University, 1001 E. 3rd Street, Bloomington, Indiana, IN, 47405-3700, USA
Chris Hemmerich
School of Informatics, Center for Genomics and Bioinformatics, Indiana University, 901 E. 10th Street, Bloomington, Indiana, IN, 47408-3912, USA
Sun Kim

Authors

Chris Hemmerich
View author publications
You can also search for this author in PubMed Google Scholar
Sun Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chris Hemmerich.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Hemmerich, C., Kim, S. A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification. J Bioinform Sys Biology 2007, 87356 (2007). https://doi.org/10.1155/2007/87356

Download citation

Received: 28 February 2007
Revised: 22 June 2007
Accepted: 31 July 2007
Published: 10 September 2007
DOI: https://doi.org/10.1155/2007/87356

A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords