Compressing Proteomes: The Relevance of Medium Range Correlations

Benedetto, Dario; Caglioti, Emanuele; Chica, Claudia

doi:10.1155/2007/60723

Research Article
Open access
Published: 30 October 2007

Compressing Proteomes: The Relevance of Medium Range Correlations

Dario Benedetto¹,
Emanuele Caglioti¹ &
Claudia Chica²

EURASIP Journal on Bioinformatics and Systems Biology volume 2007, Article number: 60723 (2007) Cite this article

2313 Accesses
5 Citations
Metrics details

Abstract

We study the nonrandomness of proteome sequences by analysing the correlations that arise between amino acids at a short and medium range, more specifically, between amino acids located 10 or 100 residues apart; respectively. We show that statistical models that consider these two types of correlation are more likely to seize the information contained in protein sequences and thus achieve good compression rates. Finally, we propose that the cause for this redundancy is related to the evolutionary origin of proteomes and protein sequences.

[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31]

References

Wootton JC: Non-globular domains in protein sequences: automated segmentation using complexity measures. Computers & Chemistry 1994, 18(3):269-285.
Article MATH Google Scholar
Blaisdell BE: A prevalent persistent global nonrandomness that distinguishes coding and non-coding eucaryotic nuclear DNA sequences. Journal of Molecular Evolution 1983, 19(2):122-133. 10.1007/BF02300750
Article MathSciNet Google Scholar
Almirantis Y, Provata A: An evolutionary model for the origin of non-randomness, long-range order and fractality in the genome. BioEssays 2001, 23(7):647-656. 10.1002/bies.1090
Article Google Scholar
Weiss O, Jiménez-Montaño MA, Herzel H: Information content of protein sequences. Journal of Theoretical Biology 2000, 206(3):379-386. 10.1006/jtbi.2000.2138
Article Google Scholar
Nevill-Manning CG, Witten IH: Protein is incompressible. Proceedings of the Data Compression Conference (DCC '99), Snowbird, Utah, USA, March 1999 257-266.
Google Scholar
Matsumoto T, Sadakane K, Imai H: Biological sequence compression algorithms. Genome Informatics 2000, 11: 43-52.
Google Scholar
Cao MD, Dix TI, Allison L, Mears C: A simple statistical algorithm for biological sequence compression. Proceedings of the Data Compression Conference (DCC '07), Snowbird, Utah, USA, March 2007 43-52.
Google Scholar
Hategan A, Tabus I: Protein is compressible. Proceedings of the 6th Nordic Signal Processing Symposium (NORSIG '04), Espoo, Finland, June 2004 192-195.
Google Scholar
Adjeroh D, Nan F: On compressibility of protein sequences. Proceedings of the Data Compression Conference (DCC '06), Snowbird, Utah, USA, March 2006 422-434.
Chapter Google Scholar
Sampath G: A block coding method that leads to significantly lower entropy values for the proteins and coding sections of Haemophilus influenzae . Proceedings of the IEEE Bioinformatics Conference (CSB '03), Stanford, Calif, USA, August 2003 287-293.
Google Scholar
Shannon CE: A mathematical theory of communication. Bell System Technical Journal 1948, 27: 379-423 and 623–656.
Article MathSciNet MATH Google Scholar
Cleary J, Witten I: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications 1984, 32(4):396-402.
Article Google Scholar
Willems FMJ, Shtarkov YM, Tjalkens TJ: The context-tree weighting method: basic properties. IEEE Transactions on Information Theory 1995, 41(3):653-664. 10.1109/18.382012
Article MATH Google Scholar
Integr8 web portal2006. [ftp://ftp.ebi.ac.uk/pub/databases/integr8/]
Abel J: The data compression resource on the internet.2005. [http://www.datacompression.info/]
Google Scholar
Orengo CA, Thornton JM: Protein families and their evolution—a structural perspective. Annual Review of Biochemistry 2005, 74: 867-900. 10.1146/annurev.biochem.74.082803.133029
Article Google Scholar
Heringa J: The evolution and recognition of protein sequence repeats. Computers & Chemistry 1994, 18(3):233-243.
Article MATH Google Scholar
Andrade MA, Petosa C, O'Donoghue SI, Müller CW, Bork P: Comparison of ARM and HEAT protein repeats. Journal of Molecular Biology 2001, 309(1):1-18. 10.1006/jmbi.2001.4624
Article Google Scholar
Kirkpatrick S, Gelatt CD Jr, Vecchi MP: Optimization by simulated annealing. Science 1983, 220(4598):671-680. 10.1126/science.220.4598.671
Article MathSciNet MATH Google Scholar
Mirny LA, Shakhnovich EI: Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. Journal of Molecular Biology 1999, 291(1):177-196. 10.1006/jmbi.1999.2911
Article Google Scholar
Huynen MA, Stadler PF, Fontana W: Smoothness within ruggedness: the role of neutrality in adaptation. Proceedings of the National Academy of Sciences of the United States of America 1996, 93(1):397-401. 10.1073/pnas.93.1.397
Article Google Scholar
Karlin S: Statistical signals in bioinformatics. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(38):13355-13362. 10.1073/pnas.0501804102
Article Google Scholar
Dill KA: Dominant forces in protein folding. Biochemistry 1990, 29(31):7133-7155. 10.1021/bi00483a001
Article Google Scholar
Rost B: Did evolution leap to create the protein universe? Current Opinion in Structural Biology 2002, 12(3):409-416. 10.1016/S0959-440X(02)00337-8
Article MathSciNet Google Scholar
Rissanen J, Langdon GG Jr.: Arithmetic Coding. IBM Journal of Research and Development 1979, 23(2):149-162.
Article MathSciNet MATH Google Scholar
Salzberg SL, Delcher AL, Kasif S, White O: Microbial gene identification using interpolated Markov models. Nucleic Acids Research 1998, 26(2):544-548. 10.1093/nar/26.2.544
Article Google Scholar
Turutina VP, Laskin AA, Kudryashov NA, Skryabin KG, Korotkov EV: Identification of latent periodicity in amino acid sequences of protein families. Biochemistry (Moscow) 2006, 71(1):18-31. 10.1134/S0006297906010032
Article Google Scholar
Korotkov EV, Korotkova MA: Enlarged similarity of nucleic acid sequences. DNA Research 1996, 3(3):157-164. 10.1093/dnares/3.3.157
Article Google Scholar
Camproux AC, Tufféry P: Hidden Markov model-derived structural alphabet for proteins: the learning of protein local shapes captures sequence specificity. Biochimica et Biophysica Acta 2005, 1724(3):394-403.
Article Google Scholar
Bentley SD, Parkhill J: Comparative genomic structure of prokaryotes. Annual Review of Genetics 2004, 38: 771-791. 10.1146/annurev.genet.38.072902.094318
Article Google Scholar
Raes J, Korbel JO, Lercher MJ, von Mering C, Bork P: Prediction of effective genome size in metagenomic samples. Genome Biology 2007, 8(1):R10. 10.1186/gb-2007-8-1-r10
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Matematica, Università di Roma "La Sapienza", Piazzale Aldo Moro 5, Roma, 00185, Italy
Dario Benedetto & Emanuele Caglioti
Structural and Computational Biology Unit, EMBL Heidelberg, Meyerhofstraße 1, Heidelberg, 69117, Germany
Claudia Chica

Authors

Dario Benedetto
View author publications
You can also search for this author in PubMed Google Scholar
Emanuele Caglioti
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Chica
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claudia Chica.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Benedetto, D., Caglioti, E. & Chica, C. Compressing Proteomes: The Relevance of Medium Range Correlations. J Bioinform Sys Biology 2007, 60723 (2007). https://doi.org/10.1155/2007/60723

Download citation

Received: 14 January 2007
Revised: 28 May 2007
Accepted: 10 September 2007
Published: 30 October 2007
DOI: https://doi.org/10.1155/2007/60723

Compressing Proteomes: The Relevance of Medium Range Correlations

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords