Open Access

Aligning Sequences by Minimum Description Length

EURASIP Journal on Bioinformatics and Systems Biology20082007:72936

https://doi.org/10.1155/2007/72936

Received: 26 February 2007

Accepted: 16 November 2007

Published: 2 January 2008

Abstract

This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the original set of sequences, a receiver generates all strings that match the expression. An alignment algorithm uses minimum description length to encode and explore alternative expressions; the expression with the shortest encoding provides the best overall alignment. When two substrings contain letters that are similar according to a substitution matrix, a code length function based on conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment, alignments produced with this new method were found to be comparable to alignments from . A second experiment measured the accuracy of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark.

[12345678910111213141516171819202122232425262728293031323334353637383940]

Authors’ Affiliations

(1)
Department of Computer and Information Science, University of Oregon

References

  1. Myers EW: The fragment assembly string graph. Bioinformatics 2005, 21(suppl. 2):ii79-ii85.Google Scholar
  2. Altschul SF, Madden TL, Schaffer AA, et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 1997, 25(17):3389-3402. 10.1093/nar/25.17.3389View ArticleGoogle Scholar
  3. Phillips AJ: Homology assessment and molecular sequence alignment. Journal of Biomedical Informatics 2006, 39(1):18-33. 10.1016/j.jbi.2005.11.005View ArticleGoogle Scholar
  4. Wrabl JO, Grishin NV: Gaps in structurally similar proteins: towards improvement of multiple sequence alignment. Proteins 2004, 54(1):71-87.View ArticleGoogle Scholar
  5. Sjölander K: Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics 2004, 20(2):170-179. 10.1093/bioinformatics/bth021View ArticleGoogle Scholar
  6. Webb B-JM, Liu JS, Lawrence CE: BALSA: Bayesian algorithm for local sequence alignment. Nucleic Acids Research 2002, 30(5):1268-1277. 10.1093/nar/30.5.1268View ArticleGoogle Scholar
  7. Rissanen J: Modelling by the shortest data description. Automatica 1978, 14(5):465-471. 10.1016/0005-1098(78)90005-5View ArticleMATHGoogle Scholar
  8. Grünwald P: A minimum description length approach to grammar inference. In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, Lecture Notes in Computer Science. Volume 1040. Springer, Berlin, Germany; 1996:203-216.View ArticleGoogle Scholar
  9. Brazma A, Jonassen I, Vilo J, Ukkonen E: Pattern discovery in biosequences. In International Conference on Grammar Inference (ICGI '98), Lecture Notes in Artificial Intelligence. Volume 1433. Edited by: Honavar V, Slutski G. Springer, Ames, Iowa, USA; 1998:257-270.Google Scholar
  10. Cai L, Malmberg RL, Wu Y: Stochastic modeling of RNA pseudoknotted structures: a grammatical approach. Bioinformatics 2003, 19(suppl. 1):i66-i73.View ArticleGoogle Scholar
  11. Searls DB: The computational linguistics of biological sequences. In Artificial Intelligence and Molecular Biology, Menlo Park, Calif, USA. American Association for Artificial Intelligence; 1993:47-120.Google Scholar
  12. Bsearls D: Linguistic approaches to biological sequences. Computer Applications in the Biosciences 1997, 13(4):333-344.Google Scholar
  13. Bairoch A: PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Research 1992, 20: 2013-2018.View ArticleGoogle Scholar
  14. Vingron M, Waterman MS: Sequence alignment and penalty choice. Review of concepts, case studies and implications. Journal of Molecular Biology 1994, 235(1):1-12. 10.1016/S0022-2836(05)80006-3View ArticleGoogle Scholar
  15. Henikoff S: Scores for sequence searches and alignments. Current Opinion in Structural Biology 1996, 6(3):353-360. 10.1016/S0959-440X(96)80055-8View ArticleGoogle Scholar
  16. Giribet G, Wheeler WC: On gaps. Molecular Phylogenetics and Evolution 1999, 13(1):132-143. 10.1006/mpev.1999.0643View ArticleGoogle Scholar
  17. Nozaki Y, Bellgard M: Statistical evaluation and comparison of a pairwise alignment algorithm that a priori assigns the number of gaps rather than employing gap penalties. Bioinformatics 2005, 21(8):1421-1428. 10.1093/bioinformatics/bti198View ArticleGoogle Scholar
  18. Reese JT, Pearson WR: Empirical determination of effective gap penalties for sequence comparison. Bioinformatics 2002, 18(11):1500-1507. 10.1093/bioinformatics/18.11.1500View ArticleGoogle Scholar
  19. Allison L, Wallace CS, Yee CN: Finite-state models in the alignment of macromolecules. Journal of Molecular Evolution 1992, 35(1):77-89. 10.1007/BF00160262View ArticleGoogle Scholar
  20. Schmidt JP: An information theoretic view of gapped and other alignments. Proceedings of the 3rd Pacific Symposium on Biocomputing (PSB '98), Maui, Hawaii, USA, January 1998 561-572.Google Scholar
  21. Aynechi T, Kuntz ID: An information theoretic approach to macromolecular modeling: I. Sequence alignments. Biophysical Journal 2005, 89(5):2998-3007. 10.1529/biophysj.104.054072View ArticleGoogle Scholar
  22. Morgenstern B: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 1999, 15(3):211-218. 10.1093/bioinformatics/15.3.211View ArticleGoogle Scholar
  23. Brudno M, Chapman M, Göttgens B, Batzoglou S, Morgenstern B: Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics 2003, 4: 66. 10.1186/1471-2105-4-66View ArticleGoogle Scholar
  24. Schneider TD: Information content of individual genetic sequences. Journal of Theoretical Biology 1997, 189(4):427-441. 10.1006/jtbi.1997.0540View ArticleGoogle Scholar
  25. Krasnogor N, Pelta DA: Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics 2004, 20(7):1015-1021. 10.1093/bioinformatics/bth031View ArticleGoogle Scholar
  26. Conery JS: Realign: grammar-based sequence alignment. University of Oregon, http://teleost.cs.uoregon.edu/realign
  27. Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure, Washington, DC, USA, 1978 5(suppl. 3):345-352.Google Scholar
  28. Mount DW: Bioinformatics: Sequence and Genome Analysis. 2nd edition. Cold Spring Harbor Laboratory Press, New York, NY, USA; 2004.Google Scholar
  29. Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America 1992, 89(22):10915-10919. 10.1073/pnas.89.22.10915View ArticleGoogle Scholar
  30. Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science 1992, 256(5062):1443-1445. 10.1126/science.1604319View ArticleGoogle Scholar
  31. Karlin S, Altschul SF: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences of the United States of America 1990, 87(6):2264-2268. 10.1073/pnas.87.6.2264View ArticleMATHGoogle Scholar
  32. Eddy SR: Where did the BLOSUM62 alignment score matrix come from? Nature Biotechnology 2004, 22(8):1035-1036. 10.1038/nbt0804-1035View ArticleGoogle Scholar
  33. Aurrecoechea C, Heiges M, Wang H, et al.: ApiDB: integrated resources for the apicomplexan bioinformatics resource center. Nucleic Acids Research 2007, 35: D427-D430. 10.1093/nar/gkl880View ArticleGoogle Scholar
  34. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 1994, 22(22):4673-4680. 10.1093/nar/22.22.4673View ArticleGoogle Scholar
  35. Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Research 1999, 27(13):2682-2690. 10.1093/nar/27.13.2682View ArticleGoogle Scholar
  36. Carter R: Speculations on the origins of Plasmodium vivax malaria. Trends in Parasitology 2003, 19(5):214-219. 10.1016/S1471-4922(03)00070-9View ArticleGoogle Scholar
  37. Cline M, Hughey R, Karplus K: Predicting reliable regions in protein sequence alignments. Bioinformatics 2002, 18(2):306-314. 10.1093/bioinformatics/18.2.306View ArticleGoogle Scholar
  38. Conery JS, Lynch M: Nucleotide substitutions and the evolution of duplicate genes. Proceedings of the 6th Pacific Symposium on Biocomputing (PSB '01), Big Island of Hawaii, Hawaii, USA, January 2001 167-178.Google Scholar
  39. Pearson WR: Comparison of methods for searching protein sequence databases. Protein Science 1995, 4(6):1145-1160.View ArticleGoogle Scholar
  40. Hulo N, Bairoch A, Bulliard V, et al.: The PROSITE database. Nucleic Acids Research 2006, 34: D227-D230. 10.1093/nar/gkj063View ArticleGoogle Scholar

Copyright

© John S. Conery. 2007

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.