Skip to main content


Springer Nature is making Coronavirus research free. View research | View latest news | Sign up for updates

Aligning Sequences by Minimum Description Length

  • 1185 Accesses

  • 1 Citations


This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the original set of sequences, a receiver generates all strings that match the expression. An alignment algorithm uses minimum description length to encode and explore alternative expressions; the expression with the shortest encoding provides the best overall alignment. When two substrings contain letters that are similar according to a substitution matrix, a code length function based on conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment, alignments produced with this new method were found to be comparable to alignments from . A second experiment measured the accuracy of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark.



  1. 1.

    Myers EW: The fragment assembly string graph. Bioinformatics 2005, 21(suppl. 2):ii79-ii85.

  2. 2.

    Altschul SF, Madden TL, Schaffer AA, et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 1997, 25(17):3389-3402. 10.1093/nar/25.17.3389

  3. 3.

    Phillips AJ: Homology assessment and molecular sequence alignment. Journal of Biomedical Informatics 2006, 39(1):18-33. 10.1016/j.jbi.2005.11.005

  4. 4.

    Wrabl JO, Grishin NV: Gaps in structurally similar proteins: towards improvement of multiple sequence alignment. Proteins 2004, 54(1):71-87.

  5. 5.

    Sjölander K: Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics 2004, 20(2):170-179. 10.1093/bioinformatics/bth021

  6. 6.

    Webb B-JM, Liu JS, Lawrence CE: BALSA: Bayesian algorithm for local sequence alignment. Nucleic Acids Research 2002, 30(5):1268-1277. 10.1093/nar/30.5.1268

  7. 7.

    Rissanen J: Modelling by the shortest data description. Automatica 1978, 14(5):465-471. 10.1016/0005-1098(78)90005-5

  8. 8.

    Grünwald P: A minimum description length approach to grammar inference. In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, Lecture Notes in Computer Science. Volume 1040. Springer, Berlin, Germany; 1996:203-216.

  9. 9.

    Brazma A, Jonassen I, Vilo J, Ukkonen E: Pattern discovery in biosequences. In International Conference on Grammar Inference (ICGI '98), Lecture Notes in Artificial Intelligence. Volume 1433. Edited by: Honavar V, Slutski G. Springer, Ames, Iowa, USA; 1998:257-270.

  10. 10.

    Cai L, Malmberg RL, Wu Y: Stochastic modeling of RNA pseudoknotted structures: a grammatical approach. Bioinformatics 2003, 19(suppl. 1):i66-i73.

  11. 11.

    Searls DB: The computational linguistics of biological sequences. In Artificial Intelligence and Molecular Biology, Menlo Park, Calif, USA. American Association for Artificial Intelligence; 1993:47-120.

  12. 12.

    Bsearls D: Linguistic approaches to biological sequences. Computer Applications in the Biosciences 1997, 13(4):333-344.

  13. 13.

    Bairoch A: PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Research 1992, 20: 2013-2018.

  14. 14.

    Vingron M, Waterman MS: Sequence alignment and penalty choice. Review of concepts, case studies and implications. Journal of Molecular Biology 1994, 235(1):1-12. 10.1016/S0022-2836(05)80006-3

  15. 15.

    Henikoff S: Scores for sequence searches and alignments. Current Opinion in Structural Biology 1996, 6(3):353-360. 10.1016/S0959-440X(96)80055-8

  16. 16.

    Giribet G, Wheeler WC: On gaps. Molecular Phylogenetics and Evolution 1999, 13(1):132-143. 10.1006/mpev.1999.0643

  17. 17.

    Nozaki Y, Bellgard M: Statistical evaluation and comparison of a pairwise alignment algorithm that a priori assigns the number of gaps rather than employing gap penalties. Bioinformatics 2005, 21(8):1421-1428. 10.1093/bioinformatics/bti198

  18. 18.

    Reese JT, Pearson WR: Empirical determination of effective gap penalties for sequence comparison. Bioinformatics 2002, 18(11):1500-1507. 10.1093/bioinformatics/18.11.1500

  19. 19.

    Allison L, Wallace CS, Yee CN: Finite-state models in the alignment of macromolecules. Journal of Molecular Evolution 1992, 35(1):77-89. 10.1007/BF00160262

  20. 20.

    Schmidt JP: An information theoretic view of gapped and other alignments. Proceedings of the 3rd Pacific Symposium on Biocomputing (PSB '98), Maui, Hawaii, USA, January 1998 561-572.

  21. 21.

    Aynechi T, Kuntz ID: An information theoretic approach to macromolecular modeling: I. Sequence alignments. Biophysical Journal 2005, 89(5):2998-3007. 10.1529/biophysj.104.054072

  22. 22.

    Morgenstern B: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 1999, 15(3):211-218. 10.1093/bioinformatics/15.3.211

  23. 23.

    Brudno M, Chapman M, Göttgens B, Batzoglou S, Morgenstern B: Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics 2003, 4: 66. 10.1186/1471-2105-4-66

  24. 24.

    Schneider TD: Information content of individual genetic sequences. Journal of Theoretical Biology 1997, 189(4):427-441. 10.1006/jtbi.1997.0540

  25. 25.

    Krasnogor N, Pelta DA: Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics 2004, 20(7):1015-1021. 10.1093/bioinformatics/bth031

  26. 26.

    Conery JS: Realign: grammar-based sequence alignment. University of Oregon,

  27. 27.

    Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure, Washington, DC, USA, 1978 5(suppl. 3):345-352.

  28. 28.

    Mount DW: Bioinformatics: Sequence and Genome Analysis. 2nd edition. Cold Spring Harbor Laboratory Press, New York, NY, USA; 2004.

  29. 29.

    Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America 1992, 89(22):10915-10919. 10.1073/pnas.89.22.10915

  30. 30.

    Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science 1992, 256(5062):1443-1445. 10.1126/science.1604319

  31. 31.

    Karlin S, Altschul SF: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences of the United States of America 1990, 87(6):2264-2268. 10.1073/pnas.87.6.2264

  32. 32.

    Eddy SR: Where did the BLOSUM62 alignment score matrix come from? Nature Biotechnology 2004, 22(8):1035-1036. 10.1038/nbt0804-1035

  33. 33.

    Aurrecoechea C, Heiges M, Wang H, et al.: ApiDB: integrated resources for the apicomplexan bioinformatics resource center. Nucleic Acids Research 2007, 35: D427-D430. 10.1093/nar/gkl880

  34. 34.

    Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 1994, 22(22):4673-4680. 10.1093/nar/22.22.4673

  35. 35.

    Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Research 1999, 27(13):2682-2690. 10.1093/nar/27.13.2682

  36. 36.

    Carter R: Speculations on the origins of Plasmodium vivax malaria. Trends in Parasitology 2003, 19(5):214-219. 10.1016/S1471-4922(03)00070-9

  37. 37.

    Cline M, Hughey R, Karplus K: Predicting reliable regions in protein sequence alignments. Bioinformatics 2002, 18(2):306-314. 10.1093/bioinformatics/18.2.306

  38. 38.

    Conery JS, Lynch M: Nucleotide substitutions and the evolution of duplicate genes. Proceedings of the 6th Pacific Symposium on Biocomputing (PSB '01), Big Island of Hawaii, Hawaii, USA, January 2001 167-178.

  39. 39.

    Pearson WR: Comparison of methods for searching protein sequence databases. Protein Science 1995, 4(6):1145-1160.

  40. 40.

    Hulo N, Bairoch A, Bulliard V, et al.: The PROSITE database. Nucleic Acids Research 2006, 34: D227-D230. 10.1093/nar/gkj063

Download references

Author information

Correspondence to JohnS Conery.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Conery, J. Aligning Sequences by Minimum Description Length. J Bioinform Sys Biology 2007, 72936 (2008).

Download citation


  • System Biology
  • Minimum Description Length