NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks

Kontkanen, Petri; Wettig, Hannes; Myllymäki, Petri

doi:10.1155/2007/90947

Research Article
Open access
Published: 20 January 2008

NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks

Petri Kontkanen¹,
Hannes Wettig¹ &
Petri Myllymäki¹

EURASIP Journal on Bioinformatics and Systems Biology volume 2007, Article number: 90947 (2008) Cite this article

2294 Accesses
4 Citations
Metrics details

Abstract

Typical problems in bioinformatics involve large discrete datasets. Therefore, in order to apply statistical methods in such domains, it is important to develop efficient algorithms suitable for discrete data. The minimum description length (MDL) principle is a theoretically well-founded, general framework for performing statistical inference. The mathematical formalization of MDL is based on the normalized maximum likelihood (NML) distribution, which has several desirable theoretical properties. In the case of discrete data, straightforward computation of the NML distribution requires exponential time with respect to the sample size, since the definition involves a sum over all the possible data samples of a fixed size. In this paper, we first review some existing algorithms for efficient NML computation in the case of multinomial and naive Bayes model families. Then we proceed by extending these algorithms to more complex, tree-structured Bayesian networks.

[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33]

References

Korodi G, Tabus I: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Transactions on Information Systems 2005, 23(1):3-34. 10.1145/1055709.1055711
Article Google Scholar
Tibshirani R, Hastie T, Eisen M, Ross D, Botstein D, Brown B: Clustering methods for the analysis of DNA microarray data. Department of Health Research and Policy, Stanford University, Stanford, Calif, USA; 1999.
Google Scholar
Pan W, Lin J, Le CT: Model-based cluster analysis of microarray gene-expression data. Genome Biology 2002, 3(2):1-8.
Article Google Scholar
McLachlan GJ, Bean RW, Peel D: A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 2002, 18(3):413-422. 10.1093/bioinformatics/18.3.413
Article Google Scholar
Hartemink AJ, Gifford DK, Jaakkola TS, Young RA: Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Proceedings of the 6th Pacific Symposium on Biocomputing (PSB '01), The Big Island of Hawaii, Hawaii, USA, January 2001 422-433.
Google Scholar
Rissanen J: Modeling by shortest data description. Automatica 1978, 14(5):465-471. 10.1016/0005-1098(78)90005-5
Article MATH Google Scholar
Rissanen J: Stochastic complexity. Journal of the Royal Statistical Society, Series B 1987, 49(3):223-239. with discussions, 223–265
MathSciNet MATH Google Scholar
Rissanen J: Fisher information and stochastic complexity. IEEE Transactions on Information Theory 1996, 42(1):40-47. 10.1109/18.481776
Article MathSciNet MATH Google Scholar
Shtarkov YuM: Universal sequential coding of single messages. Problems of Information Transmission 1987, 23(3):175-186.
MathSciNet Google Scholar
Barron A, Rissanen J, Yu B: The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory 1998, 44(6):2743-2760. 10.1109/18.720554
Article MathSciNet MATH Google Scholar
Rissanen J: Strong optimality of the normalized ML models as universal codes and information in data. IEEE Transactions on Information Theory 2001, 47(5):1712-1717. 10.1109/18.930912
Article MathSciNet MATH Google Scholar
Grünwald P: The Minimum Description Length Principle. The MIT Press, Cambridge, Mass, USA; 2007.
Google Scholar
Rissanen J: Information and Complexity in Statistical Modeling. Springer, New York, NY, USA; 2007.
MATH Google Scholar
Heckerman D: A tutorial on learning with Bayesian networks. In Tech. Rep. MSR-TR-95-06. Microsoft Research, Advanced Technology Division, One Microsoft Way, Redmond, Wash, USA, 98052; 1996.
Google Scholar
Kontkanen P, Myllymäki P: A linear-time algorithm for computing the multinomial stochastic complexity. Information Processing Letters 2007, 103(6):227-233. 10.1016/j.ipl.2007.04.003
Article MathSciNet MATH Google Scholar
Kontkanen P, Myllymäki P, Buntine W, Rissanen J, Tirri H: An MDL framework for data clustering. In Advances in Minimum Description Length: Theory and Applications. Edited by: Grünwald P, Myung IJ, Pitt M. The MIT Press, Cambridge, Mass, USA; 2006.
Google Scholar
Xie Q, Barron AR: Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Transactions on Information Theory 2000, 46(2):431-445. 10.1109/18.825803
Article MathSciNet MATH Google Scholar
Balasubramanian V: MDL, Bayesian inference, and the geometry of the space of probability distributions. In Advances in Minimum Description Length: Theory and Applications. Edited by: Grünwald P, Myung IJ, Pitt M. The MIT Press, Cambridge, Mass, USA; 2006:81-98.
Google Scholar
Kontkanen P, Myllymäki P: MDL histogram density estimation. Proceedings of the 11th International Conference on Artificial Intelligence and Statistics, (AISTATS '07), San Juan, Puerto Rico, USA, March 2007
Google Scholar
Kontkanen P, Buntine W, Myllymäki P, Rissanen J, Tirri H: Efficient computation of stochastic complexity. In Proceedings of the 9th International Conference on Artificial Intelligence and Statistics, Key West, Fla, USA, January 2003. Edited by: Bishop C, Frey B. Society for Artificial Intelligence and Statistics; 233-238.
Google Scholar
Koivisto M: Sum-Product Algorithms for the Analysis of Genetic Risks. In Tech. Rep. A-2004-1. Department of Computer Science, University of Helsinki, Helsinki, Finland; 2004.
Google Scholar
Kontkanen P, Myllymäki P: A fast normalized maximum likelihood algorithm for multinomial data. Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI '05), Edinburgh, Scotland, August 2005
Google Scholar
Knuth DE, Pittle B: A recurrence related to trees. Proceedings of the American Mathematical Society 1989, 105(2):335-349. 10.1090/S0002-9939-1989-0949878-9
Article MathSciNet MATH Google Scholar
Corless RM, Gonnet GH, Hare DEG, Jeffrey DJ, Knuth DE: On the Lambert W function. Advances in Computational Mathematics 1996, 5(1):329-359. 10.1007/BF02124750
Article MathSciNet MATH Google Scholar
Szpankowski W: Average Case Analysis of Algorithms on Sequences. John Wiley & Sons, New York, NY, USA; 2001.
Book MATH Google Scholar
Flajolet P, Odlyzko AM: Singularity analysis of generating functions. SIAM Journal on Discrete Mathematics 1990, 3(2):216-240. 10.1137/0403019
Article MathSciNet MATH Google Scholar
Schwarz G: Estimating the dimension of a model. Annals of Statistics 1978, 6(2):461-464. 10.1214/aos/1176344136
Article MathSciNet MATH Google Scholar
Kontkanen P, Myllymäki P, Tirri H: Constructing Bayesian finite mixture models by the EM algorithm. In Tech. Rep. NC-TR-97-003. ESPRIT Working Group on Neural and Computational Learning (NeuroCOLT), Helsinki, Finland; 1997.
Google Scholar
Kontkanen P, Myllymäki P, Silander T, Tirri H: On Bayesian case matching. In Proceedings of the 4th European Workshop Advances in Case-Based Reasoning (EWCBR '98), Lecture Notes In Computer Science, Springer, Dublin, Ireland, September 1998 Edited by: Smyth B, Cunningham P. 1488: 13-24.
Google Scholar
Grünwald P, Kontkanen P, Myllymäki P, Silander T, Tirri H: Minimum encoding approaches for predictive modeling. In Proceedings of the 14th International Conference on Uncertainty in Artificial Intelligence (UAI '98), Madison, Wis, USA, July 1998. Edited by: Cooper G, Moral S. Morgan Kaufmann; 183-192.
Google Scholar
Kontkanen P, Myllymäki P, Silander T, Tirri H, Grünwald P: On predictive distributions and Bayesian networks. Statistics and Computing 2000, 10(1):39-54. 10.1023/A:1008984400380
Article Google Scholar
Kontkanen P, Lahtinen J, Myllymäki P, Silander T, Tirri H: Supervised model-based visualization of high-dimensional data. Intelligent Data Analysis 2000, 4(3-4):213-227.
MATH Google Scholar
Dyer M, Kannan R, Mount J: Sampling contingency tables. Random Structures and Algorithms 1997, 10(4):487-506. 10.1002/(SICI)1098-2418(199707)10:4<487::AID-RSA4>3.0.CO;2-Q
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Complex Systems Computation Group (CoSCo), Helsinki Institute for Information Technology (HIIT), (Department of Computer Science), FIN-00014 University of Helsinki, P.O.Box 68, Finland
Petri Kontkanen, Hannes Wettig & Petri Myllymäki

Authors

Petri Kontkanen
View author publications
You can also search for this author in PubMed Google Scholar
Hannes Wettig
View author publications
You can also search for this author in PubMed Google Scholar
Petri Myllymäki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Petri Kontkanen.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Kontkanen, P., Wettig, H. & Myllymäki, P. NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks. J Bioinform Sys Biology 2007, 90947 (2008). https://doi.org/10.1155/2007/90947

Download citation

Received: 01 March 2007
Accepted: 30 July 2007
Published: 20 January 2008
DOI: https://doi.org/10.1155/2007/90947

NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords