NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks
EURASIP Journal on Bioinformatics and Systems Biology volume 2007, Article number: 90947 (2008)
Typical problems in bioinformatics involve large discrete datasets. Therefore, in order to apply statistical methods in such domains, it is important to develop efficient algorithms suitable for discrete data. The minimum description length (MDL) principle is a theoretically well-founded, general framework for performing statistical inference. The mathematical formalization of MDL is based on the normalized maximum likelihood (NML) distribution, which has several desirable theoretical properties. In the case of discrete data, straightforward computation of the NML distribution requires exponential time with respect to the sample size, since the definition involves a sum over all the possible data samples of a fixed size. In this paper, we first review some existing algorithms for efficient NML computation in the case of multinomial and naive Bayes model families. Then we proceed by extending these algorithms to more complex, tree-structured Bayesian networks.
Korodi G, Tabus I: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Transactions on Information Systems 2005, 23(1):3-34. 10.1145/1055709.1055711
Tibshirani R, Hastie T, Eisen M, Ross D, Botstein D, Brown B: Clustering methods for the analysis of DNA microarray data. Department of Health Research and Policy, Stanford University, Stanford, Calif, USA; 1999.
Pan W, Lin J, Le CT: Model-based cluster analysis of microarray gene-expression data. Genome Biology 2002, 3(2):1-8.
McLachlan GJ, Bean RW, Peel D: A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 2002, 18(3):413-422. 10.1093/bioinformatics/18.3.413
Hartemink AJ, Gifford DK, Jaakkola TS, Young RA: Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Proceedings of the 6th Pacific Symposium on Biocomputing (PSB '01), The Big Island of Hawaii, Hawaii, USA, January 2001 422-433.
Rissanen J: Modeling by shortest data description. Automatica 1978, 14(5):465-471. 10.1016/0005-1098(78)90005-5
Rissanen J: Stochastic complexity. Journal of the Royal Statistical Society, Series B 1987, 49(3):223-239. with discussions, 223–265
Rissanen J: Fisher information and stochastic complexity. IEEE Transactions on Information Theory 1996, 42(1):40-47. 10.1109/18.481776
Shtarkov YuM: Universal sequential coding of single messages. Problems of Information Transmission 1987, 23(3):175-186.
Barron A, Rissanen J, Yu B: The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory 1998, 44(6):2743-2760. 10.1109/18.720554
Rissanen J: Strong optimality of the normalized ML models as universal codes and information in data. IEEE Transactions on Information Theory 2001, 47(5):1712-1717. 10.1109/18.930912
Grünwald P: The Minimum Description Length Principle. The MIT Press, Cambridge, Mass, USA; 2007.
Rissanen J: Information and Complexity in Statistical Modeling. Springer, New York, NY, USA; 2007.
Heckerman D: A tutorial on learning with Bayesian networks. In Tech. Rep. MSR-TR-95-06. Microsoft Research, Advanced Technology Division, One Microsoft Way, Redmond, Wash, USA, 98052; 1996.
Kontkanen P, Myllymäki P: A linear-time algorithm for computing the multinomial stochastic complexity. Information Processing Letters 2007, 103(6):227-233. 10.1016/j.ipl.2007.04.003
Kontkanen P, Myllymäki P, Buntine W, Rissanen J, Tirri H: An MDL framework for data clustering. In Advances in Minimum Description Length: Theory and Applications. Edited by: Grünwald P, Myung IJ, Pitt M. The MIT Press, Cambridge, Mass, USA; 2006.
Xie Q, Barron AR: Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Transactions on Information Theory 2000, 46(2):431-445. 10.1109/18.825803
Balasubramanian V: MDL, Bayesian inference, and the geometry of the space of probability distributions. In Advances in Minimum Description Length: Theory and Applications. Edited by: Grünwald P, Myung IJ, Pitt M. The MIT Press, Cambridge, Mass, USA; 2006:81-98.
Kontkanen P, Myllymäki P: MDL histogram density estimation. Proceedings of the 11th International Conference on Artificial Intelligence and Statistics, (AISTATS '07), San Juan, Puerto Rico, USA, March 2007
Kontkanen P, Buntine W, Myllymäki P, Rissanen J, Tirri H: Efficient computation of stochastic complexity. In Proceedings of the 9th International Conference on Artificial Intelligence and Statistics, Key West, Fla, USA, January 2003. Edited by: Bishop C, Frey B. Society for Artificial Intelligence and Statistics; 233-238.
Koivisto M: Sum-Product Algorithms for the Analysis of Genetic Risks. In Tech. Rep. A-2004-1. Department of Computer Science, University of Helsinki, Helsinki, Finland; 2004.
Kontkanen P, Myllymäki P: A fast normalized maximum likelihood algorithm for multinomial data. Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI '05), Edinburgh, Scotland, August 2005
Knuth DE, Pittle B: A recurrence related to trees. Proceedings of the American Mathematical Society 1989, 105(2):335-349. 10.1090/S0002-9939-1989-0949878-9
Corless RM, Gonnet GH, Hare DEG, Jeffrey DJ, Knuth DE: On the Lambert W function. Advances in Computational Mathematics 1996, 5(1):329-359. 10.1007/BF02124750
Szpankowski W: Average Case Analysis of Algorithms on Sequences. John Wiley & Sons, New York, NY, USA; 2001.
Flajolet P, Odlyzko AM: Singularity analysis of generating functions. SIAM Journal on Discrete Mathematics 1990, 3(2):216-240. 10.1137/0403019
Schwarz G: Estimating the dimension of a model. Annals of Statistics 1978, 6(2):461-464. 10.1214/aos/1176344136
Kontkanen P, Myllymäki P, Tirri H: Constructing Bayesian finite mixture models by the EM algorithm. In Tech. Rep. NC-TR-97-003. ESPRIT Working Group on Neural and Computational Learning (NeuroCOLT), Helsinki, Finland; 1997.
Kontkanen P, Myllymäki P, Silander T, Tirri H: On Bayesian case matching. In Proceedings of the 4th European Workshop Advances in Case-Based Reasoning (EWCBR '98), Lecture Notes In Computer Science, Springer, Dublin, Ireland, September 1998 Edited by: Smyth B, Cunningham P. 1488: 13-24.
Grünwald P, Kontkanen P, Myllymäki P, Silander T, Tirri H: Minimum encoding approaches for predictive modeling. In Proceedings of the 14th International Conference on Uncertainty in Artificial Intelligence (UAI '98), Madison, Wis, USA, July 1998. Edited by: Cooper G, Moral S. Morgan Kaufmann; 183-192.
Kontkanen P, Myllymäki P, Silander T, Tirri H, Grünwald P: On predictive distributions and Bayesian networks. Statistics and Computing 2000, 10(1):39-54. 10.1023/A:1008984400380
Kontkanen P, Lahtinen J, Myllymäki P, Silander T, Tirri H: Supervised model-based visualization of high-dimensional data. Intelligent Data Analysis 2000, 4(3-4):213-227.
Dyer M, Kannan R, Mount J: Sampling contingency tables. Random Structures and Algorithms 1997, 10(4):487-506. 10.1002/(SICI)1098-2418(199707)10:4<487::AID-RSA4>3.0.CO;2-Q
About this article
Cite this article
Kontkanen, P., Wettig, H. & Myllymäki, P. NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks. J Bioinform Sys Biology 2007, 90947 (2008). https://doi.org/10.1155/2007/90947
- Statistical Method
- Bayesian Network
- System Biology
- General Framework
- Mathematical Formalization