- Research Article
- Open Access
Motif Discovery in Tissue-Specific Regulatory Sequences Using Directed Information
EURASIP Journal on Bioinformatics and Systems Biology volume 2007, Article number: 13853 (2007)
Motif discovery for the identification of functional regulatory elements underlying gene expression is a challenging problem. Sequence inspection often leads to discovery of novel motifs (including transcription factor sites) with previously uncharacterized function in gene expression. Coupled with the complexity underlying tissue-specific gene expression, there are several motifs that are putatively responsible for expression in a certain cell type. This has important implications in understanding fundamental biological processes such as development and disease progression. In this work, we present an approach to the identification of motifs (not necessarily transcription factor sites) and examine its application to some questions in current bioinformatics research. These motifs are seen to discriminate tissue-specific gene promoter or regulatory regions from those that are not tissue-specific. There are two main contributions of this work. Firstly, we propose the use of directed information for such classification constrained motif discovery, and then use the selected features with a support vector machine (SVM) classifier to find the tissue specificity of any sequence of interest. Such analysis yields several novel interesting motifs that merit further experimental characterization. Furthermore, this approach leads to a principled framework for the prospective examination of any chosen motif to be discriminatory motif for a group of coexpressed/coregulated genes, thereby integrating sequence and expression perspectives. We hypothesize that the discovery of these motifs would enable the large-scale investigation for the tissue-specific regulatory role of any conserved sequence element identified from genome-wide studies.
MacIsaac KD, Fraenkel E: Practical strategies for discovering regulatory DNA sequence motifs. PLoS Computational Biology 2006, 2(4):e36. 10.1371/journal.pcbi.0020036
Kreiman G: Identification of sparsely distributed clusters of cis -regulatory elements in sets of co-expressed genes. Nucleic Acids Research 2004, 32(9):2889-2900. 10.1093/nar/gkh614
Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology 1997, 268(1):78-94. 10.1006/jmbi.1997.0951
Li Q, Barkess G, Qian H: Chromatin looping and the probability of transcription. Trends in Genetics 2006, 22(4):197-202. 10.1016/j.tig.2006.02.004
Kleinjan DA, van Heyningen V: Long-range control of gene expression: emerging mechanisms and disruption in disease. The American Journal of Human Genetics 2005, 76(1):8-32. 10.1086/426833
Pennacchio LA, Loots GG, Nobrega MA, Ovcharenko I: Predicting tissue-specific enhancers in the human genome. Genome Research 2007, 17(2):201-211. 10.1101/gr.5972507
King DC, Taylor J, Elnitski L, Chiaromonte F, Miller W, Hardison RC: Evaluation of regulatory potential and conservation scores for detecting cis -regulatory modules in aligned mammalian genome sequences. Genome Research 2005, 15(8):1051-1060. 10.1101/gr.3642605
Pennacchio LA, Ahituv N, Moses AM, et al.: In vivo enhancer analysis of human conserved non-coding sequences. Nature 2006, 444(7118):499-502. 10.1038/nature05295
Kadota K, Ye J, Nakai Y, Terada T, Shimizu K: ROKU: a novel method for indentification of tissue-specific genes. BMC Bioinformatics 2006, 7: 294. 10.1186/1471-2105-7-294
Schug J, Schuller W-P, Kappen C, Salbaum JM, Bucan M, Stoeckert CJ Jr: Promoter features related to tissue specificity as measured by Shannon entropy. Genome biology 2005, 6(4):R33. 10.1186/gb-2005-6-4-r33
Werner T: Regulatory networks: linking microarray data to systems biology. Mechanisms of Ageing and Development 2007, 128(1):168-172. 10.1016/j.mad.2006.11.022
Aerts S, Van Loo P, Thijs G, et al.: TOUCAN 2: the all-inclusive open source workbench for regulatory sequence analysis. Nucleic Acids Research 2005, 33(Web Server):W393-W396. 10.1093/nar/gki354
Chan BY, Kibler D: Using hexamers to predict cis -regulatory motifs in Drosophila. BMC Bioinformatics 2005, 6: 262. 10.1186/1471-2105-6-262
Hutchinson GB: The prediction of vertebrate promoter regions using differential hexamer frequency analysis. Computer Applications in the Biosciences 1996, 12(5):391-398.
Sumazin P, Chen G, Hata N, Smith AD, Zhang T, Zhang MQ: DWE: discriminating word enumerator. Bioinformatics 2005, 21(1):31-38. 10.1093/bioinformatics/bth471
Lakshmanan G, Lieuw KH, Lim K-C, et al.: Localization of distant urogenital system-, central nervous system-, and endocardium-specific transcriptional regulatory elements in the GATA-3 locus. Molecular and Cellular Biology 1999, 19(2):1558-1568.
Khandekar M, Suzuki N, Lewton J, Yamamoto M, Engel JD: Multiple, distant Gata2 enhancers specify temporally and tissue-specific patterning in the developing urogenital system. Molecular and Cellular Biology 2004, 24(23):10263-10276. 10.1128/MCB.24.23.10263-10276.2004
Peng H, Long F, Ding C: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 2005, 27(8):1226-1238.
Proceedings of NIPS 2006 Workshop on Causality and Feature Selection. [http://research.ihost.com/cws2006/]
Guyon I, Elisseeff A: An introduction to variable and feature selection. The Journal of Machine Learning Research 2003, 3: 1157-1182.
Marko H: The bidirectional communication theory—a generalization of information theory. IEEE Transactions on Communications 1973, COM-21(12):1345-1351.
Massey J: Causality, feedback and directed information. Proceedings of the International Symposium on Information Theory and Its Applications (ISITA '90), Waikiki, Hawaii, USA, November 1990 303-305.
Venkataramanan R, Pradhan SS: Source coding with feed-forward: rate-distortion theorems and error exponents for a general source. IEEE Transactions on Information Theory 2007, 53(6):2154-2179.
Cover TM, Thomas JA: Elements of Information Theory. John Wiley & Sons, New York, NY, USA; 1991.
Miller EG: A new class of entropy estimators for multidimensional densities. Proceedings of the IEEE International Conference on Accoustics, Speech, and Signal Processing (ICASSP '03), Hong Kong, April 2003 3: 297-300.
Willett RM, Nowak RD: Complexity-regularized multiresolution density estimation. Proceedings of the International Symposium on Information Theory (ISIT '04), Chicago, Ill, USA, June-July 2004 303-305.
Nemenman I, Shafee F, Bialek W: Entropy and inference, revisited. In Advances in Neural Information Processing Systems 14. Edited by: Dietterich TG, Becker S, Ghahramani Z. MIT Press, Cambridge, Mass, USA; 2002.
Paninski L: Estimation of entropy and mutual information. Neural Computation 2003, 15(6):1191-1253. 10.1162/089976603321780272
Joe H: Relative entropy measures of multivariate dependence. Journal of the American Statistical Association 1989, 84(405):157-164. 10.2307/2289859
Efron B, Tibshirani RJ: An Introduction to the Bootstrap, Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, Boca Raton, Fla, USA; 1994.
Ramsay JO, Silverman BW: Functional Data Analysis, Springer Series in Statistics. Springer, New York, NY, USA; 1997.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B 1995, 57(1):289-300.
Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. Springer, New York, NY, USA; 2001.
Kendall MG: A new measure of rank correlation. Biometrika 1938, 30(1/2):81-93. 10.2307/2332226
NCBI Pubmed URL[http://www.ncbi.nlm.nih.gov/entrez/query.fcgi]
Murphy AM, Thompson WR, Peng LF, Jones L II: Regulation of the rat cardiac troponin I gene by the transcription factor GATA-4. Biochemical Journal 1997, 322, part 2: 393-401.
Azakie A, Fineman JR, He Y: Myocardial transcription factors are modulated during pathologic cardiac hypertrophy in vivo. The Journal of Thoracic and Cardiovascular Surgery 2006, 132(6):1262-1271.e4. 10.1016/j.jtcvs.2006.08.005
Vanhoutte P, Nissen JL, Brugg B, et al.: Opposing roles of Elk-1 and its brain-specific usoform, short Elk-1, in nerve growth factor-induced PC12 differentiation. Journal of Biological Chemistry 2001, 276(7):5189-5196. 10.1074/jbc.M006678200
Olson EN: Regulation of muscle transcription by the MyoD family: the heart of the matter. Circulation Research 1993, 72(1):1-6.
Dressler GR, Douglass EC: Pax-2 is a DNA-binding protein expressed in embryonic kidney and Wilms tumor. Proceedings of the National Academy of Sciences of the United States of America 1992, 89(4):1179-1183. 10.1073/pnas.89.4.1179
Grote D, Souabni A, Busslinger M, Bouchard M: Pax2/8-regulated Gata3 expression is necessary for morphogenesis and guidance of the nephric duct in the developing kidney. Development 2006, 133(1):53-61. 10.1242/dev.02184
Rao A, Hero AO, States DJ, Engel JD: Inference of biologically relevant gene influence networks using the directed information criterion. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), Toulouse, France, May 2006 2: 1028-1031.
About this article
Cite this article
Rao, A., Hero, A.O., States, D.J. et al. Motif Discovery in Tissue-Specific Regulatory Sequences Using Directed Information. J Bioinform Sys Biology 2007, 13853 (2007) doi:10.1155/2007/13853
- Support Vector Machine
- Motif Discovery
- Include Transcription Factor
- Fundamental Biological Process
- Interesting Motif