Skip to main content

Motif Discovery in Tissue-Specific Regulatory Sequences Using Directed Information


Motif discovery for the identification of functional regulatory elements underlying gene expression is a challenging problem. Sequence inspection often leads to discovery of novel motifs (including transcription factor sites) with previously uncharacterized function in gene expression. Coupled with the complexity underlying tissue-specific gene expression, there are several motifs that are putatively responsible for expression in a certain cell type. This has important implications in understanding fundamental biological processes such as development and disease progression. In this work, we present an approach to the identification of motifs (not necessarily transcription factor sites) and examine its application to some questions in current bioinformatics research. These motifs are seen to discriminate tissue-specific gene promoter or regulatory regions from those that are not tissue-specific. There are two main contributions of this work. Firstly, we propose the use of directed information for such classification constrained motif discovery, and then use the selected features with a support vector machine (SVM) classifier to find the tissue specificity of any sequence of interest. Such analysis yields several novel interesting motifs that merit further experimental characterization. Furthermore, this approach leads to a principled framework for the prospective examination of any chosen motif to be discriminatory motif for a group of coexpressed/coregulated genes, thereby integrating sequence and expression perspectives. We hypothesize that the discovery of these motifs would enable the large-scale investigation for the tissue-specific regulatory role of any conserved sequence element identified from genome-wide studies.

[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42]


  1. 1.

    MacIsaac KD, Fraenkel E: Practical strategies for discovering regulatory DNA sequence motifs. PLoS Computational Biology 2006, 2(4):e36. 10.1371/journal.pcbi.0020036

    Article  Google Scholar 

  2. 2.

    Kreiman G: Identification of sparsely distributed clusters of cis -regulatory elements in sets of co-expressed genes. Nucleic Acids Research 2004, 32(9):2889-2900. 10.1093/nar/gkh614

    Article  Google Scholar 

  3. 3.

    Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology 1997, 268(1):78-94. 10.1006/jmbi.1997.0951

    Article  Google Scholar 

  4. 4.

    Li Q, Barkess G, Qian H: Chromatin looping and the probability of transcription. Trends in Genetics 2006, 22(4):197-202. 10.1016/j.tig.2006.02.004

    Article  Google Scholar 

  5. 5.

    Kleinjan DA, van Heyningen V: Long-range control of gene expression: emerging mechanisms and disruption in disease. The American Journal of Human Genetics 2005, 76(1):8-32. 10.1086/426833

    Article  Google Scholar 

  6. 6.

    Pennacchio LA, Loots GG, Nobrega MA, Ovcharenko I: Predicting tissue-specific enhancers in the human genome. Genome Research 2007, 17(2):201-211. 10.1101/gr.5972507

    Article  Google Scholar 

  7. 7.

    King DC, Taylor J, Elnitski L, Chiaromonte F, Miller W, Hardison RC: Evaluation of regulatory potential and conservation scores for detecting cis -regulatory modules in aligned mammalian genome sequences. Genome Research 2005, 15(8):1051-1060. 10.1101/gr.3642605

    Article  Google Scholar 

  8. 8.

    Pennacchio LA, Ahituv N, Moses AM, et al.: In vivo enhancer analysis of human conserved non-coding sequences. Nature 2006, 444(7118):499-502. 10.1038/nature05295

    Article  Google Scholar 

  9. 9.

    Kadota K, Ye J, Nakai Y, Terada T, Shimizu K: ROKU: a novel method for indentification of tissue-specific genes. BMC Bioinformatics 2006, 7: 294. 10.1186/1471-2105-7-294

    Article  Google Scholar 

  10. 10.

    Schug J, Schuller W-P, Kappen C, Salbaum JM, Bucan M, Stoeckert CJ Jr: Promoter features related to tissue specificity as measured by Shannon entropy. Genome biology 2005, 6(4):R33. 10.1186/gb-2005-6-4-r33

    Article  Google Scholar 

  11. 11.

    Werner T: Regulatory networks: linking microarray data to systems biology. Mechanisms of Ageing and Development 2007, 128(1):168-172. 10.1016/j.mad.2006.11.022

    Article  Google Scholar 

  12. 12.

    Aerts S, Van Loo P, Thijs G, et al.: TOUCAN 2: the all-inclusive open source workbench for regulatory sequence analysis. Nucleic Acids Research 2005, 33(Web Server):W393-W396. 10.1093/nar/gki354

    Article  Google Scholar 

  13. 13.

    Chan BY, Kibler D: Using hexamers to predict cis -regulatory motifs in Drosophila. BMC Bioinformatics 2005, 6: 262. 10.1186/1471-2105-6-262

    Article  Google Scholar 

  14. 14.

    Hutchinson GB: The prediction of vertebrate promoter regions using differential hexamer frequency analysis. Computer Applications in the Biosciences 1996, 12(5):391-398.

    Google Scholar 

  15. 15.

    Sumazin P, Chen G, Hata N, Smith AD, Zhang T, Zhang MQ: DWE: discriminating word enumerator. Bioinformatics 2005, 21(1):31-38. 10.1093/bioinformatics/bth471

    Article  Google Scholar 

  16. 16.

    Lakshmanan G, Lieuw KH, Lim K-C, et al.: Localization of distant urogenital system-, central nervous system-, and endocardium-specific transcriptional regulatory elements in the GATA-3 locus. Molecular and Cellular Biology 1999, 19(2):1558-1568.

    Article  Google Scholar 

  17. 17.

    Khandekar M, Suzuki N, Lewton J, Yamamoto M, Engel JD: Multiple, distant Gata2 enhancers specify temporally and tissue-specific patterning in the developing urogenital system. Molecular and Cellular Biology 2004, 24(23):10263-10276. 10.1128/MCB.24.23.10263-10276.2004

    Article  Google Scholar 

  18. 18.

    Peng H, Long F, Ding C: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 2005, 27(8):1226-1238.

    Article  Google Scholar 

  19. 19.

    Proceedings of NIPS 2006 Workshop on Causality and Feature Selection. []

  20. 20.

    Guyon I, Elisseeff A: An introduction to variable and feature selection. The Journal of Machine Learning Research 2003, 3: 1157-1182.

    MATH  Google Scholar 

  21. 21.

    Marko H: The bidirectional communication theory—a generalization of information theory. IEEE Transactions on Communications 1973, COM-21(12):1345-1351.

    Article  Google Scholar 

  22. 22.

    Massey J: Causality, feedback and directed information. Proceedings of the International Symposium on Information Theory and Its Applications (ISITA '90), Waikiki, Hawaii, USA, November 1990 303-305.

    Google Scholar 

  23. 23.

    Venkataramanan R, Pradhan SS: Source coding with feed-forward: rate-distortion theorems and error exponents for a general source. IEEE Transactions on Information Theory 2007, 53(6):2154-2179.

    Article  MathSciNet  Google Scholar 

  24. 24.

    Cover TM, Thomas JA: Elements of Information Theory. John Wiley & Sons, New York, NY, USA; 1991.

    Google Scholar 

  25. 25.

    Miller EG: A new class of entropy estimators for multidimensional densities. Proceedings of the IEEE International Conference on Accoustics, Speech, and Signal Processing (ICASSP '03), Hong Kong, April 2003 3: 297-300.

    Google Scholar 

  26. 26.

    Willett RM, Nowak RD: Complexity-regularized multiresolution density estimation. Proceedings of the International Symposium on Information Theory (ISIT '04), Chicago, Ill, USA, June-July 2004 303-305.

    Google Scholar 

  27. 27.

    Nemenman I, Shafee F, Bialek W: Entropy and inference, revisited. In Advances in Neural Information Processing Systems 14. Edited by: Dietterich TG, Becker S, Ghahramani Z. MIT Press, Cambridge, Mass, USA; 2002.

    Google Scholar 

  28. 28.

    Paninski L: Estimation of entropy and mutual information. Neural Computation 2003, 15(6):1191-1253. 10.1162/089976603321780272

    Article  MATH  Google Scholar 

  29. 29.

    Joe H: Relative entropy measures of multivariate dependence. Journal of the American Statistical Association 1989, 84(405):157-164. 10.2307/2289859

    Article  MathSciNet  MATH  Google Scholar 

  30. 30.

    Efron B, Tibshirani RJ: An Introduction to the Bootstrap, Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, Boca Raton, Fla, USA; 1994.

    Google Scholar 

  31. 31.

    Ramsay JO, Silverman BW: Functional Data Analysis, Springer Series in Statistics. Springer, New York, NY, USA; 1997.

    Google Scholar 

  32. 32.

    Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B 1995, 57(1):289-300.

    MathSciNet  MATH  Google Scholar 

  33. 33.

    Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. Springer, New York, NY, USA; 2001.

    Google Scholar 

  34. 34.

    Kendall MG: A new measure of rank correlation. Biometrika 1938, 30(1/2):81-93. 10.2307/2332226

    Article  MathSciNet  MATH  Google Scholar 

  35. 35.

    NCBI Pubmed URL[]

  36. 36.

    Murphy AM, Thompson WR, Peng LF, Jones L II: Regulation of the rat cardiac troponin I gene by the transcription factor GATA-4. Biochemical Journal 1997, 322, part 2: 393-401.

    Article  Google Scholar 

  37. 37.

    Azakie A, Fineman JR, He Y: Myocardial transcription factors are modulated during pathologic cardiac hypertrophy in vivo. The Journal of Thoracic and Cardiovascular Surgery 2006, 132(6):1262-1271.e4. 10.1016/j.jtcvs.2006.08.005

    Article  Google Scholar 

  38. 38.

    Vanhoutte P, Nissen JL, Brugg B, et al.: Opposing roles of Elk-1 and its brain-specific usoform, short Elk-1, in nerve growth factor-induced PC12 differentiation. Journal of Biological Chemistry 2001, 276(7):5189-5196. 10.1074/jbc.M006678200

    Article  Google Scholar 

  39. 39.

    Olson EN: Regulation of muscle transcription by the MyoD family: the heart of the matter. Circulation Research 1993, 72(1):1-6.

    Article  Google Scholar 

  40. 40.

    Dressler GR, Douglass EC: Pax-2 is a DNA-binding protein expressed in embryonic kidney and Wilms tumor. Proceedings of the National Academy of Sciences of the United States of America 1992, 89(4):1179-1183. 10.1073/pnas.89.4.1179

    Article  Google Scholar 

  41. 41.

    Grote D, Souabni A, Busslinger M, Bouchard M: Pax2/8-regulated Gata3 expression is necessary for morphogenesis and guidance of the nephric duct in the developing kidney. Development 2006, 133(1):53-61. 10.1242/dev.02184

    Article  Google Scholar 

  42. 42.

    Rao A, Hero AO, States DJ, Engel JD: Inference of biologically relevant gene influence networks using the directed information criterion. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), Toulouse, France, May 2006 2: 1028-1031.

    Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Arvind Rao.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Rao, A., Hero, A.O., States, D.J. et al. Motif Discovery in Tissue-Specific Regulatory Sequences Using Directed Information. J Bioinform Sys Biology 2007, 13853 (2007).

Download citation


  • Support Vector Machine
  • Motif Discovery
  • Include Transcription Factor
  • Fundamental Biological Process
  • Interesting Motif