Skip to main content
  • Research Article
  • Open access
  • Published:

Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates

Abstract

Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, they are used for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI's combined DNA index system (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats—an application of importance in genetic profiling.

[12345678910111213141516171819202122]

References

  1. Steuer R, Kurths J, Daub CO, Weise J, Selbig J: The mutual information: detecting and evaluating dependencies between variables. Bioinformatics 2002, 18(supplement 2):S231-S240.

    Article  Google Scholar 

  2. Dawy Z, Goebel B, Hagenauer J, Andreoli C, Meitinger T, Mueller JC: Gene mapping and marker clustering using Shannon's mutual information. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2006, 3(1):47-56. 10.1109/TCBB.2006.9

    Article  Google Scholar 

  3. Segal E, Fondufe-Mittendorf Y, Chen L, et al.: A genomic code for nucleosome positioning. Nature 2006, 442(7104):772-778. 10.1038/nature04979

    Article  Google Scholar 

  4. Osada Y, Saito R, Tomita M:Comparative analysis of base correlations in untranslated regions of various species. Gene 2006, 375(1-2):80-86.

    Article  Google Scholar 

  5. Kozak M: Initiation of translation in prokaryotes and eukaryotes. Gene 1999, 234(2):187-208. 10.1016/S0378-1119(99)00210-3

    Article  Google Scholar 

  6. Reddy DA, Mitra CK: Comparative analysis of transcription start sites using mutual information. Genomics, Proteomics and Bioinformatics 2006, 4(3):189-195. 10.1016/S1672-0229(06)60032-6

    Article  Google Scholar 

  7. Reddy DA, Prasad BVLS, Mitra CK: Comparative analysis of core promoter region: information content from mono and dinucleotide substitution matrices. Computational Biology and Chemistry 2006, 30(1):58-62. 10.1016/j.compbiolchem.2005.10.004

    Article  MATH  Google Scholar 

  8. Shabalina SA, Ogurtsov AY, Rogozin IB, Koonin EV, Lipman DJ: Comparative analysis of orthologous eukaryotic mRNAs: potential hidden functional signals. Nucleic Acids Research 2004, 32(5):1774-1782. 10.1093/nar/gkh313

    Article  Google Scholar 

  9. Baldi P, Brunak S, Frasconi P, Soda G, Pollastri G: Exploiting the past and the future in protein secondary structure prediction. Bioinformatics 1999, 15(11):937-946. 10.1093/bioinformatics/15.11.937

    Article  Google Scholar 

  10. Battail G: Should genetics get an information-theoretic education? Genomes as error-correcting codes. IEEE Engineering in Medicine and Biology Magazine 2006, 25(1):34-45.

    Article  Google Scholar 

  11. Gao H, Gordon-Kamm WJ, Lyznik LA: ASF/SF2-like maize pre-mRNA splicing factors affect splice site utilization and their transcripts are alternatively spliced. Gene 2004, 339(1-2):25-37.

    Article  Google Scholar 

  12. Cover TM, Thomas JA: Elements of Information Theory. John Wiley & Sons, New York, NY, USA; 1991.

    Book  MATH  Google Scholar 

  13. Good PI: Resampling Methods. Birkhäuser, Boston, Mass, USA; 2005.

    Google Scholar 

  14. Manly B: Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall/CRC, Boca Raton, Fla, USA; 1977.

    Google Scholar 

  15. Lehmann EL, Romano JP: Testing Statistical Hypotheses. 3rd edition. Springer, New York, NY, USA; 2005.

    MATH  Google Scholar 

  16. Schervish MJ: Theory of Statistics. Springer, New York, NY, USA; 1995.

    Book  MATH  Google Scholar 

  17. Hagenauer J, Dawy Z, Göbel B, Hanus P, Mueller J: Genomic analysis using methods from information theory. Proceedings of IEEE Information Theory Workshop (ITW '04), San Antonio, Tex, USA, October 2004 55-59.

    Google Scholar 

  18. Goebel B, Dawy Z, Hagenauer J, Mueller JC: An approximation to the distribution of finite sample size mutual information estimates. Proceedings of IEEE International Conference on Communications (ICC '05), Seoul, Korea, May 2005 2: 1102-1106.

    Google Scholar 

  19. Hutter M: Distribution of mutual information. In Advances in Neural Information Processing Systems 14. MIT Press, Cambridge, Mass, USA; 2002:399-406.

    Google Scholar 

  20. Hughes TA: Regulation of gene expression by alternative untranslated regions. Trends in Genetics 2006, 22(3):119-122. 10.1016/j.tig.2006.01.001

    Article  Google Scholar 

  21. Ă…berg J, Shtarkov YuM, Smeets BJM: Multialphabet coding with separate alphabet description. Proceedings of the International Conference on Compression and Complexity of Sequences, Positano, Italy, June 1997 56-65.

    Google Scholar 

  22. Orlitsky A, Santhanam NP, Viswanathan K, Zhang J: Limit results on pattern entropy. IEEE Transactions on Information Theory 2006, 52(7):2954-2964.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hasan Metin Aktulga.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Aktulga, H.M., Kontoyiannis, I., Lyznik, L.A. et al. Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates. J Bioinform Sys Biology 2007, 14741 (2007). https://doi.org/10.1155/2007/14741

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1155/2007/14741

Keywords