Skip to main content

Advertisement

Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates

Article metrics

Abstract

Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, they are used for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI's combined DNA index system (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats—an application of importance in genetic profiling.

[12345678910111213141516171819202122]

References

  1. 1.

    Steuer R, Kurths J, Daub CO, Weise J, Selbig J: The mutual information: detecting and evaluating dependencies between variables. Bioinformatics 2002, 18(supplement 2):S231-S240.

  2. 2.

    Dawy Z, Goebel B, Hagenauer J, Andreoli C, Meitinger T, Mueller JC: Gene mapping and marker clustering using Shannon's mutual information. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2006, 3(1):47-56. 10.1109/TCBB.2006.9

  3. 3.

    Segal E, Fondufe-Mittendorf Y, Chen L, et al.: A genomic code for nucleosome positioning. Nature 2006, 442(7104):772-778. 10.1038/nature04979

  4. 4.

    Osada Y, Saito R, Tomita M:Comparative analysis of base correlations in untranslated regions of various species. Gene 2006, 375(1-2):80-86.

  5. 5.

    Kozak M: Initiation of translation in prokaryotes and eukaryotes. Gene 1999, 234(2):187-208. 10.1016/S0378-1119(99)00210-3

  6. 6.

    Reddy DA, Mitra CK: Comparative analysis of transcription start sites using mutual information. Genomics, Proteomics and Bioinformatics 2006, 4(3):189-195. 10.1016/S1672-0229(06)60032-6

  7. 7.

    Reddy DA, Prasad BVLS, Mitra CK: Comparative analysis of core promoter region: information content from mono and dinucleotide substitution matrices. Computational Biology and Chemistry 2006, 30(1):58-62. 10.1016/j.compbiolchem.2005.10.004

  8. 8.

    Shabalina SA, Ogurtsov AY, Rogozin IB, Koonin EV, Lipman DJ: Comparative analysis of orthologous eukaryotic mRNAs: potential hidden functional signals. Nucleic Acids Research 2004, 32(5):1774-1782. 10.1093/nar/gkh313

  9. 9.

    Baldi P, Brunak S, Frasconi P, Soda G, Pollastri G: Exploiting the past and the future in protein secondary structure prediction. Bioinformatics 1999, 15(11):937-946. 10.1093/bioinformatics/15.11.937

  10. 10.

    Battail G: Should genetics get an information-theoretic education? Genomes as error-correcting codes. IEEE Engineering in Medicine and Biology Magazine 2006, 25(1):34-45.

  11. 11.

    Gao H, Gordon-Kamm WJ, Lyznik LA: ASF/SF2-like maize pre-mRNA splicing factors affect splice site utilization and their transcripts are alternatively spliced. Gene 2004, 339(1-2):25-37.

  12. 12.

    Cover TM, Thomas JA: Elements of Information Theory. John Wiley & Sons, New York, NY, USA; 1991.

  13. 13.

    Good PI: Resampling Methods. Birkhäuser, Boston, Mass, USA; 2005.

  14. 14.

    Manly B: Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall/CRC, Boca Raton, Fla, USA; 1977.

  15. 15.

    Lehmann EL, Romano JP: Testing Statistical Hypotheses. 3rd edition. Springer, New York, NY, USA; 2005.

  16. 16.

    Schervish MJ: Theory of Statistics. Springer, New York, NY, USA; 1995.

  17. 17.

    Hagenauer J, Dawy Z, Göbel B, Hanus P, Mueller J: Genomic analysis using methods from information theory. Proceedings of IEEE Information Theory Workshop (ITW '04), San Antonio, Tex, USA, October 2004 55-59.

  18. 18.

    Goebel B, Dawy Z, Hagenauer J, Mueller JC: An approximation to the distribution of finite sample size mutual information estimates. Proceedings of IEEE International Conference on Communications (ICC '05), Seoul, Korea, May 2005 2: 1102-1106.

  19. 19.

    Hutter M: Distribution of mutual information. In Advances in Neural Information Processing Systems 14. MIT Press, Cambridge, Mass, USA; 2002:399-406.

  20. 20.

    Hughes TA: Regulation of gene expression by alternative untranslated regions. Trends in Genetics 2006, 22(3):119-122. 10.1016/j.tig.2006.01.001

  21. 21.

    Åberg J, Shtarkov YuM, Smeets BJM: Multialphabet coding with separate alphabet description. Proceedings of the International Conference on Compression and Complexity of Sequences, Positano, Italy, June 1997 56-65.

  22. 22.

    Orlitsky A, Santhanam NP, Viswanathan K, Zhang J: Limit results on pattern entropy. IEEE Transactions on Information Theory 2006, 52(7):2954-2964.

Download references

Author information

Correspondence to Hasan Metin Aktulga.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Aktulga, H.M., Kontoyiannis, I., Lyznik, L.A. et al. Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates. J Bioinform Sys Biology 2007, 14741 (2007) doi:10.1155/2007/14741

Download citation

Keywords

  • Genomic Sequence
  • Mutual Information
  • System Biology
  • Statistical Dependence
  • Information Estimate