Identification of thresholds for dichotomizing DNA methylation data
© Liu et al.; licensee Springer. 2013
Received: 1 March 2013
Accepted: 23 May 2013
Published: 6 June 2013
DNA methylation plays an important role in many biological processes by regulating gene expression. It is commonly accepted that turning on the DNA methylation leads to silencing of the expression of the corresponding genes. While methylation is often described as a binary on-off signal, it is typically measured using beta values derived from either microarray or sequencing technologies, which takes continuous values between 0 and 1. If we would like to interpret methylation in a binary fashion, appropriate thresholds are needed to dichotomize the continuous measurements. In this paper, we use data from The Cancer Genome Atlas project. For a total of 992 samples across five cancer types, both methylation and gene expression data are available. A bivariate extension of the StepMiner algorithm is used to identify thresholds for dichotomizing both methylation and expression data. Hypergeometric test is applied to identify CpG sites whose methylation status is significantly associated to silencing of the expression of their corresponding genes. The test is performed on either all five cancer types together or individual cancer types separately. We notice that the appropriate thresholds vary across different CpG sites. In addition, the negative association between methylation and expression is highly tissue specific.
DNA methylation plays an important role in cancer through hypermethylation to turn off tumor suppressors and hypomethylation to activate oncogenes [1, 2]. It is widely accepted that DNA methylation is associated with silencing of gene expression . With data from high-throughput array and sequencing technologies, several studies have analyzed the relationship between methylation and gene expression [4–6].
When the relationship between methylation and gene expression is discussed, both are often described as binary signals (i.e., on-off, high-low) . For example, for a gene whose expression can be controlled by the methylation of a CpG site in its promoter region: if the CpG site is methylated, the gene’s expression is typically low; if the CpG site is unmethylated, the expression of the gene can be either high or low, depending on other controlling mechanisms. On the other hand, measurements of methylation and expression obtained using microarrays and sequencing technologies are in continuous values. If we want to interpret the relationship between methylation and gene expression data using the binary language, appropriate thresholds are needed to dichotomize the measurements.
To jointly analyze methylation and gene expression, an ideal dataset would be a large collection of samples for which both data types are available. The Cancer Genome Atlas (TCGA) project provides such data for a large number of cancer samples [8–11]. Moreover, the TCGA samples are derived from multiple cancer and tissue types. The diversity among the samples may enable us to see relationships that cannot be observed in individual tissue types.
In this paper, we downloaded DNA methylation and gene expression data in TCGA. Data for a total of 992 samples were available, covering five cancer types. We extended the StepMiner algorithm  to identify thresholds to dichotomize methylation and expression measurements. Hypergeometric test was used to identify CpG sites whose methylation is significantly associated to silencing of expression of their corresponding genes. We observed that appropriate thresholds are highly CpG site specific, and the methylation-expression association for many genes is tissue-type specific.
Materials and methods
Methylation and expression data from TCGA
TCGA data portal (https://tcga-data.nci.nih.gov/tcga/tcgaDownload.jsp) provides three ways for accessing the data. Two of them, ‘data matrix’ and ‘bulk download,’ require investigators to manually select a subset of the data and then automatically collect relevant data files into a compressed.tar file for download. After that, additional effort is needed to parse and assemble the downloaded files into formats useful to programming environments such as Matlab or R. Since TCGA data keep growing and the manual selection can be tedious when multiple data types and disease types are considered, it is difficult to keep track of the manual selections and guarantee reproducibility. Therefore, we chose the third way, ‘open-access http directory,’ which contains links for all individual data files in TCGA (http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/). We created Matlab scripts to programmatically grab methylation and RNA-seq data files for each individual disease type, automatically parse them, and organize them into tab delimited spreadsheets for subsequent analysis. Our scripts for automatically downloading TCGA data are available at http://odin.mdacc.tmc.edu/~pqiu/software/DownloadTCGA/.
Genome-wide methylation measurements were generated using the Illumina Infinium Human DNA Methylation27 array platform (Illumina, Inc., San Diego, CA, USA), which interrogates the methylation status of 27,578 CpG sites in proximal promoter regions of 14,475 genes in the human genome. As of 12 February 2013, methylation data for 2,796 samples across 12 cancer types were available. We downloaded the TCGA level 3 preprocessed data, which are the ratio M i /(U i +M i ) for each CpG site i. Mi represents the signal intensity of the methylated probe for CpG site i, and Ui is the signal intensity of the unmethylated probe. Therefore, the numerical range of the data is between 0 and 1. Zero (0) indicates unmethylated, whereas 1 indicates completely methylated. The data contain a small fraction of empty entries, because the corresponding probes either overlap with known single-nucleotide polymorphisms or other genomic variations, or their signal intensities are lower than the background.
TCGA uses several platforms to quantify gene expression, among which the Illumina GA II and HiSeq platforms profiled the largest number of samples. As of 12 February 2013, preprocessed RNA-seq data for 4,108 samples across 11 cancer types were available. The preprocessed data are the RPKM values for 20,532 genes in each sample. Roughly, the numerical range of the data is between 0 and 105. For each gene, we replaced the zero entries with the minimal non-zero value of this gene across all samples and transformed the data to log scale.
The total number of overlapping samples between the above methylation and expression data was 992. The overlap covered five different cancer types: breast cancer (BRCA, 313 samples), colon and rectal cancer (COAD/READ, 227 samples), kidney renal clear cell carcinoma (KIRC, 208 samples), squamous cell lung cancer (LUSC, 129 samples), and uterine corpus endometrioid carcinoma (UCEC, 115 samples). Our analysis was performed based on these 992 overlapping samples.
Extend StepMiner for dichotomizing methylation and expression data
where i and j both range from 1 to n.
Hypergeometric test for methylation controlled genes
The optimal SNR value in StepMiner2D measures the multi-modality of the joint distribution of X and Y, rather than the association between the two variables. For example, if X and Y independently follow two bi-modal distributions, although there is no association between the two variables, the optimal SNR can be large. Thus, SNR does not seem to be suitable for evaluating the association between methylation and expression. Here, we are interested in one particular kind of association, whether methylation of a CpG site leads to down-regulation of its corresponding gene expression. After dichotomizing methylation and expression data, the sufficient statistics become counts of points in the four quadrants in Figure 2a. The significance of methylation controlled gene can be intuitively explained as whether the observed count in the upper-right quadrant is significantly less than expected. Popular statistical tests for 2×2 contingency tables, such as Fisher exact and chi-square tests, are designed to evaluate the whether counts are significantly unbalanced but not toward a specific direction. We choose to use hypergeometric test. Let N denote the total number of samples; R is the total number of methylated samples (sum of points in the upper-right quadrant and the lower-right quadrant); U is the total number of samples with high gene expression (total number of points in the two upper quadrants). Condition on N, R, and U, if the methylation and expression are independent, the number of samples in the upper-right quadrant k follows a hypergeometric distribution . To evaluate the significance of the observed count in the upper-right quadrant K, we can compute the probability of observing K or less points under the assumption that methylation and expression are independent p value . This is a hypergeometric test specifically for evaluating the significance of whether methylation turns off gene expression.
We preprocessed the TCGA data by filtering out CpG sites with small variance or many missing data points and matching methylation and expression data according to genes. The methylation data we downloaded from TCGA were generated by the Methylation27 array platform, which provided the methylation status of 27,578 CpG sites in 14,475 genes across 992 cancer samples. We excluded CpG sites whose annotated genes are not present in the expression data. We also excluded CpG sites with more than 1% missing data and ones whose methylation beta value is smaller than 0.01 for more than 95% of the samples. After applying these filtering criteria, we obtained a total of 11,189 CpG sites annotated to 7,344 unique genes. For approximately half of the genes, only one CpG site is measured for each gene; data for two CpG sites are available for the majority of the other half; for a very small number of genes, measurements of multiple CpG sites are available. In the subsequent subsections, for the methylation data of each of the 11,189 CpG sites, we extracted the expression data of its corresponding gene and focused our bivariate analysis on features paired according to genes. Preprocessed data and the code for our analysis is available at http://odin.mdacc.tmc.edu/~pqiu/projects/MethExpr/.
Identification of methylation on-off threshold
Tissue-specific association between methylation and expression
We performed integrative analysis of methylation and gene expression data of five cancer types in TCGA. First, we pooled samples from all five cancer types together and applied StepMiner2D to identify thresholds for dichotomizing the methylation and expression data. In such a pan-cancer analysis strategy, the diversity and variation among samples allow us to observe positive and negative signals in sufficient number of samples and empower the method to identify the appropriate thresholds. Then, we applied hypergeometric test to identify CpG sites whose methylation is significantly associated to silencing of the expression of their corresponding genes, either using all five cancer types together or using individual cancer types separately. When all five cancer types were examined together, 2,976 CpG sites showed significant negative association with gene expression. However, when samples in different cancer types were considered separately, a much smaller number of significant associations were observed in at least one cancer type. We speculate that the associations only significant in pan-cancer analysis are likely to be induced by tissue differences, whereas significant associations observed in individual cancer types are more likely to reflect regulatory relationships between methylation and gene expression. For future work, there are a few possible extensions. The methylation data used here are generated by the Illumina Methylation 27k platform. TCGA also generates methylation data using the Illumina Methylation 450k platform, which measures roughly 20 times more CpG sites. We plan to redo the analysis using the 450k methylation data, which will enable us to identify more associations between methylation and expression. Moreover, the proposed analysis strategy can also be applied to examine associations among measurements made by other modalities, such as microRNA expression, DNA copy number variation, protein expression, etc.
The authors would like to acknowledge The Cancer Genome Atlas Research Network for providing the methylation and expression data used in this paper. This work was partially supported by TCGA Genome Data Analysis Center grant at the University of Texas MD Anderson Cancer Center (U24 CA143883 02 S1), as well as NIH grants (R01CA163481 and R01CA174385) from the National Cancer Institute.
- Ballestar E: An introduction to epigenetics. Adv. Exp. Med. Biol 2011, 711: 1-11. 10.1007/978-1-4419-8216-2_1View ArticleGoogle Scholar
- Jones P, Baylin S: The fundamental role of epigenetic events in cancer. Nat. Rev. Genet 2002,3(6):415-428.Google Scholar
- Laird P: Principles and challenges of genomewide dna methylation analysis. Nat. Rev. Genet 2010,11(3):191-203.View ArticleGoogle Scholar
- Li M, Balch C, Montgomery J, Jeong M, Chung J, Yan P, Huang T, Kim S, Nephew K: Integrated analysis of DNA methylation and gene expression reveals specific signaling pathways associated with platinum resistance in ovarian cancer. BMC Med. Genomics 2009, 2: 34. 10.1186/1755-8794-2-34View ArticleGoogle Scholar
- Shaknovich R, Geng H, Johnson N, Tsikitas L, Cerchietti L, Greally J, Gascoyne R, Elemento O, Melnick A: DNA methylation signatures define molecular subtypes of diffuse large B-cell lymphoma. Blood 2010,116(20):e81-89. 10.1182/blood-2010-05-285320View ArticleGoogle Scholar
- Widschwendter M, Jiang G, Woods C, Muller H, Fiegl H, Goebel G, Marth C, Muller-Holzner E, Zeimet A, Laird P, Ehrlich M: DNA hypomethylation and ovarian cancer biology. Cancer Res 2004,64(13):4472-4480. 10.1158/0008-5472.CAN-04-0238View ArticleGoogle Scholar
- Newell-Price J, Clark A, King P: DNA methylation and silencing of gene expression. Trends Endocrinol. Metab 2000,11(4):142-148. 10.1016/S1043-2760(00)00248-4View ArticleGoogle Scholar
- Cancer Genome Atlas Research Network: Integrated genomic analyses of ovarian carcinoma. Nature 2011,474(7353):609-615. 10.1038/nature10166View ArticleGoogle Scholar
- Cancer Genome Atlas Research Network: Comprehensive genomic characterization of squamous cell lung cancers. Nature 2012,487(7417):519-525.Google Scholar
- Cancer Genome Atlas Research Network: Comprehensive molecular characterization of human colon and rectal cancer. Nature 2012,487(7407):330-333. 10.1038/nature11252View ArticleGoogle Scholar
- Cancer Genome Atlas Research Network: Comprehensive molecular portraits of human breast tumours. Nature 2012,490(7418):61-70. 10.1038/nature11412View ArticleGoogle Scholar
- Sahoo D, Dill D, Gentles A, Tibshirani R, Plevritis S: Boolean implication networks derived from large scale, whole genome microarray datasets. Genome Biol 2008,9(10):R157. 10.1186/gb-2008-9-10-r157View ArticleGoogle Scholar
- Sahoo D, Dill D, Tibshirani R, Plevritis S: Extracting binary signals from microarray time-course data. Nucleic Acids Res 2007,35(11):3705-3712. 10.1093/nar/gkm284View ArticleGoogle Scholar
- Hinoue T, Weisenberger D, Lange C, Shen H, Byun H, Van Den Berg D, Malik S, Pan F, Noushmehr H, van Dijk C, Tollenaar R, Laird P: Genome-scale analysis of aberrant dna methylation in colorectal cancer. Genome Res 2012,22(2):271-282. 10.1101/gr.117523.110View ArticleGoogle Scholar
- Qiu P, Zhang L: Identification of markers associated with global changes in DNA methylation regulation in cancers. BMC Bioinformatics 2012,13(Suppl 13):S7. 10.1186/1471-2105-13-S13-S7View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.