Simultaneous identification of robust synergistic subnetwork markers for effective cancer prognosis
© Khunlertgit and Yoon; licensee Springer. 2014
Received: 10 July 2014
Accepted: 17 October 2014
Published: 6 November 2014
Accurate prediction of cancer prognosis based on gene expression data is generally difficult, and identifying robust prognostic markers for cancer remains a challenging problem. Recent studies have shown that modular markers, such as pathway markers and subnetwork markers, can provide better snapshots of the underlying biological mechanisms by incorporating additional biological information, thereby leading to more accurate cancer classification.
In this paper, we propose a novel method for simultaneously identifying robust synergistic subnetwork markers that can accurately predict cancer prognosis. The proposed method utilizes an efficient message-passing algorithm called affinity propagation, based on which we identify groups – or subnetworks – of discriminative and synergistic genes, whose protein products are closely located in the protein-protein interaction (PPI) network. Unlike other existing subnetwork marker identification methods, our proposed method can simultaneously identify multiple nonoverlapping subnetwork markers that can synergistically predict cancer prognosis.
Evaluation results based on multiple breast cancer datasets demonstrate that the proposed message-passing approach can identify robust subnetwork markers in the human PPI network, which have higher discriminative power and better reproducibility compared to those identified by previous methods. The identified subnetwork makers can lead to better cancer classifiers with improved overall performance and consistency across independent cancer datasets.
Identifying disease-related biological markers is an important problem in translational genomics, and there have been significant research efforts to find robust markers for disease diagnosis and prognosis from gene expression data obtained from microarrays or next-generation sequencing (NGS). However, the small sample size and the high dimensionality of the typical genomic data makes the prediction of such biomarkers very challenging. A large number of approaches have been proposed so far to deal with these issues, where it has been recently shown that the concept of ‘modular markers’ have potentials for detecting better disease markers that are more robust and reproducible across independent datasets. In the past, it has been a common practice to look for the so-called ‘key genes’ that show significant differential expression under different conditions or between distinct phenotypes to discover gene markers that may be used for discriminating between different classes of biological/clinical samples. Unlike these traditional gene markers, where each gene is viewed as a potential biomarker, a modular marker consists of multiple genes that belong to the same functional module and show coordinated behaviors to fulfill a common biological function. The utilization of modular markers allows us to interpret and analyze the gene expression data in a more system-oriented way, which may facilitate the prediction of system-level properties based on the markers.
Examples of such modular markers include the pathway markers - and the subnetwork markers ,. A pathway marker consists of multiple genes that belong to the same functional pathway. In order to use a pathway marker in a classification task, we first need to infer the activity level of the pathway based on the expression levels of its member genes, after which the inferred pathway activity can be used as a feature in a classifier. So far, several different methods have been proposed for pathway activity inference -, and it has been shown that pathway markers tend to be more effective and robust compared to traditional gene markers. Unfortunately, the usefulness of pathway markers is practically limited by our incomplete pathway knowledge. In fact, currently known pathways cover only a relatively small number of genes; hence, the reliance on pathway markers may result in excluding crucial genes that may play important roles in determining the phenotypes of interest.
The concept of subnetwork markers has been originally proposed to address the weakness of pathway markers ,. The main idea is to overlay the protein-protein interaction (PPI) network with the gene expression data to identify potential ‘subnetwork markers,’ which consist of discriminative genes whose protein products interact with each other, hence, connected in the PPI network. Conceptually, we can find such subnetwork markers by identifying subnetwork regions that undergo significant differential expression across different phenotypes, and the detected subnetwork markers may potentially correspond to functional modules – such as signaling pathways or protein complexes – in the underlying biological network. PPI networks provide a much better gene coverage compared to the set of currently known pathways; hence, this network-based approach can essentially overcome the major shortcoming of the pathway-based approach.
Until now, several different strategies have been proposed for identifying subnetwork markers. For example, Chuang et al.  proposed an efficient algorithm for finding subnetwork markers, where they first identify highly discriminative seed genes and then greedily grow the subnetworks around the seed genes to maximize the mutual information between the average z-score of the member genes and the class label. More recently, Su et al.  proposed a different strategy, where differentially expressed linear paths are found by dynamic programming and overlapping paths are combined to obtain discriminative subnetwork markers. Both studies , have shown that subnetwork markers can lead to more accurate and robust classifiers, compared to pathway markers.
In this paper, we propose a novel method for identifying effective subnetwork markers for predicting cancer prognosis. The proposed method is based on an efficient message-passing algorithm, called affinity propagation, which can be used to efficiently identify clusters of discriminative and synergistic genes whose protein products are either connected or closely located in the PPI network. Unlike previous subnetwork marker identification methods, the proposed method can simultaneously predict multiple subnetwork markers, which are mutually exclusive and have the potential to accurately predict cancer prognosis in a synergistic manner. Based on several independent breast cancer datasets, we demonstrate that the proposed method can identify better prognostic markers that have improved reproducibility and higher discriminative power compared to the markers identified by previous methods.
2 Materials and methods
We obtained four independent breast cancer microarray gene expression datasets from previous studies, which we refer to as the USA dataset (GEO:GSE2034) , Netherlands dataset (NKI-295) , Belgium dataset (GEO:GSE7390) , and Sweden dataset (GEO:GSE1456) , respectively. The USA, Belgium, Sweden datasets were profiled on the Affymetrix U133a platform and downloaded from the Gene Expression Omnibus (GEO) website . The Netherlands dataset was profiled on a custom Agilent microarray platform, and it was downloaded from the Stanford website . The USA dataset contains the gene expression profiles of 286 breast cancer patients, the Netherlands dataset contains the profiles of 295 patients, the Belgium dataset contains the profiles of 198 patients, and the Sweden dataset contains the profiles obtained from 159 patients. In this study, gene expression profiles of the patients for whom metastasis had been detected within 5 years of surgery were labeled as ‘metastatic’, while the remaining profiles were labeled as ‘non-metastatic’. The USA, Netherlands, Belgium, and Sweden datasets respectively contain 106, 78, 35, and 35 metastatic profiles. The human protein-protein interaction network used in this paper was obtained from a previous study on subnetwork marker identification by Chuang et al. , which consists of 11,203 proteins and 57,235 interactions. We overlaid the gene expression data in the four breast cancer datasets with this PPI network, by mapping each gene to the corresponding protein in the network. After removing the proteins that do not have corresponding genes in all four datasets, we obtained an induced network with 26,150 interactions among 4,936 proteins.
2.2 The affinity propagation algorithm: a brief overview
The data point k that maximizes the sum a(i,k)+r(i,k) is chosen as the exemplar for the data point i, and the algorithm converges if the set of exemplars does not change further.
So far, affinity propagation has been applied to various applications – such as predicting genes from microarray data and clustering facial images – and it has been shown to effectively identify meaningful clusters of data points at a much lower computational cost than traditional clustering methods . One important advantage of affinity propagation is that the number of clusters need not be specified in advance. This is especially useful in our current application, since we neither know how many functional modules are embedded in the biological network at hand nor how many of them are relevant to cancer prognosis, which makes it practically difficult to determine how many subnetwork markers we should look for.
2.3 Computing the similarity between genes
The proteins corresponding to the genes in the same cluster should have direct interaction or should be closely located in the PPI network.
Every gene in a potential subnetwork marker should have sufficient discriminative power to distinguish between the two class labels (metastatic vs. non-metastatic).
The discriminative power to distinguish between the two class labels should be increased by combining genes within the same cluster.
if the shortest distance d(i,k) between the protein products of the genes g i and g k in the PPI network satisfies d(i,k)≤2. Otherwise, we set the similarity to s(i,k)=−∞. The discriminative power of a given gene is measured in terms of the t-test statistics score of the log-likelihood ratio (LLR) between the two class labels, and t i and t k are the t-test scores of g i and g k , respectively. Similarly, t ik is the t-test score of the combined LLRs of g i and g k which is computed by summing up the LLRs of the two genes. This term, t ik , reflects the discriminative power of the gene pair (g i ,g k ) after combining them. The self-similarity was set to s(k,k)=c for all k, where the constant c was chosen such that s(i,k)≥c for only 1% of all gene pairs (g i ,g k ). Uniform initialization of the self-similarity s(k,k)=c guarantees that every gene in the dataset gets equal chance to be an exemplar at the beginning of the message-passing process.
if g k has high discriminative power (first term);
if combining the two genes increases the overall discriminative power;
if both genes have similar discriminative power.
The main reason underlying the asymmetric definition of the similarity s(i,k) is to indicate the direction of similarity. Based on our asymmetric definition, the exemplars of the identified clusters tend to have higher discriminative power compared to other non-exemplars. Intuitively, the gene similarity defined in (4) will make the affinity propagation algorithm identify gene clusters that consist of highly discriminative genes that are synergistic to each other and whose protein products are closely located in the PPI network.
2.4 Post-processing the identified gene subnetworks
Although the affinity propagation algorithm can effectively identify subnetworks that consist of discriminative and synergistic genes, the clustering process does not completely rule genes with relatively lower discriminative power out of those subnetworks. As a result, the initial subnetworks that are predicted by affinity propagation may still contain genes with relatively lower discriminative power compared to other genes in the same subnetwork. In order to improve the overall discriminative power of the potential subnetwork markers, we post-processed the initial subnetworks as follows. First, we clustered the genes in a given subnetwork into k groups based on their t-test statistics scores using the k-means clustering algorithm, where k was chosen to be k=⌊log(# of gene in considered subnetwork)+1⌋. After clustering, the genes in the group with the lowest average t-test score were removed from the subnetwork.
2.5 Probabilistic inference of subnetwork activity
where is the conditional probability density function (PDF) of x i under phenotype j. We assume that the gene expression level of g i under phenotype j follows a Gaussian distribution.
3.1 Statistics of the identified subnetwork markers
Average size of the identified subnetwork markers
Total number of unique genes in the identified subnetwork markers
Total number of common genes between the top subnetwork markers identified using different α
Overlap between the top subnetwork markers identified on different datasets
USA - Netherlands
USA - Belgium
USA - Sweden
Netherlands - Belgium
Netherlands - Sweden
Belgium - Sweden
3.2 Computational cost for subnetwork marker identification
3.3 Discriminative power of the subnetwork markers
The horizontal axis in Figure 2 corresponds to K, and the vertical axis corresponds to the mean absolute t-test score of the top K subnetwork markers. We compared the discriminative power of the subnetwork markers predicted by the proposed method with the discriminative power of the subnetworks predicted by the greedy method proposed in . The activity level of these subnetworks (identified by the greedy method) was inferred based on the same scheme that was originally used in . As we can see from Figure 2, the proposed method typically finds subnetwork markers with comparable or slightly higher discriminative power compared to the previous greedy method, although both methods work very well. In this experiment, the parameter α did not significantly affect the average discriminative power of the subnetwork markers identified by the proposed method.
We also investigated the impact of the post-processing step by comparing the discriminative power of the subnetwork markers before and after post-processing. Additional file 1: Figure S1 shows the results obtained using α=0.5. We can see that the discriminative power of the top 50 subnetwork markers improves as a result of the post-processing step, during which we remove the genes that have relatively lower discriminative power.
One interesting observation we can make from these figures is that a smaller α tends to yield subnetwork markers that retain their discriminative power relatively better across independent datasets. This observation makes an intuitive sense, since a larger α tends to penalize genes with different discriminative power thereby giving rise to relatively smaller subnetwork markers that mostly consist of a few highly discriminative genes that may not be necessarily synergistic. This increases the risk of overfitting the data, thereby degrading the effectiveness of the predicted markers on other independent datasets.
3.4 Evaluating the reproducibility of the predicted subnetwork markers
In order to evaluate the efficacy of the predicted subnetwork markers in cancer prognosis, we performed five-fold cross-validation experiments based on a similar set-up that has been commonly used in previous studies -.
Considering that our ultimate goal is to identify effective subnetwork markers that can be used for building robust classifiers that can accurately predict breast cancer prognosis, it is important to verify whether the predicted markers can actually lead to better classifiers whose performance can be reproduced on independent datasets. For this purpose, we performed the following cross-dataset experiments.
First of all, we selected one of the four breast cancer datasets just for identifying the potential subnetwork markers and selecting the optimal feature set (i.e., the set of markers to be used for building the classifier). To select the optimal set of features, we randomly divided the chosen dataset into three folds, where two folds (marker-evaluation set) were used for evaluating the discriminative power of the subnetwork markers and the remaining one fold (feature-selection set) was used for selecting the features to be used in the classifier. We used the entire set for estimating the class conditional probability density functions that are needed for the pathway activity inference .
We evaluated the discriminative power of all potential subnetwork markers based on the marker-evaluation set, selected the top 50 markers, and sorted them according to their absolute t-test score in a descending order. Initially, we built a classifier based on linear discriminant analysis (LDA), where only the top subnetwork marker was included in the feature set. The classifier was trained on the marker-evaluation set, and its classification performance was assessed by measuring the area under ROC curve (AUC) on the feature-selection set. Subsequently, we added the next best subnetwork marker to the feature set, re-trained and re-evaluated the classifier, and kept the subnetwork marker only if the AUC increased. We repeated this process for the top 50 subnetwork markers.
Finally, we also performed within-dataset experiments to investigate the performance of the proposed method and compare it with previous subnetwork and pathway-based methods. In these experiments, the classifiers were trained and evaluated on different folds of the same dataset, where a similar five-fold cross-validation set-up was used as before. We first selected a dataset and then randomly divided it into five folds. Four out of the five folds were used as a training set for building the classifier. The remaining one fold was used as a test set for evaluating the classification performance. The subnetwork markers were identified using the entire dataset, and not just the four fold training set, due to the high computational burden for re-identifying the subnetwork markers every time for a large number of random partitions. The results are depicted in Additional file 1: Figure S3. We can see that classifiers based on subnetwork markers performed significantly better compared to those based on pathway markers. The main reason for this significant performance improvement is the substantially increased coverage of genes, which was the main motivation for identifying subnetwork markers and using them for cancer classification. The proposed subnetwork marker identification method and the greedy method performed both well in the within-dataset experiments, although our proposed method outperformed the greedy method in terms of robustness and reproducibility across different datasets as we have shown before.
In this paper, we proposed a novel method for identifying robust and synergistic subnetwork markers that can be used to accurately predict breast cancer prognosis. Our proposed method utilizes an efficient message-passing algorithm called affinity propagation  to identify gene subnetworks that consist of discriminative and synergistic genes whose protein products are known to interact with each other or to be closely located in the protein-protein interaction network. The proposed method allows us to simultaneously identify multiple mutually exclusive subnetwork markers that have the potential to synergistically improve the prediction of breast cancer prognosis. Extensive evaluation based on four large-scale breast cancer datasets demonstrates that the proposed method can predict effective subnetwork markers with high discriminative power and reproducible performance across independent datasets. Furthermore, the predicted markers can be used to construct robust cancer classifiers that can yield more consistent classification performance across datasets compared to other existing methods.
NK was supported by a scholarship from the Royal Thai Government. BJY was supported in part by the National Science Foundation, through NSF Award CCF-1149544.
- Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies. Proc. Natl. Acad. Sci. U S A 2005, 102: 13544-13549. 10.1073/pnas.0506577102View ArticleGoogle Scholar
- Z Guo, T Zhang, X Li, Q Wang, J Xu, H Yu, J Zhu, H Wang, C Wang, EJ Topol, Q Wang, S Rao, Towards precise classification of cancers based on robust gene functional expression profiles. BMC Bioinformatics. 6, 58 (2005).View ArticleGoogle Scholar
- E Lee, HY Chuang, JW Kim, T Ideker, D Lee, Inferring pathway activity toward precise disease classification. PLoS Comput. Biol. 4(1000217) (2008).View ArticleGoogle Scholar
- J Su, B-J Yoon, ER Dougherty, Accurate and reliable cancer classification based on probabilistic inference of pathway activity. PLoS ONE 4(12), 8161 (2009).View ArticleGoogle Scholar
- N Khunlertgit, B-J Yoon, Identification of robust pathway markers for cancer through rank-based pathway activity inference. Adv. Bioinformatics. 2013(618461) (2013). doi:10.1155/2013/618461.Google Scholar
- HY Chuang, E Lee, YT Liu, D Lee, T Ideker, Network-based classification of breast cancer metastasis. Mol. Syst. Biol. 3, 140 (2007).View ArticleGoogle Scholar
- J Su, B-J Yoon, ER Dougherty, Identification of diagnostic subnetwork markers for cancer in human protein-protein interaction network. BMC Bioinformatics 11, 8 (2010).View ArticleGoogle Scholar
- C Auffray, Protein subnetwork markers improve prediction of cancer outcome. Mol. Syst. Biol. 3, 141 (2007).View ArticleGoogle Scholar
- Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Gelder MM-v, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005, 365: 671-679. 10.1016/S0140-6736(05)17947-1View ArticleGoogle Scholar
- van de Vijver MJ, He YD, van’t Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 2002,347(25):1999-2009. 10.1056/NEJMoa021967View ArticleGoogle Scholar
- Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, Haibe-Kains B, Viale G, Delorenzi M, Zhang Y, d’Assignies MS, Bergh J, Lidereau R, Ellis P, Harris AL, Klijn JGM, Foekens JA, Cardoso F, Piccart MJ, Buyse M, Sotiriou C: Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the transbig multicenter independent validation series. Clin. Cancer Res 2007,13(11):3207-3214. 10.1158/1078-0432.CCR-06-2765View ArticleGoogle Scholar
- Pawitan Y, Bjohle J, Amler L, Borg A-L, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, Liu E, Miller L, Nordgren H, Ploner A, Sandelin K, Shaw P, Smeds J, Skoog L, Wedren S, Bergh J: Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res. 2005,7(6):953-964. 10.1186/bcr1325View ArticleGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002,30(1):207-210. 10.1093/nar/30.1.207View ArticleGoogle Scholar
- Chang HY, Nuyten DSA, Sneddon JB, Hastie T, Tibshirani R, Sørlie T, Dai H, He YD, Bartelink H, Brown PO: Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival.Proc. Natl. Acad. Sci. U S A 2005,102(10):3738-3743. 10.1073/pnas.0409462102View ArticleGoogle Scholar
- Frey BJ, Dueck D: Clustering by passing messages between data points. Science 2007,315(5814):972-976. 10.1126/science.1136800MathSciNetView ArticleMATHGoogle Scholar
- Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP: Molecular signatures database (msigdb) 3.0. Bioinformatics 2011,27(12):1739-1740. 10.1093/bioinformatics/btr260View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.