Exploring soybean metabolic pathways based on probabilistic graphical model and knowledge-based methods
© Hou et al. 2015
Received: 16 January 2015
Accepted: 9 June 2015
Published: 20 June 2015
Soybean (Glycine max) is a major source of vegetable oil and protein for both animal and human consumption. The completion of soybean genome sequence led to a number of transcriptomic studies (RNA-seq), which provide a resource for gene discovery and functional analysis. Several data-driven (e.g., based on gene expression data) and knowledge-based (e.g., predictions of molecular interactions) methods have been proposed and implemented. In order to better understand gene relationships and protein interactions, we applied probabilistic graphical methods, based on Bayesian network and knowledgebase constraints using gene expression data to reconstruct soybean metabolic pathways. The results show that this method can predict new relationships between genes, improving on traditional reference pathway maps.
Soybean (Glycine max L. Merr.) is recognized as an important food source for humans and animals because of its relatively high protein and oil ingredients. As one major species of the legume family, soybean contains high-quality protein which is a fundamental requirement, providing complete protein that contains all the essential amino acids that people need. Soybean is also considered “heart healthy” since soybean protein intake can significantly decrease serum (blood) cholesterol and low-density lipoprotein (LDL) levels, contributing to a reduced risk of coronary heart disease [1, 2].
A remarkable achievement in soybean research was the completion of the genome sequence (http://www.phytozome.net/soybean), which provided the basis for a variety of detailed, genome-wide studies, including completion of a transcriptome atlas based on RNA-seq analysis of different tissues [3, 4]. The availability of this transcriptome data facilitates more detailed studies of soybean gene function (e.g., ).
In order to visualize and analyze large-scale experimental gene expression data, especially to elucidate gene–gene and protein–protein interactions, gene and protein expression data are commonly mapped to reference metabolic pathways, which provides a context for understanding the functional response of the plant to a given treatment. Metabolic pathways are designed to represent the chemical reactions among a set of small molecules in a cell within one organism. Therefore, reconstruction of metabolic pathways from protein and gene expression data can help researchers discover new, fundamental biological functions for a particular network. Although more and more plant genome sequences are becoming available, there is still need for improved methods for metabolic pathway reconstruction to support functional studies.
In order to reconstruct a traditional metabolic pathway for a given species (e.g., those provided by the KEGG database [6, 7]), the annotated genes and their encoded protein products are integrated with the reference metabolic pathways in the KEGG database. The gene product sequences are mapped to the reference pathway using the KEGG Automatic Annotation Server (KAAS)  based on sequence homology to similarly mapped sequences from well-annotated reference genomes. Each gene is assigned one KEGG orthology number (KO number) with the highest ranking based on the functional annotation in KAAS and scoring orthology groups by probability and heuristics. The association between KO numbers leads to placement of the gene products into curated pathways. In order to improve on these methods, providing more potential and valuable interactions between genes and proteins, we applied the Bayesian network to construct probabilistic graphical networks. By way of example, we used knowledgebase constraints to improve the prediction efficiency and accuracy to reconstruct metabolic pathways for soybean .
2.1 Pathway construction workflow
2.2 Data preprocessing
2.2.1 Removal of non-expressed genes
The published RNA-seq gene expression data, representing 14 soybean tissue-specific conditions, were normalized to counts per million reads (CPM) . The CPM normalization was implemented using the Bioconductor package edgeR  within the R statistical programming language. The genes with a retrieved CPM value above one in at least one condition were kept for further analysis, while those genes showing no apparent expression in any of the 14 conditions were removed from the dataset.
2.2.2 Protein sequence generation
Accurate gene translation is essential for developing the initial reference pathway maps. Protein sequences available for the annotated soybean genome were extracted from the Soybean Knowledge Base (SoyKB) [11, 12]. The KEGG pathway database utilizes Entrez IDs for each soybean gene and, therefore, the SoyKB Glyma-format ID was converted to EntrezGene ID (ex., GLYMA01G00300.1 ↔ 100781438) using BioMart . The metabolic pathways were built by combining all of the knowledge above (i.e., protein annotation and associated gene expression values).
2.2.3 Gene clustering
Metabolic pathways are graphical representations of cellular processes in the KEGG database. Each reference pathway is composed of a network of enzymes and a set of genes that are functionally related in terms of predicted cellular and molecular functions based on experimental knowledge [6, 7]. One assumption of our metabolic pathway reconstruction method is that genes that share similar gene expression patterns over a set of experiments are more likely to be involved in the same reference metabolic pathway. Therefore, we used the Expectation-Maximization-based clustering algorithm on the data sets . This clustering algorithm is Gaussian mixture model based and the clustering number is determined by cross validation to improve the consistency and robustness for gene selection.
2.2.4 Relation knowledgebase construction
Information on all the existing relationships among KO identifiers (ortholog group) for all the mapped genes in all pathways can be collected from the KEGG database [6, 7]. Such a relational knowledgebase includes the relation and reaction among orthologous groups, such as the directional relations, the relationship type within protein–protein interactions and protein–compound interactions, as well as related chemical reactions.
In order to generate the relation knowledgebase, based on the gene-to-KO assignments and the list of pathway maps with KGML format generated by the method described above in the “Initial pathway construction” section, we extracted all the sets of relations and reactions for mapped KO numbers from the full set of KEGG reference pathways . The relationship between genes can also be predicted based on the relationship between KO numbers.
In order to improve the space searching efficiency and quality for Bayesian network construction, two knowledgebase sets were generated . A gene whitelist was created for Bayesian network prediction consisting of all the relationships between genes. A blacklist was also constructed consisting of all possible gene relations not supported by the knowledgebase and, therefore, excluded from the Bayesian network construction.
2.3 Initial pathway construction
The KEGG database [6, 7] provides 90 graphical diagrams for soybean reference metabolic pathways, which were computationally generated from manually curated pathways based on experimental knowledge of metabolism. Each pathway represents the network structure of chemical compounds, enzyme molecules and enzymatic reactions, where each enzyme is assigned one Enzyme Commission (EC) number to specify enzyme-catalyzed reactions. Each EC number is associated with a KEGG Orthology (KO) number in the KEGG database. The KO number is a unique identifier for matching the genomic information in the GENES database and the gene products (enzyme–enzyme interaction) information in the PATHWAY database. In each reference pathway, rectangle nodes are assigned with the KO identifiers to denote specific enzymes. Once the KO identifiers are assigned to genes in a specific genome, the related organized-specific pathways are generated automatically. A web-based server called KEGG Automatic Annotation Server (KAAS)  can automatically assign KO identifiers to genes based on the protein sequence similarities, which enables the reconstruction of initial organism-specific pathways and BRITE hierarchies.
However, since the complete metabolic pathway is separated into a list of subpathways in terms of different cellular, molecular functions, genes mapped to a specific pathway can only represent a small part of relationships in the whole pathway. The traditional mapping method such as KAAS can only predict a small subset of relationships that exist in each reference pathway. In order to address this weakness, we applied the Bayesian probabilistic network method  to expand the initially mapped pathways by adding more genes and relationships, taking into account all predicted KO relationships in KEGG. Based on the gene-to-KO assignment and pathway network structure information, the initial pathway can be constructed by matching the genes to the KO identifiers in each pathway.
2.4 Bayesian network pathway construction
After the initial pathways for soybean genes were constructed and the associated knowledgebase built, we expanded the pathways by adding more genes with similar gene expression patterns and new relationships and reactions. The new genes were derived from predicted gene clusters and metabolic pathways predicted through the Bayesian network method taking advantage of all existing relationships between KO numbers in the knowledgebase. The score-based heuristic approach was applied to learn all the possible local structures to enlarge the whole metabolic pathway step-by-step .
2.4.1 Gene sampling
In order to sample genes that are more likely to be involved in the same pathway, the gene cluster containing more genes than in the initial pathway was assigned the highest probability for sampling. The sampling probability for the remaining clusters was assigned automatically based on the Euclidean distance between each cluster and pivot cluster . The probability assignment follows the criteria that a shorter distance has higher weight for sampling, which means the gene expression values are more similar between genes in two clusters.
2.4.2 Bayesian network construction
The Bayesian network approach can be applied to discover casual relationships from gene expression data, which proposes a probabilistic model with joint probability distribution to represent the gene expression patterns for the target genes across the different experimental conditions. Based on the predicted network structure, valuable biological information can be extracted to understand the regulation process among genes. Bayesian network is represented as a directed acyclic graph (DAG), with the gene/protein as nodes and the reactions/relation between genes as directed edges in graphical representation of metabolic pathways. The score function, which is Bayesian Information Criterion (BIC) based, is used to predict networks from gene expression data . The score can be evaluated by adding, removing, and reversing a single edge at each local structure updating step during the network learning process. The greedy hill-climbing algorithm can help find the optimal structure network with a local maximum. After determining the local optimal structure, the new local network is considered as a new node to be used to repeat the sampling procedure to produce a larger pathway network.
Since network learning in a large searching space is time-consuming, we used the knowledge constraints to restrict the network search, resulting in a smaller searching space instead of the overall search space. The concepts of a whitelist and blacklist in Bayesian network were applied. The edges existing in the initial pathways were always present in the graph, serving as a whitelist in the Bayesian network. Based on the sampled genes and existing relationships of orthologous groups (KO numbers) in KEGG, the gene relations that do not exist in the knowledgebase will never be present in the graph, which is served as the blacklist for the Bayesian network.
2.4.3 Parsing and editing pathway information
There are several problems that need closer attention during the pathway processing for network construction.
Cycle detection and processing in pathway
Metabolic pathway reflects a series of reactions between enzymes, which is often feed-forward reactions with one direction. However, reversible reaction will also exist in the pathway that leads to feedback loops among sets of enzymes. Before feeding the sampled genes from gene cluster combined with initial mapping pathway into Bayesian network reconstruction, the presence of cycles should be conquered in advance since Bayesian network could not handle loops or cycles in graph. During the whitelist generation step for Bayesian network, the gene–gene reactions with direction from the initial pathway that exists in the KEGG knowledgebase were added into whitelist sets iteratively with checking to see if the cycle exists in the current network at each time. If the gene pairs with directed reaction would cause a loop among the current gene network in the whitelist, the edge was not be added. This step generates the initial gene network as a whitelist with no cycles occurring for Bayesian network prediction. In order to incorporate the reaction information from all species in the KEGG database, if the reaction for a gene pair belongs only to soybean, this edge was also excluded from the whitelist set in order to make the initial mapping network more independent. This also helped to validate the performance of our network prediction. After the new network is predicted, the initial mapping network with cycles was amended to the predicted pathway to complete the existing feedback reaction activity.
Multi-molecule nodes in a pathway
In the metabolic pathway, multiple proteins may catalyze the same reactions and inhibit or activate the same substrate. Such a set of molecules was grouped together and labeled as one node in the pathway, sharing the same node identifier. In the KGML file for the KEGG pathway, each node is composed of multiple different KO numbers. During the extraction of gene relationships from the initial mapping pathways, the relationships between two nodes in the pathway were assigned to each KO number pair from two nodes to generate the gene–gene relationship. After the gene network was generated by Bayesian network construction, genes belonging to same KO number were grouped together to simplify the network representation.
2.5 Functional enrichment analysis
Protein function prediction software MULTICOM-PDCN [16, 17] was used for function prediction of gene sets from the Bayesian graphical network. A set of Gene Ontology (GO)  terms associated with three functional categories (i.e., biological process, molecular function, and cellular components) was predicted for each gene. A Fisher exact test was conducted on each predicted pathway to identify over-represented GO terms, which are significant GO terms associated with the group of genes in the pathway. The significant GO terms identified in each Bayesian network served as a reference to validate the predicted edges among gene sets.
3 Results and discussion
3.1 Data description
Our initial input data were RNA-seq gene expression data, representing 14 soybean tissue-specific conditions, including 9 different soybean tissues (root hair cells isolated 84 and 120 h after sowing (HAS), root tip, root, mature nodules, leaves, shoot apical meristem (SAM), flower and green pods , as well as 5 additional tissues taken from Libault et al. . This large scale of transcriptomic analysis provided a comprehensive compendium of soybean gene expression. We applied our pathway reconstruction pipeline to this full set of transcriptome data, which contains expression measurements on 69,077 putative annotated soybean genes and 7314 unannotated genes in which 53,175 putative annotated genes were expressed in at least one condition while 15,902 putative annotated genes showed no apparent expression. Genes that were not expressed at all and unannotated were removed from further analysis. Each gene identifier was labeled using the Glyma-format following the convention adopted by the Arabidopsis community .
Since one gene can have multiple transcripts or protein sequences due to alternative splicing, we extracted all the transcript variant IDs for the 53,175 expressed genes. For example, gene Glyma20g01000 had only one transcript variant Glyma20g01000.1, while Glyma01g00300 had two transcript variants, Glyma01g00300.1 and Glyma01g00300.2. Protein sequence data were extracted from the Soybean Knowledge Base (SoyKB) [11, 12] providing 35,505 protein sequences; from the expressed 53,175 gene transcripts, protein sequence information for the remaining 17,670 genes was not provided in SoyKB. When dealing with the gene ID conversion process through BioMart , BLAST [21, 22] was used to align the transcript sequence against the EntrezGene database and the EntrezGene identifier with high degree of sequence similarity was assigned to transcript ID. Because of this similarity-based mapping method, the same transcript variant might have several different EntrezGene IDs, while several transcript variants might share identical EntrezGene IDs. In such situations, we downloaded the protein sequence information for all soybean genes in the KEGG pathway database and the one-to-one mapping between Glyma-format ID, and EntrezGene ID for each transcript was chosen based on sequence identity. This step removed those transcripts from same gene that their sequences did not match the gene sequences in KEGG database. Finally, the protein sequence information and gene expression value for a total of 26,873 protein-coding genes were built and used for metabolic pathway reconstruction.
3.2 Metabolic pathway prediction
3.3 Function enrichment analysis
The top 10 enriched functions of 106 genes identified in the predicted network for the glyoxylate and dicarboxylate metabolism pathway (KO00630)
Tricarboxylic acid cycle
l-Malate dehydrogenase activity
Glutamine biosynthetic process
Glutamate-ammonia ligase activity
Malate metabolic process
l-Lactate dehydrogenase (cytochrome) activity
Lactate metabolic process
In this study, we applied probabilistic graphical and knowledge-based methods to reconstruct soybean metabolic pathways based on a comprehensive transcriptome database. Based on the results, the method performed better than the traditional sequence-homology mapping method by predicting more real relationships in the pathways. Functional enrichment analysis on the predicted pathways also revealed that functional related gene pairs were predicted successfully to enlarge the initial mapping network from KEGG. The good performance of the data and knowledge-based probabilistic method provided fundamental, new biological information for soybean research and demonstrates that this method can be generally applicable for other genomes where similar starting data are available.
The work was partially supported by a NSF grant (IOS1025752), a grant from the US Department of Energy (DE-SC0004898), an NSF CAREER award (DBI1149224), and an NIH grant (R01GM093123).
- JW Anderson, BM Johnstone, ME Cook-Newell, Meta-analysis of the effects of soy protein intake on serum lipids. N. Engl. J. Med. 333(5), 276–282 (1995)View ArticleGoogle Scholar
- X Zhang, XO Shu, Y-T Gao, G Yang, Q Li, H Li, F Jin, W Zheng, Soy food consumption is associated with lower risk of coronary heart disease in Chinese women. J. Nutr. 133(9), 2874–2878 (2003)Google Scholar
- M Libault, A Farmer, T Joshi, K Takahashi, RJ Langley, LD Franklin, J He, D Xu, G May, G Stacey, An integrated transcriptome atlas of the crop model Glycine max, and its use in comparative analyses in plants. Plant J. 63(1), 86–99 (2010)Google Scholar
- J Schmutz, SB Cannon, J Schlueter, J Ma, T Mitros, W Nelson, DL Hyten, Q Song, JJ Thelen, J Cheng, D Xu, U Hellsten, GD May, Y Yu, T Sakurai, T Umezawa, MK Bhattacharyya, D Sandhu, B Valliyodan, E Lindquist, M Peto, D Grant, S Shu, D Goodstein, K Barry, M Futrell-Griggs, B Abernathy, J Du, Z Tian, L Zhu, N Gill, T Joshi, M Libault, A Sethuraman, X-C Zhang, K Shinozaki, HT Nguyen, RA Wing, P Cregan, J Specht, J Grimwood, D Rokhsar, G Stacey, RC Shoemaker, SA Jackson, Genome sequence of the palaeopolyploid soybean. Nature 463(7278), 178–183 (2010)View ArticleGoogle Scholar
- FF Aceituno, N Moseyko, SY Rhee, RA Gutiérrez, The rules of gene expression in plants: organ identity and gene body methylation are key factors for regulation of gene expression in Arabidopsis thaliana. BMC Genomics 9(1), 438 (2008)Google Scholar
- M Kanehisa, S Goto, KEGG: Kyoto encyclopedia of genes and genomes. Nucl. Acids Res. 28(1), 27–30 (2000)View ArticleGoogle Scholar
- M Kanehisa, S Goto, Y Sato, M Furumichi, M Tanabe, KEGG for integration and interpretation of large-scale molecular data sets. Nucl. Acids Res. 40(1), gkr988–D114 (2011)Google Scholar
- Y Moriya, M Itoh, S Okuda, AC Yoshizawa, M Kanehisa, KAAS: an automatic genome annotation and pathway reconstruction server. Nucl. Acids Res 35(Web Server issue), W182–5 (2007)View ArticleGoogle Scholar
- Q Qi, J Li, J Cheng, Reconstruction of metabolic pathways by combining probabilistic graphical model-based and knowledge-based methods. BMC Proc. 8(6), S5 (2014)View ArticleGoogle Scholar
- MD Robinson, DJ McCarthy, GK Smyth, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1), 139–140 (2010)View ArticleGoogle Scholar
- T Joshi, K Patil, MR Fitzpatrick, LD Franklin, Q Yao, JR Cook, Z Wang, M Libault, L Brechenmacher, B Valliyodan, X Wu, J Cheng, G Stacey, HT Nguyen, D Xu, Soybean Knowledge Base (SoyKB): a web resource for soybean translational genomics. BMC Genomics 13(1), S15 (2012)View ArticleGoogle Scholar
- T Joshi, MR Fitzpatrick, S Chen, Y Liu, H Zhang, RZ Endacott, EC Gaudiello, G Stacey, HT Nguyen, D Xu, Soybean knowledge base (SoyKB): a web resource for integration of soybean translational genomics and molecular breeding. Nucl. Acids Res. 42(Database issue), D1245–52 (2014)View ArticleGoogle Scholar
- A Kasprzyk, BioMart: driving a paradigm change in biological data management. Database 2011(0), bar049–bar049 (2011)Google Scholar
- M Hall, E Frank, G Holmes, B Pfahringer, P Reutemann, IH Witten, The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009)View ArticleGoogle Scholar
- M Scutari, Learning Bayesian networks with the bnlearn R package, 2009Google Scholar
- Z Wang, X-C Zhang, MH Le, D Xu, G Stacey, J Cheng, A protein domain co-occurrence network approach for predicting protein function and inferring species phylogeny. PLoS ONE 6(3), e17906 (2011)Google Scholar
- Z Wang, R Cao, J Cheng, Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks. BMC Bioinformatics 14, S3 (2013)Google Scholar
- M Ashburner, CA Ball, JA Blake, D Botstein, H Butler, JM Cherry, AP Davis, K Dolinski, SS Dwight, JT Eppig, MA Harris, DP Hill, L Issel-Tarver, A Kasarskis, S Lewis, JC Matese, JE Richardson, M Ringwald, GM Rubin, G Sherlock, Gene ontology: tool for the unification of biology. Nat. Genet. 25(1), 25–29 (2000)View ArticleGoogle Scholar
- M Libault, A Farmer, L Brechenmacher, J Drnevich, RJ Langley, DD Bilgin, O Radwan, DJ Neece, SJ Clough, GD May, G Stacey, Complete transcriptome of the soybean root hair cell, a single-cell model, and its alteration in response to Bradyrhizobium japonicum infection. Plant Physiol. 152(2), 541–552 (2010)View ArticleGoogle Scholar
- D Meinke, M Koornneef, Community standards for Arabidopsis genetics. Plant J. 12(2), 247–253 (1997)View ArticleGoogle Scholar
- TA Tatusova, TL Madden, BLAST 2 sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol. Lett. 174(2), 247–50 (1999)View ArticleGoogle Scholar
- SF Altschul, TL Madden, AA Schaffer, J Zhang, Z Zhang, W Miller, DJ Lipman, G. BLAST, PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–402 (1997)Google Scholar
- EJB Williams, DJ Bowles, Coexpression of neighboring genes in the genome of Arabidopsis thaliana. Genome Res. 14(6), 1060–1067 (2004)View ArticleGoogle Scholar
- V Srinivasasainagendra, GP Page, T Mehta, I Coulibaly, AE Loraine, CressExpress: a tool for large-scale mining of expression data from Arabidopsis. Plant Physiol. 147(3), 1004–1016 (2008)View ArticleGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.