- Research Article
- Open Access
How to Improve Postgenomic Knowledge Discovery Using Imputation
© Muhammad Shoaib B. Sehgal et al. 2009
- Received: 28 February 2008
- Accepted: 4 November 2008
- Published: 11 January 2009
While microarrays make it feasible to rapidly investigate many complex biological problems, their multistep fabrication has the proclivity for error at every stage. The standard tactic has been to either ignore or regard erroneous gene readings as missing values, though this assumption can exert a major influence upon postgenomic knowledge discovery methods like gene selection and gene regulatory network (GRN) reconstruction. This has been the catalyst for a raft of new flexible imputation algorithms including local least square impute and the recent heuristic collateral missing value imputation, which exploit the biological transactional behaviour of functionally correlated genes to afford accurate missing value estimation. This paper examines the influence of missing value imputation techniques upon postgenomic knowledge inference methods with results for various algorithms consistently corroborating that instead of ignoring missing values, recycling microarray data by flexible and robust imputation can provide substantial performance benefits for subsequent downstream procedures.
- Gene Regulatory Network
- Gene Selection
- Imputation Method
- Predictor Gene
- Breast Cancer Data
The study of genes and their transactional relationship with other genes can be modelled using machine learning algorithms in a diverse range of applications from disease analysis  and drug progression for target diseases  through evolutionary study  and comparative genomics , all of which are characterised by using microarray gene expression data. The statistical analysis of microarray datasets depends highly upon the accuracy of the gene expression methods. Microarray production is a complex process, whereby samples are prepared for differential expression in a series of stages involving the laying of specimens on the slides by a robotic arm, imaging of the slides, and finally determining the numerical gene expression values. Each step inevitably exhibits a propensity for error , a corollary to this is the inherent erroneous gene expression values for certain genes, which are popularly referred to as missing values. While microarray technology is continually being refined, there is an enormous amount of public domain gene expression data available that frequently contains at least 5% erroneous spots. Indeed, in many datasets, at least 60% of genes have either one or more missing values , which can seriously impact on subsequent data analysis involving, for example, significant gene selection, gene regulatory network (GRN) reconstruction, and clustering algorithms [7, 8].
The simplest ways to address this problem are to either repeat the experiment, though this is often not feasible for economic reasons, or ignore those samples containing missing values, but again this is not recommended because of the limited number of available samples. Alternative strategies include row average/median imputation (substitution by the corresponding row average/median value) and the ubiquitous ZeroImpute, where missing values are replaced by zero. Both approaches are high variance, with neither exploiting the underlying data correlations which can lead to higher estimation errors . The prevailing wisdom is to accurately estimate missing values by exploiting the latent correlation structure of the microarray data [8, 10], as manifested by the development of numerous microarray imputation techniques including collateral missing value estimation (CMVE) , singular value decomposition impute (SVDImpute) , K-nearest neighbour (KNN) , least square impute (LSImpute) , local LSimpute (LLSImpute) , Bayesian principal component analysis (BPCA) , a set of theoretic framework based on projection onto convex sets imputation (POCS Impute) method  and most recently, heuristic collateral missing value imputation (HCMVI) . In addition, other methods which use contextual information include gene ontology-based imputation (GOImpute)  and metadata-based imputation technique .
This paper will investigate the gene expression correlation assumption by empirically analysing different postgenomic knowledge discovery methods including gene selection and GRN reconstruction techniques in the presence of missing values, specifically for the breast and ovarian cancer datasets of Hedenfalk et al.  and Jazaeri et al. , respectively. The rationale for choosing these two datasets is that generally cancerous data  lacks molecular homogeneity in tumour tissues, which makes missing value estimation far more challenging. Additionally, breast cancer is the second leading cause of cancer death in women today (following lung cancer), with 1 in 11 Australian women being diagnosed with the disease before the age of 75, and the number of breast cancer patients increasing everyday, as diagnosis methods improve . Ovarian cancer is the fourth most common cause of cancer-related deaths in American women of all ages, as well as being the most prevalent cause of death from gynaecologic malignancies in the United States .
Aside from the obvious numerical relevance of missing value estimation, another key driver is the biological significance of imputation, particularly algorithmic performance in estimating significant genes in microarray data that may be erroneously affected. Plakophilin 2 (PKP2), for example, is present in breast carcinoma cell lines  and is significant as it serves as a marker for the identification and characterisation of carcinomas derived either from or corresponding to, simple, and complex epithelia . As will be witnessed in Section 6, PKP2 is often not selected by gene selection methods when missing values are present and so would generally be either ignored or replaced when conventional estimation methods are applied. By judiciously employing a flexible imputation strategy such as HCMVI, however, the probability that these genes are correctly selected can be significantly enhanced. Similarly, the GRN reconstruction performance may be significantly influenced by missing values with a substantial number of vital coregulation links being neglected when imputing by traditional and contemporary methods (Sections 3 and 4). The interaction in breast cancer data between ADP-ribosylation factor 3 and estrogen sulfotransferase (EST), which is similar to the NSAP1 protein, is, for instance, consistently overlooked when missing values are introduced, though they have been successfully reconstructed using flexible imputation methods (Section 5). In both scenarios, accurate imputation crucially eliminates the need for repeating an experiment which can be costly, and may be pragmatically infeasible.
This paper presents a treatise on existing imputation methods by examining their performance in managing microarray dataset missing values to improve postgenomic knowledge discovery. Concomitant with analysing the numerical accuracy of imputation, the biological significance for two proteins is analysed, namely, KIAA1025 and MHC, from the breast and ovarian cancer datasets, respectively, because of their acknowledged importance in diagnosing the different cancer types [27–29].
The remainder of the paper is organised as follows. After formally defining the nomenclature, Sections 3, 4, and 5 will, respectively, review the gamut of traditional, contemporary, and flexible microarray missing value imputation algorithms together with their particular epithets and limitations. A reflective analysis is then presented in Section 6 upon a series of experiments performed on various breast and ovarian cancer microarray datasets, including both statistical and biological significance interpretations, while some conclusions are provided in Section 7.
Imputation strategies have been broadly classified into three categories: traditional, contemporary, and flexible techniques. Original imputation approaches, which replace a missing value by either zero or row/column mean, are designated as traditional, as they are simple and computationally efficient, but do not take advantage of any latent correlation within the data. Contemporary techniques subsequently evolved to improve the estimation accuracy by using inherent data correlations, usually under the assumption that the causal correlation structure is either localised or globalised. They are also characterised by using a fixed number of predictor genes in the estimation which limits the flexibility to fully exploit any data correlations. This was the incentive for the most recent family of flexible imputation methods which are able to freely adapt to the data distribution by automatically determining the optimal number of predictor genes, thereby minimising the impact of missing values on subsequent biological analysis. In the following sections, these three imputation categories are, respectively, reviewed.
These are broadly characterised by replacing expression values of those genes that posses missing values by zero, their gene/sample mean or median, and in certain cases, by using the well-known KNN method. The advantages and disadvantages of these popular approaches are now discussed.
3.1. Zeroimpute and Mean/median Imputation
In these methods, missing values are, respectively, replaced either by zero (ZeroImpute) or by the gene/sample average  and/or median. The attraction is their simplicity and computational efficiency, though none take advantage of the underlying correlation structure of the data, with the consequence that the data variance is generally high. This means that when there are a large number of missing values present in the microarray data, these imputation strategies can significantly compromise subsequent postgenomic analysis. The impact, however, can be reduced by adapting the estimation parameters to the underlying correlation structure of the data, with the following sections examining some well-established methods.
3.2. Singular Value Decomposition-Based Imputation (svdimpute)
This uses the combination of singular value decomposition (SVD)  and expectation maximization (EM)  to estimate the missing values by calculating mutually orthogonal expression patterns often referred to as Eigen genes. As SVD calculations require the entire matrix, missing values are replaced by their row mean prior to the k most effective Eigen genes being selected according to their corresponding Eigen values. The imputed missing value estimate for Y ij is then calculated by regressing g i against the k most effective Eigen genes with expression values from sample j which contained the missing value being ignored. SVDImpute reduces imputation errors by recursively estimating the missing values using the EM algorithm until the change in the matrices becomes less than an empirically determined threshold, nominally 0.01 . The technique performs best when 20% of the Eigen genes are used for estimation, and while it is a better strategy than high-variance approaches like ZeroImpute, it has the drawbacks of both being highly sensitive to noise and only considering global data correlations, which inevitably leads to higher estimation errors in locally correlated datasets.
3.3. K-Nearest Neighbour (knn) Estimation
KNN  estimates missing values by searching for the k nearest genes normally by applying the Euclidean distance and then taking the weighted average of these k genes. The k genes whose expression vectors are most similar to genetic expression values in all samples, except the sample which contains the missing value, are selected. The similarity measure between gene g i and other genes is then determined by the Euclidian distance over the observed components in sample j, and the missing value estimated as the weighted average of the corresponding entries in the selected k expression vectors, where the contribution of every gene is scaled by the similarity of its expression to g i .
While KNN is flexible in terms of the choice of similarity measure, it does imply the performance of a specific metric is data dependent. Troyanskaya et al.  demonstrated that Euclidean distance performs better than other similarity measures for microarray data, and though it is highly sensitive to microarray data outliers, log-transforming the data can significantly reduce their effect in determining gene similarity.
The choice of an appropriate k value especially influences imputation performance. Experimental results have established that for small datasets is the best choice , while Troyanskaya et al.  observed that KNN is insensitive to values of k in the range 10 to 20. The key point to emphasise is that regardless of the underlying structure of the microarray data, a preset value of k is employed which clearly does not fully harness the capability of an imputation method. A much more creative strategy is to endeavour to automatically determine the best k value from the data correlation structure, which is the fundamental premise of the two flexible imputation techniques described in Section 5.
Summarising, while traditional algorithms have been widely adopted, the inherently high data variance has a major impact on downstream analysis methods like significant gene selection and class prediction GRN reconstruction. To relax this restriction, more robust techniques have evolved in an attempt to garner superior performance in terms of estimation accuracy, although as will be witnessed, they still exhibit some limitations, most notably from a biological significance perspective. Section 4 focuses on some of the most well-established contemporary imputation approaches.
This category embraces those methods that implicitly attempt to lower the data variance of missing value estimates, by seeking to exploit the underlying localised or global correlation structure of the microarray data. Some of the most popular algorithms together with their relative merits and demerits will now be investigated.
4.1. Least Square Impute (lsimpute) Estimation
This is a regression-based method that exploits the correlation between genes. There are three variants of the imputation LSImpute  algorithm, namely, LSImpute-Gene, LSImpute-Array, and LSImpute-Adaptive. LSImpute-Gene estimates missing values using the correlation between the genes (intrasample) while LSImpute-Array exploits intersample correlation while LSImpute-Adaptive combines both techniques using a bootstrapping approach . The communal features of all three LSImpute variants will now be delineated.
To estimate missing value Y ij in (1), the k most-correlated genes are firstly selected, whose expression vectors are similar to gene i from Y in all samples except j, where all the correlated genes do not contain any missing values. As LSImpute-Gene is based upon a regression, it mandates that the number of model parameters must be lower than the number of observations, though in general for microarray data, the number of genes is usually much greater than the sample number. The algorithm then computes regressive estimates for each selected gene and the missing value estimate is obtained from their weighted average.
While LSImpute-Gene affords greater accuracy than traditional imputation methods like KNN and SVDImpute (Section 3), it still has the same fundamental limitation of using a preset k value. Bø et al. , for example, empirically determined as the most suitable value for their particular dataset, though crucially this finding is data dependent and not generic. It also demonstrated that this imputation approach works better if missing values have been initially approximated by LSImpute-Gene and then refined with LSImpute-Array. This lowers the imputation error, though commensurately it increases the computational overhead, and since it still employs LSImpute-Gene prior to any estimation, the value of k is always fixed.
LSImpute-Adaptive combines the strengths of both LSImpute-Gene and LSImpute-Array by fusing their respective imputation results. It modifies the weights for each imputation using a bootstrapping process , with empirical results  endorsing that this strategy performs better when either variant is separately applied.
With the flexibility to adjust the number of predictor genes in the regression, LSImpute performs best when data exhibits a strong local correlation structure, though the comparative prediction accuracy is still inferior to that achieved by the new flexible imputation algorithms, which dynamically determine k directly from the data (Section 5).
4.2. Bayesian Principal Component Analysis (bpca) Estimation 
BPCA estimates missing values using Bayesian estimation theory with a variational algorithm  to calculate the model parameters and ultimately the imputed value Y ij . The posteriori distribution of the missing value and the posteriori distribution of the model parameter are firstly computed from gene values having no missing values and since this distribution calculation requires the complete matrix, so missing values are replaced by their corresponding gene averages. The model parameters are then used to compute the current posteriori distribution, with the maximum likelihood  parameters being iteratively updated using the current posteriori distribution of model parameters and missing values, until convergence is reached.
By considering only global correlations within a dataset, BPCA has a distinct advantage in terms of prediction speed compared with all the other imputation techniques analysed, but its performance is highly dependent on either a strong underlying global correlation within the data or having a very high number of samples. This is an offset by the likelihood of high imputation errors when either the dataset is locally correlated or comprises a small number of samples.
4.3. Collateral Missing Value Estimation (CMVE) 
where , , and are the weights assigned to each constituent imputation estimate.
CMVE uses LS regression of k-correlated genes for the first missing value estimate , while NNLS and linear programming compute the other two estimates and . The rationale for including NNLS is that unnormalised microarray data has only positive values so NNLS takes advantage of exploiting the positive search space. If the data is either normalized or log-transformed then it will contain some negative values so LS regression is used for this particular estimation. Since both the Pearson correlation and the covariance functions necessitate complete imputation matrices, so CMVE firstly replaces all missing values by gene averages. Once the initial missing value estimate is generated, then new estimated value is used in all future predictions, which is a distinctive feature of this particular imputation strategy.
CMVE has been proven to perform best for locally correlated data, providing consistently superior imputation quality compared to all the aforementioned techniques, by virtue of the property of recycling estimated values in future predictions . It is also more robust as witnessed by its performance in the presence of high numbers of missing values. The main drawback of CMVE, just like all the other contemporary algorithms, is the preset value of k which means that it does not fully adapt to the correlation structure of the data and compromises performance when data has a global structure.
In summarising the imputation methods reviewed so far, the main assumption relates to the underlying correlation structure of the dataset, where KNN, LSImpute, and CMVE perform better when data is locally correlated, while SVDImpute and BPCA are more apposite for missing value estimation in globally correlated datasets. From a postgenomic knowledge inference viewpoint, however, any estimation strategy must be adapted to the correlation data structure so imputation performs equally well for both types of correlated data. The next section presents two recent flexible imputation methods that exhibit this propitious property, in automatically adapting to the data correlation structure to produce minimal imputation error.
Flexible imputation techniques use, to some extent, core building blocks developed for their contemporary estimation counterparts in Section 4, and are characterised by automatically selecting, a priori, the optimal number of estimator genes from the data correlation structure. This avoids the problem that if the data is globally correlated, then a small number of predictor genes (low k value) may ignore genes that are strongly correlated to the gene having the missing value. Conversely, when an unnecessarily large value of number of genes (high k value) is used, this can introduce genes for prediction which either has little or no correlation to the gene with missing values. Two techniques are reviewed in this category.
5.1. Local Least Square Impute (llsimpute) 
This is similar to LSImpute in that it estimates missing values by constructing a linear combination of correlated genes using LS principles. The crucial difference is that in estimating Y ij , the number of predictor genes k is heuristically determined directly from the dataset. To determine the optimum k, LLSImpute artificially removes a known value from the most correlated gene g i before iteratively estimating it over a range of k values, with the k that produces the minimum estimation error then being used for imputation.
Kim et al.  employed the L 2 norm as well as Pearson correlation to identify the most correlated genes, with the L 2 norm reported to perform slightly better than the Pearson correlation method for the chosen experimental data, although the difference in prediction accuracies between the two approaches was statistically insignificant.
In comparison with the various traditional and contemporary approaches, LLSImpute adapts to the underlying correlated data structure, with the corollary being superior imputation performance, and while it incurs a considerably higher computational cost, from a microarray data perspective, missing value estimation accuracy always has a greater priority than computational complexity.
5.2. Heuristic Collateral Missing Value Imputation (hcmvi) 
This uses the multiestimate CMVE algorithm  detailed in Section 4, as its kernel building block to formulate the final imputation of missing value Y ij . It is analogous to LLSImpute in that it also automatically determines the optimal number of predictor genes k by using Monte Carlo (MC) simulation . It selects multiple matrices with known gene expression values with each matrix  having a selection probability in the MC simulation. HCMVI then identifies the most-correlated matrix from the Pearson correlation  between each selected matrix and the gene expression Y. These known values are then estimated by CMVE for a range of k values, with the optimal k being the one that generates the minimum estimation error.
HCMVI retains all the enhanced imputation performance characteristics and advantages of the original CMVE algorithm, while crucially automatically adapting to the underlying correlation structure of the microarray data, though as with LLSImpute, it incurs an additional computational overhead.
This section will rigorously examine the influence the aforementioned imputation strategies have in improving missing-value estimation accuracy for postgenomic knowledge discovery methods such as significant gene selection , allied with the biological significance of the imputation. Six different microarray datasets for breast and ovarian cancer tissues are used, with data being log-transformed and normalized, so that and , in order to remove all experimental variations.
The breast cancer dataset  contained 7, 7, 8 samples of BRCA1, BRCA2, and sporadic mutations (neither BRCA1 nor BRCA2), respectively, while the ovarian cancer dataset  contained 16, 16, and 18 samples, respectively, of BRCA1, BRCA2, sporadic mutations. Each breast cancer data sample contained microarray data of 3226 genes and there were 6445 genetic expressions per sample for the ovarian dataset. It is worth noting that the number of probes in both breast and ovarian cancer datasets is different. The data are generated by different labs under different experimental conditions and thus represent experimental variations.
To equitably evaluate the performance of the traditional and contemporary imputation algorithms on downstream biological analysis methods, the number of predictor genes was fixed at in all experiments. In contrast, the two flexible imputation methods (LLSImpute and HMCVI) automatically determine k by adapting to the correlation structure of the data. Also in this empirical analysis, the LLSImpute variant based upon the L 2 norm is applied due to its superior performance . In the next section, the influence of imputation on both significant gene selection and GRN reconstruction (STEP 4 in Figure 1) is investigated.
6.1. Imputation and Biological Significance of Selected Genes
To explore the impact of each estimation algorithm upon significant gene selection, a set of genes ( ) has been chosen from the original dataset using the between sum of squares to within sum of squares method which identifies genes that concomitantly have large interclass and small intraclass variations. The main reason for adopting this particular method is its proven superior performance capability to select significant genes compared with other popular methods such as the t-test . To assess the effect of missing values on gene selection, experiments were performed across a missing value range of probabilities from .01 to .2, with values being iteratively removed from the original gene expression in (1). These were then estimated using ZeroImpute, KNN, LLSImpute, BPCA, CMVE, and HCMVI, respectively, to form prior to being applied to selected sets of p genes using BSS/WSS, for each respective estimation matrix. The selected genes have been then compared with to obtain the true positive percentage accuracy (%Accuracy) metric, to provide a dispassionate measure of the estimation performance of each algorithm.
Imputation algorithm performance has normally only been assessed numerically, with considerable debate within the research community of the suitability of standard evaluation measures, such as normalised RMS error (NRMSE). Interpreting the results from a biological significance perspective has not received the same attention, though the impact of missing values on selected genes in postgenomic knowledge discovery is clearly a major factor in algorithmic performance assessment.
6.2. Biological Significance of Imputation
While the primary focus is on the estimation accuracy of an imputation method, it is equally important to conduct an investigation into the biological significance of certain selected genes for the respective datasets when evaluating the impact of missing values on gene selection. Indeed, it is constructive to ascertain whether a particular imputation technique assists the gene selection methods in identifying known and novel genes for a given sample. This may provide not only valuable information for the design of basic mechanistic, diagnostic, and biomarker studies, but also valuable data for use in the construction of gene networks and pathways involved in processes like oncogenesis and resistance to tumour induction.
In examining the results for both the breast and ovarian cancer datasets, a number of genes were overlooked using traditional methods, when missing values were introduced and processed, which independent experiments  have confirmed alter expressions in tumor lines and so can be very important in oncogenesis. This set of genes have not only been selected by the BSS/WSS algorithm, but have been revalidated using the modified t-test with greedy pairs method  which minimizes the bias of the gene selection strategy towards either a particular imputation technique or a set of genes.
KIAA1025 and plakophilin2 selection in breast cancer dataset across the range of missing values
MHC class II = DQ alpha (MHC and MHC Class II = DQ beta (MHC ) selection in ovarian cancer across the range of missing values
Interestingly, for both cancer datasets, across the full missing value range from 1% to 20%, these regulated genes have been correctly identified when gene selection has been preceded by HCMVI imputation as confirmed in Tables 1 and 2. It highlights that consideration of the biological significance of any imputation is extremely important and underscores the need for accurate estimation prior to gene selection, particularly in the presence of higher numbers of missing values.
As alluded earlier, existing GRN reconstruction methods conventionally replace missing values by either ZeroImpute or gene average [30, 42], despite both inevitably impacting upon subsequent GRN reconstruction, as will now be more fully examined.
6.3. Impact of Missing Values on Gene Regulatory Network Reconstruction
To evaluate the influence of missing values, the algorithm for the reconstruction of accurate cellular networks (ARACNE)  has been employed because it affords better performance over alternative approaches like Bayesian networks  and has been tested for mammalian gene network reconstruction and compared with other techniques that are normally applied to simple eukaryotes such as Saccharomyces cerevisiae .
ARACNE firstly computes the statistical significant gene-gene coregulation using mutual information before applying a data processing inequality to prune indirect relationships, that is, genes which are coregulated by either one or more intermediate genes. To comparatively evaluate the respective imputation performances on GRN reconstruction, the number of conserved links is determined, which represents whether a particular coregulation link is present in both and . The gene network is then initially constructed using ARACNE from the original data Y with no missing values. As in the previous experiments, up to 20% missing values have been randomly introduced and then, respectively, estimated using traditional, contemporary, and flexible imputation methods (Section 3–5, resp.). The corresponding gene networks are then constructed from the imputed data and and compared to ascertain the conserved links.
In the ovarian cancer dataset, the interaction link between Ro ribonucleoprotein autoantigen (Ro/SS-A) = autoantigen calreticulin and Glutathione S-transferase theta 1 was not identified in BRCA1-data, when missing values were introduced but was regenerated when these missing values were imputed using HCMVI. Similarly, coregulation between Inhibitor of DNA binding 3, dominant negative helix-loop-helix protein, and p53 in BRCA2 ovarian cancer dataset was also missed, but the link was reconstructed when HMCVI imputation was applied across the range of missing values. In the sporadic ovarian cancer dataset, transcriptional links between CD97 and RAB-10 were again only successfully reconstructed using HCMVI, while they were overlooked by all other estimation methods again underpinning the significance of accurate missing value imputation prior to GRN reconstruction.
6.4. Significance Test Results
6.5. Normalized Root Mean Square Error
where Y is the original data matrix and is the estimated matrix using HCMVI, CMVE, BPCA, LLSImpute, and KNN, respectively. This particular measure has been used by Sehgal et al. , Ouyang et al. , and Tuikkala et al.  for error estimation because for zero imputation.
This paper has pragmatically argued that imputation can be effectively applied to recycle microarray data and in doing so provide many potential benefits ranging from cost savings to performance enhancements in postgenomic knowledge discovery. While cognisance is made that ZeroImpute and other traditional missing value imputation strategies are straightforward to implement, new flexible methods have been proven to exhibit much superior accuracy and performance from both a statistical and biological significance perspectives, by virtue of their innate ability to exploit any underlying data correlation structures. A comprehensive study of missing values in microarray data has been presented and their subsequent impact upon postgenomic knowledge discovery methods, including significant gene selection and gene regulatory network reconstruction, has been investigated. Empirical analysis has consistently shown that rather than merely ignoring missing values, which has been the preferred approach to resolve this problem, flexible and robust imputation algorithms afford considerable performance benefits and so should, wherever possible, be mandated prior to any knowledge inference process using microarray data.
- Sutphin PD, Raychaudhuri S, Denko NC, Altman RB, Giaccia AJ: Application of supervised machine learning to identify genes associated with the hypoxia response. Nature Genetics 2001, 27: 90.View ArticleGoogle Scholar
- Schmatz D, Friend S: A simple recipe for drug interaction networks earns its stars. Nature Genetics 2006, 38(4):405-406. 10.1038/ng0406-405View ArticleGoogle Scholar
- Joron M, Jiggins CD, Papanicolaou A, McMillan WO: Heliconius wing patterns: an evo-devo model for understanding phenotypic diversity. Heredity 2006, 97(3):157-167. 10.1038/sj.hdy.6800873View ArticleGoogle Scholar
- Ioshikhes IP, Albert I, Zanton SJ, Pugh BF: Nucleosome positions predicted through comparative genomics. Nature Genetics 2006, 38(10):1210-1215. 10.1038/ng1878View ArticleGoogle Scholar
- Brazma A, Hingamp P, Quackenbush J, et al.: Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nature Genetics 2001, 29(4):365-371. 10.1038/ng1201-365View ArticleGoogle Scholar
- Tuikkala J, Elo L, Nevalainen OS, Aittokallio T: Improving missing value estimation in microarray data with gene ontology. Bioinformatics 2006, 22(5):566-572. 10.1093/bioinformatics/btk019View ArticleGoogle Scholar
- Acuna E, Rodriguez C: The treatment of missing values and its effect in the classifier accuracy. In Classification, Clustering and Data Mining Applications. Springer, Berlin, Germany; 2004:639-648.View ArticleGoogle Scholar
- Kim H, Golub GH, Park H: Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 2005, 21(2):187-198. 10.1093/bioinformatics/bth499View ArticleGoogle Scholar
- Troyanskaya O, Cantor M, Sherlock G, et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17(6):520-525. 10.1093/bioinformatics/17.6.520View ArticleGoogle Scholar
- Bø TH, Dysvik B, Jonassen I: LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Research 2004, 32(3):e34. 10.1093/nar/gnh026View ArticleGoogle Scholar
- Sehgal MSB, Gondal I, Dooley LS: Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data. Bioinformatics 2005, 21(10):2417-2423. 10.1093/bioinformatics/bti345View ArticleGoogle Scholar
- Oba S, Sato M-A, Takemasa I, Monden M, Matsubara K-I, Ishii S: A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 2003, 19(16):2088-2096. 10.1093/bioinformatics/btg287View ArticleGoogle Scholar
- Gan X, Liew AW-C, Yan H: Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Research 2006, 34(5):1608-1619. 10.1093/nar/gkl047View ArticleGoogle Scholar
- Sehgal MSB, Gondal I, Dooley LS, Coppel R: Heuristic non parametric collateral missing value imputation: a step towards robust post-genomic knowledge discovery. In Pattern Recognition in Bioinformatics, Lecture Notes in Computer Science. Springer, Berlin, Germany; 2008:373-387.View ArticleGoogle Scholar
- Tuikkala J, Elo L, Nevalainen OS, Aittokallio T: Improving missing value estimation in microarray data with gene ontology. Bioinformatics 2006, 22(5):566-572. 10.1093/bioinformatics/btk019View ArticleGoogle Scholar
- Jörnsten R, Ouyang M, Wang H-Y: A meta-data based method for DNA microarray imputation. BMC Bioinformatics 2007, 8: 1-10. article 109 10.1186/1471-2105-8-1View ArticleGoogle Scholar
- Hedenfalk I, Duggan D, Chen Y, et al.: Gene-expression profiles in hereditary breast cancer. The New England Journal of Medicine 2001, 344(8):539-548. 10.1056/NEJM200102223440801View ArticleGoogle Scholar
- Jazaeri AA, Yee CJ, Sotiriou C, Brantley KR, Boyd J, Liu ET: Gene expression profiles of BRCA1-linked, BRCA2-linked, and sporadic ovarian cancers. Journal of the National Cancer Institute 2002, 94(13):990-1000.View ArticleGoogle Scholar
- Jörnsten R, Wang H-Y, Welsh WJ, Ouyang M: DNA microarray data imputation and significance analysis of differential expression. Bioinformatics 2005, 21(22):4155-4161. 10.1093/bioinformatics/bti638View ArticleGoogle Scholar
- Laurier J: Alarming increase in cancer rates. WHO report, 2003Google Scholar
- Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16(10):906-914. 10.1093/bioinformatics/16.10.906View ArticleGoogle Scholar
- Keedwell E, Narayanan A: Intelligent Bioinformatics: The Application of Artificial Intelligence Techniques to Bioinformatics. John Wiley & Sons, New York, NY, USA; 2005.View ArticleGoogle Scholar
- Brás LP, Menezes JC: Improving cluster-based missing value estimation of DNA microarray data. Biomolecular Engineering 2007, 24(2):273-282. 10.1016/j.bioeng.2007.04.003View ArticleGoogle Scholar
- Wong DSV, Wong FK, Wood GR: A multi-stage approach to clustering and imputation of gene expression profiles. Bioinformatics 2007, 23(8):998-1005. 10.1093/bioinformatics/btm053View ArticleGoogle Scholar
- Mertens C, Kuhn C, Franke WW: Plakophilins 2a and 2b: constitutive proteins of dual location in the karyoplasm and the desmosomal plaque. Journal of Cell Biology 1996, 135(4):1009-1025. 10.1083/jcb.135.4.1009View ArticleGoogle Scholar
- Mertens C, Kuhn C, Moll R, Schwetlick I, Franke WW: Desmosomal plakophilin 2 as a differentiation marker in normal and malignant tissues. Differentiation 1999, 64(5):277-290. 10.1046/j.1432-0436.1999.6450277.xView ArticleGoogle Scholar
- Jansen E, Laven JSE, Dommerholt HBR, et al.: Abnormal gene expression profiles in human ovaries from polycystic ovary syndrome patients. Molecular Endocrinology 2004, 18(12):3050-3063. 10.1210/me.2004-0074View ArticleGoogle Scholar
- Lu M, Thompson WA, Lawlor DA, Reveille JD, Lee JE:Rapid direct determination of in the whole blood of normal individuals and cancer patients by specific polymerase chain reaction amplification. Journal of Immunological Methods 1996, 199(1):61-68. 10.1016/S0022-1759(96)00167-6View ArticleGoogle Scholar
- Harvell DME, Richer JK, Allred DC, Sartorius CA, Horwitz KB: Estradiol regulates different genes in human breast tumor xenografts compared with the identical cells in culture. Endocrinology 2006, 147(2):700-713.View ArticleGoogle Scholar
- Xu H, Wu P, Wu CFJ, Tidwell C, Wang Y: A smooth response surface algorithm for constructing a gene regulatory network. Physiological Genomics 2003, 11(1):11-20.View ArticleGoogle Scholar
- Whittle P: Probability via Expectation. 3rd edition. Springer, Berlin, Germany; 1992.View ArticleMATHGoogle Scholar
- Mooney CZ, Duval RD: Bootstrapping: A Nonparametric Approach to Statistical Inference. Sage, Thousand Oaks, Calif, USA; 1993.Google Scholar
- McLachlan G, Peel D: Finite Mixture Models, Wiley Series in Prbability and Statistics. Wiley-Interscience, New York, NY, USA; 2000.View ArticleGoogle Scholar
- Kim K-Y, Kim B-J, Yi G-S: Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics 2004, 5: 1-9. article 160 10.1186/1471-2105-5-1View ArticleMathSciNetGoogle Scholar
- Casella G, Robert CP: Monte Carlo Statistical Methods. Springer, Berlin, Germany; 2005.Google Scholar
- Abelson RP: Statistics as Principled Argument. Lawrence Erlbaum, Mahwah, NJ, USA; 1995.Google Scholar
- Wilcox RR: Fundamentals of Modern Statistical Methods. Springer, Berlin, Germany; 2001.View ArticleMATHGoogle Scholar
- Scholkopf B, Tsuda K, Vert J-P: Kernel Methods in Computational Biology. MIT Press, Cambridge, Mass, USA; 2004.Google Scholar
- Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 2002, 97(457):77-86. 10.1198/016214502753479248View ArticleMathSciNetMATHGoogle Scholar
- Salceda S, Drumright C, DiEgidio A, et al.: Identification of differentially expressed genes in breast cancer. Nature Genetics 2001, 27: 83-84.View ArticleGoogle Scholar
- Bø TH, Jonassen I: New feature subset selection procedures for classification of expression profiles. Genome Biology 2002, 3(4):research0017.1-research0017.11. 10.1186/gb-2002-3-4-research0017View ArticleGoogle Scholar
- Choi JK, Yu U, Yoo OJ, Kim S: Differential coexpression analysis using microarray data and its application to human cancer. Bioinformatics 2005, 21(24):4348-4355. 10.1093/bioinformatics/bti722View ArticleGoogle Scholar
- Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, Califano A: Reverse engineering of regulatory networks in human B cells. Nature Genetics 2005, 37(4):382-390. 10.1038/ng1532View ArticleGoogle Scholar
- Jensen FV: Bayesian Networks and Decision Graphs. 2nd edition. Springer, Berlin, Germany; 2002.Google Scholar
- Ihmels J, Levy R, Barkai N: Principles of transcriptional control in the metabolic network of Saccharomyces cerevisiae . Nature Biotechnology 2004, 22(1):86-92. 10.1038/nbt918View ArticleGoogle Scholar
- Margolin AA, Nemenman I, Basso K, et al.: ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 2006, 7(supplement 1):1-15. article S7Google Scholar
- Bansal M, Belcastro V, Ambesi-Impiombato A, Di Bernardo D: How to infer gene networks from expression profiles. Molecular Systems Biology 2007, 3: 1-10. article 78View ArticleGoogle Scholar
- Ouyang M, Welsh WJ, Georgopoulos P: Gaussian mixture clustering and imputation of microarray data. Bioinformatics 2004, 20(6):917-923. 10.1093/bioinformatics/bth007View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.