# Bayesian Hierarchical Model for Estimating Gene Expression Intensity Using Multiple Scanned Microarrays

- Rashi Gupta
^{1, 2}Email author, - Elja Arjas
^{1, 3}, - Sangita Kulathinal
^{1}, - Andrew Thomas
^{1}and - Petri Auvinen
^{2}

**2008**:231950

https://doi.org/10.1155/2008/231950

© Rashi Gupta et al. 2008

**Received: **7 July 2007

**Accepted: **28 November 2007

**Published: **12 December 2007

## Abstract

We propose a method for improving the quality of signal from DNA microarrays by using several scans at varying scanner sen-sitivities. A Bayesian latent intensity model is introduced for the analysis of such data. The method improves the accuracy at which expressions can be measured in all ranges and extends the dynamic range of measured gene expression at the high end. Our method is generic and can be applied to data from any organism, for imaging with any scanner that allows varying the laser power, and for extraction with any image analysis software. Results from a self-self hybridization data set illustrate an improved precision in the estimation of the expression of genes compared to what can be achieved by applying standard methods and using only a single scan.

## Keywords

## 1. Introduction

DNA microarray technology is used to study simultaneously the expression profile of a large number of distinct genes [1]. Several factors contribute to the accuracy with which these genes and their expressions (also referred here as intensities) can be determined. In particular, very low or very high in-tensities may lead to poor estimation of the ratio between the two samples and thus to an incorrect identification of differentially expressed genes. Low intensities tend to be noisy and lead to highly variable ratio estimates, whereas very high intensities are saturated from above and hence give biased results.

One of the objectives of microarray experiments is to identify a subset of genes that are differentially expressed be-tween the samples of interest. The relative intensity between the samples at a spot (also referred here as gene) is extracted by applying suitable image processing methods to the images produced by scanning the microarray slides on which the two samples have been hybridized. Errors occurring during image acquisition affect all further analyses and, therefore, the process of generation of these digital images is crucial. Photomultiplier tube (PMT), laser power (LP), and analog to digital converter (ADC), are the main components of an acquisition device, the scanner, which controls the for-mation of digital images. Each spot on the hybridized slide has fluorescent molecules corresponding to the two labeled samples and they emit photons upon excitation by a laser. The photons are converted into electrons by the PMT and the amount of current that eventually flows is directly proportional to the amount of incident light at the photocathode, unless saturation occurs [2]. Saturation oc-curs when the signal from a pixel exceeds the scanner's upper threshold of detection ( , for a 16-bit computer storage system). This phenomenon is also called clipping, and it occurs when the ADC converter converts the electrons into a sequence of digital signals. This clipping effect renders the relation between the measured and the true intensities nonlinear in the upper range of intensities.

A single scanner setting will not be optimal for both weakly and highly expressed genes. The choice of the parameters (of a scanner) involves a trade-off between spots with a low intensity and spots that are saturated to some degree. Thus it seems reasonable to consider multiple scanning of the same microarray at different scanner sensitivities and estimate spot intensities from the combined data.

Not much work has been done so far on improving the data quality by using multiple scans and by adjusting for pixel censoring. Dudley et al. [3] increased the dynamic range of gene expression using a new method. They proposed to hybridize experimental and control samples against labeled oligos that would be complementary with respect to every microarray feature, rather than cohybridizing the samples. However, their method cannot be applied to experiments that follow the standard method of Schena et al. [1]. Khondoker et al. [4] presented a regression model based on a nonlinear relationship and involving both an additive and a multiplicative error terms to establish a link between the saturated and the true intensities, and used an approach based on maximum likelihood estimation to correct for saturation. Lyng et al. [5] recalculated the saturated signals using a set of unsaturated intensities from a second scan. Though they proposed a method to determine the unsaturated intensities, the main focus of their paper was on investigating the relationship between PMT voltages, spot intensities, and expression ratios for three commercial scanners of two different brands. Piepho et al. [6] suggested using a nonlinear latent regression model for correcting the bias caused by saturation and for combining data from multiple scans. Skibbe et al. [7] compared the number of differentially expressed genes when using approaches based on linear regression and when considering a union of sets of differentially expressed genes that had been identified by scans made by varying the PMT and laser power. They showed that the latter approach effectively identifies a subset of statistically significant genes that the former approach is unable to find.

Our approach towards improving quality of intensity measurements is based on first producing three images with different scanner sensitivities, and obtaining three different sets of expression values. We then apply a novel Bayesian latent intensity model, in which these different sets of expression values are used to estimate suitably calibrated true expressions of genes. The resulting estimates, treated in the form of respective (posterior) distributions, can be used in a higher-level analysis for identifying differentially expressed genes. The proposed approach is applicable to standard microarray methodology and to cDNA arrays. The method, however, cannot be applied to Affymetrix gene chips as the current Affymetrix technology does not allow multiple scanning. The method is generic and can be applied to data from any organism, imaging with any scanner that allows scanning at different laser powers, and extraction with any image analysis software.

## 2. Method

### 2.1. Data

In this study, we used cDNA microarrays containing 16 000 individual fragments printed in duplicate (produced in Turku Centre for Biotechnology, University of Turku, Finland). Our approach was tested on two experiments. The first experiment was designed to examine the effects of RhoG on HeLa cells by comparing expression profile of RhoG expressing cells versus control cells. This experiment was performed on two arrays (here called and ). Each array had RNA from RhoG G12V labeled with Cy5, and control labeled with Cy3. The second experiment was a self-self hybridization experiment where RNA samples from T-Rex-HeLa cells (Invitrogen) transfected with a pcDNA4/TO vector were used. Details about sample preparation, RNA extraction, sample labeling and microarray hybridization for the two experiments can be requested from the authors.

### 2.2. Multiple Scanning

### 2.3. Quantification of Spot Intensities

Digital images were processed using GenePix Pro version 3.0 software (Axon Instruments, Inc., Foster City, Calif, USA. http://www.axon.com/GN_GenePix4000.html). The automatic spot finding algorithm of GenePix was used to find spot boundaries and to calculate spot intensities. Spot fore-ground and background intensities for both channels were derived and background corrected intensities above 200 from all three scans were used for our study.

### 2.4. Latent Variable Model

*true latent intensity*of a gene and denote it on logarithmic (base e) scale by , where is the number of spots used in the experiment. In the present experiment, each microarray chip was scanned at three different scanner settings. Let index the scanner setting. For each array, the first scan (i.e., ) was made by using a scanner setting that would correspond to a situation in which only a single scan was per-formed.Therefore, these first scans form a natural basis for calibrating the corresponding true latent intensities . They can be expected to capture, without a downward bias caused by saturation, spots that do not have abundant levels of RNA. The second scan was then made by lowering the laser power from the level of the first scan, and the third scan by lowering it even more. The measured signals can then be expected to be correspondingly weaker, with the effect that also the degree of saturation in the measured signals on the highly expressed genes would be reduced, or even eliminated completely. Next, we assume that the second and the third scans are linked to by simple functional relationships, respectively, by and . Here and are unknown functions, which can naturally be assumed to be increasing and continuous, but which otherwise need to be estimated from the data. Since the range of gene expression data from the first scan was from to , we decided to break the whole range into shorter intervals yet ensuring that there would be enough data points in each interval. We call these intervals and assume a simple linear form for and in each of these intervals. In other words, we set

where is the length of the interval.

where is the error associated with spot and scan .

Strictly speaking, right censoring at (where 65 535 is the scanner's upper threshold of detection) is only appropriate in a pixel level model. In spot intensities, some degree of clipping takes place already well below this value. Piepho et al. [6] showed in their paper that spot saturation begins between 15 and 16 on log (base 2) scale, that is, somewhere between 32 768 and 65 535 on natural scale. The reason is that the signal from a spot is obtained by averaging the readings over the pixels belonging to the spot, and some of these pixels may be already saturated. As a result, and unlike in pixel level data, in spot level data there is no sharp threshold value beyond which saturation has an effect. Gupta et al. [8] provided data where, as could be expected, with increasing observed spot intensity also an increasing proportion of the pixel readings had reached their maximal value 65 535. At spot summary value of 60 000, most of the pixels comprising the spot were already saturated. However, although a pixel level model can be said to give a more truthful description of the saturation phenomenon as such, it cannot be applied in practice for analyzing pixel level data from arrays which typically con-tain several thousand spots. The reason is simply the computational cost involved, as each spot consists of 80–100 pixels. Here, instead of attempting to model the effect of saturation on observed spot intensities, we treat the high intensity readings, which are most affected by saturation, as right censored observations. We then compensate for the resulting loss of information by utilizing the measurements obtained with a lower laser power, finally combining, within the Bayesian framework, the information from all three scan measurements to obtain the posterior distribution of the true latent intensity. Right censored measurements are taken care of as a part of the same estimation process, by data augmentation, which effectively means that they are replaced by the corresponding conditional distributions. Applying such a process naturally still involves deciding on the level beyond which right censoring should take place. In the results reported here we considered signals which exceeded (i.e., approximately 11) as right censored. Later, in Section 3, we consider the influence of the choice of the censoring threshold in some details.

To complete specification of the model, we assume errors are independent and identically distributed Normal random variables with mean and interval dependent variances , where ; . The interval dependent precision parameters (inverse of variances , , and , ) of errors , , were assigned gamma prior with parameters ( ). The true un-derlying latent intensities are assigned Uniform prior distribution over the interval [ , ], which is approximately [ ]. The parameters ( , ) are assigned Uniform distribution over ( ).

### 2.5. Bayesian Analysis

Several authors suggested Bayesian methods for analyzing microarray data [9–16]. Under the Bayesian paradigm, once the model is defined, statistical inferences can be expressed directly in terms of the conditional posterior probabilities conditioned on the observed data.

A priori, the parameters , , and are assumed to be independent. The numerical computations were done using Markov chain Monte Carlo (MCMC), where the sampling algorithm can be summarized as follows.

*Step 1*.

Specify initial values of , , , and of the augmented variables to be sampled when considering right censored 's.

*Step 2*.

Sample the latent intensities from their conditional distribution.

*Step 3*.

Sample , from their conditional distribution.

*Step 4*.

Sample from its conditional distribution.

*Step 5*.

Sample augmented 's from their conditional distri-butions.

*Step 6*.

Repeat step 2 to step 5 till sufficient samples are gen-erated.

The model was formulated in BUGS language and pa-rameter estimation was performed using WinBUGS [17]. WinBUGS is a free software and its newer versions can also run from within the statistical package . The current model runs in OpenBUGS version 2.01 on Intel Pentium processor 2.80 GHz with 1 GB RAM and takes approximately two hour to do 30 000 iterations using two chains in parallel. Convergence was monitored visually (i.e., by the mixing of two chains) and after a burn-in of 3000 iterations, two chains of 12 000 iterations each were generated to check the convergence of the parameter estimates under consideration. Thereafter, a sample of size 15 000 was generated to make inference.

## 3. Results

The approach described in this paper was tested on two real data sets described in Section 2.1 For the first experiment, samples were hybridized on two arrays and and each array was scanned three times at different scanner sensitivities (see Table 1). The same samples were hybridized on both arrays, but the scanner settings chosen for the two arrays were different as the experiments were performed independently on the two arrays.

## 4. Conclusions

Here our focus has been on the systematic bias in the intensity measurements caused by intrinsic scanner noise at the lower end and pixel censoring at the upper end. These two problems cannot be handled under a single scanner setting. Moreover, guidelines are not available for choosing optimal scanner settings to address both of these issues. Therefore, it seems reasonable to do several scans on every array, some at relatively lower sensitivities (ensuring no censoring at the upper end) and others at higher sensitivity levels (to capture weakly expressed genes), and ultimately combine the information to get improved gene expression measurements at all ranges. More scans can easily be accommodated in the model but there are practical limitations like degradation of the dye and the time required for scanning. Keeping these points in mind, three scans seem to be a good compromise.

The proposed method has advantages at three levels. First, modeling under the Bayesian framework allows for missing data estimation by sampling randomly from the corresponding posterior predictive distribution. Second, it allows for joint estimation of a large number of model parameters and latent variables. Usually for analyzing microarray data, the statistical methods are applied in a sequential manner with the output of each step in the analysis serving as the input for the next. Under the sequential approach, the uncertainties in the conclusions from any earlier step make the subsequent steps dependent on the particular choice of the method and the resulting point estimate that is then used. In our model, such uncertainties are accounted for in a systematic manner as we work with distributions of all the unknown parameters, including the latent expression of the genes being considered. A third aspect of our method is that it opens up the possibility of extending the current model to accommodate features of normalization and identification of differentially expressed genes in an integrated model, which first improves the overall signal and then identifies differentially expressed genes by using such improved signals. Realization of the integrated model is in principle a straightforward modification to the model proposed here, by adding further layers to the present hierarchical model. Such additional layers would then account for between-array variations, within-array variations, and dye swap, and allow for comparing and combing data from multiple arrays. We are currently working towards such an integrated model.

## Appendix

See Algorithm 1.

**Algorithm 1:** Code written in BUGS language.

cut1<- 6.29

cut2<- 7.29

cut3<- 8.29

cut4<- 9.29

cut5<- 10.29

class[i]<- 1 + step(logye1[i] - cut1) + step(logye1[i] - cut2) + step(logye1[i] - cut3) + step(logye1[i] - cut4) + step(logye1[i] - cut5)

muYe2[i]<- A[i]+B[i]+C[i]

A[i]<- (b[1] muYe1[i]) step(cut1-logye1[i]) + (b[1] + (b[2] (muYe1[i]-1))) step(cut2-logye1[i]) step(logye1[i]-cut1) +

(b[1] + b[2] + (b[3] (muYe1[i]-2))) step(cut3-logye1[i]) step(logye1[i]-cut2)

B[i] -(b[1] + b[2] + b[3] + (b[4] (muYe1[i]-3))) step(cut4-logye1[i]) step(logye1[i]-cut3) + (b[1] + b[2] + b[3] + b[4] +

(b[5] (muYe1[i]-4))) step(cut5-logye1[i]) step(logye1[i]-cut4)

C[i] -(b[1] + b[2] + b[3] + b[4] + b[5] + (b[6] (muYe1[i]-5))) step(logye1[i]-cut5)

D[i] -(d[1] muYe1[i]) step(cut1-logye1[i]) + (d[1] + (d[2] (muYe1[i]-1))) step(cut2-logye1[i]) step(logye1[i]-cut1) +

(d[1] + d[2] + (d[3] (muYe1[i]-2))) step(cut3-logye1[i]) step(logye1[i]-cut2)

E[i] -(d[1] + d[2] + d[3] + (d[4] (muYe1[i]-3))) step(cut4-logye1[i]) step(logye1[i]-cut3) + (d[1] + d[2] + d[3] + d[4] +

(d[5] (muYe1[i]-4))) step(cut5-logye1[i]) step(logye1[i]-cut4)

F[i] -(d[1] + d[2] + d[3] + d[4] + d[5] + (d[6] (muYe1[i]-5))) step(logye1[i]-cut5)

logye1[i] ~ dnorm(muYe1[i], tau1[class[i]]) I(logye1cen[i], )

logye2[i] ~ dnorm(muYe2[i], tau2[class[i]]) I(logye2cen[i], )

logye3[i] ~ dnorm(muYe3[i], tau3[class[i]]) I(logye3cen[i], )

b[j] ~ dunif(0,5)

d[j] ~ dunif(0,5)

where N is the number of genes, logye1, logye2, logye3 are the measurements from the three scans on logarithmic scale,

I(logye1cen[i], ), I(logye2cen[i], ), I(logye3cen[i], ) were used to specify the lower bound for the censored measurements

from the three scans.

## Declarations

### Acknowledgments

The authors thank Bob O'Hara for the careful reading of the manuscript and Mizanur R. Khondoker for having kindly provided the computer code for generating the data that led to our Figure 9. This work has been supported by the ComBi graduate school (RG), the Academy of Finland via its funding of the Centre of Population Genetic Analyses and via the SYSBIO Research Program (EA), and the Institute of Biotechnology (PA).

## Authors’ Affiliations

## References

- Schena M, Shalon D, Davis RW, Brown PO:
**Quantitative monitoring of gene expression patterns with a comple mentary DNA microarray.***Science*1995,**270**(5235):467-470. 10.1126/science.270.5235.467View ArticleGoogle Scholar - Yang Y, Buckley M, Dudoit S, Speed T:
**Comparison of methods for image analysis on cDNA microarray data.***Journal of Computational and Graphical Statistics*2001,**11**(1):108-136.MathSciNetView ArticleGoogle Scholar - Dudley AM, Aach J, Steffen MA, Church GM:
**Measuring absolute expression with microarrays with a calibrated reference sample and an extended signal intensity range.***Proceedings of the National Academy of Sciences of the United States of America*2002,**99**(11):7554-7559. 10.1073/pnas.112683499View ArticleGoogle Scholar - Khondoker MR, Glasbey CA, Worton BJ:
**Statistical estimation of gene expression using multiple laser scans of microarrays.***Bioinformatics*2006,**22**(2):215-219. 10.1093/bioinformatics/bti790View ArticleGoogle Scholar - Lyng H, Badiee A, Svendsrud DH, Hovig E, Myklebost O, Stokke T:
**Profound influence of microarray scanner characteristics on gene expression ratios: analysis and procedure for correction.***BMC Genomics*2004,**5:**10. 10.1186/1471-2164-5-10View ArticleGoogle Scholar - Piepho HP, Keller B, Hoecker N, Hochholdinger F:
**Combining signals from spotted cDNA microarrays obtained at different scanning intensities.***Bioinformatics*2006,**22**(7):802-807. 10.1093/bioinformatics/btk047View ArticleGoogle Scholar - Skibbe DS, Wang X, Zhao X, Borsuk LA, Nettleton D, Schnable PS:
**Scanning microarrays at multiple intensities enhances discovery of differentially expressed genes.***Bioinformatics*2006,**22**(15):1863-1870. 10.1093/bioinformatics/btl270View ArticleGoogle Scholar - Gupta R, Auvinen P, Thomas A, Arjas E:
**Bayesian hierarchical model for correcting signal saturation in microarrays using pixel intensities.***Statistical Applications in Genetics and Molecular Biology*2006.,**5**(1): Article 20MATHGoogle Scholar - Keller AD, Schummer M, Hood L, Ruzzo WL:
**Bayesian classification of DNA array expression data.**In*Tech. Rep. UW-CSE-2000-08-01*. Department of Computer Science and Engineering, University of Washington, Seattle, Wash, USA; 2000.Google Scholar - Baldi P, Long AD:
**A Bayesian framework for the analysis of microarray expression data: regularized****-test and statistical inferences of gene changes.***Bioinformatics*2001,**17**(6):509-519. 10.1093/bioinformatics/17.6.509View ArticleGoogle Scholar - Dror RO, Murnick JG, Rinaldi NA:
**A Bayesian approach to transcript estimation from gene array data: the beam technique.**In*Proceedings of the 6th Annual International Conference on Computational Molecular Biology (RECOMB '02), Washington, DC, USA, April 2002*. ACM Press; 137-143.Google Scholar - Parmigiani G, Garrett ES, Anbazhagan R, Gabrielson E:
**A statistical framework for expression-based molecular classification in cancer.***Journal of the Royal Statistical Society*2002,**64**(4):717-736. 10.1111/1467-9868.00358MathSciNetView ArticleMATHGoogle Scholar - Ramoni MF, Sebastiani P:
**Bayesian methods for microarray data analysis.***Proceedings of the IMA Workshop 1: Statistical Methods for Gene Expression: Microarrays and Proteomics, Minneapolis, Minn, USA, September-October 2003*Google Scholar - Bhattacharjee M, Pritchard CC, Nelson PS, Arjas E:
**Bayesian integrated functional analysis of microarray data.***Bioinformatics*2004,**20**(17):2943-2953. 10.1093/bioinformatics/bth338View ArticleGoogle Scholar - Frigessi A, van de Wiel MA, Holden M, Svendsrud DH, Glad IK, Lyng H:
**Genome-wide estimation of transcript concentrations from spotted cDNA microarray data.***Nucleic Acids Research*2005,**33**(17):e143. 10.1093/nar/gni141View ArticleGoogle Scholar - Lewin A, Richardson S, Marshall C, Glazier A, Aitman T:
**Bayesian modeling of differential gene expression.***Biometrics*2006,**62**(1):10-18.MathSciNetView ArticleMATHGoogle Scholar - Spiegelhalter DJ, Thomas A, Best NG: "WinBUGS" Version 1.2. User Manual, MRC Biostatistics Unit, 199.Google Scholar
- Yang YH, Dudoit S, Luu P, Speed TP:
**Normalization for cDNA microarray data.***Microarrays: Optical Technologies and Informatics, Proceedings of SPIE, San Jose, Calif, USA, January 2001***4266:**141-152.Google Scholar - Yang YH, Dudoit S, Luu P,
*et al*.:**Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation.***Nucleic Acids Research*2002,**30**(4):e15. 10.1093/nar/30.4.e15View ArticleGoogle Scholar - de La Nava JG, van Hijum S, Trelles O:
**Saturation and quantization reduction in microarray experiments using two scans at different sensitivities.***Statistical Applications in Genetics and Molecular Biology*2004.,**3**(1): Article 11MATHGoogle Scholar - Dodd LE, Korn EL, McShane LM, Chandramouli GVR, Chuang EY:
**Correcting log ratios for signal saturation in cDNA microarrays.***Bioinformatics*2004,**20**(16):2685-2693. 10.1093/bioinformatics/bth309View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.