 Research Article
 Open Access
Compressive Sensing DNA Microarrays
 Wei Dai^{1}Email author,
 Mona A Sheikh^{2},
 Olgica Milenkovic^{1}Email author and
 Richard G Baraniuk^{2}
https://doi.org/10.1155/2009/162824
© Wei Dai et al. 2009
 Received: 30 July 2008
 Accepted: 23 October 2008
 Published: 22 December 2008
Abstract
Compressive sensing microarrays (CSMs) are DNAbased sensors that operate using group testing and compressive sensing (CS) principles. In contrast to conventional DNA microarrays, in which each genetic sensor is designed to respond to a single target, in a CSM, each sensor responds to a set of targets. We study the problem of designing CSMs that simultaneously account for both the constraints from CS theory and the biochemistry of probetarget DNA hybridization. An appropriate crosshybridization model is proposed for CSMs, and several methods are developed for probe design and CS signal recovery based on the new model. Lab experiments suggest that in order to achieve accurate hybridization profiling, consensus probe sequences are required to have sequence homology of at least 80% with all targets to be detected. Furthermore, outofequilibrium datasets are usually as accurate as those obtained from equilibrium conditions. Consequently, one can use CSMs in applications in which only short hybridization times are allowed.
Keywords
 Compressive Sense
 Current Base
 Restricted Isometry Property
 Base Choice
 Belief Propagation Algorithm
1. Introduction
Accurate identification of large numbers of genetic sequences in an environment is an important and challenging research problem. DNA microarrays are a frequently applied solution for microbe DNA detection and classification [1]. The array consists of genetic sensors or spots, containing a large number of singlestranded DNA sequences termed probes. A DNA strand in a test sample, referred to as a target, tends to bind or "hybridize" with its complementary probe on a microarray so as to form a stable duplex structure. The DNA samples to be identified are fluorescently tagged before being flushed against the microarray. The excess DNA strands are washed away and only the hybridized DNA strands are left on the array. The fluorescent illumination pattern of the array spots is then used to infer the genetic makeup in the test sample.
1.1. Concerns in Classical DNA Microarrays
In traditional microarray designs, each spot has a DNA subsequence that serves as a unique identifier of only one organism in the target set. However, there may be other probes in the array with similar base sequences for identifying other organisms. Due to the fact that the spots may have DNA probes with similar base sequences, both specific and nonspecific hybridization events occur; the latter effect leads to errors in the array readout.
Furthermore, the unique sequence design approach severely restricts the number of organisms that can be identified. In typical biosensing applications, an extremely large number of organisms must be identified. For example, there are more than known harmful microbes, many with significantly more than strains [2]. A large number of DNA targets require microarrays with a large number of spots. The implementation cost and speed of microarray data processing is directly related to the number of spots, which represents a significant problem for commercial deployment of handheld microarraybased biosensors.
1.2. Compressive Sensing
Compressive sensing (CS) is a recently developed sampling theory for sparse signals [3]. The main result of CS, introduced by Candès and Tao [3] and Donoho [4], is that a length signal that is sparse in some basis can be recovered exactly in polynomial time from just linear measurements of the signal. In this paper, we choose the canonical basis; hence has nonzero and zero entries.
In matrix notation, we measure , where is the sparse signal vector we aim to sense, is an measurement vector, and the measurement matrix is an matrix. Since , recovery of the signal from the measurements is ill posed in general. However, the additional assumption of signal sparsity makes recovery possible. In the presence of measurement noise, the model becomes , where stands for i.i.d. additive white Gaussian noise with zero mean.
The two critical conditions to realize CS are that (i) the vector to be sensed is sufficiently sparse, and (ii) the rows of are sufficiently incoherent with the signal sparsity basis. Incoherence is achieved if satisfies the socalled restricted isometry property (RIP) [3]. For example, random matrices built from Gaussian and Bernoulli distributions satisfy the RIP with high probability. can also be sparse with only nonzero entries per row ( can vary from row to row) [5].
Various methods have been developed to recover a sparse from the measurements [35–7]. When itself is sparse, belief propagation and related graphical inference algorithms can also be applied for fast signal reconstruction [5].
An important property of CS is its information scalability—CS measurements can be used for a wide range of statistical inference tasks besides signal reconstruction, including estimation, detection, and classification.
1.3. Compressive Sensing Meets Microarrays
The setting for microbial DNA sensing naturally lends itself to CS, although the number of potential agents that a hostile adversary can use is large, not all agents are expected to be present in a significant concentration at a given time and location, or even in an air/water/soil sample to be tested in a laboratory. In traditional microarrays, this results in many inactive probes during sensing. On the other hand, there will always be minute quantities of certain harmful biological agents that may be of interest to us. Therefore, it is important not just to detect the presence of agents in a sample, but also to estimate the concentrations with which they are present.
Mathematically, one can represent the DNA concentration of each organism as an element in a vector . Therefore, as per the assumption of only a few agents being present, this vector is sparse, that is, contains only a few significant entries. This suggests putting thought into the design of a microarray along the lines of the CS measurement process, where each measurement is a linear combination of the entries in the vector, and where the sparse vector can be reconstructed from via CS decoding methods.
In our proposed microarrays, the readout of each probe represents a probabilistic combination of all the targets in the test sample. The probabilities are representatives of each probe affinity to its targets due to how much the target and probe are likely to hybridize together. We explain our model for probetarget hybridization in Section 2.2. In particular, the crosshybridization property of a DNA probe with several targets, not just one, is the key for applying CS principles.
Here, is the sensing matrix, and denotes a vector of i.i.d. additive white Gaussian noise samples with zero mean.
We note that this probabilistic combination is assumed to be linear for the purposes of microarray design. However, in reality, there is a nonlinear saturation effect when excessive targets are present (see Section 2.4 for details). We take this into account on the reconstruction side, as part of the CS decoding techniques to decipher the combinatorial sensor readout.
Therefore, by using the CS principle, the number of spots in the microarray can be made much smaller than the number of target organisms. With fewer "intelligently chosen" DNA probes, the microarray can also be more easily miniaturized [8–10]. We refer to a microarray designed this way as a CS microarray (CSM).
The CS principle is similar to the concept of group testing [8–11], which also relies on the sparsity observed in the DNA target signals. The chief advantage of a CSbased approach over direct group testing is its information scalability. With a reduced number of measurements, we are able not just to detect, but also to estimate the target signal. This is important because often pathogens in the environment are only harmful to us in large concentrations. Furthermore, we are able to use CS recovery methods such as belief propagation that decode while accounting for experimental noise and measurement nonlinearities due to excessive target molecules [12].
It is also worth to point out the substantial difference between CSMs and the "composite microarrays" designed to reduce measurement variability [13]. In the latter approach, the microarray readouts are linear combinations of input signal components and therefore can be expressed in the form given by (1). However, the matrix of [13] does typically not satisfy the CS design principles. As a result, the number of required measurements/spots is significantly larger than that of CSMs. On the other hand, the use of the CS principle allows both the robustness of measurements and a significant reduction in the number of spots on the array [14].
1.4. Clusters of Orthologous Groups
The COGs database consists of groups of 192, 987 proteins in 66 unicellular organisms classified into 4872 clusters. We use these clusters as a guideline to group targets together. Targets with similar DNA sequences belong to the same group, and can be more easily identified with a single probe. When designing probes, it is important to make sure that the chosen probes align minimally with organisms that do not belong to its group (the "nontargets"). We can use the COGs database with its exhaustive classification to this end, since DNA sequences of an organism whose proteins do not belong to a certain COG will have minimal alignment with DNA sequences of other organisms in that COG. This significantly reduces the computational complexity of the search for good probe sequences.
One limitation in using COGs is that it will constrain design of the matrix for us. For instance, if we were to choose a set of 10 organisms we are interested in for microarray detection, there are only a finite number of COGs (groups) that these 10 organisms will belong to. We would have to carefully sift through these groups to find the one that best satisfies CSrequirements of , and for each choice, making sure that it is dissimilar enough from the other groups chosen. So on the one hand, using COGs guides our target grouping strategy; on the other hand, it is possible that we might not be able to find enough suitable COGs to identify all members of the group. Using only a COGsbased approach, we may have to resort to using a that may not be the best from a CS perspective but simply what nature gives us. Here, however, we only consider an approach using COGs.
A second limitation of COGs is the fact that it is a classification of organisms based on alignments between the sections of their DNA that encode for proteins, not entire sequences. Therefore, a point for future exploration would be to work with values from alignments between entire DNA sequences of organisms. Probes selected using such an alignment would be better reflective of the actual probetarget hybridization that takes place in a biosensing device.
However, we are fortunate that prokaryotes such as unicellular bacteria typically have larger percentages of coding DNA to noncoding, and therefore as long as we are interested in the detection of unicellular bacteria, which are prokaryotes, using a COGsbased probe selection is not as much of an issue. On the other hand, eukaryotes have large amounts of noncoding regions in their DNA. This phenomenon is known as the value enigma [15]: more complex organisms often have more noncoding DNA in their genomes.
1.5. CSM Design Consideration
To design a CSM, we start with a given set of targets and a valid CS matrix . The design goal is to find DNA probe sequences such that the hybridization affinity between the th probe and the th target can be approximated by the value of . For this purpose, we need to go rowbyrow in , and for each row find a probe sequence such that the hybridization affinities between the probe and the targets mimic the entries in this row. For simplicity, we assume that the CS matrix is binary, that is, its entries have value zero or are equal to some positive constant, say . An entry of positive value refers to the case where the corresponding target and probe DNA strands bind together with a sufficient strength such that the fluorescence from the target strand adhered to the probe is visible during the microarray readout process. A zerovalued entry indicates that no such hybridization affinity exists. How to construct a binary CS matrix is discussed in many papers, including [1617], but is beyond the scope of this paper. Henceforth, we assume that we know the we want to approximate.
The CSM design process is then reduced to answering two questions. Given a probe and target sequence pair, how does one predict the corresponding microarray readout intensity? Given targets and the desired binding pattern, how does one find a probe DNA sequence such that the binding pattern is satisfied?
The first question is answered by a twostep translation of a probetarget pair to the spot intensity. First, we need a hybridization model that uses features of the probe and target sequences to predict the crosshybridization affinity between them. Since the CS matrix that we want to approximate is binary, the desired hybridization affinities can be roughly categorized into two levels, "high" and "low," corresponding to one and zero entries in , respectively. The affinities in each category should be roughly uniform, while those belonging to different categories must differ significantly. With these design requirements in mind, we develop a simplified hybridization model in Section 2.2 and verify its accuracy via laboratory experiments, the results of which are presented in Section 2.3. As the second step, we need to translate the hybridization values to microarray spot intensities using a model that includes physical parameters of the experiment, such as background noise. This issue is discussed in Section 2.4.
To answer the second question, we propose a probe design algorithm that uses a "sequence voting mechanism" and a randomization mechanism. The algorithm is presented in Section 3.1. An example of the practical implementation of this algorithm is given in Section 3.2.
2. Hybridization Model
2.1. Classical Models
12 parameters used in [18] for predicting hybridization affinities between DNA sequence pairs.
Parameter  Description 

 Probe sequence length, Target sequence length 
 Probe GC content, target GC content 
 SmithWaterman score: computed from the scoring system used in the SW alignment 
 value: probability that the SW score occurred by chance 
 Percent identity: percentage of matched bases in the aligned region after SW alignment 
 Length of the SW alignment 
 Gibbs free energy for probe DNA folding 
 Hamming distance between probe and target 
 Length of longest contiguous matched segment in a SW alignment 
 GC content in the longest contiguous segment 
Another important parameter for assessing hybridization affinity is , the length of contiguous matched base pairs. It has been shown in [1820] that long contiguous base pairs imply strong affinity between the probe and target. Usually, one requires at least 10 bases in oligo DNA probes for ensuring sufficiently strong hybridization affinity.
Besides the large number of parameters that potentially influence hybridization affinity, there are many theories for which features most influence hybridization and how they affect the process [182122]. A thirdorder polynomial model using percent identity , as the single parameter, was developed in [21]. More recently, three multivariate models, based on the thirdorder polynomial regression, regression trees, and artificial neural networks, respectively, were studied in [18].
2.2. Our Model for CSM
Different from the above approaches aiming at identifying the exact affinity value, the binary nature of our CS matrix brings possible simplifications. As we have discussed in Section 1.5, we only need to predict whether the affinity between a probetarget pair is either "high" or "low." For this purpose, two set of rules, designed for deciding "high" and "low" affinities, respectively, are developed in this section.
We propose the notion of the best matched substring pair, defined as follows, for our hybridization model.
Definition 1. Let be a DNA sequence. A substring of is a sequence of the form , where . Consider a given sequence pair and and . Let be a positive integer at most . A pair of substrings of length , one of which is part of and the other part of , will be denoted by and , where .
where denotes the WatsonCrick complement of , and denotes the cardinality of the underlying set.
The best matched substring pair of length is the substring pair with the largest among all possible substring pairs of length from the pair of and .
For a given , the largest substring percent identity is the of the best matched substring pair of length .
Remark 1.
For a given , the best matched substring pair is not necessarily unique, while the value is unique.
 (1)
For hybridization prediction, the parameter percent identity should be used together with the alignment length . Although the significance of the singleparameter model based on was demonstrated in [21], we observed that using the parameter as the sole affinity indicator is sometimes misleading. As an illustration, consider the example in Figure 3. For the sequence pair A, the SW alignment gives and . For the sequence pair B, the SW alignment gives and . Though the pair B exhibits a smaller , it obviously has a stronger binding affinity than the pair A, for the aligned part of the pair A is merely a part of the aligned region of the pair B. The same principle holds for the sequence pairs B and C as well. This example shows that besides the percent identity, the alignment length is important.
 (2)
The pair of and is not sufficient to predict hybridization affinity. Consider the sequence pairs C and D in Figure 3. Both of them exhibit the same values for the and parameters. However, the hybridization affinities of these two pairs are different. To see this, let us refer to Figure 4 which depicts the values of sequence pairs C and D for different length . It can be observed that for any given , the value of the sequence pair C is larger than that of the sequence pair D. In other words, the sequences in the former pair match with each other uniformly better than the sequences in the latter pair. The sequence pair C has a larger chance to hybridize than the pair D does. With the same values of parameters and , the difference in hybridization affinity comes from the distribution of matched bases in the aligned region.
The advantage of using the largest substring percent identities for hybridization prediction is now apparent. The s include all the information contained in the previously discussed , and parameters; it can be verified that and that the is one of the values of s such that . Of course, a list of provides more detailed information, since it gives both local and global matching information.
 (C1)
There exists a best matched substring pair of length at least such that the corresponding substring percent identity satisfies . Alternatively, such that . Here, both and are judiciously chosen parameters.
 (C2)
Among all the best matched substring pairs with , there should be no pair of length longer than , that is, it should hold that for all . Again, has to be chosen properly.
Criterion (C1) guarantees that there is a significantly long substring pair with highpercent identity that ensures strong hybridization affinity. Although criterion (C2) may seem counterintuitive at first glance, it ensures that one single target cannot dominantly hybridize with the consensus probe, that is, the binding affinities between probetarget pairs are roughly uniform.
 (C3)
Among all the best matched substring pairs with percent identity at least , there should be no pair of length longer than , that is, .
 (C4)
Among all the substring pairs matched perfectly (with ), there should be no pair of length greater than , that is, for all .
Criterion (C3) asserts that there should be no substring pair that has both long length and highpercentage identity. The last criterion, (C4), prevents the existence of a long contiguous matched substring pair which suggests large binding affinity. Again, and have to be chosen appropriately.
This model may seem an oversimplification for accurate hybridization affinity prediction. However, in our practical experience with small binary CS matrices (Section 1.5), this model functions properly (see Section 2.3).
where is either zerovalued or equal to , and is the approximation error that is assumed to take small values only. The physical interpretation of is given in (9). The values of s can be calibrated via lab experiments. Furthermore, the reconstruction algorithm can be designed to be robust to the approximation error.
Remark 2. This model can be further refined by introducing weighting factors in the definition of . More precisely, the number of positionally matched base pairs can be replaced by a weighted sum, where CG and AT pairs are assigned different values. More accurate model, taking into account nearestneighbor interaction, can be considered as well [2324]. These extensions will be considered elsewhere.
2.3. Experimental Calibration of Parameters
Lab experiments were performed to verify our translation criteria (C1)–(C4) and to choose appropriate values for the involved parameters.
The probe and target sequences were synthesized by Invitrogen, with the first three probes purified using the PAG (polyacrylamide gel electrophoresis) method, while all other sequences were purified using the highperformance liquid chromatography method (HPLC). The fluorescent tags of the targets are Alexa 532.
Best match substring data.





























 (1)
For all sequence pairs exhibiting significant hybridization level, one must have .
 (2)
For all sequence pairs of which the microarray readout is weak, we have . (For the pair of probe A and Target B, , but the corresponding microarray readout is week.) Consequently, may be a critical parameter for deciding whether a probetarget pair hybridizes or not.
 (3)
Among all sequence pairs with weak microarray readouts, the length of the longest contiguous segment is (the pair of probe C and target A). This fact implies that the probetarget pair may not hybridize even when they have a contiguous matched substring of length .
Chosen values of the parameters in the criteria (C1)–(C4).
Parameter 







Value  0.80  20  25  0.75  16  7 
Interestingly, when we reduced the incubation time to four hours such that the full equilibrium has not been achieved, the microarray still gave an accurate readout (see Figure 5(d)). We expect that one can use CSMs in applications for which only short hybridization times are allowed.
2.4. Translating Hybridization Affinity into Microarray Spot Intensity
where is the actual spot intensity we measure for given experimental conditions, and are positive hybridization constants, is the hybridization affinity, is the target concentration, presents the mean background noise, and denotes the measurement noise which is often assumed to be Gaussian distributed with mean zero and variance [2526]. This model mimics the wellknown Langmuir model, with background noise taken into consideration [2627].
3. Search for Appropriate Probes
3.1. Probe Design Algorithm
We describe next an iterative algorithm for finding probe sequences satisfying a predefined set of binding patterns, that is, sequences that can serve as CS probes.
The design problem is illustrated by the following example. Suppose that we are dealing with three targets, labeled by , and , and that the binding pattern of the probe and targets is such that the probe is supposed to bind with targets and , but not with target . Assume next that the hybridization affinities between a candidate probe and targets and are too small, while the hybridization affinity between the probe and target is too large. In order to meet the desired binding pattern, we need to change some nucleotide bases of the probe sequence. For example, consider a particular aligned position of the probe and the targets, the corresponding probe and targets bases equal to "T," "T," "A," and "A," respectively. In this case, from the perspective of target , the base "T" of the probe should be changed to "A," while from the perspective of target , this "T" base should be changed to any other base not equal to "T." On the other hand, for target to exhibit strong hybridization affinity with the probe, the identity of the corresponding probe base should be kept intact. As different preferences appear from the perspectives of different targets, it is not clear whether the base under consideration should be changed or not.
We address this problem by using a sequence voting mechanism. For each position in the probe sequence, one has four base choices—"A," "T," "C," and "G." Each target is allowed to "cast its vote" for its preferred base choice. The final decision is made based on counting all the votes from all targets. More specifically, we propose a design parameter, termed as preference value (PV), to implement our voting mechanism. For a given pair of probe and target sequences, a unique PV is assigned to each base choice at each position of the probe. We design four rules for PV assignment.
 (1)
If the target "prefers" the current probe base left unchanged, a positive PV is assigned to the corresponding base choice.
 (2)
From the perspective of the target, if the current probe base should be changed to another specific base, then the original base choice is assigned a negative PV while the intended base choice is assigned a positive PV.
 (3)
If the current base should be changed to any other base, then the corresponding base choice is assigned a negative PV while other base choices are assigned a zero PV.
 (4)
Finally, if a base choice is not included in the above three rules, a zero PV is assigned to it.
The specific magnitude of the nonzero PVs is chosen according to the significance of the potential impact on the hybridization affinity between the considered target and probe. The details of this PV assignment are highly technical and therefore omitted. The interested reader is referred to our software tool [28] for a detailed implementation of the PV computation algorithm.
After PV assignment, we calculate the socalled Accumulated PV (APV). For a given base choice at a given position of the probe, the corresponding APV is the sum of all the PVs associated with this choice. The APV is used as an indicator of the influence of a base change in our algorithm; the bases associated with negative APVs are deemed undesirable and therefore should be changed; if the current base of the probe is associated with a positive APV, one would like to leave this base unchanged; if a base choice, different from the current base of the probe, has a positive APV value, one should change the current base to this new choice.
It is worth pointing out the "partly" random nature of the algorithm. In step 5 of our algorithm, whether a current base at a given position is changed or not and which base the current base is changed to are randomly decided. The probabilities with which the current base is changed, and with which a specific base is selected to replace the current base, are related to the magnitudes of the associated APVs. The implementation details behind this randomization mechanism are omitted, but can be found in [28].
This random choice component helps in avoiding "dead traps" that may occur in deterministic algorithms. As an illustrative example, suppose that the intended binding pattern between a probe and all targets except target 1 is satisfied in a given iteration. From the perspective of target 1, the first base of the probe should be changed from "T" to "C." In a deterministic approach, a base replacement must be performed following this preference exactly. However, this base change breaks the desired hybridization pattern between the probe and target 2. In the next iteration, according to the perspective of target 2, the first base of the probe has to be changed back to "T." As a result, this probe base "oscillates" between these two choices of "T" and "C," and the algorithm falls into a "dead trap." In contrast, due to the randomization mechanism in our algorithm, there is a certain probability that the base change does not follow exactly what seems necessary. Dead traps can be prevented from happening or escaped from once they happen.
The algorithm is repeated as many times as the number of probes.
3.2. Toy Probe Design Example for
The target nucleotide sequences.
Target 1 


From Methanothermobacter thermautotrophicus (Mth)—Genome position: complement (142033 142087)  
Target 2 

From Methanococcus jannaschii (Mja)—Genome position: (77481 77535)  
Target 3 

From Methanosarcina acetivorans str.C2A (Mac)—Genome position: (59910 59964)  
Target 4 

From Pyrococcus horikoshii (Pab)—Genome position: complement (1122252 1122306)  
Target 5 

From Archaeoglobus fulgidus (Afu)—Genome Position: complement (365030 365084)  
Target 6 

From Methanopyrus kandleri AV19 (Mka)—Genome Position: complement (1007480 1007534)  
Target 7 

From Thermoplasma volcanium (Tvo)—Genome Position: (636571 636625) 
The GC contents for these three probes are 50%, 51.4%, and 51.4%, respectively. The GC contents of the sequences should be of similar value to ensure similar melting temperatures for the duplexes. The secondary structures of these probes can be predicted by using the mfold package [29] and are depicted in Figure 6. As one can see, all folds have sufficiently long unmatched regions that can hybridize to the targets.
The best matched lengths of the probes and targets.
Target 1  Target 2  Target 3  Target 4  Target 5  Target 6  Target 7  

Probe 1 







Probe 2 







Probe 3 







4. CSM Signal Recovery
The final step of a CSM process is to estimate the target concentration according to the microarray readout. Recall the signal acquisition model in (5), a signal recovery algorithm specifically designed for CSMs have to take into account the measurement nonlinearity.
Compared to other CS signal recovery methods, belief propagation (BP) is the best amenable to incorporate nonlinear measurement. It has been shown that a CS measurement matrix can be represented as a bipartite graph of signal coefficient nodes s and measurement nodes s [512]. When is sparse enough, BP can be applied, so we are able to approximate the marginal distributions of each of the coefficients conditioned on the observed data. (Note that the Hamming code matrix is not sparse. Still, one can use simple "sparsified" techniques to modify for decoding purpose only [30]). We can then estimate the MLE, MMSE, and MAP estimates of the coefficients from their distributions (we refer to [512] for details.)
Algorithm 1: Probe design for CSMs.
Input: The target sequences, the row of the intended binding matrix corresponding to the chosen probe.
Initialization: Randomly generate multiple candidates for the probe under consideration. For each candidate, perform the following iterative sequence update procedure.
 (1)
Check the probe's GC content. If GC content is too low, randomly change some "A" or "T" bases to "G" or "C" bases, and vice versa. The GC content afterbase changes must satisfy the GC content requirement.
 (2)
Check whether the probe sequence satisfies the intended binding pattern. If yes, quit the iterations. If not, go to the next step.
 (3)
If an appropriate probe has not been found after a large number of iterations, report a failure, and quit the iterations.
 (4)
For each of the targets, calculate the PV associatedwith each of the base choice at each position of the probe. Then calculate the APV.
 (5)
Randomly change some bases of the probe sequence so that a potential change associated with a larger APV increment is made more probable.
 (6)
Go back to Step 1.
Completion: Check for loop information in the secondary structure of all the surviving probe candidates. Choose the probe with the fewest loops. If more than one such probe exists, randomly choose one of the probes with the shortest loop length.
Output: The probe sequence.
4.1. Extracting the Signal from Nonlinear Measurements
Due to saturation effects in the intensity response of the microarray, the nonlinearity acts on so that recorded measurements will never exceed . We note that due to the presence of measurement noise, the solution is not as simple as inverting the nonlinearity and then applying BP for CS reconstruction.
Our goal is to determine the probability distribution of at all possible values the true signal values can take on a grid of sample points, using the measurement intensities as constraints. The problem then reduces to solving the regular CS signal recovery problem using BP [5]. We note that instead of inversemapping to find , we can calculate the equivalent probabilities of the transformed distribution: , by mapping the required sample points for the distribution to transformed points . At the th measurement node ; the latter probability masses can be picked out at the desired points. None of the values of will be evaluated at values that exceed by construction. Now, the inverse function is well defined and we can calculate probability masses of from those of . The problem thus reduces to the regular BP solution for CS reconstruction. This procedure is repeated at each constraint node .
 (1)
Transform the sample points by applying to get .
 (2)
For th measurement node , obtain the probability distribution of which is equivalent to the distribution of .
 (3)
Evaluate the probability masses of at sample grid points .
 (4)
Calculate probability masses of from those of by applying function .
 (5)
Apply BP for CS decoding as in [5].
4.2. Numerical Results
5. Conclusion
We study how to design a microarray suitable for compressive sensing. A hybridization model is proposed to predict whether given CS probes mimic the behavior of a binary CS matrix, and algorithms are designed, respectively, to find probe sequences satisfying the binding requirements, and to compute the target concentration from measurement intensities. Lab experimental calibration of the model and a smallscale CSM design result are presented.
Declarations
Acknowledgements
This work was supported by NSF Grants CCF 0821910 and CCF 0809895. The authors also gratefully acknowledge many useful discussions with Xiaorong Wu from the University of Colorado at Denver School of Medicine.
Authors’ Affiliations
References
 Affymetrix microarrays http://www.affymetrix.com/products/arrays/specific/cexpress.affx
 Taylor JW, Turner E, Townsend JP, Dettman JR, Jacobson D: Eukaryotic microbes, species recognition and the geographic limits of species: examples from the kingdom Fungi. Philosophical Transactions of the Royal Society B 2006, 361(1475):19471963. 10.1098/rstb.2006.1923View ArticleGoogle Scholar
 Candès EJ, Tao T: Decoding by linear programming. IEEE Transactions on Information Theory 2005, 51(12):42034215. 10.1109/TIT.2005.858979View ArticleMATHGoogle Scholar
 Donoho DL: Compressed sensing. IEEE Transactions on Information Theory 2006, 52(4):12891306.View ArticleMathSciNetMATHGoogle Scholar
 Sarvotham S, Baron D, Baraniuk R: Compressed sensing reconstruction via belief propagation. preprint, 2006, http://www.dsp.ece.rice.edu/cs/csbpTR07142006.pdf
 Tropp JA: Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information Theory 2004, 50(10):22312242. 10.1109/TIT.2004.834793View ArticleMathSciNetMATHGoogle Scholar
 Dai W, Milenkovic O: Subspace pursuit for compressive sensing: closing the gap between performance and complexity. submitted to IEEE Transactions on Information Theory, http://arxiv.org/abs/0803.0811 submitted to IEEE Transactions on Information Theory,
 Wang D, Urisman A, Liu YT, et al.: Viral discovery and sequence recovery using DNA microarrays. PLoS Biology 2003, 1(2, article e2):14. 10.1371/journal.pbio.0000041View ArticleMATHGoogle Scholar
 Schliep A, Torney DC, Rahmann S: Group testing with DNA chips: generating designs and decoding experiments. Proceedings of the Computational Systems Bioinformatics Conference (CSB '03), Stanford, Calif, USA, August 2003 2: 8491.Google Scholar
 Macula AJ, Schliep A, Bishop MA, Renz TE: New, improved, and practical kstem sequence similarity measures for probe design. Journal of Computational Biology 2008, 15(5):525534. 10.1089/cmb.2007.0208View ArticleMathSciNetGoogle Scholar
 Du DZ, Hwang FK: Combinatorial Group Testing and Its Applications. World Scientific, Singapore; 2000.MATHGoogle Scholar
 Sheikh MA, Sarvotham S, Milenkovic O, Baraniuk RG: DNA array decoding from nonlinear measurements by belief propagation. Proceedings of the 14th IEEE/SP Workshop on Statistical Signal Processing (SSP '07), Madison, Wis, USA, August 2007 215219.Google Scholar
 Shmulevich I, Astola J, Cogdell D, Hamilton SR, Zhang W: Data extraction from composite oligonucleotide microarrays. Nucleic Acids Research 2003, 31(7, article e36):15.View ArticleGoogle Scholar
 Candès EJ, Romberg JK, Tao T: Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics 2006, 59(8):12071223. 10.1002/cpa.20124View ArticleMathSciNetMATHGoogle Scholar
 Gregory TR: Macroevolution, hierarchy theory, and the Cvalue enigma. Paleobiology 2004, 30(2):179202. 10.1666/00948373(2004)030<0179:MHTATC>2.0.CO;2View ArticleGoogle Scholar
 DeVore RA: Deterministic constructions of compressed sensing matrices. Journal of Complexity 2007, 23(4–6):918925.View ArticleMathSciNetMATHGoogle Scholar
 Berinde R, Indyk P: Sparse recovery using sparse random matrices. preprint, 2008, http://people.csail.mit.edu/indyk/report.pdf
 Chen YA, Chou CC, Lu X, et al.: A multivariate prediction model for microarray crosshybridization. BMC Bioinformatics 2006, 7, article 101: 112.Google Scholar
 Smith TF, Waterman MS: Identification of common molecular subsequences. Journal of Molecular Biology 1981, 147(1):195197. 10.1016/00222836(81)900875View ArticleGoogle Scholar
 Matlab Bioinformatics Toolbox—Exploring Primer Design Demo. http://www.mathworks.com/applications/compbio/demos.html?file=/products/demos/shipping/bioinfo/primerdemo.html
 Xu W, Bak S, Decker A, Paquette SM, Feyereisen R, Galbraith DW: Microarraybased analysis of gene expression in very large gene families: the cytochrome P450 gene superfamily of Arabidopsis thaliana . Gene 2001, 272(12):6174. 10.1016/S03781119(01)005169View ArticleGoogle Scholar
 Khomyakova E, Livshits MA, Steinhauser MC, et al.: Onchip hybridization kinetics for optimization of gene expression experiments. BioTechniques 2008, 44(1):109117. 10.2144/000112622View ArticleGoogle Scholar
 Breslauer KJ, Frank R, Blocker H, Marky LA: Predicting DNA duplex stability from the base sequence. Proceedings of the National Academy of Sciences of the United States of America 1986, 83(11):37463750. 10.1073/pnas.83.11.3746View ArticleGoogle Scholar
 Milenkovic O, Kashyap N: DNA codes that avoid secondary structures. Proceedings of the IEEE International Symposium on Information Theory (ISIT '05), Adelaide, Australia, September 2005 288292.Google Scholar
 Durbin BP, Hardin JS, Hawkins DM, Rocke DM: A variancestabilizing transformation for geneexpression microarray data. Bioinformatics 2002, 18: S105S110. 10.1093/bioinformatics/18.suppl_1.S105View ArticleGoogle Scholar
 Hekstra D, Taussig AR, Magnasco M, Naef F: Absolute mRNA concentrations from sequencespecific calibration of oligonucleotide arrays. Nucleic Acids Research 2003, 31(7):19621968. 10.1093/nar/gkg283View ArticleGoogle Scholar
 Kane MD, Jatkoe TA, Stumpf CR, Lu J, Thomas JD, Madore SJ: Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Research 2000, 28(22):45524557. 10.1093/nar/28.22.4552View ArticleGoogle Scholar
 Matlab codes for probe design in CSMs Google Scholar
 Zuker M: Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Research 2003, 31(13):34063415. 10.1093/nar/gkg595View ArticleGoogle Scholar
 Kumar V, Milenkovic O: On graphical representations of algebraic codes suitable for iterative decoding. IEEE Communications Letters 2005, 9(8):729731. 10.1109/LCOMM.2005.1496597View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.