# Compressive Sensing DNA Microarrays

- Wei Dai
^{1}Email author, - Mona A Sheikh
^{2}, - Olgica Milenkovic
^{1}Email author and - Richard G Baraniuk
^{2}

**2009**:162824

https://doi.org/10.1155/2009/162824

© Wei Dai et al. 2009

**Received: **30 July 2008

**Accepted: **23 October 2008

**Published: **22 December 2008

## Abstract

Compressive sensing microarrays (CSMs) are DNA-based sensors that operate using group testing and compressive sensing (CS) principles. In contrast to conventional DNA microarrays, in which each genetic sensor is designed to respond to a single target, in a CSM, each sensor responds to a set of targets. We study the problem of designing CSMs that simultaneously account for both the constraints from CS theory and the biochemistry of probe-target DNA hybridization. An appropriate cross-hybridization model is proposed for CSMs, and several methods are developed for probe design and CS signal recovery based on the new model. Lab experiments suggest that in order to achieve accurate hybridization profiling, consensus probe sequences are required to have sequence homology of at least 80% with all targets to be detected. Furthermore, out-of-equilibrium datasets are usually as accurate as those obtained from equilibrium conditions. Consequently, one can use CSMs in applications in which only short hybridization times are allowed.

## 1. Introduction

Accurate identification of large numbers of genetic sequences in an environment is an important and challenging research problem. DNA microarrays are a frequently applied solution for microbe DNA detection and classification [1]. The array consists of genetic sensors or *spots*, containing a large number of single-stranded DNA sequences termed *probes*. A DNA strand in a test sample, referred to as a *target*, tends to bind or "hybridize" with its complementary probe on a microarray so as to form a stable duplex structure. The DNA samples to be identified are fluorescently tagged before being flushed against the microarray. The excess DNA strands are washed away and only the hybridized DNA strands are left on the array. The fluorescent illumination pattern of the array spots is then used to infer the genetic makeup in the test sample.

### 1.1. Concerns in Classical DNA Microarrays

In traditional microarray designs, each spot has a DNA subsequence that serves as a unique identifier of only *one* organism in the target set. However, there may be other probes in the array with similar base sequences for identifying other organisms. Due to the fact that the spots may have DNA probes with similar base sequences, both specific and nonspecific hybridization events occur; the latter effect leads to errors in the array readout.

Furthermore, the unique sequence design approach severely restricts the number of organisms that can be identified. In typical biosensing applications, an extremely large number of organisms must be identified. For example, there are more than known harmful microbes, many with significantly more than strains [2]. A large number of DNA targets require microarrays with a large number of spots. The implementation cost and speed of microarray data processing is directly related to the number of spots, which represents a significant problem for commercial deployment of hand-held microarray-based biosensors.

### 1.2. Compressive Sensing

Compressive sensing (CS) is a recently developed sampling theory for sparse signals [3]. The main result of CS, introduced by Candès and Tao [3] and Donoho [4], is that a length-
signal
that is
-sparse in some basis can be recovered *exactly* in polynomial time from just
linear measurements of the signal. In this paper, we choose the canonical basis; hence
has
nonzero and
zero entries.

In matrix notation, we measure
, where
is the
sparse signal vector we aim to sense,
is an
measurement vector, and the *measurement matrix*
is an
matrix. Since
, recovery of the signal
from the measurements
is ill posed in general. However, the additional assumption of signal *sparsity* makes recovery possible. In the presence of measurement noise, the model becomes
, where
stands for i.i.d. additive white Gaussian noise with zero mean.

The two critical conditions to realize CS are that (i) the vector to be sensed is sufficiently sparse, and (ii) the rows of are sufficiently incoherent with the signal sparsity basis. Incoherence is achieved if satisfies the so-called restricted isometry property (RIP) [3]. For example, random matrices built from Gaussian and Bernoulli distributions satisfy the RIP with high probability. can also be sparse with only nonzero entries per row ( can vary from row to row) [5].

Various methods have been developed to recover a sparse from the measurements [35–7]. When itself is sparse, belief propagation and related graphical inference algorithms can also be applied for fast signal reconstruction [5].

An important property of CS is its *information scalability*—CS measurements can be used for a wide range of statistical inference tasks besides signal reconstruction, including estimation, detection, and classification.

### 1.3. Compressive Sensing Meets Microarrays

The setting for microbial DNA sensing naturally lends itself to CS, although the number of potential agents that a hostile adversary can use is large, *not all agents* are expected to be present in a significant concentration at a given time and location, or even in an air/water/soil sample to be tested in a laboratory. In traditional microarrays, this results in many inactive probes during sensing. On the other hand, there will always be minute quantities of certain harmful biological agents that may be of interest to us. Therefore, it is important not just to detect the presence of agents in a sample, but also to *estimate* the concentrations with which they are present.

Mathematically, one can represent the DNA concentration of each organism as an element in a vector . Therefore, as per the assumption of only a few agents being present, this vector is sparse, that is, contains only a few significant entries. This suggests putting thought into the design of a microarray along the lines of the CS measurement process, where each measurement is a linear combination of the entries in the vector, and where the sparse vector can be reconstructed from via CS decoding methods.

In our proposed microarrays, the readout of each probe represents a probabilistic combination of all the targets in the test sample. The probabilities are representatives of each probe affinity to its targets due to how much the target and probe are likely to hybridize together. We explain our model for probe-target hybridization in Section 2.2. In particular, the cross-hybridization property of a DNA probe with several targets, not just one, is the key for applying CS principles.

Here, is the sensing matrix, and denotes a vector of i.i.d. additive white Gaussian noise samples with zero mean.

We note that this probabilistic combination is assumed to be linear for the purposes of microarray design. However, in reality, there is a nonlinear saturation effect when excessive targets are present (see Section 2.4 for details). We take this into account on the reconstruction side, as part of the CS decoding techniques to decipher the combinatorial sensor readout.

Therefore, by using the CS principle, the number of spots in the microarray can be made much smaller than the number of target organisms. With fewer "intelligently chosen" DNA probes, the microarray can also be more easily miniaturized [8–10]. We refer to a microarray designed this way as a CS microarray (CSM).

The CS principle is similar to the concept of group testing [8–11], which also relies on the sparsity observed in the DNA target signals. The chief advantage of a CS-based approach over direct group testing is its information scalability. With a reduced number of measurements, we are able not just to detect, but also to *estimate* the target signal. This is important because often pathogens in the environment are only harmful to us in large concentrations. Furthermore, we are able to use CS recovery methods such as belief propagation that decode
while accounting for experimental noise and measurement nonlinearities due to excessive target molecules [12].

It is also worth to point out the substantial difference between CSMs and the "composite microarrays" designed to reduce measurement variability [13]. In the latter approach, the microarray readouts are linear combinations of input signal components and therefore can be expressed in the form given by (1). However, the matrix of [13] does typically not satisfy the CS design principles. As a result, the number of required measurements/spots is significantly larger than that of CSMs. On the other hand, the use of the CS principle allows both the robustness of measurements and a significant reduction in the number of spots on the array [14].

### 1.4. Clusters of Orthologous Groups

The COGs database consists of groups of 192, 987 proteins in 66 unicellular organisms classified into 4872 clusters. We use these clusters as a guideline to group targets together. Targets with similar DNA sequences belong to the same group, and can be more easily identified with a single probe. When designing probes, it is important to make sure that the chosen probes align minimally with organisms that do not belong to its group (the "nontargets"). We can use the COGs database with its exhaustive classification to this end, since DNA sequences of an organism whose proteins do not belong to a certain COG will have minimal alignment with DNA sequences of other organisms in that COG. This significantly reduces the computational complexity of the search for good probe sequences.

One limitation in using COGs is that it will constrain design of the matrix for us. For instance, if we were to choose a set of 10 organisms we are interested in for microarray detection, there are only a finite number of COGs (groups) that these 10 organisms will belong to. We would have to carefully sift through these groups to find the one that best satisfies CS-requirements of , and for each choice, making sure that it is dissimilar enough from the other groups chosen. So on the one hand, using COGs guides our target grouping strategy; on the other hand, it is possible that we might not be able to find enough -suitable COGs to identify all members of the group. Using only a COGs-based approach, we may have to resort to using a that may not be the best from a CS perspective but simply what nature gives us. Here, however, we only consider an approach using COGs.

A second limitation of COGs is the fact that it is a classification of organisms based on alignments between the *sections* of their DNA that encode for proteins, not entire sequences. Therefore, a point for future exploration would be to work with values from alignments between entire DNA sequences of organisms. Probes selected using such an alignment would be better reflective of the actual probe-target hybridization that takes place in a biosensing device.

However, we are fortunate that prokaryotes such as unicellular bacteria typically have larger percentages of coding DNA to noncoding, and therefore as long as we are interested in the detection of unicellular bacteria, which are prokaryotes, using a COGs-based probe selection is not as much of an issue. On the other hand, eukaryotes have large amounts of noncoding regions in their DNA. This phenomenon is known as the -value enigma [15]: more complex organisms often have more noncoding DNA in their genomes.

### 1.5. CSM Design Consideration

To design a CSM, we start with a given set of
targets and a valid CS matrix
. The design goal is to find
DNA probe sequences such that the hybridization affinity between the
th probe and the
th target can be *approximated* by the value of
. For this purpose, we need to go row-by-row in
, and for each row find a probe sequence such that the hybridization affinities between the probe and the
targets mimic the entries in this row. For simplicity, we assume that the CS matrix
is binary, that is, its entries have value zero or are equal to some positive constant, say
. An entry of positive value refers to the case where the corresponding target and probe DNA strands bind together with a sufficient strength such that the fluorescence from the target strand adhered to the probe is visible during the microarray readout process. A zero-valued entry indicates that no such hybridization affinity exists. How to construct a binary CS matrix
is discussed in many papers, including [1617], but is beyond the scope of this paper. Henceforth, we assume that we know the
we want to approximate.

The CSM design process is then reduced to answering two questions. Given a probe and target sequence pair, how does one predict the corresponding microarray readout intensity? Given targets and the desired binding pattern, how does one find a probe DNA sequence such that the binding pattern is satisfied?

The first question is answered by a two-step translation of a probe-target pair to the spot intensity. First, we need a hybridization model that uses features of the probe and target sequences to predict the cross-hybridization affinity between them. Since the CS matrix that we want to approximate is binary, the desired hybridization affinities can be roughly categorized into two levels, "high" and "low," corresponding to one and zero entries in , respectively. The affinities in each category should be roughly uniform, while those belonging to different categories must differ significantly. With these design requirements in mind, we develop a simplified hybridization model in Section 2.2 and verify its accuracy via laboratory experiments, the results of which are presented in Section 2.3. As the second step, we need to translate the hybridization values to microarray spot intensities using a model that includes physical parameters of the experiment, such as background noise. This issue is discussed in Section 2.4.

To answer the second question, we propose a probe design algorithm that uses a "sequence voting mechanism" and a randomization mechanism. The algorithm is presented in Section 3.1. An example of the practical implementation of this algorithm is given in Section 3.2.

## 2. Hybridization Model

### 2.1. Classical Models

12 parameters used in [18] for predicting hybridization affinities between DNA sequence pairs.

Parameter | Description |
---|---|

Probe sequence length, Target sequence length | |

Probe GC content, target GC content | |

Smith-Waterman score: computed from the scoring system used in the SW alignment | |

Percent identity: percentage of matched bases in the aligned region after SW alignment | |

Length of the SW alignment | |

Gibbs free energy for probe DNA folding | |

Hamming distance between probe and target | |

Length of longest contiguous matched segment in a SW alignment | |

GC content in the longest contiguous segment |

*Smith-Waterman*(SW) local alignment, computed using dynamic programming techniques [19]. The SW alignment identifies the most similar local region between two nucleotide sequences. It compares segments of all possible lengths, calculates the corresponding sequence similarity according to some scoring system, and outputs the optimal local alignment and the optimal similarity score. For example, if we have two sequences -CCCTGGCT- and -GTAAGGGA- , the SW alignment, which ignores prefix and suffix gaps, outputs the best local alignment

Another important parameter for assessing hybridization affinity is , the length of contiguous matched base pairs. It has been shown in [1820] that long contiguous base pairs imply strong affinity between the probe and target. Usually, one requires at least 10 bases in oligo DNA probes for ensuring sufficiently strong hybridization affinity.

Besides the large number of parameters that potentially influence hybridization affinity, there are many theories for which features most influence hybridization and how they affect the process [182122]. A third-order polynomial model using percent identity , as the single parameter, was developed in [21]. More recently, three multivariate models, based on the third-order polynomial regression, regression trees, and artificial neural networks, respectively, were studied in [18].

### 2.2. Our Model for CSM

Different from the above approaches aiming at identifying the exact affinity value, the binary nature of our CS matrix brings possible simplifications. As we have discussed in Section 1.5, we only need to predict whether the affinity between a probe-target pair is either "high" or "low." For this purpose, two set of rules, designed for deciding "high" and "low" affinities, respectively, are developed in this section.

We propose the notion of the best matched substring pair, defined as follows, for our hybridization model.

*Definition 1*. Let
be a DNA sequence. A substring of
is a sequence of the form
, where
. Consider a given sequence pair
and
and
. Let
be a positive integer at most
. A pair of substrings of length
, one of which is part of
and the other part of
, will be denoted by
and
, where
.

where denotes the Watson-Crick complement of , and denotes the cardinality of the underlying set.

*The best matched substring pair* of length
is the substring pair with the largest
among all possible substring pairs of length
from the pair of
and
.

For a given
, *the largest substring percent identity*
is the
of the best matched substring pair of length
.

Remark 1.

For a given , the best matched substring pair is not necessarily unique, while the value is unique.

- (1)
For hybridization prediction, the parameter percent identity should be used together with the alignment length . Although the significance of the single-parameter model based on was demonstrated in [21], we observed that using the parameter as the sole affinity indicator is sometimes misleading. As an illustration, consider the example in Figure 3. For the sequence pair A, the SW alignment gives and . For the sequence pair B, the SW alignment gives and . Though the pair B exhibits a smaller , it obviously has a stronger binding affinity than the pair A, for the aligned part of the pair A is merely a part of the aligned region of the pair B. The same principle holds for the sequence pairs B and C as well. This example shows that besides the percent identity, the

*alignment length*is important.

- (2)
The pair of and is not sufficient to predict hybridization affinity. Consider the sequence pairs C and D in Figure 3. Both of them exhibit the same values for the and parameters. However, the hybridization affinities of these two pairs are different. To see this, let us refer to Figure 4 which depicts the values of sequence pairs C and D for different length . It can be observed that for any given , the value of the sequence pair C is larger than that of the sequence pair D. In other words, the sequences in the former pair match with each other uniformly better than the sequences in the latter pair. The sequence pair C has a larger chance to hybridize than the pair D does. With the same values of parameters and , the difference in hybridization affinity comes from the distribution of matched bases in the aligned region.

The advantage of using the largest substring percent identities for hybridization prediction is now apparent. The s include all the information contained in the previously discussed , and parameters; it can be verified that and that the is one of the values of s such that . Of course, a list of provides more detailed information, since it gives both local and global matching information.

- (C1)
There exists a best matched substring pair of length at least such that the corresponding substring percent identity satisfies . Alternatively, such that . Here, both and are judiciously chosen parameters.

- (C2)
Among all the best matched substring pairs with , there should be no pair of length longer than , that is, it should hold that for all . Again, has to be chosen properly.

Criterion (C1) guarantees that there is a significantly long substring pair with high-percent identity that ensures strong hybridization affinity. Although criterion (C2) may seem counterintuitive at first glance, it ensures that one single target cannot dominantly hybridize with the consensus probe, that is, the binding affinities between probe-target pairs are roughly uniform.

Criterion (C3) asserts that there should be no substring pair that has both long length and high-percentage identity. The last criterion, (C4), prevents the existence of a long contiguous matched substring pair which suggests large binding affinity. Again, and have to be chosen appropriately.

This model may seem an oversimplification for accurate hybridization affinity prediction. However, in our practical experience with small binary CS matrices (Section 1.5), this model functions properly (see Section 2.3).

where is either zero-valued or equal to , and is the approximation error that is assumed to take small values only. The physical interpretation of is given in (9). The values of s can be calibrated via lab experiments. Furthermore, the reconstruction algorithm can be designed to be robust to the approximation error.

*Remark 2.* This model can be further refined by introducing weighting factors in the definition of
. More precisely, the number of positionally matched base pairs can be replaced by a weighted sum, where C-G and A-T pairs are assigned different values. More accurate model, taking into account nearest-neighbor interaction, can be considered as well [2324]. These extensions will be considered elsewhere.

### 2.3. Experimental Calibration of Parameters

Lab experiments were performed to verify our translation criteria (C1)–(C4) and to choose appropriate values for the involved parameters.

The probe and target sequences were synthesized by *Invitrogen*, with the first three probes purified using the PAG (polyacrylamide gel electrophoresis) method, while all other sequences were purified using the high-performance liquid chromatography method (HPLC). The fluorescent tags of the targets are Alexa 532.

- (1)
- (2)
For all sequence pairs of which the microarray readout is weak, we have . (For the pair of probe A and Target B, , but the corresponding microarray readout is week.) Consequently, may be a critical parameter for deciding whether a probe-target pair hybridizes or not.

- (3)
Among all sequence pairs with weak microarray readouts, the length of the longest contiguous segment is (the pair of probe C and target A). This fact implies that the probe-target pair may not hybridize even when they have a contiguous matched substring of length .

Interestingly, when we reduced the incubation time to four hours such that the full equilibrium has not been achieved, the microarray still gave an accurate readout (see Figure 5(d)). We expect that one can use CSMs in applications for which only short hybridization times are allowed.

### 2.4. Translating Hybridization Affinity into Microarray Spot Intensity

where is the actual spot intensity we measure for given experimental conditions, and are positive hybridization constants, is the hybridization affinity, is the target concentration, presents the mean background noise, and denotes the measurement noise which is often assumed to be Gaussian distributed with mean zero and variance [2526]. This model mimics the well-known Langmuir model, with background noise taken into consideration [2627].

## 3. Search for Appropriate Probes

### 3.1. Probe Design Algorithm

We describe next an iterative algorithm for finding probe sequences satisfying a predefined set of binding patterns, that is, sequences that can serve as CS probes.

The design problem is illustrated by the following example. Suppose that we are dealing with three targets, labeled by , and , and that the binding pattern of the probe and targets is such that the probe is supposed to bind with targets and , but not with target . Assume next that the hybridization affinities between a candidate probe and targets and are too small, while the hybridization affinity between the probe and target is too large. In order to meet the desired binding pattern, we need to change some nucleotide bases of the probe sequence. For example, consider a particular aligned position of the probe and the targets, the corresponding probe and targets bases equal to "T," "T," "A," and "A," respectively. In this case, from the perspective of target , the base "T" of the probe should be changed to "A," while from the perspective of target , this "T" base should be changed to any other base not equal to "T." On the other hand, for target to exhibit strong hybridization affinity with the probe, the identity of the corresponding probe base should be kept intact. As different preferences appear from the perspectives of different targets, it is not clear whether the base under consideration should be changed or not.

We address this problem by using a *sequence voting mechanism*. For each position in the probe sequence, one has four base choices—"A," "T," "C," and "G." Each target is allowed to "cast its vote" for its preferred base choice. The final decision is made based on counting all the votes from all targets. More specifically, we propose a design parameter, termed as *preference value* (PV), to implement our voting mechanism. For a given pair of probe and target sequences, a unique PV is assigned to each base choice at each position of the probe. We design four rules for PV assignment.

- (1)
If the target "prefers" the current probe base left unchanged, a positive PV is assigned to the corresponding base choice.

- (2)
From the perspective of the target, if the current probe base should be changed to another

*specific*base, then the original base choice is assigned a negative PV while the intended base choice is assigned a positive PV. - (3)
If the current base should be changed to

*any other*base, then the corresponding base choice is assigned a negative PV while other base choices are assigned a zero PV. - (4)
Finally, if a base choice is not included in the above three rules, a zero PV is assigned to it.

The specific magnitude of the nonzero PVs is chosen according to the significance of the potential impact on the hybridization affinity between the considered target and probe. The details of this PV assignment are highly technical and therefore omitted. The interested reader is referred to our software tool [28] for a detailed implementation of the PV computation algorithm.

After PV assignment, we calculate the so-called *Accumulated PV* (APV). For a given base choice at a given position of the probe, the corresponding APV is the sum of all the PVs associated with this choice. The APV is used as an indicator of the influence of a base change in our algorithm; the bases associated with negative APVs are deemed undesirable and therefore should be changed; if the current base of the probe is associated with a positive APV, one would like to leave this base unchanged; if a base choice, different from the current base of the probe, has a positive APV value, one should change the current base to this new choice.

It is worth pointing out the "partly" random nature of the algorithm. In step 5 of our algorithm, whether a current base at a given position is changed or not and which base the current base is changed to are randomly decided. The probabilities with which the current base is changed, and with which a specific base is selected to replace the current base, are related to the magnitudes of the associated APVs. The implementation details behind this randomization mechanism are omitted, but can be found in [28].

This random choice component helps in avoiding "dead traps" that may occur in deterministic algorithms. As an illustrative example, suppose that the intended binding pattern between a probe and all targets except target 1 is satisfied in a given iteration. From the perspective of target 1, the first base of the probe should be changed from "T" to "C." In a deterministic approach, a base replacement must be performed following this preference exactly. However, this base change breaks the desired hybridization pattern between the probe and target 2. In the next iteration, according to the perspective of target 2, the first base of the probe has to be changed back to "T." As a result, this probe base "oscillates" between these two choices of "T" and "C," and the algorithm falls into a "dead trap." In contrast, due to the randomization mechanism in our algorithm, there is a certain probability that the base change does not follow exactly what seems necessary. Dead traps can be prevented from happening or escaped from once they happen.

The algorithm is repeated as many times as the number of probes.

### 3.2. Toy Probe Design Example for

The target nucleotide sequences.

The GC contents for these three probes are 50%, 51.4%, and 51.4%, respectively. The GC contents of the sequences should be of similar value to ensure similar melting temperatures for the duplexes. The secondary structures of these probes can be predicted by using the m-fold package [29] and are depicted in Figure 6. As one can see, all folds have sufficiently long unmatched regions that can hybridize to the targets.

## 4. CSM Signal Recovery

The final step of a CSM process is to estimate the target concentration according to the microarray readout. Recall the signal acquisition model in (5), a signal recovery algorithm specifically designed for CSMs have to take into account the measurement nonlinearity.

Compared to other CS signal recovery methods, *belief propagation* (BP) is the best amenable to incorporate nonlinear measurement. It has been shown that a CS measurement matrix
can be represented as a bipartite graph of signal coefficient nodes
s and measurement nodes
s [512]. When
is sparse enough, BP can be applied, so we are able to approximate the marginal distributions of each of the
coefficients conditioned on the observed data. (Note that the Hamming code matrix
is not sparse. Still, one can use simple "sparsified" techniques to modify
for decoding purpose only [30]). We can then estimate the MLE, MMSE, and MAP estimates of the coefficients from their distributions (we refer to [512] for details.)

**Algorithm 1:** Probe design for CSMs.

**Input:** The
target sequences, the row of the intended binding matrix
corresponding to the chosen probe.

**Initialization:** Randomly generate multiple candidates for the probe under consideration. For each candidate, perform the following iterative sequence update procedure.

**Iteration:**

- (1)
Check the probe's GC content. If GC content is too low, randomly change some "A" or "T" bases to "G" or "C" bases, and vice versa. The GC content afterbase changes must satisfy the GC content requirement.

- (2)
Check whether the probe sequence satisfies the intended binding pattern. If yes, quit the iterations. If not, go to the next step.

- (3)
If an appropriate probe has not been found after a large number of iterations, report a failure, and quit the iterations.

- (4)
For each of the targets, calculate the PV associatedwith each of the base choice at each position of the probe. Then calculate the APV.

- (5)
Randomly change some bases of the probe sequence so that a potential change associated with a larger APV increment is made more probable.

- (6)
Go back to Step 1.

**Completion:** Check for loop information in the secondary structure of all the surviving probe candidates. Choose the probe with the fewest loops. If more than one such probe exists, randomly choose one of the probes with the shortest loop length.

**Output:** The probe sequence.

### 4.1. Extracting the Signal from Nonlinear Measurements

Due to saturation effects in the intensity response of the microarray, the nonlinearity acts on so that recorded measurements will never exceed . We note that due to the presence of measurement noise, the solution is not as simple as inverting the nonlinearity and then applying BP for CS reconstruction.

Our goal is to determine the probability distribution of
at all possible values the true signal values
can take on a grid of sample points, using the measurement intensities
as constraints. The problem then reduces to solving the regular CS signal recovery problem using BP [5]. We note that instead of inverse-mapping
to find
, we can calculate the equivalent probabilities of the *transformed* distribution:
, by mapping the required sample points for the
distribution to transformed points
. At the
th measurement node
; the latter probability masses can be picked out at the desired
points. None of the values of
will be evaluated at
values that exceed
by construction. Now, the inverse function is well defined and we can calculate probability masses of
from those of
. The problem thus reduces to the regular BP solution for CS reconstruction. This procedure is repeated at each constraint node
.

- (1)
- (2)
For th measurement node , obtain the probability distribution of which is equivalent to the distribution of .

- (3)
- (4)
- (5)
Apply BP for CS decoding as in [5].

### 4.2. Numerical Results

## 5. Conclusion

We study how to design a microarray suitable for compressive sensing. A hybridization model is proposed to predict whether given CS probes mimic the behavior of a binary CS matrix, and algorithms are designed, respectively, to find probe sequences satisfying the binding requirements, and to compute the target concentration from measurement intensities. Lab experimental calibration of the model and a small-scale CSM design result are presented.

## Declarations

### Acknowledgements

This work was supported by NSF Grants CCF 0821910 and CCF 0809895. The authors also gratefully acknowledge many useful discussions with Xiaorong Wu from the University of Colorado at Denver School of Medicine.

## Authors’ Affiliations

## References

- Affymetrix microarrays http://www.affymetrix.com/products/arrays/specific/cexpress.affx
- Taylor JW, Turner E, Townsend JP, Dettman JR, Jacobson D: Eukaryotic microbes, species recognition and the geographic limits of species: examples from the kingdom Fungi.
*Philosophical Transactions of the Royal Society B*2006, 361(1475):1947-1963. 10.1098/rstb.2006.1923View ArticleGoogle Scholar - Candès EJ, Tao T: Decoding by linear programming.
*IEEE Transactions on Information Theory*2005, 51(12):4203-4215. 10.1109/TIT.2005.858979View ArticleMATHGoogle Scholar - Donoho DL: Compressed sensing.
*IEEE Transactions on Information Theory*2006, 52(4):1289-1306.View ArticleMathSciNetMATHGoogle Scholar - Sarvotham S, Baron D, Baraniuk R: Compressed sensing reconstruction via belief propagation. preprint, 2006, http://www.dsp.ece.rice.edu/cs/csbpTR07142006.pdf
- Tropp JA: Greed is good: algorithmic results for sparse approximation.
*IEEE Transactions on Information Theory*2004, 50(10):2231-2242. 10.1109/TIT.2004.834793View ArticleMathSciNetMATHGoogle Scholar - Dai W, Milenkovic O: Subspace pursuit for compressive sensing: closing the gap between performance and complexity. submitted to
*IEEE Transactions on Information Theory*, http://arxiv.org/abs/0803.0811 submitted to IEEE Transactions on Information Theory, - Wang D, Urisman A, Liu Y-T,
*et al*.: Viral discovery and sequence recovery using DNA microarrays.*PLoS Biology*2003, 1(2, article e2):1-4. 10.1371/journal.pbio.0000041View ArticleMATHGoogle Scholar - Schliep A, Torney DC, Rahmann S: Group testing with DNA chips: generating designs and decoding experiments.
*Proceedings of the Computational Systems Bioinformatics Conference (CSB '03), Stanford, Calif, USA, August 2003*2: 84-91.Google Scholar - Macula AJ, Schliep A, Bishop MA, Renz TE: New, improved, and practical k-stem sequence similarity measures for probe design.
*Journal of Computational Biology*2008, 15(5):525-534. 10.1089/cmb.2007.0208View ArticleMathSciNetGoogle Scholar - Du DZ, Hwang FK:
*Combinatorial Group Testing and Its Applications*. World Scientific, Singapore; 2000.MATHGoogle Scholar - Sheikh MA, Sarvotham S, Milenkovic O, Baraniuk RG: DNA array decoding from nonlinear measurements by belief propagation.
*Proceedings of the 14th IEEE/SP Workshop on Statistical Signal Processing (SSP '07), Madison, Wis, USA, August 2007*215-219.Google Scholar - Shmulevich I, Astola J, Cogdell D, Hamilton SR, Zhang W: Data extraction from composite oligonucleotide microarrays.
*Nucleic Acids Research*2003, 31(7, article e36):1-5.View ArticleGoogle Scholar - Candès EJ, Romberg JK, Tao T: Stable signal recovery from incomplete and inaccurate measurements.
*Communications on Pure and Applied Mathematics*2006, 59(8):1207-1223. 10.1002/cpa.20124View ArticleMathSciNetMATHGoogle Scholar - Gregory TR: Macroevolution, hierarchy theory, and the C-value enigma.
*Paleobiology*2004, 30(2):179-202. 10.1666/0094-8373(2004)030<0179:MHTATC>2.0.CO;2View ArticleGoogle Scholar - DeVore RA: Deterministic constructions of compressed sensing matrices.
*Journal of Complexity*2007, 23(4–6):918-925.View ArticleMathSciNetMATHGoogle Scholar - Berinde R, Indyk P: Sparse recovery using sparse random matrices. preprint, 2008, http://people.csail.mit.edu/indyk/report.pdf
- Chen YA, Chou C-C, Lu X,
*et al*.: A multivariate prediction model for microarray cross-hybridization.*BMC Bioinformatics*2006, 7, article 101: 1-12.Google Scholar - Smith TF, Waterman MS: Identification of common molecular subsequences.
*Journal of Molecular Biology*1981, 147(1):195-197. 10.1016/0022-2836(81)90087-5View ArticleGoogle Scholar *Matlab Bioinformatics Toolbox—Exploring Primer Design Demo*. http://www.mathworks.com/applications/compbio/demos.html?file=/products/demos/shipping/bioinfo/primerdemo.html- Xu W, Bak S, Decker A, Paquette SM, Feyereisen R, Galbraith DW: Microarray-based analysis of gene expression in very large gene families: the cytochrome P450 gene superfamily of
*Arabidopsis thaliana*.*Gene*2001, 272(1-2):61-74. 10.1016/S0378-1119(01)00516-9View ArticleGoogle Scholar - Khomyakova E, Livshits MA, Steinhauser M-C,
*et al*.: On-chip hybridization kinetics for optimization of gene expression experiments.*BioTechniques*2008, 44(1):109-117. 10.2144/000112622View ArticleGoogle Scholar - Breslauer KJ, Frank R, Blocker H, Marky LA: Predicting DNA duplex stability from the base sequence.
*Proceedings of the National Academy of Sciences of the United States of America*1986, 83(11):3746-3750. 10.1073/pnas.83.11.3746View ArticleGoogle Scholar - Milenkovic O, Kashyap N: DNA codes that avoid secondary structures.
*Proceedings of the IEEE International Symposium on Information Theory (ISIT '05), Adelaide, Australia, September 2005*288-292.Google Scholar - Durbin BP, Hardin JS, Hawkins DM, Rocke DM: A variance-stabilizing transformation for gene-expression microarray data.
*Bioinformatics*2002, 18: S105-S110. 10.1093/bioinformatics/18.suppl_1.S105View ArticleGoogle Scholar - Hekstra D, Taussig AR, Magnasco M, Naef F: Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays.
*Nucleic Acids Research*2003, 31(7):1962-1968. 10.1093/nar/gkg283View ArticleGoogle Scholar - Kane MD, Jatkoe TA, Stumpf CR, Lu J, Thomas JD, Madore SJ: Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays.
*Nucleic Acids Research*2000, 28(22):4552-4557. 10.1093/nar/28.22.4552View ArticleGoogle Scholar - Matlab codes for probe design in CSMs Google Scholar
- Zuker M: Mfold web server for nucleic acid folding and hybridization prediction.
*Nucleic Acids Research*2003, 31(13):3406-3415. 10.1093/nar/gkg595View ArticleGoogle Scholar - Kumar V, Milenkovic O: On graphical representations of algebraic codes suitable for iterative decoding.
*IEEE Communications Letters*2005, 9(8):729-731. 10.1109/LCOMM.2005.1496597View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.