Incorporating prior knowledge induced from stochastic differential equations in the classification of stochastic observations

In classification, prior knowledge is incorporated in a Bayesian framework by assuming that the feature-label distribution belongs to an uncertainty class of feature-label distributions governed by a prior distribution. A posterior distribution is then derived from the prior and the sample data. An optimal Bayesian classifier (OBC) minimizes the expected misclassification error relative to the posterior distribution. From an application perspective, prior construction is critical. The prior distribution is formed by mapping a set of mathematical relations among the features and labels, the prior knowledge, into a distribution governing the probability mass across the uncertainty class. In this paper, we consider prior knowledge in the form of stochastic differential equations (SDEs). We consider a vector SDE in integral form involving a drift vector and dispersion matrix. Having constructed the prior, we develop the optimal Bayesian classifier between two models and examine, via synthetic experiments, the effects of uncertainty in the drift vector and dispersion matrix. We apply the theory to a set of SDEs for the purpose of differentiating the evolutionary history between two species. Electronic supplementary material The online version of this article (doi:10.1186/s13637-016-0036-y) contains supplementary material, which is available to authorized users.

In order to make the manuscript self-contained, here we present the definition of Quadratic Discriminant Analysis (QDA) in a classical setting in which both class-conditional densities are Gaussian, class k having mean vector µ k ∈ R p and covariance matrix Σ k ∈ R p×p for k = 0, 1. In this case it is well-known that the Bayes classifier is given by ψ(x) = 1 if d 1 (x) ≥ d 0 (x), where x is a p-dimensional sample point, and the discriminant d k is defined by for k = 0, 1. The form of this equation shows that the decision boundary d 1 (x) = d 0 (x) is quadratic.
In practice, when the means and covariance matrices are not known, they are replaced by the sample means and sample covariance matrices, and the resulting classification method is known as quadratic discriminant analysis (QDA).
In the special case where the covariance matrices are identical, then ln (det[Σ k ]) can be dropped and the discriminant takes the form which is a linear function of x and produces hyperplane decision boundaries. When the classifier is designed from sample data, the means are replaced by the sample means, the covariance matrix is replaced by the pooled sample covariance matrix, and the classification method is known as linear discriminant analysis (LDA).
QDA and LDA are derived under the Gaussian assumption but in practice can perform well so long as the class conditional densities are not too far from Gaussian and there is sufficient data to obtain good estimates of the relevant covariance matrices, the point being that the QDA and LDA classification rules involve finding the sample covariance matrices and sample means. Owing to the greater number of parameters to be estimated for QDA as opposed to LDA, one can proceed with smaller samples with LDA than with QDA.

II. ERROR ESTIMATION ACCURACY
Given a feature-label distribution, error estimation accuracy is commonly measured by the mean-square

III. BAYESIAN MMSE ERROR ESTIMATOR
Let π(c), π(θ 0 ) and π(θ 1 ) denote the marginal priors of c, θ 0 and θ 1 respectively, and suppose data are used to find each posterior, π * (c), π * (θ 0 ) and π * (θ 1 ), respectively. Independence is preserved, i.e., π * (c, θ 0 , θ 1 ) = π * (c)π * (θ 0 )π * (θ 1 ) [1]. If ψ n is a trained classifier given by ψ n (x) = 0 if x ∈ R 0 and ψ n (x) = 1 if x ∈ R 1 , where R 0 and R 1 are measurable sets partitioning the sample space, then the true error of ψ n under the feature-label distribution parameterized by θ may be decomposed as where f θy (x|y) is the class-y conditional density assuming parameter θ y is true and ε y is the error contributed by class y. The Bayesian MMSE error estimator can be expressed as [1] ε (ψ n , Given the sample and letting Θ y be the parameter space of θ y , The Bayesian MMSE error estimator can be found from effective class-conditional densities, which are derived by taking the expectations of the individual class-conditional densities with respect to the posterior distribution, Using these [2], IV. REVIEW OF LITERATURE PERTAINING TO CLASSIFICATION OF STOCHASTIC PROCESSES In the following we refer to "time" as being a generic term indicating an ordered set of indices. iv) Use some similarity measure to determine the "natural" grouping among a collection of times series (clustering) In the current manuscript, we are concerned with a classification application and, therefore, in the subsequent discussion we will only focus on studies related to classification. Classification of time series data has different applications in various domains. For example, in seismology the classification of timeseries data is used to classify earthquake data from data obtained from nuclear explosions [4], [5]. In engineering, an important application is to differentiate a signal generated by noise alone from a signal plus noise. In medicine this classification has been used to classify different stages of sleep by considering the contents of EEG signals [6], and to differentiate between levels of anesthesia that are sufficient for deep surgery [7]. There is quite a large body of work on discrimination of stochastic processes. In what follows we focus on several important achievements in the field and categorize them into several classes.  [11] used the same serially correlated structure and obtained a different asymptotic expansion of LDA true error from the one that Tubbs previously had achieved in [10]. This type of asymptotic analysis was later used in [11], [12] to characterize the asymptotic expected true error of univariate LDA and Z-statistics assuming an autoregressive process of order p. In [13], we consider two general classes of Gaussian distributions under which we characterize the exact performance of LDA when the data are univariate. We show the application of the theory developed therein in situations where the data are generated from two autoregressive or two movingaverage sources. Some work considers the problem of discrimination of time-series data by considering a Bayesian approach specific to univariate autoregressive processes [14], [15]. For example Broemeling and Son [14] consider the problem of assigning time-series data to an autoregressive source with unknown parameters. By considering a vague prior for unknown parameters, they train a model to assign an observed sample path to the class that maximizes the posterior mass function.

B. Spectral approach to discrimination of stochastic processes
Shumway and Unger [4] use the theory of discriminant analysis combined with spectral approximation and estimation for discriminating two Gaussian processes. In this regard, they derive the optimal discriminant in the sense of maximizing the Kullback-Leibler discrimination information rate, J-divergence rate, and detection probabilities. Then they use the Fourier transform to derive an approximation of the optimal discriminant based on spectral contents of stationary processes. At the end, they replace the unknown parameters appearing in the approximate discriminant by their sample estimates. In this work, the authors conclude that the benefit of using the spectral approximation is two fold: 1) the matrix November 22, 2015 DRAFT 5 operations appearing in discrimination are replaced by simpler operations, including spectral and FFT; 2) more stable estimates are obtained from spectral estimation techniques than from covariance matrix estimation. For more detailed discussion of this work and its extension to non-Gaussian processes and clustering see [16].
The aforementioned work considers the problem of discrimination between stationary processes using parametric models. Yuan and Rao [17] consider classification of discrimination between two stationary processes using a non-parametric approach. They use a smoothed periodogram to estimate the spectral contents of the signals. The classification rule is then defined to be the classifier that minimizes the discrepancy of the periodogram from the class spectrum. The problem of discrimination between nonstationary (locally stationary) and non-Gaussian processes is considered in [18]. In this work the authors consider an approximation of the Gaussian Kullback-Leibler discrimination information rate as the classification statistic; that is, they compare the estimated rate to a pre-defined threshold and assign the labels.

C. Adaptive classification of stochastic processes
A more recently developed line of work focuses more on constructing classifiers in an adaptive setting.
In this framework the classifier is updated on the arrival of new labeled data and, at the same time, the classifier becomes adapted to change of population over time, a concept commonly referred to as population drift. Bottcher et al. [19] construct and extrapolate a model of population drift in order to construct a decision tree for classifying a future sample point. In [20], Adams et al. use the concept of so-called "adaptive forgetting". In short, a forgetting factor controls the amount of contribution of historical data-generally, the more recent the data, the more its contribution. This approach has been embedded in the theory of discriminant analysis (LDA and QDA) [20] to update the classifier using some recursion formulas for mean and covariance.

D. Semi-supervised classification of time-series
In [21] the authors study the effect of incorporating a large amount of unlabeled data to utilize in classification of time series data, hence, constructing a semi-supervised scheme. The semi-supervised classification of time-series is further studied in [22], [23].