Stochastic convex sparse principal component analysis

Principal component analysis (PCA) is a dimensionality reduction and data analysis tool commonly used in many areas. The main idea of PCA is to represent high-dimensional data with a few representative components that capture most of the variance present in the data. However, there is an obvious disadvantage of traditional PCA when it is applied to analyze data where interpretability is important. In applications, where the features have some physical meanings, we lose the ability to interpret the principal components extracted by conventional PCA because each principal component is a linear combination of all the original features. For this reason, sparse PCA has been proposed to improve the interpretability of traditional PCA by introducing sparsity to the loading vectors of principal components. The sparse PCA can be formulated as an ℓ 1 regularized optimization problem, which can be solved by proximal gradient methods. However, these methods do not scale well because computation of the exact gradient is generally required at each iteration. Stochastic gradient framework addresses this challenge by computing an expected gradient at each iteration. Nevertheless, stochastic approaches typically have low convergence rates due to the high variance. In this paper, we propose a convex sparse principal component analysis (Cvx-SPCA), which leverages a proximal variance reduced stochastic scheme to achieve a geometric convergence rate. We further show that the convergence analysis can be significantly simplified by using a weak condition which allows a broader class of objectives to be applied. The efficiency and effectiveness of the proposed method are demonstrated on a large-scale electronic medical record cohort.


Introduction
Principal component analysis (PCA) is a commonly used dimensionality reduction and data analysis tool in many areas such as computer vision [1,2], data mining [3,4], biomedical informatics [5,6], and many others. The goal of PCA is to learn a linear transformation such that the learned principal components are the dimensions retaining the most of the variance in the data. Principal components are obtained by computing the eigenvalue decomposition of the covariance matrix, and it can also be computed by the singular value decomposition of the data matrix. Let S = 1 n n i=1 x i x T i be the normalized covariance matrix for n training data points where each data point is in a d-dimensional feature space. The PCA of where Z is an orthogonal projection matrix. In many applications, we are only interested in a few top principal components. In this case, the principal components can be computed in an iterative fashion: the leading principal component is calculated at each iteration (e.g., using power methods), and we then deflate the computed component and the next principal component now becomes the leading one [7]. Therefore, we focus on finding the leading principal component in this paper. In spite of its advantages, there is an obvious disadvantage of PCA. In the solution of Eq. (1), the principal components are linear combinations of all input variables. This means that the columns of Z matrix, which are called loadings of principal components, are dense. One important implication of dense loadings is that we lose the ability to interpret the output dimensions of conventional PCA. PCA works well if we are not interested in the physical meanings of the features or if the interpretation of principal components is not crucial for the application. However, the intepretability is a significant factor when it comes to many applications such as biology, finance, and biomedical informatics.
In the domain of biomedical informatics, as more and more electronic medical records (EMR) [8] of patients are available, medical researchers are interested in applying various techniques to analyze the EMR data. Each feature of the EMR data is a record/event related to a certain diagnosis. When the traditional PCA is applied to the data, those medical features are projected to a low dimensional space, in which each new feature will be the linear combination of all the original features. In this case, it is hard to comprehend the meaning of the new features. Sparse PCA has been proposed to address this drawback. In sparse PCA, we learn sparse loading vectors which combine only few of the input variables allowing interpretation of the principal components. Sparse PCA was firstly proposed by Zou et al. in [9], where PCA was formulated as a regression problem and the sparse PCA was introduced by imposing the lasso (elastic net) constraint. Other common approaches to solve the sparse PCA problem are semi-definite programming [10,11] and inverse power method [12]. Moreover, a more recent study [13] investigated sparse PCA with oracle property. Aforementioned approaches are generally not scalable enough to work with large-scale datasets. One way to deal with large sample sizes is using stochastic methods. We can see an example of stochastic PCA in [7]. Authors described an algorithm with computationally cheap stochastic iterations and variance reduction which was suggested in [14].
In this study, sparse PCA is posed as an 1 regularized optimization problem. Standard approaches to solve such sparse learning problems are proximal gradient methods [15][16][17], which require computation of the full gradient at each iteration. These methods generally work with a composite function including a smooth part and a nonsmooth part. A large family of machine learning problems [18][19][20][21][22][23] can be expressed as composite functions. Traditionally, solving problems with objectives, which are not continuously differentiable, requires subgradient descent [24] which has very poor performance [25]. The recently developed proximal gradient methods can solve these composite problems with fast convergence rates [26,27]. However, these methods are hardly scalable to large-scale problems with large sample sizes because of the computation of full gradient. Therefore, stochastic gradient-based methods are preferred in such problems. One major disadvantage of the stochastic gradient descent is the low convergence due to high variance by random sampling. Johnson and Zhang proposed a solution for this drawback in [14]. Their solution reduced the variance by using a copy of the estimated optimal point and the full gradient at this point in the gradient step. This approach exploited the strong convexity property to obtain a geometric convergence rate under expectation. Xiao and Zhang similarly presented a multi-stage scheme to progressively reduce the variance of the proximal stochastic gradient (Prox-SVRG) with a geometric convergence rate under expectation in [15]. The fundamental assumptions were Lipschitz continuity of the gradient of smooth part and the strong convexity of the objective function.
To tackle the aforementioned challenges in this paper, we introduce a novel stochastic convex sparse PCA (Cvx-SPCA) method which is extremely efficient and can handle large-scale datasets. Specifically, we propose to adopt a convex formulation of PCA [28] which provides a strongly convex function. The problem structure in this design allows us to leverage efficient scheme of Prox-SVRG [15] which leads to an exponential (geometric) convergence rate. We also investigate the convergence analysis of Prox-SVRG and present a new proof of the convergence rate which significantly reduces the conditions and assumptions required. As such, we show that the optimization scheme can be applied to a much larger class of problems to obtain the geometric convergence rate. We conducted extensive experiments on both synthetic and real datasets to illustrate the efficiency of the proposed algorithm. Because of its efficiency, we were able to apply the proposed algorithm to analyze a real EMR cohort with a large number of patients, which is hardly possible to analyze by using traditional approaches.

Convex sparse principal component analysis
In this section, we introduce the problem formulation and optimization scheme of the proposed approach. The problem of finding a sparse loading vector is posed as the combination of 1 sparsity inducing norm and convexity from the convex principal component analysis, which allows us to utilize an extremely efficient stochastic proximal gradient approach.

Convex sparse PCA
The goal of sparse PCA is to learn sparse loading vectors such that the principal components will be linear combinations of a few key variables instead of all the variables. We propose the following convex optimization problem: where the convex PCA loss [28] is given by: F (z) = 1 2 z T (λI − S) z − w T z and the regularization term R (z) = γ z 1 is the 1 norm of the loading vector z, γ ∈ R is the regularization parameter controlling the sparsity of the loading vector, λ > λ 1 (S) is the convexity parameter, w ∈ R d is a random vector, and S = 1 Here, λ 1 (S) represents the largest eigenvalue of the covariance matrix S and w is a vector of normally distributed random numbers. An upper bound for the regularization term γ can be derived by using standard subgradient analysis [25]: if the regularization parameter γ is larger than the maximum of absolute value of the elements of the vector w, i.e., w ∞ , we will end up with trivial solutions (solutions with only zeros). This thus guides us to use a parameter range of γ ∈[ 0, w ∞ ].
In the above approach, we use a convex optimization formulation of finding the first principal component inspired by the work in [28]. Even though R (z) is not strongly convex, the overall cost function in Eq. (2) is a strongly convex function in which the strong convexity comes from F (z). The structure of the problem defined in Eq. (2) allows us to use gradient based algorithms to obtain the global solution. Moreover, the strong convexity usually ensures nice convergence properties for stochastic gradient schemes as well. Therefore, we can also benefit from the faster convergence rate of the proximal stochastic scheme proposed in [15]. We note that the objective function of traditional PCA as shown in Eq. (1) does not define a convex problem, and thus, the analysis in this paper cannot be applied to it.
The most common methods to solve problems such as Eq. (2), where the objective function is comprised of the average of smooth component functions and a nonsmooth function, are proximal gradient methods. In the next section, the method used to solve convex optimization problem given in Eq. (2) will be explained.

Optimization scheme
In this paper, we propose to use a proximal stochastic gradient method with progressive variance reduction approach [15] to solve the problem in Eq. (2). The function denoted by F (z) can also be written as the sum of n smooth functions: When n is very large, calculating the full gradient at each gradient descent iteration is an expensive operation. Hence, stochastic gradient methods are preferred to solve such problems. In stochastic approach, instead of calculating gradients for all of the data points, one data point is randomly sampled and the gradient at this point is calculated at each iteration. Therefore, the number of calculations decreases. However, the drawback of the stochastic gradient methods is the high variance introduced because of random sampling. As a result of the high variance, we suffer from poor convergence rates. As discussed previously, there are solutions to reduce the variance and increase the convergence rate. One of the studies which mitigates the high variance problem of stochastic gradient method is proximal stochastic gradient method with progressive variance reduction [15]. The study in [15] showed that the variance of the gradient can be upper bounded by using a multi-stage scheme which progressively reduces the variance. When the algorithm converges to optimal point, variance also converges to zero. Therefore, this approach can achieve better convergence rates than conventional stochastic gradient even with constant step sizes. We refer the readers to Section 3.1 in [15] for detailed proof of bounding the variance.
In this paper, we also follow the approach in [15]. The algorithm used in this study is given in Algorithm 1.

Algorithm 1
The proposed stochastic gradient descent with variance reduction algorithm for solving Cvx-SPCA end for 10:z s = 1 m m k=1 z k 11: end for 12: returnz T In the algorithm, z 0 is the initial value for loading vector z, η is the constant step size, γ is the regularization term to control sparsity of z, m is the number of iterations for each epoch s, and T is the maximum number of epochs. At each epoch, full gradient at the pointz is calculated periodically. The cost of calculating the full gradient is the product of a d×d matrix and a d dimensional vector. Therefore, the most time consuming part in our algorithm is the multiplications with covariance matrix, when the feature dimension is high.z is an estimate of the optimal point and it is updated at each epoch to be utilized in gradient calculations. During m stochastic gradient steps, we first sample a data point randomly and compute the gradient v k . If we take the expectation of the gradient calculated in Eq. (4), we can see that v k is also an estimate of the full gradient as in conventional stochastic gradient methods. This shows that v k given below is in the same direction as the full gradient under expectation.
where ∇F (z) is the average gradient of functions f i (z) , i = 1, . . . , n or the full gradient at pointz, ∇f ik z k−1 is the gradient of the function calculated by using the data point x ik sampled at the kth iteration andz is the average of z k , k = 1, .., m at the end of an epoch. After the gradient computation, we update z k by using the proximal mapping for 1 norm as follows.
In this algorithm, variance of the stochastic gradient v k is reduced progressively, while bothz and z k−1 are converging to the optimal point z * = arg min z P (z) [15]. Since the full gradient is utilized to modify stochastic gradients and function F is an average of smooth component functions, variance can be bounded. In the next section, we will give the convergence analysis of the aforementioned algorithm.

Convergence analysis
In this section, we present the convergence analysis of the proposed algorithm. The objective function used in this paper is suitable to follow the convergence analysis in [15]. Therefore, our analysis is mostly adapted from [15]. However, we use much weaker conditions which allow a broader family of objective functions to fit in this scheme and to enjoy the geometric convergence. We retain the following assumption used throughout in [15]:

Assumption 1 The function R (z) is lower semicontinuous and convex, and its effective domain, dom(R)
which also implies that the gradient of the average function F (z) is also Lipschitz continuous, i.e., there is an L > 0 such that for all z, y ∈ dom(R), In [15], convergence analysis was done for general F and R functions and both of them were assumed to be strongly convex. On the other hand, we only assume that functions F (z) and R (z) are convex, but not necessarily strongly convex. Thus, we are relaxing this strong assumption at this point. Strong convexity provides good properties and is relevant for faster convergence rates. However, objective functions are not always strongly convex in many cases. Therefore, a simplified version of the analysis will be preferable, when the objective functions do not have necessarily strong convexity property.
Although our overall objective function is strongly convex, R (z) is not strongly convex as it was mentioned in the previous section. Therefore, we drop the strong convexity assumption at two steps in the original analysis of [15] and obtain the convergence rate given in the following theorem.
Theorem 1 Under the assumption that Assumption 1 holds and 0 < η < 1/ 4L Q , where L Q = max i L i , the convergence rate is obtained as follows: where z * = arg min z P (z).
Proof The proof of Theorem 1 starts with investigating the distance between z k and z * ; z k − z * 2 . According to the stochastic gradient mapping definition in [15], z k can be written as z k−1 − ηg k .
The term −g k T z k−1 − z * + η 2 g k 2 can be bounded by using the definition of the proximal update as shown below.
According to the optimality condition, where ξ ∈ ∂R (z k ) is the subgradient of R (z) at z k . If we combine the stochastic gradient mapping definition with the optimality condition, we obtain the following expression.
By using the convexity of F (z) and R (z), we can write the following inequality.
Convergence analysis of [15] utilized strong convexity of F and R in 7. However, we will show that strong convexity is not required at this point. Since F (z) is assumed to be Lipschitz continuous with Lipschitz constant L, F z k−1 can also be bounded by using Theorem 2.1.5 in [29].
Here, we again use stochastic gradient mapping; z k − z k−1 = −ηg k to obtain the following inequality.
where = v k − ∇F z k−1 and z * corresponds to y. The term −2η T (z k − z * ) can further be bounded by using the proximal full gradient updatez k = prox ηR z k−1 − η∇F z k−1 , If Cauchy-Schwarz inequality and the non-expansiveness of the proximal mapping prox ηR (x) − prox ηR (y) ≤ x − y are utilized, the following expression can be derived.
If we insert the definitions of z k = z k−1 − ηv k and z k = z k−1 − η∇F z k−1 , we will have: If we combine the result shown above with Eq. (9): Now, expectations of both sides are taken with respect to z k .
Sincez k and z * are independent from the variable z k ; is upper bounded in Prox-SVRG algorithm and we will use the result of Corollary 3 in [15] Now, if we apply the inequality above repeatedly for k = 1, . . . , m and the expectation with respect to previous random variables z 1 , . . . , z m are taken, then we can obtain the following inequality.
Since 2η 1 − 4ηL Q < 2η, z 0 =z and P is convex, therefore, P (z s ) ≤ 1 m m k=1 P (z k ), and we can write the following inequality.
By using Lemma 1 which is a weaker condition then using the strong convexity and by applying the above inequality recursively, we derive the convergence rate as follows:

Lemma 1 Consider the problem of minimizing the sum of two convex functions:
A standard method for solving the above problem is the proximal gradient method. Given an initial point z 0 , using the proximal mapping, which is shown below, iteratively generates a sequence that will converge to the optimal solution. prox R (y) = arg min Since R (x) is a convex function, the optimal solution of above problem is also an optimal solution of the following problem using a tuning parameter μ [30] [Theorem 1].
By utilizing the optimal strong convexity condition which is a weaker condition than strong convexity [31] for a convex function R, we have the following inequality for all z ∈ : where the prox E is the Euclidean projection on to set E and is a positive parameter.
We have thus removed the strong convexity condition so that we are able to apply the algorithm in [15] to more generic convex objectives.

Results
In this section, we present the results of two types of experiments. First, the proposed algorithm was tested on synthetic datasets to investigate the convergence of the variance reduced proximal stochastic gradient compared to traditional proximal stochastic gradient descent. In addition, running times of the proposed stochastic Cvx-SPCA and other sparse PCA methods were compared to emphasize the advantage of using a stochastic approach, when there are large number of samples. In our experiments, step size η was chosen by the following heuristic according to 0 < η < 1/ 4L Q and L Q was taken as the largest eigenvalue of the covariance matrix. Iteration number m was chosen as L Q / (λ − λ 1 (S)) which is suggested in [15]. Secondly, we presented our experiments on an electronic medical records data.

Synthetic dataset
In this section, we present some results of the proposed stochastic Cvx-SPCA algorithm on synthetic datasets. Synthetic datasets used in this section were all randomly generated by normally distributed random numbers with N (0, 1). For this purpose, synthetic data with varying sample sizes were prepared by random sampling. First of all, we would like to compare the convergence of proximal stochastic gradient with variance reduction and traditional proximal stochastic gradient for our algorithm. In Fig. 1, objective versus number of epochs are plotted for using traditional proximal stochastic gradient (prox-SGD) and proximal stochastic variance reduced gradient (Prox-SVRG) methods.
In Fig. 1, convergence is observed when the maximum number of epochs is fixed to 50. We also would like to investigate how many epochs are necessary for both algorithms to converge. Therefore, we made another experiment to see how fast Cvx-SPCA with Prox-SVRG converges to a similar sparsity as Cvx-SPCA with prox-SGD. We generated another synthetic dataset with 100,000 instances and 10,000 dimensions. The result of the experiment is shown in Fig. 2. Cvx-SPCA with traditional SGD took 3646.94 s and Cvx-SPCA with SVRG took 644.60 s to converge to similar sparsity patterns.
Secondly, running times of other sparse PCA methods and the proposed method were compared in Table 1. In experiments, feature dimension was chosen as 1000. Algorithms ran until they reached similar sparsity patterns. The proposed Cvx-SPCA algorithm is more scalable, since only one gradient is computed at a time and there are no eigenvalue decomposition or SVD steps during iterations. For instance, [9] requires singular value decomposition at Fig. 1 Convergence for synthetic data. Convergence of the proposed stochastic Cvx-SPCA with (Prox-SVRG) and without variance reduction (prox-SGD). Proximal stochastic gradient with variance reduction has a faster convergence rate, since the variance caused by random sampling is bounded in Prox-SVRG each iteration, which is a bottleneck in terms of running time, [12] is an inverse power method based approach, and [11] uses semi-definite programming. Therefore, scalability with respect to sample size and dimension is an issue for the aforementioned methods.
We also investigate the regularization path for the proposed algorithm. Regularization path illustrates how the solution changes for different values of regularization parameters γ which specify the level of sparsity. In order to have a suitable level of sparsity, γ should be tuned. One common way of finding an appropriate γ is the regularization path. We first generated a random sample with ten features and applied the proposed Cvx-SPCA algorithm to obtain the principal component. Then, the covariance matrix was reconstructed by using the first principal component corresponding to the largest eigenvalue with a little random noise. Loading values of principal components were computed with varying regularization parameters γ by using the reconstructed covariance matrix. We started with small γ values, and the loading vector learned from the previous step is used as the initialization for each new Cvx-SPCA step. The result is given in Fig. 3. Since proposed Cvx-SPCA does not depend on eigenvalue decomposition or semi-definite programming, it is more scalable in terms of the sample size. It also requires less iterations to reach a desired sparsity

Large-scale healthcare dataset
We applied our Cvx-SPCA algorithm to analyze disease patterns in a general patient population. The dataset we used is a real world electronic medical record (EMR) warehouse including the records of 223,076 patients over 4 years. We used the diagnosis information (in terms of ICD9 codes [32]) in our investigation, which resulted in 11,982 features in total. In this dataset, we do not have demographic information of patients explicitly. However, we investigated patient groups with different gender and age by looking at the descriptions of the ICD9 codes. We draw histograms of the number of patients with respect to the number of diagnoses each patient has in different demographic groups and in the general population as in Figs. 4 and 5, from which we can observe that the majority of the patients just have very few records. In our experiments, we eliminated the patients who have less than five records, and this resulted in 177,856 patients. As it was mentioned earlier, some of the diseases are specifically related to gender and age that let us have an idea about the demographic information of the dataset. For instance, complications of pregnancy, female genital disorders, and abortion are some of the diagnoses which are explicitly about women. Similarly, maternal complications affecting newborn and diseases such as chickenpox and measles are related to children. There are also ICD9 codes which have terms indicating the age. For instance, some of the diagnoses have the term "senile" which points out patients at least above 60 years old. Thus, we sampled female, male, old, and child patients by taking the definitions of the ICD9 codes into account. The age range of child patients can be given as from babyhood to adolescence and age of old patients can be thought as above 60 years old. In Table 2, number of patients and number of features related to female, male, people above 60 years old and children groups are given. We should note that there may be female, male, old, and child patients who we did not include into these demographic groups. For example, there should be female/male patients with diagnoses which are not gender-or age-specific. It is not always possible to guess the gender or age from diagnosis such as hypertension or infectious diseases which can be encountered in both genders. Therefore, we are reporting the demographic groups whose ICD9 codes have clear terms indicating the demographic information.
As can be seen from Table 2, the number of femalespecific diseases and the number of female patients are more than other groups in the EMR dataset we used in this paper. Number of old patients is given less than other groups in the table. However, it may not mean that there are less number of old people in the whole patient population. We could not exactly extract age information of every diagnoses/diseases. For instance, hypertension or Alzheimer's were diseases commonly encountered among the people above a certain age in the past. However, these problems can be occurred in younger ages recently. For Table 2 We sample patients who have female, male, child, and old people related features. These samples may overlap with each other. For instance, a patient may have dementia and a prostate problem together. We did not include other problems such as hypertension or kidney problems which can be encountered in every age and both genders into these groups of patients  Table 2 are given in Fig. 4.
In our experiments, we further aggregated all diagnoses belong to the same ICD9 group together, so that each patient is represented by a 918 dimensional feature vector. The value on its ith dimension represents the frequency of the ith diagnosis code appearing in the EMR of the corresponding patient. Since every patient will have a limited number of diseases, patient vectors are very sparse.
We would like to emphasize that existing sparse PCA algorithms cannot be used to analyze a dataset at this scale. We carried out both quantitative and qualitative evaluations on this dataset. We studied the convergence of the algorithm with varying number of patients, and we observe that the proposed Cvx-SPCA can still achieve a good convergence even when the sample size is very large, as shown in Fig. 6.
Next, we conducted an experiment to show how the proposed algorithm helps us to analyze the EMR data. We applied our algorithm to the whole data set and got the output features which correspond to the nonzero loading values of the leading principal component. These output features are inferred as key medical features. One of the results is summarized in Table 3. Diseases shown in this table are the features which have non-zero loadings whose absolute values are greater than a heuristic threshold. In our experiments, we observed that the most frequently encountered output features were infectious diseases, problems related to pregnancy and labor, injuries, and cancer types. This result tells us that the proposed algorithm can provide insight about the diagnoses encountered in the patient population.
We further examined the data set and divided the features into groups in terms of gender and age. We sampled the patients who have gender-and age-related problems separately and applied our algorithm to those samples   Tables 4, 5, 6, and 7. We can see plausible results for the output features of each group in the tables. For example, diagnoses such as female genital disorders, perinatal problems, and anemia, which are more common among women, appeared in Table 4 where the algorithm was applied to the subset of patients who have female-related problems. Similarly, we can see from Table 5 that a subset of male patients generates prostate cancer along with other diagnoses which can be frequently seen in the general patient population as well. Cancer is a commonly encountered problem in nearly every age. We can come across cancer in the results of children and old patients as well. Another observation is that tuberculosis and bacterial infections are quite common among children.

Discussion
Throughout the paper, advantage of using a convex optimization approach for sparse PCA is emphasized. In this section, we would like to discuss about our conjuring of  the convergence of non-convex stochastic sparse PCA by using the same framework. One surprising finding we have is if we use this non-convex PCA to construct a nonconvex sparse PCA (by adding 1 -norm), we still benefit from a much faster convergence rate using the stochastic scheme studied in this paper. A similar result is also presented in [7], where the authors propose a stochastic PCA approach with an exponential convergence rate by using variance reduced stochastic gradient presented in [14]. These results lead us to ask the following question: Can we generalize the convergence analysis of proximal variance reduced stochastic gradient method further for non-convex settings? We will investigate this problem in the future work.

Conclusions
In this paper, a convex stochastic sparse PCA method is proposed. Since the problem of finding the leading eigenvector is formed as convex optimization, a welldefined convergence rate can be applied to the proposed algorithm. A proximal stochastic gradient method with variance reduction is preferred to avoid low convergence rates of traditional stochastic methods. Although strong convexity is usually required in literature, we simplify the convergence analysis of the existing Prox-SVRG algorithm  by using weaker conditions. According to the experiments on several synthetic data, the proposed algorithm is shown to be more scalable due to stochastic approach. In addition, an application of sparse PCA is presented to show how sparse PCA can help to interpret electronic medical records. In future work, we would like to investigate whether sparse PCA can be used to cluster patients with respect to their medical records. For instance, we propose to apply the proposed algorithm to analyze medical records and derive clinically meaningful and structural phenotypes, which can further be helpful for patient risk stratification and clustering.