# Stochastic convex sparse principal component analysis

- Inci M. Baytas
^{1}, - Kaixiang Lin
^{1}, - Fei Wang
^{2}, - Anil K. Jain
^{1}and - Jiayu Zhou
^{1}Email author

**2016**:15

https://doi.org/10.1186/s13637-016-0045-x

© The Author(s) 2016

**Received: **14 April 2016

**Accepted: **4 August 2016

**Published: **9 September 2016

## Abstract

Principal component analysis (PCA) is a dimensionality reduction and data analysis tool commonly used in many areas. The main idea of PCA is to represent high-dimensional data with a few representative components that capture most of the variance present in the data. However, there is an obvious disadvantage of traditional PCA when it is applied to analyze data where interpretability is important. In applications, where the features have some physical meanings, we lose the ability to interpret the principal components extracted by conventional PCA because each principal component is a linear combination of all the original features. For this reason, sparse PCA has been proposed to improve the interpretability of traditional PCA by introducing sparsity to the loading vectors of principal components. The sparse PCA can be formulated as an *ℓ*
_{1} regularized optimization problem, which can be solved by proximal gradient methods. However, these methods do not scale well because computation of the exact gradient is generally required at each iteration. Stochastic gradient framework addresses this challenge by computing an expected gradient at each iteration. Nevertheless, stochastic approaches typically have low convergence rates due to the high variance. In this paper, we propose a convex sparse principal component analysis (Cvx-SPCA), which leverages a proximal variance reduced stochastic scheme to achieve a geometric convergence rate. We further show that the convergence analysis can be significantly simplified by using a weak condition which allows a broader class of objectives to be applied. The efficiency and effectiveness of the proposed method are demonstrated on a large-scale electronic medical record cohort.

## Keywords

## 1 Introduction

*n*training data points where each data point is in a

*d*-dimensional feature space. The PCA of computing the top

*p*components can be written as the following optimization problem:

where **Z** is an orthogonal projection matrix. In many applications, we are only interested in a few top principal components. In this case, the principal components can be computed in an iterative fashion: the leading principal component is calculated at each iteration (e.g., using power methods), and we then deflate the computed component and the next principal component now becomes the leading one [7]. Therefore, we focus on finding the leading principal component in this paper. In spite of its advantages, there is an obvious disadvantage of PCA. In the solution of Eq. (1), the principal components are linear combinations of all input variables. This means that the columns of **Z** matrix, which are called loadings of principal components, are dense. One important implication of dense loadings is that we lose the ability to interpret the output dimensions of conventional PCA. PCA works well if we are not interested in the physical meanings of the features or if the interpretation of principal components is not crucial for the application. However, the intepretability is a significant factor when it comes to many applications such as biology, finance, and biomedical informatics. In the domain of biomedical informatics, as more and more electronic medical records (EMR) [8] of patients are available, medical researchers are interested in applying various techniques to analyze the EMR data. Each feature of the EMR data is a record/event related to a certain diagnosis. When the traditional PCA is applied to the data, those medical features are projected to a low dimensional space, in which each new feature will be the linear combination of all the original features. In this case, it is hard to comprehend the meaning of the new features.

Sparse PCA has been proposed to address this drawback. In sparse PCA, we learn sparse loading vectors which combine only few of the input variables allowing interpretation of the principal components. Sparse PCA was firstly proposed by Zou et al. in [9], where PCA was formulated as a regression problem and the sparse PCA was introduced by imposing the lasso (elastic net) constraint. Other common approaches to solve the sparse PCA problem are semi-definite programming [10, 11] and inverse power method [12]. Moreover, a more recent study [13] investigated sparse PCA with oracle property. Aforementioned approaches are generally not scalable enough to work with large-scale datasets. One way to deal with large sample sizes is using stochastic methods. We can see an example of stochastic PCA in [7]. Authors described an algorithm with computationally cheap stochastic iterations and variance reduction which was suggested in [14].

In this study, sparse PCA is posed as an *ℓ*
_{1} regularized optimization problem. Standard approaches to solve such sparse learning problems are proximal gradient methods [15–17], which require computation of the full gradient at each iteration. These methods generally work with a composite function including a smooth part and a non-smooth part. A large family of machine learning problems [18–23] can be expressed as composite functions. Traditionally, solving problems with objectives, which are not continuously differentiable, requires subgradient descent [24] which has very poor performance [25]. The recently developed proximal gradient methods can solve these composite problems with fast convergence rates [26, 27]. However, these methods are hardly scalable to large-scale problems with large sample sizes because of the computation of full gradient. Therefore, stochastic gradient-based methods are preferred in such problems. One major disadvantage of the stochastic gradient descent is the low convergence due to high variance by random sampling. Johnson and Zhang proposed a solution for this drawback in [14]. Their solution reduced the variance by using a copy of the estimated optimal point and the full gradient at this point in the gradient step. This approach exploited the strong convexity property to obtain a geometric convergence rate under expectation. Xiao and Zhang similarly presented a multi-stage scheme to progressively reduce the variance of the proximal stochastic gradient (Prox-SVRG) with a geometric convergence rate under expectation in [15]. The fundamental assumptions were Lipschitz continuity of the gradient of smooth part and the strong convexity of the objective function.

To tackle the aforementioned challenges in this paper, we introduce a novel stochastic convex sparse PCA (Cvx-SPCA) method which is extremely efficient and can handle large-scale datasets. Specifically, we propose to adopt a convex formulation of PCA [28] which provides a strongly convex function. The problem structure in this design allows us to leverage efficient scheme of Prox-SVRG [15] which leads to an exponential (geometric) convergence rate. We also investigate the convergence analysis of Prox-SVRG and present a new proof of the convergence rate which significantly reduces the conditions and assumptions required. As such, we show that the optimization scheme can be applied to a much larger class of problems to obtain the geometric convergence rate. We conducted extensive experiments on both synthetic and real datasets to illustrate the efficiency of the proposed algorithm. Because of its efficiency, we were able to apply the proposed algorithm to analyze a real EMR cohort with a large number of patients, which is hardly possible to analyze by using traditional approaches.

## 2 Convex sparse principal component analysis

In this section, we introduce the problem formulation and optimization scheme of the proposed approach. The problem of finding a sparse loading vector is posed as the combination of *ℓ*
_{1} sparsity inducing norm and convexity from the convex principal component analysis, which allows us to utilize an extremely efficient stochastic proximal gradient approach.

### 2.1 Convex sparse PCA

*R*(

**z**)=

*γ*∥

**z**∥

_{1}is the

*ℓ*

_{1}norm of the loading vector

**z**, \(\gamma \in \mathbb {R}\) is the regularization parameter controlling the sparsity of the loading vector,

*λ*>

*λ*

_{1}(

**S**) is the convexity parameter, \(\mathbf {w} \in \mathbb {R}^{d}\) is a random vector, and \(\mathbf {S} = \frac {1}{n} \sum _{i = 1}^{n} \mathbf {x}_{i}\mathbf {x}_{i}^{T}\). Here,

*λ*

_{1}(

**S**) represents the largest eigenvalue of the covariance matrix

**S**and

**w**is a vector of normally distributed random numbers. An upper bound for the regularization term

*γ*can be derived by using standard subgradient analysis [25]: if the regularization parameter

*γ*is larger than the maximum of absolute value of the elements of the vector

**w**, i.e., ∥

**w**∥

_{ ∞ }, we will end up with trivial solutions (solutions with only zeros). This thus guides us to use a parameter range of

*γ*∈[0,∥

**w**∥

_{ ∞ }].

In the above approach, we use a convex optimization formulation of finding the first principal component inspired by the work in [28]. Even though *R*(**z**) is not strongly convex, the overall cost function in Eq. (2) is a strongly convex function in which the strong convexity comes from *F*(**z**). The structure of the problem defined in Eq. (2) allows us to use gradient based algorithms to obtain the global solution. Moreover, the strong convexity usually ensures nice convergence properties for stochastic gradient schemes as well. Therefore, we can also benefit from the faster convergence rate of the proximal stochastic scheme proposed in [15]. We note that the objective function of traditional PCA as shown in Eq. (1) does not define a convex problem, and thus, the analysis in this paper cannot be applied to it.

The most common methods to solve problems such as Eq. (2), where the objective function is comprised of the average of smooth component functions and a non-smooth function, are proximal gradient methods. In the next section, the method used to solve convex optimization problem given in Eq. (2) will be explained.

### 2.2 Optimization scheme

*F*(

**z**) can also be written as the sum of

*n*smooth functions:

When *n* is very large, calculating the full gradient at each gradient descent iteration is an expensive operation. Hence, stochastic gradient methods are preferred to solve such problems. In stochastic approach, instead of calculating gradients for all of the data points, one data point is randomly sampled and the gradient at this point is calculated at each iteration. Therefore, the number of calculations decreases. However, the drawback of the stochastic gradient methods is the high variance introduced because of random sampling. As a result of the high variance, we suffer from poor convergence rates. As discussed previously, there are solutions to reduce the variance and increase the convergence rate. One of the studies which mitigates the high variance problem of stochastic gradient method is proximal stochastic gradient method with progressive variance reduction [15]. The study in [15] showed that the variance of the gradient can be upper bounded by using a multi-stage scheme which progressively reduces the variance. When the algorithm converges to optimal point, variance also converges to zero. Therefore, this approach can achieve better convergence rates than conventional stochastic gradient even with constant step sizes. We refer the readers to Section 3.1 in [15] for detailed proof of bounding the variance.

In this paper, we also follow the approach in [15]. The algorithm used in this study is given in Algorithm 1.

**z**

_{0}is the initial value for loading vector

**z**,

*η*is the constant step size,

*γ*is the regularization term to control sparsity of

**z**,

*m*is the number of iterations for each epoch

*s*, and

*T*is the maximum number of epochs. At each epoch, full gradient at the point \(\tilde {\mathbf {z}}\) is calculated periodically. The cost of calculating the full gradient is the product of a

*d*×

*d*matrix and a

*d*dimensional vector. Therefore, the most time consuming part in our algorithm is the multiplications with covariance matrix, when the feature dimension is high. \(\tilde {\mathbf {z}}\) is an estimate of the optimal point and it is updated at each epoch to be utilized in gradient calculations. During

*m*stochastic gradient steps, we first sample a data point randomly and compute the gradient

**v**

_{ k }. If we take the expectation of the gradient calculated in Eq. (4), we can see that

**v**

_{ k }is also an estimate of the full gradient as in conventional stochastic gradient methods. This shows that

**v**

_{ k }given below is in the same direction as the full gradient under expectation.

where \(\nabla F\left (\tilde {\mathbf {z}}\right)\) is the average gradient of functions *f*
_{
i
}(**z**),*i*=1,…,*n* or the full gradient at point \(\tilde {\mathbf {z}}\), ∇*f*
_{
ik
}(**z**
_{
k−1}) is the gradient of the function calculated by using the data point *x*
_{
ik
} sampled at the *k*th iteration and \(\tilde {\mathbf {z}}\) is the average of *z*
_{
k
}, *k*=1,..,*m* at the end of an epoch.

**z**

_{ k }by using the proximal mapping for

*ℓ*

_{1}norm as follows.

In this algorithm, variance of the stochastic gradient **v**
_{
k
} is reduced progressively, while both \(\tilde {\mathbf {z}}\) and **z**
_{
k−1} are converging to the optimal point *z*
_{∗}= arg min**zP**(**z**) [15]. Since the full gradient is utilized to modify stochastic gradients and function *F* is an average of smooth component functions, variance can be bounded. In the next section, we will give the convergence analysis of the aforementioned algorithm.

## 3 Convergence analysis

In this section, we present the convergence analysis of the proposed algorithm. The objective function used in this paper is suitable to follow the convergence analysis in [15]. Therefore, our analysis is mostly adapted from [15]. However, we use much weaker conditions which allow a broader family of objective functions to fit in this scheme and to enjoy the geometric convergence. We retain the following assumption used throughout in [15]:

###
**Assumption 1**

*R*(

**z**) is lower semi-continuous and convex, and its effective domain, \(dom(R):=\left \{\mathbf {z}\in \mathbb {R}^{d} | R\left (\mathbf {z}\right)<+\infty \right \}\) is closed. Each

*f*

_{ i }(

**z**),

*for*

*i*=1,…,

*n*, is differentiable on an open set that contains

*dom*(

*R*), and their gradients are Lipschitz continuous. That is, there exist

*L*

_{ i }>0 such that for all

**z**,

**y**∈

*dom*(

*R*),

*F*(

**z**) is also Lipschitz continuous, i.e., there is an

*L*>0 such that for all

**z**,

**y**∈

*dom*(

*R*),

In [15], convergence analysis was done for general *F* and *R* functions and both of them were assumed to be strongly convex. On the other hand, we only assume that functions *F*(**z**) and *R*(**z**) are convex, but not necessarily strongly convex. Thus, we are relaxing this strong assumption at this point. Strong convexity provides good properties and is relevant for faster convergence rates. However, objective functions are not always strongly convex in many cases. Therefore, a simplified version of the analysis will be preferable, when the objective functions do not have necessarily strong convexity property.

Although our overall objective function is strongly convex, *R*(**z**) is not strongly convex as it was mentioned in the previous section. Therefore, we drop the strong convexity assumption at two steps in the original analysis of [15] and obtain the convergence rate given in the following theorem.

###
**Theorem 1**

*η*<1/(4

*L*

_{ Q }), where

*L*

_{ Q }=max

_{ i }

*L*

_{ i }, the convergence rate is obtained as follows:

where *z*
_{∗}= arg min**zP**(**z**).

###
*Proof*

*z*

_{ k }and

*z*

_{∗}; ∥

*z*

_{ k }−

*z*

_{∗}∥

^{2}. According to the stochastic gradient mapping definition in [15],

*z*

_{ k }can be written as

*z*

_{ k−1}−

*η*

*g*

_{ k }.

*ξ*∈

*∂*

*R*(

*z*

_{ k }) is the subgradient of

*R*(

**z**) at

*z*

_{ k }. If we combine the stochastic gradient mapping definition with the optimality condition, we obtain the following expression.

*F*(

**z**) and

*R*(

**z**), we can write the following inequality.

*F*and

*R*in 7. However, we will show that strong convexity is not required at this point. Since

*F*(

**z**) is assumed to be Lipschitz continuous with Lipschitz constant

*L*,

*F*(

*z*

_{ k−1}) can also be bounded by using Theorem 2.1.5 in [29].

*z*

_{ k }−

*z*

_{ k−1}=−

*η*

*g*

_{ k }to obtain the following inequality.

*ξ*with

*g*

_{ k }−

*v*

_{ k }, then add and subtract

*z*

_{ k−1}from the term (

**y**−

*z*

_{ k }):

*η*<1/4

*L*

_{ Q }<1/

*L*, \(\left (\eta - \frac {L}{2}\eta ^{2}\right) = \frac {\eta }{2}\left (2 - L\eta \right)\) can be taken as

*η*/2. Because (2−

*L*

*η*) is between (1,2) according to the assumption, therefore, eliminating (2−

*L*

*η*) does not change the inequality. Now we will use the result derived above for the term \(\left (-\mathbf {g_{k}}^{T} \left (\mathbf {z_{k-1} - z_{*}}\right) + \frac {\eta }{2}\left \|\mathbf {g_{k}}\right \|^{2}\right)\) in Eq. (6).

*Δ*=

*v*

_{ k }−∇

*F*(

*z*

_{ k−1}) and

*z*

_{∗}corresponds to

**y**. The term −2

*η*

*Δ*

^{ T }(

*z*

_{ k }−

*z*

_{∗}) can further be bounded by using the proximal full gradient update \(\bar {\mathbf {z}_{k}} = \text {prox}_{\eta R}\left (\mathbf {z_{k-1}} - \eta \nabla F\left (\mathbf {z_{k-1}}\right)\right)\), If Cauchy-Schwarz inequality and the non-expansiveness of the proximal mapping (∥prox

_{ η R }(

*x*)−prox

_{ η R }(

*y*)∥≤∥

*x*−

*y*∥) are utilized, the following expression can be derived.

*z*

_{ k }=(

*z*

_{ k−1}−

*η*

*v*

_{ k }) and \(\bar {\mathbf {z}_{k}} = \left (\mathbf {z_{k-1}} - \eta \nabla F\left (\mathbf {z_{k-1}}\right)\right)\), we will have:

*z*

_{ k }.

*z*

_{∗}are independent from the variable

*z*

_{ k }; \(\mathbb {E} \left \{\Delta ^{T} \left (\bar {\mathbf {z}_{k}} - \mathbf {z_{*}}\right)\right \} = \mathbb {E} \left \{\Delta ^{T}\right \}\left (\bar {\mathbf {z}_{k}} - \mathbf {z_{*}}\right) = 0\). Because \(\mathbb {E} \left \{\Delta ^{T}\right \} = \mathbb {E} \left \{\mathbf {v_{k}} - \nabla F\left (\mathbf {z_{k-1}}\right)\right \} = \mathbb {E}\left \{\mathbf {v_{k}}\right \} - \nabla F\left (\mathbf {v_{k-1}}\right) = 0\). The variance of the gradient \(\mathbb {E}\left \{\left \|\Delta \right \|^{2}\right \}\) is upper bounded in Prox-SVRG algorithm and we will use the result of Corollary 3 in [15] which is \(\mathbb {E}\left \{\left \|\Delta \right \|^{2}\right \} \leq 4L_{Q} \left [P\left (\mathbf {z_{k-1}}\right) - P\left (\mathbf {z_{*}}\right) + P\left (\tilde {\mathbf {z}}\right)-P\left (\mathbf {z_{*}}\right)\right ]\), where

*L*

_{ Q }= max

*iL*

_{ i }, \(\tilde {\mathbf {z}}_{s} = \frac {1}{m}\sum _{k=1}^{m} \mathbf {z_{k}}\), and \(\tilde {\mathbf {z}} = \tilde {\mathbf {z}}_{s-1} = \mathbf {z_{0}}\) for a fixed epoch. After incorporating the bound of the variance of the gradient into the analysis, the following expression is obtained.

*k*=1,…,

*m*and the expectation with respect to previous random variables

*z*

_{1},…,

*z*

_{ m }are taken, then we can obtain the following inequality.

*η*(1−4

*η*

*L*

_{ Q })<2

*η*, \(\mathbf {z_{0}} = \tilde {\mathbf {z}}\) and

*P*is convex, therefore, \(P\left (\tilde {\mathbf {z}}_{s}\right) \leq \frac {1}{m}\sum _{k=1}^{m}P\left (\mathbf {z_{k}}\right)\), and we can write the following inequality.

###
**Lemma 1**

*z*

_{0}, using the proximal mapping, which is shown below, iteratively generates a sequence that will converge to the optimal solution.

*R*(

**x**) is a convex function, the optimal solution of above problem is also an optimal solution of the following problem using a tuning parameter

*μ*

*[*30

*]*

*[Theorem 1]*.

*[*31

*]*for a convex function

*R*, we have the following inequality for all

**z**∈

*Ω*:

_{ E }is the Euclidean projection on to set

*E*and

*ℓ*is a positive parameter.

We have thus removed the strong convexity condition so that we are able to apply the algorithm in [15] to more generic convex objectives.

## 4 Results

In this section, we present the results of two types of experiments. First, the proposed algorithm was tested on synthetic datasets to investigate the convergence of the variance reduced proximal stochastic gradient compared to traditional proximal stochastic gradient descent. In addition, running times of the proposed stochastic Cvx-SPCA and other sparse PCA methods were compared to emphasize the advantage of using a stochastic approach, when there are large number of samples. In our experiments, step size *η* was chosen by the following heuristic according to 0<*η*<1/(4*L*
_{
Q
}) and *L*
_{
Q
} was taken as the largest eigenvalue of the covariance matrix. Iteration number *m* was chosen as *Θ*(*L*
_{
Q
}/(*λ*−*λ*
_{1}(**S**))) which is suggested in [15]. Secondly, we presented our experiments on an electronic medical records data.

### 4.1 Synthetic dataset

*γ*which specify the level of sparsity. In order to have a suitable level of sparsity,

*γ*should be tuned. One common way of finding an appropriate

*γ*is the regularization path. We first generated a random sample with ten features and applied the proposed Cvx-SPCA algorithm to obtain the principal component. Then, the covariance matrix was reconstructed by using the first principal component corresponding to the largest eigenvalue with a little random noise. Loading values of principal components were computed with varying regularization parameters

*γ*by using the reconstructed covariance matrix. We started with small

*γ*values, and the loading vector learned from the previous step is used as the initialization for each new Cvx-SPCA step. The result is given in Fig. 3.

### 4.2 Large-scale healthcare dataset

We sample patients who have female, male, child, and old people related features. These samples may overlap with each other. For instance, a patient may have dementia and a prostate problem together. We did not include other problems such as hypertension or kidney problems which can be encountered in every age and both genders into these groups of patients

Demographic | Number of features | Number of patients |
---|---|---|

Female | 1268 | 130,035 |

Male | 106 | 24,184 |

Old | 66 | 2060 |

Child | 596 | 38,434 |

As can be seen from Table 2, the number of female-specific diseases and the number of female patients are more than other groups in the EMR dataset we used in this paper. Number of old patients is given less than other groups in the table. However, it may not mean that there are less number of old people in the whole patient population. We could not exactly extract age information of every diagnoses/diseases. For instance, hypertension or Alzheimer’s were diseases commonly encountered among the people above a certain age in the past. However, these problems can be occurred in younger ages recently. For this reason, we used only diagnoses/diseases which have explicit information about demographic of the patient, while sub-sampling the patients. Distributions of different patient groups in Table 2 are given in Fig. 4.

In our experiments, we further aggregated all diagnoses belong to the same ICD9 group together, so that each patient is represented by a 918 dimensional feature vector. The value on its *i*th dimension represents the frequency of the *i*th diagnosis code appearing in the EMR of the corresponding patient. Since every patient will have a limited number of diseases, patient vectors are very sparse.

EMR data features which contributes the output dimensions after Cvx-SPCA algorithm was applied to the whole patient population. Most frequently observed problems are infections, injuries, pregnancy, and delivery related problems and cancer types

ICD9 code | Description |
---|---|

7 | Balantidiasis/infectious |

72 | Mumps orchitis/infectious |

115 | Infection by histoplasma capsulatum |

266 | Ariboflavinosis/metabolic disorder |

507 | Pneumonitis/bacterial |

695 | Toxic erythema/dermatological |

697 | Lichen planus/dermatological |

761 | Incompetent cervix affecting fetus or newborn |

795 | Abnormal glandular papanicolaou smear of cervix |

924 | Contusion of thigh/injury |

Output EMR data features which contributes the output dimensions after applying the proposed algorithm to the subset of patients who have female-related problems. We could observe female-specific problems and other common diseases such as heart problems and anemia

ICD9 code | Description |
---|---|

281 | Pernicious anemia |

392 | Valvular and rheumatic heart disease |

614 | Female genital disorders |

778 | Serious perinatal problem affecting newborn |

905 | Major head injury |

Output EMR data features which contributes the output dimensions after applying the proposed algorithm to the subset of patients who have male related problems. We could observe a prostate problem which is directly related male patients. In addition, we can also see other common problems such as injuries

ICD9 code | Description |
---|---|

185 | Malignant neoplasm of prostate |

298 | Depressive type psychosis |

719 | Effusion of joint |

800 | Closed fracture of vault of skull |

811 | Closed fracture of scapula |

860 | Traumatic pneumothorax |

Output EMR data features which contributes the output dimensions after applying the proposed algorithm to the subset of patients who have old age-related problems. Cancer is a commonly encountered problem in nearly every ages. In addition to this, we could observe disorders of nervous system and visual problems in the results

ICD9 code | Description |
---|---|

153 | Malignant neoplasm of colon |

173 | Other malignant neoplasm of skin |

337 | Disorders of the autonomic nervous system |

368 | Visual disturbance |

Output EMR data features which contributes the output dimensions after applying the proposed algorithm to the subset of patients who have child related problems. According to our observation, tuberculosis and bacterial infections are quite common among children. Unfortunately, leukemia is also a cancer type that is seen even in small kids

ICD9 code | Description |
---|---|

8 | Intestinal infection due to other organisms |

11 | Pulmonary tuberculosis |

78 | Other diseases due to viruses and Chlamydiae |

10 | Primary tuberculous infection |

204 | Lymphoid leukemia |

## 5 Discussion

Throughout the paper, advantage of using a convex optimization approach for sparse PCA is emphasized. In this section, we would like to discuss about our conjuring of the convergence of non-convex stochastic sparse PCA by using the same framework. One surprising finding we have is if we use this non-convex PCA to construct a non-convex sparse PCA (by adding *ℓ*
_{1}-norm), we still benefit from a much faster convergence rate using the stochastic scheme studied in this paper. A similar result is also presented in [7], where the authors propose a stochastic PCA approach with an exponential convergence rate by using variance reduced stochastic gradient presented in [14]. These results lead us to ask the following question: *Can we generalize the convergence analysis of proximal variance reduced stochastic gradient method further for non-convex settings?* We will investigate this problem in the future work.

## 6 Conclusions

In this paper, a convex stochastic sparse PCA method is proposed. Since the problem of finding the leading eigenvector is formed as convex optimization, a well-defined convergence rate can be applied to the proposed algorithm. A proximal stochastic gradient method with variance reduction is preferred to avoid low convergence rates of traditional stochastic methods. Although strong convexity is usually required in literature, we simplify the convergence analysis of the existing Prox-SVRG algorithm by using weaker conditions. According to the experiments on several synthetic data, the proposed algorithm is shown to be more scalable due to stochastic approach. In addition, an application of sparse PCA is presented to show how sparse PCA can help to interpret electronic medical records. In future work, we would like to investigate whether sparse PCA can be used to cluster patients with respect to their medical records. For instance, we propose to apply the proposed algorithm to analyze medical records and derive clinically meaningful and structural phenotypes, which can further be helpful for patient risk stratification and clustering.

## Declarations

### Acknowledgements

This work is supported in part by the Office of Naval Research (ONR) under grant number N00014-14-1-0631 and National Science Foundation under grant numbers IIS-1565596 and IIS-1615597.

### Authors’ contributions

IMB and JZ developed the algorithm. KL contributed to dropping the strong convexity section. FW provided the EMR data and contributed to the interpretation of the experimental results. IMB wrote the paper, and JZ and AKJ edited the paper. All authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- FD la Torre, MJ Black, in
*ICCV Eighth IEEE International Conference on Computer Vision, vol. 1*. Robust principal component analysis for computer vision (IEEEVancouver, 2001).Google Scholar - MW Manal Abdullah, S Bo-saeed, Optimizing face recognition using pca. Int. J. Artif. Intell. Appl. (IJAIA).
**3**(2), 23–31 (2012).Google Scholar - C Gokulnath, MK Priyan, E Vishnu Balan, KP Rama Prabha, in
*International Conference on Smart Technologies and Management for Computing, Communication, Controls, Energy and Materials (ICSTM)*. Preservation of privacy in data mining by using pca based perturbation technique (IEEEChennai, 2015), pp. 202–206.Google Scholar - W-Y Wang, C-X Qu, in
*Second International Symposium on Information Science and Engineering*. Application and research of data mining based on improved pca method (IEEEShanghai, 2009), pp. 140–143.Google Scholar - PP Alberto Landi, G Pioggia, in
*Intelligent Systems Design and Applications, 9th International Conference on*. Backpropagation-based non linear pca for biomedical applications (IEEEPisa, 2009), pp. 635–640.Google Scholar - D Omucheni, K Kaduki, W Bulimo, H Angeyo, Application of principal component analysis to multispectral-multimodal optical image analysis for malaria diagnostics. Malar. J.
**13**(1), 485 (2014). Springer Nature.View ArticleGoogle Scholar - O Shamir, in
*32nd International Conference on Machine Learning, vol. 37*. A stochastic pca and svd algorithm with an exponential convergence rate (Journal of Machine Learning Research (JMLR)Lille Grand Palais, 2015).Google Scholar - What Is an Electronic Medical Record (EMR)?. https://www.healthit.gov/providers-professionals/electronic-medical-records-emr.
- TH Hui Zou, R Tibshirani, Sparse principal component analysis. J. Comput. Graph. Stat.
**15**(2), 265–286 (2006).MathSciNetView ArticleGoogle Scholar - A d’Aspremont, L El Ghaoui, M Jordan, G Lanckriet, A direct formulation for sparse pca using semidefinite programming. SIAM Rev.
**49**(3), 434–448 (2007).MathSciNetView ArticleMATHGoogle Scholar - AY Nikhil Naikal, SS Sastry, Informative feature selection for object recognition via sparse pca. Int. Conf. Comput. Vision, IEEE, 818–825 (2011).Google Scholar
- M Hein, T Buhler, An inverse power method for nonlinear eigenproblems with applications in 1-spectral clustering and sparse pca. Adv. Neural Inf. Process. Syst.
**23:**, 847–855 (2010).Google Scholar - Z Gu, Q Wang, H Liu, Sparse pca with oracle property. Adv. Neural Inf. Process. Syst. (NIPS).
**27:**, 1529–1537 (2014).Google Scholar - R Johnson, T Zhang, Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst.
**26:**, 315–323 (2013).Google Scholar - L Xiao, T Zhang, A proximal stochastic gradient method with progressive variance reduction. SIAM J. OPTIM.
**24**(4), 2057–2075 (2014).MathSciNetView ArticleMATHGoogle Scholar - A Nitanda, Stochastic proximal gradient descent with acceleration techniques. Neural Inf. Process. Syst.
**27:**, 1574–1582 (2014).Google Scholar - S Shalev-Shwartz, T Zhang, in
*31 st International Conference on Machine Learning, JMLR, vol. 32*. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization (Journal of Machine Learning Research (JMLR)Beijing, 2014).Google Scholar - J Liu, J Chen, J Ye, in
*Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*. Large-scale sparse logistic regression (ACMParis, 2009), pp. 547–556.View ArticleGoogle Scholar - R Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol, 267–288 (1996).Google Scholar
- S Ji, J Ye, in
*Proceedings of the 26th Annual International Conference on Machine Learning*. An accelerated gradient method for trace norm minimization (ACMMontreal, 2009), pp. 457–464.Google Scholar - J Zhou, L Yuan, J Liu, J Ye, in
*Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*. A multi-task learning formulation for predicting disease progression (ACMSan Francisco, 2011), pp. 814–822.Google Scholar - L Jacob, G Obozinski, J-P Vert, in
*Proceedings of the 26th Annual International Conference on Machine Learning*. Group lasso with overlap and graph lasso (ACMMontreal, 2009), pp. 433–440.Google Scholar - J Zhou, J Chen, J Ye,
*Malsar: Multi-task learning via structural regularization*. (Arizona State University, 2011), http://www.public.asu.edu/~jye02/Software/MALSAR. - NZ Shor,
*Minimization Methods for Non-differentiable Functions, vol. 3*(Springer, Berlin Heidelberg, 2012).Google Scholar - S Boyd, L Xiao, A Mutapcic, Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter.
**2004:**, 2004–2005 (2003).Google Scholar - A Beck, M Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci.
**2**(1), 183–202 (2009).MathSciNetView ArticleMATHGoogle Scholar - SJ Wright, RD Nowak, MA Figueiredo, Sparse reconstruction by separable approximation. Signal Process. IEEE Trans.
**57**(7), 2479–2493 (2009).MathSciNetView ArticleGoogle Scholar - D Garber, E Hazan, Fast and simple pca via convex optimization (2015). arXiv:1509.05647v4 [math.OC], https://arxiv.org/abs/1509.05647.
- Y Nesterov,
*Introductory Lectures On Convex Optimization: A Basic Course*, vol. 87 (Springer US, New York, 2004).View ArticleMATHGoogle Scholar - M Kloft, U Brefeld, P Laskov, K-R Müller, A Zien, S Sonnenburg, Efficient and accurate lp-norm multiple kernel learning. Adv. Neural Inf. Process. Syst, 997–1005 (2009).Google Scholar
- J Liu, SJ Wright, Asynchronous stochastic coordinate descent: parallelism and convergence properties. SIAM J. Optim.
**25**(1), 351–376 (2015).MathSciNetView ArticleGoogle Scholar - International Classification of Diseases (ICD). http://www.who.int/classifications/icd/en/.