Table Of Content

IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 1 Kernelized LRR on Grassmann Manifolds for Subspace Clustering Boyue Wang, Yongli Hu Member, IEEE, Junbin Gao, Yanfeng Sun Member, IEEE, and Baocai Yin (cid:70) 6 Abstract—Lowrankrepresentation(LRR)hasrecentlyattractedgreat Amongallthesubspaceclusteringmethodsaforemen- 1 0 interest due to its pleasing efficacy in exploring low-dimensional sub- tioned, the spectral clustering methods based on affinity 2 spacestructuresembeddedindata.Oneofitssuccessfulapplications matrix are considered having good prospects [2], in is subspace clustering, by which data are clustered according to the whichanaffinitymatrixisfirstlylearnedfromthegiven n subspacestheybelongto.Inthispaper,atahigherlevel,weintendto a dataandthenthefinalclusteringresultsareobtainedby clustersubspacesintoclassesofsubspaces.Thisisnaturallydescribed J as a clustering problem on Grassmann manifold. The novelty of this spectral clustering algorithms such as Normalized Cuts 9 paper is to generalize LRR on Euclidean space onto an LRR model (NCut) [16] or simply the K-means. The key ingredient on Grassmann manifold in a uniform kernelized LRR framework. The in a spectral clustering method is to construct a proper V] newmethodhasmanyapplicationsindataanalysisincomputervision affinity matrix for data. In the typical method, Sparse tasks. The proposed models have been evaluated on a number of Subspace Clustering (SSC) [2], one assumes that the C practicaldataanalysisapplications.Theexperimentalresultsshowthat data of subspaces are independent and are sparsely . theproposedmodelsoutperformanumberofstate-of-the-artsubspace s represented under the so-called (cid:96) Subspace Detection clusteringmethods. 1 c Property [17], in which the within-class affinities are [ IndexTerms—LowRankRepresentation,SubspaceClustering,Grass- sparse and the between-class affinities are all zeros. 1 mannManifold,KernelizedMethod It has been proved that under certain conditions the v multiple subspace structures can be exactly recovered 4 2 via (cid:96)p(p≤1) minimization [18]. 1 1 INTRODUCTION In most of current sparse subspace methods, one 2 In recent years, subspace clustering or segmentation mainlyfocusesonindependentsparserepresentationfor 0 has attracted great interest in image analysis, computer data objects. However, the relation among data objects . 1 vision, pattern recognition and signal processing [1], or the underlying global structure of subspaces that 0 [2]. The basic idea of subspace clustering is based on generate the subsets of data to be grouped is usually 6 the fact that most data often have intrinsic subspace not well considered, while these intrinsic properties are 1 : structures and can be regarded as the samples of a very important for clustering applications. So some re- v union of multiple subspaces. Thus the main goal of searchers explore these intrinsic properties and relations i X subspace clustering is to group data into different clus- among data objects and then revise the sparse represen- r ters, data points in each of which justly come from one tationmodeltorepresentthesepropertiesbyintroducing a subspace. To investigate and represent the underlying extra constraints, such as the low rank constraint [11], subspace structure, many subspace methods have been the data Laplace consistence regularization [19] and the proposed, such as the conventional iterative methods data sequential property [20]. etc. In these constraints, [3],thestatisticalmethods[4]–[7],thefactorization-based the holistic constraints such as the low rank or nuclear algebraicapproaches[8],[9],andthespectralclustering- norm (cid:107)·(cid:107)∗ are proposed in favour of structural sparsity. based methods [2], [10]–[13]. These methods have been The Low Rank Representation (LRR) model [11] is one successfullyappliedinmanyapplications,suchasimage of representatives. The LRR model tries to reveal the representation [14], motion segement [8], face classifica- latent sparse property embedded in a data set in high tion [15] and saliency detection [13], etc. dimensional space. It has been proved that, when the high-dimensional data set is actually from a union of several low dimension subspaces, the LRR model can • Boyue Wang, Yongli Hu, Yanfeng Sun and Baocai Yin are with Beijing Municipal Key Lab of Multimedia and Intelligent Software Technol- reveal this structure through subspace clustering [11]. ogy, College of Metropolitan Transportation, Beijing University of Tech- Although most current subspace clustering methods nology, Beijing 100124, China. E-mail: [email protected], show good performance in various applications, the {huyongli,yfsun,ybc}@bjut.edu.cn • Junbin Gao is with Discipline of Business Analytics, The University of similarityamongdataobjectsismeasuredintheoriginal Sydney Business School, The University of Sydney, Camperdown, NSW data domain. For example, the current LRR method 2006,Australia.E-mail:[email protected] is based on the principle of data self representation IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 2 In the other type of learning tasks, we clearly know manifolds where the data come from. For example, in image analysis, people usually use covariance matrices of features as a region descriptor [28]. In this case, such a descriptor is a point on the manifold of symmetrical positive definite matrices. More generally in computer vision,itiscommontocollectdataonaknownmanifold. Forexampleitisacommonpracticetouseasubspaceto represent a set of images [29], while such a subspace is actually a point on the Grassmann manifold [30]. Thus an image set is regarded as a point from the known Grassmann manifold. This type of tasks incorporating manifold properties in learning is called learning on manifolds.Therearethreemajorstrategiesindealingwith learning tasks on manifolds. 1) Intrinsic Strategy: The ideal but hardest strategy is tointrinsicallyperformlearningtasksonmanifolds basedontheirintrinsicgeometry.Veryfewexisting approaches adopt this strategy. 2) Extrinsic Strategy: The second strategy is to implement a learning algorithm within the tangent spaces of manifolds where all the linear relations can be exploited. In fact, this is a first order approximation to the Intrinsic strategy and most approaches fall in this category. Fig. 1. An overview of our proposed LRR on Grass- 3) Embedding Strategy: The third strategy is to embed mann manifolds. Three steps are involved in the pro- a manifold into a “larger” Euclidean space by posed model: (a) The points on Grassmann manifold an appropriate mapping like kernel methods and are mapped onto symmetric matrices. (b) LRR model is any learning algorithms will be implemented in formulatedinsymmetricmatrixspace.(c)Thecoefficients this “flatten” embedding space. But for a practical inLRRmodelareusedbyNCutforclustering. learning task, how to incorporate the manifold properties of those known manifolds in kernel mapping design is still a challenging work. and the representation error is measured in terms of In this paper, we are concerned with the points on Euclidean alike distance. However, this hypothesis may a particular known manifold, the Grassmann manifold. not be always true for many high-dimensional data in We explore the LRR model to be used for clustering a practicewherecorrupteddatamaynotresideinalinear set of data points on Grassmann manifold by adopting space nicely. In fact, it has been proved that many high- the aforementioned third strategy. In fact, Grassmann dimensional data are embedded in low dimensional manifold has a nice property that it can be embedded manifolds. For example, the human face images are into the linear space of symmetric matrices [31], [32]. By considered as samples from a non-linear submanifold this way, all the abstract points (subspaces) on Grass- [21]. Generally manifolds can be considered as low mannmanifoldcanbeembeddedintoaEuclideanspace dimensional smooth ”surfaces” embedded in a higher where the classic LRR model can be applied. Then an dimensional Euclidean space. At each point of the man- LRR model can be constructed in the embedding space, ifold, manifold is locally similar to Euclidean space. To wheretheerrormeasureissimplytakenastheEuclidean effectivelyclusterthesehighdimensiondata,itisdesired metric in the embedding space. The main idea of our to reveal the nonlinear manifold structure underlying method is illuminated in Fig. 1. these high-dimensional data and obtain a proper rep- The contributions of this work are listed as follows: resentation for the data objects. Therearetwotypesofmanifoldrelatedlearningtasks. • Constructing an extended LRR model on Grass- In the so-called manifold learning, one has to respect mann Manifold based on our prior work in [33]; the local geometry existed in data but the manifold • Giving the solutions and practical algorithms to the itself is unknown to learners. The classic representative problems of the extended Grassmann LRR model algorithms for manifold learning include LLE (Locally under different noise models, particularly defined Linear Embedding) [22], ISOMAP [23], LLP (Locally by Frobenius norm and (cid:96)2/(cid:96)1 norm; Linear Projection) [24], LE (Laplacian Embedding) [25], • Presenting a new kernelized LRR model on Grass- LTSA (Local Tangent Space Alignment) [26] and TKE mann manifold. (Twin Kernel Embedding) [27]. The rest of the paper is organized as follows. In IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 3 Section 2, we review some related works. In Section be represented as a sparse linear combinations of other 3, the proposed LRR on Grassmann Manifold (GLRR) pointsfromthesamesubspace.Mathematicallywewrite is described and the solutions to the GLRR models this sparse formulation as with different noises assumptions are given in detail. In min(cid:107)E(cid:107) +λ(cid:107)Z(cid:107) Section4,weintroduceageneralframeworkfortheLRR (cid:96) 1 E,Z (2) model on Grassmann manifold from the kernelization s.t. Y =YZ+E, diag(Z)=0. point of view. In Section 5, the performance of the pro- posedmethodsisevaluatedonseveralpublicdatabases. From the sparse representation matrix Z, an affinity Finally, conclusions and suggestions for future work are matrix can be constructed. For example one commonly provided in Section 6. used form is (|Z|+|ZT|)/2. This affinity matrix is interpreted as a graph upon which a clustering algorithm 2 RELATED WORKS such as NCut is applied for final segmentation. This is thetypicalapproachusedinmodernsubspaceclustering In this section, we briefly review the existing sparse techniques. subspaceclusteringmethodsincludingtheclassicSparse Subspace Clustering (SSC) and the Low Rank Repre- sentation (LRR) and then summarize the properties of 2.2 Low-RankRepresentation(LRR) Grassmann manifold that are related to the work pre- The LRR can be regarded as one special type of sented in this paper. sparse representation, in which rather than computing thesparserepresentationofeachdatapointindividually, 2.1 SparseSubspaceClustering(SSC) the global structure of data is collectively computed by the low rank representation of a set of data points. Given a set of data drawn from a union of unknown subspaces, the task of subspace clustering is to find The low rank measurement has long been utilized in the number of subspaces and their dimensions and matrix completion from corrupted or missing data [35], bases, and then segment the data set according to the [36]. Specifically for clustering applications, it has been subspaces. In recent years, sparse representation has provedthat,whenahigh-dimensionaldatasetisactually been applied to subspace clustering, and the proposed composedofdatafromaunionofseverallowdimension SparseSubspaceClustering(SSC)aimstofindthesparse subspaces,LRRmodelcanrevealthesubspacesstructure representationforthedatasetusing(cid:96) regularization[2]. underlying data samples [11]. It is also proved that LRR 1 The general SSC can be formulated as follows: has good clustering performance in dealing with the challenges in subspace clustering, such as the unclean min(cid:107)E(cid:107) +λ(cid:107)Z(cid:107) (cid:96) 1 data corrupted by noise or outliers, no prior knowledge E,Z (1) of the subspace parameters, and lacking of theoretical s.t. Y =DZ+E, diag(Z)=0, guaranteesfortheoptimalityofclusteringmethods[11], whereY ∈Rd×N isasetofN signalsindimensiondand [13], [37]. Z isthecorrespondentsparserepresentationofY under The general LRR model can be formulated as the thedictionaryD,andE representstheerrorbetweenthe following optimization problem: signals and its reconstructed values, which is measured min(cid:107)E(cid:107)2+λ(cid:107)Z(cid:107) by norm |·| , particularly in terms of Euclidean norm, (cid:96) ∗ (cid:96) E,Z (3) i.e.,(cid:96)=2(or(cid:96)=F)denotingtheFrobeniusnormtodeal s.t. Y =YZ+E, withtheGaussiannoise,or(cid:96)=1(theLaplaciannorm)to deal with the random gross corruptions or (cid:96) = (cid:96) /(cid:96) to where Z is the low rank representation of the data set 2 1 deal with the sample-specific corruptions. Finally λ>0 Y by itself. Here the low rank constraint is achieved by is a penalty parameter to balance the sparse term and approximating rank with the nuclear norm (cid:107)·(cid:107)∗, which the reconstruction error. is defined as the sum of singular values of a matrix and In the above sparse model, it is critical to use an is the low envelop of the rank function of matrices [11]. appropriatedictionaryD torepresentsignals.Generally, Although the current LRR method has good perfor- a dictionary can be learned from some training data by mance in subspace clustering, it relies on Euclidean using one of many dictionary learning methods, such as distance for measuring the similarity of the raw data. the K-SVD method [34]. However, a dictionary learning However, this measurement is not suitable to high- procedure is usually time-consuming and so should be dimensional data with embedding low manifold struc- done in an offline manner. So many researchers adopt ture. To characterize the local geometry of data on an a simple and direct way to use the original signals unknown manifold, the LapLRR methods [19], [38] uses themselves as the dictionary to find subspaces, which is thegraphLaplacianmatrixderivedfromthedataobjects known as the self-expressiveness property [2], i.e. each as a regularized term for the LRR model to represent data point in a union of subspaces can be efficiently the nonlinear structure of high dimensional data, while reconstructed by a linear combination of other points in the reconstruction error of the revised model is still dataset. More specifically, every point in the dataset can computed in Euclidean space. IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 4 2.3 GrassmannManifold In recent years, Grassmann manifold has attracted great interest in computer vision research community. Although Grassmann manifold itself is quite abstract, it can be well represented as a matrix quotient manifold and its Riemannian geometry has been investigated for algorithmic computation [30]. Grassmann manifold G(p,d) [30] is the space of all p-dimensional linear subspaces of Rd for 0 ≤ p ≤ d. A point on Grassmann manifold is a p-dimensional Fig. 2. The GLRR Model. The mapping of the points subspace of Rd which can be represented by any or- on Grassmann manifold, the tensor X with each slice thonormal basis X = [x ,x ,...,x ] ∈ Rd×p. The cho- beingasymmetricmatrixcanberepresentedbythelinear 1 2 p combinationofitself.Theelementz ofZ representsthe sen orthonormal basis is called a representative of the ij subspace S =span(X). Grassmann manifold G(p,d) has similaritybetweenslicesiandj. one-to-onecorrespondencetoaquotientmanifoldofthe Stiefel manifold on Rd×p, see [30]. where data matrix Y =[y ,y ,...,y ]∈RD×N. Grassmann manifold has a nice property that it can 1 2 N As mentioned above, many high dimensional data be embedded into the space of symmetric matrices via have their intrinsic manifold structures. To extend an a projection embedding, i.e. we can embed Grassmann LRRmodelformanifold-valueddata,twoissueshaveto manifold G(p,d) into the space of d × d symmetric be resolved, i.e., (i) model error should be measured in positivesemi-definitematricesSym (d)bythefollowing + terms of manifold geometry, and (ii) the linear relation- mapping, see [32], ship has to be re-interpreted. This is because the linear Π:G(p,d)→Sym (d), Π(X)=XXT. (4) relation defined by Y =YZ+E in (3) is no longer valid + The embedding Π(X) is diffeomorphism [30] (a one- on a manifold. In the extrinsic strategy mentioned in Section 1, one to-one continuous and differentiable mapping with a gets around this difficulty by using the Log map on continuousanddifferentiableinverse).Thenitisreason- a manifold to lift points (data) on a manifold onto abletoreplacethedistanceonGrassmannmanifoldwith the tangent space at a data point. This idea has been the following distance defined on the symmetric matrix applied for clustering and dimensionality reduction on space under this mapping, manifolds in [44], [45] and recently for LRR on Stiefel dg(X1,X2)=(cid:107)Π(X1)−Π(X2)(cid:107)F. (5) and SPD manifolds [46], [47]. This property was used in subspace analysis, learning In this paper, instead of using the Log map tool, we and representation [39]–[41]. The sparse coding and dic- extend the LRR model onto Grassmann manifold by tionary learning within the space of symmetric positive using the Embedding Strategy. Given a set of Grassmann definitematriceshavebeeninvestigatedbyusingkernel- points {X1,X2,...,XN} on Grassmann manifold G(p,d), ing method [32]. For clustering applications, the mean we mimic the classical LRR defined in (3) and (6) as shift method was discussed on Stiefel and Grassmann follows manifolds in [42]. Recently, a new version of K-means N (cid:88) method was proposed to cluster Grassmann points, min (cid:107)Xi(cid:9)((cid:93)Nj=1Xj (cid:12)zji)(cid:107)G +λ(cid:107)Z(cid:107)∗, (7) Z which is constructed by a statistical modeling method i=1 [43]. These works try to expand the clustering methods where (cid:9), (cid:93) and (cid:12) are only dummy operators to be within Euclidean space to more practical situations on specified soon and (cid:107)X (cid:9)((cid:93)N X (cid:12)z )(cid:107) is to measure i j=1 j ji G nonlinear spaces. Along with this direction, we further the error between the point X and its “reconstruction” i explorethesubspaceclusteringproblemsonGrassmann (cid:93)N X (cid:12)z . Thus, to get an LRR model on Grassmann j=1 j ji manifold and try to establish a novel and feasible LRR manifold, we should define proper distance and opera- model on Grassmann manifold. tors for the manifold. Based on the property of Grassmann manifold in 3 LRR ON GRASSMANN MANIFOLDS (4), we have an easy way to use the distance of the 3.1 LRRonGrassmannManifolds embedded space to replace the manifold distance in the In the current LRR model (3), the data reconstruction LRR model on Grassmann manifold as follows, errorisgenerallycomputedintheoriginaldatadomain. (cid:107)X (cid:9)((cid:93)N X (cid:12)z )(cid:107) =d (X ,((cid:93)N X (cid:12)z )). i j=1 j ji G g i j=1 j ji For example, the common form of the reconstruction errorisFrobeniusnorm,i.e.theerrortermcanbechosen This error measure not only avoids using Log map as follows, operator but also has simple computation with F-norm. Additionally, the mapping (4) maps a Grassmann N N (cid:88) (cid:88) (cid:107)E(cid:107)2F =(cid:107)Y −YZ(cid:107)2F = (cid:107)yi− zjiyj(cid:107)2F, (6) point to a point in the d×d symmetric positive semi- definitematricesspaceSym (d)inwhichthereisalinear i=1 j=1 + IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 5 combinationoperationifthecoefficientsarerestrictedto where the (cid:107)E(cid:107) norm of a tensor is defined as the (cid:96)2/(cid:96)1 be positive. So it is intuitive to replace the Grassmann sum of the Frobenius norm of 3-mode slices as follows: points with its mapped points to implement the combi- N (cid:88) nation in (7), i.e. (cid:107)E(cid:107) = (cid:107)E(:,:,i)(cid:107) . (11) (cid:96)2/(cid:96)1 F N N i=1 (cid:93) (cid:88) Xj (cid:12)zji = zji(XjXjT), for i=1,2,...,N. Note that (11) without squares is different from (9). j=1 j=1 Furthermore, we stack all the symmetric matrices 3.2 AlgorithmsforLRRonGrassmannManifold X XT’s as front slices of a 3rd order tensor X, i.e., i i The GLRR models in (8) and (10) present two typical X(:,:,i) = X XT, then all the N linear relations above i i optimization problems. In this subsection, we propose can be simply written as X × Z, where × means the 3 3 appropriate algorithms to solve them. mode-3 multiplication of a tensor and a matrix, see [48]. TheGLLR-FmodelwasproposedinourearlierACCV Thus the self-representation in (3) can be represented by paper [33] where an algorithm based on ADMM was X =X × Z+E proposed.Inthispaper,weprovideanevenfasterclosed 3 form solution for (8) and further investigate the tensor where E is the error tensor. The representation is illus- structure in these models to obtain a practical solution trated in Fig. 2. for (10). Inthefollowingsubsections,wegivetwoLRRmodels Intuitively, the tensor calculation can be converted on Grassmann manifold for two types of noise cases. to matrix operation by tensorial matricization, see [48]. For example, we can matricize the tensor X ∈ Rd×d×N 3.1.1 LRRonGrassmannManifoldwithGaussianNoise in mode-3 and obtain a matrix X ∈ RN×(d∗d) of N (3) (GLRR-F)[33] data points (in rows). So it seems that the problem For the completion of this paper, we include our has been solved using the method of the standard LRR prior work reported in the conference paper [33]. This model. However, as the dimension d ∗ d is often too LRR model on Grassmann manifold, based on the error large in practical problems, the existing LRR algorithm measurement defined in (5), is defined as follows, could break down. To avoid this matter, we carefully analyze the representation of the construction tensor min(cid:107)E(cid:107)2 +λ(cid:107)Z(cid:107) F ∗ error terms and convert the optimization problems to E,Z (8) s.t.X =X × Z+E. its equivalent and readily solvable optimization model. 3 In the following two subsections, we will give the detail The Frobenius norm here is adopted because of the of these solutions. assumption that the model fits to Gaussian noise. We call this model the Frobenius norm constrained GLRR 3.2.1 Algorithm for the Frobenius Norm Constrained (GLRR-F). In this case, the error term in (8) is GLRRModel N We follow the notation used in [33]. By using variable (cid:88) (cid:107)E(cid:107)2 = (cid:107)E(:,:,i)(cid:107)2, (9) elimination, we can convert problem (8) into the follow- F F i=1 ing problem where E(:,:,i) = XiXiT − (cid:80)N zij(XjXjT) is the i-th slice mZin(cid:107)X −X ×3Z(cid:107)2F +λ(cid:107)Z(cid:107)∗. (12) j=1 of E, which is the error between the symmetric ma- We note that (XTX ) has a small dimension p×p which j i trix XiXiT and its reconstruction of linear combination is easy to handle. Denote N (cid:80) zji(XjXjT). ∆ij =tr(cid:2)(XjTXi)(XiTXj)(cid:3), (13) j=1 and the N ×N symmetric matrix 3.1.2 LRR on Grassmann Manifold with (cid:96) /(cid:96) Noise 2 1 (GLRR-21) ∆=[∆ij]. (14) When there exist outliers in the data set, the Gaussian Then we have the following Lemma. noisemodelisnolongerafavoredchoice.Therefore,we Lemma 1. Given a set of matrices {X ,X ,..., propose using the so-called (cid:107)·(cid:107) noise model, which 1 2 (cid:96)2/(cid:96)1 X } s.t. X ∈ Rd×p and XTX = I, if ∆ = [∆ ] ∈ is used to cope with signal oriented gross errors in LRR RNN×N withielement ∆ = tri(cid:2)(Xi TX )(XTX )(cid:3), thijeni,jthe clustering applications [11]. Similar to the above GLRR- ij j i i j matrix ∆ is semi-positive definite. F model, we formulate the (cid:107)·(cid:107) norm constrained (cid:96)2/(cid:96)1 GLRR model (GLRR-21) as follows, Proof: Please refer to [33]. From Lemma 1, we have the eigenvector decomposi- min(cid:107)E(cid:107) +λ(cid:107)Z(cid:107) E,Z (cid:96)2/(cid:96)1 ∗ (10) tion for ∆ defined by ∆=UDUT, where UTU =I and s.t. X =X ×3Z+E, D = diag(σi) with nonnegative eigenvalues σi. Denote IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 6 thesquarerootof∆by∆12 =UD12UT,thenitisnothard The above ADM is appealing only if we can find toprovethatproblem (12)isequivalenttothefollowing closed form solutions to the subproblems (17) and (18). problem Consider problem (17) first. Denote Ck =X −X × Zk 3 mZin(cid:107)Z∆12 −∆12(cid:107)2F +λ(cid:107)Z(cid:107)∗. (15) ai-nthd fforornatnsyli3ce-orAd(e:r,:,tei)nsaolronAgwtheeu3se-mAo(di)etaosdaensohtoertthene notation.Thenweobservethat(17)isseparableinterms Finally we have of matrix variable E(i) as follows: Theorem 2. Given that ∆ = UDUT as defined above, the solution to (15) is given by Ek+1(i)=argmin(cid:107)E(i)(cid:107) +(cid:104)ξk(i),Ck(i)−E(i)(cid:105) F Z∗ =UD UT, E(i) λ µk where Dλ is a diagonal matrix with its i-th element defined + 2 (cid:107)Ck(i)−E(i)(cid:107)2F by µk 1 Dλ(i,i)=(cid:40)1− σλi if σi >λ, =arEg(mi)in(cid:107)E(i)(cid:107)F + 2 (cid:107)Ck(i)−E(i)+ µkξk(i)(cid:107)2F. 0 otherwise. (20) From [11], we know that the problem in (20) has a Proof: Please refer to the proof of Lemma 1 in [49]. closed form solution, given by According to Theorem 2, the main cost for solving (cid:40) the LRR on Grassmann manifold problem (8) is (i) 0 if M < 1 ; computation of the symmetric matrix ∆ and (ii) a SVD Ek+1(i)= µk (1− 1 )(Ck(i)+ 1 ξk(i)) otherwise. for∆.Thisisasignificantimprovementtothealgorithm Mµk µk (21) presented in [33]. where M =(cid:107)Ck(i)+ 1 ξk(i)(cid:107) . µk F 3.2.2 Algorithm for the (cid:96) /(cid:96) Norm Constrained GLRR Now consider problem (18). Denote 2 1 Model Now we turn to the GLRR-12 problem (10). Because f(Z)=(cid:104)ξk,X−X× Z−Ek+1(cid:105)+µk(cid:107)X−X× Z−Ek+1(cid:107)2, the existence of (cid:96) /(cid:96) norm in error term, the objective 3 2 3 F 2 1 function is not differentiable but convex. We propose using the alternating direction method (ADM) method then problem (18) becomes to solve this problem. Firstly, we construct the following augmented La- Zk+1 =argminλ(cid:107)Z(cid:107) +f(Z). (22) grangian function: ∗ Z L(E,Z,ξ)=(cid:107)E(cid:107) +λ(cid:107)Z(cid:107) +(cid:104)ξ,X −X × Z−E(cid:105) (cid:96)2/(cid:96)1 ∗ 3 µ We adopt the linearization method to solve the above + (cid:107)X −X × Z−E(cid:107)2, (16) 2 3 F problem. where (cid:104)·,·(cid:105) is the standard inner product of two tensors Forthispurpose,wefirstlyutilizethematricesineach in the same order, ξ is the Lagrange multiplier, and µ is slicetocomputethetensoroperationinthedefinitionof f(Z). For the i-th slice of the first term in f(Z), we have the penalty parameter. Specifically, the iteration of ADM for minimizing (16) goes as follows: N (cid:88) (cid:104)ξk(i),X XT − z X XT −Ek+1(i)(cid:105) Ek+1 =argminL(E,Zk,ξk) i i ji j j j=1 E N =argmin(cid:107)E(cid:107) +(cid:104)ξk,X −X × Zk−E(cid:105) (cid:88) E (cid:96)2/(cid:96)1 3 =− zjitr(ξk(i)TXjXjT)+tr(ξk(i)T(XiXiT −Ek+1(i))). µk j=1 + (cid:107)X −X × Zk−E(cid:107)2, (17) 2 3 F Zk+1 =argminL(Ek+1,Z,ξk) Define a new matrix by Z =argminλ(cid:107)Z(cid:107)∗+(cid:104)ξk,X −X ×3Z−Ek+1(cid:105) Φk =(cid:2)tr(ξk(i)TX XT)(cid:3) , Z j j i,j µk + (cid:107)X −X × Z−Ek+1(cid:107)2, (18) 2 3 F then the first term in f(Z) has the following representa- ξk+1 = ξk+µk[X −X ×3Zk+1−Ek+1], (19) tion: where we have used an adaptive parameter µk. The adaptive rule will be specified later in Algorithm 1. (cid:104)ξk,X −X × Z−Ek+1(cid:105)=−tr(ΦkZT)+const. (23) 3 IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 7 For the i-th slice of the second term of f(Z), we have Algorithm 1 Low-Rank Representation on Grassmann Manifold with (cid:96) /(cid:96) Norm Constraint. 2 1 N (cid:107)XiXiT −(cid:88)zjiXjXjT −Ek+1(i)(cid:107)2F Input: The Grassmann sample set {Xi}Ni=1,Xi ∈G(p,d), theclusternumberk andthebalancingparameterλ. j=1 =tr((X XT)TX XT)+tr(Ek+1(i)TEk+1(i)) Output: The Low-Rank Representation Z i i i i 1: Initialize:Z0 = 0, E0 = ξ0 = 0, ρ0 = 1.9, η > (cid:107)X(cid:107)2, N N + (cid:88) (cid:88) z z tr((X XT)T(X XT)) µ0 =0.01, µmax =1010, ε1 =10−4 and ε2 =10−4. j1i j2i j1 j1 j2 j2 2: Prepare ∆ according to (13); j1=1j2=1 3: while not converged do −2tr((XiXiT)TEk+1(i)) 4: Update Ek+1 according to (21); (cid:88)N 5: Update Zk+1 according to (26); −2 zjitr((XjXjT)T(XiXiT −Ek+1(i))). 6: Update ξk+1 according to (19); j=1 7: Update µk+1 according to the following rule: Denoting a matrix by µk+1 ←min{ρkµk,µmax} Ψk =(cid:2)tr(Ek+1(i)TX XT)(cid:3) j j i,j where and noting (14), we will have ρ0 if µk/(cid:107)X(cid:107)max{√η(cid:107)Zk+1−Zk(cid:107)F, (cid:107)X −X × Z−Ek+1(cid:107)2 ρk = (cid:107)Ek+1−Ek(cid:107) }≤ε 3 F F 2 =tr(Z∆ZT)−2tr((∆−Ψk)Z)+const. (24) 1 otherwise Combining (23) and (24), we have 8: Check the convergence conditions: µk 1 f(Z)= tr(Z∆ZT)−µktr((∆−Ψk+ Φk)Z)+const. (cid:107)X −X ×3Zk+1−Ek+1(cid:107)/(cid:107)X(cid:107)≤ε1 2 µk and Thus we have √ (cid:18) 1 (cid:19)T µk/(cid:107)X(cid:107)max{ η(cid:107)Zk+1−Zk(cid:107)F,(cid:107)Ek+1−Ek(cid:107)F}≤ε2 ∂f(Z)=µkZ∆−µk ∆−Ψk+ Φk . µk 9: end while Finally we can use the following linearized proximity approximation to replace (22) as follows Zk+1 4 KERNELIZED LRR ON GRASSMANN MANI- FOLD ηµk =argminλ(cid:107)Z(cid:107) +(cid:104)∂f(Zk),Z−Zk(cid:105)+ (cid:107)Z−Zk(cid:107)2 ∗ 2 F Z 4.1 KernelsonGrassmannManifold =argminλ(cid:107)Z(cid:107)∗+ ηµ2k (cid:13)(cid:13)(cid:13)(cid:13)Z−Zk+ ∂fη(µZkk)(cid:13)(cid:13)(cid:13)(cid:13)2 , (25) In this section, we consider the kernelization of the Z F GLRR-F model. In fact, the LRR model on Grassman withaconstantη >(cid:107)X(cid:107)2 where(cid:107)X(cid:107)2 isthematrixnorm manifold (8) can be regarded a kernelized LRR with ofthethirdmodematricizationofthetensorX.Thenew a kernel feature mapping Π defined by (4). It is not problem (25) has a closed form solution given by, see surprised that ∆ is semi-definite positive as it serves [50], as a kernel matrix. It is natural to further generalize the GLRR-F based on kernel functions on Grassmann Zk+1 =U S (Σ )VT, (26) z ηµλk z z manifold. where U Σ VT is the SVD of Z − ∂f(Zk) and S (·) is There are a number of Grassmann kernel functions z z z k ηµk τ proposed in recent years in computer vision and ma- the Singular Value Thresholding (SVT) operator defined chine learning communities, see [31], [41], [52], [53]. For by simplicity, we focus on the following kernels: S (Σ)=diag(sgn(Σ )(|Σ |−τ)). τ ii ii 1. Projection Kernel: This kernel is defined in [41]. For Finally the procedure of solving the (cid:96)2/(cid:96)1 norm con- any two Grassmann points Xi and Xj, the kernel value strainedGLRRproblem(10)issummarizedinAlgorithm is 1.Forthepurposeoftheself-completionofthepaper,we borrow the convergence analysis for Algorithm 1 from kp(X ,X )=(cid:107)XTX (cid:107)2 =tr((X XT)T(X XT)). i j i j F i i j j [51] without proof. Theorem 3. If µk is non-decreasing and upper bounded, The feature mapping of the kernel is actually the map- η > (cid:107)X(cid:107)2, then the sequence {(Zk,Ek,ξk)} generated by ping defined in (4). 2. Canonical Correlation Kernel: Referring to [41], this Algorithm 1 converges to a KKT point of problem (10). kernelisbasedonthecosinevaluesoftheso-calledprin- IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 8 cipal angle between two subspaces defined as follows where K21 is the square root matrix of the kernel matrix K. So the Kernelized model KGLRR-p is similar to cos(θ )= max max uTv , m m m GLRR-F model in Section 3. um∈span(Xi)vm∈span(Xj) Ithasbeenprovedthatusingmultiplekernelfunctions s.t. (cid:107)u (cid:107) =(cid:107)v (cid:107) =1; m 2 m 2 improves performance in many application scenarios uTu =0, k =1,2,...,m−1; m k [56], due to the virtues of different kernel functions for vTv =0, l=1,2,...,m−1. thecomplexdata.Soinpractice,wecanemploydifferent m l kernelfunctionstoimplementthemodelin(27),evenwe Wecanusethelargestcanonicalcorrelationvalue(the can adopt a combined kernel function. For example, in cosine of the first principal angle) as the kernel value as ourexperiments,weuseacombinationoftheabovetwo done in [54], i.e., kernel functions kcc and kproj as follows. xTx kcc(Xi,Xj)=xi∈mspaanx(Xi)xj∈mspaanx(Xj)(cid:107)xi(cid:107)i2(cid:107)xjj(cid:107)2. kccp(Xi,Xj)=αkcc(Xi,Xj)+(1−α)kp(Xi,Xj). where 0 < α < 1 is a hand assigned combination The cosine of principal angles of two subspaces can coefficient. We denote the Kernelized LRR model of be calculated by using SVD as discussed in [55], see k =kccp by KGLRR-ccp. Theorem 2.1 there. Consider two subspaces span(X ) and span(X ) as i j twoGrassmannpointswhereX andX aregivenbases. 4.3 AlgorithmforKGLRR i j If we take the following SVD It is straightforward to use Theorem 2 to solve (28). For the sake of convenience, we present the algorithm XTX =UΣVT, i j below. then the values on the diagonal matrix Σ are the cosine Let us take the eigenvector decomposition of the ker- values of all the principal angles. The kernel kcc(X ,X ) nel matrix K i j K =UDUT, uses partial information regarding the two subspaces. To increase its performance in our LRR, in this paper, where D =diag(σ ,σ ,....,σ ) is the diagonal matrix of 1 2 N we use the sum of all the diagonal values of Σ as the all the eigenvalues. Then the solution to (28) is given by kernelvaluebetweenX andX .Westillcallthisrevised i j version the canonical correlation kernel. Z∗ =UD UT, λ where D is the diagonal matrix with elements defined λ 4.2 KernelizedLRRonGrassmannManifold by (cid:40) Let k be any kernel function on Grassmann manifold. Dλ(i,i)= 1− σλi if σi >λ; Accordingtothekerneltheory[56],thereexistsafeature 0 otherwise. mapping φ such that This algorithm is valid for any kernel functions on φ:G(p,n)→F, Grassmann manifold. where F is the relevant feature space under the given kernel k. 5 EXPERIMENTS Give a set of points {X1,X2,...,XN} on Grassmann Toevaluatetheperformanceoftheproposedmethods, manifold G(p,n), we define the following LRR model GLRR-21,GLRR-F/KGLRR-pandKGLRR-ccp,weselect various public datasets of different types to conduct min(cid:107)φ(X)−φ(X)Z(cid:107)2 +λ(cid:107)Z(cid:107) . (27) F ∗ clustering experiments. These datasets are challenging We call the above model the Kernelized LRR on Grass- forclusteringapplications.Wedividethesedatasetsinto man manifold, denoted by KGLRR, and KGLRR-cc, four types: KGLRR-p for k =kcc and k =kp, respectively. • Face or expression image sets, including the Ex- However,forKGLRR-p,theabovemodel(27)becomes tended Yale B face dataset (http://vision.ucsd.edu/ the LRR model (12). Denote by K the N × N kernel content/yale-face-database) and the BU-3DEF ex- matrix over all the data points X’s. By using the similar pression dataset (http://www.cs.binghamton.edu/ derivation in [33], we can prove that the model (27) is ∼lijun/Research/3DFE/3DFE Analysis.html). equivalent to • Large scale object image sets, including the Caltech 101 dataset (http://www.vision.caltech. min−2tr(KZ)+tr(ZKZT)+λ(cid:107)Z(cid:107) , ∗ edu/feifeili/Datasets.htm) and the ImageNet Z 2012 dataset (http://www.image-net.org/ which is equivalent to download-images). min(cid:107)ZK12 −K12(cid:107)2F +λ(cid:107)Z(cid:107)∗. (28) • Humanactiondatasets,includingtheBalletdataset Z (https://www.cs.sfu.ca/research/groups/VML/ IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 9 semilatent/) and the SKIG dataset (http://lshao. on Grassmann manifold cannot be used directly. So to staff.shef.ac.uk/data/SheffieldKinectGesture.htm). compare our method with SSC and LRR, we have to • Traffic scence video clip sets, including the High- vectorize each image set to construct inputs for SSC and way Traffic Dataset (http://www.svcl.ucsd.edu/ LRR,i.e.we”vectorize”asetofimagesintoalongvector projects/traffic/) and a traffic road dataset we col- by stacking all the vectors of the raw data in the image lected. set in a particular order, e.g., in the frame order etc. The proposed methods will be compared with the However, in most of the experiments, we cannot simply benchmark spectral clustering methods, Sparse Sub- take these long vectors because of high dimensionality space Clustering (SSC) [2] and Low-Rank Representa- for a larger image set. In this case, we apply PCA to tion (LRR) [11], and several state-of-the-art clustering reducethesevectorstoalowdimensionwhichequalsto methods concerned with manifolds, including Statisti- eitherthedimensionofsubspaceofGrassmannmanifold cal computations on Grassmann and Stiefel manifolds or the number of PCA components retaining 95% of (SCGSM) [43], Sparse Manifold Clustering and Embed- its variance energy. Then PCA projected vectors will be ding (SMCE) [57] and Latent Space Sparse Subspace taken as inputs for SSC and LRR. Clustering (LS3C) [58]. In the sequel, we first describe Inour experiments,the performanceof differentalgo- the experiment setting, then report and analyze the rithms is measured by the following clustering accuracy clustering results on these datasets. number of correctly classified points Accuracy= ×100%. total number of points 5.1 ExperimentSetting To clearly present our experiments, we denote by C Our GLRR model is designed to cluster Grassmann the number of clusters, N the total number of image points,whicharesubspacesinsteadofrawobject/signal sets, P the number of images in each image set and p vectors (points). Thus before implementing the main the dimension of subspace of a Grassmann point. components of GLRR and the spectral clustering al- All the algorithms are coded in Matlab 2014a and gorithm (here we use Ncut algorithm), we must form implemented on an Intel Core i7-4770K 3.5GHz CPU subspaces from raw signals. Generally, a subspace can machine with 32G RAM. be represented by an orthonormal basis, so we utilize the samples drawn from the same subspace to construct 5.2 ClusteringonFaceandExpressionImageSets its orthonormal basis. Similar to the work in [41], [59], we simply adopt SVD to construct subspace bases. Con- Human face or expression image sets are widely used cretely, given a set of images, denoted by {Y }P with in computer vision and pattern recognition communi- i i=1 each Y in size of m×n pixels, we construct a matrix ties. They are considered as challenging data sets for i Γ=[vec(Y ),vec(Y ),...,vec(Y )] of size (m∗n)×P by either clustering or recognition applications. The main 1 2 P vectorizing all the images. Then Γ is decomposed by difficulty is that the face image is affected greatly by SVD as Γ = UΣV. We pick the first p (≤ P) singular- various factors, such as the complex structure of face, vectors X = [u ,u ,...,u ] of U to represent the entire thenon-rigidelasticdeformationofexpression,different 1 2 p image set as a point X on the Grassmann manifold poses and various light conditions. G(p,m∗n). 1) Extended Yale B dataset Thesettingofthemodelparametersaffectstheperfor- Extended Yale B dataset contains face images of 38 manceofourproposedmethods.λisthemostimportant individualsandeachsubjecthasabout64frontalimages penalty parameter for balancing the error term and the captured in different illuminations. All the face images low-rankterminourproposedmethods.Empirically,the have been normalized to the size of 20×20 pixels in 256 valueofλindifferentapplicationshasbiggaps,andthe gray levels. Some samples of Extended Yale B dataset best value for λ has to be chosen from a large range are shown in Fig. 3. of values to get a better performance in a particular To prepare the experiment data, we randomly choose application. From our experiments, we have observed P images from each subject to construct an image set that when the cluster number is increasing, the best and P is set to 4 or 8 in order to test the affection of λ is decreasing. Additionally, λ will be smaller when different scales of image set for the clustering results. the noise level in data is lower while λ will become We produce 10 image sets for each subject, so there are larger if the noise level higher. These observations are totally 380 points for clustering. To get a Grassmann usefulinselectingaproperλvaluefordifferentdatasets. point, we use the aforementioned SVD operator to get The error tolerance ε is also an important parameter the basis of subspace corresponding to each image set. in controlling the terminal condition, which bounds the The dimension of subspace p = 4. Thus the Grassmann allowed reconstructed error. We experimentally seek a point X ∈ G(4,400) in this experiment. For SSC and proper value of ε to make the iteration process stop at LRR methods, the original vector of an image set has an appropriate level of reconstructed error. dimension of 20×20×4=1600 or 20×20×8=3200 for For both SSC and LRR methods, which demand the P =4and8,respectively.Here,wereducethedimension vector form of inputs, the subspace form of points to 146 by PCA. IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 10 to present the same expression from different persons. It is not surprised that all the methods perform badly overthisdataset.Yet,theperformanceofmanifoldbased methods are superior to other methods, especially our methodsproduceaclusteringaccuracyatleast4percent better than other methods. 5.3 ClusteringonLargeScaleObjectImageSets Large scale dataset is challenging for clustering methods. When the cluster number of the data is increasing, the performance of many state-of-the-art clustering methods drops dramatically. In this set of experiment, our intention is to test our methods on two large scale object image sets, the Caltech 101 dataset and the Ima- Fig.3. SomesamplesfromExtendedYaleBdataset(the genet 2012 dataset. firsttworows)andBU-3DEFdataset(thelasttworows). 1) Caltech 101 dataset Caltech 101 dataset contains pictures of 101 categories objects and each category has about 40 to 800 images. The experiment results are shown in Table 1. It shows The objects are generally centered in images. All images that most experimental results with P =8 are obviously aregrayedandrescaledtoasizeof20×20.Fig.4shows better than that with P = 4. In fact, the larger P is, some samples of Caltech 101 dataset. the better the performance. When more images in the In each category, we randomly select P = 4 images set, the impact of outlier images such as darkness faces to construct an image set. The image sets are then con- or special expressions will be decreased. However a vertedtoGrassmannpointsX ∈G(4,400),i.e.p=4.For larger P may also increase more variances to be fitted bothSSCandLRR,thesubspacevectorswithdimension by the subspace. Compared with other manifold based 20×20×4=1600 are reduced to 249 by PCA. methods, SCGSM, SMCE and LS3C, the excellent per- To evaluate the robustness of our proposed methods formance of our methods is due to the incorporation of with large cluster numbers, we test the cases of C = low rank constraint over the similarity matrix Z. Finally 10,20,30,40and50.Table3showstheexperimentresults we also note that the performance of LRR and SSC is for different methods. As can be seen from this table, greatly worse than all manifold based methods, which in all cases, our methods outperform other state-of-the- demonstratesincorporatingmanifoldpropertiescanalso art methods at least 10 percent and are stable with the improve the performance in clustering. increase of cluster numbers. It reveals that when the 2) BU-3DEF dataset number of clusters is higher, GLRR-F is slightly better BU-3DEF dataset collects 3D face models and face thanKGLRR-ccp.Itisworthnotingthatourmethodsare images from 100 subjects of different genders, races and alsorobusttocomplexbackgroundscontainedinCaltech ages. Each subject has 25 face images, one neutral im- 101 images. age and six expressions (happiness, disgust, fear, angry, 2) ImageNet 2012 dataset surprise and sadness), each of which is at four levels ImageNet2012datasetisawildlyusedobjectdatabase of intensity. In our experiment, we use these expression for image retrieve, which contains more than 1.2 million imagesforclustering.Theyarenormalizedandcentered object images from over 1000 categories. This database in a fixed size of 24×20. Fig. 3 shows some samples of is more difficult for clustering due to the large scale BU-3DEF dataset. of categories. In addition, most objects are small and Foreachexpression,werandomlyselectP =6images un-centered in images, even many objects have severe to construct an image set and totally obtain 100 image occlusions.Weextracttheregionofinterestfromimages sets. Then, a Grassmann point X ∈ G(4,400) is created by the bounding boxes defined in image-net.org and for each image set by using SVD. There are C = 6 resize the region to 20×20. Some samples of ImageNet classes of expressions for clustering. For SSC and LRR, are shown in Fig. 4. the original vectors with dimension 24×20×6 = 2880 Many categories in this dataset are quite similar to is reduced to dimension 274 by PCA. each other, and the orientation and scale of objects in Table2presentstheexperimentresults,whichshowall images change largely, so it is hard for an algorithm themethodsperformpoorlyforthischallengingdataset. to get high accuracy for classification or clustering. To Analyzing this dataset reveals that some images of dif- test the performance of our methods in the existence of ferent expressions from one person only have very little these varieties, we only set a mild value of C = 10 in distinction while some images of the same expression this experiments, i.e. we randomly select 10 classes of from different persons have a strong distinction, which objects. In each class, Grassman points on G(2,400) are leads to a difficult problem to find a common feature constructed by using SVD for image sets, each of which