Table Of Content

Correlation Adaptive Subspace Segmentation by Trace Lasso CanyiLu1,JiashiFeng1,ZhouchenLin2,∗, ShuichengYan1 1 DepartmentofElectricalandComputerEngineering,NationalUniversityofSingapore 2 KeyLaboratoryofMachinePerception(MOE),SchoolofEECS,PekingUniversity [email protected], [email protected], [email protected], [email protected] 5 1 Abstract 1 1 1 1 0 0.8 0.8 0.8 0.8 2 0.6 0.6 0.6 0.6 This paper studies the subspace segmentation problem. 0.4 0.4 0.4 0.4 n Givenasetofdatapointsdrawnfromaunionofsubspaces, 0.2 0.2 0.2 0.2 a J thegoalistopartitionthemintotheirunderlyingsubspaces 00 200 400 600 00 200 400 600 00 200 400 600 00 200 400 600 SSC LRR LSR CASS 8 they were drawn from. The spectral clustering method is Figure1.Exampleonasubsetwith10subjectsoftheExtended 1 used as the framework. It requires to find an affinity ma- Yale B database. For a given data point y and a data set X, y trix which is close to block diagonal, with nonzero entries canbeapproximatelyexpressedasalinerrepresentationofallthe ] V corresponding to the data point pairs from the same sub- columnsofX bydifferentmethods. Thisfigureshowstheabso- space. In this work, we argue that both sparsity and the lutevaluesoftherepresentationcoefficients(normalizedto[01] C grouping effect are important for subspace segmentation. foreaseofdisplay)derivedbySSC,LRR,LSRandtheproposed s. A sparse affinity matrix tends to be block diagonal, with CASS.Heredifferentcolumnsineachsubfigureindicatedifferent c subjects. Theredcolorcoefficientscorrespondtothefaceimages less connections between data points from different sub- [ whicharefromthesamesubjectasy. Onecanseethatthecoef- spaces. The grouping effect ensures that the highly cor- 1 ficientsderivedbySSCareverysparse,andonlylimitedsamples recteddatawhichareusuallyfromthesamesubspacecan v withinclusterareselectedtorepresenty.BothLRRandLSRlead be grouped together. Sparse Subspace Clustering (SSC), 6 to dense representations. They not only group data within clus- 7 byusing(cid:96)1-minimization, encouragessparsityfordatase- tertogether, butalsobetweenclusters. ForCASS,mostoflarge 2 lection,butitlacksofthegroupingeffect. Onthecontrary, coefficientsconcentrateonthedatapointswithincluster. Thusit 4 Low-RankRepresentation(LRR),byrankminimization,and approximately reveals the true segmentation of data. Images in 0 Least Squares Regression (LSR), by (cid:96)2-regularization, ex- thispaperarebestviewedonscreen! . 1 hibitstronggroupingeffect,buttheyareshortinsubsetse- 0 lection. Thus the obtained affinity matrix is usually very 5 cations, such as motion segmentation [19], face clustering sparsebySSC,yetverydensebyLRRandLSR. 1 [12],andimagesegmentation[9],owingtothefactthatthe Inthiswork, weproposetheCorrelationAdaptiveSub- : v real-worlddataoftenapproximatelylieinamixtureofsub- space Segmentation (CASS) method by using trace Lasso. i spaces. Theproblemisformallydefinedasfollows[13]: X CASSisadatacorrelationdependentmethodwhichsimul- r taneously performs automatic data selection and groups Definition1 (SubspaceSegmentation)Givenasetofsuffi- a correlated data together. It can be regarded as a method ciently sampled data vectors X = [x ,··· ,x ] ∈ Rd×n, 1 n which adaptively balances SSC and LSR. Both theoretical where d is the feature dimension, and n is the number of andexperimentalresultsshowtheeffectivenessofCASS. datavectors. Assumethatthedataaredrawnfromaunion ofksubspaces{S }k ofunknowndimensions{r }k ,re- i i=1 i i=1 spectively. Thetaskistosegmentthedataaccordingtothe 1.Introduction underlyingsubspacestheyaredrawnfrom. This paper focuses on subspace segmentation, the goal ofwhichistosegmentagivendatasetintoclusters,ideally 1.1.Summaryofnotations with each cluster corresponding to a subspace. Subspace Some notations are used in this work. We use capital segmentationisanimportantprobleminbothcomputervi- and lowercase symbols to represent matrices and vectors, sionandmachinelearningliterature.Ithasnumerousappli- respectively.Inparticular,1 ∈Rddenotesthevectorofall d ∗Correspondingauthor. 1’s,eiisavectorwhosei-thentryis1and0forothers,and I isusedtodenotetheidentitymatrix. Diag(v)convertsthe arehighlycorrelatedorclustered,the(cid:96)1-minimizationwill vector v into a diagonal matrix in which the i-th diagonal generally select a single representative at random, and ig- entry is v . diag(A) is a vector whose i-th entry is A of nore other correlated data. This leads to a sparse solution i ii asquarematrixA. tr(A)isthetraceofasquarematrixA. butmissesdatacorrelationinformation. ThusSSCmayre- A denotes the i-th column of a matrix A. sign(x) is the sultinasparseaffinitymatrixbutleadtounsatisfactoryper- i signfunctiondefinedassign(x)=x/|x|ifx(cid:54)=0and0for formance. otherwise. v →v denotesthatvconvergestov . Low-Rank Representation (LRR) is a method which 0 0 Some vector and matrix norms will be used. ||v|| , aimstogroupthecorrelateddatatogether. Itsolvesthefol- 0 ||v|| , ||v|| and ||v|| denote the (cid:96)0-norm (number of lowingconvexoptimizationproblem: 1 2 ∞ nonzero entries), (cid:96)1-norm (sum of the absolute vale of min ||W|| s.t.X =XW. (3) each entry), (cid:96)2-norm and (cid:96)∞-norm of a vector v. ||A||1, W ∗ ||A|| , ||A|| , ||A|| , and ||A|| denote the (cid:96)1-norm F 2,1 ∞ ∗ Theaboveproblemcanbeextendedforthenoisycase: ((cid:80) |A |),Frobeniusnorm,(cid:96)2,1-norm((cid:80) ||A || ),(cid:96)∞- i,j ij j j 2 norm (maxi,j|Aij|), and nuclear norm (the sum of all the mW,iEn||W||∗+λ||E||2,1 (4) singularvalues)ofamatrixA,respectively. s.t.X =XW +E, 1.2.Relatedwork where λ > 0 is a parameter. Although LRR guarantees to There has been a large body of research on subspace produce a block diagonal solution when the data are noise segmentation [23, 3, 13, 24, 17, 5, 8]. Most recently, the free and drawn from independent subspaces, the real data Sparse Subspace Clustering (SSC) [3, 4], Low-Rank Rep- are usually contaminated with noises or outliers. So the resentation (LRR) [13, 12, 2], and Least Squares Regres- solution to problem (4) is usually very dense and far from sion (LSR) [16] techniques have been proposed for sub- block diagonal. The reason is that the nuclear norm min- space segmentation and attracted much attention. These imization lacks the ability of subset selection. Thus, LRR methodslearnanaffinitymatrixwhoseentriesmeasurethe generallygroupscorrelateddatatogether,butsparsitycan- similarities among the data points and then perform spec- notbeachieved. tral clustering on the affinity matrix to segment data. Ide- In the context of statistics, Ridge regression ((cid:96)2- ally, theaffinitymatrixshouldbeblockdiagonal(orblock regularization)[10]mayhavethesimilarbehaviorasLRR. sparse in vector form), with nonzero entries correspond- BelowisthemostrecentworkbyusingLeastSquaresRe- ing to data point pairs from the same subspace. A typical gression(LSR)[16]forsubspacesegmentation: choice for the measure of similarity between x and x is i j Wij =exp(−||xi−xj||/σ),whereσ >0. However,such mWin ||X−XW||2F +λ||W||2F. (5) method is unable to utilize the underlying linear subspace Both LRR and LSR encourage grouping effect but lack of structureofdata. Theconstructedaffinitymatrixisusually sparsity. In fact, for subspace segmentation, both sparsity not block diagonal even under certain strong assumptions, and grouping effect are very important. Ideally, the affin- e.g. independent subspaces 1. For a new point y ∈ Rd in ity matrix should be sparse, with no connection between thesubspaces,SSCpursuesasparserepresentation: clusters. On the other hand, the affinity matrix should not be too sparse, i.e., the nonzero connections within cluster min||w|| s.t.y =Xw. (1) w 1 shouldbesufficientenoughforgroupingcorrelateddatain thesamesubspaces. Thus,itisexpectedthatthemodelcan Problem (1) can be extended for handling the data with automaticallygroupthecorrelateddatawithincluster(like noise,whichleadstothepopularLasso[22]formulation: LRRandLSR)andeliminatetheconnectionsbetweenclus- min ||y−Xw||2+λ||w|| , (2) ters(likeSSC).TraceLasso[7],definedas||XDiag(w)||∗, w 2 1 is such a newly established regularizer which interpolates betweenthe(cid:96)1-normand(cid:96)2-normofw. Itisadaptiveand whereλ > 0isaparameter. SSCsolvesproblem(1)or(2) dependsonthecorrelationamongthesamplesinX,which for each data point y in the dataset with all the other data can be encoded by XTX. In particular, when the data are pointsasthedictionary. Thenitusesthederivedrepresen- highlycorrelated(XTX iscloseto11T),itwillbecloseto tationcoefficientstomeasurethesimilaritiesbetweendata the (cid:96)2-norm, while when the data are almost uncorrelated pointsandconstructstheaffinitymatrix. Itisshownthat,if (XTX is close to I), it will behave like the (cid:96)1-norm. We thesubspacesareindependent, thesparserepresentationis taketheadaptiveadvantageoftraceLassotoregularizethe blocksparse. However,ifthedatafromthesamesubspace representationcoefficientmatrix,anddefineanaffinityma- 1Acollectionofk linearsubspaces{Si}ki=1 areindependentifand trixbyapplyingspectralclusteringtothenormalizedLapla- onlyifSi∩(cid:80)j(cid:54)=iSj ={0}foralli(or(cid:80)ki=1Si=⊕ki=1Si). cian.SuchamodeliscalledCorrelationAdaptiveSubspace Segmentation(CASS)inthiswork. CASScanberegarded Ifthedataarehighlycorrelated(thedatapointsareallthe as a method which adaptively interpolates SSC and LSR. same, X = x 1T, XTX = 11T), trace Lasso is equal to 1 Anintuitivecomparisonofthecoefficientmatricesderived the(cid:96)2-norm: bythesefourmethodscanbefoundinFigure1. ForCASS, wecanseethatmostlargerepresentationcoefficientsclus- ||XDiag(w)||∗ =||x1wT||∗ =||x1||2||w||2 =||w||2. teronthedatapointsfromthesamesubspaceasy. Incom- parison, the connections within cluster are very sparse by For other cases, trace Lasso interpolates between the (cid:96)2- SSC, and the connections between clusters are very dense normand(cid:96)1-norm[7]: byLRRandLSR. ||w|| ≤||XDiag(w)|| ≤||w|| . 2 ∗ 1 1.3.Contributions WeusetraceLassoforsubsetselectionfromallthedata Wesummarizethecontributionsofthispaperasfollows: adaptively, which leads to the Correlation Adaptive Sub- • We propose a new subspace segmentation method, spaceSegmentation(CASS)method. Wefirstconsiderthe called the Correlation Adaptive Subspace Segmenta- subspace segmentation problem with clean data by CASS tion (CASS), by using trace Lasso [7]. CASS is the andthenextendittothenoisycase. firstmethodthattakesthedatacorrelationintoaccount 2.1.CASSwithcleandata for subspace segmentation. So it is self-adaptive for differenttypesofdata. Let X = [x ,··· ,x ] = [X ,··· ,X ]Γ be a set of 1 n 1 k data drawn from k subspaces {S }k , where X denotes • In theory, we show that if the data are from inde- i i=1 i a collection of n data points from the i-th subspace S , pendentsubspaces,andtheobjectivefunctionsatisfies i i n = (cid:80)k n , and Γ is a hypothesized permutation ma- theproposedEnforcedBlockSparse(EBS)conditions, i=1 i trix which rearranges the data to the true segmentation of thentheobtainedsolutionisblocksparse. TraceLasso data. For a given data point y ∈ S , it can be represented isaspecialcasewhichsatisfiestheEBSconditions. i as a linear combination of all the data points X. Different • WetheoreticallyprovethattraceLassohasthegroup- from the previous methods in SSC, LRR and LSR, CASS ingeffect,i.e.,thecoefficientsofagroupofcorrelated usesthetraceLassoastheobjectivefunctionandsolvesthe dataareapproximatelyequal. followingproblem: 2. Correlation Adaptive Subspace Segmenta- min ||XDiag(w)|| s.t. y =Xw. (6) ∗ w∈Rn tionbyTraceLasso The methods, SSC, LRR and LSR, show that if the TraceLasso[7]isarecentlyproposednormwhichbal- dataaresufficientlysampledfromindependentsubspaces,a ancesthe(cid:96)1-normand(cid:96)2-norm. Itisformallydefinedas blockdiagonalsolutioncanbeachieved.Thework[16]fur- Ω(w)=||XDiag(w)|| . thershowsthatitiseasytogetablockdiagonalsolutionif ∗ theobjectivefunctionsatisfiestheEnforcedBlockDiagonal A main difference between trace Lasso and the existing (EBD) conditions. But the EBD conditions cannot be ap- normsisthattraceLassoinvolvesthedatamatrixX,which pliedtotraceLassodirectly,sincetraceLassoisafunction makes it adaptive to the correlation of data. Actually, it involvingboththedataX andw. HereweextendtheEBD only depends on the matrix XTX of data, which encodes conditions [16] to the Enforced Block Sparse (EBS) con- thecorrelationinformationamongdata. Inparticular,ifthe ditionsandshowthattheobtainedsolutionisblocksparse norm of each column of X is normalized to one, we have when the objective function satisfies the EBS conditions. thefollowingdecompositionofXDiag(w): TraceLassoisaspecialcasewhichsatisfiestheEBScondi- tionsandthusleadstoablocksparsesolution. n (cid:88) XDiag(w)= |w |(sign(w )x )eT. Enforced Block Sparse (EBS) Conditions. Assume f i i i i isafunctionwithregardtoamatrixX ∈Rd×nandavector i=1 w =[w ;w ;w ]∈Rn,w (cid:54)=0. LetwB =[0;w ;0]∈Rn. Ifthedataareuncorrelated(thedatapointsareorthogonal, a b c b TheEBSconditionsare: XTX = I), the above equation gives the singular value decomposition of XDiag(w). In this case, trace Lasso is (1) f(X,w) = f(XP,P−1w), for any permutation ma- equaltothe(cid:96)1-norm: trixP ∈Rn×n; n (cid:88) (2) f(X,w) ≥ f(X,wB), and the equality holds if and ||XDiag(w)|| =||Diag(w)|| = |w |=||w|| . ∗ ∗ i 1 onlyifw =wB. i=1 Forsomecases,theEBSconditionscanberegardedasex- Lemma1 [18, Lemma 11] Let A ∈ Rd×n be partitioned tensionsoftheEBDconditions2. TheEBSconditionswill in the form A = [A ,A ]. Then ||A|| ≥ ||A || and the 1 2 ∗ 1 ∗ enforcethesolutiontothefollowingproblem equalityholdsifandonlyifA =0. 2 minf(X,w) s.t. y =Xw, (7) Inasimilarway,CASSownstheblocksparseproperty: w Theorem2 Let X = [x ,··· ,x ] = [X ,··· ,X ]Γ ∈ tobeblocksparsewhenthesubspaceareindependent. 1 n 1 k Rd×n be a data matrix whose column vectors are suffi- Theorem1 Let X = [x ,··· ,x ] = [X ,··· ,X ]Γ ∈ ciently drawn from a union of k independent subspaces 1 n 1 k Rd×n be a data matrix whose column vectors are suffi- {Si}ki=1, xj (cid:54)= 0, j = 1,··· ,n. For each i, Xi ∈ Rd×ni ciently 3 drawn from a union of k independent subspaces andn=(cid:80)k n . LetybeanewpointinS . Itholdsthat {Si}ki=1, xj (cid:54)= 0, j = 1,··· ,n. For each i, Xi ∈ Rd×ni the solutioni=to1prioblem (6) w∗ = Γ−1[z1∗;·i·· ;zk∗] ∈ Rn andn = (cid:80)k n . Lety ∈ Rd beanewpointinS . Then is block sparse, i.e., z∗ (cid:54)= 0 and z∗ = 0 for all j (cid:54)= i. i=1 i i i j thesolutiontoproblem(7)w∗ = Γ−1[z∗;··· ;z∗] ∈ Rn is Furthermore,z∗isalsooptimaltothefollowingproblem: 1 k i blocksparse,i.e.,z∗ (cid:54)=0andz∗ =0forallj (cid:54)=i. i j min ||X Diag(z )|| s.t. y =X z . (8) i i ∗ i i Proof.Fory ∈S ,letw∗ =Γ−1[z∗;··· ;z∗]betheoptimal zi∈Rni i 1 k solutiontoproblem(7),wherezi∗ ∈Rni correspondstoXi TheblocksparsepropertyofCASSisthesameasthose for each i = 1,··· ,k. We decompose w∗ into two parts of SSC, LRR and LSR when the data are from indepen- w∗ = u∗ + v∗, where u∗ = Γ−1[0;··· ;z∗;··· ;0] and dent subspaces. This is also the motivation for using trace i v∗ =Γ−1[z∗;··· ;0;··· ;z∗]. Wehave Lasso for subspace segmentation. For the noisy case, dif- 1 k ferentfromthepreviousmethods,CASSmayalsoleadtoa y =Xw∗ =Xu∗+Xv∗ solution which is close to block sparse, and it also has the (cid:88) =X z∗+ X z∗. groupingeffect(seeSection2.3). i i j j j(cid:54)=i 2.2.CASSwithnoisydata Since y ∈ S and X z∗ ∈ S , y − X z∗ ∈ S . Thus (cid:80) X z∗ =iy−Xiz∗i∈S ∩i ⊕ S .iCionsideiringthat The noise free and independent subspaces assumption j(cid:54)=i j j i i i j(cid:54)=i j maybeviolatedinrealapplications.Problem(6)canbeex- thesubspaces{S }k areindependent,S ∩⊕ S ={0}, i i=1 i j(cid:54)=i j tendedtohandlenoisesofdifferenttypes. Forsmallmagni- wehavey = X z∗ = Xu∗ andX z∗ = 0,j (cid:54)= i. Sou∗ is i i j j tudeanddensenoises(e.g. Gaussian),areasonablestrategy feasibletoproblem(7).Ontheotherhand,bythedefinition istousethe(cid:96)2-normtomodelthenoises: ofu∗andtheEBSconditions(2),wehave 1 min ||y−Xw||2+λ||XDiag(w)|| . (9) f(X,w∗)≥f(X,u∗). w 2 2 ∗ Noticing that w∗ is optimal to problem (7), f(X,w∗) ≤ Hereλ > 0isaparameterbalancingtheeffectsofthetwo f(X,u∗). Thustheequalityholds. BytheEBSconditions terms. For data with a small fraction of gross corruptions, (2),wegetw∗ =u∗. Therefore,z∗ (cid:54)=0,andz∗ =0forall the(cid:96)1-normisabetterchoice: i j j (cid:54)=i. (cid:4) min ||y−Xw|| +λ||XDiag(w)|| . (10) TheEBSconditionsgreatlyextendthefamilyoftheob- w 1 ∗ jective function which involves the block sparse property. Namely,thechoiceofthenormdependsonthenoises. Itis ItiseasytocheckthattraceLassosatisfiestheEBScondi- importantforsubspacesegmentationbutnotthemainfocus tions. Letf(X,w) = ||XDiag(w)|| , foranypermutation ∗ ofthispaper. matrixP ∈Rn×n, Inthecaseofdatacontaminatedwithnoises,itisdifficult f(XP,P−1w)=||XPDiag(P−1w)|| toobtainablocksparsesolution.Thoughtherepresentation ∗ coefficientderivedbySSCtendstobesparse,itisunableto =||XPP−1Diag(w)|| ∗ groupcorrelateddatatogether. Ontheotherhand,LRRand =||XDiag(w)|| =f(X,w). ∗ LSRleadtodenserepresentationswhichlacktheabilityof subsetselection.CASSbyusingtraceLassotakesthecorre- TraceLassoalsosatisfiestheEBSconditions(2)bythefol- lationofdataintoaccountwhichplacesatradeoffbetween lowinglemma: sparsity and grouping effect. Thus it can be regarded as a 2Forexample,f(X,w) = ||w||p+0×||X||F = ||w||p = g(w), methodwhichbalancesSSCandLSR. wherep ≥ 0. Itiseasytoseethatf(X,w)satisfiestheEBSconditions For SSC, LRR, LSR and CASS, each data point is ex- andg(w)satisfiestheEBDconditions. 3Thatthedatasamplingissufficientmakessurethatproblem(7)hasa pressed as a linear combination of all the data with a co- feasiblesolution. efficient vector. These coefficient vectors can be arranged Algorithm1SolvingProblem(9)byADM Input: datamatrixX,parameterλ. Initialize: w0,Y0,µ0,ρ,µ ,ε,t=0. max Output: coefficientw∗. whilenotconvergedo 1. fixtheothersandupdateJ by (a) SSC (b) LRR (c) LSR (d) CASS Jt+1 = argmin λ||J|| + 1||J − (XDiag(wt) − Figure2.Theaffinitymatricesderivedby(a)SSC,(b)LRR,(c) µt ∗ 2 LSR, and (d) CASS on the Extended Yale B Database (10 sub- 1 Yt)||2. µt F jects). 2. fixtheothersandupdatewby wt+1 =A(XTy+diag(XT(Yt+µtJt+1))), asamatrixmeasuringthesimilaritiesbetweendatapoints. whereA=(XTX+µtDiag(diag(XTX)))−1. Figure2illustratesthecoefficientmatricesderivedbythese 3. updatethemultiplier four methods on the Extended Yale B database (see Sec- Yt+1 =Yt+µt(Jt+1−XDiag(wt+1)). tion3.1fordetailedexperimentalsetting). Wecanseethat thecoefficientmatrixderivedbySSCissosparsethatitis 4. updatetheparameterbyµt+1 =min(ρµt,µ ). max even difficult to identify how many groups there are. This phenomenon confirms that SSC loses the data correlation 5. checktheconvergenceconditions information. ThusSSCdoesnotperformwellfordatawith ||Jt+1−Jt|| ≤ε, ∞ strong correlation. On the contrary, the coefficient matri- ||wt+1−wt|| ≤ε, ∞ ces derived by LRR and LSR are very dense. They group ||Jt+1−XDiag(wt+1)|| ≤ε. ∞ many data points together, but do not do subset selection. 6. t=t+1. Therearemanynonzeroconnectionsbetweenclusters,and endwhile someareverylarge.ThusLRRandLSRmaycontainmuch erroneousinformation. OurproposedmethodCASSbyus- ing trace Lasso, achieves a more accurate coefficient matrix,whichisclosetobeblockdiagonal,anditalsogroups alsoencouragessparsitybetweenclusters,butitissufficient datawithincluster.SuchintuitionshowsthatCASSismore enoughforgroupingcorrelateddatatogether. accurate to reveal the true data structure for subspace segmentation. 2.4.Optimization 2.3.Thegroupingeffect Performing CASS needs to solve the convex optimization problem (9), which can be optimized by off-the- Ithasbeenshownin[16]thattheeffectivenessofLSRby shelf solvers. The work in [7] introduces an iteratively (cid:96)2-regularization comes from the grouping effect, i.e., the reweighted least squares method for solving problem (9), coefficientsofagroupofcorrelateddataareapproximately but the solution is not necessarily globally optimal due to equal. In this work, we show that trace Lasso also has the a trick by adding a term to avoid the non-invertible issue. groupingeffectforcorrelateddata. Motivated by the optimization method used in low-rank minimization [1, 15], we adopt the Alternating Direction Theorem3 Givenadatavectory ∈ Rd,datapointsX = Method(ADM)tosolveproblem(9). Wefirstconvertitto [x ,··· ,x ] ∈ Rd×n and parameter λ > 0. Let w∗ = 1 n thefollowingequivalentproblem: [w∗,··· ,w∗]T ∈ Rn be the optimal solution to problem 1 n (9). Ifx →x ,thenw∗ →w∗. 1 i j i j min ||y−Xw||2+λ||J|| J,w 2 2 ∗ (11) TheproofoftheTheorem3canbefoundinthesupple- s.t.J =XDiag(w). mentarymaterials. IfeachcolumnofX isnormalized,xi =xj impliesthat This problem can be solved by the ADM method, which the sample correlation r = xTi xj = 1. Namely xi and operatesonthefollowingaugmentedLagrangianfunction: x are highly correlated. Then these two data points will j be grouped together by CASS due to the grouping effect. L(J,w)= 1||y−Xw||2+λ||J|| 2 2 ∗ Illustrations of the grouping effect are shown in Figures 1 +tr(YT(J −XDiag(w)))+ µ||J −XDiag(w)||2, 2 F and 2. One can see that the connections within cluster by (12) CASS are dense, similar to LRR and LSR. The grouping whereY ∈ Rd×n istheLagrangemultiplierandµ > 0is effectofCASSmaybeweakerthanLRRandLSR,sinceit thepenaltyparameterforviolationofthelinearconstraint. Algorithm2CorrelationAdaptiveSubspaceSegmentation Table 1. The segmentation errors (%) on the Hopkins 155 database. Input: datamatrixX,numberofsubspacesk Comparisonunderthesamesetting kNN SSC LRR LSR CASS 1. Solve problem (9) for each data point in X to obtain the coefficient matrix W∗, where X in (9) should be MAX 45.59 39.53 36.36 36.36 32.85 MEAN 13.44 4.02 3.23 2.50 2.42 replacesbyX =[x ,··· ,x ,x ,··· ,x ]. î 1 i−1 i+1 n STD 12.90 10.04 6.60 5.62 5.84 2. Constructtheaffinitymatrixby(|W∗|+|W∗T|)/2. Comparisontostate-of-the-artmethods SSC LRR LatLRR CASS 3. SegmentthedataintokgroupsbyNormalizedCuts. MEAN 2.18 1.71 0.85 1.47 Table2.Thesegmentationaccuracies(%)ontheExtendedYaleB WecanseethatL(J,w)isseparable,thusitcanbedecom- database. posedinto twosubproblems andminimized withregard to kNN SSC LRR LSR CASS J and w, respectively. The whole procedure for solving 5subjects 56.88 80.31 86.56 92.19 94.03 problem (9) is outlined in the Algorithm 1. It iteratively 8subjects 52.34 62.90 78.91 80.66 91.41 solvestwosubproblemswhichhaveclosedformsolutions. 10subjects 50.94 52.19 65.00 73.59 81.88 By the theory of ADM and the convexity of problem (9), Algorithm1convergesglobally. three motions (a motion corresponds to a subspace). Each 2.5.Thesegmentationalgorithm sequence is a sole data set and so there are 156 subspace segmentationproblemsintotal. WefirstusePCAtoproject Forsolvingthesubspacesegmentationproblembytrace thedataintoa12-dimensionalsubspace. Allthealgorithms Lasso,wefirstsolveproblem(9)foreachdatapointx with i are performed on each sequence, and the maximum, mean X = [x ,··· ,x ,x ,··· ,x ] which excludes x it- î 1 i−1 i+1 n i andstandarddeviationoftheerrorratesarereported. self,andobtainthecorrespondingcoefficients. Thenthese coefficients can be arranged as a matrix W∗. The affinity ExtendedYaleBischallengingforsubspacesegmenta- matrix is defined as (|W∗| + |W∗T|)/2. Finally, we use tionduetolargenoises. Itconsistsof2,414frontalfaceim- theNormalizedCuts(NCuts)[20]tosegmentthedatainto agesof38subjectsundervariouslighting,posesandillumi- k groups. ThewholeprocedureofCASSalgorithmisout- nationconditions. Eachsubjecthas64faces. Weconstruct linedintheAlgorithm2. threesubspacesegmentationtasksbasedonthefirst5,8and 10subjectsfaceimagesofthisdatabase. Thedataarefirst 3.Experiments projectedintoa5×6,8×6,and10×6-dimensionalsubspace byPCA,respectively.Thenthealgorithmsareemployedon Inthissection, weapplyCASSforsubspacesegmenta- thesethreetasksandtheaccuraciesarereported. tiononthreedatabases:theHopkins1554motiondatabase, To further evaluate the effectiveness of CASS for other Extended Yale B database [6] and MNIST database 5 of learning problems, we also use the derived affinity matrix handwrittendigits. CASSiscomparedwithSSC,LRRand for semi-supervised learning. The Markov random walks LSRwhicharetherepresentativeandstate-of-the-artmeth- algorithm[21]isemployedinthisexperiment.Itperformsa ods for subspace segmentation. The derived affinity ma- t-stepMarkovrandomwalkonthegraphoraffinitymatrix. trices from all algorithms are also evaluated for the semi- Theinfluenceofoneexampletoanotherexampleispropor- supervisedlearningtaskontheExtendedYaleBdatabase. tionaltotheaffinitybetweenthem. Wetestonthe10sub- For fair comparison with previous works, we follow the jectsfaceclassificationproblem. Foreachsubject,4,8,16 experimental settings as in [16]. The parameters for each and32faceimagesarerandomlyselectedtoformthetrain- methodaretunedtoachievethebestperformance. Theseg- ing data set, and the remaining for testing. Our goal is to mentation accuracy/error is used to evaluate the subspace predictthelabelsofthetestdatabyMarkovrandomwalks segmentation performance. The accuracy is calculated by [21]ontheaffinitymatriceslearntbykNN,SSC,LRR,LSR thebestmatchingrateofthepredictedlabelandtheground andCASS.Weexperimentallyselectk =6neighbors. The truthofdata[3]. experiment is repeated for 20 times, and the accuracy and 3.1.Datasetsandexperimentalsettings standarddeviationarereportedforevaluation. MNIST database of handwritten digits is also widely Hopkins 155 motion database contains 156 sequences, used in subspace learning and clustering [11]. It has 10 each of which has 39∼550 data points drawn from two or subjects,correspondingto10handwrittendigits,0∼9. We 4http://www.vision.jhu.edu/data/hopkins155/ selectasubsetwithasimilarsizeasintheabovefaceclus- 5http://yann.lecun.com/exdb/mnist/ tersing problem for this experiment, which consists of the 100 kNN LSR 95 SSC CASS LRR 90 85 80 75 (a) SSC (b) LRR (c) LSR (d) CASS Figure4.Theaffinitymatricesderivedby(a)SSC,(b)LRR,(c) 70 LSR,and(d)CASSontheMNISTdatabase. 65 Table3.Thesegmentationaccuracies(%)ontheMNISTdatabase. 60 kNN SSC LRR LSR CASS ACC. 61.00 62.60 66.80 68.00 73.80 55 50 Table2showstheclusteringresultontheExtendedYale 45 4Train 5Train 6Train 7Train Bdatabase. WecanseethatCASSoutperformsSSC,LRR Figure3.Comparisonofclassificationaccuracy(%)andstandard and LSR on all these three clustering tasks. In particu- deviationofdifferentsemi-supervisedlearningbasedondifferent lar,CASSgetsaccuraciesof94.03%,91.41%,and81.88% affinitymatricesontheExtendedYaleB(10subjects)database. for face clustering with 5, 8, and 10 subjects, respectively, whichoutperformsthestate-of-the-artmethodLSR.Forthe 5 subjects face clustering problem, all these four methods first 50 samples of each subject. The accuracies of SSC, perform well, and no big improvement is made by CASS. LRR,LSRandCASSarereported. Butforthe8subjectsand10subjectsfaceclusteringprob- lems, CASS achieves significant improvements. For these 3.2.Experimentalresults twoclusteringtasks,bothLRRandLSRperformmuchbet- Table1tabulatesthemotionsegmentationerrorsoffour terthanSSC,whichcanbeattributedtothestronggrouping methodsontheHopkins155database. ItshowsthatCASS effect of the two methods. However, both the two meth- gets a misclassification error of 2.42% for all 156 se- ods lack the ability of subset selection, and therefore may quences,whilethebestpreviouslyreportedresultis2.50% group some data points between clusters together. CASS byLSR.TheimprovementofCASSonthisdatabaseislim- not only preserves the grouping effect within cluster but itedduetomanyreasons. First,previousmethodshaveper- also enhances the sparsity between clusters. The intuitive formed very well on the data with only slight corruptions, comparison of these four methods can be found in Figure andthustheroomforimprovementislimited. Second,the 2. ItconfirmsthatCASSusuallyleadstoanapproximately reportederroristhemeanof156segmentationerrors,most blockdiagonalaffinitymatrixwhichresultsinamoreaccu- of which are zeros. So even if there are some high im- rate segmentation result. This phenomenon is also consis- provements on some challenging sequences, the improve- tentwiththeanalysisinTheorems2and3. ment of the mean error is also limited. Third, the correla- For semi-supervised learning, the comparison of the tion of data is strong as the dimension of each affine sub- classification accuracies is shown in Figure 3 with differ- spaceisnomorethanthree[3][16],thusCASStendstobe entnumbersoftrainingdata. CASSachievesthebestper- closetoLSRinthiscase. Duetothedimensionalityreduc- formanceandtheaccuraciesonthesesettingsareallabove tion by PCA and sufficient data sampling in each motion, 90%. Noticethattheyaremuchhigherthantheclustering CASSmaybehavelikeLSRwithastronggroupingeffect. accuracies in Table 2. This is mainly due to the mecha- Furthermore, in order to compare with the state-of-the-art nismofsemi-supervisedlearningwhichmakesuseofboth methods,wefollowthepost-processingin[12],whichmay labeledandunlabeleddatafortraining. Theaccurategraph notbeoptimalforCASS,andtheerrorofCASSisreduced construction is the key step for semi-supervised learning. to1.47%. ButthebestperformancebyLatentLRR[14]is ThisexampleshowsthattheaffinitymatrixbytraceLasso 0.85%. It is much better than other methods. That is be- isalsoeffectiveforsemi-supervisedlearning. causeLatentLRRfurtheremploysunobservedhiddendata Table 3 shows the clustering accuracies by SSC, LRR, asthedictionaryandhascomplexpre-processingandpost- LSR, and CASS on the MNIST database. The compari- processing with several parameters. The idea of incorpo- son of the derived affinity matrices by these four methods rating unobserved hidden data may also be considered in is illustrated in Figure 4. We can see that CASS obtains CASS.Thiswillbeourfuturework. an affinity matrix which is close to block diagonal by preserving the grouping effect. None of these four methods [5] Y.Fang,R.Wang,andB.Dai. Graph-orientedlearningvia performsperfectlyonthisdatabase. Nonetheless, ourpro- automaticgroupsparsityfordataanalysis. InICDM,pages posed CASS method achieves the best accuracy 73.80%. 251–259.IEEE,2012. The main reason may lie in the fact that the handwritten [6] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. digitdatadonotfitthesubspacestructurewell. Thisisalso Fromfewtomany:Illuminationconemodelsforfacerecog- nitionundervariablelightingandpose. TPAMI,23(6):643– themainchallengeforreal-worldapplicationsbysubspace 660,2001. segmentation. [7] E.Grave,G.Obozinski,andF.Bach. TraceLasso: atrace normregularizationforcorrelateddesigns. InNIPS,2011. 4.ConclusionsandFutureWork [8] J.Gui,Z.Sun,W.Jia,R.Hu,Y.Lei,andS.Ji. Discriminant Inthiswork,weproposetheCorrelationAdaptiveSub- sparse neighborhood preserving embedding for face recog- space Segmentation (CASS) method by using the trace nition. PatternRecognition,45(8):2884–2893,2012. [9] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman. Lasso. Compared with the existing SSC, LRR, and LSR, Clusteringappearancesofobjectsundervaryingillumination CASSsimultaneouslyencouragesgroupingeffectandspar- conditions. InCVPR,volume1. sity. The adaptive advantage of CASS comes from the [10] A.HoerlandR.Kennard. Ridgeregression: biasedestima- mechanismoftraceLassowhichbalancesbetween(cid:96)1-norm tionfornonorthogonalproblems. Technometrics,12(1):55– and(cid:96)2-norm.Intheory,weshowthatCASSisabletoreveal 67,1970. the true segmentation result when the subspaces are inde- [11] H.Huang,C.Ding,D.Luo,andT.Li. Simultaneoustensor pendent. ThegroupingeffectoftraceLassoisfirstlyestab- subspace selection and clustering: the equivalence of high lishedinthiswork. Atlast,theexperimentalresultsonthe orderSVDandk-meansclustering.InKDD,pages327–335, Hopkins155,ExtendedYaleB,andMNISTdatabasesshow 2008. the effectiveness of CASS. Similar improvement can also [12] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust beobservedinsemi-supervisedlearningsettingontheEx- recoveryofsubspacestructuresbylow-rankrepresentation. tendedYaledBdatabase. However,therestillremainmany TPAMI,PP(99):1,2012. problemsforfutureexploration. First,thedataitself,which [13] G.Liu,Z.Lin,andY.Yu. Robustsubspacesegmentationby low-rankrepresentation. InICML,pages663–670,2010. maybenoisy,areusedasthedictionaryforlinearconstruc- [14] G.LiuandS.Yan. Latentlow-rankrepresentationforsub- tion. Itmaybebettertolearnacompactanddiscriminative spacesegmentationandfeatureextraction. InICCV,pages dictionary for trace Lasso. Second, trace Lasso may have 1615–1622,2011. many other applications, i.e. classification, dimensionality [15] R.Liu, Z. Lin, and Z. Su. Linearized alternatingdirection reduction,andsemi-supervisedlearning. Third,morescal- methodwithparallelsplittingandadaptivepenaltyforsepa- ableoptimizationalgorithmsshouldbedevelopedforlarge rableconvexprogramsinmachinelearning.InACML,2013. scalesubspacesegmentation. [16] C.-Y. Lu, H. Min, Z.-Q. Zhao, L. Zhu, D.-S. Huang, and S.Yan. Robustandefficientsubspacesegmentationvialeast Acknowledgements squaresregression. InECCV,2012. [17] D.Luo,F.Nie,C.Ding,andH.Huang. Multi-subspacerep- This research is supported by the Singapore National resentation and discovery. In ECML PKDD, volume 6912 ResearchFoundationunderitsInternationalResearchCen- LNAI,pages405–420,2011. tre @Singapore Funding Initiative and administered by [18] Y. Ni, J. Sun, X. Yuan, S. Yan, and L.-F. Cheong. Robust the IDM Programme Office. Z. Lin is supported by Na- low-rank subspace segmentation with semidefinite guaran- tional Natural Science Foundation of China (Grant nos. tees. InICDMWorkshops,pages1179–1188,2010. 61272341,61231002,and61121002). [19] S.Rao,R.Tron,R.Vidal,andY.Ma. Motionsegmentation inthepresenceofoutlying,incomplete,orcorruptedtrajec- References tories. TPAMI,32(10):1832–1845,2010. [20] J.B.ShiandJ.Malik. Normalizedcutsandimagesegmen- [1] S.Boyd,N.Parikh,E.Chu,B.Peleato,andJ.Eckstein.Dis- tation. TPAMI,22(8):888–905,2000. tributedoptimizationandstatisticallearningviathealternat- [21] M.SzummerandT.Jaakkola.Partiallylabeledclassification ingdirectionmethodofmultipliers.FoundationsandTrends withMarkovrandomwalks. InNIPS,pages945–952,2001. inMachineLearning,3(1):1–122,2011. [22] R. Tibshirani. Regression shrinkage and selection via the [2] B.Cheng,J.Yang,S.Yan,Y.Fu,andT.S.Huang. Learning Lasso. Journal of the Royal Statistical Society. Series B, with(cid:96)1-graphforimageanalysis.TIP,19(Compendex):858– 50(1):267–288,1996. 866,2010. [23] R.Vidal,Y.Ma,andS.Sastry. Generalizedprincipalcom- [3] E.ElhamifarandR.Vidal. Sparsesubspaceclustering. In ponentanalysis(GPCA). TPAMI,27(12):1945–1959,2005. CVPR,pages2790–2797,2009. [24] S. Wang, X. Yuan, T. Yao, S. Yan, and J. Shen. Efficient [4] E. Elhamifar and R. Vidal. Sparse subspace cluster- subspacesegmentationviaquadraticprogramming.InAAAI, ing: Algorithm, theory, and applications. arXiv preprint volume1,pages519–524,2011. arXiv:1203.1005,2012.