REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggesstions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA, 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to any oenalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS. 1. REPORT DATE (DD-MM-YYYY) 2. REPORT TYPE 3. DATES COVERED (From - To) New Reprint - 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Hyperspectral Image Classification viaKernel Sparse W911NF-11-1-0245 Representation 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 611102 6. AUTHORS 5d. PROJECT NUMBER Yi Chen, Nasser M. Nasrabadi, Trac D. Tran 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAMES AND ADDRESSES 8. PERFORMING ORGANIZATION REPORT NUMBER Johns Hopkins University 3400 N Charles Street Maryland Hall Batlimore, MD 21218 -2608 9. SPONSORING/MONITORING AGENCY NAME(S) AND 10. SPONSOR/MONITOR'S ACRONYM(S) ADDRESS(ES) ARO U.S. Army Research Office 11. SPONSOR/MONITOR'S REPORT P.O. Box 12211 NUMBER(S) Research Triangle Park, NC 27709-2211 60291-MA.10 12. DISTRIBUTION AVAILIBILITY STATEMENT Approved for public release; distribution is unlimited. 13. SUPPLEMENTARY NOTES The views, opinions and/or findings contained in this report are those of the author(s) and should not contrued as an official Department of the Army position, policy or decision, unless so designated by other documentation. 14. ABSTRACT In this paper, a novel nonlinear technique for hyperspectral image (HSI) classification is proposed. Our approach relies on sparsely representing a test sample in terms of all of the training samples in a feature space induced by a kernel function. For each test pixel in the feature space, a sparse representation 15. SUBJECT TERMS Classification, hyperspectral imagery, joint sparsitymodel, kernel methods, sparse representation 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 15. NUMBER 19a. NAME OF RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE ABSTRACT OF PAGES Trac Tran UU UU UU UU 19b. TELEPHONE NUMBER 410-516-7416 Standard Form 298 (Rev 8/98) Prescribed by ANSI Std. Z39.18 Report Title Hyperspectral Image Classification viaKernel Sparse Representation ABSTRACT In this paper, a novel nonlinear technique for hyperspectral image (HSI) classification is proposed. Our approach relies on sparsely representing a test sample in terms of all of the training samples in a feature space induced by a kernel function. For each test pixel in the feature space, a sparse representation vector is obtained by decomposing the test pixel over a training dictionary, also in the same feature space, by using a kernel-based greedy pursuit algorithm. The recovered sparse representation vector is then used directly to determine the class label of the test pixel. Projecting the samples into a high-dimensional feature space and kernelizing the sparse representation improve the data separability between different classes, providing a higher classification accuracy compared to the more conventional linear sparsity-based classification algorithms. Moreover, the spatial coherency across neighboring pixels is also incorporated through a kernelized joint sparsity model, where all of the pixels within a small neighborhood are jointly represented in the feature space by selecting a few common training samples. Kernel greedy optimization algorithms are suggested in this paper to solve the kernel versions of the single-pixel and multi-pixel joint sparsity-based recovery problems. Experimental results on several HSIs show that the proposed technique outperforms the linear sparsity-based classification technique, as well as the classical support vector machines and sparse kernel logistic regression classifiers. REPORT DOCUMENTATION PAGE (SF298) (Continuation Sheet) Continuation for Block 13 ARO Report Number 60291.10-MA Hyperspectral Image Classification viaKernel Sp ... Block 13: Supplementary Note © 2013 . Published in IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, Vol. Ed. 0 (2013), (Ed. ). DoD Components reserve a royalty-free, nonexclusive and irrevocable right to reproduce, publish, or otherwise use the work for Federal purposes, and to authroize others to do so (DODGARS §32.36). The views, opinions and/or findings contained in this report are those of the author(s) and should not be construed as an official Department of the Army position, policy or decision, unless so designated by other documentation. Approved for public release; distribution is unlimited. IEEETRANSACTIONSONGEOSCIENCEANDREMOTESENSING,VOL.51,NO.1,JANUARY2013 217 Hyperspectral Image Classification via Kernel Sparse Representation YiChen,NasserM.Nasrabadi,Fellow,IEEE,andTracD.Tran,SeniorMember,IEEE Abstract—In this paper, a novel nonlinear technique for hy- labeled to one of the classes based on their spectral char- perspectralimage(HSI)classificationisproposed.Ourapproach acteristics, given a small set of training data for each class. reliesonsparselyrepresentingatestsampleintermsofallofthe VarioustechniqueshavebeendevelopedforHSIclassification. trainingsamplesinafeaturespaceinducedbyakernelfunction. Among the previous approaches, the support vector machine For each test pixel in the feature space, a sparse representation vector is obtained by decomposing the test pixel over a training (SVM) [1], [2] has proven to be a powerful tool to solve dictionary,alsointhesamefeaturespace,byusingakernel-based many supervised classification problems and has shown good greedy pursuit algorithm. The recovered sparse representation performances in hyperspectral classification, as well [3]–[5]. vector is then used directly to determine the class label of the Variations of SVM-based algorithms have also been proposed test pixel. Projecting the samples into a high-dimensional fea- toimprovetheclassificationaccuracy.Thesevariationsinclude turespaceandkernelizingthesparserepresentationimprovethe data separability between different classes, providing a higher transductive SVM, which exploits both labeled and unlabeled classificationaccuracycomparedtothemoreconventionallinear samples [6], and SVM with composite kernels (CKs), which sparsity-basedclassificationalgorithms.Moreover,thespatialco- incorporates spatial information directly in the SVM kernels herencyacrossneighboringpixelsisalsoincorporatedthrougha [7].Multinomiallogisticregression[8]isanotherwidelyused kernelized joint sparsity model, where all of the pixels within a classifier, which uses the logistic function to provide the pos- small neighborhood are jointly represented in the feature space byselectingafewcommontrainingsamples.Kernelgreedyopti- terior probability. A fast algorithm for sparse multinomial mizationalgorithmsaresuggestedinthispapertosolvethekernel logistic regression has been developed in [9] and successfully versions of the single-pixel and multi-pixel joint sparsity-based adoptedforHSIsegmentationin[10],[11].Someoftheother recovery problems. Experimental results on several HSIs show recentHSIclassificationtechniquescanbefoundin[12]–[17]. thattheproposedtechniqueoutperformsthelinearsparsity-based In these recent methods, a feature extraction strategy is pro- classification technique, as well as the classical support vector machinesandsparsekernellogisticregressionclassifiers. posedin[12]forclassificationwhichgeneralizesthelineardis- criminativeanalysisandnonparametricdiscriminativeanalysis. IndexTerms—Classification,hyperspectralimagery,jointspar- In [13], the derivative information of the spectral signatures sitymodel,kernelmethods,sparserepresentation. is exploited as features, and then decisions obtained from spectral reflectance and derivative information are fused for I. INTRODUCTION the final decisions. In [14], each image band is decomposed HYPERSPECTRAL imaging sensors capture images in into intrinsic mode functions (IMFs) which are adaptive to hundredsofcontinuousnarrowspectralbands,spanning local properties via empirical mode decomposition and then the visible to infrared spectrum. Each pixel in a hyperspectral SVM is applied to the lower order IMFs for classification. In image (HSI) is represented by a vector whose entries corre- [15], the k-nearest-neighbor classifier is applied to the local spond to various spectral-band responses. Different materials manifolds to exploit the intrinsic nonlinear structure of HSIs. usually reflect electromagnetic energy differently at specific Asemi-supervisedclassificationalgorithmisproposedin[16] wavelengths. This enables discrimination of materials based inordertouseakernelmachinewhichisiterativelyupdatedby on their spectral characteristics. One of the most important manifoldregularization.In[17]theresultsfrommultipleclas- applications of HSI is image classification, where pixels are sification/segmentationtechniquesarefusedbypostprocessing to generate the final spectral-spatial classification map. Most of the aforementioned HSI image classification techniques do notdirectlyincorporatethespatialorthecontextualinformation intotheclassifier. Manuscript received August 13, 2011; revised December 23, 2011 and March22,2012;acceptedApril18,2012.DateofpublicationJuly10,2012; Recently,sparserepresentation[18],[19]hasalsobeenpro- dateofcurrentversionDecember19,2012.Thisworkwassupportedinpart posed to solve many computer vision tasks [20]–[25], where by the National Science Foundation under Grants CCF-1117545 and CCF- the usage of sparsity as a prior often leads to state-of-the-art 0728893, the Army Research Office under Grant 58110-MA-II and Grant 60219-MA,andtheOfficeofNavalResearchunderGrantN102-183-0208. performance. Sparse representation has also been applied to Y. Chen and T. D. Tran are with the Department of Electrical and Com- HSI target detection and classification [26]–[28], relying on puterEngineering,TheJohnsHopkinsUniversity,Baltimore,MD21218USA theobservationthathyperspectralpixelsbelongingtothesame (e-mail:[email protected];[email protected]). N.M.NasrabadiiswithUSArmyResearchLaboratory,Adelphi,MD20783 classapproximatelylieinthesamelow-dimensionalsubspace. USA(e-mail:[email protected]). Thus, an unknown test pixel can be sparsely represented by Colorversionsofoneormoreofthefiguresinthispaperareavailableonline a few training samples (atoms) from a given dictionary, and athttp://ieeexplore.ieee.org. DigitalObjectIdentifier10.1109/TGRS.2012.2201730 the corresponding sparse representation vector will implicitly 0196-2892/$31.00©2012IEEE 218 IEEETRANSACTIONSONGEOSCIENCEANDREMOTESENSING,VOL.51,NO.1,JANUARY2013 encode the class information. The sparse representation-based is then obtained by decomposing the test pixel represented in classifier is different from the conventional sparse classifier a high-dimensional feature space over a structured dictionary SVMinthefollowingaspects.SVMisadiscriminativemodel, consistingoftrainingsamplesfromalloftheclassesinthesame while the sparse representation method can be viewed as a feature space. The recovered sparse vector is used directly for generative model, where the signal (pixel) is expressed as a classification. Although the proposed approach has a similar linear combination of atoms [19]. SVM is a binary classi- formulation as previous kernel regression approaches with a fier that finds the separating hyperplane between two classes sparsepriorsuchaskernelmatching pursuit[33],kernelbasis (multi-class SVM requires a one-against-one or one-against- pursuit[34],andgeneralizedLASSO[38],theunderlyingideas allstrategy).Thesparserepresentation-basedclassifierisfrom arequitedifferent.Theobjectiveofthesepreviousapproaches a reconstruction point of view. The sparse decomposition of istoapproximateafunctionasalinearcombinationofdictio- the test pixel over the entire dictionary implicitly leads to nary functions, which are the kernels centered at the training a competition between the subspaces (classes), and thus the points, by minimizing certain loss function evaluated at these recoveredsparserepresentationisdiscriminative.Moreover,in training points and subject to a sparsity prior. Therefore, the SVM, there is an explicit training stage. The SVM classifier target vector for fitting consists of the observations of the is trained only once, and then this classifier with its fixed functionvalueatthetrainingpoints,andthedictionaryisthen sparse support vectors is used to classify all of the test data. the dictionary functions evaluated at the training points which On the other hand, in our proposed approach, a new sparse turns out to be the kernel matrix. In our proposed approach, representationvectorisextractedforeachtestpixelandisthus the target vector is the test pixel itself in the feature space. adaptive, representing the sparsely selected atoms which are It is not the similarity measure between the test sample and adaptedtoreconstructthecurrenttestpixel. trainingsamplesandmaynothaveanexplicitexpression.The HSIsareusuallysmoothinthesensethatthepixelsinasmall dictionary also consists of the training samples in the feature neighborhood represent the same material and have similar space and cannot assume an explicit expression either. The spectralcharacteristics.Varioustechniqueshavebeenproposed recovered sparse representation vector can be viewed as a recentlytoexploitthecontextualcorrelationwithinHSIwhich discriminativefeatureextractedfromthetestpixelandisused havenotablyimprovedtheclassificationandsegmentationper- directlyforclassification. formance. Post-processing procedures are used in [29], [30] The contextual correlation between pixels within a small on the individually labeled samples based on certain decision spatialneighborhoodcanbeincorporatedintothekernelsparse rulestoimposethesmoothness.Markovrandomfieldsexploit representation through the JSM [31], where all neighboring the statistical dependency among neighboring pixels and are pixels are simultaneously represented by a linear combination usuallyappliedinBayesianapproaches[11].TheCKapproach ofafewcommontrainingsamplesinthefeaturespace.Further- [7]isanotherwaytoincorporatethespatialinformation,which more,theCKapproach[7]canalsobeusedwiththeproposed explicitly extracts spatial information for each spectral pixel kernelsparserepresentationmodelinordertocombinespectral andthencombinesthespectralandspatialinformationviaker- and spatial information. Efficient kernel-based optimization nelcomposition.Jointsparsitymodel(JSM)[31]isexploitedin algorithms are discussed in this paper for the recovery of the sparsity-basedHSItargetdetectionandclassification[27],[28], kernel sparse representations for both single-pixel and multi- wheretheneighboringpixelsaresimultaneouslyrepresentedby pixelJSMs. a sparse linear combination of a few common training sam- Notation-wise, vectors and matrices are denoted by lower ples. Each pixel, although sharing the same common support, anduppercaseboldletters,respectively.Foravectorα∈RN might have weighting coefficients taking on different values. andanindexsetΛ⊆{1,...,N}with|Λ|=t,α ∈Rt isthe Λ Inthisway,thesmoothnessacrossneighboringspectralpixels portion of α indexed on Λ. For a matrix S ∈RN1×N2, index is enforced directly in the classification stage, and no post- setsΛ ⊆{1,...,N }with|Λ |=t ,andΛ ⊆{1,...,N } 1 1 1 1 2 2 pwriolIlctebisseswifnuegrltlhs-tekernpodswiasrcneutsphseaertdfofiornmrttehhdee.fTcolhlaleosswdiecitnaaglilHsseSocfItiCiomKnass.gaencdlathsesiJfiScMa- owoffittthhhee|Λtt221|cr=oolwutm2s,niSnsΛiSn1,:Sin∈dinRedxtee1xd×eNdo2noinsΛaΛ1,2su,Sba:mn,Λda2tSr∈iΛxR1o,NΛf21S×∈tc2Rocnto1sin×sstt2iisnitgss tionandtargetdetectionalgorithms,theuseofkernelmethods formed by the rows and columns of S indexed on Λ and Λ , 1 2 yieldsasignificantperformanceimprovement[5],[32],because respectively. thekernel-basedalgorithmsimplicitlyexploitthehigherorder The remainder of this paper is structured as follows. structure of the given data which may not be captured by the Section II briefly introduces the sparsity-based HSI classifica- linearmodels.Therefore,ifthedatasetisnotlinearlyseparable, tion technique. Section III defines the sparsity models in the kernel methods [33]–[36] can be applied to project the data feature space, then discusses how to incorporate spatial infor- intoanonlinearfeaturespaceinwhichthedatabecomesmore mation, and describes the kernel sparse recovery algorithms. separable. In practical implementation, the kernel trick [37] is ExperimentalresultsareshowninSectionIV,andconclusions oftenusedinordertoavoidexplicitlyevaluatingthedatainthe aredrawninSectionV. featurespace. Inthispaper,weproposeanewHSIclassificationalgorithm II. SPARSITY-BASEDHSICLASSIFICATION based on kernel sparse representation by assuming that a test pixel can be linearly represented by a few training samples This section briefly introduces the sparsity-based algorithm in the feature space. The kernel sparse representation vector for HSI classification, and more details can be found in CHENetal.:HYPERSPECTRALIMAGECLASSIFICATIONVIAKERNELSPARSEREPRESENTATION 219 [26]–[28]. It is assumed that the spectral signatures of pixels III. KERNELSPARSEREPRESENTATION belongingtothesameclassapproximatelylieinthesamelow- If the classes in the dataset are not linearly separable, then dimensionalsubspace.Thus,anunknowntestsamplex∈RB, the kernel methods can be used to project the data into a where B is the number of spectral bands, can be written as a feature space, in which the classes become linearly separable sparselinearcombinationofallofthetrainingpixelsas [1]. The kernel function κ:RB ×RB (cid:7)→R is defined as the x=Aα (1) innerproduct whereA=[a1a2 ... aN]∈RB×N isastructureddictionary κ(x ,x )=(cid:9)φ(x ),φ(x )(cid:10). (7) whosecolumns{a } areN trainingsamples(referred i j i j i i=1,2,...,N toasatoms)fromallclasses,andα∈RN isanunknownsparse Commonlyusedkernelsincludetheradialbasisfunction(RBF) vector. The index set on which α have nonzero entries is the kernel κ(x ,x )=exp(−γ(cid:4)x −x (cid:4)2) with γ >0 control- i j i j support of α. The number of nonzero entries in α is called ling the width of the RBF, and order-d homogeneous and in- the sparsity level K of α and denoted by K =(cid:4)α(cid:4)0. Given homogeneous polynomial kernels κ(x ,x )=(x ·x )d and i j i j thedictionaryA,thesparsecoefficientvectorαisobtainedby κ(x ,x )=(x ·x +1)d, respectively. In this section, we i j i j solving describehowthesparsitymodelsinSectionIIcanbeextended αˆ =argmin(cid:4)x−Aα(cid:4) subjectto (cid:4)α(cid:4) ≤K (2) toafeaturespaceinducedbyakernelfunction. 2 0 0 where K is a preset upper bound on the sparsity level. The 0 A. Pixel-WiseSparsityinFeatureSpace problemin(2)isNP-hard,whichcanbeapproximatelysolved by greedy algorithms, such as orthogonal matching pursuit Let x∈RB be the data point of interest and φ(x) be its (OMP) [39] or subspace pursuit (SP) [40]. The class label of representation in the feature space. The kernel sparse repre- x is determined by the minimal residual between x and its sentationofasamplexintermsoftrainingatomsa ’scanbe i approximationfromeachclasssubdictionary: formulatedas Class(x)=arg min (cid:4)x−A αˆ (cid:4) (3) m=1,...,M :,Ωm Ωm 2 φ(x)=[φ(a ) ... φ(a )][α(cid:11) ... α(cid:11) ]T =A α(cid:11) (8) (cid:2) 1 (cid:3)(cid:4) N (cid:5)(cid:2) 1 (cid:3)(cid:4) N (cid:5) φ whereΩm ⊂{1,2,...,N}istheindexsetassociatedwiththe Aφ α(cid:2) trainingsamplesbelongingtothemthclass.Aspointedoutin [25],thesparserepresentation-basedclassifiercanbeviewedas where the columns of A are the representations of training φ ageneralizationofthenearestneighborclassifier[41]. samples in the feature space and α(cid:11) is assumed to be a sparse In HSI, pixels within a small neighborhood usually consist vector. of similar materials and, thus, their spectral characteristics are Similar to the linear sparse recovery problem in (2), α(cid:11) can highlycorrelated.Thespatialcorrelationbetweenneighboring berecoveredbysolving pixelscanbeincorporatedthroughaJSM[27],[31]byassum- ing the underlying sparse vectors associated with these pixels αˆ(cid:11) =argmin(cid:4)φ(x)−A α(cid:11)(cid:4) subjectto (cid:4)α(cid:11)(cid:4) ≤K . share a common sparsity pattern as follows. Let {xt}t=1,...,T φ 2 0 0(9) be T pixels in a spatial neighborhood centered at x . These 1 pixelscanbecompactlyrepresentedas Theproblemin(9)canbeapproximatelysolvedbykernelizing the OMP and SP algorithms (denoted by KOMP and KSP, X =[x x ... x ]=[Aα Aα ... Aα ] 1 2 T 1 2 T respectively). Note that in the above problem formulation, we =A[α α ... α ]=AS. (4) aresolvingforthesparsevectorα(cid:11)directlyinthefeaturespace (cid:2) 1 2(cid:3)(cid:4) T(cid:5) usingtheimplicitfeaturevectors,butnotevaluatingthekernel S functionsatthetrainingpoints. In the JSM, the sparse vectors {αt}t=1,...,T share the same In KOMP and KSP, essentially, each dot product operation supportΛ,and,thus,Sisasparsematrixwithonly|Λ|nonzero in OMP/SP is replaced by the kernel trick in (7). Let K ∈ A rows.Therow-sparsematrixScanberecoveredbysolvingthe RN×N be the kernel matrix whose (i,j)th entry is κ(a ,a ), i j followingoptimizationproblem and k ∈RN be the vector whose ith entry is κ(a ,x). A,x i Usingthefeaturerepresentations,thecorrelation(dotproduct) Sˆ=argmin(cid:4)X−AS(cid:4) subjectto (cid:4)S(cid:4) ≤K (5) F row,0 0 between a pixel φ(x) and a dictionary atom φ(a ) is then i where (cid:4)S(cid:4) denotes the number of non-zero rows of S, computedby row,0 and(cid:4)·(cid:4) denotestheFrobeniusnorm.Theproblemin(5)can F beapproximatelysolvedbythesimultaneousversionsofOMP c =(cid:9)φ(x),φ(a )(cid:10)=κ(x,a )=(k ) (10) i i i A,x i (SOMP)[31]orSP(SSP)[28].Thelabelofthecenterpixelx 1 isthendeterminedbytheminimaltotalresidual the orthogonal projection coefficient of φ(x) onto a set of (cid:6) (cid:6) Class(x )=arg min (cid:6)(cid:6)X−A Sˆ (cid:6)(cid:6) (6) selecteddictionaryatoms{φ(an)}n∈Λisgivenas 1 m=1,...,M :,Ωm Ωm,: F (cid:7) (cid:8) −1 where(cid:4)·(cid:4)F denotestheFrobeniusnorm. pΛ = (KA)Λ,Λ (kA,x)Λ (11) 220 IEEETRANSACTIONSONGEOSCIENCEANDREMOTESENSING,VOL.51,NO.1,JANUARY2013 and the residual vector between φ(x) and its approximation B. JointSparsityinFeatureSpace using the selected atoms {φ(a )} =(A ) is then ex- n n∈Λ φ :,Λ TheJSMin(4)canalsobeextendedtothefeaturespaceas pressedas follows: (cid:7) (cid:8) φ(r)=φ(x)−(Aφ):,Λ (KA)Λ,Λ −1(kA,x)Λ. (12) Xφ = [φ(x1)(cid:11) ... (x(cid:11)T)]=[Aφ(cid:11)α(cid:11)1 ... Aφα(cid:11)T] =A [α ... α ]=A S (16) φ(cid:2) 1 (cid:3)(cid:4) T(cid:5) φ Notethatthefeaturerepresentationoftheresidualvectorφ(r) S(cid:2) in(12)cannotbeevaluatedexplicitly.However,thecorrelation where the vectors {α(cid:11)} share the same support. The betweenφ(r)andanatomφ(a )canbecomputedby t t=1,...,T i row-sparsematrixS(cid:11)isrecoveredbysolving c = (cid:9)φ(r),φ(a )(cid:10) Sˆ(cid:11)=argmin(cid:4)X −A S(cid:11)(cid:4) subjectto (cid:4)S(cid:11)(cid:4) ≤K . i i φ φ F row,0 0 (cid:7) (cid:8) (17) −1 =(k ) −(K ) (K ) (k ) . (13) A,x i A i,Λ A Λ,Λ A,x Λ Inthispaper,weproposetheKSOMPandtheKSSPalgorithms inordertoapproximatelysolvetheabovejointsparserecovery The KOMP and KSP greedy algorithms, similar to the lin- problemin(17). ear OMP and SP algorithms, are used to locate the support In KSOMP, at every iteration, the atom that simultaneously Λ of the sparse vector αˆ(cid:11). The KOMP algorithm augments yields the best approximation to all the T pixels (or residuals the support set by only one index, which is given by λ= after initialization) is selected. Specifically, let C ∈RN×T be argmax c withc beingdefinedin(13)andφ(r)being the correlation matrix whose (i,j)th entry is the correlation i=1,...,N i i theresidualvectorfromthepreviousiteration,ateachiteration between φ(a ) and φ(r ), where φ(r ) is the residual vector i j j until K atoms are selected or the approximation error (i.e., ofφ(x ).Thenewatomisthenselectedastheoneassociated 0 j normoftheresidualvectorin(12))iswithinapresetthreshold. with the row of C, which has the maximal (cid:7) -norm for some p The KSP algorithm maintains a set of K indices with a p≥1. The KSOMP algorithm is summarized in Algorithm 1. 0 backtracking mechanism. At each iteration, the index set is Notethatwhencomputingtheprojectionin(11)andcorrelation refinedbyaddingK newcandidates,whoseassociatedatoms in (13), a regularization term λI is added in order to stabilize 0 havetheK highestcorrelation(13)totheresidualvectorfrom the matrix inversion, where λ is typically a small scalar (e.g., 0 the previous iteration, to the current list and then discarding λ=10−5 is implemented in Section IV) and I is an identity K insignificant ones from the list of 2K candidates. This matrixwhosedimensionalityshouldbeclearfromthecontext. 0 0 process repeats until certain stopping criterion is met. In both Theparameterλdoesnotseriouslyaffecttheclassificationper- of the KOMP and KSP algorithms, after the support set Λ of formance.Itis,however,necessaryforstablematrixinversion. αˆ(cid:11) isdetermined,theentriesofαˆ(cid:11) indexedonΛarecomputed AlthoughthekernelmatrixK isusuallyinvertible,(K ) A A Λ,Λ bytheorthogonalprojectionofthetestpixelontotheselected maybenearlysingularforcertainsmallindexsetΛ. dictionary atoms using (11). The KOMP/KSP algorithms can be viewed as special cases, with T =1, of the kernelized Algorithm1:KSOMP SOMP/SSP (KSOMP/KSSP) algorithms (Algorithms 1 and 2) proposed in the next section, respectively. The details are thus Input:B×NdictionaryA=[a ... a ],B×T datamatrix 1 N omittedherein. X =[x ... x ], kernel function κ, and a stopping 1 T (cid:11) Oncethesparsevectorαˆ isrecovered,theresidualbetween criterion the test sample and the mth-class reconstruction in the high- Initialization: compute the kernel matrices K ∈RN×N dimensionalfeaturespaceisthencomputedby A whose (i,j)th entry is κ(x ,x ) and K ∈RN×T i j A,X (cid:6) (cid:6) rm(x)= (cid:6)(cid:6)φ(x)−(Aφ):,Ωmαˆ(cid:11)Ωm(cid:6)(cid:6) warhgois=em1,a.(.ix.,,Nj)(cid:4)th(KeAnt,rXy)ii,s:(cid:4)κp(waiit,hxsjo)m. eSept≥in1deaxndseitterΛat0io=n (cid:9) (cid:10) 1 countert=1. = φ(x)−(Aφ):,Ωmαˆ(cid:11)Ωm,φ(x)−(Aφ):,Ωmαˆ(cid:11)Ωm 2 whilestoppingcriterionhasnotbeenmetdo (cid:7) 1) ComputethecorrelationmatrixC ∈RN×T = κ(x,x)−2αˆ(cid:11)T (k ) Ωm A,x Ωm (cid:7) (cid:8)−1 (cid:8) C=K −(K ) (K ) +λI +αˆ(cid:11)T (K ) αˆ(cid:11) 12 (14) A,X A :,Λt−1 A Λt−1,Λt−1 Ωm A Ωm,Ωm Ωm ×(K ) A,X Λt−1,: where kA,x and KA are as defined above, and Ωm is the 2) Select the new index as λ =arg max (cid:4)C (cid:4) , t i,: p index set associated with the mth class. The class label of x i=1,...,N isdeterminedas p≥1 (cid:11) 3) UpdatetheindexsetΛt =Λt−1 {λt} 4) t←t+1 Class(x)=arg min r (x). (15) m=1,...,M m endwhile CHENetal.:HYPERSPECTRALIMAGECLASSIFICATIONVIAKERNELSPARSEREPRESENTATION 221 Outpwuhto:seInndoexnzeserot rΛow=sΛint−de1x,etdhebyspΛarsaererSeˆp(cid:11)res=en(taKtion Sˆ+(cid:11) wanhderΩemK∈A,{X1,a2n,d..K.,NA}ariesatshedeinfidneexd sinetAalsgsoorciitahtmeds w1iathndth2e, λI)−1(K ) Λ,: Λ,Λ mthclass.Thelabelforthecenterpixelx1 isthendetermined A,X Λ,: bythetotalresidual Similarly, KSSP is a simultaneous version of KSP where Class(x )=arg min r (x ). (19) 1 m 1 m=1,...,M the K atoms that best simultaneously approximate all of the 0 T residuals in terms of the (cid:7) -norm are chosen. The KSSP p algorithm is summarized in Algorithm 2. Note that the step for computing the residual vectors (12) is incorporated into C. KernelSparseRepresentationWithaCompositeKernel the computation of the correlation vector in Step (1) of both AnotherwaytoaddressthecontextualcorrelationwithinHSI KSOMPandKSSP. isthroughaCK[7],whichtakesintoaccountthespatialcorre- lation between neighboring pixels by combining kernels dedi- Algorithm2:KSSP catedtothespectralandspatialinformation.TheCKapproach Input:B×NdictionaryA=[a ... a ],B×T datamatrix has been shown to significantly outperform the spectral-only 1 N classifierinHSIclassification[42].Thismethod,althoughorig- X =[x ... x ], kernel function κ, and a stopping 1 T inallyproposedforSVM,canbereadilyincorporatedintoother criterion classifiers which operate in the feature space, such as kernel Initialization: compute the kernel matrices KA ∈RN×N logisticregression(KLR)andthekernelsparserepresentation- whose (i,j)th entry is κ(xi,xj) and KA,X ∈RN×T based classifier proposed in this paper. Specifically, let xwi be whose (i,j)th entry is κ(ai,xj). Set index set the spectral pixel at location i in a HSI and xsi be the spatial Λ0 ={K0indicescorrespondingtotheK0largestnumbers information extracted from a small neighborhood centered at in (cid:4)(KA,X)i,:(cid:4)p,p≥1,i=1,...,N}, and set iteration locationi,whichisusuallythemeanand/orthestandarddevia- countert=1. tionofthepixelswithintheneighborhood.Thenewpixelentity whilestoppingcriterionhasnotbeenmetdo atthislocationcanberedefinedasxi ={xwi ,xsi}.Notethatin 1) Computethecorrelationmatrix previous sections, xi contains only spectral information (i.e., (cid:7) (cid:8) x =xw). The spectral and spatial information can then be −1 i i C=K −(K ) (K ) +λI combinedinavarietyofways,includingstacking,directsum- A,X A :,Λt−1 A Λt−1,Λt−1 mation, weighted summation, and cross-information kernels ×(K ) ∈RN×T [7].Inthispaper,weconsidertheweightedsummationkernel, A,X Λt−1,: which is shown to yield the best classification performance 2) Find the index set I ={K indicescorresponding comparedtoothertypesofCKs[7].Thekernelfunctioninthis 0 tothe K largestnumbersin (cid:4)C (cid:4) ,p≥1,i= caseis 0 i,: p 1,...,N} (cid:11) (cid:15) (cid:16) (cid:15) (cid:16) 3) UpdatethecandidateindexsetΛ˜t =Λt−1 I κ(xi,xj)=μκs xsi,xsj +(1−μ)κw xwi ,xwj (20) 4) Compute the projection coefficients P = ((KA)Λ˜t,Λ˜t +λI)−1(KA,X)Λ˜t,: ∈R2K0×T swphaetiraelμan∈d(s0p,e1ct)r,aalnfdeaκtusraens,drκeswpeacretivtheelyk.ernelfunctionsofthe 5) Update the index set Λ ={K indicesin Λ˜ t 0 t The CKs can be directly applied to the pixel-wise sparsity correspondingtotheK largestnumbersin(cid:4)P (cid:4) , 0 i,: p model in the feature space in (8). The sparse representation p≥1,i=1,...,N} vector can be recovered using the KOMP or KSP algorithm, 6) t←t+1 where the kernel matrix K is now a weighted summation A endwhile of the spectral and spatial kernel matrices of the training (cid:11) Output: Index set Λ=Λt−1, the sparse representation Sˆ dictionary A, and the vector kA,x also needs to be modified whose nonzero rows indexed by Λ are Sˆ(cid:11) =(K + accordingly. λI)−1(K ) Λ,: Λ,Λ ItisworthnotingthattheCKapproachisdifferentfromthe A,X Λ,: kernelJSMdiscussedinSectionIII-Bintermsofincorporating localized spatial features. The JSM involves only the spatial (cid:11) Once the matrix Sˆ is recovered, the total residual between informationofthetestpixels,andnopriorknowledgeaboutthe the T neighboring pixels and their approximations from the neighbors of the training pixels is needed. On the other hand, mth-classtrainingsamplesiscomputedby fortheCKs,thelocalspatialinformationforbothtrainingand (cid:12) test sets are necessary. Moreover, the JSM does not assume a (cid:13)T (cid:7) r (x )= κ(x ,x )−2Sˆ(cid:11)T (K ) sum or average of the same samples, but treats all pixels in m 1 i i Ωm,i A,X Ωm,i a small neighborhood equally and finds the sparsity pattern i=1 thatsimultaneouslyrepresentsthesepixels.TheCKapproach, (cid:14) (cid:8) 1 however,isnotlimitedtolocalspatialfeaturesandcanalsobe +Sˆ(cid:11)T (K ) Sˆ(cid:11) 2 (18) used to combine other types of features (e.g., centroid of the Ωm,i A Ωm,Ωm Ωm,i clusterasaglobalfeature)andinformationsources. 222 IEEETRANSACTIONSONGEOSCIENCEANDREMOTESENSING,VOL.51,NO.1,JANUARY2013 TABLE I THE16GROUND-TRUTHCLASSESINTHEAVIRISINDIANPINESIMAGE IV. EXPERIMENTALRESULTS In this section, we show the effectiveness of the proposed algorithms on classification of several hyperspectral data sets. Foreachimage,wesolvethesparserecoveryproblemsin(2), Fig.1. (a)Trainingand(b)testsetsfortheIndianPinesimage. (5), (9), and (17) for each test sample, and then determine the class by the minimal residual (the results are denoted by ofthedegreeofagreement.Theclassificationmapsonlabeled OMP/SP,KOMP/KSP,SOMP/SSP,andKSOMP/KSSP,respec- pixels are presented in Fig. 2, where the algorithm and OA tively).TheresultsofKOMPandKSPwithCKs,asdiscussed areshownontopofeachcorrespondingmap.Onecanclearly inSectionIII-C,aredenotedbyKOMPCKandKSPCK,respec- see that incorporating the contextual correlation and operat- tively.Theclassificationresultsarethencomparedvisuallyand ing in the feature space both have significantly improved the quantitativelytothoseobtainedbytheclassicalSVMclassifier classificationaccuracy.TheKOMPCKandKSPCKalgorithms and sparse multinomial KLR. For SVM and KLR classifiers, outperformallotherclassifiers—theOAforbothofwhichare weuseaspectral-onlykernel(denotedbySVM/KLR),aswell greaterthan98%.TheKSOMPandKSSPalgorithmsalsoyield asaCK(denotedbySVMCK/KLRCK).Inallclassifierswith superior performance, which have only about 1% lower OA a CK, we use a weighted summation kernel, and the spatial than KOMPCK and KSPCK. Note that the kernel JSM for informationisthemeanofpixelsinasmallneighborhood.The KSOMP and KSSP does not assume any prior knowledge of parametersforKLR,KLRCK,SVM,andSVMCKareobtained theneighborsofthetrainingsamplesastheCKapproachdoes. bycross-validation. The sparsity level K and RBF parameter γ used in the 0 The first HSI in our experiments is the Airborne Vis- aboveexperimentsareobtainedfromasmallvalidationset.An ible/Infrared Imaging Spectrometer (AVIRIS) image Indian n-fold cross validation would not be appropriate for finding Pines [43]. The AVIRIS sensor generates 220 bands across the optimal sparsity level, unless n is large (e.g., leave-one- the spectral range from 0.2 to 2.4 μm. In the experiments, out cross validation). This is because the sparsity level K 0 the number of bands is reduced to 200 by removing 20 water is related to the size of dictionary, therefore, the optimal K 0 absorptionbands.Thisimagehasspatialresolutionof20mper for part of the dictionary may not be optimal anymore for the pixelandspatialdimension145×145.Itcontains16ground- entire dictionary. Now, we examine how these two parameters truthclasses.Foreachclass,werandomlychoosearound10% affecttheclassificationperformanceontheIndianPinesimage. ofthelabeledsamplesfortrainingandusetheremaining90% We use randomly selected 10% of all labeled samples as the for testing, as shown in Table I and Fig. 1. RBF kernels are trainingsetandtheremainingsamplesasthetestset,thenvary usedinallkernel-basedclassifiers(i.e.,SVM,SVMCK,KLR, K from5to80andγfrom2−3to212inKOMP,KSP,KSOMP, 0 KLRCK, KOMP, KSP, KSOMP, KSSP, KOMPCK, and and KSSP. The experiment for each γ, K , and each of the 0 KSPCK). Since this image consists of large homogenous four algorithm is repeated five times using different randomly regions,alargespatialwindowofsize9×9(T =81)isused chosen training sets to avoid any bias induced by random inclassifierswithaCKandtheJSMs(4)and(16). sampling. The window size is fixed at 9 × 9 for KSOMP and The classification performance for each of the 16 classes, KSSPduetoitssmoothness.TheOAonthetestset,averaged overall accuracy (OA), average accuracy (AA), and the κ overfiveindependentrealizations,areshowninFig.3.Thebars coefficient measure [44] on the test set are shown in Table II. indicate the maximal and minimal accuracies in five runs at The OA is the ratio between correctly classified test samples eachpoint,andweseethatthethefluctuationisusuallywithin andthetotalnumberoftestsamples,theAAisthemeanofthe 2%andwithin1%inamajorityofcases.Onecanobservefrom 16 class accuracies, and the κ coefficient is a robust measure Fig. 3(a) and (b) thatforthe pixel-wise kernel sparsitymodel, CHENetal.:HYPERSPECTRALIMAGECLASSIFICATIONVIAKERNELSPARSEREPRESENTATION 223 TABLE II CLASSIFICATIONACCURACY(%)FORTHEINDIANPINESIMAGEUSING10%TRAININGSAMPLESASSHOWNINFIG.1ANDTABLEI Fig.2. Classificationmapsandoverallclassificationaccuracy(OA)fortheIndianPinesimageonlabeledpixelswith10%trainingsamples. γ =512 leads to the highest OA at all sparsity levels. For a The next two HSIs used in our experiments, the University fixedγ,theperformanceofKOMPandKSPgenerallyimproves of Pavia and the Center of Pavia images, are urban images asK increasesandtendstosaturateasK reaches30–50.For acquired by the Reflective Optics System Imaging Spectrom- 0 0 KSOMP and KSSP, as shown in Fig. 3(c) and (d), the same eter(ROSIS).TheROSISsensorgenerates115spectralbands tendencycannotbeobserved.However,thekernelJSMismore ranging from 0.43 to 0.86 μm and has a spatial resolution of stablethanthepixel-wisemodel,asforalargerangeofsparsity 1.3 m per pixel [42]. The University of Pavia image consists levelK andsufficientlylargeγ,theOAisalwaysaround96% of610×340pixels,eachhaving103bands,withthe12most 0 withasmallvariance.Thestableperformancesuggeststhatwe noisy bands removed. There are nine ground-truth classes of couldalsouseempiricalparametersK andγ. interests, as shown in Table III. For this image, we follow the 0