EFFICIENTDIVIDE-AND-CONQUERCLASSIFICATION BASED ONFEATURE-SPACE DECOMPOSITION Qi Guo1, Bo-WeiChen2∗, FengJiang3, XiangyangJi1 andSun-YuanKung2 1TsinghuaUniversity,Beijing100084,China, 2Princeton University,Princeton, NJ08544,USA (email: [email protected]) 5 3HarbinInstituteofTechnology,Harbin 150001,China 1 0 2 ABSTRACT infinite dimensionsin the intrinsic space. Furthermore,they n a This study presents a divide-and-conquer (DC) approach alsodevisedanintrinsicdatamatrix,whichwasderivedfrom J afinite-decomposablekernel,toreplacecalculationofkernel based on feature space decomposition for classification. 9 matricesintheempiricalspace.Therefore,thetimecomplex- When large-scale datasets are present, typical approaches 2 itywassavedfromoriginalO(N3)tomin(N3,J2N +J3) usually employed truncated kernel methods on the feature ] space or DC approaches on the sample space. However, for KRR, where N is the number of instances, and J is the G this did not guarantee separability between classes, owing number of feature dimension expanded by TRBFs. More- over,avoidingdirectcalculationofkernelmatriceseffectively L to overfitting. To overcome such problems, this work pro- resolvedtheneedformatrixexpansion. . poses a novel DC approach on feature spaces consisting of s c three steps. Firstly, we divide the feature space into several ThesuccessofTRBF-basedmethodreliesondimensional [ subspaces using the decompositionmethod proposed in this reductionin theintrinsicspaceandtheconversionfromem- 1 paper. Subsequently, these feature subspaces are sent into piricalspacetointrinsicspace. Althoughcomputationalload v individuallocalclassifiersfortraining. Finally,theoutcomes is relieved without losing too much accuracy, however, that 4 of local classifiers are fused together to generate the final method[4]didnotimprovediscriminabilityandseparability 8 classification results. Experiments on large-scale datasets between features. Furthermore, the algorithmic architecture 5 arecarriedoutforperformanceevaluation. Theresultsshow of that method did not support distributed processing, es- 7 0 that the error rates of the proposed DC method decreased pecially when mainstream toolboxes like Apach Hadoop . comparing with the state-of-the-art fast SVM solvers, e.g., (hadoop.apache.org) and Spark (spark.apache.org) adopt 1 reducing error rates by 10.53% and 7.53% on RCV1 and divide-and-conquerstrategyintheirimplementation.Propos- 0 5 covtypedatasetsrespectively. inganewarchitecturethatsupportsdivide-and-conquercom- 1 putationcorrespondinglybecomesnecessary. Index Terms— Feature space decomposition, feature : v spacedivision,fusion,divide-and-conquer,classification Inresponsetosuchademand,severaldivide-and-conquer i classifiers[5],[6]basedonkerneltrickshavebeendeveloped X so far. Zhang et al. used divide-and-conquer KRR [5] to r 1. INTRODUCTION a supportcomputationoflarge-scaledata. Firstly,theirmethod randomly partitioned a dataset into subsets of equal size. Typical kernel-based classification, such as Support Vector Localsolutions were subsequentlycomputedby using KRR Machines (SVMs) [1] and Kernel Ridge Regression (KRR) based on each subset. By averaging the local solutions, a [2], usually employs Radial Basis Functions (RBFs) as the globalpredictorwasthereforeobtained.Insteadofusingran- kernel,forRBFs can effectivelydelineatethe distributionof domizeddata selection as Zhanget al. did, Hsieh et al. [6] thedatabyusingmixturesofGaussianmodels.Furthermore, focused on systematic data division before applying divide- RBFs can map the input featuresinto the intrinsic space [3] and-conquerclassifiersto thedata. Intheirapproach,kernel that is spanned by infinite-dimensionalvectors. This corre- K-means clustering was performed to select the representa- spondingly increases the opportunity of creating a discrimi- tivesofthe entireinputdata. Next, the membersofa subset nanthyperplanein theempiricalspace[3], subsequentlyen- wereselectedbasedononerepresentative. Theirexperimen- hancing discriminability. However, when input dimensions talresultshoweda favorableaccuracywhensystematic data aresufficientlylarge,calculationofakernelmatrixbecomes divisionwasused. aburden. Moreover,RBFsmayleadtooverfittingduetoin- finite dimensions. To deal with such problems, rather than Althoughtheabove-mentionedapproachesrealizeddivide- usingconventionalRBFs,Wuetal. [4]proposedusingTrun- and-conquerconceptintheiralgorithms,overfittingofkernel cated Radical Basis Functions (TRBFs) to avoid generating spacewasnotfullyaddressedandresolved. Todealwiththe aforementionedproblems,thisstudyproposes R1 R2 1) A novel approach for feature-space decomposition, fi on Xi∗ and R = .. . The elements of Ri can be where the original feature space is converted to subspaces. . Besides, the bases of each subspace are reranked according Rh discrete labels or continuous prediction values. The system totheirimportance. generatestheoutputbasedonRusingfusionmethodswhich 2) A divide-and-conquer structure that allows indepen- arediscussedinSection3.2. dentlocalclassifierstocreatediscriminanthyperplanesbased subspacesratherthantheentireempiricalspace. Thislowers 3. PROPOSEDDIVISIONANDFUSIONMETHODS computational complexity while avoiding overfitting prob- lems. 3.1. Feature-SpaceDecomposition The rest of this paper is organizedas follows. Section 2 Section 2 shows that the meritof the proposedmethod is, it introduces the overview of the proposed method. Section 3 can perform classification within the subspaces and ignores then describes details of the proposed feature-space decom- thedependanceamongsubspaces. Theoretically,decomposi- positionandfusionmethod. Next,Section4summarizesthe tionmethodshouldbeabletoreducethedependanceasmuch performanceoftheproposedmethodandtheanalysisresults. aspossiblebetweenanytwofeaturesubspaceswhileremain- ConclusionsarefinallydrawninSection5. ing dependancewithin the subspaces. This is the reasonfor conducting transformation on the feature space before divi- 2. SYSTEMOVERVIEW sion. Amongallthesub-methodsinthisstudy,thesimplestidea GivenanM×N datamatrixX withN instancesandM fea- is RD which directly decomposethe feature space based on turesand a 1×N label vector y, denote the feature space as I. ItsW isanM ×M identitymatrix. RD Ω,andX aretheprojectionoftheN instancesonΩ. Wefirst As forPCA, we conductPCA on the datamatrix X and definethefeature-spacedecompositionmethodD = {T,I}, splitupthefeaturesaccordingtoI.SincePCAdiagonizesthe whereT isafeature-spacetransformfunction,andI isaset featurecovariancematrixS,thismethodeliminatestherele- offeatureindexgroups. vanceofdifferentfeaturesamongandwithinsubspaces.Ifthe ThedecompositionmethodD containsfivesub-methods data obey Gaussian distribution, the PCA also eliminate the whicharediscussedinSection3.1,namely,RandomDecom- dependanceoffeaturesamongandwithinfeaturesubspaces. position (RD), Principle Component Analysis (PCA), Dis- DCA also conductsorthogonaltransformationlikePCA, criminantComponentAnalysis(DCA),BlockCholeskyDe- whileitsdiscriminantmatrixis[S +ρI]−1S ,whereS is w w composition(BCD) and ApproximateBlock Decomposition the within-class scatter matrix, and ρ is the ridge parameter (ABD).Furthermore,eachhaveanM×Msub-transformma- [3]. Wehave trix,denotedasWRD,WPCA,WDCA,WBCDandWABD. Also, each contains a subset of feature index groups, e.g., S =ΣL ΣNl [x(l)−µ(l)][x(l)−µ(l)]T (2) w l=1 j=1 j j IRD = {IRD|IRD ⊂ {1,2,··· ,M},i = 1,2,··· ,hRD}, i i where hRD is the numberof feature subspaces decomposed where l is the number of classes, N represents the number l byRDsub-method,respectively.AsforT,wehave of samples in class l, and µ(l) specifies the average pointof class. Weconductgeneralizedeigenvectordecomposition[7] WRD toobtaintheeigenvectorsν ,ν ,...,ν andeigenvaluematrix WPCA 1 2 M Ω∗ =T(Ω),X∗ =T(X)=WX = WDCA X (1) λ1,λ2,...,λM,suchthat WBCD WABD Sνk =λk[Sw+ρI]νk,k =1,2,...,M (3) where W and Ω∗ are respectively the transform matrix and and the transform matrix is defined as W = [ν , ν , DCA 1 2 thenewfeaturespace. AsforI, wehaveI = {IRD,IPCA, ···,ν ]. ComputingS andS bothenjoysO(M2N)com- M w ···,IABD},andthetotalnumberofsubspacesish=hRD+ plexity. As [S +ρI] can hardly be singular, the complex- w hPCA+···+hABD. Notallthesub-methodsneedtobeused ity of generalized eigenvalue decomposition equals that of in real practice. If some are not applied, the corresponding λ [S + ρI]−1Sν = λ ν ,k = 1,2,...,M, which is of k w k k k W andI canjustbeempty. O(M3) time complexity. Therefore,the total complexityof Method Method TheoriginalfeaturespaceΩisfirsttransformedtoΩ∗by DCAisO(2M2N +M3). T and thendecomposedintosubspacesΩ∗,Ω∗,...,Ω∗ by I; BCD exploits a blocked Doolittle Algorithm, which is a 1 2 h allthe instancesare first projectedX∗ and subsequentlyde- form of Gaussian transformation rather than the orthogonal composedintoX∗,X∗,...,X∗.Then,alocalclassifierf (i= transformation, to eliminate the relevance among subspaces 1 2 h i 1,2,··· ,h),e.g.,SVMs,KRRs,etc.istrainedusingdatama- whileremainingrelevancewithinsubspaces.Forasymmetri- trixX∗. LetrowvectorsR = f (X∗)denotetheresultsof calblockmatrixA,weeliminatethefirstrowandcolumnof i i i i blocks,asshowninEquation4. Table 1. Time complexityof differenttransformationmeth- OOA...k′12111 AOA...′k′21222 ···.···.···. AAO′k′k...kT2k1T =B1 AA...k111 ··.··.··. AAk...Tkk1 (B1)TodRPDsCDC.AA OO((2MCMoO2m2N(NpNle++x)iMtMy33)) UnsuUpnesruvpiseerdvS,iusoeprdethD,rvoiedigtseaoenindlta,itlOytTrtarnansfsoformrm(OT) (4) BCD O(M2N+M3/m) Unsupervised,Gaussiantransform I11 −1 O12 ··· O1k ABD O(MmN+m3) Unsupervised,approximateOT −A21A11 I22 ··· O2k whereB1 = ... ... ... ... ,andthedivi- wherexij equalsthesumofalltheelementsofAij element- −Ak1A−111 Ok2 ··· Ikk wisemultiplyBij. sionofblocksremainsthesame.ThesubscriptofBindicates the row and column it eliminates. Iteratively, we subse- X11 ··· X1N quentlygenerateB2,···,Bk tosequentiallyeliminatetherest We rewrite X as ... ... ... where rows and columns of blocks. The main goal of BCD is se- XhABD1 ··· XhABDN quentially block-diagonizing the discriminant matrix based we divide each instance into m vectorsaccordingto I. The on the blocked Doolittle Algorithm. As shown in Algo- discriminantmatrixisX⊗Xusingthisdivision.Byconduct- rithm 1, X is firstly rearranged to generate X˜ according to ingeigenvectordecompositiononthediscriminantmatrix,we IBCD,inwhichMatrixSplit(X,IBCD)splitsXintoX˜ ,X˜ , have 1 2 ··· , X˜hBCD according to IBCD. The discriminant matrix X ⊗X =VTΛV (6) of BCD is S. Function BlockedDoolittle(X˜,I) generates where V = {v }, and each columnof V is an eigenvector. B (i = 1,2,··· ,hBCD)basedon the ideaof Equation4to ij i Thetransformmatrixis eliminatetheithrowandcolumnofS. The BCD transform matrix is WBCD = BhBCDBhBCD−1···B1. Comparingto v11I11 ··· v1NI1N BCD, PCA needstodoan M ×M matrixinversion,whose WABD = ... ... ... . (7) complexity is O(M3) on non-sparse matrix, whereas BCD vm1Im1 ··· vmNImN only uses an M × M matrix for m(m+1) times if divided equally,whichmonlycmosts 1 ofthetime2ofPCA. If there are approximatelyequal number of features in each m subset,computingX ⊗X yieldsO(MmN)complexityand the eigenvalue decomposition costs O(m3). Therefore, the Algorithm1[X∗,W ]=BCD(X,I) BCD totalcomplexityofABDisO(MmN +m3). {X˜1,X˜2,...,X˜n}=MatrixSplit(X,I) Table 1 shows the time complexity and details of the X˜ =[X˜1,X˜2,...,X˜m] aforementioned sub-methods. By combining the five meth- S11 ··· S1m ods together, D includes both supervision (i.e., DCA) and S=X˜X˜T = ... ... ... ,whereSij =X˜iX˜jT unsupervision(i.e., RD,PCA,BCD andABD) in transforma- Sm1 ··· Smm tionaswellasfourtransformationmethods. fori=1tohBCDdo Bi=BlockedDoolittle(S,i) S=BiS(Bi)T 3.2. FeatureSubspaceFusion endfor WBCD =BhBCDBhBCD−1···B1 After obtaining the classification result matrix R from lo- X∗=WBCDX˜ cal classifier, we weight the outcome of each subspace by training a global classifier f by using R as a data ma- n+1 trix and y as labels. The output of f is the final predic- Besidestheaforementionedorthogonaltransformationof n+1 tion result. Observations show that m < 50 << N and PCA and DCA, as well as the Gaussian transformation of TRBFKRR[4] generates favorable results for f . As the BCD, we also propose an approximate orthogonal transfor- n+1 training complexity of TRBFKRR is min(N3,J2N +J3), mationonwhichtheABDmethodisbased. Firstwedefinea m+p newoperator⊗asDefinition1. whereJ = ,andpisTRBForder. Itisefficient (cid:18) p (cid:19) totrainondatamatrixwithalargenumberofinstancesanda Definition1 For two blocked matrix A = {A } and B = ij smallnumberoffeatureslikeR. {B }withthesamesizeanddivisionofblocks,defineoper- ij ator⊗,s.t. 4. EXPERIMENTALRESULT x11 ··· x1m A⊗B = ... ... ... (5) In this section, we use LibLinear [8] and DCSVM [6] as local classifiers f ,f ,···,f in our system respectively and xm1 ··· xmm 1 1 h Table2. Decompositionsetting. TheN andN standfornumberofsubspacesandnumberoffeaturesinonesubspace. S F Dataset ProposedMethod Settings RD PCA DCA BCD ABD news20 DC-Liblinear-TRBF2KRR NS 2 0 0 0 10 NF 677596 0 0 0 135519 RCV1 DC-Liblinear-TRBF3RR NS 4 0 0 0 4 NF 23618 0 0 0 23618 covtype DC-DCSVM-TRBF2KRR NS 4 4 4 4 4 NF 40 40 40 27 27 census DC-DCSVM-LibLinear NS 2 2 0 0 0 NF 300 300 0 0 0 Table 3. Dataset statistics. “#” represents “number of”. A Table5.Comparisonofnonlinearclassificationonrealworld random0.9/0.1splitisappliedto allnews20dataset. Aran- datasets. dom0.8/0.2splitisappliedtocovtypeandcensusdataset. covtype census Dataset #TrainingInstances #TestingInstances #Features c=32,γ=32 c=512,γ=2−9 news20 17,997 1,999 1,355,191 Time(s) ErrorRate(%) Time(s) ErrorRate(%) RCV1 20,242 677,399 47,236 Proposed 7537 3.56 1459 5.0 covtype 464,810 116,202 54 DCSVM(early)2 672 3.88 261 5.1 census 159,619 39,904 409 DCSVM2 11414 3.85 1051 5.8 LibSVM2 83631 3.85 2920 5.8 Table 4. Comparison of linear classification on real world LaSVM2 102603 5.61 3514 6.8 datasets. CascadeSVM2 5600 10.49 849 7.0 news20 RCV1 LLSVM2 4451 15.79 1212 7.2 Time(s) ErrorRate(%) Time(s) ErrorRate(%) FastFood2 8550 19.9 851 8.4 Proposed 32 2.87 3.1 3.06 SpSVM2 15113 16.63 3121 9.6 LibLinear 2 3.26 0.3 3.84 LTPU2 11532 16.75 1695 8.0 SVMlight 4671 2.741 15.5 3.42 BSVM 4371 2.731 13.2 3.68 Non-Linear Classification: DCSVM is set as local L2-SVM-MFN 981 2.861 0.5 3.53 classifiers for nonlinear classification. Interestingly, in DC- DCSVM-TRBFKRR, divide-and-conquer process is con- tested the results on large scale datasets (i.e., either M or ductedonbothinstancedimensionandfeaturedimensionin N is larger than 104). Our methods are notated as “DC- the method. We evaluate it on covtypedatasetand compare classifier1-classifier2”,where“classifier1”indicatestheclas- with the results of the other SVM methods by Hsieh et al. sifierusedforf ,f ,···,f ,and“classifier2”isforf . All [6],asisshowninTable5. Theproposedmethodachievethe 1 1 m n+1 the experiments are conducted on an Intel Core i7 2.1GHz lowesterrorratewithrelativelylowtimecomplexityinboth CPU and 8G RAM machine. The datasets tested in this covtypeandcensusdatasets. paper are shown in Table 3 and can be downloaded from Moreover, comparing to directly training a TRBFKRR http://www.csie.ntu.edu.tw/˜cjlin/libsvmtoolcsla/sdsiafiterasuestinsg/the whole data matrix, DC-TRBFKRR- orUCIMachineLearningRepository. TRBFKRR greatly reduces the training complexity from Feature-Space Decomposition Setting: Table 2 shows min(N3,J2N+J3)tomin(N3,J2N + J3),whichenables m m2 the decomposition setting in our experiments. For data ma- TRBFKRRtrainingondatamatrixwithlargeN andM. triceswithhighfeaturedimensions,e.g.,news20,RCV1,we justuseRDandABDwithrelativelylowcomputationalcom- 5. CONCLUSION plexity.Fordatamatriceswithlowfeaturedimensions,allthe transformationmethodscanbecombinedtogethertoachieve Thispaperpresentsafeature-spacedecompositionclassifica- a lower errorrate. We use LibLinearas the globalclassifier tionmethodincludingfivesub-methods.Theexperimentalre- when dealing with census dataset, as its proportion of pos- sultsshowthatourdivide-and-conquerclassificationscheme itive and negative instances are 0.06/0.94, which can cause canreduceerrorrates(e.g.,reduceerrorratesby10.53%and biaswhenTRBFKRRisapplied. 7.53% in covtype and RCV1 datasets), comparing to train- LinearClassification: Linearclassificationisconducted ing directly using the whole datasets, and outperform state- onnews20andRCV1datasets. LibLinearisexploitedaslo- of-the-art fast SVM solver by reducing overfitting problem. calclassifiersf ,f ,···,f ,andTRBFKRRisusedasglobal Thefutureworkwillfocusonprovidingtheoreticalanalysis 1 2 m classifier f in our system. 