Table Of Content

2015 IEEE. Personal use of this material is permitted. new collective works, for resale or redistribution to servers or Permission from IEEE must be obtained for all other uses, in lists, or reuse of any copyrighted component of this work in any current or future media, including reprinting/republishing other works. this materialfor advertisingor promotionalpurposes,creating 6 1 0 2 n a J 5 ] G L . s c [ 1 v 5 2 9 0 0 . 1 0 6 1 : v i X r a Complex Decomposition of the Negative Distance Kernel Tim vor der Bru¨ck Steffen Eger and Alexander Mehler CC Distributed Secure Software Systems Text Technology Lab Luzerne University of Goethe University Frankfurt am Main Applied Sciences and Arts steeger,amehler @em.uni-frankfurt.de { } [email protected] Abstract—A Support Vector Machine (SVM) has become a be converted into the corresponding dual form: dc(x) = verypopularmachinelearningmethodfortextclassification.One sgn( m y α x,x + b) where α and b are constants reasonforthisrelatestotherangeofexistingkernelswhichallow deterPmijn=e1d bjyjthhe SjVi optimization anjd m is the number of for classifying data that is not linearly separable. The linear, SVs x (the input vector has to be compared only with these j polynomialandRBF(GaussianRadialBasisFunction)kernelare SVs). The vectors that are located on the wrong side of the commonly used and serve asa basisof comparison in ourstudy. hyperplane, that is, the vectors which prevent a perfect fit of We show how to derive the primal form of the quadratic Power SVoptimization,arealsoconsideredasSVs.Thus,thenumber Kernel (PK) – also called the Negative Euclidean Distance Kernel ofSVscanbequitelarge.Thescalarproduct,whichisusedto (NDK)–bymeansofcomplexnumbers.WeexemplifytheNDKin the framework of text categorization using the Dewey Document estimatethesimilarityoffeaturevectors,canbegeneralizedto Classification (DDC) as the target scheme. Our evaluation shows a kernelfunction.A kernelfunctionK is a similarityfunction that the power kernel produces F-scores that are comparable to oftwovectorssuchthatthematrixofkernelvaluesissymmet- thereferencekernels,butis–exceptforthelinearkernel–faster ricandpositivesemidefinite.Thekernelfunctiononlyappears to compute. Finally, we show how to extend the NDK-approach in the dual form. The decision function is given as follows: by including the Mahalanobis distance. dc(x):=sgn( m y α K(x,x )+b). Let Φ:Rn Rl be j=1 j j j → Keywords—SVM, kernel function, text categorization a function thatPtransforms a vector into another vector mostly of higher dimensionality with l N . It is chosen in ∈ ∪{∞} such a way that the kernel function can be represented by: INTRODUCTION K(x ,x ) = Φ(x ),Φ(x ) . Thus, the decision function in 1 2 1 2 h i An SVM has become a very popular machine learning the primal form is given by: method for text classification [5]. One reason for its popu- dc(x):=sgn( w,Φ(x) +b) (1) larity relates to the availability of a wide range of kernels h i including the linear, polynomial and RBF (Gaussian radial basis function) kernel. This paper derives a decomposition Note that Φ is not necessarily uniquely defined. Furthermore, of the quadratic Power Kernel (PK) using complex numbers Φmightconvertthedataintoaveryhighdimensionalspace.In and applies it in the area of text classification. Our evaluation such cases, the dual form should be used for the optimization process as well as for the classification of previously unseen shows that the NDK producesF-scores which are comparable data. One may think that the primal form is not needed. to those produced by the latter reference kernels while being fastertocompute–exceptforthelinearkernel.Thisevaluation However, it has one advantage: if the normal vector of the refers to the Dewey DocumentClassification DDC [11] as the hyperplane is known, a previously unseen vector can be target scheme and compares our NDK-based classifier with classifiedjustbyapplyingtheSignumfunction,Φandascalar two competitors described in [6], [10] and [18], respecstively. multiplication. This is often much faster than computing the kernel function for previously unseen vectors and each SV as An SVM is a method for supervised classification intro- required when using the dual form. duced by [17]. It determines a hyperplane that allows for the The most popular kernel is the scalar product, also called best possible separation of the input data. (Training) data on thesamesideofthehyperplaneisrequiredtobemappedtothe the linear kernel. In this case, the transformation function Φ same class label. Furthermore, the margin of the hyperplane is the identity function: Klin(x1,x2) := x1,x2 . Another h i popular kernel function is the RBF (Gaussian Radial Basis thatis definedby thevectorslocatedclosest to thehyperplane is maximized. The vectors on the margin are called Support Function), given by: Kr(x1,x2) := e−γ||x1−x2||2 where γ Vectors (SV). The decision function dc : Rn 1,0, 1 , R, γ > 0 is a constant that has to be manually specified∈. which maps a vector to its predicted class,→is{given−by}: This kernel function can be represented by a function Φr that dc(x) := sgn( w,x + b) where sgn : R 1,0, 1 is transformsthe vector into infinite dimensionalspace [19]. For the Signum funhctioni; the vector w and the→co{nstant−b}are reasons of simplicity, assume that x is a vector with only one determined by means of SV optimization. If dc(x) = 0 then component. Then the transformation function Φr is given by: thevectorislocateddirectlyonthehyperplaneandnodecision regarding both classes is possible. Φ (x):=e−γx2[1, 2γx, (2γ)2x2, (2γ)3x3,...]⊤ (2) r r1! r 2! r 3! The decision function is given in the primal form. It can Another often used kernel is the polynomial kernel: With z := a m y α x , u := a m y α x ,x and Kp(x1,x2):=(a x1,x2 +c)d (3) c′ =c mi=1yjPαjj=,1forjmujlaj(10) can bPe rje=w1ritjtenjhasj: ji h i P m For d := 2, a = 1, c = 1, and two vector components, the function Φ is given by [9]: Φ :R2 R6 with dc(x)=sgn( a x,x yjαj+2 x,z u+c′+b) (11) p p − h i h i− → Xi=1 Φ (x)=(1,x2,√2x x ,x2,√2x ,√2x ) (4) p 1 1 2 2 1 2 The expressions u, z, ( m y α ), and c′ are identical for i=1 j j In general, a vector of dimension n is transformed by Φ everyvectorx andcan bPe precomputed.Note thatthereexists p intoavectorofdimension n+d .Thus,thepolynomialkernel no primal form for the NDK based on real number vectors d which is stated by the following proposition. leads to a large number o(cid:0)f dim(cid:1)ensions in the target space. Proposition 1 Let a,c R with a > 0 and n,l N. However, the number of dimensions is not infinite as in the Then there is no fun∈ction Φ : Rn Rl∈ (re case of the RBF kernel. re → indicates that Φ operates on real numbers) with re THE POWERKERNEL ∀x1,x2 ∈Rn :hΦre(x1),Φre(x2)i=−a||x1−x2||2+c . The PK is a conditionallypositivedefinite kernelgiven by Proof: If such a function existed, then, for all x Rn, K (x ,x ):= x x p (5) ∈ s 1 2 1 2 forsomep R[16],[2],[12].A−k||erne−lisc|a|lledconditionally ||Φre(x)||2 =hΦre(x),Φre(x)i=−a·0+c=c ∈ positive-definite if it is symmetric and satisfies the conditions which requires that c 0 since the square of a real number [14] cannot be negative. No≥w, consider x,y Rn with n m x y 2 > 2c 0.Ontheonehand,we∈have,bytheCauchy- cicjK(xj,xk) 0 ci K with ci =0 (6) |S|ch−wa|r|z ineqaua≥lity, ≥ ∀ ∈ jX,k=1 Xi=1 Φ (x),Φ (y) Φ (x) Φ (y) =√c √c=c. re re re re where c is the complex-conjugate of c . We consider here a |h i|≤|| ||·|| || · j j generalized form of the PK for p:=2 (also called NDK): On the other hand, it holds that Kpow(x1,x2):= a x1 x2 2+c (7) a x y 2+c = a x y 2 c >2c c=c, − || − || |− || − || | | || − || − | − with a,c R and a > 0. The expression a x x 2+c 1 2 a contradiction. ∈ − || − || can also be written as: a (x x ),(x x ) +c= Although no primal form and therefore no function Φre 1 2 1 2 − h − − i (8) exists for real number vectors, such a function can be given a( x ,x 2 x ,x + x ,x )+c − h 1 1i− h 1 2i h 2 2i if complex number vectors are used instead. In this case, the For deciding which class a previously unseen vector belongs function Φ is defined with a real domain and a complex co- c to we can use the decision function in the dual form: domain: Φ :Rn C4n+1 and c → Φ (x):=(√a(x2 1),√ai,√2ax ,√aix2,..., m c 1− 1 1 (12) dc(x):=sgn( yjαjKpow(x,xj)+b) √a(x2n−1),√ai,√2axn,√aix2n,√c)⊤ Xj=1 m (9) Note that no scalar product can be defined for com- =sgn( y α ( a x x 2+c)+b) plex numbers that fulfills the usual conditions of bilinearity j j j Xj=1 − || − || and positive-definiteness simultaneously1. Thus, the bilinearity condition is dropped for the official definition and only The decision function shown in formula (9) has the draw- sesquilinearity is required. The standard scalar product is back that the previously unseen vector has to be compared defined as the sum of the products of the vector components with each SV, which can be quite time consuming. This can with the associated complexconjugatedvectorcomponentsof be avoided, if we reformulate formula (9) to: the other vector. Let x ,x Cn, then the scalar product 1 2 m is given by [1]: x ,x :=∈ n x x . In contrast, we dc(x)=sgn( yjαj(−ahx,xi+2ahx,xji− use a modified shca1lar2piroducPt k(=m1ark1ekd2bky a “∗”) where, Xj=1 analogously to the real vector definition, the products of the ahxj,xjim+c)+b) m Tvehcitsoprrcoodmucpto(nneonttasasrcealsaurmpmroadtuecdt:ihn∗xst1ri,cxt2mia:=thePmankt=ic1axl1skexns2ek). =sgn( a y α x,x +2a y α x,x + is a bilinear form but no longer positive definite. For real j j j j j − h i h i Xj=1 Xj=1 1This can be verified by a simple calculation: Consider some vector x= m (10) 0andx Cn. Since . is positive definite: x,x > 0, by bilinearity6 : yjαj( a xj,xj +c)+b) √ ix,√∈ ix = i xh,xi >0). h i Xj=1 − h i h − − i −h i6 m m =sgn( a x,x y α +2 x,a y α x j j j j j − h i h i− Xj=1 Xj=1 m m a y α x ,x +c y α +b) j j j j j j h i Xj=1 Xi=1 m numbervectorsthis modified scalar productis identicalto the =sgn( α y K(x ,x)+b) j j j usual definition. With this modified scalar product we get Xj=1 ∗Φ (x ),Φ (x ) m c 1 c 2 =h ax2 ax2 +i2ax x ax2 (13) =sgn( αjyjK(x,xj)+b) (K is symmetric) − 11− 21 11 21−···− 1n− Xj=1 ax22n+2ax1nx2n+c=−a||x1−x2||2+c (16) which is just the result of the NDK. The optimization can be done with the dual form. Thus, no complex number Normally, feature vectors for document classification rep- optimization is necessary. For the decision on the class to resent the weighted occurrences of lemma or word forms whichadatavectorshouldbeassignedweswitchtotheprimal [15]. Thus, such a vector contains a large number of zeros form. The vector w C4n+1 is calculated by: and is therefore usually considered sparse. In this case, the w := m α y Φ (∈x ) for all SVs x Rn. The decision j=1 j j c j j ∈ computationalcomplexityofthescalarproductcanbereduced functioPn is then given by: dc(x) := sgn( ∗w,Φc(x) + b). from (n) (where n is the number of vector components) to Note that the modified scalar product ∗w,hΦc(x) muist be a someOconstant runtime, which is the average number of non- h i real number. This is stated in the following proposition. zero vector components. Let I be the set of indices of non- fαPorjromp∈uolsaiRt(i,1o2ny)j2an∈Ldext{−∈w1R,=1n}.,PTjhmje=n=1hα∗1wj,y.,j.ΦΦ.cc,((mxx)j,i)Φicwsaiathsreadxlejfinnu∈emdRbenirn,. oaznefrdbooIe2thnbtvereieacsntoaorlfosgcvoaeuncsttolhyredxne1fiben(Iec1domf=o1pr{uvkteecd∈tob{ry1x,2...kT.∈h,(neI1}s∩cI:a2lx)a1xrk1pkr6=o·xd0u2}kc).t Proof: ∗w,Φ (x) is given by: LetusnowconsiderthecasethatbothvePctorsaretransformed c h i to complex numbers before the scalar multiplication. In this m case, the modified scalar product ∗w,Φ (x) = ∗ α y Φ (x ),Φ (x) c j j c j c h i h Xj=1 i ∗Φc(x1),Φc(x2) = (φ(x11),...,φ(x1n)), h i h (17) m (φ(x ),...,φ(x )) +c 21 2n = ∗α y Φ (x ),Φ (x) ( ∗. is bilinear) i j j c j c Xj=1h i h i is considered where φ(xk) denotes the transformation of a single real vectorcomponentto a complexnumbervectorand m = α y ∗Φ (x ),Φ (x) is defined as: j j c j c h i Xj=1 φ:R C4, φ(x ):=(√a(x2 1),√ai,√2ax ,√aix2) m → k k− k (k18) = α y ( a x x 2+c) (see form. (13)) j j j − || − || Xj=1 Note that the partial modified scalar product (14) ∗φ(x ),φ(x ) can be non-zero, if at least one of 1k 2k h i the two vector componentsx and x is non-zero,which is which is clearly a real number. 1k 2k easy to see: The NDK is related to the polynomial kernel since it ∗φ(x ),φ(0) =√a(x2 1) √a( 1)+( 1)a can also be represented by a polynomial. However, it has h k i k− · − − (19) the advantage over the polynomial kernel that it is faster to +√2ax 0+√aix2 0=a ax2 a= ax2 k· k· − k− − k compute, since the number of dimensions in the target space Only if both vector componentsare zero, one can be sure that grows only linearly and not exponentially with the number of the result is also zero: dimensions in the original space [16], [2]. It remains to show thatthedecisionfunctionsfollowingthe primalanddualform ∗φ(0),φ(0) =( √a)( √a)+ai i+0+0=a a=0 (20) arealsoequivalentforthemodifiedformofthescalarproduct. h i − − · − This is stated in the follow proposition: Thus,thesparsedatahandlingoftwotransformedcomplex Proposition 3 Let x,x ,...,x Rn, α Rm, y 1 m vectors has to be modified in such a way that vector compo- K{−(1z,11,z}2m), wz1:=,z2Pmj=R1nα.jTyhjeΦnc(xj) ∈and h∗Φc(z1∈),Φc(z2)i =∈ nneonnt-szearsosoincdiaicteeds awreithcotnhseiduenreiodnfoarnmdunlotitpltihceatiionnte:rsection of sgn( ∗w,Φ∀(x) +∈b)=sgn( m α y K(x,x )+b). h c i j=1 j j j ∗x ,x = ∗φ(x ),φ(x ) +c (21) Proof: P h 1 2i h 1k 2k i k∈(XI1∪I2) sgn( ∗w,Φ (x) +b) c h i m A further advantage of the NDK is that the Mahalanobis =sgn( ∗ αjyjΦc(xj),Φc(x) +b) distance [8] can easily be integrated, which is shown in h i Xj=1 the remainder of this section. Each symmetric matrix A m can be represented by the product V−1DV where D =sgn( ∗αjyjΦc(xj),Φc(x) +b) ( ∗. is bilinear) is a diagonal matrix with the eigenvalues of A on its h i h i Xj=1 diagonals. The square root of a matrix A is then defined as m √A:=V−1D0.5V where D0.5 is the matrix with the square =sgn( αjyj ∗Φc(xj),Φc(x) +b) root of the eigenvalues of A on its diagonal. It is obvious h i Xj=1 (15) that √A √A=A. thecubic(polynomialkernelofdegree3),theRBFkerneland Propositi·on 4: Let x, z Rn,a:=1,c:=0, then the linear kernel (see Tables I and II). The free parameters ∗Φ (√Cov−1x),Φ (√C∈ov−1z) = MH(x,z)2 of the square, the cubic and the RBF kernel are determined c c hwhere MH(x,z) denotes thie M−ahalanobis distance by means of a grid search on an independent data set. On the same held-out dataset, we adjusted the SVM-threshold (x z)TCov−1(x z) and Cov denotes the covariance q − − parameter b to optimize the F-scores. The time (on an Intel matrix between the feature values of the entire training data Core i7) required to obtain the classifications was determined set. (see Table III for the real / complex NDK, the RBF, square, Proof: We have that (with C:=√Cov−1): and linear kernel). The time needed for building the model was not measured because of being required only once and ∗Φc(Cx),Φc(Cz) therefore being irrelevant for online text classification. h i = Cx Cz 2 (see formula (13)) −|| − || The F-score for the NDK is higher than the F-scores of = (Cx Cz)⊤(Cx Cz) all other kernels and faster to compute except for the linear − − − = (C(x z))⊤(C(x z)) (22) kernel – obviously, the classification using the primal form − − − of the linear kernel is faster than the one using the NDK. ⊤ = (x z)⊤ Cov−1 Cov−1(x z) Furthermore, the complex decomposition of the NDK leads =−(x−z)⊤Cpov−1(x pz) − to a considerable acceleration of the classification process − − − compared to its dual form and should therefore be preferred. (since the covariance matrix is symmetric and the We conducted a second experiment where we compared inverse and square root of a symmetric matrix is the linear kernel with the NDK now using text snippets (i.e., also symmetric) abstracts instead of full texts) taken once more from our OAI corpus. This application scenario is more realistic for digital This proposition shows that the NDK is just the negative librariesthatmainlyharvestdocumentsbymeansoftheirmeta square of the Mahalanobis distance for some vectors xˆ and ˆz dataand,therefore,shouldalsobeaddressedbytextclassifiers. if xˆ is set to √Cov−1x (analogouslyforˆz), a is set to 1 and The kernels were trained on a set of 19000 abstracts2. This c is set to 0. Furthermore, this proposition shows that we can time, the categories are not treated independently of each easily extend our primal form to integrate the Mahalanobis other. Instead,the classifier selected all categorieswith signed distance. For that, we define a function τ : Rn Rn as distance values from the SVM hyperplane that are greater or follows: τ(x)=√Cov−1x. With Φ :=Φ τ we h→ave: equal to zero. If all distance values are negative, the category m c ◦ withthelargest(smallestabsolute)distancewaschosen.Inthis ∗Φm(x),Φm(z) =(MH(x,z))2 (23) experiment, the linear kernel performed better than the NDK. h i Using three samples of 500 texts, the F-scores of the linear which shows that we have indeed derived a primal form. kernelare(macro-averagedoverallmain10DDCcategories): We use the NDK for text classification where a text is 0.753,0.735,0.743,andoftheNDK:0.730,0.731,0.737.This automatically labeled regarding its main topics (i.e., cate- resultindicatesthattheheuristicofpreferringthecategoriesof gories),employingtheDDCasoneoftheleadingclassification highestsigneddistancetothehyperplaneisnotoptimalforthe schemes in digital libraries [11]. Documents are classified NDK. We comparedourclassifier with two othersystems, the regarding the 10 top-level categories of the DDC. To this DDC classifier of [18] and of [6]. [18] reaches an F-score end, we use training sets of documents for which the DDC of 0.502, 0.457 and 0.450 on these three samples. The F- categories have already been assigned. Lexical features are scoresoftheDDC-classifierof[6]are0.615,0.604,and0.574. extracted from this training set and made input to an SVM Obviously, our classifier outperforms these competitors. library (i.e., libsvm [3]) in order to assign one or more DDC Finally, we evaluated the NDK on the Reuters 21578 categories to previously unseen texts. Features of instance corpus3 (Lewis split). This test set is very challenging since documents are computed by means of the geometric mean 35 of all 93 categories occurring in this split have 10 or less of the tfidf values and the GSS coefficients of their lexical training examples. Furthermore,several texts are intentionally constituents [4], [13]. assigned to none of the Reuters categories. We created an artificial category for all texts that are not assigned in this EVALUATION sense. The histogram of category assignments is displayed in The evaluation and training was done using 4000 Ger- Figure 1. The F-scores for the Reuters corpus are given in man documents, containing in total 114887606 words and TableIV.Wemodifiedthetrainingdatabyfillingupinstances 9643022 sentences requiring a storage space of 823.51 MB. for every category by randomly selecting positive instances There are 400 documents of each top-level DDC category in suchthattheratio ofpositiveexamplestoallinstancesare the the corpus.2000of the textswere used fortraining,2000for sameforallcategories.Hence,inmostcategorysamplessome evaluation.Thecorpusconsistsoftextsannotatedaccordingto training examples were used more than once. This approach the OAI (Open Archive Initiative) and is presented in [7]. We prevents from preferring categories due to their multitude of tested the correctness of category assignment for the 10 top- training examples. The F-score of the NDK is lower than the levelDDCcategories.Eachdocumentismappedatleasttoone DDC category. Multiple assignments are possible. Precision, 2Note that in this evaluation we employed the tfidf score for weighting only,since theuseoftheGSScoefficient didn’tleadtoanyimprovement in recall and F-scores were computed for each DDC category F-score. using the NDK, the square (polynomial kernel of degree 2), 3URL:http://www.daviddlewis.com/resources/testcollections/reuters21578 TABLEI. PRECISION/RECALLOFDIFFERENTKERNELS. NDK RBF Linear Square Cubic Cat. Prec. Rec. Prec. Rec. Prec. Rec. Prec. Rec. Prec. Recall 0 0.845 0.777 0.867 0.731 0.747 0.853 0.831 0.772 0.789 0.817 1 0.768 0.740 0.775 0.682 0.726 0.745 0.722 0.771 0.712 0.734 2 0.909 0.833 0.892 0.857 0.905 0.847 0.873 0.882 0.935 0.635 3 0.764 0.497 0.679 0.571 0.545 0.640 0.716 0.508 0.708 0.487 4 0.852 0.529 0.870 0.388 0.863 0.490 0.441 0.146 0.000 0.029 5 0.700 0.783 0.794 0.626 0.825 0.675 0.744 0.660 0.697 0.793 6 0.682 0.685 0.718 0.625 0.699 0.685 0.670 0.700 0.744 0.595 7 0.680 0.741 0.675 0.652 0.602 0.751 0.722 0.697 0.843 0.532 8 0.657 0.720 0.586 0.785 0.626 0.774 0.670 0.677 0.752 0.457 9 0.701 0.752 0.680 0.670 0.668 0.723 0.775 0.636 0.831 0.335 All 0.756 0.706 0.754 0.659 0.721 0.718 0.716 0.645 0.801 0.542 TABLEII. F-SCORESOFDIFFERENTKERNELSEVALUATEDBYMEANS OFTHEOAICORPUSOF[7]. 3000 Number of categories Cat. NDK Square Cubic RBF Linear 2500 0 0.810 0.800 0.803 0.793 0.796 1 0.753 0.746 0.723 0.726 0.735 2000 2 0.869 0.877 0.757 0.874 0.875 3 0.603 0.594 0.577 0.621 0.589 1500 4 0.653 0.219 0.057 0.537 0.625 5 0.740 0.700 0.742 0.700 0.743 6 0.683 0.685 0.661 0.668 0.692 1000 7 0.710 0.709 0.652 0.663 0.668 8 0.687 0.674 0.569 0.671 0.692 500 9 0.726 0.699 0.478 0.675 0.695 All 0.723 0.670 0.602 0.693 0.711 0 -10 0 10 20 30 40 50 60 70 80 90 100 TABLEIII. TIME(INMILLISECONDS)REQUIREDFORTHE COMPUTATIONOFTHEKERNELSONTHEOAICORPUS. C. NDK NDK Squ. Cubic RBF Lin. Fig.1. Histogram ofthedistribution ofcategories fortheReuters corpus. prim. dual dual dual dual dual 0 6426 46200 50256 50979 47246 44322 1 9339 98289 115992 132844 122751 93550 highestmacro-averagingF-scoreobtainedbythelinearkernel, 2 5822 48359 57603 67200 65508 46786 but outperforms the square and the cubic kernel. Again, the 3 23774 134636 165604 177211 159897 111516 4 10103 118322 77634 93613 111153 103548 parameters of the NDK, the square, the cubic and the RBF 5 32501 104865 72772 135820 138107 96637 kernels were determined by means of a grid search. 6 24375 105614 116892 120919 126498 95423 7 19206 113971 122128 143562 120371 100640 8 11236 99288 106215 118378 105942 84381 CONCLUSION ANDFUTURE WORK 9 20822 125358 139004 168837 138395 111245 all 16361 99490 102410 120936 113587 88805 We derived a primal form of the NDK by means of complex numbers. In this way, we obtained a much simpler TABLEIV. MEANF-SCORE,PRECISION,ANDRECALL representationcomparedtothemodifieddualform.Weshowed (MACRO-AVERAGING)OFDIFFERENTKERNELSEVALUATEDBYMEANSOF that the primal form (and in principle also the modified dual THEREUTERSCORPUS. form)canspeeduptextclassificationconsiderably.Thereason is that it does not require to compare input vectors with all Kernel F-score Precision Recall support vectors. Our evaluation showed that the F-scores of NDK 0.394 0.414 0.419 the NDK are competitive regarding all other kernels tested Square 0.324 0.392 0.304 Cubic 0.348 0.319 0.567 here while the NDK consumes less time than the polynomial RBF 0.403 0.420 0.441 and the RBF kernel. We have also shown that the NDK Linear 0.408 0.428 0.436 performs better than the linear kernel when using full texts rather than text snippets. Whether this is due to problems of feature selection/expansion or a general characteristic of this kernel(inthesenseofbeingnegativelyaffectedbyultra-sparse featuresspaces),willbeexaminedinfuturework.Additionally, we plan to examine the PK with exponents larger than two, to investigateunder which prerequisitesthe PK performswell and to evaluate the extension of the NDK that includes the Mahalanobis distance. ACKNOWLEDGEMENT [10] Alexander Mehler and Ulli Waltinger. Enhancing document modeling bymeans ofopentopic models: Crossingthe frontier ofclassification We thank Vincent Esche and Tom Kollmar for valuable schemes indigital libraries byexample oftheDDC. LibraryHiTech, comments and suggestions for improving our paper. 27(4):520–539, 2009. [11] OCLC. Dewey decimal classification summaries. a brief in- REFERENCES troduction to the dewey decimal classification, 2012. URL: http://www.oclc.org/dewey/resources/summaries/default.htm, last ac- [1] Albrecht Beutelsbacher. Lineare Algebra. Vieweg, Braunschweig, cess:8/17/2012. Germany,2010. [12] HichemSahbiandFrancoisFleuret. Scale-invariance ofsupportvector [2] Shri D. Boolchandani and Vineet Sahula. Exploring efficient kernel machinesbasedonthetriangularkernel.TechnicalReport4601,Institut functions for support vector machine based feasability models for National deRecherche enInformatique etanAutomatique, 2002. analog circuits. International Journal of Design Analysis and Tools forCircuits andSystems,1(1):1–8,2011. [13] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, [3] Chih-Chung Chang andChih-Jen Lin. LIBSVM:alibrary for support 25(5):513–523, 1998. vector machines, 2001. [14] Bernhard Scho¨lkopf and Alexander J. Smola. Learning with Kernels [4] ThorstenJoachims. TheMaximum MarginApproachtoLearning Text - Support Vector Machines, Regularization, Optimization and Beyond. Classifiers: Methods, Theory, and Algorithms. PhD thesis, Universita¨t MITPress,Cambridge, Massachusetts, 2002. Dortmund,Informatik, LSVIII,2000. [15] FabrizioSebastiani. Machinelearninginautomatedtextcategorization. [5] Thorsten Joachims. Learning to classify text using support vector ACMComput.Surv.,34:1–47, March2002. machines. Kluwer,Boston, 2002. [16] Ce´sarSouza. Kernelfunctionsformachinelearningapplications,2010. [6] Mathias Lo¨sch. Talk: Automatische Sacherschließung elektronischer url:http://crsouza.blogspot.de/2010/03/kernel-functions-for-machine-learning.html. DokumentenachDDC,2011. [17] Vladimir Vapnik. Statistical Learning Theory. John Wiley & Sons, [7] Mathias Lo¨sch, Ulli Waltinger, Wolfram Horstmann, and Alexander NewYork,NewYork,1998. Mehler.BuildingaDDC-annotatedcorpusfromOAImetadata. Journal [18] Ulli Waltinger, Alexander Mehler, Mathias Lo¨sch, and Wolfram ofDigitalInformation, 12(2),2011. Horstmann. Hierarchical classification ofoaimetadata usingtheDDC [8] PrasantaChandraMahalanobis.Onthegeneraliseddistanceinstatistics. taxonomy. InAdvanced Language Technologies for Digital Libraries, InProceedingsoftheNationalInstituteofSciencesofIndia,pages49– LectureNotesinComputerScience,pages29–40.Springer,Heidelberg, 55,1936. Germany,2011. [9] Christopher Manning, Prabhakar Raghavan, and Hinrich Schu¨tze. In- [19] Shih-Hung Wu. Support vector machine tutorial, 2012. troduction to Information Retrieval. Cambridge University Press, URL=www.csie.cyut.edu.tw/∼shwu/PR slide/SVM.pdf. Cambridge, UK,2008.