ebook img

Stochastic Dykstra Algorithms for Metric Learning on Positive Semi-Definite Cone PDF

0.22 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Stochastic Dykstra Algorithms for Metric Learning on Positive Semi-Definite Cone

Stochastic Dykstra Algorithms for Metric Learning on Positive Semi-Definite Cone Tomoki Matsuzawa†, Raissa Relator⋄, Jun Sese⋄, Tsuyoshi Kato†,‡,∗ † Faculty of Science and Engineering, Gunma University, Kiryu-shi, Gunma, 326– 0338, Japan. ‡ CenterforInformationalBiology,OchanomizuUniversity,Bunkyo-ku,Tokyo,112– 8610, Japan. n ⋄ BRD, AIST, Koto-ku, Tokyo, 135–0064,Japan. 6 Abstract 1 0 Recently,covariancedescriptorshavereceivedmuchattentionaspowerfulrepresentationsofsetofpoints. Inthis 2 research, wepresentanewmetriclearningalgorithm forcovariancedescriptorsbasedontheDykstraalgorithm, in n whichthecurrentsolution isprojected ontoahalf-spaceateach iteration,andrunsatO(n3)time. Weempirically a demonstratethatrandomizingtheorderofhalf-spaces inourDykstra-basedalgorithm significantlyaccelerates the J convergencetotheoptimalsolution. Furthermore,weshowthatourapproachyieldspromisingexperimentalresults 7 on pattern recognition tasks. ] V 1 Introduction C . Learning with example objects characterized by a set of several points, instead of a single point, in a feature space is s c an important task in the computer vision and pattern recognition community. In the case of visual categorization of [ stillimages,manylocalimagedescriptorssuchas SIFT [17]areextractedfromaninput imageto forma singlevector such as a Bag-of-Visual-Words vector or a Fisher Vector [18, 20]. For image set classification, a surge of methods 1 v have been developed in the last decade, and probabilistic models [25] or kernels [22] are introduced to describe the 2 image set. Alternative descriptors are the covariance descriptors, which have received much attention as a powerful 2 representation of a set of points. 4 Theperformanceofcategorizingcovariancedescriptorsdependsonthemetricthatisusedtomeasurethedistances 1 between them. To comparecovariancedescriptors,a variety of distance measuressuch as affine invariantRiemannian 0 . metric [19], Stein metric [27], J-divergence [29], Frobenius distance [11], and Log-Frobenius distance [1], have been 1 discussedinexistingliterature. Some ofthemaredesignedfromtheir geometricalproperties,butsomearenot. Many 0 of these distance measures are expressed in the form 6 1 v: DΦ(X1,X2):=kΦ(X1)−Φ(X2)k2F, Xi withΦ:Sn++ →Rn×m forsomem∈N. IfΦ(X):=logm(X),wherelogm(X)takestheprincipalmatrixlogarithmof astrictlypositivedefinitematrixX,theLog-Frobeniusdistance[1]isobtained. SettingΦ(X):=Xp givesthePower- r a Frobenius distance [11], while Φ(X) := chol(X), where chol : Sn → Rn×n produces the Cholesky decomposition of + X suchthat X =chol(X)chol(X)⊤, yields the Cholesky-Frobeniusdistance [8]. These metrics arepre-definedbefore the employment of machine learning algorithms, and are not adaptive to the data to be analyzed. Meanwhile, for categorization of vectorial data, supervised learning for fitting metrics to the task has been proven to significantly increase the performance of the distance-based classifier [6, 13, 23]. Inthispaper,weintroduceaparametricdistancemeasurebetweencovariancedescriptorsandpresentnovelmetric learningalgorithmstodeterminetheparametersofthedistancemeasurefunction. Thelearningproblemisformulated as the Bregman projection onto the intersections of half-spaces. This kind of problem can be solved by the Dykstra algorithm [4, 9], which chooses a single half-space in a cyclic order and projects a current solution to the half-space. We developed an efficient technique for projection onto a single half-space. Furthermore, we empirically found that selecting the half-space stochastically, rather than in a cyclic order, dramatically increases the speed of converging to an optimal solution. 1.1 Related work To the best of our knowledge,Vemulapalli et al. (2015)[28] were the first to introduce the supervised metric learning approach for covariance descriptors. They vectorized the matrix logarithms of the covariance descriptors to apply existingmetriclearningmethodstothevectorizationsofmatrices. Thedimensionalityofthevectorizationsisn(n+1)/2 whenthesizeofthecovariancematricesaren×n. Thus,thesizeoftheMahalanobismatrixisn(n+1)/2×n(n+1)/2, which is computationally prohibitive when n is large. 1 Our approach is an extension of the distance measure of Huang et al. [10], which is based on the Log-Euclidean metric,withtheirlossfunctionbeingaspecialcaseofourformulation. TheyalsoadoptedthecyclicDykstraalgorithm for learning the Mahalanobis-like matrix. However, they misused the Woodbury matrix inversion formula when deriving the projection onto a single half-space, therefore, their algorithm has no theoretical guarantee of converging to the optimal solution. In this paper, their update rule is corrected by presenting a new technique that projects a current solution to a single half-space within O(n3) computational time. Yger and Sugiyama [30] devised a different formulation of metric learning. They introduced the congruent trans- form[2]andmeasuresdistancesbetweenthetransformationsofcovariancedescriptors. Anobjectivefunctionbasedon thekerneltargetalignment[5]isemployedtodeterminethetransformationparameters. Comparedtotheiralgorithm, our algorithm has the capability to monitor the upper bound of the objective gap, i.e. the difference between the current objective and the minimum. This implies that the resultant solution is ensured to be ǫ-suboptimal if the algorithm’s convergence criterion is set such that the objective gap upper bound is less than a very small number ǫ. Since Yger and Sugiyama [30] employed a gradient method for learning the congruent transform, there is no way to know the objective gap. 1.2 Contributions Our contributions of this paper can be summarized as follows. • Formetriclearningonpositivesemidefinitecone,wedevelopedanewalgorithmbasedontheDykstraalgorithm, in which the current solution is projected onto a half-space at each iterate, and runs at O(n3) time. • We presentanupper-boundforthe objective gapwhichprovidesastopping criterionandensuresthe optimality of the solution. • We empirically found that randomizing the order of half-spaces in our Dykstra-based algorithm significantly accelerates the convergence to the optimal solution. • We show that our approachyields promising experimental results on pattern recognition tasks. 1.3 Notation We denote vectors by bold-faced lower-case letters and matrices by bold-faced upper-case letters. Entries of vectors and matrices are not bold-faced. The transposition of a matrix A is denoted by A⊤, and the inverse of A is by A−1. Then×nidentitymatrixisdenotedbyI . Thesubscriptisoftenomitted. Them×nzeromatrixisdenotedbyO . n m×n The subscriptis oftenomitted. The n-dimensionalvectorallofwhoseentriesare oneis denotedby 1 . We use R and n Ntodenotethesetofrealandnaturalnumbers,Rn andNn todenotethesetofn-dimensionalrealandnaturalvectors, and Rm×n to denote the set of m×n real matrices. For any n ∈N, we use N to denote the set of natural numbers n less than or equal to n. Let us define R := {x ∈ R|x ≥ 0}, R := {x∈ R|x> 0}, Rn := {x ∈ Rn|x ≥ 0 }, and + ++ + p Rn := {x ∈ Rn|x > 0 }. The relational operator ≻ denotes the generalized inequality associated with the strictly ++ p positive definite cone. We use Sn to denote the set of symmetric n×n matrices. Sn to denote the set of symmetric + positive semi-definite n×n matrices,andSn to denote the setof symmetric strictly positive definite n×n matrices. ++ For any x=[x ,...,x ]⊤ ∈Rn, diag(x) is defined as an n×n diagonalmatrix whose diagonalentries arex ,...,x . 1 n 1 n For any n×n square matrix X, its trace is denoted by tr(X). For any x,y ∈ Rn, define hx,yi := n x y where i=1 i i x and y is the i-th entry of x and y, respectively. For any X,Y ∈ Rm×n, define hX,Yi := m n X Y wihere Xi and Y is the (i,j)-th entry of X and Y, respectively. O is used to denote the set of ni×P=1n orj=th1onoi,rjmia,jl i,j i,j n matrices, i.e. O :={A∈Rn×n|A⊤A=I }. P P n n 2 Our Metric Learning Problem 2.1 Parametric distance measure on Sn + We introduce the following distance measure for covariance descriptors X ,X ∈Sn: 1 2 + DΦ(X1,X2;W):= W,(Φ(X1)−Φ(X2))(Φ(X1)−Φ(X2))⊤ , D E whereW ∈Sn isthe parameterofthis distancemeasurefunction. IfW isstrictly positivedefinite andΦis bijective, + thenthisdistancemeasureDΦ(·,·;W):Sn+×Sn+ →Risametricbecauseallofthefollowingconditionsaresatisfied:(i) non-negativity: DΦ(X1,X2;W)≥0;(ii)identityofindiscernibles: DΦ(X1,X2;W)=0 iff X1 =X2;(iii)symmetry: DΦ(X1,X2;W)=DΦ(X2,X1;W);(iv)triangleinequality: DΦ(X1,X3;W)≤DΦ(X1,X2;W)+DΦ(X2,X3;W). 2 If the parameter matrix W is singular, DΦ(·,·;W) is a pseudometric, and the identity of indiscernibles is changed to the following property: For any X1 ∈ Sn++, DΦ(X1,X1;W) = 0 holds, while DΦ(X1,X2;W) = 0 occurs for some non-identical positive semi-definite matrices X and X . 1 2 2.2 Formulations of the learning problems To determine the value of the parameter matrix W, we pose a constrained optimization problem based on the idea of ITML [6]. We now consider a multi-class categorization problem. Let n be the number of classes, and the class c labels are represented by natural numbers in N . Suppose weare given nc (X ,ω ),...,(X ,ω )∈Sn ×N 1 1 ℓ ℓ + nc as a training dataset, where X is the covariance descriptor of the i-th example, and ω is its class label. From the ℓ i i examples, K pairs (i ,j ),...,(i ,j )∈N ×N are picked to give, to each pair, the following constraint: 1 1 K K ℓ ℓ ≤b ξ , if ω =ω , DΦ(Xik,Xjk;W)(≥bulbbξkk, if ωiikk 6=ωjjkk, (1) where, when ξ = 1, the two constants b and b , respectively, are the upper-bound of the distances between any k ub lb two examples in the same class and the lower-bound of the distances between any two examples in different classes. Now let us define for k ∈N , K +1, if ω =ω , b , if ω =ω , y := ik jk and b := ub ik jk k k (−1, if ωik 6=ωjk, (blb, if ωik 6=ωjk. Under the constraint (1), we wish to find W and ξ such that W is not much deviated from the identity matrix and k ξ is close to one. From this motivation, we pose the following problem: k min BD ((W,ξ),(I,1)), wrt W ∈Sn , ξ=[ξ ,...,ξ ]⊤ ∈RK , (2) ϕ ++ 1 K ++ subject to ∀k ∈NK, ykDΦ(Xik,Xjk;W)≤ykbkξk, where BD (·,·):(Sn ×RK )×(Sn ×RK )→R is the Bregman divergence [14]. Only if (W,ξ)=(I,1) will the ϕ ++ ++ ++ ++ + divergence BD ((W,ξ),(I,1)) become zero, and the value of divergence becomes larger if (W,ξ) is more deviated ϕ from (I,1). The definition of the Bregmandivergence contains a seed function ϕ:Sn ×RK →R which is assumed ++ ++ to be continuously differentiable and strictly convex. For some ϕ, the Bregman divergence is defined as BD (Θ,Θ )=ϕ(Θ)−ϕ(Θ )−h∇ϕ(Θ ),Θ−Θ i, ϕ 0 0 0 0 for Θ,Θ ∈Sn ×RK . This implies that the quantities of the deviations of the solution (W,ξ) from (I,1) depend 0 ++ ++ on the definition of the seed function. In this study, the seed function is assumed to be the sum of two terms: K ϕ(W,ξ):=ϕ (W)+ c ϕ(ξ ), r k l k k=1 X where c is a positive constant. The first term ϕ : Sn → R in the definition of the seed function is defined by k r ++ ϕ (W) := −logdet(W).As for the definition of the second termϕ : R → R, we considered the following three r l ++ functions: 1 ϕ (ξ ):=−log(ξ ), ϕ (ξ ):= ξ2, ϕ (ξ ):=(logξ −1)ξ . is k k l2 k 2 k e k k k The Bregman divergences generated from three seed functions ϕ , ϕ , and ϕ , respectively, are referred to as is l2 e Itakura-SaitoBregmanDivergence(ISBD),L2BregmanDivergence(L2BD),andRelativeEntropyBregmanDivergence (REBD), where ISBD is equal to the objective function employed by Huang et al. [10]. 3 Stochastic Variants of Dykstra Algorithm We introduce the Dykstra algorithm [4, 9] to solve the optimization problem (2). The original Dykstra algorithm [9] was developed as a computational method that finds the Euclidean projection from a point onto the intersection of 3 convexsets. Censor& Reich[4]extended the algorithmto finding the Bregmanprojectionfrom a pointx to a setC, 0 defined by argminBD (x,x ). ϕ 0 x∈C In availableliteraturerelatedto stochasticgradientdescentmethods andthe variants[3, 12,24,26] thatminimize the regularized loss averaged over a set of examples, it is empirically shown that, rather than picking an example in a cyclic order, example selection in a stochastic order dramatically speeds up the convergence to the optimal solution. Alternatively,someliteraturereportedthatatthe beginningofeveryepochinthe gradientmethod, random permutation of the order of examples also accelerates the convergence [7]. Motivatedby these facts,this study proposesthe useofstochasticordersfor selectionofconvexsetcomponents in the Dykstra algorithm. We term the stochastic version of the Dykstra algorithm as the stochastic Dykstra algorithm. In our case, every convex set component is one of K half-spaces, as will be described in a later discussion. There are, then, three ways to select half-spaces: • Cyclic: Pick a half-space in a cyclic order at each iteration. • Rand: Pick a half-space randomly at each iteration. • Perm: Permute the order of K half-spaces randomly at the beginning of each epoch. Hereinafter, we assume to employ the “Rand” option, althoughreplacing this option with one of the remaining two is straightforward. If every convex set component is a half-space, and the k-th convex set component C is expressed as k C :={x| ha ,xi≤b }, k k k then computing the Bregmanprojection froma point x to its boundary bd(C ) is equivalent to solving the following 0 k saddle point problem: maxminBD (x,x )+δ(ha ,xi−b ). ϕ 0 k k δ x This fact enables us to rewrite the Dykstra algorithm with Rand option for finding the Bregman projection from a point x to the intersection of C ,...,C , as described in Algorithm 1, where ϕ∗ is the convex conjugate of the seed 0 1 K function ϕ. Algorithm 1 Stochastic Dykstra Algorithm. 1: begin 2: ∀k ∈N : α :=0; K k 3: for t=1,2,... do 4: Pick k randomly from {1,...,K}; 5: Solve the following saddle point problem and let δ be the solution of δ: t−1/2 maxminBD (x,x )+δ (ha ,xi−b ); (3) ϕ t−1 t k k δ x 6: δ :=max(δ ,−α ); α :=α +δ ; t t−1/2 k k k t 7: x =∇ϕ∗(∇ϕ(x )−δ a ); t t−1 t k 8: end for 9: end. 4 Efficient Projection Technique We now show that solving the optimization problem (2) is equivalent to finding a Bregman projection from a point (I,1)∈Sn ×RK onto the intersection of multiple half-spaces. ++ ++ Let A be a positive semidefinite matrix expressed as k A :=(Φ(X )−Φ(X ))(Φ(X )−Φ(X ))⊤ k ik jk ik jk for k ∈N , to define a half-space K C := (W,ξ)∈Sn ×RK |y hA ,Wi−y b ξ ≤0 . k ++ ++ k k k k k (cid:8) (cid:9) 4 Then, it can be seen that the intersection of K half-spaces K C k k=1 \ isthe feasibleregionofthe optimizationproblem(2). Thisimplies thatthe Dykstraalgorithmcanbe appliedto solve problem (2). Next we presentan efficient technique that projects (W ,ξ )∈Sn ×RK onto the k-th half-spaceC , where t−1 t−1 ++ ++ k (W ,ξ ) ∈ Sn ×RK is the model parameter after the (t−1)-th iteration. Let ξ be the k-th entry in the t−1 t−1 ++ ++ k,t−1 vector ξ . The value of the function J :R→R defined by t−1 t J (δ):= A ,(W−1 +δy A )−1 −b ∇ϕ∗(∇ϕ(ξ )+δy b /c ), t k t−1 k k k l l t−1 k k k (cid:10) (cid:11) is zero at the solution δ of the saddle point problem (3). The solution δ must satisfy the strictly positive t−1/2 definiteness: (Y(δ))−1 :=W−1 +δy A ≻O, (4) t−1 k k and the feasibility of the slack variables: ∃ξ s.t. ∇ϕ(ξ )=∇ϕ(ξ )−δy b /c . (5) k,t−1/2 l k,t−1/2 l k,t−1 k k k Thereisnoclosed-formsolutionfoundforthisprojectionproblem. Hence,somenumericalmethodsuchastheNewton- Raphson method is necessary for solving the nonlinear system J (δ) = 0. If one tries to compute the value of J (δ) t t na¨ıvely, it will require an O(n3) computational cost because J (·) involves computation of the inverse of an n×n t matrix. If we suppose the numerical method assesses the value of the scalar-valued function J (·) L times, the na¨ıve t approach will take O(Ln3) computational time to find the solution of the nonlinear system J (δ) = 0. Furthermore, t the positive definiteness condition in (4) and the feasibility condition in (5) must be checked. We will show the following two claims: • The solution of the system J (δ)=0 satisfying (4) and (5) can be computed within O(n3+Ln) time, where L t is the number of times a numerical method assesses the value of J (·). t • The solution exists and it is unique. Hereinafter, we assume A is strictly positive definite. By setting A ←A +ǫI, with ǫ as a small positive constant, k k k it is easy to satisfy this assumption. Since L ∈ O(n2) in a typical setting, we can say that each update can be done in O(n3) computation. We define A1/2, U, D, and d as follows. Let A1/2 ∈Sn such that A1/2A1/2 =A , and denote by A−1/2 ∈Sn k k ++ k k k k ++ the inverse of A1/2. Introduce an orthonormal matrix U ∈ O and a diagonal matrix D = diag({d ,...,d }) that k n 1 n represent a spectral decomposition UDU⊤ =A−1/2W−1A−1/2, with d ≥···≥d . Then, we have t−1 1 n Y(δ)=A−1/2U(D+y δI)−1U⊤A−1/2, (6) k k k which allows us to rewrite the first term of J (δ) as t n 1 A ,(W−1 +δy A )−1 = . (7) k t−1 k k d +y δ i k i=1 (cid:10) (cid:11) X Assessment of J (δ) can be done within O(n) computational cost after d ,...,d are obtained. To get the n scalars t 1 n d ,...,d , we need to find A−1/2 and the spectral decomposition of A−1/2W−1A−1/2, each of which requires O(n3) 1 n k k t−1 k computation. Then×nmatrixA−1/2 canbecomputedinthepre-processoftheDykstraalgorithm,whilethespectral k decompositionofA−1/2W−1A−1/2 isdoneoncebeforeinvokingsomenumericalmethodtosolvethenonlinearsystem k t−1 k J (δ)=0. These support the first claim. t Equation (6) suggests that the set of δ satisfying (4) is given by (−d ,+∞), for y =+1, n k I := (8) r,t ((−∞,dn), for yk =−1. 5 The set of δ satisfying (5) is given as follows. In the case of using ISBD, δ ensuring (5) is in the interval (−∞,δ ), for y =+1, b k I := is,t ((−δb,+∞), for yk =−1, where c k δ := . b b ξ k k,t−1 In the case of using L2BD and REBD, there exists ξ even if ∇ϕ(ξ )−δy b /c takes any value. t−1/2 l t−1 k k k Hence, if ISBD is employed, the solution δ can be searched from the interval t−1/2 (−d ,δ ), for y =+1, n b k I ∩I = r,t is,t ((−δb,+dn), for yk =−1. If L2BD or REBD is employed, the solution δ can be searchedfromI . In the reminder of this section, we shall t−1/2 r,t use the notation I to denote the interval for δ satisfying (4) and (5) simultaneously. t We now show the uniqueness of the solution. The gradient of J :R→R is expressed as t n y y b2 ∇J (δ)=− k − k k∇2ϕ∗(∇ϕ(ξ )−δy b /c ), t (d +y δ)2 c l l t−1 k k k i k k i=1 X forδ ∈I . We firstconsiderthecasethaty =+1. Clearly,the firsttermisnegative. Thesecondtermisnon-positive t k because any convex conjugate function is convex. Therefore, we have ∇J (δ) < 0. In the case of y = −1, we get t k ∇J (δ) > 0 from a similar derivation. These observations imply that the solution is unique if a solution exists. The t existence of the solution can be established by showing that the curve J (δ) crosses the horizontal axis. t We consider the cases of using ISBD and using either L2BD or REBD separately. For the ISBD case, we have lim J (δ)=+∞, lim J (δ)=−∞, t t δց−dn δրδb if y =+1, and k lim J (δ)=−∞, lim J (δ)=+∞, t t δց−δb δրdn if y =−1. On the other hand, when using either L2BD or REBD with y =+1 we get k k lim J (δ)=+∞, lim J (δ)=−∞, t t δց−dn δ→+∞ while we obtain lim J (δ)=−∞, lim J (δ)=+∞, t t δ→−∞ δրdn when y =−1. Hence, we conclude that k ∃!δ ∈I s.t. J (δ)=0. t t 4.1 Stopping Criterion Herewediscusshowtodetermineifthesolutionisalreadyoptimalandwhentoterminatethealgorithm. Whilerunning the algorithm, (W ,ξ ) may be infeasible to the primal problem. Denote the index set of the violated constraints by t t I :={k ∈N |(W ,ξ )6∈C } and let us define ξ¯ ∈RK so that the k-th entry is given by ξ¯ := 1 hW ,A i for vio K t t k t ++ h,t bh t h h ∈ I and ξ¯ := ξ for h 6∈ I . Note that (W ,ξ¯) is a feasible solution, and ξ¯ = ξ when (W ,ξ ) is feasible. vio h,t h,t vio t t t t t t The objective gap after iteration t is bounded as follows: K BD ((W ,ξ¯),(I,1))−BD ≤ c ϕ(ξ¯ )−ϕ(ξ )−∇ϕ(1)(ξ¯ −ξ ) − α y (hA ,W i−b ξ ), ϕ t t ⋆ h l h,t l h,t l h,t h,t h h h t h h,t h∈XIvio (cid:0) (cid:1) hX=1 where we have defined BD := min BD ((W,ξ),(I,1)). ⋆ ϕ (W,ξ)∈ hCh Then this upper-bound of the objective gap can be uTsed for the stopping criterion of the Dykstra algorithm. 6 λ = 0.01, n = 10 λ = 0.01, n = 50 λ = 0.01, n = 100 102 102 102 101 Cyclic 101 Cyclic 101 Cyclic ap 100 Rand ap 100 Rand ap 100 Rand g Perm g Perm g Perm ve 10-1 ve 10-1 ve 10-1 cti10-2 cti10-2 cti10-2 e e e Obj10-3 Obj10-3 Obj10-3 10-4 10-4 10-4 10-5 10-5 10-5 5 10 15 5 10 15 5 10 15 Epoch Epoch Epoch λ = 0.001, n = 10 λ = 0.001, n = 50 λ = 0.001, n = 100 102 102 102 101 Cyclic 101 Cyclic 101 Cyclic ap 100 Rand ap 100 Rand ap 100 Rand g Perm g Perm g Perm ve 10-1 ve 10-1 ve 10-1 cti10-2 cti10-2 cti10-2 e e e Obj10-3 Obj10-3 Obj10-3 10-4 10-4 10-4 10-5 10-5 10-5 5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30 Epoch Epoch Epoch λ = 0.0001, n = 10 λ = 0.0001, n = 50 λ = 0.0001, n = 100 102 102 102 101 Cyclic 101 Cyclic 101 Cyclic ap 100 Rand ap 100 Rand ap 100 Rand g Perm g Perm g Perm e 10-1 e 10-1 e 10-1 v v v ecti10-2 ecti10-2 ecti10-2 Obj10-3 Obj10-3 Obj10-3 10-4 10-4 10-4 10-5 10-5 10-5 50 100 150 200 50 100 150 200 50 100 150 200 Epoch Epoch Epoch Figure 1: Convergence behavior of the algorithms using different settings. 5 Numerical Experiments We conducted experiments to assess the convergence speed of our optimization algorithms and the generalization performance for pattern recognition. 5.1 Convergence behavior of optimization algorithms Weexaminedouralgorithmsforassessmentofconvergencespeed. Wegenerateddatasetsartificiallyasfollows. K =50 matricesF ∈Rn×naregeneratedinwhicheachentryisdrawnfromtheuniformdistributionintheinterval[−0.5,0.5]. k Then,wesetA :=F F⊤. Thevaluesofthevariablesy arerandomlychosenfrom{±1}withsameprobabilities. We k k k k set b=1 and c=1/(λK). We exhaustively tested Cyclic, Perm, and Rand with the settings of λ=10−2,10−3,10−4 and n=10,50,100. Figure 1 demonstrates the convergence behavior of the cyclic Dykstra algorithm and the two stochastic Dykstra algorithm with various λ and n. Here, one epoch is called K times projection onto a single half-space. ISBD is employed as the objective function for learning the metric W. The objective gap is defined as the difference between thecurrentobjectivevalueandtheminimum. Inmostofthesettings,thetwostochasticDykstraalgorithmsconverged faster than the cyclic algorithm. Especially when λ=10−4, the cyclic algorithm was too slow to use it in practice. 7 (a) Brodatz texture dataset (b) ETH-80 dataset Figure 2: Generalized performances for pattern recognition. 5.2 Generalization performance for pattern recognition We used the Brodatz texture dataset [21] containing 111 different texture images to examine the generalization performance for texture classification. Each image has a size of 640×640 and gray-scaled. Images were individually divided into four sub-images of equal size. One of the four sub-images was picked randomly and used for testing, and the rest of the images were used for training. For eachtraining imageandeachtesting image,covariancedescriptorsofrandomlychosen50were extractedfrom 128×128 patches. The covariance matrices are of five-dimensional feature vectors [I,|I |,|I |,|I |,|I |]⊤. Then, x y xx yy 11,100(= 111×2×50) covariance descriptors are obtained for training and testing, respectively. For evaluation of generalized performance, k-nearest neighbor classifier is used, where the number of the nearest neighbors is set to three. We set K =100×n , b =0.05, and b =0.95. c ub lb We also examined the generalization performance for generic visual categorization using the ETH-80 dataset [16] containing n = 8 classes. Each class has 10 objects, each of which includes 41 colored images. For every object, 20 c images are randomly chosen and used for training, and the rest of images are used for testing. Onecovariancematrixisobtainedfromeachimage. Eightfeatures[x,y,R,G,B,|I |,|I |,|I |,|I |]⊤ areobtained x y xx yy from each pixel in an image. WetriedfourtypesofΦ: Id: Φ(X)=X,Log: Φ(X)=logm(X),Sqrt: Φ(X)=X1/2,Chol: Φ(X)=chol(X). The parameter W is determined by the metric learning algorithms with ISBD, L2BD, and REBD, to be compared with W = I we denote as Eye. Note that DΦ(·,·;I) = DΦ(·,·). Figure 2 gives the accuracy bar plots for the two multi-class classification problems. Whichever Φ is used, supervised metric learning improved the generalization performances both for texture classification and for generic visual categorization. For texture classification, the Choleskydecomposition-basedmappingchol(·)achievedthebestaccuracy,whilethematrixlogarithm-basedmapping logm(·) obtained the highest accuracy for generic image categorization. 6 Conclusions In this paper, we have devised several objective functions for metric learning on positive semidefinite cone, all of which can be minimized by the Dykstra algorithm. We have introduced a new technique that performs each update efficiently when the Dykstra algorithm is applied to the metric learning problems. We have empirically demonstrated that the stochastic versions of the Dykstra algorithm are much faster than the original algorithm. Acknowledgment This work was supported by JSPS KAKENHI Grant Number 26249075, 40401236. The last author would like to thank Dr. Zhiwu Huang for fruitful discussions. References [1] Vincent Arsigny, Pierre Fillard, Xavier Pennec, and Nicholas Ayache. Log-euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic resonance in medicine, 56(2):411–421,2006. [2] Rajendra Bhatia. Positive Definite Matrices. Princeton University Press, 2009. 8 [3] L´eon Bottou. Large-scale machine learning with stochastic gradient descent. In Yves Lechevallier and Gilbert Saporta, editors, Proceedings of the 19th International Conference on Computational Statistics (COMP- STAT’2010), pages 177–187,Paris,France, August 2010.Springer. [4] Yair Censor and Simeon Reich. The dykstra algorithm with bregman projections. Communications in Applied Analysis, 2:407–419,1998. [5] Nello Cristianini, John Shawe-Taylor, Andr´e Elisseeff, and Jaz S. Kandola. On kernel-target alignment. In Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani, editors, NIPS, pages 367–373. MIT Press, 2001. [6] Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information-theoretic metric learning. In Proceedings of the 24th international conference on Machine learning, pages 209–216.ACM, 2007. [7] Aaron J Defazio, Tib´erio S Caetano, and Justin Domke. Finito: A faster, permutable incremental gradient method for big data problems. arXiv preprint arXiv:1407.2710, 2014. [8] Ian L Dryden, Alexey Koloydenko, and Diwei Zhou. Non-euclidean statistics for covariance matrices, with applications to diffusion tensor imaging. The Annals of Applied Statistics, pages 1102–1123,2009. [9] Richard L. Dykstra. An algorithm for restricted least squares regression. Journal of the American Statistical Association, 78(384):837–842,December 1983. [10] Zhiwu Huang, Ruiping Wang, Shiguang Shan, Xianqiu Li, and Xilin Chen. Log-euclidean metric learning on symmetric positive definite manifold with application to image set classification. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, pages 720–729,2015. [11] SadeepJayasumana,RichardHartley,MathieuSalzmann,HongdongLi,andMehrtashTafazzoliHarandi. Kernel methodsontheriemannianmanifoldofsymmetricpositivedefinitematrices. InCVPR,pages73–80.IEEE,2013. [12] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26: Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 315–323,2013. [13] Tsuyoshi Kato and Nozomi Nagano. Metric learning for enzyme active-site search. Bioinformatics, 26(21):2698– 2704, November 2010. [14] TsuyoshiKato,Wataru Takei,and Shinichiro Omachi. A discriminative metric learning algorithmfor face recog- nition. IPSJ Transactions on Computer Vision and Applications, 5:85–89,2013. [15] Tsuyoshi Kato, Koji Tsuda, and Kiyoshi Asai. Selective integration of multiple biological data for supervised network inference. Bioinformatics, 21:2488–2495,May 2005. [16] Bastian Leibe and Bernt Schiele. Analyzing appearance and contour based methods for object categorization. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 2, pages II–409. IEEE, 2003. [17] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110,2004. [18] TomokiMatsuzawa,RaissaRelator,WataruTakei,ShinichiroOmachi,andTsuyoshiKato.Mahalanobisencodings for visual categorization. IPSJ Transactions on Computer Vision and Applications, 7:69–73, July 2015. doi: 10.2197/ipsjtcva.7.1. [19] XavierPennec,PierreFillard,andNicholasAyache. Ariemannianframeworkfortensorcomputing. International Journal of Computer Vision, 66(1):41–66,2006. [20] Florent Perronnin, Jorge Sanchez, and Thomas Mensink. Improving the fisher kernel for large-scale image clas- sification. In Computer Vision–ECCV 2010, pages 143–156.Springer, 2010. [21] TrygveRandenandJohnHakonHusoy.Filteringfortextureclassification: Acomparativestudy. PatternAnalysis and Machine Intelligence, IEEE Transactions on, 21(4):291–310,1999. [22] Raissa Relator, Yoshihiro Hirohashi, Eisuke Ito, and Tsuyoshi Kato. Mean polynomial kernel and its application tovectorsequencerecognition.IEICETransactionson Information andSystems,E97-D(7):1855–1863,July2014. 9 [23] Raissa Relator, Nozomi Nagano, and Tsuyoshi Kato. Using bregmann divergence regularized machine for com- parison of molecular local structures. IEICE Transactions on Information & Systems, E99-D(1):–, Jan 2016. [24] Nicolas L. Roux, Mark Schmidt, and Francis R. Bach. A stochastic gradient method with an exponential con- vergence rate for finite training sets. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2663–2671.Curran Associates, Inc., 2012. [25] Gregory Shakhnarovich, John W Fisher, and Trevor Darrell. Face recognition from long-term observations. In ECCV 2002, pages 851–865.Springer Berlin Heidelberg, 2002. [26] ShaiShalev-Shwartz,YoramSinger,NathanSrebro,andAndrew Cotter. Pegasos: primalestimatedsub-gradient solver for SVM. Math. Program., 127(1):3–30,2011. [27] Suvrit Sra. A new metric on the manifold of kernel matrices with application to matrix geometric means. In Advances in Neural Information Processing Systems, pages 144–152,2012. [28] Raviteja Vemulapalliand DavidW Jacobs. Riemannian metric learning for symmetric positive definite matrices. arXiv preprint arXiv:1501.02393, 2015. [29] Zhizhou Wang and Baba C Vemuri. An affine invariant tensor dissimilarity measure and its applications to tensor-valuedimagesegmentation. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 1, pages I–228.IEEE, 2004. [30] Florian Yger and Masashi Sugiyama. Supervised logeuclidean metric learning for symmetric positive definite matrices. arXiv preprint arXiv:1502.03505, 2015. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.