ebook img

An $N \log N$ Parallel Fast Direct Solver for Kernel Matrices PDF

0.79 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview An $N \log N$ Parallel Fast Direct Solver for Kernel Matrices

An N logN Parallel Fast Direct Solver for Kernel Matrices Chenhan D. Yu∗, William B. March† and George Biros§ ∗Department of Computer Science ∗†‡ Institute for Computational Engineering and Science The University of Texas at Austin, Austin, Texas, USA ∗[email protected], †[email protected], †[email protected] Abstract—Kernel matrices appear in machine learning and and may require 1000s of iterations. This complexity bar- non-parametricstatistics.GivenN pointsinddimensionsand rier has limited the use of kernel methods for large-scale 7 a kernel function that requires O(d) work to evaluate, we 1 problems [6], [20]. present an O(dNlogN)-work algorithm for the approximate 0 Contributions. We exploit hierarchically low-rank ap- factorization of a regularized kernel matrix, a common com- 2 putational bottleneck in the training phase of a learning task. proximations in which we assume that K can be approxi- n With this factorization, solving a linear system with a kernel matedwellbyD+UV,whereDisblock-diagonalandtheU a matrixcanbedonewithO(NlogN)work.Ouralgorithmonly andV matriceshavelowrank.Usingsuchadecomposition, J requireskernelevaluationsanddoesnotrequirethatthekernel weimprovethefactorizationalgorithmpresentedin[36].In matrix admits an efficient global low rank approximation. 9 thatpapertheblock-diagonalplussparsedecompositionwas Instead our factorization only assumes low-rank properties for the off-diagonal blocks under an appropriate row and done using ASKIT1, a method introduced in [21], [22] (see ] C columnordering.Wealsopresentahybridmethodthat,when §II-A). ASKIT approximates K in O(dNlogN) time and the factorization is prohibitively expensive, combines a partial thealgorithmin[36]factorizestheASKITapproximationin D factorizationwithiterativemethods.Asahighlight,weareable O(Nlog2N) time (see §II-B). Roughly speaking, ASKIT . to approximately factorize a dense 11M×11M kernel matrix s in2minuteson3,072x86“Haswell”coresanda4.5M×4.5M is based on the approximation of K as the sum of a c matrix in 1 minute using 4,352 “Knights Landing” cores. block-diagonal matrix and a low-rank matrix followed by [ recursion for each diagonal block. We refer this process 1 as the construction of the hierarchical representation of K. v I. INTRODUCTION Oncewehavethisrepresentation,wecanfactorizeK byap- 4 plyingrecursivelytheSherman-Morrison-Woodbury(SMW) 2 LetX beasetofN points∈Rd andletK(x ,x ):Rd× ¯i ¯j formula.Thefactorizationhastobedonefordifferentvalues 3 Rd →Rbeagivenkernelfunction.Thekernelmatrixisthe of λ during cross-validation studies. Therefore optimizing 2 N ×N matrix whose entries are given by K =K(x ,x ) 0 ij ¯i ¯j the factorization is crucial for the overall performance of for i,j =1,...,N, x ,x ,∈X. . ¯i ¯j a kernel method. In this paper, we extend the factorization 1 Kernel matrices appear in unsupervised and supervised scheme presented in [36] in several ways: 0 statistical learning, Gaussian process, regression, and non- 7 • We present an algorithm that factorizes the ASKIT parametric statistics [8], [12], [27], [33]. Solving linear 1 approximation in O(NlogN) time instead of systems with kernel matrices is an algebraic operation that : O(Nlog2N) and we demonstrate its performance on v is required in many kernel methods. The simplest example i several datasets (see §II-C and §V). X is ridge regression in which we solve λI+K, where λ>0 • We present a hybrid level-restricted factorization is a regularization parameter that controls generalization r schemethatreducesdramaticallythefactorizationtime a accuracyandI istheidentitymatrix.Thislinearsolvecanbe by using a Krylov iterative solver on a much smaller prohibitively expensive for large N because K is typically system than the original (see §II-C). The new method dense. For example, consider the Gaussian kernel, can be used with matrices for which [36] fails. (cid:32) 1(cid:107)x −x (cid:107)2(cid:33) • We study performance on Intel’s “Knights Landing” K(x ,x )=exp − ¯i ¯j 2 , (1) (KNL) architecture. We introduce an optimized ¯i ¯j 2 h2 matrix-free kernel summation that reduces the storage requirementsofthefactorizationwithouthavingavery wherehisthekernelbandwidth.Forsmallh,K approaches significant impact on wall-clock time (see §II-D). the identity matrix whereas for large h, K approaches the rank-one constant matrix. The first regime suggests In our numerical experiments, we measure the performance sparse approximations while the second regime suggests ofthemethodonseveraldifferentdatasets.ASKIThasbeen global low-rank approximations. But for the majority of h applied to polynomial, Matern, Laplacian, and Gaussian values, K is neither sparse nor globally low-rank. Direct factorization of λI +K requires O(N3) work, whereas a 1ApproximateSkeletonizationKernelIndependentTreecode.TheASKITlibrary Krylov iterative method costs O(N2) work per iteration isavailableathttp://padas.ices.utexas.edu/libaskit. kernels in arbitrary dimensions. Due to space limitations on-diagonal blocks K and K are themselves hierarchi- ll rr we only present the factorization results for the Gaussian cal. Note that the low-rank structure is not invariant on kernel function. Also the Gaussian kernel is among the permutations, it very strongly depends on the ordering of hardest to compress in high dimensions. We examine the the columns (or rows since the matrix is symmetric). For performance of the method for different bandwidth ranges notational convenience we write K ≈K(cid:101) =D+UV, where that are relevant to learning tasks, its sensitivity to the U and V are rank s and D is also hierarchical, where use regularization parameter λ, its performance as we increase K(cid:101) to indicate the approximate kernel matrix. the number of points, and its numerical stability (see §III). Inverting hierarchical matrices. When K admits this Limitations. Not all kernel matrices admit a good hier- hierarchicallow-rankapproximation,thenwecanefficiently archical low-rank decomposition. Typically this is related to approximate K−1 using the Sherman-Morrison-Woodbury theintrinsicdimensionalityofthedatasetatdifferentscales. formula along with recursion: SoASKITandsubsequentlyourmethodcanfail.IfASKIT K(cid:101) =D+UV =D(I+WV), and W =D−1U. cancompressthematrix,thesecondpotentialpointoffailure is the choice of regularization parameter. If it is too small, K(cid:101)−1 =(I+WV)−1D−1 (3) our algorithm (as well as [36]) can become numerically =(I−W(I+VW)−1V)D−1 unstable. We can numerically detect the instability, but it =(I−WZ−1V)D−1, where Z =I+VW. is not clear how to fix it while maintaining the log-linear complexity of the algorithm. However, small regularization Recursion is used to invert D−1. After obtaining D−1 we oftenresultsinpoorlearningperformancesothiscornercase compute W = D−1U for a rank-s matrix U and factorize is not important in applications. We discuss this in more the smaller reduced system Z = I + VW ∈ Rs×s. The detail in §III and §V. Also our methods cannot be applied scheme can be easily extended to invert λI+K. tohierarchicaldecompositionsinwhichD issparseandnot To turn this formulation into an algorithm, we need (1) a just block diagonal. For such decompositions, our method method to partition K so that off-diagonal blocks have low- can be used as a preconditioner, as discussed in [36]. rank, (2) an efficient way to compute the low-rank factors Related work. Nystrom methods and their variants [7], U and V, and (3) a scheme to construct the inverse. For [13], [28], [34] can be used to build fast factorizations. the first two tasks we use ASKIT, a method we recently However, not all kernel matrices can be approximated well developed [21]–[23]. ASKIT uses geometric information by Nystrom methods [15], [18], [23], [31]. Factorization (the input points) to permute K by partitioning the points methods based on hierarchical decomposition have been recursively using a binary tree. Interactions between points studied for kernel matrices from points in two or three in a treenode correspond to diagonal blocks of K. In the dimensions [1], [2], [4], [9], but less so in high dimensions recursion,thechildren ofthenodecanbe usedtodefinethe withafewexceptions[15],[36].Earlyworksdiscussingpar- block partitioning of the parent block, similar to (2). Next, allel operations for hierarchical matrices on shared memory we summarize ASKIT features that are necessary for this system include bulk synchronous parallelization [16] and paper. Please see [23] for the complete details on ASKIT. DAG-based task parallelism [17]. Distributed factorization Partitioning the matrix. We use a ball tree [26] to and operations were discussed in [14], [32]. The difficul- partition X. Starting with the root node (which contains the ties of generalizing low-dimensional factorizations in high- entiredataset),nodesarepartitionedintotwochildren(with dimensions are discussed in [21], [36]. an equal number of points) by a splitting hyperplane. This recursive splitting terminates when a node has less than m II. METHODS points, a user-specified parameter. The root has level l = 0 and the leaves l = D = log (N/m), the depth of the tree. 2 Webeginwithasketchofhierarchicalmatricesanddirect In the following, we overload α, β to indicate both binary solvers in §II-A. We also briefly summarize the ASKIT tree nodes and the indices of the points that belong to these algorithm which we use as the basis for our new methods. nodes;|α|isthenumberofpointsinα;andl,rindicatethe We describe parallel factorization schemes in §II-B and left and right children of the node. We define X ∈Rd×N to highlight the novelty of our approach over [36]. We then be the matrix of all points and X to be the points owned α introduce our hybrid iterative/direct solver in §II-C. by tree node α, (i.e., X ={x |∀i∈α}). α ¯i Computing low rank approximations. Let α be the A. Hierarchical Matrices and Treecodes points in a leaf node. Let S = {1,...,N}\α. The skele- Broadly speaking, we consider a matrix K ∈ RN×N to tonization of a node α is a rank-s approximation of K Sα be hierarchical if it can be partitioned as using s columns of K . We refer to these columns as Sα the skeleton of α, denoted by α. Skeletonization is done (cid:20) (cid:21) (cid:20) (cid:21) (cid:20) (cid:21) (cid:101) K K K 0 0 K K = ll lr = ll + lr . (2) using the Interpolative Decomposition (ID) [11]. Using a K K 0 K K 0 rl rr rr rl pivoted rank-revealing QR factorization, the ID finds α and (cid:101) P ∈Rs×|α| such that where the off-diagonal blocks Klr and Krl can be accu- α(cid:101)α rately approximated by a low-rank factorization and the K ≈K P . (4) Sα Sα(cid:101) α(cid:101)α Algorithm II.1 [α(cid:101),Pα(cid:101)α]=Skeletonize(α) Kα−α1 ∈ Rm×m using LAPACK’s GETRF. For an internal [i(cid:101)lf,α]=isSlekaefltheetnonreitzuren(l[)α(cid:101);,[P(cid:101)r,α(cid:101)]α=] =SkIeDl(αe)t;onize(r); nworditeeαou,twKe(cid:101)αnαee=d UDαα,(WI+α,WVαα,Vaαn)daZsα−1. Using (3), we can return [α(cid:101),Pα(cid:101)[(cid:101)l(cid:101)r]] = ID([(cid:101)l(cid:101)r]); (cid:20)K(cid:101)ll (cid:21)(cid:18)I+(cid:20)Pˆl(cid:101)l (cid:21)(cid:20) K(cid:101)lr(cid:21)(cid:19), (7) K(cid:101)rr Pˆr(cid:101)r K(cid:101)rl The first s pivots from the QR define α. Using the QR, we icnanOc(odmNp2umte)Pαc(cid:101)oαm=plKexS†itα(cid:101)yKfSoαr.thTehiosvsecrh(cid:101)aellmfeachtoowriezvaetirorne.suWltes wthheer“ehwate”dneofitnaetioPˆnl)(cid:101)l.=ThK(cid:101)elr−el1foPrle(cid:101)l,aPnˆdαα(cid:101)Pˆr(cid:101)rre=quKi(cid:101)rer−sr1P“irn(cid:101)rv(enrtointigc”e can turn it to a O(dlogNm) scheme by sampling a small K(cid:101)α−α1,whichinturnrequirestraversingallthedescendantsof subsetS(cid:48)ofSandusingitinsteadofS[23].Theapproxima- α(thesubtreerootedatα)andrecursivelyapplying(7).(We introducedthisschemein[36]andresultsinO(dNlog2N) tion rank s is chosen such that σ (K )/σ (K )<τ, s+1 S(cid:48)α 1 S(cid:48)α complexity.)Butaswewillseeshortlythissubtreetraversal where τ is user-specified and σ are the singular values is not necessary.) Once we have W , we use the SMW estimated by the diagonal of the rank-revealing QR. α formula to invert (I + W V )−1. This inverse requires For a non-leaf α, we first compute the skeletons (cid:101)l and (cid:101)r W Z−1V , which in termsαofαthe block decomposition of of the children of α and then we compute the skeleton α(cid:101)⊂ α α α α, can be written as (cid:101)lID∪d(cid:101)rec=om[(cid:101)lp(cid:101)ro]siatniodnt(hAelpgroorjiethcmtioInI.1m)a.tOrinxcPeα(cid:101)th[(cid:101)le(cid:101)r]skueslientgonainzoattihoenr (cid:20)Pˆ (cid:21)(cid:20) I K Pˆ (cid:21)−1(cid:20) K (cid:21) oifeevnetrryynoofdKe(wbubtythKerowot≈)isKcomwp(αut)e+d,(cid:80)wDecaKncoPmpuwte(βth)e, l(cid:101)l Pˆr(cid:101)r K(cid:101)rlPˆl(cid:101)l (cid:101)lrI r(cid:101)r K(cid:101)rl (cid:101)lr . (8) wthhere i∈α and β arie: the sibilαings of α aln=d0itsiβ(cid:101)ancβ(cid:101)eβstors. Since α is the parent of l and r, Pˆ and Pˆ have been In ASKIT and [36], U and V of a node α for (3) are already computed. K Pˆ and K lPˆ(cid:101)l are cr(cid:101)romputed by (cid:101)lr r(cid:101)r (cid:101)rl l(cid:101)l (cid:20) (cid:21) (cid:20) (cid:21)(cid:20) (cid:21) GEMM, and the reduced system is factorized by GETRF. 0 K K P Krl 0lr ≈UαVα = Kr(cid:101)l l(cid:101)r (cid:101)ll P(cid:101)rr . (5) andWPeˆca.nWexepsloayittaha“ttePlescoipsi“ntge”lesrecloaptieodn”bfreotwmePenPˆl(cid:101)l,,PPˆr(cid:101)r, IbnetahpipsrwoxoirmkawteeiunsetwtoheeqfaucitvathleanttKfolrrm=s:KKrTl.PThuosr,PKlrKcan. and Pαr(cid:101)rα(cid:101)because is comαpα(cid:101)uted by formula in the b[o(cid:101)lx(cid:101)r]α(cid:101)belolw(cid:101)l. Here we use the second form to write l(cid:101)r (cid:101)rr l(cid:101)l (cid:101)lr D−1 P =(cid:20)K(cid:101)l−l1 (cid:21) (cid:20)Pl(cid:101)l (cid:21)P . (9) (cid:20)K0 K0lr(cid:21)≈UαVα =(cid:20)Pl(cid:101)l P (cid:21)(cid:20)K K(cid:101)lr(cid:21). (6) α αα(cid:101) K(cid:101)r−r1 Pr(cid:101)r [(cid:101)l(cid:101)r]α(cid:101) rl r(cid:101)r (cid:101)rl The calculation in the box requires just GEMM operations We will see that having the P terms on the left allows us from the children (l and r) but not all descendants. Since toLdeesviegln arenstOri(cNtiolong. NW)efapcrtoorviizdaetiodnetaaliglosritohnm.the level- DPˆα−α(cid:101)1P=K(cid:101)α−wα1iPthαα(cid:101)(9=).(IN+owWαwVeα)fi−n1dDtα−h1aPt αPˆα(cid:101),wecacnanarlespolabcee restriction feature of ASKIT. As we saw in Algorithm II.1, telαescoαpα(cid:101)ed by Pˆ and Pˆ as αα(cid:101) l(cid:101)l r(cid:101)r the skeletonization proceeds in a bottom-up traversal of the bgarollwtirnege.lAarsgewreantrda,vdeerspeenthdeintgreoen,tthheeopfrfo-bdlieamgo,ntahlebnleocceksssaarrye Pˆαα(cid:101) =(cid:18)I+(cid:20)Pˆl(cid:101)l Pˆ (cid:21)Vα(cid:19)−1(cid:20)Pˆl(cid:101)l Pˆ (cid:21)P[(cid:101)l(cid:101)r]α(cid:101). (10) r(cid:101)r r(cid:101)r rank s can increase to the extend that no compression Notice that we no longer need to solve K(cid:101)−1 and K(cid:101)−1 in takes place. To guarantee accuracy, skeletonization of α ll rr (10). Thus, no tree traversal is required. In the leaf level should terminate if α(cid:101) = (cid:101)l∪(cid:101)r. In some of our numerical (base case), Pˆ is computed directly from K−1P . experiments, instead of using this criterion, we use L to αα(cid:101) αα αα(cid:101) Giventheseformulasandtheskeletonizationcomputedin represent the level at which the skeletonization stops. AlgorithmII.1,wecomputethefactorsneededforthedirect With level restriction, the factorization described in [36] solverinapostordertraversalofthetree(AlgorithmII.2).If cannot be used. We introduce a hybrid iterative/direct αisaleafnode,wefactorizeλI+K usinganLUfactor- scheme that addresses this shortcoming in §II-C. αα ization. Otherwise, we compute K and K . Notice that B. Fast direct solver Pˆ andPˆ arecomputedinthepre(cid:101)vlriousrecu(cid:101)rrlsion;thus,we We have sketched how we compute the UV approxima- caln(cid:101)l formar(cid:101)rndfactorizethereducedsystemZ .Finally,Pˆ α αα(cid:101) tions in K(cid:101) using ASKIT. We now discuss how to use them is telescoped using (10), thus Solve(α,WαP[(cid:101)l(cid:101)r]α(cid:101),false) inthecontextof(3)tosolveλI+K directly.Forsimplicity, (Algorithm II.3) will not invoke recursion. we describe the case where λ = 0, but all the algorithms This algorithm improves on the one in [36] by removing we describe trivially generalize to the λ (cid:54)= 0 case. We first the extra subtree traversals that result in O(Nlog2N) com- consider the case in which no level restriction takes place. plexity. Instead, our algorithm exploits the nested structure We assume that all the internal nodes have been skele- of Pˆ resulting in an NlogN complexity for the factor- αα(cid:101) tonized. The factorization of K(cid:101) proceeds using a bottom- ization.Insomeofourlargestruns,thisresultedinover3× up traversal of the tree. At the leaf level, we factorize speedup without any change in the accuracy. Algorithm II.2 Factorize(α) Algorithm II.4 DistFactorize(α,q) if α is leaf then if α is at level logp then Factorize(α). LU factorization λI+K . else αα W =P and P =I. DistFactorize(c, q). α αα(cid:101) [(cid:101)l(cid:101)r]α(cid:101) 2 else {i< q} {i≥ q} Pˆαα(cid:101)FFLo=aUrcmSftaoocWltrovαiriezzw(eeαit(th,lhW)ePˆaαrlne(cid:101)lPdd,[u(cid:101)lPFˆ(cid:101)rcr]eaα(cid:101)(cid:101)rdc,,ftasynaosdlrtesiVmezα)eZw(uαrist)ih.inngK((8(cid:101)l1r)0.,)K. (cid:101)rl. {{R{000e}}}dSRuRee2cencecdvv,K(cid:101)lB(cid:101)r.c{xa}sPˆt{x(cid:101)r}(cid:101)l.. {{{Re22qqq}}}dRSuSee2cecnenvddK,(cid:101)rBK(cid:101)l.{cxa}sPˆt{Pˆx(cid:101)l}(cid:101)r.. . 2 (cid:101)l{x} {x}(cid:101)r {0} Bcast P and LU factorizes Z. [(cid:101)lr(cid:101)]α(cid:101) Algorithm II.3 w=Solve(α,u,do_recur) Pˆ =DistSolve(α,Pˆ P ,q,false). ¯ ¯ {x}α(cid:101) {x}(cid:101)c [(cid:101)c]α(cid:101) if α is leaf then LU solver w=(λI+K )−1u αα ¯ ¯ else Algorithm II.5 w=DistSolve(α,u,q,do_recur) if do_recur then ¯ ¯ v = [Solve(l,u ,true); Solve(r,u ,true)]. ifαisatlevellogpthenw =Solve(α,u,do_recur). ¯l ¯r ¯ Compute w=u−W Z−1V u using (8). else ¯ α α α if do_recur then u=DistSolve(c,u,q,do_recur). ¯ 2 {i< q} {i≥ q} 2 2 P𝛂𝛂~K𝛂~β Reduce K(cid:101)r{x}u. Reduce u(cid:101)l =K(cid:101)l{x}u. Pβ~βKβ~𝛂PKr~rKllr~Kl~πP βKl~lKrrl~rPβ θ~ Pθ~θK~θπ 𝛂 θ β lleevveell--12 π {{{{0000}}}} RS[Bueec(cid:101)lcna;uvds(cid:101)rt]uu=(cid:101)(cid:101)lru.. Z. −1[u(cid:101)l;u(cid:101)r]{{{u22qqqsi}}}ngSRBeec(8nca)dvs.tuu(cid:101)(cid:101)lru.. . l r w =u−Pˆ (cid:101)lu . w2=u−Pˆ (cid:101)ru . {x}(cid:101)l (cid:101)l {x}(cid:101)r (cid:101)r Pππ~Kπ~θ level-3 Distributed Local nestedconstructisenabledsuchthateachthreadwillinvoke Tree Tree parallelBLASorLAPACKroutines.Ifwehavepprocesses, Figure1ThetopfourlevelsofthetreeandthecorrespondingblocksofthematrixK˜. Thenodesbelongingtoeachprocessarehighlightedinasinglecolor.Eachprocess then above level logp of the tree, we have to communicate factorizesitsownportionofthetreeindependently.Wealsohighlightthefactorsused to compute factors, since the terms needed are distributed in thedirect solverconstruction andshow whichprocess owns whichfactor. Each processownadiagonalblockandallfactorsinthesamecolumnandthesamerow. among processes. We use the Message Passing Interface Forexample,theyellowprocessownsPββ(cid:101) andKα(cid:101)β atlevel1;similarlyitowns (MPI) library for distributed memory communication. Pβθ(cid:101)andKπ(cid:101)β atlevel0. In Figure 1, we summarize the distributed-memory algo- WethendescribehowtoapplyK(cid:101)α−α1 toavectoru¯,shown rithm. The four colors represent four different MPI ranks inAlgorithmII.3.Ifαisaleafnode,wecandirectlyinvoke and the nodes they own. The tree on the right shows l =2 an LU solver to obtain w¯ = Kα−α1u¯. Otherwise, we have treenodesareuniquelyassignedtoranks,buttreenodeswith two situations. If Algorithm II.3 is called by Factorize l > 2 are shared among ranks. To facilitate collective (Tdhou_s,rneocurercuirssiofnalissere)q,utihreend.wWehiklenoSwoluve=isWcαalPle[(cid:101)ld(cid:101)r]αt(cid:101)o. ccoommmmuunniiccaattioorn,,wheaicchheqduisatlrliybudtievdidetrsetehneodraenkcsreoaftethseapalroecnatl. solvearandomu,thenweneedtosolveK(cid:101)−1u andK(cid:101)−1u We use {i} to denote the i MPI rank in the local commu- ll ¯l rr ¯r th recursively and compute (8) with GEMV on Vα and Wα and nicator. Consider the communicator of α, which involves q an LU solve GETRS on Zα. ranks.Letcdenotethechildofαthat{i}owns.Thenc=l Therefore,thecompletealgorithmconsistsofconstructing if{i< q}.Otherwise,c=r.Foradistributednodeα,data 2 the tree, calling Algorithm II.1, then Algorithm II.2 and points {x} owned by {i} are never required by other MPI i Algorithm II.3, each called on the root of the tree. processes, and Xα =Xl∪Xr =(∪i<q{x}i)∪(∪i≥q{x}i). 2 2 Parallel direct solver. The parallelization is essentially However, skeletons α and P are only stored on {0}. (cid:101) α(cid:101)[(cid:101)l(cid:101)r] identicaltotheschemeproposedin[36].Eachsubtree(aset When its sibling needs this information, we exchange the of points {x}) is assigned to a distributed-memory process information using a SendRecv between {0} and {q} 2 (or a worker). Although we described recursive version of using the parent communicator of α and its sibling. Once our algorithms we use level-by-level traversals combined received, α and the sibling communicator can Bcast to with shared or distributed memory parallelism (depending every processes in their groups. onthelevel)acrossnodesinthesamelevel.Ifthenumberof Algorithm II.4 describes the recursive distributed factor- nodesislessthanthenumberofphysicalcores,theOpenMP ization. In each node α, ranks {i < q} requires skele- 2 Kll Pl~lKl~β Pl~lKl~𝛂 Krr Pr~rKr~β Pr~rKr~𝛂 Skeleton ( I - W ( I + V W )-1 V ) D-1 frontier A Kββ Pβ~βKβ~𝛂 level-1 𝛂 ••GAlMgRorEitShm II.8 •Algorithm II.9 •Algorithm II.8 •Algorithm II.9 •Algorithm II.6 β level-2 Figure3TheSMWformulaforthepartialfactorizationinFigure2.Disfactorizedby AlgorithmII.4andsolvedbyAlgorithmII.5onskeletonizednodes.Non-skeletonized l r nodesarecollapsedintoWandV factorsofsize2Ls×N,sotheycannolongerbe P𝛂𝛂~K𝛂~[lrβ] K𝛂𝛂 level-3 factorizedefficiently.Thereducedsystem(I+VW)issolvediterativelybyGMRES. Iterative Direct Algorithm II.6 w=HybridSolve(u,A) Solver Factorization ¯ ¯ w(α)=DistSolve(α,u,true) for all α∈A. Figure2TheyellownodesaretheskeletonizationfrontierA.Theirparents(greenand ¯ bluenodes)arenotskeletonized,becausefurthercompressionmayresultinlossof w =MatVecV(root,w,A). accuracy.Wealsohighlighttheexpandedmatrixblockswhicharenecessarybecause Solve w =(I+VW)−1w iteratively. ofthislackofcompression. tons r owned by {q} to compute K , and {i ≥ q} Compute w =u−MatVecW(root,w,A). (cid:101) 2 (cid:101)r{x} 2 requires (cid:101)l owned by {0} to compute K . Assuming that (cid:101)l{x} Pˆ and Pˆ were computed, then each rank computes Algorithm II.7 w =MatVecW(α,x,A) {x}(cid:101)l {x}(cid:101)r K(cid:101)r{x}Pˆ{x}(cid:101)l and K(cid:101)l{x}Pˆ{x}(cid:101)r. {0} reduces all K(cid:101)r{x}Pˆ{x}(cid:101)l if α is at level-logp then to form K Pˆ . {q} reduces all K Pˆ and sends if α∈A then w =Pˆ u. (cid:101)rl l(cid:101)l 2 (cid:101)l{x} {x}(cid:101)r αα(cid:101) K Pˆ to {0}. The LU factorization is done and stored else w =[MatVecW(l,ul,A);MatVecW(r,ur,A)]. on(cid:101)lr{0r}(cid:101)r. Finally, {0} needs to broadcast P such that all else processes can telescope Pˆ{x}α(cid:101) by the distr[(cid:101)li(cid:101)br]uα(cid:101)ted solver. ieflsαe ∈wA=tMhaetnVwec=W(Pˆc{,xu},α(cid:101)Au). Algorithm II.5 is the recursive distributed solver, which will be called in two instances. When it is called by DistFactorize (do_recur is false), we know our Algorithm II.8 w =MatVecV(α,y,A) input u can be telescoped by (10) and no recursion takes if α is at level-logp then place. On the other hand, if it is called to solve some w =K u (β is the sibling of α). random u, then we need to traverse all the way down to β(cid:101)α if α∈/ A then the leaf level. Since the column blocks of V (K and α (cid:101)l{x} w =[MatVecV(l,u ,A);MatVecV(r,u ,A)]. K ) are distributed, a MatVec with V requires reduction l r (cid:101)r{x} else between processes. u = K u are reduced by {0}, and uis(cid:101)ls=olvKed(cid:101)r{ox}nu{a0r}e. rTehdeu(cid:101)rcoeudtpbuy(cid:101)rt{i{xs}2qa}g.aTinhebrsoyasdtecmastZb−ac1k[u(cid:101)lto;ua(cid:101)rll] wif α=∈/KAβ(cid:101){xth}uen(βwi=s tMheatsiVbelicngV(ocf,uα,).A). processes, because row blocks of W (Pˆ and Pˆ ) if α is root then AllReduce w. α {x}(cid:101)l {x}(cid:101)r are again distributed among processes. Once reaching level logp, Algorithm II.3 is called to work on the local subtree. Level restriction reduces the system that needs to be solved from N to 2Ls. C. Fast hybrid solver Partial factorization. We still factorize skeletonized As we discussed level-restriction is necessary when an nodes bottom up until reaching the frontier A. In Figure 2, off-diagonal block is no longer low-rank. We refer to the thesetreenodes(yellowandkhaki)wefactorizearediagonal set of nodes that skeletonized but whose parent did not as blocks D (yellow) on the left. Then, conceptually, we theskeletonizationfrontier A.InFigure2,theyellownodes coalesceallW (blue)andV (green)factorsfornodesabove define the frontier. Nodes “above” (i.e., closer to the root) A. Notice that we can still apply SMW on this partial the frontier cannot be skeletonized. factorizationintheformofFigure3.Ratherthancontinuing In this case the SMW formulation (3) can still be used to factorize these unskeletonized nodes, we switch to an but D, U, and V have as many blocks as the number of iterative solution for the reduced system (I+VW)−1. nodes above A. Therefore, the Z matrix will be quite large. InAlgorithmII.6,weshowthishybridalgorithm.D−1 is Forexample,ifthefrontierAconsistsofallthenodesatL, computed by Algorithm II.5 on those skeletonized nodes. and a node in A has s skeletons, then the size of Z will be The iterative solver requires the ability to compute the 2Ls. If we compute the full factorization, the cost will be MatVec for W and V. Algorithm II.7 (MatVecW) traverses O(22Ls2N+23Ls3) in work and O(2LsN) in storage. The downward from the root. Since P = I for all α basic idea here is not to store and factorize any Z, W and above the frontier, MatVecW only oα(cid:101)c[(cid:101)lc(cid:101)ru]rs on the frontier V factors for those unskeletonized nodes, but instead use (α ∈ A). Algorithm II.8 (MatVecV) computes K u or β(cid:101)α a matrix-free Krylov method. We refer to this approach as K u for all nodes above (including) the frontier. In the β(cid:101){x} the “hybrid method”, since we factorize only up to frontier. distributed tree, MatVecV performs a reduction on {x}; Arch d 4 20 36 68 132 260 Operations) in the best known method, GSKS can achieve Haswell MKL+VML 31 53 72 115 190 305 16K GSKS 321 465 512 634 687 680 O(mnd) FLOPS but with only O(md+nd) MOPS. This KNL MKL+VML 12 93 132 416 636 916 helpsthecomputationbecomelessmemoryboundevenwith 16K GSKS 703 888 1067 1246 1334 1449 small d. In this work, we implement this idea in AVX2 and Haswell MKL+VML 32 56 80 110 198 296 8K GSKS 301 448 515 558 620 543 AVX512forHaswellandKNLarchitectures.Wepresentthe KNL MKL+VML 11 93 103 166 506 753 performance of these two different approaches in Table I. 8K GSKS 479 888 903 975 1220 1345 Due to the O(mn) memory saving, GSKS is about 3∼30x Haswell MKL+VML 30 52 70 110 180 284 fasterthanthebestknownmethodonKNLforlargeproblem 4K GSKS 250 359 384 420 477 468 KNL MKL+VML 11 56 76 116 370 578 size2 and d < 68. We see that using GSKS significantly 4K GSKS 341 445 464 510 858 1015 outperforms using the standard approach. TableIGaussiankernelsummationefficiencyof16K×16K×d,8K×8K×d,and 4K×4K×dinGFLOPS.GSKScanbefoundinhttps://github.com/ChenhanYu/ks. III. THEORY ThereferenceimplementationusesMKLDGEMMandVMLVEXP. Here, we present some theoretical complexity guarantees thus,anAllReduceisrequiredattheendsuchthatallMPI and discuss the stability of our direct solver. ranks get the same output. On the other hand, MatVecW is Work. We present the complexity analysis of Algorithms supposed to perform a scattering on {x}. Thus, MatVecW II.2, II.3 and II.6. Throughout, we fix the leaf size m, always happens after MatVecV. In this case, the inputs of level restriction L, and maximum skeleton size s. Tf(N) Algorithm II.7 for all MPI ranks are the same. denotes the complexity of Algorithm II.2, and Ts(N) of AlgorithmII.3,eachforN points.SinceSolvedoeseither D. Fast kernel summation anLUsolveormatrix-vectormultiplyineachstep,wehave All the algorithms described in this paper rely on mul- Ts(N)=2Ts(N/2)+O(Ns+s2)=O(sNlogN). (12) tiplying submatrices of K with vectors, which we refer to aasnd“ksetornreedlsourmthmeayticoann”.bTehuesseedminatraicmesatcraixn-fbreeeprmecaonmnepru,tbeyd Ntraovtiecresintghattosthoelvlienagf lPeˆvαeα(cid:101)l. I=nsteK(cid:101)adα−α1(1P0α)α(cid:101)ondloyetsakneostOr(esq2uNire) computing Kij =K(x¯i,x¯j) in O(d) time on the fly. work. Using the complexity above, we derive Tf(N) as For example, during the factorization of K, submatrices K can be computed and stored. In this case, MatVec Tf(N)=2Tf(N/2)+s2N +s3 =O(s2NlogN). (13) β(cid:101)α of Kβ(cid:101)α for all treenodes α in the solving phase can be In the hybrid solver, each (I +VW)x operation requires done in O(sNlogN) with GEMV. However, storing all O(2LsN) work. To summarize, both Algorithm II.2 and these submatrices requires O(sNlogN) memory. Memory Algorithm II.3 take O(NlogN) work, and Algorithm II.6 requirements are even higher in the level-restricted version takes O(NlogN) with additional O(N) for each iteration of our algorithm. Thus, we never store these submatrices in if L is independent from N. the hybrid methods due to the storage requirements. Communication.Thecommunicationcostforthesolving Alternatively a matrix-free version only requires O(dN) phase is O(slogp) per level. To traverse the whole tree, storage (the coordinates of the points) and turns the GEMV O(slog2p)isrequiredperrighthandside.However,during to a GEMM. However, since kernel evaluations are quite the factorization the solving phase does not recur. Thus, expensive, this calculation can be significantly slower than instead of O(s2log2p) for s right hand sides, there is only storing the matrix and computing summation using GEMV. O(s2logp) communication per level. Overall, the commu- Our goal is to reduce the storage requirements without nication cost for the full factorization is O(s2log2p) since significantly sacrificing performance. there are logp distributed levels. In [24], we presented GSKS (General Stride Kernel Sum- Memory. The memory cost of our methods depend on mation), an matrix-free kernel summation that performs level restriction L and maximum skeleton size s. These re- fusingoptimization.Whilethebest-knownmethodcomputes quirementsareinadditiontothecosttostorethecoordinates andskeletoninformationforASKIT,reportedin[21].Inour K u=GEMV(K(GEMM(XT,X )),u), (11) β(cid:101)α β(cid:101) α direct solver, we require the factors U, V, and I+WV for each level of the tree below the level-restriction L in which GSKS fuses K (kernel function) and GEMV (reduction) into skeletonization stops. This requires O(2sN +s2) per level. GEMM (semi-ring rank-d update). [35] uses the same idea to Therefore, the overall memory required for our method is fuse nearest-neighbor search into GEMM. With a BLIS-like framework [30], matrix-matrix multiplication (C = AB) O(cid:18)(cid:0)2sN +s2(cid:1)(cid:18)log(cid:18)N(cid:19)−L(cid:19)(cid:19). (14) is divided into subproblems. A small subproblem that fits m C into registers is implemented in vectorized assembly or Using GSKS can reduce sNlog(N/m) to O(1) by com- intrinsic to maximize FLOPS throughput. The idea is to puting V on the fly. Recomputing W with (10) can re- directlyperformkernelevaluationandtheGEMVonC while duce another sNlog(N/m) to sN. Using both schemes it is still in the register and only store back a vector w. In short, for a typical kernel summation that involves an 2Forsmallproblemsize,GEMMinLIBSXMMhttps://github.com/hfp/libxsmmmay m×n×d GEMM with O(md+nd+mn) MOPS (Memory beslightlyfasterthanMKLGEMM,butitisstillmemoryboundwhendissmall. yields O(s2(log(N/m)−L)+sN) storage with O((d+ Dataset N d h λ Acc COVTYPE 0.1–0.5M 54 .07 .3 96% s2)NlogN) work (still O(NlogN) asymptotically). SUSY 4.5M 8 .07 10 78% Stability. Overall, the stability of our method is related MNIST2M 1.6M 784 .30 0 100% HIGGS 10.5M 28 .90 .01 73% to the conditioning of (λI+K(cid:101)), D and the reduced system MRI 3.2M 128 3.5 10 - (I + VW). We use κ = σ1/σmax to denote the 2-norm MNIST8M 8.1M 784 1.0 1.0 - condition number of a matrix where σ and σ are the NORMAL 1–32M 64 .19 1.0 - 1 max largest and smallest singular values of the matrix. TableIIDatasetsusedintheexperiments.HereNdenotesthesizeofthetrainingset, anddisthedimensionalityofpointsinthedataset.Thetestingsetsaredisjointfrom [10] suggests that when either U or V are orthonormal, thetrainingsets.Wesample10Ktestingpointsandreportthebinaryclassification then κ(I +VW) ≤ κ(D)κ(K(cid:101)). Although, our U and V accuracyinthe“Acc”column.histhebandwidthoftheGaussiankernelusedinour experiments.Theregressionresultsareproducedusingtheparametersabove.Some are not orthonormal, in our experience κ(I + VW) does combinations we used during the cross-validation are presented in details in §V. not have a conditioning problem when D and K(cid:101) are well- MRI,MNIST8MandNORMALarenotusedinregressiontasks.FortheMNIST2M we perform one-vs-all binary classification for the digit ’3’. All coordinates are conditioned.Therelationbetweenκ(λI+D)andκ(λI+K) normalizedtohavezeromeanandunitvariance. is more interesting. In general when h shrinks, we expect peak3 performance is 998 GFLOPS per Haswell node K to become more diagonally dominant and thus better and 3,046 GFLOPS per KNL node. Inv-Askit and GSKS conditioned.However,countertothisintuition,itispossible are compiled with intel-16 -O3 -mavx on Lones- for D to become more poorly conditioned as h shrinks. tar5 and intel-17 -O3 -xMIC-AVX512 on Stam- Since D is a submatrix of K, we have σ1(D) ≤ σ1(K) pede. All iterative solvers employ a Krylov subspace and σn(D) ≤ σn(K). When σn(K) < λ, then κ(λI + method(GMRES)fromthePETSclibrary[3].Specifically, D) < κ(λI +K) since λ dominates in the denominator. we use modified Gram-Schmidt for re-orthogonalization However when σn(K(cid:101)) > λ, κ(λI +D) can grow even as and employ GMRES CGS refinement. If not specified, κ(λI + K) remains small. If this case happens in many KNL experiments use Cache-Quadrant configuration with levels of our factorization, then the method is not stable. OMP_PROC_BIND=spread.“T”referstothetotalruntime With narrow bandwidths where K approaches a (blocked)- in seconds, and “GFs” refers to the GFLOPS per node. diagonal matrix, σn >λ may occur. Datasets. We use real-world datasets: COVTYPE (forest Undertheframeworkofhierarchicalmatrices,thepivoting cartographic variables); SUSY and HIGGS (high-energy rowswecanchooseduringtheDfactorizationarelimitedto physics) [19]; MNIST (handwritten digit recognition) [5]; theskeletonrows.Thus,evenκ(λI+K)isnotbad,(λI+D) and MRI (brain MRI) [25]. We also use a 64D synthetic can be unstable due to the aggressive pivoting strategy if λ dataset, which is drawn from a 6D Normal distribution and is small. Our methods can detect this situation, but avoiding embeddedin64Dwithadditionalnoise.Thissetisadataset this case entirely (or fixing it) is not straightforward. withahighambientbutrelativelysmallintrinsicdimension. Accuracy metrics and parameter selection. For the linear solve, we report the relative residual IV. EXPERIMENTALSETUP (cid:15) =(cid:107)u−(λI+K˜)w(cid:107) /(cid:107)u(cid:107) . (15) We performed numerical experiments on Haswell and r 2 2 KNL architectures with four different setups to examine The parameters h and λ used in the Gaussian kernel were the accuracy and efficiency of our methods. Especially, we selectedusingcross-validation.InTableIIwereportthepa- wanttodemonstrate(1)thecomplexityimprovementagainst rameters we used. Other combinations in §V are candidates [36],(2)FLOPSefficiency,(3)scalabilityand(4)theadvan- for the cross-validation. Level restriction L is choosen such tages of our hybrid solver. We explore the task of kernel that the relative error is controlled. In Table IV, V we use ridge regression for binary supervised classification [33], L=3, and for experiments in Figure 5 we use L=5 or 7. which requires approximating the solution of (λI +K)−1 during the training step. We use the Gaussian kernel with V. EMPIRICALRESULTS bandwidth h. The model weight w is chosen by solving The experiments are labeled #1 to #39 in the tables and w =(λI+K(cid:101))−1u, where u is given (the labels). Once w is figures.Weselectrepresentativeparametercombinationsand computed, the label given by x∈/ X is sign(K(x,X)w). We comparetheruntimebetween[36]andourAlgorithmII.4in ¯ ¯ ¯ applyourmethodstotrainthismodelonreal-worlddatasets Table III. We present single node performance on Haswell employingupto3,072x86coresand4,352KNLcores.The and KNL in Table IV. In Figure 4, we present strong the percentage of correct predictions (Acc) is reported in scalability and verify the O(NlogN) complexity of our Table II, along with the optimal h and λ that were found methods (Algorithm II.4). In Table V, we compare our using holdout cross validation. Implementation and hardware. Our experiments were 3WeestimatethepeakaccordingtotheclockrateandthetheoreticalFMAthrough- conducted on Lonestar5 (two 12-core, 2.6GHz, Xeon E5- put. For 24 Haswell cores, 998 = 2×12×2.6×16. For 68 KNL cores, 3046 = 68×1.4×32. As a reference, MKL GEMM can achieve 87% on the 2690 v3 “Haswell” per node) and Stampede (68-core, Haswell node and 69% on the KNL node. We assume two VPUs can dual issue 1.4GHz, Xeon Phi 7250 “KNL” per node) clusters at DFMAs [29]. However, Intel processors may have a different frequency while fully issuingFMA,andtheclockratemaydropto1.0GHz.Thismaybethereasonwhy the Texas Advanced Computing Center. The theoretical MKLGEMMcanonlyachieve2.1TFLOPSonKNL. τ 1E-1 1E-3 1E-5 # 11 12 13 14 15 16 # dataset h log2 log log2 log log2 log Config Haswell Cache-Quad F-Quad F-SNC4 1 COVTYPE .35 <1 <1 5 3 15 8 p 1 4 1 4 1 4 2 .07 5 3 24 11 27 12 nthd 24 6 68 17 68 17 3 SUSY .50 <1 <1 1 <1 6 3 Tf 89 94 41 52 62 77 4 .05 63 23 125 37 125 37 GFf 623 592 1356 1136 895 723 5 MNIST2M 1.0 24 11 38 15 40 16 ComputeMatVecV withGEMV. 6 .10 1 <1 1 <1 4 7 Ts 0.8 0.8 0.9 0.9 0.9 1.0 7 MNIST8M 0.3 142 51 185 61 195 64 GFs 14 14 12 12 12 11 8 HIGGS 2.0 75 29 324 84 326 85 ReevaluateV withGEMM. 9 .90 216 66 317 83 323 85 Ts 4.4 5.3 7.4 4.0 7.5 5.8 10 NORMAL32M .19 18 10 46 22 84 28 GFs 158 119 40 74 39 51 ComputeMatVecV withGSKS. TLoanbelestaIrI5IuFsiancgto1r2iz8atnioondetsim(3e,0c7o2mcpoarreiss)onwitihnasdeacpotnidve. rEaxnpkesrismesnetlsectaerdebdyonτeaonnd Ts 1.1 1.8 1.1 1.2 1.2 1.3 smax.COVTYPEusedm=2,048,κ=2,048,smax=2,048.SUSYusedthe GFs 269 164 269 243 243 228 samecombination.MNISTusedκ=256.HIGGSusedm=512,κ=1,024. TableIVSinglenodeperformanceonCOVTYPE100Kwithm=s=2048(fixed NORMALusedm=512,κ=128andsmax=256. rank) and L = 3. p is the number of MPI processes, and nthd is the number hybrid (Algorithm II.6) with the direct method (Algo- of OpenMP threads per process. #11 and #12 were conducted on Haswell, and #13 to #16 where conducted on KNL with different configurations. #13 and #14 rithm II.5). In Figure 5, we report the convergence behavior use cache-quadrant. #15 uses flat-quadrant, and #16 uses flat-snc-4 configuration. ofiterativesolver(AlgorithmII.6)onλI+K(cid:101) andourhybrid #11 achieves the high factorization performance on Haswell, and #13 is the most efficientconfigurationonKNL.Ineachcolumn,wereportfactorizationtime(Tf)and factorization. Here the parameter τ indicates the relative GFLOPS(GFf)andthreedifferentsolvingtime(Ts).Thesethreesolvingschemes tolerance of approximation of the kernel matrix K, s havedifferentstoragerequirementandmemoryoperations. max is the maximum skeleton size, κ is the number of nearest neighbors used for skeletonization sampling in ASKIT, m Strong scaling Experiment Experiment is the leaf node size, and L is the level restriction. 101 IIddeeaall NNllooggN2N 32M 102100% 100% Ideal scaling 91% facCtoormizaptaiornisotinmewbitehtw[e3e6n]th(eTaOb(lNe lIoIgI)2.NW)ealcgoomritphamre[3th6e] e (sec) 16M e (sec) 93%90% 85% and our O(NlogN) algorithms using the same parameters. Tim100 4M Tim 81% 82% We only compare the case without level restriction, because 1M NORMAL 70% [36] does not support this feature. The runtime is directly 64D h0.15 101 74% #17 associated with the rank s. For example, the U, V matrices 10-1 106 107 102 103 62% in #4 are much larger than those in #3. Thus, the runtime N #core is also much longer. The overall speedup is about 2–4× Figure4O(NlogN)verification(#17)andstrongscaling(#18).#17useNORMAL with m = 512, κ = 128, a fixed s = 256 and L = 1. The blue lines are due to the logN term. Both methods construct exactly the experimentalfactorizationtime,andyellowlinesaretheoretical(ideal)time.#18use same factorization (up to roundoff errors). Although the NORMAL1Mwithκ=128,m=s=2048andL=1.Weincreasenumberof nodesupto128Haswellnodes(3,072cores)and64KNLnodes(4,352cores).The speedupisnotexactlylogN duetotheprefactorsdepending greenlinesrepresenttheidealscaling. on s and d, we can expect the asymptotic speedup to be O(dsNlogN) where d is 54 in this dataset. GEMM takes O(logN).Forexample,COVTYPEcanonlyachieve1.9×, O(sN) to store the matrix, but GSKS is matrix free with but HIGGS can achieve 3.8× because the problem size O(1)space.GSKSisonly1.2−1.6×slowerthantheGEMV is 20× larger. Both methods have NlogN complexity for approach while it is 4−7× faster than the GEMM approach. the “Solve” operation. For the experiments in Table III, the Notice that 80% of T is dominated by GETRS (triangular longest “Solve” operation (#8) is less than 2 seconds. s solver) when d is small. Since GETRS can only achieve 2 Single node performance (Table IV). We conduct a set GFLOPSonaKNLnode,theoverallGF issomewhatlow. s of single node experiments to show the FLOP rates we Scaling (Figure 4). In #17, we use 128 nodes (3,072 achieve and to test some of the memory models on KNL. cores) and increase N from 1M to 32M. We can observe On a Haswell node, #11 reaches 62% (623/998) of the that our implementation is very close to the theoretical theoretical peak. For a KNL node, #13 (Cache-Quadrant) NlogN scaling (yellow) but lower than the Nlog2N is the fastest and achieves 45% (1356/3046). Using MPI scaling (purple). In #18, we fix the data set (NORMAL on KNL (p > 1) is typically slower due to the extra 1M) and increase the number of cores. The green line is memory operations. Our implementation does not perform the ideal scaling (100%), and our implementation reaches very well on the flat memory mode (#15 and 16). The 62% efficiency on 3,072 Haswell cores and 70% on 4,352 memory requirements usually exceed 16GB and as a result KNL cores. This relatively small problem (1M) cannot U and V cannot fit into MCDRAM. We tried manually fully exploit all computing resources; thus, we can see the swapping memory between MCDRAM and DDR4 but this degradation while N is small or when the number of cores was not as efficient as using the cache memory mode. is large (∼230 points per core for 64 KNL nodes). Reducing storage (Table IV). In Table IV, we report Hybrid and direct methods comparison (Table V). three different schemes for kernel summation §II-D: GEMV, In Table V we set L = 3 (level restriction) and com- GEMM and GSKS. The first scheme takes O(sNlogN) pare Algorithm II.6 (hybrid) and Algorithm II.2 (direct) time and space. The last two schemes evaluate K u in for problems that Algorithm II.2 can be applied (i.e., the β(cid:101)α 122#901 KNLx-- ASK255I974T404SUS1TY651090fh=G8440F416.447f15λ2112T=...523s40G2F894296s 556ee--e11(cid:15)-r422 KS9P8-- 11111010000-01----086420 λκCh# 2 O0318.eV0+T72YPE 1111100000----08642 λκCh# 2 O0019..eV03+T73YPE 11110000--0121 λκCh# 3 O0010..eV00+T705Y3PE MRIh=3.5λ=10 100 150 200 250 300 100 200 300 400 500 100 200 300 400 500 222234 x-- 332091267 483647 744926577 3119...456 325190988 113ee--e11-400 93-- 111000--042 λSh U04S.010Y5 111000--042 λSh U04S.01Y5 111000--021 λSh U00S..14Y5 MNIST2Mh=1.0λ=1 1100--86 κ#3 11e+2 1100--86 κ#3 12e+3 1100--43 κ#3 13e+5 25 x 237 27 655 1.9 592 1e-13 - 10-5 200 300 400 500 600 200 300 400 500 600 200 300 400 500 600 26 - 270 47 372 2.3 489 1e-13 - 27 - 217 19 404 27.2 612 1e-3 27 100 HIGGS 100 HIGGS 100 HIGGS 10-2 λ 0.08 10-2 λ 0.008 10-2 λ 0.00008 TableVHybridanddirectmethodscomparisonwithelevelrestrictionL=3.All 10-4 h 0.9 10-4 h 0.9 10-4 h 0.9 reexppoerrtimAeSntKsIuTsebaudiladpitnigvetirmanek,sfawcittohritzoaletiroannctiemτe=(T0f).,0s0o0lv0i1ngantidmsemeamxpl=oyi2n0g4G8S.KWSe 1100--86 κ#3 14e+2 1100--86 κ#3 15e+3 1100--86 κ#3 16e+5 (Ts) and efficiency in GFLOPS. The three experiments with orange index are the 500 600 700 800 900 500 600 700 800 900 500 600 700 800 900 hybridschemeandtheremainingarethedirectfactorization.Wereporttherelative residual(cid:15)r andnumberofKryloviterations(KSP)(forthehybridmethod). 1100-02 λM N0.I0S2T 1100-02 λM N0.I0S0T2 1100-02 λM N0.I0S0T002 full factorization requires 2LsN +22Ls2 memory for level 10-4 h 0.1 10-4 h 0.1 10-4 h 0.1 10-6 κ 1e+2 10-6 κ 1e+3 10-6 κ 1e+5 restriction L), with adaptive s selection. #20, #23 and #26 10-8 #37 10-8 #38 10-8 #39 use the direct factorization on Haswell, and #19, #22, and 40 45 50 55 60 40 45 50 55 60 40 45 50 55 60 #25onKNL.TheremainingthreerunsaredoneonHaswell Figure 5 solvingλI+K(cid:101):Convergenceoftherelativeresidual(cid:15)r (verticalaxis) using the hybrid method. On Haswell, we observe that the over time (horizontal axis) in seconds. (a) (blue) is unpreconditioned GMRES; (b) factorization time Tf is about two times longer than #21, a(orreatnhgee)saimsehyabsriTdabmleethIIoId;.κAilsl ethxepecroimndeintitosnusneudmbτer=.CO1EVT−YP5E.amnd, kHIaGnGdSsmusaexd #24 and #27. In the factorization phase, both Haswell and L=5restriction.SUSYandMNISTusedL=7.Theserunsmaybe100–1000× moreexpensive(alsorunningoutofmemory)ifweweretousethedirectsolver. KNL do not perform as well as in Table IV, because using more predictable behavior. #30 is detected numerically ill- adaptive ranks s results in load imbalance. If we further conditioningofD inoursolver.Alsonoticethatin#30both increaseL,asweneedtodoinFigure5,thecostofthefull methods fail to converge. factorization can be 1000× in runtime and 30× in storage. Applying the hybrid solver to a vector is slower than VI. CONCLUSIONS applying the direct solver, due to the need of iteration. E.g., We have introduced new algorithms for approximately T in #21 is about 20× slower. Yet the overall runtime s factorizing kernel matrices. We evaluated our algorithms (T + T ) of the hybrid method is still smaller than the f s on both real-world and synthetic datasets with different direct one. When L is larger, the advantage of the hybrid parameters.Weconductedanalysisandexperimentstostudy solver will be higher. For example, in Figure 5, we report the complexity and the scalability of our methods. These results that require L = 7. Algorithm II.2 cannot be used: experiments include scaling up to 3,072 Haswell cores and the memory just for Z with s=2048 exceeds 500GB. 4,352 KNL and exhibit significant speedups over existing Convergence behavior for solving λI +K(cid:101) (Figure 5). methods. The factorization can be very fast. For example, We report the convergence rate using four different band- it only takes 10 seconds to factorize a kernel matrix with widths with two different methods: (a) unpreconditioned 32M points in 64D. Our future work will focus on further GMRES using ASKIT’s MatVec for λI + K(cid:101) (blue line) optimizationof ourimplementation. Inparticular, wewould and (b) our hybrid method Algorithm II.6 (orange line). like to introduce task parallelism in the tree traversal to Each row corresponds to a dataset with a specific h. These address the load balancing issue. While adaptive ranks or experiments resemble a cross-validation study in which we adaptive level restriction is used, each treenode may have vary λ in order to improve learning. Across columns, we different workload. In this case, scheduling is important to vary λ as [10−2,10−3,10−5]σ1(K(cid:101)), where σ1(K(cid:101)) is an avoid the critical path. Additionally, we plan to address the estimate, so that the condition number κ of λI+K(cid:101) is 102, stability issues mentioned in §III and explore other possible 103 and 105 respectively. We report the relative (to a zero variants (e.g. sparse off-diagonal blocks). initial guess) Krylov residual (cid:15) (y-axis) over time (x-axis). r The steeper the curve is, the faster the method converges. The x-axis offset represents setup costs. E.g., in #28, the offsetoftheblueline(≈140sec)isthecostofbuildingthe tree and the skeletons (spent in ASKIT). The fixed cost of (b) includes the fixed cost of (a) plus the factorization time. We see that most of the blue lines are flat when the condition number is around 1E+5, but orange lines still decrease steadily except for #30 (see §III for the stability issue). We can observe 10–1000× speedup on the “Solve” operations. Overall the hybrid scheme is faster and has REFERENCES [19] M.LICHMAN, UCI machine learning repository, 2013. [1] S.AMBIKASARANANDE.DARVE,AnO(nlogn)fastdirect [20] Z.LUETAL.,Howtoscaleupkernelmethodstobeasgood solver for partial hierarchically semi-separable matrices, asdeepneuralnets,arXivpreprintarXiv:1411.4000,(2014). Journal of Scientific Computing, 57 (2013), pp. 477–501. [21] W. B. MARCH, B. XIAO, AND G. BIROS, ASKIT: Approxi- [2] S. AMBIKASARAN, D. FOREMAN-MACKEY, L. GREEN- mate skeletonization kernel-independent treecode in high di- GARD, D. W. HOGG, ANDM. O’NEIL, Fast direct methods mensions,SIAMJournalonScientificComputing,37(2015), for Gaussian processes and the analysis of NASA Kepler pp. 1089–1110. mission data, arXiv preprint arXiv:1403.6015, (2014). [22] W. B. MARCH, B. XIAO, S. THARAKAN, C. D. YU, AND [3] S.BALAYETAL., PETSc Web page, 2014. G.BIROS,Akernel-independentFMMingeneraldimensions, in Proceedings of SC15, Austin, Texas, Nov 2015. [4] M.BEBENDORF, Hierarchical matrices, Springer, 2008. [23] , Robust treecode approximation for kernel machines, [5] C.-C.CHANGANDC.-J.LIN,Libsvm:Alibraryforsupport in Proceedings of the 21st ACM SIGKDD Conference on vector machines, ACM Transactions on Intelligent Systems Knowledge Discovery and Data Mining, Sydney, Australia, and Technology (TIST), 2 (2011), p. 27. August 2015, pp. 1–10. [6] B. DAI, B. XIE, N. HE, Y. LIANG, A. RAJ, M.-F. F. [24] W. B. MARCH, B. XIAO, C. D. YU, AND G. BIROS, An BALCAN, AND L. SONG, Scalable kernel methods via dou- algebraic parallel treecode in arbitrary dimensions, in Pro- bly stochastic gradients, in Advances in Neural Information ceedings of IPDPS15, Hyderabad, India, May 2015. Processing Systems, 2014, pp. 3041–3049. [25] MENZEETAL.,TheMultimodalBrainTumorImageSegmen- [7] A. GITTENS AND M. MAHONEY, Revisiting the Nystrom tation Benchmark (BRATS), IEEE Transactions on Medical method for improved large-scale machine learning, in Pro- Imaging, (2014), p. 33. ceedings of ICML13, 2013, pp. 567–575. [26] S. M. OMOHUNDRO, Five balltree construction algorithms, International Computer Science Institute Berkeley, 1989. [8] A. GRAY AND A. MOORE, N-body problems in statistical learning,Advancesinneuralinformationprocessingsystems, (2001), pp. 521–527. [27] C.E.RASMUSSENANDC.WILLIAMS, Gaussian Processes for Machine Learning, MIT Press, 2006. [9] W. HACKBUSCH, B. N. KHOROMSKIJ, AND E. E. TYR- [28] S.SI,C.-J.HSIEH,ANDI.DHILLON, Memory efficient ker- TYSHNIKOV, Approximate iterations for structured matrices, nel approximation, in Proceedings of The 31st International Numerische Mathematik, 109 (2008), pp. 365–383. Conference on Machine Learning, 2014, pp. 701–709. [10] W. W. HAGER, Updating the inverse of a matrix, SIAM [29] A.SODANIETAL.,Knightslanding:Second-generationintel review, 31 (1989), pp. 221–239. xeon phi product, IEEE Micro, 36 (2016), pp. 34–46. [11] N. HALKO, P.-G. MARTINSSON, AND J. TROPP, Finding [30] F.G.VANZEEANDR.A.VANDEGEIJN,Blis:Aframework structurewithrandomness:Probabilisticalgorithmsforcon- forrapidlyinstantiatingblasfunctionality,ACMTransactions structingapproximatematrixdecompositions,SIAMReview, on Mathematical Software (TOMS), 41 (2015), p. 14. 53 (2011), pp. 217–288. [31] R.WANG,Y.LI,M.W.MAHONEY,ANDE.DARVE, Struc- [12] T. HOFMANN, B. SCHO¨LKOPF, AND A. J. SMOLA, Kernel tured block basis factorization for scalable kernel matrix methodsinmachinelearning,Theannalsofstatistics,(2008), evaluation, arXiv preprint arXiv:1505.00398, (2015). pp. 1171–1220. [32] S.WANG,X.S.LI,J.XIA,Y.SITU,ANDM.V.DEHOOP, [13] C.-J.HSIEH,S.SI,ANDI.S.DHILLON, Fast prediction for Efficientscalablealgorithmsforsolvingdenselinearsystems large-scalekernelmachines,inAdvancesinNeuralInforma- with hierarchically semiseparable structures, SIAM Journal tion Processing Systems 27, 2014, pp. 3689–3697. on Scientific Computing, 35 (2013), pp. C519–C544. [14] M.IZADIKHALEGHABADIETAL., Parallel H-matrix arith- [33] L. WASSERMAN, All of Statistics: A Concise Course in metic on distributed-memory systems, (2012). Statistical Inference, Springer, 2004. [15] R. KONDOR, N. TENEVA, AND V. GARG, Multiresolu- [34] C.WILLIAMSANDM.SEEGER, Using the Nystro¨m method tion matrix factorization, in Proceedings of ICML14, 2014, tospeedupkernelmachines,inProceedingsofthe14thAn- pp. 1620–1628. nual Conference on Neural Information Processing Systems, no. EPFL-CONF-161322, 2001, pp. 682–688. [16] R.KRIEMANN,ParalleleAlgorithmenfrH-Matrizen,disser- tation, Universitt Kiel, 2004. [35] C.D.YU,J.HUANG,W.AUSTIN,B.XIAO,ANDG.BIROS, Performance optimization for the k-nearest neighbors kernel [17] , H-LU factorization on many-core systems, Preprint, on x86 architectures, in Proceedings of SC15, ACM, 2015, Max Planck Institute for Mathematics in the Sciences, 2014. pp. 7:1–7:12. [18] D. LEE, A. GRAY, AND A. MOORE, Dual-tree fast gauss [36] C. D. YU, W. B. MARCH, B. XIAO, AND G. BIROS, INV- ASKIT: a parallel fast direct solver for kernel matrices, in transforms,Advances inNeuralInformation ProcessingSys- Proceedings of the IPDPS16, Chicago, USA, May 2016. tems, 18 (2006), pp. 747–754.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.