Table Of Content

GPU-Based Conjugate Gradient Solver for Lattice QCD with Domain-Wall Fermions 1 1 0 2 n a J 2 Ting-WaiChiu1,2,Tung-HanHsieh3,Yao-YuanMao ,1, KenjiOgawa1 (for the TWQCD ∗ Collaboration) ] t a 1 DepartmentofPhysics,andCenterforTheoreticalSciences,NationalTaiwanUniversity, l - Taipei10617,Taiwan p 2 CenterforQuantumScienceandEngineering,NationalTaiwanUniversity,Taipei10617, e h Taiwan [ 3 ResearchCenterforAppliedSciences,AcademiaSinica,Taipei115,Taiwan 1 v 3 WepresentthefirstGPU-basedconjugategradient(CG)solverforlatticeQCDwithdomain-wall 2 fermions(DWF).Itiswell-knownthatCGisthemosttime-consumingpartintheHybridMonte 4 CarlosimulationofunquenchedlatticeQCD,whichbecomesevenmorecomputationaldemand- 0 . ing forlattice QCD with exactchiralsymmetry. We havedesigneda CG solverforthe general 1 0 5-dimensionalDWF operator on NVIDIA(cid:13)R CUDATMarchitecture with mixed-precision, using 1 the defect correction as well as the reliable updates algorithms. We optimize our computation 1 : byeven-oddpreconditioninginthe4Dspace-timelattice,plusseveralinnovativetechniquesfor v i CUDA kernels. For NVIDIA GeForce(cid:13)R GTX 285/480, our CG solver attains 180/233 Gflops X (sustained). r a TheXXVIIIInternationalSymposiumonLatticeFieldTheory,Lattice2010 June14-19,2010 Villasimius,Italy Speaker. ∗ (cid:13)c Copyrightownedbytheauthor(s)underthetermsoftheCreativeCommonsAttribution-NonCommercial-ShareAlikeLicence. http://pos.sissa.it/ CGforDWFonGPU Yao-YuanMao 1. Introduction SimulationofunquenchedlatticeQCDwithexactchiralsymmetryisagrandchallengeamong all sciences. Even for a modest 163 32 lattice with lattice spacing a 0.1 fm, it often requires × ∼ asupercomputer withpeakcomputing powermorethan50Teraflops(e.g.,10racksofIBMBlue- Gene/L). Therefore, only 2-3 lattice QCD groups around the world could afford to perform the simulation of unquenched lattice QCD with the domain-wall fermion [1], or the overlap-Dirac fermion [2]. However, this scenario has been undergoing a dramatic change during the last 12 months. With the emergence of low-cost and computationally powerful Graphic Processing Unit (GPU),now plugging agraphic cardwithNVIDIAGTX285 (240cores, oneTeraflopspeak) into aPCimmediatelyturnsthesystemintoapowerfuldevice, attainingasustained 180Gflopsforthe conjugate gradient (CG)withmixedprecision. Since 2009, the Taiwan Lattice QCD Collaboration (TWQCD) has been using a GPU clus- ter (currently constituting of 250 NVIDIA GPUs with 40 Teraflops (sustained)) to simulate unquenched lattice QCDwithoptimal domain-wall quarks [3,4]. Wehave metthechallenge ofpre- serving the chiral symmetry to a high precision and sampling all topological sectors ergodically. Forourrecentphysicalresultsonthe2-flavorsQCD,wereferthereadersto[5]and[6]. Inthispaper, wepresentourdesignofthefirstCGsolverforthegeneral 5-dimensional DWF operator onNVIDIACUDAarchitecture withmixed-precision, usingthedefectcorrection aswell as the reliable updates algorithms. Our CG solver is optimized with even-odd preconditioning on the 4-dimensional space-time lattice, plus several innovative tuning techniques for the CUDA kernels. ForNVIDIAGeForceGTX285/480, ourCGsolverachieves180/233 Gflops(sustained). 2. Conjugate GradientSolverforDomain-WallFermions 2.1 OptimalDomain-WallFermions Mathematically, foragivenN (thenumberofsitesinthefifthdimension), themaximalchiral s symmetrycanbeattained bytheoptimaldomain-wallfermion(ODWF)[3]withtheoperator [D(m )] =(w D +1) d +(s D 1) L , (s =w ) (2.1) q xx;ss s w xx ss s w xx ss s s ′ ′ ′ ′ − ′ ′ HereD isthestandard WilsonDiracOperatorplusanegativeparameter m (0<m <2), w 0 0 − (Dw)xx′ =−21(cid:229) m (1−gm )Um (x)d x+mˆ,x′+(1+gm )Um†(x′)d x−mˆ,x′ +(d−m0), (2.2) (cid:2) (cid:3) whereUm (x)denotesthelinkvaraible, d isthedimension ofthespace-time(d =4forQCD), L=P L +P L , P =(1 g )/2, (2.3) + + 5 − − ± ± and d , 1<s N (L+)ss′ =(−s(−m1,qs′/2m0)d Ns,s′, s=1≤ s , L−=(L+)T. (2.4) Theweights w along the fifthdimension arefixedaccording tothe formula derived in[3]such s { } that the maximal chiral symmetry isattained. Ingeneral, for other DWFwithnon-maximal chiral symmetry, the weights s and w have different values, e.g., for the conventional (Shamir) s s { } { } DWF,s =0,w =1, s,andfortheBoriciDWF[7],s =w =1, s. s s s s ∀ ∀ 2 CGforDWFonGPU Yao-YuanMao 2.2 Even-OddPreconditioning SinceD commuteswith(w ) w d and(s ) =s d ,Eq.(2.1)becomes w ss s ss ss s ss ′ ≡ ′ ′ ′ D(m )=D (w +s L)+(1 L). (2.5) q w − Separating theevenandtheoddsitesonthe4Dspace-time lattice,Eq.(2.5)canbewrittenas d m DEO X DEOY D(m )= − 0 w (w +s L)+(1 L)= w , (2.6) q DOE d m − DOEY X w 0! w ! − where X (d m )w (1+cL)+(1 L), Y w (1+cL), (c) (s /w )d . (2.7) 0 ss s s ss ≡ − − ≡ ′ ≡ ′ Nowwefurtherrewriteitinamoresymmetricformbydefining M5 √w −1YX−1√w = (d m0)+√w −1(1 L)(1+cL)−1√w −1 −1, (2.8) ≡ − − h i and S1 √w −1YX−1=M5√w −1, S2 Y−1√w. (2.9) ≡ ≡ thenEq.(2.6)becomes 1 M DEO 1 0 1 0 1 M DEO D(m )=S 1 5 w S 1=S 1 5 w S 1, (2.10) q 1− M DOE 1 2− 1− M DOE 1 0 C 0 1 2− 5 w ! 5 w ! ! ! C 1 M DOEM DEO. (2.11) 5 w 5 w ≡ − Obviously, the most time-consuming task in the HMC is to solve the linear system CC† x = b | i | i by the conjugate gradient (CG), namely, in the computation of the fermion force in the molecular dynamics. In this work, we implement the entire conjugate gradient inside the NVIDIA GPU, whichcanbeusedfortheHMCaswellasforcomputing thevalencequarkpropagators. 2.3 Algorithm ConjugateGradient(CG)method[8]isawidely-usednumericalalgorithmforiterativelysolv- ingalinearsystemAx=btoacertainprecisione ,whereAisapositive-definite Hermitianmatrix. With the CG algorithm (see Algorithm 1), the problem is turned into a task dominated by the matrix-vector multiplication. In this work, we utilize CUDA to implement the 5D domain-wall fermion operator (2.10)matrix-vector multiplications oftheCGonNVIDIAGPUs. FortheGPU, thesingle-precision operationsareseveraltimesfasterthanthedouble-precision ones,thusitisadvantageous tousethemixed-precision CG.Intheso-calleddefectcorrectionalgorithm,onesolves x in the low-precision, and updates the solution xˆ and the residue rˆ in the high-precision. (see Algorithm 2, where the hatted symbols represent variables in the high-precision). In this fashion, mostofthefloating-point operations areinthelow-precision, thusitisadvantageous fortheGPU. Theoretically,thedefectcorrectionalgorithmismathematicallysound,anditalwaysworksinprac- tice. However, the seemingly drawback is that one has to build up the Krylov space every time it restarts the CG in the low precision. On the other hand, if one does not reset the low-precision p 3 CGforDWFonGPU Yao-YuanMao Algorithm1ConjugateGradient x :=initialguess 0 r :=b Ax 0 0 − p :=r 0 0 k:=0 while r >e b do k | | | | a :=(r ,r )/(p ,Ap ) k k k k k r :=r a Ap k+1 k k k − b :=(r ,r )/(r ,r ) k+1 k+1 k+1 k k x :=x +a p k+1 k k k p :=r +b p k+1 k+1 k+1 k k:=k+1 endwhile Algorithm2Mixed-Precision ConjugateGradient(DefectCorrection) xˆ:=initialguess rˆ:=bˆ Aˆxˆ − while rˆ >eˆ bˆ do k | | | | r:=rˆ p:=rˆ x:=0 UseAlgorithm1tosolvex=A 1rinthelow-precision toaprecision e − xˆ:=xˆ+x rˆ:=bˆ Aˆxˆ − endwhile vectorinsidethewhileloopofAlgorithm2(i.e.,skippingthestep(p:=rˆ)exceptatthefirsttime), the “warm-up" time in re-building the Krylov space could be reduced. This so-called reliable updates algorithm [9, 10] would run faster than the defect correction. Although the reliable updates in this fashion may not converge for all cases due to the non-orthogonality of p and r, in practice it seems to work for most cases. We have implemented both algorithms in our CG solver, and it automatically switchestothedefectcorrectionifthereliableupdatesdoesnotconvergeinthefirst place. 3. CUDAKernelsand Optimization The CUDA architecture developed by NVIDIA enables us to do parallel computations on NVIDIA’s GPUs. (For more detailed programming models and strategies, see “CUDA Program- mingGuideforCUDAToolkit”[11].) In CUDA, a thread is the smallest unit to execute a kernel, and a certain number of threads formablock. Eachthreadwillbeassigneda3-dimensional index,andeachblocka2-dimensional index. Inside a block, a warp of threads will execute the kernel concurrently. Each thread has its ownregister memory,andthreadsinthesameblockshareashared memory. Thespaceofregister 4 CGforDWFonGPU Yao-YuanMao and shared memoryisverylimited, but theyhavethefastest access speed. Theglobal memoryon thedevicecanbeaccessed byanythread, butithastheslowestbandwidth. Toimplement the mixed-precision CG for ODWFwith CUDA,weperform all matrix-vector multiplication, vector reduction (inner product), and vector addition/subtraction on the GPU (device),whiletheCPU(host)isusedtodotheflowcontrol,memoryassignment, anddevicecontrol. TheCUDAkernelsinourCGsolvercanbedividedintofivedifferentcatalogs. Wewilldiscuss eachcatalogandtheiroptimization inthefollowingsubsections. 3.1 VectorIndexConversion These kernels are used to change the indexing schemes between the device (GPU) and the host (CPU). To store the Dirac spinors in an one-dimensional array, we need to map the multi- dimensional indices to the 1D array index. One needs 4 3 2=24 real numbers to store one × × Dirac spinor ateach siteof the5D lattice. OntheCPU,this color-spinor index cwhichruns from 0 to 23 is the inner-most (fastest-running) index, which is followed by the fifth-dimension index s, and then x,y,z,t indices, where t is the outer-most (slowest-running) index. If i denotes the host one-dimensional arrayindexoftheCPUscheme,thenwehave i =c+s 24+x 24N + +t 24N N N N . (3.1) host s x y z s × × ··· × However, for computation on the GPU,we assign each thread a five-dimensional site index. This implies that adjacent threads have consecutive s indices. Thus we want to arrange the data such that optimal coalescence is attained when loading the vector from device global memory to the registerand/orthesharedmemoryofthethreads. SincetheGPUprovidesvectordatatypessuchas float4anddouble2whichallowustomove4singleprecisionnumbers(or2doubleprecision numbers)atonetime,asimplewaytomaptotheone-dimensional arrayindexonGPU(forsingle- precision) is i =c mod4+s 4+[c/4] 4N +x 24N + +t 24N N N N , (3.2) dev s s x y z s × × × ··· × and similarly forthe double-precision. Soevery time whenwetransfer data between the host and thedevice, weconvert theindexaccordingly. 3.2 Matrix-Vector MultiplicationforDOE(DEO) w w DOE(DEO)issimilartotheusualWilson-Dirac operator D withoutthemassterm, w w w DOwE(DEwO) xx′ =−12(cid:229) m (1−gm )Um (x)d x+amˆ,x′+(1+gm )Um†(x′)d x−amˆ,x′ . (3.3) (cid:2) (cid:3) (cid:2) (cid:3) From this expression we see that the multiplication of DOE(DEO) with a vector involves the link w w variablesUm (x)andg -matrices. Wehaveusedthefollowingtricksforoptimization. Firstly, since the g -matrices in Eq. (3.3) are in the combination (1 gm ), the left-handed and ± the right-handed Dirac components are related toeach other. Also, since the link variables do not haveDiracindices,wecanjustmultiplyUm (x)totheleft-handed components, andthenrestorethe right-handed components. 5 CGforDWFonGPU Yao-YuanMao Secondly,Um (x)hasnofifth-dimensiondependence,sothreadshavingthesamexbutdifferent scansharethesameUm (x). Soweputthelinkvariables inthesharedmemory. Thirdly, because GPU computation is memory bandwidth bound, one must try to reuse the data. For example, the hopping term (d x amˆ,x) in Dw, all neighboring sites of x are involved in − ′ the calculation. Ifweassign each xtoonethread, then there mustbeoverlapping data loading for neighboringsites. Toreducethisoverlappingdatatransfer,wedistributeeach(x,y,z)toonethread, with a loop in thet-direction. Then the neighboring data in thet-direction can be reused, and the efficiencyisenhanced. Besides above tuning techniques, wealso expand smallloops, andto usethe texture memory for caching data, Here texture is used for loading the vectors and link variables. We use Python [12]toexpandsmallloops,andtogeneratethesetofkernels forD multiplication. w 3.3 Matrix-Vector MultiplicationforM 5 Thematrix M isgiven inEq. (2.8). Onecan seethat M isblock diagonal inthe chiral basis 5 5 and it does not depend on the space-time nor the color indices. In fact, it can be divided into two constant matrices in the fifth dimension, i.e., the left-handed and the right-handed ones. So the multiplication of M with a vector can be regarded as u =(cid:229) (M ) v . Here we use the shared 5 s s 5 ss s ′ ′ ′ memory tostore thesource vector(v). SinceM only takes 2N2 realnumbers, wecan putM into 5 s 5 theregisterofeachthread(withdifferentsandx,y,z). Again,aloopint isinsertedtoreusetheM 5 dataintheregister. Also,weusePythontogeneratethesekernels. 3.4 VectorReduction To calculate the norm of a vector, we use the well-known parallel reduction with the shared memory. However, due to the limitation on the number of threads per block, it is inevitable to use global memory when the size of a vector becomes very large. Our solution is to perform the block reduction inprior kernels, i.e., tosum up vector elements (already stored inregisters/shared memory)withineachblock. Thenthesepartialsumscanbeaddedwithaparallel reduction. 3.5 VectorAdditionandSubtraction We can combine the simple addition/subtraction with other kernels in which one vector has beenloaded. Forexample,tomultiplyC 1 M DOEM DEO toavector,wecancombinethelast 5 w 5 w ≡ − subtraction withthelastM multiplication. 5 4. Performance We present some benchmarks of our CG solver, using NVIDIA GeForce GTX 285, GeForce GTX480,TeslaTMC1060,andTeslaC2050. Notethatourcodehasnotyetbeenwell-tunedforthe Fermiarchitecture(GTX480andC2050). FromTable1,weseethatthebottleneckofourprogram isinthesingle-precisionD matrix-vectormultiplication. Duetothemixed-precisionCG,thetime w used in the double-precision operations are almost negligible. For the Fermi architecture, due to the larger L1 cache, there is a significant improvement in the single-precision D matrix-vector w multiplication, andalsointhedouble-precision M matrix-vector multiplication. 5 6 CGforDWFonGPU Yao-YuanMao Table1: BenchmarkofourCGsolverforDWFona163 32 16lattice,numbersinunitsofGflops) × × D (single) M (single) D (double) M (double) CG(mixed) w 5 w 5 GTX285 177 346 33 69 181 GTX480 248 331 32 116 233 C1060 128 290 29 61 132 C2050 160 239 22 100 156 5. Summary WehaveimplementedanefficientGPU-basedCGsolverforgeneralizeddomain-wallfermions in lattice QCD. Our CUDA kernels are tuned with several innovative techniques. On NVIDIA GeForce GTX 285/480, our CG solver achieves 180/233 Gflops (sustained). This efficient CG solver constitutes the most crucial part in TWQCD’s HMC code for simulation of unquenched latticeQCDwiththeoptimaldomain-wallfermion. Acknowledgments This work is supported in part by the National Science Council (Nos. NSC96-2112-M-002- 020-MY3,NSC99-2112-M-002-012-MY3, NSC96-2112-M-001-017-MY3, NSC99-2112-M-001- 014-MY3,NSC99-2119-M-002-001) andNTU-CQSE(Nos. 99R80869, 99R80873). References [1] D.B.Kaplan,Phys.Lett.B288,342(1992) [2] H.Neuberger,Phys.Lett.B417,141(1998);R.NarayananandH.Neuberger,Nucl.Phys.B443,305 (1995) [3] T.W.Chiu,Phys.Rev.Lett.90,071601(2003);Nucl.Phys.Proc.Suppl.129,135(2004) [4] T.W.Chiuetal.[TWQCDCollaboration],PoSLAT2009,034(2009)[arXiv:0911.5029[hep-lat]]. [5] T.W.Chiuetal.[TWQCDCollaboration],PoSLAT2010,099(2010) [6] T.H.Hsiehetal.[TWQCDCollaboration],PoSLAT2010,085(2010) [7] A.Borici,Nucl.Phys.Proc.Suppl.83,771(2000)[arXiv:hep-lat/9909057]. [8] M.R.HestenesandE.Stiefel,JournalofResearchoftheNationalBureauofStandards49,6(1952). [9] G.L.G.Sleijpen,andH.A.vanderVorst,Computing56,141-164(1996). [10] M.A.Clark,R.Babich,K.Barros,R.C.BrowerandC.Rebbi,Comput.Phys.Commun.181,1517 (2010)[arXiv:0911.3191[hep-lat]]. [11] http://developer.nvidia.com/object/gpucomputing.html [12] http://www.python.org 7

GPU-Based Conjugate Gradient Solver for Lattice QCD with Domain-Wall Fermions PDF

0.08 MB·English

by Ting-Wai Chiu

#additional_collections #journals #arxiv

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview GPU-Based Conjugate Gradient Solver for Lattice QCD with Domain-Wall Fermions

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.