Table Of Content

Large-scale Ferrofluid Simulations on Graphics Processing Units A.Yu.Polyakova,b,T.V.Lyutyya,S.Denisovb,V.V.Revaa,P.Ha¨nggib aSumyStateUniversity,Sumy,Ukraine bInstituteofPhysics,UniversityofAugsburg,Augsburg,Germany Abstract 3 1 Wepresentanapproachtomolecular-dynamicssimulationsofferrofluidsongraphicsprocessingunits(GPUs). Our 0 numericalschemeisbasedonaGPU-orientedmodificationoftheBarnes-Hut(BH)algorithmdesignedtoincreasethe 2 parallelismofcomputations. Foranensembleconsistingofonemillionofferromagneticparticles,theperformance n of the proposed algorithm on a Tesla M2050 GPU demonstrated a computational-time speed-up of four order of a J magnitude compared to the performance of the sequential All-Pairs (AP) algorithm on a single-core CPU, and two order of magnitude compared to the performance of the optimized AP algorithm on the GPU. The accuracy of the 2 2 schemeiscorroboratedbycomparingtheresultsofnumericalsimulationswiththeoreticalpredictions. ] Keywords: Ferrofluid,moleculardynamicssimulation,theBarnes-Hutalgorithm,GPUs,CUDA h p - p 1. Introduction m o Ferrofluidsaremediacomposedofmagneticnanoparticlesofdiametersintherange10÷50nmwhicharedispersed c in a viscous fluid (for example, water or ethylene glycol) [1]. These physical systems combine the basic properties s. of liquids, i.e. a viscosity and the presence of surface tension, and those of ferromagnetic solids possessing, i.e. c an internal magnetization and high permeability to magnetic fields. This synergetic duality makes ferrofluids an si attractive candidate for performing different tasks, ranging from the delivery of rocket fuel into spacecraft’s thrust y chambers under zero-gravity conditions to the high-precision drug delivery during cancer therapy [2]. Moreover, h ferrofluids have already found their way into commercial applications as EM-controlled shock absorbers, dynamic p sealsinenginesandcomputerhard-drives,andaskeyelementsofhigh-qualityloudspeakers,tonamebutafew[3]. [ Beingliquids,ferrofluidscanbemodeledbyusingdifferentmacroscopic,continuum-mediaapproaches. Thecor- 2 responding field, ferrohydrodynamics [1], is a well-established area that has produced a lot of fundamental results. v However,someimportantphenomenasuchasmagnetoviscosity[4,5]cannotbedescribedproperlyneitheronthehy- 4 3 drodynamicallevelnorwithinthesingle-particlepicture[6].Theroleofmultiparticleaggregates–chainsandclusters 9 –livingintheferrofluidbulk,isimportantthere.Theevaluationofothernon-Newtonianfeaturesofferrofluids,which 2 can be controlled by external magnetic fields [5], also demands information about structure of aggregates and their . 2 dynamics. Thekeymechanismresponsiblefortheformationofferro-clustersisthedipole-dipoleinteractionacting 1 betweenmagneticparticles. Undercertainconditions,theinteractioneffectscanovercomethedestructiveeffectsof 2 thermalfluctuationsandcontributetangiblytotheensembledynamics. Therefore,inordertogetadeeperinsightinto 1 theabove-mentionedfeaturesofferrofluids,theinteractioneffectsshouldbeexplicitlyincludedintothemodel. : v Strictly speaking, the dipole-dipole interaction acts between all the possible pairs of N ferromagnetic particles. Xi Therefore, thecomputationaltimeofthestraightforwardsequentialalgorithmscaleslike N2. Thisisafundamental drawbackofmany-bodysimulationsandstandard, CPU-basedcomputationalresourcesoftenlimitthescalesofthe r a molecular-dynamicscalculations.Inordertorunlargerensemblesforlongertimes,researcherseither(i)advancetheir modelsandnumericalschemesand/or(ii)relyonmoreefficientcomputers. Thefirsttrackledtodifferentmethods developedtoexploretheequilibriumpropertiesofbulkferrofluids.Mostprominentarecut-offsphereapproximations [7]andtheEwaldsummationtechnique[8],seealsoRef. [9]fordifferentmodificationsofthetechnique. However, both methods are of limited use for computational studies of confined ferrofluids, a problem that attracts special attentionnowadaysduetoitspracticalrelevance[10,11]. Analternativeapproachisthesearchforwaystoincrease scalabilityandparallelismofcomputations,seeRef. [12]. Untilrecently,thispracticallymeanttheuseofeitherlarge PreprintsubmittedtoComputerPhysicsCommunications January23,2013 computationalclusters, consistingofmanyCentralProcessingUnits(CPUs), ormassivelyparallelsupercomputers, like Blue Gene supercomputers. The prices and the maintenance costs of such devices are high. The advent of the general-purpose computing on graphics processing units (GP2U) [13] has changed the situation drastically and boostedsimulationsofmany-bodysystemsontonewlevel[14].GPUs,initiallydesignedtoserveasthedatapipelines for graphical information, are relatively small, much easier to maintain, and possess high computation capabilities allowingforparalleldataprocessing.Nowadays,thescientificGPU-computingisusedinmanyareasofcomputational physics,thankstotheComputeUnifiedDeviceArchitecture(CUDA)developedbyNVIDIACorporation[15].CUDA significantlysimplifiesGPU-basedcalculationssonowonecoulduseaSonyPlayStation3asamulti-corecomputer byprogrammingitwithC,C++orFortran[16]languages. The typical scale of molecular-dynamics simulations in ferrofluid studies is within the range N = 102 − 103 [17, 18], while N = 104 constitutes the current limit [19]. It is evident that the increase of the ensemble size by severalordersofmagnitudewouldtangiblyimprovethestatisticalsamplingandthusthequalityofsimulationresults. To run numerically 105 ÷ 106 interacting magnetic particles is a task not much different from the running of an artificialUniverse[20]. Therefore,similartothecaseofComputationalCosmology[21,22,23],theGP2Useemsto beverypromisingalsointhecontextofferrofluidsimulations. Inthispaperwereporttheperformanceofarecently proposedGPU-orientedmodificationoftheBarnes-Hutalgorithm[24]usedtosimulatethedynamicsofN =103÷106 interactingferromagneticparticlesmovinginaviscousmedium. Although being mentioned as a potentially promising approach (see, for example, Ref. [25]), the Barnes-Hut algorithmwasneverusedinmolecular-dynamicsferrofluidsimulations,tothebestofourknowledge. Ourmaingoal here is to demonstrate a sizable speed-up one can reach when modeling such systems on a GPU – and by using GPU-orientednumericalalgorithms–comparedtotheperformanceofconventional,CPU-basedalgorithms. Wealso demonstratethehighaccuracyoftheBarnes-Hutapproximationbyusingseveralbenchmarks. Thepaperisorganized as follows: First, inSection IIwe specifythe model. Then, inSection III,we describehow both, the All-Pairsand Barn-Hutalgorithms, canbeefficientlyimplementedforGPU-basedsimulations. Theresultsofnumericaltestsare presentedanddiscussedinSectionV.Finally,SectionVIcontainsconclusions. 2. Themodel ThemodelsystemrepresentsanensembleofNidenticalparticlesoftheradiusR,madeofaferromagneticmaterial ofdensityDandspecificmagnetizationµ. EachparticleoccupiesvolumeV = 3πR3,hasmagneticmomentm(cid:126) =m(cid:126)(t) 4 ofconstantmagnitude|m(cid:126)|=m=Vµ,mass M =VD,andmomentofinertiaI = 2MR2. Theensembleisdispersedin 5 aliquidofviscosityη. BasedontheLangevindynamicsapproach,theequationsofmotionsfork-thnanoparticlecan bewritteninthefollowingform[17] Iθ¨ = N −G θ˙ +Ξr, (1) k kz r k θ Iϕ¨ = −N sinϕ +N cosϕ −G ϕ˙ +Ξr, (2) k kx k ky k r k ϕ M(cid:126)r¨ = µ (m(cid:126) ∇ )·H(cid:126) +F(cid:126)sr−G (cid:126)r˙ +Ξ(cid:126)d, (3) k 0 k k k k d k (cid:126)r where θ and ϕ are the polar and azimuthal angles of the magnetization vector m(cid:126) respectively. N = m H − kx ky kz m H ,N = m H − m H ,N = m H − m H , x,y,z denotes the Cartesian coordinates, dots over the kz ky ky kz kx kx kz kz ky kx kx ky variables denote the derivatives with respect to time. (cid:126)r is the radius-vector defining the nanoparticle position, and k thegradientisgivenby∇ = ∂ =(cid:126)e ∂ +(cid:126)e ∂ +(cid:126)e ∂ ,((cid:126)e ,(cid:126)e ,(cid:126)e aretheunitvectorsoftheCartesiancoordinates). k ∂(cid:126)rk x∂xk y∂yk z∂zk x y z ConstantsG =6πηR,andG =8πηR3 specifytranslationalandrotationalfrictioncoefficients,µ =4π·10−7H/mis t r 0 themagneticconstant. Theresultingfieldactingonthek-thparticle H(cid:126) isthesumofexternalfield H(cid:126)ext andoverallfieldexertedonthe k 2 particlebytherestoftheensemble, (cid:88)N H(cid:126) = H(cid:126)dip+H(cid:126)ext, (4) k kj j=1,j(cid:44)k 3(cid:126)r (m(cid:126) (cid:126)r )−m(cid:126) (cid:126)r2 H(cid:126)dip = kj j kj j kj, (5) kj |(cid:126)r |5 kj where(cid:126)r =(cid:126)r −(cid:126)r . F(cid:126)sr inEq.3denotestheforceinducedbyashort-rangeinteractionpotential. Inthispaperweuse kj k j k Lennard-Jonespotential[17](thoughhardsphere[26],softsphere[27]andYukawa-type[28]potentialscanbeused asalternatives),sothat F(cid:126)ksr =24E (cid:88)N (cid:126)(cid:126)rrk2j (cid:32)(cid:126)rs (cid:33)12−(cid:32)(cid:126)rs (cid:33)6. (6) j=1,j(cid:44)k kj kj kj Here E isthedepthofthepotentialwelland sistheequilibriumdistanceatwhichtheinter-particleforcevanishes. The interaction between a particle and container walls is also modeled with a Lennard-Jones potential of the same type. Therandom-forcevector,representingtheinteractionofaparticlewiththermalbath,hasstandardwhite-noise components, (cid:104)Ξr(t)(cid:105) = 0 (α = ϕ,θ), (cid:104)Ξd(t)(cid:105) = 0 (β = x,y,z), and second moments satisfying (cid:104)Ξ (t)Ξ (t(cid:48))(cid:105) = α β rα rβ 3k TG δ δ(t−t(cid:48))(α,β = ϕ,θ)(cid:104)Ξ (t)Ξ (t(cid:48))(cid:105) = 2k TGδ δ(t−t(cid:48))(α,β = x,y,z)[17]. Herek istheBoltzmann B r αβ dα dβ B t αβ B constantandT isthetemperatureoftheheatbath. (cid:112) Byrescalingthevariables,τ = t/T ,(T = R/µ 3D/4µ ),(cid:126)u = m(cid:126)/|m(cid:126)|,(cid:126)ρ =(cid:126)r /R,∇ = 1 d ,theequationsof motionsforthek−thparticlecanberewchrittenchinthereducedfo0rm, k k ρk Rd(cid:126)ρk 2 d2θ dθ T2 · k = (u h −u h )−Γ k + ch ξr (7) 5 dτ2 kx ky ky kx r dτ VR2D θ 2 d2ϕ dϕ T2 · k = −(u h −u h )sinϕ +(u h −u h )cosϕ −Γ k + ch ξr (8) 5 dτ2 ky kz kz ky k kz kx kx kz k r dτ VR2D ϕ d2(cid:126)ρ d(cid:126)ρ T2 k = f(cid:126)dip+ f(cid:126)sr−Γ k + ch (cid:126)ξd (9) dτ2 k k d dτ VRD (cid:126)ρ where (cid:88)N (cid:126)h = (cid:126)hdip+(cid:126)hext, (10) k kj j=1,j(cid:44)k 3(cid:126)ρ ((cid:126)u (cid:126)ρ )−(cid:126)u (cid:126)ρ2 (cid:126)hdip = kj j kj j kj, (11) kj ρ5 kj   f(cid:126)kdip = (cid:88)N 3(cid:126)ρkj((cid:126)uj(cid:126)uk)+(cid:126)uk((cid:126)ρuj5(cid:126)ρkj)+(cid:126)uj((cid:126)uk(cid:126)ρkj) −15(cid:126)ρkj((cid:126)uk(cid:126)ρρkj7)((cid:126)uj(cid:126)ρkj), (12) j=1,j(cid:44)k kj kj f(cid:126)ksr = 24ε (cid:88)N (cid:126)(cid:126)ρρk2j (cid:32)ρσ (cid:33)12−(cid:32)ρσ (cid:33)6, (13) j=1,j(cid:44)k kj kj kj Here(cid:126)hext =3H(cid:126)ext/4πµ,Γ =GT /VD,Γ =G T /VR2D,σ= s/R,ε= ET 2/VRD,(cid:126)ρ =(cid:126)ρ −(cid:126)ρ t t ch r r ch ch kj k j Random-forcevectorcomponentsaregivennowbywhiteGaussiannoises, withthesecondmomentssatisfying (cid:104)ξr(τ)ξr(τ(cid:48))(cid:105)=3k TΓ δ δ(τ−τ(cid:48))/T (α,β=ϕ,θ)(cid:104)ξd(τ)ξd(τ(cid:48))(cid:105)=2k TΓδ δ(τ−τ(cid:48))/T (α,β= x,y,z). Ingeneral α β B r αβ ch α β B t αβ ch casethecharacteristicrelaxationtimeofparticlemagneticmomentstotheirequilibriumorientationsismuchsmaller thanT ,andonecanneglectthemagnetizationdynamicsbyassumingthatthedirectionofthevectorm(cid:126) coincides ch k withtheeasyaxisofthek−thparticle. Weassumethatthisconditionholdsforourmodel. 3 3. Twoapproachestomany-bodysimulationsonGPUs: theAll-PairsandtheBarnes-Hutalgorithms Inthissectionwediscusstwoalternativeapproachestothenumericalpropagationofthedynamicalsystemgiven byEqs. (7-9)onaGPU.ItisassumedthatthereaderisfamiliarwiththebasicsoftheGP2U,otherwiseweaddress himtoRefs. ([29,30])thatcontaincrash-course-likeintroductionsintothephysically-orientedGPUcomputing. 3.1. All-PairAlgorithm The most straightforward approach to propagate a system of N interacting particles is to account for the inter- actions between all pairs. Although exact, the corresponding All-Pairs (AP) algorithm is slow when performed on a CPU, so it is usually used to propagate systems of N = 102 ÷103 particles. However, even with this brute-force methodonecantangiblybenefitfromGPUcomputationsbynoticingthattheAPideafitsCUDAarchitecture[31]. OneintegrationstepofthestandardAPalgorithmisperformedintwostages. Theyare: 1. Calculationofincrementofparticlepositionsandmagneticmomentsdirections. 2. Updateparticlepositionsandmagneticmomentsdirections. ThisstructureremainsintactintheGPUversionofthealgorithm. Kernelsresponsibleforthefirststagecompute forces that act on the particles, and calculate the corresponding increments for particle positions and magnetic moments,accordingtoequations(7-9). Theincrementsarethenwrittenintotheglobalmemory. Finally,second-stage kernelsupdateparticlestateswiththeobtainedincrements. Thereis,however,aneedfortheglobalsynchronization ofthethreadsthatbelongtothedifferentblocksaftereverystagesincethestagesareperformedonseparateCUDA kernelsandalltheinformationaboutthestateofthesystemiskeptinthesharedmemory. Eachthreadisresponsibleforoneparticleoftheensemble, andthusitshouldaccountfortheforcesexertedon theparticlebytherestoftheensemble. Tospeedupthecomputationalprocesswekeepthedatavectorofthethread particle in the shared memory, as well as the information on other particles, needed to compute the corresponding interactionforces. Thuswehavetwosetsofarraysofdatainthesharedmemory,namely 1. dataoftheparticlesassignedtothethreadsoftheblock; 2. dataofparticlestocomputeinteractionwith. Thenecessarydataarethecoordinatesoftheparticlesandprojectionsoftheirmagneticmomentvectorsonto x, y and z axis. Size of the arrays are equal to the number of threads per block. The first set of the data is constant duringoneintegrationstep,butthesecondsetischanged. So,atthebeginningweuploadtheinformationonparticle coordinatesandmomentatothesecondsetofarraysinthesharedmemory. Aftercomputingtheforcesactingonthe blockparticles,theprocedureisrepeated,i. e. theinformationonanothersetofparticlesiswrittenintothesecondset ofarraysandthecorrespondinginteractionsarecomputed. Thecorrespondingpseudo-codeispresentedthebelow1. Theadvantageofthedescribedapproachisthatitusestheglobalmemoryinthemostoptimalway. Theaccessto globalmemoryiscoalescedandtherearenosharedmemorybankconflicts. Foranensemble N = 104 particlesthis leadstotheGPUoccupancy2of97.7%. 3.2. Barnes-HutAlgorithm TheAll-PairsalgorithmissimpleandstraightforwardforimplementationonaGPUandperfectlyfitsCUDA.Yet thisalgorithmispurelyscalable. ThecorrespondingcomputationtimegrowslikeO(N2),anditsperformanceisvery slowalreadyforanensembleof105particles. TheBarnes-Hut(BH)approximation[32]exhibitsamuchbetterscalabilityanditscomputationaltimegrowslike O(NlogN).Thekeyideaofthealgorithmistosubstituteagroupofparticleswithasinglepseudo-particlemimicking theactionofthegroup. Thentheforceexertedbythegroupontheconsideredparticlecanbereplacedwiththeforce exertedbythepseudo-particle.Inordertoillustratetheideaassumethatallparticlesarelocatedinathree-dimensional 1Inthepseudo-code∆ρx,∆ρy,∆ρz,∆θ,∆ϕdenoteincrementsofaparticle’scoordinatesx,y,z,andthedirectionanglesofparticlemagnetic moment,θandϕ,neededtopropagatetheparticleoveronetimestep∆τ. 2TheparametershowshowtheGPUiskeptbusy,anditisequaltotheratioofthenumberofactivewarpstothemaximumnumberofwarps supportedonamultiprocessoroftheGPU.ItcanobtainedwithCUDAProfiler. 4 Algorithm1ACUDAkernelthatcomputesincrements. 1: cachedk←blockIdx.x·blockDim.x+threadIdx.x 2: for j=0tonumberOfParticlesdo 3: uploadtothesharedmemoryparticlepositionsandangles 4: e.g. xshared[threadIdx.x]= xglobal[ind],etc. 5: fori=0toblockDim.xdo 6: if j+i(cid:44)k then (cid:46)theconditiontoavoidparticleon-selfinfluence 7: Calculateforceanddipolefieldfortheparticlenumber j+i 8: oncurrentk-thparticle,andaddtheresultto 9: thetotalcached f(cid:126)dip, f(cid:126)sr,(cid:126)hdip. 10: endif 11: endfor 12: syncthreads(); 13: j← j+blockDim.x 14: endfor 15: Updatecached∆ρx,∆ρy,∆ρz,∆θ,∆ϕaccordingtoequations7-9. 16: Copyincrementstoglobalmemory. cube,whichisnamed’maincell’. Themaincellisdividedthenintoeightsub-cells. Eachsub-cellconfinessubsetof particles. If there are more than one particle in the given sub-cell then the last is again divided into eight sub-cells. The procedure is re-iterated until there is only one particle or none left in each sub-cell. In this way we can obtain anoctreewithleavesthatareeitheremptyorcontainsingleparticleonly. Asimplified,two-dimensionalrealization of this algorithm is sketched with Fig. 1. By following this recipe, we can assign to every cell, obtained during thedecomposition, apseudo-particle, withmagneticmomentu(cid:126)(cid:48) equalstothesumofmagnetizationvectors(cid:126)uofall particlesbelongingtothecell,andpositionρ(cid:126)(cid:48)whichisthepositionofthesetgeometriccenter, (cid:80)N(cid:48) (cid:126)ρ i ρ(cid:126)(cid:48) = i=1 , (14) N(cid:48) whereN(cid:48)isanumberofparticlesinthecelland(cid:126)ρ isthepositionofthei-thparticlefromthesub-set. i Forces acting on k-th particle can be calculated by traversing the octree. If the distance from the particle to the pseudo-particle that corresponds to the root cell is large enough, the influence of this pseudo-particle on the k-th particleiscalculated; otherwisepseudo-particlesofthenextsub-cellsarecheckedetc(sometimethisprocedurecan leadfinallytoaleafwithonlyoneparticleinthecellleft). Thuscalculatedforceisaddedthentothetotalforceacting onthek-thparticle. TheBHalgorithmallowsforahighparallelismanditiswidelyemployedincomputationalastrophysicsproblems [33]. However, implementation of the Barnes-Hut algorithm on GPUs remained a challenge until recently, because theprocedureusesanirregulartreestructurethatdoesnotfittheCUDAarchitecturewell. Itisthemainreasonwhy theBHschemewasnotrealizedentirelyonaGPUbutsomepartofcalculationswasalwaysdelegatedandperformed onaCPU[34,35]. ArealizationofthealgorithmsolelyonaGPUhasbeenproposedin2011[24]. Belowwebriefly outlinethemainideaandspecifyitsdifferencefromthestandard,CPU-basedrealization. To build an octree on a CPU usually heap objects are used. These objects contain both child-pointer and data fields, and their children are dynamically allocated. To avoid time-consuming dynamic allocation and accesses to heapobjects,anarray-baseddatastructureshouldbeused. Sincewehaveseveralarraysresponsibleforvariables,the coalescedglobalmemoryaccessispossible. Particlesandcellscanhavethesamedatafields,e.g. positions. Inthis casethesamearraysareused. IncontrasttotheAll-Pairsalgorithm,whereonlytwokernelswereinvolved,intheoriginalGPU-BHalgorithm hassixkernels[24]: 1. Boundingboxdefinitionkernel. 2. Octreebuildingkernel. 5 a) b) The main cell. a cell with particles a particle an empty cell Figure1:Barnes-Huthierarchicaldecompositionintwo-dimensionalspaceandthecorrespondingoctree.SeeSection3.2formoredetails. 3. Computinggeometricalcenterandtotalmagneticmomentofeachcell. 4. Sortingoftheparticleswithrespecttotheirpositions. 5. Computingforcesandfieldsactingoneachparticle. 6. Integrationkernel. Kernel 1 defines the boundaries of the root cell. Though the ensemble is confined to a container and particles cannotgooutside,wekeepthiskernel. Thesizeoftherootcellscanbesignificantlysmallerthanthecharacteristic size of the container. Moreover, the computation time of this kernel is very small, typically much less than 1% of the total time of one integration step. The idea of this kernel is to find minimum and maximum values of particle positions. Hereweuseatomicoperationsandbuilt-inminandmaxfunctions. Kernel2performshierarchicaldecompositionoftherootcellandbuildsanoctreeinthethree-dimensionalcase. As well as in following kernels, the particles are assigned to the threads in round-robin fashion. When a particle is assigned to a thread, it tries to lock the appropriate child pointer. In the case of success, the thread rewrites the child pointer and releases a lock. To perform a lightweight lock, that is used to avoid several threads access to the samepartofthetreearray,atomicoperationshouldbeinvolved. Tosynchronizethetree-buildingprocessweusethe syncthreadsbarrier. Kernel 3 calculates magnetic moments and positions of pseudo-particles associated with cells by traversing un- balancedoctreefromthebottomup. Athreadchecksifmagneticmomentandgeometriccenterofallthesub-cells assignedtoitscellhavealreadybeencomputed. Ifnotthenthethreadupdatesthecontributionofthereadycellsand waitsfortherestofthesub-cells. Otherwisetheimpactofallsub-cellsiscomputed. Kernel4sortsparticlesinaccordancetotheirlocations. Thisstepcanspeeduptheperformanceofthenextkernel duetotheoptimalglobalmemoryaccess. Kernel5firstcalculatesforcesactingontheparticles,andthencalculatesthecorrespondingincrements. Then,in order to compute the force and dipole field acting upon the particle, the octree is traversed. To minimize thread divergence,itisveryimportantthatspatiallycloseparticlesbelongtothesamewarp. Inthiscasethethreadswithin the warp would traverse approximately the same tree branches. This has already been provided by kernel 4. The necessarydatatocomputeinteractionarefetchedtothesharedmemorybythefirstthreadofawarp. Thisallowsto reducenumberofmemoryaccesses. Finally,kernel6updatesthestateoftheparticlesbyusingthepositionincrementsandre-orientparticlemagnetic moments. Theabove-describedalgorithmhasmanyadvantages. Amongthemareminimalthreaddivergenceandthecom- pleteabsenceofGPU/CPUdatatransfer(asideofthetransferofthefinalresults),optimaluseofglobalmemorywith minimal number of accesses, data field re-use, minimal number of locks etc. All this allows to achieve a tangible speed-up. FormoredetaileddescriptionwedirecttheinterestedreadertoRef. [24]. 4. Results We performed simulations on (i) a PC with Intel Xeon x5670 @2.93GHz CPU(48 Gb RAM) and (ii) a Tesla M2050GPU.ThoughtheCPUhassixcores,onlyasinglecorewasusedinsimulations.Theprogramswerecompiled 6 N AP BH AP BH APCPU APCPU APGPU CPU CPU GPU GPU APGPU BHGPU BHGPU 103 34 9.8 0.7 2 49 17 0.35 104 3470 137 20 6.5 174 534 3.1 105 392000 2487 1830 54 214 7259 33.9 106 39281250 73121 184330 621 213 63214 297 Table1:Durationofsingleintegrationstep(ms)fortheoptimizedAll-PairsalgorithmimplementedonCPU(APCPU)andGPU(APGPU),andfor theCPU-(BHCPU)andGPU-oriented(BHGPU)Barnes-Hutalgorithm. with nvcc (version 4.0) and gcc (version 4.4.1) compilers. Since there was no need in high-precision calculations, weusedsingle-precisionvariables(float)andcompiledtheprogramwith-use fast mathkey. Wealsoused-O3 optimizationflagtospeedupourprograms. Finally,theEuler-Maruyamamethodwithtimestep∆τ=0.001wasused tointegrateEqs. (7-9). WemeasuredthecomputationtimeofoneintegrationstepforbothalgorithmsasfunctionsofN. Theresultsare presentedwithTable1. ThebenefitsoftheGPUcomputingincreasewiththenumberofparticles. Foranensemble of N = 106 particles the speed-up gained from the use of the Barnes-Hut algorithm is almost 300 compared to the performance ofthe performance ofthe optimized All-Pairsalgorithm onthe same GPU.However, for N = 103 the All-Pairsalgorithmperformsbetter. Itisbecausethecomputationalexpensesforthetree-buildingphase,sortingetc., overweightthespeed-upeffectoftheapproximationforsmallnumberofparticles. HereweremindthatN =103was thetypicalscaleofthemostofferrofluidsimulationstodate[17,18]. Fig.2showsinstantaneousconfigurationsobtainedduringthesimulationsforacubicconfinementandforamono- layer. Tosimulateamono-layerofparticlesweusearectangularparallelepipedoftheheight2.1Rasaconfinement. Theparametersofthesimulationscorrespondtotheregimewhentheaveragedipoleenergyismuchlargerthanthe energyofthermalfluctuations. Theformationofchain-likelarge-scaleclusters[36]isclearlyvisible. 5. Abenchmarktest: averagemagnetizationcurves In order to check the accuracy of the numerical schemes we calculated the reduced magnetization curve [36]. Reducedmagnetizationvectorisgivenbythesum(cid:104)(cid:126)u(cid:105) = 1/N (cid:80)N (cid:126)u. Themainparametersthatcharacterizeferrofluid i i=1 magneticpropertiesarethedipolecouplingconstantλ,whichistheratioofdipole-dipolepotentialandthermalenergy, namely[37], µ m2 λ= 0 , (15) 16πR3k T B andthevolumefraction,whichistheratiobetweenthevolumeoccupiedbyparticlesandthetotalvolumeoccupied bytheferrofluid,V ,i.e., f 4/3πR3N φ= ·100%. (16) V f Inthelimitwhenλ<1andforsmallvolumefraction,φ(cid:28)100%,theprojectionofthereducedmagnetizationvector onthedirectionofappliedfield,(cid:104)u (cid:105),canwellbeapproximatedbytheLangevinfunction[36]: H(cid:126) 1 (cid:104)u (cid:105)= L(α)=coth(α)− , (17) H(cid:126) α whereαdenotestheratiobetweenmagneticenergyandthermalenergy, α=mµ H/k T. (18) 0 B 7 a) b) Figure2: SnapshotsofN-particleensemblesobtainedwiththeBarnes-Hutalgorithm: (a)N = 20000(cubicconfinementwiththeedgelength L=150R)and(b)mono-layerofN=300particles.Theparametersare:Γr=Γd=0.1,T =300K,µ=3.1·105A/m,D=5000kg/m3,R=10 nm. Wesimulatedasystemwiththeparameterscorrespondingtomaghemite(γ−Fe O )withasaturationmagnetiza- 2 3 tionofµ=3.1·105A/manddensityD=5000kg/m3,thecarrierviscosityη=0.89·10−3Pa(thelattercorrespondsto thewaterviscosityatT =298K).Thevolumefractionissetatφ=1%andtheparticleradiusR=3nm.Theexternal magneticfieldH(cid:126) wasappliedalongz−axis. Weinitiatedthesystemattimeτ=0byrandomlydistributingparticlesin acubiccontainer. Theorientationanglesofparticlemagnetizationvectorswereobtainedbydrawingrandomvalues fromtheinterval[0,2π]. Fig. 3 presents the results of the simulations. After the transient τ = 1000, given to the system of N = 105 eq particlestoequilibrate,themeanreducedmagnetizationhasbeencalculatedbyaveraging(cid:104)u (cid:105)overthetimeinterval z τ = 1000. It is noteworthy that even single-run results are very close to the Langevin function, see Fig. 3(a). calc Thecontributionofthemagnetostaticenergygrowswithαsothatthestrengthofthedipole-dipoleinteractionisalso increasing,seeFig. 4. Since the average value of dipole field projection on z axis is positive and increases with α, the dipole field amplifies the external magnetic field. This explains the discrepancy between the analytical and numerical results obtainedforlargevaluesofα. ForanensembleofN =103particlestheresultsobtainedwithtwoalgorithmsarenear identical,Fig. 3(b). 8 a) 1 b) 1 0.8 0.8 0.6 0.6 ⟩ ⟩ z z u u ⟨0.4 ⟨0.4 Barnes-Hut Barnes-Hut 0.2 0.2 Langevin function All-Pairs 0 0 0 2 4 6 8 0 2 4 6 8 α α Figure3: CheckingtheBarnes-Hutapproximation: a)comparisonbetweenthesingle-runresultsobtainedwithN =105particlesandLangevin function,Eq.(17);b)comparisonoftheresultsobtainedwiththeAll-PairandtheBarnes-HutalgorithmsforanensembleofN=103particles. Itisimportantalsotocomparetheaveragedipolefields,(cid:104)hdip(cid:105),calculatedwiththeBarnes-HutandtheAll-Pairs algorithms. Theoutputsobtainedfortheabove-givensetofparametersareshownonFig.4. Again, twoalgorithms producedalmostidenticalresults. 0.04 0.03  p diz 0.02 h  0.01 Barnes-Hut All-Pairs 0 0 2 4 6 8 α Figure4: z-componentofthereducedaveragedipolefieldasafunctionofα[18]. TheparametersarethesameasinFig. 3(b). Eachpointwas obtainedbyaveragingover103independentrealizations. 6. Conclusions WiththisworkwedemonstratethattheBarnes-Hutalgorithmcanbeefficientlyimplementedforlarge-scale,GPU- based ferrofluid simulations. Overall, we achieved a speed-up more than two orders of magnitude when compared to the performance of a common GPU-oriented All-Pairs algorithm. The proposed approach allows to increase the sizeofensemblesbytwoordersofmagnitudecomparedtothepresent-dayscaleofsimulations[19]. TheBarnes-Hut algorithmcorrectlyaccountsforthedipole-dipoleinteractionwithinanensembleof N = 106 particlesandproduces results that fit the theoretical predictions with high accuracy. Our finding opens several interesting perspectives. First, it brings about possibilities to perform large-scale molecular-dynamics simulations for the time evolution of ferrofluids placed in confinements of complex shapes like thin vessels or tangled pipes, where the boundary effects playanimportantrole[38]. Itisalsopossibletoexplorethenon-equilibriumdynamicsofferrofluids, forexample, their response to different types of externally applied magnetic fields, such as periodically alternating fields [39], or gradient fields [40]. Another direction for further studies is the exploration of the relationship between shape 9 and topology of nano-clusters and different macroscopic properties of ferrofluids [41, 42]. Finally, the GPU-based computational algorithms provide with new possibilities to study heat transport processes in ferrofluids [43], for example to investigate the performance of ferromagnetic particles as heat sources for magnetic fluid hyperthermia [44]. 7. Acknowledgment A.Yu.P.andT.V.L.acknowledgethesupportoftheCabinetofMinistersofUkraineobtainedwithintheProgram ofStudyingandTrainingforStudents,PhDStudentsandProfessor’sStuffAbroadandthesupportoftheMinistryof Education,Science,Youth,andSportofUkraine(ProjectNo0112U001383). S.D.andP.H.acknowledgethesupport bytheclusterofexcellenceNanosystemsInitiativeMunich(NIM). References [1] R.Rosensweig,Ferrohydrodynamics,CambridgeUniversityPress,1985. [2] Q.Pankhurst,J.Connolly,S.Jones,J.Dobson, Applicationsofmagneticnanoparticlesinbiomedicine, J.Phys.DAppl.Phys.36(2003) R167. [3] K.Raj,R.Moskowitz,Commercialapplicationsofferrofluids,J.Magn.Magn.Mater.85(1990)233. [4] J.P.McTague,Magnetoviscosityofmagneticcolloida,J.Chem.Phys.51(1969)133. [5] S.Odenbach,Magnetoviscouseffectsinferrofluids,Springer,2002. [6] S.Odenbach,H.Stoerk, Sheardependenceoffield-inducedcontributionstotheviscocityofmagneticfluidsatlowshearrates, J.Magn. Magn.Mater.183(2000)188. [7] P.Ilg,E.Coquelle,S.Hess, Structureandrheologyofferrofluids:simulationresultsandkineticmodels, J.Phys.Condens.Mat.18(2006) S2757. [8] S.deLeeuw,J.Perram,E.Smith, SimulationofElectrostaticSystemsinPeriodicBoundaryConditions.I.LatticeSumsandDielectric Constants,in:Proc.R.Soc.Lond.AMath.Phys.Sci.,volume373,1980,p.27. [9] J.J.Cerda`,V.Ballenegger,O.Lenz,C.Holm,P3Malgorithmfordipolarinteractions.,J.Chem.Phys.129(2008)234104. [10] H.Hartshorne,C.J.Backhouse,W.E.Lee,Ferrofluid-basedmicrochippumpandvalve,Sensor.Actuat.BChem.99(2004)592. [11] N.Pamme,Magnetismandmicrofluidics,LabChip6(2006)24. [12] J.Phillips,R.Braun,W.Wang,etal.,ScalableMolecularDynamicswithNAMD,J.Comput.Chem.26(2005)1781. [13] J.D.Owens,D.Luebke,N.Govindaraju,M.Harris,J.Kru¨ger,A.E.Lefohn,T.J.Purcell, ASurveyofGeneralPurposeComputationon GraphicsHardware,Comput.Graph.Forum26(2007)80. [14] L.Nyland,M.Harris,J.Prins,FastN-bodysimulationwithCUDA,GPUGems3,editedbyH.Nguyen,chap.31(2007). [15] J.Sanders,E.Kandrot,CUDAbyExamples,Addison-Wesley,2011. [16] CUDAFortranProgrammingGuideandReference,2012.URL:http://www.pgroup.com/lit/whitepapers/pgicudaforug.pdf. [17] Z.Wang,C.Holm,H.-W.Mu¨ller, Moleculardynamicsstudyontheequilibriummagnetizationpropertiesandstructureofferrofluids, Phys. Rev.E66(2002)021405. [18] A.O.Ivanov,S.S.Kantorovich,E.N.Reznikov,etal., Magneticpropertiesofpolydisperseferrofluids: Acriticalcomparisonbetween experiment,theory,andcomputersimulation,Phys.Rev.E75(2007)061405. [19] J.J.Cerda,E.Efimova,V.Ballenegger,E.Krutikova,A.Ivanov,C.Holm, Behaviorofbulkyferrofluidsinthedilutedlow-couplingregime: Theoryandsimulation,Phys.Rev.E81(2010)011501. [20] V.Springel,S.D.M.White,A.Jenkins,C.S.Frenk,N.Yoshida,L.Gao,J.Navarro,R.Thacker,D.Croton,J.Helly,J.A.Peacock,S.Cole, P.Thomas,H.Couchman,A.Evrard,J.Colberg,F.Pearce, Simulationsoftheformation,evolutionandclusteringofgalaxiesandquasars, Nature97(2005)629. [21] S.F.P.Zwart,R.G.Bellemana,P.M.Geldofa,High-performancedirectgravitationalN-bodysimulationsongraphicsprocessingunits,New Astron.12(2007)641. [22] R.G.Belleman,J.Be´dorf,S.F.P.Zwart, HighperformancedirectgravitationalN-bodysimulationsongraphicsprocessingunitsII:An implementationinCUDA,NewAstron.13(2008)103. [23] D.Aubert,NumericalCosmologypoweredbyGPUs,in:ProceedingsoftheInternationalAstronomicalUnion,volume6,2010,p.397. [24] M.Burtscher,K.Pingali., AnEfficientCudaImplementationoftheTree-basedBarnesHutN-bodyAlgorithm, in: GPUGems’11: GPU ComputingGemsEmeraldEdition,2011. [25] C.Holm,EfficientMethodsforLongRangeInteractionsinPeriodicGeometriesPlusOneApplication,in:ComputationalSoftMatter:From SyntheticPolymerstoProteins,ResearchCentreJulich,2004. [26] J.Weis,D.Levesque,Chainformationinlowdensitydipolarhardspheres:AMonteCarlostudy,Phys.Rev.Lett.71(1993)2729–2732. [27] D.Wei,G.Patey, Orientationalorderinsimpledipolarliquids: Computersimulationofaferroelectricnematicphase, Phys.Rev.Lett.68 (1992)2043–2045. [28] G.Me´riguet,M.Jardat,P.Turq, Browniandynamicsinvestigationofmagnetizationandbirefringencerelaxationsinferrofluids., J.Chem. Phys.123(2005)144915. [29] M.Januszewski,M.Kostur, Acceleratingnumericalsolutionofstochasticdifferentialequationswithcuda, Comput.Phys.Commun.181 (2010)183. [30] M.Weigel,SimulatingspinmodelsonGPU,Comput.Phys.Commun.182(2011)1833. 10

Large-scale Ferrofluid Simulations on Graphics Processing Units PDF

0.5 MB·English

by A. Yu. Polyakov

#additional_collections #journals #arxiv

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download Large-scale Ferrofluid Simulations on Graphics Processing Units PDF Free - Full Version

by A. Yu. Polyakov| 0.5| English

Download Large-scale Ferrofluid Simulations on Graphics Processing Units by A. Yu. Polyakov in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Large-scale Ferrofluid Simulations on Graphics Processing Units

No description available for this book.

Detailed Information

Author:	A. Yu. Polyakov
Language:	English
File Size:	0.5
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Large-scale Ferrofluid Simulations on Graphics Processing Units Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Large-scale Ferrofluid Simulations on Graphics Processing Units PDF?

Yes, on https://PDFdrive.to you can download Large-scale Ferrofluid Simulations on Graphics Processing Units by A. Yu. Polyakov completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Large-scale Ferrofluid Simulations on Graphics Processing Units on my mobile device?

After downloading Large-scale Ferrofluid Simulations on Graphics Processing Units PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Large-scale Ferrofluid Simulations on Graphics Processing Units?

Yes, this is the complete PDF version of Large-scale Ferrofluid Simulations on Graphics Processing Units by A. Yu. Polyakov. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Large-scale Ferrofluid Simulations on Graphics Processing Units PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.