Table Of ContentLarge-scale Ferrofluid Simulations on Graphics Processing Units
A.Yu.Polyakova,b,T.V.Lyutyya,S.Denisovb,V.V.Revaa,P.Ha¨nggib
aSumyStateUniversity,Sumy,Ukraine
bInstituteofPhysics,UniversityofAugsburg,Augsburg,Germany
Abstract
3
1
Wepresentanapproachtomolecular-dynamicssimulationsofferrofluidsongraphicsprocessingunits(GPUs). Our
0
numericalschemeisbasedonaGPU-orientedmodificationoftheBarnes-Hut(BH)algorithmdesignedtoincreasethe
2
parallelismofcomputations. Foranensembleconsistingofonemillionofferromagneticparticles,theperformance
n
of the proposed algorithm on a Tesla M2050 GPU demonstrated a computational-time speed-up of four order of
a
J magnitude compared to the performance of the sequential All-Pairs (AP) algorithm on a single-core CPU, and two
order of magnitude compared to the performance of the optimized AP algorithm on the GPU. The accuracy of the
2
2 schemeiscorroboratedbycomparingtheresultsofnumericalsimulationswiththeoreticalpredictions.
] Keywords: Ferrofluid,moleculardynamicssimulation,theBarnes-Hutalgorithm,GPUs,CUDA
h
p
-
p 1. Introduction
m
o Ferrofluidsaremediacomposedofmagneticnanoparticlesofdiametersintherange10÷50nmwhicharedispersed
c in a viscous fluid (for example, water or ethylene glycol) [1]. These physical systems combine the basic properties
s. of liquids, i.e. a viscosity and the presence of surface tension, and those of ferromagnetic solids possessing, i.e.
c an internal magnetization and high permeability to magnetic fields. This synergetic duality makes ferrofluids an
si attractive candidate for performing different tasks, ranging from the delivery of rocket fuel into spacecraft’s thrust
y
chambers under zero-gravity conditions to the high-precision drug delivery during cancer therapy [2]. Moreover,
h
ferrofluids have already found their way into commercial applications as EM-controlled shock absorbers, dynamic
p
sealsinenginesandcomputerhard-drives,andaskeyelementsofhigh-qualityloudspeakers,tonamebutafew[3].
[
Beingliquids,ferrofluidscanbemodeledbyusingdifferentmacroscopic,continuum-mediaapproaches. Thecor-
2
responding field, ferrohydrodynamics [1], is a well-established area that has produced a lot of fundamental results.
v
However,someimportantphenomenasuchasmagnetoviscosity[4,5]cannotbedescribedproperlyneitheronthehy-
4
3 drodynamicallevelnorwithinthesingle-particlepicture[6].Theroleofmultiparticleaggregates–chainsandclusters
9 –livingintheferrofluidbulk,isimportantthere.Theevaluationofothernon-Newtonianfeaturesofferrofluids,which
2 can be controlled by external magnetic fields [5], also demands information about structure of aggregates and their
.
2 dynamics. Thekeymechanismresponsiblefortheformationofferro-clustersisthedipole-dipoleinteractionacting
1 betweenmagneticparticles. Undercertainconditions,theinteractioneffectscanovercomethedestructiveeffectsof
2 thermalfluctuationsandcontributetangiblytotheensembledynamics. Therefore,inordertogetadeeperinsightinto
1 theabove-mentionedfeaturesofferrofluids,theinteractioneffectsshouldbeexplicitlyincludedintothemodel.
:
v Strictly speaking, the dipole-dipole interaction acts between all the possible pairs of N ferromagnetic particles.
Xi Therefore, thecomputationaltimeofthestraightforwardsequentialalgorithmscaleslike N2. Thisisafundamental
drawbackofmany-bodysimulationsandstandard, CPU-basedcomputationalresourcesoftenlimitthescalesofthe
r
a molecular-dynamicscalculations.Inordertorunlargerensemblesforlongertimes,researcherseither(i)advancetheir
modelsandnumericalschemesand/or(ii)relyonmoreefficientcomputers. Thefirsttrackledtodifferentmethods
developedtoexploretheequilibriumpropertiesofbulkferrofluids.Mostprominentarecut-offsphereapproximations
[7]andtheEwaldsummationtechnique[8],seealsoRef. [9]fordifferentmodificationsofthetechnique. However,
both methods are of limited use for computational studies of confined ferrofluids, a problem that attracts special
attentionnowadaysduetoitspracticalrelevance[10,11]. Analternativeapproachisthesearchforwaystoincrease
scalabilityandparallelismofcomputations,seeRef. [12]. Untilrecently,thispracticallymeanttheuseofeitherlarge
PreprintsubmittedtoComputerPhysicsCommunications January23,2013
computationalclusters, consistingofmanyCentralProcessingUnits(CPUs), ormassivelyparallelsupercomputers,
like Blue Gene supercomputers. The prices and the maintenance costs of such devices are high. The advent of
the general-purpose computing on graphics processing units (GP2U) [13] has changed the situation drastically and
boostedsimulationsofmany-bodysystemsontonewlevel[14].GPUs,initiallydesignedtoserveasthedatapipelines
for graphical information, are relatively small, much easier to maintain, and possess high computation capabilities
allowingforparalleldataprocessing.Nowadays,thescientificGPU-computingisusedinmanyareasofcomputational
physics,thankstotheComputeUnifiedDeviceArchitecture(CUDA)developedbyNVIDIACorporation[15].CUDA
significantlysimplifiesGPU-basedcalculationssonowonecoulduseaSonyPlayStation3asamulti-corecomputer
byprogrammingitwithC,C++orFortran[16]languages.
The typical scale of molecular-dynamics simulations in ferrofluid studies is within the range N = 102 − 103
[17, 18], while N = 104 constitutes the current limit [19]. It is evident that the increase of the ensemble size by
severalordersofmagnitudewouldtangiblyimprovethestatisticalsamplingandthusthequalityofsimulationresults.
To run numerically 105 ÷ 106 interacting magnetic particles is a task not much different from the running of an
artificialUniverse[20]. Therefore,similartothecaseofComputationalCosmology[21,22,23],theGP2Useemsto
beverypromisingalsointhecontextofferrofluidsimulations. Inthispaperwereporttheperformanceofarecently
proposedGPU-orientedmodificationoftheBarnes-Hutalgorithm[24]usedtosimulatethedynamicsofN =103÷106
interactingferromagneticparticlesmovinginaviscousmedium.
Although being mentioned as a potentially promising approach (see, for example, Ref. [25]), the Barnes-Hut
algorithmwasneverusedinmolecular-dynamicsferrofluidsimulations,tothebestofourknowledge. Ourmaingoal
here is to demonstrate a sizable speed-up one can reach when modeling such systems on a GPU – and by using
GPU-orientednumericalalgorithms–comparedtotheperformanceofconventional,CPU-basedalgorithms. Wealso
demonstratethehighaccuracyoftheBarnes-Hutapproximationbyusingseveralbenchmarks. Thepaperisorganized
as follows: First, inSection IIwe specifythe model. Then, inSection III,we describehow both, the All-Pairsand
Barn-Hutalgorithms, canbeefficientlyimplementedforGPU-basedsimulations. Theresultsofnumericaltestsare
presentedanddiscussedinSectionV.Finally,SectionVIcontainsconclusions.
2. Themodel
ThemodelsystemrepresentsanensembleofNidenticalparticlesoftheradiusR,madeofaferromagneticmaterial
ofdensityDandspecificmagnetizationµ. EachparticleoccupiesvolumeV = 3πR3,hasmagneticmomentm(cid:126) =m(cid:126)(t)
4
ofconstantmagnitude|m(cid:126)|=m=Vµ,mass M =VD,andmomentofinertiaI = 2MR2. Theensembleisdispersedin
5
aliquidofviscosityη. BasedontheLangevindynamicsapproach,theequationsofmotionsfork-thnanoparticlecan
bewritteninthefollowingform[17]
Iθ¨ = N −G θ˙ +Ξr, (1)
k kz r k θ
Iϕ¨ = −N sinϕ +N cosϕ −G ϕ˙ +Ξr, (2)
k kx k ky k r k ϕ
M(cid:126)r¨ = µ (m(cid:126) ∇ )·H(cid:126) +F(cid:126)sr−G (cid:126)r˙ +Ξ(cid:126)d, (3)
k 0 k k k k d k (cid:126)r
where θ and ϕ are the polar and azimuthal angles of the magnetization vector m(cid:126) respectively. N = m H −
kx ky kz
m H ,N = m H − m H ,N = m H − m H , x,y,z denotes the Cartesian coordinates, dots over the
kz ky ky kz kx kx kz kz ky kx kx ky
variables denote the derivatives with respect to time. (cid:126)r is the radius-vector defining the nanoparticle position, and
k
thegradientisgivenby∇ = ∂ =(cid:126)e ∂ +(cid:126)e ∂ +(cid:126)e ∂ ,((cid:126)e ,(cid:126)e ,(cid:126)e aretheunitvectorsoftheCartesiancoordinates).
k ∂(cid:126)rk x∂xk y∂yk z∂zk x y z
ConstantsG =6πηR,andG =8πηR3 specifytranslationalandrotationalfrictioncoefficients,µ =4π·10−7H/mis
t r 0
themagneticconstant.
Theresultingfieldactingonthek-thparticle H(cid:126) isthesumofexternalfield H(cid:126)ext andoverallfieldexertedonthe
k
2
particlebytherestoftheensemble,
(cid:88)N
H(cid:126) = H(cid:126)dip+H(cid:126)ext, (4)
k kj
j=1,j(cid:44)k
3(cid:126)r (m(cid:126) (cid:126)r )−m(cid:126) (cid:126)r2
H(cid:126)dip = kj j kj j kj, (5)
kj |(cid:126)r |5
kj
where(cid:126)r =(cid:126)r −(cid:126)r . F(cid:126)sr inEq.3denotestheforceinducedbyashort-rangeinteractionpotential. Inthispaperweuse
kj k j k
Lennard-Jonespotential[17](thoughhardsphere[26],softsphere[27]andYukawa-type[28]potentialscanbeused
asalternatives),sothat
F(cid:126)ksr =24E (cid:88)N (cid:126)(cid:126)rrk2j (cid:32)(cid:126)rs (cid:33)12−(cid:32)(cid:126)rs (cid:33)6. (6)
j=1,j(cid:44)k kj kj kj
Here E isthedepthofthepotentialwelland sistheequilibriumdistanceatwhichtheinter-particleforcevanishes.
The interaction between a particle and container walls is also modeled with a Lennard-Jones potential of the same
type. Therandom-forcevector,representingtheinteractionofaparticlewiththermalbath,hasstandardwhite-noise
components, (cid:104)Ξr(t)(cid:105) = 0 (α = ϕ,θ), (cid:104)Ξd(t)(cid:105) = 0 (β = x,y,z), and second moments satisfying (cid:104)Ξ (t)Ξ (t(cid:48))(cid:105) =
α β rα rβ
3k TG δ δ(t−t(cid:48))(α,β = ϕ,θ)(cid:104)Ξ (t)Ξ (t(cid:48))(cid:105) = 2k TGδ δ(t−t(cid:48))(α,β = x,y,z)[17]. Herek istheBoltzmann
B r αβ dα dβ B t αβ B
constantandT isthetemperatureoftheheatbath.
(cid:112)
Byrescalingthevariables,τ = t/T ,(T = R/µ 3D/4µ ),(cid:126)u = m(cid:126)/|m(cid:126)|,(cid:126)ρ =(cid:126)r /R,∇ = 1 d ,theequationsof
motionsforthek−thparticlecanberewchrittenchinthereducedfo0rm, k k ρk Rd(cid:126)ρk
2 d2θ dθ T2
· k = (u h −u h )−Γ k + ch ξr (7)
5 dτ2 kx ky ky kx r dτ VR2D θ
2 d2ϕ dϕ T2
· k = −(u h −u h )sinϕ +(u h −u h )cosϕ −Γ k + ch ξr (8)
5 dτ2 ky kz kz ky k kz kx kx kz k r dτ VR2D ϕ
d2(cid:126)ρ d(cid:126)ρ T2
k = f(cid:126)dip+ f(cid:126)sr−Γ k + ch (cid:126)ξd (9)
dτ2 k k d dτ VRD (cid:126)ρ
where
(cid:88)N
(cid:126)h = (cid:126)hdip+(cid:126)hext, (10)
k kj
j=1,j(cid:44)k
3(cid:126)ρ ((cid:126)u (cid:126)ρ )−(cid:126)u (cid:126)ρ2
(cid:126)hdip = kj j kj j kj, (11)
kj ρ5
kj
f(cid:126)kdip = (cid:88)N 3(cid:126)ρkj((cid:126)uj(cid:126)uk)+(cid:126)uk((cid:126)ρuj5(cid:126)ρkj)+(cid:126)uj((cid:126)uk(cid:126)ρkj) −15(cid:126)ρkj((cid:126)uk(cid:126)ρρkj7)((cid:126)uj(cid:126)ρkj), (12)
j=1,j(cid:44)k kj kj
f(cid:126)ksr = 24ε (cid:88)N (cid:126)(cid:126)ρρk2j (cid:32)ρσ (cid:33)12−(cid:32)ρσ (cid:33)6, (13)
j=1,j(cid:44)k kj kj kj
Here(cid:126)hext =3H(cid:126)ext/4πµ,Γ =GT /VD,Γ =G T /VR2D,σ= s/R,ε= ET 2/VRD,(cid:126)ρ =(cid:126)ρ −(cid:126)ρ
t t ch r r ch ch kj k j
Random-forcevectorcomponentsaregivennowbywhiteGaussiannoises, withthesecondmomentssatisfying
(cid:104)ξr(τ)ξr(τ(cid:48))(cid:105)=3k TΓ δ δ(τ−τ(cid:48))/T (α,β=ϕ,θ)(cid:104)ξd(τ)ξd(τ(cid:48))(cid:105)=2k TΓδ δ(τ−τ(cid:48))/T (α,β= x,y,z). Ingeneral
α β B r αβ ch α β B t αβ ch
casethecharacteristicrelaxationtimeofparticlemagneticmomentstotheirequilibriumorientationsismuchsmaller
thanT ,andonecanneglectthemagnetizationdynamicsbyassumingthatthedirectionofthevectorm(cid:126) coincides
ch k
withtheeasyaxisofthek−thparticle. Weassumethatthisconditionholdsforourmodel.
3
3. Twoapproachestomany-bodysimulationsonGPUs: theAll-PairsandtheBarnes-Hutalgorithms
Inthissectionwediscusstwoalternativeapproachestothenumericalpropagationofthedynamicalsystemgiven
byEqs. (7-9)onaGPU.ItisassumedthatthereaderisfamiliarwiththebasicsoftheGP2U,otherwiseweaddress
himtoRefs. ([29,30])thatcontaincrash-course-likeintroductionsintothephysically-orientedGPUcomputing.
3.1. All-PairAlgorithm
The most straightforward approach to propagate a system of N interacting particles is to account for the inter-
actions between all pairs. Although exact, the corresponding All-Pairs (AP) algorithm is slow when performed on
a CPU, so it is usually used to propagate systems of N = 102 ÷103 particles. However, even with this brute-force
methodonecantangiblybenefitfromGPUcomputationsbynoticingthattheAPideafitsCUDAarchitecture[31].
OneintegrationstepofthestandardAPalgorithmisperformedintwostages. Theyare:
1. Calculationofincrementofparticlepositionsandmagneticmomentsdirections.
2. Updateparticlepositionsandmagneticmomentsdirections.
ThisstructureremainsintactintheGPUversionofthealgorithm. Kernelsresponsibleforthefirststagecompute
forces that act on the particles, and calculate the corresponding increments for particle positions and magnetic mo-
ments,accordingtoequations(7-9). Theincrementsarethenwrittenintotheglobalmemory. Finally,second-stage
kernelsupdateparticlestateswiththeobtainedincrements. Thereis,however,aneedfortheglobalsynchronization
ofthethreadsthatbelongtothedifferentblocksaftereverystagesincethestagesareperformedonseparateCUDA
kernelsandalltheinformationaboutthestateofthesystemiskeptinthesharedmemory.
Eachthreadisresponsibleforoneparticleoftheensemble, andthusitshouldaccountfortheforcesexertedon
theparticlebytherestoftheensemble. Tospeedupthecomputationalprocesswekeepthedatavectorofthethread
particle in the shared memory, as well as the information on other particles, needed to compute the corresponding
interactionforces. Thuswehavetwosetsofarraysofdatainthesharedmemory,namely
1. dataoftheparticlesassignedtothethreadsoftheblock;
2. dataofparticlestocomputeinteractionwith.
Thenecessarydataarethecoordinatesoftheparticlesandprojectionsoftheirmagneticmomentvectorsonto x,
y and z axis. Size of the arrays are equal to the number of threads per block. The first set of the data is constant
duringoneintegrationstep,butthesecondsetischanged. So,atthebeginningweuploadtheinformationonparticle
coordinatesandmomentatothesecondsetofarraysinthesharedmemory. Aftercomputingtheforcesactingonthe
blockparticles,theprocedureisrepeated,i. e. theinformationonanothersetofparticlesiswrittenintothesecondset
ofarraysandthecorrespondinginteractionsarecomputed. Thecorrespondingpseudo-codeispresentedthebelow1.
Theadvantageofthedescribedapproachisthatitusestheglobalmemoryinthemostoptimalway. Theaccessto
globalmemoryiscoalescedandtherearenosharedmemorybankconflicts. Foranensemble N = 104 particlesthis
leadstotheGPUoccupancy2of97.7%.
3.2. Barnes-HutAlgorithm
TheAll-PairsalgorithmissimpleandstraightforwardforimplementationonaGPUandperfectlyfitsCUDA.Yet
thisalgorithmispurelyscalable. ThecorrespondingcomputationtimegrowslikeO(N2),anditsperformanceisvery
slowalreadyforanensembleof105particles.
TheBarnes-Hut(BH)approximation[32]exhibitsamuchbetterscalabilityanditscomputationaltimegrowslike
O(NlogN).Thekeyideaofthealgorithmistosubstituteagroupofparticleswithasinglepseudo-particlemimicking
theactionofthegroup. Thentheforceexertedbythegroupontheconsideredparticlecanbereplacedwiththeforce
exertedbythepseudo-particle.Inordertoillustratetheideaassumethatallparticlesarelocatedinathree-dimensional
1Inthepseudo-code∆ρx,∆ρy,∆ρz,∆θ,∆ϕdenoteincrementsofaparticle’scoordinatesx,y,z,andthedirectionanglesofparticlemagnetic
moment,θandϕ,neededtopropagatetheparticleoveronetimestep∆τ.
2TheparametershowshowtheGPUiskeptbusy,anditisequaltotheratioofthenumberofactivewarpstothemaximumnumberofwarps
supportedonamultiprocessoroftheGPU.ItcanobtainedwithCUDAProfiler.
4
Algorithm1ACUDAkernelthatcomputesincrements.
1: cachedk←blockIdx.x·blockDim.x+threadIdx.x
2: for j=0tonumberOfParticlesdo
3: uploadtothesharedmemoryparticlepositionsandangles
4: e.g. xshared[threadIdx.x]= xglobal[ind],etc.
5: fori=0toblockDim.xdo
6: if j+i(cid:44)k then (cid:46)theconditiontoavoidparticleon-selfinfluence
7: Calculateforceanddipolefieldfortheparticlenumber j+i
8: oncurrentk-thparticle,andaddtheresultto
9: thetotalcached f(cid:126)dip, f(cid:126)sr,(cid:126)hdip.
10: endif
11: endfor
12: syncthreads();
13: j← j+blockDim.x
14: endfor
15: Updatecached∆ρx,∆ρy,∆ρz,∆θ,∆ϕaccordingtoequations7-9.
16: Copyincrementstoglobalmemory.
cube,whichisnamed’maincell’. Themaincellisdividedthenintoeightsub-cells. Eachsub-cellconfinessubsetof
particles. If there are more than one particle in the given sub-cell then the last is again divided into eight sub-cells.
The procedure is re-iterated until there is only one particle or none left in each sub-cell. In this way we can obtain
anoctreewithleavesthatareeitheremptyorcontainsingleparticleonly. Asimplified,two-dimensionalrealization
of this algorithm is sketched with Fig. 1. By following this recipe, we can assign to every cell, obtained during
thedecomposition, apseudo-particle, withmagneticmomentu(cid:126)(cid:48) equalstothesumofmagnetizationvectors(cid:126)uofall
particlesbelongingtothecell,andpositionρ(cid:126)(cid:48)whichisthepositionofthesetgeometriccenter,
(cid:80)N(cid:48)
(cid:126)ρ
i
ρ(cid:126)(cid:48) = i=1 , (14)
N(cid:48)
whereN(cid:48)isanumberofparticlesinthecelland(cid:126)ρ isthepositionofthei-thparticlefromthesub-set.
i
Forces acting on k-th particle can be calculated by traversing the octree. If the distance from the particle to the
pseudo-particle that corresponds to the root cell is large enough, the influence of this pseudo-particle on the k-th
particleiscalculated; otherwisepseudo-particlesofthenextsub-cellsarecheckedetc(sometimethisprocedurecan
leadfinallytoaleafwithonlyoneparticleinthecellleft). Thuscalculatedforceisaddedthentothetotalforceacting
onthek-thparticle.
TheBHalgorithmallowsforahighparallelismanditiswidelyemployedincomputationalastrophysicsproblems
[33]. However, implementation of the Barnes-Hut algorithm on GPUs remained a challenge until recently, because
theprocedureusesanirregulartreestructurethatdoesnotfittheCUDAarchitecturewell. Itisthemainreasonwhy
theBHschemewasnotrealizedentirelyonaGPUbutsomepartofcalculationswasalwaysdelegatedandperformed
onaCPU[34,35]. ArealizationofthealgorithmsolelyonaGPUhasbeenproposedin2011[24]. Belowwebriefly
outlinethemainideaandspecifyitsdifferencefromthestandard,CPU-basedrealization.
To build an octree on a CPU usually heap objects are used. These objects contain both child-pointer and data
fields, and their children are dynamically allocated. To avoid time-consuming dynamic allocation and accesses to
heapobjects,anarray-baseddatastructureshouldbeused. Sincewehaveseveralarraysresponsibleforvariables,the
coalescedglobalmemoryaccessispossible. Particlesandcellscanhavethesamedatafields,e.g. positions. Inthis
casethesamearraysareused.
IncontrasttotheAll-Pairsalgorithm,whereonlytwokernelswereinvolved,intheoriginalGPU-BHalgorithm
hassixkernels[24]:
1. Boundingboxdefinitionkernel.
2. Octreebuildingkernel.
5
a) b) The main cell.
a cell with particles
a particle an empty cell
Figure1:Barnes-Huthierarchicaldecompositionintwo-dimensionalspaceandthecorrespondingoctree.SeeSection3.2formoredetails.
3. Computinggeometricalcenterandtotalmagneticmomentofeachcell.
4. Sortingoftheparticleswithrespecttotheirpositions.
5. Computingforcesandfieldsactingoneachparticle.
6. Integrationkernel.
Kernel 1 defines the boundaries of the root cell. Though the ensemble is confined to a container and particles
cannotgooutside,wekeepthiskernel. Thesizeoftherootcellscanbesignificantlysmallerthanthecharacteristic
size of the container. Moreover, the computation time of this kernel is very small, typically much less than 1% of
the total time of one integration step. The idea of this kernel is to find minimum and maximum values of particle
positions. Hereweuseatomicoperationsandbuilt-inminandmaxfunctions.
Kernel2performshierarchicaldecompositionoftherootcellandbuildsanoctreeinthethree-dimensionalcase.
As well as in following kernels, the particles are assigned to the threads in round-robin fashion. When a particle
is assigned to a thread, it tries to lock the appropriate child pointer. In the case of success, the thread rewrites the
child pointer and releases a lock. To perform a lightweight lock, that is used to avoid several threads access to the
samepartofthetreearray,atomicoperationshouldbeinvolved. Tosynchronizethetree-buildingprocessweusethe
syncthreadsbarrier.
Kernel 3 calculates magnetic moments and positions of pseudo-particles associated with cells by traversing un-
balancedoctreefromthebottomup. Athreadchecksifmagneticmomentandgeometriccenterofallthesub-cells
assignedtoitscellhavealreadybeencomputed. Ifnotthenthethreadupdatesthecontributionofthereadycellsand
waitsfortherestofthesub-cells. Otherwisetheimpactofallsub-cellsiscomputed.
Kernel4sortsparticlesinaccordancetotheirlocations. Thisstepcanspeeduptheperformanceofthenextkernel
duetotheoptimalglobalmemoryaccess.
Kernel5firstcalculatesforcesactingontheparticles,andthencalculatesthecorrespondingincrements. Then,in
order to compute the force and dipole field acting upon the particle, the octree is traversed. To minimize thread
divergence,itisveryimportantthatspatiallycloseparticlesbelongtothesamewarp. Inthiscasethethreadswithin
the warp would traverse approximately the same tree branches. This has already been provided by kernel 4. The
necessarydatatocomputeinteractionarefetchedtothesharedmemorybythefirstthreadofawarp. Thisallowsto
reducenumberofmemoryaccesses.
Finally,kernel6updatesthestateoftheparticlesbyusingthepositionincrementsandre-orientparticlemagnetic
moments.
Theabove-describedalgorithmhasmanyadvantages. Amongthemareminimalthreaddivergenceandthecom-
pleteabsenceofGPU/CPUdatatransfer(asideofthetransferofthefinalresults),optimaluseofglobalmemorywith
minimal number of accesses, data field re-use, minimal number of locks etc. All this allows to achieve a tangible
speed-up. FormoredetaileddescriptionwedirecttheinterestedreadertoRef. [24].
4. Results
We performed simulations on (i) a PC with Intel Xeon x5670 @2.93GHz CPU(48 Gb RAM) and (ii) a Tesla
M2050GPU.ThoughtheCPUhassixcores,onlyasinglecorewasusedinsimulations.Theprogramswerecompiled
6
N AP BH AP BH APCPU APCPU APGPU
CPU CPU GPU GPU APGPU BHGPU BHGPU
103 34 9.8 0.7 2 49 17 0.35
104 3470 137 20 6.5 174 534 3.1
105 392000 2487 1830 54 214 7259 33.9
106 39281250 73121 184330 621 213 63214 297
Table1:Durationofsingleintegrationstep(ms)fortheoptimizedAll-PairsalgorithmimplementedonCPU(APCPU)andGPU(APGPU),andfor
theCPU-(BHCPU)andGPU-oriented(BHGPU)Barnes-Hutalgorithm.
with nvcc (version 4.0) and gcc (version 4.4.1) compilers. Since there was no need in high-precision calculations,
weusedsingle-precisionvariables(float)andcompiledtheprogramwith-use fast mathkey. Wealsoused-O3
optimizationflagtospeedupourprograms. Finally,theEuler-Maruyamamethodwithtimestep∆τ=0.001wasused
tointegrateEqs. (7-9).
WemeasuredthecomputationtimeofoneintegrationstepforbothalgorithmsasfunctionsofN. Theresultsare
presentedwithTable1. ThebenefitsoftheGPUcomputingincreasewiththenumberofparticles. Foranensemble
of N = 106 particles the speed-up gained from the use of the Barnes-Hut algorithm is almost 300 compared to the
performance ofthe performance ofthe optimized All-Pairsalgorithm onthe same GPU.However, for N = 103 the
All-Pairsalgorithmperformsbetter. Itisbecausethecomputationalexpensesforthetree-buildingphase,sortingetc.,
overweightthespeed-upeffectoftheapproximationforsmallnumberofparticles. HereweremindthatN =103was
thetypicalscaleofthemostofferrofluidsimulationstodate[17,18].
Fig.2showsinstantaneousconfigurationsobtainedduringthesimulationsforacubicconfinementandforamono-
layer. Tosimulateamono-layerofparticlesweusearectangularparallelepipedoftheheight2.1Rasaconfinement.
Theparametersofthesimulationscorrespondtotheregimewhentheaveragedipoleenergyismuchlargerthanthe
energyofthermalfluctuations. Theformationofchain-likelarge-scaleclusters[36]isclearlyvisible.
5. Abenchmarktest: averagemagnetizationcurves
In order to check the accuracy of the numerical schemes we calculated the reduced magnetization curve [36].
Reducedmagnetizationvectorisgivenbythesum(cid:104)(cid:126)u(cid:105) = 1/N (cid:80)N (cid:126)u. Themainparametersthatcharacterizeferrofluid
i
i=1
magneticpropertiesarethedipolecouplingconstantλ,whichistheratioofdipole-dipolepotentialandthermalenergy,
namely[37],
µ m2
λ= 0 , (15)
16πR3k T
B
andthevolumefraction,whichistheratiobetweenthevolumeoccupiedbyparticlesandthetotalvolumeoccupied
bytheferrofluid,V ,i.e.,
f
4/3πR3N
φ= ·100%. (16)
V
f
Inthelimitwhenλ<1andforsmallvolumefraction,φ(cid:28)100%,theprojectionofthereducedmagnetizationvector
onthedirectionofappliedfield,(cid:104)u (cid:105),canwellbeapproximatedbytheLangevinfunction[36]:
H(cid:126)
1
(cid:104)u (cid:105)= L(α)=coth(α)− , (17)
H(cid:126) α
whereαdenotestheratiobetweenmagneticenergyandthermalenergy,
α=mµ H/k T. (18)
0 B
7
a) b)
Figure2: SnapshotsofN-particleensemblesobtainedwiththeBarnes-Hutalgorithm: (a)N = 20000(cubicconfinementwiththeedgelength
L=150R)and(b)mono-layerofN=300particles.Theparametersare:Γr=Γd=0.1,T =300K,µ=3.1·105A/m,D=5000kg/m3,R=10
nm.
Wesimulatedasystemwiththeparameterscorrespondingtomaghemite(γ−Fe O )withasaturationmagnetiza-
2 3
tionofµ=3.1·105A/manddensityD=5000kg/m3,thecarrierviscosityη=0.89·10−3Pa(thelattercorrespondsto
thewaterviscosityatT =298K).Thevolumefractionissetatφ=1%andtheparticleradiusR=3nm.Theexternal
magneticfieldH(cid:126) wasappliedalongz−axis. Weinitiatedthesystemattimeτ=0byrandomlydistributingparticlesin
acubiccontainer. Theorientationanglesofparticlemagnetizationvectorswereobtainedbydrawingrandomvalues
fromtheinterval[0,2π].
Fig. 3 presents the results of the simulations. After the transient τ = 1000, given to the system of N = 105
eq
particlestoequilibrate,themeanreducedmagnetizationhasbeencalculatedbyaveraging(cid:104)u (cid:105)overthetimeinterval
z
τ = 1000. It is noteworthy that even single-run results are very close to the Langevin function, see Fig. 3(a).
calc
Thecontributionofthemagnetostaticenergygrowswithαsothatthestrengthofthedipole-dipoleinteractionisalso
increasing,seeFig. 4.
Since the average value of dipole field projection on z axis is positive and increases with α, the dipole field
amplifies the external magnetic field. This explains the discrepancy between the analytical and numerical results
obtainedforlargevaluesofα. ForanensembleofN =103particlestheresultsobtainedwithtwoalgorithmsarenear
identical,Fig. 3(b).
8
a) 1 b) 1
0.8 0.8
0.6 0.6
⟩ ⟩
z z
u u
⟨0.4 ⟨0.4
Barnes-Hut Barnes-Hut
0.2 0.2
Langevin function All-Pairs
0 0
0 2 4 6 8 0 2 4 6 8
α α
Figure3: CheckingtheBarnes-Hutapproximation: a)comparisonbetweenthesingle-runresultsobtainedwithN =105particlesandLangevin
function,Eq.(17);b)comparisonoftheresultsobtainedwiththeAll-PairandtheBarnes-HutalgorithmsforanensembleofN=103particles.
Itisimportantalsotocomparetheaveragedipolefields,(cid:104)hdip(cid:105),calculatedwiththeBarnes-HutandtheAll-Pairs
algorithms. Theoutputsobtainedfortheabove-givensetofparametersareshownonFig.4. Again, twoalgorithms
producedalmostidenticalresults.
0.04
0.03
p
diz 0.02
h
0.01 Barnes-Hut
All-Pairs
0
0 2 4 6 8
α
Figure4: z-componentofthereducedaveragedipolefieldasafunctionofα[18]. TheparametersarethesameasinFig. 3(b). Eachpointwas
obtainedbyaveragingover103independentrealizations.
6. Conclusions
WiththisworkwedemonstratethattheBarnes-Hutalgorithmcanbeefficientlyimplementedforlarge-scale,GPU-
based ferrofluid simulations. Overall, we achieved a speed-up more than two orders of magnitude when compared
to the performance of a common GPU-oriented All-Pairs algorithm. The proposed approach allows to increase the
sizeofensemblesbytwoordersofmagnitudecomparedtothepresent-dayscaleofsimulations[19]. TheBarnes-Hut
algorithmcorrectlyaccountsforthedipole-dipoleinteractionwithinanensembleof N = 106 particlesandproduces
results that fit the theoretical predictions with high accuracy. Our finding opens several interesting perspectives.
First, it brings about possibilities to perform large-scale molecular-dynamics simulations for the time evolution of
ferrofluids placed in confinements of complex shapes like thin vessels or tangled pipes, where the boundary effects
playanimportantrole[38]. Itisalsopossibletoexplorethenon-equilibriumdynamicsofferrofluids, forexample,
their response to different types of externally applied magnetic fields, such as periodically alternating fields [39],
or gradient fields [40]. Another direction for further studies is the exploration of the relationship between shape
9
and topology of nano-clusters and different macroscopic properties of ferrofluids [41, 42]. Finally, the GPU-based
computational algorithms provide with new possibilities to study heat transport processes in ferrofluids [43], for
example to investigate the performance of ferromagnetic particles as heat sources for magnetic fluid hyperthermia
[44].
7. Acknowledgment
A.Yu.P.andT.V.L.acknowledgethesupportoftheCabinetofMinistersofUkraineobtainedwithintheProgram
ofStudyingandTrainingforStudents,PhDStudentsandProfessor’sStuffAbroadandthesupportoftheMinistryof
Education,Science,Youth,andSportofUkraine(ProjectNo0112U001383). S.D.andP.H.acknowledgethesupport
bytheclusterofexcellenceNanosystemsInitiativeMunich(NIM).
References
[1] R.Rosensweig,Ferrohydrodynamics,CambridgeUniversityPress,1985.
[2] Q.Pankhurst,J.Connolly,S.Jones,J.Dobson, Applicationsofmagneticnanoparticlesinbiomedicine, J.Phys.DAppl.Phys.36(2003)
R167.
[3] K.Raj,R.Moskowitz,Commercialapplicationsofferrofluids,J.Magn.Magn.Mater.85(1990)233.
[4] J.P.McTague,Magnetoviscosityofmagneticcolloida,J.Chem.Phys.51(1969)133.
[5] S.Odenbach,Magnetoviscouseffectsinferrofluids,Springer,2002.
[6] S.Odenbach,H.Stoerk, Sheardependenceoffield-inducedcontributionstotheviscocityofmagneticfluidsatlowshearrates, J.Magn.
Magn.Mater.183(2000)188.
[7] P.Ilg,E.Coquelle,S.Hess, Structureandrheologyofferrofluids:simulationresultsandkineticmodels, J.Phys.Condens.Mat.18(2006)
S2757.
[8] S.deLeeuw,J.Perram,E.Smith, SimulationofElectrostaticSystemsinPeriodicBoundaryConditions.I.LatticeSumsandDielectric
Constants,in:Proc.R.Soc.Lond.AMath.Phys.Sci.,volume373,1980,p.27.
[9] J.J.Cerda`,V.Ballenegger,O.Lenz,C.Holm,P3Malgorithmfordipolarinteractions.,J.Chem.Phys.129(2008)234104.
[10] H.Hartshorne,C.J.Backhouse,W.E.Lee,Ferrofluid-basedmicrochippumpandvalve,Sensor.Actuat.BChem.99(2004)592.
[11] N.Pamme,Magnetismandmicrofluidics,LabChip6(2006)24.
[12] J.Phillips,R.Braun,W.Wang,etal.,ScalableMolecularDynamicswithNAMD,J.Comput.Chem.26(2005)1781.
[13] J.D.Owens,D.Luebke,N.Govindaraju,M.Harris,J.Kru¨ger,A.E.Lefohn,T.J.Purcell, ASurveyofGeneralPurposeComputationon
GraphicsHardware,Comput.Graph.Forum26(2007)80.
[14] L.Nyland,M.Harris,J.Prins,FastN-bodysimulationwithCUDA,GPUGems3,editedbyH.Nguyen,chap.31(2007).
[15] J.Sanders,E.Kandrot,CUDAbyExamples,Addison-Wesley,2011.
[16] CUDAFortranProgrammingGuideandReference,2012.URL:http://www.pgroup.com/lit/whitepapers/pgicudaforug.pdf.
[17] Z.Wang,C.Holm,H.-W.Mu¨ller, Moleculardynamicsstudyontheequilibriummagnetizationpropertiesandstructureofferrofluids, Phys.
Rev.E66(2002)021405.
[18] A.O.Ivanov,S.S.Kantorovich,E.N.Reznikov,etal., Magneticpropertiesofpolydisperseferrofluids: Acriticalcomparisonbetween
experiment,theory,andcomputersimulation,Phys.Rev.E75(2007)061405.
[19] J.J.Cerda,E.Efimova,V.Ballenegger,E.Krutikova,A.Ivanov,C.Holm, Behaviorofbulkyferrofluidsinthedilutedlow-couplingregime:
Theoryandsimulation,Phys.Rev.E81(2010)011501.
[20] V.Springel,S.D.M.White,A.Jenkins,C.S.Frenk,N.Yoshida,L.Gao,J.Navarro,R.Thacker,D.Croton,J.Helly,J.A.Peacock,S.Cole,
P.Thomas,H.Couchman,A.Evrard,J.Colberg,F.Pearce, Simulationsoftheformation,evolutionandclusteringofgalaxiesandquasars,
Nature97(2005)629.
[21] S.F.P.Zwart,R.G.Bellemana,P.M.Geldofa,High-performancedirectgravitationalN-bodysimulationsongraphicsprocessingunits,New
Astron.12(2007)641.
[22] R.G.Belleman,J.Be´dorf,S.F.P.Zwart, HighperformancedirectgravitationalN-bodysimulationsongraphicsprocessingunitsII:An
implementationinCUDA,NewAstron.13(2008)103.
[23] D.Aubert,NumericalCosmologypoweredbyGPUs,in:ProceedingsoftheInternationalAstronomicalUnion,volume6,2010,p.397.
[24] M.Burtscher,K.Pingali., AnEfficientCudaImplementationoftheTree-basedBarnesHutN-bodyAlgorithm, in: GPUGems’11: GPU
ComputingGemsEmeraldEdition,2011.
[25] C.Holm,EfficientMethodsforLongRangeInteractionsinPeriodicGeometriesPlusOneApplication,in:ComputationalSoftMatter:From
SyntheticPolymerstoProteins,ResearchCentreJulich,2004.
[26] J.Weis,D.Levesque,Chainformationinlowdensitydipolarhardspheres:AMonteCarlostudy,Phys.Rev.Lett.71(1993)2729–2732.
[27] D.Wei,G.Patey, Orientationalorderinsimpledipolarliquids: Computersimulationofaferroelectricnematicphase, Phys.Rev.Lett.68
(1992)2043–2045.
[28] G.Me´riguet,M.Jardat,P.Turq, Browniandynamicsinvestigationofmagnetizationandbirefringencerelaxationsinferrofluids., J.Chem.
Phys.123(2005)144915.
[29] M.Januszewski,M.Kostur, Acceleratingnumericalsolutionofstochasticdifferentialequationswithcuda, Comput.Phys.Commun.181
(2010)183.
[30] M.Weigel,SimulatingspinmodelsonGPU,Comput.Phys.Commun.182(2011)1833.
10