Table Of ContentFast and Accurate Finite-Element Multigrid Solvers
for PDE Simulations on GPU Clusters
Dissertation
zur Erlangung des Grades eines
Doktors der Naturwissenschaften
Der Fakultät für Mathematik der
Technischen Universität Dortmund
vorgelegt von
Dominik Göddeke
FastandAccurateFiniteElementMultigridSolversforPDESimulationsonGPUClusters
DominikGöddeke
Dissertationeingereichtam: 25.2.2010
TagdermündlichenPrüfung: 10.5.2010
MitgliederderPrüfungskommission
Prof.Dr.StefanTurek(1. Gutachter,Betreuer)
Prof.Dr.HeinrichMüller(2. Gutachter)
Prof.Dr.JoachimStöckler
Prof.Dr.RudolfScharlau
Dr.MatthiasMöller
[Homer]D’oh,Ineedtodedicatethistosomeone.[/Homer]
v
Acknowledgements
Itis impossibletoacknowledge everyonewhodirectly orindirectlycontributed tothe realisation
of this thesis. I apologise to all those not mentioned by name: Your help and support is really
appreciated.
Inparticular,Iwouldliketoexpressmysinceregratitudetothefollowingpeople: Professors
StefanTurekandHeinrichMüllerguidedmyscientificcareerfromthefirstcontacttoNumerical
MathematicsandComputerGraphicsasanundergraduatestudenttothestronglyinterdisciplinary
field of scientific computing in which I finally settled with this thesis. I am deeply indebted to
themforgivingmetheopportunitytoworkinthefriendlyatmosphereattheInstituteofApplied
MathematicsandtheChairofComputerGraphicsatTUDortmund. Iwanttothankthemfortheir
ideas, fruitful discussions and encouraging criticisms, their belief in my work when I got lost,
and last but not least for taking the risk and having the vision to support this endeavour into an
emergingresearchareathatatthebeginningnoonereallyexpectedtohavesuchabigimpact.
I am very grateful to my fellow PhD students and friends in the FEAST group, Christian
Becker, Sven Buijssen, Matthias Grajewski and Hilmar Wobker. Developing such a large-scale
package like FEAST is always a group effort, and I believe we did extremely well sharing the
tedious task of implementing features for the common good even if it meant devoting the ‘occa-
sional’ week for non-thesis work, while still pursuing our separate research projects. Knowing
that I could always rely on our group for discussing design decisions, toying with ideas, going
through painful debugging sessions, solving hardware, cluster and scheduling issues and related
nightmares in parallel computing, and last but not least, sharing a beer if all else failed cannot
beappreciatedenough. Theconferenceproceedingsandjournalpaperswewrotetogetherclearly
do not tell the entire story. While not being technically members of the FEAST group, Matthias
Möller,ThomasRohkämperandMichaelKösterareacknowledgedforthesamereasons.
IamverygratefultoRobertStrzodkaforourcollaborationduringthepastyears. Whatstarted
out from his talk at my ‘inauguration workshop’ in early 2005 soon became a common vision
that we pursued relentlessly. Without Robert, I would never have met Patrick McCormick and
Jamaludin Mohd-Yusof from Los Alamos, and the research the four of us pursued together (and
thepaperswewrote) wouldnothavebeenpossible. Infact, Robertdeservesthesamecreditsfor
shapingmeintoascientistasStefanTurekandHeinrichMüllerdo.
Matthias Möller, Hilmar Wobker, Robert Strzodka and Jamaludin Mohd-Yusof deserve spe-
cial credits for proofreading sections of this thesis and discussing questions that arose. Their
constructive criticism helped a lot to improve clarity of this thesis and to make the presentation
moreconcise.
In my time at TU Dortmund, I helped supervising quite a few Diploma theses and student
projectgroups. Discussingideaswith‘my’studentscannotbeoverestimated,andIbelievethishas
alwaysbeenfruitfulforbothsides. AmongtheworksIsupervised,IwanttoacknowledgeHendrik
Becker, Markus Geveler, Dirk Ribbrock and Mirko Sykorra. One project group that I launched
was so successful that we published the results in Computer Physics Communications, and I am
vi
indebted to my co-authors Danny van Dyk, Markus Geveler, Sven Mallach, Dirk Ribbrock and
CarstenGutwengerfortheopportunitytoworktogetherproductivelyformorethantwoyears.
NVIDIA and AMD/ATI supported my research through generous hardware donations, and I
owe gratitude to Mark Harris, Mike Houston, David Luebke, Simon Green, Christian Sigg, Do-
minik Behr and Sumit Gupta for fruitful discussions and untiring assistance. The same is true
for collaborators and colleagues I discussed many scientific and non-scientific ideas with at con-
ferences or otherwise, in particular John Owens, Aaron Lefohn, Dimitri Komatitsch, Gordon Er-
lebacher,DavidMichéa,CliffWoolley,DanielNogradi,JonathanCohen,YaoZhang,JensKrüger
andRüdigerWestermann; andvirtuallyeveryoneIcollaboratedandexchangedideaswithinone
formoranother.
Finally, I am very grateful to my parents Michaela and Nikolaus, my sisters Maria and Anne
andmybrotherNiklasforencouragementandsupport. MygodchildLindaLübbers,andHannah,
JannisandPiaImöhlmadesuretokeepmesaneandfocusedthroughalltheseyears.
Support of this work by the DFG (Deutsche Forschungsgemeinschaft) through the grants
TU102/22-1andTU102/22-2isgratefullyacknowledged.
Dortmund,February25,2010
DominikGöddeke
vii
Contents
1 Introduction 1
1.1 IntroductionandMotivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Hardware-OrientedNumerics . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 TheMemoryWallProblem . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 MulticoreArchitectures . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.4 EmergingMany-CoreArchitectures . . . . . . . . . . . . . . . . . . . . 9
1.2 ThesisContributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 MixedPrecisionIterativeRefinement . . . . . . . . . . . . . . . . . . . 10
1.2.2 Co-ProcessorAccelerationofExistingLarge-ScaleCodes . . . . . . . . 11
1.2.3 GPU-BasedMultigridSolversonGeneralisedTensorProductDomains . 12
1.2.4 AShortHistoryofGPUComputing . . . . . . . . . . . . . . . . . . . . 13
1.3 ListofPublications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 ThesisOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 MathematicalBackground 19
2.1 FiniteElementDiscretisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 VariationalFormulationofthePoissonProblem . . . . . . . . . . . . . . 19
2.1.2 ChoiceofBasisFunctionsandSubdivisionoftheDomainintoElements 20
2.1.3 MeshRefinementandErrorAnalysis . . . . . . . . . . . . . . . . . . . 21
2.2 IterativeSolversforSparseLinearSystems . . . . . . . . . . . . . . . . . . . . 23
2.2.1 BasicStationaryDefectCorrectionMethods . . . . . . . . . . . . . . . 23
2.2.2 KrylovSubspaceMethods . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.3 GeometricMultigridSolvers . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 FEAST–HighPerformanceFiniteElementMultigridToolkit . . . . . . . . . . 29
2.3.1 SeparationofStructuredandUnstructuredData . . . . . . . . . . . . . . 29
2.3.2 ParallelMultilevelSolvers . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 Two-LayerScaRCSolvers . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.4 Vector-ValuedProblems . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.5 LinearisedElasticityinFEAST . . . . . . . . . . . . . . . . . . . . . . 41
2.3.6 StokesandNavier-StokesinFEAST . . . . . . . . . . . . . . . . . . . . 43
3 AShortHistoryofGPUComputing 47
3.1 LegacyGPGPU:GraphicsAPIs . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.1 TheGraphicsPipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.2 EvolutionTowardsProgrammability . . . . . . . . . . . . . . . . . . . . 50
3.1.3 DirectX9GPUs–TheFirstWaveofGPGPU2003–2006 . . . . . . . . 51
3.1.4 LegacyGPGPUProgrammingModel . . . . . . . . . . . . . . . . . . . 53
3.1.5 BibliographyofRelevantPublications . . . . . . . . . . . . . . . . . . . 56
viii Contents
3.1.6 RestrictionsofSinglePrecision . . . . . . . . . . . . . . . . . . . . . . 59
3.1.7 ExampleArchitecture: NVIDIAGeForce6 . . . . . . . . . . . . . . . . 59
3.2 NVIDIACUDAandAMDStream . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2.1 DirectX10andUnifiedArchitectures . . . . . . . . . . . . . . . . . . . 62
3.2.2 NVIDIACUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.3 AMDCTMandStream . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2.4 BibliographyofRelevantPublications . . . . . . . . . . . . . . . . . . . 71
3.2.5 DoublePrecisionFloatingPointSupport . . . . . . . . . . . . . . . . . 73
3.3 GPUsandParallelProgrammingModels . . . . . . . . . . . . . . . . . . . . . . 74
3.3.1 GPUsasParallelRandom-AccessMachines . . . . . . . . . . . . . . . . 74
3.3.2 GPUsandVectorSIMDArchitectures . . . . . . . . . . . . . . . . . . . 75
3.3.3 OtherProgrammingModels . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4 FutureAPIsandArchitectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.1 OpenCLandDirectX11 . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.2 IntelLarrabeeandFutureGraphicsArchitectures . . . . . . . . . . . . . 78
3.4.3 CPUandGPUConvergence . . . . . . . . . . . . . . . . . . . . . . . . 79
4 MixedandEmulatedPrecisionMethods 81
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.2 BenefitsofEmulatedandMixedPrecisionSchemes . . . . . . . . . . . 84
4.2 HighPrecisionEmulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2.1 ExactandRedundantEmulation . . . . . . . . . . . . . . . . . . . . . . 86
4.2.2 Double-SingleOperations . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2.3 Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2.4 ExtensionstoHigherPrecisionandApplicabilityonGPUs . . . . . . . . 88
4.3 MixedPrecisionIterativeRefinement . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.1 BackgroundandRelatedWork . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.2 MixedPrecisionIterativeRefinementforLarge,SparseSystems . . . . . 91
4.4 HardwareEfficiencyonFPGAs. . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4.1 FieldProgrammableGateArrays . . . . . . . . . . . . . . . . . . . . . 94
4.4.2 MixedPrecisionSolversonFPGAs . . . . . . . . . . . . . . . . . . . . 94
4.5 PrecisionandAccuracyRequirementsofthePoissonProblem . . . . . . . . . . 95
4.5.1 InfluenceoftheConditionNumber . . . . . . . . . . . . . . . . . . . . 96
4.5.2 InfluenceoftheApproximationQuality . . . . . . . . . . . . . . . . . . 96
4.5.3 InfluenceontheConvergenceBehaviourofIterativeSolvers . . . . . . . 98
4.5.4 SummaryandConclusions . . . . . . . . . . . . . . . . . . . . . . . . . 104
5 EfficientLocalSolversonGraphicsHardware 107
5.1 IntroductionandOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.1.1 ChapterOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.1.2 ABriefOverviewoftheCodeBase . . . . . . . . . . . . . . . . . . . . 110
5.2 LegacyGPGPU:OpenGLImplementation . . . . . . . . . . . . . . . . . . . . . 112
5.2.1 ImplementationalSpecialisations . . . . . . . . . . . . . . . . . . . . . 112
5.2.2 ExampleOperations: ParallelProgrammingPrimitives . . . . . . . . . . 112
5.2.3 ImplementationofNumericalLinearAlgebraOperations . . . . . . . . . 114
5.2.4 PerformanceMicrobenchmarks . . . . . . . . . . . . . . . . . . . . . . 117
5.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3 ModernGPGPU:CUDAImplementation . . . . . . . . . . . . . . . . . . . . . 125
5.3.1 CUDAProgrammingTechniques . . . . . . . . . . . . . . . . . . . . . 125
ix
5.3.2 ImplementationofNumericalLinearAlgebraOperations . . . . . . . . . 129
5.3.3 PerformanceMicrobenchmarks . . . . . . . . . . . . . . . . . . . . . . 133
5.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.4 ParallelisationandImplementationofPreconditioners . . . . . . . . . . . . . . . 142
5.4.1 IntroductionandPreliminaries . . . . . . . . . . . . . . . . . . . . . . . 142
5.4.2 JacobiPreconditioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.4.3 PreconditionersofGauß-Seidel-Type . . . . . . . . . . . . . . . . . . . 143
5.4.4 TridiagonalPreconditioner . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.4.5 CombinationofGauß-SeidelandTridiagonalPreconditioners . . . . . . 153
5.4.6 AlternatingDirectionImplicitPreconditioners . . . . . . . . . . . . . . 154
5.4.7 PerformanceMicrobenchmarks . . . . . . . . . . . . . . . . . . . . . . 155
5.4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.5 NumericalEvaluationofParallelisedPreconditionersinIterativeSolvers . . . . . 158
5.5.1 OverviewandTestProcedure . . . . . . . . . . . . . . . . . . . . . . . 158
5.5.2 PureNumericalEfficiencyofMultigridSolvers . . . . . . . . . . . . . . 160
5.5.3 TotalRuntimeEfficiencyofMultigridSolvers . . . . . . . . . . . . . . . 164
5.5.4 KrylovSubspaceSolvers . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.6 AccuracyofEmulatedandMixedPrecisionSolvers . . . . . . . . . . . . . . . . 171
5.6.1 OverviewandTestProcedure . . . . . . . . . . . . . . . . . . . . . . . 171
5.6.2 AccuracyofMixedPrecisionIterativeRefinementSolvers . . . . . . . . 174
5.6.3 AccuracyoftheEmulatedDouble-SingleFormat . . . . . . . . . . . . . 174
5.6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.7 NumericalEvaluationofMixedPrecisionIterativeSolvers . . . . . . . . . . . . 177
5.7.1 OverviewandTestProcedure . . . . . . . . . . . . . . . . . . . . . . . 177
5.7.2 EfficiencyontheCPUandInfluenceoftheStoppingCriterion . . . . . . 179
5.7.3 EfficiencyontheGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.7.4 SpeedupofGPU-basedSolvers . . . . . . . . . . . . . . . . . . . . . . 186
5.7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
5.8 InfluenceoftheMultigridCycleTypeandtheCoarseGridSolver . . . . . . . . 191
5.8.1 OverviewandTestProcedure . . . . . . . . . . . . . . . . . . . . . . . 191
5.8.2 BaselineExperiment–ProblemStatement. . . . . . . . . . . . . . . . . 191
5.8.3 TruncationofMultigridCycles. . . . . . . . . . . . . . . . . . . . . . . 194
5.8.4 BetterPreconditioningoftheCoarseGridSolver . . . . . . . . . . . . . 196
5.8.5 V CycleswithVariableSmoothing: AnAlternativetoF andW Cycles . . 197
5.8.6 DirectCoarseGridSolversontheCPU . . . . . . . . . . . . . . . . . . 198
5.8.7 ApproximateSolutionofCoarseGridProblems . . . . . . . . . . . . . . 199
5.8.8 ImplicationsforMixedPrecisionIterativeRefinementMultigrid . . . . . 202
5.8.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
5.9 GPUPerformanceGuidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.9.1 ControlFlowBetweenCPUandGPU . . . . . . . . . . . . . . . . . . . 209
5.9.2 ThePCIeBottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.9.3 GPUvs. CPUPerformanceCharacteristics . . . . . . . . . . . . . . . . 210
5.9.4 Locality,VectorisationandMemoryAccessPatterns . . . . . . . . . . . 211
5.9.5 ResourceAllocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
x Contents
6 MinimallyInvasiveCo-ProcessorIntegrationintoFEAST 213
6.1 IntroductionandMotivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.1.2 ChapterOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6.2 MinimallyInvasiveCo-ProcessorIntegrationintoFEAST . . . . . . . . . . . . . 217
6.2.1 ScaRCSolverAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
6.2.2 MinimallyInvasiveIntegration . . . . . . . . . . . . . . . . . . . . . . . 218
6.2.3 GPUIntegrationChallengesandImplementationalDetails . . . . . . . . 222
6.2.4 HybridSolversforHeterogeneousClusters . . . . . . . . . . . . . . . . 224
6.3 NumericalBuildingBlock: PoissonProblem . . . . . . . . . . . . . . . . . . . 225
6.3.1 IntroductionandTestProcedure . . . . . . . . . . . . . . . . . . . . . . 225
6.3.2 AccuracyandEfficiencyofMixedPrecisionTwo-LayerScaRCSolvers . 227
6.3.3 SpeedupandSpeedupAnalysis . . . . . . . . . . . . . . . . . . . . . . 230
6.3.4 WeakScalabilityonaLarge-ScaleCluster . . . . . . . . . . . . . . . . . 234
6.3.5 HybridCPU-GPUSolvers . . . . . . . . . . . . . . . . . . . . . . . . . 238
6.4 ApplicationAcceleration: LinearisedElasticity . . . . . . . . . . . . . . . . . . 241
6.4.1 Introduction,TestConfigurationsandSolutionStrategy . . . . . . . . . . 241
6.4.2 AccuracyStudies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
6.4.3 WeakScalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.4.4 PerformanceandSpeedup . . . . . . . . . . . . . . . . . . . . . . . . . 248
6.5 ApplicationAcceleration: StationaryLaminarFlow . . . . . . . . . . . . . . . . 251
6.5.1 Introduction,TestConfigurationsandSolutionStrategy . . . . . . . . . . 251
6.5.2 AnalysisoftheLinearSolver . . . . . . . . . . . . . . . . . . . . . . . 253
6.5.3 AccuracyoftheMixedPrecisionStationaryLaminarFlowSolver . . . . 255
6.5.4 PerformanceAnalysisoftheStationaryLaminarFlowSolver . . . . . . 256
6.6 Summary,CriticalDiscussion,ConclusionsandFutureWork . . . . . . . . . . . 261
6.6.1 SummaryoftheMinimallyInvasiveIntegrationApproach . . . . . . . . 261
6.6.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
6.6.3 WeakScalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
6.6.4 PerformanceofMixedPrecisionandAcceleratedTwo-LayerScaRCSolvers263
6.6.5 CriticalDiscussioninTermsofFutureScalability . . . . . . . . . . . . . 266
A Appendix 269
A.1 ListofSymbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Bibliography 271
Description:that I could always rely on our group for discussing design decisions, toying with [224] for more details on different numbering schemes, and to the textbook DDR2-667 memory, a machine that is used frequently throughout this thesis.