ebook img

Fast and Accurate Finite-Element Multigrid Solvers for PDE PDF

299 Pages·2010·4.58 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Fast and Accurate Finite-Element Multigrid Solvers for PDE

Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters Dissertation zur Erlangung des Grades eines Doktors der Naturwissenschaften Der Fakultät für Mathematik der Technischen Universität Dortmund vorgelegt von Dominik Göddeke FastandAccurateFiniteElementMultigridSolversforPDESimulationsonGPUClusters DominikGöddeke Dissertationeingereichtam: 25.2.2010 TagdermündlichenPrüfung: 10.5.2010 MitgliederderPrüfungskommission Prof.Dr.StefanTurek(1. Gutachter,Betreuer) Prof.Dr.HeinrichMüller(2. Gutachter) Prof.Dr.JoachimStöckler Prof.Dr.RudolfScharlau Dr.MatthiasMöller [Homer]D’oh,Ineedtodedicatethistosomeone.[/Homer] v Acknowledgements Itis impossibletoacknowledge everyonewhodirectly orindirectlycontributed tothe realisation of this thesis. I apologise to all those not mentioned by name: Your help and support is really appreciated. Inparticular,Iwouldliketoexpressmysinceregratitudetothefollowingpeople: Professors StefanTurekandHeinrichMüllerguidedmyscientificcareerfromthefirstcontacttoNumerical MathematicsandComputerGraphicsasanundergraduatestudenttothestronglyinterdisciplinary field of scientific computing in which I finally settled with this thesis. I am deeply indebted to themforgivingmetheopportunitytoworkinthefriendlyatmosphereattheInstituteofApplied MathematicsandtheChairofComputerGraphicsatTUDortmund. Iwanttothankthemfortheir ideas, fruitful discussions and encouraging criticisms, their belief in my work when I got lost, and last but not least for taking the risk and having the vision to support this endeavour into an emergingresearchareathatatthebeginningnoonereallyexpectedtohavesuchabigimpact. I am very grateful to my fellow PhD students and friends in the FEAST group, Christian Becker, Sven Buijssen, Matthias Grajewski and Hilmar Wobker. Developing such a large-scale package like FEAST is always a group effort, and I believe we did extremely well sharing the tedious task of implementing features for the common good even if it meant devoting the ‘occa- sional’ week for non-thesis work, while still pursuing our separate research projects. Knowing that I could always rely on our group for discussing design decisions, toying with ideas, going through painful debugging sessions, solving hardware, cluster and scheduling issues and related nightmares in parallel computing, and last but not least, sharing a beer if all else failed cannot beappreciatedenough. Theconferenceproceedingsandjournalpaperswewrotetogetherclearly do not tell the entire story. While not being technically members of the FEAST group, Matthias Möller,ThomasRohkämperandMichaelKösterareacknowledgedforthesamereasons. IamverygratefultoRobertStrzodkaforourcollaborationduringthepastyears. Whatstarted out from his talk at my ‘inauguration workshop’ in early 2005 soon became a common vision that we pursued relentlessly. Without Robert, I would never have met Patrick McCormick and Jamaludin Mohd-Yusof from Los Alamos, and the research the four of us pursued together (and thepaperswewrote) wouldnothavebeenpossible. Infact, Robertdeservesthesamecreditsfor shapingmeintoascientistasStefanTurekandHeinrichMüllerdo. Matthias Möller, Hilmar Wobker, Robert Strzodka and Jamaludin Mohd-Yusof deserve spe- cial credits for proofreading sections of this thesis and discussing questions that arose. Their constructive criticism helped a lot to improve clarity of this thesis and to make the presentation moreconcise. In my time at TU Dortmund, I helped supervising quite a few Diploma theses and student projectgroups. Discussingideaswith‘my’studentscannotbeoverestimated,andIbelievethishas alwaysbeenfruitfulforbothsides. AmongtheworksIsupervised,IwanttoacknowledgeHendrik Becker, Markus Geveler, Dirk Ribbrock and Mirko Sykorra. One project group that I launched was so successful that we published the results in Computer Physics Communications, and I am vi indebted to my co-authors Danny van Dyk, Markus Geveler, Sven Mallach, Dirk Ribbrock and CarstenGutwengerfortheopportunitytoworktogetherproductivelyformorethantwoyears. NVIDIA and AMD/ATI supported my research through generous hardware donations, and I owe gratitude to Mark Harris, Mike Houston, David Luebke, Simon Green, Christian Sigg, Do- minik Behr and Sumit Gupta for fruitful discussions and untiring assistance. The same is true for collaborators and colleagues I discussed many scientific and non-scientific ideas with at con- ferences or otherwise, in particular John Owens, Aaron Lefohn, Dimitri Komatitsch, Gordon Er- lebacher,DavidMichéa,CliffWoolley,DanielNogradi,JonathanCohen,YaoZhang,JensKrüger andRüdigerWestermann; andvirtuallyeveryoneIcollaboratedandexchangedideaswithinone formoranother. Finally, I am very grateful to my parents Michaela and Nikolaus, my sisters Maria and Anne andmybrotherNiklasforencouragementandsupport. MygodchildLindaLübbers,andHannah, JannisandPiaImöhlmadesuretokeepmesaneandfocusedthroughalltheseyears. Support of this work by the DFG (Deutsche Forschungsgemeinschaft) through the grants TU102/22-1andTU102/22-2isgratefullyacknowledged. Dortmund,February25,2010 DominikGöddeke vii Contents 1 Introduction 1 1.1 IntroductionandMotivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Hardware-OrientedNumerics . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 TheMemoryWallProblem . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.3 MulticoreArchitectures . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.4 EmergingMany-CoreArchitectures . . . . . . . . . . . . . . . . . . . . 9 1.2 ThesisContributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 MixedPrecisionIterativeRefinement . . . . . . . . . . . . . . . . . . . 10 1.2.2 Co-ProcessorAccelerationofExistingLarge-ScaleCodes . . . . . . . . 11 1.2.3 GPU-BasedMultigridSolversonGeneralisedTensorProductDomains . 12 1.2.4 AShortHistoryofGPUComputing . . . . . . . . . . . . . . . . . . . . 13 1.3 ListofPublications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 ThesisOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 MathematicalBackground 19 2.1 FiniteElementDiscretisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.1 VariationalFormulationofthePoissonProblem . . . . . . . . . . . . . . 19 2.1.2 ChoiceofBasisFunctionsandSubdivisionoftheDomainintoElements 20 2.1.3 MeshRefinementandErrorAnalysis . . . . . . . . . . . . . . . . . . . 21 2.2 IterativeSolversforSparseLinearSystems . . . . . . . . . . . . . . . . . . . . 23 2.2.1 BasicStationaryDefectCorrectionMethods . . . . . . . . . . . . . . . 23 2.2.2 KrylovSubspaceMethods . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.3 GeometricMultigridSolvers . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 FEAST–HighPerformanceFiniteElementMultigridToolkit . . . . . . . . . . 29 2.3.1 SeparationofStructuredandUnstructuredData . . . . . . . . . . . . . . 29 2.3.2 ParallelMultilevelSolvers . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3.3 Two-LayerScaRCSolvers . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.3.4 Vector-ValuedProblems . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.3.5 LinearisedElasticityinFEAST . . . . . . . . . . . . . . . . . . . . . . 41 2.3.6 StokesandNavier-StokesinFEAST . . . . . . . . . . . . . . . . . . . . 43 3 AShortHistoryofGPUComputing 47 3.1 LegacyGPGPU:GraphicsAPIs . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1.1 TheGraphicsPipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1.2 EvolutionTowardsProgrammability . . . . . . . . . . . . . . . . . . . . 50 3.1.3 DirectX9GPUs–TheFirstWaveofGPGPU2003–2006 . . . . . . . . 51 3.1.4 LegacyGPGPUProgrammingModel . . . . . . . . . . . . . . . . . . . 53 3.1.5 BibliographyofRelevantPublications . . . . . . . . . . . . . . . . . . . 56 viii Contents 3.1.6 RestrictionsofSinglePrecision . . . . . . . . . . . . . . . . . . . . . . 59 3.1.7 ExampleArchitecture: NVIDIAGeForce6 . . . . . . . . . . . . . . . . 59 3.2 NVIDIACUDAandAMDStream . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2.1 DirectX10andUnifiedArchitectures . . . . . . . . . . . . . . . . . . . 62 3.2.2 NVIDIACUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.2.3 AMDCTMandStream . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.2.4 BibliographyofRelevantPublications . . . . . . . . . . . . . . . . . . . 71 3.2.5 DoublePrecisionFloatingPointSupport . . . . . . . . . . . . . . . . . 73 3.3 GPUsandParallelProgrammingModels . . . . . . . . . . . . . . . . . . . . . . 74 3.3.1 GPUsasParallelRandom-AccessMachines . . . . . . . . . . . . . . . . 74 3.3.2 GPUsandVectorSIMDArchitectures . . . . . . . . . . . . . . . . . . . 75 3.3.3 OtherProgrammingModels . . . . . . . . . . . . . . . . . . . . . . . . 76 3.4 FutureAPIsandArchitectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.4.1 OpenCLandDirectX11 . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.4.2 IntelLarrabeeandFutureGraphicsArchitectures . . . . . . . . . . . . . 78 3.4.3 CPUandGPUConvergence . . . . . . . . . . . . . . . . . . . . . . . . 79 4 MixedandEmulatedPrecisionMethods 81 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.1.2 BenefitsofEmulatedandMixedPrecisionSchemes . . . . . . . . . . . 84 4.2 HighPrecisionEmulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.1 ExactandRedundantEmulation . . . . . . . . . . . . . . . . . . . . . . 86 4.2.2 Double-SingleOperations . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.3 Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.4 ExtensionstoHigherPrecisionandApplicabilityonGPUs . . . . . . . . 88 4.3 MixedPrecisionIterativeRefinement . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.1 BackgroundandRelatedWork . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.2 MixedPrecisionIterativeRefinementforLarge,SparseSystems . . . . . 91 4.4 HardwareEfficiencyonFPGAs. . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4.1 FieldProgrammableGateArrays . . . . . . . . . . . . . . . . . . . . . 94 4.4.2 MixedPrecisionSolversonFPGAs . . . . . . . . . . . . . . . . . . . . 94 4.5 PrecisionandAccuracyRequirementsofthePoissonProblem . . . . . . . . . . 95 4.5.1 InfluenceoftheConditionNumber . . . . . . . . . . . . . . . . . . . . 96 4.5.2 InfluenceoftheApproximationQuality . . . . . . . . . . . . . . . . . . 96 4.5.3 InfluenceontheConvergenceBehaviourofIterativeSolvers . . . . . . . 98 4.5.4 SummaryandConclusions . . . . . . . . . . . . . . . . . . . . . . . . . 104 5 EfficientLocalSolversonGraphicsHardware 107 5.1 IntroductionandOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.1.1 ChapterOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.1.2 ABriefOverviewoftheCodeBase . . . . . . . . . . . . . . . . . . . . 110 5.2 LegacyGPGPU:OpenGLImplementation . . . . . . . . . . . . . . . . . . . . . 112 5.2.1 ImplementationalSpecialisations . . . . . . . . . . . . . . . . . . . . . 112 5.2.2 ExampleOperations: ParallelProgrammingPrimitives . . . . . . . . . . 112 5.2.3 ImplementationofNumericalLinearAlgebraOperations . . . . . . . . . 114 5.2.4 PerformanceMicrobenchmarks . . . . . . . . . . . . . . . . . . . . . . 117 5.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.3 ModernGPGPU:CUDAImplementation . . . . . . . . . . . . . . . . . . . . . 125 5.3.1 CUDAProgrammingTechniques . . . . . . . . . . . . . . . . . . . . . 125 ix 5.3.2 ImplementationofNumericalLinearAlgebraOperations . . . . . . . . . 129 5.3.3 PerformanceMicrobenchmarks . . . . . . . . . . . . . . . . . . . . . . 133 5.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.4 ParallelisationandImplementationofPreconditioners . . . . . . . . . . . . . . . 142 5.4.1 IntroductionandPreliminaries . . . . . . . . . . . . . . . . . . . . . . . 142 5.4.2 JacobiPreconditioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.4.3 PreconditionersofGauß-Seidel-Type . . . . . . . . . . . . . . . . . . . 143 5.4.4 TridiagonalPreconditioner . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.4.5 CombinationofGauß-SeidelandTridiagonalPreconditioners . . . . . . 153 5.4.6 AlternatingDirectionImplicitPreconditioners . . . . . . . . . . . . . . 154 5.4.7 PerformanceMicrobenchmarks . . . . . . . . . . . . . . . . . . . . . . 155 5.4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.5 NumericalEvaluationofParallelisedPreconditionersinIterativeSolvers . . . . . 158 5.5.1 OverviewandTestProcedure . . . . . . . . . . . . . . . . . . . . . . . 158 5.5.2 PureNumericalEfficiencyofMultigridSolvers . . . . . . . . . . . . . . 160 5.5.3 TotalRuntimeEfficiencyofMultigridSolvers . . . . . . . . . . . . . . . 164 5.5.4 KrylovSubspaceSolvers . . . . . . . . . . . . . . . . . . . . . . . . . . 168 5.5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 5.6 AccuracyofEmulatedandMixedPrecisionSolvers . . . . . . . . . . . . . . . . 171 5.6.1 OverviewandTestProcedure . . . . . . . . . . . . . . . . . . . . . . . 171 5.6.2 AccuracyofMixedPrecisionIterativeRefinementSolvers . . . . . . . . 174 5.6.3 AccuracyoftheEmulatedDouble-SingleFormat . . . . . . . . . . . . . 174 5.6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 5.7 NumericalEvaluationofMixedPrecisionIterativeSolvers . . . . . . . . . . . . 177 5.7.1 OverviewandTestProcedure . . . . . . . . . . . . . . . . . . . . . . . 177 5.7.2 EfficiencyontheCPUandInfluenceoftheStoppingCriterion . . . . . . 179 5.7.3 EfficiencyontheGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 5.7.4 SpeedupofGPU-basedSolvers . . . . . . . . . . . . . . . . . . . . . . 186 5.7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 5.8 InfluenceoftheMultigridCycleTypeandtheCoarseGridSolver . . . . . . . . 191 5.8.1 OverviewandTestProcedure . . . . . . . . . . . . . . . . . . . . . . . 191 5.8.2 BaselineExperiment–ProblemStatement. . . . . . . . . . . . . . . . . 191 5.8.3 TruncationofMultigridCycles. . . . . . . . . . . . . . . . . . . . . . . 194 5.8.4 BetterPreconditioningoftheCoarseGridSolver . . . . . . . . . . . . . 196 5.8.5 V CycleswithVariableSmoothing: AnAlternativetoF andW Cycles . . 197 5.8.6 DirectCoarseGridSolversontheCPU . . . . . . . . . . . . . . . . . . 198 5.8.7 ApproximateSolutionofCoarseGridProblems . . . . . . . . . . . . . . 199 5.8.8 ImplicationsforMixedPrecisionIterativeRefinementMultigrid . . . . . 202 5.8.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 5.9 GPUPerformanceGuidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 5.9.1 ControlFlowBetweenCPUandGPU . . . . . . . . . . . . . . . . . . . 209 5.9.2 ThePCIeBottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 5.9.3 GPUvs. CPUPerformanceCharacteristics . . . . . . . . . . . . . . . . 210 5.9.4 Locality,VectorisationandMemoryAccessPatterns . . . . . . . . . . . 211 5.9.5 ResourceAllocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 x Contents 6 MinimallyInvasiveCo-ProcessorIntegrationintoFEAST 213 6.1 IntroductionandMotivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 6.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 6.1.2 ChapterOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 6.2 MinimallyInvasiveCo-ProcessorIntegrationintoFEAST . . . . . . . . . . . . . 217 6.2.1 ScaRCSolverAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.2.2 MinimallyInvasiveIntegration . . . . . . . . . . . . . . . . . . . . . . . 218 6.2.3 GPUIntegrationChallengesandImplementationalDetails . . . . . . . . 222 6.2.4 HybridSolversforHeterogeneousClusters . . . . . . . . . . . . . . . . 224 6.3 NumericalBuildingBlock: PoissonProblem . . . . . . . . . . . . . . . . . . . 225 6.3.1 IntroductionandTestProcedure . . . . . . . . . . . . . . . . . . . . . . 225 6.3.2 AccuracyandEfficiencyofMixedPrecisionTwo-LayerScaRCSolvers . 227 6.3.3 SpeedupandSpeedupAnalysis . . . . . . . . . . . . . . . . . . . . . . 230 6.3.4 WeakScalabilityonaLarge-ScaleCluster . . . . . . . . . . . . . . . . . 234 6.3.5 HybridCPU-GPUSolvers . . . . . . . . . . . . . . . . . . . . . . . . . 238 6.4 ApplicationAcceleration: LinearisedElasticity . . . . . . . . . . . . . . . . . . 241 6.4.1 Introduction,TestConfigurationsandSolutionStrategy . . . . . . . . . . 241 6.4.2 AccuracyStudies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 6.4.3 WeakScalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 6.4.4 PerformanceandSpeedup . . . . . . . . . . . . . . . . . . . . . . . . . 248 6.5 ApplicationAcceleration: StationaryLaminarFlow . . . . . . . . . . . . . . . . 251 6.5.1 Introduction,TestConfigurationsandSolutionStrategy . . . . . . . . . . 251 6.5.2 AnalysisoftheLinearSolver . . . . . . . . . . . . . . . . . . . . . . . 253 6.5.3 AccuracyoftheMixedPrecisionStationaryLaminarFlowSolver . . . . 255 6.5.4 PerformanceAnalysisoftheStationaryLaminarFlowSolver . . . . . . 256 6.6 Summary,CriticalDiscussion,ConclusionsandFutureWork . . . . . . . . . . . 261 6.6.1 SummaryoftheMinimallyInvasiveIntegrationApproach . . . . . . . . 261 6.6.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 6.6.3 WeakScalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 6.6.4 PerformanceofMixedPrecisionandAcceleratedTwo-LayerScaRCSolvers263 6.6.5 CriticalDiscussioninTermsofFutureScalability . . . . . . . . . . . . . 266 A Appendix 269 A.1 ListofSymbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Bibliography 271

Description:
that I could always rely on our group for discussing design decisions, toying with [224] for more details on different numbering schemes, and to the textbook DDR2-667 memory, a machine that is used frequently throughout this thesis.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.