ebook img

MetaFork: A Compilation Framework for Concurrency Models Targeting Hardware Accelerators PDF

173 Pages·2017·1.7 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview MetaFork: A Compilation Framework for Concurrency Models Targeting Hardware Accelerators

WWeesstteerrnn UUnniivveerrssiittyy SScchhoollaarrsshhiipp@@WWeesstteerrnn Electronic Thesis and Dissertation Repository 3-24-2017 12:00 AM MMeettaaFFoorrkk:: AA CCoommppiillaattiioonn FFrraammeewwoorrkk ffoorr CCoonnccuurrrreennccyy MMooddeellss TTaarrggeettiinngg HHaarrddwwaarree AAcccceelleerraattoorrss Xiaohui Chen, The University of Western Ontario Supervisor: Marc Moreno Maza, The University of Western Ontario A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Computer Science © Xiaohui Chen 2017 Follow this and additional works at: https://ir.lib.uwo.ca/etd Part of the Algebra Commons, Computer and Systems Architecture Commons, Programming Languages and Compilers Commons, and the Software Engineering Commons RReeccoommmmeennddeedd CCiittaattiioonn Chen, Xiaohui, "MetaFork: A Compilation Framework for Concurrency Models Targeting Hardware Accelerators" (2017). Electronic Thesis and Dissertation Repository. 4429. https://ir.lib.uwo.ca/etd/4429 This Dissertation/Thesis is brought to you for free and open access by Scholarship@Western. It has been accepted for inclusion in Electronic Thesis and Dissertation Repository by an authorized administrator of Scholarship@Western. For more information, please contact [email protected]. Abstract Parallel programming is gaining ground in various domains due to the tremendous com- putational power that it brings; however, it also requires a substantial code crafting effort to achieveperformanceimprovement. Unfortunately,inmostcases,performancetuninghastobe accomplished manually by programmers. We argue that automated tuning is necessary due to the combination of the following factors. First, code optimization is machine-dependent. That is, optimization preferred on one machine may be not suitable for another machine. Second, as the possible optimization search space increases, manually finding an optimized configura- tion is hard. Therefore, developing new compiler techniques for optimizing applications is of considerableinterest. Thisthesisaimsatgeneratingnewtechniquesthatwillhelpprogrammersdevelopefficient algorithms and code targeting hardware acceleration technologies, in a more effective manner. Our work is organized around a compilation framework, called MetaFork, for concurrency platforms and its application to automatic parallelization. MetaFork is a high-level program- ming language extending C/C++, which combines several models of concurrency including fork-join,SIMDandpipeliningparallelism. MetaForkisalsoacompilationframeworkwhich aims at facilitating the design and implementation of concurrent programs through four key featureswhichmakeMetaForkuniqueandnovel: (1) Perform automatic code translation between concurrency platforms targeting multi-core architectures. (2) Provide a high-level language for expressing concurrency as in the fork-join model, the SIMDparadigmandthepipeliningparallelism. (3) Generateparallelcodefromserialcodewithanemphasisoncodedependingonmachine or program parameters (e.g. cache size, number of processors, number of threads per threadblock). (4) Optimizecodedependingonparametersthatareunknownatcompile-time. Keywords: source-to-source compiler, pipelining, comprehensive parametric CUDA ker- nelgeneration,concurrencyplatforms,high-levelparallelprogramming. ii Acknowledgments The work discussed in this dissertation would not have been possible without the constant encouragementandinsightofmanypeople. ImustexpressmydeepestappreciationandthankstomysupervisorProfessorMarcMoreno Maza,forhisenthusiasmandpatienceduringthecourseofmyPhDresearch. Ialsowouldlike to thank my friends Ning Xie, Changbo Chen, Yuzhen Xie, Robert Moir and Colin Costello in University of Western Ontario. My appreciation also goes to my IBM colleagues in com- piler group, Wang Chen, Abdoul-Kader Keita, Priya Unnikrishnan and Jeeva Paudel, for their discussionsondevelopingcompilertechniques. ManythankstothemembersofmysupervisorycommitteeProfessorJohnBarronandPro- fessor Michael Bauer for their valuable feedbacks. Also my sincere thanks and appreciation go to the members of my examination committee Professor Robert Mercer, Professor Robert Webber, Professor David Jeffrey and Professor Jeremy Johnson for their comments and inspi- ration. Last but not the least, I am also greatly indebted to my family who deserves too many thankstofitthispage. iii Contents Abstract ii Acknowledgments iii ListofFigures vii ListofTables xi ListofAppendices xii 1 Introduction 1 1.1 DissertationOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 ThesisStatement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background 4 2.1 ConcurrencyPlatforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 CilkPlus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.3 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 PerformanceMeasurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 OccupancyandILP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Work-SpanModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.3 MasterTheorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 CompilationTheory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 TheStateoftheArtofCompilers . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Source-to-SourceCompiler . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.3 AutomaticParallelization . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 AMetalanguageforConcurrencyPlatformsBasedontheFork-JoinModel 13 3.1 BasicPrinciplesandExecutionModel . . . . . . . . . . . . . . . . . . . . . . 14 3.2 CoreParallelConstructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 VariableAttributeRules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 SemanticsoftheParallelConstructsinMetaFork . . . . . . . . . . . . . . . . 20 3.5 SupportedParallelAPIsofBothCilkPlusandOpenMP . . . . . . . . . . . . . 21 3.5.1 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5.2 CilkPlus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 iv 3.6 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6.1 TranslationfromCilkPlustoMetaFork . . . . . . . . . . . . . . . . . 28 3.6.2 TranslationfromMetaForktoCilkPlus . . . . . . . . . . . . . . . . . 29 3.6.3 TranslationfromOpenMPtoMetaFork . . . . . . . . . . . . . . . . . 31 3.6.4 TranslationfromMetaForktoOpenMP . . . . . . . . . . . . . . . . . 39 3.7 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.7.1 ExperimentationSetup . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.7.2 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.7.3 ComparativeImplementation . . . . . . . . . . . . . . . . . . . . . . . 42 3.7.4 Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.7.5 ParallelismOverheads . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4 ApplyingMetaForktotheGenerationofParametricCUDAKernels 52 4.1 OptimizingCUDAKernelsDependingonProgramParameters . . . . . . . . . 53 4.2 AutomaticGenerationofParametricCUDAKernels . . . . . . . . . . . . . . . 55 4.3 ExtendingtheMetaForkLanguagetoSupportDeviceConstructs . . . . . . . 57 4.4 TheMetaForkGeneratorofParametricCUDAKernels . . . . . . . . . . . . . 61 4.5 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5 MetaFork: AMetalanguageforConcurrencyPlatformsTargetingPipelining 74 5.1 ExecutionModelofPipelining . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2 CoreParallelConstructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3 SemanticsofthePipeliningConstructsinMetaFork . . . . . . . . . . . . . . 77 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6 MetaFork: TheCompilationFramework 79 6.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 MetaForkasaHigh-LevelParallelProgrammingLanguage . . . . . . . . . . 81 6.3 OrganizationoftheMetaForkCompilationFramework . . . . . . . . . . . . . 83 6.4 MetaForkCompiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.4.1 Front-EndoftheMetaForkCompiler . . . . . . . . . . . . . . . . . . 84 6.4.2 Analysis&TransformationoftheMetaForkCompiler . . . . . . . . . 85 6.4.3 Back-EndoftheMetaForkCompiler . . . . . . . . . . . . . . . . . . 87 6.5 UserDefinedPragmaDirectivesinMetaFork . . . . . . . . . . . . . . . . . . 87 6.5.1 RegistrationofPragmaDirectivesintheMetaForkPreprocessor . . . . 88 6.5.2 ParsingofPragmaDirectivesintheMetaForkFront-End . . . . . . . 89 6.5.3 AttachingPragmaDirectivestotheClangAST . . . . . . . . . . . . . 90 6.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.6.1 ParsingMetaForkConstructs . . . . . . . . . . . . . . . . . . . . . . 90 6.6.2 ParsingCilkPlusConstructs . . . . . . . . . . . . . . . . . . . . . . . 94 6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7 TowardsComprehensiveParametricCUDAKernelGeneration 98 v 7.1 ComprehensiveOptimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.1.1 HypothesesontheInputCodeFragment . . . . . . . . . . . . . . . . . 102 7.1.2 HardwareResourceLimitsandPerformanceMeasures . . . . . . . . . 103 7.1.3 EvaluationofResourceandPerformanceCounters . . . . . . . . . . . 104 7.1.4 OptimizationStrategies . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.1.5 ComprehensiveOptimization . . . . . . . . . . . . . . . . . . . . . . . 105 7.1.6 Data-Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.1.7 TheAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.2 ComprehensiveTranslationofanAnnotatedCProgramintoCUDAKernels . . 112 7.2.1 InputMetaForkCodeFragment . . . . . . . . . . . . . . . . . . . . . 112 7.2.2 ComprehensiveTranslationintoParametricCUDAKernels . . . . . . . 113 7.3 ImplementationDetails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.4 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 8 ConcludingRemarks 126 Bibliography 128 A CodeTranslationExamples 141 B ExamplesGeneratedbyPPCG 153 C TheImplementationforGeneratingComprehensiveMetaForkPrograms 158 CurriculumVitae 161 vi List of Figures 2.1 GPUhardwarearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 HeterogeneousprogrammingwithGPUandCPU . . . . . . . . . . . . . . . . 7 3.1 ExampleofaMetaForkprogramwithafunctionspawn . . . . . . . . . . . . 16 3.2 ExampleofaMetaForkprogramwithaparallelregion . . . . . . . . . . . . . 17 3.3 Exampleofameta forloop . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 Variousvariableattributesinaparallelregion . . . . . . . . . . . . . . . . . . 19 3.5 Exampleofsharedandprivatevariableswithmeta for . . . . . . . . . . . . . 19 3.6 Parallelfibcodeusingafunctionspawn . . . . . . . . . . . . . . . . . . . . . . . 20 3.7 Parallelfibcodeusingablockspawn . . . . . . . . . . . . . . . . . . . . . . . . 20 3.8 OpenMPclausessupportedinthecurrentprogramtranslations . . . . . . . . . 23 3.9 AcodesnippetofanOpenMPsectionsexample . . . . . . . . . . . . . . . . 25 3.10 Acodesnippetshowinghowtoexcludedeclarationswhichcomefromincluded headerfiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.11 A code snippet showing how to translate a parallel for loop from CilkPlus to MetaFork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.12 AcodesnippetshowinghowtoinsertabarrierfromCilkPlustoMetaFork . . 28 3.13 Acodesnippetshowinghowtohandledataattributeofvariablesintheprocess ofoutlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.14 A code snippet showing how to translate parallel for loop from MetaFork to CilkPlus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.15 AgeneralformofanOpenMPconstruct . . . . . . . . . . . . . . . . . . . . . 32 3.16 TranslationoftheOpenMPsectionsconstruct . . . . . . . . . . . . . . . . . 34 3.17 Anarraytypevariable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.18 AcodesnippettranslatedfromthecodesinFigure3.17 . . . . . . . . . . . . . 36 3.19 Ascalartypevariable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.20 AcodesnippettranslatedfromthecodesinFigure3.19 . . . . . . . . . . . . . 36 3.21 Acodesnippetshowingthevariablemusedasanaccumulator . . . . . . . . . 36 3.22 AMetaForkcodesnippet(right),translatedfromtheleftOpenMPcodes . . . 37 3.23 Acodesnippetshowinghowtoavoidaddingredundantbarrierswhentranslat- ingthecodesfromOpenMPtoMetaFork . . . . . . . . . . . . . . . . . . . . 38 3.24 ExampleoftranslationfromMetaForktoOpenMP . . . . . . . . . . . . . . . 39 3.25 ExampleoftranslatingaparallelfunctioncallfromMetaForktoOpenMP . . . 40 3.26 ExampleoftranslatingaparallelforloopfromMetaForktoOpenMP . . . . . 40 3.27 AcodesnippetshowinghowtogenerateanewOpenMPmainfunction . . . . . 40 3.28 Parallelmergesortinsize108 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 vii 3.29 Matrixinversionoforder4096 . . . . . . . . . . . . . . . . . . . . . . . . 43 3.30 Matrixtranspose: n = 32768 . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.31 NaiveMatrixMultiplication: n = 4096 . . . . . . . . . . . . . . . . . . . . . 44 3.32 SpeedupcurveonIntelnode . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.33 SpeedupcurveonIntelnode . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.34 SpeedupcurveonIntelnode . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.35 SpeedupcurveonIntelnode . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.36 RunningtimeofMandelbrotsetandLinearsystemsolving . . . . . . . . . . . 47 3.37 FFTtest-cases: FSUversionandBOTSversion . . . . . . . . . . . . . . . . . 48 3.38 SpeedupcurveofProteinalignment-100Proteins . . . . . . . . . . . . . . . 48 3.39 SpeedupcurveofSparseLUandStrassenmatrixmultiplication . . . . . . . . . 49 4.1 MetaForkexamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Usingmeta scheduletodefineone-dimensionalCUDAgridandthreadblock 59 4.3 Usingmeta scheduletodefinetwo-dimensionalCUDAgridandthreadblock 59 4.4 SequentialCcodecomputingJacobi . . . . . . . . . . . . . . . . . . . . . . . 60 4.5 GeneratedMetaForkcodefromthecodeinFigure4.4 . . . . . . . . . . . . . 60 4.6 OverviewoftheimplementationoftheMetaFork-to-CUDAcodegenerator . . 61 4.7 GeneratedparametricCUDAkernelfor1DJacobi . . . . . . . . . . . . . . . . 62 4.8 Generatedhostcodefor1DJacobi . . . . . . . . . . . . . . . . . . . . . . . . 63 4.9 Serial code, MetaFork code and generated parametric CUDA kernel for array reversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.10 Serial code, MetaFork code and generated parametric CUDA kernel for 2D Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.11 Serial code, MetaFork code and generated parametric CUDA kernel for LU decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.12 Serialcode,MetaForkcodeandgeneratedparametricCUDAkernelformatrix transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.13 Serialcode,MetaForkcodeandgeneratedparametricCUDAkernelformatrix addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.14 Serialcode,MetaForkcodeandgeneratedparametricCUDAkernelformatrix vectormultiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.15 Serialcode,MetaForkcodeandgeneratedparametricCUDAkernelformatrix matrixmultiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.1 Pipeliningcodewithmeta pipeconstructanditsDAG . . . . . . . . . . . . . 76 5.2 MetaForkpipeliningcodeanditsserialC-elisioncounterpartcode . . . . . . . 77 5.3 ComputationDAGofalgorithmsinFigure5.2 . . . . . . . . . . . . . . . . . . 78 5.4 Stencilcodeusingmeta pipeanditsserialC-elisioncounterpartcode . . . . . . . . . 78 6.1 MetaForkPragmadirectivesyntax . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2 AcodesnippetshowingaMetaForkprogramwithPragmadirectives . . . . . 82 6.3 AcodesnippetshowingaMetaForkprogramwithkeywords . . . . . . . . . . 82 6.4 Overallwork-flowoftheMetaForkcompilationframework . . . . . . . . . . 83 6.5 AcodesnippetshowinghowtocreatetoolsbasedonClang’sLibTooling . . 84 viii 6.6 Ageneralcommand-lineinterfaceofMetaForktools . . . . . . . . . . . . . . 85 6.7 Overallwork-flowoftheMetaForkanalysisandtransformationchain . . . . . 86 6.8 Overallwork-flowoftheClangfront-end . . . . . . . . . . . . . . . . . . . . 88 6.9 AcodesnippetshowinghowtocreateaMetaForkuserdefinedPragmaHandler 88 6.10 A code snippet showing how to register a new user defined Pragma handler instancetoClangpreprocessor . . . . . . . . . . . . . . . . . . . . . . . . 89 6.11 ConvertMetaForkKeywordstoPragmadirectives . . . . . . . . . . . . . . . 91 6.12 A code snippet showing how to annotate a parallel for loop with MetaFork Pragmadirective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.13 A code snippet showing how to annotate a parallel for loop with MetaFork keyword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.14 ThecodesnippetafterpreprocessingthecodeinFigure6.13 . . . . . . . . . . 92 6.15 Settingthetok::eodtoken . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.16 A code snippet showing how to annotate a parallel region with MetaFork Pragmadirective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.17 AcodesnippetshowinghowtoannotateaparallelregionwithMetaForkkey- word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.18 ThecodesnippetafterpreprocessingthecodeofFigure6.17 . . . . . . . . . . 93 6.19 Consumingthesharedclause . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.20 AcodesnippetshowinghowtoannotateajoinconstructwithMetaForkPragma directive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.21 AcodesnippetshowinghowtoannotateajoinconstructwithMetaForkKey- words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.22 ThecodesnippetafterpreprocessingthecodeofFigure6.21 . . . . . . . . . . 94 6.23 A code snippet showing how to annotate device code with MetaFork Pragma directive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.24 AcodesnippetshowinghowtoannotatedevicecodewithMetaForkKeywords 95 6.25 ThecodesnippetafterpreprocessingthecodeofFigure6.24 . . . . . . . . . . 95 6.26 ConvertCilkPlusKeywordstoPragmadirectives . . . . . . . . . . . . . . . . 95 6.27 A code snippet showing how to annotate a parallel for loop with CilkPlus Pragmadirective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.28 A code snippet showing how to annotate a parallel for loop with CilkPlus keyword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.29 ThecodesnippetafterpreprocessingthecodeFigure6.28 . . . . . . . . . . . . 95 6.30 AcodesnippetshowinghowtoannotateaparallelfunctioncallwithCilkPlus Pragmadirective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.31 AcodesnippetshowinghowtoannotateaparallelfunctioncallwithCilkPlus keyword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.32 ThecodesnippetafterpreprocessingthecodeofFigure6.31 . . . . . . . . . . 96 6.33 AcodesnippetshowinghowtoannotateasyncconstructwithCilkPlusPragma directive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.34 AcodesnippetshowinghowtoannotateasyncconstructwithCilkPlusKey- word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.35 ThecodesnippetafterpreprocessingthecodeofFigure6.34 . . . . . . . . . . 96 ix 7.1 MatrixadditionwritteninC(theleft-handportion)andinMetaFork(theright- handportion)withameta forloopnest,respectively . . . . . . . . . . . . . . 99 7.2 ComprehensivetranslationofMetaForkcodetotwokernelsformatrixaddition 100 7.3 ThedecisiontreeforcomprehensiveparametricCUDAkernelsofmatrixaddition101 7.4 MatrixvectormultiplicationwritteninC(theleft-handportion)andinMetaFork (theright-handportion),respectively . . . . . . . . . . . . . . . . . . . . . . . 104 7.5 Thedecisionsubtreeforresourceorperformancecounters . . . . . . . . . . . 110 7.6 TheserialelisionoftheMetaForkprogramformatrixvectormultiplication . . 113 7.7 Thesoftwaretoolsinvolvedfortheimplementation . . . . . . . . . . . . . . . 114 7.8 Computing the amount of words required per thread-block for reversing a 1D array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.9 ThefirstcaseoftheoptimizedMetaForkcodeforarrayreversal . . . . . . . . 117 7.10 ThesecondcaseoftheoptimizedMetaForkcodeforarrayreversal . . . . . . 117 7.11 ThethirdcaseoftheoptimizedMetaForkcodeforarrayreversal . . . . . . . . 117 7.12 ThefirstcaseoftheoptimizedMetaForkcodeformatrixvectormultiplication 118 7.13 ThesecondcaseoftheoptimizedMetaForkcodeformatrixvectormultiplication118 7.14 ThethirdcaseoftheoptimizedMetaForkcodeformatrixvectormultiplication 119 7.15 TheMetaForksourcecodefor1DJacobi . . . . . . . . . . . . . . . . . . . . 119 7.16 ThefirstcaseoftheoptimizedMetaForkcodefor1DJacobi . . . . . . . . . . 120 7.17 ThesecondcaseoftheoptimizedMetaForkcodefor1DJacobi . . . . . . . . 120 7.18 ThethirdcaseoftheoptimizedMetaForkcodefor1DJacobi . . . . . . . . . 121 7.19 ThefirstcaseoftheoptimizedMetaForkcodeformatrixaddition . . . . . . . 121 7.20 ThesecondcaseoftheoptimizedMetaForkcodeformatrixaddition . . . . . 122 7.21 ThethirdcaseoftheoptimizedMetaForkcodeformatrixaddition . . . . . . . 122 7.22 ThefirstcaseoftheoptimizedMetaForkcodeformatrixtranspose . . . . . . 122 7.23 ThesecondcaseoftheoptimizedMetaForkcodeformatrixtranspose . . . . . 123 7.24 ThethirdcaseoftheoptimizedMetaForkcodeformatrixtranspose . . . . . . 123 7.25 ThefirstcaseoftheoptimizedMetaForkcodeformatrixmatrixmultiplication 124 7.26 ThesecondcaseoftheoptimizedMetaForkcodeformatrixmatrixmultiplication124 7.27 ThethirdcaseoftheoptimizedMetaForkcodeformatrixmatrixmultiplication 125 B.1 PPCGcodeandgeneratedCUDAkernelforarrayreversal . . . . . . . . . . . . . . 153 B.2 PPCGcodeandgeneratedCUDAkernelformatrixaddition . . . . . . . . . . . . . 153 B.3 PPCGcodeandgeneratedCUDAkernelfor1DJacobi . . . . . . . . . . . . . . . . 154 B.4 PPCGcodeandgeneratedCUDAkernelfor2DJacobi . . . . . . . . . . . . . . . . 154 B.5 PPCGcodeandgeneratedCUDAkernelforLUdecomposition . . . . . . . . . . . . 155 B.6 PPCGcodeandgeneratedCUDAkernelformatrixvectormultiplication . . . . . . . 156 B.7 PPCGcodeandgeneratedCUDAkernelformatrixtranspose . . . . . . . . . . . . . 156 B.8 PPCGcodeandgeneratedCUDAkernelformatrixmatrixmultiplication . . . . . . . 157 x

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.