POLITECNICO DI MILANO Facoltà di Ingegneria Industriale e dell’Informazione Corso di Laurea Magistrale in Ingegneria Informatica On how to Accelerate Iterative Stencil Loops: A Scalable Streaming-based Approach Relatore: Prof. Marco Domenico SANTAMBROGIO Correlatore: Dott. Ing. Riccardo CATTANEO Tesi di Laurea di: Giuseppe Natale Carlo Sicignano Matricola n. 803783 Marticola n. 800774 Anno Accademico 2013–2014 Thispageintentionallyleftblank Contents 1 Introduction 1 1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 TheChallengesofExascaleComputing . . . . . . . . . . . . . . . . 3 1.2.1 TheEnergyandPowerChallenge . . . . . . . . . . . . . . . 4 1.2.2 TheMemoryChallenge . . . . . . . . . . . . . . . . . . . . . 5 1.2.3 TheConcurrencyandScalabilityChallenge . . . . . . . . . . 7 1.2.4 TheResiliencyChallenge . . . . . . . . . . . . . . . . . . . . 9 1.2.5 TheSoftwareChallenge . . . . . . . . . . . . . . . . . . . . . 9 1.2.6 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 MeetingtheChallenges:HeterogeneousSystems . . . . . . . . . . . 10 1.4 AContextualizedOverviewoftheProposedWork . . . . . . . . . . 13 1.5 ThesisOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 BackgroundKnowledge 15 2.1 PolyhedralFramework . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 PolyhedralModel. . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.2 PolyhedralTransformations . . . . . . . . . . . . . . . . . . . 29 2.2 StreamingSystemsinFPGAs . . . . . . . . . . . . . . . . . . . . . . 37 2.2.1 StreamingArchitectures . . . . . . . . . . . . . . . . . . . . . 38 2.3 HighLevelSynthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3.1 WhatisHLS? . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3.2 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.3.3 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 iii CONTENTS iv 2.4 IterativeStencilLoops . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.4.2 MainCharacteristicsandImplementationChallenges . . . . 47 2.4.3 StateoftheArt . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3 AScalableHardwareAcceleratorforISLs 56 3.1 TheIssueofFindinganEfficientImplementation . . . . . . . . . . . 56 3.2 ThesisContribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3 TheProposedSolution . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.1 FundamentalPrinciples . . . . . . . . . . . . . . . . . . . . . 62 3.3.2 AGeneralOverviewoftheProposedHardwareAccelerator 65 3.3.3 SomeConsiderationsontheInputCode . . . . . . . . . . . . 70 3.3.4 AComparisonwithExistingWorks . . . . . . . . . . . . . . 71 4 ProposedDesignFlow 77 4.1 DesignAutomationFlow . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2 Pre-processingPhase . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3 TheSSTMicroarchitectureDerivation . . . . . . . . . . . . . . . . . 81 4.3.1 Streaming-orientedGraphConstruction . . . . . . . . . . . . 82 4.3.2 ComputingSystemExtraction . . . . . . . . . . . . . . . . . 88 4.3.3 MemorySystemDerivation . . . . . . . . . . . . . . . . . . . 93 4.3.4 SSTIRandCodeGeneration . . . . . . . . . . . . . . . . . . 95 4.3.5 PipeliningtheSST . . . . . . . . . . . . . . . . . . . . . . . . 98 4.3.6 ScalingontheProblemSize . . . . . . . . . . . . . . . . . . . 100 4.4 TheSSTsQueuingTechnique . . . . . . . . . . . . . . . . . . . . . . 101 4.4.1 QueueLengthEstimation . . . . . . . . . . . . . . . . . . . . 104 4.4.2 HandlingMorethanOneInput . . . . . . . . . . . . . . . . . 105 5 Results 108 5.1 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2 TestCases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2.1 Polybench/CJacobi2-D . . . . . . . . . . . . . . . . . . . . . 111 5.2.2 FASTERRTM3-D . . . . . . . . . . . . . . . . . . . . . . . . . 111 CONTENTS v 5.2.3 Polybench/CSeidel2-D . . . . . . . . . . . . . . . . . . . . . 112 5.3 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3.1 jacobi-2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3.2 RTMdo_step . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.3.3 seidel-2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6 ConclusionsandFutureWork 125 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Bibliography 139 List of Figures 1.1 An illustration of the von Neumann Bottleneck. The graph refers totheevolutionofcanonicalCPUs. . . . . . . . . . . . . . . . . . . . 6 1.2 Asimpleschemeofanheterogeneoussystem. . . . . . . . . . . . . 11 1.3 ThegeneralarchitectureofanFPGA. . . . . . . . . . . . . . . . . . . 12 2.1 IterationDomain(ID)Example. . . . . . . . . . . . . . . . . . . . . . 21 2.2 SubscriptFunctionExample.Thethreesubscriptfunctionsarerel- ativetothethreearrayaccessesfors,aandx.. . . . . . . . . . . . . 22 2.3 Anexampleofadependencepolyhedron.Inthisexamplethepoly- hedron over the iteration vectors (one for the first statement, two forthesecond)andthescalarpartarecondensedinasinglematrix (noticethe1afterthethreeiterationvectors).Ontherightthereis a visual representation of the dependencies among the instances ofthetwostatements. . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 A simple schedule example. In this picture, the statement on the lefthasanidentityschedule,asthestatementinstancesaretrivially thepoints(i,j)withinthestatementID. . . . . . . . . . . . . . . . . 27 2.5 Loop Skewing Example. The ID is “skewed” to allow inner loop parallelization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6 LoopTilingExample.TheIDispartitionedintothesocalled“tiles”. 35 2.7 StreamingComputing:AGeneralPicture. . . . . . . . . . . . . . . . 38 2.8 GenericStreamingArchitecture. . . . . . . . . . . . . . . . . . . . . 39 2.9 Anillustrationofageneric5-point2-DimensionalISL. . . . . . . . 45 2.10 ISLsboundarytypes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 vi LISTOFFIGURES vii 2.11 SingleIterationTiling. . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.12 TimeSkewing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.13 WavefrontParallelization. . . . . . . . . . . . . . . . . . . . . . . . . 52 3.1 The high level scheme of the proposed hardware accelerator. The threedifferentversionsrepresentthethreedistinctdescribedcases: thefirst(a)isthestandardcase,thesecond(b)isthecaseinwhich thequeuelengthisnotanexactdivisorofthetotalnumberofISL time-steps,thethird(c)isthecaseinwhichthereareenoughavail- ableresourcestoenablequeuelooping. . . . . . . . . . . . . . . . . 66 3.2 AgeneralschemeofanStreamingStencilTime-step(SST). . . . . . 67 3.3 AnSSTforISLswithspatialdependencies. . . . . . . . . . . . . . . 69 3.4 AnexampleoftheacceleratorforanISLwithmultipleinputs.The green arrows represents the additional streams. As described in thetext,thelastSSThasonlytheactualoutputstream. . . . . . . . 69 4.1 TheProposedDesignAutomationFlow. . . . . . . . . . . . . . . . . 78 4.2 An example of a complete Data Dependency Graph (DDG). The graph is computed for sample2 in listing 4.4. On the right, there is the output of the state of the art tool for polyhedral dependency analisysCandl[1],whileontheleft,thereisthegraphinitsgraphic form. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3 The DDG with the only Read After Write (RAW) dependencies, the only dependencies we take into account. As for image 4.2, on therightthereistheoutputofCandl,whileonthelefttheDDGis itsgraphicform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.4 TheDDGafterthepruningprocess,whoseconditionsaregivenin definition4.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.5 The expanded version of the DDG for both samples. In order to easilydistinguishthem,readandwritenodesarerepresentedwith differentshapes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 LISTOFFIGURES viii 4.6 Thedependencewithintheredcircleisanexampleofthesocalled “copy”dependency. . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.7 TheStreaming-orientedGraphofsample2,listing4.4. . . . . . . . . . . 87 4.8 Avisualrepresentationoftheinstantiationofademux. . . . . . . . 88 4.9 An illustration of the cyclic dependencies between the output of the cyclic-write node and the cyclic-read node (filter) A[i-1] of sam- ple1,listing4.3.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.10 Theresultingchainsforbothsamples. . . . . . . . . . . . . . . . . . 95 4.11 ArepresentationoftheresultingSSTsforbothsamples. . . . . . . . 97 4.12 AdeadlockconditionoccurredbecauseEC isfullypipelined. . . . 99 1 4.13 Acommunicationchannelisremoved,toreducethememoryspace requirements,andsubstitutedwithanotheroff-chipaccess. . . . . . 100 4.14 Avisualizationofthepipelinedexecutionwithinthequeue. . . . . 102 4.15 The deadlock condition that can occur when queuing pipelined SSTswithmultipleinputs. . . . . . . . . . . . . . . . . . . . . . . . . 106 5.1 TheVC707board.Imagetakenfromtheproductsite. . . . . . . . . 109 5.2 Performancemeasurementofjacobi-2Dwithoutpipeliningenabled withintheSST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3 Performance measurement of jacobi-2D with pipelining enabled withintheSST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4 Resourceusageoftheacceleratorforthejacobi-2Dbenchmark. . . 117 5.5 PerformancemeasurementofRTMdo_stepwithoutpipeliningen- abledwithintheSST. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.6 Performance measurement of RTM do_step with pipelining en- abledwithintheSST. . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.7 ResourceusageoftheacceleratorfortheRTMdo_stepbenchmark. 121 5.8 Performancemeasurementofseidel-2Dwithoutpipeliningenabled withintheSST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.9 Resourceusageoftheacceleratorfortheseidel-2Dbenchmark. . . 124 List of Tables 5.1 jacobi-2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.2 RTMdo_step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.3 seidel-2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 ix List of Algorithms 1 DDGConstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2 DependencePolyhedraConstruction . . . . . . . . . . . . . . . . . . 31 3 GenericISLAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4 IterativeReductionoftheStreaming-orientedgraph . . . . . . . . . . 87 5 sd-EquivalenceClassesExtraction . . . . . . . . . . . . . . . . . . . 92 x
Description: