Feasibility of Accelerator Generation to Alleviate Dark Silicon in a Novel Architecture Antonio ARNONE PhD UNIVERSITY OF YORK Computer Science June 2017 iii Abstract ThisthesispresentsanovelapproachtoalleviatingDarkSiliconproblembyreduc- ingpowerdensity. Decreasingthesizeoftransistorhasgeneratedanincreasingonpowerconsump- tion. To attempt to manage the power issue, processor design has shifted from one single core to many cores. Switching on fewer cores while the others are off helps thechiptocooldownandspreadpowermoreevenlyoverthechip. Thismeansthat sometransistorsarealwaysidlewhileothersareworking. Therefore,scalingdown the size of the chip, and increasing the amount of power to be dissipated, increases thenumberofinactivetransistors. AsaresultitgeneratesDarkSilicon,whichdou- bleseverychipgeneration[63] One of the most effective techniques to deal with Dark Silicon is to implement acceleratorsthatexecutethemostenergyconsumersoftwarefunctions. Inthisway theCPUisabletodissipatemoreenergyandreducethedarksiliconissue. Thisworkexploresanovelacceleratordesignmodelwhichcouldbeinterfacedto aStackCPUandsocouldoptimisethetransistorlogicareaandimproveenergyeffi- ciencytotacklethedarksiliconproblembasedonheterogeneousmulti-accelerators (co-processor)instackstructure. Thecontributionofthisthesisistodevelopatooltogeneratecoprocessorsfrom software stack machine code. But it also employs up-to-date code optimisation strategies to enhance the code at the input stage. Analysis of the cores using key metrics, based on 65nm synthesis experiments and industry standard tool-sets. It furtherintroducesanovelarchitecturetodecreasethepowerdensityoftheacceler- ator. Inordertotesttheseexpectations,alarge-scalesynthesistranslationexperiment was conducted, covering widely recognised benchmarks, and generating a large numberofcores(inthethousands). Thesewereevaluatedforarangeofkeymetrics: silicon di-area, timing, power, instructions-per-clock, and power density, both with andwithoutcodeoptimisationapplied. Theresultsobtaineddemonstratethatoneoftwocompetingcoremodels,‘Wave- core’(whichisproposedinthisthesis),deliverssuperiorpowerdensitytothestan- dardapproach(whichitreferstoasCompositecore),andthatthisisachievedwith- out significant cost in terms of critical metrics of overall power consumption and criticalpathdelays. Finally,tounderstandthebenefitoftheseaccelerators,theseauto-generatedcores are analysed in comparison to a standard stack-machine CPU executing the same codesequences. BoththecoresgenerationworkandthebenchmarkCPUassumea 65nm CMOS process node, and are evaluated with industry standard design tools. Itisdemonstratedthatthegeneratedcoresachievebetterpowerefficiencyimprove- mentsovertherelativelyCPUcore. v Contents Abstract iii Acknowledgements xv DeclarationofAuthorship xvii 1 Introduction 1 1.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 NovelContributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 ScalingvsPower . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 ARMarchitecture,35yearsofRISCarchitecture . . . . . . . . . . . . . 6 1.4.1 RegisterFileandPipelinearchitecture . . . . . . . . . . . . . . . 6 1.5 TheHistoryofStackArchitecture . . . . . . . . . . . . . . . . . . . . . . 8 1.5.1 StackArchitecture,asimplearchitecture . . . . . . . . . . . . . 10 1.6 DarkSilicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6.1 TheKeyapproachestoimprovingdarksilicon . . . . . . . . . . 13 1.7 Summary: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 AcceleratorcoresandTranslatortool 17 2.1 Acceleratorstoreducepowerconsumption . . . . . . . . . . . . . . . . 19 2.2 LowPowerConcept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 DesignFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Acceleratorgeneration;designflow . . . . . . . . . . . . . . . . . . . . 26 2.5 SW/HWFunctiontranslation . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.1 TheTranslatortoolindetail . . . . . . . . . . . . . . . . . . . . . 32 2.5.2 Testingthetranslatortool . . . . . . . . . . . . . . . . . . . . . . 34 2.6 TheVerilogtestbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7 TheCompositeCoresArchitecture . . . . . . . . . . . . . . . . . . . . . 49 2.8 TheWave-coreArchitecture . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.8.1 TheWave-corearchitecture: Anideatoreducepower . . . . . . 50 2.9 Summary: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3 StandardCoreGenerationExperiment 53 3.1 TheBaselinecorebenchmarksInputComparison . . . . . . . . . . . . 55 3.1.1 Inputoperandspercore(Smallerbenchmarks,baselinecase) . 55 3.1.2 Inputoperandspercore(Largerbenchmarks,baselinecase) . . 58 3.1.3 Inputoperandspercore(Allbenchmarks,baselinecase) . . . . 58 3.2 TheBaselinecorebenchmarksOutputComparison . . . . . . . . . . . 59 3.2.1 Outputoperandspercore(Smallerbenchmarks,baselinecase) 59 3.2.2 Outputoperandspercore(Largerbenchmarks,baselinecase) . 59 3.2.3 Outputoperandspercore(Allbenchmarks,baselinecase) . . . 59 3.3 ThenumberofstatesintheFiniteStateMachine . . . . . . . . . . . . . 62 3.3.1 Numberofstatespercore(Smallerbenchmarks,baselinecase) 62 vi 3.3.2 Numberofstatespercore(Largerbenchmarks,baselinecase) . 65 3.3.3 Numberofstatespercore(Allbenchmarks,baselinecase) . . . 65 3.4 TheNumberofInstructionsperClockCycle . . . . . . . . . . . . . . . 65 3.5 CompositevsWave-coreTotalArea . . . . . . . . . . . . . . . . . . . . 69 3.6 CompositevsWave-coreTiming . . . . . . . . . . . . . . . . . . . . . . 72 3.7 CompositevsWave-coreLeakage . . . . . . . . . . . . . . . . . . . . . . 73 3.8 CompositevsWave-coreDynamicpower . . . . . . . . . . . . . . . . . 74 3.9 CompositevsWave-coreTotalpower . . . . . . . . . . . . . . . . . . . 76 3.10 CompositevsWave-corePowerDensity . . . . . . . . . . . . . . . . . . 77 3.11 CompositevsWave-coreDynamicpowerforcoreswithMultipleStates 79 3.12 Summary: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4 ImpactofCodeOptimisation(StackSchedulingCode) 85 4.1 ComparisonofOptimisedCores’Input . . . . . . . . . . . . . . . . . . 86 4.1.1 Inputoperandspercore(Smallerbenchmarks,withoptimiza- tion) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.1.2 Inputoperandspercore(Largerbenchmarks,withoptimization) 89 4.1.3 Inputoperandspercore(allbenchmarks,withoptimization) . 89 4.2 ComparisonofOptimisedCores’Output . . . . . . . . . . . . . . . . . 89 4.2.1 Outputoperandspercore(Smallerbenchmarks,withoptimiza- tion) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2.2 Outputoperandspercore(Largerbenchmarks,withoptimiza- tion) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2.3 Outputoperandspercore(Allbenchmarks,withoptimization) 93 4.3 ComparisonofOptimisedCores’States . . . . . . . . . . . . . . . . . . 95 4.3.1 Number of states per core (Smaller benchmarks, with opti- mization) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.3.2 Numberofstatespercore(Largerbenchmarks,withoptimiza- tion) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.3.3 Numberofstatespercore(Allbenchmarks,withoptimization) 98 4.4 ComparisonofOptimisedCores’IPC . . . . . . . . . . . . . . . . . . . 100 4.4.1 IPCpercore(Smallerbenchmarks,withoptimization) . . . . . 101 4.4.2 IPCpercore(Largerbenchmarks,withoptimization) . . . . . . 101 4.4.3 IPCpercore(Allbenchmarks,withoptimization) . . . . . . . . 104 4.5 Comparing the Area of the optimised benchmarks in Composite and Wave-corearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5.1 ComparingtheAreaoftheoptimisedbenchmarksinCompos- itearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5.2 Comparing the Area of the optimised benchmarks in Wave- corearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.5.3 Comparing Composite vs Wave-core architecture total Area fortheoptimizedbenchmarks . . . . . . . . . . . . . . . . . . . 107 4.6 Comparing the Timing of the optimised benchmarks in Composite andWave-corearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.6.1 Comparing the Timing of the optimised benchmarks in Com- positearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.6.2 ComparingtheTimingoftheoptimisedbenchmarksinWave- corearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.6.3 Comparing Composite vs Wave-core Timing of the optimised benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 vii 4.7 Comparing the Static Power of the optimised benchmarks in Com- positeandWave-corearchitecture . . . . . . . . . . . . . . . . . . . . . 115 4.7.1 ComparingtheLeakageoftheoptimisedbenchmarksinCom- positearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.7.2 ComparingtheLeakageoftheoptimisedbenchmarksinWave- corearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.8 ComparingtheDynamicPoweroftheoptimisedbenchmarksinCom- positeandWave-corearchitectures . . . . . . . . . . . . . . . . . . . . . 119 4.8.1 ComparingtheDynamicpoweroftheoptimisedbenchmarks inCompositearchitecture . . . . . . . . . . . . . . . . . . . . . . 119 Dynamicpowerperstate . . . . . . . . . . . . . . . . . . . . . . 119 4.8.2 ComparingtheDynamicpoweroftheoptimisedbenchmarks inWave-corearchitecture . . . . . . . . . . . . . . . . . . . . . . 121 4.8.3 CompositevsWave-coreDynamicpoweroftheoptimisedbench- marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.9 ComparingtheTotalPoweroftheoptimisedbenchmarksinCompos- iteandWave-corearchitecture . . . . . . . . . . . . . . . . . . . . . . . . 124 4.9.1 Compositetotalpowerfortheoptimizedbenchmarks. . . . . . 125 TotalpowerapplyingclockgatinginCompositearchitecture . 127 4.9.2 Wave-coretotalpowerfortheoptimizedbenchmarks . . . . . . 127 4.9.3 PowergatingandClockgatingCompositevsWave-coreTotal Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.10 ComparingthePowerDensityoftheoptimisedbenchmarksinCom- positeandWave-corearchitectures . . . . . . . . . . . . . . . . . . . . . 132 4.10.1 CompositePowerdensityfortheoptimizedbenchmarks . . . . 132 Power Density for the Composite architecture when the clock gatingisapplied . . . . . . . . . . . . . . . . . . . . . . 132 4.10.2 Wave-corePowerdensityfortheoptimizedbenchmarks . . . . 136 4.10.3 Wave-core vs Composite Power density when when power gatingandclockgatingareapplied . . . . . . . . . . . . . . . . 136 4.11 PearsonCorrelationreviewofstatisticalsignificanceofresultsinCom- positeandWave-corearchitecture . . . . . . . . . . . . . . . . . . . . . 137 4.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5 CPUvsCorePowerAnalysis 141 5.1 NOMADMACHINEARCHITECTURE . . . . . . . . . . . . . . . . . . 141 5.1.1 EvaluationofNOMADintermsofselectedcorespoweranal- ysisisundertakenasfollows . . . . . . . . . . . . . . . . . . . . 142 5.2 Methodologyofcomparison(NOMADversusAcceleratorCores) . . . 142 5.2.1 ResultsofselectedcoresanalysisforbothCPUandComposite andWave-corearchitectures . . . . . . . . . . . . . . . . . . . . 144 5.3 AnalysisusingoveragepowerforLundCPUmodel . . . . . . . . . . . 145 5.3.1 MethodologyofcomparisonIPcoresvsLundCPUmodel . . . 145 5.3.2 Comparing Composite and Wave-core architecture vs Lund CPUarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.4 AnalysisusingaveragepowermodelforNOMADCPU . . . . . . . . . 150 5.5 Summaryofpowermodelsandresults. . . . . . . . . . . . . . . . . . . 150 6 Conclusion 155 6.1 MainContributionsandImplications . . . . . . . . . . . . . . . . . . . 155 6.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 viii A VHDLCode-CompositeandWave-core 161 B HammingdistanceandVerilogtest-bench 175 C TotalArea 191 D Powerreport 193 E Timingreport 195 Bibliography 199 ix List of Tables 3.1 CombinedbenchmarksTotalAreadifferencesquartile . . . . . . . . . . 70 3.2 CombinedbenchmarksTimingdifferencesquartile . . . . . . . . . . . 72 3.3 CombinedbenchmarksLeakagedifferencesquartile . . . . . . . . . . . 74 3.4 CombinedbenchmarksDynamicPower . . . . . . . . . . . . . . . . . . 75 3.5 CombinedbenchmarksTotalPowerdifferencesquartile . . . . . . . . . 76 3.6 CombinedbenchmarksPowerDensitydifferencesquartile . . . . . . . 78 3.7 CombinedbenchmarksDynamicPowermultistatesdifferencesquartile 79 3.8 CombinedbenchmarksPowerDensitymultistatesdifferencesquartile 81 4.1 Pearsoncorrelationcoefficienttotalarea . . . . . . . . . . . . . . . . . . 110 4.2 Pearsoncorrelationcoefficienttiming . . . . . . . . . . . . . . . . . . . 115 4.3 PearsoncorrelationcoefficientStaticPower . . . . . . . . . . . . . . . . 119 4.4 Dynamicpowerperstate. . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.5 PearsoncorrelationcoeffcientDynamicpower . . . . . . . . . . . . . . 124 4.6 Pearsoncorrelationcoefficientforthefourfactors . . . . . . . . . . . . 139 5.1 LundCPUvsIPcores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Description: