ebook img

PhD Feasibility of Accelerator Generation to Alleviate Dark Silicon in a Novel Architecture Antonio PDF

224 Pages·2017·8.19 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview PhD Feasibility of Accelerator Generation to Alleviate Dark Silicon in a Novel Architecture Antonio

Feasibility of Accelerator Generation to Alleviate Dark Silicon in a Novel Architecture Antonio ARNONE PhD UNIVERSITY OF YORK Computer Science June 2017 iii Abstract ThisthesispresentsanovelapproachtoalleviatingDarkSiliconproblembyreduc- ingpowerdensity. Decreasingthesizeoftransistorhasgeneratedanincreasingonpowerconsump- tion. To attempt to manage the power issue, processor design has shifted from one single core to many cores. Switching on fewer cores while the others are off helps thechiptocooldownandspreadpowermoreevenlyoverthechip. Thismeansthat sometransistorsarealwaysidlewhileothersareworking. Therefore,scalingdown the size of the chip, and increasing the amount of power to be dissipated, increases thenumberofinactivetransistors. AsaresultitgeneratesDarkSilicon,whichdou- bleseverychipgeneration[63] One of the most effective techniques to deal with Dark Silicon is to implement acceleratorsthatexecutethemostenergyconsumersoftwarefunctions. Inthisway theCPUisabletodissipatemoreenergyandreducethedarksiliconissue. Thisworkexploresanovelacceleratordesignmodelwhichcouldbeinterfacedto aStackCPUandsocouldoptimisethetransistorlogicareaandimproveenergyeffi- ciencytotacklethedarksiliconproblembasedonheterogeneousmulti-accelerators (co-processor)instackstructure. Thecontributionofthisthesisistodevelopatooltogeneratecoprocessorsfrom software stack machine code. But it also employs up-to-date code optimisation strategies to enhance the code at the input stage. Analysis of the cores using key metrics, based on 65nm synthesis experiments and industry standard tool-sets. It furtherintroducesanovelarchitecturetodecreasethepowerdensityoftheacceler- ator. Inordertotesttheseexpectations,alarge-scalesynthesistranslationexperiment was conducted, covering widely recognised benchmarks, and generating a large numberofcores(inthethousands). Thesewereevaluatedforarangeofkeymetrics: silicon di-area, timing, power, instructions-per-clock, and power density, both with andwithoutcodeoptimisationapplied. Theresultsobtaineddemonstratethatoneoftwocompetingcoremodels,‘Wave- core’(whichisproposedinthisthesis),deliverssuperiorpowerdensitytothestan- dardapproach(whichitreferstoasCompositecore),andthatthisisachievedwith- out significant cost in terms of critical metrics of overall power consumption and criticalpathdelays. Finally,tounderstandthebenefitoftheseaccelerators,theseauto-generatedcores are analysed in comparison to a standard stack-machine CPU executing the same codesequences. BoththecoresgenerationworkandthebenchmarkCPUassumea 65nm CMOS process node, and are evaluated with industry standard design tools. Itisdemonstratedthatthegeneratedcoresachievebetterpowerefficiencyimprove- mentsovertherelativelyCPUcore. v Contents Abstract iii Acknowledgements xv DeclarationofAuthorship xvii 1 Introduction 1 1.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 NovelContributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 ScalingvsPower . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 ARMarchitecture,35yearsofRISCarchitecture . . . . . . . . . . . . . 6 1.4.1 RegisterFileandPipelinearchitecture . . . . . . . . . . . . . . . 6 1.5 TheHistoryofStackArchitecture . . . . . . . . . . . . . . . . . . . . . . 8 1.5.1 StackArchitecture,asimplearchitecture . . . . . . . . . . . . . 10 1.6 DarkSilicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6.1 TheKeyapproachestoimprovingdarksilicon . . . . . . . . . . 13 1.7 Summary: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 AcceleratorcoresandTranslatortool 17 2.1 Acceleratorstoreducepowerconsumption . . . . . . . . . . . . . . . . 19 2.2 LowPowerConcept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 DesignFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Acceleratorgeneration;designflow . . . . . . . . . . . . . . . . . . . . 26 2.5 SW/HWFunctiontranslation . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.1 TheTranslatortoolindetail . . . . . . . . . . . . . . . . . . . . . 32 2.5.2 Testingthetranslatortool . . . . . . . . . . . . . . . . . . . . . . 34 2.6 TheVerilogtestbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7 TheCompositeCoresArchitecture . . . . . . . . . . . . . . . . . . . . . 49 2.8 TheWave-coreArchitecture . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.8.1 TheWave-corearchitecture: Anideatoreducepower . . . . . . 50 2.9 Summary: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3 StandardCoreGenerationExperiment 53 3.1 TheBaselinecorebenchmarksInputComparison . . . . . . . . . . . . 55 3.1.1 Inputoperandspercore(Smallerbenchmarks,baselinecase) . 55 3.1.2 Inputoperandspercore(Largerbenchmarks,baselinecase) . . 58 3.1.3 Inputoperandspercore(Allbenchmarks,baselinecase) . . . . 58 3.2 TheBaselinecorebenchmarksOutputComparison . . . . . . . . . . . 59 3.2.1 Outputoperandspercore(Smallerbenchmarks,baselinecase) 59 3.2.2 Outputoperandspercore(Largerbenchmarks,baselinecase) . 59 3.2.3 Outputoperandspercore(Allbenchmarks,baselinecase) . . . 59 3.3 ThenumberofstatesintheFiniteStateMachine . . . . . . . . . . . . . 62 3.3.1 Numberofstatespercore(Smallerbenchmarks,baselinecase) 62 vi 3.3.2 Numberofstatespercore(Largerbenchmarks,baselinecase) . 65 3.3.3 Numberofstatespercore(Allbenchmarks,baselinecase) . . . 65 3.4 TheNumberofInstructionsperClockCycle . . . . . . . . . . . . . . . 65 3.5 CompositevsWave-coreTotalArea . . . . . . . . . . . . . . . . . . . . 69 3.6 CompositevsWave-coreTiming . . . . . . . . . . . . . . . . . . . . . . 72 3.7 CompositevsWave-coreLeakage . . . . . . . . . . . . . . . . . . . . . . 73 3.8 CompositevsWave-coreDynamicpower . . . . . . . . . . . . . . . . . 74 3.9 CompositevsWave-coreTotalpower . . . . . . . . . . . . . . . . . . . 76 3.10 CompositevsWave-corePowerDensity . . . . . . . . . . . . . . . . . . 77 3.11 CompositevsWave-coreDynamicpowerforcoreswithMultipleStates 79 3.12 Summary: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4 ImpactofCodeOptimisation(StackSchedulingCode) 85 4.1 ComparisonofOptimisedCores’Input . . . . . . . . . . . . . . . . . . 86 4.1.1 Inputoperandspercore(Smallerbenchmarks,withoptimiza- tion) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.1.2 Inputoperandspercore(Largerbenchmarks,withoptimization) 89 4.1.3 Inputoperandspercore(allbenchmarks,withoptimization) . 89 4.2 ComparisonofOptimisedCores’Output . . . . . . . . . . . . . . . . . 89 4.2.1 Outputoperandspercore(Smallerbenchmarks,withoptimiza- tion) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2.2 Outputoperandspercore(Largerbenchmarks,withoptimiza- tion) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2.3 Outputoperandspercore(Allbenchmarks,withoptimization) 93 4.3 ComparisonofOptimisedCores’States . . . . . . . . . . . . . . . . . . 95 4.3.1 Number of states per core (Smaller benchmarks, with opti- mization) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.3.2 Numberofstatespercore(Largerbenchmarks,withoptimiza- tion) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.3.3 Numberofstatespercore(Allbenchmarks,withoptimization) 98 4.4 ComparisonofOptimisedCores’IPC . . . . . . . . . . . . . . . . . . . 100 4.4.1 IPCpercore(Smallerbenchmarks,withoptimization) . . . . . 101 4.4.2 IPCpercore(Largerbenchmarks,withoptimization) . . . . . . 101 4.4.3 IPCpercore(Allbenchmarks,withoptimization) . . . . . . . . 104 4.5 Comparing the Area of the optimised benchmarks in Composite and Wave-corearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5.1 ComparingtheAreaoftheoptimisedbenchmarksinCompos- itearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5.2 Comparing the Area of the optimised benchmarks in Wave- corearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.5.3 Comparing Composite vs Wave-core architecture total Area fortheoptimizedbenchmarks . . . . . . . . . . . . . . . . . . . 107 4.6 Comparing the Timing of the optimised benchmarks in Composite andWave-corearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.6.1 Comparing the Timing of the optimised benchmarks in Com- positearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.6.2 ComparingtheTimingoftheoptimisedbenchmarksinWave- corearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.6.3 Comparing Composite vs Wave-core Timing of the optimised benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 vii 4.7 Comparing the Static Power of the optimised benchmarks in Com- positeandWave-corearchitecture . . . . . . . . . . . . . . . . . . . . . 115 4.7.1 ComparingtheLeakageoftheoptimisedbenchmarksinCom- positearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.7.2 ComparingtheLeakageoftheoptimisedbenchmarksinWave- corearchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.8 ComparingtheDynamicPoweroftheoptimisedbenchmarksinCom- positeandWave-corearchitectures . . . . . . . . . . . . . . . . . . . . . 119 4.8.1 ComparingtheDynamicpoweroftheoptimisedbenchmarks inCompositearchitecture . . . . . . . . . . . . . . . . . . . . . . 119 Dynamicpowerperstate . . . . . . . . . . . . . . . . . . . . . . 119 4.8.2 ComparingtheDynamicpoweroftheoptimisedbenchmarks inWave-corearchitecture . . . . . . . . . . . . . . . . . . . . . . 121 4.8.3 CompositevsWave-coreDynamicpoweroftheoptimisedbench- marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.9 ComparingtheTotalPoweroftheoptimisedbenchmarksinCompos- iteandWave-corearchitecture . . . . . . . . . . . . . . . . . . . . . . . . 124 4.9.1 Compositetotalpowerfortheoptimizedbenchmarks. . . . . . 125 TotalpowerapplyingclockgatinginCompositearchitecture . 127 4.9.2 Wave-coretotalpowerfortheoptimizedbenchmarks . . . . . . 127 4.9.3 PowergatingandClockgatingCompositevsWave-coreTotal Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.10 ComparingthePowerDensityoftheoptimisedbenchmarksinCom- positeandWave-corearchitectures . . . . . . . . . . . . . . . . . . . . . 132 4.10.1 CompositePowerdensityfortheoptimizedbenchmarks . . . . 132 Power Density for the Composite architecture when the clock gatingisapplied . . . . . . . . . . . . . . . . . . . . . . 132 4.10.2 Wave-corePowerdensityfortheoptimizedbenchmarks . . . . 136 4.10.3 Wave-core vs Composite Power density when when power gatingandclockgatingareapplied . . . . . . . . . . . . . . . . 136 4.11 PearsonCorrelationreviewofstatisticalsignificanceofresultsinCom- positeandWave-corearchitecture . . . . . . . . . . . . . . . . . . . . . 137 4.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5 CPUvsCorePowerAnalysis 141 5.1 NOMADMACHINEARCHITECTURE . . . . . . . . . . . . . . . . . . 141 5.1.1 EvaluationofNOMADintermsofselectedcorespoweranal- ysisisundertakenasfollows . . . . . . . . . . . . . . . . . . . . 142 5.2 Methodologyofcomparison(NOMADversusAcceleratorCores) . . . 142 5.2.1 ResultsofselectedcoresanalysisforbothCPUandComposite andWave-corearchitectures . . . . . . . . . . . . . . . . . . . . 144 5.3 AnalysisusingoveragepowerforLundCPUmodel . . . . . . . . . . . 145 5.3.1 MethodologyofcomparisonIPcoresvsLundCPUmodel . . . 145 5.3.2 Comparing Composite and Wave-core architecture vs Lund CPUarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.4 AnalysisusingaveragepowermodelforNOMADCPU . . . . . . . . . 150 5.5 Summaryofpowermodelsandresults. . . . . . . . . . . . . . . . . . . 150 6 Conclusion 155 6.1 MainContributionsandImplications . . . . . . . . . . . . . . . . . . . 155 6.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 viii A VHDLCode-CompositeandWave-core 161 B HammingdistanceandVerilogtest-bench 175 C TotalArea 191 D Powerreport 193 E Timingreport 195 Bibliography 199 ix List of Tables 3.1 CombinedbenchmarksTotalAreadifferencesquartile . . . . . . . . . . 70 3.2 CombinedbenchmarksTimingdifferencesquartile . . . . . . . . . . . 72 3.3 CombinedbenchmarksLeakagedifferencesquartile . . . . . . . . . . . 74 3.4 CombinedbenchmarksDynamicPower . . . . . . . . . . . . . . . . . . 75 3.5 CombinedbenchmarksTotalPowerdifferencesquartile . . . . . . . . . 76 3.6 CombinedbenchmarksPowerDensitydifferencesquartile . . . . . . . 78 3.7 CombinedbenchmarksDynamicPowermultistatesdifferencesquartile 79 3.8 CombinedbenchmarksPowerDensitymultistatesdifferencesquartile 81 4.1 Pearsoncorrelationcoefficienttotalarea . . . . . . . . . . . . . . . . . . 110 4.2 Pearsoncorrelationcoefficienttiming . . . . . . . . . . . . . . . . . . . 115 4.3 PearsoncorrelationcoefficientStaticPower . . . . . . . . . . . . . . . . 119 4.4 Dynamicpowerperstate. . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.5 PearsoncorrelationcoeffcientDynamicpower . . . . . . . . . . . . . . 124 4.6 Pearsoncorrelationcoefficientforthefourfactors . . . . . . . . . . . . 139 5.1 LundCPUvsIPcores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Description:
In order to test these expectations, a large-scale synthesis translation Crispin-Bailey, University of York (poster presentation) at NANO- TERA/ARTIS Analysis: Power, Timing and Area:Finally a script in Python extracts the value com/electronics-blogs/beginner-s-corner/4024632/Introduction-.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.