Microprocessor Energy Characterization and Optimization through Fast, Accurate, and Flexible Simulation by Ronny Krashinsky B.S. Electrical Engineering and Computer Science University of California at Berkeley, 1999 Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2001 c Massachusetts Institute of Technology 2001. All rights reserved. (cid:13) Author ............................................................. Department of Electrical Engineering and Computer Science May 23, 2001 Certified by ......................................................... Krste Asanovic´ Assistant Professor Thesis Supervisor Accepted by......................................................... Arthur C. Smith Chairman, Department Committee on Graduate Students 2 Microprocessor Energy Characterization and Optimization through Fast, Accurate, and Flexible Simulation by Ronny Krashinsky SubmittedtotheDepartmentofElectrical EngineeringandComputerScience on May23, 2001,inpartial fulfillmentofthe requirementsforthedegreeof MasterofScience inElectrical EngineeringandComputerScience Abstract Energy dissipation is emerging as a key constraint for both high-performance and embed- ded microprocessor designs, requiring computer architects to consider energy in addition to performance when evaluating design decisions. A major limitation is the general diffi- culty in analyzing the energy impact of architectural and microarchitectural features with- out constructing detailed implementations and running slow simulations. This thesis first describes the design of a fast, accurate, and flexible circuit simulation tool which enables transition-sensitive studies of microprocessor energy consumption that would otherwise be impossible or impractical. With a simulation infrastructure in place, various optimiza- tions are implemented that target the entire datapath and cache energy consumption. The individual energy optimizations are analyzed in detail, and the microprocessor design is characterized using various energy breakdowns and studies of the bit correlation between data values. This work shows that a few relatively simple energy-saving techniques can have a large impact in the implementationof an energy-efficient microprocessor. By fully characterizingtheenergyusage,thisthesisestablishesacoherentvisionofmicroprocessor energyconsumption,andservesasabasisandmotivationforfurtherenergyoptimizations. ThesisSupervisor: KrsteAsanovic´ Title: AssistantProfessor 3 4 Acknowledgments First and foremost, I would like to extend my deepest gratitude to Krste for being such a dedicated advisor and great teacher. I would also like to thank the SCALE group at MIT for theirendless help and occasional comicrelief. Finally, I want to thank Maggieand my familyfortheirloveand support. 5 6 Contents 1 Introduction 13 2 Energy Simulation 15 2.1 SyCHOSys Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.1 Structural Netlist . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.2 Behavioral Models . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.3 Cycle Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.4 SyCHOTick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 EnergyModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.1 DynamicSwitchingEnergy . . . . . . . . . . . . . . . . . . . . . 24 2.3 SyCHO Energy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 TransitionCounting . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Energy-Performance ModelEvaluation . . . . . . . . . . . . . . . . . . . 28 2.5 ProcessorModelDevelopment . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6 ProcessorEnergy Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3 VanillaPekoeMicroprocessor Microarchitecture 33 3.1 Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 SystemCoprocessor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 MultiplierandDivider . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6.1 Control Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6.2 DataHazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6.3 Exceptionsand Interrupts . . . . . . . . . . . . . . . . . . . . . . 40 3.6.4 Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6.5 Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.6.6 ImplementingStalls andKills . . . . . . . . . . . . . . . . . . . . 41 3.7 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.8 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 7 4 Energy Optimization 45 4.1 General Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 RegisterFile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Program Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.6 SystemCoprocessor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.7 Flip-flopsand Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.8 CombinedEnergyOptimizations . . . . . . . . . . . . . . . . . . . . . . . 65 4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5 Energy Characterization and Analysis 71 5.1 EnergyBreakdowns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.1.1 Overall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.1.2 Datapath Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.1.3 Datapath ComponentTypes . . . . . . . . . . . . . . . . . . . . . 75 5.1.4 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 BitCorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3 SpatialSeparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4 SidecarAddress Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6 Summary andConclusion 95 8 List of Figures 2-1 SyCHOSys framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2-2 GCD circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2-3 Euclid’sgreatest commondivisoralgorithm. . . . . . . . . . . . . . . . . . 18 2-4 SyCHOSys netlistfor GCD circuit. . . . . . . . . . . . . . . . . . . . . . . 18 2-5 ExampleSyCHOSys behavioralmodel: Mux2. . . . . . . . . . . . . . . . 19 2-6 SchedulinggraphsfortheGCD circuit. . . . . . . . . . . . . . . . . . . . 21 2-7 Scheduled componentevaluationcalls fortheGCD circuit. . . . . . . . . . 22 2-8 Transmissiongatemuxand latch designs. . . . . . . . . . . . . . . . . . . 25 2-9 Alternativememorylayoutsforcountingthebittransitionsofan n-bitbus. . 28 3-1 VanillaPekoeblockdiagram. . . . . . . . . . . . . . . . . . . . . . . . . . 34 3-2 VanillaPekoepipelinediagram. . . . . . . . . . . . . . . . . . . . . . . . 35 3-3 Cachediagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3-4 Branch mispredictrate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3-5 Stall,kill,and alivecontrolsignals. . . . . . . . . . . . . . . . . . . . . . . 41 3-6 Behavioral Ccodeforpipelinestalland killcontrol. . . . . . . . . . . . . . 43 3-7 VanillaPekoeCPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4-1 Unoptimizedenergy percycle. . . . . . . . . . . . . . . . . . . . . . . . . 46 4-2 Datapathsubunitsdiagram. . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4-3 Datapathclockingstrategy. . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4-4 Clock and data activities for flip-flops and latches with various levels of clockgating. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4-5 Clockgatingenergy savings. . . . . . . . . . . . . . . . . . . . . . . . . . 51 4-6 Clockversusdataenergywithclockgatinginflip-flops,latches,andbranch target adder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4-7 ALUadderand equalitycomparatorclockgating. . . . . . . . . . . . . . . 53 4-8 Registerfile read reductionsuccessivelyapplyingpreciseread control,by- passskip,and livenessgating. . . . . . . . . . . . . . . . . . . . . . . . . 54 4-9 Registerfileread reductionforvariousbenchmarks. . . . . . . . . . . . . . 54 4-10 Registerfilebittransitionactivityreductionusingamodifiedstoragecell. . 55 4-11 Register file and bypass network energy reduction with successive opti- mizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4-12 SegmentedPClatch clock and dataactivity. . . . . . . . . . . . . . . . . . 57 4-13 EnergyforPC componentswithsegmentation.. . . . . . . . . . . . . . . . 58 9 4-14 Instructioncacheenergyconsumptionwhileeliminatingintra-line-sequential tagsearches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4-15 Datacachehitrates includinglast-lineandper-bank last-linehitrates. . . . 61 4-16 Instructioncache hitrates includinglast-lineand per-bank last-linehitrates. 61 4-17 SegmentedCoprocessor-0 counterflip-flop clock and dataactivity. . . . . . 63 4-18 Segmentedcounterand comparatorcomponentsenergy. . . . . . . . . . . . 63 4-19 Flip-flopand latch structures. . . . . . . . . . . . . . . . . . . . . . . . . . 64 4-20 Flip-flopand latch energy breakdownsforvariousstructures. . . . . . . . . 65 4-21 Flip-flopand latch clockand dataactivityin optimizeddatapath. . . . . . . 66 4-22 Flip-flopand latch activity-sensitiveselectionenergy savings. . . . . . . . . 67 4-23 Optimizedenergy percycle. . . . . . . . . . . . . . . . . . . . . . . . . . 68 4-24 Energysavingswithoptimizations. . . . . . . . . . . . . . . . . . . . . . . 68 4-25 Averageenergy savingsbreakdown. . . . . . . . . . . . . . . . . . . . . . 69 5-1 Optimizedenergy perinstruction. . . . . . . . . . . . . . . . . . . . . . . 72 5-2 Optimizedaverageenergy breakdown. . . . . . . . . . . . . . . . . . . . . 72 5-3 Datapathenergy breakdownby componenttype. . . . . . . . . . . . . . . . 75 5-4 Bitcorrelation forvariousnets. . . . . . . . . . . . . . . . . . . . . . . . . 79 5-5 Bitcorrelation forvariousnetsindifferent benchmarks. . . . . . . . . . . . 80 5-6 VanillaPekoememorymap. . . . . . . . . . . . . . . . . . . . . . . . . . 81 5-7 Bit correlations for selected nets using two different stack base addresses (jpeg enc). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5-8 Energyusingtwo differentstack baseaddresses(jpeg enc). . . . . . . . . . 83 5-9 Time-multiplexingversusspatialseparation. . . . . . . . . . . . . . . . . . 84 5-10 Bittransitionsforselectednetssplitbetween: intrinsictransitionsbetween data values, intrinsic transitions between stack addresses, intrinsic transi- tionsbetween heap addresses, andtime-multiplexingoverhead. . . . . . . . 85 5-11 Bittransitionsforvariousbenchmarkssplitas inFigure5-10. . . . . . . . . 86 5-12 SplitALUand AA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5-13 AdvancedALU pipelinestages. . . . . . . . . . . . . . . . . . . . . . . . 88 5-14 Sidecaraddress pipelinediagram. . . . . . . . . . . . . . . . . . . . . . . 89 5-15 SidecarpipelineCPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5-16 Sidecarpipelineenergy consumptioncompared toVanillaPekoe. . . . . . . 91 5-17 Bitcorrelations usingaunified ALU orasplitALU and AA. . . . . . . . . 92 10
Description: