Table Of ContentMicroprocessor Energy Characterization and Optimization
through Fast, Accurate, and Flexible Simulation
by
Ronny Krashinsky
B.S. Electrical Engineering and Computer Science
University of California at Berkeley, 1999
Submitted to the Department of
Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Master of Science in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
May 2001
c Massachusetts Institute of Technology 2001. All rights reserved.
(cid:13)
Author .............................................................
Department of Electrical Engineering and Computer Science
May 23, 2001
Certified by .........................................................
Krste Asanovic´
Assistant Professor
Thesis Supervisor
Accepted by.........................................................
Arthur C. Smith
Chairman, Department Committee on Graduate Students
2
Microprocessor Energy Characterization and Optimization through
Fast, Accurate, and Flexible Simulation
by
Ronny Krashinsky
SubmittedtotheDepartmentofElectrical EngineeringandComputerScience
on May23, 2001,inpartial fulfillmentofthe
requirementsforthedegreeof
MasterofScience inElectrical EngineeringandComputerScience
Abstract
Energy dissipation is emerging as a key constraint for both high-performance and embed-
ded microprocessor designs, requiring computer architects to consider energy in addition
to performance when evaluating design decisions. A major limitation is the general diffi-
culty in analyzing the energy impact of architectural and microarchitectural features with-
out constructing detailed implementations and running slow simulations. This thesis first
describes the design of a fast, accurate, and flexible circuit simulation tool which enables
transition-sensitive studies of microprocessor energy consumption that would otherwise
be impossible or impractical. With a simulation infrastructure in place, various optimiza-
tions are implemented that target the entire datapath and cache energy consumption. The
individual energy optimizations are analyzed in detail, and the microprocessor design is
characterized using various energy breakdowns and studies of the bit correlation between
data values. This work shows that a few relatively simple energy-saving techniques can
have a large impact in the implementationof an energy-efficient microprocessor. By fully
characterizingtheenergyusage,thisthesisestablishesacoherentvisionofmicroprocessor
energyconsumption,andservesasabasisandmotivationforfurtherenergyoptimizations.
ThesisSupervisor: KrsteAsanovic´
Title: AssistantProfessor
3
4
Acknowledgments
First and foremost, I would like to extend my deepest gratitude to Krste for being such a
dedicated advisor and great teacher. I would also like to thank the SCALE group at MIT
for theirendless help and occasional comicrelief. Finally, I want to thank Maggieand my
familyfortheirloveand support.
5
6
Contents
1 Introduction 13
2 Energy Simulation 15
2.1 SyCHOSys Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1 Structural Netlist . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 Behavioral Models . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.3 Cycle Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.4 SyCHOTick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 EnergyModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 DynamicSwitchingEnergy . . . . . . . . . . . . . . . . . . . . . 24
2.3 SyCHO Energy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 TransitionCounting . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Energy-Performance ModelEvaluation . . . . . . . . . . . . . . . . . . . 28
2.5 ProcessorModelDevelopment . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 ProcessorEnergy Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 VanillaPekoeMicroprocessor Microarchitecture 33
3.1 Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 SystemCoprocessor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 MultiplierandDivider . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6.1 Control Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6.2 DataHazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6.3 Exceptionsand Interrupts . . . . . . . . . . . . . . . . . . . . . . 40
3.6.4 Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6.5 Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.6 ImplementingStalls andKills . . . . . . . . . . . . . . . . . . . . 41
3.7 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.8 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7
4 Energy Optimization 45
4.1 General Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 RegisterFile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Program Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.6 SystemCoprocessor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.7 Flip-flopsand Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.8 CombinedEnergyOptimizations . . . . . . . . . . . . . . . . . . . . . . . 65
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5 Energy Characterization and Analysis 71
5.1 EnergyBreakdowns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.1 Overall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.2 Datapath Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1.3 Datapath ComponentTypes . . . . . . . . . . . . . . . . . . . . . 75
5.1.4 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 BitCorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 SpatialSeparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 SidecarAddress Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6 Summary andConclusion 95
8
List of Figures
2-1 SyCHOSys framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2-2 GCD circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2-3 Euclid’sgreatest commondivisoralgorithm. . . . . . . . . . . . . . . . . . 18
2-4 SyCHOSys netlistfor GCD circuit. . . . . . . . . . . . . . . . . . . . . . . 18
2-5 ExampleSyCHOSys behavioralmodel: Mux2. . . . . . . . . . . . . . . . 19
2-6 SchedulinggraphsfortheGCD circuit. . . . . . . . . . . . . . . . . . . . 21
2-7 Scheduled componentevaluationcalls fortheGCD circuit. . . . . . . . . . 22
2-8 Transmissiongatemuxand latch designs. . . . . . . . . . . . . . . . . . . 25
2-9 Alternativememorylayoutsforcountingthebittransitionsofan n-bitbus. . 28
3-1 VanillaPekoeblockdiagram. . . . . . . . . . . . . . . . . . . . . . . . . . 34
3-2 VanillaPekoepipelinediagram. . . . . . . . . . . . . . . . . . . . . . . . 35
3-3 Cachediagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3-4 Branch mispredictrate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3-5 Stall,kill,and alivecontrolsignals. . . . . . . . . . . . . . . . . . . . . . . 41
3-6 Behavioral Ccodeforpipelinestalland killcontrol. . . . . . . . . . . . . . 43
3-7 VanillaPekoeCPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4-1 Unoptimizedenergy percycle. . . . . . . . . . . . . . . . . . . . . . . . . 46
4-2 Datapathsubunitsdiagram. . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4-3 Datapathclockingstrategy. . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4-4 Clock and data activities for flip-flops and latches with various levels of
clockgating. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4-5 Clockgatingenergy savings. . . . . . . . . . . . . . . . . . . . . . . . . . 51
4-6 Clockversusdataenergywithclockgatinginflip-flops,latches,andbranch
target adder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4-7 ALUadderand equalitycomparatorclockgating. . . . . . . . . . . . . . . 53
4-8 Registerfile read reductionsuccessivelyapplyingpreciseread control,by-
passskip,and livenessgating. . . . . . . . . . . . . . . . . . . . . . . . . 54
4-9 Registerfileread reductionforvariousbenchmarks. . . . . . . . . . . . . . 54
4-10 Registerfilebittransitionactivityreductionusingamodifiedstoragecell. . 55
4-11 Register file and bypass network energy reduction with successive opti-
mizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4-12 SegmentedPClatch clock and dataactivity. . . . . . . . . . . . . . . . . . 57
4-13 EnergyforPC componentswithsegmentation.. . . . . . . . . . . . . . . . 58
9
4-14 Instructioncacheenergyconsumptionwhileeliminatingintra-line-sequential
tagsearches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4-15 Datacachehitrates includinglast-lineandper-bank last-linehitrates. . . . 61
4-16 Instructioncache hitrates includinglast-lineand per-bank last-linehitrates. 61
4-17 SegmentedCoprocessor-0 counterflip-flop clock and dataactivity. . . . . . 63
4-18 Segmentedcounterand comparatorcomponentsenergy. . . . . . . . . . . . 63
4-19 Flip-flopand latch structures. . . . . . . . . . . . . . . . . . . . . . . . . . 64
4-20 Flip-flopand latch energy breakdownsforvariousstructures. . . . . . . . . 65
4-21 Flip-flopand latch clockand dataactivityin optimizeddatapath. . . . . . . 66
4-22 Flip-flopand latch activity-sensitiveselectionenergy savings. . . . . . . . . 67
4-23 Optimizedenergy percycle. . . . . . . . . . . . . . . . . . . . . . . . . . 68
4-24 Energysavingswithoptimizations. . . . . . . . . . . . . . . . . . . . . . . 68
4-25 Averageenergy savingsbreakdown. . . . . . . . . . . . . . . . . . . . . . 69
5-1 Optimizedenergy perinstruction. . . . . . . . . . . . . . . . . . . . . . . 72
5-2 Optimizedaverageenergy breakdown. . . . . . . . . . . . . . . . . . . . . 72
5-3 Datapathenergy breakdownby componenttype. . . . . . . . . . . . . . . . 75
5-4 Bitcorrelation forvariousnets. . . . . . . . . . . . . . . . . . . . . . . . . 79
5-5 Bitcorrelation forvariousnetsindifferent benchmarks. . . . . . . . . . . . 80
5-6 VanillaPekoememorymap. . . . . . . . . . . . . . . . . . . . . . . . . . 81
5-7 Bit correlations for selected nets using two different stack base addresses
(jpeg enc). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5-8 Energyusingtwo differentstack baseaddresses(jpeg enc). . . . . . . . . . 83
5-9 Time-multiplexingversusspatialseparation. . . . . . . . . . . . . . . . . . 84
5-10 Bittransitionsforselectednetssplitbetween: intrinsictransitionsbetween
data values, intrinsic transitions between stack addresses, intrinsic transi-
tionsbetween heap addresses, andtime-multiplexingoverhead. . . . . . . . 85
5-11 Bittransitionsforvariousbenchmarkssplitas inFigure5-10. . . . . . . . . 86
5-12 SplitALUand AA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5-13 AdvancedALU pipelinestages. . . . . . . . . . . . . . . . . . . . . . . . 88
5-14 Sidecaraddress pipelinediagram. . . . . . . . . . . . . . . . . . . . . . . 89
5-15 SidecarpipelineCPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5-16 Sidecarpipelineenergy consumptioncompared toVanillaPekoe. . . . . . . 91
5-17 Bitcorrelations usingaunified ALU orasplitALU and AA. . . . . . . . . 92
10
Description:describes the design of a fast, accurate, and flexible circuit simulation tool which enables I would also like to thank the SCALE group at MIT for their