Towards a More Principled Compiler: Register Allocation and Instruction Selection Revisited David Ryan Koes CMU-CS-09-157 October 2009 SchoolofComputerScience CarnegieMellonUniversity Pittsburgh,PA15213 ThesisCommittee: SethCopenGoldstein,Chair PeterLee AnupamGupta MichaelD.Smith,HarvardUniversity Submittedinpartialfulfillmentoftherequirements forthedegreeofDoctorofPhilosophy. Copyright©2009DavidRyanKoes ThisresearchwassponsoredbytheNationalScienceFoundationundergrantnumbersCCF-0702640,CCR-0205523, EIA-0220214,andIIS-0117658;andHewlettPackardundergrantnumber1010162. The views and conclusions contained in this document are those of the author and should not be interpreted as representingtheofficialpolicies,eitherexpressedorimplied,ofanysponsoringinstitution,theU.S.governmentor anyotherentity. Keywords: Compilers,RegisterAllocation,InstructionSelection,BackendOptimization ForMary,Andrew,andAlex ButespeciallyforMary iv Abstract Backend optimizations are a critical part of an optimizing compiler. This thesis develops a principled approach for understanding, evaluating, and solving backend optimization problems. Our principled approach is to develop a comprehensive and expressive model of the backend optimization problem, and design solution tech- niques for this model that achieve or approach optimality. We apply our principled approachtotheclassicalbackendoptimizationsofregisterallocationandinstruction selection. Wedevelopanexpressivemodelofregisterallocationbasedonmulti-commodity network flow. This model exactly represents the complexities of the target architec- ture. Wedesignprogressivesolutiontechniquesforourmodel. Progressivesolution techniques quickly find an initial solution and then improve upon the solution as moretimeisallottedforcompilation. Ourprogressiveallocatorallowstheprogram- mer to explicitly manage the trade-off between compile-time and code quality. As more time is allowed for compilation, the resulting allocation approaches optimal, andsubstantialimprovementsincodequalityareobtained. Wedescribeanexpressivedirectedacyclicgraphrepresentationoftheinstruction selection problem and develop a near-optimal, linear-time algorithm that solves the instruction selection problem using this expressive model. Our principled approach toinstructionselectionresultsinsignificantimprovementsincodequalitycompared totraditionalalgorithms. We evaluate our principled approaches to register allocation and instruction se- lection on a range of architectures and benchmarks. We achieve significant reduc- tionsincodesizeandincreasesinperformancerelativetopreviousapproaches. Our resultsconfirmthatourprincipledapproachisamajoradvanceinthestateoftheart ofbackendoptimization. Contents 1 Introduction 1 1.1 ProblemDescription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 RegisterAllocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.2 InstructionSelection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 RelatedWork 9 2.1 RegisterAllocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 GraphColoringRegisterAllocation . . . . . . . . . . . . . . . . . . . . 10 2.1.2 SSARegisterAllocators . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.3 LinearScanAllocators . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.4 AlternativeHeuristicAllocators . . . . . . . . . . . . . . . . . . . . . . 20 2.1.5 OptimalRegisterAllocation . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 InstructionSelection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3 GlobalMCNFRegisterAllocationModel 29 3.1 Multi-commodityNetworkFlow . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 LocalRegisterAllocationModel . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.1 SourceNodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.2 SinkNodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.3 AllocationClassNodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.4 CrossbarGroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.5 InstructionGroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.6 FullModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 GlobalRegisterAllocationModel . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 PersistentMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5 ModelingCosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.7 HardnessofSingleGlobalFlow . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.8 Simplifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4 EvaluationMethodology 67 vi 4.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 CodeQualityMetrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.1 CodeSize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.2 CodePerformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3 InstructionSetArchitectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.1 x86-32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.2 x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.3 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.4 Thumb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.4 Microarchitectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5 HeuristicRegisterAllocation 75 5.1 IterativeHeuristicAllocator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.1.2 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.1.3 AsymptoticAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.2 SimultaneousHeuristicAllocator . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2.2 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.2.3 AsymptoticAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3 BoundaryConstraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3.1 AsymptoticAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.4 HybridAllocator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.5 CompileTime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6 ProgressiveRegisterAllocation 115 6.1 RelaxationTechniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.1.1 LinearProgrammingRelaxation . . . . . . . . . . . . . . . . . . . . . . 116 6.1.2 LagrangianRelaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2 SubgradientOptimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.2.1 FlowCalculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2.2 StepUpdate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2.3 PriceUpdate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.2.4 PriceInitialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.3 ProgressiveRegisterAllocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.3.1 CodeQuality: Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.3.2 CodeQuality: Performance . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.3.3 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.3.4 CompileTime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 7 Near-OptimalLinear-TimeInstructionSelection 151 vii 7.1 ProblemDescriptionandHardness . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.2 NOLTIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.3 0-1ProgrammingSolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 7.5.1 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 7.5.2 ComparisonofAlgorithms . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.5.3 ImpactonCodeSize . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 7.5.4 CompileTimePerformance . . . . . . . . . . . . . . . . . . . . . . . . . 167 7.6 LimitationsandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.7 InteractionwithRegisterAllocation . . . . . . . . . . . . . . . . . . . . . . . . 169 7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 8 Conclusion 171 Bibliography 173 viii List of Figures 1.1 Thestructureofatypicalcompiler. . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Simpleregisterallocationexample. . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Anexampleofinstructionselectiononatree-basedIR. . . . . . . . . . . . . . . 6 2.1 Theflowofatraditionalgraphcoloringalgorithm. . . . . . . . . . . . . . . . . 10 2.2 Liverangesandthecorrespondinginterferencegraph. . . . . . . . . . . . . . . . 10 2.3 Anexampleofthesimplifyandselectphasesofagraphcoloringallocator. . . . . 11 2.4 Thelinearorderingofbasicblocks,liveintervals,andlifetimeholes. . . . . . . . 16 2.5 Resultofsimplelinearscanandsecond-chancebinpackinglinearscan. . . . . . . 17 2.6 Percentoffunctionswhichdonotspill. . . . . . . . . . . . . . . . . . . . . . . . 23 2.7 Decreaseincodequalityresultingfromspillcodeandassignmentheuristics. . . . 24 2.8 Theeffectofvariouscomponentsofregisterallocation. . . . . . . . . . . . . . . 25 3.1 Asimpleexampleofamulti-commoditynetworkflowproblem. . . . . . . . . . 30 3.2 Asimpleexampleoflocalregisterallocation. . . . . . . . . . . . . . . . . . . . 34 3.3 SourcenodesofaMCNFmodelofregisterallocation. . . . . . . . . . . . . . . . 35 3.4 SinknodesofaMCNFmodelofregisterallocation. . . . . . . . . . . . . . . . . 37 3.5 CrossbargroupsforthelocalregisterallocationproblemofFigure3.2. . . . . . . 38 3.6 Twopossiblecrossbargroupnetworkstructures. . . . . . . . . . . . . . . . . . . 39 3.7 InstructiongroupsforthelocalregisterallocationproblemofFigure3.2. . . . . . 41 3.8 ThefullMCNFmodelofthelocalregisterallocationproblemofFigure3.2. . . . 43 3.9 Asimplecontrolflowgraph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.10 ThethreetypesofflownodesintheglobalMCNFmodelofregisterallocation. . 44 3.11 EntryandexitgroupsofaglobalMCNFmodelofregisterallocation. . . . . . . . 45 3.12 Acrossbargroupwithnodesforanti-variables. . . . . . . . . . . . . . . . . . . 48 3.13 Anetworkthatdemonstratesvaluemodification,loadremat. andanti-variables. . 49 3.14 TheaccuracyofthecodesizeglobalMCNFcostmode. . . . . . . . . . . . . . . 53 3.15 Impactofsingle-executioncostsondynamicmemoryoperations. . . . . . . . . . 54 3.16 Impactonperformanceofvaryingsingle-executioncosts. . . . . . . . . . . . . . 55 3.17 Decreaseincodequalitywhencoalescingisseparatedfromanoptimalallocator . 58 3.18 AnexampleofareductionfromglobalMCNFtominimumgraphlabeling. . . . 62 3.19 Decreaseincodequalitywhenmoveinsertionisrestrictedinanoptimalallocator. 65 5.1 Anexampleofthebehavioroftheiterativeheuristicallocator. . . . . . . . . . . 78 5.2 Asimpleexampleofglobalvariableusage. . . . . . . . . . . . . . . . . . . . . 81 5.3 Theimportanceofblockorderingintheiterativeallocator. . . . . . . . . . . . . 84 ix 5.4 Theimportanceoftiebreakingstrategiesintheiterativeallocator. . . . . . . . . 85 5.5 Runningtimeofiterativeallocatorforallbenchmarkedfunctions. . . . . . . . . 87 5.6 Exampleexecutionofthesimultaneousheuristicallocator. . . . . . . . . . . . . 90 5.7 Exampleevictiondecisionsinthesimultaneousheuristicallocator. . . . . . . . . 94 5.8 Effectoftiebreakingheuristicsoncodequalityinthesimultaneousallocator . . 97 5.9 Anexamplecontrolflowgraphdecomposedintotraces. . . . . . . . . . . . . . . 98 5.10 Effectoftracedecompositionsoncodequalityinthesimultaneousallocator. . . . 99 5.11 Effectoftraceupdatepolicyoncodequalityinthesimultaneousallocator . . . . 102 5.12 Runningtimeofthesimultaneousallocatorforallbenchmarkedfunctions. . . . . 104 5.13 ACFGthatillustratesthesubtletiesofsettingboundaryconstraints. . . . . . . . 105 5.14 Codesizeimprovementofheuristicallocators. . . . . . . . . . . . . . . . . . . 109 5.15 Codesizeimprovementofheuristicallocators. . . . . . . . . . . . . . . . . . . 110 5.16 Memoryoperationreductionofheuristicallocators. . . . . . . . . . . . . . . . . 111 5.17 Averagecodequalityimprovementofheuristicallocators . . . . . . . . . . . . . 112 5.18 Slowdownofvariousallocatorsrelativetoextendedlinearscan. . . . . . . . . . . 113 6.1 Thepercentageoffunctionsthatdemonstrateanintegralitygap. . . . . . . . . . 117 6.2 LinearprogrammingsolutiontimesoftheglobalMCNFproblem. . . . . . . . . 118 6.3 Convergencebehaviorofthebasicsubgradientoptimizationalgorithm. . . . . . . 122 6.4 Convergenceofsubgradientoptimizationwithdifferentflowcalculations. . . . . 124 6.5 Graphicaldepictionoffiveratiostepupdaterules. . . . . . . . . . . . . . . . . 125 6.6 Convergenceofsubgradientoptimizationwithdifferentstepupdaterules. . . . . 126 6.7 ConvergenceofthesubgradientoptimizationwithNewton’smethodstepupdate. 128 6.8 Examplepricebehaviorusingdifferentpriceupdatestrategies. . . . . . . . . . . 129 6.9 Convergenceofsubgradientoptimizationwithdifferentpriceupdatestrategies. . 131 6.10 Effectofpriceinitializationontheinitiallowerbound. . . . . . . . . . . . . . . 134 6.11 Convergenceofsubgradientoptimizationwithdifferentpriceinitializations. . . . 134 6.12 Convergenceofheuristicpriceinitializationwithdifferentinitialallocations. . . . 135 6.13 Thebehaviorofthreeheuristicallocatorswithinaprogressiveallocator. . . . . . 137 6.14 Averagecodesizeimprovementoftheprogressiveallocator. . . . . . . . . . . . 138 6.15 Codesizeimprovementoftheprogressiveallocator. . . . . . . . . . . . . . . . 139 6.16 Codesizeimprovementoftheprogressiveallocator. . . . . . . . . . . . . . . . 140 6.17 Averagememoryoperationreductionoftheprogressiveallocator. . . . . . . . . 142 6.18 Averageperformanceimprovementoftheprogressiveallocator. . . . . . . . . . 142 6.19 Memoryoperationreductionoftheprogressiveallocator. . . . . . . . . . . . . . 143 6.20 Codeperformanceimprovementoftheprogressiveallocatorforx86-32. . . . . . 144 6.21 Codeperformanceimprovementoftheprogressiveallocatorforx86-64. . . . . . 145 6.22 Effectofblockfrequencyestimatoroncodequality. . . . . . . . . . . . . . . . . 146 6.23 Codesizeoptimalityboundsofprogressiveallocator. . . . . . . . . . . . . . . . 148 6.24 Codeperformanceoptimalityboundsofprogressiveallocator. . . . . . . . . . . . 149 6.25 Registerallocationtimebreakdownofprogressiveallocator. . . . . . . . . . . . 149 7.1 Anexampleofinstructionselectionasatilingproblem. . . . . . . . . . . . . . . 152 7.2 ExpressingBooleansatisfiabilityasaninstructionselectionproblem. . . . . . . . 154 x
Description: