DYNAMIC PROGRAM ANALYSIS ALGORITHMS TO ASSIST PARALLELIZATION ADissertation Presentedto TheAcademicFaculty by MinjangKim InPartialFulfillment oftheRequirementsfortheDegree DoctorofPhilosophyin ComputerScience SchoolofComputerScience GeorgiaInstituteofTechnology December2012 Copyright©2012byMinjangKim DYNAMIC PROGRAM ANALYSIS ALGORITHMS TO ASSIST PARALLELIZATION Approvedby: Dr. HyesoonKim,Advisor Dr. SantoshPande SchoolofComputerScience SchoolofComputerScience GeorgiaInstituteofTechnology GeorgiaInstituteofTechnology Dr. Chi-KeungLuk Dr. RichardVuduc TechnologyPathfindingand SchoolofComputationalScienceand Innovations Engineering IntelCorporation GeorgiaInstituteofTechnology Dr. Hsien-HsinS.Lee DateApproved: August16th,2012 SchoolofElectricalandComputer Engineering GeorgiaInstituteofTechnology ACKNOWLEDGEMENTS My Ph.D. would not have been successfully finished without support and encouragement from my wife, Kyung Im. Besides this dissertation, our two kids, Dowan and Jiu, are our big accomplishment and great pleasure during the graduate study. My father, my mother, father-in-law, and late mother-in-law have beenalwayssupportformeandmyfamily. This work literally cannot be completed without guidance of my advisor, Prof. Hyesoon Kim. From the very first project in Prospector to the last moment of this thesis writing, she always have advised to the right direction and complement what I missed. I have learned so many valuable lessons from her, not only how to find good research topics and solve them analytically, but how to be a good researcherandmentor. I am also very fortunate to work with Dr. Chi-Keung Luk, for being a mentor during the internship and a co-advisor to shape my thesis, besides physically participating for my defense talk. He guided me many technical challenges to improvemyalgorithms. I am grateful to all the dissertation members: Prof. Hsien-Hsin Lee, Prof. Richard Vuduc, and Prof. Santosh Pande. I particularly thank to Prof. Pande for providing me a view from a compiler perspective and sharing time in Korea. I thank to Prof. Milos Prvulovic for guiding my early graduate research. I am also thankfultoBevinBrettandJohnPieperduringmyinternshipatNashua. Ithanktoallourlabmembers: Sunpyoforhavingthesamechallengestofinish the degree and find a job; Jaekyu and Nagesh for setting up our lab machines in earlydaysandhelpingmanyLinux-relatedproblems;HyojongandChayoungfor iii helping parallelization experiments for the post-analyzer; Pranith for the parallel prophetworkandstruggles;andPuyanforthechallengingandinterestingLLVM- Prospector work and enjoying many talks. I also thank to my friends: Changhee andSangminforbeinggreatcolleaguestodiscussthedetailsofmyandtheirwork; MyungcheolforsharingmanytrickysituationsinPh.D.life;Minforsettlingdown togetherintheUS;andDonghwan,whosedissertationprefacecanbesharedwith mine,forbighelpduringjobsearching. Ithanktomanyfamilieswhostayedtogetherinthetenthandhomeapartments foryears,especiallyforChungHyuk’s,Hyojun’s,Hyungie’sJin-Kook’s,Kihwan’s, Min’s,Minsung’s,Seung-Joon’s,Taesu’s,andTonghoon’sfamilies. iv TABLE OF CONTENTS ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LISTOFTABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LISTOFFIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi I INTRODUCTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 TheProblem: ParallelizingSerialCode . . . . . . . . . . . . . . . . . 1 1.2 TheSolutionandContributions: Prospector . . . . . . . . . . . . . . 4 1.3 ThesisStatement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 OrganizationofThisProposalDocument . . . . . . . . . . . . . . . . 7 II MOTIVATIONSANDOVERVIEWOFPROSPECTOR . . . . . . . . . . 8 2.1 MotivationsforanEfficientLoopProfiler . . . . . . . . . . . . . . . 8 2.2 MotivationsforDynamicData-DependenceAnalysis . . . . . . . . . 11 2.2.1 CaseStudies: AutomaticParallelizationinC/C++Compilers 14 2.3 MotivationsforAnEfficientData-DependenceProfiler . . . . . . . . 20 2.4 MotivationsforCorrectData-DependenceProfiling . . . . . . . . . . 23 2.4.1 ExperimentationResultsofSimpleSamplingTechniques . . 26 2.5 MotivationsforDynamicParallelSpeedupPrediction . . . . . . . . 29 2.6 MotivationsforANewSpeedupPredictionAlgorithm . . . . . . . . 33 2.7 BackgroundonSchedulingPoliciesofOpenMPandCilkPlus . . . . 36 2.7.1 SchedulingPoliciesinOpenMP . . . . . . . . . . . . . . . . . 37 2.7.2 RecursiveandNestedParallelisminOpenMPandCilkPlus 37 2.8 MotivationsforPost-AnalyzerofDependenceProfiler . . . . . . . . 40 2.9 OverviewofProspector . . . . . . . . . . . . . . . . . . . . . . . . . . 42 III RELATEDWORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.1 RelatedResearchonLoopProfiling . . . . . . . . . . . . . . . . . . . 44 v 3.2 RelatedResearchonData-DependenceProfiling . . . . . . . . . . . 45 3.2.1 Data-DependenceProfilingforSpeculativeParallelization . 47 3.2.2 ReducingTimeOverheadofDynamicAnalysis . . . . . . . . 47 3.2.3 LimitationsofPreviousCompressionTechniques . . . . . . . 48 3.3 RelatedResearchonDynamicSpeedupPrediction . . . . . . . . . . 48 3.4 RelatedResearchonPostAnalysisofDependenceProfiling . . . . . 50 3.4.1 ParallelismDiscoveryUsingDynamicAnalyses . . . . . . . 50 3.4.2 CodeTransformationtoAvoidDependences . . . . . . . . . 51 3.4.3 Bug-FixingAlgorithmsinConcurrentProgramming . . . . . 52 3.4.4 ParallelismVisualization . . . . . . . . . . . . . . . . . . . . . 53 3.5 ToolsforAssistingParallelization . . . . . . . . . . . . . . . . . . . . 54 3.5.1 IntelParallelAdvisor . . . . . . . . . . . . . . . . . . . . . . . 55 3.5.2 VectorFabrics’Pareon . . . . . . . . . . . . . . . . . . . . . . 56 3.5.3 RogueWave’sThreadSpotter . . . . . . . . . . . . . . . . . . 58 IV ANEFFICIENTLOOPPROFILER . . . . . . . . . . . . . . . . . . . . . . 59 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 AnEfficientLoop-ProfilingAlgorithm . . . . . . . . . . . . . . . . . 60 4.2.1 ReconstructingCFGsandLoopStructuresfromBinary . . . 61 4.2.2 InstrumentingLoop-BehaviorInstructions . . . . . . . . . . . 62 4.3 Challenges: CaseStudiesandSolutions . . . . . . . . . . . . . . . . . 64 4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.6 SummaryofThisChapter . . . . . . . . . . . . . . . . . . . . . . . . 69 V ANEFFICIENTDYNAMICDATA-DEPENDENCEPROFILER . . . . . 71 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 TheBaselineAlgorithm: ThePairwiseMethod . . . . . . . . . . . . 72 5.2.1 CheckingDataDependencesinaLoopNest . . . . . . . . . . 73 5.2.2 HandlingLoop-independentDependences . . . . . . . . . . 76 vi 5.2.3 DependencesinFunctionsandHandlingFunctionCalls . . . 78 5.2.4 ComputingData-DependenceDistance . . . . . . . . . . . . 79 5.2.5 SummaryofthePairwiseMethod . . . . . . . . . . . . . . . . 80 5.2.6 OptimizingtheMergeOperation: PC-setOptimization . . . 82 5.2.7 ProblemsofthePairwiseMethod . . . . . . . . . . . . . . . . 84 5.3 AMemory-EfficientAlgorithminSD3 . . . . . . . . . . . . . . . . . 85 5.3.1 OverviewoftheAlgorithm . . . . . . . . . . . . . . . . . . . 85 5.3.2 DynamicDetectionofStrides . . . . . . . . . . . . . . . . . . 86 5.3.3 Stride-BasedDependenceCheckingAlgorithm . . . . . . . . 88 5.3.4 OverviewoftheMemory-EfficientSD3 Algorithm . . . . . . 93 5.3.5 OptimizingStride-BasedDependenceChecking . . . . . . . 94 5.3.6 MergingStrideTablesforLoopNests . . . . . . . . . . . . . . 98 5.3.7 HandlingKilledAddressesinStrides . . . . . . . . . . . . . . 99 5.3.8 LossyCompressioninStrides . . . . . . . . . . . . . . . . . . 100 5.4 ReducingTimeOverheadbyParallelization . . . . . . . . . . . . . . 101 5.4.1 OverviewoftheAlgorithm . . . . . . . . . . . . . . . . . . . 101 5.4.2 AHybridParallelizationModelofSD3 . . . . . . . . . . . . . 101 5.4.3 EventDistributionforParallelProcessing . . . . . . . . . . . 103 5.4.4 StridesinParallelizedSD3 . . . . . . . . . . . . . . . . . . . . 105 5.4.5 DetailsoftheData-ParallelModel . . . . . . . . . . . . . . . 106 5.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.5.1 BasicArchitecture . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.5.2 ImplementationofAnalyzer . . . . . . . . . . . . . . . . . . . 110 5.5.3 ImplementationofTracers . . . . . . . . . . . . . . . . . . . . 111 5.6 ExperimentationResults . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.6.1 ExperimentationMethodology . . . . . . . . . . . . . . . . . 113 5.6.2 MemoryOverheadofSD3 . . . . . . . . . . . . . . . . . . . . 115 5.6.3 TimeOverheadofSD3 . . . . . . . . . . . . . . . . . . . . . . 117 vii 5.6.4 InputSensitivityofData-DependenceProfiling . . . . . . . . 119 5.6.5 OpportunitiesforStrideCompression . . . . . . . . . . . . . 121 5.7 SummaryofThisChapter . . . . . . . . . . . . . . . . . . . . . . . . 122 VI ANEFFECTIVESPEEDUPPREDICTORFROMSERIALCODE . . . . 123 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2 TheFront-endofParallelProphet: AnnotationandProfiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2.1 AnnotatingSerialCode . . . . . . . . . . . . . . . . . . . . . . 125 6.2.2 IntervalProfilingtoBuildaProgramTree . . . . . . . . . . . 126 6.2.3 MemoryProfiling: BurdenFactors . . . . . . . . . . . . . . . 128 6.2.4 SummaryoftheProfiling . . . . . . . . . . . . . . . . . . . . . 130 6.3 TheBackendofParallelProphet: TheEmulators . . . . . . . . . . . 130 6.3.1 Fast-ForwardingEmulationAlgorithm . . . . . . . . . . . . . 131 6.3.2 ChallengesandLimitationsinFast-ForwardingMethod . . . 137 6.3.3 Program-Synthesis-BasedEmulationAlgorithm: BasicIdea . 138 6.3.4 ChallengesintheSynthesizer . . . . . . . . . . . . . . . . . . 140 6.3.5 SummaryoftheEmulations . . . . . . . . . . . . . . . . . . . 142 6.4 LightweightMemoryPerformanceModel . . . . . . . . . . . . . . . 143 6.4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.4.2 ThePerformanceModel . . . . . . . . . . . . . . . . . . . . . 145 6.4.3 TheDefinitionoftheBurdenFactor . . . . . . . . . . . . . . . 146 6.4.4 MemoryAccessOverhead,w Prediction . . . . . . . . . . . . 147 t 6.4.5 DetailsoftheMicrobenchmarkforw Prediction . . . . . . . 149 t 6.4.6 SummaryoftheMemoryPerformanceModel . . . . . . . . . 154 6.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.5.1 ImplementationofAnnotationSystem . . . . . . . . . . . . . 155 6.5.2 ImplementationofIntervalProfiling . . . . . . . . . . . . . . 156 6.5.3 CompressionoftheProgramTree . . . . . . . . . . . . . . . . 158 6.6 ExperimentationResults . . . . . . . . . . . . . . . . . . . . . . . . . 159 viii 6.6.1 ExperimentationMethodologies . . . . . . . . . . . . . . . . 160 6.6.2 ValidationofthePredictionModel . . . . . . . . . . . . . . . 160 6.6.3 PredictionResultswithMemoryPerformanceModel . . . . 165 6.6.4 ASimpleVerificationfortheMemoryPerformanceModel . 167 6.6.5 DetailedCacheandMemoryBehaviorsoftheBenchmarks . 168 6.6.6 TheoverheadofParallelProphet . . . . . . . . . . . . . . . . 171 6.6.7 LimitationsofParallelProphet . . . . . . . . . . . . . . . . . 172 6.7 SummaryofThisChapter . . . . . . . . . . . . . . . . . . . . . . . . 173 VII ALGORITHMS TO EXTRACT PARALLELISM AND TRANSFORMA- TIONADVICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 7.1 AssumptionsforthePost-Analyzer . . . . . . . . . . . . . . . . . . . 174 7.2 OverviewofthePost-Analyzer . . . . . . . . . . . . . . . . . . . . . 175 7.2.1 FalseDependencesandPrivatization . . . . . . . . . . . . . . 176 7.2.2 ReductionsandMutexes . . . . . . . . . . . . . . . . . . . . . 177 7.2.3 Supportingconditionvariablesandbarriers . . . . . . . . . . 179 7.2.4 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.2.5 OrderingofTransformations . . . . . . . . . . . . . . . . . . . 180 7.3 CaseStudiesofProspector . . . . . . . . . . . . . . . . . . . . . . . . 182 7.3.1 179.artinSPECCPU2000 . . . . . . . . . . . . . . . . . . . . . 182 7.3.2 SusaninMiBench . . . . . . . . . . . . . . . . . . . . . . . . . 185 7.3.3 DijkstrainMiBench . . . . . . . . . . . . . . . . . . . . . . . . 189 7.3.4 256.bzip2inSPECCPU2000 . . . . . . . . . . . . . . . . . . . 193 VIIICONCLUSIONSANDFUTURERESEARCHDIRECTIONS . . . . . . 198 8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 8.2 FutureResearchDirections . . . . . . . . . . . . . . . . . . . . . . . . 200 8.2.1 FutureworkforSD3 . . . . . . . . . . . . . . . . . . . . . . . . 200 8.2.2 FutureworkforParallelProphet . . . . . . . . . . . . . . . . 201 8.2.3 FutureworkforthePost-Analyzer . . . . . . . . . . . . . . . 202 ix REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 x
Description: