Compile Time Task and Resource Allocation of Concurrent Applications to Multiprocessor Systems Nadathur Rajagopalan Satish Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2009-19 http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-19.html January 29, 2009 Copyright 2009, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. CompileTimeTaskandResourceAllocationofConcurrentApplicationsto MultiprocessorPlatforms by NadathurRajagopalanSatish B.Tech. (IITKharagpur)2003 Adissertationsubmittedinpartialsatisfactionofthe requirementsforthedegreeof DoctorofPhilosophy in Engineering-ElectricalEngineeringandComputerSciences inthe GRADUATEDIVISION ofthe UNIVERSITYOFCALIFORNIA,BERKELEY Committeeincharge: ProfessorKurtKeutzer,Chair ProfessorJohnWawrzynek ProfessorAlperAtamtu¨rk Spring2009 ThedissertationofNadathurRajagopalanSatishisapproved: Chair Date Date Date UniversityofCalifornia,Berkeley Spring2009 CompileTimeTaskandResourceAllocationofConcurrentApplicationsto MultiprocessorPlatforms Copyright2009 by NadathurRajagopalanSatish 1 Abstract CompileTimeTaskandResourceAllocationofConcurrentApplicationsto MultiprocessorPlatforms by NadathurRajagopalanSatish DoctorofPhilosophyinEngineering-ElectricalEngineeringandComputerSciences UniversityofCalifornia,Berkeley ProfessorKurtKeutzer,Chair Single-chipmultiprocessorsarenowcommonlypresentinbothembeddedanddesktopsystems. A keychallengebeforeaprogrammerofmodernsystemsistoproductivelyprogramthemultiproces- sordevicespresentinthesesystemsandutilizetheparallelismavailableinthem. Thismotivatesthe developmentofautomatedtoolsforparallelapplicationdevelopment. Animportantstepinsuchan automatedflowisallocatingandschedulingtheconcurrenttaskstotheprocessingandcommunica- tionresourcesinthearchitecture. Whentheapplicationworkloadandexecutionprofilesareknown orcanbeestimatedatcompiletime,thenwecanperformtheallocationandschedulingstaticallyat compiletime. Manyapplicationsinsignalprocessingandnetworkingcanbescheduledatcompile time. Compile time scheduling incurs minimal overhead while running the application. It is also relevanttorapiddesign-spaceexplorationofmicro-architectures. Scheduling problems that arise in realistic application deployment and design space explo- ration frameworks can encompass a variety of objectives and constraints. In order for scheduling techniquestobeusefulforrealisticexplorationframeworks,theymustthereforebesufficientlyex- tensibletobeappliedtoarangeofproblems. Atthesametime,theymustbecapableofproducing high quality solutions to different scheduling problems. Further, such techniques must be compu- tationallyefficient, especiallywhentheyareusedtoevaluatemanymicro-architecturesinadesign spaceexplorationframework. Thefocusofthisdissertationistoprovideguidanceinchoosingschedulingmethodsthatbest trade-off the flexibility with solution time and quality of the resulting schedule. We investigate andevaluaterepresentativesofthreebroadclassesofschedulingmethods: heuristics, evolutionary 2 algorithms and constraint programming. In order to evaluate these techniques, we consider three practical task-level scheduling problems: task allocation and scheduling onto multiprocessors, re- source allocation and scheduling data transfers between CPU and GPU memories, and scheduling applications with variable task execution times and dependencies onto multiprocessors. We use applications from the networking, media and machine learning domains to benchmark our tech- niques. The above three scheduling problems, while all arising from practical mapping concerns, requiredifferentmodels,havedifferingconstraintsandoptimizefordifferentobjectives. Thediver- sity of these problems gives us a base for studying the extensibility of scheduling methods. It also helpsprovideamoreholisticviewofthemeritsofdifferentschedulingapproachesintermsoftheir efficiencyandqualityofsolutionsproducedongeneralschedulingproblems. ProfessorKurtKeutzer DissertationCommitteeChair i Dedicatedtomyfamily. ii Contents ListofFigures v ListofTables viii 1 Introduction 1 1.1 Challengesintheuseofsinglechipmultiprocessors . . . . . . . . . . . . . . . . . 2 1.2 Bridgingtheimplementationgap . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Themappingstep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.1 Complexityoftheschedulingproblem . . . . . . . . . . . . . . . . . . . . 7 1.3.2 Thecaseforcompile-timescheduling . . . . . . . . . . . . . . . . . . . . 8 1.3.3 MethodsforCompile-timeScheduling . . . . . . . . . . . . . . . . . . . . 9 1.4 Applicationofcompile-timemethodstopracticalschedulingproblems . . . . . . . 12 1.5 Contributionsofthedissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 StaticTaskAllocationandScheduling 15 2.1 MappingstreamingapplicationsontosoftmultiprocessorsystemsonFPGAs . . . 15 2.1.1 ExamplesofStreamingApplications. . . . . . . . . . . . . . . . . . . . . 16 2.1.2 TheMappingandSchedulingProblem. . . . . . . . . . . . . . . . . . . . 17 2.1.3 SoftMultiprocessorSystemsonFPGAs . . . . . . . . . . . . . . . . . . . 18 2.1.4 Theneedforautomatedmapping . . . . . . . . . . . . . . . . . . . . . . 20 2.2 AutomatedTaskAllocationandScheduling . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 StaticModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.2 OptimizationProblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 TechniquesforStaticTaskAllocationandScheduling . . . . . . . . . . . . . . . . 28 2.3.1 HeuristicMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.2 SimulatedAnnealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.3 ConstraintOptimizationMethods . . . . . . . . . . . . . . . . . . . . . . 35 2.3.4 Lowerbounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.4.1 Benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.4.2 Comparisonsonregularprocessorarchitectures . . . . . . . . . . . . . . . 42 2.4.3 Comparisononrealisticarchitecturalmodels . . . . . . . . . . . . . . . . 46 2.4.4 Throughputestimationusingmakespan . . . . . . . . . . . . . . . . . . . 48 iii 2.5 Comparingdifferentoptimizationapproaches . . . . . . . . . . . . . . . . . . . . 50 3 ResourceallocationandcommunicationschedulingonCPU/GPUsystems 52 3.1 MappingapplicationswithlargedatasetsontoCPU-GPUsystems . . . . . . . . . 52 3.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.1.2 CPU/GPUsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1.3 Themappingstep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2 TaskandDatatransferschedulingtominimizedatatransfers . . . . . . . . . . . . 61 3.2.1 StaticModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2.2 OptimizationProblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3 TechniquesforStaticOptimization . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.3.1 PreviousWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.3.2 ExactMILPformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.3.3 Decomposition-basedApproaches . . . . . . . . . . . . . . . . . . . . . . 75 3.3.4 Datatransferschedulinggivenataskorder . . . . . . . . . . . . . . . . . 75 3.3.5 Findingagoodtaskordering . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.5 ChoiceofOptimizationMethod . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4 StatisticalModelsandAnalysisforTaskScheduling 89 4.1 Variabilityinapplicationexecutiontimes . . . . . . . . . . . . . . . . . . . . . . 91 4.1.1 IPv4packetforwardingonasoftmultiprocessorsystem . . . . . . . . . . 92 4.1.2 H.264videodecodingoncommercialmulti-coreplatforms . . . . . . . . . 96 4.2 StatisticalModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.2.1 ApplicationTaskGraph . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.2.2 ArchitectureModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2.3 PerformanceModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2.4 OptimizationProblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.3 StatisticalPerformanceAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.3.1 Validallocationandschedule. . . . . . . . . . . . . . . . . . . . . . . . . 107 4.3.2 Formulationoftheperformanceanalysisproblem . . . . . . . . . . . . . . 108 4.3.3 Typesofperformanceanalysis . . . . . . . . . . . . . . . . . . . . . . . . 110 4.3.4 ComparisonofStatisticaltoStaticAnalysis . . . . . . . . . . . . . . . . . 115 4.4 Theneedforgeneralizedstatisticalmodelsandanalysis . . . . . . . . . . . . . . . 119 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5 StatisticalOptimization 122 5.1 Statisticaltaskallocationandschedulingontomultiprocessors . . . . . . . . . . . 122 5.2 TechniquesforStatisticalOptimization. . . . . . . . . . . . . . . . . . . . . . . . 123 5.2.1 StatisticalDynamicListScheduling . . . . . . . . . . . . . . . . . . . . . 124 5.2.2 SimulatedAnnealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.2.3 DeterministicOptimizationApproaches . . . . . . . . . . . . . . . . . . . 134 5.3 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.4.1 Benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Description: