Online Adaptive Code Generation and Tuning Ananta Tiwari Jeffrey K. Hollingsworth Department of Computer Science, Department of Computer Science, University of Maryland, University of Maryland, College Park, MD, 20742 College Park, MD, 20742 Email: [email protected] Email: [email protected] Abstract(cid:151)Inthispaper,wepresentaruntimecompilationand face. Addressing these challenges and making online tuning tuning framework for parallel programs. We extend our prior practical is the topic of this paper. Our goal is to enable workonourauto-tuner,ActiveHarmony,fortunableparameters applicationdeveloperstowriteapplicationsonceandhavethe that require code generation (for example, different unroll auto-tuneradjusttheapplicationexecutionautomaticallywhen factors). For such parameters, our auto-tuner generates and runonnewsystems.Ourworkcanbeviewedasthemergerof compiles new code on-the-(cid:3)y. Effectively, we merge traditional feedbackdirectedoptimizationandjust-in-timecompilation.We traditional compiler techniques and just-in-time compilation. showthatoursystemcanleverageavailableparallelismintoday’s We take a search-based collaborative approach to auto- HPCplatformsbyevaluatingdifferentcode-variantsondifferent tuning. Our system, Active Harmony, allows application pro- nodes simultaneously. We evaluate our system on two parallel grammers,compilers, librarywriters and performancemodel- applications and show that our system can improve runtime erstodescribeandexportasetofperformancerelatedtunable execution by up to 46% compared to the original version of the program. parameters. These parameters de(cid:2)ne a tuning search-space. Moreoftenthannot,thissearch-spaceishigh-dimensionaland Keywords-Auto-tuning; Parallel Search; Active Harmony exponential in size and thus, cannot be explored manually. Our system monitors the program performance and makes I. INTRODUCTION adaptation decisions. The decisions are made by a central Today’s complex and diverse architectural features require controllerusingaparallelsearchalgorithm.Theparallelsearch applying nontrivial optimization strategies on scienti(cid:2)c codes algorithmleveragesparallelarchitecturestosearchacrossaset toachievehighperformance.Asaresult,programmersusually ofoptimizationparametervalues.Differentnodesofaparallel have to spend countless hours in rewriting and tuning their system evaluate different con(cid:2)gurations at each timestep. codes. Furthermore, a code that performs well on one plat- In Active Harmony vocabulary, harmonization refers to form often faces bottlenecks on another; therefore, the tuning the process of adding (cid:147)hooks(cid:148) in an application (using the process must be largely repeated to port the code to a new Active Harmony API) to make it tunable. The process in- computing architecture. volves making fairly small changes to the application code Our research focuses on automating this tedious and error- to exporttuningoptionsto the server.The phraseharmonized prone process of tuning and porting parallel applications. application is used to refer to an application that uses Active In our earlier work [23], we showed that for well-de(cid:2)ned Harmony to adapt its execution. benchmark kernels (such as matrix multiplication), compiler- Tosummarize,in thispaper,we makethefollowingcontri- basedof(cid:3)ineauto-tuningcandeliversigni(cid:2)cantimprovements butions: over the optimizations offered by native compilers. However, 1) Anonlineauto-tunerthatcanrecompileprogramsduring for full applications, it is not as effective. Based on the a single execution. input dataset, a given parallel application can have vastly 2) Support for tuning multiple code-sections simultane- differentexecutionpro(cid:2)les.Inputdatasetscanspecifyphysical ously. domain,solvertype(s),solverparameters,discretizationorder, 3) Re(cid:2)nements of our parallel search algorithm to include and so on. Taking an of(cid:3)ine tuning approach to tune for all a penalization technique. possible computational bottlenecks is not a tractable goal. 4) Asystemdesigntosupportruntimecodegenerationand Instead, on-demand tuning during production execution is a code replacement for large scale parallel applications. desirable approach. This on-demand approach also bene(cid:2)ts 5) Anempiricalanalysisthatshowsthatevenwhentraining from the availability of real-time performance data, which runs are not available, our system can improve perfor- can be linked back to speci(cid:2)c code sections and architecture- mance of an application within a single execution. speci(cid:2)c features. Development of an online auto-tuner, however, presents its II. CHALLENGESANDREQUIREMENTS ownsetofchallenges.Managingthecostofthesearchprocess Inthissection,wereviewsomekeychallengesthatruntime and the cost of generating and compiling code-variants on- auto-tunersface and features of Active Harmonythat address the-(cid:3)y are two daunting challenges that runtime auto-tuners them. TABLEI A. Minimal overhead ASIMPLECSLSPECIFICATION Thecostsofusinganonlinetuningsystemmustbeminimal. Otherwise, such costs can overshadowany bene(cid:2)t realized in search space simple { # parameter definitions application performance. If the performance of harmonized parameter x int { codeis better (orat least not worse)thanthat of untunedver- range [1:8:1]; } sion of the code, the minimal overheadobjective is achieved. parameter y int { range [1:8:1]; B. Avoid (cid:147)bad(cid:148) regions in the search-space } parameter z int { Our earlier experience [8], [23] has shown that tun- range [1:8:1]; } ing search-spaces often exhibit a clear demarcation between # constraints (cid:147)good(cid:148) and (cid:147)bad(cid:148) regions. We observed that there are fre- constraint cone { x+z>=z; quentlymanygoodpointsneartheoptimalpointandthatthere } isalsooftenanotherregionwheretheapplicationperformance constraint ctwo { y>z; is relativelyworse. Sucha topographyarguesfor onlineauto- } tuningsystemsthatrapidlygetapplicationsoutofbadregions #putting everything together specification { of performance and into the good ones. Achieving optimal cone && ctwo; performance is a secondary goal once good performance is } } reached. C. Need to coordinate tuning for multiple code-sections Auto-tuning for full applications typically involves tuning aspect ratio). To meet the needs of having a fully expressive for multiple code-sections from several libraries simultane- way for developers to de(cid:2)ne parameters, we have developed ously. Therefore, an important question is how to coordinate a simple and extensible language (Constraint Speci(cid:2)cation this process. Left uncoordinated (for example, if each library Language,CSL)thatstandardizesparameterspacerepresenta- had an embedded auto-tuner), it would be impossible to tion for search-basedauto-tuners.CSL allows tool developers tell which change (and on what code-section) was actually to share information and search strategies with each other. improving the overall performance of the program. In fact, Meanwhile,applicationdeveloperscanuseCSLtoexporttheir it is likely that one change might improve performance and tuning needs to auto-tuning tools. the second hurt performance and the net performance bene(cid:2)t CSLprovidesconstructstode(cid:2)netunableparametersandto is little or none. There are a variety of approaches possible expressrelationshipsbetweenthoseparameters.Dependencies ranging from simple arbitration to ensure that only one code- between different parameters can be easily speci(cid:2)ed using section is tuned at a given instance to a fully uni(cid:2)ed system mathematical expressions. CSL supports a fairly comprehen- that allows coordinated simultaneous search of parameters sive list of mathematical, logical and relational operators. originating from different code-sections. Hints to the underlying search algorithm in the form of In the Active Harmony project, we take the approach of a initialpointstostart thesearch, defaultvaluesforparameters, coordinated system to allow simultaneous tuning of multiple simple constraints on parameters such as MPI message-sizes, code-sections.Weaccomplishthisbyhavingeachcode-section number of OpenMP threads, etc. can be easily expressed expose its tunable parameters via a simple, but expressive using CSL. Information from performance models can be constraint language (discussed below). A central search en- speci(cid:2)ed in the form of constraints to guide the search. gine then uses these constraints to manage the evaluation of Furthermore, parameters can also be grouped into different possible auto-tuning steps. categories to allow application of similar tuning strategies. This is particularly helpful when there are multiple code- D. Constraint language sections that bene(cid:2)t from the same optimization. We provide Animportantaspectofsearchingparametersistoallowthe a simple parameter speci(cid:2)cation example in Table I. In this precise speci(cid:2)cation of valid parameters. Such speci(cid:2)cation example, the search space consists of three parameters (cid:151) couldbeassimpleasexpressingtheminimum,maximumand x, y and z. The relationships between these parameters are initial values for a parameter. Sometimes not all parameters expressedusingtwoconstraintde(cid:2)nitions(cid:151)coneandctwo. values within a range should be searched, so an option to Full CSL grammar is provided in the Appendix. specify a step function is useful. Likewise for parameters with a large range (i.e. a buffer that could be from 1K to III. PARAMETERTUNINGALGORITHM 100Megabytes), it is useful to specify that parameter should A key to any auto-tuning system is how it goes about be searched based on the log of the value. selectingthespeci(cid:2)ccombinationsofparameterstotry.While Anothercriticalfactoristhatnotallparametersareindepen- a simple parameter space might be exhaustively searched, dent. Frequently, there is a relationship between parameters most applications contain too many combinations to try them (e.g. when considering tiling a two dimensional array, it is all. Instead, an auto tuning system must rely on ef(cid:2)cient often useful to have the tiles be rectangles with a de(cid:2)nite search algorithms to evaluate only a sub-set of the possible 2 there are some expansion points with very poor performance that can slow down the algorithm (for example, setting a tile sizeto0).Therefore,to avoidthesetime consuminginstances andtoensuregoodtransientbehaviorofthesearchalgorithm, we calculate the expansion point performance for the most promisingcase (cid:2)rstandonlyif itis successful, performa full expansion of the simplex. One of the unique features that distinguishes PRO from other algorithms used in auto-tuningsystems is that the algo- rithm leverages parallel architectures to search across a set of optimizationparametervalues.Multiple,sometimesunrelated, points in the search space are evaluated at each timestep. For online tuning of SPMD-based parallel applications, different nodesparticipatinginthetuningprocesscanevaluatedifferent versions of given loop-nests. We depict this graphically in Figure 2-(a) (more description of the (cid:2)gure appears in the next section)1. Parameter tuning is a constrained optimization problem. Therefore after each simplex transformation step, we have to make sure that the points returned by the search algorithm Fig. 1. Original 4 point simplex in a 2-dimensional space, along with the for evaluation are admissible, i.e. they satisfy the constraints. simplextransformation steps. We consider two types of parameter constraints: internal discontinuity constraints and boundary constraints. Internal discontinuityconstraintsarisebecausesometuningparameters con(cid:2)gurationswhile tryingto (cid:2)nd an optimal one (or least as canonlyhavediscreteadmissiblevalues(e.g.integerparame- nearly optimal as practical). ters).Fortheseconstraintviolations,thecomputedparameters The algorithm that we use is based on the Parallel Rank are rounded to an admissible discrete value. Order (PRO) algorithm proposed by Tabatabaee et al [22]. To handle boundary constraints, we use a penalization For a function of N variables, PRO maintains a set of K method. We add a penalty factor to the performance metric (where K is at least N +1 and is usually set to the number associated with the points that violate the constraints. The of coresthe harmonizedapplicationis runon)points forming ideais todiscouragethesimplexfrommovingtowardsillegal the vertices of a simplex in an N-dimensional space. Each regions of the search space. This approach has been used simplex transformation step of the algorithm generates up to previously in the context of constrained optimization using (K (cid:0)1) new vertices by re(cid:3)ecting, expanding, or shrinking genetic algorithms [11]. In all of the experimental results the simplex around the best vertex. After each transformation presentedinthispaperweusethismethodbecauseitissimple step, the objective function value f associated with each of and light-weight. This feature of Active Harmonyis new and the newly generated points are calculated in parallel. The has not been considered in our previous work. re(cid:3)ection step is considered successful if at least one of the (K(cid:0)1) new points has a better f than the best point in the IV. SYSTEMDESIGN simplex.Ifthe re(cid:3)ectionstep is not successful, the simplexis In this section, we describe our online auto-tuning ap- shrunk around the best point. A successful re(cid:3)ection step is proach for parameters that require new code and present followedbyexpansioncheckstep.Iftheexpansioncheckstep our system design. Examples of parameters that require new issuccessful,theexpandedsimplexisaccepted.Otherwise,the code include loop unrolling factors, loop fusion and split re(cid:3)ected simplex is accepted and the search moves on to the parametersanddata-copyparameters.Withtheadditionofthis next iteration. The search stops if the simplex converges to capability,Active Harmonysupports the developmentof self- a point in the search space (or after a pre-de(cid:2)ned number of tuning applications that include runtime code-transformation searchsteps).Agraphicalillustrationforre(cid:3)ection,expansion and replacement. and shrink steps are shown in Figure 1 for a 2-dimensional To make runtime auto-tuning practical, the key issue that search space and a 4-point simplex. needs to be addressed is the ef(cid:2)cient runtime management Note that before computingall expansionpoints, we check of the process of generating, compiling, and maintaining a theoutcomeoftheexpansionforonlythemostpromisingcase set ofalternativeimplementationsandsearchingamongthem. (cid:2)rst. The most promisingpoint (shown in Figure 1-(c)) is the A given loop-nest generally requires more than one (cid:3)avor point in the originalsimplex whose re(cid:3)ection aroundthe best of transformation strategy. As the number of transformations pointreturnsabetterfunctionvalue.Thisseemstobecounter- intuitiveat (cid:2)rst glance,since weare nottakingfull advantage 1There are, of course, some parameters that are global (such as data of the parallelism. However, in our experiments, we realized decomposition). Thesearesearched sequentially. 3 can take advantage of idle (possibly remote) machines for distributed code-generation and compilation. Users provide a set of available machines at the start of the tuning session. These machinesdo the actual code-generationwork.Once all code-variants are generated, the compiled code-variants are transported to the scratch (cid:2)lesystem of the parallel machine, wheretheapplicationbeingtunedisexecuting.Afterthecode- generationiscomplete,ourcode-generationutilitynoti(cid:2)esthe Active Harmony server about the status. Figure 2-(a) shows a schematic diagram of the work(cid:3)ow within our online tuning system. Figure 2-(b) shows the application-level view of the tuning process. At each search (a) step, the Active Harmony server issues a request to the code-servers to generate code variants with a given set of parameters for loop transformations. The code-variants that are generated are compiled into a shared library (denoted as v_N.so in the (cid:2)gure 2-(b)). Once the code-generation is complete, the application receives a code-ready message from the Active Harmony server. The nodes allocated to the parallel application then load the new code using the dlopen-dlsym mechanism. The new code is executed and (b) the measured performance values (denoted as PM_N in the (cid:2)gure 2-(b)) are consumed by the Active Harmony server to Fig.2. Fig.2-(a)showstheoverallonlinetuning work(cid:3)ow.Fig.2-(b)shows make simplex transformation decisions. The timing of actual application level viewoftheauto-tuning work(cid:3)ow loadingofnewcodeisdeterminedbyhooks(insertedusingthe ActiveHarmonyAPI)intheapplication.Forexample,inmost programs, we load new code only on timestep boundaries. increases, the number of alternative code-variants grows ex- Preparinganapplicationforauto-tuningstartswithoutlining ponentially.A brute-forceapproach of generatingall possible thecompute-intensivecode-sectionsto separatefunctions.We combinations is, thus, not practical. Instead, our approach then insert appropriate calls to the outlined functions using generates code variants on-demand. This selective approach function pointers. These function pointers are updated when requires an ef(cid:2)cient code generator that can produce correct new codes become available. Currently, the code-sections are code during runtime. outlined manually. In the future, we intend to automate this Developinganef(cid:2)cientcodetransformationframeworkthat processusingthe ROSE compilerframework[12].Each node ensures correctness is a large area of research in and of running the application keeps track of the best code-variant itself. Recently, a variety of compiler-based code transforma- it has seen thus far in the tuning process. If the code-server tion frameworks [7], [10], [4], [17] have been developed to fails to delivernew versionson time,the nodes continuetheir give programmers and compiler experts greater control over executionwiththebestversionthattheyhavediscoveredupto what optimization strategies to use for complex codes. These thatpointinthetuningprocess.Theperiodwherenonewcode frameworksaredesignedtofacilitatemanualexplorationofthe is available is referredas search_stall(see (cid:2)gure 2-(b)). large optimization space of possible compiler transformations The non-blocking relationship between application execution and their parameter values. Each of these frameworks has its and dynamic code-generation is important in minimizing the strengthsandweaknessesandassuch,itisdesirabletodesign online tuning overhead. The application does not have to asystemthatcaneasilyselectamongtheavailabletoolsbased wait until the new code becomes available. Furthermore, this on our code transformationrequirements. asynchronous relationship enables our auto-tuner to exercise ActiveHarmonyreliesonstandalonecode-generationutility control over what code-generation utility to use, how many (or code-servers) for on-demand code generation. Here we parallel code-servers to run and how many code-variants to describe the two most important features of this utility. First, generate in any given search iteration. The policy decisions thedesignofcode-serversallowstheuserstoeasilyselectand about what code-variants to generate and evaluate at each switch between available code transformation tools. We sep- iteration is made completely by the centralized tuning server. arate the search-based navigation of the code-transformation parameter space and the code-generation process, which al- V. EXPERIMENTS lows us to easily switch between different underlying code- In this section, we present an experimental evaluation of generation tools (e.g. if we are tuning CUDA code, we can our framework. First, we conduct a study using as a test switchtoacode-transformationframeworkthatsupportsGPUs application,aPoisson’sequationsolverprogram,todetermine via CUDA or OpenCL). Second, our code-generation utility the least number of parallel code-servers needed to ensure 4 thatthesearch_stallphasedoesnotdominatethetuning B. Code-generationutility work(cid:3)ow.Second,weusetwoparallelapplicationstodemon- For all the experimental results presented in this paper, we strate the effectiveness of our system on three different com- use CHiLL [7], a polyhedra-based compiler framework, to putingplatforms.Wecomparetheperformanceofharmonized generatecode-variants.Aswediscussedinsection IV,wecan applications with that of original applications compiled with use any available code-generation utility within our system. thevendor-suggestedhighestlevelofoptimization(cid:3)agsturned We chose CHiLL because it provides a convenient high-level on. Moreover, once the harmonized application is done with scriptinterfacethatallowscompilersandapplicationprogram- its execution, we take the best code-variants returned by the mers to use a common interface to describe parameterized Active Harmony server and run the application using those code transformations to be applied to a computation (see code-variants. We call these runs post-harmony runs. tables III and IV). In CHiLL nomenclature, these scripts are ActiveHarmonyutilizesthe(cid:2)rstsearchiterationtogenerate called (cid:147)recipes(cid:148). Besides making it easy to interface with uniformly distributed random con(cid:2)gurations from the search the code-generation utility, these code transformation recipes space. These con(cid:2)gurations are evaluated in parallel. We call offer an additional advantage. Unlike traditional compiler this iteration exploratory iteration2. The best among these optimizations which must be coded into the compiler, these con(cid:2)gurations then serves as the starting point for the initial recipescanbeevolvedandreusedovertime.A recipelibrary, simplex construction. For all our experiments, unroll factors created by compiler experts and developers based on their and tile sizes are constrained by the storage capacity of their experience working with real codes, can then be consulted associated memory hierarchy levels. by auto-tuners to tune arbitrary loop-nests. Tocontrolforperformancevariability3,weusethemultiple Fortheexperimentsontheumd-cluster,thecode-generation sampling method. Each con(cid:2)guration is evaluated twice (i.e. andcompilationisdelegatedtoidlelocalmachines.Fortheex- theperformanceoftwoconsecutivetimestepsisrecorded)and perimentsonCarver,the code-generationis out-sourcedto an theminimumofthetwosamplesissenttotheActiveHarmony eight-core64-bitx86machineat UMD (i.e.codeis generated server.We showedinourpreviouswork [22]thateveninthe justin timeandshippedacrossthecontinentalUnitedStates). presenceof5%variabilityduetobackgroundnoise,takingthe This was done because the Carver scheduler does not permit minimumoftwosamplesisenoughtoensuretheconvergence synchronized (co-scheduled) jobs yet, which meant that we of the search algorithm. could not launch a code-generation job simultaneously with the applicationjob. For the experiments on Hopper,the code- generation and compilation takes place on the login nodes. A. Platforms The experiments were performed on three platforms. The C. Calculating the (cid:147)net(cid:148) speedup (cid:2)rst platform is a 64-node Linux cluster (henceforth referred Our runtime tuning strategy uses extra cores to generate to as umd-cluster). Each node is equipped with dual Intel and compile new code. Ideally, a fair comparison would be Xeon2.66GHz(SSE2)cores(withHyper-Threadingenabled). between the execution time of the harmonized application to Nodes are connected via a Myrinet interconnect. The second that of the original application run on N +C cores, where h platform (at NERSC [6]) is named Carver, which is an IBM N is the number of cores the harmonized application is run h iDataPlexsystemwith400computenodes.Eachnodecontains on and C is the number of cores used for code-generation. two quad-core Intel Nehalem 2.67 GHz cores (3,200 cores However, this is not always possible due to application’s total in the machine). Nodes are connected via a 4X QDR data distribution semantics (for example, the application may In(cid:2)nibandinterconnect.Thesearchitecturesaredifferentfrom require the number of cores to be a power of 2). Instead, each other not only in terms of the core architectures (cid:151) to account for these extra cores, we calculate a new metric Carver’s cores are several generations newer (cid:151) but also in (cid:151) net speedup. We de(cid:2)ne charge factor (equation 1) as the terms of the interconnect used to connect the nodes. Finally, ratio of the number of cores used to run the application and the third platform, which is named Hopper4, is a Cray XT5 the total number of cores used for both code-generation and machineatNERSC.Eachnodeconsistsoftwo2.4GHzAMD harmonized application execution. Opteron Shanghai quad-core cores (5,312 cores total in the N h machine). Nodes are connected via a Seastar2 interconnect. c:f:= (1) N +C h Wethenmultiplythespeedupofharmonizedapplicationsover 2Note that this randomized method can sometimes select points in the searchspacethathavepoorperformance.Wepaythispenaltyforoneiteration the original applicationby this chargefactor to derive the net at the beginning of the search to gather some knowledge about the search speedup. space. The cost of this step is usually amortized within the (cid:2)rst few PRO steps. D. Code-server sensitivity 3Besides the tunable parameters, there are many other factors affecting a program’sperformance.Therefore,evenfora(cid:2)xedsetoftunableparameters, With the experimental results presented in this section, the application performance varies in time. Network contention, operating we attempt to answer the following question (cid:151) how many system jitter, and memory architecture complexity are common sources of parallel code-servers are needed to ensure that the auto-tuner performance variability. 4HopperhassincebeenupgradedtoaCrayXE6with 153,408cores. does not have to wait for too long before the new code is 5 Sensitivity analysis − Comparison of the search Evolution for different number of code server 2 1 code−server Stalls 2 code−servers Stalls 1.5 4 code−servers s) 1 Stalls e ( m n ti0.5 atio 0 50 100 150 200 250 300 350 400 450 500 er −it 2 er 8 code−servers P 12 code−servers 1.5 16 code−servers 1 0.5 0 50 100 150 200 250 300 350 400 450 500 Iterations Fig.3. Sensitivity results demonstrating howthechange inthenumberofcode-servers affects thesearch evolution ready? A related question is (cid:151) How often is the system in pointin thesearchspace,theload oncode-serversgoesdown the search_stall phase? These questions are important (i.e., more points in the simplex become identical). The data becauseifthesearch_stallphaseislong,theapplication in column 4 of table II shows the number of unique code- canpossiblycontinuewithmediocreparametercon(cid:2)gurations variantsevaluatedindifferentexperiments.Wecanseethatas for extended periods of time. the average number of search iterations goes from 6 for runs The experiments were conducted on the umd-cluster using using 1 code-server to 17 for runs using 2 code-servers, the thePESapplication(describedinsectionV-E1).Wecontrolled averagenumberofuniquecode-variantsgoesupbyonly208. the input problem size (10243) and the number of cores Our end goal with this experiment was to set a min- runningthe application(128).All 128 cores participatein the imum number of code-servers required to ensure a short tuningprocess,whichmeansateachsearchstep,code-servers search_stall phase for the rest of the experiments. Hav- havetogenerateandcompileupto128code-variants.Thisisa ingsaid that, we acknowledgethat thereareotherfactorsthat typicalnumberofcode-variantsrequiredpersearchiterationin can play important roles in setting this minimum. In one of alltheexperimentsreportedinthispaper.Wevarythenumber our preliminary experiments, we discovered that the size of of code-servers running in parallel and record the average the search space can also dictate this minimum. When we numberoftime-stepsthattheapplicationhadtocontinuewith ran the experiments with small tuning space, the exploratory the old code. We call this metric (cid:147)stalled(cid:148) iterations. iteration (initial random search of points) was able to (cid:2)nd Figure 3 shows how per-iteration performance measure- good parameter con(cid:2)gurations. In this case, the number of ments change over time for auto-tuningconductedwith alter- stalled iterations did not matter because applications were natenumberofcode-servers.Longstretchesofconsecutiveap- already executing with good con(cid:2)gurations. Moreover, the plicationtimestepswithnoperformanceimprovement(marked number of minimum code-servers can (and most probably by arrows) in experiments conducted with 1, 2 and 4 code- will) change if we switch between different code-generation servers indicate that the application continued its execution tools. Currently, we are looking into more robust ways to with poor con(cid:2)gurations for several timesteps. The same is account for these issues and to derive the value for the not true for tuningconductedwith 8, 12 and 16 code-servers. minimum required parallel code-servers. For the rest of the Table II summarizes the results for this experiment. We auto-tuningexperiments we describe in this paper, we used 8 directthereaderstocolumns3and5.Asthenumberofcode- code-servers. serversisincreasedfrom1to4,theaveragenumberofstalled E. Subject applications, tuning strategies and results iterationsgoesdownsigni(cid:2)cantly.Thisistobeexpected.What is surprising is that the addition of extra code-servers from 8 1) Poisson’sEquationSolver(PES): Poisson’sequationisa to16doesnotsigni(cid:2)cantlychangetheapplicationspeedupor partial differential equation that is used to characterize many the number of stalled iterations. The reason for this is that processes in electrostatics, engineering, (cid:3)uid dynamics, and as the search algorithm evolves and starts converging to a theoreticalphysics.TosolveforPoisson’sequationonathree- 6 TABLEII PES Application (64 cores, umd−cluster) SENSITIVITYEXPERIMENTRESULTS 2 #ofcode Avg.#of Avg.#of Avg.#ofcode Avg. 1.8 servers searchiters stallediters evaluated speedup 1 6* 46 502 0.75 2 17* 13 710 0.97 1.6 4 27 7.18 928 1.04 nt 8 23 4.48 818 1.23 e1.4 m 12 22 4.06 833 1.21 e v 16 26 3.59 931 1.24 pro1.2 *-search algorithmdidnotconverge Im 1 Aggregate plot of the worst per−iteration performance (PES, 128 cores, umd−cluster) post−harmony 0.8 harmonized net speedup 2 512 576 640 704 768 832 896 worst perf (108) (184) (254) (341) (428) (536) (680) Problem−domain (cubed) 1.5 e (sec) Original Perf (0.84 sec) pFoigs.t-h5a.rmoPneyrforurmnaonfcteheimsoplrvoevrem(6e4ntcoorfehruanrmoonniuzmedd-PclEuSst,ern)et speedup, and m 1 Ti using the dynamiccode-generationmethod. For this function, 0.5 wetileallthreeloopsandtheinnermostloopisunrolled.The searchspaceis,thus,six-dimensional(twotunableparameters for the relaxation function and four for the error function). 00 100 200 300 400 500 All cores allocated to the application participate in the tuning Application Iterations process. Thus, a 128-corerun of this application evaluates up to 128 tiling con(cid:2)gurations simultaneously for the relaxation Fig.4. Aplotforaggregate worsttiming ateach iteration. function and up to 128 loop-variants simultaneously for the errorfunctioninasinglesearchstep.Theoptimizationstrategy dimensional grid, we use a modi(cid:2)ed version of the parallel (expressed in terms of the CHiLL recipe) along with the implementationprovidedintheKeLP-1.4distribution[1].The constraintsforunboundtunableparametersisprovidedintable applicationiswritteninC++andFortran.Theimplementation III.Fora5123 problemsizerunon64-cores,thesearchspace uses the redblack successive over relaxation method to solve has approximately3:11(cid:2)107 possible con(cid:2)gurations. the equation. The core of the computational time is spent on We performed two sets of auto-tuning experiments (cid:151) one therelaxationfunction,whichusesa7-pointstenciloperation, using 64 cores and one using 128 cores. Both sets of experi- and the error calculation function, which calculates the sum mentswere doneonthe umd-cluster.Foreachcore count,we of squares of the residual over the 3D grid. These two code- selectmultipleinputdomainsizes.Figure4showshowActive sections are tuned simultaneously using the Active Harmony Harmony steers per-iteration performance of the harmonized framework. The code-sections are outlined in table III. PES. This experiment uses a grid size of 10243. The (cid:2)gure Theoriginalimplementationoftherelaxationoperationuses plots the timing of the worst performing con(cid:2)guration for the (cid:2)rst half of one iteration to update (cid:147)red(cid:148) array points and each application iteration. The running time of an SPMD- the second half of the iteration to update the (cid:147)black(cid:148) points. basedapplicationis bounded,at eachtimestep,bytheslowest In the adapted version of the application, we use the fused con(cid:2)guration.Per-iterationtime for the originalapplicationis version of the relaxation operation. The fused version orders indicatedbythehorizontallineinthe(cid:2)gure.The(cid:2)gureshows the loop iteration so that black points in each column are that Active Harmonysuggestedcon(cid:2)gurationsoutperformthe updated immediately after the red points in the next column original application’s per-iteration timing within the (cid:2)rst few and vice versa [20]. This fused version(see Table III) serves tens of iterations. as the baseline for comparing the net speedup of harmonized Figures 5 and 6 plot the net and harmonized speedups PES. achievedwithinonefullexecutionoftheharmonizedPES.The original application execution times (in seconds) are shown Our tuning strategy for this application combines symbolic parameter tuning5 and tuning with dynamic code-generation. in parentheses below the label for x-axis. The application was run on 64 and 128 cores on the umd-cluster for varying For the relaxation function, we use symbolic tuning. We tile input data sizes. As expected, as the size of the problem thetwooutermostloopsanduseActiveHarmonytodetermine domain increases, the performance of the harmonized appli- the dimension of the tiles. The error function is optimized cation increases as well. This is intuitive because with the increase in the problem size, the Active Harmony server gets 5Symbolictuningreferstotuningforparameters thataresymbolic,i.e.no newcodeisnecessary tomovebetweenparameter values. more time to explore the search space before the application 7 TABLEIII PESAUTO-TUNING Kernel Original Transformation SearchConstraints Code Strategy=Recipe andspacedescription do kk=wl2-1,wh2 do k=kk+1,kk,-1 if ((k.le.wh2).and. (k.ge.wl2)) then Manual tiling: TI1(cid:2)TJ1(cid:20) 21(cid:0)cache2size(cid:1) relaxation dodjo=wil=1w,lw0h+1mod(kk+j+1,2),wh0,2 (iTIa1n,d TjJ1l)oops TI12TI[01;(cid:21)4;T::J:1;prob size] u(i,j,k) = c * (u(i-1,j,k) TJ12[0;4;:::;prob size] + u(i+1,j,k) + u(i,j-1,k) + u(i,j+1,k) + u(i,j,k-1) + u(i,j,k+1)- c2*b(i,j,k)) TI22[0;4;:::;prob size] TJ22[0;4;:::;prob size] TK22[0;4;:::;prob size] do k = 1, N original()* TI2(cid:21)TJ2 do j = 1, N tile(0, 3, TI2)** TJ2(cid:21)TK2 do i = 1, N tile(0, 3, TJ2) UI22[1;2;:::;register size] du = c*(u(i-1,j,k) + u(i+1,j,k) tile(0, 3, TK2) error + u(i,j-1,k) + u(i,j+1,k) unroll(0, 6, UI2)*** Searchspacedimension:6 + u(i,j,k-1) + u(i,j,k+1) Parameters:[TI1;TJ1;TI2;TJ2;TK2;UI2] - c2*u(i,j,k)) r = b(i,j,k) - du err = err + r*r Samplesearchspacesize:3:11(cid:2)107 possibleconfigurationsfor 64(cid:0)corerunwith5123domain(cid:0)size *-keeporiginal looporder. **-tile(statement #,looplevel totile,tilingfactor) ***-unroll(statement #,looplevel tounroll,unrollfactor) PES Applicaition (128 cores, umd−cluster) allowingforthecode-servers,asingleexecutionwithnoprior 1.8 runs is, on average, 16% faster than the original application. post−harmony 1.7 harmonized The best net speedup for the harmonized application is 1.37. net speedup Post-harmonyruns,which use Active Harmonysuggested pa- 1.6 rametercon(cid:2)gurationsandcode-variants,onaverage,perform ent1.5 1.72 times faster than the original application. This indicates m e the performance gain if the program was run a second time v pro1.4 on the same machine with similar inputs. m I 2) Parallel Multiblock Lattice Boltzmann (PMLB): The 1.3 Lattice Boltzmann Method (LBM) is a widely used method 1.2 in solving (cid:3)uid dynamic systems. In contrast to the conven- tional methods in (cid:3)uid dynamics, which are based on the 1.1 discretization of macroscopicdifferential equations, the LBM 1 960 1024 1088 1152 1216 1280 has the ability to deal ef(cid:2)ciently with complex geometrics (411) (464) (598) (700) (816) (929) and topologies [25]. For our experiments, we use the parallel Problem−domain (cubed) multiblock implementation (extended to 3D problems) of the Fig. 6. Performance improvement of harmonized PES, net speedup, and LBM developed by Yu et al [26]. The test case lattice model post-harmonyrunofthesolver(128corerunonumd-cluster) for our experiments is D3Q19 (19 velocities in 3D) with the collision and streaming operations. The application is written in C. completesitsexecution.Forthe5123 problemsize(see(cid:2)gure The PMLB code is divided into six main operations: ini- 5), the program runtime is too short (108 seconds) and the tialization, collision, communication, streaming, physical and harmonized application runs 28% slower than the original (cid:2)nalization.Collision,communication,streamingandphysical untuned version. The 28% slowdown does incorporate the operationsare executedwithin a loop. Initialization and (cid:2)nal- charge factor for 8 extra cores used for code-generation.The izationoperationsareperformedonce.We focusourattention harmonized application is unable to overcome the penalty of on the streaming operation, which accounts for more than using some poor con(cid:2)gurations early in the short run of the 75% of the execution time. The streaming operation moves program (132 seconds elapsed time). particlesinmotiontonewlocationsalongwiththeirrespective On average,forbothcorecountsanddifferentproblemdo- 19 velocities. This operation requires a signi(cid:2)cant number of mains,harmonizedPES,intermsofthenetspeedup,performs memory copy operations. 1.16timesfasterthantheoriginalapplication.Thus,evenafter The streaming operation consists of (cid:2)ve separate triply- 8 TABLEIV PMLBAUTO-TUNING Kernel Original Transformation SearchConstraints Code Strategy=Recipe andspacedescription for (i=1; i<=imax;i++) for(j=1; j<=jmax; j++) for(k=1; k<=kmax; k++) { TI2[0;4;:::;prob size] c1 = i*(ny_local); TJ 2[0;4;:::;prob size] c2 = c1+(ny_local); UK2[1;2;3;4] c3 = (c1+j)*(nz_local); original()* TI(cid:21)TJ c4 = c3+(nz_local); tile(0, 1, TI)** c5 = (c2+j)*(nz_local); tile(0, 3, TJ) Searchspacedimension:6 streaming 1 c6 = c5+(nz_local); unroll(0, 5, UK)*** 2setsof [TI;TJ;UK]: c7 = (c3+k)*en; known(imax>1)**** oneforfusedkernelsand known(jmax>1) onefornon(cid:0)fusedkernels fi[ 6+c7]=fi[6+(k+1+c3)*en]; known(kmax>1) fi[ 4+c7]=fi[4+(k+c4)*en]; fi[18+c5]=fi[18+(k+1+c4)*en]; Samplesearchspacesize::8:92(cid:2)106 fi[ 2+c7]=fi[ 2+(k+c5)*en]; possibleconfigurationsfor fi[14+c7]=fi[14+(k+1+ c5)*en]; 128(cid:0)core;5123 problemsize fi[10+c7]=fi[10+(k+c6)*en]; } *-keeporiginal looporder. **-tile(statement #,looplevel totile,tiling factor)***-unroll(statement #,loopleveltounroll,unrollfactor) ****-known predicate nested kernels, which are tuned simultaneously. Our opti- PMLB Application (64 cores, umd−cluster) 1.5 mization strategy utilizes loop-fusion, loop-tiling and loop- unrolling.Thetuningisdoneintwophases.The(cid:2)rstfewiter- 1.4 ationsoftheLBMmethodareusedtoidentifythebestfusion con(cid:2)gurationforthe(cid:2)vetriplynestedloopswithinthestream- 1.3 ing operation. For this stage, we use the exhaustive search. nt Once we identify the best performing fusion con(cid:2)guration, me1.2 e the tuning moves to the second stage, which involves tiling ov pr1.1 the outermost two loops and unrolling the innermost loop6. m I The second stage uses the parallel rank ordering algorithm 1 to determine two sets of tiling and unrolling factors (cid:151) one post−harmony for the fused loop-nests and another for the remaining loop- 0.9 harmonized net speedup nests. The search parameter space for all PMLB experiments is,thus,six-dimensional.Theoptimizationstrategy(expressed 0.8 256 320 384 448 512 576 (150) (282) (474) (746) (1122) (1559) in terms of the CHiLL recipe) along with the constraints for Problem−domain (cubed) unbound tunable parameters is provided in table IV. Out of Fig.7. Performance improvement ofharmonized PMLB,netspeedup, and the (cid:2)ve deeply nested kernels in the streaming operation, we post-harmonyrunofPMLB(64corerunonumd-cluster) showonlyonekernelinthetable.Otherkernelsaresimilarin structure.Fora5123problemsizerunon128-cores,thesearch PMLB Application (128 cores, umd−cluster) space has approximately 8:92(cid:2)106 possible con(cid:2)gurations. 1.4 post−harmony PMLB tuning experiments were done on the umd-cluster, harmonized 1.35 net speedup Carver and Hopper. Figures 7 and 8 plot speedup results 1.3 for harmonized and post-harmony PMLB runs using 64 and 1.25 128 cores on umd-cluster. We use multiple input datasets for nt e different core counts. Again, we see that the increase in the em 1.2 v o size of the problem domain leads to a better performance pr1.15 m for the harmonized PMLB. On average, harmonized PMLB I 1.1 performs1.14times faster thantheoriginalapplication,while 1.05 post-harmonyrunsperform1.38 times faster thanthe original 1 application. 0.95 For the experiments on Carver, we use two different core counts (cid:151) 256 and 512. We were limited in terms of the 0.9 384 448 512 576 640 (243) (379) (549) (781) (1108) Problem−domain (cubed) 6Simple code modi(cid:2)cations were required to remove scalar dependencies between different levels ofloop-nests toensure legality ofcode transforma- Fig.8. Performance improvement ofharmonized PMLB,netspeedup, and tions. post-harmonyrunofPMLB(128corerunonumd-cluster) 9 PMLB Application (256 cores, Carver) PMLB Application (512 cores, Hopper) 1.35 1.5 1.4 1.3 post−harmony harmonized 1.3 1.25 net speedup nt1.2 ent 1.2 e m m1.1 e e v prov 1 mpro1.15 m I I 1.1 0.9 post−harmony 0.8 harmonized 1.05 net speedup 0.7 1 0.6 5(7132) 6(14510) 7(26848) (859336) 1(071224) 1(1115920) 0.95 640 768 896 1024 1152 1280 Problem−domain (cubed) (95) (178) (286) (443) (614) (846) Problem−domain (cubed) Fig.9. Performance improvement ofharmonized PMLB,net speedup,and Fig.11. PerformanceimprovementofharmonizedPMLB,netspeedup,and post-harmonyrunofPMLB(256corerunonCarver) post-harmonyrunofPMLB(512corerunonHopper) PMLB Application (512 cores, Carver) PMLB Application (1024 cores, Hopper) 1.35 1.6 post−harmony 1.3 harmonized 1.5 net speedup post−harmony 1.25 harmonized 1.4 net speedup nt veme1.3 ment 1.2 o e Impr1.2 mprov1.15 I 1.1 1.1 1 1.05 0.9 0.8 640 768 896 1024 1152 1280 1 1024 1152 1408 (96) (164) (238) (409) (651) (886) (204) Problem−do(3m00a)in (cubed) (562) Problem−domain (cubed) Fig.10. PerformanceimprovementofharmonizedPMLB,netspeedup,and Fig.12. PerformanceimprovementofharmonizedPMLB,netspeedup,and post-harmonyrunofPMLB(512corerunonCarver) post-harmonyrunofPMLB(512corerunonHopper) F. Cross-platform comparison numberofcorecountsbecause512isthemaximumcorecount a user can reserve on Carver. Figures 9 and 10 plot speedup Intheprecedingsection,weshowedhowweappliedActive results for harmonized and post-harmony runs using 256 and Harmony to auto-tune the execution of two parallel applica- 512 cores of Carver. In terms of the net speedup, on average, tions on three different platforms. A logical question is how harmonizedPMLBperforms1.11timesfasterthantheoriginal do the parameters that Active Harmony selected for different application.Thebestnetspeedupforaharmonizedrunis1.46, platforms relate to each other. To answer this question, we i.e. evenafterfactoringintheextracoresforcode-generation, conducted a controlled study using 64-cores on all three a single execution of harmonized PMLB is up to 46% faster systems. We selected three problem sizes for the PMLB than the original application. Post-harmony runs perform, on application. The selection of the problem sizes was based on average, 1.37 times faster than the original application. whether Active Harmony’s search converges to a solution or Experiments on Hopper were done using 512 and 1024 not within a single execution of the harmonized PMLB. This cores. Figures 11 and 12 plot speedup results for harmonized was done to ensure that Active Harmony gets a fair chance and post-harmony runs using 512 and 1024 cores of Hopper. to select good con(cid:2)gurations on all the systems. We then In terms of the net speedup, on average, harmonized PMLB use Active Harmony suggested parameter con(cid:2)gurations for performs 1.14 times faster than the original application. The a given problemsize on one system to conductpost-harmony best net speedup for a harmonized run is 1.21. Post-harmony runs for the same problem size on other systems. runs perform, on average, 1.28 times faster than the original The results are summarized in table V. Post-harmony runs application. conducted on the umd-cluster using the con(cid:2)gurations sug- 10
Description: