Reducing overheads of dynamic scheduling on heterogeneous chips Francisco Corbera, Andrés Rodríguez, María J. Garzarán Rafael Asenjo, Angeles Navarro, Antonio Dept. ComputerScience,UIUC Vilches, [email protected] UniversidaddeMálaga,AndalucíaTech,Dept. ofComputerArchitecture,Spain {corbera,andres,asenjo,angeles,avilches}@ac.uma.es 5 1 ABSTRACT ofmoredynamicstrategiestodistributetheworkloadamong 0 In recent processor development, we have witnessed the in- thecoresandtheGPU.Infact,achievingmaximumperfor- 2 tegration of GPU and CPUs into a single chip. The result mance and/or minimum energy consumption often requires ofthisintegrationisareductionofthedatacommunication simultaneous use of both, the GPU and the CPU cores [8]. n overheads. This enables an efficient collaboration of both a devices in the execution of parallel workloads. In this context, one problem that has received attention J lately is the efficient execution of the iterations of a paral- 4 Inthiswork,wefocusontheproblemofefficientlyschedul- lel loop on both devices, the CPU cores and the integrated 1 ing chunks of iterations of parallel loops among the com- GPU. However, the optimal division of work between CPU puting devices on the chip (the GPU and the CPU cores) andGPUisveryapplicationandinputdatadependentsoit ] C in the context of irregular applications. In particular, we is required a careful partitioning of the workload across the analyzethesourcesofoverheadthatthehostthreadexper- CPU cores and the GPU accelerator. In the case of regular D iments when a chunk of iterations is offloaded to the GPU applications,themaindifficultyistodeterminehowtopar- s. whileotherthreadsareexecutingconcurrentlyotherchunks titiontheloadbetweenthetwodevicestoavoidloadimbal- c on the CPU cores. We carefully study these overheads on ance. Thisisusuallyaccomplishedbyrunningafewchunks [ different processor architectures and operating systems us- of iterations in both devices to determine the speed differ- ing Barnes Hut as a study case representative of irregular ence between them. The appropriate chunk for each device 1 applications. Wealsoproposeasetofoptimizationstomiti- is then computed and scheduled to run [3]. In the case of v 6 gatetheoverheadsthatariseinpresenceofoversubscription irregularapplicationsthechallengeisbiggerbecause,inthis 3 andtakeadvantageofthedifferentfeaturesoftheheteroge- case, GPU performance can be suboptimal when the appli- 3 neous architectures. Thanks to these optimizations we re- cation workload is distributed among all the cores and the 3 duceEnergy-DelayProduct(EDP)by18%and84%onIntel GPU without considering the size of the chunk assigned to 0 Ivy Bridge and Haswell architectures, respectively, and by theGPU,evenwhentheloadisbalancedamongthedevices. . 57% on the Exynos big.LITTLE. For instance, using the irregular Barnes Hut benchmark [9] 1 asourcaseofstudy,wehavefoundthatthesizeofthechunk 0 of iterations assigned to the GPU has a significant impact 5 1. INTRODUCTION on its performance, as we will show in Section 2. Thus, 1 Recently, we have seen a trend towards the integration of foreachapplicationandinputdata,theschedulingstrategy : v GPU and CPUs on the same die. Examples include recent must be aware of the optimal chunk of iterations to offload i Intel processors (Ivy Bridge, Haswell), the AMD APUs, or to the GPU. Since the workload assigned to the GPU and X more power constrained processors targeted at mobile and theCPUcoresmustalsobebalanced,thebestapproachisa r embedded devices, like the Samsung Exynos 5 or the Qual- dynamicschedulingstrategythatassignstheoptimalchunk a comm Snapdragon 800, among others. The integration al- of iterations to each device. This means that the cores and lows the sharing of the memory system, what reduces the the GPU will repeatedly receive a chunk of iterations to be communicationoverheadsbetweenthedevicesandenablesa computeduntiltheendoftheiterationspace. Thisschedul- moreeffectivecooperationamongallthecomputingdevices. ing strategy is described in Section 3. In contrat to discrete GPUs (connected through a slower PCI bus), integrated GPUs also enable the implementation TheintensiveuseoftheGPUoffloadingmechanismwillre- veal several sources of overheads that have to be carefully considered. For instance, not only the traditionally stud- ied overheads like the data transfers ones (host-to-device and device-to-host) have to be taken into account, but also kernel launching and host thread dispatch overheads gain relevance. In this paper, we study the impact of each one of these overheads on different heterogenous architectures with an integrated GPU. More precisely, in Section 4, we HIP3ES2015Amsterdam,TheNetherlands conductourexperimentsontwoIntelprocessors(IvyBridge EU Active EU Idle EU Stalled 0.12 0.7 1 0.6 0.1 0.8 0.5 evitcA U0.08 eldI UE00..34 dellatS U0.6 E E 0.2 chunk320 0.06 0.4 chunk640 0.1 chunk1280 chunk2560 0.04 0 0.2 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 Iterations space #104 Iterations space #104 Iterations space #104 (a) (b) (c) #107 GPU L3 Cache Misses Throughput (iterations/ms) 2.5 160 sessiM eh1.25 )sm/snoitare111024000 caC 3L UPG0.15 ti( tuphguorh 468000 T 0 20 0 2 4 6 8 10 0 2 4 6 8 10 Iterations space #104 Iterations space #104 (d) (e) Figure 1: GPU Hardware metrics for Barnes Hut on Haswell. The legend of subfigure 1c applies to all subfigures (cid:80) and Haswell architectures) and one Samsung Exynos 5 fea- EU/ all cycles); and the L3 cache misses due to all EUs turing a big.LITTLE architecture. Two different operating GPU memory requests. Subfigure 1e also shows the GPU systems (Windows and Linux) are considered. effective throughput, measured as the number of iterations per ms., through the iteration space. Data transfer and With the information provided by these experiments we kernel offloading overheads have been included in the com- tackletheproblemofreducingtheimpactofthemorerepre- putation of the throughput. sentative overheads. In Section 5, we propose and evaluate someoptimizationsthatnotonlyreducetheexecutiontime Aswecanseeinsubfigure1e,forthetimestepstudied,the but also the energy consumption. Finally, we present the optimalchunksizeis640(seeredlineinthefigure). Increas- related works in Section 6 and conclude that in order to ing the chunk size beyond this value degrades the through- squeeze the last drop of performance out of these heteroge- put. The hardware metrics indicate that small chunk sizes neous chips, it is mandatory to conduct a thorough analy- (e.g. 320) do not effectively fill the GPU computing units, sis of the overheads and to study how the CPU cores, the astheratioEUIdleindicatesinsubfigure1b(seeblueline). GPUandthedifferentlayersofthesoftwarestack(OpenCL However, when the chunk size is large enough to fill the driver, OS, etc) interplay (see Section 7). EUs (EU Idle <0.1), the EUs might stall when the chunk size increases, due to the increment in L3 cache misses (see 2. MOTIVATION greenandpinklinesinsubfigure1d). Thisiswhathappens We have found that, for irregular applications, offloading in our irregular benchmark in which the majority of mem- big chunk sizes to the GPU can hinder performance. This oryaccessesareuncoalesced. Inourcase,chunksizeslarger is illustrated in Figure 1, that shows the evolution of dif- than1280dramaticallyincreaseL3misses,whichinturnin- ferent GPU hardware metrics on the Haswell architecture creasestheratioofEUStalled(>0.9)andreducestheratio (describedinsection4.1)throughtheiterationspaceofone of EU Active (<0.08), causing a reduction of the effective timestepoftheirregularbenchmarkBarnesHut. Aninput throughput. setof100,000bodieswasusedtocollecttheseresults. Each subfigure represents the evolution of the metric of interest Therefore, in the quest of finding the optimal distribution for different chunk sizes assigned to the GPU (see chunk of work between the GPU and the CPU, if we assign a big sizes legend in subfigure 1c). We have used Intel VTune chunk of iterations to each device we can end up by not Amplifier 2015 [1] to trace the ratio of cycles when EUs exploiting the GPU EUs optimally. Thus, the partitioning (ExecutionUnits)are active ((cid:80) cycles whenEUex- strategy must be aware of the optimal chunk of iterations ecutes instructions/(cid:80) allacllyEclUess); the ratio of cycles that must be offloaded to the GPU for the corresponding whenEUsareidle((cid:80)all EUscycleswhennothreadssched- application. On the other hand, we have observed that the uled on EU/(cid:80) aallllEcUyscles); the ratio of cycles when effective throughput of the CPU cores is not so sensible to EUs are stalledal(l(cid:80)EUs cycles when EU does not exe- the chunk size. As long as the chunk size is bigger than a all EUs cute instructions and at least one thread is scheduled on threshold value, the CPU throughput tends to be constant. Normalized Time Normalized Energy Normalized EDP 1.2 1.2 1.4 Bulk-‐Oracle Bulk-‐Oracle Bulk-‐Oracle 1.1 1.15 Dynamic 1.3 Dynamic Dynamic 1.1 1.2 1 1.05 1.1 0.9 1 1 0.8 0.95 0.9 0.9 0.8 0.7 0.85 0.7 0.6 0.8 0.6 3+1 4+1 3+1 4+1 3+1 4+1 3+1 4+1 3+1 4+1 3+1 4+1 3+1 4+1 3+1 4+1 3+1 4+1 Ivy Haswell Exynos Ivy Haswell Exynos Ivy Haswell Exynos Figure2: Normalizedtime,energyandenergy-delay-product(EDP)forBarnesHutonIvyBridge,HaswellandExynos. The lower the values the better. For instance, when using the Threading Building Blocks li- rest of the iterations to the CPU cores. Previously, for the brary (TBB) [15], it is recommended to have CPU chunk static scheduling we carry out an exhaustive offline profil- sizes that take 100,000 clock cycles at least. By allowing a ing that looks for the static partitioning of the iteration dynamicschedulingoftheiterationspaceinsuchawaythat space between the CPU and GPU that minimizes the ex- eachdevicegetsanoptimalchunkofiterationswhilebalanc- ecutiontime. Wevarythepercentageoftheiterationspace ing the workload among them, we can minimize the execu- offloaded to the GPU (between 0% -only CPU execution- tiontime. However,someoverheadsareinvolvedinthistype and 100% -only GPU execution- using 10% steps). We call of scheduling, as we will see in the next sections. Anyway, this strategy Bulk-Oracle. Again, Table 1 shows the opti- we explore here the applicability of a dynamic scheduling mal percentage of iterations offloaded to the GPU for each strategy on a heterogeneous architecture using Barnes Hut platform under the static partition strategy. Notice that as example. withthestaticapproachthewholechunkisoffloadedtothe GPU at the beginning of each time step. InFigure2wecomparethetime,energyandEDP(Energy- Delay-Product)[6]forBarnesHutonthreedifferenthetero- Table1: OptimalGPUchunksizeforDynamicandoptimal geneous architectures using 4 CPU cores and 1 integrated percentage of the iteration space offloaded to the GPU for GPU: Ivy Bridge, Haswell and Exynos. The input set also Bulk-Oracle has 100,000 bodies, but now 75 time-steps were computed. Foreachplatform,twoscenariosareshown: 3+1,whichcor- Ivy Haswell Exynos responds to the case in which 4 threads are scheduled in 3+1 4+1 3+1 4+1 3+1 4+1 the processor (3CPUthreads + 1hostthreaddedicated to Dynamic 1536 1536 2048 2048 2048 2048 offload the work to the GPU) and 4+1 which corresponds Bulk-Oracle 50% 40% 70% 70% 20% 20% to 5 threads (4 CPU threads + 1 host thread). The later represents the case in which 1 thread of oversubscription is InFigure2eachparameter(time,energyandEDP)hasbeen allowed. Weexplorethiscasebecausewewanttomakethe normalizedwithrespecttothevalueobtainedfortheBulk- most of our computing resources: since the host thread is Oracle 3+1 execution on each architecture. As we see, the most of the time blocked while the GPU takes care of its dynamicstrategyoutperformsthestaticone(Bulk-Oracle), chunk of iterations, we add an additional thread to avoid except in the case of Haswell for 4+1, where overheads as- one core becoming idle. sociated to the host thread degrade performance (both in timeandenergy). Thisissueisdiscussedinthenextsection. In the figure, we call Dynamic to our dynamic scheduling Anotherinterestingresultisthatoversubscription(4+1)im- strategy(itworkssimilarlytotheOpenMPdynamicschedul- proves the execution times on Ivy, for both the static and ing policy of the pragma omp parallel for). In this strat- dynamic approaches, but it also increases the energy con- egy,firstwehavetofindtheoptimalchunksizefortheGPU sumption. This is due to the fact that 4 threads compute (the GPU chunk). This is done through an offline training a larger number of chunks on the CPU cores and, since the phase,whereweexploredifferentchunksizes,andchoosethe CPUislessenergeticallyefficientcomputingthechunksthan value that maximizes the effective throughput of the appli- theGPU,thisresultsinhigherenergyconsumption. Forthis cationinthecorrespondingGPU.Table1showstheoptimal architecture and this benchmark, the increment of the en- size for each platform. This size is passed to our scheduler ergyontheCPUisnotamortizedbythereductionoftime. which dynamically assigns a new chunk to the GPU each On the other hand, on Exynos, a static partitioning does time it finishes the computation. Similarly, each CPU core not scale when going from 3+1 to 4+1, while the dynamic gets a new chunk of iterations (the CPU chunk) every time strategy reduces time and energy consumption. it finishes the previous one. The size of the CPU chunk is selected to balance the load with the GPU computation 3. SCHEDULINGSTRATEGY (details are provided in the next section). We compare this dynamic strategy with a static schedul- Inthissection,wepresentinmoredetailourscheduler(Sec- ing that assigns a single GPU chunk to the GPU and the tion 3.1), how the partitioner works (Section 3.2), and the the user has to define two functions to perform the asyn- 1 #include <HScheduler.h> 2 class Body{ chronoushost-to-device(line8)anddevice-to-host(line10) 3 public: memory transfers, as well as the kernel launching (line 9). 4 void operatorCPU(int begin, int end) { Since these functions are all asynchronous, we finish the 5 for(i=begin; i!=end; i++){ ... } 6 } GPU part of the body with the synchronous clFinish() 7 void operatorGPU() (int begin, int end){ call that does not return until all the previous steps have 8 hostToDevide(begin, end){...} been completed. 9 launchKernel(begin, end){...} 10 deviceToHost(begin, end){...} 11 clFinish(); Internally, our scheduler is implemented as a pipeline that 12 } consists of two stages or filters: Filter , which selects the 13 } 1 14 ... computingdeviceandthechunksize(numberofiterations) 15 int main(int argc, char* argv[]){ assigned,andFilter2,whichprocessesthechunkonthecor- 16 respondingdevice. Filter firstlychecksiftheGPUdeviceis 17 Body body; 1 available. In that case, a G token is created and initialized 18 19 // Start task scheduler withtherangeoftheGPUchunk. IfthereisnoidleGPUde- 20 task_scheduler_init init(nThreads); vice,thenaCPUcoreisidle;thus,aC tokeniscreatedand 21 ... initializedwiththerangeoftheCPUchunk. Inbothcases, 22 parallel_for(begin,end,body,Partitioner_H(G)); 23 ... thepartitionerextractsachunkofiterationsfromtherange 24 } oftheremainingiterationspace. Next,Filter processesthe 2 chunk in the corresponding device and records the time it Figure 3: Using the parallel_for template takes to compute the corresponding chunk1. This is neces- sary to compute the device’s throughput, which is used by the partitioner described next. potencial sources of overhead (Section 3.3). 3.2 Partitioningstrategy 3.1 Schedulerdescription Weassumethattheexecutiontimecanbeseenasasequence Our scheduler, that we call Dynamic, considers loops with ofschedulingintervals{tG0,tG1...tGi−1,tGi,tGi+1...}for independent iterations (parallel_for) and features a work theGPUand{tC0,tC1...tCi−1,tCi,tCi+1...}foreachCPU schedulingpolicywithadynamicGPUandCPUchunkpar- core. Each computing device at the current interval, tGi or titioning. Our approach dynamically partitions the whole tCi,cangetachunkofiterations. TherunningtimeT(tGi), iterationspaceintochunksorblocksofiterations. Thegoal for each GPU’s chunk size G(tGi)=G, or the time T(tCi) ofthepartitioningstrategyistoevenlybalancetheworkload for a CPU’s chunk size C(tCi), is recorded. This time is of the loop among the compute resources (GPU and CPU used to compute the throughput, λG(tGi) for the GPU or cores)aswellastoassigntoeachdevicethechunksizethat λC(tCi) for a CPU core, in the current scheduling interval maximizesitsthroughput. Thisiskeybecause,asshownin as, Figure1,thechunksizecanhaveasignificantimpactonthe performanceofheterogeneousarchitectures,especiallywhen G dealing with irregular codes. λG(tGi) = (1) T(tGi) OurschedulerbuildsontopofanextensionoftheTBBpar- λC(tCi) = C(tCi) (2) allel_fortemplateforheterogenousGPU-CPUsystemsby T(tCi) Navarro et al. [13]. Figure 3 shows the pseudo-code to use theextendedparallel_forconstructinsuchheterogeneous In order to compute the chunk size for the GPU, an offline systems. As in any TBB program, the scheduler is initial- training phase explores different chunk sizes, and chooses ized (line 20). In this step, the developer sets the number the value that maximizes the effective throughput of the ofOSthreads,nThreads,thattheTBBruntimewillcreate, application for a given input data. To reduce the number whichcanvaryfrom1tothenumberofCPUcoresplusone ofrunsofthisofflinetrainingphase,wesettheGPUchunk additionalthreadtohosttheGPU(thehostthread). Then, size to the smallest number of iterations that fully occupy thedevelopercaninvoketheparallel_for(line22),which theGPUresources. Forexample,ontheintegratedGPUof hasthefollowingparameters: theiterationspace(therange theIntelHaswellwehave20EU(executionunits)eachone begin, end), the body object of the loop (body), and the running a SIMD-thread (aka wavefront) at a time, and 8, partitionerobject(Partitoner_H).Thelatterargument,ef- 16 or 32 work-items per SIMD-thread (decided at compile fectively overloads the native TBB parallel_for function time). In OpenCL these values can be queried reading two so that the heterogeneous version is invoked. Besides, the variables2. Thus, the product of these two arguments is Partitoner_H method takes care of the dynamic partition- used as the initial GPU chunk size. For Barnes Hut the ing strategy that gets the optimal chunk size for the GPU compilerchooses16SIMDandthereforeweselectaninitial (parameter G provided by the user) and for the CPU cores GPU chunk size of 20×16 = 320 iterations. Then, chunk as described in Section 3.2. sizes that are multiple of this initial chunk size are tried. 1Note that for the GPU, the time includes the kernel exe- Theuseralsowritestheclass Bodythatprocessesthechunk cution time as well as the hostToDevice and deviceToHost on the CPU cores or on the GPU (lines 2-13 in Figure 3). times. Twomethods(operators)mustbecoded. OnefortheCPU 2CL DEVICE MAX COMPUTE UNITS,and (lines 4-6) and one for the GPU (lines 7-10). For the GPU, CL KERNEL PREFERRED WORK GROUP SIZE MULTIPLE We keep trying different chunk sizes while the throughput Host-thread running on one GPU increases. Once the throughput decreases or remains stable CPU Core …… for 2 or more chunk sizes, the training phase ends and the GchPoUsenc.hTunhkissaipzeprtohaacthorebqtuaiinreesdotnhlye ahifgehwesrtunths.roughput is Tc1 Chunk i-1 Scheduling Partitioning Tg1 Regarding the policy to set the CPU core chunk size, our Tc2 partitioner follows the heuristic by Navarro et al. [13]. Ba- Host-to-Device sically, this heuristic is based on a strategy to minimize the hostToDevice() Tg2 launchKernel() load imbalance among the CPU cores and the GPU, while deviceToHost() Kernel launching clFinish() maximizingthethroughputofthesystem. Tothatend,the Tg3 optimization model described there recommends that: each Chunk i time that a chunk is partitioned and assigned to a device, U its size should be selected such that it is proportional to the GP Kernel execution device’s effective throughput. Therefore, we implemented a greedy partitioning algorithm based on the following key Tg4 observation: while there are sufficient remaining iterations, Device-to-Host thechunksizeassignedtoaGPUattheschedulinginterval Tg5 tGi should be the optimal for the GPU (G, as explained in Thread dispatch thepreviousparagraph), whereasattheschedulinginterval Tc3 +1 tvCeriiftyh:e chunk size assigned to a CPU core, C(tCi), should Chunk i U P ……G C(tCi) = G (3) Time Time λC(tCi−1) λG(tGi−1) Figure 4: Time diagram showing the different phases of of- where λ (tC ) and λ (tG ) are the CPU and GPU floading a chunk to the GPU C i−1 G i−1 throughputs in the previous scheduling intervals, respec- tively. So we have: configure the OpenCL command queue in the profile mode C(ti)=G· λC(tCi−1) (4) sothatwecanreadthe“start”and“complete”timestamps λG(tGi−1) of each of the enqueued commands. 3.3 Sourcesofoverhead WiththisinformationwecomputetheoverheadofSchedul- Figure4showsthedifferentphasesthatourDynamicframe- ing and Partitioning, O , Host-to-Device operation, O , sp hd work has to perform to offload a chunk of iterations to the Kernel Launching, O , Device-to-Host, O , and Thread kl dh GPU. Dispatch, O , as follows: td Asweexplainedintheprevioussection,ourframeworkuses (cid:80) (Tc2−Tc1) nThreads, fromwhichoneofthemiscalledthehostthread Osp = #GPUchunks (5) and just takes care of serving the GPU. This host thread TotalExecutionTime (cid:80) (Tg2−Tg1) runsinoneoftheavailablecoresandfirstexecutesthecode O = #GPUchunks (6) associated with the scheduler and the partitioner (the code hd TotalExecutionTime of Filter1 explained earlier). Then, the same host thread O = (cid:80)#GPUchunks(Tg3−Tg2) (7) executes the Filter2 stage that includes the function calls kl TotalExecutionTime listed in Figure 3 (hostToDevice(), launchKernel(), de- (cid:80) (Tg5−Tg4) viceToHost, and clFinish()). The first three calls just Odh = #TGoPtaUlcEhxuneckustionTime (8) asynchronouslyenqueuethecorrespondingoperationonthe (cid:80) (cid:16)(Tc3−Tc2)−(Tg5−Tg1)(cid:17) GPU’scommandqueue,whereasthelatterisasynchronous #GPUchunks O = (9) wait. In Figure 4 we can see that the enqueued operations td TotalExecutionTime are sequentially executed on the GPU where we consider the times taken by the “Host-to-Device”, “Kernel launch- 4. ANALYSISOFOVERHEADS ing”,“Kernel execution”and“Device-to-Host”. When this In this section we analyze the sub-optimal performance of last operation is done, the host thread is notified but some the Dynamic scheduler for the 4+1 configuration, as shown timemaybetakenbytheOStore-schedulethehostthread. in Figure 2. To better understand the underlying reasons This time is illustrated in the figure with the label“Thread for that performance degration we first describe our envi- dispatch”. ronmentsetup(Section4.1)andthendiscusstheoverheads that we measure (Section 4.2). In order to measure the relevant overheads involved in the execution of the code some time stamps are taken on the CPU (Tc1, Tc2 and Tc3) and on the GPU (Tg1 to Tg5) 4.1 ExperimentalSettings as depicted in Figure 5. To get the CPU time stamps we We run our experiments on three different platforms: two rely on TBB’s tick_count class, whereas for the GPU we Intel-based desktops and an Odroid XU3 bare-board. More %"Overheads"(Ivy"Bridge)" %"Overheads"(Haswell)" % Overheads (Exynos) 3.5 252"≈ ≈ 313.4.#1 3.0 4" 313.2.#0≈ ≈ 2.5 1.0# 2.0 3" 0.8# 1.5 2" 0.6# 1.0 1" 0.4# 0.5 0.2# 0.0 0" 0.0# 3+1 4+1 3+1 4+1 3+1" 4+1" 3+1" 4+1" 3+1# 4+1# 3+1# 4+1# Dynamic Dynamic Pri Dynamic" Dynamic"Pri" Dynamic# Dynamic#Pri# %"Osp" %"Ohd" %"Okl" %"Odh" %"Otd" %#Osp# %#Ohd# %#Okl# %#Odh# %#Otd# % Osp % Ohd % Okl % Odh % Otd Figure 5: Percentage of the different overheads for Barnes Hut on Ivy Bridge, Haswell and Exynos. The lower, the better. precisely,thedesktopsarebasedontwoquad-coreIntelpro- Itispossibletomonitorenergy(inJoules)consumedbythe cessors: a Core i5-3450, 3.1GHz, based on the Ivy Bridge Cortex A15 cores, the Cortex A7 cores, DRAM and GPU, architecture,andaCorei7-4770,3.4GHz,basedonHaswell. separately. The energy figures on the Exynos 5 we present Both processors feature an on-chip GPU, the HD-2500 and in this paper are the sum of the four power monitors afore- HD-4600,respectively. TheOdroidhasaSamsungExynos5 mentioned. (5422)featuringaCortex-A152.0Ghzquadcorealongwith aCortex-A7quadcoreCPUs(ARM’sbig.LITTLEarchitec- The sample rate used is 10 Hz. Thus, one sampled power ture). Thisplatformfeaturescurrent/powermonitorsbased value is obtained every 100 milliseconds. The power value on TI INA231 chips that enable readouts of instant power readismultipliedbythesamplingperiodandtheproductis consumed on the A15 cores, A7 cores, main memory and integratedinacumulatedenergyvalue(inJoules). Wehave GPU. The Exynos 5 includes the GPU Mali-T628 MP6. chosenthissampleratebecausethepowersensorsactualize theirvaluesevery260millisecondsapproximately,soasam- Regardingthesoftwaretools,onthedesktopswerelyonIn- pleratetwotimesfasterisgoodenoughforgettingaccurate tel Performance Counter Monitor (PCM) tool [5] to access measurements (sampling rate is below Nyquist rate). the HW counters (which also provides energy consumption in Joules). The GPU kernels are implemented with Intel 4.2 Discussion OpenCLSDK2014thatiscurrentlyonlyavailableforWin- Figure5showstheoverheadsthatwemeasuredforthecor- dows OS. The benchmarks are compiled using Intel C++ responding terms defined in the previous section (eqs. 6-9) Compiler14.0with-O3optimizationflag. TheOdroidboard forthethreeplatformsthatweconsider,IvyBridge,Haswell runs Linux Ubuntu 14.04.1 LTS, and an in-house library and Exynos when running the Barnes Hut benchmark with has been developed to measure energy consumption, as de- 100,000 bodies and 15 time steps for the 3+1 and 4+1 sce- scribed below. On this platform, the compiler used is gcc narios. ThissectioncoversonlytheDynamicresults(shown 4.8 with -O3 flag. The OpenCL SDK v1.1.0 is used for the on the left side) of the three subfigures in Figure 5. Mali GPU. Thesmallestoverhead,onaverage,istheoneduetoSchedul- On all platforms, Intel TBB 4.2 provides the core template ing and Partitioning, O , that represents 0.02% on Ivy of the heterogenous parallel_for. We measured time and sp Bridge, and less than 0.004% on Haswell and Exynos. energyin10runsoftheapplicationsandreporttheaverage. Regarding the data transfer overheads, O and O , on hd dh 4.1.1 EnergymeasurementontheExynos5 Ivy Bridge and Haswell, they are always below 0.3%. How- The Odroid XU3 platform is shipped with an integrated ever, on Exynos, these overheads are significantly larger, power analysis tool. It features four on-board INA231 cur- aroundanorderofmagnitudehigher: onaverageOhd =2.8% rent/powermonitors3tomonitortheCortexA15cores,Cor- and Odh=1.6%. After testing our platforms with memory texA7cores,DRAMandGPUpowerdissipation. Theread- bound micro-benchmarks, we found that Exynos exhibits outs from these power sensors are accessible through the an order of magnitude higher data transfers times than the /sys file system from user-space. The system does not pro- Intel architectures. This explains the impact of Host-To- videcumulatedenergyconsumptions,soonlyinstantpower DeviceandDevice-To-HostoverheadsontheExynos. More readings are available. precisely, the hostToDevice() operation implicitly copies the host buffer onto a different region of memory that can We have developed a library to measure the energy con- be accessed by the GPU and that is non-pageable (pinned sumed by our executions. The library allows starting/stop- memory). Similarly,deviceToHost()doesanothermemory- pingadedicatedthreadtosamplepowerreadingsandinte- to-memory copy operation in the other direction. There- grate them through time using the real-time system clock. fore, lower memory bandwidth on the Exynos results in Once the sampling thread is working, the library also pro- more apreciable Host-To-Device and Device-To-Host over- videsfunctionstotakepartialenergymeasurementstopro- headswithrespecttotheIntelarchitectures. Asfuturework filetheenergyconsumptionthroughdifferentalgorithmstages. wewillstudyhowBarnesHutmemoryaccessescouldbere- organized so that the cores and the GPU could share the 3http://www.ti.com/product/ina231 same buffer, avoiding the copy operations using the zero- copy-buffer capability of OpenCL. figure shows that on Ivy Bridge and Haswell, the O over- td head for the 4+1 configuration has been reduced as is now WithrespecttotheKernelLaunchoverhead,itgoesfromup similar to that of the 3+1 one. This confirms that increas- to 0.4% on Haswell to up to 3% and 2% on Ivy Bridge and ingthepriorityofthehostthreadforthe4+1configuration Exynos, respectively. Average times consumed on this op- reduces this particular overhead. Note that on the Exynos erationare1.8ms,1ms,and3.6msonIvyBridge,Haswell platform increasing the host thread priority barely affects and Exynos, respectively. the measured overheads. However, on the desktop platforms one of the most notice- Figure 6 shows the normalized metrics (time, energy, and able overheads is O (Thread Dispatch) especially for the EDP)w.r.t. Bulk-Oracle3+1,asshowninFigure2,butnow td 4+1 scenario. For that case, it represents 22% and 33% adding a new bar representing the results for Dynamic Pri. of the total execution time on Ivy Bridge and Haswell, re- As expected, Figure 6 confirms that for Ivy and Haswell, spectively. Notice that this happens only for the oversub- that run Windows OS, boosting the priority of the host scriptioncasesandunderWindowsOS.Actually,onExynos thread has almost no impact in reducing time, energy or undertheLinuxOS,theTreadDispatchoverheadrepresents EDP for the 3+1 configuration. However, with respect to less than 0.09% in all cases. The overhead in Windows is Dynamic, Dynamic Pri has a significant impact on these explained because the OpenCL driver ends up blocking the metricsintheoversubscribed4+1scenario. Forinstance,on hostthreadonceithasoffloadedthekernel. Itdoesthatto the Ivy Bridge, the Dynamic Pri reduces time, energy, and avoidwastingacorewithabusy-waitinghostthread. Inthe EDP by 10%, 7% and 18%, respectively, when comparing 4+1 scenario, 4 threads are already intensively using the 4 withDynamic. OnHaswell,thesereductionsareevenmore coresandwhentheOpenCLdrivernotifiestheOSaboutthe significant: 37%, 33% and 84% for time, energy and EDP, GPU completion, the OS wakes up the host thread. How- respectively. One interesting result appears on Ivy Bridge ever, if this host thread does not have enough priority it is whencomparingtheDynamic3+1configurationversusthe unlikely it will be dispatched to the running state straight- Dynamic 4+1 as we mentioned in Section 2. Let’s remind away. The core of the Windows scheduling policy is Round that even though the 4+1 configuration is faster than the Robin, so the just awaken host thread has to wait on the 3+1, the energy consumed by the 4+1 is higher, which is ready queue to the next available time slice. This does not nottrueanymoreforDynamicPri. Now,DynamicPriuses happen in the Linux scheduler, where a just awaken thread the GPU more efficiently (it is served quicker and this re- gainsmoreprioritythantheotherCPU-boundthreadsthat sults in the GPU processing more chunks). Therefore, the areintensivelycomputingCPUchunks. Thisschedulingde- more energy consuming CPU cores end up doing less work cision in Linux rewards interactive threads (IO-bound). As (andconsuminglessenergy)resultinginsmallertotalenergy aconsequence,inLinux,thehostthreadwillbeimmediately consumption figures. dispatchedafterbeingwokenup. Thisresultiscorroborated thanks to the experiment described in the next section. Finally, notice that incrementing the priority of the host thread does not result in further reductions on time or en- 5. OPTIMIZATIONS ergy on the Exynos platform. However, in the next section After identifying the main sources of overhead of our Dy- we explore a posible strategy to further reduce time and namicapproach,inthissectionwediscusstheoptimizations energy consumption on this architecture. thatcanbeimplementedtoaddressthem. Theoverarching goalisnotonlyreducetheimpactoftheoverhead,butalso 5.2 Exploitingbig.LITTLEarchitecture toreducetheenergyconsumption. Tothatend,wepropose OneinterestingfeatureoftheExynos5422isthatitsupports onestrategyfortheWindowsbasedplatforms(Section5.1) global task scheduling (GTS). This enables using the four andadifferentonefortheExynosarchitecture(Section5.2). A15 cores and the four A7 ones at the same time. The Linux scheduler automatically uses the more powerful A15 5.1 Increasingthepriorityofthehostthread cores for compute intensive threads, whereas power saving As it has been shown in the previous section, on the Win- A7 are reserved for system and background tasks. dows based desktops, the highest overhead appears for the 4+1configuration,wherewehavemeasuredthatupto22% Figure 7 shows execution time (ms.), energy (J.) and EDP and 33% of the execution time is wasted on the clFin- fortheExynosplatformwithconfigurations3+1,4+1,7+1, ish() operation on the Ivy Bridge and Haswell, respec- and 8+1. The 7+1 and the 8+1 configurations use both tively. This can be solved by assigning a higher priority the A15 and A7 cores for the computing threads, while the to the host thread so that another thread can be immedi- 3+1 and the 4+1 only use the A15 cores for the comput- ately preempted and the just awoken host thread can take ing threads. In all the configurations, the mapping of the up a core and start feeding the GPU again. This is key hostthreadtoanA15oranA7coredependsonthepinning when the GPU processes the chunks more efficiently than strategy described below. Notice that the GPU also com- the CPU, as it happens in our benchmark. To boost the putes chunks of iterations, so there is now a larger degree host thread priority we rely on the SetThreadPriority() of heterogeneity. Four alternatives are considered for each WindowsAPI.Theframeworkobtainedwiththisoptimiza- configuration: tion is called Dynamic Pri. TherightpartofthethreesubfiguresofFigure5showsthe • Dynamicisthebaselineapproachwherethechunksize overheadsincurredbyBarnesHutwhenusingDynamicPri assignedtotheA15andA7coresdependsontherela- and the arguments described in the previous section. The tivethroughputoneachone. Thehostthreadispinned Normalized Time Normalized Energy Normalized EDP 1.2 Bulk-‐Oracle 1.2 1.4 Bulk-‐Oracle Bulk-‐Oracle 1.1 DDyynnaammiicc Pri 1.11.51 Dynamic 1.2 Dynamic 1 1.05 Dynamic Pri 1 Dynamic Pri 0.9 1 0.95 0.8 0.8 0.9 0.7 0.6 0.85 0.6 0.8 0.4 3+1 4+1 3+1 4+1 3+1 4+1 3+1 4+1 3+1 4+1 3+1 4+1 3+1 4+1 3+1 4+1 3+1 4+1 Ivy Haswell Exynos Ivy Haswell Exynos Ivy Haswell Exynos Figure6: Normalizedtime,energyandenergy-delay-product(EDP)forBarnesHutonIvyBridge,HaswellandExynosadding a higher priority to the host-thread. The lower, the better. Time in ms. (Exynos) Energy in J. (Exynos) EDP (Exynos) 240000 222123000000000000 DDDyyynnnaaammmiiiccc PAr7i 111678000000 DDDyyynnnaaammmiiiccc PAr7i 440500000000 DDyynnaammiicc Pri 200000 Dynamic Pri A7 1500 Dynamic Pri A7 350000 DDyynnaammiicc APr7i A7 190000 1400 180000 1300 300000 170000 1200 250000 160000 1100 150000 1000 200000 3+1 4+1 7+1 8+1 3+1 4+1 7+1 8+1 3+1 4+1 7+1 8+1 Figure 7: Time (ms), energy (J) and energy-delay-product (EDP) for Barnes Hut on Exynos using A15 and A7 cores. The lower, the better. to one of the A15 cores. ittoaA7havesmallimpact. Forinstance,forthe3+1and the 4+1 configurations, increasing the priority of the host • Dynamic Pri is based on the baseline, but it increases thread or pinning it to the A7 core have almost no impact. the priority of the host thread. Onthe4+1scenario,whereDynamicA7isusing5cores(4 A15and1A7forthehostthread)weappreciateamarginal • DynamicA7isbasedonthebaseline,butpinsthehost energy saving of 3% and similar running times. Increasing thread to one of the A7 cores. the priority of the host thread or pinning it to the A7 core has a higher impact on the 7+1 and 8+1 configurations, • Dynamic Pri A7 is based on the baseline, but it pins although their impact is relatively small. We notice that the host thread to one of the A7 cores and increases for7+1and8+1configurations,DynamicPri,DynamicA7, its priority. andDynamicPriA7,obtainverysimilarresultsintermsof timeandenergy. Again,for7+1and8+1scenariosandwith The energy sampling thread (see Section 4.1.1) is always respect to Dynamic, Dynamic Pri and Dynamic A7 can, on pinned to one of the A7 cores, so it only produces an over- theaverage,reducethetime,energyandEDPby4.3%,3.6% head of around 0.5% of the total execution time when run- and 7.8%, respectively. All in all, w.r.t. the Dynamic 4+1, ningBarnesHutunder 8+1scenarios(lessthan0.02%oth- Dynamic Pri 8+1 reduces EDP by 57%. erwise). Regarding the breakdown of energy consumption, A15coresusebetween72%and76%ofthetotalenergycon- Overall, while we expected that pinning the host thread to sumed, the GPU between 15% and 20%, and the memory the A7 would have a higher impact on time or energy, our subsystem around 4.5%. Interestingly, A7 cores only take experimental results show little impact on either of them. around 2.5% of the total energy when they are idle (3+1 and 4+1 scenarios) and around 7.3% when they are fully 6. RELATEDWORK utilized (7+1 and 8+1 scenarios). For our benchmark, the The closest work to ours is that of Zhu et al. [16], which four A7 cores consume one order of magnitude less energy address the problem of performance degradation when sev- than the four A15 cores. eral independent OpenCL programs run at the same time (co-run) on the CPU and on the GPU of an Ivy Bridge us- Results in Figure 7 show that by increasing the number of ing the Windows OS. The programs running on the CPU threadstousetheA7cores(compare3+1or4+1with7+1 use all cores, so they are in a situation similar to our 4+1 or 8+1), execution time and energy reduce significantly. In configurations (oversubscription). To avoid degradation of particular, going from Dynamic 4+1 to Dynamic 8+1, we the GPU kernel they also propose increasing the priority of reducetime,energyandEDPby22%,19%and46%respec- thethreadthatlaunchestheGPUkernel. Ourstudydiffers tively. Increasing the priority of the host thread or pinning from theirs because we do not run two different programs, instead we partition the iteration space of a single program 8. REFERENCES to exploit both, the CPU and the GPU. Our study also [1] Intel VTune Amplifier 2015, 2014. shows that increasing the priority of the host thread is not https://software.intel.com/en-us/intel-vtune-amplifier- necessary when there is no oversubscription (i.e. 3+1) or xe. whentheunderlyingOSisLinux. Wealsoassesstheuseof [2] C. Augonnet, J. Clet-Ortega, S. Thibault, and a big.LITTLE architecture. R. Namyst. Data-aware task scheduling on multi-acceleratorbasedplatforms.InProc of ICPADS, Other works as [12, 7] also address the overhead problems 2010. while offloading computation to GPUs. The work of Lustig [3] M. Belviranli, L. Bhuyan, and R. Gupta. A dynamic andMartonosi[12]presentsaGPUhardwareextensioncou- self-scheduling scheme for heterogeneous pledwithasoftwareAPIthataimsatreducingtwosources multiprocessor architectures. ACM Trans. Archit. of overhead: data transfers and kernel launching. They use Code Optim., 9(4), Jan. 2013. a Full/Empty Bits technique to improve data staging and [4] J. Bueno, J. Planas, A. Duran, R. Badia, synchronization in CPU-GPU communication. This tech- X. Martorell, E. Ayguade, and J. Labarta. Productive niqueallowssubsetsofdataresultsbeingtransferredtothe programming of GPU clusters with OmpSs. In Proc. CPUproactively,ratherthanwaitingfortheentirekernelto of IPDPS, 2012. finish. Grasso et al. [7] propose several host code optimiza- [5] R. Dementiev, T. Willhalm, O. Bruggeman, P. Fay, tions (Use of Zero-copy Buffer, Global Work Size equal to P. Ungerer, A. Ott, P. Lu, J. Harris, P. Kerly, and multiples of #EUs) in order to reduce GPU’s computation P. Konsor. Intel performance counter monitor. overheads on embedded GPUs. They present a comparison www.intel.com/software/pcm, 2012. in terms of performance and energy consumption between [6] R. Gonzales and M. Horowitz. Energy dissipation in an OpenCL legacy version and an OpenCL optimized one. general purpose microprocessors. IEEE Journal of Ourworkfocusonreducingthesourcesofoverheadaswell, Solid-State Circuits, 31(9), 1996. but in contrast, we focus on CPU-GPU collaborative com- [7] I. Grasso, P. Radojkovic, N. Rajovic, I. Gelado, and putation instead of only targeting the integrated GPU. A. Ramirez. Energy efficient hpc on embedded socs: Optimization techniques for mali gpu. In Parallel and Several previous works study the problem of automatically Distributed Processing Symposium, 2014 IEEE 28th schedulingonheterogeneousplatformswithamulticoreand International, pages 123–132, May 2014. anintegratedordiscreteGPU[11,14,2,10,4,3,8]. Among [8] R. Kaleem et al. Adaptive heterogeneous scheduling thoseworks,theonlyonethatalsouseschipswithintegrated for integrated GPUs. In Intl. Conf. on Parallel GPUs is Concord [8]. However, Concord does not analyze Architectures and Compilation, PACT ’14, pages theoverheadsincurredbyoffloadingachunkofiterationsto 151–162, 2014. the GPU. [9] M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali. Lonestar: A suite of parallel irregular programs. In ISPASS, Apr 2009. 7. CONCLUSIONS [10] J. Lima, T. Gautier, N. Maillard, and V. Danjean. In this work, we elaborate on the possibility of successfully Exploiting concurrent GPU operations for efficient implementingadynamicschedulingstrategythatautomati- work stealing on multi-GPUs. In SBAC-PAD’12, callydistributestheiterationspaceofaparallelloopamong pages 75–82, 2012. the cores and the GPU of an heterogeneous chip. To that [11] C.-K. Luk, S. Hong, and H. Kim. Qilin: Exploiting goal it is key to guarantee optimal resource occupation and parallelism on heterogeneous multiprocessors with load balance while reducing the impact of the overhead. adaptive mapping. In Proc. Micro, pages 45–55, 2009. OurproposalisevaluatedfortheBarnesHutbenchmarkon [12] D. Lustig and M. Martonosi. Reducing gpu offload mid/low power heterogenous architectures like Ivy Bridge, latency via fine-grained cpu-gpu synchronization. In Haswell, and Exynos 5, where the first two run under Win- High Performance Computer Architecture dows OS and the latter under Linux. We have studied the (HPCA2013), 2013 IEEE 19th International sources of overhead on these systems, finding that, under Symposium on, pages 354–365, Feb 2013. Windows,theoverheadduetore-schedulingthehostthread [13] A. Navarro, A. Vilches, F. Corbera, and R. Asenjo. isprohibitiveinoversubscribedscenarios(morethreadsthan Strategies for maximizing utilization on multi-CPU cores). We solve this issue by increasing the priority of the and multi-GPU heterogeneous architectures. The host thread. Our experimental results show that this re- Journal of Supercomputing, 2014. duces Energy-Delay Product (EDP) by 18% and 84% on Intel Ivy Bridge and Haswell architecture, respectively. On [14] V. Ravi and G. Agrawal. A dynamic scheduling theExynosplatform,theLinuxschedulersuccessfullydeals framework for emerging heterogeneous systems. In withoversubscriptionwhenonlyusingtheA15cores. There- Proc. of HiPC, pages 1–10, 2011. fore, for this platform we explore the benefits, in terms of [15] J. Reinders. Intel Threading Building Blocks: timeandenergy,thatcanbeachievedwhenpinningthehost Multi-core parallelism for C++ programming. threadtoalowpowercoreorbythecombinedusageofthe O’Reilly, 2007. fourA15coresandthefourA7coresincludedintheExynos [16] Q. Zhu, B. Wu, X. Shen, L. Shen, and Z. Wang. 5. Our experimental results show that using the A7 cores Understanding co-run degradations on integrated reduces EDP by 46%. Increasing the priority of the host heterogeneous processors. In The 27th International threadorpinningthethreadtoanA7reducesEDPbyand Workshop on Languages and Compilers for Parallel additional 7.8% in the Exynos big.LITTLE architecture. Computing, 2014.