Table Of ContentReducing overheads of dynamic scheduling on
heterogeneous chips
Francisco Corbera, Andrés Rodríguez, María J. Garzarán
Rafael Asenjo, Angeles Navarro, Antonio Dept. ComputerScience,UIUC
Vilches, garzaran@illinois.edu
UniversidaddeMálaga,AndalucíaTech,Dept.
ofComputerArchitecture,Spain
{corbera,andres,asenjo,angeles,avilches}@ac.uma.es
5
1 ABSTRACT ofmoredynamicstrategiestodistributetheworkloadamong
0 In recent processor development, we have witnessed the in- thecoresandtheGPU.Infact,achievingmaximumperfor-
2 tegration of GPU and CPUs into a single chip. The result mance and/or minimum energy consumption often requires
ofthisintegrationisareductionofthedatacommunication simultaneous use of both, the GPU and the CPU cores [8].
n
overheads. This enables an efficient collaboration of both
a
devices in the execution of parallel workloads. In this context, one problem that has received attention
J
lately is the efficient execution of the iterations of a paral-
4
Inthiswork,wefocusontheproblemofefficientlyschedul- lel loop on both devices, the CPU cores and the integrated
1
ing chunks of iterations of parallel loops among the com- GPU. However, the optimal division of work between CPU
puting devices on the chip (the GPU and the CPU cores) andGPUisveryapplicationandinputdatadependentsoit
]
C in the context of irregular applications. In particular, we is required a careful partitioning of the workload across the
analyzethesourcesofoverheadthatthehostthreadexper- CPU cores and the GPU accelerator. In the case of regular
D
iments when a chunk of iterations is offloaded to the GPU applications,themaindifficultyistodeterminehowtopar-
s. whileotherthreadsareexecutingconcurrentlyotherchunks titiontheloadbetweenthetwodevicestoavoidloadimbal-
c on the CPU cores. We carefully study these overheads on ance. Thisisusuallyaccomplishedbyrunningafewchunks
[ different processor architectures and operating systems us- of iterations in both devices to determine the speed differ-
ing Barnes Hut as a study case representative of irregular ence between them. The appropriate chunk for each device
1
applications. Wealsoproposeasetofoptimizationstomiti- is then computed and scheduled to run [3]. In the case of
v
6 gatetheoverheadsthatariseinpresenceofoversubscription irregularapplicationsthechallengeisbiggerbecause,inthis
3 andtakeadvantageofthedifferentfeaturesoftheheteroge- case, GPU performance can be suboptimal when the appli-
3 neous architectures. Thanks to these optimizations we re- cation workload is distributed among all the cores and the
3 duceEnergy-DelayProduct(EDP)by18%and84%onIntel GPU without considering the size of the chunk assigned to
0 Ivy Bridge and Haswell architectures, respectively, and by theGPU,evenwhentheloadisbalancedamongthedevices.
. 57% on the Exynos big.LITTLE. For instance, using the irregular Barnes Hut benchmark [9]
1
asourcaseofstudy,wehavefoundthatthesizeofthechunk
0
of iterations assigned to the GPU has a significant impact
5 1. INTRODUCTION
on its performance, as we will show in Section 2. Thus,
1
Recently, we have seen a trend towards the integration of foreachapplicationandinputdata,theschedulingstrategy
:
v GPU and CPUs on the same die. Examples include recent must be aware of the optimal chunk of iterations to offload
i Intel processors (Ivy Bridge, Haswell), the AMD APUs, or to the GPU. Since the workload assigned to the GPU and
X
more power constrained processors targeted at mobile and theCPUcoresmustalsobebalanced,thebestapproachisa
r embedded devices, like the Samsung Exynos 5 or the Qual- dynamicschedulingstrategythatassignstheoptimalchunk
a
comm Snapdragon 800, among others. The integration al- of iterations to each device. This means that the cores and
lows the sharing of the memory system, what reduces the the GPU will repeatedly receive a chunk of iterations to be
communicationoverheadsbetweenthedevicesandenablesa computeduntiltheendoftheiterationspace. Thisschedul-
moreeffectivecooperationamongallthecomputingdevices. ing strategy is described in Section 3.
In contrat to discrete GPUs (connected through a slower
PCI bus), integrated GPUs also enable the implementation TheintensiveuseoftheGPUoffloadingmechanismwillre-
veal several sources of overheads that have to be carefully
considered. For instance, not only the traditionally stud-
ied overheads like the data transfers ones (host-to-device
and device-to-host) have to be taken into account, but also
kernel launching and host thread dispatch overheads gain
relevance. In this paper, we study the impact of each one
of these overheads on different heterogenous architectures
with an integrated GPU. More precisely, in Section 4, we
HIP3ES2015Amsterdam,TheNetherlands conductourexperimentsontwoIntelprocessors(IvyBridge
EU Active EU Idle EU Stalled
0.12 0.7 1
0.6
0.1 0.8
0.5
evitcA U0.08 eldI UE00..34 dellatS U0.6
E E
0.2 chunk320
0.06 0.4 chunk640
0.1 chunk1280
chunk2560
0.04 0 0.2
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Iterations space #104 Iterations space #104 Iterations space #104
(a) (b) (c)
#107 GPU L3 Cache Misses Throughput (iterations/ms)
2.5 160
sessiM eh1.25 )sm/snoitare111024000
caC 3L UPG0.15 ti( tuphguorh 468000
T
0 20
0 2 4 6 8 10 0 2 4 6 8 10
Iterations space #104 Iterations space #104
(d) (e)
Figure 1: GPU Hardware metrics for Barnes Hut on Haswell. The legend of subfigure 1c applies to all subfigures
(cid:80)
and Haswell architectures) and one Samsung Exynos 5 fea- EU/ all cycles); and the L3 cache misses due to
all EUs
turing a big.LITTLE architecture. Two different operating GPU memory requests. Subfigure 1e also shows the GPU
systems (Windows and Linux) are considered. effective throughput, measured as the number of iterations
per ms., through the iteration space. Data transfer and
With the information provided by these experiments we kernel offloading overheads have been included in the com-
tackletheproblemofreducingtheimpactofthemorerepre- putation of the throughput.
sentative overheads. In Section 5, we propose and evaluate
someoptimizationsthatnotonlyreducetheexecutiontime Aswecanseeinsubfigure1e,forthetimestepstudied,the
but also the energy consumption. Finally, we present the optimalchunksizeis640(seeredlineinthefigure). Increas-
related works in Section 6 and conclude that in order to ing the chunk size beyond this value degrades the through-
squeeze the last drop of performance out of these heteroge- put. The hardware metrics indicate that small chunk sizes
neous chips, it is mandatory to conduct a thorough analy- (e.g. 320) do not effectively fill the GPU computing units,
sis of the overheads and to study how the CPU cores, the astheratioEUIdleindicatesinsubfigure1b(seeblueline).
GPUandthedifferentlayersofthesoftwarestack(OpenCL However, when the chunk size is large enough to fill the
driver, OS, etc) interplay (see Section 7). EUs (EU Idle <0.1), the EUs might stall when the chunk
size increases, due to the increment in L3 cache misses (see
2. MOTIVATION greenandpinklinesinsubfigure1d). Thisiswhathappens
We have found that, for irregular applications, offloading in our irregular benchmark in which the majority of mem-
big chunk sizes to the GPU can hinder performance. This oryaccessesareuncoalesced. Inourcase,chunksizeslarger
is illustrated in Figure 1, that shows the evolution of dif- than1280dramaticallyincreaseL3misses,whichinturnin-
ferent GPU hardware metrics on the Haswell architecture creasestheratioofEUStalled(>0.9)andreducestheratio
(describedinsection4.1)throughtheiterationspaceofone of EU Active (<0.08), causing a reduction of the effective
timestepoftheirregularbenchmarkBarnesHut. Aninput throughput.
setof100,000bodieswasusedtocollecttheseresults. Each
subfigure represents the evolution of the metric of interest Therefore, in the quest of finding the optimal distribution
for different chunk sizes assigned to the GPU (see chunk of work between the GPU and the CPU, if we assign a big
sizes legend in subfigure 1c). We have used Intel VTune chunk of iterations to each device we can end up by not
Amplifier 2015 [1] to trace the ratio of cycles when EUs exploiting the GPU EUs optimally. Thus, the partitioning
(ExecutionUnits)are active ((cid:80) cycles whenEUex- strategy must be aware of the optimal chunk of iterations
ecutes instructions/(cid:80) allacllyEclUess); the ratio of cycles that must be offloaded to the GPU for the corresponding
whenEUsareidle((cid:80)all EUscycleswhennothreadssched- application. On the other hand, we have observed that the
uled on EU/(cid:80) aallllEcUyscles); the ratio of cycles when effective throughput of the CPU cores is not so sensible to
EUs are stalledal(l(cid:80)EUs cycles when EU does not exe- the chunk size. As long as the chunk size is bigger than a
all EUs
cute instructions and at least one thread is scheduled on threshold value, the CPU throughput tends to be constant.
Normalized
Time
Normalized
Energy
Normalized
EDP
1.2
1.2
1.4
Bulk-‐Oracle
Bulk-‐Oracle
Bulk-‐Oracle
1.1
1.15
Dynamic
1.3
Dynamic
Dynamic
1.1
1.2
1
1.05
1.1
0.9
1
1
0.8
0.95
0.9
0.9
0.8
0.7
0.85
0.7
0.6
0.8
0.6
3+1
4+1
3+1
4+1
3+1
4+1
3+1
4+1
3+1
4+1
3+1
4+1
3+1
4+1
3+1
4+1
3+1
4+1
Ivy
Haswell
Exynos
Ivy
Haswell
Exynos
Ivy
Haswell
Exynos
Figure2: Normalizedtime,energyandenergy-delay-product(EDP)forBarnesHutonIvyBridge,HaswellandExynos. The
lower the values the better.
For instance, when using the Threading Building Blocks li- rest of the iterations to the CPU cores. Previously, for the
brary (TBB) [15], it is recommended to have CPU chunk static scheduling we carry out an exhaustive offline profil-
sizes that take 100,000 clock cycles at least. By allowing a ing that looks for the static partitioning of the iteration
dynamicschedulingoftheiterationspaceinsuchawaythat space between the CPU and GPU that minimizes the ex-
eachdevicegetsanoptimalchunkofiterationswhilebalanc- ecutiontime. Wevarythepercentageoftheiterationspace
ing the workload among them, we can minimize the execu- offloaded to the GPU (between 0% -only CPU execution-
tiontime. However,someoverheadsareinvolvedinthistype and 100% -only GPU execution- using 10% steps). We call
of scheduling, as we will see in the next sections. Anyway, this strategy Bulk-Oracle. Again, Table 1 shows the opti-
we explore here the applicability of a dynamic scheduling mal percentage of iterations offloaded to the GPU for each
strategy on a heterogeneous architecture using Barnes Hut platform under the static partition strategy. Notice that
as example. withthestaticapproachthewholechunkisoffloadedtothe
GPU at the beginning of each time step.
InFigure2wecomparethetime,energyandEDP(Energy-
Delay-Product)[6]forBarnesHutonthreedifferenthetero- Table1: OptimalGPUchunksizeforDynamicandoptimal
geneous architectures using 4 CPU cores and 1 integrated percentage of the iteration space offloaded to the GPU for
GPU: Ivy Bridge, Haswell and Exynos. The input set also Bulk-Oracle
has 100,000 bodies, but now 75 time-steps were computed.
Foreachplatform,twoscenariosareshown: 3+1,whichcor- Ivy Haswell Exynos
responds to the case in which 4 threads are scheduled in 3+1 4+1 3+1 4+1 3+1 4+1
the processor (3CPUthreads + 1hostthreaddedicated to Dynamic 1536 1536 2048 2048 2048 2048
offload the work to the GPU) and 4+1 which corresponds Bulk-Oracle 50% 40% 70% 70% 20% 20%
to 5 threads (4 CPU threads + 1 host thread). The later
represents the case in which 1 thread of oversubscription is InFigure2eachparameter(time,energyandEDP)hasbeen
allowed. Weexplorethiscasebecausewewanttomakethe normalizedwithrespecttothevalueobtainedfortheBulk-
most of our computing resources: since the host thread is Oracle 3+1 execution on each architecture. As we see, the
most of the time blocked while the GPU takes care of its dynamicstrategyoutperformsthestaticone(Bulk-Oracle),
chunk of iterations, we add an additional thread to avoid except in the case of Haswell for 4+1, where overheads as-
one core becoming idle. sociated to the host thread degrade performance (both in
timeandenergy). Thisissueisdiscussedinthenextsection.
In the figure, we call Dynamic to our dynamic scheduling Anotherinterestingresultisthatoversubscription(4+1)im-
strategy(itworkssimilarlytotheOpenMPdynamicschedul- proves the execution times on Ivy, for both the static and
ing policy of the pragma omp parallel for). In this strat- dynamic approaches, but it also increases the energy con-
egy,firstwehavetofindtheoptimalchunksizefortheGPU sumption. This is due to the fact that 4 threads compute
(the GPU chunk). This is done through an offline training a larger number of chunks on the CPU cores and, since the
phase,whereweexploredifferentchunksizes,andchoosethe CPUislessenergeticallyefficientcomputingthechunksthan
value that maximizes the effective throughput of the appli- theGPU,thisresultsinhigherenergyconsumption. Forthis
cationinthecorrespondingGPU.Table1showstheoptimal architecture and this benchmark, the increment of the en-
size for each platform. This size is passed to our scheduler ergyontheCPUisnotamortizedbythereductionoftime.
which dynamically assigns a new chunk to the GPU each On the other hand, on Exynos, a static partitioning does
time it finishes the computation. Similarly, each CPU core not scale when going from 3+1 to 4+1, while the dynamic
gets a new chunk of iterations (the CPU chunk) every time strategy reduces time and energy consumption.
it finishes the previous one. The size of the CPU chunk
is selected to balance the load with the GPU computation 3. SCHEDULINGSTRATEGY
(details are provided in the next section).
We compare this dynamic strategy with a static schedul-
Inthissection,wepresentinmoredetailourscheduler(Sec-
ing that assigns a single GPU chunk to the GPU and the
tion 3.1), how the partitioner works (Section 3.2), and the
the user has to define two functions to perform the asyn-
1 #include <HScheduler.h>
2 class Body{ chronoushost-to-device(line8)anddevice-to-host(line10)
3 public: memory transfers, as well as the kernel launching (line 9).
4 void operatorCPU(int begin, int end) { Since these functions are all asynchronous, we finish the
5 for(i=begin; i!=end; i++){ ... }
6 } GPU part of the body with the synchronous clFinish()
7 void operatorGPU() (int begin, int end){ call that does not return until all the previous steps have
8 hostToDevide(begin, end){...} been completed.
9 launchKernel(begin, end){...}
10 deviceToHost(begin, end){...}
11 clFinish(); Internally, our scheduler is implemented as a pipeline that
12 } consists of two stages or filters: Filter , which selects the
13 } 1
14 ... computingdeviceandthechunksize(numberofiterations)
15 int main(int argc, char* argv[]){ assigned,andFilter2,whichprocessesthechunkonthecor-
16 respondingdevice. Filter firstlychecksiftheGPUdeviceis
17 Body body; 1
available. In that case, a G token is created and initialized
18
19 // Start task scheduler withtherangeoftheGPUchunk. IfthereisnoidleGPUde-
20 task_scheduler_init init(nThreads); vice,thenaCPUcoreisidle;thus,aC tokeniscreatedand
21 ...
initializedwiththerangeoftheCPUchunk. Inbothcases,
22 parallel_for(begin,end,body,Partitioner_H(G));
23 ... thepartitionerextractsachunkofiterationsfromtherange
24 } oftheremainingiterationspace. Next,Filter processesthe
2
chunk in the corresponding device and records the time it
Figure 3: Using the parallel_for template takes to compute the corresponding chunk1. This is neces-
sary to compute the device’s throughput, which is used by
the partitioner described next.
potencial sources of overhead (Section 3.3).
3.2 Partitioningstrategy
3.1 Schedulerdescription Weassumethattheexecutiontimecanbeseenasasequence
Our scheduler, that we call Dynamic, considers loops with ofschedulingintervals{tG0,tG1...tGi−1,tGi,tGi+1...}for
independent iterations (parallel_for) and features a work theGPUand{tC0,tC1...tCi−1,tCi,tCi+1...}foreachCPU
schedulingpolicywithadynamicGPUandCPUchunkpar- core. Each computing device at the current interval, tGi or
titioning. Our approach dynamically partitions the whole tCi,cangetachunkofiterations. TherunningtimeT(tGi),
iterationspaceintochunksorblocksofiterations. Thegoal for each GPU’s chunk size G(tGi)=G, or the time T(tCi)
ofthepartitioningstrategyistoevenlybalancetheworkload for a CPU’s chunk size C(tCi), is recorded. This time is
of the loop among the compute resources (GPU and CPU used to compute the throughput, λG(tGi) for the GPU or
cores)aswellastoassigntoeachdevicethechunksizethat λC(tCi) for a CPU core, in the current scheduling interval
maximizesitsthroughput. Thisiskeybecause,asshownin as,
Figure1,thechunksizecanhaveasignificantimpactonthe
performanceofheterogeneousarchitectures,especiallywhen
G
dealing with irregular codes. λG(tGi) = (1)
T(tGi)
OurschedulerbuildsontopofanextensionoftheTBBpar- λC(tCi) = C(tCi) (2)
allel_fortemplateforheterogenousGPU-CPUsystemsby T(tCi)
Navarro et al. [13]. Figure 3 shows the pseudo-code to use
theextendedparallel_forconstructinsuchheterogeneous In order to compute the chunk size for the GPU, an offline
systems. As in any TBB program, the scheduler is initial- training phase explores different chunk sizes, and chooses
ized (line 20). In this step, the developer sets the number the value that maximizes the effective throughput of the
ofOSthreads,nThreads,thattheTBBruntimewillcreate, application for a given input data. To reduce the number
whichcanvaryfrom1tothenumberofCPUcoresplusone ofrunsofthisofflinetrainingphase,wesettheGPUchunk
additionalthreadtohosttheGPU(thehostthread). Then, size to the smallest number of iterations that fully occupy
thedevelopercaninvoketheparallel_for(line22),which theGPUresources. Forexample,ontheintegratedGPUof
hasthefollowingparameters: theiterationspace(therange theIntelHaswellwehave20EU(executionunits)eachone
begin, end), the body object of the loop (body), and the running a SIMD-thread (aka wavefront) at a time, and 8,
partitionerobject(Partitoner_H).Thelatterargument,ef- 16 or 32 work-items per SIMD-thread (decided at compile
fectively overloads the native TBB parallel_for function time). In OpenCL these values can be queried reading two
so that the heterogeneous version is invoked. Besides, the variables2. Thus, the product of these two arguments is
Partitoner_H method takes care of the dynamic partition- used as the initial GPU chunk size. For Barnes Hut the
ing strategy that gets the optimal chunk size for the GPU compilerchooses16SIMDandthereforeweselectaninitial
(parameter G provided by the user) and for the CPU cores GPU chunk size of 20×16 = 320 iterations. Then, chunk
as described in Section 3.2. sizes that are multiple of this initial chunk size are tried.
1Note that for the GPU, the time includes the kernel exe-
Theuseralsowritestheclass Bodythatprocessesthechunk cution time as well as the hostToDevice and deviceToHost
on the CPU cores or on the GPU (lines 2-13 in Figure 3). times.
Twomethods(operators)mustbecoded. OnefortheCPU 2CL DEVICE MAX COMPUTE UNITS,and
(lines 4-6) and one for the GPU (lines 7-10). For the GPU, CL KERNEL PREFERRED WORK GROUP SIZE MULTIPLE
We keep trying different chunk sizes while the throughput Host-thread
running on one GPU
increases. Once the throughput decreases or remains stable CPU Core
……
for 2 or more chunk sizes, the training phase ends and the
GchPoUsenc.hTunhkissaipzeprtohaacthorebqtuaiinreesdotnhlye ahifgehwesrtunths.roughput is Tc1 Chunk i-1
Scheduling
Partitioning Tg1
Regarding the policy to set the CPU core chunk size, our
Tc2
partitioner follows the heuristic by Navarro et al. [13]. Ba- Host-to-Device
sically, this heuristic is based on a strategy to minimize the hostToDevice() Tg2
launchKernel()
load imbalance among the CPU cores and the GPU, while deviceToHost() Kernel launching
clFinish()
maximizingthethroughputofthesystem. Tothatend,the Tg3
optimization model described there recommends that: each Chunk i
time that a chunk is partitioned and assigned to a device, U
its size should be selected such that it is proportional to the GP Kernel execution
device’s effective throughput. Therefore, we implemented a
greedy partitioning algorithm based on the following key Tg4
observation: while there are sufficient remaining iterations,
Device-to-Host
thechunksizeassignedtoaGPUattheschedulinginterval
Tg5
tGi should be the optimal for the GPU (G, as explained in Thread dispatch
thepreviousparagraph), whereasattheschedulinginterval
Tc3 +1
tvCeriiftyh:e chunk size assigned to a CPU core, C(tCi), should Chunk i
U
P
……G
C(tCi) = G (3) Time Time
λC(tCi−1) λG(tGi−1)
Figure 4: Time diagram showing the different phases of of-
where λ (tC ) and λ (tG ) are the CPU and GPU floading a chunk to the GPU
C i−1 G i−1
throughputs in the previous scheduling intervals, respec-
tively. So we have:
configure the OpenCL command queue in the profile mode
C(ti)=G· λC(tCi−1) (4) sothatwecanreadthe“start”and“complete”timestamps
λG(tGi−1) of each of the enqueued commands.
3.3 Sourcesofoverhead WiththisinformationwecomputetheoverheadofSchedul-
Figure4showsthedifferentphasesthatourDynamicframe- ing and Partitioning, O , Host-to-Device operation, O ,
sp hd
work has to perform to offload a chunk of iterations to the Kernel Launching, O , Device-to-Host, O , and Thread
kl dh
GPU. Dispatch, O , as follows:
td
Asweexplainedintheprevioussection,ourframeworkuses (cid:80) (Tc2−Tc1)
nThreads, fromwhichoneofthemiscalledthehostthread Osp = #GPUchunks (5)
and just takes care of serving the GPU. This host thread TotalExecutionTime
(cid:80) (Tg2−Tg1)
runsinoneoftheavailablecoresandfirstexecutesthecode O = #GPUchunks (6)
associated with the scheduler and the partitioner (the code hd TotalExecutionTime
of Filter1 explained earlier). Then, the same host thread O = (cid:80)#GPUchunks(Tg3−Tg2) (7)
executes the Filter2 stage that includes the function calls kl TotalExecutionTime
listed in Figure 3 (hostToDevice(), launchKernel(), de- (cid:80) (Tg5−Tg4)
viceToHost, and clFinish()). The first three calls just Odh = #TGoPtaUlcEhxuneckustionTime (8)
asynchronouslyenqueuethecorrespondingoperationonthe (cid:80) (cid:16)(Tc3−Tc2)−(Tg5−Tg1)(cid:17)
GPU’scommandqueue,whereasthelatterisasynchronous #GPUchunks
O = (9)
wait. In Figure 4 we can see that the enqueued operations td TotalExecutionTime
are sequentially executed on the GPU where we consider
the times taken by the “Host-to-Device”, “Kernel launch-
4. ANALYSISOFOVERHEADS
ing”,“Kernel execution”and“Device-to-Host”. When this
In this section we analyze the sub-optimal performance of
last operation is done, the host thread is notified but some
the Dynamic scheduler for the 4+1 configuration, as shown
timemaybetakenbytheOStore-schedulethehostthread.
in Figure 2. To better understand the underlying reasons
This time is illustrated in the figure with the label“Thread
for that performance degration we first describe our envi-
dispatch”.
ronmentsetup(Section4.1)andthendiscusstheoverheads
that we measure (Section 4.2).
In order to measure the relevant overheads involved in the
execution of the code some time stamps are taken on the
CPU (Tc1, Tc2 and Tc3) and on the GPU (Tg1 to Tg5) 4.1 ExperimentalSettings
as depicted in Figure 5. To get the CPU time stamps we We run our experiments on three different platforms: two
rely on TBB’s tick_count class, whereas for the GPU we Intel-based desktops and an Odroid XU3 bare-board. More
%"Overheads"(Ivy"Bridge)" %"Overheads"(Haswell)" %
Overheads
(Exynos)
3.5
252"≈ ≈ 313.4.#1 3.0
4" 313.2.#0≈ ≈ 2.5
1.0# 2.0
3" 0.8# 1.5
2" 0.6# 1.0
1" 0.4# 0.5
0.2# 0.0
0" 0.0# 3+1
4+1
3+1
4+1
3+1" 4+1" 3+1" 4+1" 3+1# 4+1# 3+1# 4+1#
Dynamic
Dynamic
Pri
Dynamic" Dynamic"Pri" Dynamic# Dynamic#Pri#
%"Osp" %"Ohd" %"Okl" %"Odh" %"Otd" %#Osp# %#Ohd# %#Okl# %#Odh# %#Otd# %
Osp
%
Ohd
%
Okl
%
Odh
%
Otd
Figure 5: Percentage of the different overheads for Barnes Hut on Ivy Bridge, Haswell and Exynos. The lower, the better.
precisely,thedesktopsarebasedontwoquad-coreIntelpro- Itispossibletomonitorenergy(inJoules)consumedbythe
cessors: a Core i5-3450, 3.1GHz, based on the Ivy Bridge Cortex A15 cores, the Cortex A7 cores, DRAM and GPU,
architecture,andaCorei7-4770,3.4GHz,basedonHaswell. separately. The energy figures on the Exynos 5 we present
Both processors feature an on-chip GPU, the HD-2500 and in this paper are the sum of the four power monitors afore-
HD-4600,respectively. TheOdroidhasaSamsungExynos5 mentioned.
(5422)featuringaCortex-A152.0Ghzquadcorealongwith
aCortex-A7quadcoreCPUs(ARM’sbig.LITTLEarchitec- The sample rate used is 10 Hz. Thus, one sampled power
ture). Thisplatformfeaturescurrent/powermonitorsbased value is obtained every 100 milliseconds. The power value
on TI INA231 chips that enable readouts of instant power readismultipliedbythesamplingperiodandtheproductis
consumed on the A15 cores, A7 cores, main memory and integratedinacumulatedenergyvalue(inJoules). Wehave
GPU. The Exynos 5 includes the GPU Mali-T628 MP6. chosenthissampleratebecausethepowersensorsactualize
theirvaluesevery260millisecondsapproximately,soasam-
Regardingthesoftwaretools,onthedesktopswerelyonIn- pleratetwotimesfasterisgoodenoughforgettingaccurate
tel Performance Counter Monitor (PCM) tool [5] to access measurements (sampling rate is below Nyquist rate).
the HW counters (which also provides energy consumption
in Joules). The GPU kernels are implemented with Intel 4.2 Discussion
OpenCLSDK2014thatiscurrentlyonlyavailableforWin-
Figure5showstheoverheadsthatwemeasuredforthecor-
dows OS. The benchmarks are compiled using Intel C++
responding terms defined in the previous section (eqs. 6-9)
Compiler14.0with-O3optimizationflag. TheOdroidboard
forthethreeplatformsthatweconsider,IvyBridge,Haswell
runs Linux Ubuntu 14.04.1 LTS, and an in-house library
and Exynos when running the Barnes Hut benchmark with
has been developed to measure energy consumption, as de-
100,000 bodies and 15 time steps for the 3+1 and 4+1 sce-
scribed below. On this platform, the compiler used is gcc
narios. ThissectioncoversonlytheDynamicresults(shown
4.8 with -O3 flag. The OpenCL SDK v1.1.0 is used for the
on the left side) of the three subfigures in Figure 5.
Mali GPU.
Thesmallestoverhead,onaverage,istheoneduetoSchedul-
On all platforms, Intel TBB 4.2 provides the core template
ing and Partitioning, O , that represents 0.02% on Ivy
of the heterogenous parallel_for. We measured time and sp
Bridge, and less than 0.004% on Haswell and Exynos.
energyin10runsoftheapplicationsandreporttheaverage.
Regarding the data transfer overheads, O and O , on
hd dh
4.1.1 EnergymeasurementontheExynos5 Ivy Bridge and Haswell, they are always below 0.3%. How-
The Odroid XU3 platform is shipped with an integrated ever, on Exynos, these overheads are significantly larger,
power analysis tool. It features four on-board INA231 cur- aroundanorderofmagnitudehigher: onaverageOhd =2.8%
rent/powermonitors3tomonitortheCortexA15cores,Cor- and Odh=1.6%. After testing our platforms with memory
texA7cores,DRAMandGPUpowerdissipation. Theread- bound micro-benchmarks, we found that Exynos exhibits
outs from these power sensors are accessible through the an order of magnitude higher data transfers times than the
/sys file system from user-space. The system does not pro- Intel architectures. This explains the impact of Host-To-
videcumulatedenergyconsumptions,soonlyinstantpower DeviceandDevice-To-HostoverheadsontheExynos. More
readings are available. precisely, the hostToDevice() operation implicitly copies
the host buffer onto a different region of memory that can
We have developed a library to measure the energy con- be accessed by the GPU and that is non-pageable (pinned
sumed by our executions. The library allows starting/stop- memory). Similarly,deviceToHost()doesanothermemory-
pingadedicatedthreadtosamplepowerreadingsandinte- to-memory copy operation in the other direction. There-
grate them through time using the real-time system clock. fore, lower memory bandwidth on the Exynos results in
Once the sampling thread is working, the library also pro- more apreciable Host-To-Device and Device-To-Host over-
videsfunctionstotakepartialenergymeasurementstopro- headswithrespecttotheIntelarchitectures. Asfuturework
filetheenergyconsumptionthroughdifferentalgorithmstages. wewillstudyhowBarnesHutmemoryaccessescouldbere-
organized so that the cores and the GPU could share the
3http://www.ti.com/product/ina231 same buffer, avoiding the copy operations using the zero-
copy-buffer capability of OpenCL. figure shows that on Ivy Bridge and Haswell, the O over-
td
head for the 4+1 configuration has been reduced as is now
WithrespecttotheKernelLaunchoverhead,itgoesfromup similar to that of the 3+1 one. This confirms that increas-
to 0.4% on Haswell to up to 3% and 2% on Ivy Bridge and ingthepriorityofthehostthreadforthe4+1configuration
Exynos, respectively. Average times consumed on this op- reduces this particular overhead. Note that on the Exynos
erationare1.8ms,1ms,and3.6msonIvyBridge,Haswell platform increasing the host thread priority barely affects
and Exynos, respectively. the measured overheads.
However, on the desktop platforms one of the most notice- Figure 6 shows the normalized metrics (time, energy, and
able overheads is O (Thread Dispatch) especially for the EDP)w.r.t. Bulk-Oracle3+1,asshowninFigure2,butnow
td
4+1 scenario. For that case, it represents 22% and 33% adding a new bar representing the results for Dynamic Pri.
of the total execution time on Ivy Bridge and Haswell, re- As expected, Figure 6 confirms that for Ivy and Haswell,
spectively. Notice that this happens only for the oversub- that run Windows OS, boosting the priority of the host
scriptioncasesandunderWindowsOS.Actually,onExynos thread has almost no impact in reducing time, energy or
undertheLinuxOS,theTreadDispatchoverheadrepresents EDP for the 3+1 configuration. However, with respect to
less than 0.09% in all cases. The overhead in Windows is Dynamic, Dynamic Pri has a significant impact on these
explained because the OpenCL driver ends up blocking the metricsintheoversubscribed4+1scenario. Forinstance,on
hostthreadonceithasoffloadedthekernel. Itdoesthatto the Ivy Bridge, the Dynamic Pri reduces time, energy, and
avoidwastingacorewithabusy-waitinghostthread. Inthe EDP by 10%, 7% and 18%, respectively, when comparing
4+1 scenario, 4 threads are already intensively using the 4 withDynamic. OnHaswell,thesereductionsareevenmore
coresandwhentheOpenCLdrivernotifiestheOSaboutthe significant: 37%, 33% and 84% for time, energy and EDP,
GPU completion, the OS wakes up the host thread. How- respectively. One interesting result appears on Ivy Bridge
ever, if this host thread does not have enough priority it is whencomparingtheDynamic3+1configurationversusthe
unlikely it will be dispatched to the running state straight- Dynamic 4+1 as we mentioned in Section 2. Let’s remind
away. The core of the Windows scheduling policy is Round that even though the 4+1 configuration is faster than the
Robin, so the just awaken host thread has to wait on the 3+1, the energy consumed by the 4+1 is higher, which is
ready queue to the next available time slice. This does not nottrueanymoreforDynamicPri. Now,DynamicPriuses
happen in the Linux scheduler, where a just awaken thread the GPU more efficiently (it is served quicker and this re-
gainsmoreprioritythantheotherCPU-boundthreadsthat sults in the GPU processing more chunks). Therefore, the
areintensivelycomputingCPUchunks. Thisschedulingde- more energy consuming CPU cores end up doing less work
cision in Linux rewards interactive threads (IO-bound). As (andconsuminglessenergy)resultinginsmallertotalenergy
aconsequence,inLinux,thehostthreadwillbeimmediately consumption figures.
dispatchedafterbeingwokenup. Thisresultiscorroborated
thanks to the experiment described in the next section. Finally, notice that incrementing the priority of the host
thread does not result in further reductions on time or en-
5. OPTIMIZATIONS ergy on the Exynos platform. However, in the next section
After identifying the main sources of overhead of our Dy- we explore a posible strategy to further reduce time and
namicapproach,inthissectionwediscusstheoptimizations energy consumption on this architecture.
thatcanbeimplementedtoaddressthem. Theoverarching
goalisnotonlyreducetheimpactoftheoverhead,butalso 5.2 Exploitingbig.LITTLEarchitecture
toreducetheenergyconsumption. Tothatend,wepropose OneinterestingfeatureoftheExynos5422isthatitsupports
onestrategyfortheWindowsbasedplatforms(Section5.1) global task scheduling (GTS). This enables using the four
andadifferentonefortheExynosarchitecture(Section5.2). A15 cores and the four A7 ones at the same time. The
Linux scheduler automatically uses the more powerful A15
5.1 Increasingthepriorityofthehostthread cores for compute intensive threads, whereas power saving
As it has been shown in the previous section, on the Win- A7 are reserved for system and background tasks.
dows based desktops, the highest overhead appears for the
4+1configuration,wherewehavemeasuredthatupto22% Figure 7 shows execution time (ms.), energy (J.) and EDP
and 33% of the execution time is wasted on the clFin- fortheExynosplatformwithconfigurations3+1,4+1,7+1,
ish() operation on the Ivy Bridge and Haswell, respec- and 8+1. The 7+1 and the 8+1 configurations use both
tively. This can be solved by assigning a higher priority the A15 and A7 cores for the computing threads, while the
to the host thread so that another thread can be immedi- 3+1 and the 4+1 only use the A15 cores for the comput-
ately preempted and the just awoken host thread can take ing threads. In all the configurations, the mapping of the
up a core and start feeding the GPU again. This is key hostthreadtoanA15oranA7coredependsonthepinning
when the GPU processes the chunks more efficiently than strategy described below. Notice that the GPU also com-
the CPU, as it happens in our benchmark. To boost the putes chunks of iterations, so there is now a larger degree
host thread priority we rely on the SetThreadPriority() of heterogeneity. Four alternatives are considered for each
WindowsAPI.Theframeworkobtainedwiththisoptimiza- configuration:
tion is called Dynamic Pri.
TherightpartofthethreesubfiguresofFigure5showsthe • Dynamicisthebaselineapproachwherethechunksize
overheadsincurredbyBarnesHutwhenusingDynamicPri assignedtotheA15andA7coresdependsontherela-
and the arguments described in the previous section. The tivethroughputoneachone. Thehostthreadispinned
Normalized
Time
Normalized
Energy
Normalized
EDP
1.2
Bulk-‐Oracle
1.2
1.4
Bulk-‐Oracle
Bulk-‐Oracle
1.1
DDyynnaammiicc
Pri
1.11.51
Dynamic
1.2
Dynamic
1
1.05
Dynamic
Pri
1
Dynamic
Pri
0.9
1
0.95
0.8
0.8
0.9
0.7
0.6
0.85
0.6
0.8
0.4
3+1
4+1
3+1
4+1
3+1
4+1
3+1
4+1
3+1
4+1
3+1
4+1
3+1
4+1
3+1
4+1
3+1
4+1
Ivy
Haswell
Exynos
Ivy
Haswell
Exynos
Ivy
Haswell
Exynos
Figure6: Normalizedtime,energyandenergy-delay-product(EDP)forBarnesHutonIvyBridge,HaswellandExynosadding
a higher priority to the host-thread. The lower, the better.
Time
in
ms.
(Exynos)
Energy
in
J.
(Exynos)
EDP
(Exynos)
240000
222123000000000000
DDDyyynnnaaammmiiiccc
PAr7i
111678000000
DDDyyynnnaaammmiiiccc
PAr7i
440500000000
DDyynnaammiicc
Pri
200000
Dynamic
Pri
A7
1500
Dynamic
Pri
A7
350000
DDyynnaammiicc
APr7i
A7
190000
1400
180000
1300
300000
170000
1200
250000
160000
1100
150000
1000
200000
3+1
4+1
7+1
8+1
3+1
4+1
7+1
8+1
3+1
4+1
7+1
8+1
Figure 7: Time (ms), energy (J) and energy-delay-product (EDP) for Barnes Hut on Exynos using A15 and A7 cores. The
lower, the better.
to one of the A15 cores. ittoaA7havesmallimpact. Forinstance,forthe3+1and
the 4+1 configurations, increasing the priority of the host
• Dynamic Pri is based on the baseline, but it increases thread or pinning it to the A7 core have almost no impact.
the priority of the host thread. Onthe4+1scenario,whereDynamicA7isusing5cores(4
A15and1A7forthehostthread)weappreciateamarginal
• DynamicA7isbasedonthebaseline,butpinsthehost
energy saving of 3% and similar running times. Increasing
thread to one of the A7 cores.
the priority of the host thread or pinning it to the A7 core
has a higher impact on the 7+1 and 8+1 configurations,
• Dynamic Pri A7 is based on the baseline, but it pins
although their impact is relatively small. We notice that
the host thread to one of the A7 cores and increases
for7+1and8+1configurations,DynamicPri,DynamicA7,
its priority.
andDynamicPriA7,obtainverysimilarresultsintermsof
timeandenergy. Again,for7+1and8+1scenariosandwith
The energy sampling thread (see Section 4.1.1) is always respect to Dynamic, Dynamic Pri and Dynamic A7 can, on
pinned to one of the A7 cores, so it only produces an over- theaverage,reducethetime,energyandEDPby4.3%,3.6%
head of around 0.5% of the total execution time when run- and 7.8%, respectively. All in all, w.r.t. the Dynamic 4+1,
ningBarnesHutunder 8+1scenarios(lessthan0.02%oth- Dynamic Pri 8+1 reduces EDP by 57%.
erwise). Regarding the breakdown of energy consumption,
A15coresusebetween72%and76%ofthetotalenergycon- Overall, while we expected that pinning the host thread to
sumed, the GPU between 15% and 20%, and the memory the A7 would have a higher impact on time or energy, our
subsystem around 4.5%. Interestingly, A7 cores only take experimental results show little impact on either of them.
around 2.5% of the total energy when they are idle (3+1
and 4+1 scenarios) and around 7.3% when they are fully 6. RELATEDWORK
utilized (7+1 and 8+1 scenarios). For our benchmark, the The closest work to ours is that of Zhu et al. [16], which
four A7 cores consume one order of magnitude less energy address the problem of performance degradation when sev-
than the four A15 cores. eral independent OpenCL programs run at the same time
(co-run) on the CPU and on the GPU of an Ivy Bridge us-
Results in Figure 7 show that by increasing the number of ing the Windows OS. The programs running on the CPU
threadstousetheA7cores(compare3+1or4+1with7+1 use all cores, so they are in a situation similar to our 4+1
or 8+1), execution time and energy reduce significantly. In configurations (oversubscription). To avoid degradation of
particular, going from Dynamic 4+1 to Dynamic 8+1, we the GPU kernel they also propose increasing the priority of
reducetime,energyandEDPby22%,19%and46%respec- thethreadthatlaunchestheGPUkernel. Ourstudydiffers
tively. Increasing the priority of the host thread or pinning from theirs because we do not run two different programs,
instead we partition the iteration space of a single program 8. REFERENCES
to exploit both, the CPU and the GPU. Our study also [1] Intel VTune Amplifier 2015, 2014.
shows that increasing the priority of the host thread is not https://software.intel.com/en-us/intel-vtune-amplifier-
necessary when there is no oversubscription (i.e. 3+1) or xe.
whentheunderlyingOSisLinux. Wealsoassesstheuseof [2] C. Augonnet, J. Clet-Ortega, S. Thibault, and
a big.LITTLE architecture. R. Namyst. Data-aware task scheduling on
multi-acceleratorbasedplatforms.InProc of ICPADS,
Other works as [12, 7] also address the overhead problems 2010.
while offloading computation to GPUs. The work of Lustig [3] M. Belviranli, L. Bhuyan, and R. Gupta. A dynamic
andMartonosi[12]presentsaGPUhardwareextensioncou- self-scheduling scheme for heterogeneous
pledwithasoftwareAPIthataimsatreducingtwosources multiprocessor architectures. ACM Trans. Archit.
of overhead: data transfers and kernel launching. They use Code Optim., 9(4), Jan. 2013.
a Full/Empty Bits technique to improve data staging and [4] J. Bueno, J. Planas, A. Duran, R. Badia,
synchronization in CPU-GPU communication. This tech- X. Martorell, E. Ayguade, and J. Labarta. Productive
niqueallowssubsetsofdataresultsbeingtransferredtothe programming of GPU clusters with OmpSs. In Proc.
CPUproactively,ratherthanwaitingfortheentirekernelto of IPDPS, 2012.
finish. Grasso et al. [7] propose several host code optimiza-
[5] R. Dementiev, T. Willhalm, O. Bruggeman, P. Fay,
tions (Use of Zero-copy Buffer, Global Work Size equal to
P. Ungerer, A. Ott, P. Lu, J. Harris, P. Kerly, and
multiples of #EUs) in order to reduce GPU’s computation
P. Konsor. Intel performance counter monitor.
overheads on embedded GPUs. They present a comparison
www.intel.com/software/pcm, 2012.
in terms of performance and energy consumption between
[6] R. Gonzales and M. Horowitz. Energy dissipation in
an OpenCL legacy version and an OpenCL optimized one.
general purpose microprocessors. IEEE Journal of
Ourworkfocusonreducingthesourcesofoverheadaswell,
Solid-State Circuits, 31(9), 1996.
but in contrast, we focus on CPU-GPU collaborative com-
[7] I. Grasso, P. Radojkovic, N. Rajovic, I. Gelado, and
putation instead of only targeting the integrated GPU.
A. Ramirez. Energy efficient hpc on embedded socs:
Optimization techniques for mali gpu. In Parallel and
Several previous works study the problem of automatically
Distributed Processing Symposium, 2014 IEEE 28th
schedulingonheterogeneousplatformswithamulticoreand
International, pages 123–132, May 2014.
anintegratedordiscreteGPU[11,14,2,10,4,3,8]. Among
[8] R. Kaleem et al. Adaptive heterogeneous scheduling
thoseworks,theonlyonethatalsouseschipswithintegrated
for integrated GPUs. In Intl. Conf. on Parallel
GPUs is Concord [8]. However, Concord does not analyze
Architectures and Compilation, PACT ’14, pages
theoverheadsincurredbyoffloadingachunkofiterationsto
151–162, 2014.
the GPU.
[9] M. Kulkarni, M. Burtscher, C. Cascaval, and
K. Pingali. Lonestar: A suite of parallel irregular
programs. In ISPASS, Apr 2009.
7. CONCLUSIONS
[10] J. Lima, T. Gautier, N. Maillard, and V. Danjean.
In this work, we elaborate on the possibility of successfully
Exploiting concurrent GPU operations for efficient
implementingadynamicschedulingstrategythatautomati-
work stealing on multi-GPUs. In SBAC-PAD’12,
callydistributestheiterationspaceofaparallelloopamong
pages 75–82, 2012.
the cores and the GPU of an heterogeneous chip. To that
[11] C.-K. Luk, S. Hong, and H. Kim. Qilin: Exploiting
goal it is key to guarantee optimal resource occupation and
parallelism on heterogeneous multiprocessors with
load balance while reducing the impact of the overhead.
adaptive mapping. In Proc. Micro, pages 45–55, 2009.
OurproposalisevaluatedfortheBarnesHutbenchmarkon
[12] D. Lustig and M. Martonosi. Reducing gpu offload
mid/low power heterogenous architectures like Ivy Bridge,
latency via fine-grained cpu-gpu synchronization. In
Haswell, and Exynos 5, where the first two run under Win-
High Performance Computer Architecture
dows OS and the latter under Linux. We have studied the
(HPCA2013), 2013 IEEE 19th International
sources of overhead on these systems, finding that, under
Symposium on, pages 354–365, Feb 2013.
Windows,theoverheadduetore-schedulingthehostthread
[13] A. Navarro, A. Vilches, F. Corbera, and R. Asenjo.
isprohibitiveinoversubscribedscenarios(morethreadsthan
Strategies for maximizing utilization on multi-CPU
cores). We solve this issue by increasing the priority of the
and multi-GPU heterogeneous architectures. The
host thread. Our experimental results show that this re-
Journal of Supercomputing, 2014.
duces Energy-Delay Product (EDP) by 18% and 84% on
Intel Ivy Bridge and Haswell architecture, respectively. On [14] V. Ravi and G. Agrawal. A dynamic scheduling
theExynosplatform,theLinuxschedulersuccessfullydeals framework for emerging heterogeneous systems. In
withoversubscriptionwhenonlyusingtheA15cores. There- Proc. of HiPC, pages 1–10, 2011.
fore, for this platform we explore the benefits, in terms of [15] J. Reinders. Intel Threading Building Blocks:
timeandenergy,thatcanbeachievedwhenpinningthehost Multi-core parallelism for C++ programming.
threadtoalowpowercoreorbythecombinedusageofthe O’Reilly, 2007.
fourA15coresandthefourA7coresincludedintheExynos [16] Q. Zhu, B. Wu, X. Shen, L. Shen, and Z. Wang.
5. Our experimental results show that using the A7 cores Understanding co-run degradations on integrated
reduces EDP by 46%. Increasing the priority of the host heterogeneous processors. In The 27th International
threadorpinningthethreadtoanA7reducesEDPbyand Workshop on Languages and Compilers for Parallel
additional 7.8% in the Exynos big.LITTLE architecture. Computing, 2014.