ebook img

Scalable Task Parallelism for NUMA PDF

15 Pages·2017·2.3 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Scalable Task Parallelism for NUMA

Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management Andi Drebes Antoniu Pop Karine Heydemann TheUniversityofManchester TheUniversityofManchester SorbonneUniversités SchoolofComputerScience SchoolofComputerScience UPMCUnivParis06 OxfordRoad OxfordRoad CNRS,UMR7606,LIP6 ManchesterM139PL ManchesterM139PL 4,PlaceJussieu UnitedKingdom UnitedKingdom F-75252ParisCedex05,France [email protected] [email protected] [email protected] Albert Cohen Nathalie Drach INRIAandÉcoleNormale SorbonneUniversités Supérieure UPMCUnivParis06 45rued’Ulm CNRS,UMR7606,LIP6 F-75005Paris 4,PlaceJussieu France F-75252ParisCedex05,France [email protected] [email protected] ABSTRACT Keywords Dynamic task-parallel programming models are popular on Task-parallel programming; NUMA; Scheduling; Memory shared-memorysystems,promisingenhancedscalability,load allocation; Data-flow programming. balancing and locality. Yet these promises are undermined bynon-uniformmemoryaccess(NUMA).Weshowthatus- ing NUMA-aware task and data placement, it is possible to preserve the uniform abstraction of both computing and 1. INTRODUCTION memoryresourcesfortask-parallelprogrammingmodelswhile High-performance systems are composed of hundreds of achieving high data locality. Our data placement scheme general-purposecomputingunitsanddozensofmemorycon- guarantees that all accesses to task output data target the trollers to satisfy the ever-increasing need for computing local memory of the accessing core. The complementary power and memory bandwidth. Shared memory program- task placement heuristic improves the locality of task input ming models with fine-grained concurrency have success- data on a best effort basis. Our algorithms take advantage fullyharnessedthecomputationalresourcesofsucharchitec- of data-flow style task parallelism, where the privatization tures [3, 33, 28, 30, 31, 10, 19, 8, 9, 7, 35]. In these models, oftaskdataenhancesscalabilitybyeliminatingfalsedepen- the programmer exposes parallelism through the creation dencesandenablingfine-graineddynamiccontroloverdata of fine-grained units of work, called tasks, and the specifi- placement. Thealgorithmsarefullyautomatic,application- cation of synchronization that constrains the order of their independent,performance-portableacrossNUMAmachines, execution. Arun-timesystemmanagestheexecutionofthe andadapttodynamicchanges. Placementdecisionsusein- task-parallelapplicationandactsasanabstractionlayerbe- formation about inter-task data dependences readily avail- tween the program and the underlying hardware and soft- ableintherun-timesystemandplacementinformationfrom ware environment. That is, the run-time is responsible for the operating system. We achieve 94% of local memory ac- bookkeeping activities necessary for the correctness of the cessesona192-coresystemwith24NUMAnodes,upto5× execution (e.g., the creation and destruction of tasks and higher performance than NUMA-aware hierarchical work- their synchronization), interfacing with the operating sys- stealing, and even 5.6× compared to static interleaved allo- tem for resource management (e.g., allocation of data and cation. Finally,weshowthatstate-of-the-artdynamicpage meta-data for tasks, scheduling tasks to cores) and efficient migrationbytheoperatingsystemcannotcatchupwithfre- exploitation of the hardware. quentaffinitychangesbetweencoresanddataandthusfails This concept relieves the programmer from dealing with to accelerate task-parallel applications. detailsofthetargetplatformandthusgreatlyimprovespro- Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor ductivity. Yet it leaves issues related to efficient interaction classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed withsystemsoftware,efficientexploitationofthehardware, forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcita- and performance portability to the run-time. On today’s tiononthefirstpage. Copyrightsforcomponentsofthisworkownedbyothersthan ACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orre- systems with non-uniform memory access (NUMA), mem- publish,topostonserversortoredistributetolists,requirespriorspecificpermission ory latency depends on the distance between the request- and/[email protected]. ing cores and the targeted memory controllers. Efficient PACT’16,September11-15,2016,Haifa,Israel resource usage through task scheduling needs to go handin (cid:13)c 2016ACM.ISBN978-1-4503-4121-9/16/09...$15.00 handwiththeoptimizationofmemoryaccessesthroughthe DOI:http://dx.doi.org/10.1145/2967938.2967946 placementofphysicalpages. Thatis,memoryaccessesmust be kept local in order to reduce latency and data must be across NUMA machines and transparently adapts to dy- distributed across memory controllers to avoid contention. namicchangesatruntime. Thedetailedinformationabout The alternative of abstracting computing resources only theaffinitiesbetweentasksanddatarequiredbythesetech- and leaving NUMA-specific optimization to the application niques is either readily available or can be obtained au- is far less attractive. The programmer would have to take tomatically in the run-times of task-parallel programming into account the different characteristics of all target sys- models with inter-task dependences, such as StarSs [30], tems (e.g., the number of NUMA nodes, their associated OpenMP 4 [28], SWAN [33] and OpenStream [31], which amountofmemoryandaccesslatencies),topartitionappli- allow the programmer to make inter-task datadependences cation data properly and to place the data explicitly us- explicit. While specifying the precise task-level data-flow ing operating system-specific interfaces. For applications ratherthansynchronizationconstraintsalonerequiresmore with dynamic, data-dependent behavior, the programmer initial work for programmers, this effort is more than off- would also have to provide mechanisms that constantly re- set by the resulting benefits in terms of performance and acttochangesthroughouttheexecutionasaninitialplace- performance portability. ment with high data locality at the beginning might have The paper is organized as follows. Section 2 presents the to be revised later on. Such changes would have to be co- principles of enhanced work-pushing. For a more complete ordinated with the run-time system to prevent destructive discussion of our solutions, we propose multiple heuristics performanceinterference,introducingatightandundesired taking into account the placement of input data, output coupling between the run-time and the application. data or both. Section 3 introduces deferred allocation, in- Ontheoperatingsystemside,optimizationsarecompelled cludingabriefoutlineofthetechnicalsolutionsemployedfor to place tasks and data conservatively [13, 24], unless pro- fine-grained data placement. Sections 4 and 5 present the videdwithdetailedaffinityinformationbytheapplication[5, experimental methodology and results. A comparison with 6],high-levellibraries[26]ordomainspecificlanguages[20]. dynamicpagemigrationispresentedinSection6. Section7 Furthermore,astask-parallelrun-timesoperateinuser-space, discusses the most closely related work, before we conclude aseparatekernelcomponentwouldaddadditionalcomplex- in Section 8. itytothesolution;thisadvocatesforauser-spaceapproach. This paper shows that it is possible to efficiently and por- 2. TASKSCHEDULINGWITHENHANCED tably exploit dynamic task parallelism on NUMA machines WORK-PUSHING withoutexposingprogrammerstothecomplexityofthesesys- tems, preserving a simple, uniform abstract view for both Let us start with terminology and hypotheses about the memoryandcomputations,yetachievinghighlocalityofmem- programming and execution models. ory accesses. Our solution exploits the task-parallel data- flow programming style and its transparent privatization of 2.1 Anabstractmodelfortaskparallelism task data. This allows the run-time to determine a task’s Our solutions are based on shared memory task-parallel working set, enabling transparent, fine-grained control over programming models with data dependences. Each task is task and data placement. associated with a set of incoming data dependences and a Basedonthepropertiesoftask-privatedata,weproposea setofoutgoing data dependences,asillustratedbyFigure1. dynamic task and data placement algorithm to ensure that Each dependence is associated with a contiguous region of inputandoutputdataarelocalandthatinteractsconstruc- memory, called input buffer and output buffer for incoming tively with work-stealing to provide load-balancing across and outgoing dependences, respectively. The addresses of both cores and memory controllers: these buffers are collected in the task’s frame, akin to the activation frame storing a function’s arguments and local • Ourmemoryallocationmechanism,calleddeferred al- variables in the call stack. While the frame is unique and location,avoidsmakingearlyplacementdecisionsthat allocated at the task’s creation time, its input and output couldlaterharmperformance. Inparticular,themem- buffers may be placed on different NUMA nodes and allo- ory to store task output data is not allocated until cated later in the life cycle of the task, but no later than the task placement is known. The mechanism hands the beginning of the execution of the task, reading from over this responsibility to the producer task on its lo- input buffers and writing into output buffers. Buffer place- cal NUMA node. This scheme guarantees that all ac- mentandallocationtimehaveadirectinfluenceonlocality cesses to task output data are local. Control over data and task-data affinity. Since we ought to offer a uniform placementisobtainedthroughtheprivatizationoftask abstraction of NUMA resources, we assume input and out- output data. putbuffersaremanagedbytherun-timesystemratherthan • To enhance the locality of read memory accesses, we explicitly allocated by the application. This is the case of buildonearlierwork[14]andproposeenhanced work- programmingmodelssuchasOpenStream[31]andCnC[8], pushing,awork-sharingmechanismthatinteractscon- but not of StarSs [30] and OpenMP 4.0; see Section 7 for structively with deferred allocation. Since the inputs further discussion. We say that a task t depends on an- c of a task are outputs of another task, the location of othertaskt ift hasanoutgoingdependenceassociatedto p p input data is determined by deferred allocation when abufferbandift hasanincomingdependenceassociatedto c the producer tasks execute. Enhanced work-pushing b. In this scenario the task t is referred to as the producer p is a best-effort mechanism that places a task accord- and t is the consumer. c ing to these locations before task execution and thus Although this is not a fundamental requirement in our before allocating memory for the task’s outputs. work, we will assume for simplicity that a task executes Thiscombinationofenhancedwork-pushing anddeferredal- from beginning to end without interruption. A task be- locationisfullyautomatic,application-independent,portable comes ready for execution when all of its dependences have buffers. As a consequence, the task’s input data is entirely locatedonasinglenode. Thispropertyisusedbythework- pushing technique, in which the worker activating a task only becomes its owner if the task’s input data is stored on the worker’s node. If the input data is located on another node,thetaskistransferredtoarandomworkerassociated toacoreonthatnode. Theapproachremainslimitedhow- ever: (1)asitassumesthatallinputsarelocatedonthesame nodeitisill-suitedforinputdatalocatedonmultiplenodes, Figure 1: Most general case for a task t: n inputs of size and (2) it does not optimize for outgoing dependences. δ0,...,δn−1,moutputsofsizeδ0,...,δm−1,dataplacedon i i o o Below,wefirstpresentenhanced work-pushing,ageneral- n+m NUMA nodes N0,...,Nn−1,N0,...,Nm−1. i i o o ization of work-pushing, capable of dealing with input data distributedovermultipleinputbufferspotentiallyplacedon been satisfied, i.e., when its producers of input data have differentnodes. Thistechniqueservesasabasisforthecom- completed and when its consumers have been created with plementary deferred allocation technique, presented in the theaddressesoftheirframescommunicatedtothetask. The nextsection,thatallowstherun-timetoimprovetheplace- working set of a task is defined as the union of all memory ment of output buffers. We introduce three work-pushing addresses that are accessed during its execution. Note that heuristics that schedule a task according to the placement the working set does not have to be identical to the union of its input or output data or both. This complements the ofatask’sinputandoutputbuffers(e.g.,ataskmayaccess study of the effects of work-pushing on data locality. And globally shared data structures). However, since our algo- experimentally in Section 5, it enables us to show the limi- rithmsrequireaccuratecommunicationvolumeinformation, tationsofNUMA-awareschedulingalone,limitedtopassive we assume that the bulk of the working set of each task is reactions to a given data placement. constituted by its input and output data. This is the case for all of the benchmarks studied in the experimental eval- 2.3 Enhancedwork-pushing uation. A worker thread is responsible for the execution of tasks The names of the three heuristics for enhanced work- onitsassociatedcore. Eachworkerisprovidedwithaqueue pushing are input-only, output-only and weighted. The first oftasksreadyforexecutionfromwhichitpopsandexecutes two heuristics take into account incoming and outgoing de- tasks. When the queue is empty, the worker may obtain a pendences only, respectively. The weighted heuristic takes task from another one through work-stealing [4]. A task intoaccountalldependences,butassociatesdifferentweights is pushed to the queue by the worker that satisfies its last toincomingandoutgoingdependencestohonorthefactthat remaining dependence. We say that this worker activates readandwriteaccessesusuallydonothavethesamelatency. the task and becomes its owner. Algorithm 1 shows how the heuristics above are used by The execution of a task-parallel program starts with a thefunctionactivate,whichiscalledwhenaworkerw acti- root task derived from the main function of a sequential vates a task t (i.e., when the task becomes ready). Lines 1 process. Newtasksarecreateddynamically. Thepartofthe to 3 define variables uin and uout, indicating which types program involved in creating new tasks is called the control of dependences should be taken into account according to program. Ifonlytheroottaskcreatesothertaskswespeakof the heuristic h. The variables used to determine whether a sequential control program, otherwise of a parallel control thenewlyactivatedtaskneedstobetransferredtoaremote program. node are initialized in lines 5 to 8: the data array stores the cumulated size of input and output buffers of t for each 2.2 WeaknessesoftaskparallelismonNUMA of the N nodes of the system, D stands for the incoming in systems dependences of t and D for its outgoing dependences. out Whether memory accesses during the execution of a task The for loop starting at Line 10 iterates over a list of target the local memory controller of the executing core or triples with a set of dependences Dcur, a variable ucur indi- some remote memory controller depends on the placement cating whether the set should be taken into account, and a oftheinputandoutputbuffersandontheworkerexecuting weight wcur associated to each type of dependence. During thetask. Theseaffinitiesarehighlydynamicandcandepend thefirstiteration,Dcur isidenticalwithDin andduringthe on many factors, such as: second iteration identical with Dout. For each dependence in D , a second loop in Line 12 determines the buffer b cur • the order of task creations by the control program; used by the dependence, the size s of the buffer as well as b • the execution order of producers; the node nb containing the buffer. The node on which a buffer is placed might be unknown if the buffer has not yet • the duration of each task; been placed by the operating system, e.g., using the first- • work-stealing events; touchscheme and the buffer has been allocated but not yet written. Ifthisisthecase,itssizeisaddedtothetotalsize • orresourceavailability(e.g.,availablememorypernode). s , but not included into the per-node statistics. Other- tot In earlier work [14], we showed that some of these issues wise, the data array is updated accordingly by multiplying can be mitigated by using work-pushing. Similar to the ab- s with the weight w . b cur stract model discussed above, the approach assumes that Oncethetotalsizeandtheweightednumberofbytesper tasks communicate through task-private buffers. However, node have been determined, the procedure checks whether it also assumes that all input data of a task is stored in a the task should be pushed to a remote node. Tasks whose single,contiguousmemoryregionratherthanmultipleinput overall size of dependences is below a threshold are added Algorithm 1: activate(w, t) higher amount of tasks than others. Work-stealing redis- tributes tasks among the remaining workers and thus pre- 1 if h=input only then (uin,uout)←(true,false) vents the system from load imbalance, but cannot improve 2 else if h=output only then (uin,uout)←(false,true) overalldatalocalityiftheinitialdatadistributionwaspoor. 3 else if h=weighted then (uin,uout)←(true,true) Classicaltaskparallelrun-timesallocatebuffersduringtask 4 creation[8,33,31];hencedatadistributionmainlydepends 5 data[0,...N −1]←(cid:104)0,...,0(cid:105) onthecontrolprogram. Asequentialcontrolprogramleads 6 Din ←in deps(t) to poorly placed data, while a parallel control program lets 7 Dout ←out deps(t) 8 stot ←0 work-stealing evenly distribute task creation and buffer al- location. However, writing a parallel control program is al- 9 10 for (Dcur,ucur,wcur) in readychallenginginitself,evenforprogramswithregularly- (cid:104)(D ,u ,w ), (D ,u ,w )(cid:105) do structured task graphs. Additionally ensuring an equal dis- in in in out out out 11 if ucur =true then tribution of data across NUMA nodes through the control 12 for d∈Dcur do programisevenmorechallengingorinfeasible,especiallyfor 13 sb ←size of(buffer of(d)) applicationswithlessregularly-structuredtaskgraphs(e.g., 14 nb ←node of(buffer of(d)) if the structure of the graph depends on input data). Such 15 stot ←stot+sb optimizations also reject efficient exploitation of NUMA to 16 if nb (cid:54)=unknown then the programmer and are thus contrary to the idea of ab- 17 data[nb]←data[nb]+wcur·sb 18 end straction from the hardware by the run-time. 19 end Inthefollowingsection,weintroduceaNUMA-awareallo- 20 end catorthatcomplementstheinputonly work-pushingheuris- 21 end tic and that decouples data locality from the control pro- 22 gram, leaving efficient exploitation to the run-time. 23 if stot <threshold then 24 add to local queue(w, t) 3. DEFERREDALLOCATION 25 else 26 nmin ←node with min access cost(data) NUMA-awareallocationcontrolstheplacementofdataon 27 specificnodes. Ourproposedschemetomakethesedecisions 28 if nmin (cid:54)=local node of worker(w) then transparent relies on per-node memory pools to control the 29 wdst ←random worker on node(nmin) placement of task buffers. 30 res ←transfer task(t,wdst) 31 if res=failure then 3.1 Per-nodememorypools 32 add to local queue(w, t) 33 end Per-nodememorypoolscombineamechanismforefficient 34 else reuse of blocks of memory with the ability to determine on 35 add to local queue(w, t) which nodes blocks are placed. Each NUMA node has a 36 end memory pool that is composed of k free lists L0 to Lk−1, 37 end where Li contains blocks of size 2Smin+i bytes. When a worker allocates a block of size s, it determines the corre- to the local queue (Line 24) to avoid cases in which the spondinglistLj with2Smin+j−1 <s≤2Smin+j andremoves the first block of that list. If the list is empty, it allocates a overhead of a remote push cannot be compensated by the largerchunkofmemoryfromtheoperatingsystem,removes improvement on task execution time. For tasks with larger thefirstblockfromthechunkandaddstheremainingparts dependences, the run-time determines the node n with min to the free list. the minimal overall access cost (Line 29). The access cost Acommonallocationstrategyofoperatingsystemsisfirst- for a node N is estimated by summing up the access costs i touch allocation, composed of two steps. The first step re- toeachnodeN containingatleastoneofthebuffers,which j ferredtoaslogical allocation istriggeredbythesystemcall inturncanbeestimatedbymultiplyingtheaveragelatency betweenN andN .1 Ifn isdifferentfromthelocalnode used by the application to request additional memory and i j min only extends the application’s address space. The actual n of the activating worker, the run-time tries to transfer lcl physical allocation is triggered upon the first write to the t to a random worker on n . If this fails, e.g., if the data min memory region and places the corresponding page on the structure of the targeted worker receiving remotely pushed same node as the writing core. Hence, a block that origi- tasks is full, the task is added to the local queue (Line 32). natesfromanewlyallocatedchunkisnotnecessarilyplaced 2.4 Limitationsofenhancedwork-pushing on any node. However, when a block is freed, it has been written by a The major limitation of enhanced work-pushing is that, producer and it is thus safe to assume that the block has regardless of the heuristic, it can only react passively to been placed through physical allocation. The identifier of a given data placement. This implies that data must al- the containing node can be obtained through a system call, ready be well-distributed across memory controllers if all which enables the run-time to return the block to the cor- tasksshouldtakeadvantageofthisschedulingstrategy. For rect memory pool. To avoid the overhead of a system call poorly distributed data, e.g., if all data is placed on a sin- each time a block is freed, information on the NUMA node gle node, a subset of the workers receives a significantly containing a block is cached in a small meta-data section 1We estimated the latencies based on the distance between associated to the block. This memory pooling mechanism each pair of nodes reported by the numactl tool provided providesthreefundamentalpropertiesfordeferredallocation by LibNUMA [21]. presented below. First, it ensures that allocating a block from a memory pool always returns a block that has not concerns that led to the delegation of task management to beenplacedyetorablockthatisknowntobeplacedonthe the run-time. node associated to the memory pool. Second, data can be Thanks to deferred allocation, buffers allocated for early placed on a specific node with very low overhead. Finally, taskscanbereusedatalaterstage. Thedifferenceisshown the granularity for data placement is decoupled from the in Figures 3a and 3c. In the first case, all three buffers b , i usual page granularity as a block may be sized arbitrarily. b andb areallocatedbeforethedependenttaskst to i+1 i+2 i t are executed. In the latter case, the buffer used by t i+3 i and t can be reused as the input buffer of t . Parallel i+1 i+3 3.2 Principlesofdeferredallocation controlprogramsalsobenefitfromdeferredallocationasthe minimal number of buffers along a path of dependent tasks The key idea of deferred allocation is to delay the alloca- canbedecreasedbyone(e.g.,inFigure3bonlyb andb tion and thus the placement of each task buffer until the i i+1 are simultaneously live and b can be reused for b when node executing the producer that writes to the buffer is i i+2 using deferred allocation). known. This guarantees that accesses to output buffers are always local. The classical approach in run-times for de- 3.3 Compatibilitywithwork-pushing pendent tasks is to allocate input buffers upon the creation of a task [8, 33, 31] or earlier [30]. Instead, we propose to Deferredallocationguaranteeslocalwriteaccesses,butit let each input buffer for a consumer task tc to be allocated does not influence the locality of read accesses. By com- by the producer task tp writing into it, immediately before bining deferred allocation with the input-only heuristic of task tc starts execution. Since the input buffer of tc is an enhanced work-pushing, it is possible to optimize for both output buffer of tp, the location of input data in tc is effec- read and write accesses. tively determined by its producer(s). In the following, we Itisimportanttonotethatneithertheoutput-onlyheuris- usethetermimmediateallocation todistinguishthedefault tic nor the weighted heuristic can be used since the output allocationschemeinwhichinputbuffersareallocatedupon buffers of a task are not determined upon task activation creation from deferred allocation. when the work-pushing decision is taken. Figure 2a shows the implications of immediate allocation on data locality for a task t. All input buffers of t are al- located on the node Nc on which the creator of t operates. 4. EXPERIMENTALSETUP The same scheme applies to the creators t0 to tm−1, caus- c,o c,o Fortheexperimentalevaluationweimplementedenhanced ing the input buffersof the tasks t0 totm−1 tobe allocated o o work-pushinganddeferredallocationintherun-timesystem on nodes N0 to Nm−1, respectively. In the worst case for o o of the OpenStream project [29]. We start with an overview data locality, t is stolen by a worker operating on neither of the software and hardware environment used in our ex- N nor N0 to Nm−1 and all memory accesses of t target c o o periments,followedbyapresentationoftheselectedbench- memory on remote nodes. marks. Whenusingdeferredallocation, theinputbuffersoftare not allocated before its producers start execution and the 4.1 Softwareenvironment output buffers of t are not allocated before t is activated (Figure 2b). When t becomes ready, all of its input buffers OpenStream [31] is a task-parallel, data-flow program- have received input data from the producers of t and have ming model implemented as an extension to OpenMP. Ar- been placed on up to n different nodes N0 to Nn−1 (Fig- bitrary dependence patterns can be used to exploit task, i i ure 2c). The data locality impact of deferred allocation pipeline and data parallelism. Each data-flow dependence is illustrated in Figure 2d, showing the placement at the issemanticallyequivalenttoacommunicationandsynchro- moment when the worker executing t has been determined. nizationeventwithinanunboundedFIFOqueuereferredto Regardless of any possible placement of t, all of its output as a stream. Pragmatically, this is implemented by compil- buffersareplacedonthesamenodeastheworkerexecuting ingdependencesasaccessestotaskbuffersdynamicallyallo- thetask. Hence,usingdeferredallocation,writeaccessesare catedatexecutiontime: writestostreamsresultinwritesto guaranteedtotargetlocalmemory. Furthermore,thisprop- the buffers of the tasks consuming the data, while read ac- ertyisindependentfromtheplacementofthecreatingtasks cessestostreamsbyconsumer tasks aretranslatedtoreads t and t0 to tm−1, which effectively decouples data local- from their own, task-private buffers. c c,o c,o ity from the control program. Even for a sequential control We implemented the optimizations presented in this pa- program, data is distributed over the different nodes of the perintothepubliclyavailablerun-timeofOpenStream[29]. machineaccordingtowork-stealing. Thisway,work-stealing Crucially,werelyonthefactthatOpenStreamprogramsare does not only take the role of a mechanism responsible for written with programmer annotations explicitly describing computationalloadbalancing,butalsotheroleasamecha- theflowofdatabetweentasks. Thisprecisedata-flowinfor- nism for load balancing across memory controllers. mation is preserved during compilation and made available Animportantsideeffectofdeferredallocationisasignif- to the run-time library. We leverage this essential seman- icant reduction of the memory footprint. With a sequential tic information to determine, at run-time and before task controlprogram,alltasksarecreatedbyasingle“root”task. execution, how much data is exchanged by any given task. This causes a large number of input buffers to be allocated OpenStreamprogramsaredynamicallyload-balanced,wor- early on, while the actual working set of live buffers might ker threads use hierarchical work-stealing to acquire and be much smaller. A parallel control program can mitigate execute tasks whose dependences have been satisfied. If the effects of early allocation, e.g., by manually throttling work-pushing is enabled, workers can also receive tasks in task creation as shown inFigure 3b. However, this requires adedicatedmulti-producersingle-consumerqueue[14]. Our significant programmer effort and hurts the separation of experiments use one worker thread per core. (a)Immediateallocationwithoutbufferreuse create create (a)Immediateallocation (b)Deferredallocation: aftercreationoft (b)Throttlingoftaskcreations reuse (c)Deferredallocationfavoringbufferreuse (c)Deferredallocation: activationoft (d)Deferredallocation: finalplacement Figure 3: Allocation and reuse Figure 2: Allocation schemes 4.2 Hardwareenvironment usingLibNUMA,initializesitandmeasuresexecutiontime The experiments were conducted on two many-core sys- forasequenceofmemoryaccessestothisbufferfromacore tems. onaspecificnode. Eachsequencetraversesthewholebuffer Opteron-64isaquad-socketsystemwithfourAMDOp- from beginning to end in steps of 64 bytes, such that each teron6282SEprocessorsrunningat2.6GHz,usingScientific cache line is only accessed once. The buffer size was set to Linux6.2withkernel3.10.1. Themachineiscomposedof4 1GiB to ensure data is evicted from the cache before it is physicalpackages,with2diesperpackage,eachdiecontain- reused and thus to measure only memory accesses that are ing8coresorganizedinpairs. Eachpairsharesthefirst-level satisfiedbythememorycontrollerandnotbythehierarchy instruction cache as well as a 2MiB L2 cache. An L3 cache of caches. of6MiBandthememorycontrolleraresharedbythe8cores Tables1and2indicatethetotalexecutiontimeasafunc- onthesamedie. The16KiBL1cacheisprivatetoeachcore. tion of the number of hops and the access mode of the syn- Main memory is 64GiB, equally divided into 8 NUMA do- thetic benchmark for both systems. The results show that mains. ForeachNUMAnode,4neighborsareatadistance latency increases with the distance between the requesting of 1 hop and 3 neighbors are at 2 hops. core and the targeted memory controller and that writes SGI-192 is an SGI UV2000 with 192 cores and 756GiB are significantly slower than reads. The rightmost column RAM,distributedover24NUMAnodes,andrunningSUSE of each table shows the access time normalized to accesses Linux Enterprise Server 11 SP3 with kernel 3.0.101-0.46- targeting local memory. For reads on the Opteron-64 sys- default. The system is organized in blades, each of which tem, these values range from 1.81 for on-package accesses contains two Intel Xeon E5-4640 CPUs running at 2.4GHz. to a memory controller at a distance of one hop to a factor Each CPU has 8 cores with direct access to a memory con- of 4.34 for off-package accesses at a distance of two hops. troller. The cache hierarchy consists of 3 levels: a core- For writes, these values are lower—1.2 to 2.48—due to the private L1 with separate instruction and data cache, each higher latency of local writes. Not surprisingly, the fac- with a capacity of 32KiB; a core-private, unified L2 cache torsforbothreadsandwritesaresignificantlyhigheronthe of 256KiB; and a unified L3 cache of 20MiB, shared among larger SGI-192 system (up to 7.48 for reads at three hops). all8coresoftheCPU.Hyperthreading wasdisabledforour This suggests that locality optimizations will have a higher experiments. Each blade has a direct connection to a set impact on SGI-192 and that the locality of writes will have of other blades and indirect connections to the remaining the greatest impact. ones. From a core’s perspective, a memory controller can 4.3 Benchmarks be either local if associated to the same CPU, at 1 hop if on the same blade, at 2 hops if on a different blade that is We evaluate the impact of our techniques on nine bench- connected directly to the core’s blade or at 3 hops if on a marks,eachofwhichisavailableinanoptimizedsequential remote blade with an indirect connection. implementationandtwotunedparallelimplementationsus- ing OpenStream. LatencyofmemoryaccessesandNUMAfactors. The first parallel implementation uses task-private input and output buffers as described in Section 2.1 and thus en- We used a synthetic benchmark to measure the latency ables enhanced work-pushing and deferred allocation. Data of memory accesses as a function of the distance in hops frominputbuffersisonlyreadandneverwritten,whiledata between a requesting core and the memory controller that in output buffers is only written and never read. Hence, satisfies the request. It allocates a buffer on a given node tasks cannot perform in-place updates and results are writ- Readaccesses WriteaccessesFactorR/W Blocksize Blocksize Local 1288.5±1.22ms2256.9±14.06ms 1.00/1.00 Matrix/Vectorsize (Opteron-64)(SGI-192) Iterations 1hop(on-package) 2328.4±0.49ms2717.6±12.16ms 1.81/1.20 Jacobi-1d 228 216 216 60 1hop(off-package)2781.4±0.56ms3934.6±00.56ms 2.16/1.74 Jacobi-2d 214×214 210×26 28×28 60 2hops 5601.6±0.57ms5601.3±00.55ms 4.34/2.48 Jacobi-3d 210×29×29 25×26×25 24×26×26 60 Seidel-1d 228 216 216 60 Table 1: Average latency of accesses on Opteron-64 Seidel-2d 214×214 28×28 27×29 60 Seidel-3d 210×29×29 26×25×25 24×28×24 60 Readaccesses WriteaccessesFactorR/W Blur- 215×215(Opteron-64) 29×26 - Local 934.82±5.74ms 1307.4±2.95ms 1.00/1.00 roberts 216×216(SGI-192) 210×26 1hop 4563.1±3.02ms5282.38±1.56ms 4.88/4.04 Bitonic 228 216 217 - 2hops5820.48±2.11ms6473.38±1.16ms 6.23/4.95 K-means 40.96Mpts,10dims.,11clust.104 104 - 3hops6991.24±2.71ms7673.14±0.92ms 7.48/5.87 Table 3: Benchmark parameters Table 2: Average latency of accesses on SGI-192 this baseline for our experiments as DSA-BASE. Identical ten to a different location than the input data. We refer to parameters for the block size and run-time have been used this implementation as DSA (dynamic single assignment). fortheexperimentswiththesharedmemoryversionsofthe The second parallel implementation, which we refer to as benchmarks (SHM), which we refer to as SHM-BASE. SHM, uses globally shared data structures and thus does AllbenchmarkswerecompiledusingtheOpenStreamcom- notexposeinformationonmemoryaccessestotherun-time. piler based on GCC 4.7.0. The compilation flags for blur- However, the pages of the data structures are distributed roberts aswellasthejacobi andseidel benchmarkswere-O3 across all NUMA nodes in a round-robin fashion using in- -ffast-math, while kmeans uses -O3 and bitonic uses -O2. terleavedallocation. Weusethisimplementationtocompare The parallel implementations are provided with a paral- our solutions to classical static NUMA-aware optimizations lelcontrolprogramtopreventsequentialtaskcreationfrom that require only minimal changes to the application. The becomingaperformancebottleneck. Toavoidmemorycon- benchmarks are the following. troller contention, the initial and final data are stored in • Jacobi-1d, jacobi-2d and jacobi-3d are the usual one-, globaldatastructuresallocatedusinginterleavedallocation two- and three-dimensional Jacobi stencils iterating across all NUMA nodes. overarraysofdoubleprecisionfloatingpointelements. Ateachiteration,thealgorithmaveragesforeachma- Datadependencepatterns. trixelementthevaluesoftheelementsinitsVonNeu- The relevant producer-consumer patterns shown in Fig- mannneighborhoodusingthevaluesfromtheprevious ure 4 can be divided into three groups with different impli- iteration. cationsforouroptimizations: unbalanceddependences (e.g., • Seidel-1d,seidel-2dandseidel-3demployasimilarsten- one input buffer accounting for more than 90% of the in- cil pattern but use values from the previous and the put data) with long dependence paths (jacobi-1d, jacobi- current iteration for updates. 2d, jacobi-3d, seidel-1d, seidel-2d, seidel-3d, kmeans), un- balanced dependences with short dependence paths (blur- • Kmeans isadata-miningbenchmarkthatpartitionsa roberts) and balanced dependences (bitonic). The behavior set of n d-dimensional points into k clusters using the ofourheuristicsonthesepatternsisreferencedintheexper- K-means clustering algorithm. Each vector is repre- imental evaluation. All of the benchmarks have non-trivial, sented by d single precision floating point values. connected task graphs, i.e., none of the benchmarks repre- • Blur-robertsappliestwosuccessiveimagefiltersondou- sents an embarrassingly parallel workload. ble precision floating point elements [22]: a Gaussian blurfilter oneachpixel’sMooreneighborhoodfollowed Characterizationofmemoryaccesses. by the Roberts Cross Operator for edge detection. The benchmarks were carefully tuned (block sizes and • Bitonic implements a bitonic sorting network [1], ap- tiling) to take advantage of caches. However, the effective- plied to a sequence of arbitrary 64-bit integers. nessofthecachehierarchyalsodependsonthepattern,the Table3summarizestheparametersforthedifferentbench- frequencyandthetimingofmemoryaccessesduringtheexe- marksandmachines. Thesizeofinputdatawaschosentobe cutionofabenchmark,leadingtomoreorfewercachemisses significantlyhigherthanthetotalamountofcachememory for a given block size. Figure 5 shows the cache miss rates andlowenoughtopreventthesystemfromswapping. This at the last level of cache (LLC) on SGI-192 of DSA-BASE, size is identical on both machines, except for blur-roberts, whichisagoodproxyfortherateofrequeststomainmem- whoseexecutiontimeforimagesofsize215×215 istooshort ory for each benchmark. For all bar graphs in this paper, on SGI-192 and which starts swapping for size 216×216 on error bars indicate standard deviation. As the focus of our Opteron-64. Toamortizetheexecutionofauxiliarytasksat optimizations is on the locality of memory accesses, we ex- the beginning and the end of execution of the stencils, we pectahigherimpactforbenchmarksexhibitinghigherLLC set the number of iterations to 60. miss rates. For this reason, seidel and blur-roberts are ex- The block size has been tuned to minimize the execution pectedtobenefitthemostfromouroptimizations,followed timefortheparallelimplementationwithtask-privateinput by the jacobi benchmarks and bitonic. Kmeans has a very andoutputdata(DSA)oneachmachine. Toavoidanybias low LLC miss rate and is not expected to show significant in favor of our optimizations, enhanced work-pushing and improvement. deferred allocation have been disabled during this tuning 4.4 Experimentalbaseline phase. Inthisconfiguration,therun-timeonlyreliesonop- timized work-stealing [23] extended with hierarchical work- Todemonstratetheeffectivenessofouroptimizations,our stealing [14] for computational load balancing. We refer to principal point of comparison is DSA-BASE. We validate (a)Jacobi-1d /seidel-1d (b)Jacobi-2d /seidel-2d (c)Jacobi-3d /seidel-3d (d)Kmeans (e)Blur-roberts (f)Bitonic Figure4: Maintasksandtypesofdependencesofthebenchmarks. Theamountofdataexchangedbetweentasksisindicated by the width of arrows. Symbols: S , S , S : number of elements per block in x, y and z direction; S : number of b,x b,y b,z b,p points per block, d: number of dimensions, k: number of clusters in kmeans; S : number of elements per block in bitonic. b Figure 5: LLC misses per thousand instructions on SGI-192 Figure 6: Cholesky factorization on Opteron-64. thesoundnessofthisbaselinebycomparingitsperformance Figure 7 shows the locality of requests to main memory on Cholesky factorization against the two state-of-the-art RHW on Opteron-64. The configurations input only, output loc linear algebra libraries PLASMA/QUARK [35] and Intel’s only and weighted refer to DSA-BASE combined with the MKL [18]. Figure 6 shows the execution times of Cholesky respectiveenhancedwork-pushingheuristic,whiledfa refers factorizationonamatrixofsize8192×8192runningonthe toDSA-BASEcombinedwithdeferredallocation,butwith- Opteron-64 platform with four configurations: DSA-BASE, out enhanced work-pushing. Data locality is consistently Intel MKL, PLASMA and finally optimized OpenStream improved by our optimizations across all benchmarks. The (DSA-OPT), the latter using our run-time implementing combination of enhanced work-pushing and deferred allo- the optimizations presented in this paper. This validates cation (DSA-OPT) is comparable to the output only and the soundness of our baseline, which achieves similar per- weightedheuristicsofwork-pushing,butyieldssignificantly formancetoIntelMKL,whilealsoshowcasingtheeffective- better results than enhanced work-pushing only for bench- ness of our optimizations, automatically and transparently marks with balanced dependences. For all jacobi and sei- matchingtheperformanceofPLASMAwithoutanychange del benchmarksaswellaskmeans thelocalitywasimproved in the benchmark’s source code. above88%andforbitonic above81%. Blur-roberts doesnot benefit as much from our optimizations as the other bench- marks. Astherun-timesystemonlymanagestheplacement 5. RESULTS ofprivatizeddataassociatedwithdependences,shortdepen- Wenowevaluateenhancedwork-pushinganddeferredal- dencepaths,suchasinBlur-roberts,onlyallowtherun-time location,startingwiththeimpactonmemoryaccesslocality, to optimize placement for a fraction of the execution. As a and following on with the actual performance results. result, the overall impact is diluted proportionately. We furthernotethattheinputonlyheuristicofwork-pushing— 5.1 Impactonthelocalityofmemoryaccesses closesttotheoriginalheuristicinpriorwork[14]—doesnot improvememorylocalityasmuchasweightedwork-pushing On the Opteron platform, we use two hardware perfor- orthecombinationofdeferredallocationwithwork-pushing. mancecounterstocounttherequeststonode-localmemory2 Figure 8 shows RRT on SGI-192, highlighting the effective- andtoremotenodes,3 respectively. Weconsiderthelocality loc ness of our optimizations on data under the control of the metric RHW, defined as the ratio of local memory accesses loc run-time. Not accounting for unmanaged data, we achieve to total memory accesses, shown in Figure 7. We could not almostperfectlocality(upto99.8%)acrossallbenchmarks, providethecorrespondingfiguresfortheSGIsystemdueto except for bitonic, where balanced dependences imply that missing support in the kernel. However, the OpenStream whenever input data is on multiple nodes, only half of the run-time contains precise information on the working set of input data can be accessed locally. tasks and on the placement of input buffers, which can be used to provide a second locality metric RRT that precisely loc 5.2 Impactonperformance accountsforaccessestodatamanagedbytherun-time,i.e., associated to task dependences. Figure 9 shows the speedup achieved over DSA-BASE. Thebestperformanceisachievedbycombiningwork-pushing 2CPUIOREQUESTSTOMEMORYIO:LOCALCPUTOLOCALMEM and deferred allocation, with a global maximum of 2.5× on 3CPUIOREQUESTSTOMEMORYIO:LOCALCPUTOREMOTEMEM Opteron-64 and 5.0× on SGI-192. Generally, the speedups Figure 7: Locality RHW of requests to main memory on the Opteron-64 system for deferred allocation loc Figure 8: Locality RRT of data managed by the run-time on SGI-192 loc arehigheronSGI-192,showingthatouroptimizationshave withrespecttoNUMAandmigratesittowardsthenodeof a higher impact on machines with higher penalties for re- the accessing CPU. moteaccesses. Theseimprovementsresultfrombetterlocal- We first evaluate the influence of page migration on a ity as well as from the memory footprint reduction induced synthetic benchmark to determine under which conditions by deferred allocation. Note that the input-only heuristic the mechanism is beneficial and show that these conditions did not perform well with jacobi, highlighting the impor- do not meet the requirements for task-parallel programs. tance of considering both input and output flows, and of We then study the impact of dynamic page migration on proactively distributing buffers through deferred allocation theSHMparallelbaselinewithgloballyshareddata(SHM- rather than reactively adapting the schedule only. BASE). 6.1 Parametrizationofpagemigration 5.3 Comparisonwithinterleavedallocation For all experiments, we used version 4.3.0 of the Linux Figure10showsthespeedupoftheDSAparallelbaseline Kernel. As the SGI test platform requires specific kernel over the implementations using globally shared data struc- patchesandissharedamongmanyusers,weconductedthese tures distributed over all NUMA nodes using interleaved experimentsonOpteron-64only. Themigrationmechanism allocation (SHM-BASE). The optimizations achieve up to is configured through the procfs pseudo filesystem, as fol- 3.1× speedup on Opteron-64 and 5.6× on SGI-192. The lows: bestperformanceissystematicallyobtainedbythecombined • Migration can be globally enabled or disabled by set- work-pushinganddeferredallocationstrategy. Theseresults ting numa balancing to 1 or 0, respectively. clearlyindicatethattakingadvantageofthedynamicdata- • The parameter numa balancing scan delay ms indica- flow information provided in modern task-dependent lan- tes the minimum execution time of a process before guagesallowsformoreprecisecontrolovertheplacementof page migration starts. In our experiments, we have data leading to improved performance over static schemes set this value to 0toenable migrationas soonas pos- unable to react to dynamic behavior at execution time. In sible. Pagemigrationduringinitializationisprevented thecaseofinterleavedallocation,theuniformaccesspattern using appropriate calls to mbind, temporarily impos- of the benchmarks evaluated in this work yields good load ing static placement. balancingacrossmemorycontrollers,butpoordatalocality. • Theminimum/maximumdurationbetweentwoscans 6. COMPARISON WITH DYNAMIC PAGE iscontrolledbynuma balancing scan period min ms / numa balancing scan period max ms. Wehavesetthe MIGRATION minimalperiodto0andthemaximumperiodto100000 Whileourstudyfocusedontheapplicationandrun-time, to allow for constant re-evaluation of the mapping. the reader may wonder how kernel-level optimizations fare • How much of the address space is examined in one in our context. Recent versions of Linux comprise the bal- scan is defined by numa balancing scan size mb. In ancenuma patchset [12] for dynamic page migration and a theexperiments,thisparameterhasbeensetto100000 transparent policy migrating pages during the execution of to prevent the system from scanning only a subset of a process based on its memory accesses. The kernel peri- the pages. odically scans the address space of the process and changes the protection flags of the scanned pages such that an ac- In the following evaluation, we calculate the ratio of the cess causes an exception. Upon an access to such a page, median wall clock execution time with dynamic migration theexceptionhandlercheckswhetherthepageismisplaced (numa balancing setto1)dividedbythemediantimewith- (a)Opteron-64system (b)SGI-192system Figure 9: Speedup over the parallel baseline DSA-BASE (a)Opteron-64 (b)SGI-192 Figure10: Speedupovertheimplementationswithgloballyshareddatastructuresandinterleavingonallnodes(SHM-BASE) out migration (numa balancing set to 0) for 10 runs of a 3. Assign exactly one buffer of the current set to each synthetic benchmark. thread,withthreadibeingtheownerofthei-thbuffer. 6.2 Evaluationofasyntheticbenchmark 4. Synchronize all threads using a barrier and let each thread traverse its buffer I times linearly by adding a Thesyntheticbenchmarkhasbeendesignedspecificallyto constant to the first 8-byte integer of every cache line evaluatethepotentialofdynamicpagemigrationforscenar- of the buffer. ios with clear relationships between data and computations and without interference. It is composed of the following 5. ChangetheaffinityAtimesbyrepeatingsteps3and4 steps: a total of A times. 6. Synchronize all threads with a barrier and print the 1. Allocate S sets of T 64MiB buffers, distributed in a time elapsed between the moments in which the first round-robin fashion on the machine’s NUMA node. thread passed the first and the last barrier, respec- That is, the i-th buffer of each set is allocated on tively. NUMAnode (imodN), withN beingthe totalnum- ber of NUMA nodes. On Opteron-64, the number of 8 cores per NUMA node is equal to the total number of nodes. Thus, the allocation 2. Create T threads and pin the i-th thread on the i-th schemeabovecauseseveryNUMAnodetoaccesseverynode core. ofthesystematthebeginningofeachaffinitychange,which

Description:
Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false depen-.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.