ebook img

Correct and Efficient Work-Stealing for Weak Memory Models PDF

12 Pages·2013·0.97 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Correct and Efficient Work-Stealing for Weak Memory Models

Correct and Efficient Work-Stealing for Weak Memory Models NhatMinhLê AntoniuPop AlbertCohen FrancescoZappaNardelli INRIAandENSParis Abstract Lev’sdequefortheARMarchitectures,andproveitscorrectness againstthememorysemanticsdefinedin[12]and[7].Oursecond ChaseandLev’sconcurrentdequeisakeydatastructureinshared- contribution is a systematic study of the performance of several memoryparallelprogrammingandplaysanessentialroleinwork- implementationsofChase–Levonrelaxedhardware.Indetail,we stealing schedulers. We provide the first correctness proof of an compare our optimized ARM implementation against a standard optimizedimplementationofChaseandLev’sdequeontopofthe implementationforthex86architectureandtwoportablevariants POWERandARMarchitectures:theseprovideveryrelaxedmem- expressed in C11: a reference sequentially consistent translation orymodels,whichweexploittoimproveperformancebutconsider- of the algorithm, and an aggressively optimized version making ablycomplicatethereasoning.Wealsostudyanoptimizedx86and full use of the release–acquire and relaxed semantics offered by aportableC11implementation,conductingsystematicexperiments C11 low-levelatomics. These implementationsof the Chase–Lev toevaluatetheimpactofmemorybarrieroptimizations.Ourresults dequeareevaluatedinthecontextofawork-stealingscheduler.We demonstratethebenefitsofhandtuningthedequecodewhenrun- considerdiverseworker/thiefconfigurations,includingasynthetic ningontopofrelaxedmemorymodels. benchmarkwithtwodifferentworkloadsandstandardtask-parallel Categories and Subject Descriptors D.1.3 [Programming Tech- kernels. Our experiments demonstrate the impact of the memory niques]: Concurrent Programming; E.1 [Data Structures]: Lists, barrier optimization on the throughput of our work-stealing run- stacks,andqueues time. We also comment on how the ARM correctness proof can be tailored to these alternative implementations. As a side effect, Keywords lock-free algorithm, work-stealing, relaxed memory we highlight that our optimized ARM implementation cannot be model,proof expressed using C11 low-level atomics, which invariably end up insertingoneredundantsynchronizationinstruction. 1. Introduction MulticorePOWERandARMarchitecturesarestandardtargetsfor 2. Chase–Levdeque server, consumer electronics, and embedded control applications. User-space runtime schedulers offer an excellent playground for Thedifficultiesofparallelprogrammingareexacerbatedbythere- studyinglow-levelhigh-performancecode.Wefocusonrandom- laxed memory model implemented by these architectures, which ized work-stealing: it was originally designed as the scheduler allowtheprocessorstoperformawiderangeofoptimizations,in- of the Cilk language for shared-memory multiprocessors [4], but cludingthread-localreorderingandnon-atomicstorepropagation. thanks to its merits [2] it has been adopted in a number of par- The safety-critical nature of many embedded applications call allel libraries and parallel programming environments, including forsolidfoundationsforparallelprogramming.Thispapershows theIntelTBBandcompilersuite.Work-stealingvariantshavealso thatahighdegreeofconfidencecanbeachievedforhighlyopti- beenproposedfordistributedclusters[5]andheterogeneousplat- mized,real-world,concurrentalgorithms,runningontopofweak forms[1].Theschedulingstrategyisintuitive: memory models. A good test-case is provided by the runtime • Eachprocessorusesadynamicarrayasadequeholdingtasks schedulerofatasklibrary.WethusfocusontheChaseandLev’s readytobescheduled. concurrent doubly-ended queue (deque) [3], the cornerstone of • Eachprocessormanagesitsowndequeasastack.Itmayonly mostwork-stealingschedulers.Untilnow,norigorouscorrectness pushandpoptasksfromthebottomofitsowndeque. proofhasbeenbeenprovidedforimplementationsofthisalgorithm • Otherprocessorsmaynotpushorpopfromthatdeque;instead, running on top of a relaxed memory model. Furthermore, while theystealtasksfromthetopwhentheirowndequeisempty.In work-stealingiswidelyusedonthex86architecture(anevaluation mostimplementations,thestolendequeisselectedatrandom. underarestrictivehypothesisofidempotenceoftheworkloadcan • Initially,oneprocessorstartswiththe“root”taskoftheparallel befoundin[10]),fewexperimentstargetweakermemorymodels. programinitsdeque,andallotherdequesareempty. Ourfirstcontributionisacorrectnessproofofthisfundamen- Thestate-of-the-artalgorithmforthework-stealingdequeisChase talconcurrentdatastructurerunningontopofarelaxedmemory and Lev’s lock-free deque [3]. It uses an array with automatic, model.Weprovideahand-tunedimplementationoftheChaseand asynchronous growth. Assuming sequentially consistent memory, it involves only one atomic compare-and-swap (CAS) per steal, noCASonpush,andnoCASontakeexceptwhenthedequehas exactlyonlyoneelementleft. Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor We implemented and tested four versions of the concurrent classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed dequealgorithm,withdifferentbarrierconfigurations:(1)asequen- forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation onthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistribute tiallyconsistentversion,writtenwithC11seq_cstatomics,follow- tolists,requirespriorspecificpermissionand/orafee. ingtheoriginaldescriptionin[3];(2)anoptimizedversion,which PPoPP’13, February23–27,2013,Shenzhen,China. takes full advantage of the C11 relaxed memory model, reported Copyright(cid:13)c 2013ACM978-1-4503-1922/13/02...$10.00 inFigure1;(3)anativeversionforARMv7,reportedinFigure2, and(4)anativeversionforx86.Thesenativeversionsrelyoncom- rithm.However,iflow-levelatomicsarecompiledusingthemap- pilerintrinsicsandinlineassemblytoleveragearchitecture-specific pingofMcKenneyandSilvera[9]onARMv7/POWERorthemap- assumptionsandthusreducethenumberofbarriersrequired. ping of Tehrekov [14] on x86, the generated code contains more In our implementations of Figure 1 and Figure 2, we assume barriersthanthehand-optimizednativeversionsonbothx86and thattheDequetypeisdeclaredas: ARMv7. We show in Section 5 that this happens because of the typedef struct { typedef struct { need for seq_cst atomics to simulate ARMv7/POWER cumula- aattoommiicc__siinzte_btufsfiezre[;]; aAttoommiicc_(sAirzrea_yt*)toapr,rabyo;ttom; tive semantics. Concretely, on ARMv7, an extra dmb instruction } Array; } Deque; isinsertedbeforeeachCASoperation[11],comparedtothenative InthecodeofFigure1theatomic_andmemory_order_prefixes versionwherearelaxedCAS—coherentandatomiconly—issuf- havebeenelidedforclarity.TheARMv7pseudo-codeofFigure2 ficient. On x86, an mfence instruction is added between the two usesthekeywordsRandWtodenotereadsandwritestosharedvari- readsinsteal.ThefullysequentiallyconsistentC11implementa- ables,andatomicindicatesablockthatwillbeexecutedatomically, tioninsertsmanymoreredundantbarriers[11]. implementedviaLL/SCinstructions.Thex86versionisbasedon priorwork[10]andonlyrequiresasinglemfencememorybarrier 3. ThememorymodelofARMv7 intake,inplaceofthecalltothread_fenceintheC11code. The memory model of the ARMv7 architecture follows closely thatofthePOWERarchitecture,allowingawiderangeofrelaxed 2.1 Notionsofcorrectness behaviorstobeobservabletotheprogrammer: Theexpectedbehaviorofthework-stealingdequeisintuitive:tasks pushedintothedequearetheneithertakeninreverseorderbythe 1. The hardware threads can each perform reads and writes out- samethread,orstolenbyanotherthread.Wesaythatanimplemen- of-order, or even speculatively. Basically any local reordering tationiscorrectifitsatisfiesfourcriteria, formalized andproven isallowedunlessthereisadata/controldependenceorsynchro- correctforourARMv7optimizedcodeinSection4: nizationinstructionpreventingit. 1. tasksaretakeninreverseorder; 2. The memory system does not guarantee that a write becomes 2. onlytaskspushedaretakenorstolen(well-definedreads); visible to all other hardware threads at the same time point. 3. ataskpushedintoadequecannotbetakenorstolenmorethan Writesperformedbyonethreadarepropagatedto(andbecome once(uniqueness); visible from) any other thread in an arbitrary order, unless 4. givenafinitenumberofpushoperations,allpushedvalueswill synchronizationinstructionsareused. eventuallybeeithertakenorstolenexactlyonce,ifenoughtake 3. A dmb barrier instruction guarantees that all the writes which andstealoperationsareattempted(existence). havebeenobservedbythethreadissuingthebarrierinstruction Thesecriteriaholdbecauseofthefollowingassumptionsandprop- are propagated to all the other threads before the thread can ertiesoftheChase–Levalgorithm: continue.Observedwritesincludeallwritespreviouslyissued • Foranygivendeque,pushandpopoperationsexecuteonasin- bythethreaditself,aswellasanywritepropagatedtoitfrom glethread.Concurrencycanonlyoccurbetweenoneexecution another thread prior to the barrier. This semantics of barrier ofpushortakeintheownerthread,andoneormoreexecutions instructionsisreferredtoascumulative. ofstealindifferentthreads. WebuildontheaxiomaticformalizationofPOWERandARMv7 • Newly pushed tasks are made visible to take and steal by the memorymodelbyMador-Haimetal.[7],whichhasbeenproved incrementtobottominpush.AsweshallseeinSection4,our equivalent to the operational semantics of Sarkar et al. [12]. A ARMv7 implementation enforces this by placing a sync bar- gentleintroductioncanbefoundin[8]. rierbeforetheupdateofbottom,guaranteeingthatthepushed Axiomaticexecutionwitnessescaptureabstractmemoryevents elementcannotbestolenbeforebottomisupdated. associated with memory-related instructions and internal transi- • Takentasksarereservedfirstbyupdatingbottom;again,inour tions of the model. Unlike in stronger models such as x86, each ARMv7code,thesyncbarrierplacedaftertheupdatetobottom memoryaccessisrepresentedatrun-timebytwodistinctevents:an willensurethatitwillnotbeconcurrentlystolen. issuingevent—calledsatforreadsandiniforwrites—eventually • Stolen tasks are reserved by updating top. The only situation followedbyacommiteventwhenthespeculativestateofthein- where steal and take contend for the same task is when the structionisresolved.Onceawriteinstructioniscommitted,events deque has a single element left; this particular conflict is re- thatpropagateittootherthreadscanbeobserved—propagationto solvedthroughtheCASinstructionsinbothtakeandsteal.This threadAisdenotedpp .Alltherelationspartofanexecutionwit- A scenarioallowedChaseandLevtomaketheCASintakecon- nessarelistedinTable1. ditionaluponthesizeofthedequebeing1.Thecorrectnessof The core of the axiomatic model builds on the evord relation, this optimization on a relaxed memory model depends on the modelingthehappens-beforeorderbetweenevents.Thissatisfies presenceofthetwofullbarriersintakeandsteal,toensurethat thefundamentalproperty: atleastoneoftheparticipantswillhaveaconsistentviewofthe sizeofthedeque.Havingjustonetakeorstealseeingaconsis- −ev−o→rd ⊃ −a−ft→er ∪ −b−ef−o→re ∪ −c−om−→m ∪ −in→sn ∪ −lo−c→al tentviewofthesizeofthedequeisenough:ifitistake,thatwill andmustbeacyclicforanexecutiontobeconsistent. forceaCAStobeperformed;ifitissteal,theindexreservation We assume that the atomic sections, used to represent CAS- willensureanemptyreturnvalue. like behaviors, are executed atomically and obey a total order. • Finally, stolen tasks are protected from being concurrently We model them either as a single instance of a read instruction stolen multiple times by the monotonic CAS update to top (failedCAS)oranatomicread–writepairofinstructioninstances insteal.ThisCASordersstealoperationsandmakesthemmu- (successfulCAS).Theatomicityoftheseaccessesiscapturedby tually exclusive. At the same time, steal operations that abort the −p−o-−at−o→m relation.Wedonotassumeanyotherpropertyonthese duetoafailedCASdonotchangethestateofthedeque. atomic sections (e.g., cumulativity). In practice, atomic sections canbeimplementedwithLL/SCinstructions. 2.2 ComparisonoftheC11andARMimplementations Weuseseveralnotationshortcuts.Werefertothedequeglobal OurC11implementationinFigure1isoptimalinthesensethatno variables top, bottom, and array as t, b, and a. Elements of the C11 synchronization can be removed without breaking the algo- bufferarewrittenx ,whereiisthevirtualindexinnaturalnumbers i int take(Deque *q) { int take(Deque *q) { size_t b = load_explicit(&q->bottom, relaxed) - 1; size_t b = R(q->bottom) - 1; (a) Array *a = load_explicit(&q->array, relaxed); Array *a = R(q->array); (b) store_explicit(&q->bottom, b, relaxed); W(q->bottom, b); (c) thread_fence(seq_cst); sync; size_t t = load_explicit(&q->top, relaxed); size_t t = R(q->top); (d) int x; int x; if (t <= b) { if (t <= b) { /* Non-empty queue. */ x = R(a->buffer[b % a->size]); (e) x = load_explicit(&a->buffer[b % a->size], relaxed); if (t == b) { if (t == b) { bool success = false; /* Single last element in queue. */ atomic /* Implemented with LL/SC. */ if (!compare_exchange_strong_explicit(&q->top, &t, t + 1, seq_cst, relaxed)) if (success = (R(q->top) == t)) (f) /* Failed race. */ W(q->top, t + 1); (g) x = EMPTY; if (!success) x = EMPTY; store_explicit(&q->bottom, b + 1, relaxed); W(q->bottom, b + 1); (h) } } } else { /* Empty queue. */ } else { x = EMPTY; x = EMPTY; store_explicit(&q->bottom, b + 1, relaxed); W(q->bottom, b + 1); (i) } } return x; return x; } } void push(Deque *q, int x) { void push(Deque *q, int x) { size_t b = load_explicit(&q->bottom, relaxed); size_t b = R(q->bottom); (a) size_t t = load_explicit(&q->top, acquire); size_t t = R(q->top); (b) Array *a = load_explicit(&q->array, relaxed); Array *a = R(q->array); (c) if (b - t > a->size - 1) { /* Full queue. */ if (b - t > a->size - 1) { /* Full queue. */ resize(q); resize(q); a = load_explicit(&q->array, relaxed); a = R(q->array); (d) } } store_explicit(&a->buffer[b % a->size], x, relaxed); W(a->buffer[b % a->size], x); (e) thread_fence(release); sync; store_explicit(&q->bottom, b + 1, relaxed); W(q->bottom, b + 1); (f) } } int steal(Deque *q) { int steal(Deque *q) { size_t t = load_explicit(&q->top, acquire); size_t t = R(q->top); (a) thread_fence(seq_cst); sync; size_t b = load_explicit(&q->bottom, acquire); size_t b = R(q->bottom); (b) int x = EMPTY; int x = EMPTY; if (t < b) { if (t < b) { /* Non-empty queue. */ Array *a = R(q->array); (c) Array *a = load_explicit(&q->array, consume); x = R(a->buffer[t % a->size]); (d) x = load_explicit(&a->buffer[t % a->size], relaxed); ctrl_isync; if (!compare_exchange_strong_explicit(&q->top, &t, t + 1, seq_cst, relaxed)) bool success = false; /* Failed race. */ atomic /* Implemented with LL/SC. */ return ABORT; if (success = (R(q->top) == t)) (e) } W(q->top, t + 1); (f) return x; if (!success) return ABORT; } } return x; Figure1. C11codeofChase–Levdeque,withlow-levelatomics } Figure2. ARMv7pseudo-codeofChase–Levdeque beforeanywrap-aroundisapplied.Barrierinstructionsareomitted tecture.Tothatend,wepinpointspecificsubgraphsofanexecution for brevity when implied by the presence of a −s−y→nc or −c−trl−-is−y→nc witness(hereafter,executiongraphs)thatcannotoccurtogetherin relation.Irrelevantvaluesinreadsandwritesarereplacedwiththe the same consistent execution witness. We then show that all in- placeholder“_”(e.g.,Rx,_).Wedonotlabelinstructioninstances correctexecutions,suchasthosecontainingtwoinstancesofsteal individually, but decorate them with a disambiguating execution readingthesamevalueaddedbyasingleinstanceofpush,cannot prefix, identified by adot. These prefixes do not only distinguish haveconsistentexecutionwitnessesand,assuch,cannothappen. between instruction instances, but also group related instruction Theproofisstructuredasfollows.In4.1weprovidebasictech- instances within a same execution unit (usually an invocation of nical definitions and properties of the memory model, which are oneofpush,takeorsteal).Forthis,whennoprefixisspecified,the usedthroughouttheproof.In4.2wedescribeallthepossibleexe- lastprefixinleft-to-rightorderisassumed. cutiongraphsforeachofthethreeoperations(push,takeandsteal), followingthecontrolflowoftheARMv7codeinFigure2.In4.3 we show how the succession of dynamic arrays built by resizing 4. ProofofcorrectnessoftheARMv7code canbeabstractedasasinglesequenceofuniqueabstractvaluesin- dependentofresizeoperations,withstrongcoherenceandconsis- Theproofisdividedintofiveparts;itvalidatesthecriteria2to4 tencyproperties.Corollary2establishesCriterion2(well-defined enumeratedinSection2.1.Sincepushandtakeneverexecutecon- reads).In4.4webuildonthepreviousabstractiontoproveTheo- currentlyandbisonlyevermodifiedinoneofthesefunctions,the rem1,pertainingtotheuniquenessofelementstakenandstolen, proofofCriterion1doesnotinvolvereasoningaboutconcurrency whichcorrespondstoCriterion3(uniqueness).Finally,in4.5,we andweomitithere. relyonallpreviousresultstoproveTheorem2establishingCrite- Theproofbuildsonapreciseanalysisofallthepossibleexe- rion 4 (existence): the existence of matching take or steal opera- cutionwitnessesofarbitraryinvocationsofthealgorithm.Were- tionsforeverypushedelement,undertheappropriatehypotheses. callthatanexecutionwitness,asdefinedbytheARMv7axiomatic model,isagraphcapturingallmemoryeventsoccuringduringan 4.1 Preliminaryproperties execution(vertices),aswellastherelationsthatlinkthem(edges). Individuallemmasstrivetonarrowdownthesetofpossibleexecu- Before delving into the details of the proof itself, we introduce tionwitnesses,basedonpropertiesofthealgorithmandthearchi- somesupportdefinitionsandrelatedproperties. Rl,α readofvalueαfromlocationl(_standsforanyvalue) implies −p−p-−s→at , by definition of communication edges (if threads Wl,α writeofvalueαtolocationl(_standsforanyvalue) aredifferent)oruniprocessorconstraints(ifsamethread). sync memorybarrier(usuallyimpliedby −s−y→nc ) isync instructionbarrier(usuallyimpliedby −c−tr−l-−isy−→nc ) Lemma2. Thefollowingpropertiesinvolve −p−p-−s→at and −p−o-−l→oc : sat(X) satisfy(a.k.a.complete)eventofareadinstruction (i) A.Wx,_−→rf B.Rx,_−p−o-−l→oc B(cid:48).Rx,_ =⇒ A.Wx,_−p−p-−s→at B(cid:48).Rx,_ ini(X) initializeeventofawriteinstruction (ii) A.Wx,_−→co B.Wx,_−p−p-−s→at C.Rx,_ =⇒ A.Wx,_(cid:54)−→rf C.Rx,_ com(X) commiteventofanin-flightorspeculativeinstruction (iii) Wx,_−p−p-−s→at Ry,_−→dp Rz,_ =⇒ Wx,_−p−p-−s→at Rz,_ ppA(X) propagatetothreadofAevent (iv) ¬(cid:0)A.Wx,_−p−p-−s→at B.Ry(cid:48),_−→dp B.Wx(cid:48),_−p−p-−s→at A.Ry,_−→dp A.Wx,_(cid:1) −p→o programorder Proof. Weproveeachpointseparately: −p−o-−at−o→m atomicoperationinprogramorder(forCAS;seebelow) (i) Ifthewriteandthereadshappeninthesamethread,thenallin- −p−o-−lo→c same-locationaccessinprogramorder(definedin4.1) struction instances belong to that thread, and program order prevails. −→co writecoherence Otherwise, either A.Wx,_−→rf B(cid:48).Rx,_ and the result is immediate, or −→rf readfrom A.Wx,_(cid:54)−→rf B(cid:48).Rx,_ and B.Rx,_−p−o-−lo→c B(cid:48).Rx,_ implies the following: −r→(cid:27) readfromfar(definedin4.3) com(B.Ry,_)−lo−c→al sat(B(cid:48).Ry,_),bydefinitionof −lo−c→al .Hence: −→fr fromread −a−d→dr addressdependence(usuallyimplicit) ppB(A.Wx,_)→sat(B.Ry,_)−in−→sn com(Ry,_)−lo−c→al sat(B(cid:48).Ry,_) −c→trl controldependence(usuallyimplicit) (ii) Suppose A.Wx,_(cid:54)−→rf C.Rx,_. Then C.Rx,_−→fr B.Wx,_, and we −d−a→ta datadependence(usuallyimplicit) havethefollowingcycleintheeventhappens-beforeorder: −c−tr−l−d-→−ipsy−→nc onbosne-rcvuambuleladtievpeelnodceanlcoerd(deerfiinngedbainrr4ie.1r)(seebelow) sat(C.Rx,_)−c−om−→m ppZ(B.Wx,_)→sat(C.Rx,_) −s−y→nc cumulativefullbarrier(seebelow) (iii) FollowsfromLemma1. −p−p-−s→at write-to-readpropagation(definedin4.1) (iv) Assumethat: −b−a−e−ff−to→e→rre abfetfeorrebabrarrireireredegdege A.Wx,_−p−p-−s→at B.Ry(cid:48),_−d→p B.Wx(cid:48),_−p−p-−s→at A.Ry,_−d→p A.Wx,_ −c−om−→m communicationedge IfA∼Bthenthereisacyclein −p→o .Otherwise,byLemma1,wehavea −in−→sn intra-instructionorderedge cycleintheeventhappens-beforeorder: −−elov−−oc→→ardl leovceanltohradpepreendsg-beeforeorder(usuallytypesetas→) ppB(Wx,_)→sat(Ry(cid:48),_)→com(Wx(cid:48),_)−in−→sn ppA(Wx(cid:48),_) OnARMv7, −s−y→nc correspondstoadmbinstructionwhile −c−tr−l-−isy−→nc corre- →sat(Ry,_)→com(Wx,_)−in−→sn ppB(Wx,_) Lemma3. Thefollowingpropertiesinvolvingbarriersapply: spondstoadependentconditionalbranchfollowedbyanisbinstruction. Table 1. Summary of relations used in the ARMv7 axiomatic (i) (Wx,_−s−y→nc Wy,_−p−p-−s→at Rz,_ ∨ Wx,_−p−p-−s→at Ry,_−s−y→nc Rz,_) model =⇒ Wx,_−p−p-−s→at Rz,_ (ii) A.Wx,_−→rf B.Rx,_−s−y→nc B.Wy,_−p−p-−s→at C.Rx,_ =⇒ A.Wx,_−p−p-−s→at C.Rx,_ For convenience, we define the −p−o-−l→oc relation, which relates (iii) LetXstandforA.Wx,_−→rf B.Rx,_or(A∼B).Wx,_ local(same-thread)accessestothesamememorylocation; −p−o-−l→oc andY standforC.Wy,_−→rf D.Ry,_or(C ∼D).Wy,_ impliesan instruction-levelcommunicationedge −→co , −→rf or −→fr . thenthefollowingholds: Inparticular, −p−o-−l→oc implies −→co betweentwowrites. ¬(X−s−y→nc B.Ry,_−→fr C.Wy,_∧Y −s−y→nc D.Rx,_−→fr A.Wx,_) Wedefinethedependencerelation −→dp asfollows: Proof. Weproveeachpointseparately: Rx,_−→dp Ry,_ ⇐de⇒f Rx,_(−a−d→dr ∪ −c−trl−-is−y→nc )Ry,_ (i) If Wx,_ and Rz,_ occur in the same thread, then all instruction Rx,_−→dp Wy,_ ⇐de⇒f Rx,_(−a−d→dr ∪ −c→trl ∪ −d−a→ta )Wy,_ instances belong to that thread and program order prevails. Otherwise, supposeRz,_executesinA;wehavetwocases: Lemma1.RTxh,e_f−→doplloRwyi,n_g=p⇒ropesrattie(Rsixn,v_o)l→vingsa−t→d(pRya,p_p)ly: ppA(Wx,_)−b−ef−o→re ppA(sync)−b−ef−o→re ppA(Wy,_)→sat(Rz,_) Rx,_−→dp Wy,_ =⇒ sat(Rx,_)→com(Wy,_) Ortheotherwayaround: Proof. In the case the of an address or control dependence, the result is ppA(Wx,_)→sat(Ry,_)−in−→sn com(Ry,_)−lo−c→al com(sync)−lo−c→al sat(Rz,_) animmediateconsequenceofthedefinitionofintra-instructionandlocal Inbothcases,ppA(Wx,_)→sat(Rz,_). orders.Itremainstobeshownthattheresultholdsfor −c−tr−l-−isy−→nc :adepen- (ii) Suppose A ∼ C. If A ∼ B, then program order prevails: dentconditionalbranchinstruction,ctrl,followedbyanisyncbarrier.Sup- all the instruction instances belong to the same thread. If not, suppose poseRx,_−c−tr−l-−isy−→nc Ry,_.Thenwehave:sat(Rx,_)−in−→sn com(Rx,_)−lo−c→al C.Rx,_−p→o A.Wx,_;thentheeventhappens-beforeordercontainsthefol- com(ctrl)−lo−c→al com(isync)−lo−c→al sat(Ry,_). lowingcycle: We define the relation −p−p-−s→at between instruction instances, ppB(A.Wx,_)−c−om−→m sat(B.Rx,_)−in−→sn com(Rx,_)−lo−c→al com(sync) A.Wx,_−p−p-−s→at B.Ry,_,asfollows:1 −lo−c→al com(B.Wy,_)−in−→sn ppC(Wy,_)→sat(C.Rx,_)−in−→sn com(Rx,_) (cid:40)Wx,_−→po Ry,_ ifA∼B −lo−c→al com(A.Wx,_)−in−→sn ppB(Wx,_) ppB(Wx,_)→sat(Ry,_) ifA(cid:54)∼B Otherwise,supposeA(cid:54)∼C.IfA∼B,thenA.Wx,_−s−y→nc B.Wy,_andwe havetheresultfrom(i).Ifnot,wehave: wprheefirxeesAA∼andBBmbeelaonnsgtthoatthiensstarmucetitohnreainds.tances grouped under ppB(A.Wx,_)−c−om−→m sat(B.Rx,_)−in−→sn com(Rx,_)−lo−c→al com(sync) Intuitively, −p−p-−s→at represents a “known-to” relation in the fol- Thus,wehaveppC(A.Wx,_)−b−ef−o→re ppC(sync)−b−ef−o→re ppC(B.Wy,_)→ lowingsense:A.Wx,_−p−p-−s→at B.Ry,_meansthat,atthetimeofread- sat((Ciii.)RSxu,_p)p.osethecontrary.IfB∼D,then −→rf and −→fr formapaththat ingy,thatspecificwritetox(aswellasanywritethatiscoherence- goesagainst −p→o :thegraphisinvalidaccordingtouniprocessorconstraints. beforeit)isknowntothethreadexecutingB.Itisclearthat −→rf Otherwise,B(cid:54)∼Dandthefollowingholds(omittingintermediatesteps inelaborating −b−ef−o→re forconciseness): 1Notethat −p−p-−s→at doesnotimplyaneventhappens-beforeorder onthe • com(B.sync)−lo−c→al com(C.Wy,_)−lo−c→al com(D.sync)−in−→sn ppB(sync) eventsmakinguptherelatedinstructioninstances. ifB∼C. • com(B.sync)−lo−c→al sat(Ry,_)−c−om−→m ppB(C.Wy,_)−b−ef−o→re ppB(D.sync) relation. Initially β0 = 0. Since all push and take operations otherwise. occurinasinglethread,andstealoperationsneveralterthevalue Eafittehreredwgaey,bceotwme(eBn.tshyenct)wo→baprrpieBrs(:Dp.psyDn(cB)..sByyncd)e−fia−fnt→eitriocno,mw(eDh.asvyenca)n. oofrdbe,rtwheitheilnemtheenptsusohfa(nβdnt)akceororpesepraotniodntso.Swimritielasrltyo,bweindepfirongertahme Moreover,eitherA∼DorA(cid:54)∼D: sequence(τ )ofvaluestakenbythevariablet.Weassumeτ =0. • ppD(B.sync)−a−ft→er com(D.sync)−lo−c→al com(A.Wx,_)−in−→sn ppB(Wx,_) Furthermorem,sinceallwritestotarefromCASinstructions,0which ifA∼D. • optphDer(wBis.es.ync)−a−ft→er com(D.sync)−lo−c→al sat(Rx,_)−c−om−→m ppD(A.Wx,_) atrbeysoeqnue,en(τtimal)lyisomrdoenreodto,naincdalalyllisnuccrehaCsiAngS,ainnsdtrsu.ct.tiτomns=inmcre.ment Thus,inallcases,wehaveacycle: Foreachindexi,wedefinethesequence(ξiv)v∈Nofsuccessive −c−om−→m satc(Bom.R(Bx,._s)y−inn−→csn)−bc−eof−mo→re(Rpxp,B_)(A−lo−c.→Wal xco,_m)(B.sync) uvWanlxdueie,rs_lygoiinfvgeanaprtrouastyhh.eOoepnleelrymatteihnoetn,alatrsietngdsauerxdchlieswisnrtiohtfeetidhseecqaauldeldebdryesstsihge&nilxfiasctaownftrtihatees 4.2 Executionpaths it induces a new value in an (ξiv) sequence, while writes due to resizingdonot.Foralli,ξ0,thevaluebeforethefirstsignificant We consider the three operations of the work-stealing algorithm: write to x location, is undiefined: ξ0 = ⊥. Similarly, a read is take, push and steal. Each of them exhibits different execution i i significantifitoccursinasuccessfulinstanceoftakeorsteal. paths depending on control flow. Data and address dependences are implicit in the notations and are omitted for brevity. Control Lemma4. Foralli,(ξv)isgloballycoherent. i dependencesareimpliedbytheguardconditionsineachcaseand avraeriaablsleosocmarirttyeidn,gbtuhtewcoenetrxopllidceitpethnedecnocnes.trGairneteskolnetttehresbβ,anτd, ξt oPfrothoef.aGddivreensstwofotshiegnuinfidcearnltywinrgitaersrWay)x.iI,f_WanxdiW,_xa(cid:48)in,d_aWtixnd(cid:48)i,e_xbio(trhegwarridtelestos thesamememorylocation,thentheyareorderedbywritecoherence.Ifthey denotethememoryvaluesofb,t,andsomexi,respectively.Reads donot,thentheremustbearesizeoperationafterthefirstwriteandbefore andwritesareannotatedwiththecorrespondinglinefromFigure2. thesecond(allwriteshappeninthesamethread).Becauseofthecumulative For take and steal, we say that an instance of the operation barrierafteraresizeoperation,threadsthatseethesecondvaluemusthave is successful if it returns one element; otherwise (including if it seenthefirstbeforehand.Hence,thereisaglobalcoherenceorderonthe returnsempty)itisconsideredfailed. writes,whichcorrespondstotheorderofpushoperations. Wedefinetherelationreadfromfarasfollows:forsomemem- 4.2.1 Take orylocationsm0,...,mn andsomevaluev,Wm0,v−→r(cid:27) Rmn,v Twofailurecasesreturnnoelement(empty),andtwosuccesscases ifWm0,v−→rf Rmn,vorthereexistsasequenceofcopiescarrying returnoneelementfromthedeque.Allfourpathsstartwith: thevalueofthewritetotheread: (a)Rb,β−→po (b)Ra,&x−→po (c)Wb,β−1−s−y→nc (d)Rt,τ Wm0,v−→rf Rm0,v−d−a→ta Wm1,v−→rf ··· −d−a→ta Wmn,v−→rf Rmn,v. Forconciseness,wehereafteromitthevariablenamefromreads Specificcontinuationsforeachpatharelistedbelow. and writes whenever the variable can be inferred from the value: ReturnemptywithoutCAS, β−τ ≤0: ··· −→po (i)Wb,β e.g., Wβn stands for Wb,βn. Let Wξiv denote the vth significant Returnemptywith(failed)CAS, β−τ =1,τ (cid:54)=τ(cid:48): writeatindexi,andRξivasignificantreads.t.Wξiv−→r(cid:27) Rξiv. ··· −→po (e)Rxβ−1,ξ−→po (f)Rt,τ(cid:48)−→po (h)Wb,τ +1 Lemma5. GivenawriteWxi,_andareadRx(cid:48)j,_, RReettuurrnnoonneewwiitthho(usutcCcAesSsf,uβl)−CAτS>, β1:−τ··=· −→p1o: (e)Rxβ−1,ξ i(cid:54)=j =⇒ Wxi,_(cid:54)−→rf Rx(cid:48)j,_ ··· −→po (e)Rxβ−1,ξ−→po (f)Rt,τ−p−o-−at−o→m (g)Wt,τ+1−→po (h)Wb,β Proof. If the addresses of the underlying arrays differ, then the memory locationsreadandwrittenaredistinctandtherecanbenoreadfromrelation. 4.2.2 Push Otherwise,sinceoldarraysareneverreused,theaddressesarethesame There are two paths: a straight case, and a resizing case which andi ≡ j mod size(x)Rx(cid:48)j,_belongstoasuccessfulinstanceoftake, growstheunderlyingcircularbuffer. push(withresizing),orsteal.LetXbethatinstance. LetP betheinstanceofpushtowhichWxi,_belongs.InP,wehave Straight, β−τ <size(x)−1: thefollowingexecutiongraph: (aR)eRsbiz,iβng−→p,oβ(b−)Rτt,≥τ−s→pioze((cx)R)a−,&1:xw−→phoer(ee)xW(cid:48)rxeβfe,rξs−ts−oy→ncth(efn)eWwba,rβra+y 1 P.Rt,τP−c→trl Wxi,_−s−y→nc Wb,βP +1 (a)Rb,β−→po (b)Rt,τ−→po (c)Ra,&x−→po resize where τP ≤i≤βP and βP −τP <size(x)−1 −s−y→nc (d)Ra,&x(cid:48)−→po (e)Wx(cid:48)β,ξ−s−y→nc (f)Wb,β+1 Letusassumei(cid:54)=j∧Wxi,_−→rf Rx(cid:48)j,_andshowitisindeedimpossible. where resize =Rxτ,ξτ−→po Wx(cid:48)τ,ξτ−→po ··· Assume X is a successful instance of take or push. Since X and −→po Rxβ−1,ξβ−1−→po Wx(cid:48)β−1,ξβ−1−s−y→nc Wa,&x(cid:48) dPerbe(tlhoengortdoertheofsalomaedstharneadd,stPoremsutosttohcecusrambeefolroecaXtioninispropgrersaemrveodr-: 4.2.3 Steal P.Wxi,_−p−o-−lo→c X.Rx(cid:48)j,_). Ifj<i,thenj≤i−size(x).However,thefollowingmustholdinP: There are three paths: two failure cases and one success case. Failurereturnsnoelementandsuccessreturnsastolenelement. τP ≤i≤βP ∧βP −τP <size(x)−1 ReturnemptywithoutCAS, β−τ ≤0: (a)Rt,τ−s−y→nc (b)Rb,β hence j<i−size(x)+1≤βP −size(x)+1<τP Returnemptywith(failed)CAS, β−τ >0∧τ (cid:54)=τ(cid:48): Furthermore,ifX isatakeoperation,Rx(cid:48)j,_readsthelastelementof (a)Rt,τ−s−y→nc (b)Rb,β−c−trl−-is−y→nc (c)Ra,&x−→po (d)Rxτ,ξ−c−trl−-is−y→nc (e)Rt,τ(cid:48) the deque, and j = βX −1 ≥ τX; if X is a push operation, Rx(cid:48)j,_ Returnonewith(successful)CAS, β−τ >0: resultsfromacopyoperationoftheresizingcode,hencej ≥ τX.Since (a)Rt,τ−s−y→nc (b)Rb,β−c−tr−lc−-−tisr−yl−-→nis−cy→n(cc)(Re)aR,t&,τx−−p→p−oo-−a(t−od→m)R(xfτ)W,ξt,τ +1 XP.Rotc,cτuPrs−p−aof-−lto→ecrXP.Rint,pτrXogaranmdjo<rdeτrPan≤dτtXis≤mjo.nIomtopnoiscsaiblllye.increasing, Ifi < j,then,sincej ≥ βX,bmustincreasefromβP +1toj+1 4.3 Significantreadsandwrites betweenthewriteinP andthereadinX.Hence,theremustbeaninstance P(cid:48)ofpushbetweenPandX(inprogramorder)thatincrementsbtoj+1. We define the sequence (βn) of values taken by the variable b Indeed,theonlywritesthatincreasethevalueofboccurinpushandtake; overthecourseoftheprogram,accordingtothewritecoherence andtheeffectoftakeasawholeneverincreasesthevalueofbsinceitfirst decrementsthevariable.Wehave: Otherwise, 0 < v < u. Let W.Wξv be the significant write s.t. i henPc.eWxPi,.W_−p−xo-−ilo→,_c−→cPo(cid:48)P.W(cid:48).Wxjx,_j−p,−_o-−−plo−→pc-−s→aXt.XRx.R(cid:48)jx,_(cid:48)j,_ WcWar.(cid:48)rW.yWiξnivxgi−r→,t(cid:27)hξeivRv−→xrfail,uRξexivio.,fIξniξviv.oMtthooerrReowxvoie,rrdξ,siav,c.ctTohrehdraeitngesextoqisutthesneacdeesfieeqnnuidteisonncweoitfho(fξaivco)wparinietdes Thus,fromLemma2(ii),P.Wxi,_(cid:54)−→rf X.Rx(cid:48)j,_. Wtheξuseimnapnrtoigcsraomforredseizr.ing, W.Wξiv and W(cid:48).Wxi,ξiv must come before execNuotiwo,nagsrsaupmhefoXrXis:asuccessfulinstanceofsteal.Wehavethefollowing iWehavetwocases:eitherWξiuandRxi,ξivrefertothesamememory locationortheydonot. X.Rt,τX =j−s−y→nc Rb,βX−c−tr−l-−isy−→nc Ra,&x(cid:48)−p→o Rx(cid:48)j,_ Assumethattheyrefertothesamememorylocationxi.Thenitmust −c−tr−l-−isy−→nc Rt,τX−p−o-−at−o→m Wt,τX+1 bethatW(cid:48).Wxi,ξiv−p−o-−lo→c Wxi,ξiu,andwehave: Ifj<i,thenj≤i−size(x).However,thefollowingmustholdinP: W(cid:48).Wxi,ξiv−→co Wξiu−p−p-−s→at Ra,&x−a−d→dr Rxi,ξiv j<i−size(x)+1≤βP −size(x)+1<τP HenCceo,nfvreormselLye,masmsuam2e(tihi)a,tWthe(cid:48).yWdoxin,oξtivre(cid:54)−→rfferRtoxit,hξeivs.aImmepmosesmibolery. location. HenceτX =j<τP.Sincetincreasesmonotonically,itmustbethat: ThentheremustbearesizeoperationbetweenW(cid:48).Wxi,ξivandWξiu: X.Rx(cid:48)j,_−c−tr−l-−isy−→nc Rt,τX−p−o-−at−o→m Wt,τX+1 Wa,&x−s−y→nc W(cid:48).Wxi,ξiv−s−y→nc Wa,&x(cid:48)−s−y→nc Wx(cid:48)i,ξiu −→rf Rt,_−s−y→nc Wt,_−→rf ···−s−y→nc Wt,τP−→rf P.Rt,τP−c→trl Wxi,_ −p−p-−s→at Ra,&x−a−d→dr Rxi,ξiv HenceX.Rx(cid:48)j,_mustbecommittedbeforeWt,τX +1.SinceWt,τX + Hence,fromLemma3(i),Wa,&x−→co Wa,&x(cid:48)−p−p-−s→at Ra,&x.Andfrom 1 is (cumulatively) propagated to Wxi,_, X.Rx(cid:48)j,_ must be committed Lemma2(ii),Wa,&x(cid:54)−→rf Ra,&x.SincethereisonlyonewriteWa,&x 1be−pf−po-−rse→atWPx.Ri,t_,.τFPo.rImfaWllyx:i,i_t−→rffolRloxw(cid:48)js,_ftrhoemnWLexmim,_a−p−p3-−s→a(tii)Rtxh(cid:48)ja,t_.WWte,τgXet:+ tohfatc(ogiipi)vieeTsshtcehareerrveyaxilniusgets&thaxewtvroiatlaeu,eWwoe.Wfhξaξvvivetsao.tc.RoWnξtv.rWa.dTξichivtai−otr→(cid:27)ns.eRqξuiven,caendenadsseqwuietnhcae X.Wt,τX+1−p−p-−s→at P.Rt,τP−c→trl Wxi,_ writeW(cid:48).Wξv−→rf Rξv.Sinceu ≤i v,Wξiu−p→o W.Wξv bydefinitionof i i i i ∧P.Wxi,_−p−p-−s→at X.Rx(cid:48)j,_−c−tr−l-−isy−→nc Wt,τX+1 F(ξriovm).TLheamnmksat3ot(hi)e,bwaerrgieertaWfteξruW−p−pξ-−ius→atinRpξuvsh.,Wξiu−s−y→nc W(cid:48).Wξiv−→rf Rξiv. Lemma2(iv)tellsthatitisimpossible.ThusP.Wxi,_(cid:54)−→rf X.Rx(cid:48)j,_. i i If i < j, then i ≤ j −size(x), and there must be an instance P(cid:48) Corollary 2 (Well-defined significant reads). Given a significant ofpushs.t.P(cid:48).Wb,j+1−p−o-−lo→c Wb,βX−→rf X.Rb,βX (sothatindexjbe readRxi,ξ,ξ=ξivforsomev>0. accessibleinX).P(cid:48)cannotoccurbeforeP inprogramorderbecause,as above,wewouldhaveτP(cid:48) ≤ τP ≤ iontheonehand,andi ≤ j − ProoSfu.pLpeotseXξb(cid:54)=etξhve,stuhcecnesξsf=ul⊥insctaanncoenolyftbaekeanorusntdeaefilns.et.dRvxaliu,eξf∈romXt.he sinizcere(axs)es<inτsPiz(cid:48)eo,nsothteheotihneerquhaalnidty.TsthilelhuonlddesrliyfitnhgeasrirzaeysaolfsPomanodnoPto(cid:48)ndicifaflelry. uninitializedarray,ipriortocopying.Indeed,ifxiisnotaffectedbycopying, thenitmustbeoneofthenewslotsallocatedbytheresizing,henceitsinitial HenceP(cid:48)occursafterP.FurthermoreWx(cid:48)j(cid:48),_∈P(cid:48).IfxinP andx(cid:48)(cid:48)in valueisξ0.LetRbethepushoperationthatallocatesthearrayx.There P(cid:48)refertodifferentarrays,thenaresizeoperationRmustprecedeP(cid:48),s.t. existsaξuisuchthat: i Wa,&x−p−o-−lo→c P.Ra,&x−p−o-−lo→c R.Wa,&x(cid:48)(cid:48) Wxi,⊥−→co R.Wxi,ξiu−s−y→nc Wa,&x−→rf X.Ra,&x−a−d→dr Rxi,ξ −s−y→nc P(cid:48).Wx(cid:48)j(cid:48),_−s−y→nc Wb,j+1 It follows from Lemmas 2 (iii), 3 (i) and 2 (ii) that Wxi,⊥(cid:54)−→rf Rxi,ξ. −p−o-−lo→c Wb,βX−→rf X.Rb,βX−c−tr−l-−isy−→nc Ra,&x(cid:48)−a−d→dr Rx(cid:48)j,_ ImpHosesnicbele,.ξ = ξv.WehaveRb,β ∈ X andβ ≥ i+1 > 0,forX is hence Wa,&x−→co R.Wa,&x(cid:48)(cid:48)−s−y→nc Wb,βX−p−p-−s→at X.Rb,βX successful.Hence,ithereisaninstanceofpushP s.t.P.Wb,β−→rf X.Rb,β. FromLemma2(iii),Wb,βX−p−p-−s→at X.Ra,&x(cid:48);Lemma2(ii)concludes Sinceβ ≥ i+1,eitherβ = i+1andWξiu ∈ P,ortheremustbe that Wa,&x(cid:54)−→rf X.Ra,&x(cid:48). Since all resize operations allocate new ar- aninstanceofpushthatcontainsasignificantwriteWξiu andcomesbe- rays,&x(cid:48) (cid:54)= &x,whichcontradictsourpremises.Otherwise,xandx(cid:48)(cid:48) foreP inprogramorder.Inbothcases,Wξiubelongstoapushoperation, refertothesamearray,henceWxi,_−p−o-−lo→c Wx(cid:48)j(cid:48),_,andweget: phuenshce,Wuξ>u−s−0y→n.cMPo.rWeobv,eβr,.tIhfaXnkisstaontihnestbanarcreieorfataftkeer,aP.sWignbi,fiβca−p→notXwr.iRteξvin; P.Wxi,_−p−o-−lo→c−→rfPX(cid:48).W.Rxb,(cid:48)j(cid:48)β,_X−s−y−c→n−tcr−l-W−isy−→nbc,jR+x(cid:48)j1,_−p−o-−lo→c Wb,βX oPhte.hnWecrebw,,ibβsye−i,pL−pe-P−sm→a.tWmXabs,.Rβ3ξ−→(rivif).aXInn.dRb6bo,,t0hβ<c−c−atrs−ule-−iss≤y,−→nWcv.Rξiuξiv−s−y→nacndP.WLebm,mβa−p−p3-−s→at(iiX).gRivξiievs, ItfollowsfromLemmas3(i)and2(iii)that: P.Wxi,_−→co Wx(cid:48)j(cid:48),_−p−p-−s→at Rx(cid:48)j,_ 4.4 Uniquenessofsignificantreads Hence,fromLemma2(ii),Wxi,_(cid:54)−→rf Rx(cid:48)j,_. Trehaedsreastudltisffferroemnttihnedepxreesvicoaunsnsoetcrteiotrnieevsetathbelisshamtheateltewmoesnitgξnvifi.cTahnet i Corollary1. GivenasignificantwriteWξivandasignificantread onlypossiblecauseofduplicatesignificantreadsisthusreducedto Rx(cid:48)j,_: i(cid:54)=j =⇒ Wξiv(cid:54)−→r(cid:27) Rx(cid:48)j,_. thecasewherethereadsaccessthesameindexi. Proof. If i (cid:54)= j, we know that Wξv(cid:54)−→rf Rx(cid:48),_. Furthermore, all copies, Theorem1(Work-stealing:uniquenessofsignificantreads). Given i j whichhappenduringaresizeoperation,copyfromandtothesameindex. aworkerthreadexecutingasequenceof pushandtakeoperations, Sincetherearelesscopiesthanthesizeoftheexpandedarray,therecanbe andfinitenumbernumberofthiefthreadseachexecutingstealop- notwocopieswritingtothesamememorylocationinthenewarray.Hence, erations, all against a same deque. If X and Y are two distinct therecanbenosequenceofcopiesfromWξvtoRx(cid:48),_. i j successfulinstancesof stealortake, Lemma6. GivenasignificantwriteWξiu andasignificantread ∀Rξv ∈X,∀Rξv(cid:48) ∈Y,i(cid:54)=i(cid:48)∨v(cid:54)=v(cid:48) Rξv: i i(cid:48) i (i) Wξiu−p−p-−s→at Ra,&x−a−d→dr Rxi,ξiv =⇒ u≤v Lemma7. GivenS1andS2distinctsuccessfulinstancesof steal, (ii) 0<u≤v =⇒ Wξiu−p−p-−s→at Rxi,ξiv ∀Rξiv ∈S1,∀Rξiv(cid:48)(cid:48) ∈S2,i(cid:54)=i(cid:48) Proof. Weproveeachpointseparately: Proof. Allwritestotatomicallyincrementit(byatomicityofCAS).Hence (i) Supposev<u.WedefineW(cid:48).Wxi,ξivasfollows. twosuccessfulstealoperationscannotwrite(thusread)thesamevalueof Ifv =0,ξivisanundefinedvalue;letW(cid:48).Wxi,ξi0−→rf Rxi,ξivbethe t.Readsfromxinastealoperationaccesstheindexgivenbythevalueof initializationofxi.W(cid:48).Wxi,ξi0comesbeforeWξiuinprogramorder. thetvariable.HenceRt,i∈S1andRt,i(cid:48)∈S2implyi(cid:54)=i(cid:48). Lemma 8. Given T a successful instance of take and P an in- 4.5 Existenceofsignificantreads stanceof push.IfP comesafterT inprogramorder,then: Theorem 2 (Work-stealing: existence of significant reads). Con- ∀Rξv ∈T,∀Wξu ∈P,i(cid:54)=j∨v(cid:54)=u sideraworkerthreadexecutingasequenceof pushand takeop- i j erations,andafinitenumberofthiefthreadseachexecutingsteal Proof. Assume i = j ∧ v = u. We have Rξv−p→o Wξu; therefore i j operations,allagainstasamedeque.Ifthenumberof pushisfi- Wξu(cid:54)−p−p-−s→at Rξv.FromLemma6(ii),itfollowsthatu > v.Wehavea j i nite,thenallthreadsreachastationarystatewhereb=tinafinite contradiction. numberoftransitions,andthefollowingholdsglobally: Lemma9. GivenT andT distinctsuccessfulinstancesof take, 1 2 ∀ξv,v>0 =⇒ ∃!Rξvinsomethreadbeforethestationarypoint ∀Rξv ∈T ,∀Rξv(cid:48) ∈T ,i(cid:54)=i(cid:48)∨v(cid:54)=v(cid:48) i i i 1 i(cid:48) 2 Let P be the last instance of push in the worker thread, in F Proof. Wehavethefollowingexecutiongraphs: programorder.LetWβ ∈P andRτ ∈P .Wesaythatan T1.Rβn−p→o Ra,_−p→o Wb,βn−1−s−y→nc Rt,τ−p→o Rξβvn−1−p→o ··· instanceXoftakeorstneFalistraFilingifRmβFn≥nFF∈X. T2.Rβn(cid:48)−p→o Ra,_−p→o Wb,βn(cid:48)−1−s−y→nc Rt,τ(cid:48)−p→o Rξβvn(cid:48)(cid:48)−1−p→o ··· Lemma 11. Given X a successful trailing instance of take or Andβn−1=iandβn(cid:48)−1=i(cid:48). steal: Rτm ∈X =⇒ m≥mF. Sinceallinstancesoftakeoccurintheworkerthread,wehaveeither: Proof. Wehavetwocases: T1.Wb,βn−1−p−o-−lo→c T2.Rβn(cid:48) or T2.Wb,βn(cid:48)−1−p−o-−lo→c T1.Rβn • Assume X is an instance of take. X follows PF in program order: Letusassumethefirstcaseaswellasi =i(cid:48)∧v =v(cid:48)andshowitis PF.RτmF −p−o-−lo→c X.Rτm,andm≥mF byuniprocessorconstraints. • AssumeX isaninstanceofsteal.SinceX issuccessful,X contains iβamnnpi(cid:48)nHo−ssetsani1nbc,celeae(,nPβtdhneTo)f1omp.tWhuuessrbht,citniahsc−aper−toe-wb−alo→esrcieintefTgrs2osW.myRmβbim,ktoie−p+ti−or+i-−cl1o→a1c.l.bTWe2twe.Rehbea,nvien+βa1nn,d−s.nt1.(cid:48)n;=th<eirk=ee≤xi(cid:48)isn=ts(cid:48) fmaosululsoctwcyeiinsesglfduretlhaiednssXtaam.nRceet,v_oa,flauanedC.DAthuSeebitnaosrtrtrihueecrtbbioaenrfro,ireherePnbceFetw.tWheeeβntnwXFo.,Rrwebae,d_hsaafnvrdoe:mthet andβk−1 =iandβk =i+1(asnotedabove,takeasawholedoesnot WτmF −→rf PF.RτmF −s−y→nc WβnF −p−o-−lo→c Wβn−→rf X.Rβn−s−y→nc Rτm increasethevalueofb).Wegetthefollowinggraph: Rb,i−p→o P.Wξiu−s−y→nc Wb,βk=i+1−p−o-−lo→c T2.Rβn(cid:48)−p→o Ra,_−a−d→dr Rξiv(cid:48) FthreonmfoLlleomwmsafro3m(iiL),emwmeah3av(ei)Wthτamt FW−τp−pm-−sF→at−pX−p-−.s→RatβRnτ−sm−y→n.cTRoτtaml.orI-t LLeemmmmaa63((i)i)thyaiteluds≤Pv.W(cid:48)aξniud−fpr−po-−ms→atLRema,m_−aa−d8→drthRatξviv(cid:48)<. Iut.thImenpofosslliobwles. from dmeFr ,oWn τCkA−→cSo iWnsτtrmucFtio∧nsWanτdk(cid:54)−→LrfemRmτma.2Th(eiir)efgourea,ramnte≥emthFat.∀k < Lemma12. GivenX andY distinctsuccessfultrailinginstances Lemma10. GivenT asuccessfulinstanceof takeandS asuc- cessfulinstanceof steal, of takeorsteal,then: ∀Rξiv ∈X,∀Rξiv(cid:48)(cid:48) ∈Y,i(cid:54)=i(cid:48). ∀Rξv ∈T,∀Rξv(cid:48) ∈S,i(cid:54)=i(cid:48)∨v(cid:54)=v(cid:48) Proof. Assumei=i(cid:48).AccordingtoTheorem1,v(cid:54)=v(cid:48),hencethereexist i i(cid:48) twodistinctsignificantwritesWξv andWξv(cid:48).Withoutlossofgenerality, PTw−rs.i−yotRh→noβcfβ.nRnW−βp→−oen(cid:48)hR1a−ca−=vtr,−el_-−iit−sph→y−aoe→nncfdWoRτlmblao,,w(cid:48)β_i=n−pn→og−i(cid:48)Re.1xξe−τvscm−y(cid:48)u→n(cid:48)tci−oc−Rtnr−τl-g−misry−a→np−p→chosR:Rτmξβv(cid:48)n−p−−o-1−at−−po→→om ·W··t,−p→τomS(cid:48).+Rτ1m(cid:48) lwdeetrri.uteFs,uaarsntshdueWmbrmeeξfoiovv(cid:48)rr(cid:48)ee<−,s−Pyt→nhvFce(cid:48)r.;PeWPFisβF.anW.WFcβu.βmnSnFuiFnla−cpit−eooiv-−cXleo→ccubrraWesraradbifes,tir_e(cid:48)fri−→rrnofbmpoYuthP.sRhFwbar.,fWi_ttee−d→rsβp,eniaRFncξh,pivwrsoieggnrhaiafimvceao:nr-t Letusassumei=i(cid:48)∧v=v(cid:48).Thenτm(cid:48) =i(cid:48) =i=βn−1.ForS HencewehaveWξv(cid:48)−p−p-−s→at Y.RξvfromLemma3(i)andLemma2(iii).It tosucceed,wemusthaveτm(cid:48) <βn(cid:48).Hence,βn≤βn(cid:48). thenfollowsfromLie(cid:48)mma6thatv(cid:48)i≤v;thus,v<v(cid:48)≤v.Impossible. Also,forT tosucceed,wemusthaveτm<βn.Twocases: • If βn = τm + 1, then a successful CAS occurs in T. Moreover, Corollary3. Thecombinednumberofsuccessfultrailinginstances βn =τm+1impliesτm(cid:48) +1=βn =τm+1,henceτm(cid:48) =τm. of takeandstealislessthanorequaltoβ −τ . Impossible,sincetismonotonicallyincreasingandSmustalsocontain nF mF asuccessfulCASwiththesamevalueoft. Proof. LetXbeasuccessfultrailinginstanceoftakeorsteal,andRβn ∈ • Itofnβinca>llyτimncr+ea1s,etsh,etnhenroeCmAuSstobcectuwrsoinwTriteasndAm.W(cid:48)τ>mm−→co.SBin.cWeτtmm(cid:48)osn.ot.- X(froamndLRemτmma∈11X)..HWeneckenτomw≥thaτtmnF≥. nF (bydefinition)andm ≥ mF A.Wτm−→rf T.Rτm−→fr B.Wτm(cid:48)−→rf S.Rτm(cid:48)−s−y→nc Rβn(cid:48) Furthermore,atakeoperationalwayscontainsonedecrementingwrite tob(byone),whichmaybefollowedbyoneincrementingwritetob(by IfS.Rβn(cid:48)B−→fr.WT.τWm(cid:48)b−→,rfβnS.−Rτ1m,(cid:48)th−se−yn→ncwReβhnav(cid:48)e−→f:r T.Wb,βn−1 one)T.hHeerenfcoeren,≥XncFanimonpllyiesreβand≤atβannFin.dex i, s.t. τmF ≤ i < βnF. ∧T.Wb,βn−1−s−y→nc Rτm−→fr B.Wτm(cid:48) Lemma12tellstherecanbenomorethanβnF −τmF suchX. ImpossibleaccordingtoLemma3(iii).ThereforeWβn(cid:48),thesource Lemma13. Thereisafinitenumberofsuccessful(trailingornon- ofS.Rβn(cid:48) mustcomebeforeWβn+1 (incoherenceorder,hencein trailing)instancesof takeorsteal. programorderasbothoccurinthesamethread).Consequently,(βn) mustincreasefromβn−1=itoβn(cid:48)betweenn+1andn(cid:48).SinceT Proof. ItfollowsfromCorollary3thatthereisafinitenumberofsuccessful doesnotincrementthevalueofb(executionwithoutCAS),theremust trailinginstancesoftakeorsteal. beaninstanceP ofpushthatwritesP.Wβk−p→o Wβn(cid:48)−→rf S.Rβn(cid:48),s.t. Furthermore,theremustbeafinitenumberofnon-trailingtakeopera- n<k≤n(cid:48)andβk−1=iandβk=i+1. tions,whichcomebeforePF inprogramorder. Wegetthefollowingexecutiongraph: Lastly, there is a finite number of push operations, thus (βn) has a P.Wξiu−s−y→nc Wb,i+1−p→o Wβn(cid:48)−→rf S.Rβn(cid:48)−c−tr−l-−isy−→nc Rξiv(cid:48) mvaaluxeimsuomf,tβlmesasxt.hSainncseomtweovsaulcuceesosffubl,sttheearleopcearnatbioennsommusotrreeathdadnifβfemreanxt HencewehaveWξu−p−p-−s→at Rξv(cid:48)fromLemma3(i)andLemma2(iii). successfulinstancesofsteal. i i Hencethefinitenumberofsuccessfulinstancesoftakeorsteal. Finally,itfollowsfromLemma6thatu≤v(cid:48),andfromLemma8that v<u≤v(cid:48).Wehaveacontradiction. Lemma14. Ineachthread,thereexistsXafailedinstanceof take Theorem1followsdirectlyfromLemmas9,10and7. orsteals.t.∀Rβ ∈X,∀Rτ ∈X,β ≤τ .Furthermore,each n m n m threadmakesnomorethan1+m +β −τ attemptsattake straightforward fashion. In all cases that do not involve cumula- orstealthatresultinafailedCASFinstruncFtion.2mF tivity, the −p−p-−s→at relation (defined in 4.1) combined with depen- dences,whichformthecoreoftheARMv7/POWERproof,maybe Proof. ItfollowsfromLemma13thatthereisafinitenumberofsuccessful replacedwithanalogouspropertiespertainingtotheC11happens- instances,henceafinitenumberperthread.Thus,theremustexistafailed beforerelationcombinedwithrelease–acquiresemantics.Theone instanceoftakeorsteal. notabledifferencebetweenthetwomodelsliesintheabsenceofcu- A failure can occur either because the deque is empty (βn ≤ τm) or because of a failed CAS instruction. Suppose there is no X where mulativityinthedesignoftheC11abstractmachine:neitherC11 βn ≤ τm;thenallfailuresmustbeduetoafailedCASinstruction.A fences nor C11 atomic accesses guarantee cumulativity. A simi- failedCASoccursifthetwovaluesoftreadduringtheinstanceXdiffer. lareffectcanbeachievedbychainingalternatingrelease–acquire LetY1andY2betwosuchfailedinstancesexecutinginasamethread;let writesandreads,whichformahappens-beforepath.Butthisdevice usassumethatY2followsY1inprogramorder,n1(cid:54)=n(cid:48)1andn2(cid:54)=n(cid:48)2: doesnotworkinsituationswherepropagationneedstobeasserted Y1.Rτn1−p−o-−lo→c Rτn(cid:48)1−p−o-−lo→c Y2.Rτn2−p−o-−lo→c Rτn(cid:48)2 sbiettuwateieonntowcocurresaidns,threatshteeralthoapneraatrieoand.Ifnoflloorwmeadllyby(saeewLrietme.m3 Tah1i0s TwheerheaveexisWtsτanw(cid:48)1r−pi−pte-−s→aWt τYn2(cid:48)1.R−→rτfnR2τ,na(cid:48)1nd−p,−o-−alo→scinY2th.Reτpnr2o.oDfuoeftLoeLmemmam1a12, (wi)e, ftaokrethdeofonromtraeladdes“corlidp”tivoanl)u,eitsmofubstobthebthoattttowmoacnodntcoupr,rewnhtesrteea“loalndd” deducethatn(cid:48)1≤n2. couldbedefinedas“olderthanthevalueknowntotheotherparty Sincen1(cid:54)=n(cid:48)1 ∧ n2(cid:54)=n(cid:48)2,andtismonotonicallyincreasing,itmust incoherenceorder”.Thepresenceofthetwocumulativebarriersin bethatn1 <n(cid:48)1 ≤n2 <n(cid:48)2.HencesuccessiveCAS-failinginstancesin stealandtakeonARMv7guaranteesuchacondition: asamethreadreadincreasingvaluesoft.ItfollowsfromCorollary3thatt • ifthetakebarrierisorderedbeforethebarrierinsteal,thenthe takesnomorethan1+mF +βnF −τmF differentvalues. program-order-previouswritetobottomwillbepropagatedto Therefore,therecanbenomorethan1+mF+βnF−τmF CAS-failing theinstanceofsteal; instancesoftakeorstealperthread.Sincethereisalsoafinitenumberof successfulsuchinstances,anyfurthertakeorstealoperationsmustreturn • conversely, if the steal barrier is ordered before the barrier in empty,andthethreadreachesitsstationarypoint. take,thenvaluereadbytheprogram-order-previousreadfrom topwillbepropagatedtotheinstanceoftake. Corollary4. Thecombinednumberofsuccessful(trailingornot) In the second case, it is important to remark that the write that instancesof takeandstealisequaltothenumberof push. produced the value read in steal might belong to another thread, Proof. Asuccessfulinstanceoftakeeitherdecreasesthevalueofbbyone and thus not be sequenced before the barrier. In the absence of orincreasesthevalueoftbyone;asuccessfulinstanceofstealincreases cumulativity,itneednotbepropagatedtotheinstanceoftake. thevalueoftbyone.Aninstanceofpushincreasesthevalueofbbyone. ToenforcethisparticularcaseofcumulativityinC11,werely Itfollowsfromthepreviouslemmathattheworkerthreadreachesa on the properties of sequential consistency. By making all writes stationarypointwhereb = t.Clearly,thiscannotoccurbeforeallpush (actually, CAS operations) to top sequentially consistent, we en- operationsandallsuccessfulinstancesoftakehaveoccurred. surethatthereisatotalorderingbetweenthetwofences(intake Sinceb = tatthestationarypointandallincreasestobprecede,the andsteal)andthewritethatproducedthevalueoftopreadinthe sumofincreasestotanddecreasestob(thecombinednumberofsuccessful instanceofsteal.Furthermore,ifthatreadusesacquiresemantics, instancesoftakeandsteal)mustbeatleastequaltothenumberofincreases thenthereisahappens-beforerelationbetweenitandthestealbar- tob(thenumberofpushoperations).Itisexactlyequal,asotherwisethere wouldbemoresignificantreadsthansignificantwrites,whichisimpossible rier.Hence,thewritemustcomebeforesaidbarrierinsequential accordingtoTheorem1. consistencytotalorder.Then,eitherthebarrierinstealisordered beforethebarrierintake,ortheotherwayaround: OnemayfinallyproveTheorem2.Ontheonehand,Corollary4 • if the steal barrier is ordered before the barrier in take, then tellsthatthenumberofsignificantreads(fromasuccessfulinstance itfollowsfromseq_cstbarriersemanticsthatthevalueoftop oftakeorsteal)isequaltothenumberofsignificantwrites(from readbytakecannotbeolderthantheonereadinsteal;4 an instance of push). On the other hand, Theorem 1 states that • conversely,ifthetakebarrierisorderedbeforetheoneinsteal, significantreadsuniquelymaptosignificantwrites.Byinjectivity, thenthevalueofbottomreadbystealcannotbeolderthanthe thereexistsauniquesignificantreadforeachsignificantwrite. onewrittenintake.5 5. OntheC11implementation 6. Experimentalresults The sequentially consistent implementation is a direct translation Wepresentexperimentalresultsonthreecurrentandwidelyused of the original algorithm using C11 seq_cst atomic variables for architectures: (1) a Tegra 3 ARMv7 processor rev 9 (v7l) with 4 all shared accesses. It is obtained from the code in Figure 1 by coresat1.3GHzand1GBofRAM;(2)anIntelCorei7-2720QM replacingallmemoryorderconstantswithseq_cst;doingsomakes machine with 4 cores (hyper-threading disabled) at 2.2GHz and fencesunnecessary,hencetheyshouldberemoved. 4GBofRAM;and(3)adual-socketAMDOpteronMagny-Cours TheoptimizedC11implementationimprovesupontheprevious 6164HEmachinewith2×12coresat1.7GHzand16GBofRAM. version by replacing sequential consistency with release–acquire All tests were compiled with GCC 4.7.0, the first release of pairswhereappropriate.Itestablisheshappens-beforerelationsbe- GCCtointroducebuilt-insupportforC11atomics. tween reads and writes, as required by the proof. Unfortunately, without relying on seq_cst, using only release, acquire and con- 3C11 defines a happens-before relation, which does not fully encapsu- sumeoperations,wewereunabletoreproducetherequiredmemory late the notion of cumulativity. The only inter-thread edges in happens- orderingconstraintsneededonthePOWERandARMv7architec- beforecomefromwrite–readpairswithrelease–acquiresemantics(see[6] tureswhileadheringtoC11semantics. 5.1.2.4p11 and p16). In the absence of a write instruction, no fence or AlthoughdesignedforARMv7/POWER,mostofthearguments otheroperationcanpropagateaccumulatedinformationtoanotherthread— developedintheproofinformallytranslatetotherulesofC11ina inotherwords,itisnotpossibletoestablishahappens-beforepathbetween tworeadsindifferentthreadswithoutaninterveningwrite.Hencethere- lianceonseq_cstprimitives,enforcingasequentiallyconsistenttotalorder. 2Henceathreadeventuallyreachesastationarystatewhereb=t;itshould benotedthatthemodeldoesnotguaranteeprogress;itislegalforathread 4See[6]7.17.3p9. toenduploopingonanon-finalstatewhereb=tbutb(cid:54)=βnF. 5See[6]7.17.3p11. 6.1 Syntheticbenchmarks In all diagrams, we have included a set of points labeled nofences,forcomparisonpurposes.Thesecorrespondtotheleast We designed a synthetic benchmark to simulate the depth-first commondenominatoramongallthetestedbarrierplacementstrate- traversalofabalancedtree—withbreadthbanddepthd—ofempty gies:onlyrelaxedCASoperationsareincluded,withotherwiseno tasksbyamainworkerthread,reproducingtheprototypicalexecu- memory barriers. The nofences version violates the semantics of tionofaCilkprogram.Oneormorethievesattempttostealthese thework-stealingdeque.Eachofourproposedimplementationsof tasks.Forrobustnessandpredictability,theworkeralwayscreates the algorithm can be seen as adding a different set of barriers to andpushesthesamenumberoftasksinthedeque,followingthe nofences,makingitcorrect.Hence,resultsobtainedwithnofences depth-firstpattern,regardlessofwhetheraspecificcontinuationhas shouldbetakenasnomorethanageneralbaseline,asthecomplete beenstolenbyanotherthread(itissimplyrecordedasstolen,but lackoffencescanleadtounexpectedbehavior.Forinstance,Fig- subsequenttasksspawnnormallyandlocally).Thethievesperform ures3and4showgreaterthroughputvaluesathighcontentionfor stealactionsataconfigurablerate,anddiscardstolentasks. thecomb-shapedworkloadonARM.Thosearetheresultsofalong Wehaveexperimentedwithtwodifferentmethodsofstealdis- tailoffastemptytakeoperations,anartifactduetothenatureof tribution, the goal being to uniformly spread the contention over thecomb-shapedtestandtheabsenceofsynchronizationbetween theentirelifeoftheworkerthread.Thefirstmethodisbasedonthe emptytakeandsteal(enabledbythelackofbarriersintake).8 CPUclockofthecorededicatedtoeachthief;withthistechnique, theclockisregularlysampledandtheappropriatenumberofsteal b=1;d=107 b=3;d=15 operationsisperformedaccordingly.Thesecondmethodreliesona Corei7(2threads) 4.87862×108 3.60838×108 randomnumbergenerator,calledinabusyloop,whichallowssteal Opteron(2threads) 2.55142×108 2.04978×108 operationswithasetprobability.Whiletheclock-basedapproach Tegra3(2threads) 5.47223×107 4.12112×107 producesmorereliableresults,itcanonlybeusedifalow-overhead Corei7(4threads) 4.88018×108 3.66404×108 CPUclockisavailablefromuserspace,whichisunfortunatelynot Opteron(24threads) 2.56214×108 2.03235×108 thecaseonourARM-basedsystem.6Conversely,thesecondtech- Tegra3(4threads) 5.48473×107 4.11242×107 nique suffers from imprecision when targeting smaller ranges of frequencies, which is necessary on faster processors or when the Table2. Near-idealthroughput(s−1) numberofcoresincreases.7 Hence,theformerisusedonx86and thelatteronARMv7,withappropriateempiricaltuningtogather Alltheresultsbasedonthemixedpushandtake“tree”work- resultsoveracommonrepresentativerangeofstealthroughput. load show a marked improvement of the hand-written native and Weselectedtwoworkloads:areasonablybroadtree(b=3;d= c11 versions over the naive sequentially consistent translation of 15) and, as a special case, a degenerate comb-shaped tree (b = the original Chase–Lev algorithm, seqcst. While the relative gain 1;d = 107).Theformerismeanttoreproducenormalcontention remainsstableatalllevelsofcontentionontheCorei7andTegra3, withstealoperationsalongsidebothpushandtake,whilethelatter itdropssharplyontheOpteron,presumablybecauseofthehigher illustratesacaseofcontentionbetweentakeandstealonly. numberofcores.Nevertheless,forlowvalues,whichmoreclosely We measure the time taken by the worker thread to complete modelrealisticscenarios,theoptimizedimplementationsperform thespecifiednumberoftaskcreationsandconsumptions.Thisin atleast1.5timesbetterthanseqcstonbothx86andARM. turn serves to compute the push/take throughput—the combined Comparing x86 and ARM, we note that a higher relative numberofofpushandtakeoperationscompletedperunitoftime, throughputisachievedonARM(peakatabove85%)thanonx86 as well as the effective steal throughput, defined as follows: the (peak at above 50%), indicating that the first serializing instruc- test protocol strives to perform a number of steals over time, at tionintroducedinthecodeisverycostlyonx86,especiallyifitis a fixed, nominal steal throughput; the effective throughput is the addedtothecriticalpath(asisthecaseinnative,c11andseqcst, real throughput as could be observed after the experiment, i.e., butnotinnofences).Thiscouldsuggesteitherthestrongerguaran- howmanystealswereactuallyperformedduringthelifetimeofthe tees of the x86 memory model—a full memory fence is required workerthread.Thesemetricsprovideameasureoftheefficiencyof tolinearizehistoryinordertomaintaintotalstoreorder[13]—or thealgorithmonitscriticalpathatvariouslevelsofcontention. aggressivelocaloptimizationsforsingle-threadexecutionwithout In order to assess the impact of the added barriers on the dif- communication. ferentarchitectures,rawthroughputvalueshavebeennormalized Fromtheseobservations,wecanpostulatethatadvancedARM bythenear-idealthroughputonthesameworkload(seeTable2), architectures such as the Tegra 3 benefit the most from a well- obtainedonasinglethreadwithnocontentionandnosynchroniza- written concurrent program that takes full advantage of the flexi- tion: memory barriers are replaced with simple compiler fences, bilityallowedbytheirmemorymodel,andconverselystrugglerel- andCASoperationswithasimplebranchandconditionalassign- ativelymorewithliteralinterpretationsofalgorithmsdesignedwith ment.Thesenumbersprovideagoodapproximationoftheupper stricter,simplerhypothesesinmind. boundontheachievablethroughputoneachmachine,thoughother 6.2 Task-parallelbenchmarks minorfactorscancontributetohigherobservablevalues.Inpartic- ular,itshouldbenotedthatcountingthethroughputinnumberof We further experiment on common task-parallel benchmarks, operationspersecondis,bydesign,ageneralization:theexecution mostly extracted from the Cilk benchmark suite,9 to evaluate the timeofeachoperationdependsonitsnatureandtheexactcontrol impactofthememorybarrieroptimizationonrealisticworkloads pathtaken;forexample,aninvocationoftakereturningemptywill andload-balancingscenarios. befasterthanonereturningatask. Fibonacci isthetree-recursivecomputationofthe35th Fibonacci number;itillustratestherawcostoftheschedulingalgorithm aseachtaskonlyperformsasingleaddition. 6TheARMv7C15cyclecounterregistercanonlybequeriediffirstenabled from kernel mode, and is delegated to a monitoring co-processor, with 8Inthecasewherethedequeisempty,neithertakenorstealneedstoexe- unclearconsequencesforthebus,caches,andmemorymodelasawhole. cuteaCASinstruction;furthermore,intheabsenceofbarriers,theARMv7 7On higher end processors with multiple cores acting as thieves, higher memorymodeldoesnotrequiresuccessivedecrementsandincrementsof stealprobabilitiescanyieldmanytimesmorestealattemptsthanthereare bottomintaketopropagatetothethieves. taskscreatedoverasetperiod. 9http://supertech.csail.mit.edu/cilk b=1, d=10000000 on ARM (Tegra 3) 2 threads b=1, d=10000000 on x86 (Core i7) 2 threads b=1, d=10000000 on x86 (Opteron) 2 threads put 1.1 utp 1.1 put 1.1 roughpush/take th 0000.... 67891 nofsneeanqtccic1ves1est push/take through 000000...... 4567891 nofsneeanqtccic1ves1est push/take through 000000...... 4567891 nofsneeanqtccic1ves1est ed 0.5 ed 0.3 ed 0.3 maliz 0.4 maliz 00..12 maliz 00..12 or 0.3 or 0 or 0 n 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06n 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06n 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 effective steal throughput (s⁻¹) effective steal throughput (s⁻¹) effective steal throughput (s⁻¹) b=3, d=15 on ARM (Tegra 3) 2 threads b=3, d=15 on x86 (Core i7) 2 threads b=3, d=15 on x86 (Opteron) 2 threads put 1 utp 1.1 put 1 gh 0.95 seqcst gh 1 seqcst gh 0.9 seqcst roued push/take thmaliz 0000 ....000005678.....555556789 nofneantcci1ve1es malized push/take throu 000000000.........123456789 nofneantcci1ve1es malized push/take throu 00000000........12345678 nofneantcci1ve1es or 0.45 or 0 or 0 n 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06n 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06n 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 effective steal throughput (s⁻¹) effective steal throughput (s⁻¹) effective steal throughput (s⁻¹) Figure3. Syntheticsingle-thiefbenchmarks b=1, d=10000000 on ARM (Tegra 3) 4 threads b=1, d=10000000 on x86 (Core i7) 4 threads b=1, d=10000000 on x86 (Opteron) 24 threads put 1.1 utp 1.1 put 1.1 roughpush/take th 0000.... 67891 nofsneeanqtccic1ves1est push/take through 000000...... 4567891 nofsneeanqtccic1ves1est push/take through 000000...... 4567891 nofsneeanqtccic1ves1est ed 0.5 ed 0.3 ed 0.3 maliz 0.4 maliz 00..12 maliz 00..12 or 0.3 or 0 or 0 n 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06n 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06n 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 effective steal throughput (s⁻¹) effective steal throughput (s⁻¹) effective steal throughput (s⁻¹) b=3, d=15 on ARM (Tegra 3) 4 threads b=3, d=15 on x86 (Core i7) 4 threads b=3, d=15 on x86 (Opteron) 24 threads put 1 utp 1 put 0.8 roughake th 00..89 nofsneeanqtccic1ves1est ake through 0000....6789 nofsneeanqtccic1ves1est ake through 000...567 nofsneeanqtccic1ves1est h/t 0.7 h/t 0.5 h/t 0.4 ed pusmaliz 00..56 malized pus 0000....1234 malized pus 000...123 or 0.4 or 0 or 0 n 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06n 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06n 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 effective steal throughput (s⁻¹) effective steal throughput (s⁻¹) effective steal throughput (s⁻¹) Figure4. Syntheticmulti-thiefbenchmarks ARM (Tegra 3) 4 threads x86 (Core i7) 4 threads x86 (Opteron) 24 threads 1.3 1.6 1.4 1.25 seqcc1s1t 1.5 seqcc1s1t 1.35 seqcc1s1t native native 1.3 native Cst 1.2 nofences Cst 1.4 nofences Cst 1.25 nofences vs. Seq-Speedup 11 1..01 .1155 Speedup vs. Seq- 111...123 Speedup vs. Seq- 11 11..01..1255 1 0.95 1 0.95 0.9 0.9 0.9 Fibonacci FFT-1D Matmul StrassenKnapsack Seidel Fibonacci FFT-1D Matmul StrassenKnapsack Seidel Fibonacci FFT-1D Matmul StrassenKnapsack Seidel Figure5. Task-parallelbenchmarkspeedupsagainsttheC11sequentiallyconsistentbaseline FFT-1D computestheCooley-TukeyfastFouriertransformovera Strassen isanoptimizedmatrixmultiplicationalgorithm,running vectorof220elements. onmatricesofsize512×512ontheTegra3andCorei7plat- Matmul istheblockedmatrixmultiplication,ofsize256×256on forms,andofsize2048×2048onOpteron. the Tegra 3 and Core i7 platforms, and of size 384×384 on Knapsack is the usual resource allocation problem. A set of ob- Opterontoensureasufficientcomputationtime. jects,eachwithagivenweightandvalue,mustbepickedfrom

Description:
Correct and Efficient Work-Stealing for Weak Memory Models. Nhat Minh Lê. Antoniu Pop. Albert Cohen. Francesco Zappa Nardelli. INRIA and ENS
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.