ebook img

LAC 2010 - Paper: Work Stealing Scheduler for Automatic Parallelization in Faust PDF

2010·0.28 MB·English
Save to my drive
Quick download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview LAC 2010 - Paper: Work Stealing Scheduler for Automatic Parallelization in Faust

Work Stealing Scheduler for Automatic Parallelization in Faust Stephane Letz and Yann Orlarey and Dominique Fober GRAME 9 rue du Garet, BP 1185 69202 Lyon Cedex 01, France, {letz, orlarey, fober}@grame.fr Abstract sophisticated methods to reorganize the code to Faust0.9.101introducesanalternativetoOpenMP fit specific architectures can then be tried and based parallel code generation using a Work Steal- analyzed. ing Scheduler and explicit management of worker 1.1 The Faust approach threads. This paper explains the new option and presents some benchmarks. Faust [2] is a programming language for real- time signal processing and synthesis designed Keywords from scratch to be a compiled language. Being FAUST, Work Stealing Scheduler, Parallelism efficiently compiled allows Faust to provide a viable high-level alternative to C/C++ to de- 1 Introduction velop high-performance signal processing appli- Multi/many cores machines are becoming com- cations, libraries or audio plug-ins. mon. There is a challenge for the software com- TheFaustcompilerisbuiltasastackofcode munity to develop adequate tools to take profit generators. The scalar code generator produces of the new hardware platforms [3] [8]. Various a single computation loop. The vector genera- libraries like Intel Thread Building Blocks 2 or tor rearrange the C++ code in a way that fa- ”extended C like” Cilk [5] can possibly help but cilitates the autovectorization job of the C++ still require the programmer to precisely define compiler. Instead of generating a single sam- sequential and parallel sub part of the compu- ple computation loop, it splits the computation tation and use the appropriate library call or intoseveralsimplerloopsthatcommunicatesby building blocks to implement the solution. vectors. The result is a Direct Acyclic Graph In the audio domain, work has been done to (DAG) in which each node is a computation parallelize well known algorithms [4], but very loop. few systems aim to help in automatic paral- Starting from the DAG of computation loops lelization. The problem is usually tricky with (called”tasks”),Faustwasalreadyabletogen- stateful systems and imperative approaches erateparallelcodeusingOpenMPdirectives[1]. where data has to be shared between several This model has been completed with a new threads, and concurrent access have to be accu- dynamic scheduler based on a Work Stealing rately controlled. model. Onthecontrary,thefunctionalapproachused in high level specification languages generally 2 Scheduling strategies for parallel helps in this area. It basically allows the pro- code grammer to define the problem in an abstract The role of scheduling is to decide how to or- way completely unconnected of implementa- ganize the computation, in particular how to tions details, and let the compiler and various assign tasks to processors. backends do the hard job. By exactly controlling when and how state 2.1 Static versus dynamic scheduling is managed during the computation, the com- The scheduling problem exists in two forms: piler can decide what multi-threading or par- static and dynamic. In static scheduling usu- allel generation techniques can be used. More ally done at compile time, the characteristics of a parallel program (such as task processing 1this work was partially supported by the ANR times, communication, data dependencies, and project ASTREE (ANR-08-CORD-003) 2http://www.threadingbuildingblocks.org/ synchronizationrequirements)areknownbefore program execution. Temporal and spatial as- are organized as a sequence of groups of par- signment of tasks to processors are done by the allel tasks. Then appropriate OpenMP direc- scheduler to optimize the execution time. In tives are added to describe a fork/join model. dynamic scheduling on the contrary, only a few Each group of task is executed in parallel and assumptions about the parallel program can be synchronization points are placed between the madebeforeexecution,andthus,schedulingde- groups (Figure 1). This model gives good re- cisions have to be realized on-the-fly, and as- sults with the Intel icc compiler. Unfortunately signment of tasks to processors are made at run untilrecently,OpenMPimplementationing++ time. was quite weak and inefficient, and even unus- able in an real-time context on OSX. 2.2 DSP specific context Moreover the overall approach of organizing Wecanlistseveralspecificelementsoftheprob- tasksas a sequence of groups of parallel tasks is lem: not optimum in the sense that synchronization - the graph describes a DSP processor which points are required for all threads when part of is going to read n input buffers containing p the graph could continue to run. In some sense frames to produce m output buffers containing the synchronization strategy is too strict and a the same p number of frames data-flow model is more appropriate. - the graph is completely known in advance, 2.3.1 Activating Work Stealing butprecisecostofeachtaskandcommunication Scheduler mode times (memory access, cache effects...) is not known The scheduler based parallel code generation is - the graph can be computed in a single step: activated by passing the --scheduler (or --sch) eachtaskisgoingtoprocesspframesinasingle option to the Faust compiler. It implies the step,orbufferscanbecutinslicesandthegraph --vec option as the parallel code generation is can be executed several times on sub slices to built on top of the vector code generation. fill output buffers. (see ”pipelining” section) With the --scheduler option, the Faust com- pilerusesaverydifferentapproach. Adata-flow model for graph execution is used, to be exe- cuted by a dynamic scheduler. Parallel C++ code embedding a Work Stealing Scheduler and using a pool of worker threads is generated. Threads are created when the application starts and all participate in the computation of the graph of tasks. Since the graph topology is known at compilation time, the generated C++ code can precisely describe the execution flow (which tasks have to be executed when a given task is finished...). Ready tasks are activated at the beginning of the compute method and are executed by available threads in the pool. The control flow then circulate in the graph from inputs task to output tasks in the form of acti- Figure 1: Tasks graph with forward activations vations (ready task index called tasknum) until and explicit synchronization points all output tasks have been executed. 2.4 Work Stealing Scheduler 2.3 OpenMP mode In a Work Stealing Scheduler [7], idle threads The OpenMP based parallel code generation is take the initiative: they attempt to steal tasks activated by passing the ---openMP (or --omp) op- from other threads. This is possible by having tion to the Faust compiler. It implies the --vec eachthreadownsaWork Stealing Queue,aspe- option as the parallel code generation is built cial double-ended queue with a Push operation, on top of the vector code generation. a private LIFO Pop operation 3 and a public In OpenMP mode, a topological sort of the graph is done. Starting from the inputs, tasks 3which does not need to be multi-thread aware FIFO Pop operation 4. The basic idea of work 2.5.2 Several outputs stealingisforeachprocessortoplaceworkwhen If the task has several outputs, the code has to: it is discovered in its local WSQ, greedily per- - init tasknum with the WORK_STEALING value form that work from its local WSQ, and steal - if there is a direct link between the given work from the WSQ of other threads when the task and one of the output task, then this out- local WSQ is empty. put task will be the next to execute. All other Starting from a ready task, each thread exe- tasks with a direct link are pushed on current cutes it and follows the data dependencies, pos- thread WSQ 5 sibly pushing ready output tasks into its own - otherwise for output tasks with more than local WSQ. When no more tasks can be exe- one input, the activation counter is atomically cuted on a given computation path, the thread decremented(possiblyreturningthetasknumof pops a task from its local WSQ using its pri- the next task to execute) vateLIFOPopoperation. IftheWSQisempty, - after execution of the activation code, the thread is allowed to steal tasks from other tasknum will either contains the actual value threads WSQ using their public FIFO Pop op- of the next task to run or WORK_STEALING, so that eration. thenextreadytaskiffoundbyrunningthework The local LIFO Pop operation allows better stealing task. cache locality and the FIFO steal Pop larger 2.5.3 Work Stealing task chuck of work to be done. The reason for this is The special work stealing task is executed when that many work stealing workloads are divide- the current thread has no more next task to and-conquerinnature,stealingoneoftheoldest run in its computation path and its WSQ is task implicitly also steals a (potentially) large empty. The GetNextTask function aims to find subtree of computations that will unfold once out a ready task by possibly stealing a task to that piece of work is stolen and run. run from any of the other threads except the Foragivencycle, thewholenumberofframes current one. If no task is ready then GetNextTask is used in one graph execution cycle. So when returnsWORK_STEALINGvalueandthethreadloops finished, a given task will (possibly) activate its until it finally finds a task or the whole compu- output task only, and activation goes forward. tation ends. 2.4.1 Code generation 2.5.4 Last task The compiler produces a computeThread method Output tasks of the DAG are connected and called by all threads: activatethespeciallast task whichinturnquits - tasks are numbered and compiled as a big the thread. switch/case block - a work stealing task which aim to find out void computeThread(int thread) the next ready task is created { TaskQueue taskqueue; - an additional last task is created int tasknum = -1; int count = fFullCount; 2.5 Compilation of different type of nodes // Init input and output FAUSTFLOAT* input0 Foragiventaskinthegraph,thecompiledcode = &input[0][fIndex]; will depend of the topology of the graph at this FAUSTFLOAT* input1 stage. = &input[1][fIndex]; FAUSTFLOAT* output0 2.5.1 One output and direct link = &output[0][fIndex]; If the task has one output only and this output // Init graph has one input only (so basically there is a single int task_list_size = 2; link between the two tasks), then a direct acti- int task_list[2] = {2,3}; vation is compiled, that is the tasknum of the taskqueue.InitTaskList( nexttaskisthetasknumoftheoutputtask,and task_list_size, task_list, there its no additional step required to find out fDynamicNumThreads, the next task to run. 5The chosen task here is the first in the task output 4which has to be multi-thread aware using lock-free list,moresophisticatedchoiceheuristicscouldbetested techniques and is thus more costly at this stage. cur_thread, FAUSTFLOAT** output) { tasknum); this->input = input; this->output = output; // Graph execution code for (fIndex = 0; while (!fIsFinished) { fIndex < fullcount; switch (tasknum) { fIndex += 1024) { case WORK_STEALING: { fFullCount tasknum = min(1024, fullcount-fIndex); = GetNextTask(thread); TaskQueue::Init(); break; // Initialize end task } fGraph.InitTask(1,1); case LAST_TASK: { // Only initialize tasks with inputs fIsFinished = true; fGraph.InitTask(3,1); break; // Activate worker threads } fIsFinished = false; case 2: { fThreadPool. // DSP code SignalAll(fDynamicNumThreads-1); PushHead(4); // Master thread participates tasknum = 3; computeThread(0); break; // Wait for end } while (!fThreadPool.IsFinished()) {} case 3: { } // DSP code } PushHead(5); Listing 2: master thread compute method tasknum = 4; break; } 2.7 Start time case 4: { At start time, n worker threads are created and // DSP code put in sleep state, thread scheduling properties tasknum = ActivateOutput(LAST_TASK) andprioritiesaresetaccordingtocallingthread break; parameters. } case 5: { 3 Pipelining // DSP code tasknum Some graphs are sequential by nature. Pipelin- = ActivateOutput(LAST_TASK) ing techniques aim to create and exploit paral- break; lelisminthosekindofgraphtobetterusemulti- } cores machines. The general idea is that for a } sequence of two tasks A and B, instead of run- } } ning A on the entire buffer size then run B, we Listing 1: example of computeThread method want to split the computation in n sub-buffers and runA onthe first sub-buffer, then Bcan be runonAfirstsub-bufferoutputwhileArunson 2.6 Activation at each cycle second sub-buffer and so on. The following steps are done at each cycle: - n-1 threads are resumed and start the work - the calling thread also participates -readytasks(tasksthatdependsoftheinput audio buffers or tasks with no inputs) are dis- patchedbetweenallthreads, thatiseachthread pushes sub-tasks in its private WSQ and takes Figure 2: Very simple graph rewritten to ex- one of them to be directly executed press pipelining, A is a recursive task, B is a - after having done its part of the work, the non recursive one main thread waits for all worker threads to fin- ish This is done in the following compute method Different code generation techniques can be called by the master thread: used to obtain the desired result. We choose to rewrite the initial graph of tasks, by duplicating n times each task to be run on a sub part of the void compute(int fullcount, FAUSTFLOAT** input, buffer. This way the previously described code generator can be directly used without major Karplus32 changes. 120 3.1 Code generation 100 The pipelining parallel code generation is ac- tivated by passing -sch option as well as the /s) 80 B --pipelining option (or -pipe) with a value, the M Scalar factor the initial buffer will be divided in. ut ( 60 Vector Previously described code generation strat- p OpenMP h g 40 Scheduler egyhastobecompleted. Theinitialtasksgraph u o isrewrittenandeachtaskissplitinseveralsub- Tr 20 tasks: - recursive tasks 6 are rewritten as a list of 0 n connected sub-tasks (that is activation has to 1 2 3 4 go from the first sub-task to the second one and Run so on). Each sub-task is then going to run on buffer-size/nnumberofframes. Thereisnoreal Figure 3: Compared performances of the 4 gain for the recursive task itself since it will still compilation schemes on karplus32.dsp be computed in a sequential manner. But out- putsub-taskscanbeactivatedmorerapidlyand possibly executed immediately is they are part As we can see, in both cases the paralleliza- of a non recursive task (Figure 2). tionintroducesarealgainofperformances. The - non recursive tasks are rewritten as a list of speedup for Karplus32 was x2.1 for OpenMP n non connected sub-tasks, so that all sub-tasks andx2.08fortheWSscheduler. ForSonik Cube can possibly be activated in parallel and thus the speedup with 8 threads was of x4.17 for run on several cores at the same time. Each OpenMP and x5.29 for the WS scheduler. It sub-task is then going to run on buffer-size/n is obviously not always the case. Simple appli- number of frames. cations, with limited demands in terms of com- ThisstrategyisusedforalltasksintheDAG, puting power, tend to perform usually better the rewritten graph enters the previously de- in scalar mode. More demanding applications scribed code generator and the complete code usually benefit from parallelization. is generated. 4 Benchmarks To compare the performances of these various compilation schemes in a realistic situation, we havemodifiedanexistingarchitecturefile(Alsa- GTK on Linux and CoreAudio-GTK on OSX) to continuously measure the duration of the computemethod(600measuresperrun). Wegive here the results for three real-life applications: Karplus32, a 32 strings simulator based on the Karplus-Strongalgorithm(figure3),SonikCube (figure 4), the audio part software of an audio- visualinstallationandMixer (figure5),amulti- voices mixer with pan and gain. Instead of time, the results of the tests are expressedinMB/sofprocessedsamplesbecause memory bandwidth is a strong limiting factor for today’s processors (an audio application can never go faster than the memory bandwidth). Figure 4: Compared performances of different generation mode from 1 to 8 threads 6thosewherethecomputationoftheframendepends of the computation of previous frames The efficiency of OpenMP and WS scheduler tasks. are quite comparable, with an advantage to WS - there is no special strategy to deal with scheduler with complex applications and more thread affinity, thus the performances can de- CPUs. Pleasenotethatnotallimplementations grade if tasks are switched between cores. of OpenMP are equivalent. Unfortunately the - the effect of memory bandwidth limits and GCC 4.4.1 version is still unusable for real time cache effects have to be better understood audioapplication. InthiscasetheWSscheduler - composing several pieces of parallel code is the only choice. The efficiency is also depen- without sacrificing performance can be quite dent of the vector size used. Vector sizes of 512 tricky. Performance can severally degrade as or 1024 samples usually give the best results. soon as too much threads are competing for The pipelining mode does not show a clear the limited number of physical cores. Using ab- speedup in most cases, but Mixer is an example straction layers like Lithe [6]or OSX libdispatch when this new mode helps. 7 could be of interest. References [1] Y. Orlarey, D. Fober, and S. Letz. Adding Automatic Parallelization to Faust. Linux Audio Conference 2009. [2] Yann Orlarey, Dominique Fober, and Stephane Letz. Syntactical and seman- tical aspects of faust. Soft Computing, 8(9):623632, 2004. [3] J.Ffitch, R.Dobson, and R.Bradford. The Imperative for High-Performance Audio Computing. Linux Audio Conference 2009. [4] Fons Adriaensen. Design of a Convolution Engine optimised for Reverb. Linux Audio Figure 5: Compared performances of different Conference 2006. generation mode from 1 to 4 threads on OSX [5] Robert D.Blumofe, Christopher F. Joerg, with a buffer of 4096 frames Bradley C. Kuszmaul, Charles E. Leiser- son, Keith H. Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. 5 Open questions and conclusion In PPOPP 95: Proceedings of the fifth ACM SIGPLAN symposium on Principles In general, the result can greatly depends on and practice of parallel programming, pages the DSP code going to be parallelized, the cho- 207216, New York, NY, USA, 1995. ACM. sen compiler (icc or g++) and the number of threads to be activated when running. Simple [6] Heidi Pan, Benjamin Hindman, Krste DSP effects are still faster in scalar of vector- Asanovic. Lithe: Enabling Efficient Compo- ized mode and parallelization is of no use. But sition of Parallel Libraries. USENIX Work- with some more complex code like Sonik Cube shop on Hot Topics in Parallelism (Hot- the speedup is quite impressive as more threads Par’09), Berkeley, CA. March 2009. are added. But they are still a lot of opened [7] Blumofe, Robert D. and Leiserson, Charles questions that will need more investigation: E. Schedulingmultithreadedcomputationsby - improving dynamic scheduling, since right work stealing. Journal of ACM, volume 46, now there is no special strategy to choose the number 5, 1999, New York NY USA. task to wake up in case a given task has several outputtoactivate. Someideasofstaticschedul- [8] David Wessel et al. Reinventing Audio and ing algorithms could be used in this context. Music Computation for Many-Core Proces- sors. International Computer Music Con- - the current compilation technique for ference 2008. pipelining actually duplicates each task code n times. This simple strategy will probably show its limit for more complex effects with a lot of 7http://developer.apple.com/

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.