ebook img

LAC 2009 - Paper: Automatic Parallelization of Faust Code PDF

2009·0.36 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview LAC 2009 - Paper: Automatic Parallelization of Faust Code

Adding Automatic Parallelization to Faust YannOrlarey and StephaneLetz and DominiqueFober Grame,CentreNationaldeCreationMusicale 9rueduGaret,BP1185 69202LyonCedex01 France, {orlarey,letz,fober}@grame.fr Abstract 2 Faustoverview In this section we give a brief overview of Faust with Faust 0.9.9.5 introduces new compilation options to do auto- someexamplesofcode. matic parallelization of code using OpenMP. This paper ex- plains how the automatic parallelization is done and presents A Faust program describes a signal processor, some- somebenchmarks. thing that transforms some input signals and produces someoutputsignals.Theprogrammingmodelusedcom- binesafunctionalprogrammingapproachwithablock- Keywords diagram syntax. The functional programming approach providesanaturalframeworkforsignalprocessing. Dig- italsignalsaremodeledasdiscretefunctionsoftime,and Faust,OpenMP,Parallelism signalprocessorsassecondorderfunctionsthatoperate on them. Moreover Faust’s block-diagram composition operators, used to combine signal processors together, 1 Introduction fitinthesamepictureasthirdorderfunctions. TheFaustcompilertranslatesFaustprogramsintoequiv- Faust is a programming language for real-time signal alent C++ programs. It uses several optimization tech- processing and synthesis designed from scratch to be a niquesinordertogeneratethemostefficientcode. The compiled language. Being efficiently compiled allows resultingcodecanusuallycompetewith,andsometimes Fausttoprovideaviablehigh-levelalternativetoC/C++ outperform, DSP code directly written in C/C++. It is to develop high-performance signal processing applica- alsoself-containedanddoesn’tdependonanyDSPrun- tions,librariesoraudioplug-ins. timelibrary. Until recently the computation code generated by the Thanks to specific architecture files, a single Faust pro- compiler was organized quite traditionally as a single gram can be used to produce code for a variety of plat- sample processing loop. This scheme works very well forms and plug-in formats. These architecture files act butitdoesn’ttakeadvantagesofmulticorearchitectures. as wrappers and describe the interactions with the host Moreoveritcangeneratecodethatexceedstheautovec- audioandGUIsystem. Currentlymorethan10architec- torizationcapabilitiesofcurrentC++compilers. tures are supported (see Table 1) and new ones can be We have recently extended the compiler with two new easilyadded. schemes: thevectorandtheparallelschemes. Thevec- tor scheme simplifies the autovectorization work of the alsa-gtk.cpp ALSAapplication+GTK alsa-qt.cpp ALSAapplication+QT4 C++ compiler by splitting the sample processing loop jack-gtk.cpp JACKapplication+GTK intoseveralsimplerloops. Theparallelschemeanalyzes jack-qt.cpp JACKapplication+QT4 thedependenciesbetweentheseloopsandaddOpenMP ca-qt.cpp CoreAudioapplication+QT4 pragmastoindicatethosethatcanbecomputedinparal- ladspa.cpp LADSPAplug-in lel. max-msp.cpp MaxMSPplug-in supercollider.cpp Supercolliderplug-in Thesenewschemescanproduceinterestingperformance vst.cpp VSTplug-in improvements. The goal of the paper is to present q.cpp Qlanguageplug-in these new compilation schemes and to provide some benchmarks comparing their performances. The paper Table1: SomearchitecturefilesavailableforFaust is organized as follow : the next section will give a briefoverviewofFaustlanguage, Thethirdsectionwill Inthefollowingsubsectionswegiveashortandinformal present the three code generation schemes and the last introduction to the language through the example of a section will introduce the benchmarks used and the re- simple noise generator. Interested readers can refer to sultsobtained. [1]foramorecompletedescription. 2.1 Asimplenoisegenerator efficientcodeisnottocompiletheblockdiagramitself, butwhatitcomputes. A Faust program describes a signal processor by combining primitive operations on signals (like Driven by the semantic rules of the language the com- +,−,∗,/,√,sin,cos,...)usinganalgebraofhighlevel piler starts by propagating symbolic signals into the blockdiagram,inordertodiscoverhoweachoutputsig- composition operators [2] (see Table 2). You can think nalcanbeexpressedasafunctionoftheinputsignals. of these composition operators as a generalization of mathematicalfunctioncompositionf ◦g. These resulting signal expressions are then simplified and normalized, and common subexpressions are fac- f ∼ g recursivecomposition torized. Finally these expressions are translated into a f , g parallelcomposition f : g sequentialcomposition selfcontainedC++classthatimplementsalltherequired f <: g splitcomposition computation. f :> g mergecomposition Tocompileournoisegeneratorexampleweusethefol- Table2: Thefivehighlevelblock-diagramcomposition lowingcommand: operatorsusedinFaust $ faust noise.dsp AFaustprogramisorganizedasasetofdefinitionswith ThiscommandgeneratesthefollowingC++codeonthe at least one for the keyword process (the equivalent of standardoutput: maininC). class mydsp : public dsp { Ournoisegeneratorexamplenoise.dsponlyinvolves private: int iRec0[2]; three very simple definitions. But it also shows some float fslider0; specificaspectsofthelanguage: public: static void metadata(Meta* m) { random = +(12345) ~ *(1103515245); } noise = random/2147483647.0; process = noise * vslider("noise", 0, 0, virtual int getNumInputs() { return 0; } 100, 0.1)/100; virtual int getNumOutputs() { return 1; } static void classInit(int samplingFreq) { } The first definition describes a (pseudo) random num- virtual void instanceInit(int samplingFreq) bergenerator. Eachnewrandomnumberiscomputedby { multiplyingthepreviousoneby1103515245andadding fSamplingFreq = samplingFreq; totheresult12345. for (int i=0; i<2; i++) iRec0[i] = 0; fslider0 = 0.0f; The expression +(12345) denotes the operation of } virtual void init(int samplingFreq) adding 12345 to a signal. It is an example of a com- { mon technique in functional programming called par- classInit(samplingFreq); tial application: the binary operation + is here pro- instanceInit(samplingFreq); vided with only one of its arguments. In the same way } *(1103515245)denotesthemultiplicationofasignalby virtual void buildUserInterface(UI* interface) { 1103515245. interface->openVerticalBox("noise"); interface->declare(&fslider0, "style" The two resulting operations are recursively composed , "knob"); using the ∼ operator. This operator connects in a feed- interface->addVerticalSlider("noise", back loop the output of +(12345) to the input of &fslider0, 0.0f, 0.0f, 100.0f, 0.1f); *(1103515245) (with an implicit 1-sample delay) interface->closeBox(); and the output of *(1103515245) to the input of } virtual void compute (int count, +(12345). float** input, float** output) Theseconddefinitiontransformstherandomsignalinto { anoisesignalbyscalingitbetween-1.0and+1.0. float fSlow0 = (4.656613e-12f * fslider0); float* output0 = output[0]; Finally, the definition of process adds a simple user in- for (int i=0; i<count; i++) { terfacetocontroltheproductionofthesound. Thenoise iRec0[0] = 12345+1103515245*iRec0[1]; signalismultipliedbythevaluedeliveredbyasliderto output0[i] = fSlow0*iRec0[0]; // post processing controlitsvolume. iRec0[1] = iRec0[0]; } 2.2 Invokingthecompiler } }; The role of the compiler is to translate Faust programs intoequivalentC++programs. Thekeyideatogenerate The generated class contains seven methods. Among these methods getNumInputs() and getNumOutputs() return the number of input and output signals required by our signal processor. init() initializes the internal state of the signal pro- cessor. buildUserInterface() can be seen as a listofhighlevelcommands,independentofanytoolkit, to build the user interface. The method compute() does the actual signal processing. It takes 3 arguments: the number of frames to compute, the addresses of the input buffers and the addresses of the output buffers, andcomputestheoutputsamplesaccordingtotheinput samples. 2.3 Generatingafullapplication Thefaustcommandacceptsseveraloptionstocontrol the generated code. Two of them are widely used. The Figure2: Graphicblock-diagramofthenoisegenerator option -o outputfile specifies the output file to be used producedwiththe-svgoption instead of the standard output. The option -a architec- turefiledefinesthearchitecturefileusedtowrapthegen- erateC++class. 0.9.9.5. Then we will present the vector generation of codewherethecodeisorganizedintoseveralloopsthat For example the command faust -a operates on vectors, and finally the parallel generation jack-qt.cpp -o noise.cpp noise.dsp of code where these vector loops are parallelized using generates a full jack application using QT4.4 as a OpenMPdirectives. graphictoolkit. Thefigure1isascreenshotofournoise applicationrunning. 3.1 Preliminarysteps Before reaching the stage of the C++ code generation, theFaustcompilerhavetocarryonseveralstepsthatwe describebrieflyhere. 3.1.1 Parsingsourcefiles The first one is to recursively parse all the source Figure 1: Screenshot of the noise example generated files involved. Each source file contains a set of def- withthejack-qt.cpparchitecture initions and possibly some import directives for other source files. The result of this phase is a list of 2.4 Generatingablock-diagram definitions : [(name = definition ),(name = 1 1 2 definition ),...]. This list is actually a set, as redefi- 2 Anotherinterestingoptionis-svgthatgeneratesoneor nitionsofsymbolsarenotallowed. moreSVGgraphicfilesthatrepresenttheblock-diagram oftheprogramasinFigure2. 3.1.2 Evaluatingblock-diagrams Itisinterestingtonotethedifferencebetweentheblock Amongthenamesdefinedtheremustbeprocess,theana- diagram and the generated C++ code. The block dia- logofmaininC/C++.Thisdefinitionhastobeevaluated graminvolvesoneaddition,twomultiplicationsandtwo asFaustallowsalgorithmicblock-diagramdefinitions. divisions.ThegeneratedC++programonlyinvolvesone additionandtwomultiplicationspersamples. Thecom- Forexamplethealgorithmicdefinition: piler was able to optimize the code by factorizing and reorganizingtheoperations. Listing1: exampleofalgorithmicdefinition foo(n) = *(10+n); As already said, the key idea here is not to compile the process = par(i,3, foo(i)); blockdiagramitself,butwhatitcomputes. willbetranslatedinaflatblock-diagramdescriptionthat 3 Codegeneration containsonlyprimitiveblocks: process = (_,10:*),(_,11:*),(_,12:*); InthissectionwedescribehowtheFaustcompilergener- atesitscode. Wewillfirstintroducethesocalledscalar generationofcodewhichwastheonlyoneuntilversion Thisdescriptionissaidtobeinnormalform. 3.1.3 Discoveringthemathematicalequations 3.1.5 Occurrenceanalysis Faustdoesn’tcompileablock-diagramdirectly. Itusesa The role of this last preparation phase is to an- phaseofsymbolicpropagationtofirstdiscoveritsmath- alyze in which context each subexpression is ematical semantic (what it computes). The principle is used and to discover common subexpressions. to propagate symbolic signals through the inputs of the If an expensive common subexpression is dis- block-diagraminordertoget,attheotherend,themath- covered, an assignment to a cache variable ematicalequationofeachoutputsignal. float fTemp = <common subexpression code>; is generated, and the cache variable fTemp is used in These equations are then normalized so that different its enclosing expressions. Otherwise the subexpression block-diagrams, but computing mathematically equiva- codeisusedin-lined. lentsignals,resultinthesameoutputequations. Theoccurrenceanalysisproceedsbyatop-downvisitof Hereisaverysimpleexamplewheretheinputsignalis the signal expression. The first time a subexpression is dividedby2andthendelayedby10samples: visited,itisannotatedwithacounter. Thenexttime,the counterwillbeincreasedanditsvisitskipped. process = /(2) : @(10); Subexpressions with several occurrences are candidates to be cached in variables. However in some circum- This is equivalent to having the input signal first multi- stancesexpressionswithasingleoccurrenceneedalsoto pliedby2,thendelayedby7samples,thendividedby4 becachediftheyoccurinafastercontext.Forexamplea andthendelayedby3samples. constantexpressionoccurringinalowspeeduserinter- process = *(2) : @(7) : /(4): @(3); face expression or a user interface expression occurring inahighspeedaudioexpressionwillgenerallyrequired tobecached. Bothleadtothefollowingsignalequation: ItisonlyafterthisphasethatthegenerationoftheC++ Y(t)=0.5∗X(t−10) codecanstart. 3.2 ScalarCodegeneration Faustappliesseveralrulestosimplifyandnormalizeout- put signal equations. For example one of theses rules ThegenerationoftheC++codeismadebypopulatinga saysthatitisbettertomultiplyasignalbyaconstantaf- klassobject(representingaC++class),withstringsrep- teradelaythanbefore.Itgivesthecompilermoreoppor- resenting C++ declarations and lines of code. In scalar tunitiestoshareandreusethesamedelayline. Another modetheselinesofcodeareorganizedinasinglesam- rule says that two consecutive delays can be combined plecomputationloop, whiletheycanbesplittedinsev- intoasingleone. eralloopswiththenewvectorandparallelschemes. Thecodegenerationreliesbasicallyontwofunctions: a 3.1.4 Typingthemathematicalequations translationfunction[[]]thattranslateasignalexpression The next phase is to assign types to the resulting signal intoastringofC++code,andacachefunctionC()that equations. This will not only help the compiler to de- checksifavariableisneeded. tect errors but also to generate the most efficient code. We don’t have the space to go in too much details but Severalaspectsareconsidered: hereisthetranslationrulefortheadditionoftwosignal expressions: 1. thenatureofthesignal: integeroffloat. [[E ]]→S 1 1 2. interval of values of the signal : the minimum and [[E ]]→S 2 2 maximumvaluesthatasignalcantake [[E +E ]]→C((cid:48)(cid:48)(S +S )(cid:48)(cid:48)) 1 2 1 2 3. thecomputationtimeofthesignal:thesignalcanbe It says that to compile the addition of two signals we computedatcompilationtime,atinitializationtime compile each of these signals and concat the resulting oratexecutiontime. stringswitha+signinbetween. Thestringobtainedis 4. the speed of the signal : constant signals are com- passedtothecachefunctionthatwillcheckiftheexpres- puted only once, low speed user interface signals sionissharedornot. arecomputedonceforeveryblockofsamples,high speedaudiosignalsarecomputedeverysamples. Let say that the string passed to the cache func- tion C() is (input0[i] + input1[i]). If the ex- 5. parallelism of the signal : true if the samples of pression is shared, the cache function will allo- the signal can be computed in parallel, false when cate a fresh variable name fTemp0, add the line of thesignalhaverecursivedependenciesrequiringits code float fTemp0 = (input0[i] + input1[i]); to samplestobecomputedsequentially. the klass object and return fTemp0 as a string to be used when compiling enclosing expressions. If the ex- The vector code generation is activated by passing the pression is not shared it will simply return the string --vectorize(or-vec)optiontotheFaustcompiler.Two (input0[i] + input1[i])unmodified. additional options are available : --vec-size <n> con- trols the size of the vector (by default 32 samples) and Toillustratethis,letstaketwosimpleexamples.Thefirst --loop-variant 0/1 gives some additional control on oneconvertastereosignalintoamonosignalbyadding theloops. thetwoinputsignals: Toillustratethedifferencebetweenscalarcodeandvec- process = +; tor code, let’s take the computation of the RMS (Root Mean Square) value of a signal. Here is the Faust code Inthiscase(input0[i] + input1[i])isnotsharedand thatcomputestheRootMeanSquareofaslidingwindow thegeneratedC++codeisthefollowing: of1000samples: virtual void compute (int count, // Root Mean Square of n consecutive samples float** input, RMS(n) = square : mean(n) : sqrt ; float** output) // Square of a signal { float* input0 = input[0]; square(x) = x * x ; float* input1 = input[1]; float* output0 = output[0]; // Mean of n consecutive samples of a signal for (int i=0; i<count; i++) { // (uses fixpoint to avoid the accumulation of // rounding errors) output0[i] = (input0[i] + input1[i]); mean(n) = float2fix : integrate(n) : } fix2float : /(n); } // Sliding sum of n consecutive samples integrate(n,x) = x - x@n : +~_ ; Butwhenthesumofthetwoinputsignalsisduplicated ontwooutputsignalsasin: // Convertion between float and fix point float2fix(x) = int(x*(1<<20)); process = + <: _,_; fix2float(x) = float(x)/(1<<20); // Root Mean Square of 1000 consecutive samples then(input0[i] + input1[i])willbecachedinatem- process = RMS(1000) ; poraryvariable: virtual void compute (int count, The compute() method generated in scalar mode is the float** input, following: float** output) { virtual void compute (int count, float* input0 = input[0]; float** input, float* input1 = input[1]; float** output) float* output0 = output[0]; { float* output1 = output[1]; float* input0 = input[0]; for (int i=0; i<count; i++) { float* output0 = output[0]; float fTemp0 = (input0[i] + input1[i]); for (int i=0; i<count; i++) { output0[i] = fTemp0; float fTemp0 = input0[i]; output1[i] = fTemp0; int iTemp1 = int(1048576*fTemp0*fTemp0); } iVec0[IOTA&1023] = iTemp1; } iRec0[0] = ((iVec0[IOTA&1023] + iRec0[1]) - iVec0[(IOTA-1000)&1023]); output0[i] = sqrtf(9.536744e-10f * float(iRec0[0])); 3.3 VectorCodegeneration // post processing iRec0[1] = iRec0[0]; Modern C++ compiler are able to do autovectorization, IOTA = IOTA+1; that is to use SIMD instructions to speedup the code. } } These instructions can typically operate in parallel on short vectors of 4 simple precision floating point num- bers thus leading to a theoretical speedup of x4. Au- The-vecoptionleadstothefollowingreorganizationof tovectorizationofC/C+programsisadifficulttask. Cur- thecode: rent compilers are very sensitive to the way the code is virtual void compute (int fullcount, arranged. In particular too complex loops can prevent float** input, autovectorization. Thegoalofthenewvectorcodegen- float** output) eration is to rearrange the C++ code in a way that fa- { cilitates the autovectorization job of the C++ compiler. int iRec0_tmp[32+4]; Insteadofgeneratingasinglesamplecomputationloop, int* iRec0 = &iRec0_tmp[4]; for (int index=0; index<fullcount; index+=32) it splits the computation into several simpler loops that { communicatesbyvectors. int count = min (32, fullcount-index); float* input0 = &input[0][index]; The result is a directed graph in which each node is a float* output0 = &output[0][index]; computationloop.Thisgraphisstoredintheklassobject for (int i=0; i<4; i++) andatopologicalsortisappliedtoitbeforeprintingthe iRec0_tmp[i]=iRec0_perm[i]; code. // SECTION : 1 for (int i=0; i<count; i++) { iYec0[(iYec0_idx+i)&2047] = 3.4 ParallelCodegeneration int(1048576*input0[i]*input0[i]); } // SECTION : 2 The parallel code generation is activated by passing the for (int i=0; i<count; i++) { --openMP(or-omp)optiontotheFaustcompiler. Itim- iRec0[i] = ((iYec0[i] + iRec0[i-1]) - plies the -vec options as the parallel code generation is iYec0[(iYec0_idx+i-1000)&2047]); built on top of the vector code generation by inserting } appropriateOpenMPdirectivesintheC++code. // SECTION : 3 for (int i=0; i<count; i++) { output0[i] = sqrtf((9.536744e-10f * 3.4.1 TheOpenMPAPI float(iRec0[i]))); } // SECTION : 4 m iYec0_idx = (iYec0_idx+count)&2047; as foriR(eicn0t_peir=m0[;i]i=<i4R;eci0+_+t)mp[count+i]; ter thre a } d } # fork p ra g m While the second version of the code is more complex a o it turn out to be much easier to vectorize efficiently by m p the C++ compiler. Using Intel icc 11.0, with the ex- pa ra act same compilation options : -O3 -xHost -ftz llel join -fno-alias -fp-model fast=2, the scalar ver- sionleadstoathroughputperformanceof129.144MB/s, while the vector version achieves 359.548 MB/s, a speedupofx2.8! fork # p ra g m a o m p parallel code generator pa ra (OpenMP directives) lle join l vector code generator (loop separation) Figure4: OpenMPisbasedonafork-joinmodel OpenMP (http://wwww.openmp.org) is a well estab- scalar code generator lished API that is used to explicitly define direct multi- threaded, shared memory parallelism. It is based on a fork-joinmodelofparallelism(seefigure4). Parallelre- gionsaredelimitedbyusingthe#pragma omp parallel construct. Attheentranceofaparallelregionateamof parallel threads is activated. The code within a parallel regionisexecutedbyeachthreadoftheparallelteamun- Figure3: Faust’sstackofcodegenerators tiltheendoftheregion. The vector code generation is built on top of the scalar #pragma omp parallel codegeneration(seefigure3). Everytimeanexpression { needs to be compiled, the compiler checks to see if it // the code here is executed simultaneously by needs to be in a separate loop or not. It applies some // every thread of the parallel team simple rules for that. Expressions that are shared (and ... } arecomplexenough)aregoodcandidatestobecompiled in a separate loop, as well as recursive expressions and expressionsusedindelaylines. In order not to have every thread doing redundantly the exact same work, OpemMP provides specific work- 3.4.3 Exampleofparallelcode sharingdirectives. Forexample#pragma omp sections allowstobreaktheworkintoseparate,discretesections. To illustrate how Faust utilises the OpenMP directives, Eachsectionbeingexecutedbyonethread: hereisaverysimpleexample,two1-polefiltersinparal- lelconnectedtoanadder(seefigure5thecorresponding #pragma omp parallel block-diagram): { #pragma omp sections filter(c) = *(1-c) : + ~ *(c); { process = filter(0.9), filter(0.9) : +; #pragma omp section { // job 1 } #pragma omp section { // job 2 } ... } ... } 3.4.2 AddingOpenMPdirectives As already said the parallel code generation is built on top of the vector code generation. The graph of loops produced by the vector code generator is topologically sortedinordertodetecttheloopsthatcanbecomputed inparallel. ThefirstsetL containstheloopsthatdon’t 0 dependofanyotherloops,thesetL containstheloops 1 Figure5: twofiltersinparallelconnectedtoanadder thatonlydependofloopsofL ,etc.. 0 Thecorrespondingcompute()methodobtainedusingthe As all the loops of a given set L can be computed in n -ompoptionisthefollowing: parallel,thecompilerwillgenerateasectionsconstruct withasectionforeachloop. virtual void compute (int fullcount, #pragma omp sections float** input, { float** output) #pragma omp section { for (...) { float fRec0_tmp[32+4]; // Loop 1 float fRec1_tmp[32+4]; } float* fRec0 = &fRec0_tmp[4]; #pragma omp section float* fRec1 = &fRec1_tmp[4]; for (...) { #pragma omp parallel firstprivate(fRec0,fRec1) // Loop 2 { } for (int index = 0; index < fullcount; ... index += 32) } { int count = min (32, fullcount-index); Ifagivensetconstainsonlyoneloop,thenthecompiler float* input0 = &input[0][index]; float* input1 = &input[1][index]; checkstoseeiftheloopcanbeparallelized(norecursive float* output0 = &output[0][index]; dependencies) or not. If it can be parallelized, it gener- #pragma omp single ates: { for (int i=0; i<4; i++) #pragma omp for fRec0_tmp[i]=fRec0_perm[i]; for (...) { for (int i=0; i<4; i++) // Loop code fRec1_tmp[i]=fRec1_perm[i]; } } // SECTION : 1 #pragma omp sections otherwiseitgeneratesasingleconstructsothatonlyone { threadwillexecutetheloop: #pragma omp section for (int i=0; i<count; i++) { #pragma omp single fRec0[i] = ((0.1f * input1[i]) for (...) { + (0.9f * fRec0[i-1])); // Loop code } } #pragma omp section for (int i=0; i<count; i++) { fRec1[i] = ((0.1f * input0[i]) 4 Benchmarks + (0.9f * fRec1[i-1])); } Tocomparetheperformancesofthesethreetypesofcode } generationinarealisticsituationwehaveimplementeda // SECTION : 2 #pragma omp for specialalsa-gtk-bench.cpparchitecturefilethatmeasures for (int i=0; i<count; i++) { thedurationofthecompute()method.Hereisafragment output0[i] = (fRec1[i] + fRec0[i]); ofthisarchitecturefile: } // SECTION : 3 while(running) { #pragma omp single audio.read(); { STARTMESURE for (int i=0; i<4; i++) DSP.compute(audio.buffering(), fRec0_perm[i]=fRec0_tmp[count+i]; audio.inputSoftChannels(), for (int i=0; i<4; i++) audio.outputSoftChannels() fRec1_perm[i]=fRec1_tmp[count+i]; ); } STOPMESURE } audio.write(); } running = mesure <= (KMESURE + KSKIP); } } Thiscodeappealsforsomecomments: The methodology is the following. The duration of the computemethodismeasuredbyreadingtheTSC(Time StampCounter)register. Atotalof128+2048measures 1. The parallel construct #pragma omp parallel is aremadebyrun. Thefirst128measuresareconsidered the fundamental construct that starts parallel exe- awarm-upperiodandareskipped. Themedianvalueof cution. Thenumberofparallelthreadsisgenerally thefollowing2048measuresiscomputed. Thismedian thenumberofCPUcoresbutitcanbecontrolledin value,expressedinprocessorscycles,isfirstconvertedin severalways. aduration,andtheninnumberofbytesproducedpersec- 2. variables external to the parallel region are shared ond considering the audio buffer size (in our test 2048) by default. The firstprivate(fRec0,fRec1) andthenumberofoutputchannels. clauseindicatesthateachthreadshouldhaveitspri- vatecopyoffRec0andfRec1.Thereasonisthatac- This throughput performance is a good indicator. The cessingsharedvariablesrequiresanindirectionand memorybandwidthisastronglimitingfactorfortoday’s isveryinefficientcomparedtoprivatecopies. processors,andithastobesharedamongtheprocessors. Inotherwords,onaSMPmachinearealtimeaudiopro- 3. The top level loop for (int index = 0;...)... gram can never go faster than the memory bandwidth. isexecutedbyallthreadssimultaneously. Thesub- Andifasequentialprogramalreadyutilisesalltheavail- sequentwork-sharingdirectivesinsidetheloopwill able memory bandwidth, there is no room for improve- indicatehowtheworkmustbesharedbetweenthe ment. In this case a parallel version can only perform threads. worth. 4. Pleasenotethatanimpliedbarrierexistsattheend ofeachwork-sharingregion. Allthreadsmusthave 4.1 Machinesandcompilersused executed the barrier before any of them can con- tinue. In order to compare the scalar code generation with the 5. The work-sharing directive #pragma omp single new vector and parallel code generation we have com- indicates that this first section will be executed by piledwithFaust0.9.9.5b2aseriesoftestinthreediffer- onlyonethread(anyofthem). entversions. Thefollowingcommandswereused: 6. Thework-sharingdirective#pragma omp sections - scal : faust -a alsa-gtk-bench.cpp indicates that each corresponding test.dsp -o test.cpp #pragma omp section, here our two filters, willbeexecutedinparallel. - vec : faust -a alsa-gtk-bench.cpp -vec -vs 3968 test.dsp -o test.cpp 7. Theloopconstruct#pragma omp forspecifiesthat theiterationsoftheassociatedloopwillbeexecuted - par : faust -a alsa-gtk-bench.cpp inparallel. Theiterationsofthelooparedistributed -omp -vs 3968 test.dsp -o test.cpp acrosstheparallelthreads. Forexampleifwehave two threads the first one can compute indices be- We have also used two different C++ compilers, GNU tween0andcount/2andtheotherbetweencount/2 GCCandIntelICC: andcount. 8. Finally#pragma omp singleinsection3indicates - GCC version 4.3.2 with options : -O3 that this last section will be executed by only one -march=native -mfpmath=sse thread(anyofthem). -msse -msse2 -msse3 -ffast-math -ftree-vectorize. ( -fopenmp added for 8000 OpenMP). VAIO XPS - ICCversion11.0.074withoptions: -O3 -xHost 7000 MAC -ftz -fno-alias -fp-model fast=2. (-openmpisaddedforOpenMP). 6000 Allthetestswererunonthreedifferentmachines: s 5000 B/ M - vaio : a Sony Vaio SZ3VP laptop, with an Intel put 4000 h T7400 dual core processor at 2167 MHz, 2GB of ug o Ram, running an Ubuntu 7.10 distribution with a Tr 3000 2.6.22-15-generickernel. - xps:aDellXPSmachinewithanIntelQ9300quad 2000 core processor at 2500 MHz, 4GB of Ram, run- ning an Ubuntu 8.10 distribution with a 2.6.22-15- 1000 generickernel. 0 - macpro : an Apple Macpro with two Intel Xeon gcc.scal gcc.vec gcc.par icc.scal icc.vec icc.par X5365quadcoreprocessorsat3000MHz,2GBof Compilation option Ram, running an Ubuntu 8.10 distribution with a 2.6.27-12-generickernel Figure6: Copy1.dspbenchmark 4.2 Benchmark: copy1.dsp 5000 scal Thegoalofthisfirsttestistomeasurethememoryband- 4500 vec par width. We use a very simple Faust program copy1.dsp thatsimplycopiestheinputsignaltotheoutputsignal: 4000 process = _; 3500 s B/ 3000 The results we have obtained are summarized figure 6. M Thehorizontalaxescorrespondstothethreefaustcom- put 2500 pilationschemes:scalar,vectorandparallel,combined gh u with thetwo C++compilers : gcc and icc. Thevertical o 2000 Tr axesisthethroughput:howmanybytesofsampleseach 1500 testedprogramisabletoproducepersecond(higherval- uesarethebetter). 1000 It is interesting to note how catastrophic are the perfor- 500 mances of the parallel versions. The scalar and vector versions are quite similar with a little advantage to the 0 scalarversion. Thecodegeneratedbyiccperformsbet- 0 0.5 1 1.5 2 2.5 3 3.5 4 ter. ThememorybandwidthoftheMacproisdisappoint- Run ing specially considering that it has to be shared by 8 cores. Figure 7: Stability of measures (copy1 on macpro, icc version) How stable are these measures ? Figure 7 compares the performances of copy1 (compiled with icc) on the Despite the fact that freeverb has a limited amount of Macpro on 5 different runs. As we can see the stability parallelism,iccgivesquiteconvincingresultswitharea- isreasonablygood. sonablespeeduponvectorandparallelcodeontheVaio andtheXPSmachines. Itisalsointerestingtonotethat 4.3 Benchmark: freeverb.dsp onparallelversionthe83GHzcoresofthemacprowere slowerthan42.5GhzcoresoftheXPS! The second test is freeverb.dsp, a Faust implementation oftheFreeverb(thesourcecanbefoundintheFaustdis- tribution). 4.4 Benchmark: karplus32.dsp Theresultsaregivenfigure8. Heregccgivesverygood Karplus32.dsp is a generalized version of Karplus- results in scalar code and outperforms icc in 2 of the 3 Strongalgorithmwith32slightlydetunedstringsinpar- cases. Buttheperformancesofgccarestillverypooron allel(thesourcecanbefoundintheFaustdistribution). vectorandparallelcode. Figure9givestheresults. Againexcellentperformances 100 vol = *( vslider("fader", 0, -60, 4, 0.1) VAIO : db2linear : smooth(0.99) ); XPS 90 MAC mute = *(1 - checkbox("mute")); 80 vumeter(x) = attach(x, env(x) : vbargraph("",0,1)) 70 with { B/s 60 env = abs:min(0.99):max ~ -(1.0/SR); M }; hput 50 pan = _ <: *(sqrt(1-c)), *(sqrt(c)) ug with { Tro 40 c = ( nentry("pan",0,-8,8,1)-8)/-16 : smooth(0.99 ); 30 }; 20 voice(v) = vgroup("voice %v", mute : 10 hgroup("", vol : vumeter) : pan ); 0 gcc.scal gcc.vec gcc.par icc.scal icc.vec icc.par stereo = hgroup("stereo out", vol, vol); Compilation option process = hgroup("mixer", par(i,8,voice(i)) :> stereo); Figure8: Freeverb.dspbenchmark Theresultsoffigure10showarealbenefitforthevector- of gcc in scalar mode. Good progression of the perfor- ized version with a speedup exceeding x2 on the 3 ma- mances in vector mode as well as in parallel mode for chines. Thereisalsoapositiveimpactoftheparalleliza- icc. tion even if more limited. As usual gcc delivers good scalar code but poor results on vectorized and OpenMP 80 VAIO code. XPS 70 MAC 120 VAIO 60 XPS MAC 100 s 50 B/ M put 40 80 h s g B/ ou M Tr 30 put 60 h g 20 ou Tr 40 10 0 20 gcc.scal gcc.vec gcc.par icc.scal icc.vec icc.par Compilation option 0 gcc.scal gcc.vec gcc.par icc.scal icc.vec icc.par Figure9: Karplus32.dspbenchmark Compilation option 4.5 Benchmark: mixer.dsp Figure10: mixer.dspbenchmark Thisistheimplementationofasimple8channelsmixer. Eachchannelhasamutebutton,avolumecontrolindB, 4.6 Benchmark: fdelay8.dsp avumeterandastereopancontrol. Themixerhasalsoa volumecontrolofthestereooutput. This test implements an 8-channels fractional delay. EachchannelhasavolumecontrolindBaswellasade- import("music.lib"); lay control in fractions of samples. The interpolation is basedonafifth-orderLagrangeinterpolationfromJulius smooth(c) = *(1-c) : +~*(c); Smith’sFaustfilterlibrary.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.