ESAIM: PROCEEDINGS AND SURVEYS, Vol. ?, 2017, 1-10 Editors: Willbesetbythepublisher A TASK-DRIVEN IMPLEMENTATION OF A SIMPLE NUMERICAL SOLVER 7 FOR HYPERBOLIC CONSERVATION LAWS 1 0 2 Mohamed Essadki1,2, Jonathan Jung3,4, Adam Larat1,5, Milan Pelletier1 and n Vincent Perrier4,3 a J 9 Abstract. This article describes the implementation of an all-in-one numerical procedure within 1 the runtime StarPU. In order to limit the complexity of the method, for the sake of clarity of the presentationofthenon-classicaltask-drivenprogrammingenvironnement,wehavelimitedthenumerics ] C to first order in space and time. Results show that the task distribution is efficient if the tasks D are numerous and individually large enough so that the task heap can be saturated by tasks which computational time covers the task management overhead. Next, we also see that even though they . s are mostly faster on graphic cards, not all the tasks are suitable for GPUs, which brings forward the c importance of the task scheduler. Finally, we look at a more realistic system of conservation laws [ with an expensive source term, what allows us to conclude and open on future works involving higher 1 local arithmetic intensity, by increasing the order of the numerical method or by enriching the model v (increased number of parameters and therefore equations). 1 3 Résumé. Danscetarticle,ilestquestiondel’implémentationd’uneméthodenumériqueclé-en-main 4 auseinduruntimeStarPU.Afindelimiterlacomplexitédelaméthodeetcedanslesoucisdeclarifier 5 laprésentationdenotreméthodedanslecadredelaprogrammationpartâche,nousavonslimitél’ordre 0 de la méthode numérique à un en temps et en espace. Les résultats montrent que la distribution de . 1 tâches est efficace si les tâches sont suffisamment nombreuses et de taille suffisante pour couvrir le 0 temps supplémentaire de gestion des tâches. Ensuite, nous observons que même si certaines tâches 7 sont nettement plus rapides sur carte graphique, elles ne sont pas toutes éligibles à un portage sur 1 GPU,cequimetenavantl’importanced’unrépartiteurdetâcheintelligent. Enfin,nousregardonsun : v système de lois de conservation plus tournée vers l’application, incluant notamment un terme source i coûteux en terme de temps de calcul. Ceci nous permet de conclure et d’ouvrir sur un travail futur, X dans lequel l’intensité arithmétique locale sera augmentée par le biais d’une méthode numérique ou r a d’un modèle d’ordres plus élevés. 1. Introduction Computationsofturbulentflowshasbeenundergoingtwobreakthroughoverthelastthreedecades,concern- ing computational architectures and numerical methods. On the one hand, computing clusters are reaching a 1 Laboratoire EM2C, CNRS, CentraleSupélec, Université Paris Saclay, Grande Voie des Vignes, 92295 Châtenay-Malabry - France 2 IFPÉnergiesnouvelles,1-4avenuedeBois-Préau,92852Rueil-MalmaisonCedex-France 3 LMAPUMR5142,UPPA,CNRS 4 CagireTeam,INRIABordeauxSud-Ouest,Pau-France 5 FédérationdeMathématiquesdel’ÉcoleCentraleParis,CNRSFR3487,GrandeVoiedesVignes,92295Châtenay-Malabry (cid:13)c EDPSciences,SMAI2017 2 ESAIM:PROCEEDINGSANDSURVEYS sizeofmanyhundredsofthousandsofcores,whichnowadaysallowstoovercomeRANSmodelingbyconsidering unstationaryLESsimulations[2,5]. Ontheotherhand,thisnewframeworkrequireshighordernumericalmeth- ods, which have also been much improved recently [2,3,11]. Nevertheless, these newly manufactured massively multicore computers rely on heterogeneous architectures. In order to improve the efficiency of our methods on these architectures, their implementation and algorithmic must be redesigned. The aim of this project is to op- timizetheimplementationofall-in-onenumericalmethodsandmodelsonemergingheterogeneousarchitectures by using runtimes schedulers. In this publication, we focus on StarPU [1]. Runtimeschedulersallowtohandleheterogeneousarchitecturesinatransparentway. Forthis,thealgorithm must be recast in a graph of tasks. Computing kernels must then be written in different languages, in order to be executed on the different possible devices. Then, the runtime scheduler will be in charge of dynamically balancing the tasks on the different available computing resources (cores of the host, accelerators, etc.). Finite differences methods have already been successfully ported on hybrid architectures [7]. However, we aim at developping numerics on unstructured meshes, for example of the discontinuous Galerkin type, and our methods have a very different memory access pattern, which is one of the bottlenecks of the task-driven programming. Inordertominimizethecomplexityofourpresentation,gatheringboththenumericalprocedure anditsimplementationwithinthenon-classicalStarPUframework,wehavedecidedtorestrictitsscopetofirst order in space and time. 2. A first order finite volume solver for two dimensional conservation laws 2.1. Hyperbolic conservation laws Let Ω be a subset of R2. We consider a two dimensional system of conservation laws with source terms ∂ W +∇·F(W)=S(W), (1) t wheretisthetime,∇=(∂ ,∂ )T with(x,y)thespatialposition,W ∈Rm (m∈N)isthevectorofconservative x y variables,F=(F ,F )T :Ω→RmisthefluxfunctionandS(W)representsthesourceterms. Later,thenumber x y of conserved variables is noted nVar = m. The unknowns W depend on the spatial position (x,y) and on the timet. Weassumethatsystem(1)ishyperbolic,meaningthatthedirectionalJacobian ∂F ·nisdiagonalizable ∂W in every direction n ∈ S1, and we note λ (W) ≤ ··· ≤ λ (W) its eigenvalues, the direction n being implicit. 1 m We aim at computing a solution of (1) with initial condition ∀(x,y)∈Ω, W(x,y,0)=W0(x,y). (2) 2.2. Finite volume discretization In this section, we present a simple numerical scheme which solves system (1) in space and time: (x,y,t) ∈ (cid:16) (cid:17) (cid:16) (cid:17) Ω×R+. If Nx and Ny are two given positive integers, we define the sequences x and y i+1 j+1 2 0≤i≤Nx 2 0≤j≤Ny by x = a+i∆x, i = 0,··· ,Nx and y = c+j∆y, j = 0,··· ,Ny, so that Ω is discretized into Nx×Ny i+1 j+1 2 (cid:105) (cid:104) (cid:105) (cid:104) 2 cells Ci,j = xi−1,xi+1 × yj−1,yj+1 . We also consider a sequence of times (tn)n∈N such that t0 =0, which 2 2 2 2 defines the time steps (∆t) =t −t >0. n n+1 n 2.2.1. Hyperbolic part Next, for any cell C of the mesh, at any time t , we look for an approximation of the average value of the i,j n solution W(x,y,t) on the cell: 1 (cid:90) Wn = W(x,y,t )dxdy. (3) i,j |C | n i,j Ci,j ESAIM:PROCEEDINGSANDSURVEYS 3 By integrating system (1) over a cell C , one obtains: i,j d (cid:90) (cid:88) (cid:90) W(x,y,t)dxdy+ F(W)·ndS =0, dt Ci,j Γ⊂∂Ci,j Γ where n = (n ,n ) is the outward unit normal to the boundary ∂C . Then, the first order explicit finite x y i,j volume scheme writes [8]: Wn+1,∗−Wn =−(∆t)n (cid:16)F˜(cid:0)Wn ,Wn ,(1,0)(cid:1)−F˜(cid:0)Wn ,Wn ,(1,0)(cid:1)(cid:17) i,j i,j ∆x i,j i+1,j i−1,j i,j −(∆t)n (cid:16)F˜(cid:0)Wn ,Wn ,(0,1)(cid:1)−F˜(cid:0)Wn ,Wn ,(0,1)(cid:1)(cid:17), (4) ∆y i,j i,j+1 i,j−1 i,j where F˜(W ,W ,n) is a chosen numerical flux. In all our computations, the first order Lax-Friedrichs flux k,l m,n will be used: F(W )·n+F(W )·n σ F˜(W ,W ,n)= L R − (W −W ) L R 2 2 R L with σ = max (|λ (W )|, |λ (W )|). p L p R p∈ 1,m (cid:74) (cid:75) The numerical scheme (4) is stable provided the following CFL condition is ensured (cid:18) (cid:19) (∆t) × max max |λ (Wn )| ≤min(∆x,∆y). (5) n p i,j (i,j)∈ 1,Nx × 1,Ny p∈ 1,m (cid:74) (cid:75) (cid:74) (cid:75) (cid:74) (cid:75) Inpractice,aconstanttimestep(∆t) =∆tissetatstart,andthereforewejustneedtocheckatthebeginning n of each iteration that ∆t satisfies inequality (5). 2.2.2. Source term Since we consider only a first order numerical scheme in space and time, the source term treatment can be simply added by use of a first order operator splitting technique [6]. Therefore, the additional source terms can possibly be taken into account by local cell-wise integration of the following ODE, which is simply discretized by mean of a forward Euler time scheme: (cid:16) (cid:17) ∂ W =S(W) (cid:55)−→ Wn+1 =Wn+1,∗+∆tS Wn+1,∗ , (6) t i,j i,j i,j where Wn+1,∗ is the partial update given by the Finite Volume integration (4). Let us note that though the i,j (cid:16) (cid:17) update (6) is simple and cheap, the computation of S Wn+1,∗ may be rather complex and computationnally i,j expensive. 3. StarPU: a task scheduler Task scheduling offers a new approach for the implementation of numerical codes, which is particularly suitable when looking for an execution on heterogeneous architectures. The structure of the codes is divided into two main layers that we will call here bones and flesh. At first level, the code is parsed to build a tasks dependency graph that can be viewed as its skeleton. Next, one enters withinthisgraphoftasksandstartsexecutingtheactionsassociatedtoeachofthesetasks. Then,thecodeacts on the memory layout thanks to what can be viewed as its muscles. During the execution part, one may have different choices in the tasks execution on the available computational power and this is where the scheduler plays its role: it aims at distributing the actions so as to minimize the global computational time. 4 ESAIM:PROCEEDINGSANDSURVEYS North West East South Figure 2. Partition (in blue) with its overlap (in red) in the East,North,WestandSouthdi- Figure 1. Partitioningofaninitialmeshof{Nx=30×Ny=30}cellsinto rection. Dataofeachpartitionis {NPartX=2×NPartY=2}225cellsdomains. composedofthesefivehandles. Now, we detail the glossary attributed to such a new implementation framework within the context of the StarPU runtime. 3.1. StarPU tasks In StarPU, a task is made of the following: a codelet, associated kernels and data handles. • The codelet is the task descriptor. It contains numerous information about the task, including the number of data handles, the list of available kernels implemented to execute this task (for CPU, GPU orpossiblysomethingelse...) andthememoryaccessmodeforeachdatahandle: "Read"(R),"Write" (W) or "Read and Write" (RW). • The kernels are the functions that will be executed on a dedicated architecture: CPU, GPU, etc. A task may have the choice between different kernels implementation and it is the role of the StarPU schedulertodistributethetasksontheavailableheterogeneousarchitecturesfollowingacertaincriterion (in general minimizing the global execution time). • The data handles are the memory managers. Each data handle can be viewed as the encapsulation of a memory allocation, which allows to keep trace of the action (read or/and write) of the task kernel on the memory layout. In particular, this allows to build the task dependency graph. 3.2. Construction of the tasks dependency graph The domain of size Nx×Ny is partitioned into NPartX×NPartY balanced parts of size NxLoc×NyLoc, see Figure1. Onetaskwilldealwithonepartitionatatime. Inordertotreattheinteractionbetweenthedomains in an asynchronous manner, each domain is supplemented with copies of its border data, see Figure 2. • Eachmemoryallocationisassociatedwithadata handle: oneforeachsubdomain,fourforitsborders copy, called overlaps in the following, • The codelet of each task points toward a certain number of these data handles, tagged as "R" or "W" or "RW", for Read and/or Write, following the access mode exepected by the associated kernel, • Finally,thecodeisparsedandthetasksaresubmittedtoStarPU.Whenataskhastowrite onacertain data handle memory, a dependency edge is drawn between this task and the latest tasks having read access on this data handle. Similarly, for each of its read-accessed data handles, a task will depend on the latest tasks having write access on them. ESAIM:PROCEEDINGSANDSURVEYS 5 25 70 4 5 8 9 16 60 17 20 32 33 64 65 50 128 129 256 257 dup 15 1052142 dup 40 1052153 ee 2048 ee 2049 sp 10 4096 sp 30 4097 linear linear 20 5 10 0 0 0 5 10 15 20 25 0 10 20 30 40 50 60 70 number of cores number of cores (a) 2 dodeca-core Haswell Intel Xeon E5-2680. (b) Xeon Phi KNL. Figure 3. Task size overhead with the eager scheduler: scalability results obtained with duration of tasks varying between 4 and 4096µs on two different machines. • Once the dependency graph is built, the tasks are distributed in dependency order by the scheduler on the available computational ressources, following the available kernel implementations. Note that building the dependency graph and lauching the task can be an intricated work, especially since the future of the graph may depend on the computation itself: think of the number of time steps which may depend on the solution, for example. 3.3. Task overhead StarPU tasks management is not free in terms of CPU time. In fact, each task execution presents a latency called "overhead". We want to know when this additional time can be neglected or not. There is a StarPU benchmark that allows to measure the minimal duration of a task to ensure a good scalibility. We send 1000 short tasks with the same (short) duration varying between 4 and 4096µs and we study the scalability. In Figure 3, we plot the results obtained on a 2 dodeca-core Haswell Intel Xeon E5-2680 and on a Xeon Phi KNL with the scheduler eager. On few cores, we have a good scalibility result, even if the duration of the tasks is short (<0.2ms). On many cores, we need longer tasks duration to get satisfying scaling (≈1ms). To sum up, if the duration of the tasks is smaller than the microsecond, their overhead cannot be neglected anymore. 3.4. Schedulers The purpose of the scheduler is to launch the tasks when they become ready to be executed. In StarPU, many different scheduling policies are available. In this paper, we consider only the two following: • the eager scheduler: it is the scheduler by default. It uses a central task queue. As soon as a worker has finished a task, the next task in the queue is submitted to this worker. • the dmda scheduler: this scheduler takes into account the performance models and the data transfer time for the tasks (see Section 5.3.2). It schedules the tasks so as to minimize their completion time by carefully choosing the optimal executing architecture. 6 ESAIM:PROCEEDINGSANDSURVEYS 4. Practical Implementation Afterageneralpresentationofthemodelandthenumericalmethodinsection2andoftheruntimeenviron- mentwithaparticularfocusonStarPUinsection3,wenowdetailthewaywehaveimplementedthisnumerical resolution of a system of hyperbolic conservation laws within the task-driven framework StarPU. 4.1. Memory allocation Since the tasks dependency graph is mainly based on the memory dependency between successive tasks, memory allocation is a crucial development part of our application. This starts with the definition of the structure Cell, which contains the nVar=m conserved variables at a certain point in space and time. 4.1.1. Principal variables Forafirstorderfinitevolumeresolutionofasystemofconservationlaws,onlytwomainvariablesareneeded: • u, the computed solution. In each mesh cell, it contains nDoF×nVar floats. However, since at first ordernDoF=1,ucontainsonlyoneCellstructureofsizenVarcorrespondingtoonelocalapproximation (3) of the solution per mesh cell. • RHS, the vector of residuals. It has exactly the same memory characteristics as u, since at each time step, the update is done by: u+= RHS. (7) These two vectors of variables are allocated per subdomain. Typically, for each of the NPartX×NPartY subdomains, a vector uLoc of (NxLoc×NyLoc) Cells is created and encapsulated in a StarPU data handler which allows to follow the memory dependency. 4.1.2. Overlap additional memory buffers Inordertominimizethecommunicationsandthedependenciesbetweensubdomains,eachsubdomaincomes with four additional one-dimensional buffers corresponding to the possible four overlap-data needed at the subdomain boundaries (East, North, West and South). Two vectors of size NxLoc×sizeof(Cell) (ovlpS and ovlpN) and two vectors of size NyLoc×sizeof(Cell) (ovlpE and ovlpW) are always allocated, whether the overlap needs to be used or not. The reason for that is thatthenumberofdatahandlerspassedtoaStarPUkernelneedstobeconstant. Therefore, whencopyingthe overlap-data for example, all the overlap data handlers are passed anyway but nothing is done if they are not needed, like in the case of a periodic subdomain in one direction. Of course, here lies a small communication optimization, when sending a task on another device, since some useless data is transfered. However, we think that the overlap tasks should be of negligeable size compared to the task acting on the entire subdomains and should be mainly executed on the host node. 4.2. Description of the tasks In paragraph 2.2.1, we have seen that the first order finite volume method essentially consists in 1) checking thecomputationaltimestep∆t,2)computingallthenumericalfluxesatalltheedgesofthemesh,3)gathering them in the RHS vector and finally 4) updating the numerical solution u. Based on this general decomposition, here is the list of tasks implemented in our application. Between brackets is specified the memory data handlers accessed by the task, with their respective access rights given between parentheses: (cid:2) (cid:3) • initialCondition uLoc(W) : fills each subdomain solution with the initial condition. (cid:2) (cid:3) • checkTimeStep uLoc(R) : computes the largest characteristic speed within the subdomain. In order to avoid gathering these time constraints globally, we only check that the fixed time step ∆t initially given by the user respects locally the stability constraint. ESAIM:PROCEEDINGSANDSURVEYS 7 Copy East West Copy Figure 4. Copy to overlaps tasks, example with two domains. The global computational domain is vertically divided into two parts, blue (left) and green (right). The green one is supplemented with a west overlap, whereas the blue one is supplemented with an east overlap. One residual computation includes two communications tasks: copying the right column of the blue domain into the west overlap of the green domain, and copying the left column of the green domain into the east overlap of the blue domain. (cid:2) (cid:3) • copyOverlaps ovlpE(W),ovlpN(W),ovlpW(W),ovlpS(W),uLoc(R) : each subdomain fills the corre- sponding neighbors overlap data vectors, if needed. This is not the case if the subdomain is periodic in one or both directions, or if one of the boundary states is prescribed (Dirichlet boundary condition). The northest (resp. southest) line of the subdomain is copied in the ovlpS (resp. ovlpN) data handle of the northern (resp. southern) neighbor. The extreme east (resp. extreme west) column is copied in the ovlpW (resp. ovlpE) data handle of the eastern (resp. western) neighbor. Figure 4 depicts the copyOverlaps task for two domains. (cid:2) (cid:3) • internalResiduals uLoc(R),RHS(W) : computesthenumericalfluxatalltheedgesofthesubdomain, except those of the boundaries where an overlap data vector has to be used: ∆t (cid:16) (cid:17) ∆t (cid:16) (cid:17) RHS = F∗ −F∗ + F∗ −F∗ . (8) i,j ∆x i+12,j i−21,j ∆y i,j+12 i,j−21 (cid:2) (cid:3) • borderResiduals ovlpE(R),ovlpN(R),ovlpW(R),ovlpS(R),uLoc(R),RHS(RW) : computestheremain- ing numerical fluxes, meaning those between the overlap vector states and the border cells of the subdomain. Therefore, the overlap data vectors passed in argument here are those belonging to the subdomain. (cid:2) (cid:3) • update uLoc(RW),RHS(R) : update the numerical solution subdomain-wise, thanks to the update relation (7). The corresponding task diagram for one time step and two sub-domains is given in Figure 5. This graph is an output we can get from StarPU to verify the correct sequence of the tasks. 4.3. Specific asynchronous treatment of the outputs The output task consists in writing the global mesh solution u into a disk file at once. In order to avoid global synchronization, a data handle dataForOutput of size Nx×Ny×nVar is used to temporarily store the dataofeachsubdomain. SincethisbufferwillbeentirelywritteninanoutputfilebyasingletaskoutputTask, it needs to be contiguous in memory and to be encapsulated in a single global descriptor. However, when the solution needs to be written on the disk, it is first copied in this dataForOutput buffer and this should be done in an asynchronous parallel manner, by the gatherForOutput tasks. This is only possible if the 8 ESAIM:PROCEEDINGSANDSURVEYS starpu_data_unregister checkTimeStep sync_task update CopyOverlaps starpu_data_unregister initialCondition CopyOverlaps starpu_data_unregister InternalResidual starpu_data_unregister BorderResidual MOTHER sync_task starpu_data_unregister InternalResidual BorderResidual starpu_data_unregister initialCondition CopyOverlaps starpu_data_unregister starpu_data_unregister CopyOverlaps checkTimeStep sync_task update starpu_data_unregister Figure 5. Task diagram built by StarPU for one Forward Euler step of our first order finite volume task-driven implementation on two subdomains, hence the horizontal symmetry. dataForOutput buffer is also described per subdomain, see Figure 6. At the end, we need one global data handle and a set of NPartX×NPartY per-subdomain data handlers, both sharing the same memory allocation. Since the dependency graph is built on the memory dependancy between the tasks, it is very dangerous to use data handlers sharing the same memory slots. Fortunately, both descriptors can be declared in StarPU and a specific procedure allows to switch from one to another, so that they are never used concurrently. First, the global buffer is allocated and encapsulated in a global StarPU data handler. Next, it is also described as a vector of NPartX×NPartY data handlers, thanks to the command: starpu_matrix_data_register(starpu_data_handle *data_handle,int DEVICE,void *data_ptr, int LD,int NxLoc, int NyLoc, size_t sizeof(Cell)); Here, the integer LD allows to encapsulate the grid block per block, since the global vector is accessed through the formula: for(int j=0;j<NyLoc;j++) {for(int i=0;i<NxLoc;i++) global_data[i+j*LD];} Then, if data_ptr points to the bottom-left corner of the subdomain, the corresponding block is encapsulated in the subdomain data handler data_handle. Eventually, when the solution needs to be written on the disk, the subdomain-wise descriptor is acti- vated and each subdomain copies its local solution uLoc to the appropriate dataForOutput sub-data-handle (gatherForOutput task). Next, the description of the dataForOutput buffer is switched to its global descrip- tor and the write-to-disk single task can be executed (outputToDisk task). Once the writing is finished, the description of the memory buffer is switched back to the partitioned sub-data-handles. Figure 7 illustrates how the simple task diagram for the numerics shown in Figure 5 is updated for outputs. In Figure 8, we plot the Gantt chart obtained for 60 time iterations of the finite volume scheme on two cores with an output every 20 iterations. Green color indicates working time and red color symbolizes sleeping time. This asynchronous treatment of the outputs allows to perform outputs without blocking one core for the outputs and without inducing sleeping time for the other cores. ESAIM:PROCEEDINGSANDSURVEYS 9 Computingdatahandlers Outputdatahandlers Partition Unpartition Keepcomputation Figure 6. StrategyfornonblockingIO.Thecomputingbuffersaredepictedontheleft. The data handlers for output dataForOutput are depicted on the right. Each part of the domain is writingonadedicatedpartof dataForOutputusingthetaskgatherForOutput. Oncethiscopy is done, the computing buffers can be used for further computations. On the output handlers side, once all the copies have been completed, the buffer can be switched to a global buffer thanks to a switch task, which is followed by a starpu_data_invalidate_submit command onthelocaldescriptorof dataForOutput. Oncetheunpartitioningisdone,theoutputhandler can be used as a single data handler, and one tasks is in charge of writing this buffer in an output file. Once the ouput in the file is finished, another switch task is in charge of re- partitioning the data, the global descriptor is invalidated, and the dataForOutput is able to receive data from the computing buffers again. 5. Results In this last section, we study the performance of our task-driven implementation on test cases solving the (cid:18) (cid:30) (cid:19)2 two-dimensional Euler equations over the periodic domain R Z . More precisely, the state vector and the associated normal flux are given by W =(ρ,ρu,ρv,ρE)t, F.n(W)=(ρu(cid:126).n, ρuu(cid:126).n+pn , ρvu(cid:126).n+pn , ρu(cid:126).nH)t, (9) x y 10 ESAIM:PROCEEDINGSANDSURVEYS checkTimeStep sync_task CopyOverlaps initialCondition CopyOverlaps starpu_data_unregister update gatherForOutput starpu_data_unregister InternalResidual starpu_data_unregister starpu_data_unregister BorderResidual sync_task starpu_data_unregister gatherForOutput MOTHER BorderResidual starpu_data_unregister switch outputToDisk switch starpu_data_unregister InternalResidual starpu_data_unregister starpu_data_unregister CopyOverlaps switch outputToDisk switch starpu_data_unregister initialCondition CopyOverlaps sync_task update gatherForOutput starpu_data_unregister gatherForOutput checkTimeStep Figure 7. Task diagram built by StarPU for one time step, including output of the initial andfinalsolutions. Thesolutionuisfirstcopiedsubdomainpersubdomainintoaglobalbuffer dataForOutput, after which this buffer is written in a file by a single task. Since two different descriptors of the buffer are needed, a switch task is launched between both actions. Output 2 Output 4 Output 1 Output 3 Figure 8. Gantt chart for 60 time iterations of the finite volume scheme on two cores with an output every 20 iterations. whereE isthetotal energy,pisthepressure andH =E+p/ρisthetotal enthalpy. Inthefollowing,weassume that p satisfies a perfect gas pressure law p=(γ−1)ρ(E− u2+v2). The source term is set to zero: S(W)=0. 2 Thissystemofequations(9)isknowntobehyperbolicwithcharacteristicvelocitiesgivenbyu(cid:126).n−c, u(cid:126).n, u(cid:126).n and u(cid:126).n+c, c being the sound velocity given by c2 =γp/ρ. 5.1. Test cases On this set of equations, we use two classical test cases, namely a linear advection of a density perturbation and an isentropic vortex. Both have the advantage to present an analytical solution. 5.1.1. Localized cosine perturbation on ρ First,weconsideranarbitraryperturbationofthedensityfieldρ(x,y)whichissimplyadvectedonaconstant field u¯, v¯, p¯. We choose: (cid:112) r(x,y)= (x−0.5)2+(y−0.5)2, ρ(x,y)=ρ(r(x,y))=1+(r(x,y)<0.25)∗cos(4πr(x,y)), (10) u¯=v¯=1, and p=1/γ, so that the sound velocity is everywhere smaller than one.