Automatic Multilevel Parallelization Using OpenMP** Haoqiang Jim Gabriele Jost , Jerry Yah NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 USA {hi in, gj ost, yan} @nas .nasa. gov Eduard Ayguade, Marc Gonzalez, Xavier Martorell Centre Europeu de Parallelism de Barcelona, Computer Architecture Department (UPC) cr. A,rdi Girona i-3, Modul D6,08034 - Barcelona, Spain {eduard,marc, xavier} @ac .upc. es Abstract feature that requires attention in future releases of OpenMP compilers. Some research platforms, such In this paper we describe lhe extension of tile as the OpenMP NanosCompiler [9], have been de- CAPO parallelization support tool to support multi- veloped to show the feasibility of exploiting nested level parallelism based on i)penMP directives. parallelism in OpenMP and to serve as testbeds for CAPO generates OpenMP directives with extensions new extensions in this direction. The OpenMP supported by the NanosCompiler to allow for direc- NanosCompiler accepts Fortran-77 code containing tive nesting and definition of thread groups. We OpenMP directives and generates plain Fortran-77 report some results for several benchmark codes and code with calls to the NthLib thread library [17] (cur- one fidl application that have be ;nparallelized using rently implemented for the SGI Origin). In contrast our system. to the SGI MP library, NthLib allows for multilevel parallel execution such that inner parallel constructs 1 Introduction are not being serialized. The NanosCompiler pro- Parallel architectures are an nstrumental tool for gramming model supports several extensions to the OpenMP standard to allow the user to control the the execution of computational intensive applica- allocation of work to the participating threads. By tions. Simple and powerful programming models and environments are required to develop and tune such supporting nested OpenMP directives the NanosCompiler offers a convenient way to multi- parallel applications. Current programming models offer either library-based implementations (such as level parallelism. MPI [16]) or extensions to sequential languages (di- In this study, we have extended the automatic rectives and language constructs) that express the parallelization tool, CAPO, to allow for the genera- available parallelism in the _pplication, such as tion of nested OpenMP parallel constructs in order to Open/_lP [201. support multilevel shared memory parallelisation. OpenMP was introduced as an industrial standard CAPO automates the insertion of OpenMP directives for shared-memory programming with directives. with nominal user interaction to facilitate parallel Recently, it has gained significant popularity and processing on shared memory parallel machines. It is wide compiler support. However, relevant perform- based on CAPTools [11], a semi-automatic paralleli- ance issues must still be addressed which concern sation tool for the generation of message passing programming model design as well as implementa- codes, developed at the University of Greenwich. tion. In addition to that, extensions to the standard To this point there is little reported experience are being proposed and evaluated in order to widen with shared memory multilevel parallelism. By being the applicability of OpenMP to a broad class of par- able to generate nested directives automatically in a allel applications without sacrificing portability and reasonable amount of time we hope to be able to gain simplicity. a better understanding of performance issues and the What has not been clearly addressed in OpenMP needs of application programs when in comes to ex- is the exploitation of multiple levels parallelism. The ploiting multilevel parallelism. lack of compilers that are able t_ exploit further par- The paper is organized as follows: Section 2 allelism inside a parallel regior_ has been the main summarizes the NanosCompiler extensions to the cause of this problem, which has favored the practice OpenMP standard. Section 3 discusses the extension of combining several programming models to ad- of CAPO to generate multilevel parallel codes. Sec- dress scalability of applicatiom to exploit multiple tion 4 presents case studies on several benchmark levels of parallelism on a large namber of processors. codes and one full application. The nesting of parallel constructs in OpenMP is a 2 The NanosCompiler *The author is an employee of Corn)rater Sciences Corpora- OpenMP provides a fork-and-join execution tion. model in which a program begins execution as a sin- ** A Preliminary version of this papel was presented at the 3_d gle process or thread. This thread executes European Workshop on OpemMP (EWOMPOI) sequentiaullyntilaPARALLEL construct is found. ing the relative weight of the computation that each At this time, the thread creates a team of threads and group has to perform. From this information and the it becomes its master thread. All threads execute the number of threads available in the team, the threads statements lexically enclosed by the parallel con- are allocated to the groups at runtime. The weight struct. Work-sharing constructs (-_O,SECTIONS and vector is allocated by the user and its values are SINGLE) are provided to divide the execution of the computed from information available within the ap- enclosed code region among the members of a team. plication itself (for instance iteration space, All threads are independent and may synchronize at computational complexity). the end of each work-sharing construct or at specific points (specified by the BARRIIS:R directive). Exclu- 3 The CAPO Parallelization Support Tool sive execution mode is also p)ssible through the The main goal of developing parallelization sup- definition of CRITICAL and ORDERED regions. If a port tools, is to eliminate as much of the tedious and thread in a team encounters a mw PARALLEL con- sometimes error-prone work that is needed for man- struct, it creates a new team and it becomes its ual parallelization of serial applications. With this in master thread. OpenMP v2.0 provides the mind, CAPO [13] was developed to automate the NUN_THREADS clause to restrict the number of insertion of OpenMP compiler directives with nomi- threads that compose the team. nal user interaction. This is achieved largely by use The NanosCompiler extensior_ to multilevel paral- of the very accurate interprocedural analysis from lelization is based on the concept of thread groups. CAPTools [11] and also benefits from a directive A group of threads is composed of a subset of the browser to allow the user to examine and refine the total number of threads available in the team to run a directives automatically placed within the code. parallel construct. In a parallel construct, the pro- CAPTo_Is provides a fully interprocedural and grammer may define the number of groups and the value-based dependence analysis engine [14] and has composition of each one. When a thread in the cur- successfully been used to parallelize a number of rent team encounters a PARALLEL construct mesh-based applications for distributed memory ma- defining groups, the thread creates a new team and it chines. becomes its master thread. The new team is com- posed of as many threads as the number of groups. 3.1 Single level parallelization The rest of the threads are used t) support the execu- tion of nested parallel constructs In other words, the The single loop level parallelism automatically definition of groups establishes an allocation strategy exploited in CAPO can be defined by the following for the inner levels of parallelism. To define groups three stages (see [13] for more details of these stages of threads, the NanosCompiler supports the GROUPS and their implementation): clause extension to the PARALLEL directive. 1) Identification of parallel loops and parallel re- gions - this includes a comprehensive breakdown of C$OMP PARALLEL GROUPS (gspec) the different loop types, such as serial, parallel in- cluding reductions, and pipelines. The outermost C;;MP END PARALLEL parallel loops are considered for parallelization so tong as they provide sufficient granularity. Since the Different formats for the GROUPS clause argument dependence analysis is interprocedural, the parallel gspec are allowed [10]. The s:mplest specifies the regions can be defined as high up in the call tree as number of groups and performs m equal partition of possible. This provides an efficient placement of the the total number of threads to the groups: directives. 2) Optimization of parallel regions and parallel gspec : ngroups loops - the fbrk-and-join overhead (associated with starting a parallel region) and the synchronizing cost The argument ngroups specifies the number of are greatly lowered by reducing the number of paral- groups to be defined. This format assumes that work lel regions required. This is achieved by merging is well balanced among groups and therefore all of together parallel regions where there is no violation them receive the same number of threads to exploit of data usage. In addition, the synchronization be- inner levels of parallelism. At runtime, the composi- tween successive parallel loops is removed if it can tion of each group is determined by equally be proved that the loops can correctly execute asyn- distributing the available threads among the groups. chronously (using the NOWAIT clause). 3) Code transformation and insertion of OpenMP gspec : ngroups, weigh2 directives - this includes the search for and insertion of possible THREADPRIVATE common blocks. In this case, the user specifies tt'e number of groups There is also special treatment for private variables (ngroups) and an integer vector (weight) indicat- innon-threadprivcaotemmonblocks. If there is a _" Serial Code usage conflict then the routine is cloned and the common block variable is added to the argument list of the cloned routine. Finally, tt,e call graph is trav- Data Dependence Analysis ] ÷ ersed to place OpenMP directives within the code. This includes the identification of necessary variable First Level Loop Analysis ] types, such as SHARED, PRIVATE, and REDUCTION. ........ -_1[- ........... -__-_ I e ood v oelopAn.,y i] " 3.2 Extension to multilevel parallelization I Although the SGI Origin compiler does not sup- I I'I Second Level Directive Insertion -, port nested parallelism, the user can exploit I :1 parallelism across multiple loop nests in a limited ............................ ---- manner. The SGI compiler acce _ts the NEST clause on the OMP DO directive [181. The NEST clause First Level Directive Insertion ] requires at least 2 variables as arguments to identify indices of subsequent DO-lo)ps. The identified _" Parallel Code loops must be perfectly nested. No code is allowed between the identified DO statements and the corre- Figure 1: Steps in multilevel parallelization sponding END DO statements. The nest clause on the OMP DO directive informs lhe compiler that the and codes that may be changed or introduced during entire set of iterations across the identified loops can the code transformation. be executed in parallel. The co npiler can then lin- earize the execution of the loop iteration and divide them among the available single level of threads. 4) First-level directive insertion. Lastly code transformation and OpenMP directives insertion are CAPO has the capability to i.tentify suitable loop performed for the outer level parallelization. All the nests and generate the SGI NE:_T clause. We have transformations of the last stage of the single level extended this feature of CAI'O to support true parallelization are being performed, with the excep- nested parallelism. tion that we disallow the THREADPRIVATE Our extension to OpenMP rrultilevel parallelism directive. Compared to single level parallelization, is based on parallelism at different loop nests and the two-level parallelization process requires the makes use of the extensions offered by the additional steps indicated in the dash box in Figure 1. NanosCompiler. Currently, we limit our approach to only two-level loop parallelism, which is of more 3.3 Implementation consideration practical use. The approach to automatically exploit two-level parallelism is extended from the single In order to maintain consistency during the code level parallelization and is illu:,trated in Figure 1. transformations that occur during the parallelization Besides the data dependence aralysis in the begin- process we need to update data dependencies prop- ning he approach can be summarized in the erly. Consider the example, where CAPO transforms following four steps. an array reduction into updates to a local variable. 1) First-level loop analysis. ]-his is essentially the This is Ibllowed by an update to the global array in a combination of the first two stages in the single level CRITICAL section to work around the limitation on parallelization where parallel lo?ps and parallel re- reduction in OpenMP vl.x. The data dependence gions are identified and optimized at the outermost graph needs to be updated to reflect the change due loop level. to this transformation, such as associating depend- 2) Second-level loop analysis. This step involves ence edges related to the original variable to the local the identification of parallel lo_ps and parallel re- variable and adding new dependences for the local gions nested inside the parall,J loops that were variable from the local updates to the global update. identified in Step 1. These parallel loops and parallel Performing a full data dependence analysis for the regions are then optimized as before but limited to modified code block is another possibility but this the scope defined by the first lewq. would not take advantage of the information already 3) Second-level directive insertion. This includes obtained from the earlier dependence analysis. code transformation and OpenMP directives inser- When nested parallel regions are considered, the tion for the second level. The step performed before scope of the THREADPRIVATE directive is not clear inserting any directives in the fiJst-level is to ensure any more, since a variable may be threadprivate for a consistent picture is maintained for any variables the outer nest of parallel regions but shared for the inner parallel regions, and the directive cannot be boundtoaspecifincestlevel.1heOpenMsPpecifi- ber) from the original K loop limit. Only then are the cationdoesnotproperlyaddrestshisissue.Our first-level directives added to the IT loop (instead of solutionistodisallowtheTHRt]ADPRZVAdiTreEc- the K loop). The method is not as elegant as one tivewhennestepdarallelismisconsidereadndtreat would prefer, but it points to some of the limitations anyprivatevariabledsefinedin;ommonblocksbya with the nested OpenMP directives. In particular we speciatrlansformatiaosnmentiloedinSectio3n.1. would not be able to set up a two-dimensional pipe- Thescopeofthesynchronizaudoirnectivehsasto line, since it would involve synchronization of becarefullyfollowedF.orexampleth,eMASTER threads from two different nest levels. We will dis- directiviesnotalloweidntheex:enotfa PARALLEL cuss the problem of two-dimensional pipelining in DO. This changes the way a seftware pipeline (see one of our case studies in Section 4.2. [13] for further explanation) can be implemented if it One of the contributions by the NanosCompiler to isnested inside an outer parallel loop. support nested directives is the GROUPS clause, CAPO detects opportunities for software- which can be used to define the number of thread pipelined execution of loops Wlere data dependen- groups to be created at the beginning of an outer-nest cies prevent parallelization. Sud_ loops are enclosed parallel region. In our implementation, the GROUPS by a parallel region. The iteration space of the loops directive containing a single shared variable is divided up among the threads using the OMP DO 'ngroups' is generated for all the first-level paral- directive. The threads then explicitly synchronize lel regions. The ngroups variable is placed in a their execution with their neighbors. This is dis- common block and can be defined by the user at run cussed in greater detail in Section 4.2 and an time. Although it would be better to generate the example for a one-dimensional fipeline is shown in GROUPS clause with a weight argument based on Figure 5. The CAPO extensiors to support nested different workloads of parallel regions, this is not parallelism include software pipelining. The follow- considered at the moment. ing example shows how CAPO exploits 2 levels of The nested loop: parallelism in a loop nest where only the outer loop DO K:I, NK is truly parallel. Assume we have a nest containing two loops: RHO = I/NORMK(K) DO J=2, NJ DO K=I,NK A(J,K) : A(J,K} + RHO* B(J,K) DO J=2, NJ END DO A(J,K) = A(J,K) _ A(J-I,K) END DO The outer loop K is parallel and _he inner loop J can be set up with a pipeline. After inserting directives at will be transformed by CAPO into: the second level to set up the pipeline, we have 1$OMP PARALLEL !$OMP PARALLEL GROUPS (ngroups) DO K=I ,NK !$OMP& PRIVATE (RHO, K) !..point-to-point sync directive !$OMP DO !$OMP DO DO J=2, NJ A(J,K) = A(J,K) _ A(J-I,K) DO K:I, NK RHO : I/NORMI<(K) The implementation of the point-to-point synchroni- zation with directives is illustrated in Section 4.2. In !$OMP PARALLEL DO PRIVATE (J) order to parallelize the K loop al the outer level, we DO J:2, NJ A(J,K) = A(J,K) + RHO* B(J,K) need to first transform the loop iato a form such that END DO the outer-level directives can be added. It is achieved I$OMP END PARALLEL DO by explicitly calculating the K-l,)op bound for each END DO outer-level thread as shown in the following codes: !$OMP END DO NOWAIT !$OMP PARALLEL DO GROUl_S(ngroups) DO IT=I, omp_get_num_threads () !$OMP END PARALLEL CALL calc bound(I_2, l,NK, Note that for this loop the SGI NEST clause is not > low, high) applicable, since there is a statement between DO K $OMP PARALLEL DO K:low, high and DO J. • .point-to-point sync directive $OMP DO 4 Case Studies DO J=2, NJ A(J,K) = A(J,K) + A(J-I,K) In this section we show examples for successful The function "calc_bound" calculates the K loop and not so successful automatic multilevel paralleli- bound (low, high) for a given iT (the thread num- zation. We have parallelized the three application benchmar(kBsT,SPa, ndLU)fr,)mtheNASParallel A study about the effects of single level Benchmark[4s]andtheARC3D[22]application OpenMP parallelization of the NAS Parallel Bench- codeusingtheCAPOmultileveplarallelizatifoena- marks can be found in [12]. In our experiments we tureandexamineitdseffectiveness. started out with the same serial implementation of Ineachofourexperimenwtsegeneratneested the codes that was the basis for the single level OpenMPdirectiveasndusetheNanosCompifloerr OpenMP implementation as described in [12]. We compilatioanndbuildingoftheexecutableAss.dis- ran class A (64x64x64 grid points), B (102x102x102 cusseidnSection2sand3,thenestedparalleclode grid points), and C (162x162x 162 grid points) for containtsheGROUPcSlauseattheouterlevel.Ac- cordingtotheOpenMPstandardth,enumbeorf executintghreadcsanbespecifi,:adtruntimebythe BTClassA(Problemsize64x64x64) environmenvtariable OMP_NI;N_THREADS. We introduce the environment variable 120 NANOS_GROUPS and modify the source code to 100 have the main routine check the value of this variable O 80 DSGIOpenMP IOSGI OpenMP +NEST and set the argument to the GROUPS clause accord- 60 I iONanos Outer ingly. This allows us to run the {ame executable not '_ 40 LONanos Nested only with different numbers of threads, but also with 20 different numbers of groups. We compare the tim- 0 ings for different numbers of gxoups to each other. 8 16 32 64 128 Note that single level parallelizat_on of the outer loop Number ofthreads corresponds to the case that the number of executing threads is equal to the number of groups, i.e. there is SPClass A(Problem size64x64x64) only one thread in each group. We compare these timings to those resulting from compilation with the native SGI compiler, which supports only the single -8 level OpenMP parallelization _nd serializes inner OSGr OpenMP parallel loops. The timings were obtained or_ a SGI Origin 2000 C OII"NmSNaGanInoosOspONoneuMstelePr_+NEST with RI2000 CPUs, 400MHz clock, and 768MB F-- local memory per node 8 16 32 64 128 Number of threads 4.1 Successful multilevel p_railelization: the BT and SP benchmarks Figure 2: Timing results for class A benchmarks. The NAS Parallel Benchmmks BT and SP are both simulated CFD applications with a similar structure. They use an implicit algorithm to solve the the BT and SP benchmarks. As an example we show 3D compressible Navier-Stokes 4quations. The x, y, timings for problem class A for both benchmarks in and z dimensions are decoupled by usage of an Al- Figure 2. ternating Direction Implicit (ADI) factorization The programs compiled with the SGI OpenMP method. In BT, the resulting _ystems are block- compiler scale reasonably well up to 64 threads, but tridiagonal with 5x5 blocks. The systems are solved do not show any further speed-up if more threads are sequentially along each dimension. SP uses a diago- being used. For a small number of threads (up to 64), nalization method that decouples each block- the outer level parallel code generated by the Nanos tridiagonal system into three independent scalar pen- Compiler runs somewhat slower than the code gen- tadiagonal systems that are solved sequentially along erated by the SGI compiler, but its relative each dimension. performance improves with increasing number of threads. When increasing from 64 to 128 threads, the multilevel parallel code still shows a speed-up, pro- vided the number of groups is chosen in an optimal way. We observed a speed-up of up to 85% for 128 threads. In Figure 3 we show the speed-up resulting from nested parallelization for three problem classes of the SP and BT benchmarks. We denote by * SGI OpenMP: the time for outer loop parallelization using just the native SGI compiler, • SGI OpenMP+NEST: The time for outer loop The reason that multilevel parallelism has a positive parallelization using the ¢,GI NEST clause if effect on the performance of these loops is mainly applicable. due to the fact that load balancing between the • Nanos Outer: the tim,: for outer loop threads is improved. For class A, for example, the number of iterations is 62. If only the outer loop is parallelization using the N mosCompiler, parallelized, using more than 62 threads will not im- • Nanos Nested: the minimal time for nested prove the performance any further. In the case of 64 parallelization using the NmosCompiler. threads, 2 of them will be idling. If, however, the The timings show that the SC I NEST clause is of second loop level is also parallelized, all 64 threads limited benefit. It improves the performance of the can be put to use. Our experiments show that by BT benchmark slightly, but it d:)es not help the SP choosing the number of groups too small, the per- benchmark. The time consumin_ routines in the two tbrmance will actually decrease. Setting the number benchmarks are the three solvers in x, y, and z- of groups to 1 effectively moves the parallelism direction and the computation ot the right hand side. completely to the inner loop, which will in most In case of BT, CAPO paralleli::ed 28 loops, ll of cases be less efficient than parallelizing the outer which were suitable for the NEST clause. This in- loop. cludes the major loops in the three solver routines. In Table 1 we show the maximal and minimal The time consuming loops in tl'e calculation of the right hand side are not suitable f)r the NEST clause, since they contain statements between the DO state- BT Speed-up with Nested Paralleli=aticn ments. The situation is a lot worse for the SP 2 benchmark. CAPO parallelized !;1 loops. The NEST 18 clause could be generated for I'I' of them. The "til-l--_- E_6 main loops in the solver routines were not suitable _ 14 for the NEST clause, because the inner loops are IB Class AI IIIClass 81 enclosed in subroutine calls. The computation of the _ 08 IOClass _SI right hand side contains nested loops that are not _06 tightly nested, just like in the ca_,e of BT. The NEST _'04 clause could only be applied to loops with a very low 0 workload. In this case, distributi lg the work in mul- 8 16 32 64 128 tiple dimensions leads to a slight decrease of Number ofThreads performance for a small number of threads.Neither the occurrence of code between the DO statements SP Speed-up with Nested Parallellzation nor inner loops enclosed within subroutine calls poses an obstacle to nested parallel regions supported by the NanosCompiler. For the BT benchmark CAPO parallelized 13 of the 2_; parallel loops em- ploying nested parallel regions and the GROUPS IOISlass Al clause. For the SP benchmark CAPO identified 17of IIIIClass 81 i the 31 parallel 31 loops, as suita')le for nested paral- Im_iass LI lelism. In both benchmarks the most time consuming loops are parallelized in two dimensions. All of the nested parallel loops are at least triple nested. The structure of the loops is such tha: the two outer most 8 1E 32 51 12_ loops can be parallelized. The inner parallel loops Numbe_ ofThreads enclose one or more inner loop:; and contain a rea- sonably large amount of computational work, Figure 3: Speed-up due to nested parallelism. number of iterations (for class A) of the inner paral- lel loop that a thread has to execute, depending on the number of groups. # Groups Max # Iters Min #Iters routine BLTS. The index K corresponds to the third 64 62 coordinate direction. 32 62 31 DO K = KST, KEND 16 64 45 CALL JACLD (K) 8 64 49 CALL BLTS (K) 4 64 45 END DO Table 1: Thread workload for the class A SUBROUTINE BLTS problems BT and SP. DO J = JST, JEND To give a flavor of how the performance of the Loop_Body (J, K) END DO multilevel parallel code depend_ on the grouping of threads we show timings for the BT benchmark on RETURN 64 threads and varying number of groups in Figure 4. END The timings indicate that good _:riteria to choose the All of the loops involved carry data dependencies number of groups are: that prevent straightforward parallelization. There is, • Efficient granularity of the parallelism, i.e., the however, the possibility to exploit a certain level of number of groups has to he sufficiently small. parallelism by using software pipelining as described In our experiments we observe that the number in Section 3.3. To set up a pipeline for the outer loop, of groups should not be smaller than the num- thread 0 starts to work on its first chunk of data in K ber of threads within a groap. direction. Once thread 0 finishes, thread 1can start • The number of groups ha,._to be large enough working on its chunk for the same K and, in the to ensure a good balancing of work among the meantime, thread 0 moves on to the K+ 1.The direc- threads. tives generated by CAPO to implement the pipeline for the outer loop are shown in Figure 5. BTBenchmark with64l"}lraade The K loop is placed inside a parallel region. Two OpenMP library functions are called to obtain the current thread identifier (iam) and the total number of threads (numt). The shared array isync is used to indicate the availability of data from neighboring threads. Together with the FLUSH directive in a WHILE loop it is used to set up the point-to-point synchronization between threads. The first WHILE ensures that thread iam will not start with its slice of the J loop before the previous thread has updated its Classw Cr_ssA Class_ ClassC data. The second WHILE is used to signal data avail- BenchmarkL'la_s ability to the next thread. Figure 4: Timings of BT with varying number The NanosCompiler team is currently defining of thread groups. and implementing OpenMP extensions to easily ex- press the precedence relations that originate pipelined computations. These extensions are also 4.2 The need for OpenMl' extensions: the valid in the scope of nested parallelism. They are LU benchmark based on two components: The LU application benchmark is a simulated • The ability to name work-sharing constructs CFD application that uses the symmetric successive (and therefore reference any piece of work over-relaxation (SSOR) methoc to solve a seven coming out of it). band block-diagonal system re_ulting from finite- • The ability to specify predecessor and succes- difference discretization of the 3D compressible Na- sor relationships between named work-sharing vier-Stokes equations by splittinl, _it into block lower constructs (PRED and SUCC clauses). and block upper triangular systems. This avoids the manual transformation of the loop As starting point for our tests we choose the pipe- to access data slices and manual insertion of syn- lined implementation of the parallel SSOR chronization calls. From the new directives and algorithm, as described in [12]. the example below clauses, the compiler automatically builds synchro- shows the loop structure of the lower-triangular nization data structures and insert synchronization solver in SSOR. The lower-triangular and diagonal actions following the predecessor and successor rela- systems are formed in routine J]\CLD and solved in tionshipdsefined[8].Figure6showsthepipelined new NanosCompiler directives, and a 2-dimensional loopfromFigure5whenusingtlenewdirectives. pipelined implementation based on MPI. The com- piler directives based implementation shows about the same performance as the hand-coded synchroni- zation. !$OMP PARALLEL PRIVATE (K, iam, numt) iam : omp_get_thread_:lum() numt = omp_get_num_th 7eads () LU Class A Timings isync (iam) = 0 ! $OMP BARRIER DO K = KST, KEND CALL JACLD (K) CALL BLTS (K) END DO I$OMP END PARALLEL SUBROUTINE BLTS (K) OI1E_xPpRlEicDit/SUCSyCnc t IIIMPl if" (iam .gt. 0 .and. iam .it. numt) then 20 0 do while(isync(iam-k) .eq. 0) 4 8 16 32 64 !$OMP FLUSH (isync ) end do Number ofThreads isync (iam-l) = 0 !$OMP FLUSH (isync ) Figure 7: Timings for different implementations end if of LU ,$OMP DO DO J = JST, JEND The performance of the pipelined parallel imple- Loop_Body (u,_j END DO mentation of the LU benchmark is discussed in [12]. !$OMP END DO nowait The timings in Figure 7 show that the directive based if (iam .it. numt) then implementation does not scale as well as a message do while (isync(iam) .eq. i) passing implementation of the same algorithm. The !$OMP FLUSH (isync) end do cost of pipelining results mainly from wait during isync (iam) = 1 startup and finishing. The message-passing version !$OMP FLUSH (isync) employs a 2 dimensional pipeline where the wait endif RETURN cost can be greatly reduced. The use of nested END OpenMP directives offers the potential to achieve similar scalability to the message passing implemen- Figure 5: The one-dimensior_al parallel pipe- tation. line implemented in LU. There is, however, a problem in setting up a direc- tive-based two-dimensional pipeline. The new !$OMP PARALLEL PRIVATE (K, iam, numt ) directives allow synchronization of threads within one team and synchronization between different DO K = KST, KEND teams. CALL JACLD (K) CALL BLTS (K) The structure of the Loop_Body depicted in Fig- END DO ure 5 looks like: !$OMP END PARALLEL SUBROUTINE BLTS (K) DO I = ILOW, IHIGH !$O_'DO NAM_ (inner_loop) DO M = i, 5 DO J = JST, JEND TV(M, I,J) = V(M, I,J,K-I) !$OMP PRED (inner_loop, j-l) + V(M, I,J-I,K) Loop Body (J,K) + V(M, I-I,J,K) !$OMP SUCC (inner_loop, j_l) END DO END DO ! $OMP END DO nowait DO M = i, 5 RETURN V(M,I,J,K): TV(M, I,J) END END DO END DO Figure 6: One-dimensional pipeline using di- rectives If both J-and I-loop are to be parallelized employing pipelines, a thread would need to be able to synchro- In Figure 7 we show the timings fi)r LU benchmark nize with its neighbor in the J- and I-directions on comparing the one-level pipelined implementation different nesting levels. Parallelizing the I-loop with using the synchronization mechar ism from Figure 3, OpenMP directives introduces an inner parallel re- the one-level pipelined implemmtation using the gion,asshowbnelow(seealsothediscussioinnSec- tion3.3) !$OMP PARALLEL I BC --[ Boundary Condition synchroni za tion l I '$OMP DO DO JT = ... I RHS --_ Explicit Right-Hand-Side I '$OMP PARALLEL _- DO J = JLOW, JHIGH FiLTER3D -- Artificial Dissipation Terms Synchroni za tion2 I I ! $OMP DO TKINV (X) For each L: DO I : ILOW, IHIGH I form LHS for (J,K) plane END DO VPENTA3 --solve first 3 1$OMP END DO NOWAIT XI solver VPENTA --solve 4& 5 Synchroni za cion2 I J END DO NPINV !$OMP END PARALLEL (-- END DO ETA Isolver 1_ [ (Y) For each L: !$OMP END DO NOWAIT form LHS for (K,J) plane syn chron iza tion i NPINVI i_ VVPPEENNTTAA3 ----soslovleve4f&irst53 The end of the inner paralle region forces the I threads to join and destroys the multilevel pipeline ZETA solver< ! (Z) ForeachK: mechanism. In order to set up a 2-dimensiona! pipe- line, two possibilities should be taken into account. I TK fVoPrmENTLHAS3 fo-r- s(oLl,vJe) pfilrasnte3 The first one is removing the implicit barrier at the end of the inner parallel region. Such a NOWAIT update solution I J VPENTA -- solve 4& 5 clause is not available in OpemMP but could be eas- ily implemented in the corrpiler. The second Figure 6: The schematic flowchart of the ADI alternative is the use of nested Ot_ DO directives solver in ARC3D. within the same parallel region. This is an extension also proposed to OpenMP and a,,ailable in the native For each time step, the solver first sets up bound- SGI compiler. This simply uses one level of parallel- ary conditions (BC), forms the explicit right-hand- ism but performs a two-dimens onal distribution of side (RHS) with artificial dissipation terms work. The loop structure of m_ny time consuming (FILTER3D), and then sweeps through three direc- loops in the LU benchmark is ,.uitable for the SGI tions (X, Y and Z) to update the 5-element fields, NEST clause, but the SGI compiler does not provide separately. Each sweep consists of forming and solv- extensions for explicit thread synchronization. As we ing a series of scalar pentadiagonal systems in a two- have seen in Section 4.1, the re_,trictions to applica- dimensional plane one at a time. Two-dimensional tion of the NEST clause greatl 3 limit its usage for arrays are created from the 3D fields and are passed many time consuming loops. It _ould be desirable to into the pentadiagonal solvers (VPENTA3 tbr the have these restrictions removed Code between the first 3 elements and VPENTA for the 4 and 5th ele- DO statements could be handled by having only part ments, both originally written for vector machines), of the threads executing these sta ements. In case that which perform Gaussian eliminations. The solutions the inner loop is enclosed in a subroutine call, more are then copied back to the three-dimensional resid- complicated techniques, invohing procedure in- ual fields. Between sweeps there are routines lining are necessary. (TKINV. NPINV and TK) to calculate and solve small, local 5x5 eigensystems. Finally the solution is 4.3 Unsuitable loop structure in ARC3D updated for the current time step. We ran ARC3D for two different problem sizes. ARC3D uses an implicit scheme to solve Euler In both cases the performance dropped by 10% to and Navier-Stokes equations in i three-dimensional 70% when the number of groups was smaller than (3D) rectilinear grid. The main component is an ADI the number of threads, i.e. when multilevel parallel- solver, which results from the approximate factoriza- ism was used. Example timings for both problem tion of finite difference equ:,tions. The actual sizes and 64 threads are given in Figure 7. The tim- implementation of the ADI solver (subroutine ings for outer level parallelism are given in Figure 8. STEPF3D) in the serial ARC3D is illustrated in Fig- ure 6. It is very similar to the SP benchmark. Eventhoughthe timeconsumingsolverin ARC3DissimilartotheoneintheSPbenchmark, ourapproactohautomatmicult:leveplarallelization ARC3D Timings for Probelm size 64x64x64 wasnotsuccessfFuol.rARC3DCAPOidentifie5d8 parallelloops3,5ofwhichweresuitablefornested 18 parallelizatio1n9.ofthe35nestepdarallelol opshad 16 verylittleworkintheinnerpar:llelloopandineffi- _) 12 cientmemorayccesAs.nexampliesshownbelow. ._="8 LnNanos ONe' 4 2 !$OMP PARALLEL DO GROUPS Ingroups) 0 !$OMP& PRIVATE (AR, BR, CR0 DR, ER) 4 8 15 32 64 128 DO K : KLOW, KUP Number ofthread= !$OMP PARALLEL DO DO L : 2, LM ARC3D Timings for Problem size DO J = 2, JM 194x194x194 AR(L,J) = AR(L,J) V(J,K,L) BR(L,J) = BR(L,J) V(J,K,L) 350 CR(L,J) = CR(L,J} V(J,K,L) DR(L,J) : DR(L,J) V(J,K,L) 300 ER(L,J) = ER(L,J} V(J,K,L) 250 CR(L,J) = CR(L,J) i. g 2o0 OSGI OpenMP I o END DO ¢ 150 MSGI OpenMP+NEST END DO I Narlos Outer END DO 50 Parallelizing the L loop increase_; the execution time 0 of the loop considerably due tc a high number of 4 8 16 32 64 128 cache invalidations. The occurrence of many such NunYoer of threads loops in the original ARC3D code nullifies the bene- fits of a better load balance and we see no speed-up for multilevel parallelism. Figure 8: Timings from the outer level paral- The NEST clause could be applied to the same 35 lelization of ARC3D. loops that were suitable for ne{ted parallelization. the execution time. At the least we will need to ex- However, just like the nested l:arallel regions, the tend the CAPO directives browser to allow the user NEST clause did not improve the performance of the code. inspection of all multilevel parallel loops and possi- bly perform code transformations or disable nested ARC3D Nested Parallelism Timings directives. 6O 5 Related work There are a number of commercial and research parallelizing compilers and tools that have been de- veloped over the years. Some of the more notable ones include Superb [24], Polaris [6], Suif [24] KAI's toolkit [15], VAST/Parallel [21], and FOR- Gexplorer [1] Regarding OpenMP directives, most current commercial and research compilers mainly support 64x54x64 1_xl _4_194 Problem size the exploitation of a single level of parallelism and special cases of nested parallelism (e.g. double per- Figure 7: Timings of AR('3D with varying fectly nested loops as in the SGI MIPSpro compiler). number of thread groups for a given total of 64 The KAI/Intel compiler offers, through a set of ex- threads. tensions to OpenMP, work queues and an interface for inserting application tasks before execution (WorkQueue proposal [23]). The KAIllntel proposal The example of ARC3D shovrs that parallelizing mainly targets dynamic work generation schemes all loops in an application indis,:riminately on two (recursions and loops with unknown loop bounds). levels with the same name number of groups and the At the research level, the Illinois--Intel Multithread- same weight for each group may actually increase ing library [7] provides a similar approach based on 10