Table Of Content

Purdue University Purdue e-Pubs Computer Science Technical Reports Department of Computer Science 1988 Exploiting Parallelism Across Program Execution: A Unification Technique and Its Analysis Vernon J. Rego Purdue University, [email protected] Aditya P. Mathur Purdue University, [email protected] Report Number: 88-751 Rego, Vernon J. and Mathur, Aditya P., "Exploiting Parallelism Across Program Execution: A Unification Technique and Its Analysis" (1988).Computer Science Technical Reports.Paper 647. http://docs.lib.purdue.edu/cstech/647 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. EXPLOITING PARALLELISM ACROSS PROGRAMEXECUTION: A UNIFICATION TECHNIQUE AND ITS ANALYSIS Vemon Rego Aditya P. Mathur CSD TR-751 March 1988 ExploitingParallelism AcrossProgramExecution: A Unification Technique And Its Analysis VernonRego and AdityaP. Mathur DepanmentofComputerSciences PurdueUniversity WestLafayette,IN47906 Abstract: This paper describes a new technique for source-source transformation of sequential programs. We show that the transformed programs so generated provide significant speedups over the original program on vector-processors and vector multiprocessors..We exploit the parallelism that arises when multiple instances of a program areexecuted on simultaneously available data sets. This is incontrastto the existing approaches that aim. at detecting parallelism within a program. Analytic and simulation models of our technique clearly indicate the speedups that could be achieved when several data sets are available simultaneously, as is the case in many fields ofinterest Index terms: vectormultiprocessors, program unification, multiple data sets, software testing, urnmodel, orderstatistic, simulation. I.INTRODUCTION In this paper we present and analyze a technique for source-to-source transformation of sequential programs. We claim that the transformed programs so generated can be parallelized more effectively by existing tools such as those reponed in [4,13,15,16]. As shown inFig. 1, a tool based on our technique will transform a source program for input to anyone ofthe above citedparallelization tools orvectorizing compilers. Several attempts have been made at discovering parallelism in a sequential program for efficient scheduling of computations on vector-processors (e.g. Cray IS) and vector multiprocessors (e.g Cray X/lv'IP. Alliant FXJ8 and Cedar). All these machines fit a shared memoryparaUetprocessormodel [29] consisting ofL homogeneous and autonomous processors interconnected byanetwork. Eachmemorymoduleis accessibleby all theprocessors. DO-loops within a program have been the major targets for parallelization. In [28J tech niques for processor assignment to parallel loops are described. Loop coalescing has been pro posed in [27,29J as a means to resO"Ucrore several types ofnested loops to singly nested loops. The same walk also presents simulation results that show the speedups obtained as a result of parallelization when the number ofprocessors is increased. The Doacross rechnique was inrro duced by Cytron [10] as amechanism formodeling and scheduling sequential and vectorloops Parallelizationl P Program P Parallel P Vectorization Unifier Tool N Figure 1. ATool forProgram Unification 1 2 4 Figure 2. An example ofaprogram graph with fournodes -2- onmultiprocessors. Incertain cases, this techniquerequires insenion ofdelays during successive execution ofloop iterations in order to introduce synchronization. Examples ofloops requiring synchronization, and differentmethods for achievingit, have been exemplifiedin [22,34]. In [6], it has been pointed out that certain programs perfonn poorly on aclass ofmachines whenthey contain: 1. Linearrecurrencesoforder> 1. 2. IF-loops perfonningnon-iterativecomputations. 3. Nonlinearrecurrences. 4. DO-loopswithexits. 5. Shortinnerloopsthatdependon outerloops. Thebenchmarks that wepresent insection III show thatthe transformation technique developed in this paperimproves theperformance ofprograms even when lhey contain statementsequences ofthekind cited above. We note that the benchmarks reported inliterature to illusttatethe advantages ofusingthe program transformation tools cited above. have not presented any results on programs that employ Monte Carlo simulation techniques. It is a well known factthat such programs perform poorly on vector machines unless special care is taken [31]. Notable attempts have been made however. to developvectorcodesforMonte Carlo simulations [8.9]. Techniques describedin [9] alterthe code structureinorderto achieve vectorization. We arenot awareofany tool thatincor porates these techniques and performs an automatic transformation ofthe original Monte Carlo codeto aveclorizableMonteCarlo code. The anempts made so far have concentrated on discovering parallelism within one execu tion ofa program on given input data. We referto such an approach as local optimization. On the otherhand, ifaprogram P is to be executed overseveral simultaneously available input data sets, the exploitation ofparallelism which exists acrossmultiple executions ofP is whatwe call global optimization. Global optimization may not be ofmuch interest for parallel machines in whichindividualprocessors areinherently sequentialinnatureand capableofoperatingonmulti ple instruction streams [14]. The reason why global optimizationmay notbe ofmuch interestin such machines is because instances ofP can be scheduled on different processors that work independently. Onmachines like the Alliant FXJ8 orthe CrayX/MP, this approach is acceptable ifP vectorizes well and can therefore use the resources ofa single processorefficiently. Ingen m. eral. however. this is not true. As shown in section when P does not parallelize, multiple instances ofP canbe combined togetherinto anotherprogram Pto improve parallelism. Inthis - 3- paperwerefertoP as aunifiedprogram. The n~d for simultaneous execution ofP on several data sets arises often in a variety of fields. As anexample from software testing. there are severalteclmiques thatrequire aprogram to be executed on many test data sets. Regression analysis [7] and muration basedresting [1,20] are two such teclmiques that are computationally intensive. Some experiments done with mutation analysis based testinghave shownthatglobal optimizationcouldproveto beofsignificant utility in many cases. Another areathat could benefit from global optimization is neuralmodeling. As an example, computationalmodels ofthecochlea [3] areoftenexercisedforseveral combinations ofparametervalues whichincludeparameters describing the properties ofthe externalsound and non-linear active parameters such as the damping funetiOIL These and many other areas serve as the primary motivation for the development and investigation ofthe ~form.ationpresented in thispaper. The remainder of this paper is organized as follows. The next section presents some m definitions and terminology for concepts and teffilS used in this paper. Insection the source to·source transformation technique, which relies onthe global optimization referred to above. is presented with examples. Ourtransformationproduces a program whose dynamic behaviorcan be analyzed using probability and simulation. An analysis is presented in sections IV and V. Computational results showing the speedup produced by applying our transformation are presentedinsectionVI. WeconcludeinsectionVII. II.DEFlNITIONS AND TERMINOLOGY We shall use P to denote the program to be transfoIDled using the technique described in this paper. We assumethat P is to be executed on N simultaneously available data sets denoted by d1•d2•. .. •dN• We refer to P ~ Pj. an instance ofp. when it executes on data set dj. A basic block is a sequence ofconsecutive statements in which the flow ofcontrol enters the first statement and leaves the last statement wi£h.out the possibility of control leaving at any o£h.er statement [2). Using this definition, P canbetransformed [2) iruo asequenceofK basic blocks denoted by B}o B2 •.•.•Bx. We will frequently denote these simply as blocks 1.2.3•..... etc.• throughK. We assume that except forBx•all basic blocks end wilh an assignment. an uncondi tional branch or a conditional branch. Further, a conditional branch at the end of a block B is j always of the form.: if c then gote label where c is a scalar logical variable evaluated within blockB and label is a statementlabel. We shall denote the label ofthe first statement ofblock j• Bj asSj. -4 - A. TheProgram Graph A given program P operates deterministically on its input. Since we are interested in the behaviour ofa program on an arbitrary input. we require some means ofdefiningP 's behaviour nondetenninistically. For the class ofprograms that we are interested in studying (e.g., Fortran. Pascal), it canb'ivially be shown[2] that agivenprogram P withK blocks canberepresented by a graph with K nodes. Eachnode, except for the Kth node, has outdegree at mosttwo. The Kth nodeis aterminalnodewithoutdegreezero. LetGp be anondeterministic program graph correspondingto aprogramP. The graphGp is obtained by assigning probabilities to the arcs in the deterministic graph ofP. As in similar studies [29], we assume that these probabilities areobtained from usersupplied expectedbranch· ing frequencies, or eslimated from test data. An example ofaprogram graph is shown inFig. 2, and an example ofa nondeterministic program graph can be seen inFig. 7(a). Define R", to be thesetofnodesthat canbereached from nodem by traversing only asingle arc, for 1::;m 5: K. Foreachm, 1S m <K, thenumberofelements inR,., is atmosttwo, and Rx =~. Weuse the Greek letter ~ to denote a branching probability, with the convention that for a given m, 1S m <K, theprobabilities(l - ~j) and j3j areassigned to arcs (m, i) and(m, j), respectively, where i = min {i,n, j = max {i,n, for R = {i, n. In case IR 1= lor IR 1= 0, no m m m confusion arises because the branching probability is either 1 or 0, respectively. Without loss of generality, weassumethatblockK has outdegreezero,andthereis atleastonepathfrom block I toblockK intheprogram graph. The time to execute an entity (Le., a block or a program) x will be denoted by t(x). The program obtainedbytransformingP usingourteclmique, is denotedbyP.Thespeedup obtained P by concunent execution of over all the N data sets, against executing P serially over these data sets (Le. first executing P on dl, then executing it on d2 and so on) is denoted by'Y, and defined as: N ~ reP,) y==i=-l-=-- (2.1) rep) A blockBi inP gets transformed to blockBj inP. One execution ofBj corresponds to N serialexecutions ofBi•We definetheblockspeedup coejjicienlforblockBj , denotedby ai, as reB,) =-::,--,,=-,., a;,N = (2.2) (N' r(Bi) -5- and note that ai,N is a typically a decreasing function ofN. For vector-multiprocessors, it is a wellknown fact that a; N < 1forseveral typeS ofblocks. Wewillusc aN to denote thevalue of theblockspeedup coefficiern averaged overallprogramblocksofP. Example: (Scalnrcompurarion) Fora blockinP•containing only one scalarcomputationx =y ,., sin(z), the a values as a function ofN areplotted inFig. 3. t ForincreasingN. a decreasein thevalue ofeximplies that a single execution ofthe transfonned block once is more efficient than an N-slep serial execution oftheoriginalblock.D The overall speedup"(depends. amongstotherfactors. ontheblock speedup coefficients for individualblocks ofP [17]. TIlls relationship ismademoreexplicitinsection IV. ill.TIlEPROGRAMTRANSFORMATIONTECHNIQUE We shall begin by elaborating the idea ofglobal optimization. Suppose that a program P with a flow graphas showninFig. 2, isto beexecuted on threedata sets onauniprocessorvector machine. Further, supPJse that the paths followed by Pi, P2 and P3 are as shown in Fig. 4(a). Here. blocks which can be executed in parallel are enclosed within a box. Thus, if all three instances of P are to be executed concurrently, block B1in each instance can be executed in parallel, followed by block B2, and then block B3. At this point, PI needs to execute block B4 and the other two instances ofP need to execute block B2' Assuming that our block selection algorithm selects block B2 as the next one to be executed, P2 and P3 can execute this block in parallel. Reasoningalongtheselines, wecanworkoutthe complete executionschedule.Fig. 4(b) exhibits one such schedule. To show that the above example illustrates a practically viable teclmique, the remainderof this sectionisdevoted to answeringthefollowing questions: 1. Howcanblocks ofdifferentprogram instances executeinparallel? 2. Whatmechanisms areneeded to managemultiplepaths that can arise, as inFig. 4(a), duringthe executionofdifferentprograminstances? 3. Whatspeedup,ifany. can be obtained as a resultof concurrentexecution ofmultiple instancesofP as showninFig. 4(a)? t Allbenchmarkspresented inthispaperhavebeenoblainedontheAllianlFXAI series wilhonecompuling element. 0.3- • Empirical data Spline interpolation z tl 0.3 . '+- 0.2 '+- (]) 0 u (l 0.2 ::J -u (]) (]) (l 0.1 --. U1 • • e---. ..Y • • • • • .--. u 0 0.1 - m o. 0 +--,-----,----.-----.---,-----,----.-----.----,---,-----,------,- 10 90 170 250 330 410 490 Number of program components N Empirical GX values for x = sin(z) N Fig 3. I 1 2 3 4 2 3 4 1 2 3 ~0 4 1 2 3 2 3 P Figure 4(a). Multiple paths in BLOCK BEING EXECUTED (n) 1 2 3 2 3 2 3 4 Program instances 1,2,3 1,2,3 1,2,3 2,3 2,3 3 3 1,2,3 executing theblock (n) Program instances none none none 1 1 1,2 1,2 none waiting for CPU Figure 4(b). Concurrentexecution multiple instance ofP

Description:

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. Rego, Vernon J. and Mathur, Aditya P., "Exploiting Parallelism Across Program Execution: A Unification Technique and Its Analysis

Exploiting Parallelism Across Program Execution PDF

50 Pages·2013·1.09 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Exploiting Parallelism Across Program Execution

Description:

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.