ebook img

Managing Communication Latency-Hiding at Runtime for Parallel Programming Languages and Libraries PDF

0.35 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Managing Communication Latency-Hiding at Runtime for Parallel Programming Languages and Libraries

PREPRINT Managing Communication Latency-Hiding at Runtime for Parallel Programming Languages and Libraries Mads Ruben Burgdorff Kristensen Brian Vinter NielsBohrInstitute NielsBohrInstitute Copenhagen,Denmark Copenhagen,Denmark [email protected] [email protected] 2 1 ABSTRACT guage, where the development time is drastically reduced 0 This work introduces a runtime model for managing com- compared to a compiled language with explicit message- 2 munication with support for latency-hiding. The model en- passing. ables non-computer science researchers to exploit commu- n nication latency-hidingtechniquesseamlessly. Forcompiled High-productivity languages such as Matlab and Python – a J languages,itisoftenpossibletocreateefficientschedulesfor popular languages for scientific computing – are generally communication, but thisis not thecase for interpretedlan- acceptedasbeingslowerthancompileslanguages,butmore 8 guages. By maintaining data dependencies between sched- importantly they are inherently sequential and while intro- 1 uled operations, it is possible to aggressively initiate com- ducing parallelism is possible in these languages [1][2][3] it munication andlazily evaluatetaskstoallow maximal time limits the productivity. It has been previously shown that ] C forthecommunicationtofinishbeforeenteringawaitstate. it is possible to parallelize matrix and vector-based data- D We implement a heuristic of this model in DistNumPy, an structuresfrom Python,even on distributed memory archi- auto-parallelizing version of numerical Python that allows tectures[4]. However,theparallelexecutionspeedisseverely . s sequential NumPy programs to run on distributed memory impededbycommunicationbetweennodes,insuchascheme c architectures. Furthermore, we present performance com- for automatic parallelization. [ parisons for eight benchmarks with and without automatic 1 latency-hiding. The results shows that our model reduces To obtain performance in manual parallelization the pro- v the time spent on waiting for communication as much as grammer usually applies a techniqueknown as latency-hid- 4 27 times, from a maximum of 54% to only 2% of the total ing, which is a well-known technique to improve the per- 0 execution time, in a stencil application. formanceandscalability ofcommunicationboundproblems 8 and is mandatory in many scientificcomputations. 3 1. INTRODUCTION . In this paper, we introduce an abstract model to handle 1 There are many ways to categorize scientific applications – latency-hiding at runtime. The target is scientific applica- 0 terms includingscalability, communication pattern, IO and tions that make use of vectorized computation. The model 2 so forth. In the following, we wish to differentiate between enables us to implement latency-hiding into high-produc- 1 large maintained codes, often commercial or belonging to a tivity programming languages where the runtime system : community,andsmaller, lessorganized,codesthatareused v handlescommunication and parallelization exclusively. by individual researchers or in a small research group. The i X large codes are often fairly static and each version of the In such high-productivity languages, a key feature is auto- code can beexpectedtoberunmanytimes bymanyusers, r matic distribution, parallelization and communication that a and thus justifying a large investment in writing the code. are transparent to the user. Because of this transparency, Thesmalldevelopmentcodesontheotherhand,changefre- the runtime system has to handle latency-hiding without quentlyandmayonlyberunafewtimesaftereachchange, any help. Furthermore, the runtime system has no knowl- usually only by theoneuser who madethechanges. edge of the communication pattern used by the user. A generic model for latency-hidingis therefore desirable. The consequence of these two patterns is that the large codes may be written in a compiled language with explicit The transparent latency-hiding enables a researcher that message-passing. While the small codes have an inherent uses small self-maintained programs, to use a high-produc- needtobewritteninahigh-productivityprogramminglan- tivity programming language, Python in our case, with- outsacrificingthepossibilityofutilizingscalabledistributed memoryplatforms. Thepurposeoftheworkisnotthatthe performance of an application, which is written in a high- productivitylanguage,shouldcompetewiththatofamanu- allyparallelizedcompiledapplication. Ratherthepurposeis toclosethegapbetweenhigh-productivityonasingleCPU and high performance on a parallel platform and thushave a high-productivityenvironment for scalable architectures. Thelatency-hidingmodelproposedin thispaperistailored Hardwarearchitecturesalso exploitdataparallelism tohid- to parallel programming languages and libraries with the ingmemorylatency[10]orcommunicationlatency[11]. Like- following properties: wise, parallel data dependencyanalysis is essential in order to efficiently schedule instructions and avoid pipeline inter- locks [12] [13]. • Theprogramminglanguagerequiresdynamicschedul- ing at runtimebecause it is interpreted. 3. LATENCY-HIDING • Theprogramminglanguagesupportsandutilizesadis- Wedefinelatency-hidinginformallyasin[14]–“atechnique tributed memory environment. toincreaseprocessorutilizationbytransferringdataviathe networkwhilecontinuingwiththecomputationatthesame • All parallel processes have a global knowledge of the time”. When implementing latency-hiding the overall per- data distribution and computation. formance depends on two issues: the ability of the commu- nication layertohandlethecommunicationasynchronously • The programming language makes use of data paral- and the amount of computation that can overlap the com- lelism in a Single Instruction, Multiple Data (SIMD) munication – in thiswork we will focus on thelatter issue. fashion inthesensethatdataaffinitydictatesthedis- tribution of thecomputation. In order to maximize the amount of communication hid- den behind computation when performing vectorized com- putations our abstract latency-hiding model uses a greedy Distributed Numerical Python (DistNumPy)is an example algorithm. The algorithm divides the arrays, and thereby of such a parallel library, and the first project that fully the computation, into a number of fixed-sized data blocks. incorporate our latency-hiding model. The implementation of DistNumPyis open-sourceand freely available1. Sincemostnumericalapplicationswillwork onidenticaldi- mensioned datasets, the distribution of the datasets will be identical. For many data blocks, the location will therefore The rest of the paper is organized as follows. In section 2, be the same and these will be ready for execution without 3 and 4, we go through the background and theory of our anydatatransfer. Whiletheco-locateddatablocksarepro- latency-hiding model. In section 5, we describe how we use cessed, the transfers of the data blocks from different loca- our latency-hiding model in DistNumPy. In section 6, we tioncanbecarriedoutinthebackgroundthusimplementing present a performance study. Section 7 is future work, and latency-hiding. The performance of thisalgorithm relies on finally in section 8 we conclude. two properties: 2. RELATED WORK Libraries and programming languages that support paral- • Thenumberofdatablocksmustbesignificantlygreat- lelisminahighproductivemannerisawell-knownconcept. er than numberof parallel processors. In a perfect framework, all parallelism introduced by the • Asignificatnumberofdatablocksmustsharelocation. framework is completely transparent to the user while the performance and scalability achieved is optimal. However, most frameworks require the user to specify some kind of Inordertoobtainbothpropertiesweneedadatastructure parallelism – eitherexplicitly byusing parallel directivesor that support easy retrieval of dependencies between data implicitly by usingparallel data structures. blocks. Furthermore, the number of data blocks in a com- putation is proportional with the total problem size thus In this paper we will focus on data parallel frameworks, in efficiency is of utmost importance. which parallelism isbasedontheexploitationof datalocal- ity. A classical example of such a framework is High Per- 4. DIRECTED ACYCLICGRAPH formance Fortran (HPF) [5], which is an extension of the It is well-known that a directed acyclic graph (DAG) can Fortran-90 programming language. HPF introduces paral- beusedtoexpressdependenciesinparallelapplications[15]. lelism primarily with vector operations, which, in order to NodesintheDAGrepresentoperationsandedgesrepresent archive good performance, must be aligned by the user to serialization dependenciesbetween theoperations, which in reducecommunication. However,alotofworkhasbeenput our case is dueto conflictingdata block accesses. into eliminating this alignment issue either at compile-time or run-time[6] [7] [8]. Scheduling operations in a DAG is a well-studied problem. TheschedulingproblemisNP-completeinitsgeneralforms DistNumPy[4]is a library for doingnumerical computation [16] where operations are scheduled such that the overall in Python that targets scalable distributed memory archi- computationtimeis minimized. Thereexist manyheuristic tectures. DistNumPy accomplishes this by extending the forsolvingtheschedulingproblem[17],butnonematchour NumPy module[9], which combined with Python forms a target. popular framework for scientific computations. The paral- lelization introducedin DistNumPyfocuses on parallel vec- The scheduling problem we solve in this paper is not NP- tor operations like HPF, but because of the latency-hiding hardbecausewearetargetingprogrammingframeworksthat we introduce in this paper, it is not a requirement to align make use of data parallelism in a SIMD fashion. The par- vectors in order to achievegood performance. allelmodelweintroduceisstaticallyorchestratingdatadis- 1DistNumPy is available at tributionandparallelization basedonpredefineddataaffin- http://code.google.com/p/DistNumPy ity. Assignment of computation tasks are not part of our scheduling problem. Instead, our scheduling problem con- MPIcommunicationinDistNumPyisfullyinvisibleandthe sistsofmaximizingtheamountofcommunicationthatover- user needs no knowledge of MPI or any parallel program- laps computation when moving data to the process that is mingmodel. Theonly differencein theAPIof NumPyand predefinedto perform thecomputation. DistNumPy is the array creation routines. DistNumPy al- lowsbothdistributedandnon-distributedarraystoco-exist, In[18]theauthorsdemonstratethatitispossibletodynamic the user must specify, as an optional parameter, if the ar- scheduleoperationsinadistributedenvironmentusinglocal rayshouldbedistributed. Thefollowingillustratestheonly DAGs. That is, each process runs a private runtime sys- difference between the creation of a standard array and a tem and communicateswith otherprocesses regarding data distributed array: dependences. Similarly, our scheduling problem is also dy- #Non-Distributed namicbutinourcaseallprocesses haveaglobal knowledge A = numpy.array([1,2,3]) of the data distribution and computation. Hence, no com- #Distributed munication regarding datadependencesis required at all. B = numpy.array([1,2,3], dist=True) The time complexity of insetting a node into a DAG, G = 5.1 Views (V,E), is O(V) in worse case. Building the complete DAG is therefore O(V2). Removing one node from the DAG is NumPy and DistNumPy use identical arrays syntax, which O(V),whichmeansthatin thecasewherewesimplywants is based on the Python list syntax. The arrays are indexed toschedulealloperationsinalegalorderthetimecomplex- positionally,0throughlength–1,wherenegativeindexesis ity is O(V2). This is without minimizing the overall com- usedfor indexingin thereversedorder. Likethelist syntax putation or the amount of communication hidden behind in Python, it is possible to index multiple elements. All in- computation. We therefore conclude that a complete DAG dexingthatrepresentsmorethanoneelementreturnsaview approachisinadequateforruntimecontroloflatency-hiding oftheelementsratherthananewcopyoftheelements. This in ourcase. means that an array does not necessarily represent a com- plete, contiguous block of memory. It is possible to have a We address the shortcoming of the DAGapproach through hierarchy of arrays where only one array represents a com- a heuristic that manage dependencies on individual blocks. pletecontiguousblockof memoryandtheotherarraysrep- Instead of having a complete DAG, we maintain a list of resent a subpart of that memory. DistNumPy implements depending operations for each data block. Still, the time an array hierarchywheredistributedarraysarerepresented complexity of scheduling all operations is O(V2) in worse by thefollowing two datastructures. case, but the heuristic exploits the observation that in the common case a scientific application spreads a vectorized Array-base isthebaseofanarrayandhasdirectaccessto operation evenly between the data blocks in the involved the content of the array in main memory. An array- arrays. Thusthenumberof dependenciesassociated with a base is created with all related meta-data when the single data block is manageable by a simple linked list. In user allocates a new distributed array, but the user Section 5.7, we will present a practical implementation of will neveraccess thearray directlythrough thearray- theidea. base. Thearray-basealwaysdescribesthewholearray and its meta-datasuch as array size and data type. 5. DISTRIBUTED NUMERICALPYTHON Array-view is a view of an array-base. The view can re- The programming language Python combined with the nu- presentthewholearray-baseoronlyasub-partofthe merical library NumPy[9] has become a popular numerical array-base. An array-view can even represent a non- framework amongst researchers. It offers a high level pro- contiguous sub-part of the array-base. An array-view gramming language toimplementnewalgorithms that sup- contains its own meta-data that describes which part port a broad range of high level operations directly on vec- of the array-base is visible. The array-view is mani- tors and matrices. pulated directly by the user and from the users per- spectivethe array-viewis simply a normal contiguous The idea in NumPy is to provide a numerical extension to array. thePythonlanguagethatenablesthePythonlanguagetobe bothhighproductiveandhighperforming. NumPyprovides notonlyanAPItostandardizednumericalsolvers,butalso Array-views are not allowed to refer to each other, which the option to develop new numerical solvers that are both meansthatthehierarchyisflatwithonlytwolevels: array- implemented and efficiently executed in Python, much like base below array-view. However, multiple array-views are theidea behind theMatlab[19]framework. allowed to refer to the same array-base. This two-tier hier- archy is illustrated in Figure 1. DistNumPyisanewversion ofNumPythatparallelizes ar- ray operations in a manner completely transparent to the 5.2 Data Layout user – from the perspective of the user, the difference be- The data layout in DistNumPy consists of three kinds of tweenNumPyandDistNumPyisminimal. DistNumPycan data blocks: base-blocks, view-blocks and sub-view-blocks, use multiple processors through the communication library which makeup a threelevel abstraction hierarchy (Fig. 2). MessagePassingInterface(MPI)[20]. However,DistNumPy does not use the traditional single-program multiple-data (SPMD)parallelprogrammingmodelthatrequirestheuser Base-block is a block of an array-base. It maps directly to differentiate between the MPI-processes. Instead, the into one block of memory located on one node. The Non-aligned Aligned Array-views 04 05 06 04 06 08 10 11 12 Array-bases 03 04 05 06 07 08 10 11 12 Figure 2: Anillustrationof the block hierarchy that represents a 2D distributed array. The array is di- Memory 01 02 03 04 05 06 07 08 09 10 11 12 vided into three block-types: Base, View and Sub- View-blocks. The 16 base-blocks make up the base- array, which may be distributed between multiple Figure1: Referencehierarchybetweenthetwoarray MPI-processes. The 9 view-blocks make up a view data structures and the main memory. Only the of the base-array and represent the elements that three array-views at top of the hierarchy are visible are visible to the user. Each view-block is further- from the perspective of the user. moredividedintofoursub-view-blocks,eachlocated on a single MPI-process. memory block is not necessarily contiguous but only one MPI-process has exclusive access to the block. the computation-loop is implemented in C and is executed Furthermore, DistNumPy makes use of a N-Dimen- in parallel. sionalBlockCyclicDistributioninspiredbyHighPer- formanceFortran[5],inwhichbase-blocksaredistributed Applying an ufunc operation on a whole array-view is se- across multiple MPI-processes in a round-robin fash- mantically equivalentto performing theufuncoperation on ion. each array-view block individually. This property makes it possibletoperformadistributedufuncoperationinparallel. View-block is a block of an array-view, from theperspec- A distributed ufuncoperation consists of four steps: tive of the user a view-block is a contiguous block of array elements. A view-block can span over multiple base-blocks and consequently also over multiple MPI- 1. All MPI-processes determine the distribution of the processes. For a MPI-process to access a whole view- view-blockcomputation,whichisstrictlybasedonthe block it will have to fetch data from possibly remote distribution of theoutput array-view. MPI-processes and put the pieces together before ac- 2. All MPI-processes exchange array elements in such a cessing the block. To avoid this process, which may manner that each MPI-process can perform its com- cause some internal memory copying, we divide view- putation locally. blocks intosub-view-block. 3. All MPI-processes perform their local computation. Sub-view-block isablockofdatathatisapartofaview- block but is located on only one MPI-process. The 4. AllMPI-processessendthealteredarrayelementsback driving idea is that all array operation is translated to theoriginal locations. into anumberof sub-view-blockoperations. 5.4 Latency-Hiding The standard approach to hide communication latency be- Wewilldefineanalignedarrayasanarraythathaveadirect, hind computation in message-passing is a technique known contiguousmappingthroughtheblockhierarchy. Thatis,a as double buffering. The implementation of double buffer- distributed array in which thebase-blocks, view-blocksand ingisstraightforward whenoperatingonasetofdatablock sub-view-blocksareidentical. Anon-aligned array isthena thatallhaveidenticalsizes. Thecommunicationofonedata distributed array without this property. blockisoverlappedwiththecomputationofanotheralready communicateddatablockandsincethesizes ofallthedata 5.3 Universal Function blocks are identical all iterations are identical. AnimportantmechanisminDistNumPyisaconceptcalled aUniversalfunction. Auniversalfunction(ufunc)isafunc- InDistNumPy,astraightforwarddoublebufferingapproach tion that operates on all elements in an array-view inde- works well for ufuncs that operate on aligned arrays, be- pendently. That is, an ufunc is a vectorized wrapper for a causeittranslatesintocommunicationandcomputationop- functionthattakesafixednumberofscalarinputsandpro- erations on whole view-blocks, which does not benefit from ducesa fixednumberof scalar outputs. E.g., addition is an latency-hidinginside view-blocks. However, for ufuncs that ufunc that takes three array-views as argument: two input operateonnon-alignedarraysthisisnotthecasebecausethe arrays and one output array. For each element, the ufunc view-block is distributed between multiple MPI-processes. adds the two input arrays together and writes the result In order to achieve good scalable performance the Dist- into the output array. Using ufunc can result in a signifi- NumPy implementation must therefore introduce latency- cantperformanceboostcomparedtonativePythonbecause hiding inside view-blocks. For example the computation 1 import numpy 2 M = numpy.array([1,2,3,4,5,6],\ 3 dist=True) 4 N = numpy.empty((6),dist=True) 5 A = M[2:] 6 B = M[0:4] 7 C = N[1:5] 8 C = A + B Figure3: Thisisanexampleof asmall3-point sten- cil application. Figure 5: Illustration of a DAG that represents the dependencies in a 3-point stencil application (Fig. 3). The DAGconsists of 12 operations, op1 to op12, divided between two processes. computation time of op5 and op8 and the communication time of op6 and op7. Figure 4: The data layout of the two arrays M and N and the three array-views A, B and C in the 3- Wewillstrictlyprioritizebetweenoperationsbasedonwhe- point stencil application (Fig. 3). The arrays are ther they involve communication or computation – giving distributed between two nodes using a block-size of priority to communication over computation. Furthermore, three. wewill assume thatall operationstakethesame amountof time,whichisareasonableassumptioninDistNumPysince itdividesarrayoperationsintosmallblocksthatoftenhave of a view-block in Figure 2 can make use of latency-hiding thesame computation or communication time. by first initiating the communication of the three non-local sub-view-blocksthencomputethelocal sub-view-blockand 5.6 Lazy Evaluation finally compute thethreecommunicated sub-view-blocks. Since Python is an interpreted dynamic programming lan- 5.5 The Dependency Graph guage,itisnotpossibletoschedulecommunicationandcom- putation operations at compile time. Instead, we introduce One of the key contributions in this paper is a latency- lazy evaluation as a technique to determine the communi- hiding model that, by maintaining data dependencies be- cation and computation operations used in the program at tween scheduled operations, is able to aggressively initiate runtime. communication and lazily evaluate tasks, in order to allow maximal time for thecommunication tofinish before enter- During the execution of a DistNumPy program all MPI- ing a wait state. In this section, we will demonstrate the processes recordtherequestedarray operationsratherthan idea of the model by giving an example of a small 3-point applying them immediately. The MPI-processes maintain stencil computation implemented in DistNumPy (Fig. 3). theoperations inaconvenientdatastructureandat alater For now, we will use a traditional DAG for handling the point all MPI-processes apply the operations. The idea is data dependencies. Later we will describe the implementa- that by having a set of operations to carry out it may be tionoftheheuristicthatenablesustomanagedependencies possible to schedulecommunication and computation oper- more efficiently. ations that haveno mutualdependenciesin parallel. Additional,itshouldbenotedthattheparallelprocessesdo WewillonlyintroducelazyevaluationforPythonoperations notneedtoexchangedependencyinformationsincetheyall that involve distributed arrays. If the Python interpreter havefull knowledge of thedatadistribution. encounters operations that do not include DistNumPy ar- rays,theinterpreterwillexecutethemimmediately. Atsome TwoprocessesareexecutingthestencilapplicationandDist- point,thePythoninterpreterwilltriggerDistNumPytoex- NumPydistributesthetwoarrays,M andN,usingablock- ecute all previously recorded operation. This mechanism of size of three. This means that three contiguous array ele- executing all recorded operation we will call an operation mentsarelocatedoneachprocess(Fig. 4). UsingaDAGas flush and thefollowing threeconditions may trigger it. definedinsection4,figure5illustratesthedependenciesbe- tween 12 operations that together constitute theexecution. Initially thefollowing six operations are ready: • The Python interpreter issues a read from distribut- ed data. E.g. when the interpreter reaches a branch R:={op1,op2,op3,op4,op9,op10} statement. Afterwards, without the need of communication, two more • The number of delayed operations reaches a user-de- operationsop5andop8maybeexecuted. Thus,itispossible finedthreshold. tointroducelatency-hidingbyinitiatingthecommunication, op6 and op7, before evaluating operation op5 and op8. The • The Python interpreter reaches the end of the pro- amount of communication latency hidden depends on the gram. 5.7 The Dependency System Process 1 Process 2 The main challenge when introducing lazy evaluation is to Iteration 1 implement a dependency system that schedules operations inaperformanceefficientmannerwhiletheimplementation Recv Apply Apply Recv keepsthe overhead at an acceptable level. OurfirstlazyevaluationapproachmakesuseofaDAG-bas- Iteration 2 ed data structureto contain all recorded operations. When an operation is recorded, it is split across the sub-view- Send Send blocks that are involved in the operation. For each such operation, a DAGnode is created just as in figure3 and 4. Iteration 3 Beside the DAG our dependency system also consist of a ready queue, which is a queue of recorded operations that Apply Apply do not have any dependencies. The ready queue makes it possible tofindoperations that are ready tobe executedin thetime complexity of O(1). Figure 6: Illustration of the na¨ıve evaluation ap- proach. The result is a deadlock in the first itera- Operation Insertion. The recording of an operation trig- tionsincebothprocessesarewaitingforthereceive- gers an insertion of new node into the DAG. A straightfor- node to finish, but that will never happen because ward approach will simply implement insertion by compar- the matching send-node is in second iteration. ing the new node with all the nodes already located in the DAG.If adependencyis detected theimplementation adds 2. We only start the execution of a computation node anedgebetweenthenodes. Thetimecomplexityofsuchan when there is no communication node in the ready implementation isO(n) wherenisthenumberofoperation queue. in the DAG and the construction of the complete DAG is O(n2). 3. Weonlywaitforcommunicationwhenthereadyqueue has nocomputation nodes. OperationFlush. To achieve good performance the oper- 5.7.1 Deadlocks ation flush implementation must maximize the amount of ToavoiddeadlocksaMPI-processwillonlyenterablocking communicationthatitisoverlappedbycomputation. There- statewhenithasinitiatedallcommunicationandfinishedall fore,theflushimplementationinitiatecommunicationatthe ready computation. This guaranties a deadlock-free execu- earliestpointintimeandonlydocomputationwhenallcom- tionbutitalsoreducestheflexibilityoftheexecutionorder. munication has been initiated. Furthermore, to make sure Still,itispossibletocheckforfinishedcommunicationusing that there is progress in the MPI layer it checks for fin- non-blockingfunctionssuch as MPI_Testsome(). ished communication in between multiple computation op- erations. The following is the flow of our operation flush Thena¨ıveapproachtoevaluateaDAGissimplytofirsteval- algorithm: uate all nodes that have no dependencies and then remove theevaluatednodesfromthegraphandstartover–similar 1. Initiateallcommunicationoperationsinthereadyqu- to the traditional BSP model. However, this approach may eue. result in a deadlock as illustrated in figure6. 2. Check in a non-blocking manner if some communica- 5.7.2 DependencyHeuristic tionoperationshavefinishedandremovefinishedcom- ExperimentswithlazyevaluationusingtheDAG-baseddata munication operations from the ready queue and the structure shows that the overhead associated with the cre- DAG.Furthermore,register operations thatnowhave ation of the DAG is very time consuming and becomes the no dependenciesinto theready queue. dominating performance factor. We therefore introduce a heuristictospeedupthecommoncase. Webasetheheuris- 3. If there is only computation operations in the ready tic on thefollowing two observations: queue execute one of them and remove it from the ready queueand theDAG. • In the common case, a scientific DistNumPy applica- 4. Go backtostep one ifthereare anyoperations left in tion spreads a computation evenly between all sub- theready queueelse we are finished. view-blocksin theinvolvedarrays. • Operationdependenciesareonlypossiblebetweensub- The algorithm maintains thefollowing three invariants: view-blocksthat are part of thesame base-block. 1. Alloperations,thatareready,arelocatedintheready TheheuristicisthatinsteadofhavingaDAG,weintroduce queue. aprioritizedoperationlistforeachbase-block. Theassump- Table 1: Hardware specifications CPU IntelXeon E5345 CPU Frequency 2.33 GHz CPU pernode 2 All access-nodes that access the same Cores perCPU 4 base-blockarelinkedtogetherinadepen- Memory pernode 16 GB dency-list. Theorderofthelist issimply Numberof nodes 16 based on the time of node insertion (de- Network Gigabit Ethernet scending order). Additionally an access- node contains the information needed to determine whether it depends on other tion is that, in the common case, the number of operations access-nodes. associated with a base-block is manageable bya linked list. An operation-node has a pointer to all We implement the heuristic using the following algorithm. access-nodes that are involved in the op- Anumberofoperation-nodesandaccess-nodesrepresentthe eration. Attached to an operation is a operationdependencies. Theoperation-nodecontainsallin- reference counter that specifies the num- formation needed to execute the operation on a set of sub- ber of dependencies associated with the view-blocksandthereisapointertoanaccess-nodeforeach operation. When the counter reaches sub-view-block. The access-node represents memory access zerotheoperation isreadyforexecution. to a sub-view-block, which can be either reading or writ- At some point when the operation has ing. E.g., the representation of an addition operation on beenexecutedtheoperation-nodeandall threesub-view-blocksistworeadaccess-nodesandonewrite access-nodes are completely removefrom access-node (Fig. 7). thedependencysystem. Our algorithm places all access-nodes in dependency-lists Abase-blocksimplycontainsapointerto based on the base-block that they are accessing. When an the first access-node in the dependency- operation-nodeisrecordedeachassociatedaccess-nodeisin- list. serted into the dependency list of the sub-view-blocks they access. Additionally,thenumberofaccumulateddependen- Figure 7: The structures used in the dependency cies the access-nodes encounter is saved as the operation- system. node’s reference counter. Alloperation-nodesthatarereadyforexecutionhavearefer- encecountofzeroandareinthereadyqueue. Still,theymay have references to access-nodes in dependency-lists – only when we execute an operation-node will we remove the as- sociated access-nodes from the dependency-lists. Following the removal of an access-node we traverse the dependency- list andfor each dependingaccess-nodewereducetheasso- ciating reference counter by one. Because of this, therefer- ence counter of another operation-node may be reduced to zero,inwhichcasewemovetheoperation-nodetotheready queueand thealgorithm starts all over. Figure 7 goes through all the structures that make up the dependency system and figure 8 illustrates a snapshot of the dependency system when executing the 3-point stencil application. 6. EXPERIMENTS To evaluate the performance impact of the latency-hiding modelintroducedinthispaper,wewillconductperformance benchmark using DistNumPy and NumPy2. The bench- Figure 8: Illustration of the dependency system mark is executedon an IntelCore 2 Quadcluster(Table 1) whenexecutingthe 3-point stencilinfigure 3, 4and and for each application we calculate the speedup of Dist- 5. The illustration is a snapshot of the dependency NumPycompared toNumPy. The problem size is constant systemonnode0afterthecreationofallthearrays. thoughalltheexecutions,i.e. wearemeasuringstrongscal- Note that since the block size is three, node 0 only ing. To measure the performance impact of the latency- has one block of each array. hiding,weusetwodifferentsetups: onewithlatency-hiding andonethatusesblockingcommunication. Forbothsetups we measured the timespent on waiting for communication, 2NumPyversion 1.3.0 i.e. the communication latency not hidden behind compu- # Black Scholes Function tation. # S: Stock price # X: Strike price In this benchmark we utilizes the cluster in a by node fash- # T: Years to maturity ion. That is, from one to sixteen CPU-cores we start on # r: Risk-free rate MPI-process per node (no SMP) and above sixteen CPU- # v: Volatility def BlackScholes(CallPutFlag,S,X,T,r, cores we start multiple MPI-processes per node. The MPI library used throughout thisbenchmark is OpenMPI3. v): d1 = (log(S/X)+(r+v*v/2.)*T)/(v*sqrt (T)) The benchmark consists of the following eight Python ap- d2 = d1-v*sqrt(T) plications. if CallPutFlag==’c’: return S*CND(d1)-X*exp(-r*T)*CND( d2) Fractal ComputationoftheMandelbrotSet. FromaNum- else: return X*exp(-r*T)*CND(-d2)-S*CND Py tutorial written by Walt[21](Fig. 11). (-d1) Black-Scholes ComputationoftheBlack-Scholesmodel[22] implemented in NumPy(Fig. 9 and 12). Figure 9: This is the Black Sholes Function in Both Fractaland Black-Scholesareembarrassingly the Black-Scholes benchmark where CND is the cu- parallelapplicationsandweexpectthatlatency-hiding mulative normal distribution. Note that there is will not improve theperformance. no source code difference between a parallel and a sequential version – it is regular Python/Numpy N-body ANewtonianN-bodysimulationthatusesana¨ıve source code. algorithm that computes all body-body interactions. TheNumPyimplementationisatranslationofaMAT- LABapplication byCasanova[23] (Fig. 13). cells = full[1:-1, 1:-1] up = full[0:-2, 1:-1] down = full[2: , 1:-1] kNN Ana¨ıveimplementationofaknearestneighborsearch left = full[1:-1, 0:-2] (Fig. 14). right = full[1:-1, 2: ] TheN-body andkNNapplicationshaveacomputa- while epsilon < delta: tion complexity of O(n2). This indicates that thetwo work[:] = cells work += 0.2 * (up+down+left+right) applicationsshouldhavegoodscalabilityevenwithout diff = absolute(cells - work) latency-hiding. delta = sum(diff) cells[:] = work Lattice Boltzmann 2D LatticeBoltzmannmodelofchan- nelflowin2DusingtheD2Q9model. Itisatranslation of a MATLAB application byLatt[24] (Fig. 15). Figure 10: This is the kernel in the Jacobi Stencil benchmark. First, we define a view of the full ar- Lattice Boltzmann 3D LatticeBoltzmannmodelofaflu- ray for each point in the stencil, five in this case, idin3DusingtheD3Q19model. Itisatranslationof and then we apply the stencil until it converges. a MATLABapplication by Haslam[25](Fig. 16). Notethatthereisnosourcecodedifferencebetween a parallel and a sequential version – it is regular ThetwoLatticeBoltzmannapplicationshaveacom- putation complexity of O(n). More communication Python/Numpy source code. is therefore needed and we expect that latency-hiding will improvetheperformance. 100 Blocking (waiting time) Latency−Hiding (waiting time) Jacobi The Jacobi method is an algorithm for determin- Blocking (speedup) Latency−Hiding (speedup) JacobcwiiatmTnshoipihgametapSethnrJrpttotiahelnxwxiecentixooermicibssrotJioaawiylvltatmueieIocvotrnefopeiyttobhOehtnmsrehio(msaeinadrtstaoep)ihuos.lfbplouns,elHaidlsinntceoac.g(st(wgFhythFs.isemiatoigtvegtntea.eno.msrurc1k,1saci0,l7etoloshs)omafwo.penaelpehdicnursaohae1ptvnatea8ievsitr)ototea.aninenniscqemgtouremapaasltcsltephehsimomuoeenrctemeaisntna.etthttioaeetiIdnndnot Speedup of DistNumPy compared to NumPy 13 6248 24680000 Relative waiting time in percentage of total time the Jacob Stencil application four adjacent elements arerequired. Weexpectlatency-hidingtobeveryim- 0 portant for good scalability. 2 4 8 16 32 64 128 No. of CPU−cores Figure 11: Speedup of the Fractal application. 3OpenMPI version 1.5.1 32 100 100 Blocking (waiting time) Blocking (waiting time) Latency−Hiding (waiting time) 64 Latency−HBidloincgk i(nwga (istipnege tdimupe)) LatencyB−loHcidkiinngg ((ssppeeeedduupp)) Latency−Hiding (speedup) Speedup of DistNumPy compared to NumPy 13 4862 24680000 Relative waiting time in percentage of total time Speedup of DistNumPy compared to NumPy 1 648 24680000 Relative waiting time in percentage of total time 0 2 0 2 4 8 16 32 64 128 2 4 8 16 32 64 128 No. of CPU−cores No. of cluster nodes (one CPU−core per node) Figure 15: Speedup of the Lattice Boltzmann 2D Figure12: SpeedupoftheBlack-Scholesapplication. application. 100 100 Blocking (waiting time) Blocking (waiting time) Latency−Hiding (waiting time) Latency−Hiding (waiting time) Blocking (speedup) Blocking (speedup) Latency−Hiding (speedup) 16 Latency−Hiding (speedup) Speedup of DistNumPy compared to NumPy 1 486 24680000 Relative waiting time in percentage of total time Speedup of DistNumPy compared to NumPy 48 24680000 Relative waiting time in percentage of total time 0 2 0 2 4 8 16 32 64 128 2 4 8 16 32 64 128 No. of CPU−cores No. of CPU−cores Figure 13: Speedup of the N-body application. Figure 16: Speedup of the Lattice Boltzmann 3D application. 100 Blocking (waiting time) 100 Latency−Hiding (waiting time) Blocking (waiting time) Blocking (speedup) Latency−Hiding (waiting time) Latency−Hiding (speedup) Blocking (speedup) Speedup of DistNumPy compared to NumPy 1 486 24680000 Relative waiting time in percentage of total time Speedup of DistNumPy compared to NumPy 1 648 Latency−Hiding (speedup) 24680000 Relative waiting time in percentage of total time 2 4 8 16 32 64 128 0 2 0 2 4 8 16 32 64 128 No. of CPU−cores No. of CPU−cores Figure 14: Speedup of the kNN application. Figure 17: Speedup of the Jacobi application. 100 Blocking (waiting time) By Node Latency−Hiding (waiting time) 32 By Core Blocking (speedup) Latency−Hiding (speedup) 16 Speedup of DistNumPy compared to NumPy 48 24680000 Relative waiting time in percentage of total time Speedup of DistNumPy compared to NumPy 1 648 2 2 2 4 8 16 32 64 128 0 2 4 8 16 32 64 128 No. of CPU−cores No. of CPU−cores Figure 19: Speedup of the N-body application that Figure 18: Speedup of the Jacobi Stencil applica- compares by node, in which the maximum number tion. of CPU-cores isused, andby core,inwhichthe min- imum number of nodes is used. Note that the by node graph is identical to the latency-hiding graph 6.1 Discussion in figure 13. Overall, the benchmarks show that DistNumPy has both good performance and scalability. However, the scalability is somewhat worsening at 32 CPU-cores and above, which kNN is minimal – the speedup achieved at sixteen CPU- correlateswiththeuseofmultipleCPU-corespernode. Be- cores is 12.5 and 12.6, respectively. The relatively modest cause of this distinct performance profile, we separate the speedup in kNN is the result of poor load balancing. At following discussion into results executed on one to sixteen eight and sixteen CPU-cores the chosen problem is not di- CPU-cores (one CPU-core per node) and the results exe- vided evenlybetween theprocesses. cuted on 32 CPU-cores to 128 CPU-cores (multiple CPU- cores pernode). Latency-hiding is somewhat beneficial to the two Lattice Boltzmann applications. The waiting time percentage on sixteenCPU-coresgoesfrom19%to13%inLatticeBoltz- 6.1.1 OnetoSixteenCPU-cores mann 2D, and from 16% to 9% in Lattice Boltzmann The benchmarks clearly shows that DistNumPy has both 3D.However,theoverallimpactonthespeedupisnotthat good performance and scalability. Actually, half of the Py- great, primarily because of the overhead associated with thon applications achieve super-linear speedup at sixteen latency-hiding. CPU-cores. This is possible because DistNumPy, opposed toNumPy,willtrytoreusememoryallocationsbylazilyde- Finally,latency-hidingintroducesadrasticallyimprovedper- allocating arrays. DistNumPy uses a very na¨ıve algorithm formance to the two communication intensive applications that simply checks if a new array allocation is identical to Jacobi and Jacobi Stencil. The waiting time percent- a just de-allocated array. If that is the case one memory age going from 54% to 2% and from 62% to 9%, respec- allocation and de-allocation is avoided. tively. Latency-hidingalso hasamajor impact on theover- all speedup, which goes from 5.9 to 12.8 and from 7.7 to Inthetwoembarrassinglyparallelapplications,Fractaland 18.4, respectively. In other words latency-hidingmore than Black-Scholes,weseeverygoodspeedupasexpected. Since doublesthe overall speedupand CPU utilization. the use of communication in both applications is almost non-existing the latency-hiding makes no difference. The speedupachievedat sixteenCPU-coresis18.8 and15.4, re- 6.1.2 ScalingabovesixteenCPU-cores spectively. Overall, the performance is worsening at 32 CPU-cores – particular at 64 CPU-cores and above where the CPU uti- The two na¨ıve implementations of N-body and kNN do lization is below 50%. One reason for this performance not benefit from latency-hiding. In N-body the dominat- degradation is the characteristic of strong scaling. In order ing operations are matrix-multiplications, which is a native to haveconsiderable more data blocks than MPI-processes, operationinNymPyandinDistNumPyimplementedasspe- thesizeofthedatadistributionblocksdecreasesasthenum- cializedoperationsusingtheparallelalgorithmSUMMA[26] ber of executing CPU-cores increases. Smaller data blocks and not as a combination of ufunc operations. Since both reduces the performance since the overhead in DistNumPy thelatency-hidingand theblocking execution usethesame is proportional with thesize of a datablock. SUMMA algorithm the performance is very similar. How- ever,becauseoftheoverheadassociatedwithlatency-hiding, However, smaller data blocks are not enough to justify the the blocking execution performs a bit better. The speedup observed performancedegradation. Thevon Neumannbot- achieved at sixteen CPU-cores is 17.2 with latency-hiding tleneck[27] associated with main memory also hinder scal- and 17.8 with blocking execution. Similarly, the perfor- ability. This is evident when looking at Figure 19, which mance difference between latency-hiding and blocking in is a speedup graph of N-body that compares by node and

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.