ebook img

DTIC AD1004530: Software Issues in High-Performance Computing and a Framework for the Development of HPC Applications PDF

13 Pages·0.17 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC AD1004530: Software Issues in High-Performance Computing and a Framework for the Development of HPC Applications

SOFTWARE ISSUES IN HIGH-PERFORMANCE COMPUTING AND A FRAMEWORK FOR THE DEVELOPMENT OF HPC APPLICATIONS PETER H. MILLS, LARS S. NYLAND, JAN F. PRINS, AND JOHN H. REIF Abstract. We identifythe following key problemsfaced by HPC software: (1) the large gap betweenHPC designand implementationmodelsin appli- cation development,(2) achieving high performancefor a single application ondi(cid:11)erentHPCplatforms,and(3)accommodatingconstantchangesinboth problemspeci(cid:12)cationandtargetarchitectureas computationalmethodsand architecturesevolve. Toattacktheseproblems,wesuggestanapplicationdevelopmentmethodol- ogyinwhichhigh-levelarchitecture-independentspeci(cid:12)cationsareelaborated, throughaniterativere(cid:12)nementprocesswhichintroducesarchitecturaldetail, intoaformwhichcanbetranslatedtoe(cid:14)cientlow-levelarchitecture-speci(cid:12)c programmingnotations. A tree-structureddevelopmentprocesspermitsmul- tiplearchitecturestobetargetedwithimplementationstrategiesappropriate to eacharchitecture,and also providesa systematicmeans to accommodate changesinspeci(cid:12)cationandtargetarchitecture. WedescribetheProteussystem,anapplicationdevelopmentsystembased onawide-spectrumprogrammingnotationcoupledwithanotionofprogram re(cid:12)nement.Thissystemsupportstheabovedevelopmentmethodologyvia: (1) theconstructionofthespeci(cid:12)cationandthesuccessivedesignsinauniformno- tation,whichcanbeinterpretedtoprovideearlyfeedbackonfunctionalityand performance,(2)migrationof thedesigntowardsspeci(cid:12)carchitecturesusing formalmethodsofprogramre(cid:12)nement,(3)techniquesforperformanceassess- ment in which the computationalmodelvaries with the level of re(cid:12)nement, and (4) the automatic translation of suitably re(cid:12)ned programs to low-level parallelvirtualmachinecodesfore(cid:14)cientexecution. 1. HPC Software Issues Whilelarge scale parallelprocessors have greatly increased the performance po- tential for HPC, they have also introduced substantial new software development problems. We identify three problems that we see as the largest obstacles to the development of HPC software. Currently, architecture-speci(cid:12)c notations are largely used to program parallel machines. These low-levelnotationsre(cid:13)ect speci(cid:12)c features ofatargetarchitecture such asshared vs. distributed memory,SIMDvs. MIMDcontrolorganization,and di(cid:11)erent formsof memoryand communicationlocality. This work was supportedby ARPA viaONR contractsN00014-91-J-1985and N00014-92-C- 0182,andbyRomeLabsContractF30602-94-C-0037. 1 2 MILLS, NYLAND, PRINS, AND REIF Although such low-level notations are needed to provide the detailed access to the machinery necessary to orchestrate high performance, they are not well suited tosustainingalargedesignanddevelopmentactivity. Theyareunabletogeneralize theexpressionofconcurrency withtheconsequence thateachsoftwaredevelopment step involves large and tedious low-level programming e(cid:11)ort whose e(cid:11)ect may be di(cid:14)cult to analyze. On the other hand, higher-level parallel computing models that might be more suitable for the development of parallel software tend to rely on generalizations of concurrency that are unrealistic or inaccurate with respect to actual machine performance. Such models can easily lead to designs that are completely impractical. Thusthe (cid:12)rstproblemisthatinordertoconstruct complexparallelapplications that achieve high performance, developers must bridge the gap between these two levels of expression in some fashion. The second problem has become more apparent as we enter the second or even third generation of parallel computers: di(cid:11)erent architectures and di(cid:11)erent-sized machines are now availableto researchers and hence existing applications need re- targetinginorder to \track"the HPC revolution. However,the low-levelnotations in which applications have been developed lack portability between architectures. It is not a matter of simple translation between notations: the e(cid:11)ect of the target architecture can be pervasive. At a high level, di(cid:11)erent architectures may require fundamentallydi(cid:11)erent algorithmsto achieve optimalperformance; at a low level, overallperformanceexhibitsgreatsensitivitytochangesincommunicationtopology and memoryhierarchy. The retargeting problemsinvolvedare su(cid:14)ciently complex that automatictranslation and optimizationare unlikely to o(cid:11)er a comprehensive solution. Consider, for example, a molecular dynamics simulation package with which we have experience (the Cedar system, developed by J. Hermans at UNC-CH). Molecular dynamics simulationis a grand-challenge problem, of great importance to areas such as drug-design. The molecular dynamics package in question was originally developed to run on Cray computers. It was subsequently adapted for use on a number of other HPC platforms including mini-supercomputers, high- performanceworkstations,theMasParComputerfamily,theKendallSquarefamily, and workstation clusters supporting the PVM services. Most of these versions remainimportantbecause users ofthe Cedar software at di(cid:11)erent sites haveaccess to di(cid:11)erent kinds and sizes of HPC machines. Fundamentallydi(cid:11)erent algorithms are involved in all of these versions to achieve the best performance. This is a result not just of architectural di(cid:11)erences but also of the degree of parallelism sought relative to the problemsize. The third problem emerges when we consider that a scienti(cid:12)c application such astheCedarsystemiscontinuouslyevolvingasnewscienti(cid:12)candalgorithmicideas need to be incorporated. In this case smallchanges in the functional speci(cid:12)cation can lead to large and very di(cid:11)erent program changes in each of the architecture- speci(cid:12)c implementations. The maintenance of all these implementations quickly becomesintractablewiththeresultthatsomearchitecture-speci(cid:12)c versionsbecome scienti(cid:12)callyobsolete while other scienti(cid:12)callycurrent versions are not ableto take advantage of the full range of HPC resources. We see similarproblems appearing in the development and maintenance commercialand militaryapplications. Thus,tosummarize,thekeyproblemsweseeinthedevelopmentofHPCsoftware SOFTWARE ISSUES AND DEVELOPMENT OF HPC APPLICATIONS 3 are: (cid:15) bridging the gap between high-level parallel design models and low-level execution models, (cid:15) targeting a single application to multipleHPC platforms,and (cid:15) managingevolution in both the application and the target architectures. We believe that high level programming models and notations are critical to the expression and exploration of complex designs. They also promote portability acrossarchitectures (oratleastmakeexplicitatahigherlevelthedesigndi(cid:11)erences for di(cid:11)erent architectures). However, for a programming methodology based on high-levelnotationsto be practical itmusteventually be connected to the kinds of detailed notations that access the lower-level performance issues. We believe that itisamistaketouse onlyhigh-leveloronlylow-levelmodelsandnotation;auseful frameworkmust accommodateboth views. Thegrowingcrisisinsoftwareversionmanagementashighperformanceapplica- tionsevolveandareportedtoavarietyofHPCarchitectures overtheirlifetimealso suggests that di(cid:11)erent views of the development are needed: a high-level view is preferred for changes in speci(cid:12)cation, while lower-level views are moreappropriate for architecture and machinechanges. Theremainderofthispaperisorganizedasfollows. InSection2weoutlineasoft- ware developmentmethodologythat addresses these problemsby providingfor the architectural specialization ofhigh-level designs couched in a single wide-spectrum notation, the exploitation of parallel virtual machines as translation targets, and theearlyassessmentofprototypesusingahierarchyofparallelcomputationalmod- els matched to the level of design. In section 3 we describe the Proteus system, an e(cid:11)ort under way within our group to support the software development methodol- ogy. Section 4 contains a brief overview of related work. We conclude the paper with a discussion of requirements for realizing the above framework. 2. A Refinement-Based Approach to HPC Software Development Toaddress theproblemsraised inthe previoussection,we proposeare(cid:12)nement- based development methodology. Informally,by re(cid:12)nement we mean the inclusion of additional detail. The approach starts with a speci(cid:12)cation that is initially re- (cid:12)ned into a high-levelarchitecture independent design and fromthere successively broughtclosertoatargetarchitecturethroughre(cid:12)nementstepswhichincrementally incorporate architectural details. The most re(cid:12)ned version corresponds directly to an implementationin a low-levelarchitecture-speci(cid:12)c notation. 2.1. Treestructuredprogramdevelopment. Werepresent thesuccessive ver- sions developed as nodes ina graph and add an edge between versions when one is developed fromanother viare(cid:12)nement, as shown inFigure 1. A set ofarchitecture dependent implementations of a single problem can be constructed using a tree structured graph. At internal nodes of the tree, each di(cid:11)erent re(cid:12)nement re(cid:13)ects a design decisions that essentially directs the development toward one or a group of speci(cid:12)c architectures (e.g. shared memoryor distributed memorymachines). If we express the re(cid:12)nement steps in a formalmanner, for example as program transformations, then there is the possibility of applying these transformations 4 MILLS, NYLAND, PRINS, AND REIF Specification Architecture-independent design Architecture-dependent design Implementation Figure 1. Programdevelopment tree automaticallyinthe formofdevelopment\tactics",asisdoneinprogramsynthesis systems such as KIDS [22], and the CIP system [19]. Specification P0 High-level Refinement P2 P1 Programs P4 P3 Low-level Translation C + CVL Parallel Programs C + PVM C + Posix Threads Figure 2. Development tree with translations to parallelimplementations An automated approach can be particularly important in this setting because therearemoreversionstodevelopfromaspeci(cid:12)cationthanistypicalinthesynthe- sisofaconventional(sequential)application. Targetinganewarchitecture requires the additionof a new sequence of re(cid:12)nements starting froman appropriate level of design to an architecture-speci(cid:12)c implementation. Generating a new set of imple- mentations followinga change to the high-level speci(cid:12)cation (for example, adding anew componentina simulation),in principlerequires thatwe \replay"allre(cid:12)ne- ment steps starting with the new speci(cid:12)cation to generate a new tree of versions that incorporates the changes for each of the targeted architectures. Of course, automated program synthesis technology and design capture are re- search areas that require a great deal of additional development to be generally applicable. A more pragmatic view is that a tree structured development is an organizational concept and that the actual synthesis, re(cid:12)nement and replay steps are conducted using a mixture of manual and automatic techniques. In particu- lar, it is likely that automated techniques may apply only near the leaves of the development tree. 2.2. Wide-spectrum concurrent programming notation. We believe that the best waytosupport the capabilitiesdescribed is torepresent allversions inthe development tree using a single wide-spectrum concurrent programminglanguage. SOFTWARE ISSUES AND DEVELOPMENT OF HPC APPLICATIONS 5 Such a language, by unifying the various parallel programmingparadigms, would bothbeabletocaptureconcurrency inanabstracthigh-levelfashionandtoprovide auniformvehicleforre(cid:12)nementtowardsparticulararchitecturesdemandingvarious paradigmssuch as data and process parallelism. Usingsuchalanguage,prototypescanbeconstructedatanarchitecture-independent level and evaluated using an interpreter and tools. Re(cid:12)nement of such prototypes consists of program modi(cid:12)cations or transformations that result in restrictions in theuse oftheconcurrency constructs. Such restrictions express the adaptationofa high-leveldesigntoconstructs e(cid:14)cientlysupportedonaspeci(cid:12)c architecture. Since theresultingprogramwouldstillbeinthesamenotation,theinterpreter canagain be used toassess its functionalityand someperformance measures. Programsthat are suitably re(cid:12)ned should be automatically translatable to e(cid:14)cient parallel pro- grams in low-level architecture-speci(cid:12)c notations and run directly on the targeted parallel machines. 2.3. Low-levelparallelvirtualmachines. The(cid:12)naltranslationtechniques can gain wider applicability by targeting low-level parallel virtual machines that are e(cid:14)ciently implemented on classes of parallel architectures, rather than machine- speci(cid:12)c languages. For example the C language along with libraries such as the vector library CVL [1], PVM [11] or POSIX threads mightbe appropriate as low- level parallel virtual machine targets. 2.4. Software developmentprocess. Figure 2 illustratesthe developmentpro- cess we have in mind. Starting with an initial speci(cid:12)cation, programs are succes- sively transformed to incorporate speci(cid:12)cation changes, to restrict the expression ofconcurrency andtotranslateversionstoarchitecture-speci(cid:12)c low-levelnotations. We thus di(cid:11)erentiate transformationsteps into (cid:15) elaborations, which alter the meaningof a speci(cid:12)cation, (cid:15) re(cid:12)nements,whichpreservethemeaningofthespeci(cid:12)cationbutnarrowthe choices for execution, for example by restricting the form of concurrency employed,and (cid:15) translations, which convert the program from the high level notation to a low-level notation. In Figure 2 an initial executable speci(cid:12)cation P0 is developed (after some elab- oration). This formulationmayor maynot include any explicit concurrency. Here we assume P0 includes only implicitdata parallelism. Severalre(cid:12)nementpathsareshown. Eachre(cid:12)nementissimplyarewritingofthe programtext that preserves meaningbut changes the detailedformofthe program and the constructs used. For example,the re(cid:12)nement fromP0 to P1 mightrestrict data-parallel expressions so that the resulting program is translatable to the CVL model. Alternatively,the re(cid:12)nement fromP0 to P2 introduces explicitconcurrency in the program. In version P3 this concurrency is expressed in a very general form from which it is not possible to determine communication points. For this version,aC programspawningPosixthreads andusingsemaphores tosynchronize would have to generated. P4 is an alternate re(cid:12)nement of P2 in which the form of the explicit concurrency is su(cid:14)ciently restricted that all communication can be statically identi(cid:12)ed and a PVM program can be generated from P4. If P4 also includesacomponentthatuses data-parallelism,thatcomponentcanbetranslated 6 MILLS, NYLAND, PRINS, AND REIF to CVL. In general, a program version can translate to any number of program segments in di(cid:11)erent virtual machine models. These re(cid:12)nement operations could be accomplished by manual rewriting or by a semi-automatic program transformation system. As one gains insight into the principles of these re(cid:12)nements, it may be possible to express them in the form of tactics or to incorporate them into the automated translations. 2.5. Prototypingand evaluationof designs. It iscriticalto thisdevelopment methodology that the intermediate versions in a development can be assessed in someform. Throughtheuseofaninterpreterforthehigh-levelnotation,allversions canbeexecuted,sothatsomeempiricalmeasurementoffunctionalandperformance characteristics is possible. In general it is unlikely that e(cid:14)cient parallel execution could be achieved for high-level speci(cid:12)cations, and in fact it may well be that the only way to execute such versions is by a sequential simulation performed by the interpreter. Even with this limitation important performance and functionality measurements could still be obtained, e.g. total work performed by a parallel algorithmand its load distribution under some particular data decomposition. The execution capability would support the rapid prototyping of designs and permit the exploration of a large and complex space of alternatives in which sig- ni(cid:12)cant design trade-o(cid:11)s exist. There is extensive evidence in many engineering domains,including both the software and hardware, that informationobtained by disciplined experimentation with prototypes reduces risk and improves productiv- ity. In the domain of parallel computation where design principles are not well understood, the knowledgeacquiredfromprototypingcanbe particularlyvaluable. By re(cid:12)ning prototypes toward speci(cid:12)c implementations rather than throwing them away, we improve the ability to carry information from the prototype into implementation. 2.6. Performance prediction. A signi(cid:12)cant aid in the design of e(cid:14)cient paral- lel programs is the ability to predict the performance on actual parallel machines, allowingthe early assessment of algorithmicvariations without the cost of fullim- plementation. Performanceanalysishereencompassesbothempiricalmeasurement ofperformance (e.g. through simulation)as wellas static estimationofcomplexity measures of time and resource utilization. For both these cases the measurements ofbehavior are de(cid:12)ned in termsof amathematicalmodelofa computingmachine. Indeed, parallel computation models which underlie both static and dynamic per- formance analysis give an operational semantics to programs which provides an intuitive framework that guides the very design of the algorithm. In this sense there islittletechnicaldistinctionbetween formalmodelsofcomputationandwhat aretypicallytermedprogrammingmodels. Thusitisnotsurprisingthatitremains di(cid:14)cult to assess parallel program performance for many of the same reasons it is di(cid:14)culttoconstruct e(cid:14)cientandportableparallelprogramsinthe (cid:12)rst place{the gapbetween high-levelmodelsanddiverse low-levelmachineshinders thethe accu- racy ofperformance estimation. Tocombatthis problemwe propose a re(cid:12)nement- based approach to performance prediction in which the computing model, chosen from a hierarchy of recently developed more detailed models, matches the level of programre(cid:12)nement. Historically, the PRAM is the most widely used parallel model. However, for current parallel machines, the PRAM is often inaccurate in predicting the actual SOFTWARE ISSUES AND DEVELOPMENT OF HPC APPLICATIONS 7 running time of programs since it hides details which impact performance such as the timerequired for network communicationand synchronizationas wellas issues ofasynchrony and memoryhierarchy. For example,this modeldoes not re(cid:13)ect the current trend toward larger-grained asynchronous MIMD machines whose proces- sors each mayhavetheir ownsophisticated memoryhierarchies and whichcommu- nicate over relatively slow networks. This necessitates (1) a pragmatic re(cid:12)nement of parallel machine models, that is, the development of models which incorporate realistic aspects such as communication costs and memory hierarchy while still remaining abstract enough to be machine-independent and amenable to reason- ing, and (2) the practical application of these theoretical models to performance analysis, that is, the development of better techniques and tools for performance prediction. Models and resource metrics for parallel computation. In response to the(cid:12)rstneedtherehavebeenproposedavarietyofmodelswhichextendthePRAM to incorporate realistic aspects such as asynchrony of processes (e.g., the APRAM [9]),communicationcosts,suchasnetworklatencyandbandwidthrestrictions(e.g., the LogP model [10]), and memory hierarchy, re(cid:13)ecting the e(cid:11)ects of multileveled memorysuch as di(cid:11)ering access times for registers, local cache, mainmemoryand diskI/O (e.g.,the P-HMM [23]). The mostprevalent andpromisingrecent models are parameterized(or generic) models,whichabstract the architectural details into severalgenericparameterswhichwecallresourcemetrics. Typicalresource metrics includethenumberofprocessors,communicationlatency,bandwidth,blocktransfer capability,networktopology,memoryhierarchy,memoryaccess methodanddegree ofasynchrony. Usingsuchaparameterizedmodelonecandesignbroadlyapplicable parameterizedalgorithmsthatcanbe tailoredtospeci(cid:12)c machinesbyinstantiating the parameters, such as latency and bandwidth, to matchmachine characteristics. Re(cid:12)nement of models. We argue for an approach to performance analysis that in answer to the second need { practical application of recent re(cid:12)ned models { allies performance assessment with the incremental re(cid:12)nement of design. Our approachforperformancepredictionisbasedon(1)the use ofincreasinglydetailed models as the program re(cid:12)nement progresses, gaining accuracy and con(cid:12)dence as developmentprogresses, (2)theuseofdi(cid:11)erentmodelsforanalysisofcodesegments following di(cid:11)erent paradigms, such as data-parallelism and message-passing, to support the assessment of multi-paradigmprograms, and (3) the extension of an alreadyemerginghierarchy ofre(cid:12)ned modelsasneeded tosupport theabovegoals, followingprinciples derived from a careful examinationof key issues in the design of models of parallel computation. The key notion in the (cid:12)rst point is that the computing model used for assess- ment varies with the level of re(cid:12)nement. At each point in the stepwise re(cid:12)nement the design can be assessed; the accuracy of assessment increases with the level of architectural detail incorporated into the design and the correspondingly more de- tailed model used for analysis. Moreover, in terms of resource metrics, the model should \(cid:12)t" the re(cid:12)ned programnot just in level of detail but also in the choice of resource metrics with which it approximates machines. As more detailed architec- turalcommitmentsare madeinthe speci(cid:12)c expression ofconcurrency (forexample incorporating notions of message-passing) models with appropriate resource met- rics (for example with notions of latency) can be attached to the program. Thus a hierarchy of models expressing increasingly detailed resource metrics are used 8 MILLS, NYLAND, PRINS, AND REIF which matches the tree-like structure of programre(cid:12)nement. Forexample,atthe coarsest levelperformancepredictionmaybe doneusingthe interpretertoderivesimpleapproximationsoftotalwork. Astheprogramisre(cid:12)ned into data-parallel code one might employ the VRAM vector model [2], or for a shared-memoryversionamodelakintotheAPRAMmightbeusedforperformance evaluation. As the program is further re(cid:12)ned, for example from a shared-memory program to a more sophisticated form which corresponds to message-passing, the LogP network model may be employed to gain more accurate assessment, with suitableinstrumentationthatidenti(cid:12)eslow-levelunitsofcommunication(andwork) in order to \attach" the modelto the program. At a further stage in the re(cid:12)nement process it becomes important to model the several layers of memorieswhich exist in manymachines,since di(cid:11)ering access timestolocalcacheanddiskmaystronglye(cid:11)ectperformance. Yettoaccommodate these more detailed performance measures, further re(cid:12)nements of parallel models mightbe required, since avoidexists inmodelsthat accurately treat both network communicationandparallelmulti-levelmemory. Asasimpleexampleoftheprocess ofdevelopingimprovedperformancemodels,consideranewhybridmodelofparallel computation, the LogP-HMM model [14], which extends a network model (the LogP) with a sequential hierarchical memory model (the HMM). Such a re(cid:12)ned model could be instrumented into parallel code through the use of annotations which incorporate explicit detailsofmemorylocality. A related approach has been used for cache-coherent shared-memory multiprocessors in the CICO project [13], whereannotationsservebothforperformancepredictionandtoguidemoree(cid:14)cient code generation. External Codes Repository Executable Machine Cray C90 Objects e e InterfacCVL, Etc. MasPar MP-1 elaborate execute MIF achinads, TMC CM-5 Me refine Current analyze ual x thr Version VirtPosi KSR-1 translate monitor allel VM, ParP Intel Paragon Modification Execution Figure 3. Re(cid:12)nement-based frameworkfor software development SOFTWARE ISSUES AND DEVELOPMENT OF HPC APPLICATIONS 9 3. The Proteus System for the Development of HPC Software Wenowbrie(cid:13)ydescribe Proteus, are(cid:12)nement-based systemforparallelsoftware development which embodies the principles earlier presented. The Proteus system isunder jointdevelopmentbyDukeUniversity,the UniversityofNorth Carolinaat Chapel Hill, and the Kestrel Institute. Its goal is to provide improved capabilities for exploring the design space of a parallel application using prototypes, and for evolvingaprototypeintoahighly-specializedande(cid:14)cientparallelimplementation. The Proteus system, illustrated in Figure 3, comprises: (cid:15) a wide-spectrum parallel programmingnotation that allows high-level ex- pression of speci(cid:12)cations, (cid:15) amethodologyfor(semi-automatic)re(cid:12)nementofarchitecture-independent prototypes tolower-levelprogramsoptimizedforspeci(cid:12)c architectures, fol- lowed by translation to portable intermediate languages, (cid:15) anexecution systemconsistingofaninterpreter, aModuleInterconnection Facility(MIF)allowinginteroperabilityofProteuswithotherprogramming languages and run-time analysis tools, and (cid:15) a methodologyfor prototype performance evaluation integrating both dy- namic(experimental)andstatic(analytical)techniqueswithmodelsmatched to the level of re(cid:12)nement. We believe that, in the absence of both standard modelsfor parallel computing and adequate compilers, this approach gives the greatest hope of producing useful applicationsfortoday’scomputers. It allowsthe programmerto balanceexecution speed against portability and ease of development. 3.1. The Proteus language. The Proteus language is an imperative language thatprovideshigh-levelnotationsforexpressingseveralfundamentalformsofparal- lelismincludingimplicitconcurrencyfoundindata-parallelexpressions,andexplicit concurrency in the formof tasks and controlled access to shared state. Dataparalleloperationsareexpressed usingthefamiliarmathematicalnotations of set, sequence, and map comprehension. The ability to specify irregular and nested dataparallelismisanaturalconsequence ofprovidingnested aggregatedata types such as sequences of sequences, sets of sets, etc. Like SETL, many of the powerful and (cid:13)exible mathematical types are prede(cid:12)ned in Proteus. Additional user-de(cid:12)ned data types may be speci(cid:12)ed algebraically in Proteus and packaged as parameterized theories { parameterization both generalizes polymorphismand enhances reusability. Process (or task) parallel computations can also be succinctly expressed with a smallsetofprocesscreationandsynchronizationprimitivessimilartothoseadopted in recent languages such as PCN [7], CC++ [6], and COOL [5]. In particular, communicationisthroughasharedobjectmodelinwhichtheaccesstosharedstate is controlled through object methods and class directives which constrain mutual exclusion of methods [16]. Prede(cid:12)ned classes such as for single-assignment objects which synchronize a producer with a consumer [12], together with provisions for privatestatewithbarriersynchronization[17],allowthe expression ofawiderange of parallel computingparadigms. 3.2. Transformationof data and process parallelism. Proteus data-parallel expressionsinafunctionalsubsetinvolvingnestedsequencedatatypescancurrently 10 MILLS, NYLAND, PRINS, AND REIF betransformedandtranslatedtoCwithvectoroperations(CVL)usingtheKestrel Data-Type Re(cid:12)nement System (DTRE3) [20]. This includes irregular and nested data-parallelexpressions (as found in the parallelapplicationof a function to each of a collection of argument sequences of di(cid:11)ering lengths), recursive parallel com- putations(as foundforexampleindivide-and-conquer algorithms),andhigh-order parallel function application (as found in the parallel reduction of a sequence of values using an arbitrary function). The resultant CVL program can be e(cid:14)ciently executed on diverse parallel ma- chines such as the CrayC-90,the TMCCM-5, andthe MasPar MP-2. Translation of process parallelismis under development. 3.3. Performance prediction and measurement in Proteus. The Proteus interpreter currently provides a rudimentaryper-process clock that measures com- putationalsteps. This,inconjunctionwithexplicitinstrumentationofProteuscode is,usedtodevelopresourcerequirementmeasuresandtopredictperformance. Sup- port for multipleperformance-prediction models is under investigation. 3.4. Applications of the Proteus system. Several small demonstrations and larger driving problems have been used to assess and validate our technical ap- proach, focusing on such aspects as the prototyping process and methodology,the expressiveness of the Proteus language, and the e(cid:11)ectiveness of the Proteus tools. One demonstrationprobleminvolves the prototyping and implementationof al- gorithmicvariants of the Fast MultipoleAlgorithm(FMA) for N-body simulation. These algorithmspromise performance and accuracy advantages for computation- allychallengingproblemssuch as moleculardynamicssimulations,yet arecomplex and time-consuming to implement. The FMA has many variants which generate a design space which is not well understood. The goal of our experiments with Proteus hasbeen toexplore this space. Our experiments haveidenti(cid:12)ed new adap- tiveproblemdecompositionsthat yieldgood performanceeven incomplexsettings where bodies are not uniformlydistributed [18]. Further descriptions of the language, implementation, and demonstrations are availablefrom the Proteus WWW informationserver at http://www.cs.unc.edu/proteus.html. 4. Related Work There are a variety of e(cid:11)orts which seek to address the problem of parallel software developmentthroughhigh-levellanguagescapable ofexpressing programs executable on a broad range of parallel architectures. These e(cid:11)orts maybe distin- guishedintheapproachtheytaketodealingwiththetradeo(cid:11)sofexpressiveness, ef- (cid:12)ciency,andsophisticationofcompilationstrategies. Forexample,someapproaches restrict the forms of concurrency usable in order to achieve good performance on allplatforms. Thisis the approach of High-PerformanceFortran (HPF), for exam- ple,whichis limitedto(cid:13)at data-parallelism,andmightbe saidtoalsocharacterize the current intent of HPC++. However, by restricting the notation to expressing only what is readily compilable, a measure of expressiveness, for example nested data-parallelism,maybe sacri(cid:12)ced. Otherlanguagesattempttoprovideahigherlevelofexpression,butthenfacedif- (cid:12)culty in achieving good performance because of very general programmingmodel

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.