Lecture Notes in Computer Science 1511 Editedby G.Goos,J. Hartmanisand J.van Leeuwen 3 Berlin Heidelberg NewYork Barcelona Budapest HongKong London Milan Paris Singapore Tokyo David R. O’Hallaron (Ed.) Languages, Compilers, and Run-Time Systems for Scalable Computers 4th International Workshop, LCR ’98 Pittsburgh, PA, USA, May 28-30, 1998 Selected Papers 1 3 SeriesEditors GerhardGoos,KarlsruheUniversity,Germany JurisHartmanis,CornellUniversity,NY,USA JanvanLeeuwen,UtrechtUniversity,TheNetherlands VolumeEditor DavidR.O’Hallaron ComputerScienceandElectricalandComputerEngineering SchoolofComputerScience,CarnegieMellonUniversity 5000ForbesAvenue,Pittsburgh,PA15213-3891,USA E-mail:[email protected] Cataloging-in-Publicationdataappliedfor DieDeutscheBibliothek-CIP-Einheitsaufnahme Languages,compilers,andrun-timesystemsforscalable computers:4thinternationalworkshop;selectedpapers/LCR’98, Pittsburgh,PA,USA,May28-30,1998.DavidR.O’Hallaron(ed.).- Berlin;Heidelberg;NewYork;Barcelona;Budapest;HongKong ;London;Milan;Paris;Singapore;Tokyo:Springer,1998 (Lecturenotesincomputerscience;Vol.1511) ISBN3-540-65172-1 CR SubjectClassification (1991):F.2.2,D.1.3,D.4.4-5,C.2.2, D.3, F.1, C.2.4, C.3 ISSN0302-9743 ISBN3-540-65172-1Springer-VerlagBerlinHeidelbergNewYork Thisworkissubjecttocopyright.Allrightsarereserved,whetherthewholeorpartofthematerialis concerned,specificallytherightsoftranslation,reprinting,re-useofillustrations,recitation,broadcasting, reproductiononmicrofilmsorinanyotherway,andstorageindatabanks.Duplicationofthispublication orpartsthereofispermittedonlyundertheprovisionsoftheGermanCopyrightLawofSeptember9,1965, initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer-Verlag.Violationsare liableforprosecutionundertheGermanCopyrightLaw. (cid:1)c Springer-VerlagBerlinHeidelberg1998 PrintedinGermany Typesetting:Camera-readybyauthor SPIN10692671 06/3142–543210 Printedonacid-freepaper Preface It is a great pleasure to present this collection of papers from LCR ’98, the Fourth Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers. The LCR workshop is a bi-annual gathering of computer scientists who develop software systems for parallel and distributed computers. LCR is held in alternating years with the ACM Symposium on Principles and Practice of Parallel Programming (PPoPP) and draws from the same community. This fourth meeting was held in cooperation with ACM SIGPLAN on the cam- pus of Carnegie Mellon University, May 28{30, 1998. There were 60 registered attendees from 9 nations. A total of 47 6-page extended abstracts were sub- mitted. There were a total of 134 reviews for an average of 2.85 reviews per submission. Submissions were rank ordered by average review score. The top 23 submissions were selected as full papers and the next 9 as short papers. The program committee consisted of David Bakken (BBN), Ian Foster (Ar- gonne),ThomasGross(CMUandETHZurich),CharlesKoelbel(Rice),Piyush Mehrotra(ICASE),DavidO’Hallaron,Chair(CMU),JoelSaltz(Maryland),Ja- spal Subhlok (CMU), Boleslaw Szymanski (RPI), Katherine Yelick (Berkeley), and Hans Zima (Vienna). Inadditiontothemembersofthecommittee,thefollowingpeoplechairedsessi- ons: Lawrence Rauchwerger, Peter Brezany, Terry Pratt, Alan Sussman, Sand- hyaDwarkadisandRajivGupta.Also,thefollowingpeoplegenerouslyprovided additionalreviews:SigfriedBenkner,ChenDing,GuohaJin,ErwinLaure,Pan- tona Mario, Eduard Mehofer, and Bernd Wender. We very much appreciate the e(cid:11)orts of these dedicated volunteers. BarbaraGrandillodidawonderfuljobofchairingtheorganizingcommitteeand handling the local arrangements. Thomas Gross and Jaspal Subhlok made nu- merous suggestions. Peter Dinda, Peter Lieu, Nancy Miller, and Bwolen Yang also helped out during the workshop itself. We are very grateful for their help. Finally I would like to thank Mary Lou So(cid:11)a at the Univ. of Pittsburgh, for her help in getting SIGPLAN support, Bolek Szymanski at RPI, who chaired the previous meeting and was always helpful and encouraging, and Alfred Hofmann atSpringer,whoseadviceandencouragementenabledustoproducethisvolume. Carnegie Mellon University David O’Hallaron August, 1998 LCR ’98 General/Program Chair Table of Contents Expressing Irregular Computations in Modern Fortran Dialects .............1 Jan F. Prins, Siddhartha Chatterjee, and Martin Simons Memory System Support for Irregular Applications ........................17 John Carter, Wilson Hsieh, Mark Swanson, Lixin Zhang, Erik Brunvand, Al Davis, Chen-Chi Kuo, Ravindra Kuramkote, Michael Parker, Lambert Schaelicke, Leigh Stoller, and Terry Tateyama MENHIR: An Environment for High Performance Matlab ..................27 St(cid:19)ephane Chauveau and Francois Bodin On the Automatic Parallelization of Sparse and Irregular Fortran Programs ..........................................................41 Yuan Lin and David Padua Loop Transformations for Hierarchical Parallelism and Locality ............57 Vivek Sarkar Dataflow Analysis Driven Dynamic Data Partitioning ......................75 Jodi Tims, Rajiv Gupta, and Mary Lou So(cid:11)a A Case for Combining Compile-Time and Run-Time Parallelization ........91 Sungdo Moon, Byoungro So, Mary W. Hall, and Brian Murphy Compiler and Run-Time Support for Adaptive Load Balancing in Software Distributed Shared Memory Systems ............................107 Sotiris Ioannidis and Sandhya Dwarkadas E(cid:14)cient Interprocedural Data Placement Optimisation in a Parallel Library ..........................................................123 Olav Beckmann and Paul H. J. Kelly A Framework for Specializing Threads in Concurrent Run-Time Systems ..139 Gregory D. Benson and Ronald A. Olsson Load Balancing with Migrant Lightweight Threads ........................153 David Cronk and Piyush Mehrotra Integrated Task and Data Parallel Support for Dynamic Applications .....167 James M. Rehg, Kathleen Knobe, Umakishore Ramachandran, Rishiyur S. Nikhil, and Arun Chauhan Supporting Self-Adaptivity for SPMD Message-Passing Applications ......181 M. Cermele, M. Colajanni, and S. Tucci Evaluating the E(cid:11)ectiveness of a Parallelizing Compiler ...................195 Dixie Hisley, Gagan Agrawal and Lori Pollock VIII Table of Contents Comparing Reference Counting and Global Mark-and-Sweep on Parallel Computers ....................................................205 Hirotaka Yamamoto, Kenjiro Taura, and Akinori Yonezawa Design of the GODIVA Performance Measurement System ................219 Terrence W. Pratt Instrumentation Database for Performance Analysis of Parallel Scienti(cid:12)c Applications ....................................................229 Je(cid:11)rey Nesheiwat and Boleslaw K. Szymanski A Performance Prediction Framework for Data Intensive Applications on Large Scale Parallel Machines .........................................243 Mustafa Uysal, Tahsin M. Kurc, Alan Sussman, and Joel Saltz MARS: A Distributed Memory Approach to Shared Memory Compilatio ..259 M.F.P. O’Boyle More on Scheduling Block-Cyclic Array Redistribution ....................275 Fr(cid:19)ed(cid:19)eric Desprez, St(cid:19)ephane Domas, Jack Dongarra, Antoine Petitet, Cyril Randriamaro, and Yves Robert Flexible and Optimized IDL Compilation for Distributed Applications ....288 Eric Eide, Jay Lepreau, and James L. Simister QoS Aspect Languages and Their Runtime Integration ....................303 Joseph P. Loyall, David E. Bakken, Richard E. Schantz, John A. Zinky, David A. Karr, Rodrigo Vanegas, and Kenneth R. Anderson Statistical Properties of Host Load ........................................319 Peter A. Dinda Locality Enhancement for Large-Scale Shared-Memory Multiprocessors ...335 Tarik Abdelrahman, Naraig Manjikian, Gary Liu, and S. Tandri Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors ..................................343 Peter Brezany, Alok Choudhary, and Minh Dang Detection of Races and Control-Flow Nondeterminism ....................351 Mindong Feng and Chung Kwong Yuen Improving Locality in Out-of-Core Computations Using Data Layout Transformations ..................................................359 M. Kandemir, A. Choudhary, and J. Ramanujam Optimizing Computational and Spatial Overheads in Complex Transformed Loops .......................................................367 Dattatraya Kulkarni and Michael Stumm Table of Contents IX Building a Conservative Parallel Simulation with Existing Component Libraries .....................................................378 Chu-Cheow Lim and Yoke-Hean Low A Coordination Layer for Exploiting Task Parallelism with HPF ..........386 Salvatore Orlando and Ra(cid:11)aele Perego InterAct: Virtual Sharing for Interactive Client-Server Applications .......394 Srinivasan Parthasarathy and Sandhya Dwarkadas Standard Templates Adaptive Parallel Library (STAPL) ..................402 Lawrence Rauchwerger, Francisco Arzu, and Koji Ouchi Author Index .............................................................411 Expressing Irregular Computations ? in Modern Fortran Dialects JanF.Prins,SiddharthaChatterjee,andMartinSimons DepartmentofComputerScience TheUniversityofNorthCarolina ChapelHill,NC27599-3175 fprins,sc,[email protected] Abstract. ModerndialectsofFortranenjoywideuseandgoodsupportonhigh- performancecomputersasperformance-orientedprogramminglanguages.Bypro- vidingtheabilitytoexpressnesteddataparallelisminFortran,weenableirregular computationstobeincorporatedintoexistingapplicationswithminimalrewriting and without sacrificing performance within the regular portions of the applica- tion.Sinceperformanceofnesteddata-parallelcomputationisunpredictableand oftenpoorusingcurrentcompilers,weinvestigatesource-to-sourcetransforma- tiontechniquesthatyieldFortran90programswithimprovedperformanceand performancestability. 1 Introduction Modernscienceandengineeringdisciplinesmakeextensiveuseofcomputersimulati- ons.As these simulations increase in size and detail, the computational costs of naive algorithms can overwhelm even the largest parallel computers available today. Fortu- nately,computationalcostscanbereducedusingsophisticatedmodelingmethodsthat varymodelresolutionasneeded,coupledwithsparseandadaptivesolutiontechniques thatvarycomputationaleffortintimeandspaceasneeded.Suchtechniqueshavebeen developedandareroutinelyemployedinsequentialcomputation,forexample,incosmo- logicalsimulations(usingadaptiven-bodymethods)andcomputationalfluiddynamics (usingadaptivemeshingandsparselinearsystemsolvers). However, these so-called irregular or unstructured computations are problematic for parallel computation, where high performance requires equal distribution of work over processors and locality of reference within each processor. For many irregular computations, the distribution of work and data cannot be characterized a priori, as thesequantitiesareinput-dependentand/orevolvewiththecomputationitself.Further, irregular computations are difficult to express using performance-oriented languages such as Fortran, because there is an apparent mismatch between data types such as trees, graphs, and nested sequences characteristic of irregular computations and the staticallyanalyzablerectangularmulti-dimensionalarraysthatarethecoredatatypesin ?ThisresearchwassupportedinpartbyNSFGrants#CCR-9711438and#INT-9726317.Chat- terjeeissupportedinpartbyNSFCAREERAward#CCR-9501979.Simonsissupportedbya researchscholarshipfromtheGermanResearchFoundation(DFG). D.O’Hallaron(Ed.):LCR’98,LNCS1511,pp.1–16,1998. (cid:13)c Springer-VerlagBerlinHeidelberg1998 2 J.F.Prins,S.Chatterjee,andM.Simons modernFortrandialectssuchasFortran90/95[19],andHighPerformanceFortran(HPF) [16].Irregulardatatypescanbeintroducedusingthedataabstractionfacilities,witha representationexploitingpointers.Optimizationofoperationsonsuchanabstractdata typeiscurrentlybeyondcompile-timeanalysis,andcompilershavedifficultygenerating high-performanceparallelcodeforsuchprograms.Thispaperprimarilyaddressesthe expressionofirregularcomputationsinFortran95,butdoessowithaparticularview of the compilation and high performance execution of such computations on parallel processors. ThemodernFortrandialectsenjoyincreasinguseandgoodsupportasmainstream performance-orientedprogramminglanguages.Byprovidingtheabilitytoexpressirre- gularcomputationsasFortranmodules,andbypreprocessingthesemodulesintoaform thatcurrentFortrancompilerscansuccessfullyoptimize,weenableirregularcomputa- tionstobeincorporatedintoexistingapplicationswithminimalrewritingandwithout sacrificingperformancewithintheregularportionsoftheapplication. Forexample,considertheNASCG(ConjugateGradient)benchmark,whichsolves anunstructuredsparselinearsystemusingthemethodofconjugategradients[2].Within thedistributedsamplesequentialFortransolution,79%ofthelinesofcodearestandard Fortran77concernedwithproblemconstructionandperformancereporting.Thenext 16%consistofscalarandregularvectorcomputationsoftheBLAS2variety[17],while thefinal5%ofthecodeistheirregularcomputationofthesparsematrix-vectorproduct. Clearlywewanttorewriteonlythis5%ofthecode(whichperforms97%ofthework in the class B computation), while the remainder should be left intact for the Fortran compiler. This is not just for convenience. It is also critical for performance reasons; followingAmdahl’sLaw,astheperformanceoftheirregularcomputationimproves,the performanceoftheregularcomponentbecomesincreasinglycriticalforsustainedhigh performance overall. Fortran compilers provide good compiler/annotation techniques toachievehighperformancefortheregularcomputationsintheproblem,andcanthus provideanefficientandseamlessinterfacebetweentheregularandirregularportionsof thecomputation. We manually applied the implementation techniques described in Sect. 4 to the irregularcomputationintheNASCGproblem.TheresultantFortranprogramachieved a performance on the class B NAS CG 1.0 benchmark of 13.5 GFLOPS using a 32 processorNECSX-4[25].Webelievethistobethehighestperformanceachievedfor thisbenchmarktodate.Itexceeds,byafactorof2.6,thehighestperformancereportedin thelastNPB1.0report[27],andisslightlyfasterthanthe12.9GFLOPSrecentlyachieved usinga1024processorCrayT3E-900[18].Theseencouraginginitialresultssupportthe thesisthathigh-levelexpressionandhigh-performanceforirregularcomputationscan besupportedsimultaneouslyinaproductionFortranprogrammingenvironment. 2 Expressing Irregular Computations Using Nested Data Parallelism Weadoptthedata-parallelprogrammingmodelofFortranasourstartingpoint.Thedata- parallelprogrammingmodelhasproventobepopularbecauseofitspowerandsimplicity. Data-parallellanguagesarefoundedontheconceptofcollections(suchasarrays)and