ebook img

Algorithmic Differentiation of Pragma-Defined Parallel Regions: Differentiating Computer Programs Containing OpenMP PDF

411 Pages·2014·4.419 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Algorithmic Differentiation of Pragma-Defined Parallel Regions: Differentiating Computer Programs Containing OpenMP

Algorithmic Diff erentiation of Pragma-Defi ned Parallel Regions Michael Förster Algorithmic Diff erentiation of Pragma-Defi ned Parallel Regions Diff erentiating Computer Programs Containing OpenMP Michael Förster RWTH Aachen University Aachen, Germany D 82, Dissertation RWTH Aachen University, Aachen, Germany, 2014 ISBN 978-3-658-07596-5 ISBN 978-3-658-07597-2 (eBook) DOI 10.1007/978-3-658-07597-2 Th e Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografi e; detailed bibliographic data are available in the Internet at http://dnb.d-nb.de. Library of Congress Control Number: 2014951338 Springer Vieweg © Springer Fachmedien Wiesbaden 2014 Th is work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifi cally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfi lms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, compu- ter soft ware, or by similar or dissimilar methodology now known or hereaft er developed. Exempted from this legal reservation are brief excerpts in connection with reviews or schol- arly analysis or material supplied specifi cally for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. Th e use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specifi c statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal re- sponsibility for any errors or omissions that may be made. Th e publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Springer Vieweg is a brand of Springer DE. Springer DE is part of Springer Science+Business Media. www.springer-vieweg.de Abstract Thegoalofthisdissertationistodevelopasourcecodetransformationthatexploits the knowledge that a given input code is parallelizable in a way that it generates derivativecodeefficientlyexecutableonasupercomputerenvironment. There is barely a domain where optimization does not play a role. Not only inscienceandengineering,alsoineconomicsandindustryitisimportanttofind optimal solutions for a given problem. The size of these optimization problems often requires large-scale numerical techniques that are capable of running on a supercomputerarchitecture. For continuous optimization problems the calculation of derivative values of a given function is crucial. If these functions are given as a computer code im- plementation Q then techniques known as algorithmic differentiation (AD) alias automatic differentiation can be used to obtain an implementation Q(cid:48) that is ca- pableofcomputingthederivativeofagivenoutputofQwithrespecttoacertain input. Thisthesisfocusesonalgorithmicdifferentiationbysourcetransformation. The implementation Q is transformed into Q(cid:48) such that Q(cid:48) contains assignments forcomputingthederivativevalues. Ontheoneside,thesizeofoptimizationproblemsisrising. Ontheotherside, the number of cores per central processing unit (CPU) in modern computers is growing. Atypicalsupercomputernodehasupto32coresorevenmoreincase that multiple physical nodes form a compound. In order to allow Q to compute itsoutputvaluesefficiently,theimplementationofQshouldexploittheunderlying multicorecomputerarchitecture.Aneasyapproachofusingparallelprogramming istodeclareacertaincoderegioninsideofQasparallelizable. Thisdeclarationis donebysettingacertainkindofpragmainfrontofthecorrespondingcoderegion. The pragma is a compiler directive and in our case this special directive informs thecompilerthatthecorrespondingcoderegionshouldbeexecutedconcurrently. ThiscoderegionisdenotedasaparallelregionPandtheparallelinstanceswhich executeParecalledthreads. There are two fundamental modes in AD, the forward and the reverse mode. We present source transformation rules for a simplified programming language, calledSPL. Inaddition,weshowthattheserulesprovidederivativecodeeitherin forward or in reverse mode. One crucial goal of this work is that the knowledge that the original code contains a parallel region P leads to a parallel region P(cid:48) VI Abstract in the derivative code. This allows a concurrent computation of the derivative values. WeexhibitaprooftoensurethattheparallelexecutionofP(cid:48) iscorrect. In casethattheuserofADwantstoachievehigherderivativecodethepossibilityof reapplyingthesourcetransformationisimportant. Therefore, weexhibitthatthe source transformation is closed in the sense that the output code language is the sameastheinputlanguage. Thereversemode ofADbuildsthe socalledadjointcode. The term‘reverse’ indicates that the adjoint code requires a data flow reversal of the execution of P. SupposethatPconsistsofcodewhereamemorylocationisreadbymultiple threads. The data flow reversal of P leads to the situation that the corresponding derivativecomponentofthismemorylocationisatargetofmultiplestoreopera- tionsfromdifferentthreadsduringtheexecutionoftheadjointcode. Thesestore operationsmustbesynchronized. Otherwise, theadjointcodewouldhavearace conditionatruntime. Conservatively,onecouldassumethatallmemorylocations inParereadbymultiplethreads,whichleadstotheresultthattheadjointsource transformation generates a lot of synchronization constructs to ensure a correct parallelexecution. Intheworstcasethesynchronizationoverheadleadstoacon- currentruntimeofthederivativecodeP(cid:48)thatisbiggerthanthesequentialruntime. Inordertoavoidasmuchsynchronizationaspossible,wedevelopastaticprogram analysisthatcollectsinformationaboutPatcompiletimewhetherornotamem- orylocationisexclusivelyreadbyathread.Isamemorylocationreadexclusively, theadjointsourcetransformationdoesnotneedtoemitasynchronizationmethod forthecorrespondingderivativecomputation. Thiscanmakeamajordifference. We demonstrate how the context-free grammar for the language SPL can be extendedinordertorecognizepragmasdefinedintheOpenMPstandard. Beside the extension of the grammar we present source transformation rules for these OpenMP constructs. With the source transformation rules for constructs such as thebarrier,thecritical,ortheworksharingloopconstruct,thisworkprovidesrules forgeneratingderivativecodeformostoftheoccurringOpenMPparallelregions. The approach of this work has been implemented in a tool, called simple par- allellanguagecompiler(SPLc). Wegiveevidencethatourapproachisapplicable through the implementation of two optimization problems. On the one hand, we use a first derivative code provided by SPLc to solve a nonlinear least-squares problem. On the other hand, a nonlinear constrained optimization problem has beensolvedwiththesecondderivativecodeprovidedbySPLcaswell. Acknowledgments Firstandforemost,IwouldliketothankmyadviserProf. Dr. UweNaumannfor histhoughtfulguidanceandconstantencouragement. Hehasbeeninspiringsince mydaysasacomputersciencestudentandIamgratefulthatIdidgetthechance tomakemyPhDathisinstitute. Specialthanksgoestomyco-supervisorProf.Dr.ChristianBischofforhissup- port and constructive suggestions. I also would like to thank Prof. Dr. Thomas Nollforthefruitfuldiscussionsaboutstaticprogramanalysis. FurtherthanksgoestoallcurrentandformeremployeesoftheLehr-undForsch- ungsgebietComputerScience12attheRWTHAachenUniversity. Wehadalot offunandmanydiscussionsmainlyaboutalgorithmicdifferentiationbutbesides about all sorts of things. I would especially like to thank Johannes Lotz, Lukas Razik,MichelSchanen,ArindamSen,andMarkusTowaraforreadingearlydrafts ofthismanuscriptandforfindingmostofthecontainederrorsandunintelligible explanations. ManythankstotheITcenteroftheRWTHAachenUniversity. TheHPCgroup provided a lot of useful material about parallel programming and in particular about OpenMP. At this point, a special thank goes to Sascha Bücken for provid- ing us access to the SUN cluster that had served its time, but has never lost its capabilitiesforbeingagoodenvironmentforparallelprogramming. IreallyappreciateallmyfamilyandIamgratefulforalltheirsupportduringall theseyears. Iwishmyfathercouldreadthisbuthepassedawaymuchtooearly. Anyway,thanksforallthefunduringoursuccessfulmotocrossyears.Theupsand downsduringmysportscareerhavebeenamajorinfluenceonmyfurtherlife. Last, butnot least, Iwould liketothank mywifeMirjam. Withouther under- standingandsupportduringthepastfewyears,Imaynothavefinishedthisthesis. Manythanksforallthepleasurethatwehadtogetherwithourtwolittledaughters EmmyandMatilda. Contents Abstract V Acknowledgments VII 1 MotivationandIntroduction 1 1.1 NumericalOptimizationintheMulticoreEra . . . . . . . . . . . 1 1.1.1 ANonlinearLeast-SquaresProblem . . . . . . . . . . . . 6 1.2 AlgorithmicDifferentiation . . . . . . . . . . . . . . . . . . . . . 11 1.2.1 SecondDerivativeCode . . . . . . . . . . . . . . . . . . 18 1.2.2 dcc-ADerivativeCodeCompiler . . . . . . . . . . . . . 21 1.2.3 ANonlinearConstrainedOptimizationProblem. . . . . . 26 1.3 OpenMPStandard3.1. . . . . . . . . . . . . . . . . . . . . . . . 29 1.4 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 1.6 OutlineoftheThesis . . . . . . . . . . . . . . . . . . . . . . . . 58 2 TransformationofPureParallelRegions 61 2.1 FormalismandNotation . . . . . . . . . . . . . . . . . . . . . . 61 2.2 SPL-ASimpleLanguageforParallelRegions . . . . . . . . . . 79 2.3 ADSourceTransformationofSPLCode . . . . . . . . . . . . . . 85 2.3.1 Tangent-LinearModelofSPL-Transformationτ(P) . . . 86 2.3.2 AdjointModelofSPL-Transformationσ(P) . . . . . . . 94 2.3.3 SPLCodeInsideofC/C++Code . . . . . . . . . . . . . . 113 2.4 ClosureoftheSourceTransformation . . . . . . . . . . . . . . . 116 2.4.1 ClosurePropertyofτ(P) . . . . . . . . . . . . . . . . . . 116 2.4.2 ClosurePropertyofσ(P)andtheExclusiveReadProperty 118 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 3 ExclusiveReadAnalysis 137 3.1 ControlFlowinSPLcode. . . . . . . . . . . . . . . . . . . . . . 142 3.2 IntegerIntervalAnalysis . . . . . . . . . . . . . . . . . . . . . . 146 3.3 DirectedAcyclicGraphsandPartialOrders . . . . . . . . . . . . 148 X Contents 3.4 IntervalsofDirectedAcyclicGraphs . . . . . . . . . . . . . . . . 156 3.5 DataFlowAnalysiswithDAGIntervals . . . . . . . . . . . . . . 158 3.5.1 WideningandNarrowingOperators . . . . . . . . . . . . 164 3.5.2 DataFlowAnalysisofConditionalBranches . . . . . . . 172 3.6 TowardstheExclusiveReadProperty . . . . . . . . . . . . . . . 177 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 4 TransformationofOpenMPConstructs 187 4.1 StackImplementationfortheAdjointCode . . . . . . . . . . . . 187 4.2 SPLOMP1-SynchronizationConstructs . . . . . . . . . . . . . . 189 4.2.1 SynchronizationwithBarriers . . . . . . . . . . . . . . . 191 4.2.2 SynchronizationpermasterConstruct . . . . . . . . . . . 199 4.2.3 SynchronizationpercriticalConstruct . . . . . . . . . . . 200 4.2.4 SynchronizationperatomicConstruct . . . . . . . . . . . 211 4.2.5 ClosureofSPLOMP1 . . . . . . . . . . . . . . . . . . . 218 4.3 SPLOMP2-WorksharingConstructs . . . . . . . . . . . . . . . . 223 4.3.1 LoopConstruct . . . . . . . . . . . . . . . . . . . . . . . 226 4.3.2 sectionsConstruct . . . . . . . . . . . . . . . . . . . . . 234 4.3.3 singleConstruct. . . . . . . . . . . . . . . . . . . . . . . 235 4.3.4 CombinedParallelConstructs . . . . . . . . . . . . . . . 236 4.4 SPLOMP3-Data-Sharing . . . . . . . . . . . . . . . . . . . . . 238 4.4.1 GlobalData-threadprivateDirective . . . . . . . . . . . 238 4.4.2 Thread-LocalData-privateClause . . . . . . . . . . . . 240 4.4.3 firstprivateConstruct . . . . . . . . . . . . . . . . . . . . 244 4.4.4 lastprivateConstruct . . . . . . . . . . . . . . . . . . . . 248 4.4.5 reductionClause . . . . . . . . . . . . . . . . . . . . . . 258 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 5 ExperimentalResults 269 5.1 TestSuite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 5.1.1 PureParallelRegion . . . . . . . . . . . . . . . . . . . . 272 5.1.2 ParallelRegionwithaBarrier . . . . . . . . . . . . . . . 281 5.1.3 ParallelRegionwithamasterConstruct . . . . . . . . . . 290 5.1.4 ParallelRegionwithaCriticalRegion . . . . . . . . . . . 291 5.1.5 ParallelRegionwithatomicConstruct . . . . . . . . . . . 300 5.2 SecondDerivativeCodes . . . . . . . . . . . . . . . . . . . . . . 315 5.3 ExclusiveReadAnalysis . . . . . . . . . . . . . . . . . . . . . . 317 5.4 Least-SquaresProblem . . . . . . . . . . . . . . . . . . . . . . . 323 5.5 NonlinearConstrainedOptimizationProblem . . . . . . . . . . . 323 Contents XI 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 6 Conclusions 331 6.1 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 6.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 A SPLc-ASPLcompiler 337 A.1 BuildingSPLc . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 A.2 UserGuide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 A.3 DeveloperGuide . . . . . . . . . . . . . . . . . . . . . . . . . . 339 B TestSuite 341 Bibliography 397 Index 403

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.