ebook img

CUDA Fortran for Scientists and Engineers. Best Practices for Efficient CUDA Fortran Programming PDF

316 Pages·2014·4.67 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview CUDA Fortran for Scientists and Engineers. Best Practices for Efficient CUDA Fortran Programming

CUDA Fortran for Scientists and Engineers CUDA Fortran for Scientists and Engineers Best Practices for Efficient CUDA Fortran Programming Gregory Ruetsch and Massimiliano Fatica NVIDIA Corporation, Santa Clara, CA AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann is an imprint of Elsevier Acquiring Editor: Todd Green Development Editor: Lindsay Lawrence Project Manager: Punithavathy Govindaradjane Designer: Matthew Limbert Morgan Kaufmann is an imprint of Elsevier 225 Wyman Street, Waltham, MA 02451, USA Copyright © 2014 Gregory Ruetsch/NVIDIA Corporation and Massimiliano Fatica/NVIDIA Corporation. Published by Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechani- cal, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permis- sions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any infor- mation, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data Ruetsch, Gregory. CUDA Fortran for scientists and engineers : best practices for efficient CUDA Fortran programming / Gregory Ruetsch, Massimiliano Fatica. pages cm Includes bibliographical references and index. ISBN 978-0-12-416970-8 (alk. paper) 1. FORTRAN (Computer program language) I. Fatica, Massimiliano. II. Title. III. Title: Best practices for efficient CUDA Fortran programming. QA76.73.F25R833 2013 005.13’1--dc23 2013022226 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-416970-8 Printed and bound in the United States of America 14 15 16 17 18 10 9 8 7 6 5 4 3 2 1 For information on all MK publications visit our website at www.mkp.com To Fortran programmers, who know a good thing when they see it. Acknowledgments Writing this book has been an enjoyable and rewarding experience for us, largely due to the interactions with the people who helped shape the book into the form you have before you. There are many people who have helped with this book, both directly and indirectly, and at the risk of leaving someone out we would like to thank the following people for their assistance. Of course, a book on CUDA Fortran would not be possible without CUDA Fortran itself, and we would like to thank The Portland Group (PGI), especially Brent Leback and Michael Wolfe, for liter- ally giving us something to write about. Working with PGI on CUDA Fortran has been a delightful experience. The authors often reflect on how computations used in their theses, which required many, many hours on large-vector machines of the day, can now run on an NVIDIA graphics processing unit (GPU) in less time than it takes to get a cup of coffee. We would like to thank those at NVIDIA who helped enable this technological breakthrough. We would like to thank past and present members of the CUDA software team, especially Philip Cuadra, Mark Hairgrove, Stephen Jones, Tim Murray, and Joel Sherpelz for answering the many questions we asked them. Much of the material in this book grew out of collaborative efforts in performance-tuning appli- cations. We would like to thank our collaborators in such efforts, including Norbert Juffa, Patrick Legresley, Paulius Micikevicius, and Everett Phillips. Many people reviewed the manuscript for this book at various stages in its development, and we would like to thank Roberto Gomperts, Mark Harris, Norbert Juffa, Brent Leback, and Everett Phillips for their comments and suggestions. We would like to thank Ian Buck for allowing us to spend time at work on this endeavor, and we would like to thank our families for their understanding while we also worked at home. Finally, we would like to thank all of our teachers. They enabled us to write this book, and we hope in some way that by doing so, we have continued the chain of helping others. xi Preface This document is intended for scientists and engineers who develop or maintain computer simulations and applications in Fortran and who would like to harness the parallel processing power of graphics processing units (GPUs) to accelerate their code. The goal here is to provide the reader with the funda- mentals of GPU programming using CUDA Fortran as well as some typical examples, without having the task of developing CUDA Fortran code become an end in itself. The CUDA architecture was developed by NVIDIA to allow use of the GPU for general-purpose computing without requiring the programmer to have a background in graphics. There are many ways to access the CUDA architecture from a programmer’s perspective, including through C/C from ++ CUDA C or through Fortran using The Portland Group’s (PGI’s) CUDA Fortran. This document per- tains to the latter approach. PGI’s CUDA Fortran should be distinguished from the PGI Accelerator and OpenACC Fortran interfaces to the CUDA architecture, which are directive-based approaches to using the GPU. CUDA Fortran is simply the Fortran analog to CUDA C. The reader of this book should be familiar with Fortran 90 concepts, such as modules, derived types, and array operations. For those familiar with earlier versions of Fortran but looking to upgrade to a more recent version, there are several excellent books that cover this material (e.g., Metcalf, 2011). Some features introduced in Fortran 2003 are used in this book, but these concepts are explained in detail. Although this book does assume some familiarity with Fortran 90, no experience with parallel programming (on the GPU or otherwise) is required. Part of the appeal of parallel programming on GPUs using CUDA is that the programming model is simple and novices can get parallel code up and running very quickly. Often one comes to CUDA Fortran with the goal of porting existing, sometimes rather lengthy, Fortran code to code that leverages the GPU. Because CUDA is a hybrid programming model, where both GPU and CPU are utilized, CPU code can be incrementally ported to the GPU. CUDA Fortran is also used by those porting applications to GPUs mainly using the directive-base OpenACC approach, but who want to improve the performance of a few critical sections of code by hand-coding CUDA Fortran. Both OpenACC and CUDA Fortran can coexist in the same code. This book is divided into two main parts. The first part is a tutorial on CUDA Fortran programming, from the basics of writing CUDA Fortran code to some tips on optimization. The second part is a col- lection of case studies that demonstrate how the principles in the first part are applied to real-world examples. This book makes use of the PGI 13.x compilers, which can be obtained from http://pgroup.com. Although the examples can be compiled and run on any supported operating system in a variety of development environments, the examples included here are compiled from the command line as one would do under Linux or Mac OS X. Companion Site Supplementary materials for readers can be downloaded from Elsevier: http://store.elsevier.com/product.jsp?isbn 9780124169708. = xiii CHAPTER 1 Introduction CHAPTEROUTLINEHEAD 1.1 ABriefHistoryofGPUComputing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 ParallelComputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 BasicConcepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 AFirstCUDAFortranProgram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.2 ExtendingtoLargerArrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.3 MultidimensionalArrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4 DeterminingCUDAHardwareFeaturesandLimits . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.1 SingleandDoublePrecision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.4.1.1 AccommodatingVariablePrecision . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.5 ErrorHandling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.6 CompilingCUDAFortranCode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.6.1 SeparateCompilation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.1 A brief history of GPU computing Parallel computing has been around in one form or another for many decades. In the early stages it wasgenerallyconfinedtopractitionerswhohadaccesstolargeandexpensivemachines.Today,things areverydifferent.Almostallconsumerdesktopandlaptopcomputershavecentralprocessingunits,or CPUs,withmultiplecores.Evenmostprocessorsincellphonesandtabletshavemultiplecores.The principalreasonforthenearlyubiquitouspresenceofmultiplecoresinCPUsistheinabilityofCPU manufacturerstoincreaseperformanceinsingle-coredesignsbyboostingtheclockspeed.Asaresult, since about 2005 CPU designs have “scaled out” to multiple cores rather than “scaled up” to higher clockrates.AlthoughCPUsareavailablewithafewtotensofcores,thisamountofparallelismspales incomparisontothenumberofcoresinagraphicsprocessingunit(GPU).Forexample,theNVIDIA Tesla®K20Xcontains2688cores.GPUswerehighlyparallelarchitecturesfromtheirbeginning,inthe mid-1990s,sincegraphicsprocessingisaninherentlyparalleltask. CUDAFortranforScientistsandEngineers.http://dx.doi.org/10.1016/B978-0-12-416970-8.00001-8 3 ©2014ElsevierInc.Allrightsreserved. 4 CHAPTER 1 Introduction TheuseofGPUsforgeneral-purposecomputing,oftenreferredtoasGPGPU,wasinitiallyachal- lengingendeavor.Onehadtoprogramtothegraphicsapplicationprogramminginterface(API),which provedtobeveryrestrictiveinthetypesofalgorithmsthatcouldbemappedtotheGPU.Evenwhen such a mapping was possible, the programming required to make this happen was difficult and not intuitiveforscientistsandengineersoutsidethecomputergraphicsvocation.Assuch,adoptionofthe GPUforscientificandengineeringcomputationswasslow. ThingschangedforGPUcomputingwiththeadventofNVIDIA’sCUDA®architecturein2007.The CUDAarchitectureincludedbothhardwarecomponentsonNVIDIA’sGPUandasoftwareprogram- ming environment that eliminated the barriers to adoption that plagued GPGPU. Since CUDA’s first appearance in 2007, its adoption has been tremendous, to the point where, in November 2010, three ofthetopfivesupercomputersintheTop500listusedGPUs.IntheNovember2012Top500list,the fastest computer in the world was also GPU-powered. One of the reasons for this very fast adoption ofCUDAisthattheprogrammingmodelwasverysimple.CUDAC,thefirstinterfacetotheCUDA architecture,isessentiallyCwithafewextensionsthatcanoffloadportionsofanalgorithmtorunon theGPU.ItisahybridapproachwherebothCPUandGPUareused,soportingcomputationstothe GPUcanbeperformedincrementally. Inlate2009,ajointeffortbetweenThePortlandGroup®(PGI®)andNVIDIAledtotheCUDAFor- trancompiler.JustasCUDACisCwithextensions,CUDAFortranisessentiallyFortran90withafew extensionsthatallowuserstoleveragethepowerofGPUsintheircomputations.Manybooks,articles, and other documents have been written to aid in the development of efficient CUDA C applications (e.g.,SandersandKandrot,2011;KirkandHwu,2012;Wilt,2013).Becauseitisnewer,CUDAFortran has relatively fewer aids for code development. Much of the material for writing efficient CUDA C translateseasilytoCUDAFortran,sincetheunderlyingarchitectureisthesame,butthereisstillaneed formaterialthataddresseshowtowriteefficientcodeinCUDAFortran.Thereareacoupleofreasons forthis.First,thoughCUDACandCUDAFortranaresimilar,therearesomedifferencesthatwillaffect howcodeiswritten.Thisisnotsurprising,sinceCPUcodewritteninCandFortranwilltypicallytake onadifferentcharacterasprojectsgrow.Also,therearesomefeaturesinCUDACthatarenotpresent in CUDA Fortran, such as certain aspects of textures. Conversely, there are some features in CUDA Fortran,suchasthedevicevariableattributeusedtodenotedatathatresidesontheGPU,thatarenot presentinCUDAC. Thisbookiswrittenforthosewhowanttouseparallelcomputationasatoolingettingotherwork doneratherthanasanendinitself.Theaimistogivethereaderabasicsetofskillsnecessaryforthem to write reasonably optimized CUDA Fortran code that takes advantage of the NVIDIA® computing hardware.Thereasonfortakingthisapproachratherthanattemptingtoteachhowtoextracteverylast ounceofperformancefromthehardwareistheassumptionthatthoseusingCUDAFortrandosoasa meansratherthananend.Suchuserstypicallyvalueclearandmaintainablecodethatissimpletowrite andperformsreasonablywellacrossmanygenerationsofCUDA-enabledhardwareandCUDAFortran software. But where is the line drawn in terms of the effort-performance tradeoff? In the end it is up to the developer to decide how much effort to put into optimizing code. In making this decision, we need to know what type of payoff we can expect when eliminating various bottlenecks and what effort is involvedindoingso.Onegoalofthisbookistohelpthereaderdevelopanintuitionneededtomakesuch areturn-on-investmentassessment.Toachievethisend,wediscussbottlenecksencounteredinwriting 1.3 Basic concepts 5 commonalgorithmsinscienceandengineeringapplicationsinCUDAFortran.Multipleworkarounds arepresentedwhenpossible,alongwiththeperformanceimpactofeachoptimizationeffort. 1.2 Parallel computation BeforejumpingintowritingCUDAFortrancode,weshouldsayafewwordsaboutwhereCUDAfits in with other types of parallel programming models. Familiarity with and an understanding of other parallel programming models is not a prerequisite for this book, but for readers who do have some parallelprogrammingexperience,thissectionmightbehelpfulincategorizingCUDA. WehavealreadymentionedthatCUDAisahybridcomputingmodel,whereboththeCPUandGPU areusedinanapplication.ThisisadvantageousfordevelopmentbecausesectionsofanexistingCPU codecanbeportedtotheGPUincrementally. Itispossibletooverlap computation ontheCPUwith computationontheGPU,sothisisoneaspectofparallelism. AfargreaterdegreeofparallelismoccurswithintheGPUitself.SubroutinesthatrunontheGPU are executed by many threads in parallel. Although all threads execute the same code, these threads typicallyoperateondifferentdata.Thisdataparallelismisafine-grainedparallelism,whereitismost efficienttohaveadjacentthreadsoperateonadjacentdata,suchaselementsofanarray.Thismodelof parallelismisverydifferentfromamodellikeMessagePassingInterface,commonlyknownasMPI, whichisacoarse-grainedmodel.InMPI,dataaretypicallydividedintolargesegmentsorpartitions, andeachMPIprocessperformscalculationsonanentiredatapartition. AfewcharacteristicsoftheCUDAprogrammingmodelareverydifferentfromCPU-basedparallel programmingmodels.OnedifferenceisthatthereisverylittleoverheadassociatedwithcreatingGPU threads.Inadditiontofastthreadcreation,contextswitches,wherethreadschangefromactivetoinactive andviceversa,areveryfastforGPUthreadscomparedtoCPUthreads.Thereasoncontextswitching isessentiallyinstantaneousontheGPUisthattheGPUdoesnothavetostorestate,astheCPUdoes when switching threads between being active and inactive. As a result of this fast context switching, itisadvantageoustoheavilyoversubscribeGPUcores—thatis,havemanymoreresidentthreadsthan GPUcoressothatmemorylatenciescanbehidden.Itisnotuncommontohavethenumberofresident threads on a GPU an order of magnitude larger than the number of cores on the GPU. In the CUDA programmingmodel,weessentiallywriteaserialcodethatisexecutedbymanyGPUthreadsinparallel. Eachthreadexecutingthiscodehasameansofidentifyingitselfinordertooperateondifferentdata, butthecodethatCUDAthreadsexecuteisverysimilartowhatwewouldwriteforserialCPUcode. Ontheotherhand,thecodeofmanyparallelCPUprogrammingmodelsdiffersgreatlyfromserialCPU code.WewillrevisiteachoftheseaspectsoftheCUDAprogrammingmodelandarchitectureasthey ariseinthefollowingdiscussion. 1.3 Basic concepts ThissectioncontainsaprogressionofsimpleCUDAFortrancodeexamplesusedtodemonstratevarious basicconceptsofprogramminginCUDAFortran. 6 CHAPTER 1 Introduction Before we start, we need to define a few terms. CUDA Fortran is a hybrid programming model, meaningthatcodesectionscanexecuteeitherontheCPUortheGPU,ormoreprecisely,onthehostor device.ThetermshostisusedtorefertotheCPUanditsmemory,andthetermdeviceisusedtorefer toGPUanditsmemory,bothinthecontextofaCUDAFortranprogram.Goingforward,weusethe termCPUcodetorefertoaCPU-onlyimplementation.Asubroutinethatexecutesonthedevicebutis calledfromthehostiscalledakernel. 1.3.1 A first CUDA Fortran program Asareference,westartwithaFortran90codethatincrementsanarray.Thecodeisarrangedsothat theincrementingisperformedinasubroutine,whichitselfisinaFortran90module.Thesubroutine loopsoverandincrementseachelementofanarraybythevalueoftheparameterbthatispassedinto thesubroutine. 1 module simpleOps_m 2 contains 3 subroutine increment(a, b) 4 implicit none 5 integer, intent(inout) :: a(:) 6 integer, intent(in) :: b 7 integer :: i, n 8 9 n = size(a) 10 do i = 1, n 11 a(i) = a(i)+b 12 enddo 13 14 end subroutine increment 15 end module simpleOps_m 16 17 18 program incrementTestCPU 19 use simpleOps_m 20 implicit none 21 integer, parameter :: n = 256 22 integer :: a(n), b 23 24 a = 1 25 b = 3 26 call increment(a, b) 27 28 if (any(a /= 4)) then 29 write(*,*) ’**** Program Failed ****’ 30 else 31 write(*,*) ’Program Passed’

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.