Table Of ContentCUDA Fortran for
Scientists and Engineers
CUDA Fortran for
Scientists and Engineers
Best Practices for Efficient CUDA
Fortran Programming
Gregory Ruetsch and Massimiliano Fatica
NVIDIA Corporation, Santa Clara, CA
AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Morgan Kaufmann is an imprint of Elsevier
Acquiring Editor: Todd Green
Development Editor: Lindsay Lawrence
Project Manager: Punithavathy Govindaradjane
Designer: Matthew Limbert
Morgan Kaufmann is an imprint of Elsevier
225 Wyman Street, Waltham, MA 02451, USA
Copyright © 2014 Gregory Ruetsch/NVIDIA Corporation and Massimiliano Fatica/NVIDIA Corporation.
Published by Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechani-
cal, including photocopying, recording, or any information storage and retrieval system, without permission in
writing from the publisher. Details on how to seek permission, further information about the Publisher’s permis-
sions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright
Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other
than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our
understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any infor-
mation, methods, compounds, or experiments described herein. In using such information or methods they should be
mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability
for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or
from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
Library of Congress Cataloging-in-Publication Data
Ruetsch, Gregory.
CUDA Fortran for scientists and engineers : best practices for efficient
CUDA Fortran programming / Gregory Ruetsch, Massimiliano Fatica.
pages cm
Includes bibliographical references and index.
ISBN 978-0-12-416970-8 (alk. paper)
1. FORTRAN (Computer program language) I. Fatica, Massimiliano. II. Title.
III. Title: Best practices for efficient CUDA Fortran programming.
QA76.73.F25R833 2013
005.13’1--dc23
2013022226
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
ISBN: 978-0-12-416970-8
Printed and bound in the United States of America
14 15 16 17 18 10 9 8 7 6 5 4 3 2 1
For information on all MK publications visit our website at www.mkp.com
To Fortran programmers, who know a good thing when they see it.
Acknowledgments
Writing this book has been an enjoyable and rewarding experience for us, largely due to the interactions
with the people who helped shape the book into the form you have before you. There are many people
who have helped with this book, both directly and indirectly, and at the risk of leaving someone out we
would like to thank the following people for their assistance.
Of course, a book on CUDA Fortran would not be possible without CUDA Fortran itself, and we
would like to thank The Portland Group (PGI), especially Brent Leback and Michael Wolfe, for liter-
ally giving us something to write about. Working with PGI on CUDA Fortran has been a delightful
experience.
The authors often reflect on how computations used in their theses, which required many, many
hours on large-vector machines of the day, can now run on an NVIDIA graphics processing unit (GPU)
in less time than it takes to get a cup of coffee. We would like to thank those at NVIDIA who helped
enable this technological breakthrough. We would like to thank past and present members of the
CUDA software team, especially Philip Cuadra, Mark Hairgrove, Stephen Jones, Tim Murray, and Joel
Sherpelz for answering the many questions we asked them.
Much of the material in this book grew out of collaborative efforts in performance-tuning appli-
cations. We would like to thank our collaborators in such efforts, including Norbert Juffa, Patrick
Legresley, Paulius Micikevicius, and Everett Phillips.
Many people reviewed the manuscript for this book at various stages in its development, and we
would like to thank Roberto Gomperts, Mark Harris, Norbert Juffa, Brent Leback, and Everett Phillips
for their comments and suggestions.
We would like to thank Ian Buck for allowing us to spend time at work on this endeavor, and we
would like to thank our families for their understanding while we also worked at home.
Finally, we would like to thank all of our teachers. They enabled us to write this book, and we hope
in some way that by doing so, we have continued the chain of helping others.
xi
Preface
This document is intended for scientists and engineers who develop or maintain computer simulations
and applications in Fortran and who would like to harness the parallel processing power of graphics
processing units (GPUs) to accelerate their code. The goal here is to provide the reader with the funda-
mentals of GPU programming using CUDA Fortran as well as some typical examples, without having
the task of developing CUDA Fortran code become an end in itself.
The CUDA architecture was developed by NVIDIA to allow use of the GPU for general-purpose
computing without requiring the programmer to have a background in graphics. There are many ways
to access the CUDA architecture from a programmer’s perspective, including through C/C from
++
CUDA C or through Fortran using The Portland Group’s (PGI’s) CUDA Fortran. This document per-
tains to the latter approach. PGI’s CUDA Fortran should be distinguished from the PGI Accelerator and
OpenACC Fortran interfaces to the CUDA architecture, which are directive-based approaches to using
the GPU. CUDA Fortran is simply the Fortran analog to CUDA C.
The reader of this book should be familiar with Fortran 90 concepts, such as modules, derived
types, and array operations. For those familiar with earlier versions of Fortran but looking to upgrade
to a more recent version, there are several excellent books that cover this material (e.g., Metcalf, 2011).
Some features introduced in Fortran 2003 are used in this book, but these concepts are explained in
detail. Although this book does assume some familiarity with Fortran 90, no experience with parallel
programming (on the GPU or otherwise) is required. Part of the appeal of parallel programming on
GPUs using CUDA is that the programming model is simple and novices can get parallel code up and
running very quickly.
Often one comes to CUDA Fortran with the goal of porting existing, sometimes rather lengthy,
Fortran code to code that leverages the GPU. Because CUDA is a hybrid programming model, where
both GPU and CPU are utilized, CPU code can be incrementally ported to the GPU. CUDA Fortran is
also used by those porting applications to GPUs mainly using the directive-base OpenACC approach,
but who want to improve the performance of a few critical sections of code by hand-coding CUDA
Fortran. Both OpenACC and CUDA Fortran can coexist in the same code.
This book is divided into two main parts. The first part is a tutorial on CUDA Fortran programming,
from the basics of writing CUDA Fortran code to some tips on optimization. The second part is a col-
lection of case studies that demonstrate how the principles in the first part are applied to real-world
examples.
This book makes use of the PGI 13.x compilers, which can be obtained from http://pgroup.com.
Although the examples can be compiled and run on any supported operating system in a variety of
development environments, the examples included here are compiled from the command line as one
would do under Linux or Mac OS X.
Companion Site
Supplementary materials for readers can be downloaded from Elsevier:
http://store.elsevier.com/product.jsp?isbn 9780124169708.
=
xiii
CHAPTER
1
Introduction
CHAPTEROUTLINEHEAD
1.1 ABriefHistoryofGPUComputing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 ParallelComputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 BasicConcepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 AFirstCUDAFortranProgram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 ExtendingtoLargerArrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 MultidimensionalArrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 DeterminingCUDAHardwareFeaturesandLimits . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1 SingleandDoublePrecision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4.1.1 AccommodatingVariablePrecision . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 ErrorHandling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6 CompilingCUDAFortranCode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.6.1 SeparateCompilation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.1 A brief history of GPU computing
Parallel computing has been around in one form or another for many decades. In the early stages it
wasgenerallyconfinedtopractitionerswhohadaccesstolargeandexpensivemachines.Today,things
areverydifferent.Almostallconsumerdesktopandlaptopcomputershavecentralprocessingunits,or
CPUs,withmultiplecores.Evenmostprocessorsincellphonesandtabletshavemultiplecores.The
principalreasonforthenearlyubiquitouspresenceofmultiplecoresinCPUsistheinabilityofCPU
manufacturerstoincreaseperformanceinsingle-coredesignsbyboostingtheclockspeed.Asaresult,
since about 2005 CPU designs have “scaled out” to multiple cores rather than “scaled up” to higher
clockrates.AlthoughCPUsareavailablewithafewtotensofcores,thisamountofparallelismspales
incomparisontothenumberofcoresinagraphicsprocessingunit(GPU).Forexample,theNVIDIA
Tesla®K20Xcontains2688cores.GPUswerehighlyparallelarchitecturesfromtheirbeginning,inthe
mid-1990s,sincegraphicsprocessingisaninherentlyparalleltask.
CUDAFortranforScientistsandEngineers.http://dx.doi.org/10.1016/B978-0-12-416970-8.00001-8 3
©2014ElsevierInc.Allrightsreserved.
4 CHAPTER 1 Introduction
TheuseofGPUsforgeneral-purposecomputing,oftenreferredtoasGPGPU,wasinitiallyachal-
lengingendeavor.Onehadtoprogramtothegraphicsapplicationprogramminginterface(API),which
provedtobeveryrestrictiveinthetypesofalgorithmsthatcouldbemappedtotheGPU.Evenwhen
such a mapping was possible, the programming required to make this happen was difficult and not
intuitiveforscientistsandengineersoutsidethecomputergraphicsvocation.Assuch,adoptionofthe
GPUforscientificandengineeringcomputationswasslow.
ThingschangedforGPUcomputingwiththeadventofNVIDIA’sCUDA®architecturein2007.The
CUDAarchitectureincludedbothhardwarecomponentsonNVIDIA’sGPUandasoftwareprogram-
ming environment that eliminated the barriers to adoption that plagued GPGPU. Since CUDA’s first
appearance in 2007, its adoption has been tremendous, to the point where, in November 2010, three
ofthetopfivesupercomputersintheTop500listusedGPUs.IntheNovember2012Top500list,the
fastest computer in the world was also GPU-powered. One of the reasons for this very fast adoption
ofCUDAisthattheprogrammingmodelwasverysimple.CUDAC,thefirstinterfacetotheCUDA
architecture,isessentiallyCwithafewextensionsthatcanoffloadportionsofanalgorithmtorunon
theGPU.ItisahybridapproachwherebothCPUandGPUareused,soportingcomputationstothe
GPUcanbeperformedincrementally.
Inlate2009,ajointeffortbetweenThePortlandGroup®(PGI®)andNVIDIAledtotheCUDAFor-
trancompiler.JustasCUDACisCwithextensions,CUDAFortranisessentiallyFortran90withafew
extensionsthatallowuserstoleveragethepowerofGPUsintheircomputations.Manybooks,articles,
and other documents have been written to aid in the development of efficient CUDA C applications
(e.g.,SandersandKandrot,2011;KirkandHwu,2012;Wilt,2013).Becauseitisnewer,CUDAFortran
has relatively fewer aids for code development. Much of the material for writing efficient CUDA C
translateseasilytoCUDAFortran,sincetheunderlyingarchitectureisthesame,butthereisstillaneed
formaterialthataddresseshowtowriteefficientcodeinCUDAFortran.Thereareacoupleofreasons
forthis.First,thoughCUDACandCUDAFortranaresimilar,therearesomedifferencesthatwillaffect
howcodeiswritten.Thisisnotsurprising,sinceCPUcodewritteninCandFortranwilltypicallytake
onadifferentcharacterasprojectsgrow.Also,therearesomefeaturesinCUDACthatarenotpresent
in CUDA Fortran, such as certain aspects of textures. Conversely, there are some features in CUDA
Fortran,suchasthedevicevariableattributeusedtodenotedatathatresidesontheGPU,thatarenot
presentinCUDAC.
Thisbookiswrittenforthosewhowanttouseparallelcomputationasatoolingettingotherwork
doneratherthanasanendinitself.Theaimistogivethereaderabasicsetofskillsnecessaryforthem
to write reasonably optimized CUDA Fortran code that takes advantage of the NVIDIA® computing
hardware.Thereasonfortakingthisapproachratherthanattemptingtoteachhowtoextracteverylast
ounceofperformancefromthehardwareistheassumptionthatthoseusingCUDAFortrandosoasa
meansratherthananend.Suchuserstypicallyvalueclearandmaintainablecodethatissimpletowrite
andperformsreasonablywellacrossmanygenerationsofCUDA-enabledhardwareandCUDAFortran
software.
But where is the line drawn in terms of the effort-performance tradeoff? In the end it is up to the
developer to decide how much effort to put into optimizing code. In making this decision, we need
to know what type of payoff we can expect when eliminating various bottlenecks and what effort is
involvedindoingso.Onegoalofthisbookistohelpthereaderdevelopanintuitionneededtomakesuch
areturn-on-investmentassessment.Toachievethisend,wediscussbottlenecksencounteredinwriting
1.3 Basic concepts 5
commonalgorithmsinscienceandengineeringapplicationsinCUDAFortran.Multipleworkarounds
arepresentedwhenpossible,alongwiththeperformanceimpactofeachoptimizationeffort.
1.2 Parallel computation
BeforejumpingintowritingCUDAFortrancode,weshouldsayafewwordsaboutwhereCUDAfits
in with other types of parallel programming models. Familiarity with and an understanding of other
parallel programming models is not a prerequisite for this book, but for readers who do have some
parallelprogrammingexperience,thissectionmightbehelpfulincategorizingCUDA.
WehavealreadymentionedthatCUDAisahybridcomputingmodel,whereboththeCPUandGPU
areusedinanapplication.ThisisadvantageousfordevelopmentbecausesectionsofanexistingCPU
codecanbeportedtotheGPUincrementally. Itispossibletooverlap computation ontheCPUwith
computationontheGPU,sothisisoneaspectofparallelism.
AfargreaterdegreeofparallelismoccurswithintheGPUitself.SubroutinesthatrunontheGPU
are executed by many threads in parallel. Although all threads execute the same code, these threads
typicallyoperateondifferentdata.Thisdataparallelismisafine-grainedparallelism,whereitismost
efficienttohaveadjacentthreadsoperateonadjacentdata,suchaselementsofanarray.Thismodelof
parallelismisverydifferentfromamodellikeMessagePassingInterface,commonlyknownasMPI,
whichisacoarse-grainedmodel.InMPI,dataaretypicallydividedintolargesegmentsorpartitions,
andeachMPIprocessperformscalculationsonanentiredatapartition.
AfewcharacteristicsoftheCUDAprogrammingmodelareverydifferentfromCPU-basedparallel
programmingmodels.OnedifferenceisthatthereisverylittleoverheadassociatedwithcreatingGPU
threads.Inadditiontofastthreadcreation,contextswitches,wherethreadschangefromactivetoinactive
andviceversa,areveryfastforGPUthreadscomparedtoCPUthreads.Thereasoncontextswitching
isessentiallyinstantaneousontheGPUisthattheGPUdoesnothavetostorestate,astheCPUdoes
when switching threads between being active and inactive. As a result of this fast context switching,
itisadvantageoustoheavilyoversubscribeGPUcores—thatis,havemanymoreresidentthreadsthan
GPUcoressothatmemorylatenciescanbehidden.Itisnotuncommontohavethenumberofresident
threads on a GPU an order of magnitude larger than the number of cores on the GPU. In the CUDA
programmingmodel,weessentiallywriteaserialcodethatisexecutedbymanyGPUthreadsinparallel.
Eachthreadexecutingthiscodehasameansofidentifyingitselfinordertooperateondifferentdata,
butthecodethatCUDAthreadsexecuteisverysimilartowhatwewouldwriteforserialCPUcode.
Ontheotherhand,thecodeofmanyparallelCPUprogrammingmodelsdiffersgreatlyfromserialCPU
code.WewillrevisiteachoftheseaspectsoftheCUDAprogrammingmodelandarchitectureasthey
ariseinthefollowingdiscussion.
1.3 Basic concepts
ThissectioncontainsaprogressionofsimpleCUDAFortrancodeexamplesusedtodemonstratevarious
basicconceptsofprogramminginCUDAFortran.
6 CHAPTER 1 Introduction
Before we start, we need to define a few terms. CUDA Fortran is a hybrid programming model,
meaningthatcodesectionscanexecuteeitherontheCPUortheGPU,ormoreprecisely,onthehostor
device.ThetermshostisusedtorefertotheCPUanditsmemory,andthetermdeviceisusedtorefer
toGPUanditsmemory,bothinthecontextofaCUDAFortranprogram.Goingforward,weusethe
termCPUcodetorefertoaCPU-onlyimplementation.Asubroutinethatexecutesonthedevicebutis
calledfromthehostiscalledakernel.
1.3.1 A first CUDA Fortran program
Asareference,westartwithaFortran90codethatincrementsanarray.Thecodeisarrangedsothat
theincrementingisperformedinasubroutine,whichitselfisinaFortran90module.Thesubroutine
loopsoverandincrementseachelementofanarraybythevalueoftheparameterbthatispassedinto
thesubroutine.
1 module simpleOps_m
2 contains
3 subroutine increment(a, b)
4 implicit none
5 integer, intent(inout) :: a(:)
6 integer, intent(in) :: b
7 integer :: i, n
8
9 n = size(a)
10 do i = 1, n
11 a(i) = a(i)+b
12 enddo
13
14 end subroutine increment
15 end module simpleOps_m
16
17
18 program incrementTestCPU
19 use simpleOps_m
20 implicit none
21 integer, parameter :: n = 256
22 integer :: a(n), b
23
24 a = 1
25 b = 3
26 call increment(a, b)
27
28 if (any(a /= 4)) then
29 write(*,*) ’**** Program Failed ****’
30 else
31 write(*,*) ’Program Passed’