ebook img

Heterogeneous Computing with OpenCL 2.0 PDF

313 Pages·2015·13.602 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Heterogeneous Computing with OpenCL 2.0

Heterogeneous Computing with OpenCL 2.0 Heterogeneous Computing with OpenCL 2.0 Third Edition David Kaeli Perhaad Mistry Dana Schaa Dong Ping Zhang AMSTERDAM (cid:129) BOSTON (cid:129) HEIDELBERG (cid:129) LONDON NEW YORK (cid:129) OXFORD (cid:129) PARIS (cid:129) SAN DIEGO SAN FRANCISCO (cid:129) SINGAPORE (cid:129) SYDNEY (cid:129) TOKYO Morgan Kaufmann is an imprint of Elsevier AcquiringEditor:ToddGreen EditorialProjectManager:CharlieKent ProjectManager:PriyaKumaraguruparan CoverDesigner:MatthewLimbert MorganKaufmannisanimprintofElsevier 225WymanStreet,Waltham,MA02451,USA Copyright©2015,2013,2012AdvancedMicroDevices,Inc.PublishedbyElsevierInc. Allrightsreserved. Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans, electronicormechanical,includingphotocopying,recording,oranyinformationstorageand retrievalsystem,withoutpermissioninwritingfromthepublisher.Detailsonhowtoseek permission,furtherinformationaboutthePublisher’spermissionspoliciesandour arrangementswithorganizationssuchastheCopyrightClearanceCenterandtheCopyright LicensingAgency,canbefoundatourwebsite:www.elsevier.com/permissions. Thisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightby thePublisher(otherthanasmaybenotedherein). Notices Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchand experiencebroadenourunderstanding,changesinresearchmethods,professionalpractices, ormedicaltreatmentmaybecomenecessary. Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgein evaluatingandusinganyinformation,methods,compounds,orexperimentsdescribedherein. Inusingsuchinformationormethodstheyshouldbemindfuloftheirownsafetyandthe safetyofothers,includingpartiesforwhomtheyhaveaprofessionalresponsibility. Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors, assumeanyliabilityforanyinjuryand/ordamagetopersonsorpropertyasamatterof productsliability,negligenceorotherwise,orfromanyuseoroperationofanymethods, products,instructions,orideascontainedinthematerialherein. ISBN:978-0-12-801414-1 BritishLibraryCataloguinginPublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary LibraryofCongressCataloging-in-PublicationData AcatalogrecordforthisbookisavailablefromtheLibraryofCongress ForinformationonallMKpublications visitourwebsiteatwww.mkp.com List of Figures Fig.1.1 (a)Simplesorting:adivide-and-conquerimplementation,breakingthe listintoshorterlists,sortingthem,andthenmergingtheshortersorted lists.(b)Vector-scalarmultiply:scatteringthemultipliesandthen gatheringtheresultstobesummedupinaseriesofsteps. 3 Fig.1.2 MultiplyingelementsinarraysAandB,andstoringtheresultinan arrayC. 4 Fig.1.3 TaskparallelismpresentinfastFouriertransform(FFT)application. Differentinputimagesareprocessedindependentlyinthethree independenttasks. 5 Fig.1.4 Task-levelparallelism,wheremultiplewordscanbecompared concurrently.Alsoshownisfiner-grainedcharacter-by-character parallelismpresentwhencharacterswithinthewordsarecompared withthesearchstring. 6 Fig.1.5 AfterallstringcomparisonsinFigure1.4havebeencompleted,we cansumupthenumberofmatchesinacombiningnetwork. 6 Fig.1.6 Therelationshipbetweenparallelandconcurrentprograms.Parallel andconcurrentprogramsaresubsetsofallprograms. 8 Fig.2.1 Out-of-orderexecutionofaninstructionstreamofsimpleassembly-like instructions.Notethatinthissyntax,thedestinationregisterislisted first.Forexample,add a,b,cisa = b+c. 18 Fig.2.2 VLIWexecutionbasedontheout-of-orderdiagraminFigure2.1. 20 Fig.2.3 SIMDexecutionwhereasingleinstructionisscheduledinorder,but executesovermultipleALUsatthesametime. 21 Fig.2.4 Theout-of-orderscheduleseeninFigure2.1combinedwithasecond threadandexecutedsimultaneously. 23 Fig.2.5 Twothreadsscheduledinatime-slicefashion. 24 Fig.2.6 Takingtemporalmultithreadingtoanextremeasisdoneinthroughput computing:alargenumberofthreadsinterleaveexecutiontokeepthe devicebusy,whereaseachindividualthreadtakeslongertoexecute thanthetheoreticalminimum. 25 Fig.2.7 TheAMDPuma(left)andSteamroller(right)high-leveldesigns(not showntoanysharedscale).Pumaisalow-powerdesignthatfollowsa traditionalapproachtomappingfunctionalunitstocores.Steamroller combinestwocoreswithinamodule,sharingitsfloating-point(FP)units. 26 Fig.2.8 TheAMDRadeonHD6970GPUarchitecture.Thedeviceisdivided intotwohalves,whereinstructioncontrol(schedulinganddispatch)is performedbythewaveschedulerforeachhalf.The2416-laneSIMD coresexecutefour-wayVLIWinstructionsoneachSIMDlaneand containprivatelevel1(L1)cachesandlocaldatashares(scratchpad memory). 27 xi xii List of Figures Fig.2.9 TheNiagara2CPUfromSun/Oracle.Thedesignintendstomakea highlevelofthreadingefficient.NoteitsrelativesimilaritytotheGPU designseeninFigure2.8.Givenenoughthreads,wecancoverall memoryaccesstimewithusefulcompute,withoutextracting instruction-levelparallelism(ILP)throughcomplicatedhardware techniques. 32 Fig.2.10 TheAMDRadeonR9290Xarchitecture.Thedevicehas44coresin 11clusters.Eachcoreconsistsofascalarexecutionunitthathandles branchesandbasicintegeroperations,andfour16-laneSIMDALUs. Theclustersshareinstructionandscalarcaches. 35 Fig.2.11 TheNVIDIAGeForceGTX780architecture.Thedevicehas12large coresthatNVIDIAreferstoas“streamingmultiprocessors”(SMX). EachSMXhas12SIMDunits(withspecializeddouble-precisionand specialfunctionunits),asingleL1cache,andaread-onlydatacache. 36 Fig.2.12 TheA10-7850KAPUconsistsoftwoSteamroller-basedCPUcores andeightRadeonR9GPUcores(3216-laneSIMDunitsintotal).The APUincludesafastbusfromtheGPUtoDDR3memory,andashared paththatisoptionallycoherentwithCPUcaches. 37 Fig.2.13 AnInteli7processorwithHDGraphics4000graphics.Althoughnot termed“APU”byIntel,theconceptisthesameasforthedevicesin thatcategoryfromAMD.IntelcombinesfourHaswellx86coreswithits graphicsprocessors,connectedtoasharedlast-levelcache(LLC)via aringbus. 38 Fig.3.1 AnOpenCLplatformwithmultiplecomputedevices.Eachcompute devicecontainsoneormorecomputeunits.Acomputeunitis composedofoneormoreprocessingelements(PEs).Asystemcould havemultipleplatformspresentatthesametime.Forexample,a systemcouldhaveanAMDplatformandanIntelplatformpresentat thesametime. 43 Fig.3.2 SomeoftheOutputfromtheCLInfoprogramshowingthe characteristicsofanOpenCLplatformanddevices.Weseethatthe AMDplatformhastwodevices(aCPUandaGPU).Theoutputshown herecanbequeriedusingfunctionsfromtheplatformAPI. 46 Fig.3.3 Vectoradditionalgorithmshowinghoweachelementcanbeadded independently. 50 Fig.3.4 ThehierarchicalmodelusedforcreatinganNDRangeofwork-items, groupedintowork-groups. 52 Fig.3.5 TheOpenCLruntimeshowndenotesanOpenCLcontextwithtwo computedevices(aCPUdeviceandaGPUdevice).Eachcompute devicehasitsowncommand-queues.Host-sideanddevice-side command-queuesareshown.Thedevice-sidequeuesarevisibleonly fromkernelsexecutingonthecomputedevice.Thememoryobjects havebeendefinedwithinthememorymodel. 54 Fig.3.6 MemoryregionsandtheirscopeintheOpenCLmemorymodel. 61 Fig.3.7 MappingtheOpenCLmemorymodeltoanAMDRadeonHD7970GPU. 62 List of Figures xiii Fig.4.1 Ahistogramgeneratedfroma256-bitimage.Eachbincorrespondsto thefrequencyofthecorrespondingpixelvalue. 76 Fig.4.2 Animagerotatedby45◦.Pixelsthatcorrespondtoanout-of-bounds locationintheinputimagearereturnedasblack. 83 Fig.4.3 Applyingaconvolutionfiltertoasourceimage. 91 Fig.4.4 Theeffectofdifferentconvolutionfiltersappliedtothesamesource image:(a)theoriginalimage;(b)blurringfilter;and(c)embossingfilter. 92 Fig.4.5 Theproducerkernelwillgeneratefilteredpixelsandsendthemviaa pipetotheconsumerkernel,whichwillthengeneratethehistogram: (a)originalimage;(b)filteredimage;and(c)histogramoffilteredimage. 99 Fig.5.1 Multiplecommand-queuescreatedfordifferentdevicesdeclared withinthesamecontext.Twodevicesareshown,whereone command-queuehasbeencreatedforeachdevice. 118 Fig.5.2 Multipledevicesworkinginapipelinedmanneronthesamedata.The CPUqueuewillwaituntiltheGPUkernelhasfinished. 119 Fig.5.3 Multipledevicesworkinginaparallelmanner.Inthisscenario,both GPUsdonotusethesamebuffersandwillexecuteindependently. TheCPUqueuewillwaituntilbothGPUdeviceshavefinished. 120 Fig.5.4 ExecutingthesimplekernelshowninListing5.5.Thedifferent work-itemsintheNDRangeareshown. 121 Fig.5.5 Withinasinglekerneldispatch,synchronizationregardingexecution orderissupportedonlywithinwork-groupsusingbarriers.Global synchronizationismaintainedbycompletionofthekernel,andthe guaranteethatonacompletioneventallworkiscompleteandmemory contentisasexpected. 126 Fig.5.6 ExampleshowingOpenCLmemoryobjectsmappingtoargumentsfor clEnqueueNativeKernel()inListing5.8. 131 Fig.5.7 Asingle-levelfork-joinexecutionparadigmcomparedwithnested parallelismthreadexecution. 133 Fig.6.1 Anexampleshowingascenariowhereabufferiscreatedand initializedonthehost,usedforcomputationonthedevice,and transferredbacktothehost.Notethattheruntimecouldhavealso createdandinitializedthebufferdirectlyonthedevice.(a)Creation andinitializationofabufferinhostmemory.(b)Implicitdatatransfer fromthehosttothedevicepriortokernelexecution.(c)Explicit copyingofdatabackfromthedevicetothehostpointer. 150 Fig.6.2 Datamovementusingexplicitread-writecommands.(a)Creationofan uninitializedbufferindevicememory.(b)Explicitdatatransferfrom thehosttothedevicepriortoexecution.(c)Explicitdatatransferfrom thedevicetothehostfollowingexecution. 151 Fig.6.3 Datamovementusingmap/unmap.(a)Creationofanuninitialized bufferindevicememory.(b)Thebufferismappedintothehost’s addressspace.(c)Thebufferisunmappedfromthehost’s addressspace. 158 Fig.7.1 ThememoryspacesavailabletoanOpenCLdevice. 164 xiv List of Figures Fig.7.2 Dataracewhenincrementingasharedvariable.Thevaluestored dependsontheorderingofoperationsbetweenthethreads. 166 Fig.7.3 ApplyingZ-ordermappingtoatwo-dimensionalmemoryspace. 172 Fig.7.4 Thepatternofdataflowfortheexampleshowninthe localAccesskernel. 177 Fig.8.1 High-leveldesignofAMD’sPiledriver-basedFX-8350CPU. 188 Fig.8.2 OpenCLmappedontoanFX-8350CPU.TheFX-8350CPUisboththe OpenCLhostandthedeviceinthisscenario. 189 Fig.8.3 Implementationofwork-groupexecutiononanx86architecture. 190 Fig.8.4 Mappingthememoryspacesforawork-group(work-group0)ontoa PiledriverCPUcache. 192 Fig.8.5 High-levelRadeonR9290XdiagramlabeledwithOpenCLexecution andmemorymodelterms. 193 Fig.8.6 Memorybandwidthsinthediscretesystem. 195 Fig.8.7 RadeonR9290Xcomputeunitmicroarchitecture. 197 Fig.8.8 MappingOpenCL’smemorymodelontoaRadeonR9290XGPU. 201 Fig.8.9 Usingvectorreadsprovidesabetteropportunitytoreturndata efficientlythroughthememorysystem.Whenwork-itemsaccess consecutiveelements,GPUhardwarecanachievethesameresult throughcoalescing. 203 Fig.8.10 Accessestononconsecutiveelementsreturnsmallerpiecesofdata lessefficiently. 203 Fig.8.11 MappingtheRadeonR9290Xaddressspaceontomemorychannels andDRAMbanks. 204 Fig.8.12 RadeonR9290Xmemorysubsystem. 205 Fig.8.13 TheaccumulationpassoftheprefixsumshowninListing8.2overa 16-elementarrayinlocalmemoryusing8work-items. 208 Fig.8.14 Step1inFigure8.13showingthebehaviorofanLDSwitheightbanks. 209 Fig.8.15 Step1inFigure8.14withpaddingaddedtotheoriginaldatasetto removebankconflictsintheLDS. 210 Fig.9.1 Animageclassificationpipeline.AnalgorithmsuchasSURFisusedto generatefeatures.Aclusteringalgorithmsuchask-meansthen generatesasetofcentroidfeaturesthatcanserveasasetofvisual wordsfortheimage.Thegeneratedfeaturesareassignedtoeach centroidbythehistogrambuilder. 214 Fig.9.2 FeaturegenerationusingtheSURFalgorithm.TheSURFalgorithm acceptsanimageasaninputandgeneratesanarrayoffeatures.Each featureincludespositioninformationandasetof64valuesknownas adescriptor. 214 Fig.9.3 Thedatatransformationkernelusedtoenablememorycoalescingis thesameasamatrixtransposekernel. 219 Fig.9.4 Atransposeillustratedonaone-dimensionalarray. 220 Fig.10.1 ThesessionexplorerforCodeXLinprofilemode.Twoapplication timelinesessionsandoneGPUperformancecountersessionareshown. 233 List of Figures xv Fig.10.2 TheTimelineViewofCodeXLinprofilemodefortheNbody application.Weseethetimespentindatatransferandkernelexecution. 234 Fig.10.3 TheAPITraceViewofCodeXLinprofilemodefortheNbodyapplication. 235 Fig.10.4 CodeXLProfilershowingthedifferentGPUkernelperformance countersfortheNbodykernel. 237 Fig.10.5 AMDCodeXLexplorerinanalysismode.TheNBodyOpenCLkernel hasbeencompiledandanalyzedforanumberofdifferentgraphics architectures. 240 Fig.10.6 TheISAviewofKernelAnalyzer.TheNBodyOpenCLkernelhasbeen compiledformultiplegraphicsarchitectures.Foreacharchitecture, theAMDILandtheGPUISAcanbeevaluated. 241 Fig.10.7 TheStatisticsviewfortheNbodykernelshownbyKernelAnalyzer.We seethatthenumberofconcurrentwavefrontsthatcanbescheduledis limitedbythenumberofvectorregisters. 241 Fig.10.8 TheAnalysisviewoftheNbodykernelisshown.Theexecution durationcalculatedbyemulationisshownfordifferentgraphics architectures. 242 Fig.10.9 Ahigh-leveloverviewofhowCodeXLinteractswithanOpenCL application. 243 Fig.10.10 CodeXLAPItraceshowingthehistoryoftheOpenCLfunctionscalled. 244 Fig.10.11 AkernelbreakpointsetontheNbodykernel. 246 Fig.10.12 TheMulti-Watchwindowshowingthevaluesofaglobalmemorybuffer intheNbodyexample.Thevaluescanalsobevisualizedasanimage. 247 Fig.11.1 C++AMPcodeexample—vectoraddition. 250 Fig.11.2 Vectoraddition,conceptualview. 251 Fig.11.3 FunctorversionforC++AMPvectoraddition(conceptualcode). 256 Fig.11.4 FurtherexpandedversionforC++AMPvectoraddition(conceptual code). 257 Fig.11.5 Hostcodeimplementationofparallel_for_each(conceptualcode). 259 Fig.11.6 C++AMPLambda—vectoraddition. 260 Fig.11.7 CompiledOpenCLSPIRcode—vectoradditionkernel. 261 Fig.12.1 WebCLobjects. 275 Fig.12.2 Usingmultiplecommand-queuesforoverlappeddatatransfer. 281 Fig.12.3 TypicalruntimeinvolvingWebCLandWebCL. 283 Fig.12.4 TwotrianglesinWebGLtodrawaWebCL-generatedimage. 284 List of Tables Table 4.1 TheOpenCLFeaturesCoveredbyEachExample 76 Table 6.1 SummaryofOptionsforSVM 159 Table 9.1 TheTimeTakenfortheTransposeKernel 227 Table 9.2 KernelRunningTime(ms)forDifferentGPUImplementations 228 Table10.1 TheCommandStatesthatcanbeUsedtoObtainTimestampsfrom OpenCLEvents 230 Table11.1 MappingKeyC++AMPConstructstoOpenCL 255 Table11.2 ConceptualMappingofDataMembersontheHostSideandonthe DeviceSide 258 Table11.3 DataSharingBehaviorandImplicationsofOpenCL2.0SVMSupport 262 Table12.1 RelationshipsBetweenCTypesUsedinKernelsandsetArg()’s webcl.type 277 xvii Foreword Inthelastfewyearscomputinghasenteredtheheterogeneouscomputingera,which aims to bring together in a single device the best of both central processing units (CPUs)andgraphicsprocessingunits(GPUs).Designersarecreatinganincreasingly wide range of heterogeneous machines, and hardware vendors are making them broadly available. This change in hardware offers great platforms for exciting new applications. But, because the designs are different, classical programming models donotworkverywell,anditisimportanttolearnaboutnewmodelssuchasthosein OpenCL. When the design of OpenCL started, the designers noticed that for a class of algorithmsthatwerelatencyfocused(spreadsheets),developerswrotecodeinCor C++andranitonaCPU,butforasecondclassofalgorithmsthatwherethroughput focused(e.g.matrixmultiply),developersoftenwroteinCUDAandusedaGPU:two relatedapproaches,buteachworkedononlyonekindofprocessor—C++didnotrun on a GPU, CUDA did not run on a CPU. Developers had to specialize in one and ignoretheother.Buttherealpowerofaheterogeneousdeviceisthatitcanefficiently runapplicationsthatmixbothclassesofalgorithms.Thequestionwashowdoyou programsuchmachines? Onesolutionistoaddnewfeaturestotheexistingplatforms;bothC++andCUDA areactivelyevolvingtomeetthechallengeofnewhardware.Anothersolutionwasto createanewsetofprogrammingabstractionsspecificallytargetedatheterogeneous computing. Apple came up with an initial proposal for such a new paradigm. This proposalwasrefinedbytechnicalteamsfrommanycompanies,andbecameOpenCL. Whenthedesignstarted,Iwasprivilegedtobepartofoneofthoseteams.Wehad a lot of goals for the kernel language: (1) let developers write kernels in a single source language; (2) allow those kernels to be functionally portable over CPUs, GPUs, field-programmable gate arrays, and other sorts of devices; (3) be low level so that developers could tease out all the performance of each device; (4) keep the model abstract enough, so that the same code would work correctly on machines beingbuiltbylotsofcompanies.And,ofcourse,aswithanycomputerproject,we wantedtodothisfast.Tospeedupimplementations,wechosetobasethelanguage on C99. In less than 6 months we produced the specification for OpenCL 1.0, and within1yearthefirstimplementationsappeared.Andthen,timepassedandOpenCL metrealdevelopers... So what happened? First, C developers pointed out all the great C++ features (a real memory model, atomics, etc.) that made them more productive, and CUDA developers pointed out all the new features that NVIDIA added to CUDA (e.g. nestedparallelism)thatmakeprogramsbothsimplerandfaster.Second,ashardware architects explored heterogeneous computing, they figured out how to remove the early restrictions requiring CPUs and GPUs to have separate memories. One great hardware change was the development of integrated devices, which provide both a xix

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.