ebook img

Predicting the Performance Impact of Different Fat-Tree Configurations PDF

13 Pages·2017·1.18 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Predicting the Performance Impact of Different Fat-Tree Configurations

Predicting the Performance Impact of Different Fat-Tree Configurations NikhilJain AbhinavBhatele LouisH.Howell LawrenceLivermoreNational LawrenceLivermoreNational LawrenceLivermoreNational Laboratory Laboratory Laboratory [email protected] [email protected] [email protected] DavidBöhme IanKarlin EdgarA.León LawrenceLivermoreNational LawrenceLivermoreNational LawrenceLivermoreNational Laboratory Laboratory Laboratory [email protected] [email protected] [email protected] MisbahMubarak NoahWolfe ToddGamblin ArgonneNationalLaboratory RensselaerPolytechnicInstitute LawrenceLivermoreNational [email protected] [email protected] Laboratory [email protected] MatthewL.Leininger LawrenceLivermoreNational Laboratory [email protected] ABSTRACT KEYWORDS Thefat-treetopologyisoneofthemostcommonlyusednetwork fat-treetopology,networksimulation,performanceprediction,pro- topologiesinHPCsystems.Vendorssupportseveraloptionsthat curement canbeconfiguredwhendeployingfat-treenetworksonproduc- ACMReferenceFormat: tionsystems,suchaslinkbandwidth,numberofrails,numberof NikhilJain,AbhinavBhatele,LouisH.Howell,DavidBöhme,IanKarlin, planes,andtapering.Thispapershowcasestheuseofsimulations EdgarA.León,MisbahMubarak,NoahWolfe,ToddGamblin,andMatthew tocomparetheimpactofthesedesignoptionsonrepresentative L.Leininger.2017.PredictingthePerformanceImpactofDifferentFat-Tree productionHPCapplications,libraries,andmulti-jobworkloads. Configurations.InProceedingsofSC17.ACM,NewYork,NY,USA,13pages. WepresentadvancesintheTraceR-CODESsimulationframework https://doi.org/10.1145/3126908.3126967 thatenablethisanalysisandevaluateitspredictionaccuracyagainst experimentsonaproductionfat-treenetwork.Inordertounder- 1 INTRODUCTION standtheimpactofdifferentnetworkconfigurationsonvarious Thefat-treetopology[31]isexpectedtobeusedforbuildingthe anticipatedscenarios,westudyworkloadswithdifferentcommuni- interconnectionnetworksofmanynext-generationhighperfor- cationpatterns,computation-to-communicationratios,andscaling mancecomputing(HPC)systems,e.g.SierraatLawrenceLiver- characteristics.Usingmulti-jobworkloads,wealsostudytheim- moreNationalLaboratory(LLNL)[8]andSummitatOakRidge pact of inter-job interference on performance and compare the NationalLaboratory[10].Itiscurrentlyusedinmanyproduction cost-performancetradeoffs. systems,rangingfromsmallclusterswithafewhundrednodesto multi-petaFlop/ssupercomputers[9].Suchextensiveuseoffat-tree CCSCONCEPTS networksispartlydrivenbytheireasilyextensibleconstruction •Networks→Networkperformancemodeling;Networksim- fromoff-the-shelfcommodityhardware.Thefat-treetopologyalso ulations;Networkperformanceanalysis; provideshighbisectionbandwidthandarelativelylowdiameter amongtheavailablealternativesforagivennodecount. Thecapabilitiesofnetworksaretypicallyimprovedcommensu- Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor ratelywiththecomputationalpowerofHPCsystemstopreventnet- classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation worksfromnegativelyimpactingtheperformanceofapplications. onthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACM Forfat-treenetworks,thisiscurrentlybeingdonebyincreasing mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish, thebandwidthoflinksandbyusingmulti-railandmulti-planenet- topostonserversortoredistributetolists,requirespriorspecificpermissionand/ora [email protected]. works[20].However,networkimprovementsincreaseprocurement SC17,November12–17,2017,Denver,CO,USA andoperationalcosts. ©2017AssociationforComputingMachinery. Sincethetotalbudgetavailableforprocurementandoperation ACMISBN978-1-4503-5114-0/17/11...$15.00 https://doi.org/10.1145/3126908.3126967 ofsupercomputersistypicallyfixed,HPCcentersstrivetostrikea SC17,November12–17,2017,Denver,CO,USA N.Jainetal. a) Single rail single plane (full) b) Single rail single plane (tapered) c) Dual rail single plane d) Dual rail dual plane 8 nodes, 10 switches 12 nodes, 7 switches 4 nodes, 10 switches 8 nodes, 20 switches Figure1:Examplesofdesignoptionsforfat-treenetworks. balancebetweenthecomputationalandcommunicationcapabil- collectives,andfat-treenetworkoptionssuchastapered,multi-rail, itiesoftheirsystems.Theincreasingcostofnetworkspresentsa andmulti-plane.Wethenusetheextendedframeworktopredict challengeinthisregard.Ononehand,HPCcenterswanttomini- the performance impact of different network configurations on mizethenetworkingcostsothatmorecomputationalresourcescan thefollowingapplicationsandlibraries:Hypre[22],Atratus,Mer- bepurchasedandmadeavailablefortheusers.Ontheotherhand, cury[16],MILC[15],ParaDiS[13],pF3D[39],andQbox[23]. reducednetworkcapabilitiescanslowdownsomeapplicationsand Theprimarycontributionsofthisworkare: offsetthegainsprovidedbyadditionalcomputationalresources. • AdvancesintheTraceR-CODESframeworkthatenablelow- Thus,HPCcentersareinterestedinunderstandingtheimpactof effort,accuratesimulationsofproductionapplications. reducednetworkcapabilitiesonperformanceandmitigatingthe • ValidationoftheTraceR-CODESframeworkformanyparallel risksofreducingthosecapabilities. codesincludingproductionapplications. Asaresult,itisimperativetodeveloppredictionmethodologies • Performanceandinterferencepredictionsforproductionap- thatcanassistinevaluatingtheperformanceofapplicationsand plications,libraries,andmulti-jobworkloadsonarangeof multi-jobworkloadsonavailablenetworkalternatives.Fortypical potentialfuturenetworkdesigns. HPCdatacenters,asignificantfractionofsystemtimeisspent in running a few production applications (each using 5–30% of Wealsoassessthesuitabilityofdifferentfat-treeconfigurations totalnodesinthesystem[32]).Thus,inthispaper,wefocuson fortheapplicationsandworkloadssimulatedinthispaper.Notethat predictingtheimpactofalternativefat-treeconfigurationsonaset thechoiceofinputproblemscanpotentiallyalterthecharacteristics ofsuchrepresentativeapplicationsandlibraries. of many applications, and hence some of the results presented Severalruntimecharacteristicsareexpectedtohaveasignif- maynotapplytoinputsthatsignificantlyalterthebehaviorofan icanteffectontheperformanceofcurrentapplicationsonnext- application. generationsystems.Wefocusonpredictingtheimpactof:1)com- putationscalingofanapplicationonitsperformance,2)thecommu- 2 FAT-TREENETWORKS nicationpatternofanapplicationonitsscaling,and3)theimpact ofinter-jobinterferenceonperformance. Thefat-treetopologyisatree-basedtopologyinwhichbandwidth Arobust,reliablesimulationframeworkcanplayanimportant ofedgesincreasesnearthetop(root)ofthetree[30].Practicalde- roleinperformingstudiesthatcananswerthesequestions.The ploymentsoffat-treeinmostsupercomputersresemblethefolded- needforsimulationsarisesbecauseexistingsystemsofferlimited ClostopologyasshowninFigure1(a).Inthissetup,manyrouters optionsforstudyingfuturenetworks.Thisisbecausefuturenet- ofsameradixaregroupedtogethertoformcoreswitchesandpro- workswillbeinherentlydifferentfromcurrentnetworks.Wealso videhighbandwidth.Thefat-treeshowninFigure1(a)isafull cannotchangethehardwareand/orchangetheparametersofpro- fat-tree:thetotalbandwidthwithinaleveldoesnotdecreaseas ductionnetworkswithoutdisruptingexistingproductionwork- wemovefromnodesconnectedtotheleafswitchestowardhigher loads. levels. Thesimulationframeworkusedforpredictionstudiesshouldbe Inordertoreducethecostofthenetwork,taperingcanbede- abletosimulateproductionapplicationsandlibrariesrunningon ployedtoconnectmorenodesperleafswitch(Figure1(b)).This thousandsofMPIprocesses.Useofproductioncodesiscriticalbe- reducesthetotalbandwidthathigherlevelsbutalsolowersthe causecomplexcommunicationcharacteristicsofmanyproduction numberofswitchesandlinksrequiredtoconnectthesamenumber codesarehardtocaptureinsimplifiedproxyapplications.How- ofnodesincomparisontothefullfat-tree. ever,accuratelysimulatingproductionapplicationsischallenging Ontheotherhand,whenhigherbandwidthisdesired,eachnode becausetheeffectsofmillionsofMPIcommunicationcallsand canbeprovidedmultipleports(rails)toinjecttrafficatahigherrate low-leveltransientcontentionneedtobecaptured. intotheleafswitches.Themultipleportscaneitherbeconnected TraceR-CODES[27,35]isascalablesimulationframeworkfor toswitchesinthesameplaneasshowninFigure1(c)ortodisjoint predictingperformanceonHPCnetworks.Inthispaper,weextend planes as shown in Figure 1(d). In both cases, fewer nodes can theTraceR-CODESframeworkwithmultiplecapabilitiesrequired be connected using the same quantity of network resources in toenablelow-effortsimulationsofproductionapplicationsand comparisontothesinglerailfat-tree.Eitheroftheseconfigurations libraries on fat-tree alternatives. The capabilities added include canalsobetaperedtoretainhighinjectionbandwidthatthenodes, supportforOTF2traceformat[6],modelingMPIprotocolsand butreducecostbyreducingthebandwidthatthehigherlevels. PredictingthePerformanceImpactofDifferentFat-TreeConfigurations SC17,November12–17,2017,Denver,CO,USA Table1:Fat-treeconfigurationscurrentlyavailable. Table2:Problemsizesrunon16KMPIprocesses. Configuration Linkb/w #rails #planes Tapering App Descriptionofinputproblem SR-EDR 100Gb/s 1 1 1:1 Hypre Diffusionwithtiledpattern,34,320unknownsper DRP-T-EDR 100Gb/s 2 2 2:1 process DRP-EDR 100Gb/s 2 2 1:1 Atratus CrookedPipe,28,740unknownsperprocess SR-HDR 200Gb/s 1 1 1:1 Mercury 2500particlesperprocess DR-T-HDR 200Gb/s 2 1 2:1 MILC su_rmd,16,384gridpointsperprocess DR-HDR 200Gb/s 2 1 1:1 ParaDiS 5.2millionnodestotal pF3D 3.9Mgridpointsperprocess Qbox Goldbenchmark,512atomstotal Currently,alloftheaboveconfigurationsareofferedbymultiple vendorsincludingMellanoxandIntel.Inaddition,severaloptions areavailableforbandwidthofindividuallinks–FDR(56Gb/s), EDR(100Gb/s),HDR(200Gb/s).Thecombinatorialchoicesoffered bytheseoptionsmakethetaskoffindingthemostsuitableconfig- usingrealisticmaterialpropertiesandaloadbalancingscheme urationforHPCcentersandapplicationsdifficult.Weaddressthis basedondomainreplication.Thenumberofparticlesareweak- challengebyshowingthatsimulationsofapplicationsonpossible scaledwiththenumberofMPIprocessesbutthemeshgeometryis configurations(Table1)canprovidekeyinsightsandreliabledata heldconstant. pointscriticaltothisdecision-makingprocess. MILC[15]isawidelyusedparallelapplicationsuiteforstudying quantumchromodynamics(QCD),thetheoryofstronginteractions 3 APPLICATIONCHARACTERISTICS ofsubatomicphysics.Wehaveusedthesu3_rmdapplicationin Inthissection,webrieflydescribethecodesusedinthisstudyand whichquarkfieldsandMPIprocessesaredefinedasa4Dgridof analyzetheircommunicationcharacteristics.Themotivationfor spacetimepoints. thisanalysisistwo-fold.First,ithelpsunderstandtheperformance ParaDiS[13]isalarge-scaledislocationdynamicssimulationcode trendsobservedforvariouscodesondifferentnetworkconfigu- usedtostudythefundamentalmechanismsofplasticityatLLNL. rations. Second, it can be used to find generic trends in impact ParaDiScouplesdirectN2calculationofforcesbetweendisloca- ofnetworkconfigurationsbasedonspecificresultsobservedfor tionlinesegmentswithafastmultipolemethod(FMM)tocapture differentapplications. remoteinteractions.Itusesahierarchicaldecompositionscheme Thecodesusedinthisstudyincludeapplicationsandlibraries wherethesimulationvolumeisrecursivelydividedintoslabsalong thatareeitherruninproductionatHPCcenters(Hypre,Mercury, eachphysicaldirection. MILC,ParaDiS,pF3D,Qbox),orrepresentcodesruninproduction pF3D[39]isusedforsimulatinglaser-plasmainteractionsinex- (Atratus).Thesecodesspanawiderangeofphysicsandmathe- perimentsconductedattheNationalIgnitionFacilityatLLNL.It maticaldomainsincludingMonteCarlo,first-principlesmolecular solvescoupledwaveequationsforthelaserlight,thescatteredlight, dynamics,transport,plasmainteractions,structuredandunstruc- andtheplasma.pF3Dusesaregular3DCartesiangridandspatial turedgrids,andsparselinearalgebra. decompositionwith2DFFTsintheplanesandbeampropagation Hypre[22]isaparallellinearsolverslibrarydevelopedatLLNL acrosstheplanes. andisusedbymanyproductionapplications.Forthisstudy,weuse Qbox[23]isafirst-principlesmoleculardynamicscodethatisused theAlgebraicMultigrid(AMG)Solverona2Ddiffusionproblem tocomputetheelectronicstructureofatoms,molecules,solids,and usingstructuredAdaptiveMeshRefinement(AMR).Werunaweak liquids within the Density Functional Theory (DFT) formalism. scalingproblemsothenumberofmeshpointsisproportionalto ItwontheGordonBellawardatSC2006andiswidelyusedfor thenumberofMPIprocesses.Thetestsaresetupwithinalarger studyingatomicpropertiesofheavymetals.Qboxdividestheatoms applicationcodebuttracingislimitedtotheoperationstakingplace andtheirstatesamonga2DgridofMPIprocesses,andmakesheavy duringtheHypresetupandsolvephases. useofFFTsandlinearalgebralibrariessuchasScalapackandBLAS. AtratusextendsMULARD[5],ahighorder,finiteelementbased Foreachoftheabovecodes,weworkedwiththeirdevelopers multigroupradiationdiffusioncodethatusesa3Dunstructured and/orfrequentuserstoconstructrepresentativescientificinput mesh,byincludingmoreadvancedphysicsanddiscretizations.Itis problemsforweak-scaledexecutionson1Kto16KMPIprocesses. usedprimarilyasaresearchtooltoexplorefutureprogramming These problems were run either on an Infiniband-Xeon cluster paradigmswithdataflowandcomputationsimportanttoLLNL (Quartz)[7]with32corespernode,oronanIBMBlueGene/Q applications.AtratususesMFEM[4],whichinvokesHypresolvers system(rzUSeq,Vulcan)with16corespernode.Table2liststhe buttheyaremorespecializedthanthestandardAMGsolverwe inputproblemsthatwererunon16KMPIprocesses.Usingprofiling runfortheHypretests. datacollectedfromrunson16Kprocesses,wenowsummarizethe Mercury[16]isaproductionMonteCarloapplicationdeveloped communicationcharacteristicsoftheseapplications.Theresults atLLNLasageneral-purposeparticletransportcode.Forthese presentedheredonotincludeMPIcallsusedintheinitializationstep experiments,weruna3Dneutrontransporteigenvalueproblem ofapplications.Asmentionedbefore,thechoiceofinputproblems SC17,November12–17,2017,Denver,CO,USA N.Jainetal. (a) P2P send size (b) P2P send calls (c) P2P send neighbors 1E+08 1E+06 1E+04 Messagesize(bytes)1111111EEEEEEE+++++++00000001234567 Avg Max Numberofcalls11111EEEEE+++++0000012345 Small Medium Large Numberofneighbors111EEE+++000123 Small Medium Large 1E+00 1E+00 1E+00 Hypre AtratusMercuryMILC ParaDispF3D Qbox Hypre AtratusMercuryMILC ParaDispF3D Qbox Hypre AtratusMercuryMILC ParaDispF3D Qbox Figure2:Communicationcharacteristicsforpoint-to-pointoperationsacrossapplications. (a) MPI_Allreduce size (b) MPI_Allreduce calls (c) MPI_Allreduce neighbors 1E+07 1E+04 1E+05 Messagesize(bytes)111111EEEEEE++++++000000123456 Avg Max Numberofcalls111EEE+++000123 Small Medium Large Numberofneighbors1111EEEE++++00001234 Small Medium Large 1E+00 1E+00 1E+00 Hypre AtratusMercuryMILC ParaDispF3D Qbox Hypre AtratusMercuryMILC ParaDispF3D Qbox Hypre AtratusMercuryMILC ParaDispF3D Qbox Figure3:MPI_Allreducecharacteristicsacrossapplications. Table3:Collectiveoperationsusageacrossapplications.Eachcellentry(a/b/c)representsthreedatapoints:a-messagesize, b-numberofcalls,andc-sizeofcommunicator. Operation HypreandAtratus Mercury MILC ParaDiS pF3D Qbox Bcast Small/Medium/Large All/Large/Large × × Small/Medium/Large All/Large/Large Reduce × All/Large/Large × × × All/Large/Medium Alltoall × Small/Medium/Large × × Large/Large/Small Small/Large/Medium Allgather × Small/Large/Large × Small/Small/Large Small/Small/Medium × canpotentiallyalterthecharacteristicsofmanyoftheseapplica- forMILC,pF3D,andQbox,thenumberoflargemessagesishigh tions,andhencesomeoftheresultspresentedmaynotapplyfor andthenumberofneighborstowhichthesemessagesaresentis allproblemsrunontheapplications. small. Point-to-pointcommunication:Figure2presentsthesummary Insummary,wecandividetheapplicationsintotwosets:one forpoint-to-pointsendoperationsperformedbyalltheapplications. withpredominantlysmall/mediumsizedmessagesandalargenum- Resultsforpoint-to-pointreceiveoperationsaresimilartothere- berofneighbors(Hypre,Atratus,andMercury),andanotherwith sultsforsendoperations,andthusnotshown.Figure2(a)showsthe manylargemessagesendstoafewneighbors(MILC,pF3D,and minimum,average,andmaximumacrossprocessesoftheaverage Qbox).ParaDisdoesnotfallineithercategorysinceithasmany andmaximumsizeofmessagessentbyindividualprocesses.The small/mediumsizedmessagesandseverallargesizedmessages. maximumsizeformostprocessesacrossallapplicationsexcept Collectivecommunication:Figure3presentsthesummaryfor Hypreisgreaterthan100KB.However,theaveragesizeislessthan MPI_Allreduce,whichisthemostcommonlyusedcollectiveopera- 1KBforHypreandAtratus,andgreaterthan1MBforpF3Dand tionacrossapplications.Intheseresults,thecriterionforclassifying Qbox. theoperationsbasedontheirmessagesizeisthesameasthatfor In Figures 2(b,c), the distribution of message sends is shown point-to-pointmessages.Figure3showsthatalmostallapplica- basedontheirsize.Messageswithsizelessthan512bytesarecon- tionsmakehundredsof MPI_Allreducecallsofsmallsizeover sideredsmall,lessthan64KBaremarkedmedium,andrestare allMPIprocesses.Mercury,ParaDiS,andQboxalsoinvokemany termedlarge.Thesecutoffspointsarechosenastheyarecurrently largesizedMPI_Allreduces.Amongthem,Qbox’scallsareover therecommendedvaluesforswitchingfromshorttoeagerand medium-sizedcommunicators,whileMercury’sandParaDiS’scalls eagertorendezvousprotocolsonPerformanceScaledMessaging2 areoverlargecommunicators. (PSM2)[1].Forseveralcodes(Hypre,Atratus,Mercury,andPar- Sincemostothercollectivecallsaremadebyasubsetofappli- aDiS),alargefractionofmessagesendsareeithersmallormedium. cations,theyaresuccinctlypresentedinTable3.Notethat,for Thenumberofuniqueneighborswithwhichthesemessagesare MPI_AlltoallandMPI_Allgather,themessagesizeconsidered communicatedishigh,especiallyforsmallmessages.Incontrast, formarkingcallsassmall,medium,orlargeistheamountofdata PredictingthePerformanceImpactofDifferentFat-TreeConfigurations SC17,November12–17,2017,Denver,CO,USA Table 4: Simplified qualitative summary of applications’ canalsospecifythejobplacementandtaskmappingforeachap- communicationcharacteristics. plication.Forcomputeregionsoftheapplication,TraceRallows replacingandscalingtheexecutiontimerecordedinthetraces. App P2P Collectives Thisenablespredictionsoflikelyscenariosonfuturesystemswith differentcomputationalpower. Hypre light lightallreduce,bcast Low-efforttracecollection:Apracticallyusefulmethodforcol- Atratus light lightallreduce,bcast lectingtracesshouldrequireminimalchangestotheapplication, Mercury light light alltoall, allgather; heavy allre- incurlowoverhead,andbememoryefficient.WhileBigSimtrace duce,bcast,reduce collectionusingAMPI[25]incurslowoverhead,itrequiresrecom- MILC heavy lightallreduce pilationofalltheparallellibrariesusedbytheapplication,andthus ParaDiS medium lightallgather;heavyallreduce canresultinsignificanteffort. pF3D heavy lightallreduce,bcast,allgather;heavy RecentsupportfortheOpenTraceFormat(OTF2)[6]byMPI alltoall tracingtoolssuchasScoreP[2]andTAU[37]providesaneasy-to- Qbox heavy lightalltoall;heavyallreduce,bcast, usealternativeforcollectingapplicationtraces.Sincethesetools reduce usethePMPIinterfacetointerceptMPIcalls,theycanbeadded atlinktime.Thus,weaddedthecapabilitytouseOTF2tracesfor simulatingapplication’sflowinTraceR.Becauseofthisaddition, exchangedbetweenapairofMPIprocesses.Inaddition,ifonly weareabletoworkwithapplicationteamsandobtaintracesfor tens of calls are made or if only tens of processes are part of a largeapplicationswithloweffort. communicator,thosecallsaremarkedsmall.Ifthenumberofcalls orcommunicatorsizeisafewhundreds,wetermthecallsmedium, MPIcollectives:Section3showedthatcollectiveoperationsare andlargeotherwise. usedinmanyapplicationsforexchangingdataamongMPIpro- Table3showsthatMercuryandQboxmakeheavyuseofMPI_Bcast cessesandsynchronizingtheprocesses.Inthepast,TraceRhassup- andMPI_Reduceovermediumandlargecommunicatorsforallcat- portedtree-basedimplementationsofthreecollectives:MPI_Bcast, egoriesofmessagesizes.Incontrast,Hypre,Atratus,andpF3Din- MPI_Reduce,andMPI_Barrier. vokesmallsizedbroadcastoperations.ParaDiSonlyinvokessmall Wehaveaddedthefollowingcollectivesthatarealsousedin sizedMPI_Allgatherafewtimes.MPI_Allgatherisalsousedby manyapplications:MPI_Allreduce,MPI_Alltoall(v),andMPI_Allgather. MercuryandpF3Dforsmallmessages.pF3Dalsomakesheavyuse Foreachoftheseoperations,wehaveimplementedseveraldiffer- ofMPI_Alltoalloversmallcommunicators,whileMercuryand entalgorithmsthatareusedinMPIlibrariesbasedonthesizeof QboxinvokeMPI_Alltoallwithsmallmessagesovermediumto messagesandthatofthecommunicators[40].Forexample,the largecommunicators.Table4summarizesthesefindings. Bruckalgorithm,multi-pairsend-recv,andringalgorithmshave beenimplementedforMPI_Alltoallusingsmall,medium,andlarge 4 SIMULATIONFRAMEWORK messagesizes,respectively.Tothebestofourknowledge,such detailedmodelingofcollectivealgorithmsisnotsupportedinany TheTraceR-CODESframework[11,35]providestrace-drivenin- othersimulator. frastructure for packet-level simulations of traffic flow on HPC networks.ItisbuiltuponROSS[14],aparalleldiscreteeventsimu- Eager-Rendezvousprotocol:DifferentprotocolsareusedbyMPI lation(PDES)engine,andhasbeensuccessfullydeployedforstudy- toperformcommunicationamongindividualprocessesbasedon ingmulti-jobworkloadsonHPCnetworks[27,43].Theoptimistic thesizeofmessages.Thisnotonlyimpactstheamountofactive natureoftheROSSPDESenginedrivesthescalabilityoftheTraceR- communicationonthenetwork,butcanalsoaffectthewaydifferent CODESframeworkandenablesfastsimulationusinglargecore MPIprocessesaresynchronized[17].Tothisend,wehaveadded countsincomparisontoothersimulationframeworks. supportforeagerandrendezvousprotocolsinTraceR. CODES provides a unified API for simulating traffic flow on Theadditionofthesefeatures,alongwiththeexistingcapabilities manyHPCnetworks,e.g.dragonfly,torus,expressmesh,andfat- viz.parallelscalability,packet-leveltrafficflow,andsupportfor tree.Thenetworkbeingsimulatedcanbeselectedatruntimealong multi-jobworkloads,makestheTraceR-CODESframeworkhighly withotherparameterssuchasthetopologydimensions,linkband- suitableforapplicationsimulationswithloweffort. width,routingschemeetc.Thedefaultfat-treenetworkinCODES assumesasinglerail,singleplanetopology.Forthiswork,wehave 4.2 Validation augmentedthedefaultmodelwithparameter-basedinstantiation TheTraceR-CODESsimulationframeworkhasbeenverifiedand ofvariousconfigurationoptionsdiscussedinSection2.Inthenew validatedinseveralpastpublications[27,35].Inthisstudy,forall model,ausercanspecifythenumberofrailsandplanes.Likewise, configurationsoffat-tree,wehaveconductedcontrolledsimulations taperingcanbedoneatleafswitches. ofmicro-benchmarks,e.g.ping-pong,toverifythatthesimulations behaveasexpected.Wehavealsocomparedpredictedperformance 4.1 AdvancesinTraceR withobservedperformanceforseveralmicro-benchmarksandappli- TraceRisaparallelexecutiontracesimulatorthatreplaysthecon- cations.Duetolacktospace,wepresentresultsforarepresentative trolandcommunicationflowofHPCapplicationsontopofCODES. setinFigures4&5.Theseresultsspanawiderangeofmessagesizes Formulti-jobworkloads,eachapplicationisconcurrentlysimu- andscenariosofinterestviz.latency-boundandbandwidth-bound latedandsharesnetworkresourceswithotherapplications.Users executions. SC17,November12–17,2017,Denver,CO,USA N.Jainetal. (a) Ping-pong (b) Ping-pong relative error (c) All-to-all relative error (512 cores) (d) 3D Stencil 20 20 0.3 Time(microseconds) 1 10 010 PredPicSMtMePd2I %Error---- 43211 000000 vsv. sP. SMMP2I %ErrorvsMPI-- 211 0000 Time(seconds) 00 ..120 OCPCrobeomsdmeimrcmvt-ee-OddP 4 32 256 2K 16K 128K 1M 4 32 256 2K 16K 128K 1M 128 512 2K 8K 32K 128K512K 128 256 512 1K 2K Message size Message size Message size Core count Figure4:ComparingpredictionresultsfromTraceR-CODESframeworkwithresultsfromexecutionsonaproductionsystem, Quartz[7]. (a) Atratus (strong scaling) (b) pF3D (512 cores) Atratusisbetween11%and34%ofthetotalruntime,whileitis between31%and63%on512and1024cores. 120 Observed Predicted Time(seconds) 100 -0.O4%bse-r0v.e8d% -1.9% -0.8% -2.6% -5.3% 1 04680000 3.1% -0.5% 0.0% 0.0% -0.6% 3.5% uPcorreenRdfi5eig(csbuuti)rlo.tansIttsifosofpnoresrbntpahdnFsa3dt2Dwd0ii–adffr2teeh5r%w-ibnioothtufainisntdks4epm%xFae3opcDfputith(niSeogenocatbtniisomdenrnev3uei)nmdarcbtieoemmrsehomoffowuprnrnvoicaicnaretisoFisoiuegnss-. Predicted 10 20 pernode.Tothebestofourknowledge,validationresultsforsuch 32 64 C12o8re c o2u5n6t 512 1024 A DBifferenCt confDgurationEs F applicationshavenotbeenshownforanyothersimulator. Alongwithpaststudies[27,35],theseresultsdemonstratethat Figure5:Predictionaccuracyishighforlatency-boundAtra- theTraceR-CODESframeworkpredictstrendsoftheexecution tusandbandwidth-boundpF3D. timeofbenchmarksandproductionapplicationsaccuratelyforthe fat-treetopology. Inthevalidationexperiments,observedperformancewasrecorded 5 COMPARATIVEANALYSIS fromexecutionsusingMPIversionsofdifferentcodesonQuartz,a 2:1taperedfat-treesystemwith100Gb/slinkbandwidth[7].The Inthissection,weuseTraceRtounderstandtheperformancetrade- radixofeachswitchis48,with32nodesconnectedtoeachswitch offsofdifferentdesignoptions(Section2)thatrepresentfuture at the leaf level. Quartz was used for collecting traces of codes networks. In order to do so, we design prototype systems with withcomputation(3DStencil,Atratus,pF3D)topreventmispre- ∼4,500nodes,whichissimilartotheexpectednodecountofSum- dictionsduetodifferencesincomputetimewhileanothersystem mitandSierra. (Catalyst[3])wasusedfortracesofcommunication-onlymicro Thesixprototypesystemsusedinthisstudyarebasedonthe benchmarks. configurationsinTable1,whicharecurrentlyofferedbynetwork Figures4(a)-(c)presentthepredictionaccuracyforping-pong vendors:1)SR-EDR:singlerailsingleplaneEDR,2)DRP-T-EDR: andall-to-allmicro-benchmarks.Intheping-pongbenchmark,two dualraildualplanetaperedEDR,3)DRP-EDR:dualraildualplane MPIprocessesarerunontwonodes(onepernode)andmessages EDR,4)SR-HDR:singlerailsingleplaneHDR,5)DR-T-HDR:dual areexchangedamongthem.Intheall-to-allexperiments,512MPI railsingleplanetaperedHDR,6)DR-HDR:dualrailsingleplane processesrunningon16nodesinvokeMPI_Alltoallcallsrepeat- HDR.SRandDRrefertosinglerailanddualrailrespectively.If‘P’ edly.Forawiderangeofmessagesizes(4bytesto4MB),wefind appearsintheconfigurationname,itsuggestsdualplaneandif‘T’ thatthepredictedexecutiontimeissimilartotheobservedtime. appears,itreflectstaperingofthenetwork. HighererrorincomparisontoMPIisobservedatmediummessage Theapplicationtracesusedinthesesimulationsaregenerated sizeswheretheperformanceoftheMPIimplementationiserratic. on16,384MPIprocessesusingthesameinputproblemsthatwere However,forthesamedatapoints,predictionsaremuchcloserto usedintheresultsshowninSection3.PMPI-basedScorePlibraryis resultsobtainedforPSM2,thecommunicationlayerdirectlyused usedforcollectingOTF2traces,whichleadstolessthan5%tracing byMPIonQuartz. overhead.Inallsimulations,a3-levelfat-treeisconstructedusing In3DStencil,MPIprocessesarearrangedinathree-dimensional routerswith36portsand90nsrouterdelay.Thepacketsizeisfixed grid.Ineveryiteration,eachMPIprocessexchangesmessageswith at4KBandFTREEstaticroutingschemeisused.Theseparameter sixnearest-neighborsandperformsa7-pointstencilupdateonthe choiceshavebeenmadebecausetheyresemblethesettingsused dataitowns.For3DStencil,thepredictedexecutiontimeclosely oncurrentproductionsystems. followsthetrendoftheobservedtimeasthecorecountisincreased. Figure4(d)showsaccuracyresultsfortwoversionsof3DStencil– 5.1 Communication-onlySimulations withandwithoutcomputation. Thefirstsetofresultswepresentareforcommunication-onlysce- Figure5(a)showsthatforalatency-boundapplicationsuchas narios,i.e.applicationtracesaresimulatedwithallthecomputation Atratus(Section3),highpredictionaccuracyisobservedforastrong regionsignored(notsimulated).Theseresultshelpunderstandthe scalingrun.Upto256cores,MPItimefordifferentprocessesin PredictingthePerformanceImpactofDifferentFat-TreeConfigurations SC17,November12–17,2017,Denver,CO,USA ��� ����� ������� ������� ������� ���� ���� ���� �������������� �� ���� ���� ���� ������ ������ ����� ��������� ��������� ��������� ��������� ��������� �������� ������ ������ ������ ������ ������ ���� ���� �� ������ ��������� ������� ������ �������� ������ ������ ��������� ������� ������ �������� ������ ������ ��������� ������� ������ �������� ������ Figure6:Comparingcommunication-onlyscenarioforapplicationsusing16,384processesand1,024nodes.%performance gainsincomparisontoSR-EDRareshown.Table1liststhenetworkconfigurations. directeffectofnetworkconfigurationsonapplicationcommuni- sizedmessages,andthusdelayscausedbyroutersandlinkscan cationpatternsandprovideanupperboundontheperformance addup.Since,DRP-EDRhasmoreroutersandlinksincomparison benefitsofdifferentconfigurations.Intheseruns,16,384processes toSR-HDR,fewerpacketstraverseperrouter/linkandthusthe aresimulatedon1,024nodes(∼23%ofsystemsize)with16pro- impactofthosedelaysislower. cessespernodeallocatedusingalinearplacementpolicy.Figure6 ParaDiSgeneratesmanysmallandmediumsizedmessagesends, presentstheresultsforalltheapplicationsalongwithperformance butitalsoperformsmanylargesizedmessageexchanges,andcalls improvementsrelativetosinglerailsingleplaneEDR(SR-EDR). MPI_Allreduce.Asaresult,thenetworkconfigurationhasalarger HypreandAtratus:Fortheseapplicationsthatpredominantlyuse impactontheperformanceofParaDiSthanHypreandAtratus. smallandmediumsizedmessagesforpoint-to-pointandcollective pF3D and Qbox: These applications make heavy use of collec- communication,theperformanceimpactofchangingthenetwork tives and thus show up to 68% improvement with the best fat- configuration is low (< 17%). For Atratus, providing additional treeconfigurationasshowninFigure6(right).ForpF3D,boththe bandwidth at the lower levels of fat-tree either via use of dual MPI_Alltoallcallsandthepoint-to-pointcallsareamongpro- rail(e.g.SR-EDRvs.DRP-T-EDR)orviauseoflinkswithhigher cessesrelativelyclosetooneanotherintheMPIrankspace.Thus, bandwidth(e.g.DRP-T-EDRvs.DR-T-HDR)improvesperformance taperingthenetworkdoesnotimpactperformancesignificantly becauseofitsmediumsizemessages.However,sincethesemessages (e.g.DRP-T-EDRvs.DRP-EDR). aremostlycommunicatedwithprocessesthatarecloseinMPIrank ForQbox,whiletheMPI_Alltoallcommunicationisamong space,taperingdoesnotaffectperformancewhenlinearmapping “nearby”processes,othercollectiveswithlargemessagesizesoccur isused(e.g.DRP-T-EDRvs.DRP-EDR). overprocessesspreadallovertheMPIrankspace.Thus,itnotonly ForHypre,onlywhenHDRlinksareused,weobserveaper- benefitsfromhigherbandwidthatthelowerlevels,butalsoshows formanceimpactofnetworkconfigurations.Thisisbecausethe betterperformancewithfullfat-treeconfigurations.Aninteresting averageandmaximummessagesizesofHyprearelowestamongall datapointforQboxisthetransitionfromSR-HDRtoDR-T-HDR. theapplicationsandthusHypreisboundbypermessageoverheads SinceindualrailsingleplaneHDR,fewernodesareconnectedper onEDRnetworks. leafswitch(12nodesperswitch)incomparisontoSR-HDR(18 Overall,wefindthepotentialgainsofusinghighercapability nodesperswitch),moredatahastotraversethroughhigherlevels fat-treeconfigurationslimitedforrelativelylightcommunication inthetreewhenprocessescommunicatewithfarawayprocesses patternsinHypreandAtratus. inMPIrankspace.Thisnegativelyimpactsthegainsprovidedby Mercury,MILC,andParaDiS:Alltheseapplicationsareimpacted DR-T-HDR,andresultsinlowerperformance. inasimilarwaybyvaryingnetworkconfigurations:significant Insummary,withcommunicationpatternsthatincludelarge improvementisobtainedbyaddingmorebandwidthandtapering messagesizedexchangesamongevenfewpairsoffarawayneigh- leadstoanoticeablelossinperformance.ForMercury,thisisex- borsorincludelargesizedcollectives,significantgainsareobserved pectedsinceitmakesheavyuseofcollectivessuchasMPI_Allreduce, withhighernetworkbandwidth(dualrailand/orHDRlinks)and MPI_Bcast, and MPI_Reducewithlargemessagesizesoverlarge fullfat-treeconfigurations(DRP-EDR,DR-HDR). communicators.Inaddition,processesfarawayinMPIrankspace alsoexchangemanymediumsizemessages. 5.2 VaryingComputationScaling ForMILC,everyMPIprocesscommunicateswithveryfewMPI processes.However,theseexchangesaretypicallyovermediumto Intheprevioussection,westudiedtheimpactofnetworkconfig- largesizedmessages.Duetothe4DlayoutofMPIprocesses,someof urationsonexecutionswiththecomputationturnedoff.Wenow thecommunicatingpairsarefarapartintheMPIrankspace.These studycasesinwhichcomputationisalsoconsideredalongsidecom- reasonsaccountfortheperformanceimpactobservedforMILC. munication.Thesecasesareimportantbecausetheperformance Aninterestingobservationfortheseresultsismarginallylower ofanapplicationcanchangesignificantlyduetotheinterplayof performanceofsinglerailHDR(SR-HDR)incomparisontodual computationandcommunication.Hence,suchananalysiscanhelp raildualplaneEDR(DRP-EDR),althoughtheyoffersimilartotal findtherightbalancebetweencomputationalandcommunication bandwidth.WesuspectthisisbecauseMILCsendsveryfewlarge resources. SC17,November12–17,2017,Denver,CO,USA N.Jainetal. (a) Atratus (b) Mercury (c) MILC 1% 10 47% 66% 71% 72% 10 1% 10 2% Time(s) 1 4% 8% 1 1 21% 55% 65% 0.1 0.1 0.1 2x 10x 25x 50x 2x 10x 25x 50x 2x 10x 25x 50x SR-EDR DRP-T-EDR DRP-EDR SR-HDR DR-T-HDR DR-HDR (d) ParaDiS (e) pF3D (f) Qbox 1% 100 7% 9% 25% 51% 65% 67% me(s) 10 5% 11% 10 41% 49% 10 Ti 18% 1 1 1 2x 10x 25x 50x 2x 10x 25x 50x 2x 10x 25x 50x SR-EDR DRP-T-EDR DRP-EDR SR-HDR DR-T-HDR DR-HDR Figure7:Analyzingtheeffectofnetworkconfigurationsonscenarioswithdifferentcomputationscaling.%differencebetween thebestandtheworstconfigurationisshown. Inoursimulationframework,cores(orlogicallyequivalentcom- to21%.ForbothAtratusandpF3D,similarreductioninbenefits putationalresources)areseparateentitiesthatinteractwiththe fromabetternetworkconfigurationisobservedduetothepresence NIC/networkasisthecaseonrealsystems.Thus,ifcommunication- ofrelativelysignificantcomputationtime.Thistrendcontinuesfor computationoverlapispossible,itissimulatedappropriately.Note thecasewith25×scaling. thatTraceRdoesnotsimulateormodelcomputation,itonlyfast- Asthecomputationscalingisdecreasedfurtherto10×,formany forwardstheclockwhenacomputationregionisencountered. applicationsweobserveasignificantchangeintheimpactofnet- Intheseexperiments,thecomputationalcapabilityandexecution workconfigurations.ForHypre,Atratus,andParaDiS,thecompu- timeonanIntelXeonCPUE5-2695v4operatingat2.1GHzwith tationtimebeginstodominatethetotalexecutiontimeandthe onesocketistakenasthebasevalue.ThisCPUhasapeakflop/s impactofnetworkconfigurationislimited.ForMILCandpF3D, ratingof600GFlop/s.Theexecutiontimeofcomputeregionsis onlytheSR-EDRconfigurationisnowsignificantlyworsethan scaledby2×,10×,25×,and50×relativetothisCPUtoconductthe otherconfigurations,andtheperformanceimprovementduetoa analysisforfuturesystems.Thisimpliesthatthe2×scalingresults morecapablenetworkisatmost25%.Forboththeseapplications, assumethatonafuturenode,thecomputationcanbeperformed thechangeinabsolutetimebetweenvariousnetworkconfigura- twiceasfastcomparedtothisCPU. tions(fromDRP-T-EDRtoDR-HDR)issimilartothe50×scenario. Weusethe2×casetorepresentsystemssimilartocurrentclus- However,with10×scaling,sincethetimespentincomputationis tersbutwithadditionalCPUsockets,whilethe10×caseisusedas twentytimesthetimespentincommunication,theneteffectof representativeoffuturesystemswithaccelerators.Theremaining theseimprovementsismarginal. casesrepresenteithermorecompute-intensivesystemsorapplica- ForMercuryandQbox,evenat10×scaling,thetimespentin tionsthatcanexploitmostofthecomputationalpoweroffuture communicationisdominant.Theimpactofaddingmorebandwidth nodes. andusingfullfat-treesystemsinsteadofataperedoneissimilarto Figure7showsthepredictedimpactofnetworkconfigurations earliercases,andthegainsareproportionate.Finally,at2×scaling, withdifferentlevelsofcomputationscaling.ResultsonHypreare Mercuryistheonlyapplicationwhosetimespentincomputation omittedbecausetheimpactofnetworkconfigurationsonrelative isnotsignificantlygreaterthanthecommunicationtime. gainsissimilartothatonAtratus.With50×scaling,thepredicted resultsaresimilartotheresultsobtainedforcommunication-only 5.3 EffectonStrongScaling scenarios.HypreandAtratushaveminimalgains,pF3Dgainsfrom Intheselastsetsofresultsforsinglejobsimulations,westudythe increasedbandwidthatthelowerlevelofthefat-treebutisnot impactofnetworkconfigurationsonstrongscaling.Whilesystem impactedbytapering,andMercury,MILC,andQboximprovewith capabilities continue to grow, the size of input problems is not mostchangesinconfigurations.ForParaDiSthough,evenat50× increasingproportionatelyinsomedomains.Hence,strongscaling scaling,thepotentialimpactofnetworkconfigurationsismuch isanimportantcharacteristictoconsiderforfuturesystems. lowerincomparisontothecommunication-onlyscenario. In ourexperiments, strong scaling is achieved bysimulating The biggest difference for 50× scaling in comparison to the executionofjobswith16,384MPIprocesses(usedearlier)onthree communication-only scenario is for ParaDiS where load imbal- differentnodecounts:256,1,024,and4,096.Thisresultsinallo- anceamongprocessesandsimilartimespentincomputationand cationof64,16,and4MPIprocessespernode.Asbefore,wefol- communicationreducesthebestpredictedimprovementfrom51% lowasimplemodeloflinearlyscalingthecomputationassuming PredictingthePerformanceImpactofDifferentFat-TreeConfigurations SC17,November12–17,2017,Denver,CO,USA ������������������������������ ������������������������������ ����� ��� ����� ��� ����� ��� ����������������� ��������� ����� ���� ���� � ������ ������ ������ ������ ������ ������ ������ ������ ������ ������ ������ ������ ������ ������ ����� ������� ������� ���� ������� ���� ���� ����� ������� ������� ���� ������� ���� ���� ��������� ����������� ����������� �������������������������������������������������������������������������������������������������������������������������������������������������������� Figure8:Impactofnetworkconfigurationonstrongscalingaproblembysimulating16,384MPIprocessesondifferentnode counts.Foreachconfiguration,improvementshownforanapplicationisspeeduprelativetotheapplication’sperformance on256nodesusingthatnetworkconfiguration. within-nodeparallelization.Westudythesecasesfortwodifferent Themulti-jobworkloadsusedintheseexperimentsconsistof computationalscalingscenariosfromtheprevioussection:2×and 17jobsof4,096processeseachrunningon256nodes.Eachjobis 10×. arbitrarilyassignedoneoftheapplicationsdescribedearlier.Since Foranygivennodecount,wefindthattheimpactofnetwork thesimulationsofHypreandAtratushaveshownsimilarbehavior configurationsondifferentapplicationsissimilartotheresults sofar,wedonotuseHypreintheseexperimentsinordertofocus describedintheprevioussection.Thisisnotunexpectedbecause onadiverseset.TracegenerationwasdoneonQuartzbyrunning strongscalingaproblemwithlinearscalingofcomputationdoes weak-scaledversionsoftheinputproblems(Table2)on4KMPI notchangeitscomputation-to-communicationratio.Thus,inthis processes.Wesimulatethreesuchrandomlygeneratedworkloads, section,wefocusoncomparingpredictedspeedupswhenappli- inwhicheachapplicationisassignedto∼8jobsintotal. cationsarescaledfrom256to1,024nodesasshowninFigure8. Inthefirstsetofsimulationsofmulti-jobworkloads,weuse They-axisinthisfigureplotsthepredictedspeedupforindividual 2×computationscaling.Nodesareallocatedtojobsusinga“ran- applicationsrelativetotheirperformanceon256nodesforthesame domizedclusters”policy:jobsareassignednodesinsmallclusters networkconfiguration.Theheightofeachbarshowstheadditional spreadthroughoutthesystem.Fortheseexperiments,wefindthat speedupoverthedatapointrightbelowit. theaverageperformanceofmostapplications(exceptMercury) Forthesetwith2×scalingofcomputeregionsrelativetocur- acrossdifferentnetworkconfigurationsissimilar(<5%difference). rentsystems(Figure8(a)),weobservethattheimpactofnetwork Thisisexpectedsincethetimespentincomputationis>90%forall configurationsonspeedupformanyapplicationsisminimal.Only theseapplications.ForMercury,weobserveupto33%gains;these forpF3DandQbox,dualrail/planeandHDRconfigurationslead resultsaresimilartotheresultsobservedforMercuryintheprevi- tobetterscalingincomparisontothescalingonthesinglerail oussection.Wealsofindthattheimpactofinter-jobinterference singleplaneEDRsystem(upto27%gaininspeedup).Theseresults isminimalgiventhecomputation-heavynatureoftheworkloads. alsoshowthatcommunicationbottlenecksdonotpreventefficient Figure9(a)presentstheaveragetimespentinexecutingatime scalingofmostapplications. stepofdifferentapplicationswhenrunninginamulti-jobworkload. For the case with 10× computation scaling as shown in Fig- Fortheseresults,10×computationscalingisused,andthejobplace- ure8(b),theimpactofnetworkconfigurationsonscalingissig- mentpolicyisthesameasbefore.BothAtratusandParaDiSare nificantlyhigher.First,thespeedupsarerelativelylowerforall notsignificantlyimpactedbychangesinnetworkconfigurations, networkconfigurations,especiallySR-EDR.Second,manyappli- evenwhenpartofamulti-jobworkload. cations(Hypre,MILC,pF3D,andQbox)showbetterscalingwith MercuryandQboxshowperformancegainswitheveryimprove- morecapablenetworkconfigurations.Third,forQboxandMercury, mentinnetworkconfiguration.Thetrendsandimpactofindividual scalingisbetterforfullfat-treeconfigurationsincomparisonto configurationchangesaresimilartothesingle-jobscenariosstudied taperedconfigurations. earlierinFigure7.NotethatforQbox,DRP-T-HDRconfiguration Insummary,ourpredictionssuggestthatwith2×computation performs better than SR-HDR in this case because the random- scaling,networkperformancehaslimitedimpactonscalingofap- izedclustersjobplacementeliminatesthebenefitofhavingmore plications,butwith10×computationscaling,carefulconsideration nodesperleafswitch.Withnodesspreadthroughoutthesystem, ofnetworkconfigurationsisneededtoobtaingoodstrongscaling moretraffichastotraversetohigherlevellinksfortheSR-HDR behavior. configurationincomparisontothesinglejobscenariowithlinear placement. For MILC and pF3D, we observe higher gains (31% and 51%) 6 ANALYZINGMULTI-JOBWORKLOADS whenbetternetworkconfigurationsareusedincomparisontothe Simultaneousexecutionofmultiplelargejobsthatsharenetwork gainspredictedforsingle-jobscenarios(21%and25%).Theother resourcesisacommonproductionscenario.Inthissection,westudy differencefromsingle-jobresultsinFigure7isthatinaddition theimpactofsuchworkloadsonperformanceofindividualjobs totheuseofdualrailEDR,usingafullfat-treeconfigurationand ondifferentfat-treeconfigurations.Wealsocomparetheoverall increasinglinkbandwidth(EDRtoHDR)alsohelps.Thisisbecause performanceofworkloadsacrossdifferentconfigurations. SC17,November12–17,2017,Denver,CO,USA N.Jainetal. (a) Application performance in multi-job workloads (b) Impact of inter-job interference conds) 100 51% 66% n 5600 eperstep(se 10 8% 58% 31% 5% %slowdow 12340000 Tim 1 Atratus Mercury MILC ParaDis pF3D Qbox Atratus Mercury MILC ParaDis pF3D Qbox SR-EDR DRP-T-EDR DRP-EDR SR-HDR DR-T-HDR DR-HDR Figure9:Analyzingtheperformanceofapplicationsinamulti-jobworkloads.Eachworkloadconsistsof17jobs,eachallocated 256nodesandissimulatedwith10×computationscaling.For(b),thepercentageslowdownshowninrelativetopredictions withoutinterference. communicatingpairsthatarenearbyinthelinearplacementare spreadoutintherandomizedclustersplacementandhencethese ������������������������������ applicationsalsomakeuseoflinksathigherlevelsofthefat-tree. ����� ����� Figure9(b)presentstheaveragepercentageslowdownthatis ������ ����� ����� ����� observedforindividualapplicationsrelativetotheirperformance � ����� � for256nodesinterference-freesingle-jobruns.Forthetwoconfig- ����� � urationswithlowertotalbandwidth(SR-EDRandDRP-T-EDR),the � effectofinter-jobinterferenceishigher.WealsofindthatpF3Dand ��� �� � Qboxareimpactedmorebyinter-jobinterferencethanotherappli- ���� �� cations.Thisismostlikelyduetotheirheavyuseoftightly-coupled � � MPI_Alltoall(v)operations.Forsomeapplications,thetapered � �� configurationsshowhigherimpactofinter-jobinterference. ������������������������������������������ InFigure10,wecomparetheimpactofnetworkconfiguration ontotalperformanceofthemulti-jobworkloads.Thevaluesshown hereareanaverageoverthethreemulti-jobworkloadswehave Figure 10: Comparing the total normalized performance run.Thetotalperformanceofaworkloadwithnjobsiscomputed ofworkloadsexecutedondifferentnetworkconfigurations asPw =(cid:205)ni=−01Pinorm,wherePinormisthenormalizedexecutionrate with10×computationscaling. (numberofstepscompletedperunittime)foranapplicationacross differentnetworkconfigurations.Theratesarenormalizedbased Table5alsocombinesthecostestimateswiththeresultsinFig- ontherateobservedforthebestperformingnetworkconfiguration foragivenjob.ThisimpliesthatPnormforjobiisalwayslessthan ure10andcomputesthepredictedcosttoperformancevalueofdif- i ferentfat-treeconfigurations.Wefindthatthecost-to-performance orequaltoone,andthetotalperformanceofaworkloadisbounded ratiodropsrapidlyfromSR-EDRtoDRP-T-EDRtoDRP-EDR,i.e. bythenumberofjobsintheworkload. dualrailEDRconfigurationsprovidebetterreturnsfortheircost Figure10showsthatforthemulti-jobworkloadssimulatedus- incomparisontothesinglerailEDRconfiguration.Further,the ingtherandomizedclustersjoballocationpolicy,highergainsare cost-to-performanceratioofallHDRconfigurationsislower(i.e. observedwhentransitioningfromdualraildualplanetaperedEDR better)thantheEDRconfigurations.Inparticular,theSR-HDRcon- (DRP-T-EDR)todualraildualplaneEDR(DRP-EDR).Wenoticea figurationprovidessimilarperformanceasDRP-EDRatamuch similartransitionindualrailHDR-basedconfigurations.Wefind bettercosttoperformanceratio. improvementstobemarginalwhentransitioningfromsinglerail HDRtodualrailtaperedHDRfortheinputproblemssimulatedin 7 RELATEDWORK thesemulti-jobworkloads. SeveraltopologieshavebeenproposedandusedinHPCnetworks, Cost-performancetradeoffs:Fromaprocurementpointofview, e.g.hypercube[24],fat-tree[30],torus[19,33],dragonfly[21,29]. theperformanceimprovementsdiscussedaboveshouldbecom- Amongthesetopologies,variationsofthefat-treeanddragonfly binedwiththeexpecteddollarcoststoanalyzetheperformance- topologiesarecurrentlybeingusedinmanysystems.Thispaper costtradeoffs.Considerascenarioinwhichthenormalizedcosts focusesonfat-treebecauseofthemultiplicityofdesignoptions (w.r.t.SR-EDR)ofthenetworkconfigurationsandthetotalcost availableforit[20,34]. ofthesystemareasshowninTable5.Here,weassumenetwork ForanalysisandpredictionofperformanceonHPCnetworks, costis10%ofthetotalsystemcostforSR-EDR.Theseestimatesare severalapproachesrangingfromanalyticalmodeling[12,26,36,38] basedonpastprocurementsandthebestdatacurrentlyavailable toflitandpacket-levelsimulations[18,28,35,41]havebeenpro- tous. posed.Wedeploytrace-drivenpacket-levelsimulationsbecause theyallowforuseofexistingcodesforsimulatingtheircommuni- cationcomplexity.

Description:
Several runtime characteristics are expected to have a signif- icant effect on the support for OTF2 trace format [6], modeling MPI protocols and collectives .. 2:1 tapered fat-tree system with 100 Gb/s link bandwidth [7]. The radix of
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.