Predicting the Performance Impact of Different Fat-Tree Configurations NikhilJain AbhinavBhatele LouisH.Howell LawrenceLivermoreNational LawrenceLivermoreNational LawrenceLivermoreNational Laboratory Laboratory Laboratory [email protected] [email protected] [email protected] DavidBöhme IanKarlin EdgarA.León LawrenceLivermoreNational LawrenceLivermoreNational LawrenceLivermoreNational Laboratory Laboratory Laboratory [email protected] [email protected] [email protected] MisbahMubarak NoahWolfe ToddGamblin ArgonneNationalLaboratory RensselaerPolytechnicInstitute LawrenceLivermoreNational [email protected] [email protected] Laboratory [email protected] MatthewL.Leininger LawrenceLivermoreNational Laboratory [email protected] ABSTRACT KEYWORDS Thefat-treetopologyisoneofthemostcommonlyusednetwork fat-treetopology,networksimulation,performanceprediction,pro- topologiesinHPCsystems.Vendorssupportseveraloptionsthat curement canbeconfiguredwhendeployingfat-treenetworksonproduc- ACMReferenceFormat: tionsystems,suchaslinkbandwidth,numberofrails,numberof NikhilJain,AbhinavBhatele,LouisH.Howell,DavidBöhme,IanKarlin, planes,andtapering.Thispapershowcasestheuseofsimulations EdgarA.León,MisbahMubarak,NoahWolfe,ToddGamblin,andMatthew tocomparetheimpactofthesedesignoptionsonrepresentative L.Leininger.2017.PredictingthePerformanceImpactofDifferentFat-Tree productionHPCapplications,libraries,andmulti-jobworkloads. Configurations.InProceedingsofSC17.ACM,NewYork,NY,USA,13pages. WepresentadvancesintheTraceR-CODESsimulationframework https://doi.org/10.1145/3126908.3126967 thatenablethisanalysisandevaluateitspredictionaccuracyagainst experimentsonaproductionfat-treenetwork.Inordertounder- 1 INTRODUCTION standtheimpactofdifferentnetworkconfigurationsonvarious Thefat-treetopology[31]isexpectedtobeusedforbuildingthe anticipatedscenarios,westudyworkloadswithdifferentcommuni- interconnectionnetworksofmanynext-generationhighperfor- cationpatterns,computation-to-communicationratios,andscaling mancecomputing(HPC)systems,e.g.SierraatLawrenceLiver- characteristics.Usingmulti-jobworkloads,wealsostudytheim- moreNationalLaboratory(LLNL)[8]andSummitatOakRidge pact of inter-job interference on performance and compare the NationalLaboratory[10].Itiscurrentlyusedinmanyproduction cost-performancetradeoffs. systems,rangingfromsmallclusterswithafewhundrednodesto multi-petaFlop/ssupercomputers[9].Suchextensiveuseoffat-tree CCSCONCEPTS networksispartlydrivenbytheireasilyextensibleconstruction •Networks→Networkperformancemodeling;Networksim- fromoff-the-shelfcommodityhardware.Thefat-treetopologyalso ulations;Networkperformanceanalysis; provideshighbisectionbandwidthandarelativelylowdiameter amongtheavailablealternativesforagivennodecount. Thecapabilitiesofnetworksaretypicallyimprovedcommensu- Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor ratelywiththecomputationalpowerofHPCsystemstopreventnet- classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation worksfromnegativelyimpactingtheperformanceofapplications. onthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACM Forfat-treenetworks,thisiscurrentlybeingdonebyincreasing mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish, thebandwidthoflinksandbyusingmulti-railandmulti-planenet- topostonserversortoredistributetolists,requirespriorspecificpermissionand/ora [email protected]. works[20].However,networkimprovementsincreaseprocurement SC17,November12–17,2017,Denver,CO,USA andoperationalcosts. ©2017AssociationforComputingMachinery. Sincethetotalbudgetavailableforprocurementandoperation ACMISBN978-1-4503-5114-0/17/11...$15.00 https://doi.org/10.1145/3126908.3126967 ofsupercomputersistypicallyfixed,HPCcentersstrivetostrikea SC17,November12–17,2017,Denver,CO,USA N.Jainetal. a) Single rail single plane (full) b) Single rail single plane (tapered) c) Dual rail single plane d) Dual rail dual plane 8 nodes, 10 switches 12 nodes, 7 switches 4 nodes, 10 switches 8 nodes, 20 switches Figure1:Examplesofdesignoptionsforfat-treenetworks. balancebetweenthecomputationalandcommunicationcapabil- collectives,andfat-treenetworkoptionssuchastapered,multi-rail, itiesoftheirsystems.Theincreasingcostofnetworkspresentsa andmulti-plane.Wethenusetheextendedframeworktopredict challengeinthisregard.Ononehand,HPCcenterswanttomini- the performance impact of different network configurations on mizethenetworkingcostsothatmorecomputationalresourcescan thefollowingapplicationsandlibraries:Hypre[22],Atratus,Mer- bepurchasedandmadeavailablefortheusers.Ontheotherhand, cury[16],MILC[15],ParaDiS[13],pF3D[39],andQbox[23]. reducednetworkcapabilitiescanslowdownsomeapplicationsand Theprimarycontributionsofthisworkare: offsetthegainsprovidedbyadditionalcomputationalresources. • AdvancesintheTraceR-CODESframeworkthatenablelow- Thus,HPCcentersareinterestedinunderstandingtheimpactof effort,accuratesimulationsofproductionapplications. reducednetworkcapabilitiesonperformanceandmitigatingthe • ValidationoftheTraceR-CODESframeworkformanyparallel risksofreducingthosecapabilities. codesincludingproductionapplications. Asaresult,itisimperativetodeveloppredictionmethodologies • Performanceandinterferencepredictionsforproductionap- thatcanassistinevaluatingtheperformanceofapplicationsand plications,libraries,andmulti-jobworkloadsonarangeof multi-jobworkloadsonavailablenetworkalternatives.Fortypical potentialfuturenetworkdesigns. HPCdatacenters,asignificantfractionofsystemtimeisspent in running a few production applications (each using 5–30% of Wealsoassessthesuitabilityofdifferentfat-treeconfigurations totalnodesinthesystem[32]).Thus,inthispaper,wefocuson fortheapplicationsandworkloadssimulatedinthispaper.Notethat predictingtheimpactofalternativefat-treeconfigurationsonaset thechoiceofinputproblemscanpotentiallyalterthecharacteristics ofsuchrepresentativeapplicationsandlibraries. of many applications, and hence some of the results presented Severalruntimecharacteristicsareexpectedtohaveasignif- maynotapplytoinputsthatsignificantlyalterthebehaviorofan icanteffectontheperformanceofcurrentapplicationsonnext- application. generationsystems.Wefocusonpredictingtheimpactof:1)com- putationscalingofanapplicationonitsperformance,2)thecommu- 2 FAT-TREENETWORKS nicationpatternofanapplicationonitsscaling,and3)theimpact ofinter-jobinterferenceonperformance. Thefat-treetopologyisatree-basedtopologyinwhichbandwidth Arobust,reliablesimulationframeworkcanplayanimportant ofedgesincreasesnearthetop(root)ofthetree[30].Practicalde- roleinperformingstudiesthatcananswerthesequestions.The ploymentsoffat-treeinmostsupercomputersresemblethefolded- needforsimulationsarisesbecauseexistingsystemsofferlimited ClostopologyasshowninFigure1(a).Inthissetup,manyrouters optionsforstudyingfuturenetworks.Thisisbecausefuturenet- ofsameradixaregroupedtogethertoformcoreswitchesandpro- workswillbeinherentlydifferentfromcurrentnetworks.Wealso videhighbandwidth.Thefat-treeshowninFigure1(a)isafull cannotchangethehardwareand/orchangetheparametersofpro- fat-tree:thetotalbandwidthwithinaleveldoesnotdecreaseas ductionnetworkswithoutdisruptingexistingproductionwork- wemovefromnodesconnectedtotheleafswitchestowardhigher loads. levels. Thesimulationframeworkusedforpredictionstudiesshouldbe Inordertoreducethecostofthenetwork,taperingcanbede- abletosimulateproductionapplicationsandlibrariesrunningon ployedtoconnectmorenodesperleafswitch(Figure1(b)).This thousandsofMPIprocesses.Useofproductioncodesiscriticalbe- reducesthetotalbandwidthathigherlevelsbutalsolowersthe causecomplexcommunicationcharacteristicsofmanyproduction numberofswitchesandlinksrequiredtoconnectthesamenumber codesarehardtocaptureinsimplifiedproxyapplications.How- ofnodesincomparisontothefullfat-tree. ever,accuratelysimulatingproductionapplicationsischallenging Ontheotherhand,whenhigherbandwidthisdesired,eachnode becausetheeffectsofmillionsofMPIcommunicationcallsand canbeprovidedmultipleports(rails)toinjecttrafficatahigherrate low-leveltransientcontentionneedtobecaptured. intotheleafswitches.Themultipleportscaneitherbeconnected TraceR-CODES[27,35]isascalablesimulationframeworkfor toswitchesinthesameplaneasshowninFigure1(c)ortodisjoint predictingperformanceonHPCnetworks.Inthispaper,weextend planes as shown in Figure 1(d). In both cases, fewer nodes can theTraceR-CODESframeworkwithmultiplecapabilitiesrequired be connected using the same quantity of network resources in toenablelow-effortsimulationsofproductionapplicationsand comparisontothesinglerailfat-tree.Eitheroftheseconfigurations libraries on fat-tree alternatives. The capabilities added include canalsobetaperedtoretainhighinjectionbandwidthatthenodes, supportforOTF2traceformat[6],modelingMPIprotocolsand butreducecostbyreducingthebandwidthatthehigherlevels. PredictingthePerformanceImpactofDifferentFat-TreeConfigurations SC17,November12–17,2017,Denver,CO,USA Table1:Fat-treeconfigurationscurrentlyavailable. Table2:Problemsizesrunon16KMPIprocesses. Configuration Linkb/w #rails #planes Tapering App Descriptionofinputproblem SR-EDR 100Gb/s 1 1 1:1 Hypre Diffusionwithtiledpattern,34,320unknownsper DRP-T-EDR 100Gb/s 2 2 2:1 process DRP-EDR 100Gb/s 2 2 1:1 Atratus CrookedPipe,28,740unknownsperprocess SR-HDR 200Gb/s 1 1 1:1 Mercury 2500particlesperprocess DR-T-HDR 200Gb/s 2 1 2:1 MILC su_rmd,16,384gridpointsperprocess DR-HDR 200Gb/s 2 1 1:1 ParaDiS 5.2millionnodestotal pF3D 3.9Mgridpointsperprocess Qbox Goldbenchmark,512atomstotal Currently,alloftheaboveconfigurationsareofferedbymultiple vendorsincludingMellanoxandIntel.Inaddition,severaloptions areavailableforbandwidthofindividuallinks–FDR(56Gb/s), EDR(100Gb/s),HDR(200Gb/s).Thecombinatorialchoicesoffered bytheseoptionsmakethetaskoffindingthemostsuitableconfig- usingrealisticmaterialpropertiesandaloadbalancingscheme urationforHPCcentersandapplicationsdifficult.Weaddressthis basedondomainreplication.Thenumberofparticlesareweak- challengebyshowingthatsimulationsofapplicationsonpossible scaledwiththenumberofMPIprocessesbutthemeshgeometryis configurations(Table1)canprovidekeyinsightsandreliabledata heldconstant. pointscriticaltothisdecision-makingprocess. MILC[15]isawidelyusedparallelapplicationsuiteforstudying quantumchromodynamics(QCD),thetheoryofstronginteractions 3 APPLICATIONCHARACTERISTICS ofsubatomicphysics.Wehaveusedthesu3_rmdapplicationin Inthissection,webrieflydescribethecodesusedinthisstudyand whichquarkfieldsandMPIprocessesaredefinedasa4Dgridof analyzetheircommunicationcharacteristics.Themotivationfor spacetimepoints. thisanalysisistwo-fold.First,ithelpsunderstandtheperformance ParaDiS[13]isalarge-scaledislocationdynamicssimulationcode trendsobservedforvariouscodesondifferentnetworkconfigu- usedtostudythefundamentalmechanismsofplasticityatLLNL. rations. Second, it can be used to find generic trends in impact ParaDiScouplesdirectN2calculationofforcesbetweendisloca- ofnetworkconfigurationsbasedonspecificresultsobservedfor tionlinesegmentswithafastmultipolemethod(FMM)tocapture differentapplications. remoteinteractions.Itusesahierarchicaldecompositionscheme Thecodesusedinthisstudyincludeapplicationsandlibraries wherethesimulationvolumeisrecursivelydividedintoslabsalong thatareeitherruninproductionatHPCcenters(Hypre,Mercury, eachphysicaldirection. MILC,ParaDiS,pF3D,Qbox),orrepresentcodesruninproduction pF3D[39]isusedforsimulatinglaser-plasmainteractionsinex- (Atratus).Thesecodesspanawiderangeofphysicsandmathe- perimentsconductedattheNationalIgnitionFacilityatLLNL.It maticaldomainsincludingMonteCarlo,first-principlesmolecular solvescoupledwaveequationsforthelaserlight,thescatteredlight, dynamics,transport,plasmainteractions,structuredandunstruc- andtheplasma.pF3Dusesaregular3DCartesiangridandspatial turedgrids,andsparselinearalgebra. decompositionwith2DFFTsintheplanesandbeampropagation Hypre[22]isaparallellinearsolverslibrarydevelopedatLLNL acrosstheplanes. andisusedbymanyproductionapplications.Forthisstudy,weuse Qbox[23]isafirst-principlesmoleculardynamicscodethatisused theAlgebraicMultigrid(AMG)Solverona2Ddiffusionproblem tocomputetheelectronicstructureofatoms,molecules,solids,and usingstructuredAdaptiveMeshRefinement(AMR).Werunaweak liquids within the Density Functional Theory (DFT) formalism. scalingproblemsothenumberofmeshpointsisproportionalto ItwontheGordonBellawardatSC2006andiswidelyusedfor thenumberofMPIprocesses.Thetestsaresetupwithinalarger studyingatomicpropertiesofheavymetals.Qboxdividestheatoms applicationcodebuttracingislimitedtotheoperationstakingplace andtheirstatesamonga2DgridofMPIprocesses,andmakesheavy duringtheHypresetupandsolvephases. useofFFTsandlinearalgebralibrariessuchasScalapackandBLAS. AtratusextendsMULARD[5],ahighorder,finiteelementbased Foreachoftheabovecodes,weworkedwiththeirdevelopers multigroupradiationdiffusioncodethatusesa3Dunstructured and/orfrequentuserstoconstructrepresentativescientificinput mesh,byincludingmoreadvancedphysicsanddiscretizations.Itis problemsforweak-scaledexecutionson1Kto16KMPIprocesses. usedprimarilyasaresearchtooltoexplorefutureprogramming These problems were run either on an Infiniband-Xeon cluster paradigmswithdataflowandcomputationsimportanttoLLNL (Quartz)[7]with32corespernode,oronanIBMBlueGene/Q applications.AtratususesMFEM[4],whichinvokesHypresolvers system(rzUSeq,Vulcan)with16corespernode.Table2liststhe buttheyaremorespecializedthanthestandardAMGsolverwe inputproblemsthatwererunon16KMPIprocesses.Usingprofiling runfortheHypretests. datacollectedfromrunson16Kprocesses,wenowsummarizethe Mercury[16]isaproductionMonteCarloapplicationdeveloped communicationcharacteristicsoftheseapplications.Theresults atLLNLasageneral-purposeparticletransportcode.Forthese presentedheredonotincludeMPIcallsusedintheinitializationstep experiments,weruna3Dneutrontransporteigenvalueproblem ofapplications.Asmentionedbefore,thechoiceofinputproblems SC17,November12–17,2017,Denver,CO,USA N.Jainetal. (a) P2P send size (b) P2P send calls (c) P2P send neighbors 1E+08 1E+06 1E+04 Messagesize(bytes)1111111EEEEEEE+++++++00000001234567 Avg Max Numberofcalls11111EEEEE+++++0000012345 Small Medium Large Numberofneighbors111EEE+++000123 Small Medium Large 1E+00 1E+00 1E+00 Hypre AtratusMercuryMILC ParaDispF3D Qbox Hypre AtratusMercuryMILC ParaDispF3D Qbox Hypre AtratusMercuryMILC ParaDispF3D Qbox Figure2:Communicationcharacteristicsforpoint-to-pointoperationsacrossapplications. (a) MPI_Allreduce size (b) MPI_Allreduce calls (c) MPI_Allreduce neighbors 1E+07 1E+04 1E+05 Messagesize(bytes)111111EEEEEE++++++000000123456 Avg Max Numberofcalls111EEE+++000123 Small Medium Large Numberofneighbors1111EEEE++++00001234 Small Medium Large 1E+00 1E+00 1E+00 Hypre AtratusMercuryMILC ParaDispF3D Qbox Hypre AtratusMercuryMILC ParaDispF3D Qbox Hypre AtratusMercuryMILC ParaDispF3D Qbox Figure3:MPI_Allreducecharacteristicsacrossapplications. Table3:Collectiveoperationsusageacrossapplications.Eachcellentry(a/b/c)representsthreedatapoints:a-messagesize, b-numberofcalls,andc-sizeofcommunicator. Operation HypreandAtratus Mercury MILC ParaDiS pF3D Qbox Bcast Small/Medium/Large All/Large/Large × × Small/Medium/Large All/Large/Large Reduce × All/Large/Large × × × All/Large/Medium Alltoall × Small/Medium/Large × × Large/Large/Small Small/Large/Medium Allgather × Small/Large/Large × Small/Small/Large Small/Small/Medium × canpotentiallyalterthecharacteristicsofmanyoftheseapplica- forMILC,pF3D,andQbox,thenumberoflargemessagesishigh tions,andhencesomeoftheresultspresentedmaynotapplyfor andthenumberofneighborstowhichthesemessagesaresentis allproblemsrunontheapplications. small. Point-to-pointcommunication:Figure2presentsthesummary Insummary,wecandividetheapplicationsintotwosets:one forpoint-to-pointsendoperationsperformedbyalltheapplications. withpredominantlysmall/mediumsizedmessagesandalargenum- Resultsforpoint-to-pointreceiveoperationsaresimilartothere- berofneighbors(Hypre,Atratus,andMercury),andanotherwith sultsforsendoperations,andthusnotshown.Figure2(a)showsthe manylargemessagesendstoafewneighbors(MILC,pF3D,and minimum,average,andmaximumacrossprocessesoftheaverage Qbox).ParaDisdoesnotfallineithercategorysinceithasmany andmaximumsizeofmessagessentbyindividualprocesses.The small/mediumsizedmessagesandseverallargesizedmessages. maximumsizeformostprocessesacrossallapplicationsexcept Collectivecommunication:Figure3presentsthesummaryfor Hypreisgreaterthan100KB.However,theaveragesizeislessthan MPI_Allreduce,whichisthemostcommonlyusedcollectiveopera- 1KBforHypreandAtratus,andgreaterthan1MBforpF3Dand tionacrossapplications.Intheseresults,thecriterionforclassifying Qbox. theoperationsbasedontheirmessagesizeisthesameasthatfor In Figures 2(b,c), the distribution of message sends is shown point-to-pointmessages.Figure3showsthatalmostallapplica- basedontheirsize.Messageswithsizelessthan512bytesarecon- tionsmakehundredsof MPI_Allreducecallsofsmallsizeover sideredsmall,lessthan64KBaremarkedmedium,andrestare allMPIprocesses.Mercury,ParaDiS,andQboxalsoinvokemany termedlarge.Thesecutoffspointsarechosenastheyarecurrently largesizedMPI_Allreduces.Amongthem,Qbox’scallsareover therecommendedvaluesforswitchingfromshorttoeagerand medium-sizedcommunicators,whileMercury’sandParaDiS’scalls eagertorendezvousprotocolsonPerformanceScaledMessaging2 areoverlargecommunicators. (PSM2)[1].Forseveralcodes(Hypre,Atratus,Mercury,andPar- Sincemostothercollectivecallsaremadebyasubsetofappli- aDiS),alargefractionofmessagesendsareeithersmallormedium. cations,theyaresuccinctlypresentedinTable3.Notethat,for Thenumberofuniqueneighborswithwhichthesemessagesare MPI_AlltoallandMPI_Allgather,themessagesizeconsidered communicatedishigh,especiallyforsmallmessages.Incontrast, formarkingcallsassmall,medium,orlargeistheamountofdata PredictingthePerformanceImpactofDifferentFat-TreeConfigurations SC17,November12–17,2017,Denver,CO,USA Table 4: Simplified qualitative summary of applications’ canalsospecifythejobplacementandtaskmappingforeachap- communicationcharacteristics. plication.Forcomputeregionsoftheapplication,TraceRallows replacingandscalingtheexecutiontimerecordedinthetraces. App P2P Collectives Thisenablespredictionsoflikelyscenariosonfuturesystemswith differentcomputationalpower. Hypre light lightallreduce,bcast Low-efforttracecollection:Apracticallyusefulmethodforcol- Atratus light lightallreduce,bcast lectingtracesshouldrequireminimalchangestotheapplication, Mercury light light alltoall, allgather; heavy allre- incurlowoverhead,andbememoryefficient.WhileBigSimtrace duce,bcast,reduce collectionusingAMPI[25]incurslowoverhead,itrequiresrecom- MILC heavy lightallreduce pilationofalltheparallellibrariesusedbytheapplication,andthus ParaDiS medium lightallgather;heavyallreduce canresultinsignificanteffort. pF3D heavy lightallreduce,bcast,allgather;heavy RecentsupportfortheOpenTraceFormat(OTF2)[6]byMPI alltoall tracingtoolssuchasScoreP[2]andTAU[37]providesaneasy-to- Qbox heavy lightalltoall;heavyallreduce,bcast, usealternativeforcollectingapplicationtraces.Sincethesetools reduce usethePMPIinterfacetointerceptMPIcalls,theycanbeadded atlinktime.Thus,weaddedthecapabilitytouseOTF2tracesfor simulatingapplication’sflowinTraceR.Becauseofthisaddition, exchangedbetweenapairofMPIprocesses.Inaddition,ifonly weareabletoworkwithapplicationteamsandobtaintracesfor tens of calls are made or if only tens of processes are part of a largeapplicationswithloweffort. communicator,thosecallsaremarkedsmall.Ifthenumberofcalls orcommunicatorsizeisafewhundreds,wetermthecallsmedium, MPIcollectives:Section3showedthatcollectiveoperationsare andlargeotherwise. usedinmanyapplicationsforexchangingdataamongMPIpro- Table3showsthatMercuryandQboxmakeheavyuseofMPI_Bcast cessesandsynchronizingtheprocesses.Inthepast,TraceRhassup- andMPI_Reduceovermediumandlargecommunicatorsforallcat- portedtree-basedimplementationsofthreecollectives:MPI_Bcast, egoriesofmessagesizes.Incontrast,Hypre,Atratus,andpF3Din- MPI_Reduce,andMPI_Barrier. vokesmallsizedbroadcastoperations.ParaDiSonlyinvokessmall Wehaveaddedthefollowingcollectivesthatarealsousedin sizedMPI_Allgatherafewtimes.MPI_Allgatherisalsousedby manyapplications:MPI_Allreduce,MPI_Alltoall(v),andMPI_Allgather. MercuryandpF3Dforsmallmessages.pF3Dalsomakesheavyuse Foreachoftheseoperations,wehaveimplementedseveraldiffer- ofMPI_Alltoalloversmallcommunicators,whileMercuryand entalgorithmsthatareusedinMPIlibrariesbasedonthesizeof QboxinvokeMPI_Alltoallwithsmallmessagesovermediumto messagesandthatofthecommunicators[40].Forexample,the largecommunicators.Table4summarizesthesefindings. Bruckalgorithm,multi-pairsend-recv,andringalgorithmshave beenimplementedforMPI_Alltoallusingsmall,medium,andlarge 4 SIMULATIONFRAMEWORK messagesizes,respectively.Tothebestofourknowledge,such detailedmodelingofcollectivealgorithmsisnotsupportedinany TheTraceR-CODESframework[11,35]providestrace-drivenin- othersimulator. frastructure for packet-level simulations of traffic flow on HPC networks.ItisbuiltuponROSS[14],aparalleldiscreteeventsimu- Eager-Rendezvousprotocol:DifferentprotocolsareusedbyMPI lation(PDES)engine,andhasbeensuccessfullydeployedforstudy- toperformcommunicationamongindividualprocessesbasedon ingmulti-jobworkloadsonHPCnetworks[27,43].Theoptimistic thesizeofmessages.Thisnotonlyimpactstheamountofactive natureoftheROSSPDESenginedrivesthescalabilityoftheTraceR- communicationonthenetwork,butcanalsoaffectthewaydifferent CODESframeworkandenablesfastsimulationusinglargecore MPIprocessesaresynchronized[17].Tothisend,wehaveadded countsincomparisontoothersimulationframeworks. supportforeagerandrendezvousprotocolsinTraceR. CODES provides a unified API for simulating traffic flow on Theadditionofthesefeatures,alongwiththeexistingcapabilities manyHPCnetworks,e.g.dragonfly,torus,expressmesh,andfat- viz.parallelscalability,packet-leveltrafficflow,andsupportfor tree.Thenetworkbeingsimulatedcanbeselectedatruntimealong multi-jobworkloads,makestheTraceR-CODESframeworkhighly withotherparameterssuchasthetopologydimensions,linkband- suitableforapplicationsimulationswithloweffort. width,routingschemeetc.Thedefaultfat-treenetworkinCODES assumesasinglerail,singleplanetopology.Forthiswork,wehave 4.2 Validation augmentedthedefaultmodelwithparameter-basedinstantiation TheTraceR-CODESsimulationframeworkhasbeenverifiedand ofvariousconfigurationoptionsdiscussedinSection2.Inthenew validatedinseveralpastpublications[27,35].Inthisstudy,forall model,ausercanspecifythenumberofrailsandplanes.Likewise, configurationsoffat-tree,wehaveconductedcontrolledsimulations taperingcanbedoneatleafswitches. ofmicro-benchmarks,e.g.ping-pong,toverifythatthesimulations behaveasexpected.Wehavealsocomparedpredictedperformance 4.1 AdvancesinTraceR withobservedperformanceforseveralmicro-benchmarksandappli- TraceRisaparallelexecutiontracesimulatorthatreplaysthecon- cations.Duetolacktospace,wepresentresultsforarepresentative trolandcommunicationflowofHPCapplicationsontopofCODES. setinFigures4&5.Theseresultsspanawiderangeofmessagesizes Formulti-jobworkloads,eachapplicationisconcurrentlysimu- andscenariosofinterestviz.latency-boundandbandwidth-bound latedandsharesnetworkresourceswithotherapplications.Users executions. SC17,November12–17,2017,Denver,CO,USA N.Jainetal. (a) Ping-pong (b) Ping-pong relative error (c) All-to-all relative error (512 cores) (d) 3D Stencil 20 20 0.3 Time(microseconds) 1 10 010 PredPicSMtMePd2I %Error---- 43211 000000 vsv. sP. SMMP2I %ErrorvsMPI-- 211 0000 Time(seconds) 00 ..120 OCPCrobeomsdmeimrcmvt-ee-OddP 4 32 256 2K 16K 128K 1M 4 32 256 2K 16K 128K 1M 128 512 2K 8K 32K 128K512K 128 256 512 1K 2K Message size Message size Message size Core count Figure4:ComparingpredictionresultsfromTraceR-CODESframeworkwithresultsfromexecutionsonaproductionsystem, Quartz[7]. (a) Atratus (strong scaling) (b) pF3D (512 cores) Atratusisbetween11%and34%ofthetotalruntime,whileitis between31%and63%on512and1024cores. 120 Observed Predicted Time(seconds) 100 -0.O4%bse-r0v.e8d% -1.9% -0.8% -2.6% -5.3% 1 04680000 3.1% -0.5% 0.0% 0.0% -0.6% 3.5% uPcorreenRdfi5eig(csbuuti)rlo.tansIttsifosofpnoresrbntpahdnFsa3dt2Dwd0ii–adffr2teeh5r%w-ibnioothtufainisntdks4epm%xFae3opcDfputith(niSeogenocatbtniisomdenrnev3uei)nmdarcbtieoemmrsehomoffowuprnrnvoicaicnaretisoFisoiuegnss-. Predicted 10 20 pernode.Tothebestofourknowledge,validationresultsforsuch 32 64 C12o8re c o2u5n6t 512 1024 A DBifferenCt confDgurationEs F applicationshavenotbeenshownforanyothersimulator. Alongwithpaststudies[27,35],theseresultsdemonstratethat Figure5:Predictionaccuracyishighforlatency-boundAtra- theTraceR-CODESframeworkpredictstrendsoftheexecution tusandbandwidth-boundpF3D. timeofbenchmarksandproductionapplicationsaccuratelyforthe fat-treetopology. Inthevalidationexperiments,observedperformancewasrecorded 5 COMPARATIVEANALYSIS fromexecutionsusingMPIversionsofdifferentcodesonQuartz,a 2:1taperedfat-treesystemwith100Gb/slinkbandwidth[7].The Inthissection,weuseTraceRtounderstandtheperformancetrade- radixofeachswitchis48,with32nodesconnectedtoeachswitch offsofdifferentdesignoptions(Section2)thatrepresentfuture at the leaf level. Quartz was used for collecting traces of codes networks. In order to do so, we design prototype systems with withcomputation(3DStencil,Atratus,pF3D)topreventmispre- ∼4,500nodes,whichissimilartotheexpectednodecountofSum- dictionsduetodifferencesincomputetimewhileanothersystem mitandSierra. (Catalyst[3])wasusedfortracesofcommunication-onlymicro Thesixprototypesystemsusedinthisstudyarebasedonthe benchmarks. configurationsinTable1,whicharecurrentlyofferedbynetwork Figures4(a)-(c)presentthepredictionaccuracyforping-pong vendors:1)SR-EDR:singlerailsingleplaneEDR,2)DRP-T-EDR: andall-to-allmicro-benchmarks.Intheping-pongbenchmark,two dualraildualplanetaperedEDR,3)DRP-EDR:dualraildualplane MPIprocessesarerunontwonodes(onepernode)andmessages EDR,4)SR-HDR:singlerailsingleplaneHDR,5)DR-T-HDR:dual areexchangedamongthem.Intheall-to-allexperiments,512MPI railsingleplanetaperedHDR,6)DR-HDR:dualrailsingleplane processesrunningon16nodesinvokeMPI_Alltoallcallsrepeat- HDR.SRandDRrefertosinglerailanddualrailrespectively.If‘P’ edly.Forawiderangeofmessagesizes(4bytesto4MB),wefind appearsintheconfigurationname,itsuggestsdualplaneandif‘T’ thatthepredictedexecutiontimeissimilartotheobservedtime. appears,itreflectstaperingofthenetwork. HighererrorincomparisontoMPIisobservedatmediummessage Theapplicationtracesusedinthesesimulationsaregenerated sizeswheretheperformanceoftheMPIimplementationiserratic. on16,384MPIprocessesusingthesameinputproblemsthatwere However,forthesamedatapoints,predictionsaremuchcloserto usedintheresultsshowninSection3.PMPI-basedScorePlibraryis resultsobtainedforPSM2,thecommunicationlayerdirectlyused usedforcollectingOTF2traces,whichleadstolessthan5%tracing byMPIonQuartz. overhead.Inallsimulations,a3-levelfat-treeisconstructedusing In3DStencil,MPIprocessesarearrangedinathree-dimensional routerswith36portsand90nsrouterdelay.Thepacketsizeisfixed grid.Ineveryiteration,eachMPIprocessexchangesmessageswith at4KBandFTREEstaticroutingschemeisused.Theseparameter sixnearest-neighborsandperformsa7-pointstencilupdateonthe choiceshavebeenmadebecausetheyresemblethesettingsused dataitowns.For3DStencil,thepredictedexecutiontimeclosely oncurrentproductionsystems. followsthetrendoftheobservedtimeasthecorecountisincreased. Figure4(d)showsaccuracyresultsfortwoversionsof3DStencil– 5.1 Communication-onlySimulations withandwithoutcomputation. Thefirstsetofresultswepresentareforcommunication-onlysce- Figure5(a)showsthatforalatency-boundapplicationsuchas narios,i.e.applicationtracesaresimulatedwithallthecomputation Atratus(Section3),highpredictionaccuracyisobservedforastrong regionsignored(notsimulated).Theseresultshelpunderstandthe scalingrun.Upto256cores,MPItimefordifferentprocessesin PredictingthePerformanceImpactofDifferentFat-TreeConfigurations SC17,November12–17,2017,Denver,CO,USA ��� ����� ������� ������� ������� ���� ���� ���� �������������� �� ���� ���� ���� ������ ������ ����� ��������� ��������� ��������� ��������� ��������� �������� ������ ������ ������ ������ ������ ���� ���� �� ������ ��������� ������� ������ �������� ������ ������ ��������� ������� ������ �������� ������ ������ ��������� ������� ������ �������� ������ Figure6:Comparingcommunication-onlyscenarioforapplicationsusing16,384processesand1,024nodes.%performance gainsincomparisontoSR-EDRareshown.Table1liststhenetworkconfigurations. directeffectofnetworkconfigurationsonapplicationcommuni- sizedmessages,andthusdelayscausedbyroutersandlinkscan cationpatternsandprovideanupperboundontheperformance addup.Since,DRP-EDRhasmoreroutersandlinksincomparison benefitsofdifferentconfigurations.Intheseruns,16,384processes toSR-HDR,fewerpacketstraverseperrouter/linkandthusthe aresimulatedon1,024nodes(∼23%ofsystemsize)with16pro- impactofthosedelaysislower. cessespernodeallocatedusingalinearplacementpolicy.Figure6 ParaDiSgeneratesmanysmallandmediumsizedmessagesends, presentstheresultsforalltheapplicationsalongwithperformance butitalsoperformsmanylargesizedmessageexchanges,andcalls improvementsrelativetosinglerailsingleplaneEDR(SR-EDR). MPI_Allreduce.Asaresult,thenetworkconfigurationhasalarger HypreandAtratus:Fortheseapplicationsthatpredominantlyuse impactontheperformanceofParaDiSthanHypreandAtratus. smallandmediumsizedmessagesforpoint-to-pointandcollective pF3D and Qbox: These applications make heavy use of collec- communication,theperformanceimpactofchangingthenetwork tives and thus show up to 68% improvement with the best fat- configuration is low (< 17%). For Atratus, providing additional treeconfigurationasshowninFigure6(right).ForpF3D,boththe bandwidth at the lower levels of fat-tree either via use of dual MPI_Alltoallcallsandthepoint-to-pointcallsareamongpro- rail(e.g.SR-EDRvs.DRP-T-EDR)orviauseoflinkswithhigher cessesrelativelyclosetooneanotherintheMPIrankspace.Thus, bandwidth(e.g.DRP-T-EDRvs.DR-T-HDR)improvesperformance taperingthenetworkdoesnotimpactperformancesignificantly becauseofitsmediumsizemessages.However,sincethesemessages (e.g.DRP-T-EDRvs.DRP-EDR). aremostlycommunicatedwithprocessesthatarecloseinMPIrank ForQbox,whiletheMPI_Alltoallcommunicationisamong space,taperingdoesnotaffectperformancewhenlinearmapping “nearby”processes,othercollectiveswithlargemessagesizesoccur isused(e.g.DRP-T-EDRvs.DRP-EDR). overprocessesspreadallovertheMPIrankspace.Thus,itnotonly ForHypre,onlywhenHDRlinksareused,weobserveaper- benefitsfromhigherbandwidthatthelowerlevels,butalsoshows formanceimpactofnetworkconfigurations.Thisisbecausethe betterperformancewithfullfat-treeconfigurations.Aninteresting averageandmaximummessagesizesofHyprearelowestamongall datapointforQboxisthetransitionfromSR-HDRtoDR-T-HDR. theapplicationsandthusHypreisboundbypermessageoverheads SinceindualrailsingleplaneHDR,fewernodesareconnectedper onEDRnetworks. leafswitch(12nodesperswitch)incomparisontoSR-HDR(18 Overall,wefindthepotentialgainsofusinghighercapability nodesperswitch),moredatahastotraversethroughhigherlevels fat-treeconfigurationslimitedforrelativelylightcommunication inthetreewhenprocessescommunicatewithfarawayprocesses patternsinHypreandAtratus. inMPIrankspace.Thisnegativelyimpactsthegainsprovidedby Mercury,MILC,andParaDiS:Alltheseapplicationsareimpacted DR-T-HDR,andresultsinlowerperformance. inasimilarwaybyvaryingnetworkconfigurations:significant Insummary,withcommunicationpatternsthatincludelarge improvementisobtainedbyaddingmorebandwidthandtapering messagesizedexchangesamongevenfewpairsoffarawayneigh- leadstoanoticeablelossinperformance.ForMercury,thisisex- borsorincludelargesizedcollectives,significantgainsareobserved pectedsinceitmakesheavyuseofcollectivessuchasMPI_Allreduce, withhighernetworkbandwidth(dualrailand/orHDRlinks)and MPI_Bcast, and MPI_Reducewithlargemessagesizesoverlarge fullfat-treeconfigurations(DRP-EDR,DR-HDR). communicators.Inaddition,processesfarawayinMPIrankspace alsoexchangemanymediumsizemessages. 5.2 VaryingComputationScaling ForMILC,everyMPIprocesscommunicateswithveryfewMPI processes.However,theseexchangesaretypicallyovermediumto Intheprevioussection,westudiedtheimpactofnetworkconfig- largesizedmessages.Duetothe4DlayoutofMPIprocesses,someof urationsonexecutionswiththecomputationturnedoff.Wenow thecommunicatingpairsarefarapartintheMPIrankspace.These studycasesinwhichcomputationisalsoconsideredalongsidecom- reasonsaccountfortheperformanceimpactobservedforMILC. munication.Thesecasesareimportantbecausetheperformance Aninterestingobservationfortheseresultsismarginallylower ofanapplicationcanchangesignificantlyduetotheinterplayof performanceofsinglerailHDR(SR-HDR)incomparisontodual computationandcommunication.Hence,suchananalysiscanhelp raildualplaneEDR(DRP-EDR),althoughtheyoffersimilartotal findtherightbalancebetweencomputationalandcommunication bandwidth.WesuspectthisisbecauseMILCsendsveryfewlarge resources. SC17,November12–17,2017,Denver,CO,USA N.Jainetal. (a) Atratus (b) Mercury (c) MILC 1% 10 47% 66% 71% 72% 10 1% 10 2% Time(s) 1 4% 8% 1 1 21% 55% 65% 0.1 0.1 0.1 2x 10x 25x 50x 2x 10x 25x 50x 2x 10x 25x 50x SR-EDR DRP-T-EDR DRP-EDR SR-HDR DR-T-HDR DR-HDR (d) ParaDiS (e) pF3D (f) Qbox 1% 100 7% 9% 25% 51% 65% 67% me(s) 10 5% 11% 10 41% 49% 10 Ti 18% 1 1 1 2x 10x 25x 50x 2x 10x 25x 50x 2x 10x 25x 50x SR-EDR DRP-T-EDR DRP-EDR SR-HDR DR-T-HDR DR-HDR Figure7:Analyzingtheeffectofnetworkconfigurationsonscenarioswithdifferentcomputationscaling.%differencebetween thebestandtheworstconfigurationisshown. Inoursimulationframework,cores(orlogicallyequivalentcom- to21%.ForbothAtratusandpF3D,similarreductioninbenefits putationalresources)areseparateentitiesthatinteractwiththe fromabetternetworkconfigurationisobservedduetothepresence NIC/networkasisthecaseonrealsystems.Thus,ifcommunication- ofrelativelysignificantcomputationtime.Thistrendcontinuesfor computationoverlapispossible,itissimulatedappropriately.Note thecasewith25×scaling. thatTraceRdoesnotsimulateormodelcomputation,itonlyfast- Asthecomputationscalingisdecreasedfurtherto10×,formany forwardstheclockwhenacomputationregionisencountered. applicationsweobserveasignificantchangeintheimpactofnet- Intheseexperiments,thecomputationalcapabilityandexecution workconfigurations.ForHypre,Atratus,andParaDiS,thecompu- timeonanIntelXeonCPUE5-2695v4operatingat2.1GHzwith tationtimebeginstodominatethetotalexecutiontimeandthe onesocketistakenasthebasevalue.ThisCPUhasapeakflop/s impactofnetworkconfigurationislimited.ForMILCandpF3D, ratingof600GFlop/s.Theexecutiontimeofcomputeregionsis onlytheSR-EDRconfigurationisnowsignificantlyworsethan scaledby2×,10×,25×,and50×relativetothisCPUtoconductthe otherconfigurations,andtheperformanceimprovementduetoa analysisforfuturesystems.Thisimpliesthatthe2×scalingresults morecapablenetworkisatmost25%.Forboththeseapplications, assumethatonafuturenode,thecomputationcanbeperformed thechangeinabsolutetimebetweenvariousnetworkconfigura- twiceasfastcomparedtothisCPU. tions(fromDRP-T-EDRtoDR-HDR)issimilartothe50×scenario. Weusethe2×casetorepresentsystemssimilartocurrentclus- However,with10×scaling,sincethetimespentincomputationis tersbutwithadditionalCPUsockets,whilethe10×caseisusedas twentytimesthetimespentincommunication,theneteffectof representativeoffuturesystemswithaccelerators.Theremaining theseimprovementsismarginal. casesrepresenteithermorecompute-intensivesystemsorapplica- ForMercuryandQbox,evenat10×scaling,thetimespentin tionsthatcanexploitmostofthecomputationalpoweroffuture communicationisdominant.Theimpactofaddingmorebandwidth nodes. andusingfullfat-treesystemsinsteadofataperedoneissimilarto Figure7showsthepredictedimpactofnetworkconfigurations earliercases,andthegainsareproportionate.Finally,at2×scaling, withdifferentlevelsofcomputationscaling.ResultsonHypreare Mercuryistheonlyapplicationwhosetimespentincomputation omittedbecausetheimpactofnetworkconfigurationsonrelative isnotsignificantlygreaterthanthecommunicationtime. gainsissimilartothatonAtratus.With50×scaling,thepredicted resultsaresimilartotheresultsobtainedforcommunication-only 5.3 EffectonStrongScaling scenarios.HypreandAtratushaveminimalgains,pF3Dgainsfrom Intheselastsetsofresultsforsinglejobsimulations,westudythe increasedbandwidthatthelowerlevelofthefat-treebutisnot impactofnetworkconfigurationsonstrongscaling.Whilesystem impactedbytapering,andMercury,MILC,andQboximprovewith capabilities continue to grow, the size of input problems is not mostchangesinconfigurations.ForParaDiSthough,evenat50× increasingproportionatelyinsomedomains.Hence,strongscaling scaling,thepotentialimpactofnetworkconfigurationsismuch isanimportantcharacteristictoconsiderforfuturesystems. lowerincomparisontothecommunication-onlyscenario. In ourexperiments, strong scaling is achieved bysimulating The biggest difference for 50× scaling in comparison to the executionofjobswith16,384MPIprocesses(usedearlier)onthree communication-only scenario is for ParaDiS where load imbal- differentnodecounts:256,1,024,and4,096.Thisresultsinallo- anceamongprocessesandsimilartimespentincomputationand cationof64,16,and4MPIprocessespernode.Asbefore,wefol- communicationreducesthebestpredictedimprovementfrom51% lowasimplemodeloflinearlyscalingthecomputationassuming PredictingthePerformanceImpactofDifferentFat-TreeConfigurations SC17,November12–17,2017,Denver,CO,USA ������������������������������ ������������������������������ ����� ��� ����� ��� ����� ��� ����������������� ��������� ����� ���� ���� � ������ ������ ������ ������ ������ ������ ������ ������ ������ ������ ������ ������ ������ ������ ����� ������� ������� ���� ������� ���� ���� ����� ������� ������� ���� ������� ���� ���� ��������� ����������� ����������� �������������������������������������������������������������������������������������������������������������������������������������������������������� Figure8:Impactofnetworkconfigurationonstrongscalingaproblembysimulating16,384MPIprocessesondifferentnode counts.Foreachconfiguration,improvementshownforanapplicationisspeeduprelativetotheapplication’sperformance on256nodesusingthatnetworkconfiguration. within-nodeparallelization.Westudythesecasesfortwodifferent Themulti-jobworkloadsusedintheseexperimentsconsistof computationalscalingscenariosfromtheprevioussection:2×and 17jobsof4,096processeseachrunningon256nodes.Eachjobis 10×. arbitrarilyassignedoneoftheapplicationsdescribedearlier.Since Foranygivennodecount,wefindthattheimpactofnetwork thesimulationsofHypreandAtratushaveshownsimilarbehavior configurationsondifferentapplicationsissimilartotheresults sofar,wedonotuseHypreintheseexperimentsinordertofocus describedintheprevioussection.Thisisnotunexpectedbecause onadiverseset.TracegenerationwasdoneonQuartzbyrunning strongscalingaproblemwithlinearscalingofcomputationdoes weak-scaledversionsoftheinputproblems(Table2)on4KMPI notchangeitscomputation-to-communicationratio.Thus,inthis processes.Wesimulatethreesuchrandomlygeneratedworkloads, section,wefocusoncomparingpredictedspeedupswhenappli- inwhicheachapplicationisassignedto∼8jobsintotal. cationsarescaledfrom256to1,024nodesasshowninFigure8. Inthefirstsetofsimulationsofmulti-jobworkloads,weuse They-axisinthisfigureplotsthepredictedspeedupforindividual 2×computationscaling.Nodesareallocatedtojobsusinga“ran- applicationsrelativetotheirperformanceon256nodesforthesame domizedclusters”policy:jobsareassignednodesinsmallclusters networkconfiguration.Theheightofeachbarshowstheadditional spreadthroughoutthesystem.Fortheseexperiments,wefindthat speedupoverthedatapointrightbelowit. theaverageperformanceofmostapplications(exceptMercury) Forthesetwith2×scalingofcomputeregionsrelativetocur- acrossdifferentnetworkconfigurationsissimilar(<5%difference). rentsystems(Figure8(a)),weobservethattheimpactofnetwork Thisisexpectedsincethetimespentincomputationis>90%forall configurationsonspeedupformanyapplicationsisminimal.Only theseapplications.ForMercury,weobserveupto33%gains;these forpF3DandQbox,dualrail/planeandHDRconfigurationslead resultsaresimilartotheresultsobservedforMercuryintheprevi- tobetterscalingincomparisontothescalingonthesinglerail oussection.Wealsofindthattheimpactofinter-jobinterference singleplaneEDRsystem(upto27%gaininspeedup).Theseresults isminimalgiventhecomputation-heavynatureoftheworkloads. alsoshowthatcommunicationbottlenecksdonotpreventefficient Figure9(a)presentstheaveragetimespentinexecutingatime scalingofmostapplications. stepofdifferentapplicationswhenrunninginamulti-jobworkload. For the case with 10× computation scaling as shown in Fig- Fortheseresults,10×computationscalingisused,andthejobplace- ure8(b),theimpactofnetworkconfigurationsonscalingissig- mentpolicyisthesameasbefore.BothAtratusandParaDiSare nificantlyhigher.First,thespeedupsarerelativelylowerforall notsignificantlyimpactedbychangesinnetworkconfigurations, networkconfigurations,especiallySR-EDR.Second,manyappli- evenwhenpartofamulti-jobworkload. cations(Hypre,MILC,pF3D,andQbox)showbetterscalingwith MercuryandQboxshowperformancegainswitheveryimprove- morecapablenetworkconfigurations.Third,forQboxandMercury, mentinnetworkconfiguration.Thetrendsandimpactofindividual scalingisbetterforfullfat-treeconfigurationsincomparisonto configurationchangesaresimilartothesingle-jobscenariosstudied taperedconfigurations. earlierinFigure7.NotethatforQbox,DRP-T-HDRconfiguration Insummary,ourpredictionssuggestthatwith2×computation performs better than SR-HDR in this case because the random- scaling,networkperformancehaslimitedimpactonscalingofap- izedclustersjobplacementeliminatesthebenefitofhavingmore plications,butwith10×computationscaling,carefulconsideration nodesperleafswitch.Withnodesspreadthroughoutthesystem, ofnetworkconfigurationsisneededtoobtaingoodstrongscaling moretraffichastotraversetohigherlevellinksfortheSR-HDR behavior. configurationincomparisontothesinglejobscenariowithlinear placement. For MILC and pF3D, we observe higher gains (31% and 51%) 6 ANALYZINGMULTI-JOBWORKLOADS whenbetternetworkconfigurationsareusedincomparisontothe Simultaneousexecutionofmultiplelargejobsthatsharenetwork gainspredictedforsingle-jobscenarios(21%and25%).Theother resourcesisacommonproductionscenario.Inthissection,westudy differencefromsingle-jobresultsinFigure7isthatinaddition theimpactofsuchworkloadsonperformanceofindividualjobs totheuseofdualrailEDR,usingafullfat-treeconfigurationand ondifferentfat-treeconfigurations.Wealsocomparetheoverall increasinglinkbandwidth(EDRtoHDR)alsohelps.Thisisbecause performanceofworkloadsacrossdifferentconfigurations. SC17,November12–17,2017,Denver,CO,USA N.Jainetal. (a) Application performance in multi-job workloads (b) Impact of inter-job interference conds) 100 51% 66% n 5600 eperstep(se 10 8% 58% 31% 5% %slowdow 12340000 Tim 1 Atratus Mercury MILC ParaDis pF3D Qbox Atratus Mercury MILC ParaDis pF3D Qbox SR-EDR DRP-T-EDR DRP-EDR SR-HDR DR-T-HDR DR-HDR Figure9:Analyzingtheperformanceofapplicationsinamulti-jobworkloads.Eachworkloadconsistsof17jobs,eachallocated 256nodesandissimulatedwith10×computationscaling.For(b),thepercentageslowdownshowninrelativetopredictions withoutinterference. communicatingpairsthatarenearbyinthelinearplacementare spreadoutintherandomizedclustersplacementandhencethese ������������������������������ applicationsalsomakeuseoflinksathigherlevelsofthefat-tree. ����� ����� Figure9(b)presentstheaveragepercentageslowdownthatis ������ ����� ����� ����� observedforindividualapplicationsrelativetotheirperformance � ����� � for256nodesinterference-freesingle-jobruns.Forthetwoconfig- ����� � urationswithlowertotalbandwidth(SR-EDRandDRP-T-EDR),the � effectofinter-jobinterferenceishigher.WealsofindthatpF3Dand ��� �� � Qboxareimpactedmorebyinter-jobinterferencethanotherappli- ���� �� cations.Thisismostlikelyduetotheirheavyuseoftightly-coupled � � MPI_Alltoall(v)operations.Forsomeapplications,thetapered � �� configurationsshowhigherimpactofinter-jobinterference. ������������������������������������������ InFigure10,wecomparetheimpactofnetworkconfiguration ontotalperformanceofthemulti-jobworkloads.Thevaluesshown hereareanaverageoverthethreemulti-jobworkloadswehave Figure 10: Comparing the total normalized performance run.Thetotalperformanceofaworkloadwithnjobsiscomputed ofworkloadsexecutedondifferentnetworkconfigurations asPw =(cid:205)ni=−01Pinorm,wherePinormisthenormalizedexecutionrate with10×computationscaling. (numberofstepscompletedperunittime)foranapplicationacross differentnetworkconfigurations.Theratesarenormalizedbased Table5alsocombinesthecostestimateswiththeresultsinFig- ontherateobservedforthebestperformingnetworkconfiguration foragivenjob.ThisimpliesthatPnormforjobiisalwayslessthan ure10andcomputesthepredictedcosttoperformancevalueofdif- i ferentfat-treeconfigurations.Wefindthatthecost-to-performance orequaltoone,andthetotalperformanceofaworkloadisbounded ratiodropsrapidlyfromSR-EDRtoDRP-T-EDRtoDRP-EDR,i.e. bythenumberofjobsintheworkload. dualrailEDRconfigurationsprovidebetterreturnsfortheircost Figure10showsthatforthemulti-jobworkloadssimulatedus- incomparisontothesinglerailEDRconfiguration.Further,the ingtherandomizedclustersjoballocationpolicy,highergainsare cost-to-performanceratioofallHDRconfigurationsislower(i.e. observedwhentransitioningfromdualraildualplanetaperedEDR better)thantheEDRconfigurations.Inparticular,theSR-HDRcon- (DRP-T-EDR)todualraildualplaneEDR(DRP-EDR).Wenoticea figurationprovidessimilarperformanceasDRP-EDRatamuch similartransitionindualrailHDR-basedconfigurations.Wefind bettercosttoperformanceratio. improvementstobemarginalwhentransitioningfromsinglerail HDRtodualrailtaperedHDRfortheinputproblemssimulatedin 7 RELATEDWORK thesemulti-jobworkloads. SeveraltopologieshavebeenproposedandusedinHPCnetworks, Cost-performancetradeoffs:Fromaprocurementpointofview, e.g.hypercube[24],fat-tree[30],torus[19,33],dragonfly[21,29]. theperformanceimprovementsdiscussedaboveshouldbecom- Amongthesetopologies,variationsofthefat-treeanddragonfly binedwiththeexpecteddollarcoststoanalyzetheperformance- topologiesarecurrentlybeingusedinmanysystems.Thispaper costtradeoffs.Considerascenarioinwhichthenormalizedcosts focusesonfat-treebecauseofthemultiplicityofdesignoptions (w.r.t.SR-EDR)ofthenetworkconfigurationsandthetotalcost availableforit[20,34]. ofthesystemareasshowninTable5.Here,weassumenetwork ForanalysisandpredictionofperformanceonHPCnetworks, costis10%ofthetotalsystemcostforSR-EDR.Theseestimatesare severalapproachesrangingfromanalyticalmodeling[12,26,36,38] basedonpastprocurementsandthebestdatacurrentlyavailable toflitandpacket-levelsimulations[18,28,35,41]havebeenpro- tous. posed.Wedeploytrace-drivenpacket-levelsimulationsbecause theyallowforuseofexistingcodesforsimulatingtheircommuni- cationcomplexity.
Description: