ebook img

Building Multirail InfiniBand Clusters PDF

13 Pages·2015·0.17 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Building Multirail InfiniBand Clusters

Building Multirail InfiniBand Clusters: MPI-Level Designs and Performance Evaluation(cid:3) JiuxingLiu AbhinavVishnu DhabaleswarKPanda ComputerScienceandEngineering TheOhioStateUniversity Columbus,OH43210 fliuj,vishnu,[email protected] Abstract 1 Introduction In the past few years, the computational power of Intheareaofclustercomputing,InfiniBandisbecoming commodity PCs has been doubling about every eighteen increasinglypopularduetoitsopenstandardandhighper- months. At the same time, network interconnects that formance. However, even with InfiniBand, network band- provide low latency and high bandwidth are also emerg- widthcanstillbecometheperformancebottleneckforsome ing. This trend makesit verypromising to buildhigh per- oftoday’smostdemandingapplications. formance computing environments by cluster computing, Inthispaper,westudytheproblemofhowto overcome which combines the computational power of commodity the bandwidth bottleneck by using multirail networks. We PCsandthecommunicationperformanceofhighspeednet- presentdifferentwaysofsettingupmultirailnetworkswith workinterconnects. Inthisarea,MessagePassingInterface InfiniBandandproposeaunifiedMPIdesignthatcansup- (MPI) [10] has become the de facto standard for writing portall theseapproaches. We havealsodiscussedvarious parallelapplications. important design issues and providedin-depth discussions Recently, InfiniBand Architecture [11] has been pro- of different policies of using multirail networks, including posedasthenextgenerationinterconnectforI/Oandinter- an adaptive striping scheme that can dynamically change processcommunication. Duetoitsopenstandardandhigh thestripingparametersbasedoncurrentsystemcondition. performance, InfiniBand is becoming increasingly popular forclustercomputing.HighperformanceMPIimplementa- We have implemented our design and evaluated it us- tions overInfiniBandhave also becomeavailable[19, 18]. ing both microbenchmarks and applications. Our perfor- One of the notable features of InfiniBand is its high band- manceresults show that multirail networkscansignificant width. Currently,InfiniBand4xlinkssupportapeakband- improveMPIcommunicationperformance. Withatworail width of 1GB/s in each direction. (Note that unless other- InfiniBandcluster,wehaveachievedalmosttwicetheband- wisestated,theunitMBinthispaperisanabbreviationfor width and half the latency for large messages compared 106 bytesandGB isanabbreviationfor 109 bytes.) How- withtheoriginalMPI.Attheapplicationlevel,themultirail ever,evenwithInfiniBand,networkbandwidthcanstillbe- MPI can significantly reduce communication time as well cometheperformancebottleneckforsomeoftoday’smost as running time depending on the communication pattern. demandingapplications.Thisisespeciallythecaseforclus- We havealsoshownthat theadaptivestripingschemecan tersbuiltwithSMPmachines,inwhichmultipleprocesses achieve excellent performance without a priori knowledge may run on a single node and must share the node band- ofthebandwidthofeachrail. width. One important way to overcome the bandwidth bottle- neck is to use multirail networks [4]. The basic idea is to havemultipleindependentnetworks(rails)toconnectnodes inacluster. Withmultirailnetworks,communicationtraffic (cid:3)ThisresearchissupportedinpartbyDepartmentofEnergy’sGrant canbedistributedtodifferentrails. Therearetwowaysof #DE-FC02-01ER25506, National Science Foundation’s grants #CCR- distributing communication traffic. In multiplexing1, mes- 0204429and#CCR-0311542, andagrantfromMellanoxTechnologies. ——————–0-7695-2153-3/04$20.00(c)2004IEEE——————– 1Alsocalledreversemultiplexinginthenetworkingcommunity. 1 sagesaresentthroughdifferentrailsinaroundrobinfash- 2.1 OverviewofInfiniBand ion. In striping, messages are divided into several chunks andsentoutsimultaneouslyusingmultiplerails. Byusing The InfiniBand Architecture (IBA) [11] defines a thesetechniques,thebandwidthbottleneckcanbeavoided switched network fabric for interconnecting processing oralleviated. nodesandI/Onodes.Itprovidesacommunicationandman- In this paper, we present a detailed study of designing agement infrastructure for inter-processor communication highperformancemultirailInfiniBandclusters. Wediscuss and I/O. In an InfiniBand network, processing nodes and various ways of setting up multirail networks with Infini- I/OnodesareconnectedtothefabricbyChannelAdapters Band and propose a unified MPI design that can support (CA). Host Channel Adapters (HCAs) sit on processing alltheseapproaches. Ourdesignachieveslowoverheadby nodes. takingadvantageofRDMAoperationsinInfiniBandandin- The InfiniBand communication stack consists of differ- tegratingthemultiraildesignwithMPIcommunicationpro- ent layers. The interface presented by Channel adapters tocols. Ourdesignalsofeaturesaveryflexiblearchitecture to consumers belongs to the transport layer. A queue-pair that supports different policies of using multiple rails. We based model is used in this interface. A QueuePair in In- haveprovidedin-depthdiscussionsofdifferentpoliciesand finiBandArchitectureconsistsoftwoqueues: asendqueue also proposed an adaptive striping policy that can dynam- and a receive queue. The send queue holds instructions ically change the striping parameters based on the current to transmit data and the receive queue holds instructions availablebandwidthofdifferentrails. that describe where received data is to be placed. Com- Wehaveimplementedourdesignandevaluateditusing munication operations are described in Work Queue Re- quests (WQR), or descriptors, and submitted to the work bothmicrobenchmarksandapplicationsusingan8-nodeIn- finiBandtestbed. Ourperformanceresultsshowthatmulti- queue. ThecompletionofWQRsisreportedthroughCom- rail networks can significantly improve MPI communica- pletionQueues(CQs). InfiniBandsupportsdifferentclasses of transport services. In this paper, we focus on the Re- tionperformance. With a two rail InfiniBand network, we haveachievedalmost twice the bandwidthand half the la- liable Connection (RC) service. InfiniBand Architecture tency for large messages compared with the original MPI. supports both channel and memory semantics. In chan- nelsemantics,send/receiveoperationsareusedforcommu- Thepeakunidirectionalbandwidthandbidirectionalband- widthwehaveachievedare1723MB/sand1877MB/s,re- nication. In memory semantics, InfiniBand supports Re- spectively. Dependingonthecommunicationpattern,mul- mote Direct Memory Access (RDMA) operations, includ- tirail MPI can significantly reduce communication time as ingRDMAwrite andRDMAread. RDMAoperationsare wellasrunningtimeforcertainapplications. Wehavealso one-sidedanddonotincursoftwareoverheadattheremote side. Intheseoperations,thesender(initiator)candirectly shownthat for rails with different bandwidth, the adaptive stripingschemecanachieveexcellentperformancewithout accessremotememorybypostingRDMAdescriptors. The a priori knowledge of the bandwidth of each rail. It can operationistransparenttothesoftwarelayeratthereceiver (target)side. evenoutperformstaticschemeswithaprioriknowledgeof railbandwidthincertaincases. Atthe physicallayer,InfiniBandsupports differentlink speeds. MostHCAsinthecurrentmarketsupport4xlinks, Theremainingpartofthepaperisorganizedasfollows: whichcanpotentiallyachieveapeakbandwidthof1GB/s. In Section 2, we provide background information of this 12x links are also available. However, currently they are work. We discuss different ways of setting up InfiniBand usedtointerconnectedswitchesratherthanendnodes. multirail networksin Section 3. Our multirail MPI design is presented in Section 4. We discuss some of the details 2.2 OverviewofMPIProtocols design issues in Section 5. In Section 6, we present per- formanceresultsofourmultirailMPI.Relatedwork isde- scribedinSection7. InSection8,weconcludeanddiscuss MPIdefinesfourdifferentcommunicationmodes: Stan- ourfuturedirections. dard,Synchronous,Buffered,andReady. Twointernalpro- tocols, Eager and Rendezvous, are usually used to imple- mentthesefourcommunicationmodes. Theseprotocolsare 2 Background handledbyacomponentintheMPIimplementationcalled progressengine. InEager protocol, themessageispushed to the receiver side regardless of its state. In Rendezvous In this section, we provide background information for protocol, ahandshakehappensbetweenthesenderandthe our work. First, we provide a brief introduction of Infini- receiverviacontrolmessagesbeforethedataissent tothe Band. Then,wediscusstheinternalcommunicationproto- receiverside.Usually,Eagerprotocolisusedforsmallmes- colsusedbyMPIandtheirimplementationoverInfiniBand. sagesandRendezvousprotocolisusedforlargemessages. 2 In Figure 1, we show examples of typical Eager and Ren- to 4x speed. A straightforward way to alleviate the band- dezvousprotocols. widthbottleneckistousemultipleHCAsineachnodeand connectthemtotheInfiniBandswitchfabric. Throughthe Eager Protocol Rendezvous Procotol support ofcommunication software, users can take advan- tageoftheaggregatedbandwidthofallHCAsineachnode Send Send RendSeztavrotus withoutmodifying applications. Another advantageof us- Eager Data ingmultipleHCAspernodeisthatpossiblebandwidthbot- tlenecksinlocalI/Obusescanalsobeavoided. Forexam- Rendezvous Receive Reply ple,thePCI-X133MHz/64bitbus(usedbymost4xHCAs inthe currentmarket)canonly supportaround1 GB/s ag- Rendezvous Receive Data gregated bandwidth. Although a 4x HCA has a peak ag- Rendezvous gregated bandwidth of 2 GB/s for both link directions, its Finish performanceislimited bythe PCI-X bus. These problems canbealleviatedbyconnectingmultipleHCAstodifferent I/Obusesinasystem. Figure1.MPIEagerandRendezvousProtocols A multirail InfiniBand setup using multiple HCAs per When we are transferring large data buffers, it is bene- node can connect each of HCAs in a node to a separate ficial to avoid extra data copies. A zero-copyRendezvous switch. If a larger switch is available, all HCAs can also protocolimplementationcanbeachievedbyusingRDMA beconnectedto thissingle physicalnetwork. Throughthe write. Inthisimplementation,thebuffersarepinneddown use of appropriate switch configurations and routing algo- in memoryand the bufferaddressesare exchanged via the rithms,usingasinglenetworkcanbeequivalenttoamulti- controlmessages.Afterthat,thedatacanbewrittendirectly railsetup. from the source buffer to the destination buffer by doing RDMAwrite. Thisapproachhasbeenwidelyusedforim- plementingMPIoverdifferentinterconnects[19,12,2]. 3.2 MultiplePorts For small data transfer in Eager protocol and control messages,theoverheadofdatacopiesissmall. Therefore, weneedtopushmessageseagerlytowardtheothersideto achievebetterlatency. Thisrequirementmatcheswellwith Currently, many InfiniBand HCAs in the market have thepropertiesofInfiniBandsend/receiveoperations. How- multiple ports. For example, InfiniHost HCAs [14] from ever,send/receiveoperationshavetheirdisadvantagessuch Mellanoxhavetwoportsineachcard. Therefore,multirail aslowerperformanceandhigheroverhead. Therefore,our InfiniBandnetworkscan also be constructedby taking ad- previousworkin[13]proposedaschemethatusesRDMA vantage of multiple ports in a single HCA. This approach operations also for small data and control messages. This canbe veryattractivebecausecomparedwith using multi- schemeimprovesbothlatencyandbandwidthofsmallmes- pleHCAs, itonly requiresoneHCApernode. Hence,the sagetransfersinMPI. totalcostofmultirailnetworkscanbesignificantlyreduced. However,aswehavediscussed,thelocalI/Obuscanbe 3 InfiniBand Multirail Network Configura- theperformancebottleneckinsuchaconfigurationbecause tions all ports of a HCA have to share the I/O bus. Hence, this approach will not achieve any performance benefit by us- InfiniBand multirail networks can be set up in different ing4xHCAs withPCI-Xbuses. However,benefitscanbe ways. In this section, we discuss three types of possible achieved by using future HCAs that support PCI-X Dou- multirailnetworkconfigurationsandtheirrespectivebene- bleDataRate(DDR)orQuadDataRate(QDR)interfaces. fits. Inthe firstapproach, multipleHCAsare usedin each Recently,PCIExpress[21]hasbeenintroducedasthenext node. Thesecondapproachexploitsmultipleportsinasin- generationlocalI/Ointerconnect.PCIExpressusesaserial, gleHCA.Finally,we describehowto setupvirtualmulti- point-to-point interface. It can deliver scalable bandwidth railnetworkswithonlyasingleportbyusingtheLIDmask byusingmultiplelanesineachpoint-to-pointlink. Forex- control(LMC)mechanisminInfiniBand. ample, an 8x PCI Express link can achieve 2 GB/s band- widthineachdirection(4GB/stotal). MultipleportInfini- 3.1 MultipleHCAs BandHCAsthatsupportPCIExpressarealreadyavailable in the market [15]. Therefore, this approach can be very Although InfiniBand Architecture specifies 12x links, usefulinconstructingmultirailnetworksusingsystemsthat currentInfiniBandHCAsinthemarketcansupportonlyup havePCIExpressinterfaces. 3 3.3 SinglePortwithLIDMaskControl(LMC) ofthreeimportantcomponents: CommunicationScheduler, SchedulingPolicies,andCompletionFilter. Inthissubsection,wediscussanotherapproachofsetting up multirail InfiniBand networks which does not require MPI Protocol Layer multiple ports or HCAs for each node. The basic idea of thisapproachistosetupdifferentpathsbetweentwoports Eager Rendezvous on two nodes. By using appropriate routing algorithms, it PMreostsoacgoels PMreostsoacgoels Input from other system components CNoomtifpiclaettiioonn ispossibletomakethepathsindependentofeachother.Al- thougha single network is used in this approach, we have Communication Scheduling Completion multiple logical networks (or logical rails). If the logical Scheduler Policies Filter networksare independent of each other, conceptually they are very similar to multirail networks. Therefore, we call Virtual Completion thisapproachasvirtualmultirailnetworks. Subchannels Notification InInfiniBand,eachporthasalocalidentifier(LID).Usu- ally,apathisdeterminedbythedestinationLID.Therefore, InfiniBand Layer multiple LIDs need to be used in order to have different paths. Toaddressthisissue, InfiniBandprovidesamecha- Figure2.BasicArchitectureofMultirailMPIDesign nismcalledLIDMaskControl(LMC).Basically,LMCpro- videsawaytoassociatemultiplelogicalLIDswithasingle The Communication Scheduler is the central part of physicalport. Hence,multiplepathscanbeconstructedby our design. Basically, it accepts protocol messages from usingLMC. the MPI Protocol Layer, and stripes (or multiplexes) them It should be noted that in virtual multirail networks, a acrossmultiplevirtualsubchannels.(Detailsofvirtualsub- portissharedbyallthelogicalrails. Hence,iftheportlink channelswillbedescribedlater.) Inordertodecidehowto bandwidth or the local I/O bus is the performance bottle- dostripingormultiplexing,the CommunicationScheduler neck, this approach cannot bring any performance benefit. usesinformationprovidedbytheSchedulingPoliciescom- It can only be used for fault tolerance in this case. How- ponent. SchedulingPoliciescanbestaticschemesthatare ever, if the performance bottleneck is inside the network, determinedatinitializationtime.Theycanalsobedynamic virtualmultirailnetworkscanimprovecommunicationper- schemes that adjust themselves based on input from other formancebyutilizingmultiplepaths. componentsofthesystem. Sinceasinglemessagemaybestripedandsentasmul- 4 MultirailMPI Design tiple messages through the InfiniBand Layer, we use the Completion Filter to filter completion notifications and to In this section, we present varioushigh leveldesign is- inform the MPI Protocol Layer about completions only suesinvolvedinsupportingmultirailnetworksinMPIover whennecessary. TheCompletionFiltercanalsogatherin- InfiniBand. We first present the basic architecture of our formation based on the completion notificationsand use it design. After that, we discuss how we can have a unified asinputtoadjustdynamicschedulingpolicies. designtosupportmultirailnetworksusingmultipleHCAs, multiple ports, multiple connections for a single port, or 4.2 VirtualSubchannelAbstraction any combination of the above. Then we describe how we can achieve low overhead by integrating our design with MPIandtakingadvantageofInfiniBandRDMAoperations. Aswehavediscussed,multirailnetworkscanbebuiltby OneimportantcomponentinourarchitectureisScheduling using multiple HCAs in a single node, or by using multi- Policies. Inthelastpartofthissection, wediscussseveral pleportsinasingleHCA.Wehavealsoseenthatevenwith policies supported by our architecture and also present an asingleport,itispossibletoachieveperformancebenefits adaptivestripingschemethatcandynamicallyadjuststrip- by allowing multiple paths to be set up between two end- ingparametersbasedoncurrentsystemconditions. points.Therefore,itisdesirabletohaveasingleimplemen- tationtohandleallthesecasesinsteadofdealingwiththem 4.1 BasicArchitecture separately. In MPI applications, every two processes can commu- Thebasicarchitectureofourdesigntosupportmultirail nicate with each other. This is implemented in manyMPI networksisshowninFigure2.Wefocusonthearchitecture designs by a data structure called virtual channel (or vir- ofthesenderside.Inthefigure,wecanseethatbesidesMPI tual connection). A virtual channel can be regarded as an Protocol Layer and InfiniBand Layer, our design consists abstractcommunicationchannelbetweentwoprocesses. It 4 doesnothavetocorrespondtoaphysicalconnectionofthe softwarecanalsobebenefitedfrommultirailnetworks.Our underlyingcommunicationlayer. designisdifferentbecausewehavechosentointegratethese In this paper, we use an enhanced virtual channel ab- functionalities more tightly with the MPI communication straction to provide a unified solution to support multiple protocols. Instead of focusing on portability, we aim to HCAs,multipleports,andmultiplepathsforasingleport. achieve high efficiency and flexibility in our implementa- Inourdesign,avirtualchannelcanconsistofmultiplevir- tion. Sincemultirail supportisintegratedwith MPI proto- tualsubchannels(calledsubchannelslater). SinceourMPI cols, we can specificallytailor its designto MPI to reduce implementation mainly takes advantage of the InfiniBand overhead. Thistightlycoupledstructurealsogivesusmore Reliable Connection (RC) service, each subchannel corre- flexibilityincontrollinghowmessagesarestripedormulti- spondstoareliableconnectionattheInfiniBandLayer. At plexedindifferentMPIprotocols. the virtual channel level, we maintain various data struc- Onekeydesigndecisionwehavemadeistoallowmes- turestocoordinateallthesubchannels. sage stripingonly for RDMA messages, althoughallmes- It is easy to seehowthisenhancedabstraction candeal sages, including RDMA and send/receive, can use multi- with allthe multirailconfigurationswe havediscussed. In plexing. This is not a serious restriction for MPI because the case of each node having multiple HCAs, subchan- MPI implementations over InfiniBand usually only use nelsforavirtualchannelcorrespondtoconnectionsthatgo RDMAoperationstotransferlargemessages. Send/receive through different HCAs. If we would like to use multiple operations are often used only for transferring small mes- portsoftheHCAs,wecansetupsubchannelssothatthere sages. By using striping with RDMA, there is almost no is one connection for each port. Similarly, different sub- overhead to reassemble messages because data is directly channels/connectionscanbesetupinasingleportthatfol- putintothedestinationbuffer.Zero-copyprotocolsinMPI, lowdifferentpaths.Oncealltheconnectionsareinitialized, whichusuallytakeadvantageofRDMA,canbesupported thesamesubchannelabstractionisusedforcommunication inastraightforwardmanner. in all cases. Therefore, there is essentially no difference As an example, let’s take a look at the Eager and the foralltheconfigurationsexceptfortheinitializationphase. RendezvousprotocolsshowninFigure1. IntheEagerpro- Thesubchannelabstractioncanalsoeasily dealwithcases tocol, the data message can be sent using either RDMA inwhichwehaveacombinationofmultipleHCAs,multi- or send/receive operations. However, since this message pleports,andmultiplepathsfor asingleport. Thisideais is small, stripingisnot necessaryandonly multiplexingis furtherillustratedinFigure3. used.IntheRendezvousprotocol,controlmessagesarenot striped. However,data messages can be striped since they Process 1 Process 2 canbeverylarge. Port1 Subchannels Port1 HCA HCA Node 1 Node 2 Port1 Port1 4.4 SchedulingPolicies HCA HCA Process 1 Process 2 Different scheduling policies can be used by the Com- Port1 Subchannels Port1 munication Scheduler to decide which subchannels to use Node 1 HCA HCA Node 2 fortransferringmessages. We categorizedifferentpolicies Port2 Port2 intotwoclasses: staticschemesanddynamicschemes.Ina staticscheme,thepolicyanditsparametersaredetermined Process 1 Process 2 atinitializationtimeandstayunchangedduringtheexecu- Subchannels tion of MPI applications. On the other hand, a dynamic Port1 Port1 Node 1 HCA HCA Node 2 schemecanswitchbetweendifferentpoliciesorchangeits parameters. Inourdesign, schedulingpoliciescanalso beclassified Figure3.VirtualSubchannelAbstraction into multiplexing schemes and striping schemes. Multi- plexing schemes are used for send/receive operations and RDMAoperations with smalldata, in whichmessages are 4.3 IntegrationwithMPIprotocols not striped. Striping schemes are used for large RDMA messages. In some MPI implementations, functionalities such as Formultiplexingschemes, a simplesolutionisbinding, striping messages across multiple network interfaces are inwhichonlyonesubchannelisusedforallmessages.This partofamessaginglayer.Thismessaginglayerprovidesan schemehastheleastoverhead.Itcantakeadvantageofmul- interfacetoupperlayersoftwaresuchasMPI.One advan- tiplesubchannelsiftherearemultipleprocessesinasingle tageofthisapproachishighportability,asotherupperlayer node. Forutilizingmultiplesubchannelswithasinglepro- 5 cess per node, schemes similar to Weighted Fair Queuing differentstripesineachsubchannelandexploitfeedbackin- (WFQ)andGeneralizedProcessorScheduling(GPS) have formation from the InfiniBand Layer to adjust the weights been proposed in the networking area [1]. These schemes totheiroptimalvalues. take into consideration the length of a message. In Infini- In designing the adaptive striping scheme, we assume Band, the per operation cost usually dominates for small thelatenciesofallsubchannelsareaboutthesameandfo- messages. Therefore,wechoosetoignorethemessagesize cusontheirbandwidth. Inordertoachieveoptimalperfor- for small messages. As a result, simple round robin or manceforstriping,akeyinsightisthatthemessagemustbe weightedroundrobinschemescanbeusedformultiplexing. striped in such a way that transmission of each stripe will Insomecases,differentsubchannelsmayhavedifferentla- finish at about the same time. This results in perfect load tencies. Thiswillresultinmanyout-of-ordermessagesfor balancing and minimum message delivering time. There- roundrobinschemes.Avariationofroundrobincalledwin- fore, our scheme constantly monitors the time each stripe dowbasedroundrobincanbeusedtoaddressthisissue.In spentineachsubchannelandusethisinformationtoadjust thisscheme,awindowsizeWisgivenandasubchannelis the weight so that striping distribution becomes more and usedtosentWmessagesbeforetheCommunicationSched- morebalancedandeventuallyreachesoptimum. Thisfeed- ulerswitches to another subchannel. Since W consecutive backbasedcontrolmechanismisillustratedinFigure4. messages travelsthe same subchannel, the number of out- of-order messages can be greatly reduced for subchannels Rendezvous withdifferentlatencies. RMDesMsaAg eDsata Forstripingschemes,themostimportantfactorweneed toconsideristhebandwidthofeachsubchannel. Itshould Communication WSPteroiiglpihciyntegd Scheduling AWdjueisgthmtents Completion benotedthatweshouldconsiderpathbandwidthinsteadof Scheduler Policies Filter link bandwidth, although they can sometimes be the same depending on the switch configuration and the communi- Striped Completion of cation pattern. Even striping can be used for subchannels Messages Different Stripes withequalbandwidth,whileweightedstripingcanbeused forsubchannelswithdifferentbandwidths. Similartomul- InfiniBand Layer tiplexing,bindingcanbeusedwhentherearemultiplepro- cessesinasinglenode. Figure4.FeedbackLoopinAdaptiveStriping 4.5 AdaptiveStriping In InfiniBand, a completion notification will be gener- ated after each message is deliveredto the destination and Aswehavediscussedintheprevioussubsection,itisim- anacknowledgmentisreceived. WiththehelpofComple- portanttotakeintoconsiderationpathbandwidthforstrip- tionFilter,theprogressengineofourMPIimplementation ing schemes. A simple solution is to use weighted strip- uses polling to check anynewcompletion notification and ing and set the weights of different subchannels to their take appropriate actions. In order to calculate the deliv- respective link bandwidths. However, this method fails to ering time of each stripe, we first record the start time of addressthefollowingproblems: First,sometimesinforma- eachstripe whenitishandedoverto the InfiniBandLayer tionsuchaslinkbandwidthisnotavailabledirectlytoMPI for transmission. Whenthe deliveryis finished, a comple- implementations.Second,insomecases,bottlenecksinthe tionnotificationwillbegeneratedbytheInfiniBandLayer. networkorswitchesmaymakethepathbandwidthsmaller TheCompletionFiltercomponentwillthenrecordthefin- than the link bandwidth. Finally, path bandwidth can also ish time and derive the delivering time by subtracting the beaffectedbyotherongoingcommunication. Therefore,it starttimefromit. Afterdeliveringtimesforallstripesofa maychangeovertime. Apartialsolutiontotheseproblems message are collected, adjustment of weights is calculated istocarry outsmalltests duringthe initializationphaseof andsenttotheSchedulingPoliciescomponenttoadjustthe MPI applications to determine the path bandwidth. How- policy. Later, the Communication Scheduler will use the ever,inadditiontoitshighoverhead(testsneedtobedone newpolicyforstriping. for every subchannel between every pair of nodes), it still Next we will discuss the details of weight adjustment. failstosolvethelastproblem. Our basic idea is to have a fixed number of total weights In this subsection, we propose a dynamic scheme for andredistributeitbasedon feedbackinformation obtained stripinglargemessages. Ourscheme,calledadaptivestrip- fromdifferentstripesofasinglemessage.Supposethetotal ingscheme,isbasedontheweightedstripingscheme.How- weightisWtotal,thecurrentweightofsubchanneliisWi, ever, instead of using a set of fixed weights that are set at the path bandwidth of subchannel i is BWi, the message initialization time, we constantly monitor the progress of sizeisS,andthestripedeliveringtimeforsubchanneliis 6 ti,wethenhavethefollowing: 5 DetailedDesignIssues BWi = S(cid:1) WWtotial = S(cid:1)Wi (1) ti ti(cid:1)Wtotal Our multirail MPI is based on MVAPICH[19, 13], our SinceWtotal andS arethesameforallsubchannels,we MPI implementation over InfiniBand. MVAPICH is de- havethefollowing: rived from MPICH [9], which was developed at Argonne National Laboratory and is currently one of the most pop- BWi / Wi (2) ular MPI implementations. MVAPICH is also derived ti from MVICH [12], which is an ADI2 implementation for Therefore, new weight distributions can be done based VIA[5]. onEquation2. SupposeW0 isthenewweightforsubchan- i Inthissection,wediscusssomeofthedetaileddesignis- neli,thefollowingcanbeusedtocalculateW0: i suesinourmultiraildesign.Theseissuesincludesomespe- Wi cial cases for multiple HCAs, handling out-of-order mes- Wi0 =Wtotal(cid:1) P ti Wk (3) sages,andRDMAcompletionnotification. k2subchannels tk In Equation 3, weights are completely redistributed based on the feedback information. To make our scheme 5.1 HandlingMultipleHCAs more robustto fluctuations in the system, we canpreserve partof the historicalinformation. Suppose (cid:11)is a constant between0and1,wecanhavethefollowingequation: In Section 4, we have described how we can provide a unifieddesignformultipleHCAs,multipleports,andmul- Wi tipleconnectionsinasingleport. Thekeyideaistousethe Wi0 =(1(cid:0)(cid:11))(cid:1)Wi+(cid:11)(cid:1)Wtotal(cid:1)P ti Wk (4) subchannelabstraction. Oncesubchannelsareestablished, k2subchannels tk thereisessentiallynodifferenceindealingwithallthedif- In our implementation, the start times of all stripes are ferentcases. almost the same and can be accurately measured. How- However, due to some restrictions in InfiniBand, there ever, completion notification are generated by the Infini- aretwosituationsthatmustbehandleddifferentlyformul- Band Layer asynchronously and we only record the finish tipleHCAs: completionqueue(CQ)pollingandbufferreg- time of a stripe as we have found its completion notifica- istration. tion. SinceMPIprogressengineprocessingcanbedelayed Our MPI implementation uses mostly RDMA to trans- duetoapplicationcomputation,wecanonlyobtainanupper fermessagesandwehavedesignedspecialmechanismsat boundoftheactualfinishtimeandtheresultingdelivering the receiver to detect incoming messages [13]. However, ti is also an upper bound. Therefore, one question is how CQs are still used at the sender side. Although multiple accurately we can estimate the delivering time ti for each connectionscanbeassociatedwithasingleCQ,InfiniBand subchannel. To address this question, we consider three requires all these connections to be from a single HCA. cases: Therefore, in the case of multiple HCAs, we need to uses 1. Progress engine is not delayed. In this case, accurate multipleCQs. Thisresults inslightlyhigheroverheaddue deliveringtimecanbeobtained. totheextrapollingofCQs. 2. Progressengineisdelayedandsomeofthedelivering Bufferregistrationalsoneedsdifferenthandlingformul- timesareoverestimated. BasedonEquation4,inthis tiple HCAs. In InfiniBand, buffer registration serves two case, weight redistribution will not be optimal, but it purposes. First, it ensures the buffer will be pinned down willstillimproveperformancecomparedwiththeorig- inphysicalmemorysothatitcanbesafelyaccessedbyIn- inalweightdistribution. finiBandhardwareusingDMA.Second,itprovidestheIn- finiBandHCAwithaddresstranslationinformationsothat 3. Progressengineisdelayedforalongtimeandwefind bufferscanbeaccessedthroughvirtualaddresses.Hence,if all completion notifications at about the same time. abufferistobesentthroughmultipleHCAs,itmustbereg- BasedonEquation4, thiswillessentiallyresultinno istered with eachof them. Currently,we haveused a sim- changeintheweightdistribution. pleapproachofregisteringthewholebufferwithallHCAs. We can see that in no case will the redistribution result Althoughthisapproachincreasestheregistrationoverhead, inworseperformancethantheoriginaldistribution.Inprac- thisoverheadcanbelargelyavoidedbyusingaregistration tice,case1isthemostcommonandaccurateestimationcan cache. In future, we plan to investigate schemes that only beexpectedmostofthetime. registerpartofthebufferwitheachHCAs. 7 5.2 Out-of-OrderMessageProcessing 6 PerformanceEvaluation Inthissection,weevaluatetheperformanceofourmul- DuetotherequirementofMPI,itisdesirabletoprocess tirail MPI design overInfiniBand. Ourevaluation consists messages in order in our multirail implementation. Since of two parts. In the first part, we show the performance we use Reliable Connection (RC) service provided by In- benefitwecanachievecomparedwiththeoriginalMPIim- finiBandforeachsubchannel,messageswillnotbelostand plementation. Inthesecondpart,weprovideanevaluation they are delivered in order in a single subchannel. How- of our adaptive striping scheme. Due to the limitation of ever, there is no ordering guarantee for multiple subchan- our testbed, we focus on multirail networks with multiple nels of the same virtual channel. To addressthis problem, HCAsinthesection. we introduce a Packet Sequence Number (PSN) variable for each virtual channel. Every message sent through this 6.1 ExperimentalTestbed virtual channel will carry current PSN and also increment it. EachreceivermaintainsanExpectedSequenceNumber Our testbed cluster of 8 SuperMicro SUPER X5DL8- (ESN)foreveryvirtualchannel.Whenanout-of-ordermes- GG nodes with ServerWorks GC LE chipsets. Each node sagearrives,itisputintoanout-of-orderqueueassociated hasdualIntelXeon3.0GHzprocessors,512KBL2cache, withthisvirtualchannelanditsprocessingisdeferred. This and PCI-X 64-bit 133 MHz bus. We have used InfiniHost queue is checked at proper times when a message in the MT23108DualPort4xHCAsfromMellanox. Ifbothports queuemaybethenextexpectedpacket. of an HCA are used, we can potentially achieve one way The basic operations on the out-of-order queue are en- peak bandwidth of 2 GB/s. However, the PCI-X bus can queue, dequeue, and search. To improve performance, it only support around 1 GB/s maximum bandwidth. There- is desirable to optimize these operations. In practice we fore, for each node we have used two HCAs and only one havefoundthatwhenappropriatecommunicationschedul- port of each HCA is connected to the switch. The Server- ing policies are used, out-of-order messages are very rare. Works GC LE chipsets have two separate I/O bridges. To Asaresult,verylittleoverheadisspentinout-of-ordermes- reducetheimpactofI/Obus,thetwoHCAsareconnected sagehandling. to PCI-X buses connected to different I/O bridges. All nodesareconnectedtoasingleMellanoxInfiniScale24port switch(MTS2400),whichsupportsall24portsrunningat 5.3 RDMACompletionNotification full4xspeed.Therefore,ourconfigurationisequivalenttoa two-railInfiniBandnetworkbuiltfrommultipleHCAs. The kernelversionweusedisLinux2.4.22smp. TheInfiniHost Inourdesign,largeRDMAmessagessuchasdatames- SDK version is 3.0.1 and HCA firmware version is 3.0.1. sages in the Rendezvous protocol can be striped into mul- The Front Side Bus (FSB) of each node runs at 533MHz. tiplesmallermessages. Hence,multiplecompletionnotifi- Thephysicalmemoryis1GBofPC2100DDR-SDRAM. cationsmaybegeneratedforasinglemessageatthesender side. The Completion Filtercomponent in our design will 6.2 PerformanceBenefitsofMultirailDesign notifytheMPIProtocolLayeronlyafterithascollectedall thenotifications. To evaluate the performance benefit of using multirail At the receiver, the MPI protocol Layer also needs to networks, we compare our new multirail MPI with our know when the data message has been put into the desti- original MPI implementation. In the multirail MPI de- nation buffer. In our original design, this is achieved by sign,unlessotherwisestated,evenstripingisusedforlarge using a Rendezvous finish control message. This message messages and round robin scheme is used for small mes- willonly be receivedafter the Rendezvousdata messages, sages. We first present performance comparisons using since ordering is guaranteed for a single InfiniBand con- micro-benchmarks, including latency, bandwidth and bi- nection. However, this scheme is not enough for multiple directional bandwidth. We then present results for collec- subchannels. In this case, we have to use multiple Ren- tivecommunicationbyusingPallasMPIbenchmarks[20]. dezvousfinish messages – oneper eachsubchannel where Finally, we carry out application levelevaluation by using Rendezvousdata is sent. The receiverwillnotify the MPI some of the NAS Parallel Benchmarks [17] and a visual- ProtocolLayeronlyafterithasreceivedalltheRendezvous ization application. In many of the experiments, we have finishmessages. It shouldbe noted thatthese Rendezvous considered two cases: UP mode (each node running one finishmessagesaresentinparallelandtheirtransfertimes process)andSMPmode(eachnoderunningtwoprocesses). areoverlapped. Therefore,ingeneraltheyhaveverysmall In Figures 5, 7 and 8, we show the latency, bandwidth extraoverhead. and bidirectional bandwidth results in UP mode. We also 8 350 600 1800 Striping Round Robin Striping 300 Original 500 Original 1600 Original 250 B/s) 400 B/s) 11240000 Time (us) 112050000 andwidth (M 230000 andwidth (M 1 068000000 B B 400 50 100 200 0 0 0 16 64 256 1K 4K 16K 64K 256K 2 4 8 16 32 641282565121K 2K 4 64 1K 16K 256K 4M Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) Figure 5. MPI Latency (UP Figure6.MPIBandwidth(Small Figure 7. MPI Bandwidth (UP mode) Messages,UPmode) mode) 2000 1000 1000 Striping Striping Striping 1800 Original 900 Binding 900 Binding 1600 800 Original 800 Original B/s) 1400 B/s) 700 B/s) 700 dth (M 11020000 dth (M 560000 dth (M 560000 wi 800 wi 400 wi 400 d d d an 600 an 300 an 300 B 400 B 200 B 200 200 100 100 0 0 0 4 64 1K 16K 256K 4M 4 64 1K 16K 256K 4M 4 64 1K 16K 256K 4M Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) Figure 8. MPI Bidirectional Figure9. MPIBandwidth(SMP Figure 10. MPI Bidirectional Bandwidth(UPmode) mode) Bandwidth(SMPmode) show bandwidth results for small messages in Figure 6. ure9showsthatbothstripingandbindingperformssignif- (Note that in the x axis of the figures, unit K is an abbre- icantlybetterthantheoriginaldesign. Wecanalsoseethat viationfor210andMisanabbreviationfor220.)FromFig- stripingdoesbetterthanbinding.Thereasonisthatstriping ure5wecanseethatforsmallmessages,theoriginaldesign canutilizebothHCAsinbothdirectionswhilebindingonly and the multirail design perform comparably. The small- usesonedirection in eachHCA.Since in the bidirectional est latency is around 6 (cid:22)s for both. However, as message bandwidth test in SMP mode, both HCAs are utilized for sizeincreases,themultiraildesignoutperformstheoriginal both directions, striping and binding perform comparably, design. For large messages, it achieves about half the la- ascanbeseenfromFigure10. tency of the original design. In Figure 7, we can observe In Figures 11, 12, 13 and 14 we show results for thatmultiraildesigncanachievesignificantlyhigherband- MPI BcastandMPI Alltoallfor8processes(UPmode)and width.Thepeakbandwidthfortheoriginaldesignisaround 16 processes (SMP mode) using Pallas Benchmarks. The 884MB/s.Withthemultiraildesign,wecanachievearound trendisverysimilartowhatwehaveobservedinprevious 1723MB/sbandwidth,whichisalmosttwicethebandwidth tests. Withmultiraildesign,wecanachievesignificantper- obtainedwiththeoriginaldesign. Bidirectionalbandwidth formance improvementfor large messages compared with resultsinFigure8showasimilartrend. Thepeakbidirec- theoriginaldesign. tional bandwidth is around 943 MB/s for the original de- In Figures 15 and 16 we show application results. We signand1877MB/sforthemultiraildesign.InFigure6we havechosentheISandFTapplications(ClassAandClass can see that the round robin scheme can slightly improve B)intheNASParallelBenchmarksbecausecomparedwith bandwidth for small messages compared with the original other applications, they are more bandwidth-bound. We scheme. havealsousedavisualizationapplication. Thisapplication For Figures 9 and 10, we have used two processes on is a modified versionof the program described in [7]. We each node, each of them sending or receiving data from a show performance numbers for both UP and SMP modes. process on the other node. It should be noted that in the However,dueto the largedata setsizein thevisualization bandwidthtest,thetwosendersareonseparatenodes. For application,wecanonlyrunitinUPmode. themultiraildesign,wehaveshownresultsusingbotheven Fromthefigureswecanseethatmultiraildesignresults stripingpolicyandbindingpolicyforlargemessages. Fig- insignificantreductionincommunicationtimeforallappli- 9 1200 5000 Striping Striping Original 4500 Original 1000 4000 800 3500 s) s) 3000 u u e ( 600 e ( 2500 m m Ti 400 Ti 12500000 200 1000 500 0 0 16 64 256 1K 4K 16K 64K 256K 16 64 256 1K 4K 16K 64K 256K Message Size (Bytes) Message Size (Bytes) Figure11.MPIBcastLatency(UPmode) Figure12.MPIAlltoallLatency(UPmode) 2000 25000 Striping Striping 1800 Original Original 1600 20000 1400 s) 1200 s) 15000 u u e ( 1000 e ( m m Ti 800 Ti 10000 600 400 5000 200 0 0 16 64 256 1K 4K 16K 64K 256K 16 64 256 1K 4K 16K 64K 256K Message Size (Bytes) Message Size (Bytes) Figure13.MPIBcastLatency(SMPmode) Figure14.MPIAlltoallLatency(SMPmode) cationsinbothUPandSMPmodes. ForFT,thecommuni- 6.3 EvaluatingtheAdaptiveStripingScheme cationtimeisreducedalmostbyhalf. ForIS, thecommu- nicationtimeisreducebyupto38%,whichresultsinupto In this subsection, we show how our proposed adap- 22%reductioninapplicationrunningtime. Forthevisual- tivestripingschemecanprovidegoodperformanceincases ization application, the communication time is reduced by each rail has different bandwidth. To simulate this envi- 43% and the application running time is reduced by 16%. ronment, for most of our experiments, we have forced the Overall, we cansee thatmultirail designcanbring signifi- second HCA on each node to run at 1x speed with a peak cantperformanceimprovementto bandwidth-boundappli- bandwidthof 250MB/s. The firstHCAoneach nodestill cations. operatesatthe normal4x speed(1 GB/s peakbandwidth). Without a priori knowledgeof this environment, our mul- tirailMPIimplementationwilluseevenstriping. Withthis knowledge,itwilluseweightedstripingandsettheweights to 4 and 1 respectively for each subchannel. We compare both of them with the adaptivestriping scheme, which as- signsequalweightstobothsubchannelsinitially. Wefocus onmicrobenchmarksandUPmodeinthissubsection. Figures 17 and 18 show the latency and bandwidth re- sults. We can seethat the adaptivestripingscheme signif- icantlyoutperformsevenstripingandachievescomparable performancewithweightedstriping. InFigure19,weshow bidirectional bandwidth results for the three schemes. An important finding is that our adaptive scheme can signifi- cantly outperform weighted striping in this case. This is Figure 15. Application Figure 16. Application because in the test, the communication traffic is assigned Results (8 processes, Results (16 processes, to the two subchannels as 4:1 based on the link speed (4x UPmode) SMPmode) vs.1x). Withbidirectionaltraffic,theaggregatelinkspeeds wouldbe8xand2xrespectivelyforeachsubchannel.How- 10

Description:
Building Multirail InfiniBand Clusters: MPI-Level Designs and Performance. Evaluation. ∗. Jiuxing Liu. Abhinav Vishnu. Dhabaleswar K Panda.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.