Optimizing Network Virtualization in Xen AravindMenon AlanL.Cox WillyZwaenepoel EPFL,Switzerland Riceuniversity,Houston EPFL,Switzerland Abstract network resources. In such an environment, the virtual machine monitor (VMM) virtualizes themachine’s net- Inthispaper,weproposeandevaluatethreetechniques work I/O devices to allow multiple operating systems foroptimizingnetwork performance inthe Xenvirtual- runningindifferentVMsto accessthenetwork concur- ized environment. Our techniques retain the basic Xen rently. architecture of locating device drivers in a privileged Despitetheadvancesinvirtualizationtechnology[15, ‘driver’domainwithaccesstoI/Odevices,andproviding 4], the overhead of network I/O virtualization can still networkaccesstounprivileged‘guest’domainsthrough signi(cid:2)cantlyaffecttheperformanceofnetwork-intensive virtualizednetworkinterfaces. applications. For instance, Sugerman et al. [15] report First, we rede(cid:2)ne the virtual network interfaces of thattheCPUutilizationrequiredtosaturatea100Mbps guestdomainstoincorporatehigh-levelnetworkoff(cid:3)oad networkunderLinux2.2.17runningonVMwareWork- features available in most modern network cards. We station2.0was5to6times higher comparedtotheuti- demonstrate the performance bene(cid:2)ts of high-level of- lizationundernativeLinux2.2.17.Evenintheparavirtu- (cid:3)oad functionality in the virtual interface, even when alizedXen2.0VMM[4],Menonetal.[10]reportsignif- such functionality is not supported in the underlying icantlylowernetworkperformanceunderaLinux2.6.10 physical interface. Second, we optimize the implemen- guest domain, compared to native Linux performance. tationofthedatatransferpathbetweenguestanddriver Theyreportperformancedegradationbyafactorof2to domains.Theoptimizationavoidsexpensivedataremap- 3xforreceiveworkloads,andafactorof5xdegradation ping operations on the transmit path, and replaces page fortransmitworkloads. Thelatterstudy,unlikeprevious remappingbydatacopyingonthereceivepath. Finally, reported results on Xen [4, 5], reports network perfor- weprovidesupportforguestoperatingsystemstoeffec- mance under con(cid:2)gurations leadingto full CPU satura- tively utilize advanced virtual memory features such as tion. superpagesandglobalpagemappings. Inthispaper,weproposeandevaluateanumberofop- The overall impact of these optimizations is an im- timizations for improving the networking performance provementintransmitperformanceofguestdomainsby under the Xen 2.0 VMM. These optimizations address a factor of 4.4. The receive performance of the driver manyoftheperformancelimitationsidenti(cid:2)edbyMenon domain is improved by 35% and reaches within 7% of et al. [10]. Starting with version 2.0, the Xen VMM native Linux performance. The receive performance in adoptedanetworkI/O architecturethatissimilar tothe guestdomainsimprovesby18%,butstilltrailsthenative hostedvirtualmachinemodel[5]. Inthisarchitecture,a Linuxperformanceby61%.Weanalysetheperformance physical network interface is owned by a special, privi- improvementsindetail,andquantifythecontributionof legedVMcalledadriverdomainthatexecutesthenative eachoptimizationtotheoverallperformance. Linuxnetworkinterfacedriver. In contrast, anordinary VMcalledaguestdomainisgivenaccesstoavirtualized 1 Introduction networkinterface. Thevirtualizednetworkinterfacehas a front-end device in the guest domain and a back-end In recent years, there has been a trend towards running device in the corresponding driver domain. The front- networkintensiveapplications, suchasInternetservers, end and back-end devices transfer network packets be- in virtual machine (VM) environments, where multiple tweentheirdomainsoveranI/Ochannelthatisprovided VMsrunningonthesamemachinesharethemachine’s bytheXenVMM.Withinthedriverdomain,eitherEth- ernetbridgingorIProutingisusedtodemultiplexincom- anevaluationofthedifferentoptimizationsdescribed. In ingnetworkpackets andto multiplex outgoing network section 7, we discuss related work, and we conclude in packets between the physical network interface and the section8. guestdomainthroughitscorrespondingback-enddevice. Our optimizations fall into the following three cate- 2 Background gories: 1. We add three capabilities to the virtualized net- The network I/O virtualization architecture in Xen can work interface: scatter/gather I/O, TCP/IP check- be a signi(cid:2)cant source of overhead for networking per- sumof(cid:3)oad,andTCPsegmentationof(cid:3)oad(TSO). formanceinguestdomains[10]. Inthissection,wede- Scatter/gather I/O and checksum of(cid:3)oad improve scribetheoverheadsassociatedwithdifferentaspectsof performanceintheguestdomain.Scatter/gatherI/O the Xen network I/O architecture, and their overall im- eliminates the need for data copying by the Linux pactonguestnetworkperformance.Tounderstandthese implementation of sendfile(). TSO improves overheads better, we (cid:2)rst describe the network virtual- performancethroughout thesystem. Inadditionto izationarchitectureusedinXen. itswell-knowneffectsonTCPperformance[11,9] TheXenVMMusesanI/Oarchitecturewhichissim- bene(cid:2)ting the guest domain, it improves perfor- ilar to the hosted VMM architecture [5]. Privileged mance in the Xen VMM and driver domain by re- domains, called ‘driver’ domains, use their native de- ducing the number of network packets that they vice drivers to access I/O devices directly, and perform musthandle. I/Ooperationsonbehalfofotherunprivilegeddomains, calledguestdomains. GuestdomainsusevirtualI/Ode- 2. We introduce a faster I/O channel for transferring vicescontrolledbyparavirtualizeddriverstorequestthe network packets between the guest and driver do- driverdomainfordeviceaccess. mains. The optimizations include transmit mecha- ThenetworkarchitectureusedinXenisshownin(cid:2)g- nisms that avoid a data remapor copyin the com- ure 1. Xen provides each guest domain with a number mon case, and a receive mechanism optimized for ofvirtualnetworkinterfaces,whichisusedbytheguest smalldatareceives. domainforallitsnetworkcommunications.Correspond- ingtoeach virtualinterfaceinaguestdomain, a‘back- 3. We present VMM support mechanisms for allow- end’interfaceiscreatedinthedriverdomain,whichacts ing the use of ef(cid:2)cient virtual memory primitives as the proxy for that virtual interface in the driver do- in guest operating systems. These mechanisms al- main.Thevirtualandbackendinterfacesare‘connected’ lowguestOSestomakeuseofsuperpageandglobal to each other over an ‘I/O channel’. The I/O channel pagemappingsontheIntelx86architecture,which implementsazero-copydatatransfermechanismforex- signi(cid:2)cantly reduce the number of TLB misses by changingpacketsbetweenthevirtualinterfaceandback- guestdomains. end interfaces by remapping the physical page contain- Overall, our optimizations improve the transmit per- ingthepacketintothetargetdomain. formanceinguestdomainsbyafactor4.4. Receiveside networkperformanceisimprovedby35%fordriverdo- Driver Domain Guest Domain mains, andby18% forguestdomains. Wealsopresent adetailedbreakdownoftheperformancebene(cid:2)tsresult- Bridge ing from each individual optimization. Our evaluation demonstrates the performance bene(cid:2)ts of TSO support in the virtual network interface even in the absence of Backend Interface Virtual Interface TSOsupportinthephysicalnetworkinterface. Inother NIC Driver words, emulating TSOinsoftwareinthedriverdomain results in higher network performance than performing I/O Channel TCPsegmentationintheguestdomain. The outline for the restofthepaper is as follows. In Physical NIC Xen VMM section 2, we describe the Xen network I/O architec- tureandasummaryofitsperformanceoverheadsasde- scribed in previous research. In section 3, we describe Figure1: XenNetworkI/OArchitecture the design of the new virtual interface architecture. In section4,wedescribeouroptimizationstotheI/Ochan- All the backend interfaces in the driver domain (cor- nel. Section 5 describes the new virtual memory opti- respondingtothevirtualinterfaces)areconnectedtothe mizationmechanismsaddedtoXen. Section6presents physicalNICandtoeachotherthroughavirtualnetwork bridge. 1 The combination of the I/O channel and net- missratecomparedtoexecutioninnativeLinux. Addi- work bridge thus establishes a communication path be- tionally, guest domains were seen to suffer from much tween the guest’s virtual interfaces and the physical in- higherL2cachemissescomparedtonativeLinux. terface. Table1comparesthetransmitandreceivebandwidth 3 VirtualInterfaceOptimizations achieved in a Linux guest domain using this virtual in- terfacearchitecture,withtheperformanceachievedwith NetworkI/Oissupportedinguestdomainsbyproviding thebenchmarkrunninginthedriverdomainandinnative each guest domain with a set of virtual network inter- Linux.Theseresultsareobtainedusinganetperf-like[1] faces,whicharemultiplexedontothephysicalinterfaces benchmarkwhichusedthezero-copysendfile()for usingthemechanismsdescribedinsection2.Thevirtual transmit. network interface provides the abstraction of a simple, low-level network interface to the guest domain, which Con(cid:2)guration Receive(Mb/s) Transmit(Mb/s) uses paravirtualized drivers to perform I/O on this in- Linux 2508(100%) 3760(100%) terface. The network I/O operations supported on the Xendriver 1738(69.3%) 3760(100%) virtual interface consist of simple network transmit and Xenguest 820(32.7%) 750(19.9%) receiveoperations,whichareeasilymappableontocor- respondingoperationsonthephysicalNIC. Table1: NetworkperformanceunderXen Choosingasimple,low-levelinterfaceforvirtualnet- workinterfacesallowsthevirtualizedinterfacetobeeas- The driver domain con(cid:2)guration shows performance ilysupportedacrossalargenumberofphysicalinterfaces comparable to native Linux for the transmit case and a available, each with different network processing capa- degradation of 30% for the receive case. However, this bilities. However,thisalsopreventsthevirtualinterface con(cid:2)gurationusesnativedevicedriverstodirectlyaccess fromtakingadvantageofdifferentnetworkof(cid:3)oadcapa- thenetworkdevice,andthusthevirtualizationoverhead bilities of the physical NIC, such as checksum of(cid:3)oad, islimitedtosomelow-levelfunctionssuchasinterrupts. scatter/gather DMA support, TCP segmentation of(cid:3)oad Incontrast, inthe guest domain con(cid:2)guration, which (TSO). uses virtualized network interfaces, the impact of net- work I/O virtualization is much more pronounced. The 3.1 VirtualInterfaceArchitecture receiveperformanceinguestdomainssuffersfromaper- formance degradation of 67% relative to native Linux, Weproposeanewvirtualinterfacearchitectureinwhich and the transmit performance achieves only 20% of the thevirtualnetworkinterfacealwayssupportsa(cid:2)xedset throughputachievableundernativeLinux. of high level network of(cid:3)oad features, irrespective of Menon et al. [10] report similar results. Moreover, whetherthesefeaturesaresupportedinthephysicalnet- they introduce Xenoprof, a variant of Opro(cid:2)le [2] for work interfaces. The architecture makes use of of(cid:3)oad Xen, and apply it to break down the performance of a features ofthe physical NIC itself if theymatchthe of- similarbenchmark. Theirstudymadethefollowingob- (cid:3)oadrequirementsofthevirtualnetworkinterface.Ifthe servationsabouttheoverheadsassociatedwithdifferent requiredfeaturesarenotsupportedbythephysicalinter- aspectsoftheXennetworkvirtualizationarchitecture. face,thenewinterfacearchitectureprovidessupportfor Since each packet transmitted or received on a guest thesefeaturesinsoftware. domain’s virtual interface had to pass through the I/O Figure 2 shows the top-level design of the virtual in- channelandthenetworkbridge,asigni(cid:2)cantfractionof terfacearchitecture.Thevirtualinterfaceintheguestdo- thenetworkprocessingtimewasspentintheXenVMM mainsupportsasetofhigh-levelof(cid:3)oadfeatures,which and the driver domain respectively. For instance, 70% arereportedtotheguestOSbythefront-enddrivercon- oftheexecutiontimeforreceivingapacketintheguest trollingtheinterface.Inourimplementation,thefeatures domainwasspentintransferringthepacketthroughthe supportedbythevirtualinterfacearechecksumof(cid:3)oad- driverdomainandtheXenVMMfromthephysicalinter- ing,scatter/gatherI/OandTCPsegmentationof(cid:3)oading. face. Similarly, 60% oftheprocessingtimeforatrans- Supporting high-level features like scatter/gather I/O mit operation was spent in transferring the packet from and TSO allows the guest domain to transmit network theguest’svirtualinterfacetothephysicalinterface.The packetsinsizesmuchbiggerthanthenetworkMTU,and breakdownofthisprocessingoverheadwasroughly40% which can consist of multiple fragments. These large, Xen, 30% driver domain for receive traf(cid:2)c, and 30% fragmented packets are transferred from the virtual to Xen,30%driverdomainfortransmittraf(cid:2)c. the physical interface over a modi(cid:2)ed I/O channel and Ingeneral, boththeguestdomainsandthedriverdo- network bridge (The I/O channel is modi(cid:2)ed to allow mainwereseentosufferfromasigni(cid:2)cantlyhigherTLB transferofpacketsconsistingofmultiplefragments,and Driver Domain Guest Domain icant advantage of high level interfaces comes from its impactonimprovingtheef(cid:2)ciencyofthenetworkvirtu- Bridge alizationarchitectureinXen. Asdescribedinsection2,roughly60%oftheexecu- tion time for a transmit operation from a guest domain Offload Driver High-level isspentintheVMManddriverdomainformultiplexing Backend Interface Virtual Interface the packet from the virtual to the physical network in- NIC Driver terface. Most of this overheadis a per-packet overhead I/O Channel incurred in transferring the packet over the I/O channel and the network bridge, and in packet processing over- Physical NIC headsatthebackendandphysicalnetworkinterfaces. Xen VMM Usingahighlevelvirtualinterfaceimprovestheef(cid:2)- ciencyofthe virtualizationarchitecture byreducing the Figure2: NewI/OArchitecture numberofpacketstransfersrequiredovertheI/Ochannel andnetworkbridgefortransmittingthesameamountof data(byusingTSO),andthusreducingtheper-bytevir- thenetworkbridgeismodi(cid:2)edtosupportforwardingof tualization overhead incurred by the driver domain and packetslargerthantheMTU). theVMM. Thearchitectureintroducesanewcomponent,a‘soft- Forinstance,intheabsenceofsupportforTSOinthe wareof(cid:3)oad’driver,whichsitsabovethenetworkdevice virtualinterface,each1500(MTU)bytepackettransmit- driverinthedriverdomain,andinterceptsallpacketsto tedbytheguestdomainrequiresonepageremapopera- betransmittedonthenetworkinterface. Whentheguest tionoverthe I/O channel andone forwardingoperation domain’spacketarrivesatthephysicalinterface,theof- over the network bridge. In contrast, if the virtual in- (cid:3)oaddriverdetermineswhethertheof(cid:3)oadrequirements terface supports TSO, the OS uses much bigger sized ofthepacketarecompatiblewiththecapabilitiesofthe packets comprising of multiple pages, and in this case, NIC,andtakesdifferentactionsaccordingly. each remap operation over the I/O channel can poten- IftheNICsupportstheof(cid:3)oadfeaturesrequiredbythe tiallytransfer4096bytes ofdata. Similarly, withlarger packet, the of(cid:3)oaddriversimplyforwards the packetto packets,muchfewerpacketforwardingoperationsarere- thenetworkdevicedriverfortransmission. quiredoverthenetworkbridge. Inthe absence ofsupport fromthe physical NIC, the Supportinglargersizedpacketsinthevirtualnetwork of(cid:3)oad driver performs the necessary of(cid:3)oad actions in interface (using TSO) can thus signi(cid:2)cantly reduce the software. Thus, if the NIC does not support TSO, the overheads incurred in network virtualization along the of(cid:3)oad driver segments the large packet into appropri- transmit path. It is for this reason that the software of- ate MTU sized packets before sending them for trans- (cid:3)oad driver is situated in the driver domain at a point mission. Similarly, it takes care of the other of(cid:3)oad re- where the transmit packet would have already covered quirements, scatter/gather I/O and checksum of(cid:3)oading muchofthemultiplexingpath,andisreadyfortransmis- ifthesearenotsupportedbytheNIC. siononthephysicalinterface. Thus, even in the case when the of(cid:3)oad driver may havetoperformnetworkof(cid:3)oadingforthepacketinsoft- 3.2 Advantageofahighlevelinterface ware, we expect the bene(cid:2)ts of reduced virtualization overhead along the transmit path to show up in overall Ahighlevelvirtualnetworkinterfacecanreducethenet- performancebene(cid:2)ts. Ourevaluationinsection6shows workprocessingoverheadintheguestdomainbyallow- thatthisisindeedthecase. ingittoof(cid:3)oadsomenetworkprocessingtothephysical NIC. Support for scatter/gather I/O is especially useful for doing zero-copynetworktransmits, such as send(cid:2)le 4 I/OChannelOptimizations in Linux. Gather I/O allows the OS to construct net- work packets consisting of multiple fragments directly Previous research [10] noted that 30-40% of execution fromthe(cid:2)lesystembuffers,withouthavingto(cid:2)rstcopy time for a network transmit or receive operation was themouttoacontiguouslocationinmemory.Supportfor spentintheXenVMM,whichincludedthetimeforpage TSOallowstheOStodotransmitprocessingonnetwork remappingandownershiptransfersovertheI/Ochannel packets of size much larger than the MTU, thus reduc- and switching overheads between the guest and driver ingtheper-byteprocessingoverheadbyrequiringfewer domain. packetstotransmitthesameamountofdata. TheI/Ochannelimplementsazero-copypageremap- Apartfromitsbene(cid:2)tsfortheguestdomain,asignif- ping mechanism for transferring packets between the guest and driver domain. The physical page containing thisoverthenetworkbridge. thepacketisremappedintotheaddressspaceofthetar- Toensure correct executionof the network driver(or getdomain.Inthecaseofareceiveoperation,theowner- thebackenddriver)forthispacket,thebackendensures shipofthephysicalpageitselfwhichcontainsthepacket that the pseudo-physical to physical mappings of the istransferredfromthedrivertotheguestdomain(Both packetfragmentsaresetcorrectly(Whenaforeignpage theseoperationsrequireeachnetworkpackettobeallo- frame is mapped into the driver domain, the pseudo- catedonaseparatephysicalpage). physicaltophysicalmappingsforthepagehavetobeup- A study of the implementation of the I/O channel in dated). Since the network driver uses only the physical Xenreveals that threeaddress remaps andtwomemory addresses of the packet fragments, and not their virtual allocation/deallocation operations are required for each address,itissafetopassanunmappedpagefragmentto packetreceiveoperation,andtwoaddressremapsarere- thedriver. quiredforeachpackettransmitoperation.Onthereceive The ‘header’ channel for transferring the packet path,foreachnetworkpacket(physicalpage)transferred header is implemented using a separate set of shared from the driver to the guest domain, the guest domain pagesbetweentheguestanddriverdomain,foreachvir- releases ownership of a page to the hypervisor, and the tualinterfaceoftheguestdomain.Withthismechanism, driverdomainacquiresareplacementpagefromthehy- the cost of two page remaps per transmit operation is pervisor,tokeeptheoverallmemoryallocationconstant. replaced in the common case, by the cost of copying a Further,eachpageacquireorreleaseoperationrequiresa smallheader. changeinthevirtualaddressmapping. Thusoverall,for Wenotethatthismechanismrequiresthephysicalnet- eachreceiveoperationonavirtualinterface,twopageal- work interface to support gather DMA, since the trans- location/freeingoperationsandthreeaddressremapping mittedpacketconsistsofapacketheaderandunmapped operationsarerequired.Forthetransmitpath,ownership fragments. If the NIC does not support gather DMA, transferoperationisavoidedasthepacketcanbedirectly theentirepacketneedstobemappedinbeforeitcanbe mappedintotheprivilegeddriverdomain.Eachtransmit copiedouttoacontiguousmemorylocation. Thischeck operationthusincurstwoaddressremappingoperations. isperformedbythesoftwareof(cid:3)oaddriver(describedin We now describe alternate mechanisms for packet theprevious section), which performs the remapopera- transferonthetransmitpathandthereceivepath. tionifnecessary. 4.1 TransmitPathOptimization 4.2 ReceivePath Optimization We observe that for network transmit operations, the TheXenI/Ochannelusespageremappingonthereceive driverdomaindoesnotneedtomapintheentirepacket path to avoid the cost of an extra data copy. Previous from the guest domain, if the destination of the packet research[13]has alsoshownthatavoidingdatacopyin is any host other than the driver domain itself. In or- networkoperationssigni(cid:2)cantlyimprovessystemperfor- dertoforwardthepacketoverthenetworkbridge,(from mance. the guest domain’s backend interface to its target inter- However,wenotethattheI/Ochannelmechanismcan face),thedriverdomainonlyneedstoexaminetheMAC incursigni(cid:2)cantoverheadifthesizeofthepackettrans- headerofthe packet. Thus, if the packet headercan be ferred is very small, for example a few hundred bytes. supplied to the driver domain separately, the rest of the Inthis case, it maynotbe worthwhile toavoidthe data networkpacketdoesnotneedtobemappedin. copy. Thenetworkpacketneedstobemappedintothedriver Further,datatransferbypagetransferfromthedriver domain only when the destination of the packet is the domaintotheguestdomainincurssomeadditionalover- driverdomainitself,orwhenitisabroadcastpacket. heads.Firstly,eachnetworkpackethastobeallocatedon We use this observation toavoid the page remapping aseparatepage,sothatitcanberemapped.Additionally, operation over the I/O channel in the common transmit thedriverdomainhastoensurethatthereisnopotential case. We augment the I/O channel with an out-of-band leakageofinformationbytheremappingofapagefrom ‘header’channel,whichtheguestdomainusestosupply thedrivertoaguestdomain. Thedriverdomainensures theheaderofthepackettobetransmittedtothebackend thisbyzeroingoutatinitialization, allpages whichcan driver.Thebackenddriverreadstheheaderofthepacket bepotentiallytransferredouttoguestdomains. (Theze- fromthis channel todetermineifit needs tomapinthe roingisdonewhenevernewpagesareaddedtothemem- entire packet (i.e., ifthe destinationofthepacket is the oryallocatorinthedriverdomainusedforallocatingnet- driver domain itself or the broadcast address). It then worksocketbuffers). constructsanetworkpacketfromthepacketheaderand We investigate the alternate mechanism, in which the(possiblyunmapped)packetfragments,andforwards packetsaretransferredtotheguestdomainbydatacopy. As our evaluation shows, packet transfer by data copy Onthex86platform,onesuperpageentrycovers1024 insteadof page remappinginfact results ina small im- pages of physical memory, which greatly increases the provement in the receiver performance in the guest do- virtualmemorycoverage intheTLB.Thus, thisgreatly main. In addition, using copying instead of remapping reduces the capacity misses incurredinthe TLB. Many allowsustouseregularMTU sizedbuffersfornetwork operatingsystemsusesuperpagestoimprovetheirover- packets, which avoids the overheads incurred from the all TLB performance: Linux uses superpages to map needtozerooutthepagesinthedriverdomain. the ‘linear’ (lowmem) part of the kernel address space; The data copy in the I/O channel is implemented us- FreeBSD supports superpage mappings for both user- ing a set of shared pages between the guest and driver spaceandkernelspacetranslations[12]. domains,whichisestablishedatsetuptime.Asingleset Thesupportforglobalpagetablemappingsinthepro- ofpagesissharedforallthevirtualnetworkinterfacesin cessor allows certain page table entries to be marked the guest domain (in order to restrict the copying over- ‘global’,whicharethenkeptpersistentintheTLBacross head for a large working set size). Only one extra data TLB (cid:3)ushes (for example, on context switch). Linux copyisinvolvedfromthedriverdomain’snetworkpack- usesglobalpagemappingstomapthekernelpartofthe etstothesharedmemorypool. addressspaceofeachprocess,sincethispartiscommon Sharing memory pages between the guest and the between all processes. By doing so, the kernel address driver domain (for both the receive path, and for the mappings are not (cid:3)ushed from the TLB when the pro- header channel in the transmit path) does not introduce cessorswitchesfromoneprocesstoanother. any new vulnerability for the driver domain. The guest domaincannotcrashthedriverdomainbydoinginvalid writes to the shared memory pool since, on the receive 5.2 Issuesin supportingVMfeatures path, the driver domain does not read from the shared pool,andonthetransmitpath,itwouldneedtoreadthe 5.2.1 SuperpageMappings packetheadersfromtheguestdomainevenintheorigi- Asuperpagemappingmapsacontiguousvirtualaddress nalI/Ochannelimplementation. range to a contiguous physical address range. Thus, in ordertouseasuperpagemapping,theguestOSmustbe abletodeterminephysicalcontiguityofitspageframes 5 VirtualMemoryOptimizations within a superpage block. This is not be possible in a fully virtualized system like VMware ESX server [16], Itwasnotedinapreviousstudy[10],thatguestoperating where the guest OS uses contiguous ‘pseudo-physical’ systemsrunningonXen(bothdriverandguestdomains) addresses,whicharetransparentlymappedtodiscontigu- incurred a signi(cid:2)cantly higher number of TLB misses ousphysicaladdresses. fornetworkworkloads(morethananorderofmagnitude higher)relativetotheTLBmissesinnativeLinuxexecu- A secondissue with theuse of superpages is thefact tion. Itwasconjecturedthatthiswasduetotheincrease thatallpageframeswithinasuperpageblockmusthave inworkingsetsizewhenrunningonXen. identical memory protection permissions. This can be We show that it is the absence of support for certain problematic in a VM environment because the VMM virtualmemoryprimitives,suchassuperpagemappings may want to set special protection bits for certain page and global page table mappings, that leads to a marked frames. As an example, page frames containing page increase in TLB miss rate for guest OSes running on tables of the guest OS must be set read-only so that Xen. theguest OS cannot modifythemwithout notifying the VMM. Similarly, the GDT and LDT pages on the x86 architecturemustbesetread-only. 5.1 VirtualMemoryfeatures Thesespecialpermissionpagesinterferewiththeuse ofsuperpages. Asuperpage blockcontainingsuchspe- Superpageandglobalpagemappingsarefeaturesintro- cialpagesmustuseregulargranularitypagingtosupport duced inthe Intelx86processorseries startingwiththe differentpermissionsfordifferentconstituentpages. PentiumandPentiumProprocessorsrespectively. A superpage mappingallows the operatingsystemto specify the virtual address translation for a large set of 5.2.2 GlobalMappings pages, instead of at the granularity of individual pages, as with regular paging. A superpage page table entry Xen does not allow guest OSes to use global mappings providesaddresstranslationfromasetofcontiguousvir- sinceitneedstofully(cid:3)ushtheTLBwhenswitchingbe- tualaddresspagestoasetofcontiguousphysicaladdress tweendomains. Xenitselfusesglobalpagemappingsto pages. mapitsaddressrange. 5.3 SupportforVirtualMemoryprimitives the guest OS are returned to this reserved pool. Thus, inXen thismechanismcollectsallpageswiththesamepermis- sion into a different superpage, and avoids the breakup 5.3.1 SuperpageMappings ofsuperpagesintoregularpages. IntheXenVMM,supportingsuperpagesforguestOSes Certainread-onlypagesintheLinuxguestOS,suchas issimpli(cid:2)edbecauseoftheuseoftheparavirtualization theboottimeGDT,LDT,initialpagetable(init mm),are approach. IntheXenapproach, theguestOSisalready currently allocated at static locations in the OS binary. aware of the physical layout of its memory pages. The In order to allow the use of superpages over the entire XenVMMprovidestheguestOSwithapseudo-physical kerneladdressrange,theguestOSismodi(cid:2)edtorelocate to physical page translation table, which can be used these read-only pages to within a read-only superpage bytheguestOS todeterminethephysical contiguityof allocatedbythespecialallocator. pages. To allow the use of superpage mappings in the guest OS, we modify the XenVMM and the guest OS to co- 5.3.2 GlobalPageMappings operate with each other over the allocation of physical memoryandtheuseofsuperpagemappings. Supportingglobalpagemappingsforguestdomainsrun- The VMM is modi(cid:2)ed so that for each guest OS, it ning in a VM environment is quite simple on the x86 tries to give the OS an initial memory allocation such architecture. The x86 architecture allows some mecha- thatpageframeswithinasuperpageblockarealsophys- nismsbywhichglobalpagemappingscanbeinvalidated ically contiguous (Basically, the VMM tries to allocate from the TLB (This can be done, for example, by dis- memory to the guest OS in chunks of superpage size, ablingandre-enablingtheglobalpaging(cid:3)aginthepro- i.e.,4MB).Sincethisisnotalwayspossible,theVMM cessorcontrol registers, speci(cid:2)callythe PGE (cid:3)aginthe does not guarantee that page allocation for each super- CR4register). page range is physically contiguous. The guest OS is WemodifytheXenVMMtoallowguestOSestouse modi(cid:2)edsothatitusessuperpagemappingsforavirtual globalpagemappingsintheiraddressspace.Oneachdo- addressrangeonlyifitdeterminesthattheunderlyingset mainswitch,theVMMismodi(cid:2)edto(cid:3)ushallTLBen- ofphysicalpagesisalsocontiguous. tries,usingthemechanismdescribedabove.Thishasthe Asnotedaboveinsection5.2.1,theuseofpageswith additionalsideeffectthatthe VMM’sglobalpagemap- restrictedpermissions,suchaspagetable(PT)pages,pre- pingsarealsoinvalidatedonadomainswitch. vents the guestOS fromusinga superpage tocover the Theuseofglobalpagemappingspotentiallyimproves physicalpagesinthataddressrange. theTLBperformanceintheabsenceofdomainswitches. Asnewprocessesarecreatedinthesystem,theOSfre- However,inthecasewhenmultipledomainshavetobe quently needs to allocate (read-only) pages for the pro- switchedfrequently,thebene(cid:2)tsofglobalmappingsmay cess’s page tables. Each such allocation of a read-only be much reduced. In this case, use of global mappings pagepotentiallyforces the OS toconvert thesuperpage in the guest OS forces boththe guest’s and the VMM’s mapping covering that page to a two-level regular page TLB mappings to be invalidated on each switch. Our mapping. Withtheproliferationofsuchread-onlypages evaluation in section 6 shows that global mappings are overtime,theOSwouldendupusingregularpagingto bene(cid:2)cialonlywhenthereisasingledriverdomain,and addressmuchofitsaddressspace. notinthepresenceofguestdomains. The basic problem here is that the PT page frames We make one additional optimization to the domain for new processes are allocated randomly from all over switchingcodeintheVMM.Currently,theVMM(cid:3)ushes memory,withoutanylocality,thusbreakingmultiplesu- theTLBwheneveritswitchestoanon-idledomain. We perpage mappings. We solve this problem by using a notice that this incurs an unnecessary TLB (cid:3)ush when special memory allocatorintheguest OS for allocating theVMMswitchesfromadomaindtotheidledomain page frames with restricted permissions. This allocator andthenbacktothedomaind. Wemakeamodi(cid:2)cation triestogrouptogetherallmemorypageswithrestricted toavoidtheunnecessary(cid:3)ushinthiscase. permissionsintoacontiguousrangewithinasuperpage. WhentheguestOSallocatesaPTframefromthisal- locator, the allocator reserves the entire superpage con- tainingthisPTframeforfutureuse.Itthenmarkstheen- 5.4 Outstandingissues tiresuperpageasread-only,andreservesthepagesofthis superpageforread-onlyuse. Onsubsequentrequestsfor There remain a few aspects of virtualization which are PTframesfromtheguestOS,theallocatorreturnspages dif(cid:2)cult to reconcile with the use of superpages in the fromthereservedsetofpages. PTpageframesfreedby guestOS.Webrie(cid:3)ymentionthemhere. 5.4.1 Transparentpagesharing machines with a similar CPU con(cid:2)guration, and hav- ing one Intel Pro-1000 Gigabit NIC per machine. All Transparent page sharing between virtual machines is the NICs have support for TSO, scatter/gather I/O and an effective mechanism to reduce the memory usage checksum of(cid:3)oad. The clients and server machines are in the system when there are a large number of VMs connectedoveraGigabitswitch. [16]. Pagesharingusesaddresstranslationfrompseudo- The experiments measure the maximum throughput physical to physical addresses to transparently share achievablewiththebenchmark(eithertransmitterorre- pseudo-physicalpageswhichhavethesamecontent. ceiver) running on the server machine. The server is Pagesharingpotentiallybreaksthecontiguityofphys- connected to each client machine over a different net- icalpageframes. Withtheuseofsuperpages,eitherthe workinterface,andusesoneTCPconnectionperclient. entiresuperpagemustbesharedbetweentheVMs,orno Weuseasmanyclientsasrequiredtosaturatetheserver page within thesuperpage can be shared. This can sig- CPU, and measure the throughput under different Xen ni(cid:2)cantlyreducethescopeformemorysharingbetween and Linux con(cid:2)gurations. All the pro(cid:2)ling results pre- VMs. sented in this section are obtained using the Xenoprof AlthoughXendoesnotmakeuseofpagesharingcur- systemwidepro(cid:2)lerinXen[10]. rently, this is a potentially important issue for super- pages. 6.2 OverallResults 5.4.2 Ballooningdriver Weevaluatethefollowingcon(cid:2)gurations: ‘Linux’refers The ballooning driver [16] is a mechanism by which to the baseline unmodi(cid:2)ed Linux version 2.6.11 run- theVMMcanef(cid:2)cientlyvarythememoryallocationof ningnativemode.‘Xen-driver’referstotheunoptimized guestOSes. Sincetheuseofaballooningdriverpoten- XenoLinuxdriverdomain. ‘Xen-driver-opt’referstothe tially breaks the contiguity of physical pages allocated Xendriverdomainwithouroptimizations. ‘Xen-guest’ toaguest,thisinvalidatestheuseofsuperpagesforthat refers to the unoptimized, existing Xen guest domain. addressrange. ‘Xen-guest-opt’ refers to the optimized version of the guestdomain. A possible solution is to force memory alloca- tion/deallocation to be in units of superpage size for Figure 3 compares the transmit throughput achieved coarse grained ballooning operations, and to invalidate undertheabove5con(cid:2)gurations.Figure4showsreceive superpagemappingsonlyfor(cid:2)negrainedballooningop- performanceinthedifferentcon(cid:2)gurations. erations. This functionality is not implemented in the currentprototype. 5000 Transmit Throughput s) 4000 b/ M 6 Evaluation ut ( 3000 p gh 2000 The optimizations described in the previous sections ou have been implemented in Xen version 2.0.6, running Thr 1000 Linuxguestoperatingsystemsversion2.6.11. 0 6.1 ExperimentalSetup Linux Xen-driverXen-driver-Xoepnt-guestXen-guest-opt Weusetwomicro-benchmarks,atransmitandareceive benchmark, to evaluate the networking performance of guest and driver domains. These benchmarks are simi- Figure3:Transmitthroughputindifferentcon(cid:2)gurations lar to the netperf [1] TCP streaming benchmark, which measuresthemaximumTCPstreamingthroughputover For the transmit benchmark, the performance of the asingleTCPconnection. Ourbenchmarkismodi(cid:2)edto Linux, Xen-driver and Xen-driver-opt con(cid:2)gurations is usethezero-copysend(cid:2)lesystemcallfortransmitoper- limitedbythenetworkinterfacebandwidth,anddoesnot ations. fullysaturatetheCPU.All threecon(cid:2)gurations achieve The ‘server’ system for running the benchmark is a an aggregate link speed throughput of 3760 Mb/s. The DellPowerEdge1600SC,2.4GHzIntelXeonmachine. CPUutilizationvalues forsaturatingthe networkinthe ThismachinehasfourIntelPro-1000GigabitNICs.The three con(cid:2)gurations are, respectively, 40%, 46% and ‘clients’forrunninganexperimentconsistofIntelXeon 43%. 3500 4000 3000 Receive Throughput s) 3500 Transmit Throughput hroughput (Mb/s) 1122050500000000 Throughput (Mb/ 11223 505050000000000000 T 500 0 0 Linux Xen-driverXen-driver-Xoepnt-guestXen-guest-opt Guest-higGh-uioecs-t-shpigGh-uioecst-higGhuest-iocGuest-spGuest-none Figure4: Receivethroughputindifferentcon(cid:2)gurations Figure 5: Contribution of individual transmit optimiza- tions The optimized guest domain con(cid:2)guration, Xen- con(cid:2)guration to yield 3230 Mb/s (con(cid:2)guration Guest- guest-opt improves on the performance of the unopti- high-ioc).Thesuperpageoptimizationimprovesthisfur- mized Xen guest by a factor of 4.4, increasing trans- thertoachieve3310Mb/s(con(cid:2)gurationGuest-high-ioc- mit throughput from 750 Mb/s to 3310 Mb/s. How- sp). The impact of these two optimizations by them- ever, Xen-guest-opt gets CPU saturatedat this through- selvesintheabsenceofthehighlevelinterfaceoptimiza- put,whereastheLinuxcon(cid:2)gurationreachesaCPUuti- tionisinsigni(cid:2)cant. lizationofroughly40%tosaturate4NICs. For the receive benchmark, the unoptimized Xen driver domain con(cid:2)guration, Xen-driver, achieves a 6.3.1 HighlevelInterface throughputof1738Mb/s,whichisonly69% ofthena- Weexplaintheimprovedperformanceresultingfromthe tiveLinuxthroughput,2508Mb/s. TheoptimizedXen- useofahighlevelinterfacein(cid:2)gure6. The(cid:2)gurecom- driver version, Xen-driver-opt, improves upon this per- parestheexecutionoverhead(inmillionsofCPUcycles formance by 35%, and achieves a throughput of 2343 persecond)oftheguestdomain, driverdomainandthe Mb/s. The optimized guest con(cid:2)guration Xen-guest- Xen VMM, incurred for running the transmit workload optimprovestheguestdomainreceiveperformanceonly onasingleNIC,comparedbetweenthehigh-levelGuest- slightly,from820Mb/sto970Mb/s. highcon(cid:2)gurationandtheGuest-nonecon(cid:2)guration. 6.3 TransmitWorkload 3500 Xen Wenowexaminethecontributionofindividualoptimiza- millions) 3000 DGruiveesrt Ddoommaaiinn tionsforthetransmitworkload.Figure5showsthetrans- n ( 2500 mit performance under different combinations of opti- utio 2000 c mizations. ‘Guest-none’ is the guest domain con(cid:2)gura- e x tion with no optimizations. ‘Guest-sp’ is the guest do- of e 1500 s main con(cid:2)guration using only the superpage optimiza- cle 1000 y ‘tiGoune.s‘tG-huiegsht’-iuosce’suosnelsyotnhleyhtihgehIl/eOveclhvainrntuealloipnttiemrfiazcaetioopn-. CPU c 500 timization. ‘Guest-high-ioc’ uses both high level inter- 0 Guest-high Guest-none faces and the optimized I/O channel. ‘Guest-high-ioc- Transmit Optimization sp’ uses all the optimizations: high level interface, I/O channeloptimizationsandsuperpages. Figure6: Breakdownofexecutioncostinhigh-leveland Thesinglebiggestcontributiontoguest transmitper- unoptimizedvirtualinterface formancecomesfromthe use ofahigh-level virtual in- terface (Guest-high con(cid:2)guration). This optimization Theuseofthehighlevelvirtualinterfacereducesthe improves guest performance by 272%, from 750 Mb/s executioncostoftheguestdomainbyalmostafactorof to2794Mb/s. 4comparedtotheexecutioncostwithalow-levelinter- The I/O channel optimization yields an incremental face. Further, the high-level interface also reduces the improvement of 439 Mb/s (15.7%) over the Guest-high Xen VMM execution overhead by a factor of 1.9, and the driver domain overhead by a factor of 2.1. These 6.3.2 I/OChannelOptimizations reductions can be explained by the reasoning given in Figure 5 shows that using the I/O channel optimiza- section 3, namely the absence of data copy because of tions in conjunction with the high level interface (con- the support for scatter/gather, and the reduced per-byte (cid:2)gurationGuest-high-ioc)improvesthetransmitperfor- processingoverheadbecauseoftheuseoflargerpackets mance from 2794 Mb/s to 3239 Mb/s, an improvement (TSO). of15.7%. The use of a highlevel virtual interfacegives perfor- Thiscanbeexplainedbycomparingtheexecutionpro- manceimprovementsfortheguestdomainevenwhenthe (cid:2)leofthetwoguestcon(cid:2)gurations,asshownin(cid:2)gure8. of(cid:3)oad features are not supported in the physical NIC. TheI/Ochanneloptimizationreducestheexecutionover- Figure7showstheperformanceoftheguestdomainus- headincurredintheXenVMMby38%(Guest-high-ioc ing a high level interface with the physical network in- con(cid:2)guration), and this accounts for the corresponding terfacesupportingvaryingcapabilities. Thecapabilities improvementintransmitthroughput. ofthephysicalNICformaspectrum,atoneendtheNIC supportsTSO,SGI/Oandchecksumof(cid:3)oad,attheother enditsupportsnoof(cid:3)oadfeature,withintermediatecon- ns) 1200 Driver domXaeinn (cid:2)gurationssupportingpartialof(cid:3)oadcapabilities. millio 1000 Guest domain n ( utio 800 c e 4000 ex 600 Transmit throughput of 3500 s e 400 b/s) 3000 cycl put (M 22050000 CPU 200 ough 1500 0 Guest-high Guest-high-ioc Thr 1000 Guest Configuration 500 0 Figure8: I/Ochanneloptimizationbene(cid:2)t TSO+SG+CSsGu+mCsumCsum None Guest-none 6.3.3 Superpagemappings Figure5shows that superpage mappings improve guest Figure7: Advantageofhigher-levelinterface domainperformance slightly, by 2.5%. Figure 9 shows dataandinstructionTLBmissesincurredforthetransmit benchmark for three sets of virtual memory optimiza- Even in the case when the physical NIC supports tions. ‘Base’ refers to a con(cid:2)guration which uses the no of(cid:3)oad feature (bar labeled ‘None’), the guest do- high-level interface and I/O channel optimizations, but mainwithahighlevelinterfaceperformsnearlytwiceas withregular4Ksizedpages.‘SP-GP’istheBasecon(cid:2)g- wellastheguestusingthedefaultinterface(barlabeled urationwiththesuperpageandglobalpageoptimizations Guest-none),viz. 1461Mb/svs.750Mb/s. in addition. ‘SP’ is the Base con(cid:2)guration with super- pagemappings. Thus, even in the absence of of(cid:3)oad features in the The TLB misses are grouped into three categories: NIC,byperformingtheof(cid:3)oadcomputationjustbefore data TLB misses (D-TLB), instruction TLB misses in- transmittingitovertheNIC(intheof(cid:3)oaddriver)instead curred in the guest and driver domains (Guest OS I- of performing it in the guest domain, we signi(cid:2)cantly TLB), and instruction TLB misses incurred in the Xen reducetheoverheadsincurredintheI/Ochannelandthe VMM(XenI-TLB). networkbridgeonthepacket’stransmitpath. Theuseofsuperpagesaloneissuf(cid:2)cienttobringdown Theperformanceoftheguestdomainwithjustcheck- thedataTLBmissesbyafactorof3.8.Theuseofglobal sum of(cid:3)oad support in the NIC is comparable to the mappingsdoesnothaveasigni(cid:2)cantimpactondataTLB performance without any of(cid:3)oad support. This is be- misses (con(cid:2)gurations SP-GP and SP), since frequent cause, in the absence of support for scatter/gather I/O, switchesbetweentheguestanddriverdomaincancelout data copying both with or without checksum computa- anybene(cid:2)tsofusingglobalpages. tionincureffectivelythesamecost. Withscatter/gather The use of global mappings, however, does have a I/Osupport,transmitthroughputincreasesto2007Mb/s. negative impact on the instruction TLB misses in the
Description: