Optimizing Network Virtualization in Xen Aravind Menon Alan L. Cox WillyZwaenepoel EPFL,Switzerland Riceuniversity, Houston EPFL,Switzerland Abstract network resources. In such an environment, the virtual machine monitor (VMM) virtualizes the machine’s net- Inthispaper,weproposeandevaluatethreetechniques work I/O devices to allow multiple operating systems for optimizing network performance in the Xen virtual- runningindifferent VMsto accessthe network concur- ized environment. Our techniques retain the basic Xen rently. architecture of locating device drivers in a privileged Despitetheadvancesinvirtualizationtechnology[15, ‘driver’domainwithaccesstoI/Odevices,andproviding 4], the overhead of network I/O virtualization can still networkaccesstounprivileged‘guest’domainsthrough significantlyaffecttheperformanceofnetwork-intensive virtualizednetworkinterfaces. applications. For instance, Sugerman et al. [15] report First, we redefine the virtual network interfaces of thattheCPUutilizationrequiredtosaturatea100Mbps guestdomainstoincorporatehigh-levelnetworkofffload networkunderLinux2.2.17runningonVMwareWork- features available in most modern network cards. We station2.0was5to 6timeshigher comparedtothe uti- demonstrate the performance benefits of high-level of- lizationundernativeLinux2.2.17.Evenintheparavirtu- fload functionality in the virtual interface, even when alizedXen2.0VMM[4],Menonetal.[10]reportsignif- such functionality is not supported in the underlying icantlylowernetworkperformanceunderaLinux2.6.10 physical interface. Second, we optimize the implemen- guest domain, compared to native Linux performance. tationofthedatatransferpathbetweenguestanddriver Theyreportperformancedegradationbyafactorof2to domains.Theoptimizationavoidsexpensivedataremap- 3xforreceiveworkloads,andafactorof5xdegradation ping operations on the transmit path, and replaces page fortransmitworkloads. Thelatterstudy,unlikeprevious remappingbydatacopyingonthereceivepath. Finally, reported results on Xen [4, 5], reports network perfor- weprovidesupportforguestoperatingsystemstoeffec- mance under configurations leading to full CPU satura- tively utilize advanced virtual memory features such as tion. superpagesandglobalpagemappings. Inthispaper,weproposeandevaluateanumberofop- The overall impact of these optimizations is an im- timizations for improving the networking performance provementintransmitperformanceofguestdomainsby under the Xen 2.0 VMM. These optimizations address a factor of 4.4. The receive performance of the driver manyoftheperformancelimitationsidentifiedbyMenon domain is improved by 35% and reaches within 7% of et al. [10]. Starting with version 2.0, the Xen VMM native Linux performance. The receive performance in adopted a network I/O architecturethatissimilar to the guestdomainsimprovesby18%,butstilltrailsthenative hostedvirtualmachinemodel[5]. Inthisarchitecture,a Linuxperformanceby61%.Weanalysetheperformance physical network interface is owned by a special, privi- improvementsindetail,andquantifythecontributionof legedVMcalledadriverdomainthatexecutesthenative eachoptimizationtotheoverallperformance. Linux network interface driver. In contrast, an ordinary VMcalledaguestdomainisgivenaccesstoavirtualized 1 Introduction networkinterface. Thevirtualizednetworkinterfacehas a front-end device in the guest domain and a back-end In recent years, there has been a trend towards running device in the corresponding driver domain. The front- networkintensive applications, such asInternetservers, end and back-end devices transfer network packets be- in virtual machine (VM) environments, where multiple tweentheirdomainsoveranI/Ochannelthatisprovided VMsrunning on the same machine share the machine’s bytheXenVMM.Withinthedriverdomain,eitherEth- USENIX Association Annual Tech ’06: 2006 USENIX Annual Technical Conference 15 ernetbridgingorIProutingisusedtodemultiplexincom- anevaluationofthedifferentoptimizationsdescribed. In ing network packets and to multiplex outgoing network section 7, we discuss related work, and we conclude in packets between the physical network interface and the section8. guestdomainthroughitscorrespondingback-enddevice. Our optimizations fall into the following three cate- 2 Background gories: 1. We add three capabilities to the virtualized net- The network I/O virtualization architecture in Xen can work interface: scatter/gather I/O, TCP/IP check- be a significant source of overhead for networking per- sumoffload,andTCPsegmentationoffload(TSO). formanceinguest domains[10]. Inthissection, wede- Scatter/gather I/O and checksum offload improve scribetheoverheadsassociatedwithdifferentaspectsof performanceintheguestdomain.Scatter/gatherI/O the Xen network I/O architecture, and their overall im- eliminates the need for data copying by the Linux pactonguestnetworkperformance. Tounderstandthese implementation of sendfile(). TSO improves overheads better, we first describe the network virtual- performance throughout the system. In additionto izationarchitectureusedinXen. itswell-knowneffectsonTCPperformance[11,9] TheXenVMMusesanI/Oarchitecturewhichissim- benefiting the guest domain, it improves perfor- ilar to the hosted VMM architecture [5]. Privileged mance in the Xen VMM and driver domain by re- domains, called ‘driver’ domains, use their native de- ducing the number of network packets that they vice drivers to access I/O devices directly, and perform musthandle. I/Ooperationsonbehalfofotherunprivilegeddomains, calledguestdomains. GuestdomainsusevirtualI/Ode- 2. We introduce a faster I/O channel for transferring vicescontrolledbyparavirtualizeddriverstorequestthe network packets between the guest and driver do- driverdomainfordeviceaccess. mains. The optimizations include transmit mecha- ThenetworkarchitectureusedinXenisshowninfig- nisms that avoid a data remap or copy in the com- ure 1. Xen provides each guest domain with a number mon case, and a receive mechanism optimized for ofvirtualnetworkinterfaces,whichisusedbytheguest smalldatareceives. domainforallitsnetworkcommunications.Correspond- ingto each virtualinterface ina guestdomain, a ‘back- 3. We present VMM support mechanisms for allow- end’interfaceiscreatedinthedriverdomain,whichacts ing the use of efficient virtual memory primitives as the proxy for that virtual interface in the driver do- in guest operating systems. These mechanisms al- main. Thevirtualandbackendinterfacesare‘connected’ lowguestOSestomakeuseofsuperpageandglobal to each other over an ‘I/O channel’. The I/O channel pagemappingsontheIntelx86architecture,which implementsazero-copydatatransfermechanismforex- significantly reduce the number of TLB misses by changingpacketsbetweenthevirtualinterfaceandback- guestdomains. end interfaces by remapping the physical page contain- Overall, our optimizations improve the transmit per- ingthepacketintothetargetdomain. formanceinguestdomainsbyafactor4.4. Receiveside networkperformanceisimprovedby35%fordriverdo- Driver Domain Guest Domain mains, and by 18% for guestdomains. We also present adetailedbreakdownoftheperformancebenefitsresult- Bridge ing from each individual optimization. Our evaluation demonstrates the performance benefits of TSO support in the virtual network interface even in the absence of Backend Interface Virtual Interface TSOsupportinthephysicalnetworkinterface. Inother NIC Driver words, emulating TSOin software in the driver domain results in higher network performance than performing I/O Channel TCPsegmentationintheguestdomain. The outline for the restof the paper is as follows. In Physical NIC Xen VMM section 2, we describe the Xen network I/O architec- tureandasummaryofitsperformanceoverheadsasde- scribed in previous research. In section 3, we describe Figure1: XenNetworkI/OArchitecture the design of the new virtual interface architecture. In section4,wedescribeouroptimizationstotheI/Ochan- All the backend interfaces in the driver domain (cor- nel. Section 5 describes the new virtual memory opti- respondingtothevirtualinterfaces)areconnectedtothe mization mechanismsadded toXen. Section6 presents physicalNICandtoeachotherthroughavirtualnetwork 16 Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association bridge. 1 The combination of the I/O channel and net- missrate comparedtoexecutioninnativeLinux. Addi- work bridge thus establishes a communication path be- tionally, guest domains were seen to suffer from much tween the guest’s virtual interfaces and the physical in- higherL2cachemissescomparedtonativeLinux. terface. Table1comparesthetransmitandreceivebandwidth 3 VirtualInterfaceOptimizations achieved in a Linux guest domain using this virtual in- terfacearchitecture,withtheperformanceachievedwith NetworkI/Oissupportedinguestdomainsbyproviding thebenchmarkrunninginthedriverdomainandinnative each guest domain with a set of virtual network inter- Linux. Theseresultsareobtainedusinganetperf-like[1] faces,whicharemultiplexedontothephysicalinterfaces benchmarkwhichusedthezero-copysendfile()for usingthemechanismsdescribedinsection2. Thevirtual transmit. network interface provides the abstraction of a simple, low-level network interface to the guest domain, which Configuration Receive(Mb/s) Transmit(Mb/s) uses paravirtualized drivers to perform I/O on this in- Linux 2508(100%) 3760(100%) terface. The network I/O operations supported on the Xendriver 1738(69.3%) 3760(100%) virtual interface consist of simple network transmit and Xenguest 820(32.7%) 750(19.9%) receive operations, whichare easilymappableontocor- respondingoperationsonthephysicalNIC. Table1: NetworkperformanceunderXen Choosingasimple,low-levelinterfaceforvirtualnet- workinterfacesallowsthevirtualizedinterfacetobeeas- The driver domain configuration shows performance ilysupportedacrossalargenumberofphysicalinterfaces comparable to native Linux for the transmit case and a available, each with different network processing capa- degradation of 30% for the receive case. However, this bilities. However,thisalsopreventsthevirtualinterface configurationusesnativedevicedriverstodirectlyaccess fromtakingadvantageofdifferentnetworkoffloadcapa- thenetworkdevice,andthusthevirtualizationoverhead bilities of the physical NIC, such as checksum offload, islimitedtosomelow-levelfunctionssuchasinterrupts. scatter/gather DMA support, TCP segmentation offload In contrast, in the guest domain configuration, which (TSO). uses virtualized network interfaces, the impact of net- work I/O virtualization is much more pronounced. The 3.1 VirtualInterfaceArchitecture receiveperformanceinguestdomainssuffersfromaper- formance degradation of 67% relative to native Linux, Weproposeanewvirtualinterfacearchitectureinwhich and the transmit performance achieves only 20% of the thevirtualnetworkinterfacealwayssupportsafixedset throughputachievableundernativeLinux. of high level network offload features, irrespective of Menon et al. [10] report similar results. Moreover, whetherthesefeaturesaresupportedinthephysicalnet- they introduce Xenoprof, a variant of Oprofile [2] for work interfaces. The architecture makes use of offload Xen, and apply it to break down the performance of a features of the physical NIC itself if they match the of- similarbenchmark. Theirstudymadethefollowingob- floadrequirementsofthevirtualnetworkinterface.Ifthe servationsabouttheoverheadsassociatedwithdifferent requiredfeaturesarenotsupportedbythephysicalinter- aspectsoftheXennetworkvirtualizationarchitecture. face, thenewinterfacearchitectureprovidessupportfor Since each packet transmitted or received on a guest thesefeaturesinsoftware. domain’s virtual interface had to pass through the I/O Figure 2 shows the top-level design of the virtual in- channelandthenetworkbridge,asignificantfractionof terfacearchitecture.Thevirtualinterfaceintheguestdo- thenetworkprocessingtimewasspentintheXenVMM mainsupportsasetofhigh-leveloffloadfeatures,which and the driver domain respectively. For instance, 70% arereportedtotheguestOSbythefront-enddrivercon- oftheexecutiontimeforreceivingapacketintheguest trollingtheinterface. Inourimplementation,thefeatures domainwasspentintransferringthepacketthroughthe supportedbythevirtualinterfacearechecksumoffload- driverdomainandtheXenVMMfromthephysicalinter- ing,scatter/gatherI/OandTCPsegmentationoffloading. face. Similarly, 60% of the processingtime for a trans- Supporting high-level features like scatter/gather I/O mit operation was spent in transferring the packet from and TSO allows the guest domain to transmit network theguest’svirtualinterfacetothephysicalinterface. The packetsinsizesmuchbiggerthanthenetworkMTU,and breakdownofthisprocessingoverheadwasroughly40% which can consist of multiple fragments. These large, Xen, 30% driver domain for receive traffic, and 30% fragmented packets are transferred from the virtual to Xen,30%driverdomainfortransmittraffic. the physical interface over a modified I/O channel and Ingeneral, both the guest domainsandthedriverdo- network bridge (The I/O channel is modified to allow mainwereseentosufferfromasignificantlyhigherTLB transferofpacketsconsistingofmultiplefragments,and USENIX Association Annual Tech ’06: 2006 USENIX Annual Technical Conference 17 Driver Domain Guest Domain icant advantage of high level interfaces comes from its impactonimprovingtheefficiencyofthenetworkvirtu- Bridge alizationarchitectureinXen. Asdescribedinsection2, roughly60%oftheexecu- tion time for a transmit operation from a guest domain Offload Driver High-level isspentintheVMManddriverdomainformultiplexing Backend Interface Virtual Interface the packet from the virtual to the physical network in- NIC Driver terface. Most of this overhead is a per-packet overhead I/O Channel incurred in transferring the packet over the I/O channel and the network bridge, and in packet processing over- Physical NIC headsatthebackendandphysicalnetworkinterfaces. Xen VMM Usinga high levelvirtual interface improvesthe effi- ciency of the virtualization architecture by reducing the Figure2: NewI/OArchitecture numberofpacketstransfersrequiredovertheI/Ochannel andnetworkbridgefortransmittingthesameamountof data(byusingTSO),andthusreducingtheper-bytevir- thenetworkbridgeismodifiedtosupportforwardingof tualization overhead incurred by the driver domain and packetslargerthantheMTU). theVMM. Thearchitectureintroducesanewcomponent,a‘soft- Forinstance,intheabsenceofsupportforTSOinthe wareoffload’driver,whichsitsabovethenetworkdevice virtualinterface,each1500(MTU)bytepackettransmit- driverinthedriverdomain,andinterceptsallpacketsto tedbytheguestdomainrequiresonepageremapopera- betransmittedonthenetworkinterface. Whentheguest tion over the I/O channel and one forwarding operation domain’spacketarrivesatthephysicalinterface,theof- over the network bridge. In contrast, if the virtual in- floaddriverdetermineswhethertheoffloadrequirements terface supports TSO, the OS uses much bigger sized ofthe packetare compatiblewiththecapabilitiesof the packets comprising of multiple pages, and in this case, NIC,andtakesdifferentactionsaccordingly. each remap operation over the I/O channel can poten- IftheNICsupportstheoffloadfeaturesrequiredbythe tially transfer 4096 bytes of data. Similarly, with larger packet, the offload driver simply forwards the packetto packets,muchfewerpacketforwardingoperationsarere- thenetworkdevicedriverfortransmission. quiredoverthenetworkbridge. In the absence of support from the physical NIC, the Supportinglargersizedpacketsinthevirtualnetwork offload driver performs the necessary offload actions in interface (using TSO) can thus significantly reduce the software. Thus, if the NIC does not support TSO, the overheads incurred in network virtualization along the offload driver segments the large packet into appropri- transmit path. It is for this reason that the software of- ate MTU sized packets before sending them for trans- fload driver is situated in the driver domain at a point mission. Similarly, it takes care of the other offload re- where the transmit packet would have already covered quirements, scatter/gather I/O and checksum offloading muchofthemultiplexingpath,andisreadyfortransmis- ifthesearenotsupportedbytheNIC. siononthephysicalinterface. Thus, even in the case when the offload driver may havetoperformnetworkoffloadingforthepacketinsoft- 3.2 Advantageofa highlevelinterface ware, we expect the benefits of reduced virtualization overhead along the transmit path to show up in overall Ahighlevelvirtualnetworkinterfacecanreducethenet- performancebenefits. Ourevaluationinsection6shows workprocessingoverheadintheguestdomainbyallow- thatthisisindeedthecase. ingittooffloadsomenetworkprocessingtothephysical NIC. Support for scatter/gather I/O is especially useful for doing zero-copy network transmits, such as sendfile 4 I/OChannelOptimizations in Linux. Gather I/O allows the OS to construct net- work packets consisting of multiple fragments directly Previous research [10] noted that 30-40% of execution fromthefilesystembuffers,withouthavingtofirstcopy time for a network transmit or receive operation was themouttoacontiguouslocationinmemory.Supportfor spentintheXenVMM,whichincludedthetimeforpage TSOallowstheOStodotransmitprocessingonnetwork remappingandownershiptransfersovertheI/Ochannel packets of size much larger than the MTU, thus reduc- and switching overheads between the guest and driver ingtheper-byteprocessingoverheadbyrequiringfewer domain. packetstotransmitthesameamountofdata. TheI/Ochannelimplementsazero-copypageremap- Apartfromitsbenefitsfortheguestdomain,asignif- ping mechanism for transferring packets between the 18 Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association guest and driver domain. The physical page containing thisoverthenetworkbridge. thepacketisremappedintotheaddressspaceofthetar- To ensure correct execution of the network driver (or getdomain.Inthecaseofareceiveoperation,theowner- thebackenddriver)forthispacket, thebackendensures shipofthephysicalpageitselfwhichcontainsthepacket that the pseudo-physical to physical mappings of the istransferredfrom the drivertothe guestdomain(Both packetfragmentsaresetcorrectly(Whenaforeignpage theseoperationsrequireeachnetworkpackettobeallo- frame is mapped into the driver domain, the pseudo- catedonaseparatephysicalpage). physicaltophysicalmappingsforthepagehavetobeup- A study of the implementation of the I/O channel in dated). Since the network driver uses only the physical Xen reveals that three address remaps and two memory addresses of the packet fragments, and not their virtual allocation/deallocation operations are required for each address,itissafetopassanunmappedpagefragmentto packetreceiveoperation,andtwoaddressremapsarere- thedriver. quiredforeachpackettransmitoperation.Onthereceive The ‘header’ channel for transferring the packet path,foreachnetworkpacket(physicalpage)transferred header is implemented using a separate set of shared from the driver to the guest domain, the guest domain pagesbetweentheguestanddriverdomain,foreachvir- releases ownership of a page to the hypervisor, and the tualinterfaceoftheguestdomain. Withthismechanism, driverdomainacquiresareplacementpagefromthehy- the cost of two page remaps per transmit operation is pervisor,tokeeptheoverallmemoryallocationconstant. replaced in the common case, by the cost of copying a Further,eachpageacquireorreleaseoperationrequiresa smallheader. changeinthevirtualaddressmapping. Thusoverall,for Wenotethatthismechanismrequiresthephysicalnet- eachreceiveoperationonavirtualinterface,twopageal- work interface to support gather DMA, since the trans- location/freeingoperationsandthreeaddressremapping mittedpacketconsistsofapacketheaderandunmapped operationsarerequired. Forthetransmitpath,ownership fragments. If the NIC does not support gather DMA, transferoperationisavoidedasthepacketcanbedirectly theentirepacketneedstobemappedinbeforeitcanbe mappedintotheprivilegeddriverdomain. Eachtransmit copiedouttoacontiguousmemorylocation. Thischeck operationthusincurstwoaddressremappingoperations. isperformedbythesoftwareoffloaddriver(describedin We now describe alternate mechanisms for packet the previous section), which performs the remap opera- transferonthetransmitpathandthereceivepath. tionifnecessary. 4.1 Transmit PathOptimization 4.2 ReceivePath Optimization We observe that for network transmit operations, the TheXenI/Ochannelusespageremappingonthereceive driverdomaindoesnotneedtomapintheentirepacket path to avoid the cost of an extra data copy. Previous from the guest domain, if the destination of the packet research [13] has also shownthat avoidingdata copyin is any host other than the driver domain itself. In or- networkoperationssignificantlyimprovessystemperfor- dertoforwardthepacketoverthenetworkbridge,(from mance. the guest domain’s backend interface to its target inter- However,wenotethattheI/Ochannelmechanismcan face),thedriverdomainonlyneedstoexaminetheMAC incursignificantoverheadifthesizeofthepackettrans- header of the packet. Thus, if the packet header can be ferred is very small, for example a few hundred bytes. supplied to the driver domain separately, the rest of the In this case, it may not be worthwhile to avoid the data networkpacketdoesnotneedtobemappedin. copy. Thenetworkpacketneedstobemappedintothedriver Further,datatransferbypagetransferfromthedriver domain only when the destination of the packet is the domaintotheguestdomainincurssomeadditionalover- driverdomainitself,orwhenitisabroadcastpacket. heads.Firstly,eachnetworkpackethastobeallocatedon We use this observation to avoid the page remapping aseparatepage,sothatitcanberemapped. Additionally, operation over the I/O channel in the common transmit thedriverdomainhastoensurethatthereisnopotential case. We augment the I/O channel with an out-of-band leakageofinformationbytheremappingofapagefrom ‘header’channel,whichtheguestdomainusestosupply thedrivertoaguestdomain. Thedriverdomainensures theheaderofthepackettobetransmittedtothebackend thisby zeroing outatinitialization, allpages which can driver. Thebackenddriverreadstheheaderofthepacket bepotentiallytransferredouttoguestdomains. (Theze- from this channel to determine if it needs to map in the roingisdonewhenevernewpagesareaddedtothemem- entire packet (i.e., if the destination of the packet is the oryallocatorinthedriverdomainusedforallocatingnet- driver domain itself or the broadcast address). It then worksocketbuffers). constructsanetworkpacketfromthepacketheaderand We investigate the alternate mechanism, in which the(possiblyunmapped)packetfragments,andforwards packetsaretransferredtotheguestdomainbydatacopy. USENIX Association Annual Tech ’06: 2006 USENIX Annual Technical Conference 19 As our evaluation shows, packet transfer by data copy Onthex86platform,onesuperpageentrycovers1024 instead of page remapping in fact results in a small im- pages of physical memory, which greatly increases the provement in the receiver performance in the guest do- virtual memory coverage inthe TLB. Thus, thisgreatly main. In addition, using copying instead of remapping reduces the capacity misses incurred in the TLB. Many allowsus touse regularMTU sizedbuffersfor network operatingsystemsusesuperpagestoimprovetheirover- packets, which avoids the overheads incurred from the all TLB performance: Linux uses superpages to map needtozerooutthepagesinthedriverdomain. the ‘linear’ (lowmem) part of the kernel address space; The data copy in the I/O channel is implemented us- FreeBSD supports superpage mappings for both user- ing a set of shared pages between the guest and driver spaceandkernelspacetranslations[12]. domains,whichisestablishedatsetuptime. Asingleset Thesupportforglobalpagetablemappingsinthepro- ofpagesissharedforallthevirtualnetworkinterfacesin cessor allows certain page table entries to be marked the guest domain (in order to restrict the copying over- ‘global’,whicharethenkeptpersistentintheTLBacross head for a large working set size). Only one extra data TLB flushes (for example, on context switch). Linux copyisinvolvedfromthedriverdomain’snetworkpack- usesglobalpagemappingstomapthekernelpartofthe etstothesharedmemorypool. addressspaceofeachprocess,sincethispartiscommon Sharing memory pages between the guest and the between all processes. By doing so, the kernel address driver domain (for both the receive path, and for the mappings are not flushed from the TLB when the pro- header channel in the transmit path) does not introduce cessorswitchesfromoneprocesstoanother. any new vulnerability for the driver domain. The guest domaincannotcrashthedriverdomainbydoinginvalid writes to the shared memory pool since, on the receive 5.2 Issuesin supportingVMfeatures path, the driver domain does not read from the shared 5.2.1 SuperpageMappings pool,andonthetransmitpath,itwouldneedtoreadthe packetheadersfromtheguestdomainevenintheorigi- Asuperpagemappingmapsacontiguousvirtualaddress nalI/Ochannelimplementation. range to a contiguous physical address range. Thus, in ordertouseasuperpagemapping,theguestOSmustbe 5 VirtualMemoryOptimizations abletodetermine physical contiguityof itspageframes within a superpage block. This is not be possible in a fully virtualized system like VMware ESX server [16], Itwasnotedinapreviousstudy[10],thatguestoperating where the guest OS uses contiguous ‘pseudo-physical’ systemsrunningonXen(bothdriverandguestdomains) addresses,whicharetransparentlymappedtodiscontigu- incurred a significantly higher number of TLB misses ousphysicaladdresses. fornetworkworkloads(morethananorderofmagnitude higher)relativetotheTLBmissesinnativeLinuxexecu- A second issue with the use of superpages is the fact tion. Itwasconjecturedthatthiswasduetotheincrease thatallpageframeswithinasuperpageblockmusthave inworkingsetsizewhenrunningonXen. identical memory protection permissions. This can be We show that it is the absence of support for certain problematic in a VM environment because the VMM virtualmemoryprimitives,suchassuperpagemappings may want to set special protection bits for certain page and global page table mappings, that leads to a marked frames. As an example, page frames containing page increase in TLB miss rate for guest OSes running on tables of the guest OS must be set read-only so that Xen. the guest OS cannot modify them without notifying the VMM. Similarly, the GDT and LDT pages on the x86 architecturemustbesetread-only. 5.1 VirtualMemoryfeatures Thesespecialpermissionpagesinterferewiththeuse of superpages. Asuperpage block containing such spe- Superpageandglobalpagemappingsarefeaturesintro- cialpagesmustuseregulargranularitypagingtosupport duced in the Intel x86 processor series starting with the differentpermissionsfordifferentconstituentpages. PentiumandPentiumProprocessorsrespectively. A superpage mapping allows the operating system to specify the virtual address translation for a large set of 5.2.2 GlobalMappings pages, instead of at the granularity of individual pages, as with regular paging. A superpage page table entry Xen does not allow guest OSes to use global mappings providesaddresstranslationfromasetofcontiguousvir- sinceitneedstofullyflushtheTLBwhenswitchingbe- tualaddresspagestoasetofcontiguousphysicaladdress tweendomains. Xenitselfusesglobalpagemappingsto pages. mapitsaddressrange. 20 Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association 5.3 SupportforVirtualMemoryprimitives the guest OS are returned to this reserved pool. Thus, inXen thismechanismcollectsallpageswiththesamepermis- sion into a different superpage, and avoids the breakup 5.3.1 SuperpageMappings ofsuperpagesintoregularpages. IntheXenVMM,supportingsuperpagesforguestOSes Certainread-onlypagesintheLinuxguestOS,suchas issimplifiedbecauseoftheuseoftheparavirtualization theboottimeGDT,LDT,initialpagetable(init mm),are approach. IntheXenapproach, the guestOSisalready currently allocated at static locations in the OS binary. aware of the physical layout of its memory pages. The In order to allow the use of superpages over the entire XenVMMprovidestheguestOSwithapseudo-physical kerneladdressrange,theguestOSismodifiedtorelocate to physical page translation table, which can be used these read-only pages to within a read-only superpage bythe guestOS to determine the physical contiguity of allocatedbythespecialallocator. pages. To allow the use of superpage mappings in the guest OS, we modify the Xen VMM and the guest OS to co- 5.3.2 GlobalPageMappings operate with each other over the allocation of physical memoryandtheuseofsuperpagemappings. Supportingglobalpagemappingsforguestdomainsrun- The VMM is modified so that for each guest OS, it ning in a VM environment is quite simple on the x86 tries to give the OS an initial memory allocation such architecture. The x86 architecture allows some mecha- thatpageframeswithinasuperpageblockarealsophys- nismsbywhichglobalpagemappingscanbeinvalidated ically contiguous (Basically, the VMM tries to allocate from the TLB (This can be done, for example, by dis- memory to the guest OS in chunks of superpage size, ablingandre-enablingtheglobalpagingflaginthepro- i.e.,4MB).Sincethisisnotalwayspossible,theVMM cessor control registers, specificallythe PGE flag in the does not guarantee that page allocation for each super- CR4register). page range is physically contiguous. The guest OS is WemodifytheXenVMMtoallowguestOSestouse modifiedsothatitusessuperpagemappingsforavirtual globalpagemappingsintheiraddressspace.Oneachdo- addressrangeonlyifitdeterminesthattheunderlyingset mainswitch,theVMMismodifiedtoflushallTLBen- ofphysicalpagesisalsocontiguous. tries,usingthemechanismdescribedabove. Thishasthe Asnotedaboveinsection5.2.1,theuseofpageswith additionalsideeffectthatthe VMM’sglobalpagemap- restrictedpermissions,suchaspagetable(PT)pages,pre- pingsarealsoinvalidatedonadomainswitch. vents the guestOS from using a superpage to cover the Theuseofglobalpagemappingspotentiallyimproves physicalpagesinthataddressrange. theTLBperformanceintheabsenceofdomainswitches. Asnewprocessesarecreatedinthesystem,theOSfre- However,inthecasewhenmultipledomainshavetobe quently needs to allocate (read-only) pages for the pro- switchedfrequently,thebenefitsofglobalmappingsmay cess’s page tables. Each such allocation of a read-only be much reduced. In this case, use of global mappings page potentially forces the OS to convert the superpage in the guest OS forces both the guest’s and the VMM’s mapping covering that page to a two-level regular page TLB mappings to be invalidated on each switch. Our mapping. Withtheproliferationofsuchread-onlypages evaluation in section 6 shows that global mappings are overtime,theOSwouldendupusingregularpagingto beneficialonlywhenthereisasingledriverdomain,and addressmuchofitsaddressspace. notinthepresenceofguestdomains. The basic problem here is that the PT page frames We make one additional optimization to the domain for new processes are allocated randomly from all over switchingcodeintheVMM.Currently,theVMMflushes memory,withoutanylocality,thusbreakingmultiplesu- theTLBwheneveritswitchestoanon-idledomain. We perpage mappings. We solve this problem by using a notice that this incurs an unnecessary TLB flush when special memory allocator in the guest OS for allocating theVMMswitchesfromadomaindtotheidledomain page frames with restricted permissions. This allocator andthenbacktothedomaind. Wemakeamodification triestogrouptogether allmemorypageswithrestricted toavoidtheunnecessaryflushinthiscase. permissionsintoacontiguousrangewithinasuperpage. WhentheguestOSallocatesaPTframefromthisal- locator, the allocator reserves the entire superpage con- tainingthisPTframeforfutureuse.Itthenmarkstheen- 5.4 Outstanding issues tiresuperpageasread-only,andreservesthepagesofthis superpageforread-onlyuse. Onsubsequentrequestsfor There remain a few aspects of virtualization which are PTframesfromtheguestOS,theallocatorreturnspages difficult to reconcile with the use of superpages in the fromthereservedsetofpages. PTpageframesfreedby guestOS.Webrieflymentionthemhere. USENIX Association Annual Tech ’06: 2006 USENIX Annual Technical Conference 21 5.4.1 Transparentpagesharing machines with a similar CPU configuration, and hav- ing one Intel Pro-1000 Gigabit NIC per machine. All Transparent page sharing between virtual machines is the NICs have support for TSO, scatter/gather I/O and an effective mechanism to reduce the memory usage checksum offload. The clients and server machines are in the system when there are a large number of VMs connectedoveraGigabitswitch. [16]. Pagesharingusesaddresstranslationfrompseudo- The experiments measure the maximum throughput physical to physical addresses to transparently share achievablewiththebenchmark(eithertransmitterorre- pseudo-physicalpageswhichhavethesamecontent. ceiver) running on the server machine. The server is Pagesharingpotentiallybreaksthecontiguityofphys- connected to each client machine over a different net- icalpageframes. Withtheuseofsuperpages, eitherthe workinterface,andusesoneTCPconnectionperclient. entiresuperpagemustbesharedbetweentheVMs,orno Weuseasmanyclientsasrequiredtosaturatetheserver page within the superpage can be shared. This can sig- CPU, and measure the throughput under different Xen nificantlyreducethescopeformemorysharingbetween and Linux configurations. All the profiling results pre- VMs. sented in this section are obtained using the Xenoprof AlthoughXendoesnotmakeuseofpagesharingcur- systemwideprofilerinXen[10]. rently, this is a potentially important issue for super- pages. 6.2 Overall Results 5.4.2 Ballooningdriver Weevaluatethefollowingconfigurations: ‘Linux’refers The ballooning driver [16] is a mechanism by which to the baseline unmodified Linux version 2.6.11 run- theVMMcanefficientlyvarythe memoryallocationof ningnativemode. ‘Xen-driver’referstotheunoptimized guestOSes. Sincethe use of a ballooning driver poten- XenoLinuxdriverdomain. ‘Xen-driver-opt’referstothe tially breaks the contiguity of physical pages allocated Xen driver domain with our optimizations. ‘Xen-guest’ toaguest,thisinvalidatestheuseofsuperpagesforthat refers to the unoptimized, existing Xen guest domain. addressrange. ‘Xen-guest-opt’ refers to the optimized version of the guestdomain. A possible solution is to force memory alloca- tion/deallocation to be in units of superpage size for Figure 3 compares the transmit throughput achieved coarse grained ballooning operations, and to invalidate undertheabove5configurations. Figure4showsreceive superpagemappingsonlyforfinegrainedballooningop- performanceinthedifferentconfigurations. erations. This functionality is not implemented in the currentprototype. 5000 Transmit Throughput s) 4000 b/ 6 Evaluation M ( 3000 ut p gh 2000 The optimizations described in the previous sections u o have been implemented in Xen version 2.0.6, running Thr 1000 Linuxguestoperatingsystemsversion2.6.11. 0 6.1 ExperimentalSetup Linux Xen-driverXen-driver-Xoepnt-guestXen-guest-opt Weusetwomicro-benchmarks,atransmitandareceive benchmark, to evaluate the networking performance of guest and driver domains. These benchmarks are simi- Figure3:Transmitthroughputindifferentconfigurations lar to the netperf [1] TCP streaming benchmark, which measuresthemaximumTCPstreamingthroughputover For the transmit benchmark, the performance of the asingleTCPconnection. Ourbenchmarkismodifiedto Linux, Xen-driver and Xen-driver-opt configurations is usethezero-copysendfilesystemcallfortransmitoper- limitedbythenetworkinterfacebandwidth,anddoesnot ations. fully saturate the CPU. All three configurations achieve The ‘server’ system for running the benchmark is a an aggregate link speed throughput of 3760 Mb/s. The DellPowerEdge1600SC,2.4GHzIntelXeonmachine. CPUutilization values for saturating the network in the ThismachinehasfourIntelPro-1000GigabitNICs.The three configurations are, respectively, 40%, 46% and ‘clients’forrunninganexperimentconsistofIntelXeon 43%. 22 Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association 3500 4000 3000 Receive Throughput s) 3500 Transmit Throughput Mb/s) 2500 (Mb/ 23500000 hroughput( 112050000000 Throughput 112505000000000 T 500 0 0 Linux Xen-driverXen-driver-Xoepnt-guestXen-guest-opt Guest-higGh-uioecs-t-shpigGh-uioecst-higGhuest-iocGuest-spGuest-none Figure4: Receivethroughputindifferentconfigurations Figure 5: Contribution of individual transmit optimiza- tions The optimized guest domain configuration, Xen- configuration to yield 3230 Mb/s (configuration Guest- guest-opt improves on the performance of the unopti- high-ioc). Thesuperpageoptimizationimprovesthisfur- mized Xen guest by a factor of 4.4, increasing trans- thertoachieve3310Mb/s(configurationGuest-high-ioc- mit throughput from 750 Mb/s to 3310 Mb/s. How- sp). The impact of these two optimizations by them- ever, Xen-guest-opt gets CPU saturated at this through- selvesintheabsenceofthehighlevelinterfaceoptimiza- put,whereastheLinuxconfigurationreachesaCPUuti- tionisinsignificant. lizationofroughly40%tosaturate4NICs. For the receive benchmark, the unoptimized Xen driver domain configuration, Xen-driver, achieves a 6.3.1 HighlevelInterface throughputof1738Mb/s, whichisonly 69% of the na- Weexplaintheimprovedperformanceresultingfromthe tive Linuxthroughput, 2508Mb/s. TheoptimizedXen- useofahighlevelinterfaceinfigure6. Thefigurecom- driver version, Xen-driver-opt, improves upon this per- parestheexecutionoverhead(inmillionsofCPUcycles formance by 35%, and achieves a throughput of 2343 per second) of the guestdomain, driver domain and the Mb/s. The optimized guest configuration Xen-guest- Xen VMM, incurred for running the transmit workload optimprovestheguestdomainreceiveperformanceonly onasingleNIC,comparedbetweenthehigh-levelGuest- slightly,from820Mb/sto970Mb/s. highconfigurationandtheGuest-noneconfiguration. 6.3 Transmit Workload 3500 Xen Wenowexaminethecontributionofindividualoptimiza- ons) 3000 DGruiveesrt Ddoommaaiinn tionsforthetransmitworkload.Figure5showsthetrans- milli 2500 ( n mit performance under different combinations of opti- o uti 2000 mizations. ‘Guest-none’ is the guest domain configura- ec x tion with no optimizations. ‘Guest-sp’ is the guest do- e 1500 of main configuration using only the superpage optimiza- cles 1000 tion. ‘Guest-ioc’usesonlytheI/Ochanneloptimization. cy U 500 ‘Guest-high’usesonlythehighlevelvirtualinterfaceop- P C timization. ‘Guest-high-ioc’ uses both high level inter- 0 Guest-high Guest-none faces and the optimized I/O channel. ‘Guest-high-ioc- Transmit Optimization sp’ uses all the optimizations: high level interface, I/O channeloptimizationsandsuperpages. Figure6: Breakdownofexecutioncostinhigh-leveland The single biggestcontribution to guest transmit per- unoptimizedvirtualinterface formance comes from the use of a high-level virtual in- terface (Guest-high configuration). This optimization Theuseofthehighlevelvirtualinterfacereducesthe improves guest performance by 272%, from 750 Mb/s executioncostoftheguestdomainbyalmostafactorof to2794Mb/s. 4comparedtotheexecutioncostwithalow-levelinter- The I/O channel optimization yields an incremental face. Further, the high-level interface also reduces the improvement of 439 Mb/s (15.7%) over the Guest-high Xen VMM execution overhead by a factor of 1.9, and USENIX Association Annual Tech ’06: 2006 USENIX Annual Technical Conference 23 the driver domain overhead by a factor of 2.1. These 6.3.2 I/OChannelOptimizations reductions can be explained by the reasoning given in Figure 5 shows that using the I/O channel optimiza- section 3, namely the absence of data copy because of tions in conjunction with the high level interface (con- the support for scatter/gather, and the reduced per-byte figurationGuest-high-ioc)improvesthetransmitperfor- processingoverheadbecauseoftheuseoflargerpackets mance from 2794 Mb/s to 3239 Mb/s, an improvement (TSO). of15.7%. The use of a high level virtual interface gives perfor- Thiscanbeexplainedbycomparingtheexecutionpro- manceimprovementsfortheguestdomainevenwhenthe fileofthetwoguestconfigurations,asshowninfigure8. offload features are not supported in the physical NIC. TheI/Ochanneloptimizationreducestheexecutionover- Figure7showstheperformanceoftheguestdomainus- headincurredintheXenVMMby38%(Guest-high-ioc ing a high level interface with the physical network in- configuration), and this accounts for the corresponding terface supporting varyingcapabilities. The capabilities improvementintransmitthroughput. ofthephysicalNICformaspectrum,atoneendtheNIC supportsTSO,SGI/Oandchecksumoffload,attheother enditsupportsnooffloadfeature,withintermediatecon- s) 1200 Driver domXaeinn figurationssupportingpartialoffloadcapabilities. on Guest domain milli 1000 ( n o 800 uti c e 4000 ex 600 3500 Transmit throughput of es 400 b/s) 3000 cycl M 2500 U 200 ( P ut 2000 C p h 0 ug 1500 Guest-high Guest-high-ioc o hr 1000 Guest Configuration T 500 0 Figure8: I/Ochanneloptimizationbenefit TSO+SG+CSsGu+CsumCsum None Guest-none 6.3.3 Superpagemappings m Figure 5 shows that superpage mappings improve guest Figure7: Advantageofhigher-levelinterface domain performance slightly, by 2.5%. Figure 9 shows dataandinstructionTLBmissesincurredforthetransmit benchmark for three sets of virtual memory optimiza- Even in the case when the physical NIC supports tions. ‘Base’ refers to a configuration which uses the no offload feature (bar labeled ‘None’), the guest do- high-level interface and I/O channel optimizations, but mainwithahighlevelinterfaceperformsnearlytwiceas withregular4Ksizedpages.‘SP-GP’istheBaseconfig- wellastheguestusingthedefaultinterface(barlabeled urationwiththesuperpageandglobalpageoptimizations Guest-none),viz. 1461Mb/svs. 750Mb/s. in addition. ‘SP’ is the Base configuration with super- pagemappings. Thus, even in the absence of offload features in the The TLB misses are grouped into three categories: NIC,byperformingthe offloadcomputationjustbefore data TLB misses (D-TLB), instruction TLB misses in- transmittingitovertheNIC(intheoffloaddriver)instead curred in the guest and driver domains (Guest OS I- of performing it in the guest domain, we significantly TLB), and instruction TLB misses incurred in the Xen reducetheoverheadsincurredintheI/Ochannelandthe VMM(XenI-TLB). networkbridgeonthepacket’stransmitpath. Theuseofsuperpagesaloneissufficienttobringdown Theperformanceoftheguestdomainwithjustcheck- thedataTLBmissesbyafactorof3.8. Theuseofglobal sum offload support in the NIC is comparable to the mappingsdoesnothaveasignificantimpactondataTLB performance without any offload support. This is be- misses (configurations SP-GP and SP), since frequent cause, in the absence of support for scatter/gather I/O, switchesbetweentheguestanddriverdomaincancelout data copying both with or without checksum computa- anybenefitsofusingglobalpages. tionincur effectively the same cost. Withscatter/gather The use of global mappings, however, does have a I/Osupport,transmitthroughputincreasesto2007Mb/s. negative impact on the instruction TLB misses in the 24 Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association
Description: