ebook img

Diagnosing Performance Overheads in the Xen Virtual - HP Labs PDF

12 Pages·2005·0.24 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Diagnosing Performance Overheads in the Xen Virtual - HP Labs

Diagnosing Performance Overheads in the Xen Virtual Machine Environment Aravind Menon1, Jose Renato Santos, Yoshio Turner, G. (John) Janakiraman, Willy Zwaenepoel1 Internet Systems and Storage Laboratory HP Laboratories Palo Alto HPL-2005-80 May 3, 2005* E-mail: [email protected], [email protected], [email protected], [email protected], [email protected] Virtual Machine, Virtual Machine (VM) environments (e.g., VMware and Xen) are experiencing a performance resurgence of interest for diverse uses including server consolidation and shared hosting. An application's performance in a virtual machine environment can differ markedly analysis, statistical from its performance in a nonvirtualized environment because of interactions with the profiling, Xen, underlying virtual machine monitor and other virtual machines. However, few tools are OProfile currently available to help debug performance problems in virtual machine environments. In this paper, we present Xenoprof, a system-wide statistical profiling toolkit implemented for the Xen virtual machine environment. The toolkit enables coordinated profiling of multiple VMs in a system to obtain the distribution of hardware events such as clock cycles and cache and TLB misses. We use our toolkit to analyze performance overheads incurred by networking applications running in Xen VMs. We focus on networking applications since virtualizing network I/O devices is relatively expensive. Our experimental results quantify Xen's performance overheads for network I/O device virtualization in uni- and multi-processor systems. Our results identify the main sources of this overhead which should be the focus of Xen optimization efforts. We also show how our profiling toolkit was used to uncover and resolve performance bugs that we encountered in our experiments which caused unexpected application behavior . * Internal Accession Date Only 1EPFL, Lausanne, Switzerland This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The original paper published in and presented at the First ACM/USENIX Conference on Virtual Execution Environments (VEE'05), 11-12 June 2005, Chicago, Illinois, USA Approved for External Publication © Copyright 2005 ACM Diagnosing Performance Overheads in the Xen Virtual Machine Environment Aravind Menon(cid:3) Jose Renato Santos Yoshio Turner EPFL,Lausanne HPLabs,PaloAlto HPLabs,PaloAlto aravind.menon@ep(cid:3).ch [email protected] [email protected] G. (John) Janakiraman Willy Zwaenepoel HPLabs,PaloAlto EPFL,Lausanne [email protected] willy.zwaenepoel@ep(cid:3).ch ABSTRACT GeneralTerms Virtual Machine (VM) environments (e.g., VMware and Measurement, Performance Xen) are experiencing a resurgence of interest for diverse uses including server consolidation and shared hosting. An Keywords application’s performance in a virtual machine environ- Virtual machine monitors, performance analysis, statistical ment can di(cid:11)er markedly from its performance in a non- pro(cid:12)ling virtualized environment because of interactions with the underlying virtual machine monitor and other virtual ma- chines. However,fewtoolsarecurrentlyavailabletohelpde- 1. INTRODUCTION bugperformanceproblemsinvirtualmachineenvironments. Virtual Machine (VM) environments are experiencing a In this paper, we present Xenoprof, a system-wide statis- resurgence of interest for diverse uses including server con- tical pro(cid:12)ling toolkit implemented for the Xen virtual ma- solidationandsharedhosting. Theemergenceofcommercial chine environment. The toolkit enables coordinated pro(cid:12)l- virtual machines for commodity platforms [7], paravirtual- ing of multiple VMs in a system to obtain the distribution izing open-source virtual machine monitors (VMMs) such ofhardwareeventssuchasclockcyclesandcacheandTLB asXen[4],andwidespreadsupportforvirtualizationinmi- misses. croprocessors[13]willboostthistrendtowardusingvirtual We use our toolkit to analyze performance overheads in- machines in production systems for server applications. curredbynetworkingapplicationsrunninginXenVMs. We An application’s performance in a virtual machine en- focus on networking applications since virtualizing network vironment can di(cid:11)er markedly from its performance in a I/O devices is relatively expensive. Our experimental re- non-virtualizedenvironmentbecauseofinteractionswiththe sultsquantifyXen’sperformanceoverheadsfornetworkI/O underlying VMM and other virtual machines. The grow- device virtualization in uni- and multi-processor systems. ing popularity of virtual machine environments motivates Ourresultsidentifythemainsourcesofthisoverheadwhich deeper investigation of the performance implications of vir- should be the focus of Xen optimization e(cid:11)orts. We also tualization. However, few tools are currently available to show how our pro(cid:12)ling toolkit was used to uncover and re- analyze performance problems in these environments. solve performance bugs that we encountered in our experi- In this paper, we present Xenoprof, a system-wide statis- ments which caused unexpected application behavior. tical pro(cid:12)ling toolkit implemented for the Xen virtual ma- chine environment. The Xenoprof toolkit supports system- CategoriesandSubjectDescriptors wide coordinated pro(cid:12)ling in a Xen environment to obtain thedistributionofhardwareeventssuchasclockcycles, in- D.4.8[OperatingSystems]: Performance|Measurements; struction execution, TLB and cache misses, etc. Xenoprof D.2.8[SoftwareEngineering]: Metrics|Performancemea- allowspro(cid:12)lingofconcurrentlyexecutingdomains(Xenuses sures the term \domain" to refer to a virtual machine) and the (cid:3) Work done during internship at HP Labs and then at Xen VMM itself at the (cid:12)ne granularity of individual pro- EPFL. cessesandroutinesexecutingineitherthedomainorinthe Xen VMM. Xenoprof will facilitate a better understanding oftheperformancecharacteristicsofXen’smechanismsand therebyhelptoadvancee(cid:11)ortstooptimizetheimplementa- tionofXen. XenoprofismodeledontheOPro(cid:12)le[1]pro(cid:12)l- Copyright2005ACM. ing tool available on Linux systems. Thisistheauthor’sversionofthework. Itispostedherebypermissionof WereportontheuseofXenoproftoanalyzeperformance ACMforyourpersonaluse.Notforredistribution. overheads incurred by networking applications running in Originalpaperpublishedinandpresentedat: Xendomains. Wefocusonnetworkingapplicationsbecause First ACM/USENIX Conference on Virtual Execution Environments of the prevalence of important network intensive applica- (VEE’05),11›12June2005,Chicago,Illinois,USA tions, such as web servers and because virtualizing network I/O devices is relatively expensive. Previously published Linux Receive Throughput 1000 XenoLinux Receive Throughput papers evaluating Xen’s performance have used network- ingbenchmarksinsystemswithlimitednetworkbandwidth and high CPU capacity [4, 10]. Since this results in the 800 network, rather than the CPU, becoming the bottleneck s) b/ resource, in these studies the overhead of network device M virtualization does not manifest in reduced throughput. In ut ( 600 p contrast,ouranalysisexaminescaseswherethroughputde- h g u grades because CPU processing is the bottleneck instead of o hr 400 the network. Our experimental results quantify Xen’s per- T formanceoverheadsfornetworkI/Odevicevirtualizationin uni- and multi-processor systems. Our results identify the 200 main sources of this overhead which should be the focus of Xen optimization e(cid:11)orts. For example, Xen’s current I/O 0 virtualization implementation can degrade performance by 0 20 40 60 80 100 120 140 preventing the use of network hardware o(cid:15)oad capabilities App. buffer size(KB) such as TCP segmentation and checksum. We additionally show a small case study that demon- Figure 1: XenoLinux performance anomaly: Net- strates the utility of our pro(cid:12)ling toolkit for debugging real workreceivethroughputinXenoLinuxshowserratic performancebugsweencounteredduringsomeofourexper- behavior with varying application bu(cid:11)er size iments. Theexamplesillustratehowunforeseeninteractions between an application and the VMM can lead to strange performance anomalies. the application receive bu(cid:11)er size. In contrast, throughput Therestofthepaperisorganizedasfollows. Webeginin is highly sensitive to the application bu(cid:11)er size when the Section 2 with our example case study motivating the need receiver runs in the Xen environment. for performance analysis tools in virtual machine environ- This is one prime example of a scenario where we would ments. Following this, in Section 3 we describe aspects of liketohaveaperformanceanalysistoolwhichcanpinpoint Xen and OPro(cid:12)le as background for our work. Section 4 thesourceofunexpectedbehaviorinXen. Althoughsystem- describes the design of Xenoprof. Section 5 shows how the wide pro(cid:12)ling tools such as OPro(cid:12)le exist for single OSes toolkit can be used for performance debugging of the moti- like Linux, no such tools are available for a virtual machine vating example of Section 2. In Section 6, we describe our environment. Xenoprof is intended to (cid:12)ll this gap. analysis of the performance overheads for network I/O de- vicevirtualization. Section7describesrelatedwork,andwe 3. BACKGROUND conclude with discussion in Section 8. In this section, we brie(cid:13)y describe the Xen virtual ma- chine monitor and the OPro(cid:12)le tool for statistical pro(cid:12)ling 2. MOTIVATINGEXAMPLE onLinux. Ourdescriptionoftheirarchitectureislimitedto EarlyoninourexperimentationwithXen,weencountered theaspectsthatarerelevanttothediscussionofourstatis- a situation in which a simple networking application which tical pro(cid:12)ling tool for Xen environments and the discussion wewroteexhibitedpuzzlingbehaviorwhenrunninginaXen ofXen’sperformancecharacteristics. Wereferthereaderto domain. We present this example in part to illustrate how [4] and [1] for more comprehensive descriptions of Xen and thebehaviorofanapplicationrunninginavirtualmachine OPro(cid:12)le, respectively. environmentcandi(cid:11)ermarkedlyandinsurprisingwaysfrom 3.1 Xen its behavior in a non-virtualized system. We also show the example to demonstrate the utility of Xenoprof, which as Xen is an open-source x86 virtual machine monitor we describe later in Section 5 we used to help us diagnose (VMM) that allows multiple operating system (OS) in- and resolve this problem. stances to run concurrently on a single physical machine. The example application consists of two processes, a Xen uses paravirtualization [25], where the VMM exposes sender and a receiver, running on two Pentium III ma- a virtual machine abstraction that is slightly di(cid:11)erent from chines connected through a gigabit network switch. The the underlying hardware. In particular, Xen exposes a hy- two processes are connected by a single TCP connection, percall mechanism that virtual machines must use to per- with the sender continuously sending a 64 KB (cid:12)le (cached formprivilegedoperations(e.g.,installingapagetable),an in the sender’s memory) to the receiver using the zero-copy event noti(cid:12)cation mechanism to deliver virtual interrupts sendfile()systemcall. Thereceiversimplyissuesreceives and other noti(cid:12)cations to virtual machines, and a shared- for the data as fast as possible. memorybaseddevicechannelfortransferringI/Omessages In our experiment, we run the sender on Linux (non- among virtual machines. While the OS must be ported to virtualized), and we run the receiver either on Linux or in thisvirtualmachineinterface,thisapproachleadstobetter a Xen domain. We measure the receiver throughput as we performancethananapproachbasedonpurevirtualization. vary the size of the user-level bu(cid:11)er which the application ThelatestversionoftheXenarchitecture[10]introduces passes to the socket receive system call. Note that we are anewI/Omodel,wherespecialprivilegedvirtualmachines not varying the receiver socket bu(cid:11)er size which limits ker- calleddriverdomains ownspeci(cid:12)chardwaredevicesandrun nelmemoryusage. Figure1showsthatwhenthereceiveris their I/O device drivers. All other domains (guest domains running on Linux, its throughput is nearly independent of in Xen terminology) run a simple device driver that com- municatesviathedevicechannelmechanismwiththedriver Pro(cid:12)ling with OPro(cid:12)le operates as follows (simpli(cid:12)ed): domaintoaccesstherealhardwaredevices. Driverdomains 1.UserprovidesinputtoOPro(cid:12)leabouttheperformance directlyaccesshardwaredevicesthattheyown;however,in- events to be monitored and the periodic count. terrupts from these devices are (cid:12)rst handled by the VMM 2.OPro(cid:12)leprogramshardwarecounterstocounttheuser- whichthennoti(cid:12)esthecorrespondingdriverdomainthrough speci(cid:12)ed performance event and to generate an NMI when virtual interrupts delivered over the event mechanism. The the counter has counted to the user-speci(cid:12)ed count. guestdomainexchangesservicerequestsandresponseswith 3. The performance monitoring hardware generates an the driver domain over an I/O descriptor ring in the device NMI upon counter over(cid:13)ow. channel. Anasynchronousinter-domaineventmechanismis 4. OPro(cid:12)le’s NMI handler catches the NMI and records used to send noti(cid:12)cation of queued messages. To support the program counter value in a kernel bu(cid:11)er. high-performance devices, references to page-sized bu(cid:11)ers 5. OPro(cid:12)le processes the bu(cid:11)er periodically to determine are transferred over the I/O descriptor ring rather than ac- the routine and executable corresponding to the program tualI/Odata(thelatterwouldrequirecopying). Whendata counteroneachsampleinthebu(cid:11)er. Thisisdeterminedby is sent by the guest domain, Xen usesa sharing mechanism consultingthevirtualmemorylayoutoftheprocessandthe where the guest domain permits the driver domain to map kernel. the page with the data and pin it for DMA by the device. When data is sent by the driver domain to the guest do- 4. STATISTICALPROFILINGINXEN main, Xen uses a page-remapping mechanism which maps In this section, we describe Xenoprof, a system-wide sta- the page with the data into the guest domain in exchange tistical pro(cid:12)ling toolkit we have developed for Xen virtual for an unused page provided by the guest domain. machine environments. The Xenoprof toolkit provides ca- pabilities similar to OPro(cid:12)le for the Xen environment (i.e., using performance monitoring hardware to collect periodic 3.2 OPro(cid:2)le samples of performance data). The performance of appli- OPro(cid:12)le [1] is a system-wide statistical pro(cid:12)ling tool for cations running on Xen depend on interactions among the Linux systems. OPro(cid:12)le can be used to pro(cid:12)le code exe- application’s processes, the operating system it is running cuting at any privilege level, including kernel code, kernel on,theXenVMM,andpotentiallyothervirtualvirtualma- modules, user level applications and user level libraries. chines (e.g., driver domain) running on the same system. OPro(cid:12)le uses performance monitoring hardware on the In order to study the costs of virtualization and the inter- processortocollectperiodicsamplesofvariousperformance actions among multiple domains, the performance pro(cid:12)ling data. Performance monitoring hardware on modern pro- tool must be able to determine the distribution of perfor- cessorarchitecturesincludecountersthattrackvariouspro- mance events across routines in the Xen VMM and all the cessoreventsincludingclockcycles,instructionretirements, domains running on it. TLB misses, cache misses, branch mispredictions, etc. The The Xenoprof toolkit consists of a VMM-level layer (we performance monitoring hardware can be con(cid:12)gured to no- hereon refer to this layer as Xenoprof) responsible for ser- tify the operating system when these counters reach spec- vicing counter over(cid:13)ow interrupts from the performance i(cid:12)ed values. OPro(cid:12)le collects the program counter (PC) monitoringhardwareandadomain-levellayerderivedfrom value at each noti(cid:12)cation to obtain a statistical sampling. OPro(cid:12)le responsible for attributing samples to speci(cid:12)c rou- For example, OPro(cid:12)le can collect the PC value whenever tines within the domain. The OPro(cid:12)le layer drives the per- the clock cycle counter reaches a speci(cid:12)c count to generate formance pro(cid:12)ling through hypercalls supported by Xeno- thedistributionoftimespentinvariousroutinesinthesys- profandXenoprofdeliverssamplestotheOPro(cid:12)lelayerus- tem. Likewise, OPro(cid:12)le can collect the PC value whenever ing Xen’s virtual interrupt mechanism. System-wide pro(cid:12)l- the TLB miss counter reaches a speci(cid:12)c count to generate ingisrealizedthroughthecoordinationofmultipledomain- thedistributionofTLBmissesacrossvariousroutinesinthe level pro(cid:12)lers. While our current implementation is based system. The accuracy and overhead of the pro(cid:12)ling can be on OPro(cid:12)le, other statistical pro(cid:12)lers (e.g., VTune [3]) can controlledbyadjustingthesamplinginterval(i.e.,thecount be ported to use the Xenoprof interface. when noti(cid:12)cations are generated). We(cid:12)rstdescribetheissuesthatguidedourdesignchoices On x86 systems, OPro(cid:12)le con(cid:12)gures the performance for Xenoprof, followed by a description of the Xenoprof monitoring hardware to raise a Non-Maskable Interrupt framework itself. We then describe the modi(cid:12)cations to (NMI) whenever a hardware performance counter reaches standard domain level pro(cid:12)lers such as OPro(cid:12)le that are its threshold. The use of the NMI mechanism for noti(cid:12)- necessarytoenablethemtobeusedwithXenoprofforpro- cation is key to accurate performance pro(cid:12)ling. Since the viding system-wide pro(cid:12)ling in Xen. NMI cannot be masked, the counter over(cid:13)ow will be ser- 4.1 Designchoiceforpro(cid:2)linginXen viced by the NMI handler without any delay allowing the programcountertobereadimmediatelyandassociatedwith The main issue with pro(cid:12)ling virtual machine environ- thesample. Ifamaskableinterruptisusedtonotifycounter ments is that pro(cid:12)ling cannot be centralized. System-wide over(cid:13)ow, the interrupt can be masked which will delay the pro(cid:12)ling in any environment requires the knowledge of de- recording of the program counter causing the sample to be tailed system level information. As an example, we saw in associated with a wrong process or routine (e.g., if a con- Section 3.2 that in order to account a PC sample to the textswitchoccursbetweentheover(cid:13)owofthecounter,and correct routine and process, OPro(cid:12)le needs to consult the theactualcollectionofthesample). Furthermore,sincethe virtualmemorylayoutofthecurrentprocess,orthekernel. NMI will be serviced even in the middle of other interrupt Inasingle-OScaselikeOPro(cid:12)le,alltherequiredsystemlevel handlers, use of the NMI allows the pro(cid:12)ling of interrupt informationisavailableinonecentralizedplace, thekernel. handlers. Unfortunately, in a virtual machine environment like Xen, theinformationrequiredforsystem-widepro(cid:12)lingisspread de(cid:12)nesasetoffunctionsfordomainlevelpro(cid:12)lerstospecify across multiple domains, and this domain-level information pro(cid:12)ling parameters, to start and to stop pro(cid:12)ling. is not usually accessible from the hypervisor. For instance, 2. It allows domain-level pro(cid:12)lers to register virtual in- Xen cannot determine the current process running in a do- terrupts that will be triggered whenever a new PC sample main, or determine its memory layout in order to (cid:12)nd the is available. Unlike sampling in OPro(cid:12)le, the domain level routinecorrespondingtoaPCsamplecollectedforpro(cid:12)ling. pro(cid:12)lersinXenoprofdonotcollectPCsamplesthemselves. Thus, system-wide pro(cid:12)ling in Xen is not possible with the Instead,Xenoprofcollectsitontheirbehalf. Thisisbecause pro(cid:12)ler executing solely inside Xen. itmaynotbepossibletocallthedomainlevelvirtualinter- Even if the domain’s kernel symbol map is available to rupthandlersynchronouslyfromanycontextinXen,when- Xen,itwouldstillbedi(cid:14)cultforXentodeterminethesym- ever a hardware counter over(cid:13)ows. Since PC samples must bol map for processes running in the domain. be collected synchronously to be correct, Xenoprof collects Thus,system-widepro(cid:12)linginXenmustbe\distributed" thesamplesonbehalfofthedomainandlogsthemintothe in nature, with help required from domain speci(cid:12)c pro(cid:12)lers per-domain sample bu(cid:11)er of the appropriate domain. The for handling the pro(cid:12)ling of the individual domains. Most domain pro(cid:12)ler is noti(cid:12)ed of additional samples through a ofthedomainspeci(cid:12)cpro(cid:12)lingresponsibilitiesaredelegated virtual interrupt. todomainspeci(cid:12)cpro(cid:12)lers,andathinparavirtualizedinter- 3. It coordinates the pro(cid:12)ling of multiple domains. In face,Xenoprof,isprovidedforcoordinatingtheirexecution, every pro(cid:12)ling session there is a distinguished domain, the and providing them access to the hardware counters. initiator, which is responsible for con(cid:12)guring and starting a pro(cid:12)ling session. The initiator domain has the privileges 4.2 XenoprofFramework to decide the pro(cid:12)ling setup, including hardware events to pro(cid:12)le,domainstopro(cid:12)le,etc. Aftertheinitiatorselectsthe Xenoprofhelpsrealizesystem-widepro(cid:12)linginXenbythe setofdomainstobepro(cid:12)led,Xenoprofwaitsuntilallpartic- coordination of domain speci(cid:12)c pro(cid:12)lers executing in each ipant domains perform a hypercall indicating that they are domain. Xenoprof uses the same hardware counter based ready to process samples. The pro(cid:12)ling session starts only sampling mechanism as used in OPro(cid:12)le. It programs the after this condition is satis(cid:12)ed and after a start command hardware performance counters to generate sampling inter- isissuedbytheinitiator. Pro(cid:12)lingstopswhenrequestedby ruptsatregulareventcountintervals. Xenoprofhandsover the initiator. The sequencing of pro(cid:12)ling operations across thePCsamplescollectedoncounterover(cid:13)owtothepro(cid:12)ler the initiator and other domains is orchestrated by the user running in the current domain for further processing. For (e.g., using a user-level script). system-widepro(cid:12)ling,alldomainsrunaXenoprofcompliant In the current architecture, only the privileged domain, pro(cid:12)ler, thus no PC samples are lost. Xenoprof also allows domain0, can act as the initiator for a pro(cid:12)ling session in- the selective pro(cid:12)ling of a subset of domains. volving more than one domain. Thus, only the privileged Xenoprofallowsdomainlevelpro(cid:12)lerstocollectPCsam- domain0cancontrolsystem-widepro(cid:12)linginvolvingmulti- ples for performance counter over(cid:13)ows which occur in the ple domains. This restriction is necessary because allowing contextofthatdomain. SincethesePCsamplesmayalsoin- unprivileged domains to do system-wide pro(cid:12)ling would vi- cludesamplesfromtheXenaddressspace,thedomainpro- olatethesecurityandisolationguaranteesprovidedbyXen. (cid:12)lersmustbeextendedtorecognizeandcorrectlyattribute Xenoprofallowsunprivilegeddomainstodosingle-domain Xen’s PC samples. Xenoprof also allows domain level pro- pro(cid:12)ling of the domain itself. In this scenario, the domain (cid:12)lers to optionally collect PC samples for other \passive" can start pro(cid:12)ling immediately, and no coordination be- domains. A possible use of this feature is when the passive tween domains is involved. domain does not have a Xenoprof-compliant pro(cid:12)ler, but One restriction in the current architecture of Xenoprof is wearestillinterestedinestimatingtheaggregateexecution that it does not allow concurrent independent pro(cid:12)ling of cost accruing to the domain. multiple domains. Concurrent independent pro(cid:12)ling of dif- Domain level pro(cid:12)lers in Xenoprof operate in a manner ferentdomainswouldrequiresupportforper-domainvirtual mostlysimilartotheiroperationinanon-virtualizedsetup. counters in Xenoprof. This is currently not supported. For low level operations, such as accessing and program- mingperformancecounters, andcollectingPCsamples,the 4.3 PortingOPro(cid:2)letoXenoprof domainpro(cid:12)lersaremodi(cid:12)edtointerfacewiththeXenoprof framework. Thehighleveloperationsofthepro(cid:12)lerremain Domain level pro(cid:12)lers must be ported to the Xenoprof mostlyunchanged. AftercollectingPCsamplesfromXeno- interface so that they can be used in the Xen environment. prof, the pro(cid:12)ler accounts the sample to the appropriate Porting a pro(cid:12)ler to Xenoprof is fairly straightforward, and user or kernel routine by consulting the required memory entails the following steps: layout maps. At the end of pro(cid:12)ling, each domain pro(cid:12)ler 1. The pro(cid:12)ler code for accessing and programming the has a distribution of hardware event counts across routines hardware counters must be modi(cid:12)ed to use the Xenoprof in its applications, the kernel or Xen for its domain. The virtual event interface. globalsystempro(cid:12)lecanbeobtainedbysimplymergingthe 2. Before starting pro(cid:12)ling, the pro(cid:12)ler queries Xenoprof individual pro(cid:12)les. The detailed pro(cid:12)le for Xen can be ob- to determine whether it is to take on the role of the initia- tained by merging the Xen components of the pro(cid:12)les from tor. Onlytheinitiatordomainperformstheglobalpro(cid:12)ling the individual domains. setup, such as deciding the events and pro(cid:12)ling duration. The Xenoprof framework de(cid:12)nes a paravirtualized inter- Boththe initiatorandthe otherpro(cid:12)lers registertheircall- face to support domain level pro(cid:12)lers in the following re- back functions for collecting PC samples. spects: 3. Thepro(cid:12)lerisextendedwithaccesstotheXenvirtual 1. It de(cid:12)nes a performance event interface which maps memory layout, so that it can identify Xen routines corre- performance events to the physical hardware counters, and sponding to PC samples in Xen’s virtual address range. Point A Point B 1200 Linux Receive Throughput skb copy bits 0.15 28 XenoLinux Receive Throughput skbu(cid:11) ctor absent 9 XenoLinux Receive Throughput (fixed) 1000 tcp collapse 0.05 3 other routines 99.8 60 s) 800 Mb/ Table 2: XenoLinux Pro(cid:12)le put ( 600 A h ug Thesecondfunction,skbuff ctor,isaXenoLinuxspeci(cid:12)c o Thr 400 B routinewhichiscalledwhenevernewpagesareaddedtothe socketbu(cid:11)erslabcache. Thisfunctionzeroesoutnewpages before adding them to the slab cache, which can be an ex- 200 pensiveoperation. XenoLinuxclearsoutpagesaddedtoits slab cache to make sure that there are no security breaches 0 when these pages are exchanged with other domains. 0 20 40 60 80 100 120 140 tcp collapse is a routine in Linux which is called for App. buffer size reducing the per-socket TCP memory pressure in the ker- nel. Whenever a socket’s receive queue memory usage ex- Figure 2: Xenoprof network performance anomaly ceeds the per-socket memory limits, tcp collapse is in- voked to free up fragmented memory in the socket’s re- ceive queue. (Note that this deals with per-socket memory We ported OPro(cid:12)le to use the Xenoprof interface in the limits, not the entire kernel’s TCP memory usage). The above respects. tcp collapse function tries to reduce internal fragmenta- tion in the socket’s receive queue by copying all the socket 5. PERFORMANCE DEBUGGING USING bu(cid:11)ers in the receive queue into new compact bu(cid:11)ers. It XENOPROF makes use of the skb copy bits function for compacting fragmented socket bu(cid:11)ers. Frequent compaction of socket In this section, we analyze the performance anomaly de- bu(cid:11)ers can be an expensive operation. scribedinSection2usingourXenoprofbasedpro(cid:12)lingtool. The high execution cost of tcp collapse and Torecallbrie(cid:13)y,theperformancebugconsistsoftheerratic skb copy bits at point B indicates that the kernel variation in TCP receive throughput as a function of the spends a signi(cid:12)cant fraction of the time compacting frag- application level bu(cid:11)er size, when the application runs in mented socket bu(cid:11)ers for reducing TCP memory pressure. a Xen driver domain on a Pentium III machine. We repro- This is suspicious, because in the normal scenario, most of ducethegraphofthereceiverthroughputversusapplication the socket bu(cid:11)ers in a socket’s receive queue have a size bu(cid:11)er size in Figure 2. equal to the Maximum Transfer Unit (MTU) size and do To understand the variation in receiver throughput run- not su(cid:11)er much internal fragmentation. ning in Xen, we pro(cid:12)le the receiver at two points that have Tocheckifsocketbu(cid:11)erfragmentationisindeedtheissue widely di(cid:11)erent performance (points A and B in Figure 2 here,weinstrumenttheXenoLinuxkerneltodeterminethe withhighandlowthroughput,respectively). Table1shows average internal fragmentation in the socket bu(cid:11)ers. Ker- thedistributionofexecutiontimeforpointsAandBacross nel instrumentation reveals that on an average, each socket the XenoLinux kernel (Linux ported to the Xen environ- bu(cid:11)er is allocated 4 KB (i.e. one page size), out of which ment), the network driver and Xen. The table shows that it uses only 1500 bytes (MTU) for receiving data. This ob- whenthethroughputislowatpointB,asigni(cid:12)cantlygreater viouslyleadstointernalfragmentation,andcausestheper- fractionoftheexecutiontimeisspentintheXenoLinuxker- socket memory limit to be frequently exceeded, leading to nel, compared to point A, which has higher throughput. signi(cid:12)cant time being wasted in defragmenting the bu(cid:11)ers. Point A Point B It turns out that for every 1500 byte socket bu(cid:11)er re- XenoLinux kernel 60 84 questedbythenetworkdriver,XenoLinuxactuallyallocates network driver 10 5 anentirepage,i.e. 4KB.Thisisdonetofacilitatethetrans- xen 30 11 ferofanetworkpacketbetweenanI/Odomainandaguest domain. A page sized socket bu(cid:11)er facilitates easy transfer Table 1: Xen domain0 Pro(cid:12)le of ownership of the page between the domains. Unfortu- nately, this also leads to signi(cid:12)cant wastage of per-socket The additional kernel execution overhead at B has a memory,andasweseeintheanalysisabove,thiscanresult function-wisebreakdownasshownintable2. Table2shows in unexpected performance problems. that point B su(cid:11)ers signi(cid:12)cantly lower throughput com- The performance problem resulting from socket memory pared to A, because of the additional time (roughly 40%) exhaustion can be arti(cid:12)cially (cid:12)xed by using the kernel pa- spent in the kernel routines skb copy bits, skbuff ctor rameter tcp adv window scale. This kernel parameter de- and tcp collapse. The corresponding overheads for these terminesthefractionoftheper-socketreceivememorywhich functions at point A are nearly zero. is set aside for user-space bu(cid:11)ering. Thus, by reserving a The biggest overhead at point B comes from the func- largefractionofthesocket’smemoryforuserbu(cid:11)ering,the tionskb copy bits. skb copy bitsisasimpledatacopying e(cid:11)ective TCP window size of the socket can be reduced, functionintheLinuxkernelusedtocopydatafromasocket and thus the kernel starts dropping packets before it leads bu(cid:11)er to another region in kernel memory. to socket memory exhaustion. With this con(cid:12)guration in place, the performance prob- TCP stack for large data transfers over a small number of lem arising due to socket bu(cid:11)er compaction is eliminated connections. and we get more consistent receiver performance. This im- Thebenchmarkconsistsofanumberofreceiver processes, provedscenarioisshowninFigure2asthelinecorrespond- oneperNIC(networkinterfacecard),runningonthetarget ing to the \(cid:12)xed" Xenoprof con(cid:12)guration. The overhead of machine. Each receiver is connected to a sender process virtualizationisapproximatelyconstantacrosstherangeof runningonadi(cid:11)erentmachineoveraseparateNIC.Sender bu(cid:11)er sizes, leading to similar throughput behavior for this processesusesendfile()tosendacached64KB(cid:12)letothe con(cid:12)guration and the base Linux con(cid:12)guration. corresponding receiver process continuously in a loop. We measure the maximum aggregate throughput achievable by the receivers. The sender processes each run on a di(cid:11)erent 6. PERFORMANCEOVERHEADSINXEN machinetomakesurethattheydonotbecomeabottleneck. Inthissection,wecomparetheperformanceofthreenet- 6.1.1 Xen›domain0con(cid:2)guration workingapplicationsrunningintheXenvirtualenvironment with their performance in a non-virtualized Linux environ- We (cid:12)rst compare the performance of the receiver in the ment. Thethreeapplicationsprovideathoroughcoverageof Xen-domain0 con(cid:12)guration with its performance in a base- various performance overheads arising in a virtualized net- line Linux system. In the next subsection, we will evaluate work environment. its performance in Xen guest domain con(cid:12)gurations. Fig- The three benchmarks we use stress di(cid:11)erent aspects of ure 3 compares the Xen-domain0 and Linux performance the network stack. The (cid:12)rst two are simple microbench- for di(cid:11)erent number of NICs. marks, similar to the TTCP [2] benchmark, and measure the throughput of TCP send and receive over a small num- 3000 ber of connections (one per network interface card). The Linux third benchmark consists of a full (cid:13)edged web server run- Xen-domain0 ning a simple synthetic workload. These benchmarks stress s) 2500 b/ the TCP send half, TCP receive half, and concurrent TCP M connections respectively. ut ( 2000 WeuseXenoproftoanalyzeandevaluatetheperformance p h of the three benchmarks in non-virtualized Linux (single- ug 1500 o CPU) and also in three di(cid:11)erent Xen con(cid:12)gurations: hr 1)Xen-domain0: Applicationrunsinasingleprivileged e T 1000 driver domain. v ei 2) Xen-guest0: A guest domain con(cid:12)guration in which c e an unprivileged guest domain and a privileged driver do- R 500 main run on the same CPU. The application and a virtual device driver run in the guest domain, and the physical de- 0 vicedriver(servingasthebackendfortheguest)runsinthe 1 2 3 4 driver domain and interacts directly with the I/O device. Number of NICs 3) Xen-guest1: SimilartoXen-guest0con(cid:12)gurationex- cept the guest and driver domains each run on a separate Figure 3: Receive throughput on Xen domain0 and CPU. Linux Our experimental setup is the following. For experi- ments that do not include the Xen-guest1 con(cid:12)guration In Linux, the receivers reaches an aggregate throughput (Sections 6.1.1 and 6.2), we use a Dell PowerEdge 1600SC, of2462Mb/swith3NICs,whereasinXen-domain0reaches 2393 MHz Intel Xeon server, with 1 GB of RAM and four 1878 Mb/s with 2 NICs. Intel Pro-1000 gigabit ethernet cards. The server is con- This result contrasts with previous research [4], which nected over a gigabit switch to multiple client machines claimsthatXen-domain0achievesperformancenearlyequal having similar con(cid:12)gurations and with one gigabit ethernet to the performance of a baseline Linux system. Those ex- card each. For experiments that include Xen-guest1 (Sec- periments used only a single NIC which became the bot- tions6.1.2and6.3,andforthewebserverbenchmark,weuse tleneck resource. Since the system was not CPU-limited, a Compaq Proliant DL360 dual-CPU 2800 MHz Intel Xeon higher CPU utilization with Xen was not re(cid:13)ected in re- server, with 4 GB of RAM, and one Broadcom ‘Tigon 3’ ducedthroughput. WithmoreNICs,asthesystembecomes gigabit ethernet card. The server is connected to one client CPU-limited, performance di(cid:11)erences crop up. machine with similar con(cid:12)guration over a gigabit switch. We pro(cid:12)le the Xen-domain0 con(cid:12)guration and Linux for Xen version 2.0.3 is used for all experiments, and the di(cid:11)erent hardware events in the 3 NIC run. We (cid:12)rst com- domains run XenoLinux derived from stock Linux version pare the aggregate hardware event counts in the two con- 2.6.10. All domains are con(cid:12)gured to use 256 MB of RAM. (cid:12)gurations for the following hardware events: instruction In the following subsections, we describe our evaluation counts, L2 cache misses, data TLB misses, and instruction and analysis of Xen for the three benchmarks. TLBmisses. Figure4showsthenormalizedvaluesforthese events relative to the Xen-domain0 numbers. 6.1 Receiverworkload Figure4showsthataprimarycauseforreducedthrough- The TCP receiver microbenchmark measures the TCP put in Xen-domain0 is the signi(cid:12)cantly higher data and in- receive throughput achievable over a small number of TCP struction TLB miss rates compared to Linux. We can com- connections. Thisbenchmarkstressesthereceivehalfofthe pare con(cid:12)gurations in terms of the ratio of their cache or 0.6919(theratioofinstructionsexecuted,Figure4)divided 1.6 0 Linux by 0.6251 (throughput ratio, Figure 3) yields instruction n ai 1.4 Xen-domain0 cost ratio 1.11, indicating that Xen-domain0 requires 11% m more instructions to process each byte compared to Linux. o n-d 1.2 Thereceiver intheXen-domain0con(cid:12)gurationmustexecute e additionalinstructionswithinXenandalsoXenspeci(cid:12)crou- X 1 o tines in XenoLinux. ve t 0.8 The execution of Xen is roughly 14% of CPU cycles. Ta- ati ble 4 shows the cost of some speci(cid:12)c routines in Xen and el 0.6 XenoLinux which contribute to execution overhead. nt r ou 0.4 % Execution time Function Module c ent 0.2 9.15 skbu(cid:11) ctor XenoLinux v 6.01 mask and ack irq Xen E 0 2.81 unmask irq Xen Instr L-2 I-TLB D-TLB Profiled Hardware Event Table 4: Expensive routines in Xen-domain0 (3 NICs) Figure 4: Relative hardware event counts in Xen- domain0 and Linux for receive (3 NICs) The skbuff ctor routine, discussed in Section 5, is a XenoLinux speci(cid:12)c routine which can be quite expensive depending on the frequency of socket bu(cid:11)er allocation (9% TLB miss rates (we de(cid:12)ne miss rate as the ratio of number ofCPUtimeinthiscase). Maskingandunmaskingofinter- of misses to number of instructions executed). Compared ruptsinXencontributesanadditional9%ofCPUoverhead toLinux,Xen-domain0hasadataTLBmisscountroughly notpresentinLinux. Foreachpacketreceived,thenetwork 13 times higher (which works out to 19 times higher data device raises a physical interrupt which is (cid:12)rst handled in TLBmissrate),andinstructionTLBmisscountroughly17 Xen, and then the appropriate \event" is delivered to the times higher (or 25 times higher miss rate). Higher TLB correct driver domain through a virtual interrupt mecha- misses leads to instruction stalling, and at full utilization nism. Xen masks level triggered interrupts to avoid un- this results in the instruction count for Xen-domain0 being bounded reentrancy since the driver domain interrupt han- only about 70% of the instruction count for Linux, leading dlerresetstheinterruptsourceoutsidethescopeofthephys- to a similar decrease in achieved throughput. ical interrupt handler [10]. The domain interrupt handler TLB misses in Xen-domain0 are not concentrated in a has to make an additional hypercall to Xen to unmask the fewhotspotfunctions,butarespreadacrosstheentireTCP interrupt line after servicing the interrupt. receive code path, as shown in Table 3. From preliminary investigations based on Xen code instrumentation, we sus- In summary, pecttheincreaseinmissesiscausedbyincreasesinworking 1. The maximum throughput achieved in the Xen- setsizeasopposedtoTLB(cid:13)ushes,butfurtherinvestigation domain0con(cid:12)gurationforthereceiver benchmarkisroughly is needed. 75% of the maximum throughput in Linux. 2. Xen-domain0 su(cid:11)ers a signi(cid:12)cantly higher TLB miss % D-TLB miss Function Module rate compared to Linux, which is the primary cause for 9.48 e1000 intr network driver throughput degradation. 7.69 e1000 clean rx irq network driver 3. Other overheads arise from interrupt handling over- 5.81 alloc skb from cache XenoLinux heads in Xen, and Xen speci(cid:12)c routines in XenoLinux. 4.45 ip rcv XenoLinux 3.66 free block XenoLinux 6.1.2 Guestdomaincon(cid:2)gurations 3.49 kfree XenoLinux We next evaluate the TCP receiver benchmark in the 3.32 tcp preque process XenoLinux guest domain con(cid:12)gurations Xen-guest0 and Xen-guest1, 3.01 tcp rcv established XenoLinux and compare performance with the Xen-domain0 con(cid:12)gu- ration. Our experiments show that receiver performance in Table 3: Data TLB miss distribution in Xen- Xen-guest0 is signi(cid:12)cantly lower than in the Xen-domain0 domain0 (3 NICs) con(cid:12)guration. Whereas the Xen-domain0 receiver achieves 1878 Mb/s, and Linux receiver achieves 2462 Mb/s, the re- In addition to comparing two con(cid:12)gurations in terms of ceiver running in Xen-guest0 achieves 849 Mb/s, less than miss rates, which determine the time required to execute thecapacityofasingleNIC.Thus,fortheremainderofthis each instruction, we can also evaluate the e(cid:14)ciency of two section,wefocusouranalysisfortheguestdomainsonlyfor con(cid:12)gurations in terms of their instruction cost, which we thesingleNICcon(cid:12)guration,whereXen-domain0andXen- de(cid:12)ne as the number of instructions that must be executed guest1bothdeliver941Mb/sandXen-guest0achievesonly to transfer and process a byte of data to or from the net- 849 Mb/s. work. By dividing the ratio of instructions executed in the Theprimarycauseofthroughputdegradationintheguest two con(cid:12)gurations by the ratio of throughput achieved in domaincon(cid:12)gurationsisthesigni(cid:12)cantlyhigherinstruction the two con(cid:12)gurations, we get the instruction cost ratio for counts relative to Xen-domain0. Figure 5 shows normal- the two con(cid:12)gurations. Comparing Xen-domain0 to Linux, ized aggregate counts for various performance events rela- tive to the Xen-domain0 values. We can see that compared % Instructions Xen function to Xen-domain0, the Xen-guest0 and Xen-guest1 con(cid:12)gu- 3.94 copy from user ll rations have higher instruction cost ratios (2.25 and 2.52, 3.02 do mmu update respectively), higher L2 cache miss rates (1.97 times higher 2.59 copy to user ll and 1.52 times higher), lower instruction TLB miss rates 2.50 mod l1 entry (ratio 0.31 and 0.72), and lower data TLB miss rates (ratio 1.67 alloc domheap pages 0.54 and 0.87). 1.47 do extended command 1.43 free domheap pages 4.5 Table6: InstructioncountdistributioninXen(Xen- n0 Xen-domain0 guest0 con(cid:12)guration) ai 4 Xen-guest0 m Xen-guest1 o d 3.5 % Instructions Driver domain function en- 5.14 nf hook slow o X 3 3.59 nf iterate ative t 2.5 11..7167 brnneftprxrearcotuiotning el 2 1.31 fdb insert nt r 1.00 br handle frame ou 1.5 c ent 1 Tmaabinle(7X:eInn-sgturuesctt0iocnocno(cid:12)ugnutradtiisotnri)bution in driver do- v E 0.5 Instr L-2 I-TLB D-TLB Apartfromthesetwomajorsourcesofoverhead,Figure5 Profiled Hardware Event shows that the guest con(cid:12)gurations also su(cid:11)er signi(cid:12)cantly higher L2 cache misses, roughly 4 times the misses in Xen- Figure 5: Relative hardware event counts in domain0. Thismaybeexplainedbythefactthatexecuting Xen-domain0, Xen-guest0 (1CPU) and Xen-guest1 twodomainsconcurrentlyincreasestheworkingsetsizeand (2CPUs) reduces locality, leading to more L2 misses. One surprising result from Figure 5 is that Xen-guest0 su(cid:11)ers lower TLB Table 5 gives the breakdown of instruction counts across missescomparedtoXen-guest1,eventhoughXen-guest0in- the guest and driver domains and Xen for the three con(cid:12)g- volves switching between the guest and driver domains on urations. The table shows that compared to Xen-domain0, the same CPU. the guest domain con(cid:12)gurations execute considerably more instructionsinXenandalargenumberofinstructionsinthe In summary, driverdomaineventhoughitonlyhoststhephysicaldriver. 1. The receiver benchmark in the guest domain con(cid:12)g- urations achieves less than half the throughput achieved in Xen-domain0 Xen-guest0 Xen-guest1 the Xen-domain0 con(cid:12)guration. Guest domain N/A 16453 20145 2. Themainreasonsforreducedthroughputarethecom- Driver domain 19104 15327 18845 putational overheads of the hypervisor and the driver do- Xen 1939 11733 14975 main,whichcausetheinstructioncosttoincreasebyafactor of 2.2 to 2.5. Table 5: Instruction execution samples in Xen con- 3. Increasedworkingsetsizeandreducedlocalityleadto (cid:12)gurations higher L2 misses in the guest con(cid:12)gurations. Detailed pro(cid:12)ling of Xen execution overhead, shown in 6.2 Senderworkload Table6,showsthatmostoftheinstructionoverheadofXen We now evaluate the di(cid:11)erent Xen con(cid:12)gurations using comes from the page remapping and page transfer between our second microbenchmark, the TCP sender benchmark. the driver and guest domains for network data received in This is similar to the receiver benchmark, except that we the driver domain. For each page of network data received, measurethethroughputonthesender sideoftheTCPcon- the page is remapped into the guest domain and the driver nections. We run multiple sender processes, one per NIC, domain acquires a replacement page from the guest. on the target machine, and run each receiver process on a Thesecondsourceofoverheadiswithinthedriverdomain. separatemachine. Thisbenchmarkstressesthesendcontrol Mostoftheoverheadofthedriverdomain,asshowninTa- path of the TCP stack for large data transfers. ble 7, comes from the network (cid:12)ltering and bridging code. We (cid:12)rst compare the performance of the sender running Xencreatesanumberofvirtualnetworkinterfacesforguest in Xen-domain0 with its performance in Linux. We ob- domains, which are bridged to the physical network inter- serve that, for our system con(cid:12)guration, the sender work- faceinthedriverdomainusingthestandardLinuxbridging 1 load is very light on the CPU, and both Linux and Xen- code domain0 are able to saturate up to 4 NICs, achieving an 1As an alternative to bridging, Xen recently added a aggregatethroughputof3764Mb/s,withoutsaturatingthe routing-based solution. We have not yet evaluated this ap- CPU. Thus, we observe no throughput di(cid:11)erences up to proach. 4 NICs. However as in the receiver benchmark, the Xen- domain0con(cid:12)gurationsu(cid:11)erssigni(cid:12)cantlyhigherTLBmiss functions. It appears as if the TCP stack in the guest con- ratescomparedtoLinux,andshowsgreaterCPUutilization (cid:12)gurationprocessesalargernumberofpacketscomparedto for the same workload. Xen-domain0totransferthesameamountofdata. Further WenextevaluatetheTCPsender benchmarkintheguest analysis (from kernel instrumentation) reveals that this is domaincon(cid:12)guration,Xen-guest0. AsTCPsendischeaper indeed the case, and this in fact results from the absence compared to TCP receive, we expected the sender in the of TCP segmentation o(cid:15)oad (TSO) support in the guest guest con(cid:12)guration to perform better than the receiver. domain’s virtual NIC. However, we found that maximum throughput achievable TSO support in a NIC allows the networking stack to by the sender in the Xen-guest0 con(cid:12)guration is only 706 bunchtogetheralargenumberofsocketbu(cid:11)ersintoonebig Mb/s,whereasintheXen-domain0con(cid:12)guration,thesender packet. The networking stack can then do TCP send pro- is able to achieve up to 3764 Mb/s. cessing on this single packet, and can o(cid:15)oad the segmenta- Figure 6 compares the hardware event counts for Xen- tion of this packet into MTU sized packets to the NIC. For guest0andtheeventcountsforthesender runninginXen- large data transfers, TSO support can signi(cid:12)cantly reduce domain0 with one NIC. The (cid:12)gure demonstrates that the computation required in the kernel. primary cause for throughput degradation in Xen-guest0 is Forthesender workload,Xen-domain0candirectlymake higher instruction cost. The Xen-guest0 con(cid:12)guration in- use of TSO support in the NIC to o(cid:15)oad its computation. cursafactorof6timeshigherinstructioncount(8.15times However, the guest domain is forced to perform more TCP higher instruction cost) than the Xen-domain0 con(cid:12)gura- processing for MTU sized packets, because the virtual NIC tion. The L2 miss rate is also a factor of 4 higher in the visible in the guest domain does not provide TSO support. Xen-guest0 con(cid:12)guration than the Xen-domain0 con(cid:12)gura- ThisisbecausevirtualI/OdevicesinXenhavetobegeneric tion,whichcanbeattributedtocontentionfortheL2cache acrossawiderangeofphysicaldevices,andsotheyprovide between the guest domain and the driver domain. a simple, consistent interface for each device class, which may not re(cid:13)ect the full capabilities of the physical device. Pro(cid:12)lingresultsshowthatinadditiontoperforminglarge 1.2 Xen-domain0 TCPpacketsegmentationinsoftware,theguestdomainuses 0 st Xen-guest0 the function csum partial copy generic to perform TCP ue 1 checksum computation in software. This is again because g n- the virtual NIC does not support the checksum o(cid:15)oad ca- Xe 0.8 pabilitiesofthephysicalNIC.However,wehavedetermined o that disabling TSO support in the Xen-domain0 con(cid:12)gura- e t tionincreasesinstructionsexecutedbyafactorof2.7. Recall ativ 0.6 thattheratioofinstructionsexecutedbetweentheguestdo- el nt r 0.4 imsa2i.n8.inTXheerne-fgouree,stw0eacnodnctlhuededrtihvaerttdhoemianianbiinlitXyetno-udsoemTaSinO0 u co in the guest domain is the primary cause for the higher in- nt 0.2 struction cost for the guest domain in Xen-guest0. e v E In summary, 0 Instr L-2 I-TLB D-TLB 1. The sender workload in the guest con(cid:12)guration achieves a throughput less than one-(cid:12)fth of its throughput Profile Hardware Event in Xen-domain0 or Linux. 2. Xen’sI/Odriverdomainmodelpreventstheguestdo- Figure 6: Relative hardware event counts in Xen- main from o(cid:15)oading TCP processing functions (e.g., TCP domain0 and Xen-guest0 for sender benchmark segmentation o(cid:15)oad, checksum o(cid:15)oad) to the physical in- terface. We study the distribution of instructions among the rou- 3. Increased working set size and reduced locality in the tines in the domain hosting the sender application, i.e., the guest con(cid:12)guration also lead to higher L2 miss rates. guestdomainoftheXen-guest0con(cid:12)gurationandthedriver domain of the Xen-domain0 con(cid:12)guration. The instruction 6.3 Webserverworkload counts for these domains di(cid:11)ers by a factor of 2.8 (as com- paredtofactorof6overallbetweenthetwocon(cid:12)gurations). Ourthirdbenchmarkforevaluatingnetworkperformance in Xen is the macro-level web-server benchmark running Function Xen-guest0 guest Xen-domain0 a simple synthetic workload. The web-server benchmark tcp write xmit 212 72 stressestheoverallnetworkingsystem,includingdatatrans- tcp transmit skb 512 92 fer paths and connection establishment and closing for a ip queue xmit 577 114 large number of connections. We use the multithreaded knot web server from the Table 8: Instruction execution samples in Xen- Capriccio project [22], and compare the performance of the guest0 guest and Xen-domain0 di(cid:11)erentdomaincon(cid:12)gurationswithabaselineLinuxcon(cid:12)g- uration. We run the server using the LinuxThreads thread- Table8showsthecostofhigh-levelTCP/IPprotocolpro- ing library in all con(cid:12)gurations, because the NPTL library cessing functions. The Xen-guest0 guest has signi(cid:12)cantly currently cannot be used with Xen. higher instruction counts for these functions, and although To generate the workload, we run multiple httperf [17] not shown, this leads to similar increases for all descendant clients on di(cid:11)erent machines which generate requests for a

Description:
May 3, 2005 Diagnosing Performance Overheads in the Xen Virtual. Machine Environment. Aravind Menon*. EPFL, Lausanne [email protected].
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.