Proceedings of the Linux Symposium Volume One June 27th–30th, 2007 Ottawa, Ontario Canada Contents ThePriceofSafety: EvaluatingIOMMUPerformance 9 Ben-Yehuda,Xenidis,Mostrows,Rister,Bruemmer,VanDoorn LinuxonCellBroadbandEnginestatusupdate 21 ArndBergmann LinuxKernelDebuggingonGoogle-sizedclusters 29 M.Bligh,M.Desnoyers,&R.Schultz LtraceInternals 41 RodrigoRubiraBranco Evaluatingeffectsofcachememorycompressiononembeddedsystems 53 AndersonBriglia,AllanBezerra,LeonidMoiseichuk,&NitinGupta ACPIinLinux–Mythsvs.Reality 65 LenBrown CoolHandLinux–HandheldThermalExtensions 75 LenBrown AsynchronousSystemCalls 81 ZachBrown Frysk1,Kernel0? 87 AndrewCagney KeepingKernelPerformancefromRegressions 93 T.Chen,L.Ananiev,andA.Tikhonov BreakingtheChains—UsingLinuxBIOStoLiberateEmbeddedx86Processors 103 J.Crouse,M.Jones,&R.Minnich GANESHA,amulti-usagewithlargecacheNFSv4server 113 P.Deniel,T.Leibovici,&J.-C.Lafoucrière WhyVirtualizationFragmentationSucks 125 JustinM.Forbes ANewNetworkFileSystemisBorn: ComparisonofSMB2,CIFS,andNFS 131 StevenFrench SupportingtheAllocationofLargeContiguousRegionsofMemory 141 MelGorman KernelScalability—ExpandingtheHorizonBeyondFineGrainLocks 153 CoreyGough,SureshSiddha,&KenChen Kdump: Smarter,Easier,Trustier 167 VivekGoyal UsingKVMtorunXenguestswithoutXen 179 R.A.Harper,A.N.Aliguori&M.D.Day Djprobe—Kernelprobingwiththesmallestoverhead 189 M.HiramatsuandS.Oshima DesktopintegrationofBluetooth 201 MarcelHoltmann Howvirtualizationmakespowermanagementdifferent 205 YuKe Ptrace,Utrace,Uprobes: Lightweight,DynamicTracingofUserApps 215 J.Keniston,A.Mavinakayanahalli,P.Panchamukhi,&V.Prasad kvm: theLinuxVirtualMachineMonitor 225 A.Kivity,Y.Kamay,D.Laor,U.Lublin,&A.Liguori LinuxTelephony 231 PaulP.Komkoff,A.Anikina,&R.Zhnichkov LinuxKernelDevelopment 239 GregKroah-Hartman ImplementingDemocracy 245 ChristopherJamesLahey ExtremeHighPerformanceComputingorWhyMicrokernelsSuck 251 ChristophLameter PerformanceandAvailabilityCharacterizationforLinuxServers 263 LinkovKoryakovskiy “TurningthePage”onHugetlbInterfaces 277 AdamG.Litke ResourceManagement: Beancounters 285 PavelEmelianov,DenisLunevandKirillKorotaev ManageableVirtualAppliances 293 D.Lutterkort Everythingisavirtualfilesystem: libferris 303 BenMartin Conference Organizers Andrew J. Hutton, Steamballoon, Inc., Linux Symposium, Thin Lines Mountaineering C. Craig Ross, Linux Symposium Review Committee Andrew J. Hutton, Steamballoon, Inc., Linux Symposium, Thin Lines Mountaineering Dirk Hohndel, Intel Martin Bligh, Google Gerrit Huizenga, IBM Dave Jones, Red Hat, Inc. C. Craig Ross, Linux Symposium Proceedings Formatting Team John W. Lockhart, Red Hat, Inc. Gurhan Ozen, Red Hat, Inc. John Feeney, Red Hat, Inc. Len DiMaggio, Red Hat, Inc. John Poelstra, Red Hat, Inc. Authorsretaincopyrighttoallsubmittedpapers,buthavegrantedunlimitedredistributionrights toallasaconditionofsubmission. The Price of Safety: Evaluating IOMMU Performance MuliBen-Yehuda JimiXenidis MichalOstrowski IBMHaifaResearchLab IBMResearch IBMResearch [email protected] [email protected] [email protected] KarlRister AlexisBruemmer LeendertVanDoorn IBMLTC IBMLTC AMD [email protected] [email protected] [email protected] Abstract 1 Introduction An I/O Memory Management Unit (IOMMU) creates one or more unique address spaces which can be used IOMMUs, IO Memory Management Units, are hard- to control how a DMA operation, initiated by a device, ware devices that translate device DMA addresses to accesseshostmemory. Thisfunctionalitywasoriginally machine addresses. An isolation capable IOMMU re- introduced to increase the addressability of a device or stricts a device so that it can only access parts of mem- bus, particularly when 64-bit host CPUs were being in- ory it has been explicitly granted access to. Isolation troduced while most devices were designed to operate capableIOMMUsperformavaluablesystemserviceby in a 32-bit world. The uses of IOMMUs were later ex- preventingroguedevicesfromperformingerrantorma- tended to restrict the host memory pages that a device liciousDMAs, therebysubstantiallyincreasingthesys- canactuallyaccess,thusprovidinganincreasedlevelof tem’s reliability and availability. Without an IOMMU isolation, protecting the system from user-level device a peripheral device could be programmed to overwrite drivers and eventually virtual machines. Unfortunately, anypartofthesystem’smemory. Operatingsystemsuti- thisadditionallogicdoesimposeaperformancepenalty. lizeIOMMUstoisolatedevicedrivers;hypervisorsuti- The wide spread introduction of IOMMUs by Intel [1] lizeIOMMUstograntsecuredirecthardwareaccessto and AMD [2] and the proliferation of virtual machines virtual machines. With the imminent publication of the will make IOMMUs a part of nearly every computer PCI-SIG’s IO Virtualization standard, as well as Intel system. There is no doubt with regards to the benefits and AMD’s introduction of isolation capable IOMMUs IOMMUsbring... buthowmuchdotheycost? Weseek inallnewservers,IOMMUswillbecomeubiquitous. to quantify, analyze, and eventually overcome the per- formance penalties inherent in the introduction of this newtechnology. Althoughtheyprovidevaluableservices,IOMMUscan imposeaperformancepenaltyduetotheextramemory 1.1 IOMMUdesign accessesrequiredtoperformDMAoperations. Theex- act performance degradation depends on the IOMMU design, its caching architecture, the way it is pro- A broad description of current and future IOMMU grammed and the workload. This paper presents the hardwareandsoftwaredesignsfromvariouscompanies performance characteristics of the Calgary and DART can be found in the OLS ’06 paper entitled Utilizing IOMMUs in Linux, both on bare metal and in a hyper- IOMMUs for Virtualization in Linux and Xen [3]. The visorenvironment. ThethroughputandCPUutilization designofasystemwithanIOMMUcanbebroadlybro- ofseveralIOworkloads,withandwithoutanIOMMU, kendownintothefollowingareas: are measured and the results are analyzed. The poten- tialstrategiesformitigatingtheIOMMU’scostsarethen • IOMMUhardwarearchitectureanddesign. discussed. In conclusion a set of optimizations and re- sultingperformanceimprovementsarepresented. • Hardware↔softwareinterfaces. • 9 • 10 • ThePriceofSafety: EvaluatingIOMMUPerformance • Pure software interfaces (e.g., between userspace IOMMU (which will be discussed later in detail) does andkernelspaceorbetweenkernelspaceandhyper- notprovideasoftwaremechanismforinvalidatingasin- visor). gle cache entry—one must flush the entire cache to in- validate an entry. We present a related optimization in Section4. Itshouldbenotedthattheseareascananddoaffecteach other: thehardware/softwareinterfacecandictatesome It should be mentioned that the PCI-SIG IOV (IO Vir- aspectsofthepuresoftwareinterfaces,andthehardware tualization) working group is working on an Address designdictatescertainaspectsofthehardware/software TranslationServices(ATS)standard. ATSbringsinan- interfaces. other level of caching, by defining how I/O endpoints (i.e., adapters) inter-operate with the IOMMU to cache This paper focuses on two different implementations translations on the adapter and communicate invalida- of the same IOMMU architecture that revolves around tionrequestsfromtheIOMMUtotheadapter. Thisadds thebasicconceptofaTranslationControlEntry(TCE). another level of complexity to the system, which needs TCEsaredescribedindetailinSection1.1.2. tobeovercomeinordertofindtheoptimalcachingstrat- egy. 1.1.1 IOMMUhardwarearchitectureanddesign 1.1.2 Hardware↔SoftwareInterface Just as a CPU-MMU requires a TLB with a very high hit-rate in order to not impose an undue burden on the The main hardware/software interface in the TCE fam- system, so does an IOMMU require a translation cache ilyofIOMMUsistheTranslationControlEntry(TCE). to avoid excessive memory lookups. These translation TCEsareorganizedinTCEtables. TCEtablesareanal- cachesarecommonlyreferredtoasIOTLBs. ogoustopagetablesinanMMU,andTCEsaresimilar topagetableentries(PTEs). EachTCEidentifiesa4KB The performance of the system is affected by several page of host memory and the access rights that the bus cache-relatedfactors: (or device) has to that page. The TCEs are arranged in acontiguousseriesofhostmemorypagesthatcomprise • Thecachesizeandassociativity[13]. theTCEtable. TheTCEtablecreatesasingleuniqueIO address space (DMA address space) for all the devices • Thecachereplacementpolicy. thatshareit. • The cache invalidation mechanism and the fre- The translation from a DMA address to a host mem- quencyandcostofinvalidations. oryaddressoccursbycomputinganindexintotheTCE table by simply extracting the page number from the The optimal cache replacement policy for an IOTLB DMA address. The index is used to compute a direct is probably significantly different than for an MMU- offsetintotheTCEtablethatresultsinaTCEthattrans- TLB.MMU-TLBsrelyonspatialandtemporallocality latesthatIOpage. Theaccesscontrolbitsarethenused toachieveaveryhighhit-rate. DMAaddressesfromde- to validate both the translation and the access rights to vices,however,donotnecessarilyhavetemporalorspa- thehostmemorypage. Finally,thetranslationisusedby tial locality. Consider for example a NIC which DMAs thebustodirectaDMAtransactiontoaspecificlocation receivedpacketsdirectlyintoapplicationbuffers: pack- inhostmemory. ThisprocessisillustratedinFigure1. etsformanyapplicationscouldarriveinanyorderandat The TCE architecture can be customized in several any time, leading to DMAs to wildly disparate buffers. ways,resultingindifferentimplementationsthatareop- This is in sharp contrast with the way applications ac- timizedforaspecificmachine. Thispaperexaminesthe cess their memory, where both spatial and temporal lo- performanceoftwoTCEimplementations. Thefirstone cality can be observed: memory accesses to nearby ar- istheCalgaryfamilyofIOMMUs,whichcanbefound eastendtooccurcloselytogether. inIBM’shigh-endSystemx(x86-64based)servers,and Cache invalidation can have an adverse effect on the the second one is the DMA Address Relocation Table performance of the system. For example, the Calgary (DART) IOMMU, which is often paired with PowerPC
Description: