What Every Programmer Should Know About Memory PDF

What Every Programmer Should Know About Memory UlrichDrepper RedHat,Inc. [email protected] November21,2007 Abstract AsCPUcoresbecomebothfasterandmorenumerous,thelimitingfactorformostprogramsis now,andwillbeforsometime,memoryaccess. Hardwaredesignershavecomeupwithever more sophisticated memory handling and acceleration techniques–such as CPU caches–but thesecannotworkoptimallywithoutsomehelpfromtheprogrammer. Unfortunately, neither thestructurenorthecostofusingthememorysubsystemofacomputerorthecachesonCPUs iswellunderstoodbymostprogrammers. Thispaperexplainsthestructureofmemorysubsys- temsinuseonmoderncommodityhardware,illustratingwhyCPUcachesweredeveloped,how theywork,andwhatprogramsshoulddotoachieveoptimalperformancebyutilizingthem. 1 Introduction daythesechangesmainlycomeinthefollowingforms: Intheearlydayscomputersweremuchsimpler.Thevar- • RAMhardwaredesign(speedandparallelism). iouscomponentsofasystem,suchastheCPU,memory, massstorage,andnetworkinterfaces,weredevelopedto- • Memorycontrollerdesigns. getherand, asaresult, werequitebalancedintheirper- • CPUcaches. formance. Forexample,thememoryandnetworkinter- faces were not (much) faster than the CPU at providing • Directmemoryaccess(DMA)fordevices. data. This situation changed once the basic structure of com- For the most part, this document will deal with CPU puters stabilized and hardware developers concentrated caches and some effects of memory controller design. onoptimizingindividualsubsystems. Suddenlytheper- Intheprocessofexploringthesetopics,wewillexplore formanceofsomecomponentsofthecomputerfellsig- DMA and bring it into the larger picture. However, we nificantly behind and bottlenecks developed. This was willstartwithanoverviewofthedesignfortoday’scom- especiallytrueformassstorageandmemorysubsystems modity hardware. This is a prerequisite to understand- which, for cost reasons, improved more slowly relative ing the problems and the limitations of efficiently us- toothercomponents. ing memory subsystems. We will also learn about, in some detail, the different types of RAM and illustrate Theslownessofmassstoragehasmostlybeendealtwith whythesedifferencesstillexist. usingsoftwaretechniques: operatingsystemskeepmost oftenused(andmostlikelytobeused)datainmainmem- Thisdocumentisinnowayallinclusiveandfinal. Itis ory,whichcanbeaccessedatarateordersofmagnitude limited to commodity hardware and further limited to a fasterthantheharddisk. Cachestoragewasaddedtothe subset of that hardware. Also, many topics will be dis- storagedevicesthemselves,whichrequiresnochangesin cussed in just enough detail for the goals of this paper. the operating system to increase performance.1 For the For such topics, readers are recommended to find more purposes of this paper, we will not go into more details detaileddocumentation. ofsoftwareoptimizationsforthemassstorageaccess. When it comes to operating-system-specific details and Unlike storage subsystems, removing the main memory solutions, the text exclusively describes Linux. At no as a bottleneck has proven much more difficult and al- time will it contain any information about other OSes. most all solutions require changes to the hardware. To- Theauthorhasnointerestindiscussingtheimplications for other OSes. If the reader thinks s/he has to use a 1Changes are needed, however, to guarantee data integrity when differentOStheyhavetogototheirvendorsanddemand usingstoragedevicecaches. theywritedocumentssimilartothisone. One last comment before the start. The text contains a Copyright©2007UlrichDrepper number of occurrences of the term “usually” and other, Allrightsreserved.Noredistributionallowed. similar qualifiers. The technology discussed here exists inmany,manyvariationsintherealworldandthispaper Thanks only addresses the most common, mainstream versions. Itisrarethatabsolutestatementscanbemadeaboutthis IwouldliketothankJohnrayFullerandthecrewatLWN technology,thusthequalifiers. (especially Jonathan Corbet for taking on the daunting task of transforming the author’s form of English into DocumentStructure somethingmoretraditional.MarkusArmbrusterprovided alotofvaluableinputonproblemsandomissionsinthe text. Thisdocumentismostlyforsoftwaredevelopers. Itdoes notgointoenoughtechnicaldetailsofthehardwaretobe usefulforhardware-orientedreaders. Butbeforewecan AboutthisDocument go into the practical information for developers a lot of groundworkmustbelaid. ThetitleofthispaperisanhomagetoDavidGoldberg’s Tothatend,thesecondsectiondescribesrandom-access classic paper “What Every Computer Scientist Should memory (RAM) in technical detail. This section’s con- Know About Floating-Point Arithmetic” [12]. This pa- tentisnicetoknowbutnotabsolutelycriticaltobeable per is still not widely known, although it should be a tounderstandthelatersections. Appropriatebackrefer- prerequisite for anybody daring to touch a keyboard for encestothesectionareaddedinplaceswherethecontent seriousprogramming. isrequiredsothattheanxiousreadercouldskipmostof OnewordonthePDF:xpdfdrawssomeofthediagrams thissectionatfirst. ratherpoorly.Itisrecommendeditbeviewedwithevince ThethirdsectiongoesintoalotofdetailsofCPUcache or, if really necessary, Adobe’s programs. If you use behavior. Graphs have been used to keep the text from evince be advised that hyperlinks are used extensively beingasdryasitwouldotherwisebe. Thiscontentises- throughout the document even though the viewer does sentialforanunderstandingoftherestofthedocument. notindicatethemlikeothersdo. Section4describesbrieflyhowvirtualmemoryisimple- mented. Thisisalsorequiredgroundworkfortherest. Section 5 goes into a lot of detail about Non Uniform MemoryAccess(NUMA)systems. Section6isthecentralsectionofthispaper. Itbringsto- gether all the previous sections’ information and gives programmers advice on how to write code which per- formswellinthevarioussituations. Theveryimpatient reader couldstart with thissection and, ifnecessary, go backtotheearliersectionstofreshenuptheknowledge oftheunderlyingtechnology. Section 7 introduces tools which can help the program- merdoabetterjob. Evenwithacompleteunderstanding ofthetechnologyitisfarfromobviouswhereinanon- trivialsoftwareprojecttheproblemsare. Sometoolsare necessary. In section 8 we finally give an outlook of technology whichcanbeexpectedinthenearfutureorwhichmight justsimplybegoodtohave. ReportingProblems The author intends to update this document for some time. Thisincludesupdatesmadenecessarybyadvances intechnologybutalsotocorrectmistakes. Readerswill- ing to report problems are encouraged to send email to the author. They are asked to include exact version in- formationinthereport. Theversioninformationcanbe foundonthelastpageofthedocument. 2 Version1.0 WhatEveryProgrammerShouldKnowAboutMemory 2 CommodityHardwareToday tionwithdevicesthroughavarietyofdifferentbuses.To- daythePCI,PCIExpress,SATA,andUSBbusesareof It is important to understand commodity hardware be- mostimportance,butPATA,IEEE1394,serial,andpar- cause specialized hardware is in retreat. Scaling these allelportsarealsosupportedbytheSouthbridge. Older daysismostoftenachievedhorizontallyinsteadofverti- systemshadAGPslotswhichwereattachedtotheNorth- cally,meaningtodayitismorecost-effectivetousemany bridge.Thiswasdoneforperformancereasonsrelatedto smaller, connected commodity computers instead of a insufficiently fast connections between the Northbridge few really large and exceptionally fast (and expensive) andSouthbridge. However,todaythePCI-Eslotsareall systems. This is the case because fast and inexpensive connectedtotheSouthbridge. networkhardwareiswidelyavailable. Therearestillsit- Suchasystemstructurehasanumberofnoteworthycon- uations where the large specialized systems have their sequences: placeandthesesystemsstillprovideabusinessopportu- nity,buttheoverallmarketisdwarfedbythecommodity hardware market. Red Hat, as of 2007, expects that for • AlldatacommunicationfromoneCPUtoanother futureproducts,the“standardbuildingblocks”formost musttraveloverthesamebususedtocommunicate datacenterswillbeacomputerwithuptofoursockets, withtheNorthbridge. eachfilledwithaquadcoreCPUthat,inthecaseofIntel CPUs,willbehyper-threaded.2 Thismeansthestandard • AllcommunicationwithRAMmustpassthrough systeminthedatacenterwillhaveupto64virtualpro- theNorthbridge. cessors.Biggermachineswillbesupported,butthequad • TheRAMhasonlyasingleport. 3 socket,quadCPUcorecaseiscurrentlythoughttobethe sweet spot and mostoptimizations are targeted for such • Communication between a CPU and a device at- machines. tached to the Southbridge is routed through the Northbridge. Largedifferencesexistinthestructureofcomputersbuilt ofcommodityparts. Thatsaid,wewillcovermorethan 90%ofsuchhardwarebyconcentratingonthemostim- Acoupleofbottlenecksareimmediatelyapparentinthis portantdifferences. Notethatthesetechnicaldetailstend design. OnesuchbottleneckinvolvesaccesstoRAMfor tochangerapidly,sothereaderisadvisedtotakethedate devices. In the earliest days of the PC, all communica- ofthiswritingintoaccount. tionwithdevicesoneitherbridgehadtopassthroughthe CPU,negativelyimpactingoverallsystemperformance. Over the years personal computers and smaller servers To work around this problem some devices became ca- standardizedonachipsetwithtwoparts:theNorthbridge pableofdirectmemoryaccess(DMA).DMAallowsde- andSouthbridge. Figure2.1showsthisstructure. vices, with the help of the Northbridge, to store and re- ceive data in RAM directly without the intervention of CPU1 CPU2 the CPU (and its inherent performance cost). Today all high-performance devices attached to any of the buses FSB can utilize DMA. While this greatly reduces the work- RAM Northbridge loadontheCPU,italsocreatescontentionfortheband- widthoftheNorthbridgeasDMArequestscompetewith SATA PCI-E Southbridge RAM access from the CPUs. This problem, therefore, USB mustbetakenintoaccount. A second bottleneck involves the bus from the North- Figure2.1: StructurewithNorthbridgeandSouthbridge bridgetotheRAM.Theexactdetailsofthebusdepend on the memory types deployed. On older systems there is only one bus to all the RAM chips, so parallel ac- AllCPUs(twointhepreviousexample,buttherecanbe cessisnotpossible. RecentRAMtypesrequiretwosep- more) are connected via a common bus (the Front Side arate buses (or channels as they are called for DDR2, Bus,FSB)totheNorthbridge.TheNorthbridgecontains, seepage8)whichdoublestheavailablebandwidth. The among other things, the memory controller, and its im- Northbridgeinterleavesmemoryaccessacrossthechan- plementationdeterminesthetypeofRAMchipsusedfor nels. Morerecentmemorytechnologies(FB-DRAM,for thecomputer. DifferenttypesofRAM,suchasDRAM, instance)addmorechannels. Rambus, and SDRAM, require different memory con- trollers. Withlimitedbandwidthavailable,itisimportantforper- formancetoschedulememoryaccessinwaysthatmini- Toreachallothersystemdevices,theNorthbridgemust mizedelays. Aswewillsee,processorsaremuchfaster communicate with the Southbridge. The Southbridge, oftenreferredtoastheI/Obridge, handlescommunica- 3Wewillnotdiscussmulti-portRAMinthisdocumentasthistype ofRAMisnotfoundincommodityhardware, atleastnotinplaces 2Hyper-threadingenablesasingleprocessorcoretobeusedfortwo wheretheprogrammerhasaccesstoit. Itcanbefoundinspecialized ormoreconcurrentexecutionswithjustalittleextrahardware. hardwaresuchasnetworkrouterswhichdependonutmostspeed. UlrichDrepper Version1.0 3 andmustwaittoaccessmemory,despitetheuseofCPU caches. If multiple hyper-threads, cores, or processors RAM CPU1 CPU2 RAM accessmemoryatthesametime,thewaittimesformem- ory access are even longer. This is also true for DMA RAM CPU3 CPU4 RAM operations. SATA There is more to accessing memory than concurrency, PCI-E Southbridge USB however. Access patterns themselves also greatly influ- ence the performance of the memory subsystem, espe- ciallywithmultiplememorychannels. Insection2.2we Figure2.3: IntegratedMemoryController wilcovermoredetailsofRAMaccesspatterns. Onsomemoreexpensivesystems,theNorthbridgedoes not actually contain the memory controller. Instead the deeperintothistechnologyhere. Northbridge can be connected to a number of external Therearedisadvantagestothisarchitecture,too. Firstof memory controllers (in the following example, four of all, because the machine still has to make all the mem- them). oryofthesystemaccessibletoallprocessors,themem- ory is not uniform anymore (hence the name NUMA - CPU CPU 1 2 Non-UniformMemoryArchitecture-forsuchanarchi- tecture). Local memory (memory attached to a proces- RAM MC MC RAM 1 3 Northbridge sor)canbeaccessedwiththeusualspeed. Thesituation RAM MC MC RAM 2 4 isdifferentwhenmemoryattachedtoanotherprocessor SATA is accessed. In this case the interconnects between the PCI-E Southbridge USB processorshavetobeused. Toaccessmemoryattached toCPU fromCPU requirescommunicationacrossone 2 1 interconnect. WhenthesameCPUaccessesmemoryat- tachedtoCPU twointerconnectshavetobecrossed. Figure2.2: NorthbridgewithExternalControllers 4 Each such communication has an associated cost. We Theadvantageofthisarchitectureisthatmorethanone talk about “NUMA factors” when we describe the ex- memory bus exists and therefore total available band- tratimeneededtoaccessremotememory. Theexample widthincreases.Thisdesignalsosupportsmorememory. architecture in Figure 2.3 has two levels for each CPU: Concurrentmemoryaccesspatternsreducedelaysbysi- immediately adjacent CPUs and one CPU which is two multaneously accessing different memory banks. This interconnects away. With more complicated machines is especially true when multiple processors are directly the number of levels can grow significantly. There are connectedtotheNorthbridge,asinFigure2.2. Forsuch alsomachinearchitectures(forinstanceIBM’sx445and adesign,theprimarylimitationistheinternalbandwidth SGI’s Altix series) where there is more than one type of the Northbridge, which is phenomenal for this archi- ofconnection. CPUsareorganizedintonodes; withina tecture(fromIntel).4 nodethetimetoaccessthememorymightbeuniformor haveonlysmallNUMAfactors.Theconnectionbetween Using multiple external memory controllers is not the nodes can be very expensive, though, and the NUMA onlywaytoincreasememorybandwidth. Oneotherin- factorcanbequitehigh. creasinglypopularwayistointegratememorycontrollers into the CPUs and attach memory to each CPU. This CommodityNUMAmachinesexisttodayandwilllikely architecture is made popular by SMP systems based on playanevengreaterroleinthefuture. Itisexpectedthat, AMD’sOpteronprocessor. Figure2.3showssuchasys- fromlate2008on,everySMPmachinewilluseNUMA. tem. IntelwillhavesupportfortheCommonSystemIn- The costs associated with NUMA make it important to terface(CSI)startingwiththeNehalemprocessors; this recognize when a program is running on a NUMA ma- is basically the same approach: an integrated memory chine. Insection5wewilldiscussmoremachinearchi- controller with the possibility of local memory for each tecturesandsometechnologiestheLinuxkernelprovides processor. fortheseprograms. Withanarchitecturelikethisthereareasmanymemory Beyond the technical details described in the remainder banksavailableasthereareprocessors. Onaquad-CPU ofthissection,thereareseveraladditionalfactorswhich machine the memory bandwidth is quadrupled without influence the performance of RAM. They are not con- the need for a complicated Northbridge with enormous trollablebysoftware,whichiswhytheyarenotcovered bandwidth. Having a memory controller integrated into in this section. The interested reader can learn about theCPUhassomeadditionaladvantages;wewillnotdig someofthesefactorsinsection2.1. Theyarereallyonly neededtogetamorecompletepictureofRAMtechnol- 4Forcompletenessitshouldbementionedthatsuchamemorycon- ogyandpossiblytomakebetterdecisionswhenpurchas- trollerarrangementcanbeusedforotherpurposessuchas“memory RAID”whichisusefulincombinationwithhotplugmemory. ingcomputers. 4 Version1.0 WhatEveryProgrammerShouldKnowAboutMemory The following two sections discuss hardware details at Ifaccesstothestateofthecellisneededthewordaccess thegatelevelandtheaccessprotocolbetweenthemem- lineWLisraised. Thismakesthestateofthecellimme- ory controller and the DRAM chips. Programmers will diately available for reading on BL and BL. If the cell likely find this information enlightening since these de- state must be overwritten the BL and BL lines are first tailsexplainwhyRAMaccessworksthewayitdoes. It settothedesiredvaluesandthenWLisraised.Sincethe isoptionalknowledge,though,andthereaderanxiousto outsidedriversarestrongerthanthefourtransistors(M 1 gettotopicswithmoreimmediaterelevanceforeveryday throughM )thisallowstheoldstatetobeoverwritten. 4 lifecanjumpaheadtosection2.2.5. See [20] for a more detailed description of the way the cell works. For the following discussion it is important 2.1 RAMTypes tonotethat TherehavebeenmanytypesofRAMovertheyearsand • onecellrequiressixtransistors. Therearevariants eachtypevaries,sometimessignificantly,fromtheother. withfourtransistorsbuttheyhavedisadvantages. The older types are today really only interesting to the historians. We will not explore the details of those. In- • maintaining the state of the cell requires constant steadwewillconcentrateonmodernRAMtypes;wewill power. only scrape the surface, exploring some details which arevisibletothekernelorapplicationdeveloperthrough • the cell state is available for reading almost im- theirperformancecharacteristics. mediatelyoncethewordaccesslineWLisraised. Thesignalisasrectangular(changingquicklybe- Thefirstinterestingdetailsarecenteredaroundtheques- tween the two binary states) as other transistor- tion why there are different types of RAM in the same controlledsignals. machine. More specifically, why are there both static RAM (SRAM5) and dynamic RAM (DRAM). The for- • thecellstateisstable,norefreshcyclesareneeded. mer is much faster and provides the same functionality. Why is not all RAM in a machine SRAM? The answer There are other, slower and less power-hungry, SRAM is, as one might expect, cost. SRAM is much more ex- forms available, but those are not of interest here since pensive to produce and to use than DRAM. Both these we are looking at fast RAM. These slow variants are cost factors are important, the second one increasing in mainlyinterestingbecausetheycanbemoreeasilyused importance more and more. To understand these differ- inasystemthandynamicRAMbecauseoftheirsimpler ences we look at the implementation of a bit of storage interface. forbothSRAMandDRAM. In the remainder of this section we will discuss some 2.1.2 DynamicRAM low-leveldetailsoftheimplementationofRAM.Wewill keep the level of detail as low as possible. To that end, Dynamic RAM is, in its structure, much simpler than wewilldiscussthesignalsata“logiclevel”andnotata static RAM. Figure 2.5 shows the structure of a usual levelahardwaredesignerwouldhavetouse. Thatlevel DRAM cell design. All it consists of is one transistor ofdetailisunnecessaryforourpurposehere. andonecapacitor. Thishugedifferenceincomplexityof coursemeansthatitfunctionsverydifferentlythanstatic 2.1.1 StaticRAM RAM. AL WL DL M Vdd C M2 M4 M6 M5 M1 M3 Figure2.5: 1-TDynamicRAM BL BL A dynamic RAM cell keeps its state in the capacitor C. ThetransistorMisusedtoguardtheaccesstothestate. Figure2.4: 6-TStaticRAM ToreadthestateofthecelltheaccesslineALisraised; thiseithercausesacurrenttoflowonthedatalineDLor Figure 2.4 shows the structure of a 6 transistor SRAM not, depending on the charge in the capacitor. To write cell.Thecoreofthiscellisformedbythefourtransistors tothecellthedatalineDLisappropriatelysetandthen M toM whichformtwocross-coupledinverters.They AL is raised for a time long enough to charge or drain 1 4 havetwostablestates,representing0and1respectively. thecapacitor. ThestateisstableaslongaspoweronV isavailable. dd Thereareanumberofcomplicationswiththedesignof 5InothercontextsSRAMmightmean“synchronousRAM”. dynamicRAM.Theuseofacapacitormeansthatreading UlrichDrepper Version1.0 5 the cell discharges the capacitor. The procedure cannot Charge Discharge berepeatedindefinitely,thecapacitormustberecharged 100 at some point. Even worse, to accommodate the huge 90 number of cells (chips with 109 or more cells are now ge80 r a70 common) the capacity to the capacitor must be low (in h C60 thefemto-faradrangeorlower). Afullychargedcapac- ge50 itor holds a few 10’s of thousands of electrons. Even nta40 thoughtheresistanceofthecapacitorishigh(acoupleof e30 c tera-ohms) it only takes a short time for the capacity to er20 P dissipate. Thisproblemiscalled“leakage”. 10 0 1RC 2RC 3RC 4RC 5RC 6RC 7RC 8RC 9RC This leakage is why a DRAM cell must be constantly refreshed. FormostDRAMchipsthesedaysthisrefresh must happen every 64ms. During the refresh cycle no Figure2.6: CapacitorChargeandDischargeTiming accesstothememoryispossiblesincearefreshissimply a memory read operation where the result is discarded. cell. The SRAM cells also need individual power for Forsomeworkloadsthisoverheadmightstallupto50% the transistors maintaining the state. The structure of ofthememoryaccesses(see[3]). the DRAM cell is also simpler and more regular which A second problem resulting from the tiny charge is that means packing many of them close together on a die is theinformationreadfromthecellisnotdirectlyusable. simpler. The data line must be connected to a sense amplifier Overall,the(quitedramatic)differenceincostwins. Ex- which can distinguish between a stored 0 or 1 over the ceptinspecializedhardware–networkrouters,forexam- wholerangeofchargeswhichstillhavetocountas1. ple–wehavetolivewithmainmemorywhichisbased A third problem is that reading a cell causes the charge on DRAM. This has huge implications on the program- of the capacitor to be depleted. This means every read merwhichwewilldiscussintheremainderofthispaper. operation must be followed by an operation to recharge But first we need to look into a few more details of the thecapacitor. Thisisdoneautomaticallybyfeedingthe actualuseofDRAMcells. output of the sense amplifier back into the capacitor. It doesmean,though,thereadingmemorycontentrequires 2.1.3 DRAMAccess additionalenergyand,moreimportantly,time. Afourthproblemisthatcharginganddrainingacapac- Aprogramselectsamemorylocationusingavirtualad- itor is not instantaneous. The signals received by the dress. The processor translates this into a physical ad- senseamplifierarenotrectangular,soaconservativees- dressandfinallythememorycontrollerselectstheRAM timate as to when the output of the cell is usable has to chipcorrespondingtothataddress. Toselecttheindivid- be used. The formulas for charging and discharging a ualmemorycellontheRAMchip,partsofthephysical capacitorare addressarepassedonintheformofanumberofaddress lines. It would be completely impractical to address memory QCharge(t) = Q0(1−e−RtC) locationsindividuallyfromthememorycontroller: 4GB QDischarge(t) = Q0e−RtC of RAM would require 232 address lines. Instead the address is passed encoded as a binary number using a smaller set of address lines. The address passed to the Thismeansittakessometime(determinedbythecapac- DRAM chip this way must be demultiplexed first. A ityCandresistanceR)forthecapacitortobechargedand demultiplexerwithN addresslineswillhave2N output discharged. It also means that the current which can be lines. Theseoutputlinescanbeusedtoselectthemem- detectedbythesenseamplifiersisnotimmediatelyavail- orycell. Usingthisdirectapproachisnobigproblemfor able. Figure2.6showsthechargeanddischargecurves. chipswithsmallcapacities. TheX–axisismeasuredinunitsofRC(resistancemulti- pliedbycapacitance)whichisaunitoftime. Butifthenumberofcellsgrowsthisapproachisnotsuit- able anymore. A chip with 1Gbit6 capacity would need UnlikethestaticRAMcasewheretheoutputisimmedi- 30 address lines and 230 select lines. The size of a de- atelyavailablewhenthewordaccesslineisraised,itwill multiplexer increases exponentially with the number of always take a bit of time until the capacitor discharges inputlineswhenspeedisnottobesacrificed. Ademulti- sufficiently. This delay severely limits how fast DRAM plexerfor30addresslinesneedsawholelotofchipreal canbe. estate in addition to the complexity (size and time) of the demultiplexer. Even more importantly, transmitting The simple approach has its advantages, too. The main advantage is size. The chip real estate needed for one 6IhatethoseSIprefixes. Formeagiga-bitwillalwaysbe230and DRAMcellismanytimessmallerthanthatofanSRAM not109bits. 6 Version1.0 WhatEveryProgrammerShouldKnowAboutMemory 30impulsesontheaddresslinessynchronouslyismuch itors do not fill or drain instantaneously). These timing harderthantransmitting“only”15impulses. Fewerlines constants are crucial for the performance of the DRAM have to be laid out at exactly the same length or timed chip. Wewilltalkaboutthisinthenextsection. appropriately.7 Asecondaryscalabilityproblemisthathaving30address linesconnectedtoeveryRAMchipisnotfeasibleeither. Pinsofachiparepreciousresources. Itis“bad”enough n that the data must be transferred as much as possible in o cti parallel(e.g.,in64bitbatches). Thememorycontroller e el mustbeabletoaddresseachRAMmodule(collectionof S a0 s RAM chips). If parallel access to multiple RAM mod- s a1 re ulesisrequiredforperformancereasonsandeachRAM d d modulerequiresitsownsetof30ormoreaddresslines, A then the memory controller needs to have, for 8 RAM w o modules,awhopping240+pinsonlyfortheaddresshan- R dling. TocounterthesesecondaryscalabilityproblemsDRAM chips have, for a long time, multiplexed the address it- a2 self. Thatmeanstheaddressistransferredintwoparts. ColumnAddressSelection a3 Thefirstpartconsistingofaddressbits(a0 anda1 inthe exampleinFigure2.7)selecttherow. Thisselectionre- Data mainsactiveuntilrevoked.Thenthesecondpart,address bitsa2 anda3,selectthecolumn. Thecrucialdifference Figure2.7: DynamicRAMSchematic isthatonlytwoexternaladdresslinesareneeded. Afew morelinesareneededtoindicatewhentheRASandCAS signals are available but this is a small price to pay for Figure2.7showsaDRAMchipataveryhighlevel. The cuttingthenumberofaddresslinesinhalf. Thisaddress DRAM cells are organized in rows and columns. They multiplexingbringsitsownsetofproblems,though. We couldallbealignedinonerowbutthentheDRAMchip willdiscusstheminsection2.2. would need a huge demultiplexer. With the array ap- proachthedesigncangetbywithonedemultiplexerand one multiplexer of half the size.8 This is a huge saving 2.1.4 Conclusions onallfronts. Intheexampletheaddresslinesa0 anda1 throughtherowaddressselection(RAS)9 demultiplexer Donotworryifthedetailsinthissectionareabitover- select the address lines of a whole row of cells. When whelming. Theimportantthingstotakeawayfromthis reading,thecontentofallcellsisthuslymadeavailableto sectionare: thecolumnaddressselection(CAS)9 multiplexer. Based on the address lines a2 and a3 the content of one col- • therearereasonswhynotallmemoryisSRAM umnisthenmadeavailabletothedatapinoftheDRAM chip. This happens many times in parallel on a number • memory cells need to be individually selected to ofDRAMchipstoproduceatotalnumberofbitscorre- beused spondingtothewidthofthedatabus. • the number of address lines is directly responsi- Forwriting,thenewcellvalueisputonthedatabusand, bleforthecostofthememorycontroller,mother- when the cell is selected using the RAS and CAS, it is boards,DRAMmodule,andDRAMchip storedinthecell. Aprettystraightforwarddesign. There are in reality – obviously – many more complications. • it takes a while before the results of the read or Thereneedtobespecificationsforhowmuchdelaythere writeoperationareavailable isafterthesignalbeforethedatawillbeavailableonthe databusforreading.Thecapacitorsdonotunloadinstan- Thefollowingsectionwillgointomoredetailsaboutthe taneously,asdescribedintheprevioussection. Thesig- actualprocessofaccessingDRAMmemory. Wearenot nalfromthecellsissoweakthatitneedstobeamplified. going into more details of accessing SRAM, which is For writing it must be specified how long the data must usually directly addressed. This happens for speed and beavailableonthebusaftertheRASandCASisdoneto because the SRAM memory is limited in size. SRAM successfullystorethenewvalueinthecell(again,capac- is currently used in CPU caches and on-die where the 7ModernDRAMtypeslikeDDR3canautomaticallyadjustthetim- connectionsaresmallandfullyundercontroloftheCPU ingbutthereisalimitastowhatcanbetolerated. designer. CPUcachesareatopicwhichwediscusslater 8Multiplexersanddemultiplexersareequivalentandthemultiplexer butallweneedtoknowisthatSRAMcellshaveacertain hereneedstoworkasademultiplexerwhenwriting. Sowewilldrop maximumspeedwhichdependsontheeffortspentonthe thedifferentiationfromnowon. 9Thelineoverthenameindicatesthatthesignalisnegated. SRAM. The speed can vary from only slightly slower UlrichDrepper Version1.0 7 than the CPU core to one or two orders of magnitude CLK slower. RAS 2.2 DRAMAccessTechnicalDetails CAS In the section introducing DRAM we saw that DRAM chips multiplex the addresses in order to save resources Address Row Col Addr Addr int the form of address pins. We also saw that access- tRCD CL ingDRAMcellstakestimesincethecapacitorsinthose DQ Data Data Data Data cellsdonotdischargeinstantaneouslytoproduceastable Out Out Out Out signal;wealsosawthatDRAMcellsmustberefreshed. Now it is time to put this all together and see how all Figure2.8: SDRAMReadAccessTiming these factors determine how the DRAM access has to happen. We will concentrate on current technology; we will not busandloweringtheRASsignal. Allsignalsarereadon discussasynchronousDRAManditsvariantsastheyare therisingedgeoftheclock(CLK)soitdoesnotmatterif simply not relevant anymore. Readers interested in this thesignalisnotcompletelysquareaslongasitisstable topic are referred to [3] and [19]. We will also not talk atthetimeitisread. Settingtherowaddresscausesthe aboutRambusDRAM(RDRAM)eventhoughthetech- RAMchiptostartlatchingtheaddressedrow. nologyisnotobsolete. Itisjustnotwidelyusedforsys- tem memory. We will concentrate exclusively on Syn- The CAS signal can be sent after tRCD (RAS-to-CAS chronous DRAM (SDRAM) and its successors Double Delay) clock cycles. The column address is then trans- DataRateDRAM(DDR). mittedbymakingitavailableontheaddressbusandlow- ering the CAS line. Here we can see how the two parts Synchronous DRAM, as the name suggests, works rel- of the address (more or less halves, nothing else makes ative to a time source. The memory controller provides sense)canbetransmittedoverthesameaddressbus. aclock,thefrequencyofwhichdeterminesthespeedof the Front Side Bus (FSB) – the memory controller in- Nowtheaddressingiscompleteandthedatacanbetrans- terface used by the DRAM chips. As of this writing, mitted. The RAM chip needs some time to prepare for frequenciesof800MHz, 1,066MHz, oreven1,333MHz this. The delay is usually called CAS Latency (CL). In areavailablewithhigherfrequencies(1,600MHz)being Figure2.8theCASlatencyis2.Itcanbehigherorlower, announced for the next generation. This does not mean dependingonthequalityofthememorycontroller,moth- the frequency used on the bus is actually this high. In- erboard,andDRAMmodule. Thelatencycanalsohave stead,today’sbusesaredouble-orquad-pumped,mean- half values. With CL=2.5 the first data would be avail- ing that data is transported two or four times per cy- ableatthefirstfallingflankinthebluearea. cle. Higher numbers sell so the manufacturers like to With all this preparation to get to the data it would be advertiseaquad-pumped200MHzbusasan“effective” wasteful to only transfer one data word. This is why 800MHzbus. DRAM modules allow the memory controller to spec- ForSDRAMtodayeachdatatransferconsistsof64bits ifyhowmuchdataistobetransmitted. Oftenthechoice – 8 bytes. The transfer rate of the FSB is therefore 8 is between 2, 4, or 8 words. This allows filling entire bytesmultipliedbytheeffectivebusfrequency(6.4GB/s linesinthecacheswithoutanewRAS/CASsequence. It for the quad-pumped 200MHz bus). That sounds a lot isalsopossibleforthememorycontrollertosendanew butitistheburstspeed,themaximumspeedwhichwill CAS signal without resetting the row selection. In this neverbesurpassed. Aswewillseenowtheprotocolfor way, consecutive memory addresses can be read from talkingtotheRAMmoduleshasalotofdowntimewhen or written to significantly faster because the RAS sig- no data can be transmitted. It is exactly this downtime nal does not have to be sent and the row does not have which we must understand and minimize to achieve the to be deactivated (see below). Keeping the row “open” bestperformance. issomethingthememorycontrollerhastodecide. Spec- ulatively leaving it open all the time has disadvantages withreal-worldapplications(see[3]).SendingnewCAS 2.2.1 ReadAccessProtocol signalsisonlysubjecttotheCommandRateoftheRAM module(usuallyspecifiedasTx,wherexisavaluelike Figure2.8showstheactivityonsomeoftheconnectors 1or2;itwillbe1forhigh-performanceDRAMmodules of a DRAM module which happens in three differently whichacceptnewcommandseverycycle). colored phases. As usual, time flows from left to right. Alotofdetailsareleftout. Hereweonlytalkaboutthe In this example the SDRAM spits out one word per cy- bus clock, RAS and CAS signals, and the address and cle. Thisiswhatthefirstgenerationdoes. DDRisable data buses. A read cycle begins with the memory con- to transmit two words per cycle. This cuts down on the troller making the row address available on the address transfertimebutdoesnotchangethelatency. Inprinci- 8 Version1.0 WhatEveryProgrammerShouldKnowAboutMemory ple, DDR2worksthesamealthoughinpracticeitlooks only in use two cycles out of seven. Multiply this with different.Thereisnoneedtogointothedetailshere.Itis theFSBspeedandthetheoretical6.4GB/sfora800MHz sufficienttonotethatDDR2canbemadefaster,cheaper, bus become 1.8GB/s. That is bad and must be avoided. more reliable, and is more energy efficient (see [6] for The techniques described in section 6 help to raise this moreinformation). number. Buttheprogrammerusuallyhastodohershare. There is one more timing value for a SDRAM module 2.2.2 PrechargeandActivation whichwehavenotdiscussed.InFigure2.9theprecharge commandwasonlylimitedbythedatatransfertime.An- Figure2.8doesnotcoverthewholecycle. Itonlyshows other constraint is that an SDRAM module needs time partsofthefullcycleofaccessingDRAM.Beforeanew after a RAS signal before it can precharge another row RASsignalcanbesentthecurrentlylatchedrowmustbe (denoted as t ). This number is usually pretty high, RAS deactivatedandthenewrowmustbeprecharged.Wecan intheorderoftwoorthreetimesthet value. Thisis RP concentrate here on the case where this is done with an a problem if, after a RAS signal, only one CAS signal explicit command. There are improvements to the pro- followsandthedatatransferisfinishedinafewcycles. tocolwhich,insomesituations,allowsthisextrastepto AssumethatinFigure2.9theinitialCASsignalwaspre- be avoided. The delays introduced by precharging still cededdirectlybyaRASsignalandthatt is8cycles. RAS affecttheoperation,though. Thentheprechargecommandwouldhavetobedelayed byoneadditionalcyclesincethesumoft ,CL,and RCD t (sinceitislargerthanthedatatransfertime)isonly CLK RP 7cycles. WE tRP DDR modules are often described using a special nota- RAS tion: w-x-y-z-T.Forinstance: 2-3-2-8-T1. Thismeans: CAS w 2 CASLatency(CL) Address Col Row Col x 3 RAS-to-CASdelay(tRCD) Addr CL Addr tRCD Addr y 2 RASPrecharge(tRP) z 8 ActivetoPrechargedelay(t ) RAS DQ Data Data T T1 CommandRate Out Out Therearenumerousothertimingconstantswhichaffect Figure2.9: SDRAMPrechargeandActivation thewaycommandscanbeissuedandarehandled.Those five constants are in practice sufficient to determine the Figure2.9showstheactivitystartingfromoneCASsig- performanceofthemodule,though. naltotheCASsignalforanotherrow.Thedatarequested withthefirstCASsignalisavailableasbefore,afterCL It is sometimes useful to know this information for the cycles. In the example two words are requested which, computersinusetobeabletointerpretcertainmeasure- on a simple SDRAM, takes two cycles to transmit. Al- ments. Itisdefinitelyusefultoknowthesedetailswhen ternatively,imaginefourwordsonaDDRchip. buying computers since they, along with the FSB and SDRAM module speed, are among the most important Even on DRAM modules with a command rate of one factorsdeterminingacomputer’sspeed. the precharge command cannot be issued right away. It is necessary to wait as long as it takes to transmit the The very adventurous reader could also try to tweak a data. Inthiscaseittakestwocycles. Thishappenstobe system. Sometimes the BIOS allows changing some or the same as CL but that is just a coincidence. The pre- all these values. SDRAM modules have programmable chargesignalhasnodedicatedline;instead,someimple- registerswherethesevaluescanbeset.UsuallytheBIOS mentations issue it by lowering the Write Enable (WE) picks the best default value. If the quality of the RAM and RAS line simultaneously. This combination has no module is high it might be possible to reduce the one usefulmeaningbyitself(see[18]forencodingdetails). ortheotherlatencywithoutaffectingthestabilityofthe computer. Numerous overclocking websites all around OncetheprechargecommandisissuedittakestRP(Row the Internet provide ample of documentation for doing Prechargetime)cyclesuntiltherowcanbeselected. In this. Do it at your own risk, though and do not say you Figure 2.9 much of the time (indicated by the purplish havenotbeenwarned. color) overlaps with the memory transfer (light blue). This is good! But t is larger than the transfer time RP 2.2.3 Recharging andsothenextRASsignalisstalledforonecycle. If we were to continue the timeline in the diagram we Amostly-overlookedtopicwhenitcomestoDRAMac- would find that the next data transfer happens 5 cycles cessisrecharging. Asexplainedinsection2.1.2,DRAM afterthepreviousonestops. Thismeansthedatabusis cellsmustconstantlyberefreshed. Thisdoesnothappen UlrichDrepper Version1.0 9 completely transparently for the rest of the system. At f f f times when a row10 is recharged no access is possible. DRAM I/O Cell The study in [3] found that “[s]urprisingly, DRAM re- Buffer Array freshorganizationcanaffectperformancedramatically”. EachDRAMcellmustberefreshedevery64msaccord- ing to the JEDEC (Joint Electron Device Engineering Figure2.11: DDR1SDRAMOperation Council)specification. IfaDRAMarrayhas8,192rows this means the memory controller has to issue a refresh commandonaverageevery7.8125µs(refreshcommands The difference between SDR and DDR1 is, as can be seeninFigure2.11andguessedfromthename,thattwice can be queued so in practice the maximum interval be- the amount of data is transported per cycle. I.e., the tween two requests can be higher). It is the memory DDR1chiptransportsdataontherisingandfallingedge. controller’s responsibility to schedule the refresh com- This is sometimes called a “double-pumped” bus. To mands. The DRAM module keeps track of the address make this possible without increasing the frequency of ofthelastrefreshedrowandautomaticallyincreasesthe the cell array a buffer has to be introduced. This buffer addresscounterforeachnewrequest. holds two bits per data line. This in turn requires that, There is really not much the programmer can do about in the cell array in Figure 2.7, the data bus consists of therefreshandthepointsintimewhenthecommandsare two lines. Implementing this is trivial: one only has to issued. ButitisimportanttokeepthispartoftheDRAM use the same column address for two DRAM cells and lifecycleinmindwheninterpretingmeasurements. Ifa accesstheminparallel. Thechangestothecellarrayto critical word has to be retrieved from a row which cur- implementthisarealsominimal. rently is being refreshed the processor could be stalled TheSDRDRAMswereknownsimplybytheirfrequency for quite a long time. How long each refresh takes de- (e.g.,PC100for100MHzSDR).TomakeDDR1DRAM pendsontheDRAMmodule. sound better the marketers had to come up with a new schemesincethefrequencydidnotchange. Theycame 2.2.4 MemoryTypes with a name which contains the transfer rate in bytes a DDRmodule(theyhave64-bitbusses)cansustain: Itisworthspendingsometimeonthecurrentandsoon- to-be current memory types in use. We will start with SDR(SingleDataRate)SDRAMssincetheyaretheba- 100MHz×64bit×2=1,600MB/s sis of the DDR (Double Data Rate) SDRAMs. SDRs wereprettysimple.Thememorycellsandthedatatrans- HenceaDDRmodulewith100MHzfrequencyiscalled ferratewereidentical. PC1600. With 1600 > 100 all marketing requirements arefulfilled;itsoundsmuchbetteralthoughtheimprove- f f mentisreallyonlyafactoroftwo.12 DRAM Cell f 2f 2f Array DRAM I/O Cell Buffer Array Figure2.10: SDRSDRAMOperation Figure2.12: DDR2SDRAMOperation InFigure2.10theDRAMcellarraycanoutputthemem- ory content at the same rate it can be transported over thememorybus. IftheDRAMcellarraycanoperateat To get even more out of the memory technology DDR2 100MHz,thedatatransferrateofthebusofasinglecell includesabitmoreinnovation.Themostobviouschange isthus100Mb/s. Thefrequencyf forallcomponentsis that can be seen in Figure 2.12 is the doubling of the thesame. IncreasingthethroughputoftheDRAMchip frequency of the bus. Doubling the frequency means isexpensivesincetheenergyconsumptionriseswiththe doubling the bandwidth. Since this doubling of the fre- frequency. With a huge number of array cells this is quencyisnoteconomicalforthecellarrayitisnowre- prohibitively expensive.11 In reality it is even more of quiredthattheI/Obuffergetsfourbitsineachclockcy- a problem since increasing the frequency usually also cle which it then can send on the bus. This means the requires increasing the voltage to maintain stability of changestotheDDR2modulesconsistofmakingonlythe the system. DDR SDRAM (called DDR1 retroactively) I/O buffer component of the DIMM capable of running manages to improve the throughput without increasing at higher speeds. This is certainly possible and will not anyoftheinvolvedfrequencies. requiremeasurablymoreenergy,itisjustonetinycom- ponent and not the whole module. The names the mar- 10Rowsarethegranularitythishappenswithdespitewhat[3]and otherliteraturesays(see[18]). 12IwilltakethefactoroftwobutIdonothavetoliketheinflated 11Power=DynamicCapacity×Voltage2×Frequency. numbers. 10 Version1.0 WhatEveryProgrammerShouldKnowAboutMemory

