Table Of ContentWhat Every Programmer Should Know About Memory
UlrichDrepper
RedHat,Inc.
drepper@redhat.com
November21,2007
Abstract
AsCPUcoresbecomebothfasterandmorenumerous,thelimitingfactorformostprogramsis
now,andwillbeforsometime,memoryaccess. Hardwaredesignershavecomeupwithever
more sophisticated memory handling and acceleration techniques–such as CPU caches–but
thesecannotworkoptimallywithoutsomehelpfromtheprogrammer. Unfortunately, neither
thestructurenorthecostofusingthememorysubsystemofacomputerorthecachesonCPUs
iswellunderstoodbymostprogrammers. Thispaperexplainsthestructureofmemorysubsys-
temsinuseonmoderncommodityhardware,illustratingwhyCPUcachesweredeveloped,how
theywork,andwhatprogramsshoulddotoachieveoptimalperformancebyutilizingthem.
1 Introduction daythesechangesmainlycomeinthefollowingforms:
Intheearlydayscomputersweremuchsimpler.Thevar-
• RAMhardwaredesign(speedandparallelism).
iouscomponentsofasystem,suchastheCPU,memory,
massstorage,andnetworkinterfaces,weredevelopedto- • Memorycontrollerdesigns.
getherand, asaresult, werequitebalancedintheirper-
• CPUcaches.
formance. Forexample,thememoryandnetworkinter-
faces were not (much) faster than the CPU at providing • Directmemoryaccess(DMA)fordevices.
data.
This situation changed once the basic structure of com- For the most part, this document will deal with CPU
puters stabilized and hardware developers concentrated caches and some effects of memory controller design.
onoptimizingindividualsubsystems. Suddenlytheper- Intheprocessofexploringthesetopics,wewillexplore
formanceofsomecomponentsofthecomputerfellsig- DMA and bring it into the larger picture. However, we
nificantly behind and bottlenecks developed. This was willstartwithanoverviewofthedesignfortoday’scom-
especiallytrueformassstorageandmemorysubsystems modity hardware. This is a prerequisite to understand-
which, for cost reasons, improved more slowly relative ing the problems and the limitations of efficiently us-
toothercomponents. ing memory subsystems. We will also learn about, in
some detail, the different types of RAM and illustrate
Theslownessofmassstoragehasmostlybeendealtwith
whythesedifferencesstillexist.
usingsoftwaretechniques: operatingsystemskeepmost
oftenused(andmostlikelytobeused)datainmainmem- Thisdocumentisinnowayallinclusiveandfinal. Itis
ory,whichcanbeaccessedatarateordersofmagnitude limited to commodity hardware and further limited to a
fasterthantheharddisk. Cachestoragewasaddedtothe subset of that hardware. Also, many topics will be dis-
storagedevicesthemselves,whichrequiresnochangesin cussed in just enough detail for the goals of this paper.
the operating system to increase performance.1 For the For such topics, readers are recommended to find more
purposes of this paper, we will not go into more details detaileddocumentation.
ofsoftwareoptimizationsforthemassstorageaccess.
When it comes to operating-system-specific details and
Unlike storage subsystems, removing the main memory solutions, the text exclusively describes Linux. At no
as a bottleneck has proven much more difficult and al- time will it contain any information about other OSes.
most all solutions require changes to the hardware. To- Theauthorhasnointerestindiscussingtheimplications
for other OSes. If the reader thinks s/he has to use a
1Changes are needed, however, to guarantee data integrity when differentOStheyhavetogototheirvendorsanddemand
usingstoragedevicecaches.
theywritedocumentssimilartothisone.
One last comment before the start. The text contains a
Copyright©2007UlrichDrepper number of occurrences of the term “usually” and other,
Allrightsreserved.Noredistributionallowed. similar qualifiers. The technology discussed here exists
inmany,manyvariationsintherealworldandthispaper Thanks
only addresses the most common, mainstream versions.
Itisrarethatabsolutestatementscanbemadeaboutthis
IwouldliketothankJohnrayFullerandthecrewatLWN
technology,thusthequalifiers.
(especially Jonathan Corbet for taking on the daunting
task of transforming the author’s form of English into
DocumentStructure somethingmoretraditional.MarkusArmbrusterprovided
alotofvaluableinputonproblemsandomissionsinthe
text.
Thisdocumentismostlyforsoftwaredevelopers. Itdoes
notgointoenoughtechnicaldetailsofthehardwaretobe
usefulforhardware-orientedreaders. Butbeforewecan AboutthisDocument
go into the practical information for developers a lot of
groundworkmustbelaid.
ThetitleofthispaperisanhomagetoDavidGoldberg’s
Tothatend,thesecondsectiondescribesrandom-access classic paper “What Every Computer Scientist Should
memory (RAM) in technical detail. This section’s con- Know About Floating-Point Arithmetic” [12]. This pa-
tentisnicetoknowbutnotabsolutelycriticaltobeable per is still not widely known, although it should be a
tounderstandthelatersections. Appropriatebackrefer- prerequisite for anybody daring to touch a keyboard for
encestothesectionareaddedinplaceswherethecontent seriousprogramming.
isrequiredsothattheanxiousreadercouldskipmostof
OnewordonthePDF:xpdfdrawssomeofthediagrams
thissectionatfirst.
ratherpoorly.Itisrecommendeditbeviewedwithevince
ThethirdsectiongoesintoalotofdetailsofCPUcache or, if really necessary, Adobe’s programs. If you use
behavior. Graphs have been used to keep the text from evince be advised that hyperlinks are used extensively
beingasdryasitwouldotherwisebe. Thiscontentises- throughout the document even though the viewer does
sentialforanunderstandingoftherestofthedocument. notindicatethemlikeothersdo.
Section4describesbrieflyhowvirtualmemoryisimple-
mented. Thisisalsorequiredgroundworkfortherest.
Section 5 goes into a lot of detail about Non Uniform
MemoryAccess(NUMA)systems.
Section6isthecentralsectionofthispaper. Itbringsto-
gether all the previous sections’ information and gives
programmers advice on how to write code which per-
formswellinthevarioussituations. Theveryimpatient
reader couldstart with thissection and, ifnecessary, go
backtotheearliersectionstofreshenuptheknowledge
oftheunderlyingtechnology.
Section 7 introduces tools which can help the program-
merdoabetterjob. Evenwithacompleteunderstanding
ofthetechnologyitisfarfromobviouswhereinanon-
trivialsoftwareprojecttheproblemsare. Sometoolsare
necessary.
In section 8 we finally give an outlook of technology
whichcanbeexpectedinthenearfutureorwhichmight
justsimplybegoodtohave.
ReportingProblems
The author intends to update this document for some
time. Thisincludesupdatesmadenecessarybyadvances
intechnologybutalsotocorrectmistakes. Readerswill-
ing to report problems are encouraged to send email to
the author. They are asked to include exact version in-
formationinthereport. Theversioninformationcanbe
foundonthelastpageofthedocument.
2 Version1.0 WhatEveryProgrammerShouldKnowAboutMemory
2 CommodityHardwareToday tionwithdevicesthroughavarietyofdifferentbuses.To-
daythePCI,PCIExpress,SATA,andUSBbusesareof
It is important to understand commodity hardware be- mostimportance,butPATA,IEEE1394,serial,andpar-
cause specialized hardware is in retreat. Scaling these allelportsarealsosupportedbytheSouthbridge. Older
daysismostoftenachievedhorizontallyinsteadofverti- systemshadAGPslotswhichwereattachedtotheNorth-
cally,meaningtodayitismorecost-effectivetousemany bridge.Thiswasdoneforperformancereasonsrelatedto
smaller, connected commodity computers instead of a insufficiently fast connections between the Northbridge
few really large and exceptionally fast (and expensive) andSouthbridge. However,todaythePCI-Eslotsareall
systems. This is the case because fast and inexpensive connectedtotheSouthbridge.
networkhardwareiswidelyavailable. Therearestillsit-
Suchasystemstructurehasanumberofnoteworthycon-
uations where the large specialized systems have their
sequences:
placeandthesesystemsstillprovideabusinessopportu-
nity,buttheoverallmarketisdwarfedbythecommodity
hardware market. Red Hat, as of 2007, expects that for • AlldatacommunicationfromoneCPUtoanother
futureproducts,the“standardbuildingblocks”formost musttraveloverthesamebususedtocommunicate
datacenterswillbeacomputerwithuptofoursockets, withtheNorthbridge.
eachfilledwithaquadcoreCPUthat,inthecaseofIntel
CPUs,willbehyper-threaded.2 Thismeansthestandard • AllcommunicationwithRAMmustpassthrough
systeminthedatacenterwillhaveupto64virtualpro- theNorthbridge.
cessors.Biggermachineswillbesupported,butthequad
• TheRAMhasonlyasingleport. 3
socket,quadCPUcorecaseiscurrentlythoughttobethe
sweet spot and mostoptimizations are targeted for such • Communication between a CPU and a device at-
machines. tached to the Southbridge is routed through the
Northbridge.
Largedifferencesexistinthestructureofcomputersbuilt
ofcommodityparts. Thatsaid,wewillcovermorethan
90%ofsuchhardwarebyconcentratingonthemostim- Acoupleofbottlenecksareimmediatelyapparentinthis
portantdifferences. Notethatthesetechnicaldetailstend design. OnesuchbottleneckinvolvesaccesstoRAMfor
tochangerapidly,sothereaderisadvisedtotakethedate devices. In the earliest days of the PC, all communica-
ofthiswritingintoaccount. tionwithdevicesoneitherbridgehadtopassthroughthe
CPU,negativelyimpactingoverallsystemperformance.
Over the years personal computers and smaller servers
To work around this problem some devices became ca-
standardizedonachipsetwithtwoparts:theNorthbridge
pableofdirectmemoryaccess(DMA).DMAallowsde-
andSouthbridge. Figure2.1showsthisstructure.
vices, with the help of the Northbridge, to store and re-
ceive data in RAM directly without the intervention of
CPU1 CPU2 the CPU (and its inherent performance cost). Today all
high-performance devices attached to any of the buses
FSB
can utilize DMA. While this greatly reduces the work-
RAM Northbridge
loadontheCPU,italsocreatescontentionfortheband-
widthoftheNorthbridgeasDMArequestscompetewith
SATA
PCI-E Southbridge RAM access from the CPUs. This problem, therefore,
USB
mustbetakenintoaccount.
A second bottleneck involves the bus from the North-
Figure2.1: StructurewithNorthbridgeandSouthbridge bridgetotheRAM.Theexactdetailsofthebusdepend
on the memory types deployed. On older systems there
is only one bus to all the RAM chips, so parallel ac-
AllCPUs(twointhepreviousexample,buttherecanbe
cessisnotpossible. RecentRAMtypesrequiretwosep-
more) are connected via a common bus (the Front Side
arate buses (or channels as they are called for DDR2,
Bus,FSB)totheNorthbridge.TheNorthbridgecontains,
seepage8)whichdoublestheavailablebandwidth. The
among other things, the memory controller, and its im-
Northbridgeinterleavesmemoryaccessacrossthechan-
plementationdeterminesthetypeofRAMchipsusedfor
nels. Morerecentmemorytechnologies(FB-DRAM,for
thecomputer. DifferenttypesofRAM,suchasDRAM,
instance)addmorechannels.
Rambus, and SDRAM, require different memory con-
trollers. Withlimitedbandwidthavailable,itisimportantforper-
formancetoschedulememoryaccessinwaysthatmini-
Toreachallothersystemdevices,theNorthbridgemust
mizedelays. Aswewillsee,processorsaremuchfaster
communicate with the Southbridge. The Southbridge,
oftenreferredtoastheI/Obridge, handlescommunica- 3Wewillnotdiscussmulti-portRAMinthisdocumentasthistype
ofRAMisnotfoundincommodityhardware, atleastnotinplaces
2Hyper-threadingenablesasingleprocessorcoretobeusedfortwo wheretheprogrammerhasaccesstoit. Itcanbefoundinspecialized
ormoreconcurrentexecutionswithjustalittleextrahardware. hardwaresuchasnetworkrouterswhichdependonutmostspeed.
UlrichDrepper Version1.0 3
andmustwaittoaccessmemory,despitetheuseofCPU
caches. If multiple hyper-threads, cores, or processors RAM CPU1 CPU2 RAM
accessmemoryatthesametime,thewaittimesformem-
ory access are even longer. This is also true for DMA RAM CPU3 CPU4 RAM
operations.
SATA
There is more to accessing memory than concurrency, PCI-E Southbridge
USB
however. Access patterns themselves also greatly influ-
ence the performance of the memory subsystem, espe-
ciallywithmultiplememorychannels. Insection2.2we
Figure2.3: IntegratedMemoryController
wilcovermoredetailsofRAMaccesspatterns.
Onsomemoreexpensivesystems,theNorthbridgedoes
not actually contain the memory controller. Instead the deeperintothistechnologyhere.
Northbridge can be connected to a number of external
Therearedisadvantagestothisarchitecture,too. Firstof
memory controllers (in the following example, four of
all, because the machine still has to make all the mem-
them).
oryofthesystemaccessibletoallprocessors,themem-
ory is not uniform anymore (hence the name NUMA -
CPU CPU
1 2 Non-UniformMemoryArchitecture-forsuchanarchi-
tecture). Local memory (memory attached to a proces-
RAM MC MC RAM
1 3
Northbridge sor)canbeaccessedwiththeusualspeed. Thesituation
RAM MC MC RAM
2 4 isdifferentwhenmemoryattachedtoanotherprocessor
SATA is accessed. In this case the interconnects between the
PCI-E Southbridge
USB processorshavetobeused. Toaccessmemoryattached
toCPU fromCPU requirescommunicationacrossone
2 1
interconnect. WhenthesameCPUaccessesmemoryat-
tachedtoCPU twointerconnectshavetobecrossed.
Figure2.2: NorthbridgewithExternalControllers 4
Each such communication has an associated cost. We
Theadvantageofthisarchitectureisthatmorethanone talk about “NUMA factors” when we describe the ex-
memory bus exists and therefore total available band- tratimeneededtoaccessremotememory. Theexample
widthincreases.Thisdesignalsosupportsmorememory. architecture in Figure 2.3 has two levels for each CPU:
Concurrentmemoryaccesspatternsreducedelaysbysi- immediately adjacent CPUs and one CPU which is two
multaneously accessing different memory banks. This interconnects away. With more complicated machines
is especially true when multiple processors are directly the number of levels can grow significantly. There are
connectedtotheNorthbridge,asinFigure2.2. Forsuch alsomachinearchitectures(forinstanceIBM’sx445and
adesign,theprimarylimitationistheinternalbandwidth SGI’s Altix series) where there is more than one type
of the Northbridge, which is phenomenal for this archi- ofconnection. CPUsareorganizedintonodes; withina
tecture(fromIntel).4 nodethetimetoaccessthememorymightbeuniformor
haveonlysmallNUMAfactors.Theconnectionbetween
Using multiple external memory controllers is not the
nodes can be very expensive, though, and the NUMA
onlywaytoincreasememorybandwidth. Oneotherin-
factorcanbequitehigh.
creasinglypopularwayistointegratememorycontrollers
into the CPUs and attach memory to each CPU. This CommodityNUMAmachinesexisttodayandwilllikely
architecture is made popular by SMP systems based on playanevengreaterroleinthefuture. Itisexpectedthat,
AMD’sOpteronprocessor. Figure2.3showssuchasys- fromlate2008on,everySMPmachinewilluseNUMA.
tem. IntelwillhavesupportfortheCommonSystemIn- The costs associated with NUMA make it important to
terface(CSI)startingwiththeNehalemprocessors; this recognize when a program is running on a NUMA ma-
is basically the same approach: an integrated memory chine. Insection5wewilldiscussmoremachinearchi-
controller with the possibility of local memory for each tecturesandsometechnologiestheLinuxkernelprovides
processor. fortheseprograms.
Withanarchitecturelikethisthereareasmanymemory Beyond the technical details described in the remainder
banksavailableasthereareprocessors. Onaquad-CPU ofthissection,thereareseveraladditionalfactorswhich
machine the memory bandwidth is quadrupled without influence the performance of RAM. They are not con-
the need for a complicated Northbridge with enormous trollablebysoftware,whichiswhytheyarenotcovered
bandwidth. Having a memory controller integrated into in this section. The interested reader can learn about
theCPUhassomeadditionaladvantages;wewillnotdig someofthesefactorsinsection2.1. Theyarereallyonly
neededtogetamorecompletepictureofRAMtechnol-
4Forcompletenessitshouldbementionedthatsuchamemorycon-
ogyandpossiblytomakebetterdecisionswhenpurchas-
trollerarrangementcanbeusedforotherpurposessuchas“memory
RAID”whichisusefulincombinationwithhotplugmemory. ingcomputers.
4 Version1.0 WhatEveryProgrammerShouldKnowAboutMemory
The following two sections discuss hardware details at Ifaccesstothestateofthecellisneededthewordaccess
thegatelevelandtheaccessprotocolbetweenthemem- lineWLisraised. Thismakesthestateofthecellimme-
ory controller and the DRAM chips. Programmers will diately available for reading on BL and BL. If the cell
likely find this information enlightening since these de- state must be overwritten the BL and BL lines are first
tailsexplainwhyRAMaccessworksthewayitdoes. It settothedesiredvaluesandthenWLisraised.Sincethe
isoptionalknowledge,though,andthereaderanxiousto outsidedriversarestrongerthanthefourtransistors(M
1
gettotopicswithmoreimmediaterelevanceforeveryday throughM )thisallowstheoldstatetobeoverwritten.
4
lifecanjumpaheadtosection2.2.5.
See [20] for a more detailed description of the way the
cell works. For the following discussion it is important
2.1 RAMTypes
tonotethat
TherehavebeenmanytypesofRAMovertheyearsand
• onecellrequiressixtransistors. Therearevariants
eachtypevaries,sometimessignificantly,fromtheother.
withfourtransistorsbuttheyhavedisadvantages.
The older types are today really only interesting to the
historians. We will not explore the details of those. In- • maintaining the state of the cell requires constant
steadwewillconcentrateonmodernRAMtypes;wewill power.
only scrape the surface, exploring some details which
arevisibletothekernelorapplicationdeveloperthrough • the cell state is available for reading almost im-
theirperformancecharacteristics. mediatelyoncethewordaccesslineWLisraised.
Thesignalisasrectangular(changingquicklybe-
Thefirstinterestingdetailsarecenteredaroundtheques-
tween the two binary states) as other transistor-
tion why there are different types of RAM in the same
controlledsignals.
machine. More specifically, why are there both static
RAM (SRAM5) and dynamic RAM (DRAM). The for- • thecellstateisstable,norefreshcyclesareneeded.
mer is much faster and provides the same functionality.
Why is not all RAM in a machine SRAM? The answer
There are other, slower and less power-hungry, SRAM
is, as one might expect, cost. SRAM is much more ex-
forms available, but those are not of interest here since
pensive to produce and to use than DRAM. Both these
we are looking at fast RAM. These slow variants are
cost factors are important, the second one increasing in
mainlyinterestingbecausetheycanbemoreeasilyused
importance more and more. To understand these differ-
inasystemthandynamicRAMbecauseoftheirsimpler
ences we look at the implementation of a bit of storage
interface.
forbothSRAMandDRAM.
In the remainder of this section we will discuss some 2.1.2 DynamicRAM
low-leveldetailsoftheimplementationofRAM.Wewill
keep the level of detail as low as possible. To that end,
Dynamic RAM is, in its structure, much simpler than
wewilldiscussthesignalsata“logiclevel”andnotata
static RAM. Figure 2.5 shows the structure of a usual
levelahardwaredesignerwouldhavetouse. Thatlevel
DRAM cell design. All it consists of is one transistor
ofdetailisunnecessaryforourpurposehere.
andonecapacitor. Thishugedifferenceincomplexityof
coursemeansthatitfunctionsverydifferentlythanstatic
2.1.1 StaticRAM RAM.
AL
WL
DL M
Vdd C
M2 M4
M6
M5
M1 M3 Figure2.5: 1-TDynamicRAM
BL BL
A dynamic RAM cell keeps its state in the capacitor C.
ThetransistorMisusedtoguardtheaccesstothestate.
Figure2.4: 6-TStaticRAM ToreadthestateofthecelltheaccesslineALisraised;
thiseithercausesacurrenttoflowonthedatalineDLor
Figure 2.4 shows the structure of a 6 transistor SRAM not, depending on the charge in the capacitor. To write
cell.Thecoreofthiscellisformedbythefourtransistors tothecellthedatalineDLisappropriatelysetandthen
M toM whichformtwocross-coupledinverters.They AL is raised for a time long enough to charge or drain
1 4
havetwostablestates,representing0and1respectively. thecapacitor.
ThestateisstableaslongaspoweronV isavailable.
dd
Thereareanumberofcomplicationswiththedesignof
5InothercontextsSRAMmightmean“synchronousRAM”. dynamicRAM.Theuseofacapacitormeansthatreading
UlrichDrepper Version1.0 5
the cell discharges the capacitor. The procedure cannot Charge Discharge
berepeatedindefinitely,thecapacitormustberecharged 100
at some point. Even worse, to accommodate the huge 90
number of cells (chips with 109 or more cells are now ge80
r
a70
common) the capacity to the capacitor must be low (in h
C60
thefemto-faradrangeorlower). Afullychargedcapac-
ge50
itor holds a few 10’s of thousands of electrons. Even nta40
thoughtheresistanceofthecapacitorishigh(acoupleof e30
c
tera-ohms) it only takes a short time for the capacity to er20
P
dissipate. Thisproblemiscalled“leakage”. 10
0
1RC 2RC 3RC 4RC 5RC 6RC 7RC 8RC 9RC
This leakage is why a DRAM cell must be constantly
refreshed. FormostDRAMchipsthesedaysthisrefresh
must happen every 64ms. During the refresh cycle no Figure2.6: CapacitorChargeandDischargeTiming
accesstothememoryispossiblesincearefreshissimply
a memory read operation where the result is discarded.
cell. The SRAM cells also need individual power for
Forsomeworkloadsthisoverheadmightstallupto50%
the transistors maintaining the state. The structure of
ofthememoryaccesses(see[3]).
the DRAM cell is also simpler and more regular which
A second problem resulting from the tiny charge is that means packing many of them close together on a die is
theinformationreadfromthecellisnotdirectlyusable. simpler.
The data line must be connected to a sense amplifier
Overall,the(quitedramatic)differenceincostwins. Ex-
which can distinguish between a stored 0 or 1 over the
ceptinspecializedhardware–networkrouters,forexam-
wholerangeofchargeswhichstillhavetocountas1.
ple–wehavetolivewithmainmemorywhichisbased
A third problem is that reading a cell causes the charge on DRAM. This has huge implications on the program-
of the capacitor to be depleted. This means every read merwhichwewilldiscussintheremainderofthispaper.
operation must be followed by an operation to recharge But first we need to look into a few more details of the
thecapacitor. Thisisdoneautomaticallybyfeedingthe actualuseofDRAMcells.
output of the sense amplifier back into the capacitor. It
doesmean,though,thereadingmemorycontentrequires 2.1.3 DRAMAccess
additionalenergyand,moreimportantly,time.
Afourthproblemisthatcharginganddrainingacapac- Aprogramselectsamemorylocationusingavirtualad-
itor is not instantaneous. The signals received by the dress. The processor translates this into a physical ad-
senseamplifierarenotrectangular,soaconservativees- dressandfinallythememorycontrollerselectstheRAM
timate as to when the output of the cell is usable has to chipcorrespondingtothataddress. Toselecttheindivid-
be used. The formulas for charging and discharging a ualmemorycellontheRAMchip,partsofthephysical
capacitorare addressarepassedonintheformofanumberofaddress
lines.
It would be completely impractical to address memory
QCharge(t) = Q0(1−e−RtC) locationsindividuallyfromthememorycontroller: 4GB
QDischarge(t) = Q0e−RtC of RAM would require 232 address lines. Instead the
address is passed encoded as a binary number using a
smaller set of address lines. The address passed to the
Thismeansittakessometime(determinedbythecapac-
DRAM chip this way must be demultiplexed first. A
ityCandresistanceR)forthecapacitortobechargedand demultiplexerwithN addresslineswillhave2N output
discharged. It also means that the current which can be
lines. Theseoutputlinescanbeusedtoselectthemem-
detectedbythesenseamplifiersisnotimmediatelyavail-
orycell. Usingthisdirectapproachisnobigproblemfor
able. Figure2.6showsthechargeanddischargecurves.
chipswithsmallcapacities.
TheX–axisismeasuredinunitsofRC(resistancemulti-
pliedbycapacitance)whichisaunitoftime. Butifthenumberofcellsgrowsthisapproachisnotsuit-
able anymore. A chip with 1Gbit6 capacity would need
UnlikethestaticRAMcasewheretheoutputisimmedi- 30 address lines and 230 select lines. The size of a de-
atelyavailablewhenthewordaccesslineisraised,itwill
multiplexer increases exponentially with the number of
always take a bit of time until the capacitor discharges
inputlineswhenspeedisnottobesacrificed. Ademulti-
sufficiently. This delay severely limits how fast DRAM
plexerfor30addresslinesneedsawholelotofchipreal
canbe.
estate in addition to the complexity (size and time) of
the demultiplexer. Even more importantly, transmitting
The simple approach has its advantages, too. The main
advantage is size. The chip real estate needed for one 6IhatethoseSIprefixes. Formeagiga-bitwillalwaysbe230and
DRAMcellismanytimessmallerthanthatofanSRAM not109bits.
6 Version1.0 WhatEveryProgrammerShouldKnowAboutMemory
30impulsesontheaddresslinessynchronouslyismuch itors do not fill or drain instantaneously). These timing
harderthantransmitting“only”15impulses. Fewerlines constants are crucial for the performance of the DRAM
have to be laid out at exactly the same length or timed chip. Wewilltalkaboutthisinthenextsection.
appropriately.7
Asecondaryscalabilityproblemisthathaving30address
linesconnectedtoeveryRAMchipisnotfeasibleeither.
Pinsofachiparepreciousresources. Itis“bad”enough
n that the data must be transferred as much as possible in
o
cti parallel(e.g.,in64bitbatches). Thememorycontroller
e
el mustbeabletoaddresseachRAMmodule(collectionof
S
a0 s RAM chips). If parallel access to multiple RAM mod-
s
a1 re ulesisrequiredforperformancereasonsandeachRAM
d
d modulerequiresitsownsetof30ormoreaddresslines,
A
then the memory controller needs to have, for 8 RAM
w
o modules,awhopping240+pinsonlyfortheaddresshan-
R
dling.
TocounterthesesecondaryscalabilityproblemsDRAM
chips have, for a long time, multiplexed the address it-
a2 self. Thatmeanstheaddressistransferredintwoparts.
ColumnAddressSelection
a3 Thefirstpartconsistingofaddressbits(a0 anda1 inthe
exampleinFigure2.7)selecttherow. Thisselectionre-
Data
mainsactiveuntilrevoked.Thenthesecondpart,address
bitsa2 anda3,selectthecolumn. Thecrucialdifference
Figure2.7: DynamicRAMSchematic isthatonlytwoexternaladdresslinesareneeded. Afew
morelinesareneededtoindicatewhentheRASandCAS
signals are available but this is a small price to pay for
Figure2.7showsaDRAMchipataveryhighlevel. The
cuttingthenumberofaddresslinesinhalf. Thisaddress
DRAM cells are organized in rows and columns. They
multiplexingbringsitsownsetofproblems,though. We
couldallbealignedinonerowbutthentheDRAMchip
willdiscusstheminsection2.2.
would need a huge demultiplexer. With the array ap-
proachthedesigncangetbywithonedemultiplexerand
one multiplexer of half the size.8 This is a huge saving 2.1.4 Conclusions
onallfronts. Intheexampletheaddresslinesa0 anda1
throughtherowaddressselection(RAS)9 demultiplexer Donotworryifthedetailsinthissectionareabitover-
select the address lines of a whole row of cells. When whelming. Theimportantthingstotakeawayfromthis
reading,thecontentofallcellsisthuslymadeavailableto sectionare:
thecolumnaddressselection(CAS)9 multiplexer. Based
on the address lines a2 and a3 the content of one col-
• therearereasonswhynotallmemoryisSRAM
umnisthenmadeavailabletothedatapinoftheDRAM
chip. This happens many times in parallel on a number
• memory cells need to be individually selected to
ofDRAMchipstoproduceatotalnumberofbitscorre-
beused
spondingtothewidthofthedatabus.
• the number of address lines is directly responsi-
Forwriting,thenewcellvalueisputonthedatabusand,
bleforthecostofthememorycontroller,mother-
when the cell is selected using the RAS and CAS, it is
boards,DRAMmodule,andDRAMchip
storedinthecell. Aprettystraightforwarddesign. There
are in reality – obviously – many more complications. • it takes a while before the results of the read or
Thereneedtobespecificationsforhowmuchdelaythere writeoperationareavailable
isafterthesignalbeforethedatawillbeavailableonthe
databusforreading.Thecapacitorsdonotunloadinstan-
Thefollowingsectionwillgointomoredetailsaboutthe
taneously,asdescribedintheprevioussection. Thesig-
actualprocessofaccessingDRAMmemory. Wearenot
nalfromthecellsissoweakthatitneedstobeamplified.
going into more details of accessing SRAM, which is
For writing it must be specified how long the data must
usually directly addressed. This happens for speed and
beavailableonthebusaftertheRASandCASisdoneto
because the SRAM memory is limited in size. SRAM
successfullystorethenewvalueinthecell(again,capac-
is currently used in CPU caches and on-die where the
7ModernDRAMtypeslikeDDR3canautomaticallyadjustthetim- connectionsaresmallandfullyundercontroloftheCPU
ingbutthereisalimitastowhatcanbetolerated. designer. CPUcachesareatopicwhichwediscusslater
8Multiplexersanddemultiplexersareequivalentandthemultiplexer
butallweneedtoknowisthatSRAMcellshaveacertain
hereneedstoworkasademultiplexerwhenwriting. Sowewilldrop
maximumspeedwhichdependsontheeffortspentonthe
thedifferentiationfromnowon.
9Thelineoverthenameindicatesthatthesignalisnegated. SRAM. The speed can vary from only slightly slower
UlrichDrepper Version1.0 7
than the CPU core to one or two orders of magnitude
CLK
slower.
RAS
2.2 DRAMAccessTechnicalDetails
CAS
In the section introducing DRAM we saw that DRAM
chips multiplex the addresses in order to save resources Address Row Col
Addr Addr
int the form of address pins. We also saw that access- tRCD CL
ingDRAMcellstakestimesincethecapacitorsinthose
DQ Data Data Data Data
cellsdonotdischargeinstantaneouslytoproduceastable Out Out Out Out
signal;wealsosawthatDRAMcellsmustberefreshed.
Now it is time to put this all together and see how all
Figure2.8: SDRAMReadAccessTiming
these factors determine how the DRAM access has to
happen.
We will concentrate on current technology; we will not busandloweringtheRASsignal. Allsignalsarereadon
discussasynchronousDRAManditsvariantsastheyare therisingedgeoftheclock(CLK)soitdoesnotmatterif
simply not relevant anymore. Readers interested in this thesignalisnotcompletelysquareaslongasitisstable
topic are referred to [3] and [19]. We will also not talk atthetimeitisread. Settingtherowaddresscausesthe
aboutRambusDRAM(RDRAM)eventhoughthetech- RAMchiptostartlatchingtheaddressedrow.
nologyisnotobsolete. Itisjustnotwidelyusedforsys-
tem memory. We will concentrate exclusively on Syn- The CAS signal can be sent after tRCD (RAS-to-CAS
chronous DRAM (SDRAM) and its successors Double Delay) clock cycles. The column address is then trans-
DataRateDRAM(DDR). mittedbymakingitavailableontheaddressbusandlow-
ering the CAS line. Here we can see how the two parts
Synchronous DRAM, as the name suggests, works rel- of the address (more or less halves, nothing else makes
ative to a time source. The memory controller provides sense)canbetransmittedoverthesameaddressbus.
aclock,thefrequencyofwhichdeterminesthespeedof
the Front Side Bus (FSB) – the memory controller in- Nowtheaddressingiscompleteandthedatacanbetrans-
terface used by the DRAM chips. As of this writing, mitted. The RAM chip needs some time to prepare for
frequenciesof800MHz, 1,066MHz, oreven1,333MHz this. The delay is usually called CAS Latency (CL). In
areavailablewithhigherfrequencies(1,600MHz)being Figure2.8theCASlatencyis2.Itcanbehigherorlower,
announced for the next generation. This does not mean dependingonthequalityofthememorycontroller,moth-
the frequency used on the bus is actually this high. In- erboard,andDRAMmodule. Thelatencycanalsohave
stead,today’sbusesaredouble-orquad-pumped,mean- half values. With CL=2.5 the first data would be avail-
ing that data is transported two or four times per cy- ableatthefirstfallingflankinthebluearea.
cle. Higher numbers sell so the manufacturers like to
With all this preparation to get to the data it would be
advertiseaquad-pumped200MHzbusasan“effective”
wasteful to only transfer one data word. This is why
800MHzbus.
DRAM modules allow the memory controller to spec-
ForSDRAMtodayeachdatatransferconsistsof64bits ifyhowmuchdataistobetransmitted. Oftenthechoice
– 8 bytes. The transfer rate of the FSB is therefore 8 is between 2, 4, or 8 words. This allows filling entire
bytesmultipliedbytheeffectivebusfrequency(6.4GB/s linesinthecacheswithoutanewRAS/CASsequence. It
for the quad-pumped 200MHz bus). That sounds a lot isalsopossibleforthememorycontrollertosendanew
butitistheburstspeed,themaximumspeedwhichwill CAS signal without resetting the row selection. In this
neverbesurpassed. Aswewillseenowtheprotocolfor way, consecutive memory addresses can be read from
talkingtotheRAMmoduleshasalotofdowntimewhen or written to significantly faster because the RAS sig-
no data can be transmitted. It is exactly this downtime nal does not have to be sent and the row does not have
which we must understand and minimize to achieve the to be deactivated (see below). Keeping the row “open”
bestperformance. issomethingthememorycontrollerhastodecide. Spec-
ulatively leaving it open all the time has disadvantages
withreal-worldapplications(see[3]).SendingnewCAS
2.2.1 ReadAccessProtocol
signalsisonlysubjecttotheCommandRateoftheRAM
module(usuallyspecifiedasTx,wherexisavaluelike
Figure2.8showstheactivityonsomeoftheconnectors
1or2;itwillbe1forhigh-performanceDRAMmodules
of a DRAM module which happens in three differently
whichacceptnewcommandseverycycle).
colored phases. As usual, time flows from left to right.
Alotofdetailsareleftout. Hereweonlytalkaboutthe In this example the SDRAM spits out one word per cy-
bus clock, RAS and CAS signals, and the address and cle. Thisiswhatthefirstgenerationdoes. DDRisable
data buses. A read cycle begins with the memory con- to transmit two words per cycle. This cuts down on the
troller making the row address available on the address transfertimebutdoesnotchangethelatency. Inprinci-
8 Version1.0 WhatEveryProgrammerShouldKnowAboutMemory
ple, DDR2worksthesamealthoughinpracticeitlooks only in use two cycles out of seven. Multiply this with
different.Thereisnoneedtogointothedetailshere.Itis theFSBspeedandthetheoretical6.4GB/sfora800MHz
sufficienttonotethatDDR2canbemadefaster,cheaper, bus become 1.8GB/s. That is bad and must be avoided.
more reliable, and is more energy efficient (see [6] for The techniques described in section 6 help to raise this
moreinformation). number. Buttheprogrammerusuallyhastodohershare.
There is one more timing value for a SDRAM module
2.2.2 PrechargeandActivation
whichwehavenotdiscussed.InFigure2.9theprecharge
commandwasonlylimitedbythedatatransfertime.An-
Figure2.8doesnotcoverthewholecycle. Itonlyshows other constraint is that an SDRAM module needs time
partsofthefullcycleofaccessingDRAM.Beforeanew after a RAS signal before it can precharge another row
RASsignalcanbesentthecurrentlylatchedrowmustbe (denoted as t ). This number is usually pretty high,
RAS
deactivatedandthenewrowmustbeprecharged.Wecan intheorderoftwoorthreetimesthet value. Thisis
RP
concentrate here on the case where this is done with an a problem if, after a RAS signal, only one CAS signal
explicit command. There are improvements to the pro- followsandthedatatransferisfinishedinafewcycles.
tocolwhich,insomesituations,allowsthisextrastepto AssumethatinFigure2.9theinitialCASsignalwaspre-
be avoided. The delays introduced by precharging still cededdirectlybyaRASsignalandthatt is8cycles.
RAS
affecttheoperation,though. Thentheprechargecommandwouldhavetobedelayed
byoneadditionalcyclesincethesumoft ,CL,and
RCD
t (sinceitislargerthanthedatatransfertime)isonly
CLK RP
7cycles.
WE tRP
DDR modules are often described using a special nota-
RAS tion: w-x-y-z-T.Forinstance: 2-3-2-8-T1. Thismeans:
CAS
w 2 CASLatency(CL)
Address Col Row Col x 3 RAS-to-CASdelay(tRCD)
Addr CL Addr tRCD Addr y 2 RASPrecharge(tRP)
z 8 ActivetoPrechargedelay(t )
RAS
DQ Data Data T T1 CommandRate
Out Out
Therearenumerousothertimingconstantswhichaffect
Figure2.9: SDRAMPrechargeandActivation
thewaycommandscanbeissuedandarehandled.Those
five constants are in practice sufficient to determine the
Figure2.9showstheactivitystartingfromoneCASsig- performanceofthemodule,though.
naltotheCASsignalforanotherrow.Thedatarequested
withthefirstCASsignalisavailableasbefore,afterCL It is sometimes useful to know this information for the
cycles. In the example two words are requested which, computersinusetobeabletointerpretcertainmeasure-
on a simple SDRAM, takes two cycles to transmit. Al- ments. Itisdefinitelyusefultoknowthesedetailswhen
ternatively,imaginefourwordsonaDDRchip. buying computers since they, along with the FSB and
SDRAM module speed, are among the most important
Even on DRAM modules with a command rate of one factorsdeterminingacomputer’sspeed.
the precharge command cannot be issued right away. It
is necessary to wait as long as it takes to transmit the The very adventurous reader could also try to tweak a
data. Inthiscaseittakestwocycles. Thishappenstobe system. Sometimes the BIOS allows changing some or
the same as CL but that is just a coincidence. The pre- all these values. SDRAM modules have programmable
chargesignalhasnodedicatedline;instead,someimple- registerswherethesevaluescanbeset.UsuallytheBIOS
mentations issue it by lowering the Write Enable (WE) picks the best default value. If the quality of the RAM
and RAS line simultaneously. This combination has no module is high it might be possible to reduce the one
usefulmeaningbyitself(see[18]forencodingdetails). ortheotherlatencywithoutaffectingthestabilityofthe
computer. Numerous overclocking websites all around
OncetheprechargecommandisissuedittakestRP(Row the Internet provide ample of documentation for doing
Prechargetime)cyclesuntiltherowcanbeselected. In this. Do it at your own risk, though and do not say you
Figure 2.9 much of the time (indicated by the purplish havenotbeenwarned.
color) overlaps with the memory transfer (light blue).
This is good! But t is larger than the transfer time
RP 2.2.3 Recharging
andsothenextRASsignalisstalledforonecycle.
If we were to continue the timeline in the diagram we Amostly-overlookedtopicwhenitcomestoDRAMac-
would find that the next data transfer happens 5 cycles cessisrecharging. Asexplainedinsection2.1.2,DRAM
afterthepreviousonestops. Thismeansthedatabusis cellsmustconstantlyberefreshed. Thisdoesnothappen
UlrichDrepper Version1.0 9
completely transparently for the rest of the system. At f f f
times when a row10 is recharged no access is possible. DRAM
I/O
Cell
The study in [3] found that “[s]urprisingly, DRAM re- Buffer
Array
freshorganizationcanaffectperformancedramatically”.
EachDRAMcellmustberefreshedevery64msaccord-
ing to the JEDEC (Joint Electron Device Engineering Figure2.11: DDR1SDRAMOperation
Council)specification. IfaDRAMarrayhas8,192rows
this means the memory controller has to issue a refresh
commandonaverageevery7.8125µs(refreshcommands The difference between SDR and DDR1 is, as can be
seeninFigure2.11andguessedfromthename,thattwice
can be queued so in practice the maximum interval be-
the amount of data is transported per cycle. I.e., the
tween two requests can be higher). It is the memory
DDR1chiptransportsdataontherisingandfallingedge.
controller’s responsibility to schedule the refresh com-
This is sometimes called a “double-pumped” bus. To
mands. The DRAM module keeps track of the address
make this possible without increasing the frequency of
ofthelastrefreshedrowandautomaticallyincreasesthe
the cell array a buffer has to be introduced. This buffer
addresscounterforeachnewrequest.
holds two bits per data line. This in turn requires that,
There is really not much the programmer can do about in the cell array in Figure 2.7, the data bus consists of
therefreshandthepointsintimewhenthecommandsare two lines. Implementing this is trivial: one only has to
issued. ButitisimportanttokeepthispartoftheDRAM use the same column address for two DRAM cells and
lifecycleinmindwheninterpretingmeasurements. Ifa accesstheminparallel. Thechangestothecellarrayto
critical word has to be retrieved from a row which cur- implementthisarealsominimal.
rently is being refreshed the processor could be stalled
TheSDRDRAMswereknownsimplybytheirfrequency
for quite a long time. How long each refresh takes de-
(e.g.,PC100for100MHzSDR).TomakeDDR1DRAM
pendsontheDRAMmodule.
sound better the marketers had to come up with a new
schemesincethefrequencydidnotchange. Theycame
2.2.4 MemoryTypes
with a name which contains the transfer rate in bytes a
DDRmodule(theyhave64-bitbusses)cansustain:
Itisworthspendingsometimeonthecurrentandsoon-
to-be current memory types in use. We will start with
SDR(SingleDataRate)SDRAMssincetheyaretheba- 100MHz×64bit×2=1,600MB/s
sis of the DDR (Double Data Rate) SDRAMs. SDRs
wereprettysimple.Thememorycellsandthedatatrans- HenceaDDRmodulewith100MHzfrequencyiscalled
ferratewereidentical. PC1600. With 1600 > 100 all marketing requirements
arefulfilled;itsoundsmuchbetteralthoughtheimprove-
f f mentisreallyonlyafactoroftwo.12
DRAM
Cell
f 2f 2f
Array
DRAM
I/O
Cell
Buffer
Array
Figure2.10: SDRSDRAMOperation
Figure2.12: DDR2SDRAMOperation
InFigure2.10theDRAMcellarraycanoutputthemem-
ory content at the same rate it can be transported over
thememorybus. IftheDRAMcellarraycanoperateat To get even more out of the memory technology DDR2
100MHz,thedatatransferrateofthebusofasinglecell includesabitmoreinnovation.Themostobviouschange
isthus100Mb/s. Thefrequencyf forallcomponentsis that can be seen in Figure 2.12 is the doubling of the
thesame. IncreasingthethroughputoftheDRAMchip frequency of the bus. Doubling the frequency means
isexpensivesincetheenergyconsumptionriseswiththe doubling the bandwidth. Since this doubling of the fre-
frequency. With a huge number of array cells this is quencyisnoteconomicalforthecellarrayitisnowre-
prohibitively expensive.11 In reality it is even more of quiredthattheI/Obuffergetsfourbitsineachclockcy-
a problem since increasing the frequency usually also cle which it then can send on the bus. This means the
requires increasing the voltage to maintain stability of changestotheDDR2modulesconsistofmakingonlythe
the system. DDR SDRAM (called DDR1 retroactively) I/O buffer component of the DIMM capable of running
manages to improve the throughput without increasing at higher speeds. This is certainly possible and will not
anyoftheinvolvedfrequencies. requiremeasurablymoreenergy,itisjustonetinycom-
ponent and not the whole module. The names the mar-
10Rowsarethegranularitythishappenswithdespitewhat[3]and
otherliteraturesays(see[18]). 12IwilltakethefactoroftwobutIdonothavetoliketheinflated
11Power=DynamicCapacity×Voltage2×Frequency. numbers.
10 Version1.0 WhatEveryProgrammerShouldKnowAboutMemory