SAFARITechnicalReportNo. 2016-001(January26,2016) Thisisasummaryoftheoriginalpaper,entitled“Tiered-LatencyDRAM:ALowLatencyandLowCostDRAMArchitecture” whichappearsinHPCA2013[37]. Tiered-Latency DRAM (TL-DRAM) DonghyukLee YoonguKim VivekSeshadri JamieLiu LavanyaSubramanian OnurMutlu CarnegieMellonUniversity Abstract sense-amplifier is connected to many DRAM cells through a wirecalledabitline. This paper summarizes the idea of Tiered-Latency DRAM, Everybitlinehasanassociatedparasiticcapacitancewhose which was published in HPCA 2013 [37]. The key goal of value is proportional to the length of the bitline. Unfortu- TL-DRAM is to provide low DRAM latency at low cost, a nately, such parasitic capacitance slows down DRAM oper- critical problem in modern memory systems [55]. To this 6 ation for two reasons. First, it increases the latency of the end, TL-DRAM introduces heterogeneity into the design of a 1 sense-amplifiers. When the parasitic capacitance is large, a 0 DRAMsubarraybysegmentingthebitlines,therebycreatinga cellcannotquicklycreateavoltageperturbationonthebitline 2 low-latency, low-energy, low-capacityportioninthesubarray that could be easily detected by the sense-amplifier. Second, (called the near segment), which is close to the sense ampli- n it increases the latency of charging and precharging the bit- fiers, and a high-latency, high-energy, high-capacity portion, a lines. Although the cell and the bitline must be restored to J whichisfartherawayfromthesenseamplifiers. Thus,DRAM their quiescent voltages during and after an access to a cell, 6 becomesheterogeneouswithasmallportionhavinglowerla- suchaproceduretakesmuchlongerwhentheparasiticcapac- 2 tencyandalargeportionhavinghigherlatency. Varioustech- niques can be employed to take advantage of the low-latency itance is large. Due to the above reasons and a detailed la- ] tencybreak-down(refertoourHPCA-19paper[37]),wecon- nearsegmentandthisnewheterogeneousDRAMsubstrate,in- R cludethatlongbitlinesarethedominantsourceofDRAMla- cluding hardware-based caching and software based caching A tency[22,70,51,52]. andmemoryallocationoffrequentlyuseddatainthenearseg- . s ment. Evaluations with simple such techniques show signifi- Latency vs. Cost Trade-Off. The bitline length is a key c cantperformanceandenergy-efficiencybenefits[37]. designparameterthatexposestheimportanttrade-offbetween [ latency and die-size (cost). Short bitlines (few cells per bit- 1 line) constitute a small electrical load (parasitic capacitance), 1 Summary v whichleadstolowlatency. However,theyrequiremoresense- 3 amplifiersforagivenDRAMcapacity(Figure1a),whichleads 0 1.1 TheProblem: HighDRAMLatency toalargedie-size. Incontrast,longbitlineshavehighlatency 9 Primarilyduetoitslowcost-per-bit,DRAMhaslongbeen 6 and a small die-size (Figure 1b). As a result, neither of these thechoicesubstrateforarchitectingmainmemorysubsystems. 0 twoapproachescanoptimizeforbothlatencyandcost-per-bit. . In fact, DRAM’s cost-per-bit has been decreasing at a rapid 1 160 rDcaeRtsesAiavMseDgcReeAnllesMraitnpitorooncteohsfesDtseaRcmhAenModloihegayasrseecnaa.alebslAetdsoiainntrceergseruaalsttie,ngeevalyecrhlmasrugoecre-- bitline(short) cells e v: capacitymainmemorysubsystemsatlowcost. sense-amps bitlin arXi tthheeInlsaatmsetnaecrk1y1oc-ofyneDtarRrasAintMtteorvhtaahlseirnceomwnahtiiinnceuhdedDalRsmcAaoMlsitn’gcsoconofssttac-noptse.tr--pDbeiurt-rbidnietg-, bitline(short) cells bitline(long) cells Ibitlinesolationc eTlRls. creasedbyafactorof16,DRAMlatency(asmeasuredbythe sense-amps sense-amps sense-amps t and t timing constraints) decreased by only 30.5% RCD RC and 26.3% [5, 25], as shown in Figure 1 of our paper [37]. (a)LatencyOpt. (b)CostOptimized (c)OurProposal From the perspective of the processor, an access to DRAM Figure1.DRAM:Latencyvs.CostOptimized,OurProposal takes hundreds of cycles — time during which the processor Figure 2 shows the trade-off between DRAM latency and maybestalled,waitingforDRAM.Suchwastedtimeleadsto die-size by plotting the latency (t and t ) and the die- largeperformancedegradations. RCD RC size for different values of cells-per-bitline. Existing DRAM 1.2 KeyObservationsandOurGoal architectures are either optimized for die-size (commodity DDR3[64,50])andarethuslowcostbuthighlatency,oropti- Bitline: DominantSourceofLatency. InDRAM,eachbit mizedforlatency(RLDRAM[49],FCRAM[65])andarethus isrepresentedaselectricalchargeinacapacitor-basedcell.The lowlatencybuthighcost. smallsizeofthiscapacitornecessitatestheuseofanauxiliary structure, called a sense-amplifier, to detect the small amount Thegoalofourpaper[37]istodesignanewDRAMarchi- ofchargeheldbythecellandamplifyittoafulldigitallogic tecturetoapproximatethebestofbothworlds(i.e.,lowlatency value. But, a sense-amplifier is approximately one hundred andlowcost),basedonthekeyobservationthatthatlongbit- timeslargerthanacell[61]. Toamortizetheirlargesize,each linesarethedominantsourceofDRAMlatency. 1 SAFARITechnicalReportNo. 2016-001(January26,2016) (t )forallcells, TL-DRAMofferssignificantlyreducedla- e tRCD tRC RC ziS 7 16 16 : cells-per-bitline tency(tRC)forcellsinthenearsegment,whileincreasingthe eiD- 6 (492mm2): die-size latencyforcellsinthefarsegmentduetotheadditionalresis- de 5 tanceoftheisolationtransistor. InDRAM,alargefractionof z ilamroN 34 3624 RLDRAM 32 (276) RLDRAM tiphnoewTpeLor-.wDOeRrnAistMhceohnoastsuhmearelhdoawbnyedr,thcaeacpcbeaisctsliitinanengsc.tehS,eiintfcaaerlsstohegecmonneeasnurtmsreeegqsmulieerenssst repa 12 128 F2C5R6A M51 2 128 (6141 4(1) 6 8) FCRAM 512 (73.5) tcoogngsluimngpttihoen.isoMlaatiinonlytrdaunesitsotoarsd,dilteiaodnianlgistoolainticorneatsreadnspisotworesr, e 256 (87) hC 0 DDR3 DDR3 TL-DRAM increases die-area by 3%. Our paper includes de- 0 10 20 30 40 50 60 tailedcircuit-levelanalysesofTL-DRAM(Section4of[37]). Faster Latency (ns) ShortBitline LongBitline SegmentedBitline Figure2.BitlineLength:Latencyvs.Die-Size (Fig1a) (Fig1b) (Fig1c) Unsegmented Unsegmented Near Far 1.3 Tiered-LatencyDRAM Length(Cells) 32 512 32 480 To achieve the latency advantage of short bitlines and the costadvantageoflongbitlines,weproposetheTiered-Latency Latency Low High Low Higher DRAM(TL-DRAM)architecture,whichisshowninFigure1c (tRC) (23.1ns) (52.5ns) (23.1ns) (65.8ns) and 3a. The key idea of TL-DRAM is to divide the long bit- Normalized Low High Low Higher lineintotwoshortersegmentsusinganisolationtransistor:the PowerConsump. (0.51) (1.00) (0.51) (1.49) near segment (connected directly to the sense-amplifier) and Normalized High Lower Low thefarsegment(connectedthroughtheisolationtransistor). Die-Size(Cost) (3.76) (1.00) (1.03) Table1.Latency,Power,andDie-AreaComparison FSeagr ment CCELL bitline CFAR CCELL bitline CFAR 1.4 LeveragingTL-DRAM Isolation Isolation Isolation Transistor TR. (off) TR. (on) TL-DRAMenablesthedesignofmanynewmemoryman- Near SSeegnmsee-nt CCELL CNEAR CCELL CNEAR ategreismtiecnstopfothliecineesatrhaantdexthpelofiatrtsheegmaseynmtsm.eOtruircHlaPtCenAc-y19chpaarpace-r Amps (inSection5)describesfourwaysoftakingadvantageofTL- DRAM.Here,wedescribetwoapproachesinparticular. (a)Organization (b)NearSeg.Access (c)FarSeg.Access Figure3.TL-DRAM:Nearvs.FarSegments In the first approach, the memory controller uses the near Theprimaryroleoftheisolationtransistoristoelectrically segmentasahardware-managedcacheforthefarsegment. In decouplethetwosegmentsfromeachother. Thischangesthe ourHPCA-19paper[37],wediscussthreepoliciesformanag- effective bitline length (and also the effective bitline capaci- ing the near segment cache. (The three policies differ in de- tance)asseenbythecellandsense-amplifier.Correspondingly, ciding when a row in the far segment is cached into the near thelatencytoaccessacellisalsochanged—albeitdifferently segmentandwhenitisevicted.) Inaddition,weproposeanew dependingonwhetherthecellisinthenearorthefarsegment. datatransfermechanism(Inter-SegmentDataTransfer)thatef- When accessing a cell in the near segment, the isolation ficientlymigratesdatabetweenthesegmentsbytakingadvan- transistor is turned off, disconnecting the far segment (Fig- tage of the fact that the bitline is a bus connected to the cells ure 3b). Since the cell and the sense-amplifier see only the in both segments. By using this technique, the data from the reducedbitlinecapacitanceoftheshortenednearsegment,they source row can be transferred to the destination row over the candrivethebitlinevoltagemoreeasily.Asaresult,thebitline bitlinesatverylowlatency(additional4nsovert ). Further- RC voltageisrestoredmorequickly, sothatthelatency(tRC)for more, this Inter-Segment Data Transfer happens exclusively the near segment is significantly reduced. On the other hand, within DRAM bank without utilizing the DRAM channel, al- whenaccessingacellinthefarsegment,theisolationtransis- lowingconcurrentaccessestootherbanks. toristurnedontoconnecttheentirelengthofthebitlinetothe sense-amplifier. Inthiscase,theisolationtransistoractslikea In the second approach, the near segment capacity is ex- resistorinsertedbetweenthetwosegments(Figure3c)andlim- posed to the OS, enabling the OS to use the full DRAM ca- its how quickly charge flows to the far segment. Because the pacity. We propose two concrete mechanisms, one where the farsegmentcapacitanceischargedmoreslowly,ittakeslonger memory controller uses an additional layer of indirection to for the far segment voltage to be restored, so that the latency map frequently accessed pages to the near segment, and an- (t )isincreasedforcellsinthefarsegment. other where the OS uses static/dynamic profiling to directly RC Latency, Power, and Die-Area. Table 1 summarizes the map frequently accessed pages to the near segment. In both latency, power, and die-area characteristics of TL-DRAM to approaches, theaccessestopagesthataremappedtothenear other DRAMs, estimated using circuit-level SPICE simula- segmentareservedfasterandwithlowerpowerthaninconven- tion[56]andpower/areamodelsfromRambus[61].Compared tional DRAM, resulting in improved system performance and tocommodityDRAM(longbitlines)whichincurshighlatency energyefficiency. 2 SAFARITechnicalReportNo. 2016-001(January26,2016) 1.5 Results: PerformanceandPower 2 Significance Our HPCA-19 paper [37] provides extensive detail about 2.1 Novelty both of the above approaches. But, due to space constraints, Toourknowledge,ourHPCA-19paperisthefirsttoenable wepresenttheevaluationresultsofonlythefirstapproach, in latencyheterogeneityinDRAMwithoutsignificantlyincreas- which the near segment is used as hardware-managed cache ingcost-per-bitandtoproposehardware/softwaremechanisms managed under our best policy (Benefit-Based Caching) to thatleveragethislatencyheterogeneitytoimprovesystemper- showtheadvantageofourTL-DRAMsubstrate. formance. Wemakethefollowingmajorcontributions. ACost-EfficientLow-LatencyDRAM.Basedonthekey Performance & Power Analysis. Figure 4 shows the observation that long internal wires (bitlines) are the domi- average performance improvement and power-efficiency of nant source of DRAM latency, we propose a new DRAM ar- our proposed mechanism over the baseline with conventional chitecturecalledTiered-LatencyDRAM(TL-DRAM).Toour DRAM, on 1-, 2- and 4-core systems. As described in Sec- knowledgethisisthefirstworktoenablelow-latencyDRAM tion 1.3, access latency and power consumption are signifi- without significantly increasing the cost-per-bit. By adding a cantlylowerfornearsegmentaccesses,buthigherforfarseg- single isolation transistor to each bitline, we carve out a re- mentaccesses,comparedtoaccessesinaconventionalDRAM. gion within a DRAM chip, called the near segment, that is Weobservethatalargefraction(over90%onaverage)ofre- fastandenergy-efficient. Thiscomesatamodestoverheadof questshitintherowscachedinthenearsegment, therebyac- 3%increaseinDRAMdie-area. Whiletherearetwopriorap- cessingthenearsegmentwithlowlatencyandlowpowercon- proaches to reduce DRAM latency (using short bitlines [49, sumption. Asaresult,TL-DRAMachievessignificantperfor- 65],addinganSRAMcacheinDRAM[20,18,16,84]),both mance improvement by 12.8%/12.3%/11.0% and power sav- of these approaches significantly increase die-area due to ad- ings by 23.6%/26.4%/28.6% in 1-/2-/4-core systems, respec- ditional sense-amplifiers or additional area for SRAM cache, tively. asweevaluateinourpaper[37]. Comparedtothesepriorap- tn 15% n 30% proaches, TL-DRAM is a much more cost-effective architec- em oit25% tureforachievinglowlatency. 10% c20% e u v d15% There are many works that reduce overall memory access orpm 5% eR r105%% latencybymodifyingDRAM,theDRAM-controllerinterface, e .frePI 0 % Co1r (e1--ccoh)unt2 ( (#2 -ochf )cha4n (n4-eclhs)) woP 0 % Co1r (e1--ccho)unt2 ((#2 -ochf )cha4n (4n-echls)) aannddbDaRnAdwMidcthon[t2r9o,ll1er0s,.6T6h,e4s0e],wroedrkusceenraebfrleesmhocoreunptasra[4ll2e,li4sm3, 26,79,60],acceleratebulkoperations[66,68,69,11],acceler- (a)IPCImprovement (b)PowerConsumption atecomputationinthelogiclayerof3D-stackedDRAM[2,1, Figure4.IPCImprovement&PowerConsumption 83,17],enablebettercommunicationbetweenCPUandother Sensitivity to Near Segment Capacity. The number of devices through DRAM [39], leverage process variation and rowsinthenearsegmentpresentsatrade-off,sinceincreasing temperature dependency in DRAM [38], leverage DRAM ac- thenearsegment’ssizeincreasesitscapacitybutalsoincreases cess patterns [19], reduce write-related latencies by better de- its access latency. Figure 5 shows the performance improve- signing DRAM and DRAM control policies [13, 36, 67], and mentofourproposedmechanismsoverthebaselineaswevary reduce overall queuing latencies in DRAM by better schedul- the near segment size. Initially, performance improves as the ingmemoryrequests[53,54,27,28,75,73,21,78]. Ourpro- number of rows in the near segment since more data can be posal is orthogonal to all of these approaches and can be ap- cached. However, increasing the number of rows in the near plied in conjunction with them to achieve higher latency and segmentbeyond32reducestheperformancebenefitduetothe energybenefits. increasedcapacitance. Inter-Segment Data Transfer. By implementing latency heterogeneity within a DRAM subarray, TL-DRAM enables efficient data transfer between the fast and slow segments by utilizingthebitlinesasawidebus. Thismechanismtakesad- vantage of the fact that both the source and destination cells share the same bitlines. Furthermore, this inter-segment mi- gration happens only within a DRAM bank and does not uti- Figure5.EffectofVaryingNearSegmentCapacity lizetheDRAMchannel,therebyallowingconcurrentaccesses tootherbanksoverthechannel. Thisinter-segmentdatatrans- Other Results. In our HPCA-19 paper, we provide a de- ferenablesfastandefficientmovementofdatawithinDRAM, tailedanalysisofhowtimingparametersandpowerconsump- whichinturnenablesefficientwaysoftakingadvantageofla- tion vary when varying the near segment length, in Section 4 tencyheterogeneity. and 6.3, respectively. We also provide a comprehensive eval- Sonetal. proposesalowlatencyDRAMarchitecture[71] uation of the mechanisms we build on top of the TL-DRAM that has fast (long bitline) and slow (short bitline) subarrays substrateforsingle-andmulti-coresystemsinSection8. in DRAM. This approach provides largest benefit when allo- Allofourresultsaregatheredusinganin-houseversionof catinglatencycriticaldatatothelowlatencyregions(thelow Ramulator[31],anopen-sourceDRAMsimulator[30],which latency subarrays. Therefore, overall memory system perfor- isintegratedintoanin-houseprocessorsimulator. mance is sensitive to the page placement policy. However, 3 SAFARITechnicalReportNo. 2016-001(January26,2016) our inter-segment data transfer enables efficient relocation of tiers,showingthespreadinlatencyforthreetiers.) Thisen- pages,leadingtodynamicpageplacementbasedonthelatency ables new mechanisms both in hardware and software that criticalityofeachpage. can allocate data appropriately to different tiers based on theiraccesscharacteristicssuchaslocality,criticality,etc. 2.2 PotentialLong-TermImpact • Inspiring new ways of architecting latency heterogeneity ToleratingHighDRAMLatencybyEnablingNewLay- withinDRAM.Toourknowledge, TL-DRAMisthefirstto ers in the Memory Hierarchy. Today, there is a large la- enablelatencyheterogeneitywithinDRAMbysignificantly tency cliff between the on-chip last level cache and off-chip modifyingtheexistingDRAMarchitecture. Webelievethat DRAM,leadingtoalargeperformancefall-offwhenapplica- thiscouldinspireresearchonotherpossiblewaysofarchi- tions start missing in the last level cache. By introducing an tectinglatencyheterogeneitywithinDRAMorothermem- additional fast layer (the near segment) within the DRAM it- orydevices. self,TL-DRAMsmoothensthislatencycliff. NotethatmanyrecentworksaddedaDRAMcacheorcre- References atedheterogeneousmainmemories[33,35,59,47,81,62,57, [1] J.Ahnetal. AScalableProcessing-in-MemoryAcceleratorforParallel 48,44,12,63,41,14]tosmooththelatencycliffbetweenthe GraphProcessing.InISCA,2015. [2] J. Ahn et al. PIM-Enabled Instructions: A Low-Overhead, Locality- last level cache and a longer-latency non-volatile main mem- AwareProcessing-in-MemoryArchitecture.InISCA,2015. ory,e.g.,PhaseChangeMemory[33,35,59],ortotakeadvan- [3] R.Ausavarungnirunetal. Stagedmemoryscheduling: achievinghigh performanceandscalabilityinheterogeneoussystems.InISCA,2012. tageoftheadvantagesofmultipledifferenttypesofmemories [4] R.Ausavarungnirunetal. ExploitingInter-WarpHeterogeneitytoIm- to optimize for multiple metrics. Our approach is similar at proveGPGPUPerformance.InPACT,2015. the high-level (i.e., to reduce the latency cliff at low cost by [5] S.BorkarandA.A.Chien. Thefutureofmicroprocessors. InCACM, 2011. taking advantage of heterogeneity) yet we introduce the new [6] Y.Caietal.ProgramInterferenceinMLCNANDFlashMemory:Char- low-latencylayerwithinDRAMitselfinsteadofaddingacom- acterization,Modeling,andMitigation.InICCD,2013. [7] Y.Caietal. Neighbor-cellAssistedErrorCorrectionforMLCNAND pletelyseparatedevice. FlashMemories.InSIGMETRICS,2014. Applicability to Future Memory Devices. We show the [8] Y.Caietal.DataretentioninMLCNANDflashmemory:Characteriza- tion,optimization,andrecovery.InHPCA,2015. benefits of TL-DRAM’s asymmetric latencies. Considering [9] Y.Caietal. ReadDisturbErrorsinMLCNANDFlashMemory:Char- that most memory devices adopt a similar cell organization acterization,Mitigation,andRecovery.InDSN,2015. [10] K.K.Changetal. ImprovingDRAMperformancebyparallelizingre- (i.e., a 2-dimensional cell array and row/column bus connec- fresheswithaccesses.InHPCA,2014. tions),ourapproachofreducingtheelectricalloadofconnect- [11] K.K.Changetal. Low-CostInter-LinkedSubarrays(LISA):Enabling FastInter-SubarrayDataMovementinDRAM.InHPCA,2016. ingtoabus(bitline)toachievelowaccesslatencycanbeap- [12] N.Chatterjeeetal.LeveragingHeterogeneityinDRAMMainMemories plicabletoothermemorydevices. toAccelerateCriticalWordAccess.InMICRO,2012. [13] N. Chatterjee et al. Staged Reads: Mitigating the Impact of DRAM Furthermore, the idea of performing inter-segment data WritesonDRAMReads.InHPCA,2012. transfer can also potentially be applied to other memory de- [14] G.Dhimanetal. PDRAM:AhybridPRAMandDRAMmainmemory system.InDAC,2009. vices, regardless of the memory technology. For example, [15] E.Ebrahimietal.ParallelApplicationMemoryScheduling.InMICRO, we believe it is promising to examine similar approaches 2011. [16] EnhancedMemorySystems.EnhancedSDRAMSM2604,2002. for emerging memory technologies like Phase Change Mem- [17] Q.Guoetal. 3D-StackedMemory-SideAcceleration: Acceleratorand ory[33,59,58,46,82,34]orSTT-MRAM[32,80],aswellas SystemDesign.InWoNDP,2013. theNANDflashmemorytechnology[45,8,9,7,6]. [18] C.A.Hart. CDRAMinaunifiedmemoryarchitecture. InCompcon Spring’94,DigestofPapers,1994. New Research Opportunities. The TL-DRAM substrate [19] H.Hassanetal. ChargeCache:ReducingDRAMLatencybyExploiting creates new opportunities by enabling mechanisms that can RowAccessLocality.InHPCA,2016. [20] H.Hidakaetal. TheCacheDRAMArchitecture: ADRAMwithan leveragethelatencyheterogeneityofferedbythesubstrate. We On-ChipCacheMemory.InIEEEMicro,1990. brieflydescribethreedirections,butwebelievemanynewpos- [21] E.Ipeketal.Selfoptimizingmemorycontrollers:Areinforcementlearn- ingapproach.InISCA,2008. sibilitiesabound. [22] JEDEC. DDR3SDRAMSTANDARD. http://www.jedec.org/ standards-documents/docs/jesd-79-3d,2010. • New ways of leveraging TL-DRAM. TL-DRAM is a sub- [23] J.A.Joaoetal.Bottleneckidentificationandschedulinginmultithreaded strate that can be utilized for many applications. Although applications.InASPLOS,2012. [24] J.A.Joaoetal. Utility-BasedAccelerationofMultithreadedApplica- wedescribetwomajorwaysofleveragingTL-DRAMinour tionsonAsymmetricCMPs.InISCA,2013. HPCA-19paper,webelievethereareseveralmorewaysto [25] T. S. Jung. Memory technology and solutions roadmap. leveragetheTL-DRAMsubstratebothinhardwareandsoft- http://www.sec.co.kr/images/corp/ir/irevent/ techforum_01.pdf,2005. ware.Forinstance,newmechanismscouldbedevisedtode- [26] S.Khanetal. TheEfficacyofErrorMitigationTechniquesforDRAM tectdatathatislatencycritical(e.g., datathatcausesmany RetentionFailures: AComparativeExperimentalStudy. InSIGMET- RICS,2014. threads to becomes serialized [15, 77, 23, 76, 24] or data [27] Y.Kimetal. ATLAS:Ascalableandhigh-performanceschedulingal- that belongs to threads that are more latency-sensitive [27, gorithmformultiplememorycontrollers.InHPCA,2010. [28] Y.Kimetal. Threadclustermemoryscheduling:Exploitingdifferences 28,72,78,3,4,73,75,74])orcouldbecomelatencycriti- inmemoryaccessbehavior.InMICRO,2010. calinthenearfutureandallocate/prefetchsuchdataintothe [29] Y.Kimetal. Acaseforexploitingsubarray-levelparallelism(SALP)in DRAM.InISCA,2012. nearsegment. [30] Y. Kim et al. Ramulator source code. https://github.com/ • Opening up new design spaces with multiple tiers. TL- CMU-SAFARI/ramulator,2015. [31] Y. Kim, W. Yang, and O. Mutlu. Ramulator: A Fast and Extensible DRAMcanbeeasilyextendedtohavemultiplelatencytiers DRAMSimulator.InIEEECAL,2015. by adding more isolation transistors to the bitlines, provid- [32] E. Kultursay et al. Evaluating STT-RAM as an energy-efficient main memoryalternative.InISPASS,2013. ingmorelatencyasymmetry.(OurHPCA-19paperprovides [33] B. C. Lee et al. Architecting Phase Change Memory As a Scalable ananalysisofthelatencyofaTL-DRAMdesignwiththree DRAMAlternative.InISCA,2009. 4 SAFARITechnicalReportNo. 2016-001(January26,2016) [34] B.C.Leeetal. PhaseChangeMemoryArchitectureandtheQuestfor [73] L.Subramanianetal. TheBlacklistingMemoryScheduler: Achieving Scalability.InCACM,2010. highperformanceandfairnessatlowcost.InICCD,2014. [35] B.C.Leeetal.Phase-ChangeTechnologyandtheFutureofMainMem- [74] L.Subramanianetal. TheApplicationSlowdownModel: Quantifying ory.InIEEEMicro,2010. andControllingtheImpactofInter-ApplicationInterferenceatShared [36] C.J.Leeetal. DRAM-AwareLast-LevelCacheWriteback: Reducing CachesandMainMemory.InMICRO,2015. Write-CausedInterferenceinMemorySystems. InUTTechReportTR- [75] L.Subramanianetal. TheBlacklistingMemoryScheduler: Balancing HPS-2010-002,2010. Performance,FairnessandComplexity.InTPDS,2016. [37] D. Lee et al. Tiered-Latency DRAM: A Low Latency and Low Cost [76] M.A.Sulemanetal. Acceleratingcriticalsectionexecutionwithasym- DRAMArchitecture.InHPCA,2013. metricmulti-corearchitectures.InASPLOS,2009. [38] D.Leeetal. Adaptive-LatencyDRAM:OptimizingDRAMTimingfor [77] M.A.Sulemanetal. DataMarshalingforMulti-coreArchitectures. In theCommon-Case.InHPCA,2015. ISCA,2010. [39] D.Leeetal. DecoupledDirectMemoryAccess: IsolatingCPUandIO [78] H. Usui et al. DASH: Deadline-Aware High-Performance Memory TrafficbyLeveragingaDual-Data-PortDRAM.InPACT,2015. SchedulerforHeterogeneousSystemswithHardwareAccelerators. In [40] D.Leeetal. SimultaneousMulti-LayerAccess:Improving3D-Stacked ACMTACO,2016. MemoryBandwidthatLowCost.InACMTACO,2016. [79] R.Venkatesanetal. Retention-awareplacementinDRAM(RAPID): [41] Y.Lietal.ManagingHybridMainMemorieswithaPage-UtilityDriven softwaremethodsforquasi-non-volatileDRAM.InHPCA,2006. PerformanceModel.InCoRRabs/1507.03303,2015. [80] J. Wang et al. Enabling High-performance LPDDRx-compatible [42] J.Liuetal. RAIDR:Retention-AwareIntelligentDRAMRefresh. In MRAM.InISLPED,2014. ISCA,2012. [81] H.Yoonetal. RowBufferLocalityAwareCachingPoliciesforHybrid [43] J.Liuetal.AnExperimentalStudyofDataRetentionBehaviorinMod- Memories.InICCD,2012. ernDRAMDevices: ImplicationsforRetentionTimeProfilingMecha- [82] H. Yoon et al. Efficient Data Mapping and Buffering Techniques for nisms.InISCA,2013. MultilevelCellPhase-ChangeMemories.InACMTACO,2014. [44] Y.Luoetal. CharacterizingApplicationMemoryErrorVulnerabilityto [83] D.Zhangetal.TOP-PIM:Throughput-orientedProgrammableProcess- OptimizeDataCenterCostviaHeterogeneous-ReliabilityMemory. In inginMemory.InHPCA,2014. DSN,2014. [84] Z.Zhangetal.CachedDRAMforILPprocessormemoryaccesslatency [45] Y. Luo et al. WARM: Improving NAND flash memory lifetime with reduction.IEEEMicro,July2001. write-hotnessawareretentionmanagement.InMSST,2015. [46] J.Mezaetal.Acaseforsmallrowbuffersinnon-volatilemainmemories. InICCD,2012. [47] J.Mezaetal. EnablingEfficientandScalableHybridMemoriesUsing Fine-GranularityDRAMCacheManagement.InIEEECAL,2012. [48] J.Mezaetal.ACaseforEfficientHardware-SoftwareCooperativeMan- agementofStorageandMemory.InWEED,2013. [49] Micron. RLDRAM2and3Specifications. http://www.micron. com/products/dram/rldram-memory. [50] Y.Moonetal.1.2V1.6Gb/s56nm6F24GbDDR3SDRAMwithhybrid- I/Osenseamplifierandsegmentedsub-arrayarchitecture.ISSCC,2009. [51] O.Mutlu. MemoryScaling: ASystemsArchitecturePerspective. In IMW,2013. [52] O.Mutlu. MainMemoryScaling: ChallengesandSolutionDirections. InMorethanMooreTechnologiesforNextGenerationComputerDesign. Springer,2015. [53] O.MutluandT.Moscibroda. Stall-timefairmemoryaccessscheduling forchipmultiprocessors.InMICRO,2007. [54] O.MutluandT.Moscibroda. Parallelism-awarebatchscheduling: En- hancingbothperformanceandfairnessofsharedDRAMsystems. In ISCA,2008. [55] O.MutluandL.Subramanian. ResearchProblemsandOpportunitiesin MemorySystems.InSUPERFRI,2015. [56] S. Narasimha et al. High performance 45-nm SOI technology with enhanced strain, porous low-k BEOL, and immersion lithography. In IEDM,2006. [57] S.PhadkeandS.Narayanasamy. MLPawareheterogeneousmemory system.InDATE,2011. [58] M.K.Qureshietal. EnhancingLifetimeandSecurityofPCM-based MainMemorywithStart-gapWearLeveling.InMICRO,2009. [59] M.K.Qureshietal. ScalableHighPerformanceMainMemorySystem UsingPhase-changeMemoryTechnology.InISCA,2009. [60] M.K.Qureshietal.AVATAR:AVariable-Retention-Time(VRT)Aware RefreshforDRAMSystems.InDSN,2015. [61] Rambus. DRAM Power Model. http://www.rambus.com/ energy,2010. [62] L.E.Ramosetal. Pageplacementinhybridmemorysystems. InICS, 2011. [63] J.Renetal. ThyNVM:EnablingSoftware-TransparentCrashConsis- tencyinPersistentMemorySystems.InMICRO,2015. [64] Samsung. DRAM Data Sheet. http://www.samsung.com/ global/business/semiconductor/product. [65] Y.Satoetal. FastCycleRAM(FCRAM);a20-nsrandomrowaccess, pipe-linedoperatingDRAM.InSymposiumonVLSICircuits,1998. [66] V.Seshadrietal. RowClone: FastandEnergy-efficientin-DRAMBulk DataCopyandInitialization.InMICRO,2013. [67] V.Seshadrietal.TheDirty-BlockIndex.InISCA,2014. [68] V.Seshadrietal. FastBulkBitwiseANDandORinDRAM. InIEEE CAL,2015. [69] V.Seshadrietal.Gather-ScatterDRAM:In-DRAMAddressTranslation toImprovetheSpatialLocalityofNon-unitStridedAccesses.InMICRO, 2015. [70] S.M.Sharroushetal. Dynamicrandom-accessmemorieswithoutsense amplifiers.InElektrotechnik&Informationstechnik,2012. [71] Y.H.Sonetal. ReducingMemoryAccessLatencywithAsymmetric DRAMBankOrganizations.InISCA,2013. [72] L.Subramanianetal. MISE:ProvidingPerformancePredictabilityand ImprovingFairnessinSharedMainMemorySystems.InHPCA,2013. 5