ebook img

MIMS: Towards a Message Interface based Memory System PDF

0.6 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview MIMS: Towards a Message Interface based Memory System

MIMS: Towards a Message Interface based Memory System LichengChen, TianyueLu,Yanan Wang, MingyuChen, Yuan Ruan, Zehan Cui, YongbingHuang,MingyangChen, JiutianZhang,YungangBao StateKeyLaboratoryofComputerArchitecture, InstituteofComputingTechnology,ChineseAcademy ofSciences {chenlicheng,lutianyue,wangyanan,cmy,ruanyuan}@ict.ac.cn {cuizehan,huangyongbing,chenmingyang,zhangjiutian,baoyg}@ict.ac.cn 3 1 0 2 Abstract themarket,suchas60-coreIntelXeonPhiCoprocessor[12], and100-coreTileraTILE-GX64-bitprocessor[17]. Thein- n Memory system is often the main bottleneck in chip- a creasing number of cores would result in severe bandwidth multiprocessor(CMP)systemsintermsoflatency,bandwidth J pressureonthememorysystem. Memoryrequestsfrommul- andefficiency,andrecentlyadditionallyfacingcapacityand 1 tiplecoreswouldalsointerferewitheachotherandresultin powerproblemsinaneraofbigdata.Alotofresearchworks ] have been done to address part of these problems, such as lowlocality. ThusfutureDRAMarchitecturesplacealower R photonicstechnologyfor bandwidth, 3D stacking for capac- priorityonlocality anda higherpriorityonparallelism[50]. On the other hand, the amount of data is predicted to grow A ity, and NVM for power as well as many micro-architecture witharateof40%peryear[14],ithasbecomeahottopicin . levelinnovations. Many ofthem need a modificationofcur- s both academic and industry communities recent years. Big c rentmemoryarchitecture,sincethedecades-oldsynchronous data processing requires more memory capacity and band- [ memory architecture (SDRAM) has become an obstacle to width. adoptthoseadvances.However,tothebestofourknowledge, 1 v noneofthemisabletoprovideauniversalmemoryinterface However main memory that acts as the bridge between 1 thatisscalableenoughtocoveralltheseproblems. highleveldataandlowlevelprocessorisfailedtoscale,lead- 5 In this paper, we argue that a message-based interface ingmemorysystemtobeamainbottleneck.Besidesthewell- 0 should be adopted to replace the traditional bus-based in- knownmemorywall problem[53], the memorysystem also 0 terface in memory system. A novelmessage interface based facesmanyotherchallenges(walls),whichareconcludedas . 1 memory system (MIMS) is proposed. The key innovation of followed: 0 3 MIMS is that processor and memory system communicate Memorywall(Latency): Theoriginal"memorywall"re- 1 through a universal and flexible message interface. Each ferred to memory access latency problem [53] and it was : message packet could contain multiple memory requests or the main problem in memory system until mid-2000swhen v i commands along with various semantic information. The the CPU frequency race slowed down. Then came the X memory system is more intelligent and active by equipping multi/many core age. The situation has changed a bit that r withalocalbufferscheduler,whichisresponsibletoprocess queuing delays have become a major bottleneck, and might a packet,schedulememoryrequests,andexecutespecificcom- contributemorethan70%ofmemorylatency[50]. Thusfor mandswiththehelpofsemanticinformation.Theexperimen- future memory architecture, it should place a higher prior- talresultsbysimulatorshowthat,withaccurategranularity ity to reduce queuing delays. Exploiting higher parallelism message, the MIMS would improve performanceby 53.21%, inmemorysystemcouldreducequeuingdelaysbecauseitis while reducing energy delay product (EDP) by 55.90%, the abletode-queuerequestsfaster[50]. effectivebandwidthutilizationisimprovingby62.42%. Fur- Bandwidth wall: The increasing number of concurrent thermore,combiningmultiplerequestsinapacketwouldre- memory requests along with the increasing amount of data, ducelinkoverheadandprovideopportunityforaddresscom- resultinheavybandwidthpressure. Howeverthebandwidth pression. ofmemoryisfailingtoscaleduetotherelativelyslowgrowth ofpin countsofprocessormodule(about10%peryear[2]). 1. Introduction This has been concluded as bandwidth wall [46]. The aver- The exponentialgrowthofboththenumberofcores/threads age memorybandwidthfor each core is actually decreasing. (computingresources)and the amountof data (workingset) In DDRx memory system, the memory controller (often in- demandshighmemorythroughputandcapacity.Thenumber tegrated on processor chip) connects directly with DRAM ofcoresintegratedintooneprocessorchipisexpectedtodou- devices through wide synchronous DDRx bus, each chan- bleevery18months[2],many-coreprocessorshavebeenon nel costs hundreds of processor pins. Using narrower and higher speed serial bus between processor and memory de- memory system, improve performance and decrease power vices could alleviate the pin count limitation, such as Full- consumption.Manyautonomousmemorysystemshavebeen buffered DIMM and its successors. Optical interconnection proposedbefore,such asProcessing in Memory(PIM)[29], and3Dstackingaresupposedtosolvethisproblemsubstan- ActiveMemoryOperation(AMO)[32,33],andSmartMem- tiallyinthefuture. ory [43]. These technologies are limited to proprietary de- Efficiency problem: Latency and bandwidth are only signs and never get into a standard memory interface with physicalfactorsof a memorysystem while the efficiencyof readandwriteoperationsonly. memoryaccessreallycountsforProcessor. Currentmemory Insummary,alotofworkshavebeendonetoalleviatevar- controller normally accesses a cache-block data (64B) in a ious memory system bottlenecks. However each of them is BL burst(BLis8forDDR3)andDRAM modulesactivates only focus on one or part of these walls. To the best of our awholerow(e.g.8KB)inabank.Theselargeandfixed-size knowledge,noneofthemisabletoaddressalltheseproblems. designshelptoincreasethepeakbandwidthwhenmemoryac- Table1listsdifferentapproachesandwhichproblemeachad- cesshasagoodspatiallocality.However,inmultiplecoresys- dressed (please refer to section 2 for detail). For example, tem,thelocalitybothinrowbufferandcachelineisdecreased theBOB[27]memorysystemisfocusonaddressbandwidth [50,55]. Ithasbeenshownthatlargedatacachehadalmost and capacity wall, and the simpler controller is responsible nobenefitforscale-outworkloadperformance[34, 41]. For to schedule requests, makes the memory system a little au- data access withoutspatial locality, coarse-graineddata unit tonomous. would waste activate power and reduce effective bandwidth LY BW CY EY PR AS sinceitmaymovedatathatneverbeused. Thisisknownas overfetchproblem. Memorysystemthatsupportsfinegranu- Sub-Access * * √ × × × larityaccess[55,56]couldimprovebandwidthefficiency. FGMS * √ √ × × × Buffer-Chip √ Capacitywall: Bigdatarequireshighermemorycapacity. × × × × × But the slowly growing of pin count limited the number of BOBMS √ √ * × × × memorychannelseachprocessorcouldsupport.Furthermore AMO/Smart √ √ × × × × thenumberofDIMM(dualin-linememorymodule)couldbe NVM √ √ * × × × supportedineachchannelislimitedduetosignalintegrityre- 3D-Stacked * √ √ * × × striction. Forinstance,onlyoneDIMMisallowedinDDR3- Photonics √ √ * * × × 1600whilefourDIMMsareallowedforDDR1[35]. Andthe Mobile-DRAM * √ * × × × capacity each DIMM could provide is growing slowly due Note:√-Yes, -No,*-Maybe. to the difficulties in decreasing the size of a DRAM cell’s LY:Latency,BW:Ban×dwidth,EY:Efficiency, capacitors[27]. ToincreasetheDIMMcountsinsinglechan- CY:Capacity,PR:Power,AS:Autonomous nel, registers and buffers are added to the memory interfac- Table1:Comparisonofdifferentapproachestoalleviatediffer- eRDIMM,LRDIMM.Therearealsovariousproposalstopro- entwalls. videbigmemories,suchasBOB(BufferOnBoard)Memory [27], HMC (Hybrid Memory Cube) [11], high density non- volatile memory (e.g. PCM) [40] and 3D-Stacked memory In this work, we argue that traditional synchronous bus- [49]. basedmemoryinterfaceshouldberedesignedtoincorporate Power wall: memorysystem has been reportedto be the futureinnovations.Incontrasttotraditionalreadorwritebus main powerconsumerin servers, contributingabout40%to transactionbasedmemoryinterface,aflexibleasynchronous thetotalsystempower[38,28].Capacitor-basedmemorycell messagebasedmemoryinterfacewill bringmoredesignop- contributesthemainpowertotheDRAMsystemwhichisnot portunity. We propose a uniform message interface based anarchitecturalissue. However,duetotheoverfetchproblem, memorysystem (MIMS):Memoryrequestandresponse are a largepart of the dynamicpower of DRAM is wasted. Im- sending via asynchronous messages. A local buffer sched- provingmemorysystemtosupportSub-Accessinrowbuffer uler is putbetweenmemorycontrollerand memorydevices. and fine granularitymemoryaccess couldalleviate the over- Device-specific schedulingis decoupledfromCPU memory ferchproblem. Toreducestaticpower,non-volatilememory controller, which is only responsible to compose memory (e.g. PCM)couldbeinvestigatedaspotentialalternativesfor requests for packet. The memory controller communicates existingmemorytechnologies. NV-memoryhasatotallydif- with buffer scheduler over high-speed serial point-to-point ferentaccessparameters,soitcannotworkunderamemory link with a flexible message packetprotocol. Each message interfacedesignedforDRAMsuchasDDRx. couldcontainmultiplememoryrequestsorresponsesaswell Besides all the abovewalls, thereis a longtrendto equip as semantic information such as granularity, thread id, pri- the memory system with some simple processing logic to ority, timeout. The buffer scheduler act as the traditional make memory more autonomous. Logic in memory would memorycontroller: it needsto track status of localmemory significantly reduce data movement between processor and devices, schedulerequests, generateandissue specific DDR 2 commands,meanwhilemeetingthe timing constraints(such logicbetweenprocessorandDRAMdevices,whichisrespon- asDDR3forDRAMdevices).Additionally,bufferscheduler sibletoreceivememoryrequestsfromlastlevelcache(LLC). can use varioussemantic informationfrom the CPU to help The memory controller needs to track the status of DRAM itsscheduling. devices (e.g., bank states) and generates DRAM commands MIMSwillbringatleastthefollowingadvantages: foreachselectedrequestsmeanwhilemeetingtheDDRxtim- 1. It provides a uniform, scalable message interface to ac- ing constraints. The integrated memory controller commu- cess different memory system. The status tracking and nicatesdirectlywithDRAM devicesoverwidesynchronous requestschedulingisdecouplingfrommemorycontroller DDRbuswithseparatedata,command,andaddressbus.This andpusheddowntobufferscheduler. Thustheintegrated directly-connecteddesignwouldresultinhighprocessorpin- memorycontrollerhasnotiminglimitationsandcouldeas- countcost,andthishasbecomeamainbottlenecktosupport ilyscaletootheremergingmemorytechnologies.Alsothe largememorycapacity,becausethegrowthofprocessorpin- memorycapacityisonlyrestrictedbythebufferscheduler, countisfailedtokeepupwithdemand. whichisdecoupledwithmemorycontroller. The memorysystem has a hierarchicalorganization,with 2. Itcouldnaturallysupportvariablegranularitymemoryre- different level parallelism. Each memory controller might quests. Eachmemoryrequestistransferredwiththeexact support multiple memory channels, while each channel has sizeofreallyusefuldata.Thiscouldsignificantlyimprove separateDDRbus.Thuseachchannelcouldbeaccessedinde- data/bandwidth effectiveness and reduce memory power pendently.Withinamemorychannel,theremightbemultiple consumption. DIMMs(DualInlineMemoryModule). EachDIMMmight 3. It enables inter-requestsoptimization duringthe memory comprisewithmultipleranks(1-4),andeachrankprovidesa accesssuchascombiningmultipleoperationsinapacket logical64-bitdata-path(bus)tomemorycontroller(72-bitin and compressing memory address for a sequence of re- ECC-DIMM).MultipleDRAMdeviceswithinarankneedto quests. beoperatedintandem.Thex8DRAMdevicesarecommonly 4. It is easy to add additional semantic information to the usedtodayandtheywillbeusedinthisworkbydefault. message to help the buffer scheduler to make decisions A prevalent DRAM device (chip) consists of multiple whendoinglocalscheduling. Localcomputationorintel- DRAM banks (8 in DDR3) which can be processed concur- ligentmemoryoperationrequestscanalsobeaddedaspart rently. There is a two dimensionalarray, consisting of rows ofthemessage. andcolumns,withineachDRAMbank. Arowbufferisded- To demonstrate the benefits of using a message interface, icatedtoeachbank,whichisusual4-16KB.Beforeeachcol- we have implemented a cycle-detailed memory system sim- umnaccess, arowneedstobeloadedintotherowbufferby ulator, MIMSim. Experiments on fine-granularity access, anactivecommand.Iflatterrequestsarehitintherowbuffer, trunk memory request and address compression are taken. itcouldbeaccesseddirectlybyread/writecommand. Other- The results provide elementary proof for the benefits of wise, the row bufferneeded firstly to be prechargedback to MIMS. arraybeforeissuinganewrequest. The rest of the paper is organized as follows. Section 2 2.2.Sub-AccessMemory givesbackgroundonmemorysystemandrelatedwork,while Section3presentstheMessageInterfacebasedMemorySys- Sub-Access memory refers to dividing a whole component tem, includes architecture, packet format, address compres- into multiple sub-components,so that each memoryrequest sionandchallenge.Section4presentstheexperimentalsetup, onlyneedstoaccessaportionofdata.Herecomponentcould and the results are presentedin section 5. Section 6 givesa berank,rowbufferandcacheline(FGMS). conclusionofthispaper. Subrankmemorydividesamemoryrankintomultiplelog- icalsub-ranks,whichcouldbeaccessedindependently. Data 2. Background andRelated Work layout might be adjusted so that each cache line is put in a In this section, we first give a brief description of the most sub-rank. Each memory access would only require a part commonly used JEDEC DDRx SDRAM memory systems, of memory devices (in the same sub-rank) to be involved. and then discuss some optimizations on memory architec- Manydifferentapproachescouldbeclassifiedassub-rank,in- ture,includingSub-Access,buffer-chipmemory,autonomous cludingRambus’s Modulethreading[52], Multicore DIMM memory,andsome aggressivememorysystem, suchas non- (MCDIMM)[20,19],andmini-rank[59,31],heterogeneous volatile memory (NVM), 3D-stacked memory, photonicsin- Multi-Channel [57]. Sub-rank technology could save mem- terconnectmemoryandmobile-DRAMinserver. orypowerbyalleviatingtheover-fetchproblemandimprove memorylevelparallelism (MLP).The downsideof it is that 2.1.DDRxMemorySystem thememoryaccesslatencywillincreasesincepartofthe to- The JEDEC standardized DDR (Double Data-Rate) [7] syn- tal device bandwidth can be utilized, because they still use chronousDRAMisdominatednowadays. InDDRxmemory coarsegranularitymemoryaccess. system,memorycontrollercouldbeconsideredasthebridge Udipiet.al[50]proposedSSAtoreducepower. Anentire 3 cachelineisfetchedfromasinglesubarraybyre-organizing ule was equipped with a Unified DIMM Interface Chip thedatalayoutinSSA.Cooper-BalisandJacob[26]proposed (UDIC).Thememorycontrollersendsread/writerequeststo a fine-grainedactivationapproachto reducememory power, UDIC through the unified interface without worrying about which only actives a smaller portion of row within the data anydevicesstatusortimingconstraints. arraybyutilizingposted-CAScommand. 2.4.AutonomousMemory FGMS: AGMS and DGMS [55, 56] adopted sub-rank memorysystemtoallowdynamicallyusefineorcoarsegran- There has been a long time effort to make memory au- ularitymemoryaccess. Theyalsoproposedsomesmartdata tonomousbyequippingmainmemorywithprocessinglogic layoutstosupporthighreliabilitywithlowoverhead.Convey- to support some local computations. Processor-in-memory designedScatter-GatherDIMMs[24]arepromisingtoallow (PIM) systems incorporate processing units on modified 8 bytes (fine granularity) access that could reduce the inef- DRAM chips [29, 37]. Active memory controller [33, 32] ficienciesin nonunitystrides orrandomlymemoryaccesses, adds an active memory unit to supportactive memory oper- howevertheimplementationdetailofSGDIMMislack. Cray ation (AMO), such as scalar operations (e.g. inc, dec) and BlackWidow[18]adoptedmany32-bitwidechannels,allow- stream operations (e.g. sum, max). Smart Memory [43] at- ingittosupport16Bminimumaccessgranularity. tachessimplecomputeandlockwithdata,thusreduceschip I/O bandwidth and achieves high performance and low la- 2.3.Buffer-ChipMemory tency. To alleviatememorycapacityproblem,a commonway isto 2.5.AggressiveMemorySystem put an intermediate buffer (logic) between the memory con- trollerandDRAMdevices,whichcouldreducetheelectrical Non-volatilememory(suchasPCM)[40,44,61,58]hasbeen loadonthememorycontrollerandimprovesignalintegrity. consideredasapotentialreplacementforDRAMchipinthe InRegisteredDIMM(RDIMM)[3],asimpleregisterisin- future.NVMdevicescouldremovestaticpowerconsumption tegratedonDIMMtobuffercontrolandaddresssignals.Load andpromisetoprovidehighercapacity. Recentresultshows Reduced DIMM (LRDIMM) [6] further buffers all signals thatsomeNVMhascomparablelatencytotheDRAM.How- thatgotoDRAMdevices,includingalldataandstrobes. De- ever NVM chip usually have a differenttiming requirement coupledDIMM [60] adoptsa synchronizationbufferto con- sothatitcannotbeincorporatedintotheDDRxmemoryin- vert between low speed memory devices and high data rate terfacedirectly. memorybus. Withasimilaridea,BOOM[54]addsabuffer On Chip optical interconnection is expected to provide chip between the fast DDR3 memorybus and wide internal enough bandwidth for memory system. Udipi et al. [49] bus, which enable the use of low-frequencymobile DRAM proposed a novel memory architecture uses photonics inter- devices,thusBOOMcouldsavememory. connectsamongmemorycontrollerwith3Dmemorystacked InFully-BufferedDIMM(FBDIMM)[36,4]memorymod- dies,theyalsoproposedanovelpacketbasedinterfacetore- ule,thereisanAMB(AdvancedMemoryBuffer)integrated linquish the memorycontroller and allow the memory mod- oneachDIMMmodule,multipleFBDIMMsareorganizedas ulestobemoreautonomous. Eachpacketonlycontainsone adaisychainwhichcouldsupporthighcapacity.Thememory memoryrequestsandisprocessedinaFCFSorder. controller communicates with AMB through point-to-point Hybrid Memory Cube (HMC) [11] Utilizes 3D intercon- narrow, high-speed channels with some simple packet pro- necttechnology. A smalllogic layer thatsits belowvertical tocol. Intel Scalable Memory Interface (SMI) [8] and IBM stacksofDRAMdieconnectedbythrough-siliconvia(TSV) power 7 memory system [39, 51] also place logic buffers bonds. Thislogiclayerisresponsibletocontrolmemoryde- betweenthe DRAM and the processor,whichcouldsupport vices. The memory controller communicate to HMC logic morememorychannels. chip via abstracted high-speedinterface, logic layer flexibil- Cooper-Baliset. al[27]proposedageneralizedBufferOn ityallowsHMCcubestobedesignedformultipleplatforms Board(BOB)memorysystem. InBOB,intermediatelogicis andapplicationswithoutchangingthehigh-volumeDRAM. placed(onmotherboard)betweenon-chipmemorycontroller 3. MessageInterface based Memory System and DRAM devices. The memory controllercommunicates withtheintermediatebufferthroughseriallink. Thememory 3.1.WhyuseaMessage-basedInterface? controlleris decouplingfromscheduling, and the intermedi- atebufferactuallyactsasatraditionalmemorycontroller: it Thecurrentbus-basedmemoryinterfacecanbedatedbackto tracksstatusofitslocalmemorydevicesandschedulesmem- the 1970s when the first DRAM chip in the world was pro- oryrequests,issuescorrespondingDRAMcommandsmean- duced. After 40 years the main characteristics remains un- whilemeetstimingconstraints. TheBOBmemorysystemis changed: separated data, address and control signals; fixed promisingtoalleviatethecapacityandbandwidthproblem. transfersizeandmemoryaccesstiming(latency);CPUbeing UniMA [30] aims to enable universal interoperability be- aware and take care of every bit of storage on the memory tweenprocessorsandmemorymodules. Eachmemorymod- chip;limitedon-goingoperationsontheinterface. Onemay 4 argue that a simple and raw interface for DRAM keeps the 3.2.MIMSArchitecture minimumlatencyformemoryaccess,butitalsobringsobsta- cles to improve memory performance as described in Intro- Figure 1 shows the architecture of the Message Interface ductionsection. Nowadayswith moreand moreparallelism basedMemorySystem(MIMS).As in theBuffer-On-Board incomputersystem,singlememoryaccesslatencyisnotthe (BOB) memory system [27], the memory controller in pro- main issue for overallperformanceanymore. Is it the right cessor does not directly communicatewith memorydevices timetochangethisdecades-oldinterface? (DRAM), instead, it communicates with the buffer sched- uler via serialized point-to-pointlink which is narrowerand Decouplingisthe commontrendofmanypreviousworks could work at a much higher frequency. Each memorycon- mentioned in section 2. That is to separate the data trans- trollercouldsupportmultiplebufferschedulers. Eachbuffer feranddataorganization. TheCPUshouldonlytakecareof scheduler consists of memory request buffer, packet genera- sendingrequestsandreceivingdatawhileabuffercontroller tor,packetdecoder,returnbufferandlinkbus. takes charge of scheduling and local DRAM chip organiza- Thememorycontrollerreceivesvariable-granularitymem- tion. Apacket-basedinterfacewillenablethisseparationby oryrequestsfrommultipleprocessorcores.Thememorycon- encapsulating data, address and control signals. If we just trollerfirstlychoosesthetargetbufferschedulerbasedonthe stop here, then packeting is only a low-level encapsulation address mapping scheme, and then put it into on-chip net- forbustransactions. work. The NOC routes each memory request into its target Wecangoafurtherstepfrompackettomessage.Heremes- requestbuffer. Foreachbufferscheduler,therequestbuffers sagemeansthecontentofapacketisnotpredefinedorfixed aredividedintoReadQueueandWriteQueue,whichisused but programmable. Message also meansCPU can putmore tobufferreadandwriterequestsrespectively. Readrequests semanticinformationonapacketotherthansimpleread/write have high priority when scheduling requests to pack until operations.Thenthebuffercontrollercanmakeuseofthisin- the numberof write requestsin the write queueexceedsthe formationtogetbetterperformance. Theinformationmaybe high water mark [48, 25]. Then write requests get high pri- size,sequence,priority,processid,etc.orevenarray,linkand ority, and write requests would be contiguously selected to lock.ItislikethatthebuffercontrollerisintegratedwithCPU bepackedandsenttothecorrespondingbufferschedulerun- virtually to get all those necessary information for memory til the number of write requests return below the low water optimization. A message-basedinterface will providemany mark. opportunitiestohelpsolvememorysystemissues. The Packet Generator is responsible to select multiple For latency problem, though a message interface may in- memoryrequeststoputintoapacket,sendittoSerDesBuffer, creasethelatencyofsingleoperation,ithelpstoincreasepar- construct the packet head, which containing meta data of a allelismanddobetterprefetchingandschedulingwithseman- packet. Notethat,thepackingoperationisnotinthecritical ticinformation,socontributestodecreasetheoveralllatency. path,becausethePacketGeneratorkeepstrackingthestatus oftheserializedlinkbus, andcouldstartpackingprocessin Forbandwidthproblem,messageinterfacewillsupportbet- advancebeforethelinkbusbecomeavailable(free).Afterthe ter memoryparallelism to full utilize the bandwidth; packet packet has been constructed and the link bus become avail- interfaceenablesnewinterconnectiontechnologies.Message able,thepacketwouldbesenttothetargetbufferscheduler. alsoenableseffectivecompression. After receiving a message packet, the packet decoder on For efficiency problem, exact data size information will the buffer scheduler would unpack the packet and retrieve helpreducewaste ofover-fetch;message also enablesexact all memory requests integrated in the packet, and then send prefetchingtoreduceunnecessaryoperations. them to the scheduler. The scheduler acts as a traditional For capacity problem, decoupling enables special design memory controller: it communicateswith the DRAM mem- forlargecapacitymemorysystems; messageevenenablesa orymodulethroughwideandrelativelowfrequencybuswith networkofextendedmemorysystems. thesynchronousDDR3protocolasinthetraditionalmemory system. The scheduler needs to track all the memory mod- For power problem, message enablesfine-grainedcontrol ulestates(e.g. bankstate,busstate)attachedtoit,schedules ofactiveDRAMregions.Decouplingalsoenableslowpower thememoryrequestsandgeneratesDRAMcommands(such NVRAMtobeincludedinmemorysystemtransparently. as ACTIVE,PRECHARGE, READ, WRITE etc.) based on Forautonomousoperation,messageprovidesanaturalsup- the memory states, and issues the DRAM commandsto the portwithsemanticinformation. memory module meanwhile fulfills the DDR3 protocolcon- To demonstrate the benefits of message interface, a draft straints. architecturedesignandevaluationaregivenasfollowing. It Sub-rank memory system is used to support variable- shouldbenotedthatthedesignandevaluationareelementary granularitymemoryaccess, which is similar as DGMS[56]. andincompletetocoveralltheadvantageofmessage-based We use x8 DRAM devices, and each rank is separated into interface. 8 sub-ranks, with one device per sub-rank. The data burst 5 Memory Controller Buffer Memory Module (On Chip) Scheduler (DRAM Channels) Link DDR3 Write Queue Bus Bus R S an Core 0 Read Queue che k... d u Data Buf le r Sub C RM ro Rank ... equestsemory ssbar Write Queue ... ... DBDuRs3 R S a Core C-1 Read Queue ched nk... u Data Buf le r Sub Rank represents Packet Generator, represents Packet Decoder Figure1:TheMessageInterfaceMemorySystemarchitecture. length (BL) is 8 in DDR3, thus the minimum granularity is (cid:47)(cid:46)(cid:50)(cid:43) (cid:51)(cid:46)(cid:43)(cid:39) (cid:53)(cid:55)(cid:48)(cid:54)(cid:42) 8B. (cid:258) (cid:258) 3.2.1. Packet Format Message packet is the essential and criticalcomponentinMIMS,andpacketshouldbedesigned (cid:39)(cid:37)(cid:54)(cid:44)(cid:39) (cid:51)(cid:55) (cid:38)(cid:49)(cid:55) (cid:53)(cid:57) (cid:36)(cid:39)(cid:39)(cid:53) (cid:42)(cid:60) (cid:55)(cid:50) (cid:55)(cid:44)(cid:39) to easily scale to support various memory optimizations. Each packetcouldsupportmultiple memoryrequests. Each Figure2:Readpacketformat. packetcontainssome LinkOverhead(LKOH)whichis gen- erated and processed at lower layer such as link layer and (cid:47)(cid:46)(cid:50)(cid:43) (cid:51)(cid:46)(cid:43)(cid:39) (cid:53)(cid:55)(cid:48)(cid:54)(cid:42) (cid:58)(cid:55)(cid:39)(cid:36) physicallayer. TheLKOHisnecessaryforserialbuscommu- nication protocol, which usually contains reliability-related data such as Start, End signal, Sequence ID, and checksum Figure3:Writepacketformat. codes(CRC). TheoverheadofLKOHwouldbehighifeachpacketonly with some other semantic message, it basically contains ad- containsa small amountof data (payload). Especially for a dress (ADDR), granularity (GY) for each memory request, readrequest,whichonlycontainsaddressandoperation,the andcouldsalabletocontainmoresemanticmessagesuchas overheadof the LKOH would be near 50%for a size of 8B request timeout (TO), thread id (TID). The timeout require LKOH (as in PCIe protocol). Combining multiple memory the longest acceptable latency (queue delay) that it must be requestsintoapacketwouldincreasethepayloadsize. scheduledand return, this is valuable to implementQoS for Tosupportmultiplememoryrequestsina packet,wepro- requests,othermessagethatisvaluableforschedulingcould pose a variable-lengthpacket formatfor MIMS. The packet alsobeintegratedintheRTMSG.AlltheRTMSGsinapacket has three basic types: Read Packet, Write Packet and Read- arein thesame formatandsame length,whichmakesto en- Return Packet, which might contain multiple read memory codeanddecodereadingpacketeasilyandeffectively. requests, write requests and return data respectively. Since Inawritepacket,asshowninfigure3,theformatisnearly eachpacketmightcontainvariablenumberofrequests,each inthesame,italsocontainsLKOH,apacketheadandmulti- packetisaddedapackethead(PKHD)whichcontainsMeta plewriterequests,wherethepacketheadisjustthesameas data of the packet, the detail format of packet is shown in in the read packet, exceptthatthe packettype(PT) is Write figure 4. We can see in figure 2 that a Read Packet has a Packet. BesidesaRTMSG,eachwriterequestneedalsocon- packet head and multiple Request Messages (RTMSG), and tainswritedata(WTDA).TheRTMSGisthesamewithread it should contain LKOH. The packet head contains Destina- request. Write datamightbe variable-length,andthelength tion Buffer Scheduler Identifier (DESID), Packet Type (PT, is determinedby the granularity(in RTMSG) of the request. suchas Read), the Count(CNT) ofrequestsandsome other For example, the length of data is 8B for a fine granularity Reserved(RSV)fields. Notethatalltherequestsinapacket write,anditis64Bforacoarsegranularitywrite. are sent to the same Destination Buffer Scheduler. After Read-returnpackethasthesameformatwithwritepacket. packet head, multiple request messages are closely aligned, The request address needs to be returned since memory re- eachrequestmessage(RTMSG)representsamemoryrequest quests are scheduled out-of-order both in packet encoding 6 (cid:51)(cid:68)(cid:85)(cid:68)(cid:79)(cid:79)(cid:72)(cid:79)(cid:3)(cid:71)(cid:72)(cid:70)(cid:82)(cid:71)(cid:76)(cid:81)(cid:74) addresscompressioncouldfurtherreducethesizeofpayload, (cid:53)(cid:55)(cid:48)(cid:54)(cid:42) thus reducethe demandto the bandwidthof link bus.Figure (cid:53)(cid:72)(cid:68)(cid:71)(cid:3)(cid:51)(cid:68)(cid:70)(cid:78)(cid:72)(cid:87) 5(a) shows the simplified FIFO one request per packet, we canseethatfor8memoryrequests,ittotallyinduces8packet (cid:37)(cid:68)(cid:87)(cid:70)(cid:75) overhead(PKT_OH).Andfigure5(b)showsifthepacketsup- (cid:48)(cid:36)(cid:54)(cid:46) portto involvemultiplerequests, such as 4 requestsin each packet, then there are 2 packetswith induced2 packetover- heads, it could save 6 PKT_OHs space. However since it still packet requests in FIFO, the addresses in each packet (cid:39)(cid:72)(cid:70)(cid:82)(cid:71)(cid:72)(cid:71)(cid:3)(cid:85)(cid:72)(cid:68)(cid:71)(cid:3)(cid:85)(cid:72)(cid:84)(cid:88)(cid:72)(cid:86)(cid:87)(cid:29)(cid:3) (cid:68)(cid:71)(cid:71)(cid:85)(cid:72)(cid:86)(cid:86)(cid:15)(cid:3)(cid:74)(cid:85)(cid:68)(cid:81)(cid:88)(cid:79)(cid:68)(cid:85)(cid:76)(cid:87)(cid:92)(cid:15)(cid:3)(cid:53)(cid:39) hasrelativelypoorlocality, whichis obstacle to performad- dress compression. Thus in figure 5(c), we selects memory Figure 4: Parallel decoding for Read Packet. Each batch (4 requestsinen-packetinanout-of-orderandcompress-aware inthefigure)ofRTMSGsaredecodedinparallel, eachRTMG manner, which firstly re-order memory requests and group couldbedecodedwithasimpleMASKoperator. multiple adjacent requests which is preferred to be selected andinbufferscheduler.Inordertoreducetheoverheadofre- inthesamepacket. Finally,figure5(d)showsthebase-delta turningaddress,eachreadrequestcouldbeassignedarequest address compression in each packet, we choose a base, and id,whichismuchsmaller(10bitsisenough1024requests). all the addressare then representedasthe difference(DIFF) 3.2.2.PacketDecodingAfterthebufferschedulerreceiveda tothebase,wheretheDIFFcouldhavevariablelength,such packet,the PacketDecoderfirstly readsthepacketheadand as2Binthefirstpacketand1Binthesecondpacket. getsthemetadataofthepacket,suchasDESID(Destination Buffer Scheduler ID), Packet Type, memory requests count 3.3.Thechallengeofdesigningamessageinterfacebased andotherreserveddata.TheDESIDisusedtocheckwhether memorysystem thepacketwasroutedcorrectly. Thenthetypeofthepacket andthecountofmemoryrequestsarechecked. Readpacket: Sinceeachreadrequestmessageshasfixed- Messagebasedinterfaceformemorysystemwillbringchal- length fields, including address, granularity etc., it could be lenges to all system levels that concerned with memory. processedinparalleleasily.Figure4showstheprocessofpar- Manychallengesremaintobesolved. Hereisanincomplete allel decodingfor read packet. In this example, 4 RTMSGs list. aredecodedonceatime.SincetheformatforeachRTMSGis same,theycouldbedecodedwithasinglemask,theaddress, 1. Complexity: Message processing is more complex than granularityandothermessageofeachreadrequestcouldbe simple packet. Both the memory controller and buffer extracted.Afterthat,thenextbatchRTMSGsarereadytobe schedulerneedmorecomplexlogictoaccomplishthetask, decoded. e.g. longerqueuemanagementandconsistencechecking. Write Packet: Each write request has the Request Mes- Althoughlogicisbecomingcheaper,itstillneedstoinves- sage along with variable length of data, where the length tigatewhetherthecost,powerconsumption,andincreased couldbedeterminedbythegranularityofthememoryrequest. latencycanbecontrolledwithinanacceptablelevel. Forexample,ifthegranularityis4,thenthelengthofdatais 2. ISAextension: Tofullutilizingtheflexibilityofmessage, 32B(8B*4). Thedecoderprocesstherequestsinserialdue CPU needs to provide more semantic information along tovariablesize:itextractsaddress,granularityandothermes- with read/write request. This may bring extensionsneed sageofthefirstwriterequest,thenitcalculatesthelengthof totheISA.Forexample,howtoprovidethesizeinforma- databaseonthegranularity,retrievesthewritedata,advances tionforvariablegranularitymemoryrequests;howtode- tothenextrequest,untilallthewriterequestsareretrieved. liverprocessinformationsuchasthreadid,priority,time- 3.2.3. Address Compression in a message packet Putting outandprefetch;howtogenerateActiveMemoryOpera- multiplememoryrequestsinapacketprovidesagoodoppor- tionrequesttothememorycontroller. tunitytoaddresscompression. Itisalsoenablestocompress 3. Cache Support: To better support variable granularity data in write packet and return packet. A lot of work had memory accesses, variable-sized cache line is preferred contributed to data compression. Motivated by that address though with difficulty. Sector cache for fine granularity wouldcontributeabout39%of packet(section 5.1formore and SPM (Scratchpad memory) for large granularity can detail), we focuson compressing multiple address in a read alsobeusedwitharedesign. packet which has not been investigated before, and we will 4. Programming:Thesemanticinformationmayalsobedis- showthatwithsomesimplecompressionalgorithms,thead- coveredand generatedby software andsent via message. dresscouldbecompressedefficiently. Application can be implemented with some hint API, or Theexampleinfigure5illustrateshowinvolvingmultiple withthehelpofanaggressivecompilertogenerateMIMS requests in a packet could reduce packet overhead and how specialinstructionsautomatically. 7 PKT_OH 0x2b334ecfb2b8 PKT_OH 0x46e44b60 0x2b334ecfdee8 0x46e44b18 0x2b334ecfb2b8 PKT_OH 0x46e44b18 PKT_OH 0x2b334ecfe7d0 0x46e44bf0 0x2b334ecfa9c0 0x46e44ba8 Saved Space:6 PKE_OHs PKT_OH 0x2b334ecfdee8 0x46e44bf0 0x46e44ba8 0x46e44b60 0x46e44b18 PKT_OH 0x46e44b60 PKT_OH 0x46e44ba8 0x2b334ecfe7d0 0x2b334ecfa9c0 0x2b334ecfdee8 0x2b334ecfb2b8 PKT_OH 0x2b334ecfa9c0 PKT_OH 0x46e44bf0 PKT_OH 0x46e44b00 0xf0 0xa8 0x60 0x18 Saved Space PKT_OH 0x2b334ecfe7d0 PKT_OH 0x2b334ecf0000 0xe7d0 0xa9c0 0xdee8 0xb2b8 Saved Space Figure 5: Conceptual example showing the benefit ofinvolving multiple requests ina packet: packet overhead reduction and addresscompression. (a)FIFOwithonerequestineachpacket. (b)FIFOwithmultiplerequestsineachpacket. (c)Out-of-order compressed-awarerequestsgrouping.(d)Addresscompressionineachpacket. 4. Experiments Setup granularityisabout72.85%forreadrequests,butitisabout 97.59% for write requests. And in the listrank benchmark, 4.1.SimulatorandWorkloads the rate of 2-granularity and 4-granularity is about 52.99% and31.54%respectivelyforreadrequests,buttheyareabout To evaluate MIMS, we have implemented a cycle-detailed 76.51%and0.90%respectivelyforwriterequests. MessageInterfaceMemorySystemsimulatorwhichisnamed MIMSim. We adopted DRAM modules (devices) based on Cate. Bench. RPKI RG WPKI WG R/T DRAMSim2 [47], which is a cycle accurate DDR2/3 mem- FINE GUPS 69.67 1.78 69.62 1.78 1.00 orysystemsimulator. TheDRAMSim2modelsallaspectsof FINE SSCA2 20.89 1.68 20.42 1.56 1.02 the memory controller and DRAM devices, including trans- FINE canl. 17.79 1.64 8.64 1.10 2.06 action queue, command queue and read-return queue, ad- FINE park. 9.76 2.42 6.14 2.74 1.59 dressmappingscheme,DDRdata/address/commandbuscon- MID lirk. 22.56 3.56 15.45 3.37 1.46 tention,DRAMdevicepowerandtiming,androwbufferman- MID BFS 22.36 3.10 2.44 3.49 9.16 agement. We addre-orderbuffer(ROB)to makesimulation COR STRM. 33.33 8.00 16.63 8.00 2.00 moreaccurate,theDRAMmoduleismodifiedtosupportsub- COR bt 7.68 7.98 7.63 7.98 1.01 rank.Channelinterleavingaddressmappingisadoptedasthe COR ft 31.85 8.00 31.72 8.00 1.00 default(baseline)configurationtomaximumMLP(Memory COR sp 8.04 7.98 7.89 7.98 1.02 LevelParallelism),andFRFCFS[45]schedulingpolicywith COR ua 4.42 7.19 3.80 7.92 1.16 closed-pagerowbuffermanagement. COR ScPC. 11.00 5.65 3.85 5.74 2.86 Pin[42]isusedtocollectmemoryaccesstracesfromvari- COR perM 2.62 6.28 2.40 6.12 1.09 ousofworkloadsrunningwith2-16threads. Wechoosesev- eralmulti-threadmemoryintensiveapplicationsfromBFSin Note,canl.:canneal,park.:pagerank, Graph500 [9], PARSEC [23], Listrank [21], Pagerank [10], lirk.:listrank,STRM.:STREAM,ScPC:ScaleParC. SSCA2 [22], GUPS [15], NAS [13], STREAM [16]. Table Table2:WorkloadsCharacteristics. 2 lists the main characteristicsof these workloads. We clas- sify the workloadsinto three categoriesbased on the access granularity: fine granularity (FINE: <=3), Middle granular- To collect granularity message for each memory request, ity(MID:3-6),andcoarsegranularity(COR:6-8). Memory we implement a 3-level cache simulator as a Pin-tool. The readandwriterequestsarereportedseparately,includingthe detail configuration is listed in table 3. We start the cache read memory requests per kilo instruction (RPKI), the aver- simulator after each application enters into a representative agereadgranularity(RG),thewritememoryrequestsperkilo region. Afterwarm-upthecachesimulatorwith100million instruction(WPKI),theaveragewritegranularity(WG),and memoryrequests,wecollectmemorytraceswithgranularity theread/writeratio(RD/WT).Thereasontoseparatetheread and cache access type message. For PARSEC benchmark, andwritecharacteristicsisthatwefindthegranularitydistri- we naturally choose the ROI (Region-of-Interest) codes as butionof read and write mightdifferentfor some FINE and the region for PARSEC benchmarks; and for all the other MID benchmarks. Figure6 showstheir granularitydistribu- benchmarks,wemanuallyskiptheinitializationphase(such tion. For example, in the canneal benchmark, the rate of 1- asgraph-generationinBFS)andcollectmemorytracesafter 8 100% ranks with 8 DRAM x8 chips each. Each DRAM chip has on90% 8banks. Wefastforward64millionmemorytracesforeach uti80% 8 core (thread),simulate untilall the threadshave executedat rib70% 4 least100millioninstructions. Dist5600%% 2 ToevaluatetheMIMS,weusethefollowingmemorysys- ty(cid:3)40% 1 temconfiguration: ari30% DDR: traditional DDRx (3) memory system with fixed nul20% • coarseaccessgranularity(cacheline:64B),thisisthebase- Gra10% line. 0% BOB: Buffer On Board memory system, fixed coarse ac- rd wt rd wt rd wt rd wt rd wt rd wt • cessgranularity,1memoryrequest(read/write)perpacket, BFS GUPS SSCA2 canneal listrank pagerank simplepacketformatwithoutanyextramessage. MIMS_one (MI_1): Message Interfaced based memory Figure 6: The read and write granularity distribution ofFINE • system, adopts sub-rank memory organization to support andMIDmemory-intensiveworkloads. variable-granularityaccess, 1requestperpacket, contains meaningfulwork. granularitymessageinpacket. MIMS_multiple (MI_mul): Message Interfaced based • 2.7GHz,256-entry, memorysystem,supportsvariable-granularityaccess,mul- ReorderBuffer maxfetch/retirepercycle:4/2, tiplerequestsinapacket. 5pipeline(latencyofnon-meminstr) DRAM and Controller Power: we evaluate memory Private,32KB,4-way,64Bcacheline, powerconsumptionwith DRAMsim2 [47] power calculator, L1Cache 9CPUcycleshit(4+5) which uses the power model developed by Micron Corpo- Private,256KB,8-way,64Bcacheline, ration based on the transitions of each bank. The DRAM L2Cache 15CPUcycleshit(10+5) powerisdividedinto4components:background,refresh,ac- Shared,16-way,64Bcacheline, tivation/precharge,andburst, where backgroundand refresh L3Cache 1MB/core,45CPUcycleshit(40+5) powerisoftenconcludedasstaticpower,activation/precharge Memory 2bufferschedulers/MC, and burst power is concluded as dynamic power. Besides Controller Read/WriteQueue:64/64 DRAM devices, we also take the memory controller power 2.7GHz,point-to-point, intoconsider,foritwouldcontributeasignificantamountto LinkBus read/writebuswidth:16/16 overallconsumption[28] (about20%). In BOB and MIMS, Buffer FRFCFS[45],closedpage thecontrollerpowerisactuallyreferredtosimplercontroller Scheduler Channel-interleavemapping andbufferschedulerpowerrespectively. ForDDR,weadopt DRAMParameters the MC power to 8.5W from [5]; for BOB and MIMS, we 264-bitChannels,2Ranks/Channel, adopt the intermediate controller power to 14W as in [27]. Memory 8devices/Rank,8sub-ranks/rank, Thecontrolleridlepowerissetto50%ofitspeakpower. x8-widthsub-rank,1device/sub-rank 5. Experimental Results DDR3-1333MHz,x8,8banks, 32768Rows/bank,1024Columns/Row In this section, We first present the performance and power DRAMDevice 8KBRowBufferperBank,BL=8, impactsofMIMSinSection5.1,andthenevaluatetheeffec- Timeandpowerparametersfrom tivenessofcombiningbig-granularitymemoryrequests. We Micron2GbSDRAM[1] present the effect of memory addresses compression in sec- tion5.3. Table3:SystemConfigurations 5.1.Performanceandpowerimpacts In this section, we presentsimulationresults of 16-coresys- 4.2.Systemconfigurations temsonFINEandMIDgranularityworkloads. Allthework- Table 3 lists the main parameter settings used in the cycle- loadsareinmultiple-threadmode,witheachcorerunningone detailed simulators. Note, the non-memory instruction la- thread. Weusethetotalnumberofsubmittedinstructionsas tency and cache hit latencies listed here are used as the la- themetricofperformance. tencyoftheinstructionneedtowaitintheROB(inMMAsim) Figure 7 shows the normalized performancespeedup and before it could be committed. For example, a L2 cache hit effective bandwidth utilization of different memory system, memory access instruction could be committable only after where the baseline is DDR. For these FINE or MID work- 15 CPU cycles when it is added in the ROB. The baseline loads, such as BFS, canneal, GUPS, fine granularity access memory system has 2 DDR3-1333MHZchannelswith dual would benefit. The BOB performancedegrades range from 9 Normalized Speedup Effective Bandwidth Utilization Background Refresh Act/Pre Burst Controller Normalized EDP 2.5 50% 30 2.5 Normalized Speedup01..55012 01234%0000%%%%Effective Bandwidth Utilization Memory Power (W)1122050505 00112..55Normalized EDP DDRBOBMI_1MI_mul DDRBOBMI_1MI_mul DDRBOBMI_1MI_mul DDRBOBMI_1MI_mul DDRBOBMI_1MI_mul DDRBOBMI_1MI_mul DDRBOBMI_1MI_mul DDRBOBMI_1MI_mul DDRBOBMI_1MI_mul DDRBOBMI_1MI_mul DDRBOBMI_1MI_mul DDRBOBMI_1MI_mul DDRBOBMI_1MI_mul DDRBOBMI_1MI_mul BFS canneal GUPS listrank pagerank SSCA2 GMEAN BFS canneal GUPS listrank pagerank SSCA2 GMEAN Figure 7: Normalized Speedup and Effective Bandwidth Uti- Figure8: MemorypowerbreakdownandthenormalizedEDP lization of different memory system in 16-core configuration, (Energy*DelayProduct,lowerisbetter) thebaselineisDDR latencyiscategorizedintotwomajorcategories:Queuingand 49.38% to 78.38%. This is because that the BOB still uses DRAM Core Access Latency. The Queuing Latency repre- coarse granularity access, and the intermediate controller sentslatencyofamemoryrequestwaitingtobescheduledin would introduce packet overhead and extra latency. On the theTransactionQueue,whichhasbeenprovedtobethemain other side, MIMS_1 and MIMS_mul could improve perfor- componentofmemorylatency[50].TheDRAMCoreAccess mancebecausetheysupportvariable-granularity,thenormal- LatencyrepresentsthelatencyofexecutingDDRcommands ized speedups of MIMS_1 range from 1.11 to 1.53, and of of a memory requestin DRAM devices. In MIMS memory MIMS_mul range from 1.29 to 2.08, this indicates that in- system,thereisanadditionalSchedulinglatency,whichrep- tegrating multiple requests in a packet could reduce packet resentstheextraprocessinglatencyinducedbyBufferSched- headoverheard,thusimprovememoryperformance. Theef- uler, it includes the SerDes latency, scheduling latency and fectivebandwidthutilizationsnearlyhavethesamevariation packetencoding/decodinglatency.TheQueuingLatencycon- trend with the speedups in different memory systems. For tainsbothinmemorycontroller(waitingtobepacked)andin DDR,theyrangefrom15.58%to31.55%;forBOB,theeffec- bufferscheduler(waitingto be issued to DRAM devices)in tivebandwidthutilizationisdecreased,sinceeachmemoryre- MIMS. questwouldintroduceapacketoverhead,alongwiththewast- Figure9showsthememorylatencybreakdownin16-core ing bandwidth for transferring useless data in a cache line. configuration. We can see that for these memory intensive TheMIMS_1couldeliminatewastingdatabutstillsuffersig- workloads, the Queuing Latency dominates the memory la- nificantpacketoverhead. TheMIMS_mulcouldachievethe tency, especially for GUPS and SSCA2 application, which bestefficiencybandwidthutilization,rangingfrom21.15%to couldachieveabout1185.67nsand933.0nsrespectivelyin 44.49%. DDR memory system, meanwhile the DRAM Core Access Figure8showsthememorypowerbreakdownandthenor- Latencyisonly22.22ns. Thereasonforitisthatthesetwo malized EDP in different memory systems. Here we also applications suffer high MPKI as shown in table 2 and the consider the power of controller. The average total power traditionalDDR memorysystem is failed to serve them due for DDR is about 23.38W, and the BOB has a little more to its limited MLP. However,the QueuingLatencycouldre- power(26.36W),sincetheintermediatesimplecontrollercon- ducesignificantlyinMIMS,forinstance,itreduceto234.81 sumes more power than the on chip MC, the DRAM power ns for GUPS and 147.41 ns for SSCA2, that is because the of themare nearlythesame. The MIMS_1andMIMS_mul MIMS adopted sub-rank and it could provide more MLP could effectively reduce the Activation/Prechargepower be- sinceeachnarrowaub-rankcouldbeaccessedindependently. cause each (fine) request only activate/prechare a sub-rank Eventhoughtheintermediatebufferschedulerwouldinduce (one DRAM device in our work) with smaller row, and re- extra Scheduling Latency, the whole memory latency is re- duce the Burst power because it only read/write the really ducedforallworkloads. useful part of data in a cache line (such as 8B data in 64B Figure10showsthepercentageofdifferentcomponentsin cacheline). ThusthepowerofMIMS_1andMIMS_mulre- packets in MIMS_multiple memory system. Here we only ducedto 16.90Wand17.13Wrespectively. Thenormalized show the packets in downside (from cpu) bus. For a read EDP(EnergyDelayProduct)ofBOBreducesabout1.78,this packet, it only containspacket overhead(PKT_OH) and ad- is mainly because the introducing latency. MIMS_one and dress;forawritepacket,itcontainsdataalso.Wecanseethat, MIMS_multiplecouldimproveEDPby0.53and0.44respec- the address contributes a large portion of packet, it ranges tively,thisisbecausesub-rankcouldimprovememoryparal- from 39.15% to 67.39%, and Data contributes range from lelismandthusreducequeuingdelays. 8.24%to 55.26%. The BFSbenchmarkcontributesthemax Memory Latency:In DDR memory system, the memory address portion, that is because it has a read/write ratio of 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.