Table Of Content

Archipelago: A Polymorphic Cache Design for Enabling Robust Near-Threshold Operation Amin Ansari Shuguang Feng Shantanu Gupta Scott Mahlke AdvancedComputerArchitectureLaboratory UniversityofMichigan,AnnArbor,MI48109 {ansary, shoe, shangupt, mahlke}@umich.edu ABSTRACT bereducedbelowacertainthresholdwithoutdrasticallysacrificing clock frequency. Lowering thisminimum achievable voltage can Extreme technology integration in the sub-micron regime comes dramatically improve the lifetime, energy consumption, and bat- witharapidriseinheatdissipationandpowerdensityformodern processors. Dynamic voltage scaling is a widely used technique terylifeofmedicaldevices,laptops,andhandheldproducts. totacklethisproblemwhenhighperformanceisnotthemaincon- The minimum achievable voltage for DVS is set such that un- cern.However,theminimumachievablesupplyvoltageforthepro- der the worst-case process variation, the processor operates cor- cessor isoftenbounded bythelargeon-chipcachessinceSRAM rectly [11]. The motivation for our work comes from the obser- cellsfailatasignificantlyfasterratethanlogiccellswhenreduc- vationthatlargeSRAMstructuresarelimitingtheextenttowhich ingsupplyvoltage. Thisismainlyduetothehighersusceptibility operationalvoltagescanbereducedinmodernprocessors. Thisis of the SRAM structures to process-induced parameter variations. becauseSRAMdelayincreasesatahigherratethanCMOSlogic Inthiswork,weproposeahighlyflexiblefault-tolerantcachede- delay as thesupply voltage isdecreased [39]. Furthermore, with sign, Archipelago, that by reconfiguring its internal organization increasing systematic and random process variation in deep sub- can efficiently tolerate the large number of SRAM failures that micron technologies, the failure rate of SRAM structures rapidly arise when operating in the near-threshold region. Archipelago increases in the near-threshold regime. Ultimately, the minimum partitions the cache to multiple autonomous islands with various sustainable V of the entire cache structure – and consequently dd sizes which can operate correctly without borrowing redundancy the core as a whole – is determined by the one SRAM bit-cell fromeachother. Ourconfigurationalgorithm–anadaptedversion withintheentiresystemwiththehighestrequiredoperationalvolt- of minimum clique covering – exploits the high degree of flexi- age. Thisforcesdesignerstoappropriatealargevoltagemarginin bilityintheArchipelago architecturetoreduce thegranularity of ordertoavoidon-chipcachefailures. redundancyreplacementandminimizetheamountofspacelostin AnSRAMcellcanfailduetothefollowingreasons: areadsta- thecachewhenoperatinginnear-thresholdregion. Usingourap- bilityfailure, awritestabilityfailure, anaccess timefailure, or a proach, the operational voltage of a processor can be reduced to hold failure [3]. Figure 1 depicts the bit error rate (BER) of an 375mV,whichtranslatesto79%dynamicand51%leakagepower SRAM cell based on the operational voltage ina 90nm technol- savings(in90nm)foramicroprocessorsimilartotheAlpha21364. ogynode[30]. Theminimumoperationalvoltageof64KBL1and Thesepowersavingscomewitha4.6%performancedrop-offwhen 2MBL2cachesisselectedtoensureahighexpectedyield,99%in operatinginlowpowermodeand2%areaoverheadforthemicro- processor. 1. INTRODUCTION write-margin limit read-current limit 1.E+00 Withaggressivesiliconintegrationandclockfrequencyincrease, power consumption and heat dissipation have become key chal- 1.E-02 m) plpscvetooooanowwlggtlaeeeeignrrsfegcac,ioinoslencunlsatershuelcuiesmtnmrgi[pdpc3teiti(8tisoDyo]in.,gnVanoISnrfted)omdafdilusisahccoteiraagosahpcwfdefrieoepndcvceteetierslcrfsyoestarhoiumlrerissfcea,ecodnteonicxsmtdetpeielctoposihforintoantiinchniqnedegugsrestmeh[ox2tearpo5slfe.]aprd.ceaiDtGdtcetukyrshcaonaegewataimdintrnhylgiygce-, SRAM Bit Error Rate 1111....EEEE----01000846 Nominal ThresholdVoltage (90n 624 M KBB ((9999%% yyiieelldd)) 612 mV651 mV namic power quadratically scales with voltage and linearly with frequency.However,thesupplyvoltageofamicroprocessorcannot 1.E-12 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 Power Supply Voltage (Vdd) Figure1:BiterrorrateforanSRAMcellwithvaryingV val- dd uesin90nm.Forthistechnology,thewrite-marginisthedom- Permission tomake digital orhardcopies ofall orpartofthis workfor inant factor and limits the operational voltage of the SRAM personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare structure. Here,theY-axisislogarithmic,highlightingtheex- notmadeordistributedforprofitorcommercialadvantageandthatcopies tremelyfastgrowthinfailureratewithdecreasingV .Thetwo bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to dd horizontaldotted-linesmarkthefailureratesatwhichthemen- republish,topostonserversortoredistributetolists,requirespriorspecific permissionand/orafee. tionedSRAMstructures(64KBand2MB)canoperatewithat Copyright20XXACMX-XXXXX-XX-X/XX/XX...$10.00. least99%manufacturingyield. bit byte word (8B) block (128B) column (256B) word-line (1KB) High failurerate region Medium failurerate region Low failurerate region Failure-freeregion ( 10-3(cid:1095)(cid:3)(cid:17)(cid:28)(cid:90)(cid:3)(cid:895) ( 10-5(cid:1095)(cid:3)(cid:17)(cid:28)(cid:90)(cid:3)(cid:1092)(cid:3)(cid:1005)(cid:1004)-3) ( 10-9(cid:1095)(cid:3)(cid:17)(cid:28)(cid:90)(cid:3)(cid:1092)(cid:3)(cid:1005)(cid:1004)-5) ( BER < 10-9) Our Target ECC-2, 2D ECC, Bit-Fix Row/column redundancy, ECC No protection is required 100 10 Percentage Faulty 0.11 0.01 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 Power Supply Voltage (Vdd) Figure 2: Percentage of faulty bits, bytes, words, blocks, columns, and word-lines for a 2MB L2 cache whilevarying the supply voltage. Here, theY-axis islogarithmic, highlightingthe rapidgrowth in faultyunitswhen decreasingV . Thetop part of this dd figuredepictsourconceptualdivisionofthisV rangeintofourdifferentregionsbasedontheprotectiondifficulty.Foreachregion, dd correspondingbiterrorratesandalsoseveralapplicableprotectiontechniquesarealsoshown. Inordertooperatecorrectlyinthe failure-freeregion,noprotectionmechanismisrequired. However, ascanbeseen,ourtargetisthehighfailurerateregionwhich causesanavalancheoffailuresforon-chipcaches. Figure1. Duetothehigher sensitivityof abit-cell toparametric discussedearlierforV ≥651mV,almostnofailureisexpected dd variationsatlowerV s,thefailurerateofanSRAMcellincreases (i.e.,failure-freeregion). Ascanbeseen,fortheV ≤ 400mV, dd dd exponentiallywhenV getsdecreased. Thisexponentialincrease finergranularitiesofredundancy arerequiredtoallowahighuti- dd in thenumber of faultycellsmakes the protection of the on-chip lizationofthespareelementssincealargefractionofword-lines, cachesmuchmoredifficultwhenoperatinginthenear-thresholdre- columns, blocks, andevenwordswouldbefaulty. Thisincreases gion.Ascanbeseeninthisfigure,thewritemarginmostlydictates thecomplexityofthedesignsincesimpledisabling(e.g.,blockor theminimumV anditisexpectedtooperatewithV ≥651mV waydisabling)orcoarsegrainredundancy(e.g.,roworcolumnre- dd dd duetothedominatingfailurerateofthe2MBL2cache.Thismini- dundancy)techniquescannotbeefficientlyapplied. mumoperationalvoltageisconsistentwithpredictedandmeasured In this work, we propose Archipelago (AP), a cache capable values(∼0.7V)reportedin[8]. ofreconfiguringitsinternalorganizationtoefficientlytoleratethe In the literature, several techniques have been proposed to im- large number of SRAM failures that arise when operating in the provedynamicand/orleakagepowerofon-chipcachesaswellas near-thresholdregion. APallowsfault-freeoperationbypartition- the entire processor [39]. Section 4.1 summarizes some of these ingthecacheintomultipleautonomousislandswithvarioussizes. low-power design techniques. Most of these methods exploit the Each island is a group of physical cache word-lines that can op- structurespecificsleepmodesand/orpower-awareresourcealloca- eratecorrectlywithoutusinganyword-lineoutsideoftheirgroup. tion toavoidfacingwiththefailures. Consequently, for low V Eachgrouphasasacrificialword-linewhichisdivideduptomulti- dd values (e.g., ≤ 651mV), the amount of power saving for these pleredundancyunits. Thesespareunitsaredirectly/indirectlyem- methodsisrestrictedduetothearisingfailuresintheSRAMstruc- ployedtoachievefault-freeoperationoftheotherword-linesinthe tures. samegroup. Sacrificialword-linesdonot contributetotheeffec- Incontrast,theobjectiveofourworkistoenableDVStopush tivecache capacity since they do not store any independent data. thecore/processoroperatingvoltagedowntothenear-thresholdre- Theclusteringofthecachetotheseautonomousislandsisdoneby gion (e.g., low power mode) whilepreserving correct functional- aconfigurationalgorithmwhichisdescribedinSection2.3. This ityofon-chipcaches. Analternativetothisapproachisdual-V s adaptedversionoftheminimumcliquecoveringalgorithmtriesto dd wherecoreandcachesoperateatdifferentvoltages.However,dual- partitionthecache totheleast number of islands/groups tomini- V imposes a high cost in terms of area and design complexity. mize the number of sacrificial word-lines required for guarantee- dd Voltage-levelconvertersmustbeaddedtoallowsignalsharingbe- ingthefault-freeoperationofthecache. APenablesgreaterpower tween different voltageislands. Furthermore, high voltage mem- savingsthanpriorapproachesandrequiresonlyasinglepowersup- ory elements generate noise that victimizes the neighboring low ply. Theoverheadoftheapproachisasmallperformancepenalty voltagelogiccircuits,necessitatingeithershieldingorextranoise (4.6%) when operating in near-threshold mode and a negligible margins[40]. areaoverhead (2%) for themicroprocessor. WeapplyAP toL1- Inthiswork,wetargetultra-lowvoltageoperation D,L1-I,andL2cachestoevaluatetheachievablepowerreduction (V ≤ 400mV) in the near-threshold region which causes an fortheoverallmicroprocessorsystem. Athoroughcomparisonof dd extreme bit-cell failure rate (> 10−3) for the on-chip caches – APwithseveralwell-knownproposalsisdoneinSection4.3. presented as the high failure rate region in Figure 2. This figure The primary contributions of this paper are: 1) A flexible and showsthepercentageoffaulty(i.e.,containing atleastonefaulty highlyreconfigurablearchitecturethatcanbeleveragedtoprotect bit-cell) bits, bytes, words, blocks, columns, and word-lines for regularSRAMstructuresagainsthighfailurerates; 2)Inorderto a 2MB L2 cache for different V values. Figure 2 is generated minimize the amount of redundancy required for protecting the dd assuming a uniform failure distribution based on the relation be- cache,wemodelthecachefaultpatternwithapropergraphstruc- tween bit-cell failure rate and V in Figure 1. In this figure, at turetoguideaminimum cliquecovering configuration algorithm dd V = 350mV, almost all blocks are faulty while 30% of the fornearoptimalgroupformation;3)Toourknowledge,Archipelago dd wordsarefaultyforthe2MB L2cache. Ascanbeseenandalso isthefirstlowoverhead,fault-tolerantarchitecturaltechniquewhich Figure3: A dual-bank2-way set-associative Archipelago. Each cachebankconsistsof eightlines. Cacheblocksaredividedinto threeequallysizeddatachunks.Blackboxesrepresentchunksofdatathathaveatleastonefaultybit.Thememoryandfaultmaps, whichareessentialcomponentsoftheproposedscheme,arealsoshown. allowsthecachetooperatecorrectlywhenV ≤400mV;and4) ThefigurealsoshowsstructuresspecifictoAParchitecture(i.e., dd Adesignspaceexplorationin90nmtoshowtheactualprocessof thememorymap,thefaultmapandtheMUXinglayer).Thecache selectingthearchitectureparametersforbothL1andL2caches. accesscanbelogicallydividedintothreesteps. First,thememory mapisreadtodetermineanindexfor accessingthecache banks. 2. ARCHIPELAGO Next,thecachebanksandfaultmapareaccessedinparallel.Here, thecachebanksprovidetheactual(fromregularlines)andredun- Inthissection, wefirstdescribe theAParchitecturethat adap- dant data (from sacrificial line). The fault map indicates which tivelyreconfiguresitselftoabsorbfailingSRAMs.Next,wepresent partsoftheactualorredundantdatashouldbeusedtoconstructthe aconfigurationalgorithmthatexploitstheintrinsicflexibilityofthe output. Finally,theMUXinglayerleveragestheinformationfrom AParchitecturetoperformanearoptimalredundancyassignment. faultmaptoassembletherequesteddata. The coalescence of highly flexible hardware and intelligent con- InAP,amemorymapisusedtoprovidealevelofindirectionin figuration enables our proposed solution to minimize the impact cacheaccesses. Eachcacheaccessfirstindexesintothismemory ofoperatingatnear-thresholdregiononcachecharacteristics(e.g., map,whichsuppliesthelocationofthedatalineanditscorrespond- sizeandlatency)withminimaloverhead. ingsacrificialline. Afterthesetwolineshavebeenaccessedfrom 2.1 Baseline APArchitecture their respective banks (different ones), a MUXing layer2 is used tocomposeafault-freeblockbyselectingtheappropriatechunks InAP,thecacheispartitionedintoseveral autonomous islands fromeachline.Consideringthebasicdesignofafast-accesscache of varying sizes. Eachisland isa group of physical cache word- (e.g.,L1cache), basedonthetagcomparisonresults,columnde- lines that can operate correctly without using any word-line out- coders – which are placed before our MUXing layer – MUX the sideof theirgroup. Eachgrouphasasacrificialword-linewhich corresponding blocks out of read word-lines. On the other hand, isdividedupintomultipleredundancyunits.Thesespareunitsare itshould beclear thatfor energy-efficient caches (e.g., L2cache) directly/indirectly employed toallow afault-freeoperation of the design, each word-linemight only containasingleblock andac- other word-lines in the same group. Put simply, considering one cesstothetaganddataarrayissequential. Asaresult, indexing of these groups with n + 1 partiallyfunctional cache word-lines, intothe memory mapwillbe done after resolving thetagpart of APallowsthisgrouptobehaveasasetofnfullyfunctionalcache theaddress. word-lines.Intheremainderofthispaper,werefertoeveryphysi- TheMUXing layer receives itsinputs indirectly fromthe fault calcacheword-line,whichmaycontainmultiplecacheblocksasa map.Foragivendataline,thefaultmapdetermineswhichchunks line. arefaultyandshouldbereplacedwithcorrespondingchunksfrom Figure3isatoyexamplethatdepictsa2-wayset-associativeAP the sacrificial line. To aid in the encoding and decoding of this cache withtwobanks. Eachcachebank haseight linesandeach information,auniqueaddressisassignedtoalllineswithinagroup lineconsistsoftwoblocksofdatawhicharedividedinto3equally (groupaddress). Forinstance,inFigure3,lines10and15arethe sizeddatachunks. Inthisfigure,lines4,10,and15formagroup firstandsecondlinesofG3,respectively. Foreachdatachunkin (named G3) in the cache. Line 4 (labeled G3(S)) is the sacrifi- the sacrificial line, the fault map stores the group address of the ciallinethatfurnishestheredundancyneededtoaccommodatethe linetowhichthat datachunk isassigned. Ingeneral, for agiven faultychunksinlines10and15. Inordertominimizetheaccess group,ifthesacrificiallineconsistsofkdatachunks,thefaultmap latency overhead, thesacrificial line(4) and thedata lines (10 or 15)shouldbeindifferentbanks1 sothatthesacrificiallinecanbe requires to store (i1,i2,...,ik). In this notation, ij is the group addressofthelinetowhichthejthdatachunkofthesacrificialline accessedinparalleltotheoriginaldataline. isassigned.Inourexample,theentrywhichisassignedtothethird 1Inthiswork,ourapproachisdescribedforprotectingcacheswith onlytwobanks.However,itshouldbeclearthatourschemecanbe 2As a side note, since read and write are symmetric operations, naturallyextendedtocacheswithanynumberofbanks≥2without theonlymodificationinthehardwareimplementationwouldbeto lossofgenerality. replacetheMUXesintheMUXinglayerwithpasstransistors[40]. (a) Readingaline(G3(3))fromthesamebankasthesacrificialline(G3(S)) (b) Readingfromthesemi-sacrificialline(G3(2)) Figure4: Twospecial read-accessscenarios. Astandardread accessisillustratedinFigure3. Noticetheextrabitthathasbeen addedtoboththememorymapandeveryfaultmapentrytohandlescenario(b).Sincethe4thdatachunkofthesemi-sacrificialline isre-allocated,itismarkedasRAinscenario(b). group(G3)inthefaultmap,contains(1|2|-|2|-|1)forway0 sacrificiallinesacrosscache. Therefore, theprimaryobjectiveof andway1.Consideringonlyway1,thisindicatesthatthe4thchunk APistoformgroupssuchthattherearenocollisionsbetweenany ofthecorrespondingsacrificialline(G3(S))isdevotedtoG3(2),the twolineswithinagroup. 6thchunkisdedicatedtoG3(1),andthe5thchunkisnotassigned to any line. Finally, the MUXing layer gets its input from a set 2.2 AP withRelaxedGroup Formation ofcomparatorsthatcomparethegroupaddressofadatalinewith Sinceeverygrouprequiresasacrificiallinebededicatedsolely faultmapentriesforthesamegroup. Forexample,groupaddress forredundancy,APstrivestominimizethetotalnumberofgroups of line 15 is2 – read frommemory map – which gets compared thatmustbeformed.Giventhatthenumberoflinesisfixedwithin withG3’sfaultmapentries. acache,achievingthisobjectiveimpliesthatlargergroupsarepre- Inorderforlinesinacachetocometogetherandformagroup, ferred over smaller ones. In order to improve the likelihood of theyshouldnothaveanycollisions. Inourscheme,twolineshave forminglargegroups,weremovetheconstraintthatforcesthesac- acollisioniftheyhaveatleastonefaultydatachunk inthesame rificialline,fromaparticulargroup,tobeinadifferentbankthan position. Forexample,iftheseconddatachunkofthe3rdand6th alltheotherlines. Thisallowsanysetoflinesfromthetwobanks lines is faulty, lines 3 and 6 have a collision. Similarly, in Fig- toformagroup. However,inordertominimizetheaccesslatency, ure 3, lines10 and 15 are collision-free sincethey do not have a wedonotallowagrouptoderiveallitslinesfromthesamebank. faultyblockinthesameposition. Bydefinition,collisionsarein- Inotherwords,eachgroupshouldhaveatonelineineachbankto dependent of workload running on the system and data stored in allowaparallelaccesstotheoriginalandsparedata. Relaxingthe the cache. Two lines that have a collision cannot share a single mentionedconstraint, givesrisetotwonewreadaccessscenarios sacrificial line. This means as the number of collisions increase, inadditiontothestandardcasedescribedinFigure3. alargerfractionof cacheneedstobeturnedintosacrificiallines. HandlingDifferentTypesofAccesses: InFigure4(a),lines2, Groupformationisthekeycomponenttominimizethenumberof 4, 7, 10 and 15 areinthe samegroup, withline4serving asthe First Bank Second Bank 1 G1(1) 6 D 2 G2(1) 7 G2(3) 3 G2(S) 8 G1(3) 4 G1(2) 9 G1(S) 5 G2(2) 10 G2(4) 10 1 Group 2 7 2 9 4 Group 1 8 5 3 6 Disabled Figure5: Asimplifiedexampleoftheminimumcliquecoveringprocessforagivendistributionoffaultsinthecachebanks. Here, eachbankhasonly5lines.Thesolverdisablesthe6thlinesinceithasmanyfaultychunksand,isthereforeveryexpensivetorepair. Twocliquesareformedbythesolverandlines9and3aredesignatedassacrificiallinesforgroups1and2,respectively. Moreover, theconceptualpartitioningofthecacheintodistinctislandsisalsodemonstrated. sacrificial line. However, when line 2 or 7 is accessed as a data sacrificial line. As can be seen, the 4th data chunk of G3(2) has line,aparallelaccesstoline4cannotbeperformedsincetheyare beenre-allocatedtoG3(3). Consequently,weartificiallymarkthe inthesamebank. Tohandlesuchascenario,asmallandtranspar- 4thchunkofthesemi-sacrificiallineas“re-allocated”(RAinFig- ent modification tothe AP architecture isneeded. Wearbitrarily ure4(b)). WhenG3(2)isaccessed,itsfaultyandalsore-allocated select alinefromthebank not containing thesacrificialline(but datachunks(i.e.,the4thand5thchunks)aresuppliedasexpected still from the same group) and label it as a semi-sacrificial line bythecorrespondingchunksfromthesacrificialline(G3(S)).How- (line15inFigure4(a)).Thissemi-sacrificiallinecanbeusedtore- ever, this cannot be easily done using the unmodified fault-map placefaultychunksfromcachelineswhichareinthesamebankas sincethe4thentryofthefaultmappointstothefaultychunksof thesacrificialline. However,incontrasttothesacrificialline,the datafromG3(3). Asaresult, thesystemcannot identifythat the semi-sacrificial line still contributes to the effective cache space. 4thchunkofthesemi-sacrificiallineshouldbereplacedduringan Inotherwords, asemi-sacrificiallineonlyactsasalevelofindi- access. Inordertotacklethisproblem,weaddanextrabittoev- rectionforredundancysubstitutionbyborrowingredundancyfrom eryfaultmapentrywhichindicateswhetherthecorrespondingdata itscorresponding sacrificialline. Moreover, semi-sacrificial lines chunkshouldbereplacedwhenaccessingthesemi-sacrificialline. guarantee the parallel access to the original and spare data in all Forinstance,inFigure4(b),sincethe4thand5thchunksshouldbe possiblescenarios. Consequently, thefaultychunksofthelines2 replaced,theircorrespondingbitsinthefaultmaphavebeensetto and7arereplacedusingG3(2)insteadoftheG3(S).Withtheaddi- “1”. tionofthesemi-sacrificialline,accessestothecachecanbedivided intothreecategories: 1) Accessestodataand sacrificial linesthat resideindifferent 2.3 AP Configuration banks. ThisisthebasecasewhichisdemonstratedinFigure3and To maximize the number of functional lines in the cache, we doesnotrequireanyspecialconsiderationbeyondwhatisdescribed needtominimizethenumberofsacrificiallinesrequiredtoenable earlier. fault-freeoperation. Aspreviouslydiscussed,thereisasinglesac- 2)Accessestodatalinesthataretheinthesamebankasthecor- rificiallinedevotedtoeverygroupoflines. Thissacrificiallineis responding sacrificial line. ThiscaseisillustratedinFigure4(a). notaddressableasadatalinesinceitdoesnotstoreanyindependent Line G3(3) is the data line and line G3(2) is the semi-sacrificial data.Inotherwords,sacrificiallinesdonotcontributetotheusable linewhichsuppliesthereplacement chunks forthefaultyonesin capacityofthecache. Dependingonthenumberofcollision-free G3(3). Instead of accessing G3(S), the memory map remaps the groupsthatareformed,theeffectivecapacityofthecachecanvary addressofthesacrificiallinetotheaddressofG3(2). Thiscaseis dramatically.Thismotivatestheneedtominimizethetotalnumber particularlyinteresting,sincenootherpartsoftheproceduremust ofgroupsrequired. beadjustedtosupport thisaccessscenario. Thefaultmapisstill Problem Modeling: Here, we model the problem as a graph used,unmodified,toindicatethatthe4thdatachunkfromG3(3)is in which every group corresponds to a clique. To minimize the faultyandshouldbesubstituted.Therefore,APreplacesthisfaulty numberofgroupsinagivenfaultycache–upperpartofFigure5 chunkofG3(3)bythesemi-sacrificialline’s4thchunkinsteadof withtwobanks,wemodeltheproblemasaminimumcliquecover sacrificialline’s4thchunk. (MCC)problem[14]. Figure5showstheprocess offormingthe 3)Accessestoasemi-sacrificialline. Thiscaseisdemonstrated groups givenafault patternforthecache. Eachnodeinthecon- in Figure 4(b) for which two small modifications to the access structedgraph,teninall,isacacheline. Thereisanedgebetween procedure are necessary. An additional bit is added to memory two nodes if and only if there is no collision between the corre- mapentriesindicatingwhethertheaccesseddatalineisthesemi- spondinglines. Therefore,itispossibletoassignconnectednodes base 2nd non-fair 2nd fair 64-cap 12 average non-functional lines max non-functional disabled 10 100 y of Occurrence 68 ber of Word-lines 246800000 uenc 4 Num base 2nd non-fair 2nd fair 64-cap eq Different Versions of the Solver Fr 2 0 4 9 14 19 24 29 34 39 44 49 54 59 64 69 74 79 84 Clique Size Figure6: DistributionofthecliquesizefordifferentversionsofthesolverbasedonMonteCarlosimulation. Notethatfor64-cap, thesizeofallcliquesis≤ 64. Here,thenumberofnon-functionallinesisthesummationofthenumberofsacrificiallinesandthe numberofdisabledlines. Theplotintheinsertdepictstheaveragenumberofnon-functionalcachelines,themaximumnumberof non-functionallines,andthenumberofdisabledlineswhileachieving99%yield. tothesamegroup. Forexample, thereisanedgebetweenlines2 andallowsustotakecareoftheparallelaccessproblemdiscussed and3butnoedgebetweenlines1and10. above. Asmentionedearlier,agroupisasetoflinesforwhichthereis 2)AnartifactintheDSATURsolveralgorithmcansometimes nocollisionbetweenanypairoflines. Byconstructingthegraph causeittodisablealargefractionofcachelines.Morespecifically, this way, a collision free group forms a clique (i.e., there is an itpicksthenodesforcoloringonlybasedonthedegreeofsatura- edgebetweeneverypairofnodes).Asaresult,thetaskofforming tionwhichisproportional tothereciprocal of thenode degreein groupscanberepresentedasfindingthecliquesintheconstructed ourconstructedgraph[22]. Becauseofthisbias,allthelinesfrom graph.However,sinceweareinterestedtominimizethenumberof onebankmightbeselectedwhiletherearemanyunassignedlines sacrificial lines, thisproblem turnsto findingthe minimumnum- intheotherbank. Insuchascenario,theunassignedlineshaveto ber ofcliquesthatcover theentiregraph. InFigure5, lines1, 4, bedisabled which resultsinalargefractionof disabled lines. In 8, and 9formthe firstgroup (G1, cliquewithsize4) whilelines order tosolve thisproblem, wemodifytheoriginal DSATURal- 2,3,5,7,and10formthesecondgroup(G2,cliquewithsize5). gorithmtopickthelinesfrombothbanksmoreevenly. Thiswas Line6isdisabled(D)sinceitcontains4faultychunksandrepair- donebygivingartificialprioritytothelinesinabankthathasmore ing it is not cost effective. In general, a line gets disabled, if its unassignedlinesatthebeginningofeachassignmentphase. correspondingnodeinthegraphdoesnotbelongtoanycliquewith 3)AswillbediscussedinSection3,minimizingtheareaover- size≥ 2. Here,thefaultmaphas2lineswhichcorrespondtothe headofthefaultmapcarriesamajorsignificanceforourscheme. sacrificial lines 9 and 3. As can be seen in this figure, there are Thesizeofeachentryinthefaultmapisproportionaltothe 16 faulty chunks out of 60, therefore, at most 10−(⌈16⌉) = 7 log (max{cliquesize}). Therefore, to reduce the size of the 6 2 cachelinescanbekeptfunctional afterthegrouping. Ultimately, faultmap,anupperbound(e.g.,64)canbeplacedonthemaximum ourconfigurationalgorithmcanachievethisbestcase,highlighting cliquesize(MCS).Byaddingthisfeature,allthecliqueswhichare theeffectivenessofourapproach. foundbytheDSATURsolvercanbeforcedtohaveasizesmaller MCC Solver: We use the transformation described in [19] to thanorequaltoMCS. converttheMCCproblemtoaminimumchromaticnumberprob- ThemainplotinFigure6depictsthedistributionofcliquesizes lem. Afterapplyingthistransformation, thefinalgraphispassed for different versions of thesolver: base (base DSATURsolver), totheDSATURsolver[22]whichusesawell-knownandefficient 2ndnon-fair(basesolver+modification(1)),2ndfair(basesolver approximationalgorithm.Asshownin[22],theapproximationfac- + modifications (1,2)), and 64-cap (base solver + modifications torfortheDSATURsolveris≤ 6.1%forvariousgraphdensities. (1,2,3)). This datais generated using 1000 iterationsof a Monte For the problem size that we target in this work, the run time of Carlosimulationfora2MBL2cachewith2048lines.Itshouldbe the DSATURalgorithm is always less than 5ms. In contrast, the clearthatthesizeofthefaultmapisproportionaltothe fullbacktrack-basedsolver(e.g.,optimalsolver)takesseveraldays (Num.of Cliques)×log (MCS)whichimpliesasmallnum- 2 tosolveoneinstanceofourconfigurationproblemwhichmakesit beroflargecliquesispreferable.Furthermore,since infeasibletobeusedinpractice. (Num.ofCliques)×(AverageCliqueSize)isequaltotheto- Nevertheless,theoriginalsolver’sanswerisnotdirectlyapplica- talnumberofword-lines,aconstantvalue,MCSshouldbeasclose bletoourproblem. Sincethesolverisfreetoformagroupinany as possible to the AverageClique Size. As a result, the most possiblewaythatminimizesthenumberofcliques,itispossibleto desirabledistributionofcliquesizeswould beatight distribution haveacliquewithnodesinonlyonebank. Thislattercaseisnot around largecliquesizes. AscanbeseeninFigure6, amajority afeasiblesolutionbecausethesolutiondisallowsparallelaccessto ofcliquesfallintothenarrowregionof60to80nodes. Thistight thespareelements. Asaresult,weapplyasetofmodificationsto distributionshowstheefficiencyandproperbalancingofthegroup theoriginalsolverformakingitsuitabletoourapplication: sizesintheprocessofgroupformationbythedifferentversionsof 1)Weforcethesolvertopickthesecondlineofthegroupfrom the solver. The smaller plot in this figure demonstrates different adifferentbankfromthefirstlineofthegroup. Thissmallmodi- characteristics of these 4 versions of the solver. In this plot, the ficationassuresusthatatleastonelineisselectedfromeachbank numberofnon-functionallinesisthesummationofthenumberof sacrificiallinesandthenumberofdisabledlines. Ascanbeseen, Table1:Thetargetsystemconfiguration the second modification (i.e., fairness) can effectively reduce the Parameters Value averagenumberofdisabledlinesfrom25.5to6.1. Technology 90nm(1.2VnominalVdd) Clockfrequency 1.9GHz[23] AnotherobservationisthattheconstraintMCS =64increases L1Cache 2banks64KBdata,2banks64KBinstruction, the maximum number of non-functional lines by 9% while it re- split,2-way,4cycles,1port,64Bblocksize ducessizeofeachfaultmapentryby17%–from7bitsto6bits. L2Cache 2banks2MBUnified,8-way,10cycles latency,1port,LRU,128Bblocksize Foreachcacheinstance,thenumberoflinesinthefaultmaparray Registers 80integer,72floatingpoint isequaltothenumberofsacrificiallines(e.g.,2inFigure5).How- ROB(re-orderingbuffer) 128entries ever,duetothepresenceofprocessvariationinalargepopulation LSQ(load/storequeue) 64entries offabricatedchips,differentfaultpatternsshouldbeexpected. As Instructionfetchbuffer 32instructions Integer/FPissuequeue 32/32entries wedescribedearlier,inourevaluation,weemployaMonteCarlo FU(functionalunit) 4intALU,4intmult/div,2memorysystemports simulationtogenerateapopulationof1000cacheinstancesandthe FPU(floatingpointunit) 4FPALU,1FPmult/div totalnumber offault maplinesisdetermined basedonthemaxi- Mainmemory 225cycles(highpower),34cycles(lowpower) Branchpredictor combined(bimodaland2-level), mumnumberofsacrificiallineswhileachievinga99%yield. RAS(returnaddressstack)has32entries HardwareConfiguration: InordertoconfigureAP,themem- BTB(branchtargetbuffer) 2048entries,2-wayassociative, oryandfaultmapsneedtobefilled. Theinitialstepinvolvessolv- BHT(branchhistorytable)has4096entries ingtheMCCconfigurationproblemforagivencachefault-pattern. GiventheMCCsolution,eachline–anodeinthegraph–canbe classifiedas: 1)Dataline: Foreachdataline,anewmemorymap CacheAddressingMechanismafterCapacityReduction: In entry should be allocated. Each memory map entry has 5 fields ourdesign,thememorymapcanbeusedtoremaptheoriginalad- (Figure4(b)): Thefirstfieldisthedatalineaddress. Ifthislineis dressofthesacrificiallinestoother functional lines. Asaresult, inthesamebankasitsrespectivesacrificialline,thesecondfield asmallfractionofthesetswillhavethefunctionalityoftwosets. willsettotheaddressoftherespectivesemi-sacrificialline.Other- Inotherwords, twosetsdifferentsetswillbelocatedinthesame wise,itwillgettheaddressofthesacrificialline.Thethird,fourth, line.Forthosedual-setword-lines,associativityisreducedbyhalf. andfifthfieldsshouldbesettothedataline’sgroupnumber,group Thesedual-setlinesaredistributedevenlyacrossthecachetopre- address, and “0”, respectively. 2) Sacrificial line: Although no ventbiasedmissrateforanaddresssub-range. Thetagcompari- changeinthememorymapisrequired,afaultmapentryshouldbe sonandreplacementlogicneedsstraight-forwardmodificationsto allocatedforeachsacrificialline. Eachfaultmapentryhasafield makethiswork,detailsofwhichareomittedintheinterestofspace. perdatachunk–6fieldsinFigure4. Forfaultydatachunksofa High Power Mode Operation: In the high power mode, our sacrificiallineortheoneswhicharenotassignedtoanyfaultydata schemeisturnedoffinordertominimizetheunwantedoverheads: chunks,correspondingfieldsinthefaultmapshouldbesetto“-”. 1) All the cache lines are functional and there is no sacrifice of Forotherdatachunks,correspondingfaultmapfieldsshouldbeset the cache capacity. 2) Assuming clock gating, there isanegligi- to the group addresses of the data/semi-sacrificial lines to which ble overhead for the dynamic power due to the switching in the those data chunks are assigned. 3) Semi-sacrificial line: This is bypassMUXeswhichconsistsoftheMUXinglayerandanaddi- similar to the first case, except the last field of the memory map tionalMUXforbypassingthememorymap.Inotherwords,inhigh entryshouldbesetto“1”. 4)Disabledline: Nothingneedstobe powermode, theaccesspathofAPincludesanadditionalbypass doneinthiscase. MUXcomparedtoabaselinecache. 3)Leakagepoweroverhead LowPowerModeOperation:Aswesawinthissection, remainsthesame. However,powergatingtechniquescanbeused Archipelagoleveragesasimplearchitecturewhichmainlyconsists forgeneralleakagemitigation[18]. oftworelativelysmallarraystructures.However,inordertoachieve arobustsub-400mV operation,asophisticatedconfigurationalgo- 3. EVALUATION rithmisusedtocustomizethehardwarebasedontheexistingfault patterninaparticularcache. Thishardware/softwareco-designal- ThissectionevaluatesthepotentialofAPinreducingthepower lowstokeepthearchitecturesimplewhenmodelingtheconfigura- of the processor while keeping the overheads as low as possible. tion withwell-establishedalgorithmic problems. Thefirst timea Comparisonswithrelatedworkarepresentedinthenextsectionof processorswitchestolowpowermode,thebuilt-inselftest(BIST) thepaper. modulescansthecacheforthepotentialfaultycells. Afterdeter- mining the faulty chunks of each cache line in low power mode, 3.1 Methodology theprocessorswitchesbacktohighpowermodeandconstructsthe Forperformanceevaluation,weuseSimAlpha,avalidatedmicro- mentionedgraphandsolvestheMCCproblemusingthemodified architecturalsimulatorbasedontheSimpleScalarout-of-ordersim- DSATURsolver. Asmentionedbefore,thesolvertimefora2MB ulator[5]. TheprocessorisconfiguredasshowninTable1andis L2cacheislessthan5msonanIntelCoreTM2Duomachine. This modeled after the DEC Alpha 21364 at an ambient temperature solutioncontainstheinformationthatisrequiredtobestoredinthe of40◦C [35, 23]. Dynamicprocessor power consumptioniscal- memoryandthefaultmaps.Thisconfigurationinformationcanbe culated using Wattch [7] based on the activity factors of individ- storedonthehard-driveandiswrittentothememorymapandfault ualcorestructures,andleakagepoweriscomputedwithHotLeak- mapatthenextsystemboot-up.Inaddition,thememorymap,fault age[41].CACTI[31]isleveragedtoevaluatethedelay,power,and mapandthetagarraysneedtobeprotectedusing,forexample,the area of the base cache structures. To take into account the over- wellstudied10Tcell[8]whichhasabout 66%areaoverhead for headsofthememorymapandfaultmaparrays,weusetheSRAM theserelativelysmallarrays. These10Tcellsareabletomeetthe generator module provided by the 90nm Artisan Memory Com- target voltageinthiswork fortheaforementioned memory struc- piler. TheSynopsys standardindustrialtool-chain(withaTSMC tureswithoutfailing. However,aswewilldiscussinSection4.3, 90nmtechnologylibrary)isusedtoevaluatetheoverheadsofthe 10Tcellsarenotacost-effectiveoptionforprotectingtheentireL1 remaining miscellaneous logic(i.e, bypass MUXes, comparators, andL2caches. subsettagcomparisondisabling,etc.).Lastly,tofindthematching Percentage of non-functional cache lines Fault-map area-overhead 100 1b chunks 2b chunks 4b chunks 8b chunks 16b chunks 90 80 70 Percentage 456000 30 20 10 0 Power Supply Voltage (Vdd) (a) Percentageofnon-functionallinesandareaoverheadofthefault-mapforL1whilevaryingV anddatachunksize dd Percentage of non-functional cache lines Fault-map area-overhead 100 1b chunks 2b chunks 4b chunks 8b chunks 16b chunks 90 80 70 Percentage 456000 30 20 10 0 Power Supply Voltage (Vdd) (b) Percentageofnon-functionallinesandareaoverheadofthefault-mapforL2whilevaryingV anddatachunksize dd Figure7:ProcessofdeterminingtheminimumachievableV forL1andL2cacheswhilelimitingthefractionofthenon-functional dd cachelinesandalsotheareaoverhead ofthefaultmapstructureto≤ 10%. Moreover, inthese10sub-plots,vertical dottedlines showtheminimumachievableV whiledatachunksizevariesfrom1bitto16bits. dd frequencyforagivenV ,weusethealpha-powermodeldescribed fractionofthecachelinesthatgetdisabled asloweringV leads dd dd in[27]. toincreasingerrorratesandaprecipitousincreaseinfaultychunks. Foragivensetofcacheparameters(e.g.,V ,chunksize,MCS, Aswementioned earlier, for thesedisabledlines, no entryinthe dd etc.),aMonteCarlosimulationwith1000iterationsisperformed faultmaparrayisrequired. usingthemodifiedDSATURsolverdescribedinSection2toiden- Here,theverticaldottedlineshighlighttheminimumachievable tifytheportionofthecachethatshouldbedisabled. Asdiscussed V based on the aforementioned 10% limits on non-functional dd earlier,solutionsgeneratedbytheMCCsolvertargeta99%yield. linesand fault map size. It isnotable that for small chunk sizes, Inother words, only1%ofmanufacturedandconfigured on-chip area overhead of the fault map is the limiting factor, while for cachesareallowedtoexhibitfailureswhenoperatinginlow-power larger chunks, the number of non-functional lines becomes dom- mode. Ontheotherhand,thetargetyielddirectlyimpactsthesize inant. Ascanbeseeninthisfigure,theminimumachievableV dd ofthefaultmap. Itssizeissetbasedonthemaximumnumberof forL1islowerthanL2.ThiswasexpectedduetoL2’slongerlines cliquesformedacrossallcacheinstancesintheMonteCarlopro- andlargersizewhichmakesprotectionofL2harderthanL1[40]. cesswhenignoringasmanyworst-casesituationsasthetargetyield Therefore,L2protectioncostdictatestheminimumoperatingvolt- allows. ageof our system. Inaddition, basedonthetrendinFigure7, it shouldbeclearthatachunksizeoutsideofthepresentedrangewill 3.2 DesignSpace Exploration onlyresultinahigherminimumV .Asaresult,weselect375mV dd astheminimumV (i.e,low-powermodeoperatingvoltage)since Figure7showstheprocessofdeterminingtheminimumachiev- dd allotherlowervoltagesviolateour10%limits. ableV forourtargetsystem.Inthisfigure,chunksizevariesfrom dd 1bit to16bitsforbothL1andL2caches. Inhigh-power mode, ForVdd = 375mV,adesignspaceexplorationofL1-(D/I)and L2 caches is demonstrated in Figure 8. There are two important both fault and memory map arrays remain idle and leak power. parameters tonote: 1) MCS, themaximum allowable cliquesize It iscrucial to minimize the size of these structures. The size of (Section 2.3) and 2) chunk size, varying from 1b to 128b for L1 thememorymapisessentiallyfixedbythenumberoflinesinthe and from 1b to 32b for L2. From Figure 8, it should be clear cache. Thefaultmapsize,however,canvarysignificantlydepend- thatthechunksizesoutsideofthepresentedrangealwaysviolate ingonconfigurationparameters,motivatingacloserlookatthesize at least one of our aforementioned 10% limits. For every pair of ofthefaultmapasanimportant designfactor. Consequently, we (MCS,chunksize),aMonteCarlosimulation,targeting99%yield, limittheareaoverheadofthefaultmapto10%(ofthecachearea). is performed with 1000 iterations. This simulation identifies the Furthermore,sincecachesizehasastrongcorrelationwithsystem necessaryareaoverheadofthefaultmapandthefractionofnon- performance, we limit our scheme to setting at most 10% of the functional lines in the cache. Here, we still limit the fraction of cachelinestonon-functional. thenon-functionallinesto10%whiletryingtominimizethearea AsevidentinFigure7,decreasingV increasesthenon- dd overheadofthefaultmap. functional portionofthecacheandalsotheareaofthefaultmap InFigure8(a),withintheshadedregion,onlypointsontheblack array. However, beyond acertainpoint, the areaoverhead of the dottedline(Paretofrontier)areconsideredsincetheydominatethe fault map starts decreasing. This phenomena is due to the large (a) L1designspaceexploration (b) L2designspaceexploration Figure8: DesignpointsfordifferentMaximumCliqueSize(MCS)andchunksizepairsareshownthatcanachievea99%yield. ForeachMCSvalue, correspondingchunksizesfrom{2n |n ∈ {0,1,...,7}}forL1andfrom{2n |n ∈ {0,1,...,5}}forL2are chosen. Theshadedboxesrepresentstheregionofinterestwhereboththefault-mapoverhead andthefractionofnon-functional linesislimitedto≤10%.TheblackdottedlineistheParetofrontier. other design points. Note that making the chunk size larger de- worst-casecapacitylossduringourevaluation.Inordertoevaluate creasestheareaoverheadofthefaultmapsincethenumberofen- the worst-case performance penalty of our scheme in low-power triesin the fault map is reduced. However, this reduction isalso mode, we ran the cross-compiled Alpha binaries of SPEC-CPU- accompaniedbyanincreaseinthefractionofnon-functionalcache 2KbenchmarksuiteonSimAlphaafterfast-forwardingtoanearly lines,theresultoffeweredgesinthegraphdescribedinSection2.3. SimPoint[37].WeassumeoneextracyclelatencyforL1and2ex- ForL1,thedesignpointwithMCS = 32andchunksize = 16 tracyclesforL2.Cachesizeisalsoreducedbasedonthefractionof bits is selected, highlighting an interesting trade-off between the non-functionallinesinFigure8. Onaverage,a4.6%performance areaoverhead of thefaultmapand thefractionof non-functional penaltyisseeninlow-powermodefromwhich0.6%iscontributed lines. ByrepeatingthesameprocessforL2,asillustratedinFig- by the cache capacity loss due to the presence of non-functional ure8(b),thedesignpointwithMCS = 16andchunksize = 8 lines(Figure9(b)). Ascanbeseen,ourstrictlimitonthefraction bitsisselected. ofnon-functional linesresultsinminimalimpactonperformance becauseofcachecapacityloss.However,oneshouldnotethatlow- 3.3 Results powermodeperformanceisusuallynotamajorconcern. Inhigh Figure 9(a) summarizes the overheads of our scheme for both powermode,thereisnocapacitylosssincenofailuresneedtobe L1andL2caches. Asmentionedbefore, wealsoaccount forthe tolerated. Furthermore,basedonourCACTIdelayresultsandthe overheadsofusing10TSRAMcells[8]forprotectingthetag,fault frequencyofthesystem(Table1),thereisenoughslackontheac- map,andmemorymaparraysinlow-powermode. Inaddition,the cesstimeofourL1andL2cachestofitthesmallbypassMUXes faultmap,memorymap,andthesecondbankhavetheirownsep- (additional0.07nsdelay)withoutaddinganyextracyclestotheac- aratedecoderswhichareaccountedforinourevaluation. Leakage cesstimeofthesecaches. Inotherwords,thereisnoperformance overheadinhighpowermodecorrespondstothefaultmap,mem- lossinhigh-powermode.However,onemighthaveacachedesign ory map, miscellaneous logic, and extra leakage of 10T cells for withoutanyslackavailable. Inthatscenario,weaddanadditional tagarray.Note,thememorymapisafargreatercontributortoarea cycle inhigh-power mode for L1and L2, whichtranslatesintoa andleakagepoweroverheadintheL1thanintheL2. Thereason 3.6%performancelossforSPEC-2K. behindthisisthattheL1hasonly 1 thelinesoftheL2,whileits Summaryofbenefitsandoverheads:Figure10showsthesav- 4 overallsizeis 1.ForL2,thefaultmapisthemajorcomponentof ingsandoverheadsfortheAlpha21364microprocessorusingAP 32 overhead. Duetoitssignificantlylargersize, theL2cachedomi- forprotectingtheon-chipcaches. AscanbeseeninFigure10(b), natestheprocessor leakageandareaoverheads. Nevertheless, as theoverheadsoftheproposedmethodarealmostnegligible.These canbeseeninthisfigure,ourschemehasonlymodestoverheads overheadsareevaluatedin90nmusingthemethodologydescribed for theL2 cache. Dynamic power overhead inhigh-power mode in Section 3.1. On the other hand, Figure 10(a) depicts the per- canbemainlyattributedtobypassMUXessinceweassumeclock centagereductioninleakagepower,dynamicpower,andminimum gatingforthefaultmapandmemorymaparrays. InAP,whenin achievable supplyvoltagebyusingAPforprotectingtheon-chip low-power mode, thememorymapandMUXinglayerareonthe caches. These results are reported in the 45nm, 65nm, 90nm, critical path of the cache access. Based on our timing analysis, and 130nm technology nodes. The relation between the supply the access delay overhead of L1 and L2 are 0.41ns and 0.58ns, voltageandtheexpectedSRAMbit-cellfailurerateforthesefour respectively. Basedonoursystemclockfrequency(Table1),this technologynodesareextractedfrom[40,24,34,13,33,9,6].Con- translatesto1additionalcyclelatencyforL1and2additionalcy- sideringthe90nmtechnologynode,APenablesDVStosave79% clesforL2inlow-powermode. dynamicpowerand51%staticpowerinthenear-thresholdregion. Asmentionedbefore,thefractionofnon-functionallinesisbased Withtheaggressivetechnologyscaling,thesystematicandrandom ona99%yieldina1000-runforaMonteCarlosimulation. Asa variationsareexpectedtoincrease[33]. Thisresultsinhighersen- result, the fraction of non-functional lines would be smaller than sitivity/vulnerability of SRAM cells to power supply variations. whatispresentedinFigure8formanychips.Thisisbecauseacross Hence, the percentage of reduction in dynamic power/minimum allthefabricateddies,processinducedparametricvariationcauses achievableVddgraduallyreduceswhenheadingtowarddeepersub- differentcachefaultpatternstoappear. However,weconsiderthe microntechnologies. Nevertheless, a68%dynamicpower reduc- ffaauulltt(cid:3)(cid:3)mmaapp(cid:3)(cid:3)((1100TT)) mmiisscceellllaanneeoouuss(cid:3)(cid:3)llooggiicc mmeemmoorryy(cid:3)(cid:3)mmaapp(cid:3)(cid:3)((1100TT)) ttaagg(cid:3)(cid:3)oovveerrhheeaadd(cid:3)(cid:3)((1100TT)) 12 verheadPercentageofOverhead(cid:3)(cid:3) 1111112468024024 HighPower(cid:3)Mode Percentage of Performance Loss 1002468 extra access latency cache capacity loss 0 L1(cid:3)area L2(cid:3)area L1p(cid:3)loeawkearge(cid:3)L2p(cid:3)loeawkearge(cid:3) dpyonLwa1me(cid:3)ric(cid:3) dpyonLwa2me(cid:3)ric(cid:3) 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 255.vortex 256.bzip2 300.twolf 171.swim 172.mgrid 173.applu 177.mesa 179.art 183.equake 188.ammp 200.sixtrack average (a) Area, leakage, and dynamic power overheads of our (b) Performance loss in low-power mode. Since the fraction of schemeforbothL1andL2caches. Here,10Tcellsareused non-functionallinesislimitedtolessthan10%,theaccesslatency forprotectingthefaultmap,memorymap,andtagarrays. overheadisthedominantfactorinperformancepenalty. Figure9:Area,power,andperformanceoverheadsofourscheme. Leakage Power Dynamic Power 130nm Minimum 90nm Vdd 65nm 45nm 0 10 20 30 40 50 60 70 80 90 100 Percentage Reduction (a) Percentageofreductioninleakagepower,dynamicpower,andmini- (b) Overheads of using AP for protecting on- mumachievableV forthebaselineprocessorusing4differenttechnol- chip caches of the mentioned microprocessor dd ogynodes. in90nm. Figure 10: Low-power mode benefits and overheads for an Alpha 21364 microprocessor system (Table 1) augmented with Archipelago. Here, we account for the dynamic power overhead of accessing the second bank in low power mode for handling failures. tioncanbeachievedin45nm. Incontrast,forleakagepower,the notlikelytobeaccessedinthenearfuture. Menget. al.[28]pro- percentage reductiongraduallyincreasesasthetechnologyscales posedamethodforminimizingleakageoverheadinthepresenceof down. Thisismainlyduetothedrasticdifferencesincorrelations manufacturingvariations.Inthisscheme,theyartificiallyprioritize betweentheleakagepowerandV acrosstechnologygenerations. cache ways with smaller leakage and resize the cache by avoid- dd Indeepertechnologynodes,areductioninV manifestsasamuch ingsub-arraysthathavehigherleakagefactors. Insteadofturning dd larger saving in leakage power [15]. Therefore, for deeper tech- off blocks, drowsy cache [12] is a state preserving approach that nologies,eventhoughweachievelessV reduction,thenetleak- has two different supply voltage modes. In order to save power, dd age power saving is larger. Due to the excessive increase in the recently inactive cache blocks periodically fall into a low power sub-thresholdtransistorleakage,itisexpectedthatleakagepower modeinwhichtheycannotbereadorwritten. willdominatethetotalpowerconsumptionofachipinthefuture ForlowV values(e.g., ≤ 651mV in90nm), theamount of dd technologies[15]. powersavingforschemessimilartodrowsycache[12]isrestricted duetofailuresinSRAMstructures[34]. Asdiscussedearlier,our objective is to enable DVS to push the processor/core operating 4. RELATED WORK voltage down to the near-threshold region while preserving cor- MotivationsfortheAPdesigncomefrompreviousworkintwo rect functionality of on-chip caches. Wilkerson et. al. [40] pro- major categories: low-power and fault-tolerant cache techniques. posedtwodifferentcacheprotectionschemesforL1andL2caches Here,SRAMcellstabilityeffortsareincludedinthefirstcategory that useseveral levelsof decoding/shifting totakethefaultydata sincetheytrytoproactivelyavoidfailures.Furthermore,acompar- chunksoutandreplacethemusingECCprotectedpatches. Dueto isonof APwithseveral oftheseconventional andstateof theart thestrictbindingbetweendataandredundancy,theirschemesneed proposalswillbepresented. todisable 50% of L1and 25% of L2caches whichresultsintoa considerableperformancedrop-offinlowpowermode. Recently, 4.1 Low-Power CacheTechniques Abella et. al. [1] proposed a cache protection scheme based on The usage of Vdd gating for leakage power reduction by turn- sub-block disabling which can provide a better performance pre- ingoffcachelinesisdescribedin[20]. Thisapproachreducesthe dictability than [40]. However, since this scheme relies on dis- leakagepowerofthecachebyturningoffthecachelinesthatare

Description:

Amin Ansari. Shuguang In this work, we propose a highly flexible fault-tolerant cache de- .. In AP, a memory map is used to provide a level of indirection in.

Archipelago: A Polymorphic Cache Design for Enabling Robust Near-Threshold Operation PDF

12 Pages·2010·0.89 MB·English

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download Archipelago: A Polymorphic Cache Design for Enabling Robust Near-Threshold Operation PDF Free - Full Version

by Unknow| 2010| 12 pages| 0.89| English

Download Archipelago: A Polymorphic Cache Design for Enabling Robust Near-Threshold Operation by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Archipelago: A Polymorphic Cache Design for Enabling Robust Near-Threshold Operation

Amin Ansari. Shuguang In this work, we propose a highly flexible fault-tolerant cache de- .. In AP, a memory map is used to provide a level of indirection in.

Detailed Information

Author:	Unknown
Publication Year:	2010
Pages:	12
Language:	English
File Size:	0.89
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Archipelago: A Polymorphic Cache Design for Enabling Robust Near-Threshold Operation Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Archipelago: A Polymorphic Cache Design for Enabling Robust Near-Threshold Operation PDF?

Yes, on https://PDFdrive.to you can download Archipelago: A Polymorphic Cache Design for Enabling Robust Near-Threshold Operation by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Archipelago: A Polymorphic Cache Design for Enabling Robust Near-Threshold Operation on my mobile device?

After downloading Archipelago: A Polymorphic Cache Design for Enabling Robust Near-Threshold Operation PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Archipelago: A Polymorphic Cache Design for Enabling Robust Near-Threshold Operation?

Yes, this is the complete PDF version of Archipelago: A Polymorphic Cache Design for Enabling Robust Near-Threshold Operation by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Archipelago: A Polymorphic Cache Design for Enabling Robust Near-Threshold Operation PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.