Table Of ContentArchipelago: A Polymorphic Cache Design for Enabling
Robust Near-Threshold Operation
Amin Ansari Shuguang Feng Shantanu Gupta Scott Mahlke
AdvancedComputerArchitectureLaboratory
UniversityofMichigan,AnnArbor,MI48109
{ansary, shoe, shangupt, mahlke}@umich.edu
ABSTRACT bereducedbelowacertainthresholdwithoutdrasticallysacrificing
clock frequency. Lowering thisminimum achievable voltage can
Extreme technology integration in the sub-micron regime comes
dramatically improve the lifetime, energy consumption, and bat-
witharapidriseinheatdissipationandpowerdensityformodern
processors. Dynamic voltage scaling is a widely used technique terylifeofmedicaldevices,laptops,andhandheldproducts.
totacklethisproblemwhenhighperformanceisnotthemaincon- The minimum achievable voltage for DVS is set such that un-
cern.However,theminimumachievablesupplyvoltageforthepro- der the worst-case process variation, the processor operates cor-
cessor isoftenbounded bythelargeon-chipcachessinceSRAM rectly [11]. The motivation for our work comes from the obser-
cellsfailatasignificantlyfasterratethanlogiccellswhenreduc- vationthatlargeSRAMstructuresarelimitingtheextenttowhich
ingsupplyvoltage. Thisismainlyduetothehighersusceptibility operationalvoltagescanbereducedinmodernprocessors. Thisis
of the SRAM structures to process-induced parameter variations. becauseSRAMdelayincreasesatahigherratethanCMOSlogic
Inthiswork,weproposeahighlyflexiblefault-tolerantcachede- delay as thesupply voltage isdecreased [39]. Furthermore, with
sign, Archipelago, that by reconfiguring its internal organization increasing systematic and random process variation in deep sub-
can efficiently tolerate the large number of SRAM failures that micron technologies, the failure rate of SRAM structures rapidly
arise when operating in the near-threshold region. Archipelago increases in the near-threshold regime. Ultimately, the minimum
partitions the cache to multiple autonomous islands with various sustainable V of the entire cache structure – and consequently
dd
sizes which can operate correctly without borrowing redundancy the core as a whole – is determined by the one SRAM bit-cell
fromeachother. Ourconfigurationalgorithm–anadaptedversion withintheentiresystemwiththehighestrequiredoperationalvolt-
of minimum clique covering – exploits the high degree of flexi-
age. Thisforcesdesignerstoappropriatealargevoltagemarginin
bilityintheArchipelago architecturetoreduce thegranularity of
ordertoavoidon-chipcachefailures.
redundancyreplacementandminimizetheamountofspacelostin
AnSRAMcellcanfailduetothefollowingreasons: areadsta-
thecachewhenoperatinginnear-thresholdregion. Usingourap-
bilityfailure, awritestabilityfailure, anaccess timefailure, or a
proach, the operational voltage of a processor can be reduced to
hold failure [3]. Figure 1 depicts the bit error rate (BER) of an
375mV,whichtranslatesto79%dynamicand51%leakagepower
SRAM cell based on the operational voltage ina 90nm technol-
savings(in90nm)foramicroprocessorsimilartotheAlpha21364.
ogynode[30]. Theminimumoperationalvoltageof64KBL1and
Thesepowersavingscomewitha4.6%performancedrop-offwhen
2MBL2cachesisselectedtoensureahighexpectedyield,99%in
operatinginlowpowermodeand2%areaoverheadforthemicro-
processor.
1. INTRODUCTION write-margin limit read-current limit
1.E+00
Withaggressivesiliconintegrationandclockfrequencyincrease,
power consumption and heat dissipation have become key chal- 1.E-02 m)
plpscvetooooanowwlggtlaeeeeignrrsfegcac,ioinoslencunlsatershuelcuiesmtnmrgi[pdpc3teiti(8tisoDyo]in.,gnVanoISnrfted)omdafdilusisahccoteiraagosahpcwfdefrieoepndcvceteetierslcrfsyoestarhoiumlrerissfcea,ecodnteonicxsmtdetpeielctoposihforintoantiinchniqnedegugsrestmeh[ox2tearpo5slfe.]aprd.ceaiDtGdtcetukyrshcaonaegewataimdintrnhylgiygce-, SRAM Bit Error Rate 1111....EEEE----01000846 Nominal ThresholdVoltage (90n 624 M KBB ((9999%% yyiieelldd)) 612 mV651 mV
namic power quadratically scales with voltage and linearly with
frequency.However,thesupplyvoltageofamicroprocessorcannot 1.E-12
0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7
Power Supply Voltage (Vdd)
Figure1:BiterrorrateforanSRAMcellwithvaryingV val-
dd
uesin90nm.Forthistechnology,thewrite-marginisthedom-
Permission tomake digital orhardcopies ofall orpartofthis workfor inant factor and limits the operational voltage of the SRAM
personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare
structure. Here,theY-axisislogarithmic,highlightingtheex-
notmadeordistributedforprofitorcommercialadvantageandthatcopies
tremelyfastgrowthinfailureratewithdecreasingV .Thetwo
bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to dd
horizontaldotted-linesmarkthefailureratesatwhichthemen-
republish,topostonserversortoredistributetolists,requirespriorspecific
permissionand/orafee. tionedSRAMstructures(64KBand2MB)canoperatewithat
Copyright20XXACMX-XXXXX-XX-X/XX/XX...$10.00. least99%manufacturingyield.
bit byte word (8B) block (128B) column (256B) word-line (1KB)
High failurerate region Medium failurerate region Low failurerate region Failure-freeregion
( 10-3(cid:1095)(cid:3)(cid:17)(cid:28)(cid:90)(cid:3)(cid:895) ( 10-5(cid:1095)(cid:3)(cid:17)(cid:28)(cid:90)(cid:3)(cid:1092)(cid:3)(cid:1005)(cid:1004)-3) ( 10-9(cid:1095)(cid:3)(cid:17)(cid:28)(cid:90)(cid:3)(cid:1092)(cid:3)(cid:1005)(cid:1004)-5) ( BER < 10-9)
Our Target ECC-2, 2D ECC, Bit-Fix Row/column redundancy, ECC No protection is required
100
10
Percentage Faulty 0.11
0.01
0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7
Power Supply Voltage (Vdd)
Figure 2: Percentage of faulty bits, bytes, words, blocks, columns, and word-lines for a 2MB L2 cache whilevarying the supply
voltage. Here, theY-axis islogarithmic, highlightingthe rapidgrowth in faultyunitswhen decreasingV . Thetop part of this
dd
figuredepictsourconceptualdivisionofthisV rangeintofourdifferentregionsbasedontheprotectiondifficulty.Foreachregion,
dd
correspondingbiterrorratesandalsoseveralapplicableprotectiontechniquesarealsoshown. Inordertooperatecorrectlyinthe
failure-freeregion,noprotectionmechanismisrequired. However, ascanbeseen,ourtargetisthehighfailurerateregionwhich
causesanavalancheoffailuresforon-chipcaches.
Figure1. Duetothehigher sensitivityof abit-cell toparametric discussedearlierforV ≥651mV,almostnofailureisexpected
dd
variationsatlowerV s,thefailurerateofanSRAMcellincreases (i.e.,failure-freeregion). Ascanbeseen,fortheV ≤ 400mV,
dd dd
exponentiallywhenV getsdecreased. Thisexponentialincrease finergranularitiesofredundancy arerequiredtoallowahighuti-
dd
in thenumber of faultycellsmakes the protection of the on-chip lizationofthespareelementssincealargefractionofword-lines,
cachesmuchmoredifficultwhenoperatinginthenear-thresholdre- columns, blocks, andevenwordswouldbefaulty. Thisincreases
gion.Ascanbeseeninthisfigure,thewritemarginmostlydictates thecomplexityofthedesignsincesimpledisabling(e.g.,blockor
theminimumV anditisexpectedtooperatewithV ≥651mV waydisabling)orcoarsegrainredundancy(e.g.,roworcolumnre-
dd dd
duetothedominatingfailurerateofthe2MBL2cache.Thismini- dundancy)techniquescannotbeefficientlyapplied.
mumoperationalvoltageisconsistentwithpredictedandmeasured In this work, we propose Archipelago (AP), a cache capable
values(∼0.7V)reportedin[8]. ofreconfiguringitsinternalorganizationtoefficientlytoleratethe
In the literature, several techniques have been proposed to im- large number of SRAM failures that arise when operating in the
provedynamicand/orleakagepowerofon-chipcachesaswellas near-thresholdregion. APallowsfault-freeoperationbypartition-
the entire processor [39]. Section 4.1 summarizes some of these ingthecacheintomultipleautonomousislandswithvarioussizes.
low-power design techniques. Most of these methods exploit the Each island is a group of physical cache word-lines that can op-
structurespecificsleepmodesand/orpower-awareresourcealloca- eratecorrectlywithoutusinganyword-lineoutsideoftheirgroup.
tion toavoidfacingwiththefailures. Consequently, for low V Eachgrouphasasacrificialword-linewhichisdivideduptomulti-
dd
values (e.g., ≤ 651mV), the amount of power saving for these pleredundancyunits. Thesespareunitsaredirectly/indirectlyem-
methodsisrestrictedduetothearisingfailuresintheSRAMstruc- ployedtoachievefault-freeoperationoftheotherword-linesinthe
tures. samegroup. Sacrificialword-linesdonot contributetotheeffec-
Incontrast,theobjectiveofourworkistoenableDVStopush tivecache capacity since they do not store any independent data.
thecore/processoroperatingvoltagedowntothenear-thresholdre- Theclusteringofthecachetotheseautonomousislandsisdoneby
gion (e.g., low power mode) whilepreserving correct functional- aconfigurationalgorithmwhichisdescribedinSection2.3. This
ityofon-chipcaches. Analternativetothisapproachisdual-V s adaptedversionoftheminimumcliquecoveringalgorithmtriesto
dd
wherecoreandcachesoperateatdifferentvoltages.However,dual- partitionthecache totheleast number of islands/groups tomini-
V imposes a high cost in terms of area and design complexity. mize the number of sacrificial word-lines required for guarantee-
dd
Voltage-levelconvertersmustbeaddedtoallowsignalsharingbe- ingthefault-freeoperationofthecache. APenablesgreaterpower
tween different voltageislands. Furthermore, high voltage mem- savingsthanpriorapproachesandrequiresonlyasinglepowersup-
ory elements generate noise that victimizes the neighboring low ply. Theoverheadoftheapproachisasmallperformancepenalty
voltagelogiccircuits,necessitatingeithershieldingorextranoise (4.6%) when operating in near-threshold mode and a negligible
margins[40]. areaoverhead (2%) for themicroprocessor. WeapplyAP toL1-
Inthiswork,wetargetultra-lowvoltageoperation D,L1-I,andL2cachestoevaluatetheachievablepowerreduction
(V ≤ 400mV) in the near-threshold region which causes an fortheoverallmicroprocessorsystem. Athoroughcomparisonof
dd
extreme bit-cell failure rate (> 10−3) for the on-chip caches – APwithseveralwell-knownproposalsisdoneinSection4.3.
presented as the high failure rate region in Figure 2. This figure The primary contributions of this paper are: 1) A flexible and
showsthepercentageoffaulty(i.e.,containing atleastonefaulty highlyreconfigurablearchitecturethatcanbeleveragedtoprotect
bit-cell) bits, bytes, words, blocks, columns, and word-lines for regularSRAMstructuresagainsthighfailurerates; 2)Inorderto
a 2MB L2 cache for different V values. Figure 2 is generated minimize the amount of redundancy required for protecting the
dd
assuming a uniform failure distribution based on the relation be- cache,wemodelthecachefaultpatternwithapropergraphstruc-
tween bit-cell failure rate and V in Figure 1. In this figure, at turetoguideaminimum cliquecovering configuration algorithm
dd
V = 350mV, almost all blocks are faulty while 30% of the fornearoptimalgroupformation;3)Toourknowledge,Archipelago
dd
wordsarefaultyforthe2MB L2cache. Ascanbeseenandalso isthefirstlowoverhead,fault-tolerantarchitecturaltechniquewhich
Figure3: A dual-bank2-way set-associative Archipelago. Each cachebankconsistsof eightlines. Cacheblocksaredividedinto
threeequallysizeddatachunks.Blackboxesrepresentchunksofdatathathaveatleastonefaultybit.Thememoryandfaultmaps,
whichareessentialcomponentsoftheproposedscheme,arealsoshown.
allowsthecachetooperatecorrectlywhenV ≤400mV;and4) ThefigurealsoshowsstructuresspecifictoAParchitecture(i.e.,
dd
Adesignspaceexplorationin90nmtoshowtheactualprocessof thememorymap,thefaultmapandtheMUXinglayer).Thecache
selectingthearchitectureparametersforbothL1andL2caches. accesscanbelogicallydividedintothreesteps. First,thememory
mapisreadtodetermineanindexfor accessingthecache banks.
2. ARCHIPELAGO Next,thecachebanksandfaultmapareaccessedinparallel.Here,
thecachebanksprovidetheactual(fromregularlines)andredun-
Inthissection, wefirstdescribe theAParchitecturethat adap-
dant data (from sacrificial line). The fault map indicates which
tivelyreconfiguresitselftoabsorbfailingSRAMs.Next,wepresent
partsoftheactualorredundantdatashouldbeusedtoconstructthe
aconfigurationalgorithmthatexploitstheintrinsicflexibilityofthe
output. Finally,theMUXinglayerleveragestheinformationfrom
AParchitecturetoperformanearoptimalredundancyassignment.
faultmaptoassembletherequesteddata.
The coalescence of highly flexible hardware and intelligent con-
InAP,amemorymapisusedtoprovidealevelofindirectionin
figuration enables our proposed solution to minimize the impact
cacheaccesses. Eachcacheaccessfirstindexesintothismemory
ofoperatingatnear-thresholdregiononcachecharacteristics(e.g.,
map,whichsuppliesthelocationofthedatalineanditscorrespond-
sizeandlatency)withminimaloverhead.
ingsacrificialline. Afterthesetwolineshavebeenaccessedfrom
2.1 Baseline APArchitecture their respective banks (different ones), a MUXing layer2 is used
tocomposeafault-freeblockbyselectingtheappropriatechunks
InAP,thecacheispartitionedintoseveral autonomous islands
fromeachline.Consideringthebasicdesignofafast-accesscache
of varying sizes. Eachisland isa group of physical cache word-
(e.g.,L1cache), basedonthetagcomparisonresults,columnde-
lines that can operate correctly without using any word-line out-
coders – which are placed before our MUXing layer – MUX the
sideof theirgroup. Eachgrouphasasacrificialword-linewhich
corresponding blocks out of read word-lines. On the other hand,
isdividedupintomultipleredundancyunits.Thesespareunitsare
itshould beclear thatfor energy-efficient caches (e.g., L2cache)
directly/indirectly employed toallow afault-freeoperation of the
design, each word-linemight only containasingleblock andac-
other word-lines in the same group. Put simply, considering one
cesstothetaganddataarrayissequential. Asaresult, indexing
of these groups with n + 1 partiallyfunctional cache word-lines,
intothe memory mapwillbe done after resolving thetagpart of
APallowsthisgrouptobehaveasasetofnfullyfunctionalcache
theaddress.
word-lines.Intheremainderofthispaper,werefertoeveryphysi-
TheMUXing layer receives itsinputs indirectly fromthe fault
calcacheword-line,whichmaycontainmultiplecacheblocksasa
map.Foragivendataline,thefaultmapdetermineswhichchunks
line.
arefaultyandshouldbereplacedwithcorrespondingchunksfrom
Figure3isatoyexamplethatdepictsa2-wayset-associativeAP
the sacrificial line. To aid in the encoding and decoding of this
cache withtwobanks. Eachcachebank haseight linesandeach
information,auniqueaddressisassignedtoalllineswithinagroup
lineconsistsoftwoblocksofdatawhicharedividedinto3equally
(groupaddress). Forinstance,inFigure3,lines10and15arethe
sizeddatachunks. Inthisfigure,lines4,10,and15formagroup
firstandsecondlinesofG3,respectively. Foreachdatachunkin
(named G3) in the cache. Line 4 (labeled G3(S)) is the sacrifi-
the sacrificial line, the fault map stores the group address of the
ciallinethatfurnishestheredundancyneededtoaccommodatethe
linetowhichthat datachunk isassigned. Ingeneral, for agiven
faultychunksinlines10and15. Inordertominimizetheaccess
group,ifthesacrificiallineconsistsofkdatachunks,thefaultmap
latency overhead, thesacrificial line(4) and thedata lines (10 or
15)shouldbeindifferentbanks1 sothatthesacrificiallinecanbe requires to store (i1,i2,...,ik). In this notation, ij is the group
addressofthelinetowhichthejthdatachunkofthesacrificialline
accessedinparalleltotheoriginaldataline.
isassigned.Inourexample,theentrywhichisassignedtothethird
1Inthiswork,ourapproachisdescribedforprotectingcacheswith
onlytwobanks.However,itshouldbeclearthatourschemecanbe 2As a side note, since read and write are symmetric operations,
naturallyextendedtocacheswithanynumberofbanks≥2without theonlymodificationinthehardwareimplementationwouldbeto
lossofgenerality. replacetheMUXesintheMUXinglayerwithpasstransistors[40].
(a) Readingaline(G3(3))fromthesamebankasthesacrificialline(G3(S))
(b) Readingfromthesemi-sacrificialline(G3(2))
Figure4: Twospecial read-accessscenarios. Astandardread accessisillustratedinFigure3. Noticetheextrabitthathasbeen
addedtoboththememorymapandeveryfaultmapentrytohandlescenario(b).Sincethe4thdatachunkofthesemi-sacrificialline
isre-allocated,itismarkedasRAinscenario(b).
group(G3)inthefaultmap,contains(1|2|-|2|-|1)forway0 sacrificiallinesacrosscache. Therefore, theprimaryobjectiveof
andway1.Consideringonlyway1,thisindicatesthatthe4thchunk APistoformgroupssuchthattherearenocollisionsbetweenany
ofthecorrespondingsacrificialline(G3(S))isdevotedtoG3(2),the twolineswithinagroup.
6thchunkisdedicatedtoG3(1),andthe5thchunkisnotassigned
to any line. Finally, the MUXing layer gets its input from a set 2.2 AP withRelaxedGroup Formation
ofcomparatorsthatcomparethegroupaddressofadatalinewith
Sinceeverygrouprequiresasacrificiallinebededicatedsolely
faultmapentriesforthesamegroup. Forexample,groupaddress
forredundancy,APstrivestominimizethetotalnumberofgroups
of line 15 is2 – read frommemory map – which gets compared
thatmustbeformed.Giventhatthenumberoflinesisfixedwithin
withG3’sfaultmapentries.
acache,achievingthisobjectiveimpliesthatlargergroupsarepre-
Inorderforlinesinacachetocometogetherandformagroup,
ferred over smaller ones. In order to improve the likelihood of
theyshouldnothaveanycollisions. Inourscheme,twolineshave
forminglargegroups,weremovetheconstraintthatforcesthesac-
acollisioniftheyhaveatleastonefaultydatachunk inthesame
rificialline,fromaparticulargroup,tobeinadifferentbankthan
position. Forexample,iftheseconddatachunkofthe3rdand6th
alltheotherlines. Thisallowsanysetoflinesfromthetwobanks
lines is faulty, lines 3 and 6 have a collision. Similarly, in Fig-
toformagroup. However,inordertominimizetheaccesslatency,
ure 3, lines10 and 15 are collision-free sincethey do not have a
wedonotallowagrouptoderiveallitslinesfromthesamebank.
faultyblockinthesameposition. Bydefinition,collisionsarein-
Inotherwords,eachgroupshouldhaveatonelineineachbankto
dependent of workload running on the system and data stored in
allowaparallelaccesstotheoriginalandsparedata. Relaxingthe
the cache. Two lines that have a collision cannot share a single
mentionedconstraint, givesrisetotwonewreadaccessscenarios
sacrificial line. This means as the number of collisions increase,
inadditiontothestandardcasedescribedinFigure3.
alargerfractionof cacheneedstobeturnedintosacrificiallines.
HandlingDifferentTypesofAccesses: InFigure4(a),lines2,
Groupformationisthekeycomponenttominimizethenumberof
4, 7, 10 and 15 areinthe samegroup, withline4serving asthe
First Bank Second Bank
1 G1(1) 6 D
2 G2(1) 7 G2(3)
3 G2(S) 8 G1(3)
4 G1(2) 9 G1(S)
5 G2(2) 10 G2(4)
10
1 Group 2
7 2
9 4
Group 1
8
5 3
6
Disabled
Figure5: Asimplifiedexampleoftheminimumcliquecoveringprocessforagivendistributionoffaultsinthecachebanks. Here,
eachbankhasonly5lines.Thesolverdisablesthe6thlinesinceithasmanyfaultychunksand,isthereforeveryexpensivetorepair.
Twocliquesareformedbythesolverandlines9and3aredesignatedassacrificiallinesforgroups1and2,respectively. Moreover,
theconceptualpartitioningofthecacheintodistinctislandsisalsodemonstrated.
sacrificial line. However, when line 2 or 7 is accessed as a data sacrificial line. As can be seen, the 4th data chunk of G3(2) has
line,aparallelaccesstoline4cannotbeperformedsincetheyare beenre-allocatedtoG3(3). Consequently,weartificiallymarkthe
inthesamebank. Tohandlesuchascenario,asmallandtranspar- 4thchunkofthesemi-sacrificiallineas“re-allocated”(RAinFig-
ent modification tothe AP architecture isneeded. Wearbitrarily ure4(b)). WhenG3(2)isaccessed,itsfaultyandalsore-allocated
select alinefromthebank not containing thesacrificialline(but datachunks(i.e.,the4thand5thchunks)aresuppliedasexpected
still from the same group) and label it as a semi-sacrificial line bythecorrespondingchunksfromthesacrificialline(G3(S)).How-
(line15inFigure4(a)).Thissemi-sacrificiallinecanbeusedtore- ever, this cannot be easily done using the unmodified fault-map
placefaultychunksfromcachelineswhichareinthesamebankas sincethe4thentryofthefaultmappointstothefaultychunksof
thesacrificialline. However,incontrasttothesacrificialline,the datafromG3(3). Asaresult, thesystemcannot identifythat the
semi-sacrificial line still contributes to the effective cache space. 4thchunkofthesemi-sacrificiallineshouldbereplacedduringan
Inotherwords, asemi-sacrificiallineonlyactsasalevelofindi- access. Inordertotacklethisproblem,weaddanextrabittoev-
rectionforredundancysubstitutionbyborrowingredundancyfrom eryfaultmapentrywhichindicateswhetherthecorrespondingdata
itscorresponding sacrificialline. Moreover, semi-sacrificial lines chunkshouldbereplacedwhenaccessingthesemi-sacrificialline.
guarantee the parallel access to the original and spare data in all Forinstance,inFigure4(b),sincethe4thand5thchunksshouldbe
possiblescenarios. Consequently, thefaultychunksofthelines2 replaced,theircorrespondingbitsinthefaultmaphavebeensetto
and7arereplacedusingG3(2)insteadoftheG3(S).Withtheaddi- “1”.
tionofthesemi-sacrificialline,accessestothecachecanbedivided
intothreecategories:
1) Accessestodataand sacrificial linesthat resideindifferent 2.3 AP Configuration
banks. ThisisthebasecasewhichisdemonstratedinFigure3and
To maximize the number of functional lines in the cache, we
doesnotrequireanyspecialconsiderationbeyondwhatisdescribed
needtominimizethenumberofsacrificiallinesrequiredtoenable
earlier.
fault-freeoperation. Aspreviouslydiscussed,thereisasinglesac-
2)Accessestodatalinesthataretheinthesamebankasthecor-
rificiallinedevotedtoeverygroupoflines. Thissacrificiallineis
responding sacrificial line. ThiscaseisillustratedinFigure4(a).
notaddressableasadatalinesinceitdoesnotstoreanyindependent
Line G3(3) is the data line and line G3(2) is the semi-sacrificial
data.Inotherwords,sacrificiallinesdonotcontributetotheusable
linewhichsuppliesthereplacement chunks forthefaultyonesin
capacityofthecache. Dependingonthenumberofcollision-free
G3(3). Instead of accessing G3(S), the memory map remaps the
groupsthatareformed,theeffectivecapacityofthecachecanvary
addressofthesacrificiallinetotheaddressofG3(2). Thiscaseis
dramatically.Thismotivatestheneedtominimizethetotalnumber
particularlyinteresting,sincenootherpartsoftheproceduremust
ofgroupsrequired.
beadjustedtosupport thisaccessscenario. Thefaultmapisstill
Problem Modeling: Here, we model the problem as a graph
used,unmodified,toindicatethatthe4thdatachunkfromG3(3)is
in which every group corresponds to a clique. To minimize the
faultyandshouldbesubstituted.Therefore,APreplacesthisfaulty
numberofgroupsinagivenfaultycache–upperpartofFigure5
chunkofG3(3)bythesemi-sacrificialline’s4thchunkinsteadof
withtwobanks,wemodeltheproblemasaminimumcliquecover
sacrificialline’s4thchunk.
(MCC)problem[14]. Figure5showstheprocess offormingthe
3)Accessestoasemi-sacrificialline. Thiscaseisdemonstrated
groups givenafault patternforthecache. Eachnodeinthecon-
in Figure 4(b) for which two small modifications to the access
structedgraph,teninall,isacacheline. Thereisanedgebetween
procedure are necessary. An additional bit is added to memory
two nodes if and only if there is no collision between the corre-
mapentriesindicatingwhethertheaccesseddatalineisthesemi-
spondinglines. Therefore,itispossibletoassignconnectednodes
base 2nd non-fair 2nd fair 64-cap
12
average non-functional lines max non-functional disabled
10
100
y of Occurrence 68 ber of Word-lines 246800000
uenc 4 Num base 2nd non-fair 2nd fair 64-cap
eq Different Versions of the Solver
Fr
2
0
4 9 14 19 24 29 34 39 44 49 54 59 64 69 74 79 84
Clique Size
Figure6: DistributionofthecliquesizefordifferentversionsofthesolverbasedonMonteCarlosimulation. Notethatfor64-cap,
thesizeofallcliquesis≤ 64. Here,thenumberofnon-functionallinesisthesummationofthenumberofsacrificiallinesandthe
numberofdisabledlines. Theplotintheinsertdepictstheaveragenumberofnon-functionalcachelines,themaximumnumberof
non-functionallines,andthenumberofdisabledlineswhileachieving99%yield.
tothesamegroup. Forexample, thereisanedgebetweenlines2 andallowsustotakecareoftheparallelaccessproblemdiscussed
and3butnoedgebetweenlines1and10. above.
Asmentionedearlier,agroupisasetoflinesforwhichthereis 2)AnartifactintheDSATURsolveralgorithmcansometimes
nocollisionbetweenanypairoflines. Byconstructingthegraph causeittodisablealargefractionofcachelines.Morespecifically,
this way, a collision free group forms a clique (i.e., there is an itpicksthenodesforcoloringonlybasedonthedegreeofsatura-
edgebetweeneverypairofnodes).Asaresult,thetaskofforming tionwhichisproportional tothereciprocal of thenode degreein
groupscanberepresentedasfindingthecliquesintheconstructed ourconstructedgraph[22]. Becauseofthisbias,allthelinesfrom
graph.However,sinceweareinterestedtominimizethenumberof onebankmightbeselectedwhiletherearemanyunassignedlines
sacrificial lines, thisproblem turnsto findingthe minimumnum- intheotherbank. Insuchascenario,theunassignedlineshaveto
ber ofcliquesthatcover theentiregraph. InFigure5, lines1, 4, bedisabled which resultsinalargefractionof disabled lines. In
8, and 9formthe firstgroup (G1, cliquewithsize4) whilelines order tosolve thisproblem, wemodifytheoriginal DSATURal-
2,3,5,7,and10formthesecondgroup(G2,cliquewithsize5). gorithmtopickthelinesfrombothbanksmoreevenly. Thiswas
Line6isdisabled(D)sinceitcontains4faultychunksandrepair- donebygivingartificialprioritytothelinesinabankthathasmore
ing it is not cost effective. In general, a line gets disabled, if its unassignedlinesatthebeginningofeachassignmentphase.
correspondingnodeinthegraphdoesnotbelongtoanycliquewith 3)AswillbediscussedinSection3,minimizingtheareaover-
size≥ 2. Here,thefaultmaphas2lineswhichcorrespondtothe headofthefaultmapcarriesamajorsignificanceforourscheme.
sacrificial lines 9 and 3. As can be seen in this figure, there are Thesizeofeachentryinthefaultmapisproportionaltothe
16 faulty chunks out of 60, therefore, at most 10−(⌈16⌉) = 7 log (max{cliquesize}). Therefore, to reduce the size of the
6 2
cachelinescanbekeptfunctional afterthegrouping. Ultimately, faultmap,anupperbound(e.g.,64)canbeplacedonthemaximum
ourconfigurationalgorithmcanachievethisbestcase,highlighting cliquesize(MCS).Byaddingthisfeature,allthecliqueswhichare
theeffectivenessofourapproach. foundbytheDSATURsolvercanbeforcedtohaveasizesmaller
MCC Solver: We use the transformation described in [19] to thanorequaltoMCS.
converttheMCCproblemtoaminimumchromaticnumberprob- ThemainplotinFigure6depictsthedistributionofcliquesizes
lem. Afterapplyingthistransformation, thefinalgraphispassed for different versions of thesolver: base (base DSATURsolver),
totheDSATURsolver[22]whichusesawell-knownandefficient 2ndnon-fair(basesolver+modification(1)),2ndfair(basesolver
approximationalgorithm.Asshownin[22],theapproximationfac- + modifications (1,2)), and 64-cap (base solver + modifications
torfortheDSATURsolveris≤ 6.1%forvariousgraphdensities. (1,2,3)). This datais generated using 1000 iterationsof a Monte
For the problem size that we target in this work, the run time of Carlosimulationfora2MBL2cachewith2048lines.Itshouldbe
the DSATURalgorithm is always less than 5ms. In contrast, the clearthatthesizeofthefaultmapisproportionaltothe
fullbacktrack-basedsolver(e.g.,optimalsolver)takesseveraldays (Num.of Cliques)×log (MCS)whichimpliesasmallnum-
2
tosolveoneinstanceofourconfigurationproblemwhichmakesit beroflargecliquesispreferable.Furthermore,since
infeasibletobeusedinpractice. (Num.ofCliques)×(AverageCliqueSize)isequaltotheto-
Nevertheless,theoriginalsolver’sanswerisnotdirectlyapplica- talnumberofword-lines,aconstantvalue,MCSshouldbeasclose
bletoourproblem. Sincethesolverisfreetoformagroupinany as possible to the AverageClique Size. As a result, the most
possiblewaythatminimizesthenumberofcliques,itispossibleto desirabledistributionofcliquesizeswould beatight distribution
haveacliquewithnodesinonlyonebank. Thislattercaseisnot around largecliquesizes. AscanbeseeninFigure6, amajority
afeasiblesolutionbecausethesolutiondisallowsparallelaccessto ofcliquesfallintothenarrowregionof60to80nodes. Thistight
thespareelements. Asaresult,weapplyasetofmodificationsto distributionshowstheefficiencyandproperbalancingofthegroup
theoriginalsolverformakingitsuitabletoourapplication: sizesintheprocessofgroupformationbythedifferentversionsof
1)Weforcethesolvertopickthesecondlineofthegroupfrom the solver. The smaller plot in this figure demonstrates different
adifferentbankfromthefirstlineofthegroup. Thissmallmodi- characteristics of these 4 versions of the solver. In this plot, the
ficationassuresusthatatleastonelineisselectedfromeachbank numberofnon-functionallinesisthesummationofthenumberof
sacrificiallinesandthenumberofdisabledlines. Ascanbeseen, Table1:Thetargetsystemconfiguration
the second modification (i.e., fairness) can effectively reduce the Parameters Value
averagenumberofdisabledlinesfrom25.5to6.1. Technology 90nm(1.2VnominalVdd)
Clockfrequency 1.9GHz[23]
AnotherobservationisthattheconstraintMCS =64increases
L1Cache 2banks64KBdata,2banks64KBinstruction,
the maximum number of non-functional lines by 9% while it re- split,2-way,4cycles,1port,64Bblocksize
ducessizeofeachfaultmapentryby17%–from7bitsto6bits. L2Cache 2banks2MBUnified,8-way,10cycles
latency,1port,LRU,128Bblocksize
Foreachcacheinstance,thenumberoflinesinthefaultmaparray
Registers 80integer,72floatingpoint
isequaltothenumberofsacrificiallines(e.g.,2inFigure5).How- ROB(re-orderingbuffer) 128entries
ever,duetothepresenceofprocessvariationinalargepopulation LSQ(load/storequeue) 64entries
offabricatedchips,differentfaultpatternsshouldbeexpected. As Instructionfetchbuffer 32instructions
Integer/FPissuequeue 32/32entries
wedescribedearlier,inourevaluation,weemployaMonteCarlo
FU(functionalunit) 4intALU,4intmult/div,2memorysystemports
simulationtogenerateapopulationof1000cacheinstancesandthe FPU(floatingpointunit) 4FPALU,1FPmult/div
totalnumber offault maplinesisdetermined basedonthemaxi- Mainmemory 225cycles(highpower),34cycles(lowpower)
Branchpredictor combined(bimodaland2-level),
mumnumberofsacrificiallineswhileachievinga99%yield.
RAS(returnaddressstack)has32entries
HardwareConfiguration: InordertoconfigureAP,themem- BTB(branchtargetbuffer) 2048entries,2-wayassociative,
oryandfaultmapsneedtobefilled. Theinitialstepinvolvessolv- BHT(branchhistorytable)has4096entries
ingtheMCCconfigurationproblemforagivencachefault-pattern.
GiventheMCCsolution,eachline–anodeinthegraph–canbe
classifiedas: 1)Dataline: Foreachdataline,anewmemorymap CacheAddressingMechanismafterCapacityReduction: In
entry should be allocated. Each memory map entry has 5 fields ourdesign,thememorymapcanbeusedtoremaptheoriginalad-
(Figure4(b)): Thefirstfieldisthedatalineaddress. Ifthislineis dressofthesacrificiallinestoother functional lines. Asaresult,
inthesamebankasitsrespectivesacrificialline,thesecondfield asmallfractionofthesetswillhavethefunctionalityoftwosets.
willsettotheaddressoftherespectivesemi-sacrificialline.Other- Inotherwords, twosetsdifferentsetswillbelocatedinthesame
wise,itwillgettheaddressofthesacrificialline.Thethird,fourth, line.Forthosedual-setword-lines,associativityisreducedbyhalf.
andfifthfieldsshouldbesettothedataline’sgroupnumber,group Thesedual-setlinesaredistributedevenlyacrossthecachetopre-
address, and “0”, respectively. 2) Sacrificial line: Although no ventbiasedmissrateforanaddresssub-range. Thetagcompari-
changeinthememorymapisrequired,afaultmapentryshouldbe sonandreplacementlogicneedsstraight-forwardmodificationsto
allocatedforeachsacrificialline. Eachfaultmapentryhasafield makethiswork,detailsofwhichareomittedintheinterestofspace.
perdatachunk–6fieldsinFigure4. Forfaultydatachunksofa High Power Mode Operation: In the high power mode, our
sacrificiallineortheoneswhicharenotassignedtoanyfaultydata schemeisturnedoffinordertominimizetheunwantedoverheads:
chunks,correspondingfieldsinthefaultmapshouldbesetto“-”. 1) All the cache lines are functional and there is no sacrifice of
Forotherdatachunks,correspondingfaultmapfieldsshouldbeset the cache capacity. 2) Assuming clock gating, there isanegligi-
to the group addresses of the data/semi-sacrificial lines to which ble overhead for the dynamic power due to the switching in the
those data chunks are assigned. 3) Semi-sacrificial line: This is bypassMUXeswhichconsistsoftheMUXinglayerandanaddi-
similar to the first case, except the last field of the memory map tionalMUXforbypassingthememorymap.Inotherwords,inhigh
entryshouldbesetto“1”. 4)Disabledline: Nothingneedstobe powermode, theaccesspathofAPincludesanadditionalbypass
doneinthiscase. MUXcomparedtoabaselinecache. 3)Leakagepoweroverhead
LowPowerModeOperation:Aswesawinthissection, remainsthesame. However,powergatingtechniquescanbeused
Archipelagoleveragesasimplearchitecturewhichmainlyconsists forgeneralleakagemitigation[18].
oftworelativelysmallarraystructures.However,inordertoachieve
arobustsub-400mV operation,asophisticatedconfigurationalgo-
3. EVALUATION
rithmisusedtocustomizethehardwarebasedontheexistingfault
patterninaparticularcache. Thishardware/softwareco-designal- ThissectionevaluatesthepotentialofAPinreducingthepower
lowstokeepthearchitecturesimplewhenmodelingtheconfigura- of the processor while keeping the overheads as low as possible.
tion withwell-establishedalgorithmic problems. Thefirst timea Comparisonswithrelatedworkarepresentedinthenextsectionof
processorswitchestolowpowermode,thebuilt-inselftest(BIST) thepaper.
modulescansthecacheforthepotentialfaultycells. Afterdeter-
mining the faulty chunks of each cache line in low power mode, 3.1 Methodology
theprocessorswitchesbacktohighpowermodeandconstructsthe
Forperformanceevaluation,weuseSimAlpha,avalidatedmicro-
mentionedgraphandsolvestheMCCproblemusingthemodified
architecturalsimulatorbasedontheSimpleScalarout-of-ordersim-
DSATURsolver. Asmentionedbefore,thesolvertimefora2MB
ulator[5]. TheprocessorisconfiguredasshowninTable1andis
L2cacheislessthan5msonanIntelCoreTM2Duomachine. This
modeled after the DEC Alpha 21364 at an ambient temperature
solutioncontainstheinformationthatisrequiredtobestoredinthe of40◦C [35, 23]. Dynamicprocessor power consumptioniscal-
memoryandthefaultmaps.Thisconfigurationinformationcanbe
culated using Wattch [7] based on the activity factors of individ-
storedonthehard-driveandiswrittentothememorymapandfault
ualcorestructures,andleakagepoweriscomputedwithHotLeak-
mapatthenextsystemboot-up.Inaddition,thememorymap,fault
age[41].CACTI[31]isleveragedtoevaluatethedelay,power,and
mapandthetagarraysneedtobeprotectedusing,forexample,the
area of the base cache structures. To take into account the over-
wellstudied10Tcell[8]whichhasabout 66%areaoverhead for
headsofthememorymapandfaultmaparrays,weusetheSRAM
theserelativelysmallarrays. These10Tcellsareabletomeetthe
generator module provided by the 90nm Artisan Memory Com-
target voltageinthiswork fortheaforementioned memory struc-
piler. TheSynopsys standardindustrialtool-chain(withaTSMC
tureswithoutfailing. However,aswewilldiscussinSection4.3,
90nmtechnologylibrary)isusedtoevaluatetheoverheadsofthe
10Tcellsarenotacost-effectiveoptionforprotectingtheentireL1
remaining miscellaneous logic(i.e, bypass MUXes, comparators,
andL2caches.
subsettagcomparisondisabling,etc.).Lastly,tofindthematching
Percentage of non-functional cache lines Fault-map area-overhead
100
1b chunks 2b chunks 4b chunks 8b chunks 16b chunks
90
80
70
Percentage 456000
30
20
10
0
Power Supply Voltage (Vdd)
(a) Percentageofnon-functionallinesandareaoverheadofthefault-mapforL1whilevaryingV anddatachunksize
dd
Percentage of non-functional cache lines Fault-map area-overhead
100
1b chunks 2b chunks 4b chunks 8b chunks 16b chunks
90
80
70
Percentage 456000
30
20
10
0
Power Supply Voltage (Vdd)
(b) Percentageofnon-functionallinesandareaoverheadofthefault-mapforL2whilevaryingV anddatachunksize
dd
Figure7:ProcessofdeterminingtheminimumachievableV forL1andL2cacheswhilelimitingthefractionofthenon-functional
dd
cachelinesandalsotheareaoverhead ofthefaultmapstructureto≤ 10%. Moreover, inthese10sub-plots,vertical dottedlines
showtheminimumachievableV whiledatachunksizevariesfrom1bitto16bits.
dd
frequencyforagivenV ,weusethealpha-powermodeldescribed fractionofthecachelinesthatgetdisabled asloweringV leads
dd dd
in[27]. toincreasingerrorratesandaprecipitousincreaseinfaultychunks.
Foragivensetofcacheparameters(e.g.,V ,chunksize,MCS, Aswementioned earlier, for thesedisabledlines, no entryinthe
dd
etc.),aMonteCarlosimulationwith1000iterationsisperformed faultmaparrayisrequired.
usingthemodifiedDSATURsolverdescribedinSection2toiden- Here,theverticaldottedlineshighlighttheminimumachievable
tifytheportionofthecachethatshouldbedisabled. Asdiscussed V based on the aforementioned 10% limits on non-functional
dd
earlier,solutionsgeneratedbytheMCCsolvertargeta99%yield. linesand fault map size. It isnotable that for small chunk sizes,
Inother words, only1%ofmanufacturedandconfigured on-chip area overhead of the fault map is the limiting factor, while for
cachesareallowedtoexhibitfailureswhenoperatinginlow-power larger chunks, the number of non-functional lines becomes dom-
mode. Ontheotherhand,thetargetyielddirectlyimpactsthesize inant. Ascanbeseeninthisfigure,theminimumachievableV
dd
ofthefaultmap. Itssizeissetbasedonthemaximumnumberof forL1islowerthanL2.ThiswasexpectedduetoL2’slongerlines
cliquesformedacrossallcacheinstancesintheMonteCarlopro- andlargersizewhichmakesprotectionofL2harderthanL1[40].
cesswhenignoringasmanyworst-casesituationsasthetargetyield Therefore,L2protectioncostdictatestheminimumoperatingvolt-
allows. ageof our system. Inaddition, basedonthetrendinFigure7, it
shouldbeclearthatachunksizeoutsideofthepresentedrangewill
3.2 DesignSpace Exploration onlyresultinahigherminimumV .Asaresult,weselect375mV
dd
astheminimumV (i.e,low-powermodeoperatingvoltage)since
Figure7showstheprocessofdeterminingtheminimumachiev- dd
allotherlowervoltagesviolateour10%limits.
ableV forourtargetsystem.Inthisfigure,chunksizevariesfrom
dd
1bit to16bitsforbothL1andL2caches. Inhigh-power mode, ForVdd = 375mV,adesignspaceexplorationofL1-(D/I)and
L2 caches is demonstrated in Figure 8. There are two important
both fault and memory map arrays remain idle and leak power.
parameters tonote: 1) MCS, themaximum allowable cliquesize
It iscrucial to minimize the size of these structures. The size of
(Section 2.3) and 2) chunk size, varying from 1b to 128b for L1
thememorymapisessentiallyfixedbythenumberoflinesinthe
and from 1b to 32b for L2. From Figure 8, it should be clear
cache. Thefaultmapsize,however,canvarysignificantlydepend-
thatthechunksizesoutsideofthepresentedrangealwaysviolate
ingonconfigurationparameters,motivatingacloserlookatthesize
at least one of our aforementioned 10% limits. For every pair of
ofthefaultmapasanimportant designfactor. Consequently, we
(MCS,chunksize),aMonteCarlosimulation,targeting99%yield,
limittheareaoverheadofthefaultmapto10%(ofthecachearea).
is performed with 1000 iterations. This simulation identifies the
Furthermore,sincecachesizehasastrongcorrelationwithsystem
necessaryareaoverheadofthefaultmapandthefractionofnon-
performance, we limit our scheme to setting at most 10% of the
functional lines in the cache. Here, we still limit the fraction of
cachelinestonon-functional.
thenon-functionallinesto10%whiletryingtominimizethearea
AsevidentinFigure7,decreasingV increasesthenon-
dd
overheadofthefaultmap.
functional portionofthecacheandalsotheareaofthefaultmap
InFigure8(a),withintheshadedregion,onlypointsontheblack
array. However, beyond acertainpoint, the areaoverhead of the
dottedline(Paretofrontier)areconsideredsincetheydominatethe
fault map starts decreasing. This phenomena is due to the large
(a) L1designspaceexploration (b) L2designspaceexploration
Figure8: DesignpointsfordifferentMaximumCliqueSize(MCS)andchunksizepairsareshownthatcanachievea99%yield.
ForeachMCSvalue, correspondingchunksizesfrom{2n |n ∈ {0,1,...,7}}forL1andfrom{2n |n ∈ {0,1,...,5}}forL2are
chosen. Theshadedboxesrepresentstheregionofinterestwhereboththefault-mapoverhead andthefractionofnon-functional
linesislimitedto≤10%.TheblackdottedlineistheParetofrontier.
other design points. Note that making the chunk size larger de- worst-casecapacitylossduringourevaluation.Inordertoevaluate
creasestheareaoverheadofthefaultmapsincethenumberofen- the worst-case performance penalty of our scheme in low-power
triesin the fault map is reduced. However, this reduction isalso mode, we ran the cross-compiled Alpha binaries of SPEC-CPU-
accompaniedbyanincreaseinthefractionofnon-functionalcache 2KbenchmarksuiteonSimAlphaafterfast-forwardingtoanearly
lines,theresultoffeweredgesinthegraphdescribedinSection2.3. SimPoint[37].WeassumeoneextracyclelatencyforL1and2ex-
ForL1,thedesignpointwithMCS = 32andchunksize = 16 tracyclesforL2.Cachesizeisalsoreducedbasedonthefractionof
bits is selected, highlighting an interesting trade-off between the non-functionallinesinFigure8. Onaverage,a4.6%performance
areaoverhead of thefaultmapand thefractionof non-functional penaltyisseeninlow-powermodefromwhich0.6%iscontributed
lines. ByrepeatingthesameprocessforL2,asillustratedinFig- by the cache capacity loss due to the presence of non-functional
ure8(b),thedesignpointwithMCS = 16andchunksize = 8 lines(Figure9(b)). Ascanbeseen,ourstrictlimitonthefraction
bitsisselected. ofnon-functional linesresultsinminimalimpactonperformance
becauseofcachecapacityloss.However,oneshouldnotethatlow-
3.3 Results
powermodeperformanceisusuallynotamajorconcern. Inhigh
Figure 9(a) summarizes the overheads of our scheme for both powermode,thereisnocapacitylosssincenofailuresneedtobe
L1andL2caches. Asmentionedbefore, wealsoaccount forthe tolerated. Furthermore,basedonourCACTIdelayresultsandthe
overheadsofusing10TSRAMcells[8]forprotectingthetag,fault frequencyofthesystem(Table1),thereisenoughslackontheac-
map,andmemorymaparraysinlow-powermode. Inaddition,the cesstimeofourL1andL2cachestofitthesmallbypassMUXes
faultmap,memorymap,andthesecondbankhavetheirownsep- (additional0.07nsdelay)withoutaddinganyextracyclestotheac-
aratedecoderswhichareaccountedforinourevaluation. Leakage cesstimeofthesecaches. Inotherwords,thereisnoperformance
overheadinhighpowermodecorrespondstothefaultmap,mem- lossinhigh-powermode.However,onemighthaveacachedesign
ory map, miscellaneous logic, and extra leakage of 10T cells for withoutanyslackavailable. Inthatscenario,weaddanadditional
tagarray.Note,thememorymapisafargreatercontributortoarea cycle inhigh-power mode for L1and L2, whichtranslatesintoa
andleakagepoweroverheadintheL1thanintheL2. Thereason 3.6%performancelossforSPEC-2K.
behindthisisthattheL1hasonly 1 thelinesoftheL2,whileits Summaryofbenefitsandoverheads:Figure10showsthesav-
4
overallsizeis 1.ForL2,thefaultmapisthemajorcomponentof ingsandoverheadsfortheAlpha21364microprocessorusingAP
32
overhead. Duetoitssignificantlylargersize, theL2cachedomi- forprotectingtheon-chipcaches. AscanbeseeninFigure10(b),
natestheprocessor leakageandareaoverheads. Nevertheless, as theoverheadsoftheproposedmethodarealmostnegligible.These
canbeseeninthisfigure,ourschemehasonlymodestoverheads overheadsareevaluatedin90nmusingthemethodologydescribed
for theL2 cache. Dynamic power overhead inhigh-power mode in Section 3.1. On the other hand, Figure 10(a) depicts the per-
canbemainlyattributedtobypassMUXessinceweassumeclock centagereductioninleakagepower,dynamicpower,andminimum
gatingforthefaultmapandmemorymaparrays. InAP,whenin achievable supplyvoltagebyusingAPforprotectingtheon-chip
low-power mode, thememorymapandMUXinglayerareonthe caches. These results are reported in the 45nm, 65nm, 90nm,
critical path of the cache access. Based on our timing analysis, and 130nm technology nodes. The relation between the supply
the access delay overhead of L1 and L2 are 0.41ns and 0.58ns, voltageandtheexpectedSRAMbit-cellfailurerateforthesefour
respectively. Basedonoursystemclockfrequency(Table1),this technologynodesareextractedfrom[40,24,34,13,33,9,6].Con-
translatesto1additionalcyclelatencyforL1and2additionalcy- sideringthe90nmtechnologynode,APenablesDVStosave79%
clesforL2inlow-powermode. dynamicpowerand51%staticpowerinthenear-thresholdregion.
Asmentionedbefore,thefractionofnon-functionallinesisbased Withtheaggressivetechnologyscaling,thesystematicandrandom
ona99%yieldina1000-runforaMonteCarlosimulation. Asa variationsareexpectedtoincrease[33]. Thisresultsinhighersen-
result, the fraction of non-functional lines would be smaller than sitivity/vulnerability of SRAM cells to power supply variations.
whatispresentedinFigure8formanychips.Thisisbecauseacross Hence, the percentage of reduction in dynamic power/minimum
allthefabricateddies,processinducedparametricvariationcauses achievableVddgraduallyreduceswhenheadingtowarddeepersub-
differentcachefaultpatternstoappear. However,weconsiderthe microntechnologies. Nevertheless, a68%dynamicpower reduc-
ffaauulltt(cid:3)(cid:3)mmaapp(cid:3)(cid:3)((1100TT)) mmiisscceellllaanneeoouuss(cid:3)(cid:3)llooggiicc mmeemmoorryy(cid:3)(cid:3)mmaapp(cid:3)(cid:3)((1100TT)) ttaagg(cid:3)(cid:3)oovveerrhheeaadd(cid:3)(cid:3)((1100TT)) 12
verheadPercentageofOverhead(cid:3)(cid:3) 1111112468024024 HighPower(cid:3)Mode Percentage of Performance Loss 1002468 extra access latency cache capacity loss
0 L1(cid:3)area L2(cid:3)area L1p(cid:3)loeawkearge(cid:3)L2p(cid:3)loeawkearge(cid:3) dpyonLwa1me(cid:3)ric(cid:3) dpyonLwa2me(cid:3)ric(cid:3) 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 255.vortex 256.bzip2 300.twolf 171.swim 172.mgrid 173.applu 177.mesa 179.art 183.equake 188.ammp 200.sixtrack average
(a) Area, leakage, and dynamic power overheads of our (b) Performance loss in low-power mode. Since the fraction of
schemeforbothL1andL2caches. Here,10Tcellsareused non-functionallinesislimitedtolessthan10%,theaccesslatency
forprotectingthefaultmap,memorymap,andtagarrays. overheadisthedominantfactorinperformancepenalty.
Figure9:Area,power,andperformanceoverheadsofourscheme.
Leakage
Power
Dynamic
Power
130nm
Minimum 90nm
Vdd 65nm
45nm
0 10 20 30 40 50 60 70 80 90 100
Percentage Reduction
(a) Percentageofreductioninleakagepower,dynamicpower,andmini- (b) Overheads of using AP for protecting on-
mumachievableV forthebaselineprocessorusing4differenttechnol- chip caches of the mentioned microprocessor
dd
ogynodes. in90nm.
Figure 10: Low-power mode benefits and overheads for an Alpha 21364 microprocessor system (Table 1) augmented with
Archipelago. Here, we account for the dynamic power overhead of accessing the second bank in low power mode for handling
failures.
tioncanbeachievedin45nm. Incontrast,forleakagepower,the notlikelytobeaccessedinthenearfuture. Menget. al.[28]pro-
percentage reductiongraduallyincreasesasthetechnologyscales posedamethodforminimizingleakageoverheadinthepresenceof
down. Thisismainlyduetothedrasticdifferencesincorrelations manufacturingvariations.Inthisscheme,theyartificiallyprioritize
betweentheleakagepowerandV acrosstechnologygenerations. cache ways with smaller leakage and resize the cache by avoid-
dd
Indeepertechnologynodes,areductioninV manifestsasamuch ingsub-arraysthathavehigherleakagefactors. Insteadofturning
dd
larger saving in leakage power [15]. Therefore, for deeper tech- off blocks, drowsy cache [12] is a state preserving approach that
nologies,eventhoughweachievelessV reduction,thenetleak- has two different supply voltage modes. In order to save power,
dd
age power saving is larger. Due to the excessive increase in the recently inactive cache blocks periodically fall into a low power
sub-thresholdtransistorleakage,itisexpectedthatleakagepower modeinwhichtheycannotbereadorwritten.
willdominatethetotalpowerconsumptionofachipinthefuture ForlowV values(e.g., ≤ 651mV in90nm), theamount of
dd
technologies[15]. powersavingforschemessimilartodrowsycache[12]isrestricted
duetofailuresinSRAMstructures[34]. Asdiscussedearlier,our
objective is to enable DVS to push the processor/core operating
4. RELATED WORK
voltage down to the near-threshold region while preserving cor-
MotivationsfortheAPdesigncomefrompreviousworkintwo rect functionality of on-chip caches. Wilkerson et. al. [40] pro-
major categories: low-power and fault-tolerant cache techniques. posedtwodifferentcacheprotectionschemesforL1andL2caches
Here,SRAMcellstabilityeffortsareincludedinthefirstcategory that useseveral levelsof decoding/shifting totakethefaultydata
sincetheytrytoproactivelyavoidfailures.Furthermore,acompar- chunksoutandreplacethemusingECCprotectedpatches. Dueto
isonof APwithseveral oftheseconventional andstateof theart thestrictbindingbetweendataandredundancy,theirschemesneed
proposalswillbepresented. todisable 50% of L1and 25% of L2caches whichresultsintoa
considerableperformancedrop-offinlowpowermode. Recently,
4.1 Low-Power CacheTechniques
Abella et. al. [1] proposed a cache protection scheme based on
The usage of Vdd gating for leakage power reduction by turn- sub-block disabling which can provide a better performance pre-
ingoffcachelinesisdescribedin[20]. Thisapproachreducesthe dictability than [40]. However, since this scheme relies on dis-
leakagepowerofthecachebyturningoffthecachelinesthatare
Description:Amin Ansari. Shuguang In this work, we propose a highly flexible fault-tolerant
cache de- .. In AP, a memory map is used to provide a level of indirection in.