ebook img

Bespoke Processors for Applications with Ultra-low Area and Power Constraints PDF

14 Pages·2017·1.5 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Bespoke Processors for Applications with Ultra-low Area and Power Constraints

Bespoke Processors for Applications with Ultra-low Area and Power Constraints HariCherupalli HenryDuwe WeidongYe UniversityofMinnesota UniversityofIllinois UniversityofIllinois [email protected] [email protected] [email protected] RakeshKumar JohnSartori UniversityofIllinois UniversityofMinnesota [email protected] [email protected] ABSTRACT 1 INTRODUCTION Alargenumberofemergingapplicationssuchasimplantables,wear- Alargeclassofemergingapplicationsischaracterizedbysevere ables,printedelectronics,andIoThaveultra-lowareaandpower areaandpowerconstraints.Forexample,wearables[46,51]and constraints.Theseapplicationsrelyonultra-low-powergeneralpur- implantables[25,50]areextremelyarea-andpower-constrained. posemicrocontrollersandmicroprocessors,makingthemthemost Several IoT applications, such as stick-on electronic labels [48], abundanttypeofprocessorproducedandusedtoday.Whilegen- RFIDs[54],andsensors[30,72],arealsoextremelyarea-andpower- eralpurposeprocessorshaveseveraladvantages,suchasamortized constrained. Area constraints are expected to be severe also for developmentcostacrossmanyapplications,theyaresignificantly printedplastic[49]andorganic[41]applications. over-provisioned for many area- and power-constrained systems, Costconcernsdrivemanyoftheaboveapplicationstousegeneral whichtendtorunonlyoneorasmallnumberofapplicationsover purposemicroprocessorsandmicrocontrollersinsteadofmuchmore theirlifetime.Inthispaper,wemakeacaseforbespokeprocessor area-andpower-efficientASICs,since,amongotherbenefits,de- design,anautomatedapproachthattailorsageneralpurposeproces- velopmentcostofmicroprocessorIPcorescanbeamortizedbythe sorIPtoatargetapplicationbyremovingallgatesfromthedesign IPcorelicensoroveralargenumberofchipmakersandlicensees. thatcanneverbeusedbytheapplication.Sinceremovedgatesare Infact,ultra-low-area-andpower-constrainedmicroprocessorsand neverusedbyanapplication,bespokeprocessorscanachievesignif- microcontrollerspoweringtheseapplicationsarealreadythemost icantlylowerareaandpowerthantheirgeneralpurposecounterparts widely used type of processing hardware in terms of production withoutanyperformancedegradation.Also,gateremovalcanex- andusage[5,23,53],inspiteoftheirwell-knowninefficiencycom- poseadditionaltimingslackthatcanbeexploitedtoincreasearea paredtoASICandFPGA-basedsolutions[31].Giventhismismatch andpowersavingsorperformanceofabespokedesign.Bespoke betweentheextremeareaandpowerconstraintsofemergingapplica- processordesignreducesareaandpowerby62%and50%,onaver- tionsandtherelativeinefficiencyofgeneralpurposemicroprocessors age,whileexploitingexposedtimingslackimprovesaveragepower andmicrocontrollerscomparedtotheirASICcounterparts,there savingsto65%. existsaconsiderableopportunitytomakemicroprocessor-basedso- lutionsfortheseapplicationsmuchmorearea-andpower-efficient. CCSCONCEPTS Onebigsourceofareainefficiencyinamicroprocessoristhat • Computer systems organization → Special purpose systems; ageneralpurposemicroprocessorisdesignedtotargetanarbitrary Embeddedsystems;•Hardware→Applicationspecificproces- applicationandthuscontainsmanymoregatesthanwhataspecific sors; applicationneeds(Section2).Also,theseunusedgatescontinueto KEYWORDS consumepower,resultinginsignificantpowerinefficiency.While ultra-low-powerprocessors,application-specificprocessors,bespoke adaptivepowermanagementtechniques(e.g.,powergating[58,60]) processors,hardware-softwareco-analysis,InternetofThings helptoreducepowerconsumedbyunusedgates,theeffectivenessof suchtechniquesislimitedduetothecoarsegranularityatwhichthey ACMReferenceformat: mustbeapplied,aswellassignificantimplementationoverheads HariCherupalli,HenryDuwe,WeidongYe,RakeshKumar,andJohnSartori. 2017.BespokeProcessorsforApplicationswithUltra-lowAreaandPower such as domain isolation and state retention (Section 7). These Constraints.InProceedingsofISCA’17,Toronto,ON,Canada,June24-28, techniquesalsoworsenareainefficiency. 2017,14pages.https://doi.org/10.1145/3079856.3080247 Oneapproachtosignificantlyincreasetheareaandpowereffi- ciencyofamicroprocessorforagivenapplicationistoeliminate Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor alllogicinthemicroprocessorIPcorethatwillnotbeusedbythe classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed application.Eliminatinglogicthatisguaranteedtonotbeusedby forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation anapplicationcanproduceadesigntailoredtotheapplication–a onthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACM mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish, bespokeprocessor–thathassignificantlylowerareaandpowerthan topostonserversortoredistributetolists,requirespriorspecificpermissionand/ora theoriginalmicroprocessorIPthattargetsanarbitraryapplication. [email protected]. Aslongastheapproachtocreateabespokeprocessorisautomated, ISCA’17,June24-28,2017,Toronto,ON,Canada ©2017AssociationforComputingMachinery. theresultingdesignretainsthecostbenefitsofamicroprocessor ACMISBN978-1-4503-4892-8/17/06...$15.00 IP,sincenoadditionalhardwareorsoftwareneedstobedeveloped. https://doi.org/10.1145/3079856.3080247 ISCA’17,June24-28,2017,Toronto,ON,Canada H.Cherupallietal. upto92%(46%minimum,62%onaverage)andpowerreductions are up to 74% (37% minimum, 50% on average) compared to a generalpurposemicroprocessor.Whentimingslackresultingfrom gateremovalisexploited,powerreductionsincreasetoupto91% (50%minimum,65%onaverage). •Finally,wepresentandanalyzedesignapproachesthatcanbeused to support bespoke processors throughout the product life-cycle. Thesedesignapproachesincludeproceduresforverifyingbespoke processors,techniquestodesignbespokeprocessorsthatsupport multipleknownapplications,andstrategiestoallowin-fieldupdates inbespokeprocessors. Figure 1: General purpose processors are overdesigned for a 2 MOTIVATION specificapplication(top).Abespokeprocessordesignmethod- Area-andpower-constrainedmicroprocessorsandmicrocontrollers ologyallowsamicroprocessorIPlicensororlicenseetotarget arethemostabundanttypeofprocessorproducedandusedtoday, differentapplicationsefficientlywithoutadditionalsoftwareor with projected deployment growing exponentially in the near fu- hardwaredevelopmentcost(bottom). ture[5,23,34,53].Thisexplosivegrowthisfueledbyemerging area-andpower-constrainedapplications,suchastheinternet-of- things (IoT), wearables, implantables, and sensor networks. The Also,sincenologicusedbytheapplicationiseliminated,areaand microprocessors and microcontrollers used in these applications powerbenefitscomeatnoperformancecost.Theresultingbespoke aredesignedtoincludeawidevarietyoffunctionalitiesinorderto processordoesnotrequireprogrammerinterventionorhardwaresup- supportalargenumberofdiverseapplicationswithdifferentrequire- porteither,sincethesoftwareapplicationcanstillrun,unmodified, ments.Ontheotherhand,theembeddedsystemsdesignedforthese onthebespokeprocessor. applicationstypicallyconsistofoneapplicationorasmallnumberof Inthispaper,wepresentamethodologytoautomaticallygenerate applications,runningoverandoveronageneralpurposeprocessor a bespoke processor for an application out of a general purpose forthelifetimeofthesystem[1].Giventhataparticularapplication processor/microcontrollerIPcore.Ourmethodologyreliesongate- mayonlyuseasmallsubsetofthefunctionalitiesprovidedbya levelsymbolicsimulationtoidentifygatesinthemicroprocessorIP generalpurposeprocessor,theremaybeaconsiderableamountof thatcannotbetoggledbytheapplication,irrespectiveoftheappli- logicinageneralpurposeprocessorthatisnotusedbyanapplica- cationinputs,andautomaticallyeliminatesthemfromthedesign tion.Figure2illustratesthispoint,showingthefractionofgatesin toproduceasignificantlysmallerandlowerpowerdesignwiththe anopenMSP430[27]processorthatarenottoggledwhenavariety sameperformance.Inmanycases,reductioninthenumberofgates ofapplications(Table1)areexecutedontheprocessorwithmany alsointroducestimingslackthatcanbeexploitedtoimproveperfor- differentinputsets.Thebarinthefigureshowstheintersectionof manceorfurtherreducepowerandarea.Sincetheoriginaldesign allgatesthatwerenotexercised(toggled)bytheapplicationforany isprunedatthegranularityofgates,theresultingmethodologyis input,andtheintervalshowstherangeinfractionofunexercised muchmoreeffectivethananyapproachthatreliesoncoarse-grained gatesacrossdifferentinputs.Foreachapplication,asignificantfrac- application-specificcustomization.Theproposedmethodologycan tion(around30%-60%)oftheprocessor’sgateswerenottoggled beusedeitherbyIPlicensorsorIPlicenseestoproducebespoke duringanyexecutionoftheapplication.Theseresultsindicatethat designsfortheapplicationofinterest(Figure1).Simpleextensions theremaybeanopportunitytoreduceareaandpowersignificantly toourmethodologycanbeusedtogeneratebespokeprocessorsthat inarea-andpower-constrainedsystemsbyremovinglogicfromthe can support multiple applications or different degrees of in-field processorthatcannotbeexercisedbytheapplication(s)runningon softwareprogrammability,debuggability,andupdates(Section3.5). theprocessor,ifitcanbeguaranteedthatremovedlogicwillnever Thispapermakesthefollowingcontributions. beneededforanypossibleexecutionoftheapplication(s). •Weproposebespokeprocessors–anovelapproachtoreducing However,identifyingallthelogicthatisguaranteedtoneverbe areaandpowerbytailoringaprocessortoanapplication,suchthat usedbyanapplicationisnotstraightforward.Onepossibleapproach aprocessorconsistsofonlythosegatesthattheapplicationneeds isprofiling,whereinanapplicationisexecutedformanyinputsand for any possible execution with any possible inputs. A bespoke thesetofgatesthatwereneverexercisedisrecorded,asinFigure2. processorstillrunstheunmodifiedapplicationbinarywithoutany However,profilingcannotguaranteethatthesetofgatesusedby performancedegradation. anapplicationwillnotbedifferentforadifferentinputset.Indeed, •Wepresentanautomatedmethodologyforgeneratingbespoke profilingresultsinFigure2showconsiderablevariationsinexercised processors.Oursymbolicgate-levelsimulation-basedmethodology gates(upto13%)fordifferentexecutionsofthesameapplication takestheoriginalmicroprocessorIPandapplicationbinaryasinput withdifferentinputs.Thus,anapplicationmightrequiredifferent toproduceadesignthatisfunctionally-equivalenttotheoriginalpro- gatesandexecuteincorrectlyforanunprofiledinput. cessorfromtheperspectiveofthetargetapplicationwhileconsisting Staticapplicationanalysisrepresentsanotherapproachforde- oftheminimumnumberofgatesneededforexecution. terminingunusablelogicforanapplication.However,application •Wequantifytheareaandpowerbenefitsofbespokeprocessorsfor analysismaynotidentifythemaximumamountoflogicthatcanbe asuiteofsensorandembeddedbenchmarks.Areareductionsare BespokeProcessorsforApplicationswithUltra-lowAreaandPowerConstraints ISCA’17,June24-28,2017,Toronto,ON,Canada DBG DBG 60 )% ALU ALU ( s50 e ta WDG MEM_BB EXEC WDG MEM_BB EXEC g d40 FRONTEND FRONTEND e su SFR SFR n30 U common common 20 binSearch divinSortintAVGintFilt mult rletHold tea8 FFTViterbiconvEnautocorr irq dbg F MiUgLTure4:CLGK(aa)teinstRnFFoiltttoggleundiqueby (MaU)LTin(tbF)iCSLlKctraamnbdlReF(dbi)nstFcirltaumnbiquleed intFiltforthesameinputset.Graygatesarenottoggledbyei- Figure2:AsignificantfractionofgatesinanopenMSP430pro- therapplication.Redgatesareuniqueuntoggledgatesforeach cessorarenottoggledwhenanapplicationexecutesonthepro- application. Even though the applications use the same set of cessor.Eachbarrepresentsgatesnottoggledbyanyinputforan instructions and control flow, the gates that they exercise are application; theinterval shows therange of unexercisedgates different. fordifferentinputs. DBG DBG Giventhatthefractionoflogicinaprocessorthatisnotusedby ALU ALU agivenapplicationcanbesubstantial,andmanyarea-andpower- constrainedsystemsonlyexecuteoneorfewapplicationsfortheir WDG MEM_BB EXEC WDG MEM_BB EXEC entirelifetime,itmaybepossibletosignificantlyreduceareaand FRONTEND FRONTEND powerinsuchsystemsbyremovinglogicfromtheprocessorthat SFR SFR cannotbeusedbytheapplication(s).However,sincedifferentap- plicationscanexercisesubstantiallydifferentpartsofaprocessor, common common MULT CLK RF unique MULT CLK RF unique andsimplyprofilingorstaticallyanalyzinganapplicationcannot (a)FFT (b)binSearch guaranteewhichpartsoftheprocessorcanandcannotbeusedbyan Figure3:Gatesnottoggledbytwoapplications–(a)FFTand application,tailoringaprocessortoanapplicationrequiresatech- (b)binSearch–forprofilinginputs.Graygatesarenottoggled niquethatcanidentifyallthelogicinaprocessorthatisguaranteed byeitherapplication.Redgatesareuniqueuntoggledgatesfor toneverbeusedbytheapplicationandremoveunusablelogicin eachapplication. awaythatleavesthefunctionalityoftheprocessorunchangedfor theapplication.Inthenextsection,wedescribeamethodologythat meetstheserequirements.Wecallgeneralpurposeprocessorsthat havebeentailoredtoanindividualapplicationbespokeprocessors, removed,sinceunusedlogicdoesnotcorrespondonlytosoftware- reminiscentofbespokeclothing,inwhichagenericclothingitemis visiblearchitecturalfunctionalities(e.g.,arithmeticunits),butalso tailoredforanindividualperson. tofine-grainedandsoftware-invisiblemicroarchitecturalfunctional- ities(e.g.,pipelineregisters).Forexample,considertwodifferent 3 TAILORINGABESPOKEPROCESSOR applications–FFTandbinSearch.Figure3showsthegatesinthe processorthatwerenotexercisedduringanyprofilingexecutionof Abespokeprocessor,tailoredtoatargetapplication,mustbefunc- theapplications.Sincetheapplicationsusedifferentsubsetsofthe tionally-equivalent to the original processor when executing the functionalitiesprovidedbytheprocessor,thepartsoftheprocessor application.Assuch,thebespokeimplementationofaprocessor thattheydonotexercisearedifferent.However,acloserlookreveals designshouldretainallthegatesfromtheoriginalprocessordesign that while some of the differences correspond to coarse-grained thatmightbeneededtoexecutetheapplication.Anygatethatcould software-visiblefunctionalities(e.g.,themultiplierisusedbyFFT be toggled by the application and propagate its toggle to a state butnotbybinSearch),otherdifferencesarefine-grained,software- elementoroutputportperformsanecessaryfunctionandmustbe invisible,andcannotbedeterminedthroughapplicationanalysis retainedtomaintainfunctionalequivalence.Conversely,anygate (e.g.,differentgate-levelactivityprofilesinmodulesliketheproces- thatcanneverbetoggledbytheapplicationcansafelyberemoved, sorfrontend).Asanotherexample,Figure4showsthebreakdown aslongaseachfanoutlocationforthegateisfedwiththegate’s ofinstructionsusedbyintFiltandscrambled-intFilt.Thetwo constantoutputvaluefortheapplication.Removingconstant(un- applicationsuseexactlythesameinstructions(scrambled-intFilt toggled)gatesforanapplicationcouldresultinsignificantareaand isasyntheticbenchmarkgeneratedbyscramblinginstructionsin powersavingsand,unlikeconventionalenergysavingtechniques, intFilt);however,thediegraphsinFigure4showthatthesetsof willintroducenoperformancedegradation(indeed,nochangeatall unexercisedgatesfortheapplicationsaredifferent.Thisisduetothe inapplicationbehavior). factthateventhesequenceofinstructionsexecutedbyanapplication Figure5showsourprocessfortailoringabespokeprocessortoa caninfluencewhichlogictheapplicationcanexerciseinaproces- targetapplication.Thefirststep–input-independentgateactivity sordependingonthemicroarchitecturaldetails.Suchinteractions analysis–performsatypeofsymbolicsimulation,whereunknown cannotbedeterminedsimplythroughapplicationanalysis. input valuesare representedas Xs,and gate-levelactivityof the ISCA’17,June24-28,2017,Toronto,ON,Canada H.Cherupallietal. Gate-level Aftereachcycleissimulated,thetoggledgatesareremovedfrom Netlist thelistofunexercisablegates.GateswhereanXpropagatedare considered as toggled, since some input assignment could cause thegatestotoggle.IfanXpropagatestothePC,indicatinginput- Gate Ac.vity List of Unused Cu3ng & dependentcontrolflow,oursimulatorbranchestheexecutiontree andsimulatesexecutionforallpossiblebranchpaths,followinga Analysis (Untoggled) Gates S.tching depth-firstorderingofthecontrolflowgraph.Sincethisnaivesimu- lationapproachdoesnotscalewellforcomplexorinfinitecontrol Applica.on Bespoke Processor structureswhichresultinalargenumberofbranchestoexplore, Binary Tailored to weemployaconservativeapproximationthatallowsouranalysis Applica.on toscaleforarbitrarily-complexcontrolstructureswhileconserva- tivelymaintainingcorrectnessinidentifyingexercisablegates.Our Figure 5: Our technique performs input-independent gate ac- approximationworksbytrackingthemostconservativegate-level tivityanalysistodeterminewhichgatesofaprocessorcannot statethathasbeenobservedforeachPC-changinginstruction(e.g., betoggledinanyexecutionoftheapplication.Thesegatesare conditionalbranch).Themostconservativestateistheonewhere thencutfromthedesigntoformacustom,bespokeprocessor themostvariablesareassumedtobeunknown(X).Whenabranch withreducedareaandpower. isre-encounteredwhilesimulatingonacontrolflowpath,simula- tiondownthatpathcanbeterminatedifthesymbolicstatebeing processorischaracterizedforallpossibleexecutionsoftheapplica- simulatedisasubstateofthemostconservativestatepreviouslyob- tion,foranypossibleinputstotheapplication.Thesecondphaseof servedatthebranch(i.e.,thestatesmatchorthemoreconservative ourbespokeprocessordesigntechnique–gatecuttingandstitching statehasXsinalldifferingvariables),sincethestate(oramore –usesgate-levelactivityinformationgatheredduringgateactivity conservativeversion)hasalreadybeenexplored.Ifthesimulated analysis to prune away unnecessary gates and reconnect the cut stateisnotasubstateofthemostconservativeobservedstate,the connectionsbetweengatestomaintainfunctionalequivalencetothe twostatesaremergedtocreateanewconservativesymbolicstateby originaldesignforthetargetapplication. replacingdifferingstatevariableswithXs,andsimulationcontinues fromtheconservativestate.Thisconservativeapproximationtech- 3.1 Input-independentGateActivityAnalysis niqueallowsgateactivityanalysistocompleteinasmallnumberof Thesetofgatesthatanapplicationtogglesduringexecutioncan passesthroughtheapplicationcode,evenforapplicationswithan varydependingonapplicationinputs.Thisisbecauseinputscan exponentially-largeorinfinitenumberofexecutionpaths.2 changethecontrolflowofexecutionthroughthecodeaswellasthe Theresultofinput-independentgateactivityanalysisforanap- datapathsexercisedbytheinstructions.Sinceexhaustiveprofiling plicationisalistofallgatesthatcannotbetoggledinanyexecution forallpossibleinputsisinfeasible,andlimitedprofilingmaynot oftheapplication,alongwiththeirconstantvalues.Sincethelogic identifyallexercisablegatesinaprocessor,wehaveimplemented functionsperformedbythesegatesarenotnecessaryforthecorrect ananalysistechniquebasedonsymbolicsimulation[7],thatisable executionofthebinaryforanyinput,theymaysafelybecutfrom tocharacterizethegate-levelactivityofaprocessorexecutingan thenetlist,aslongastheirconstantoutputvaluesarepreserved.The applicationforallpossibleinputswithasinglegate-levelsimulation. followingsectiondescribeshowunusablegatescanbecutfromthe During this simulation, inputs are represented as unknown logic processorwithoutaffectingthefunctionalityoftheprocessorforthe values(Xs),whicharetreatedasboth1sand0swhenrecording targetapplication. possibletoggledgates. Symbolicsimulationhasbeenappliedincircuitsforlogicand 3.2 CuttingandStitching timingverification,aswellassequentialtestgeneration[8,24,37, Oncegatesthatthetargetapplicationcannottogglehavebeeniden- 42,44].Morerecently,ithasbeenappliedtodetermineapplication- tified,theyarecutfromtheprocessornetlistforthebespokedesign. specificVmin [18].Symbolicsimulationhasalsobeenappliedfor Aftercuttingoutagate,thenetlistmustbestitchedbacktogetherto softwareverification[74].However,tothebestofourknowledge,no generatethefinalnetlistandlaid-outdesignforthebespokeproces- existingtechniquehasappliedsymbolicsimulationtocreatebespoke sor.Figure6showsourmethodforcuttingandstitchingabespoke processorstailoredtotherequirementsofanapplication. processor.First,eachgateonthelistofunusable(untoggled)gatesis Algorithm1describesinput-independentgateactivityanalysis. removedfromthegate-levelnetlist.Afterremovingagate,allfanout Initially,thevaluesofallmemorycellsandgatesaresettoXs.The locationsthatwereconnectedtotheoutputnetoftheremovedgate applicationbinaryisloadedintoprogrammemory,providingthe aretiedtoastaticvoltage(‘1’or‘0’)correspondingtotheconstant valuesthateffectivelyconstrainwhichgatescanbetoggledduring outputvalueofthegateobservedduringsimulation.Sincethelogi- execution.Duringsimulation,oursimulatorsetsallinputstoXs, calstructureofthenetlisthaschanged,thenetlistisre-synthesized whichpropagatethroughthegate-levelnetlistduringsimulation.1 aftercuttingallunusablegatestoallowadditionaloptimizationsthat 1Anydataorsignalsthatcanbewrittenbyexternalevents(e.g.,interruptsignalsor DMAwrites)arealsoconsideredunknownvalues(Xs)duringouranalysis.Firmware 2Somecomplexapplicationsandprocessorsmightstillrequireheuristicsforexplo- componentsofinterrupthandling,e.g.,thejumptableandinterruptservicehandling rationofalargenumberofexecutionpaths[10,32];however,ourapproachisadequate routine,areconsideredtobepartoftheapplicationbinary(i.e.,knownvalues)during forULPsystems,representativeofanincreasingnumberoffutureapplicationswhich symbolicsimulation.Ifaninterruptisenabledduringaninstruction’sexecution,then tendtohavesimpleprocessorsandapplications[36,53].Forexample,completeanalysis thatinstructionisconsideredaspossiblymodifyingthePC. ofourmostcomplexbenchmarktakes3hours. BespokeProcessorsforApplicationswithUltra-lowAreaandPowerConstraints ISCA’17,June24-28,2017,Toronto,ON,Canada List of Unused List of Constant Gates Gate Values Bespoke Gate- Set Unconnected level Netlist Gate-level Cut Place Gate Inputs to Synthesis Netlist Unused Gates & Route Constant Values Bespoke GDSII File Figure6:Toolflowforcuttingandstitching. Algorithm1Input-independentGateActivityAnalysis application,logical1s,0s,andunknownsymbols(Xs)arepropa- 1.ProcedureAnnotateGate-LevelNetlist(app_binary,design_netlist) gatedthroughoutthenetlist.Incycle0,AandBhaveknownvalues 2.Initializeallmemorycellsandallgatesindesign_netlisttoX thatarepropagatedthroughgatesaandb,drivingtmp0andtmp1to 3.Loadapp_binaryintoprogrammemory 4.Propagateresetsignal ‘0’.Thecontrollingvalueatgatecdrivestmp2to‘1’,despiteinput 5.s←Stateatstartofapp_binary Cbeinganunknownvalue(X).InputsAandBarenotchangedby 6.Tableofpreviouslyobservedsymbolicstates,T.insert(s) 7.Stackofun-processedexecutionpoints,U.push(s) thesimulationofthebinaryuntilaftercycle2,whenanXwasprop- 8.mark_all_gates_untoggled(design_netlist) 9.whileU!=0/do agatedtothePC(notshown)thatrequirestwodifferentexecution 10. e←U.pop() pathstobeexplored.Intheleftpath,inputBbecomesXincycle3, 11. whilee.PC_next!=Xand!e.ENDdo 12. e.set_inputs_X()//setallperipheralportinputstoXs causingtmp1tobecomeXaswell.However,sinceinputCisa‘0’, 13. e′←propagate_gate_values(e)//simulatethiscycle tmp2isstilla‘1’.Intherightexecutionpath,inputsAandBboth 14. annotate_gate_activity(design_netlist,e,e′)//unmarkeverygatetoggled(orpossiblytog- gled) haveXsandlogicvaluesthatmaytoggletmp1incycles5-7,butfor 15. ife′.modifies_PCthen eachofthesecycles,inputCisa‘0’,keepingtmp2constantat‘1’. 16. c←T.get_conservative_state(e) 17. ife′(cid:49)cthen Sincetmp2isnevertoggledduringanyofthepossibleexecutionsof 18. T.make_conservative_superstate(c,e′) theapplication,gatecismarkedforcutting,anditsconstantoutput 19. else 20. break value(‘1’)isstoredforstitching.Althoughgatedisnevertoggled 21. endif 22. endif incycles0-2ordowntheleftexecutionpath,itdoestoggleinthe 23. e←e′//advancecyclestate rightexecutionpathandthuscannotbemarkedforcutting.Gatesa 24. endwhile 25. ife.PC_next==Xthen andbalsotoggleandthusarenotmarkedforcutting. 26. c←T.get_conservative_state(e) Oncegateactivityanalysishasgeneratedalistofcuttablegates 27. ife(cid:49)cthen 28. e′←T.make_conservative_superstate(c,e) andtheirconstantvalues,cuttingandstitchingbegins.Sincegatec 3209.. forea′l′l←a∈e.uppodssaitbel_eP_CPC_n_enxetx(at_)vals(e′)do wasmarkedforcutting,itisremovedfromthenetlist,leavingthe 31. U.push(e′′) inputtoitsfanout(d)unconnected.Duringstitching,d’sfloating 32. endfor inputisconnectedtoc’sknownconstantoutputvaluefortheappli- 33. endif 34. endif cation(‘1’).Afterstitching,thegate-levelnetlistisre-synthesized. 35.endwhile 36.forallg∈design_netlistdo Synthesisremovesgatesthatarenotdrivinganyothergates(gatesa 37. ifg.untoggledthen andb),eventhoughtheytoggledduringsymbolicsimulation,since 38. annotate_constant_value(g,s)//recordthegate’sinitial(andfinal)value 39. endif theirworkdoesnotaffectthestateoroutputfunctionofthepro- 40.endfor cessorfortheapplication.Synthesisalsoperformsoptimizations, suchasconstantpropagation,whichreplacesgatedwithaninverter, sincetheconstantcontrollinginputof‘1’totheXORgatemakes reduceareaandpower.Sincesomegateshaveconstantinputsafter itfunctionasaninverter.Finally,placeandrouteproducesafully cuttingandstitching,theycanbereplacedbysimplergates.Also, laid-outbespokedesign. toggledgatesleftwithfloatingoutputsaftercuttingcanberemoved, sincetheiroutputscanneverpropagatetoastateelementoroutput 3.4 Correctness port.Sincecuttingcanreducethedepthoflogicpaths,somepaths mayhaveextratimingslackaftercutting,allowingfaster,higher In this section, we show that the transformations we perform to powercellstobereplacedwithsmaller,lowerpowerversionsofthe createabespokeprocessorimplementationproduceadesignthatis cells.Finally,there-synthesizednetlistisplacedandroutedtopro- functionallyequivalenttotheoriginalprocessordesignforthetarget ducethebespokeprocessorlayout,aswellasafinalgate-levelnetlist application.I.e.,thebespokedesignimplementsthesamefunction withnecessarybuffers,etc.introducedtomeettimingconstraints. andproducesthesameoutputastheoriginaldesignforallpossible executionsoftheapplication. 3.3 IllustrativeExample Theorem:Abespokeprocessorimplementationℬ ofprocessor 𝒜 Thissectionillustrateshowbespokeprocessordesigntailorsaproces- 𝒫tailoredtoanapplication𝒜isfunctionally-equivalenttoprocessor sordesigntoaparticularapplication,asdescribedinSections3.1and3.2. 𝒫withrespecttoapplication𝒜;ℬ𝒜producesthesameoutputas𝒫 Figure7illustratesthebespokedesignprocess.TheleftpartofFig- foranypossibleexecutionof𝒜. ure7showsinput-independentgateactivityanalysisforasimple Proof:Thefirststepincreatingℬ𝒜 –input-independentgate examplecircuit(topright).Duringsymbolicsimulationofthetarget activityanalysis(Section3.1)–identifiesthesubsetℰ ofallgatesin ISCA’17,June24-28,2017,Toronto,ON,Canada H.Cherupallietal. Gate Ac>vity Analysis: Original Circuit: A a tmp0 tmp1 B b c tmp2 C d OUT Cycle A B C D tmp0 tmp1 tmp2 OUT D 0 1 0 X 1 0 0 1 0 A<er CuGng: A a 1 1 0 1 1 0 0 1 0 B b C d OUT 2 1 0 0 1 0 0 1 0 D A<er S>tching: A a B b 1 C d OUT Cycle A B C D tmp0 tmp1 tmp2 OUT Cycle A B C D tmp0 tmp1 tmp2 OUT D 3 1 X 0 1 0 X 1 0 3 1 0 1 0 0 0 1 1 A<er Synthesis: e OUT D 4 1 0 X 1 0 0 1 0 4 1 0 X 1 0 0 1 0 A<er Place & Route: 5 0 X 0 1 1 X 1 0 5 0 X 0 1 1 X 1 0 6 X 0 0 1 X X 1 0 D 7 X 0 0 0 X X 1 1 OUT Figure7:Anexampleofgateactivityanalysisandcuttingandstitching. theprocessorthatcanpossiblybeexercisedby𝒜,forallpossible Gate-level inputs.Theanalysisalsoidentifiestheconstantoutputvaluesforall Netlist gates𝒰 thatcanneverbeexercisedby𝒜.Itfollowsthatℰ∩𝒰=0/ andℰ∪𝒰 =𝒢,where𝒢 isthesetofallgatesin𝒫.Cuttingand stitching(Section3.2)removesallgatesintheset𝒰 andtiestheir GGaGatatete eA A cAc'c'v'vivtityity y LLiLsistist ot o fo fU fU nUnunususeseded d (cid:1) outputnetstotheirknownconstantvalues,suchthatthefunctionality AAnAnanalaylylsysissis is (U(U(nUntntotogogggglgelelded)d )G )G aGatateteses s ofallgatesin𝒰 ismaintainedinℬ .Allgatesinℰ remaininthe Cu<ng & 𝒜 bespokedesign,soallgatesinℰ havethesamefunctionalityand Applica'on S'tching Applica'on pthraotdℬuceitshefusnacmtieonoaultlpyuetqsuinivaℬl𝒜entatnod𝒫𝒫f.oSrin𝒜ceanℰd∪p𝒰rod=uc𝒢e,sitthfeolslaomwes ABpBiBpninilaniacraryary 'y o n Bespoke Processor 𝒜 outputas𝒫forallpossibleinputsto𝒜.■ Tailored to Applica'ons Wealsoverifiedcorrectnessthroughinput-independentgateac- tivityanalysisandinput-basedsimulationsonboththeoriginaland Figure 8: To support multiple programs, our bespoke design bespokeprocessorforeveryapplication(Section5.1),confirming techniqueperformsinput-independentgateactivityanalysison thatoutputsofbothprocessorswerethesameineachcase.3 eachprogram.Cuttingandstitchingisperformedusingthein- tersection of the untoggled gates lists from all supported pro- 3.5 SupportingMultipleApplications grams. Whilebespokeprocessordesigninvolvestailoringageneralpurpose processor into an application-specific processor implementation, isperformedforeachapplication,andcuttingandstitchingisper- bespoke processors, which are descended from general purpose formedfortheintersectionofunusedgatesfortheapplications.The processors,stillretainsomeprogrammability.Inthissection,we intersectionofunusedgatesrepresentsallthegatesthatarenotused describeseveralapproachesforcreatingbespokeprocessorsthat byanyofthetargetapplications.Theresultingbespokeprocessor supportmultipleapplications. containsallthegatesnecessarytorunanyofthetargetapplications. Thefirstandmoststraightforwardcaseiswhenthemultipletarget Whiletheremaybesomeareaandpowercostcomparedtoabespoke applicationsareknownaprioriatdesigntime.Forexample,alicen- designforasingleapplicationduetohavingmoregates,Section5 sororlicensee(seeFigure1)thatwantstoamortizetheexpenseof showsthatabespokeprocessorsupportingmultipleapplicationsstill designingandmanufacturingachipmaychoosetotailorabespoke affordssignificantareaandpowerbenefits. processortosupportmultipleapplications.Inthiscase,thebespoke There may also be cases where it is desirable for a bespoke processor design methodology (Figure 5) is simply expanded to processortosupportanapplicationthatisnotknownatdesigntime. supportmultipletargetapplications,asshowninFigure8.Givena For example, one advantage of using a programmable processor knownsetofbinariesthatneedtobesupported,gateactivityanalysis is the ability to update the target application in the field to roll out a new software version or to fix a software bug. Tailoring a 3Industrialequivalencecheckingtools(e.g.,Formality[63])checkstaticequiva- bespokeprocessortoaspecificapplicationinvariablyreducesits lence(i.e.,theiranalysisisapplication-independent);however,thebespokeandoriginal abilitytosupportin-fieldupdates.However,eveninthecasewhen designsareonlyequivalentforthetargetapplication,notingeneral.Therefore,werely ongate-levelanalysisandinput-basedsimulationsforverification. anapplicationwasunknownatdesigntime,itmaybepossiblefor BespokeProcessorsforApplicationswithUltra-lowAreaandPowerConstraints ISCA’17,June24-28,2017,Toronto,ON,Canada Functionality Implementation Table1:Benchmarks MaxExecution PC Instruction Benchmark Description Length(cycles) 0 sub &a &b 1 XXXX XXXX XXXX XXXX a binSearch Binarysearch 2037 Mem[b] = Mem[b] - Mem[a]; 2 XXXX XXXX XXXX XXXX b div Unsignedintegerdivision 402 if (Mem[b] < 0) goto c 3 jnX XXXXX010 c ors inSort In-placeinsertionsort 4781 s 4 sub &a &b en intAVG Signedintegeraverage 14512 5 xxxxxxxxxxxxxxxx S d intFilt 4-tapsignedFIRfilter 113791 6 xxxxxxxxxxxxxxxx e 7 jnx xxxxx010 dd mult Unsignedmultiplication 210 e b rle Run-lengthencoder 8283 m E tHold Digitalthresholddetector 18511 tea8 TEAencryptionalgorithm 2228 Figure9:FunctionalityandMSP430implementationofTuring- FFT FastFouriertransform 406006 completesubnegpseudo-instruction. C B Viterbi Viterbidecoder 1167298 M convEn Convolutionalencoder 117789 E abespokeprocessortosupporttheapplication.Forinstance,itis E autocorr Autocorrelation 8092 alwayspossibletocheckwhetheranewsoftwareversioncanbe nit irq Interrupttest 210 supportedbyabespokeprocessorbycheckingwhetherthegates U dbg Debuginterface 14166 requiredbythenewsoftwareversionareasubsetofthegatesin addedunittestbenchmarks[27]correspondingtosomefunctionali- thebespokeprocessor.Itmaybepossibletoincreasecoveragefor tiesintheprocessorthatwerenotexercisedbytheotherbenchmarks. in-fieldupdatesbyanticipatingthemandexplicitlydesigningthe Inadditiontoevaluatingabare-metalenvironmentcommoninarea- processor to support them. As an example, this may be used to andpower-constrainedembeddedsystems[17,56,61,66],wealso supportcommonbugfixesbyautomaticallycreatingmutantsofthe evaluatebespokeprocessorsthatsupportourapplicationsrunning originalsoftwarebyinjectingcommonbugs[38],thencreatinga ontheprocessorwithanoperatingsystem(FreeRTOS[55]).Bench- bespokedesignthatsupportsallthemutantversionsoftheprogram. marksarechosentoberepresentativeofemergingultra-low-power Thisapproachmayincreasetheprobabilitythatadebuggedversion applicationdomainssuchaswearables,internetofthings,andsen- ofthesoftwareissupportedbythebespokedesign. sornetworks[73].Also,benchmarkswereselectedtorepresenta Sometimes an in-field update represents a significant change rangeofcomplexityintermsofcontrolflowandexecutionlength. that is beyond the scope of a simple code mutation. To support PoweranalysisisperformedusingSynopsysPrimetime[64].Exper- such updates, a bespoke processor may need to provide support imentswereperformedonaserverhousingtwoIntelXeonE-2640 forasmallnumberofISAfeaturesthatcanbeusedtoimplement processors(8-coreseach,2GHzoperatingfrequency,64GBRAM). arbitrarycomputations-e.g.,aTuring-completeinstruction(orsetof instructions),inadditiontothetargetapplication(s).Consideradding 4.2 Baselines supportforsubneg,anexampleTuring-completeinstruction[26]. For evaluations, we compare bespoke designs against two base- Figure 9 shows the functionality and code implementation of a line processors. The first baseline is the openMSP430 processor, subnegpseudo-instructioncreatedforMSP430.Sincethememory optimizedtominimizeareaandpowerforoperationat100MHz operandaddressesandthebranchtargetareassumedtobeunknown and1V.Forthesecondbaseline,wecreateaggressively-optimized values(Xs),abinarythatcharacterizesthebehaviorofsubnegcan application-specificversionsofopenMSP430foreachbenchmark beco-analyzedwiththetargetapplicationbinarytotailoraTuring- applicationbyrewritingtheRTLtoremovemodulesthatareunused complete bespoke processor that supports the target application bythebenchmark,beforeperformingsynthesis,placement,androut- nativelyandcanhandlearbitraryupdates,possiblywithsomearea, ing.SuchanapproachisrepresentativeofanXtensa-likeapproach power,andperformanceoverhead. [13],wheretheprocessorconfigurationiscustomizedforapartic- ularapplication.Note,however,thatourbaselineissignificantly 4 METHODOLOGY moreaggressivethananXtensa-likeapproach,sinceitrequiresour 4.1 SimulationInfrastructureandBenchmarks input-independentgateactivityanalysistechnique(Section3.1)to identifythemodulesthatcannotbeusedbyanapplication.Such We verify our technique on a silicon-proven processor – open- moduleidentificationmaynotbepossiblethroughstaticanalysisof MSP430[27],anopen-sourceversionofoneofthemostpopular applicationcodealone. ULPprocessors[6,70].Theprocessorissynthesized,placed,and routedinTSMC65GPtechnology(65nm)foranoperatingpointof 5 RESULTS 1Vand100MHzusingSynopsysDesignCompiler[62]andCadence EDISystem[11].Gate-levelsimulationsareperformedbyrunning Inthissection,weevaluatebespokeprocessors.Wefirstconsider fullbenchmarkapplicationsontheplacedandroutedprocessorus- areaandpowerbenefitsoftailoringaprocessortoanapplication, ingacustomgate-levelsimulatorthatefficientlytraversesthecontrol thenevaluatedesignapproachesthatcanbeusedtosupportbespoke flowgraphofanapplicationandcapturesinput-independentactivity processorsthroughouttheproductlife-cycle,includingprocedures profiles(Section3.1).Table1listsourbenchmarkapplications.We forverifyingbespokeprocessors,techniquestodesignbespokepro- showresultsforallbenchmarksfrom[73]andallEEMBCbench- cessorsthatsupportmultipleknownapplications,andstrategiesto marks[22]thatfitintheprogrammemoryoftheprocessor.Wealso allowin-fieldupdatesinbespokeprocessors. ISCA’17,June24-28,2017,Toronto,ON,Canada H.Cherupallietal. 1 alu s0.9 multiplier e0.8 ta watchdog G0.7 elb0.6 cmloecmk tbraecekbone asU0.5 sfr no0.4 register file itc0.3 frontend arF0.2 execution unit glue 0.1 dbg 0 clock module glue Figure10:Theheightofabarrepresentsthefractionofgatesthatcanbetoggledbyabenchmark.Componentswithineachbar representeachmodule’scontributiontothefractionofgatestoggledbythebenchmark. 100 Figure10showsthefractionofgatesintheoriginalprocessor 90 wdofibefagistsughiergailentinensestehhatdoachhtweasctbsiocgauearnaln.drceWbhbpeeremettsooooebggdnsggutellleereevadd’cesbbhtcyyhmoatenhotatederciuahbbclueebhnt’eiscbnohcecnomhnntmcoathrramiktrbhk.uae.Trt4ikthooTencthaafitelnorcgstttooahmbtgeeagspfrlroeaiinnncoetnittnohhltnyees Percentage Savings 23456780000000 10 arelativelysmallfractionofthegatesinthebaselinedesign.At 0 most,57%ofthegatesinthebaselinedesigncanbetoggled,and 11benchmarkstogglelessthanhalfthegates.Eventhoughalarge Gate Savings Area Savings Power Savings fractionofthegatesofthebaselineprocessorcannotbetoggledby Figure11:Reduction(%)ingatecount,area,andpowerfora eachbenchmark,eachbenchmarkcantoggleadifferentsetofgates. bespokedesign,comparedtothebaselineprocessor. Forexample,autocorr1,whichusesthelargestfractionofthegates inthebaselineprocessor,doesnotexercisetheclock_module,while 100 tHold,whichtogglesthesmallestfractionofthebaselinegates,does 90 emxaeSrrkocsimsaeengdmantoeodstuiolnethsteh,ressu.ccHlhoocawks_etmvheeor,dmmuuloeld.tiuplleieurs,aagreeduisfefderbsybysoamppelibceanticohn-. ntage Savings 4567800000 For example, intFilt can never toggle about half of the multi- Perce 2300 pliergatesduetoconstraintsthebinaryplacesonfiltercoefficients, 10 whereasmulttogglesalmostallthegatesinthemultiplier.Other 0 modules,suchasthefrontend,aretoggledbyallapplications,but each application can toggle a different subset of frontend gates. Gate Savings Area Savings Power Savings Whiletheseresultsshowthatabespokeprocessorcanhaveasig- Figure12:Reduction(%)ingatecount,area,andpowerforbe- nificantlylowergatecountthanthegeneralpurposeprocessoritis spokedesigns,comparedtoapplication-specificcoarse-grained derivedfrom,theyalsoconfirmthathardware-softwareco-analysis module-levelbespokedesign. is necessary to identify all the gates that can be eliminated in a bespokedesign.Eliminationofgatesbasedontechniquessuchas thebaselinedesign.Areasavingsareupto92%(dbg),whilepower profilingorstaticanalysiswilleitherfailtoguaranteecorrectness savingsareupto74%(dbg). orwillmissopportunitiestoeliminategatesthatanapplicationcan Figure10showsthatsomemodulescouldbewhollyremoved neveruse. for specific benchmarks (e.g., the multiplier can be removed for Bespokeprocessorshavefewergates,lowerarea,andlowerpower binSearch,sinceitcannotuseanygatesinthemultiplier).Forsuch thantheirgeneralpurposecounterparts.Figure11showsthereduc- modules,itispossibletouseanXtensa-likeapproach[13],enabled tioningates,area,andpoweraffordedbybespokeprocessorstailored by our input-independent gate activity analysis, where modules toeachbenchmark.FFT,whichhasthesmallestgatecountreduction (44%)5,stillreducesareaby47%andpowerby37%,relativeto inwhichnogatesareusablebyanapplicationareremovedfrom the design. Figure 12 shows the benefits of bespoke processors relativetocoarse-grainedbespokedesignsinwhichwholly-unusable 4UnlikeFigure2,whichpresentsresultsfromprofiling,Figure10showsresults moduleshavebeenremovedfromtheprocessor.Notethatcompared frominput-independentgateanalysis. toanXtensa-likeapproach,acoarse-grainedbespokedesigndoes 5NotethatgatecountreductionreportedinFigure11isdifferentthanfractionof notneedanyknowledgeofthemicroarchitecture,astheunusable toggledgatesinFigure10,sincebespokedesignalsoremovessometoggledgatesthat cannotpropagatetheirtogglestostateelementsoroutputports. gatesareidentifiedautomaticallybyhardware-softwareco-analysis. BespokeProcessorsforApplicationswithUltra-lowAreaandPowerConstraints ISCA’17,June24-28,2017,Toronto,ON,Canada Table2:Benefitsofexploitingtimingslackcreatedbycutting, Table3:Verificationruntimeandcoverage. stitching,andre-synthesis. Sim.Time(s) Input-BasedCoverage X- per Num Line Br. Br. Gate Total Benchmark Addl.Power Based Input Paths % % Dir.% % Benchmark Timing Vmin Savingsfrom Power binSearch 23 3 83 100 100 93 87 Slack(%) (V) Savings Slack(%) div 7 22 1 100 - - 93 (%) inSort 25 156 718 100 100 100 93 binSearch 24.30 0.81 36.86 72.1 intAVG 116 79 1 100 100 100 82 intFilt 1365 625 1 100 100 100 28 div 24.51 0.82 34.53 69.9 mult 20 20 1 100 - - 64 inSort 22.24 0.83 33.25 67.6 rle 20 32 1 74 100 75 92 intAVG 23.37 0.83 33.35 67.7 tHold 7 27 6239 100 100 100 88 intFilt 23.23 0.84 31.31 58.5 tea8 21 32 1 100 100 100 93 FFT 10498 5669 1 100 100 100 47 mult 20.45 0.91 20.33 59.3 Viterbi 433 3637 8 100 100 100 86 rle 22.10 0.83 33.31 66.8 convEn 1401 168 1 100 100 100 89 tHold 24.07 0.81 36.90 73.5 autocorr 90 13 1 38 14 14 71 tea8 23.55 0.83 33.23 65.7 1 FFT 21.74 0.90 20.13 50.0 Viterbi 23.48 0.83 33.18 64.9 es0.8 u convEn 23.96 0.83 33.03 63.9 al V0.6 autocorr1 18.48 0.91 18.31 50.3 d e irq 17.91 0.92 16.24 57.7 aliz0.4 dbg 45.70 0.60 67.73 91.5 m Nor0.2 PAorewaer Gate Count The results show that the fine-grained gate-level bespoke design 00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 canprovideupto75%powerreduction(22%minimum,35%on Number of Supported Benchmarks average)overcoarse-grainedmodule-levelbespokedesign. Figure13:Normalizedgatecount,area,andpowerrangesfor Additionalpowersavingsmaybepossiblewhencutting,stitch- allpossiblebespokeprocessorsupportingmultipleapplications. ing,andre-synthesisremovesgatesfromcriticalpaths,exposing additional timing slack that can be exploited for energy savings. Thesecondmethodweusedtoverifybespokeprocessorsinvolves Table2showstimingslackexposedduringbespokeprocessortai- performing input-based simulations on the original and bespoke loringforeachbenchmark.Exposedtimingslackcanbeusedto processorstoconfirmthattheyproducethesameoutputs.Outputs reducetheoperatingvoltageoftheprocessorwithoutreducingthe producedduringsimulationarerecorded,andtheoutputsanddata frequency.6Table2alsoshowstheminimumsafeoperatingvoltage memoryfromtheoriginalandbespokeprocessorsarecomparedfor foreachbespokedesign(assumingworst-casePVTvariations),the equivalence.Sinceitisinfeasibletosimulatetheapplicationwithall additionalpowersavingsaffordedbyexploitingtimingslackinbe- possibleinputs,weusedKLEELLVMExecutionEngine(KLEE)[9] spokedesigns,andthetotalpowersavingsachievedwithrespectto togenerateinputsthatexerciseasmanycontrolpathsthroughthe thebaselinedesignfromeliminatingunusablelogicandexploiting applicationaspossible.Table3liststhenumberofinputssimulated exposedtimingslackforvoltagereduction. foreachbenchmarkandthecorrespondingcoverageofthecode.For mostbenchmarks,alllinesandbranchdirectionsarecovered.Where 5.1 Verification coverageisnot100%,theportionofthecodethatwasnotcovered We followed a two-pronged approach to verify our bespoke pro- wasnotexecutable.Thetablealsoreportsthefractionofgatesin cessordesigns.First,weperformedinput-independentgateactivity the bespoke designs that were exercised during the input-based analysisonthebespokeprocessordesignandcomparedtheproces- simulations.Weseethatamajorityofthegates(78%,onaverage) sorstatebetweentheoriginalandbespokeprocessorsineachcycle. weretoggledduringthesimulations,indicatingthatthemajorityof Attheendofactivityanalysis,wecomparedthecontentsofthedata gatesinabespokedesignarenecessary.8 Table3alsoshowsthe memorywiththatoftheoriginaldesigntoensurethatbothdesigns aggregateruntimeoftheinput-basedsimulations,providingsome producedthesameoutputs.Therewerenodiscrepanciesforany quantificationofverificationeffort.9 ofthebespokeprocessors,indicatingthatthebespokeprocessors 5.2 SupportingMultiplePrograms traversethesamestatesastheoriginalprocessorandproducethe sameoutputsforeachbenchmark.Whilethisverificationefficiently7 Bespokeprocessorsareabletosupportmultipleprogramsbyin- checksforfunctionalcorrectnessconsideringallpossibleinputs,it cludingtheunionofgatesneededtosupportalloftheprograms. doesnotexplicitlyguaranteethatthedatamemorycontentsofa Figure13showsgatecount,power,andareaforbespokeprocessors bespokeprocessorarecorrectforanyspecificinputs.Foranexplicit proofofthecorrectnessofbespokeprocessors,seeSection3.4. 8Notethatgatecoverageisnotexpectedtobe100%,sinceKLEEaimstocover linesofcodeandexecutionpaths,notgates.Inparticular,benchmarksthatusethe 6Exposedtimingslackcouldalsobeusedtoincreaseoperatingfrequency(perfor- multiplier(intFilt,mult,FFT,andautocorr)seelowgatecoveragesincethemultiplieris mance)ofabespokedesign.Onaverage,frequencycanbeincreasedby13%inthe asignificantfractionofbespokedesignsforsuchbenchmarksandKLEEdoesnottryto bespokedesigns. forminputstothemultipliertoincreasedatapathcoverage. 7Table3showsthattheruntimeofX-basedsimulationsiswithinanorderof 9Notethatsimulationsformultipleinputscaneasilybeparallelizedtoreducethe magnitudeofasingleinput-basedsimulation. verificationtimesignificantly. ISCA’17,June24-28,2017,Toronto,ON,Canada H.Cherupallietal. Table4:Miluproducesthreetypesofmutants.TypeI:Logical Table 5: Percentage of mutants (in-field updates) of different conditionaloperatormutants.TypeII:Computationoperator typesthataresupportedbythebespokedesignforthebasesoft- mutants.TypeIII:loopconditionaloperatormutants. wareimplementation.“-”denotesthatagivenbenchmarkdid nothaveanymutantsofthattype. Benchmark TypeI TypeII TypeIII Total binSearch 0 0 15 15 TypeI TypeII TypeIII Total Benchmark inSort 8 0 15 23 % % % % rle 0 20 25 45 binSearch - - 73 73 tea8 48 24 10 82 inSort 25 - 27 26 Viterbi 24 24 35 83 rle - 100 84 91 autocorr 12 0 10 22 tea8 58 75 100 68 Viterbi 92 83 80 84 autocorr 50 - 40 45 tailoredtoN programs,normalizedtothebaselineprocessor.For eachvalueofN,thebarsshowtherangesofthesemetricsacross 1 bespokeprocessorstailoredtoallcombinationsofNprograms.For manycombinationsofprograms,evenforuptotenprograms,only es0.8 u 60%orlessofthegatesareneeded.Infact,despitesupportingten Val0.6 programs,theareaandpowerofabespokeprocessorcanbereduced ed byupto41%and20%,respectively.However,supportingmulti- aliz0.4 m pleprogramscanlimittheextentofgatecuttingandtheresulting or N0.2 areaandpowerbenefitswhentheapplicationsexercisesignificantly differentportionsoftheprocessor.Forexample,thetwo-program 0 bespokeprocessorwiththelargestgatecountisonetailoredtodbg binSearch inSort rle tea8 Viterbi autocorr1 andirq.Eachapplicationusescomponentsoftheprocessorthatare Gate Count Area Power Overhead notexercisedbytheotherprogram;dbgexercisesthedebugmodule, Figure14:Normalizedgatecount,area,andpowervsthebase- whileirqexercisestheinterrupthandlinglogic.Theresultinggate linedesignfordesignssupportingallmutants(in-fieldupdates). reductionof18%stillproducesareaandpowerbenefitsof26%and 19%,respectively.Althoughsupportingmultipleprogramsreduces bespokeprocessortailoredtotheoriginal“buggy”application.We gatecount,area,andpowerreductionbenefits,theareaandpower seethatbetween25%and100%ofvariousmutantsarecovered,and willneverincreasewithrespecttothebaselinedesign.Intheworst 70%ofallmutantsarecovered.Thisshowsthatabespokeprocessor case,thebaselineprocessorcanrunanycombinationofprogram willmaintainsomeoftheoriginalgeneralpurposeprocessor’sability binaries. tosupportin-fieldupdates.Ifahighercoverageofpossiblebugs isdesired,theautomatically-generatedmutantscanbeconsidered 5.3 SupportingIn-fieldUpdates asindependentprogramswhiletailoringthebespokeprocessorfor Weconsidertwoapproachestodesigningbespokeprocessorsthat theapplication(seeSection3.5).Figure14showstheincreasein canbeupdatedinthefield.First,weevaluateamethodthatallows gatecount,area,andpowerrequiredtotailorabespokeprocessorto abespokeprocessortohandlecommon,minorprogrammingbugs. thesixbenchmarkswiththemostmutantsbyincludingallpossible Second,weevaluateamethodthatallowsabespokeprocessorto mutants(i.e.,bugfixes)generatedbyMiluduringbespokedesign. handleinfrequent,arbitrarysoftwareupdates. Providingsupportforsimplein-fieldupdatesincursagatecount In-fieldupdatesmayoftenbedeployedtofixminorcorrectness overhead of between 1% and 40%. Despite this increase in gate bugs(e.g.,off-by-oneerrors,etc.)[33].Toemulatein-fieldupdates count,totalareabenefitsforthebespokeprocessorsarebetween tofixbugs,weusetheMilumutationtestingtool[38]togenerate 23% and 66%, while total power benefits are between 13% and “updates”correspondingtobugfixes.Table4liststhebreakdown 53%.Therefore,simplein-fieldupdatescanbesupportedwhilestill ofmutantsbytypegeneratedbyMiluforthesixbenchmarkswith achievingsubstantialareaandpowerbenefits. themostmutants.Ifabenchmarkhaszeromutantsforaparticular A bespoke processor tailored to a specific application can be type,nomutationsitesofthattypewerefoundinthatbenchmark designedtosupportarbitrarysoftwareupdatesbydesigningitto by Milu. Type I mutants are conditional operator mutants (e.g., supportaTuring-completeinstruction(e.g.,subneg)orsetofinstruc- A||B→A&&B).TypeIImutantsarecomputationoperatormutants tions,inadditiontootherprogramsitsupports(Section3.5).Forour (e.g.,A+B→A×B).TypeIIImutantsareloopconditionaloperator single-applicationbespokeprocessors,theaverageareaandpower mutants(e.g.,i<32→i(cid:44)32). overheadstosupportsubnegare8%and10%,respectively.Average Table5liststhepercentageofmutants(i.e.,in-fieldupdatestofix areaandpowerbenefitsforsubneg-enhancedbespokeprocessors bugs)thataresupportedbytheoriginalbespokedesign(generated are56%and43%,respectively. fora“buggy”application).Manyminorbugfixescanbecoveredby Notethataninstructioninabespokeprocessor’stargetapplica- abespokeprocessordesignedfortheoriginalapplicationwithoutany tionisnotguaranteedtobesupportedinadifferentapplication(e.g., modification.I.e.,themutantsrepresentingmanyin-fieldupdates anupdate),sincetheprocessoreliminatesgatesthatarenotneeded only use a subset of the gates in the original bespoke processor. tosupportthepossibleinstructionsequencesinthetargetapplica- Thismeansthatthesemutantswillexecutecorrectlyontheoriginal tion’sexecutiontree;adifferentsequenceofthesameinstructions

Description:
never used by an application, bespoke processors can achieve signif- icantly lower area and [3] Abhinav Agarwal and Arvind. 2013. Leveraging
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.