Bespoke Processors for Applications with Ultra-low Area and Power Constraints HariCherupalli HenryDuwe WeidongYe UniversityofMinnesota UniversityofIllinois UniversityofIllinois [email protected] [email protected] [email protected] RakeshKumar JohnSartori UniversityofIllinois UniversityofMinnesota [email protected] [email protected] ABSTRACT 1 INTRODUCTION Alargenumberofemergingapplicationssuchasimplantables,wear- Alargeclassofemergingapplicationsischaracterizedbysevere ables,printedelectronics,andIoThaveultra-lowareaandpower areaandpowerconstraints.Forexample,wearables[46,51]and constraints.Theseapplicationsrelyonultra-low-powergeneralpur- implantables[25,50]areextremelyarea-andpower-constrained. posemicrocontrollersandmicroprocessors,makingthemthemost Several IoT applications, such as stick-on electronic labels [48], abundanttypeofprocessorproducedandusedtoday.Whilegen- RFIDs[54],andsensors[30,72],arealsoextremelyarea-andpower- eralpurposeprocessorshaveseveraladvantages,suchasamortized constrained. Area constraints are expected to be severe also for developmentcostacrossmanyapplications,theyaresignificantly printedplastic[49]andorganic[41]applications. over-provisioned for many area- and power-constrained systems, Costconcernsdrivemanyoftheaboveapplicationstousegeneral whichtendtorunonlyoneorasmallnumberofapplicationsover purposemicroprocessorsandmicrocontrollersinsteadofmuchmore theirlifetime.Inthispaper,wemakeacaseforbespokeprocessor area-andpower-efficientASICs,since,amongotherbenefits,de- design,anautomatedapproachthattailorsageneralpurposeproces- velopmentcostofmicroprocessorIPcorescanbeamortizedbythe sorIPtoatargetapplicationbyremovingallgatesfromthedesign IPcorelicensoroveralargenumberofchipmakersandlicensees. thatcanneverbeusedbytheapplication.Sinceremovedgatesare Infact,ultra-low-area-andpower-constrainedmicroprocessorsand neverusedbyanapplication,bespokeprocessorscanachievesignif- microcontrollerspoweringtheseapplicationsarealreadythemost icantlylowerareaandpowerthantheirgeneralpurposecounterparts widely used type of processing hardware in terms of production withoutanyperformancedegradation.Also,gateremovalcanex- andusage[5,23,53],inspiteoftheirwell-knowninefficiencycom- poseadditionaltimingslackthatcanbeexploitedtoincreasearea paredtoASICandFPGA-basedsolutions[31].Giventhismismatch andpowersavingsorperformanceofabespokedesign.Bespoke betweentheextremeareaandpowerconstraintsofemergingapplica- processordesignreducesareaandpowerby62%and50%,onaver- tionsandtherelativeinefficiencyofgeneralpurposemicroprocessors age,whileexploitingexposedtimingslackimprovesaveragepower andmicrocontrollerscomparedtotheirASICcounterparts,there savingsto65%. existsaconsiderableopportunitytomakemicroprocessor-basedso- lutionsfortheseapplicationsmuchmorearea-andpower-efficient. CCSCONCEPTS Onebigsourceofareainefficiencyinamicroprocessoristhat • Computer systems organization → Special purpose systems; ageneralpurposemicroprocessorisdesignedtotargetanarbitrary Embeddedsystems;•Hardware→Applicationspecificproces- applicationandthuscontainsmanymoregatesthanwhataspecific sors; applicationneeds(Section2).Also,theseunusedgatescontinueto KEYWORDS consumepower,resultinginsignificantpowerinefficiency.While ultra-low-powerprocessors,application-specificprocessors,bespoke adaptivepowermanagementtechniques(e.g.,powergating[58,60]) processors,hardware-softwareco-analysis,InternetofThings helptoreducepowerconsumedbyunusedgates,theeffectivenessof suchtechniquesislimitedduetothecoarsegranularityatwhichthey ACMReferenceformat: mustbeapplied,aswellassignificantimplementationoverheads HariCherupalli,HenryDuwe,WeidongYe,RakeshKumar,andJohnSartori. 2017.BespokeProcessorsforApplicationswithUltra-lowAreaandPower such as domain isolation and state retention (Section 7). These Constraints.InProceedingsofISCA’17,Toronto,ON,Canada,June24-28, techniquesalsoworsenareainefficiency. 2017,14pages.https://doi.org/10.1145/3079856.3080247 Oneapproachtosignificantlyincreasetheareaandpowereffi- ciencyofamicroprocessorforagivenapplicationistoeliminate Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor alllogicinthemicroprocessorIPcorethatwillnotbeusedbythe classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed application.Eliminatinglogicthatisguaranteedtonotbeusedby forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation anapplicationcanproduceadesigntailoredtotheapplication–a onthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACM mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish, bespokeprocessor–thathassignificantlylowerareaandpowerthan topostonserversortoredistributetolists,requirespriorspecificpermissionand/ora theoriginalmicroprocessorIPthattargetsanarbitraryapplication. [email protected]. Aslongastheapproachtocreateabespokeprocessorisautomated, ISCA’17,June24-28,2017,Toronto,ON,Canada ©2017AssociationforComputingMachinery. theresultingdesignretainsthecostbenefitsofamicroprocessor ACMISBN978-1-4503-4892-8/17/06...$15.00 IP,sincenoadditionalhardwareorsoftwareneedstobedeveloped. https://doi.org/10.1145/3079856.3080247 ISCA’17,June24-28,2017,Toronto,ON,Canada H.Cherupallietal. upto92%(46%minimum,62%onaverage)andpowerreductions are up to 74% (37% minimum, 50% on average) compared to a generalpurposemicroprocessor.Whentimingslackresultingfrom gateremovalisexploited,powerreductionsincreasetoupto91% (50%minimum,65%onaverage). •Finally,wepresentandanalyzedesignapproachesthatcanbeused to support bespoke processors throughout the product life-cycle. Thesedesignapproachesincludeproceduresforverifyingbespoke processors,techniquestodesignbespokeprocessorsthatsupport multipleknownapplications,andstrategiestoallowin-fieldupdates inbespokeprocessors. Figure 1: General purpose processors are overdesigned for a 2 MOTIVATION specificapplication(top).Abespokeprocessordesignmethod- Area-andpower-constrainedmicroprocessorsandmicrocontrollers ologyallowsamicroprocessorIPlicensororlicenseetotarget arethemostabundanttypeofprocessorproducedandusedtoday, differentapplicationsefficientlywithoutadditionalsoftwareor with projected deployment growing exponentially in the near fu- hardwaredevelopmentcost(bottom). ture[5,23,34,53].Thisexplosivegrowthisfueledbyemerging area-andpower-constrainedapplications,suchastheinternet-of- things (IoT), wearables, implantables, and sensor networks. The Also,sincenologicusedbytheapplicationiseliminated,areaand microprocessors and microcontrollers used in these applications powerbenefitscomeatnoperformancecost.Theresultingbespoke aredesignedtoincludeawidevarietyoffunctionalitiesinorderto processordoesnotrequireprogrammerinterventionorhardwaresup- supportalargenumberofdiverseapplicationswithdifferentrequire- porteither,sincethesoftwareapplicationcanstillrun,unmodified, ments.Ontheotherhand,theembeddedsystemsdesignedforthese onthebespokeprocessor. applicationstypicallyconsistofoneapplicationorasmallnumberof Inthispaper,wepresentamethodologytoautomaticallygenerate applications,runningoverandoveronageneralpurposeprocessor a bespoke processor for an application out of a general purpose forthelifetimeofthesystem[1].Giventhataparticularapplication processor/microcontrollerIPcore.Ourmethodologyreliesongate- mayonlyuseasmallsubsetofthefunctionalitiesprovidedbya levelsymbolicsimulationtoidentifygatesinthemicroprocessorIP generalpurposeprocessor,theremaybeaconsiderableamountof thatcannotbetoggledbytheapplication,irrespectiveoftheappli- logicinageneralpurposeprocessorthatisnotusedbyanapplica- cationinputs,andautomaticallyeliminatesthemfromthedesign tion.Figure2illustratesthispoint,showingthefractionofgatesin toproduceasignificantlysmallerandlowerpowerdesignwiththe anopenMSP430[27]processorthatarenottoggledwhenavariety sameperformance.Inmanycases,reductioninthenumberofgates ofapplications(Table1)areexecutedontheprocessorwithmany alsointroducestimingslackthatcanbeexploitedtoimproveperfor- differentinputsets.Thebarinthefigureshowstheintersectionof manceorfurtherreducepowerandarea.Sincetheoriginaldesign allgatesthatwerenotexercised(toggled)bytheapplicationforany isprunedatthegranularityofgates,theresultingmethodologyis input,andtheintervalshowstherangeinfractionofunexercised muchmoreeffectivethananyapproachthatreliesoncoarse-grained gatesacrossdifferentinputs.Foreachapplication,asignificantfrac- application-specificcustomization.Theproposedmethodologycan tion(around30%-60%)oftheprocessor’sgateswerenottoggled beusedeitherbyIPlicensorsorIPlicenseestoproducebespoke duringanyexecutionoftheapplication.Theseresultsindicatethat designsfortheapplicationofinterest(Figure1).Simpleextensions theremaybeanopportunitytoreduceareaandpowersignificantly toourmethodologycanbeusedtogeneratebespokeprocessorsthat inarea-andpower-constrainedsystemsbyremovinglogicfromthe can support multiple applications or different degrees of in-field processorthatcannotbeexercisedbytheapplication(s)runningon softwareprogrammability,debuggability,andupdates(Section3.5). theprocessor,ifitcanbeguaranteedthatremovedlogicwillnever Thispapermakesthefollowingcontributions. beneededforanypossibleexecutionoftheapplication(s). •Weproposebespokeprocessors–anovelapproachtoreducing However,identifyingallthelogicthatisguaranteedtoneverbe areaandpowerbytailoringaprocessortoanapplication,suchthat usedbyanapplicationisnotstraightforward.Onepossibleapproach aprocessorconsistsofonlythosegatesthattheapplicationneeds isprofiling,whereinanapplicationisexecutedformanyinputsand for any possible execution with any possible inputs. A bespoke thesetofgatesthatwereneverexercisedisrecorded,asinFigure2. processorstillrunstheunmodifiedapplicationbinarywithoutany However,profilingcannotguaranteethatthesetofgatesusedby performancedegradation. anapplicationwillnotbedifferentforadifferentinputset.Indeed, •Wepresentanautomatedmethodologyforgeneratingbespoke profilingresultsinFigure2showconsiderablevariationsinexercised processors.Oursymbolicgate-levelsimulation-basedmethodology gates(upto13%)fordifferentexecutionsofthesameapplication takestheoriginalmicroprocessorIPandapplicationbinaryasinput withdifferentinputs.Thus,anapplicationmightrequiredifferent toproduceadesignthatisfunctionally-equivalenttotheoriginalpro- gatesandexecuteincorrectlyforanunprofiledinput. cessorfromtheperspectiveofthetargetapplicationwhileconsisting Staticapplicationanalysisrepresentsanotherapproachforde- oftheminimumnumberofgatesneededforexecution. terminingunusablelogicforanapplication.However,application •Wequantifytheareaandpowerbenefitsofbespokeprocessorsfor analysismaynotidentifythemaximumamountoflogicthatcanbe asuiteofsensorandembeddedbenchmarks.Areareductionsare BespokeProcessorsforApplicationswithUltra-lowAreaandPowerConstraints ISCA’17,June24-28,2017,Toronto,ON,Canada DBG DBG 60 )% ALU ALU ( s50 e ta WDG MEM_BB EXEC WDG MEM_BB EXEC g d40 FRONTEND FRONTEND e su SFR SFR n30 U common common 20 binSearch divinSortintAVGintFilt mult rletHold tea8 FFTViterbiconvEnautocorr irq dbg F MiUgLTure4:CLGK(aa)teinstRnFFoiltttoggleundiqueby (MaU)LTin(tbF)iCSLlKctraamnbdlReF(dbi)nstFcirltaumnbiquleed intFiltforthesameinputset.Graygatesarenottoggledbyei- Figure2:AsignificantfractionofgatesinanopenMSP430pro- therapplication.Redgatesareuniqueuntoggledgatesforeach cessorarenottoggledwhenanapplicationexecutesonthepro- application. Even though the applications use the same set of cessor.Eachbarrepresentsgatesnottoggledbyanyinputforan instructions and control flow, the gates that they exercise are application; theinterval shows therange of unexercisedgates different. fordifferentinputs. DBG DBG Giventhatthefractionoflogicinaprocessorthatisnotusedby ALU ALU agivenapplicationcanbesubstantial,andmanyarea-andpower- constrainedsystemsonlyexecuteoneorfewapplicationsfortheir WDG MEM_BB EXEC WDG MEM_BB EXEC entirelifetime,itmaybepossibletosignificantlyreduceareaand FRONTEND FRONTEND powerinsuchsystemsbyremovinglogicfromtheprocessorthat SFR SFR cannotbeusedbytheapplication(s).However,sincedifferentap- plicationscanexercisesubstantiallydifferentpartsofaprocessor, common common MULT CLK RF unique MULT CLK RF unique andsimplyprofilingorstaticallyanalyzinganapplicationcannot (a)FFT (b)binSearch guaranteewhichpartsoftheprocessorcanandcannotbeusedbyan Figure3:Gatesnottoggledbytwoapplications–(a)FFTand application,tailoringaprocessortoanapplicationrequiresatech- (b)binSearch–forprofilinginputs.Graygatesarenottoggled niquethatcanidentifyallthelogicinaprocessorthatisguaranteed byeitherapplication.Redgatesareuniqueuntoggledgatesfor toneverbeusedbytheapplicationandremoveunusablelogicin eachapplication. awaythatleavesthefunctionalityoftheprocessorunchangedfor theapplication.Inthenextsection,wedescribeamethodologythat meetstheserequirements.Wecallgeneralpurposeprocessorsthat havebeentailoredtoanindividualapplicationbespokeprocessors, removed,sinceunusedlogicdoesnotcorrespondonlytosoftware- reminiscentofbespokeclothing,inwhichagenericclothingitemis visiblearchitecturalfunctionalities(e.g.,arithmeticunits),butalso tailoredforanindividualperson. tofine-grainedandsoftware-invisiblemicroarchitecturalfunctional- ities(e.g.,pipelineregisters).Forexample,considertwodifferent 3 TAILORINGABESPOKEPROCESSOR applications–FFTandbinSearch.Figure3showsthegatesinthe processorthatwerenotexercisedduringanyprofilingexecutionof Abespokeprocessor,tailoredtoatargetapplication,mustbefunc- theapplications.Sincetheapplicationsusedifferentsubsetsofthe tionally-equivalent to the original processor when executing the functionalitiesprovidedbytheprocessor,thepartsoftheprocessor application.Assuch,thebespokeimplementationofaprocessor thattheydonotexercisearedifferent.However,acloserlookreveals designshouldretainallthegatesfromtheoriginalprocessordesign that while some of the differences correspond to coarse-grained thatmightbeneededtoexecutetheapplication.Anygatethatcould software-visiblefunctionalities(e.g.,themultiplierisusedbyFFT be toggled by the application and propagate its toggle to a state butnotbybinSearch),otherdifferencesarefine-grained,software- elementoroutputportperformsanecessaryfunctionandmustbe invisible,andcannotbedeterminedthroughapplicationanalysis retainedtomaintainfunctionalequivalence.Conversely,anygate (e.g.,differentgate-levelactivityprofilesinmodulesliketheproces- thatcanneverbetoggledbytheapplicationcansafelyberemoved, sorfrontend).Asanotherexample,Figure4showsthebreakdown aslongaseachfanoutlocationforthegateisfedwiththegate’s ofinstructionsusedbyintFiltandscrambled-intFilt.Thetwo constantoutputvaluefortheapplication.Removingconstant(un- applicationsuseexactlythesameinstructions(scrambled-intFilt toggled)gatesforanapplicationcouldresultinsignificantareaand isasyntheticbenchmarkgeneratedbyscramblinginstructionsin powersavingsand,unlikeconventionalenergysavingtechniques, intFilt);however,thediegraphsinFigure4showthatthesetsof willintroducenoperformancedegradation(indeed,nochangeatall unexercisedgatesfortheapplicationsaredifferent.Thisisduetothe inapplicationbehavior). factthateventhesequenceofinstructionsexecutedbyanapplication Figure5showsourprocessfortailoringabespokeprocessortoa caninfluencewhichlogictheapplicationcanexerciseinaproces- targetapplication.Thefirststep–input-independentgateactivity sordependingonthemicroarchitecturaldetails.Suchinteractions analysis–performsatypeofsymbolicsimulation,whereunknown cannotbedeterminedsimplythroughapplicationanalysis. input valuesare representedas Xs,and gate-levelactivityof the ISCA’17,June24-28,2017,Toronto,ON,Canada H.Cherupallietal. Gate-level Aftereachcycleissimulated,thetoggledgatesareremovedfrom Netlist thelistofunexercisablegates.GateswhereanXpropagatedare considered as toggled, since some input assignment could cause thegatestotoggle.IfanXpropagatestothePC,indicatinginput- Gate Ac.vity List of Unused Cu3ng & dependentcontrolflow,oursimulatorbranchestheexecutiontree andsimulatesexecutionforallpossiblebranchpaths,followinga Analysis (Untoggled) Gates S.tching depth-firstorderingofthecontrolflowgraph.Sincethisnaivesimu- lationapproachdoesnotscalewellforcomplexorinfinitecontrol Applica.on Bespoke Processor structureswhichresultinalargenumberofbranchestoexplore, Binary Tailored to weemployaconservativeapproximationthatallowsouranalysis Applica.on toscaleforarbitrarily-complexcontrolstructureswhileconserva- tivelymaintainingcorrectnessinidentifyingexercisablegates.Our Figure 5: Our technique performs input-independent gate ac- approximationworksbytrackingthemostconservativegate-level tivityanalysistodeterminewhichgatesofaprocessorcannot statethathasbeenobservedforeachPC-changinginstruction(e.g., betoggledinanyexecutionoftheapplication.Thesegatesare conditionalbranch).Themostconservativestateistheonewhere thencutfromthedesigntoformacustom,bespokeprocessor themostvariablesareassumedtobeunknown(X).Whenabranch withreducedareaandpower. isre-encounteredwhilesimulatingonacontrolflowpath,simula- tiondownthatpathcanbeterminatedifthesymbolicstatebeing processorischaracterizedforallpossibleexecutionsoftheapplica- simulatedisasubstateofthemostconservativestatepreviouslyob- tion,foranypossibleinputstotheapplication.Thesecondphaseof servedatthebranch(i.e.,thestatesmatchorthemoreconservative ourbespokeprocessordesigntechnique–gatecuttingandstitching statehasXsinalldifferingvariables),sincethestate(oramore –usesgate-levelactivityinformationgatheredduringgateactivity conservativeversion)hasalreadybeenexplored.Ifthesimulated analysis to prune away unnecessary gates and reconnect the cut stateisnotasubstateofthemostconservativeobservedstate,the connectionsbetweengatestomaintainfunctionalequivalencetothe twostatesaremergedtocreateanewconservativesymbolicstateby originaldesignforthetargetapplication. replacingdifferingstatevariableswithXs,andsimulationcontinues fromtheconservativestate.Thisconservativeapproximationtech- 3.1 Input-independentGateActivityAnalysis niqueallowsgateactivityanalysistocompleteinasmallnumberof Thesetofgatesthatanapplicationtogglesduringexecutioncan passesthroughtheapplicationcode,evenforapplicationswithan varydependingonapplicationinputs.Thisisbecauseinputscan exponentially-largeorinfinitenumberofexecutionpaths.2 changethecontrolflowofexecutionthroughthecodeaswellasthe Theresultofinput-independentgateactivityanalysisforanap- datapathsexercisedbytheinstructions.Sinceexhaustiveprofiling plicationisalistofallgatesthatcannotbetoggledinanyexecution forallpossibleinputsisinfeasible,andlimitedprofilingmaynot oftheapplication,alongwiththeirconstantvalues.Sincethelogic identifyallexercisablegatesinaprocessor,wehaveimplemented functionsperformedbythesegatesarenotnecessaryforthecorrect ananalysistechniquebasedonsymbolicsimulation[7],thatisable executionofthebinaryforanyinput,theymaysafelybecutfrom tocharacterizethegate-levelactivityofaprocessorexecutingan thenetlist,aslongastheirconstantoutputvaluesarepreserved.The applicationforallpossibleinputswithasinglegate-levelsimulation. followingsectiondescribeshowunusablegatescanbecutfromthe During this simulation, inputs are represented as unknown logic processorwithoutaffectingthefunctionalityoftheprocessorforthe values(Xs),whicharetreatedasboth1sand0swhenrecording targetapplication. possibletoggledgates. Symbolicsimulationhasbeenappliedincircuitsforlogicand 3.2 CuttingandStitching timingverification,aswellassequentialtestgeneration[8,24,37, Oncegatesthatthetargetapplicationcannottogglehavebeeniden- 42,44].Morerecently,ithasbeenappliedtodetermineapplication- tified,theyarecutfromtheprocessornetlistforthebespokedesign. specificVmin [18].Symbolicsimulationhasalsobeenappliedfor Aftercuttingoutagate,thenetlistmustbestitchedbacktogetherto softwareverification[74].However,tothebestofourknowledge,no generatethefinalnetlistandlaid-outdesignforthebespokeproces- existingtechniquehasappliedsymbolicsimulationtocreatebespoke sor.Figure6showsourmethodforcuttingandstitchingabespoke processorstailoredtotherequirementsofanapplication. processor.First,eachgateonthelistofunusable(untoggled)gatesis Algorithm1describesinput-independentgateactivityanalysis. removedfromthegate-levelnetlist.Afterremovingagate,allfanout Initially,thevaluesofallmemorycellsandgatesaresettoXs.The locationsthatwereconnectedtotheoutputnetoftheremovedgate applicationbinaryisloadedintoprogrammemory,providingthe aretiedtoastaticvoltage(‘1’or‘0’)correspondingtotheconstant valuesthateffectivelyconstrainwhichgatescanbetoggledduring outputvalueofthegateobservedduringsimulation.Sincethelogi- execution.Duringsimulation,oursimulatorsetsallinputstoXs, calstructureofthenetlisthaschanged,thenetlistisre-synthesized whichpropagatethroughthegate-levelnetlistduringsimulation.1 aftercuttingallunusablegatestoallowadditionaloptimizationsthat 1Anydataorsignalsthatcanbewrittenbyexternalevents(e.g.,interruptsignalsor DMAwrites)arealsoconsideredunknownvalues(Xs)duringouranalysis.Firmware 2Somecomplexapplicationsandprocessorsmightstillrequireheuristicsforexplo- componentsofinterrupthandling,e.g.,thejumptableandinterruptservicehandling rationofalargenumberofexecutionpaths[10,32];however,ourapproachisadequate routine,areconsideredtobepartoftheapplicationbinary(i.e.,knownvalues)during forULPsystems,representativeofanincreasingnumberoffutureapplicationswhich symbolicsimulation.Ifaninterruptisenabledduringaninstruction’sexecution,then tendtohavesimpleprocessorsandapplications[36,53].Forexample,completeanalysis thatinstructionisconsideredaspossiblymodifyingthePC. ofourmostcomplexbenchmarktakes3hours. BespokeProcessorsforApplicationswithUltra-lowAreaandPowerConstraints ISCA’17,June24-28,2017,Toronto,ON,Canada List of Unused List of Constant Gates Gate Values Bespoke Gate- Set Unconnected level Netlist Gate-level Cut Place Gate Inputs to Synthesis Netlist Unused Gates & Route Constant Values Bespoke GDSII File Figure6:Toolflowforcuttingandstitching. Algorithm1Input-independentGateActivityAnalysis application,logical1s,0s,andunknownsymbols(Xs)arepropa- 1.ProcedureAnnotateGate-LevelNetlist(app_binary,design_netlist) gatedthroughoutthenetlist.Incycle0,AandBhaveknownvalues 2.Initializeallmemorycellsandallgatesindesign_netlisttoX thatarepropagatedthroughgatesaandb,drivingtmp0andtmp1to 3.Loadapp_binaryintoprogrammemory 4.Propagateresetsignal ‘0’.Thecontrollingvalueatgatecdrivestmp2to‘1’,despiteinput 5.s←Stateatstartofapp_binary Cbeinganunknownvalue(X).InputsAandBarenotchangedby 6.Tableofpreviouslyobservedsymbolicstates,T.insert(s) 7.Stackofun-processedexecutionpoints,U.push(s) thesimulationofthebinaryuntilaftercycle2,whenanXwasprop- 8.mark_all_gates_untoggled(design_netlist) 9.whileU!=0/do agatedtothePC(notshown)thatrequirestwodifferentexecution 10. e←U.pop() pathstobeexplored.Intheleftpath,inputBbecomesXincycle3, 11. whilee.PC_next!=Xand!e.ENDdo 12. e.set_inputs_X()//setallperipheralportinputstoXs causingtmp1tobecomeXaswell.However,sinceinputCisa‘0’, 13. e′←propagate_gate_values(e)//simulatethiscycle tmp2isstilla‘1’.Intherightexecutionpath,inputsAandBboth 14. annotate_gate_activity(design_netlist,e,e′)//unmarkeverygatetoggled(orpossiblytog- gled) haveXsandlogicvaluesthatmaytoggletmp1incycles5-7,butfor 15. ife′.modifies_PCthen eachofthesecycles,inputCisa‘0’,keepingtmp2constantat‘1’. 16. c←T.get_conservative_state(e) 17. ife′(cid:49)cthen Sincetmp2isnevertoggledduringanyofthepossibleexecutionsof 18. T.make_conservative_superstate(c,e′) theapplication,gatecismarkedforcutting,anditsconstantoutput 19. else 20. break value(‘1’)isstoredforstitching.Althoughgatedisnevertoggled 21. endif 22. endif incycles0-2ordowntheleftexecutionpath,itdoestoggleinthe 23. e←e′//advancecyclestate rightexecutionpathandthuscannotbemarkedforcutting.Gatesa 24. endwhile 25. ife.PC_next==Xthen andbalsotoggleandthusarenotmarkedforcutting. 26. c←T.get_conservative_state(e) Oncegateactivityanalysishasgeneratedalistofcuttablegates 27. ife(cid:49)cthen 28. e′←T.make_conservative_superstate(c,e) andtheirconstantvalues,cuttingandstitchingbegins.Sincegatec 3209.. forea′l′l←a∈e.uppodssaitbel_eP_CPC_n_enxetx(at_)vals(e′)do wasmarkedforcutting,itisremovedfromthenetlist,leavingthe 31. U.push(e′′) inputtoitsfanout(d)unconnected.Duringstitching,d’sfloating 32. endfor inputisconnectedtoc’sknownconstantoutputvaluefortheappli- 33. endif 34. endif cation(‘1’).Afterstitching,thegate-levelnetlistisre-synthesized. 35.endwhile 36.forallg∈design_netlistdo Synthesisremovesgatesthatarenotdrivinganyothergates(gatesa 37. ifg.untoggledthen andb),eventhoughtheytoggledduringsymbolicsimulation,since 38. annotate_constant_value(g,s)//recordthegate’sinitial(andfinal)value 39. endif theirworkdoesnotaffectthestateoroutputfunctionofthepro- 40.endfor cessorfortheapplication.Synthesisalsoperformsoptimizations, suchasconstantpropagation,whichreplacesgatedwithaninverter, sincetheconstantcontrollinginputof‘1’totheXORgatemakes reduceareaandpower.Sincesomegateshaveconstantinputsafter itfunctionasaninverter.Finally,placeandrouteproducesafully cuttingandstitching,theycanbereplacedbysimplergates.Also, laid-outbespokedesign. toggledgatesleftwithfloatingoutputsaftercuttingcanberemoved, sincetheiroutputscanneverpropagatetoastateelementoroutput 3.4 Correctness port.Sincecuttingcanreducethedepthoflogicpaths,somepaths mayhaveextratimingslackaftercutting,allowingfaster,higher In this section, we show that the transformations we perform to powercellstobereplacedwithsmaller,lowerpowerversionsofthe createabespokeprocessorimplementationproduceadesignthatis cells.Finally,there-synthesizednetlistisplacedandroutedtopro- functionallyequivalenttotheoriginalprocessordesignforthetarget ducethebespokeprocessorlayout,aswellasafinalgate-levelnetlist application.I.e.,thebespokedesignimplementsthesamefunction withnecessarybuffers,etc.introducedtomeettimingconstraints. andproducesthesameoutputastheoriginaldesignforallpossible executionsoftheapplication. 3.3 IllustrativeExample Theorem:Abespokeprocessorimplementationℬ ofprocessor 𝒜 Thissectionillustrateshowbespokeprocessordesigntailorsaproces- 𝒫tailoredtoanapplication𝒜isfunctionally-equivalenttoprocessor sordesigntoaparticularapplication,asdescribedinSections3.1and3.2. 𝒫withrespecttoapplication𝒜;ℬ𝒜producesthesameoutputas𝒫 Figure7illustratesthebespokedesignprocess.TheleftpartofFig- foranypossibleexecutionof𝒜. ure7showsinput-independentgateactivityanalysisforasimple Proof:Thefirststepincreatingℬ𝒜 –input-independentgate examplecircuit(topright).Duringsymbolicsimulationofthetarget activityanalysis(Section3.1)–identifiesthesubsetℰ ofallgatesin ISCA’17,June24-28,2017,Toronto,ON,Canada H.Cherupallietal. Gate Ac>vity Analysis: Original Circuit: A a tmp0 tmp1 B b c tmp2 C d OUT Cycle A B C D tmp0 tmp1 tmp2 OUT D 0 1 0 X 1 0 0 1 0 A<er CuGng: A a 1 1 0 1 1 0 0 1 0 B b C d OUT 2 1 0 0 1 0 0 1 0 D A<er S>tching: A a B b 1 C d OUT Cycle A B C D tmp0 tmp1 tmp2 OUT Cycle A B C D tmp0 tmp1 tmp2 OUT D 3 1 X 0 1 0 X 1 0 3 1 0 1 0 0 0 1 1 A<er Synthesis: e OUT D 4 1 0 X 1 0 0 1 0 4 1 0 X 1 0 0 1 0 A<er Place & Route: 5 0 X 0 1 1 X 1 0 5 0 X 0 1 1 X 1 0 6 X 0 0 1 X X 1 0 D 7 X 0 0 0 X X 1 1 OUT Figure7:Anexampleofgateactivityanalysisandcuttingandstitching. theprocessorthatcanpossiblybeexercisedby𝒜,forallpossible Gate-level inputs.Theanalysisalsoidentifiestheconstantoutputvaluesforall Netlist gates𝒰 thatcanneverbeexercisedby𝒜.Itfollowsthatℰ∩𝒰=0/ andℰ∪𝒰 =𝒢,where𝒢 isthesetofallgatesin𝒫.Cuttingand stitching(Section3.2)removesallgatesintheset𝒰 andtiestheir GGaGatatete eA A cAc'c'v'vivtityity y LLiLsistist ot o fo fU fU nUnunususeseded d (cid:1) outputnetstotheirknownconstantvalues,suchthatthefunctionality AAnAnanalaylylsysissis is (U(U(nUntntotogogggglgelelded)d )G )G aGatateteses s ofallgatesin𝒰 ismaintainedinℬ .Allgatesinℰ remaininthe Cu<ng & 𝒜 bespokedesign,soallgatesinℰ havethesamefunctionalityand Applica'on S'tching Applica'on pthraotdℬuceitshefusnacmtieonoaultlpyuetqsuinivaℬl𝒜entatnod𝒫𝒫f.oSrin𝒜ceanℰd∪p𝒰rod=uc𝒢e,sitthfeolslaomwes ABpBiBpninilaniacraryary 'y o n Bespoke Processor 𝒜 outputas𝒫forallpossibleinputsto𝒜.■ Tailored to Applica'ons Wealsoverifiedcorrectnessthroughinput-independentgateac- tivityanalysisandinput-basedsimulationsonboththeoriginaland Figure 8: To support multiple programs, our bespoke design bespokeprocessorforeveryapplication(Section5.1),confirming techniqueperformsinput-independentgateactivityanalysison thatoutputsofbothprocessorswerethesameineachcase.3 eachprogram.Cuttingandstitchingisperformedusingthein- tersection of the untoggled gates lists from all supported pro- 3.5 SupportingMultipleApplications grams. Whilebespokeprocessordesigninvolvestailoringageneralpurpose processor into an application-specific processor implementation, isperformedforeachapplication,andcuttingandstitchingisper- bespoke processors, which are descended from general purpose formedfortheintersectionofunusedgatesfortheapplications.The processors,stillretainsomeprogrammability.Inthissection,we intersectionofunusedgatesrepresentsallthegatesthatarenotused describeseveralapproachesforcreatingbespokeprocessorsthat byanyofthetargetapplications.Theresultingbespokeprocessor supportmultipleapplications. containsallthegatesnecessarytorunanyofthetargetapplications. Thefirstandmoststraightforwardcaseiswhenthemultipletarget Whiletheremaybesomeareaandpowercostcomparedtoabespoke applicationsareknownaprioriatdesigntime.Forexample,alicen- designforasingleapplicationduetohavingmoregates,Section5 sororlicensee(seeFigure1)thatwantstoamortizetheexpenseof showsthatabespokeprocessorsupportingmultipleapplicationsstill designingandmanufacturingachipmaychoosetotailorabespoke affordssignificantareaandpowerbenefits. processortosupportmultipleapplications.Inthiscase,thebespoke There may also be cases where it is desirable for a bespoke processor design methodology (Figure 5) is simply expanded to processortosupportanapplicationthatisnotknownatdesigntime. supportmultipletargetapplications,asshowninFigure8.Givena For example, one advantage of using a programmable processor knownsetofbinariesthatneedtobesupported,gateactivityanalysis is the ability to update the target application in the field to roll out a new software version or to fix a software bug. Tailoring a 3Industrialequivalencecheckingtools(e.g.,Formality[63])checkstaticequiva- bespokeprocessortoaspecificapplicationinvariablyreducesits lence(i.e.,theiranalysisisapplication-independent);however,thebespokeandoriginal abilitytosupportin-fieldupdates.However,eveninthecasewhen designsareonlyequivalentforthetargetapplication,notingeneral.Therefore,werely ongate-levelanalysisandinput-basedsimulationsforverification. anapplicationwasunknownatdesigntime,itmaybepossiblefor BespokeProcessorsforApplicationswithUltra-lowAreaandPowerConstraints ISCA’17,June24-28,2017,Toronto,ON,Canada Functionality Implementation Table1:Benchmarks MaxExecution PC Instruction Benchmark Description Length(cycles) 0 sub &a &b 1 XXXX XXXX XXXX XXXX a binSearch Binarysearch 2037 Mem[b] = Mem[b] - Mem[a]; 2 XXXX XXXX XXXX XXXX b div Unsignedintegerdivision 402 if (Mem[b] < 0) goto c 3 jnX XXXXX010 c ors inSort In-placeinsertionsort 4781 s 4 sub &a &b en intAVG Signedintegeraverage 14512 5 xxxxxxxxxxxxxxxx S d intFilt 4-tapsignedFIRfilter 113791 6 xxxxxxxxxxxxxxxx e 7 jnx xxxxx010 dd mult Unsignedmultiplication 210 e b rle Run-lengthencoder 8283 m E tHold Digitalthresholddetector 18511 tea8 TEAencryptionalgorithm 2228 Figure9:FunctionalityandMSP430implementationofTuring- FFT FastFouriertransform 406006 completesubnegpseudo-instruction. C B Viterbi Viterbidecoder 1167298 M convEn Convolutionalencoder 117789 E abespokeprocessortosupporttheapplication.Forinstance,itis E autocorr Autocorrelation 8092 alwayspossibletocheckwhetheranewsoftwareversioncanbe nit irq Interrupttest 210 supportedbyabespokeprocessorbycheckingwhetherthegates U dbg Debuginterface 14166 requiredbythenewsoftwareversionareasubsetofthegatesin addedunittestbenchmarks[27]correspondingtosomefunctionali- thebespokeprocessor.Itmaybepossibletoincreasecoveragefor tiesintheprocessorthatwerenotexercisedbytheotherbenchmarks. in-fieldupdatesbyanticipatingthemandexplicitlydesigningthe Inadditiontoevaluatingabare-metalenvironmentcommoninarea- processor to support them. As an example, this may be used to andpower-constrainedembeddedsystems[17,56,61,66],wealso supportcommonbugfixesbyautomaticallycreatingmutantsofthe evaluatebespokeprocessorsthatsupportourapplicationsrunning originalsoftwarebyinjectingcommonbugs[38],thencreatinga ontheprocessorwithanoperatingsystem(FreeRTOS[55]).Bench- bespokedesignthatsupportsallthemutantversionsoftheprogram. marksarechosentoberepresentativeofemergingultra-low-power Thisapproachmayincreasetheprobabilitythatadebuggedversion applicationdomainssuchaswearables,internetofthings,andsen- ofthesoftwareissupportedbythebespokedesign. sornetworks[73].Also,benchmarkswereselectedtorepresenta Sometimes an in-field update represents a significant change rangeofcomplexityintermsofcontrolflowandexecutionlength. that is beyond the scope of a simple code mutation. To support PoweranalysisisperformedusingSynopsysPrimetime[64].Exper- such updates, a bespoke processor may need to provide support imentswereperformedonaserverhousingtwoIntelXeonE-2640 forasmallnumberofISAfeaturesthatcanbeusedtoimplement processors(8-coreseach,2GHzoperatingfrequency,64GBRAM). arbitrarycomputations-e.g.,aTuring-completeinstruction(orsetof instructions),inadditiontothetargetapplication(s).Consideradding 4.2 Baselines supportforsubneg,anexampleTuring-completeinstruction[26]. For evaluations, we compare bespoke designs against two base- Figure 9 shows the functionality and code implementation of a line processors. The first baseline is the openMSP430 processor, subnegpseudo-instructioncreatedforMSP430.Sincethememory optimizedtominimizeareaandpowerforoperationat100MHz operandaddressesandthebranchtargetareassumedtobeunknown and1V.Forthesecondbaseline,wecreateaggressively-optimized values(Xs),abinarythatcharacterizesthebehaviorofsubnegcan application-specificversionsofopenMSP430foreachbenchmark beco-analyzedwiththetargetapplicationbinarytotailoraTuring- applicationbyrewritingtheRTLtoremovemodulesthatareunused complete bespoke processor that supports the target application bythebenchmark,beforeperformingsynthesis,placement,androut- nativelyandcanhandlearbitraryupdates,possiblywithsomearea, ing.SuchanapproachisrepresentativeofanXtensa-likeapproach power,andperformanceoverhead. [13],wheretheprocessorconfigurationiscustomizedforapartic- ularapplication.Note,however,thatourbaselineissignificantly 4 METHODOLOGY moreaggressivethananXtensa-likeapproach,sinceitrequiresour 4.1 SimulationInfrastructureandBenchmarks input-independentgateactivityanalysistechnique(Section3.1)to identifythemodulesthatcannotbeusedbyanapplication.Such We verify our technique on a silicon-proven processor – open- moduleidentificationmaynotbepossiblethroughstaticanalysisof MSP430[27],anopen-sourceversionofoneofthemostpopular applicationcodealone. ULPprocessors[6,70].Theprocessorissynthesized,placed,and routedinTSMC65GPtechnology(65nm)foranoperatingpointof 5 RESULTS 1Vand100MHzusingSynopsysDesignCompiler[62]andCadence EDISystem[11].Gate-levelsimulationsareperformedbyrunning Inthissection,weevaluatebespokeprocessors.Wefirstconsider fullbenchmarkapplicationsontheplacedandroutedprocessorus- areaandpowerbenefitsoftailoringaprocessortoanapplication, ingacustomgate-levelsimulatorthatefficientlytraversesthecontrol thenevaluatedesignapproachesthatcanbeusedtosupportbespoke flowgraphofanapplicationandcapturesinput-independentactivity processorsthroughouttheproductlife-cycle,includingprocedures profiles(Section3.1).Table1listsourbenchmarkapplications.We forverifyingbespokeprocessors,techniquestodesignbespokepro- showresultsforallbenchmarksfrom[73]andallEEMBCbench- cessorsthatsupportmultipleknownapplications,andstrategiesto marks[22]thatfitintheprogrammemoryoftheprocessor.Wealso allowin-fieldupdatesinbespokeprocessors. ISCA’17,June24-28,2017,Toronto,ON,Canada H.Cherupallietal. 1 alu s0.9 multiplier e0.8 ta watchdog G0.7 elb0.6 cmloecmk tbraecekbone asU0.5 sfr no0.4 register file itc0.3 frontend arF0.2 execution unit glue 0.1 dbg 0 clock module glue Figure10:Theheightofabarrepresentsthefractionofgatesthatcanbetoggledbyabenchmark.Componentswithineachbar representeachmodule’scontributiontothefractionofgatestoggledbythebenchmark. 100 Figure10showsthefractionofgatesintheoriginalprocessor 90 wdofibefagistsughiergailentinensestehhatdoachhtweasctbsiocgauearnaln.drceWbhbpeeremettsooooebggdnsggutellleereevadd’cesbbhtcyyhmoatenhotatederciuahbbclueebhnt’eiscbnohcecnomhnntmcoathrramiktrbhk.uae.Trt4ikthooTencthaafitelnorcgstttooahmbtgeeagspfrlroeaiinnncoetnittnohhltnyees Percentage Savings 23456780000000 10 arelativelysmallfractionofthegatesinthebaselinedesign.At 0 most,57%ofthegatesinthebaselinedesigncanbetoggled,and 11benchmarkstogglelessthanhalfthegates.Eventhoughalarge Gate Savings Area Savings Power Savings fractionofthegatesofthebaselineprocessorcannotbetoggledby Figure11:Reduction(%)ingatecount,area,andpowerfora eachbenchmark,eachbenchmarkcantoggleadifferentsetofgates. bespokedesign,comparedtothebaselineprocessor. Forexample,autocorr1,whichusesthelargestfractionofthegates inthebaselineprocessor,doesnotexercisetheclock_module,while 100 tHold,whichtogglesthesmallestfractionofthebaselinegates,does 90 emxaeSrrkocsimsaeengdmantoeodstuiolnethsteh,ressu.ccHlhoocawks_etmvheeor,dmmuuloeld.tiuplleieurs,aagreeduisfefderbsybysoamppelibceanticohn-. ntage Savings 4567800000 For example, intFilt can never toggle about half of the multi- Perce 2300 pliergatesduetoconstraintsthebinaryplacesonfiltercoefficients, 10 whereasmulttogglesalmostallthegatesinthemultiplier.Other 0 modules,suchasthefrontend,aretoggledbyallapplications,but each application can toggle a different subset of frontend gates. Gate Savings Area Savings Power Savings Whiletheseresultsshowthatabespokeprocessorcanhaveasig- Figure12:Reduction(%)ingatecount,area,andpowerforbe- nificantlylowergatecountthanthegeneralpurposeprocessoritis spokedesigns,comparedtoapplication-specificcoarse-grained derivedfrom,theyalsoconfirmthathardware-softwareco-analysis module-levelbespokedesign. is necessary to identify all the gates that can be eliminated in a bespokedesign.Eliminationofgatesbasedontechniquessuchas thebaselinedesign.Areasavingsareupto92%(dbg),whilepower profilingorstaticanalysiswilleitherfailtoguaranteecorrectness savingsareupto74%(dbg). orwillmissopportunitiestoeliminategatesthatanapplicationcan Figure10showsthatsomemodulescouldbewhollyremoved neveruse. for specific benchmarks (e.g., the multiplier can be removed for Bespokeprocessorshavefewergates,lowerarea,andlowerpower binSearch,sinceitcannotuseanygatesinthemultiplier).Forsuch thantheirgeneralpurposecounterparts.Figure11showsthereduc- modules,itispossibletouseanXtensa-likeapproach[13],enabled tioningates,area,andpoweraffordedbybespokeprocessorstailored by our input-independent gate activity analysis, where modules toeachbenchmark.FFT,whichhasthesmallestgatecountreduction (44%)5,stillreducesareaby47%andpowerby37%,relativeto inwhichnogatesareusablebyanapplicationareremovedfrom the design. Figure 12 shows the benefits of bespoke processors relativetocoarse-grainedbespokedesignsinwhichwholly-unusable 4UnlikeFigure2,whichpresentsresultsfromprofiling,Figure10showsresults moduleshavebeenremovedfromtheprocessor.Notethatcompared frominput-independentgateanalysis. toanXtensa-likeapproach,acoarse-grainedbespokedesigndoes 5NotethatgatecountreductionreportedinFigure11isdifferentthanfractionof notneedanyknowledgeofthemicroarchitecture,astheunusable toggledgatesinFigure10,sincebespokedesignalsoremovessometoggledgatesthat cannotpropagatetheirtogglestostateelementsoroutputports. gatesareidentifiedautomaticallybyhardware-softwareco-analysis. BespokeProcessorsforApplicationswithUltra-lowAreaandPowerConstraints ISCA’17,June24-28,2017,Toronto,ON,Canada Table2:Benefitsofexploitingtimingslackcreatedbycutting, Table3:Verificationruntimeandcoverage. stitching,andre-synthesis. Sim.Time(s) Input-BasedCoverage X- per Num Line Br. Br. Gate Total Benchmark Addl.Power Based Input Paths % % Dir.% % Benchmark Timing Vmin Savingsfrom Power binSearch 23 3 83 100 100 93 87 Slack(%) (V) Savings Slack(%) div 7 22 1 100 - - 93 (%) inSort 25 156 718 100 100 100 93 binSearch 24.30 0.81 36.86 72.1 intAVG 116 79 1 100 100 100 82 intFilt 1365 625 1 100 100 100 28 div 24.51 0.82 34.53 69.9 mult 20 20 1 100 - - 64 inSort 22.24 0.83 33.25 67.6 rle 20 32 1 74 100 75 92 intAVG 23.37 0.83 33.35 67.7 tHold 7 27 6239 100 100 100 88 intFilt 23.23 0.84 31.31 58.5 tea8 21 32 1 100 100 100 93 FFT 10498 5669 1 100 100 100 47 mult 20.45 0.91 20.33 59.3 Viterbi 433 3637 8 100 100 100 86 rle 22.10 0.83 33.31 66.8 convEn 1401 168 1 100 100 100 89 tHold 24.07 0.81 36.90 73.5 autocorr 90 13 1 38 14 14 71 tea8 23.55 0.83 33.23 65.7 1 FFT 21.74 0.90 20.13 50.0 Viterbi 23.48 0.83 33.18 64.9 es0.8 u convEn 23.96 0.83 33.03 63.9 al V0.6 autocorr1 18.48 0.91 18.31 50.3 d e irq 17.91 0.92 16.24 57.7 aliz0.4 dbg 45.70 0.60 67.73 91.5 m Nor0.2 PAorewaer Gate Count The results show that the fine-grained gate-level bespoke design 00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 canprovideupto75%powerreduction(22%minimum,35%on Number of Supported Benchmarks average)overcoarse-grainedmodule-levelbespokedesign. Figure13:Normalizedgatecount,area,andpowerrangesfor Additionalpowersavingsmaybepossiblewhencutting,stitch- allpossiblebespokeprocessorsupportingmultipleapplications. ing,andre-synthesisremovesgatesfromcriticalpaths,exposing additional timing slack that can be exploited for energy savings. Thesecondmethodweusedtoverifybespokeprocessorsinvolves Table2showstimingslackexposedduringbespokeprocessortai- performing input-based simulations on the original and bespoke loringforeachbenchmark.Exposedtimingslackcanbeusedto processorstoconfirmthattheyproducethesameoutputs.Outputs reducetheoperatingvoltageoftheprocessorwithoutreducingthe producedduringsimulationarerecorded,andtheoutputsanddata frequency.6Table2alsoshowstheminimumsafeoperatingvoltage memoryfromtheoriginalandbespokeprocessorsarecomparedfor foreachbespokedesign(assumingworst-casePVTvariations),the equivalence.Sinceitisinfeasibletosimulatetheapplicationwithall additionalpowersavingsaffordedbyexploitingtimingslackinbe- possibleinputs,weusedKLEELLVMExecutionEngine(KLEE)[9] spokedesigns,andthetotalpowersavingsachievedwithrespectto togenerateinputsthatexerciseasmanycontrolpathsthroughthe thebaselinedesignfromeliminatingunusablelogicandexploiting applicationaspossible.Table3liststhenumberofinputssimulated exposedtimingslackforvoltagereduction. foreachbenchmarkandthecorrespondingcoverageofthecode.For mostbenchmarks,alllinesandbranchdirectionsarecovered.Where 5.1 Verification coverageisnot100%,theportionofthecodethatwasnotcovered We followed a two-pronged approach to verify our bespoke pro- wasnotexecutable.Thetablealsoreportsthefractionofgatesin cessordesigns.First,weperformedinput-independentgateactivity the bespoke designs that were exercised during the input-based analysisonthebespokeprocessordesignandcomparedtheproces- simulations.Weseethatamajorityofthegates(78%,onaverage) sorstatebetweentheoriginalandbespokeprocessorsineachcycle. weretoggledduringthesimulations,indicatingthatthemajorityof Attheendofactivityanalysis,wecomparedthecontentsofthedata gatesinabespokedesignarenecessary.8 Table3alsoshowsthe memorywiththatoftheoriginaldesigntoensurethatbothdesigns aggregateruntimeoftheinput-basedsimulations,providingsome producedthesameoutputs.Therewerenodiscrepanciesforany quantificationofverificationeffort.9 ofthebespokeprocessors,indicatingthatthebespokeprocessors 5.2 SupportingMultiplePrograms traversethesamestatesastheoriginalprocessorandproducethe sameoutputsforeachbenchmark.Whilethisverificationefficiently7 Bespokeprocessorsareabletosupportmultipleprogramsbyin- checksforfunctionalcorrectnessconsideringallpossibleinputs,it cludingtheunionofgatesneededtosupportalloftheprograms. doesnotexplicitlyguaranteethatthedatamemorycontentsofa Figure13showsgatecount,power,andareaforbespokeprocessors bespokeprocessorarecorrectforanyspecificinputs.Foranexplicit proofofthecorrectnessofbespokeprocessors,seeSection3.4. 8Notethatgatecoverageisnotexpectedtobe100%,sinceKLEEaimstocover linesofcodeandexecutionpaths,notgates.Inparticular,benchmarksthatusethe 6Exposedtimingslackcouldalsobeusedtoincreaseoperatingfrequency(perfor- multiplier(intFilt,mult,FFT,andautocorr)seelowgatecoveragesincethemultiplieris mance)ofabespokedesign.Onaverage,frequencycanbeincreasedby13%inthe asignificantfractionofbespokedesignsforsuchbenchmarksandKLEEdoesnottryto bespokedesigns. forminputstothemultipliertoincreasedatapathcoverage. 7Table3showsthattheruntimeofX-basedsimulationsiswithinanorderof 9Notethatsimulationsformultipleinputscaneasilybeparallelizedtoreducethe magnitudeofasingleinput-basedsimulation. verificationtimesignificantly. ISCA’17,June24-28,2017,Toronto,ON,Canada H.Cherupallietal. Table4:Miluproducesthreetypesofmutants.TypeI:Logical Table 5: Percentage of mutants (in-field updates) of different conditionaloperatormutants.TypeII:Computationoperator typesthataresupportedbythebespokedesignforthebasesoft- mutants.TypeIII:loopconditionaloperatormutants. wareimplementation.“-”denotesthatagivenbenchmarkdid nothaveanymutantsofthattype. Benchmark TypeI TypeII TypeIII Total binSearch 0 0 15 15 TypeI TypeII TypeIII Total Benchmark inSort 8 0 15 23 % % % % rle 0 20 25 45 binSearch - - 73 73 tea8 48 24 10 82 inSort 25 - 27 26 Viterbi 24 24 35 83 rle - 100 84 91 autocorr 12 0 10 22 tea8 58 75 100 68 Viterbi 92 83 80 84 autocorr 50 - 40 45 tailoredtoN programs,normalizedtothebaselineprocessor.For eachvalueofN,thebarsshowtherangesofthesemetricsacross 1 bespokeprocessorstailoredtoallcombinationsofNprograms.For manycombinationsofprograms,evenforuptotenprograms,only es0.8 u 60%orlessofthegatesareneeded.Infact,despitesupportingten Val0.6 programs,theareaandpowerofabespokeprocessorcanbereduced ed byupto41%and20%,respectively.However,supportingmulti- aliz0.4 m pleprogramscanlimittheextentofgatecuttingandtheresulting or N0.2 areaandpowerbenefitswhentheapplicationsexercisesignificantly differentportionsoftheprocessor.Forexample,thetwo-program 0 bespokeprocessorwiththelargestgatecountisonetailoredtodbg binSearch inSort rle tea8 Viterbi autocorr1 andirq.Eachapplicationusescomponentsoftheprocessorthatare Gate Count Area Power Overhead notexercisedbytheotherprogram;dbgexercisesthedebugmodule, Figure14:Normalizedgatecount,area,andpowervsthebase- whileirqexercisestheinterrupthandlinglogic.Theresultinggate linedesignfordesignssupportingallmutants(in-fieldupdates). reductionof18%stillproducesareaandpowerbenefitsof26%and 19%,respectively.Althoughsupportingmultipleprogramsreduces bespokeprocessortailoredtotheoriginal“buggy”application.We gatecount,area,andpowerreductionbenefits,theareaandpower seethatbetween25%and100%ofvariousmutantsarecovered,and willneverincreasewithrespecttothebaselinedesign.Intheworst 70%ofallmutantsarecovered.Thisshowsthatabespokeprocessor case,thebaselineprocessorcanrunanycombinationofprogram willmaintainsomeoftheoriginalgeneralpurposeprocessor’sability binaries. tosupportin-fieldupdates.Ifahighercoverageofpossiblebugs isdesired,theautomatically-generatedmutantscanbeconsidered 5.3 SupportingIn-fieldUpdates asindependentprogramswhiletailoringthebespokeprocessorfor Weconsidertwoapproachestodesigningbespokeprocessorsthat theapplication(seeSection3.5).Figure14showstheincreasein canbeupdatedinthefield.First,weevaluateamethodthatallows gatecount,area,andpowerrequiredtotailorabespokeprocessorto abespokeprocessortohandlecommon,minorprogrammingbugs. thesixbenchmarkswiththemostmutantsbyincludingallpossible Second,weevaluateamethodthatallowsabespokeprocessorto mutants(i.e.,bugfixes)generatedbyMiluduringbespokedesign. handleinfrequent,arbitrarysoftwareupdates. Providingsupportforsimplein-fieldupdatesincursagatecount In-fieldupdatesmayoftenbedeployedtofixminorcorrectness overhead of between 1% and 40%. Despite this increase in gate bugs(e.g.,off-by-oneerrors,etc.)[33].Toemulatein-fieldupdates count,totalareabenefitsforthebespokeprocessorsarebetween tofixbugs,weusetheMilumutationtestingtool[38]togenerate 23% and 66%, while total power benefits are between 13% and “updates”correspondingtobugfixes.Table4liststhebreakdown 53%.Therefore,simplein-fieldupdatescanbesupportedwhilestill ofmutantsbytypegeneratedbyMiluforthesixbenchmarkswith achievingsubstantialareaandpowerbenefits. themostmutants.Ifabenchmarkhaszeromutantsforaparticular A bespoke processor tailored to a specific application can be type,nomutationsitesofthattypewerefoundinthatbenchmark designedtosupportarbitrarysoftwareupdatesbydesigningitto by Milu. Type I mutants are conditional operator mutants (e.g., supportaTuring-completeinstruction(e.g.,subneg)orsetofinstruc- A||B→A&&B).TypeIImutantsarecomputationoperatormutants tions,inadditiontootherprogramsitsupports(Section3.5).Forour (e.g.,A+B→A×B).TypeIIImutantsareloopconditionaloperator single-applicationbespokeprocessors,theaverageareaandpower mutants(e.g.,i<32→i(cid:44)32). overheadstosupportsubnegare8%and10%,respectively.Average Table5liststhepercentageofmutants(i.e.,in-fieldupdatestofix areaandpowerbenefitsforsubneg-enhancedbespokeprocessors bugs)thataresupportedbytheoriginalbespokedesign(generated are56%and43%,respectively. fora“buggy”application).Manyminorbugfixescanbecoveredby Notethataninstructioninabespokeprocessor’stargetapplica- abespokeprocessordesignedfortheoriginalapplicationwithoutany tionisnotguaranteedtobesupportedinadifferentapplication(e.g., modification.I.e.,themutantsrepresentingmanyin-fieldupdates anupdate),sincetheprocessoreliminatesgatesthatarenotneeded only use a subset of the gates in the original bespoke processor. tosupportthepossibleinstructionsequencesinthetargetapplica- Thismeansthatthesemutantswillexecutecorrectlyontheoriginal tion’sexecutiontree;adifferentsequenceofthesameinstructions
Description: