Scheduling and Optimal Register Placement for Synchronous Circuits Derived Using Software Pipelining Techniques NOUREDDINECHABINI RoyalMilitaryCollegeofCanada ELMOSTAPHAABOULHAMID Universite´ deMontre´al ISMA¨ILCHABINI MassachusettsInstituteofTechnology and YVONSAVARIA E´colePolytechniquedeMontre´al DatadependencyconstraintsconstitutealowerboundPontheminimalclockperiodofsingle-phase clocked sequential circuits. In contrast to methods based on basic retiming, clocked sequential circuits with clock period P can always be obtained using software pipelining techniques. Such circuitscanbederivedbyanymethodthatcanbeframedinthefollowingfour-stepprocess:Step1, determineP;Step2,computeavalidperiodicscheduleofthecomputationalelements;Step3,place registersbacktothecircuit;Step4,assigntheclocksignalstocontrolregisters. Methodswithpolynomialrun-timetoimplementthisprocessareproposedintheliterature. Theyimplementthesestepssequentially,startingwithStep1.Thesemethodsdonotknowhow to optimally place registers which leads to an unnecessary number of registers. In this article, weaddresstheproblemofhowtosimultaneouslyimplementSteps2and3inordertominimize the total number of registers. We conjecture that the problem is NP-hard in its general form. Weformulatetheproblemforthefirsttimeintheliterature,anddeviseaMixedIntegerLinear Program(MILP)tosolveit.FromthisMILP,wederivealinearprogramtodetermineapproximate This research benefited from financial support from Le Fonds Nature et Technologies (Quebec, Canada),NSF(USA),NSERC(Canada). Authors’ addresses: N. Chabini, Department of Electrical and Computer Engineering, Royal Military College of Canada, PO Box 17000, Station Forces, Kingston, On, Canada, K7K 7B4; email:[email protected];E.M.Aboulhamid,DIRO,Universite´deMontre´al,C.P.6128,Suc.Centre- ville,Montre´al,Qc,Canada,H3C3J7;email:[email protected];I.Chabini,Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Room 1-263, Cambridge,MA,USA,02139;email:[email protected];Y.Savaria,DepartmentofElectricalEn- gineering,E´colePolytechniquedeMontre´al,C.P.6079,Suc.Centre-ville,Montre´al,Qc,Canada, H3C3A7;email:[email protected]. Permissiontomakedigitalorhardcopiesofpartorallofthisworkforpersonalorclassroomuseis grantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitordirectcommercial advantageandthatcopiesshowthisnoticeonthefirstpageorinitialscreenofadisplayalong withthefullcitation.CopyrightsforcomponentsofthisworkownedbyothersthanACMmustbe honored.Abstractingwithcreditispermitted.Tocopyotherwise,torepublish,topostonservers, toredistributetolists,ortouseanycomponentofthisworkinotherworksrequirespriorspecific permissionand/orafee.PermissionsmayberequestedfromPublicationsDept.,ACM,Inc.,1515 Broadway,NewYork,NY10036USA,fax:+1(212)869-0481,[email protected]. (cid:1)C 2005ACM1084-4309/05/0400-0187$5.00 ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.2,April2005,Pages187–204. 188 • N.Chabinietal. solutionstotheproblemforlargegeneralcircuits.Weshowthattheproposedapproachcanhandle nonzeroclockskew.Experimentalresultsconfirmtheeffectivenessoftheapproachandshowthat significantreductionsofthenumberofregisterscanbeobtainedalthoughregistersharingisnot used.Whenthescheduleisgiven,theproposedapproachprovidessolutionstotheproblemofhow toplacetheminimalnumberofregistersinStep3. CategoriesandSubjectDescriptors:B.[Hardware] GeneralTerms:Algorithms,Performance AdditionalKeyWordsandPhrases:Retiming,softwarepipelining,multiphase,clock,sequential circuit 1. INTRODUCTION Data dependency constraints constitute a lower bound on the clock period of synchronous sequential circuits. This lower bound, denoted here P, can be de- termined by solving an instance of the well known Cost-to-Time Ratio Cycle Problem[DasdanandGupta1998;Gerezetal.1992;Lawler1976]onthegraph modelingthecircuit. Basic retiming has been proposed as an optimization technique for syn- chronous circuits [Leiserson and Saxe 1991]. This technique changes the lo- cation of registers in the circuit in order to achieve one of the following goals: i) minimizing the clock period, ii) minimizing the number of registers, or iii) minimizingthenumberofregistersforatargetclockperiod. Basic retiming [Leiserson and Saxe 1991] may fail in transforming a given synchronoussingle-phasesequentialcircuittoanotherfunctionallyequivalent clocked circuit with a clock period of value P. Indeed, as presented in Boyer et al. [2001a] and Lockyear and Ebeling [1994], basic retiming can transform the correlator circuit to another one with a minimal clock period of value 13, but a functionally equivalent circuit of a clock period of value P = 10 can be obtained as provided in those papers. Figure 1 presents two other circuits to show that 1) basic retiming can fail in producing circuits with clock period of valuePthatisduetodatadependencyconstraintsonly,and2)topresenthow muchreductionsoftheclockperiodonecanobtainbyusingmethodsbasedon softwarepipeliningtechniqueslikethemethodinBoyeretal.[2001a],insteadof usingmethodsbasedonbasicretiming.ForcircuitsinFigure1,basicretiming gives a minimal clock period of value 60 for circuit #1, and 45x for circuit #2. FunctionallyequivalentcircuitswithclockperiodP=45canbeobtainedusing, for instance, the method in Boyer et al. [2001a]. This reduces the clock period by25%forcircuit#1andby((x−1)/x)forcircuit#2(forinstance,whenx=2, thisreductionis50%). WhenbasicretimingfailstoobtainacircuitwithclockperiodofvalueP,then a functionally equivalent circuit with clock period of value P can be obtained withthepenaltyofincreasingthenumberofclocks(phases),andsuchacircuitis thencalledamultiphaseclockedsequentialcircuit.Forinstance,thecorrelator produced in Boyer et al. [2001a] and Lockyear and Ebeling [1994] is a two- phase circuit. Details on multiphase clocked sequential circuits can be found, forinstance,inIshiietal.[1997]andLockyearandEbeling[1994]. ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.2,April2005. SchedulingandOptimalRegisterPlacementforSynchronousCircuits • 189 Fig.1. Examplestoshowthatbasicretimingcanfailinminimizingtheclockperiod. Methods to transform single-phase clocked sequential circuits to function- ally equivalent ones with the clock period as close as possible to P are pro- posedinLegletal.[1997],Ishiietal.[1997],LockyearandEbeling[1994],and Maheshwari and Sapatnekar [1999]. In Legl et al. [1997], basic retiming has beenextendedtodealwithcircuitswhoseregistersarenotenabledatthesame time. The idea is that registers controlled by the same phase can be moved acrosscomputationalelements. Itisknownthatwithlevel-sensitivestorageelements(latches),clockedcir- cuits can be made faster and smaller [Ishii et al. 1997; Lockyear and Ebeling 1994]thanwithedgetriggeredflip-flops.InIshiietal.[1997],methodstomini- mizetheclockperiodofmultiphaselevel-sensitiveclockedcircuitsareprovided. Also,procedurestoderivethesekindsofcircuitsfromedge-triggeredonesare presented. In Lockyear and Ebeling [1994] and Maheshwari and Sapatnekar [1999],retimingwithmultiphaseclocksisproposed.Formethodsin[Lockyear etal.1994;Maheshwarietal.1999],thephasesarefixedbeforeretimingwhich cangiveaclockperiodofvaluePonlyifgoodphasesarechosen. Clock skew is defined as the maximum difference of the delays from the clocksourcetotheclock-pinsonstorageelements[Tsay1993].Clockskewcan cause malfunction of clocked circuits. Methods to ensure zero-skew in the de- sign are reported in Li and Jabori [1992] and Tsay [1993]. However, skews are sometimes used as a tool to improve the performance of clocked circuits [Fishburn1990;DeokarandSapatnekar1995;SapatnekarandDeokar1996]. In Fishburn [1990], two linear programs are presented to solve the problem of finding skews to minimize the clock period and the problem of maximizing skews for a target clock period. The equivalence between clock skew and re- timing was first reported in Fishburn [1990], and a formal proof is provided in Deokar and Sapatnekar [1995]. For the work in Deokar and Sapatnekar [1995], a clock skew optimization problem is first solved with the objective of minimizing the clock period. Then, the obtained skews are transformed to re- timingbymovingsomeflip-flopsacrosscombinationalblocks.Forsingle-phase clockedcircuits,amixedintegerlinearprogramtocombineretimingandclock skewisdevisedinFriedmanetal.[1999]andLiuetal.[2002].Aspresentedin ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.2,April2005. 190 • N.Chabinietal. Boyeretal.[2001b],thetolerancetotheclockskewforclockedcircuitscanbe improvedbyusinglatchesinsteadofflip-flops.Thepapershowsthat,formulti- phase clocked circuits operating at the minimal clock period P, the maximum tolerance to clock skew is (P−D )/4, where D is the propagation delay of max max theslowestcomputationalelementinthecircuit. Software pipelining is a powerful technique for increasing the instruction- levelparallelismforparallelprocessors.Thismethodoverlapstheexecutionof successive iterations in order to reduce the difference of their start execution times.Foranintroductiontosoftwarepipeliningandtoitsrelatedtechniques, thereaderisreferredtoAllanetal.[1995]. Tothebestofourknowledge,nomethodbasedonbasicretimingcanalways transform single-phase clocked sequential circuits to functionally equivalent clocked sequential circuits with a minimal clock period P that is due to data dependencyconstraintsonly.Methodsbasedonsoftwarepipeliningtechniques to obtain the latter circuits have been recently proposed [Boyer et al. 2001a, b;Chabinietal.2001].Neitherthenumberofphasesnorthekindofmemory elements to be used are fixed in advance in Boyer et al. [2001a] and Chabini etal.[2001],comparedtosomepublishedapproacheslikeLockyearandEbeling [1994]andMaheshwariandSapatnekar[1999]thatwereviewedpreviously.As mentioned,themethodsinLockyearandEbeling[1994]andMaheshwariand Sapatnekar[1999]canproducecircuitswithclockperiodequaltoPonlyifgood phasesarechosen,whilethemethodsinBoyeretal.[2001a]andChabinietal. [2001]arealwaysabletoobtaincircuitsthatoperateatP.Thelattermethods canbeframedinthefollowingprocess. Step1: Determine the minimal value P of the clock period due to data depen- dencyconstraintsonly. Step2: Computeavalidperiodicscheduleofthecomputationalelementswith periodP. Step3: Placeregistersinthecircuitaccordingtothecomputedschedule. Step4: Determinethephasestocontrolregisters. The method in Boyer et al. [2001a] implements this process sequentially, startingfromStep1.ForStep2,onlyAsSoonAsPossible(ASAP)orAsLateAs Possible(ALAP)schedulesarecomputed.AspresentedinChabinietal.[2001], usingASAPorALAPschedulescanleadtocircuitswithanunnecessarynumber ofregistersorphases.ForStep3,thismethodusesaheuristic,whichagaincan leadtoanunnecessarynumberofregistersorphases. The paper Chabini et al. [2001] has provided two methods with polynomial run-time to determine schedules for reducing register requirements and the number of required phases. Compared to Boyer et al. [2001a], these methods proved very efficient in reducing the number of registers and the number of requiredphases.Nevertheless,theproblemofhowtoefficientlyplaceregisters inthecircuitisnotaddressedinChabinietal.[2001]. Forsoftwarepipelininginthecontextofloops,methodsforschedulingunder register constraints to generate the code for parallel processors has been ex- aminedintheliterature.But,itwasassumedthatprocessorsaresingle-phase ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.2,April2005. SchedulingandOptimalRegisterPlacementforSynchronousCircuits • 191 clocked. Circuits derived from the previously described process can be multi- phase.Hence,thesemethodscannotbeusedtosimultaneouslyimplementSteps 2and3oftheprocess. Inthisarticle,weaddresstheproblemofhowtosimultaneouslyimplement Steps 2 and 3 of the process in order to minimize the number of registers. We proposethefirstformulationintheliteratureforthisproblem,fromwhichwe deriveamixedintegerlinearprogram(MILP).Weconjecturethattheproblem is NP-hard in its general form. Linear Programs (LPs) are solvable in poly- nomialrun-time.FromthisMILP,wederiveanLPtodetermineapproximate solutions to the problem for large general circuits. Furthermore, we present how the proposed approach can handle nonzero clock skew. To test the effec- tiveness of the approach in minimizing the number of registers, we apply the MILPandtheLPonwell-knownbenchmarksandshowthesuperiorityofthat approach over the method in Boyer et al. [2001a]. The assessment of the ap- proachisalsodoneinthecaseofnonzeroclockskew,andtheobtainedresults show the superiority of the approach over the method in Boyer et al. [2001b]. WecompareourexperimentalresultstoBoyeretal.[2001a,b]sincetothebest of our knowledge, there are no other papers at this moment that are close to theissueweaddresshere. The next section gives some notations and definitions used in this article. Section 3 introduces the mean of register placement, briefly reviews the reg- isters placement step in the method of Boyer et al. [2001a], presents how the phases to control registers are computed, and shows that the algorithm pro- posed in Boyer et al. [2001a] to place registers is not exact. The problem we addressanditsformulationarepresentedinSection4.Alinearprogramtode- termineapproximatesolutionsforthisproblemisgiveninSection5.Section6 presentshowtheproposedapproachcanhandlenonzeroclockskewandgivesa theoreticalresult.ExperimentalresultsareprovidedinSection7,andSection8 concludesthearticle. 2. PRELIMINARIES 2.1 DesignRepresentation Theinputtoourapproachinthisarticleisasingle-phasesynchronoussequen- tialcircuitastheoneinFigure2(a).AsinBoyeretal.[2001a],Maheshwariand Sapatnekar[1997],ShenoyandRudell[1994],andLeisersonandSaxe[1991], wemodeltheinputcircuitbyadirectedcyclicgraphG=(V,E,d,w),whereVis thesetofcomputationalelementsinthecircuit,andEisthesetofedges,which represent interconnections between vertices. Let N be the set of nonnegative integers.Eachvertexv∈Vhasapropagationdelayd(v)∈Nwhichisassumed to be fixed in this article. Each edge e , from u to v, in E is weighted with u,v a register count w(e ) ∈ N, representing the number of registers on the wire u,v betweenuandv. AsinBoyeretal.[2001a],MaheshwariandSapatnekar[1997],Shenoyand Rudell [1994], Leiserson and Saxe [1991], propagation delays of registers and wires are assumed to be equal to zero. We believe that this delay model is ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.2,April2005. 192 • N.Chabinietal. Fig.2. Samplecircuitanditsdirectedcyclicgraphmodel. acceptable at the high-level abstraction of the design, but not when compu- tational elements are, for instance, transistors. Even though we assume this delaymodel,theproblemweaddressinthearticleisstillcomplex. Figures2(a)and2(b)presentanexampleofasingle-phasesynchronousse- quentialcircuitanditsdirectedcyclicgraphmodel,respectively.InFigure2(a), large rectangles represent computational elements and small rectangles rep- resent registers. Wires are oriented to show the propagation direction of the signals.Thepropagationdelayofeachcomputationalelementofthiscircuitis specifiedasalabelontheleftofeachlargerectangle.Thisexamplewillbeused through this article, and will serve to illustrate the initial specification of the problem to be solved. Without any optimization, the minimum clock period of thecircuitinFigure2is80whichisequaltod(v )+d(v )+d(v ). 5 1 3 2.2 PeriodicSchedules Wedefineaschedules[Bennour1996;Boyeretal.2001a]asafunctions:N× V → Q, where s (v) ≡ s(n, v) denotes the schedule time of the nth iteration of n operationv.Inmultiphaseflip-flop-basedcircuits,thescheduletimeofoperation visthestartexecutiontimeofv.AschedulesiscalledperiodicwithperiodP, if: ∀n∈N,∀v∈V:sn+1(v)=sn(v)+P. (1) Whenthereisnoresourceconstraint,aschedulesissaidtobevalidifand only if the operations terminate before their results are needed. In this case, wesaythatdatadependenciesaresatisfiedwhichisequivalenttothefollowing mathematicalinequality: ∀n∈N,∀eu,v ∈E:sn+w(eu,v)(v)≥sn(u)+d(u). (2) ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.2,April2005. SchedulingandOptimalRegisterPlacementforSynchronousCircuits • 193 2.3 MaximumThroughputofSynchronousSequentialCircuits Let C be the set of directed cycles in the directed cyclic graph modeling the circuit.Basedondatadependencyconstraintsonly,themaximumthroughput, denoted T, is given by the following expression [Bennour 1996; Bennour and Aboulhamid1995]: (cid:1)(cid:1) (cid:3)(cid:4)(cid:1) (cid:3)(cid:3) (cid:2) (cid:2) T=Minc∈C w(eu,v) d(u) (3) eu,v∈c ∀v∈Vandeu,v∈c Determining the maximum throughput is a Minimal Cost-to-Time Ratio Cycle Problem [Gerez et al. 1992; Lawler 1976]. This problem can be solved inthegeneralcasewitharun-timeinO(|V(cid:7)E|log(|V|d ))[DasdanandGupta max 1998;Lawler1976],wheredmax=Maxv∈V(d(v)).Apossiblemethodtosolvethis problem is to iteratively apply Bellman-Ford’s algorithm [Cormen et al. 1990] forlongestpathsonthegraphG =(V,E,d,w )derivedfromGbyletting: p p w (e )=d(u)−P·w(e ), (4) P u,v u,v where e ∈ E and P = 1/T. A binary search may be used to find the minimal u,v valueofPforwhichthereisnopositivecycleinG [Bennour1996;Bennourand P Aboulhamid 1995]. Without loss of generality, for circuits that do not attempt to perform wave pipelining, we assume that P is greater than or equal to the propagationdelayofeachcomputationalelementinthecircuit. By applying expression (3) on the example circuit in Figure 2, the value of P is 60. This value corresponds to the cycle defined by vertices v , v , v , and 1 2 4 v .Noticethatapplyingbasicretimingforminimalclockperiodonthatcircuit 5 leadstoalargervalueofP.Indeed,itleadstoP=70. 2.4 PeriodicScheduleforaGivenPeriod Fromequation(1)andinequality(2),wehavethat: ∀e ∈E,s (v)−s (u)≥d(u)−P·w(e ). (5) u,v 0 0 u,v In the case of periodic schedules, determining a valid schedule of all the in- stancesofeachvertexvinVisequivalenttodeterminings (v)foreachvinV, 0 which consists of finding a solution to the system of inequalities described by (5).Tosolvethissystem,thegraphG describedintheprevioussectionmaybe P used. Note that ASAP and ALAP schedules are possible solutions to this sys- tem.TofindanASAPschedule,Bellman-Ford’salgorithm[Cormenetal.1990] forlongestpaths,fromachosenvertexv totheothervertices,maybeapplied x onthegraphG .FindinganALAPschedulemaybedoneasfollows.InStep1, P agraphG(cid:8) hastobederivedfromG byinvertingthedirectionofeachedgein P G . In Step 2, Bellman-Ford’s algorithm for longest paths, from the vertex v P x to the other vertices, has to be applied on the graph G(cid:8), where the weights of itsedgesaredefinedbyEquation(4).Finally,inStep3,theALAPscheduleis obtained by multiplying each result in Step 2 by −1. Relative to v = v , the x 1 ASAPschedulesofverticesv ,v ,v ,v ,v ,andv ofthecircuitinFigure2are 1 2 3 4 5 6 0,−30,30,−10,−40,and−30,respectively.TheirALAPschedulesare0,−30, 40,−10,−40,and10,respectively. ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.2,April2005. 194 • N.Chabinietal. Fig.3. Schedulegraph. 2.5 ScheduleGraph Aperiodicschedule,withperiodP,isexpressedbyaschedulegraphG =(V,E, s d, T , P) [Boyer et al. 2001a]. Here V, E, and d have the same definition given s forthecaseofthegraphGpreviouslydefined.T :E→Qisaweightfunction s whichassociatestoeachedgee inEthetimedistancebetweentheschedule u,v timesofuandv.Mathematically,T (e )isdefinedasfollows: s u,v ∀e ∈E,T (e )=s (v)−s (u). (6) u,v s u,v w(eu,v) 0 BecausesisperiodicwithperiodP,Equation(6)mayberewrittenasfollows: ∀e ∈E,T (e )=s (v)−s (u)+P·w(e ). (7) u,v s u,v 0 0 u,v The graph G is consistent if and only if for each edge e in E, T (e ) ≥ s u,v s u,v d(u). This is derived from Equation (2). Figure 3 shows a consistent schedule graph,whereedgesarelabeledwithT valuesforthecircuitinFigure2,using s the ASAP schedule determined in Section 2.4. The weight of each arc in the schedulegraphisintermofnumberofunitsoftime. 3. REGISTERPLACEMENTANDASSIGNMENTOFPHASES Forcircuitsoptimizedusingbasicretiming[LeisersonandSaxe1991],registers areplacedintheoptimizedcircuitusingthefollowingformula: ∀e ∈E,w (e )=r(v)−r(u)+w(e ), u,v r u,v u,v wherew (e )andw(e )are,respectively,thenumberofregistersonthearc r u,v u,v e , after and before retiming. r(u) is the value assigned by basic retiming to u,v eachcomputationalelementuinthecircuit. Intherestofthissection,weshowhowregisterscanbeplacedandcontrolled in circuits derived by the process we presented in Section 1. To this end, we review the method in Boyer et al. [2001a] which is a possible implementation of that process. The approach we are proposing in this article leads to better implementationsoftheprocess. For the method proposed in Boyer et al. [2001a], registers are placed back tothecircuitbypipeliningtheschedulegraphG definedinSection2.5.Every s pathinG thatislongerthantheminimalclockperiodPisbrokenbyinserting s ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.2,April2005. SchedulingandOptimalRegisterPlacementforSynchronousCircuits • 195 Fig.4. PlacementandphasesofregistersusingalgorithminBoyeretal.[2001a]. registersonit.Forpathshavingalength(intermofnumberofunitsoftimes) lessthanP,noregisterisrequiredifoperationschainingisassumed. Forsynchronoussingle-phasesequentialcircuits,registersarecontrolledby the same signal, called the clock. When clock skew is not supported, registers inthatcasemustreceivetheclockatthesamemoment.Insynchronousmulti- phasesequentialcircuits,registersarenotnecessarilycontrolledbythesame clock.Inthiscase,theclockscanhavethesameperiodandbedefinedrelatively toaglobalclockthatcanbeoneofthoseclocks.Eachclockisthenanoffsetof theglobalclock.Thatoffsetiscalledthephaseintheliterature. CircuitsderivedbytheprocesswepresentedinSection1canbemultiphase, and all the clocks have the same period. In the case of the method in Boyer etal.[2001a],whichisapossibleimplementationoftheprocess,onceregisters areplaced,thephasestocontrolthemarethencomputedasfollows.Thephase of a register on the input of a computational element v is (s (v) modulo P), 0 where s (v) is the schedule of v, and P is the minimal clock period due to data 0 dependencyconstraintsonly. Figures 4(a) and 4(b) present the placement of registers and their phases obtainedusingthealgorithmprovidedinBoyeretal.[2001a]toplaceregisters using the schedule graph depicted in Figure 3. The latter graph corresponds to the circuit in Figure 2 and is obtained as explained in Section 2.5. Data in Figure4(c)isprovidedtoassistthereaderinterestedincomputingthephases giveninFigure4(b).Thenumberofregistersthatareplacedinthecircuitis6, andthenumberofphasestocontrolthemis4. The algorithm for register placement in Boyer et al. [2001a] is not optimal in the sense that it does not use a minimum number of registers. Indeed, for ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.2,April2005. 196 • N.Chabinietal. Figure 4(a), register R can be omitted since there is no combinational path 1 longerthanPbetweenR andR . 4 5 4. PROBLEMFORMULATIONANDAPPROACHESFORITSRESOLUTION OurfocusistosimultaneouslyrealizeSteps2and3intheprocesspresentedin Section 1 in order to minimize the number of registers. The problem, denoted (cid:1), we address in this article is then to determine a schedule with the mini- mum register requirements, where the register placement is done during the scheduledetermination.Wedonotsupportregistersharingasinthecasewhen basicretimingisused,since,inourcase,theobtainedcircuitscanbemultiphase clockedsequentialcircuits,and,inthiscase,registersontheoutputofacom- putationalelementcanbesharedonlyiftheyarecontrolledbythesamephase. However,oncetheregistersareplaced,onecanexaminethephasesofregisters ontheoutputofeachcomputationalelementtodecidewhethertosharethem. Letuspresenttheproblem(cid:1)inawaythatmakesiteasiertounderstandour approachinsolvingit.AsexplainedinSection3,theplacementofregisterscon- sistsinpipeliningtheschedulegraphtoobtainacircuitthatcanoperatewith the minimal clock period P. Recall that in Boyer et al. [2001a] the placement ofregistersisdoneoncethescheduleiscomputed.Ifthescheduleisgiven,the problem(cid:1)transformsintoaproblemofpipeliningtheschedulegraph,whileus- ingaminimalnumberofregisters.Theweightofeacharcintheschedulegraph is given by Equation (7) (i.e., ∀e ∈E, T (e ) = s (v) − s (u) + P· w(e )). u,v s u,v 0 0 u,v Instead of fixing the schedule first, before pipelining the schedule graph, we want to make the schedule a variable in the problem and then to pipeline the resultingschedulegraph. Weconjecturethattheproblem(cid:1)isNP-hardinitsgeneralform.Weprovide inthissectionamathematicalformulation(MF)totheproblem.FromthisMF, wederiveamixedintegerlinearprogram(MILP)thatcanbeusedforsolving theproblemforspecialorsmall-sizecircuits.InSection5,wederivefromthis MILPalinearprogramtodetermineapproximatesolutionstotheproblemfor generallargecircuits. BeforepresentingthedetailsrelatedtoMFandMILP,letusfirstgivesome definitionsandnotationswhileintroducinganinformalformulationoftheprob- lem. Figure 5 gives a portion of the schedule graph to pipeline, where i and j aretwocomputationalelements.Unknownvariablesx denotethenumberof i,j registers that must be placed on the arc, e , to guarantee that the length, i,j l , of every path that goes to j via i is less than or equal to the minimal clock i,j period P. Variable l will be defined in the following. Note that as in Boyer i,j etal.[2001a],operationchainingisassumed,andhencenoregisterisrequired if l ≤ P. Suppose that paths that go to j via i are already examined in order i,j todetermineifsomeregistersmustbeplacedonthemornot.Letm beanon- i negativerealnumbergreaterthanorequaltoeachremainderthatisobtained by dividing the length of each one of those paths by P. The length l of every i,j path that goes to j via i is the sum of m and T (e ), where T (e ) is defined i s i,j s i,j byEquation(7).Variabley istheremainderofthedivisionofl byP.Were- i,j i,j quirethatm ≤(P−d(i))whichguaranteesthat,ifaregisterRisplacedonthe i ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.2,April2005.