Technology Mapping and Architecture Evalution for k/m-Macrocell-Based FPGAs JASONCONGandHUIHUANG UniversityofCalifornia,LosAngeles and XINYUAN IBMCorporation Inthisarticle,westudythetechnologymappingproblemforanovelfield-programmablegatearray (FPGA)architecturethatisbasedonk-inputsingle-outputprogrammablelogicarray-(PLA-)like cells,or,k/m-macrocells.Eachcellinthisarchitecturecanimplementasingleoutputfunctionofup tokinputsanduptomproductterms.Wedevelopaveryefficienttechnologymappingalgorithm, kmflow,forthisnewtypeofarchitecture.Theexperimentalresultsshowthatouralgorithmcan achievedepth-optimalityonalmostallthetestcasesinasetof16MicroelectronicsCenterofNorth Carolina(MCNC)benchmarks.Furthermoreitisshownthatonthissetofbenchmarks,withonlya relativelysmallnumberofproductterms(m≤k+3),thek/m-macrocell-basedFPGAscanachieve thesameorsimilarmappingdepthcomparedwiththetraditionalk-inputsingle-outputlookup table-(k-LUT-)basedFPGAs.Wealsoinvestigatethetotalareaanddelayofk/m-macrocell-based FPGAsandcomparethemwiththoseofthecommonlyused4-LUT-basedFPGAs.Theexperimental resultsshowthatk/m-macrocell-basedFPGAscanoutperform4-LUT-basedFPGAsintermsof bothdelayandareaafterplacementandroutingbyVPRonthissetofbenchmarks. CategoriesandSubjectDescriptors:B.6.3[LogicDesign]:DesignAids;B.7.1[IntegratedCir- cuits]:TypesandDesignStyles—Gatearrays GeneralTerms:Algorithms,Design AdditionalKeyWordsandPhrases:PLD,FPGA,CPLD,technologymapping 1. INTRODUCTION The programmable logic devices (PLDs) have been widely used to imple- ment small to medium sized digital circuits. There are two major types of ThisworkwaspartiallysupportedbyAlteraCorp.andVantisCorp.undertheCaliforniaMICRO program. X.YuanwasaffiliatedwiththeUniversityofCalifornia,LosAngeles,LosAngeles,CA,atthetime oftheresearchforthisarticle. Authors’ addresses: J. Cong, Department of Computer Science, University of California, Los Angeles,LosAngeles,CA90095;email:[email protected];X.Yuan,SystemandTechnologyGroup, IBMCorporation,1000RiverSt.,Maildrop862F,EssexJunction,VT05452;email:xinyuan@us. ibm.com. Permissiontomakedigitalorhardcopiesofpartorallofthisworkforpersonalorclassroomuseis grantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitordirectcommercial advantageandthatcopiesshowthisnoticeonthefirstpageorinitialscreenofadisplayalong withthefullcitation.CopyrightsforcomponentsofthisworkownedbyothersthanACMmustbe honored.Abstractingwithcreditispermitted.Tocopyotherwise,torepublish,topostonservers, toredistributetolists,ortouseanycomponentofthisworkinotherworksrequirespriorspecific permissionand/orafee.PermissionsmayberequestedfromPublicationsDept.,ACM,Inc.,1515 Broadway,NewYork,NY10036USA,fax:+1(212)869-0481,[email protected]. (cid:2)C 2005ACM1084-4309/05/0100-0003$5.00 ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.1,January2005,Pages3–23. 4 • X.Yuanetal. PLDs—fieldprogrammablegatearrays(FPGAs),whichusuallyconsistofsmall programmablelogiccells,suchask-inputsingle-outputlookuptables(k-LUTs), andcomplexprogrammablelogicdevices(CPLDs),whicharebasedonmultiple- input and multiple-output programmable logic array- (PLA-) like logic cells. BothFPGAsandCPLDshavebeenwidelyused. Most commonly used FPGAs are based on k-LUTs. Every k-LUT can im- plement any function with no more than k inputs. In practice, k is usually small, as the area of a k-LUT grows exponentially with large k. For example, LUTsoffourtosixinputsarewidelyusedincommercialFPGAs[AlteraCorp. 2001; Xilinx Inc. 2001]. On the other hand, CPLDs usually have large basic cells. Each PLA cell can have a large number of inputs (typically between 30 and40),productterms(typicallybetween50to100),andmultipleoutputs(16, for example) [Altera Corp. 2000; Cypress Semiconductor Corp. 2000; Lattice SemiconductorCorp.2000].Asaresult,asinglePLAcellisabletoimplement multiplefunctionswithwideinputs. Rose et al. [1990] showed that among k-LUT cells, the four-input single- output LUT cell yields the smallest FPGA area for a wide range of program- ming technologies and routing pitches. Most commercially available FPGAs indeed use LUTs with an input size of four or five. Kouloheris and El Gamal [1992]investigatedthebestgranularityforPLA-basedCPLDsandfoundthat the total CPLD area is smallest if each basic cell has 8 to 10 inputs, 3 to 4 outputs,and12to13productterms.Thenumberofproducttermsisrestricted togrowlinearlyasinputsizeincreases[Kouloheris1993].Inpractice,however, mostcommerciallyavailableCPLDsusemuchlargerPLA-likelogiccells.Since FPGAsusesmallprogrammablecells,theycanoftenofferhigherdensityand capacity,atapriceofpossiblylargerandsomewhatunpredictabledelay,asthe critical path often needs to go through multiple levels of programmable cells connected by the programmable interconnect. On the other hand, CPLDs are usuallyfasterastheprogrammablecellsaremuchlarger,whichresultsinfewer levels of the logic. (The worst-case delay in CPLD also tends to be more pre- dictableasthelevelofthelogicisusuallydeterminedbythearchitectureand canbeestimatedbythedesigner.)However,CPLDsusuallyofferconsiderably lowerlogicdensity.Webelievethatthisisduetotworeasons:(a)itisinherently difficult to map logic into multioutput PLA-like programmable cells, as most technology mapping techniques have been developed for single-output logic cells; and (b) the difficulty associated with synthesis/mapping for PLA-based CPLD devices in turn resulted in very limited studies on this topic—the only relatedworkswecanfindweretheDDMap[Kouloheris1993],afastheuristic partition method for PLA-based architecture proposed in Hasan et al. [1992], TEMPLA[AndersonandBrown1998],andPLAmap[Congetal.2001].Inthe later1990s,withtheintroductionofhybridFPGAfamilies[KavianiandBrown 1999],afewofmappingalgorithmsforahybridarchitectureofLUTsandPLAs havebeenreported[Kaviani1999;KrishnamoorthyandTessier2003].(Incom- parison, there have been much more extensive studies on LUT-based FPGAs. Acomprehensivesurveyupto1997wasreportedinCongandDing[1996].) The need to reduce the logic levels (and associated interconnects!) to im- prove circuit performance, the intention to avoid the mapping problem for ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.1,January2005. TechnologyMappingandArchitectureEvaluation • 5 multioutput functions, and the hope to leverage a large amount of research resultsonsynthesisandmappingforLUT-basedFPGAs,seemtosuggestthat weshouldconsiderFPGAswithLUTsofmuchlargernumbersofinputs.How- ever,theareaofak-LUTgrowsexpontentiallywithrespecttok.Usingk-LUTs with large k may considerably lower chip density. Therefore, we have to ex- plore other alternatives. It has been reported that the functions mapped into largeLUTsusuallyuseconsiderablyfewerproducttermsthanthecapacityof thelookuptable[KouloherisandGamal1992;Kouloheris1993].Forinstance, the utilization of a lookup table cell with K-inputs and N-outputs, which is defined as the ratio of the number of product terms used per cell to the cell capacity, N ·2K−1,islessthan0.1for K = 7,N = 1.Thisleadsustoconsider anFPGAarchitecturebasedonk-inputsingle-outputPLA-likelogiccells.Each cell can implement a single output function of up to m product terms and up to k inputs. Such a cell is called a k/m-macrocell throughout this article. A k/m-macrocelldiffersfromak-LUTinthateachmacrocellcanimplementonly a subset of all possible k-input functions. A k/m-macrocell is different from a generalPLA-likeblockusedinmostCPLDs,aseachk/m-macrocellhasasin- gleoutput.Ifwechoosemtobesmall,k/m-macrocellsaremuchsmallerthan k-LUTs. Therefore, it is possible to use k/m-macrocells with larger input size inordertogetasmallerlogicdepthandlessinterconnectwithoutconsiderably loweringthelogicdensity. In this article, we develop a very efficient technology mapping algorithm, named k m flow, for this new type of architecture. The experimental results show that our algorithm can achieve depth-optimality on almost all the test- casesinasetof16MicroelectronicsCenterofNorthCarolina(MCNC)bench- marks. Furthermore it is shown that on this set of benchmarks, with only a relativelysmallnumberofproductterms(m≤k+3),thek/m-macrocell-based FPGAs can achieve the same or a similar mapping depth compared with the traditional k-LUT based FPGAs. We also investigate the total area and delay ofk/m-macrocell-basedFPGAsandcomparethemwiththoseofthecommonly used4-LUT-basedFPGAs.Theexperimentalresultsshowthatk/m-macrocell- based FPGAs can outperform 4-LUT-based FPGAs in terms of both delay and area after placement and routing by versatile place route (VPR) on this set of benchmarks. Therestofthisarticleisorganizedasfollows.Section2formulatestheprob- lem. Section 3 introduces a technology mapping algorithm for k/m-macrocell- based FPGAs. Section 4 further investigates the area and delay of k/m- macrocell-basedarchitecture.Wedrawourconclusionsbasedonexperimental results in Section 5. An extended abstract of this work was presented at the FPGASymposiumin2000[Congetal.2000]. 2. DEFINITIONSANDPROBLEMFORMULATION WefirstreviewsometerminologiesdefinedinCongandDing[1994].ABoolean networkcanberepresentedasadirectedacyclicgraph(DAG)whereeachnode represents a logic gate and a directed edge (i, j) exists if the output of gate i is an input of gate j. A primary input (PI) node has no incoming edge and a ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.1,January2005. 6 • X.Yuanetal. primaryoutput(PO)nodehasnooutgoingedge.Weuseinput(v)todenotethe set of nodes which are fanins of gate v. We assume the network is k-bounded, that is, for each node v in the network, |input(v)| ≤ k.1 A cone at node v, denoted as C , is a subgraph consisting of v and its predecessors such that v any path connecting a node in C and v lies entirely in C . The notation of v v input(C ) is also used to represent the set of distinct nodes outside C which v v supply inputs to the gates in C . A maximum cone at v, also the fanin net- v work of v, denoted as N , is a cone consisting of v and all of its predecessors. v The level of a node v is the length of the longest path from any PI node to v. ThelevelofaPIiszero.Thedepthofanetworkisthelargestnodelevelinthe network. A coneC is said to be k-feasible if and only if |input(C )| ≤ k. Similarly,C v v v issaidtobem-packableifandonlyifitsfunctionhasasum-of-productrepre- sentation with no more than m product terms. C is said to be k/m-feasible if v itisbothk-feasibleandm-packable.Pleasenotethatthewordfeasibleusually referstothenumberofinputsofamacrocell,andpackablereferstothenumber ofproductterms.Theonlyexceptionisk/m-feasible,whichisashortenedver- sionofk-feasibleandm-packable.Itisobviousthatak/m-feasibleconecanbe implemented by a k/m-macrocell. A network is called m-packable if the func- tion of any node in the network has a sum-of-product representation with no morethanmproductterms. Severalconceptsaboutcutsinanetworkwillbeusedinourdiscussion.Given anetworkN withasourcesandasinkt,acut(X,X(cid:3))isapartitionofthenodes in the network such that s ∈ X,t ∈ X(cid:3), and no nodes in X(cid:3) provide input to any node in X. Clearly X(cid:3) may be considered a cone at t inside network N. Therefore we can apply the previous definitions on k/m-feasibility to cuts. A cut(X,X(cid:3))issaidtobek-feasibleifandonlyif X(cid:3) isak-feasiblecone.Thecut issaidtobem-packableifandonlyif X(cid:3) isanm-packablecone.Ak/m-feasible cutisonethatisbothk-feasibleandm-packable.Foreverynodevanditsfanin network N ,acut(X,X(cid:3))in N isapartitionofthenodessuchthatallthePI v v nodes belong to X and v belong to X(cid:3). It is clear that every cone rooted at v correspondstoacutin N . v Thetechnologymappingproblemfork/m-macrocell-basedFPGAsistocover agivenk-boundedm-packableBooleannetworkwithasetofk/m-feasiblecones. Notethatweallowtheseconestooverlap,thatis,nodesmaybeduplicatedin themappingsolutions.ThisproblemisNP-Hardasthetwolevelminimization problemisNP-Hard. Throughoutthediscussiononthetechnologymappingalgorithm(Section3), unitdelayandunitareamodelsareused.Thatis,variationofinterconnection delay and routing area is not directly considered during technology mapping of the original network. Each k/m-macrocell contributes a constant delay in- dependent of the function it implements. Each cell is counted as a unit when we evaluate the area; hence the total area of the mapping solution equals to 1Any network can be fully decomposed into a two-bounded network without deteriorating the mappingquality[CongandHwang1996]. ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.1,January2005. TechnologyMappingandArchitectureEvaluation • 7 the total number of macrocells. Such simplification is reasonable because the layoutinformationisnotavailableyet.ForarchitectureevaluationinSection4, however,wewillusemoreaccuratedelayandareamodelswithconsideration of the interconnect. We use a well-known FPGA placement and routing tool (VPR[Betzetal.1999])togetthetotalareaandcriticalpathdelayafterlayout forcomparison.Toavoidconfusion,weusedepthandnumberofmacrocellsin Section3torefertothedelayandareaundertheunitdelayandtheunitarea model. 3. TECHNOLOGYMAPPINGFORk/m-MACROCELLS 3.1 Overview A k/m-macrocell can be regarded as a k-LUT with an additional restriction thatitcanonlyimplementlogicfunctionswithnomorethanmproductterms. Therefore, it is natural to start with the k-LUT mapping problem since it has beenintensivelystudiedinthepastdecade. Currently,therearethreemajorapproachestoLUT-basedFPGAmapping: tree-basedmapping(e.g.,Chortle-crf[Francisetal.1991a],Chortle-d[Francis etal.1991b]),flow-basedmapping(e.g.,FlowMap[CongandDing1994]),and cut-enumeration-based mapping (e.g., PREATOR [Cong et al. 1999]). A more comprehensive survey is available in Cong and Ding [1996]. Tree-based map- ping algorithms partition the network into trees and handle each tree sepa- rately. Each individual tree can be mapped optimally but a prior tree parti- tioningoftencompromisesthemappingquality.Usuallytheyarefastheuristic algorithms.Flow-basedmappingalgorithmsarebasedonthetheoremofmax- flow-min-cutandthecomputationofnetworkflow.Itcangenerateadepthopti- malmappingsolutioninpolynomialtime.However,flow-basedalgorithmslack flexibility as they find only one or two depth-optimal min-cuts for every node. Ontheotherhand,cut-enumeration-basedapproachescanfindoutmany,ifnot all,possiblecutsforeverynode.Theyofferhighflexibilityandcanachieveopti- malitywithmoreconstraints,buttheyareconsiderablyslowerthantree-based orflow-basedmethods. Theapproachwepresenthere,calledk m flow,isahybridofflow-basedand cut enumeration-based methods. We try to find a k/m-feasible cut for every nodefirstbyflowcomputation.Ifthatfails,weturntocutenumeration. Thek m flowalgorithmconsistsoftwophases—(i)labelingthenetworkand (ii) mapping the network into macrocells. The labeling phase is trying to find a k/m-feasible cut for each node for depth minimization. The mapping phase generatesk/m-macrocellsinthemappingsolutionaccordingtothelabelsand cutscomputedinthelabelingphase. 3.2 NonmonotoneProperties For every node v, let N be the fanin network consisting of node v and all its v predecessors. We assume that there is a given label label(v) associated with ∗ eachnodev.Wealsodefinelabel (v),theoptimalmappingdepthofv,tobethe ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.1,January2005. 8 • X.Yuanetal. minimum depth of the k/m-macrocell mapping solution for N . The labeling v phasefork/m-macrocellmappingissimilartothatintheFlowMapalgorithm [CongandDing1994].Itfindsak/m-feasiblecutforeverynodevandcomputes alabelforvtominimizethedepthofthek/m-macrocellimplementingnodev inthemappingsolution.Ideally,wewouldlikethecomputedlabeltobeequal to the optimal mapping depth, that is, label(v) = label∗(v) for every node v in thenetwork,asisinthecasewiththeFlowMapalgorithmfork-LUTmapping. However,itismoredifficulttodosoforthek/m-macrocell-basedmappingdueto thenonmonotonepropertyoftheclusteringconstraintsandthenon-monotone propertyoftheoptimaldepths. 3.2.1 NonmonotonePropertyofClusteringConstraints. Thefundamental difficultyofk/m-macrocell-basedFPGAmappingisthattheconstraintsonthe number of inputs and the number of product terms of a k/m-macrocell are not monotone clustering constraints. That is, if a cone C is k-infeasible (or v m-unpackable), it does not guarantee that all its supercones (i.e., those cones includingC )arek-infeasible(orm-unpackable).Asaresult,ak-infeasible(or v m-unpackable) cone C could become k-feasible (or m-packable) by including v more nodes into it. Note that the LUT mapping problem also has this non- monotoneproperty. 3.2.2 Nonmonotone Property of Optimal Mapping Depths. In addition, the optimal k/m-macrocell mapping depth is not monotone either. The op- timal mapping depth is monotone if label∗(v) ≥ label∗(u) as long as u is an inputtov. An example of the nonmonotone properties is shown in Figure 1. It is a portion of a two-bounded network and we do not show the complete network (forexample,output f1, f2 canbedriverstoothergatesthatarenotshownin Figure1).Assumingk =4andm=4,coneC isnot4-packablewhilealarger f1 coneC isboth4-feasibleand4-packable.TheoptimaldepthtoimplementC f f1 is 2 with 3 4/4-macrocells (as shown in the shaded regions). However, C can f beimplementedwithonly14/4-macrocellandthereforetheoptimalmapping depthforC is1,thatis,fornode f anditsinputnode f ,label∗(f)=1<2= f 1 ∗ label (f ).NotethatfortheLUTmappingproblem,itwasshowninCongand 1 Ding[1994]thattheoptimalmappingdepthismonotone. 3.3 Depth-OptimalMappingAlgorithmfork/m-macrocell Nevertheless, we can have a depth-optimal mapping algorithm for k/m- macrocell.Givenacut(X,X(cid:3))in N ,theheightofthecut,denotedash(X,X(cid:3)), v isthemaximumlabelininput(X(cid:3)),thatis, h(X,X(cid:3))=max{label(v)|v∈input(X(cid:3))} It is assumed that every node in input(X(cid:3)) already has a label. A min-height k/m-feasiblecut(X,X(cid:3))inanetworkisak/m-feasiblecutsuchthath(X,X(cid:3))≤ h(Y,Y(cid:3)),where(Y,Y(cid:3))isanyotherk/m-feasiblecut.Basedonthedefinitionof optimalmappingdepthofvlabel∗(v),wehavelabel∗(v)=h(X,X(cid:3))+1,if(X,X(cid:3)) ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.1,January2005. TechnologyMappingandArchitectureEvaluation • 9 Fig.1. Anexampleshowingthenonmonotonepropertiesofthek/m-macrocellmappingproblem. is the min-height k/m-feasible cut in C and label(u) = label∗(u) for any node v uother than v in coneC . Therefore, following a similar argument as in Cong v andDing[1994],wecanconcludethat(i)amappingalgorithmcanlabelevery nodev suchthatlabel(v) = label∗(v)ifitcanfindthemin-heightk/m-feasible cut for every node. And (ii) a mapping algorithm can find the depth optimal mappingsolutionfork/m-macrocell-basedFPGAsifitcanfindthemin-height k/m-feasiblecutforeverynode. Basedontheaboveobservation,adepthoptimalmappingalgorithmworks as follows. It finds the min-height k/m-feasible cut for each node in the topological order from PIs to POs. It then can label each node v such that label(v) = h(X,X(cid:3)) + 1 = label∗(v), where (X,X(cid:3)) is the min-height k/m- feasible cut for v. After labeling the whole network, it can use the min-height k/m-feasiblecutstogeneratethek/m-macrocellsinthemappingsolution.The mappingresulthastheoptimaldepth. In order to find the min-height k/m-feasible cut for every node, we can exhaustively enumerate all k-feasible cuts and test if they are m-packable. Theenumerationalgorithmcanthenpickthek/m-feasiblecutwithminimum heightforeverynode.However,suchanalgorithmisimpracticaltousedueto thehighcomplexityofexhaustivecutenumerationforalargek.Intheory,the number of k-cuts of a node in a cone with n nodes can be as large as O(nk). Sinceweareinterestedinlargek withvaluesk =6to10,wemovetodevelop amoreefficientheuristicalgorithm. ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.1,January2005. 10 • X.Yuanetal. 3.4 Thek m flowAlgorithm—aHeuristicApproach 3.4.1 Labeling Phase. The k m flow algorithm also computes a label for eachnodefromPIstoPOsintopologicalorder.Atthebeginning,everyPInode will receive a label of 0. Then for every node v, suppose mlevel is the largest labelamongv’sfanins,itisassumedthatlabel(v)≥mlevel.Inordertotestifwe cansetlabel(v)tomlevel,wecollapseeachnodeuin N withlabel(u)=mlevel v intonodevtoformaninducednetworkN(cid:3) andtestifwecanfindak/m-feasible v cutin N(cid:3).Obviouslyweassumethatnodelabelswillincreasemonotonically. v Based on the max-flow-min-cut theorem, we can easily find two min-cuts in N(cid:3): the max-volume-min-cut (X,X(cid:3)), which is a min-cut with the largest v |X(cid:3)|, and the min-volume-min-cut (Y,Y(cid:3)), which is a min-cut with the small- est |Y(cid:3)|. Please note both (X,X(cid:3)) and (Y,Y(cid:3)) are min-cuts, which implies that |input(X(cid:3))| = |input(Y(cid:3))| ≤ |input(Z(cid:3))| where (Z,Z(cid:3)) is any other cut in N(cid:3). v Also, note that max-volume-min-cut and min-volume-min-cut are unique and Y(cid:3) ⊆ X(cid:3). The max-volume-min-cut and min-volume-min-cut can be found in O(ke)time,whereeisthenumberofedgesandk isthevalueofthemaximum flow. There are three cases on whether the max-volume and min-volume min- cuts are k/m-feasible. m-packable is checked after calling Espresso [Brayton etal.1984]onthefunctionsin X(cid:3) andY(cid:3) collapsedatv. Case1. Neithermax-volumenormin-volumemin-cutisk-feasible. Because any k/m-feasible cut must be k-feasible too, this condition implies thatnok/m-feasiblecutexistsin N(cid:3).Inthiscase,nodevcanbesimplylabeled v asmlevel+1. Case2. Eithermax-volumeormin-volumemin-cutisk/m-feasible. Suppose(X,X(cid:3))isthek/m-feasiblemin-cut(eitherthemax-volumeormin- volumeone).Wecancreateak/m-macrocellfornodev,denotedasmap node(v), to implement the function of X(cid:3). The depth of map node(v) in the mapping solutioncannotbelargerthanmlevel.Therefore,weassignlabel(v)=mlevel. Case 3. Both max-volume and min-volume min-cut are k-feasible but not m-packable. Notethatthisconditiondoesnotguaranteethatthereisnok/m-feasiblecut existinginN(cid:3).Therefore,wetrytodoalocalcutenumerationinhopeoffinding v ak/m-feasiblecut. Tosearchforak/m-feasiblecut,perhapsthemostnaturalwayistodoalo- calcutenumerationwithintheconedefinedbythemax-volume-k-feasible-cut. However, unlike a max-volume-min-cut, the max-volume-k-feasible-cut may not be unique. Moreover, there is no good algorithm to find the max-volume- k-feasible-cut. Furthermore, it is intuitive to think that if the max-volume- min-cut (X,X(cid:3)) is not m-packable, a cut outside or across it may not likely be k/m-feasible, since it will have more fanins, tending to require more product termsinitssum-of-productrepresentation.Therefore,inordertosearchfora k/m-feasiblecutunderCase3,weonlyenumeratecutsinside X(cid:3) toseeifthey arek/m-feasible. ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.1,January2005. TechnologyMappingandArchitectureEvaluation • 11 TodoalocalcutenumerationinaconeC ,firstwemarkallinputstoC as v v pseudo-PIs and then go through all nodes inside C in topological order from v pseudo-PIsandenumerateallk-feasiblecutsforeverynodevbythefollowing equation,where x and y arethefaninsofv2: (cid:1) Cut(v)=({(C −x,x)}∪Cut(x)) ({(C − y, y)}∪Cut(y))[Congetal.1999]. x k y Cut(x)isthesetofk-feasiblecutsforn(cid:1)ode x.Thenotation(Cx −x,x)refersto thecutthat(cid:1)cutsoffthesinglenodex. k isamergingoperatordefinedontwo cutsets; S S meansmergeeverycutcut in S witheverycutcut in S 1 k 2 1 1 2 2 andonlykeepthek-feasiblecutsintheresult. After the enumeration process, we check Cut(v) to see if there is an m- packablecut.Ifthereexistsak/m-feasiblecut,nodevcanbelabeledasmlevel; otherwise,itwillbelabeledasmlevel+1. ThepseudocodeforthelabelingphaseisshowninFigure2.Forallthenodes, werecordthek/m-feasiblecutswhichcorrespondtotheirlabels. 3.4.2 MappingPhase. Thesecondphaseofouralgorithmistogeneratethe k/m-macrocells in the mapping solution. For every node v, basd on the k/m- feasible cut (X,X(cid:3)) found in labeling phase, we can create a k/m-macrocell map node(v) for v to implement the function of X(cid:3) and input(map node(v)) = input(X(cid:3)). The mapping phase is very similar to the mapping phase in FlowMap [Cong and Ding 1994]. The detailed description is shown in Figure2. 3.4.3 Properties of the k m flow Algorithm. We can prove the following propertiesforthealgorithmdiscussedabove: (1) If a node v is labeled as label(v), it can then be implemented with a depth nomorethanlabel(v).Thatis,label(v)istheupperboundestimationofthe depthofvinthemappingsolution. (2) IfCase3neverhappenswhenmappingaspecificcircuit,thenthemapping solutionisdelayoptimal.Indeed,itisjustthesameask-LUTmapping. (3) Foranycertaincircuit,iftheoptimaldepthfork-LUT-basedmappingisd , 1 theoptimaldepthfork/m-macrocell-basedmappingisd andthedepthof 2 k m flowmappingresultisd ,thend ≤d ≤d . 3 1 2 3 3.5 AreaEnhancement Afterobtainingak/m-macrocellmappingsolution,wewanttofurtherreduce the number of k/m-macrocells in the mapping solution without increasing its depth. 2Although in problem formulation the given network is k-bounded, our algorithm is imple- mented for a two-bounded given network as it is proved in Cong and Hwang [1996] that any network can be fully decomposed into a two-bounded network without deteriorating the map- ping quality. Therefore the cut enumeration equation provided here is for a two-bounded net- work.Ofcourseitcanbegeneralizedtoak-boundednetworkatthecostofhighercomputation complexity. ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.1,January2005. 12 • X.Yuanetal. Fig.2. Pseudocodeofkmflowalgorithm. Foreveryk/m-macrocellv,wetrytopackasmanyofitspredecessorswith it as possible into a single k/m-macrocell. Clearly we need to guarantee the conditionthatthenewk/m-macrocellisstillk/m-feasible.Inordertodoso,we trytocombinevwithanyofitsfaninsandthencheckiftheycanbepackedinto onek/m-macrocell.Aftervandoneofitsfaninshavebeensuccessfullypacked together, a new node v(cid:3) will be formed to replace v in the mapping solution. It could be that some of the fanins of v(cid:3) (input(v(cid:3)) = input(v)∪input(fanin)) can stillbepackedtogetherwithv(cid:3).Therefore,theabovegreedypackingprocesswill berepeateduntilnomorenodescanbepacked.Thedetailedpackingalgorithm, k m pack,isshowninFigure3.Onaverage,thetotalnumberofmacrocellsin themappingsolutionmaybereducedbyafactorof6%aftertheabovepacking process. ACMTransactionsonDesignAutomationofElectronicSystems,Vol.10,No.1,January2005.