ebook img

Very fast Simulated Annealing for HW-SW partitioning PDF

32 Pages·2004·0.54 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Very fast Simulated Annealing for HW-SW partitioning

Very fast Simulated Annealing for HW-SW partitioning Sudarshan Banerjee Nikil Dutt Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA [email protected] [email protected] CECS TechnicalReport #04-#18 June, 2004 Abstract Hardware/software(HW-SW)partitioningisakeyprobleminthecodesignofembeddedsystems and has been studied extensively in the past. With the wide availability of commercial platforms suchastheVirtex-IIProseriesfromXilinxthatintegrateprocessorswithreconfigurablelogic,one major existing challengeis the lack of efficient algorithmsthat can generatevery high-qualityso- lutionsbyexploringahugeHW/SWexplorationspace-thekeycriterionistoobtainsuchsolutions at a speed suitable for integration into a compiler-based partitioner. In this report, we make two contributionsforHW-SW partitioningofapplicationsspecifiedas proceduralcall-graphs: 1) We prove that during partitioning, the execution time metric for moving a vertex needs to be updatedonlyfortheimmediateneighboursofthevertex, ratherthanforallancestorsalongpaths to the root vertex. This enables move-based partitioningalgorithms such as Simulated Annealing (SA)toexecutesignificantlyfaster,allowingcallgraphswiththousandsofverticestobeprocessed inless thanhalfasecond 2) Additionally,we devisea new cost function for SA that enables searching of spaces overlooked bytraditionalSAcostfunctionsforHW-SWpartitioning,allowingthediscoveryofadditionalpar- titioningsolutionsin a veryefficientmanner. Wepresentexperimentalevidenceonaverylargedesignspacewithover12000probleminstances. We generatethe probleminstances by varying the call-graphsizes from20 to 1000 vertices, inde- gree/outdegreeof vertices, communication-to-computationratios,and varyingthearea constraint on thehardware partition. Thousandsof probleminstancesareexplored in a matterofminutesas compared to several hours or days using a traditional SA formulation. Aggregate data collected 1 over this large set of experiments demonstrates that when compared to a KLFM algorithm start- ing with all vertices in software, our approach is 1) asymptoticallyfaster, with a run-time around 5 times faster for graphs with 1000 vertices, 2) is frequently able to locate better design points with over 10 % improvement in application execution time, and 3) the average improvement in application execution time is around 5%. We confirmed the solution quality of results generated by our approach by additional comparisons with a) set of KLFM runs starting from different ini- tial configurations for the same problem instance, and b) other cost-functions commonly used in SA-based approaches for HW-SW partitioning. Overall, our approach generates superior results andexecutes muchfaster. 2 Contents 1 Introduction 5 2 Related work 6 3 Problem description 7 3.1 Problemdescription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Notationaldetails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Efficient computation ofexecution timechangemetric 9 5 Simulated annealing 12 5.1 AlgorithmDescription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.2 Costfunctionforsimulatedannealing . . . . . . . . . . . . . . . . . . . . . . . . 13 5.3 Keyparameters for SimulatedAnnealing . . . . . . . . . . . . . . . . . . . . . . . 18 6 Experiments 19 6.1 Experimentalsetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 6.2 Experimentalresultsand keyobservations . . . . . . . . . . . . . . . . . . . . . . 20 6.2.1 Proposed SA (SA-new) VsKLFM . . . . . . . . . . . . . . . . . . . . . . 20 6.2.2 Proposed SA (SA-new) VsotherSA . . . . . . . . . . . . . . . . . . . . . 25 7 Conclusion 26 8 Acknowledgements 27 9 Appendix A:Aggregatedata forKLFMVsSA (proposed costfunction) 29 List of Figures 1 Target architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Simpleexample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 simplecall graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4 Solutionspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5 Neighbourhoodmove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6 Costfunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 7 Set ofexperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 8 v50,CCR 0.1: Performance Vs constraint . . . . . . . . . . . . . . . . . . . . . . 21 9 v50,CCR 0.3: Performance Vs constraint . . . . . . . . . . . . . . . . . . . . . . 21 10 v50,CCR 0.5: Performance Vs constraint . . . . . . . . . . . . . . . . . . . . . . 22 11 v50,CCR 0.7: Performance Vs constraint . . . . . . . . . . . . . . . . . . . . . . 22 3 12 Run-timeperformanceplotforalgorithms . . . . . . . . . . . . . . . . . . . . . . 23 13 v20,CCR 0.1: Performance Vs constraint . . . . . . . . . . . . . . . . . . . . . . 29 14 v20,CCR 0.3: Performance Vs constraint . . . . . . . . . . . . . . . . . . . . . . 29 15 v20,CCR 0.5: Performance Vs constraint . . . . . . . . . . . . . . . . . . . . . . 29 16 v20,CCR 0.7: Performance Vs constraint . . . . . . . . . . . . . . . . . . . . . . 29 17 v100,CCR 0.1: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 30 18 v100,CCR 0.3: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 30 19 v100,CCR 0.5: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 30 20 v100,CCR 0.7: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 30 21 v200,CCR 0.1: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 31 22 v200,CCR 0.3: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 31 23 v200,CCR 0.5: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 31 24 v200,CCR 0.7: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 31 25 v500,CCR 0.1: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 31 26 v500,CCR 0.3: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 31 27 v500,CCR 0.5: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 32 28 v500,CCR 0.7: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 32 29 v1000,CCR 0.1: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 32 30 v1000,CCR 0.3: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 32 31 v1000,CCR 0.5: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 32 32 v1000,CCR 0.7: Performance Vsconstraint . . . . . . . . . . . . . . . . . . . . . 32 4 1 Introduction Partitioning is an important problem in all aspects of design. HW-SW (hardware-software) partitioning, i.e the decision to partition an application onto hardware (HW) and software (SW) execution units, is possibly the most critical decision in HW-SW codesign. The effectiveness of a HW-SW design in terms of system execution time, area, power consumption, etc, are primarily influenced bypartitioningdecisions. The partitioning problem has a lot of related variations depending on the objective function being optimized. In this report, we consider the problem of minimizing execution time of an applicationforasystemwithhardareaconstraints. Anexampleofarelatedproblemistheproblem ofminimizingaggregateenergyconsumptionforasystemwithhardconstraintsonexecutiontime. We consider an application specified as a DAG (directed acyclic graph), extracted in the form of a callgraph from a sequential application written in ’C’. Our target system architecture has one microprocessor(SW)andonearea-constrainedhardwareunit(HW)suchasareconfigurablelogic fabric. AnexampleofsuchanarchitectureistheXilinxVirtex-IIProPlatformFPGAXC2VPX20 that integrates one PPC405 (PowerPC) processor with reconfigurable logic. In this work, we as- sume that the HW and SW units execute in mutual exclusion- this allows us to focus exclusively on thepartitioningproblem. Fora DAG representing a call graph, the executiontimeofa vertex needs to be computedfrom all descendants in the sub graph rooted at the vertex. In HW-SW partitioning, when a vertex is moved from SW to HW or vice-versa, the execution time of the program changes due to the differentexecutiontimesofaHWversusaSWimplementation,alongwithincreasedordecreased communicationcost across the cut. This change in execution timeis represented by the execution timechangemetric. In this report, we make two contributions to HW-SW partitioning. First we prove that for a callgraph representation, when a vertex is moved to a different partition, it is only necessary to update the execution time change metric [8] for its immediate parents and immediate children instead of all ancestors along the path to the root. This in general allows for a more efficient applicationofmove-basedalgorithmslikesimulatedannealing(SA). Secondly, we present a cost function for simulated annealing to search regions of the solution space often not thoroughly explored by traditional cost functions. This enables us to frequently generatemoreefficient designpoints. Our two contributions result in a very fast simulated annealing (SA) implementation that gen- erates partitioningssuch that theexecutiontimesare frequently betterby over10% compared to a KLFM(Kernighan-Lin/Fiduccia-Matheyes)algorithmforHW-SWpartitioningforgraphsranging from 20 vertices to 1000 vertices. Equally importantly, graphs with a thousand vertices are pro- cessedinmuchlessthanasecond,withthealgorithmrun-timeasymptoticallyfasterthanaKLFM implementationby around5 timesforgraphs witha1000vertices. Giventheknownpropensity ofa KLFM approach to get stuck at local minima,we additionally conducted experiments where the KLFM algorithm was allowed to start from different configu- rations. Our approach generates better quality results compared to this set of independent KLFM runs on the same problem space. Additionalcomparisons with commonlyused SA cost functions 5 inHW-SWpartitioningestablishthequalityofresultsgenerated by ourapproach. The rest of this report is organized as follows: in Section 2, we review related research in HW- SW partitioning. In Section 3 we review the problem description. In Section 4, we prove that the execution time change metric needs to be updated only for immediate neighbours of a vertex. In Section 5, we discuss the simulated annealing algorithm and present our proposed approach towardsa moreinterestingcost function. In Section 6, wepresent theexperimentsconducted. We concludewithSection7. 2 Related work Hardware-software partitioning is an extensively studied ”hard” problem with a plethora of approaches- dynamicprogramming[13], geneticalgorithms[4], greedy heuristics[12], to namea few. MostoftheinitialworkonHW-SWpartitioning,[13],[14]focussedontheproblemofmeet- ingtimingconstraintswithasecondarygoalofminimizingtheamountofhardware. Subsequently therehasbeenasignificantamountofworkonoptimizingperformanceunderareaconstraints,[1], [3], [8]. With the goal of searching a larger design space, techniques such as simulated annealing (SA) have been applied to HW-SW partitioning using fairly simple cost functions. While a lot of initial work such as [14] was based exclusively on simulated annealing, recent approaches com- monly measure their quality against a SA implementation. For example, [1] compares SA with a knowledge-basedapproach, and[3]compares SA withtabu search. Itiswell-knownthatSArequirescareful attentioninformulatingacostfunctionthatallowsthe search to ”hill-climb” over suboptimal solutions. However, much of the published work in HW- SW partitioning have not studied in detail the SA cost functions that permit a wider exploration of the search space. As an example, in [3], [9], the SA formulation considers only valid solutions satisfying constraints, thus restricting the ability of SA to ”hill-climb” over invalid solutions to reach avalidbettersolution. The two previous pieces of work in HW-SW partitioning that are most directly related to our work are [8], [5]. Ourmodel for HW-SWpartitioningis based on [8], a well-knownadaptation of theKLparadigmforHW-SWpartitioning;oureffortsinimprovingthequalityofthecostfunction are closelyrelated to[5]. Our partitioning granularity is similar to [8], effectively that of a loop-procedure call-graph; each partitioning object represents a function and the DAG edges are annotated with callcounts. [8] introduced the notion of execution time change metric for a DAG, and updating the metric potentially by evaluation of ancestors along the path to the root. The linear cost function in [8] ignorestheeffect ofHWarea aslong asthearea constraintissatisfied. [5]providesanin-depthdiscussionofcostfunctionsandthenotionofimprovingtheresultsob- tainedfromasimplelinearcostfunctionbydynamicallychangingtheweightsofthevariables. We differfrom[5]inthefollowingways: [5]addressestheproblemofchoosingasuitablegranularity for HW-SW partitioning that minimizes area while meeting timing constraints; since we consider the problem of minimizing execution time while satisfying HW area constraints, the proposed cost function in [5] needs significant adaptation for our problem. In [5], the dynamic weighting technique was applied towards the secondary objective of minimizing HW area once the primary 6 objective, the timing constraint, was almost satisfied. We however, apply a dynamic weighting factor to our cost functions in various regions of the search space to better guide the search. Last but not the least, since their primary focus was on the granularity selection problem, there was no quantitativecomparisonof theirapproach with otheralgorithms-wehavecompared ourapproach to the KLFM approach with an extensive set of test cases and demonstrated the effectiveness of ourapproach. 3 Problem description 3.1 Problem description The application specification methodology and architectural assumptions in our problem def- inition are similar to [8]. Here we provide a brief summary of the key aspects of the problem definition. memory HW local memory HW SW Figure1.Targetarchitecture We consider an application specified as a DAG (directed acyclic graph), extracted from a se- quential program written in C, or, any other procedural language. The target architecture for this application is a system with a single SW processor and a single HW unit connected by a system bus, as shown in Figure 1. An example of such an architecture is the widely available Xilinx XC2VPX20 with a singlePPC405 processor connected to reconfigurable logicby a PLB (proces- sor local bus). In this work, we assume mutually exclusive operation of the two units, i.e the two unitsmaynotbecomputingconcurrently. WeadditionallyassumethattheHWunitcanbeconfig- ured only once before the application starts execution and the HW functionality does not change once the application starts execution, i.e., we consider that the HW unit does not have dynamic RTR (run-time reconfiguration) capability 1. The problem considered in this report is to partition the application such that the execution time of the application is minimized while simultaneously satisfyingthehard area constraintsoftheHWunit. 1Thisisarelevantpracticalassumptioninlightofthesignificantreconfigurationpenaltiesincurredinsuchcom- monlyavailablesingle-contextRTRarchitectures 7 Each partitioning object corresponding to a vertex in the DAG is essentially a function that can be mapped to HW or SW. Each directed edge x y in the DAG represents a call or an access ( ; ) made by the caller function x to the callee function y. The SW executiontimes and callcounts are obtainedfromprofilingtheapplicationontheSWprocessor. Inthismodel,theHWexecutiontime and the HW area for the functions are estimated from synthesis of the functions on the given HW unit2. Communicationtimeestimatesaremadebysimplydividingthevolumeofdatatransferred by the bus speed. Since theexecution timemodel is sequential, bus contention is assumed to play an insignificantrole. v 0 v v 4 SW 6 v v 3 v 1 5 v HW 2 Figure2.Simpleexample We next motivate the first part of our contribution with the simple example shown in Figure 2. For the callgraph in the figure, the execution time of the program (same as the execution time for vertexv )obviouslydependsontheexecutiontimeofitsdescendantv . Letusassumeallvertices 0 2 were initiallyin SW. If we movethe vertex v to HW, the executiontime changes due to HW-SW 2 communicationon theedges v v , v v and changein executiontimeforvertexv . It would 3 2 1 2 2 ( ; ) ( ; ) appear that any execution time related metric for the vertices v , v , v , would need to be updated 0 6 4 when this move is made. In the next section, we show with a simple example that this is not true fortheexecutiontimechangemetricand followupwithaproof. Beforeproceedingfurther,weneedtointroduceaslightlymoreformalsetofnotationsrequired intherest ofthisreport. 3.2 Notationaldetails The input to the partitioning algorithm is a directed acyclic graph (DAG) representing a call- graph, CG = (V, E). V is the set of graph vertices where each vertex v represents a function. E i is the set of graph edges where each edge e represents a function call to the child function v by ij j 2With a large number of objects, fast, good quality estimators are of course more practical than detailed logic synthesis 8 the parent function v. The application representation assumes that there are no recursivefunction i calls. Eachedgeisassociatedwith2weights(cc , ct ). cc represents thecallcount,i.e,thenumber ij ij ij oftimesfunctionv is calledby itsparent v. ct representstheHW-SWcommunicationtime,i.e, j i ij if v is mapped to SW and its child v is mapped to HW (or vice-versa), ct represents the time i j ij taken to transfer data between theSW and the HW unit for each call. An important assumptionis made that vertices mapped onto the same computing unit have negligiblecommunicationlatency, i.eSW-SWorHW-HWcommunicationtimecan beconsideredto be0 forpractical purposes. Eachvertexv isassociatedwith3weights(ts,th,h ). ts representsthesoftware(SW)execution i i i i i time for a given vertex, i.e, the time required to execute the function corresponding to v on the i processor. th represents the hardware (HW) execution time, i.e, the time required to execute the i function on the HW unit. The hardware implementation of the function requires area h on the w HWunit. NotethatthisdefinitionworksoffasingleParetopoint-inthiswork,wedonotconsider compiler (synthesis) optimizations leading to multiple HW implementations with different area and timingcharacteristics. A partitioning of the vertices can be represented in the following manner: P = vs vh vs . 0 1 2 f ; ; :::::g This denotes that in partitioningP, vertex v is mapped to SW, v is mapped to HW, v is mapped 0 1 2 toSW, etc. Twokeyattributesofapartitioningare TP HP . TP denotestheexecutiontimeofthe ( ; ) application under the partitioning P, HP denotes the aggregate area of all components mapped to hardware underpartitioningP. 4 Efficient computation of execution time change metric Given a sequential execution paradigm and a call-graph specification, the execution time of a vertex v is computed as the sum of its self-execution time and the execution time of its children. i The execution time for v additionally includes HW-SW communication time for each child of v i i mapped to a different partition. Thus, if TP denotes the executiontimefor vertex v undera given i i partitioningP TiP =ti +(cid:229) jjCij1 ccij Tj (cid:229) jjCid1iffj ccij ctij ( (cid:3) )+ ( (cid:3) ) where t is either th=or ts depending o=n whether v is currently mapped to HW or SW in the i i i i diff partitioning P. C represents the set of all children of vertex v, C represents the set of all i i i childrenofv mappedtoadifferentpartition. Notethatifv correspondstomainina’C’program, i 0 TP equalsthecompleteprogramexecutiontimewhenpartitioningPspecifieswhetherverticesare 0 mappedto HWorSW. For a partitioning P, the execution time change metric D P, for a vertex v , is defined as the i i change in execution time when the vertex v is moved to a different partition. That is, D P denotes i i thechangeinTP when vertexv ismappedto adifferentpartition. 0 i From the definition of the execution time change metric, it would appear that when a vertex is moved to a different partition, the metric values for all its ancestors would need to be updated. Indeed, in previous work, [8], the change equations assumed it was necessary to update ancestors all the way to the root. This, however, is not the case: it is only necessary to update the metric 9 values for immediate neighbours. In the following, we first give an intuition for this through a simpleexample,and followupwithaproof. Considerthesimpleexamplein Figure3. h s {t , t } 0 0 v 0 {cc , ct } {cc , ct } 13 13 01 01 v h s h s v3 1 {t 1 , t1 } {t , t } 3 3 {cc , ct } {cc , ct } 32 32 12 12 v 2 h s {t , t } 2 2 Figure3.simplecallgraph For the graph shown in Figure 3, the program execution time corresponding to a partitioning P = vs vs vs vs isgivenby 0 1 2 3 f ; ; ; g TP =ts cc ts cc ts cc ts cc ts 0 0 01 1 12 2 03 3 32 2 + (cid:3)( + (cid:3) )+ (cid:3)( + (cid:3) ) Ifthevertexv ismovedtoHW, i.ewehaveanewpartitioningP = vh vs vs vs , 0 0 0 1 2 3 f ; ; ; g TP0 =th cc ts ct cc ts cc ts ct cc ts 0 0 01 1 01 12 2 03 3 03 32 2 + (cid:3)( + + (cid:3) )+ (cid:3)( + + (cid:3) ) Theexecution timechangemetricforvertexv is 0 D P =(T1 -T ) = th ts cc ct cc ct - Equation (A) 0 0 0 0 0 01 01 03 03 Wecan similarlycomput(eD (cid:0)P, D )P+,D P. (cid:3) + (cid:3) 1 2 3 Next,letus considerapartitioningP = vs vs vh vs wherethevertexvh is mappedto HW. 2 0 1 2 3 2 f ; ; ; g Forthispartitioning,executiontimeforvertexv isgivenby 0 TP2 =ts cc ts cc ct th cc ts cc ct th 0 0 01 1 12 12 2 03 3 32 32 2 + (cid:3)( + (cid:3)( + ))+ (cid:3)( + (cid:3)( + )) Ifthevertexv isnowmovedto HW, i.ewehaveanewpartitioningP = vh vs vh vs , 0 20 0 1 2 3 f ; ; ; g TP20 =th cc ts ct cc ct ts cc ts ct cc ct ts 0 0 01 1 01 12 12 2 03 3 03 32 32 2 + (cid:3)( + + (cid:3)( + ))+ (cid:3)( + + (cid:3)(( + )) Thus, D P2 = TP20 TP2 = th ts cc ct cc ct - Equation (B) 0 0 0 0 0 01 01 03 03 ( (cid:0) ) ( (cid:0) )+ (cid:3) + (cid:3) Equations(A)and(B)areidentical. Thatis,whenvertexv ismovedtoadifferentpartition,the 2 metricD P remainsunchanged foranon-immediateancestorv . 0 0 In order to prove that the above result holds in the generic case, the simple example gives us the insight that we need to represent the expression TP in a form slightly differently from the 0 originaldefinition. Wedefinetheaggregatecallcount,CC foravertexv inthefollowingrecursive i i (cid:229) P manner:CC = i cc CC . P representsthesetofallparentvertices(allfunctionsitiscalled i jj j1 ji j i ( (cid:3) ) from)fortheverte=xv . CC , i.etheaggregatecallcountfor theroot vertex,is 1. i 0 Armedwiththeabovedefinition,we can proceed toprovethefollowinglemma: 10

Description:
Sudarshan Banerjee. Nikil Dutt. Center for to the root vertex. This enables move-based partitioning algorithms such as Simulated Annealing.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.