1 Maintaining Virtual Areas on FPGAs using Strip Packing with Delays Josef Angermeier#1, Sa´ndor P. Fekete∗, Tom Kamphans∗2, Nils Schweer∗, Ju¨rgen Teich# #Department of Computer Science 12, University of Erlangen-Nuremberg Erlangen, Germany {angermeier, teich}@cs.fau.de ∗Department of Computer Science, Braunschweig University of Technology Braunschweig, Germany {s.fekete, n.schweer}@tu-bs.de, [email protected] 0 1 0 2 Abstract—The computing resources available on dynamically physical addresses in the memory. Thus, pages of different n partially reconfigurable devices increase every year enormously. applications may lie physically next to each other, but only a In the near future, we expect that many applications run on a the approved application may access them. Furthermore, the J single reconfigurable device. In this paper, we present a concept concept of the virtual address space also facilitated writing 5 formultitaskingondynamicallypartiallyreconfigurablesystems 2 called virtual area management. Weexplain itsadvantages, show software, as each application did not need to bother aboutthe its challenges, and discuss possible solutions. Furthermore, we positions of other applications but can assume to be the only ] investigate one problem in more detail: Packing modules with user of the processor and the memory. R time-varying resource requests. Thisproblem from thereconfig- In order to work out concepts for allowing multitasking on A urable computing field results in a completely new optimization reconfigurable devices, we oriented ourselves on concepts of problemnot tackledbefore. ILP-basedandheuristicapproaches s. are compared in an experimental study and the drawbacks and thesoftwareworldsuchasvirtualaddressspace.But,asthere c benefits discussed. are some major differences between hardware and software, [ one cannot just put one concept unchanged from software to 1 reconfigurable devices. One important difference is that on v I. INTRODUCTION reconfigurable devices, different modules may be executed in 3 Reconfigurable devices offer more and more space and parallel while software usually works more sequentially. 9 functionality over time and will probably continue to do so The basic idea of our concept which we called ”Virtual 4 4 in the future. Yet, huge hardware applications can already Area Management” is to partition the available FPGA area . be instantiated on the reconfigurable chips, or even arrays of into different regions. Each hardware application obtains one 1 processors. Furthermore, nowadays reconfigurable chips are contiguous region, which may grow or shrink depending 0 0 mostly used to instantiate a single application with multiple on the applications resource needs. Each running hardware 1 modules, which might not all be necessary at each instant application can reconfigure hardware modules in its region v: in time. But as reconfigurable devices evolve further, the and free or try to allocate more area. In this process, each i offered resources will be large enough to also run completely applicationdoesnotknowthephysicalpositionofitsmodules, X different applications simultaneously. This development also but just the relative positions. Thus, it cannot reconfigureany r took place in the software world: In the beginning of the regionthatbelongsto a differentapplicationandis concerned a nineties, most personal computers used operating systems onlywithitsownresources.Intermodulecommunicationtakes such as MS-DOS, which allowed just one single software placeonlyintheregionofoneapplication.Thus,themodules application to be executed, alone with some drivers. Since that must communicate with each other are automatically then, the increased computing resources have allowed us to grouped by position to each other. run multiple applications simultaneously. Communication between the partial modules and with the Similar to the software world, multitasking will lead re- input and output periphery is an important point. We assume configurable devices to higher efficiency, but will also raise that the application modules are provided with some means several challenges that must be solved. For example, in order to communicate with each other and to the FPGA’s I/O-ports to provide reliability and security, we must make sure that independently from their position (e.g., as described in [1]). no application can violate the execution of another one. In Thereareseveralpossiblewaystoachievethisgoal.Wesolved software this is solved by runningeach application in its own the communication problems by using our self-developed virtualaddressspace. Each applicationknowsonlythe virtual platformcalledESM(see[2],[3]).Itoffers—amongstothers— addresses of its own resources and parts of the operating asocalledcrossbardevice,whichdynamicallyroutestheinput system. The virtual addresses are translated at runtime into andoutputsignalstothepositionofthecorrespondingmodule. Moreover, the modules can communicate to each other using 1Supported by DFG grant TE 163/14-2, 14-3, project “ReCoNodes”, as the crossbar. Thus, we can assume that the modules can be partofthePriorityProgramme1148.“Reconfigurable Computing”. placed independent from the positions of the I/O periphery 2Supported byDFGgrant FE407/8-2, 8-3, project “ReCoNodes”, as part ofthePriorityProgramme1148,“Reconfigurable Computing”. and other the positions of other modules. 2 A. Related Work which nearby position to place it. The approach by Cardoso [16] also considers the topic of 1) Reconfigurable Computing: Brebner [4], [5] addressed resource virtualization on FPGA devices, achievable due to the problems involved in presenting to a software-oriented dynamic reconfiguration capabilities. Hereby, a new temporal user a larger virtual hardware resource that is implemented partitioning algorithm is proposed. The model by Banerjee using smaller physical FPGA hardware. Their approach is et al. [17] furthermore also supports HW/SW partitioning based on using swappable units, and a prototype operating of the tasks. However, both approaches are also based on system is described that demonstrates operational steps. In the assumption that the running times of each task can be contrastto that work,this paper doesnotaddressthe problem estimatedroughly,andthattheworst-caseexecutiontimedoes ofovercomingthephysicalconstraintsgivenbyasmallFPGA, notdiffer too much from the averagecase. In contrastto that, but tackles the problem, how to run multiple applications on our approachmay also be engagedfor the online case, where a single reconfigurable resource. this must not be the case. Bazargan et al. [6] present fast online placement methods 2) Packing: Itturnsoutthatplacinghardwaremoduleswith for dynamically reconfigurable systems, as well as offline growingandshrinkingarearesourcesamountstostrippacking. 3D placement algorithms. Hereby, partial modules are to be The classical strip packing problem was first considered by placed completely independent of each other, and the inter- Baker et al. [18]. In this problem a set of rectangles must be module communication problem is not addressed. Steiger et packed into a strip of semi-infinite height and width 1 such al. [7] and Diessel et al. [8] further improve scheduling thatthetotalheightofthepackingisminimized.Theyshowed methodsfor partially reconfigurablesystems. Our approachis that in the online case the bottom left heuristic does not basedonthedifferentiationbetweenapplicationsandmodules. guaranteea constantcompetitive ratio for packing a sequence Modulesbelongingto the same applicationare placednearby, ofrectangles.Fortheofflinecase theyprovedan upperbound such that inter-module communication can be as efficient of 3 for a sequence of rectangles and of 2 on the competitive as possible. Different scheduling subproblems have been ad- ratio for a sequence of squares; both analyses require the dressed meanwhile; for example, scheduling with respect to elements to be sorted. Later Kenyon and Re´mila designed a thereconfigurationoverhead[9].Banerjeeetal. [10] takeinto fully polynomial time approximation scheme for the offline account hardware-software partitioning decisions for a fast setting [19]. For the online case Baker and Schwarz [20] execution of an application. But in contrast to our paper, all introduced the so-called shelf algorithms with an competitive these approachesstill focus on executinga single application. ratio that can be made arbitrarily close to 1.7. Csirik and Some operating systems for reconfigurable embedded plat- Woeginger [21] showed a lower bound of 1.69103 for any forms were developed [11], [12], [13]. Such an operating shelfalgorithmandintroducedanalgorithmwhoseasymptotic systemprovidesaminimalprogrammingmodelandaruntime worst-case ratio comes arbitrarily close to this value. system.Theruntimesystemperformsonlinetaskandresource In the classical game of Tetris the aim is to find online management. Scheduling problems are formulated for the placements for a sequence of objects—not all having rectan- 1D and 2D resource models and developed heuristics are gular shape—such that space is utilized as well as possible. compared to each other. Resources of the operating system In this process, no item can ever move upward, no collisions and the user applications are clearly differentiated, but that is between objects must occur, an item will come to a stop if not the case for the resource access of multiple applications and only if it is supported from below, and each placement running on the FPGA. Our paper suggests a compromise for must be placed before the next item arrives. Obviously, there the inter-module communication problem: Modules belong- is a slight difference in the objective function, as Tetris aims ing to the same application should placed nearby to each at filling rows. In actual optimization scenarios, this is less other, such that they can exchange data efficiently. Modules interesting, as it is not critical whether a row is used to belonging to different applications are not necessarily placed precisely100%.Evenwhendisregardingthedifficultyofever- nearby. Additionally, in contrast to our application model, no increasing speed, Tetris is notoriously difficult: As shown by application can shrink and grow during runtime. The focus Breukelaar et al. [22], Tetris is PSPACE-hard, even for the in former works was put more on hardware and software original,limitedsetofdifferentobjects.AzarandEpstein[23] abstraction, here it is on securely running multiple, dynamic consideredTetris-like onlinepackingofrectanglesinto a strip hardware applications. whereeachitem mustbemovedona collision-freepath toits Many recent works focused on achieving optimal perfor- final position which does not have to supported from below. mance to put multiple tasks onto one FPGA within this JustlikeinTetris,theyconsideredthesituationwithorwithout context. Cordone et al. [14] and Redaelli et al. [15] specified rotationofobjects.Forthecase withoutrotation,theyshowed a new model for partitioning and scheduling on partially dy- that no constant competitive ratio is possible, unless there namically reconfigurable hardware. The different applications is a fixed-size lower bound of ε on the side length of the arerepresentedbyataskgraphandtheaimistoobtainatotal objects, in which case there is an upper bound of O(log 1). ε execution time near optimality by taking reconfiguration and For the case in which rotation is possible, they showed a 4- communication times into account. However, this approach competitivestrategy,basedonshelf-packingmethods,withall does not allow dynamic behavior, such that according to the rectangles being rotated to be placed on their narrow sides. current state of the resources, the tasks can select on their Coffmann,Downey,andWinkler[24]consideredprobabilistic own which module implementationto reconfigurenext and at aspects of online rectangle packing with Tetris constraint, 3 Reconfigurable Device MotherBoard Reconfigurable Area BabyBoard SRAM SRAM SRAM SRAM A G VA1 VA1 VA1 VA2 VA2 Flash FP … VSlot1 VSlot2 VSlot3 VSlot1 VSlot2 M1 M2 M3 Ms CP CP CP CP CP Reconfiguration Manager Control−Processor Crossbar PowerPC VA1−Scheduler VA2−Scheduler Peripherals Load A in VSlot1 Load X in VSlot2 Load Y in VSlot2 Load B Res(A)=Val1 Fig.1. ArchitectureoverviewourESMplatform(see[2],[3]formoredetails). in VSlot1 Res(Y)=Val2 Load D in VSlot2 withoutallowingrotations.Ifrectanglesidelengthsarechosen uniformlyat randomfromthe interval[0,1],theyshowedthat Load E in VSlot3 Load A in VSlot1 there is a lower bound of (0.31382733...)n on the expected height of the strip. Using another kind of level-type strategy, Fig. 2. Two different hardware applications running on a 1D partially dy- which arises from the bin-packing–inspired Next Fit Level, namicallyreconfigurabledevice:Thefirstthreeslotsformthefirstvirtualarea they established an upper bound of (0.36976421...)n on the (VA1),theremainingthesecondvirtualarea(VA2).ModulesinoneVAcan communicate witheach other, e.g.byusingtheir communication point(CP) expected height. Fekete et al. [25] considered an Tetris-like belonging toreconfigurable bus. Neither the firstnor the second application online packing with gravity; that is, every item must be knowstheexactpositionofitsVAonthereconfigurabledevice;onlythesize supported from below in its final position. For squares they oftheavailable areaandtherelative positions ofthereconfigurable modules loadedinitsvirtual areaareknown. gave an algorithm with competitive ratio 2.6154. Note that none of the previous works allows stretching the objects in any direction. the virtual areas. This secure operating system unit maps the virtual area units to the corresponding physical dynamically II. VIRTUAL AREA MANAGEMENT reconfigurable area units. The virtual area management unit A. Main Idea can be compared to the memory managementunit (MMU) in Ourconceptisaimedatpartiallydynamicallyreconfigurable the software world: both handle the translation of virtual to architectures, where modules loaded onto the reconfigurable physical memory positions. Furthermore, the corresponding device may be exchanged at runtime. A typical structure of application software threads do not know the actual physi- such a device is given in Fig. 1. Yet devices are available cal positions of the reconfigured modules, only the relative whichallowthe reconfigurationofcolumnsonly,whilenewer positions of each reconfigurable module to each other. Each platforms also allow reconfiguration of certain contiguous application simply requests to load a specified module to a cells. The former is called 1D reconfiguration, the latter 2D virtual position in the assigned virtual area. See the example reconfiguration.Our concepts are applicable to both architec- in Fig. 2: The second application with its virtual area VA2 tures. Furthermore,the platform shouldconsist of one control issues a request to load a specified bit-file called “X” to CPU, which may be placed externally or be included into the the virtual address (here called: VSlot) ‘2’. It does not know reconfigurable device. that this virtual address corresponds physically to the last To run different applications on the reconfigurable device, reconfigurableunitonthereconfigurabledevice.Itknowsonly we propose to partition the available reconfigurable area into that the module loaded there is on the right side of a module so called virtual areas (VAs). For performance reasons, the loaded to the virtual address (VSlot) ‘1’. As each application differentVAsshouldconsistofcontiguousreconfigurablearea is allowed to specify only a virtual address corresponding to units. As the required amount of resources of an application its virtual area in which to place the module, it cannot place changes over time, the size of the virtual area may change a module into an area belonging to a different application. dynamically over the running time of its application. The The conceptis transparentto the applied intermodulecom- mappingofvirtualareatophysicalreconfigurableareaisdone munication in each virtual area: Each application can choose by the control processor. Each hardware application being its preferred communication method (e.g., bus system, neigh- executed on the reconfigurable device has its own software bor to neighborcommunicationover bus macros). When only controlthread runningon this CPU, see Fig. 2. These threads contiguous virtual areas are used, the communicating mod- request the initially required area and transmit changes in the ules are grouped together automatically and communication requirements. An operating system service on the software overheadis minimized.Furthermore,communicationbetween side takes the requests and is in charge of the managementof different applications or heterogeneities of the reconfigurable 4 area is supported easily by extending the operating system. left or right, or at the bottom, and, depending where Yet there are many other approaches to schedule tasks of some unused area is available, it can decide to put a different applications, but in contrast to them our approach partial module there and instantiate some appropriate of virtual areas combinesa localization strategy, putting tasks communication module to transfer data there and back. whichbelongto one applicationtogether,andhighlydynamic Thus,ateachrun,anapplicationcanoperatedifferentlyin applications, which can individually take different module its amountof resource usage. Before, trade-off decisions selectionandplacementdecisionbasedonthecurrentcontext. where also possible for a single application, but the new idea is to let each application decide in its own control program its next steps depending on the behavior of the B. Advantages other applications. The idea of virtual area management offers the following advantages when running multiple applications on a single dynamically partially reconfigurable device. C. Challenges • Resource accounting: The concept can be easily used to First, there are some technical problems to be solved. For prevent applications from using too many resources at area position transparency, the placement of a module should runtime.Onemightwanttorestricttheresourcesgranted notbelimitedtoonesingleposition.Furthermore,theremight to an application for each execution or just in a certain be heterogeneitieson the reconfigurable device which require context in order. The goal is, for example, to reduce different implementation for different positions. A possible the power consumption or to accelerate more important solution to this problem is to generate implementations for applications in a system where certain applications have themoduleforeachpossiblepositionwherethecorresponding lower priorities than others. The concept of virtual area virtual area can be placed. The corresponding module imple- managementprovidesalimitationmechanismbygranting mentation bit-files can be compressed to save some space. justacertainamountof,forexample,reconfigurablearea Another option is to apply a single generated module bit-file to an application. and relocate this file; that is, adapt it to the corresponding • Resource protection: Each application controls only the position. We solved this technical issue in the following way: reconfiguration within its virtual area, as it can only Our experimental board is equipped with a reconfiguration specify those virtual positions that are valid in its own manager device that manipulates the corresponding bit-files VA. The virtual area management unit checks for each before the reconfiguration of the corresponding device. request if the virtual address is valid and translates it A further technical challenge lies in the communication intoaphysicalpositiononthereconfigurabledevice.This of the partial modules to the external periphery (e.g., video, way,noapplicationcanloadanymoduletotheareaofan- audio) over the input/output pins at the border of the FPGA. otherapplicationandtheapplicationsareprotected.Thus, Our experimental board has a crossbar device that routes I/O an error in one application cannot harm the execution signalsdynamicallyfromtheperipherytothe currentposition of all other applications and lead to a complete system of the partial module. breakdown. A larger challenge is to fulfill the changing resource re- • Support for differing scheduling and placement proce- quirements of the different applications. Every application dures: Different applications require differentscheduling has a different resource usage profile which depends on the and placement methods to increase performance.Instead inputs specified. An application called with a larger problem of designing complex scheduling algorithms that try to instance to solve needs also more resources. Additionally, meet all the requirements (e.g., periodic or aperiodic, the resource usage profile also depends on the execution withorwithoutdeadlines)byintroducingvariouspriority context. The application may behave differently depending classes, each application gets its own virtual area and on how many area resources are available at each position. specifies the best scheduling strategy. This may include Furthermore,theresourcerequirementdependingontheinputs a simpler and faster implementation of new hardware of an applicationmay or may not be knownin advance(or at applications within an existing system. least can be estimated, e.g., in a numerical application based • Area position transparency and programming dynamic ontherequiredprecisionofthesolutions).Theformeriscalled applications:Theabsolutepositions’independenceofthe the offline case, the latter the online case. executed modules, in the following called area position Intheonlinecase,adeadlockispossiblewhentwoapplica- transparency,offersanewprogrammingmodel.Itallows tions currently running on the reconfigurable device can only to write dynamic applications, which subject to the cur- proceed in their task when an resource increase is granted, rentresourcecontextdecideontheirownhowtoproceed. see Fig. 3. The blocks represent the occupied area of one Dependingontheassignedarea,anapplicationcaneither applicationatdifferentpointsintime.Theleftmostapplication use an implementation that offers more performance but needs two more area units, as does the application second needsmorearea,oranotheronethattakeslongerbutuses from the left. Furthermore, no application can give up its less area. Another new possibility is that the application accumulatedresources,assavingandrestoringstatesiswidely can decide how to increase and shrink depending on consideredtooexpensiveonFPGAs.Suchadeadlockmustbe the current area usage context. An application can ask prevented. A commonly used solution is to allow hardware the virtual area management unit, if it can grow to the task preemption: If not enough resources are available to 5 Fig. 3. Both the first and the second application (from the left), request Fig.5. Left:Scheduleoftheresourcedemandsonthereconfigurabledevice more area resources in order to proceed. A deadlock occurs, because no under the assumption that running applications cannot wait for a resource moreresources areavailable tomeetanyrequest. grant. Right: Schedule for the same demands, but with the option to delay resourcegrants. meetincreased resourcedemands,the state of one application option for the forth requestof the forth module,we achieve a is stored in external memory. At a later point in time, the better makespan. application is loaded again on the reconfigurable device and the state be restored againto continueits processing.Another approach is not to wait until a deadlock scenario actually III. PACKING APPLICATION MODULES happens, but to check beforehand that it cannot get this far: We consider the problem FPGATris: Scheduling modules Assumingthatthemaximalsizeofarequestisknown,anarea whose resource requests (i.e., space on an FPGA) varies over shared between two applicationsis grantedexclusivelyto just time. This may be, for example, a router module that needs one application, but not one part of it to the one application, more resources if the traffic increases. We assume that a and another part to the other application. module occupies a certain number of slots on the FPGA, but In the following, we consider the offline problem: The requests only complete slots. Thus, we model the FPGA as resource usage profiles for specific inputs of a series of a one-dimensionalarray.Furthermore,we assume that time is applications is known a priori, or can be estimated roughly. discrete;thatis,requestsaremultiplesofafixed-sizetimeslot. Anexampleofapplicationresourceprofilesisgivenin Fig.4. Now, scheduling a sequence of modules with time-varying Note that in the offline case deadlocks cannot occur: We resourcescorrespondsto strip packing:The width of the strip searchforfeasiblesolutions(i.e.,scheduleswherenoresource is the number of slots on the FPGA; the height corresponds requests overlap) only. to the time axis. Thus, we use heightandtime synonymously. Hardware task preemption can be used to increase the Moreover, we assume that every module occupies a base slot area usage. However, saving and restoring the states of the and extends to the left or to the right of the base slot.1. hardwareapplicationscanbeverycostlyin timeandmemory. We are allowed to delay a request; that is, we may stretch An approach that balances reconfiguration costs and efficient the modules along the time axis; see Fig. 6. Our goal is to resourceusageistoallowthatrequestsmaybedelayedbythe minimize the makespan (i.e., the time needed to fulfill all scheduler. The application keeps its currently occupied area, requests).Forthestrip-packingproblem,thisgoalcorresponds but remains idle until the request is fulfilled. Compare the to minimizing the height of the occupied part of the strip. two schedulesfor our example shown in Fig. 5: The schedule shown on the left hand side is a solution for scheduling the 1Forconvenience,weconsideronlythecaseofgrowingeithertotheleftor application modules without delaying requests. On the right totherightofthebaseslot.Thegeneralizationtobothsidesisstraightforward. hand side, we allow that requests are postponed. Using this time delayed 1 mi slot 1 N Fig.6. Wecanplacethemodulemi atthepositionmarkedby×(thebase Fig.4. Fiveapplicationsmodules,eachwithdifferentresourcedemandsover slot of mi), if we delay the third request. That is, the second request stays timefortheirspecificinputs. ontheFPGAuntilthethirdrequestcanbefulfilled. 6 A. An ILP b) BoundaryConstraints: Next,weensurethatarequest does not exceed the FPGA’s boundary by forcing all slot We are given a strip of width N (e.g., an FPGA with N assignmentvariablesthatwouldcauseaninfeasibleplacement slots and a time axis) and want to place M modules. Each to be zero. module, m , is given as a sequence of requests. Let ℓ denote i i the length of the request sequence for module m and (i,j) i the jth request of mi. The size of (i,j) is given by rij ∈ ∀i,j, s=slow,...,sup :xsi =0 (3) ZZ\{0},1≤j ≤ℓ . For r > 0 the module expands to the i ij right of the base slot, for r <0 to the left. where ij For the ILP, we introduce four kinds of variables: N −r +2, r >0 ij ij s := and • the slot assignment variables, xsi low (1, rij <0 • the time assignment variables, ytij • the occupancy variables, zstij s := N, rij >0 . up • the usage variables, ut (−rij −1, rij <0 The first two types specify when and where a request is scheduled. More precisely, setting x to 1 indicates that si c) Order Constraints: Now, we ensure that the process- module m is scheduled in slot s. Setting y to 1 indicates i tij ing order is maintained; that is, request (i,j) of m is not i that request (i,j) of module m is scheduled at time t. That i scheduledbeforerequest(i,j−1)is finished.For every(i,j) is, module m occupies the following slots at time t: i there is exactly one t such that y = 1. Thus, summing up tij t · y over t for fixed i and j yields the time step where s,...,s+r −1 for r >0, tij ij ij request (i,j) is scheduled. This yields: s+rij +1,...,s for rij <0, T T tytij − tytij−1 >0 ∀i,j >0. (4) where s is the base slot of module mi. For every i there is Xt=1 Xt=1 exactly one x with x = 1 and for every (i,j) there is si si exactly one y with y =1. tij tij d) Occupancy Constraints: If x = 1 and y = 1, si tij Usually (i.e., if |r | > 1) a request occupies more than ij the request (i,j) occupies r slots adjacent to s at time ij one slot when executed. Moreover, if the request (i,j+1) is t; see Fig. 7. The first step to prevent other modules from delayed, (i,j) remains on the FPGA for more than one time overlapping with m is to set the appropriate occupancy i unit (i.e., it occupies more the one time row). To keep track variables as follows. of the occupied slots, we set z to 1, if slot s is occupied stij ∀i=1,...,M, j =1,...,ℓ , s=1,...,N, t=1,...,T, i by request(i,j) at time t. The usage variablessimply specify ′ s =s ,...,s : which time steps are used. low up Clearly, the FPGA’s size, N, and the number of modules, xsi+ytij −zs′tij ≤1, (5) M, strongly determine the size of an ILP and, in turn, the time needed to solve it. In addition, we assume that an upper with bound, T, on the number of time steps is given. The closer this bound is to the optimum, the smaller the resulting ILP. s, r >0 ij This upper bound can be obtained, for example, using the slow := and (max{1,s+rij +1}, rij <0 tabu search in Sect. III-B4. min{N,s+r −1}, r >0 ij ij s := . up 1) Constraints: (s, rij <0 a) AssignmentConstraints: Eachrequestmustbesched- uled exactly once. That is, for every (i,j) we have to set exactlyonex to1(toassignaslotfor(i,j))andexactlyone si s =s s low up y (toassignastarttimefor(i,j)).Thefollowingconstraints tij r express these conditions: ij r >0 ij N s s=s low up x =1 ∀i=1,...,M, (1) s=1 si rij rij <0 X T ytij =1 ∀i=1,...,M, j =1,...,ℓi. (2) Fig.7. Occupancy Constraints: Ifxsi=1andytij =1,therequest(i,j) occupies rij slotsleftorrighttos—depending onsgn(rij). t=1 X 7 e) Exclusive Constraints: By setting the appropriate oc- S =(1,...,M) cupancy variables with the occupancy constraints, we can for i=0 to M/2 ensure that requests do not overlap. We allow at most one found =0 occupancy variable for a fixed slot and a fixed time to be 1. for j =1 to M ∀t=1,...,T, s=1,...,N : if (j,((j+i)modM)+1) are not in the tabu list Swap items at pos. j and ((j+i)modM)+1 in S M ℓi Calculate makespan of BestFit z ≤1 (6) stij if makespan is the best so far then i=1j=1 XX found =j Undo swapping f) DelayConstraints: Ifarequest(i,j+1)isdelayed,the if found >0 then preceding request (i,j) remains on the FPGA until (i,j+1) Swap pos. found and ((found +i)modM)+1 in S is scheduled. Thus, if zstij = 1 either zs(t+1)ij must be Store (found,((found+i)modM)+1) in the tabu list set to 1—because the module still occupies space on the FPGA—or (i,j + 1) is scheduled at time t + 1; that is, y = 1 holds. The following constraints keep track Fig.8. TabuSearch (t+1)i(j+1) of delayed requests. ∀i=1,...M, j =1,...,ℓ −1, 2) FirstFitwithdelays: Thismethodworksthesameasthe i s=1,...,N, t=1,...,T −1: method above, but allows the delaying of requests. That is, if fora certain startpositionthe requests0,...,j−1fit intothe zstij −zs(t+1)ij −y(t+1)i(j+1) ≤0 (7) stripwithoutoverlap,butrequestj doesnotfitintimestept′, we search for the largest j′ < j such that request (i,j′) fits in t′ and delay every request j′′ = j′+1,... (i.e., we move g) Usage Constraints: Finally, we introduce some con- them upwards in the strip); see Fig. 6. straints that define our usage variables.Let u be 1, if at least t 3) BestFit: Similar to FirstFit with delays, we try to find one y is 1 or if u is 1. tij t+1 a nonoverlapping position by testing every possible position. ∀t=1,...T, i=1,...,M, j =1,...,ℓ : i But now, we do not choose the first feasible position, but we u −y ≥0 (8) evaluate every position as follows: We separately count the t tij unoccupied cells left and right to the placed module and take and ∀t=2,...,T : the minimum of the these two values as a score for the given position.Forexample,forthe placementofm in Fig.6 there i ut−1−ut ≥0. (9) are4unoccupiedcellslefttomi and14unoccupiedcellsright to m , yielding a value of 4 for his placement. We choose i 2) ObjectiveFunction: To minimize the makespan,we use the position that yields the minimal score and break ties by the following ILP: preferring the (first) position with least number of delays. To avoid that every module is placed on the left or right T side (yieldinga scoreof 0),we maintainanupperlimit, t , max min iui subject to Eq. 1–9 (10) for the time. Before we place a new module,mi, we increase Xi=1 xsi ∈{0,1} tmax byℓi/2andtrytoplacemi withinthegiventimebound. If this is not possible, we increase t by ℓ and try again. ytij ∈{0,1} max i 4) TabuSearch: BestFit inserts the modules in the given z ∈{0,1} stij order(i.e.,m ,m ,...,m ). Obviously,theresultofBestFit 1 2 M ut ∈{0,1}. highly dependson the insertion order, so we may get a better resultif we permutethe insertion orderof the modules.Thus, weuseatabusearchtotryseveralBestFitruns,eachoneusing B. Heuristic Methods adifferentorderfortheinsertionofmodules.Startingwiththe We implementedseveralheuristicsforourproblem:A sim- sequenceS =(1,...,M),weswaptwoitemsofthesequence ple FirstFit with and withoutdelayingrequests,and two more and compute the makespan that is achieved by BestFit. More elaborated heuristics, BestFit and TabuSearch. Our methods precisely, we maintain a swapping distance, i, ranging from pack the given modules in a semi-infinite strip. The width of i=0 to M/2. For a fixed i, we swap the items at positions j the strip is given by the number of slots on the FPGA; the and ((j+i)modM)+1 for j = 1,...M, keeping track of height of the strip corresponds to the time axis. the best makespan achieved so far. We accept the swap that 1) FirstFit: Probably the simplest heuristic is to place the achieves the best makespan known so far. A tabu list ensures modules,onebyone,inafirst-fitwayintothestrip:beginning thatwedonotswapanalreadyacceptedpairagain;seeFig.8. withs=1andt=1,we testforeverypositionif themodule thatmustbeplacedoverlapswithalready-placedmodules.We C. Experimental Results choose the first position where no overlap occurs. Note that AnexampleinstanceandsolutionsareshowninFig.9.The we disregard the possibility of delaying requests. corresponding ILP was solved in approximately 6 hours on 8 m2 m1 m4 m3 m4 m4 m3 m2 m3 m4 m1 m2 m1 m1 m2 m1 m2 m3 m4 m3 Fig.9. Fromlefttoright:Exampleinputandpackingsgeneratedby(fromlefttoright)FirstFitwithoutdelays,FirstFitwithdelays,BestFit,andTabuSearch/ILP. Delayed modules are shown hatched. Note that m3 is packed by BestFit on top of m1, because this position has value 0 and fits into the strip of height tmax+ℓ3/2.TabuSearchswapsm3 andm4 intheinsertionorder. an Intel(R) Xeon(TM) 3.20GHz CPU running ILOG CPLEX IV. CONCLUSION 10.00 under Linux. Note that TabuSearch yields the same Reconfigurable devices increasingly offer enough comput- result in less than one second. ing resources, so that in the near future, multiple applications To test our heuristics, we conducted a set of experiments. maybeexecutedonthemconcurrently,insteadofjustasingle For upper limits on the size of a request, r , ranging from max application. However, until now there is a general lack of 10%to90%instepsof10,werandomlygeneratedsequences, research on how to successfully achieve secure and flexible each of 20 modules. For each value of r , we shuffled 20 max execution of multiple applications on dynamically partially sequencesasfollows:Foreverymodule,wechooseitslength, reconfigurabledevices. We present an approach called virtual ℓ , randomly,by normaldistribution with expected value µ= i areamanagement,whichisheavilyinfluencedbymultitasking 10 and variance σ2 = 5. The size for every request, r , was ij and operating system concepts of the software. Advantages chosen by normaldistribution, too, with an expected value of of our concept (e.g., support for accounting of resources, µ=r /2andvarianceσ2 =r /4.Wepresenttheresults max max resourceprotection,multipleschedulingandplacementstrate- for N =50, other FPGA widths showed similar results. gies, and a new programming model) are explained. Further- more,challengesposed bythisconceptandpossible solutions Heuristic Averagerunningtime are discussed. Afterwards, a specific approach to virtual area FirstFit 0.25s FFwithdelays 0.33s managementin the offline case is presented in more detail. It BestFit 2.09s is based on the assumption that most applications can handle TabuSearch 1125.06s alsoadelayedresourcegrant.Thecorrespondingoptimization TABLEI problem to minimize the total makespan is solved with an AVERAGERUNNINGTIMEFOROURHEURISTICS. ILP and heuristics. Both approaches are compared in an experimental study. The presented concept is a practicable solution to the TableIshowstheaveragerunningtimeoverallexperiments, considered important problem; the proposed methods can be Fig. 10showsthe meanvalueover20runsforeveryheuristic applied to nowadays available reconfigurable devices. We do and value of r . Fig. 11 shows mean-, maximal-, and max not rely on the assumption that saving and storing hardware minimal values for BestFit and TabuSearch. For comparison, taskstateswillnolongerbeconsideredastooexpensiveinthe Fig. 10 also shows an average lower bound computed as the future. Furthermore, programming models for reconfigurable smallest area needed to pack all requests; that is, architectures,resourceaccountingandprotection,bitstreamre- locationandpositionindependence,areallformidableresearch 1 M ℓi LB= r . problemsontheirown,howevertheconceptiscompatibleand ij N extendableto differentsolution approachesto these problems. i=1j=1 XX Choosingthebestsuitedstrategydependsonthescenario.For systemsthatrunthesamerequestsequenceonandonandthat REFERENCES are produced in a large number, it may even pay off to use [1] D. Koch, C. Haubelt, and J. Teich, “Efficient reconfigurable on-chip the ILP. Clearly, balancing computation time and quality, the busesforFPGAs,”inProc.16thAnnu.IEEESympos.Field-Programm. tabusearchisabetterchoice.Nevertheless,itrequiresthatthe CustomComput.Mach.,2008. requestsareknownbeforehand.Ifthisisnotgiven,BestFitcan [2] J.Angermeier,D.Go¨hringer,M.Majer,J.Teich,S.P.Fekete,andJ.V. der Veen, “The Erlangen Slot Machine — A platform for interdisci- be used, because it works in an offline scenario as well as in plinaryresearchindynamicallyreconfigurablecomputing,”Information an online setting. Technology, vol.49,pp.143–148,2007. 9 250 300 Lower bound Tabu search FirstFit Best Fit FF with delays BestFit Tabu Search 250 200 200 150 g g n n ki ki c c pa pa 150 of of ht ht g g ei 100 ei H H 100 50 50 0 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 100 Max. request size (%) Max. request size (%) Fig.10. Acomparisonofourheuristics—FirstFit (withoutdelays),FirstFit Fig. 11. Mean-, maximal-, and minimal value averaged over 100 runs per (with delays), BestFit, and TabuSearch—in settings with different densities densities forBestFitandTabuSearch. (i.e.,maximalvalue forarequest) averaged over100runsperdensities, For comparison,alowerbound(ratiooftotalareabynumberofslots)isshown. [3] C. Bobda, M. Majer, A. Ahmadinia, T. Haller, A. Linarth, J. Teich, tems:Fromdesignconceptstorealizations,”InProceedingsOfThe3rd S.Fekete,andJ.vanderVeen,“TheErlangenSlotMachine:AHighly International Conference On Engineering Of Reconfigurable Systems FlexibleFPGA-BasedReconfigurablePlatform,”IEEESymp.onFPGAs AndArchitectures (ERSA),pp.284–287,2003. andCustomComputing Machines (FCCM),pp.319–320,2005. [14] R. Cordone, F. Redaelli, M. A. Redaelli, M. D. Santambrogio, and [4] G. Brebner, “A virtual hardware operating system for the xilinx D. Sciuto, “Partitioning and scheduling of task graphs on partially XC6200,” in Field-Programmable Logic Smart Applications, New dynamically reconfigurable FPGAs,” IEEE Transact. Computer-Aided ParadigmsandCompilers, 1996,pp.327–336. DesignofIntegratedCircuitsandSystems,vol.28,pp.662–675,2009. [5] ——, “The swappable logic unit: a paradigm for virtual hardware,” in [15] F. Redaelli, M. D. Santambrogio, and D. Sciuto, “Task scheduling FPGAsfor Custom Computing Machines, 1997. Proceedings., The5th with configuration prefetching and anti-fragmentation techniques on AnnualIEEESymposium on,1997,pp.77–86. dynamically reconfigurable systems,” in Proc. 11th Conf. on Design, [6] K.Bazargan,R.Kastner,andM.Sarrafzadeh,“Fasttemplateplacement AutomationandTestinEurope,2008,pp.519–522. for reconfigurable computing systems,” Design & Test of Computers, [16] J. M. Cardoso, “On combining temporal partitioning and sharing of IEEE,vol.17,no.1,pp.68–83,2000. functional units in compilation for reconfigurable architectures,” IEEE [7] C. Steiger, H. Walder, M. Platzner, and L. Thiele, “Online scheduling Transact.Comput.,vol.52,pp.1362–1375, 2003. andplacementofreal-timetaskstopartiallyreconfigurable devices,”in [17] S. Banerjee, E. Bozorgzadeh, and N. D. Dutt, “Integrating physical Real-TimeSystemsSymposium,2003.RTSS2003.24thIEEE,2003,pp. constraintsinHW-SWpartitioningforarchitectureswithpartialdynamic 224–225. reconfiguration,” IEEETransact.VeryLargeScaleIntegration Systems, [8] O.Diessel,H.ElGindy,M.Middendorf, H.Schmeck,andB.Schmidt, vol.14,pp.1189–1202, 2006. “Dynamic scheduling of tasks on partially reconfigurable FPGAs,” [18] B.S.Baker,E.G.C.Jr.,andR.L.Rivest,“Orthogonalpackingsintwo ComputersandDigitalTechniques,IEEEProceedings-,vol.147,no.3, dimensions,”SIAMJ.Comput.,vol.9,pp.846–855,1980. pp.181–188, 2000. [19] C. Kenyon and E. Re´mila, “Approximate strip packing,” in Proc. 37th [9] J. Resano, D. Mozos, and F. Catthoor, “A hybrid prefetch scheduling Annu.IEEESympos.Found.Comput.Sci.,1996,pp.31–36. heuristictominimizeatrun-timethereconfigurationoverheadofdynam- [20] B. S. Baker and J. S. Schwarz, “Shelf algorithms for two-dimensional ically reconfigurable hardware [multimedia applications],” in Design, packingproblems,”SIAMJ.Comput.,vol.12,pp.508–525, 1983. AutomationandTestinEurope,2005.Proceedings,2005,pp.106–111. [21] J. Csirik and G. J. Woeginger, “Shelf algorithms for on-line strip [10] S. Banerjee, E. Bozorgzadeh, and N. Dutt, “Physically-aware HW- packing,” Inform.Process.Lett.,vol.63,pp.171–175, 1997. SW partitioning for reconfigurable architectures with partial dynamic [22] R.Breukelaar,E.D.Demaine,S.Hohenberger,H.J.Hoogeboom,W.A. reconfiguration,”inProceedingsofthe42ndannualDesignAutomation Kosters, and D. Liben-Nowell, “Tetris is hard, even to approximate,” Conference. ACM,2005,pp.335–340. Internat. J.Comput. Geom.Appl.,vol.14,pp.41–68,2004. [11] C.Steiger,H.Walder,andM.Platzner,“Operatingsystemsforreconfig- [23] Y.Azar andL.Epstein, “Ontwodimensional packing,” J. Algorithms, urableembeddedplatforms:onlineschedulingofreal-timetasks,”IEEE vol.25,pp.290–310,1997. Transactions onComputers, vol.53,no.11,pp.1393–1407, 2004. [24] E.Coffman,Jr.,P.J.Downey,andP.Winkler,“Packingrectangles ina [12] G.WigleyandD.Kearney,“Thedevelopmentofanoperatingsystemfor strip,”ActaInform.,vol.38,pp.673–693,2002. reconfigurable computing,” inField-Programmable Custom Computing [25] S. P. Fekete, T. Kamphans, and N. Schweer, “Online square packing,” Machines,2001.FCCM’01.The9thAnnualIEEESymposiumon,2001, inProc.20thAlgorithmsandDataStructureSymposium(WADS),2009, pp.249–250. pp.302–314. [13] H. Walder and M. Platzner, “Reconfigurable hardware operating sys-