ebook img

The Artemis Workbench for System-level Performance Evaluation of PDF

14 Pages·2002·0.53 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The Artemis Workbench for System-level Performance Evaluation of

UvA-DARE (Digital Academic Repository) The Artemis workbench for system-level performance evaluation of embedded systems Pimentel, A.D. DOI 10.1504/IJES.2008.020299 Publication date 2008 Published in International Journal of Embedded Systems Link to publication Citation for published version (APA): Pimentel, A. D. (2008). The Artemis workbench for system-level performance evaluation of embedded systems. International Journal of Embedded Systems, 3(3), 181-196. https://doi.org/10.1504/IJES.2008.020299 General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:07 Mar 2023 The Artemis Workbench for System-level Performance Evaluation of Embedded Systems Andy D. Pimentel, Member, IEEE Computer Society Abstract(cid:151)In this article, we present an overview of the and (cid:3)exibility. This has led to the notion of a new design Artemis workbench, which provides modelling and simulation methodology, namely system-level design. Below, we brie(cid:3)y methods and tools for ef(cid:2)cient performance evaluation and describe three important ingredients of system-level design exploration of heterogeneous embedded multimedia systems. approaches. Morespeci(cid:2)cally,wedescribetheArtemissystem-levelmodelling methodology, including its support for gradual re(cid:2)nement of 1) Platform architectures architectureperformancemodelsaswellasforcalibrationofthe Platform-based design [46], [52] has become a popular system-level models. We show that this methodology allows for designmethodasitstressesthere-useofIP(Intellectual architectural exploration at different levels of abstraction while maintaining high-level and architecture independentapplication Property)blocks.Inthisdesignapproach,a singlehard- speci(cid:2)cations. Moreover, we illustrate these modelling aspects wareplatformis usedas a(cid:147)hardwaredenominator(cid:148)that using a case study with a Motion-JPEG application. issharedacrossmultipleapplicationsin agivendomain Index Terms(cid:151)System-level modelling and simulation, model and is accompanied by a range of methods and tools re(cid:2)nement, performance evaluation, design space exploration, for design and development. This increases production model calibration. volumeandreducescostcomparedtocustomisingachip for every application. I. INTRODUCTION 2) Separation of concerns To even further improve the potentials for re-use of DESIGNERSofmodernembeddedsystemsarefacedwith IP and to allow for effective exploration of alternative a number of emerging challenges. Because embedded design solutions, it is widely recognised that the (cid:147)sep- systems are mostly targeted for mass production and often aration of concerns(cid:148) is a crucial component in system- run on batteries, they should be cheap to realise and power level design [22]. Two common types of separation in ef(cid:2)cient. In addition, these systems increasingly need to the design process are: i) separating computation from support multiple applications and standards for which they communicationby connectingIP processingcoresvia a should provide high, and sometimes real-time, performance. standard(message-passing)networkinterface[5]andii) For example, digital televisions or mobile devices have to separating application (what is the system supposed to support different standards for communication and coding do) from architecture (how it does it) [3], [23]. of digital contents. On top of this, modern embedded sys- 3) High-levelmodellingandsimulationearly inthedesign tems should be (cid:3)exible so that they can easily be extended Insystem-leveldesign,designersalreadystartwithmod- to support future applications and standards. Such (cid:3)exible ellingandsimulating(possible)systemcomponentsand support for multiple applications calls for a high degree their interactions in the early design stages [43]. More of programmability. However, performance requirements and speci(cid:2)cally, system-level models typically represent ap- constraintsoncostandpowerconsumptionrequiresubstantial plication behaviour,architecture characteristics, and the parts of these systems to be implemented in dedicated hard- relation (e.g., mapping, hardware-software partitioning) ware blocks. As a result, modern embedded systems often between application(s) and architecture. These models have a heterogeneoussystem architecture, i.e., they consist of do so at a high level of abstraction, thereby minimising componentsrangingfromfullyprogrammableprocessorcores the modelling effort and optimising simulation speed to fully dedicated hardware components for the time-critical thatis neededfortargetingtheearlydesignstages. This application tasks. Increasingly, such heterogeneous systems high-level modelling allows for early veri(cid:2)cation of a are integrated on a single chip. This yields heterogeneous design and can provide estimations on the performance multi-processor Systems-on-Chip (SoCs) that exploit task- (e.g., [3], [43]), power consumption (e.g., [6]) or cost level parallelism in applications. [13] of the design. The heterogeneity of modern embedded systems and the varyingdemandsoftheirtargetapplicationsgreatlycomplicate Design space exploration plays a crucial role in system- the system design. It is widely agreed upon that traditional leveldesignofembeddedsystem(platform)architectures.Due design methods fall short for the design of these systems to the systems’ complexity, it is imperative to have good as such methods cannot deal with the systems’ complexity performance evaluation tools for ef(cid:2)ciently exploring a wide rangeofdesignchoicesduringthe earlydesignstages. Inthis A.D. Pimentel is with the Dept. of Computer Science of the University article, we present an overview of the Artemis workbench, ofAmsterdam, Kruislaan 403, 1098 SJAmsterdam, The Netherlands. He is project coordinator oftheArtemisproject. Email:[email protected] which provides high-level modelling and simulation methods andtools for ef(cid:2)cient performanceevaluationandexploration Application domain of heterogeneous embedded multimedia systems [43]. More speci(cid:2)cally, we describe the Artemis system-level modelling Application specification methodology (cid:150) which deploys the aforementioned principle Architecture template (sequential) of separation of concerns [22] (cid:150) and particularly focus on thesupportforgradualre(cid:2)nementofarchitectureperformance Translator models as well as on the calibration of system-level models. Weshowthatthismethodologyallowsforarchitecturalexplo- rationatdifferentlevelsofabstractionwhilemaintaininghigh- Architecture instance Applica(ctioonnc suprreecnifti)cation level and architecture independent application speci(cid:2)cations. entire Furthermore,weillustratethesemodellingaspectsusingacase specification study with a Motion-JPEG encoder application. System−level modelling and simulation part of 1) Mapping and Refinement The remainder of the article is organised as follows. The specification 2) Quantitative performance analysis next section provides a birds eye view of the Artemis work- by means of co−simulation − on levels of abstraction bench, brie(cid:3)y discussing the different tool-sets that are in- − across levels of abstraction tegrated in the workbench. Section III focuses on Artemis’ system-level modellingand simulationtechniques by describ- Component low−level calibration ingitsprototypemodellingandsimulationenvironment,called numbers Sesame. In Section IV, we explain how Artemis facilitates Performance numbers gradual re(cid:2)nement of system-level architecture models by applying data(cid:3)ow graphs. Section V illustrates the discussed Fig.1. Theinfrastructure oftheArtemisworkbench. modelling and re(cid:2)nement techniques using a case study with a Motion-JPEG encoder application. This section also demonstrates how our system-level models can be calibrated will elaborate on all of the above modelling, mapping, and withlow-levelimplementationinformationusinganautomated re(cid:2)nement aspects in more detail. componentcalibrationapproach.Finally,SectionVIdiscusses Because Artemis operates at a high level of abstraction, related work, after which Section VII concludes the article. low-level component performance numbers can be used to calibrate the system-level architecture models. To this end, II. THEARTEMISWORKBENCH individualprocesses(i.e.,codesegments)ofaKPNapplication speci(cid:2)cation are taken apart and implemented as individual The Artemis workbench consists of a set of methods and low-level components (that appear in the current high-level tools conceived and integrated in a framework to allow de- instance of the platform architecture). This results in per- signersto modelapplicationsandSoC-based(multiprocessor) formance numbers (cid:150) as well as in estimations on cost and architectures at a high level of abstraction,to map the former power consumption (cid:150) for the low-level components that the onto the latter, and to estimate performancenumbers through system-level modelling framework needs to provide accu- co-simulationofapplicationandarchitecturemodels.Figure1 rate performance estimations for the multi-processor system depicts the (cid:3)ow of operation for the Artemis workbench, architecture as a whole. For this calibration process, the wherethegreypartsrefertothevarioustool-setsthattogether Artemis workbench uses the Laura tool-set [49], [60] and embody the workbench. The point of departure is an appli- the Molen calibration platform architecture [53](cid:150)[55].Before cation domain (being multimedia applications for Artemis), presenting more details on Artemis’ system-level modelling an experimental domain-speci(cid:2)c platform architecture and a andsimulationtechniques,whichformsthebulkofthisarticle, domain-speci(cid:2)c application speci(cid:2)ed as an executable se- the remainder of this section (cid:2)rst takes a closer look at quential program. The platform architecture is instantiated the Compaan1 and Laura1 tool-sets as well as the Molen1 in the architecture model layer of the workbench, while calibration platform. the application speci(cid:2)cation is converted to a functionally equivalent concurrent speci(cid:2)cation using a translator called Compaan [14], [49], [51]. More speci(cid:2)cally, Compaan trans- A. The Compaan and Laura tool-sets forms the sequential application speci(cid:2)cation into a Kahn Today, traditional imperative languages like C, C++ or Process Network (KPN) [21]. In between the application and Matlab are still dominantwith respect to implementingappli- architecture layers there is a mapping layer. This mapping cations for SoC-based (platform)architectures.It is, however, layer provides means to perform quantitative performance very dif(cid:2)cult to map these imperative implementations, with analysis on levels of abstraction, and to re(cid:2)ne application typically a sequential model of computation, onto multi- speci(cid:2)cation components between levels of abstraction. Such processorSoCarchitecturesthatallowforexploitingtask-level re(cid:2)nement is required to match application speci(cid:2)cations to parallelisminapplications.Incontrast,modelsofcomputation the level of detail of the underlying architecture models. Effectively, the mapping layer bridges the gap between the 1Hereweshouldnotethatalthoughasigni(cid:2)cantamountofworkhasbeen performed on Compaan, Laura and Molen in the context of the Artemis application and architecture (models), sometimes referred to project, including the integration of these research efforts into a single as the implementation gap [35]. In the next sections, we framework,theydonothavetheirorigininArtemis. Compaan Laura that inherently express task-level parallelism in applications andmakecommunicationsexplicit,suchasCSP[20]andPro- Application P1 P2 cessNetworks[21],[31],allowforeasiermappingontomulti- (Matlab code) Src S1 Sink processorSoCarchitectures.However,specifyingapplications P3 P4 using these models of computation usually requires more Transformations Kahn Process Network (Unroll/Skew/Merge) implementation effort in comparison to sequential imperative solutions. KPNtoArchitecture Matlab code In Artemis, we use an approach in which we start from a sequential imperative application speci(cid:2)cation (cid:150) more specif- Network of MatParser ically an application written in a subset of Matlab (cid:150) which conceptual processors is then automatically convertedinto a Kahn Process Network (KPN) [21] using the Compaan tool-set [14], [49], [51]. This SAC Mapping coIPres conversion is fast and correct by construction. In the KPN library model of computation, parallel processes communicate with DgParser Network of eachotherviaunboundedFIFOchannels.Readingfromchan- synthesizable processors nels is done in a blocking manner, while writing to channels PRDG is non-blocking. We decided to use KPNs for application Verilog VHDL SystemC speci(cid:2)cations because they nicely (cid:2)t with the targeted media- Panda processing application domain and they are deterministic. feedback The latter implies that the same application input always Fig.2. TheCompaan (left) andLaura(right)tool-sets. results in the same application output, irrespective of the scheduling of the KPN processes. This provides us with a lot of scheduling freedom when, as will be discussed later This includes the addition of IP cores that implement certain on,mappingKPNprocessesontoSoCarchitecturemodelsfor functionsintheoriginalapplicationaswellassettingattributes quantitative performanceanalysis. of componentssuch as bit-width and buffer sizes. In the (cid:2)nal The infrastructureof the Compaan tool-set is illustrated on step, the hardware model is converted into VHDL. To do so, theleft-handsideofFigure2.Thegreypartsrefertothesepa- Laura supplies a piece of VHDL code for each component ratetoolsthatarepartofCompaan,whilethewhitepartsrefer in the hardware model that expresses how to represent that to the (intermediate) formats of the application speci(cid:2)cation. componentin the targetarchitecture.Usingcommercialtools, Starting-pointis anapplicationspeci(cid:2)cationin Matlab,which the VHDL code can then be synthesized and mapped onto needs to be speci(cid:2)ed as a parameterised static nested loop an FPGA. As can be seen in Figure 2, the results from program. Recently, Compaan’s scope has been extended to this automated implementation trajectory can be fed back to also includeweakly-dynamicnestedloopprogramsthat allow Compaanto exploredifferenttransformationsthat will, in the forspecifyingdata-dependentbehaviour[47].OntheseMatlab end, lead to different implementations. applicationspeci(cid:2)cations,varioussource-leveltransformations can be applied in order to, for example, increase or decrease theamountofparallelisminthe(cid:2)nalKPN[48].Inanextstep, B. The Molen calibration platform the Matlab code is transformed into single assignment code Figure 3 depicts the platform architecture that is used for (SAC), which resembles the dependence graph (DG) of the componentcalibration in Artemis. This platform architecture, originalnestedloopprogram.Hereafter,theSAC isconverted called Molen [53](cid:150)[55], connects a programmable processor to a Polyhedral Reduced Dependency Graph (PRDG) data with a recon(cid:2)gurable unit and uses microcode to incorporate structure, being a compact mathematical representation of a architectural support for the recon(cid:2)gurable unit. Instructions DG in terms of polyhedra.Finally, a PRDG is convertedinto a KPN by associating a KPN process with each node in the PRDG. The parallel KPN processes communicate with each Main Memory other according to the data dependencies given in the DG. TheLauratool-set[49],[60],depictedontheright-handside Instruction Data fetch Load/Store of Figure2, takes a KPN as inputand producessynthesizable VHDL code that implements the application speci(cid:2)ed by the Data KPN for a speci(cid:2)c FPGA platform. To this end, the KPN Arbiter Memory speci(cid:2)cation is (cid:2)rst converted into a functionally equivalent MUX/DEMUX network of conceptual processors, called hardware model. This hardware model, which is platform independent as no Register File Core Reconfig. information on the target FPGA platform is incorporated, Processor Microcode CCU Unit de(cid:2)nes the key components of the architecture and their Exchange Reconfigurable Processor attributes. It also de(cid:2)nes the semantic model, i.e., how the Registers various components interact with each other. Subsequently, platformspeci(cid:2)c informationis addedtothehardwaremodel. Fig.3. TheMolen calibration platformarchitecture. arefetchedfromthememory,afterwhichthearbiterperforms A B C Application Model apartialdecodingontheinstructionstodeterminewherethey Process Network should be issued [27]. Those instructions that have been im- event plementedin (cid:2)xedhardwareare issuedto the coreprocessing trace (CP)unit,whichis oneofthePowerPCs fromaXilinxVirtex II ProTM platform in the Molen prototype implementation Mapping Layer [25], while instructions for custom execution are redirected Dataflow Virtual Virtual to the recon(cid:2)gurable unit. The instructions entering the CP Processor Processor unitarefurtherdecodedandthenissuedtotheircorresponding for Process A for Process C functional units. The recon(cid:2)gurable unit consists of a custom con(cid:2)gured buffer buffer unit (CCU), currently implemented by the Xilinx Virtex II ProTM FPGA, and a r(cid:181)-code unit. The recon(cid:2)gurable unit Architecture Model performsoperationsthat can beas simpleas aninstructionor Discrete Event Processor 1 Processor 2 Processor 3 as complex as a piece of code describing a certain function. FIFO Molen divides an operation into two distinct phases: set and bus execute. The set phase is responsible for recon(cid:2)guring the data channels CCUhardware,enablingtheexecutionofanoperation.Sucha token−exchange channels Memory mapping phasemaybesubdividedintotwosub-phases,namelypartial- set (p-set) and complete-set (c-set). The p-set phase covers Fig. 4. Sesame’s application model layer, architecture model layer, and common functions of an application or set of applications. mappinglayerwhichinterfacesbetweenapplication andarchitecture models. Subsequently, the c-set sub-phase only recon(cid:2)gures those blocks in the CCU which are not covered in the p-set sub- Sesamefacilitates performanceanalysis ofembeddedsystems phase in order to complete the functionality of the CCU. architectures according to the Y-chart design approach [3], To perform the actual recon(cid:2)guration of the CCU, recon- [23],recognisingseparateapplicationandarchitecturemodels (cid:2)gurationmicrocode is loaded into the r(cid:181)-codeunit and then withinasystemsimulation.Anapplicationmodeldescribesthe executed (using p-set and c-set instructions) [26]. Hereafter, functionalbehaviourofanapplication,includingbothcompu- the execute phase is responsible for the operation execution tation and communication behaviour. An architecture model ontheCCU,performedbyexecutingtheexecutionmicrocode. de(cid:2)nes architecture resources and captures their performance Important in this respect is the fact that both the set and constraints. After explicitly mapping an application model execute phases do not explicitlyspecify a certain operation to onto an architecture model, they are co-simulated via trace- beperformed.Instead,thep-set,c-setandexecuteinstructions drivensimulation.Thisallowsforevaluationofthesystemper- point to the memory location where the recon(cid:2)guration or formanceofaparticularapplication,mapping,andunderlying execution microcode is stored. architecture.Essentialinthismodellingmethodologyisthatan The Compaan and Laura tool-sets in combination with the application model is independent from architectural speci(cid:2)cs, Molenplatformarchitectureprovidegreatopportunitiesforthe assumptions on hardware/software partitioning, and timing previously discussed calibration of system-level architecture characteristics. As a result, a single application model can models. For this purpose, Laura maps a speci(cid:2)c component be used to exercise different hardware/software partitionings from an application speci(cid:2)cation to a hardware implemen- and can be mapped onto a range of architecture models, tation by converting the Compaan-generated KPN associated possibly representing different system architectures or simply with the application component to a VHDL implementation. modelling the same system architecture at various levels of This VHDL code is subsequently used as recon(cid:2)guration abstraction. The layered infrastructure of Sesame is shown in microcodeforMolen’sCCU,whiletheremainderoftheappli- Figure 4. cationspeci(cid:2)cation(i.e.,thecodethathasnotbeensynthesized to a hardware implementation) is executed on Molen’s core processor. As a result, the application component mapped A. Application modelling ontotheCCUprovideslow-levelimplementationnumbersthat can be used to calibrate the corresponding component in the For application modelling [12], Sesame uses KPN ap- system-level architecture model. In Section V, we present a plication speci(cid:2)cations that are generated by the Compaan case study in which this component calibration is illustrated tool-set or have been derived by hand. The computational for a DCT task in a Motion-JPEG encoder application. behaviour of an application is captured by instrumenting the code of each Kahn process with annotations that describe the application’s computational actions. The reading from III. THESESAMEMODELLINGANDSIMULATION or writing to Kahn channels represents the communication ENVIRONMENT behaviour of a process within the application model. By Artemis’ system-level modelling and simulation environ- executingthe Kahnmodel,eachprocess recordsits actionsin ment, called Sesame [11], [44], builds upon the ground- orderto generateits owntrace of applicationevents,which is laying work of the Spade framework [34]. This means that necessaryfordrivinganarchitecturemodel.Theseapplication events typically are coarse grained, such as execute(DCT) or read(pixel-block,channelid). ToexecuteKahnapplicationmodels,andtherebygenerating theapplicationeventsthat representthe workloadimposedon the architecture,Sesame featuresa processnetworkexecution enginesupportingKahnsemantics.Thisexecutionengineruns the Kahn processes as separate threads using the Pthreads package. Currently, the Kahn processes need to be written in C++, but C and Java support is also planned for the future.Toallowforrapidcreationandmodi(cid:2)cationofmodels, the structure of the application models (i.e., which processes are used in the model and how they are connected to each other) is not hard-coded in the C++ implementation of the processes, but instead, it is described in a language called YML (Y-chart Modelling Language) [11]. This is an XML- basedlanguagewhichis similarto Ptolemy’sMoML[30] but is slightly less generic in the sense that YML only needs to support a few simulation domains. As a consequence, Fig. 5. A screenshot of the YML editor, showing its layered layout that YML supports a subset of MoML’s features. However, YML correspondstothethreelayersofSesame(seeFigure 4). provides one additional feature in comparison to MoML as it contains built-in scripting support. This allows for loop-like constructs, mapping and connectivity functions, and so on, YML support. SCPEx raises the abstractionlevel of SystemC which facilitate the description of large and complex models. models, thereby reducing the modelling effort required for Inaddition,itenablesthecreationoflibrariesofparameterised developing transaction-level architecture models and making YMLcomponentdescriptionsthatcanbeinstantiatedwiththe the modelling process less prone to programming errors. appropriateparameters,therebyfosteringre-useofcomponent descriptions.TosimplifytheuseofYMLevenfurther,aYML editorhasalsobeendevelopedtocomposemodeldescriptions usingaGUI.Figure5givesanimpressionoftheYMLeditor’s C. Mapping GUI, showing its layered layout that corresponds to the three layersofSesame(seeFigure4),namelytheapplicationmodel To map Kahn processes (i.e., their event traces) from an layer, mapping layer and architecture model layer. application model onto architecture model components and to support the scheduling of application events from differ- ent event traces when multiple Kahn processes are mapped B. Architecture modelling onto a single architecture component (e.g., a programmable Architecture models in Sesame, which typically operate processor), Sesame provides an intermediate mapping layer. at the so-called transaction level [10], [19], simulate the This layerconsists of virtualprocessorcomponentsandFIFO performance consequences of the computation and commu- buffers for communication between the virtual processors. nication events generated by an application model. These ar- Thereisaone-to-onerelationshipbetweentheKahnprocesses chitecturemodelssolelyaccountforarchitecturalperformance in the application model and the virtual processors in the constraints and do not need to model functional behaviour. mapping layer. This is also true for the Kahn channels and This is possible because the functional behaviour is already the FIFO buffers in the mapping layer, except for the fact captured in the application models, which subsequently drive that the latter are limited in size. Their size is parameterised the architecture simulation. An architecture model is con- and dependent on the modelled architecture. As the structure structed from generic building blocks provided by a library, of the mapping layer closely resembles the structure of the which contains template performance models for processing applicationmodelunderinvestigation,Sesameprovidesatool cores, communication media (like busses) and various types that is able to automatically generate the mapping layer from of memory.The structure of architecture models (cid:150) specifying the YML description of an application model. which building blocks are used from the library and the way A virtual processor in the mapping layer reads in an they are connected (cid:150) is also described in YML. applicationtracefromaKahn processvia a trace eventqueue Sesame’s architecture models are implementedusing either and dispatches the events to a processing component in the Pearl [38] or SystemC [1], [19]. Pearl is a small but pow- architecture model. The mapping of a virtual processor onto erful discrete-event simulation language which provides easy a processing component in the architecture model is freely construction of the models and fast simulation [44]. For our adjustable, facilitated by the fact that the mapping layer and SystemC architecture models, we provide an add-on library its mapping onto the architecture model are also described in to SystemC, called SCPEx (SystemC Pearl Extension) [50], YML (and manipulated using the YML editor, see Figure 5). which extends SystemC’s programming model with Pearl’s Communication channels (cid:150) i.e., the buffers in the mapping message-passing paradigm and which providesSystemC with layer (cid:150) are also mapped onto the architecture model. In Figure4,forexample,onebufferisplacedinsharedmemory2 callyderivedfromaKPNapplicationspeci(cid:2)cation,areCDFG- while the other buffer is mapped onto a point-to-point FIFO like representations of the Kahn processes. They contain channel between processors 1 and 2. controlconstructslikeCDFGs,butunlikeCDFGs,theyarenot The mechanism with which application events are dis- directly executable since SPs only contain symbolic instruc- patched from a virtual processor to an architecture model tions(i.e.,applicationevents)andnorealcode.Therefore,SPs component guarantees deadlock-free scheduling of the appli- need extra information for execution to determine the control cation events from different event traces. In this mechanism, (cid:3)owwithinanSP,whichissuppliedintermsofcontroltraces. computationeventsarealwaysdirectlydispatchedbyavirtual These control traces are generated by running the application processor to the architecture component onto which it is witha particularset ofdata.At thearchitecturelayer,SPs are mapped. The latter schedules incoming events that originate executedwiththecontroltracestogenerateeventtraceswhich from different event queues according to a given policy and aresubsequentlyusedtodrivetheresourcesinthearchitecture subsequently models their timing consequences. The event model. Like Sesame, Archer also supports the re(cid:2)nement of schedulingsupportsa rangeof prede(cid:2)nedpolicies, like FCFS architecture models. It does so by transforming application- andround-robin,but can also easily becustomisedwith alter- level SPs into architecture-levelSPs [56]. nativepolicies.For communicationevents,avirtualprocessor (cid:2)rst consults the appropriate buffer at the mapping layer to D. Mapping support using multi-objective optimisation checkwhetherornotacommunicationissafetotakeplaceso To facilitate effective design space exploration, Sesame thatnodeadlockcanoccur.Onlyifitis foundtobesafe(i.e., provides some (initial) support for (cid:2)nding promising candi- forreadeventsthedatashouldbeavailableandforwriteevents date application-to-architecture mappings to guide a designer thereshouldberoominthetargetbuffer),thencommunication duringthesystem-levelsimulationstage.Tothisend,wehave events may be dispatched to the processor component in the developed a mathematical model that captures several trade- architecturemodel.As longas a communicationeventcannot offs faced during the process of mapping [15]. In this model, be dispatched, the virtual processor blocks. This is possible we take into account the computational and communication becausethe mappinglayerexecutesinthesame simulationas demands of an application as well as the properties of an thearchitecturemodel.Therefore,boththemappinglayerand architecture, in terms of computational and communication thearchitecturemodelsharethesamesimulation-timedomain. performance, power consumption, and cost. The resulting Thisalsoimpliesthateachtimeavirtualprocessordispatches trade-offs with respect to performance, power consumption anapplicationevent(eithercomputationorcommunication)to and cost are formulated as a multi-objective combinatorial a component in the architecture model, the virtual processor optimisation problem. Using an optimisation software tool, isblockedinsimulatedtimeuntiltheevent’slatencyhasbeen which is based on a widely-known evolutionary algorithm simulated by the architecture model. [61],themappingdecisionproblemissolvedbyprovidingthe When architecturemodel componentsare gradually re(cid:2)ned designer with a set of approximated Pareto-optimal mapping to includemoreimplementationdetails, thevirtual processors solutions that can be further evaluated using system-level at the mapping layer are also re(cid:2)ned. The latter is done with simulation. For a more detailed description of this mapping data(cid:3)owgraphssuchthatitallowsustoperformarchitectural support, the interested reader is referred to [15]. simulationat multiplelevelsofabstractionwithoutmodifying the applicationmodel. Figure 4 illustrates this data(cid:3)ow-based IV. ARCHITECTUREMODELREFINEMENTTHROUGH re(cid:2)nementbyre(cid:2)ningthevirtualprocessorforprocessBwith TRACETRANSFORMATIONS a(cid:2)ctivedata(cid:3)owgraph.Inthisapproach,theapplicationevent tracesspecifywhatavirtualprocessorexecutesandwithwhom Re(cid:2)ning architecture model components in Sesame re- itcommunicates,whiletheinternaldata(cid:3)owgraphofavirtual quires that the application events driving them should also processorspeci(cid:2)eshowthecomputationsandcommunications be re(cid:2)ned to match the architectural detail. Since we aim take place at the architecture level. In the next section, we at a smooth transition between different abstraction levels, provide more insight on how this re(cid:2)nement approach works re-implementing or transforming (parts of) the application by explaining the relation between trace transformations for models for each abstraction level is undesirable. Instead, re(cid:2)nement and data(cid:3)ow actors at the mapping layer. Sesame only maintains application models at a high level of abstraction (thereby optimising the potentials for re-use of Inacloselyrelatedproject,calledArcher[58],analternative application models) and bridges the abstraction gap between mapping approach is studied. That is, while Archer and application models and underlying architecture models at the Sesame share a lot of the same application and architec- mapping layer. As will be explained in this section, bridging ture modelling techniques, Archer uses a different mapping the abstraction gap is accomplished by re(cid:2)ning the virtual strategy than Sesame. In Archer, Control Data Flow Graphs processors in the mapping layer with data(cid:3)ow actors that (CDFG) [59] are taken as a basis. However, as the CDFG transform coarse-grained application events into (cid:2)ner grained notation is too complex for design space exploration, the events at the desired abstraction level which are subsequently CDFGsareliftedtoahigherabstractionlevel,calledSymbolic used to drive the architecture model components [16], [17], Programs (SP) [57]. The SPs, which in Archer are automati- [42]. In other words, the data(cid:3)ow actors consume external 2The architecture model accounts for the modelling of bus activity (arbi- input(data(cid:3)ow)tokensthatrepresenthigh-levelcomputational tration,transfers,etc.)whenaccessing thisbuffer. and communication application events and produce external output tokens that represent the re(cid:2)ned architectural events the architecture events that are the (cid:150) possibly transformed (cid:150) associated with the application events. representationofapplicationeventsatthearchitecturelevel.In Re(cid:2)nement of application events is denoted using trace addition, to capture control behaviour of the Kahn processes, transformations [33], in which the left-hand side contains the IDF graphs also contain dynamic actors for conditional the coarse-grained application events that need to be re(cid:2)ned jumps and repetitions. The IDF graphs are executable as and the right-handside the resulting architecture-levelevents. the actors have an execution mechanism called (cid:2)ring rules Furthermore, ‘(cid:0) ’ symbols in trace transformations denote whichspecify whenan actor can (cid:2)re. When (cid:2)ring an actor,it the (cid:148)followed by(cid:148) ordering relation. To give an example, consumes the required tokens from its input token channels the following trace transformations re(cid:2)ne R(ead) and W(rite) and produces a speci(cid:2)ed number of tokens on its output applicationeventssuchthatthesynchronisationsareseparated channels. A special characteristic of our IDF graphs is that from actual data transfers [33]: theSDFactorsaretightlycoupledwiththearchitecturemodel components. This means that a (cid:2)ring SDF actor may send a Qref R (cid:1)(cid:3)(cid:2) cd (cid:0) ld (cid:0) sr (1) token to the architecture model to initiate the simulation of W (cid:1)(cid:3)Qr(cid:2)ef cr(cid:0) st (cid:0) sd (2) an event. The SDF actor in question is then blocked until it receives an acknowledgement token from the architecture Here,re(cid:2)nedarchitecture-leveleventscheck-data ,load-data(cid:134), model indicating that the performance consequences of the (cid:4) signal-room , check-room , store-data(cid:134), signal-data are ab- event have been simulated within the architecture model. To (cid:4) (cid:4) (cid:4) breviated as cd, ld, sr, cr, st, sd, respectively. The events give an example, an SDF actor that embodies a write event marked with * refer to synchronisations while those marked willblockafter(cid:2)ringuntilthewritehasbeensimulatedatthe with refertodatatransmissions.Theabovere(cid:2)nementsallow architecture level. (cid:5) for, for example, moving synchronisation points or reducing In IDF graphs, scheduling informationof IDF actors is not their number when a pattern of application events is trans- incorporated into the graph de(cid:2)nition but is explicitly sup- formed [33], [42]. Consider, for example, an application pro- plied by a scheduler. This scheduler operates on the original cess that reads a block of data froman input buffer,performs application event traces in order to schedule our IDF actors. some computation on it, and writes the results to an output The actor scheduling can be done either in a semi-static or buffer.Thiswouldgeneratea(cid:148)R(cid:0) E (cid:0) W(cid:148)application-event dynamic manner. In dynamic scheduling, the application and pattern, in which the E(xecute) refers to the computation on architecturemodelsareco-simulatedusingaUNIXIPC-based the block of data. Assuming that this application process is interfacetocommunicateeventsfromtheapplicationmodelto mappedontoaprocessingcomponentthatdoesnothavelocal the scheduler. As a consequence, the scheduler only operates storage but operates directly on its input and output buffers, on a window of application events which implies that the we need the following trace transformation: IDF graphs cannot be analysed at compile-time. This means Qref that, for example, it is not possible to decide at compile-time R(cid:0) E (cid:0) W (cid:1)(cid:6)(cid:2) cd(cid:0) cr(cid:0) ld (cid:0) E (cid:0) st (cid:0) sr(cid:0) sd (3) whether an IDF graph will complete its execution in (cid:2)nite In the re(cid:2)ned event sequence, we early check (cid:150) using the time,orwhethertheexecutioncanbeperformedwithbounded check-room (cr) (cid:150) if there is room in the output buffer before memory.Alternatively,wecanalsoscheduletheIDFactorsin fetching the data (ld) from the input buffer because the a semi-static manner. To do so, the application model should processingcomponentcannottemporarilystoreresultslocally. (cid:2)rst generate the entire application traces and store them into In addition, the input buffer must remain available until the trace (cid:2)les (if their size permits this) prior to the architectural processing component has (cid:2)nished operating on it (i.e., after simulation.Thisstaticschedulingmechanismisawell-known writingtheresultstotheoutputbuffer).Therefore,thesignal- techniqueinPtolemy[9]andhasbeenproventobeveryuseful room (sr) is scheduled after the st. for system simulation [8]. However, in Sesame, it does not In Sesame, Synchronous Data Flow (SDF) [29] actors are yield to a fully static scheduling. This is because of the fact deployed to realise trace transformations. Integer-controlled that,aswaspreviouslyexplained,ourSDFactorshaveatoken DataFlow (IDF)[8]actorsaresubsequentlyutilisedto model exchange mechanism with the underlying architecture model, repetitionsand branchingconditionswhich may be present in yielding some dynamic behaviour. theapplicationcode[16].However,asillustratedin[42],they WealsointendtoinvestigatewhetherornotourIDFgraphs mayalso be used withinstatic transformationsto achieveless can be speci(cid:2)ed as so-called well-behaved data(cid:3)ow graphs complicated (in terms of the number of actors and channels) [18]. In these well-behaved data(cid:3)ow graphs dynamic actors data(cid:3)ow graphs. are only used as a part of two prede(cid:2)ned clusters of actors (cid:150) Re(cid:2)ning application event traces by means of data(cid:3)ow knownas schemas (cid:150) that allow for modellingconditionaland actors works as follows. For each Kahn process at the ap- repetitive behaviour. The resulting graphs have, as opposed plication layer, an IDF graph is synthesized at the mapping to regular IDF graphs, many of the same attractive properties layer and embedded in the corresponding virtual processor. withrespecttostaticanalysisasgraphscomposedonlyofSDF Asaresult,eachvirtualprocessorisequippedwithanabstract actors. representation of the application code from its corresponding To illustrate how IDF graphs are constructed and applied Kahn process, similar to the concept of Symbolic Programs foreventre(cid:2)nement,weuseanexampletakenfromaMotion- from [58]. Sesame’s IDF graphs consist of static SDF actors JPEGencoderapplicationwestudiedin[32]and[44].Figure6 (due to the fact that SDF is a subset of IDF) embodying shows the annotated C++ code for the Quality-Control (QC) while(1) { ..write(out_TablesInfo,ChrominanceTablesInfo) read(in_NumOfBlocks,NumOfBlocks); write(out_TablesInfo,LuminanceTablesInfo) read(in_NumOfBlocks,NumOfBlocks) // code omitted write(out_TablesInfo,LumTablesInfo); Scheduler write(out_TablesInfo,ChrTablesInfo); switch(TablesChangeFlag) { case HuffTablesChanged: CASE−BEGIN REPEAT−BEGIN write(out_HuffTables,LumHuffTables); write(out_HuffTables,ChrHuffTables); write(out_Command1,OldTables); write(out_Command2,NewTables); break; case QandHuffTablesChanged: // code omitted cd cr1 cr2 cr1 cr2 cd default: wwrriittee((oouutt__CCoommmmaanndd12,,OOllddTTaabblleess));; ld st1 st2 st1 st2 ld E break; } sr sd1 sd2 sd1 sd2 sr // code omitted for(int i=1;i<(NumOfBlocks/2);i++) { // code omitted read(in_Statistics,Statistics); execute("op_AccStatistics"); CASE−END REPEAT−END // code omitted } } Fig.8. IDFgraph fortheQCtaskrealising communication re(cid:2)nement. Fig.6. AnannotatedC++codefragmenttakenfromaQuality-Control(QC) taskinaMotion-JPEGencoder application. token exchange channels for the sake of simplicity. Also not shown are the token channels to and from the IDF graphs of neighbouringvirtual processors with which is communicated. Kahn process of the M-JPEG application. The QC process For example, the R(ead) actors are in reality connected to dynamically computes the tables for Huffman encoding as a W(rite) actor from a remote virtual processor in order to well as those required for quantising each frame in the video signalwhendataisavailableandwhenroomisavailable.The stream, according to the image statistics and the obtained IDFactorsCASE-BEGIN,CASE-END,REPEAT-BEGIN,and compression bitrate of the previous video frame. In Figure 7, REPEAT-ENDmodelconditionalandrepetitionstructuresthat anIDFgraphfortheQCprocessisgiven,realisingahigh-level are present in the application code. The scheduler reads the (unre(cid:2)ned) simulation. That is, the architecture-level events application trace from the Kahn process in question and embodied by the SDF actors (depicted as circles) directly executes the IDF graph by scheduling the IDF actors accord- represent the application-level R(ead), E(xecute) and W(rite) ingly by sending the appropriate control tokens. In Figure 7, events. The SDF actors drive the architecture model compo- there are (horizontal)dotted token channels between the SDF nents by the aforementioned token exchange mechanism, al- actors, denoting dependencies. Adding these token channels thoughFigure7doesnotdepictthearchitecturemodelnorthe to the graph results in sequential execution of architecture- level events while removing them will allow for exploiting .. parallelism by the underlying architecture model. Like all write(out_TablesInfo,ChrominanceTablesInfo) write(out_TablesInfo,LuminanceTablesInfo) models in Sesame, the structure of our IDF graphs is also read(in_NumOfBlocks,NumOfBlocks) described using YML. Scheduler InFigure8,anIDFgraphfortheQC processisshownthat implements the aforementioned communication re(cid:2)nement in CASE−BEGIN REPEAT−BEGIN which the application-level R(ead) and W(rite) events are re(cid:2)ned such that the synchronisation and data-transfer parts become explicit. The computational E(xecute) events remain unre(cid:2)nedinthisexample.Weagainomittedthetokenchannels to/from IDF graphs of neighbouring virtual processors in R W1 W2 W1 W2 R E Figure 8, but in reality cd actors have, for example, an incoming token channel from an sd actor of a remote IDF graph. By (cid:2)ring the re(cid:2)ned SDF actors (cd, cr, etc.) in the IDF graphaccordingtothe orderin whichtheyappearon the CASE−END REPEAT−END right-hand side of a trace transformation (cid:150) see for example transformation(3),notingthattheright-handsidemayalsobe Fig.7. IDFgraph representing the QCtask fromFigure 6,realising high- speci(cid:2)ed as a partial ordering [17], [33] (cid:150) this automatically level (unre(cid:2)ned)simulation atarchitecture level. yields a valid schedule for the IDF graph [16]. Here, we also recall that the level of parallelism between the architecture- Main Memory level events is speci(cid:2)ed by the presence or absence of token channels between SDF actors. To conclude, communication Instruction Data fetch Load/Store re(cid:2)nement is accomplished by simply replacing SDF actors with re(cid:2)ned ones, allowing to evaluate the performance of Data Arbiter Memory differentcommunicationbehavioursatarchitecturelevelwhile MUX/DEMUX the application model remains unaffected. As shown in [17] aanlsdolaiklleowwsefwoirllred(cid:2)emnionngsctroamtepiuntathtieonnaelxtbesehcatvioionu,rt.hisapproach Register File ProCcoersesor MRiecUrcooncnitofidge. pixel Preshipftixel The IDF-based re(cid:2)nement approach also permits mixed- Exchange block In Registers pixel pixel levelsimulations,inwhichonlypartsofthearchitecturemodel 2D−DCT Out block are re(cid:2)ned while the other parts remain at the higher level of CCU Reconfigurable Processor abstraction.This will be demonstratedin thenextsectiontoo. one−to−one These mixed-level simulations enable performanceevaluation M−JPEG code mapping of a speci(cid:2)c architecturecomponentwithin the contextof the pixel preshift Kahn Process in_block Network behaviourofthewholesystem.Theythereforeavoidtheneed DCT function call in pixel (in_block, out_block) forbuildingacompletelyre(cid:2)nedarchitecturemodelduringthe pixel 2d−dct out out_block earlydesignstages.Moreover,mixed-levelsimulationsdonot pixel suffer from deteriorated system evaluation ef(cid:2)ciency caused by unnecessarily re(cid:2)ned parts of the architecture model. Fig.9. Calibration ofthe DCTtask usingthe Compaan/Laura frameworks V. MOTION-JPEG CASESTUDY andtheMolen calibration platform. This section presentsan experimentthat illustrates some of theimportantaspectsofArtemis’(cid:3)owofoperationasdepicted inFigure1.Morespeci(cid:2)cally,usingtheMotion-JPEGencoder For the system-level modelling and simulation part of application from the previous section, we demonstrate how the experiment, we decided to model the Molen calibration componentcalibrationcanbeperformedbymeansoftheCom- platform architecture itself. This gives us the opportunity to paan/Lauratool-setsandtheMolenplatform.Furthermore,we actually validate our performanceestimations against the real describe the system-level modelling and simulation aspects numbersfrom the implementation.The resulting system-level of the M-JPEG experiment, emphasising on the IDF-based Molen model contains two processing components (Molen’s architecture model re(cid:2)nement that was performed. CP and CCU) whichare bi-directionallyconnectedusing two In the experiment, we selected the DCT task from the M- uni-directional FIFO buffers. Like in the real Laura(cid:0) Molen JPEGapplicationforcalibration.ThismeansthattheDCTtask mapping, we mapped the DCT Kahn process from our M- is taken (cid:148)all the way down(cid:148) to a hardware implementation in JPEG application model onto the CCU component in the ar- order to study its low-level performance aspects. To do so, chitecturemodel,whereastheremainingKahnprocesseswere the followingsteps were taken, which are integrallyshown in mapped onto the CP component. We also decided to re(cid:2)ne Figure 9. The DCT was (cid:2)rst isolated from the sequential M- the CCU in our architecture model such that it models the JPEG codeandusedasinputtotheCompaantool-set.Subse- pixel-level DCT implementation used in the Compaan/Laura quently, Compaan generated a KPN application speci(cid:2)cation implementation.TheCP componentinourarchitecturemodel for the DCT task. This DCT KPN is internally speci(cid:2)ed at was not re(cid:2)ned, implying that it operates (i.e., models timing pixel level but has in- and output tasks that operate at the consequences) at the same (pixel-block) level as the appli- levelofpixelblocksbecausetheoriginalM-JPEGapplication cation events it receives from the application model. Hence, speci(cid:2)cationalsooperatesatthisblocklevel.UsingtheLaura this yields a mixed-level simulation. We would also like to tool-set, the KPN for the DCT task was translated into a stress that the M-JPEG application model was not changed VHDL implementation, in which for example the 2D-DCT forthisexperiment.Thismeansthattheapplicationeventsfor component is implemented as a 92-stage pipelined IP block, the CCU component, referring to DCT operations on entire that can be mapped onto the FPGA (i.e., CCU) of the Molen pixel blocks, needed to be re(cid:2)ned to pixel-level events. In platform. By mapping the remainder of the M-JPEG code addition,atthearchitecturemodellevel,theexecutionofthese onto Molen’s core processor (CP), we were able to study pixel-level events required to be modelled according to the the hardware DCT implementation in the context of the M- pipelined execution semantics of the actual implementation. JPEG application. As will be explained later, the results of This because the Preshift and 2D-DCT blocks in the Laura- this exercise have been used to calibrate our system-level generated implementation are pipelined units. architecture modelling. Although being out of scope for this According to what was explained in the previous section, article,itmightbeworthmentioningthattheM-JPEGencoder weaccomplishedthere(cid:2)nementoftheCCUcomponentinthe with FPGA-implemented DCT obtained a 2.14 speedup (cid:150) architecturemodelbyre(cid:2)ningthevirtualprocessorassociated out of a 2.5 maximum attainable theoretical speedup (cid:150) in withtheDCTKahnprocessinthemappinglayer,asthisisthe comparison to a full software implementation. virtualprocessorthatismappedontotheCCUcomponent.The

Description:
The Artemis Workbench for System-level Performance Evaluation of Embedded Systems Andy D. Pimentel, Member, IEEE Computer Society AbstractŠIn this article, we
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.