Power and Execution Time Measurement Methodology for SDF Applications on FPGA-based MPSoCs Christof Schlaak Maher Fakih Ralf Stemmer OFFISInstitutefor OFFISInstitutefor UniversityofOldenburg, InformationTechnology InformationTechnology Germany Oldenburg,Germany Oldenburg,Germany ralf.stemmer@uni- christof.schlaak@offis.de maher.fakih@offis.de oldenburg.de 7 1 ABSTRACT in modern designs and their power consumption may have 0 Timing and power consumption play an important role in a major impact on the overall system power consumption. 2 the design of embedded systems. Furthermore, both prop- Especially for mobile battery powered computers the main n ertiesaredirectlyrelatedtothesafetyrequirementsofmany fractionoftheoverallpowerconsumptionisduetocomplex a embeddedsystems. Withregardtoavailabilityrequirements, processing elements (radio, main processor, graphics accel- J powerconsiderationsareofuttermostimportanceforbattery- erator) [6]. operated systems. Validation of timing and power requires 3 Therearemainlytwoapproachesfortheestimationofpower observability of these properties. In many cases this is dif- 1 consumption of MPSoCs: Analytical and empirical meth- ficult, because the observability is either not possible or re- ods. By the analytical methods a mathematical model of quires big extra effort in the system validation process. In ] the device under test (DUT) is constructed where the com- C this paper, we present a measurement-based approach for ponents influencing the DUT power are modeled (e.g. cir- D thejointtimingandpoweranalysisofSynchronousDataflow cuit switching activity) and an analysis is made to obtain (SDF) applications running on a shared memory multipro- . estimates for average/peak power consumption. In empiri- s cessor systems-on-chip (MPSoC) architecture. As a proof- calmethods,powerconsumptionismeasureddirectlyonthe c of-concept, we implement an MPSoC system with config- [ urable power and timing measurement interfaces inside a hardwareforsingledevicese.g. fortheprocessor,orforthe whole MPSoC. FieldProgrammableGateArray(FPGA).Ourexperiments 1 demonstrate the viability of our approach being able of ac- PowerestimationofanMPSoCisnotaneasytaskduetothe v 9 curately analyzing different mappings of image processing lackofobservability. Inmanycases,powermeasurementcan 0 applications (SobelfilterandJPEGencoder)onanFPGA- onlybeperformedattheMPSoC’spowerrailinputsandno 7 based MPSoC implementation. direct relationship between the running software and mea- suredpowerconsumptioncanbeestablished. Themeasured 3 CCSConcepts powerconsumptionconsistsofastaticandadynamicpart. 0 •Computersystemsorganization→Systemonachip; . The static power contribution depends on parameters that 1 Embedded hardware; are fixed at MPSoC design time (chip area, used technol- 0 1 Introduction ogy and process variation/corner) and dynamic properties 7 that can be externally controlled (supply voltage, ambient 1 Low power consumption and meeting real-time require- temperature). In this paper, we assume static power to be v: mentsarekeyissuesinembeddedsystemsdesign. Withthe constant. Althoughthisisanoversimplification,thecontrol i growingcomputationaldemandofnowadaysapplicationsin of static power consumption is out of the scope of this pa- X the automotive, avionics and multimedia domain, the size per. The dynamic power contribution (i.e. switching activ- r andcomplexityofembeddedsystemsbasedonmultiproces- ity) is completely application and data dependent [13] and a sor platforms (MPSoCs) is increasing and thus leading to isaffectedbymanyfactors. E.g.,thesoftwarefunctionality, high power consumption. MPSoCs are used ubiquitously mappingofsoftwaretaskstoprocessors,softwarescheduling, communicationbetweentasksandtheresultingcommunica- ThisworkhasbeenpartiallysupportedbytheARAMISIIproject tionandcomputationresourceutilization. Inthispaper,we (01|S16025J), which is funded by the German Federal Ministry of ResearchandEducation(BMBF),andbytheSAFEPOWERproject will focus on the measurement of the dynamic power con- withfundingfromtheEuropeanUnion’sHorizon2020researchand sumption. innovationprogrammeundergrantagreementNo646531. Inthispaper,wepresentameasurement-basedapproachfor time and power analysis of multiple Synchronous Dataflow Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor (SDF) applications running on an MPSoC implemented on personalorclassroomuseisgrantedwithoutfeeprovidedthatcopies an FPGA. As application model we use the SDF model of arenotmadeordistributedforprofitorcommercialadvantageandthat copiesbearthisnoticeandthefullcitationonthefirstpage. Tocopy computation (MoC) [12], because it offers a strict separa- otherwise, or republish, to post on servers or to redistribute to lists, tion of communication and computation. An SDF applica- requirespriorspecificpermissionand/orafee. tion consists of computation kernels called actors (see top HIP3ES’17,January25,2017,Stockholm,Sweden ofFig.1),andcommunicationchannelsfollowingtheFIFO- Copyright2017ACM978-1-4503-3738-0. concept. Theexecutionofanactorhasthreephases: a)The http://dx.doi.org/10.1145/0000000.0000000 read phase where all data are read from all incoming chan- nels; b) The computation phase, where data is processed; FPGAandasetofSDFGsareexecuted. Ahardwaretimer c) The write phase, where the actor writes the output data intheFPGAmeasurestheperiodsoftheSDFapplications, into the FIFO-buffers of all outgoing channels. This allows sothatthepredictionofthemeasurementcanbecompared. ustoanalyzethecommunicationandcomputationtimeand This approach has many similarities with the measurement power consumption of our application separately. Our ap- concept of this work. However, it only addresses the mea- proach comes with the following contributions: surement of time; a combined power measurement is not 1. We present a measurement concept which allows a considered. In addition, the measurement of the execution minimalinvasivetimingandpoweranalysisoftheSDF time refers only to periods, rather than more fine grained applications at different granularity levels e.g. for sin- levels (e.g. measuring the delay of the actor phases). gle phases (Write, Compute, Read) of an actor, for The work in [7] dealt with the effects of parameters such single SDFGs and for the whole system application, as the number of processors and the clock frequency on the 2. We integrate a low-cost measurement board (Mageec performance and the power consumption of FPGA-based [2])withacustomizedmeasurementcontroller(imple- MPSoCs. Forthecharacterizationofthevarioussystemde- mentedintheFPGA)torealizealow-costimplemen- signs,atimerIPthatisconnectedviaasharedbuswiththe tation of our flexible measurement concept, allowing soft processors (MicroBlazes) captured the execution time. power and time analysis of a given implementation in The power consumption is not measured but estimated via an automated way. Xilinx Power Estimator (XPE) [3] tool. The impact of the software is not covered and a detailed analysis of an SDF 2 RelatedWork application (for example, the actor phases) is not possible. Various approaches utilize measurement-based methods for In[13]acycle-by-cycleenergymeasurementinFPGAsbased the direct evaluation of FPGA power consumption or the on switched capacitor is presented. With the help of this validation of power models (estimations). approach, a high resolution of the measured energy values Schreineretal. [18]usedanFPGAboardwithanintegrated (every20ns)wasachieved. Anothermeasurementapproach measuringelectronicsforthispurpose. Itcapturesthepower presented in [21] also achieves a high resolution and is ca- consumptionbutwithalowsamplingrateof6,25Hz which pableofmeasuringSWapplicationswithdetailedgranular- istoosmalltoenableadetailedanalysisofshortsub-phases ities when running on FPGAs. In difference to the work of the software to be measured. above, our approach introduces a hardware component on Ahigh-levelpowermodelforFPGA-basedMPSoCswaspre- theFPGAthatflexiblytriggerthepower/timemeasurement sented in [16]. For the evaluation of this model, the power of the annotated running application modeled as SDFGs. consumption is measured and compared with the predicted Nevertheless, it is possible to use the measurement infras- values. The measurement was made with the help of the tructurefrom[21]combinedwithourmeasurementconcept Virtex-6 FPGA with its integrated electronics that stores to analyze SDF applications. the measured power values in internal measurement regis- To the best of our knowledge, no similar measurement ap- ters (with a sampling rate of 5 Khz). proach was found which enables measuring the execution The work in [8, 9] exploits the clean semantics of SDF ap- time and power consumption of multiple SDF applications plicationstoapplypower-gatingtoreducetheirenergycon- at different granularity levels running on an FPGA-based sumptionwhenrunningonFPGAs. Theyusetheon-board MPSoC. Especially the usage of a low-cost measurement measurementdevicesoftheXilinxZC702boardformeasur- board (in our case the Mageec-board with hardware costs ing the power consumption and evaluating the efficiency of around 50e) for measuring the power consumption of an their approach. FPGA-based MPSoCs is novel. Alsotheworkin[5]suggestsanaccuratepowerconsumption 3 MeasurementConcept measurement utilizing the on-board power monitors which can be found in some FPGAs for e.g. the ZC702 Xilinx When analyzing an SDF application in detail, the level of board. Typically, these on-board power monitors have a measurement granularity is important. Timing and power low sampling rate to perform detailed analysis of software analysis of both sub-phases of an application behavior, as applications. Fore.g. oneofthemostmodernXilinxFPGA wellasitsoverallbehaviorisrelevant. Wedefinefourgran- boards ZC702 [10] samples the power rails with the help of ularitylevels(fromcoarsetofine)thatshouldbesupported theintegratedpowercontrollerTexasInstrumentsUCD9248 by our approach: The System-level granularity which is in- [1] every 200 µs (i.e., at a frequency of 5kHz). These sam- dependent of time. At this level, all SDFGs are repeatedly pling rates are too low to measure the power consumption executed, while the (average) power is measured over a pe- of short phases of software execution on the MPSoC. riod of time. On the SDFG-level granularity, all SDFGs of In[15]alaboratorypowersupplywithbuilt-inmeasuringde- theapplicationareanalyzedoneafteranother,givinginfor- vice (Keithley SourceMeter 2400) is used. In [4, 11, 14, 20] mationaboutthetiming(throughputorend-to-endlatency) oscilloscopes connected to shunt resistors are used to mea- and power usage of every SDFG1. Next, the actors of each sure power consumption. Using oscilloscopes or specialized SDFG can be analyzed at the actor-level granularity. Here, powermeasurementdevicecanobtainaccurateresultswith the analysis results are useful to analyze optimal actors to high resolutions but having the disadvantage of high costs. tile mapping. Furthermore, the phase-level granularity pro- Schabbiret. al[19]presentedadesignflowtogeneratemul- 1IntermsofSDFGs,aniteration iscompletedwhentheini- tiprocessorplatformsformultipleSDFGs. Inthisflow,mod- tialtokensdistributiononallitschannelsisrestored. Having elsforperformancepredictionareusedtoobtainroughesti- theiterationinmind,measurementscanbeguidedtotrace mates of the periods of the SDF implementations. To eval- the activations of the actors leading to an iteration of the uate these predictions, the MPSoC is implemented on an SDFG. customized measurement controller (see Fig. 3). When the 9 1 9 GX 1 executionisfinished,theysendstopsignals. Thus,themea- surement controller is able to recognize the relevant phase get ABS and to trigger the measurement process at the right time. Pixel In general, the aim is to keep the invasiveness caused by 9 GY 1 the code instrumentation of our approach as little as pos- 9 1 sible. This can be done by keeping the delay time of the instructions, needed to be executed on a processor to trig- tile 1 tile 2 tile 3 tile 4 ger the measurement control unit, minimal. Another goal private private private private is to make the communication between processor and mea- memory memory memory memory surement controller as fast as possible to immediately start andstopthe(power)measurementandatthesametimeto system bus avoid contention between concurrently accessing processors when triggering the measurement controller. In order to achieve above goals, we instrument the original shared memory source code of the SDF application with start and stop statements that control the measurement with the help of aminimalsetofinstructions(whereeachcontrolstatement Figure1: User-specifiedexampleinputsfortheanal- costs merely 2 cycles of delay on a MicroBlaze processor). ysis: A Sobel filter modeled as a SDFG (top) repre- Depending on the current scenario, the placement of these senting the software application. An MPSoC (bot- statementsinthesourcecodevaries(seeSect.CodeInstru- tom)withfourprocessorsandasharedmemorycon- mentation). Of course, the execution time of the annotated nectedthroughasharedsystembus. Eachprocessor code, when run on a target processor, is now delayed in has its own private memory. A mapping of the ac- different ways for each scenario compared to the unmodi- tors on the processors (matching colors and dotted fied original application code. Consequently, the applica- lines) and of the channels on the shared memory tion’stimingbehaviorisnotthesameforthemeasurements (matching color green) and the real use-case, which is undesirable for a measure- . ment approach. We coped with these timing variations by creating delay statements (consisting of NOPs (No Opera- tion)), that take the same amount of time as the measure- videsthemostdetailedanalysisforeveryread,computeand ment control statements, when executed. After every mea- writephaseofeveryactor. Again,thesemeasurementshelp surement, these delay statements replace the measurement optimizingtheactortotilemapping,consideringtheactor’s controlstatementsautomatically,enforcingthesametiming computationandcommunicationdemands. Inordertosup- behavior in the target application as the annotated one in port the analysis of SDF applications at the above-defined themeasurementscenarios. Thiswaytheannotationaffects levels,ourtimingandpoweranalysismeasurementapproach theapplicationequallyduringmeasurementandinthereal requires the following inputs (see also Fig. 1): use case. Certainly, by doing this, the timing behavior of • a specified granularity level, in which the application the original application has been changed. However, as we will be explored, will in Sect. Evaluation these changes are of little account. • an MPSoC as a hardware platform implementation, There are many ways to notify the measurement controller • one or more SDFGs implemented as a software appli- about the beginning or ending code blocks to be measured. cation, Inourmeasurementconcept,wesuggestthattheprocessors • and a Mapping of the SDFG parts to the MPSoC re- use their peripheral buses to send triggers to the measure- sources. ment control unit. On an FPGA based hardware platform, OurmeasurementapproachanalyzestheDUTaccordingto each processor can be configured such that it has an ex- the chosengranularityandassigns timingandpower values clusive access to its own peripheral bus to avoid contention toeachoftheidentifiedcodeblocks(actorphases,wholeac- causedbyconcurrentaccessesofmultipleprocessors. When tors,iterations). Forthis,eachblockismeasuredseparately using COTS (Component-Of-The-Shelf) MPSoC as target and several times in a measurement scenario. By repeat- platform, we do not necessarily have the flexibility reserve ingeachmeasurement,wecantracebest,worstandaverage a peripheral bus uniquely for every processor. In that case, timing and power results. Obviously, the measurement of the worst-case delay, which is raised by simultaneous bus best- and worst-case values do not necessarily provide any accessesoftheparticipants(e.g. processors),mustbetaken guarantee on lower and upper bounds. Extensions towards into account. In this paper, we focused on dealing with a full measurement data collection (histogram) and calcu- FPGA based MPSoCs. Nevertheless, the concept can be lation of the measurement variance are possible, but not in used as well for COTS MPSoCs/ASICS, by applying some the scope of this work. modifications. 3.1 Controllingthemeasurement 3.2 CodeInstrumentation In order to tag the relevant phase of the application (e.g. theWritephaseofanactor),thecorrespondingapplication Whenpreparingthesourcecodeoftheapplicationforanaly- codeshouldbeinstrumentedwithmeasurementcontrolsig- sis,thefirststepistoinsertdelaystatementsaroundrelated nals. Before executing the relevant part of the application blocksofcode,asitcanbeseeninFig. 2. Dependingonthe code, the processors of the MPSoC send start signals to a granularity level, the size of these blocks varies. After that, original phase actor SDFG-level FPGA computer P measurement iw nchiotiSmleDp(rFuuGtnen(_)ig;negtP) i{xel(); iw nshittaiSlreDt((rF)u;Gn(n)i;ng) { iw nshittaiSlreDt((rF)u;Gnn()i;ng) { iswntahitriSlte(D)(;rFuGn(n)i;ng) { sy roc. 1 P-bus 1 system rPeoswuletsr } wPriteE_getP ix1el(); } cswdtoeormlipate(py)_(u;)gt;ee_tPgiextePli(x);el(); } cwstoormipte(p)_u;gtee_tPgiextePli(x)e;l(); } cwormitep_ugtee_tPgiextePli(x)e;l(); MPSoC stem bus .Proc.. P-bus n sctmooenmpatwersonualrtlteec-rh MBaogaeredc- DEixsTseicipmuatetiioonn initSDFG(); initSDFG(); initSDFG(); initSDFG(); . n while(running) { while(running) { while(running) { while(running) { read_GX(); delay(); delay(); read_GX(); compute_GX(); read_GX(); read_GX(); compute_GX(); write_GX(); delay(); compute_GX(); write_GX(); Figure 3: Measurement-based approach imple- } compute_GX(); write_GX(); } delay(); delay(); mentedonanFPGAconnectedtotheMageec power PE 2 write_GX(); } measuring board. The timing is measured on the delay(); } FPGA, whereas the power is analyzed with the ex- ternal Mageec-Board. Both parts send their results initSDFG(); initSDFG(); initSDFG(); initSDFG(); while(running) { while(running) { while(running) { while(running) { to the measurement host computer. read_GY(); delay(); delay(); read_GY(); compute_GY(); read_GY(); read_GY(); compute_GY(); write_GY(); delay(); compute_GY(); write_GY(); } compute_GY(); write_GY(); } delay(); delay(); The measurement must start, when the first source actor PE 3 write_GY(); } fires. This is achieved by making each processor send a delay(); } ’startmeasurement’signal,beforeitexecutesasourceactor initSDFG(); initSDFG(); initSDFG(); initSDFG(); of the considered SDFG. Whenever the measurement con- while(running) { while(running) { while(running) { while(running) { trollerreceivesastartsignal,themeasurementgetsstarted. read_ABS(); delay(); delay(); read_ABS(); compute_ABS(); read_ABS(); read_ABS(); compute_ABS(); Anyfurtherstartsignalsareignoreduntilthemeasurement } delay(); compute_ABS(); stop(); compute_ABS(); delay(); } is stopped and can restart again. The end of the measure- PE 4 delay(); } ment is recognized by stop signals. Processors send a stop } signal when they finish the execution of their last sink ac- tor of the considered SDFG. The measurement controller Figure 2: Original application code and annotated counts these incoming signals and stops the measurement code on different levels of granularity for a Sobel only when every processor that fires some sink actors has filter application on a hardware platform with four indicated the end of the sink actors’ computation. Due to processing elements (see Fig. 1). As an example for thefact,thatthenumberofrequiredstopsignalsvariesfrom a particular measurement scenario, two delay state- an SDFG to another, this parameter must be configured in ments have been replaced by start and stop signals themeasurementcontrollerbeforethemeasurementbegins. (bold) on each granularity level. For this purpose, we implemented a software API for con- figuring the measurement controller (auto-restart, number of stops etc., number of measurements etc.). Since the con- eachmeasurementscenarioiscreatedbyreplacingtwocon- figuration is done before the actual measurement starts, it secutive delay statements with start and stop statements. does not affect the measurement results (e.g. delay). Hence, the code block in between is measured. If multiple levels of granularity are required for a detailed For evaluation, we automated this code annotation process analysisofanSDFapplication,thecodemustbeannotated withtheaidofascript,whichtransformsanXML-basedde- onthelowestlevel(phasegranularity). Itisalwayspossible scriptionoftheDUT(includingthemappingoftheSDFGto toannotatethecodeonphasegranularitytohavetheoption the MPSoC) into an instrumented SDF compatible C-code to switch the granularity without affecting previous experi- ready to be directly deployed the target processors. This ments. However, this configuration has the highest impact automation significantly speeds up the measurement proce- on the (timing) behavior. dure and reduces errors due to manual implementations. 3.3 Implementation On the Phase granularity level, each code-block that per- forms a reading, computing or writing operation, is instru- Themeasurementoftimingandpowerconsumptionishan- mentedbydelaystatements. FortheActor granularitylevel dled in different ways (as shown in Fig. 3). The timing is a start measurement statement is put before every actor’s measuredbya’stopwatch’module,whichisincludedinthe readoperationandastopmeasurementstatementisputaf- FPGA design. Since the same clock drives the MPSoC and ter their write operations. When annotating on the level of the stopwatch, we can achieve a cycle-accurate time mea- SDFG (when tracing the end-to-end latency of an SDFG), surement. weneedtostartmeasuringwiththeexecutionoffirstsource Afteratimingmeasurementisperformed,theresultisstored actor and stop when the last sink actor has completed its in a buffer on the FPGA. Once enough values have been computation. Moreover, at the SDFG-level analysis, the obtainedorthebufferisfull,themeasuredvaluesaretrans- measurement must start directly after the previous one is mitted (via UART) to the computer for further analysis. finished, the next iteration follows directly. Dependingonhowmanyiterationsshouldbecoveredinthe IntheexampledepictedinFig. 1itiseasytodetectthebe- measurements, the buffer size needs to be properly chosen. ginningandendingofaniterationoftheSobelfilterSDFG. Because of this procedure, the power measurement is not Nevertheless,specialmechanismsarenecessarytodealwith influenced by timing value transmission. SDFGs that contain more than one source or sink actors. Forthepowermeasurement,wechosetheexternalMageec- trigger wire code instrumentation impact UART 10 RxD 1 on phases granularity 2 on actor granularity FPGA-Core DDR-RAM 1 or e err 0.1 relativ scδoA1 b ==e l10 f8,i42lt40e%;r (;3 δx23 -=m 0a,s2k2)%; 0.01 0.001 0 500 1000 1500 2000 2500 3000 average cycles per actor (c ) A 3 measurement UART-to-USB Converter connections points Figure 5: Invasiveness impact due to code instru- to shunts mentation with delay statements. The higher the complexity (execution cycles per actor), the lower Figure 4: Measurement system setup, consisting of the percentage impact. The Sobel filter with a 3x3- an FPGA (Digilent Nexys 4 DDR), and UART-to- mask (which is a rather simple application) needs USB converter for the transmission of the timing an average of 1820 cycles per execution of an actor, values and the Mageec-board for power measure- which is increased by 0.44% when the code is anno- ment. tated on phases granularity (δ1) and 0.22%, when it is annotated on the actor-level granularity (δ2). board, because FPGA-integrated power measurement de- vicescouldnotcopewithourhighsampleratedemandsfor Significant Bit) for the ADC). With the help of the corre- measuring short phases of the application. With a sample spondingdata-sheetsthelatteramountsto≈5%. Eachsig- rate of 84kHz the Mageec-board is qualified for this task. nal, when sent by the measurement controller and received However, higher sample rates are still desirable. After cus- bytheMageec-board,hasadelay25cyclesintheworst-case. tomizing the device’s firmware by installing a buffer for the Hence,thebeginningandtheendingpointofthepowermea- measuredvalues,theMageec-boardisabletomeasurequick surement may be 25 cycles (referring to a 100MHz design) successions of short code blocks. late and may include 25 cycles of the following (not ought Byapplyingtheshuntresistormethodforelectricalcurrent to be measured) code block. measurement [17], the Mageec-board is able to sample the power consumption of three devices. In our setup with the DuetothesamplerateoftheMageec-board,whichislower Digilent Nexys 4 DDR (which is shown in Fig. 4), we re- than the 100MHz clock of the MPSoC, the shortest code movedthree’dummy’(0Ω)resistors(R254,R246andR261) block noticeable for power measurement must be at least from the (on-board) wiring of the FPGA-Core, FPGA-IO 1200 cycles (on a 100MHz system). The Mageec-board is andtheDDR-RAM.Afterthat,wesolderedthreeshuntre- not capable of measuring shorter code blocks, which is the sistors (10mΩ, 20mΩ and 20mΩ) into these places, which caseinsomeofthemeasurementstakenintable1(see’n/a’). values were most appropriate regarding the right balance As mentioned in the concept description, the code annota- between high resolution (high values) and stable operation tion will permanently be present in the application, even oftheFPGA(lowvalues;lowvoltagereduction). Bothends after the analysis. The comparison between original and of these resistors are connected to the Mageec-board, pro- annotated code shows, that the delay depends on the mea- vidingtheabilitytomeasurethevoltageateachresistorand surementgranularityandthecomplexityoftheactorsinthe thus calculating the power consumption. SDFGs(asshowninFig. 5). IfweanalyzeaverysimpleSo- belfilterapplicationwitha3x3maskonphasesgranularity, 4 Evaluation the code annotation increases each actor’s execution time The timing measurement is cycle accurate by design. For by 0.44%. Considering a more complex Sobel filter with a verifying this claim, we executed and measured some code 9x9masktheannotationonphasesgranularityincreasesthe blockswiththeXilinxAXItimer. Afterthat, wecompared execution time by less than 0.1%. these results with our own timing measurement approach For demonstrating the benefits of our approach, we con- and validated their equality. structed two experiments. Both of them employ the hard- Evaluating the accuracy of the power measurement is more ware platform shown in Fig. 1, realized with the following challenging. On the one hand, the connection between the Xilinxmodules: MicroBlazesprocessors,AXI4interconnect measurement controller and the Mageec-board shows some (as system bus), AXI4 streaming interconnect (as periph- delay. On the other hand, the measured power values in- eral bus that is connected the measurement controller on clude some errors, because of imperfect manufacturing pro- the FPGA), AXI BRAM controller and block memory gen- cesses of the integrated components (i.e. shunt resistors in erator (for the shared memory). theFPGA;amplifiersandADCsoftheMageec-board)typ- Inthefirstexperiment,afulldetailedanalysisofaSobelfil- ically having some tolerance intervals (e.g. ±5 LSB (Least ter application on a quad-core MPSoC with the fixed map- Table 1: Analysis of a Sobel filter with a 9x9 mask Table 2: Different mappings of Sobel filter and on a quad-core MPSoC on phases granularity. JPEG encoder actors on a quad-core MPSoC for phase exec. time [cyc.] power [W] evaluation. best avg. worst best avg. worst mapping tile 1 tile 2 tile 3 tile 4 getP.comp. 7875 7948,5 8055 0.5536 0.5978 0.6347 map 1 getPixel GY getMB DCT write 1507915274.915460 0.5415 0.5643 0.5854 GX ABS CC VLC GXread 1766418356.922582 0.5407 0.5596 0.5790 map 2 getPixel GX GY ABS comp. 4575 4575.0 4575 0.5403 0.6162 0.6787 getMB CC DCT VLC write 282 285,1 299 n/a n/a n/a map 3 getPixel GX getMB CC GYread 1766418390.129939 0.5419 0.5611 0.5798 ABS GY VLC DCT comp. 4575 4575,0 4575 0.5387 0.6135 0.6780 map 4 getPixel GX getMB CC write 282 285,2 294 n/a n/a n/a GY ABS DCT VLC ABSread 2012623174.734855 0.5418 0.5556 0.5754 map 5 getPixel getMB GY DCT comp. 52 52.0 52 n/a n/a n/a CC GX VLC ABS map 6 getPixel getMB GX CC DCT GY VLC ABS pingasinFig. 1wasperformed. Eachscenariotakesaround map 7 getMB getPixel 1 minute to be measured. The entire measurement proce- CC GX dure took around 40 minutes, including the instrumenta- DCT GY tionofthesourcecodeandtheconfigurationofthesoftware VLC ABS projectsinXilinxSDK.Table1showstheresults. Contrary to our expectations, we notice that the actors GX and GY take more than nine times longer to read 9 tokens than to Weshowedtheviabilityofourapproachbeingabletomea- write 1 token. The reason for this may be the bus arbitra- surethetimingofSDFAsatcycleaccuracywiththehelpof tion and polling wait times for these actors. In general, the a customized measurement controller. Because the probes computation times are rather small compared to the com- used for triggering the measurement unit is replaced by munication (read and write) times. NOP-operations with equivalent timing impact, the behav- The second experiment uses the same hardware platform ior of the measured software will not change significantly but explores different mappings of two SDFGs (see table in the usage field. Our power consumption measurement 2): a Sobel filter with a 9x9 mask (actors getPixel, GX, GY comes with a very low cost solution and is easily applicable and ABS) and a JPEG encoder (see SDFG in [19]) (actors to any kind of hardware platform that allows accessing the getMB, CC, DCT and VLC). The actors on every processor are power supply interconnects. In the image-processing use- scheduled in static order. However, dynamic scheduling is case shown, we demonstrated the benefits of our approach also supported by our approach but was not tested. All allowingtheconstructionofaParetochartofdifferentmap- thechannels(FIFO-buffer)usedforcommunicationbetween pings of actors to the tiles (of the MPSoC) for obtaining actors were put into the shared memory of the MPSoC to power and timing optimal implementations. Due to lim- invoke contention on the shared bus. itations of the measurement device using low cost ADCs, TheresultsofthemeasurementsonSDFG-levelgranularity our methods works best with measurements at levels above under different mappings are shown in the Pareto chart in phasesgranularity. Thiscouldbeimprovedinthefutureby Fig. 6. We can distinguish two groups of mappings of the using alternative measurement devices with higher resolu- Sobel filter in Fig. 6. The first group (seen at the bottom- tion (such as the one in [21]). rightofSobelParetoinFig. 6)includesmappings2,5,and 6 Acknowledgments 6 where the faster SDFG (Sobel filter) actors have a lower priority on the processing elements or have to wait for the This work has been partially supported by the ARAMIS II slowerSDFG(JPEGencoder)actorstofinish,beforeitcan project (01|S16025J), which is funded by the German Fed- finish its own iteration. In the other mappings the SDFGs eral Ministry of Research and Education (BMBF), and by do not depend on each other and the Sobel filter can reach theSAFEPOWERprojectwithfundingfromtheEuropean better end-to-end latency times around 40000 cycles (map- Union’s Horizon 2020 research and innovation programme pings 1, 3, 4 and 7). Good timing results can be achieved under grant agreement No 646531. with mapping 1, where the Sobel filter iteration takes an References averageof35963cyclesandtheJPEGencodertakes294906 cycles. However,duetothehighprocessoractivityandonly [1] Datasheet: Ucd9248 - digital pwm system controller. few waiting times (for polling), the power usage is high in Technical Report SLVSA33A, Texas Instruments, Jan comparison to the other (slower) mappings. The 7th map- 2010. ping makes use of only two cores, reducing the power but [2] Mageec–energymeasurementinfrastructure.Technical increasing the cycles needed per iteration. report, 2013. [3] Xilinx power estimator (xpe). Technical report, 2016. 5 Conclusion [4] P.Albicocco,D.Papini,andA.Nannarelli.DirectMea- In this paper, we presented a methodology to measure the surement of Power Dissipated by Monte Carlo Simula- timingbehaviorandthepowerconsumptionofsynchronous tions on CPU and FPGA Platforms. Report 2012-18, data flow applications mapped to FPGA-based MPSoCs. Technical University of Denmark, 2012. Sobel filter iteration JPEG encoder iteration 0.565 0.565 map 1 map 1 map 2 map 2 map 3 map 3 0.56 map 4 0.56 map 4 map 5 map 5 map 6 map 6 0.555 map 7 0.555 map 7 watt) watt) wer ( 0.55 wer ( 0.55 o o p p 0.545 0.545 0.54 0.54 0.535 0.535 0 50000 100000 150000 200000 250000 300000 200000 250000 300000 350000 400000 450000 500000 550000 time (cycles) time (cycles) Figure 6: Pareto charts for the iterations of the Sobel filter (left) and JPEG encoder (right) executed on a quad-core MPSoC. Timing and power results vary due to 7 alternative mappings of the actors on the core. [5] A.F.BeldachiandJ.L.Nunez-Yanez. AccuratePower Logic (SPL), 2011 VII Southern Conference on, pages control and monitoring in ZYNQ boards. In 2014 24th 87–90. IEEE, 2011. International Conference on Field Programmable Logic [15] J.Ou.RapidEnergyEstimationforHardware-Software and Applications (FPL), pages 1–4. IEEE, 2014. CodesignUsingFPGAs. EURASIPJournalonEmbed- [6] A. Carroll and G. Heiser. An analysis of power con- ded Systems, 2006:1–11, 2006. sumption in a smartphone. In Proceedings of the 2010 [16] R. Piscitelli and A. D. Pimentel. A high-level power USENIX Conference on USENIX Annual Technical model for mpsoc on fpga. In Parallel and Distributed Conference, USENIXATC’10, pages 21–21, Berkeley, Processing Workshops and Phd Forum (IPDPSW), CA, USA, 2010. USENIX Association. 2011 IEEE International Symposium on, pages 128– [7] D. G¨ohringer, J. Obie, A. L. S. Braga, M. Hu¨bner, 135. IEEE, 2011. C.H.Llanos,andJ.Becker. Explorationofthepower- [17] S. Reda and A. Nowroz. Power Modeling and Char- performancetradeoffthroughparameterizationoffpga- acterization of Computing Devices: A Survey. Foun- based multiprocessor systems. Int. J. Reconfig. Com- dations and Trends in Electronic Design Automation, put., 2011:7:1–7:17, jan 2011. 6(2):121–216, 2012. [8] M. Hosseinabady and J. L. Nunez-Yanez. Run-time [18] S. Schreiner, K. Gru¨ttner, S. Rosinger, and W. Nebel. power gating in hybrid ARM-FPGA devices. In 2014 Ein verfahren zur bestimmung eines powermodells von 24th International Conference on Field Programmable xilinx microblaze mpsocs zur verwendung in virtuellen Logic and Applications (FPL), pages 1–6. IEEE, 2014. plattformen. In18. Workshop Methoden und Beschrei- [9] M. Hosseinabady and J. L. Nunez-Yanez. Energy op- bungssprachen zur Modellierung und Verifikation von timization of FPGA-based stream-oriented computing Schaltungen und Systemen (MBMV 2015), 03 2015. with power gating. In 2015 25th International Con- [19] A. Shabbir, A. Kumar, S. Stuijk, B. Mesman, and ference on Field Programmable Logic and Applications H. Corporaal. Ca-mpsoc: An automated design flow (FPL), pages 1–6. IEEE, 2015. for predictable multi-processor architectures for mul- [10] X.Inc. Zynq-7000APSoCLowPowerTechniquespart tiple applications. Journal of Systems Architecture, 2 - Measuring ZC702 Power using TI Fusion Power 56(7):265–277, 2010. Designer Tech Tip. http://www.wiki.xilinx.com/ [20] E. Sotiriou-Xanthopoulos, G. S. P. Delicia, P. Figuli, Zynq-7000+AP+SoC+Low+Power+Techniques+ K.Siozios,G.Economakos,andJ.Becker.Apoweresti- part+2+-+Measuring+ZC702+Power+using+TI+ mation technique for cycle-accurate higher-abstraction Fusion+Power+Designer+Tech+Ti(30.04.2015), Jan- systemc-based cpu models. In Embedded Computer uar 2014. Version 0.1. Systems: Architectures, Modeling, and Simulation (SAMOS), 2015 International Conference on, pages [11] R.JevticandC.Carreras.Powermeasurementmethod- 70–77, July 2015. ology for fpga devices. IEEE Transactions on Instru- [21] M. Weiland and N. Johnson. Benchmarking for power mentation and Measurement, 60(1):237–247, Jan 2011. consumptionmonitoring. ComputerScience-Research [12] E.A.LeeandD.G.Messerschmitt.Staticschedulingof and Development, 30(2):155–163, 2015. synchronous data flow programs for digital signal pro- cessing. IEEE Transactions on Computers, 1987. [13] H. G. Lee, K. Lee, Y. Choi, and N. Chang. Cycle- accurate energy measurement and characterization of FPGAs. AnalogIntegratedCircuitsandSignalProcess- ing, 42(3):239–251, 2005. [14] J.P.OliverandE.Boemo.Powerestimationsvs.power measurementsinCycloneIIIdevices. InProgrammable