IEEETRANSACTIONSONMULTIMEDIA,VOL.9,NO.7,NOVEMBER2007 1475 Complexity Model Based Proactive Dynamic Voltage Scaling for Video Decoding Systems Emrah Akyol and Mihaela van derSchaar, Senior Member, IEEE Abstract—Significantpowersavingscanbeachievedonvoltage/ Theenergyspentononeprocesscanbereducedbydecreasing frequency configurable platforms by dynamically adapting the thevoltage,whichwillcorrespondinglyincreasethedelay.The frequency and voltage according to the workload (complexity). main goal of all DVS algorithms is utilizing this energy-delay Video decoding is one of the most complex tasks performed on tradeoff for tasks whose jobs’ completion times are immate- suchsystemsduetoitscomputationallydemandingoperationslike rialaslongastheyarecompletedbeforetheirprocessingdead- inversefiltering,interpolation,motioncompensationandentropy decoding. Dynamically adapting the frequency and voltage for line [4]–[8]. An example is real-time video decoding, where video decoding is attractive due to the time-varying workload an early completion of frame decoding does not provide any andbecausetheutilityofdecodingaframeisdependentonlyon benefit as long as the display deadline for that frame is met. decodingtheframebeforethedisplaydeadline.Ourcontribution A DVS algorithm essentially assigns the operating level (i.e., inthispaperistwofold.First,weadoptacomplexitymodelthat powerandfrequency)foreachjobgiventheestimatedcyclere- explicitly considers the video compression and platform specifics toaccuratelypredictexecutiontimes.Second,basedonthiscom- quirement(i.e.,therequiredcomplexity)andthejobdeadline. plexity model, we propose a dynamic voltage scaling algorithm Hence, the success of DVS algorithms depends on the accu- thatchangeseffectivedeadlinesofframedecodingjobs.Wepose racy of the complexity estimation for the various jobs. Over- ourproblem asa buffer-constrainedoptimizationandshowthat estimatingthe complexityresultsinearly terminationandidle significantimprovementscanbeachievedoverthestate-of-the-art timeandhence,resourcesareunnecessarilywasted.Underesti- dynamic voltage scaling techniques without any performance degradation. matingthecomplexity,however,resultsininsufficienttimeallo- cationforthejob.Inthespecificcaseofvideodecoding,thiscan IndexTerms—Complexityprediction,dynamicvoltagescaling, leadtoframedropsorframefreezes.Toavoidsuchjobmisses, videodecoding. currentembeddedsystemsoftenassume“worst-case”resource utilization for the design and implementation of compression I. INTRODUCTION techniquesandstandards,therebyneglectingthefactthatmul- POWER-FREQUENCY reconfigurable processors are al- timediacompressionalgorithmsrequiretime-varyingresources, readyavailableinwirelessandportabledevices.Recently, whichdiffersignificantlyfromthe“worst-case”requirements. hardware components are being designed to support multiple powermodesthatenabletradingoffexecutiontimeforenergy A. DynamicVoltageScaling-RelatedWorks savings. Forexample,some mobileprocessors can change the speed(i.e.,frequency)andenergyconsumptionatruntime[1], DVS is an especially good fit for multimedia applications [2].Significantpowersavingscanbeachievedbyadaptingthe because these applications generally involve computationally voltageandprocessingspeedforthetaskswhereearlycomple- complex tasks. Recently, several researchers have addressed tiondoesnotprovideanygains. the problem of efficient DVS for multimedia applications. Dynamic voltage scaling (DVS) algorithms were proposed An intra-job DVS is proposed by gradually increasing the for dynamicallyadapting the operatingfrequencyand voltage. frequency (speed) within the job, while monitoring the in- InCMOScircuits,powerconsumptionisdescribedbythefol- stantaneous cycle demands similar to the approach in [5]. In lowingrelationship: ,where , , [6], rather than using the worst case complexity, a worst case denote the voltage, effective capacitance and the operating estimatesatisfyinga statisticalguaranteedeterminedbasedon frequency,respectively.Theenergyspentononetaskispropor- theonlineprofiledhistogramofthepreviousframesisusedfor tional to the time spent for completing that task and since the thefrequencyassignmentofeachjob.Moreover,a crosslayer timeisinverselyproportionaltofrequency,theenergyispropor- optimization framework for different layers of the operating tionaltothesquareofthevoltage,i.e., [3]. system is proposed to minimize the energy consumption for multimediaapplicationsin[6]. Multimedia data consists of a sequence of data units (DU). ManuscriptreceivedJune6,2006;revisedApril17,2007.Thisworkwas supportedbytheNationalScienceFoundationunderGrantsCCF0541453and Thedataunitcouldhavedifferentgranularities,i.e.,itcanbea CSR-EHS0509522.Theassociateeditorcoordinatingthereviewofthismanu- video frame, a part ofa video frame, a video packet, a collec- scriptandapprovingitforpublicationwasProf.MadjidMerabti. tionofvideoframes,macroblocks,subbandsetc.Forvideode- TheauthorsarewiththeDepartmentofElectricalEngineering,HenrySamuel coding,ajobrepresentsconventionallythedecodingofaframe. SchoolofEngineeringandAppliedScience,UniversityofCalifornia,LosAn- geles, CA 90095-1594 USA (e-mail: [email protected]; [email protected]. Basedontheframerate,thereisaworst-casedesignparameter, edu). ,thatdenotestheamountoftimeavailableforprocessinga Colorversionsofoneormoreofthefiguresinthispaperareavailableonline job.Dependingonthetime-varyingcharacteristicsofthevideo athttp://ieeexplore.ieee.org. DigitalObjectIdentifier10.1109/TMM.2007.906563 content, the deployed compression algorithm and parameters 1520-9210/$25.00©2007IEEE 1476 IEEETRANSACTIONSONMULTIMEDIA,VOL.9,NO.7,NOVEMBER2007 Fig.1. DVSStrategies:(Top)NoDVS,(middle)conventionalreactiveDVSand(bottom)proposedproactiveDVS. andencoding/transmissionbit-rate,noteveryjobneedstheen- assumeapower-frequencyrelationshipof [3],[8],then tire to complete its execution. Often, the actual comple- thepowerspentonthevariousjobswillbe , tiontimeofthejobislessthan .Asshowninthetoppanel , . For the “proactive op- ofFig.1,the firstjobneedsT1 timeunitswhereas thesecond timization” with modified deadlines, the total complexity jobneedsT2timeunits.Thedifferencebetween andthe is , and the frequencies actual completion time is called “slack.” The key idea behind are kept constant: . existingadaptivesystems [5]–[8]is toadapt the operatingfre- The total energy spent on the group of jobs equals quencysuchthatthejobisprocessedexactlyin timeunits . Hence, the total energy (see the middle panel of Fig. 1). In this way, the slack is min- spent for the “no-DVS” case is , for the imizedandthe requiredcriticalresources(suchasenergy)are “conventional DVS” equals and for the decreased. Thus, the existing approach to adaptive system de- “proactiveDVS”caseequals , sign is to build on-line models of the multimedia complexity where .TableIillustratestheparam- suchthattherealcomplexity(intermsofthenumberofexecu- eters associated with each method and each job. Clearly, this tioncycles)canbeaccuratelypredicted,thehardwareparame- example highlights the merit of the proposed proactive DVS ters can be adequately configured, or an optimized scheduling algorithm presented in this page since conventional DVS pro- strategy can be deployed to minimize the energy. Clearly, this vides 35% energy reduction with respect to no-DVS, whereas showsthereactiveandgreedynatureoftheexistingadaptation proactiveDVS provides 66% energy reduction withrespect to process—the execution times are predicted based on the time no-DVSforthissimpleexample.Notethat,forthesakeofthis usedbypreviousjobsand,accordingly,theresourcesareopti- example,weassumedthatthecomplexitiesarepreciselyknown mized within a job for a fixed time allocation . However, beforehand. In practice, it is not possible to obtain an exact theDVSgainscanbesubstantiallyimprovedwhentheallocated estimateofthecomplexityasthisdependsnotonlyonthevideo times (T1, T2, and T3) and operating levels (power-frequency content and compression algorithm, but also the state of the pairs)areoptimizedjointly(inter-joboptimization)asshownin decodingplatform.Hence,existingDVSalgorithmswithfixed thebottompanelofFig.1. timeallocationsareconservativeintheirfrequencyassignment in order to avoid job misses caused by complexity estimation mismatches. This conservative nature of the previous DVS B. DVS-AnIllustrativeExample methodsresultsinredundantslacksatjobboundaries. Assume we have jobs with complexities C. SystemAssumptions , , . From now on, we use the term complexity to represent the number of In this paper, we make similar assumptions to other DVS execution cycles. As shown in Fig. 1, with “no-DVS,” de- studies[5]–[8].Specifically,weassumethefollowing. coding is performed considering the worst case scenario: at A) The active power is the dominant power, where as the the maximum available frequency for each job and cor- leakage power caused by memory operations, including responding maximum power . For “conventional DVS,” buffer operations, is negligible. We note that, while a frequencies are adjusted to finish each job just-in-time i.e., lower CPU speed decreases the CPU energy, it may in- , , . If we creasetheenergyofotherresourcessuchasmemory[7]. AKYOLANDVANDERSCHAAR:COMPLEXITYMODELBASEDPROACTIVEDYNAMICVOLTAGESCALINGFORVIDEODECODINGSYSTEMS 1477 TABLEI COMPLEXITY, DEADLINES AND FOR EACH ALGORITHM (NO-DVS j CONVENTIONAL DVS j PROACTIVE DVS), ALLOCATEDTIME,FREQUENCY,POWER,ANDENERGYOFEACHJOB Thisassumptionisreasonableforacomputationallycom- cycles.Asanexample,inthestate-of-theartvideocompression plex task like video decoding. We focus on minimizing standard, H.264/AVC [12], a distinction of frame types does the CPU energy, although the proposed method can be not even exist. Instead, every frame type can include different extended to other resources such as the combined CPU typesof(I,B,orP)macroblocks,andeachmacroblockrequires andmemoryenergy.Also,forplatformswithsignificant different amount of processing. Summarizing, it is difficult to passive power, (e.g., sensor networks) passive power achieve the accurate estimates for the current frame decoding should be explicitly considered and thus, it should be complexitymerely from the previous frames (i.e., the conven- incorporated into the proposed solution. However, note tional approach reported in [6]). To overcome this difficulty, thatthis willalsoresult ina morecomplexoptimization we adopt a complexity estimation method for video decoding problem, since passive energy decreases with frequency based on complexity hints that can be easily generated during whereasactiveenergyincreaseswithit. encoding [13], [15]. We note that, an accurate model of the B) Theoperatingfrequencychangeswiththevoltageand,at complexitywillincreaseenergysavingsforboththeproposed onefrequency,thereisonlyonevoltageavailable.Hence, method,aswellasallotherDVSmethods. power-frequency tuples also implicitly imply voltage E. Post-DecodingBuffer changes,i.e.,foreachfrequencythereisonlyonepower consumptionlevel. Althoughtheutilizedcomplexityestimationmethodissignif- C) The number of execution cycles (i.e., complexity) re- icantly more accurate than previous methods based on merely quired for a job doesnot change with frequency;itonly thepreviousframes’complexity,itisimpossibletoaprioride- depends on the video, compression algorithm and the terminethepreciseutilizedcomplexity.Toavoidthejobmisses decodingplatform.Hence,thetimespentononejobcan causedbythecomplexityestimatemismatch(i.e.,withoutun- besimplywrittenas ,where isthecomplexity necessarily being conservative for DVS), we propose to use and istheoperatingfrequency. a post-decoding buffer between the decoding and display de- D) Anumberofthejobsareavailableforprocessingatany vicestostoredecoded,butnotyetdisplayedframes.Notethat, time,i.e.,wedonotassumeperiodicarrivalofjobsasin suchapost-decodingbufferisalreadyrequiredtostorethede- [5], [6], since in most real-time streaming applications, codedbutnotyetdisplayedframesforvideodecodersdeploying multipleframes/jobsarealreadywaitinginthedecoding predictionsfromthefutureframes(e.g.,P-framesaredecoded buffer.Hence,itisassumedthattheprocessorcanstartde- and store beforehand in order to be able to decode B-frames). codingaframewhenthedecodingofthepreviousframe Usingsuchbufferswasalsoproposedforotherpurposeslikede- iscompleted. codingveryhighcomplexityframesinrealtimebyintroducing E) The processor handles only one task—the video de- some delay [16]. Also, it is important to notice that using this coding, and there are no other concurrent tasks to be buffer will enable new DVS approaches like changing job ex- processed. ecutionorders,deadlinesandoperatingfrequencyaccordingto theirestimatedcycledemandsbymitigatingfixedharddeadline constraints. Consider the above example with complexities of D. ComplexityEstimation second and third jobs interchanged, i.e., and The previous works in complexity estimation for video , i.e., . For this case, the modi- decoding only consider the frame type and previous frames fiedsecondjobshouldborrowtimeslotsfromthethirdjobfor cycle requirements for estimating the current job (i.e., cur- the proposed DVS method. However, in that case, the second rent frame decoding) complexity [6], [9], [10]. However, our jobwillnotmeetits deadlinewhichis .Hence, previous studies [11] revealed that with the development of withoutanybufferstomitigatethedeadlinerequirements,effi- highlyadaptivevideocompressiontechniquesdeployingmac- cientproactiveDVSapproachesareonlypossiblewhenthejobs roblock-basedadaptivemultiframeprediction,contextadaptive areorderedinincreasingcomplexity.Thisisgenerallynotthe entropy coding, adaptive update steps in wavelet coding, etc., case for most coders and deployed prediction structures. It is modeling the current frame complexity based on previous importanttonoticethatutilizingthisbufferdoesnotintroduce framesofthesametypeisnolongeraccurate.Eachindividual anyadditionaldelayintothedecoding—everyjobisstillcom- frame now requires a significantly different amount of video pletedbeforeitsdeadline.Toconclude,usingthepost-decoding decoding operations yielding a differentamount ofprocessing buffer will serve two purposes: to avoid job misses caused by 1478 IEEETRANSACTIONSONMULTIMEDIA,VOL.9,NO.7,NOVEMBER2007 Fig.2. Basicmodulesoftheproposedgenericvideodecodingcomplexitymodelingframework.Thecomplexityfunctionsareindicatedineachmodule;the genericcomplexitymetricsofeachmoduleareindicatedontheconnectingarrows[13]. theinaccuracyofthecomplexityestimatesandtoperformeffi- coding,InverseTransform,MotionCompensationandInterpo- cient proactive DVS by changing the time allocations through lation. Different video coders might have additional modules borrowingtimeallocationsfromtheneighboringjobs. such as in-loop filters as critical components [14]. Also, other modules, such as inverse quantization, can be easily incorpo- F. ContributionsofThisPaper rated in already considered modules. The modules that we il- As mentioned before, none of the previous studies [4]–[8] lustrated in the paper only capture the most common modules consider proactively changing the time allocations and fre- for video decoding systems. For instance, many video coders quency; instead, they aim at adapting the frequency to fixed suchasMPEG-2,aswellasthecoderdeployedforexperimen- timeallocationsinagreedyfashion.WeproposeanovelDVS tationinourpaperdoesnotpossessanin-loopfilter. algorithm that adapts jobs deadlines by buffering the decoded frames before display. By utilizing this post-decoding buffer, B. BackgroundonGenericComplexityMetrics:AFramework we study the DVS problem into the context of the buffer con- forAbstractComplexityDescription strainedoptimizationproblem,similartowellstudiedproblems of rate-control with buffer constraints [17]. We also propose In order to represent the different decoding platforms in a anoptimalsolutionforthebuffer-constrainedpower/frequency generic manner, in our recent work [15], we have deployed a allocation problem based on dynamic programming. More- concept that has been successful in the area of computer sys- over, we present several low-complexity suboptimal methods. tems, namely, a virtual machine. We assume an abstract de- Summarizing, this paper presents two major contributions: i) coderreferredtoasGenericReferenceMachine(GRM),which a practical methodology for complexity estimation that con- represents the majority of video decoding architectures. The sidersthevideosource,videoencodingalgorithmandplatform keyideaoftheproposedparadigmwasthatthesamebitstream specifics to accurately predict execution times with the use of would require/involve different resources/complexities on var- offlinegeneratedcomplexityhints;ii)basedonthiscomplexity ious decoders. However, given the number of factors that in- model,anovelproactiveDVSmethodthatchangestheeffective fluence the complexityof the decoder such as implementation deadlines(timeallocations)consideringthebufferconstraints. details,decodingplatform,itisimpracticaltodetermineatthe This paper is organized as follows. Section II describes the encoder side the real complexity for every possible decoder. video decoding process, complexity estimation and job defi- Hence,weadoptagenericcomplexitymodelthatcapturesthe nitions. The proposed DVS algorithms are explained in Sec- abstract/genericcomplexitymetrics(GCM)oftheemployedde- tion III. Section IV presents the comparative results and Sec- codingorstreamingalgorithmdependingonthecontentcharac- tionVconcludesthepaper. teristicsandtransmissionbitrate[13],[15].GCMsarederived by computing the number of times the different GRM-opera- II. VIDEODECODINGMODEL tionsareexecuted. The GRM framework for the transform-based motion-com- A. MotionCompensatedVideoDecoding pensatedvideodecoderscanbesummarizedbythemodulesil- The basic operational framework for the majority of trans- lustrated inFig.2.Forany specificimplementationofa video form-based motion-compensated video decoders can be sum- decodingalgorithm,thedecodingcomplexitymaybeexpressed marized by the modules illustrated in Fig. 2. Every video de- intermsofgenericbasicoperationsandnumberofexecutioncy- coder with motion compensation starts with entropy decoding clesforeachbasicoperation.Followingtheapproachin[13],we to generate motion and residual information from compressed assumebasicoperationsforeachmodulethatareeasilymapped bits.Then,theinversetransform(DCTorwavelet)isperformed torealdecodingcomplexityandareindicatedbytheconnecting to generate reference orresidual frames inthe spatial domain. arrowsinFig.2.Thesebasicoperationsareasfollows. Motioncompensationwithinterpolationisemployedtoinverse • Entropy decoding : the number of iterations of the thetemporalpredictionsmadeattheencoder.Hence,thecom- “Read Symbol” (RS) function is considered. This func- plexity of every video decoding algorithm can be character- tionencapsulatesthe(context-based)entropy-decodingof ized in terms of four basic functional modules: Entropy De- asymbolfromthecompressedbitstream. AKYOLANDVANDERSCHAAR:COMPLEXITYMODELBASEDPROACTIVEDYNAMICVOLTAGESCALINGFORVIDEODECODINGSYSTEMS 1479 • Inversetransform :thenumberofmultiply-accumu- late(MAC)operationswithnonzeroentriesisconsidered, bycountingthenumberoftimesaMACoperationoccurs between a filter-coefficient (FC) and a nonzero decoded pixelortransformcoefficient. • Fractional-pixelinterpolation :thenumberofMAC operationscorrespondingtohorizontalorverticalinterpo- lation(IO)cancharacterizethecomplexityofthismodule. Fig.3. TheupdateofthepredictorforeachDUwithinajob,i=0;...;L (cid:0) • Motioncompensation :thebasicmotioncompensa- 1. tion(MC)operationperpixelisthefactorinthismodule’s complexity. beutilizedattheexpenseofanincreasedcomplexityoverhead C. Conversion of Generic Complexity Metrics to Real [18].Thesolutionto(3)canbeobtainedbydeployingasteepest ComplexityUsingAdaptiveFiltering descentalgorithmthatiterativelyapproximates : TheGCMsareconvertedonlinetothenumberofexecution cycles at the decoder, using a time-varying predictor for (4) eachoftheGCMfunctions(RS,FC,MC,IO).Wedenotethis where isapositivestepsize.Weusetheinstantaneousapprox- function set . Let , , imationsforthecovariance , denotethenumberofcyclesspent(termedhere as “real complexity”), their estimates, and the corresponding and toobtainthewellknownLMSre- GCMandlinearpredictors,respectively,forfunctions , cursionforeachDU : forDUindex (indexwithinthejob),temporallevelorframe type ,withinjob .Thevalueofeach maybedetermined whileencodingandtheseGCMscanbeattachedtothebitstream (5) withnegligibleencodingcomplexityorbitratecost[15].Similar tothelinearmodelsin[18],[20],wemodeltherealcomplexity Inthispaper,weusethenormalizedLMSsinceithasfastercon- as vergenceanditisindependentofthemagnitudeoftheregressor (GCMs),whichis (1) where is the estimation error and the linear prediction (complexityestimate)is (6) (2) AlthoughthepredictorisupdatedateachDU,weupdatethe We note that the linear mapping of the generic complexity predictoronlyatthejobboundariesinouroptimizationframe- metrics to execution times is a kind of least squares matching work(asshowninFig.3).Inotherwords,updatingthejobpa- problemthatiswellstudiedinthecontextofadaptivefiltering. rameters and assigning different operating levels for each DU Hence, we employ a widely used adaptive filter, normalized isnotrealizableinrealtime.Ifjob has DUsoffunction LMS,forconvertinggenericmetricstorealcomplexitysuchas type andtemporallevel(orframetype) ,weusethepredictor thenumberofcyclesspentontheprocessor.Forthispurpose, for the duration of the execution normalizedLMS(NLMS)isshowntobeagoodchoiceamong timeofjob ,althoughweiteratethispredictorinjob the family of adaptive filters since, besides its very low com- asin(4).Thus,weestimatethecomplexityjob as plexity, it replaces the required statistical information with in- stantaneousapproximations[21].Inourproblem,statisticsare time-varying due to the state of the decoding platform. This (7) makes LMS algorithms a better choice than rec ursive least squares (RLS) filters, which update the filter from the begin- and the total complexity of job is the sum of complexities ningassumingstationarystatistics. associatedwitheveryframetypeandfunctiontypewithinthat Ifwemodeltherealcomplexityandgenericcomplexitymet- job: rics, as scalar random variables ( and ), then the optimallinearestimator(inthemeansquareerrorsense)[21] (8) (3) where isthecardinalityofthefunctionset ,whichisinour case . isgivenby ,where denotestheco- We also determine an error estimate for the complexity variancematrix.Sinceweuseafirstorderpredictor, reduces estimationbasedontheerrorobtainedforthepreviousjob.We tothescalarautocorrelationvalue.Higherorderfilterscanalso assumethaterrorsforeachDUdecodingcanbecharacterized 1480 IEEETRANSACTIONSONMULTIMEDIA,VOL.9,NO.7,NOVEMBER2007 as zero mean independent identically distributed Gaussian random variables: . Given the set of realiza- tions ofthisrandomvariable,thevarianceofthezero mean complexity error estimate of every DU within the job can be found by the maximum likelihood method [21] whichhasclosedformexpression: (9) Fig. 4. Directed acyclic dependence graphs for (a) Dependencies between I-B1-B2-P1-B3-B4-P2framesand(b)HierarchicalBPictures,I-B1-B2-B3-P. SincewedonotusethepredictoratDUgranularity,butrather update it at job intervals, the estimation error associated with E. ModifiedJobDefinitionsforAdaptiveVideoCompression theconstantpredictorthroughoutthejobcanbewrittenas Allstateoftheartvideoencodersdeploycomplextemporal predictionsfrombothpastand/orfutureframes.Forexample,in thelatestvideocompressionstandard,H.264/AVC[12],BandP typemacroblocksarepredictedfromothermacroblocks,some ofwhichareinfutureframesandthus,mustbedecodedbefore theirdisplaydeadline.Hence,eachframehasadecodingdead- (10) line that is determined by the temporal decomposition struc- ture (temporal dependencies). This deadline is different from the play-back (display) deadline determined by the frame rate and using(6) .Let bethesetofframesforwhichframe isusedas areference.Then,thedisplayanddecodingdeadlinesforframe (11) canbewrittenas: Using (9) and (11), we obtain the estimation error variance of the nextjobas Unlikepreviousworkthatconsidersthedecodingofeachin- dividual frame as a task, we combine frames having the same (12) decoding deadline (i.e., frames that are dependently encoded) Wedefinetheworstcaseerror(ortherisk)oftheestimationfor into one job of the decoding task. In general, we define every job as jobbasedonthreeparameters: (13) where the parameter determines the amount of risk that can where be tolerated for a specific task. The worst case error estimate deadline decodingdeadlineofjob , ; isusedindeterminingthebufferoccupancythresholds( and )inbufferconstrainedoptimizationinSectionIII. complexity estimated number of cycles that job consumesonaspecificplatform, ; D. OverheadoftheComplexityEstimation size numberofdecodedoriginalframeswhen We measure the overhead of the estimation process by the job finishes, . exact count of floating point arithmetic operations (multipli- The amount and type of complexity vary significantly per cation, addition), herein called “flops” (floating points opera- job [11]. In the following, we illustrate how to calculate the tions). For normalized LMS, there are 5 flops for each DU. decoding deadlines and how to define jobs in both traditional Hence, there are flops required for job predictive coding and motion compensated temporal filtering .Fortheerrorestimationpart, flops (MCTF)basedcoding. are required, that can be seen from (9) and (12). Totally, 1) Example-1: Job Structure for Predictive Decoding flopsrequiredforthecomplexity Schemes: In predictive coding, frames are encoded with and error estimation of job . Note that, error estimation (i.e., interdependencies that can be represented by a directed the worst case estimate) is only used for determining the risk acyclic dependence graph (DAG) [25]. Examples are shown ofthebufferoverflowunderflow,which,similarto[17],canbe in Fig. 4 for two different GOP structures: the conventional heuristicallysettosomefixedvaluewhenthecomplexityesti- I-B-B-P-B-B-P GOPstructure and the hierarchical B pictures. mationoverheadbecomessignificant. Decoding frame I is a job and decoding frames P1 and B1 AKYOLANDVANDERSCHAAR:COMPLEXITYMODELBASEDPROACTIVEDYNAMICVOLTAGESCALINGFORVIDEODECODINGSYSTEMS 1481 Fig.5. Inverse5(cid:0)3MCTFdecompositionindecoding. jointlyrepresentsanotherjobasshowninFig.4(a).Prediction , , , ,and shouldbedecoded1.Thisimpliesthat structures using hierarchical B pictures as in the H.264/AVC theseframes( and )havethesamedecodingdeadlineand standard lead to the following sizes , complexities , and thus,theydefineajob.TableIIshowsthedeadline,complexity deadlines :thefirstjobrepresentsthedecodingoftheframe (comp_t, texture related complexity: entropy decoding and in- the second job consists of decoding versetransform; comp_m,motion related complexity:interpo- frames P,B1 and B2 and the lastjob lationandmotioncompensationtogenerateoriginalframesand is the decoding of frame B3 . Also, intermediatelow-passframes)andthesizeofeachjob. the first job (i.e., decoding I frame) involves only the texture NotethatthesizeofeachframeintheGOPisthesameforall relatedcomplexities,whereasthesecondandthirdjobsinclude thejobsinthisexample.However,itisrelatedtothetemporal several bi-directional MC operations. It is important to notice decomposition used (5/3 Daubechies filter) and can differ for that, both the second and the third job can be viewed from a anotherdecomposition[24]. highlevelperspectiveasdecodingaBframe.However,thejob parametersaresubstantiallydifferent,therebyhighlightingthe III. PROACTIVEDYNAMICVOLTAGESCALINGALGORITHM needforencoder-specificcomplexityestimation. A. ProblemSetup 2) Example-2:JobStructureforMCTFDecodingSchemes: InMCTF-basedvideocompressionschemes,videoframesare We note that, for each video decoding job, we have a complexity estimate, size, and deadline defined before the filtered into (low-frequency or average) and (high-fre- job is executed. Let us assume there is a discrete set of quencyordifference)frames.Theprocessisappliediteratively, firstontheoriginalframes( denotesthesequenceoforiginal operating levels with corresponding frequency and power frames), and then, subsequently, to the resulting frames. levels which can be used in our frequency/voltage adaptation A temporal pyramid of temporal levels is produced in this . Each level manner. We use the notation to indicate the -th frame has a different power consumption and different frequency, of temporal level , where and . Equivalently,thenotation isusedtoindicatetheremaining . Assume there are a total of jobs with the frame at the last level, after the completion of the temporal complexity estimates , the deadlines and the sizes decomposition. From the temporal dependencies depicted in Fig. 5, we can 1NotethatH ofthepreviousGOPshouldalreadybedecodedfortheframes see that to display the original frames and , the frames beforeO 1482 IEEETRANSACTIONSONMULTIMEDIA,VOL.9,NO.7,NOVEMBER2007 TABLEII DEADLINE,COMPLEXITYANDSIZEOFEACHJOBFORDECODINGAGOPWITHT =4LEVELUSINGA5/3DAUBECHIESFILTERBASEDMCTF TABLEIII NOMENCLATUREOFSYMBOLS Fig.6. ProposedbuffercontrolledDVS. .Tofacilitatethereadingofthepaper,in operatinglevelofthejob ( tuple)isdeterminedconsid- TableIII,wepresentasummaryofthemostimportantsymbols eringtheparametersof jobsandbufferoccupancy .For andtheirdescriptionsinthepaper. eachjob,thecomplexityestimatesareupdatedandbasedonthe Thedynamicvoltagescalingproblemattemptstofindtheset bufferoccupancy,anewoperatinglevelisassigned. of operating levels (power and frequency tuple) for each job Wedefinethebufferoccupancyforjob as recursively , as shown in (14) at the as bottomofthepage. We propose to use a post-decoding buffer between the dis- playdeviceandthedecodingplatformasshowninFig.6.The (15) (14) AKYOLANDVANDERSCHAAR:COMPLEXITYMODELBASEDPROACTIVEDYNAMICVOLTAGESCALINGFORVIDEODECODINGSYSTEMS 1483 where denotestheframerate, istheinitialstateofthe for the next jobs, we aim to have the buffer occu- buffer,anddependsontheinitialplaybackdelaywhichmaybe pancy at the equilibrium (the half size of the buffer 2) zero if no delay is tolerable. We define the buffer size as the . We rede- number of decoded frames. Even in the zero delay case, the finetheexpectedenergyspentandexpecteddelay(timespent) buffer can be occupied with early completions or as a result forjob as of a more aggressive (faster) decoding strategy for the initial frames.Then,theDVSproblemin(14)becomestheproblemof (17) minimizingthetotalenergyunderbufferconstraints;see(16)at thebottomofthepage. ThenDVSproblem forthe next jobsstartingfrom job Thebufferoverflow/underflowconstraint:Thebuffershould can be approximated as described below. See equation (18) at never underflow, to avoid any frame freezes. Also, the buffer thebottomofthepage. occupancycannotgrowindefinitelybecauseitisafinitephys- Althoughthereisafinitenumberofoperatinglevels,thein- icalbuffer.Ifweassumethemaximumbuffersizeis ,we termediateoperatinglevelsareachievablebychangingthefre- needtoguaranteethatthebufferoccupancyisalwayslowerthan quency/powerwithinthejobsimilartotheapproachin[4],[5]. . Fig.7showstheenergy-delaycurveforjobs and . 1) Proposition I: If we neglect the transition cost from B. OptimalSolutionBasedonDynamicProgramming one frequency to another [5], the frequency change within the Theoptimalsolutioncanbefoundbydynamicprogramming job corresponds to piecewise bilinear interpolation power-fre- methods, which explicitly consider every possible frequency- quencypointsasshowninFig.7. bufferpairforeachjobandcheckthebufferstateforoverflow Proof: Assumethedesiredfrequencyforthejob ( or underflow [22]. A trellis is created with given complexity canbeachievedbyperforming cyclesatfrequency estimatesofthejobandpossiblepower-frequencyassignments. and cyclesat suchthat ,i.e.,forsome Ateverystage,thepathsreachingthesamebufferoccupancyat suchthat and .Also,let ahigherenergycostarepruned.Notethat,sincethecomplexity and be the corresponding E-D points to and prediction is updated at each job boundary, this optimization . shouldbeperformedforeachjob. The optimal solution does not assume any property of the power-frequencyfunctionsuchasconvexity.However,thecom- plexity of this approach is in the order of for a total of jobs. This overhead may be not practical for real timedecodingsystems.Hence,inSectionIV,weproposesub- (19) optimalapproximationstothisoptimalsolution. C. PropertiesofBufferConstrainedProactiveDVSProblem 2Amoreaccurateequilibriumbuffersizecanbedeterminedbyconsidering complexityestimatesinthelongrun,i.e.,consideringtherelativeamountof We approximate the buffer constraints for a look-ahead complexityofthelook-aheadwindowwithrespecttototalcomplexityofM window of jobs: Starting from the job jobs. (16) (18) 1484 IEEETRANSACTIONSONMULTIMEDIA,VOL.9,NO.7,NOVEMBER2007 convexfunctionoffrequencybutdoesnotprovideconvexE-D points. Hence, before any optimization, the power-frequency values which do not provide complex E-D points should be pruned, i.e., a convex hull of E-D points, from a possible set ofE-D points should be generated. Note that, since the slopes are identical for all jobs (Proposition II), this pruning is done onlyonceforalljobsasshowninTableV. 4) Proposition IV: The optimal operating level assignment willresultinequalslopesforeveryjob: Proof: TheDVSproblem(18)canbeconvertedtoanun- constrainedproblembyintroducingaLagrangemultiplierinto Fig.7. Energy-DelaycurveoftwodifferentjobswithD,Epairscorresponds thecostfunctionoftheproblem.Lagrangiansolutionsarebased to different frequency-power levels of the processor. Intermediate levels are achievedbyfrequencychangeswithinthejob.Greatervaluesoftheslopemeans onthefactthatforadditiveconvexcostfunctions,theslopeof greater energy spent with smaller delay yielding a higher frequency-power the individual cost functions should be equal. This is a direct choice.EveryjobhasdifferentE-Dpointsbuttheslopesareidenticalforevery extensionoftheTheorem-1of[23].Hence,tominimizeenergy, job. theE-Dslopeofalljobsshouldbeidentical. From the Propositions II and IV, the optimal fre- quency considering a look-ahead window of jobs can be found considering a total processing time (20) and a total com- plexity .Then,thefrequencycanbefoundby Equations (19) and (20) show the linear relationship for the : pointsinterpolatedbetweenE-Dpoints. 2) Proposition II: The slope between two E-D points only depends on the power and frequency values but not on com- plexityofthejob. (23) Proof: TheE-Dslopebetweenpoints , and , forthejob is This result is also intuitive because, by utilizing the buffer we (21) consideracollectionof jobsasonejob.Inotherwords,by utilizingthepost-decodingbuffer,wesmooththecomplexities Replacing(17)into(21),weobtain ofjobswithinthegroupof jobs. We note that, although we assumed that all intermediate (22) points are achievable by continuously changing the frequency withinthejob,inpracticetheremayonlybealimitednumber Equation(21)showsthattheslopes dependonlyonpower- of frequency transition opportunities within a that job [6], frequencytuples,hencetheyareidenticalforalljobs. i.e., the choice of and in Proposition I is not arbitrary 3) Proposition III: From a set of given power-frequency but limited to some number determined by the used plat- points, only the ones that generate a convex set of E-D points form. Also, note an intermediate point ( , such that shouldbeusedfortheproposedproactiveDVS. ) does not provide Proof: TheproofisbasedontheJensen’sInequality[26]. any energy savings beyond using the given power frequency IfthereisanE-Dpoint , whichdoesnotlieontheconvex points ( , and , ) since the intermediate point hullofE-Dpoints,thenanintermediatepoint , withthe is on the linear curve, not strictly convex. Let us consider an samedelay canbeachievedbybilinearinterpolation example where two jobs with identical complexities are to of , and , suchthatifforsomeconstant be processed at a frequency . We can , then achieve this frequency by processing half of the cycles at fromPropositionI.However,since , doesnotlie andtheotherhalfofthecyclesat forbothjobs.However, ontheconvexhull,ifthereexistsaconstant , processing one job at frequency and the other job at then [26].Hence, will result in identical energy spent and identical total delay. E-D points that are not on the convex hull should not be used The only difference created by using intermediate frequencies sincethereisanintermediatepointwiththesamedelay,leading is decreasing the instantaneous delay (the delay caused by the tolowerenergy. firstjobinthegivenexample)whichisalreadytoleratedbythe GenerationoftheConvexHullofE-DPoints:Theconvexity buffer. Also considering the frequency transition cost within ofpower-frequencyvaluesdoesnotguaranteetheconvexityof a job, we conclude that the intermediate frequencies should the E-D curve. For example, for is a not be used in the proposed DVS method. In Section IV, we
Description: