AnySP: Anytime Anywhere Anyway Signal Processing Mark Woh1, Sangwon Seo1, Scott Mahlke1, Trevor Mudge1, Chaitali Chakrabarti2 and Krisztian Flautner3 1AdvancedComputerArchitectureLaboratory 2DepartmentofElectricalEngineering UniversityofMichigan,AnnArbor,MI ArizonaStateUniversity,Tempe,AZ {mwoh,swseo,mahlke,tnm}@umich.edu [email protected] 3ARM,Ltd. Cambridge,UnitedKingdom krisztian.fl[email protected] ABSTRACT CategoriesandSubjectDescriptors In the past decade, the proliferation of mobile devices has C.1.2 [Processor Architectures]: Multiple Data Stream increased at a spectacular rate. There are now more than Architectures(Multiprocessors);C.3[Special-Purposeand 3.3billionactivecellphonesintheworld—adevicethatwe Application-BasedSystems]: [Signalprocessingsystems] now all depend on in our daily lives. The current gener- ation of devices employs a combination of general-purpose GeneralTerms processors, digital signal processors, and hardwired acceler- Algorithms, Design, Performance atorstoprovidegiga-operations-per-secondperformanceon milliWattpowerbudgets. Suchheterogeneousorganizations 1. INTRODUCTION areinefficienttobuildandmaintain,aswellaswastesilicon area and power. Looking forward to the next generation of In the coming years, the deployment of untethered com- mobile computing, computation requirements will increase puters will continue to increase rapidly. The prime exam- by one to three orders of magnitude due to higher data ple today is the cell phone, but in the near future we ex- rates, increased complexity algorithms, and greater com- pect to see the emergence of new classes of such devices. putation diversity but the power requirements will be just These devices will improve on the mobile phone by mov- as stringent. Scaling of existing approaches will not suffice ing advanced functionality such as high-bandwidth inter- instead the inherent computational efficiency, programma- net access, human-centric interfaces with voice recognition, bility, and adaptability of the hardware must change. To high-definition video processing, and interactive conferenc- overcome these challenges, this paper proposes an exam- ing onto the device. At the same time, we also expect the ple architecture, referred to as AnySP, for the next gener- emergence of relatively simple, disposable devices that sup- ation mobile signal processing. AnySP uses a co-design ap- portthemobilecomputinginfrastructure. Therequirements proach where the next generation wireless signal processing of these low-end devices will grow in a manner similar to and high-definition video algorithms are analyzed to cre- those for high-end devices, only delayed in time. ate a domain specific programmable architecture. At the Untethered devices perform signal processing as one of heartofAnySPisaconfigurablesingle-instructionmultiple- their primary computational activities due to their heavy data datapath that is capable of processing wide vectors or usage of wireless communication and their rendering of au- multiplenarrowvectorssimultaneously. Inaddition,deeper dioandvideosignals. Fourthgenerationwirelesstechnology computation subgraphs can be pipelined across the single- (4G) has been proposed by the International Telecommuni- instructionmultiple-datalanes. Thesethreeoperatingmodes cations Union to increase the bandwidth to maximum data provide high throughput across varying application types. rates of 100 Mbps for high mobility situations and 1 Gbps Results show that AnySP is capable of sustaining 4G wire- for stationary and low mobility scenarios like internet hot less processing and high-definition video throughput rates, spots [28]. This translates to an increase in the computa- and will approach the 1000 Mops/mW efficiency barrier tional requirements of 10-1000x over previous third genera- when scaled to 45nm. tion wireless technologies (3G) with a power envelope that can only increase by 2-5x [33]. Other forms of signal pro- cessing,suchashigh-definitionvideo,arealso10-100xmore compute intensive than current mobile video. Figure 1 presents the demands of the 3G and 4G proto- cols in terms of the peak processing throughput and power budget. Conventional processors cannot meet the power- Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare throughput requirements of these protocols. 3G protocols, notmadeordistributedforprofitorcommercialadvantageandthatcopies such as W-CDMA, require approximately 100 Mops/mW. bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to Desktop processors, such as the Pentium M, operate below republish,topostonserversortoredistributetolists,requirespriorspecific 1 Mop/mW, while digital signal processors, such as the TI permissionand/orafee. C6x, operate around 10 Mops/mW. High performance sys- ISCA’09,June20–24,2009,Austin,Texas,USA. tems,liketheIBMCell[24],canprovideexcellentthrough- Copyright2009ACM978-1-60558-526-0/09/06...$5.00. 10000 3Gsolutionsisinadequateandmustbeincreasedbyatleast Performance (Gops) 110100000(S6O5nDmA) M3oVGbi diWleei orHeDle4sGs W(S9irO0TenID lmCeAs6) Xs 1V0I0IRm01A a0MMg0 oinMpeos/pms/WmW 1 Mo1p0 sM/moWpPse/nmtiWum PMoweIrB EMffBic eiCettneecr yll asogtdbstoocornraialafrittfilmtsodwihyuxnrsmeadpgwprbepaaerriuombotofdgirhovlotseiif3iutnvdyGammneeenltspuaaod:srgcolpoothnfm1laivupi)ddaitetldapudenifeotapdtsanfsalee-ictsptclaeafheaportbnvetulryeiitrcpooli4eamnpriGpsntopiadecc.oaororenfrAateemttosladesulasernmaiilianninnimstag.rdmypraeFktasnfeehuuiorusvtaerrlfettlana,anlhnceosesutlst;tmruimhr,ne2raaupnigb)nirllrogetdetgourhihw;pgptgeolrafahleoafaran.rmetbcdswfPtooim,aalarrrinrtn3oemaddys)--- 1 higherchipvolumesasasingleplatformcansupportafamily 0.1 1 Power (Watts) 10 100 of mobile devices. Lastly, hardware adaptivity is necessary to maintain efficiency as the core computational character- Figure 1: Performance verses power requirements istics of the applications change. 3G solutions rely heavily for various mobile computing applications. onthewidespreadamountsofvectorparallelisminwireless signalprocessingalgorithms,butlosemostoftheirefficiency when vector parallelism is unavailable or constrained. put, but its power consumption makes it infeasible for mo- Toaddressthesechallenges,thispaperproposesAnySP— bile devices [24]. Research solutions, such as VIRAM [14] anexampleofanadvancedsignalprocessorarchitecturethat and Imagine [1], can achieve the performance requirements targets the next generation mobile computing. Our objec- for 3G, but exceed the power budgets of mobile terminals. tiveistocreateafullyprogrammablearchitecturethatsup- SODAimproveduponthesesolutionsandwasabletomeet ports 4G wireless communication and high-definition video both the power and throughput requirements for 3G wire- coding at efficiency levels of 1000 Mops/mW. Programma- less [17]. Companies such as Phillips [32], Infineon [26], bility is recognized as a first class design constraint, thus ARM [34], and Sandbridge [8] have also proposed domain- nohardwiredASICsareemployed. Toaddresstheissuesof specificsystemsthatmeettherequirementsfor3Gwireless. computationalefficiencyandadaptivity,weintroduceacon- For4Gwirelessprotocols,thecomputationefficiencymust figurable single-instruction multiple-data (SIMD) datapath beincreasedtogreaterthan1000Mops/mW.Threecentral as the core execution engine of the system. The datapath technologies are used in 4G: orthogonal frequency division supports three execution scenarios: wide vector computa- multiplexing (OFDM), low density parity check (LDPC) tion(64lanes),multipleindependentnarrowvectorcompu- code, and multiple-input multiple-output (MIMO) techni- tation threads (8 threads × 8 lanes), and 2-deep subgraphs ques. OFDM is a modulation scheme that transmits a sig- on moderate wide vector computation (32 lanes × depth 2 nalovermanynarrowbandsub-carriers. FastFouriertrans- computations). This inherent flexibility allows the datap- forms(FFTs)arekeyastheyareusedtoplacesignalsfrom athtobecustomizedtotheapplication,butstillretainhigh thebasebandtothesub-carriers. LDPCcodesprovidesupe- execution efficiency. AnySP also attacks the traditional in- riorerrorcorrectioncapabilities,andlesscomputationcom- efficiencies of SIMD computation: register file power, data plexitycomparedtoTurbocodesusedin3G.However,par- shuffling, and reduction operators. allelizingtheLDPCdecodingalgorithmismorechallenging The key contributions of this paper are as follows: becauseofthelargeamountofdatashuffling. Lastly,MIMO is based on the use of multiple antennae for both transmis- • Anin-depthanalysisofthecomputationalcharacteris- sion and reception of signals, and requires complex signal tics of 4G wireless communication and high-definition detection algorithms. The need for higher bandwidth and video algorithms. increases in computation complexity are the primary rea- sons for the two order-of-magnitude increase in processing • Thedesign,implementation,andevaluationofAnySP— requirements when going from 3G to 4G. anexampleprogrammableprocessorthattargetsnext Mobile computing platforms are not limited to perform- generation mobile computing. ing only wireless signal processing. High-definition video is • The design and organization of a configurable SIMD alsoanimportantapplicationthattheseplatformswillneed datapath for sustaining high throughput computing to support. Figure 1 shows that the performance require- with both wide and narrow vector widths. ments of video exceed that of 3G wireless, but are not as high as 4G wireless. Thus the power per Gops requirement 2. MOBILESIGNALPROCESSING is further constrained. Moreover, the data access complex- ity in video is much higher than wireless. Wireless signal 2.1 OverviewofBenchmarks processing algorithms typically operate on single dimension vectors, whereas video algorithms operate on two or three 2.1.1 4GWirelessProtocol dimensional blocks of data. Thus, video applications push designstohavemoreflexiblehigherbandwidthmemorysys- Wemodelour4GwirelesssystemaftertheNTTDoCoMo tems. Highdefinitionvideoisjustoneexampleofagrowing test 4G wireless system [31], because there is no standard class of applications with diverse computing and memory yetfor4G.Themajorcomponentsofthephysicallayercon- requirements that will have to be supported by the next sists of three blocks: a modulator/demodulator, a MIMO generation of mobile devices. encoder/decoder, and a channel encoder/decoder. These The design of the next generation of mobile platforms blocksconstitutethemajorityofthe4Gcomputations[33]. mustaddressthreecriticalissues: efficiency, programmabil- The role of the modulator is to map data sequences into ity,andadaptivity. Theexistingcomputationalefficiencyof symbols with certain amplitudes and phases, onto multiple Algorithm SIMD Scalar Overhead SIMDWidth Amount Workload(%) Workload(%) Workload(%) (Elements) ofTLP FFT/IFFT 75 5 20 1024 Low STBC 81 5 14 4 High LDPC 49 18 33 96 Low DeblockingFilter 72 13 15 8 Medium Intra-Prediction 85 5 10 16 Medium InverseTransform 80 5 15 8 High MotionCompensation 75 5 10 8 High Table1: DatalevelparallelismanalysisforMSPalgorithms. Overheadworkloadistheamountofinstructions needed to aid the SIMD operations like data shuffle and SIMD load/store. orthogonalfrequencies. ThisisdoneusinginverseFFT.The ing(MSP),weperformedadetailedanalysisofthekeyker- demodulatorperformstheoperationsinreverseordertore- nel algorithms. These are FFT, STBC, and LDPC for the construct the original data sequences. The MIMO encoder wireless workload, and deblocking, intra-prediction, inverse multiplexes many data signals over multiple antennae. The transform, and motion compensation for the video process- MIMO decoder receives all the signals from the antennae ing workload. Our goal was to explore the characteristics and either decodes all the streams for increased data rates of each algorithm in order to find their similarities and dif- or combines all the signals in order to increase the signal ferences. This section will discuss the studies that helped strength. The algorithm used to increase data rate is the define our example architecture for MSP. vertical Bell Laboratories layered space-time (V-BLAST), and the algorithm used to increase the signal quality is the 2.2.1 MultipleSIMDWidths space time block coding (STBC). Finally, the channel en- coder and decoder perform forward error correction (FEC) Manypreviousstudieshaveanalyzedanddeterminedthe thatenablesreceiverstocorrecterrorsinthedatasequence best SIMD width that should be implemented for general- without retransmission. There are many FEC algorithms purpose and application-specific workloads. The optimal that are used in wireless systems. LDPC is used because SIMDwidthisdifferentineachstudybecausewidthdepends it supports the highest data rates. Our target for 4G wire- on the algorithms that constitute the workload. Examples less is the maximum data rate for high mobility, which is are the 2-8-wide Intel MMX extensions [23], the 16-wide 100Mbps. This 4G configuration utilizes the FFT, STBC, NXP EVP [32], and the 32-wide SODA [17]. and LDPC kernels. Table1presentsourstudyanalyzingthedatalevelparal- lelism(DLP)ofMSPalgorithms. Wecalculatetheavailable 2.1.2 H.264VideoStandard DLPwithineachofthealgorithmsandshowthemaximum naturalvectorwidththatisachievable. Theinstructionsare Videocompressionstandards,suchasMPEG-4andH.264, brokendowninto3categories: SIMD,overhead,andscalar. arebeingactivelyconsideredformobilecommunicationsys- TheSIMDworkloadconsistsofalltherawSIMDcomputa- tems because of the increasing demand for multimedia con- tions that use traditional arithmetic and logical functional tentonhandhelddevices. Inthispaper,H.264isselectedas units. Theoverheadworkloadconsistsofalltheinstructions the multimedia benchmark because it achieves better com- that assist SIMD computations, for example loads, stores pression compared to previous standards and also contains and shuffle operations. The scalar workload consists of all most of the basic functional blocks (prediction, transform, theinstructionsthatarenotparallelizableandmustberun quantization, and entropy decoding) of previous standards. on a scalar unit or on the address generation unit (AGU). Of the three profiles in H.264, we focused on the Baseline From Table 1, we see that many of the algorithms have profileduetoitspotentialapplicationinvideotelephony,and different natural vector widths–4, 8, 16, 96, 1024. Also, the videoconferencing. algorithms with smaller SIMD widths exhibit a high level The H.264 decoder receives a compressed bitstream from of TLP, which means that we can process multiple threads thenetworkabstractlayer(NAL).Thefirstblockistheen- that work on separate data on a wide SIMD machine. For tropy decoder which is used to decode the bitstream. After instance, 8 instances of STBC that have SIMD width of 4 reordering the stream, the quantized coefficients are scaled canbeprocessedona32-widemachine. UnlikemostSIMD and their inverse transform is taken to generate the resid- architectures that are designed with a fixed SIMD width to ual block data. Using header information in the NAL, the processallthealgorithms,thisstudysuggeststhatthebest decoder selects prediction values for motion compensation solution would be to support multiple SIMD widths and to in one of two ways: from a previously decoded frame or exploit the available thread-level parallelism (TLP) when from the filtered current frame (intra-prediction). Accord- the SIMD width is small. By supporting multiple SIMD ing to the power profile of H.264, about 75% of the de- widths, theSIMDlaneutilizationcanbemaximizedaswill coder power consumption is attributed to three algorithms: be demonstrated in Section 3.1.1. deblocking filter (34%), motion compensation (29%), and Thoughthescalarandoverheadworkloadsarenotthema- intra-prediction (10%) [16]. Therefore, these three algo- jority,theystillcontribute20-30%ofthetotalcomputation. rithms are selected as the H.264 kernel algorithms in this IninstancessuchasLDPC,comprisedofmostlysimplecom- study. putations, data shuffling and memory operations dominate the majority of the workload. This suggests that we can- 2.2 AlgorithmAnalysis not simply improve the SIMD performance, but also must To build a unified architecture for mobile signal process- reduce the overhead workload. This can be accomplished Move Move Move Move Move Move I: Coeff I: Coeff I: Ain I: Ain I: Ain I: Ain O: V_Wr O: V_Wi O: V_tap0r O: V_tap0i O: V_tap1r O: V_tap1i p3 p2 p1 p0 q0 q1 q2 q3 Sub Sub Add Add I:V_tap0r, V_tap1r I:V_tap0i, V_tap1i I:V_tap0r, V_tap1r I:V_tap0i, V_tap1i P3+p P2+p P1+p P0+q Q0+ Q1+ Q2+ Q3+ O: V_resultr O: V_resulti O: V_final0r O: V_final0i 2 1 0 0 p0 q0 q1 q2 SUB I: P1, V1, V2 >>2 + + + + + + >>2 O: V3 V_SIGN V_CMP_SET I: V3 I: V4, V3 + + + + + + Mult Mult Mult Mult O: P4 O: P3, V5 I:V_resultr, V_Wr I:V_resulti, V_Wi I:V_resultr, V_Wi I:V_resulti, V_Wr O: V_result1 O: V_result2 O: V_result1 O: V_result2 P_XNOR V_P_SELECT V_P_SELECT V_P_CMP_SET >+>43 , >+>22, >+>43, >+>43, >+>22, >+>43, I: V_resulSt1u,b V_result2 I: V_resulAt1d,d V_result2 I: OP:5 P, P54 I: PO3,: VV66, V4 I: PO2,: VV74, V5 I: OP:3 P, V3,8 V, V83 O: V_resultABWr O: V_resultABWi Move Move Move Move I: V_Ore: sAuoltuAtBWr I: V_Ore: sAuoltuAtBWi I: O V:_ Afionualt0r I: OV:_ Afinoault0i c) Subgraph for Bit Node and Check a) Deblocking Filter Subgraph b) FFT Subgraph Node Operation Figure 2: Select subgraphs for the inner loops of MSP algorithms. byintroducingbettersupportfordatareorganizationorby Bypass Write Bypass Read Register Access 100% increasing the scalar and AGU performance. Table 1 also 90% shows how much TLP is available; high TLP means that 80% 70% manyindependentbutidenticalthreadscanbespawnedand 60% lowmeansonlyoneorafewcanbespawnedmainlydueto 50% 40% data dependencies or absence of outer loops. 30% 20% 10% 2.2.2 RegisterValueLifetimes 0% For many processors, the register file (RF) contributes a FFT STBC LDPC Deblocking Intra‐ Inverse Filter Prediction Transform large percentage of the power consumption. Some architec- tureshaveRFsthatconsume10-30%ofthetotalprocessing Figure 3: Register file access for the inner loops of element (PE) power [11] [22]. This is due to the reading MSP algorithms. Bypass write and read are the RF and writing of values for each instruction. Many architec- read and write accesses that do not need to use the tures, like very large instruction word (VLIW), compound main RF. Register accesses are the read and write this problem by having multi-ported RFs that increase the accesses that do need to use the main RF. energy per access. Recent works in [9] and [7] show that performance and power can be reduced by bypassing the modification creating a total of 20 registers: 16 main regis- RFforshort-livedvaluesandstoringthevaluewithinareg- ters and 4 temporary registers. Figure 3 shows the break- ister bypass pipeline. The bypass network is a non-critical down of accesses that can and cannot bypass the RF. By- path structure for holding values a fixed number of cycles pass write and bypass read correspond to the RF read and before they are consumed. By bypassing the RF, access writeaccessesthatdonotneedtousethemainRFandcan power and register pressure are reduced allowing for more be eliminated with either data forwarding or by storing the effective register allocation. Another technique used to re- datainasmallRFpartition. Register accessesaretheread duceregisterfilepowerisregisterfilepartitioning. Previous andwriteaccessesthatcannotbebypassedandneedtouse researchfoundthatforsomeembeddedapplications,asmall the main RF. For most of the algorithms a large portion of setofregistersaccountedforthemajorityoftheregisterfile register accesses can be bypassed. The exception to this is accesses [11]. By splitting the register file into a small and LDPC where many values are calculated and stored for a large region, the high access registers can be stored in the several cycles before one of them is chosen. Based on our smallpartitionresultinginlessenergyconsumedperaccess. analysis, we conclude that architectures that run MSP al- Onefocusofthispaperistostudythepotentialforelim- gorithmsshouldsupportbypassmechanismstohelpreduce inating RF accesses in MSP applications. Figure 2 shows the accesses to the main RF. the subgraphs for several of the MSP inner loops. We see thatthereexistslargeamountsofdatalocalityinthesesub- 2.2.3 InstructionPairFrequency graphs, which is illustrated by the number of single lines that go from one source to one destination. The MSP ap- Intypicaldigitalsignalprocessing(DSP)codes,thereare plications fit the streaming dataflow model, where data is pairs of instructions that are frequently executed back to produced and then read once or twice. The values are usu- back. A prime example of this is the multiply followed ally consumed by the instructions directly succeeding the by accumulate instruction. To improve performance and producer. The number of temporary registers needed are decrease energy consumption, DSP processors such as the small but are accessed often suggesting a register bypass Analog Devices ADSP and the TI C series introduced the networkorapartitionedRFcanreducepowerandincrease multiply-accumulate(MAC)operation. Theworkin[4]has performance. exploited this further by building specialized hardware to WefurtheranalyzetheMSPalgorithmstodeterminethe exploit long chains of instructions. extent to which RF accesses can be reduced. We use the WeperformedastudyontheMSPalgorithmstofindthe SODAarchitecture[17]withasmallfourentryRFpartition mostcommonproducer-consumerinstructionpairs. Ineach IInnssttrruuccttiioonn PPaaiirr FFrreeqquueennccyy TToottaall A BC D A BC D E F GH I J 11 mmuullttiippllyy‐‐aadddd 2266..7711%% 2266..7711%% 22 aadddd‐‐aadddd 1133..7744%% 4400..4455%% IInnssttrruuccttiioonn PPaaiirr FFrreeqquueennccyy TToottaall IInnssttrruuccttiioonn PPaaiirr FFrreeqquueennccyy TToottaall 33 sshhuuffffllee‐‐aadddd 88..5544%% 4488..9999%% 11 sshhuuffffllee‐‐mmoovvee 3322..0077%% 3322..0077%% 11 sshhuuffffllee‐‐sshhuuffffllee 1166..6677%% 1166..6677%% AB C DA B CDA B CDA B CD 44 sshhiifftt rriigghhtt‐‐aadddd 66..9900%% 5555..8899%% 22 aabbss‐‐ssuubbttrraacctt 88..5544%% 4400..6611%% 22 aadddd‐‐mmuullttiippllyy 1166..6677%% 3333..3333%% 5656 aassdduuddbb‐‐ttssrrhhaaiiccfftttt ‐‐rraaiiddgghhddtt 6655....99774466%%%% 66662288....88553399%%%% 3344 ssmmhhuuooffvvffeellee‐‐‐‐ssssuuuubbbbttttrrrraaaacccctttt 8833....55554444%%%% 44559922....11665599%%%% 3344 mmuummllttuuiippllttlliiyypp‐‐llssyyuu‐‐bbaaddttrrddaacctt 11116666....66667777%%%% 55660066....66007700%%%% A BC D E F G A BC D E F G 77 mmuullttiippllyy‐‐ssuubbttrraacctt 44..1144%% 7722..7733%% 55 aadddd‐‐sshhuuffffllee 33..5544%% 5566..2233%% 55 ssuubbttrraacctt‐‐mmuulltt 1166..6677%% 8833..3333%% 8899 sshhiiffaattdd rrddiigg‐‐sshhuutt‐‐bbssttuurrbbaaccttrrttaacctt 3333....77005577%%%% 77779966....55445588%%%% 66 OOtthheerrss 4433..7777%%110000..0000%% 66 sshhuuffffllee‐‐aadddd 1166..6677%%110000..0000%% A BC D B C D E C D E F D E F G A E B F B F C GCG D DD D DD 1100 OOtthheerrss 2200..4455%%110000..0000%% a) Intra-prediction and b) LDPC c) FFT A BC D E F GH I J A BC D E F GH I J K L Deblocking Filter Combined GH I J A B C D GH I J A B C D I E DC J F I E KG J F L HK G Figure 4: Instruction pair frequencies for different A B C D A BC D E F GH I J K L MSP algorithms. A A A A B B B B C C C C D D D D I J K L E F GH D I J KC D E F casetheproducerinstruction’svalue,onceconsumed,isnot referenced again. This allows the producer and consumer instructions to be fused. Figure 4 shows the breakdown of themostfrequentinstructionpairsthatwerefoundinMSP algorithms programmed for SODA [17]. Among all the al- gorithms, the MAC instruction was the most frequent, be- causealargenumberofthesealgorithmshaveadotproduct as their inner loop. Other fused instruction pairs that were common were permute-add, add-add, and shift-add. The permute-addpairoccursinmanyofthealgorithmsbecause data must first be aligned to effectively utilize the SIMD lanes. The add-add and shift-add pairs also occur quite frequently in the video algorithms, because values are ac- cumulated together and then normalized by a power of 2 Figure 5: A set of swizzle operations that are re- division—a shift. quired for a subset of MSP algorithms. Basedonthisanalysis,weconcludethatthefusedinstruc- tion pairs with high frequency should be supported. This the swizzle operations based on a parity check matrix. In increases performance and decreases power because regis- SODA, the data shuffle network may require many cycles ter access between the producer and consumer instructions tocompletethesetaskbecomingabottleneckintheperfor- would no longer be needed. mance of LDPC. A network which can be easily configured andalsoperformtheoperationsinfewercyclesisdesirable. 2.2.4 AlgorithmDataReorderingPatterns Either a swizzle network or a more complex programmable MostcommercialDSPandgeneralpurposeprocessor(GPP) fixed pattern network would be the best solution. multimediaextensionssupportsomeformofdatareordering 2.2.5 DesignStudySummary operations. By data reordering we refer to both permuta- tion and replication of data. In graphics terminology this The result of these studies provided us with four key in- is referred to as swizzling. Swizzle operations are particu- sights that will be exploited in the next section to design larly important in SIMD machines where they are essential anefficienthigh-performancearchitectureforMSPapplica- to keep the functional units utilized. For example, Intel’s tions: Larrabee[29],whichsupports16-wideSIMDoperations,has a hardware“gather and scatter”instruction that allows 16 • Mostalgorithmshavesmallnaturalvectorwidthsand distinct memory addresses to build one SIMD vector. This large amounts of TLP isoneextremecasewherethegatherandscatteroperations can take multiple cycles and consume significant power if a • A large percentage of register values are short-lived number of distinct cache lines have to be accessed. At the andmanydonotneedtobewrittentotheregisterfile other end of the spectrum is the data“shuffle”network in • A small set of instruction pairs are used a large per- SODA [17] where the programmer needs to align the data centage of the time within32-wideSIMDblocksoftenrequiringmultipleshuffle instructions. ThisconsumeslesspowerthanIntel’sLarrabee • Eachalgorithmusesasmallsetofpredeterminedswiz- but can take multiple cycles to complete. zle patterns We performed a study to enumerate the number of swiz- zle operations required for the MSP algorithms. Figure 5 3. ANYSPARCHITECTURE shows a few of the common examples. The key finding was that all of the algorithms had a predefined set of swizzle Inthissection,wedescribetheAnySParchitecturewhich operations,typicallybelow10thatwereknownbeforehand. exploitstheMSPalgorithmfeaturesdescribedintheprevi- Because all the swizzle operations were known beforehand, ous section. sophisticatedgatherandscatterhardwaresupportlikethat 3.1 AnySPPEDesign of Larrabee is not needed. However, a shuffle network like thatofSODA,isnotcomplexenoughtosupporttheneeded The PE architecture is shown in Figure 6. It consists of swizzleoperations. AnexampleisLDPCwherethenumber SIMD and scalar datapaths. The SIMD datapath in turn of predetermined swizzle operations(in this case permuta- consists of 8-groups of 8-wide SIMD units, which can be tions)werelarge. InLDPCweneedamechanismtochange configuredtocreateSIMDwidthsof16,32and64. Eachof Multi-Bank Multi-SIMD Datapath Local Memory L1 Bank 16-bit 8-wide 16-bit 8- Program 0 16 entry 8-wide SIMD wide 4-entry Memory SIMD RF-0 FFU-0 Buffer-0 Bank 2 16-bit 8-wide 16-bit 8- 16 entry 8-wide SIMD wide 4-entry Controller Bank SIMD RF-1 FFU-1 Buffer-1 3 Ba4nk OCR 1S6-1IbM6i Dte 8 nR-twrFyi-d2e 8-wFidFeU S-2IMD wi1Bd6eu- fb4fe-iter -n82-try Multi- 8o fGS 8rI-MowuDidpes S Swizzle Output (64 Total Ba5nk BS Network ATdrdeeer Lanes) A R DMA InteTro-P E B1a5nk 1S6-1IbM6i Dte 8 nR-twrFyi-d7e 8-wFidFeU S-7IMD wi1Bd6eu- fb4fe-iter -n87-try Bus Scalar Scalar Pipeline Memory Buffer AGU Group 0 Pipeline AGU Group 1 Pipeline AGU Group 7 Pipeline AnySP PE Figure 6: AnySP PE. It consists of SIMD and scalar datapaths. The SIMD datapath consists of 8-groups of 8-wide SIMD units, which can be configured to create SIMD widths of 16, 32 and 64. Each of the 8-wide SIMD units are composed of groups of Flexible Functional Units (FFUs). The local memory consists of 16 memory banks; each bank is an 8-wide SIMD containing 256 16-bit entries, totalling 32KB of storage. the 8-wide SIMD units are composed of groups of Flexible the cost of increased total computation and complex code FunctionalUnits(FFUs). TheFFUscontainthefunctional transformations. These techniques should be avoided be- units of two lanes which are connected together through a causeoftheircomplexityandincreaseinpowerconsumption simple crossbar. The SIMD datapath is fed by eight SIMD from additional preprocessing requirements. register files (RFs). Each RF is 8 wide and has 16 entries. In Section 2.2.1, we showed that many of the algorithms The swizzle network aligns data for the FFUs. It can sup- had different natural vector widths. While the small SIMD port a fixed number of swizzle patterns of 8, 16, 32, 64 and width kernels had large amounts of TLP, the ones with 128wideelements. Finally,thereisamultipleoutputadder higher DLP showed little TLP. An interesting insight was treethatcansumofgroupsof4,8,or16elements,andstore that independent threads were not the only contributors to the results into a temporary buffer. TLP. Specifically, in H.264, the same task would be run Thelocalmemoryconsistsof16memorybanks;eachbank many times for different macroblocks. Each task was in- is an 8-wide SIMD containing 256 16-bit entries, totalling dependent of each other and ran the exact same code and 32KB of storage. Each 8-wide SIMD group has a dedicated followed almost the same control path. The difference be- AGUunit. Whennotinuse,theAGUunitcanrunsequen- tweeneach threadwas that the dataaccesses from memory tial code to assist the dedicated scalar pipeline. The AGU were different. This meant that for each thread, separate and scalar unit share the same memory space as the SIMD memory addresses would have to be computed. In SODA, datapath. To accomplish this, a scalar memory buffer that the only way we were able to handle these threads was to can store 8-wide SIMD locations is used. Because many of execute them one at a time on the 32-wide SIMD. This is the algorithms access data sequentially, the buffer acts as inefficientbecausethesealgorithmshadwidthssmallerthan a small cache which helps to avoid multiple accesses to the the SIMD width. In order to support these types of ker- vector banks. Details about each of these architectural fea- nel algorithms, we chose to design a multi-SIMD width ar- tures is discussed in the rest of this section. chitecture. Each group of 8-wide SIMD units has its own AGU to access different data. The 8-wide groups can also 3.1.1 ConfigurableMulti-SIMDWidthSupport be combined to create SIMD widths of 16, 32 or 64. Such a feature allows us to exploit the DLP and TLP together WeanalyzedtheperformanceofMSPalgorithmsonexist- for large and small SIMD width algorithms. Small SIMD ing SIMD architectures like [17][32] and found that the full width algorithms like intra-prediction and motion compen- SIMD widths could not always be utilized. In many of the sation, can process multiple macroblocks at the same time algorithmsthenaturalvectorwidthissmallerthanthepro- while exploiting the 8-wide and 16-wide SIMD parallelism cessors’s SIMD width, leading to wasted lanes, and power within the algorithms. Large SIMD width algorithms like orperformancepenalties. InordertoimproveSIMDutiliza- FFT and LDPC, can use the full 64-wide SIMD width and tion, code transform work, such as [18], tried to transform run multiple iterations. thealgorithmstofittheSIMDwidth. Thiswasachievedat Inside the 2-wide Flexible Functional Unit Slice Output Lane 0 8-wiFdFeU S0IMD 16-B4b-uietf nf8et-rrwy0i de Selection Mult 8-wiFdFeU S1IMD 16-B4b-uietf nf8et-rrwy1i de 0 0 0 1 1 4 Reg A ALU PI 8-wiFdFeU S2IMD NSewtwizozlrek 16-B4b-uietf nf8et-rrwy2i de OMSTuuruteplmteui- t SeIlnepcutito n 01 00 10 00 23 31 P Adder NEELI 4B1-E6u-nfbfteirtry 8-wiFdFeU S7IMD 16-B4b-uietf nf8et-rrwy7i de 0Se0lect0ion1 In4put Ou4tput Reg B Swizzle Matrix Vector Vector Xbar 4-entry Reg B 2-wide Buffer MALuUlt PPEI 4B1-E6u-nfbfteirtry wSezzil 2FF-FFwUUid--01e PPEI 444BBB---eeeuuunnnffffffttteeerrrrrryyy Ficisg.uTrehe8:roTwhseoswf tizhzeleseolpecetriaotniomnaintrcixomcoprurteesrpgornadptho- Reg A Adder NELI wNet 2F-FwUid-2e LNEI 44BB--eeuunnfffftteerrrryy tohuetpinuptuvtecvteocrtosrelaencdtetdh.eEcoalcuhmrnoswcocrarnesopnolnydhtaovtehae okr 2F-FwUid-3e 4B-eunffterry single 1, but columns can have multiple 1s which Lane 1 Swizzle 4B-eunffterry allow for selective multi-casting of the data. Inside the 8-wide SIMD FFU theRF.Ultimately,thisallowstwoinstructionsperchained Figure 7: Flexible Functional Unit FFUtobeinflightpercycle. Usinginstructionchainingwe areabletoextractdifferentlevelsofparallelismlikepipeline 3.1.2 TemporaryBufferandBypassNetwork parallelism and DLP within the same FFU unit. As discussed in Section 2.2.2, the RF is one of the major As shown in Figure 7, each 8-wide SIMD group is built sources of power consumption in embedded processors. We from four 2-wide FFUs. Each functional unit consists of a implemented two different techniques in order to lower RF multiplier,ALU,adderandswizzlenetworksub-block. Such power: temporary register buffers and a bypass network. a structure benefits algorithms with SIMD widths that are Thetemporaryregisterbuffersareimplementedasapar- smaller than 64. In chained execution mode, the functional titionedRF.ThemainRFstillcontains16registers. Asec- units among two internal lanes can be connected together ond partition containing 4 registers is added to the design, through a crossbar network. Overall, FFUs improve per- making the total number of registers 20. This small par- formance and reduces power by adding more flexibility and titioned RF shields the main RF from accesses by storing exploiting both TLP and DLP. valueswhichhaveveryshortlifetimes. Thissavespowerbe- causethesmaller,lowerpowerRFisaccessedmultipletimes 3.1.4 SwizzleNetwork rather than the main, higher power, RF. Another benefit is Our analysis in Section 2.2.4 showed that the number of thatbyreducingaccessestothemainRF,registerfilepres- distinct swizzle patterns that are needed for a specific al- sure is lessened, thereby decreasing memory accesses. gorithm is small, fixed, and known ahead of time. Previous The bypass network is a modification to the writeback researchhasdiscussedbuildingapplicationspecificcrossbars stage and forwarding logic. Typically in processors, data for SIMD processors [25]. Such designs hardwired only the that is forwarded to another instruction to eliminate data swizzleoperationsneededtosupporttheapplication,yield- hazards is also written back to the RF. In the bypass net- ing a power efficient design. However, they lack flexibility work, the forwarding and writing to the RF is explicitly as they are unable to support new swizzle operations for managed. This is similar to TTA [5], which defines all in- different applications post fabrication. put and output ports of all modules as registers and the We propose using an SRAM-based swizzle network that programmer explicitly manages where the data should go. addsflexibilitywhilemaintainingtheperformanceofacus- However, in our design we only allow the bypass network tomizedcrossbar. Figure8illustrateshowtheswizzleoper- to be visible to the programmer. This means that the in- ationworks. Ifthe(i,j)thentryoftheselectionmatrixis1, structiondictateswhetherthedatashouldbeforwardedand then input i is routed to output j. whether the data should be written back to the RF. This TheSRAM-basedswizzlenetworkissimilartothelayout eliminates RF writes for values that are immediately con- of[10],wherethelayoutisanX-Ystylecrossbar,wherethe sumed thus reducing RF power. input buses are layed out horizontally and the outputs are layed out vertically. Each point of intersection between the 3.1.3 FlexibleFunctionalUnits input and output buses contains a pass transistor which is In typical SIMD architectures, power and performance is controlled by a flip flop. This allows for more flexibility in lostwhenthevectorsizeissmallerthantheSIMDwidthdue that configurations can be changed post fabrication. How- asaresultofunderutilizedhardware. Theproposedflexible ever, this design has three drawbacks. First, each output is functional units (FFU) allow for reconfiguration and flexi- directly driven by the input. As the crossbar size increases, bilityintheoperationofdifferenttypesofworkloads. When the capacitance of the output lines increase requiring input SIMD utilization is low, the FFUs are able to combine the driverstobelarger,resultinginahigherpowerconsumption. functionalunitsbetweentwolanes,whichcanspeedupmore Second,eachtimeadifferentpermutationisneeded,theflip sequential workloads that underutilize the SIMD width by flops have to be reprogrammed resulting in additional cy- exploiting pipeline parallelism. This effectively turns two cles. Third, a large number of control wires is required for lanesoftheSIMDintoa2-deepexecutionpipeline. Twodif- reprogramming the flip flops increasing power consumption ferentinstructionscanbechainedthroughthepipeline,and and complicating the layout. data can be passed between them without writing back to Tosolvetheseshortcomings,weleverageanSRAM-based 1.2 ower 1 AAAA1234,,,,1111AAAA1234,,,,2222AAAA1234,,,,3333AAAA1234,,,,4444 x BBBB1234,,,,1111BBBB1234,,,,2222BBBB1234,,,,3333BBBB1234,,,,4444 = CCCC1234,,,,1111CCCC1234,,,,2222CCCC1234,,,,3333CCCC1234,,,,4444 P 0.8 a) Matrix Multiply Operation d e aliz 0.6 AA11,,12 AA12,,11 BB11,,12 BB11,,11 AA12,,11 BB11,,11 CC11,,12 AA11,,12 AA12,,11 BB11,,12 BB11,,22 AA12,,11 BB11,,22 CC11,,12 m A1,3 A3,1 B1,3 B1,1 A3,1 B1,1 C1,3 A1,3 A3,1 B1,3 B1,2 A3,1 B1,2 C1,3 Nor 0.4 AA12,,41 AA41,,11 BB12,,41 BB21,,11 AA41,,11 BB21,,11 CC12,,41 AA12,,41 AA41,,11 BB12,,41 BB12,,22 AA41,,11 BB12,,22 CC12,,41 A2,2 A2,1 B2,2 B2,1 A2,1 B2,1 C2,2 A2,2 A2,1 B2,2 B2,2 A2,1 B2,2 C2,2 0.2 A2,3 A3,1 B2,3 B2,1 A3,1 B2,1 C2,3 A2,3 A3,1 B2,3 B2,2 A3,1 B2,2 C2,3 0 16x16 32x32 64x64 128x128 AAA233,,,412 AAA412,,,111 BBB233,,,412 BBB332,,,111 AAA412,,,111 x BBB332,,,111= CCC233,,,412 AAA233,,,412 AAA412,,,111 BBB233,,,412 BBB233,,,222 AAA412,,,111 +x BBB233,,,222 = CCC233,,,412 A3,3 A3,1 B3,3 B3,1 A3,1 B3,1 C3,3 A3,3 A3,1 B3,3 B3,2 A3,1 B3,2 C3,3 Crossbar Size A3,4 A4,1 B3,4 B3,1 A4,1 B3,1 C3,4 A3,4 A4,1 B3,4 B3,2 A4,1 B3,2 C3,4 SRAM Mux‐Based AA44,,12 AA12,,11 BB44,,12 BB44,,11 AA12,,11 BB44,,11 CC44,,12 AA44,,12 AA12,,11 BB44,,12 BB44,,22 AA12,,11 BB44,,22 CC44,,12 A4,3 A3,1 B4,3 B4,1 A3,1 B4,1 C4,3 A4,3 A3,1 B4,3 B4,2 A3,1 B4,2 C4,3 A4,4 A4,1 B4,4 B4,1 A4,1 B4,1 C4,4 A4,4 A4,1 B4,4 B4,2 A4,1 B4,2 C4,4 Figure9: ComparisonofpowerbetweentheSRAM- +xSIMD MAC xSIMD MAC based swizzle network and MUX-based crossbar. b) Part of Matrix Multiply on Traditional SIMD Architectures The values are normalized to the MUX-based cross- B1,1 B1,1 A1,1 B1,1 C1,1 B1,1 B1,1 A1,1 B1,2 C1,1 bar for different crossbar widths with 16-bit data B1,2 B2,1 A1,2 B2,1 + B1,2 B2,1 A1,2 B2,2 + C1,2 inputs. These results correspond to a switching ac- B1,3 B3,1 A1,3 B3,1 B1,3 B3,1 A1,3 B3,2 B1,4 B4,1 A1,4 B4,1 B1,4 B4,1 A1,4 B4,2 tivity of 0.50. B2,1 B1,1 A2,1 B1,1 C2,1 B2,1 B1,1 A2,1 B1,2 C2,1 B2,2 B2,1 A2,2 B2,1 + B2,2 B2,1 A2,2 B2,2 + C2,2 B2,3 B3,1 A2,3 B3,1 B2,3 B3,1 A2,3 B3,2 B2,4 B4,1 A2,4 x B4,1 B2,4 B4,1 A2,4 x B4,2 swizzle network in which multiple SRAM cells replace the B3,1 B1,1 A3,1 B1,1 C3,1 B3,1 B1,1 A3,1 B1,2 C3,1 B3,2 B2,1 A3,2 B2,1 + B3,2 B2,1 A3,2 B2,2 + C3,2 single flip flop of [10]. First, by using circuit techniques, we B3,3 B3,1 A3,3 B3,1 B3,3 B3,1 A3,3 B3,2 decouple the input and output buses reducing the needed BB34,,41 BB41,,11 AA34,,41 BB41,,11 C4,1 BB34,,41 BB41,,11 AA34,,41 BB41,,22 C4,1 input driver strength and decreasing the overall power con- B4,2 B2,1 A4,2 B2,1 + B4,2 B2,1 A4,2 B2,2 + C4,2 B4,3 B3,1 A4,3 B3,1 B4,3 B3,1 A4,3 B3,2 sumptionwhileprovidingmorescalability. Second,multiple B4,4 B4,1 A4,4 B4,1 B4,4 B4,1 A4,4 B4,2 sets of swizzle configurations can be stored into the SRAM c) Equivalent part of Matrix Multiply on AnySP cells,allowingzerocycledelaysinchangingtheswizzlepat- tern. Third,bystoringmultipleconfigurationsweremovea Figure 10: Demonstration of the output adder tree large number of control wires necessary. for matrix multiplication Using the SRAM-based swizzle network, the area and power consumption of the network can be reduced while sultant sum into the temporary buffer unit is key, because still operating within a single clock cycle. Figure 9 shows manytimesthesummedvalueshaveshortlifetimesbutare the power difference between the SRAM-based swizzle net- updated and used very frequently within that period. work and MUX-based crossbar for different crossbar sizes Inthe4Gandvideodecodingalgorithms,matrixcalcula- using 16-bit inputs. As we can see, for crossbar sizes larger tions occur frequently. One of these calculations is matrix than 32x32, the power of the SRAM-based swizzle network multiplication, shown in Figure 10a, and an example of do- is dramatically lower than the MUX-based one and is able inga4x4matrixmultiplyonatypicalSIMDarchitectureis to run at almost twice the frequency. shown in Figure 10b. This calculation requires many swiz- This is one of the enabling technologies for very wide zleoperationstoalignbothmatrices. Figure10cshowshow SIMDmachinestobepowerefficientandfast. Thoughonly using an adder tree that adds only 4 numbers can reduce a certain number of swizzle patterns can be loaded with- the total operations required. Here, we see that only one out reconfiguration, it is a viable solution since only a lim- of the matrix values needs to be swizzled. Since the adder ited set of swizzle patterns need to be supported for each treeonlyproduces4ofthe16valuesperiteration,theadder algorithm. These swizzle patterns can be stored into the tree uses the temporary register buffer to store the partial SRAM-based swizzle network configuration memory at ini- SIMDvaluesinordertopreventwastedreadingandwriting tialization. Because it supports arbitrary swizzle patterns totheRF.After4iterations,16sumsareproducedforming withmulti-castingcapabilities,itfunctionsbetterthanpre- a complete SIMD vector that is then written back to the vious permutation networks found in [17][34]. RF. AnySP supports sums of 4, 8, 16, 32 or 64 elements. 3.1.5 MultipleOutputAdderTreeSupport 4. RESULTSANDANALYSIS SIMD architectures such as [17] have special SIMD sum- 4.1 Methodology mation hardware to perform “reduction to scalar” opera- tions. To compute this, adder trees sum up the values of The RTL Verilog model of the SODA processor [17] was all the lanes and store the result into the scalar RF. These synthesized in TSMC 180 nm technology, and the power techniques worked for 3G algorithms but do not support and arearesults for 90nm technology were estimatedusing videoapplicationswheresumsoflessthantheSIMDwidth a quadratic scaling factor based on Predictive Technology areneeded. Otherarchitecturescapableofprocessingvideo Model [12]. The main hardware components of the AnySP applications like [2][27] have sum trees, but their SIMD processorwereimplementedasRTLVerilogandsynthesized widths were only 4-8 elements wide. For wide-SIMD ma- inTSMC90nmusingSynopsysphysicalcompiler. Thetim- chines like AnySP, we designed the adder tree to allow for ing and power numbers were extracted from the synthesis partial summations of different SIMD widths, which then and used by our in-house architecture emulator tool to cal- writeback to the temporary buffer unit. Writing the re- culate the timing and power values for each of the kernels. Normalized Energy-Delay000001......024680 FFRTa d1i0x2-24pt FFRTa d1i0x2-44pt STBC SOLDDPAC AnyPSrPHeInd.2ticr6at4ion DebHFli.ol2tc6ek4ring TrIanHnv.2se6fros4rem ComMHpo.e2tn6ios4nation 146fpo1rre220046 edxai1cc61428thi 1ol u46nmx1868 4ma b oMld8158oBecsk i.e. aKXLJI = (mAaeIi+2XBbnfj+A+Cgock2) >Ddhpl> 2 L K J I X A U s B e 1 6 C S I M D D L a n e s ASSDhhuuDff ff2llee +++ RAAiDDghDDt_Shift 2 Prediction Modes Shuffle 0. Vertical Figure 12: Normalized Energy-Delay product for 12.. HDoCrizontal each kernel algorithm 34.. DDiiaaggoonnaall DDoowwnn L Refitght a b c d e f g h i j k l m n o p 5. Vertical Right 6. Horizontal Down AnySP’s PE area is 130% larger than SODA’s estimated 7. Vertical Left 8. Horizontal Up 90 nm PE area. Unlike SODA, where the target frequency was400MHz,AnySPwastargetedat300MHz. Themajor Figure 13: Mapping a luma 16x16 MB intra- kernels of the 100 Mbps high mobility 4G wireless protocol prediction process on AnySP; Example of the Di- andhighqualityH.2644CIFvideoareanalyzed. The4CIF agonal Down Right intra prediction for a 4x4 sub video format is 704x576 pixels at 30 fps. block (grey block) is presented with each cycle’s op- erations listed. 4.2 Algorithm-LevelResults In this section, we present the performance analysis of ure 11, AnySP’s LDPC implementation has a speedup of key algorithms in MSP, namely FFT, STBC and LDPC of 2.43x compared to SODA. This performance jump is be- 4Gwirelessprotocolsandintra-prediction,deblockingfilter, cause of the wider SIMD width and the versatile swizzle inversetransformandmotioncompensationofanH.264de- network which significantly reduced the number of register coder. reads/writesandcyclicshuffleoperationswhendealingwith The speedup of AnySP over SODA for each algorithm is large block codes. In addition, the support of temporary shown in Figure 11. Each improvement is broken down by buffers gives the LDPC min-sum decoding function storage architecturalenhancements: widerSIMDwidth(from32to for the minimum and the second minimum values thereby 64), use of single-cycle SRAM-based swizzle network, fused acceleratingthecorrespondingcompare/updateprocess[30]. operations, and temporary buffer with the bypass network. 4.2.2 H.264AlgorithmAnalysis Theenergy-delayproductforrunningthekernelalgorithms onSODAandAnySParealsopresentedinFigure12. Onav- Intra-Prediction. Figure13showshowtheH.264intra- erage, AnySP achieves 20% energy-delay improvement over prediction process is mapped onto AnySP. A 16x16 luma SODA with more than 2x speedup. A detailed analysis is macroblock is broken into 16 4x4 sub-blocks, and each 4x4 presented next. sub block has a different intra-prediction mode: Vertical, Horizontal,DC,DiagonalDownLeft,DiagonalDownRight, 4.2.1 4GAlgorithmAnalysis Vertical Right, Horizontal Down, Vertical Left, or Hori- Fast Fourier Transform. Figure 11 provides results zontal Up. As can be seen in Figure 13, 16 SIMD lanes of two different FFT configurations: 1024 point Radix-2, are used to generate 16 prediction values (a,b, ..., p) with and 1024 point Radix-4. On average, AnySP achieves a neighboring pixel values (capital letters). Independence be- 2.21xspeedupoverSODA.AwiderSIMDwidthisthemain tween intra-prediction computations for 4x4 sub-blocks al- contribution of the speedup because FFT is a highly paral- lowsotherSIMDunitstoexecuteondifferent4x4subblocks lel algorithm. The radix-4 FFT algorithm has better per- simultaneously. A factor of 2 increase in the SIMD width formance compared to radix-2. 34% of the speedup is at- almost doubles the processing performance. In addition, tributed to the fused-operations, such as shuffle/add. The fused-operations (shuffle-add and add-shift) reduces unnec- radix-4 FFT algorithm takes more advantage of the single- essary register read/write accesses, and the SRAM-based cycle SRAM-based swizzle network compared to the radix- swizzle networks supports single cycle complex swizzle op- 2 algorithms, because radix-4 algorithms require complex erations, both results in a speedup of 30%. shuffle operations that cannot be done in a single cycle by DeblockingFilter. TheH.264deblockingfiltersmooth- other networks. ens block edges of decoded macroblocks to reduce blocking STBC. AsshowninTable1,STBCcontainslargeamounts distortion. Based on data-dependent filtering conditions, of DLP and TLP. Figure 11 shows that the wider SIMD thefunctionofthisfiltervariesdynamically(three-tap,four- width along with the multi-SIMD support helped speed up tap, or five-tap filter). Figure 11 shows AnySP achieves STBC by almost 100%. The rest of the speedup was ob- about 2.18x performance increase compared to SODA. The tainedwiththeFFU,wherethefused-operationscontributed widerSIMDwidthallowsthedeblockingfiltertotaketwice another20%speedup. Thiswasbecausefusedpairssuchas asmanypixelvaluestoprocess,whichresultsin2xspeedup add-add,subtract-add,add-subtract,andsubtract-subtract over SODA. Like H.264 intra-prediction, fused operations were used frequently in the algorithm. A combination of suchasshuffle-add,shuffle-shift,andadd-shiftandtheswiz- fused-operations, register buffers, and data forwarding by- zle network helps boost performance up to 27% and 16%, passnetworkaddedtoanalmost20%decreaseintheenergy- respectively. delay product, making the algorithm run more efficiently Motion Compensation. H.264 adopts tree structured compared to the SODA architecture. motion compensation (MC). The size of a MC block can LDPC. AmongthemanydecodingalgorithmsforLDPC, be one of 16x16, 16x8, 8x16, 8x8, 4x8, or 4x4, which uses the min-sum algorithm was selected because of its simple integer-pixel, half-pixel, quarter-pixel, or eighth-pixel reso- operationsandlowmemoryrequirements. AsshowninFig- lution. Because sub-sample positions do not exist in the Baseline 64-Wide Multi-SIMD Swizzle Network Flexible Functional Unit Buffer + Bypass 2.5 p u d2.0 e e p S1.5 d e aliz1.0 m or0.5 N 0.0 FFT 1024pt FFT 1024pt STBC LDPC H.264 H.264 H.264 H.264 Radix-2 Radix-4 Intra Deblocking Inverse Motion Prediction Filter Transform Compensation Figure 11: AnySP speedup over SODA for the key algorithms used in 4G and H.264 benchmarks. The speedup is broken down into the different architectural enhancements - wider SIMD width, single-cycle SRAM-based crossbar, fused-operations and buffer support Area 4G + H.264 Decoder Area Area Power Power Components Units mm2 % mW % SIMD Data Mem (32KB) 4 9.76 38.78% 102.88 7.24% SIMD Register File (16x1024bit) 4 3.17 12.59% 299.00 21.05% SIMD ALUs, Multipliers, and SSN 4 4.50 17.88% 448.51 31.58% SIMD Pipeline+Clock+Routing 4 1.18 4.69% 233.60 16.45% PE SIMD Buffer (128B) 4 0.82 3.25% 84.09 5.92% SIMD Adder Tree 4 0.18 <1% 10.43 <1% Intra-processor Interconnect 4 0.94 3.73% 93.44 6.58% Scalar/AGU Pipeline & Misc. 4 1.22 4.85% 134.32 9.46% ARM (Cortex-M3) 1 0.6 2.38% 2.5 <1% System Global Scratchpad Memory (128KB) 1 1.8 7.15% 10 <1% Inter-processor Bus with DMA 1 1.0 3.97% 1.5 <1% Total 90nm (1V @300MHz) 25.17 100% 1347.03 100% 65nm (0.9V @ 300MHz) 13.14 1091.09 Est. 45nm (0.8V @ 300MHz) 6.86 862.09 Table ? : System Area and Power Summary (AnySP, targeted on 100Mbps) Figure 14: PE Area and Power Summary for AnySP running 100Mbps high mobility 4G wireless and H.264 4CIF video at 30fps 10000 running both 100 Mbps high mobility 4G wireless and high Gops) 4G Wireless 1000 Mops/mW qthuraoluitgyhpHu.2t6r4eq4uCirIeFmevnidteoof.10A0nMySbPpsw4Gaswaibrelelestsowmheileetctohne- ormance ( 1000 Mobile HD (A4n5ynSmP) (A9n0ynSm1P0)0 Mops/mW 10 Mops/mW IBM Cell sMuompisn/gmW1.3tarWgetabtu9t0clonsme.enoTuhgihstiosmjuesetttbhealotwtartgheeti1n04005 erf 100 Video SODA nmprocesstechnology. WeshowthisinFigure15wherewe P 10(S6O5nDmA) 3G Wireless (T9I0 Cnm6X) VIIRmAaMgine 1 Mops/mWPentium PMower EffBiceiettnecr y ruathetrpee3l1or0e.tqfIAputsnicryaecSmanPneanlobstnoestbafhoecerhsipemeeevonrebfdtoihlrewmatHiathnhDicg6evh0ivdqemeurosaW.elistypatoHw9.2e0r64nchm4aC,rItmFoefveiFtdiineggo- 1 The power breakdown of AnySP shows that the SIMD 0.1 1 Power (Watts) 10 100 functional units are the dominant power consumers. The RF and data memory are lower in comparison which shows Figure 15: Performance verses power requirements thatourdesignisefficientbecausethelargestportionofthe for various mobile computing applications. power is spent doing actual computations. In comparison, SODA’s RF and data memory power were the dominant referenceframes,thefractionalpixeldataiscreatedbyinter- consumers which suggested that a lot of power was being polation. Forexample,half-pixelvaluesaregeneratedbythe wasted reading and writing data from the register file and 6-tapfilteringoperation,whichfitsintoan8-wideSIMDpar- memory. Our proactive approach to reducing register and tition;thereforeawiderSIMDenginesupportsprocessingof memory accesses has helped in improving the efficiency of multiplefiltersatthesametime. Also,buffersupporthelps our design. store half-pixel values while sliding a filter window, which saves the number of SIMD-scalar data exchanges. Overall, 5. RELATEDWORK a 2.1x speedup is achieved by these enhancements. ManyarchitectureshavetriedtoexploitDLPandTLPin 4.3 System-LevelResults order to increase power efficiency. For instance, the Vector- Figure14showsthepowerandareabreakdownofAnySP Thread(VT)architecture[15]canexecuteinmultiplemodes:
Description: