ebook img

Design of a Unified Transport Triggered Processor for LDPC/Turbo Decoder PDF

0.56 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Design of a Unified Transport Triggered Processor for LDPC/Turbo Decoder

Design of a Unified Transport Triggered Processor for LDPC/Turbo Decoder Shahriar Shahabuddin, Janne Janhunen, Muhammet Fatih Bayramoglu, ∗ ∗ Markku Juntti, Amanullah Ghazi, and Olli Silv´en Department of Communications Engineering and Centre for Wireless Communications, University of Oulu, Finland. ∗ Department of Computer Science and Engineering, University of Oulu, Finland. 5 1 0 Abstract—This paper summarizes the design of a pro- hardwareimplementationssuffersfrominflexibility.Therefore, 2 grammable processor with transport triggered architecture the hardware implementation of a multimode decoder might n (TTA)fordecodingLDPCandturbocodes.Theprocessorarchi- not be useful for other purposes. The design presented in this a tectureisdesignedinsuchamannerthatitcanbeprogrammed paper has the potential to be used as a detector or equalizer J for LDPC or turbodecoding for thepurposeof internetworking and roaming between different networks. The standard trellis running on factor graphs, for example. 1 based maximum a posteriori (MAP) algorithm is used for The software implementations provide the required flexi- 3 turbodecoding.Unlikemostotherimplementations,asupercode bility to support a multimode decoder, but requires a careful basedsum-productalgorithmisusedforthechecknodemessage ] design to achieve the target throughput. Programmable ac- T computation for LDPC decoding. This approach ensures the celerators, which enable software-hardwareco-design method I highesthardwareutilizationoftheprocessorarchitectureforthe . two different algorithms. Up to our knowledge, this is the first mightbe an attractive solutionto overcomethese bottlenecks. s c attempt to design a TTA processor for the LDPC decoder. The Several application-specific instruction-set processors [ processor isprogrammed witha highlevel language to meet the (ASIP) for multimode decoders with high throughput have time-to-marketrequirement.Theoptimizationtechniquesandthe been designed in [12], [13] and [14]. However, all of the 1 usage of the function units for both algorithms are explained in v detail. The processor achieves 22.64 Mbps throughputfor turbo ASIPs have been programmedwith low level languagewhich 6 decoding with a single iteration and 10.12 Mbps throughput for does not meet the time-to-market requirement. Besides, the 7 LDPC decodingwith five iterations for a clock frequencyof 200 utilization of the same function units for both decoding 0 MHz. algorithms has not been described explicitly. 0 The design of software and hardware together to grind out 0 I. INTRODUCTION the best performance and to ensure programmability is not . 2 The forward error correction (FEC) scheme is one of the a straightforward task. The designer needs a very efficient 0 integralpartsofthewirelesssystems.Theturbocodingscheme tool, which can be used to design the processor easily for 5 1 [1] has been adopted for the air interface standard called a particular application. : Long Term Evolution (LTE), that has been defined by the In this paper, we propose a design of a processor based v 3rd Generation Partnership Project (3GPP) [2]. Low-density on the transport triggered architecture (TTA) for turbo and i X parity-check codes (LDPC) [3] are also gaining popularity as LDPC decoder. TTA is a very good processor template for r it has been chosen for IEEE 802.11n WLAN systems [4], a programmableASIP. The TTA based codesign environment a IEEE 802.16e WiMAX systems [5] and DVB-S2 [6]. Due to (TCE) tool enables the designer to write an application with theirexcellentperformance,theturboandLDPCcodesarethe a high level language and design the target processor in a primary candidates for FEC scheme for the next generation graphical user interface at the same time [15]. communication systems. The roaming between WLAN and Up to our knowledge, this is the first attempt to design LTE systems requiresa multimodeFEC support. Therefore,a a TTA processor for the LDPC decoder. The turbo decoder decoder which is able to support LDPC and turbo would be with TTA has been designed by Salmela et al. [16] and beneficial. Shahabuddin et al. [17]. As a TTA processor can be best Thehardwaredesignsof turboandLDPC decodersare ata utilizedtosupportdifferentalgorithms,aunifiedprocessorfor maturedstage duetoextensiveeffortsofresearchers.Someof turbo and LDPC decoding is the natural research direction. the efficient hardware implementations of turbo decoder can The max-log-MAP algorithm is used for the component be found in [7] and [8] and some of the efficient hardware decoders of turbo decoding. The parity-check matrix of size implementations of LDPC decoders can be found in [9] and (M,N) for LDPC decoding is decomposed into M rows [10] etc. Sun and Cavallaro [11] even designed multimode of two state trellises or supercodes. The trellis based sum- decoders as pure hardware designs. The hardware implemen- productalgorithmis used on these supercodesfor check node tations provide high throughput, but the development time is calculation. The processor achieves 22.64 Mbps throughput not as rapid as processor based implementations.Besides, the for turbo decoding with a single iteration and 10.12 Mbps throughputfor LDPC decoding with five iterations. second SISO decoder is referred to as a full iteration. On the The rest of the paper is organized in the following way: otherhand,theoperationperformedbya singleSISOdecoder In Sections II and III, an overview of the turbo decoding and can be referred to as a half iteration. At the beginning of the LDPC decoding algorithms is presented. In Section IV, the first iteration, the a priori values are set at zero. Six to eight simplification techniques and the similarities between turbo full iterations are used to achieve sufficient performance [1]. and LDPC decoding algorithm are presented. The common B. MAP Algorithm for Component Decoder special function unit design is presented in Section V. The processordesignhas beenpresentedin Section VI. In Section TheMAPalgorithmforthecomponentdecoderappliedhere VII, the throughput results and comparison with other imple- hasbeenproposedbyBenedettoet al.[18].Thealgorithmcan mentationsaregiven.TheconclusionisgiveninSectionVIII. be stated like: 1.Initializethevaluesoftheforwardstatemetricasα0(s)= II. REVIEWOF TURBO DECODING 0 if s=S0 and α0(s)=−∞ otherwise. A. Turbo Decoding 2.Calculatealltheforwardstatemetricofthesamewindow through the forward recursion according to The turbo decoder consists of two soft-input soft-output (SISO)decoders,withinterleaversandde-interleaversbetween αk(s)= max∗(αk−1[sS(e)]+u(e)LuI[k−1] themasshowninFig.1.Theinputsoftheturbodecodercome e (1) from the soft demodulator,which producesthe log-likelihood +c1(e)LcI1[k −1]+c2(e)LcI2[k −1]). ratios (LLR) for the systematic bits and the parity bits. The 3. Initialize the values of the backward state metric as LLRs of the systematic bit, LuI and first parity bits, LcI1 β (s)=0 if s=S and β (s)=−∞ otherwise. n n n goes to the first SISO decoder. The SISO decoder produces 4. Calculate all the backward state metric of the same softoutputsbasedonthese LLRs.These softoutputsareused window through the backward recursion as inthesecondSISOdecoderastheadditionalinformation.The inputsofthesecondSISO decoderaretheLLRscomingfrom βk(s)= meax∗(βk+1[sE(e)]+u(e)LuI[k+1] (2) the systematic bits, second parity bits denoted by LcI2 and +c1(e)LcI1[k +1]+c2(e)LcI2[k +1]). output of the first SISO decoder. The LLRs of the systematic 5. The LLR values for the information and both parity bits bitsarescrambledthistimewiththesameinterleavingpattern can be calculated as following: used at the encoder. Similarly, the soft outputs coming from the first SISO decoder are scrambled also with the same LLR(.;O)= max∗(αk−1[sS(e)]+c1(e)LcI1[k −1] interleaving pattern, which are used as a priori values for e (3) the second SISO decoder. +c2(e)LcI2[k −1]+βk[sE(e)]). Formax-log-MAPalgorithm,max∗(x,y)≈max(x,y)[19]. Thedecodingisdoneinsmallerwindowssothatthedecoding process can be done in parallel and the decoder does not have to wait for the whole block to arrive before starting the decodingprocess.Thiswindowingissometimesreferredtoas a sliding window method. III. REVIEW OFLDPC DECODING A. Quasi-Cyclic LDPC Codes LDPC codes are linear block codes which consist of code- words satisfying the parity-check equation HxT =0, (4) where H is the parity-checkmatrix and x is a codeword.The parity-checkmatrixHis‘sparse’orconsistsofasmallnumber of non-zero entries in case of LDPC codes. The non-zero entries of the parity check matrix H are Fig.1. Blockdiagram oftheturbodecoder. usuallydistributedpseudo-randomlyaccordingto some distri- bution.Althoughthispseudo-randomdistributionleadstovery The heart of the turbo coding is the iterative decoding good FER performance, it makes the encoding and decoding procedure. The output of the second SISO decoder does not of LDPC codes difficult. Therefore,a structure is imposed on produce the hard outputs immediately, but the soft output H to ease the encoding and decoding by slightly sacrificing is used again in the first SISO decoder for more accurate the performance. A good trade-off between complexity and approximation. The process continues in a similar fashion in performanceisprovidedbythequasi-cyclic(QC)LDPCcodes aniterativemanner.Asingleiterationbyboththefirstandthe [20]. TheparitycheckmatrixoftheQC-LDPCconsistsofsquare where s and b are binary variables, ⊕ denotes the binary blocks which are either all zero matrix or cyclic shifts of the addition, and Li denotes the incoming message from the k identitymatrix.Thisstructureoftheparitycheckmatrixleads kth neighbor. to efficient encoding and decoding architectures. Due to their 3. Backward recursion metrics are computed for architecture aware construction, QC-LDPC codes have been k =Z−1,Z−2,...,1 as adopted by several wireless standards such as IEEE 802.11n, IEEE 802.16e and DVB-S2. βk(s)=max∗ βk+1(s⊕b)+(−1)bLik+1 . (6) b (cid:8) (cid:9) B. Decoding of LDPC Codes 4. Finally the outgoing message to the kth neighbor is given LDPC codescan be visualizedby a bipartite graphconsist- by ing of checkandvariablenodeswhichrepresentthe rowsand 1 columns of the parity check matrix respectively. ∗ The decoding algorithm of LDPC codes is described as a Lok = 2(cid:16)msax {αk−1(s)+βk(s)} message passing algorithm running on this graph. Alternative −max∗{αk−1(s)+βk(s⊕1)}(cid:17). (7) LDPC decoding algorithms differ basically in two aspects s which are message computation at the check nodes and mes- Noticethatthesteps3and4canbecarriedin thesamerun sage flow schedules. In LDPC decoding,messages are almost to get rid of storing β (.). k always represented in LLR domain to make the decoding Wepreferlayeredschedule[22]asthemessageflowsched- numerically stable. ule. This schedule can be formally described as follows. 1. Initialize A(n) to λ(n) for n= 1,2,...,N where λ(n) denotesthe LLRof the nth bitreceivedfrom the channeland N is the block length of the code. 2. For each row repeat the following. 2.a Assign Li = A(π (k)) − Lp where π (k) denotes k j j,k j location of the kth 1 on the jth row of H and Lp is j,k the outgoing message computed by jth row in the previous iteration for the π (k)th bit. For the first iteration take Lp j j,k as 0. 2.b. Compute the outgoing messages according to the algo- rithm above for the jth row. Fig.2. Supercodesfromtheparity-check matrixofLDPCcodes. 2.c. Update A(π (k)) as A(π (k)) ← A(π (k))+Lo and j j j k assign Lp =Lo to use in the next iteration. The sum-productalgorithmis employedat the checknodes j,k k 3. Goto Step 2 until a certain number of iteration. in exact or approximate fashion. An approximation, which is 4. A(n) holds the estimated LLR’s from the LDPC decoding. proposed in [21], computesoutgoingmessages by finding the minimum of the absolute values of a subset of the incoming IV. SHARED CALCULATIONSBETWEENTURBO AND messages. Although this approximation is very popular in LDPC DECODING the LDPC decoding literature, the hardware it imposes is not useful for turbo decoding. The hardware for computing Thereare16branchmetriccomputationsbetweentwostates the minimum and absolute values are redundant for turbo for forward metric, backward metric and LLR calculations in decoding. Therefore, using this approximation reduces the the trellis diagram of an eight state convolutionalcode. hardware reusability. Hence, we use the forward-backward From the trellis structure of the 3GPP turbo code in Fig. algorithm running on a binary trellis similar to [9] and [11]. 3, it can be seen that four calculations of branch metric are This algorithm is derived by decomposing (M,N) parity being repeated to result in total sixteen calculations. The four check matrix into M binary or two-state trellises, which can calculations can be expressed as also be called supercodes. The supercodes are shown in Fig. 2. The algorithmcan be describedfora checknodeof weight γ1=LuI+LcI1 +LcI2, Z as follows. γ2 =−LuI−LcI1 +LcI2, (8) 1. The forward and backward recursion metrics are initial- γ3=LuI+LcI1 −LcI2, ized as γ4 =−LuI−LcI1 −LcI2, α0(0) = βZ(0)=0. whereγ4 canberepresentedas−γ1andγ3 canberepresented α0(1) = βZ(1)=−∞. as −γ2. Therefore,it is sufficient to calculate γ1 and γ2 only. 2. Forward recursion metrics are computed for k = A. Forward Metric calculation 1,2,...,Z−1 as The trellis of the 3GPP turbocode can be dividedinto four αk(s)=max∗ αk−1(s⊕b)+(−1)bLik−1 , (5) butterfly pairs. Using the branch metric values given above, b (cid:8) (cid:9) and for LDPC, ∗ βk(1)=max (βk+1(0)−Lik+1,βk+1(1)+Lik+1). (12) ∗ βk(0)=max (βk+1(0)+Lik+1,βk+1(1)−Lik+1). The operations needed to calculate the forward and back- ward metric is similar. However,the outputLLR computation is different for the algorithms. The output LLR of turbo involves eight forward metric, eight backward metrics and all thebranchmetricinbetween.Thecalculationcanbepresented as LLRk =max(αk−1(1)+βk(1)+γ1(k),αk−1(2)+βk(5)+γ1(k), αk−1(7)+βk(8)+γ1(k),αk−1(8)+βk(4)+γ1(k), αk−1(3)+βk(6)+γ2(k),αk−1(4)+βk(2)+γ2(k), αk−1(5)+βk(3)+γ2(k),αk−1(6)+βk(7)+γ2(k)) −max(αk−1(1)+βk(5)−γ1(k),αk−1(2)+βk(1)−γ1(k), αk−1(5)+βk(7)−γ1(k),αk−1(6)+βk(3)−γ1(k), αk−1(3)+βk(2)−γ2(k),αk−1(4)+βk(6)+γ2(k), αk−1(7)+βk(4)−γ2(k),αk−1(8)+βk(8)+γ2(k)). (13) On the other hand, the LLR calculation of the super code in LDPC is simple. 1 LLRk = (max(αk−1(0)+βk(0),αk−1(1)+βk(1)) 2 (14) −max(αk−1(0)+βk(1),αk−1(1)+βk(0))). Fig.3. Trellisof3GPPturbocode. The output LLR calculation of turbo decoding needs to be divided into four parts to make the calculations similar. the forwardmetric calculation of a butterfly pair can be given V. SPECIAL FUNCTION UNIT DESIGN as A. ALPHA Special Function Unit ∗ αk(1)=max (αk−1(1)−γk−1(1),αk−1(2)+γk−1(1)). ∗ A function unit can be made with three inputs and two αk(5)=max (αk−1(1)+γk−1(1),αk−1(2)−γk−1(1)). outputstocomputetheforwardandbackwardmetric.Inturbo (9) case,theunitcanuseαk−1(1),αk−1(2)andγk−1(1)asinputs The forward metric calculation for a supercode in LDPC andcomputetheoutputsofαk(1)andαk(5).IncaseofLDPC, decoding is also similar. the same unit can use αk−1(0), αk−1(1) and LLR as inputs and compute α (0) and α (1) as outputs. The same function k k ∗ αk(1)=max (αk−1(0)−Lik−1,αk−1(1)+Lik−1). unit can be used for both of the cases because the operations (10) αk(0)=max∗(αk−1(0)+Lik−1,αk−1(1)−Lik−1). are same as can be seen from (9) and (10). One of the ALPHA special function units calculates two The value of the branch metric in LDPC decoding equals forwardmetricvaluesbasedontwoearlierstateforwardmetric to a single LLR value between two time instances. On the in the same butterfly pair. Therefore four ALPHA unit can other hand, the branch metric value in turbo decoding is a calculate all the necessary forward metric values for one time combination of three LLR values. instant. A block diagram of the ALPHA unit used for LDPC decoding is presented in Fig. 4. B. Backward Metric and LLR calculation On the other hand, the LDPC can utilize these four units by processing four supercodes parallely. The backward metric calculation is also similar for turbo and LDPC. For turbo decoding, the backward metric calcula- B. BetaLLR Special Function Unit tion of a butterfly pair can be presented as The backward metric and the LLR is computed together ∗ βk(1)=max (βk+1(1)−γk+1(1),βk+1(5)+γk+1(1)). to reduce memory requirement. For a single algorithm it is βk(2)=max∗(βk+1(1)+γk+1(1),βk+1(5)−γk+1(1)). easierto designa unitforbetaseparatelyandLLRseparately. (11) However, the LLR calcualtion of LDPC and turbo is not the Equation (13) can be expressed as LLRk =max(max(αk−1(1)+βk(1),αk−1(2)+βk(5))+γ1(k), max(αk−1(7)+βk(8),αk−1(8)+βk(4))+γ1(k), max(αk−1(3)+βk(6),αk−1(4)+βk(2))+γ2(k), max(αk−1(5)+βk(3),αk−1(6)+βk(7))+γ2(k)) −max(max(αk−1(1)+βk(5),αk−1(2)+βk(1))−γ1(k), max(αk−1(5)+βk(7),αk−1(6)+βk(3))−γ1(k), max(αk−1(3)+βk(2),αk−1(4)+βk(6))+γ2(k), max(αk−1(7)+βk(4),αk−1(8)+βk(8)+γ2(k)). (16) We can divide (16) in four parts to use the BetaLLR unit. For example, one of the parts is given here as output1=max(αk−1(1)+βk(1),αk−1(2)+βk(5)). (17) output2=max(αk−1(1)+βk(5),αk−1(2)+βk(1)). Fig.4. ALPHAunitforasinglebutterflypair. The branch metric γ needs to be added or subtracted with the left side of (17) and have to use maximization unit to get the final LLR output. A block diagram is given in Fig. 6 to same. It can be seen from (14) that we can calculate with a calculate a LLR in turbo mode. special function unit the two maximization properties as output1=max(αk−1(0)+βk(0),αk−1(1)+βk(1)). (15) output2=max(αk−1(0)+βk(1),αk−1(1)+βk(0)). The unit would calculate the earlier state backward metrics βk(0) and βk(1) based on βk+1(0) and βk+1(1). Therefore, the BetaLLR unit takes five inputs and produce four outputs. For example, the unit takes βk+1(0), βk+1(1), αk−1(0), αk−1(1) and Lik+1 as inputs and can produce the earlier state backward metrics β (0), β (1), output1 and k k output2 of (15). A block diagram of the unit is given in Fig. 5. Fig.6. LLRcomputation withfourBetaLLRunits inturbomode. Four BetaLLR unit can be utilized in LDPC mode by processing four supercodes parallely. VI. TRANSPORT TRIGGERED ARCHITECTURE PROCESSOR A. Top level architecture A part of the TTA processor designed for the LDPC and turbo decoding is illustrated in Fig. 7. For readability, the whole processor figure is not given. The blocks on the upper partofthefigurerepresentthefunctionunitsandregisterfiles of the processor. The black horizontal straight lines represent the buses of the processor. The vertical rectangular blocks represent the sockets. The connection between function units Fig.5. BetaLLRunitforasinglebutterflypair. and buses is illustrated by black dots in the sockets. The fixed point processor includes load/store unit (LSU), Equation (13) has to be divided in some similar form to arithmetic logic unit (ALU), global control unit (GCU) and utilize the BetaLLR unitto computethe backward metric and registerfiles.Basedontheresourcerequirementsinhighlevel output LLR for turbo decoding. language, function units and register files are added. Fig.7. Implemented processorwithreduced numberoffunction units. The ALU unit is used to perform the basic arithmetic op- with the BetaLLR unit and used immediately to calculate the erationslike addition, subtractionetc. Operationslike shifting corresponding LLR. right or left are also included in ALU. The forward metric and backward metrics increase in each For turbo decoding, the forward and backward metric step and that is why the forward and backward metrics is computations between two states need at least four of the normalized to avoid memory overflow. ALPHAandBetaLLRunits.Therefore,fourALPHAandfour Before processing the ALPHA and BetaLLR values, the BetaLLR units are used in the processor. γ values need to be calculated. As shown in figure 7, The LSU units are used to support the memory accesses. The output of the BetaLLR also needs to maximize and added or LSU units are used to read and write memory. The memory subtracted with γ values to find the output LLR. canbereadinthreeclockcyclesandcanbewritteninasingle In the LDPC mode, the processor is programmed for the cycle. LDPC code of IEEE WLAN 802.11n of code rate 1/2 and Several register files are used to save the intermediate outputblocksize of648.Due to the data dependencies,a sin- results. In terms of the power consumption, registers can gle special function unit is used several times to calculate the be more expensive than memory, but to meet the latency required forward and backward metric values of a supercode requirementregisterfilesareneeded.AsingleBooleanregister inserialfashion.Forexample,thefirstrowoftheHmatrixof file has been included in the processor design. thisparticularcodeconfigurationhassevennonzeroelements. Thirty buses have been used for the processor. The number Thetwo-statetrellisshouldbeamatrixof8×2tocalculatethe of buses is crucialto ensure the parallelprocessing.However, forward and backward metrices for this row. The initilization the complexity also increases with the increased number of values of the metrices are zero. Therefore, only one ALPHA buses. andoneBetaLLRunitsareusedseventimeseachtogetallthe The LLRoutputshavebeen writtenusing a first-in-first-out necessaryoutputLLR fromthis row.Fourof thisrowscan be (FIFO)bufferbyusingthefunctionunitcalledSTREAM.The processed in parallel as there are four ALPHA and BetaLLR STREAM units can write every output sample in three clock units available. cycle. ThevariablenodeupdateisdonebysimplyaddingtheLLR outputs of the super codes with the corresponding original B. Programming the processor LLRsofthesameposition.Shiftingoperationsarerequiredto The processor is programmed with high level language C. VII. RESULTS ANDDISCUSSION Several macros have been used to call the function units and use part of that code with that specific function unit. The designed processor takes 166,224 clock cycles to pro- The turbo decodingalgorithmuse three blocksof 6,144in- cess three blocks of 6,144 samples for a full iteration for the putLLRs.Theblockshavebeendividedinsmallerwindowsto turbo decoding. The processor takes 10,368 clock cycles to savethememoryrequirements.Onlytheforwardmetricshave decode a LDPC code for IEEE WLAN 802.11n of block size been saved in a window. The backward metric is calculated 648 and code rate 1/2 after five iterations. TABLEIII Thethroughputcanbe calculatedusingthefollowingequa- PROGRAMMABLELDPCPROCESSORS tion as Reference Architecture Algorithm Throughput Size of the code block×device clock frequency Throughput= . [26] TMS320C64xx min-sum 1.8 Mbps @ 10 latency×number of iterations DSP it. (18) proposed TTA proc. for supercode based 10.12 Mbps @ 5 The throughputachievedfor the turbomode is 22.64Mbps LDPC sum-product it. [27] SDRSODA min-sum 15.2 Mbps @ 10 fora singleiterationandforLDPCmode10.12Mbpsforfive it. iterations for a clock frequency of 200 MHz. [14] VLIWASIP offsetmin-sum 53Mbps@10it. The buses of the processor are perfectly utilized to achieve [12] VLIWASIP offsetmin-sum 16.32 - 128.5 Mbps @ 10-20 thebestpossibleresultduetotheperfectscheduling.Thenum- it. ber of some of the operations during the algorithm execution has been summarized in Table I. frequencyafter 20 iterationswas achievedwhen the coderate TABLEI NUMBEROFOPERATIONS and block size is the same as presented in this paper. The design of [14] achievedhigh throughputwith a differentcode Operation #ofOPSinturbo #ofOPSinLDPC configuration.Besides, alltheimplementationspresentedhere ADD 431,009 87,134 SUB 96,354 14,231 used the assembly language. MAX 43,008 0 ALPHA 24,576 2,376 VIII. CONCLUSION BetaLLR 24,576 2,376 STREAM 6,144 648 Thepaperdiscussedthedesignissues ofa turboandLDPC decoder on a TTA processor. The design shows the promise The number of addition operations does not only represent of the possibility of designingseveraldecodingtechniqueson the addition for the algorithm, but for several other purposes a single TTA processor. The processor designed in this paper like loop indexing for the code. The maximization units are can be used fortasks beyonddecoding,forinstance, it can be notusedincaseofLDPCdecodingbecausethemaximization programmedfordetectionandequalizationalgorithmsrunning operations are done inside the ALPHA and BetaLLR units. on factor graphs [28]. The target throughput of LTE can also Acomparisonwithdifferentotherprogrammableimplemen- bereachedbymulti-coreTTAprocessor.Theflexibilitygained tations of turbo decoder has been presented in Table II. The from that processor could provide very interesting results and throughput results are normalized for a clock frequency of would be a fruitful direction for future research. 200 MHz. Our proposed processor with turbo mode provides very good throughput compared to other programmable im- IX. ACKNOWLEDGEMENT plementations. The TTA processors of [16] and [17] provide ThisresearchwassupportedbytheFinnishFundingAgency higher throughputs but the designs were dedicated for only for Technology and Innovation (Tekes), Renesas Mobile Eu- turbo decoding. rope,NokiaSiemensNetworks,ElektrobitandXilinx.Special thanksareduetoDr. PerttuSalmelafromTampereUniversity TABLEII PROGRAMMABLETURBOPROCESSORS of Technology and Dr. Frederik Naessens from IMEC for sharing their insights on programmable turbo and LDPC Reference Architecture Algorithm Throughput decoder implementations. [23] TMS320C6201DSP max-log-MAP 2Mbps [24] VLIWASIP max-log-MAP 5Mbps proposed TTAproc.inturbomode max-log-MAP 22.64Mbps REFERENCES [17] TTAproc.forLTE max-log-MAP 31.21Mbps [25] NvidiaC1060 max-log-MAP 33.85Mbps [1] C. Berrou, A. Glavieux, P. Thitimajshima, “Near shannon limit error- [16] TTAproc. max-log-MAP 98Mbps correcting coding and decoding: turbo-codes,” in IEEE International ConferenceonCommunication,vol.2,pp.1064-1070,Geneva,Switzer- land,May1993. Acomparisonwithdifferentotherprogrammableimplemen- [2] Multiplexing andchannel coding, 3GPPTS36.212version10.5.0. tations of LDPC decoderhas been presentedin Table III. The [3] R.Gallager, ”Low-densityparity-check codes,”inIRETransactions on Information Theory,vol.8,pp.21-28,Jan.1962. throughput results are normalized for a clock frequency of [4] IEEEP802.11n/D1.04,Part11:WirelessLANMediumAccessControl 200MHz.OurproposedprocessorwithLDPCmodeprovides (MAC)andPhysicalLayer(PHY)specifications, Sep.2006. moderate throughput compared to most of the programmable [5] IEEEP802.16e/D5,Part16:AirInterfaceforFixedandMobileBroad- bandWireless Access Systems,Sep.2004. implementations. [6] DigitalVideoBroadcasting(DVB);Secondgenerationframingstructure, Alles et al. presentedan efficientimplementationof multi- channel coding and modulation systems for Broadcasting, Interactive mode decoder in [12]. The ASIP achieved 34.5 Mbps to 257 Services, News Gathering and other broadband satellite applications (DVB-S2),2009-2008. Mbps for LDPC codes of different code configurations and [7] Y.SunandJ.Cavallaro,”Efficienthardwareimplementationofahighly- block size of IEEE 802.11n when the clock frequency is 400 parallel 3GPP LTE/LTE-advance turbo decoder,” Integration, the VLSI MHz. Thelowestthroughputof 34.5Mbpsat400MHz clock Journal, 2010. [8] G.Prescher,T.Gemmeke,andT.G.Noll,”Aparametrizablelow-power high-throughput turbo-decoder,” in IEEE International Conference on Acoustics,Speech,andSignalProcessing(ICASSP),vol.5,pp.25-28, Mar.2005. [9] M. M. Mansour, and N. R. Shanbhag, ”High-throughput LDPC de- coders,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems,vol.11,pp976-996,Dec.2003. [10] C. Studer, N. Preyss, C. Roth, and A. Burg, ”Configurable high- throughput decoder architecture forquasi-cyclic LDPCcodes,” inPro- ceedings of the 42th Asilomar Conference on Signals, Systems, and Computers, PacificGrove,CA,USA,Oct.2008. [11] Y.Sun,andJ.Cavallaro,”AflexibleLDPC/turbodecoderarchitecture,” JournalofSignalProcessingSystems,pp1-16,2010. [12] M. Alles, T. Vogt, and N. Wehn, ”FlexiChap: A reconfigurable ASIP forconvolutional,turbo,andLDPCcodedecoding,”in5thInternational Symposium on Turbo Codes and Related Topics, pp. 84-89, Lausanne, Switzerland, Sep.2008. [13] F. Naessens, B. Bougard, S. Bressinck, L. Hollevoet, P. Raghavan, L. VanderPerreandF.Catthoor,”Aunifiedinstructionsetprogrammable architecture for multi-standard advanced forward error correction,” in IEEE Workshop on Signal Processing Systems, SIPS 2008, pp. 31-36, Washington, D.C.,USA,Oct.2008. [14] S.Kunze,E.Matus,andG.P.Fettweis,”ASIPdecoderarchitecture for convolutional and LDPCcodes,” in IEEEInternational Symposium on CircuitsandSystems,ISCAS,pp.2457-246,Taipei,Taiwan,May2009. [15] P. Ja¨a¨skela¨inen, V. Guzma, A. Cilio, T. Pitka¨nen, and J. Takala, “Codesigntoolsetforapplication-specific instruction-set processors,”in MultimediaonMobileDevices2007,vol.6507ofProceedingsofSPIE pp.1-11,SanJose,CA,USA,Jan.2007. [16] P.Salmela,H.Sorokin,andJ.Takala,”Aprogrammablemax-log-MAP turbo decoder implementation,” Hindawi VLSI Design, vol. 2008, pp 636-640,2008. [17] S. Shahabuddin, J. Janhunen, and M. Juntti, ”Design of a transport triggered architecture processor for a flexible iterative turbo decoder,” in Proceedings of Wireless Innovation Forum Conference on Wireless Communications Technologies and Software Radio (SDR Wincomm), Washington, D.C.,USA,Jan.2013. [18] S. Benedetto, G. Montorsi, D. Divsalar, and F. Pollara, “A soft-input soft-output maximum a posteriori (MAP) module to decoder parallel and serial concatenated codes,” in JPL TDA Progr. Rep., vol. 42-127, pp.1-20,JetPropulsionLab.,Pasadena, CA,1996. [19] P.Robertson,P.Hoeher,andE.Villebrun,“Optimalandsuboptimalmax- imumaposteriorialgorithmssuitableforturbodecoding,”inEuropean Trans.onTelecommun., vol.8,pp.119-125,Mar./Apr.1997. [20] M. P. C. Fossorier, ”Quasicyclic low-density parity-check codes from circulant permutation matrices,” in IEEE Transactions on Information Theory,vol.50,pp.1788-1793, Aug.2004. [21] J.Chen,A.Dholakia,E.Eleftheriou,M.P.C.Fossorier,andX.-Y.Hu ”Reduced-ComplexityDecodingofLDPCcodes,”inIEEETransactions onCommunications, vol.53,pp.1288-1299,Aug.2005. [22] E.Sharon,S.Litsyn,andJ.Goldberger, ”Anefficient message-passing scheduleforLDPCcoding,”in23rdIEEEConventionofElectricaland Electronics Engineers inIsrael,Sep.2004. [23] Y. Song, G. Liu and Huiyang, “The implementation of turbo decoder onDSPinW-CDMAsystem,”inInternational ConferenceonWireless Communications, Networking and Mobile Computing, pp. 1281-1283, Dec.2005. [24] P. Ituero and M. L´opez-Vallejo, “New schemes in clustered VLIW processors applied to turbo decoding,” in Proceedings of International Conference onApplication-Specific Systems, Architectures and Proces- sors(ASAP’06),pp.291-296,SteamboatSprings,CO,USA,Sep.2006. [25] M. Wu,Y. Sun, and J. R. Cavallaro, ”Implementation of a3GPP LTE turbodecoder accelerator onGPU,”in2010IEEEWorkshoponSignal ProcessingSystems(SIPS),pp.192-197,SanFrancisco,CA,USA,Oct. 2010. [26] G.Lehner,J.Sayir,andM.Rupp,”EfficientDSPimplementationofan LDPCdecoder,”inIEEEInternationalConferenceonAcoustics,Speech, andSignalProcessing,ICASSP,vol.4,pp.665-668,May2004. [27] S. Seo, T. Mudge, Y. Zhu, and C. Chakrabarti, ”Design and analysis of LDPC decoders for software defined radio,” in IEEE Workshop on SignalProcessingSystems,pp.210-215,Shanghai, China, Oct.2007. [28] M.H.Taghavi,andP.H.Siegel,”Graph-baseddecodinginthepresence ofISI,”inIEEETransactionsonInformationTheory,vol.57,pp.2188- 2202,Apr.2011.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.