ebook img

Symbol-Decision Successive Cancellation List Decoder for Polar Codes PDF

2.4 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Symbol-Decision Successive Cancellation List Decoder for Polar Codes

1 Symbol-Decision Successive Cancellation List Decoder for Polar Codes Chenrong Xiong, Jun Lin Student member, IEEE and Zhiyuan Yan, Senior member, IEEE Abstract—Polar codes are of great interests because they stack sphere decoding algorithm [9] can provide maximum provably achieve the capacity of both discrete and continuous likelihood (ML) decoding of polar codes, they are considered memorylesschannelswhilehavinganexplicitconstruction.Most infeasible, especially for long polar codes, due to their much existingdecodingalgorithmsofpolarcodesarebasedonbit-wise higher complexity than the SC algorithm. Recently, an SC hardorsoftdecisions.Inthispaper,weproposesymbol-decision 5 successivecancellation(SC)andsuccessivecancellationlist(SCL) list algorithm for polar codes was proposed in [10] to bridge 1 decoders for polar codes, which use symbol-wise hard or soft the performance gap between the SC algorithm and ML 0 decisions for higher throughput or better error performance. algorithms at the cost of complexity of O(LNlogN), where 2 First, we propose to use a recursive channel combination to L is the list size. Moreover, the concatenation of polar codes n calculatesymbol-wisechanneltransitionprobabilities,whichlead with cyclic redundancy check (CRC) codes was introduced in a tosymboldecisions.Ourproposedrecursivechannelcombination J also has a lower complexity than simply combining bit-wise [4],[11].TodecodetheCRC-concatenatedpolarcodes,aCRC 0 channel transition probabilities. The similarity between our detectorisusedintheSCLalgorithmtohelpselecttheoutput 2 proposed method and Arikan’s channel transformations also codeword. The combination of an SCL algorithm and a CRC helps to share hardware resources between calculating bit- and detector is called CRC-aided SCL (CA-SCL) algorithm. [11] symbol-wisechanneltransitionprobabilities.Second,atwo-stage ] showsthatwiththeCA-SCLalgorithm,theerrorperformance T list pruning network is proposed to provide a trade-off between I theerrorperformanceandthecomplexityofthesymbol-decision of a (2048, 1024) CRC-concatenated polar code is better that . SCL decoder. Third, since memory is a significant part of SCL of a (2304, 1152) LDPC code, which is used in the WiMax s c decoders, we propose a pre-computation memory-saving tech- standard [12]. [ niquetoreducememoryrequirementofanSCLdecoder.Finally, Several architectures have been proposed for the SC algo- to evaluate the throughput advantage of our symbol-decision 1 decoders, we design an architecture based on a semi-parallel rithm. Arikan [1] showed that a fully parallel SC decoder has v successivecancellationlistdecoder.Inthisarchitecture,different a latency of 2N −1 clock cycles. A tree SC decoder and a 5 symbol sizes, sorting implementations, and message scheduling line SC decoder with complexity of O(N) were proposed in 0 schemesareconsidered.Oursynthesisresultsshowthatinterms [13]. These two decoders have the same latency as the fully 7 ofareaefficiency,oursymbol-decisionSCLdecodersoutperform parallel SC decoder. To reduce complexity further, Leroux 4 both bit- and symbol-decision SCL decoders. 0 et al. [3] proposed a semi-parallel SC decoder for polar codes . Index Terms—Error control codes, polar codes, successive by taking advantage of the recursive structure of polar codes 1 cancellation, list decoding algorithm, hardware implementation to reuse processing resources. Assuming that the number of 0 5 processing elements (PEs) are P (P = 2p ≤ N), the latency 1 I. INTRODUCTION of the semi-parallel SC decoder is 2N + N log (N ) clock P 2 4P v: Polar codes, a groundbreaking finding by Arikan [1] in cycles. To reduce the latency, a simplified SC (SSC) polar i 2009, have ignited a spark of research interest in the fields of decoder was introduced in [14] and it was further analyzed X communication and coding theory, because they can provably in [15]. In the SSC polar decoder, a polar code is converted r achieve the capacity for both discrete [1] and continuous [2] to a binary tree including three types of nodes: rate-one, rate- a memorylesschannels.Thesecondreasonwhypolarcodesare zero and rate-R nodes. Based on the SSC polar decoder, the attractive is their low encoding and decoding complexity. For ML SSC decoder makes use of the ML algorithm to deal example, a polar code of length N can be decoded by the with part ofrate-R nodes in [16].However,the SSC and ML- successive cancellation (SC) algorithm [1] with a complexity SSC polar decoders depend on positions of information bits of O(NlogN). However, their capacity approaching can be and frozen bits, and are code-specific consequently. In [17], a achievedonlywhenthecodelengthislargeenough(N >220 pre-computationlook-aheadtechniquewasproposedtoreduce [3]) if the SC algorithm is used. For short or moderate code the latency of the tree SC decoder by half. For the SCL polar length, in terms of the error performance, polar codes with decoder, the semi-parallel architecture was adopted in [18]. In the SC algorithm are inferior to Turbo codes or low-density [19] Balatsoukas-Stimming et al. proposed an architecture of parity-check (LDPC) codes [4], [5]. L=4 to achieve a throughput of 124 Mbps and a latency of Since the debut of polar codes, a lot of efforts have been 8.25mswhendecodinga(1024,512)polarcode.In[20],Lin made to improve the error performance of short polar codes. and Yan designed an SCL polar decoder with the throughput Systematic polar codes [6] were proposed to reduce the bit of182Mbpsandalatencyof5.63ms.Toreducethememory error rate (BER) while guaranteeing the same frame error requirement,thelog-likelihoodratio(LLR)messagesareused rate (FER) as their non-systematic counterparts. Although a in [21]. The throughput of existing polar decoders is still not Viterbi algorithm [7], a sphere decoding algorithm [8] and high enough for high speed applications. 2 Since the low throughput (or long latency) of the SC of[21].Ourimplementationresultsalsodemonstratethat decoder is due to its serial nature, several previous works the symbol-decision SCL decoder can provide a range of attempt to improve the throughput (or latency). In [22], the tradeoffs between area, throughput, and area efficiency. data bits of a polar code is split into several streams, which Our symbol-decision decoding algorithms assume that the are decoded simultaneously. This idea of parallel processing underlying channel has a binary input, and our symbol-wise is extended in [23], where the SC decoder is transformed into channel transformation is virtual and introduced for decoding a concatenated decoder, where all the inner SC decoders are only.Hence,ourworkisdifferentfromthoseassumingaq-ary carried out in parallel. Yuan and Parhi proposed a multi-bit (q >2) channel (see, for example, [26]). SCL decoder [24]. The decoding schedule (bit sequence) of our symbol- In this paper, we address the throughput/latency issue by decision decoding algorithms is actually the same as those proposing symbol-decision SC and SCL decoders, which are in [22]–[24], but our symbol-decision decoding algorithms based on symbol-wise hard or soft decisions. Since each are different from those in [22]–[24] in two aspects. First, symbol consists of M bits, when M > 1 the symbol- our symbol-wise recursive channel transition is different from decision decoders achieve higher throughput as well as better how transition probabilities are derived in [22]–[24]. Sec- error performance. The proposed symbol-decision decoders ond, the symbol-decision perspective allows us to prove that are natural generalization of their bit-wise counterparts, and the symbol-decision algorithms have better frame error rates reduce to existing bit-wise decoders when the symbol size is (FERs) than their bit-decision counterparts [27], while only one bit. The main contributions of this paper are: simulation results are provided in [22], [24] and error per- • We propose a novel recursive channel combination to formance is not investigated in [23]. There are additional calculate the symbol-wise channel transition probabil- differencesbetweenourdecodingalgorithms/architecturesand ities, which enable symbol decisions in SC and SCL those in [22]–[24]. For instance, all the bits within a symbol algorithms. The proposed recursive channel combination are estimated jointly in our symbol-decision SC algorithm, also has a lower complexity than simply combining whereas some bits are decoded independently for the decoder bit-wise channel transition probabilities. The similarity with parallelism two in [22]. Also, while our symbol-decision between the Arikan’s recursive channel transformation decoding is introduced on the algorithmic level, the multibit andoursymbol-wiserecursivechannelcombinationhelps decoderisintroducedonthelevelofdecodingoperations[24]. to share hardware resources to calculate the bit- and Finally, for our symbol-decision SCL decoders, we use the symbol-based channel transition probabilities. semi-parallelarchitecturebecauseitismoreareaefficientthan • AnM-bitsymbol-decisionSCLdecoderneedstofindthe the tree architecture and the line architecture [13]. Lmostreliablecandidatesoutof2MLlistcandidates.We The rest of our paper is organized as follows. Section II propose a two-stage list pruning network to perform this briefly reviews polar codes and existing decoding algorithms sorting function. This pruning network also provides a for polar codes. In Section III, the symbol-based recursive trade-off between performance and complexity. channel combination is proposed to calculate the symbol- • Byadoptingpre-computationtechnique[25],Wedevelop based channel transition probability. Moreover, to simplify a pre-computation memory-saving (PCMS) technique to the selection of the list candidates, a two-stage list pruning reduce the memory requirement of the SCL decoder. network is proposed. In Section IV, we introduce a method to Specifically, the channel information memory can be reducememoryrequirementoflistdecodersofpolarcodesby eliminated when using the PCMS technique. Moreover, pre-computation technique. In Section V, we demonstrate the this technique also helps to improve throughput slightly. hardwarearchitectureforsymbol-decisionSCLdecoders.Two • To evaluate the throughput of symbol-decision SC de- scheduling schemes for hardware sharing are discussed. We coders, we propose an area efficient architecture for also propose two list pruning network for different designs: symbol-decision SCL decoders1. In our architecture, to a folded sorting implementation and a tree sorting imple- save the area, adders in processing units are reused to mentation. A discussion on the latency of our architecture calculate the symbol-wise channel transition probability. and synthesis results for our implementations are provided in Weproposetwoschedulingschemesforsharinghardware this section as well. Finally, we draw some conclusions in resources. We also propose two list pruning network for Section VI. designs with different symbol sizes. • Wedesigntwo-,four-,andeight-bitsymbol-decisionSCL II. POLARCODESANDEXISTINGDECODING decoders for a (1024, 480) CRC32-concatenated polar ALGORITHMS code with a list size of four. Synthesis results show A. Preliminaries that in terms of area efficiency, our symbol-decision We follow the notation for vectors in [1], namely ub = SCL decoder outperforms all existing state-of-the-arts a (u ,u ,··· ,u ,u ); if a > b, ub is regarded as void. SCL decoders in [19]–[21], [24]. For example, the area a a+1 b−1 b a ub and ub denote the subvector of ub with odd and even efficiency of our four-bit symbol-decision SCL decoder a,o a,e a is 259.2 Mb/s/mm2, which is 1.51 times as big as that indices, respectively. Let W : X → Y represent a generic B-DMC with binary input alphabet X, arbitrary output alphabet Y, and 1WefocusontheSCLdecoderbecausetheSCdecodercanbeconsidered asanSCLdecoderwithalistsizeofone. transition probabilities W(y|x), y ∈ Y, x ∈ {0,1}. Assume 3 N is an arbitrary integer and M is an integer satisfying and M|N. Let WN(j,)Ms denote a set of MN coordinate channels: W(2i) (y2Λ,u2i−1|u ) W(j) : XM → YN × X(j−1)M, 0 < j ≤ N with 2Λ,1 1 1 2i theN,tMransition probabilities WN(j,)M(y1N,x(1j−1)M|xj(jM−M1)M+1), = 21WΛ(i,)1(y1Λ,u12,io−2⊕u21i,e−2|u2i−1⊕u2i) (3) where (yN,x(j−1)M) and xjM denote the output and ·W(i)(y2Λ ,u2i−2|u ), 1 1 (j−1)M+1 Λ,1 Λ+1 1,e 2i input of W(j) , respectively. N,M where 0<i≤Λ=2λ <N, and 0≤λ<n. Expressed in log-likelihood (LL), Eqs. (2) and (3) can be B. Polar Codes approximated as [4]: Polar codes are linear block codes, and their block lengths LL(2i−1)(y2Λ,u2i−2|u ) arerestrictedtopowersoftwo,denotedbyN =2n forn≥2. 2Λ 1 1 2i−1 (cid:26)(cid:104) Assume u=uN1 =(u1,u2,··· ,uN) is the data bit sequence. ≈max LLΛ(i)(y1Λ,u21i,o−2⊕u21i,e−2|u2i−1⊕0) Let F =[10]. The corresponding encoded bit sequence x= 11 (cid:105) xN1 =(x1,x2,··· ,xN) is generated by +LL(Λi)(yΛ2Λ+1,u21i,e−2|0) , (4) x=uBNF⊗n, (1) (cid:104)LL(Λi)(y1Λ,u12,io−2⊕u21i,e−2|u2i−1⊕1) where B is the N ×N bit-reversal permutation matrix and (cid:105)(cid:27) N +LL(i)(y2Λ ,u2i−2|1) −log2, F⊗n denotes the n-th Kronecker power of F [1]. Λ Λ+1 1,e For any index set A⊆{1,2,··· ,N}, u =(u :0<i≤ A i N,i ∈ A) is the sub-sequence of u restricted to A. For an LL(2i)(yΛ,u2i−1|u ) 2Λ 1 1 2i (N,K) polar code, the data bit sequence is grouped into two ≈LL(i)(yΛ,u2i−2⊕u2i−2|u ⊕u ) (5) parts:aK-elementpartu whichcarriesinformationbits,and Λ 1 1,o 1,e 2i−1 2i A uAc whose elements are predefined frozen bits, where Ac is +LLΛ(i)(yΛ2Λ+1,u21i,e−2|u2i)−log2. the complement of A. For convenience, frozen bits are set to To simplify the calculation, the constants in Eqs. (4) and zero. (5) can be discarded since this global offset for all LLs does not affect the decoding decision. C. SC Algorithm for Polar Codes Given a transmitted codeword x and the corresponding D. Parallel SC Algorithm for Polar Codes received word y, the SC algorithm for an (N,K) polar code estimates the encoding bit sequence u successively as shown (a) inAlg.1.Here,uˆ =(uˆ ,uˆ ,··· ,uˆ )representstheestimated Bit 1 Bit 2 Bit 3 Bit 4 Bit 5 Bit 6 Bit N 1 2 N value for u. (b) Algorithm 1: SC Decoding Algorithm [1] Bits Bits Bits Bits (1,2,…,M)(M+1,...2M)(2M+1,…,3M) (N-M+1,…,N) 1 for j =1:N do 2 if j ∈Ac then uˆj =0 else Fig.1. Decodingof(a)bit-decisionvs.(b)M-bitsymbol-decision 3 if WWNN((jj,,))11((yy,,uuˆˆ11jj−−11||uujj==10)) ≥1 then uˆj =1 else The SC algorithm makes hard-decision for only one bit uˆ =0 at a time, as shown in Fig. 1(a). We call it bit-decision j 4 decoding algorithm. A parallel SC decoder [22]–[24] makes hard-decision for M bits instead of only one bit at a time, as shown in Fig. 1(b). To calculate W(j)(y,uˆj−1|u ), Arikan’s recursive channel Without loss of generality, assume M is a power of two, N,1 1 j transformation [1] is applied. A pair of binary channels i.e. M = 2m(0 ≤ m ≤ n). IM d=ef {jM − M + j W2(Λ2i,−11) and W2(Λ2i,)1 are obtained by a single-step transforma- 1,jM − M + 2,··· ,jM}, for 0 < j ≤ MN. AMj d=ef tionoftwoindependentcopiesofabinaryinputchannelWΛ(i,)1 IMj ∩A and AMcj d=efIMj ∩Ac. : (W(i),W(i)) (cid:55)→ (W(2i−1),W(2i)). The channel transition Given y and uˆjM−M, ujM is determined by Λ,1 Λ,1 2Λ,1 2Λ,1 1 jM−M+1 probabilities of W(2i−1) and W(2i) are given by 2Λ,1 2Λ,1 uˆjM = argmax W(j) (y,uˆjM−M|ujM ), jM−M+1 N,M 1 jM−M+1 W(22Λi,−11)(y12Λ,u21i−2|u2i−1) uuAAMMjcj∈∈{{00,1}}|A|AMMcjj|| = 1(cid:88)(cid:104)W(i)(yΛ,u2i−2⊕u2i−2|u ⊕u ) (6) 2 Λ,1 1 1,o 1,e 2i−1 2i (2) where |AM | represents the cardinality of AM . If M = j j u2i (cid:105) N, this decoding algorithm is exactly a maximum-likelihood ·WΛ(i,)1(yΛ2Λ+1,u21i,e−2|u2i) , sequence decoding algorithm. 4 Algorithm 2: SCL Decoding Algorithm [10] time. Let Z represent the alphabet of all M-bit symbols. The symbol-decision SC algorithm deals with the virtual channel 1 α=1; W(j) : Z → YN × Z(j−1),0 < j ≤ N with the transi- 2 for j =1:N do N M 3 if j ∈Ac then tion probabilities WN(j)(y1N,z1j−1|zj), where (y1N,z1(j−1)) and 4 for i=1:α do zj = (ujM−M+1,··· ,ujM) denote the output and input of 5 (Li)j =0; WN(j), respectively. Actually, WN(j) is exactly equivalent to 6 else if 2α≤L then WN(j,)M if we consider XM as the binary vector representation 7 for i=1:α do of Z. Therefore, the symbol-decision SC algorithm has the 8 (Li)j1 =conc((Li)j1−1,0); same schedule as the parallel SC algorithm in [22]–[24]. 9 (Li+α)j1 =conc((Li)j1−1,1); Hapopwroeavcehr,, ocuarllesdymsbyoml-bdoelc-ibsaiosendSrCecaulrgsoivreithcmhahnanselacdoimffberinena-t 10 α=2α; tion,tocomputesymbol-basedchanneltransitionprobabilities 11 else W(j) (y,uˆjM−M|ujM ), which is our main focus. 12 for i=1:L do N,M 1 jM−M+1 13 S[i].P=WN(j,)1(y,(Li)j1−1|0); 14 S[i].L=(Li)j1−1; B. Symbol-Based Recursive Channel Combination 15 S[i].U=0; Assume uiiMM−M+1 = (wi,wi+N,··· ,wi+N−N)BMF⊗m 16 S[i+L].P=WN(j,)1(y,(Li)j1−1|1); for 1 ≤ i ≤ MN. In [M22]–[24], theMcalculation 1178 SS[[ii++LL]]..LU==(1L;i)j1−1; oWfN(i,)Mth(ey,uˆs1iyMm−bMol|-ubiiaMMse−dM+c1h)aninselbasterdansoitniontheprfoobllaobwiliintyg equation, referred to as direct-mapping calculation: 19 sortPDecrement(S); 20 for i=1:L do W(i) (y,uˆiM−M|uiM )= 2212 α=(LLi;)j1 =conc(S[i].L,S[i].U); NM(cid:89),M−1WM(Ni1),1(yj(jMN++1i)1MMN−,Mwˆ1(+i+−1j1MN)+jMN|wi+jMN), (7) j=0 23 uˆ =L1; where W(i) (y(j+1)MN,wˆ(i−1)+jMN|w ) is calculated by the ArikanMN’s,1recjuMNrs+iv1e cha1n+njeMNl transfoir+mjaMNtions. E. SCL and CA-SCL Algorithms for Polar Codes Actually, a symbol-based recursive channel combina- Instead of making a hard decision for each information bit tion described in Proposition 1 can be used to calculate of u in the SC algorithm, the SCL algorithm creates two W(i) (y,uˆiM−M|uiM ). N,M 1 iM−M+1 paths in which the information bit is assumed to be 0 and Proposition 1. Assume that all bits of u are independent and 1, respectively. If the number of paths is greater than the list each bit has an equal probability of being a 0 or 1. Given size L, the L most reliable paths are selected. At the end of 0 < m ≤ n, N = 2n, M = 2m, for any 1 ≤ φ ≤ m, thedecodingprocedure,themostreliablepathischosenas uˆ. 0≤λ<n, Λ=2λ, Φ=2φ, and 0≤i< 2Λ, we say that a The SCL algorithm is formally described in Alg. 2. Without Φ Φ-bitchannelW(i+1) isobtainedbyasingle-stepcombination loss of generality, we assume L to be a power of two, i.e. 2Λ,Φ L = 2l. We use L = ((L ) ,(L ) ,··· ,(L ) ) to represent of two independent copies of a Φ-bit channel W(i+1) and i i 1 i 1 i N 2 Λ,Φ/2 the i-th list vector, where 0 < i ≤ L. S is a structure type write array with size 2L. Each element of S has three members: P, (W(i+1),W(i+1))(cid:55)→W(i+1), (8) Λ,Φ/2 Λ,Φ/2 2Λ,Φ L, and U. The function sortPDecrement sorts the array S by decreasing order of P. c=conc(a,b) attaches a bit where the channel transition probability satisfies, sequence b at the end of a bit sequence a, and the length of W(i+1)(y2Λ,uiΦ|uiΦ+Φ)= the output bit sequence c is the sum of lengths of a and b. 2Λ,Φ 1 1 iΦ+1 The CA-SCL algorithm is used for the CRC-concatenated W(i+1)(yΛ,uiΦ ⊕uiΦ|uiΦ+Φ ⊕uiΦ+Φ ) (9) Λ,Φ/2 1 1,o 1,e iΦ+1,o iΦ+1,e polar codes. The difference between CA-SCL [11] and SCL ·W(i+1)(y2Λ ,uiΦ|uiΦ+Φ ). algorithmsishowtomakethefinaldecisionforuˆ.Ifthereisat Λ,Φ/2 Λ+1 1,e iΦ+1,e least one path satisfying the CRC constraint, the most reliable Similar to the SC algorithm, with the help of the symbol- CRC-valid path is chosen for uˆ. Otherwise, the decision rule based recursive channel combination, an M-bit symbol- of the SCL algorithm is used for the CA-SCL algorithm. decision SC algorithm can be represented by using a message flow graph (MFG) as well, where a channel transition proba- III. M-BITSYMBOL-DECISIONDECODINGALGORITHMS bility is referred to as a message for the sake of convenience. FORPOLARCODES This MFG is referred to as SR-MFG. If the code length A. M-bit Symbol-Decision SC Algorithm of a polar code is N, the SR-MFG can be divided into Here, we proposed a symbol-decision SC algorithm, which (n + 1) stages (S ,S ,··· ,S ) from the right to the left: 0 1 n treats M-bit data as a symbol and decodes a symbol at a one initial stage S and n calculation stages. For the SC 0 5 algorithm, all calculation stages carry out the Arikan’s recur- sive channel transformation. However, for the M-bit symbol- decision SC algorithm, in the left-most m calculation stages (S ,··· ,S ), called S-COMBS stages, symbol-based n n−m+1 channel combinations are carried out. For the rest (n−m) calculation stages (S ,··· ,S ), called B-TRANS stages, n−m 1 the Arikan’s recursive channel transformations are performed. The S-COMBS stages use outputs of B-TRANS stages to calculate symbol-based messages. For[22]–[24],werefertotheMFGastheDM-MFGwhich also consists of two parts: B-TRANS and DM-CAL. The B- TRANS part of the DM-MFG is the same as that of the SR- | MFG.However,thereisonlyonestageintheDM-CALpartof DM-CAL B-TRANS the DM-MFG which performs the direct-mapping calculation. Fig.3. Themessageflowgraphofafour-bitsymbol-decisionSCalgorithm For example, as shown in Fig. 2, the SR-MFG of a four-bit for a polar code with a code length of eight by using direct-mapping symbol-decision SC algorithm for a polar code with N = 8 calculation[22]–[24]. has four stages. Messages of the initial stage (S ) come 0 from the channel directly. Messages of the first stage (S1) transition probabilities for ujM+M. Consider the recursive are calculated with Arikan’s transformations. Messages of jM+1 symbol-based channel combination. The S-COMBS stages of the second and third stages (S2 and S3) are calculated with the SR-MFG are indexed as 1 to m from left to right. There Eq. (9). Stages in the left gray box are the S-COMBS stages. are 2n−i(0 < i ≤ m) nodes in the i-th S-COMBS stage and Stages in the right gray box are the B-TRANS stages. Fig. 3 eachnodecontains2M+i−n messages.Oneadditionisneeded shows the DM-MFG when the direct-mapping calculation is tocomputeeach LLmessageaccordingtoEq. (9).Hence,the used to calculate symbol-based channel transition probability numberofadditionsneededbytheS-COMBSstagestocalcu- W8(,14)(y18|u41) and W8(,24)(y18,u41|u85). Here, late WN(j,)M(y,uˆj1M−M|ujjMM−M+1) is (cid:80)mi=−112i2M2i +2|AMj|. v4 =u8 ⊕u8 , v8 =u8 , Actually, if we perform the hardware implementation, the 1 1,o 1,e 5 1,e worst case - that all bits of a symbol are information bits w =v ⊕v =u ⊕u ⊕u ⊕u , 1 1 2 1 2 3 4 -shouldbeconsidered.Therefore,therecursivesymbol-based w =v ⊕v =u ⊕u ⊕u ⊕u , 2 3 4 5 6 7 8 channel combination can be taken advantage of to reduce w3 =v2 =u3⊕u4, complexity of calculating the symbol-based channel transition w =v =u ⊕u , probability. 4 4 7 8 For the example shown in Fig. 2, Eq. (7) needs 24(4 − w =v ⊕v =u ⊕u , 5 5 6 2 4 1) = 48 additions to calculate log(W(1)(y8|u4)). With the w =v ⊕v =u ⊕u , 8,4 1 1 6 7 8 6 8 symbol-based channel combination, 4, 4 and 16 additions w7 =v6 =u4, are needed to calculate log(W(1)(y4|v2)), log(W(1)(v8|v6)) 4,2 1 1 4,2 5 5 w8 =v8 =u8. and log(W(1)(y8|u4)), respectively. Therefore, our method 8,4 1 1 needs only 24 + 2 × 22 = 24 additions, which is only a half of those needed by Eq. (7). Table I lists the numbers of additions needed by our recursive method and direct- mapping calculation [22]–[24] when all M bits of a symbol are information bits. When M = 8, the number of additions needed by our proposed method is 17% of that needed by the direct-mapping calculation. TABLEI THENUMBERSOFADDITIONSTOCALCULATEWN(j,+M1)(y,uˆj1M|ujjMM++M1 ) WHENTHE(j+1)-THSYMBOLHASNOFROZENBIT. Proposedmethod Direct-mappingcalculation[22]–[24] M =2 4 4 M =4 24 48 | M =8 304 1792 S-COMBS B-TRANS Fig.2. Themessageflowgraphofafour-bitsymbol-decisionSCalgorithm Theotheradvantageoftheproposedmethodtocalculatethe forapolarcodewithacodelengthofeightbyusingtheproposedsymbol- symbol-based channel transition probability is that it reveals basedrecursivechannelcombination. thesimilaritybetweentheArikan’srecursivechanneltransfor- For the direct-mapping calculation, Eq. (7) needs (M −1) mation and symbol-based recursive channel combination. We additions. Therefore, a total of 2|AMj|(M − 1) additions will take advantage of this similarity to reuse adders and to are needed to calculate all LL-based symbol-based channel save area when computing the bit- and symbol-based channel 6 transition probability in our proposed architecture. In [24], additionaldedicatedaddersareusedtocalculatedthesymbol- basedchanneltransitionprobability,whichisnotareaefficient. Algorithm 3: M-bit Symbol-Decision SCL Decoding Algorithm In terms of the error performance, the symbol-decision SC algorithm is not worse than the bit-decision SC algorithm 1 α=1; [27]. Fig. 4 shows the BERs and FERs of symbol-decision 2 for j =1: N do M SC algorithms for a (1024, 512) polar codes. SDSC-i denotes 3 β =2|AMj|; the i-bit symbol-decision SC algorithm. When M =2 and 4, 4 if β ==1 then the FER performance is the same as that of the bit-decision 5 for i=1:α do SC algorithm. When M =8, the FER performance is slightly 6 (Li)jjMM−M+1 =0; better. 7 else if αβ ≤L then 8 uAMc =0; j 9 for k =0:β−1 do 10 uAMj =dec2bin(k,|AMj|); 11 for i=1:α do 12 t=i+kα; 13 (Lt)j1M =conc((Li)j1M−M,ujjMM−M+1); 14 α=αβ; 15 else 16 uAMc =0; j 17 for k =0:β−1 do 18 uAMj =dec2bin(k,|AMj|); 19 for i=1:L do 20 t=i+kL; 21 S[t].P= Fig.4. Errorratesofsymbol-decisionSCalgorithmsfora(1024,512)polar WN(j,)M(y,(Li)j1M−M|ujjMM−M+1); code. 22 S[t].L=(Li)j1M−M; 23 S[t].U=ujjMM−M+1; 24 sortPDecrement(S); 25 for i=1:L do C. Generalized Symbol-Decision SCL Decoding Algorithm 26 (Li)j1M =conc(S[i].L,S[i].U); 27 α=L; Similarly, the symbol-based recursive channel combination is also useful for the SCL algorithm. The symbol-decision SCL algorithm is more complicate than the SCL algorithm, since the path expansion coefficient is not a constant any more. In the SCL algorithm, for each information bit, the path expansion coefficient is two. But for the M-bit symbol- decision SCL algorithm, the path expansion coefficient is 2|AMj|, which depends on the number of information bits in an M-bit symbol. The M-bit symbol-decision SCL algorithm is formally described in Alg. 3. Without any ambiguity, 0 represents a zero vector whose bit-width is determined by the left-hand operator. The function dec2bin(d,b) converts a decimal number d to a b-bit binary vector. Eq. (9) is used to calculate the symbol-based channel transition probability corresponding to each list, i.e. W(j+1)(y,(L )jM|ujM+M). N,M i 1 jM+1 Fig. 5 shows the BERs and FERs of symbol-decision SCL algorithms for a (1024, 480) CRC32-concatenated polar code with L = 4 where the generator polynomial of the CRC32 is 0x1EDC6F41. This CRC32 is also used in all the CRC-concatenated polar codes used in the following section. Fig. 5. Error rates of symbol-decision SCL algorithms for a (1024, 480) SDSCL-i denotes the i-bit symbol-decision SCL algorithm. CRC32-concatenatedpolarcodewithL=4. Theperformancesofthesymbol-decisionSCLalgorithmswith different symbol sizes are almost the same. 7 D. Two-StageListPruningNetworkforSymbol-DecisionSCL algorithm For the M-bit symbol-decision SCL algorithm, the maxi- mum path expansion coefficient is 2M, i.e. each existing path generates 2M paths. Therefore,in the worst-casescenario, the L most reliable paths should be selected out of 2ML paths. To facilitate this sorting network, we propose a two-stage list pruning network. In the first stage, the q most reliable paths areselectedfromupto2M pathsthatcomefromexpansionof each existing path. Therefore, there are qL paths left. In the second stage, the L most reliable paths are sorted out from the qL paths generate by the first stage. The message flow of a two-stage list pruning network is illustrated in Fig. 6. q ≤ 2M 2M-path Sorting Function Fig. 8. Error rates of the SDSCL-4 algorithm for a (2048, 1401) CRC32- q ≤ 2M 2M-path concatenatedpolarcodewithL=4andL=8. L FSuonrcttiniogn qSLo-rptiantgh L Function in Fig. 8, for a (2048,1401) CRC32-concatenated polar code, the two stage list-pruning network of q = 4 helps to reduce q ≤ 2M 2M-path the complexity of the SDSCL-4 decoder without observed Sorting performance loss when L = 8. When q = 2 and L = 8, Function theSDSCL-4decoderhasaperformancedegradationofabout Fig.6. Messageflowforatwo-stagelistpruningnetwork. 0.1 dB at an FER level of 10−3, compared with the SDSCL-4 decoderwithq =8andL=8.IfL=4,theerrorperformance If q ≥ L, the L paths found by the two-stage list pruning due to q =2 is very small. networkareexactlytheLmostreliablepathsamongthe2ML Therefore, the two-stage list pruning network uses an ad- paths. When q < L, the probability that the L paths found ditional parameter q to introduce different trade-offs between by the two-stage list pruning network are exactly the L most error performance and complexity. reliable paths among the 2ML paths decreases as well. This may cause some performance loss. But a smaller q leads to a IV. PRE-COMPUTATIONMEMORY-SAVINGTECHNIQUE two-stage list pruning network with lower complexity. Pre-computation technique was first proposed in [25] and can be used to improve processing rate when the number of possible outputs is finite. In [17], the pre-computation technique is used to improve the throughput of the line SC decoder with an additional cost of increased area. Here, our mainpurposeistousethepre-computationtechniquetoreduce the memory required by list decoders because the memory of an SCL decoder to store the channel transition probability becomes a big challenge as the list size and code length increase. Henceforth, this memory saving technique is called the pre-computation memory-saving (PCMS) technique. It is worthnotingthatthismemory-savingtechniqueisindependent of the decoder architecture and the message representation of SCL decoders. Let us take the MFG shown in Fig. 2 as an example. For stages S and S , the numbers of pairs of LLs stored 0 1 by the list decoder are 8 and 4L, respectively. Actually, the Fig. 7. Error rates of the SDSCL-8 decoder for a (1024, 480) CRC32- outgoing message W(1)(y2|w ) of the top black node in S 2,1 1 1 1 concatenatedpolarcodewithL=4. can only be either W(1)(y2|0) or W(1)(y2|1). The outgoing 2,1 1 2,1 1 Fig. 7 shows how different values of q affect the error message W2(,21)(y12,w1|w2) can only be one of W2(,21)(y12,0|0), performance of an SDSCL-8 algorithm for a (1024, 480) W(2)(y2,0|1), W(2)(y2,1|0), and W(2)(y2,1|1). Hence, no 2,1 1 2,1 1 2,1 1 CRC32-concatenatedpolarcodewithL=4.WhenL=4and matterwhatthelistsizeis,thetotalnumberofpossiblevalues q = 2, the SDSCL-8 algorithm shows an FER performance of outgoing messages of S is 2×4+4×4 = 24. These 1 loss of about 0.25 dB at an FER level of 10−3. As shown 24 values provide all information we need for calculations of 8 further stages. With knowledge of these 24 values, channel LLs are not needed any more. MBG Generallyspeaking,thePCMStechniquetakesadvantageof the relationship between messages of S (channel LLs), and outgoingmessagesofS .Bystoringonly0allpossibleoutgoing MPU0 OLG 1 messages of S , the PCMS technique helps list decoders save 1 G memory. CNTL MPU1 NS LPN CRCC M LetusevaluatethememorysavingofthePCMStechnique, assuming LL representation is used for the channel transition probability.WithoutPCMStechnique,alistdecoderforapolar code with the code length of N has a list size of L stores MPUM-1 (N −2)L+N LL pairs. Each pair contains two messages 0 frz_flag 1 0 which are associated with the conditional bit being zero or one. The total number of bits used for LL storage is Fig.9. ToparchitectureforanM-bitsymbol-decisionSCLdecoder. logN−1 (cid:16) (cid:88) (cid:17) B =2 NQ +L 2i(Q +logN −i) LL ch ch (10) An MPU block calculates messages for B-TRANS and S- i=1 COMBS messages and updates the partial-sum network by =2(L+1)NQ +4L(N −logN −Q −1), ch ch adopting blocks of the SCL decoder in [20]. The additions of whereQ denotesthenumberofbitsusedforthequantization S-COMBSstagesarecarriedoutbyreusingthesamehardware ch of the channel LLs. resource which is used to calculate messages of B-TRANS With the PCMS technique, the total number of LL pairs stages to reduce the area. Compared with the SCL decoder needed by a list decoder is NL+ 3N. The total number of in [20], the MPU has neither path pruning unit nor the CRC 2 2 bits needed for LL storage is: checker. The other improvement for the MPU is that PCMS technique is used here. The architecture of an MPU is shown N in Fig. 10. Channel messages are not needed any more due B =2( +N)(Q +1)+ PCMS 2 ch to the adoption of PCMS technique. L-MEM stores messages log(cid:88)N−2 corresponding to stages of the MFG. For the stage S1, MSEL 2L 2i(Q +logN −i) selectstheappropriatemessagesfromL-MEMbasedonpartial ch i=1 (11) sum values and/or the type of calculation nodes. PUs are =3N(Q +1)+LN(Q +3) processing units to calculate LL messages. PSUs is used to ch ch −4L(logN +Q +1) update partial-sums. ISel selects messages from LMEM or ch OSel module for the crossbar (CB) module which chooses =B −N(LQ +L−Q −3). LL ch ch proper messages for PUs. OSel outputs messages to L-MEM Therefore, when LL representation is used for messages, for intermediate stages and output symbol-based messages to the PCMS technique saves N(LQch+L−Qch−3) bits of MSNG. memory. The saving is linear with both N and L. Consider a polar code with N = 1024, a list decoder with L = 4 and Q =4. Without the PCMS technique, B =57104. With ch LL thePCMStechnique,B =43792.ThePCMStechnique PCMS helps to save 13312 bits of memory, which is 23% of BLL. PU0 The other advantage of the PCMS technique is that it MSEL ISel CB PU1 OSel improves the throughput slightly because the messages of S 1 are already in the memory and don’t need to be calculated PU L-1 from the channel messages. For example, for a bit-decision L-MEM semi-parallel SCL decoder with the list size of L, if the code length is N and the number of processing units is P, the latencysavingduetothePCMStechniqueis NPL clockcycles. USP0 USP1 USP1L- V. IMPLEMENTATIONOFSYMBOL-DECISIONSCL DECODERS Fig.10. ArchitectureofanMPU. A. Architecture of Symbol-Decision SCL Decoders We take the MFG of Fig. 2 as an example to illus- We propose an architecture of an M-bit symbol-decision trate the function of block MSEL. For node f of path SCL decoder shown in Fig. 9. It consists of M MPU blocks 21 (MPU0,MPU1,··· ,MPUM−1),alistpruningnetwork(LPN), l, {W2(,11)(y12|0),W2(,11)(y12|1)}l and {W2(,11)(y34|0),W2(,11)(y34|1)}l are selected from LMEM by MSEL and output to Isel. a mask bit generator (MBG), a message-screening block (MSNG), a control block (CNTL), an output-list generator For node g21 of path l, {W2(,21)(y12,w1l|0),W2(,21)(y12,w1l|1)}l (OLG) and a CRC checker (CRCC). and {W(2)(y4,w |0),W(2)(y4,w |1)} are selected from 2,1 3 3l 2,1 3 3l l 9 LMEM.Here,w1l andw3l arethepartialsumforw1 andw3, i1 respectively, belonging to path l. The detailed information of b j,1 other blocks in Fig. 10 can be found in [20] and will not be i discussed in this paper. 2 b j,2 The message-passing scheme in MFG of a polar code is Mask_bit in a serial way, which means that the calculation of a stage i depends on the output of its previous stage. The PUs in [20] i M only carry out the B-TRANS additions. On the other hand, bj,M the S-COMBS stages need only additions and a processing unit has four adders. Therefore, in order to save hardware Fig.12. Architectureforgeneratingamaskbit. resources, the adders in the processing units is reused to calculatethesymbol-basedchanneltransitionprobability,after these processing units finish calculations for the B-TRANS to as BS L. The folded sorting implementation needs 2M−1 stages. In other words, additions of both the B-TRANS and BS Ls (BS L ,BS L ,··· ,BS L ). The outputs of 0 1 2M−1−1 the S-COMBS stages are folded onto the same adders in the the BS L and the BS L (0≤i<2M−2) are connected 2i 2i+1 processing unites. As shown in Fig. 11, c[0] and c[1] are with inputs of BS L through registers and multiplexers. For i outputs for the B-TRANS stages; d[0],d[1],d[2], and d[3] are the tree sorting implementation with 2ML inputs, 2M − 1 outputs for the S-COMBS stages. BS Ls are needed. The tree sorting implementation can be divided into M layers. For 0 ≤ i < M, there are 2i BS Ls d[0] in the i-th layer. Inputs of the BS Ls of the i-th layer are d[1] connected with outputs of the BS Ls of the (i+1)-th layer. a[0] Fig. 13 and 14 show examples of the folded and tree sorting max implementations, respectively, for 2M =8. a[1] 0 0 c[0] 1 1 D D D D BS_L0 BS_L1 BS_L2 BS_L3 b[0] max MUX MUX MUX MUX b[1] 0 0 c[1] 1 u 1 d[2] mode d[3] Fig.13. Architectureforthefoldedsortingimplementationwhen2M =8. Fig.11. Architectureofaprocessingunit. Block MBG provides a mask bit for each path. If there BS_L 6 are f (fgeq0) frozen bits in the M-bit symbol, the number of expanded paths will be 2M−f. For hardware implementa- BS_L BS_L 4 5 tions, we need to consider the worst case and all messages corresponding to 2M possible paths are calculated. Each path BS_L BS_L BS_L BS_L 0 1 2 3 is associated with a mask bit. When some paths are not needed, due to frozen bits, they are turned off by mask bits. Fig. 12 shows how to generate the mask bit for path i, where Fig.14. Architectureforthetreesortingimplementationwhen2M =8. i = (i ,i ,··· ,i ) ∈ {1,0}M (0 ≤ i < 2M − 1) and 1 2 M b = (b ,b ,··· ,b ) is a frozen-bit indication vector j j,1 j,2 j,M The folded sorting implementation has a smaller area than for ujM+M. If u is a frozen bit, b = 1. Otherwise, jM+1 jM+t j,t the tree sorting implementation. However, the pipeline can b = 0. If b is an all-one vector, all bits of ujM+M are j,t j jM+1 be applied to the tree sorting implementation by inserting frozen bits, called an M-bit frozen vector. If Mask bit is 1, i registers between layers to improve the throughput of the tree ujM+M is impossible to be i and the message corresponding jM+1 sorting implementation. to ujM+M =i is set to 0 in block MSNG. For the two-stage list pruning network proposed in jM+1 Block LPN receives 2ML messages from block MSNG, Sec.III-D,eitherthefoldedsortingimplementationorthetree finds the most reliable L paths, and feeds decision results sorting implementation can be used for the 2M-to-q sorting back to the MPUs. Here, we use two different sorting im- function and the qL-to-L sorting function. plementations – a folded sorting implementation and a tree Block CNTL provides control signals to schedule the hard- sorting implementation – for different designs. The basic unit ware sharing for MPUs and decides when to start pruning for these two implementations is a bitonic sorter [28] , which paths. The signal frz flag is an indicator which is one when outputs the L max values out of 2L inputs. It is referred a frozen vector appears. When frz flag is one, all MPUs use 10 zero to update the partial-sums instead of outputs of the LPN. In this case, the LPN, the MSNG, and the calculation of L S-COMBS stages are bypassed. The OLG stores the output paths.TheCRCCchecksifapathsatisfiestheCRCconstraint. 4P SB D L _ Tree q L Sorting (L-q) B. Message Scheduling and Latency Analysis Network 0 To improve area efficiency, for different number of PUs, Fig. 17. A pipelined tree sorting implementation for the overlapping different scheduling schemes are needed. To reuse the adders scheduling. of the processing units, the additions of the S-COMBS stages in the MFG must be scheduled properly. Assume the number of the processing units is P. The total number of the adders S-COMBSstages,andthelatencyofthelistpruningnetwork. provided by processing units is 4P. If 2ML ≤ 4P, we use a T represents the overall number of clock cycles for the B serialscheduling,whichmeansthatthereisnooverlapforthe calculations of the B-TRANS stages. It is equivalent to the processing units and the LPN in terms of the operation time, latency of a bit-decision SCL decoder with a code length of as shown in Fig. 15. N and P processing units: M M TS TN N NL/M NL/M NL/M T =2 + log ( )− , Processing Units LPN B M P/M 2 4P/M P/M B-TRANS S-TRANS S1 S2 ... Sn-m Sn-m+1 Sn-m+2 ... Sn where the third term, −NL/M, is the latency saving by using P/M PCMS technique. T represent the number of clock cycles Fig.15. Serialscheduling(inclockcycles). S for the calculations of S-COMBS stages per symbol. T N Suppose each addition takes one clock cycle. Then each S- representsthenumberofextraclockcyclespersymbolneeded COMBS stage takes one clock cycle to compute messages. by the LPN to finish the list pruning after all messages of the Therefore, it takes m clock cycles for the S-COMBS stages stage Sn are calculated. If 2ML ≤ 4P, the number of clock to output messages to the LPN. To save the area, the folded cycles used to calculate messages for S-COMBS stages is sorting implementation is applied for the serial scheduling. TS =m. When 2ML>4P ≥2M/2L, TS =m−1+(cid:100)24MPL(cid:101). When 2ML > 4P ≥ 2M/2L, there are not enough More generally, T ≤(cid:80)m (cid:100)22iL(cid:101). T is determined by the S i=1 4P N adders to calculate all 2ML messages of the stage Sn in detailed implementation. Hence, the latency of the symbol- one clock cycle, but all 2M/2n−iL messages of the stage Si decision SCL decoder is: (n+m−1≤i≤n−1)canbecalculatedinoneclockcycle. N Without increasing the number of adders, 2ML cycles are T(M)=(1−γ) (T +T )+T 4P M S N B needed. In each cycle, 4P messages are calculated. To reduce N N NL NL the latency, the overlapping scheduling shown in Fig. 16 is =(1−γ) (T +T )+2 + log ( ), M S N M P 2 8P used. In clock cycle c0, the first 4P messages come out. In (12) clock cycle c , the LPN starts work. Therefore, the MPUs 1 and the LPN are working simultaneously for 2ML −1 clock where γ is a ratio of the number of frozen vectors to N. 4P M cycles. Here, the LPN works in a pipeline way. Hence, the Table II shows the latencies (in clock cycles) for different tree sorting implementation is deployed for the overlapping decoders to decode a (1024, 480) CRC32-concatenated polar scheduling and a BS L is connected at the end of the tree code with 64 processing units and L=4. We assume a BS L sorting implementation in a way shown in Fig. 17, where needs one clock cycle to find the four maximum values out the number on a line represents the number of messages of eight values. For M = 2 and M = 4, a folded sorting transmitted through the line. implementationandtheserialschedulingareused.ForM =8, a pipelined tree sorting implementation and the overlapped TS TN scheduling are applied. For M =8 and q =2, the basic unit Processing Units LPN inthetreesortingimplementationistofindthetwomaximum B-TRANS S-TRANS values out of eight values, which needs one clock cycles. S1 ... Sn-m Sn-m+1 Sn-m+2 ... Sn Therefore, T =4 when M =8 and q =2. N c c ... 0 1 : Clock cycles when the processing units are busy. TABLEII : Clock cycles when the LPN is busy. LATENCIESFORDIFFERENTDECODERSFORA(1024,480) : Clock cycles when both the processing units and LPN are busy. CRC32-CONCATENATEDPOLARCODEWITH64PROCESSINGUNITSAND L=4. Fig.16. Overlappingscheduling(inclockcycles). Decoder γ TS TN q Latency(#ofcycles) SDSCL-2 0.445 1 2 4 2069 The latency of an M-bit symbol-decision SCL decoder SDSCL-4 0.395 2 4 4 1634 SDSCL-8 0.344 6 7 4 1540 consists of: the latency for calculating messages of the B- SDSCL-8 0.344 6 4 2 1288 TRANS stages, the latency for calculating messages of the

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.