ebook img

On Deep Learning-Based Channel Decoding PDF

0.15 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview On Deep Learning-Based Channel Decoding

On Deep Learning-Based Channel Decoding Tobias Gruber∗, Sebastian Cammerer∗, Jakob Hoydis†, and Stephan ten Brink∗ ∗ Institute of Telecommunications, Pfaffenwaldring 47, University of Stuttgart, 70659 Stuttgart, Germany {gruber,cammerer,tenbrink}@inue.uni-stuttgart.de †Nokia Bell Labs, Route de Villejust, 91620 Nozay, France [email protected] Abstract—Werevisittheideaofusingdeepneuralnetworksfor A. Related Work one-shotdecodingofrandomandstructuredcodes,suchaspolar 7 codes. Although it is possible to achieve maximum a posteriori In 1943, McCulloch and Pitts published the idea of a NN 1 (MAP) bit error rate (BER) performance for both code families that models the architecture of the human brain in order 0 and for short codeword lengths, we observe that (i) structured to solve problems [2]. But it took about 45 years until the 2 codes are easier to learn and (ii) the neural network is able to backpropagation algorithm [3] made useful applications such generalize to codewords that it has never seen during training n for structured, but not for random codes. These results provide as handwritten ZIP code recognition possible [4]. One early a someevidencethatneuralnetworkscanlearnaformofdecoding form of a NN is a Hopfield net [5]. This concept was shown J algorithm, rather than only a simple classifier. We introduce to be similar to maximum likelihood decoding (MLD) of 6 the metric normalized validation error (NVE) in order to further linear block error-correcting codes (ECCs) [6]: an erroneous 2 investigate the potential and limitations of deep learning-based decoding with respect to performance and complexity. codewordwill convergeto the neareststable state of theHop- ] field net which represents the most likely codeword. A naive T implementationofMLDmeanscorrelatingthereceivedvector I I. INTRODUCTION . of modulated symbols with all possible codewords which s c Deep learning-based channel decoding is doomed by the makesitinfeasibleformostpractiblecodewordlengths,asthe [ curseofdimensionality[1]:forashortcodeoflengthN =100 decoding complexity is O 2k with k denoting the number 1 and rate r = 0.5, 250 different codewords exist, which of information bits in the c(cid:0)ode(cid:1)word. The parallel computing v are far too many to fully train any neural network (NN) capabilitiesof NNs allow us to solve or, at least, approximate 8 in practice. The only way that a NN can be trained for the MLD problem in polynomial time [7]. Moreover, the 3 practical blocklengths is, if it learns some form of decoding weights of the NN are precomputed during training and the 7 algorithm which can infer the full codebookfrom training on decoding step itself is then relatively simple. 7 0 a small fraction of codewords. However, to be able to learn a Due to its low storage capacity, Hopfield nets were soon . decoding algorithm, the code itself must have some structure replacedby feed-forwardNNs whichcan learnan appropriate 1 0 which is based on a simple encoding rule, like in the case of mapping between noisy input patterns and codewords. No 7 convolutional or algebraic codes. The goal of this paper is to assumption has to be made about the statistics of the channel 1 shed some light on the question whether structured codes are noise because the NN is able to learn the mapping or to v: easier to “learn” than random codes, and whether a NN can extract the channel statistics during the learning process [8]. i decode codewords that it has never seen during training. Different ideas around the use of NN for decoding emerged X We wantto emphasizethatthisworkisbasedonveryshort in the 90s. While in [8] the output nodesrepresentthe bits of r a blocklengths, i.e., N ≤ 64, which enables the comparison the codeword, it is also possible to use one output node per with maximum a posteriori (MAP) decoding, but also has an codeword(one-hotcoding)[9].ForHammingcoding,another independent interest for practical applications such as the in- variation is to use only the syndrome as input of the NN in ternetofthings(IoT).Wearecurrentlyrestrictedtoshortcodes order to find the most likely error pattern [10]. Subsequently, because of the exponential training complexity [1]. Thus, NND for convolutional codes arose in 1996 when Wang and the neural network decoding (NND) concept is currently not WickershowedthatNNDmatchestheperformanceofanideal competitive with state-of-the-art decoding algorithms which Viterbidecoder[1]. Butthey also mentioneda veryimportant have been highly optimizedoverthe last decadesand scale to drawback of NND: decoding problems have far more possi- arbitrary blocklengths. bilities than conventional pattern recognition problems. This Yet, there may be certain code structures which facilitate limits the NND to short codes. However, the NN decoder for the learning process. One of our key finding is that structured convolutional codes was further improved by using recurrent codes are indeed easier to learn than random codes, i.e., less neural nets [11]. trainingepochsare required.Additionally,ourresultsindicate NND did not achieve any big breakthrough for neither thatNNs maygeneralizeor“interpolate”to the fullcodebook block nor convolutional codes. Due to the standard training after having seen only a subset of examples, whenever the techniquesinthosetimesitwasnotpossibletoworkwithNNs code has structure. employing a large number of neurons and layers, which ren- Input Noise NNDinput Hidden2 Output Modulation [optional]LLR Hidden1 Hidden3 abstractchannel NNdecoder N s nbits bb01 xxxiii,,,012 oflength ˆˆbb10 mationbit kinformatio bbkk−−12 {0,1}Ekn→cod{e0r,1}N xxxiii,,,NNN−−−123 xdeword∈Xi ˆˆbbkk−−21 estimatedinfor o k c trainingneuralnetwork Fig. 1: Deep learning setup for channel coding. deredthemunsuitedforlongercodewords.Hence,theinterest output of the NN, an input-output mapping is defined by a in NNs dwindled, not only for machine learning applications chain of functions depending on the set of parameters θ by but also for decoding purposes. Some slight improvements weremadeinthefollowingyears,e.g.,byusingrandomneural w =f(v;θ)=f(L−1) f(L−2) ... f(0)(v) (2) (cid:16) (cid:16) (cid:16) (cid:17)(cid:17)(cid:17) nets [12] or by reducing the number of weights [13]. where L gives the number of layers and is also called depth. In 2006, a new training technique, called layer-by-layer It was shown in [17] that such a multilayer NN with L = 2 unsupervised pre-training followed by gradient descent fine- and nonlinear activation functions can theoretically approxi- tuning [14], led to the renaissance of NNs because it made mate any continuous function on a boundedregion arbitrarily training of NNs with more layers feasible. NNs with many closely—if the number of neurons is large enough. hidden layers are called deep. Nowadays, powerfulnew hard- In order to find the optimal weights of the NN, a training ware such as graphical processing units (GPUs) are available set of knowninput-outputmappingsis requiredand a specific to speed up learning as well as inference. In this renaissance lossfunctionhastobedefined.Bytheuseofgradientdescent of NNs, new NND ideas emerge. Yet, compared to previous optimization methods and the backpropagationalgorithm [3], work, the NN learning techniques are only used to optimize weights of the NN can be found which minimize the loss well known decoding schemes which we denote as introduc- functionoverthetrainingset.Thegoaloftrainingistoenable tion of expert knowledge. For instance, in [15], weights are the NN to find the correct outputs for unseen inputs. This is assigned to the Tanner graph of the belief propagation (BP) called generalization. In order to quantify the generalization algorithm and learned by NN techniques in order to improve ability, the loss can be determined for a data set that has not theBPalgorithm.Itstillseemsthattherecentadvancesinthe been used for training, the so-called validation set. machinelearningcommunityhavenotyetbeenadaptedtothe In this work, we want to use a NN for decoding of noisy pure idea of learning to decode. codewords.At the transmitter, k informationbits are encoded II. DEEPLEARNING FORCHANNEL CODING into a codeword of length N. The coded bits are modulated and transmitted over a noisy channel. At the receiver, a noisy The theory of deep learning is comprehensively described versionofthecodewordisreceivedandthetaskofthedecoder in[16].Nevertheless,forcompleteness,wewillbrieflyexplain istorecoverthecorrespondinginformationbits.Incomparison the main ideas and concepts in order to introduce a NN for to iterative decoding, the NN finds its estimate by passing channel (de-)coding and its terminology. A NN consists of each layer only once. As this principle enables low-latency many connectedneurons.In such a neuronall of its weighted implementations, we term it one-shot decoding. inputs are added up, a bias is optionally added, and the result Obtaining labeled training data is usually a very hard and is propagated through a nonlinear activation function, e.g., a expensive task for the field of machine learning. But using sigmoid function or a rectified linear unit (ReLU), which are NN for channel coding is special because we deal with man- respectively defined as made signals. Therefore, we are able to generate as many 1 g (z)= , g (z) =max{0,z}. (1) trainingsamplesaswelike.Moreover,thedesiredNNoutput, sigmoid 1+e−z relu also denoted as label, is obtained for free because if noisy Iftheneuronsarearrangedinlayerswithoutfeedbackconnec- codewords are generated, the transmitted information bits are tionswespeakofafeedforwardNNbecauseinformationflows obviouslyknown.Forthesakeofsimplicity,binaryphaseshift throughthenetfromthelefttotherightwithoutfeedback(see keying (BPSK) modulation and an additive white Gaussian Fig. 1). Each layer i with n inputs and m outputs performs noise(AWGN)channelisused.Otherchannelscanbeadopted i i the mapping f(i) : Rni → Rmi with the weights and biases straightforwardly, and it is this flexibility that may be a of the neurons as parameters. Denoting v as input and w as particular advantage of NN-based decoding. In order to keep the training set small it is possible to 5 40 extend the NN with additional layers for modulating and 4 30 adding noise (see Fig. 1). These additional layers have no E E trainable parameters, i.e., they perform a certain action such V 3 V 20 N N as adding noise and propagate this value only to the node 2 10 of the next layer with the same index. Instead of creating, 1 1 and thus storing, many noisy versions of the same codeword, −2 0 2 4 6 −4−2 0 2 4 6 8 working on the noiseless codeword is sufficient. Thus, the Training-E /N [dB] Training-E /N [dB] b 0 b 0 trainingsetX consistsofallpossiblecodewordsx ∈FN with i 2 F ∈ {0,1} (the labels being the corresponding information (a) Polar Code (b) Random Code 2 bits) and is given by X ={x0,...,x2k−1}. Fig.2:NVEversustraining-Eb/N0 for16bit-lengthcodesfor As recommended in [16], each hidden layer employs a a 128-64-32 NN trained with M =216 training epochs. ep ReLU activation function because it is nonlinear and at the same time very close to linear which helps during optimiza- complexity. As we support reproducible research, we have tion. Since the output layer represents the information bits, a made parts of the source code of this paper available.4 sigmoid function forces the output neurons to be in between zero and one, which can be interpreted as the probability that III. LEARN TODECODE a “1” was transmitted. If the probability is close to the bit of In the sequel, we will consider two differentcode families: thelabel,thelossshouldbeincrementedonlyslightlywhereas random codes and structured codes, namely polar codes [19]. large errors should result in a very large loss. Examples for Both have codeword length N = 16 and code rate r = 0.5. suchloss functionsare themeansquarederror(MSE) andthe While random codes are generated by randomly picking binary cross-entropy (BCE), defined respectively as codewordsfromthecodewordspacewithaHammingdistance larger than two, the generator matrix of polar codes of block 1 2 LMSE = bi−ˆbi (3) size N =2n is given by k Xi (cid:16) (cid:17) LBCE =−1 biln ˆbi +(1−bi)ln 1−ˆbi (4) GN =F⊗n, F=(cid:20) 11 01 (cid:21) (6) k Xi h (cid:16) (cid:17) (cid:16) (cid:17)i where F⊗n denotes the nth Kronecker power of F. The where bi ∈{0,1} is the ith target information bit (label) and codewordsare now obtained by x=uG , where u contains ˆb ∈[0,1] the NN soft estimate. N i k information bits and N −k frozen positions, for details we There are some alternatives for this setup. First, log- refer to [19]. This way, polar codes are inherently structured. likelihoodratio(LLR)valuescouldbeusedinsteadofchannel values. For BPSK modulation over an AWGN channel, these A. Design parameters of NND are obtained by Our starting point is a NN as described before(see Fig. 1). We introduce the notation 128-64-32 which describes the P(x=0|y) 2 LLR(y)=ln = y (5) designof the NN decoderemployingthree hiddenlayerswith P(x=1|y) σ2 128, 64, and 32 nodes, respectively. However, there are other where σ2 is the noise power and y the received channel design parameters with a non-negligible performance impact: value. This processing step can be also implemented as an 1) What is the best training signal-to-noise-ratio (SNR)? additional layer without any trainable parameters. Note, that 2) How many training samples are necessary? thenoisevariancemustbeknowninthiscaseandprovidedas 3) Is it easier to learn from LLR channel output values an additional input to the NN.1 Representing the information rather than from the direct channel output? bitsin theoutputlayeras a one-hot-codedvectoroflength2k 4) What is an appropriate loss function? is another variant. However, we refrain from this idea since 5) How many layers and nodes should the NN employ? it does not scale to large values of k. Freely available open- 6) Which type of regularization5 should be used? source machine learning libraries, such as Theano2, help to The area of research dealing with the optimization of these implement and train complex NN models on fast concurrent parametersis called hyperparameteroptimization[20]. Inthis GPU architectures. We use Keras3 as a convenient high- work,wedonotfurtherconsiderthisoptimizationandrestrict level abstraction front-end for Theano. It allows to quickly ourselves to a fixed set of hyperparameters which we have deploy NNs from a very abstract point of view in the Python foundto achieve goodresults. Our focusis on the differences programminglanguagethathidesawayalotoftheunderlying between random and structured codes. 1Inspired by the idea of spatial transformer networks [18], one could 4https://github.com/gruberto/DL-ChannelDecoding alternatively use a second NN to estimate σ2 from the input and provide 5Regularization isanymethodthattrades-offalargertrainingerroragainst thisestimate asanadditional parameter totheLLRlayer. asmallervalidationerror.Anoverviewofsuchtechniquesisprovidedin[16, 2https://github.com/Theano/Theano Ch. 7]. Wedo not use any regularization techniques in this work, but leave 3https://github.com/fchollet/keras itasaninteresting futureinvestigation. Since the performance of NND depends not only on the SNR of the validation data set (for which the bit error rate 10−1 (BER) is computed) but also on the SNR of the training ddnSaaoNttraaRmass(eelmittzs6ee,,darwsevusearpeleidddceatfiiatvniseoelnEyb,ebe/ralrNnoodwr0)l(aeNotVfnBEetwE)h.ReDpteerranfioo(nρtriemn,gbρayna)cnρebdtemavnetadhtlreiiρdcva,BtEittohhRnee BER 1100−−32 MMMeeeppp===222111024 Mep achived by a NN trained at ρ on datNaNDwitht ρv. Similarly, let Mep=216 BER (ρ ) be the BER of MtAP decodingatvSNR ρ . For a 10−4 Mep=218 MAP v v MAP setofS differentvalidationdatasetswithSNRsρ ,...,ρ , v,1 v,S 10−5 the NVE is defined as 0 2 4 6 8 10 S E /N [dB] 1 BER (ρ,ρ ) b 0 NVE(ρ)= NND t v,s . (7) t S BER (ρ ) (a) Polar Code Xs=1 MAP v,s The NVE measures how good a NND, trained at a particular 10−1 Mep SNR, is comparedto MAP decodingovera rangeof different SNRs. Obviously, for NVE = 1, the NN achieves MAP per- formance, but is generally greater. In the sequel, we compute 10−2 Mep=210 R theNVEoverS =20differentSNRpointsfrom0dBto5dB E Mep=212 with a validation set size of 20000 examples for each SNR. B 10−3 Mep=214 We train our NN decoder in so-called “epochs”. In each Mep=216 epoch, the gradient of the loss function is calculated over the 10−4 Mep=218 entire training set X using Adam’, a method for stochastic MAP gradient descent optimization [22]. Since the noise layer in 10−5 0 2 4 6 8 10 ourarchitecturegeneratesa new noise realizationeach time it E /N [dB] is used, the NN decoder will never see the same input twice. b 0 For this reason, although the training set has a limited size (b) Random Code of 2k codewords, we can train on an essentially unlimited Fig.3:InfluenceofthenumberofepochsM ontheBERofa ep training set by simply increasing the number of epochs M . ep 128-64-32NN for 16bit-lengthcodes with code rate r=0.5. However, this makes it impossible to distinguish whether the NN is improved by a larger amount of training samples or However, for polar codes, close to MAP performance is more optimization iterations. already achieved for M = 218 epochs, while we may need Starting with a NN decoder architecture of 128-64-32 and ep a larger NN or more training epochs for random codes. M = 222 learning epochs, we train the NN with datasets ep InFig.4,weillustratetheinfluenceofdirectchannelvalues of different training SNRs and evaluate the resulting NVE. versus channel LLR values as decoder input in combination The result is shown in Fig. 2, from which it can be seen that with two loss functions, MSE and BCE. The NVE for all there is an “optimal” training E /N . An explanation for the b 0 combinationsisplottedasafunctionofthenumberoftraining occurrenceof an optimumcan be explainedby the two cases: epochs. Such a curve is also called “learning curve” since 1) Eb/N0 →∞;train withoutnoise, the NN is nottrained it shows the process of learning. Although it is ususally to handle noise. recommended to normalize the NN inputs to have zero mean 2) Eb/N0 →0;trainonlywithnoise,theNNcannotlearn and unit variance, we train the NN without any normalization the code structure. which seems to be sufficient for our setup. For a few training Thisclearlyindicatesanoptimumsomewhereinbetweenthese epochs,theLLRinputimprovesthelearningprocess;however, two cases. From now on, a training Eb/N0 of 1dB and 4dB this advantage disappears for a larger Mep. The same holds is chosen for polar and random codes, respectively. for BCE against MSE. For polar codes with LLR values and Fig. 3 shows the BER achieved by a very small NN BCE the learning appears not to converge for the applied of dimensions 128-64-32 as a function of the number of number of epochs. In summary, for training the NN with a training epochs ranging from M = 210,...,218. For BER large number of training epochs it does not matter if LLR or ep simulations, we use 1 million codewords per SNR point. For channel values are used as inputs and which loss function is both code families, the larger the number of training epochs, employed. Moreover, normalization is not required. the closer is the gap between MAP and NND performance. In order to answer the question how large the NN should be, we trained NNs with different sizes and structures. From 6It would also be possible to have a training data set which contains a Fig.5,wecanconcludethat,forbothpolarandrandomcodes, mix of different SNR values, but we have not investigated this option here. it is possible to achieve MAP performance. Moreover, and Recently, the authors in [21] observed that starting at a high training SNR andthengradually reducing theSNRworkswell. somewhat surprisingly, the larger the net, the less training 20 30 directchannel N =16 channel LLR N =32 15 MSE 20 N =64 BCE E PolarCode V E PolarCode N Random Code V 10 N RandomCode 10 5 8 9 10 11 12 13 14 # information bits k 104 105 Fig. 6: Scalability shown by NVE for a 1024-512-256 NN Training epochs M ep for 16/32/64bit-length codes with different code rates and M =216 training epochs. Fig. 4: Learning curve for 16bit-length codes with code rate ep r =0.5 for a 128-64-32 NN. IV. CAPABILITY OF GENERALIZATION 20 128-64-32 As Fig. 2–6 show, NNDs for polar codes always perform 256-128-64 betterthanrandomcodesforafixedNNdesignandnumberof 15 512-256-128 trainingepochs.Thisprovidesafirstindicationthatstructured 1024-512-256 codes, such as polar codes, are easier to learn than random E PolarCode V 10 codes. In order to confirm this hypothesis, we train the NN N RandomCode based on a subset X which covers only p% of the entire set p of valid codewords. Then, the NN decoder is evaluated with 5 the set X that covers the remaining 100−p% of X. As a p benchmark, we evaluate the NN decoder also for the set of all codewords X. Instead of BER as in Fig. 3, we now use 104 105 the block error rate (BLER) for evaluation (see Fig. 7). This Training epochs M ep way,weonlyconsiderwhetheranentirecodewordiscorrectly detected or not, exluding side-effects of similarities between Fig. 5:Learningcurvefor differentNN sizes for16bit-length codewords which might lead to partially correct decoding. codes with code rate r =0.5. While for polar codes the NN is able to decode codewords that were not seen during training, the NN cannot decode epochs are necessary. In general, the larger the number of any unseen codeword for random codes. Fig. 8 emphasizes layers and neurons, the larger is the expressive power or this observation by showing the single-word BLER for the capacity of the NN [16]. Contrary to what is common in codewords x ∈ X which were not used for training. classic machine learning tasks, increasing the network size i 80 Obviously, the NN fails for almost every unseen random does not lead to overfitting since the network never sees the codeword which is plausible. But for a structured code, such same input twice. asa polarcodes, theNN isable to generalizeevenforunseen codewords.Unfortunately,theNNarchitectureconsideredhere B. Scalability is notable to achieveMAPperformanceif itis nottrainedon the entire codebook. However, finding a network architecture Up to now, we have only considered 16bit-length codes that generalizes best is topic of our current investigations. which are of little practical importance. Therefore, the scal- ability of the NN decoder is investigated in Fig. 6. One can Insummary,wecandistinguishtwoformsofgeneralization. see that the length N is not crucial to learn a code by deep First, as described in Section III, the NN can generalize from learning techniques. What matters, however, is the number input channel values with a certain training SNR to input of information bits k that determines the number of different channelvalues with arbitrarySNR. Second, the NN is able to classes (2k) which the NN has to distinguish. For this reason, generalizefromasubsetXp ofcodewordstoanunseensubset the NVE increases exponentially for larger values of k for a Xp. However, we observed that for larger NNs the capability NNoffixedsizeandfixednumberoftrainingepochs.IfaNN of the second form of generalization vanishes. decoderissupposedtoscale,itmustbeabletogeneralizefrom V. OUTLOOKAND CONCLUSION a few training examples. In other words, rather than learning to classify 2k different codewords, the NN decoder should For small block lengths, we achieved to decode random learn a decoding algorithm which provides the correct output codes as well as polar codes with MAP performance. But for any possible codeword.In the nextsection, we investigate learning is limited through exponential complexity as the whether structure allows for some form of generalization. number of information bits in the codewords increases. The 100 1 Random Code PolarCode R 70% E 10−1 80% L 0.5 p=100% 90% B R p=90% 70% LE 10−2 p=80% 0 0 10 20 30 40 50 B p=70% 80% 90% Codeword index i of X80. 10−3 MAP 100% Xp Fig. 8: Single-word BLER for xi ∈X80 at Eb/N0 =4.16dB X and M =218 learning epochs. ep 10−4 0 2 4 6 8 10 [4] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, Eb/N0 [dB] W.Hubbard,andL.D.Jackel,“Backpropagationappliedtohandwritten (a) 16bit-length Polar Code (r=0.5) zipcoderecognition,” NeuralComputation, vol.1,no.4,pp.541–551, Dec.1989. 100 [5] J. Hopfield, “Neural networks and physical systems with emergent 90% 80% 70% collective computational abilities,” Proc. Nat. Acad. Sci., vol. 79, pp. 70% 2554–2558, 1982. 10−1 80% [6] J.BruckandM.Blaum, “Neural networks, error-correcting codes, and 90% polynomials over the binary n-cube,” IEEE Trans. Inform. Theory, p=100% vol.35,no.5,pp.976–987, Sept.1989. R p=90% [7] G. Zeng, D. Hush, and N. Ahmed, “An application of neural net in LE 10−2 p=80% decoding error-correcting codes,” in IEEE Int. Symp. on Circuits and B Systems,vol.2,May1989,pp.782–785. p=70% [8] W.R.CaidandR.W.Means,“Neuralnetworkerrorcorrectingdecoders 10−3 MAP for block and convolutional codes,” in Proc. IEEE Globecom Conf., Xp 100% vol.2,Dec.1990,pp.1028–1031. [9] A.D.Stefano, O.Mirabella, G.D.Cataldo, and G.Palumbo, “On the X use of neural networks for Hamming coding,” in IEEE Int. Symp. on 10−4 Circuits andSystems,vol.3,June1991,pp.1601–1604. 0 2 4 6 8 10 [10] L. G. Tallini and P. Cull, “Neural nets for decoding error-correcting E /N [dB] codes,”inProc.IEEETech.Applicat.Conf.andWorkshopsNorthcon95, b 0 Oct.1995,pp.89–. (b) 16bit-length Random Code (r=0.5) [11] A. Hamalainen and J. Henriksson, “A recurrent neural decoder for convolutionalcodes,”inProc.IEEEInt.Conf.onCommun.(ICC),vol.2, Fig. 7: BLER for a 128-64-32 NN trained on X with p 1999,pp.1305–1309. Mep = 218 learning epochs. Solid and dashed lines show the [12] H.Abdelbaki,E.Gelenbe,andS.E.El-Khamy,“Randomneuralnetwork performance on X on X, respectively. decoder for error correcting codes,” in Int. Joint Conf. on Neural p Networks, vol.5,1999,pp.3241–3245. [13] J.-L.Wu,Y.-H.Tseng,andY.-M.Huang,“Neuralnetworkdecodersfor very surprising result is that the NN is able to general- linear blockcodes,” Int.Journ. ofComputational Engineering Science, ize for structured codes, which gives hope that decoding vol.3,no.3,pp.235–255,2002. [14] G.E.Hinton,S.Osindero,andY.-W.Teh,“Afastlearningalgorithmfor algorithms can be learned. State-of-the-art polar decoding deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, currently suffers from high decoding complexity, a lack of July2006. possible parallelization and, thus, critical decoding latency. [15] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linear codes using deep learning,” CoRR, 2016. [Online]. Available: NND inherently describes a highly parallelizable structure, http://arxiv.org/abs/1607.04793 enablingone-shotdecoding.This rendersdeep learning-based [16] I. Goodfellow, Y. Bengio, and A. Courville, “Deep Learning,” decoding a promising alternative channel decoding approach 2016, book in preparation for MIT Press. [Online]. Available: http://www.deeplearningbook.org asitavoidssequentialalgorithms.Futureinvestigationswillbe [17] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward based on the exploration of regularization techniques as well networks areuniversal approximators,” Neural Networks, vol.2,no.5, as recurrent and memory-augmented neural networks, which pp.359–366, 1989. [18] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer areknowntobeTuringcomplete[23]andhaverecentlyshown networks,”inAdvancesinNeuralInformationProcessingSystems,2015, remarkable performance in algorithm learning. pp.2017–2025. [19] E.Arikan, “Channel polarization: A method forconstructing capacity- achievingcodesforsymmetricbinary-inputmemorylesschannels,”IEEE REFERENCES Trans.Inform.Theory,vol.55,no.7,pp.3051–3073,2009. [20] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Ke´gl, “Algorithms for hyper-parameter optimization,” in Advances in Neural Information [1] X.-A.WangandS.B.Wicker,“AnartificialneuralnetViterbidecoder,” ProcessingSystems24. CurranAssociates,Inc.,2011,pp.2546–2554. IEEETrans.Commun.,vol.44,no.2,pp.165–171,Feb.1996. [21] D. George and E.A. Huerta, “Deep Neural Networks to Enable Real- [2] W.S.McCullochandW.Pitts,“Alogicalcalculusoftheideasimmanent timeMultimessenger Astrophysics,”ArXive-prints,Dec.2016. in nervous activity,” The bulletin of mathematical biophysics, vol. 5, [22] D.P.KingmaandJ.Ba,“Adam:Amethodforstochasticoptimization,” no.4,pp.115–133,1943. CoRR,2014.[Online]. Available: http://arxiv.org/abs/1412.6980 [3] D.E.Rumelhart,G.E.Hinton,andR.J.Williams,“Paralleldistributed [23] H. T. Siegelmann and E. D. Sontag, “On the computational power of processing: Explorations in the microstructure of cognition, vol. 1.” neural nets,” in Proc. of the fifth annual workshop on Computational Cambridge, MA,USA:MITPress,1986,pp.318–362. learningtheory. ACM,1992,pp.440–449.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.