ebook img

Temporal Overdrive Recurrent Neural Network PDF

0.5 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Temporal Overdrive Recurrent Neural Network

Temporal Overdrive Recurrent Neural Network Filippo Maria Bianchi1, Michael Kampffmeyer1, Enrico Maiorino2, and Robert Jenssen1 1Machine Learning Group, Dept. of Physics and Technology, University of Tromsø, Norway 2Dept. of Information, Electronic and Communication Engineering, Sapienza University, Rome, Italy Abstract—In this work we present a novel recurrent inherentfeatureofreservoircomputingapproaches[17], 7 neural network architecture designed to model systems with the gradient-based optimization, adopted in classic 1 characterized by multiple characteristic timescales in RNN architectures. Contrarily to standard RNNs, where 0 their dynamics. The proposed network is composed by 2 several recurrent groups of neurons that are trained to the internal state of the neurons is influenced by all n separately adapt to each timescale, in order to improve the frequencies of the input signal, in our network a the system identification process. We test our framework specific groups of neurons update their states according J on time series prediction tasks and we show some to different frequencies in the input signal. Therefore, promising, preliminary results achieved on synthetic data. 8 since the hidden neurons can operate at temporal res- To evaluate the capabilities of our network, we compare 1 olutions that are slower than the fastest frequencies in the performance with several state-of-the-art recurrent architectures. the input, we named our network Temporal Overdrive ] E Keywords— Recurrent Neural Network; Multiple RNN (TORNN), after the mechanism in vehicles engine N Timescales; Time Series Analysis. that,undercertainconditions,decouplesthespeedofthe wheels from the revolutions of the engine. . s c I. INTRODUCTION TORNN is capable of modeling with high accuracy [ thetimescalesinthedynamicsoftheunderlyingsystem. Time series analysis focuses on reconstructing the In the proposed architecture, the hidden recurrent layer 1 properties of a dynamical system from a sequence of is partitioned in different groups, each one specialized v values, produced from a noisy measurement process 9 in modeling dynamics at a single timescale. This is actingonthestatespaceofthesystem.Correctmodeling 5 achieved by organizing the recurrent layer in groups is necessary to understand, characterize the dynamics 1 of bandpass neurons with shared parameters, which 5 of the system and, consequently, to predict its future determine the frequency band to operate on. During 0 behavior.ARecurrentNeuralNetwork(RNN)isanuni- training, while the neuron connection weights are fixed . versal approximator of Lebesgue measurable dynamical 1 to their initial random values, the bandpass parameters 0 systems and is a powerful tool for predicting time series are learned via backpropagation. In this way, the recur- 7 [13]. In its basic formulation, the whole state of a RNN rent layer dynamics are smoothed out by the bandpass 1 evolves according to precise timing. However, in order : filtering in order to adapt to the relevant timescale. The v to model a system whose internal dynamic variables mainhyperparameterrequiredbyTORNNisthenumber i evolveatdifferenttimescales,aRNNisexpectedtolearn X ofneurongroupscomposingtherecurrentlayer,thathas bothshortandlong-termdependencies.Thecapabilityof r to be chosen by considering the expected number of a modeling multiple time scales is particularly important separate timescales characterizing the analyzed process. inthepredictionofreal-worldtimeseries,sincetheyare This information can be easily retrieved through an generally characterized by multiple seasonalities [10]. analysis in the frequency domain and/or by a priori In this paper we propose a novel RNN architecture, knowledge of the input data. The network also depends specifically designed for modeling a system character- on few other structural hyperparameters, which are not ized by heterogeneous dynamics operating at different critical as the final performance of the model is not timescales. We combine the idea of using bandpass particularly sensitive to their tuning. Since the pair of neurons in a randomly connected recurrent layer, an bandpass parameters are the only weights trained in fi[email protected] each group of neurons, the total number of parameters Correspondingauthor to be learned are considerably less than in other RNN [email protected] architectures, with a consequent simplification of the [email protected] [email protected] training procedure. Inthisworkwefocusonthepredictionofreal-valued should be organized accordingly [8]. The resulting ar- time series, generated from a system whose internal chitecture managed to improve the modeling of slow- dynamics operate at different timescales. In Sect. II changing contexts. An analogous hierarchical organiza- we review previous works that dealt with the problem tion has been implemented by stacking multiple recur- of modeling dependencies across different time scales rent layers [11]. In the same spirit, a more complex in the data with RNNs. In Sect. III we provide the stacked architecture called Gated Feedback Recurrent details of the proposed architecture. In Sect. IV we Neural Networks has been proposed in Ref. [7]. In this present some initial results on synthetic data obtained architecture,therecurrentlayersareconnectedbygated- by the proposed architecture and other state-of-the-art feedback connections which allow them to operate at alternatives.Finally,inSect.Vwedrawourconclusions. different timescales. By referring to the unfolding in time of the recurrent network, the states of consecutive time steps are fully connected and the strength of the II. RELATEDWORKS connections is trained by gradient descent. The main Classic RNN architectures struggle to model different shortcoming of these layered architectures is the high temporalresolutions,especiallyinpresenceoflong-term amount of parameters that must be learned in order to relations that are poorly handled, due to the issue of adapt the network to process the right time scales. This the vanishing gradient [1]. Several solutions have been results in a long training time, with the possibility of proposedinthepasttodealwithlong-termdependencies overfitting the training data. [19] and an important milestone has been reached with The issue of modeling multiple timescales has also the introduction of Long Short Term Memory (LSTM) been discussed within the framework of reservoir com- [14]. This architecture has been shown to be robust puting, a quite recent approach to temporal signal pro- to the vanishing gradient problem, thanks to its gated cessing that leverages on a large, randomly connected accessmechanismtoitsneurons’state.Thisresultsinthe and untrained recurrent layer called reservoir. The de- LSTMnetworks’capabilitytohandlelongtimelagsbe- sired output is computed by a linear memory-less read- tweeneventsandtoautomaticallyadapttomultipletime out, which maps the instantaneous states of the network granularity. In particular, the LSTM neurons, referred to and is usually trained by linear regression. The origi- as “cells”, are composed by an internal state that can be nal reservoir computing architectures, known as Echo accessed through three gates: the input gate, that con- State Networks [15] and Liquid State Machines [21], trols whether the current input value should replace the are characterized by a fading memory that prevents to existing state; the forget gate, that controls the reset of modelslowperiodicitiesthatextendbeyondthememory the cell’s state; the output gate, that determines whether capacity of the reservoir. A solution proposed in Ref. the unit should output its current state. An additional [17] is to slow down the dynamics of the reservoir component in several implementations of LSTM cells by using leaky integrator units as neurons. These units are the peephole connections, which allow gate layers have individual state dynamics that can model different to look directly at the cell state. A recent variant of temporal granularity in the target task. Leaky integra- the traditional LSTM architecture is the Gated Recur- tors behave as lowpass filters, whose cutoff frequency rent Unit (GRU) [5], which always exposes its internal is controlled by the leakage parameter. Precise timing state completely, without implementing mechanisms to phenomena emerge from the dynamics of a random control the degree of exposure. In GRU, both peephole connectednetworkimplementingtheseprocessingunits, connectionsandoutputactivationfunctionsareremoved, withouttherequirementofdedicatedtimingmechanism, while the input and the forget gate are coupled into an such as clocks or tapped delay lines. A different kind of update gate. While LSTM and GRU excel in learning unitthatimplementsabandpassfilterhasbeenproposed at the same time long and short time relationships, to encourage the decoupling of the dynamics within a they are difficult to train due to an objective function single reservoir and to create richer and more diverse landscape characterized by a high dimensionality and echoes, which increase the processing power of the several saddle points [9]. Additionally, these networks reservoir [25]. Improved results have been achieved by dependonseveralhyperparameters,whosetuningisnon- means of neurons implementing more advanced digital trivial, and the utility of their various computational bandpass filters, with particular frequency-domain char- components depends on the application at hand [12]. acteristics [26]. Differently from other strategies where An initial attempt to model multiple dynamics and the optimal timescales for the task at hand are learned timescales was based on the idea that temporal rela- from data, reservoir computing follows a “brute force” tionships are structured hierarchically, hence the RNN approach. In fact, not only a large, sparse and randomly connected reservoir is generated in order to provide rich N K · and heterogeneous dynamics, but also a high amount of N filters are initialized with random cutoff frequencies, to encodeawiderangeoftimescalesintherecurrentlayer. C1 By providing filters that cover the whole spectrum, the (1,p) B correct timings required to model the target dynamics are likely to be provided, at the cost of a considerably C2 (1,p) redundant representation of the system. An alternative B approach has also been proposed, where the filters are ... tuned manually according to the information of the frequency spectrum of the desired response signal [26]. (1,q) B CK III. MODELARCHITECTURE (1,p) B In this section we discuss the details of TORNN architecture.Letusconsideratimeseriesx= x[t] T , { }t=1 Fig.1. Structureoftherecurrentlayer,encodedintheweightmatrix whose values are relative to the measurement of the Wr. Inter-group connections are drawn from a binomial distribution states of a system, characterized by slow and fast B(1,p), while intra-group connections are drawn from B(1,q). To dynamics. Each dynamical component is modeled by ensure inter-group connections to be more dense than intra-group connections,wesetp(cid:29)q. a specialized group of neurons in the recurrent layer which operates at the same characteristic frequencies. The number of groups K is equal to the number of a structure with K groups. Each group contains main dynamical components of the system and can be k C N neurons, which are strongly connected to the other determined through a frequency analysis on the input neurons in . Additionally, in order to allow for some signal x. In this work, by means of a power spectral k C weak coupling between different dynamical components density estimate, we identify K as the number of peaks inthemodeledsystem,wealsogenerateasmallamount in the power spectrum, but alternative approaches are of connections between neurons belonging to different possible. Inordertooperateonlyontheportionofthespectrum groups. This allows synchronization mechanisms be- located around one of the maxima, each processing unit tweendifferentgroups,whicharerequiredifthedifferent implements a bandpass filter, whose configuration is dynamics of the modeled system are coupled. For intra- identicaltotheoneoftheotherunitsinthesamegroup. group connections, we define a probability p that deter- AsitisshownlaterinSect.III-B,thetransferfunctionof mines the presence of a link ei,j between the neurons each bandpass filter depends on two parameters related ni and nj of the same group k. A second probability C tothetwocutofffrequencies,butalsoontheconnection q determines the presence of an inter-group connection weightsofinputandrecurrentlayers.Iftheconfiguration ei,j between the neurons ni and nj of different groups. oftheseconnectionsismodified,thefrequencyresponse To guarantee higher intra-group connectivity we set p,q of each filter will change as well. In TORNN we such that p q. (cid:29) implementanhybridapproach.Analogouslytoreservoir To define the recurrent layer connections, we first computing methodologies, input and recurrent connec- generate a Boolean squared matrix W RNK×NK by r ∈ tionsarerandomlyinitializedandkeptunchanged,while drawing values from a binomial distribution (1,p) for B the pair of parameters that define the bandpass in each elements belonging to one of K blocks on the diagonal, group of neurons are trained via backpropagation, along or a binomial distribution (1,q), for the remaining B with the output layer. elements. The structure of W is depicted in Fig. 1. To r In the following, we first explain how the recurrent obtain a higher degree of heterogeneity among neuron layer is structured. Then, we describe how the bandpass activations and to enrich internal dynamics [23], each filters are implemented in the hidden processing units non-zero connection e is multiplied by a value, drawn i,j and, finally, we discuss the learning procedure adopted fromanormaldistribution (0,1).Finally,toguarantee N to train the parameters in the model. stabilityinthenetwork[4],werescalethespectralradius of W to a value lower than 1. r A. Recurrent layer structure Analogously to the connections in the recurrent layer, The recurrent layer of the network is randomly gen- the input weights (stored in the matrix W RI×NK, i ∈ erated, under the constraint that the connections form with I the input size) are randomly initialized with values drawn from a distribution (0,1) and are kept C. Learning N untrained as well. The bandpass filters implemented by the units in the group must specialize to pass only the frequencies in k B. bandpass processing units C neighborhoodofoneofthepeaksinthepowerspectrum A leaky integrator neuron outputs a weighted average of x. According to Eq. 3, to control the band to be –controlledbyaleakagerateγ–ofthenewactivationof passed by the filters in each group, one must tune the the neuron with the one from the previous time interval parameters γ and γ . However, in practical cases the 1 2 [2, 17]. A leaky neuron acts as a lowpass filter, whose bandpass width of each filter is not easy to determine in cutoff frequency is proportional to γ. Its state-update advance, as the neighborhood of each peak in the power equation is defined as spectrumcouldeitherincludenoiseorusefulinformation fortheproblemathand.Mostimportantly,γ andγ are x[t+1]=(1 γ)x[t]+γf(W x[t]+W u[t+1]) (1) 1 2 − r i relatedtotheeffectivecutofffrequenciesinthespectrum or, by following an alternative definition [24], as through a highly non-linear dependency and the desired response of the filter is difficult to determine. Therefore, x[t+1]=f((1 γ)x[t]+γW x[t]+W u[t+1]) (2) − r i we follow a data-driven approach to automatically learn where f() is a tanh (or a sigmoid) function. If γ =0, the control parameters by means of a gradient descent · thenewstateattimet+1maintainsthevalueofstatet. optimization, which accounts the nature of the task at If γ = 1, the state-update equation reduces to a classic hand. nonlinear transfer function. Thelastcomponentinthearchitectureisadenselayer, In the integrator of Eq. 2, the non-linearity f() composedbyasetofweightsstoredinthematrixW is applied also to the leakage term. This guarantee·s RH×O,whichcombinestheoutputoftheneuronsinoth∈e stability, since the poles of the transfer function are recurrent layer to produce the O-dimensional output. A constrained to the unit circle, and has the advantage of schematicdepictionofthewholearchitectureisreported no computational overhead, since the integrators can be in Fig. 2. incorporated in W [24]. This integrator always leaks, r duetothecontractingpropertyofthenon-linearmapping of the hyperbolic tangent upon itself. Since this is not an issue in TORNN, these are the filters we chose to C1 implement. γ1,1γ2,1 Aspreviouslydiscussed,wewanttheprocessingunits in each group of the recurrent layer to act as band- pass filters, which allow signals between two specific C2 frequencies to pass, but discriminate against signals at x γ1,2γ2,2 y other frequencies. In particular, we want the neurons ... in the group to reject the frequency components k C that are not close to a given peak in the spectrum. A bandpassfiltercanbegeneratedbycombiningalowpass CK filter with a highpass filter (implemented by negating Wi γ1,3γ2,3 Wo the lowpass filter). Alternatively, a bandpass filter is Wr obtained by combining two lowpass filters. In this latter case, according to Eq. 2, the state update equation of a Fig. 2. Input connections (encoded in Wi) and recurrent layer bandpass neuron reads as connections (encoded in Wr) are initialized pseudo-randomly and theyarekeptfixedafterwards.Intherecurrentlayer,theconnections x(cid:48)[t+1]=f((1 γ1)x(cid:48)[t]+ betweenneuronsinthesamegroup,depictedassolidblacklines,are − moredensethantheconnectionsbetweenneuronsindifferentgroups +γ W x[t]+W u[t+1]), 1 r i (dashed gray lines). In red, we marked the elements that are trained (3) x(cid:48)(cid:48)[t+1]=(1 γ )x(cid:48)(cid:48)[t]+γ x(cid:48)[t+1], bymeansofagradientdescentoptimization,whicharetheparameters − 2 2 thatdeterminethecutofffrequenciesineachgroupCk(γ1,kandγ2,k) x[t+1]=x(cid:48)[t+1] x(cid:48)(cid:48)[t+1]. andtheweightsconnectingtherecurrentlayerwiththenetworkoutput − (Wo). The parameters γ and γ control respectively the high 1 2 and low frequency cutoffs of the bandpass filter. Inter- Likethefilterparameters,alsotheconnectionweights estingly, the filter has a transfer function equivalent to in W are learned through a gradient descent optimiza- o the one of an analogue electronic RCL circuit [25]. tion. We opted for a standard backpropagation through time (BPTT) parametrized by a threshold τ , which algorithm as the optimizer of the gradient descent step tnc determinesthetruncationoftheunfoldingofthenetwork [18]. ERNN, LSTM and GRU are configured to use a in time. To constrain the parameters γ and γ to lay comparablenumberofparameters(approximately8500). 1 2 withinthe[0,1]interval(requiredfromEq.3),duringthe ForTORNNandESNinstead,weinstantiate20neurons optimization we apply a sigmoid function to the values in the recurrent layer for each sinusoidal component in of γ and γ learned by the gradient descent. the signal. Therefore, the number of parameters trained 1 2 in TORNN is (2 K)+(20 K)+(1), which are, × × IV. EXPERIMENTS respectively, the filter parameters for all groups, the number of connections from the recurrent layer to the In this section, we evaluate the capability of TORNN output and the bias. While the number of neurons in to model multiple dynamics characterized by different ESN reservoirs is usually much larger, the reason for timescales.Totestourmodel,weconsidertheprediction thisexperimentalsetupistoshowthatTORNNmanages of a time series generated by superimposed sine waves, to handle the same computational resources of an ESN whose frequencies are incommensurable. Since the ratio more effectively. of the frequencies is an irrational number, the periods The details of the configuration in each network are of the sine waves do not cross with each other and reportedinTab.I.ThemodelparametersofESNarenot hence the period of the resulting signal is, in theory, learned through gradient descent optimization. Indeed, infinite. This academic, yet important task, has been the output weights in the readout are computed through previously considered to test the memory capacity of a simple ridge-regression, an optimization problem that a recurrent neural network [16]. Indeed, to accurately canbeexpressedincloseformandsolvedinatimethat predict the unseen values of the time series, the network grows as the cube of the number of neurons [3]. On the wouldrequireahugeamountofmemory.Here,wetackle otherhand,acorrectsetupofthehyperparametersinthe the problem from a different perspective. We show that ESNisingeneralmorecriticalthaninotherarchitectures if the network is capable of learning how to model and they are usually tuned through cross validation pro- the dynamics of each single oscillator separately, the cedures.Inourexperiments,weusedageneticalgorithm prediction task can be solved accurately with a limited to search for the optimal hyperparameters values. We amount of computational resources. followed the same approach described in [20], to which In the following, we try to solve several time series we refer the interested reader for further details. The predictiontasksofincreasingdifficulty.Eachtimeseries bounds considered in the search of each hyperparameter is obtained by superimposing sinusoids, whose frequen- are reported in Tab I. cies are pairwise incommensurable. The sinusoids are In each task we perform a prediction with a forecast generated by multiplying a frequency φ by distinct horizon of 15 time steps. Prediction error is expressed integer powers of a transcendental number, such as e in terms of Normalized Root Mean Squared Error (Euler number) or π. In our experiments we chose e as (NRMSE), computed as thebasenumberandtheconsideredtimeserieshavethe form NRMSE=(cid:112) y y∗ 2 / y y∗ 2 , (cid:88)K (cid:104)(cid:107) − (cid:107) (cid:105) (cid:104)(cid:107) −(cid:104) (cid:105)(cid:107) (cid:105) xK[t]= sin(ekφ)[t], (4) where y is the output predicted by the network and y∗ k=1 the ground truth. We consider 4 different time series where K is the number of superimposed sine waves. x [t], generated according to Eq. 4, with K =2,3,5,7 K The difficulty of the prediction task is controlled by thenumberofsuperimposedoscillators.Thepowerspec- increasing the number of components K and by adding tral density estimates of the time series are depicted in to the time series a white noise n[t] (0,1). In the Fig.3.Asisitpossibletosee,inthespectraofx [t]and 5 ∼ N experiments,wesetanoise-to-signalratioof0.2andwe x [t]therearetwofrequencieswhichareveryclose.This 7 evaluated time series with T =5000 time-steps. increasethedifficultyoftheproblemasthesefrequencies To quantify the performance of TORNN on each are harder to separate. For each time series, we also task, we compare its prediction accuracy with the ones consider a version with superimposed white noise n[t]. achieved by Elmann-RNN (ERNN), LSTM, GRU and ThepredictionaccuraciesarereportedinFig.4,where ESN. To initialize the trainable weights in TORNN, the colored boxes represent the average NRMSE error ERNN, LSTM and GRU, we draw their initial values (the shorter, the better) and the error bars are the stan- from a Gaussian distribution (0,1/√d), d being the dard deviations, computed over 10 different trials with N number of processing units in the successive layer in independent random initializations. TORNN achieves the network [9]. For each network we used Adam better performance in every test. In order to provide TABLEI SUMMARYOFTHEHYPERPARAMETERSCONFIGURATIONFOREACHNETWORK.ERNN,LSTM,GRU:NUMBEROFPROCESSINGUNITS (Nr),NUMBEROFTRAINABLEPARAMETERS(#PARAMS),L2REGULARIZATIONOFWoWEIGHTS(λ),GRADIENTTRUNCATIONINBPPT (τtnc).ESN(ADMISSIBLERANGEOFHYPERPARAMETERS,OPTIMIZEDWITHAGENETICALGORITHM[20]):NEURONSOFTHERESERVOIR (Nr–NOTOPTIMIZED),SPECTRALRADIUS(ρ),RESERVOIRCONNECTIVITY(r),NOISEINSTATEUPDATE(ξ),INPUTSCALING(ωi), TEACHERSCALING(ωo),FEEDBACKSCALING(ωf),L2RIDGEREGRESSIONNORMALIZATION(λ).TORNN:PROCESSINGUNITSPER GROUP(N),INTRA-GROUPCONNECTIONSPROBABILITY(p),INFRA-GROUPCONNECTIONPROBABILITY(q),SPECTRALRADIUSOFWr (ρ),L2REGULARIZATIONOFWoWEIGHTS(λ)ANDGRADIENTTRUNCATIONINBPPT(τtnc). ERNN Nr #params λ τtnc 91 8555 1E-5 10 LSTM Nr #params λ τtnc 45 8506 1E-5 10 GRU Nr #params λ τtnc 52 8477 1E-5 10 ESN Nr ρ r ξ ωi ωo ωf λ 20×K [0.1,1.5] [0.1,0.5] [0,0.1] [0.1,1] [0.1,1] [0.1,1] [0,0.5] TORNN N p q ρ λ τtnc 20 0.4 0.1 0.95 1E-5 10 at the residual areas of each time series, it is possible Power spectral density estimates to notice an overall better prediction by TORNN with 50 respect to the other RNNs. 0 x2[t] These results provide empirical evidence of the effec- -50 tivenessoftheproposedframework,whichiscapableof -100 modeling the different components and thus performing 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 )elp 50 a more accurate prediction. As expected, when time m series are corrupted by noise the prediction forecast a 0 s/da -50 x3[t] accuracy heavily decreases in every model. r/Bd( y-1000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 opTtimheizasttaioten-,owf-hthiceh-aartreaErcRhNitNec,tuLrSeTsMbaasneddGoRnUg,roadbiteanint cn 50 on average a larger prediction error with respect to ESN e uq 0 and TORNN. Furthermore, ERNN performs worse than erf/rew -50 x5[t] LofSTthMe saimndpliGciRtyUinintheevearrychtietsetc.tuTrhe,iswihsicahciosnusneqabuleenctoe o-100 P 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 handle properly long-term dependencies in the analyzed 50 time series. The performance of LSTM and GRU are 0 comparable and it is not possible to conclude which x7[t] -50 one is the best performing. This result is expected and -100 is in agreement with previous studies which evidenced 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 that the selection of a specific gated architecture should Normalized Frequency (#: rad/sample) depend on the dataset and the corresponding task at hand [6]. However, even if the GRU structure is simpler Fig. 3. Power spectral densities of the four time series x2[t], x3[t], than the plain LSTM architecture (without peephole x5[t] and x7[t] considered in the experiments. Here, we report the connections), the training time of GRU is higher, due spectrumoftheversionswithoutthesuperimposedwhitenoise. to the larger number of operations required to compute the forward and the backward propagation steps [5]. a qualitative description of the forecast results, in Fig. The performance of TORNN are always better or at 5 we show a portion of the prediction of the time least equal to ESN. It is worth noticing that TORNN series x [t]+n[t] by TORNN and the other considered can be considered as a general case of an ESN, since its 7 networks, respectively RNN, LSTM, GRU and ESN. In architecture reduces to ESN when γ =1 and γ =0 in 1 2 every plot the gray line represents the ground truth, the every group of units. Indeed, with such a configuration blacklinetheconsiderednetwork’sforecastandthelight the bandpass filters reduce to regular neurons. Finally, red area the residual between the two lines. By looking we underline that in ESN the training of the model is 0.6 ERNN LSTM 0.5 GRU ESN E)0.4 TORNN S M R N0.3 or ( Err0.2 0.1 0.0 x2[t] x3[t] x5[t] x7[t] x2[t]+n[t] x3[t]+n[t] x5[t]+n[t] x7[t]+n[t] Fig. 4. Forecast accuracy (NRMSE) achieved by TORNN and other RNN architectures on the prediction of different time series, relative to a forecast horizon of 15 time intervals. Colored boxes represent the average error on 10 independent trials, the error bars the standard deviation.EachtimeseriesxK[t]isthecombinationofKsinusoidswithincommensurablefrequenciesandtheyareevaluatedwithandwithout asuperimposedwhitenoisen[t]. Original time series Predicted time series Residuals RNN LSTM GRU ESN TORNN 0 20 40 60 80 100 Time Fig. 5. Prediction results for a fraction of the time series x7[t]+n[t] with a prediction step-ahead of 15 time steps. For each network, the graylinerepresentsthegroundtruth,theblacklinethepredictedtimeseriesandtheshadedredareasaretheerrorresidualsoftheprediction. faster than in the other architectures. In fact, rather than the comparison with other architectures. a slow gradient descent optimization, the weights of the linear readout in a ESN can be evaluated by performing V. CONCLUSIONS a simple linear regression. However, ESNs trade the In this work we presented the Temporal Overdrive precision of gradient descent with the “brute force” Recurrent Neural Network, a novel RNN architecture redundancy of random reservoirs and this, inevitably, designed to model multiple dynamics that are charac- makes the models more sensitive to the selection of terized by different time scales. The proposed model the hyperparameters (like spectral radius). Hence, the is easy to configure, as it only requires to specify the computational resources required for an accurate search numberofdifferenttimescalesthatshouldbeaccounted of the optimal hyperparametyers should be accounted in for, an information that can be easily obtained from a rough frequency analysis. Each dynamical component is [8] S.ElHihiandY.Bengio.Hierarchicalrecurrentneuralnetworks handledbyagroupofneuronsimplementingabandpass for long-term dependencies. In NIPS, volume 400, page 409. Citeseer,1995. filter, whose behavior is determined by two parameters [9] X.GlorotandY.Bengio.Understandingthedifficultyoftraining that determine the cutoff frequencies. deep feedforward neural networks. In Aistats, volume 9, pages The proposed methodology follows the strategy of 249–256,2010. [10] P. G. Gould, A. B. Koehler, J. K. Ord, R. D. Snyder, R. J. reservoir computing approaches, as the input and re- Hyndman, and F. Vahid-Araghi. Forecasting time series with current connections are randomly initialized and they multiple seasonal patterns. European Journal of Operational are kept untrained. However, while reservoir computing Research, 191(1):207 – 222, 2008. ISSN 0377-2217. doi: http://dx.doi.org/10.1016/j.ejor.2007.08.024. implements a “brute force” approach to generate a high, [11] A.Graves.Generatingsequenceswithrecurrentneuralnetworks. yet redundant, number of internal dynamics, TORNN arXivpreprintarXiv:1308.0850,2013. learns from data the optimal configuration of its dynam- [12] K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and icsthroughagradientdescentoptimization.Furthermore, J. Schmidhuber. LSTM: A search space odyssey. CoRR, abs/1503.04069,2015. URLhttp://arxiv.org/abs/1503.04069. with respect to other gradient-based RNNs, the number [13] B.Hammer,A.Micheli,A.Sperduti,andM.Strickert.Recursive of trainable parameters is significantly lower, with a self-organizingnetworkmodels.NeuralNetworks,17(8-9):1061– consequent simplification of the learning procedure. 1085,2004.ISSN0893-6080.doi:10.1016/j.neunet.2004.06.009. [14] S. Hochreiter and J. Schmidhuber. Long short-term memory. We performed some preliminary tests on synthetic Neuralcomputation,9(8):1735–1780,1997. data, which showed promising results and demonstrated [15] H. Jaeger. The “echo state” approach to analysing and training that our network achieves superior performance in pre- recurrentneuralnetworks-withanerratumnote.Bonn,Germany: German National Research Center for Information Technology diction with respect to other state-of-the-art RNNs. GMDTechnicalReport,148:34,2001. TORNN can also be used for system identification [22]. [16] H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting In this case, the network must operate in a generative chaotic systems and saving energy in wireless communication. science,304(5667):78–80,2004. setup and the output of the network has to be fed back [17] H.Jaeger,M.Lukosˇevicˇius,D.Popovici,andU.Siewert. Opti- into the recurrent layer. mizationandapplicationsofechostatenetworkswithleaky-inte- We are currently working on real-world data, relative gratorneurons. NeuralNetworks,20(3):335–352,2007. ISSN 0893-6080. doi: http://dx.doi.org/10.1016/j.neunet.2007.04.016. to load forecast, which will be presented in a future EchoStateNetworksandLiquidStateMachines. extension of this work. We also plan to explore the [18] D.KingmaandJ.Ba. Adam:Amethodforstochasticoptimiza- implementation of more efficient bandpass filters, with tion. arXivpreprintarXiv:1412.6980,2014. sharper frequency cutoffs. We believe that this will help [19] T. Lin, B. G. Horne, P. Tinˇo, and C. L. Giles. Learning long- term dependencies in narx recurrent neural networks. Neural in separating the dynamical components more effec- Networks,IEEETransactionson,7(6):1329–1338,1996. tively. In the future we also plan to investigate further [20] S. Løkse, F. M. Bianchi, and R. Jenssen. Training echo state the internal dynamics of the network. networks with regularization through dimensionality reduction. CoRR, abs/1608.04622, 2016. URL http://arxiv.org/abs/1608. 04622. REFERENCES [21] Maass Wolfgang, Natschla¨ger Thomas, and Markram Henry. Real-Time Computing Without Stable States: A New Frame- [1] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term de- work for Neural Computation Based on Perturbations. Neu- pendencieswithgradientdescentisdifficult. IEEETransactions ral Computation, 14(11):2531–2560, 2002. ISSN 0899- onNeuralNetworks,5(2):157–166,Mar1994. ISSN1045-9227. 7667. doi:http://dx.doi.org/10.1162/089976602760407955. doi: doi:10.1109/72.279181. 10.1162/089976602760407955. [2] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu. Ad- [22] O. Nelles. Nonlinear system identification: from classical ap- vancesinoptimizingrecurrentnetworks. In2013IEEEInterna- proachestoneuralnetworksandfuzzymodels. SpringerScience tional Conference on Acoustics, Speech and Signal Processing, &BusinessMedia,2013. pages8624–8628.IEEE,2013. [23] M.C.Ozturk,D.Xu,andJ.C.Pr´ıncipe. Analysisanddesignof [3] F. M. Bianchi, S. Scardapane, A. Uncini, A. Rizzi, and echostatenetworks. NeuralComputation,19(1):111–138,2007. A. Sadeghian. Prediction of telephone calls load using Echo [24] B.Schrauwen,J.Defour,D.Verstraeten,andJ.VanCampenhout. State Network with exogenous variables. Neural Networks, 71: TheIntroductionofTime-ScalesinReservoirComputing,Applied 204–213,2015. doi:10.1016/j.neunet.2015.08.010. to Isolated Digits Recognition, pages 471–479. Springer Berlin [4] F. M. Bianchi, L. Livi, and C. Alippi. Investigating echo state Heidelberg,Berlin,Heidelberg,2007. ISBN978-3-540-74690-4. networks dynamics by means of recurrence analysis. arXiv doi:10.1007/978-3-540-74690-4 48. preprintarXiv:1601.07381,2016. [25] U. Siewert and W. Wustlich. Echo-state networks with band- [5] K. Cho, B. Van Merrie¨nboer, C. Gulcehre, D. Bahdanau, pass neurons: Towards generic time-scale-independent reservoir F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase structures. Internal status report, PLANET intelligent systems representationsusingrnnencoder-decoderforstatisticalmachine GmbH,2007. translation. arXivpreprintarXiv:1406.1078,2014. [26] F. Wyffels, B. Schrauwen, D. Verstraeten, and D. Stroobandt. [6] J.Chung,C¸.Gu¨lc¸ehre,K.Cho,andY.Bengio. Empiricalevalu- Band-pass reservoir computing. In 2008 IEEE International ationofgatedrecurrentneuralnetworksonsequencemodeling. JointConferenceonNeuralNetworks(IEEEWorldCongresson CoRR,abs/1412.3555,2014.URLhttp://arxiv.org/abs/1412.3555. ComputationalIntelligence),pages3204–3209,June2008. doi: [7] J.Chung,C.Gu¨lc¸ehre,K.Cho,andY.Bengio. Gatedfeedback 10.1109/IJCNN.2008.4634252. recurrentneuralnetworks. CoRR,abs/1502.02367,2015.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.