ebook img

An Empirical Study of the L2-Boost technique with Echo State Networks PDF

0.31 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview An Empirical Study of the L2-Boost technique with Echo State Networks

1 An Empirical Study of the L2-Boost technique with Echo State Networks Sebastián Basterrech IT4Innovations National Supercomputing Center VSB-Technical University of Ostrava Ostrava, Czech Republic [email protected] 5 1 0 2 Abstract—A particular case of Recurrent Neural Network encode the spatio-temporal information of the input patterns (RNN) was introduced at the beginning of the 2000s under the into a spatial representation. At each time this dynamical n nameofEchoStateNetworks(ESNs).TheESNmodelovercomes system is characterized by its state that is called reservoir a thelimitationsduringthetrainingoftheRNNswhileintroducing J in the RC literature. This non-linear transformation is most no significant disadvantages. Although the model presents some 2 well-identified drawbacks when the parameters are not well often realized by a Recurrent Neural Network (RNN) with a initialized. The performance of an ESN is highly dependent on large pool of interconnected neurons. A distinctive principle ] itsinternalparametersandpatternofconnectivityofthehidden- ofa RCmodelisthat theparametersof thedynamicalsystem G hiddenweightsOften,thetuningofthenetworkparameterscan (the RNN weights) do not participate in the training process. L be hard and can impact in the accuracy of the models. That is, once the reservoir parameters are initialized, they . In this work, we investigate the performance of a specific s boostingtechnique(calledL -Boost)withESNsassinglepredic- remain fixed during the training process. Another part of the c 2 tors. The L -Boost technique has been shown to be an effective modelisamemory-lesssupervisedlearningtoolcalledreadout [ 2 tool to combine “weak” predictors in regression problems. In structure. This part is designed to be robust and fast in the 1 this study, we use an ensemble of random initialized ESNs learning process. v (without control their parameters) as “weak” predictors of the The RC models has been applied in the neuroscience area 3 boostingprocedure.Weevaluateourapproachonfivewell-know 0 time-series benchmark problems. Additionally, we compare this forprocessing cognitiveinformationinthe neuralsystem.[6]. 5 techniquewithabaselineapproachthatconsistsofaveragingthe Furthermore, they have proven to be extremely effective tools 0 prediction of an ensemble of ESNs. fortime-seriesproblemsintheareaofMachineLearning.For 0 Keywords-L -boosting, Echo State Network, Time-series mod- instance,asfarasweknowoneofthemostpopularRCmodels . 2 1 eling, Reservoir Computing, Ensemble Methods the Echo State Networks (ESN) [7], has the best known learn- 0 ing performance on the Mackey-Glass time-series prediction 5 problem[8],[9].Inthisarticle,wewillconcentrateintheESN 1 I. INTRODUCTION model for solving time-series problems. The reservoir in the : v Boosting is a general procedure for improving the accuracy ESNmodeliscomposedbyaRNNwithsigmoidneurons,and i X of an ensemble of methods. It has been successful used thereadoutpartofthemodelisalinearregression.Theweight in supervised learning problems since its apparition in the connection between neurons in the reservoir are collected in r a 1990s[1]–[3].SeveralvariationsoftheoriginalBoostingidea a matrix that we will call reservoir matrix. The main global have been introduced over the years [4], [5], one of the most parameters of the ESN model are: the input scaling factor, popular is called AdaBoost [3]. At the beginning, Boosting the spectral radius of the reservoir matrix and the pattern of was used in problems where the output features were label connectivity among the reservoir units. The setting of these or discrete responses (classification problems). An analogy parameters often requires the human expertise and several between AdaBoost and additive models was studied in [4]. empirical trials [10]. As a consequence, the setting procedure This connection was essential for the extension of Boosting canbeexpensiveincomputationaltime.Forinstance,thetime forsolvingproblemswheretheoutputfeaturesarecontinuous complexity of an algorithm that computes the spectral radius variables (regression problems). Bühlmann et al. developed of a N ×N matrix is equal to O(N4) [11]. a variation of the Boosting technique called L -Boost that is The goal of this article is to investigate the performance 2 constructedfromanadditivemodelandthefunctionalgradient of an automatic procedure to combine single weak ESNs and descent method [5]. the L -Boost technique. We develop an automatic technique 2 Since the early 2000s, a computational paradigm called based on L -Boost, which combines the prediction of several 2 Reservoir Computing (RC) has gained prominence in the randominitializedESNsinordertoproduceahighlyaccurate NeuralComputationcommunity.InaRCmodeltherearetwo tool.WeusethetermweakESN foranESNwithoutchecking well-separated concepts: a dynamical system and a memory- and computing the spectral radius. We use the terminology less function. The purpose of the dynamical system is to weak due to the fact that this particular ESN is not optimal. Themainadvantagesoftheprocedurepresentedinthisarticle t.Theproblemiscalledregressionproblemwhenthelearning are: set has output numerical variables. Otherwise, it is named • Descend the computational effort. In order to gain in the classification problem. In the case of regression problems, it computational effort, the approach consists of combining is recommended to use the a quadratic distance [14]. Even weak ESNs. The procedure avoids to tune the reservoir though we can also use a quadratic distance in classification parameters, which can often have a high computational problems, it is recommendable to use the Kullback-Leibler cost. It uses only a uniform random initialization of the distance in this domain [14]. weights. Note that, there are not a control of the spectral AnESNmodelismainlyusedforsolvingsupervisedlearn- radius, then some single weak ESNs can have unstable ingtasks,whereinthedatasetpresentstemporaldependencies. dynamics. Although, it can be also used for non-temporal supervised • The procedure is automatic. The procedure does not learning problems [9]. In this article we will concentrate require external human expertise for setting the model only in temporal learning tasks with real output variables parameters, and for evaluating the model performance. (y(t) ∈RNY, for all t). In this work, we perform the models • The technique has a new parameter used for overfitting using a standard discrete time. We want to forecast some control. This parameter which we will that comes from aspectoftheoutputfeaturey attimet+k,usingsomeaspect the L -Boost technique. of the information available at current time t, that is given the 2 collection ((x(t),y(t)),(x(t−1),y(t−1)),(x(t−2),y(t−2)),...) We present empirical results of the procedure introduced we would like to predict the value y(t+k) (k > 0) [15]. In in this paper on a wide range of benchmark problems. We this case, the goal consists in estimating a mapping Fˆ(·) for compare these performances with the accuracy obtained by a predicting y(t+k) for some k > 0, such that some distance single ESN. Furthermore, we realize a comparison with the between Fˆ(·) and y is minimized. accuracy of a baseline approach that computes the average among single ESN models. Each single ESN is independently initialized and adjusted during the learning process. There are B. Additive Models empirical evidence in the Machine Learning literature that In[4]wasanalyzedtheBoostingmodelundertheformofan show that this baseline approach sometimes performs better Additivemodel.Givenasetoffunctionsf(m) :RNX →RNY, than other ensemble methods [12]. m = 1...,M characterized by a set of parameters θ and This work is a revised and expanded version of the arti- expansion coefficients β, cle [13]. The structure of this article is organized as follows. In f(m)(x)=β(m)h(x,θ(m)), Section II, we start with a specification of supervised learning an additive model has the following form problems with temporal data. Next, we present an overview about the family of additive models. Subsection II-C intro- M (cid:88) duces the L -Boost technique. In Section III, we present the F(x)= f(m)(x). (1) 2 Reservoir Computing paradigm. Particularly, we focus on the m=1 Echo State Network model in III-A. In Subsection III-B is The functions {h(x;θ)}M are named basis functions. They 1 presented a formalization of the procedure introduced in this are not fixed a priori and are selected depending of the cost article.SectionIVdescribestheempiricalresults.Thissection function used and the data set. An important parameter of the starts with a description of the benchmark problems. Next, model is the number of basis functions (M) considered in we present the reached results. Finally, last section provides the expression (1). This parameter controls the generalization conclusions and future work. errorofthemodel.Sincethemaingoalinalearningtaskisto find a predictor with low generalization error, the parameter II. BACKGROUND M hasanimportantroleintheaccuracyofanadditivemodel. In this Section, we start specifying the context where the ESNmodelandtheL2-Boosttechniqueareapplied.Next,we C. The L2-Boost Procedure present the additive models and we introduce a description of A relationship between the gradient descent technique and the L -Boost technique. 2 stage-wiseadditiveexpansionswasintroducedatthebeginning of the 2000s [16]. The introduction of the gradient descent A. Problem Specification algorithm using a boosting approach was an essential con- We begin specifying a supervised learning problem. Given tribution in the field of ensemble learning methods [16]. It a data set L = {(x(t),y(t)) : t = 1,...,T} where the allowed to start to use boosting in regression problems [5]. points x and y are either a class or a numerical response. A Boost method for regression problems with quadratic error We denote by N the dimension of the input vector x, distancewasintroducedunderthenameofL -Boostin[5].We x 2 and N the dimension of the output vector y. We suppose presenttheL -BoosttechniqueinAlgorithm1.Otherboosting y 2 that the mapping between the input x and the output y is variants were presented for other kind of distances, some of given by certain unknown function F(·). The goal consists them are described in [4], [5], [16]. in learning a parametric function Fˆ(x(t),L) such that certain We refer by epoch to the iteration of the training algorithm errordistancebetweenFˆ(x(t),L)andy(t)isminimizedforall through all the patterns in the training set [17]. At each epoch m+1, the basis function h(m+1)(·,θ) is fitted to the current totheopposite,whenthegradientnormlargeincreasesduring residuals: y(i) −Fˆ(m)(x(i)), for all i. Unlike other boosting the training process [21]. Recently, an effective algorithm techniques such as Adaboost, L -Boost does not present any to train RNN was introduced [19], the algorithm uses the 2 re-weighting. Another difference between L -Boost and other Hessian-free Optimization for setting the network parameters. 2 boostingmethodsisthatL -Boostpresentsatendencytoover- ReservoirComputing(RC)modelsappearasagoodalterna- 2 fit the data [5]. The model with contracting linear learners tive for RNNs.The two pioneering RC modelsare Echo State converge to the fully saturated model [5]. Each boosting Network(ESN)[7]andLiquidStateMachine(LSM)[22].This epoch contributes to additional overfitting, thus the selection computational paradigm covers the main limitations related to of the weak learners and the parameter M is an essential learningprocessesinRNNsobtainingacceptableperformance task for this device. In practice, few boosting iterations are in practical applications [9]. In a RC model there are at least enough to achieve good performances avoiding the overfitting two well-differentiated structures: a dynamical system called phenomena. reservoir and another one called readout. The readout is a supervised learning tool for training with non-temporal data. Algorithm 1 The L -Boost algorithm. For example: feedforward neural network, linear regression, 2 Require: L, M, h(x,θ) decision trees, etc. A main characteristic of a RC model is that the weights involved in circuits are deemed fixed during thelearningprocess.Thus,thematrixwiththeweightbetween Fit an initial model using a least squares fit (see [18]): Fˆ(0)(·)=h(·,θ); reservoir units (reservoir matrix) is initialized in an arbitrary way and it remains unchanged during the learning process. for (m=1,...M) do The training algorithm is restricted to update the weights in Compute the residuals for all pattern i: e(i) =y(i)−Fˆ(m)(x(i)); the readout structure. Over the last years several kinds of Fit the model fˆ(m+1)(·) parametrized as dynamicalsystemshavebeenusedforgeneratingthereservoir fˆ(m+1)(x)=h(x,θ) to the current residuals e using the state, models include: Backpropagation-decorrelation Recur- rent Learning [23], Leaky Integrator Echo State Networks least squares fit; Update: Fˆ(m+1)(·)=Fˆ(m)(·)+fˆ(m+1)(·); studied [24], Evolino [8], Intrinsic Plasticity [25], Echo State Queueing Networks [26], Reservoir Computing and Extreme end for Return the Fˆ(M)(·) function; Learning [27], and so on. A. Formalization of the Echo State Network Model III. MODELINGTIME-SERIESWITHECHOSTATE In this work related to the L2-Boost technique and the RC NETWORKS methods, we only study the L2-Boost with the ESN model. The Recurrent Neural Networks (RNNs) are powerful tools An ESN reservoir is a RNN from an input space RNx into for solving time-series benchmarks. They are computational a larger space RNs with Nx (cid:28) Ns. The connection between input and hidden neurons are collected in a N ×N weight methods that operate in time. Considering terminology of s x matrix win. The connections among the hidden neurons are graphs, in a RNN at least one circuit is presented in its representedbyaN ×N weightmatrixwr.AN ×N weight topology. The circuits of the network enable to store temporal s s y s matrixwout representsthereadoutweights.Atanytimet,the information, in order to learn and memorize the input his- information from the input pattern and the past is represented tory[9].Eachcircuitcreatesaninternalstatewhichmakesthe in a state vector recurrent network a discrete time state-space model. At each time, the RNN receives an input pattern. Next, the network s(t)=tanh(winx(t)+wrs(t−1)). (2) updates its hidden state via a non-linear activation function using the input pattern and the network state at the precedent At any time t, the output prediction y(t) ∈RNy is generated time [19]. There are a general consensus in the community using the input pattern and the reservoir state information. that considers the RNN as powerful tool for forecasting and Most often is computed using a linear regression: time-series prediction. y(t) =wout[x(t)|s(t)], (3) In spite of that, in practice the model presents some draw- backs. The most important is that is hard to train a RNN where ·|· is the vertical concatenation of the vectors. For using gradient descent methods [20]. The training methods the sake of the notation simplicity, we omit the bias term, that use the first differential information have often stability a constant term is included in all the regressions. problems and high numerical complexity. As a consequence, In [7] was analyzed the stability of the reservoir dynamics muchlongertrainingtimesarenecessarytoadjustthenetwork in the ESN model. Under certain algebraic conditions the weights. In [20] is analyzed the main limitations of the algo- reservoir state only depends (asymptotically) of the inputs rithms of the gradient descent type for training RNNs. These and the network topology. It becomes independent of its drawbacks are identified under the names of vanishing and initial conditions [9]. These conditions were summarized in the exploding gradient problems [20]. The vanishing gradient the Echo State Property (ESP) [7]. In practice, the stability phenomena occurs when the norm of the gradient decreases of the ESN is almost always ensured when the spectral arbitrarily fast to 0. The exploding gradient phenomena refers radius of the reservoir matrix is less than 1 [9], [28]. As a consequence, the reservoir weights are appropriately scaled A. Description of the Benchmark Problems in order to have a spectral radius less than 1. To scale the We use the following range of time-series benchmarks: parameters is necessary to compute the spectral radius of the • Fixed kth order NARMA. This data set presents a high reservoir matrix. The computation of the spectra requires an non-linearity and is widely used in the RC literature. importantcomputationaleffort[11].Someattemptstogenerate We generate the NARMA serie following the description a procedure for initializing the RC models were introduced in [28], [32], in [11], [25], [29]–[31]. k−1 (cid:88) b(t+1)=α (t)+α b(t) b(t−i) 1 2 B. L -BoostUsingtheESNModelforTime-seriesProcessing i=0 2 +α s(t−(k−1))s(t)+α , Information 3 4 where s(t) ∼ Unif[0,0.5] and the constants values are In this article, we investigate the performance of using L - 2 shown in Table I. In order to evaluate the memorization Boost in temporal learning tasks, and we consider as weak ability of the model, we consider two simulated series learner predictors a set of ESNs with random initialization. when k = 10 and k = 30. The task consists to predict Given an arbitrary parameter M the procedure is as follows. We initialize an ESN in a random way. The initialization k α α α α consists in selecting the size of the network as well as the 1 2 3 4 10 0.3 0.05 1.5 0.1 pattern of connectivity. We consider a reservoir with fixed 30 0.2 0.004 1.5 0.001 sparse connections. We do not control the spectrum norm of thereservoirweight.Aguideabouttheinitializationprocedure TableI:ParametersconsideredforthefixedkthorderNARMA can be seen from [10]. We expand the input information serie with k =10 and k =30. using the ESN reservoir given by the expression (2), thus we obtain s(t), ∀t. Next, we apply Algorithm 1. Finally, we thevaluey(t+1)basedonthehistoryofy(t)uptotime obtain predictor F(M)(·). The approach is summarized in t. We used the first 200 samples as initial washout in Algorithm2.Inourexperimentsweuseridgelinearregression both the training and test procedure. The regularization for computing the readout weights wout [28]. parameter used was 0.00001. • The Santa Fe Laser data set [33]. It is an experimental Algorithm 2 The L -Boost with the ESN model. 2 data that contains the intensity pulsations of a real laser Initialize an ESN following the comments in Subsec- recorded by a LeCroy oscilloscope. The data is a cross- tion III-B; cut through periodic to chaotic intensity laser pulsations. Compute the temporal expansion of L using (2); These pulsations more or less follow the theoretical Generate the set {(s(t),y(t)),∀t}; Lorenzmodelofatwolevelsystem[33].Inthisproblem, Apply the Algorithm 1; the task consists to predict the next measure y(t+1), Return Fˆ(M)(·); giventheprecedentvaluesuptot.Theoriginaldataonly consists of 1000 measurements, we used for training 499 samples and for testing 500 samples. We used a washout In order to evaluate the performance of this procedure, we compare the reached accuracy of the L -Boost technique of10samples.Theregularizationparameterγ was0.001. 2 with a simple baseline approach [17]. The baseline approach • Henon Map data set. It is a prototypical invertible map consists in combining K single predictors (in our case the with chaotic solutions proposed in [34]. The data is generated by learningpredictorsareESNs).Weconsiderrandominitialized reservoirs, without control of the reservoir spectrum norm. In y(t+1)=1−1.4(y(t))2+0.3y(t−1)+z(t+1), the baseline method, we train independently each of these single ESNs. The final prediction is the average among the where the noise is z(t) ∼ N(0,0.05). The data is single predictions. normalized in [0,1]. The goal is to predict the next value For statistical comparisons between the methods we con- y(t+1)withthepastinformationuptot.Weconsidered sider K = 30. We remark again that we do not scale the atrainingdatawith3995samplesandatestdatawith795 reservoir weights for obtaining the ESP. Even though some samples. The regularization parameter γ used was 0.001. ESN models can present good accuracy, other ones can be Weuseaninitialwashoutcomposedby100samples.The weak predictors. Additionally, we compare our performances network topology has 3 input units set with the last two withtheperformanceobtainedwhensingleESNswith“good” precedent y(t) values and the noise at current time. tuning of the reservoir parameters are used. For that, we use • Freedman’s non linear time data set [35]. The data is the results presented in the RC literature. generated by y(t+1)=g(y(t)), IV. EMPIRICALRESULTS where: (cid:26) 2x, ifx≤0.5, We begin this section describing the benchmark problems. g(x)= 2−2x, x>0.5. Next, we specify the experimental setup. We concludes this section with an analysis of our empirical results. We consider a very short data set. The length of the training data was 30 and the test size was 19. The initial columns 3 and 4 of Table IV are very similar. The model value is y(0)=0.23719. The initial washout considered with larger M performs better over the train data, but it has only was of 3 samples. The network topology has only problems of generalization. We found this characteristic in oneinputunit,severalreservoirunitsandoneoutputunit. all benchmarks. As a consequence, we can affirm that the The regularization parameter γ used was 0.001. parameter M has a relevant impact in the control of the overfitting phenomenon. We can found a similar remarks for the L -Boost technique in non-temporal learning tasks [5]. B. Experimental Setup 2 Figure 2 illustrates the NMSE reached according the reser- voirsizefordifferentM valuesforthe30thNARMAdataset. Initial Train Test This figure shows the training error, we can see the evolution DATA Washout γ samples samples of the NMSE versus the size of the reservoir. We present few 10th NARMA 200 0.00001 1400 2400 values of reservoir size between 6 till 11. Figures 1 and 3 30th NARMA 200 0.00001 1600 2600 show the NMSE of the test data versus the reservoir size for Santa Fe Laser 10 0.001 499 500 the 10th and 30th order NARMA data set, respectively. These Henon Map 100 0.001 3995 795 figures show 4 curves, the black one (with points represented Freedman’s 3 0.001 30 19 by dots) corresponds to the baseline method which combines several single ESNs. The other curves correspond to the L - 2 Table II: Parameter setting of the benchmark problems. In all Boost-ESN with different number of epochs M = 6,8 and cases, we initialize the input weights (win) using Uniform M = 10. We can not affirm that the procedure of L2-Boost distribution in [−0.2,0.2], and we initialize the reservoir with single weak ESNs performs better than optimal single weights with an Uniform distribution in [−0.8,0.8]. ESNs. The accuracy it is also of the same order that results presented in the RC literature using a single well-initialized ESN [28], [31]. We summarize the setting of the main parameters related Figure 4 illustrates the evolution of the NMSE for the to the benchmark problems in Table II. The table presents the initial washout period, the regularization parameter (γ) of reservoirsizeofthetestdataoftheSantaFeLaserbenchmark. The error was computed for the L -Boost with ESNs for the linear ridge regression, and the number of train and test 2 M =4,6 and M =8 and the baseline approach averaging 30 samples for each benchmark problem. ESNs.Figure5showstheaccuracyreachedforthemodelson The benchmarks selected have been widely used in the the Freedman test data set. The graphic shows the evolution RC literature [7], [28], [31], [36]. In all cases, we use of the L -Boost with ESNs for M = 4,5 and M = 5 and the Normalized Mean Square Error (NMSE) as measure of 2 the baseline approach. In all graphics, we can see that when accuracy model [9]. The learning method used for computing theoutputweightmatrixwout wastheofflineridgeregression. the reservoir increases its size the procedure L2-Boost with ESNs and the baseline approach decrease their test error. This Thisalgorithmhasaregularizationparameterγ thatweadjust behaviorabouttheimpactofthereservoirsizeontheaccuracy it for each benchmark problem. The pre-processing data step of the model, also happens with single ESNs [10], [28], [31]. consisted in normalizing the patterns in the interval [0,1] We investigated the algorithm performance for several reservoir sizes. The range of the reservoir size values is specified for DATA ACCURACY Nx REF. 10THNARMA 0.166(NMSE) 50 [28] each benchmark problem. The connection between the input 0.0425(NMSE) 200 [28] and reservoir layer is fully connected with random weights in 30THNARMA 0.4542(NRMSE) 100 [30] [−0.2,0.2]. The reservoir matrix is initialized using Uniform SANTAFELASER 0.0184(NMSE) 50 [28] distribution in [−0.8,0.8]. 0.00819(NMSE) 200 [28] HENONMAP 0.00975(NMSE) 50 [28] 0.00868(NMSE) 200 [28] C. Result Analysis FREEDMAN’S 0.0004302(MSE) 40 [31] TableIIIshowsresultsreportedintheRCliteratureforthese benchmarks when a single ESN model was used as model Table III: Accuracy of the ESN model for the benchmark predictor. Table IV presents the train set accuracy reached problems. Second column shows the accuracy reached by the on the Henon Map data set. The columns 2, 3 and 4 show single ESN, third column refers the reservoir size and the last the NMSE reached with L -Boost with ESNs for M epochs column shows a bibliographic reference. In the case of the 2 (M =3, 4 and 5), respectively. Column 5 of Table III shows Freedman’s non linear time data, the reservoir initialization the accuracy of the baseline approach, that it averaging the was done using the Scale Invariant Map method [31], and prediction of 30 ESNs. The columns of the table are written the Mean Square Error (MSE) was the error measure. In the using a scientific notation. caseof30thNARMA,theauthorsinitializethereservoirusing Table IV illustrates the accuracy of the models during the permutation matrices. The error measure was the Normalized training. The NMSE corresponds to the training data of the Root Square Error (NRMSE) [30]. In the other benchmarks HenonMapdataset.Wepresentthistableinordertoillustrate problems, the authors control some reservoir parameters such thetendencyofoverfittingofL -BoostwithESN.Theadditive as: spectral radius and reservoir matrix density. 2 model converge very fast to the solution, for this reason the M =3 M =4 M =5 BAS.ESN 30 order Narma Train data Nx (1.0E-12) (1.0E-13) (1.0E-13) (1.0E-12) 6 0.297367 0.131036 0.131036 0.288720 Baseline ESN Boost: M=6 7 0.191883 0.131036 0.131036 0.064412 0.63 Boost: M=8 8 0.142289 0.131036 0.131036 0.572728 Boost: M=10 9 0.160910 0.131036 0.131036 0.182942 0.62 10 0.203845 0.131036 0.131036 0.278237 0.61 11 0.272649 0.131036 0.131035 0.524017 12 0.488265 0.131036 0.131036 0.459002 MSE 0.6 N 0.59 Table IV: Train set performance of the Henon Map data set. First column indicates the number of neurons in the reservoir. 0.58 The columns 2, 3 and 4 show the NMSE obtained with L - 2 0.57 BoostwithM epochs(M =3,4and5),respectively.Column 5 shows the accuracy of the baseline approach. That is the 6 7 8 9 10 11 accuracy average among 30 ESN predictions. The columns Reservoir size are written using a scientific notation. Figure 2: The accuracy reached on the training set of the 30th order NARMA data. The vertical axis of the graph shows the 10 order Narma test data NMSE accuracy, and the horizontal axis presents some values 0.18 Baseline ESN of the reservoir size. We compare the accuracy of L2-Boost: Boost: M=4 0.17 Boost: M=5 M = 6,8 and 10 with the baseline approach averaging 30 Boost: M=6 ESNs. 0.16 0.15 E S 30 order Narma Test data M 0.6 N 0.14 Baseline ESN Boost: M=6 0.59 Boost: M=8 0.13 Boost: M=10 0.12 0.58 0.116 7 8 9 10 11 MSE0.57 Reservoir size N 0.56 Figure1:Testsetaccuracyreachedforthe10thorderNARMA data set. The vertical axis of the graph shows the NMSE 0.55 accuracy, and the horizontal axis presents some values of the reservoir size. We compare the accuracy of L -Boost: 2 0.54 M = 4,5 and 6 with the baseline approach averaging 30 6 7 8 9 10 11 Reservoir size ESNs. Figure 3: The accuracy reached for the 30th order NARMA data set. The vertical axis of the graph shows the NMSE accuracy, and the horizontal axis presents some values of V. CONCLUSIONSANDFUTUREWORK the reservoir size. We compare the accuracy of L -Boost: 2 At the beginning of the 2000s, an efficient technique to M = 6,8 and 10 with the baseline approach averaging 30 train and design a RNN was developed under the name ESNs. of Echo State Network (ESN). This approach overcome the limitations to train RNN using the gradient descent method. The performance of an ESN is highly dependent on its parameters and pattern of connectivity of the hidden-hidden In spite of the realization of numerous tests, we can not weights Besides, the network setting can be computational affirm that L -Boost with ESNs performs better than a single 2 expensive, in particular to compute the spectral radius of the well-initializedESN(accordingtheresultspresentedintheRC hidden-hidden weight matrix. literature).However,themainadvantageoftheL -Boostwith 2 Inthisarticle,weinvestigatedboostingideaswithESNs,in weak ESNs is that the procedure is automatic and does not order to built a robust new learning tool. In particular, we require the computational effort of computing the spectra of studied the utilization of L -Boost with random initialized the hidden-hidden weight matrix. Additionally, the procedure 2 ESNs. We merge a set of weak single ESNs. We call weak has a control parameter for the overfitting phenomena. ESNs because they are random initialized, and we do not use In a future work we will test the model using another extra computational effort for tuning the initial hidden-hidden technique for decrease the generalization error, as well as on weights. a more number of benchmark problems. Additionally, we can REFERENCES Laser test data 0.3 Baseline ESN 0.28 Boost: M=4 [1] R. E. Shapire, “The strength of weak learnability,” Machine Learning, Boost: M=6 vol.5,no.2,1990. 0.26 Boost: M=8 [2] Y. Freund, “Boosting a weak learning algorithm by majority,” AT&T 0.24 BellLaboratories,NewJersey,USA,Tech.Rep.,January1995. [3] Y. Freund and R. E. Shapire, “Experiments with a new Boosting 0.22 Algorithm,”MachineLearning:ProceedingsofThirteenthInternational MSE 0.2 Conference,1996. N [4] J.Friedman,T.Hastie,andR.Tibshirani,“AdditiveLogisticRegression: 0.18 aStatisticalViewofBoosting,”TheAnnalsofStatistics,vol.28,no.2, pp.337–407,2000. 0.16 [5] P. Bühlmann and B. Yu, “Boosting with the L2 loss: Regression and 0.14 Classification,”JournaloftheAmericanStatisticalAssociation,vol.98, 2003. 0.12 [6] W. Maass, “Liquid State Machines: Motivation, Theory, and Applica- 0.1 tions,” in In Computability in Context: Computation and Logic in the 6 7 8 9 10 11 Reservoir size RealWorld. ImperialCollegePress,2010,pp.275–296. [7] H.Jaeger,“The“echostate”approachtoanalysingandtrainingrecurrent neural networks,” German National Research Center for Information Figure 4: Test set accuracy of the Santa Fe Laser data. The Technology,Tech.Rep.148,2001. vertical axis of the graph shows the NMSE accuracy, and the [8] J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez, “Training horizontal axis presents some values of the reservoir size. We Recurrent Networks by Evolino,” Neural Networks, vol. 19, pp. 757– 779,2007. compare the accuracy of L -Boost: M = 4,6 and 8 with a 2 [9] M. Lukos˘evic˘ius and H. Jaeger, “Reservoir Computing Approaches to baseline approach averaging 30 ESNs. Recurrent Neural Network Training,” Computer Science Review, pp. 127–149,2009. [10] M.Lukoševicˇius,“APracticalGuidetoApplyingEchoStateNetworks,” inNeuralNetworks:TricksoftheTrade,ser.LectureNotesinComputer Freedman test data Science, G. Montavon, G. Orr, and K.-R. Müller, Eds. Springer 0.24 Berlin Heidelberg, 2012, vol. 7700, pp. 659–686. [Online]. Available: Baseline ESN http://dx.doi.org/10.1007/978-3-642-35289-8_36 0.22 Boost: M=4 Boost: M=5 [11] A. A. Ferreira, T. B. Ludermir, and R. R. B. De Aquino, “An 0.2 Boost: M=6 Approach to Reservoir Computing Design and Training,” Expert Syst. Appl.,vol.40,no.10,pp.4172–4182,Aug.2013.[Online].Available: 0.18 http://dx.doi.org/10.1016/j.eswa.2013.01.029 [12] D. Opitz and R. Maclin, “Popular Ensemble Methods: an Empirical E0.16 study,”JournalofArtificialIntelligenceResearch,vol.11,pp.169–198, S NM0.14 1999. [13] S. Basterrech, “An Empirical Study of L2-Boost with Echo State 0.12 Networks,” IEEE Intelligent Systems Design and Applications (ISDA), December2013. 0.1 [14] T. Hastie, R. Tibshirani, and J. Friedman, The elements of Statistical Learning, ser. Spring series in statistics. New York, USA: Springer- 0.08 Verlag,2001. 0.06 [15] Y.BengioandN.Chapados,“Extensionstometric-basedmodelselec- 6 7 8 9 10 11 tion,” Journal of Machine Learning Research, vol. 3, pp. 1209–1227, Reservoir size 2003. [16] J.H.Friedman,“GreedyFunctionApproximation:AGradientBoosting Figure 5: Test set performance of the Freedman data set. Machine,”AnnalsofStatistics,vol.29,pp.1189–1232,2000. Size of the reservoir versus the NMSE accuracy. We compare [17] H. Schwenk and Y. Bengio, “Boosting Neural Networks,” Neural the accuracy of L -Boost: M = 4,5 and 6 with a baseline Computation,vol.12,no.8,pp.1869–1887,Aug.2000. 2 [18] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, Numerical approach averaging 30 ESNs. Recipes in C, 2nd ed. Cambridge, UK: Cambridge University Press, 1992. [19] J.MartensandI.Sutskever,“LearningRecurrentNeuralNetworkswith Hessian-Free Optimization,” in Proceeding of the 28th International ConferenceonMachineLearning,2011,pp.1033–1040. test the approach using another supervised learning tool for [20] Y.Bengio,P.Simard,andP.Frasconi,“Learninglong-termdependencies the readout structure. withgradientdescentisdifficult,”NeuralNetworks,IEEETransactions on,vol.5,no.2,pp.157–166,1994. [21] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrentneuralnetworks,”Proceedingsofthe30thInternationalCon- ACKNOWLEDGMENT ferenceonMachineLearning,vol.28,pp.37–48,2013. [22] W. Maass, T. Natschläger, and H. Markram, “Real-time computing This work was supported within the framework of withoutstablestates:anewframeworkforaneuralcomputationbased the IT4Innovations Centre of Excellence project, reg. no. onperturbations,”NeuralComputation,pp.2531–2560,november2002. CZ.1.05/1.1.00/02.0070 supported by Operational Programme [23] J. J. Steil, “Backpropagation-Decorrelation: online recurrent learning withO(N)complexity,”InProceedingsofIJCNN’04,vol.1,2004. ’ResearchandDevelopmentforInnovations’fundedbyStruc- [24] H.Jaeger,M.Lukos˘evic˘ius,D.Popovici,andU.Siewert,“Optimization tural Funds of the European Union and state budget of the andapplicationsofEchoStateNetworkswithleaky-integratorneurons,” Czech Republic. Additionally, this article has been elaborated NeuralNetworks,no.3,pp.335–352,2007. [25] B. Schrauwen, M. Wardermann, D. Verstraeten, J. J. Steil, and intheframeworkoftheprojectNewcreativeteamsinpriorities D.Stroobandt,“ImprovingReservoirsusingIntrinsicPlasticity,”Neuro- of scientific research, reg. no. CZ.1.07/2.3.00/30.0055. computing,vol.71,pp.1159–1171,March2007. [26] S. Basterrech and G. Rubino, “Echo State Queueing Network: a new ReservoirComputinglearningtool,”IEEEConsumerComunications& NetworkingConference(CCNC’13),January2013. [27] J. B. Butcher, D. Verstraeten, B. Schrauwen, C. R. Day, and P. W. Haycock, “Reservoir Computing and Extreme Learning Machines for Non-linear Time-series Data Analysis,” Neural Networks, vol. 38, pp. 76–89,feb2013. [28] A. Rodan and P. Tin˘o, “Minimum Complexity Echo State Network,” IEEETransactionsonNeuralNetworks,pp.131–144,2011. [29] M.Lukos˘evic˘ius,“Onself-organizingreservoirsandtheirhierarchies,” JacobsUniversity,Bremen,Tech.Rep.25,2010. [30] J. Boedecker, O. Obst, N. M. Mayer, and M. Asada, “Studies on ReservoirInitializationandDynamicsShapinginEchoStateNetworks,” in Proceedings of the 17th European Symposium On Artificial Neural Networks(ESANN’09),Evere,Belgium,Apr.2009,pp.227–232. [31] S. Basterrech, C. Fyfe, and G. Rubino, “Self-Organizing Maps and Scale-Invariant Maps in Echo State Networks,” in Intelligent Systems Design and Applications (ISDA), 2011 11th International Conference on,nov.2011,pp.94–99. [32] S.Dasgupta,F.Worgotter,andP.Manoonpong,“Informationdynamics basedself-adaptivereservoirfordelaytemporalmemorytasks,”Evolving Systems,pp.1–15,2013. [33] U.Huebner,N.B.Abraham,andC.O.Weiss,“TheSantaFeTimeSeries CompetitionData,”availableat:http://goo.gl/6IKBb9,dateofaccess:12 September2013. [34] M. Hénon, “A two dimensional mapping with a strange attractor,” Commun.Math.Phys.,vol.50,pp.69–77,1976. [35] R. Hyndman, “Time series data library,” available at: http://goo.gl/PZzReR,dateofaccess:12September2013. [36] D. Verstraeten, B. Schrauwen, M. D’Haene, and D. Stroobandt, “An experimentalunificationofreservoircomputingmethods,”NeuralNet- works,no.3,pp.287–289,2007.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.