ebook img

Feature Selection for Value Function Approximation Using Bayesian Model Selection PDF

0.75 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Feature Selection for Value Function Approximation Using Bayesian Model Selection

Feature Selection for Value Function Approximation Using Bayesian Model Selection Tobias Jung and Peter Stone Departmentof Computer Sciences, University of Texas at Austin, USA {tjung,pstone}@cs.utexas.edu 2 1 0 2 Abstract. Feature selection in reinforcement learning (RL), i.e. choos- n ingbasisfunctionssuchthatusefulapproximationsoftheunkownvalue a function can be obtained, is one of the main challenges in scaling RL J toreal-world applications. Herewe consider theGaussian process based 1 framework GPTD for approximate policy evaluation, and propose fea- 3 tureselectionthroughmarginallikelihoodoptimizationoftheassociated hyperparameters. Our approach has two appealing benefits: (1) given ] I justsampletransitions,wecansolvethepolicyevaluationproblemfully A automatically(withoutlookingatthelearningtask,and,intheory,inde- . pendentofthedimensionalityofthestatespace),and(2)modelselection s c allows us to consider more sophisticated kernels, which in turn enable [ ustoidentifyrelevant subspacesandeliminate irrelevant statevariables suchthatwecanachievesubstantialcomputationalsavingsandimproved 1 prediction performance. v 5 1 6 1 Introduction 6 1. Inthispaper,weaddresstheproblemofapproximatingthevaluefunctionunder 0 a stationary policy π for a continuous state space X ⊂ D, 2 R 1 Vπ(x)= x′|x,π(x) {R(x,π(x),x′)+γVπ(x′)} (1) : E v Xi using a linear approximation of the form V˜(·;w) = mi=1wiφi(x) to represent Vπ. Here x denotes the state, R the scalar reward aPnd γ the discount factor. r Given a trajectory of states x ,...,x and rewards r ,...,r sampled under a 1 n 1 n−1 π, the goal is to determine weights w (and basis functions φ ) such that V˜ is a i i goodapproximationofVπ.Thisisthefundamentalproblemarisinginthepolicy iterationframeworkofinfinite-horizondynamicprogrammingandreinforcement learning (RL), e.g. see [21,3]. Unfortunately, this problem is also a very difficult problem that, at present, has no completely satisfying solution. In particular, deciding which features (basis functions φ ) to use is rather challenging, and i in general, needs to be done manually: thus it is tedious, prone to errors, and most important of all, requires considerable insight into the domain. Hence, it would be far more desirable if a learning system could automatically choose its ownrepresentation.Inparticular,consideringefficiency,wewanttoadapttothe actualdifficultiesfaced,withoutwastingresources:often,therearemanyfactors that can make a particular problem easier than it initially appears to be, for example,whenonlya few ofthe inputs arerelevant,orwhen the input datalies on a low-dimensional submanifold of the input space. Recent workin applying nonparametricfunction approximationto RL, such as Gaussian processes (GP) [6,16,18,5], or equivalently, regularization networks [8],isaverypromisingstepinthisdirection.Insteadofhavingtoexplicitlyspec- ify individualbasisfunctions,weonlyhavetospecify amoregeneralkernelthat just depends on a very small number of hyperparameters.The key contribution of this paper is to demonstrate that feature selection in RL from sample transi- tionscanbeautomated,usinganyofseveralpossiblemodelselectionmethodsfor these hyperparameters, such as marginal likelihood optimization in a Bayesian setting,orleave-one-out(LOO)errorminimizationinafrequentistsetting.Here, we will focus on the Bayesian setting, and adapt marginal likelihood optimiza- tionfortheGP-basedapproximatepolicyevaluationmethodGPTD,introduced withoutmodelselectionin[6].Overall,thiswillhavethefollowingbenefits:First, onlybyautomaticmodelselection(asopposedto agrid-basedsearchormanual tweaking of kernel parameters) will we be able to use more sophisticated ker- nels, which will allow us to uncover the ”hidden” properties of given problem. For example, by choosing an RBF kernel with independent lengthscales for the individualdimensionsofthestatespace,modelselectionwillautomaticallydrive thosecomponentstozerothatcorrespondtostatevariablesirrelevant(orredun- dant)tothe task.Thiswillallowusto concentrateourcomputationaleffortson the parts of the input space that really matter and will improve computational efficiency. Second, because it is generally easier to learn in ”smaller” spaces, it may also benefit generalization and thus help us to reduce sample complexity. Despiteitsmanypromises,previousworkwithGPsinRLrarelyexploresthe benefits ofmodelselection:in[18],avariantofstochasticsearchwasusedtode- terminehyperparametersofthecovarianceforGPTDusingasscorefunctionthe online performance of an agent. In [16], standard GPs with marginal likelihood based model selection were employed; however, since their approach was based on fitted value iteration, the task of value function approximation was reduced toordinaryregression.The remainingpaperisstructuredasfollows:Section2-3 contain background information and summarize the GPTD framework. As one of the benefits of model selection is the reduction of computational complex- ity, Section 4 describes how GPTD can be solved for large-scaleproblems using SR-approximation. Section 5 introduces model selection for GPTD and derives in detail the associated gradient computation. Finally, Section 6 illustrates our approach by providing experimental results. 2 Related work Theoverallgoaloflearningrepresentationsandfeatureselectionforlinearlypa- rameterized V˜ is not new within the context of RL. Roughly, past methods can be categorized along two dimensions: how the basis functions are represented (e.g. either by parameterized and predefined basis functions such as RBF, or by nonparameterized basis functions directly derived from the data) and what quantity/target function is considered to guide their construction process (e.g. either supervised methods that consider the Bellman error and depend on the particularreward/goal,orunsupervisedgraph-basedmethodsthatconsidercon- nectivitypropertiesofthestatespace).Conceptuallycloselyrelatedtoourwork is the approach described in [12], which adapts the hyperparameters of RBF- basis functions (both their location and lengthscales) using either gradient de- scentorthe cross-entropymethodonthe Bellmanerror.However,becausebasis functions are adaptedindividually (and their number is chosen in advance), the method is prone to overfitting: e.g. by placing basis functions with very small width near discontinuities. The problem is compounded when only few data points are available. In contrast, using a Bayesian approach, we can automati- cally trade-off model fit and model complexity with the number of data points, choosingalwaysthebestcomplexity:e.g.forsmalldatasetswewillpreferlarger lengthscales(lesscomplex),forlargerdatasetswecanaffordsmallerlengthscales (more complex). Other alternative approaches do not rely on predefined basis functions: The method in [9] is an incremental approach that uses dimensionality reduction and state aggregation to create new basis functions such that for every step the remaining Bellmanerrorfor a trajectoryofstates is successivelyreduced. A related approach is given in [14] which incrementally constructs an orthogonal basisfortheBellmanerror.Agraph-basedunsupervisedapproachispresentedin [11], which derives basis functions from the eigenvectors of the graph Laplacian induced from the underlying MDP. 3 Background: GPs for policy evaluation In this section we briefly summarize how GPs [17] can be used for approximate policy evaluation; here we will follow the GPTD formulation of [6]. Suppose we have observed the sequence of states x ,x ,...,x and rewards 1 2 n r ,...,r ,wherex ∼p(·|x ,π(x ))andr =R(x ,π(x ),x ).Inprac- 1 n−1 i i−1 i−1 i i i i+1 tice, MDPs considered in RL will often be of an episodic nature with absorbing terminal states. Therefore we have to transform the problem such that the re- sulting Markov chain is still ergodic: this is done by introducing a zero reward transition from the terminal state of one episode to the start state of the next episode. In addition to the sequence of states and rewards our training data thus also includes a sequence γ ,...,γ , where γ =γ (the discount factor in 1 n−1 i Eq. (1)) if x was a non-terminal state, and γ = 0 if x was a terminal state i+1 i i (in which case x is the start state of the next episode). i+1 Assume that the function values V(x) of the unknown value function V : X ⊂ D →R from Eq. (1) form a zero-mean Gaussian process with covariance R functionk(x,x′)forx,x′ ∈X;inshortV ∼GP(0,k(x,x′)).Inconsequence,the T function values for the n observedstates, v:= V(x ),...,V(x ) , will havea 1 n Gaussian distribution (cid:0) (cid:1) v|X,θ ∼ N(0,K), (2) where X := [x ,...,x ] and K is the n × n covariance matrix with entries 1 n [K] = k(x ,x ). Note that the covariance k(·,·) alone fully specifies the GP; ij i j herewewillassumethatitisasimple(positivedefinite)functionparameterized by a number of scalar parameters collected in vector θ (see Section 4). However, unlike in ordinary regression, in RL we cannot observe samples from the target function V directly. Instead, the values can only be observed indirectly:fromEq.(1)wehavethatthevalueofonestateisrecursivelydefined through the value of the successor state(s) and the immediate reward. To this end, Engel et al. propose the following generative model:1 R(x ,x )=V(x )−γ V(x )+η , (3) i i+1 i i i+1 i whereη isanoisetermthatmaydependontheinputs.2Pluggingintheobserved i T training data, and defining r:= r ,...,r , we obtain 1 n−1 (cid:0) (cid:1) r=Hv+η, (4) where the (n−1)×n matrix H is given by 1−γ 1 H:= ... ...  (5)  1 −γn−1   T and noise η := η ,...,η has distribution η ∼ N(0,Σ). One first choice 1 n−1 forthenoisecov(cid:0)arianceΣ wo(cid:1)uldbeΣ =σ2I,whereσ2 isanunknownhyperpa- 0 0 rameter (see Section 4). However, this model does not capture stochastic state transitions and hence would only be applicable for deterministic MDPs. If the environment is stochastic, the noise model Σ = σ2HHT is more appropriate, 0 see [6] for more detailed explanations. For the remainder we will solely consider the latter choice, i.e. Σ =σ2HHT. 0 LetD :={X,γ ,...,γ }be anabbreviationforthe traininginputs.Using 1 n−1 Eq. (4), it can be shown that the joint distribution of the observed rewards r given inputs D is again a Gaussian, r|D,θ ∼N(0,Q), (6) where the (n−1)×(n−1) covariance matrix Q is given by Q= HKHT+σ2HHT . (7) 0 (cid:0) (cid:1) To predict the function value V(x∗) at a new state x∗, we consider the joint distribution of r and V(x∗) r 0 Q Hk(x∗) (cid:20)V(x∗)(cid:21) |D,x∗,θ ∼ N (cid:18)(cid:20)0(cid:21),(cid:20)[Hk(x∗)]T k∗ (cid:21)(cid:19) 1 Notethatthismodelisjustalinearly transformed versionofthestandardmodelin GP regression, i.e. yi=f(xi)+εi. 2 Formally,inGPTDnoiseismodeledbyasecondzero-meanGPthatisindependent from thevalueGP. See [6] for details. T where n×1 vector k(x∗) is given by k(x∗) := k(x∗,x ),...,k(x∗,x ) and 1 n scalar k∗ by k∗ :=k(x∗,x∗). Conditioning on r,(cid:0)we then obtain (cid:1) V(x∗)|D,r,x∗,θ ∼ N(µ(x∗),σ2(x∗)) (8) where µ(x∗):=k(x∗)THTQ−1r (9) σ2(x∗):=k∗−k(x∗)THTQ−1Hk(x∗). (10) Thus,foranygivensinglestatex∗,GPTDproducesthedistributionp(V(x∗)|D,r,x∗,θ) in Eq. (8) over function values. 4 Computational considerations Regarding its implementation, GPTD for policy evaluation shares the same weakness that GPs have in traditional machine learning tasks: solving Eq. (8) requires the inversion3 of a dense (n−1)×(n−1) matrix, which when done exactly would require O(n3) operations and is hence infeasible for anything but small-scale problems (say, anything with n<5000). 4.1 Subset of regressors In the subset of regressors (SR) approach initially proposed for regularization networks [15,10], one chooses a subset {x˜}m of the data, with m ≪ n, and i=1 approximates the covariance for arbitrary x,x′ by taking k˜(x,x′)=k (x)TK−1 k (x′). (11) m mm m T Herek (·)denotesk (·):= k(x˜ ,·),...,k(x˜ ,·) ,andK isthesubmatrix m m 1 m mm [Kmm]ij =k(x˜i,x˜j) of K. Th(cid:0)e approximation in E(cid:1)q. (11) can be motivated for example from the Nystr¨om approximation [22]. Let K denote the submatrix nm [K ] = k(x ,x˜ ) corresponding to the m columns of the data points in the nm ij i j subset.Wethenhavetherank-reducedapproximationK≈K˜ =K K−1 KT nm mm nm and k(x)≈k˜(x)=K K−1 k (x). Plugging these into Eq. (8), we obtain for nm mm m the mean µ(x∗)≈k˜(x∗)THT HK˜HT+σ2HHT −1r 0 =k (x∗)T G(cid:0)TWG+σ2K (cid:1)−1GTWr, (12) m 0 mm (cid:0) (cid:1) where we have defined G := HK , W := (HHT)−1 and applied the SMW nm identity4 to show that K−1 GT GK−1 GT+σ2W−1 −1 = GTWG+σ2K −1GTW. (13) mm mm 0 0 mm (cid:0) (cid:1) (cid:0) (cid:1) 3 For numerical reasons we implement this step using the Cholesky decomposition, which has thesame computational complexity. 4 (A+BD−1C)−1BD−1=A−1B(D+CA−1B)−1 Similarly, we obtain for the predictive variance σ(x∗)≈k˜(x∗,x∗)−k˜(x∗)THT HK˜HT+σ2HHT −1Hk˜(x∗) 0 =σ2k (x∗)T GTWG+(cid:0)σ2K −1k (x∗)(cid:1). (14) 0 m 0 mm m (cid:0) (cid:1) Doing this means a huge gain in computational savings: solving the reduced probleminEq.(12) costsO(m2n)for initialization,requiresO(m2) storageand everypredictioncostsO(m)(orO(m2)ifweadditionallyevaluatethevariance). This has to be compared with the complexity of the full problem: O(n3) ini- tialization,O(n2)storage,andO(n)prediction.Thuscomputationalcomplexity now only depends linearly on n (for constant m). Note that the SR-approximation produces a degenerate GP. As a conse- quence,the predictivevariancein Eq.(14)will underestimate the true variance. Inparticular,it willbe nearzerowhen xis far fromthe subset{x˜}m (whichis i=1 exactly the opposite of what we want, as the predictive variance should be high for novel inputs). The situation can be remedied by considering the projected process approximation[4,19], which results in the same expressionfor the mean in Eq. (12), but adds the term k(x∗,x∗)−k (x∗)TK−1 k (x∗) (15) m mm m to the variance in Eq. (14) 4.2 Selecting the subset (unsupervised) Selecting the best subset is a combinatorialproblem that cannot be solved effe- ciently. Instead, we try to find a compact subset that summarizes the relevant information by incremental forward selection. In every step of the procedure, we add that element from the set of remaining unselected elements to the ac- tive set that performs best with respect to a given specific criterion. In general, we distinguish between supervised and unsupervised approaches, i.e. those that consider the targetvariable we regresson,and those that do not. Here we focus on the incomplete Cholesky decomposition (ICD) as an unsupervised approach [7,1,2]. ICDaimsatreducingateachsteptheerrorincurredfromapproximatingthe covariancematrix: K−K˜ .Note thatthe ICDofK isthe dualequivalentof (cid:13) (cid:13)F performingpartialG(cid:13)ram-Sc(cid:13)hmidtontheMercer-inducedfeaturerepresentation: (cid:13) (cid:13) ineverystep,weaddthatelementtotheactivesetwhosedistancefromthespan of the currently selected elements is largest(in feature space). The procedure is stoppedwhentheresidualofremaining(unselected)elementsfallsbelowagiven threshold,oragivenmaximumnumberofallowedelementsisexceeded.In[4,8,6] onlinevariantsthereofareconsidered(whereinsteadofrepeatedlyinspectingall remaining elements only one pass overthe dataset is made and every element is examined only once). In general, the number of elements selected by ICD will depend on the effective rank of K (and thus its eigenspectrum). 5 Model selection for GPTD The major advantage of using GP-based function approximation (in contrast to, say, neural networks or tree-based approaches) is that both ’learning’ of the weight vector and specification of the architecture/hyperparameters/basis functions can be handled in a principled and essentially automated way. 5.1 Optimizing the marginal likelihood To determine hyperparameters for GPTD, we consider the marginal likelihood of the process, i.e. the probability of generating the rewards we have observed giventhesequenceofstatesandaparticularsettingofthehyperparameterθ.We then maximize this function (its logarithm) with respect to θ. From Eq. (6) we seethatforGPTDwehavep(r|D,θ)=N(0,Q).Thusplugginginthedefinition for a multivariate Gaussian and taking the logarithm, we obtain 1 1 n L(θ)=− logdetQ− rTQ−1r− log2π. (16) 2 2 2 Optimizing this function with respect to θ is a nonconvexproblem andwe have to resortto iterative gradient-basedsolvers (such as scaled conjugate gradients, e.g. see [13]). To do this we need to be able to evaluate the gradient of L. The partial derivatives of L with respect to each individual hyperparameter θ can i be obtained in closed form as ∂L 1 ∂Q 1 ∂Q =− tr Q−1 + rTQ−1 Q−1r. (17) ∂θ 2 (cid:18) ∂θ (cid:19) 2 ∂θ i i i Note that L automatically incorporates the trade-off between model fit (train- ing error) and model complexity and can thus be regarded as an indicator for generalization capabilities, i.e. how well GPTD will predict the values of states notinits training set.The firstterminEq.(16)measuresthe complexity ofthe model, and will be large for ’flexible’ and small for ’rigid’ models.5 The second termmeasuresthe modelfitandcanshownto be the valueofthe errorfunction for a penalized least-squaresthat would (in a frequentist setting) correspondto GPTD. 5 ApropertythatmanifestsitselfintheeigenvaluesofK(sincethedeterminantequals the sum of the eigenvalues). In general, flexible models are achieved by smaller bandwidths in the covariance, meaning that K’s effective rank will be large and its eigenvalues will fall off more slowly. On the other hand, more rigid models are achieved by larger bandwidths, meaning that K’s effective rank will be low and its eigenvalues will fall off more quickly. Note that the effective rank of K is also important for the SR-approximation (see Section 3), since the effectiveness of SR depends on building a low-rank approximation of K spending as few resources as possible. 5.2 Choosing the covariance A common choice for k(·,·) is to consider a (positive definite) function parame- terized by a small number of scalarparameters,such as the stationary isotropic Gaussian (or squared exponential), which is parameterized by the lengthscale (bandwidthh).Inthefollowingwewillconsiderthreevariantsoftheform[13,17]: 1 k(x,x′)=v exp − (x−x′)TΩ(x−x′) +b (18) 0 (cid:26) 2 (cid:27) where hyperparameter v > 0 denotes the vertical lengthscale, b > 0 the bias, 0 and symmetric positive semidefinite matrix Ω is given by – Variant 1 (isotropic): Ω =hI with hyperparameter h>0. – Variant 2 (axis-aligned ARD): Ω =diag(a ,...,a ) 1 D with hyperparameters a ,...,a >0. 1 D – Variant 3(factor analysis): Ω =M MT+diag(a ,...,a ) k k 1 D where D×k matrix M is given by M :=[m ,...,m ], k <D, and both k k 1 k the entries of M , i.e. m ,...,m ,...,m ,...,m and a ,...,a > 0 k 11 1D k1 kD 1 D are adjustable hyperparameters.6 The first variant (see Figure 1) assumes that every coordinate of the input (i.e. state-vector)isequallyimportantforpredictingitsvalue.However,inparticular for high-dimensionalstate vectors,this might be too simple: along some dimen- sions this will produce too much resolution where it will be wasted, along other dimensions this will produce too little resolution where it would otherwise be needed. The second variantis more powerful and includes a different parameter for every coordinate of the state vector, thus assigning a different scale to every state variable. This covariance implements automatic relevance determination (ARD): since the individual scaling factors are automatically adapted from the data via marginal likelihood optimization, they inform us about how relevant each state variable is for predicting the value. A large value of a means that i the i-th state variable is important and even small variations along this coor- dinate are relevant. A small value of a means that the j-th state variable is j less important and only large variations along this coordinate will impact the prediction(if atall). Avalue closeto zeromeans thatthe correspondingcoordi- nate is irrelevant and could be left out (i.e. the value function does not rely on that particular state variable).The benefit of removingirrelevantcoordinatesis that the complexity of the model will decrease while the fit of the model stays the same: thus likelihood will increase.The third variantfirst identifies relevant directions in the input space (linear combinations of state variables) and per- forms a rotation of the coordinate system (the number of relevant directions is specified in advance by k). As in the secondvariant,different scaling factorsare then applied along the rotated axes. 6 The number of directions k is also determined from model selection: we systemati- callytrydifferentvaluesofk,findthecorrespondingremaininghyperparametersvia u 1 u 2 h a2 s2 s1 a1 Fig.1. Three variants of the stationary squared exponential covariance. The direc- tions/scaling factors in the third case are derived from the eigendecomposition of Ω, i.e. USUT=MkMTk+diag(a1,...,aD). 5.3 Example: gradient for ARD Asanexample,wewillnowshowhowthegradient∇θLofEq.(16)iscalculated for the ARD covariance. Note that since all hyperparameters in this model, i.e. {v ,b,σ2,a ,...,a },mustbepositive,itismoreconvenienttoconsiderthehy- 0 0 1 D perparameter vector θ in log space: θ = logv ,logb,logσ2,loga ,...,loga . 0 0 1 D We startby establishing some useful iden(cid:0)tities: for any n×n matrix A we hav(cid:1)e T [HAH ] =a −γ a −γ a +γ γ a . ij ij i i+1,j j i,j+1 i j i+1,j+1 Furthermore, we have 1+γ2 ,i=j i [HHT] =−γ ,i=j−1 or i=j+1 ij  i 0 , otherwise  NowwriteKasK=v C+b ,where[C] =exp −0.5 D a x(i)−x(j) 2 and is the n×n m0 atrix1no,fnall ones. Coijmputingnthe paPrtdi=al1dedr(cid:0)ivdative odf K(cid:1),o n,n 1 we then obtain ∂K ∂K =C, = n,n ∂v ∂b 1 0 ∂K =−1v c x(i)−x(j) 2, ν =1...D (cid:20)∂a (cid:21) 2 0 ij ν ν ν ij (cid:0) (cid:1) Next,wewillcomputethepartialderivativesofQ=(HKHT+σ2HHT),giving 0 for b: ∂Q ∂Q ∂K T T =b =bH H =bH H n,n ∂logb ∂b (cid:20) ∂b (cid:21) 1 ∂Q ⇒ =b(1−γ −γ +γ γ ). i j i j (cid:20)∂logb(cid:21) ij scg-based likelihood optimization and compare the final scores (likelihood) of the resulting models. For v we have 0 ∂Q ∂Q ∂K T T =v =v H H =v HCH ∂logv 0∂v 0 (cid:20)∂v (cid:21) 0 0 0 0 ∂Q ⇒ =v (c −γ c −γ c +γ γ c ). (cid:20)∂logv (cid:21) 0 ij i i+1,j j i,j+1 i j i+1,j+1 0 ij For σ2 we have 0 ∂Q ∂Q ∂ =σ2 =σ2 [σ2HHT]=σ2HHT ∂logσ2 0∂σ2 0∂σ2 0 0 0 0 0 σ2(1+γ2) ,i=j ∂Q 0 i ⇒ =−σ2γ ,i=j−1 or i=j+1 (cid:20)∂logσ02(cid:21)ij 0 0 i , otherwise  Finally, for each of the a , ν =1,...,D we get ν ∂Q ∂Q ∂K T =a =a H H ν ν ∂loga ∂a (cid:20)∂a (cid:21) ν ν ν ∂Q 1 ⇒ =− a v (c dν −γ c dν (cid:20)∂loga (cid:21) 2 ν 0 ij ij i i+1,j i+1,j ν ij −γ c dν +γ γ c dν ) j i,j+1 i,j+1 i j i+1,j+1 i+1,j+1 where we have defined dν := x(i)−x(j) 2. Thus, with w :=Q−1r we have for ij ν ν Eq. (17) (cid:0) (cid:1) n−1n−1 ∂Q ∂Q tr Q−1 = Q−1 (cid:18) ∂θ (cid:19) ij(cid:20)∂θ (cid:21) ν Xi=1 Xj=1(cid:2) (cid:3) ν ji n−1n−1 ∂Q ∂Q T w w= [w] [w] i j ∂θ (cid:20)∂θ (cid:21) ν Xi=1 Xj=1 ν ij which can be used to calculate the partial derivates with computational com- plexityO(n2)each(exceptforσ2,wherethematrixofderivativesistridiagonal). 0 6 Experiments This section demonstrates that our proposed model selection can be used to solvetheapproximatepolicyevaluationprobleminacompletelyautomatedway – without any manual tweaking of hyperparameters. We will also show some of the additional benefits of model selection, which are improved accuracy and reduced complexity: because we automatically set the hyperparameters we can use more sophisticated covariance functions (see Section 5.2) that depend on a

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.