ebook img

Online Learning in Adversarial Markov Decision Processes PDF

113 Pages·2013·0.46 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Online Learning in Adversarial Markov Decision Processes

Online Learning in Adversarial Markov Decision Processes: Motivation and State of the Art Csaba Szepesva´ri DepartmentofComputingScience UniversityofAlberta Based on joint work with Gergely Neu, Andra´s Gyo¨rgy, Andra´s Antos, Travis Dick Lie`ge, July 31, 2013 1/32 Loop free stochastic shortest path problems Online linear optimization (MD to MD2) MDPs with loops Conclusions Contents Why should we care about online MDPs? 2/32 Online linear optimization (MD to MD2) MDPs with loops Conclusions Contents Why should we care about online MDPs? Loop free stochastic shortest path problems 2/32 MDPs with loops Conclusions Contents Why should we care about online MDPs? Loop free stochastic shortest path problems Online linear optimization (MD to MD2) 2/32 Conclusions Contents Why should we care about online MDPs? Loop free stochastic shortest path problems Online linear optimization (MD to MD2) MDPs with loops 2/32 Contents Why should we care about online MDPs? Loop free stochastic shortest path problems Online linear optimization (MD to MD2) MDPs with loops Conclusions 2/32 Online MDPs 3/32 !"#$%&'%()*+% The MDP Model a t 4561&% =>?% (".$'/%)01"23#+% x t =>?% =>?% r t 768".9%:;1#<'1% Reward: r = r(x ,a ) ,-% t t t Goal: maximize cumulative reward (cid:34)(cid:88)T (cid:35) E r(x ,a ) . t t t=1 4/32 The MDP Model with Adversarial Reward Functions a t /01*2' ;<=' !"#$%&'()*"+,-.' x t ;<=' ;<=' ;<=' r t 314"#5'67*-8%*' y t 9*-%*2#%::15' ()*"+,-.' ;<=' Reward: r (x,a) = r(x,a,y ) t t Goal: minimize regret (cid:34)(cid:88)T (cid:35) (cid:34)(cid:88)T (cid:35) R = maxE r (xπ,aπ) −E r (x ,a ) T t t t t t t π t=1 t=1 4/32 The MDP Model The world is too large Part of the state is controlled, with a well understood dynamics Part of the state is uncontrolled, complicated dynamics, unobserved state variables In many applications only the reward is influenced by the uncontrolled component Ex: pagingincomputers,thek-serverproblem,stochasticrouting, inventoryproblems,... 5/32

Description:
Online Learning in Adversarial Markov Decision Processes: Motivation and State of the Art Csaba Szepesvari´ Department of Computing Science University of Alberta
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.