Online Learning in Adversarial Markov Decision Processes: Motivation and State of the Art Csaba Szepesva´ri DepartmentofComputingScience UniversityofAlberta Based on joint work with Gergely Neu, Andra´s Gyo¨rgy, Andra´s Antos, Travis Dick Lie`ge, July 31, 2013 1/32 Loop free stochastic shortest path problems Online linear optimization (MD to MD2) MDPs with loops Conclusions Contents Why should we care about online MDPs? 2/32 Online linear optimization (MD to MD2) MDPs with loops Conclusions Contents Why should we care about online MDPs? Loop free stochastic shortest path problems 2/32 MDPs with loops Conclusions Contents Why should we care about online MDPs? Loop free stochastic shortest path problems Online linear optimization (MD to MD2) 2/32 Conclusions Contents Why should we care about online MDPs? Loop free stochastic shortest path problems Online linear optimization (MD to MD2) MDPs with loops 2/32 Contents Why should we care about online MDPs? Loop free stochastic shortest path problems Online linear optimization (MD to MD2) MDPs with loops Conclusions 2/32 Online MDPs 3/32 !"#$%&'%()*+% The MDP Model a t 4561&% =>?% (".$'/%)01"23#+% x t =>?% =>?% r t 768".9%:;1#<'1% Reward: r = r(x ,a ) ,-% t t t Goal: maximize cumulative reward (cid:34)(cid:88)T (cid:35) E r(x ,a ) . t t t=1 4/32 The MDP Model with Adversarial Reward Functions a t /01*2' ;<=' !"#$%&'()*"+,-.' x t ;<=' ;<=' ;<=' r t 314"#5'67*-8%*' y t 9*-%*2#%::15' ()*"+,-.' ;<=' Reward: r (x,a) = r(x,a,y ) t t Goal: minimize regret (cid:34)(cid:88)T (cid:35) (cid:34)(cid:88)T (cid:35) R = maxE r (xπ,aπ) −E r (x ,a ) T t t t t t t π t=1 t=1 4/32 The MDP Model The world is too large Part of the state is controlled, with a well understood dynamics Part of the state is uncontrolled, complicated dynamics, unobserved state variables In many applications only the reward is influenced by the uncontrolled component Ex: pagingincomputers,thek-serverproblem,stochasticrouting, inventoryproblems,... 5/32
Description: