Table Of Contentpolitecnico di milano
Facoltà di Ingegneria
Scuola di Ingegneria Industriale e dell’Informazione
Dipartimento di Elettronica, Informazione e Bioingegneria
Master of Science in
Computer science and engineering
Transfer Learning for Actor-Critic
methods in Lipschitz Markov
Decision Processes
Supervisor:
prof marcello restelli
.
Assistant Supervisors:
dott matteo pirotta
.
ing andrea tirinzoni
.
Master Graduation Thesis by:
daniel felipe vacca manrique
852802
Student Id n.
2016 2017
Academic Year -
A Mis Increíbles
ACKNOWLEDGMENTS
Ringrazio il Prof. Marcello Restelli per avermi dato questa prima op-
portunità di contatto con il mondo della ricerca, e per la sua guida e
il suo sostegno, costanti lungo questo percorso, senza i cui non avrei
conseguitoquestorisultato.RingrazioilDott.MatteoPirottache,con
lasuaimportanteesperienza,ha anchecontribuitoinmanierasignifi-
cativa al successo di questo lavoro, nonostante le distanze fisiche che
ci separavano. Voglio anche ringraziare Andrea Tirinzoni che, pur es-
sendo presente solo negli ultimi mesi, è stato sempre disponibile a
darmiunamanoquandoneavevobisogno.Grazieancoraatuttietre
per quest’esperienza di crescita professionale.
También agradezco a Mis Increíbles, cuyo apoyo nunca faltó desde
5
que empecé con este loco sueño de venir a Italia años atrás; más
que el soporte económico, el impulso moral que me dieron todo este
tiempofueloquemetrajohastaelfinal.Graciasalrestodemifamilia
y mis amigos en Colombia, que siempre creyeron en mí para lograr
esta meta.
Gracias, en fin, a todos los que, durante estos dos años en Milán,
participaron en lo que ha sido la mejor experiencia de mi vida.
v
CONTENTS
Abstract xi
Estratto xiii
1 introduction 1
11 3
. Motivation
12 4
. Goal
13 4
. Contribution
14 4
. Outline
2 reinforcement learning 7
21 7
. Theoreticalframework:MarkovDecisionProcesses
211
. . Theagent:PoliciesandMarkovRewardProces-
10
ses
212 12
. . The goal: Cumulative rewards
213 13
. . Value functions
214 15
. . BellmanoperatorsandBellmanequations
22 16
. BrieftaxonomyofReinforcementLearningalgorithms
221
. . Model requirements: Model-based vs. Model-
16
free
222
. . Policy-based sampling strategy: On-policy vs.
17
Off-policy
223 18
. . Solutionstrategy:Policy-basedvs.Value-based
224 19
. . Sample usage: Online vs. Offline
23 20
. Policy gradient
231 20
. . Finite differences
232 21
. . Trajectory-based policy gradient
233 23
. . State-action-based policy gradient
234 25
. . Natural gradient
24 26
. Policy evaluation
241 26
. . Monte Carlo estimation
242 27
. . Temporal Difference estimation
243 28
. . Policyevaluationwithfunctionapproximators
2431 29
. . . The objective functions
2432 32
. . . Optimization mechanisms
2433 33
. . . LeastSquaresTemporalDifference
25 35
. The actor-critic approach
26 36
. Lipschitz Markov Decision Processes
3 transfer learning 39
31 39
. TransferLearningconceptsforReinforcementLearning
311
. . TransferableknowledgeandaTransferLearning-
40
Reinforcement Learning taxonomy
312
. . PerformancemeasuresforTransferLearning-Reinforcement
43
Learning algorithms
vii
32
. Transfer Learning algorithms in Reinforcement Lear-
46
ning
4 transfer learning approaches for actor critic
-
algorithms 49
41 49
. Thesetting:Lipschitzcontinuoustaskenvironments
42 50
. The problem
43 51
. The actor-critic implementation
431 51
. . The critic
432 53
. . The actor
44 54
. Transfer with Importance Sampling
441 55
. . The critic
442 57
. . The actor
45 59
. Transfer with an optimistic approach
451 59
. . The critic
452 63
. . The actor
46 66
. Transfer with a pessimistic approach
461 66
. . The critic
462 69
. . The actor
5 experiments 73
51 73
. Task environment: Mountain Car
52 74
. Experimental instances
53 76
. Anlysis of the results
6 conclusions and future work 83
bibliography 85
a importance sampling 93
a1 93
. Mathematical formulation and properties
a2 95
. ImportanceSamplinginReinforcementLearning
b kantorovich distance and local information 97
c lipschitz continuity 99
c1 99
. Lispchitz continuity of the tuples distribution
c2 104
. Lipschitz continuity of the matrices
c3 107
. Lipschitz continuity of the policy performance
c4 109
. Lipschitzcontinuityoftheperformancegradient
d other derivations 115
d1
. Local Lipschitz continuity and Kantorovich Lipschitz
115
continuity
d2 116
. On the objective functions
d3 120
. On the proximity of the optimal parameters
viii
LIST OF FIGURES
11 2
Figure . General agent-environment model
21
Figure . Agent-environmentmodelinReinforcementLe-
8
arning
22
Figure . Geometrical relation between the MSBE and
31
MSPBE
23 36
Figure . Actor-critic architecture
31 43
Figure . Transfer learning framework
32 44
Figure . Transfer Learning metrics
33 46
Figure . Transfer Learning cost scenarios
51 74
Figure . The Mountain Car task
52 77
Figure . NoTransfer learning curve
53 78
Figure . Learning curves for the IS experiments
54 79
Figure . LearningcurvesfortheMinexperiments
55 80
Figure . LearningcurvesfortheMinMaxexperiments
56
Figure . Effective sample size for transfer from optimal
81
policy
57
Figure . Effective sample size for transfer from worst
82
policy
1 98
Figure B. Kantorovich counterexample
LIST OF TABLES
21 17
Table . Model-freeandModel-basedalgorithms
22 18
Table . On-policy and Off-policy algorithms
23 19
Table . Policy-basedandValue-basedalgorithms
24 33
Table . Temporal difference algorithms
51 76
Table . List of experiments
LIST OF ALGORITHMS
41
Figure . Actor-critic algorithm in the no-transfer scena-
52
rio
42 53
Figure . LSTD in the no-transfer scenario
ix
43
Figure . Gradient estimation in the no-transfer scena-
54
rio
44 56
Figure . LSTDintheImportanceSamplingscenario
45
Figure . Gradient estimation in the Importance Sam-
57
pling scenario
46
Figure . Actor-critic algorithm in the Importance Sam-
58
pling scenario
ACRONYMS
RL Reinforcement Learning
MDP Markov Decision Process
POMDP Partially Observable Markov Decision Process
MRP Markov Reward Process
MC Monte Carlo
FIM Fisher Information Matrix
TD Temporal difference
MSE Mean Squared Error
MSBE Mean Squared Bellman Error
MSTDE Mean Squared Temporal Difference Error
LSTD Least Squares Temporal Difference
MSPBE Mean Squared Projected Bellman Error
NEU Norm of Expected TD Update
OPE Operator Error
FPE Fixed-Point Error
SGD Stochastic Gradient Descent
SVD Singular Value Decomposition
IS Importance Sampling
ESS Effective Sample Size
TL Transfer Learning
PLC Pointwise Lipschitz Continuous
x
Description:Both techniques are compared with a transfer mechanism based on Importance Sam- pling (IS) estimators. The optimistic approach produces good results in most of the . actor-critic techniques for the continuous scenario, by presenting part 993–1000. isbn: 978-1-60558-516-1 (cit. on p. 31).