Deep Reinforcement Learning Algorithms for Industrial Applications Master of Science in Mechatronic Engineering Department of Control and Computer Engineering Polytechnic University of Turin Supervisor Author Prof. Elio Piccolo Antonio Cappiello Abstract This thesis delves into the recent developments of reinforcement learning methods, with a particular focus on industrial applications. The first part of the thesis aims to review the general framework that stands behind the reinforcement learning theory, starting from the definition of the agent- environment interaction. The agent is the decision-maker part that acts on the environmentwhichrespondswithanobservationofthecurrentstateandafeedback signal, called reward or reinforce. The agent acts like a controller, and in the same way the environment can be seen as a plant. Moreover, the scalar reward can be seen as the feedback signal, into the loop of interaction, that tells the agent how good its actions are. Overall, the agent-environment interaction is an objective function maximisa- tion problem, and this allows to reformulate real-world applications as problems that can be solved with reinforcement learning methods. The purpose of the agent is to maximise the cumulative discounted reward over the long-run. If we are able to design a proper reward function, for a certain task, and so if we model the entire setting properly of the task as the environment, we can make the agent learn the desired policy. The agent is the decision-making part, while the environment can even be the body of a robot or anything that needs to be planned or controlled. The agent-environment interaction is led back to the Markov Decision Process formalism, which allows to treat the elements needed for learning within a mathe- matical model. The main elements needed for the formulation of the solution of a problem with reinforcement learning methods are the state space, the action space, the transition probabilities and a set of rewards for each action-state pair. Then, the value function concept is introduced in order to let the agent evaluate itsactionsandchoosethebestactionwithrespecttoitsobjective. Theaction-value function tells how good it is to take an action into a state, while the state-value function tells just how good it is to be in a certain state. The whole learning process consists of the agent that learns a policy and iteratively improves it; this improvement is intended to be a maximisation procedure of the value function. The thesis proceeds with the exposure of some of the main methods that are used to compute the state-value function. Such methods descend from dynamic 1 Chapter 0 2 programming, more particularly from Bellman’s recursive equations. The recursive relationisthebasisofsomeoftheiterativealgorithmsthatareusedtoapproximate the value function itself. Algorithms that solve recursive equation that uses state- value functions are considered model-based since they need transition probabilities, a model of the environment, in order to be solved; from these a new class of model-free algorithms is derived. These algorithms make use of the action-value functionwhichcanbeevaluatedwithoutknowingthedynamicsoftheenvironment. Specifically the SARSA and Q-learning algorithms are presented, since they pose the basis to the classical model-free update formulae that are used in iterative computations. The value functions can be described by matrices, or table, where the rows are the states and the columns are the actions: in high-dimensional case the matrix mode fails to survive and function approximators are needed in order to extend the application of reinforcement learning methods to high-dimensional state space and action space, such as continuous control problems. The application of Q-learning is extended with the introduction of function ap- proximators,moreparticularlyneuralnetworks,thatcandealwithhigh-dimensional state and action spaces that are often used to model real-world problems. TheintegrationbetweenneuralnetworksandQ-learninghasrecentlybroughtto new algorithms that compose the class of deep reinforcement learning algorithms. These algorithms are value-based if they rely just on value-function approximation, policy-based if they use a dedicated network to approximate the policy, and actor- critic if they use approximations for both value-function and policy. In this thesis these new algorithms are studied from a theoretical point of view and then applied to the solution of robotic manipulation and lane-keeping for au- tonomousdrivingtasks. Inparticular, thesetasksaresolvedviadeepreinforcement learning in simulated physics-based environments, namely MuJoCo for robotic ma- nipulation and TORCS for car driving simulation. Several experiments are conducted in order to achieve robust results, and the convergence properties are evaluated in different scenarios. The design of the re- ward functions is explained for these continuous control tasks, in order to make clear what are the several considerations to take into account when tuning the reward function. In conclusion, the characteristics, the capabilities and the development possi- bilities are discussed. Since reinforcement learning methods are proposed to be an alternative solution to classical control theory methods, the characteristics of this approach are worth to noting, and they regard the adaptability to dynamic and unknown environments, the integration with deep neural networks for end-to-end learning and the adoption of higher abstraction levels of programming. Future works are proposed at the end of robotics and autonomous driving chap- Chapter 0 3 ters, and they will address the virtual-to-real transfer learning paradigm, which can pose the basis to faster way of programming robots, vehicles and general agents that can perform very complex tasks in a simulated environment that can act ro- bustly in a real-world environment for the same task. This avoids the fact that the agent, learning in a real environment, could cause a risky situation for the environment. Chapter 0 4 Dedica Contents 1 Introduction 9 1.1 Introduction to Reinforcement Learning . . . . . . . . . . . . . . . 10 1.1.1 Comparison with SL and UL . . . . . . . . . . . . . . . . . . 10 1.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.1 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.2 Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Robotics and Manufacturing . . . . . . . . . . . . . . . . . . 14 Autonomous Driving . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Purpose of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 Reinforcement Learning Concepts 16 2.1 The Reinforcement Learning Problem . . . . . . . . . . . . . . . . . 16 2.1.1 Main Elements of the RL Problem . . . . . . . . . . . . . . 17 Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 18 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.2 The Agent-Environment Interaction . . . . . . . . . . . . . . 24 2.1.3 Markov Decision Processes . . . . . . . . . . . . . . . . . . . 25 Representation of MDPs . . . . . . . . . . . . . . . . . . . . 26 2.2 Value Functions and Optimal Policy . . . . . . . . . . . . . . . . . 27 2.3 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.1 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.2 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.3 Generalised Policy Iteration . . . . . . . . . . . . . . . . . . 35 2.3.4 Limits of Dynamic Programming . . . . . . . . . . . . . . . 37 5 Chapter 0 6 2.4 Exploration and Exploitation . . . . . . . . . . . . . . . . . . . . . 37 2.4.1 On-Policy and Off-Policy . . . . . . . . . . . . . . . . . . . . 38 3 Reinforcement Learning Methods 40 3.1 Mention to Monte-Carlo Methods . . . . . . . . . . . . . . . . . . . 40 3.1.1 Monte-Carlo Generalised Policy Iteration . . . . . . . . . . . 41 3.1.2 Drawbacks of Monte Carlo Methods . . . . . . . . . . . . . . 42 3.1.3 Monte Carlo Generalised Policy Iteration . . . . . . . . . . . 42 3.2 Temporal-Difference Methods . . . . . . . . . . . . . . . . . . . . . 42 3.2.1 TD error and TD update . . . . . . . . . . . . . . . . . . . . 43 3.2.2 SARSA algorithm (On-Policy TD Control) . . . . . . . . . . 44 3.2.3 Q-Learning algorithm (Off-Policy TD Control) . . . . . . . . 44 Mention to maximisation bias and double Q-learning . . . . 45 3.2.4 Cliff walking: Difference between SARSA and Q-learning . . 46 3.2.5 Advantages and Drawbacks . . . . . . . . . . . . . . . . . . 47 3.3 Mention to N-step TD . . . . . . . . . . . . . . . . . . . . . . . . . 48 4 Function Approximation 50 4.1 General Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . 51 4.2 Value-Functions Approximation . . . . . . . . . . . . . . . . . . . . 52 4.2.1 Linear Approximators . . . . . . . . . . . . . . . . . . . . . 53 Features construction . . . . . . . . . . . . . . . . . . . . . . 53 4.2.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . 54 4.2.3 Learning with Function Approximation . . . . . . . . . . . . 56 4.2.4 Monte Carlo with Function Approximation . . . . . . . . . . 56 4.2.5 Temporal-Difference with Function Approximation . . . . . 57 Batch RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . 58 Policy objective function . . . . . . . . . . . . . . . . . . . . 58 The Policy Gradient Theorem . . . . . . . . . . . . . . . . . 59 4.3.1 REINFORCE: Monte Carlo Policy Gradient . . . . . . . . . 61 REINFORCE with baseline . . . . . . . . . . . . . . . . . . 61 4.3.2 Actor-Critic Approach . . . . . . . . . . . . . . . . . . . . . 61 4.3.3 Deterministic Policy Gradient . . . . . . . . . . . . . . . . . 62 Deterministic Policy Gradient Theorem . . . . . . . . . . . . 63 On-Policy Deterministic Actor-Critic . . . . . . . . . . . . . 65 Chapter 0 7 5 Deep Reinforcement Learning 67 5.1 Why Deep Reinforcement Learning? . . . . . . . . . . . . . . . . . . 68 5.2 Deep Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2.1 Deep Q-Network . . . . . . . . . . . . . . . . . . . . . . . . 70 Experience Replay . . . . . . . . . . . . . . . . . . . . . . . 71 Freeze Target Network . . . . . . . . . . . . . . . . . . . . . 71 Strengths, Limitations and Drawbacks . . . . . . . . . . . . 72 5.3 Deep Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.1 Deep Deterministic Policy Gradient (DDPG) . . . . . . . . . 74 Batch Normalisation . . . . . . . . . . . . . . . . . . . . . . 76 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Networks architecture . . . . . . . . . . . . . . . . . . . . . 76 Advantages and drawbacks of DDPG . . . . . . . . . . . . . 76 5.4 Asynchronous Methods . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4.1 Asynchronous Advantage Actor-Critic (A3C) . . . . . . . . . 81 Network Architecture . . . . . . . . . . . . . . . . . . . . . . 82 6 Reinforcement Learning in Robotics 84 6.1 Reinforcement Learning and Classical Methods . . . . . . . . . . . 84 6.1.1 Motivation for Reinforcement Learning . . . . . . . . . . . . 85 6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.2.1 Model-Based Reinforcement Learning . . . . . . . . . . . . . 88 6.2.2 Other Relevant References . . . . . . . . . . . . . . . . . . . 88 6.3 Reward Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.4 Double Inverted Pendulum . . . . . . . . . . . . . . . . . . . . . . . 89 6.4.1 Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.4.2 Problem description . . . . . . . . . . . . . . . . . . . . . . 90 Design of the Reward . . . . . . . . . . . . . . . . . . . . . . 91 6.4.3 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . 91 Network architectures . . . . . . . . . . . . . . . . . . . . . 92 Exploration strategies . . . . . . . . . . . . . . . . . . . . . 92 6.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 92 6.5 Hopper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.5.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . 95 Design of the Reward . . . . . . . . . . . . . . . . . . . . . . 96 6.5.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . 96 6.6 Comparison with PID Controllers . . . . . . . . . . . . . . . . . . . 98 6.7 Conclusions and Future Research Directions . . . . . . . . . . . . . 99 Chapter 0 8 7 Autonomous driving via reinforcement learning 101 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.2 The Deep RL framework . . . . . . . . . . . . . . . . . . . . . . . . 101 7.3 Lane Keeping problem . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.3.1 PID and AI motion planners . . . . . . . . . . . . . . . . . . 103 7.3.2 Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.3 Problem description . . . . . . . . . . . . . . . . . . . . . . 104 Design of the rewards . . . . . . . . . . . . . . . . . . . . . . 104 7.3.4 Simulation setup . . . . . . . . . . . . . . . . . . . . . . . . 107 Network Architectures . . . . . . . . . . . . . . . . . . . . . 107 Exploration strategies . . . . . . . . . . . . . . . . . . . . . 109 7.3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 109 7.4 Conclusions and Future Research Directions . . . . . . . . . . . . . 112 Chapter 1 Introduction Inthischapterweintroduceanatural, non-mathematicalandnon-formaldefinition of the concept of reinforcement learning (RL). We focus on historical context, application fields and development trends. We also propose the purpose of the thesis, focusing on the implementation of a reinforcement learning approach in industrial tasks, such as robotic manipulation and vehicle autonomous driving, and what are the advantages of this approach, in termsofsafety, robustness, effectiveness, objectiveoptimisationandcostreduction. The organisation of the thesis is given in the following general overview, to give an idea of the structure of the work presentation: chapters 2 and 3 are intended to propose a description of the main elements and methods that are involved in the mathematical modelling of the reinforcement learning problem; chapter 4 is dedi- cated to the introduction of function approximators to deal with high-dimensional and more complex problems, while 5 enhances the state-of-the-art of the deep learning techniques that are used to develop robust deep reinforcement learning al- gorithms that contribute to very efficient techniques of programming robot agents; 6proposesthecomparisonoftheresultsobtainedtrainingdeepreinforcementlearn- ing agents to solve robotic manipulation and locomotive tasks in a physics-based simulated environment, as well as the advantages that these methods could con- tribute to and the future research works; chapter 7 is developed as the previously mentioned chapter, and it proposes deep reinforcement learning resolution of au- tonomous driving lane-keeping problem in a simulated environment, with a rich part of conclusions about the results that are obtained and future research works to propose very robust solutions to a very powerful technique, i.e. the integration of deep learning and reinforcement learning in a unique framework. 9
Description: