Reinforcement Learning for Trading Dialogue Agents in Non-Cooperative Negotiations. Doctoral Dissertation by Ioannis Efstathiou Submitted for the Degree of Doctor of Philosophy in Computer Science Interaction Lab School of Mathematical and Computer Sciences Heriot-Watt University 2016 The copyright in this thesis is owned by the author. Any quotation from the report or use of any of the information contained in it must acknowledge this report as the source of the quotation or information. “You can discover more about a person in an hour of play than in a year of conversation.” PLATO, 429 - 345 B.C. LINGARD, 1598 - 1670 ANONYMOUS Declaration I, Ioannis Efstathiou hereby declare that the composition of this thesis submitted for examination has been made by myself and the words are personally expressed. Any exceptions, including works taken from any other authors, are all stated in my text and are also included in my references list. Furthermore, this document has not been submitted for any other qualification or degree. v Abstract Recent advances in automating Dialogue Management have been mainly made in cooperative environments -where the dialogue system tries to help a human to meet their goals. In non-cooperative environments though, such as competitive trading, there is still much work to be done. The complexity of such an environment rises as there is usually imperfect information about the interlocutors’ goals and states. The thesis shows that non-cooperative dialogue agents are capable of learning how to successfully negotiate in a variety of trading-game settings, using Reinforcement Learning, and results are presented from testing the trained dialogue policies with humans. The agents learned when and how to manipulate using dialogue, how to judge the decisions of their rivals, how much information they should expose, as well as how to effectively map the adversarial needs in order to predict and exploit their actions. Initially the environment was a two-player trading game (“Taikun”). The agent learned how to use explicit linguistic manipulation, even with risks of expo- sure (detection) where severe penalties apply. A more complex opponent model for adversaries was also implemented, where we modelled all trading dialogue moves as implicitly manipulating the adversary’s opponent model, and we worked in a more complex game (“Catan”). In that multi-agent environment we show that agents can learn to be legitimately persuasive or deceitful. Agents which learned how to manipulate opponents using dialogue are more successful than ones which do not manipulate. We also demonstrate that trading dialogues are more successful when thelearningagentbuildsanestimateoftheadversarialhiddengoalsandpreferences. Furthermore the thesis shows that policies trained in bilateral negotiations can be very effective in multilateral ones (i.e. the 4-player version of Catan). The findings suggest that it is possible to train non-cooperative dialogue agents which success- fully trade using linguistic manipulation. Such non-cooperative agents may have important future applications, such as on automated debating, police investigation, games, and education. vi Acknowledgements Initially I would like to thank my first Supervisor, Professor Oliver Lemon, who offered me the chance to work in the intriguing areas of Reinforcement Learning and Pragmatics. Hemotivatedmetostudythestrategicnegotiationsthatcanbelearned by intelligent agents in Natural Language Dialogue systems, through the application of various interesting Reinforcement Learning algorithms and techniques. Without hisexpertise,patience,valuableguidanceandexperiencedsuggestionsthisdocument would lack content and aim. Each of our meetings was a unique experience which not only enhanced my current knowledge but significantly widened my perspective on academic thinking. My second Supervisor, Professor David Corne, who as my second mentor now and teacher in the past on several lessons, during my MSc course in Artificial In- telligence, elegantly guided me through various fascinating paths of this area and kept reinforcing my passion towards it through his ingenuity and broad experience. The Heriot-Watt University and its Library department, for providing me all the required knowledge and the flexible means respectively to effectively seek resources andproducethiswork. ThepostgraduatestudentsWenshuo, AimiliosandNidawho showed interest in our research by extending parts of our work now towards their unique goals. Furthermore, everyone involved in the Interaction Lab and especially in the STAC project for their suggestions in our meetings and valuable advice. Last but not least my family, partner and friends for their continuous support and deep understanding of my effort. vii Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Example dialogues . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Robotic deception ethics . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.7 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Background 13 2.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Exploration and Exploitation . . . . . . . . . . . . . . . . . . 14 2.1.2 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . 15 2.1.3 Policy Value Function . . . . . . . . . . . . . . . . . . . . . . 16 2.1.4 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.5 Temporal-Difference Learning . . . . . . . . . . . . . . . . . . 21 2.1.6 SARSA(0) and SARSA(λ) . . . . . . . . . . . . . . . . . . . . 22 2.2 Non-Stationary MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Game Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 Pure and Mixed Strategy . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 Cooperative and Non-cooperative games . . . . . . . . . . . . 26 2.3.3 Nash Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.4 Pareto Optimality . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.5 Dominant Strategy . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.6 Perfect and Imperfect Information Games . . . . . . . . . . . 28 2.4 RL in Dialogue Systems . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 Reinforcement Learning in Non-Cooperative Games . . . . . . 30 2.4.2 Reinforcement Learning in Negotiation Dialogue Management 32 2.5 CP-NETs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.1 CP-NETs in Dialogue Acts . . . . . . . . . . . . . . . . . . . . 40 2.6 Pragmatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.6.1 Gricean Maxims and Implicature . . . . . . . . . . . . . . . . 41 2.6.2 Gricean Maxims and Non-cooperative Dialogues . . . . . . . . 43 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3 Initial model: Taikun 45 3.1 A simple game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1.1 Game’s characteristics . . . . . . . . . . . . . . . . . . . . . . 46 3.1.2 Actions (Trading Proposals) . . . . . . . . . . . . . . . . . . . 47 3.1.3 Additional actions (Deception - Scalar Implicatures) . . . . . . 48 3.1.4 The Learning Agent . . . . . . . . . . . . . . . . . . . . . . . 49 3.1.5 The Adversaries . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.1.6 History log of the played games . . . . . . . . . . . . . . . . . 51 3.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.1 Similarities between the LA’s first (custom SARSA(0)) and second algorithm (SARSA(λ)) . . . . . . . . . . . . . . . . . . 52 3.2.2 Differencesbetweenthelearningagent’sfirst(customSARSA(0)) and second algorithm (SARSA(λ)) . . . . . . . . . . . . . . . 53 3.2.3 More details on the first algorithm’s implementation (custom SARSA(0)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2.4 Details and parameters of the second algorithm’s implemen- tation (SARSA(λ)) . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2.5 Advantages and disadvantages of the two algorithms / Results 58 3.2.6 Q-Learning and Value iteration not suitable for Taikun . . . . 60 3.3 Experiments background . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3.1 Adversary’s strategy in Experiment 1 / Baseline strategy . . . 61 3.3.2 Adversary’s strategy in Experiment 2 / Manipulated strategy 61 3.3.3 Whyistheadversary’smanipulatedbehaviourbasedonsound reasoning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.4 Restrictive adversaries . . . . . . . . . . . . . . . . . . . . . . 63 3.3.5 Exposing (detective) adversaries . . . . . . . . . . . . . . . . . 64 3.3.6 Hidden Mode MDP triggered by manipulative actions . . . . . 64 3.3.7 Hybrid strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4 Taikun: Manipulation 68 4.1 Strict adversary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.1.1 Changing the exploration rate . . . . . . . . . . . . . . . . . . 71 4.2 Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3 Hybrid strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 ix 5 Manipulation detection 85 5.1 Strict adversary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2 Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3 Restriction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4 Exposure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.1 Refusal of trading . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.2 Instant win . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.5 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.6 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.7 One manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.7.1 Dual-mind cognition . . . . . . . . . . . . . . . . . . . . . . . 96 5.7.2 When to manipulate? . . . . . . . . . . . . . . . . . . . . . . . 97 5.7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.8 Deception detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.8.1 Detection cases . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.8.2 The adversaries and the LA . . . . . . . . . . . . . . . . . . . 100 5.8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6 Taikun and humans 103 6.1 Human vs. Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.1.1 Game questionnaire . . . . . . . . . . . . . . . . . . . . . . . . 104 6.1.2 Overall questionnaire . . . . . . . . . . . . . . . . . . . . . . . 105 6.1.3 Questionnaires’ results and discussion . . . . . . . . . . . . . . 106 6.1.4 Human comments on the manipulative agent . . . . . . . . . . 107 6.1.5 Humancommentsonthenon-manipulativeagent(goal-oriented only) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.1.6 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . 109 6.2 Human vs. Human . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.2.1 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.2.2 What did people say during trading? . . . . . . . . . . . . . . 111 6.2.3 Conclusions on human game-play in Taikun . . . . . . . . . . 112 7 Main model: Catan 114 7.1 RLAs and upgraded SARSA(λ) . . . . . . . . . . . . . . . . . . . . . 115 7.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.2.1 Actions (Trading Proposals) . . . . . . . . . . . . . . . . . . . 116 7.2.2 The RL Agents (RLA) . . . . . . . . . . . . . . . . . . . . . . 117 7.2.3 Reward function . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.2.4 Training parameters . . . . . . . . . . . . . . . . . . . . . . . 117 x
Description: