ebook img

Neural Representations of Natural Language PDF

132 Pages·2019·4.024 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Neural Representations of Natural Language

Studies in Computational Intelligence 783 Lyndon White · Roberto Togneri  Wei Liu · Mohammed Bennamoun Neural Representations of Natural Language Studies in Computational Intelligence Volume 783 Series editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland e-mail: [email protected] The series “Studies in Computational Intelligence” (SCI) publishes new develop- mentsandadvancesinthevariousareasofcomputationalintelligence—quicklyand with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. More information about this series at http://www.springer.com/series/7092 Lyndon White (cid:129) Roberto Togneri Wei Liu (cid:129) Mohammed Bennamoun Neural Representations of Natural Language 123 Lyndon White Wei Liu Department ofElectrical, Electronic and Department ofComputer Scienceand Computer Engineering, Schoolof Software Engineering, Schoolof Physics, Engineering, Faculty of Engineeringand MathematicsandComputing, Faculty of MathematicalSciences EngineeringandMathematical Sciences TheUniversity of Western Australia TheUniversity of Western Australia Perth, WA,Australia Perth, WA,Australia RobertoTogneri Mohammed Bennamoun Department ofElectrical, Electronic and Department ofComputer Scienceand Computer Engineering, Schoolof Software Engineering, Schoolof Physics, Engineering, Faculty of Engineeringand MathematicsandComputing, Faculty of MathematicalSciences EngineeringandMathematical Sciences TheUniversity of Western Australia TheUniversity of Western Australia Perth, WA,Australia Perth, WA,Australia ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN978-981-13-0061-5 ISBN978-981-13-0062-2 (eBook) https://doi.org/10.1007/978-981-13-0062-2 LibraryofCongressControlNumber:2018938363 ©SpringerNatureSingaporePteLtd.2019 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authorsortheeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinor for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSingaporePteLtd. Theregisteredcompanyaddressis:152BeachRoad,#21-01/04GatewayEast,Singapore189721, Singapore Preface Wedo,therefore,offeryouourpracticaljudgements whereverwecan.Asyougainexperience,youwillformyour ownopinionofhowreliableouradviceis.Beassuredthatit isnotperfect! Press,Teukolsky,VettingandFlannery,NumericalRecipes, 3rded.,2007 Some might wonder why one would be concerned with finding good representa- tions for natural language. To answer that simply, all problem-solving is signifi- cantly simplified by finding a good representation, whether that is reducing a A wealth of other materials WikipediaarticlesonMLandNLPtendtobeofreasonablequality,Coursera offers several courses on the topics, Cross Validated Stack Exchange has thousandsofQAs,andthedocumentationformanymachinelearninglibraries often contains high-quality tutorials. Finally, preprints of the majority of the papers in the fields are available on arXiv. real-world problem into a system of inequalities to be solved by a constrained optimizerorusingdomain-specificjargontocommunicatewithexpertsinthearea. Thereisatrulyimmenseamountofexistinginformation,designedforconsumption by humans. Much of it is in text form. People prefer to create and consume information in natural language (such as prose) format, rather than in some more computationallyaccessibleformat(suchasadirectedgraph).Forexample,doctors prefer to dictate clinical notes over filling in forms for a database. One of the core attractions of a deep learning system is that it functions by learning increasingly abstract representations of inputs. These abstract representa- tions include useful features for the task. Importantly, they also include features applicable even for more distantly related tasks. Itisthegeneralassumptionofthisbookthatitisnotbeingreadinisolation.That thereader isnotbereft ofallothersourcesofknowledge.Weassumenotonlycan v vi Preface the reader access the primary sources for the papers we cite, but also that they are abletodiscoverandaccessothersuitablematerials,inordertogodeeperonrelated areas.Thereexistawealthofblogposts,videotutorials,encyclopaedicarticle,etc., on machine learning and on the mathematics involved. In general, we do assume the reader is familiar with matrix multiplication. In general, we will define networks in terms of matrices (rather than sums), as this is more concise, and better reflects real-world applications, in terms of code that a programmerwouldwrite.Wealsoassumeabasicknowledgeofprobability,andan evenmorebasicknowledgeofEnglishlinguistics.Thoughtheyarenottheintended audience, very little of the book should be beyond someone with a high-school education. The core focus of this book is to summarize the techniques that have been developed over the past 1–2 decades. In doing so, we detail the works of several dozen papers (and cite well over a hundred). Significant effort has gone into describingtheseworksclearlyandconsistently.Indoingso,thenotationuseddoes notnormallyline up with theoriginal notations usedinthose papers;however, we ensureitisalwaysmathematicallyequivalent.Forsometechniques,theexplanation isshortandsimple.Forothermorechallengingideas,ourexplanationmaybemuch longer than the original very concise formulation that is common in some papers. Forbrevity,wehavehadtolimitdiscussionofsomeaspectofnaturallanguage. In particular, we have neglected all discussion on the notion that a word may be made up of multiple tokens, for example, made up of. Phrases do receive some discussion in Chap. 6. Works such as Yin and Schütze (2014) exploration deserve more attention than we have space to give them. Similarly, we do not have space to cover character-based models, mainly characterRNNs,andotherdeepcharactermodelssuchasZhangandLeCun(2015). Thesemodelshaverelevancebothassourcesfromwhichwordrepresentationscan bederived(Bojanowskietal.2017),butmoregenerallycanbeusedforend-to-end systems. Using a purely character-based approach forces the model to learn tok- enizing, parsing, and any other feature extraction internally. We focus instead on models that work within larger systems that accomplish such preprocessing. Finally,we omitall discussion of attention models in recurrent neural networks (Bahdanau et al. 2014). The principles to the application remain similar to using normal recurrent neural networks in the applications discussed in this book. The additional attention featuresallowseveralimprovements,makingitthestateofthe art for many sequential tasks. We trust the reader with the experience of such modelsandwillseehowthey maybeappliedtotheuses ofRNNsdiscussedhere. This book makes extensive use of text in grey boxes. This is used to provide remindersofdefinitionsandcommentsonpotentiallyunfamiliarnotations.Aswell as to highlight non-technical aspects (such as societal concerns), and conversely overly technical aspects (such as implementation details) of the works discussed. They arealso usedtogive thetitlestocitations(to save flickingtothehugelist of references at the back of the book) and for the captions to the figures and more Preface vii generally to provide non-essential but potentially helpful information without interruptingtheflowofthemaintext.Onecouldreadthewholebookwithoutever reading any of the text in grey boxes, as they are not required to understand the main text. However, one would be missing out on a portion of, some very inter- esting, content. Perth, Australia Lyndon White Roberto Togneri Wei Liu Mohammed Bennamoun References Bahdanau,Dzmitry,KyunghyunCho,andYoshuaBengio.2014.Neuralmachinetranslationby jointlylearningtoalignandtranslate.InCoRRabs/1409.0473.arXiv:1409.0473. Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics,vol.5,135–146. Yin,WenpengandHinrichSchütze.2014.Anexplorationofembeddingsforgeneralizedphrases. InACL2014,p.41. Zhang, Xiang and Yann Le Cun. 2015. Text understanding from scratch. In CoRR proceedings ofthe52ndannualmeetingoftheassociationforcomputationallinguistics,vol.11. Contents 1 Introduction to Neural Networks for Machine Learning . . . . . . . . . 1 1.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Network Hyper-Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Width and Depth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4.1 Gradient Descent and Back-Propagation . . . . . . . . . . . . . . 12 1.5 Some Examples of Common Neural Network Architectures . . . . . 14 1.5.1 Classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5.2 Bottlenecking Autoencoder. . . . . . . . . . . . . . . . . . . . . . . . 18 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2 Recurrent Neural Networks for Sequential Processing . . . . . . . . . . . 23 2.1 Recurrent Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 General RNN Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.1 Matched-Sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.2 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.3 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.4 Encoder-Decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3 Inside the Recurrent Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.1 Basic Recurrent Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.2 Gated Recurrent Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.3 LSTM Recurrent Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4 Further Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.4.1 Deep Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.4.2 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.4.3 Other RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 ix x Contents 3 Word Representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 Representations for Language Modeling. . . . . . . . . . . . . . . . . . . . 39 3.1.1 The Neural Probabilistic Language Model. . . . . . . . . . . . . 41 3.1.2 RNN Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Acausal Language Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.1 Continuous Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.2 Skip-Gram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.3 Analogy Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3 Co-location Factorisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.1 GloVe. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.2 Further Equivalence of Co-location Prediction to Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4 Hierarchical Softmax and Negative Sampling. . . . . . . . . . . . . . . . 56 3.4.1 Hierarchical Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4.2 Negative Sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5 Natural Language Applications – Beyond Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5.1 Using Word Embeddings as Features . . . . . . . . . . . . . . . . 67 3.6 Aligning Vector Spaces Across Languages. . . . . . . . . . . . . . . . . . 67 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4 Word Sense Representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1 Word Senses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.1.1 Word Sense Disambiguation. . . . . . . . . . . . . . . . . . . . . . . 75 4.2 Word Sense Representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.1 Directly Supervised Method. . . . . . . . . . . . . . . . . . . . . . . 79 4.2.2 Word Embedding-Based Disambiguation Method . . . . . . . 79 4.3 Word Sense Induction (WSI) . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.1 Context Clustering-Based Approaches. . . . . . . . . . . . . . . . 83 4.3.2 Co-location Prediction-Based Approaches . . . . . . . . . . . . . 86 4.4 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5 Sentence Representations and Beyond . . . . . . . . . . . . . . . . . . . . . . . 93 5.1 Unordered and Weakly Ordered Representations . . . . . . . . . . . . . 94 5.1.1 Sums of Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . 95 5.1.2 Paragraph Vector Models (Defunct) . . . . . . . . . . . . . . . . . 96 5.2 Sequential Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.1 VAE and Encoder-Decoder . . . . . . . . . . . . . . . . . . . . . . . 98 5.2.2 Skip-Thought . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.