Natural Language Processing with Spark NLP Learning to Understand Text at Scale Alex Thomas Natural Language Processing with Spark NLP Learning to Understand Text at Scale Alex Thomas BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo Natural Language Processing with Spark NLP by Alex Thomas Copyright © 2020 Alex Thomas. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Mike Loukides Indexer: WordCo, Inc. Developmental Editors: Nicole Taché, Gary O’Brien Interior Designer: David Futato Production Editor: Beth Kelly Cover Designer: Karen Montgomery Copyeditor: Piper Editorial Illustrator: Rebecca Demarest Proofreader: Athena Lakri July 2020: First Edition Revision History for the First Edition 2020-06-24: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492047766 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Natural Language Processing with Spark NLP, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-04776-6 [LSI] Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Part I. Basics 1. Getting Started. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Introduction 3 Other Tools 5 Setting Up Your Environment 6 Prerequisites 6 Starting Apache Spark 6 Checking Out the Code 7 Getting Familiar with Apache Spark 7 Starting Apache Spark with Spark NLP 8 Loading and Viewing Data in Apache Spark 8 Hello World with Spark NLP 11 2. Natural Language Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 What Is Natural Language? 19 Origins of Language 20 Spoken Language Versus Written Language 21 Linguistics 22 Phonetics and Phonology 22 Morphology 23 Syntax 24 Semantics 25 Sociolinguistics: Dialects, Registers, and Other Varieties 25 Formality 26 iii Context 26 Pragmatics 27 Roman Jakobson 27 How To Use Pragmatics 28 Writing Systems 28 Origins 28 Alphabets 29 Abjads 30 Abugidas 31 Syllabaries 32 Logographs 32 Encodings 33 ASCII 33 Unicode 33 UTF-8 34 Exercises: Tokenizing 34 Tokenize English 35 Tokenize Greek 35 Tokenize Ge’ez (Amharic) 36 Resources 36 3. NLP on Apache Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Parallelism, Concurrency, Distributing Computation 40 Parallelization Before Apache Hadoop 43 MapReduce and Apache Hadoop 43 Apache Spark 44 Architecture of Apache Spark 44 Physical Architecture 44 Logical Architecture 46 Spark SQL and Spark MLlib 51 Transformers 54 Estimators and Models 57 Evaluators 60 NLP Libraries 63 Functionality Libraries 63 Annotation Libraries 63 NLP in Other Libraries 64 Spark NLP 65 Annotation Library 65 Stages 65 Pretrained Pipelines 72 Finisher 74 iv | Table of Contents Exercises: Build a Topic Model 76 Resources 77 4. Deep Learning Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Gradient Descent 84 Backpropagation 85 Convolutional Neural Networks 96 Filters 96 Pooling 97 Recurrent Neural Networks 97 Backpropagation Through Time 97 Elman Nets 98 LSTMs 98 Exercise 1 99 Exercise 2 99 Resources 100 Part II. Building Blocks 5. Processing Words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Tokenization 104 Vocabulary Reduction 107 Stemming 108 Lemmatization 108 Stemming Versus Lemmatization 108 Spelling Correction 110 Normalization 112 Bag-of-Words 113 CountVectorizer 114 N-Gram 116 Visualizing: Word and Document Distributions 118 Exercises 122 Resources 122 6. Information Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Inverted Indices 124 Building an Inverted Index 124 Vector Space Model 130 Stop-Word Removal 133 Inverse Document Frequency 134 In Spark 137 Table of Contents | v Exercises 137 Resources 138 7. Classification and Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Bag-of-Words Features 142 Regular Expression Features 143 Feature Selection 145 Modeling 148 Naïve Bayes 149 Linear Models 149 Decision/Regression Trees 149 Deep Learning Algorithms 150 Iteration 150 Exercises 153 8. Sequence Modeling with Keras. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Sentence Segmentation 156 (Hidden) Markov Models 156 Section Segmentation 163 Part-of-Speech Tagging 164 Conditional Random Field 168 Chunking and Syntactic Parsing 168 Language Models 169 Recurrent Neural Networks 170 Exercise: Character N-Grams 176 Exercise: Word Language Model 176 Resources 177 9. Information Extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Named-Entity Recognition 179 Coreference Resolution 187 Assertion Status Detection 189 Relationship Extraction 191 Summary 195 Exercises 196 10. Topic Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 K-Means 198 Latent Semantic Indexing 202 Nonnegative Matrix Factorization 205 Latent Dirichlet Allocation 209 Exercises 211 vi | Table of Contents 11. Word Embeddings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Word2vec 215 GloVe 226 fastText 227 Transformers 227 ELMo, BERT, and XLNet 228 doc2vec 229 Exercises 231 Part III. Applications 12. Sentiment Analysis and Emotion Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Problem Statement and Constraints 235 Plan the Project 236 Design the Solution 240 Implement the Solution 241 Test and Measure the Solution 245 Business Metrics 245 Model-Centric Metrics 246 Infrastructure Metrics 247 Process Metrics 247 Offline Versus Online Model Measurement 248 Review 248 Initial Deployment 249 Fallback Plans 249 Next Steps 250 Conclusion 250 13. Building Knowledge Bases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Problem Statement and Constraints 252 Plan the Project 253 Design the Solution 253 Implement the Solution 255 Test and Measure the Solution 262 Business Metrics 262 Model-Centric Metrics 262 Infrastructure Metrics 263 Process Metrics 263 Review 264 Conclusion 264 Table of Contents | vii 14. Search Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Problem Statement and Constraints 266 Plan the Project 266 Design the Solution 266 Implement the Solution 267 Test and Measure the Solution 275 Business Metrics 275 Model-Centric Metrics 275 Review 276 Conclusion 276 15. Chatbot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Problem Statement and Constraints 278 Plan the Project 279 Design the Solution 279 Implement the Solution 280 Test and Measure the Solution 289 Business Metrics 289 Model-Centric Metrics 290 Review 290 Conclusion 290 16. Object Character Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Kinds of OCR Tasks 291 Images of Printed Text and PDFs to Text 291 Images of Handwritten Text to Text 292 Images of Text in Environment to Text 292 Images of Text to Target 293 Note on Different Writing Systems 293 Problem Statement and Constraints 294 Plan the Project 294 Implement the Solution 295 Test and Measure the Solution 299 Model-Centric Metrics 300 Review 300 Conclusion 300 Part IV. Building NLP Systems 17. Supporting Multiple Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Language Typology 303 viii | Table of Contents