Applied Text Analysis with Python Enabling Language-Aware Data Products with Machine Learning Benjamin Bengfort, Rebecca Bilbro, and Tony Ojeda BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo Applied Text Analysis with Python by Benjamin Bengfort, Rebecca Bilbro, and Tony Ojeda Copyright © 2018 Benjamin Bengfort, Rebecca Bilbro, Tony Ojeda. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or [email protected]. Editor: Nicole Tache Indexer: WordCo Indexing Services, Inc. Production Editor: Nicholas Adams Interior Designer: David Futato Copyeditor: Jasmine Kwityn Cover Designer: Karen Montgomery Proofreader: Christina Edwards Illustrator: Rebecca Demarest June 2018: First Edition Revision History for the First Edition 2018-06-08: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491963043 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Applied Text Analysis with Python, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-96304-3 [LSI] Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Language and Computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The Data Science Paradigm 2 Language-Aware Data Products 4 The Data Product Pipeline 5 Language as Data 8 A Computational Model of Language 8 Language Features 10 Contextual Features 13 Structural Features 15 Conclusion 16 2. Building a Custom Corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 What Is a Corpus? 19 Domain-Specific Corpora 20 The Baleen Ingestion Engine 21 Corpus Data Management 22 Corpus Disk Structure 24 Corpus Readers 27 Streaming Data Access with NLTK 28 Reading an HTML Corpus 31 Reading a Corpus from a Database 34 Conclusion 36 3. Corpus Preprocessing and Wrangling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Breaking Down Documents 38 Identifying and Extracting Core Content 38 iii Deconstructing Documents into Paragraphs 39 Segmentation: Breaking Out Sentences 42 Tokenization: Identifying Individual Tokens 43 Part-of-Speech Tagging 44 Intermediate Corpus Analytics 45 Corpus Transformation 47 Intermediate Preprocessing and Storage 48 Reading the Processed Corpus 51 Conclusion 53 4. Text Vectorization and Transformation Pipelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Words in Space 56 Frequency Vectors 57 One-Hot Encoding 59 Term Frequency–Inverse Document Frequency 62 Distributed Representation 65 The Scikit-Learn API 68 The BaseEstimator Interface 68 Extending TransformerMixin 70 Pipelines 74 Pipeline Basics 75 Grid Search for Hyperparameter Optimization 76 Enriching Feature Extraction with Feature Unions 77 Conclusion 79 5. Classification for Text Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Text Classification 82 Identifying Classification Problems 82 Classifier Models 84 Building a Text Classification Application 85 Cross-Validation 86 Model Construction 89 Model Evaluation 91 Model Operationalization 94 Conclusion 95 6. Clustering for Text Similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Unsupervised Learning on Text 97 Clustering by Document Similarity 99 Distance Metrics 99 Partitive Clustering 102 Hierarchical Clustering 107 iv | Table of Contents Modeling Document Topics 111 Latent Dirichlet Allocation 111 Latent Semantic Analysis 119 Non-Negative Matrix Factorization 121 Conclusion 123 7. Context-Aware Text Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Grammar-Based Feature Extraction 126 Context-Free Grammars 126 Syntactic Parsers 127 Extracting Keyphrases 128 Extracting Entities 131 n-Gram Feature Extraction 132 An n-Gram-Aware CorpusReader 133 Choosing the Right n-Gram Window 135 Significant Collocations 136 n-Gram Language Models 139 Frequency and Conditional Frequency 140 Estimating Maximum Likelihood 143 Unknown Words: Back-off and Smoothing 145 Language Generation 147 Conclusion 149 8. Text Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Visualizing Feature Space 152 Visual Feature Analysis 152 Guided Feature Engineering 162 Model Diagnostics 170 Visualizing Clusters 170 Visualizing Classes 172 Diagnosing Classification Error 173 Visual Steering 177 Silhouette Scores and Elbow Curves 177 Conclusion 180 9. Graph Analysis of Text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Graph Computation and Analysis 185 Creating a Graph-Based Thesaurus 185 Analyzing Graph Structure 186 Visual Analysis of Graphs 187 Extracting Graphs from Text 189 Creating a Social Graph 189 Table of Contents | v Insights from the Social Graph 192 Entity Resolution 200 Entity Resolution on a Graph 201 Blocking with Structure 202 Fuzzy Blocking 202 Conclusion 205 10. Chatbots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Fundamentals of Conversation 208 Dialog: A Brief Exchange 210 Maintaining a Conversation 213 Rules for Polite Conversation 215 Greetings and Salutations 216 Handling Miscommunication 220 Entertaining Questions 222 Dependency Parsing 223 Constituency Parsing 225 Question Detection 227 From Tablespoons to Grams 229 Learning to Help 233 Being Neighborly 235 Offering Recommendations 238 Conclusion 240 11. Scaling Text Analytics with Multiprocessing and Spark. . . . . . . . . . . . . . . . . . . . . . . . . . 241 Python Multiprocessing 242 Running Tasks in Parallel 244 Process Pools and Queues 249 Parallel Corpus Preprocessing 251 Cluster Computing with Spark 253 Anatomy of a Spark Job 254 Distributing the Corpus 255 RDD Operations 257 NLP with Spark 259 Conclusion 270 12. Deep Learning and Beyond. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Applied Neural Networks 274 Neural Language Models 274 Artificial Neural Networks 275 Deep Learning Architectures 280 Sentiment Analysis 284 vi | Table of Contents Deep Structure Analysis 286 The Future Is (Almost) Here 291 Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Table of Contents | vii