Table Of ContentApplied Text Analysis with Python
Enabling Language-Aware Data Products with
Machine Learning
Benjamin Bengfort, Rebecca Bilbro, and Tony Ojeda
BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo
Applied Text Analysis with Python
by Benjamin Bengfort, Rebecca Bilbro, and Tony Ojeda
Copyright © 2018 Benjamin Bengfort, Rebecca Bilbro, Tony Ojeda. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐
tutional sales department: 800-998-9938 or corporate@oreilly.com.
Editor: Nicole Tache Indexer: WordCo Indexing Services, Inc.
Production Editor: Nicholas Adams Interior Designer: David Futato
Copyeditor: Jasmine Kwityn Cover Designer: Karen Montgomery
Proofreader: Christina Edwards Illustrator: Rebecca Demarest
June 2018: First Edition
Revision History for the First Edition
2018-06-08: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491963043 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Applied Text Analysis with Python, the
cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the publisher’s views.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-491-96304-3
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1. Language and Computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Data Science Paradigm 2
Language-Aware Data Products 4
The Data Product Pipeline 5
Language as Data 8
A Computational Model of Language 8
Language Features 10
Contextual Features 13
Structural Features 15
Conclusion 16
2. Building a Custom Corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
What Is a Corpus? 19
Domain-Specific Corpora 20
The Baleen Ingestion Engine 21
Corpus Data Management 22
Corpus Disk Structure 24
Corpus Readers 27
Streaming Data Access with NLTK 28
Reading an HTML Corpus 31
Reading a Corpus from a Database 34
Conclusion 36
3. Corpus Preprocessing and Wrangling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Breaking Down Documents 38
Identifying and Extracting Core Content 38
iii
Deconstructing Documents into Paragraphs 39
Segmentation: Breaking Out Sentences 42
Tokenization: Identifying Individual Tokens 43
Part-of-Speech Tagging 44
Intermediate Corpus Analytics 45
Corpus Transformation 47
Intermediate Preprocessing and Storage 48
Reading the Processed Corpus 51
Conclusion 53
4. Text Vectorization and Transformation Pipelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Words in Space 56
Frequency Vectors 57
One-Hot Encoding 59
Term Frequency–Inverse Document Frequency 62
Distributed Representation 65
The Scikit-Learn API 68
The BaseEstimator Interface 68
Extending TransformerMixin 70
Pipelines 74
Pipeline Basics 75
Grid Search for Hyperparameter Optimization 76
Enriching Feature Extraction with Feature Unions 77
Conclusion 79
5. Classification for Text Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Text Classification 82
Identifying Classification Problems 82
Classifier Models 84
Building a Text Classification Application 85
Cross-Validation 86
Model Construction 89
Model Evaluation 91
Model Operationalization 94
Conclusion 95
6. Clustering for Text Similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Unsupervised Learning on Text 97
Clustering by Document Similarity 99
Distance Metrics 99
Partitive Clustering 102
Hierarchical Clustering 107
iv | Table of Contents
Modeling Document Topics 111
Latent Dirichlet Allocation 111
Latent Semantic Analysis 119
Non-Negative Matrix Factorization 121
Conclusion 123
7. Context-Aware Text Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Grammar-Based Feature Extraction 126
Context-Free Grammars 126
Syntactic Parsers 127
Extracting Keyphrases 128
Extracting Entities 131
n-Gram Feature Extraction 132
An n-Gram-Aware CorpusReader 133
Choosing the Right n-Gram Window 135
Significant Collocations 136
n-Gram Language Models 139
Frequency and Conditional Frequency 140
Estimating Maximum Likelihood 143
Unknown Words: Back-off and Smoothing 145
Language Generation 147
Conclusion 149
8. Text Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Visualizing Feature Space 152
Visual Feature Analysis 152
Guided Feature Engineering 162
Model Diagnostics 170
Visualizing Clusters 170
Visualizing Classes 172
Diagnosing Classification Error 173
Visual Steering 177
Silhouette Scores and Elbow Curves 177
Conclusion 180
9. Graph Analysis of Text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Graph Computation and Analysis 185
Creating a Graph-Based Thesaurus 185
Analyzing Graph Structure 186
Visual Analysis of Graphs 187
Extracting Graphs from Text 189
Creating a Social Graph 189
Table of Contents | v
Insights from the Social Graph 192
Entity Resolution 200
Entity Resolution on a Graph 201
Blocking with Structure 202
Fuzzy Blocking 202
Conclusion 205
10. Chatbots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Fundamentals of Conversation 208
Dialog: A Brief Exchange 210
Maintaining a Conversation 213
Rules for Polite Conversation 215
Greetings and Salutations 216
Handling Miscommunication 220
Entertaining Questions 222
Dependency Parsing 223
Constituency Parsing 225
Question Detection 227
From Tablespoons to Grams 229
Learning to Help 233
Being Neighborly 235
Offering Recommendations 238
Conclusion 240
11. Scaling Text Analytics with Multiprocessing and Spark. . . . . . . . . . . . . . . . . . . . . . . . . . 241
Python Multiprocessing 242
Running Tasks in Parallel 244
Process Pools and Queues 249
Parallel Corpus Preprocessing 251
Cluster Computing with Spark 253
Anatomy of a Spark Job 254
Distributing the Corpus 255
RDD Operations 257
NLP with Spark 259
Conclusion 270
12. Deep Learning and Beyond. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Applied Neural Networks 274
Neural Language Models 274
Artificial Neural Networks 275
Deep Learning Architectures 280
Sentiment Analysis 284
vi | Table of Contents
Deep Structure Analysis 286
The Future Is (Almost) Here 291
Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Table of Contents | vii