Table Of ContentNatural Language Annotation for
Machine Learning
James Pustejovsky and Amber Stubbs
Natural Language Annotation for Machine Learning
by James Pustejovsky and Amber Stubbs
Copyright © 2013 James Pustejovsky and Amber Stubbs. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/
institutional sales department: 800-998-9938 or corporate@oreilly.com.
Editors: Julie Steele and Meghan Blanchette Indexer: WordCo Indexing Services
Production Editor: Kristen Borg Cover Designer: Randy Comer
Copyeditor: Audrey Doyle Interior Designer: David Futato
Proofreader: Linley Dolby Illustrator: Rebecca Demarest
October 2012: First Edition
Revision History for the First Edition:
2012-10-10: First release
2013-02-22: Second release
2013-07-12: Third release
See http://oreilly.com/catalog/errata.csp?isbn=9781449306663 for release details.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly
Media, Inc. Natural Language Annotation for Machine Learning, the image of a cockatiel, and related trade
dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade‐
mark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and authors assume
no responsibility for errors or omissions, or for damages resulting from the use of the information contained
herein.
ISBN: 978-1-449-30666-3
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1. The Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Importance of Language Annotation 1
The Layers of Linguistic Description 3
What Is Natural Language Processing? 4
A Brief History of Corpus Linguistics 5
What Is a Corpus? 8
Early Use of Corpora 10
Corpora Today 13
Kinds of Annotation 14
Language Data and Machine Learning 20
Classification 22
Clustering 22
Structured Pattern Induction 22
The Annotation Development Cycle 23
Model the Phenomenon 24
Annotate with the Specification 27
Train and Test the Algorithms over the Corpus 29
Evaluate the Results 30
Revise the Model and Algorithms 31
Summary 31
2. Defining Your Goal and Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Defining Your Goal 33
The Statement of Purpose 34
Refining Your Goal: Informativity Versus Correctness 35
Background Research 40
Language Resources 41
iii
Organizations and Conferences 42
NLP Challenges 42
Assembling Your Dataset 43
The Ideal Corpus: Representative and Balanced 44
Collecting Data from the Internet 45
Eliciting Data from People 46
The Size of Your Corpus 47
Existing Corpora 48
Distributions Within Corpora 49
Summary 51
3. Corpus Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Basic Probability for Corpus Analytics 54
Joint Probability Distributions 55
Bayes Rule 57
Counting Occurrences 58
Zipf’s Law 61
N-grams 61
Language Models 63
Summary 65
4. Building Your Model and Specification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Some Example Models and Specs 68
Film Genre Classification 70
Adding Named Entities 71
Semantic Roles 72
Adopting (or Not Adopting) Existing Models 75
Creating Your Own Model and Specification: Generality Versus Specificity 76
Using Existing Models and Specifications 78
Using Models Without Specifications 79
Different Kinds of Standards 80
ISO Standards 80
Community-Driven Standards 83
Other Standards Affecting Annotation 83
Summary 84
5. Applying and Adopting Annotation Standards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Metadata Annotation: Document Classification 88
Unique Labels: Movie Reviews 88
Multiple Labels: Film Genres 90
Text Extent Annotation: Named Entities 94
Inline Annotation 94
iv | Table of Contents
Stand-off Annotation by Tokens 96
Stand-off Annotation by Character Location 99
Linked Extent Annotation: Semantic Roles 101
ISO Standards and You 102
Summary 103
6. Annotation and Adjudication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
The Infrastructure of an Annotation Project 105
Specification Versus Guidelines 107
Be Prepared to Revise 109
Preparing Your Data for Annotation 109
Metadata 110
Preprocessed Data 110
Splitting Up the Files for Annotation 111
Writing the Annotation Guidelines 111
Example 1: Single Labels—Movie Reviews 113
Example 2: Multiple Labels—Film Genres 115
Example 3: Extent Annotations—Named Entities 118
Example 4: Link Tags—Semantic Roles 120
Annotators 121
Choosing an Annotation Environment 124
Evaluating the Annotations 126
Cohen’s Kappa (κ) 127
Fleiss’s Kappa (κ) 128
Interpreting Kappa Coefficients 131
Calculating κ in Other Contexts 132
Creating the Gold Standard (Adjudication) 135
Summary 136
7. Training: Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
What Is Learning? 140
Defining Our Learning Task 142
Classifier Algorithms 144
Decision Tree Learning 145
Gender Identification 147
Naïve Bayes Learning 151
Maximum Entropy Classifiers 157
Other Classifiers to Know About 158
Sequence Induction Algorithms 160
Clustering and Unsupervised Learning 162
Semi-Supervised Learning 163
Matching Annotation to Algorithms 165
Table of Contents | v
Summary 166
8. Testing and Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Testing Your Algorithm 170
Evaluating Your Algorithm 170
Confusion Matrices 171
Calculating Evaluation Scores 172
Interpreting Evaluation Scores 177
Problems That Can Affect Evaluation 178
Dataset Is Too Small 178
Algorithm Fits the Development Data Too Well 179
Too Much Information in the Annotation 180
Final Testing Scores 181
Summary 181
9. Revising and Reporting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Revising Your Project 184
Corpus Distributions and Content 184
Model and Specification 185
Annotation 186
Training and Testing 187
Reporting About Your Work 187
About Your Corpus 189
About Your Model and Specifications 190
About Your Annotation Task and Annotators 190
About Your ML Algorithm 191
About Your Revisions 192
Summary 192
10. Annotation: TimeML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
The Goal of TimeML 196
Related Research 197
Building the Corpus 199
Model: Preliminary Specifications 199
Times 200
Signals 200
Events 200
Links 201
Annotation: First Attempts 201
Model: The TimeML Specification Used in TimeBank 202
Time Expressions 202
Events 203
vi | Table of Contents
Signals 204
Links 204
Confidence 206
Annotation: The Creation of TimeBank 206
TimeML Becomes ISO-TimeML 209
Modeling the Future: Directions for TimeML 211
Narrative Containers 211
Expanding TimeML to Other Domains 212
Event Structures 214
Summary 215
11. Automatic Annotation: Generating TimeML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
The TARSQI Components 218
GUTime: Temporal Marker Identification 219
EVITA: Event Recognition and Classification 220
GUTenLINK 221
Slinket 222
SputLink 223
Machine Learning in the TARSQI Components 224
Improvements to the TTK 224
Structural Changes 225
Improvements to Temporal Entity Recognition: BTime 225
Temporal Relation Identification 226
Temporal Relation Validation 227
Temporal Relation Visualization 227
TimeML Challenges: TempEval-2 228
TempEval-2: System Summaries 229
Overview of Results 232
Future of the TTK 232
New Input Formats 232
Narrative Containers/Narrative Times 233
Medical Documents 234
Cross-Document Analysis 235
Summary 236
12. Afterword: The Future of Annotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Crowdsourcing Annotation 237
Amazon’s Mechanical Turk 238
Games with a Purpose (GWAP) 239
User-Generated Content 240
Handling Big Data 241
Boosting 241
Table of Contents | vii
Active Learning 242
Semi-Supervised Learning 243
NLP Online and in the Cloud 244
Distributed Computing 244
Shared Language Resources 245
Shared Language Applications 245
And Finally... 246
A. List of Available Corpora and Specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
B. List of Software Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
C. MAE User Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
D. MAI User Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
E. Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
viii | Table of Contents