Natural Language Annotation for Machine Learning James Pustejovsky and Amber Stubbs Natural Language Annotation for Machine Learning by James Pustejovsky and Amber Stubbs Copyright © 2013 James Pustejovsky and Amber Stubbs. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected]. Editors: Julie Steele and Meghan Blanchette Indexer: WordCo Indexing Services Production Editor: Kristen Borg Cover Designer: Randy Comer Copyeditor: Audrey Doyle Interior Designer: David Futato Proofreader: Linley Dolby Illustrator: Rebecca Demarest October 2012: First Edition Revision History for the First Edition: 2012-10-10: First release 2013-02-22: Second release 2013-07-12: Third release See http://oreilly.com/catalog/errata.csp?isbn=9781449306663 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Natural Language Annotation for Machine Learning, the image of a cockatiel, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade‐ mark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-1-449-30666-3 [LSI] Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. The Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The Importance of Language Annotation 1 The Layers of Linguistic Description 3 What Is Natural Language Processing? 4 A Brief History of Corpus Linguistics 5 What Is a Corpus? 8 Early Use of Corpora 10 Corpora Today 13 Kinds of Annotation 14 Language Data and Machine Learning 20 Classification 22 Clustering 22 Structured Pattern Induction 22 The Annotation Development Cycle 23 Model the Phenomenon 24 Annotate with the Specification 27 Train and Test the Algorithms over the Corpus 29 Evaluate the Results 30 Revise the Model and Algorithms 31 Summary 31 2. Defining Your Goal and Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Defining Your Goal 33 The Statement of Purpose 34 Refining Your Goal: Informativity Versus Correctness 35 Background Research 40 Language Resources 41 iii Organizations and Conferences 42 NLP Challenges 42 Assembling Your Dataset 43 The Ideal Corpus: Representative and Balanced 44 Collecting Data from the Internet 45 Eliciting Data from People 46 The Size of Your Corpus 47 Existing Corpora 48 Distributions Within Corpora 49 Summary 51 3. Corpus Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Basic Probability for Corpus Analytics 54 Joint Probability Distributions 55 Bayes Rule 57 Counting Occurrences 58 Zipf’s Law 61 N-grams 61 Language Models 63 Summary 65 4. Building Your Model and Specification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Some Example Models and Specs 68 Film Genre Classification 70 Adding Named Entities 71 Semantic Roles 72 Adopting (or Not Adopting) Existing Models 75 Creating Your Own Model and Specification: Generality Versus Specificity 76 Using Existing Models and Specifications 78 Using Models Without Specifications 79 Different Kinds of Standards 80 ISO Standards 80 Community-Driven Standards 83 Other Standards Affecting Annotation 83 Summary 84 5. Applying and Adopting Annotation Standards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Metadata Annotation: Document Classification 88 Unique Labels: Movie Reviews 88 Multiple Labels: Film Genres 90 Text Extent Annotation: Named Entities 94 Inline Annotation 94 iv | Table of Contents Stand-off Annotation by Tokens 96 Stand-off Annotation by Character Location 99 Linked Extent Annotation: Semantic Roles 101 ISO Standards and You 102 Summary 103 6. Annotation and Adjudication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 The Infrastructure of an Annotation Project 105 Specification Versus Guidelines 107 Be Prepared to Revise 109 Preparing Your Data for Annotation 109 Metadata 110 Preprocessed Data 110 Splitting Up the Files for Annotation 111 Writing the Annotation Guidelines 111 Example 1: Single Labels—Movie Reviews 113 Example 2: Multiple Labels—Film Genres 115 Example 3: Extent Annotations—Named Entities 118 Example 4: Link Tags—Semantic Roles 120 Annotators 121 Choosing an Annotation Environment 124 Evaluating the Annotations 126 Cohen’s Kappa (κ) 127 Fleiss’s Kappa (κ) 128 Interpreting Kappa Coefficients 131 Calculating κ in Other Contexts 132 Creating the Gold Standard (Adjudication) 135 Summary 136 7. Training: Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 What Is Learning? 140 Defining Our Learning Task 142 Classifier Algorithms 144 Decision Tree Learning 145 Gender Identification 147 Naïve Bayes Learning 151 Maximum Entropy Classifiers 157 Other Classifiers to Know About 158 Sequence Induction Algorithms 160 Clustering and Unsupervised Learning 162 Semi-Supervised Learning 163 Matching Annotation to Algorithms 165 Table of Contents | v Summary 166 8. Testing and Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Testing Your Algorithm 170 Evaluating Your Algorithm 170 Confusion Matrices 171 Calculating Evaluation Scores 172 Interpreting Evaluation Scores 177 Problems That Can Affect Evaluation 178 Dataset Is Too Small 178 Algorithm Fits the Development Data Too Well 179 Too Much Information in the Annotation 180 Final Testing Scores 181 Summary 181 9. Revising and Reporting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Revising Your Project 184 Corpus Distributions and Content 184 Model and Specification 185 Annotation 186 Training and Testing 187 Reporting About Your Work 187 About Your Corpus 189 About Your Model and Specifications 190 About Your Annotation Task and Annotators 190 About Your ML Algorithm 191 About Your Revisions 192 Summary 192 10. Annotation: TimeML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 The Goal of TimeML 196 Related Research 197 Building the Corpus 199 Model: Preliminary Specifications 199 Times 200 Signals 200 Events 200 Links 201 Annotation: First Attempts 201 Model: The TimeML Specification Used in TimeBank 202 Time Expressions 202 Events 203 vi | Table of Contents Signals 204 Links 204 Confidence 206 Annotation: The Creation of TimeBank 206 TimeML Becomes ISO-TimeML 209 Modeling the Future: Directions for TimeML 211 Narrative Containers 211 Expanding TimeML to Other Domains 212 Event Structures 214 Summary 215 11. Automatic Annotation: Generating TimeML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 The TARSQI Components 218 GUTime: Temporal Marker Identification 219 EVITA: Event Recognition and Classification 220 GUTenLINK 221 Slinket 222 SputLink 223 Machine Learning in the TARSQI Components 224 Improvements to the TTK 224 Structural Changes 225 Improvements to Temporal Entity Recognition: BTime 225 Temporal Relation Identification 226 Temporal Relation Validation 227 Temporal Relation Visualization 227 TimeML Challenges: TempEval-2 228 TempEval-2: System Summaries 229 Overview of Results 232 Future of the TTK 232 New Input Formats 232 Narrative Containers/Narrative Times 233 Medical Documents 234 Cross-Document Analysis 235 Summary 236 12. Afterword: The Future of Annotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Crowdsourcing Annotation 237 Amazon’s Mechanical Turk 238 Games with a Purpose (GWAP) 239 User-Generated Content 240 Handling Big Data 241 Boosting 241 Table of Contents | vii Active Learning 242 Semi-Supervised Learning 243 NLP Online and in the Cloud 244 Distributed Computing 244 Shared Language Resources 245 Shared Language Applications 245 And Finally... 246 A. List of Available Corpora and Specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 B. List of Software Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 C. MAE User Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 D. MAI User Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 E. Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 viii | Table of Contents