FEATURE-DRIVEN QUESTION ANSWERING WITH NATURAL LANGUAGE ALIGNMENT by Xuchen Yao A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy Baltimore, Maryland July, 2014 To My Mother’s Father ii Abstract Question Answering (QA) is the task of automatically generating answers to natural language questions from humans, serving as one of the primary research areas in natural language human-computer interaction. This dissertation focuses on English fact-seeking (factoid) QA, for instance: when was Johns Hopkins founded?.1 The key challenge in QA is the generation and recognition of indicative signals for answer patterns. In this dissertation I propose the idea of feature-driven QA, a machine learning framework that automatically produces rich features from linguis- tic annotations of answer fragments and encodes them in compact log-linear models. These features are further enhanced by tightly coupling the question and answer snip- pets via monolingual alignment. In this work monolingual alignment helps question answering in two aspects: aligning semantically similar words in QA sentence pairs (with the ability to recognize paraphrases and entailment) and aligning natural lan- guage words with knowledge base relations (via web-scale data mining). With the help of modern search engines, database and machine learning tools, the proposed method is able to efficiently search through billions of facts in the web space and optimize from millions of linguistic signals in the feature space. QA is often modeled as a pipeline of the form: question (input) → information retrieval (“search”) → answer extraction (from 1January 22, 1876 iii Abstract either text or knowledge base) → answer (output). This dissertation demonstrates the feature-driven approach applied throughout the QA pipeline: the search front end with structured information retrieval, the answer extraction back end from both unstructured data source (free text) and structured data source (knowledge base). Error propagation in natural language processing (NLP) pipelines is contained and minimized. The final system achieves state-of-the- art performance in several NLP tasks, including answer sentence ranking and answer extraction on one QA dataset, monolingual alignment on two annotated datasets, and is competitive with state-of-the-art in answering web queries using Freebase. This dissertation shows the capability of a feature-driven framework serving as the statistical backbone of modern question answering systems. Primary Advisor: Benjamin Van Durme Secondary Advisor: Chris Callison-Burch iv Acknowledgments To my dissertation committee as a whole, Benjamin Van Durme, Chris Callison- Burch, David Yarowsky and Dekang Lin. Thank you for your time and advice. I am very grateful to the following people: Benjamin Van Durme, my primary advisor: Ben admitted me to Hopkins and completely changed my life forever. He wrote me thousands of emails during the course of my study, met with me every week, advised me, helped me, encouraged me and never blamed me a single time for my wrongdoing. He was always there whenever I needed help and he gave me a lot of freedom. Ben is a great person with extraordinary leadership, integrity, fairness, and management skills. I have learned more than enough from him. Thank you, Ben. Chris Callison-Burch, my secondary advisor: Chris is extremely kind and generous with his students. He read the whole dissertation word by word front to back and marked every page with detailed comments. Chris has taught me things beyond research: networking, entrepreneurship, and artistic thinking. Thank you, Chris. Peter Clark, who was my mentor when I interned at Vulcan (now his group is part of the Allen Institute of Artificial Intelligence). Pete is the one who inspired me to do a dissertation on question answering. His group also funded two and a half years of my PhD study. He is such a gentleman with an encouraging and supportive heart. Thank you, Pete. v Acknowledgments Dekang Lin, who was my mentor when I interned at Google Research on their questionansweringproject. Dekangreshapedmymindinproblemsolvingandfinding a balanced point between research and industry. He will have a profound impact on how I work in the future, just like what his research has influenced the entire community. Thank you, Dekang. Jason Eisner, for whose Natural Language Processing class I was the teaching assistant at Hopkins for two years. Jason helped me with a deep understanding of log-linear models, which are the statistical backbone of the entire dissertation. He is a great person, an intellectual giant, and he treats everyone equally. Thank you, Jason. David Yarowsky, for whose Information Retrieval class I was the teaching assistant at Hopkins. David’s focus on research novelty and originality heavily influenced this dissertation. He also set a good example that finishing a PhD (with high quality) in less than four years was possible. Thank you, David. Professors and researchers who taught me, mentored me or helped me at graduate school: John Wierman, Mark Dredze, Kyle Rawlins, David Chiang, Liang Huang, Adam Lopez, Matt Post, Sanjeev Khudanpur, Paul McNamee, Phil Harrison, Shane Bergsma, Veselin Stoyanov, and others. Thank you. Colleagues and friends at the Center for Language and Speech Processing and JHU: AdamTeichert, AdithyaRenduchintala, AndongZhan, AnnIrvine, AricVelbel, Brian Kjersten, Byung Gyu Ahn, Carl Pupa, Cathy Thornton, Chandler May, Chunxi Liu, CourtneyNapoles,DaZheng,DarceyRiley,DebbieDeford,DelipRao,EhsanVariani, Feipeng Li, Frank Ferraro, Hainan Xu, Hong Sun, Jason Smith, Jonathan Weese, Juri Ganitkevitch, Katharine Henry, Keisuke Sakaguchi, Matt Gormley, Michael Carlin, Michael Paul, Nanyun Peng, Naomi Saphra, Nicholas Andrews, Olivia Buzek, Omar Zaidan, Pegah Ghahremani, Peter Schulam, Pushpendre Rastogi, Rachel Rudinger, vi Acknowledgments Ruth Scally, Ryan Cotterell, Samuel Thomas, Scott Novotney, Sixiang Chen, Svitlana Volkova, Tim Vieira, Travis Wolfe, Vijayaditya Peddinti, Xiaohui Zhang, and Yiming Wang. Thank you. My very best Chinese friends at Hopkins: Cao Yuan, Chen Guoguo, Huang Shuai, Sun Ming, and Xu Puyang. All together we went through so much in grad school. Thank you. Finally, thank you to my family. I wouldn’t have been myself without your support. vii Contents Abstract iii Acknowledgments v List of Acronyms xv List of Tables xvii List of Figures xx 1. Introduction 1 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2. Main Idea: Feature-driven Question Answering . . . . . . . . . . . . . 9 1.2.1. QA on Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.2. QA on KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.3. With Hard Alignment on Text . . . . . . . . . . . . . . . . . . 16 1.2.4. With Soft Alignment on KB . . . . . . . . . . . . . . . . . . . 17 1.3. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4. How to Read this Dissertation . . . . . . . . . . . . . . . . . . . . . . 24 1.5. Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2. 50 Years of Question Answering 28 viii Contents 2.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2. Conferences and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 33 2.2.1. TREC (Text REtrieval Conference) QA Track . . . . . . . . . 33 2.2.2. QA@CLEF (Cross Language Evaluation Forum) . . . . . . . . 35 2.2.3. Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . 39 2.2.3.1. Precision, Recall, Accuracy, F for IR/QA . . . . . . 40 β 2.2.3.2. MAP, MRR for IR . . . . . . . . . . . . . . . . . . . 42 2.2.3.3. Precision-Recall Curve for IR/QA: Drawn Very Dif- ferently . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.2.3.4. Micro F vs. Macro F vs. Averaged F for QA . . . 44 1 1 1 2.2.3.5. Permutation Test . . . . . . . . . . . . . . . . . . . . 46 2.3. Significant Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3.1. IR QA: Document and Passage Retrieval . . . . . . . . . . . . 48 2.3.2. NLP QA: Answer Extraction . . . . . . . . . . . . . . . . . . 51 2.3.2.1. Terminology for Question Analysis . . . . . . . . . . 51 2.3.2.2. Template Matching . . . . . . . . . . . . . . . . . . . 52 2.3.2.3. Answer Typing and Question Classification . . . . . 53 2.3.2.4. Web Redundancy . . . . . . . . . . . . . . . . . . . . 56 2.3.2.5. Tree/Graph Matching . . . . . . . . . . . . . . . . . 57 2.3.3. IR4QA: Structured Retrieval . . . . . . . . . . . . . . . . . . . 59 2.3.4. KB QA: Database Queries . . . . . . . . . . . . . . . . . . . . 63 2.3.4.1. Early Years: Baseball, Lunar and 15+ More . . . 63 2.3.4.2. Statistical Semantic Parsing . . . . . . . . . . . . . . 66 2.3.5. Hybrid QA (IR+NLP+KB) . . . . . . . . . . . . . . . . . . . 71 2.3.5.1. IBM Watson . . . . . . . . . . . . . . . . . . . . . . 73 2.4. A Different View: Linguistic Features vs. Machine Learning . . . . . 83 ix Contents 2.4.1. Linguistics: Word, POS, NER, Syntax, Semantics and Logic . 84 2.4.2. Learning: Ad-hoc, Small Scale and Large Scale . . . . . . . . 87 2.4.3. Appendix: Publications Per Grid . . . . . . . . . . . . . . . . 90 3. Feature-driven QA from Unstructured Data: Text 93 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.2. Tree Edit Distance Model . . . . . . . . . . . . . . . . . . . . . . . . 97 3.2.1. Cost Design and Edit Search . . . . . . . . . . . . . . . . . . . 99 3.2.2. TED for Sentence Ranking . . . . . . . . . . . . . . . . . . . . 102 3.2.3. QA Sentence Ranking Experiment . . . . . . . . . . . . . . . 104 3.3. Answer Extraction as Sequence Tagging . . . . . . . . . . . . . . . . 105 3.3.1. Sequence Model . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.3.2. Feature Design . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.3.3. Overproduce-and-vote . . . . . . . . . . . . . . . . . . . . . . 112 3.3.4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.3.4.1. QA Results . . . . . . . . . . . . . . . . . . . . . . . 114 3.3.4.2. Ablation Test . . . . . . . . . . . . . . . . . . . . . . 116 3.3.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 3.4. Structured Information Retrieval for QA . . . . . . . . . . . . . . . . 119 3.4.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 3.4.2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 3.4.3. Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 3.4.4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 3.4.4.1. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 3.4.4.2. Document Retrieval . . . . . . . . . . . . . . . . . . 134 3.4.4.3. Passage Retrieval . . . . . . . . . . . . . . . . . . . . 134 3.4.4.4. Answer Extraction . . . . . . . . . . . . . . . . . . . 137 x
Description: