FEATURE-DRIVEN QUESTION ANSWERING WITH NATURAL LANGUAGE ALIGNMENT by Xuchen Yao A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy Baltimore, Maryland July, 2014 To My Mother’s Father Abstract Question Answering (QA) is the task of automatically generating answers to nat- ural language questions from humans, serving as one of the primary research areas in natural language human-computer interaction. This dissertation focuses on En- glish fact-seeking (factoid) QA, for instance: when was Johns Hopkins founded?.1 The key challenge in QA is the generation and recognition of indicative signals for answer patterns. In this dissertation I propose the idea of feature-driven QA, a machine learning framework that automatically produces rich features from lin- guistic annotations of answer fragments and encodes them in compact log-linear models. These features are further enhanced by tightly coupling the question and answer snippets via monolingual alignment. In this work monolingual alignment helps question answering in two aspects: aligning semantically similar words in QA sentence pairs (with the ability to recognize paraphrases and entailment) and aligning natural language words with knowledge base relations (via web-scale data mining). With the help of modern search engines, database and machine learning tools, the proposed method is able to efficiently search through billions of facts in the web space and optimize from millions of linguistic signals in the feature space. QA is often modeled as a pipeline of the form: question (input) → information retrieval (“search”) → answer extraction (from either text or knowledge base) → answer (output). This dissertation demonstrates the feature-driven approach applied through- 1January 22, 1876 iii Abstract out the QA pipeline: the search front end with structured information retrieval, the answer extraction back end from both unstructured data source (free text) and structured data source (knowledge base). Error propagation in natural lan- guage processing (NLP) pipelines is contained and minimized. The final system achieves state-of-the-art performance in several NLP tasks, including answer sen- tencerankingandanswerextractionononeQAdataset, monolingualalignmenton two annotated datasets, and question answering from Freebase with web queries. This dissertation shows the capability of a feature-driven framework serving as the statistical backbone of modern question answering systems. Primary Advisor: Benjamin Van Durme Secondary Advisor: Chris Callison-Burch iv Acknowledgments To my dissertation committee as a whole, Benjamin Van Durme, Chris Callison- Burch, David Yarowsky and Dekang Lin. Thank you for your time and advice. I am very grateful to the following people: Benjamin Van Durme, my primary advisor: Ben admitted me to Hopkins and completely changed my life forever. He wrote me thousands of emails during the course of my study, met with me every week, advised me, helped me, encouraged me and never blamed me a single time for my wrongdoing. He was always there whenever I needed help and he gave me a lot of freedom. Ben is a great person with extraordinary leadership, integrity, fairness, and management skills. I have learned more than enough from him. Thank you, Ben. Chris Callison-Burch, my secondary advisor: Chris is extremely kind and gen- erous with his students. He read the whole dissertation word by word front to back and marked every page with detailed comments. Chris has taught me things beyond research: networking, entrepreneurship, and artistic thinking. Thank you, Chris. Peter Clark, who was my mentor when I interned at Vulcan (now his group is part of the Allen Institute of Artificial Intelligence). Pete is the one who inspired me to do a dissertation on question answering. His group also funded two and a half years of my PhD study. He is such a gentleman with an encouraging and supportive heart. Thank you, Pete. Dekang Lin, who was my mentor when I interned at Google Research on their v Acknowledgments question answering project. Dekang reshaped my mind in problem solving and finding a balanced point between research and industry. He will have a profound impact on how I work in the future, just like what his research has influenced the entire community. Thank you, Dekang. Jason Eisner, for whose Natural Language Processing class I was the teaching assistant at Hopkins for two years. Jason helped me with a deep understanding of log-linear models, which are the statistical backbone of the entire dissertation. He is a great person, an intellectual giant, and he treats everyone equally. Thank you, Jason. David Yarowsky, for whose Information Retrieval class I was the teaching as- sistant at Hopkins. David’s focus on research novelty and originality heavily in- fluenced this dissertation. He also set a good example that finishing a PhD (with high quality) in less than four years was possible. Thank you, David. Professors and researchers who taught me, mentored me or helped me at grad- uate school: John Wierman, Mark Dredze, Kyle Rawlins, David Chiang, Liang Huang, Adam Lopez, Matt Post, Sanjeev Khudanpur, Paul McNamee, Phil Har- rison, Shane Bergsma, Veselin Stoyanov, and others. Thank you. Colleagues and friends at the Center for Language and Speech Processing and JHU: Adam Teichert, Adithya Renduchintala, Andong Zhan, Ann Irvine, Aric Velbel, Brian Kjersten, Byung Gyu Ahn, Carl Pupa, Cathy Thornton, Chan- dler May, Chunxi Liu, Courtney Napoles, Da Zheng, Darcey Riley, Debbie De- ford, Delip Rao, Ehsan Variani, Feipeng Li, Frank Ferraro, Hainan Xu, Hong Sun, Jason Smith, Jonathan Weese, Juri Ganitkevitch, Katharine Henry, Keisuke Sakaguchi, Matt Gormley, Michael Carlin, Michael Paul, Nanyun Peng, Naomi Saphra, Nicholas Andrews, Olivia Buzek, Omar Zaidan, Pegah Ghahremani, Pe- ter Schulam, Pushpendre Rastogi, Rachel Rudinger, Ruth Scally, Ryan Cotterell, Samuel Thomas, Scott Novotney, Sixiang Chen, Svitlana Volkova, Tim Vieira, Travis Wolfe, Vijayaditya Peddinti, Xiaohui Zhang, and Yiming Wang. Thank vi Acknowledgments you. My very best Chinese friends at Hopkins: Cao Yuan, Chen Guoguo, Huang Shuai, Sun Ming, and Xu Puyang. All together we went through so much in grad school. Thank you. Finally, thank you to my family. I wouldn’t have been myself without your support. vii Contents Abstract iii Acknowledgments v List of Tables xv List of Figures xviii 1. Introduction 1 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2. Main Idea: Feature-driven Question Answering . . . . . . . . . . . . 9 1.2.1. QA on Text . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.2. QA on KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2.3. With Hard Alignment on Text . . . . . . . . . . . . . . . . . 14 1.2.4. With Soft Alignment on KB . . . . . . . . . . . . . . . . . . 16 1.3. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.4. How to Read this Dissertation . . . . . . . . . . . . . . . . . . . . . 23 1.5. Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2. 50 Years of Question Answering 27 2.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2. Conferences and Evaluation . . . . . . . . . . . . . . . . . . . . . . 31 2.2.1. TREC (Text REtrieval Conference) QA Track . . . . . . . . 31 viii Contents 2.2.2. QA@CLEF (Cross Language Evaluation Forum) . . . . . . . 35 2.2.3. Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . 38 2.2.3.1. Precision, Recall, Accuracy, F for IR/QA . . . . . 38 β 2.2.3.2. MAP, MRR for IR . . . . . . . . . . . . . . . . . . 40 2.2.3.3. Precision-RecallCurveforIR/QA:DrawnVeryDif- ferently . . . . . . . . . . . . . . . . . . . . . . . . 41 2.2.3.4. Micro F vs. Macro F vs. Averaged F for QA . . 42 1 1 1 2.2.3.5. Permutation Test . . . . . . . . . . . . . . . . . . . 44 2.3. Significant Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.3.1. IR QA: Document and Passage Retrieval . . . . . . . . . . . 46 2.3.2. NLP QA: Answer Extraction . . . . . . . . . . . . . . . . . 48 2.3.2.1. Terminology for Question Analysis . . . . . . . . . 48 2.3.2.2. Template Matching . . . . . . . . . . . . . . . . . . 49 2.3.2.3. Answer Typing and Question Classification . . . . 51 2.3.2.4. Web Redundancy . . . . . . . . . . . . . . . . . . . 53 2.3.2.5. Tree/Graph Matching . . . . . . . . . . . . . . . . 54 2.3.3. IR4QA: Structured Retrieval . . . . . . . . . . . . . . . . . . 55 2.3.4. KB QA: Database Queries . . . . . . . . . . . . . . . . . . . 60 2.3.4.1. Early Years: Baseball, Lunar and 15+ More . . 60 2.3.4.2. Statistical Semantic Parsing . . . . . . . . . . . . . 63 2.3.5. Hybrid QA (IR+NLP+KB) . . . . . . . . . . . . . . . . . . 67 2.3.5.1. IBM Watson . . . . . . . . . . . . . . . . . . . . . 69 2.4. A Different View: Linguistic Features vs. Machine Learning . . . . 78 2.4.1. Linguistics: Word, POS, NER, Syntax, Semantics and Logic 80 2.4.2. Learning: Ad-hoc, Small Scale and Large Scale . . . . . . . 82 2.4.3. Appendix: Publications Per Grid . . . . . . . . . . . . . . . 84 3. Feature-driven QA from Unstructured Data: Text 86 ix Contents 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.2. Tree Edit Distance Model . . . . . . . . . . . . . . . . . . . . . . . 90 3.2.1. Cost Design and Edit Search . . . . . . . . . . . . . . . . . . 91 3.2.2. TED for Sentence Ranking . . . . . . . . . . . . . . . . . . . 94 3.2.3. QA Sentence Ranking Experiment . . . . . . . . . . . . . . 97 3.3. Answer Extraction as Sequence Tagging . . . . . . . . . . . . . . . 98 3.3.1. Sequence Model . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.3.2. Feature Design . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.3.3. Overproduce-and-vote . . . . . . . . . . . . . . . . . . . . . 104 3.3.4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.3.4.1. QA Results . . . . . . . . . . . . . . . . . . . . . . 105 3.3.4.2. Ablation Test . . . . . . . . . . . . . . . . . . . . . 109 3.3.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.4. Structured Information Retrieval for QA . . . . . . . . . . . . . . . 110 3.4.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.4.2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.4.3. Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.4.4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 3.4.4.1. Data . . . . . . . . . . . . . . . . . . . . . . . . . . 122 3.4.4.2. Document Retrieval . . . . . . . . . . . . . . . . . 125 3.4.4.3. Passage Retrieval . . . . . . . . . . . . . . . . . . . 125 3.4.4.4. Answer Extraction . . . . . . . . . . . . . . . . . . 128 3.4.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4. Discriminative Models for Monolingual Alignment 134 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 x
Description: