Table Of ContentFEATURE-DRIVEN QUESTION ANSWERING
WITH NATURAL LANGUAGE ALIGNMENT
by
Xuchen Yao
A dissertation submitted to Johns Hopkins University in conformity
with the requirements for the degree of Doctor of Philosophy
Baltimore, Maryland
July, 2014
To My Mother’s Father
Abstract
Question Answering (QA) is the task of automatically generating answers to nat-
ural language questions from humans, serving as one of the primary research areas
in natural language human-computer interaction. This dissertation focuses on En-
glish fact-seeking (factoid) QA, for instance: when was Johns Hopkins founded?.1
The key challenge in QA is the generation and recognition of indicative signals
for answer patterns. In this dissertation I propose the idea of feature-driven QA,
a machine learning framework that automatically produces rich features from lin-
guistic annotations of answer fragments and encodes them in compact log-linear
models. These features are further enhanced by tightly coupling the question and
answer snippets via monolingual alignment. In this work monolingual alignment
helps question answering in two aspects: aligning semantically similar words in
QA sentence pairs (with the ability to recognize paraphrases and entailment) and
aligning natural language words with knowledge base relations (via web-scale data
mining). With the help of modern search engines, database and machine learning
tools, the proposed method is able to efficiently search through billions of facts in
the web space and optimize from millions of linguistic signals in the feature space.
QA is often modeled as a pipeline of the form:
question (input) → information retrieval (“search”) → answer extraction (from
either text or knowledge base) → answer (output).
This dissertation demonstrates the feature-driven approach applied through-
1January 22, 1876
iii
Abstract
out the QA pipeline: the search front end with structured information retrieval,
the answer extraction back end from both unstructured data source (free text)
and structured data source (knowledge base). Error propagation in natural lan-
guage processing (NLP) pipelines is contained and minimized. The final system
achieves state-of-the-art performance in several NLP tasks, including answer sen-
tencerankingandanswerextractionononeQAdataset, monolingualalignmenton
two annotated datasets, and question answering from Freebase with web queries.
This dissertation shows the capability of a feature-driven framework serving as
the statistical backbone of modern question answering systems.
Primary Advisor: Benjamin Van Durme
Secondary Advisor: Chris Callison-Burch
iv
Acknowledgments
To my dissertation committee as a whole, Benjamin Van Durme, Chris Callison-
Burch, David Yarowsky and Dekang Lin. Thank you for your time and advice.
I am very grateful to the following people:
Benjamin Van Durme, my primary advisor: Ben admitted me to Hopkins and
completely changed my life forever. He wrote me thousands of emails during the
course of my study, met with me every week, advised me, helped me, encouraged
me and never blamed me a single time for my wrongdoing. He was always there
whenever I needed help and he gave me a lot of freedom. Ben is a great person
with extraordinary leadership, integrity, fairness, and management skills. I have
learned more than enough from him. Thank you, Ben.
Chris Callison-Burch, my secondary advisor: Chris is extremely kind and gen-
erous with his students. He read the whole dissertation word by word front to
back and marked every page with detailed comments. Chris has taught me things
beyond research: networking, entrepreneurship, and artistic thinking. Thank you,
Chris.
Peter Clark, who was my mentor when I interned at Vulcan (now his group is
part of the Allen Institute of Artificial Intelligence). Pete is the one who inspired
me to do a dissertation on question answering. His group also funded two and
a half years of my PhD study. He is such a gentleman with an encouraging and
supportive heart. Thank you, Pete.
Dekang Lin, who was my mentor when I interned at Google Research on their
v
Acknowledgments
question answering project. Dekang reshaped my mind in problem solving and
finding a balanced point between research and industry. He will have a profound
impact on how I work in the future, just like what his research has influenced the
entire community. Thank you, Dekang.
Jason Eisner, for whose Natural Language Processing class I was the teaching
assistant at Hopkins for two years. Jason helped me with a deep understanding
of log-linear models, which are the statistical backbone of the entire dissertation.
He is a great person, an intellectual giant, and he treats everyone equally. Thank
you, Jason.
David Yarowsky, for whose Information Retrieval class I was the teaching as-
sistant at Hopkins. David’s focus on research novelty and originality heavily in-
fluenced this dissertation. He also set a good example that finishing a PhD (with
high quality) in less than four years was possible. Thank you, David.
Professors and researchers who taught me, mentored me or helped me at grad-
uate school: John Wierman, Mark Dredze, Kyle Rawlins, David Chiang, Liang
Huang, Adam Lopez, Matt Post, Sanjeev Khudanpur, Paul McNamee, Phil Har-
rison, Shane Bergsma, Veselin Stoyanov, and others. Thank you.
Colleagues and friends at the Center for Language and Speech Processing and
JHU: Adam Teichert, Adithya Renduchintala, Andong Zhan, Ann Irvine, Aric
Velbel, Brian Kjersten, Byung Gyu Ahn, Carl Pupa, Cathy Thornton, Chan-
dler May, Chunxi Liu, Courtney Napoles, Da Zheng, Darcey Riley, Debbie De-
ford, Delip Rao, Ehsan Variani, Feipeng Li, Frank Ferraro, Hainan Xu, Hong
Sun, Jason Smith, Jonathan Weese, Juri Ganitkevitch, Katharine Henry, Keisuke
Sakaguchi, Matt Gormley, Michael Carlin, Michael Paul, Nanyun Peng, Naomi
Saphra, Nicholas Andrews, Olivia Buzek, Omar Zaidan, Pegah Ghahremani, Pe-
ter Schulam, Pushpendre Rastogi, Rachel Rudinger, Ruth Scally, Ryan Cotterell,
Samuel Thomas, Scott Novotney, Sixiang Chen, Svitlana Volkova, Tim Vieira,
Travis Wolfe, Vijayaditya Peddinti, Xiaohui Zhang, and Yiming Wang. Thank
vi
Acknowledgments
you.
My very best Chinese friends at Hopkins: Cao Yuan, Chen Guoguo, Huang
Shuai, Sun Ming, and Xu Puyang. All together we went through so much in grad
school. Thank you.
Finally, thank you to my family. I wouldn’t have been myself without your
support.
vii
Contents
Abstract iii
Acknowledgments v
List of Tables xv
List of Figures xviii
1. Introduction 1
1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2. Main Idea: Feature-driven Question Answering . . . . . . . . . . . . 9
1.2.1. QA on Text . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.2. QA on KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.3. With Hard Alignment on Text . . . . . . . . . . . . . . . . . 14
1.2.4. With Soft Alignment on KB . . . . . . . . . . . . . . . . . . 16
1.3. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4. How to Read this Dissertation . . . . . . . . . . . . . . . . . . . . . 23
1.5. Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2. 50 Years of Question Answering 27
2.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2. Conferences and Evaluation . . . . . . . . . . . . . . . . . . . . . . 31
2.2.1. TREC (Text REtrieval Conference) QA Track . . . . . . . . 31
viii
Contents
2.2.2. QA@CLEF (Cross Language Evaluation Forum) . . . . . . . 35
2.2.3. Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . 38
2.2.3.1. Precision, Recall, Accuracy, F for IR/QA . . . . . 38
β
2.2.3.2. MAP, MRR for IR . . . . . . . . . . . . . . . . . . 40
2.2.3.3. Precision-RecallCurveforIR/QA:DrawnVeryDif-
ferently . . . . . . . . . . . . . . . . . . . . . . . . 41
2.2.3.4. Micro F vs. Macro F vs. Averaged F for QA . . 42
1 1 1
2.2.3.5. Permutation Test . . . . . . . . . . . . . . . . . . . 44
2.3. Significant Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.3.1. IR QA: Document and Passage Retrieval . . . . . . . . . . . 46
2.3.2. NLP QA: Answer Extraction . . . . . . . . . . . . . . . . . 48
2.3.2.1. Terminology for Question Analysis . . . . . . . . . 48
2.3.2.2. Template Matching . . . . . . . . . . . . . . . . . . 49
2.3.2.3. Answer Typing and Question Classification . . . . 51
2.3.2.4. Web Redundancy . . . . . . . . . . . . . . . . . . . 53
2.3.2.5. Tree/Graph Matching . . . . . . . . . . . . . . . . 54
2.3.3. IR4QA: Structured Retrieval . . . . . . . . . . . . . . . . . . 55
2.3.4. KB QA: Database Queries . . . . . . . . . . . . . . . . . . . 60
2.3.4.1. Early Years: Baseball, Lunar and 15+ More . . 60
2.3.4.2. Statistical Semantic Parsing . . . . . . . . . . . . . 63
2.3.5. Hybrid QA (IR+NLP+KB) . . . . . . . . . . . . . . . . . . 67
2.3.5.1. IBM Watson . . . . . . . . . . . . . . . . . . . . . 69
2.4. A Different View: Linguistic Features vs. Machine Learning . . . . 78
2.4.1. Linguistics: Word, POS, NER, Syntax, Semantics and Logic 80
2.4.2. Learning: Ad-hoc, Small Scale and Large Scale . . . . . . . 82
2.4.3. Appendix: Publications Per Grid . . . . . . . . . . . . . . . 84
3. Feature-driven QA from Unstructured Data: Text 86
ix
Contents
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.2. Tree Edit Distance Model . . . . . . . . . . . . . . . . . . . . . . . 90
3.2.1. Cost Design and Edit Search . . . . . . . . . . . . . . . . . . 91
3.2.2. TED for Sentence Ranking . . . . . . . . . . . . . . . . . . . 94
3.2.3. QA Sentence Ranking Experiment . . . . . . . . . . . . . . 97
3.3. Answer Extraction as Sequence Tagging . . . . . . . . . . . . . . . 98
3.3.1. Sequence Model . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.3.2. Feature Design . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.3.3. Overproduce-and-vote . . . . . . . . . . . . . . . . . . . . . 104
3.3.4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.3.4.1. QA Results . . . . . . . . . . . . . . . . . . . . . . 105
3.3.4.2. Ablation Test . . . . . . . . . . . . . . . . . . . . . 109
3.3.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.4. Structured Information Retrieval for QA . . . . . . . . . . . . . . . 110
3.4.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.4.2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.4.3. Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.4.4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.4.4.1. Data . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.4.4.2. Document Retrieval . . . . . . . . . . . . . . . . . 125
3.4.4.3. Passage Retrieval . . . . . . . . . . . . . . . . . . . 125
3.4.4.4. Answer Extraction . . . . . . . . . . . . . . . . . . 128
3.4.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4. Discriminative Models for Monolingual Alignment 134
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
x
Description:Jason Eisner, for whose Natural Language Processing class I was the .. With gold-standard answers, factoid question answering can be Jeopardy! quiz show, but it will not be a “show stopper” (in a negative sense) how this general purpose architecture compares with the expert system, Watson,.