Multi-Structured Models for Transforming and Aligning Text Kapil Thadani Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2015 (cid:13)c 2015 Kapil Thadani All Rights Reserved ABSTRACT Multi-Structured Models for Transforming and Aligning Text Kapil Thadani Structured representations are ubiquitous in natural language processing as both the prod- uct of text analysis tools and as a source of features for higher-level problems such as text generation. Thisdissertationexploresthenotionthatdifferentstructuredabstractionsoffer distinct but incomplete perspectives on the meaning encoded within a piece of text. We focus largely on monolingual text-to-text generation problems such as sentence compression and fusion, which present an opportunity to work toward general-purpose statistical mod- els for text generation without strong assumptions on a domain or semantic representation. Systems that address these problems typically rely on a single structured representation of text to assemble a sentence; in contrast, we examine joint inference approaches which leverage the expressive power of heterogenous representations for these tasks. These ideas are introduced in the context of supervised sentence compression through a compact integer program to simultaneously recover ordered n-grams and dependency trees that specify an output sentence. Our inference approach avoids cyclic and disconnected structures through flow networks, generalizing over several established compression tech- niques and yielding significant performance gains on standard corpora. We then consider the tradeoff between optimal solutions, model flexibility and runtime efficiency by target- ing the same objective with approximate inference techniques as well as polynomial-time variants which rely on mildly constrained interpretations of the compression task. While improving runtime is a matter of both theoretical and practical interest, the flex- ibility of our initial technique can be further exploited to examine the multi-structured hypothesis under new structured representations and tasks. We therefore investigate exten- sions to recover directed acyclic graphs which can represent various notions of predicate- argument structure and use this to experiment with frame-semantic formalisms in the con- text of sentence compression. In addition, we generalize the compression approach to ac- commodate multiple input sentences for the sentence fusion problem and construct a new dataset of natural sentence fusions which permits an examination of challenges in auto- mated content selection. Finally, the notion of multi-structured inference is considered in a different context—that of monolingual phrase-based alignment—where we find additional support for a holistic approach to structured text representation. Table of Contents List of Figures vi List of Tables ix List of Symbols xvi Acknowledgments xxi 1 Introduction 1 1.1 Multi-Structured Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Background on Tasks 11 2.1 Sentence Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Sentence Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Text Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Other Related Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.1 Paraphrase generation . . . . . . . . . . . . . . . . . . . . . . . . . . 20 i 2.4.2 Sentence simplification . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.3 Title generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.4 Machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 Multi-Structured Compression 23 3.1 Compression Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.1 Corpus analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Multi-Structured Compression . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.1 Compression as linear optimization . . . . . . . . . . . . . . . . . . . 31 3.2.2 Multi-structured objective . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Compression via Integer Linear Programming . . . . . . . . . . . . . . . . . 34 3.3.1 Enforcing tree structure . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.2 Assembling valid n-gram factorizations . . . . . . . . . . . . . . . . . 41 3.3.3 Enforcing a compression rate . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.1 Feature categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.2 Token features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.3 n-gram features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.4 Dependency features . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5 Parameter Estimatation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5.1 Structured perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5.2 Deriving features for reference compressions . . . . . . . . . . . . . . 51 3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.6.1 Joint inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.6.2 Content-bearing words . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.6.3 Example output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.6.4 Varying the compression rate . . . . . . . . . . . . . . . . . . . . . . 65 3.6.5 Higher-order n-grams . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.6.6 Subtree deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.7 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 ii 4 Approximation Strategies for Compression 74 4.1 Compression via Lagrangian Relaxation . . . . . . . . . . . . . . . . . . . . 75 4.1.1 Decomposing the inference task . . . . . . . . . . . . . . . . . . . . . 76 4.1.2 Bigram paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.1.3 Dependency subtrees . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.1.4 Scoring approximate solutions . . . . . . . . . . . . . . . . . . . . . . 81 4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2.1 Tightness of approximations . . . . . . . . . . . . . . . . . . . . . . . 82 4.2.2 Tradeoff between structural solutions . . . . . . . . . . . . . . . . . . 85 4.2.3 Compression quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.4 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5 Efficient Compression via Dynamic Programming 96 5.1 Compressive Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.1.1 Edge-factored parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.1.2 Bigram-factored compressions . . . . . . . . . . . . . . . . . . . . . . 100 5.1.3 Second-order parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.1.4 Enforcing compression rates . . . . . . . . . . . . . . . . . . . . . . . 106 5.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2.1 Second-order dependency features . . . . . . . . . . . . . . . . . . . 111 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.3.1 Compression quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.3.2 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3.3 Second-order dependencies . . . . . . . . . . . . . . . . . . . . . . . 116 5.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6 Compression over Predicate-Argument Structures 120 6.1 Structured Semantic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.1.1 Multi-structured objective . . . . . . . . . . . . . . . . . . . . . . . . 124 6.1.2 Enforcing DAG structure . . . . . . . . . . . . . . . . . . . . . . . . 125 iii 6.1.3 Constraining concept lexicons . . . . . . . . . . . . . . . . . . . . . . 128 6.1.4 Preserving frame semantics in compression . . . . . . . . . . . . . . 129 6.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.2.1 Frame features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.2.2 FE features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.3.1 Compression quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.3.2 Frame-semantic integrity . . . . . . . . . . . . . . . . . . . . . . . . 137 6.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7 Multi-Structured Sentence Fusion 140 7.1 Pyramid Fusion Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.2 Multi-Structured Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.2.1 ILP formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.2.2 Redundancy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.2.3 Dependency orientation . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.3.1 Token features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.3.2 Bigram and dependency features . . . . . . . . . . . . . . . . . . . . 152 7.3.3 Deriving features for reference fusions . . . . . . . . . . . . . . . . . 152 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.4.1 Fusion quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.4.2 Example output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.4.3 Content selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.4.4 Dependency orientation . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.5 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 8 Multi-Structured Monolingual Alignment 167 8.1 Aligned Paraphrase Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 8.1.1 Corpus analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 8.2 Multi-Structured Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 iv 8.2.1 Alignment as linear optimization . . . . . . . . . . . . . . . . . . . . 172 8.2.2 Multi-structured objective . . . . . . . . . . . . . . . . . . . . . . . . 173 8.2.3 Inference via ILP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 8.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 8.3.1 Phrase alignment features . . . . . . . . . . . . . . . . . . . . . . . . 178 8.3.2 Edge matching features . . . . . . . . . . . . . . . . . . . . . . . . . 180 8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 8.4.1 Confident alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 8.4.2 All alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 8.5 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 9 Conclusions 185 9.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 9.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 9.1.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 9.1.3 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 9.1.4 Learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 9.1.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 9.2.1 Unifying text-to-text operations . . . . . . . . . . . . . . . . . . . . 192 9.2.2 Direct applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 9.2.3 Task-based evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . 195 9.2.4 Multi-task learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Bibliography 197 v List of Figures 2.1 An example of phrase-based monolingual alignment drawn from the aligned paraphrasecorpusofCohnetal.(2008). Solidlinesindicatesurealignments while dashed lines indicate possible alignments. . . . . . . . . . . . . . . . 18 3.1 Distribution of instancesin the BN training dataset with respectto the num- ber of tokens dropped from the input sentence to produce (a) the longest reference compression, (b) the reference compression of median length, and (c) the shortest reference compression. . . . . . . . . . . . . . . . . . . . . . 27 3.2 Distribution of instances in the WN training dataset with respect to the number of tokens dropped from the input sentence to produce the reference compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Dependencycommodityvaluesforaflownetworkaccompanyingatree-based compression solution. Dashed lines denote all non-zero flow variables γ . . 38 ij 3.4 An illustrative flow network with edge weights indicating non-zero flow fea- turing (a) consistent flow and no directed cycles, (b) a cycle that preserves flow but needs multiple incoming edges (c) a cycle with one incoming edge for each node but consequently inconsistent flow. . . . . . . . . . . . . . . . 40 3.5 Adjacency commodity values for a flow network accompanying a path-based compression solution. Dashed lines denote all non-zero flow variables γ(cid:48) . . 43 ij 3.6 Variation in RASP F with imposed compression rate for the BN corpus. All 1 datapoints plotted at average output compression rates after rounding down to token counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 vi
Description: