ebook img

Parsing with Automatically Acquired, Wide-Coverage, Robust, Probabilistic LFG Approximations PDF

211 Pages·2012·4.66 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Parsing with Automatically Acquired, Wide-Coverage, Robust, Probabilistic LFG Approximations

Parsing with Automatically Acquired, Wide-Coverage, Robust, Probabilistic LFG Approximations Aoife Cahill A dissertation submitted in partial fulfilment of the requirements for the award of Doctor of Philosophy to the DCU Dublin City University School of Computing Supervisors: Prof. Josef van Genabith Dr. Andy Way September 2004 Declaration I hereby certify that this material, which I now submit for assessment on the programme of study leading to the award of Doctor of Philosophy is entirely my own work and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work Signed ____ -------------------------------------- (Aoife Cahill) Student ID 97093246 Date September 2004 Contents Abstract v List of Tables viii List of Figures xiii List of Acronyms xvii 1 Introduction 1 2 Automatic F-Structure Annotation 7 2.1 Background and Motivation............................................................................ 7 2.2 Lexical Functional Grammar .............................................. . ....................... 8 2.2.1 Unification (or Constraint-Based) Grammars...................................... 8 2.2.2 LFG......................................................................................................... 9 2.2.3 Why the LFG Framework?.................................................................. 10 2.3 Previous Work on Automatic Annotation.................................................... 11 2.3.1 Annotation Algorithms ........................................................................ 12 2.3.2 Regular Expression-Based Annotation.......................................... . 12 2.3.3 Set-Based Tree Description Rewriting.......................................... . 13 2.4 Our Annotation Algorithm ............................................................................. 13 2.4.1 Proto F-structures............................ ..................................................14 2.4.2 Proper F-structures........................................................ .................19 2.4.3 From Annotated Trees to LFG F-Structures........................................20 2.4.4 Evaluation .............................................................................................22 2.4.5 Implementation........................................................................................27 2.5 Summary...........................................................................................................28 3 Tools and Infrastructure 30 3.1 Background and Motivation................... ........................................................ 30 3.2 TTS: Treebank Tool Suite................................................................................. 31 3.2.1 List Rules by Frequency........................................................................ 33 3.2.2 Search by Rule........................................................................................33 3.3 FSAT: F-Structure Annotation Tools...............................................................41 3.4 Summary............................................................................................................ 44 3.4.1 Design and Implementation.................................................................44 3.4.2 The Advantages of Developing the Tool Suite....................................45 4 Probabilistic Context-Free Parsing Models 47 4.1 Introduction............................................................................... .......................47 4.2 Context-Free Parsing......................................................«................................48 4.2.1 Context-Free Grammars........................................................................ 48 4.2.2 Probabilistic Context-Free Grammars ............................................... 48 4.2.3 Parsing with Context-Free Grammars................................................49 4.3 Improving Probabilistic Context-Free Parsing................................................52 4.3.1 Grammar Transformations .................................................................. 52 4.3.2 Lexicalisation...........................................................................................54 4.4 Some more complex Approaches to Parsing...................................................55 4.5 Summary............................................................................................................ 57 5 Treebank-Based PCFG Extraction and Transformation Experiments 59 5.1 Introduction ............................................................................ 59 5.2 Basic Treebank Pre-Processing Prior to Grammar Extraction......................60 5.2.1 Original Charniak Pre-Processing Steps.............................................. 60 5.2.2 Unary Productions................................................................................. 65 5.2.3 Combining Pre-Processing Steps......................................................... 68 5.3 Grammar Transformations..................................................................................70 5.3.1 Parent/Grandparent Transformations .............................................70 ii 5.3.2 F-Structure-Annotated Rules.............................................................. 72 5.3.3 Combining Transformations................................................................ 74 5.4 Summary........................................................................................................... 78 6 Parsing into F-Structures 80 6.1 Introduction........................................................................................................80 6.2 Two Parsing Architectures................................................................................81 6.2.1 The Pipeline Model .............................................................................81 6.2.2 The Integrated M odel..........................................................................81 6.3 Parsing into Proto F-Structures.......................................................................82 6.3.1 Evaluation ............................................................................................82 6.3.2 Fragmentation...................................................... ....................... , 87 6.4 Parsing into Proper F-Structures....................................................................88 6.4.1 Extraction of Semantic Forms.................................................... . * . 90 6.4.2 Approximation of Functional Uncertainty Paths .............................91 6.4.3 Long-Distance Dependency Resolution Algorithm.................... 93 6.4.4 Evaluation of F-Structures ................................................................93 6.4.5 Evaluation of LDD Resolution .......................................................... 101 6.5 Summary...................................................................................................... . 103 7 Comparison of Our Approach with Other Approaches 106 7.1 Introduction........................................................................................................106 7.2 Other Deep Parsing Approaches ....................................................................107 7.2.1 Collins’ Model 3 ...................................................................................107 7.2.2 Johnson 2002 ......................................................................................... 110 7.2.3 The English ParGram Grammar........................................................112 7.3 Summary...........................................................................................................113 8 Migrating Automatic Annotation-Based Grammar Acquisition and Parsing to German and the TIGER Treebank 116 8.1 Introduction........................................................................................................116 8.2 From TIGER to a German LFG ....................................................................117 8.2.1 Prom TIGER Graphs to Trees ...............................................................117 8.2.2 Annotation of Derived Trees ..................................................................119 8.2.3 Evaluation of the Automatic Annotation Algorithm........................124 8.3 Parsing Experiments.............................................................................................125 8.4 Adding Morphological Information..................................................................127 8.4.1 Automatically Simulating Morphological Information in TIGER Trees 128 8.4.2 Experiments and Results .....................................................................129 8.5 Summary ................................................................................................ 133 9 Conclusions 134 9.1 Future Work ......................................................................................................138 Bibliography 140 Appendices 149 A Non-punctuation tags in the Penn-II Treebank 149 B DCU 105 Gold Standard Sentences 151 C Parsing Results for Section 23 Trees 160 D Parsing Results for F-Structures against the DCU 105 and PARC 700 170 E Parsing Results for F-Structures against the 2,416 F-Structures auto­ matically generated for Section 23 181 Abstract Traditionally, rich, constraint-based grammatical resources have been hand-coded. Scaling such resources beyond toy fragments to unrestricted, real text is knowledge-intensive, time- consuming and expensive. The work reported in this thesis is part of a larger project to automate as much as possible the construction of wide-coverage, deep, constraint-based grammatical resources from treebanks. The Penn-II treebank is a large collection of parse-annotated newspaper text. We have designed a Lexical-Functional Grammar (LFG) (Kaplan and Bresnan, 1982) f-structure annotation algorithm to automatically annotate this treebank with f-structure information approximating to basic predicate-argument or dependency structures (Cahill et al., 2002c, 2004a). We then use the f-structure-annotated treebank resource to auto­ matically extract grammars and lexical resources for parsing new text into f-structures. We have designed and implemented the Treebank Tool Suite (TTS) to support the linguistic work that seeds the automatic f-structure annotation algorithm (Cahill and van Genabith, 2002) and the F-Structure Annotation Tool (FSAT) to validate and visualise the results of automatic f-structure annotation. We have designed and implemented two PCFG-based probabilistic parsing architec­ tures for parsing unseen text into f-structures: the pipeline and the integrated model. Both architectures parse raw text into basic, but possibly incomplete, predicate-argument struc­ tures (“proto f-structures”) with long distance dependencies (LDDs) unresolved (Cahill et al., 2002c). We have designed and implemented a method for automatically resolving LDDs at f-structure level based on a finite approximation of functional uncertainty equations (Ka­ plan and Zaenen, 1989) automatically acquired from the f-structure-annotated treebank resource (Cahill et al., 2004b). To date, the best result achieved by our own Penn-II induced grammars is a dependency f-score of 80.33% against the PARC 700, an improvement of 0.73% over the best hand­ crafted grammar of (Kaplan et al., 2004). The processing architecture developed in this thesis is highly flexible: using external, state-of-the-art parsing technologies (Charniak, 2000) in our pipeline model, we achieve a dependency f-score of 81.79% against the PARC 700, an improvement of 2.19% over the results reported in Kaplan et al. (2004). We have also ported our grammar induction methodology to German and the TIGER treebank resource (Cahill et al., 2003a). We have developed a method for treebank-based, wide-coverage, deep, constraint- based grammar acquisition. The resulting PCFG-based LFG approximations parse the Penn-II treebank with wider coverage (measured in terms of complete spanning parse) and parsing results comparable to or better than those achieved by the best hand-crafted grammars, with, we believe, considerably less grammar development effort. We believe that our approach successfully addresses the knowledge-acquisition bottleneck (familiar from rule-based approaches to Al and NLP) in wide-coverage, constraint-based grammar development. Our approach can provide an attractive, wide-coverage, multilingual, deep, constraint-based grammar acquisition paradigm. v Acknowledgements I wish to acknowledge everyone who has helped me with this thesis. Firstly, I would like to express my deepest gratitude to Josef van Genabith, without whom, this thesis would not exist. Josef has always been patient and understanding and I really appreciate that his door was always open to me. He has been a wonderful mentor, full of helpful and insightful comments and most of all encouragement and enthusiasm. In this respect, I would also like to thank Andy Way. Andy has always supported and reassured me, whenever I had any doubts and his pragmatic approach to everything has been an inspiration. Thanks also to all the members of the National Centre for Language Technology, and the staff and post-graduate students of the School of Computing at DCU for all their helpful suggestions whenever I presented my work. Special thanks to Mary, Mick and Ruth for listening to my crazy half-baked ideas and discussing them with me. I would also like to thank Michelle, Nano and everybody else in CAPG for keeping me sane. I am grateful to Dalen and Tom for all their technical expertise, they have taught me so much! My time at the University of Stuttgart in the summer of 2003 was thoroughly enjoyable and extremely rewarding. I would especially like to thank Helmut Schmid, Martin Forst, Christian Rohrer and Michael Schielen for all their help and support while I was there. I would like to thank everybody at the Natural Language Theory and Technology group at the Palo Alto Research Center, I have always enjoyed my visits and come home with a plethora of ideas. I would especially like to thank Stefan, Tracy, Ron and Dick for the fruitful discussions we have had. Also, thanks to Dick for providing us with his evaluation software. I would also like to thank all of my friends outside of DCU for their constant friendship and support over the course of this Ph.D. It is impossible to name everybody, but special thanks to Rachael, Nico, Paul, Jerry, Brian, Valerie, Frances, Claudia, Siobhan and Sarv for helping me to keep my head out of the clouds and reminding me that there is life outside of thesis-writing! Finally, I would like to thank my parents Maria and Michael, my sisters Fiona and Roisin, and my best friend Barry for their unquestioning belief in me. Without their confidence in me, I would never have had the courage to go through with this. Thank you (especially Barry) for putting up with me, even in my worst humours and helping me through the tough times. I am grateful to Enterprise Ireland for their financial support of our project (Basic Research Grant SC/2001/186), enabling us to carry out the work reported in this thesis.

Description:
E Parsing Results for F-Structures against the 2,416 F-Structures auto plan and Zaenen, 1989) automatically acquired from the We have also ported our grammar induction methodology to German and the TIGER the staff and post-graduate students of the School of Computing at DCU for all their.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.