ebook img

LEARNING PROBABILISTIC MODELS OF WORD SENSE DISAMBIGUATION Approved by PDF

195 Pages·2007·0.88 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview LEARNING PROBABILISTIC MODELS OF WORD SENSE DISAMBIGUATION Approved by

LEARNING PROBABILISTIC MODELS OF WORD SENSE DISAMBIGUATION Approved by: Dr. Dan Moldovan Dr. Rebecca Bruce Dr. Weidong Chen Dr. Frank Coyle Dr. Margaret Dunham Dr. Mandyam Srinath LEARNING PROBABILISTIC MODELS OF WORD SENSE DISAMBIGUATION A Dissertation Presented to the Graduate Faculty of the School of Engineering and Applied Science Southern Methodist University in Partial Fulfillment of the Requirements for the degree of Doctor of Philosophy with a Major in Computer Science by Ted Pedersen (B.A., Drake University) (M.S., University of Arkansas) May 16, 1998 ACKNOWLEDGMENTS I am indebted to Dr. Rebecca Bruce for sharing freely of her time, knowledge, andinsightsthroughoutthisresearch. Certainlynoneofthiswouldhavebeenpossible without her. Dr. Weidong Chen, Dr. Frank Coyle, Dr. Maggie Dunham, Dr. Dan Moldovan, andDr. MandyamSrinathhaveallmadeimportantcontributionstothisdissertation. They are also among the main reasons why my time at SMU has been both happy and productive. I am also grateful to Dr. Janyce Wiebe, Lei Duan, Mehmet Kayaalp, Ken McK- eever, and Tom O’Hara for many valuable comments and suggestions that influenced the direction of this research. This work was supported by the Office of Naval Research under grant number N00014-95-1-0776. iii Pedersen, Ted B.A., Drake University M.S., University of Arkansas Learning Probabilistic Models of Word Sense Disambiguation Advisor: Professor Dan Moldovan Doctor of Philosophy degree conferred May 16, 1998 Dissertation completed May 16, 1998 Selecting the most appropriate sense for an ambiguous word is a common problem in natural language processing. This dissertation pursues corpus–based ap- proaches that learn probabilistic models of word sense disambiguation from large amounts of text. These models consist of a parametric form and parameter esti- mates. The parametric form characterizes the interactions among the contextual features and the sense of the ambiguous word. Parameter estimates describe the probability of observing different combinations of feature values. These models dis- ambiguate by determining the most probable sense of an ambiguous word given the context in which it occurs. This dissertation presents several enhancements to existing supervised methods of learning probabilistic models of disambiguation from sense–tagged text. A new search strategy, forward sequential, guides the selection process through the space of possible models. Each model considered for selection is judged by a new class of evaluation metric, the information criteria. The combination of forward sequential search and Akaike’s Information Criteria is shown to consistently select highly ac- curate models of disambiguation. The same search strategy and evaluation criterion also serve as the basis of the Naive Mix, a new supervised learning algorithm that is shown to be competitive with leading machine learning methodologies. In these comparisons the Naive Bayesian classifier also fares well which seems surprising since it is based on a model where the parametric form is simply assumed. However, an iv explanation for this success is presented in terms of learning rates and bias–variance decompositions of classification error. Unfortunately, sense–taggedtextonlyexistsinsmallquantitiesandisexpensive to create. This substantially limits the portability of supervised learning approaches to word sense disambiguation. This bottleneck is addressed by developing unsuper- vised methods that learn probabilistic models from raw untagged text. However, such text does not contain enough information to automatically select a parametric form. Instead, one must simply be assumed. Given a form, the senses of ambiguous words are treated as missing data and their values are imputed via the Expecta- tion Maximization algorithm and Gibbs Sampling. Here the parametric form of the Naive Bayesian classifier is employed. However, this methodology is appropriate for any parametric form in the class of decomposable models. Several local–context, frequency–based feature sets are also developed and shown to be appropriate for unsupervised learning of word senses from raw untagged text. v TABLE OF CONTENTS ACKNOWLEDGMENTS .................................................... iii LIST OF FIGURES.......................................................... x LIST OF TABLES ........................................................... xiii CHAPTER 1. INTRODUCTION ..................................................... 1 1.1. Word Sense Disambiguation ....................................... 2 1.2. Learning from Text ............................................... 3 1.2.1. Supervised Learning........................................ 5 1.2.2. Unsupervised Learning ..................................... 6 1.3. Basic Assumptions ................................................ 7 1.4. Chapter Summaries ............................................... 7 2. PROBABILISTIC MODELS ........................................... 10 2.1. Inferential Statistics ............................................... 10 2.1.1. Maximum Likelihood Estimation ........................... 11 2.1.2. Bayesian Estimation ....................................... 14 2.2. Decomposable Models ............................................. 15 2.2.1. Examples .................................................. 17 2.2.2. Decomposable Models as Classifiers......................... 22 3. SUPERVISED LEARNING FROM SENSE–TAGGED TEXT ........... 24 3.1. Sequential Model Selection ........................................ 25 3.1.1. Search Strategy ............................................ 26 3.1.2. Evaluation Criteria......................................... 29 vi 3.1.2.1. Significance Testing ............................... 30 3.1.2.2. Information Criteria .............................. 33 3.1.3. Examples .................................................. 35 3.1.3.1. FSS AIC ......................................... 35 3.1.3.2. BSS AIC ......................................... 37 3.2. Naive Mix ........................................................ 39 3.3. Naive Bayes....................................................... 43 4. UNSUPERVISED LEARNING FROM RAW TEXT .................... 45 4.1. Probabilistic Models............................................... 46 4.1.1. EM Algorithm ............................................. 47 4.1.1.1. General Description............................... 47 4.1.1.2. Naive Bayes description ........................... 49 4.1.1.3. Naive Bayes example.............................. 51 4.1.2. Gibbs Sampling ............................................ 57 4.1.2.1. General Description............................... 58 4.1.2.2. Naive Bayes description ........................... 60 4.1.2.3. Naive Bayes example.............................. 63 4.2. Agglomerative Clustering.......................................... 70 4.2.1. Ward’s minimum–variance method ......................... 71 4.2.2. McQuitty’s similarity analysis .............................. 72 5. EXPERIMENTAL DATA .............................................. 74 5.1. Words ............................................................ 74 5.2. Feature Sets ...................................................... 75 5.2.1. Supervised Learning Feature Set............................ 75 vii 5.2.2. Unsupervised Learning Feature Sets ........................ 80 5.2.3. Feature Sets and Event Distributions ....................... 83 6. SUPERVISED LEARNING EXPERIMENTAL RESULTS .............. 92 6.1. Experiment 1: Sequential Model Selection ......................... 92 6.1.1. Overall Accuracy........................................... 93 6.1.2. Model Complexity ......................................... 96 6.1.3. Model Selection as a Robust Process........................ 96 6.1.4. Model selection for Noun interest .......................... 99 6.2. Experiment 2: Naive Mix.......................................... 104 6.3. Experiment 3: Learning Rate...................................... 109 6.4. Experiment 4: Bias Variance Decomposition ....................... 113 7. UNSUPERVISED LEARNING EXPERIMENTAL RESULTS ........... 119 7.1. Assessing Accuracy in Unsupervised Learning...................... 120 7.2. Analysis 1: Probabilistic Models................................... 124 7.2.1. Methodological Comparison ................................ 127 7.2.2. Feature Set Comparison.................................... 130 7.3. Analysis 2: Agglomerative Clustering .............................. 135 7.3.1. Methodological Comparison ................................ 138 7.3.2. Feature Set Comparison.................................... 143 7.4. Analysis 3: Gibbs Sampling and McQuitty’s Similarity Analysis ... 145 8. RELATED WORK..................................................... 151 8.1. Semantic Networks ................................................ 152 8.2. Machine Readable Dictionaries .................................... 154 8.3. Parallel Translations .............................................. 155 viii 8.4. Sense–Tagged Corpora ............................................ 157 8.5. Raw Untagged Corpora ........................................... 160 9. CONCLUSIONS ....................................................... 163 9.1. Supervised Learning............................................... 163 9.1.1. Contributions .............................................. 163 9.1.2. Future Work ............................................... 165 9.2. Unsupervised Learning ............................................ 168 9.2.1. Contributions .............................................. 169 9.2.2. Future Work ............................................... 170 REFERENCES .............................................................. 174 ix LIST OF FIGURES Figure Page 2.1. Saturated Model (CVRTS) ........................................... 18 2.2. Decomposable Model (CSV)(RST) ................................... 19 2.3. Model of Independence (C)(V)(R)(T)(S) .............................. 21 2.4. Naive Bayes Model (CS)(RS)(TS)(VS) ............................... 22 4.1. E–Step Iteration 1 .................................................... 52 4.2. M–Step Iteration 1: pˆ(S), pˆ(F |S), pˆ(F |S) ............................ 53 1 2 4.3. E–Step Iteration 2 .................................................... 54 4.4. E–Step Iteration 2 .................................................... 55 4.5. M–Step Iteration 2: pˆ(S), pˆ(F |S), pˆ(F |S) ............................ 55 1 2 4.6. E–Step Iteration 3 .................................................... 56 4.7. E–Step Iteration 3 .................................................... 57 4.8. Stochastic E–Step Iteration 1.......................................... 64 4.9. Stochastic M–step Iteration 1: pˆ(S), pˆ(F |S), pˆ(F |S) .................. 65 1 2 4.10. E–Step Iteration 2 .................................................... 66 4.11. Stochastic E–Step Iteration 2.......................................... 67 4.12. Stochastic M–step Iteration 2: pˆ(S), pˆ(F |S), pˆ(F |S) .................. 68 1 2 4.13. Stochastic E–Step Iteration 3.......................................... 69 4.14. Stochastic E–Step Iteration 3.......................................... 69 4.15. Matrix of Feature Values, Dissimilarity Matrix......................... 71 x

Description:
are manually annotated with sense values by a human judge. These sense–tagged accuracy of unsupervised learning algorithms; particular attention is paid to features
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.