ADDIS ABABA UNIVERSITY FUCULTY OF INFORMATICS DEPARTMENT OF INFORMATION SCIENCE DEVELOPMENT OF STEMMING ALGORITHM FOR WOLAYTTA TEXT LEMMA LESSA FEREDE JULY, 2003 ADDIS ABABA UNIVERSITY FUCULTY OF INFORMATICS DEPARTMENT OF INFORMATION SCIENCE DEVELOPMENT OF STEMMING ALGORITHM FOR WOLAYTTA TEXT A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE DEGREE OF MASTER OF SCIENCE IN INFORMATION SCIENCE BY LEMMA LESSA FEREDE JULY, 2003 ADDIS ABABA UNIVERSITY FUCULTY OF INFORMATICS DEPARTMENT OF INFORMATION SCIENCE DEVELOPMENT OF STEMMING ALGORITHM FOR WOLAYTTA TEXT A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE DEGREE OF MASTER OF SCIENCE IN INFORMATION SCIENCE BY LEMMA LESSA FEREDE (Name and Signature of Members of the Examining board) ______________________, Chairman, Examining board _______________________ Ato Mesfin Getachew, Advisor _______________________ W/t Atelach Alemu, Advisor _______________________ Dr Haile Eyesus Engdashet, Advisor _______________________ ______________________, External Examiner _______________________ DECLARATION This thesis is my original work and has not been submitted for a degree in any other University. __________________________ Lemma Lessa July, 2003 The thesis has been submitted for examination with our approval as University advisors. ____________________ ____________________ ________________________ Mesfin Getachew (Ato) Atelach Alemu (W/t) Haile Eyesus Engdashet(Dr) July, 2003 July, 2003 July, 2003 Dedicated to my son, Kibruyisfa Lemma, who born during the first year of my study at the School of Graduate Studies, Addis Ababa University. A C K N O W L E D G E M E N T S First of all, I would like to forward my heartfelt gratitude to my research advisors: Ato Mesfin Getachew, W/t Atelach Alemu and Dr Haile Eyesus Engdashet for their crucial advice since the conception of this research work. Had it not been for their invaluable advice, I would not have surely completed the work. My deepest gratitude also goes to Dr Nega Alemayehu (University of Sheffield, UK) for his cooperation in providing me professional comments through out the research work. I also wish to express my respect and appreciation to Dr Martin Porter (Author of Porter Stemmer) for his comments on the code I have generated. Finally, my thanks goes to my family (esp. my wife, W/ro Serkalem Assefa), colleagues, friends like Chuba Chino and others who have helped me in one way or the other through the two hectic years of my study at School of Graduate Studies, Addis Ababa University. June, 2003 Lemma Lessa Addis Ababa, Ethiopia TABLE OF CONTENTS ACKNOWLEDGEMENTS ................................................................................................................ i TABLE OF CONTENTS ................................................................................................................ ii LIST OF ABBREVIATIONA AND SYMBOLS USED .................................................................. v LIST OF TABLES ............................................................................................................................. vi WOLAYTTA VOWELS .................................................................................................................... vii WOOLAYTTA CONSONANTS ...................................................................................................... viii ABSTRACT ....................................................................................................................................... ix CHAPTER ONE ................................................................................................................................. 1 INTRODUCTION ............................................................................................................................... 1 1.1 BACKGROUND OF THE STUDY .............................................................................................. 1 1.2 STATEMENT OF THE PROBLEM AND JUSTIFICATION OF THE STUDY ........................ 5 1.3 OBJECTIVES ................................................................................................................................. 7 1.3.1 GENERAL OBJECTIVE ................................................................................................ 7 1.3.2 SPECIFIC OBJECTIVES ................................................................................................ 7 1.4 METHODOLOGY .......................................................................................................................... 8 1.4.1 LITERATURE REVIEW ................................................................................................ 8 1.4.2 DATA SOURCES ........................................................................................................... 9 1.4.3 DEVELOPING AND TRAINING THE STEMMER ......................................................9 1.4.4 STEMMER TESTING ............................................................................................….....9 1.4.5 IMPLEMENTATION OF THE ALGORITHM ............................................................10 1.5 APPLICATION OF THE RESULT ...............................................................................................10 1.6 SCOPE AND LIMITATION OF THE STUDY ............................................................................11 1.7 ORGANIZATION OF THE THESIS .............................................................................................11 CHAPTER TWO .............................................................................................................................. 13 REVIEW OF RELATED LITERATURE .......................................................................................... 13 2.1 INTRODUCTION .........................................................................................................................13 2.2 CONFLATION TECHNIQUES ................................................................................................... 13 2.3 STEMMING ALGORITHMS ...................................................................................................... 16 2.4 CHAPTER SUMMARY ............................................................................................................... 23 CHAPTER THREE .......................................................................................................................... 24 3.1 INTRODUCTION ....................................................................................................................... 24 3.2 MORPHOLOGY .......................................................................................................................... 24 3.2.1 TYPES OF MORPHEMES OCCUR IN WOLAYTTA ............................................... 25 3.2.2 HOW WORDS ARE FORMED IN WOLAYTTA ...................................................... 25 3.3 INFLECTIONAL AFFIXES OF WOLAYTTA ...........................................................................26 3.3.1. NOUNS .........................................................................................................................26 3.3.1.1 INFLECTION OF NOUNS ..............................................................................27 3.3.1.2. DERIVATION OF NOUNS ...........................................................................39 3.3.2 ADJECTIVES ................................................................................................................42 3.3.3. VERBS ..........................................................................................................................44 3.3.3.1. VERB INFLECTION ......................................................................................45 3.3.3.2. VERB DERIVATION ....................................................................................50 3.4 COMPOUNDING .........................................................................................................................53 3.5 CHAPTER SUMMARY ...............................................................................................................55 CHAPTER FOUR ............................................................................................................................ 56 DEVELOPMENT OF STEMMER FOR WOLAYTTA TEXT ..........................................................56 4.1 INTRODUCTION .........................................................................................................................56 4.2 SAMPLE TEXT .............................................................................................................................56 4.2.1 TEST DATA .............................................................................................................................. 57 4.2.2 TRAINING SET ....................................................................................................................... 57 4.3 WORD DISTRIBUTION OF WOLAYTTA .............................................................................. 57 4.4 COMPLATION OF STOPWORD LIST .......................................................................................59 4.5 WOLAYTTA AFFIX LIST COMPILATION ............................................................................61 4.6 THE STEMMER ..........................................................................................................................64 4.7 CONDITIONS/RULES CONSIDERED BY THE STEMMER ................................................. 66 4.8 EVALUATION OF THE FIRST STEMMER ........................................................................….69 4.9 IMPROVED STEMMER ............................................................................................................71 4.10 CHAPTER SUMMARY ...............................................................................................................75 CHAPTER FIVE .................................................................................................................................76 CONCLUSIONS AND RECOMMENDATIONS ...............................................................................76 5.1 CONCLUSIONS ........................................................................................................................... 76 5.2 RECOMMENDATIONS .............................................................................................................. 79 BIBLIOGRAPHY ............................................................................................................................. 80 APPENDCES APPENDIX I: Stopwords compiled from Wolaytta sample text ....................................................... 86 APPENDIX II: List of Wolaytta word endings (suffixes) ................................................................. 91 APPENDIX III: Examples of a stem and its different variants ...........................................................95 APPENDIX IV: Comparison of Unstemmed and Stemmed text from the sample text .......................97 APPENDIX V: Test Set ............................................................................................................….......99 APPENDIX VI: Training Set .............................................................................................................102 ABBREVIATIONS AND SYMBOLS USED IN THE STUDY Abbreviations/symbols Meaning ==(cid:1) becomes ' ' gloss 1 1st person singular 1pl 1st person plural 2 2nd person singular masculine/feminine 2pl 2nd person plural masculine/feminine 3m 3rd person singular masculine 3f 3rd person singular feminine 3pl 3rd person plural masculine/feminine