ebook img

DTIC ADA460572: Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day PDF

8 Pages·0.11 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA460572: Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day

Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day SilviuCucerzan andDavidYarowsky DepartmentofComputerScience and CenterforLanguageandSpeech Processing JohnsHopkinsUniversity Baltimore,MD21218USA {silviu,yarowsky}@cs.jhu.edu Abstract annotated words, but the total weighted human la- bor and resource costs of different modes of su- This paper presents a method for bootstrapping a pervision (allowing manual rule writing to be com- fine-grained, broad-coverage part-of-speech (POS) pared directly with active learning on a common tagger in a new language using only one person- cost-performance learning curve). day of data acquisition effort. It requires only three Inthispaperweobservethatanotherusefulmea- resources, which are currently readily available in sure of (minimal) supervision is the additional cost 60-100worldlanguages: (1)anonlineorhard-copy of obtaining a desired functionality from existing pocket-sized bilingual dictionary, (2)abasiclibrary commonlyavailableknowledgesources. Inparticu- reference grammar, and (3) access to an existing lar,wenotethatforaremarkablywiderangeoflan- monolingual text corpus in the language. The al- guages, academic libraries, many booksellers and gorithm begins by inducing initial lexical POS dis- websites offer a foundation of linguistic wisdom in tributions from English translations in a bilingual reference grammars and dictionaries. Thus starting dictionary without POS tags. It handles irregular, fromthis baseline, whatisthe marginal cost ofdis- regular and semi-regular morphology through a ro- tillingfromandaugmentingthisexistingknowledge bust generative model using weighted Levenshtein toachieve adesired newtask functionality? alignments. Unsupervisedinductionofgrammatical genderisperformedviaglobalmodelingofcontext- 2 Inducing POSTagCandidatesfrom windowfeature agreement. Usingacombination of Unlabeled BilingualDictionaries these and other evidence sources, interactive train- ing of context and lexical prior models are accom- A substantial percentage of foreign language dic- plished for fine-grained POS tag spaces. Experi- tionaries that are available on line or in smaller pa- ments show high accuracy, fine-grained tag resolu- perback format are simple bilingual word or phrase tionwithminimalnewhumaneffort. translationlistswhichfailtospecifypartofspeech.1 Thusonecomponentquestionofthisworkishow 1 Introduction canone extract preliminary part-of-speech distribu- tions from untagged monolingual translation lists. Previous work in minimally supervised language learninghasdefinedminimalusingseveraldifferent Figure 1 illustrates such a bilingual dictionary, also specifying the true part of speech for each possible criteria. Some have assumed only partially tagged translation, whichwedonotassumetobegenerally training corpora (Merialdo, 1994), while others have begin with small tagged seed wordlists (such available. as Collins and Singer (1999) and Cucerzan and One approach is to take an unweighted mixture Yarowsky (1999) for named-entity tagging). Oth- of the prior part-of-speech distributions for the En- ers have exploited the automatic transfer of some glish words given in the translation list (TL) as already existing annotated resource in a different illustrated in(cid:1)F(cid:1)igure 2. These probabilities may be medium or language (such as the translingual pro- estimatedfromalargeandpreferablybalanced,cor- jection of part-of-speech tags, syntactic bracket- pus. Inthiswork,weusedstatisticsfromtheBrown ing and inflectional morphology in Yarowsky et al. andWSJcorpora combined. (2001), requiring no direct supervision in the for- 1In this section, we will use the term POS tag to denote eign language). Ngai and Yarowsky (2000) ob- only the main part-of-speech tags (noun, verb, adjective, ad- served that an often more practical measure of the verb, preposition, etc.) and not the fine-grained tags(such as degree of supervision is not simply the quantity of Noun-Genitive-fem-plur-def). Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2002 2. REPORT TYPE 00-00-2002 to 00-00-2002 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION John Hopkins University,Center for Language and Speech REPORT NUMBER Processing,Department of Computer Science,Baltimore,MD,21218 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 7 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 True where is estimated from the dictionary Romanian POS English translation list asabov(cid:6)e.(cid:2)(cid:2)W(cid:3)(cid:3)it(cid:2)h(cid:13)o(cid:3)u(cid:3)(cid:1)t anindependence assumption: mandat N warrant; proxy; mandate; moneyorder; powerofattorney (cid:6)(cid:2)(cid:2)(cid:2)(cid:2)(cid:13)(cid:3)(cid:1)(cid:3)(cid:3)(cid:3)(cid:13)(cid:3)(cid:1)(cid:1)(cid:3) manechin N model,dummy (cid:6)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:3)(cid:1)(cid:3)(cid:3)(cid:3)(cid:2)(cid:3)(cid:1)(cid:1)(cid:3)(cid:6)(cid:2)(cid:2)(cid:3)(cid:1)(cid:3)(cid:3)(cid:3)(cid:2)(cid:3)(cid:1)(cid:2)(cid:13)(cid:3)(cid:1)(cid:3)(cid:3)(cid:3)(cid:13)(cid:3)(cid:1)(cid:1) manifesta V arise,express itself, show There are two major options via which one can manual Adj manual; estimate . The first is to assume N manual; textbook; that the p(cid:6)ar(cid:2)t(cid:2)-o(cid:2)f(cid:2)-(cid:2)s(cid:3)p(cid:1)e(cid:3)(cid:3)e(cid:3)c(cid:2)h(cid:3)(cid:1)u(cid:1)sage of phrasal (English) handbook translations isgenerally consistent across dictionar- mare Adj large; big;great; tall; ies (e.g. remains high regardless old;important; of publish(cid:6)er(cid:2)(cid:4)or(cid:2)(cid:2)l(cid:4)an(cid:3)g(cid:1)(cid:4)ua(cid:3)g(cid:2)e(cid:1)). Hence one could use N sea any foreign-English bilingual dictionary that also maro Adj brown,chestnut includes the true foreign word part of speech in ad- dition to its translations to train these probabilities. Figure 1: A sample Romanian-English dictionary. Alternately, one could do a first-pass assignment The POS tags are used only for evaluation and are of foreign-word part of speech based on only sin- notavailable inmanybilingual dictionaries. gle word translations as in Figure 2, and use this to train for those foreign words hav- FW ei P(Pos j | e i ) P(Pos j | FW) ing bo(cid:6)th(cid:2)(cid:2)p(cid:2)h(cid:2)(cid:2)ra(cid:3)s(cid:1)a(cid:3)(cid:3)l(cid:3)(cid:2)a(cid:3)n(cid:1)d(cid:1)single-word definitions (such N V A asmandat). Theadvantageofthisapproachisthatit MANDAT Warrant .66 .34 .00 N V A maybenefitdictionarieswithdifferentphrasaltrans- Proxy .55 .00 .45 .67 .18 .15 lation styles from the training dictionary (e.g. use via Mandate .80 .20 .00 bilingual or omission of the word ’to’ in verb definitions). dictionary (via English treebank) However,giventheassumption ofrelativelyconsis- tent dictionary formatting styles (which was unfor- Figure 2: Inducing a preliminary POS distribution tunatelynotthecaseforKurdish),weevaluatedthis fortheRomanianwordmandatviaasimpleEnglish work based on supervised phrasal training from a translation list. single independent thirdlanguage dictionary. Table1measuresthePOSinductionperformance However, when a translation candidate isphrasal on three languages, where the true POS tags were (e.g. mandat money order), one can model the given in the dictionary (as in Figure 1), but ignored more general p(cid:1)robability of the foreign word’s part except for evaluation. The accuracy values in this ofspeechtag( )giventhepartofspeechsequence table are based on exact matches between a word’s of the Englis(cid:2)h(cid:2) phrasal translation ( ). dictionary-provided POSandthe mostprobable tag For example, one could model P(T m(cid:2)(cid:3)o(cid:1)n(cid:3)e(cid:3)(cid:3)y(cid:2)(cid:3)o(cid:1)r- initsinduced distribution. der) via P(T and P(T manife(cid:2)s(cid:2)t itself) via For our target application of part-of-speech tag- P(T (cid:2)(cid:2).(cid:4)(cid:3)H(cid:4)o(cid:3)w(cid:1)ever, bec(cid:2)a(cid:2)use English words ging, what matters is to have a robust tag probabil- ofte(cid:2)n(cid:2)(cid:5)ha(cid:3)v(cid:6)e(cid:7)m(cid:8)(cid:3)u(cid:1)ltiple parts of speech (e.g. order may itydistribution thatincludes thetruecandidate with be a verb), one may weight phrasal POS sequence sufficiently large probability to seed further train- probabilities (makinganindependence assumption) ing. By setting this baseline threshold to 0.1 and as: deleting lower ranked candidates, up to 98% of the truePOSwerefound tobeabovethisthreshold and (cid:6)(cid:2)(cid:4)(cid:2)(cid:2)(cid:9)(cid:8)(cid:10)(cid:1)(cid:11) (cid:8)(cid:7)(cid:12)(cid:1)(cid:7)(cid:1)(cid:3) hence wereconsidered infuture training. (cid:6)(cid:2)(cid:4)(cid:2)(cid:2)(cid:4)(cid:3)(cid:4)(cid:3)(cid:1)(cid:3)(cid:6)(cid:2)(cid:4)(cid:3)(cid:2)(cid:9)(cid:8)(cid:10)(cid:1)(cid:11)(cid:1)(cid:3)(cid:6)(cid:2)(cid:4)(cid:3)(cid:2)(cid:8)(cid:7)(cid:12)(cid:1)(cid:7)(cid:1)(cid:4) The Mean Probability of Truth, as shown in Ta- (cid:6)(cid:2)(cid:4)(cid:2)(cid:2)(cid:4)(cid:3)(cid:5)(cid:3)(cid:1)(cid:3)(cid:6)(cid:2)(cid:4)(cid:3)(cid:2)(cid:9)(cid:8)(cid:10)(cid:1)(cid:11)(cid:1)(cid:3)(cid:6)(cid:2)(cid:5)(cid:3)(cid:2)(cid:8)(cid:7)(cid:12)(cid:1)(cid:7)(cid:1)(cid:4) ble 1, is another measure of the quality of the POS (cid:6)(cid:2)(cid:4)(cid:2)(cid:2)(cid:5)(cid:3)(cid:4)(cid:3)(cid:1)(cid:3)(cid:6)(cid:2)(cid:5)(cid:3)(cid:2)(cid:9)(cid:8)(cid:10)(cid:1)(cid:11)(cid:1)(cid:3)(cid:6)(cid:2)(cid:4)(cid:3)(cid:2)(cid:8)(cid:7)(cid:12)(cid:1)(cid:7)(cid:1)(cid:4) predictions madebythealgorithm, representing the (cid:6)(cid:2)(cid:4)(cid:2)(cid:2)(cid:5)(cid:3)(cid:5)(cid:3)(cid:1)(cid:3)(cid:6)(cid:2)(cid:5)(cid:3)(cid:2)(cid:9)(cid:8)(cid:10)(cid:1)(cid:11)(cid:1)(cid:3)(cid:6)(cid:2)(cid:5)(cid:3)(cid:2)(cid:8)(cid:7)(cid:12)(cid:1)(cid:7)(cid:1)(cid:4) probability mass associated with the true POS tag And(cid:3)i(cid:3)n(cid:3) general: averaged overallwords. In some cases the algorithm could not predict a POS tag, primarily due to English translations for (cid:6)(cid:2)(cid:2)(cid:2)(cid:2)(cid:13)(cid:3)(cid:1)(cid:13)(cid:3)(cid:2)(cid:1) (cid:3) whichnoPOSdistribution wasknown(oftenanob- (cid:1)(cid:1)(cid:6)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:3)(cid:1)(cid:2)(cid:3)(cid:2)(cid:1)(cid:3)(cid:6)(cid:2)(cid:2)(cid:3)(cid:1)(cid:2)(cid:13)(cid:3)(cid:1)(cid:1)(cid:3)(cid:6)(cid:2)(cid:2)(cid:3)(cid:2)(cid:2)(cid:13)(cid:3)(cid:2)(cid:1) scure word, proper name or OCR error). This oc- (cid:4)(cid:2)(cid:1) (cid:4)(cid:2)(cid:2) Target Training Accuracy Correct POS Coverage MeanProbability Language Dictionary ExactPOS OverThreshold ofTruth Romanian Spanish-English 92.9 97.8 98 .91 Kurdish Spanish-English 76.8 93.1 95 .82 Spanish Romanian-English 83.3 94.9 97 .86 Table 1: Performance of inducing candidate part-of-speech distributions derived solely from untagged En- glish translation lists. Resultsaremeasured bytype (alldictionary entries areweighted equally). casional omission is measured by the coverage col- Root Inflected Affix Affix Part-of-speechTag umn. Spanish: Mostoftheobservederrorsareduetodifferences o$ o$ Adj-masc-sing in phrasal definitional conventions in the training o$ os$ Adj-masc-plur andtestingdictionaries,longphrasalidioms,single- o$ a$ Adj-fem-sing word definitions with ambiguous English parts-of- o$ as$ Adj-fem-plur speechandOCRerrors. TheKurdishdictionarywas e$ e$ Adj-masc,fem-sing particularlyhinderedbyfrequentlongphrasaltrans- e$ es$ Adj-masc,fem-plur lations which often included an explanation or def- ar$ o$ Verb-Indic_Pres-p1-sing inition in their translation. Because all dictionary ar$ as$ Verb-Indic_Pres-p2-sing entries are equally weighted, errors on rare words ar$ a$ Verb-Indic_Pres-p3-sing ar$ amos$ Verb-Indic_Pres-p1-plur such as mythological characters or kinship terms ar$ áis$ Verb-Indic_Pres-p2-plur can substantially downgrade performance. But for ar$ an$ Verb-Indic_Pres-p3-plur thepurposes ofproviding seed POSdistributions to Romanian: context-sensitive taggers, performance is quite ade- a¯$ e$ Noun-Nomin-p3-fem-plur-indef quate forthisfollow-on task. e$ i$ Noun-Nomin-p3-fem-plur-indef ea$ ele$ Noun-Nomin-p3-fem-plur-indef 3 Inducing MorphologicalAnalyses i$ ile$ Noun-Nomin-p3-fem-plur-indef a$ ale$ Noun-Nomin-p3-fem-plur-indef There has been extensive previous work in the $ $ Adj-masc,neut-sing supervised and minimally supervised induction of $ a¯$ Adj-fem-sing bothaffixparadigms (e.g. Goldsmith, 2000;Snover $ i$ Adj-masc,neut,fem-plur and Brent, 2001) and diverse models of regular and $ e$ Adj-fem,neut-plur irregular concatenative and non-concatenative mor- ru$ ra$ Adj-fem-sing phology (e.g. Schone and Jurafsky, 2000; van den ru$ ri$ Adj-masc,neut,fem-plur BoschandDaelemans,1999;YarowskyandWicen- ru$ re$ Adj-fem-plur towski, 2000). While such approaches are impor- ... ... ... e$ $ Verb-Indic_Pres-p1-sing tantfromtheperspectiveoflearningtheoryorbroad e$ i$ Verb-Indic_Pres-p2-sing coverage handling of irregular forms, another pos- e$ e$ Verb-Indic_Pres-p3-sing sible paradigm for minimal supervision is to begin e$ em$ Verb-Indic_Pres-p1-plur with whatever knowledge can be efficiently manu- e$ e¸ti$ Verb-Indic_Pres-p2-plur allyenteredfromthegrammarbookinseveralhours e$ $ Verb-Indic_Pres-p3-plur work. Wedefinedsuchgrammar-based“supervision”as Table 2: Sample extracted regular inflectional entry of regular inflectional affix changes and their paradigms(suffixcontextismarkedby$). associatedpartofspeechinstandardizedorderingof fine-grainedattributes,asinTable2forSpanishand Romanian. The full tables have approximately 200 changes, such generation will clearly have substan- lines each and required roughly 1.5-2 person-hours tialinaccuracies andovergenerations. forentry. Given a dictionary marked with core parts of However, through weighted-Levenshtein-based speech, it is trivial to generate hypothesized in- iterative alignment models, such as described in flected forms following the regular paradigms, as Yarowsky and Wicentowski (2000), one can per- shown in the left size of Figure 3. However, due formaprobabilistic stringmatchfromalllexicalto- to irregularities and semi-regularities such as stem- kens actually observed in a monolingual corpus, as Regular Observed pseudo-regular generated inflections. Dictionary Inflection Corpus The inflections of closed-class words (such as Rootword Generation Words pronouns, determiners and auxiliary verbs) are not well handled by this generative-alignment model, V-pres-3pl destrozan z->c destrocé both due to their often very high irregularity (e.g. destrozar/V V-pret-1sg destrozé z->c destrocen the Spanish verb ser (to be)) and/or their typ- V-subj-3pl destrozen destrozan ical shortness (e.g. the pronominal inflections V-pres-1sg destrue φ ->y destruí of mi, tu, su). Thus as one final amount of destruir/V VV--pprreets--13ssgg ddeessttrruuíen φ->y ddeessttrruuyyeen swuipthervtihseioirn,inlflisetcstioonfsclaonsded-ficnlaes-sgrawinoerdds,paprati-roefd- V-pres-1sg destruo φ->y destruyo speech tags were entered manually from the gram- V-pres-1sg dormo o->ue duermo mar book (e.g. aquellas#(aquel)Adj_Dem- dormir/V V-imprf-3pl dormían o->ue duermen futeilmiz-epdlaunra-vper3a)g.eTohfi4s0fi0nlailnessouarncde3ofpesurspoenr-vhisoiuorns duelen V-pret-3pl dormió perlanguage. doler/V VV--pprreess--33ppll ddoorlemnen o->ue o->uddourrmmíiaón 4 POSModelInduction V-pret-3pl dolió dolió The non-traditional supervision methodology in Sections 2 and 3 yields a noisy but broad-coverage candidate spaceofpartsofspeechwithlittlehuman Figure 3: Inflectional analysis induction via effort. weightedstringalignmenttonoisygenerationsfrom We then perform a noise-robust combination of dictionary roots under regularparadigms model estimation and re-estimation techniques for thesyntagmatictrigrammodels intheright sideofFigure32. and lexical priors u(cid:6)s(cid:2)in(cid:19)g(cid:8)(cid:16)(cid:1)th(cid:2)(cid:19)e(cid:8)w(cid:16)(cid:2)o(cid:20)r(cid:19)d(cid:8)(cid:16)c(cid:3)o(cid:1)- For example, when looking for a potential anal- occurrence informa(cid:6)tio(cid:2)(cid:13)n(cid:1)f(cid:2)r(cid:19)o(cid:8)m(cid:16)(cid:5)a(cid:1)rawcorpus. ysis path for the Spanish irregular inflection de- strocen, the closest string match is the regular hy- A suffix-based part-of-speech probability pothesis destrozar/V destrozen/V-pres_subj-3pl. (cid:6) model suffix using hierarchically Likewise, the closest(cid:1)string match for destruyen is smooth(cid:6)ed(cid:2)(cid:19)t(cid:8)r(cid:16)ie(cid:5)s(cid:2) is tr(cid:2)a(cid:13)in(cid:1)(cid:1)e(cid:1)d on the raw initial destruir/V destruen/V-pres_indic-3pl. The dif- tag distributions, yielding coverage to unseen ferences be(cid:1)tween these regular hypotheses and ob- wordsandsmoothing oflow-confidence initial served inflected forms are the relatively productive tagassignments. stem changes and , neither of which was Paradigmatic cross-context tag modeling is listed in the in(cid:4)fl(cid:5)ec(cid:11)tional(cid:14)s(cid:5)up(cid:15)ervision table, and yet (cid:6) performed as in Cucerzan and Yarowsky they were correctly handled. Note that a traditional (2000) when sufficiently large unannotated POS suffix)model would fail tohandle this case corpora areavailable. g(cid:6)iv(cid:2)en th(cid:2)at the common inflection suffix -en corre- spondstotwodifferentpartsofspeechhere(present Sub-part-of-speech contextual agreement for indicative or subjunctive depending on -ir or -ar (cid:6) features such as gender is performed as de- paradigm). scribed inSection4.1. Also note that the irregular stem change pro- The part-of-speech tag sequence models cesses such as dormir duermen have a correct (cid:6) utilize aweighted backoff best-fit analysis, despite(cid:5)the absence of anyinternal b(cid:6)e(cid:2)t(cid:19)w(cid:8)e(cid:16)e(cid:1)n(cid:2)(cid:19)fi(cid:8)n(cid:16)e(cid:2)-(cid:20)g(cid:19)r(cid:8)a(cid:16)in(cid:3)e(cid:1)d andcoarse-grained tags. stem change exemplars (e.g. o ue) in the human- generated inflectional supervisi(cid:5)on table. Boththetag-sequenceandlexicalpriormodels For further robustness, the consensus model of (cid:6) are iteratively retrained using these additional isestimatedasaweightedmixtureof evidencesourcesandfirst-passprobability dis- t(cid:6)he(cid:2)(cid:6)p(cid:8)a(cid:16)rt(cid:1)-(cid:2)o(cid:17)f(cid:18)-sp(cid:1)eech tags of the most closely aligned tributions. 2Forprocessingefficiency,oneadditionalconstraintisthat The success of this model is based on the as- potential hypothesized observed string pair candidates must (cid:1) sumption that (a) words of the same part of speech exactlymatchinbothinitialconsonantclusterandsuffixofthe generatedhypothesis. tend to have similar tag sequence behavior, and (b) there are sufficient instances of each POS tag la- 6 o Adjectives beled by either the morphology models or closed- Rati 5 class entries described in Section 3. One example nt e where these assumptions do not hold is for the Ro- m 4 e e manian word a, which has 5 possible POStags, in- gr A 3 cludingInfinitive_Marker(correspondingto n- o N the English word to). But because the Infini- nt/ 2 e tive_Markertag has no other word instances in m e Romanian, no other filial supervision exists to re- gre 1 A solve theambiguity ofaifnocontext-sensitive tag- 0 -10 -9 -8 -7 -6 -5 -4 -3 -2-1 0 1 2 3 4 5 6 7 8 9 10 ging is provided (such as the preference for a to Relative Position belabeledInfinitive_Markerwhenfollowed 2.5 pboyteanVtiaelribm-pIrnovfeimneinttitoveth).esTehmusodoenlesawvoenuulde boef ment Ratio 2 Nouns to include limited tagged contexts for ambiguous Agree smallclass(orsingletonclass)words,althoughsuch Non- 1 supervision is less readily extractable from gram- ment/ marbooks bynon-native speakers, andwasnotem- Agree 0 ployed here. -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 Relative Position 4.1 Contextual-agreement modelsfor Figure 4: Ratio of the frequency that a gender- part-of-speech subtags marked adjective (above) or noun (below) agrees ingenderwithanothernoun/adjective/determiner at Traditional part-of-speech models assume a strict relative position i over the frequency of gender dis- Markovian sequential dependency. However, Adj- agreement atthatrelative position. Noun, Det-Noun and Noun-Verb agreement at the subtag-level (e.g. forperson,number,caseandgen- ed 1.0 der) often do not require direct adjacency, and are ark btgfigunraeeorsnnSemeddsed.eecErrtTo,intonhwognilsthmih2seieh)sarenrpsreyeoatqhlorieutenticwicflrtluieaoevlccrcaekdtoreslmnodyitfaneaitmgxncaethdpubnioarnidlolrgeitonaarotggnffrufteeoaafiersoltmmourdlrgseiaec.nrtsteatidmpotornmsoauajraebsytcsfiitec(geaaadns-l bility of existence of gender-mtokens within context window 00000.....56789 a ob 0.4 However, given the assumptions of minimal su- Pr 1 2 3 4 5 6 7 8 9 10 Context Width pervision, it is not reasonable to require a parser or dependency model to identify non-adjacent agree- Figure 5: The probability that at least one gender- ing pairs explicitly. Rather, weutilize amuch more marked word will occur within a window of general tendency for words exhibiting a property words relative to another gender marked word ((cid:7)o(cid:21)f such as grammatical gender to co-occur in a rela- anypartofspeech). tivelynarrowwindowwithother wordsofthesame gender(etc.) withaprobability greaterthanchance. verge on the agreement ratio expected by chance Empirically, we observe this in Figures 4-5, which (0.82) relatively quickly. Thus while any individ- show the gender-agreement ratio between a target ual context may suggest incorrect gender based on noun/adjective and other gender marked words ap- agreement,ifoneaggregatesoveralloccurrencesof pearing in context at relative position . Adjec- a word in a corpus, a consensus gender preference tivesinRomanianexhibitastronger agr(cid:7)ee(cid:21)ment ten- emerges, with the true gender agreement signal ex- dency with words to their left (5/1 ratio), while for ceeding nearby spurious gender noise. nouns the agreement ratio is quite closely balanced Formally, we can model this window-weighted between -1(primarily determiners) and+1(primar- global feature consensus as: ily adjectives), although weaker (2.4/1 ratio), per- haps due to a greater relative tendency for nouns to juxtaposedirectlywithotherindependentclausesof (cid:6)(cid:7) (cid:5) different gender. Also, both parts of speech con- (cid:6)(cid:2)(cid:22)(cid:1)(cid:10)(cid:6)(cid:2)(cid:13)(cid:1) (cid:3) (cid:1) (cid:1) (cid:6)(cid:2)(cid:22)(cid:1)(cid:10)(cid:6)(cid:2)(cid:13)(cid:1)(cid:6)(cid:5)(cid:1)(cid:18)(cid:23)(cid:2)(cid:24)(cid:1) (cid:4) (cid:1)(cid:1)(cid:7)(cid:8)(cid:9)(cid:4)(cid:10)(cid:5)(cid:5)(cid:8)(cid:2)(cid:7) The window-size parameter was selected for manually entering the inflectional paradigms priorto(cid:7)th(cid:6)estudiesshowninFigures4-5,butissup- and associated parts of speech from a grammar as ported by them. Beyond this window the agree- inSection 3,andanadditional average of3person- ment/disagreement ratio approaches chance, but hoursperlanguagefordictionaryextractionanden- withasmallerwindowtheprobabilityoffindingany try parsing. OCR itself on our high-speed 2-sided gender-markedwordinthewindowdropsbelowthe scanner with OmniPage Pro took under 30 min- 80% coverage observed for , trading lower cov- utes). As would be expected given that data en- erage forincreased accuracy.(cid:7)(cid:6) try was done by computer scientists which were Ifonemakestheassumption thattheoverwhelm- notnativespeakersofthetestlanguages, significant ing majority of nouns have a single grammatical analysis errorsorgapswereintroduced whenrather genderindependentofcontext,weperformsmooth- blindly transferring from the reference grammar. ingtoforcenounswithsufficientglobalcontextfre- Thusto test therelative contributions of limited na- quency towardstheir single mostlikely gender. tivespeakerhelpwhenavailable,forroughly4addi- Finally, the trie-based suffix model noted in Sec- tional total person hours in a second test condition tion3canbeutilizedheretofurthergeneralize gen- for Romanian a native speaker corrected and aug- der affixal tendencies for use in smoothing poorly mentedgapsinthepatternspreviouslyenteredfrom represented single words. Through this approach the grammar book, focusing almost exclusively on we successfully discover a wide space of low- thecomplexinflections ofclosed-class words. entropy gender affixtendencies, including thecom- A summary of the results for these three super- mon -a, -dad and -ción feminine affixes in Span- vision modes is given in Table 3. Performance is ish, without any human or dictionary supervision broken downbyfine-grained part ofspeech. Exact- of nominal gender. But even those words with- match accuracy is measured over both the full fine- outgender-distinguishing affixes(e.g. parte, cabal) grained (up to 5-feature) part-of-speech space, as canbesuccessfully learned viaglobal context max- well as the 12-class core POStag (noun and proper imization. noun, pronoun, verb, adjective, adverb, numeral, determiner, conjunction, preposition, interjection, 5 Evaluationofthe FullPart-of-speech particle, punctuation). The feature of grammatical Tagger gender was specifically isolated because it is rarely salient for cross-language applications such as ma- Oneproblemwithminimallysupervisedlearningof chine translation (where grammatical gender rarely foreign languages is that annotated evaluation data transfers), and because its induction algorithm in are often not available for the features being in- Section4.1dependsheavilyonthesizeofthemono- duced, or are otherwise difficult to obtain. Thus we lingualcorpus(whichissmallintheseexperiments, have used for initial test languages two languages suggesting size-dependent potential for significant familiar to the authors (Romanian and Spanish) for further improvement here). which sufficient evaluation resources could be ob- tained. However, the monolingual corpora utilized Finally, apost-hoc analysis ofthe system vs. test for bootstrapping were quite small (123 thousand datadiscrepancies showedthatasignificantnumber words of the book 1984 for Romanian and 3.2 mil- weresimplyarbitrarydifferencesinannotationcon- lionwordsofnewswireforSpanish),whichareeas- vention between the grammar-book analyses and ilycomparable tothe sizes thatcan beaccessed on- the test data tagging policy. For example, one such line for 60-100 world languages. The seed dictio- “error”/discrepancy is the rather arbitrary distinc- naries were located online (for Spanish - 42k en- tion of whether the Romanian word oricare (mean- tries)andviaOCR(forRomanian-7kentries), and ingany)shouldbeconsideredanadjective(aslisted small grammar references were obtained at a local in a standard bilingual dictionary) or a determiner. bookstore. 1000 words of test data were annotated Anotherdifferenceiswhetherproper-namecitations with a standardized, finely detailed part-of-speech ofcommonnouns (e.g. CasaBlanca)should bean- taginventoryincludingthefullcomplexdistinctions notated forgender/number etc. ornot. forgender, person,number,case,detailed tenseand Yet regardless of exactly how many system-test nominal definiteness (an inventory of 259 and 230 discrepancies are just policy differences rather than fine-grained tags were used for Spanish and Roma- errors,eventherawaccuracyhereisverypromising nianrespectively). given the very fined-grained part-of-speech inven- The minimal supervision in this study consisted toryand smallmonolingual datasize usedforboot- of an average total of 4 person-hours per language strapping. And ultimately the performance is quite Spanish Romanian via weighted Levenshtein-based alignment models, NNS NNS NNS-8h tag sequence probability induction and grammati- 8h 8h NS-4h cal gender agreement modeling. Experiments show Allwords high accuracy coarse and fine-grained ( 250 tag) core-tag 93.1 86.3 89.2 part-of-speech analyses using only one(cid:8)person day exact-match 86.5 68.6 75.5 of new human supervision based on readily avail- exactw/ogender 87.0 76.7 83.0 ablelinguistic resources. Nouns Acknowledgements core-tag 90.3 97.4 97.4 This work was partially supported by NSF grant *number 100.0 97.4 98.9 IIS-9985033 andONR/MURIcontract N00014-01- *gender 100.0 54.9 64.7 1-0685. *definiteness – 96.6 93.7 *case – 97.4 97.4 References Verbs Baum,L.1972.Aninequalityandassociatedmaximiza- core-tag 94.7 87.9 89.5 tiontechniqueinstatisticalestimationofprobabilistic *tense 93.0 92.6 93.2 functionsofaMarkovprocess.Inequalities,3:1–8. *number 100.0 91.5 91.2 Collins, M., and Y. Singer, 1999 Unsupervised models *person 97.2 92.6 93.2 fornamedentityclassification.In Proceedingsofthe Adjectives JointSIGDATConferenceonEMNLPandVLC1999, core-tag 79.7 78.6 81.5 pp.100-110. *gender 100.0 81.3 82.2 Cucerzan, S., and D. Yarowsky, 1999. Language inde- *number 100.0 98.3 98.3 pendentnamedentityrecognitioncombiningmorpho- logicalandcontextualevidence.InProceedingsofthe Table 3: Performance of POS tagger induction JointSIGDATConferenceonEMNLPandVLC1999, pp.90-99. based on 1 person-day of supervision, no tagged Cucerzan, S., and D. Yarowsky, 2000. Language in- training corpora and a fine-grained ( 250 tags) dependent minimally supervised induction of lexical tagset. NNS and NNrefer tonon-native-s(cid:8)peaker and probabilities. In Proceedings of ACL 2000, pp. 270- native-speaker effort. 277. Goldsmith, J. A., 2000 Unsupervised learning of the remarkable given that it is the result of less than 1 morphology of a natural language. Computational Linguistics27(2):153–198. total person day of data collection and supervision, Merialdo, B., 1994. Tagging English text with a prob- in contrast to the thousands of hours and $100,000- abilistic model. Computational Linguistics 20:155– $1,000,000 spent on some annotated training data 171. in a much more limited tagset inventories. Thus Ngai, G., and D. Yarowsky, 2000. Inducing multilin- in terms of cost-benefit analysis, the supervision gualPOStaggersandNPbracketersviarobustprojec- paradigm and associated bootstrapping models pre- tionacrossalignedcorpora.InProceedingsofNAACL sented here offer quite a good value of new func- 2000,pp.200-207. tionality perlaborinvested. Schone, P., and D. Jurafsky, 2000. Knowledge-free in- ductionofmorphologyusinglatentsemanticanalysis. 6 Conclusion InProceedingsofCoNLL2000. Snover,M.G.,andM.R.Brent,2001.ABayesianmodel This paper has presented an alternative to tradi- for morpheme and paradigm identification. In Pro- tional corpus annotation-based supervision of part- ceedingsofACL2001,pp.482-490. of-speech taggers. Given that even obscure lan- VandenBosch,A.,andW.Daelemans,1999.Memory- guages have reference grammars and dictionaries basedmorphologicalanalysis.InProceedingsofACL available in large bookstores, libraries or even on- 1999,pp.285-292 line, the focus of this work is on using human su- Yarowsky, D., G. Ngai, and R. Wicentowski, 2001. In- pervision for efficient structured entry of this seed ducing Multilingual Text Analysis Tools via Robust knowledge (in the form ofregular and semi-regular ProjectionacrossAlignedCorpora.InProceedingsof HLT2001,pp.161-168. inflectional paradigms and often irregular closed- Yarowsky,D.,andR.Wicentowski,2000.Minimallysu- class part-of-speech entries). Minimally supervised pervisedmorphologicalanalysisbymultimodalalign- bootstrapping procedures then used corpus-derived ment.InProceedingsofACL2000,pp.207-216. distributional datatoinducelexicaltagprobabilities from dictionaries, irregular morphological analyses

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.