Table Of Content

Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect? UlrichGermann USC InformationSciences Institute 4676AdmiraltyWay, Suite1001 MarinadelRey, CA 90292 [email protected] Abstract 2 Data CollectionandPreparation Wereportonourexperiencewithbuildinga 2.1 ObtainingTamilData statistical MT system from scratch, includ- Tamil data is not very difficult to find on the web. ing the creation of a small parallel Tamil- There are several Tamil newspapers and magazines English corpus, and the results of a task- withonlineeditions,andthelargeinternationalTamil basedpilotevaluationofstatisticalMTsys- communityfostersthe use of the Internetforthe dis- tems trained on sets of ca. 1300 and ca. semination of information. After initial investigation 5000parallelsentencesofTamilandEnglish ofseveralwebsiteswedecidedtodownloadourexper- data. Ourresultsshowthatevenwithappar- imental corpus from www.tamilnet.com, a news ently incomprehensible system output, hu- site that provides local news on Sri Lanka in both mans without any knowledge of Tamil can TamilandEnglish. TheTamilandEnglishnewstexts achieve performance rates as high as 86% onthissitedonotseemtobetranslationsofeachother. accuracyfortopicidentification,93%recall The availability of a fairly large in-domaincorpusof for document retrieval, and 64% recall on local news on Sri Lanka in English (over 2 million questionanswering(plusanadditional14% words) allowed us to train an in-domainEnglish lan- partiallycorrectanswers). guagemodelofSriLankannews. 1 Introduction 2.2 EncodingandTokenization Crisesanddisastersfrequentlyattractinternationalat- Tamil is written in a phonematic, non-Latin script. tention to regions of the world that have previously Several encoding schemes exist in parallel. Even been largelyignoredby the internationalcommunity. thoughthe Unicodestandardincludesa set of glyphs While it is possible to stock up on emergency relief forTamil,itisnotwidelyusedinpractice. Mostweb supplies and, for the worst case, weapons, regardless sitesthatofferTamillanguagematerialassumeLatin-1 ofwhereexactlytheyareeventuallygoingtobeused, encodingandrelyonspecialtruetypefonts,whichof- thiscannotbedonewithmultilingualinformationpro- ten are also offered for free download at those sites. cessingtechnology.Thistechnologywilloftenhaveto Tamil text is therefore fairly easy to identify on web be developedafter the fact in a quick responseto the sites via the faceattribute ofthe HTML fonttag. All given situation. Multilingual data resources for sta- thatisnecessaryisalistofTamilfontnamesusedby tistical approaches, such as parallelcorpora, may not thedifferentsites,andknowledgeaboutwhichencod- alwaysbeavailable. ings these fonts implement. While we could restrict In the fall of 2000, we decided to put the current ourselvestoonedatasourceandencodingforourex- stateofthearttothetestwithrespecttotherapidcon- periment, any large-scale system would have to take structionofamachinetranslationsystemfromscratch. this into account. In order to make the source text Withinonemonth,wewould recognizable to humans who have no knowledge of hiretranslators; Tamil,wedecidedtoworkwithtransliteratedtext1. (cid:15) translateasmuchtextaspossible;and (cid:15) 2.3 TranslatingtheCorpus trainastatisticalMTsystemonthedatathuscre- (cid:15) ated. Originallywehopedtobeabletocreateaparallelcor- pusof about100,000wordson the Tamilside within ThelanguageofchoicewasTamil,whichisspoken one month, using several translators. Professional inSriLankaandinthesouthernpartofIndia.Tamilis ahead-lastlanguagewithaveryrichmorphologyand 1Translations, however, were produced from the original thereforequitedifferentfromEnglish. Tamil. Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2001 2. REPORT TYPE 00-00-2001 to 00-00-2001 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Building a Statistical Machine Translation System from Scratch: How 5b. GRANT NUMBER Much Bang for the Buck Can We Expect? 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Information Sciences Institute,University of Southern California,4676 REPORT NUMBER Admirality Way Suite 1001,Marina del Rey,CA,90292 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 8 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 translationservicesintheUScurrentlychargeratesof estimated2.5personmonths.However,agoodpartof about30centsperEnglishwordfortranslationsfrom this effort led to resources that can also be used for Tamil into English. Given that the English transla- otherpurposes. tion of a Tamil text usually contains about 1.2 times 2.4 LessonsLearnedforFutureProjects asmanywordsastheTamiloriginal,thetranslationof acorpusof100,000Tamilwordswouldcostapproxi- If we were to giveadvicefor future,similar projects, matelyUSD36,000.Thiswasfarbeyondourbudget. wewouldemphasizeandrecommendthefollowing: InIndia, bycomparison,rawtranslationsmaycost 2.4.1 Goodtranslatorsarenoteasytofind as little as one cent per Tamil word2. However, out- It is difficult to find good translators for a short- sourcingthetranslationworkabroadwasnotfeasible termcommitment. Unlessoneiswillingtopayapre- forus, sincewehadneithertheadministrativeinfras- mium price, it is unlikely that one will find profes- tructurenorthe time to managesuch an effort. Also, sionaltranslatorswhoarewillingto commitmuchof workingwith partnersso remote would have made it theirtimeforalimitedperiodoftimeandonshortno- very difficult to communicate our exact needs and to tice. implementproperqualitycontrol. We finallydecidedtohireastranslatorsfourenter- 2.4.2 Makethetranslationjobattractive ing and second-year graduate students in the depart- Asforeignstudents,ourtranslatorswouldeachhave ment of engineering whose native language is Tamil been allowed to work up to twenty hours per week. and who had responded to an ad posted in the local None of them did, because the work was frustrating mailinglistforstudentsfromIndia. and boring, and because they found more attractive, In order to manage the corpus translation process, long term employment on campus. Our translators’ we set up a web interface through which the transla- frustrationmayhavebeenfosteredbyseveralfactors: torscouldretrievesourcetextsanduploadtheirtrans- lations, post-editors could post-edit text online, the thedifferencesbetweenSriLankanTamil(theva- (cid:15) projectprogresscouldbemonitored,andallincoming rietyusedinourcorpus)andtheTamilspokenin textwasavailabletootherprojectmembersassoonas SouthernIndia(thenativelanguageofourtrans- itwassubmitted. lators),whichmadetranslating,accordingtoour We originally assumed that translators would be translators,verydifficult; abletotranslateabout500wordsperhourifwewere the lack of translation experienceof our transla- content with raw translations and hardly any format- (cid:15) tors;and ting, and if we allowed them to skip difficult words or sentences. This estimate was based on an inter- our high expectations. We originally told our (cid:15) nalevaluation,inwhichmultilingualmembersofour translators that since they were not working on group translated sample documents from their native site, we would expect the translation of 500 language (Arabic, German, Romanian) into English wordsperhourreported.Whenwelaterswitched andkepttrackofthetimetheyspentonthis. to hourly pay regardless of translation volume, Itturnedoutthatourexpectationswereverymuch thetranslationvolumepickedupslightly. exaggeratedwithbothrespecttotranslationspeedand 2.4.3 Bepreparedtopost-edit thequalityoftranslation. Theactualtranslationspeed forTamilvariedbetween156and247wordsperhour In professional translating, translators typically withanaverageof170wordsperhour.In139hoursof translateintotheirnativelanguageonly. Onemaynot reportedtranslationtime(overaperiodofeventually6 be able to find translators with English as their na- weeks),about24,000words/1,300sentencesofTamil tive language for low density or “small” languages, text were translated, at an effective cost of ca. 10.8 so it may be necessary to have the translations post- cents per Tamilword (translators’ compensationplus editedbypeoplewith greaterlanguageproficiencyin administrativeoverhead).Thisfiguredoesnotinclude English. theeffortformanuallypost-editingthetranslationsby 2.4.4 Havetranslatorsandpost-editorsworkon anativespeakerofEnglish(12-16personhours). site Theoverallorganizationoftheproject(sourcedata Itisbettertohavetranslatorsandpost-editorswork retrieval, hiring and management of the translators, on site and ideally as teams, so that they can resolve design and implementation of the web interface for ambiguitiesandmisunderstandingsimmediatelywith- managingtheprojectviatheInternet,developmentof out the delays of communicating indirectly, be it by transliteratorandstemmer,etc.)requiredanadditional email or other means. A post-editor who does not 2PersonalcommunicationswithThomasMalten,University knowthesourcelanguagemaymisinterpretthetrans- ofCologne. lator,asthefollowingcasefromourcorpusillustrates: 100 100 80 80 %) %) n n e (i 60 e (i 60 g g a a er er v v o 40 o 40 c c xt xt e e t t 20 20 0 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 corpus size (in thousand tokens) corpus size (in thousand tokens) English Tamil Figure 1: Textcoverageon previouslyunseen text for English (left) and Tamil (right). The upperline in each graphshowsthecoveragebytokensthathavebeenseenatleastonce,thelowerlineshowsthecoveragebytokens thathavebeenseenatleast5times. Theerrorbarsindicatestandarddeviation. ments of 1000 tokens each. For each segment , Raw translation: Information about the schools in we computed how many of the tokens had been pnrek- whichpeoplewhomigratedtoKudaanaduarestaying viously seen in the segments . The upper isbeinggathered. lineinthegraphsshowsthepenrc1e:n:ta:gnek(cid:0)of1tokensin Post-editedversion:Informationabouttheschoolsin thathadoccurredatleastoncebeforeinthesegmenntks (sic!) which immigrantsto Kudaanaduare attending ,thelowerlineshowsthepercentageofto- isbeinggathered. kne1n:s:t:hnakt(cid:0)h1adbeenseenatleastfivetimesbefore. ForthepurposeofstatisticalNLP,itseemsreason- Inthiscase, thepost-editorclearlymisinterpretedthe abletoassumethatthelowercurvegivesabetterindi- translator. What the translator meant to and actually cationofhowmanypercentofpreviouslyunseentext did say is that information was being gathered about we can expect to be “known” to a statistical model the schools in which migrants/war refugees who had trainedonacorpusof tokens. m arrived in Kudaanadu had found shelter. However, Atacorpussizeof24,000tokens,whichisapprox- the post-editorinterpretedthe phrase people who mi- imatelythesizeoftheparallelcorpuswewereableto grated to Kudaana as describing immigrants and as- create during our experiment, about 28% of all word sumedthatinformationwasbeinggatheredabouttheir formsinpreviouslyunseenSriLankanTamiltextcan- educationratherthantheirhousing. not be found in the corpus, and 50% have been seen lessthan5times. Inotherwords,ifwetrainasystem 3 EvaluationExperiments on this data, we can expect it to stumble over every other word! At a corpus size of 100,000 tokens, the 3.1 APrioriConsiderations numbersare17%and33%. Thericherthemorphologyofalanguageis,thegreater ForEnglish,thenumbersare9%/23%foracorpus isthetotalnumberofdistinctwordformsthatagiven of 24K tokens and 0%/8% for a corpus of 100K to- corpus consists of, and the smaller is the probability kens. thata certain wordform actuallyoccursin anygiven Inordertoboostthetextcoveragewebuiltasimple textsegment. Figure1showsthepercentageofword textstemmerforTamil, basedontheTamilinflection formsin unseentextthathaveoccurredin previously tables in Steever(1990) and some additional inspec- seen text as a function of the amount of previously tionofourparallelcorpus. Thestemmerusesregular seen text. Thegraphon the leftshowsthe curvesfor expressionmatchingtocutoffinflectionalendingsand English,theoneontherightthecurvesforSriLankan introduce some extra tokens for negation and certain Tamil. The graphsshow the averagesof 100 runson case markings (such as locative and genitive), which different text fragments; the error bars indicate stan- areallmarkedmorphologicallyinTamil. Itshouldbe darddeviation. notedthatthestemmerisfarfromperfectandwasonly Thenumberswerecomputedinthefollowingman- intended to be an interim solution. The performance ner: A corpus of 120,000 tokens was split into seg- increasesaredisplayedinFigure2. Foracorpussize 100 methods,namelyglossing(replacingeachTamilword bythemostlikelycandidatefromthetranslationtables 80 createdwiththeEGYPTtoolkit)andModel4decoding (Brownetal.,1995;Germannetal.,2001). %) n Figure 3 shows the outputof the differentsystems e (i 60 incomparisonwiththehumantranslation. g a Wethenconductedthefollowingexperiments. er v o 40 xt c 3.2.1 DocumentClassificationTask e t Seven human subjects without any knowledge of 20 Tamil were given translationsof a set of 15 texts (all from the Berkeley corpus) and asked to categorize 0 themaccordingtothefollowingtopichierarchy: 0 20 40 60 80 100 120 corpus size (in thousand tokens) NewsaboutSriLanka (cid:15) Reports about clashes between the Sri Figure 2: Text coverage increase by stemming for (cid:15) LankanarmyandtheLiberationTigers Tamil. The solid lines indicate text coverage for un- Sri Lankan security-related news (arrests, stemmeddata(seenatleastonceandatleastfivetimes, (cid:15) armsdeals,etc.) respectively), the dashed lines the text coverage for SriLankanpoliticalnews(strikes,transport, stemmeddata. (cid:15) telecom) concernsSriLankabutdoesn’tfitanyofthe of24Ktokens,thepercentagesofunknownitemsdrop (cid:15) above to19%(from28%;neverseenbefore)and36%(from NewsaboutPakistan/India 50%; seen less than 5 times). For a training corpus (cid:15) Nuclear tests in Pakistan and India, includ- of100Ktokens,thenumbersare12%and23%(from (cid:15) ing their aftermath (international reactions, 17%/33%). etc.) 3.2 Task-BasedPilotEvaluation Corruption investigation against Benazir (cid:15) Buto Giventhesenumbers,itisobviousthatonecannotex- News about Pakistan/India but none of the pect much performance from a system that relies on (cid:15) above modelstrainedononly24Ktokensofdata. Asamat- teroffact,itisclosetoimpossibletomakeanysense Internationalnews (cid:15) whatsoeveroftheoutputofsuchasystem(cf.Fig. 3). Disasters,accidents (cid:15) To get an estimate of the performance with more NelsonMandela’sbirthday (cid:15) trainingdata, we augmentedour corpuswith a paral- Otherinternationalnews (cid:15) lel corpusof internationalnews texts in Southern In- Impossibletotell dian Tamil which was made available to us by Fred (cid:15) GeyoftheUniversityofCaliforniaatBerkeley(hence- Exceptforoneduplicateset,eachsubjectreceiveda forth: Berkeley corpus). This corpus contains ca. differentsetoftranslations. Thesetsdifferedintrain- 3,800 sentence pairs with 75,800 Tamil tokens after ing parameters and the translation method used. Ta- stemming (before stemming: 60,000; the difference ble 1 shows the results of this evaluation. The dif- is due to the introduction of additional markers dur- ferencebetweenthesubjects5aand5b,whoreceived ing stemming). Some of the parallel data was with- thesamesetoftranslations,suggeststhattheindivid- held for system evaluation; the augmented training ualclassifiers’accuracyinfluencestheresultssomuch corpus (Berkeley and TamilNet corpus; short B+TN) as to blur the effect of the other parameters. There hadasizeof85KtokensontheTamilside. Theaug- seemstobeatendencyforglossingtoworkbetterthan mented training corpus had a text coverage of 81% Model4decoding.Glossing,inoursystem,isasimple (seen at least once; 75% without augmentation), and baselinealgorithmthatprovidesthemostlikelyword 67% (seen at least 5 times; 60% without augmenta- translation for each word of input. Translation can- tion), respectively, for Sri Lankan Tamil. We trained didates and their probabilities are retrieved from the IBMTranslationModel4(Brownetal.,1993)bothon translationtable,whichispartofthetranslationmodel ourcorpusaloneandontheaugmentedcorpus,using trainedontheparallelcorpus. theEGYPTtoolkit(Knightetal.,1999;Al-Onaizanet The document classification test is foremost and al., 1999), and then translated a number of texts us- aboveallameasureofthequalityofthetranslationta- ing differenttranslation modelsand differenttransfer bleforfrequentlyoccurringwords. Inpractice,actual SmallCorpus AugmentedCorpus 24Ktokens(Tamil)oftrainingdata 85Ktokens(Tamil)oftrainingdata Humantranslation Gloss Model4decodinga Gloss Model4decoding Government-United vanniunitednational governmentunited childrenunited thegovernmentwith NationalParty partykalloya nationalpartymeet nationalparty unitednationalparty meetingwillnottake tomorrow day tomorrow. meetsday. placetomorrow. naTaipeRamaaTTa naTaipeRamaaTTa Theproposed fridayprogressiveit throughout meetingtomorrow, andunitednational progressive Thursday,between samekaTcikkumiTai governmentthe thePeoplesFront tomorrowthursday onfridayhealth unitednationalin throughoutthe Governmentandthe naTaipeRavirunew mothersunited kaTcikkumiTaiday governmentandthe UnitedNational politicalvaappu nationaltomorrowon thursdaylenonew unitednationalparty Partyregardingthe informationkalloya thursdaynew politicalvaappu .daythursdayleno NewPolitical piRitoru politicalfoundon foundmeetpiRitoru newpoliticalfound Documentwas tin2attiRkupa information. tin2attiRkupa meetmovies. announcedas pin2pooTappaTT- pin2pooTappaTT- postponedtoalater uLLataakafound uLLataakamovies date. is . oflankanpresident Presidential oflankantoand chandrikasankari Secretariatsources freedomtamilian freedomnow thesrilankanforces saythatthismeeting now ’slankatoadvisor veLinaaTTukkuc ofpresident waspostponedasthe veLinaaTTukkuc freedomofthe cen2Riruppat chandrikasankari SriLankanPresident cen2Riruppat tamilianteamto iccantippu freedompresently Chandrika iccantippu secretariatiscalled piRpooTappaTT- secretariatcircles Bandaranayakkahas piRpooTappaTT- minister. uLLataakapresident reported. gonetoaforeign uLLataakato secretariatcircles country. secretariatin*is reported. Atthesametimeitis tobenotedthatthe meetingofthesub hecountries committee thataboutreturning returningthe examiningthe andiccantippu heraboutthemilitary hiscountryandthe iccantippu politicaldocument naTaipeRumen2a andtheywere returningtoincrease naTaipeRumen2athe betweentheUnited andivvaTTaaram announced. theradio. ivvaTTaarammore NationalPartyand moreand*is thereported. theGovernmentwas heldyesterday Tuesdayevening aBrownetal.(1993);Brownetal.(1995) Figure3: Sampleoutputofvarioussystems Table1: ResultsoftheDocumentClassificationTask. Testsubjectswereaskedtoclassifythetranslationsof15 documentsinto4majorand11minorcategories. partially input pegginga? transfer correct incorrect correctb 1 raw no M4decodingc 7 4 4 2 stemmed yes M4decoding 8 3 4 3 stemmed no M4decoding 13 2 0 4 raw no gloss 13 1 1 5a stemmed yes gloss 8 3 4 5b stemmed yes gloss 12 2 1 6 stemmed no gloss 11 2 2 apeggingcausesthetrainingalgorithmtoconsideralargersearchspace bcorrecttoplevelcategorybutincorrectsub-category ctranslationbymaximizingtheIBMModel4probabilityofthesource/translationpair(Brown etal.,1993;Brownetal.,1995) classification might be performed by automatic pro- nificant manner. We were surprised that even with ceduresrather than humans. If we dare to accept the the poor translation performance of our system, re- top performances of our human subjects as the ten- calls as high as 93% at a precision of 88% could be tative upper bound of what can be achieved with the achieved. Secondly, the data shows that gapsare not current system using a translation model trained on randomlydistributedoverthedata,butthatsomeques- 85K tokens of Tamil text and the corresponding En- tionsclearlyseemtohavebeenmoredifficultthanoth- glishtranslations,wecanconcludethattheclassifica- ers. One of the particulardifficultaspects ofthe task tionaccuracycanexceed86%(13/15)forfine-grained was the spelling of names. Question 11, for exam- classificationandreach100%forcoarse-grainedclas- ple,askedWhathappenedtoChandraKumarAbayas- sification. However,giventheextremelysmallsample ingh?. Inthetranslations,however,itwasrenderedin sizeinthisevaluation,theevidenceshouldnotbecon- simple transliteration: cantirakumaara apayacingka. sideredconclusive. Itrequiresaconsiderabledegreeoftenacityandimag- inationtofindthisconnection. 3.2.2 DocumentRetrievalTask The document retrieval task and the question an- 3.2.3 QuestionAnsweringTask sweringtask(seebelow)werecombinedintoonetask. In order to measure the performance in the ques- Thesubjectsreceived14texts(fromtheTamilNetcor- tion answering partof this evaluation, we considered pus)and15lead questionsplus13 additionalfollow- onlyquestionsrelevantto the documentsthatthe test up questions. Their task was to identify the docu- subjectshadidentifiedcorrectly. Becauseofthediffi- ment(s)thatcontain(s)theanswertothequestionand cultyofthetask,wewerelenienttosomedegreeinthe toanswerthequestionsasked. Typicalleadquestions evaluation.Forexample,ifthecorrectanswerwasthe were questionssuch asWhat is the security situation formerpresidentoftheteacher’sunionandtheanswer in Trinconmalee?, or Who is S. Thivakarasa?; typi- given was an official of the teacher’s union, we still calfollow-upquestionswereWhoisincontrolofthe countedthis as “close enough”and thereforecorrect. situation?, What happened to him/her?, or How did Inaddition,wealsoallowedpartiallycorrectanswers, (other)peoplereacttowhathappened?.Asinthepre- that is, answers that went into the right direction but viousexperiment,eachsubjectreceivedtheoutputof werenotquitecorrect. Forexample,ifthecorrectan- adifferentsystem. swer was The army imposed a curfew on fishing, we Table 2 shows the result of the document retrieval countedtheanswerthearmyisstoppingfishingboats task. Again, the sample size was too small to draw as partially correct. All in all, it was very difficultto anyfinalconclusions,butourresultsseem tosuggest evaluatethis section of the task, becauseit was often the following. Firstly, the test subject in the group close to impossible to determine whether the answer dealing with output of systems trained on the bigger was just an educated guess or actually based on the trainingsettendtoperformbetterthantheonesdeal- text. Thereweresomecaseswhereanswerswerepar- ingwiththeresultsoftrainingonlessdata. Thissug- tiallyorevenfullycorrecteventhoughthecorrectdoc- gests thatthe jumpfrom24Kto 85Ktokensoftrain- umenthad notbeen identified. In retrospectwe con- ing data mightimprovesystem performancein a sig- clude that it would have been better to have the test Table2:Recallandprecisiononthedocumentretrievaltask.Testsubjectswereaskedtoidentifythedocument(s) containingtheanswersto15leadquestions.Blackdotsindicatesuccessfulidentificationofatleastonedocument containingtheanswer. tcroarinpiunsg mtreatnhsol.d 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 recall precision 1 TNa glossing 67% 79% (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) 2 TN M4dec.b 53% 80% (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) 3 TN bothc 60% 48% (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) 4 TNd both 67% 79% (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) 5 B+TN.1 glossing 80% 87% (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) 6 B+TN.1 M4dec. 67% 86% (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) 7 B+TN.1 both 80% 86% (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) 8 B+TN.1f both 60% 73% (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) 9 B+TN.1g both 93% 88% (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) 10 B+TN.2h both 87% 75% (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) HumanTranslations 100% 100% (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) (cid:15) aTamilNetcorpusonly;stemmed;1291alignedtextchunks;23,359tokensonTamilside;1000trainingiterations. bIBMModel4decoding. cBothglossingandIBMModel4decodingwereavailabletothetestsubject. dsameasabove,buttrainedwithpeggingoption(morethoroughsearchduringtraining);10trainingiterations. eBerkeleyandTamilNetcorpora;5069alignedtextchunks;85421tokensonTamilside;100trainingiterations. fsameasabove;10trainingiterations. gsameasabove,trainedwithpeggingoption;10trainingiterations. hBerkeleyandTamilNetCorpora,raw(unstemmed);64439tokensonTamilside,50trainingiterations. subjects mark up those text passages in the text that Table 3: Accuracy on question answering. The test justifytheiranswers. sets are the same as in Table 2. Only questions con- Again,thedatasuggeststhatthedifferenceintrain- cerningdocumentsthatwereidentifiedcorrectlywere ingcorpussizedoesaffecttheamountofinformation consideredinthisevaluation. thatisavailablefromthesystemoutput. Subjectsus- ing output of a system based on a translation model test training No.ofrele- partially correct thatwastrainedononlytheTamilNetdatatendtoper- set corpus vantquestions correct formworse than subjectsusingoutputfroma system 1 TN 17 2=12% 0= 0% basedonatranslationmodeltrainedonthelargercor- 2 TN 16 3=19% 4= 25% pus. The poor performance on test set No. 6 may 3 TN 18 2=11% 0= 0% suggest that for this task and at this level of transla- 4 TN 17 7=41% 5= 19% tion quality, glossing provides more informative out- 4 B+TN.1 23 14=61% 6= 26% putthan Model4 decoding. This resultis notpartic- 6 B+TN.1 19 5=26% 3= 16% ularly surprising, since we noticed that Model 4 de- 7 B+TN.1 22 14=64% 3= 14% codingtendstoleaveoutmorewordsthanacceptable. 8 B+TN.1 19 12=63% 1= 5% Clearly, this is one area where the translation model 9 B+TN.1 26 14=54% 5= 19% hastobeimproved. 10 B+TN.2 24 8=33% 9= 38% Test set 10 is the only set produced by a system human 28 24=86% 2= 7% using a translation modeltrained on raw, unstemmed data. It is unclear whether the poor performance on questionansweringforthistest setisdueto a princi- pallyworsetranslationqualityorthe(lackof)tenacity 14% partially correct answers!). However, using a and willingness of the test subject to work her way system such as the one discussed in the paper is not throughthesystemoutput. an option for actual information processing. Espe- Allinall,wewereastonishedbytheamountofin- cially those subjects that had to deal with the output formation that our test subjects were able to retrieve of systems trained on the smaller corpusexperienced fromthematerialtheyreceived(thetoprecallforthe thetaskasutterlyfrustratingandwouldnotwanttodo question answering task is 64%, plus an additional itagain. 4 Conclusions performanceimprovementwhentheMTsystems are scaled up? One approach may show rapid We have reported on our experience with rapidly improvements initially but also reach a plateau buildingastatisticalMTsystemfromscratch. Within quickly, whereas another may show slow but ca. 140translatorhours,wewereabletocreateapar- steadyimprovements. allelcorpusofabout1300sentencepairswith24,000 Is there any potential for bootstrapping the re- tokensontheTamilside,atanaveragetranslationrate (cid:15) sourcecreationprocessbyusingknowledgethat ofapproximately170Tamilwordsperhour. canbeextractedfromlittleandpoordatatospeed Veryclearly,theeffortneededtocreateparalleldata upthecreationofmoreandbetterdata? isoneofthebiggestobstaclestotherapiddevelopment These are some of the the questionsthat will need ofstatisticalMTsystemsfornewlanguages. tobeaddressedinfutureresearchonQuickMT. Withtheoutputofasystemwhichusesatranslation modeltrainedonthesmallamountofparalleldatathat 5 Acknowledgments we created during the course of our experiment, human test subjects achieved a recall of over 50% on ThisresearchhasbeenfundedbytheDARPATIDES the document retrieval task but generally performed program under grant No. N66001-00-1-8914. We poorlyonquestionanswering(lessthan20%). would like to thank our translators as well as Fred Gey of the University of California at Berkeley and The addition of an additionalcorpusof 3,800 sen- ThomasMaltenoftheUniversityofColognefortheir tence pairs allowed us to estimate the benefits of in- kindsupportofourinvestigationsintoTamil,andour creasing the overall corpus size by roughly 300%. testsubjectsfortheirtenacityandpatienceduringthis BasedonourexperiencewithtranslatingtheTamilNet experiment. corpus, this additional effort would require an addi- tional450translatorand36to48post-editorhours. With the additional training data, we were able to References produceoutputthatincreasedtheperformanceonour Yaser Al-Onaizan, David Purdy, Jan Curin, Michael evaluationtasks (documentretrievalandquestionan- Jahr, Kevin Knight, John Lafferty, Dan Melamed, swering)toupto93%fordocumentretrievaland64% Noah A. Smith, Franz Josef Och, and David forquestionanswering. Yarowsky. 1999. Statistical machine translation. With respect to the scenario of “MT in a month”, Finalreport,CenterforLanguageandSpeechPro- wecannowmakethefollowingcalculation: Ifweas- cessing,JohnHopkinsUniversity. sumethattheaveragetranslatortranslatesatarateof Peter F. Brown, Stephen A. Della Pietra, Vincent J. 170words/hourandisabletospend6-7hoursperday Della Pietra, and Robert L. Mercer. 1993. The on actual translations, then a translator can translate mathematicsof statistical machine translation: Pa- about1000-1200wordsperday.Inordertotranslatea rameter estimation. Computational Linguistics, corpusof100,000wordswithinonemonth(assuming 19(2):263–311. afive-dayworkweek),wethereforeneedfourtofive Peter Brown, John Cocke, Stephen Della Pietra, Vin- full time translators. For this effort, we can expect cent Della Pietra, Frederick Jelinek, Jennifer Lai, atranslationsystemwhoseperformanceresemblesthe andRobertMercer. 1995. Methodandsystem for oneshowninourevaluation. naturallanguagetranslation. U.S.Patent5,477,451, This, of course, raises the following questions, Dec19. whichweareonlyabletoaskbutnottoansweratthis UlrichGermann,MichaelJahr, KevinKnight, Daniel point. Marcu, and Kenji Yamada. 2001. Fast decoding Canthetranslationmodelandthealgorithmsfor andoptimaldecodingforMT. In39thAnnualMeet- (cid:15) statistical training be improved so that they re- ing of the Association for Computational Linguis- quirelessdatatoproduceacceptableresults? tics(ACL-2001). Aretheremoreefficientusesofscarceresources Kevin Knight, Yaser Al-Onaizan, David Purdy, Jan (cid:15) (such as language experts and translators) for Curin, MichaelJahr, John Lafferty, Dan Melamed, building a statistical (or any other) MT system NoahSmith,FranzJosefOch,andDavidYarowsky. quickly,forexamplethecreationoflessbutmore 1999. EGYPT: a statistical machine translation informative data, e.g. a parallel corpus with toolkit. http://www.clsp.jhu.edu/ws99/ alignmentsonthewordlevel,orthecompilation projects/mt/. of a glossary/dictionary of the most frequently Sanford B. Steever. 1990. Tamil and the Dravidian usedterms? languages. InBernardComrie,editor,TheWorld’s Howdothevariousapproachescomparewithre- MajorLanguages,pages725–746.OxfordUniver- (cid:15) spect to the ratio of construction effort versus sityPress,NewYork,Oxford.

DTIC ADA460337: Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect PDF

9 Pages·0.09 MB·English

by Defense Technical Information Center

#additional_collections #dticarchive

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA460337: Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.