Final Report: Natural Language Generation in the Context of Machine Translation JanHajicˇ1 Martin Cˇmejrek2 BonnieDorr3 YuanDing4 JasonEisner5 DanGildea6 Terry Koo7 KristenParton8 GeraldPenn9 DragomirRadev10 OwenRambow11 withcontributionsby12 JanCuˇr´ın Jiˇr´ıHavelka VladislavKubonˇ IvonaKucˇerova´ March27,2004 1Teamleader,CharlesUniversity,Prague,CzechRepublic 2CharlesUniversity,Prague,CzechRepublic 3Affiliateteammember,UniversityofMaryland,USA 4UniversityofPennsylvania,USA 5JohnsHopkinsUniversity,USA 6Affiliateteammember,UniversityofPennsylvania,USA 7MIT,USA 8StanfordUniverity,USA 9UniversityofToronto,Canada 10UniversityofMichigan,USA 11UniversityofPennsylvania,USA 12AllfromCharlesUniversity,Prague,CzechRepublic Contents 1 Introduction 6 2 SummaryofResources 8 2.1 ThePragueDependency Treebank . . . . . . . . . . . . . . . . . 8 2.2 ThePennTreebank . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 TheProposition Bank . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Existing Czech-English parallel corpora . . . . . . . . . . . . . . 9 2.5 EnglishtoCzechtranslation ofPennTreebank . . . . . . . . . . . 9 2.6 EnglishAnalytical Dependency Trees . . . . . . . . . . . . . . . 10 2.6.1 MarkingHeadsinEnglish . . . . . . . . . . . . . . . . . 10 2.6.2 Lemmatization ofEnglish . . . . . . . . . . . . . . . . . 10 2.6.3 Transformation of Phrase Trees into Analytical Represen- tations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.7 EnglishTectogrammatical Dependency Trees . . . . . . . . . . . 13 2.8 Part-of-Speech Tagging andLemmatization ofCzech . . . . . . . 15 2.9 Analytical parsing ofCzech . . . . . . . . . . . . . . . . . . . . . 15 2.10 Tectogrammatical parsing ofCzech . . . . . . . . . . . . . . . . 18 2.11 Tectogrammatical Lexical Transfer - “Czenglish” tectogrammati- calrepresentation . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.11.1 Translation equivalent replacement algorithm (for1-1,1-2 entry-translation mapping). . . . . . . . . . . . . . . . . . 19 2.12 Czech-English Word-to-Word Translation Dictionaries . . . . . . 19 2.12.1 ManualDictionary Sources. . . . . . . . . . . . . . . . . 19 2.12.2 Dictionary Filtering . . . . . . . . . . . . . . . . . . . . . 19 2.12.3 ScoringTranslations UsingGIZA++ . . . . . . . . . . . . 21 3 TheGenerationSystemMAGENTA 25 3.1 AGenerative ModelforPairsofTrees . . . . . . . . . . . . . . . 25 3.1.1 Tree-to-TreeMappings . . . . . . . . . . . . . . . . . . . 26 3.1.2 ANaturalProposal: Synchronous TSG . . . . . . . . . . 26 3.1.3 PastWork . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.4 AProbabilistic TSGFormalism . . . . . . . . . . . . . . 29 3.1.5 TreeParsingAlgorithms forTSG . . . . . . . . . . . . . 29 3.1.6 Extending toSynchronous TSG . . . . . . . . . . . . . . 31 3.1.7 StatusoftheImplementation . . . . . . . . . . . . . . . . 32 3.2 TheProposer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 UsingGlobalTreeInformation: Preposition Insertion . . . . . . . 35 1 3.3.1 Description ofC5.0 . . . . . . . . . . . . . . . . . . . . . 35 3.3.2 UsingClassifierstoCaptureTreeInformation . . . . . . . 36 3.3.3 ThePreposition Insertion Classifier . . . . . . . . . . . . 37 3.3.4 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4 TheWordOrderLanguage Model . . . . . . . . . . . . . . . . . 44 3.4.1 SurfaceBigram . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.2 Tree-BasedBigram . . . . . . . . . . . . . . . . . . . . . 45 3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 EnglishMorphological Generation . . . . . . . . . . . . . . . . . 48 4 TheGenerationSystemARGENT 49 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 ARGENTarchitecture . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 TheARGENTgrammar . . . . . . . . . . . . . . . . . . . . . . 53 4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.6 Futurework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5 EnglishTectogrammatical Parsing 62 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 TypesofCorpusLabels . . . . . . . . . . . . . . . . . . . . . . . 64 5.3 PropBank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4 Tectogrammatical Representation . . . . . . . . . . . . . . . . . 67 5.5 VerbNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.6 LexicalConceptual Structure . . . . . . . . . . . . . . . . . . . . 70 5.7 Relation BetweenSemanticLabels . . . . . . . . . . . . . . . . . 72 5.8 Predicting TRLabels . . . . . . . . . . . . . . . . . . . . . . . . 73 5.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6 Evaluation 76 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2 BLEU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.3 UpperBoundSystem . . . . . . . . . . . . . . . . . . . . . . . . 77 6.4 Baseline System . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.5 GIZA++System . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.6 Evaluation Mechanics . . . . . . . . . . . . . . . . . . . . . . . . 78 6.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7 Conclusions 81 8 References 83 2 List of Figures 2.1 Example of a lemmatized sentence with marked heads: “The aim would be to end the guerrilla war for control of Cambodia by al- lowingtheKhmerRougeasmallshareofpower.”. Terminalnodes consist of a sequence of part-of-speech, word form, lemma, and a unique id. The names of the head constituent names start with @. (In the noun phrase Khmer Rouge the word Rouge was marked as headbymistake.) . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Example of an analytical tree automatically converted from Penn Treebank: “The aim would be to end the guerrilla war for con- trol of Cambodia by allowing the Khmer Rouge a small share of power.” (In the noun phrase Khmer Rouge the word Rouge was markedasheadbymistake.) . . . . . . . . . . . . . . . . . . . . 14 2.3 Example of a tectogrammatical tree automatically converted from Penn Treebank: “The aim would be to end the guerrilla war for control of Cambodia by allowing the Khmer Rouge a small share of power.” (In the noun phrase Khmer Rouge the word Rouge was markedasthehead bymistake.) . . . . . . . . . . . . . . . . . . 16 2.4 Example of a Czech analytical tree automatically parsed from in- put text: “C´ılem by bylo ukoncˇen´ı partyza´nske´ va´lky usiluj´ıc´ı o ovla´dnut´ı Kambodzˇe, prˇicˇemzˇ by Rud´ı Khme´rove´ z´ıskali nevelky´ pod´ıl na moci.” (As a result of automatic parsing, this tree con- tainssomeerrorsinattachmentandanalyticalfunctionassignment: specifically, the phrase headed by “usiluj´ıc´ı” (trying) should have been modifying “va´lky” (war),not “bylo” (tobe,the mainverb of thesentence) ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 ExampleofaCzechtectogrammaticaltreeautomaticallyconverted fromtheanalyticalone: “C´ılembybyloukoncˇen´ıpartyza´nske´ va´lky usiluj´ıc´ı oovla´dnut´ı Kambodzˇe,prˇicˇemzˇ byRud´ıKhme´rove´ z´ıskali nevelky´ pod´ılnamoci.” (Theincorrectstructurefromtheanalytical treeinFigure2.4persists.) . . . . . . . . . . . . . . . . . . . . . 18 2.6 Example of a packed tree representation of a forest of Czenglish tectogrammaticaltreesresultingfromthesentence: “C´ılembybylo ukoncˇen´ıpartyza´nske´ va´lkyusiluj´ıc´ıoovla´dnut´ıKambodzˇe,prˇicˇemzˇ byRud´ıKhme´rove´ z´ıskali nevelky´ pod´ıl namoci.” . . . . . . . . . 20 2.7 SampleoftheCzech-English dictionary usedforthetranfer. . . . 23 3.1 TRtoARmapping . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Proposer: training phase . . . . . . . . . . . . . . . . . . . . . . 34 3 3.3 Proposer: decoding stage . . . . . . . . . . . . . . . . . . . . . . 34 3.4 QuerymadebytheProposer . . . . . . . . . . . . . . . . . . . . 35 3.5 Inputstotheclassifierwhenclassifyingcurrentnode(hexagon). . . . . 37 3.6 TRtreefor“JohntalkedtoMaryandtoJane” . . . . . . . . . . . . . 39 3.7 TRtreefor“JohntalkedtoMaryandJane” . . . . . . . . . . . . . . 39 3.8 NumberofClassificationsReturnedandPercentRecallforthethreethresh- olding methods. Values in both C Thresholding columns are bold-ed when their effective is close to an integer. For easier compar- ison, the N Threshol(cid:0)din(cid:0)(cid:1)g(cid:2)(cid:3)v(cid:4)a(cid:1)lues have been moved next to their closest companionsintheCThresholdingcolumn. . . . . . . . . . . . . . . 42 3.9 Percent Recall vs. Number of Classifications Returned for the three thresholdingmethods. . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1 SamplePDTtreefor(wsj-0000-001). . . . . . . . . . . . . . . . 50 4.2 SamplePDTtreefor(wsj-0012-003). . . . . . . . . . . . . . . . 50 4.3 FDrepresentation forwsj-000-01. . . . . . . . . . . . . . . . . . 51 4.4 FDfor”Pierre Vinken”(wsj-0001-01) before lexicalization. . . . 51 4.5 FDfor”Pierre Vinken”(wsj-0001-01) afterlexicalization. . . . . 52 4.6 ARGENTFDfor”AlanSpoon”before lexicalization. . . . . . . . 54 4.7 TargetFDfor”AlanSpoon” (wsj-0012-003) afterlexicalization. . 54 4.8 ARGENT-augmentedFDfor”AlanSpoon” afterlexicalization. . 55 4.9 Sampletectogrammatical functors. . . . . . . . . . . . . . . . . . 56 4.10 SampleSURGEconstructs. . . . . . . . . . . . . . . . . . . . . . 56 4.11 Lexicon entry forverbs like ”say” and”report” inthe ”Xsays that Y”subcategorization where“thatY”isarelativeclause. . . . . . 57 4.12 Actorhead guessing macro. . . . . . . . . . . . . . . . . . . . . . 57 4.13 Converting anAPPStoSurge-style apposition. . . . . . . . . . . 57 4.14 Generalstructure oftheARGENTgrammar. . . . . . . . . . . . . 58 4.15 Mostfrequent dependency pathsinsection 00ofthePennTreebank. 58 4.16 FDforPierreVinkenafterlexicalization. . . . . . . . . . . . . . . 59 4.17 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1 Surfacesyntactic representation forthesentences in(1) . . . . . . 65 5.2 Deep-syntactic representation . . . . . . . . . . . . . . . . . . . . 65 5.3 Deep-syntactic representation: missedgeneralization . . . . . . . 66 5.4 Localsemantic representation forboth(2a)and(2b) . . . . . . . . 66 5.5 Globalsemanticrepresentationfor(3a)(withload)and(3b)(withthrow); thelabelsused areforillustrative purposes . . . . . . . . . . . . . 67 5.6 Theunique roleset forkick . . . . . . . . . . . . . . . . . . . . . 67 5.7 TRextension toPropBankentry forbid,roleset name“auction” . 68 5.8 Inventory ofthematic rolelabels usedinVerbNet . . . . . . . . . 69 5.9 EntryinPropBank-to-VerbNet lexicon forput(excerpt) . . . . . . 69 5.10 LCSdefinitions ofjog(excerpt) . . . . . . . . . . . . . . . . . . 70 5.11 Inventory ofLCSthematic roles(extract) . . . . . . . . . . . . . 70 5.12 Errorratesforpredictingonelabelsetfromanother,withandwith- outusingfeaturemlex(thegoverningverb’slexeme); isthenum- beroftokens forthestudy . . . . . . . . . . . . . .(cid:0). . . . . . . 71 4 5.13 ExhaustivemappingofthreeVerbNetlabelstoTRlabelsotherthan ACTandPAT(theargument being labeled isX) . . . . . . . . . . 71 5.14 Sample generated rule set (excerpt — “fw” is the function word for the argument, “mlex” the governing verb’s lexeme, “pba” the modifiertagfromthePennTreebank) . . . . . . . . . . . . . . . 73 5.15 Results(errorrate)fordifferent combinations ofsyntactic features (left column) and semantic features (top row); baseline error rate usinghand-written AutoTRcode is19.01% . . . . . . . . . . . . 74 6.1 Finalevaluation results forallsystems. . . . . . . . . . . . . . . . 79 6.2 Graphoffinalevaluation results forallsystems. . . . . . . . . . . 80 5 Chapter 1 Introduction JanHajicˇ Let’simagine asystem fortranslating asentence fromaforeign language (say Czech,butsubstituteanyofyourfavoritelanguageshere)intoyournativelanguage (say English). Such a system works as follows. It analyzes the foreign-language sentence to obtain a structural representation that captures its essence, i.e. ”who didwhattowhomwhere.”Itthentranslates (or“transfers”) theactors,actions,etc. intowordsinyourlanguagewhile”copyingover”thedeeperrelationshipsbetween them. Finally it synthesizes a syntactically well-formed sentence that conveys the essential meaning oftheoriginal sentence. Each step in this process isa hard technical problem, towhich the best known solutions are either not adequate, or good enough only in narrow application do- mains,failing whenapplied tootherdomains. Thissummer,wehaveconcentrated on improving one ofthese three steps, namely the synthesis (or generation), while having in mind that some of the core technologies can be applied to other parts of anMTsystemifthebasicprinciple(i.e.,usinggraphtreerepresentations withonly local transformations) iskept. The target language for generation has been English, and the source language to the MT system has been a language of a different type (Czech). We have been able to produce automatically a fairly deeply analyzed sentence structure of the source language (albeit imperfect, of course). The incorporation of the deep anal- ysis makes the whole approach very novel - so far no large-coverage translation system has tried to operate with such a structure, and the application to more di- verselanguages than,say,English andFrench,hasmadeitanexciting enterprise. Within the generation process, we have focused on the structural (syntactic) part, even though we had to develop a simple morphological generation module as well to be able to fully evaluate the final result. Statistical methods have been used throughout (except for a comparison to an existing state-of-the-art system in rule-basedNLgeneration). EvaluationhasbeencarriedoutusingtheBLEUmetric andcomparedmainlytoabaselineGIZA++-basedsystemavailable previously (as created inthepastbysomeoftheteammembers). Although the finalresults werebelowthose generated by thesimpler GIZA++ system,webelievethatusingdeeper analysis formachinetranslation isthewayto go. Needlesstosay,thesystemwecomparedtocarefullyresearchedanddeveloped 6 systems wastheresult ofonlysixweeks’work(albeit veryintensive indeed). Significantpartofthepre-workshop workwasdevotedtoresourcecreationfor the complete Czech-English MTsystem. After further expansion and finalization, these resources will be made available publicly (in a much broader and complete formthanusedduring theworkshop). Inthisreport,wesummarizetheresources, findings,methodsandexperiments we have made during the 2002 JHU/CLSP Workshop. We hope that you will be able tofind-atleast-aninspiration forthefuture workonbothMachine Transla- tionandNLGeneration here. 7 Chapter 2 Summary of Resources JanHajicˇ,MartinCˇmejrek, JanCurˇ´ın,Jirˇ´ıHavelka,VladislavKubonˇ The MAGENTA system generates English analytical dependency trees from fourdifferent input options: 1. English tectogrammatical trees, automatically created from the Penn Tree- bank; 2. Englishtectogrammatical trees,human-annotated; 3. English tectogrammatical trees, automatically created from the Penn Tree- bank,improvedbyinformation fromtheProposition Bank; 4. So-called “Czenglish” tectogrammatical trees, automatically created from the Czech input text. This input option represents an attempt to develop a fullMTsystembased ondependency trees. In the sequel, we summarize resources available before (Sections 2.1–2.4) as well as those created during the workshop (Section 2.5). Sections 2.6–2.11 de- scribe automatic procedures used for preparation of both training and testing data for all four input options used in the MAGENTA system. Section 2.12 describes theprocess offiltering dictionaries used inthetransfer procedure. 2.1 The Prague Dependency Treebank ThePragueDependencyTreebankproject1 aimsatcomplexannotationofacorpus containing about 1.8Mword occurrences (about 80,000 running text sentences) in Czech. The annotation, which is based on dependency syntax, is carried out in three steps: morphological, analytical, and tectogrammatical. The first two have beenfinishedsofar,presently,thereareabout18,000sentencestectogrammatically annotated. See (Hajicˇ et al., 2001) and (Hajicˇova´, Panevova´, and Sgall, 2000) respectively fordetails ofanalytical andtectogrammatical annotation. 1version 1.0; LDC catalog no.: LDC2001T10, ISBN: 1-58563-212-0, http://ufal.mff.cuni.cz/pdt 8 2.2 The Penn Treebank The Penn Treebank project2 consists of about 1,500,000 tokens. Its bracketing style isbased on constituent syntax, and comprises the surface syntactic structure, varioustypesofnullelementsrepresentingunderlyingpositionsforwh-movement, passive voice, infinitive constructions etc., and also predicate-argument structure markup. The largest component of the corpus consists of about 1 million words (about40,000sentences) fromtheWallStreetJournalnewspaper. Onlythispartof thePennTreebankcorpus wasusedintheMAGENTAproject. 2.3 The Proposition Bank The PropBank project adds annotation of basic semantic propositions to the Penn Treebank corpus. For a verb, there is a list of syntactic frames (frameset), which have ever occurred in the annotated data; each position in the frame is associ- ated with a semantic role in the predicate-argument structure of a given verb. The annotation started from the most frequent verbs (all occurrences of one verb are annotated in the same time) and continues to less frequent ones. See (Kingsbury, Palmer,andMarcus,2002a) forfurther details. 2.4 Existing Czech-English parallel corpora Two considerable resources of Czech-English parallel texts were available before the workshop and were mentioned in previous experiments related to statistical Czech-English MT: Reader’s Digest Vy´beˇr ((Al-Onaizan et al., 1999), (Cuˇr´ın and Cˇmejrek,2001)),andIBMAIXandAS/400operatingsystemguidesandmessages translations(Cuˇr´ınandCˇmejrek,1999). TheReader’sDigestVy´beˇrcorpus(58,137 sentence pairs) contains sentences from a broad domain, very free translations, while IBM corpus (119,886 sentence pairs) contains domain specific sentences, literal, almostword-by-word translations. According to the automatic sentence alignment procedure, only 57% sentence pairsfromtheReader’sDigestVy´beˇrcorpusare1-1matchingsentencepairs,com- pared to98%of1-1sentence pairsfromtheIBMcorpus. Both corpora are automatically morphologically annotated by automatic BH taggingtools(Hajicˇ andHladka´,1998). Noneofthesecorporacontainanysyntac- ticannotation. 2.5 English to Czech translation of Penn Treebank MAGENTAsystemusessyntactically annotatedparalleltextsastrainingdata. Be- fore the workshop preparation work started, there were no Czech-English paral- lel data manually syntactically annotated. We decided to translate a considerable partoftheexisting syntactically annotated English corpus(PennTreebank) byhu- man translators rather than tosyntactically annotate existing Czech-English paral- 2version3;LDCcatalogno.:LDC99T42,ISBN:1-58563-163-9 9
Description: