Table Of Content

Covering Treebanks with GLARF Adam Meyers(cid:0) and RalphGrishman(cid:0) and Michiko Kosaka and Shubin Zhao(cid:0) (cid:1) (cid:0) New YorkUniversity,719 Broadway,7th Floor,NY, NY 10003USA Monmouth University,West LongBranch,N.J. 07764, USA (cid:1) meyers/grishman/[email protected], [email protected] Abstract guishing complement PPs (COMP) from adjunct PPs (ADV) is useful because the former is likely ThispaperintroducesGLARF,aframe- to have an idiosyncratic interpretation, e.g., the work for predicate argument structure. object of “at” in “John is angry at Mary” is not WereportonconvertingthePennTree- a locative and should be distinguished from the bank II into GLARF by automatic locativecasebymanyapplications. methods that achieved about 90% pre- In an attempt to fill this gap, we have begun cision/recall on test sentences from the a project to add this information using both au- Penn Treebank. Plans for a corpus tomatic procedures and hand-annotation. We are ofhand-correctedoutput,extensionsof implementing automatic procedures for mapping GLARF to Japanese and applications the Penn Treebank II (PTB) into a PRED-ARG forMTarealsodiscussed. representationandthenwearecorrectingtheout- put of these procedures manually. In particular, 1 Introduction wearehopingtoencodeinformationthatwillen- able a greater level of regularization across lin- Applications using annotated corpora are often, guisticstructuresthanispossiblewithPTB. by design, limited by the information found in This paper introduces GLARF, the Grammati- thosecorpora. SincemostEnglishtreebankspro- calandLogicalArgumentRepresentationFrame- vide limited predicate-argument (PRED-ARG) work. We designed GLARF with four objec- information, parsers based on these treebanks do tives in mind: (1) capturing regularizations — not produce more detailed predicate argument noncanonical constructions (e.g., passives, filler- structures (PRED-ARG structures). The Penn gap constructions, etc.) are represented in terms Treebank II (Marcus et al., 1994) marks sub- of their canonical counterparts (simple declara- jects (SBJ), logical objects of passives (LGS), tive clauses); (2) representing all phenomena us- some reduced relative clauses (RRC), as well as ing one simple data structure: the typed feature othergrammaticalinformation,butdoesnotmark structure (3) consistently labeling all arguments each constituent with a grammatical role. In our andadjunctsforphraseswithclearheads;and(4) view, a full PRED-ARG description of a sen- producing clear and consistent PRED-ARGs for tence would do just that: assign each constituent phrases that do not have heads, e.g., conjoined a grammatical role that relates that constituentto structures,namedentities,etc. —ratherthantry- one or more other constituents in the sentence. ing to squeeze these phrases into an X-bar mold, Forexample,theroleHEADrelatesaconstituent wecustomizedourrepresentationstoreflecttheir toitsparentandtheroleOBJrelatesaconstituent head-lessproperties. Webelievethataframework to the HEAD of its parent. We believe that the for PRED-ARG needs to satisfy these objectives absence of this detail limits the range of appli- toadequatelycoveracorpuslikePTB. cations for treebank-based parsers. In particular, they limit the extent to which it is possible We believe that GLARF, because of its uni- to generalize, e.g., marking IND-OBJ and OBJ form treatment of PRED-ARG relations, will be roles allows one to generalize a single pattern to valuable for many applications, including ques- cover two related examples (“John gave Mary a tion answering, information extraction, and ma- book” = “John gave a book to Mary”). Distin- chine translation. In particular, for MT, we ex- Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2001 2. REPORT TYPE 00-00-2001 to 00-00-2001 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Covering Treebanks with GLARF 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Department of Computer Science ,New York University,715 REPORT NUMBER Broadway,New York,NY,10003 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 8 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 pect it will benefit procedures which learn trans- or subtracting attributes or defining subsumption lation rules from syntactically analyzed parallel hierarchies. Thus both Susanne’s NP1p word- corpora, such as (Matsumoto et al., 1993; Mey- tag and Penn’s NNP wordtag would correspond ers et al., 1996). Much closer alignments will to GLARF’s NNP POS tag. A GLARF-style be possible using GLARF, because of its multi- Susanne analysis of “Ontario, Canada” is (NP ple levels of representation, than would be pos- (PROVINCE (NNP Ontario)) (PUNCTUATION sible with surface structurealone (An example is (, ,)) (COUNTRY (NNP Canada)) (PATTERN providedattheendofSection2). Forthisreason, NAME) (SEM-FEATURE LOC)). A GLARF- wearecurrentlyinvestigatingtheextensionofour style PTB analysis uses the roles NAME1 and mapping procedure to treebanks of Japanese (the NAME2 instead of PROVINCE and COUNTRY, Kyoto Corpus) and Spanish (the UAM Treebank where name roles (NAME1, NAME2) are more (Moreno et al., 2000)). Ultimately, we intend to general than PROVINCE and COUNTRY in a create a parallel trilingual treebank using a com- subsumption hierarchy. In contrast, attempts to binationofautomaticmethodsandhumancorrec- convert PTBintoSusannewouldfailbecausede- tion. Suchatreebankwouldbevaluableresource tail would be unavailable. Similarly, attempts to forcorpus-trainedMTsystems. convert Susanne into the PTB framework would Theprimarygoalofthispaperistodiscussthe lose information. In summary, GLARF’s ability considerations for adding PRED-ARG informa- to represent varying levels of detail allows dif- tion to PTB, and to report on the performance of ferent types of treebank formats to be converted our mapping procedure. We intend to wait until intoGLARF,eveniftheycannotbeconvertedinto theseproceduresarematurebeforebeginningan- eachother. Perhaps,GLARFcanbecomealingua notation on a larger scale. We also describe our francaamongannotatedtreebanks. initial research on covering the Kyoto Corpus of The Negra Corpus (Brants et al., 1997) pro- JapanesewithGLARF. videsPRED-ARGinformationforGerman,simi- laringranularitytoGLARF.Themostsignificant 2 Previous Treebanks difference is that GLARF regularizes some phe- There are several corpora annotated with PRED- nomena which a Negraversionof English would ARG information, but each encode some dis- probably not, e.g., control phenomena. Another tinctions that are different. The Susanne Cor- novelfeatureofGLARFistheabilitytorepresent pus(Sampson, 1995) consists of about1/6of the paraphrases (in the Harrisian sense) that are not Brown Corpus annotated with detailed syntactic entirely syntactic, e.g., nominalizations as sen- information. Unlike GLARF,theSusanneframe- tences. Other schemes seem to only regularize work does not guarantee that each constituent be strictlysyntacticphenomena. assigned a grammatical role. Some grammatical roles (e.g., subject, object) are marked explicitly, 3 TheStructure of GLARF others are implied by phrasetags (Fr corresponds to the GLARF node label SBAR under a REL- In GLARF, each sentence is represented by a ATIVE arc label) and other constituents are not typed feature structure. As is standard, we assigned roles (e.g., constituents of NPs). Apart modelfeaturestructuresassingle-rooteddirected from this concern, it is reasonable to ask why acyclic graphs (DAGs). Each nonterminal is la- we did not adapt this scheme for our use. Su- beled with a phrase category, and each leaf is la- sanne’sgranularitysurpassesPTB-basedGLARF beled with either: (a) a (PTB) POS label and a in many areas with about 350 wordtags (part of word(eat,fish,etc.) or(b)anattributevalue(e.g., speech) and 100 phrasetags (phrase node labels). singular, passive, etc.). Types are based on non- However,GLARFwouldexpressmanyofthede- terminal node labels, POSs and other attributes tails in other ways, using fewer node and part of (Carpenter, 1992). Each arc bears a feature label speech (POS) labels and more attributes and role which represents either a grammatical role (SBJ, labels. In the feature structure tradition, GLARF OBJ, etc.) or some attribute of a word or phrase can represent varying levels of detail by adding (morphologicalfeatures,tense,semanticfeatures, etc.).1 For example, the subject of a sentence is ample,“John”in“Johnatecheese”wouldbe the head of a SBJ arc, an attribute like SINGU- thetargetofaSBJsubjectarc. LAR isthehead ofa GRAM-NUMBERarc, etc. Logical relations, encoded with SL- and L- Aconstituentinvolved inmultiplesurfaceorlog- arcs, are defined more broadly in GLARF than icalrelationsmaybeattheheadofmultiplearcs. in most frameworks. Any regularization from a Forexample,thesurfacesubject(S-SBJ)ofapas- non-canonical linguistic structure to a canonical siveverbisalsothelogicalobject(L-OBJ).These oneresultsinlogicalrelations. Following(Harris, tworolesarerepresentedastwoarcswhichshare 1968)andothers,ourmodelofcanonicallinguis- thesamehead. Thissortofstructuresharinganal- tic structure is the tensed active indicative sen- ysis originates with Relational Grammar and re- tence with no missing arguments. The following latedframeworks(Perlmutter, 1984;Johnsonand argument typeswillbeattheheadoflogical(L-) Postal,1980)andiscommoninFeatureStructure arcsbasedoncounterpartsincanonicalsentences frameworks(LFG,HPSG,etc.). Following(John- which are at the head of SL- arcs: logical argu- son et al., 1993)2, arcs are typed. There are five ments of passives, understood subjects of infini- differenttypesofrolelabels: tives, understood fillers of gaps, and interpreted Attribute roles: Gram-Number (grammati- arguments of nominalizations (In “Rome’s de- (cid:2) calnumber),Mood,Tense,Sem-Feature(se- structionofCarthage”,“Rome”isthelogicalsub- manticfeaturesliketemporal/locative), etc. ject and “Carthage” is the logical object). While canonical sentence structure provides one level Surface-only relations (prefixed with S-), (cid:2) of regularization, canonical verb argument struc- e.g.,thesurfacesubject(S-SBJ)ofapassive. turesprovideanother. Inthecaseofargumental- ternations (Levin, 1993), the same role marks an Logical-only Roles (prefixed with L-), e.g., (cid:2) alternatingargumentregardlessofwhereitoccurs thelogicalobject(L-OBJ)ofapassive. in a sentence. Thus “the man” is the indirect ob- Intermediate roles (prefixed with I-) repre- ject(IND-OBJ)and“adollar”isthedirectobject (cid:2) sentingneithersurface,norlogicalpositions. (OBJ) in both “She gave the man a dollar” and In“Johnseemedtobekidnappedbyaliens”, “She gave a dollar to the man” (the dative alter- “John” is the surface subject of “seem”, the nation). Similarly, “thepeople” is thelogicalob- logical object of “kidnapped”, and the in- ject(L-OBJ)ofboth“Thepeopleevacuatedfrom termediate subject of “to be”. Intermedi- the town” and “The troops evacuated the people atearcscapturearehelpfulformodelingthe fromthetown”,whenweassumetheappropriate way sentences conform to constraints. The regularization. Encoding this information allows intermediate subject arc obeys lexical con- applications to generalize. For example, a single straints and connect the surface subjects of InformationExtractionpatternthatrecognizesthe “seem” (COMLEX Syntax class TO-INF- IND-OBJ/OBJ distinction would be able to han- RS (Macleod et al., 1998a)) to the subject dle these two examples. Without this distinction, oftheinfinitive. However,thesubjectofthe 2patternswouldbeneeded. infinitive in this case is not a logical sub- Due to the diverse types of logical roles, we ject due to the passive. In some cases, in- sub-type roles according to the type of regu- termediatearcsaresubjecttonumberagree- larization that they reflect. Depending on the ment, e.g., in “Which aliens did you say application, one can apply different filters to a wereseen?”,theI-SBJof“wereseen”agrees detailed GLARF representation, only looking at with“were”. certain types of arcs. For example, one might choose all logical (L- and SL-) roles for an Combined surface/logical roles (unprefixed (cid:2) application that is trying to acquire selection arcs,whichwerefertoasSL-arcs). Forex- restrictions, or all surface (S- and SL-) roles 1Afewgrammaticalrolesarenonfunctional,e.g.,acon- if one was interested in obtaining a surface stituent can have multiple ADV constituents. We number parse. For other applications, one might want to theseroles(ADV1,ADV2, )topreservefunctionality. 2Thatpaperusestwoarc(cid:3)(cid:4)t(cid:3)(cid:4)y(cid:3)pes:categoryandrelational. choose between subtypes of logical arcs. Given (S (NP-SBJ (PRP they)) transformation applies to PTB, and the out- (VP (VP (VBD spent) put of each is the input of (NP-2 ($ $) (cid:5)(cid:7)(cid:6)(cid:9)(cid:8)(cid:11)(cid:10)(cid:13)(cid:12)(cid:15)(cid:14)(cid:17)(cid:16)(cid:18)(cid:6)(cid:20)(cid:19)(cid:21)(cid:8)(cid:11)(cid:5)(cid:7)(cid:22)(cid:23)(cid:16)(cid:18)(cid:10)(cid:25)(cid:24) . Asmanyofthesetransfor- (CD 325,000) (cid:5)(cid:26)(cid:6)(cid:9)(cid:8)(cid:27)(cid:10)(cid:13)(cid:12)(cid:18)(cid:14)(cid:17)(cid:16)(cid:18)(cid:6)(cid:18)(cid:19)(cid:21)(cid:8)(cid:11)(cid:5)(cid:26)(cid:22)(cid:23)(cid:16)(cid:18)(cid:10)(cid:25)(cid:24)(cid:20)(cid:28)(cid:30)(cid:29) mationsaretrivial,wefocusonthemostinterest- (-NONE- *U*)) ing set of problems. In addition, we explain how (PP-TMP-3 (IN in) GLARFisusedtorepresentsomeofthemoredif- (NP (CD 1989)))) ficultphenomena. (CC and) (Brantsetal.,1997)describesanefforttomin- (VP (NP=2 ($ $) imize human effort in the annotation of raw text (CD 340,000) with comparable PRED-ARG information. In (-NONE- *U*)) contrast, we are starting with annotated corpus (PP-TMP=3 (IN in) and want to add as much detail as possible auto- (NP (CD 1990)))))) matically. Weareasmuchconcernedwithfinding good procedures for PTB-based parser output as Figure1: Pennrepresentationofgapping weareminimizingtheeffortoffuturehumantag- gers. Theproceduresaredesignedtogettheright answermostofthetime. Humantaggerswillcor- a trilingual treebank, suppose that a Spanish recttheresultswhentheyarewrong. treebank sentence corresponds to a Japanese nominalization phrase and an English nominal- 4.1.1 Conjunctions izationphrase,e.g., DisneyhacompradoAppleComputers The treatment of coordinate conjunction in Disney’sacquisitionofAppleComputers PTB is not uniform. Words labeled CC and phrases labeled CONJP usually function as coordinate conjunctions in PTB. However, a num- Furthermore, suppose that the English treebank ber of problems arise when one attempts to un- analyzes the nominalization phrase both as an ambiguously identify the phrases which are con- NP (Disney = possessive, Apple Computers = joined. Most significantly, given a phrase XP object of preposition) and as a paraphrase of a with conjunctions and commas and some set of sentence (Disney = subject, Apple Computers other constituents , it is not always = object). For an MT system that aligns the (cid:31) (cid:29)"!$#$#$#"!%(cid:31)&(cid:24) clear which are conjuncts and which are not, Spanish and English graph representation, it (cid:31)(’ i.e., Penn does not explicitly mark items as con- may be useful to view the nominalization phrase junctsandone cannotassume thatall arecon- in terms of the clausal arguments. However, (cid:31)(’ juncts. In GLARF,conjoined phrases are clearly in a Japanese/English system, we may only identified and conjuncts in those phrases are dis- want to look at the structure of the English tinguished from non-conjuncts. We will discuss nominalizationphraseasanNP. eachproblematiccasethatweobserved inturn. InstancesofwordsthataremarkedCCinPenn 4 GLARF andthe PennTreebank do not always function as conjunctions. They This section focuses on some characteristics of mayplaytheroleofasententialadverb,apreposi- English GLARF and how we map PTB into tionortheheadofaparentheticalconstituents. In GLARF,asexemplifiedbymappingthePTBrep- GLARF,conjoinedphrases areexplicitlymarked resentationinFigure1totheGLARFrepresenta- with the attribute value (CONJOINED T). The tion in Figure 2. In the process, we will discuss mapping procedures recognize that phrases be- how some of the more interesting linguistic phe- ginning with CCs, PRN phrases containing CCs, nomenaarerepresentedinGLARF. amongothersarenotconjoinedphrases. A sister of a conjunction (other than a con- 4.1 MappingintoGLARF junction) need not be a conjunct. There are two Our procedure for mapping PTB into GLARF cases. First of all, a sister of a conjunction can uses a sequence of transformations. The first be a shared modifier, e.g., the right node raised S PRD SBJ VP CONJ1 CONJOINED PRP they CONJUNCTION1 CONJ2 T CC VP and VP HEAD ADV ADV OBJ PP VBD NP PP OBJ SEMFEAT spent SEMFEAT HEAD PATTERN UNIT OBJ HEAD OBJ NUM NP TMP NP IN NP TMP in $ CD NUMBER IN YEAR PATTERN $ 325,000 in UNIT PATTERN YEAR PATTERN NUM CD TIME 1990 CD TIME $ CD NUMBER L−GAPPING−HEAD 1989 $ 340,000 Figure2: GLARFrepresentationofgapping PP modifier in “[NP senior vice president] and ing the conjuncts. CONJPs, unlike CCs, can oc- [NP general manager] [PP of this U.S. sales and cur initially, e.g., “[Not only] [was Fred a good marketing arm]”; andthelocative “there”in “de- doctor], [he was a good friend as well].”). Sec- terring U.S. high-technology firms from [invest- ondly,theycanbeembeddedinthefirstconjunct, ing or [marketing their best products] there]”. In e.g., “[Fred, not only, liked to play doctor], [he addition, the boundaries of the conjoined phrase wasgoodatitaswell.]”. and/or the conjuncts that they contain are omit- In Figure 2, the conjuncts are labeled explic- ted in some environments, particularly when sin- itlywiththeirrolesCONJ1 andCONJ2,thecon- gle words are conjoined and/or when the phrases junction is labeled as CONJUNCTION1 and the occur before the head of a noun phrase or quan- top-most VP is explicitly marked as a conjoined tifier phrase. Some phrases which are under phrasewiththeattribute/value (CONJOINEDT). a single nonterminal node in the treebank (and 4.1.2 ApplyingLexicalResources are not further broken down) include the following: “between $190 million and $195 million”, We merged together two lexical resources “Hollingsworth & VoseCo.”,“cottonandacetate NOMLEX (Macleod et al., 1998b) and COM- fibers”,“thoseworkersandmanagers”,“thisU.S. LEX Syntax 3.1 (Macleod et al., 1998a), deriv- sales and marketing arm”, and “Messrs. Cray ing PP complements of nouns from NOMLEX and Barnum”. To overcome this sort of prob- and using COMLEX for other types of lexical lem,proceduresintroducebracketsandmarkcon- information.We use these resources to help add stituents as conjuncts. Considerations included additional brackets, make additional role distinc- POScategories,similaritymeasures,construction tions and fill a gap when its filler is not marked type (e.g., & is typically part of a name), among in PTB. AlthoughPenn’s -CLRtags aregood in- otherfactors. dicators of complement-hood, they only apply to verbalcomplements. Thusproceduresformaking CONJPshaveadifferentdistributionthanCCs. adjunct/complement distinctions benefited from Different considerations are needed for identify- the dictionary classes. Similarly, COMLEX’s NP-FOR-NP class helped identify those -BNF areindicatedbythePATTERNattribute. constituents which were indirect objects (“John 4.1.4 GappingConstructions baked Mary a cake”, “John baked a cake [for Mary]”). TheclassPRE-ADJidentifiedthosead- Figures 1 and 2 are corresponding PTB and verbial modifiers within NPs which really mod- GLARF representations of gapping. Penn rep- ify the adjective. Thus we could add the follow- resents gapping via “parallel” indices for corre- ing brackets to the NP: “[even brief] exposures”. sponding arguments. In GLARF,theshared verb NTITLEandNUNITwereusefulfortheanalysis is at the head of two HEAD arcs. GLARF over- ofpatterntypenounphrases,e.g.,“PresidentBill comessomeproblemswithstructuresharinganal- Clinton”, “five million dollars”. Our procedures yses of gapping constructions. The verb gap is a for identifying the logical subjects of infinitives “sloppy” (Ross, 1967) copy of the original verb. make extensive use of the control/raising proper- Two separate spending events are represented by ties of COMLEX classes. For example, X is the oneverb. Intuitively, structuresharingimpliesto- subjectoftheinfinitivesin“Xappearedtoleave” kenidentity,whereastypeidentitywouldbemore and “X was likely to bring attention to the prob- appropriate. Inaddition,thecopiedverbneednot lem”. agreewiththesubjectinthesecondconjunct,e.g., “was”, not “were” would agree with the second 4.1.3 NEsandOtherPatterns conjunctin“the risks toohighandthepo- )+*$(cid:6),*$’ tential payoff too far in the future”. It is thus Overthepastfewyears,therehasbeenalotof *$’ problematic to view the gap as identical in ev- interestin automaticallyrecognizingnamed enti- ery way to the filler in this case. In GLARF, we ties,timephrases,quantities,amongotherspecial canthusdistinguishthegappingsortoflogicalarc typesofnounphrases. Thesephraseshaveanum- (L-GAPPING-HEAD)fromtheothertypesofL- ber of things in common including: (1) their in- HEADarcs. Wecanstipulatethatagappinglogi- ternal structure can have idiosyncratic properties calarcrepresents anappropriately inflectedcopy relative to other types of noun phrases, e.g., per- ofthephraseattheheadofthatarc. sonnamestypicallyconsistofoptionaltitlesplus In GLARF, the predicate is always explicit. oneormorenames(first,middle,last)plusanop- However,Penn’srepresentation(H.Koti,pc)pro- tional post-honorific; and (2)externally, they can vides an easy way to represent complex cases, occur wherever some more typical phrasal con- e.g., “John wanted to buy gold, and Mary *gap* stituent (usually NP) occurs. Identifying these silver. In GLARF,the gap would be filled by the patterns makes it possible to describe these dif- nonconstituent “wanted to buy”. Unfortunately, ferences in structure, e.g., instead of identifying we believe that this is a necessary burden. A a head for “John Smith, Esq.”, we identify two goal of GLARF is to explicitly mark all PRED- names and a posthonorific. If this named entity ARG relations. Given parallel indices, the user wentunrecognized,wewouldincorrectlyassume mustextractthepredicatefromthetextby(imper- that“Esq.” wasthehead. Currently,wemergethe fect) automatic means. The current solution for outputofanamedentitytaggertothePennTree- GLARF is to provide multiple gaps. The second bankpriortoprocessing. InadditiontoNEtagger conjunct of the example in question would have output,weuseproceduresbasedonPenn’sproper the following analysis: (S (SBJ ) (PRD nounwordtags. -.(cid:8)(cid:27)(cid:6)(cid:20)/10 (VP (HEAD ) (COMP (S (PRD (VP In Figure 2, there are four patterns: two 24365(cid:30)’ 78590 (HEAD ) (OBJ silver)))))))), where NUMBER and two TIME patterns. The TIME 2:3;5=< 24365(cid:30)’ is filled by “wanted”, is filled by “tobuy” patterns are very simple, each consisting just 24365=< and isboundtoMary. of YEAR elements, although MONTH, DAY, 78590 HOUR, MINUTE, etc. elements are possible. 5 JapaneseGLARF The NUMBER patterns each consist of a sin- gleNUMBER(althoughmultipleNUMBERcon- Japanese GLARF will have many of the same stituents are possible, e.g., “one thousand”) and specifications described above. To illustrate how oneUNITconstituent. Thetypesofthesepatterns we will extend GLARF to Japanese, we discuss resulting GLARF Feature Structures into triples oftheform Role-NamePivotNon-Pivot forall > ? logicalarcs(cf. (Carolletal.,1998)),usingsome automatic procedures. The “pivot” is the head of headed structures, but may be some other constituent in non-headed structures. For example, in a conjoined phrase, the pivot is the conjunction, and the head would be the list of heads of theconjuncts. RatherthanlistingthewholePivot and non-pivot phrases in the triples, we simply list the heads of these phrases, which is usually asingleword. Finally, wecomputeprecisionand recallbycomparingthetriplesgeneratedfromour procedurestotriplesgeneratedfromthecorrected GLARF.3Anexactmatchisacorrectanswerand anythingelseisincorrect.4 6.1 TheTestandtheResults Figure3: StackedPostpositionsinGLARF We developed our mapping procedures in two stages. We implemented some mapping proce- twodifficult-to-representphenomena: elisionand dures based on PTB manuals, related papers and stackedpostpositions. actualusageoflabelsinPTB.Afterourinitialim- GrammaticalanalysesofJapaneseareoftende- plementation,wetunedtheproceduresbasedona pendency treeswhichusepostpositions asarcla- training set of 64 sentences from two PTB files: bels. Arguments, when elided, are omitted from wsj 0003 and wsj 0051, yielding 1285 + triples. the analysis. In GLARF, however, we use role Thenwetestedtheseproceduresagainstatestset labels like SBJ, OBJ, IND-OBJ and COMP and consisting of 65 sentences from wsj 0089 (1369 mark elidedconstituents aszeroedarguments. In triples). OurresultsareprovidedinFigure4. Pre- thecaseofstackedpostpositions,werepresentthe cision and recall are calculated on a per sentence different roles via different arcs. We also rean- basisandthenaveraged. Theprecisionforasen- alyze certain postpositions as being complemen- tence is the number of correct triples divided by tizers (subordinators) or adverbs, thus excluding the total number of triples generated. The recall them from canonical roles. By reanalyzing this is the total number of correct triples divided by way,wearrivedattwotypesoftruestackedpost- thetotalnumberoftriplesintheanswerkey. positions: nominalization and topicalization. For Out of 187 incorrect triples in the test corpus, example, in Figure3, thetopicalized NPisat the 31reflectedtheincorrectrolebeingselected,e.g., head of two arcs, labeled S-TOP and L-COMP theadjunct/complementdistinction,139reflected and the associated postpositions are analyzed as errorsoromissionsinourproceduresand7triples morphologicalcaseattributes. related to other factors. We expect a sizable im- provement as we increase the size of our train- 6 Testing theProcedures ing corpus and expand the coverage of our pro- To test our mapping procedures, we apply them 3We admit a bias towards our output in a small num- to some PTB files and then correct the result- berofcases(lessthan1%). Forexample,itisunimportant ingrepresentationusingANNOTATE(Brantsand whether“exposedtoit”modifies“thegroup”or“workers” in“agroupofworkersexposed toit”. Theoutputwillget Plaehn, 2000), a program for annotating edge- fullcreditforthisexampleregardlessofwherethereduced labeledtreesandDAGs,originallycreatedforthe relativeisattached. NEGRAcorpus. Wechosebothfilesthatwehave 4(Carolletal.,1998)reportabout88%precisionandre- callforsimilartriplesderivedfromparseroutput. However, used extensively to tune the mapping procedures theyallowtriplestomatchinsomecaseswhentherolesare (training) and other files. We then convert the differentandtheydonotmarkmodifierrelations. T. Brants, W. Skut, and B. Krenn. 1997. Tagging Data Sentences Recall Precision GrammaticalFunctions. InEMNLP-2. Training 64 94.4 94.3 J.Caroll,T.Briscoe,andA.Sanfillippo. 1998. Parse Test 65 89.0 89.7 Evaluation: a Survey and a New Proposal. LREC 1998,pages447–454. Figure4: Results B. Carpenter. 1992. The Logic of Typed Features. CambridgeUniversityPress,NewYork. cedures,particularlysinceoneomissionoftenre- Z. Harris. 1968. Mathematical Structures of Lan- sultedinseveralincorrecttriples. guage. Wiley-Interscience,NewYork. 7 Concluding Remarks D. Johnson and P. Postal. 1980. Arc Pair Grammar. PrincetonUniversityPress,Princeton. We show that it is possible to automatically map D. Johnson, A. Meyers, and L. Moss. 1993. A PTB input into PRED-ARG structure with high Unification-Based Parser for Relational Grammar. accuracy. Whileour initialresults arepromising, ACL1993,pages97–104. mapping procedures are limited by available re- B. Levin. 1993. English Verb Classes and Alterna- sources. ToproducethebestpossibleGLARFre- tions: A Preliminary Investigation. University of source,handcorrectionwillbenecessary. ChicagoPress,Chicago. Weareimprovingourmappingproceduresand C. Macleod, R. Grishman, and A. Meyers. 1998a. extending them to PTB-based parser output. We COMLEXSyntax. ComputersandtheHumanities, are creating mapping procedures forthe Susanne 31(6):459–481. corpus, the Kyoto Corpus and the UAM Tree- C.Macleod,R.Grishman,A.Meyers,L.Barrett,and bank. This work is a precursor to the creation of R.Reeves. 1998b. Nomlex: Alexiconofnominal- atrilingualGLARFtreebank. izations. Euralex98. Wearecurrentlydefiningtheproblemofmap- pingtreebanksintoGLARF.Subsequently,wein- M.Marcus,G.Kim,M.A.Marcinkiewicz,R.MacIn- tyre,A.Bies,M.Ferguson,K.Katz,andB.Schas- tend to create standardized mapping rules which berger. 1994. Thepenntreebank: Annotatingpred- canbeappliedbyanynumberofalgorithms. The icate argument structure. In Proceedings of the endresultmaybethatdetailedparsingcanbecar- 1994 ARPA Human Language Technology Work- ried out in two stages. In the first stage, one de- shop. rivesaparseatthelevelofdetailofthePennTree- Y. Matsumoto, H. Ishimoto, T. Utsuro, and M. Na- bank II. In the second stage, one derives a more gao. 1993. Structural Matching of Parallel Texts. detailed parse. The advantage of such division InACL1993 shouldbeobvious: oneisfreetofindthebestpro- A. Meyers, R. Yangarber, and R. Grishman. 1996. cedures for each stage and combine them. These AlignmentofSharedForestsforBilingualCorpora. procedurescouldcomefromdifferentsourcesand Coling1996,pages460–465. usetotallydifferentmethods. A. Meyers, M. Kosaka, and R. Grishman. 2000. Chart-BasedTransferRuleApplicationinMachine Acknowledgements Translation. Coling2000,pages537–543. This research was supported by the Defense Ad- A. Moreno, R. Grishman, S. Lopez, F. Sanchez, and vanced Research Projects Agency under Grant S. Sekine. 2000. A treebank of Spanish and its applicationtoparsing. LREC,pages107–111. N66001-00-1-8917 from the Space and Naval Warfare Systems Center, San Diago and by the D.Perlmutter. 1984. Studies inRelational Grammar National Science Foundation under Grant IIS- 1. UniversityofChicagoPress,Chicago. 0081962. J. Ross. 1967. Constraints on Variables in Syntax. Ph.D.thesis,MIT. References G. Sampson. 1995. English for the Computer: The Susanne Corpus and Analytic Scheme. Clarendon T.BrantsandO.Plaehn. 2000. Interactivecorpusan- Press,Oxford. notation. LREC2000,pages453–459.