ebook img

Compacting the Penn Treebank Grammar PDF

5 Pages·0.1 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Compacting the Penn Treebank Grammar

Compacting the Penn Treebank Grammar Alexander Krotov and Mark Hepple and Robert Gaizauskas and Yorick Wilks Department of Computer Science, Sheffield University 211 Portobello Street, Sheffield S1 4DP, UK {alexk, hepple, robertg, yorick}@dcs.shef.ac.uk 9 9 9 Abstract precision and recall figures of around 80% for 1 a parser employing such a grammar. In this Treebanks, such as the Penn Treebank (PTB), n offer a simple approach to obtaining a broad paper we show that the huge size of such a tree- a bank grammar (see below) can be reduced in coverage grammar: one can simply read the J grammar off the parse trees in the treebank. size without appreciable loss in performance, 1 and, in fact, an improvement in recall can be 3 While such a grammar is easy to obtain, a square-root rate of growth of the rule set with achieved. L] corpus size suggests that the derived grammar Our approach can be generalised in terms C is far from complete and that much more tree- of Data-Oriented Parsing (DOP) methods (see banked text would berequired to obtain a com- (Bonnema et al., 1997)) with the tree depth of . s plete grammar, if one exists at some limit. 1. However, the number of trees produced with c [ However, we offer an alternative explanation ageneralDOPmethodissolargethatBonnema in terms of the underspecification of structures (Bonnema et al., 1997) has to resort to restrict- 1 v within the treebank. This hypothesis is ex- ing the tree depth, using a very domain-specific 1 plored by applying an algorithm to compact corpus such as ATIS or OVIS, and parsing very 0 the derived grammar by eliminating redundant short sentences of average length 4.74 words. 0 rules – rules whose right hand sides can be Our compaction algorithm can be extended for 2 0 parsed by other rules. The size of the result- theusewithintheDOPframeworkbut,because 9 ing compacted grammar, which is significantly of the hugesize of thederived grammar (see be- 9 less than that of the full treebank grammar, is low), wechose tousethesimplestPCFG frame- / s shown to approach a limit. However, such a work for our experiments. c compacted grammar does not yield very good We are concerned with the nature of the rule : v performance figures. A version of the com- set extracted, and how it can beimproved, with i X paction algorithm taking rule probabilities into regard both to linguistic criteria and processing r account isproposed,whichis arguedtobemore efficiency. In whatfollows, wereporttheworry- a linguistically motivated. Combined with simple ing observation that the growth of the rule set thresholding, this method can be used to give a continues at a square root rate throughout pro- 58% reduction in grammar size without signif- cessing of the entire treebank (suggesting, per- icant change in parsing performance, and can hapsthattherulesetisfarfromcomplete). Our produce a 69% reduction with some gain in re- results are similar to those reported in (Krotov 1 call, but a loss in precision. et al., 1994). We discuss an alternative pos- sible source of this rule growth phenomenon, 1 Introduction partial bracketting, and suggest that it can be ThePennTreebank(PTB)(Marcusetal.,1994) alleviated by compaction, where rules that are has been used for a rather simple approach redundant (in a sense to be defined) are elimi- to deriving large grammars automatically: one nated from the grammar. where the grammar rules are simply ‘read off’ Our experiments on compacting a PTB tree- the parse trees in the corpus, with each local 1For the complete investigation of the grammar ex- subtree providing the left and right hand sides tracted from the Penn Treebank II see (Gaizauskas, of a rule. Charniak (Charniak, 1996) reports 1995) 20000 3 Rule Growth and Partial Bracketting s 15000 e er of rul 10000 Winhtyhisshwoauyld? tPhuettsientgoafsirdueletshceopntoisnsuibeiltitoygtrhoawt b m natural languages do not have finite rule sets, u N 5000 we can think of two possible answers. First, it may be that the full “underlying grammar” is 0 0 20 40 60 80 100 much larger than the rule set that has so far Percentage of the corpus been produced, requiring a much larger tree- Figure 1: Rule Set Growth for Penn Treebank banked corpus than is now available for its II extraction. If this were true, then the out- lookwouldbebleakforachievingnear-complete grammars from treebanks, given the resource bank grammar resulted in two major findings: demandsofproducingsuchresources. However, one, that the grammar can at the limit be com- theradicalincompleteness ofgrammarthatthis pacted to about 7% of its original size, and the alternative implies seems incompatible with the rule number growth of the compacted grammar promisingparsingresultsthat Charniakreports stopsatsomepoint. Theotheristhata58% re- (Charniak, 1996). duction can be achieved with no loss in parsing A second answer is suggested by the presence performance, whereas a 69% reduction yields a 2 in the extracted grammar of rules such as (1). gain in recall, but a small loss in precision. This rule is suspicious from a linguistic point of This, we believe, gives further support to view, and we would expect that the text from the utility of treebank grammars and to the which it has been extracted should more prop- compaction method. For example, compaction erly have beenanalysed usingrules (2,3), i.e. as methods can beapplied within the DOP frame- a coordination of two simpler NPs. work to reduce the number of trees. Also, by partially lexicalising the rule extraction process NP → DT NN CC DT NN (1) (i.e., by using some more frequent words as well as the part-of-speech tags), we may be able to NP → NP CC NP (2) achieve parsing performance similar to the best NP → DT NN (3) results in the field obtained in (Collins, 1996). Our suspicion is that this example reflects a widespread phenomenon of partial bracketting 2 Growth of the Rule Set within the PTB. Such partial bracketting will ariseduringthehand-parsingoftexts,with(hu- One could investigate whether there is a fi- man) parsers adding brackets where they are nite grammar that should account for any text confident that some string forms a given con- within a class of related texts (i.e. a domain stituent, but leaving out many brackets where oriented sub-grammar of English). If there is, they are less confident of the constituent struc- the number of extracted rules will approach a ture of the text. This will mean that many limit as more sentences are processed, i.e. as rules extracted from the corpus will be ‘flat- the rule number approaches the size of such an ter’ than they should be, corresponding prop- underlying and finite grammar. erly to what should be the result of using sev- We had hoped that some approach to a limit eral grammar rules, showing only the top node would be seen using PTB II (Marcus et al., andleafnodesofsomeunspecifiedtreestructure 1994), which larger and more consistent for an- (where the ‘leaf nodes’ here are category sym- notation than PTB I. As shown in Figure 1, bols,whichmaybenonterminal). Fortheexam- however, the rule number growth continues un- ple above, a tree structure that should properly abated even after more than 1 million part-of- havebeengivenas(4),hasinsteadreceivedonly speech tokens have been processed. 2PTBPOStagsareusedhere,i.e. DTfordeterminer, CCforcoordinatingconjunction(e.g‘and’),NNfornoun thepartialanalysis(5), fromtheflatter‘partial- is kept in the grammar. The rules that remain structure’ rule (1). when all rules have been checked constitute the compacted grammar. i. NP (4) An interesting question is whether the result (cid:8)(cid:8)HH (cid:8) H of compaction is independent of the order in (cid:8) H (cid:8) H whichtherulesareaddressed. Ingeneral, thisis NP CC NP not the case, as is shown by the following rules, (cid:8)(cid:8)HH (cid:8)(cid:8)HH of which (8) and (9) can each be used to parse DT NN DT NN the other, so that whichever is addressed first will be eliminated, whilst the other will remain. ii. NP (5) (cid:24)(cid:24)(cid:24)(cid:8)(cid:24)(cid:8)(cid:24)(cid:24)(cid:8)HXXHXHXXX B → C (6) DT NN CC DT NN C → B (7) 4 Grammar Compaction A → B B (8) The idea of partiality of structure in treebanks and their grammars suggests a route by which A → C C (9) treebank grammars may be reduced in size, or Order-independence can be shown to hold for compacted as we shall call it, by the elimination grammars that contain no unary or epsilon of partial-structure rules. A rule that may be (‘empty’) rules, i.e. rules whose righthand sides eliminable as a partial-structure ruleis onethat have one or zero elements. The grammar that can be‘parsed’(in thefamiliar senseof context- we have extracted from PTB II, and which is free parsing) using other rules of the grammar. usedin the compaction experiments reportedin For example, the rule (1) can be parsed us- thenextsection, isonethatexcludessuchrules. ing the rules (2,3), as the structure (4) demon- Unary and sister rules were collapsed with the strates. Note that, although a partial-structure sisternodes,e.g. thestructure(S (NP -NULL-) ruleshouldbeparsableusingotherrules,itdoes (VP VB (NP (QP ...))) .) will produce the not follow that every rule which is so parsable following rules: S -> VP ., VP -> VB QP and is a partial-structure rule that should be elimi- QP -> ....4 For further discussion, and for the nated. There may belinguistically correct rules proof of the order independence see (Krotov, which can be parsed. This is a topic to which 1998). we will return at the end of the paper (Sec. 6). For most of what follows, however, we take the 5 Experiments simpler path of assuming that the parsability of a rule is not only necessary, but also sufficient, We conducted a number of compaction exper- 5 for its elimination. iments: first, the complete grammar was Rules which can be parsed using other rules parsed as described in Section 4. Results ex- in the grammar are redundant in the sense that ceeded our expectations: the set of 17,529 rules eliminating such a rule will never have the ef- reduced to only 1,667 rules, a better than 90% fect of making a sentence unparsablethat could reduction. previously be parsed.3 To investigate in more detail how the com- Thealgorithm weuseforcompacting agram- pacted grammar grows, we conducted a third mar is straightforward. A loop is followed experimentinvolvingastaged compactionofthe whereby each rule R in the grammar is ad- grammar. Firstly, thecorpuswas splitinto10% dressed in turn. If R can be parsed using other chunks (by number of files) and the rule sets rules (which have not already been eliminated) extracted from each. The staged compaction then R is deleted (and the grammar without R proceeded as follows: the rule set of the first is used for parsing further rules). Otherwise R 10% was compacted, and then the rules for the 3Thus, wherever a sentence has a parse P that em- 4See (Gaizauskas, 1995) for discussion. ploystheparsableruleR,italsohasafurtherparsethat 5For these experiments, we used two parsers: Stol- is just like P except that any use of R is replaced by a cke’s BOOGIE (Stolcke, 1995) and Sekine’s Apple Pie more complex substructure,i.e. a parse of R. Parser (Sekineand Grishman, 1995). 2000 bythecorpus. Frequency ofoccurrencedatafor rules which have been collected from the cor- s 1500 ule pus and used to assign probabilities to rules, er of r 1000 and hence to the structures they allow, so as b to produce a probabilistic context-free grammar m Nu 500 for the rules. Where a parsable rule is correct ratherthanmerelypartiallybracketted,wethen 0 expect this fact to be reflected in rule and parse 0 20 40 60 80 100 Percentage of the corpus probabilities (reflecting the occurrence data of the corpus), which can be used to decide when Figure 2: Compacted Grammar Size a rule that may be eliminated should be elimi- nated. Inparticular,aruleshouldbeeliminated next10%addedandtheresultingsetagaincom- only when the more complex structure allowed pacted, and then the rules for the next 10% byotherrulesismoreprobablethanthesimpler added, and so on. Results of this experiment structure that the rule itself allows. are shown in Figure 2. We developed a linguistic compaction algo- At 50% of the corpus processed the com- rithm employing the ideas just described. How- pacted grammar size actually exceeds the level ever, we cannot present it here due to the space it reaches at 100%, and then the overall gram- limitations. The preliminary results of our ex- mar size starts to go down as well as up. This periments are presented in Table 1. Simple reflects the fact that new rules are either redun- thresholding (removing rules that only occur dant, or make “old” rules redundant, so that once) was also to achieve the maximum com- the compacted grammar size seems to approach paction ratio. For labelled as well as unlabelled a limit. evaluation of the resulting parse trees we used theevalbsoftwarebySatoshiSekine. See(Kro- 6 Retaining Linguistically Valid tov, 1998) for the complete presentation of our Rules methodology and results. Even though parsable rules are redundant in As one can see, the fully compacted grammar the sense that has been defined above, it does yieldspoorrecallandprecisionfigures. Thiscan not follow that they should always be removed. bebecausecollapsingoftherulesoftenproduces In particular, there are times where the flatter too much substructure (hence lower precision structureallowed bysomerulemaybemorelin- figures) and also because many longer rules in guistically correct, rather than simply a case of fact encode valid linguistic information. How- partial bracketting. Consider, for example, the ever, linguistic compaction combined with sim- (linguistically plausible) rules (10,11,12). Rules ple thresholdingachieves a 58% reduction with- (11) and (12) can be used to parse (10), but out any loss in performance, and 69% reduction the latter should not be eliminated, as there even yields higher recall. are cases where the flatter structure it allows is more linguistically correct. 7 Conclusions We see the principal results of our work to be VP → VB NP PP (10) the following: VP → VB NP (11) • the result showing continued square-root NP → NP PP (12) growth in the rule set extracted from the i. VP ii. VP PTB II; (cid:8)(cid:8)HH (cid:8)H (cid:8) H (cid:8)(cid:8) HH • the analysis of the source of this contin- VB NP VB NP PP ued growth in terms of partial bracketting (cid:8)(cid:8)HH and the justification this provides for com- NP PP paction via rule-parsing; (13) We believe that a solution to this problem • the result that the compacted rule set does can be found by exploiting the data provided approach a limit at some point during Full Simply thresholded Fully compacted Linguistically compacted Grammar 1 Grammar 2 Labelled evaluation Recall 70.55% 70.78% 30.93% 71.55% 70.76% Precision 77.89% 77.66% 19.18% 72.19% 77.21% Unlabelled evaluation Recall 73.49% 73.71% 43.61% 74.72% 73.67% Precision 81.44% 80.87% 27.04% 75.39% 80.39% Grammar size 15,421 7,278 1,122 4,820 6,417 reduction(as %of full) 0% 53% 93% 69% 58% Table 1: Preliminary results of evaluating the grammar compaction method staged rule extraction and compaction, af- ence on Artificial Intelligence (AAAI-96), pages ter a sufficient amount of input has been 1031–1036.MIT Press, August. processed; Michael Collins. 1996. A new statistical parser basedonbigramlexicaldependencies. InProceed- • that, though the fully compacted grammar ings of the 34th Annual Meeting of the ACL. produces lower parsing performance than Robert Gaizauskas. 1995. Investigations into the the extracted grammar, a 58% reduction grammar underlying the Penn Treebank II. Re- search Memorandum CS-95-25, University of (without loss) can still be achieved by us- Sheffield. ing linguistic compaction, and 69% reduc- Alexander Krotov, Robert Gaizauskas, and Yorick tion yields a gain in recall, but a loss in Wilks. 1994. Acquiring a stochastic context-free precision. grammar from the Penn Treebank. In Proceed- ings of Third Conference on the Cognitive Sci- The latter result in particular provides further ence of Natural Language Processing, pages 79– support for the possible future utility of the 86, Dublin. compaction algorithm. Our method is similar Alexander Krotov. 1998. Notes on compacting to that used by Shirai (Shirai et al., 1995), but the Penn Treebank grammar. Technical Memo, Department of Computer Science, University of the principal differences are as follows. First, Sheffield. their algorithm does not employ full context- M.Marcus,G.Kim,M.A.Marcinkiewicz,R.MacIn- free parsing in determining the redundancy of tyre,A.Bies,M.Ferguson,K.Katz,andB.Schas- rules, considering instead only direct composi- berger. 1994. The Penn Treebank: Annotating tion of the rules (so that only parses of depth predicate argument structure. In Proceedings of 2 are addressed). We proved that the result ARPA Speech and Natural language workshop. of compaction is independent of the order in Satoshi Sekine and Ralph Grishman. 1995. A which the rules in the grammar are parsed in corpus-basedprobabilisticgrammarwithonlytwo those cases involving ’mutual parsability’ (dis- non-terminals. In Proceedings of Fourth Interna- tional Workshop on Parsing Technologies. cussed in Section 4), but Shirai’s algorithm will Kiyoaki Shirai, Takenobu Tokunaga, and Hozumi eliminate both rules so that coverage is lost. Tanaka. 1995. Automatic extraction of Japanese Secondly, it is not clear that compaction will grammarfromabracketedcorpus. InProceedings work in the same way for English as it did for ofNaturalLanguageProcessingPacific RimSym- Japanese. posium, Korea, December. Andreas Stolcke. 1995. An efficient probabilis- References tic context-free parsing algorithm that computes prefix probabilities. Computational Linguistics, RemkoBonnema,RensBod,andRemkoScha. 1997. 21(2):165–201. A DOP model for semantic interpretation. In Proceedings of European Chapter of the ACL, pages 159–167. Eugene Charniak. 1996. Tree-bank grammars. In Proceedings of the Thirteenth National Confer-

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.