Two Algorithms for Learning the Parametersof Stochastic Context-Free Grammars BrentHeeringa Tim Oates DepartmentofComputerScience DepartmentofComputerScienceandElectricalEngineering UniversityofMassachusetts,Amherst UniversityofMarylandBaltimoreCounty 140Governor’sDrive 1000HilltopCircle Amherst,MA01003 Baltimore,MD21250 [email protected] [email protected] Abstract paradigm by assuming a fixed structure and modifyingthe parameters. Stochastic context-free grammars (SCFGs) are often Given a SCFG to be learned, both algorithms have ac- usedtorepresentthesyntaxofnaturallanguages.Most cess to thestructure ofthe grammar and a set of sentences algorithms for learning them require storage and re- generated by the grammar. The correct parameters are un- peatedprocessingof a sentencecorpus. Thememory known. PRESPAN and SPAN beginbyparsingthesentence andcomputationaldemandsofsuchalgorithmsareill- corpus usinga chart parser. Note thatthe parse of an indi- suitedforembeddedagentssuchasamobilerobot.Two algorithms are presented that incrementally learn the vidualsentence does notdependon theparameters; itonly parametersofstochasticcontext-freegrammarsassen- depends onthestructure. However, thedistributionofsen- tences are observed. Both algorithms require a fixed tencesparseddoesdependontheparametersofthegrammar amountofspaceregardlessof thenumberof sentence usedtogeneratethem. Bothalgorithmsassociatewitheach observations. Despite usingless information than the ruleahistogramthatrecordsthenumberoftimestheruleis inside-outsidealgorithm,thealgorithmsperformalmost usedinparsesoftheindividualsentences. aswell. PRESPAN and SPAN make an initial guess at the values oftheparametersbysettingthemrandomly. Theythengen- erateacorpusofsentenceswiththeseparametersandparse Introduction them, resulting in a second set of histograms. The degree Although natural languages are not entirely context free, to which the two sets of histograms differ is a measure of stochastic context-free grammars (SCFGs) are an effective the difference between thecurrentparameter estimates and representationfor capturingmuch of theirstructure. How- the target parameters. PRESPAN modifiesits parameter es- ever, for embedded agents, most algorithms for learning timates so the sum totaldifference between the histograms SCFGs from data have twoshortcomings. First, they need is minimized. In contrast, SPAN modifies its estimates so accesstoacorpusofcompletesentences,requiringtheagent thedifferencebetween individualhistogramsisminimized. toretaineverysentence ithears. Second, theyarebatchal- Empiricalresultsshowthatthisprocedureyieldsparameters gorithmsthatmake repeated passes over thedata, oftenre- thatareclosetothosefoundbytheinside-outsidealgorithm. quiringsignificant computationin each pass. These short- comingsareaddressedthroughtwoonlinealgorithmscalled StochasticContext-FreeGrammars SPAN1 and PRESPAN2 that learn the parameters of SCFGs Stochastic context-free grammars 3 are the natural exten- usingonlysummarystatisticsincombinationwithrepeated sion of Context-Free Grammars to the probabilistic do- samplingtechniques. main (Sipser, 1997; Charniak, 1993). Said differently, SCFGscontainbothstructure(i.e. rules) and parameters they are context-free grammars with probabilities associ- (i.e. rule probabilities). One approach to learning SCFGs ated with each rule. Formally, a SCFG is a four-tuple fromdataistostartwithagrammarcontainingallpossible where rules that can be created from some alphabet of terminals M =hV;(cid:6);R;Si andnon-terminals. Typicallythesizeoftheright-hand-side 1. isafinitesetofnon-terminals ofeach ruleis boundbya smallconstant(e.g. 2). Then an V 2. isafiniteset,disjointfrom ,ofterminals algorithmforlearningparameters isappliedandallowedto (cid:6) V “prune”rulesbysettingtheirexpansionprobabilitiestozero 3. is a finite set of rules of the form where (Lari & Young, 1990). PRESPAN and SPAN operate inthis bRelongs to , and is a finite stringAcom!powsed of eleA- mentsfromV and w. Wereferto astheleft-handside Copyright c 2001, American Association for Artificial Intelli- ( ) of thVerule(cid:6)and as the riAght-handside( ), gence(www(cid:13) Allrightsreserved. oLrHexSpansion,oftherulew. Additionally,eachrule RhHasSan 1SPANstandsforSampleParseAdjustNormalize r 2PRESPANissonamedbecauseitisthepredecessorofSPAN, 3StochasticContext-FreeGrammarsareoftencalledProbabilis- soitliterallymeanspre-SPAN ticContext-FreeGrammars. THIS PAGE 6 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 associated probability such that the probabilitiesof Table1:Agrammarthatgeneratespalindromes ruleswiththesameleftp-(hra)ndsidesumto1. 4. isthestartsymbol. S S ! A A Grammars can either be ambiguous or unambiguous. S ! B B Ambiguousgrammarscan generatethesame stringinmul- S ! A C tipleways. Unambiguousgrammarscannot. S ! B D C ! S A LearningStochastic-ContextFree Grammars D ! S B A ! Y Learningcontext-freegrammars istheproblemofinducing B ! Z a context-free structure (or model) from a corpus of sen- Y ! y tences (i.e., data). When the grammars are stochastic, one X ! z faces the additional problem of learning rule probabilities (parameters) from the corpus. Given a set of sentence ob- servations , the goal is to discover the 0 1 0 1 0 1 0 1 grammarthOatg=enefroa0te:d::on.(cid:0)Ty1pgically,thisproblemisframed S → A A S → A C S → B B S → B C in terms of a search in gOrammar space where the objective function is the likelihood of the data given the grammar. While the problem of incrementally learning the structure ofSCFGsisinterestinginitsownright,themainfocushere 0 1 0 1 0 1 0 1 2 3 4 is on learning parameters. For a thoroughoverview of ex- C → S A D → S B B → Z A → Y istingtechniquesforlearningstructure, see (Stolcke,1994; Figure 1: The palindrome grammar rule histograms after Chen,1996;Nevill-Manning&Witten,1997). only one parse of the sentence y y y y. Because only one LearningParameters sentencehasbeenparse,themassofthedistributioniscon- centratedinasinglebin. The inside-outsidealgorithm(Lari & Young, 1990;Lari & Young,1991)isthestandardmethodforestimatingparam- eters in SCFGs. The algorithm uses the general-purpose one. Second, both algorithms naturally allow new data to expectation-maximization (EM) procedure. Almostall pa- contributetolearningwithoutrestartingtheentireprocess. rameter learning is done batch style usingsome version of PRESPAN andSPANuseonlyastatisticalsummaryofthe the inside-outsidealgorithm. For example, in learning pa- observationdataforlearning.Bothstorethesummaryinfor- rameters, (Chen,1995)initiallyestimatestheruleprobabil- mationinhistograms. SPAN and PRESPAN alsorecordhis- ities using the most probable parse of the sentences given tograminformationabouttheircurrentparameterestimates. thegrammar(theViterbiparse)andthenusesa“post-pass” Sotheadditionofnewsentencestypicallydoesnotincrease procedurethatincorporatestheinside-outsidealgorithm. thememoryrequirements. Furthermore,thehistogramsplay To use EM the entire sentence corpus must be stored. acrucialroleinlearning. Iftheparameterestimatesareac- While this storage may not be in the form of actual sen- curate,thehistogramsoftheobserveddatashouldresemble tences, it is always in some representation that easily al- the histograms of the current parameterization. When the lowsthereconstructionoftheoriginalcorpus(e.g.,thechart histogramsdonotresembleeachother,thedifferenceisused of a chart parse). Because we are interested in language toguidethelearningprocess. acquisition in embedded agents over long periods of time, the prospect of memorizing and repeatedly processing en- ADescriptionofPRESPAN tiresentencecorporaisunpalatable. Let beaSCFGandlet This motivationalso carries a desire to easily adjustour beaTse=tohfV;(cid:6)se;nRte;nScies generatedstochasOtic=allfyof0ro:m::oTn.(cid:0)L1egt parameterswhennewsentencesareencountered.Thatis,we n beaSCFGthatisthesame as except wanttolearnproductionprobabilitiesincrementally. While 0 Mthe=rulheV;p(cid:6)ro;bRab;ilSitiies in have been assigned atTrandom theinside-outsidealgorithmcan incorporatenew sentences 0 (subjecttotheconstrainttRhatthesumoftheprobabilitiesfor inbetweeniterations,itstillusestheentiresentencecorpus allruleswiththesameleft-handsideisone). iscalledthe forestimation. targetgrammarand thelearninggrammar.TThegoalisto use a statisticalsumMmary of toobtainparameters for TheAlgorithms thatareasclosetotheunknowOnparametersof aspossibMle. PRESPAN and SPAN address some of the concerns given Using and a standard chart parsing algTorithm (e.g., in the previous section. Both are unsupervised, incremen- Charniak,M1993orAllen,1995)onecanparseasentenceand tal algorithmsfor findingparameter estimates instochastic counthowmanytimesaparticularrulewasusedinderiving context-freegrammars. thatsentence. Leteach rule inagrammar have twoasso- PRESPANandSPANareincrementalintwoways.First,in ciated histograms called Orand L. O is constructed theclassicalsense,ateveryiterationtheyusethepreviously byparsingeachsentenceHinr andHrercordHinrgthenumberof learnedparameterestimationasasteppingstonetothenew times rule appears inthe pOarse tree ofthe sentence. His- r tions. This isbecause thebincountstypicallyincrease lin- 0 1 0 1 0 1 0 1 early while thesample size remains constant. Futurework S → A A S → A C S → B B S → B C willexamine therole ofthe normalizationfactor, however, forthisworkitis keptfixed throughoutthedurationofthe algorithm. Foreachrule PRESPAN nowhastwodistributions: O, 0 1 0 1 0 1 0 1 2 3 4 based on the corrpus generated from , and baseHd orn C → S A D → S B B → Z A → Y thecorpusgeneratedfrom . ComparTing HtorL seems L O a naturalpredictorof thelMikelihoodof theHobrservHatrioncor- Figure 2: The palindrome grammar rule histograms after pus given PRESPAN’s learning grammar. Relative entropy parsing y y y y and y y. Notice that the mass of rules (alsoknownastheKullback-Leiblerdistance)iscommonly , and are now evenly distributed be- usedtocomparetwodistributions and (Cover&Thomas, tSwe!enA0aCnd1. SCimi!larlSytAhemassofrule isevenly 1991).Itisdefinedas: p q distributedbetween2and4. S ! Y p(x) tograms constructedinthiswayare called observationhis- D(pjjq)=Xx p(x)logq(x) tograms. The indices of the histogram range from 0 to Becausetwodistributionsareassociatedwitheachrule , where isthemaximumnumberoftimesarulewasusedikn therelativeentropiesaresummedovertherules. r apartickularsentenceparse. Inmanycases, remainssmall, and more importantly, when a sentence pakrse does not in- crease ,thestoragerequirementsremainunchanged. pr(x) (1) Eachk L is a histogram identical in nature to O but (cid:28) =Xr Xx pr(x)log qr(x) is used dHurring the learning process, so it’s a learnHingr his- If decreases between iterations, then the likelihoodof togram. Like the observation histograms, PRESPAN uses is(cid:28)increasing so PRESPAN increases the probabilitiesof eachlearninghistogramtorecordthenumberoftimeseach Mthe rules used in generating the sample corpus. When is ruleoccursinsinglesentenceparsesduringthelearningpro- large,thealgorithmonlyincreasesasmallsubsetoftherusles cess. Thedifferenceisthatthecorpusofsentencesparsedto used to generate the sample.4 Likewise, if increases be- fill L isgeneratedstochasticallyfrom usingitscurrent tweeniterations,PRESPAN decreases therule(cid:28)probabilities. paraHmreters. M PRESPAN usesamultiplicativeupdatefunction. Suppose For example, suppose PRESPAN is provided with the rule was selected for an update at time . If is the palindrome-generating structure given in Table 1 and en- probarbility of at time and decreasetd betwpte(ern) itera- counters the sentence . Chart parsing the sentence tions, then r t (cid:28) . Once the probability reveals thatrule y yhaysyfrequency 4, rules , updatesareppte+r1fo(rrm)e=dP1R:0E1SP(cid:3)ApNt(srta)rtsanotheriterationbe- and A ! Yhavefrequency1,andtheSre!maiAniAng ginningwiththegenerationofasmallsentencecorpusfrom Sno!n-teArmCinals hSav!e fArequency 0. Figure 1 depicts graphi- the learning grammar. The algorithm stops iteratingwhen callythehistogramsforeachruleafterparsingthesentence. the relative entropy falls below a threshold, or some pre- Inparsingthesentence , therule isusedonce, specifiednumberofiterationshascompleted. isusedtwiceanydytheotherrSul!esaAreAnotused. Fig- uAre!2Yshows how the histograms in Figure 1 change after ADescriptionofSPAN additionallyparsing . Afterevery sentenycye parse, PRESPAN updates the obser- SPAN differsfrom PRESPAN intheselectionofrulestoup- date, thecriteriaforupdates,andtheupdateruleitself. Re- vation histograms and discards the sentence along with its call that PRESPAN uses (see Equation1), the sum ofrel- parse. Itisleftwithonlyastatisticalsummaryofthecorpus. ative entropy calculation(cid:28)s for each rule, as a measure of Asaresult,onecannotreconstructtheobservationcorpusor progressordeteriorationofgrammarupdates. Since isan any single sentence within it. From this pointforward the aggregate value, an unsuccessful change in probabil(cid:28)ityfor observationhistograms are updated onlywhen new data is one rule could overshadow a successful change of another encountered. rule. Furthermore,theupdateruledoesnotdifferentiatebe- PRESPAN nowbegins the iterativeprocess. First, it ran- tweensmallandlargesuccessesandfailures. domly generates a small sentence corpus of prespecified SPAN addresses these concerns by examining local constantsize fromitslearninggrammar . Eachsentence changes inrelativeentropyand usingthose values tomake inthesamplesisparsedusingachartparseMr. Usingthechart, rule specific changes. SPAN calculates the relative entropy PRESPANrecordssummarystatisticsexactlyasitdidforthe forrule attime andcomparesitwiththerelativeentropy observation corpus except the statistics for each rule are attime r . IfthterelativeentropydecreasesitmeansSPAN addedtothelearninghistogramsinsteadoftheobservration updatedtt(cid:0)he1ruleprobabilityfavorably,ifitincreases,thedis- histograms.Afterdiscardingthesentences,thelearninghis- tributions have become more dissimilar so the probability togramsare normalized tosome fixed size . Withoutnor- malization, the information provided by thhe new statistics 4Usingonlytherulesfiredduringthegenerationofthelastsen- wouldhave decreasing impact on the histograms’ distribu- tenceseemstoworkswell. shouldmoveintheoppositedirection.Thisisbestexplained Table2:AgrammargeneratingsimpleEnglishphrases byexaminingSPAN’supdaterule: S ! NP VP NP ! Det N Pt+1(r)=Pt(r) + (cid:11) (cid:3) sgn(Pt(r)(cid:0)Pt(cid:0)1(r)) VP ! Vt NP VP ! Vc PP (cid:3) sgn((cid:1)REt) VP ! Vi (cid:3) f((cid:1)REt) PP ! P NP Det ! A + (cid:12) (cid:3) (Pt(r)(cid:0)Pt(cid:0)1(r)) (2) Det ! THE Vt ! TOUCHES The update ruleis based onthe steepest descent method Vt ! COVERS (Bertsekas&Tsitsiklis,1996).Here, isthe“sign”func- Vc ! IS Vi ! ROLLS tion that returns -1 if its argument issgnnegative, 0 if its ar- Vi ! BOUNCES gumentis zero and +1 ifitsargument ispositive. The first N ! CIRCLE N ! SQUARE sign function determines the direction of the previous up- N ! TRIANGLE date. That is, it determines whether, in the last time step, P ! ABOVE P ! BELOW SPAN increased or decreased the probability. The second A ! a THE ! the signfunctiondeterminesiftherelativeentropyhasincreased TOUCHES ! touches ordecreased. Ifithasdecreased, thenthedifferenceisposi- IS ! is ROLLS ! rolls tive,ifitincreased,thedifferenceisnegative.Togetherthese BOUNCES ! bounces signfunctionsdeterminethedirectionofthestep. Thefunc- CIRCLE ! circle SQURE ! square tion returnsthemagnitudeofthestep. This,intu- TRIANGLE ! triangle itivelfy(,(cid:1)isRanEet)stimateofthegradientsincethemagnitudeof ABOVE ! above BELOW ! below thechangeinrelativeentropyisreflectiveoftheslope. The parameterisastep-size. Finallythe (cid:11)expressionisamomentumterm. (cid:12)(cid:3)(Pt(r)(cid:0)Pt(cid:0)1(r)) parse of the observations sentences, so the computational Oncetheprobabilityupdatesareperformedforeachrule, complexity is where is the length of of the 3 3 another iteration starts beginning with the generation of a maximumsamOpl(eJofjaVnyj )iteration.J small sentence corpus from the learning grammar. Like Every iteration of the inside-outside algorithm requires PRESPAN,thealgorithmstopsiteratingwhentherelativeen- the complete sentence corpus. Using the algorithm in the tropyfallsbelowa threshold,orsome prespecified number contextofembedded agents, where thesentence corpusin- ofiterationshascompleted. creasescontinuouslywithtime,meansacorrespondingcon- SPAN is a more focused learning algorithm than PRES- tinuousincrease inmemory. WithSPAN andPRESPAN, the PAN. Thisisbecauseallrulesareindividuallyupdatedbased memoryrequirementsremaineffectivelyconstant. on local changes instead of stochastically selected and up- While the algorithms continually update their learning dated based on global changes. The algorithmic changes histogramsthroughthelearningprocess,thenumberofbins speed up learning by, at times, two orders of magnitude. increasesonlywhenasentenceparsecontainsanoccurrence Whilethesebenefitsdrasticallyincreaselearningtime,they countlargerthananyencounteredpreviously.Thesampleis do not necessarily result in more accurate grammars. Ev- representative ofthe grammar parameters and structure, so idence and explanation of this is given in the Experiments typicallyafterafewiterations,thenumberofbinsbecomes section. stable. This means that when new sentences are encoun- tered there is typicallyno increase in the amount of space AlgorithmAnalysis required. Inthissection boththetime and space requirements ofthe Experiments algorithms are analyzed. Comparing the results with the timeandspacerequirementsoftheinside-outsidealgorithm The previous section described two online algorithms for shows that SPAN and PRESPAN are asymptoticallyequiva- learningtheparametersofSCFGsgivensummarystatistics lent in time but nearly constant (as opposed to linear with computedfromacorpusofsentences. Theremainingques- inside-outside)inspace. tion is whether the quality of the learned grammar is sac- The inside-outside algorithm runs in time rificed because a statistical summary of the informationis 3 3 where isthelengthofsentencecorpusanOd(L jiVstjhe)num- usedratherthanthecompletesentencecorpus. Thissection berofnLon-terminalsymbols(Stolcke,1994).jVThjecomplex- presents the resultsof experiments that compare the gram- ityarisesdirectlyfromthechartparsingroutinesusedtoes- mars learned with PRESPAN and SPAN with those learned timateprobabilities.Notethatthenumberofiterationsused bytheinside-outsidealgorithm. bytheinside-outsidealgorithmisdominatedbythecompu- The following sections provide experimental results for tationalcomplexityofchartparsing. boththePRESPANandSPANalgorithms. BothSPANandPRESPANchartparsetheobservationcor- pus once but repeatedly chart parse the fixed size samples ExperimentswithPRESPAN they generates during the learning process. Taken as a Let be thetargetgrammar whose parameters are tobe T whole, this iterative process typicallydominates the single learnMed. Let be a grammar thathas thesame structure L M the given significance level. Because the mean of the two Table3:Anambiguousgrammar distributionsisminute,amorepowerfultestisneeded. Since both PRESPAN and the inside-outside algorithm were run on the same problems, a paired sample t-testcan S ! A be applied. This test is more powerfulthan the standard t- S ! B A A ! C test. Suppose again the the means of the twodistributions A ! C C are equal. Usingthisasthenullhypothesisandperforming B ! S thepairedsample t-testyields . Thatis, theprob- B ! D abilityofmakinganerrorinrejpec<tin0g:0th1enullhypothesisis B ! E lessthan .Closerinspectionofthedatarevealswhythis B ! F isthe cas0e:.01inside-outsideperformed betterthantheonline C ! z algorithmoneach of the 50grammars. However, as is ev- D ! y ident from the means and standard deviation, the absolute E ! x differenceineachcasewasquitesmall. F ! w The same experiments were conductedwiththeambigu- ous grammar shown in Table 3. The grammar is ambigu- as butwithruleprobabilitiesinitializeduniformlyran- ous because, for example, can be generated by T domMand normalized so the sum of the probabilitiesof the or z z with S ! rules with the same left-hand side is 1.0. Let be a set aAnd! CC ! zz . STh!e mBeaAn and sBtan!darSd d!eviCatio!n ozf T ofsentencesgeneratedstochasticallyfrom .OTheperfor- the lAog!likeCliho!odszfor PRESPAN were and T mance of the algorithm is compared by ruMnningit on . These values fortheinsid(cid:22)e-o=ut(cid:0)sid1e98a3lg:1o5rithm L and and computingthe loglikelihoodof givenMthe (cid:27)we=re250:95 and .Thestandardt-testre- T T finalOgrammar. O turned(cid:22)a=(cid:0)va1lu9e79o:f37 an(cid:27)d=th2e5p0a:i5re7dsamplet-testwassig- Becausethealgorithmlearnsparametersforafixedstruc- nificant apt the 0l:e9v4el. Again, inside-outsideperformed ture, a numberofdifferenttargetgrammars are used inex- betteronevery0o:0n1eofthe50grammars, butthedifferences perimentation; each with the same structure but different wereverysmall. ruleprobabilities. Thegoalistodeterminewhetheranyre- gions of parameter space were significantly better for one ExperimentswithSPAN algorithmover the other. Thisis accomplished bystochas- The same experiments were performed using SPAN. That ticallysamplingfromthisspace. Notethatanewcorpusis is, the grammar in Table 2 was used with 50 different tar- generated foreach new set of parameters as they influence get parameter settings and 500 sentences in for each T whichsentencesaregenerated. setting. The mean and standard deviation ofOthe log like- ThegrammarshowninTable2(Stolcke,1994)wasused lihoods for the SPAN with (histogram size inthismannerwith50differenttargetparametersettingsand andlearningcorpussizereshpec=tivsely=)w1e0r0e 500sentences in foreach setting. The mean and stan- and . These values forthe insid(cid:22)e-o=ut(cid:0)si4de26a6lg:4o4- T dard deviation ofOthe log likelihoods for PRESPAN with rithm(cid:27)w=er6e50:57 and . Recall that (histogramsize and learning corpus size re- equivalentp(cid:22)erf=orm(cid:0)a3n9c8e7w:5o8uldbe(cid:27)a=sign6i0fi8c:a5n9taccomplish- hspe=ctisve=ly)1w00ere and . Theseval- ment because the onlinealgorithmhas access tomuch less uesfortheinside(cid:22)-ou=ts(cid:0)id9e6a2lg:5o8rithm(cid:27)we=re241:25 and informationaboutthedata. Assumingboththemeansofthe . Recall thatequivalentperfo(cid:22)rm=an(cid:0)c9e5w9:o8u3ldbe distributionsare equalandusingthisas thenullhypothesis (cid:27)a s=ign2i4fi0c:a8n5t accomplishment because the onlinealgorithm ofatwo-tailedt-testresultsin . hasaccesstomuchlessinformationaboutthedata. Suppose Thesameexperimentwascpon=du0c:t0e3dwiththeambiguous the means of both empirical distributionsare equal. With grammar shownintable3. Thegrammar isambiguous,for thisassumptionasthenullhypothesis,atwo-tailedt-testre- example, because could be generated by sults in . This means that if one rejects the null or z z with S ! Aan!d hypothespis=,th0e:p9r5obabilityofmakinganerroris0.95. CC ! z;z . TShe!meBanAand staBnda!rddSev!iatioCno!fthzelog Unfortunately,theaboveresultdoesnotsanctionthecon- Alike!lihCood!sfozrtheonlinealgorithmwere and clusion that the two distributions are the same. One can, . These values fortheinside(cid:22)-o=ut(cid:0)sid2e02a5lg:9o3rithm however, look at the power of the test in this case. If the (cid:27)we=re589:78 and . Thet-testreturneda test’spowerishighthenitislikelythatatruedifferencein valu(cid:22)eo=f(cid:0)18.38:41 (cid:27) = 523:46 the means would be detected. If the power is low then it p Inside-o0u:t3si3de performed significantly better on the un- isunlikelythatthetest woulddetect a real difference. The ambiguous grammar but there was not a significant differ- powerofatestdependsonanumberoffactors,includingthe ence ontheambiguousgrammar. GiventhefactthatSPAN samplesize,thestandarddeviation,thesignificancelevelof hasaccess tofarlessinformationthantheinside-outsideal- thetest,andtheactualdifferencebetweenthemeans. Given gorithm, this is not a trivialaccomplishment. One conjec- a sample size of 50, a standard deviationof 240.05, a sig- ture is that SPAN never actually converges to a stable set nificance level of 0.05, and an actual delta of 174.79, the of parameters but walks around whatever local optimum it powerofthet-testis0.95. Thatis,withprobability0.95the finds in parameter space. This is suggested by the obser- t-testwilldetecta difference inmeans ofatleast 174.79at vationthat for any given trainingset the loglikelihoodfor the inside-outsidealgorithm is always higher than that for Acknowledgements PaulR.Cohenprovidedhelpfulcom- SPAN. Comparison of the parameters learned shows that ments and guidance when revising and editing the paper. SPAN is moving in the direction of the correct parameters Additionalthanks to Gary Warren King and Clayton Mor- butthatitneveractuallyconvergesonthem. risonfortheirexcellentsuggestions. Mark Johnson supplied a clean implementation of the Discussion inside-outsidealgorithm. Thisresearch issupportedbyDARPA/USASMDCunder It was noted earlier that SPAN learns more quickly than contract number DASG60-99-C-0074. The U.S. Govern- PRESPAN buttheExperiments sectionshowsthisimprove- ment is authorized to reproduce and distribute reprints for mentmaycomeatacost. Onereasonforthismaylieinthe governmentalpurposesnotwithstandinganycopyrightnota- the sentence samples produced from the learning grammar tionhereon.Theviewsandconclusionscontainedhereinare duringeach iteration. Recall that SPAN learns by generat- those of the authors and should not be interpreted as nec- ingasentencesampleusingitscurrentparameterestimates. essarily representing the official policies or endorsements Thenthissampleisparsedandthedistributioniscompared either expressed or implied, of DARPA/USASMDC or the to the distributionof the sentences generated from the tar- U.S.Government. getgrammar. Each sentence samplereflects thecurrentpa- rameter estimates, butalsohassome amountoferror. This References errormaybemorepronouncedinSPANbecauseateachiter- Allen,J.(1995).Naturallanguageunderstanding.TheBen- ation,everyruleisupdated. Thisupdateisadirectfunction jamins/CummingsPublishingCompany,Inc.2edition. ofstatisticscomputedfromthesample, sothesampleerror Bertsekas, D. P., & Tsitsiklis,J. N. (Eds.). (1996). Neuro- mayovershadowactualimprovementordeteriorationinpa- dynamicprogramming. Belmont,MA:AthenaScientific. rameterupdatesfromthelastiteration. OvercomingthesampleerrorprobleminSPAN mightbe Charniak, E. (1993). Statisticallanguagelearning. Cam- accomplishedbyincorporatingglobalviewsofprogress,not bridge:TheMITPress. unlikethoseusedinPRESPAN. Infact,asynergyofthetwo Chen, S. F. (1995). Bayesian grammar induction for lan- algorithmsmaybeanappropriatenextstepinthisresearch. guagemodeling(TechnicalReportTR-01-95).Centerfor Another interesting prospect for future parameter- Research inComputingTechnology, HarvardUniversity, learningresearchisbasedonruleorderings.Rememberthat Cambridge,MA. parameter changes in rules closer to the start-symbol of a Chen,S.F.(1996).Buildingprobabilisticmodelsfornatural grammarhavemoreeffectontheoveralldistributionofsen- language. Doctoraldissertation,TheDivisionofApplied tencesthanchangestoparametersfartheraway. Oneideais Sciences,HarvardUniversity. to takethe grammar, transform itintoa graph so that each uniqueleft-handsidesymbolisavertexandeachindividual Cover,T.M.,&Thomas,J.A.(1991).Elementsofinforma- right-hand-sidesymbol is a weighted arc. Using the start- tiontheory. JohnWileyandSons,Inc. symbol vertex as the root node and assuming each arc has Lari, K.,&Young,S.J.(1990). Theestimationofstochas- weight1.0, onecan assigna ranktoeach vertexbyfinding ticcontext-free grammarsusingtheinside-outisidealgo- theweightoftheshortestpathfromtheroottoalltheother rithm. ComputerSpeechandLanguage,4,35–56. vertices. This ordering may provide a convenient way to Lari, K., & Young, S. J. (1991). Applications of stochas- iterativelylearntheruleprobabilities.Onecanimaginecon- ticcontext-free grammars usingtheinside-outsidealgo- centratingonlyonlearningtheparametersoftheruleswith rithm. ComputerSpeechandLanguage,5,237–257. rank 1, then fixing those parameters and workingon rules with rank 2, and so forth. When the final rank is reached, Nevill-Manning,C., & Witten, I. (1997). Identifyinghier- the process wouldstart again from the beginning. Clearly archicalstructureinsequences: A linear-timealgorithm. self-referentialrulesmayposesomedifficulty,buttheideas JournalofArtificialIntelligenceResearch, 7,67–82. haveyettobefullyexamined. Sipser,M.(1997).Introductiontothetheoryofcomputation. Boston:PWSPublishingCompany. Conclusion Stolcke, A.(1994). Bayesian learningofprobabilisticlan- Most parameter learning algorithmsfor stochastic context- guage models. Doctoral dissertation, Division of Com- free grammars retain the entire sentence corpus through- puterScience,UniversityofCalifornia,Berkeley. outthelearningprocess. Incorporatingacompletememory Stolcke,A.,&Omohundro,S.(1994). Inducingprobabilis- ofsentence corporaseems ill-suitedforlearninginembed- ticgrammars bybayesian modelmerging. Grammatical ded agents. PRESPAN and SPAN are two incremental al- Inference and Applications (pp. 106–118). Berlin, Hei- gorithms for learning parameters in stochastic context-free delberg:Springer. grammars using only summary statistics of the observed Sutton,R.S., &Barto,A.G.(1998). Reinforcement learn- data. Both algorithms require a fixed amount of space re- ing:anintroduction. Cambridge:TheMITPress. gardlessofthenumberofsentencestheyprocesses. Despite using much less information than the inside-outside algo- rithm,PRESPAN andSPANperformalmostaswell.