Table Of ContentTwo Algorithms for Learning the Parametersof Stochastic Context-Free
Grammars
BrentHeeringa Tim Oates
DepartmentofComputerScience DepartmentofComputerScienceandElectricalEngineering
UniversityofMassachusetts,Amherst UniversityofMarylandBaltimoreCounty
140Governor’sDrive 1000HilltopCircle
Amherst,MA01003 Baltimore,MD21250
heeringa@cs.umass.edu oates@eecs.umbc.edu
Abstract paradigm by assuming a fixed structure and modifyingthe
parameters.
Stochastic context-free grammars (SCFGs) are often Given a SCFG to be learned, both algorithms have ac-
usedtorepresentthesyntaxofnaturallanguages.Most
cess to thestructure ofthe grammar and a set of sentences
algorithms for learning them require storage and re-
generated by the grammar. The correct parameters are un-
peatedprocessingof a sentencecorpus. Thememory
known. PRESPAN and SPAN beginbyparsingthesentence
andcomputationaldemandsofsuchalgorithmsareill-
corpus usinga chart parser. Note thatthe parse of an indi-
suitedforembeddedagentssuchasamobilerobot.Two
algorithms are presented that incrementally learn the vidualsentence does notdependon theparameters; itonly
parametersofstochasticcontext-freegrammarsassen- depends onthestructure. However, thedistributionofsen-
tences are observed. Both algorithms require a fixed tencesparseddoesdependontheparametersofthegrammar
amountofspaceregardlessof thenumberof sentence usedtogeneratethem. Bothalgorithmsassociatewitheach
observations. Despite usingless information than the ruleahistogramthatrecordsthenumberoftimestheruleis
inside-outsidealgorithm,thealgorithmsperformalmost usedinparsesoftheindividualsentences.
aswell.
PRESPAN and SPAN make an initial guess at the values
oftheparametersbysettingthemrandomly. Theythengen-
erateacorpusofsentenceswiththeseparametersandparse
Introduction
them, resulting in a second set of histograms. The degree
Although natural languages are not entirely context free, to which the two sets of histograms differ is a measure of
stochastic context-free grammars (SCFGs) are an effective the difference between thecurrentparameter estimates and
representationfor capturingmuch of theirstructure. How- the target parameters. PRESPAN modifiesits parameter es-
ever, for embedded agents, most algorithms for learning timates so the sum totaldifference between the histograms
SCFGs from data have twoshortcomings. First, they need is minimized. In contrast, SPAN modifies its estimates so
accesstoacorpusofcompletesentences,requiringtheagent thedifferencebetween individualhistogramsisminimized.
toretaineverysentence ithears. Second, theyarebatchal- Empiricalresultsshowthatthisprocedureyieldsparameters
gorithmsthatmake repeated passes over thedata, oftenre- thatareclosetothosefoundbytheinside-outsidealgorithm.
quiringsignificant computationin each pass. These short-
comingsareaddressedthroughtwoonlinealgorithmscalled StochasticContext-FreeGrammars
SPAN1 and PRESPAN2 that learn the parameters of SCFGs
Stochastic context-free grammars 3 are the natural exten-
usingonlysummarystatisticsincombinationwithrepeated
sion of Context-Free Grammars to the probabilistic do-
samplingtechniques.
main (Sipser, 1997; Charniak, 1993). Said differently,
SCFGscontainbothstructure(i.e. rules) and parameters
they are context-free grammars with probabilities associ-
(i.e. rule probabilities). One approach to learning SCFGs
ated with each rule. Formally, a SCFG is a four-tuple
fromdataistostartwithagrammarcontainingallpossible
where
rules that can be created from some alphabet of terminals
M =hV;(cid:6);R;Si
andnon-terminals. Typicallythesizeoftheright-hand-side 1. isafinitesetofnon-terminals
ofeach ruleis boundbya smallconstant(e.g. 2). Then an V
2. isafiniteset,disjointfrom ,ofterminals
algorithmforlearningparameters isappliedandallowedto
(cid:6) V
“prune”rulesbysettingtheirexpansionprobabilitiestozero 3. is a finite set of rules of the form where
(Lari & Young, 1990). PRESPAN and SPAN operate inthis bRelongs to , and is a finite stringAcom!powsed of eleA-
mentsfromV and w. Wereferto astheleft-handside
Copyright c 2001, American Association for Artificial Intelli- ( ) of thVerule(cid:6)and as the riAght-handside( ),
gence(www(cid:13).aaai.org). Allrightsreserved. oLrHexSpansion,oftherulew. Additionally,eachrule RhHasSan
1SPANstandsforSampleParseAdjustNormalize r
2PRESPANissonamedbecauseitisthepredecessorofSPAN, 3StochasticContext-FreeGrammarsareoftencalledProbabilis-
soitliterallymeanspre-SPAN ticContext-FreeGrammars.
Report Documentation Page Form Approved
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE 3. DATES COVERED
2001 2. REPORT TYPE 00-00-2001 to 00-00-2001
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Two Algorithms for Learning the Parameters of Stochastic Context-Free
5b. GRANT NUMBER
Grammars
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
University of Massachusetts,Department of Computer REPORT NUMBER
Science,Amherst,MA,01002
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE 6
unclassified unclassified unclassified
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
associated probability such that the probabilitiesof
Table1:Agrammarthatgeneratespalindromes
ruleswiththesameleftp-(hra)ndsidesumto1.
4. isthestartsymbol.
S S ! A A
Grammars can either be ambiguous or unambiguous.
S ! B B
Ambiguousgrammarscan generatethesame stringinmul-
S ! A C
tipleways. Unambiguousgrammarscannot. S ! B D
C ! S A
LearningStochastic-ContextFree Grammars D ! S B
A ! Y
Learningcontext-freegrammars istheproblemofinducing
B ! Z
a context-free structure (or model) from a corpus of sen-
Y ! y
tences (i.e., data). When the grammars are stochastic, one X ! z
faces the additional problem of learning rule probabilities
(parameters) from the corpus. Given a set of sentence ob-
servations , the goal is to discover the 0 1 0 1 0 1 0 1
grammarthOatg=enefroa0te:d::on.(cid:0)Ty1pgically,thisproblemisframed S → A A S → A C S → B B S → B C
in terms of a search in gOrammar space where the objective
function is the likelihood of the data given the grammar.
While the problem of incrementally learning the structure
ofSCFGsisinterestinginitsownright,themainfocushere 0 1 0 1 0 1 0 1 2 3 4
is on learning parameters. For a thoroughoverview of ex- C → S A D → S B B → Z A → Y
istingtechniquesforlearningstructure, see (Stolcke,1994;
Figure 1: The palindrome grammar rule histograms after
Chen,1996;Nevill-Manning&Witten,1997).
only one parse of the sentence y y y y. Because only one
LearningParameters sentencehasbeenparse,themassofthedistributioniscon-
centratedinasinglebin.
The inside-outsidealgorithm(Lari & Young, 1990;Lari &
Young,1991)isthestandardmethodforestimatingparam-
eters in SCFGs. The algorithm uses the general-purpose
one. Second, both algorithms naturally allow new data to
expectation-maximization (EM) procedure. Almostall pa-
contributetolearningwithoutrestartingtheentireprocess.
rameter learning is done batch style usingsome version of
PRESPAN andSPANuseonlyastatisticalsummaryofthe
the inside-outsidealgorithm. For example, in learning pa-
observationdataforlearning.Bothstorethesummaryinfor-
rameters, (Chen,1995)initiallyestimatestheruleprobabil-
mationinhistograms. SPAN and PRESPAN alsorecordhis-
ities using the most probable parse of the sentences given
tograminformationabouttheircurrentparameterestimates.
thegrammar(theViterbiparse)andthenusesa“post-pass”
Sotheadditionofnewsentencestypicallydoesnotincrease
procedurethatincorporatestheinside-outsidealgorithm.
thememoryrequirements. Furthermore,thehistogramsplay
To use EM the entire sentence corpus must be stored.
acrucialroleinlearning. Iftheparameterestimatesareac-
While this storage may not be in the form of actual sen-
curate,thehistogramsoftheobserveddatashouldresemble
tences, it is always in some representation that easily al-
the histograms of the current parameterization. When the
lowsthereconstructionoftheoriginalcorpus(e.g.,thechart
histogramsdonotresembleeachother,thedifferenceisused
of a chart parse). Because we are interested in language
toguidethelearningprocess.
acquisition in embedded agents over long periods of time,
the prospect of memorizing and repeatedly processing en- ADescriptionofPRESPAN
tiresentencecorporaisunpalatable.
Let beaSCFGandlet
This motivationalso carries a desire to easily adjustour
beaTse=tohfV;(cid:6)se;nRte;nScies generatedstochasOtic=allfyof0ro:m::oTn.(cid:0)L1egt
parameterswhennewsentencesareencountered.Thatis,we
n beaSCFGthatisthesame as except
wanttolearnproductionprobabilitiesincrementally. While 0
Mthe=rulheV;p(cid:6)ro;bRab;ilSitiies in have been assigned atTrandom
theinside-outsidealgorithmcan incorporatenew sentences 0
(subjecttotheconstrainttRhatthesumoftheprobabilitiesfor
inbetweeniterations,itstillusestheentiresentencecorpus
allruleswiththesameleft-handsideisone). iscalledthe
forestimation.
targetgrammarand thelearninggrammar.TThegoalisto
use a statisticalsumMmary of toobtainparameters for
TheAlgorithms
thatareasclosetotheunknowOnparametersof aspossibMle.
PRESPAN and SPAN address some of the concerns given Using and a standard chart parsing algTorithm (e.g.,
in the previous section. Both are unsupervised, incremen- Charniak,M1993orAllen,1995)onecanparseasentenceand
tal algorithmsfor findingparameter estimates instochastic counthowmanytimesaparticularrulewasusedinderiving
context-freegrammars. thatsentence. Leteach rule inagrammar have twoasso-
PRESPANandSPANareincrementalintwoways.First,in ciated histograms called Orand L. O is constructed
theclassicalsense,ateveryiterationtheyusethepreviously byparsingeachsentenceHinr andHrercordHinrgthenumberof
learnedparameterestimationasasteppingstonetothenew times rule appears inthe pOarse tree ofthe sentence. His-
r
tions. This isbecause thebincountstypicallyincrease lin-
0 1 0 1 0 1 0 1 early while thesample size remains constant. Futurework
S → A A S → A C S → B B S → B C willexamine therole ofthe normalizationfactor, however,
forthisworkitis keptfixed throughoutthedurationofthe
algorithm.
Foreachrule PRESPAN nowhastwodistributions: O,
0 1 0 1 0 1 0 1 2 3 4 based on the corrpus generated from , and baseHd orn
C → S A D → S B B → Z A → Y thecorpusgeneratedfrom . ComparTing HtorL seems
L O
a naturalpredictorof thelMikelihoodof theHobrservHatrioncor-
Figure 2: The palindrome grammar rule histograms after
pus given PRESPAN’s learning grammar. Relative entropy
parsing y y y y and y y. Notice that the mass of rules
(alsoknownastheKullback-Leiblerdistance)iscommonly
, and are now evenly distributed be-
usedtocomparetwodistributions and (Cover&Thomas,
tSwe!enA0aCnd1. SCimi!larlSytAhemassofrule isevenly
1991).Itisdefinedas: p q
distributedbetween2and4. S ! Y
p(x)
tograms constructedinthiswayare called observationhis- D(pjjq)=Xx p(x)logq(x)
tograms. The indices of the histogram range from 0 to
Becausetwodistributionsareassociatedwitheachrule ,
where isthemaximumnumberoftimesarulewasusedikn therelativeentropiesaresummedovertherules. r
apartickularsentenceparse. Inmanycases, remainssmall,
and more importantly, when a sentence pakrse does not in-
crease ,thestoragerequirementsremainunchanged. pr(x) (1)
Eachk L is a histogram identical in nature to O but (cid:28) =Xr Xx pr(x)log qr(x)
is used dHurring the learning process, so it’s a learnHingr his- If decreases between iterations, then the likelihoodof
togram. Like the observation histograms, PRESPAN uses is(cid:28)increasing so PRESPAN increases the probabilitiesof
eachlearninghistogramtorecordthenumberoftimeseach Mthe rules used in generating the sample corpus. When is
ruleoccursinsinglesentenceparsesduringthelearningpro- large,thealgorithmonlyincreasesasmallsubsetoftherusles
cess. Thedifferenceisthatthecorpusofsentencesparsedto used to generate the sample.4 Likewise, if increases be-
fill L isgeneratedstochasticallyfrom usingitscurrent tweeniterations,PRESPAN decreases therule(cid:28)probabilities.
paraHmreters. M PRESPAN usesamultiplicativeupdatefunction. Suppose
For example, suppose PRESPAN is provided with the rule was selected for an update at time . If is the
palindrome-generating structure given in Table 1 and en- probarbility of at time and decreasetd betwpte(ern) itera-
counters the sentence . Chart parsing the sentence tions, then r t (cid:28) . Once the probability
reveals thatrule y yhaysyfrequency 4, rules , updatesareppte+r1fo(rrm)e=dP1R:0E1SP(cid:3)ApNt(srta)rtsanotheriterationbe-
and A ! Yhavefrequency1,andtheSre!maiAniAng ginningwiththegenerationofasmallsentencecorpusfrom
Sno!n-teArmCinals hSav!e fArequency 0. Figure 1 depicts graphi- the learning grammar. The algorithm stops iteratingwhen
callythehistogramsforeachruleafterparsingthesentence. the relative entropy falls below a threshold, or some pre-
Inparsingthesentence , therule isusedonce, specifiednumberofiterationshascompleted.
isusedtwiceanydytheotherrSul!esaAreAnotused. Fig-
uAre!2Yshows how the histograms in Figure 1 change after ADescriptionofSPAN
additionallyparsing .
Afterevery sentenycye parse, PRESPAN updates the obser- SPAN differsfrom PRESPAN intheselectionofrulestoup-
date, thecriteriaforupdates,andtheupdateruleitself. Re-
vation histograms and discards the sentence along with its
call that PRESPAN uses (see Equation1), the sum ofrel-
parse. Itisleftwithonlyastatisticalsummaryofthecorpus.
ative entropy calculation(cid:28)s for each rule, as a measure of
Asaresult,onecannotreconstructtheobservationcorpusor
progressordeteriorationofgrammarupdates. Since isan
any single sentence within it. From this pointforward the
aggregate value, an unsuccessful change in probabil(cid:28)ityfor
observationhistograms are updated onlywhen new data is
one rule could overshadow a successful change of another
encountered.
rule. Furthermore,theupdateruledoesnotdifferentiatebe-
PRESPAN nowbegins the iterativeprocess. First, it ran- tweensmallandlargesuccessesandfailures.
domly generates a small sentence corpus of prespecified
SPAN addresses these concerns by examining local
constantsize fromitslearninggrammar . Eachsentence
changes inrelativeentropyand usingthose values tomake
inthesamplesisparsedusingachartparseMr. Usingthechart,
rule specific changes. SPAN calculates the relative entropy
PRESPANrecordssummarystatisticsexactlyasitdidforthe forrule attime andcomparesitwiththerelativeentropy
observation corpus except the statistics for each rule are attime r . IfthterelativeentropydecreasesitmeansSPAN
addedtothelearninghistogramsinsteadoftheobservration
updatedtt(cid:0)he1ruleprobabilityfavorably,ifitincreases,thedis-
histograms.Afterdiscardingthesentences,thelearninghis-
tributions have become more dissimilar so the probability
togramsare normalized tosome fixed size . Withoutnor-
malization, the information provided by thhe new statistics 4Usingonlytherulesfiredduringthegenerationofthelastsen-
wouldhave decreasing impact on the histograms’ distribu- tenceseemstoworkswell.
shouldmoveintheoppositedirection.Thisisbestexplained
Table2:AgrammargeneratingsimpleEnglishphrases
byexaminingSPAN’supdaterule:
S ! NP VP
NP ! Det N
Pt+1(r)=Pt(r) + (cid:11) (cid:3) sgn(Pt(r)(cid:0)Pt(cid:0)1(r)) VP ! Vt NP
VP ! Vc PP
(cid:3) sgn((cid:1)REt) VP ! Vi
(cid:3) f((cid:1)REt) PP ! P NP
Det ! A
+ (cid:12) (cid:3) (Pt(r)(cid:0)Pt(cid:0)1(r)) (2) Det ! THE
Vt ! TOUCHES
The update ruleis based onthe steepest descent method Vt ! COVERS
(Bertsekas&Tsitsiklis,1996).Here, isthe“sign”func- Vc ! IS
Vi ! ROLLS
tion that returns -1 if its argument issgnnegative, 0 if its ar- Vi ! BOUNCES
gumentis zero and +1 ifitsargument ispositive. The first N ! CIRCLE
N ! SQUARE
sign function determines the direction of the previous up- N ! TRIANGLE
date. That is, it determines whether, in the last time step, P ! ABOVE
P ! BELOW
SPAN increased or decreased the probability. The second A ! a
THE ! the
signfunctiondeterminesiftherelativeentropyhasincreased TOUCHES ! touches
ordecreased. Ifithasdecreased, thenthedifferenceisposi- IS ! is
ROLLS ! rolls
tive,ifitincreased,thedifferenceisnegative.Togetherthese BOUNCES ! bounces
signfunctionsdeterminethedirectionofthestep. Thefunc- CIRCLE ! circle
SQURE ! square
tion returnsthemagnitudeofthestep. This,intu- TRIANGLE ! triangle
itivelfy(,(cid:1)isRanEet)stimateofthegradientsincethemagnitudeof ABOVE ! above
BELOW ! below
thechangeinrelativeentropyisreflectiveoftheslope. The
parameterisastep-size. Finallythe
(cid:11)expressionisamomentumterm. (cid:12)(cid:3)(Pt(r)(cid:0)Pt(cid:0)1(r)) parse of the observations sentences, so the computational
Oncetheprobabilityupdatesareperformedforeachrule, complexity is where is the length of of the
3 3
another iteration starts beginning with the generation of a maximumsamOpl(eJofjaVnyj )iteration.J
small sentence corpus from the learning grammar. Like Every iteration of the inside-outside algorithm requires
PRESPAN,thealgorithmstopsiteratingwhentherelativeen- the complete sentence corpus. Using the algorithm in the
tropyfallsbelowa threshold,orsome prespecified number contextofembedded agents, where thesentence corpusin-
ofiterationshascompleted. creasescontinuouslywithtime,meansacorrespondingcon-
SPAN is a more focused learning algorithm than PRES- tinuousincrease inmemory. WithSPAN andPRESPAN, the
PAN. Thisisbecauseallrulesareindividuallyupdatedbased memoryrequirementsremaineffectivelyconstant.
on local changes instead of stochastically selected and up- While the algorithms continually update their learning
dated based on global changes. The algorithmic changes histogramsthroughthelearningprocess,thenumberofbins
speed up learning by, at times, two orders of magnitude. increasesonlywhenasentenceparsecontainsanoccurrence
Whilethesebenefitsdrasticallyincreaselearningtime,they countlargerthananyencounteredpreviously.Thesampleis
do not necessarily result in more accurate grammars. Ev- representative ofthe grammar parameters and structure, so
idence and explanation of this is given in the Experiments typicallyafterafewiterations,thenumberofbinsbecomes
section. stable. This means that when new sentences are encoun-
tered there is typicallyno increase in the amount of space
AlgorithmAnalysis required.
Inthissection boththetime and space requirements ofthe
Experiments
algorithms are analyzed. Comparing the results with the
timeandspacerequirementsoftheinside-outsidealgorithm The previous section described two online algorithms for
shows that SPAN and PRESPAN are asymptoticallyequiva- learningtheparametersofSCFGsgivensummarystatistics
lent in time but nearly constant (as opposed to linear with computedfromacorpusofsentences. Theremainingques-
inside-outside)inspace. tion is whether the quality of the learned grammar is sac-
The inside-outside algorithm runs in time rificed because a statistical summary of the informationis
3 3
where isthelengthofsentencecorpusanOd(L jiVstjhe)num- usedratherthanthecompletesentencecorpus. Thissection
berofnLon-terminalsymbols(Stolcke,1994).jVThjecomplex- presents the resultsof experiments that compare the gram-
ityarisesdirectlyfromthechartparsingroutinesusedtoes- mars learned with PRESPAN and SPAN with those learned
timateprobabilities.Notethatthenumberofiterationsused bytheinside-outsidealgorithm.
bytheinside-outsidealgorithmisdominatedbythecompu- The following sections provide experimental results for
tationalcomplexityofchartparsing. boththePRESPANandSPANalgorithms.
BothSPANandPRESPANchartparsetheobservationcor-
pus once but repeatedly chart parse the fixed size samples ExperimentswithPRESPAN
they generates during the learning process. Taken as a Let be thetargetgrammar whose parameters are tobe
T
whole, this iterative process typicallydominates the single learnMed. Let be a grammar thathas thesame structure
L
M
the given significance level. Because the mean of the two
Table3:Anambiguousgrammar
distributionsisminute,amorepowerfultestisneeded.
Since both PRESPAN and the inside-outside algorithm
were run on the same problems, a paired sample t-testcan
S ! A
be applied. This test is more powerfulthan the standard t-
S ! B A
A ! C test. Suppose again the the means of the twodistributions
A ! C C are equal. Usingthisasthenullhypothesisandperforming
B ! S thepairedsample t-testyields . Thatis, theprob-
B ! D abilityofmakinganerrorinrejpec<tin0g:0th1enullhypothesisis
B ! E lessthan .Closerinspectionofthedatarevealswhythis
B ! F isthe cas0e:.01inside-outsideperformed betterthantheonline
C ! z
algorithmoneach of the 50grammars. However, as is ev-
D ! y
ident from the means and standard deviation, the absolute
E ! x
differenceineachcasewasquitesmall.
F ! w
The same experiments were conductedwiththeambigu-
ous grammar shown in Table 3. The grammar is ambigu-
as butwithruleprobabilitiesinitializeduniformlyran- ous because, for example, can be generated by
T
domMand normalized so the sum of the probabilitiesof the or z z with S !
rules with the same left-hand side is 1.0. Let be a set aAnd! CC ! zz . STh!e mBeaAn and sBtan!darSd d!eviCatio!n ozf
T
ofsentencesgeneratedstochasticallyfrom .OTheperfor- the lAog!likeCliho!odszfor PRESPAN were and
T
mance of the algorithm is compared by ruMnningit on . These values fortheinsid(cid:22)e-o=ut(cid:0)sid1e98a3lg:1o5rithm
L
and and computingthe loglikelihoodof givenMthe (cid:27)we=re250:95 and .Thestandardt-testre-
T T
finalOgrammar. O turned(cid:22)a=(cid:0)va1lu9e79o:f37 an(cid:27)d=th2e5p0a:i5re7dsamplet-testwassig-
Becausethealgorithmlearnsparametersforafixedstruc- nificant apt the 0l:e9v4el. Again, inside-outsideperformed
ture, a numberofdifferenttargetgrammars are used inex- betteronevery0o:0n1eofthe50grammars, butthedifferences
perimentation; each with the same structure but different wereverysmall.
ruleprobabilities. Thegoalistodeterminewhetheranyre-
gions of parameter space were significantly better for one ExperimentswithSPAN
algorithmover the other. Thisis accomplished bystochas- The same experiments were performed using SPAN. That
ticallysamplingfromthisspace. Notethatanewcorpusis is, the grammar in Table 2 was used with 50 different tar-
generated foreach new set of parameters as they influence get parameter settings and 500 sentences in for each
T
whichsentencesaregenerated. setting. The mean and standard deviation ofOthe log like-
ThegrammarshowninTable2(Stolcke,1994)wasused lihoods for the SPAN with (histogram size
inthismannerwith50differenttargetparametersettingsand andlearningcorpussizereshpec=tivsely=)w1e0r0e
500sentences in foreach setting. The mean and stan- and . These values forthe insid(cid:22)e-o=ut(cid:0)si4de26a6lg:4o4-
T
dard deviation ofOthe log likelihoods for PRESPAN with rithm(cid:27)w=er6e50:57 and . Recall that
(histogramsize and learning corpus size re- equivalentp(cid:22)erf=orm(cid:0)a3n9c8e7w:5o8uldbe(cid:27)a=sign6i0fi8c:a5n9taccomplish-
hspe=ctisve=ly)1w00ere and . Theseval- ment because the onlinealgorithmhas access tomuch less
uesfortheinside(cid:22)-ou=ts(cid:0)id9e6a2lg:5o8rithm(cid:27)we=re241:25 and informationaboutthedata. Assumingboththemeansofthe
. Recall thatequivalentperfo(cid:22)rm=an(cid:0)c9e5w9:o8u3ldbe distributionsare equalandusingthisas thenullhypothesis
(cid:27)a s=ign2i4fi0c:a8n5t accomplishment because the onlinealgorithm ofatwo-tailedt-testresultsin .
hasaccesstomuchlessinformationaboutthedata. Suppose Thesameexperimentwascpon=du0c:t0e3dwiththeambiguous
the means of both empirical distributionsare equal. With grammar shownintable3. Thegrammar isambiguous,for
thisassumptionasthenullhypothesis,atwo-tailedt-testre- example, because could be generated by
sults in . This means that if one rejects the null or z z with S ! Aan!d
hypothespis=,th0e:p9r5obabilityofmakinganerroris0.95. CC ! z;z . TShe!meBanAand staBnda!rddSev!iatioCno!fthzelog
Unfortunately,theaboveresultdoesnotsanctionthecon- Alike!lihCood!sfozrtheonlinealgorithmwere and
clusion that the two distributions are the same. One can, . These values fortheinside(cid:22)-o=ut(cid:0)sid2e02a5lg:9o3rithm
however, look at the power of the test in this case. If the (cid:27)we=re589:78 and . Thet-testreturneda
test’spowerishighthenitislikelythatatruedifferencein valu(cid:22)eo=f(cid:0)18.38:41 (cid:27) = 523:46
the means would be detected. If the power is low then it p Inside-o0u:t3si3de performed significantly better on the un-
isunlikelythatthetest woulddetect a real difference. The ambiguous grammar but there was not a significant differ-
powerofatestdependsonanumberoffactors,includingthe ence ontheambiguousgrammar. GiventhefactthatSPAN
samplesize,thestandarddeviation,thesignificancelevelof hasaccess tofarlessinformationthantheinside-outsideal-
thetest,andtheactualdifferencebetweenthemeans. Given gorithm, this is not a trivialaccomplishment. One conjec-
a sample size of 50, a standard deviationof 240.05, a sig- ture is that SPAN never actually converges to a stable set
nificance level of 0.05, and an actual delta of 174.79, the of parameters but walks around whatever local optimum it
powerofthet-testis0.95. Thatis,withprobability0.95the finds in parameter space. This is suggested by the obser-
t-testwilldetecta difference inmeans ofatleast 174.79at vationthat for any given trainingset the loglikelihoodfor
the inside-outsidealgorithm is always higher than that for Acknowledgements PaulR.Cohenprovidedhelpfulcom-
SPAN. Comparison of the parameters learned shows that ments and guidance when revising and editing the paper.
SPAN is moving in the direction of the correct parameters Additionalthanks to Gary Warren King and Clayton Mor-
butthatitneveractuallyconvergesonthem. risonfortheirexcellentsuggestions.
Mark Johnson supplied a clean implementation of the
Discussion inside-outsidealgorithm.
Thisresearch issupportedbyDARPA/USASMDCunder
It was noted earlier that SPAN learns more quickly than contract number DASG60-99-C-0074. The U.S. Govern-
PRESPAN buttheExperiments sectionshowsthisimprove- ment is authorized to reproduce and distribute reprints for
mentmaycomeatacost. Onereasonforthismaylieinthe governmentalpurposesnotwithstandinganycopyrightnota-
the sentence samples produced from the learning grammar tionhereon.Theviewsandconclusionscontainedhereinare
duringeach iteration. Recall that SPAN learns by generat- those of the authors and should not be interpreted as nec-
ingasentencesampleusingitscurrentparameterestimates. essarily representing the official policies or endorsements
Thenthissampleisparsedandthedistributioniscompared either expressed or implied, of DARPA/USASMDC or the
to the distributionof the sentences generated from the tar- U.S.Government.
getgrammar. Each sentence samplereflects thecurrentpa-
rameter estimates, butalsohassome amountoferror. This References
errormaybemorepronouncedinSPANbecauseateachiter- Allen,J.(1995).Naturallanguageunderstanding.TheBen-
ation,everyruleisupdated. Thisupdateisadirectfunction
jamins/CummingsPublishingCompany,Inc.2edition.
ofstatisticscomputedfromthesample, sothesampleerror
Bertsekas, D. P., & Tsitsiklis,J. N. (Eds.). (1996). Neuro-
mayovershadowactualimprovementordeteriorationinpa-
dynamicprogramming. Belmont,MA:AthenaScientific.
rameterupdatesfromthelastiteration.
OvercomingthesampleerrorprobleminSPAN mightbe Charniak, E. (1993). Statisticallanguagelearning. Cam-
accomplishedbyincorporatingglobalviewsofprogress,not bridge:TheMITPress.
unlikethoseusedinPRESPAN. Infact,asynergyofthetwo Chen, S. F. (1995). Bayesian grammar induction for lan-
algorithmsmaybeanappropriatenextstepinthisresearch.
guagemodeling(TechnicalReportTR-01-95).Centerfor
Another interesting prospect for future parameter- Research inComputingTechnology, HarvardUniversity,
learningresearchisbasedonruleorderings.Rememberthat Cambridge,MA.
parameter changes in rules closer to the start-symbol of a
Chen,S.F.(1996).Buildingprobabilisticmodelsfornatural
grammarhavemoreeffectontheoveralldistributionofsen-
language. Doctoraldissertation,TheDivisionofApplied
tencesthanchangestoparametersfartheraway. Oneideais
Sciences,HarvardUniversity.
to takethe grammar, transform itintoa graph so that each
uniqueleft-handsidesymbolisavertexandeachindividual Cover,T.M.,&Thomas,J.A.(1991).Elementsofinforma-
right-hand-sidesymbol is a weighted arc. Using the start- tiontheory. JohnWileyandSons,Inc.
symbol vertex as the root node and assuming each arc has
Lari, K.,&Young,S.J.(1990). Theestimationofstochas-
weight1.0, onecan assigna ranktoeach vertexbyfinding
ticcontext-free grammarsusingtheinside-outisidealgo-
theweightoftheshortestpathfromtheroottoalltheother
rithm. ComputerSpeechandLanguage,4,35–56.
vertices. This ordering may provide a convenient way to
Lari, K., & Young, S. J. (1991). Applications of stochas-
iterativelylearntheruleprobabilities.Onecanimaginecon-
ticcontext-free grammars usingtheinside-outsidealgo-
centratingonlyonlearningtheparametersoftheruleswith
rithm. ComputerSpeechandLanguage,5,237–257.
rank 1, then fixing those parameters and workingon rules
with rank 2, and so forth. When the final rank is reached, Nevill-Manning,C., & Witten, I. (1997). Identifyinghier-
the process wouldstart again from the beginning. Clearly archicalstructureinsequences: A linear-timealgorithm.
self-referentialrulesmayposesomedifficulty,buttheideas JournalofArtificialIntelligenceResearch, 7,67–82.
haveyettobefullyexamined.
Sipser,M.(1997).Introductiontothetheoryofcomputation.
Boston:PWSPublishingCompany.
Conclusion
Stolcke, A.(1994). Bayesian learningofprobabilisticlan-
Most parameter learning algorithmsfor stochastic context- guage models. Doctoral dissertation, Division of Com-
free grammars retain the entire sentence corpus through- puterScience,UniversityofCalifornia,Berkeley.
outthelearningprocess. Incorporatingacompletememory Stolcke,A.,&Omohundro,S.(1994). Inducingprobabilis-
ofsentence corporaseems ill-suitedforlearninginembed- ticgrammars bybayesian modelmerging. Grammatical
ded agents. PRESPAN and SPAN are two incremental al- Inference and Applications (pp. 106–118). Berlin, Hei-
gorithms for learning parameters in stochastic context-free delberg:Springer.
grammars using only summary statistics of the observed
Sutton,R.S., &Barto,A.G.(1998). Reinforcement learn-
data. Both algorithms require a fixed amount of space re-
ing:anintroduction. Cambridge:TheMITPress.
gardlessofthenumberofsentencestheyprocesses. Despite
using much less information than the inside-outside algo-
rithm,PRESPAN andSPANperformalmostaswell.