Table Of ContentSparse Forward-Backward for Fast Training of
Conditional Random Fields
CharlesSutton,ChrisPalandAndrewMcCallum
UniversityofMassachusettsAmherst
Dept.ComputerScience
Amherst,MA01003
{casutton,pal,mccallum}@cs.umass.edu
Abstract
Complextasksinspeechandlanguageprocessingoftenincluderandom
variableswith large state spaces, both in speech tasks that involvepre-
dicting words and phonemes, and in joint processing of pipelined sys-
tems,inwhichthestatespacecanbethelabelingofanentiresequence.
In large state spaces, however, discriminative training can be expen-
sive, because it often requires many calls to forward-backward. Beam
search is a standard heuristic for controlling complexity during Viterbi
decoding, but during forward-backward, standard beam heuristics can
bedangerous,astheycanmaketrainingunstable. We introducesparse
forward-backward,avariationalperspectiveonbeammethodsthatuses
anapproximatingmixtureofKroneckerdeltafunctions. Thismotivates
anovelminimum-divergencebeamcriterionbasedonminimizingKLdi-
vergencebetweentherespectivemarginaldistributions.Ourbeamselec-
tion approach is not only more efficient for Viterbi decoding, but also
more stable within sparse forward-backward training. For a standard
text-to-speech problem, we reduce CRF training time fourfold—from
overadaytosixhours—withnolossinaccuracy.
1 Introduction
Complex tasks in speech and language processing often include random variables with
largestatespaces. Trainingsuchmodelscanbeexpensive,evenforlinearchains,because
standard estimation techniques, such as expectation maximization and conditionalmaxi-
mum likelihood, often requirerepeatedlyrunningfoward-backwardoverthe trainingset,
whichrequiresquadratictimeinthenumberofstates. DuringViterbidecoding,astandard
techniquetoaddressthisproblemisbeamsearch,thatis,ignoringvariableconfigurations
whose estimated max-marginal is sufficiently low. For sum-product inference methods
such as forward-backward, beam methods can be dangerous, however, because standard
beam selection criteria can inappropriatelydiscard probabilitymass in a way that makes
trainingunstable.
Inthispaper,weintroduceaperspectiveonbeamsearchthatmotivatesitsusewithinsum-
productinference.Inparticular,wecastbeamsearchasavariationalprocedurethatapprox-
imatesadistributionwithalargestatespacebyamixtureofmanyfewerKroneckerdelta
functions.Thismotivatessparseforward-backward,anovelmessage-passingalgorithmin
whichaftereachmessagepass,approximatemarginalpotentialsarecompressedaftereach
Report Documentation Page Form Approved
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE 3. DATES COVERED
2006 2. REPORT TYPE 00-00-2006 to 00-00-2006
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Sparse Forward-Backward for Fast Training of Conditional Random
5b. GRANT NUMBER
Fields
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
University of Massachusetts,Computer Science REPORT NUMBER
Department,Amherst,MA,01003
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
The original document contains color images.
14. ABSTRACT
see report
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE 5
unclassified unclassified unclassified
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
pass. Essentially, this extends beam search from max-productinference to sum-product.
Our perspectivealso motivatesthe minimum-divergencebeam, a new beamcriterionthat
selectsacompressedmarginaldistributionwithafixedKullback-Leibler(KL)divergence
ofthetruemarginal. Notonlydoesthiscriterionperformbetterthanstandardbeamcrite-
riaforViterbidecoding,ititeractsmorestablywithtraining. Ononereal-worldtask, the
NetTalktext-to-speechdataset[5],wecannowtrainaconditionalrandomfield(CRF)in
about6hoursforwhichtrainingpreviouslyrequiredoveraday,withnolossinaccuracy.
2 Sparse Forward-Backward
Standard beam search can be viewed as maintaining sparse local marginal distributions
such that together they are as close as possible to a large distribution. In this section,
we formalize this intuition using a variationalargument, which motivates our new beam
criterionforsparseforward-backward.
Consideradiscretedistributionp(y),whereyisassumedtohaveverymanypossiblecon-
figurations. We approximatepbya sparsedistributionq, whichwe writeasa mixtureof
Kroneckerdeltafunctions:
q(y)= q δ (y), (1)
i i
Xi∈I
whereI ={i1,...,ik}isthesetofindicesisuchthatq(y =i)isnon-zero,andδi(y)=1
ify =i. WerefertothesetI asthebeam.
Consider the problem of finding the distribution q(y) of smallest weight such that
KL(qkp) ≤ ǫ. First, suppose the set I = {i1,...,ik} is fixed in advance, and we wish
to choose the probabilities q to minimize KL(qkp). Then the optimal choice is simply
i
q =p / p ,aresultwhichcanbeverifiedusingLagrangemultipliersonthenormal-
i i i∈I i
izationconstraintofq.
P
Second,supposewewishtodeterminethesetofindicesIofafixedsizekwhichminimize
KL(qkp). Then the optimal choice is when I = {i1,...,ik} consists of the indices of
thelargestk valuesofthediscretedistributionp. First, defineZ(I) = p , thenthe
i∈I i
optimalapproximatingdistributionis:
P
q
i
argminKL(qkp)=argmin argmin q log (2)
i
q I (cid:26) {qi}Xi∈I pi(cid:27)
p p /Z(I)
i i
=argmin log (3)
I (cid:26)Xi∈I Z(I) pi (cid:27)
=argmax logZ(I) (4)
I
(cid:8) (cid:9)
That is, the optimalchoice of indicesis the one that retains mostprobabilitymass. This
meansthatitis straightforwardto findthe discretedistributionq of minimalweightsuch
that KL(q||p) ≤ ǫ. We can sort the elements of the probability vector p, truncate after
logZ(I)exceeds−ǫ,andrenormalizetoobtainq.
To apply these ideas to forward-backward, essentially we compress the marginal beliefs
aftereverymessagepass. Wecallthismethodsparseforward-backward,whichwedefine
asfollows. Consideralinear-chainprobabilitydistributionp(y,x) ∝ tΨt(yt,yt−1,x),
such as an hidden Markov model (HMM) or conditional random fielQd (CRF). Let αt(i)
denotetheforwardmessages,β (i)thebackwardmessages,andγ (i)=α (i)β (i)bethe
t t t t
computedmarginals.Thenthesparseforwardrecursionis:
1. Passthemessageinthestandardway:
αt(j)← Ψt(i,j,x)αt−1(i) (5)
Xi
2. Computethenewdensebeliefγ as
t
γ (j)∝α (j)β (j) (6)
t t t
3. Compressintoasparsebeliefγ′(j),maintainingKL(γ′kγ)≤ ǫ. Thatis,sortthe
elementsofγandtruncateafterlogZ(I)exceeds−ǫ. CalltheresultingbeamI .
t
4. Compressα (j)torespectthenewbeamI .
t t
The backward recursion is defined similarly. Note that in every compression operation,
the beam I is recomputed from scratch; therefore, during the backward pass, variable
t
configurationscan both leave and enter the beam on the basis of backward information.
Just as in standardforward-backward,it can be shown by recursion thatthe sum of final
alphasyieldsthemassofthebeam.Thatis,ifIisthesetofallstatesequencesinthebeam,
then jaT(j)= y∈I tΨt(yt,yt−1,x). Therefore,becausebackwardrevisionstothe
beamPdonotdecrePasetheQlocalsumofbetas,theydonotdamagethequalityoftheglobal
beamoversequences.
The criterion in step 3 for selecting the beam is novel, and we call it the miniumum-
divergence criterion. Alternatively, we could take the top N states, or all states within
athreshold.Inthenextsectionwewillcomparetothesealternatecriteria.
Finally, we discuss a few practical considerations. We have found improved results by
addingaminimumbeliefsizeconstraintK,whichpreventsabeliefstateγ′(j)frombeing
t
compressedbelowK non-zeroentries. Also,wehavefoundthattheminimum-divergence
criterionusuallyfindsagoodbeamafterasingleforwardpass. Minimizingthenumberof
passesis desirable,becauseif findinga goodbeamrequiresmanyforwardandbackward
passes,onemayaswelldoexactforward-backward.
3 Results and Analysis
Inthissectionweevaluatesparseforward-backwardforbothmax-productandsum-product
inferenceinHMMsandCRFsandthewellknownNetTalktext-to-speechdataset[5]which
contains20,008Englishwords. Thetaskistoproducetheproperphonesgivenastringof
lettersasinput.
3.1 DecodingExperiments
Inthissectionwecomparetheourminimum-divergencecriteriontotraditionalbeamsearch
criteriaduringViterbidecoding. We generatesyntheticdata fromanHMM oflength75.
Transition matrix entries are sampled from a Dirichlet with every α = .1. Emission
j
matricesaregeneratedfromamixtureoftwodistributions: (a)alowentropy,sparsecon-
ditionaldistributionwith10non-zeroelementsand(b)ahighentropyDirichletwithevery
α = 104, with mixture weights of .75 and .25 respectively. The goal is to simulate a
j
regimewheremoststatesarehighlyinformativeabouttheirdestination,butafewareless
informative.Wecomparedthreebeamcriteria:(1)afixedbeamsize,(2)anadaptivebeam
wheremessageentriesareretainediftheirlogscoreiswithinafixedthresholdofthebest
sofar,and(3)ourminimum-divergencecriterionwithKL≤0.001andanadditionalmin-
imumbeamsize constraintofK ≥ 4. Ourminimum-divergencecriterionfindstheexact
Viterbipathanaverageonly9.6statespervariable. Ontheotherhand,thefixedbeamre-
quiresbetween20and25statestoreadsthesameaccuracy,andthesimplethresholdbeam
requires30.4statespervariable. WehavesimilarresultsontheNetTalkdata(omitteddue
tospace).
3.2 TrainingExperiments
Inthissection,wepresentresultsshowingthatsparseforward-backwardcanbeembedded
withinCRFtraining,yieldingsignificantspeedupsintrainingtimewithnolossintesting
performance.
Computation Time vs. Likelihood Computation Time vs. Likelihood
−28
−1
−2 −30
−3 −32
3×Log Likelihood 10 −−−−−87654 FKFETiihLxxxar eee≤cdd stε hBB F&oeeol daar|Iw mmj|B a≥ erSo daNfi zmBAeav Ngck. wKaLr sdize 3×Log Likelihood 10−−−−43330864 FKEiLxxa e≤cd tε B F&eo ar|Iwmj| a≥ ro dNf BAavgck. wKaLr sdize
−9
−42
−10
−44
200 400 600 800 1000 1200 1400 1 2 3 4 5 6 7 8
Computation Time (seconds) Computation Time (seconds) x 104
(a) (b)
Figure1:Comparisonofsparseforward-backwardmethodsforCRFtrainingonbothsyntheticdata
(left) and on the NetTalk data set (right). Both graphs plot log likelihood on the training data as
afunctionoftrainingtime. Inbothcases,sparseforward-backwardperformsequivalentlytoexact
trainingonbothtrainingandtestaccuracyusingonlyaquarterofthetrainingtime.
First, we train CRFs using synthetic data generated from a 100 state HMM in the same
mannerasintheprevioussection. Weuse50sequencesfortrainingand50sequencesfor
testing.InallcasesweuseexactViterbidecodingtocomputetestingaccuracy.Wecompare
fivedifferentmethodsfordiscardingprobabilitymass: (1)theminimum-divergencebeam
with KL ≤ 0.5 and minimum beam size K ≥ 30 (2) a fixed beam of size K = 30,
(3)afixedbeamwhosesizewastheaveragesizeusedbytheminimum-divergencebeam,
(4) a threshold based beam which explores on average the same number of states as the
minimum-divergencebeam,and(5)exactforwardbackward. Learningcurvesareshown
inFigure1(a).
Comparedtoexacttraining,sparseforward-backwardusesone-fourthofthetimeofexact
trainingwithnolossinaccuracy.Also,wefinditisimportantforthebeamtobeadaptive,
by comparing to the fixed beam whose size is the average number of states used by our
minimum-divergencecriterion. Although minimum divergence and the fixed beam con-
vergetothesamesolution,minimumdivergencefinishesfaster,indicatingthattheadaptive
beam does help training time. Most of the benefit occurs later in training, as the model
becomesfartherfromuniform.
In the case of the smaller, fixed beam of size N, our L-BFGS optimizerterminatedwith
an erroras a resultof thenoisy gradientcomputation. Inthe case ofthe thresholdbeam,
the likelihood gradients were erratic, but L-BFGS did terminate normally. However the
recognitionaccuracyofthefinalmodelwaslow,at67.1%.
Finally, we present results from training on the real-world NetTalk data set. In Figure
1(b)wepresentruntime,modellikelihoodandaccuracyresultsfora52stateCRFforthe
NetTalkproblemthatwasoptimizedusing19075examplesandtestedusing934examples.
For the minimum divergence beam, we set the divergence threshold ǫ = .005 and the
minimum beam size K ≥ 10. We initialize the CRF parameters using a subset of 12%
ofthedata,beforetrainingonthefulldatauntilconvergence. Weusedthebeammethods
duringthecompletetrainingrunandduringthisinitializationperiod.
Duringthecompletetrainingrun,thethresholdbeamgradientestimatesweresonoisythat
ourL-BFGSoptimizerwasunabletotakeacompletestep. Exactforwardbackwardtrain-
ingproducedatestsetaccuracyof91.6%. Trainingusingthelargerfixedbeam(N =20)
terminated normallybut very noisy intermediategradientswere foundin the terminating
iteration.Theresultwasamuchloweraccuracyof85.7%.Incontrast,theminimumdiver-
gencebeamachievedanaccuracyof91.7%inlessthan25%ofthetimeittooktoexactly
traintheCRFusingforward-backward.
4 Related Work
Relatedtoourworkiszero-compressioninjunctiontrees[3],describedin[2],whichcon-
siderseverypotentialinacliquetree,andsetsthesmallestpotentialvaluestozero,withthe
constraintthatthetotalmassofthepotentialdoesnotfallbelowafixedvalueδ. Incontrast
toourwork,theyprunethemodel’spotentialsoncebeforeperforminginference,whereas
wedynamicallyprunethebeliefsduringinference,andindeedthebeamcanchangeduring
inference as new information arrives from other parts of the model. Also, Jordan et al.
[4], intheirworkonhiddenMarkovdecisiontrees, introduceavariationalalgorithmthat
usesadeltaonasinglebeststatesequence,buttheyprovidenoexperimentalevaluationof
thistechnique. Incomputervision, CoughlanandFerreira[1]haveuseda beliefpruning
methodwithinbeliefpropagationforloopymodelswhichisverysimilartoourthreshold
beambaseline.
5 Conclusions
Wehavepresentedaprincipledmethodforsignificantlyspeedingupdecodingandlearning
tasksinHMMsandCRFs. We alsohavepresentedexperimentalworkdemonstratingthe
utility of our approach. As future work, we believe a promising avenue of exploration
would be to explore adaptive strategies involving interaction of our L-BFGS optimizer,
detectingexcessivelynoisygradientsandautomaticallysettingǫvalues.Whileresultshere
were only with linear-chain models, we believe this approach should be more generally
applicable. For example, in pipelines of NLP tasks, it is often better to pass lattices of
predictionsratherthansingle-bestpredictions,inordertopreserveuncertaintybetweenthe
tasks. For such systems, the current work has implications for how to select the lattice
size, and how to pass information backwards through the pipeline, so that higher-level
informationfromlatertaskscanimproveperformanceonearliertasks.
ACKNOWLEDGMENTSThisworkwassupportedinpartbytheCenterforIntelligentInforma-
tionRetrieval,inpartbyTheCentralIntelligenceAgency,theNationalSecurityAgencyandNational
ScienceFoundationunderNSFgrants#IIS-0326249and#IIS-0427594, andinpartbytheDefense
Advanced ResearchProjectsAgency(DARPA),throughtheDepartment oftheInterior, NBC,Ac-
quisition Services Division, under contract number NBCHD030010. Any opinions, findings and
conclusionsorrecommendationsexpressedinthismaterialaretheauthor(s)anddonotnecessarily
reflectthoseofthesponsor.
References
[1] JamesM.CoughlanandSabinoJ.Ferreira. Findingdeformableshapesusingloopybeliefprop-
agation. InEuropeanConferenceonComputerVision,2002.
[2] R.G.Cowell,A.P.Dawid,S.L.Lauritzen,andD.J.Spiegelhalter. ProbabilisticNetworksand
ExpertSystems. Springer,1999.
[3] F. Jensen and S. K. Andersen. Approximations in bayesian belief universes for knowledge-
basedsystems.Proceedingsofthe6thConferenceonUncertaintyinArtifcialIntelligence,1990.
Appearstobeunavailable.
[4] MichaelI.Jordan,ZoubinGhahramani,andLawrenceK.Saul. HiddenMarkovdecisiontrees.
InMichaelC.Mozer,MichaelI.Jordan,andThomasPetsche,editors,Proc.Conf.Advancesin
NeuralInformationProcessingSystems,NIPS,volume9.MITPress,1996.
[5] T.J.SejnowskiandC.R.Rosenberg. Nettalk: aparallelnetworkthatlearnstoreadaloud. Cog-
nitiveScience,14:179–211,1990.