Sparse Forward-Backward for Fast Training of Conditional Random Fields CharlesSutton,ChrisPalandAndrewMcCallum UniversityofMassachusettsAmherst Dept.ComputerScience Amherst,MA01003 {casutton,pal,mccallum}@cs.umass.edu Abstract Complextasksinspeechandlanguageprocessingoftenincluderandom variableswith large state spaces, both in speech tasks that involvepre- dicting words and phonemes, and in joint processing of pipelined sys- tems,inwhichthestatespacecanbethelabelingofanentiresequence. In large state spaces, however, discriminative training can be expen- sive, because it often requires many calls to forward-backward. Beam search is a standard heuristic for controlling complexity during Viterbi decoding, but during forward-backward, standard beam heuristics can bedangerous,astheycanmaketrainingunstable. We introducesparse forward-backward,avariationalperspectiveonbeammethodsthatuses anapproximatingmixtureofKroneckerdeltafunctions. Thismotivates anovelminimum-divergencebeamcriterionbasedonminimizingKLdi- vergencebetweentherespectivemarginaldistributions.Ourbeamselec- tion approach is not only more efficient for Viterbi decoding, but also more stable within sparse forward-backward training. For a standard text-to-speech problem, we reduce CRF training time fourfold—from overadaytosixhours—withnolossinaccuracy. 1 Introduction Complex tasks in speech and language processing often include random variables with largestatespaces. Trainingsuchmodelscanbeexpensive,evenforlinearchains,because standard estimation techniques, such as expectation maximization and conditionalmaxi- mum likelihood, often requirerepeatedlyrunningfoward-backwardoverthe trainingset, whichrequiresquadratictimeinthenumberofstates. DuringViterbidecoding,astandard techniquetoaddressthisproblemisbeamsearch,thatis,ignoringvariableconfigurations whose estimated max-marginal is sufficiently low. For sum-product inference methods such as forward-backward, beam methods can be dangerous, however, because standard beam selection criteria can inappropriatelydiscard probabilitymass in a way that makes trainingunstable. Inthispaper,weintroduceaperspectiveonbeamsearchthatmotivatesitsusewithinsum- productinference.Inparticular,wecastbeamsearchasavariationalprocedurethatapprox- imatesadistributionwithalargestatespacebyamixtureofmanyfewerKroneckerdelta functions.Thismotivatessparseforward-backward,anovelmessage-passingalgorithmin whichaftereachmessagepass,approximatemarginalpotentialsarecompressedaftereach Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2006 2. REPORT TYPE 00-00-2006 to 00-00-2006 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Sparse Forward-Backward for Fast Training of Conditional Random 5b. GRANT NUMBER Fields 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of Massachusetts,Computer Science REPORT NUMBER Department,Amherst,MA,01003 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES The original document contains color images. 14. ABSTRACT see report 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 5 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 pass. Essentially, this extends beam search from max-productinference to sum-product. Our perspectivealso motivatesthe minimum-divergencebeam, a new beamcriterionthat selectsacompressedmarginaldistributionwithafixedKullback-Leibler(KL)divergence ofthetruemarginal. Notonlydoesthiscriterionperformbetterthanstandardbeamcrite- riaforViterbidecoding,ititeractsmorestablywithtraining. Ononereal-worldtask, the NetTalktext-to-speechdataset[5],wecannowtrainaconditionalrandomfield(CRF)in about6hoursforwhichtrainingpreviouslyrequiredoveraday,withnolossinaccuracy. 2 Sparse Forward-Backward Standard beam search can be viewed as maintaining sparse local marginal distributions such that together they are as close as possible to a large distribution. In this section, we formalize this intuition using a variationalargument, which motivates our new beam criterionforsparseforward-backward. Consideradiscretedistributionp(y),whereyisassumedtohaveverymanypossiblecon- figurations. We approximatepbya sparsedistributionq, whichwe writeasa mixtureof Kroneckerdeltafunctions: q(y)= q δ (y), (1) i i Xi∈I whereI ={i1,...,ik}isthesetofindicesisuchthatq(y =i)isnon-zero,andδi(y)=1 ify =i. WerefertothesetI asthebeam. Consider the problem of finding the distribution q(y) of smallest weight such that KL(qkp) ≤ ǫ. First, suppose the set I = {i1,...,ik} is fixed in advance, and we wish to choose the probabilities q to minimize KL(qkp). Then the optimal choice is simply i q =p / p ,aresultwhichcanbeverifiedusingLagrangemultipliersonthenormal- i i i∈I i izationconstraintofq. P Second,supposewewishtodeterminethesetofindicesIofafixedsizekwhichminimize KL(qkp). Then the optimal choice is when I = {i1,...,ik} consists of the indices of thelargestk valuesofthediscretedistributionp. First, defineZ(I) = p , thenthe i∈I i optimalapproximatingdistributionis: P q i argminKL(qkp)=argmin argmin q log (2) i q I (cid:26) {qi}Xi∈I pi(cid:27) p p /Z(I) i i =argmin log (3) I (cid:26)Xi∈I Z(I) pi (cid:27) =argmax logZ(I) (4) I (cid:8) (cid:9) That is, the optimalchoice of indicesis the one that retains mostprobabilitymass. This meansthatitis straightforwardto findthe discretedistributionq of minimalweightsuch that KL(q||p) ≤ ǫ. We can sort the elements of the probability vector p, truncate after logZ(I)exceeds−ǫ,andrenormalizetoobtainq. To apply these ideas to forward-backward, essentially we compress the marginal beliefs aftereverymessagepass. Wecallthismethodsparseforward-backward,whichwedefine asfollows. Consideralinear-chainprobabilitydistributionp(y,x) ∝ tΨt(yt,yt−1,x), such as an hidden Markov model (HMM) or conditional random fielQd (CRF). Let αt(i) denotetheforwardmessages,β (i)thebackwardmessages,andγ (i)=α (i)β (i)bethe t t t t computedmarginals.Thenthesparseforwardrecursionis: 1. Passthemessageinthestandardway: αt(j)← Ψt(i,j,x)αt−1(i) (5) Xi 2. Computethenewdensebeliefγ as t γ (j)∝α (j)β (j) (6) t t t 3. Compressintoasparsebeliefγ′(j),maintainingKL(γ′kγ)≤ ǫ. Thatis,sortthe elementsofγandtruncateafterlogZ(I)exceeds−ǫ. CalltheresultingbeamI . t 4. Compressα (j)torespectthenewbeamI . t t The backward recursion is defined similarly. Note that in every compression operation, the beam I is recomputed from scratch; therefore, during the backward pass, variable t configurationscan both leave and enter the beam on the basis of backward information. Just as in standardforward-backward,it can be shown by recursion thatthe sum of final alphasyieldsthemassofthebeam.Thatis,ifIisthesetofallstatesequencesinthebeam, then jaT(j)= y∈I tΨt(yt,yt−1,x). Therefore,becausebackwardrevisionstothe beamPdonotdecrePasetheQlocalsumofbetas,theydonotdamagethequalityoftheglobal beamoversequences. The criterion in step 3 for selecting the beam is novel, and we call it the miniumum- divergence criterion. Alternatively, we could take the top N states, or all states within athreshold.Inthenextsectionwewillcomparetothesealternatecriteria. Finally, we discuss a few practical considerations. We have found improved results by addingaminimumbeliefsizeconstraintK,whichpreventsabeliefstateγ′(j)frombeing t compressedbelowK non-zeroentries. Also,wehavefoundthattheminimum-divergence criterionusuallyfindsagoodbeamafterasingleforwardpass. Minimizingthenumberof passesis desirable,becauseif findinga goodbeamrequiresmanyforwardandbackward passes,onemayaswelldoexactforward-backward. 3 Results and Analysis Inthissectionweevaluatesparseforward-backwardforbothmax-productandsum-product inferenceinHMMsandCRFsandthewellknownNetTalktext-to-speechdataset[5]which contains20,008Englishwords. Thetaskistoproducetheproperphonesgivenastringof lettersasinput. 3.1 DecodingExperiments Inthissectionwecomparetheourminimum-divergencecriteriontotraditionalbeamsearch criteriaduringViterbidecoding. We generatesyntheticdata fromanHMM oflength75. Transition matrix entries are sampled from a Dirichlet with every α = .1. Emission j matricesaregeneratedfromamixtureoftwodistributions: (a)alowentropy,sparsecon- ditionaldistributionwith10non-zeroelementsand(b)ahighentropyDirichletwithevery α = 104, with mixture weights of .75 and .25 respectively. The goal is to simulate a j regimewheremoststatesarehighlyinformativeabouttheirdestination,butafewareless informative.Wecomparedthreebeamcriteria:(1)afixedbeamsize,(2)anadaptivebeam wheremessageentriesareretainediftheirlogscoreiswithinafixedthresholdofthebest sofar,and(3)ourminimum-divergencecriterionwithKL≤0.001andanadditionalmin- imumbeamsize constraintofK ≥ 4. Ourminimum-divergencecriterionfindstheexact Viterbipathanaverageonly9.6statespervariable. Ontheotherhand,thefixedbeamre- quiresbetween20and25statestoreadsthesameaccuracy,andthesimplethresholdbeam requires30.4statespervariable. WehavesimilarresultsontheNetTalkdata(omitteddue tospace). 3.2 TrainingExperiments Inthissection,wepresentresultsshowingthatsparseforward-backwardcanbeembedded withinCRFtraining,yieldingsignificantspeedupsintrainingtimewithnolossintesting performance. Computation Time vs. Likelihood Computation Time vs. Likelihood −28 −1 −2 −30 −3 −32 3×Log Likelihood 10 −−−−−87654 FKFETiihLxxxar eee≤cdd stε hBB F&oeeol daar|Iw mmj|B a≥ erSo daNfi zmBAeav Ngck. wKaLr sdize 3×Log Likelihood 10−−−−43330864 FKEiLxxa e≤cd tε B F&eo ar|Iwmj| a≥ ro dNf BAavgck. wKaLr sdize −9 −42 −10 −44 200 400 600 800 1000 1200 1400 1 2 3 4 5 6 7 8 Computation Time (seconds) Computation Time (seconds) x 104 (a) (b) Figure1:Comparisonofsparseforward-backwardmethodsforCRFtrainingonbothsyntheticdata (left) and on the NetTalk data set (right). Both graphs plot log likelihood on the training data as afunctionoftrainingtime. Inbothcases,sparseforward-backwardperformsequivalentlytoexact trainingonbothtrainingandtestaccuracyusingonlyaquarterofthetrainingtime. First, we train CRFs using synthetic data generated from a 100 state HMM in the same mannerasintheprevioussection. Weuse50sequencesfortrainingand50sequencesfor testing.InallcasesweuseexactViterbidecodingtocomputetestingaccuracy.Wecompare fivedifferentmethodsfordiscardingprobabilitymass: (1)theminimum-divergencebeam with KL ≤ 0.5 and minimum beam size K ≥ 30 (2) a fixed beam of size K = 30, (3)afixedbeamwhosesizewastheaveragesizeusedbytheminimum-divergencebeam, (4) a threshold based beam which explores on average the same number of states as the minimum-divergencebeam,and(5)exactforwardbackward. Learningcurvesareshown inFigure1(a). Comparedtoexacttraining,sparseforward-backwardusesone-fourthofthetimeofexact trainingwithnolossinaccuracy.Also,wefinditisimportantforthebeamtobeadaptive, by comparing to the fixed beam whose size is the average number of states used by our minimum-divergencecriterion. Although minimum divergence and the fixed beam con- vergetothesamesolution,minimumdivergencefinishesfaster,indicatingthattheadaptive beam does help training time. Most of the benefit occurs later in training, as the model becomesfartherfromuniform. In the case of the smaller, fixed beam of size N, our L-BFGS optimizerterminatedwith an erroras a resultof thenoisy gradientcomputation. Inthe case ofthe thresholdbeam, the likelihood gradients were erratic, but L-BFGS did terminate normally. However the recognitionaccuracyofthefinalmodelwaslow,at67.1%. Finally, we present results from training on the real-world NetTalk data set. In Figure 1(b)wepresentruntime,modellikelihoodandaccuracyresultsfora52stateCRFforthe NetTalkproblemthatwasoptimizedusing19075examplesandtestedusing934examples. For the minimum divergence beam, we set the divergence threshold ǫ = .005 and the minimum beam size K ≥ 10. We initialize the CRF parameters using a subset of 12% ofthedata,beforetrainingonthefulldatauntilconvergence. Weusedthebeammethods duringthecompletetrainingrunandduringthisinitializationperiod. Duringthecompletetrainingrun,thethresholdbeamgradientestimatesweresonoisythat ourL-BFGSoptimizerwasunabletotakeacompletestep. Exactforwardbackwardtrain- ingproducedatestsetaccuracyof91.6%. Trainingusingthelargerfixedbeam(N =20) terminated normallybut very noisy intermediategradientswere foundin the terminating iteration.Theresultwasamuchloweraccuracyof85.7%.Incontrast,theminimumdiver- gencebeamachievedanaccuracyof91.7%inlessthan25%ofthetimeittooktoexactly traintheCRFusingforward-backward. 4 Related Work Relatedtoourworkiszero-compressioninjunctiontrees[3],describedin[2],whichcon- siderseverypotentialinacliquetree,andsetsthesmallestpotentialvaluestozero,withthe constraintthatthetotalmassofthepotentialdoesnotfallbelowafixedvalueδ. Incontrast toourwork,theyprunethemodel’spotentialsoncebeforeperforminginference,whereas wedynamicallyprunethebeliefsduringinference,andindeedthebeamcanchangeduring inference as new information arrives from other parts of the model. Also, Jordan et al. [4], intheirworkonhiddenMarkovdecisiontrees, introduceavariationalalgorithmthat usesadeltaonasinglebeststatesequence,buttheyprovidenoexperimentalevaluationof thistechnique. Incomputervision, CoughlanandFerreira[1]haveuseda beliefpruning methodwithinbeliefpropagationforloopymodelswhichisverysimilartoourthreshold beambaseline. 5 Conclusions Wehavepresentedaprincipledmethodforsignificantlyspeedingupdecodingandlearning tasksinHMMsandCRFs. We alsohavepresentedexperimentalworkdemonstratingthe utility of our approach. As future work, we believe a promising avenue of exploration would be to explore adaptive strategies involving interaction of our L-BFGS optimizer, detectingexcessivelynoisygradientsandautomaticallysettingǫvalues.Whileresultshere were only with linear-chain models, we believe this approach should be more generally applicable. For example, in pipelines of NLP tasks, it is often better to pass lattices of predictionsratherthansingle-bestpredictions,inordertopreserveuncertaintybetweenthe tasks. For such systems, the current work has implications for how to select the lattice size, and how to pass information backwards through the pipeline, so that higher-level informationfromlatertaskscanimproveperformanceonearliertasks. ACKNOWLEDGMENTSThisworkwassupportedinpartbytheCenterforIntelligentInforma- tionRetrieval,inpartbyTheCentralIntelligenceAgency,theNationalSecurityAgencyandNational ScienceFoundationunderNSFgrants#IIS-0326249and#IIS-0427594, andinpartbytheDefense Advanced ResearchProjectsAgency(DARPA),throughtheDepartment oftheInterior, NBC,Ac- quisition Services Division, under contract number NBCHD030010. Any opinions, findings and conclusionsorrecommendationsexpressedinthismaterialaretheauthor(s)anddonotnecessarily reflectthoseofthesponsor. References [1] JamesM.CoughlanandSabinoJ.Ferreira. Findingdeformableshapesusingloopybeliefprop- agation. InEuropeanConferenceonComputerVision,2002. [2] R.G.Cowell,A.P.Dawid,S.L.Lauritzen,andD.J.Spiegelhalter. ProbabilisticNetworksand ExpertSystems. Springer,1999. [3] F. Jensen and S. K. Andersen. Approximations in bayesian belief universes for knowledge- basedsystems.Proceedingsofthe6thConferenceonUncertaintyinArtifcialIntelligence,1990. Appearstobeunavailable. [4] MichaelI.Jordan,ZoubinGhahramani,andLawrenceK.Saul. HiddenMarkovdecisiontrees. InMichaelC.Mozer,MichaelI.Jordan,andThomasPetsche,editors,Proc.Conf.Advancesin NeuralInformationProcessingSystems,NIPS,volume9.MITPress,1996. [5] T.J.SejnowskiandC.R.Rosenberg. Nettalk: aparallelnetworkthatlearnstoreadaloud. Cog- nitiveScience,14:179–211,1990.