Efficient Computation of Entropy Gradient for Semi-Supervised Conditional Random Fields GideonS.Mann and AndrewMcCallum DepartmentofComputerScience UniversityofMassachusetts Amherst,MA01003 [email protected],[email protected] Abstract Entropyregularization(ER)isamethodofsemi- supervised learning first proposed for classification Entropyregularizationisastraightforward tasks(GrandvaletandBengio,2004). Inadditionto andsuccessfulmethodofsemi-supervised maximizing conditional likelihood of the available learningthataugmentsthetraditionalcon- labels, ER also aims to minimize the entropy of the ditionallikelihoodobjectivefunctionwith predictedlabeldistributiononunlabeleddata. Byin- an additional term that aims to minimize sisting on peaked, confident predictions, ER guides the predicted label entropy on unlabeled the decision boundary away from dense regions of data. Ithaspreviouslybeendemonstrated input space. It is simple and compelling—no pre- to provide positive results in linear-chain clustering, no “auxiliary functions,” tuning of only CRFs, but the published method for cal- onemeta-parameteranditisdiscriminative. culatingtheentropygradientrequiressig- Jiao et al. (2006) apply this method to linear- nificantly more computation than super- chain CRFs and demonstrate encouraging accuracy vised CRF training. This paper presents improvements on a gene-name-tagging task. How- a new derivation and dynamic program ever, the method they present for calculating the for calculating the entropy gradient that gradient of the entropy takes substantially greater issignificantlymoreefficient—havingthe time than the traditional supervised-only gradient. same asymptotic time complexity as su- Whereas supervised training requires only classic pervised CRF training. We also present forward/backward, taking time O(ns2) (sequence efficient generalizations of this method length times the square of the number of labels), for calculating the label entropy of all their training method takes O(n2s3)—a factor of sub-sequences, which is useful for active O(ns) more. This greatly reduces the practicality learning,amongotherapplications. of using large amounts of unlabeled data, which is exactlythedesireduse-case. 1 Introduction Thispaperpresentsanew,moreefficiententropy Semi-supervised learning is of growing importance gradient derivation and dynamic program that has in machine learning and NLP (Zhu, 2005). Condi- thesameasymptotictimecomplexityasthegradient tional random fields (CRFs) (Lafferty et al., 2001) fortraditionalCRFtraining,O(ns2). Inordertode- areanappealingtargetforsemi-supervisedlearning scribethiscalculation,thepaperintroducesthecon- because they achieve state-of-the-art performance cept of subsequence constrained entropy—the en- across a broad spectrum of sequence labeling tasks, tropyofaCRFforanobserveddatasequencewhen andyet,likemanyothermachinelearningmethods, part of the label sequence is fixed. These meth- training them by supervised learning typically re- ods will allow training on larger unannotated data quireslargeannotateddatasets. setsizesthanpreviouslypossibleandsupportactive Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2007 2. REPORT TYPE 00-00-2007 to 00-00-2007 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Efficient Computation of Entropy Gradient for Semi-Supervised 5b. GRANT NUMBER Conditional Random Fields 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of Massachusetts,Department of Computer REPORT NUMBER Science,Amherst,MA,01003 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 4 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 learning. Thisnegativeentropytermincreasesasthedecision boundary is moved into sparsely-populated regions 2 Semi-SupervisedCRFTraining ofinputspace. Lafferty et al. (2001) present linear-chain CRFs, a 3 AnEfficientFormoftheEntropy discriminative probabilistic model over observation Gradient sequences x and label sequences Y = hY ..Y i, 1 n where|x| = |Y| = n,andeachlabelYi hassdiffer- In order to maximize the above objective function, entpossiblediscretevalues. Foralinear-chainCRF thegradientfortheentropytermmustbecomputed. ofMarkovorderone: Jiaoetal. (2006)performthiscomputationby: ! 1 X ∂ pθ(Y|x) = Z(x) exp θkFk(x,Y) , ∂θ −H(Y|x) = covpθ(Y|x)[F(x,Y)]θ, k where F (x,Y) = P f (x,Y ,Y ,i), where k i k i i+1 and the partition function Z(x) = cov [F (x,Y),F (x,Y)]= P exp(P θ F (x,Y)). Given training pθ(Y|x) j k Y k k k E [F (x,Y),F (x,Y)] data D = hd ..d i, the model is trained by pθ(Y|x) j k 1 n −E [F (x,Y)]E [F (x,Y)]. maximizing the log-likelihood of the data pθ(Y|x) j pθ(Y|x) k L(θ;D) = P logp (Y(d)|x(d)) by gradient While the second term of the covariance is easy d θ methods (e.g. Limited Memory BFGS), where the to compute, the first term requires calculation of gradientofthelikelihoodis: quadratic feature expectations. The algorithm they propose to compute this term is O(n2s3) as it re- ∂ X L(θ;D) = Fk(x(d),Y(d)) quiresanextranestedloopinforward/backward. ∂θ k However, the above form of the gradient is not d XX the only possibility. We present here an alternative − pθ(Y|x(d))Fk(x(d),Y). derivationofthegradient: d Y ∂ ∂ X Thesecondterm(theexpectedcountsofthefeatures ∂θ −H(Y|x)= ∂θ pθ(Y|x)logpθ(Y|x) k k given the model) can be computed in a tractable Y „ « X ∂ amount of time, since according to the Markov as- = p (Y|x) logp (Y|x) ∂θ θ θ k sumption,thefeatureexpectationscanberewritten: Y „ « ∂ Xpθ(Y|x)Fk(x,Y) = +pθ(Y|x) ∂θk logpθ(Y|x) ! Y X X =Xpθ(Y|x) Fk(x,Y)−Xpθ(Y0|x)Fk(x,Y0) logpθ(Y|x) p (Y ,Y |x)f (x,Y ,Y ). θ i i+1 k i i+1 Y Y0 ! i Yi,Yi+1 +Xp (Y|x) F (x,Y)−Xp (Y0|x)F (x,Y0) . θ k θ k A dynamic program (the forward/backward algo- Y Y0 rithm)thencomputesintimeO(ns2)alltheneeded Since P p (Y|x)P p (Y0|X)F (x,Y0) = probabilities pθ(Yi,Yi+1), where n is the sequence P p (Y0|YX)θF (x,Y0)Y,t0heθsecondsukmmandcan- length,andsisthenumberoflabels. Y0 θ k cels,leaving: For semi-supervised training by entropy regular- ization, wechangetheobjectivefunctionbyadding ∂ X −H(Y|x)= p (Y|x)logp (Y|x)F (x,Y) the negative entropy of the unannotated data U = ∂θ θ θ k Y hu ..u i. (HereGaussianpriorisalsoshown.) ! ! 1 n − Xp (Y|x)logp (Y|x) Xp (Y0|x)F (x,Y0) . θ θ θ k L(θ;D,U) = Xlogp (Y(d)|x(d))−X θk Y Y0 θ 2σ2 n k Like the gradient obtained by Jiao et al. (2006), +λXp (Y(u)|x(u))logp (Y(u)|x(u)). there are two terms, and the second is easily com- θ θ putable given the feature expectations obtained by u forward/backwardandtheentropyforthesequence. However, unlike the previous method, here the first term can be efficiently calculated as well. First, the term must be further factored into a form more y3 y4 amenabletoanalysis: X p (Y|x)logp (Y|x)F (x,Y) θ θ k α α β β Y H (0|y1) H (Y1|y2) H (Y6|y5) H (0|y6) X X = p (Y|x)logp (Y|x) f (x,Y ,Y ,i) θ θ k i i+1 Y i Figure 1: Partial lattice shown for com- X X puting the subsequence constrained entropy: = fk(x,Yi,Yi+1,i) PY p(Y−(3..4),y3,y4)logp(Y−(3..4),y3,y4). Once the i Yi,Yi+1 completeHα andHβ latticesareconstructed(inthedirection X of the arrows), the entropy for each label sequence can be pθ(Y|x)logpθ(Y|x). computedinlineartime. Y −(i..i+1) Here, Y = hY Y i. In order illustrates an example in which the constrained se- −(i..i+1) 1..(i−1) (i+2)..n quence is of size two, but the method applies to to efficiently calculate this term, it is sufficient to calculate P p (Y|x)logp (Y|x) for all arbitrary-lengthcontiguouslabelsequences. Y θ θ −(i..i+1) ComputingtheHα(·)andHβ(·)latticesiseasily pairs y ,y . The next section presents a dynamic i i+1 performed using the probabilities obtained by for- program which can perform these computations in O(ns2). ward/backward. First recall the decomposition for- mulasforentropy: 4 SubsequenceConstrainedEntropy H(X,Y) = H(X)+H(Y|X) Wedefinesubsequenceconstrainedentropyas X H(Y|X) = P(X = x)H(Y|X = x). X Hσ(Y |y ,x) = p (Y|x)logp (Y|x). x −(a..b) a..b θ θ Y−(a..b) Usingthisdecomposition, wecandefineadynamic program over the entropy lattices similar to for- The key to the efficient calculation for all subsets ward/backward: is to note that the entropy can be factored given a linear-chain CRF of Markov order 1, since Yi+2 is Hα(Y |y ,x) 1..i i+1 independentofY givenY . i i+1 =H(Y |y ,x)+H(Y |Y ,y ,x) i i+1 1..(i−1) i i+1 X X pθ(Y−(a..b),ya..b|x)logpθ(Y−(a..b),ya..b|x) = pθ(yi|yi+1,x)logpθ(yi|yi+1,x) Y−(a..b) yi X X = p (y |x)p (Y |y ,x)× + p (y |y ,x)Hα(Y |y ). θ a..b θ −(a..b) a..b θ i i+1 1..(i−1) i Y−(a..b) yi (cid:0) (cid:1) logp (y |x)+logp (Y |y ,x) θ a..b θ −(a..b) a..b The base case for the dynamic program is =pθ(ya..b|x)logpθ(ya..b|x) Hα(∅|y1) = p(y1)logp(y1). Thebackwardentropy +p (y |x)Hσ(Y |y ,x) is computed in a similar fashion. The conditional θ a..b −(a..b) a..b probabilitiesp (y |y ,x)ineachofthesedynamic θ i i−1 =p (y |x)logp (y |x) θ a..b θ a..b programs are available by marginalizing over the +pθ(ya..b|x)Hα(Y1..(a−1)|ya,x) per-transition marginal probabilities obtained from +p (y |x)Hβ(Y |y ,x). forward/backward. θ a..b (b+1)..n b The computational complexity of this calcula- GiventheHα(·)andHβ(·)lattices,anysequence tion for one label sequence requires one run of for- entropycanbecomputedinconstanttime. Figure1 ward/backward at O(ns2), and equivalent time to calculate the lattices for Hα and Hβ. To calculate provides a comprehensive survey of popular semi- thegradientrequiresonefinaliterationoveralllabel supervisedlearningtechniques. pairs at each position, which is again time O(ns2), 7 Conclusion but no greater, as forward/backward and the en- tropy calculations need only to be done once. The Thispaperpresentstwoalgorithmicadvances. First, complete asymptotic computational cost of calcu- it introduces an efficient method for calculating lating the entropy gradient is O(ns2), which is the subsequence constrained entropies in linear-chain same time as supervised training, and a factor of CRFs, (useful for active learning). Second, it O(ns) faster than the method proposed by Jiao et demonstrates how these subsequence constrained al. (2006). entropies can be used to efficiently calculate the Wall clock timing experiments show that this gradient of the CRF entropy in time O(ns2)— method takes approximately 1.5 times as long as the same asymptotic time complexity as the for- traditional supervised training—less than the con- ward/backward algorithm, and a O(ns) improve- stant factors would suggest.1 In practice, since the ment over previous algorithms—enabling the prac- three extra dynamic programs do not require re- tical application of CRF entropy regularization to calculation of the dot-product between parameters largeunlabeleddatasets. andinputfeatures(typicallythemostexpensivepart of inference), they are significantly faster than cal- Acknowledgements culatingtheoriginalforward/backwardlattice. This work was supported in part by DoD contract #HM1582- 06-1-2013,inpartbyTheCentralIntelligenceAgency,theNa- 5 ConfidenceEstimation tionalSecurityAgencyandNationalScienceFoundationunder NSFgrant#IIS-0427594,andinpartbytheDefenseAdvanced In addition to its merits for computing the entropy Research Projects Agency (DARPA), through the Department oftheInterior,NBC,AcquisitionServicesDivision,undercon- gradient,subsequenceconstrainedentropyhasother tractnumberNBCHD030010.Anyopinions,findingsandcon- uses, including confidence estimation. Kim et al. clusionsorrecommendationsexpressedinthismaterialbelong (2006) propose using entropy as a confidence esti- totheauthor(s)anddonotnecessarilyreflectthoseofthespon- sor. mator in active learning in CRFs, where examples with the most uncertainty are selected for presenta- tion to humans labelers. In practice, they approxi- References mate the entropy of the labels given the N-best la- Y.GrandvaletandY.Bengio. 2004. Semi-supervisedlearning bels. Not only could our method quickly and ex- byentropyminimization. InNIPS. actly compute the true entropy, but it could also be D.Hernando,V.Crespi,andG.Cybenko. 2005. Efficientcom- usedtofindthesubsequencethathasthehighestun- putation of the hidden markov model entropy for a given certainty, which could further reduce the additional observationsequence. IEEETrans.onInformationTheory, 51:7:2681–2685. humantaggingeffort. F. Jiao, S. Wang, C.-H. Lee, R. Greiner, and D. Schuur- 6 RelatedWork mans. 2006. Semi-supervised conditional random fields forimprovedsequencesegmentationandlabeling. InCOL- ING/ACL. Hernando et al. (2005) present a dynamic program for calculating the entropy of a HMM, which has S. Kim, Y. Song, K. Kim, J.-W. Cha, and G. G. Lee. 2006. some loose similarities to the forward pass of the Mmr-based active machine learning for bio named entity recognition. InHLT/NAACL. algorithmproposedinthispaper. Notably,ouralgo- rithm allows for efficient calculation of entropy for J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional randomfields: Probabilisticmodelsforsegmentingandla- anylabelsubsequence. belingsequencedata. InProceedingsofICML,pages282– Semi-supervised learning has been used in many 289. models,predominantlyforclassification,asopposed X. Zhu. 2005. Semi-supervised learning literature survey. to structured output models like CRFs. Zhu (2005) Technical Report 1530, Computer Sciences, University of Wisconsin-Madison. 1Reporting experimental results with accuracy is unneces- sarysinceweduplicatethetrainingmethodofJiaoetal.(2006).