ebook img

DTIC ADA449621: Multi-Conditional Learning: Generative/Discriminative Training for Clustering and Classification PDF

0.16 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA449621: Multi-Conditional Learning: Generative/Discriminative Training for Clustering and Classification

Multi-Conditional Learning: Generative/Discriminative Training for Clustering and Classification Andrew McCallum, Chris Pal,Greg Druck andXuerui Wang DepartmentofComputerScience 140GovernorsDrive UniversityofMassachusetts Amherst,MA01003–9264 {mccallum,pal,gdruck,xuerui}@cs.umass.edu Abstract constraints on parameter estimation. Our experimental re- sults on a variety of standard text data sets show that this This paper presents multi-conditional learning (MCL), a density-estimationconstraintisamoreeffectiveregularizer trainingcriterionbasedonaproductofmultipleconditional than “shrinkage toward zero,” which is the basis of tradi- likelihoods. When combining the traditional conditional tionalregularizers,suchastheGaussianprior—reducinger- probability of “label given input” with a generative proba- rorbynearly50%insomecases. Aswellasimprovingac- bilityof“inputgivenlabel”thelateractsasasurprisinglyef- curacy,the inclusionofa densityestimationcriterionhelps fectiveregularizer. Whenappliedtomodelswithlatentvari- improveconfidenceprediction. ables,MCLcombinesthestructure-discoverycapabilitiesof generative topic models, such as latent Dirichlet allocation In additionto simple conditionalmodels, there hasbeen andtheexponentialfamilyharmonium,withtheaccuracyand growinginterestinconditionally-trainedmodelswithlatent robustness of discriminative classifiers, such as logistic re- variables(Jebara&Pentland,1998;McCallumetal.,2005; gressionandconditionalrandomfields.Wepresentresultson Quattonietal.,2004). Simultaneouslythereisimmensein- severalstandardtextdatasetsshowingsignificantreductions terestedingenerative“topicmodels,”suchaslatentDirich- in classification error due to MCL regularization, and sub- let allocation, and its progeny, as well as their undirected stantialgainsinprecisionandrecallduetothelatentstructure analogues,includingtheharmoniummodels(Wellingetal., discoveredunderMCL. 2005;Xingetal.,2005;Smolensky,1986). Inthispaperwealsodemonstratemulti-conditionallearn- Introduction ing applied to latent-variable models. MCL discovers a latent space projection that captures not only the co- Conditional-probability training, in the form of maximum occurrence of features in input (as in generative models), entropyclassifiers(Bergeretal.,1996)andconditionalran- butalsoprovidestheabilitytoaccuratelypredictdesignated domfields(CRFs) (Laffertyet al., 2001; Sutton& McCal- outputs(asindiscriminativemodels). WefindthatMCLis lum,2006),hashaddramaticandgrowingimpactonnatural morerobustthantheconditionalcriterionalone,whilealso languageprocessing,informationretrieval,computervision, beingmorepurposefulthangenerativelatentvariablemod- bioinformatics,andotherrelatedfields. However,discrimi- els. On the document retrieval task introduced in Welling nativemodelstendtooverfitthetrainingdata,andaprioron etal. (2005),wefindthatMCLmorethandoublesprecision parameters typically provides limited relief. In fact, it has andrecallincomparisonwiththegenerativeharmonium. beenshownthatinsomecasesgenerativena¨ıveBayesclas- sifiers provide higher accuracy than conditional maximum In latent variable models, MCL can be seen as a form entropyclassifiers (Ng & Jordan, 2002). We thus consider of semi-supervised clustering—with the flexibility to op- alternativetrainingcriteriawithreducedrelianceonparame- erate on relational, structured, CRF-like models in a prin- terpriors,whichalsocombinegenerativeanddiscriminative cipled way. MCL here aims to combine the strengths of learning. CRFs(handlingauto-correlationandnon-independentinput This paper presents multi-conditionallearning, a family featuresin makingpredictions), with the strengthsof topic ofparameterestimationobjectivefunctionsbasedonaprod- models (discovering co-occurrence patterns and useful la- uctofmultipleconditionallikelihoods.Inoneconfiguration tent projections). This paper sets the stage for various in- of this approach, the objective function is the (weighted) teresting future work in multi-conditional learning. Many productofthe“discriminative”probabilityoflabelgivenin- configurationsofmulti-conditionallearningarepossible,in- put, and the “generative” probability of the input given la- cluding ones with more than two conditionalprobabilities. bel. Theformeraimstofindagooddecisionboundary,the Forexample,transferlearningcouldnaturallybeconfigured later aimsto modelthe density of the input, and the single as the productof conditionalprobabilitiesfor the labels of setofparametersinourna¨ıve-Bayes-structuredmodelthus eachtask,withsomelatentvariablesandparametersshared. strives for both. All regularizers provide some additional Semi-supervisedlearningcouldbeconfiguredastheproduct ofconditionalprobabilitiesforpredictingthe label, aswell Copyright (cid:13)c 2006, American Association for Artificial Intelli- aspredictingeachinputgiventhe others. Theseconfigura- gence(www.aaai.org).Allrightsreserved. tionsarethesubjectofongoingwork. Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2006 2. REPORT TYPE 00-00-2006 to 00-00-2006 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Multi-Conditional Learning: Generative/Discriminative Training for 5b. GRANT NUMBER Clustering and Classification 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of Massachusetts,Center for Intelligent Information REPORT NUMBER Retrieval,Department of Computer Science,Amherst,MA,01003 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES The original document contains color images. 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 7 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 Multi-Conditional Learning and MRFs regressionormaximumentropy(Bergeretal.,1996)model In the following exposition we first present the general canbewrittenwithsimilarna¨ıvegraphicalstructures. Here framework of multi-conditional learning. We then derive we considerna¨ıveMRFs whichcanalso be representedby the equations used for multi-conditional learning in sev- asimilargraphicalstructurebutdefineajointdistributionin eral structured Markov Random Field (MRF) models. We termsofunnormalizedpotentialfunctions. introduce discrete hidden (sub-class) variables into na¨ıve Consider data D = {(y˜n,x˜j,n);n = 1,...,N,j = MRFmodels, creatingmulti-conditionalmixtures,anddis- 1...Mn} wherethereare N instancesandwithineach in- cuss how multi-conditional methods are derived. We then stance there are Mn realizations of discrete random vari- constructbinarywordoccurrencemodelscoupledwithhid- ables{x}.Wewilluseyntodenoteasinglediscreterandom den continuousvariables, as in the exponentialfamily har- variableforaclasslabel. Modelparametersaredenotedby monium,demonstratingtheadvantagesofmulti-conditional θ. ForacollectionofN documentswethushaveMn word learningforthesemodelsalso. eventsforeachdocument. Thejointdistributionofthedata can be modeled using a set of na¨ıve MRFs, one for each TheMCLFramework observationsuchthat Consideradatasetconsistingofi=1,...,N instances.We 1 Mn willconstructprobabilisticmodelsconsistingofdiscreteob- P(x ,...,x ,y|θ)= φ(y|θ ) φ(x ,y|θ ) (4) servedrandomvariables{x},discretehiddenvariables{z} 1 Mn Z y j x,y j=1 and continuoushidden variables z. Denote an outcome of Y where arandomvariableasx˜. Definej = 1,...,N pairsofdis- s jointsubsetsofobservations{x˜A}ij and{x˜B}ij,whereour Mn indices denotethe ith instance of the variablesin subset j. Z = ... φ(y|θ ) φ(x ,y|θ ). (5) y j x,y Wewillconstructamulti-conditionalobjectivebytakingthe productofdifferentconditionalprobabilitiesinvolvingthese Xy Xx1 xXMn jY=1 subsetsandwewilluseα toweightthecontributionsofthe Ifwedefinepotentialfunctionsφ(·) to consistofexponen- j different conditionals. Using these definitions the optimal tiatedlinearfunctionsofmultinomialvariables(sparsevec- parametersettingsunderourmulti-conditionalcriterionare torswith a single 1 in one of the dimensions), y for labels givenby andwj foreachword,ana¨ıveMRFcanbewrittenas argmax P {x˜ },{z},z |{x˜ } ;θ αjdz , 1 Mn θ Yi,j {Xz}ijZ (cid:0)(cid:8) A (cid:9)ij B ij (cid:1) ij P(y,{w})= Z exp(cid:18)yTθy+yTθTx,yj=1wj(cid:19). (6) (1) X wherewederivethesemarginalconditionallikelihoodsfrom To simplify our presentation, consider now combining asingleunderlyingjointprobabilitymodelwithparameters our multinomial word variables {w} such that x = θ. Ourunderlyingjointprobabilitymodelmayitselfbenor- [ Mn w ;1]. One can also combine θ and θ into θ malizedlocally,globallyor usingsomecombinationofthe j=1 j y x,y suchthat two. P 1 For the experiments in this paper we will partition ob- P(y,x)= exp(yTθTx) (7) Z served variables into a set of “labels” y and a set of “fea- Underthismodel,tooptimizeL from(2)wehave tures” x. We define two pairs of subsets: {x ,x } = MC A B 1 {y,x}and{xA,xB}2 = {x,y}. We thenconstructmulti- exp(yTθTx) exp(yTθTx) conditional objective functions L with the following P(y|x)= andP(x|y)= MC exp(yTθTx) Z(y) form y (8) LMC =log P(y|x)αP(x|y)β where P (2) =αL (θ)+βL (θ). y(cid:0)|x x|y (cid:1) Mn Inthisconfigurationonecanthinkofourobjectiveashaving Z(y)= ... exp(yTθTx,ywj)exp(yTθy). (9) a generativecomponentP(x|y) and a discriminativecom- Xw1 wXMnjY=1 ponentP(y|x). Anotherattractivedefinitionusingtwopairs Thegradientsofthelogconditionallikelihoodscontainedin is: {x ,x } = {y,x} and {x ,x } = {x,∅}, giving A B 1 A B 2 ourobjectivecanthenbecomputedusing: risetoobjectivesofthefollowingform L=log(P(y|x)αP(x)β , (3) ∇L (θ)= N x yT − yexp(yTθTxn)xnyT whichrepresentsawayofrestructuringa(cid:1)jointlikelihoodto y|x n=1 n n P yexp(yTθTxn) ! concentratemodelingpoweronaconditionaldistributionof X interest. Thisobjectiveissimilartotheapproachadvocated =N hxyTiP˜(x,y)−PhhxyTiP(y|x)iP˜(x) inMinka(2005). (cid:16) (cid:17)(10) Na¨ıveMRFsforDocuments whereh·i denotestheexpectationwithrespecttodistri- P(x) Thegraphicaldescriptionsofthena¨ıveBayesmodelfortext butionP(x)andweuseP˜(x)todenotetheempiricaldistri- documents(Nigametal.,2000)andthemultinomiallogistic butionof the data, the distributionobtainedplacing a delta Forexample,thegradientforthe“weights”λ compris- xe,zs y Labels ing the elements of the potential function parameters θx,z Labels y arecomputedfrom z HToidpdicesn ∂∂Lλx|y(θ) = N Mn P(zn|{x˜}n,y˜n;θ)fxe,zs(x˜j,n,zn) xe,zs nX=1Xj=1(cid:20)Xzn x1 x2 … xMn x1 x2 … xMn − P({x}n,zn|y˜n;θ)fxe,zs(xj,n,zn) , Xzn {Xx}n (cid:21) (15) M Word Events M Word Events n n wheref (x,z)arebinaryfeaturefunctionsevaluatingto xe,zs onewhenthestateofx = x and thestateofz = z . The e s Figure 1: (Left)Afactor graph (Kschischang et al., 2001) fora updates for the potentials function parameters using Ly|x na¨ıveMRF.(Right)Afactorgraphforamixtureofna¨ıveMRFs. takeaformsimilartothestandard“maximumentropy”gra- In these models each word occurrence is a draw from a discrete dientcomputations,augmentedwithahiddenvariable. We randomvariable;thereareMnrandomvariablesindocumentn. termmixturemodelstrainedmymulti-conditionallearning multi-conditionalmixtures(MCM). functiononeachdatapointandnormalizedbyN. Tocom- HarmoniumStructuredModels pute∇Lx|y(θx,y),weobservethat A harmonium model (Smolensky, 1986) is a two layer MarkovRandomField (MRF) consisting of observedvari- Mn Mn exp(yTθT w ) ables and hidden variables. Like all MRFs, the model we P(x|y)= P(wj|y)= exp(yTx,θyT jw ) , presenthere will be defined in terms of a globallynormal- jY=1 jY=1(cid:18) wj x,y j (cid:19) ized product of (unnormalized)potential functions defined (11) upon subsets of variables. A harmonium can also be de- P scribedasa typeofrestrictedBoltzmannmachine(Hinton, andtherefore 2002). Inthefollowingwepresentanewtypeofexponen- N Mn tialfamilymulti-attributeharmonium,extendingthemodels ∇L (θ )= w˜ y˜T −w y˜TP(w |y˜ ) . usedinWellingetal. (2005)andthedual-wingharmonium x|y x,y j,n n j,n n j,n n workofXingetal. (2005). nX=1Xj=1(cid:16) (cid:17) Ourexponentialfamilyharmoniumstructuredmodelcan (12) bewrittenas P(x,z|Θ)=exp θTf (x )+ θTf (z ) MixturesofNa¨ıveMRFs i i i j j j (cid:26) i j Wecanextendthebasicna¨ıveMRFmodelshowninFigure X X (16) 1 (Left) by adding a hiddensubclass variable as illustrated + θTf (x ,z )−A(Θ) , ij ij i j (Right).Inamixtureofna¨ıveMRFsthejointdistributionof i j (cid:27) thedataforeachobservationcanbemodeledusing XX where z is a vector of continuousvalued hidden variables, 1 Mn x is a vector of observations, θi representsparameter vec- P({x},y,z|θ)= φ(y|θ )φ(y,z|θ ) φ(x ,z|θ ), tors (or weights), θ represents a parameter vector on a Z y y,z j x,z ij jY=1 cross productof states, fi denotesfeature functions, Θ = (13) {θ ,θ ,θ } is the set of all parameters and A is the log- ij i j where the φ(y,z|θ ) potentialencodesa sparse compati- partitionfunctionornormalizationconstant. Aharmonium y,z bilityfunctionrelatinglabelsorclassestoasubsetofstates modelfactorizesthethirdtermof(16)intoθTf (x ,z )= ij ij i j ofthehiddendiscretevariablez. f (x )TWTf (z ),whereWT isaparametermatrixwith To optimize a mixture of na¨ıve MRFs, we use the ex- i i ij j j ij dimensions a × b, i.e., with rows equal to the number of pected gradient algorithm (Salakhutdinov et al., 2003). In statesof f (x ) andcolumnsequaltothe numberof states thismodelwecancomputethegradientofthecompletelog i i off (z ). Inthemodelsweconstructherewewillusebi- likelihoodandthisgradientdecomposeswithrespecttoour j j narywordoccurrencevectorsthathavedimensionM , the expectationsuchthatthefollowingcomputationcanbeeffi- v size ofourvocabulary. Thisisin contrasttoourmodelsin cientlyperformed, theprevioussectionwherewehadadifferentnumberofdis- ∂ cretewordeventsMnforeachdocumentn. Wewilldenote ∇Lx|y(θ)= ∂θ lnP({x}|y;θ) one of the observed input variables xd as a discrete label denotedasyinFigure2. ∂ = P(z|{x},y;θ) lnP({x},z|y;θ). Figure2illustratesamulti-attributeharmoniummodelas ∂θ afactorgraph. Aharmoniumrepresentsthefactorizationof z X (14) ajointdistributionforobservedandhiddenvariablesusing … where h·i denotes the expectation under the empiri- z z z z Hidden P˜(x) 1 2 k n (Topics) caldistribution, h·iP(x) is an expectationunderthe models marginaldistributionandN isthenumberofdataelements. Wecanthuscomputethegradientofthelog-likelihoodwith respecttotheweightmatrixWusing … … ∂L 1 Nd 1 Ns = WTx˜ x˜T − WTx˜ x˜T , ∂WT N i i N i,(j) i,(j) d s ! i=1 j=1 X X (22) x1 x2 … xMv y Observed awrheesraemNpdleasrientdheexendumbybejraonfdvNectoarrseothfeobnsuemrvbeedrdoaftsaa,mx˜pi,l(ejs) Variables s used per data vector, computedusing Gibbssamplingwith conditionals(17),(18)and(19).Inourexperimentsherewe Binary Word Domain Labels havefounditpossibletouseeitheroneorasmallnumberof Occurrences MarkovChainMonteCarlo(MCMC)(Andrieuetal.,2003) stepsinitializedfromthedatavector(thecontrastivediver- genceapproach(Hinton,2002)). StandardMCMCapproxi- Figure 2: Afactorgraphforamulti-attributeharmoniummodel mationsforexpectationsarealsopossible. Weusestraight- ortwolayerMRF. forward gradient-based optimization for model parameters withalearningrateandamomentumterm.Finally,forcon- ditional likelihood and multi-conditional likelihood based agloballynormalizedproductoflocalfunctions. Inourex- learning,gradientvaluescanbeobtainedfrom periments here we shall use the harmonium’sfactorization structure to define an MRF and we will then define sets of ∂L ∂F(x ,x ;θ) marginal conditionals distributions of some observed vari- MC =N (α+β) b d ∂θ ∂θ ablesgivenothersthatareofparticularinterestsoastoform (cid:20) (cid:20)(cid:28) (cid:29)P˜(xb,xd) ourmulti-conditionalobjective. ∂F(x ,x ;θ) Importantly,usingagloballynormalizedjointdistribution −α b d (23) ∂θ withthisconstructionitisalsopossibletoderivetwoconsis- (cid:28)(cid:28) (cid:29)P(xd|xb;θ)(cid:29)P˜(xb)(cid:21) tentconditionalmodels,oneforhiddenvariablesgivenob- ∂F(x ,x ;θ) servedvariablesandoneforobservedvariablesgivenhidden −β b d ∂θ variables (Welling et al., 2005). The conditional distribu- (cid:28)(cid:28) (cid:29)P(xb|xd;θ)(cid:29)P˜(xd)(cid:21)(cid:21) tionsdefinedbythesemodelscanalsobeusedtoimplement Relationships to OtherWork sampling schemes for various probabilities in the underly- ingjointmodel. However,itisimportanttorememberthat Theoretical and empirical results in Ng and Jordan (2002) the originalmodelparameterizationis notdefined in terms havesupportedthenotionthat,whileadiscriminativemodel of these conditional distributions. In our experiments be- may have a lower asymptotic error (with more data), the low we usea jointmodelwith a formdefinedby(16)with error rate of classifications based on an analogous genera- WT =[WTWT]suchthatthe(exponentialfamily)condi- b d tivemodelcanoftenapproachanasymptoticallyhigherer- tionaldistributionsconsistentwiththejointmodelare ror rate faster. Hybridsmethodscombininggenerativeand P(z |x˜) = N(z ;µˆ,I), µˆ =µ+WTx˜ (17) discriminative methods are appealing in that they have the n n potentialtodrawuponthestrengthsofbothapproaches.For P(x |˜z) = B(x ;θˆ ), θˆ =θ +W z˜ (18) b b b b b b example,in Raina etal. (2003),a highdimensionalsubset P(x |˜z) = D(x ;θˆ ), θˆ =θ +W z˜, (19) of parametersare trainedundera jointlikelihoodobjective d d d d d d while another smaller subset of parameters are trained un- where N(), B() and D() represent Normal, Bernoulli and der a conditional likelihood objective. In contrast, in our Discrete distributionsrespectively. The following equation approach all parameters are optimized under a number of canbeusedtorepresentthemarginaldistributionofx, conditionalobjectives. P(x|θ,Λ)=exp{θTx+xTΛx−A(θ,Λ)}, (20) InCorduneanuandJaakkola(2003),amethodcharacter- whereΛ= 1WWT andθcombinesθ andθ . Thelabels ized as information regularization is formulated for using forthismode2larethediscreterandomvadriable(bi.e.y=x ) informationaboutthemarginaldensityofunlabeleddatato d andthefeaturesarethebinaryvariables. constrain an otherwise free conditional distribution. Their In an exponential family model with exponential func- approachcan be thoughtof as a method for penalizingde- tionF(x;θ), itiseasyto verifythatthegradientofthelog cisionboundariesthatoccurinareasofhighmarginalden- marginal likelihood L of the observed data x, can be ex- sity. In termsof theregularizationperspective,ourmulti- pressed conditionalapproachusesadditionalorauxiliaryconditional distributions derived from an underlying joint probability ∂L(θ;x) ∂F(x;θ) ∂F(x;θ) =N − , modelasregularizers.Furthermore,ourapproachisdefined ∂θ (cid:20)(cid:28) ∂θ (cid:29)P˜(x) (cid:28) ∂θ (cid:29)P(x;θ)(cid:21) within the context of an underlying joint model. It is our (21) belief that these additional conditional distributions in our objective function can serve as a regularizerfor the condi- • The Web Knowledge Base (webkb) data set consists of tionaldistributionsweprimarilycareabout,theprobability webpages from four universities that are classified into of labels. As such, we weightthe conditionaldistributions faculty, student, course, and project (we discard the differentlyinourobjective. categoriesofstaff,department,andother). With equalweightingof conditionalsand anappropriate Wedetermineαandβ,theweightsofeachcomponentof definition of subsets of variables, the method can be seen our objective function, and the Gaussian prior variance σ2 asatypeofpseudo-likelihood(Besag,1975). However,our using cross validation. Specifically, we use 10-fold cross- goalsarequitedifferent,inthatwearenottryingtoapprox- validation,with5foldsusedforchoosingtheseparameters imateajointlikelihood,butrather,wewishtoexplicitlyop- and 5 folds used for testing. The models tend to be quite timizefortheconditionaldistributionsinourobjective. sensitive to the values of α and β. Additionally, because The mixtures of na¨ıve MRFs we present resemble the there is no longer a guaranteeof convexity, thoughtfulini- multiple mixture components per class approach used in tialization of parameters is sometimes required. In future Nigam et al. (2000). The conditional distributions arising work, we hope to more thoroughlyunderstand and control forourlabelsgivenourdataarealsorelatedtomixturesof fortheseengineeringissues. experts(Jordan&Jacobs,1994),conditionalmixturemod- Duringpreprocessing, we removewords that only occur els(Jebara&Pentland,1998),simplemixturesofmaximum oncein theeachcorpus, aswellasstopwords,HTML,and entropymodels(Pavlovetal.,2002),andmixturesofcondi- emailmessageheaders. Wealsotestwithsmall-vocabulary tionalrandomfields(McCallumetal.,2005;Quattonietal., versionsofeachdatasetinwhichthevocabularysizeisre- 2004).Thecontinuouslatentvariablemodelwepresenthere ducedto2000usinginformationgain. issimilartothedualwingharmoniumortwolayerrandom The results are presented in Table 1. The parenthesized fieldpresentedinXingetal. (2005)forminingtextandim- valuesarethestandarddeviationsofthetestaccuracyacross ages. In that approach a lower dimensional representation the cross validation folds. On 15 of 20 data sets, we show of image and text data is obtained by optimizing the joint improvementsoverbothmaximumentropyandna¨ıveBayes. likelihoodofaharmoniummodel. Although the differences in accuracy are small in some cases, the overall trend across data sets illustrates the po- Experimental Results tentialofMCLforregularization.Infact,thedifferencebe- Inthissection,wepresentexperimentalresultsusingmulti- tween the mean accuracy for maximum entropy and MCL conditionalobjectivefunctionsinthecontextofthemodels islargerthanthedifferencebetweenthemeanaccuraciesof described. First, we apply na¨ıve Markov random fields to na¨ıveBayesandmaximumentropy.Acrossalldatasets,the documentclassificationandshowthatthemulti-conditional mean MCL accuracyis significantly greater than the mean training provides better regularization than the traditional accuracies of naive Bayes (p = 0.001) and maximum en- Gaussianprior. Next,wedemonstratemixtureformsofthe tropy(p=0.0002)underaone-tailedpairedt-test. modelonbothrealandsyntheticdata,includinganexample Wealsofoundthatin10of15datasetsonwhichwealso of topic discovery. Finally, we show that in harmonium- calculatedtheareaundertheaccuracy/coveragecurve,MCL structuredmodels, the multi-conditionalobjectiveprovides providedbetterconfidenceestimates. aquantitativelybetterlatentspace. MixturesofNa¨ıveMRFs Na¨ıveMRFsandMCLasRegularization Inordertodemonstratetheabilityofmulti-conditionalmix- WeusetheobjectivefunctionαL (θ)+βL (θ)inna¨ıve tures to successfully classify data that is not linearly sep- y|x x|y MRFsandcomparetothegenerativena¨ıveBayesmodeland arable, we perform the following synthetic data experi- the discriminative maximum entropy model for document ments. Four class labels are each associated with four 4- classification. Wepresentextensiveexperimentswithcom- dimensional Gaussians, having means and variances uni- montextdatasets,whicharebrieflydescribedbelow. formly sampled between 0-100. Positions of data points generatedfromtheGaussiansareroundedtointegervalues. • 20 Newsgroups is a corpus of approximately 20,000 For some samples of the Gaussian means and variances— newsgroup messages. We use the entire corpus (abbre- e.g. an XOR configuration—a significant portion of the viatedasnews),aswellastwosubsets(talkandcomp). data would be misclassified by the best linear separator. • The industry sector corpus is a collection of corporate MCMs,however,canlearnandcombinemultiplelinearde- webpagessplit into about70 categories. We use the en- cision boundaries. A MCM with two hidden subclasses tire corpus(sector),aswellasthreesubsets: healthcare, perclassattainsanaccuracyof75%,whereasna¨ıveBayes, financial(finan),andtechnology. maximumentropy,andnon-mixturemulti-conditionalna¨ıve MRFshaveaccuraciesof54%,52%,and56%,respectively. • The movie review corpus (movie) is a collection of user Withexplicitly-constructedXORpositioning,MCMattains movie reviews from the Internet Movie Database, com- 99%,whiletheothersyieldlessthan50%. piledbyBoPangatCornellUniversity. We usedthepo- Running these MCMs on the talk data set yields “top- laritydataset(v2.0),wherethetaskistoclassifythesen- ics”similartolatentDirichletallocation(LDA)(Bleietal., timentofeachreviewaspositiveornegative. 2003),exceptthatparameterestimationisdriventodiscover • Thesraadatasetconsistsof73,218UseNetarticlesfrom topicsthatnotonlyre-generatethewords,butalsohelppre- fourdiscussiongroups: simulatedautoracing,simulated dicttheclasslabel;(thusMCMcanalsobeunderstoodasa aviation,realautos,andrealaviation. “semi-supervised”topicmodel). Furthermore,MCMtopics Data NaiveBayes MaxEnt MCL Topic1(guncontrol) Topic2(Wacoincident) news 85.3(0.61) 82.9(0.82) 85.9(0.89) guns 1.27 nra 1.63 news(2000) 76.4(0.88) 77.4(0.81) 77.7(0.48) texas 1.19 assault 1.52 comp 85.1(1.78) 83.7(0.68) 83.4(0.94) gun 1.18 waco 1.21 comp(2000) 81.8(1.36) 82.2(0.75) 84.0(1.05) enforcement 1.14 compound 1.19 talk 84.6(1.02) 82.3(1.43) 83.7(1.27) ... ... ... ... talk(2000) 83.7(2.17) 81.6(2.27) 84.3(1.21) president -0.83 employer -0.90 sector 75.6(2.05) 88.0(1.13) 87.4(0.84) peace -0.85 cult -0.94 sector(2000) 73.9(0.78) 82.0(1.03) 83.2(1.56) years -0.88 terrorists -1.02 tech 91.0(1.33) 91.8(2.24) 93.1(1.69) feds -1.17 matthew -1.15 tech(2000) 92.9(2.46) 91.4(2.03) 94.5(1.81) finan 92.3(2.36) 89.2(1.52) 91.5(2.57) Table 2: Two MCM-discovered “topics” associated with the finan(2000) 87.3(3.31) 89.6(1.82) 94.6(1.79) politics.guns label in a run on talk data set. On the health 93.5(4.36) 94.0(3.74) 95.5(4.00) left, discussion about gun control in Texas. The negatively- health(2000) 95.0(5.00) 91.0(3.39) 95.5(4.30) weighted words are prominent in other classes, including movie 78.6(1.20) 82.6(2.96) 82.7(2.50) politics.misc. Ontheright,discussionaboutthegunrights movie(2000) 90.9(1.98) 88.8(1.96) 94.0(1.05) ofDavidKoreshwhenfederal agents stormedtheircompound in sraa 95.9(0.15) 96.1(0.23) 96.7(0.09) Waco,TX.AspectsoftheDavidiancult,however,werediscussed sraa(2000) 93.7(0.20) 94.7(0.13) 95.0(0.21) inreligion.misc. webkb 87.9(2.14) 92.4(0.84) 92.4(1.04) webkb(2000) 84.7(1.20) 92.4(1.07) 92.7(1.40) mean 86.5(6.73) 87.7(5.39) 89.4(5.76) 5 5 5 4.5 4.5 4.5 4 4 4 3.5 3.5 3.5 3 3 3 Table1:DocumentclassificationaccuraciesfornaiveBayes,max- 2.5 2.5 2.5 2 2 2 imumentropy,andMCL. 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 −01 0 1 2 3 4 5 6 7 0 0 2 4 6 8 10 −01 0 1 2 3 4 5 6 7 aredefinednotonlybypositivewordassociations,butalso by prominent negative word associations. The words with Figure 3: (Left)Joint likelihood optimization. (Middle) One of mostpositiveandnegativeθx,z areshowninTable2. the many near optimal solutions found by conditional likelihood optimization. (Right) An optimal solution found by our multi- Lower-variance ConditionalMixtureEstimation conditionalobjective. Consider data generated from two classes, each with four sub-classesdrawnfrom2-DisotropicGaussians(similarto reduced 20 newsgroups data set prepared in MATLAB by the example in Jebara and Pentland (2000)). The data are Sam Roweis. In this data set, 16242 documentsare repre- illustratedbyred◦’sandblue×’sinFigure3. Usingjoint, sentedby100wordvocabularybinaryoccurrencesand are conditional,andmulti-conditionallikelihood,wefitmixture labeledasoneoffourdomains. modelswithtwo(diagonalcovariance,i.e.na¨ıve)subclasses usingconditionalexpectedgradientoptimization(Salakhut- To evaluate the quality of our latent space, we retrieve documents that have the same domain label as a test doc- dinovetal.,2003). Thefiguredepictstheparametersofthe ument based on their cosine coefficient in the latent space best models found under our objectives using ellipses for whenobservingonlybinaryoccurrences.Werandomlysplit constantprobabilityunderthemodel. dataintoatrainingsetof12,000documentsandatestsetof From this illustrative example, we see that the parame- 4242documents.Weuseajointmodelwithacorresponding ters estimated by joint likelihood would completely fail to fullrankmulti-variateBernoulliconditionalforbinaryword classify ◦ versus × given location. In contrast, the condi- occurrencesandadiscreteconditionalfordomains.Figure4 tionalobjectivefocuses completelyon the decision bound- showstheprecision-recallresults. ML-1is ourmodelwith ary, however, in 30 random initializations, this produced no domain label information. ML-2 is optimized with do- parameterswith veryhighvariance,andlittle interpretabil- mainlabelinformation.CLisoptimizedtopredictdomains ity. Ourmulti-conditionalobjective,however,optimizesfor from words and MCL is optimized to predict both words both class label prediction and class-conditioned density, fromdomainsanddomainsfromwords. FromFigure4we yielding good classification accuracy, and sensible, low- seethatthelatentspacecapturedbythemodelismorerele- varianceparameterestimates. vantfordomainclassificationwhenthemodelisoptimized Multi-ConditionalHarmoniums undertheCLandMCLobjectives.MCLmorethandoubles theprecisionandrecallatreasonablevaluesofthecounter- Weareinterestedinthequalityofthelatentrepresentations parts. obtainedwhenoptimizingmulti-attributeharmoniumstruc- tured models under standard (joint) maximum likelihood DiscussionandConclusions (ML), conditional likelihood (CL) and multi-conditional likelihood(MCL)objectives. Weuseasimilartestingstrat- Wehavepresentedmulti-conditionallearninginthecontext egy to Welling et al. (2005) but focus on comparing the of na¨ıve MRFs, mixtures of na¨ıve MRFs and harmonium- different latent spaces obtained with the various optimiza- structured models. For Naive MRFs, we show tion objectives. As in Welling et al. (2005), we used the that multi-conditional learning provides improved regu- mumentropyapproachtonaturallanguageprocessing. Compu- 0.75 tationalLinguistics,22,39–72. ML 1 0.7 ML 2 Besag,J.(1975).Statisticalanalysisofnon-latticedata.TheStatis- CL 0.65 MCL tician,24,179–195. Blei,D.,Ng,A.,&Jordan,M.(2003). LatentDirichletallocation. 0.6 JournalofMachineLearningResearch,3,993–1022. Precision00..045.555 CHoinirzdtoautnni,oenaG.n.uP(,r2oA0c0.e,2e&)d.inJgTasarakoikfnoiUlnang,cTper.rot(ad2iu0nc0tyt3s)in.ofAOrentxipfiinecfriaotslrmIbnyatteimlolniigneriemngciuezl.ianrg- contrastivedivergence. NeuralComputation,14,1771–1800. 0.4 Jebara,T.,&Pentland,A.(1998).Maximumconditionallikelihood 0.35 viaboundmaximizationandtheCEMalgorithm. InNeuralIn- formationProcessingSystems(NIPS),11. 0.3 Jebara,T.,&Pentland,A.(2000).OnreversingJensen’sinequality. 0.25 10−4 10−3 10−2 10−1 100 NIPS13. Recall Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181– Figure4:Precision-recallcurvesforthe“20newsgroups”dataus- 214. ingML,CLandMCLwith20latentvariables. Randomguessing Kschischang, F. R., Frey, B., & Loeliger, H.-A. (2001). Factor isahorizontallineat.25. graphs and thesum-product algorithm. IEEETransactions on InformationTheory,47,498–519. Lafferty,J.,McCallum,A.,&Pereira,F.(2001). Conditionalran- larization, and flexible, robust mixtures. In the context dom fields: Probabilistic models for segmenting and labeling of harmonium-structured models our experiments show sequencedata. Proc.ICML,282–289. that multi-conditional contrastive-divergence-based opti- McCallum, A., Bellare, K., & Pereira, F. (2005). A conditional mizationprocedurescanleadtolatentdocumentspaceswith randomfieldfordiscriminatively-trainedfinite-statestringedit superiorquality. distance. ConferenceonUncertaintyinAI(UAI). Multi-conditional learning is well suited for multi-task Minka,T.(2005). Discriminativemodels,notdiscriminativetrain- and semi-supervised learning, since multiple prediction ing. MSR-TR-2005-144. tasks are easily and naturally defined in the MCL frame- Ng,A.Y.,&Jordan,M.(2002). Ondiscriminativevs.generative work. In recent work by Ando and Zhang (2005), semi- classifiers:AcomparisonoflogisticregressionandnaiveBayes. supervised and multi-task learning methods are combined. NIPS14. Their approach involves auxiliary prediction problems de- Nigam,K.,McCallum,A.K.,Thrun,S.,&Mitchell,T.M.(2000). fined for unlabeled data such that model structures arising Textclassificationfromlabeledandunlabeleddocumentsusing from these tasks are also useful for another classification EM. MachineLearning,39,103–134. problemofparticularinterest. Theirapproachinvolvesfind- Pavlov,D.,Popescul,A.,Pennock,D.,&Ungar,L.(2002). Mix- ing the principal components of the parameters space for turesofconditionalmaximumentropymodels. NECResearch auxiliarytasks. OnecansimilarlyusetheMCLapproachto InstituteTechnicalReportNECI. defineauxiliaryconditionaldistributionsamongfeatures.In Quattoni, A.,Collins, M.,&Darrell, T.(2004). Conditional ran- this way MCL is a natural framework for semi-supervised domfieldsforobjectrecognition. NIPS17,1097–1104. learning. We are presently exploring MCL in these multi- Raina,R.,Shen,Y.,Ng,A.Y.,&McCallum,A.(2003). Classifi- taskandsemi-supervisedsettings. cationwithhybridgenerative/conditionalmodels. NIPS. Salakhutdinov, R., Roweis, S., & Ghahramani, Z. (2003). Opti- Acknowledgements mization with EM and expectation-conjugate-gradient. Proc. Thiswork wassupported inpart bytheCenter for IntelligentIn- ICML. formationRetrieval,inpartbytheCentralIntelligenceAgency,the Smolensky, P. (1986). Information processing in dynamical sys- NationalSecurityAgencyandtheNationalScienceFoundationun- tems: foundations of harmony theory. In D. Rumehart and derNSFgrant#IIS-0326249,andinpartbytheDefenseAdvanced J. McClelland (Eds.), Parallel distributed processing: Explo- Research Projects Agency (DARPA), through the Department of rations in the microstructure of cognition. volume 1: Founda- the Interior, NBC, Acquisition Services Division, under contract tions,194–281.MITPress. numberNBCHD030010. Sutton,C.,&McCallum,A.(2006).Anintroductiontoconditional randomfieldsforrelationallearning.InL.GetoorandB.Taskar References (Eds.),Introductiontostatisticalrelationallearning.MITPress. Ando,R.K.,&Zhang,T.(2005).Aframeworkforlearningpredic- Toappear. tivestructuresfrommultipletasksandunlabeleddata. Journal Welling, M., Rosen-Zvi, M., & Hinton, G. (2005). Exponential ofMachineLearningResearch,6,1817–1853. familyharmoniumswithanapplicationtoinformationretrieval. Andrieu,C.,deFreitas,N.,Doucet,A.,&Jordan,M.(2003). An NIPS,1481–1488. introductiontoMCMCformachinelearning. MachineLearn- Xing,E.,Yan,R.,&Hauptmann,A.G.(2005). Miningassociated ing,50,5–43. textandimageswithdual-wingharmoniums. Proc.Uncertainty inArtificialIntelligence. Berger,A.L.,Pietra,S.A.D.,&Pietra,V.J.D.(1996). Amaxi-

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.