Optimal Mixture Models in IR VictorLavrenko CenterforIntelligentInformationRetrieval DepartmentofComupterScience UniversityofMassachusetts,Amherst,MA01003, [email protected] Abstract. WeexploretheuseofOptimalMixtureModelstorepresenttopics.We analyzetwobroadclassesofmixturemodels:set-basedandweighted.Weprovide anoriginalproofthatestimationofset-basedmodelsisNP-hard,andtherefore not feasible. We argue that weighted models are superior to set-based models, and the solution can be estimated by a simple gradient descent technique. We demonstratethatOptimalMixtureModelscanbesuccessfullyappliedtothetask ofdocumentretrieval.Ourexperimentsshowthatweightedmixturesoutperform asimplelanguagemodelingbaseline.Wealsoobservethatweightedmixturesare morerobustthanotherapproachesofestimatingtopicalmodels. 1 Introduction StatisticalLanguageModelingapproacheshavebeensteadilygainingpopularityinthe field of Information Retrieval. They were first introduced by Ponte and Croft [18], andwereexpandeduponinanumberoffollowingpublications[4,15,24,8,9,11,14]. Theseapproacheshaveproventobeveryeffectiveinanumberofapplications,includ- ing ad-hoc retrieval [18,4,15], topic detection and tracking [26,10], summarization [5], questionanswering [3],text segmentation[2], andother tasks.Themainstrength ofLanguageModelingtechniquesliesinverycarefulestimationofwordprobabilities, something that has been done in a heuristic fashion in prior research on Information Retrieval[21,19,20,25]. A common theme in Language Modeling approaches is that natural language is viewedasaresultofrepeatedsamplingfromsomeunderlyingprobabilitydistribution over the vocabulary. If one accepts that model of text generation, many Information Retrievalproblemscanbere-castintermsofestimatingtheprobabilityofobservinga given sample of text from a particular distribution. For example, if we knew a distri- bution of words in a certain topic of interest, we could estimatethe probability that a givendocumentisrelevanttothattopic,aswasdonein[26,21,14].Alternatively,we couldassociateaprobabilitydistributionwitheverydocumentinalargecollection,and calculate the probability that a question or a query was a sample from that document [18,4,15]. 1.1 MixtureModels MixturemodelsrepresentaverypopularestimationtechniqueinthefieldofLanguage Modeling.A mixture model is simplya linearcombination of severaldifferent distri- butions.Mixturemodels,inoneshapeoranother,havebeenemployedineverymajor Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2005 2. REPORT TYPE - 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Optimal Mixture Models in IR 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Space and Naval Warfare Systems Center,53560 Hull Street,San REPORT NUMBER Diego,CA,92152-5001 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES The original document contains color images. 14. ABSTRACT see report 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 20 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 Language Modelingpublicationto date.For example,smoothing[6,12,17], a critical component of any language model, can be interpreted as a mixture of a topic model withabackgroundmodel,ashighlightedin[15,13,16]. This paper will be primarily concerned with the use of mixtures to represent se- mantic topic models. For the scope of this paper, a topic model will be defined as a distribution,whichgivestheprobabilityofobservinganygivenwordindocumentsthat discusssomeparticulartopic.Apopularwaytoestimatethetopicmodelisbymixing word probabilities from the documentsthat are believedto be relatedto that topic. In the next section we will briefly survey a number of publications exploring the use of mixturemodelstorepresenttopicalcontent. 1.2 RelatedWorkonMixtureModels Hoffman[9]describedtheuseoflatentsemanticvariablestorepresentdifferenttopical aspectsofdocuments.Hoffmanassumedthatthereexistafixednumberoflatenttopical distributions and represented documents as weighted mixtures of those distributions. Hoffmanusedan expectation-maximizationalgorithmtoautomaticallyinduce topical distributionsbymaximizingthelikelihoodoftheentiretrainingset.Itisworthwhileto pointoutthatthenatureoftheestimationalgorithmusedbyHoffmanalsoallowsoneto re-expresstheselatentaspectdistributionsasmixturesofindividualdocumentmodels. Berger and Lafferty [4] introduced an approach to Information Retrieval that was basedonideasfromStatisticalMachineTranslation.Theauthorsestimatedasemantic modelofthedocumentasaweightedmixtureoftranslationvectors.Whilethismodel doesnotinvolvemixingdocumentmodels,itisstillanexampleofamixturemodel. In the context of Topic Detection and Tracking [1], several researchers used un- weightedmixturesoftrainingdocumentstorepresentevent-basedtopics.Specifically, Jinet.al.[10]trainedaMarkovmodelfrompositiveexamples,andYamronet.al.[26] used clustering techniques to represent background topics in the dataset (a topic was representedasamixtureofthedocumentsinthecluster). Lavrenko [13] considered topical mixture models as a way to improve the effec- tivenessofsmoothing.Recallthatsmoothingisusuallydonebycombiningthesparse topicmodel(obtainedbycountingwordsinsomesampleoftext)withthebackground model.Lavrenkohypothesizedthatbyusinga zoneofcloselyrelatedtextsamples he could achieve semantic smoothing, where words that are closely related to the origi- naltopicwouldgethigherprobabilities.Lavrenkousedanunweightedmixturemodel, similartotheonewewilldescribeinsection3.1.Themaindrawbackoftheapproach was that performance was extremely sensitive to the size of the subset he called the zone.AsimilarproblemwasencounteredbyOgilvie[16]whenheattemptedtosmooth documentmodelswithmodelsoftheirnearestneighbors. Intwoveryrecentpublications,bothLaffertyandZhai[11],andLavrenkoandCroft [14] proposed using a weighted mixture of top-ranked documents from the query to representatopicmodel.Theprocessofassigningtheweightstothedocumentsisquite differentinthetwopublications.LaffertyandZhaidescribeaniterativeprocedure,for- malizedasaMarkovchainontheinvertedindexes.LavrenkoandCroftestimateajoint probabilityofobservingthequerywordstogetherwithanypossiblewordinthevocab- ulary.Bothapproachescanbeexpressedasmixturesofdocumentmodels,andinboth casestheauthorspointedoutthatperformanceoftheirmethodswasstronglydependent onthenumberoftop-rankeddocumentsoverwhichtheyestimatedtheprobabilities. 1.3 Overview The remainder of this paper is structured as follows. In section 2 we formally define theproblemoffindinganOptimalMixtureModel(OMM)foragivenobservation.We alsodescribealowerboundonsolutionstoanyOMMproblem.Section3describesun- weightedoptimalmixturemodelsandprovesthatfindingsuchmodelsiscomputation- allyinfeasible.Section4.1definesweightedmixturemodels,anddiscussesagradient descenttechniqueforapproximatingthem.Section5describesasetofretrievalexper- imentswecarriedouttotesttheempiricalperformanceofOptimalMixtureModels. 2 Optimal MixtureModels Aswepointedoutinsection1.2,anumberofresearchers[13,16,11,14]whoemployed mixturemodelsobservedthatthequalityofthemodelisstronglydependentonthesub- setofdocumentsthatareusedtoestimatethemodel.Inmostcasestheresearchersused afixednumberoftop-rankeddocuments,retrievedinresponsetothequery.Thenum- ber of documents turns out to be an important parameter that has a strong effect on performanceandvariesfromquerytoqueryandfromdatasettodataset.Thedesireto selectthisparameterautomaticallyistheprimarymotivationbehindthepresentpaper. We would like to find the optimal subset of documents and form an Optimal Mixture Model.Optimalitycan bedefined in a numberofdifferentways, for instanceitcould meanbestretrievalperformancewithrespecttosomeparticularmetric,likeprecisionor recall.However,optimizingtosuchmetricsrequirestheknowledgeofrelevancejudg- ments, which are not always available at the time when we want to form our mixture model. In this paper we take a very simple criterion for optimality.Suppose we have (cid:0)(cid:2)(cid:1)(cid:4)(cid:3)(cid:5)(cid:3)(cid:5)(cid:3)(cid:6)(cid:0)(cid:8)(cid:7) asampleobservation: ,whichcouldbeauser’squery,oranexampledocu- ment.Theoptimalmixturemodel(cid:9)(cid:11)(cid:10)(cid:13)(cid:12)(cid:6)(cid:14) isamodelwhichassignsthehighestprobability toourobservation. 2.1 FormalProblemStatement (cid:0)(cid:16)(cid:15)(cid:18)(cid:17)(cid:20)(cid:19)(cid:21)(cid:3)(cid:22)(cid:3)(cid:5)(cid:3)(cid:13)(cid:23)(cid:25)(cid:24) (cid:0) (cid:1) (cid:3)(cid:22)(cid:3)(cid:5)(cid:3)(cid:26)(cid:0) (cid:7) Suppose isourvocabulary,and isastringoverthatvocab- (cid:0) (cid:15)(cid:29)(cid:17)(cid:31)(cid:30)! ulary.Let (cid:27)(cid:28) bethesimplexofallprobabilitydistributionsover ,thatis (cid:27)(cid:28) (cid:30)(cid:11)&(cid:11)’)(+*(cid:30),*-(cid:15).(cid:19)/(cid:24) (cid:27)"$#(cid:29)% .Inmostcaseswewillnotbeinterestedinthewholesimplex(cid:27)(cid:28) , butonlyina smallsubset (cid:27)(cid:28)1032(cid:29)(cid:27)(cid:28) ,whichcorrespondstothesetofallpossiblemix- turemodels.Exactconstructionof(cid:27)(cid:28)40 isdifferentfordifferenttypesofmixturemodels, andwillbedetailedlater.Theoptimalmodel (cid:9)(cid:11)(cid:10)5(cid:12)6(cid:14) istheelementof (cid:27)(cid:28) 0 thatgivesthe (cid:0)7(cid:1)(cid:25)(cid:3)(cid:5)(cid:3)(cid:22)(cid:3)(cid:26)(cid:0)8(cid:7) maximumlikelihoodtotheobservation : (cid:15):9(cid:31);=<8>?9+@ (cid:0)(cid:2)(cid:1)(cid:25)(cid:3)(cid:5)(cid:3)(cid:22)(cid:3)(cid:26)(cid:0)8(cid:7)I* (cid:9)7(cid:10)5(cid:12)6(cid:14) A?B-CDFE (cid:28)HG (cid:9)(cid:29)J (1) (cid:0)K(cid:1)(cid:4)(cid:3)(cid:5)(cid:3)(cid:22)(cid:3)(cid:26)(cid:0)8(cid:7) In Information Retrieval research, it is common to assume that words aremutuallyindependentofeachother,oncewefixamodel (cid:9) .Equivalently,wecan (cid:0) say that each model (cid:9) is a unigrammodel, and (cid:0) represent repeated random sam- (cid:0)7(cid:1)(cid:4)(cid:3)(cid:5)(cid:3)(cid:5)(cid:3)(cid:6)(cid:0)(cid:8)(cid:7) * plesfrom(cid:9) .Thisallowsustocomputethejointprobability (cid:28)HG (cid:9)(cid:29)J asthe productofthemarginals: (cid:7) (cid:15):9 ; <8>?9+@ (cid:1) (cid:0) * (cid:9) (cid:10)(cid:13)(cid:12)(cid:6)(cid:14) A?B-CD E (cid:1) (cid:28)HG (cid:0) (cid:9)(cid:29)J (2) (cid:0)(cid:3)(cid:2) Now we can make another assumption common in Information Retrieval: we de- (cid:0) (cid:1) (cid:3)(cid:5)(cid:3)(cid:22)(cid:3)=(cid:0) (cid:7) clarethat areidenticallydistributedaccordingto (cid:9) ,thatisforeveryword (cid:0) (cid:1) (cid:15) * (cid:15) (cid:3)(cid:5)(cid:3)(cid:22)(cid:3)(cid:21)(cid:15) (cid:0) (cid:7) (cid:15) * (cid:0) (cid:4) we have (cid:28)HG (cid:4) (cid:9)(cid:29)J (cid:28)HG (cid:4) (cid:9)(cid:29)J . Assuming (cid:0) are identically distributedallowsus to re-arrangethe termsin the productabove,andgrouptogether alltermsthatsharethesame(cid:4) : (cid:9)7(cid:10)(cid:13)(cid:12)6(cid:14) (cid:15) 9(cid:31);=<(cid:8)AH>?B-9(cid:31)CD@ E (cid:1) (cid:28)HG (cid:4) *(cid:9)(cid:29)J(cid:9)(cid:8) (cid:5)(cid:11)(cid:10)(cid:6)(cid:13)(cid:12)(cid:15)(cid:14)(cid:16)(cid:14)(cid:16)(cid:14)(cid:6)(cid:18)(cid:17)(cid:15)(cid:19) (3) (cid:5) B(cid:7)(cid:6) (cid:0)K(cid:1)(cid:21)(cid:3)(cid:5)(cid:3)(cid:5)(cid:3)(cid:26)(cid:0)(cid:8)(cid:7) Heretheproductgoesoverallthewords(cid:4) inourvocabulary,and(cid:20) (cid:4) G J (cid:0)K(cid:1)(cid:25)(cid:3)(cid:5)(cid:3)(cid:22)(cid:3)(cid:26)(cid:0)8(cid:7) (cid:15) isjustthenumberoftimes (cid:4) wasobservedinoursample . Ifwe let (cid:21) (cid:5) (cid:8) (cid:5)(cid:11)(cid:10)(cid:6)(cid:13)(cid:7)(cid:12)(cid:22)(cid:14)(cid:16)(cid:14)(cid:16)(cid:14)(cid:6) (cid:17) (cid:19) ,useashorthand (cid:9) (cid:5) for (cid:28)HG (cid:4) *(cid:9)(cid:29)J , andtakealogarithmoftheobjective (whichdoesnotaffectmaximization),wecanre-express(cid:9) (cid:10)5(cid:12)6(cid:14) asfollows: (cid:15):9(cid:31);=<8>?9+@ < (cid:9) (cid:10)5(cid:12)6(cid:14) A?B-CDFE(cid:13)(cid:23) (cid:21) (cid:5)(cid:25)(cid:24)(cid:27)(cid:26) (cid:9) (cid:5) (4) (cid:5) B(cid:7)(cid:6) Notethatbydefinition, isalsoadistributionoverthevocabulary,i.e. isamem- (cid:21) (cid:21) ber of (cid:27)(cid:28) , although it may not be a member of our subset (cid:27)(cid:28)40. We can think of (cid:21) as (cid:0) (cid:1) (cid:3)(cid:22)(cid:3)(cid:5)(cid:3)=(cid:0) (cid:7) anempiricaldistributionoftheobservation .Now,sinceboth (cid:9) and(cid:21) are distributions, maximization of the above summation is equivalent to minimizing the cross-entropyofdistributions(cid:21) and(cid:9) : (cid:9)K(cid:10)5(cid:12)6(cid:14) (cid:15)!9(cid:31);=<!A?>(cid:29)B-(cid:28)(cid:27)C(cid:30)DFE (cid:31) G!(cid:21) *(cid:9)(cid:29)J (5) Equation(5)willbeusedasourobjectiveforformingoptimalmixturemodelsinall theremainingsectionsofthispaper.Themaindifferenceswillbeinthecompositionof thesubset(cid:27)(cid:28)10,buttheobjectivewillremainunchanged. 2.2 LowerBoundonOMMsolutions Supposewe allowed (cid:27)(cid:28)10 toinclude allpossibledistributionsoverourvocabulary,that (cid:15) iswemake(cid:27)(cid:28)10 (cid:27)(cid:28) .Thenwecanprovethat(cid:21) itselfistheuniqueoptimalsolutionof equation(5).TheproofisdetailedinsectionA.1oftheAppendix. This observation serves as a very important step in analyzing the computational complexityoffindingtheoptimalmodel(cid:9) outoftheset(cid:27)(cid:28)40.Weprovedthatanysolu- tion (cid:9) willbenobetterthan (cid:21) itself.Thisimpliesthatforeveryset (cid:27)(cid:28)40,determining whether(cid:21) (cid:27)(cid:28) 0 isnomoredifficultthanfindinganoptimalmixturemodelfromthat same set (cid:27)(cid:28) 0. The reductionisverysimple: given (cid:27)(cid:28)40 and (cid:21) ,let (cid:9) be the solution of (cid:15) equation(5). Then,according tosection A.1, (cid:21) (cid:27)(cid:28)40 if andonly if (cid:9) (cid:21) .Testing (cid:15) whether(cid:9) (cid:21) canbedoneinlineartime(withrespecttothesizeofourvocabulary), sowehaveapolynomial-timereductionfromtestingwhether(cid:21) isamemberof (cid:27)(cid:28) 0 to solvingequation(5)andfindinganoptimalmixturemodel. Thisresultwillbeusedintheremainderofthispapertoprovethatforcertainsets (cid:27)(cid:28) 0, solving equation (5) is NP-hard. In all cases we will show that testing whether (cid:21) (cid:27)(cid:28) 0 isNP-hard,andusethepolynomial-timereductionfromthissectiontoassert thatsolvingequation(5)forthatparticular(cid:27)(cid:28)40 isNP-hardaswell. 3 Set-based Mixture Models The most simple and intuitive type of mixture models is a set-based mixture. In this sectionwedescribetwosimplewaysofconstructingamixturemodelifwearegivena setofdocuments.Oneisbasedonconcatenatingthedocumentsintheset,theother– onaveragingthedocumentmodels.VerysimilarmodelswereconsideredbyLavrenko [13]andOgilvie[16]intheirattemptstocreateunweightedmixturemodels.Estimating either of these models from a given set of documents is trivial. However,if we try to lookfortheoptimalsetofdocuments,theproblembecomesinfeasible,asweshowin section3.3. 3.1 PooledOptimalMixtureModels Firstwedefinearestrictedclassofmixturemodelsthatcanbeformedby“concatenat- ing”severalpiecesoftextandtakingtheempiricaldistributionoftheresult.Tomake this more formal, suppose we are given a large collection (cid:0) of text samples of vary- inglength.Inthispaperwewillonlyconsiderfinitesets (cid:0) .ForInformationRetrieval applications(cid:0) willbeacollectionofdocuments.Foreverytextsample (cid:0) wecan (cid:15) (cid:5)(cid:11)(cid:10) (cid:19) (cid:1) constru* ct*itsempiricaldistributionbysetting(cid:1) (cid:5) (cid:8) (cid:4)(cid:2) (cid:3)(cid:2)(cid:4) ,justaswedidinsection2.1. Here, denotesthetotalnumberofwordsin .Similarly,foreverysubset72 (cid:0) ,we (cid:1) (cid:1) (cid:5) canconstructitsempiricaldistributionbyconcatenatingtogetherallelements , (cid:1) (cid:5) andconstructingthedistributionoftheresultingtext.Inthatcase,theprobabilitymass ontheword(cid:4) wouldbe: (cid:5) (cid:15)(cid:7)(cid:6) (cid:2) B(cid:9)(cid:8) (cid:20) (cid:4)* G(cid:10)(cid:1)* J (6) (cid:5) B (cid:6) (cid:2) (cid:9)(cid:8) (cid:1) Now,foragivencollectionofsamples(cid:0) ,wedefinethepooledmixtureset(cid:27)(cid:28) (cid:12)(cid:5)(cid:10) (cid:10) (cid:12)(cid:11)(cid:14)(cid:13) (cid:16)(cid:15) to be the set of empirical distributions of all the subsets of (cid:0) , where probabilities (cid:5) arecomputedaccordingtoequation(6).WedefinethePooledOptimalMixtureModel (POMM)problemtobethetaskofsolvingequation(5)overtheset(cid:27)(cid:28) (cid:12)(cid:5)(cid:10)=(cid:10) ,i.e.finding *(cid:11)(cid:14)(cid:13) (cid:17)(cid:15) the element (cid:9) (cid:27)(cid:28) (cid:11)(cid:14)(cid:13)(cid:12)(cid:5)(cid:10) (cid:10)(cid:16)(cid:15), which minimizesthe cross-entropy (cid:31) G (cid:21) (cid:9)(cid:29)J with a given targetdistribution . (cid:21) 3.2 AveragedOptimalMixtureModels Nextweconsideranotherclassofmixturemodels,similartopooledmodelsdescribed inthelastsection.Thesemodelsarealsobasedonacollection(cid:0) oftextsamples,and canbeformedby“averaging”wordfrequenciesacrossseveralpiecesoftext.Tomake this formal, let (cid:0) be a finite collection of text samples. Let (cid:0) be the corresponding collection of empirical distributions, that is for each observation (cid:0) , there exists a corresponding distribution (cid:9)(cid:3)(cid:1) (cid:0) , such that (cid:9)(cid:4)(cid:1) (cid:13)(cid:5) (cid:15) (cid:8) (cid:5)(cid:11)(cid:4)(cid:2) (cid:10)(cid:2)(cid:4) (cid:19)(cid:1)(cid:2).(cid:1)For a subset (cid:5) 0 2 (cid:0) , we can construct its distribution by averaging together the empirical distributions of elements in 0. Let 0 be a set of text samples, let be the set of corresponding (cid:5) (cid:5) (cid:5) empirical models, and let (cid:20) G J denote the numberof elements in . The probability (cid:5) (cid:5) massontheword(cid:4) is: (cid:19) (cid:15) (cid:5) (cid:9) (cid:5) (7) (cid:5) (cid:20) G (cid:5) J A(cid:6)(cid:23) (cid:5)(cid:5)B(cid:9)(cid:8) (cid:1) (cid:13) For a given collection of samples (cid:0) , we define the averaged mixture model set (cid:27)(cid:28) tobethesetofaverageddistributionsofallsubsets 0 of(cid:0) ,withprobabilities (cid:11)(cid:14)(cid:13)(cid:7)(cid:9)(cid:8)(cid:11)(cid:10) (cid:5) computed according to equation (7). We define the Averaged Optimal Mixture Model (AOMM)problemtobethetaskofsolvingequation(5)overtheset(cid:27)(cid:28) . (cid:11)(cid:14)(cid:13)(cid:7)(cid:9)(cid:8)(cid:11)(cid:10) 3.3 FindingtheOptimalSubsetisInfeasible Weoutlinedtwopossiblewaysforestimatingamixturemodelifwearegivenasetof documents.Nowsupposeweweregivenatargetdistribution andacollection(cid:0) ,and (cid:21) wantedtofindasubsetK2 (cid:0) whichproducesanoptimalmixturemodelwithrespectto (cid:5) .Itturnsoutthatthisproblemiscomputationallyinfeasible.Intuitively,thisproblem (cid:21) involvessearchingoveranexponentialnumberofpossiblesubsetsof(cid:0) .InsectionA.3 oftheAppendixweprovethatfindinganoptimalsubsetforpooledmodelsisNP-hard. InsectionA.4weshowthesameforaveragedmodels.Inbothproofswestartbyusing theresultofsection2.2andconvertingtheoptimizationproblemtoadecisionproblem over the same space of distributions. Then we describe a polynomial-time reduction from3SATtothecorrespondingdecisionproblem.3SAT(describedinA.2)isawell- known NP-hard problem, and reducing it to finding an optimal subset of documents provesoursearchingproblemtobeNP-hardaswell. It is interesting to point out that we were not able to demonstrate that finding an optimal subset can actually be solved by a nondeterministic machine in polynomial time.ItiseasytoshowthatthedecisionproblemscorrespondingtoPOMMandAOMM areintheNPclass,buttheoriginaloptimizationproblemsappeartobemoredifficult. 4 WeightedMixture Models Nowweturnourattentiontoanother,morecomplexclassofOptimalMixtureModels. For set-based models of section 3, the probabilities were completely determined by which documents belonged to the set, and no weighting on documents was allowed. Nowweconsiderthekindsofmodelswhereinadditiontoselectingthesubset,wealso allowputtingdifferentweightsonthedocumentsinthatsubset.Thisflavorofmixture modelswasusedbyHoffman[9],LaffertyandZhai[11],andLavrenkoandCroft[14] intheirresearch. 4.1 WeightedOptimalMixtureModels Nowifwewanttofindanoptimalmixturemodelforsomeobservationwenotonlyneed tofindthesubsetofdocumentstouse,butalsoneedtoestimatetheoptimalweightsto placeonthosedocuments.Atfirstglanceitappearsthatallowingweightsondocuments willonlyaggravatethefactthatfindingoptimalmodelsisinfeasible(section3.3),since wejustaddedmoredegreesoffreedomtotheproblem.Inreality,allowingweightsto be placedon documentsactually makesthe problemsolvable,as it pavesthe way for numericalapproximations.RecallthatbothPOMMandAOMMareessentiallycombi- natorialproblems,inbothcasesweattempttoreducecross-entropy(equation(5))over afinitesetofdistributions: (cid:27)(cid:28) (cid:12)6(cid:10)=(cid:10) forPOMM,and (cid:27)(cid:28) forAOMM.Bothsetsare (cid:11)(cid:14)(cid:13) (cid:16)(cid:15) (cid:11)(cid:14)(cid:13)(cid:7)(cid:9)(cid:8)(cid:11)(cid:10) exponentiallylargewithrespectto(cid:0) ,butarefiniteandthereforefullofdiscontinuities. Inordertousenumericaltechniqueswemusthaveacontinuousspace(cid:27)(cid:28) 0.Inthissec- tion we describe how we can extend (cid:27)(cid:28) (cid:12)6(cid:10)=(cid:10) or equivalently (cid:27)(cid:28) to a continuous (cid:11)(cid:14)(cid:13) (cid:16)(cid:15) (cid:11)(cid:9)(cid:13)(cid:7) (cid:8)(cid:11)(cid:10) simplex (cid:27)(cid:28) (cid:0) . We define the Weighted Optimal Mixture Model problem(WOMM) to (cid:11)(cid:14)(cid:13) betheoptimizationofequation(5)overthesimplex(cid:27)(cid:28) (cid:0) .WearguethataWOMMso- (cid:11)(cid:14)(cid:13) lutionwillalwaysbenoworsethanthesolutionofaPOMMorAOMMforagiven(cid:0) , althoughthatsolutionmaynotnecessarilyliein(cid:27)(cid:28) (cid:12)(cid:5)(cid:10)=(cid:10) or(cid:27)(cid:28) .Welookatasimple (cid:11)(cid:9)(cid:13) (cid:17)(cid:15) (cid:11)(cid:14)(cid:13)(cid:7)(cid:9)(cid:8)(cid:11)(cid:10) gradientdescenttechniqueforsolvingWOMM.Thetechniqueisnotguaranteedtofind a globally optimal solution, but in practice convergesquite rapidly and exhibits good performance. WOMMDefinition Let(cid:0) beoursetoftextsamplesandlet(cid:0) bethecorresponding (cid:11) setofofempiricalmodels(cid:9) foreachsample (cid:0) .Foranarbitrarysetofweights (cid:1) (cid:27)" (cid:8) (cid:10)(cid:11) (cid:19) we can define th(cid:2)e corresponding m(cid:1)odel (cid:9) (cid:0) to be the average of all the modelsin(cid:0) ,weightedby(cid:1) : (cid:11) (cid:9) (cid:0) (cid:5) (cid:15) (cid:1) (cid:9) (cid:5) (8) (cid:13) (cid:23) B (cid:2) (cid:2) (cid:13) (cid:2) (cid:11) Itiseasytoverifythatequation(8)definesavaliddistribution,aslongas *(cid:1) *I(cid:15) (cid:19) . Nowwecandefine (cid:27)(cid:28) (cid:0) tobethesetofallpossiblelinearcombinationsofmodelsin (cid:0) ,i.e. (cid:27)(cid:28) (cid:0) (cid:15) (cid:17) (cid:9) (cid:11)(cid:14)(cid:0) (cid:13) % (cid:1) &.’I(/*(cid:1) * (cid:15) (cid:19)/(cid:24) .WOMM isdefinedassolvingequation(5) (cid:11) (cid:11)(cid:14)(cid:13) over(cid:27)(cid:28) (cid:0) . (cid:11)(cid:14)(cid:13) RelationshiptoSet-basedModels Itisimportanttorealizethatthereisastrongcon- nectionbetweenWOMM andset-basedmodelsfromsection3.Thesimplex (cid:27)(cid:28) (cid:0) in- (cid:11)(cid:14)(cid:13) cludesbothsets(cid:27)(cid:28) (cid:12)(cid:5)(cid:10)=(cid:10) and(cid:27)(cid:28) ,since: (cid:11)(cid:9)(cid:13) (cid:17)(cid:15) (cid:11)(cid:14)(cid:13)(cid:7)(cid:9)(cid:8)(cid:11)(cid:10) (i) equations(7)and(8)implythatanA(cid:1) OMMmodelofaset isthesamethingasa WOMMmodel(cid:9) (cid:0) where(cid:1) (cid:2) (cid:15) (cid:10)(cid:8)(cid:7)(cid:19) when(cid:1) (cid:5) ,and(cid:1) (cid:5)(cid:2) (cid:15):’ for(cid:21)(cid:3) (cid:2) (cid:5) (ii) equations (6) and (8) imply that(cid:8)a POMM model of a set is equivalent to a WOMMmodel(cid:9) (cid:0) where(cid:1) (cid:2) (cid:15) (cid:4)(cid:6)(cid:5)(cid:8)(cid:4)(cid:7)(cid:10)(cid:2) (cid:9) (cid:4)(cid:4)(cid:2) (cid:4) when(cid:1) (cid:5) ,and(cid:1)(cid:5) (cid:2) (cid:15)!’ for(cid:21)(cid:3) (cid:2) (cid:5) Thisimpliesthateveryelementofeither(cid:27)(cid:28) or(cid:27)(cid:28) (cid:12)6(cid:10)=(cid:10) isalsoanelementof(cid:27)(cid:28) (cid:0) . (cid:11)(cid:14)(cid:13)(cid:7)(cid:9)(cid:8)(cid:11)(cid:10) (cid:11)(cid:14)(cid:13) (cid:16)(cid:15) (cid:11)(cid:14)(cid:13) Therefore, a weighted optimal mixture model will be as good, or better than any set- basedmixturemodel,aslongaswearedealingwiththesamecollection(cid:0) . 4.2 IterativeGradientSolution Since(cid:27)(cid:28) (cid:0) isacontinuoussimplex,wecanemploynumericaltechniques,toiteratively (cid:11)(cid:14)(cid:13) approachasolution.Wedescribeagradientdescentapproach,similartotheoneadvo- catedbyYamronet.al.[26].Recallthatourobjectiveistominimizethecross-entropy (equation(5)ofthetargetdistribution(cid:21) overthesimplex(cid:27)(cid:28) (cid:0) .Foragivencollection (cid:0) ,everyelementof(cid:27)(cid:28) (cid:0) canbeexpressedintermsof(cid:1) ,the(cid:11)(cid:14)v(cid:13)ectorofmixingweights, (cid:11)(cid:14)(cid:13) according to equation (8). We rewrite the objective function in terms of the mixing vector(cid:1) : *(cid:17) (cid:24) (cid:31) (cid:0) G (cid:21) (cid:9) (cid:2) J (9) (cid:15) (cid:0) (cid:23) (cid:21) (cid:5) (cid:24)(cid:27)(cid:26) < (cid:23) G (cid:1) (cid:2) (cid:1)(cid:3)(cid:2) (cid:2) (cid:1) (cid:2) J)(cid:9) (cid:2) (cid:13)(cid:5) (cid:5) (cid:2) (cid:15) < (cid:1) < (cid:1) (cid:0) (cid:23) (cid:21) (cid:5) (cid:24)(cid:27)(cid:26) (cid:23) (cid:2) (cid:9) (cid:2) (cid:13)(cid:5)(cid:5)(cid:4) (cid:24)(cid:3)(cid:26) (cid:23) (cid:2) (cid:5) (cid:2) (cid:2) Notethatinequation(10),weusedtheexpression (cid:0) (cid:5) insteadof(cid:1) .Doingthis (cid:4)(cid:6)(cid:5) (cid:0) (cid:5) (cid:2) allowsustoenforcetheconstraintthatthemixingweightsshouldsumtoonewithout using Lagrange multipliers or other machinery of constrained optimization. In other words, once we made this change to the objective function, we can perform uncon- strainedminimization over (cid:1) . In order to find the maximumof equation(10) we take thederivativewithrespecttothemixingweightofeachelement : (cid:6) (cid:7) (cid:19) (cid:7)(cid:7) (cid:31)(cid:1) (cid:7)(cid:0) (cid:15) (cid:0) (cid:23) (cid:21) (cid:5)(cid:1) (cid:9) (cid:9) (cid:13)(cid:5) (cid:5) (cid:4) (cid:1) (10) (cid:5) (cid:6) (cid:2) (cid:2) (cid:2) (cid:13) (cid:6) (cid:2) (cid:2) Aftersettingthederivativeequaltozero,andre-arrangingtheterms,weseethatthe extremumisachievedwhenforevery (cid:0) wehave: (cid:6) (cid:7) (cid:19) (cid:15) (cid:21) (cid:5) (cid:9) (cid:13)(cid:5) (11) (cid:23) G (cid:1) (cid:1)(cid:8)(cid:2) (cid:1) J)(cid:9) (cid:5) (cid:5) (cid:6) (cid:2) (cid:2) (cid:2) (cid:2) (cid:2) (cid:13) We cantakethisequationandturn itintoanincrementalupdaterule.Suppose (cid:1) #(cid:7) (cid:23) isthemixingweightofelement after iterationsofthealgorithm.Thenatthenext (cid:6) iterationtheweightshouldbecome: (cid:1) (cid:7) (cid:1) (cid:7) (cid:1) #(cid:10)(cid:7) (cid:9) (cid:11) (cid:0) (cid:23) G (cid:1)(cid:21) (cid:5) (cid:1)(cid:3)(cid:9)(cid:2) (cid:13)(cid:5)(cid:1) #J)(cid:9) (cid:5) (12) (cid:5) (cid:6) (cid:2) (cid:2) (cid:2) (cid:2) (cid:2) (cid:13) Itis easy to seethatwhenthe extremumisachieved,equation(11) holds,andthe value(cid:1) (cid:7) willnotchangefromoneiterationtoanother,sotheprocedureisconvergent. Inpractice,itissufficienttoruntheprocedureforjustafewiterations,asitconverges (cid:0) rapidly.Everyiterationof update (12) requires on the order of G (cid:20) G(cid:0) J (cid:20) G J J , and (cid:13)(cid:12) thenumberofiterationscanbeheldataconstant. LocalMinima Itisimportanttorealizethattheiterativeupdateinequation(12)isnot guaranteedtoconvergetotheglobalminimumofequation(10).Thereasonforthatis thattheobjectivefunctionisnotconvexeverywhere.Wecanseethatclearlywhenwe takethesecondderivativeoftheobjectivewithrespecttothemixtureweights: (cid:7) (cid:19) (cid:7) (cid:7)(cid:1) (cid:0)(cid:7)(cid:31)(cid:7) (cid:0)(cid:1) (cid:15) (cid:15) (cid:23) (cid:5) (cid:1)(cid:21) (cid:5) (cid:9) (cid:1) (cid:22)(cid:13)(cid:9)(cid:5) (cid:9) (cid:5)(cid:15) (cid:13)(cid:5) (cid:0) (cid:0) (cid:1) (cid:1) (cid:0) (13) (cid:6) (cid:1) (cid:1) (cid:1) (cid:13)(cid:3)(cid:2) (cid:6) (cid:1) (cid:1) (cid:2) Itisnotobviouswhetherleft-handsideoftheequationaboveispositiveornegative, sowecannotconcludewhetherthefunctionisgloballyconvex,orwhetherithaslocal minima.Inpracticewefoundthattheincrementalalgorithmconvergesquiterapidly. 5 ExperimentalResults InthissectionwediscussanapplicationofOptimalMixtureModelstotheproblemof estimatingatopicmodelfromasmallsample.Theexperimentswerecarriedoutinthe following setting. Our collection (cid:0) is a collection of approximately 60,000 newswire andbroadcastnewsstoriesfromtheTDT2corpus[7].Forthisdataset,wehaveacol- lectionof96event-centeredtopics.Everytopicisdefinedbytheset " 2 (cid:0) ofstories thatarerelevanttoit.TherelevanceassessmentswerecarriedoutbyLDC[7]andare exhaustive. For every topic, our goal is to estimate (cid:9) , the distribution of words in the doc- (cid:5)(cid:4) uments relevant to that topic. We assume that the relevant set " is unknown, but we havea single example document " . The goal is to approximate the topicmodel (cid:1) (cid:9) ascloselyaspossibleusingonly and(cid:0) .Weformulatetheproblemasfindingthe (cid:6)(cid:4) (cid:1) optimalweightedmixturemodel,andusetheiterativeupdatedetailedinsection4.2to estimate the optimal mixture. Since we do not know " , we cannot optimize equation (cid:15) (5)for(cid:9)(cid:7)(cid:4) directly.Wehypothesizethatoptimizingfor(cid:21) (cid:9) (cid:2) isagoodalternative. This hypothesisis similar to theassumptionsmadein [14]. Note thatthis assumption maybeproblematic,sinceeffectively isanelementof(cid:0) .Thismeansthatourgradi- (cid:1) entsolutionwilleventuallyconvergeto(cid:9) ,whichisnotwhatwewant,sincewewant (cid:2) toconvergeto(cid:9) .However,wehopethatrunningthegradientmethodforjustafew (cid:4) iterationswillresultinareasonablemixturemodel. We carry out three types of experiments. First we demonstrate that the gradient procedure described in section 4.2 indeed convergesto the target. Second we look at howwelltheresultingmixturemodelapproximatestherealtopicmodel (cid:9) .Finally, (cid:8)(cid:4) weperformasetofad-hocretrievalexperimentstodemonstratethatourmixturemodel canbeusedtoproduceeffectivedocumentrankings. 5.1 ConvergencetoTarget Figure 1 shows how quickly the weighted mixture model convergesto the target dis- tribution (cid:9) . On the y-axis we plotted the relativeentropy(equation (14) in the Ap- (cid:2) pendix)betweenthemixturemodelandthetargetmodel,asafunctionofanumberof gradientupdates.Relativeentropyisaveragedoverall96topics.Thesolidlineshows