Table Of Content

Bayesian Nonparametrics in Topic Modeling: A Brief Tutorial AlexanderSpangher ColumbiaUniversity,classof2014 5 aas2230 1 CS6998: HighDimensionalDataAnalysis 0 2 n Abstract a J 6 UsingnonparametricmethodshasbeenincreasinglyexploredinBayesianhierar- 1 chicalmodelingasawaytoincreasemodelflexibility. Althoughthefieldshows alotofpromise, inferenceinmanymodels, includingHierachicalDirichletPro- ] L cesses (HDP), remain prohibitively slow. One promising path forward is to ex- M ploit the submodularity inherent in Indian Buffet Process (IBP) to derive near- optimalsolutionsinpolynomialtime. Inthiswork, Iwillpresentabrieftutorial . onBayesiannonparametricmethods,especiallyastheyareappliedtotopicmod- t a eling. Iwillshowacomparisonbetweendifferentnon-parametricmodelsandthe t s currentstate-of-the-artparametricmodel,LatentDirichletAllocation(LDA). [ 1 v 1 1 ResearchGoals: 6 8 3 Mygoalsontheoutsetofthisexperimentwereto: 0 . 1 • Learnmoreaboutnonparametricmodels 0 5 • Investigate different non-parametric topic models and compare them to an LDA process 1 runonaNewYorkTimesdatasetwithtopicsk =55. : v • Pickhigher-performingalgorithmsandapplyrecentresearchinIndianBuffetProcesssub- i X modularitytothesealgorithmstoassessspeedincrease. r a I should note that my experiments were not successful. The non-parametric topic modeling algo- rithmsthatIcomparedtoLDAhadconsistentlyworseperplexityscores. Thus,Ididnotgoforward withproposedimplementationofsubmodularmethods. Thisspeakslesstothepotentialofnonpara- metricmodelsthentotheneedformoreworkinthefieldandtheneedtolowerthebarrierofentry–I foundthatunifyingintroductorytextswerefarandfewbetween. Thus,IwilltreatthispaperasthestartofatutorialIplantobuildoutinthenextfewmonths. While Iamcertainlynotanexpert,Iapproachedthesubjectasabeginnerjustafewweeksago. Thispaper gives me an opportunity to offer a beginner’s look at non-parametrics, one that I will expand on. AlthoughIfeelthatIhavenotcontributedinasignificantwayresearch-wise,ifIexpandthispaper over the next few months to give a better high-level introduction to nonparametric modeling, this canbeanimportantcontribution. I have tried to make my emphasis in this paper orthogonal to my presentation. As in, I will go lightlyonthetopicsthatIwentheavilyintoduringmypresentationandtrytogomoredeeplyinto thebackgroundofprocesses. 1 2 Introduction Bayesianhierarchicaltechniquespresentunified,flexibleandconsistentmethodsformodelingreal- world problems. In this section, we will review parametric models by using the Latent Dirichlet Allocationmodel(LDA),presentedbyBleiet.al.(2003)[1],asanexample.Wewillthenintroduce BayesianProcessesandshowhowtheyfactorintonon-parametricmodels. 2.1 BayesianParametricModels: LatentDirichletAllocation(LDA) Figure1: LDAGraphicalModel[1](Blei,2003) TheLDAtopic-modelisaproto-typicalgenerativeBayesianmodel. Itis”generative”inthesense thatitassumesthatsomereal-worldobservation(inthiscase,word-counts),aregeneratedbyseries ofdrawsfromunderlyingdistributions.AccordingtoBlei’shypothesis,theLDAmodelisgenerated asfollows: ForeachdocumentwincorpusD: 1. ChooseN ∼Poisson(κ) 2. Chooseθ ∼Dir(α) 3. ForeachoftheN wordsw : n • Chooseatopicz ∼Multinomial(θ) n • Chooseawordw ∼Multinomial(β ) n zn These underlying distributions are hidden from us when we observe our sample set (the corpus), but through various methods we can infer their parameterization. This allows us to predict future samples,compressinformation,andexplainexistingsamplesinusefulways. TheLDAisamixturemodel–itassumesthateachwordisassignedaclass,or”topic”,andthateach document is represented by a mixture of these topics. The overall number of topics in a corpus, k, needs to be parameterized by the user, and this is often difficult to interpret. To overcome this limitation,weintroducenon-parametricmodels. 2.2 BayesianNonparametricModels: AnIntroduction GaussianProcesses: FunctionModeling WestartourdiscussiononnonparametricsbydiscussingGaussianProcesses,sincetheirconstruc- tion follows the way in which processes we actually use will be constructed, yet I feel Gaussian- anything is a naturally easier concept to grasp. Let’s say X is distributed according to a Gaussian Processwithmeanmeasurem(x)andcovariancek(x,x(cid:48)): X ~GP(m(x),k(x,x(cid:48))) Meanmeasureandcovariancemeasuresaresimplyfunctionsthattakesomepossiblyinfinitesubset xofEuclidean(or,moregenerally,Hilbertspace)andreturnameasure.Itismosthelpfulwhenuses processesinBayesiannonparametricstothinkofa”measure”asourpriorbelief,expressednotasa paramterbutasamappingfunction. 2 Inotherwords,muchlikeafunctionmapsaninputspacetoaquantity,measuresareusedtoquantize (cid:2) (cid:3) finite sets. For example, lets take the mean measure. In the Gaussian Process, m(x) = E f(x) . Thismapsafinite(orinfinite)setxtoanexpectedvalue. Inpractice,whenwehavenoinformation aboutf(x),it’scommontochoosem = 0,asasuitablemeasureprior. (ThisisreflectedinFigure 2.awhichshowsourpriorbeliefsabouttheGP,wherethespanoftheorocessisgivenbythegray region,withmean0andconstantvariance.) OverafinitesubsetofEuclideanspace,x,themeasuretakesanexpected,realvalue,andtheGaus- sianProcesstakesadiscretedistribution: ajointGaussian. Forexample,ifx ,x aresubsetsofthe 1 2 realline,theny ,y aredistributedas: 1 2 (cid:18) (cid:19) (y ,y )~GP(m({x ,x }),k({x ,x },{x ,x }(cid:48)))=N(cid:0) µ1 ,Σ) 1 2 1 2 1 2 1 2 µ 2 This joint Gaussian property of the GP, over an infinite subspace, can describe the set of points forming functions, shown in Figure 2. We can envision starting with a finite number of discrete subsetsoftherealline,representedasthebluepointsinthediagram,andcontinuingtoaddsamples. As the number of subsets we add approaches infinity, we have encompassed the entire real line, andcreatedajointGaussianwho’srealizationformsacontinuousfunction. ThisabilityoftheGP tomodelfunctionsallowsittobeusedinkernel-basednon-linearregressionasaprior. (Formore information,see[3]) Figure 2: Gaussian Process used to describe the function space. (a) Prior, with GP(0,Σ). The blue, dottedlinerepresentsGP overboundedsubspaceswhiletheredandbluelinesrepresentthe unboundedcases. (b)Posterior,updatedtofitobserveddata. [3](Rasmussen,2006) DirichletProcesses: ProbabilityModeling Similarly to how the Gaussian Process can be applied over infinite subspaces to model functions, theDirichletProcesscanmodelcontinuousprobabilitydensities. Thisisalsoratherintuitive. Let’sstartwithafiniteset,A. IfΠisapartitionofAsuchthat(cid:83)k Π = AandT ∩T = ∅for i=1 i k l k (cid:54)=l,then: (G(Π ),...,G(Π ))~Dir(αH(Π ),...,αH(Π )) 1 k 1 k WhereGisdrawnfromaDirichletProcess(DP)withbasemeasureHandconcentrationparameter α: (cid:2) (cid:3) G~DP(α,H)whereE G(Π ) =H(Π ) i i WecanthinkofthebasemeasureH intheDirichletProcessthesamewayaswethinkofthemean measureintheGP,simplyasamappingbetweensubsetsofHilbertspacetoreals–inthiscasereals ∈[0,1]. Now,usingaDirichletprocesstomodelthespaceofcontinuousprobabilityfunctionsmakessense when we consider the special property of the Dirichlet Distribution that a sample from the distri- 3 bution is a tuple that sums to 1–essentially a probability mass function. Thus, a Dirichlet Process overinfinitelymanypartitionsgivesacontinuousprobabilitydistributionmuchthesamewaythata samplingfromaGP overinfinitelymanysubsetsofthereallinegivesacontinousfunction. TheDirichletProcesshasfoundmanyusesinmodelingprobabilitydistributions.Oneexampleisthe HierarchicalDirichletProcess(HDP)byYeeWhyeTeh[4]. ThismodelusestwostackedDirichlet Processes. The first is sampled to provide the base distribution for the second, allowing a fluid construction of point mixture-modeling known as the ”Chinese Restaurant Franchise”. However, theHDPhasnotedflaws,inparticular:becausetheHDPdrawstheclassproportionsfromadataset- leveljointdistribution,itmakestheassumptionthattheweightofacomponentintheentiredataset (inourapplication,the”corpus”)iscorrelatedwiththeproportionofthatcomponentbeingexpressed within a datapoint (a ”document”). Or, in other words, the probability of a datapoint exhibiting a classiscorrelatedwiththeweightofthatclasswithinthedatapoint.Intuitively,inthetopicmodeling context, we might argue that rare corpus-level topics are often expressed to a great degree within specificdocuments,thusindicatingthattheHDPisflawed. BetaProcess: PointEventModeling The Beta Process offers us a way to perform class assignment without this correlational bias. We willseemoreinthenextsection. Takingastepback,adrawfromaBetaDistributionisaspecialcaseofaDirichletdistributiondraw overovertwoclasses. Similarly,theBetaProcessisdefinedovertheproductspaceΩ×[0,1]. AdrawfromaBetaProcessisgivenas: B =(cid:80)∞ p δ k=1 k wk whereB ~BP(c,B ),andδ canbethoughtofasaunitmeasureofw . pandw aredefinedby 0 wk k theLevymeasure,v,oftheBetaProcess’productspace,(givenbyHjort(1990)[6]),is ν (dpdw)=cp−1(1−p)c−qdpB (dw) BP 0 which can be passed as the mean measure to a Poisson Process. (Where c > 0 is the concentra- tion parameter and B is the continuous-finite base measure). (This is commonly used in Levy- 0 Khintchineformulationsofstochasticprocesses.) Therefore,wecanseethatBPdrawsneednotsumtoone. Takenoverinfinitesubspaces,theBeta ProcesscanbeusedtomodelCDF’s,aspointmeasurementsare∈[0,1],butdrawsarenotdependent on the sum of subspaces, like the Dirichlet. Taking the aggregate of point measurements, we can deriveCDF’s,whichareespeciallyusefulinfieldslikesurvivalanalysis. As we can see, w point measurements can also be used to model class assigments. Thus, Beta j Processescaneffectivelymodeleachclass-membershipseparately. Perdatapoint,probabilitymass behindnsubspaces,orclasses,isnolongerboundedbyoneasintheDirichletProcess. (Formore onBetaProcessesseePaisley[7]andZhou[8].) 2.3 BernouilliProcessesandtheIndianBuffetProcess Here we construct such a mixed-membership modeling scheme. The Beta Process is conjugate to theBernouillifamily. (ABernouilliprocessisaverysimplestochasticprocessthatcanbethought ofsimplyasasequenceofcoinflipswithprobabilityp). ThismakestheBeta-Bernouillipairingan idealcandidateformixed-membershipapplicationsliketopicmodeling,whereeachpointdatapoint isexpressedasacombinationoflatentclasses. GriffithsandGharamani[9]recognizedthisin2005,andconstructedtheIndianBuffetProcess(IBP) bymarginalizingouttheBetaprocess.AsThibauxandJordan[10]wouldlatershowin2007,given: B~BP(c,B ){X |B} ~BeP(B) 0 i 1...n 4 whereBePistheBernouilliProcess,theBetaProcesscanbemarginalizedouttogive: (cid:18) n (cid:19) X |X ∼BeP c B +(cid:88) mn,j δ n+1 1...n c+n 0 c+n wj i=1 whereB isacontinuousbasedistributionwithmassB (Ω) = γ, δ istheunitpointmassatw, 0 0 w andm =(cid:80)n X ,orthenumberofdatapointsinw . Thus,weseethroughthismarginalization n,j i=1 i j thattheclasslabelsj areindependentoforder–aslongastheorderofw isconsistentacrossX , j n the BeP construction remains the same. This realization can be used to prove exchangeability of thebeta-assignedclasslabels,provingthatthesepointsareaDe-Finettimixture. (Formoredetails, see[10]. De-Finettimixturesaregreatbecausetheycanberealizedinprovablydefinedprobability distributions.) Ifwemodeltheseprocessastheircorrespondingdistributions,wecanderiveaprobabilityfunction. α π |α∼Beta( ,1) k K z |π ∼Bernoulli(π ) ik k k K (cid:90) (cid:18) N (cid:19) (cid:89) (cid:89) P(Z)= P(z |π ) p(π )dπ i,k k k k k=1 i=1 Figure3: GraphicalmodelfortheIndianBuffetProcess,[9] Wheretheprobabilityofobservingasetofclassassignments{z ,...,z }foreachdatapointiis i,1 k,i givenbyP(Z).UndertheexchangeabilitypropertyoftheBeta-Bernouilliconstructiongivenabove, theIndianBuffetProcesscanbeexpressedasthesumofinfiniteclasses: K! P([Z])= P(Z) (cid:81)2N−1K ! h=0 h K! (cid:89)K (cid:90) (cid:18)(cid:89)N (cid:19) P([Z])= P(z |π ) p(π )dπ (cid:81)2N−1K ! i,k k k k h=0 h k=1 i=1 P([Z]|α,β)= (αβ)K+ e−a(cid:80)Ni=1 β+βi−1 (cid:89)K Γ(mk)Γ(N −mk+β) (cid:81)2N−1K ! Γ(N +β) h=1 h k=1 WhereΓrepresentstheGammafunctionandαandβ areconcentrationparameters. k represents h the history term of the lof ordering. In addition to reordering the class labels in a way that sums probability only over the ”active” set K+ of topics, this construction collapses multiple classes assigned to the same datapoints into a single class label. For more details on the combinatorial ordering scheme see slide 22 of my presentation, included in this folder. This formulation groups the class labels of the IBP datapoints in a way that allows the probability distribution to remain well-defined even as the set of classes is unbounded. Thus, we can draw inference on the infinite parameterspacethroughafinitesetofobservations. 5 3 ApplicationsoftheIndianBuffetProcess: Experiments TheIndianBuffetProcesshasbeenappliedtoahostofmixed-modelschemes. LiketheHierachical Dirichlet Process, it models class membership in a non-parametric way, thus making it appealing for use in non-parametric Bayesian models. Moreso, because it escapes the correlation that HDP introduces,theIBPhasbeenrecentlystudiedintopicmodelingapplications. OnesimpleexampleistheFocusedTopicModel(Bleietal.,2010)[11]. TheFocusedTopicModel usesanIBPCompoundDirichletProcess,whicheffectivelydecouplesthecorrelationbetweenhigh corpus-level probability mass for a topic, and high document-level probability mass. The topic weights θ for document i, now, are generated from from a Dir(φ·b ), where b represents the i i i binarydrawfromanIBP.Thus,strongcorpus-leveltopicsmaybe”shut-off”inthedocument. Archambeauet.al.takethisastepfurtherin[12]byapplyinganIBPcompoundDirichlettoboththe documenttopicsθ andthewordsβ . Inotherwords,whiletheFocusedTopicmodeldecouplesthe i w corpus-levelstrengthofatopicwiththeprobabilityofadocumentexpressingatopic,itstillallows wordstobeweightedstronglyunderhighprobabilitytopics,encouragingdistributionsacrosswords tobemoreuniform. Archambeauet. al. addtheIBPcompoundDirichlettowordprobabilitiesas well,creatingadoublycompoundedmodel,whichtheycallLiDA. Mingyuan Zhou took a different approach [13]. Instead of directly incorporating the IBP, he built atopicmodelaroundtheBeta-NegativeBinomialProcess, ahierarchicaltopicmodelwhichisre- lated to the IBP (the IBP is a Beta-Bernouilli construction) but is not binary, and thus has greater expressiveness. Hepresentedhisworkatthe2014NIPSconference. Experiments: I ran trials for all three of these on New York Times dataset of 5294 documents and 5065 words, which I compared against a control of LDA with 55 topics. I chose this number has it was the current30-daywindowofarticlesthatweusefortrainingourrecommendationssystem. Iheldout 10percentofarticlesforaperplexitytest. Thelog-perplexityofaheldoutsetwas-7.9forLDAon 55topics,2.099forLiDA,whichscaledto66topics,5.6forFTMand5.4forBNBP. 4 SubmodularityoftheIBP CurrentinferencealgorithmsforIBP-relatedprocessesinvolveeithersequentialGibbssamplersor sequentialvariationalinference,effectivelycreatinganNP-hardproblem. RecentworkbyReedand Griffiths[14]examiningtheIBPhasfocusedonitssubmodularityproperties. AlthoughIexplored thisindetailinmypresentation,Iwillnotgodeeplyintoithere,asitisnotrelevant. Acknowledgments I’dliketothankDr. LozanoandDr. Aravkinforatrulyexcellentsemester,andmycoworkersatthe NewYorkTimes–especiallyDaeilKim–forsuggestionsandmoralsupport. 5 Conclusion Inthispaper,I’vetriedtogiveabrieftutorialonsomecommonlyusedBayesiannonparametrics,as wellasexplainsomemotivationsbehindtheiruse. I’vetriedtosteerclearoftheeuphemismsmany introductoryreviewsusetoexplainnonparametrics,likestick-breakingprocessesorpoyla-urns,in favorofamoregeneralexplanations. I’vereviewedsomerecentimplementationsofnonparametric modelsandcomparedthemonasingledataset. AlthoughtheperformanceofthemodelsItestedwasgenerallylackingandthemethodswereslow, IstillfeelthatnonparametricsofferasophisticatedapproachtowardsconstructingflexibleBayesian models. Withmoreresearch,theI’mconfidentthatthefieldcouldproducesomepromisingresults. References [1]Blei, DavidM., Ng, AndrewY.&Jordan, M.I.(1995)LatentDirichletAllocation. JohnLafferty(eds.), JournalofMachineLearningResearch3,pp.993-1022. 6 [2]Ibrahim,JosephGeorgy,andMingChenBayesianSurvivalAnalysis.NewYork:Springer,2001.Print. [3]Rasmussen,CarlEdward,andChristopherK.I.WilliamsGaussianProcessesforMachineLearning.Cam- bridge,Mass.:MIT,2006.Print. [4]Teh,YeeWhye,MichaelIJordan,MatthewJBeal,andDavidMBlei. ”HierarchicalDirichletProcesses.” JournaloftheAmericanStatisticalAssociation:1566-581.Print. [5]Sudderth, Eric. ”GraphicalModelsforVisualObjectRecognitionandTracking.”DoctoralThesis, Mas- sachusettsInstituteofTechnology(2006). [6] Hjort, Nils Lid. ”Nonparametric Bayes Estimators Based on Beta Processes in Models for Life History Data.”TheAnnalsofStatistics(1990):1259-294.Print. [7]JohnPaisleyandMichaelJordan.”AConstructiveDefinitionoftheBetaProcess”.TechnicalReport.2014. Print [8]Zhou,Mingyuan,LaurenHannah,DavidDunson,andLawrenceCarin. ”Beta-NegativeBinomialProcess andPoissonFactorAnalysis.”Proceedingsofthe15thInternationalConferenceonArtificialIntelligenceand Statistics(AISTATS)(2012). [9]Griffiths,Thomas,andZoubinGhahramani.”InfiniteLatentFeatureModelsandtheIndianBuffetProcess.” PresentedatNIPS,2005(2005).Print. [10] Thibaux, Romain, and Michael Jordan. ”Hierarchical Beta Processes and the Indian Buffet Process.” PresentedatAISTATS2007Conference(2007). [11] Williamson, Sinead, Chong Wang, Katherine Heller, and David Blei. ”The IBP Compound Dirichlet ProcessandItsApplicationtoFocusedTopicModeling.”Proceedingsofthe27thInternationalConferenceon MachineLearning(2010). [12]Archambeau,Cedric,BalajiLakshminarayanan,andGuillaumeBouchard.”LatentIBPCompoundDirich- letAllocation.”IEEETransactionsonPatternAnalysisandMachineIntelligence:1.Print. [13] Zhou, Mingyuan. ”Beta-Negative Binomial Process and Exchangeable Random Partitions for Mixed- MembershipModeling.”PresentedatNeuralInformationProcessingSystems2014.(2014) [14]Reed,Colorado,andZoubinGhahramani.”ScalingtheIndianBuffetProcessviaSubmodularMaximiza- tion.”PresentedatInternationalConferenceonMachineLearning2013(2013). 7