ebook img

A Nested HDP for Hierarchical Topic Models PDF

0.13 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Nested HDP for Hierarchical Topic Models

A Nested HDP for Hierarchical Topic Models JohnPaisley ChongWang DavidBlei MichaelI.Jordan DepartmentofEECS Dept. ofMachineLearning Dept. ofComputerScience DepartmentofEECS UCBerkeley CarnegieMellonUniversity PrincetonUniversity UCBerkeley 3 1 Abstract 0 2 We develop a nested hierarchical Dirichlet process (nHDP) for hierarchical topic modeling. The n a nHDP is a generalization of the nested Chinese restaurant process (nCRP) that allows each word J to follow its own path to a topic node according to a document-specific distribution on a shared 6 tree. This alleviates the rigid, single-path formulation of the nCRP, allowing a document to more 1 easilyexpressthematicborrowingsasarandomeffect.Wedemonstrateouralgorithmon1.8million documentsfromTheNewYorkTimes.1 ] L M 1 Introduction . t a Organizingthingshierarchicallyisanaturalprocessofhumanactivity.Ahierarchicaltree-structuredrepresentationof t datacanprovideanilluminatingmeansforunderstandingandreasoningabouttheinformationitcontains. Thenested s [ Chineserestaurantprocess(nCRP)[2]isamodelthatperformsthistaskfortheproblemoftopicmodeling. Itdoes thisbylearningatreestructurefortheunderlyingtopics,withtheinferentialgoalbeingthattopicsclosertotheroot 1 aremoregeneral,andgraduallybecomemorespecificinthematiccontentwhenfollowingapathdownthetree. v 0 The nCRP is limited in the hierarchies it can model. We illustrate this limitation in Figure 1. The nCRP models 7 the topics that go into constructing a document as lying along one path of the tree. This significantly limits the 5 underlying structure that can be learned from the data. The nCRP performs document-specific path clustering; our 3 goal is to develop a related Bayesian nonparametric prior that performs word-specific path clustering. We illustrate . 1 thisobjectiveinFigure1. Tothisend,wemakeuseofthehierarchicalDirichletprocess[3],developinganovelprior 0 thatwerefertoasthenestedhierarchicalDirichletprocess(nHDP).WiththenHDP,atop-levelnCRPbecomesabase 3 distributionforacollectionofsecond-levelnCRPs,oneforeachdocument. ThenestedHDPprovidestheopportunity 1 forcross-thematicborrowingthatisnotpossiblewiththenCRP. : v i X 2 NestedHierarchicalDirichletProcesses r a NestedCRPs. NestedChineserestaurantprocesses(nCRP)areatree-structuredextensionoftheCRPthatareuseful for hierarchical topic modeling. A natural interpretation of the nCRP is as a tree where each parent has an infinite number of children. Starting from the root node, a path is traversed down the tree. Given the current node, a child nodeisselectedwithprobabilityproportionaltothepreviousnumberoftimesitwasselectedamongitssiblings,ora newchildisselectedwithprobabilityproportionaltoaparameterα>0. For hierarchical topic modeling with the nCRP, each node has an associated discrete distribution on word indices drawn from a Dirichlet distribution. Each document selects a path down the tree according to a Markov process, which produces a sequence of topics. A stick-breaking process provides a distribution on the selected topics, after whichwordsforadocumentaregeneratedbyfirstdrawingatopic,andthendrawingthewordindex. HDPs. TheHDPisamulti-levelversionoftheDirichletprocess.Itmakesuseoftheideathatthebasedistributionfor theDPcanbediscrete. AdiscretebasedistributionallowsformultipledrawsfromtheDPpriortoplaceprobability mass on the same subset of atoms. Hence different groups of data can share the same atoms, but place different probabilitydistributionsonthem. TheHDPdrawsitsdiscretebasefromaDPprior,andsotheatomsarelearned. 1Thisisaworkshopversionofalongerpaper.Pleasesee[1]formoredetails,includingascalableinferencealgorithm. 1 nCRP nHDP Figure1: AnexampleofpathstructuresforthenestedChineserestaurantprocess(nCRP)andthenestedhierarchical Dirichletprocess(nHDP)forhierarchicaltopicmodeling. WiththenCRP,thetopicsforadocumentarerestrictedto lyingalongasinglepathtoarootnode. WiththenHDP,eachdocumenthasaccesstotheentiretree,butadocument- specific distribution on paths will place high probability on a particular subtree. The goal of the nHDP is to learn a thematicallyconsistenttreeasachievedbythenCRP,whileallowingforthecross-thematicborrowingsthatnaturally occurwithinadocument. Nested HDPs. In building on the nCRP framework, our goal is to allow for each document to have access to the entiretree,whilestilllearningdocument-specificdistributionsontopicsthatarethematicallycoherent. Ideally,each documentwillstillexhibitadominantpathcorrespondingtoitsmainthemes,butwithoffshootsallowingforrandom effects. OurtwomajorchangestothenCRPformulationtowardthisendarethat(i)eachwordfollowsitsownpathto atopic,and(ii)eachdocumenthasitsowndistributiononpathsinasharedtree. With the nHDP, all documents share a global nCRP. This nCRP is equivalently an infinite collection of Dirichlet processeswithatransitionrulebetweenDPsstartingfromarootnode. WiththenestedHDP,weuseeachDirichlet processintheglobalnCRPasabaseforasecond-levelDPdrawnindependentlyforeachdocument. Thisconstitutes an HDP. Using a nesting of HDPs allows each document to have its own tree where the transition probabilities are definedoverthesamesubsetofnodes,butwherethevaluesoftheseprobabilitiesaredocument-specific. Foreachnodeinthetree,wealsogeneratedocument-specificswitchingprobabilitiesthatarei.i.d.betarandomvari- ables. Whenwalkingdownthetreeforaparticularword,westopwiththeswitchingprobabilityofthecurrentnode, orcontinuedownthetreeaccordingtotheprobabilitiesofthechildHDP.WesummarizethisprocessinAlgorithm1. Sample Results. We present some qualitative results for the nHDP topic model on a set of 1.8 million documents fromTheNewYorkTimes. Theseresultswereobtainedusingascalablevariationalinferencealgorithm[4]afterone passthroughthedataset. InFigure2weshowexampletopicsfromthemodelandtheirrelativestructure. Weshow four topics from the top level of the tree (shaded), and connect topics according to parent/child relationship. The model learns a meaningful hierarchical structure; for example, the sports subtree branches into the various sports, whichthemselvesappeartobranchbyteams. Intheforeignaffairssubtree,childrentendtogroupbymajorsubregion andthenbranchoutintosubregionorissue. Algorithm1GeneratingDocumentswiththeNestedHierarchicalDirichletProcess Step1. GenerateaglobaltreebyconstructingannCRP. Step2. Foreachdocument,generateadocumenttreefromtheHDPandswitchingprobabilitiesforeachnode. Step3. Generateadocument. Forwordnindocumentd, a)WalkingdownthetreeusingHDPs. Atcurrentnode,continue/stopaccordingtoitsswitchingprobability. b)Samplewordnfromthediscretedistributionattheterminalnode. 2 court judge justice supreme case walsh trial hcmdphaeoaoeractsiedetplotihncitraatsslls bmpydeoueesabmctrleiidscbriodenarsy onicroloaivnrntethrralclcaaaowwsuyyerteerrsjcecuarvrisimydeeencegasoepetfliftoiiontcztreenerearyl binatenrkest sfmsaicretamoccnorusguleraniytntiiensg fnmmisnuteovnuotecdtnukseastyolrs pforssrmwooeatddaoatlaleurreitsclestrss dcswcooorailfknatteekr spnteehelerotwvpniocheerokne msoicftrwoasroeft uwlceaonmobniroopktrnrleoarycsetes bapugulnardienogenemstent decishsioeonnuasete jcuhsdaegrnegaetes aelgwagnaeyterbeonssncteyr lrdoaeatbentssreesatalte mpsinteovarecrckkseetnottrs idcnordumugpstarynies cteocmhnpoultoegrwinyitreerlneests reseawogsrupaicnntehedrsaotwinsg benefits bpiallys rbeilplublican jcuodmicmiairtytee mmaorrktgeatge trading percent scyosmtepmanies sincsietitnuttisets congress nomination companies brokers money software national rate senator roberts communications government laboratory percent telephone billion pay rymueosltsssicidpiannoirsdwnepmseceoauovioxslitrlioaooudsunpnialsdeetristy sugmcoonoovmiroscpecsbieuohcmnatahbtmaoncicuciewhromnsmlmeiistavmttenitetee tanrrsreauruemanmcaggltsemlaglsyaeeeaesonggitrsnviiesseallmrtaaenttbouivlrryeeippsaeralaeecslelsbhchteiyheeonbrrizsiiariaabstuantnotigoallennash akejh ocsirgunraaeydsgbbrpsavletneicine s hstobsaathifodfkraeptypfecererirherekeoediorcalvcdledseeseurnrrsty dcbclabmapdooeuroieemrrasgilpecnfiieletnrpitoananasearrggsftyatsneotsinieosnislernoeassaavtuanlinrteagnscbeeucosinnoemssaaamgicgdcreocevduoenipuractnyisting cwafpvoaeuloarrhtdrnoskictelppsaeeertsolilmseraewencisetltrssriiconsefogpencuaniarlaleeirsectclubertrogrsainyclity kremlin arafat switzerland rates penguins pollution cchlinetcohnnya saudi government minister hnaozloicaust goal pmitatsribourgh igporaueilnlrfsian abloarisndaaebmniaa ofalpeofdoafrimlecditiiiegiacnrnalisssltration egfuruenarrinoomcnpeaenayn jbgnmeaoawilnltodoikssshevic ytgpoaaaursmcdshsedown gjspfeaeiatassarsncseteosllnls psgsceeaocrmirooeenddd hseoacksmceoiveyniacdowlands rmrgmaicaeanhrsdgdtesieesirrneosrn shipping bush wnaatsiohninsgton east byuogsnoisalavia quarter football devils square cniroaautqionncisl iawhrdauamqsrisneisitnration china mntfroaoriltociotepassry insttreeiblrlubigniaeanlce dnbfoaooomtwrtbeleall sgqbeauoamawsrletoenrback tgeaammes ogwpooelofnds sreescoulruittyion nkoorretha defense abfinghanistan smeiaasmoin gpalamye tteonunris nuclear sboeuijitnhg nuclear ltqaaadliebedanan rinunningflorida tgeoainmg hole wpicnhresoeapgmperaocictmnoasrls hkoonngg ntauiwclaevnaiertnam madtrrereeimasfaegstsnyailesne afpaigilirrahcntreearft yahgrnuiaktnemsees sbgleeaaaasmgseueobenall tsheinakson t o sfsuineernaecaoldmneddent scwsbolprueiuadbstdgthee boat asia space jet torre series basketball victory insurance indonesia sox season round return singapore red mets game jordan cup cambodia jeter valentine team bulls team piazza knicks chicago soccer franco michael americans shea pngoeatimnstes kenwicinkgs jpahciklson mgoaatclh scored riley half game patrick Figure2: Tree-structuredtopicsfromTheNewYorkTimes. Ashadednodeisatop-levelnode. References [1] J.Paisley,C.Wang,D.Blei,andM.I.Jordan,“NestedhierarchicalDirichletprocesses,”arXiv:1210.6738,2012. [2] D.Blei,T.Griffiths,andM.Jordan,“ThenestedChineserestaurantprocessandBayesiannonparametricinference oftopichierarchies,”JournaloftheACM,vol.57,no.2,pp.7:1–30,2010. [3] Y.Teh, M.Jordan, M.Beal, andD.Blei, “HierarchicalDirichletprocesses,” JournaloftheAmericanStatistical Association,vol.101,no.476,pp.1566–1581,2006. [4] M.Hoffman,D.Blei,C.Wang,andJ.Paisley,“Stochasticvariationalinference,”arXiv:1206.7051,2012. 3

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.