Poisson–Gamma Dynamical Systems AaronSchein CollegeofInformationandComputerSciences UniversityofMassachusettsAmherst Amherst,MA01003 [email protected] 7 1 0 MingyuanZhou HannaWallach 2 McCombsSchoolofBusiness MicrosoftResearchNewYork n TheUniversityofTexasatAustin 641AvenueoftheAmericas a Austin,TX78712 NewYork,NY10011 J [email protected] [email protected] 9 1 ] Abstract L M Weintroduceanewdynamicalsystemforsequentiallyobservedmultivariatecount . data.Thismodelisbasedonthegamma–Poissonconstruction—anaturalchoicefor t a countdata—andreliesonanovelBayesiannonparametricpriorthattiesandshrinks t themodelparameters,thusavoidingoverfitting. WepresentanefficientMCMCin- s [ ferencealgorithmthatadvancesrecentworkonaugmentationschemesforinference innegativebinomialmodels. Finally,wedemonstratethemodel’sinductivebias 1 usingavarietyofreal-worlddatasets,showingthatitexhibitssuperiorpredictive v performanceoverothermodelsandinfershighlyinterpretablelatentstructure. 3 7 5 5 1 Introduction 0 . 1 Sequentiallyobservedcountvectorsy(1),...,y(T)arethemainobjectofstudyinmanyreal-world 0 applications,includingtextanalysis,socialnetworkanalysis,andrecommendersystems. Countdata 7 poseuniquestatisticalandcomputationalchallengeswhentheyarehigh-dimensional,sparse,and 1 overdispersed,asisoftenthecaseinreal-worldapplications. Forexample,whentrackingcountsof : v userinteractionsinasocialnetwork,onlyatinyfractionofpossibleedgesareeveractive,exhibiting i X burstyperiodsofactivitywhentheyare. Modelsofsuchdatashouldexploitthissparsityinorderto scaletohighdimensionsandberobusttooverdispersedtemporalpatterns. Inadditiontothesecharac- r a teristics,sequentiallyobservedmultivariatecountdataoftenexhibitcomplexdependencieswithinand acrosstimesteps. Forexample,scientificpapersaboutonetopicmayencourageresearcherstowrite papersaboutanotherrelatedtopicinthefollowingyear. Modelsofsuchdatashouldthereforecapture thetopicstructureofindividualdocumentsaswellastheexcitatoryrelationshipsbetweentopics. The linear dynamical system (LDS) is a widely used model for sequentially observed data, with many well-developed inference techniques based on the Kalman filter [1, 2]. The LDS assumes thateachsequentiallyobservedV-dimensionalvectorr(t) isrealvaluedandGaussiandistributed: r(t) (Φθ(t),Σ),whereθ(t) RK isalatentstate,withK components,thatislinkedtothe obser∼veNdspace via Φ RV×K. ∈TheLDSderivesitsexpressivepowerfromthewayit assumes ∈ thatthelatentstatesevolve: θ(t) (Πθ(t−1),∆),whereΠ RK×K isatransitionmatrixthat ∼N ∈ capturesbetween-componentdependenciesacrosstimesteps. AlthoughtheLDScanbelinkedto non-realobservationsviatheextendedKalmanfilter[3],itcannotefficientlymodelreal-worldcount databecauseinferenceis ((K+V)3)andthusscalespoorlywiththedimensionalityofthedata[2]. O 30thConferenceonNeuralInformationProcessingSystems(NIPS2016),Barcelona,Spain. Manypreviousapproachestomodelingsequentiallyobservedcountdatarelyonthegeneralized linear modeling framework [4] to link the observations to a latent Gaussian space—e.g., via the Poisson–lognormallink[5]. Researchershaveusedthisconstructiontofactorizesequentiallyob- served count matrices under a Poisson likelihood, while modeling the temporal structure using well-studiedGaussiantechniques[6,7]. MostofthesepreviousapproachesassumeasimpleGaus- sianstate-spacemodel—i.e.,θ(t) (θ(t−1),∆)—thatlackstheexpressivetransitionstructure ∼ N oftheLDS;onenotableexceptionisthePoissonlineardynamicalsystem[8]. Inpractice, these approaches exhibit prohibitive computational complexity in high dimensions, and the Gaussian assumption may fail to accommodate the burstiness often inherent to real-world count data [9]. We present the Poisson–gamma dynamical system (PGDS)—a new dynamical system, based on the 25 gamma–Poissonconstruction,thatsupportstheexpres- sivetransitionstructureoftheLDS.Thismodelnatu- 20 rallyhandlesoverdisperseddata. Weintroduceanew Bayesian nonparametric prior to automatically infer 15 themodel’srank. Wedevelopanelegantandefficient algorithmforinferringtheparametersofthetransition 10 structurethatadvancesrecentworkonaugmentation schemesforinferenceinnegativebinomialmodels[10] 5 and scales with the number of non-zero counts, thus 1988 1991 1994 1997 2000 exploitingthesparsityinherenttoreal-worldcountdata. Weexaminethewayinwhichthedynamicalgamma– Figure1: Thetime-stepfactorsforthree Poissonconstructionpropagatesinformationandderive componentsinferredbythePGDSfroma themodel’ssteadystate,whichinvolvestheLambert corpus of NIPS papers. Each component Wfunction[11]. Finally,weusethePGDStoanalyze isassociatedwithafeaturefactorforeach adiverserangeofreal-worlddatasets,showingthatit wordtypeinthecorpus;welistthewords exhibitsexcellentpredictiveperformanceonsmooth- withthelargestfactors. Theinferredstruc- ingandforecastingtasksandinfersinterpretablelatent turetellsafamiliarstoryabouttheriseand structure,anexampleofwhichisdepictedinfigure1. fallofcertainsubfieldsofmachinelearning. 2 Poisson–GammaDynamicalSystems WecanrepresentadatasetofV-dimensionalsequentiallyobservedcountvectorsy(1),...,y(T)asa V T countmatrixY. ThePGDSmodelsasinglecounty(t) 0,1,... inthismatrixasfollows: v × ∈{ } y(t) Pois(δ(t)(cid:80)K φ θ(t))and θ(t) Gam(τ (cid:80)K π θ(t−1),τ ), (1) v ∼ k=1 vk k k ∼ 0 k2=1 kk2 k2 0 where the latent factors φ and θ(t) are both positive, and represent the strength of feature v in vk k componentk andthestrengthofcomponentk attimestept,respectively. Thescalingfactorδ(t) capturesthescaleofthecountsattimestept,andthereforeobviatestheneedtorescalethedataasa preprocessingstep. WerefertothePGDSasstationaryifδ(t)=δfort=1,...,T. Wecanviewthe featurefactorsasaV KmatrixΦandthetime-stepfactorsasaT KmatrixΘ.Becausewecanalso × × collectivelyviewthescalingfactorsandtime-stepfactorsasaT KmatrixΨ,whereelementψ = tk × δ(t)θ(t),thePGDSisaformofPoissonmatrixfactorization: Y Pois(ΦΨT)[12,13,14,15]. k ∼ ThePGDSischaracterizedbyitsexpressivetransitionstructure,whichassumesthateachtime-step factorθ(t)isdrawnfromagammadistribution,whoseshapeparameterisalinearcombinationofthe k Kfactorsattheprevioustimestep.Thelatenttransitionweightsπ ,...,π ,...,π ,whichwe 11 k1k2 KK canviewasaK K transitionmatrixΠ,capturetheexcitatoryrelationshipsbetweencomponents. × Thevectorθ(t) =(θ(t),...,θ(t))hasanexpectedvalueofE[θ(t) θ(t−1),Π]=Πθ(t−1)andisthere- 1 K | foreanalogoustoalatentstateinthetheLDS.Theconcentrationparameterτ determinesthevariance 0 ofθ(t)—specifically,Var(θ(t) θ(t−1),Π)=(Πθ(t−1))τ−1—withoutaffectingitsexpectedvalue. | 0 Tomodelthestrengthofeachcomponent,weintroduceK componentweightsν = (ν ,...,ν ) 1 K andplaceashrinkageprioroverthem. Weassumethatthetime-stepfactorsandtransitionweights forcomponentkaretiedtoitscomponentweightν . Specifically,wedefinethefollowingstructure: k θ(1) Gam(τ ν ,τ )and π Dir(ν ν ,...,ξν ...,ν ν )and ν Gam(γ0,β), (2) k ∼ 0 k 0 k ∼ 1 k k K k k ∼ K 2 whereπ = (π ,...,π )isthekth columnofΠ. Because(cid:80)K π = 1, wecaninterpret π asthkeproba1bkilityoftKraknsitioningfromcomponentktocomponke1n=t1k .k1(kWenotethatinterpreting k1k 1 ΠasastochastictransitionmatrixrelatesthePGDStothediscretehiddenMarkovmodel.)Forafixed valueofγ ,increasingKwillencouragemanyofthecomponentweightstobesmall.Asmallvalueof 0 ν willshrinkθ(1),aswellasthetransitionweightsinthekthrowofΠ. Smallvaluesofthetransition k k weightsinthekthrowofΠthereforepreventcomponentkfrombeingexcitedbytheothercomponents andbyitself. Specifically, becausetheshapeparameterforthegammaprioroverθ(t) involvesa k linearcombinationofθ(t−1)andthetransitionweightsinthekthrowofΠ,smalltransitionweights willresultinasmallshapeparameter,shrinkingθ(t). Thus,thecomponentweightsplayacritical k roleinthePGDSbyenablingittoautomaticallyturnoffanyunneededcapacityandavoidoverfitting. Finally,weplaceDirichletpriorsoverthefeaturefactorsanddrawtheotherparametersfromanon- informative gamma prior: φ = (φ ,...,φ ) Dir(η ,...,η ) and δ(t),ξ,β Gam((cid:15) ,(cid:15) ). k 1k Vk ∼ 0 0 ∼ 0 0 ThePGDSthereforehasfourpositivehyperparameterstobesetbytheuser: τ ,γ ,η ,and(cid:15) . 0 0 0 0 Bayesiannonparametricinterpretation:AsK ,thecomponentweightsandtheircorrespond- ingfeaturefactorvectorsconstituteadrawG=(cid:80)→∞∞ν 1 fromagammaprocessGamP(G , β), k=1 k φk 0 whereβisascaleparameterandG isafiniteandcontinuousbasemeasureoveracompleteseparable 0 metricspaceΩ[16]. Modelsbasedonthegammaprocesshaveaninherentshrinkagemechanism becausethenumberofatomswithweightsgreaterthanε>0followsaPoissondistributionwithafi- nitemean—specifically,Pois(γ (cid:82)∞dνν−1exp( βν)),whereγ =G (Ω)isthetotalmassunder 0 ε − 0 0 thebasemeasure. ThisinterpretationenablesustoviewthepriorsoverΠandΘasnovelstochastic processes,whichwecallthecolumn-normalizedrelationalgammaprocessandtherecurrentgamma process,respectively. Weprovidethedefinitionsoftheseprocessesinthesupplementarymaterial. Non-countobservations: ThePGDScanalsomodelnon-countdatabylinkingtheobservedvectors tolatentcounts.Abinaryobservationb(t)canbelinkedtoalatentPoissoncounty(t)viatheBernoulli– v v Poisson distribution: b(t) = 1(y(t) 1) and y(t) Pois(δ(t)(cid:80)K φ θ(t)) [17]. Similarly, a v v ≥ v ∼ k=1 vk k real-valuedobservationr(t)canbelinkedtoalatentPoissoncounty(t)viathePoissonrandomized v v gammadistribution[18]. Finally,BasbugandEngelhardt[19]recentlyshowedthatmanytypesof non-countmatricescanbelinkedtoalatentcountmatrixviathecompoundPoissondistribution[20]. 3 MCMCInference MCMCinferenceforthePGDSconsistsofdrawingsamplesofthemodelparametersfromtheirjoint posteriordistributiongivenanobservedcountmatrixY andthemodelhyperparametersτ ,γ ,η , 0 0 0 (cid:15) . Inthissection,wepresentaGibbssamplingalgorithmfordrawingthesesamples. Atahighlevel, 0 ourapproachissimilartothatusedtodevelopGibbssamplingalgorithmsforseveralotherrelated models[10,21,22,17];however,weextendthisapproachtohandletheuniquepropertiesofthe PGDS.ThemaintechnicalchallengeissamplingΘfromitsconditionalposterior,whichdoesnot haveaclosedform. Weaddressthischallengebyintroducingasetofauxiliaryvariables. Underthis augmentedversionofthemodel,marginalizingoverΘbecomestractableanditsconditionalposterior hasaclosedform. Moreover, byintroducingtheseauxiliaryvariablesandmarginalizingoverΘ, weobtainanalternativemodelspecificationthatwecansubsequentlyexploittoobtainclosed-form conditionalposteriorsforΠ,ν,andξ. WemarginalizeoverΘbyperforminga“backwardfiltering” pass,startingwithθ(T). Werepeatedlyexploitthefollowingthreedefinitionsinordertodothis. Definition1:Ify =(cid:80)N y ,wherey Pois(θ )areindependentPoisson-distributedrandomvari- ables,then(y1,.·..,yNn)=∼1 Mn ult(y·,((cid:80)nNn∼θ=11θn,..n.,(cid:80)Nnθ=N1θn))andy· ∼Pois((cid:80)Nn=1θn)[23,24]. Definition 2: If y Pois(cθ), where c is a constant, and θ Gam(a,b), then y NB(a, c ) ∼ ∼ ∼ b+c is a negative binomial–distributed random variable. We can equivalently parameterize it as y NB(a,g(ζ)),whereg(z)=1 exp( z)istheBernoulli–Poissonlink[17]andζ =ln(1+ c). ∼ − − b Definition 3: If y NB(a,g(ζ)) and l CRT(y,a) is a Chinese restaurant table–distributed ∼ ∼ random variable, then y and l are equivalently jointly distributed as y SumLog(l,g(ζ)) and ∼ l Pois(aζ)[21]. Thesumlogarithmicdistributionisfurtherdefinedasthesumoflindependent an∼didenticallylogarithmic-distributedrandomvariables—i.e.,y =(cid:80)l x andx Log(g(ζ)). i=1 i i ∼ 3 MarginalizingoverΘ: Wefirstnotethatwecanre-expressthePoissonlikelihoodinequation1 intermsoflatentsubcounts[13]: y(t) =y(t) =(cid:80)K y(t) andy(t) Pois(δ(t)φ θ(t)). Wethen v v· k=1 vk vk ∼ vk k definey(t) =(cid:80)V y(t). Viadefinition1,weobtainy(t) Pois(δ(t)θ(t))because(cid:80)V φ =1. ·k v=1 vk ·k ∼ k v=1 vk We start with θ(T) because none of the other time-step factors depend on it in their priors. Via k definition2,wecanimmediatelymarginalizeoverθ(T)toobtainthefollowingequation: k y(T) NB(τ (cid:80)K π θ(T−1),g(ζ(T))),where ζ(T) =ln(1+ δ(T)). (3) ·k ∼ 0 k2=1 kk2 k2 τ0 Next, we marginalize over θ(T−1). To do this, we introduce an auxiliary variable: l(T) k k ∼ CRT(y(T),τ (cid:80)K π θ(T−1)). Wecanthenre-expressthejointdistributionovery(T)andl(T)as ·k 0 k2=1 kk2 k2 ·k k y(T) SumLog(l(T),g(ζ(T))and l(T) Pois(ζ(T)τ (cid:80)K π θ(T−1)). (4) ·k ∼ k k ∼ 0 k2=1 kk2 k2 Wearestillunabletomarginalizeoverθ(T−1) becauseitappearsinasumintheparameterofthe k Poissondistributionoverl(T);however,viadefinition1,wecanre-expressthisdistributionas k l(T) =l(T) =(cid:80)K l(T) and l(T) Pois(ζ(T)τ π θ(T−1)). (5) k k· k2=1 kk2 kk2 ∼ 0 kk2 k2 We then define l(T) = (cid:80)K l(T). Again via definition 1, we can express the distribution ·k k1=1 k1k over l(T) as l(T) Pois(ζ(T)τ θ(T−1)). We note that this expression does not depend on the tra·nksition ·wkeig∼hts because (cid:80)0Kk π = 1. We also note that definition 1 implies that k1=1 k1k (l(T),...,l(T)) Mult(l(T),(π ,...,π )). Next,weintroducem(T−1) = y(T−1)+l(T),which 1k Kk ∼ ·k 1 K k ·k ·k summarizes all of the information about the data at time steps T 1 and T via y(T−1) and l(T), − ·k ·k respectively. Becausey(T−1)andl(T)arebothPoissondistributed,wecanusedefinition1toobtain ·k ·k m(T−1) Pois(θ(T−1)(δ(T−1)+ζ(T)τ )). (6) k ∼ k 0 Combiningthislikelihoodwiththegammapriorinequation1,wecanmarginalizeoverθ(T−1): k m(T−1) NB(τ (cid:80)K π θ(T−2),g(ζ(T−1))),where ζ(T−1) =ln(1+ δ(T−1) +ζ(T)). (7) k ∼ 0 k2=1 kk2 k2 τ0 Wethenintroducel(T−1) CRT(m(T−1),τ (cid:80)K π θ(T−2))andre-expressthejointdistribu- k ∼ k 0 k2=1 kk2 k2 tionoverl(T−1)andm(T−1)astheproductofaPoissonandasumlogarithmicdistribution,similarto k k equation4. Thisthenallowsustomarginalizeoverθ(T−2)toobtainanegativebinomialdistribution. k Wecanrepeatthesameprocessallthewaybacktot=1,wheremarginalizingoverθ(1)yieldsm(1) k k ∼ NB(τ ν ,g(ζ(1))). Wenotethatjustasm(t)summarizesalloftheinformationaboutthedataattime 0 k k stepst,...,T,ζ(t) =ln(1+ δ(t) +ζ(t+1))summarizesalloftheinformationaboutδ(t),...,δ(T). τ0 Aswementionedpreviously,introducing l(1) ∼Pois(ζ(1)τ ν ) theseauxiliaryvariablesandmarginalizing k· 0 k (l(t),...,l(t))∼Mult(l(t),(π ,...,π ))fort>1 overΘalsoenablesustodefineanalterna- 1k Kk ·k 1k Kk tivemodelspecificationthatwecanexploit lk(t·) =(cid:80)Kk2=1lk(tk)2 fort>1 toobtainclosed-formconditionalposteri- m(t) ∼SumLog(l(t),g(ζ(t))) k k· orsforΠ,ν,andξ. Weprovidepartofits (y(t),l(t+1))∼Bin(m(t),( δ(t) , ζ(t+1)τ0 )) generativeprocessinfigure2. Wedefine ·k ·k k δ(t)+ζ(t+1)τ0 δ(t)+ζ(t+1)τ0 m(T) =y(T)+l(T+1),wherel(T+1) =0, (y1(tk),...,yV(tk))∼Mult(y·(kt),(φ1k,...,φVk)) k ·k ·k ·k andζ(T+1) =0sothatwecanpresentthe Figure2: Alternativemodelspecification. alternativemodelspecificationconcisely. Steadystate: Wedrawparticularattentiontothebackwardpassζ(t) =ln(1+ δ(t) +ζ(t+1))that τ0 propagatesinformationaboutδ(t),...,δ(T)aswemarginalizeoverΘ. Inthecaseofthestationary PGDS—i.e.,δ(t) =δ—thebackwardpasshasafixedpointthatwedefineinthefollowingproposition. 4 Proposition1: Thebackwardpasshasafixedpointofζ(cid:63) = W ( exp( 1 δ )) 1 δ . − −1 − − − τ0 − − τ0 ThefunctionW ()isthelowerrealpartoftheLambertWfunction[11]. Weprovethisproposition −1 · inthesupplementarymaterial. Duringinference,weperformthe (T)backwardpassrepeatedly. O TheexistenceofafixedpointmeansthatwecanassumethestationaryPGDSisinitssteadystateand replacethebackwardpasswithan (1)computation1ofthefixedpointζ∗. Tomakethisassumption, O wemustalsoassumethatl(T+1) Pois(ζ(cid:63)τ θ(T))insteadofl(T+1) =0.Wenotethatananalogous ·k ∼ 0 k ·k steady-stateapproximationexistsfortheLDSandisroutinelyexploitedtoreducecomputation[25]. Gibbssamplingalgorithm: GivenY andthehyperparameters,Gibbssamplinginvolvesresampling eachauxiliaryvariableormodelparameterfromitsconditionalposterior. Ouralgorithminvolvesa “backwardfiltering”passanda“forwardsampling”pass,whichtogetherforma“backwardfiltering– forwardsampling”algorithm. Weuse Θ(≥t)todenoteeverythingexcludingθ(t),...,θ(T). −\ Samplingtheauxiliaryvariables: Thisstepisthe“backwardfiltering”pass. ForthestationaryPGDS initssteadystate,wefirstcomputeζ∗anddraw(l(T+1) ) Pois(ζ(cid:63)τ θ(T)). Fortheothervari- ·k |− ∼ 0 k antsofthemodel,wesetl(T+1) =ζ(T+1) =0.Then,workingbackwardfromt=T,...,2,wedraw ·k (l(t) Θ(≥t)) CRT(y(t)+l(t+1),τ (cid:80)K π θ(t−1))and (8) k· | −\ ∼ ·k ·k 0 k2=1 kk2 k2 (l(t),...,l(t) Θ(≥t)) Mult(l(t),( πk1θ1(t−1) ,..., πkKθK(t−1) )). (9) k1 kK| −\ ∼ k· (cid:80)Kk2=1πkk2θk(t2−1) (cid:80)Kk2=1πkk2θk(t2−1) Afterusingequations8and9forallk =1,...,K,wethensetl(t) =(cid:80)K l(t) .Forthenon-steady- ·k k1=1 k1k statevariants,wealsosetζ(t) =ln(1+ δ(t) +ζ(t+1));forthesteady-statevariant,wesetζ(t) =ζ∗. τ0 SamplingΘ: WesampleΘfromitsconditionalposteriorbyperforminga“forwardsampling”pass, startingwithθ(1). Conditionedonthevaluesofl(2),...,l(T+1) andζ(2),...,ζ(T+1) obtainedvia ·k ·k the“backwardfiltering”pass,wesampleforwardfromt=1,...,T,usingthefollowingequations: (θ(1) Θ) Gam(y(1)+l(2)+τ ν ,τ +δ(1)+ζ(2)τ )and (10) k | −\ ∼ ·k ·k 0 k 0 0 (θ(t) Θ(≥t)) Gam(y(t)+l(t+1)+τ (cid:80)K π θ(t−1),τ +δ(t)+ζ(t+1)τ ). (11) k | −\ ∼ ·k ·k 0 k2=1 kk2 k2 0 0 Sampling Π: The alternative model specification, with Θ marginalized out, assumes that (l(t),...,l(t)) Mult(l(t),(π ,...,π )). Therefore,viaDirichlet–multinomialconjugacy, 1k Kk ∼ ·k 1k Kk (π Θ) Dir(ν ν +(cid:80)T l(t),...,ξν +(cid:80)T l(t),...,ν ν +(cid:80)T l(t)). (12) k| −\ ∼ 1 k t=1 1k k t=1 kk K k t=1 Kk Sampling ν and ξ: We use the alternative model specification to obtain closed-form conditional posteriorsforν andξ. First,wemarginalizeoverπ toobtainaDirichlet–multinomialdistribution. k k Whenaugmentedwithabeta-distributedauxiliaryvariable,theDirichlet–multinomialdistributionis proportionaltothenegativebinomialdistribution[26]. Wedrawsuchanauxiliaryvariable,whichwe use,alongwithnegativebinomialaugmentationschemes,toderiveclosed-formconditionalposteriors forν andξ. Weprovidetheseposteriors,alongwiththeirderivations,inthesupplementarymaterial. k Wealsoprovidetheconditionalposteriorsfortheremainingmodelparameters—Φ,δ(1),...,δ(T), andβ—whichweobtainviaDirichlet–multinomial,gamma–Poisson,andgamma–gammaconjugacy. 4 Experiments Inthissection, wecomparethepredictiveperformanceofthePGDStothatoftheLDSandthat ofgammaprocessdynamicPoissonfactoranalysis(GP-DPFA)[22]. GP-DPFAmodelsasingle count in Y as y(t) Pois((cid:80)K λ φ θ(t)), where each component’s time-step factors evolve v ∼ k=1 k vk k as a simple gamma Markov chain, independently of those belonging to the other components: θ(t) Gam(θ(t−1),c(t)). Weconsiderthestationaryvariantsofallthreemodels.2 Weusedfivedata k ∼ k sets,andtestedeachmodelontwotime-seriespredictiontasks:smoothing—i.e.,predictingy(t)given v 1SeveralsoftwarepackagescontainfastimplementationsoftheLambertWfunction. 2WeusedthepykalmanPythonlibraryfortheLDSandimplementedGP-DPFAourselves. 5 y(1),...,y(t−1),y(t+1),...,y(T)—andforecasting—i.e.,predictingy(T+s)giveny(1),...,y(T)for v v v v v v v somes 1,2,... [27].Weprovidebriefdescriptionsofthedatasetsbelowbeforereportingresults. ∈{ } GlobalDatabaseofEvents,Language,andTone(GDELT):GDELTisaninternationalrelationsdata setconsistingofcountry-to-countryinteractioneventsoftheform“countryitookactionatoward countryj attimet,”extractedfromnewscorpora. Wecreatedfivecountmatrices,oneforeachyear from2001through2005. Wetreateddirectedpairsofcountriesi j asfeaturesandcountedthe → numberofeventsforeachpairduringeachday. Wediscardedallpairswithfewerthantwenty-five totalevents,leavingT =365,aroundV 9,000,andthreetosixmillioneventsforeachmatrix. ≈ IntegratedCrisisEarlyWarningSystem(ICEWS):ICEWSisanotherinternationalrelationseventdata setextractedfromnewscorpora. ItismorehighlycuratedthanGDELTandcontainsfewerevents. Wethereforetreatedundirectedpairsofcountriesi j asfeatures. Wecreatedthreecountmatrices, ↔ onefor2001–2003,onefor2004–2006,andonefor2007–2009. Wecountedthenumberofeventsfor eachpairduringeachthree-daytimestep,andagaindiscardedallpairswithfewerthantwenty-five totalevents,leavingT =365,aroundV 3,000,and1.3to1.5millioneventsforeachmatrix. ≈ State-of-the-Union transcripts (SOTU): The SOTU corpus contains the text of the annual SOTU speechtranscriptsfrom1790through2014. Wecreatedasinglecountmatrixwithonecolumnper year. Afterdiscardingstopwords,wewereleftwithT =225,V =7,518,and656,949tokens. DBLPconferenceabstracts(DBLP):DBLPisadatabaseofcomputerscienceresearchpapers. We usedthesubsetofthiscorpusthatAcharyaetal. usedtoevaluateGP-DPFA[22]. Thissubsetcor- respondstoacountmatrixwithT =14columns,V =1,771uniquewordtypes,and13,431tokens. NIPScorpus(NIPS):TheNIPScorpuscontainsthetextofeveryNIPSconferencepaperfrom1987 to2003. Wecreatedasinglecountmatrixwithonecolumnperyear. Wetreateduniquewordtypes asfeaturesanddiscardedallstopwords,leavingT =17,V =9,836,and3.1milliontokens. 2500 900 12500000 567800000000 IRCIsrahuraqisnes↔alia↔↔↔UPSUUaASlSeAAstine 1000 400 300 500 200 100 0 0 1988 1992 1995 1998 2002 Mar2009 Jun2009 Aug2009 Oct2009 Dec2009 Figure3: y(t)overtimeforthetopfourfeaturesintheNIPS(left)andICEWS(right)datasets. v Experimentaldesign: Foreachmatrix,wecreatedfourmasksindicatingsomerandomlyselected subset of columns to treat as held-out data. For the event count matrices, we held out six (non- contiguous)timestepsbetweent=2andt=T 3totestthemodels’smoothingperformance,as − wellasthelasttwotimestepstotesttheirforecastingperformance. Theothermatriceshavefewer timesteps. FortheSOTUmatrix,wethereforeheldoutfivetimestepsbetweent=2andt=T 2, − aswellast = T. FortheNIPSandDBLPmatrices,whichcontainsubstantiallyfewertimesteps thantheSOTUmatrix,weheldoutthreetimestepsbetweent=2andt=T 2,aswellast=T. − Foreachmatrix,mask,andmodelcombination,weraninferencefourtimes.3 ForthePGDSand GP-DPFA,weperformed6,000Gibbssamplingiterations,imputingthemissingcountsfromthe “smoothing”columnsatthesametimeassamplingthemodelparameters. Wethendiscardedthe first4,000samplesandretainedeveryhundredthsamplethereafter. Weusedeachofthesesamplesto predictthemissingcountsfromthe“forecasting”columns. Wethenaveragedthepredictionsoverthe samples. FortheLDS,weranEMtolearnthemodelparameters. Then,giventheseparametervalues, weusedtheKalmanfilterandsmoother[1]topredicttheheld-outdata. Inpractice,forallfivedata sets,V wastoolargeforustoruninferencefortheLDS,whichis ((K +V)3)[2],usingallV O features. Wethereforereportresultsfromtwoindependentsetsofexperiments: onecomparingall threemodelsusingonlythetopV =1,000featuresforeachdataset,andonecomparingthePGDS tojustGP-DPFAusingallthefeatures. ThefirstsetofexperimentsisgeneroustotheLDSbecause thePoissondistributioniswellapproximatedbytheGaussiandistributionwhenitsmeanislarge. 3ForthePGDSandGP-DPFAweusedK =100.ForthePGDS,wesetτ =1,γ =50,η =(cid:15) =0.1. 0 0 0 0 WesetthehyperparametersofGP-DPFAtothevaluesusedbyAcharyaetal.[22].FortheLDS,weusedthe defaulthyperparametersforpykalman,andreportresultsforthebest-performingvalueofK ∈{5,10,25,50}. 6 Table1: Resultsforthesmoothing(“S”)andforecasting(“F”)tasks. Forbotherrormeasures,lower valuesarebetter. WealsoreportthenumberoftimestepsT andtheburstinessBˆ ofeachdataset. MeanRelativeError(MRE) MeanAbsoluteError(MAE) T Bˆ Task PGDS GP-DPFA LDS PGDS GP-DPFA LDS GDELT3651.27 S 2.335±0.19 2.951±0.32 3.493±0.53 9.366±2.19 9.278±2.01 10.098±2.39 F 2.173±0.41 2.207±0.42 2.397±0.29 7.002±1.43 7.095±1.67 7.047±1.25 ICEWS3651.10 S 0.808±0.11 0.877±0.12 1.023±0.15 2.867±0.56 2.872±0.56 3.104±0.60 F 0.743±0.17 0.792±0.17 0.937±0.31 1.788±0.47 1.894±0.50 1.973±0.62 SOTU2251.45 S 0.233±0.01 0.238±0.01 0.260±0.01 0.408±0.01 0.414±0.01 0.448±0.00 F 0.171±0.00 0.173±0.00 0.225±0.01 0.323±0.00 0.314±0.00 0.370±0.00 DBLP 14 1.64 S 0.417±0.03 0.422±0.05 0.405±0.05 0.771±0.03 0.782±0.06 0.831±0.01 F 0.322±0.00 0.323±0.00 0.369±0.06 0.747±0.01 0.715±0.00 0.943±0.07 NIPS 17 0.33 S 0.415±0.07 0.392±0.07 1.609±0.43 29.940±2.9528.138±3.08108.378±15.44 F 0.343±0.01 0.312±0.00 0.642±0.14 62.839±0.3752.963±0.52 95.495±10.52 Results: Weusedtwoerrormeasures—meanrelativeerror(MRE)andmeanabsoluteerror(MAE)— tocomputethemodels’smoothingandforecastingscoresforeachmatrixandmaskcombination. We thenaveragedthesescoresoverthemasks. Forthedatasetswithmultiplematrices,wealsoaveraged thescoresoverthematrices. Thetwoerrormeasuresdifferasfollows: MREaccommodatesthe scaleofthedata,whileMAEdoesnot. Thisisbecauserelativeerror—whichwedefineas |yv(t)−yˆv(t)|, 1+yv(t) wherey(t)isthetruecountandyˆ(t)istheprediction—dividestheabsoluteerrorbythetruecountand v v thuspenalizesoverpredictionsmoreharshlythanunderpredictions. MREisthereforeanespecially naturalchoicefordatasetsthatarebursty—i.e.,datasetsthatexhibitshortperiodsofactivitythatfar exceedtheirmean. Modelsthatarerobusttothesekindsofoverdispersedtemporalpatternsareless likelytomakeoverpredictionsfollowingaburst,andarethereforerewardedaccordinglybyMRE. Intable1,wereporttheMREandMAEscoresfortheexperimentsusingthetopV =1,000features. Wealsoreporttheaverageburstinessofeachdataset. Wedefinetheburstinessoffeaturevinmatrix Y tobeBˆv = T−11(cid:80)Tt=−11 |yv(t+1µˆ)v−yv(t)|,whereµˆv = T1 (cid:80)Tt=1yv(t). Foreachdataset,wecalculated theburstinessofeachfeatureineachmatrix,andthenaveragedthesevaluestoobtainanaverage burstinessscoreBˆ. ThePGDSoutperformedtheLDSandGP-DPFAonsevenofthetenprediction taskswhenweusedMREtomeasurethemodels’performance; whenweusedMAE,thePGDS outperformedtheothermodelsonfiveofthetasks. Inthesupplementarymaterial,wealsoreportthe resultsfortheexperimentscomparingthePGDStoGP-DPFAusingallthefeatures. Thesuperiority of the PGDS over GP-DPFA is even more pronounced in these results. We hypothesize that the differencebetweenthesemodelsisrelatedtotheburstinessofthedata. Forbotherrormeasures,the onlydatasetforwhichGP-DPFAoutperformedthePGDSonbothtaskswastheNIPSdataset. This datasethasasubstantiallyloweraverageburstinessscorethantheotherdatasets. Weprovidevisual evidenceinfigure3,wherewedisplayy(t)overtimeforthetopfourfeaturesintheNIPSandICEWS v datasets. Fortheformer,thefeaturesevolvesmoothly;forthelatter,theyexhibitburstsofactivity. Exploratory analysis: We also explored the latent structure inferred by the PGDS. Because its parametersarepositive,theyareeasytointerpret. Infigure1,wedepictthreecomponentsinferred fromtheNIPSdataset. Byexaminingthetime-stepfactorsandfeaturefactorsforthesecomponents, weseethattheycapturethedeclineofresearchonneuralnetworksbetween1987and2003,aswell astheriseofBayesianmethodsinmachinelearning. Thesepatternsmatchourpriorknowledge. Infigure4,wedepictthethreecomponentswiththelargestcomponentweightsinferredbythePGDS fromthe2003GDELTmatrix. Thetopcomponentisinblue,thesecondisingreen,andthethird isinred. Foreachcomponent, wealsolistthesixteenfeatures(directedpairsofcountries)with thelargestfeaturefactors. Thetopcomponent(blue)ismostactiveinMarchandApril,2003. Its featuresinvolveUSA,Iraq(IRQ),GreatBritain(GBR),Turkey(TUR),andIran(IRN),amongothers. Thiscomponentcorrespondstothe2003invasionofIraq. Thesecondcomponent(green)exhibitsa noticeableincreaseinactivityimmediatelyafterApril,2003. ItstopfeaturesinvolveIsrael(ISR), Palestine(PSE),USA,andAfghanistan(AFG).Thethirdcomponentexhibitsalargeburstofactivity 7 6 5 4 3 2 1 Jan2003 Mar2003 May2003 Aug2003 Oct2003 Dec2003 Figure4: Thetime-stepfactorsforthetopthreecomponentsinferredbythePGDSfromthe2003 GDELTmatrix. Thetopcomponentisinblue,thesecondisingreen,andthethirdisinred. Foreach component,wealsolistthefeatures(directedpairsofcountries)withthelargestfeaturefactors. inAugust,2003,butisotherwiseinactive. ItstopfeaturesinvolveNorthKorea(PRK),SouthKorea (KOR),Japan(JPN),China(CHN),Russia(RUS),andUSA.Thiscomponentcorrespondstothe six-partytalks—aseriesofnegotiationsbetweenthesesixcountriesforthepurposeofdismantling NorthKorea’snuclearprogram. ThefirstroundoftalksoccurredduringAugust27–29,2003. Infigure5,wealsoshowthecomponentweightsforthetoptencom- 6 ponents,alongwiththecorrespondingsubsetofthetransitionmatrix 5 4 Π. Therearetwocomponentswithweightsgreaterthanone: thecom- 23 1 ponentsthataredepictedinblueandgreeninfigure4. Thetransition 0 weightsinthecorrespondingrowsofΠarealsolarge,meaningthat 9 othercomponentsarelikelytotransitiontothem. Aswementioned 8 previously, the GDELT data set was extracted from news corpora. 7 Therefore,patternsinthedataprimarilyreflectpatternsinmediacov- 6 erageofinternationalaffairs.Wethereforeinterpretthelatentstructure 5 inferredbythePGDSinthefollowingway: in2003,themediabriefly 4 covered various major events, including the six-party talks, before 3 quicklyreturningtoabackdropoftheongoingIraqwarandIsraeli– 2 Palestinian relations. By inferring the kind of transition structure 1 depictedinfigure5,thePGDSisabletomodelpersistent,long-term 0 temporalpatternswhileaccommodatingtheburstinessofteninherent 0 1 2 3 4 5 6 7 8 9 toreal-worldcountdata. ThisabilityiswhatenablesthePGDSto Figure 5: The latent tran- achievesuperiorpredictiveperformanceovertheLDSandGP-DPFA. sition structure inferred by the PGDS from the 2003 GDELT matrix. Top: The 5 Summary component weights for the top ten components, in de- WeintroducedthePoisson–gammadynamicalsystem(PGDS)—anew creasing order from left to Bayesiannonparametricmodelforsequentiallyobservedmultivariate right;twooftheweightsare countdata. Thismodelsupportstheexpressivetransitionstructure greater than one. Bottom: ofthelineardynamicalsystem,andnaturallyhandlesoverdispersed Thetransitionweightsinthe data. WepresentedanovelMCMCinferencealgorithmthatremains correspondingsubsetofthe efficientforhigh-dimensionaldatasets,advancingrecentworkonaug- transitionmatrix. Thisstruc- mentationschemesforinferenceinnegativebinomialmodels. Finally, ture means that all compo- weusedthePGDStoanalyzefivereal-worlddatasets,demonstrating nentsarelikelytotransition thatitexhibitssuperiorsmoothingandforecastingperformanceover tothetoptwocomponents. twobaselinemodelsandinfershighlyinterpretablelatentstructure. Acknowledgments We thank David Belanger, Roy Adams, Kostis Gourgoulias, Ben Marlin, Dan Sheldon, and Tim Vieiraformanyhelpfulconversations. ThisworkwassupportedinpartbytheUMassAmherstCIIR andinpartbyNSFgrantsSBE-0965436andIIS-1320219. Anyopinions,findings,conclusions,or recommendationsarethoseoftheauthorsanddonotnecessarilyreflectthoseofthesponsors. 8 References [1] R.E.Kalman. Anewapproachtolinearfilteringandpredictionproblems. JournalofBasic Engineering,82(1):35–45,1960. [2] Z.GhahramaniandS.T.Roweis.LearningnonlineardynamicalsystemsusinganEMalgorithm. InAdvancesinNeuralInformationProcessingSystems,pages431–437,1998. [3] S.S.Haykin. KalmanFilteringandNeuralNetworks. 2001. [4] P.McCullaghandJ.A.Nelder. Generalizedlinearmodels. 1989. [5] M.G.Bulmer. OnfittingthePoissonlognormaldistributiontospecies-abundancedata. Biomet- rics,pages101–110,1974. [6] D.M.BleiandJ.D.Lafferty. Dynamictopicmodels. InProceedingsofthe23rdInternational ConferenceonMachineLearning,pages113–120,2006. [7] L.Charlin,R.Ranganath,J.McInerney,andD.M.Blei. DynamicPoissonfactorization. In Proceedingsofthe9thACMConferenceonRecommenderSystems,pages155–162,2015. [8] J.H.Macke,L.Buesing,J.P.Cunningham,B.M.Yu,K.V.Krishna,andM.Sahani. Empirical modelsofspikinginneuralpopulations. InAdvancesinNeuralInformationProcessingSystems, pages1350–1358,2011. [9] J. Kleinberg. Bursty and hierarchical structure in streams. Data Mining and Knowledge Discovery,7(4):373–397,2003. [10] M.ZhouandL.Carin. Augment-and-conquernegativebinomialprocesses. InAdvancesin NeuralInformationProcessingSystems,pages2546–2554,2012. [11] R.Corless,G.Gonnet,D.E.G.Hare,D.J.Jeffrey,andD.E.Knuth. OntheLambertWfunction. AdvancesinComputationalMathematics,5(1):329–359,1996. [12] J.Canny.GaP:Afactormodelfordiscretedata.InProceedingsofthe27thAnnualInternational ACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages122–129, 2004. [13] A.T.Cemgil. Bayesianinferencefornonnegativematrixfactorisationmodels. Computational IntelligenceandNeuroscience,2009. [14] M.Zhou,L.Hannah,D.Dunson,andL.Carin. Beta-negativebinomialprocessandPoisson factoranalysis. InProceedingsofthe15thInternationalConferenceonArtificialIntelligence andStatistics,2012. [15] P.Gopalan,J.Hofman,andD.Blei. ScalablerecommendationwithPoissonfactorization. In Proceedingsofthe31stConferenceonUncertaintyinArtificialIntelligence,2015. [16] T.S.Ferguson. ABayesiananalysisofsomenonparametricproblems. TheAnnalsofStatistics, 1(2):209–230,1973. [17] M.Zhou.Infiniteedgepartitionmodelsforoverlappingcommunitydetectionandlinkprediction. InProceedingsofthe18thInternationalConferenceonArtificialIntelligenceandStatistics, pages1135–1143,2015. [18] M.Zhou,Y.Cong,andB.Chen. Augmentablegammabeliefnetworks. JournalofMachine LearningResearch,17(163):1–44,2016. [19] M.E.BasbugandB.Engelhardt. HierarchicalcompoundPoissonfactorization. InProceedings ofthe33rdInternationalConferenceonMachineLearning,2016. [20] R.M.Adelson. CompoundPoissondistributions. OR,17(1):73–75,1966. [21] M.ZhouandL.Carin. Negativebinomialprocesscountandmixturemodeling. IEEETransac- tionsonPatternAnalysisandMachineIntelligence,37(2):307–320,2015. [22] A. Acharya, J. Ghosh, and M. Zhou. Nonparametric Bayesian factor analysis for dynamic countmatrices. Proceedingsofthe18thInternationalConferenceonArtificialIntelligenceand Statistics,2015. [23] J.F.C.Kingman. PoissonProcesses. OxfordUniversityPress,1972. [24] D.B.DunsonandA.H.Herring. Bayesianlatentvariablemodelsformixeddiscreteoutcomes. Biostatistics,6(1):11–25,2005. [25] W.J.Rugh. LinearSystemTheory. Pearson,1995. [26] M.Zhou. NonparametricBayesiannegativebinomialfactoranalysis. arXiv:1604.07464. [27] J.DurbinandS.J.Koopman. TimeSeriesAnalysisbyStateSpaceMethods. OxfordUniversity Press,2012. 9