ebook img

An Online Expectation-Maximisation Algorithm for Nonnegative Matrix Factorisation Models PDF

0.46 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview An Online Expectation-Maximisation Algorithm for Nonnegative Matrix Factorisation Models

An Online Expectation-Maximisation Algorithm for Nonnegative Matrix Factorisation Models Sinan Yıldırım∗, A. Taylan Cemgil∗∗, Sumeetpal S. Singh∗∗∗ ∗Statistical Laboratory, DPMMS, University of Cambridge, UK 4 ∗∗Department of Computer Engineering, Boˇgazi¸ci University, Turkey 1 ∗∗∗Department of Engineering, University of Cambridge, UK 0 2 Abstract: In this paper we formulate the nonnegative matrix factorisation (NMF) problem n as a maximum likelihood estimation problem for hidden Markov models and propose online a expectation-maximisation (EM) algorithms to estimate the NMF and the other unknown J static parameters. We also propose a sequential Monte Carlo approximation of our online EM 1 algorithm. We show the performance of the proposed method with two numerical examples. 1 ] 1. INTRODUCTION developing online algorithms that pass over each data G pointonly once.While the developmentofeffective online L With the advancementofsensorandstoragetechnologies, algorithmsfor matrix factorisationareof interest ontheir s. and with the cost of data acquisition dropping signifi- own, the algorithmic ideas can be generalised to more c cantly, we are able to collect and record vast amounts structured models such as tensor factorisations (e.g. see [ of raw data. Arguably, the grand challenge facing com- [Kolda and Bader, 2009]). putation in the 21st century is the effective handling of 1 In this paper our primary interest is estimation of B v such large data sets to extract meaningful information (rather than B and X), which often is the main objective 0 for scientific, financial, political or technological purposes in NMF problems. We formulate the NMF problem as a 9 [Donoho, 2000]. Unfortunately, classical batch processing maximumlikelihoodestimation(MLE)problemforhidden 4 methods are unable to deal with very large data sets due Markovmodels(HMMs).Theadvantageofdoingsoisthat 2 to memory restrictions and slow computational time. the asymptotic properties of MLE for HMM’s has been . 1 One key approach for the analysis of large datasets is studied in the past by many authors and these results 0 based on the matrix and tensor factorisation paradigm. may be adapted to the NMF framework. We propose a 4 Given an observed dataset Y, where Y is a matrix of a sequentialMonteCarlo(SMC)basedonlineEMalgorithm 1 certain dimension and each element of it corresponds to [Capp´e,2009,DelMoraletal.,2009]fortheNMFproblem. : v an observed data point, the matrix factorisation problem SMCintroducesalayerofbiaswhichdecreasesasthenum- Xi is the computation of matrix factors B and X such that ber of particles in the SMC approximation is increased. Y is approximated by the matrix product BX, i.e., r In the literature, several online algorithms have been a Y ≈BX proposed for online computation of matrix factorisations. Mairal et al. [2010] propose an online optimisation algo- (Laterwewillmakeournotationandinferentialgoalsmore rithm, based on stochastic approximations, which scales precise.) Indeed, many standard statistical methods such up gracefully to large data sets with millions of training as clustering, independent components analysis (ICA), samples.AproofofconvergenceispresentedfortheGaus- nonnegative matrix factorisation (NMF), latent semantic sian case. There are similar formulations applied to other indexing (LSI), collaborative filtering can be expressed matrix factorisation formulations, notably NMF [Lefevre andunderstoodasmatrixfactorisationproblems[Leeand et al., 2011] and Latent Dirichlet Allocation [Hoffman Seung,1999,SinghandGordon,2008,Korenet al.,2009]. etal.,2010],aswellasalternativeviewsforNMFwhichare Matrix factorisation models also have well understood based on incremental subspace learning [Bucak and Gun- probabilistic/statistical interpretations as probabilistic sel, 2009]. Although the empirical results of these meth- generative models and many standard algorithms men- odssuggestgoodperformance,theirasymptoticproperties tioned above can also be derived as maximum likelihood have not been established. ormaximuma-posterioriparameterestimationprocedures [FevotteandCemgil,2009,SalakhutdinovandMnih,2008, 1.1 Notation Cemgil, 2009]. The advantage of this interpretation is that it enables one to incorporate domain specific prior Let A be a M ×N matrix. The (m,n)’th element of A knowledge in a principled and consistent way. This can is A(m,n). If M (or N) is 1, then A(i) = A(1,i) (or be achieved by building hierarchical statistical models to A(i,1)). The m’th row of A is A(m,·). If A and B are fit the specifics of the application at hand. Moreover, the both M ×N matrices, C = A⊙B denotes element-by- probabilistic/statistical approach also provides a natural elementmultiplication,i.e.,C(m,n)=A(m,n)B(m,n); A B framework for sequential processing which is desirable for (or A/B) means element-by-element division, in a similar wwahye.re11MM××N1 (is0Mab×bNre)viiasteadMto ×1MN. Nm=atr{i0x,1o,f2,1.’.s.}(0a’ns)d, pLaertamθet=ers(oψf,tBh)eH∈MMΘ.W=eΨcan×wRriM+te×tKhedjoeinnottdeenasllitythoef R+ = [0,∞) are the sets of nonnegative integers and (X1:t,Z1:t,Y1:t) given θ as real numbers. Random variables will be defined by using capital letters, such as X,Y,Z, etc., and their realisations p (x ,z , y )=µ (x )g (y |x )π (z |y ,x ) θ 1:t 1:t 1:t ψ 1 B 1 1 B 1 1 1 will be corresponding small case letters (x,y,z, etc.). The t indicator function I (x) = 1 if x = α, otherwise it is 0; α × f (x |x )g (y |x )π (z |x ,y ).(4) also, for a set A, I (x)=1 if x∈A, otherwise it is 0. ψ i i−1 B i i B i i i A i=2 Y 2. THE STATISTICAL MODEL FOR NMF From (4), we observe that the joint density of (X1:t,Y1:t) t Consider the following HMM comprised of the latent p (x ,y )=µ (x )g (y |x ) f (x |x )g (y |x ) processes{X ,Z } andtheobservationprocess{Y } . θ 1:t 1:t ψ 1 B 1 1 ψ i i−1 B i i t t t≥1 t t≥1 The process X ∈RK is a Markov process of K×1 iY=2 t + t≥1 defines the law of another HMM {X ,Y } comprisedof nonnegative vectors with an initial density µ and the t t t≥1 (cid:8) (cid:9) ψ the latent process {X } , with initial and transitional transition density fψ for t=2,3,... t t≥1 densities µ and f , and the observation process {Y } ψ ψ t t≥1 X1 ∼µψ(x), Xt|(Xt−1 =xt−1)∼fψ(xt|xt−1), (1) with the observation density gB. Finally, the likelihood of data is given by where ψ ∈ Ψ is a finite dimensional parameter which parametrizesthelawoftheMarkovprocess.Z ∈NM×K is T t p (y )=E g (y |X ) . (5) a M×K matrix of nonnegative integers,and its elements θ 1:T ψ B t t " # are independent conditioned on Xt as follows: tY=1 Inthispaper,wetreatθasunknownandseekfortheMLE M K solution θ∗ for it, which satisfies Zt|(Xt =xt)∼ PO(zt(m,k);B(m,k)xt(k)) θ∗ =argmaxpθ(y1:T). (6) θ∈Θ m=1k=1 Y Y where B ∈ RM+×K is an M × K nonnegative matrix. 2.1 Relation to the classical NMF Here PO(v;λ) denotes the PoissondistributiononN with intensity parameter λ≥0 In the classical NMF formulation [Lee and Seung, 1999, PO(v;λ)=exp(vlogλ−λ−logv!), 2000], given a M ×T nonnegative matrix Y = [y ...y ], 1 T we want to factorize it to M ×K and K×T nonnegative The M ×1 observation vector Y is conditioned on Z in t t matrices B and X = [X ...X ] such that the difference a deterministic way 1 T betweenY andBX isminimisedaccordingtoadivergence K Y (m)= Z (m,k), m=1,...,M. (B∗,X∗)=argminD(Y||BX). (7) t t B,X k=1 X One particular choice for D is the generalised Kullback- ThisresultsintheconditionaldensityofY givenX =x , t t t Leibler (KL) divergence which is written as denoted by g , being a multivariate Poissondensity B M T Y(m,t) D(Y||U)= Y(m,t)log −Y(m,t)+U(m,t) M U(m,t) Y |(X =x )∼g (y |x )= PO(y (m);B(m,·)x )(.2) m=1t=1 t t t B t t t t X X Noticing the similarity between the generalised KL diver- m=1 Y gence and the Poissondistribution, [Lee and Seung, 1999] Hence the likelihood of y given x can analytically be t t showed that the minimisation problem can be formulated evaluated.Moreover,theconditionalposteriordistribution in a MLE sense. More explicitly, the solution to π (z |y ,x ) of Z given y and x has a factorized closed B t t t t t t form expression: (B∗,X∗)=argmaxℓ(y ,...,y |B,X), 1 T B,X Z |(Y =y ,X =x )∼π (z |y ,x ) t t t t t B t t t T M ℓ(y ,...,y |B,X)= g (y |X ) (8) 1 T B t t = M(z (m,·);y (m),ρ )(3) t t t,m t=1 Y mY=1 isthesameasthesolutionto(7).Inourformulationofthe where ρt,m(k) = B(m,k)xt(k)/B(m,·)xt and M denotes NMF problem, X = [X1...XT] is not a static parameter a multinomial distribution defined by but it is a random matrix whose columns constitute a Markov process. Therefore, the formulation for MLE in K K ρvk our case changes to maximising the expected value of M(v;α,ρ)=I v α! k , α k the likelihood in (8) over the parameter θ = (B,ψ) with v ! ! k Xk=1 kY=1 respect to (w.r.t.) the law of X where v = [v1...vK] is a realisation of the vector valued (B∗,ψ∗)=arg max E [ℓ(y ,...,y |B,X)]. (9) ψ 1 T random variable V = [V1...VK], ρ = (ρ1,...,ρK), and (B,ψ)∈Θ K ρ = 1. It is a standard result that the marginal It is obvious that (6) and (9) are equivalent. We will k=1 k mean of the k’th component is E [V ]=αρ . see in Section 3 that the introduction of the additional α,ρ k k P process {Z } is necessary to perform MLE using the Writingthesufficientstatisticsinadditiveformsasin(10) t t≥1 EM algorithm (see Lee and Seung [2000] for its first use enables us to use a forward recursion to find the expecta- for the problem stated in (7)). tions of the sufficient statistics in an online manner. This leadsto anonline versionofthe EMalgorithmaswe shall 3. EM ALGORITHMS FOR NMF see in the following section. Our objective is to estimate the unknown θ given Y = 1:T y . The EM algorithm can be used to find the MLE for 3.2 Online EM 1:T θ. We first introduce the batch EM algorithm and then explain how an online EM version can be obtained. To explain the methodology in a general sense, as- sume that we want to calculate the expectations S = t 3.1 Batch EM E [S (X ,Z )|Y =y ] of sufficient statistics of the θ t 1:t 1:t 1:t 1:t additive form b With the EM algorithm, given the observation sequence t y we increase the likelihood p (y ) in (5) iteratively 1:T θ 1:T S (x ,z )= s (x ,z ,x ,z ) (11) until we reach a maximal point on the surface of the t 1:t 1:t i i−1 i−1 i i i=1 likelihood. The algorithm is as follows: X w.r.t. the posterior density p (x ,z |y ) for a given θ 1:t 1:t 1:t Choose θ(0) for initialisation. At iteration j =0,1,... parameter value B. Letting u = (x ,z ) for simplicity, t t t we define the intermediate function • E-step: Calculate the intermediate function which is the expectation of the log joint distribution of (X1:T,Z1:T,Y1:T) with respect to the law of Tt(ut)= St(u1:t)pθ(u1:t−1|y1:t−1,ut)du1:t−1. (X ,Z ) given Y =y . 1:T 1:T 1:T 1:T Z Q(θ(j);θ)=E [logp (X ,Z ,Y )|Y =y )] One can show that we have the forward recursion θ(j) θ 1:T 1:T 1:T 1:T 1:T [Del Moral et al., 2009,Capp´e, 2011] • M-step: The new estimate is the maximiser of the intermediate function θ(j+1) =argmaxQ(θ(j);θ) T (u )= (T (u )+s (u ,u )) t t t−1 t−1 t t−1 t θ Z With a slight modification of the update rules found in ×p (u |y ,u )du (12) θ t−1 1:t−1 t t−1 Cemgil [2009, Section 2], one can show that for NMF withtheconventionT (u)=0.Hence,T canbecomputed models the update rule for B reduces to calculating the 0 t online, so are the estimates expectations S = T (u )p (u |y )du . T t t t θ t 1:t t S =E X Y =y , Z 1,T θ(j) t 1:T 1:T We can decompose the backward transition density " (cid:12) # b Sb =E Xt=T1Z (cid:12)(cid:12)(cid:12)(cid:12)Y =y pθp(uθ(tx−t1−|y11,:zt−t−11,|uyt1):t−an1,dxtth,zetfi)l=terπiBng(zdt−en1|sxitty−1p,θy(tu−t1|y)1:t) as 2,T θ(j) t 1:T 1:T "Xt=1 (cid:12)(cid:12) # ×pθ(xt−1|xt,y1:t−1), (13) and updatinbg the parameter es(cid:12)(cid:12)timate for B as pθ(xt,zt|y1:t)=πB(zt|xt,yt)pθ(xt|y1:t) (cid:12) (14) T B(j+1) =S2,T/ 1M S1,T . where πB is defined in (3). From (10) we know that the required sufficient statistics are additive in the required (cid:18) h i (cid:19) Moreover,if the transitiondensity f belongs to anexpo- form; therefore, the recursion in (12) is possible for the b bψ nential family, the update rule for ψ becomes calculating NMF model. The recursionfor S depends on the choice 3,t the expectation of a J ×1 vector valued function of the transition density f ; however the recursions for ψ T S1,t and S2,t are the same for any model regardless of the S3,T =Eθ(j) s3,t(Xt−1,Xt) Y1:T =y1:T choiceoffψ.Forthis reason,we shallhavea detailedlook " (cid:12) # at (12) for the first two sufficient statistics S and S . Xt=1 (cid:12) 1,t 2,t andupbdatingtheestimateforψusing(cid:12)(cid:12)amaximisationrule For S1,t, notice from (13) that, pθ(xt−1,zt−1|y1:t−1,xt,zt) (cid:12) Λ:RJ →Ψ, ψ(j+1) =Λ S3,T . does not depend on zt. Moreover, the sufficient statistic S is not a function of z . Therefore, z in (12) 1,t 1:t t−1 Note that s and Λ depend on the N(cid:16)MF m(cid:17)odel, partic- 3,t integrates out, and T is a function of x only. Hence we b 1,t t ularly to the probability laws in (1) defining the Markov will write it as T (x ). To sum up, we have the recursion 1,t t chain for {X } . Therefore, we have to find the mean t t≥1 estimates of the following sufficient statistics at time t. T (x )=x + T (x )p (x |x ,y )dx . 1,t t t 1,t−1 t−1 θ t−1 t 1:t−1 t−1 t t Z S (x )= x , S (z )= z , 1,t 1:t i 2,t 1:t i For S , we claim that T (x ,z ) = z +C (x ) where i=1 i=1 2,t 2,t t t t t t X X t Ct(xt) is a nonnegative M × K matrix valued function S3,t(x1:t)= s3,t(xt−1,xt). (10) depending on xt but not zt, and the recursion for Ct(xt) is expressed as i=1 X B⊙ y xT 3.3 SMC implementation of the online EM algorithm C (x )= C (x )+ t−1 t−1 t t Z t−1 t−1 (B(cid:0)xt−1)1TK (cid:1)! Recall that {Xt,Yt}t≥1 is also a HMM with the initial ×p (x |x ,y )dx and transition densities µ and f in (1), and the ob- θ t−1 t 1:t−1 t−1 ψ ψ servation density g in (2). Since the conditional density B This claim can be verified by induction. Start with t=1. π (z |x ,y ) has a close form expression, it is sufficient to B t t t SinceT2,0 =0M×K,weimmediatelyseethatT2,t(x1,z1)= have a particle approximation to only pθ(x1:t|y1:t). This z1 =z1+C1(x1)whereC1(x1)=0M×K.Forgeneralt>1, approximation can be performed in an online manner assumethatT2,t−1(xt−1,zt−1)=zt−1+Ct−1(xt−1).Using using a SMC approach.Suppose thatwe havethe particle (13), approximation to p (x |y ) at time t with N particles θ 1:t 1:t T (x ,z )=z + (z +C (x ))π (z |x ,y ) N N 2,t t t t t−1 t−1 t−1 B t−1 t−1 t−1 pN(dx |y )= w(i)δ (dx ), w(i) =1,(17) Z θ 1:t 1:t t x(i) 1:t t ×p (x |x ,y )dx dz 1:t θ t−1 t 1:t−1 t−1 t−1 i=1 i=1 X X Now, observe that the (m,k)’th element of the integral where x(i) = (x(i),...,x(i)) is the n’th path particle with 1:t 1 t z π (z |x ,y )dz is B(m,k)yt−1(m)xt−1(k).So, weightw(i) andδ is the diracmeasureconcentratedatx. t−1 B t−1 t−1 t−1 t−1 B(m,·)xt−1 t x we can write the integral as The particle approximation of the filter at time t can be R obtained from pN(dx |y ) by marginalization B⊙ y xT θ 1:t 1:t z π (z |x ,y )dz = t−1 t−1 N Z t−1 B t−1 t−1 t−1 t−1 (B(cid:0)xt−1)1TK (cid:1) pNθ (dxt|y1:t)= wt(i)δx(i)(dxt). t Sowearedone.Usingasimilarderivationandsubstituting Xi=1 (14) into (13), we can show that At time t+1, for each n we draw x(i) from a proposal t+1 density q (x |x(i)) with a possible implicit dependency θ t+1 t on y . We then update the weights according to the B⊙ y xT t+1 S = C (x )+ t t p (x |y )dx . recursive rule: 2,t Z t t (Bx(cid:0)t)1TK(cid:1)! θ t 1:t t w(i)f (x(i) |x(i))g (y |x(i) ) w(i) ∝ t ψ t+1 t B t+1 t+1 . b t+1 q (x(i) |x(i)) The online EM algorithm is a variation over the batch θ t+1 t Toavoidweightdegeneracy,ateachtimeonecanresample EM where the parameter is re-estimated each time a new observationis received.In this approachrunning averages from (17) to obtain a new collection of particles x(i) t ofthesufficientstatisticsarecomputed[Elliottetal.,2002, with weights w(i) = 1/N, and then proceed to the time t Mongillo and Deneve, 2008, Capp´e, 2009, 2011], [Kantas t + 1. Alternatively, this resampling operation can be et al., 2009, Section 3.2.]. Specifically, let γ = {γt}t≥1, done according to a criterion which measures the weight called the step-size sequence, be a positive decreasing degeneracy [Doucet et al., 2000]. The SMC online EM sequence satisfying t≥1γt = ∞ and t≥1γt2 < ∞. A algorithm for NMF models executing (15) and (16) based common choice is γtP=t−a for 0.5< a ≤P1. Let θ1 be the on the SMC approximation of pθ(x1:t|y1:t) in (17) is initialguessofθ∗ beforehavingmadeanyobservationsand presented Algorithm 1. attimet,letθ1:tbethesequenceofparameterestimatesof Algorithm 1. SMC online EM algorithm for NMF the online EM algorithm computed sequentially based on models y . Letting u = (x ,z ) again to show for the general 1:t−1 t t t • E-step: If t = 1, initialise θ ; sample x(i) ∼ q (·), case, when yt is received, online EM computes 1 1 θ1 and set w(i) = µψ1(x1(i))gB1(y1|x1(i)), T(i) =x(i), C(i) = Tγ,t(ut)= ((1−γt)Tγ,t−1(ut−1)+γtst(ut−1,ut)) 1 qθ1(x1(i)) 1,1 e 1 1 Z ×p (u |y ,u )du , (15) 0, T3(,i1) =s3,1(x1(i)),ei=1,...,eN. Ifet>1,e e θ1:t t−1 1:t−1 t t−1 · For i = 1,...,N, saemple x(i) ∼ q (·|x(i) ) and t θt t−1 S = T (u )p (u |y )du (16) ecompute e t γ,t t θ1:t t 1:t t (i) (i) (i) Z T =(1−γ )T e +γ x , 1,t t 1,t−1 t t andthenappliesthemaximisationruleusingtheestimates St.Thesubscriptθ1:t onthedensitiespθ1:t(ut−1|y1:t−1,ut) T3(,it) =e(1−γt)T3(,it)−1+γts3,t(xet(−i)1,x(ti)) panudtedpθs1e:tq(uuet|nyt1i:at)llyinudsiicnagtetshethpaatratmheesteerlaθwksaatrteimbeeikn,gkc≤omt-. C(i) =e (1−γ )C(i) +(1−γ )γ Bt⊙ yte−1x(t−i)T1 , (See Algorithm 1 for details.) In practice, the maximisa- t t t−1 t t−1 B(cid:16)x(i) 1T (cid:17) tion step is not executed until a burn-in time t for added t t−1 K stability of the estimators as discussed in Cappb´e [2009]. e w(i) f (x(i)|x(i) )g (cid:16)(y |x(i))(cid:17) w(i) ∝ t−1 ψt t t−1 Bt t t . TheonlineEMalgorithmcanbeimplementedexactlyfora t q (x(i)|x(i) ) θt t t−1 linearGaussianstate-spacemodel[Elliottetal.,2002]and e e forfinitestate-spaceHMM’s.[MongilloandDeneve,2008, · Resameple from particles {(xt,T1,t,Ct,T3,t)(i)} for i = 1,...,N accoerding to the weights Capp´e,2011].Anexactimplementationis notpossiblefor NMFmodelsingeneral,thereforewenowinvestigateSMC {wt(i)}i=1,...,N toget{(xt,T1,t,eCt,eT3,t)e(i)}efori= implementations of the online EM algorithm. 1,...,N each with weight w(i) =1/N. t e • M-step: If t < t , set B = B . Else, calculate 20 20 10 20 20 b t+1 t using the particles before resampling 10 10 5 10 10 0 0 0 0 0 N S = T1(i)w(i), 10 10 10 10 1 1,t t t 5 5 5 5 0.5 i=1 X 0 0 0 0 0 N eB ⊙e y x(i)T 10 10 10 10 10 S = C(i)+γ t t t w(i) 5 5 5 5 5 2,t  t t B x(cid:16)(i) 1T(cid:17) t 0 0 0 0 0 Xi=1 t t e K 10 4 20 10 20 e N (cid:16) (cid:17)  e 5 2 10 5 10 S = T3(i)ew(i), 3,t t t 0 0 0 0 0 i=1 4 10 10 10 10 X update the parameter θt+1e= (eBt+1,ψt+1), Bt+1 = 2 5 5 5 5 S2,t , ψ =Λ(S ). 0 0 0 0 0 1M[S1,t]T t+1 3,t 10 4 10 10 10 5 2 5 5 5 Algorithm1isaspecialapplicationoftheSMConlineEM 0 0 0 0 0 algorithm proposed in Capp´e [2009] for a general state- 10 10 10 10 10 space HMM, and it only requires O(N) computations per 5 5 5 5 5 time step. Alternatively, one can implement an O(N2) 0 0 0 0 0 SMC approximation to the online EM algorithm, see 10 10 10 20 4 Del Moral et al. [2009] for its merits and demerits over 5 5 5 10 2 the current O(N) implementation. The O(N2) is made 0 0 0 0 0 possible by plugging the following SMC approximationto 0 5 10 0 5 10 0 5 10 0 5 10 0 5 10 p (x |x ,y ) into (12) x 104 x 104 x 104 x 104 x 104 θ t−1 t 1:t−1 Fig. 1. Online estimation of B in the NMF model in pN(dx |x ,y )= pNθ (dxt−1|y1:t−1)fψ(xt|xt−1) . Section 4.1 using exact implementation of online EM θ t−1 t 1:t−1 pN(dx |y )f (x |x ) forNMF.The(i,j)’thsubfigureshowstheestimation θ t−1 1:t−1 ψ t t−1 result for the B(i,j) (horizontal lines). R 4. NUMERICAL EXAMPLES X(1) for the first 500 time steps (α = 0.95) t 1 4.1 Multiple basis selection model 0.8 0.6 In this simple basis selection model, Xt ∈ {0,1}K deter- 0.4 mines which columns of B are selected to contribute to 0.2 the intensity of the Poisson distribution for observations. 0 For k =1,...,K, 0 100 200 time (t) 300 400 500 X (k)∼µ(·), Prob(X (k)=i|X (k)=j)=P(j,i), Fig. 2. A realisationof {X (1)} for α=0.95. 1 t t−1 t t≥1 where µ is a distribution over X and P is such that 0 The law of the Markovchain for {X } is as follows:for t t≥1 P(1,1) = p and P(2,2) = q. Estimation of ψ = (p,q) k =1,...,K, X (k)∼U(0,1), and 1 can be done by calculating X (k)|(X (k)=x)∼ρ(x)U(0,x)+(1−ρ(x))U(x,1), t+1 t T S =E s (X ,X ) Y =y , α, if x≤0.5 3,T θ 3,i i−1 i 1:T 1:T ρ(x)= " (cid:12) # 1−α, if x>0.5. Xi=1 (cid:12) (cid:26) b K I(0,0)(xt−1(k),xt((cid:12)(cid:12)k)) When α is close to 1, the process will spend most of its I (x (k)) (cid:12) time around0and1 withastrongcorrelation.(Figure4.2 s (x ,x )= 0 t 3,t t t−1 I (x (k),x (k)) shows a realisationof {X (1)} for 500 time steps when (1,1) t−1 t t t≥1 Xk=1 I1(xt(k))  α=0.95.) For estimation of α, one needs to calculate and applying the maximisation rule (p(j+1),q(j+1)) = T Λ(S(j)) where Λ(·) for this model is defined as S3,T =Eθ s3,i(Xi−1,Xi) Y1:T =y1:T , 3,t " (cid:12) # Xi=1 (cid:12) Fpliegbmureent4Λa.t1(iSbos3nh,too)wf=son(tSlhbin3e,ete(Es1t)Mi/mSb(a3wt,tii(ot2nh),γrSbe3s=,ut(ltt3s−)/0o.Sbf83ta,thn(e4d)e)tx.a=ct1i0m0)- s3,t(xt−1,bxt)=(cid:20)I(0,1)×IA(0x,t1−)/1A(kx)t(−x1t(−k)1((xkt(cid:12)(cid:12)(cid:12))−,1x(tk(k),)x)t(k))(cid:21) t b where, for u∈(0,1), we define the set for the 8×5 matrix B (assuming (p,q) known) given the A =((0,0.5]×(0,u])∪((0.5,1)×(u,1)). 8×100000 matrix Y which is simulated p = 0.8571,q = u 0.6926. The maximisation step for α is characterisedas Λ(S )=S (1)/ S (1)+S (2) . 3,t 3,t 3,t 3,t 4.2 A relaxation of the multiple basis selection model (cid:16) (cid:17) b b b b Inthismodel,theprocess{X ∈(0,1)} isnotadiscrete We generated a 8×50000 observation matrix Y by using t t≥1 one, but it is a Markov process on the unit interval (0,1). a 8×5 matrix B and α = 0.95. We used the SMC EM 4 5 4 4 4 OCapp´e.OnlineEMalgorithmforhiddenMarkovmodels. 2 2 2 2 Journal of Computational and Graphical Statistics, 20 0 0 0 0 0 (3):728–749,2011. 10 10 10 10 4 A. T. Cemgil. Bayesian inference for nonnegative matrix 5 5 5 5 2 factorisation models. Computational Intelligence and 0 0 0 0 0 Neuroscience, 2009:1–17,2009. 5 10 10 5 5 P.DelMoral,A.Doucet,andS.SSingh. Forwardsmooth- 5 5 ingusingsequentialMonteCarlo.TechnicalReport638, 0 0 0 0 0 Cambridge University, Engineering Department, 2009. 4 4 10 5 4 D. Donoho. High-dimensional data analysis: The curses 2 2 5 2 and blessings of dimensionality. ”Math Challenges of 0 0 0 0 0 the 21st Century”, 2000. 10 10 10 10 5 A. Doucet, S.J. Godsill, and C. Andrieu. On sequential 5 5 5 5 Monte Carlo sampling methods for Bayesian filtering. 0 0 0 0 0 4 10 10 4 4 Statistics and Computing, 10:197–208,2000. 2 5 5 2 2 Robert J. Elliott, Jason J. Ford, and John B. Moore. On-line almost-sure parameter estimation for partially 0 0 0 0 0 5 10 10 4 10 observed discrete-time linear systems with known noise 5 5 2 5 characteristics. International Journal of Adaptive Con- 0 0 0 0 0 trol and Signal Processing, 16:435–453, 2002. doi: 4 4 4 4 4 10.1002/acs.703. 2 2 2 2 2 C.FevotteandA.T.Cemgil. Nonnegativematrix factori- 0 0 0 0 0 sations as probabilistic inference in composite models. 0 5 0 5 0 5 0 5 0 5 x 104 x 104 x 104 x 104 x 104 In Proc. 17th European Signal Processing Conference (EUSIPCO’09), Glasgow, 2009. Fig.3.OnlineestimationofBintheNMFmodelinSection Matthew Hoffman, David Blei, and Francis Bach. Online 4.2 using Algorithm 1. The (i,j)’th subfigure shows learning for latent dirichlet allocation. In J. Lafferty, the estimation result for B(i,j) (horizontal lines). C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information algorithmdescribedinAlgorithm1toestimateB (assum- Processing Systems 23, pages 856–864,2010. ing α known), with N = 1000 particles, q (x |x ) = θ t t−1 f (x |x ), γ = t−0.8, and t = 100. Figure 4.2 shows N.Kantas,A.Doucet,S.S.Singh,andJ.M.Maciejowski. ϕ t t−1 t b An overview of sequential Monte Carlo methods for the estimation results. parameter estimation in general state-space models. In 5. DISCUSSION Proceedings IFAC System Identification (SysId) Meet- ing., 2009. In this paper, we presented and online EM algorithm for T.G. Kolda and B.W. Bader. Tensor decompositions and NMFmodelswithPoissonobservations.Wedemonstrated applications. SIAM Review, 51(3):455–500,2009. an exact implementation and the SMC implementation Y. Koren, R. Bell, and C. Volinsky. Matrix factorization of the online EM method on two separate NMF models. techniquesforrecommendersystems. 42(8):30–37,2009. However, the method is applicable to any NMF model D. D. Lee and H. S. Seung. Learning the parts of objects wherethecolumnsofthematrixX canberepresentedasa withnonnegativematrixfactorization.Nature,401:788– stationary Markov process, e.g. the log-Gaussian process. 791, 1999. D. D. Lee and H. S. Seung. Algorithms for non-negative The results in Section 4 do not reflecton the generalityof matrix factorization. In NIPS, pages 556–562,2000. the method, i.e., only B is estimated but the parameter A.Lefevre,F.Bach,andC.Fevotte. Onlinealgorithmsfor ϕ is assumed to be known, although we formulated the nonnegative matrix factorization with the itakura-saito estimation rules for all of the parameters in θ. Also, we divergence. In(WASPAA)IEEE Workshop on Applica- performexperimentswherethedimensionoftheB matrix tionsofSignalProcessingtoAudioandAcoustics,pages maybetoosmallforrealisticscenarios.NotethatinAlgo- 313–316,2011. rithm 1 we used the bootstrap particle filter, which is the Julien Mairal, Francis Bach, Jean Ponce, and Guillermo simplest SMC implementation. The SMC implementation Sapiro. Online Learning for Matrix Factoriza- may be improved devising sophisticated particle filters, tion and Sparse Coding. February 2010. URL (e.g. those involving better proposal densities that learn http://arxiv.org/abs/0908.0050. fromthecurrentobservation,SMCsamplers,etc.),andwe G. Mongillo and S. Deneve. Online learning with hidden believe that only with that improvement the method can Markovmodels. Neural Computation, 20(7):1706–1716, handle more complete problems with higher dimensions. 2008. R. Salakhutdinov and A. Mnih. Probabilistic matrix fac- REFERENCES torization. In Advances in Neural Information Process- S. S. Bucak and B Gunsel. Incrementalsubspace learning ing Systems, volume 20, 2008. via non-negativematrix factorization. Pattern Recogni- A. P. Singh and G. J. Gordon. A unified view of matrix tion, 42:788–797,2009. factorization models. In ECML PKDD’08, Part II, OCapp´e.OnlinesequentialMonteCarloEMalgorithm.In number 5212, pages 358–373.Springer, 2008. Proc. IEEE Workshop on Statistical Signal Processing, 2009.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.