ebook img

Slow mixing for Latent Dirichlet allocation PDF

0.09 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Slow mixing for Latent Dirichlet allocation

Slow mixing for Latent Dirichlet Allocation 7 1 Johan Jonasson∗† 0 2 n January 12, 2017 a J 1 1 Abstract ] G MarkovchainMonteCarlo(MCMC)algorithmsareubiquitous inprob- L ability theory in general and in machine learning in particular. A Markov . chain isdevised sothat its stationary distribution issomeprobability distri- s c bution ofinterest. Thenonesamplesfromthegivendistribution byrunning [ theMarkovchainfora”longtime”untilitappearstobestationary andthen 1 collectsthesample. Howeverthesechainsareoftenverycomplexandthere v are no theoretical guarantees that stationarity is actually reached. In this 0 6 paperwestudytheGibbssampleroftheposteriordistributionofaverysim- 9 plecaseofLatentDirichlet Allocation, anattractive Bayesian unsupervised 2 learningmodelfortextgeneration andtextclassification. Itturnsoutthatin 0 . somesituations, the mixing timeof the Gibbs sampler isexponential in the 1 length of documents and so it is practically impossible to properly sample 0 7 fromtheposterior whendocuments aresufficientlylong. 1 : AMSSubject classification: 60J10 v i Keywords andphrases: mixingtime,MCMC, Gibbssampler,topicmodel X Shorttitle: Slowmixingfor LDA r a 1 Introduction MarkovchainMonteCarlo(MCMC)isapowerfultoolforsamplingfromagiven probabilitydistributionon averylarge statespace, where directsamplingisdiffi- cult,inpartbecauseofthesizeofthestatespaceandinpartbecauseofnormaliz- ingconstantsthatare difficulttocompute. ∗ChalmersUniversityofTechnologyandUniversityofGothenburg †ResearchsupportedbytheKnutandAliceWallenbergFoundation 1 In machine learning in particular, MCMC algorithms are extremely common for sampling from posterior distributions of Bayesian probabilistic models. The posterior distribution given observed data is then difficult to sample from for the reasons above. One then designs an (irreducible aperiodic) Markov chain whose stationarydistributionispreciselythetargetedposterior. Thisisusuallyfairlyeasy since theposterioris usually easy to computeup to the normalizing constant (the denominator in Bayes formula). One usually uses Gibbs sampling or the related Metropolis-Hastingsalgorithm. Gibbssamplinginthesesituationscanbedescribedasfollows. Ourstatespace isafinitesetofrandomvariablesX = {X } ,whereX ∈ T forsomemeasur- a a∈A a able space T, so that X ∈ TA and the targeted distribution is a given probability measure P on TA. In order to sample from P one starts a Markov chain on TA whose updates are given by first choosing an index a ∈ A at random and then choosinga new valueof X according to theconditionaldistributionofX given a a all X , b ∈ A \ {a}. Under mild conditions, this chain converges in distribution b to P. The Markov chain is then run for a ”long time” and then a sample, hope- fully approximatelyfrom P, is collected. A key questionhere is for how long the chain actually has to be run, in order for the distribution after that time to be a good approximation of P. Since A is usually large, the number of steps needed should at least be no more than polynomial in the size of A for Gibbs sampling to be feasible. In almost all practical cases, the structure of the sample space and theprobabilitymeasurePissocomplexthatisvirtuallyimpossibletomakearig- orous analysis of the mixing rate. However it may be possible to consider some very simplified special cases. In this paper, we will analyse a special case of La- tent Dirichlet allocation, henceforth LDA for short, and demonstrate for such a simplespecial casethat mixingcan indeed beaproblem. LDA is model used to classify documents according to their topics. We have a large corpus of documents and we want to determine for each word in each document which topicit comes from. Knowingthis wecan also classify the doc- uments according to the proportion of words of the different topics it contains. The setup in LDA is that one has a fixed number D of documents of lengths N d a fixed set of topics t ,t ,...,t and a fixed set of words w ,w ,...,w . These 1 2 s 1 2 v are specified in advance. The number of topics is usually not large, whereas the number of words is. Next, for each document d = 1,...,D, a multinomial dis- tribution θ = (θ (1),...,θ (s)) over topics is chosen according to a Dirichlet d d d prior with a known parameter α = (α ,...,α ). For each topic t a multinomial 1 s i distributionφ = (φ (1),...,φ (v)) according to aDirichlet prior withparameter i i i β = (β ,...,β ) independently of each other and of the θ :s. Given these, the 1 v d 2 corpus is then generated by for each position p = 1,...,N in each document d d picking a topic z according to θ and then picking the word at that position d,p d according to φ , doing this independently for all positions. Note that the LDA zd,p model is a so called ”bag of words” model, i.e. it is invariant under permutations withineach document. In this paper, we will consider the very simplified case with D = s = v = 2, N = N = m and α = β = (1,1) and study the mixing time asymptotics as 1 2 m → ∞. To simplify the notation, denote the two topics by A and B and the two words by 1 and 2. Define n as the number of occurrences of the word j in ij document i, i,j = 1,2 and write n = n + n (which by assumption equals i. i1 i2 m) and n = n + n and n = n = 2m. We consider the mixing time .j 1j 2j .. i,j ij for Gibbs sampling of the posterior in a seemingly typical case, namely that the P numberof1:sinthefirstdocumentis3m/10andintheseconddocument6m/10. Ourresultisthefollowing. Theorem 1.1 Consider the case n = 3m/10 and n = 6m/10. Then there 11 21 existsa λ > 0 suchthatforeach0 < κ < 3/4, τ (κ) > eλn. mix Remark. The point of this paper is to demonstratethat the mixingtimeissue can be a real problem for Bayesian inference in machine learning in general and notforLDAinparticular. Therearealsoothermethodsforestimationthatseemto workwellforLDA,inparticularvariationalinference. Furthermore,experimental results seem to be fairly well in line with what one would expect from a topic classificationalgorithm. Before moving on to the proof of Theorem 1.1, we formally introduce the conceptofmixingtime. Let{X }∞ beadiscretetimeMarkovchainonthefinite t t=0 state space S and for s ∈ S, let P be the underlying probability measure under s X = s. 0 Definition1.2 Let µ and ν betwo probabilitymeasures on S. The the totalvari- ationdistancebetween µand ν is givenby 1 kµ−νk = max(µ(A)−ν(A)) = |µ(s)−ν(s)|. TV A⊂S 2 s∈S X Definition1.3 Foreachκ ∈ (0,1),theκ-mixingtimeof {X } isgiven by t τ (κ) = maxmin{t : kP (X ∈ ·)−πk ≤ κ}. mix s t TV s∈S 3 The essence of Theorem 1.1 is that even after an exponentially long time, the distribution of the state of the Gibbs sampler is concentrated on a set whose probabilitymassaccording tothetargeted posteriorisat most1/4. 2 Proof of Theorem 1.1 Thefollowingtwo combinatoriallemmaswillbeneeded. Lemma 2.1 Let X beastandarduniformrandomvariableand0 ≤ k ≤ n. Then 1 E[Xk(1−X)n−k] = . (n+1) n k Proof. Firstrecall thatforanym = 1,2,..., (cid:0) (cid:1) 1 E[Xm] = . m+1 Next observe that the result holds true for any n with k = n. We want to prove that the claimed result holds for (n,k) = (p,r) for some arbitrary 0 ≤ r ≤ p. Wedothisbyinduction. Wemaythenassumethattheresultistruewith(n,k) = (m,l) for any m and l with either m < p or m = p and l > r. Then, taking Y = Xr(1−X)p−r−1, E[Xr(1−X)p−r] = E[Y(1−X)] = E[Y]−E[YX] = E[Xr(1−X)p−r−1]−E[Xr+1(1−X)p−r] 1 1 = − p p−1 (p+1) p r r+1 1 = (cid:0) (cid:1) , (cid:0) (cid:1) (p+1) p r wherethefourthequalityfollowsform(cid:0)th(cid:1)einductionhypothesisandthelastequal- ityholdsbecause p+1 1 1 + = r+1 (p+1) p (p+1) p (p+1) p p r r+1 (cid:0) r(cid:1) r+1 1 p (cid:0) (cid:1) (cid:0) (cid:1) (cid:0) (cid:1)(cid:0) 1(cid:1) = r+1 r = . p p p−1 r+(cid:0)1 (cid:1) r (cid:0) (cid:1) (cid:0) (cid:1) 4 ✷ Lemma 2.2 Foranynonnegativeintegersa, b, c andd, a+b+c+d a+c b+d a+b+c+d a+b c+d = . a+c a d a+b a d (cid:18) (cid:19)(cid:18) (cid:19)(cid:18) (cid:19) (cid:18) (cid:19)(cid:18) (cid:19)(cid:18) (cid:19) Proof. Both sidesoftheequalityare equalto themultinomialcoefficient a+b+c+d . a,b,c,d (cid:18) (cid:19) ✷ Let W = (w ,w ,...,w ,w ,...,w ) be the words in our corpus, let 11 12 1m 21 2m Z = (z ,z ,...,z )bethelatenttopicsandZ bethelatenttopicsindocument 11 12 2m d d. Letalsoθ betheprobabilitythatz = A,d = 1,2andletφ betheconditional i d1 t probability that w = 1 given that z = t, t = A,B. In the case under study, dj dj these four quantities are all independent standard uniform random variables. We beginbydeterminingtheposteriordistribution. P(Z = z|W = w) ∝ P(W = w|Z = z)P(Z = z). Now P(Z = z) = E[P(Z = z|θ ,θ )] 1 2 = E[θk1.(1−θ )n1.−k1.]E[θk2.(1−θ )n2.−k2.] 1 1 2 2 1 = . (n +1)(n +1) n1. n2. 1. 2. k1. k2. (cid:0) (cid:1)(cid:0) (cid:1) Here we have used the notation k for the numberwords in document d with the dj topicbeingAandthewordbeingj andthesamedotnotationforthek:sasforthe n:s. The last equality is Lemma 2.1. For the second factor we have analogously, againusingLemma2.1, P(W = w|Z = z) = E[φk.1(1−φ )k..−k.1]E[φn.1−k.1(1−φ )n..−(n.1−k.1)] A A B B 1 = . (k +1)(n −k +1) k.. n..−k.. .. .. .. k.1 n.1−k.1 (cid:0) (cid:1)(cid:0) (cid:1) 5 Hence, ignoring factors that do not depend on the k:s and using Lemma 2.2 forthesecondequality −1 n n k n −k P(Z = z|W = w) ∝ (k +1)(n −k +1) 1. 2. .. .. .. .. .. .. k k k n −k (cid:18) (cid:18) 1.(cid:19)(cid:18) 2.(cid:19)(cid:18) .1(cid:19)(cid:18) .1 .1(cid:19)(cid:19) n.. = k.. . (k +1)(n −k +1) n1. n2. n.1 n.2 .. .. .. (cid:0) (cid:1) k1. k2. k.1 k.2 Thisexpressiononlydependsonzviak = k(cid:0)(z)(cid:1):(cid:0)= ((cid:1)k(cid:0) ,k(cid:1)(cid:0) ,k(cid:1) ,k ). Identi- 11 12 21 22 fyingallzhavingthesamek(z),wehave(regarding,withsomeabuseofnotation, a k also as the equivalence class consisting of all z:s having that particular k :s) dj that for any k and k , all z ∈ k have the same probability of transitioning into 1 2 1 k . Hence the process where we only record the k:s is a lumped Markov chain 2 anditsufficestoregardthischain,whosestatespaceis[n ]×[n ]×[n ]×[n ]. 11 12 21 22 In particular the Gibbs sampler does not mix any faster than the lumped Markov chain; this followsfrom the definition of the total variation norm and the triangle inequality. ThelumpedMarkovchain hasthestationarydistribution n.. n11 n12 n21 n22 f(k) = fw(k) := P(K = k|W = w) = C0(k +1)(nk.. −kk11 +k112) nk12.1 nk22.2 n.1 n.2 , .. (cid:0) ..(cid:1)(cid:0) .(cid:1).(cid:0) (cid:1)(cid:0)k1. (cid:1)(cid:0)k2. (cid:1)k.1 k.2 whereofcourseK = k(Z) and C is thenormalizingconstant. (cid:0) (cid:1)(cid:0) (cid:1)(cid:0) (cid:1)(cid:0) (cid:1) 0 ToproveTheorem1.1,wewillshowthattherearesmallneighborhoodsofthe two states k = (3m/10,0,6m/10,0) and k = (3m/10,7m/10,0,0) that are 0 1 extremelydifficulttoleave. Remark. By switching the names of the two topics, it follows that the states (0,7m/10,0,4m/10)and(0,0,6m/10,4m/10)areequallydifficulttoleave. One canongoodgroundsarguethatthesearereallythesameask andk respectively. 0 1 Doing so, the result of Theorem 1.1 is still valid, at least for κ < 1/2, which is ”bad enough”. Define h(x) = xx(1−x)1−x −1, x ∈ (0,1). Then,fora ∈ (0,1),by Stirlin(cid:0)g’sformula, (cid:1) m = C h(a)m, 1 am (cid:18) (cid:19) 6 whereC is oforder m−1/2. Hence 1 h(9/20)2m f(k ) = C C 0 0 2 h(3/10)mh(3/5)m h(9/20)2 m = C C , 0 2 h(3/10)h(3/5) (cid:18) (cid:19) where C is of order m1/2. Now consider f in a small neighborhood of k : let 2 0 a,b,c,dbesmallpositivenumbers,say at most1/10. Then f(3m/10−am,bm,3m/5−cm,dm) m h( 9 + (b+d−a−c))2h(10a)3/10h(10b)7/10h(5c)3/5h(5d)2/5 = C C 20 2 3 7 3 2 0 3 h(10(a+c))9/10h(10(b+d))11/10h( 3 +b−a)h(3 +d−c)! 9 11 10 5 =: C C g(a,b,c,d)m, 0 3 whereC isoforderbetweenm−1/2 andm1/2. Weclaimthatallpartialderivatives 3 of g are strictly negative at the origin. We do this with respect to a, leaving the analogouscalculationsto thereader. Now h( 9 − a)2h(10a)3/10 g(a,0,0,0) = 20 2 3 h(10a)9/10h( 3 −a) 9 10 sothat logg(a,0,0,0) 9 a 3 10a 9 10a 3 = 2logh − + logh − logh −logh −a . 20 2 10 3 10 9 10 (cid:18) (cid:19) (cid:18) (cid:19) (cid:18) (cid:19) (cid:18) (cid:19) It isreadilychecked that d 1−x logh(x) = log . dx x Hence ∂ ( 9 − a)(1− 10a)10a( 7 +a) logg(a,0,0,0) = log 20 2 3 9 10 . ∂a (11 + a)10a(1− 10a)( 3 −a) 20 2 3 9 10 For a = 0, the right hand side equals −log(11/7) which is smaller than −9/20. Analogously, the partial derivatives of logg with respect to b, c and d are less 7 than −1/4 (we do not get exactly the same numbers here). It follows that for a sufficientlysmallǫ > 0, f(k) < e−4ǫmf(k ) 0 whenever kk−k0k∞ = ǫm. For such a k, let Vk be the number of visits to k of thelumpedMarkovchain startingatk untilitreturnsto k . Thenby anintuitive 0 0 and wellknownfact inMarkovtheory (seee.g. [1], Lemma6 ofChapter 2) f(k) E[Vk] = f(k ) < e−4ǫm. 0 Hence P(Vk > 0) = E[VkE|[VVkk]> 0] < e−4ǫm. By aBonferroni boundweget P ∪k:kk−k0k∞{Vk > 0} < 4(ǫm)3e−4ǫm < e−8ǫm for sufficiently lar(cid:0)ge m (ǫm > 170 is su(cid:1)fficient). Hence, starting at k , the prob- 0 ability of getting more than distance ǫm from k0 in e9ǫm steps is at most e−7ǫ2m whichgoes to0 as m → ∞. Next,wemakean analogousanalysisoftheneighborhoodofk . Weget 1 h(1)2 m f(k ) = C C 2 , 1 0 1 h(1)9/10h( 4 )11/10 (cid:18) 3 11 (cid:19) whereC is oforder n1/2, and 1 f(3m/10−am,bm,3/5−cn,dm) m h(1 + (c+d−a−b))2h(10a)3/10h(10b)7/10h(5c)3/5h(5d)2/5 = C C 2 2 3 7 3 2 0 3 h(1 + 10(c−a))9/10h( 7 + 10(d−b))11/10h(a+b)h(c+d)! 3 9 10 11 =: C C g(a,b,c,d)m 0 3 Thisentailsthat h(1 − a)2h(10a)3/10 g(a,0,0,0) = 2 2 3 . x Takingthelogarithmand differentiatingwithrespect toagives ∂ 1 − a)(1− 10a)(2 + 10a)a logg(a,0,0,0) = log 2 2 3 3 9 ∂a (1 + a)10a(1 − 10a)(1−a) (cid:18) 2 2 3 3 9 (cid:19) 8 and taking a = 0, this becomes −log(5/3) < −1/2. Repeating the above argu- ment, it follows for some ν > 0 that starting from k , the probability of getting 1 more than distance νm from k1 in eν9m steps is at most e−7ν2m. Combining this withtheanalogousresult fork , thisprovesTheorem 1.1. 0 References [1] David Aldous and James A. Fill, Reversible Markov Chains and Random Walks on Graphs, Unfinished monograph at http://www.stat.berkeley.edu/∼aldous/RWG/book.html [2] Mark Andrews and Gabriella Vigliocco, The Hidden Markov Topic Model (2010), A Probabilistic Model of Semantic Representation, Top- icsin CognitiveScience2, 101-113. [3] Davi M. Blei, Introduction to ProbabilisticTopic Models (2011), Prince- tonUniversity. [4] David M. Blei, Andrew Y. Ng and Michael I. Jordan, Latent Dirichlet Allocation(2003), JournalofMachineLearning Research93, 993-1022. [5] Matt Hoffman, David M. Blei, Chong Wang and John Paisley (2013), JournalofMachineLearning Research14, 1303-1347. 9

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.