Estimating probabilities from experimental frequencies In´es Samengo∗ Centro At´omico Bariloche and Instituto Balseiro (8400) San Carlos de Bariloche, R´ıo Negro, Argentina 2 Estimatingtheprobabilitydistributionqgoverningthebehaviourofacertainvariablebysampling 0 its value a finite number of times most typically involves an error. Successive measurements allow 0 the construction of a histogram, or frequency count f, of each of the possible outcomes. In this 2 work,theprobabilitythatthetruedistributionbeq,giventhatthefrequencycountf wassampled, n isstudied. SuchaprobabilitymaybewrittenasaGibbsdistribution. Athermodynamicpotential, a which allows an easy evaluation of the mean Kullback-Leibler divergence between the true and J measured distribution, is defined. For a large number of samples, the expectation value of any 8 functionofqisexpandedinpowersoftheinversenumberofsamples. Asanexample,themoments, 2 theentropy and themutual information are analyzed. ] PACSnumbers: 02.50.Tt h c e I. ESTIMATING PROBABILITIES FROM not quite in q, but rather in some function of q. Treves m EXPERIMENTAL FREQUENCIES and Panzeri [1], for example, have quantified the mean - error that an experimenter makes when evaluating the t a Theestimationofprobabilitydistributionsfromalim- mutual information in the frequency count f, as an ap- t s ited number of samples typically involvesan error. Con- proximationto that in the true (and unknown) q. Their t. sider,forexample,arandomvariablethatcanbeeither0 analysis was made in the same spirit as above, that is, a m or 1, both values with probability 1/2. An experimenter they have considered q fixed, while the value of f de- measures the variable, say, four times. If n (similarly, pended on the particular outcome of N measurements. d- n1) is the number of trials the result was 0 (0correspond- They have obtained a clean analytical result, under an n ingly, 1), the possible outcomes are n = j,n = 4 j, independence approximation. Their approach may be 0 1 o where j may vary between 0 and 4. Each of those po−ssi- naturallygeneralizedtosituationswhereqisaprobabil- c bilities has probability 3/2j!(4 j)! of occurring. If the ity density, that is, varies in a continuous set [2]. [ − experimenter estimates the underlying probability from However,whatthe experimenter knowsis notthe true 1 the frequencies, his or her claim will be that the prob- q, but one particular f, obtained after N observations. v ability of getting a zero is n /4. However, in view that 6 0 His or her aim is to estimate the most probable value n depends on the particular outcome of the four trials, 1 0 of q (or of some function of q) from the knowledge of only a fraction3/16of the times will this procedure give 5 f. More generally, the experimenter may be interested the correct result, that is f =q =1/2. 1 0 0 in the whole distribution P(qf), that is, the probability 0 In the above example, there are three probability dis- that the true distribution be q|, given that he or she has 2 tributions involved. First, there is the true underlying measured f. This means to settle the problem the other 0 probability q, actually governing the outcome of the ex- way round as was studied by Treves and Panzeri,and in t/ periment. In vector notation, q = (q0,q1), and in the the example above. It actually corresponds to Wolpert a particular instance above, q = (1/2,1/2). Then, there m and Wolf’s approach[3] in the estimation of entropies. is the frequency count f = (f ,f ), where f is obtained 0 1 i d- by dividing ni by the total number of measurements N Inthe followingsection,the propertiesofthe distribu- n (four, in the example). And finally, there is the proba- tion P(qf) are studied. In Sect. III, P(qf) is written o bility that f =q. To define this last probability, one has as a Gib|bs’ distribution, where the invers|e number of c to consider all possible samples of N trials, and evaluate samples plays the role of an effective temperature, and v: how often the condition f =q is fulfilled. the Kullback-Leibler divergence between f and q is the i More generally, one can define the probability of mea- equivalent of the energy of state q. As a consequence, X suring a particular f, while the underlying q remains a thermodynamic potential is defined, thus allowing the r fixed. This means to consider a probability distribution calculation of the mean Kullback-Leibler divergence be- a of all the possible frequency counts. The independent tween f and q by simple derivation. This inspires the variable is the vector f, which varies in a discrete set, expansionmadeinSect. IV,wheretheexpectationvalue and the dependent variable is p(f q). of an arbitrary function of q can be written as a power | The frequency count f is an estimation of the under- series in the inverse number of samples. The case of the lying q. Inmanyapplications,however,one is interested entropy, the mutual information, or any moment of the distributionqisshownintheexamplesofSect. V. Next, in Sect. VI the analytical results are confronted with numerical simulations. Finally, in Sect. VII, the main ∗Electronicaddress: [email protected] results are summarized and discussed. 2 II. THE PROBABILITY DISTRIBUTION FOR is it that is not known [5]. A prior that is uniform over THE TRUE PROBABILITY DISTRIBUTION , as was used by Wolpert and Wolf [3], is certainly not D uniform over any non linear function of q, for example ConsidertherandomvariableX takingvaluesfromthe the log-likelihood. Thus, not knowing anything about q set x = (x ,...,x ), with probabilities q = (q ,...,q ). impliesknowingsomethingaboutlnq,whichinturnmay 1 S 1 S Inprinciple,thereisnoneedthatx ,...,x be numerical result in awkward scaling properties. In this work, the 1 S values,itsufficesthemtobeanyexclusiveandexhaustive power prior set of categories. An experimenter makes N observations of the value P (q)= ΠSi=1 qiβ−1, (5) β of X and builds a histogram n = (n ,...,n ), where n 1 S i Zβ is the number of times the outcome was x . The ex- i perimenter considers the frequencies f = (f1,...,fS) = is repeatedly used, with Zβ = √S[Γ(β)]S/Γ(Sβ) (no- (n1/N,...,nS/N)asanestimationofthe true underlying tice that when β 0, β √S). However, as was probabilitydistributionq. Ifthemeasurementsaretaken shownin[5]choosin→ganyZoft→hese priorsresultsina sur- independently, the probability of measuring f given that prisingly peaked a priori distribution of the possible en- the data are sorted according to q is equal to the prob- tropies. Hence, the choice of the prior is a delicate issue ability of observing each xi a number ni of times, that and,inanyparticularapplication,itshouldbedonecare- is, fully. Here, no attempt will be made to instruct on the way such a choice should be made, but since the results p(f q)=N! Π qini = N! exp N f lnq . that follow are strongly grounded on Bayesianinference, i i i | ni! Πi(Nfi)! ! their validity is, at most, as good as the prior” [3]. i X Replacing Eqs. (1) and (4) in Eq. (3), (1) However,the knowledgethe experimenter has athand is exp[ ND(f,q)]P(q) P(qf)= − , (6) f, not q. He or she may therefore wonder what is the | Z probability that the true distribution be q, given that where D is the Kullback-Leibler divergence between f the outcome of the experiment was f. This means to and q evaluateaprobabilitydensityP(qf),whoseindependent | variable q runs over all the possible distributions of the f data. That is, all vectors in S such that D(f,q)= filn i , (7) ℜ i (cid:18)qi(cid:19) X q = 1 i andquantifiesisthemeaninformationfordiscriminating i X in favor of f against q, given the data [4]. The function 0 q 1, i. (2) ≤ i ≤ ∀ reads Z The set of all q obeying Eqs. (2) constitutes the domain where P(qf) is defined. It is a finite portion of an = dSq P(q) exp[ ND(f,q)]. (8) (DS 1)-dimens|ionalplaneembeddedin S,andisnormal Z ZD − − ℜ to the vector (1,1,...,1). In the remaining of the section, the properties of Notice that since each fi is the ratio of two natural P(qf) are studied for the particular Pβ(q) defined in numbers, the set of possible frequencies f is discrete. Eq.|(5). In doing so, the integral The domain , on the contrary, contains a continuum D of distributions q. Consequently, p(f q) is a probability, Π Γ(m +1) whereas P(qf) is a density. | ΠSi=1 qimi dSq =√S Γ(iS+ i m ), (9) Bayes’ rule|states that ZD i i is frequently encountered. Equation(9)Pwasfirstderived p(f q)P(q) P(qf)= | , (3) in [3], and an alternative proof may be found in the Ap- | p(f) pendix. For the priors in Eq. (5), the function Eq. (8) may where P(q) is the prior probability distribution for q, Z be calculated analytically, and it reads and ΠS Γ(Nf +β) p(f)= P(f q) P(q) dSq . (4) =exp[N (f)] √S j=1 k , (10) D | Z H Γ(N +Sβ) Z Here, dSq is a volume element, in . where is the entropy of a distribution ThepriorP(q)containsalladditDionalpiecesofknowl- H edge about q, apart from the experimental data. Here, S the assumption is made that there is no a priori knowl- (f)= filnfi. (11) H − edge. However,it turns out to be crucial to specify what i=1 X 3 Thus, replacing Eq. (10) in Eq. (6) smaller the effect. Changing the value of β is equivalent to re-scaling the vertical axis of figure 1. Γ(N +Sβ) qNfi+β−1 Typically,onewantstomakeaguessaboutthetrueq. P(qf)= Π i . (12) | √S iΓ(Nfi+β) Here, two possible estimators have been calculated: the maximumqM andthemean q . Byusingthemaximum, h i The most probable qM = (qM,...,qM) is obtained by one is choosing the value that is most probably correct. 1 S maximizingEq. (12),underthenormalizationconstrain. But of course, eventually one will also make an error. If The result is one measures the error as a (qM q)2, and averages it − with P(qf), its mean turns out to be larger than if one qiM = NN+fiS+(ββ−11). (13) hthaadtcghivoeses|nthheqico[r3r]e.cHteanncsew,earlmthoosutgfhreqqMuenistltyh,eifeosntiemcaatroers − for the typical size of the errors, q is a better choice. Thus, if P(q) is uniform in (β = 1), then the most Whenusing q asanestimatorh,tihecovariancematrix probable q is f. With the Dmaximum likelihood prior Σ may be ofhintierest. By means of Eq. (9)it is easy to ij (β 0), the most probable q is shifted from f to- show that for i=j ward→s lower counts. The Krichevsky-Trofimovestimator 6 [8](β =1/2)and the Shurmann-Grassberger[9] β =1/S Σ = (q q )(q q ) (15) ij i i j j lie in between. h −h i −h i i (Nf +β)(Nf +β) i j UsingEq. (9)theexpectationvalueofeachcomponent = −(N +Sβ)2(N +Sβ+1) q may be calculated, i f f i j when N S, Nfi+β →− N ≫ q = . (14) i h i N +Sβ whereas for i=j For the uniform prior β = 1, this equation reduces to Σ = (q q )2 = (16) Laplace’s estimator of probabilities, first introduced by ii i i −h i in his Essay on probabilities. In figure 1 the difference (Nf +β)[N(1 f )+β(S 1)] (cid:10) i (cid:11) i − − (N +Sβ)2(N +Sβ+1) f (1 f ) i i − when N S, → N ≫ The negative sign in Eq. (15) derives from the normal- 0,0 N = 30 izationcondition: sincethesumofallq isfixedtounity, i if one of them surpasses its mean, it is to be expected - fi that some other component will be below. In contrast, > 6 Eq. (16) shows that Σii is always positive. <qi -0,2 The expectation value of q Eq. (14)together with the covariance matrix Eqs. (15) and (16) are useful to give 3 the Gaussian approximation to P(qf), centered in its | mean: -0,4 P(qf)=K exp 1(q q )t Σ˜−1(q q ) , (17) 0 1/3 2/3 1 f | −2 −h i −h i (cid:20) (cid:21) i where the super-script t means transposed, and K is a normalizationconstant. Equation (17) is only defined in the plane containing , normalto the vector (1,1,...,1). FIG. 1: Difference between hqii and fi, as a function of fi. D The value of β has been set to 1. The three lines correspond Actually, Σ does not have an inverse in the entire space to N = 3, 6 and 30. Here, X may take 3 values (S = 3). S, since the direction (1,1,...,1) is one of its eigenvec- ℜ Whenfi <1/3, theexpectation valueof qi islarger than the tors, with eigenvalue equal to zero. However, being Σ a measured frequency fi. As N increases, the effect becomes symmetric matrix, it can be diagonalized by an orthog- less important. onal basis. Hence, the S 1 remaining eigenvectors lie − in the plane containing . The restrictionofΣ into that between q andthefrequencycountf isshown,forβ = subspace is Σ˜, and its inDverse is the matrix appearing in i i h i 1. Itisseenthatwhenf issmallerthan1/S, q islarger the exponent of Eq. (17). i i h i than f . On the other hand, if f > 1/S, then q < In order to normalize the approximation (17) an inte- i i i h i f . That is, the mean value of q is displaced from the gral of a Gaussian function in is needed. This is cer- i i D frequency count so as to approach the flat distribution tainlynotaneasytask. If,however,onecanassumethat 1/S. Ofcourse,the largerthe number ofsamplesN, the thedistributionissufficientlypeakedsothatP(qf) 0, | ≈ 4 for q in the border of , then the domain can be ex- sincethen,severaltimesinlearningtheory(seeforexam- D D tended to the whole plane normal to (1,1,...,1). In that ple[7]). Inthesecases,whenfluctuationwhereneglected, case, K−1 = 2πΠ λ , where λ are the S 1 eigenval- the probability distribution under study had the form of j j j − ues of Σ˜. While the calculationof all the λ is a difficult Eq. (6). In the present context, no approximations are p j problem, it is quite straightforward to show that when needed to write down Eq. (6). N S, all the λ are proportional to 1/N. Therefore, The exponentialfactor in (6)depends on qand f only j the≫square root of each eigenvalue is a useful measure of inthe combinationD(f,q), diminishing exponentiallyas the width of P(qf) in the direction of its eigenvector. the divergence between the two distributions grows. Its However, the G|aussian approximation(17) is not use- maximum is attained when D = 0. It can be shown [4] ful for other purposes, as for instance, calculating mean thatfor anyf andq, D(f,q) 0, andthe equality holds ≥ values, since it lacks from analytical expressions as (9). only when f =q. As a consequence, in what follows, the full Eq. (12) is Defining the thermodynamic potential used. F = ln (21) Equation (9) allows the evaluation of all moments of − Z P(q f) it follows that i | ∂F Γ(Nf +k+β)Γ(N +Sβ) D = , (22) qk = i . (18) h i ∂N h ii Γ(Nf +β)Γ(N +Sβ+k) i ∂2F σ2 = D2 D 2 = , (23) Since the moments are the coefficients of the Taylor ex- D −h i −∂N2 pansion of the Fourier transform of a distribution, the where the mean (cid:10) values ((cid:11)) are defined by single-component distribution reads ()P(qf)dS . h · i D · | q For example, when the prior is given by Eq. (5), P(q f) = P(q f ) (19) R i i i | | qNfi+β−1(1 q)N(1−fi)+β(S−1)−1 D = (f) Ψ(N +Sβ)+ fiΨ(Nfi+β), (24) h i H − = − , i B[Nf +β,N(1 f )+β(S 1)] X i i − − whereΨ(x)=dlnΓ(x)/dxistheDigammafunction[10]. where B(x,y) = Γ(x)Γ(y)/Γ(x+y). Figure 2 displays It is easy to show that the distribution P(q f ) for three different values of N, i| i S 1 andβ =1. Inallcases,whenN islarge,thedistribution lim D = − + (1/N2). (25) is symmetrical, and reaches its maximum value in q = N≫Sh i 2N O i fi =1/3. Infact,itmaybeshownanalyticallythatwhen Here,bothN andNfihavebeensupposedlarge,foralli. N 1, Sincef isoftheorderof1/S,theabovelimitholdswhen ≫ i N S. Equation (25) states that for a large number of 1 ≫ lim P(q f )= exp (q f )2/2σ2 , (20) samples, the expected value of the divergence between i i i i N≫1 | √2πσ2 − − the experimental frequencies and the true distribution (cid:2) (cid:3) does not depend on the measured f. It grows linearly where σ = [fi(1 fi)/N]1/2. That is, the distribution with the number of items, and decreases as 1/N. − tends to a Gaussianfunction centered at the experimen- Accordingly, talfrequency,andwithameandispersionthatdiminishes with the square root of the number of samples. Notice S σ2 = Ψ1(N +Sβ)+ f2Ψ1(Nf +β), (26) that in this limit, P(qf) does not depend on β. D − i i It may be seen in Fi|g. 2 that for smaller values of N, Xi=1 the distribution is no longer symmetrical. In fact, since where Ψ1(x) = dΨ(x)/dx, is the first Polygamma Func- S = 2 and f1 = 1/3 < 1/S, the tail in P(q1 f1) extends tion [10]. Taking the limit of a large number of samples, | to the right,resultingin a positive q f , aspredicted i i h i− S 1 by equation (18). lim σ2 = − + (1/N3). (27) N≫S D 2N2 O In the limit N S, the mean quadratic dispersion does III. THE INVERSE NUMBER OF SAMPLES AS not depend on≫the measured f . i AN EFFECTIVE TEMPERATURE Equation (6) states that P(qf) is completely analo- IV. ESTIMATION OF FUNCTIONALS OF q, gous to a Gibbs distribution, wh|ere the number of sam- FOR A LARGE NUMBER OF SAMPLES. ples N plays the role of the inverse of the temperature, D(f,q) is the equivalent to the energy of the state q, Many times, one is interested in the value of some and P(q) is the density of states. This analogy was first function W(q). For instance, if X takes numerical val- pointed out in the context of machine learning [6], and ues, W may be the mean X¯ = x q . Or, in some i i i P 5 other application, W may be the entropy of the distri- In the same way , if i=j 6 bution q (see equation (11)). If the set X is the Carte- sian product of two other sets X = Z1 Z2, such that (qi fi)(qj fj) = (33) ∀Wxim∈ayXb:extih=em(zua1t,uzab2l),inwfohremreatzia1on∈IZb1eatwn×deeznb2Z∈1Za2n,dthZe2n: h N−fifj −β−[β+(i1+Sβ)(Sfifj −fi−fj)] − (N +Sβ)(N +Sβ+1) I = q ln qab , (28) fifj when N S, ab q q →− N ≫ ab (cid:20) a. .b(cid:21) X whereas when i=j where (q f )2 = (34) i i q = q , − a. ab Nf (1 f )+β[1+β+f (1+Sβ)(Sf 2)] b (cid:10) i − i(cid:11) i i− X (N +Sβ)(N +Sβ+1) q = q . (29) .b ab f (1 f ) a i − i when N S. (35) X → N ≫ Since q is unknown, an interesting guess for W(q) is its Bayesianestimation Summarizing, to first order in 1/N, W W (f)+ (36) W = W(q)P(qf), (30) h i ≈ h i ZD | S ∂W β(1 Sfi) + − + whichhastheappealingpropertyofminimizingthemean i=1 ∂qi (cid:12)f N squareerror[3]. Thezeroorderguessfor W isW(f). In X (cid:12) what follows,a systematic method to imhproive this value 1 S ∂2W(cid:12)(cid:12) fi(1 fi) + − is derived. 2 ∂q2 N − i=1 i (cid:12)f Intheprevioussectiontheexpectationvalueofthedi- X (cid:12) vergencebetweenthetrueandthemeasureddistribution S ∂2(cid:12)(cid:12)W fifj . was calculated,as wellas the size of the fluctuations, for − ∂q ∂q N i=1 j<i i j(cid:12)f thepriorsinEq. (5). Asthenumberofsamplesincreases, XX (cid:12) (cid:12) boththeexpecteddivergenceandthefluctuationsdimin- This general formula allows the ca(cid:12)lculation of the first ish as 1/N. Since a smalldivergencemeans that the two correction of the expectation value of an arbitrary func- distributionsarenecessarilyverysimilar,onlytheqthat tion W(q), whenever the prior is given by Eq. (5). are very near f have a non vanishing probability—for D Now, consider the more general case of an arbitrary sufficiently small, this argument holds for any definition prior. If P(q) is not given by Eq. (5), then one can still of similarity. proceed as above, but replacing W(q) by the product As a consequence, it is reasonable to expand W(q) in W(q)P(q), and setting β =1. its Taylor series in the neighborhood of f. Hence, Eq. (30) reads V. EXAMPLES ∞ S k 1 ∂ W = (q f ) W . (31) h i * k! i− i ∂qi! |f+ Here, the expansion(36) is applied to a few particular Xk=0 Xi=1 cases. Wolpert and Wolf [3] have already calculated the Since P(qf) decreases dramatically as q departs from firsttwoexamplesexactly(Subsect. VAandVB),inthe f, the high|er order terms (large k) in Eq. (31) should particularcaseofβ =1. Theirresults,onceexpandedup become negligible, at least, for large N. to first order in 1/N are now compared to Eq. (36), for In the first place, the mean values of Eq. (31) are verification. The advantage of Eq. (36) is that, in con- evaluated for the special case of the power law priors. trast to Wolpert and Wolf’s approach, it applies to any This involves, basically, the computation of integrals in function W. The counterpart, of course, is that it gives of ΠS (q f )ki, for a set of non negative indexes nomorethanthefirstcorrectionto W . SubsectionVC (Dk ,k ,.i.=.k1 )it−hatisum up to K. This can be done using deals with the calculation of momenhts.i i 2 S Eq. (9). Of course, the term k = 0—that is, the raw guess–doesnotdependonN. Itmaybeshownthatonly A. The mean value of the entropy k =1 and k =2 are proportional to 1/N. Specifically, β(1 Sfi) Inthe firstplace,the functionW(q)is takentobe the q f = − h i− ii N +Sβ entropy of the distribution q, defined in Eq. (11), for H β(1−Sfi), when N S. (32) qwh=erefa.sI∂t2is /e∂asqy∂tqo v=erifyδ t/hqat, ∂wHhe/r∂eqiδ=is−K[1r+oelnneqkie]r, → N ≫ H i j − ij i ij 6 delta function: δ = 1, if i = j and δ = 0, if i = j. since the mean value in Eq. (40) involves the distribu- ij ij 6 Replacing in Eq. (36) and keeping only up to the first tion P(qf). The approach in [1], instead, uses p(f q), | | order in 1/N one arrives at while the true q is fixed. In the present approach, the mean value I can be either higher or lower than I(f). h i βS = 1 (f)+ (37) hHi − N H (cid:18) (cid:19) C. The mean value of functions of X S β 1 S 1 ln − + (1/N2). (38) N f − 2N O i=1 (cid:18) i(cid:19) Consider a function g : x1,...,xS that maps X { } → R the possiblevaluesofX intorealnumbers. Forexample, For the case of β = 1, this same expression is obtained if X takes numerical values, then g can be such that k by expanding the exact result, obtained in [3] g (x )=xk. For each such g, another function G: k i i D → is defined, namely G(q)= g(x )q . In the example S R i i i Nf +1 above, G is the k-moment of the distribution q. The hHi[3] = − Ni+S Φ(1)(Nfi+2)− expectatioknvalue G iseasilyPcalculatedusingEq. (36), Xi=1 h and reads h i Φ(1)(N +S+1) , (39) S βS β i G =G(f) 1 + g(x ). (42) i where Φ(1)(x) = dlnΓ(x)/dx is the Digamma function h i − N N (cid:18) (cid:19) i=1 X [10]. Inparticular,fortheg consideredabove,thisisthefirst k order correction to all moments of q. B. The mean value of the mutual information VI. NUMERICAL SIMULATIONS NowW istakentobethemutualinformationbetween two sets, as defined by Eq. (28). Replacing in Eq. (36), In this section, Eq. (36) is confronted to the result S S of numerical simulations. Once again, and just to fol- I = I(f) 1 β 1 2 + (40) h i − N low previous studies, W(q) is set equal to the mutual (cid:18) (cid:19) information. However, in contrast to what was done up S S +1 S S β f 1 2 − 1− 2 + ln ab , to now [1, 2, 3], the simulations are performed strictly 2N N f f ab (cid:18) a. .b(cid:19) within the present framework. That is, the measured X frequencyf iskeptfixed,andtheprobabilityforthetrue Where S1 and S2 are the number of elements in the sets q is evaluated. Z1 and Z2. When β = 1, Eq. (40) coincides with the The procedure to measure numerically P(qf) is now expansion up to first order in 1/N of the exact result explained. As before, X takes values in a set|of S ele- derived in [3], ments. Hence, f and q are S-dimensional vectors. The value of f is fixed. The domain is discretized into Nf +1 D I = ab Φ(1)(Nf +2) a number J of cells. Each cell corresponds to a vector h i[3] N +S1S2 ab − q that will be visited by the program. The larger the Xab h number of cells J, the better the sampling of the do- Φ(1)(N +S S +1) 1 2 main . For each one of these cells, the value of X is − D Nf +S i measured N times. The outcomes are sorted with the a. 2 Φ(1)(Nfa.+S2+1) distribution q of the actual cell. If the frequency count − N +S S − 1 2 Xa h thus obtained equals f, the counter of the selected cell Φ(1)(N +S S +1) is increased (there is counter for each cell in ). The 1 2 − comparisonbetween the frequency count and thDe (fixed) Nf.b+S1 Φ(1)(iNf +S +1) f is done with precision ǫ. The procedure is repeated M .b 1 − N +S1S2 − times (M large) in order to have enough counts. This Xb h algorithmallowsto construct a histogramfor the proba- Φ(1)(N +S1S2+1) . (41) bility that a given q generates the selected f. ∈D i Forsimplicity,intheresultsbelowthenumberoftrials The quantities f and f in Eqs. (40) and (41) are M is the same for all cells. This is equivalent to using a. .b defined as in (29). a uniform prior in (β = 1). A simulation with a non D In contrastto the resultobtainedin[1], the firstorder uniform prior can be carried out by choosing a different correctionto the mutual information does bear a depen- M for each cell. dence on the values of the individual probabilities f . Thetwoparametersthatdeterminetheprecisionofthe ab There is no conflict, however, between the two results, simulations are J and ǫ. If D is the Kullback-Leibler J 7 divergence between two neighboring q cells, whenever (a) 1/N D then the only vector q that produces fre- ≪ J 0,02 quency counts equal to f is q = f. That is, for N suffi- cientlylarge,thediscretizedsystembehavesasifN = . ∞ Notice that for largeJ, twoneighboringcells correspond KTtohuilqslbmaacnekad-nLsqeti+hbazitδgqwd,hisewtnaitnNhcereeabacechthweδesqeiJnS∝t−h1eJ/tSSw−,ot1hi.se≈sTimhSuu/sJl,aStt−ioh1ne. < I > - I0,01 starts to behave as if N were actually infinite. 0,00 Ontheotherhand,ifǫisnotsmallenough,onemistak- 0,00 0,01 0,02 0,03 0,04 enlycountscoincidenceswithf,justbecausethecriterion 1 / N used in the comparison is too brute. In other words, a large ǫ allows that cells q too far away from f do give rise to frequency counts equal to f. That is, the system behaves as if N where smaller than its actual value. The dots in figure 2 show the result of the above pro- (b) 5 N = 3 0 0 4 ) q | f = 1 / 3 11 123 162 < I > - I( )f -1x10-3 ( p 0 -2x10-3 0,0 0,2 0,4 0,6 0,8 1,0 q 1x10-3 2x10-3 1 1 / N FIG. 2: Probability distribution P(q1|f1) for the case f1 = 1/3, β =1 and S =2. Different curvescorrespond to several FIG. 3: Difference between the expectation value of the values of the numberof samples N. The full line depicts the mutual information hIi and the measured I(f), as a func- analytical result Eq. (19), while the dots are the numerical tion of the inverse number of samples 1/N. The β = 1 simulations (see Sect. VI). prior was considered. The full line represents the analyti- cal result, Eq. (40), and the dots the simulations. In (a), cedure, for a single component q1. As observed, there f11 =f12 =f21 =f22 =1/4,andI(f)=0. ForeachcellinD, is very good agreement with the full line, showing the 30,000 sets of N samples havebeen sorted. In(b),f11 =0.4, analytical result, Eq. (12). f12 = 0.1, f21 = 0.1, and f22 = 0.4, so I(f) = 0.192745. For Toevaluatetheexpectationvalueofacertainfunction, each cell in D, 10,000 sets of N samples have been sorted. one simply needs to calculate the sum In both cases, each axis in q space has been divided in 20 intervals, in order to discretize D, while theparameter ǫ was set to 0.0125. W = W(q)P(qf), (43) h i|numerical | cells in D X using the P(qf) obtained with the algorithm explained S2 =2, thus making a 3 dimensional domain . | D above. Figure 3 depicts the result for the mutual infor- In(a)theselectedf hadnomutualinformation: I(f)= mation, with β =1. The dots represent the simulations, 0. The graph shows that the expectation value of I is Eq. (43), whereas the full line shows the analytical re- positive. With the chosen parameters (see the caption sult (40). The computational time required to evaluate ofthe figure),the analyticalresult(40)coincides exactly P(qf)increasesexponentiallywiththenumberofdimen- with the one derived by Treves and Panzeri [1], that is, | sions S. Hence, in the present comparison it is desirable I = (S 1)(S 1)/2N. Since for I(f) = 0, Eq. 1 2 h i − − to keep S as small as possible. However, in order to de- (40) reduces to I = S S +1 S S /2N, for some 1 2 1 2 fineamutualinformationtwosetsZ1andZ2areneeded, particular choicehsiof S and S −the tw−o expressions may I J with S and S elements each. In figure 3, S = 2 and coincide. Itshouldbekeptinmind,however,thatthisis 1 2 1 8 justacoincidence,andthetwomeanvalueshavedifferent decreaseas1/N (Eq. (25)). Yetanotherrouteistowork meanings. withthefunctionW(q)oneisinterestedin. Bymeansof Incontrast,incase(b)thevalueofI(f)islarge(seethe Eq. (36), it is possible to decide whether the term pro- captionfordetails). Inthiscase,the simulationsconfirm portional to 1/N is only a small correction to W(f) or, the phenomenon that was pointed out in the previous on the contrary, the two terms are comparable. In the section, namely, that the expectation value I may be latter case, more measurements should be carried out. h i lower than the measured I(f). Although some of the expressions presented here are ItmaybeseenthatforlargeN,allthedotsconcentrate valid for an arbitraryprior, much of the work deals with in I = I(f). This is, as pointed out before, due to the the particular case of Eq. (5). The use of a prior that is h i discretizationof . If the number of cells J is increased, essentially a linear combination of functions of the form D one needs to go to a larger N to find such a saturation. (5) has been proposed [5], specifically, to be used in the On the contrary, for smaller N, the simulated I lies inference of entropies. For this case, the partition func- h i below its theoretical value. This is a manifestation of tion should be constructed by applying the same linear the finite nature of ǫ, and the phenomenon becomes less superposition to Eq. (10), and the same holds for Eqs. evident as ǫ is lowered. (13-19). The calculation of D and σ as derivatives of D h i F isstillvalid,whereasEq. (12)shouldalsobeaveraged. The analysis of P(qf) carriedout in Sect. II, and the VII. DISCUSSION statistical mechanical|description of Sect. III are valid even for small N. The fact that D 1/N for large h i → In this work, the probability density P(qf) for the N inspires the expansion of W of Sect. IV. It should | h i true distribution q given the experimental frequencies f be clear, nevertheless, that such an expansion is only is analyzed. Such a density, it is shown, may be writ- convergent when N S. Actually, Eq. (12) is the first ≫ ten as a Gibbs distribution, where the inverse number order term in powers of S/N, and there is no reason to of samples plays the role of an effective temperature, think that the higher order terms will be negligible, if and the Kullback-Leibzig divergence between f and q is suchaconditiondoes nothold. Moreover,itisnecessary the equivalent of the energy of state q. Its study is not to have Nf 1 for all i. When N is large enough, one i ≫ onlyforacademicpurposes,buteventuallyalsopractical. can always define the number of categories S as to have In the ideal situation, it would be valuable to calculate them all well populated. But for N S this may well ≈ P(qf)whileanexperimentisbeingcarriedout,inorder not be the case. The consequences may,in fact, be quite | to know when the number of samples is already enough. dramatic. For instance, in the example of the entropy The experimenter may thus decide to give an end to the (Subsect. VA) one can explicitly see that f appears in i samplingprocesswhenthewidthofP(qf)reachessome the denominator of Eq. (37). In other words, the result | acceptable value. For example, someone interested in is meaningless if there are empty categories. measuring the public opinion prior to an election may However, when the condition N S does hold, Eq. wonder how many subjects need to be polled in order (12) may serve to draw non trivia≫l conclusions. For to have a reliable estimation of the forthcoming result. instance, it is usually supposed that limited sampling, Many times, however, experiments comes to an end be- on average,flaws the data introducing false correlations. cause of other factors (a deadline, or a floor in the the This work shows this is not necessarily the case: limited amount of money, patience or students). An estimation sampling may sometimes, on average, lower the correla- of the width of P(qf) is valuable even in these cases, tions. This is clear in the simulations of Sect. VI, where | just to provide error bars. finite sampling results, in mean, in a downwards bias of One possibility is to write down the full P(qf). How- the mutual information. | ever, being a function of many variables, this may not be very practical. A convenient parameter measuring the width of P(qf) in several directions is the square root of the corresp|onding eigenvalues of Σ˜. These have Acknowledgments been shown to diminish asymptotically as 1/N. From theinformation-theoreticalpointofview,amoreappeal- I thank Ilya Nemenman for his very useful comments ing parameter is the mean divergence D, and its mean andsuggestions. IalsothankAlessandroTrevesandSte- quadraticfluctuations. AsisshowninEq. (24),forsmall fanoPanzeriforacriticalreadingofthemanuscript. This N such a width depends on the value of f. If N S, workhasbeencarriedoutwithagrantofFundaci´onAn- ≫ however,both D and σ become independent of f and torchas. D h i [1] AlessandroTreves,andStefanoPanzeri,Neural Comp.7 [2] Stefano Panzeri and Alessandro Treves, Network 7 87 399 (1995) (1996) 9 [3] Wolpert David H, and Wolf David R, Phys. Rev. E 52 Now, the hypothesis is made for arbitrary S (6) 6841 (1995) [4] Kullback, S., (1968). Information theory and statistics. 1 ΠS m ! [5] INlyeawYNoermk:enDmoavner,. Fariel Shafee and William Bialek, ImS = λS S−1+i=1 Sj=i1mj !. (A3) xxx.lanl.gov/abs/physics/0108025 (2001) (cid:16) (cid:17) P [6] H. S. Seung, H. Sompolinsky and N. Tishby, Phys. Rev. To prove it, one proceeds by complete induction. Eq. A45 (8) 6056 (1992) (A3) is assumed true for a given m = (m ,...,m ) and 1 S [7] WilliamBialek,IlyaNemenmanandNaftaliTishby,Neu- the aim is to prove it for (m ,...,m ). Hence i S+1 ral Comput.13 (11) 2409 (2001) [8] F. Willems, Y. Shtarkov and T. Tjalkens, IEEE Trans. [9] TIn.f.ScThhuyr.m, 4an1n, 6a5n3d(P19.9G5r)assberg, Chaos 6 414 (1996) I(Sm+11,...,mS+1) = ZD ΠSi=+11dqi qimi [10] AbramowitzMandStegunI.A(Editors),”Handbookof = λS(cid:0) IS−1 (cid:1) mathematical functions” (Dover, NewYork,1972) λS+1 (m1,...,mS−1)× [11] Is.erSie.sGarnaddsphrtoedyuncatsn”d(IA.cMad.eRmyizchPikr,es”sT,aSbalnesDoifeginot,e1g9ra9l4s), 1− Si=1 S mS+1 dq qmS 1 P S S − Z0 j=1 X APPENDIX A: INTEGRATING A POWER S DISTRIBUTION IN D Θ 1 (A4) − j=1 X Here, Eq. (9) is derived. An alternative and more λ general line of reasoning may be found in [3]. = S IS λ (m1,...,mS−1),mS+mS+1+1× The aim is to calculate S+1 m !m ! S S+1 (A5) ImS = ΠSi=1dqi qimi (A1) (mS +mS+1+1)! ZD 1 ΠS+1m ! = 1dq1 q1m1 1...dqS qSmSδλS1− S qj, = λS+1 (S+1)−i=11+ i Sj=+11mj !(,A6) Z0 Z0 j=1 h i X P whereΘ(x)isHeavisidestepfunction: Θ(x)=1ifx 1, whereλS isaconstantensuringthatwhenallmi vanish, and Θ(x) = 0 if x < 0. When passing from Eq. (A4≥) to I0S is the volume of D. The supra-index in ImS indicates Eq. (A5), use was made of the result (A2). Accordingly, the dimension of the vectors m and q. (A6) derives from the inductive hypothesis (A3). Since If X can only take two values, then S = 2. In this Eq. (A6)coincideswith(A3)whenS isreplacedbyS+1, case, [11] the hypothesis (A3) is proved true. Finally, to determine λ one evaluates 1 1 S I2 = dq qm1 dq qm2δ[λ (1 q q )], m Z10 11 1 Z0 2 2 2 − 1− 2 I0S = λS(S1 1)!. (A7) = dq qm1(1 q )m2 − λ 1 1 − 1 2 Z0 The volume of is √S/(S 1)!,ascanbe verified,once = 1 m1!m2! . (A2) again, by compDlete inductio−n. Then λ =1/√S. S λ (m +m +1)! 2 1 2