ebook img

Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation PDF

0.27 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation

Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation Andrew Barron Teemu Roos Kazuho Watanabe Department of Statistics Department of Computer Science Graduate School of Information Science Yale University University Helsinki Nara Institute of Science and Technology Email: [email protected] Email: [email protected].fi Email: [email protected] Abstract—The normalized maximized likelihood (NML) pro- ideal codelengths is the (pointwise) regret videstheminimaxregretsolutioninuniversaldatacompression, 4 1 1 1 gambling, and prediction, and it plays an essential role in regret(q,x)=log −minlog 0 the minimum description length (MDL) method of statistical q(x) θ pθ(x) modeling and estimation. Here we show that the normalized 2 maximumlikelihoodhasaBayes-likerepresentationasamixture which of course is the same as regret(q,x) = logm(x). The q(x) n of the component models, even in finite samples, though the minimax regret problem is to solve for the distribution q∗(x) a weightsoflinearcombinationmaybebothpositiveandnegative. achieving J This representation addresses in part the relationship between minmaxregret(q,x) 8 MDLandBayesmodeling.Thisrepresentationhastheadvantage q x∈X 2 ofspeedingthecalculationofmarginalsandconditionalsrequired for coding and prediction applications. where the minimum is taken over all probability mass func- ] IndexTerms—universalcoding,universalprediction,minimax tions. Beginning with Shtarkov [8], who formulated the mini- T regret, Bayes mixtures maxregretproblemforuniversaldatacompression,ithasbeen I . shownthatthesolutionisgivenbythenormalizedmaximized s c likelihood q∗(x)=NML(x) given by [ I. INTRODUCTION m(x) NML(x)= 1 For a family of probability mass (or probability density) C v 6 functions p(x;θ), also denoted pθ(x) or p(x|θ), for data x where C = CShtarkov is the normalizer given by (cid:80)xm(x). 1 in a data space X and a parameter θ in a parameter set This q∗(x) is an equalizer rule (achieving constant regret), 1 Θ, there is a distinquished role in information theory and showing that the minimax reget is logC. 7 statistics for the maximum likelihood measure with mass (or For settings in which C = (cid:80) m(x) is infinite, the max- 1. density) function proportional to p(x;θˆ(x)) obtained from the imized likelihood measure m(x)xis not normalizable and the 0 maximum likelihood estimator θˆ(x) achieving the maximum minimaxregretdefinedaboveisinfinite.Nevertheless,onecan 14 likelihood value m(x) = maxθ p(x;θ). Let C = (cid:80)xm(x), identify problems of this type in which the maximized likeli- : where the sum is replaced by an integral in the density case. hoodvaluecontinuestohaveadistinguishedrole.Inparticular, v For statistical models in which (cid:80)xm(x) is finite (i.e. the suppose the data comes in two parts (x,x(cid:48)),x ∈ X,x(cid:48) ∈ X(cid:48), i X maximum likelihood measure is normalizable), this maxi- thought of as initial and subsequent data strings. Then the r mum value m(x) characterizes exact solution in an arbitrary maximum likelihood value m(x,x(cid:48))=maxθp(x,x(cid:48);θ) often a sequence (non-stochastic) setting to certain modeling tasks has a finite marginal m (x)=(cid:80) m(x,x(cid:48)) leading to the init x(cid:48) ranging from universal data compression, to arbitrage-free conditional NML distribution gambling, to predictive distributions with minimax regret. m(x,x(cid:48)) Common to these modeling tasks is the problem of provid- condNML(x(cid:48)|x)= ing a single non-negative distribution q(x) with (cid:80) q(x)=1 minit(x) x with a certain minimax property. For instance, for the com- which is non-negative and sums to 1 over x(cid:48) ∈ X(cid:48) for each pression of data x with codelength log1/q(x), the codelength such conditioning event x∈X. is to be compared to the best codelength with hindsight Bayes mixtures are used in approximation, and as we shall min log1/p (x) among the codes parameterized by the fam- see, in exact representation of the maximized likelihood mea- θ θ ily. This ideal codelength is not exactly attainable, because sure. There are reasons for this use of Bayes mixtures when the maximized likelihood m(x) will (except in a trivial case) studying properties of logarithmic regret in general and when have a sum than is greater than 1, so that the Kraft inequality studying the normalized maximum likelihood in particular. required for unique decodability would not be satisfied by There are two traditional reasons that have an approximate plugging in the MLE. We must work with a q(x) with sum nature.Oneisarelationshipbetweenminimaxpointwiseregret not greater than 1. The difference between these actual and and minimax expected regret for which Bayes procedures are knowntoplayadistinquishedrole.Theotheristheestablished enlargement of the family to compensate for this difficulty. role of such mixtures in the asymptotic charaterization of The present work motivates consideration of signed mixtures minimax pointwise regret. in the original family rather than enlarging the family. Here we offer up two more reasons for consideration of We turn now to the main finite sample reasons for explo- Bayes mixtures which are based on exact representation of ration of Bayes representation of NML. The first is the matter thenormalizedmaximumlikelihood.Oneisthatrepresentation of computational simplification of the representation of NML by mixtures provides computational simplification of coding bymixtureswithpossibly signedweightsofcombination.For and prediction by NML or conditional NML. The other is anycodingdistributionq(x ,x ,...,x )ingeneral,andNML 1 2 n that the exact representation of NML allows determination in particular, coding implementation (for which arithmetic of which parametric families allow Bayes interpretation with coding is the main tool) requires computation of the sequence positive weights and which require a combination of positive of conditional distributions of x |x ,...,x defined by the i 1 i−1 and negative weights. ratios of consecutive marginals for (x ,...,x ) for 1≤i≤n. 1 i BeforeturningattentiontoexactrepresentationoftheNML, This appears to be a very difficult task for normalized max- let’s first recall the information-theoretic role in which Bayes imum likelihood, for which direct methods require sums of mixtures arise. The expected regret (redundancy) in data size up to |X|n−1. Fast methods for NML coding have been compression is E [log1/q(X)−log1/p(X|θ)], which is a developed in specialed settings [5], yet remain intractable for X|θ function of θ giving the expectation of the difference between most models. the codelength based on q and the optimal codelength (the In contrast, for computation of the corresponding ingre- expected difference being a Kullback divergence). There is a dients of Bayes mixtures mix(xn) = (cid:82) p(xn|θ)W(dθ), similarformulationofexpectedregretforthedescriptionofX(cid:48) one can make use of simplifying conditional rules for given X which is the risk of the statistical decision problem p(x |x ,...,x ,θ), e.g. it is equal to p(x |θ) in the con- i 1 i−1 i with loss specified by Kullback divergence. ditionally iid case or p(x |x ,θ) in the first order Markov i i−1 Forthesetwodecisionproblemstheproceduresminimizing case, which multiply in providing p(xi|θ) for each i < n. theaverageriskaretheBayesmixturedistributionsandBayes So to compute mix(x |x ,...,x ) one has ready access i 1 i−1 predictive distributions, respectively. In general admissibility to the computation of the ratios of consecutive marginals theory for convex losses like Kullback divergence, the only mix(x ,...,x )=(cid:82) p(xi|θ)W(dθ), contingent on the ability 1 i procedures not improvable in their risk functions are Bayes to do the sums or integrals required by the measure W. andcertainlimitsofBayesprocedureswithpositivepriors.For Equivalently, one has representation of the required predictive minimax expected redundancy, with min over q and max over distributionsmix(x |x ,...,x )asaposterioraverage,e.g. i 1 i−1 (cid:82) θ, the minimax solution is characterized using the maximin in the iid case it is, p(x |θ)W(dθ|x ,...,x ). So Bayes i 1 i−1 averageredundancy,whichcallsforaleastfavorable(capacity mixtures permit simplified marginalization (and conditioning) achieving) prior [6]. compared to direct marginalization of the NML. The maximum pointwise regret max log(m(x)/q(x)) pro- The purpose of the present paper is to explore in the finite x vides an upper bound on max E logm(X)/q(X) as well samplesetting,thequestionofwhetherwecantakeadvantage θ X|θ asanupperboundonthemaximumexpectedredundancy.Itis of the Bayes mixture to provide exact representation of the for the max over θ problems that the minimax solution takes maximizedlikelihoodmeasures.Thatis,thequestionexplored the form of a Bayes mixture. So it is a surprise that the max is whether there is a prior measure W such that exactly over x form also has a mixture representation as we shall see. (cid:90) The other traditional role for Bayes mixtures in the study maxp(x;θ)= p(x;θ)W(dθ). θ of the NML arises in asymptotics [12]. Suppose x = (x ,x ,...,x ) is a string of outcomes from a given al- Likewise for strings xn = (x ,...,x ) we want the rep- 1 2 n 1 n phabet. Large sample approximations for smooth families of resentation max p(xn;θ) = (cid:82) p(xn : θ)W (dθ) for some θ n distributions show a role for sequences of prior distributions prior measure W that may depend on n. Then to perform n W with densities close to Jeffreys prior w(θ) taken to be the marginalization required for sequential prediction and n proportionalto|I(θ)|1/2 whereI(θ)istheFisherinformation. coding by maximized likelihood measures, we can get them Bayesmixturesofthistypeareasymptoticallyminimaxforthe computationaly easily as (cid:82) p(xi;θ)W (dθ) for i ≤ n. We n expected regret [3], [13], and in certain exponential families point out that this computational simplicity holds just as well Bayes mixtures are simultaneously asymptotically minimax if W is a signed (not necessarily non-negative) measure. n forpointwiseregretandexpectedregret[14],[9].However,in In Section II, we give a result on exact representation in non-exponential families, it is problematic for Bayes mixtures the case that the family has a finitely supported sufficient to be asymptotically minimax for pointwise regret, because statistic. In Section III we demonstrate numerical solution there are data sequences for which the empirical Fisher in- in the Bernoulli trials case, using linear algebra solutions formation(arisinginthelargesampleLaplaceapproximation) for W as well as a Renyi divergence optimization, close n does not match the Fisher information, so that Jeffreys prior to optimization of the maximum ratio of the NML and the fails. The work of [9], [10] overcomes this problem in the mixture. asymptotic setting by putting a Bayes mixture on a slight Finally, we emphasize that at no point are we trying to reportanegativeprobabilityasadirectmodelofanobservable likelihood factorizes as p(xn|θ) = g(xn)p (T(xn)|θ). If the T variable. Negative weights of combination of unobservable statistic T takes on M values, we may regard the distribution parametersareinsteadarisinginrepresentationandcalculation of T as a vector in the positive orthant of RM, with sum ingredients of non-negative probability mass functions of ob- of coordinates equal to 1. For example if X ,...,X are 1 n servable quantities. The marginal and predictive distributions Bernoulli(θ) trials then T = (cid:80)n X is well known to be i=1 i of observable outcomes all remain non-negative as they must. a sufficient statistic having a Binomial(n,θ) distribution with M =n+1. II. SIGNEDPRIORREPRESENTATIONOFNML The main point of the present paper is to explore the (cid:82) We say that a function mix(x) = p(x|θ)W(dθ) is a ramifications of the following simple result. signed Bayes mixture when W is allowed to be a signed Theorem: Signed-Bayes representation of maximized like- measure, with positive and negative parts, W+ and W−, re- lihood. Suppose the parametric family p(xn|θ) , with xn in spectively.ThesesignedBayesmixturesmayplayaroleinthe Xn and θ in Θ, has a sufficient statistic T(xn) with values representation of NML(x). For now, let’s note that for strings in a set of cardinality M, where M = M may depend on n x=(x1,x2,...,xn) a signed Bayes mixture has some of the n. Then for any subset ΘM ={θ1,θ2,...,θM} for which the same marginalization properties as a proper Bayes mixture. distributions of T are linearly independent in RM, there is a Themarginalfor(x1,...,xi)isdefinedbysummingoutxi+1 possibly-signed measure Wn supported on ΘM, with values through xn. For the components p(xn|θ), the marginals are W1,n,W2,n,...,WM,n,suchthatm(xn)=maxθp(xn|θ)has denoted p(x1,...,xi|θ). These may be conveniently simple the representation to evaluate for some choices of the component family, e.g. (cid:90) M i.i.d.orMarkov.ThenthesignedBayesmixturehasmarginals m(xn)= p(xn|θ)W (dθ)=(cid:88)p(xn|θ )W . (cid:82) n k k,n mix(x ,...,x ) = p(x ,...,x |θ)W(dθ). [Here it is be- 1 i 1 i k=1 ing assumed that, at least for indicies i past some initial value, the mix (xi) = (cid:82) p(xi|θ)W (dθ) and mix (xi) = Proof: By sufficiency, it is enough to represent + + − (cid:82) p(xi|θ)W (dθ) are finite, so that the exchange of the order mT(t) = maxθpT(t|θ) as a linear combination of − p (t|θ ),p (t|θ ),...,p (t|θ ), which is possible since of the integral and the sum producing this marginal is valid.] T 1 T 2 T M Our emphasis will be on cases in which the mixture mix(xn) these are linearly independent and hence span RM. is non-negative (that is mix (xn) ≤ mix (xn) for all xn) Remark 1: Consequently, the task of computation of the − + and then the marginals will be non-negative as well. Accord- marginals (and hence conditionals) of maximized likelihood ingly one has predictive distributions mix(x |x ,...,x ) needed for minimax pointwise redundancy codes is reduced i 1 i−1 defined as ratios of consecutive marginals, as long as the from the seemingly hard task of summing over Xn−i to the conditioningstringhasmix(x ,...,x )finiteandnon-zero. much simpler task of computing the sum of M terms 1 i−1 It is seen then that mix(x |x ,...,x ) is a non-negative i 1 i−1 M distribution which sumsto 1, summing over xi in X, for each mn(x ,...,x )=(cid:88)p(x ,...,x |θ )W . i 1 i 1 i k k,n suchconditioningstring.Moreover,onemayformallydefinea k=1 possibly-signed posterior distribution such that the predictive distribution is still a posterior average, e.g. in the iid case one Remark 2: Often the likelihood ratio p(xn|θ)/p(xn|θˆ) still has the representation (cid:82) p(x |θ)W(dθ|x ,...,x ). simplifies, where θˆ is the maximum likelihood estimate. i 1 i−1 We mention that for most families the maximized likeli- For instance, in i.i.d. exponential families it takes the form hood mn(xn) = max p(xn|θ) has a horizon dependence exp{−nD(θˆ||θ)} where D(θ||θ(cid:48)) is the relative entropy be- θ property, such that the marginals mn(x ,...,x ) defined tweenthedistributionsatθandatθ(cid:48).Sothen,dividingthrough as (cid:80) mn(x ,...,x ,x ,...i,x1) remaiin slightly byp(xn|θˆ),therepresentationtaskistofindapossibly-signed dependxie+n1t,..o.,xnnn for1each ii ≤i+1n. In snuitable Bayes ap- measure W such that the integral of these e−nD is constant proximations and exact representations, this horizon depen- for all possible values of θˆ, that is, (cid:82) e−nD(θˆ||θ)Wn(dθ)=1. dence is reflected in a (possibly-signed) prior W depend- The θˆwill only depend on the sufficient statistic so this is a n ing on n, such that its marginals mn(x ,...,x ) take the simplified form of the representation. form (cid:82) p(x ,...,x |θ)W (dθ). (Un)acihie1vabilityiof asymp- Remark 3: Summing out xn one sees that the Shtarkov 1 i n totic minimax regret without dependency on the horizon was value C = (cid:80) m(xn) has the representation Shtarkov xn characterized as a conjecture in [12]. Three models within C =(cid:80)M W .Thatis,theShtarkovvaluematches Shtarkov k=1 k,n one-dimensional exponential families are exceptions to this the total signed measure of Θ. When C is finite, Shtarkov horizon dependence [2]. It is also of interest that, as shown in one may alternatively divide out C and provide a Shtarkov [2],thosehorizonindependentmaximizedlikelihoodmeasures representation NML(xn) = (cid:82) p(xn|θ)W (dθ) in which the n have exact representation using horizon independent positive possibly-signed prior has total measure 1. priors. Remark 4: In the Bernoulli trials case, the likelihoods Any finite-valued statistic T = T(xn) has a distribution are proportional to [θ/(1−θ)]T(1−θ)n. The (1−θ)n can p (t|θ) = (cid:80) p(xn|θ). It is a sufficient statistic if be associated with the weights of combination. To see the T xn:T(xn)=t there is a function g(xn) not depending on θ such that the linear independence required in the theorem, it is enough to note that the vectors of exponentials (eηt : t = 1,2,...,n) C. Bernoulli trials: Divergence optimization of θ and W n are linearly independent for any n distinct values of the log Forlessthann+1parametervaluesθ withnon-zeroprior k odds η = log[θ/(1−θ)]. The roles of θ and 1−θ can be probability, there is no guarantee that an exact representation exchanged in the maximized likelihood, and, correspondingly, of NML is possible. However, (cid:98)n/2(cid:99)+1 mass points should the representation can be arranged with a prior symmetric suffice when we are also free to choose the θ values because around θ =1/2. Numerical selections are studied below. the total degree of freedom of the symmetric discrete prior, III. NUMERICALRESULTS (cid:98)n/2(cid:99), coincides with the number of equations to be satisfied by the mixture. We now proceed to demonstrate the discussed mixture Whilesolvingtherequiredweights,W ,canbedoneina representations. k,n straightforward manner using linear algebra, the same doesn’t A. A trivial example where negative weigths are required holdforθ .InspiredbytheworkofWatanabeandIkeda[11], k We start with a simple illustration of a case where negative we implemented a Newton type algorithm for minimizing weightsarerequired.Considerasingleobservationofaternary (cid:32) (cid:33)β random variable X ∈ {1,2,3} under a model consisting of 1 log(cid:88)NML(xn) NML(xn) threeprobabilitymassfunctions,p ,p ,p ,definedasfollows. β (cid:80)K p(xn |θ )W 1 2 3 xn k=1 k k,n (cid:18) (cid:19) (cid:18) (cid:19) (cid:18) (cid:19) 1 1 1 1 2 3 2 with a large values of β, under the constraint that p = , ,0 , p = 0, , , p = , , . 1 2 2 2 2 2 3 7 7 7 (cid:80)K W =1,whereK isthenumberofmasspoints.This k=1 k,n optimizationcriterionisequivalenttotheRenyidivergence[7], The maximum likelihood values are given by m(x) = and converges to the log of the worst-case ratio max p(x;θ) = 1/2 for all x since for all x ∈ {1,2,3} θ the maximum, 1/2, is achieved by either p or p (or both). NML(xn) 1 2 logmax The NML distribution is therefore the uniform distribution xn (cid:80)K p(xn |θ )W NML(x)=(1/3,1/3,1/3). k=1 k k,n An elementary solution for the weights W ,W ,W as β → ∞. In the following, we use β = 150 except 1 2 3 such that mix(x) = NML(x) for all x yields W = for n = 500 where we use β = 120 in order to avoid (−2/3,−2/3,7/3). The solution is unique implying that in numerical problems. The mass points were initialized at particularthereisnoweightvectorthatachievesthematching sin2((k−1)π/2(K−1)):k =1,...,K. with only positive weights. Figure 2 shows the priors obtained by optimizing the loca- tions of the mass points, θ , and the respective prior weights, k B. Bernoulli trials: Linear equations for W with fixed θ n W . The left panels show priors where the number of mass k For n Bernoulli trials, the sufficient statistic T =(cid:80)n X points is (cid:98)n/2(cid:99)+1, while the right panels shows priors with i=1 i takesonM =n+1possiblevalues.Wediscusstwoalternative n+1masspointswhichguaranteesthatanexactrepresentation methods for finding the prior. First, as a direct application of ispossibleevenwithoutoptimizationoftheθ .Note,however, k linear algebra, we choose a set of M fixed parameter values that the divergence optimization method we use can only θ ,...,θ and obtain the weights by solving a system of deal with non-negative prior weigths. The obtained mixtures 1 M M linear equations. By Remark 4 above, any combination of had Kullback-Leibler divergence D(NML||mix)<10−7 and distinct θ values yields linearly independent distributions of worst-case ratio max NML(xn)/mix(xn) < 1+10−3 in xn T and a signed-Bayes representation is guaranteed to exist. each case. However, the choice of θ has a strong effect on the resulting weights W . IV. CONCLUSIONSANDFUTUREWORK k,n We consider two alternative choices of θ ,k ≤n+1: first, Unlike many earlier studies that have focused on either k a uniform grid with θ =(k−1)/n, and second, a grid with finite-sample or asymptotic approximations of the normalized k points at θ =sin2((k−1)π/2n). The latter are the quantiles maximum likelihood (NML) distribution, the focus of the k of the Beta(1/2,1/2) distribution (also known as the arcsine present paper is on exact representations. We showed that an law), which is the Jeffreys prior motivated by the asymptotics exact representation of NML as a Bayes-like mixture with discussed in the introduction. a possibly signed prior exists under a mild condition related Figure 1 shows priors representing NML obtained by solv- to linear independence of a subset of the statistical model in ing the associated linear equations. For mass points given by consideration. We presented two techniques for finding the sin2((k − 1)π/2n) : k = 1,...,n + 1, the prior is nearly required signed priors in the case of Bernoulli trials. uniformexceptattheboundariesoftheparameterspacewhere The implications of this work are two-fold. First, from a the weights are higher. For uniformly spaced mass points, theoretical point of view, it provides insight into the relation- the prior involves both negative and positive weights when ship between MDL and Bayesian methods by demonstrating n≥10.Withoutthenon-negativityconstraint,therequirement thatinsomemodels,afinite-sampleBayes-likecounterpartto that the weights sum to one no longer implies a bound NML only exists when the customary assumption that prior (W ≤ 1) on the magnitudes of the weights, and in fact, the probabilities are non-negative is removed. This complements absolute values of the weights become very large as n grows. earlier asymptotic and approximate results. Second, from a UNIFORMGRID sin2GRID K=(cid:98)n/2(cid:99)+1 K=n+1 n=5 K=6 n=5 K=6 n=5 K=3 n=5 K=6 0.20 0.4 w 0.15 w 0.10 w 0.2 w 0.15 0.00 0.00 0.0 0.00 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 1.5 n=10 q K=11 0.12 n=10 q K=11 n=10 q K=6 n=10 q K=11 w 0.5 w 0.06 w 0.15 w 0.08 −1.0 0.00 0.00 0.00 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 n=20 q K=21 n=20 q K=21 n=20 q K=11 n=20 q K=21 w 0200 w 0.04 w 0.10 w 0.040.08 −300 0.00 0.00 0.00 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 n=40 q K=41 n=40 q K=41 n=40 q K=21 n=40 q K=41 w −4e+074e+07 w 0.000.03 w 0.000.040.08 w 0.000.020.04 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 n=100 K=51 n=100 K=101 q q Fig.1. ExamplesofpriorsrepresentingtheNMLdistributionintheBernoulli mpooidnetslcwhiothsenna=tu5n,i1fo0r,m20i,n4te0r.vaTlsheinle[f0t,p1a];nethlsesrhigohwtpparinoerlssfsohrown+pri1ormsafossr w 0.020 w 0.010 thesamenumberofmasspointsatsin2((k−1)π/2n):k=1,...,n+1. Thepriorweightsareobtainedbydirectlysolvingasetoflinearequations. 0.000 0.000 Negativeweightsareplottedinred. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 n=500 K=251 n=500 K=501 q q practical point of view, a Bayes-like representation offers a cpo(xmp,u.t.a.t,ioxna)llayndefcfiocniednittiownaaylptoroebxatbrailcittimesaprg(ixna|lxp,ro.b..a,bxilities) w 0.004 w 0.0025 for1i<n wihere n is the total sample size. Theise p1robabilii−ti1es 0.000 0.0000 are required in, for instance, data data compression using 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 arithmetic coding. Fig.2. ExamplesofpriorsrepresentingtheNMLdistributionintheBernoulli Other algorithms will be explored in the full paper along modelwithn=5,10,20,40,100,500.Boththelocationsofthemasspoints, with other families including truncated Poisson and multino- θk, and the respective prior weights, Wk,n, are optimized using a Newton type method. The left panels show priors with (cid:98)n/2(cid:99)+1 mass points; the mial, which have an interesting relationship between supports rightpanelshowspriorswithn+1masspoints. fortheprior.ThefullpaperwillalsoshowhowmatchingNML produces a prior with some interesting prediction properties. [7] A.Renyi,“Onmeasuresofentropyandinformation,”Proc.oftheFourth Inparticular,inBernoullitrials,bytheNMLmatchingdevice, BerkeleySymp.onMath.Statist.andProb. vol.1,Univ.ofCalif.Press, pp.547-561,1961. we can arrange a prior in which the posterior mean of the log [8] Yu M. Shtarkov, “Universal sequential coding of single messages,” odds is the same as the maximum likelihood estimate of the ProblemsofInformationTransmission,vol.23,pp.3-17,July1988. log odds, whenever the count of ones is neither 0 nor n. [9] J.Takeuchi&A.R.Barron,“AsymptoticallyminimaxregretbyBayes mixtures,”Proc.1998IEEEISIT,1998. REFERENCES [10] J.Takeuchi&A.R.Barron,“AsymptoticallyminimaxregretbyBayes mixturesfornon-exponentialfamilies,”Proc.2013IEEEITW,pp.204- [1] A.R.Barron,J.RissanenandB.Yu,“Theminimumdescriptionlength 208,2013. principleincodingandmodeling,”IEEETrans.Inform.Theory,Vol.44 [11] K. Watanabe & S. Ikeda, “Convex formulation for nonparametric esti- No.6,pp.2743-2760,1998. mationofmixingdistribution,”Proc.2012WITMSE,pp.36-39,2012. [2] P. Bartlett, P. Gru¨nwald, P. Harremoe¨s, F. Hedayati, W. Kotłowski, [12] K. Watanabe, T. Roos & P. Myllyma¨ki, “Achievability of asymptotic ”Horizon-independent optimal prediction with log-loss in exponential minimaxregretinonlineandbatchprediction,”Proc.2013ACML,pp. families,”arXiv:1305.4324v1,May2013. 181-196,2013. [3] B. Clarke & A. R. Barron, “Jeffreys prior is asymptotically least [13] Q.Xie&A.R.Barron,“Minimaxredundancyfortheclassofmemory- favorable under entropy risk,” J. Statistical Planning and Inference, lesssources”,IEEETrans.Inform.Theory,vol.43,pp.646-657,1997. 41:37-60,1994. [14] Q.Xie&A.R.Barron,“Asymptoticminimaxregretfordatacompres- [4] P.D.Gru¨nwald,TheMinimumDescriptionLengthPrinciple,MITPress, sion, gambling and prediction,” IEEE Trans. Inform. Theory, vol. 46, 2007. pp.431-445,2000. [5] P.Kontkanen&P.Myllyma¨ki.“Alinear-timealgorithmforcomputing themultinomialstochasticcomplexity.”InformationProcessingLetters, vol.103,pp.227-233,2007. [6] D.Haussler,“Ageneralminimaxresultforrelativeentropy,”IEEETrans. Inform.Theory,vol.43,no.4,pp.1276-1280,1997.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.