FUNCTIONALBREGMANDIVERGENCE 1 Functional Bregman Divergence and Bayesian Estimation of Distributions B. A. Frigyik, S. Srivastava, and M. R. Gupta Abstract—A class of distortions termed functional Bregman as relative entropy, can be expressed as a (different) Bregman divergencesisdefined,whichincludessquarederrorandrelative divergence on the exponential distribution parameters. The entropy. A functional Bregman divergence acts on functions or functional Bregman definition enables stronger results and a distributions, and generalizes the standard Bregman divergence more general application. for vectors and a previous pointwise Bregman divergence that was defined for functions. A recently published result showed InSection1westateafunctionaldefinitionoftheBregman that the mean minimizes the expected Bregman divergence. The divergence and give examples for total squared difference, new functional definition enables the extension of this result relative entropy, and squared bias. In later subsections, the to the continuous case to show that the mean minimizes the relationship between the functional definition and previous expected functional Bregman divergence over a set of functions Bregman definitions is established, and properties are noted. or distributions. It is shown how this theorem applies to the Bayesian estimation of distributions. Estimation of the uniform Then in Section 2 we present the main theorem: that the distribution from independent and identically drawn samples is expectationofasetoffunctionsminimizestheexpectedBreg- used as a case study. mandivergence.Wediscusstheapplicationofthistheoremto Index Terms—Bregman divergence, Bayesian estimation, uni- Bayesian estimation, and as a case study compare different form distribution, learning estimates for the uniform distribution given independent and BREGMANdivergencesareausefulsetofdistortionfunc- identicallydrawnsamples.Proofsareintheappendix.Readers tions that include squared error, relative entropy, logistic who are not familiar with functional derivatives may find loss, Mahalanobis distance, and the Itakura-Saito function. helpful our short introduction to functional derivatives [18] Bregman divergences are popular in statistical estimation and or the text by Gelfand and Fomin [19]. informationtheory.AnalysisusingtheconceptofBregmandi- vergenceshasplayedakeyroleinrecentadvancesinstatistical learning[1]–[10],clustering[11],[12],inverseproblems[13], I. FUNCTIONALBREGMANDIVERGENCE maximumentropyestimation[14],andtheapplicabilityofthe (cid:161) (cid:162) Let Rd,Ω,ν be a measure space, where ν is a Borel data processing theorem [15]. Recently,it was discoveredthat measure and d is a positive integer. Let φ be a real functional themeanistheminimizeroftheexpectedBregmandivergence over the normed space Lp(ν) for 1≤p≤∞. Recall that the for a set of d-dimensional points [11], [16]. bounded linear functional δφ[f;·] is the Fre´chet derivative of InthispaperwedefineafunctionalBregmandivergencethat φ at f ∈Lp(ν) if applies to functions and distributions, and we show that this new definition is equivalent to Bregman divergence applied φ[f +a]−φ[f]=(cid:52)φ[f;a] to vectors. The functional definition generalizes a pointwise Bregman divergence that has been previously defined for =δφ[f;a]+(cid:178)[f,a](cid:107)a(cid:107)Lp(ν) (1) measurable functions [7], [17], and thus extends the class of for all a ∈ Lp(ν), with (cid:178)[f,a] → 0 as (cid:107)a(cid:107) → 0 [19]. distortionfunctionsthatareBregmandivergences;seeSection Lp(ν) Thengivenan appropriatefunctional φ,a functionalBregman I-A.2 for an example. Most importantly, the functional defi- divergence can be defined: nition enables one to solve functional minimization problems using standard methods from the calculus of variations; we DefinitionI.1(FunctionalDefinitionofBregmanDivergence). extend the recent result on the expectation of vector Breg- Let φ : Lp(ν) → R be a strictly convex, twice-continuously man divergence [11], [16] to show that the mean minimizes Fre´chet-differentiablefunctional.TheBregmandivergenced : φ the expected Bregman divergence for a set of functions or Lp(ν)×Lp(ν) → [0,∞) is defined for all admissible f,g ∈ distributions. We show how this theorem links to Bayesian Lp(ν) as estimation of distributions. For distributions from the expo- nential family distributions, many popular divergences, such d [f,g]=φ[f]−φ[g]−δφ[g;f −g], (2) φ B.A. Frigyik iswith the Departmentof Mathematics, Purdue University, WestLafayette,IN47907(e-mail:[email protected]). where δφ[g;·] is the Fre´chet derivative of φ at g. S. Srivastava is with the Department of Applied Mathematics, University ofWashington,Seattle,WA98195(e-mail:[email protected]). Here,wehaveusedtheFre´chetderivative,butthedefinition M.R.GuptaiswiththeDepartmentofElectricalEngineering,University (and results in this paper) can be easily extended using other ofWashington,Seattle,WA98195(e-mail:[email protected]).This definitions of functional derivatives; a sample extension is author’s work was supported in part by the United States Office of Naval Research. given in Section I-A.3. Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2008 2. REPORT TYPE 00-00-2008 to 00-00-2008 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Functional Bregman Divergence and Bayesian Estimation of 5b. GRANT NUMBER Distributions 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of Washington,Department of Electrical REPORT NUMBER Engineering,Seattle,WA,98195 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT A class of distortions termed functional Bregman divergences is defined, which includes squared error and relative entropy. A functional Bregman divergence acts on functions or distributions, and generalizes the standard Bregman divergence for vectors and a previous pointwise Bregman divergence that was defined for functions. A recently published result showed that the mean minimizes the expected Bregman divergence. The new functional definition enables the extension of this result to the continuous case to show that the mean minimizes the expected functional Bregman divergence over a set of functions or distributions. It is shown how this theorem applies to the Bayesian estimation of distributions. Estimation of the uniform distribution from independent and identically drawn samples is used as a case study. 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE Same as 10 unclassified unclassified unclassified Report (SAR) Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 2 FUNCTIONALBREGMANDIVERGENCE A. Examples for all a∈L2(ν), which implies that φ is strictly convex and (cid:90) (cid:90) (cid:90) Different choices of the functional φ lead to different Bregman divergences. Illustrative examples are given for dφ[f,g] = f2dν− g2dν−2 g(f −g)dν squared error, squared bias, and relative entropy. Functionals (cid:90) for other Bregman divergences can be derived based on these = (f −g)2dν examples, from the example functions for the discrete case = (cid:107)f −g(cid:107)2 . given in Table 1 of [16], and from the fact that φ(cid:82)is a strictly L2(ν) convex functional if it has the form φ(g) = φ˜(g(t))dt where φ˜ : R → R, φ˜ is strictly convex and g is in some 2) Squared Bias: Under definition (2), squared bias is a Bregman divergence, this we have not previously seen noted well-defined vector space of functions [20]. in the literature despite the importance of minimizing bias in (cid:82) estimation [21]. 1) Total Squared Difference: Let φ[g] = g2dν, where (cid:161)(cid:82) (cid:162) φ:L2(ν)→R, and let g,f,a∈L2(ν). Then Let φ[g]= gdν 2, where φ:L1(ν)→R. In this case (cid:90) (cid:90) (cid:181)(cid:90) (cid:90) (cid:182) (cid:181)(cid:90) (cid:182) 2 2 φ[g+a]−φ[g] = (g+a)2dν− g2dν φ[g+a]−φ[g] = gdν+ adν − gdν (cid:90) (cid:90) (cid:90) (cid:90) (cid:181)(cid:90) (cid:182) = 2 gadν+ a2dν. 2 = 2 gdν adν+ adν . (4) Because (cid:82) (cid:82) (cid:82) Note that 2 gdν adν is a continuous linear functional on a2dν = (cid:107)a(cid:107)2L2(ν) =(cid:107)a(cid:107) →0 L1(ν) and (cid:161)(cid:82) adν(cid:162)2 ≤(cid:107)a(cid:107)2 , so that (cid:107)a(cid:107) (cid:107)a(cid:107) L2(ν) L1(ν) L2(ν) L2(ν) (cid:161)(cid:82) (cid:162) as a→0 in L2(ν), adν 2 (cid:107)a(cid:107)2L1(ν) 0≤ ≤ =(cid:107)a(cid:107) . (cid:90) (cid:107)a(cid:107)L1(ν) (cid:107)a(cid:107)L1(ν) L1(ν) δφ[g;a]=2 gadν, Thus from (4) and the definition of the Fre´chet derivative, (cid:90) (cid:90) whichisacontinuouslinearfunctionalina.Toshowthatφis strictly convex we show that φ is strongly positive. When the δφ[g;a]=2 gdν adν. second variation δ2φ and the third variation δ3φ exist, they are described by By the definition of the second Fre´chet derivative, 1 (cid:52)φ[f;a] = δφ[f;a]+ δ2φ[f;a,a] δ2φ[g;b,a] = δφ[g+b;a]−δφ[g;a] 2 (cid:90) (cid:90) (cid:90) (cid:90) +(cid:178)[f,a](cid:107)a(cid:107)2 (3) = 2 (g+b)dν adν−2 gdν adν Lp(ν) 1 (cid:90) (cid:90) = δφ[f;a]+ δ2φ[f;a,a] = 2 bdν adν 2 1 + δ3φ[f;a,a,a] 6 is another quadratic form, and δ2φ is independent of g. +(cid:178)[f,a](cid:107)a(cid:107)3Lp(ν), Then δ2φ[g;a,a] is strongly positive because, where (cid:178)[f,a] → 0 as (cid:107)a(cid:107)Lp(ν) → 0. The quadratic func- (cid:181)(cid:90) (cid:182)2 tional δ2φ[f;a,a] defined on normed linear space Lp(ν) is δ2φ[g;a,a]=2 adν =2(cid:107)a(cid:107)2 ≥0 L1(ν) strongly positive if there exists a constant k > 0 such that δ2φ[f;a,a] ≥ k(cid:107)a(cid:107)2 for all a ∈ L2(ν). By definition of Lp(ν) for a ∈ L1(ν), and thus φ is in fact strictly convex. The the second Fre´chet derivative, Bregman divergence is thus δ2φ[g;b,a] = δφ[g+b;a]−δφ[g;a] (cid:90) (cid:90) d [f,g] φ (cid:181)(cid:90) (cid:182) (cid:181)(cid:90) (cid:182) (cid:90) (cid:90) = 2 (g+b)adν−2 gadν 2 2 (cid:90) = fdν − gdν −2 gdν (f −g)dν = 2 badν. (cid:181)(cid:90) (cid:182) (cid:181)(cid:90) (cid:182) (cid:90) (cid:90) 2 2 = fdν + gdν −2 gdν fdν Thus δ2φ[g;b,a] is a quadratic form, where δ2φ is actually (cid:181)(cid:90) (cid:182) 2 independent of g and strongly positive since = (f −g)dν (cid:90) δ2φ[g;a,a]=2 a2dν =2(cid:107)a(cid:107)2 ≤ (cid:107)f −g(cid:107)2 . L2(ν) L1(ν) FRIGYIK,SRIVASTAVA,GUPTA 3 3) Relative Entropy of Simple Functions: Denote by S the which coupled with (6) does yield relative entropy. It remains collection of all integrable simple functions on the measure to be shown only that (8) satisfies (7). Note that space (Rd,Ω,ν), that is, the set of functions which can be written as a finite linear combination of indicator functions (cid:90) h+g such that if g ∈S, g can be expressed, φ[g+h]−φ[g]−δ φ[g;h]= (h+g)ln −hdν G g (cid:90)Rd (cid:88)t h+g g(x)= α I ; α =0, = (h+g)ln −hdν, i Ti 0 E g i=0 (9) where I is the indicator function of the set T , {T }t is Ti i i i=1 a collection of mutually disjoint sets of finite measure and (cid:83) where E is the set on which g is not zero. T =Rd\ t T .WeadopttheconventionthatT istheset 0 i=1 i 0 Becauseg ∈W,therearem,M >0suchthatm≤g ≤M on which g is zero and therefore α (cid:54)=0 if i(cid:54)=0. i on E.Let h∈G be suchthat (cid:107)h(cid:107) ≤m,then g+h≥0. Consider the normed vector space (S,(cid:107)·(cid:107) ) and let L∞(ν) L∞(ν) Our goal is to show that the expression W be the subset (not necessarily a vector subspace) of non- negative functions in this normed space: φ[g+h]−φ[g]−δ φ[g;h] W ={g ∈Ssubject to g ≥0}. G (10) (cid:107)h(cid:107) L∞(ν) If g ∈W then (cid:90) (cid:88)t (cid:90) (cid:88)t is non-negative and that it is bounded above by a bound glngdν = αilnαidν = αilnαiν(Ti), (5) that goes to 0 as (cid:107)h(cid:107)L∞(ν) → 0. We start by bounding the Rd i=1 Ti i=1 integrand from above using the inequality lnx≤x−1: since 0ln0=0. Define the functional φ on W, (cid:90) h+g h h2 φ[g]= glngdν, g ∈W. (6) (h+g)ln −h≤(h+g) −h= . g g g Rd The functional φ is not Fre´chet-differentiable at g because in general it cannot be guaranteed that g+h is non-negative on Then since h2/g ≤((cid:107)h(cid:107) )2/m, L∞(ν) the set where g = 0 for all perturbing functions h in the (cid:161) (cid:162) underlying normed vector space S,(cid:107)·(cid:107)L∞(ν) with norm φ[g+h]−φ[g]−δ φ[g;h] 1 (cid:90) h2 smaller than any prescribed (cid:178) > 0. However, a generalized G ≤ dν Gaˆteaux derivative can be defined if we limit the perturbing (cid:107)h(cid:107)L∞(ν) (cid:107)h(cid:107)L∞(ν) E g function h to a vector subspace. To that end, let G be the ν(E) (cid:161) (cid:162) ≤ (cid:107)h(cid:107) . subspace of S,(cid:107)·(cid:107)L∞(ν) defined by m L∞(ν) G ={f ∈Ssubject to fdν (cid:191)gdν}. Since g is integrable ν(E)<∞ and the right hand side goes ItisstraightforwardtoshowthatG isavectorspace.Wedefine to 0 as (cid:107)h(cid:107) →0. L∞(ν) the generalized Gaˆteaux derivative of φ at g ∈ W to be the Next, in order to show that (10) is non-negative we have linear operator δ φ[g;·] if G to prove that the integral (9) is not negative. To do so, we normalizethemeasureandapplyJensen’sinequality.Takethe |φ[g+h]−φ[g]−δ φ[g;h]| lim G =0. (7) first term of the integrand of (9), (cid:107)h(cid:107)L∞(ν)→0 (cid:107)h(cid:107)L∞(ν) h∈G (cid:90) Note, that δ φ[g;·] is not linear in general, but it is on the h+g G (h+g)ln dν vectorspace G.Ingeneral, if G isthe entireunderlying vector g E (cid:90) (cid:181) (cid:182) space then (7) is the Fre´chet derivative, and if G is the span h+g h+g = ln gdν, of only one element from the underlying vector space then g g E (cid:90) (7) is the Gaˆteaux derivative. Here, we have generalized the h+g h+g g Gaˆteaux derivative for the present case that G is a subspace = (cid:107)g(cid:107)L1(ν) g ln g (cid:107)g(cid:107) dν of the underlying vector space. (cid:90)E (cid:181) (cid:182) L1(ν) h+g It remains to be shown that given the functional (6), the = (cid:107)g(cid:107) λ dν˜, L1(ν) g derivative (7) exists and yields a Bregman divergence corre- E sponding to the usual notion of relative entropy. Consider the possible solution wherethe normalizedmeasure dν˜= g dν isa probabil- (cid:90) (cid:107)g(cid:107)L1(ν) itymeasureandλ(x)=xlnxisaconvexfunctionon(0,∞). δ φ[g;h]= (1+lng)hdν, (8) G ByJensen’sinequalityandthenchangingthemeasurebackto Rd 4 FUNCTIONALBREGMANDIVERGENCE dν, wherex,y ∈Rn,andφ˜:Rn →Risstrictlyconvexandtwice (cid:90) (cid:181) (cid:182) differentiable. h+g (cid:107)g(cid:107) λ dν˜ L1(ν) g Jones and Byrne describe a general class of divergences E (cid:181)(cid:90) (cid:182) h+g between functions using a pointwise formulation [7]. Csisza´r ≥ (cid:107)g(cid:107) λ dν˜ L1(ν) g specializedthepointwiseformulationtoaclassofdivergences (cid:181) E (cid:182) he termed Bregman distances B [17], where given a σ- (cid:107)g+h(cid:107) s,ν L1(ν) = (cid:107)g(cid:107) λ finite measure space (X,Ω,ν), and non-negative measurable L1(ν) (cid:107)g(cid:107) (cid:181) L1(ν) (cid:182) functions f(x) and g(x), Bs,ν(f,g) equals (cid:107)g+h(cid:107)L1(ν) (cid:90) = (cid:107)g+h(cid:107) ln L1(ν) (cid:107)g(cid:107) s(f(x))−s(g(x))−s(cid:48)(g(x))(f(x)−g(x))dν(x). (12) L1(ν) (cid:181) (cid:182) (cid:107)g(cid:107) L1(ν) ≥ (cid:107)g+h(cid:107)L1(ν) 1− (cid:107)g+h(cid:107) The function s : (0,∞) → R is constrained to be dif- L(cid:90)1(ν) ferentiable and strictly convex, and the limit lim s(x) x→0 = (cid:107)g+h(cid:107) −(cid:107)g(cid:107) = hdν, and lim s(cid:48)(x) must exist, but not necessarily be finite. L1(ν) L1(ν) x→0 E The function s plays a role similar to the function φ in the where we used the fact that ln 1 ≥ 1−x for all x > 0. By functional Bregman divergence; however, s acts on the range x combining these two latest results we find that of the functions f,g, whereas φ acts on the functions f,g. (cid:90) (cid:90) h+g Proposition I.3 (Functional Definition Generalizes Pointwise (h+g)ln dν ≥ hdν, g Definition). Given a pointwise Bregman divergence as per E E (12), an equivalent functional Bregman divergence can be so equivalently (9) is always non-negative. This fact also defined as per (2) if the measure ν is finite. However, given confirms that the resulting relative entropy d [f,g] is always φ a functional Bregman divergence d [f,g], there is not neces- φ non-negative, because (9) is d [f,g] if one sets h=f −g. φ sarily an equivalent pointwise Bregman divergence. Lastly, one must show that the functional defined in (6) is strictly convex. Again we will show this by showing that the second variation of φ[g] is strongly positive. Let f ∈ G and C. Properties of the Functional Bregman Divergence (cid:107)f(cid:107)L∞(ν) ≤ m. Using the Taylor expansion of ln one can The Bregman divergence for random variables has some express, well-known properties, as reviewed in [11, Appendix A]. (cid:90) (cid:181) (cid:182) Here, we note that the same properties hold for the functional f δ φ[g+f;h]−δ φ[g;h]= hln 1+ dν Bregman divergence (2). We give complete proofs in [18]. G G g (cid:90)E f = h dν+(cid:178)[f,g](cid:107)f(cid:107) , 1. Non-negativity: The functional Bregman divergence is g L∞(ν) E non-negative: d [f,g]≥0 for all admissible inputs. φ where (cid:178)[f,g] goes to 0 as (cid:107)f(cid:107) →0 because L∞(ν) 2. Convexity: The Bregman divergence d [f,g] is always φ ν(E)(cid:107)h(cid:107) L∞(ν) convex with respect to f. (cid:107)(cid:178)[f,g](cid:107) ≤ (cid:107)f(cid:107) . L∞(ν) 2M2 L∞(ν) Therefore (cid:90) 3. Linearity: The functional Bregman divergence is linear δ2φ[g;h,f]= hfdν, such that, G g E d [f,g]=c d [f,g]+c d [f,g]. and (c1φ1+c2φ2) 1 φ1 2 φ2 (cid:90) 4. Equivalence Classes: Partition the set of strictly convex, h2 δ2φ[g;h,h]= dν differentiable functionals {φ} into classes such that φ and G g 1 E φ belong to the same class if d [f,g] = d [f,g] for 1 2 φ1 φ2 ≥ M(cid:107)h(cid:107)L1(ν). all f,g ∈ A. For brevity we will denote dφ1[f,g] simply by d . Let φ ∼ φ denote that φ and φ belong to the φ1 1 2 1 2 Thus δ2φ[g;h,h] is strongly positive. same class, then ∼ is an equivalence relation in that is G satisfies the properties of reflexivity (d = d ), symmetry φ1 φ1 (if d =d , then d =d ), and transitivity (if d =d B. Relationship to Other Bregman Divergence Definitions φ1 φ2 φ2 φ1 φ1 φ2 and d = d , then d = d ). Further, if φ ∼ φ , then φ2 φ3 φ1 φ3 1 2 Twopropositionsestablishtherelationshipofthefunctional they differ only by an affine transformation. Bregman divergence to other Bregman divergence definitions. 5. Linear Separation: The locus of admissible functions Proposition I.2 (Functional Bregman Divergence Generalizes f ∈ Lp(ν) that are equidistant from two fixed functions Vector Bregman Divergence). The functional definition (2) is g ,g ∈ Lp(ν) in terms of functional Bregman divergence a generalization of the standard vector Bregman divergence 1 2 form a hyperplane. d (x,y)=φ˜(x)−φ˜(y)−∇φ˜(y)T(x−y), (11) φ˜ FRIGYIK,SRIVASTAVA,GUPTA 5 6.DualDivergence:Givenapair(g,φ)whereg ∈Lp(ν)and A. Bayesian Estimation φisastrictlyconvextwice-continuouslyFre´chet-differentiable Theorem II.1 can be applied to a set of distributions to functional, then the function-functional pair (G,ψ) is the find the Bayesian estimate of a distribution given a posterior Legendre transform of (g,φ) [19], if or likelihood. For parametric distributions parameterized by (cid:90) θ ∈ Rn, a probability measure Λ(θ), and some risk function φ[g] = −ψ[G]+ g(x)G(x)dν(x), (13) R(θ,ψ), ψ ∈Rn, the Bayes estimator is defined [22] as (cid:90) (cid:90) δφ[g;a] = G(x)a(x)dν(x), (14) θˆ=arg inf R(θ,ψ)dΛ(θ). (16) ψ∈Rn where ψ is a strictly convex twice-continuously Fre´chet- That is, the Bayes estimator minimizes some expected risk in differentiable functional, and G∈Lq(ν), where 1 + 1 =1. terms of the parameters. It follows from recent results [16] p q that θˆ= E[Θ] if the risk R is a Bregman divergence, where Given Legendre transformation pairs f,g ∈ Lp(ν) and Θ is the random variable whose realization is θ; this property F,G∈Lq(ν), has been previously noted [8], [10]. dφ[f,g]=dψ[G,F]. The principle of Bayesian estimation can be applied to the distributions themselves rather than to the parameters: 7. Generalized Pythagorean Inequality: For any admissible (cid:90) f,g,h∈Lp(ν), gˆ=arg inf R(f,g)P (f)dM, (17) F g∈A M d [f,h]=d [f,g]+d [g,h]+δφ[g;f −g]−δφ[h;f −g]. φ φ φ whereP (f)isaprobabilitymeasureonthedistributionsf ∈ F M, dM is a measure for the manifold M, and A is either the space of all distributions or a subset of the space of all II. MINIMUMEXPECTEDBREGMANDIVERGENCE distributions, such as the set M. When the set A includes the distribution E [F] and the risk function R in (17) is a Consider two sets of functions (or distributions), M and PF functional Bregman divergence, then Theorem II.1 establishes A. Let F ∈ M be a random function with realization f. that gˆ=E [F]. Suppose there exists a probability distribution P over the PF F Forexample,inrecent work,twoofthe authors derivedthe set M, such that P (f) is the probability of f ∈ M. F mean class posterior distributionfor each class for a Bayesian For example, consider the set of Gaussian distributions, and quadratic discriminant analysis classifier, and showed that given samples drawn independently and identically from a the classification results were superior to parameter-based randomly selected Gaussian distribution N, the data imply Bayesian quadratic discriminant analysis [23]. a posterior probability P (N) for each possible generating N Of particular interest for estimation problems are the Breg- realization of a Gaussian distribution N. The goal is to find man divergence examples given in Section I-A: total squared the function g∗ ∈ A that minimizes the expected Bregman difference (mean squared error) is a popular risk function in divergence between the random function F and any function regression [21]; minimizing relative entropy leads to useful g ∈A.Thefollowingtheoremshowsthatifthesetofpossible theorems for large deviations and other statistical subfields minimizersAincludesE [F],theng∗ =E [F]minimizes PF PF [24]; and analyzing bias is a common approach to character- the expectation of any Bregman divergence. Note the theorem izing and understanding statistical learning algorithms [21]. requires slightly stronger conditions on φ than the definition of the Bregman divergence (2) requires. B. Case Study: Estimating a Scaled Uniform Distribution Theorem II.1 (Minimizer of the Expected Bregman Di- vergence). Let δ2φ[f;a,a] be strongly positive and let As an illustration of the theorem, we present and compare φ ∈ C3(L1(ν);R) be a three-times continuously Fre´chet- different estimates of a scaled uniform distribution given differentiablefunctionalonL1(ν).LetMbeasetoffunctions independent and identically drawn samples. Let the set of that lie on a manifold M, and have associated measure uniform distributions over [0,θ] for θ ∈ R+ be denoted dM such that integration is well-defined. Suppose there is a by U. Given independent and identically distributed samples probability distribution P defined over the set M. Let A be X ,X ,...,X drawnfromanunknownuniformdistribution F 1 2 n a set of functions that includes E [F] if it exists. Suppose f ∈U, the generating distribution is to be estimated. The risk PF the function g∗ minimizes the expected Bregman divergence function R is taken to be squared error or total squared error between the random function F and any function g ∈A such depending on context. that 1) BayesianParameterEstimate: Dependingonthechoice of the probability measure Λ(θ), the integral (16) may not be g∗ =arg inf E [d (F,g)]. g∈A PF φ finite; for example, using the likelihood of θ with Lebesgue measuretheintegralisnotfinite.Astandardsolutionistouse Then, if g∗ exists, it is given by agammaprioronθandLebesguemeasure.LetΘbearandom parameter with realization θ, let the gamma distribution have g∗ =EPF[F]. (15) parameters t1 and t2, and denote the maximum of the data as 6 FUNCTIONALBREGMANDIVERGENCE X =max{X ,X ,...,X }. Then a Bayesian estimate is [9],[23].Inpractice,priorsaresometimeschoseninBayesian max 1 2 n formulated [22, p. 240, 285]: estimation to tame the tail of likelihood distributions so that expectations will exist when they might otherwise be infinite E[Θ|{X ,X ,...,X },t ,t ] 1 2 n 1 2 [22]. This mathematically convenient use of priors adds esti- (cid:82) = (cid:82)X∞X∞mmaaxxθθθnn++1t1t11++11eeθ−θ−tt1212ddθθ. (18) mathltaeetrinoesanttiibmvieaasttioothnmataptmrhoeabmyleabmteicuaanlslwyaacrormnavninetenidmieibnzytatppiroriinoorrsokfinsoatownlfeoedxrgmpeeu.clAtaetnde The integrals can be expressed in terms of the chi-squared Bregman divergence between the unknown distribution and random variable I2 with v degrees of freedom: the estimated distribution, and restrict the set of distributions v that can be the minimizer to be a set for which the Bayesian E[Θ|{X ,X ,...,X },t ,t ]= 1 2 n 1 2 integral exist. Open questions are how such restrictions trade- 1 P(χ22(n+t1−1) < t2X2max). (19) off bias for reduced variance, and how to find or define an t (n+t −a) P(χ2 < 2 ) “optimal” restricted set of distributions for this estimation 2 1 2(n+t1) t2Xmax approach. Note that (16) presupposes that the best solution is also a Finally, there are some results for the standard vector uniform distribution. Bregman divergence that have not been extended here. It has 2) BayesianUniformDistributionEstimate: Ifonerestricts been shown that a standard vector Bregman divergence must theminimizerof(17)tobeauniformdistribution,then(17)is be the risk function in order for the mean to be the minimizer solved with A = U. Because the set of uniform distributions of an expected risk [16, Theorems 3 and 4]. The proof of that doesnotgenerallyincludeitsmean,TheoremII.1doesnotap- result relies heavily on the discrete nature of the underlying ply,andthusdifferentBregmandivergencesmaygivedifferent vectors,anditremainsanopenquestionastowhetherasimilar minimizers for (17). Let P be the likelihood of the data (no F result holds for the functional Bregman divergence. Another priorisassumedoverthesetU),andusetheFisherinformation result that has been shown for the vector case but remains metric ( [25]–[27]) for dM. Then the solution to (17) is an open question in the functional case is convergence in the uniform distribution on [0,21/nX ]. Using Lebesgue max probability [16, Theorem 2]. measure instead gives a similar result: [0,21/(n+1/2)X ]. max We were unable to find these estimates in the literature, and so their derivations are presented in the appendix. ACKNOWLEDGMENTS 3) Unrestricted Bayesian Distribution Estimate: When the This work was funded in part by the Office of Naval only restriction placed on the minimizer g in (17) is that g Research, Code 321, Grant # N00014-05-1-0843. The authors be a distribution, then one can apply Theorem II.1 and solve thank Inderjit Dhillon, Castedo Ellerman, and Galen Shorack directly for the expected distribution E [F]. Let P be the PF F for helpful discussions. likelihood of the data (no prior is assumed over the set U), and use the Fisher information metric for dM. Solving (15), noting that the uniform probability of x is f(x) = 1/a if APPENDIX:PROOFS x≤a and zero otherwise, and the likelihood of the n drawn points is (1/X )n if a≥X and zero otherwise, A. Proof of Proposition I.2 max max (cid:82) (cid:161) (cid:162)(cid:161) (cid:162)(cid:161) (cid:162) ∞ 1 1 da We give a constructive proof that there is a corresponding g∗(x) = max(x,X(cid:82)mX∞amx)ax aa1ndaaan a φfun:cLti1o(nνa)l→BreRg,mwanhedrieveνrg=en(cid:80)ceni=dφ1[δfc,iga]nfdorfa,sgp∈ecLifi1c(νch).oiHceeroef, n(X )n δxdenotestheDiracmeasuresuchthatallmassisconcentrated = max . (20) at x, and {c ,c ,...,c } is a collection of n distinct points (n+1)[max(x,X )]n+1 1 2 n max in Rd. III. FURTHERDISCUSSIONANDOPENQUESTIONS For any x ∈ Rn, define φ[f] = φ˜(x1,x2,...,xn), where f(c )=x ,f(c )=x ,...,f(c )=x . Then the difference WehavedefinedageneralBregmandivergenceforfunctions 1 1 2 2 n n is and distributions that can provide a foundation for results in statistics, information theory and signal processing. Theorem ∆φ[f;a]=φ[f +a]−φ[f] II.1 is important for these fields because it ties Bregman = φ˜((f +a)(c ),...,(f +a)(c ))−φ˜(x ,...,x ) divergences to expectation. As shown in Section II-A, The- 1 n 1 n orem II.1 can be directly applied to distributions to show = φ˜(x1+a(c1),...,xn+a(cn))−φ˜(x1,...,xn). that Bayesian distribution estimation simplifies to expectation Let a be short hand for a(c ), and use the Taylor expansion when the risk function is a Bregman divergence and the i i for functions of several variables to yield minimizing distribution is unrestricted. It is common in Bayesian estimation to interpret the prior ∆φ[f;a]=∇φ˜(x ,...,x )T(a ,...,a )+(cid:178)[f,a](cid:107)a(cid:107) . as representing some actual prior knowledge, but in fact prior 1 n 1 n L1 knowledge often is not available or is difficult to quantify. Therefore, Another approach is to use a prior to capture coarse informa- tionfrom thedatathatmaybe usedtostabilizetheestimation δφ[f;a]=∇φ˜(x ,...,x )T(a ,...,a )=∇φ˜(x)Ta, 1 n 1 n FRIGYIK,SRIVASTAVA,GUPTA 7 wherex=(x ,x ,...,x )anda=(a ,...,a ).Thus,from Suppose {h } ⊂ L∞(ν) such that h → 0. Then there is 1 2 n 1 n n n (3), the functional Bregman divergence definition (2) for φ is a measurable set E such that its complement is of measure 0 equivalent to the standard vector Bregman divergence: and h →0 uniformly on E. There is some N >0 such that n for any n > N, |h (x)| ≤ (cid:178) for all x ∈ E. Without loss of n d [f,g] = φ[f]−φ[g]−δφ[g;f −g] φ˜ generality, assume that there is some M >0 such that for all = φ˜(x)−φ˜(y)−∇φ˜(y)T(x−y). (21) x ∈ E, |f(x)| ≤ M. Since s˜ is continuously differentiable, there is a K > 0 such that max{s˜(cid:48)(t)subject to t ∈ [−M − (cid:178),M +(cid:178)]}≤K, and by the mean value theorem B. Proof of Proposition I.3 (cid:175) (cid:175) (cid:175)s˜(f(x)+h(x))−s˜(f(x))(cid:175) First, we give a constructive proof of the first part of (cid:175) (cid:175)≤K, (cid:175) (cid:175) h(x) the proposition by showing that given a B , there is an s,ν equivalent functional divergence d . Then, the second part for almost all x∈X. Then φ of the proposition is shown by example: we prove that the |(cid:178)(f(x),h(x))|≤2K, squared bias functional Bregman divergence given in Section I-A.2 is a functional Bregman divergence that cannot be except on a set of measure 0. The fact that h(x)→0 almost defined as a pointwise Bregman divergence. everywhere implies that |(cid:178)(f(x),h(x))| → 0 almost every- Note that the integral to calculate B is not always finite. where, and by Lebesgue’s dominated convergence theorem, s,ν To ensure finite B , we explicitly constrain lim s(cid:48)(x) the corresponding integral goes to 0. As a result, the Fre´chet s,ν x→0 and lim s(x) to be finite. From the assumption that s derivative of φ is x→0 (cid:90) is strictly convex, s must be continuous on (0,∞). Recall from the assumptions that the measure ν is finite, and that the δφ[f;h]= s(cid:48)(f(x))h(x)dν. (22) function s is differentiable on (0,∞). X Thus the functional Bregman divergence is equivalent to the GivenaB ,definethecontinuouslydifferentiablefunction s,ν (cid:40) given pointwise Bs,ν. s(x) x≥0 We additionally note that the assumptions that f ∈L∞(ν) s˜(x)= and that the measure ν is finite are necessary for this proof. −s(−x)+2s(0) x<0. Counterexamples can be constructed if f ∈Lp or ν(X)=∞ Specify φ:L∞(ν)→R as such that the Fre´chet derivative of φ does not obey (22). This (cid:90) concludes the first part of the proof. φ[f]= s˜(f(x))dν. To show that the squared bias functional Bregman diver- X gence given in Section I-A.2 is an example of a functional Note that if f ≥0, Bregman divergence that cannot be defined as a pointwise (cid:90) Bregman divergence we prove that the converse statement φ[f]= s(f(x))dν. leads to a contradiction. X Suppose (X,Σ,ν) and (X,Σ,µ) are measure spaces where Because s˜is continuous on R, s˜(f) ∈ L∞(ν) whenever f ∈ ν is a non-zero σ-finite measure and that there is a differen- L∞(ν), so the above integrals always make sense. tiable function f :(0,∞)→R such that It remains to be shown that δφ[f;·] completes the equiva- (cid:181)(cid:90) (cid:182) (cid:90) 2 lence when f ≥0. For h∈L∞(ν), ξdν = f(ξ)dµ, (23) (cid:90) (cid:90) φ[f +h]−φ[f]= s˜(f(x)+h(x))dν− s(f(x))dν where ξ ∈ L1(ν). Let f(0) = lim f(x), which can be x→0 (cid:90)X X finite or infinite, and let α be any real number. Then = s˜(f(x)+h(x))−s(f(x))dν (cid:90) (cid:181)(cid:90) (cid:182) (cid:181)(cid:90) (cid:182) 2 2 (cid:90)X f(αξ)dµ= αξdν =α2 ξdν = s˜(cid:48)(f(x))h(x)+(cid:178)(f(x),h(x))h(x)dν (cid:90) (cid:90)X =α2 f(ξ)dµ. = s(cid:48)(f(x))h(x)+(cid:178)(f(x),h(x))h(x)dν, X Because ν is σ-finite, there is a measurable set E such that where we used the fact that 0 < |ν(E)| < ∞. Let X\E denote the complement of E in X. Then s˜(f(x)+h(x)) (cid:181)(cid:90) (cid:182) 2 = s˜(f(x))+(s˜(cid:48)(f(x))+(cid:178)(f(x),h(x)))h(x) α2ν2(E)=α2 I dν E = s(f(x))+(s(cid:48)(f(x))+(cid:178)(f(x),h(x)))h(x), (cid:90) =α2 f(I )dµ E because f ≥ 0. On the other hand, if h(x) = 0 then (cid:90) (cid:90) (cid:178)(f(x),h(x))=0, and if h(x)(cid:54)=0 then =α2 f(0)dµ+α2 f(1)dµ (cid:175) (cid:175) (cid:175)s˜(f(x)+h(x))−s˜(f(x))(cid:175) X\E E |(cid:178)(f(x),h(x))|≤(cid:175)(cid:175) (cid:175)(cid:175)+|s(cid:48)(f(x))|. =α2f(0)µ(X\E)+α2f(1)µ(E). h(x) 8 FUNCTIONALBREGMANDIVERGENCE Also, Let (cid:181)(cid:90) (cid:182)2 (cid:90) α2ν2(E)= αIEdν . J[g] = EPF[dφ(F,g)]= dφ[f,g]PF(f)dM (cid:90) M However, = (φ[f]−φ[g]−δφ[g;f −g])P (f)dM,(28) F (cid:181)(cid:90) (cid:182) (cid:90) 2 M αIEdν = f(αIE)dµ where (28) follows by substituting the definition of Bregman (cid:90) (cid:90) divergence (2). Consider the increment = f(αI )dµ+ f(αI )dµ E E X\E E ∆J[g;a]=J[g+a]−J[g] (29) (cid:90) =f(0)µ(X\E)+f(α)µ(E); =− (φ[g+a]−φ[g])P (f)dM F so one can conclude that (cid:90)M − (δφ[g+a;f −g−a] α2f(0)µ(X\E)+α2f(1)µ(E) M = f(0)µ(X\E)+f(α)µ(E). (24) −δφ[g;f −g])PF(f)dM, (30) Apply equation (23) for ξ =0 to yield where (30) follows from substituting (28) into (29). Using the (cid:181)(cid:90) (cid:182) (cid:90) definition of the differential of a functional given in (1), the 2 first integrand in (30) can be written as 0= 0dν = f(0)dµ=f(0)µ(X). φ[g+a]−φ[g] = δφ[g;a]+(cid:178)[g,a](cid:107)a(cid:107) . (31) L1(ν) Since |ν(E)| > 0, µ(X) (cid:54)= 0, so it must be that f(0) = 0, and (24) becomes Take the second integrand of (30), and subtract and add δφ[g;f −g−a], α2ν2(E)=α2f(1)µ(E)=f(α)µ(E) ∀α∈R. δφ[g+a;f −g−a]−δφ[g;f −g] The first equation implies that µ(E) (cid:54)= 0. The second = δφ[g+a;f −g−a]−δφ[g;f −g−a] equation determines the function f completely: +δφ[g;f −g−a]−δφ[g;f −g] f(α)=f(1)α2. (=a) δ2φ[g;f −g−a,a]+(cid:178)[g,a](cid:107)a(cid:107) +δφ[g;f −g] L1(ν) Then (23) becomes −δφ[g;a]−δφ[g;f −g] (cid:181)(cid:90) (cid:182) (cid:90) 2 (=b) δ2φ[g;f −g,a]−δ2φ[g;a,a]+(cid:178)[g,a](cid:107)a(cid:107) ξdν = f(1)ξ2dµ. L1(ν) −δφ[g;a] (32) Consideranytwodisjointmeasurablesets,E andE ,with 1 2 where(a)followsfromthelinearityofthethirdterm,and(b) finite nonzero measure. Define ξ =I and ξ =I . Then 1 E1 2 E2 followsfromthelinearityofthefirstterm.Substitute(31)and ξ =ξ +ξ and ξ ξ =I I =0. Equation (23) becomes 1 2 1 2 E1 E2 (32) into (30), (cid:90) (cid:90) (cid:90) (cid:90) (cid:179) ξ dν ξ dν =f(1) ξ ξ dµ. (25) 1 2 1 2 (cid:52)J[g;a] = − δ2φ[g;f −g,a]−δ2φ[g;a,a] M (cid:180) This implies the following contradiction: +(cid:178)[g,a](cid:107)a(cid:107) P (f)dM. (cid:90) (cid:90) L1(ν) F ξ1dν ξ2dν =ν(E1)ν(E2)(cid:54)=0, (26) Note that the term δ2φ[g;a,a] is of order (cid:107)a(cid:107)2 , that (cid:176) (cid:176) L1(ν) is, (cid:176)δ2φ[g;a,a](cid:176) ≤ m(cid:107)a(cid:107)2 for some constant m. but (cid:90) L1(ν) L1(ν) Therefore, f(1) ξ ξ dµ=0. (27) 1 2 (cid:107)J[g+a]−J[g]−δJ[g;a](cid:107) L1(ν) lim =0, C. Proof of Theorem II.1 (cid:107)a(cid:107)L1(ν)→0 (cid:107)a(cid:107)L1(ν) where, Recall that for a functional J to have an extremum (mini- mum) at f =fˆ, it is necessary that (cid:90) δJ[g;a] = − δ2φ[g;f −g,a]P (f)dM. (33) F δJ[f;a]=0 and δ2J[f;a,a]≥0, M For fixed a, δ2φ[g;·,a] is a bounded linear functional in the forf =fˆandforalladmissiblefunctionsa∈A.Asufficient second argument, so the integration and the functional can be condition for a functional J[f] to have a minimum for f =fˆ interchanged in (33), which becomes is that the first variation δJ[f;a] must vanish for f =fˆ, and (cid:183) (cid:90) (cid:184) its second variation δ2J[f;a,a] must be strongly positive for f =fˆ. δJ[g;a]=−δ2φ g; (f −g)PF(f)dM,a . M FRIGYIK,SRIVASTAVA,GUPTA 9 Using the functional optimality conditions, J[g] has an ex- continuous functional δ3φ in the first integral of (38), then tremum for g =gˆ if (cid:183) (cid:90) (cid:184) (cid:183) (cid:90) (cid:184) δ2J[gˆ;a,a] = −δ3φ gˆ; (f −gˆ)P (f)dM,a,a F δ2φ gˆ; (f −gˆ)P (f)dM,a =0. (34) (cid:90) M F M + δ2φ[gˆ;a,a]P (f)dM F (cid:82) (cid:90) M Set a= (f −gˆ)P(f)dM in (34) and use the assumption that the qMuadratic functional δ2φ[g;a,a] is strongly positive, = δ2φ[gˆ;a,a]PF(f)dM (39) which implies that the above functional can be zero if and (cid:90)M only if a=0, that is, ≥ k(cid:107)a(cid:107)2 P (f)dM L1(ν) F (cid:90) M = k(cid:107)a(cid:107)2 > 0, (40) 0 = (f −gˆ)PF(f)dM, (35) L1(ν) M where (39) follows from (35), and (40) follows from the gˆ = E [F], (36) PF strong positivity of δ2φ[gˆ;a,a]. Therefore, from (40) and the where the last line holds if the expectation exists (i.e. if the functional optimality conditions, gˆ is the minimum. measureiswell-definedandtheexpectationisfinite).Because a Bregman divergence is not necessarily convex in its second D. Derivation of the Bayesian Distribution-based Uniform argument, it is not yet established that the above unique Estimate Restricted to a Uniform Minimizer extremumisaminimum.Toseethat(36)isinfactaminimum Let f(x) = 1/a for all 0 ≤ x ≤ a and g(x) = 1/b for all of J[g], from the functional optimality conditions it is enough toshowthatδ2J[gˆ;a,a]isstronglypositive.Toshowthis,for 0≤x≤b. Assume at first that b>a; then the total squared b∈L1(ν), consider difference between f and g is (cid:90) (cid:181) (cid:182) (cid:181) (cid:182) 1 1 2 1 2 δJ[g+b;a]−δJ[g;a] (f(x)−g(x))2dx = a − +(b−a) (cid:90) a b b (cid:161) x (=c) − δ2φ[g+b;f −g−b,a] b−a = M (cid:162) ab −δ2φ[g;f −g,a] P (f)dM (cid:90) F |b−a| (cid:161) = , (=d) − δ2φ[g+b;f −g−b,a]−δ2φ[g;f −g−b,a] ab M (cid:162) wherethelastlinedoesnotrequiretheassumptionthatb>a. +δ2φ[g;f −g−b,a]−δ2φ[g;f −g,a] P (f)dM (cid:90) F In this case, the integral (17) is over the one-dimensional (cid:161) (=e) − δ3φ[g;f −g−b,a,b]+(cid:178)[g,a,b](cid:107)b(cid:107) manifoldofuniformdistributionsU;aRiemannianmetriccan L1(ν) be formed by using the differential arc element to convert M +δ2φ[g;f −g,a]−δ2φ[g;b,a] Lebesgue measure on the set U to a measure on the set of (cid:162) −δ2φ[g;f −g,a] P (f)dM parameters a such that (17) is re-formulated in terms of the (cid:90) F parameters for ease of calculation: (cid:161) (=f) − δ3φ[g;f −g,a,b]−δ3φ[g;b,a,b] (cid:90) (cid:176) (cid:176) ∞ |b−a| 1 (cid:176)df(cid:176) +(cid:178)M[g,a,b](cid:107)b(cid:107)L1(ν)−δ2φ[g;b,a](cid:162)PF(f)dM, (37) b∗ =argbm∈Rin+ a=Xmax ab an (cid:176)(cid:176)da(cid:176)(cid:176)2da, (41) where 1/an is the likelihood of the n data points being where (c) follows from using integral (33); (d) from subtract- drawn from a uniform distribution [0,a], and the estimated ing and adding δ2φ[g;f−g−b,a]; (e) from the fact that the (cid:176)distr(cid:176)ibution is uniform on [0,b∗]. The differential arc element variation of the second variation of φ is the third variation (cid:176) (cid:176) (cid:176)df(cid:176) can be calculated by expanding df/da in terms of of φ [28]; and (f) from the linearity of the first term and da 2 cancellation of the third and fifth terms. Note that in (37) for the Haar orthonormal basis {√1a,φjk(x)}, which forms a fixed a, the term δ3φ[g;b,a,b] is of order (cid:107)b(cid:107)2 , while the complete orthonormal basis for the interval 0 ≤ x ≤ a, and L1(ν) first and the last terms are of order (cid:107)b(cid:107) . Therefore, then the required norm is equivalent to the norm of the basis L1(ν) (cid:176) (cid:176) coefficients of the orthonormal expansion: (cid:176)δJ[g+b;a]−δJ[g;a]−δ2J[g;a,b](cid:176) (cid:176) (cid:176) L1(ν) (cid:176)df(cid:176) 1 lim =0, (cid:176) (cid:176) = . (42) (cid:107)b(cid:107)L1(ν)→0 (cid:107)b(cid:107)L1(ν) (cid:176)da(cid:176)2 a3/2 where For estimation problems, the measure determined by the (cid:90) Fisher information metric may be more appropriate than δ2J[g;a,b] = − δ3φ[g;f −g,a,b]P (f)dM Lebesgue measure [25]–[27]. Then F (cid:90) M + δ2φ[g;a,b]P (f)dM. (38) dM =|I(a)|12da, (43) F M where I is the Fisher information matrix. For the one- Substitute b = a, g = gˆ and interchange integration and the dimensional manifold M formed by the set of scaled uniform