ebook img

Unifying Divergence Minimization and Statistical - Alex Smola PDF

15 Pages·2006·0.23 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Unifying Divergence Minimization and Statistical - Alex Smola

Unifying Divergence Minimization and Statistical Inference via Convex Duality Yasemin Altun1 and Alex Smola2 1 Toyota Technological Institute at Chicago, Chicago, IL 60637, USA?? [email protected] 2 National ICT Australia, North Road, Canberra 0200 ACT, Australia [email protected] Abstract. Inthispaperweunifydivergenceminimizationandstatisti- cal inference by means of convex duality. In the process of doing so, we provethatthedualofapproximatemaximumentropyestimationismax- imumaposterioriestimation.Moreover,ourtreatmentleadstostability and convergence bounds formany statisticallearningproblems.Finally, we show how an algorithm by Zhang can be used to solve this class of optimization problems efficiently. 1 Introduction It has become part of machine learning folklore that maximum entropy esti- mation and maximum likelihood are convex duals to each other. This raises the questionwhethermaximumaposterioriestimateshavesimilarcounterpartsand whether such estimates can be obtained efficiently. RecentlyDudiketal.[9]showedthatacertainformofregularizedmaximum entropydensityestimationcorrespondsto‘ regularizationinthedualproblem. 1 This is our starting point to develop a theory of regularized divergence mini- mization which aims to unify a collection of related approaches. By means of convex duality we are able to give a common treatment to – The regularized LMS minimization methods of Arsenin and Tikhonov [21], and of Morozov [14]. There the problem of minimizing kxk2 subject to kAx−bk2 ≤(cid:15) 2 2 is studied as a means of improving the stability of the problem Ax=b. – RudermanandBialek[18]studyarelatedproblemwhereinsteadofaquadratic penalty on x the following objective function is minimize −H(x) subject to kAx−bk2 ≤(cid:15) 2 In other words, the problem of solving Ax = b is stabilized by finding the maximum entropy distribution which satisfies the constraint. ?? Parts of this work were done when the author was visiting National ICT Australia. 2 Altun and Smola – The density estimation problem of [9] can be viewed as one of solving a variant of the above, namely that of minimizing −H(x) subject to kAx−bk ≤(cid:15) ∞ where the constraint encode deviations of the measured values of some mo- ments or features and their expected values. Theproblemwestudycanbeabstractlystatedastheregularizedinverseproblem minimizef(x) subject to kAx−bkB ≤(cid:15). x∈X whereXandBareBanachspaces.Westartbyestablishingageneralframework of duality to solve this problem using a convex analysis tool, namely Fenchel’s duality.Thistheoryisespeciallyusefulinthemostgeneralformofourproblem, whereXandBareinfinitedimensional,sinceinthiscaseLagrangiantechniques are problematic due to differentiability issues. We apply this framework to a generalized notion of regularized divergence minimization, since a large subset of statistical learning literature can be analyzed within this class of problems. By studying convex duality of two important classes of divergences, namely Csisz´ar and Bregman divergences, we show that maximum a posteriori estima- tion is the convex dual of approximate maximum entropy estimation. Various statisticalinferencemethods,suchasboosting,logisticregression,GaussianPro- cesses and others become instances of ourframework, by using different entropy functions and regularization methods. Following these lines, we not only give a commontreatmenttothesemethods,butalsoprovidedirectionstodevelopnew inference techniques by investigating different entropy-regularization combina- tions. For example, working in Banach spaces, we can perform different regular- izationsonsubsetsofbasisfunctions,whichisusefulinproblemslikestructured learning where there are several distinctly different sets of basis functions. Fromaregularizationpointofview,ourapproachprovidesanaturalinterpre- tationtotheregularizationcoefficient(cid:15),whichcorrespondstotheapproximation parameterintheprimalproblem.Studyingtheconcentrationofempiricalmeans, √ weshowthatagoodvalueof(cid:15)isproportionaltoO(1/ m)wheremisthesample size.Notingthat(cid:15)isgenerallychosenbycrossvalidationtechniquesinpractice, we believe our framework gives us an enhanced interpretation of regularized op- timization problems. We also provide unified bounds on the performance of the estimate wrt loss on empirical estimates as well as the loss on true statistics, which apply to arbitrary linear classes and divergences. Finally, we show that a singlealgorithmcanefficientlyoptimizethislargeclassofoptimizationproblems with good convergence rates. Relatedwork Thereisalargeliteratureonanalyzingvariouslossfunctionsonex- ponentialfamiliesastheconvexdualofrelativeentropyminimizationviaequal- ity constraints of the form Ax = b. For example, Lafferty [12] analyze logistic regression and exponential loss as a special case of Bregman divergence mini- mization and propose a family of sequential update algorithms. Similar treat- Divergence Minimization and Convex Duality 3 ments are given in [11,8]. One common property of these studies is that they investigate exact divergence minimization. Previous work on approximate divergence minimization focused on mini- mizing KL divergence such that its convex dual is penalized by ‘ and ‘ 1 2 normterms,eg.[7].[9]showthatapproximateKLdivergenceminimizationwrt. kAx−bk ≤(cid:15) has the convex dual of ‘ norm regularized maximum likelihood. ∞ 1 Recently [10] produced similar results for ‘ norm regularization. p In this paper, we improve over previous work by generalizing to a family of divergence functions with inequality constraints in Banach spaces. Our unified treatmentofvariousentropymeasures(includingCsisz´arandAmaridivergences) and various normed spaces allows us to produce the cited work as special cases andtodefinemoresophisticatedregularizationsviaBanachspacenorms.Finally we provide risk bounds for the estimates. 2 Fenchel Duality Wenowgiveaformaldefinitionoftheclassofinverseproblemswesolve.Denote by X and B Banach spaces and let A : X → B be a bounded linear operator between those two spaces. Here A corresponds to an “observation operator”, e.g. mapping distributions into a set of moments, marginals, etc. Moreover, let b ∈ B be the target of the estimation problem. Finally, denote by f : X → R and g :B→R convex functions and let (cid:15)≥0. Problem 1 (Regularized inverse). Our goal is to find x ∈ X which solves the following convex optimization problem: minimizef(x) subject to kAx−bkB ≤(cid:15). x∈X Example 2 (Density estimation). Assume that x is a density, f is the neg- ative Shannon-Boltzmann entropy, b contains the observed values of some mo- mentsor features,Aistheexpectationoperatorofthosefeatureswrt.thedensity x and the Banach space B is ‘ . p WeshallseeinSection3.2thatthedualtoExample3isamaximumaposteriori estimation problem. In cases where B and X are finite dimensional the problem is easily solved by calculating the corresponding Lagrangian, setting its derivative to 0 and solveing for x. In the infinite dimensional case, more careful analysis is required to ensure continuity and differentiability. Convex analysis provides a powerful machinery, namely Fenchel’s conjugate duality theorem, to study this problem by formulating the primal-dual space relations of convex optimization problems in our general setting. We need the following definition: Definition 3 (Convex conjugate). Denote by X a Banach space and let X∗ be its dual. The convex conjugate or the Legendre-Fenchel transformation of a function f :X→R is f∗ :X∗ →R where f∗ is defined as f∗(x∗)= sup{hx,x∗i−f(x)}. x∈X 4 Altun and Smola We present Fenchel’s theorem where the primal problem is of the form f(x)+ g(Ax). Problem 2 becomes an instance of the latter for suitably defined g. Theorem 4 (Fenchel Duality [3, Th. 4.4.3]). Let g : B → R be a convex function on B and other variables as above. Define t and d as follows: t= inf {f(x)+g(Ax)} and d= sup {−f∗(A∗x∗)−g∗(−x∗)}. x∈X x∗∈B∗ Assume that f, g and A satisfy one of the following constraint qualifications: a) 0∈core(domg−Adomf) and both f and g are left side continuous (lsc) b) Adomf ∩contg 6=∅ Here s ∈ core(S) if S λ(S−s) ⊆ X where X is a Banach space and S ⊆ X. λ>0 In this case t=d, where the dual solution d is attainable if it is finite. WenowapplyFenchel’sdualitytheoremtoconvexconstraintoptimizationprob- lems,suchasProblem2,sincethedualproblemiseasiertosolveincertaincases. Lemma 5 (Fenchel duality with constraints). In addition to the assump- tions of Theorem 5, let b∈B and (cid:15)≥0. Define t and d as follows: t= inf {f(x) subject to kAx−bk ≤(cid:15)} B x∈X and d= sup {−f∗(A∗x∗)+hb,x∗i−(cid:15)kx∗k } B∗ x∗∈B∗ Suppose f is lower semi-continuous and that for B :=(cid:8)¯b∈B with (cid:13)(cid:13)¯b(cid:13)(cid:13)≤1(cid:9) the following constraint qualification holds: core(Adomf)∩(b+(cid:15)int(B))6=∅. (CQ) In this case t=d with dual attainment. Proof Define g in Theorem 5 as the characteristic function on (cid:15)B+b, i.e. ( 0 if¯b∈(cid:15)B+b g(¯b)=χ (¯b)= (1) (cid:15)B+b ∞ otherwise The convex conjugate of g is given by g∗(x∗)=sup(cid:8)(cid:10)¯b,x∗(cid:11) subject to¯b−b∈(cid:15)B(cid:9) ¯b =−hx∗,bi+(cid:15)sup(cid:8)(cid:10)¯b,x∗(cid:11) subject to¯b∈B(cid:9)=(cid:15)kx∗k −hx∗,bi B∗ ¯b Theorem 5 and the relation core(B)=int(B) prove the lemma. Theconstraintqualification(CQ)ensuresthenon-emptinessofthesub-differential. (cid:15) = 0 leads to equality constraints Ax = b, for which CQ requires b to be an element of core(Adomf). If the equality constraints are not feasible b ∈/ core(Adomf), which can be the case in real problems, the solution diverges. Divergence Minimization and Convex Duality 5 Such problems may be rendered feasible by relaxing the constraints ((cid:15)>0), which corresponds to expanding the search space by defining an (cid:15) ball around b and searching for a point in the intersection of this ball and core(Adomf). In the convex dual problem, this relaxation is penalized with the norm of the dual parameters scaling linearly with the relaxation parameter (cid:15). Inpracticeitisdifficulttocheckwhether(CQ)holds.Onesolutionistosolve the dual optimization problem and infer that the condition holds if the solution does not diverge. To assure a finite solution, we restrict the function class such that f∗ is Lipschitz and to perturb the regularization slightly by taking its kth power, resulting in a Lipschitz continuous optimization. For instance Support Vector Machines perform this type of adjustment to ensure feasibility [4]. Lemma 6. Denote by X a Banach space, let b∈X∗ and let k >1. Assume that f(Ax) is convex and Lipschitz continuous in x with Lipschitz constant C. Then n o inf f(Ax)−hb,xi+(cid:15)kxkk (2) x∈X 1 does not diverge and the norm of x is bounded by kxkX ≤[(kbkX∗ +C)/k(cid:15)]k−1. Proof [sketch]NotethattheoverallLipschitzconstantoftheobjectivefunction (except for the norm) is bounded by kbk +C. The objective function cannot X∗ increase further if the slope due to the norm is larger than what the Lipschitz constant admits. Solving for (cid:15)kkxkk−1 =kbk +C proves the claim. X X∗ 3 Divergence Minimization and Convex Duality Now that we established a rather general framework of duality for regularized inverse problems, we consider applications to problems in statistics. For the remainder of the section x will either be a density or a conditional density over the domain T. For this reason we will use p instead of x to denote the variable of the optimization problem. Denotebyψ :T →BfeaturefunctionsandletA:X→Bbetheexpectation operatorofthefeaturemapwithrespecttop.Inotherwords,Ap:=E [ψ(t)]. t∼p With some abuse of notation we will use the shorthand E [ψ] whenever conve- p nient.Finallydenotebyψ˜=btheobservedvalueofthefeaturesψ(t),whichare derived, e.g. via b=m−1Pm ψ(t ) for t ∈S, the sample of size m. i=1 i i This setting allows us to study various statistical learning methods within convexdualityframework.Itiswell-knownthatthedualofmaximum(Shannon) entropy (commonly called MaxEnt) is maximum likelihood (ML) estimation. One of the corollaries, which follows immediately from the more general result in Lemma 12 is that the dual of approximate maximum entropy is maximum a posteriori estimation (MAP). 6 Altun and Smola Theorem 7. Assume that f is the negative Shannon entropy, that is f(p) := R −H(p)= logp(t)dp(t). Under the above conditions we have T (cid:13) (cid:13) Z min−H(p) subject to (cid:13)E [ψ]−ψ˜](cid:13)≤(cid:15) and dp(t)=1 (3) (cid:13) p (cid:13) p T D E Z = max φ,ψ˜ −log exp(hφ,ψ(t)i)dt−(cid:15)kφk+e−1 φ T Equivalently φ maximizes Pr(S|φ)Pr(φ) and Pr(φ)∝exp(−(cid:15)kφk). Inordertoprovideacommontreatmenttovariousstatisticalinferencetech- niques, as well as give insight into the development of new ones, we study two important classes of divergence functions, Csisz´ar’s divergences and Bregman divergences.Csisz´ardivergence,whichincludesAmari’sαdivergencesasspecial cases, gives the asymmetric distance between two infinite-dimensional density functions induced by a manifold. Bregman divergences are commonly defined over distributions over a finite domain. The two classes of divergences intersect attheKLdivergence.Toavoidtechnicalproblemsweassumethattheconstraint qualifications are satisfied (e.g. via Lemma 7). 3.1 Csisz´ar Divergences Definition 8. Denote by h : R → R a convex lsc function and let p,q be two distributions on T. Then the Csisz´ar divergence is given by Z (cid:16) (cid:17) f (q,p)= q(t)h p(t) dt. (4) h q(t) Differentchoicesforhleadtodifferentdivergencemeasures.Forinstanceh(ξ)= ξlogξyieldstheKullback-Leiblerdivergence.Commonly,qisfixedandoptimiza- tion is performed with respect to p, which we denote by f (p). Since f (p) is h,q h,q convex and expectation is a linear operator, we can apply Lemma 6 to obtain the convex conjugate of Csisz´ar’s divergence optimization: Lemma 9 (Duality of Csisz´ar Divergence). Assume that the conditions of Lemma 6 hold. Moreover let f be defined as a Csisz´ar divergence. Then n o n D E o min fh,q(p)|kEp[ψ]−ψ˜]kB ≤(cid:15) =max −fh∗,q(hφ,ψ(.)i)+ φ,ψ˜ −(cid:15)kφkB∗ . p φ (cid:16)D E(cid:17) Moreover the solutions pˆand φˆ are connected by pˆ(t)=q(t)(h∗)0 ψ(t),φˆ . Proof TheadjointofthelinearoperatorAisgivenbyhAx,φi=hA∗φ,xi.Let- (cid:10)R (cid:11) R tingAbetheexpectationwrtp,wehave p(t)ψ(t),φ = p(t)hψ(t),φidt= T T (A∗φ)(p)forA∗φ=hφ,ψ(.)i.Nextnotethatf∗(hφ,ψ(·)i)=R q(t)h∗(hφ,ψ(t)i)dt. T Plugging this into Lemma 6, we obtain the first claim. Divergence Minimization and Convex Duality 7 Using attainability of the solution it follows that there exist pˆand φˆ which solve the corresponding optimization problems. Equating both sides we have Z (cid:16) (cid:17) D E D E q(t)h pˆ(t) dt=−f∗(hφ,ψ(.)i)+ ψ˜,φˆ −(cid:15)kφˆk =−f∗(hφ,ψ(.)i)+ φˆ,E [ψ] . q(t) B∗ pˆ T Here the last equality follows from the definition of the constraints (see the proof of Lemma 6). Taking the derivative at the solution pˆ (due to constraint qualification) and noticing that the derivative of the first term on the RHS, we (cid:16) (cid:17) D E get h0 pˆ = φˆ,ψ . Using the relation (h0)−1 =(h∗)0 completes the proof. q Since we are dealing with probability distributions, it is convenient to add the R constraint dp(t)=1. We have the following corollary. T Corollary 10 (Csisz´ar divergence and probability constraints). Define all variables as in Lemma 10. We have (cid:26) (cid:13) (cid:13) Z (cid:27) min f (p) subject to (cid:13)E [ψ]−ψ˜(cid:13) ≤(cid:15) and dp(t)=1 h,q (cid:13) p (cid:13) p B T n D E o = max −f∗ (hφ,ψ(.)i−Λ )+ φ,ψ˜ −Λ −(cid:15)kφk =:−LC(φ). (5) φ h,q φ φ B∗ ψ˜ D E Here the solution is given by pˆ(t) = q(t)(h∗)0( ψ(t),φˆ −Λ(φˆ)) where Λ(φˆ) is the partition function which ensures that p be a probability distribution (λ(φˆ) is the minimizer of (5) with respect to λ ). φ Proof [sketch] Define P = {p|R dp(t) = 1} and f in Lemma 6 as f(p) = T fh,q(p) + χP(p). Then, for Λp = ∞ if p ∈/ P, the convex conjugate of f is f∗(p∗) = sup {hp,p∗i−f (p)−Λ (R dp(t)−1) = Λ +(f )∗(p∗ −Λ ). p h,q p T p∗ h,q p∗ Performing the steps in the proof of Lemma 10 gives the result. An important and well-studied special case of this duality is the minimization of KL divergence as we investigate in the next section. Note that new inference techniques can be derived using other h functions, eg. Tsallis’ entropy, which is preferable over Shannon’s entropy in fields as statistical mechanics. 3.2 MAP and Maximum Likelihood via KL Divergence Defininghin(4)ash(ξ):=ξln(ξ)wehaveh∗(ξ∗)=exp(ξ∗−1).ThenCsisz´ar’s divergence becomes the KL divergence. Applying Corollary 11 we have: Lemma 11 (KLdivergencewithprobabilityconstraints).Defineallvari- ables as in Lemma 11. We have (cid:26) (cid:13) (cid:13) Z (cid:27) min KL(pkq) subject to (cid:13)E [ψ]−ψ˜(cid:13) ≤(cid:15) and dp(t)=1 (cid:13) p (cid:13) p B T (cid:26)D E Z (cid:27) = max φ,ψ˜ −log q(t)exp(hφ,ψ(t)i)dt−(cid:15)kφkB∗ +e−1 (6) φ T (cid:16)D E (cid:17) where the unique solution is given by pˆ (t)=q(t)exp φˆ,ψ(t) −Λ . φˆ φˆ 8 Altun and Smola Proof The dual of f is f∗ (x∗)= R q(t)exp(x∗(t)−1)dt. Hence we have the h,q T dual objective function Z D E q(t)exp(hφ,ψ(t)i−Λ −1)dt+ φ,ψ˜ −Λ −(cid:15)kφk φ φ B∗ T R WecansolveforoptimalityinΛ whichyieldsΛ =log q(t)exp(hφ,ψ(t)i)dt. φ φ T Substituting this into the objective function proves the claim. Thus,optimizingapproximateKLdivergenceleadstoexponentialfamilies.Many well known statistical inference methods can be viewed as special cases. Let P={p|p∈X,R dp(t)=1} and q(t)=1,∀t∈T. T Example 12. For (cid:15)=0, we get the well known duality between Maximum En- tropy and Maximum Likelihood estimation. n o D E Z min −H(p) subject to E [ψ]=ψ˜ =max φ,ψ˜ −log exp(hφ,ψ(t)i)dt+e−1 p p∈P φ T Example 13. For B=‘ we get the density estimation problem of [9] ∞ n (cid:13) (cid:13) o min −H(p) subject to (cid:13)E [ψ]−ψ˜(cid:13) ≤(cid:15) (cid:13) p (cid:13) p∈P ∞ D E Z =max φ,ψ˜ −log exp(hφ,ψi(t))dt−(cid:15)kφk +e−1 1 φ T IfBisareproducingkernelHilbertspaceofsplinefunctionsweobtainthedensity estimator of [16], who use an RKHS penalty on φ. The well-known overfitting behavior of ML can be explained by the constraint qualification (CQ) of Section 2. While it can be shown that in exponential fam- ilies the constraint qualifications are satisfied [22] if we consider the closure of the marginal polytope, the solution may be on (or close to) a vertex of the marginal polytope. This can lead to large (or possibly diverging) values of φ. Hence,regularizationby approximate moment matchingis usefulto ensure that such divergence does not occur. Regularizing ML with ‘ and ‘ norm terms is a common practice [7], where 2 1 the coefficient is determined by cross validation techniques. The analysis above provides a unified treatment of the regularization methods. But more impor- tantly, it leads to a principled way of determining the regularization coefficient (cid:15) as discussed in Section 4. Note that if t ∈ T is an input-output pair t = (x,y) we could maximize the entropy of either the joint probability density p(x,y) or the conditional model p(y|x),whichiswhatwereallyneedtoestimatey|x.Ifwemaximizetheentropy of p(y|x) and B is a RKHS with kernel k(t,t0):=hψ(t),ψ(t0)i we obtain a range of conditional estimation methods: – For ψ(t)=yψ (x) and y ∈{±1}, we obtain binary Gaussian Process classi- x fication [15]. Divergence Minimization and Convex Duality 9 – For ψ(t) = (y,y2)ψ (x), we obtain the heteroscedastic GP regression esti- x mates of [13]. – For decomposing ψ(t), we obtain various graphical models and conditional random fields as described in [1]. – For ψ(t) = yψ (x) and ‘ spaces, we obtain as its dual ‘ regularization x ∞ 1 typically used in sparse estimation methods. The obvious advantage of using convex duality in Banach spaces is that it pro- videsaunifiedapproach(includingbounds)fordifferentregularization/relaxation schemesaslistedabove.Moreimportantly,thisgeneralityprovidesflexibilityfor more complex choices of regularization. For instance, we could define different regularizations for features that possess different characteristics. 3.3 Bregman Divergence TheBregmandivergencebetweentwodistributionspandqforaconvexfunction h acting on the space of probabilities is given by 4 (p,q)=h(p)−h(q)−h(p−q),∇ h(q)i. (7) h q Note 4 (p,q) is convex in p. Applying Fenchel’s duality theory, we have h Corollary 14. DualityofBregmanDivergenceAssumethattheconditions of Lemma 6 hold. Moreover let f be defined as a Bregman divergence. Then n (cid:13) (cid:13) o min 4 (p,q) subject to (cid:13)E [ψ]−ψ˜(cid:13) ≤(cid:15) h (cid:13) p (cid:13) p B n D E o =max −h∗(hφ−φ ,ψi)+ φ,ψ˜ −(cid:15)kφk =:−LB(φ). (8) φ q B∗ ψ˜ Proof Defining H (p) = h(p) − hp,h0(q)i, 4 (p,q) = H (p) − h∗(φ ). The q h q q convexconjugateofH isH∗(φ)=sup hp,φ+h0(q)i−h(p)=h∗(φ−φ ),since q q p q h0(q)=φ .Sinceqisconstant,wegettheequality(uptoaconstant)byplugging q H∗ into Lemma 6. q AsinCsisz´ar’sdivergence,theKLdivergencebecomesaspecialcaseofBregman R divergence by defining h as h(p) := p(t)ln(p(t))dt. Thus, we can achieve the T same results in Section 3.2 using Bregman divergences as well. Also, it has been shown in various studies that Boosting which minimizes exponential loss can be cast as a special case of Bregman divergence problem with linear equality constraints [8,11]. An immediate result of Corollary 15, then, is to generalize these approaches by relaxing the equality constraints wrt. various norms and achieveregularizedexp-lossoptimizationproblemsleadingtovariousregularized boosting approaches. Due to space limitations, we omit the details. 4 Bounds on the Dual Problem and Uniform Stability Generalizationperformancesofestimatorsachievedbyoptimizingvariousconvex functions in Reproducing Kernel Hilbert Spaces have been studied extensively. See e.g. [19,5] and references therein. Producing similar results in the general formofconvexanalysisallowsustounifypreviousresultsviasimplerproofsand tight bounds. 10 Altun and Smola 4.1 Concentration of empirical means One of the key tools in the analysis of divergence estimates is the fact that deviations of the random variable ψ˜= 1ψ(t ) are well controlled. m i Theorem 15. DenotebyT :={t ,...,t }⊆T asetofrandomvariablesdrawn 1 m fromp.Letψ :T →BbeafeaturemapintoaBanachspaceBwhichisuniformly bounded by R. Then the following bound holds (cid:13) (cid:13) (cid:13) 1 Xm (cid:13) (cid:13) ψ(t )−E [ψ(t)](cid:13) ≤2R (F,p)+(cid:15) (9) (cid:13)m i p (cid:13) m (cid:13) (cid:13) i=1 B (cid:16) (cid:17) withprobabilityatleast1−exp −(cid:15)2m .HereR (F,p)denotestheRademacher R2 m average wrt the function class F :={φ (·)=hψ(t),φ i where kφk ≤1}. p p B∗ Moreover,if B isa RKHS with kernel k(t,t0) the RHS of (9) can betightened p to m−1E [k(t,t)−k(t,t0)]+(cid:15). The same bound for (cid:15) as above applies. p See [2] for more details and [20] for earlier results on Hilbert Spaces. Proof The first claim follows immediately from [2, Theorem 9 and 10]. The second part is due to an improved calculation of the expected value of the LHS of (9). We have by convexity "(cid:13)(cid:13) 1 Xm (cid:13)(cid:13) # (cid:13)(cid:13) 1 Xm (cid:13)(cid:13)221 Ep (cid:13)(cid:13)m ψ(ti)−Ep[ψ(t)](cid:13)(cid:13) ≤Ep(cid:13)(cid:13)m ψ(ti)−Ep[ψ(t)](cid:13)(cid:13)  (cid:13) (cid:13) (cid:13) (cid:13) i=1 B i=1 B r h i q = m−12 Ep kψ(t)−Ep[ψ(t)]k2 =m−12 Ep[k(t,t)−k(t,t0)] The concentration inequality for bounding large deviations remains unchanged wrt. the Banach space case, where the same tail bound holds. TheusefulnessofTheorem16arisesfromthefactthatitallowsustodetermine (cid:15) in the inverse problem. If m is small, it is sensible to choose a large value of (cid:15) and with increasing m our precision should improve with O(√1 ). This gives us m a principled way of determining (cid:15) based on statistical principles. 4.2 Stability with respect to changes in b Nextwestudythestabilityofconstrainedoptimizationproblemswhenchanging the empirical mean parameter b. Consider the convex dual problem of Lemma 6 and the objective function of its special case (7). Both problems can be summa- rized as L(φ,b):=f(Aφ)−hb,φi+(cid:15)kφkk (10) B∗ where (cid:15) > 0 and f(Aφ) is a convex function. We first show that for any b0, the difference between the value of L(φ,b0) obtained by minimizing L(φ,b) with respect to φ and vice versa is bounded.

Description:
1 Toyota Technological Institute at Chicago, Chicago, IL 60637, USA** altun@tti- c.org In this paper we unify divergence minimization and statisti- cal inference by means of convex .. Commonly, q is fixed and optimiza- tion is performed with
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.