ebook img

Bounding errors of Expectation-Propagation PDF

0.21 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Bounding errors of Expectation-Propagation

Bounding errors of Expectation-Propagation 6 GuillaumeDehaene SimonBarthelmé 1 UniversityofGeneva CNRS,Gipsa-lab 0 [email protected] [email protected] 2 n a Abstract J 1 1 ExpectationPropagationisaverypopularalgorithmforvariationalinference,but ] comeswithfewtheoreticalguarantees. Inthisarticle, we provethattheapprox- O imationerrorsmadebyEPcan bebounded. Ourboundshavean asymptoticin- C terpretationinthenumbernofdatapoints,whichallowsustostudyEP’sconver- . gencewithrespecttothetrueposterior. Inparticular,weshowthatEPconverges t a atarateofO(n−2)forthemean,uptoanorderofmagnitudefasterthanthetra- t ditionalGaussianapproximationatthemode.Wealsogivesimilarasymptoticex- s [ pansionsformomentsoforder2to4,aswellasexcessKullback-Leiblercost(de- finedastheadditionalKLcostincurredbyusingEPratherthantheidealGaussian 1 approximation). Alltheseexpansionshighlightthesuperiorconvergenceproper- v ties of EP. Our approach for deriving those results is likely applicable to many 7 8 similarapproximateinferencemethods. Inaddition,weintroduceboundsonthe 3 momentsoflog-concavedistributionsthatmaybeofindependentinterest. 2 0 . 1 0 6 Introduction 1 : v i X ExpectationPropagation(EP,1)isanefficientapproximateinferencealgorithmthatisknowntogive goodapproximations,tothepointofbeingalmostexactincertainapplications[2,3].Itissurprising r a that, while the method is empirically very successful, there are few theoretical guarantees on its behavior. Indeed,mostworkonEPhasfocusedonefficientlyimplementingthemethodinvarious settings. Theoreticalworkon EPmostly representsnewjustificationsof the methodwhich, while they offerintuitive insight, do not givemathematicalproofsthat the methodbehavesas expected. One recent breakthroughis due to Dehaene and Barthelmé [4] who prove that, in the large data- limit,theEPiterationbehaveslikeaNewtonsearchanditsapproximationisasymptoticallyexact. However,itremainsunclearhowgoodwecanexpecttheapproximationtobewhenwehaveonly finitedata.Inthisarticle,weofferacharacterizationofthequalityoftheEPapproximationinterms oftheworst-casedistancebetweenthetrueandapproximatemeanandvariance. Whenapproximatingaprobabilitydistributionp(x)thatis,forsomereason,closetobeingGaussian, anaturalapproximationtouseistheGaussianwithmeanequaltothemode(orargmax)ofp(x)and withvariancetheinverselog-Hessianatthemode.WecallittheCanonicalGaussianApproximation (CGA), and its use is usually justified by appealing to the Bernstein-von Mises theorem, which showsthat,inthelimitofa largeamountofindependentobservations,posteriordistributionstend towards their CGA. This powerful justification, and the ease with which the CGA is computed (finding the mode can be done using Newton methods) makes it a good reference point for any methodlikeEPwhichaimstoofferabetterGaussianapproximationatahighercomputationalcost. Insection1,weintroducetheCGAandtheEPapproximation.Insection2,wegiveourtheoretical resultsboundingthequalityofEPapproximations. 1 1 Background In this section, we present the CGA and give a short introduction to the EP algorithm. In-depth descriptionsofEPcanbefoundinMinka[5],Seeger[6],Bishop[7],Raymondetal.[8]. 1.1 TheCanonicalGaussianApproximation What we call here the CGA is perhaps the most common approximate inference method in the machinelearningcookbook. Itisoftencalledthe“Laplaceapproximation”,butthisisamisnomer: the Laplace approximationrefersto approximatingthe integral p from the integralof the CGA. ThereasontheCGAissooftenusedisitscompellingsimplicity:´givenatargetdistributionp(x)= exp(−φ(x)),wefindthemodex⋆ andcomputethesecondderivativesofφatx⋆: x⋆ = argminφ(x) β⋆ = φ′′(x⋆) to form a Gaussian approximation q(x) = N x|x⋆, 1 ≈ p(x). The CGA is effectively just β⋆ a second-orderTaylor expansion, and its use is(cid:16)justified b(cid:17)y the Bernstein-von Mises theorem [9], which essentially says that the CGA becomes exact in the large-data (large-n) asymptotic limit. Roughly,if p (x) ∝ n p(y |x)p (x), where y ...y representindependentdatapoints, then n i=1 i 0 1 n lim p (x)=N x|x⋆, 1 intotalvariation. n→∞ n Q n β⋆ n (cid:16) (cid:17) 1.2 CGAvsGaussianEP GaussianEP,asitsnameindicates,providesanalternativewayofcomputingaGaussianapproxima- tiontoatargetdistribution. ThereisbroadoverlapbetweentheproblemswhereEPcanbeapplied andtheproblemswheretheCGAcanbeused,withEPcomingatahighercost. Ourcontributionis toshowformallythatthehighercomputationalcostforEPmaywellbeworthbearing,asEPapprox- imationscanoutperformCGAsbyanorderofmagnitude. Tobespecific,wefocusonthemoment estimates(meanandcovariance)computedbyEPandCGA,andderiveboundsontheirdistanceto thetruemeanandvarianceofthetargetdistribution.Ourboundshaveanasymptoticinterpretation, andunderthatinterpretationweshowforexamplethatthemeanreturnedbyEPiswithinanorder of O n−2 of the true mean, wheren is the numberofdatapoints. For the CGA, which uses the modeasanestimateofthemean,weexhibitaO n−1 upperbound,andwecomputetheerrorterm (cid:0) (cid:1) responsibleforthisO n−1 behavior. Thisenablesustoshowthat,inthesituationsinwhichthis (cid:0) (cid:1) errorisindeedO n−1 ,EPisbetterthantheCGA. (cid:0) (cid:1) (cid:0) (cid:1) 1.3 TheEPalgorithm We considerthetaskofapproximatingaprobabilitydistributionoverarandom-variableX :p(x), which we call the targetdistribution. X can be high-dimensional,butfor simplicity, we focuson theone-dimensionalcase. OneimportanthypothesisthatmakesEPfeasibleisthatp(x)factorizes intonsimplefactorterms: p(x)= f (x) i i Y EPproposestoapproximateeachf (x) (usuallyreferredto as sites) by a Gaussianfunctionq (x) i i (referredtoasthesite-approximations). ItisconvenienttousetheparametrizationofGaussiansin termsofnaturalparameters: x2 q (x|r ,β )∝exp r x−β i i i i i 2 (cid:18) (cid:19) which makes some of the further computationseasier to understand. Note that EP could also be usedwithotherexponentialapproximatingfamilies. TheseGaussianapproximationsarecomputed iteratively. Starting from a current approximation(qt(x|rt,βt)), we select a site for update with i i i indexi. Wethen: 2 • Computethecavitydistributionqt (x)∝ qt(x). Thisisveryeasyinnaturalparam- −i j6=1 j eters: Q x2 q (x)∝exp rt x− βt −i  j  j 2  j6=i j6=i X X      • Computethehybriddistributionht(x)∝qt (x)f (x)anditsmeanandvariance i −i i • ComputetheGaussianwhichminimizestheKullback-Leiblerdivergencetothehybrid,ie theGaussianwithsamemeanandvariance: P(ht)=argmin KL ht|q i i q (cid:0) (cid:0) (cid:1)(cid:1) • Finally,updatetheapproximationoff : i P(ht) qt+1 = i i qt −i wherethedivisionissimplycomputedasasubtractionbetweennaturalparameters We iterate these operationsuntila fixedpointis reached,at whichpointwe returna Gaussian ap- proximationofp(x)≈ q (x). i Q 1.4 The“EP-approximation” Inthiswork,wewillcharacterizethequalityofanEPapproximationofp(x). Wedefinethistobe anyfixedpointoftheiterationpresentedinsection1.3,whichcouldallbereturnedbythealgorithm. ItisknownthatEPwillhaveatleastonefixed-point[1],butitisunknownunderwhichconditions thefixed-pointisunique. Weconjecturethat,whenallsitesarelog-concave(oneofourhypotheses tocontrolthebehaviorofEP),itisinfactuniquebutwecan’tofferaproofyet. Ifp(x)isn’tlog- concave, it is straightforwardto constructexamplesin which EPhas multiplefixed-points. These openquestionswon’tmatterforourresultbecausewewillshowthatallfixed-pointsofEP(should therebemorethanone)produceagoodapproximationofp(x). FixedpointsofEPhaveaveryinterestingcharacterization.Ifwenoteq∗thesite-approximationsat i a givenfixed-point,h∗ thecorrespondinghybriddistributions,andq∗ theglobalapproximationof i p(x),thenthemeanandvarianceofallthehybridsandq∗isthesame1. Aswewillshowinsection 2.2,thisleadstoaverytightboundonthepossiblepositionsofthesefixed-points. 1.5 Notation We willuserepeatedlythefollowingnotation. p(x) = f (x)isthetargetdistributionwewant i i to approximate. The sites f (x) are each approximated by a Gaussian site-approximation q (x) i i yieldinganapproximationtop(x)≈q(x)= q (x). TQhehybridsh (x)interpolatebetweenq(x) i i i andp(x)byreplacingonesiteapproximationq (x)withthetruesitef (x). i i Q Our results make heavy use of the log-functionsof the sites and the target distribution. We note φ (x) = −log(f (x)) and φ (x) = −log(p(x)) = φ (x). We will introduce in section 2 i i p i hypothesesonthesefunctions.Parameterβ controlstheirminimumcurvatureandparametersK m d controlthemaximumdth derivative. P Wewillalwaysconsiderfixed-pointsofEP,wherethemeanandvarianceunderallhybridsandq(x) isidentical. Wewillnotethesecommonvalues: µ andv . Wewillalsorefertothethirdand EP EP fourthcenteredmomentofthehybrids,denotedbymi,mi andtothefourthmomentofq(x)which 3 4 issimply3v2 . Wewillshowhowallthesemomentsarerelatedtothetruemomentsofthetarget EP distributionwhichwewillnoteµ,vforthemeanandvariance,andmp,mpforthethirdandfourth 3 4 moment. WealsoinvestigatethequalityoftheCGA:µ≈x⋆ andv ≈ φ′′(x⋆) −1 wherex⋆ isthe p themodeofp(x). h i 1Fornon-Gaussianapproximations,theexpectedvaluesofallsufficientstatisticsoftheexponentialfamily areequal. 3 2 Results Inthissection,wewillgivetightboundsonthequalityoftheEPapproximation(ie:offixed-points oftheEPiteration). Ourresultsleanonthepropertiesoflog-concavedistributions[10]. Insection 2.1,weintroducenewboundsonthemomentsoflog-concavedistributions. Theboundsshowthat those distributions are in a certain sense close to being Gaussian. We then apply these results to studyfixedpointsofEP,wheretheyenableustocomputeboundsonthedistancebetweenthemean and variance of the true distribution p(x) and of the approximationgiven by EP, which we do in section2.2. Our bounds require us to assume that all sites f (x) are β -strongly log-concave with slowly- i m changinglog-function.Thatis,ifwenoteφ (x)=−log(f (x)): i i ′′ ∀i∀xφ (x) ≥ β >0 (1) i m ∀i∀d∈[3,4,5,6] φ(d)(x) ≤ K (2) i d (cid:12) (cid:12) (cid:12) (cid:12) The target distribution p(x) then inherits tho(cid:12)se prop(cid:12)erties from the sites. Noting φ (x) = p −log(p(x)) = φ (x), then φ is nβ -strongly log-concave and its higher derivatives are i i p m bounded: P ′′ ∀x, φ (x) ≥ nβ (3) p m ∀d∈[3,4,5,6] φ(d)(x) ≤ nK (4) p d (cid:12) (cid:12) (cid:12) (cid:12) Anaturalconcernhereiswhetherornotourcon(cid:12)ditionso(cid:12)nthesitesareofpracticalinterest.Indeed, strongly-log-concavelikelihoodsarerare.Wepickedthesestrongregularityconditionsbecausethey maketheproofsrelativelytractable(althoughstilltechnicalandlong). Theprooftechniquecarries over to more complicated, but more realistic, cases. One such interesting generalization consists of the case in which p(x) and all hybridsat the fixed-pointare log-concavewith slowly changing log-functions (with possibly differing constants). In such a case, while the math becomes more unwieldy, similar bounds as ours can be found, greatly extending the scope of our results. The resultswepresenthereshouldthusbeunderstoodasasteppingstoneandnotasthefinalwordon thequalityoftheEPapproximation:wehavefocusedonprovidingarigorousbutextensibleproof. 2.1 Log-concavedistributionsarestronglyconstrained Log-concavedistributionshavemanyinterestingproperties. Theyareofcourseunimodal,andthe familyisclosedunderbothmarginalizationandmultiplication.Forourpurposeshowever,themost importantpropertyisaresultduetoBrascampandLieb[11],whichboundstheirevenmoments.We givehereanextensioninthecaseoflog-concavedistributionswithslowlychanginglog-functions (asquantifiedbyeq. (2)). OurresultsshowthattheseareclosetobeingGaussian. TheBrascamp-Liebinequalitystatesthat,ifLC(x)∝exp(−φ(x))isβ -stronglylog-concave(ie: m ′′ φ (x)≥β ),thencenteredevenmomentsofLC areboundedbythecorrespondingmomentsofa m Gaussianwithvarianceβ−1. Ifwenotethesemomentsm andµ =E (x)themeanofLC: m 2k LC LC m = E (x−µ )2k 2k LC LC m ≤ (2k−(cid:16)1)!!β−k (cid:17) (5) 2k m where (2k−1)!! is the doublefactorial: the productof all odd termsfrom 1 to 2k−1. 3!! = 3, 5!! = 15, 7!! = 105, etc. This result can be understoodas stating that a log-concavedistribution musthaveasmallvariance,butdoesn’tgenerallyneedtobeclosetoaGaussian. Withourhypothesisofslowlychanginglog-functions,wewereabletoimproveonthisresult. Our improvedresultsincludeaboundonoddmoments,aswellasfirstorderexpansionsofevenmoments (eqs. (6)-(9)). Our extensionto the Brascamp-Liebinequalityis asfollows. If φ is slowlychangingin the sense ′ thatsomeofitshigherderivativesarebounded,aspereq. 2,thenwecangiveaboundonφ (µ ) LC 4 (showingthatµ isclosetothemodex⋆ ofLC,seeeqs. (10)to(13))andm (showingthatLC LC 3 ismostlysymmetric): φ′(µ ) ≤ K3 (6) LC 2β m (cid:12) (cid:12) (cid:12)(cid:12) |m3(cid:12)(cid:12)| ≤ 2βK33 (7) m andwecancomputethefirstorderexpansionsofm andm ,andboundtheerrorsintermsofβ 2 4 m andtheK’s: m−1−φ′′(µ ) ≤ K32 + K4 (8) 2 LC β2 2β m m (cid:12) (cid:12) φ′′(cid:12)(cid:12)(µ )m −3m (cid:12)(cid:12) ≤ 19K32 + 5K4 (9) LC 4 2 2 β4 2β3 m m (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) ′′ −1 ′′ −2 With eq. (8) and (9), we see that m ≈ φ (µ ) and m ≈ 3 φ (µ ) and, in that 2 LC 4 LC ′′ sense,thatLC(x)isclosetotheGaussian(cid:16)withmean(cid:17)µ andinverse-v(cid:16)arianceφ (cid:17)(µ ). LC LC Theseexpansionscouldbeextendedtofurtherordersandsimilarformulascanbefoundfortheother momentsofLC(x): forexample,anyoddmomentscanbeboundedby|m | ≤ C K β−(k+1) 2k+1 k 3 m (with C some constant) and any even moment can be found to have first-order expansion: k ′′ −k m ≈ (2k−1)!! φ (µ ) . The proof, as well as more detailed results, can be found in 2k LC theSupplement. (cid:16) (cid:17) Note how ourresultrelates to the Bernstein-vonMises theorem, whichsays that, in the limitof a largeamountofobservations,aposteriorp(x)tendstowardsitsCGA.Ifweconsidertheposterior obtainedfromnlikelihoodfunctionsthatarealllog-concaveandslowlychanging,ourresultsshow theslightlydifferentresultthatthemomentsofthatposteriorareclosetothoseofaGaussianwith mean µ (instead of x⋆ ) and inverse-varianceφ′′(µ ) (instead of φ′′(x⋆ )) . This point is LC LC LC LC critical. WhiletheCGAstillendsupcapturingthelimitbehaviorofp,asµ → x⋆ inthelarge- LC datalimit(seeeq. (13)below),anapproximationthatwouldreturntheGaussianapproximationat µ wouldbebetter. ThisisessentiallywhatEPdoes,andthisishowitimprovesontheCGA. LC 2.2 ComputingboundsonEPapproximations Inthissection,weconsideragivenEPfixed-pointq∗(x|r ,β )andthecorrespondingapproximation k i i ofp(x): q∗(x|r = r ,β = β ). Wewillshowthattheexpectedvalueandvarianceofq∗(resp. i i µ andv )areclosetothetruemeanandvarianceofp(resp. µandv),andalsoinvestigatethe EP EP qualityoftheCGAP(µ≈x⋆,vP≈ φ′′(x⋆) −1). p h i Under our assumptionson the sites (eq. (1) and (2)), we are able to deriveboundson the quality of the EP approximation. The proofis quite involvedandlong, andwe will onlypresentit in the Supplement. Inthe main text, we givea partialversion: we detailthe firststep ofthe demonstra- tion,whichconsistsofcomputingaroughboundonthedistancebetweenthetruemeanµ, theEP approximationµ andthemodex⋆,andgiveanoutlineoftherestoftheproof. EP Let’sshowthatµ,µ andx⋆ areallclosetooneanother.Westartfromeq. (6)appliedtop(x): EP φ′(µ) ≤ K3 (10) p 2β m (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 5 whichtellsusthatφ′(µ)≈0. µmustthusbeclosetox⋆. Indeed: p φ′(µ) = φ′(µ)−φ′(x⋆) (11) p p p (cid:12)(cid:12) (cid:12)(cid:12) = (cid:12)(cid:12)φ′′(ξ)(µ−x⋆)(cid:12)(cid:12)ξ ∈[µ,x⋆] (cid:12) (cid:12) (cid:12) p (cid:12) ≥ (cid:12)(cid:12)φ′′(ξ) |µ−x⋆|(cid:12)(cid:12) (cid:12) p (cid:12) ≥ (cid:12)nβ |µ(cid:12)−x⋆| (12) (cid:12) m (cid:12) (cid:12) (cid:12) Combiningeq. (10)and(12),wefinallyhave: K |µ−x⋆|≤n−1 3 (13) 2β2 m Let’snowshowthatµ isalsoclosetox⋆. Weproceedsimilarly,startingfromeq.(6)butapplied EP toallhybridsh (x): i ∀i φ′(µ )+β µ −r ≤n−1 K3 (14) i EP −i EP −i 2β m (cid:12) (cid:12) whichisnotreallyequivalent(cid:12)to eq. (10)yet. Recallthat(cid:12)q(x|r,β) hasmeanµEP: we thushave: (cid:12) (cid:12) r =βµ . Whichgives: EP β µ = ((n−1)β)µ −i EP EP ! i X = (n−1)r = r (15) −i i X Ifwesumalltermsineq. (14),theβ µ andr thuscancel,leavinguswith: −i EP −i φ′(µ ) ≤ K3 (16) p EP 2β m (cid:12) (cid:12) whichisequivalenttoeq.(10)butforµ(cid:12) instead(cid:12)ofµ. Thisshowsthatµ is,likeµ,closetox⋆: E(cid:12)P (cid:12) EP K |µ −x⋆|≤n−1 3 (17) EP 2β2 m At this point, we can show that, since they are both close to x⋆ (eq. (13) and (17)), µ = µ + EP O n−1 ,whichconstitutesthefirststepofourcomputationofboundsonthequalityofEP. Af(cid:0)ter co(cid:1)mputingthis, the nextstep is evaluatingthe quality of the approximationof the variance, viacomputing v−1−v−1 forEPand v−1−φ′′(x⋆) fortheCGA,fromeq. (8). Inbothcases, EP p wefind: (cid:12)(cid:12) (cid:12)(cid:12) v−1(cid:12)(cid:12)(cid:12) = v−1 +O(cid:12)(cid:12)(cid:12)(1) (18) EP = φ′′(x⋆)+O(1) (19) p Since v−1 is of order n, because of eq. (5) (Brascamp-Lieb upper bound on variance), this is a decentapproximation:therelativeerrorisofordern−1. WecanfindsimilarlythatbothEPandCGAdoagoodjoboffindingagoodapproximationofthe fourthmomentofp: m . ForEPthismeansthatthefourthmomentofeachhybridandofq area 4 closematch: ∀im ≈ mi ≈3v2 (20) 4 4 EP ′′ −2 ≈ 3 φ (m) (21) p (cid:16) (cid:17) Incontrast,thethirdmomentofthehybridsdoesn’tmatchatallthethirdmomentofp,buttheirsum does! m ≈ mi (22) 3 3 i X 6 Finally,wecomebacktotheapproximationofµbyµ .Theseobeytwoverysimilarrelationships: EP φ′(µ)+φ(3)(µ)v = O n−1 (23) p p 2 φ′(µ )+φ(3)(µ )vEP = O(cid:16)n−1(cid:17) (24) p EP p EP 2 (cid:16) (cid:17) Sincev =v +O n−2 (aslightrephrasingofeq. (18)),wefinallyhave: EP (cid:0) (cid:1) µ=µ +O n−2 (25) EP (cid:0) (cid:1) Wesummarizetheresultsinthefollowingtheorem: Theorem1. Characterizingfixed-pointsofEP Underthe assumptionsgivenby eq. (1) and (2) (log-concavesites with slowly changinglog), we canboundthequalityoftheEPapproximationandtheCGA: K |µ−x∗| ≤ n−1 3 2β2 m |µ−µ | ≤ B (n)=O n−2 EP 1 v−1−φ′′(x∗) ≤ 2K32 + K4(cid:0) (cid:1) p β2 2β m m (cid:12) (cid:12) (cid:12) v−1−v−1(cid:12) ≤ B (n)=O(1) (cid:12) EP(cid:12) 2 (cid:12) (cid:12) Wegivethefullexpressionforth(cid:12)eboundsB(cid:12)1andB2intheSupplement Note that the order of magnitude of the bound on |µ−x⋆| is the best possible, because it is at- tainedforcertaindistributions.Forexample,consideraGammadistributionwithnaturalparameters (nα,nβ)whosemean α isapproximatedatordern−1 byitsmode α − 1 . Moregenerally,from β β nβ eq. (23),wecancomputethefirstorderoftheerror: φ(3)(µ)v 1 φ(3)(µ) p p µ−m≈− ≈− (26) φ′p′(µ) 2 2 φ′p′(µ) 2 whichis the termcausingthe ordern−1 error. Whenevert(cid:2)histerm(cid:3) issignificant, itis thussafe to concludethatEPimprovesontheCGA. Alsonotethat,sincev−1 isofordern,therelativeerrorforthev−1 approximationisofordern−1 for both methods. Despite having a convergencerate of the same order, the EP approximationis demonstrablybetterthantheCGA,asweshownext.Letusfirstseewhytheapproximationforv−1 isonlyoforder1forbothmethods.Thefollowingrelationshipholds: v−1 =φ′′(µ)+φ(3)(µ)mp3 +φ(4)(µ)mp4 +O n−1 (27) p p 2v p 3!v ′′ (cid:0) (cid:1) Inthisrelationship,φ (µ)isanorderntermwhiletherestareorder1. Ifwenowcomparethisto p theCGA approximationofv−1,wefindthatitfailsatmultiplelevels. First,itcompletelyignores the two order 1 terms, and then, because it takes the value of φ′′ at x⋆ which is at a distance of p O n−1 fromµ,itaddsanotherorder1errorterm(sinceφ(3) = O(n)). TheCGAisthusadding p quiteabitoferror,evenifeachcomponentisoforder1. (cid:0) (cid:1) Meanwhile,v obeysarelationshipsimilartoeq. (27): EP v−1 =φ′′(µ )+ φ(3)(µ ) mi3 +φ(4)(µ )3vE2P +O n−1 (28) EP p EP i EP 2v p EP 3!v i (cid:20) EP(cid:21) EP X (cid:0) (cid:1) ′′ WecanseewheretheEPapproximationproduceserrors. Theφ termiswellapproximated:since p |µ−µ |=O n−2 ,wehaveφ′′(µ)=φ′′(µ )+O n−1 . Theterminvolvingm isalsowell EP p p EP 4 (cid:0) (cid:1) (cid:0) (cid:1) 7 approximated,andwecanseethattheonlytermthatfailsisthem term. Theorder1erroristhus 3 entirelycomingfromthisterm,whichshowsthatEPperformancesuffersmorefromtheskewness ofthetargetdistributionthanfromitskurtosis. Finally,notethat,withourresult,wecangetsomeintuitionsaboutthequalityoftheEPapproxima- tionusingothermetrics.Forexample,ifthemostinterestingmetricistheKLdivergenceKL(p,q), the excess KL divergence from using the EP approximation q instead of the true minimizer q KL (whichhasthesamemeanµandvariancevasp)isgivenby: q (x−µ)2 (x−µ )2 1 v ∆KL= plog KL = p(x) − + EP − log (29) ˆ q ˆ 2v 2v 2 v EP (cid:18) EP(cid:19)! 1 v v (µ−µ )2 = −1−log + EP (30) 2 v v 2v (cid:20) EP (cid:18) EP(cid:19)(cid:21) EP 1 v−v 2 (µ−µ )2 EP EP ≈ + (31) 4 v 2v (cid:18) EP (cid:19) EP whichwerecognizeasKL(q ,q). AsimilarformulagivestheexcessKLdivergencefromusing KL theCGAinsteadofq . Forbothmethods,thevariancetermisofordern−2 (thoughitshouldbe KL smallerforEP),butthemeantermisofordern−3forEPwhileitisofordern−1fortheCGA.Once again,EPisfoundtobethebetterapproximation. Finally,notethatourboundsarequitepessimistic:thetruevaluemightbeamuchbetterfitthanwe havepredictedhere. A first cause is the bounding of the derivatives of log(p) (eqs. (3),(4)): while those bounds are correct,theymightprovetobeverypessimistic. Forexample,ifthecontributionsfromthesitesto thehigher-derivativescanceleachotherout,amuchlowerboundthannK mightapply. Similarly, d theremightbeanotherlowerboundonthecurvaturemuchhigherthannβ . m Anothercauseistheboundingofthevariancefromthecurvature. WhileapplyingBrascamp-Lieb requiresthe distributionto havehighlog-curvatureeverywhere,a distributionwith high-curvature close to the mode and low-curvature in the tails still has very low variance: in such a case, the Brascamp-Liebboundisverypessimistic. Inordertoimproveonourbounds,wewillthusneedtousetighterboundsonthelog-derivativesof thehybridsandofthetargetdistribution,butwewillalsoneedanextensionoftheBrascamp-Lieb result that can deal with those cases where a distribution is strongly log-concavearoundits mode but,inthetails,thelog-curvatureismuchlower. 3 Conclusion EPhasbeenused fornowquitesome time withoutanytheoreticalconcreteguaranteesonits per- formance.Inthiswork,weprovideexplicitperformanceboundsandshowthatEPissuperiortothe CGA, in the sense of giving provablybetter approximationsof the mean and variance. There are nowtheoreticalargumentsforsubstitutingEPtotheCGAinanumberofpracticalproblemswhere thegaininprecisionisworththeincreasedcomputationalcost. Thisworktackledthefirststepsin provingthatEPoffersanappropriateapproximation. Continuinginitstrackswillmostlikelylead to more generaland less pessimistic bounds, but it remainsan open question how to quantify the qualityoftheapproximationusingotherdistance measures. Forexample,itwouldbehighlyuse- fulformachinelearningif onecouldshowboundsonpredictionerrorwhenusingEP.We believe thatourapproachshouldextendtomoregeneralperformancemeasuresandplantoinvestigatethis furtherinthefuture. References [1] ThomasP.Minka. ExpectationPropagationforapproximateBayesianinference. InUAI’01: Proceedingsofthe 17thConferencein Uncertainty in ArtificialIntelligence, pages362–369, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1-55860-800-1. URLhttp://portal.acm.org/citation.cfm?id=720257. 8 [2] Malte Kuss and Carl E. Rasmussen. Assessing ApproximateInferencefor Binary Gaussian ProcessClassification. J.Mach.Learn.Res.,6:1679–1704,December2005.ISSN1532-4435. URLhttp://portal.acm.org/citation.cfm?id=1194901. [3] Hannes Nickisch and Carl E. Rasmussen. Approximations for Binary Gaussian Process Classification. Journal of Machine Learning Research, 9:2035–2078, October 2008. URL http://www.jmlr.org/papers/volume9/nickisch08a/nickisch08a.pdf. [4] Guillaume Dehaene and Simon Barthelmé. Expectation propagation in the large-data limit. Technicalreport,March2015. URLhttp://arxiv.org/abs/1503.08060. [5] T. Minka. Divergence Measures and Message Passing. Technical report, 2005. URL http://research.microsoft.com/en-us/um/people/minka/papers/message-passing/minka- [6] M.Seeger. ExpectationPropagationforExponentialFamilies. Technicalreport,2005. URL http://people.mmci.uni-saarland.de/~{}mseeger/papers/epexpfam.pdf. [7] Christopher M. Bishop. Pattern Recognition and Machine Learn- ing (Information Science and Statistics). Springer, 1st ed. 2006. corr. 2nd printing 2011 edition, October 2007. ISBN 0387310738. URL http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0387310738 [8] Jack Raymond, Andre Manoel, and Manfred Opper. Expectation propagation, September 2014. URLhttp://arxiv.org/abs/1409.6179. [9] Anirban DasGupta. Asymptotic Theory of Statistics and Probability (Springer Texts in Statistics). Springer, 1 edition, March 2008. ISBN 0387759700. URL http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0387759700 [10] Adrien Saumard and Jon A. Wellner. Log-concavity and strong log-concavity: A review. Statist. Surv., 8:45–114, 2014. doi: 10.1214/14-SS107. URL http://dx.doi.org/10.1214/14-SS107. [11] Herm J. Brascamp and Elliott H. Lieb. Best constants in young’s inequality, its con- verse, and its generalization to more than three functions. Advances in Mathematics, 20 (2):151–173, May 1976. ISSN 00018708. doi: 10.1016/0001-8708(76)90184-5. URL http://dx.doi.org/10.1016/0001-8708(76)90184-5. 9 Supplementary information of “Bounding errors of Expectation-Propagation” A Improving ontheBrascamp-Lieb bound Inthis section, we detailourmathematicalresultsconcerningthe extensionofthe Brascamp-Lieb bound. We will note LC(x) = exp(−φ(x)) a log-concave distribution. We assume that φ is strongly convex,andslowlychanging,ie: ′′ ∀xφ (x) ≥ β (32) m ∀d∈[3,4,5,6] φ(d)(x) ≤ K (33) d (cid:12) (cid:12) (cid:12) (cid:12) A.1 TheoriginalBrascamp-Liebtheorem (cid:12) (cid:12) Let µ = E (x) be the expected value of LC. The original Brascamp-Lieb result [1976] LC LC concernsboundingfractionalcenteredmomentsofLC bythecorrespondingfractionalmomentsof aGaussianofvarianceβ−1, centeredatµ . Notingg(x) = N x|µ ,β−1 thatGaussian,we m LC LC m have: ∀α≥1E (|x−µ |α)≤E (|x−µ(cid:0) |α) (cid:1) (34) LC LC g LC However,we arenotinterestedin theirfullresult, butonlyin arestrictedversionofitwhichonly concernsevenmoments.Thisversionsimplyreads: ∀k ∈Nm =E |x−µ |2k ≤ (2k−1)m β−1 (35) 2k LC LC 2k−2 m (cid:16) m(cid:17) ≤ (2k−1)!!β−k (36) 2k m where (2k−1)!! is the double-factorial: the productof all odd terms between 1 and 2k−1. Eq. (35) might be a new result. Note that equality only occurs when f(x) = 1 and LC is Gaussian. Notealsothattheboundsonthehigherderivativesofφarenotneededforthisresult,butonlyfor ourextension. We offerherea proofofeq. (35)(fromwhicheq. (36)isa trivialconsequence),whichisslightly differentfromBrascamp&Lieb’soriginalproof. Webelievethisprooftobeoriginal,thoughitis stillquitesimilartotheoriginalproof. Proof. Let’sdecomposeLC(x)intotwoparts: • g(x)=N x|µ ,β−1 theboundingGaussianwithsamemeanasLC LC m • f(x)= LC(cid:0)(x) theremai(cid:1)nder g(x) f iseasilyshowntobelog-concave,whichmeansthatitisunimodal. Wewillnotex⋆ themodeof f. f isincreasingon]−∞,x⋆]anddecreasingon[x⋆,∞[. Wethusknowthesignoff′(x): sign f′(x) =sign(x⋆−x) (37) (cid:16) (cid:17) +∞ ′ Considertheintegral: g(x)f (x)dx. Byintegrationbyparts(orbyStein’slemma),wehave: −∞ ´ +∞ +∞ ′ g(x)f (x)dx = g(x)f(x)β (x−µ)dx ˆ ˆ m −∞ −∞ = β (µ−µ) m +∞ ′ g(x)f (x)dx = 0 (38) ˆ −∞ 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.