An Improvement to the Domain Adaptation Bound in a PAC-Bayesian context 5 PascalGermain AmauryHabrard 1 IFT-GLO,Universite´Laval LaHC,UMRCNRS5516 0 Que´bec(QC),Canada Univ.ofSt-Etienne,France 2 [email protected] [email protected] n a J Franc¸oisLaviolette EmilieMorvant IFT-GLO,Universite´Laval LaHC,UMRCNRS5516 3 1 Que´bec(QC),Canada Univ.ofSt-Etienne,France [email protected] [email protected] ] L M Abstract . t a ThispaperprovidesatheoreticalanalysisofdomainadaptationbasedonthePAC- t s Bayesiantheory. Weproposeanimprovementofthepreviousdomainadaptation [ boundobtainedbyGermainetal. [1]intwoways. Wefirstgiveanothergeneral- 1 izationboundtighterandeasiertointerpret.Moreover,weprovideanewanalysis v oftheconstanttermappearingintheboundthatcanbeofhighinterestfordevel- 2 opingnewalgorithmicsolutions. 0 0 3 1 Introduction 0 . 1 Domainadaptation(DA)ariseswhenthedistributiongeneratingthetargetdatadiffersfromtheone 0 fromwhichthesourcelearninghasbeengeneratedfrom. Classicaltheoreticalanalysesofdomain 5 adaptationproposesomegeneralizationboundsovertheexpectedriskofaclassifierbelongingtoa 1 hypothesisclassHoverthetargetdomain[2,3,4]. Recently,Germainetal. havegivenageneral- v: izationboundexpressedasanaveragingovertheclassifiersinHusingthePAC-Bayesiantheory[1]. i Inthispaper,wederiveanewPAC-Bayesiandomainadaptationboundthatimprovestheprevious X resultof[1].Moreover,weprovideananalysisoftheconstanttermappearingintheboundopening r thedoortodesignnewalgorithmsabletocontrolthisterm. Thepaperisorganizedasfollows. We a introducetheclassicalPAC-BayesiantheoryinSection2. Wepresentthedomainadaptationbound obtainedin[1]inSection3. Section4presentsournewresults. 2 PAC-BayesianSetting inSupervised Learning In the non adaptive setting, the PAC-Bayesian theory [5] offers generalization bounds (and algo- rithms)forweightedmajorityvotesoverasetoffunctions,calledvoters. LetX ⊆ Rd betheinput space of dimension d and Y = {−1,+1}be the outputspace. A domain P is an unknowndis- s tributionover X ×Y. The marginaldistribution of P overX is denotedby D . LetH be a set s s of n voterssuch that: ∀h ∈ H, h : X → Y, and let π be a prioron H. A prior is a probability distributiononHthat“models”someaprioriknowledgeonqualityofthevotersofH. Then,givenalearningsampleS = {(x ,y )}ms ,drawnindependentlyandidenticallydistributed i i i=1 (i.i.d.) accordingto the distributionP , the aim ofthe PAC-Bayesian learneristo find a posterior s distributionρleadingtoaρ-weightedmajorityvoteB overHthathasthelowestpossibleexpected ρ risk, i.e., the lowest probability of making an error on future examples drawn from D . More s precisely,thevoteB anditstrueandempiricalrisksaredefinedasfollows. ρ 1 Definition 1. Let H be a set of voters. Let ρ be a distribution over H. The ρ-weighted majority voteB (sometimescalledtheBayesclassifier)is: ρ ∀x∈X, B (x) d=ef sign E h(x) . ρ h∼ρ (cid:20) (cid:21) ThetrueriskofB onadomainP anditsempiricalriskonam -sampleS arerespectively: ρ s s 1 ms R (B ) d=ef E I[B (x )6=y ], and R (B ) =def I[B (x )6=y ]. Ps ρ (xi,yi)∼Ps ρ i i S ρ ms ρ i i i=1 X whereI[a 6= b]isthe0-1lossfunctionreturning1ifa = band0otherwise. UsualPAC-Bayesian analyses[5,6,7,8,9]donotdirectlyfocusontheriskofB ,butboundtheriskofthecloselyrelated ρ stochasticGibbsclassifier G . Itpredictsthelabelofanexamplexbyfirst drawinga classifier h ρ fromHaccordingtoρ,andthenitreturnsh(x). Thus,thetrueriskandtheempiricalonam -sample s S ofG correspondtotheexpectationoftherisksoverHaccordingtoρ: ρ R (G ) =def E R (h) = E E I[h(x )6=y ], Ps ρ h∼ρ Ps (xi,yi)∼Psh∼ρ i i 1 ms and R (G ) =def E R (h) = E I[h(x )6=y ]. S ρ S i i h∼ρ ms h∼ρ i=1 X Note that it is well-knownin the PAC-Bayesian literature that the risk of the deterministic classi- fierB andtheriskofthestochasticclassifierG arerelatedbyR (B )≤2R (G ). ρ ρ Ps ρ Ps ρ 3 PAC-BayesianDomainAdaptationofthe Gibbs classifier Throughouttherestofthispaper,weconsiderthePAC-BayesianDAsettingintroducedbyGermain etal. [1]. ThemaindifferencebetweensupervisedlearningandDA is thatwe havetwo different domainsoverX×Y:thesourcedomainP andthetargetdomainP (D andD aretherespective s t s t marginalsoverX). TheaimisthentolearnagoodmodelonthetargetdomainP knowingthatwe t onlyhavelabelinformationfromthesourcedomainP . Concretely,inthesettingdescribedin[1], s we have a labeled source sample S ={(x ,y )}ms , drawn i.i.d. from P and a targetunlabeled i i j=1 s sample T = {x }mt , drawni.i.d. fromD . One thusdesires to learnfromS andT a weighted j j=1 t majorityvotewiththelowestpossibleexpectedriskonthetargetdomainR (B ),i.e.,withgood Pt ρ generalizationguaranteesonP . RecallingthatusualPAC-Bayesiangeneralizationboundstudythe t riskoftheGibbsclassifier,Germainetal.[1]havedoneananalysisofitstargetriskR (G ),which Pt ρ alsoreliesonthenotionofdisagreementbetweenthevoters: R (h,h′) d=ef E I[h(x)6=h′(x)]. (1) D x∼D Theirmainresultisthefollowingtheorem. Theorem1(Theorem4of[1]). LetHbeasetofvoters. ForeverydistributionρoverH,wehave: RPt(Gρ) ≤ RPs(Gρ)+disρ(Ds,Dt)+λρ,ρ∗T , (2) wheredis (D ,D )isthedomaindisagreementbetweenthemarginalsD andD , ρ s t s t dis (D ,D ) =def E (R (h,h′)−R (h,h′)) , (3) ρ s t (h,h′)∼ρ2 Ds Dt (cid:12) (cid:12) wwihtehreρ2ρ(∗h,=h′a)rg=minρ(hR)×(ρG(h)′)i,stahnedbλ(cid:12)(cid:12)(cid:12)esρt,ρd∗Tistr=ibuRtiPotn(Gonρ∗Tth)e+tarRgeDttd(oGmρa,Ginρ.(cid:12)(cid:12)(cid:12)∗T) + RDs(Gρ,Gρ∗T), T ρ Pt ρ NotethatthisboundreflectstheusualphilosophyinDA:Itiswellknownthatafavorablesituation forDAarriveswhenthedivergencebetweenthedomainsissmallwhileachievinggoodsourceper- formance[2,3,4]. Germainetal.[1]havethenderivedafirstpromisingalgorithmcalledPBDAfor minimizingthistrade-offbetweensourceriskanddomaindisagreement. NotethatGermainetal.[1]alsoshowedthat,foragivenhypothesisclassH,thedomaindisagree- mentofEquation(3)isalwayssmallerthantheH∆H-distanceofBen-Davidetal.[2,3]definedby 12sup(h,h′)∈H2|RDt(h,h′)−RDs(h,h′)|. 2 4 New Results 4.1 ImprovementofTheorem1 First,weintroducethenotionofexpectedjointerrorofapairofclassifiers(h,h′)drawnaccording tothedistributionρ,definedas e (G ,G ) =def E E I[h(x)6=y]×I[h′(x)6=y]. (4) Ps ρ ρ (h,h′)∼ρ2(x,y)∼Ps Thm2belowreliesonthedomaindisagreementofEq.(1),andonexpectedjointerrorofEq.(4). Theorem2. LetHbeahypothesisclass. Wehave 1 ∀ρonH, R (G ) ≤ R (G )+ dis (D ,D )+λ , (5) Pt ρ Ps ρ 2 ρ s t ρ whereλ isthedeviationbetweentheexpectedjointerrorsofG onthetargetandsourcedomains: ρ ρ λ d=ef e (G ,G )−e (G ,G ) . ρ Pt ρ ρ Ps ρ ρ (cid:12) (cid:12) (cid:12) (cid:12) Proof. First,notethatforanydistribut(cid:12)ionP onX×Y,withmargina(cid:12)ldistributionDonX,wehave 1 R (G ) = R (G ,G )+e (G ,G ), P ρ D ρ ρ P ρ ρ 2 as 2R (G ) = E E I[h(x)6=y]+I[h′(x)6=y] P ρ (h,h′)∼ρ2 (x,y)∼P (cid:16) (cid:17) = E E 1×I[h(x)6=h′(x)]+2×I[h(x)6=y]I[h′(x)6=y] (h,h′)∼ρ2 (x,y)∼P (cid:16) (cid:17) = R (G ,G )+2×e (G ,G ). D ρ ρ P ρ ρ Therefore, 1 R (G )−R (G )= R (G ,G )−R (G ,G ) + e (G ,G )−e (G ,G ) Pt ρ Ps ρ 2 Dt ρ ρ Ds ρ ρ Pt ρ ρ Ps ρ ρ 1(cid:16) (cid:17) (cid:16) (cid:17) ≤ R (G ,G )−R (G ,G ) + e (G ,G )−e (G ,G ) 2 Dt ρ ρ Ds ρ ρ Pt ρ ρ Ps ρ ρ 1(cid:12) (cid:12) (cid:12) (cid:12) = (cid:12)dis (D ,D )+λ . (cid:12) (cid:12) (cid:12) (cid:12) ρ s t ρ (cid:12) (cid:12) (cid:12) 2 TheimprovementofTheorem2overTheorem1 reliesontwo mainpoints. Onthe onehand,our newresultcontainsonlythehalfofdisρ(Ds,Dt). Ontheotherhand,contrarytoλρ,ρ∗T ofEq.(2), thetermλ ofEq.(5)doesnotdependanymoreonthebestρ∗ onthetargetdomain. Thisimplies ρ T thatournewboundisnotdegeneratedwhenthetwodistributionsP andP areequal(orveryclose). s t Conversely,whenP =P ,theboundofTheorem1gives s t RPt(Gρ)≤RPt(Gρ)+RPt(Gρ∗T)+2RDt(Gρ,Gρ∗T), wwhheicnhthiseastuplepaosrtt2oRfρPta(nGdρρ∗T∗).inMHoriesocvoenr,sttihtuetetedrmof2atRlDeats(tGtwρ,oGdρi∗Tff)eriesngtrcelaatsesrifitehrasn. zero for any ρ T 4.2 ANewPAC-BayesianBound NotethattheimprovementsintroducedbyTheorem2donotchangetheformandthephilosophyof thePAC-BayesiantheoremspreviouslypresentedbyGermainetal.[1]. Indeed,followingthesame prooftechnique,weobtainthefollowingPAC-Bayesiandomainadaptionbound. Theorem3. ForanydomainsP andP (resp. withmarginalsD andD )overX×Y,anysetof s t s t hypothesisH,anypriordistributionπ overH,anyδ ∈ (0,1],anyrealnumbersα > 0andc > 0, with a probability at least 1− δ over the choice of S ×T ∼ (P ×D )m, for every posterior s T distributionρonH,wehave c′ α′ KL(ρkπ)+ln3 R (G ) ≤ c′R (G )+α′ 1dis (S,T)+ + δ +λ + 1(α′−1), Pt ρ S ρ 2 ρ c α m ρ 2 (cid:18) (cid:19) c 2α whereλ isdefinedbyEq.(6),andwhere c′ d=ef , and α′ d=ef . ρ 1−e−c 1−e−2α 3 4.3 OntheEstimationoftheUnknownTermλ ρ Thenextpropositiongivesanupperboundonthetermλ ofTheorems2and3. ρ Proposition4. LetHbethehypothesisspace.IfwesupposethatP andP sharethesamesupport, s t then ∀ρonH, λ ≤ χ2 P kP e (G ,G ), ρ t s Ps ρ ρ whereePs(Gρ,Gρ)istheexpectedjointerroqron(cid:0)thesour(cid:1)cedistribution,asdefinedbyEq.(4),and χ2 P kP isthechi-squareddivergencebetweenthetargetandthesourcedistributions. t s (cid:0) (cid:1) Proof. SupposingthatP andP havethesamesupport,thenwecanupperboundλ usingCauchy- t s ρ Schwarzinequalitytoobtainline4fromline3. λ = E E I[h(x)6=y]I[h′(x)6=y]− E I[h(x)6=y]I[h′(x)6=y] ρ (cid:12)(h,h′)∼ρ2(cid:20)(x,y)∼Pt (x,y)∼Ps (cid:21)(cid:12) (cid:12) P (x,y) (cid:12) =(cid:12)(cid:12)(cid:12)(h,hE′)∼ρ2(cid:20)(x,yE)∼PsPst(x,y)I[h(x)6=y]I[h′(x)6=y]−(x,yE)∼PIs[h(x)6=y]I[h′(cid:12)(cid:12)(x)6=y](cid:21)(cid:12) (cid:12) P (x,y) (cid:12) =(cid:12)(cid:12)(cid:12)(h,hE′)∼ρ2(x,yE)∼Ps(cid:18)Pst(x,y) −1(cid:19)I[h(x)6=y]I[h′(x)6=y](cid:12) (cid:12)(cid:12) (cid:12) (cid:12) (cid:12) P (x,y) 2 (cid:12) ≤(cid:12) E t −1 × E E (I[h(x)(cid:12)6=y]I[h′(x)6=y])2 s(x,y)∼Ps(cid:18)Ps(x,y) (cid:19) r(h,h′)∼ρ2 (x,y)∼Ps P (x,y) 2 ≤ E t −1 × E E I[h(x)6=y]I[h′(x)6=y] s(x,y)∼Ps(cid:18)Ps(x,y) (cid:19) (h,h′)∼ρ2 (x,y)∼Ps P (x,y) 2 = E t −1 ×e (G ,G ) = χ2 P kP e (G ,G ). s(x,y)∼Ps(cid:18)Ps(x,y) (cid:19) Ps ρ ρ q t s Ps ρ ρ (cid:0) (cid:1) Thisresultindicatesthatλ canbecontrolledbytheterme ,whichcanbeestimatedfromsamples, ρ Ps and the chi-squared divergence between the two distributions that we could try to estimate in an unsupervised way or, maybe more appropriately, use as a constant to tune, expressing a tradeoff between the two distributions. This opens the door to derive new learning algorithmsfor domain adaptationwiththehopeofcontrollinginpartsomenegativetransfer. References [1] P.Germain,A.Habrard,F.Laviolette,andE.Morvant. APAC-Bayesianapproachfordomain adaptationwithspecializationtolinearclassifiers. InICML,pages738–746,2013. [2] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysisof representationsfor domain adaptation. InAdvancesinNeuralInformationProcessingSystems,pages137–144,2006. [3] S. Ben-David, J. Blitzer, K. Crammer,A. Kulesza, F. Pereira, andJ.W. Vaughan. A theoryof learningfromdifferentdomains. MachineLearning,79(1-2):151–175,2010. [4] Y.Mansour,M.Mohri,andA.Rostamizadeh. Domainadaptation: Learningboundsandalgo- rithms. InConferenceonLearningTheory,pages19–30,2009. [5] D.A.McAllester. SomePAC-Bayesiantheorems. MachineLearning,37:355–363,1999. [6] DMcAllester. PAC-Bayesianstochasticmodelselection. MachineLearning,51:5–21,2003. [7] M. Seeger. PAC-Bayesian generalizationboundsforgaussian processes. Journalof Machine LearningResearch,3:233–269,2002. [8] O.Catoni. PAC-Bayesiansupervisedclassification:thethermodynamicsofstatisticallearning, volume56. Inst.ofMathematicalStatistic,2007. [9] P. Germain, A. Lacasse, F. Laviolette, and M. Marchand. PAC-Bayesian learning of linear classifiers. InInternationalConferenceonMachineLearning,2009. 4