Achieving Privacy in the Adversarial Multi-Armed Bandit AristideC.Y.Tossou ChristosDimitrakakis ChalmersUniversityofTechnology UniversityofLille,France Gothenburg,Sweden ChalmersUniversityofTechnology,Sweden [email protected] HarvardUniversity,USA [email protected] 7 Abstract movierecommendationscanbeformalizedsimilarly(Pandey 1 andOlston2006). 0 Inthispaper,weimprovethepreviouslybestknownregret Privacy can be a serious issue in the bandit setting (c.f. 2 boundtoachieve(cid:15)-differentialprivacyinobliviousadversarial √ (Jain,Kothari,andThakurta2012;ThakurtaandSmith2013; banditsfromO(T2/3/(cid:15))toO( TlnT/(cid:15)).Thisisachieved n MishraandThakurta2015;Zhaoetal.2014)).Forexample, bycombiningaLaplaceMechanismwithEXP3.Weshowthat a inclinicaltrials,wemaywanttodetectandpublishresults J thoughEXP3isalreadydifferentiallyprivate,itleaksalinear amountofinformationinT.However,wecanimprovethis about the best drug without leaking sensitive information, 6 privacybyrelyingonitsintrinsicexponentialmechanismfor suchasthepatient’shealthconditionandgenome.Differen- 1 √ selectingactions.ThisallowsustoreachO( lnT)-DP,with tialprivacy(Dwork2006)formallyboundstheamountof ] aregretofO(T2/3)thatholdsagainstanadaptiveadversary, informationthatathirdpartycanlearnnomattertheirpower G an improvement from the best known of O(T3/4). This is orsideinformation. L donebyusinganalgorithmthatrunEXP3inamini-batch Differentialprivacyhasbeenusedbeforeinthestochastic loop.Finally,werunexperimentsthatclearlydemonstratethe setting(TossouandDimitrakakis2016;MishraandThakurta . s validityofourtheoreticalanalysis. 2015;Jain,Kothari,andThakurta2012)wheretheauthors c obtainoptimalalgorithmsuptologarithmicfactors.Inthe [ adversarial setting, (Thakurta and Smith 2013) adapts an 1 Introduction 1 algorithmcalledFollowTheApproximateLeadertomakeit v Weconsidermulti-armedbanditproblemsintheadversarial privateandobtainaregretboundofO(T2/3).Inthiswork, 2 settingwherebyanagentselectsonefromanumberofalter- we show that a number of simple algorithms can satisfy 2 natives(calledarms)ateachroundandreceivesagainthat privacyguarantees,whileachievingnearlyoptimalregret(up 2 dependsonitschoice.Theagent’sgoalistomaximizeitsto- tologarithmicfactors)thatscalesnaturallywiththelevelof 4 talgainovertime.Therearetwomainsettingsforthebandit privacydesired. 0 problem.Inthestochasticone,thegainsofeacharmaregen- . Ourworkisalsoofindependentinterestfornon-private 1 eratedi.i.dbysomeunknownprobabilitylaw.Intheadversar- multi-armedbanditalgorithms,astherearecompetitivewith 0 ialsetting,whichisthefocusofthispaper,thegainsaregen- thecurrentstateoftheartagainstswitching-costadversaries 7 eratedadversarially.Weareinterestedinfindingalgorithms (wherewerecovertheoptimalbound).Finally,weprovide 1 withatotalgainoverT roundsnotmuchsmallerthanthatof rigorousempiricalresultsagainstavarietyofadversaries. : v anoraclewithadditionalknowledgeabouttheproblem.In Thefollowingsectiongivesthemainbackgroundandnota- Xi bothsettings,algorithmsthatac√hievetheoptimal(problem- tions.Section3.1describesmeta-algorithmsthatperturbthe independent)regretboundofO( T)areknown(Auer,Cesa- gainsequencetoachieveprivacy,whileSection3.2explains r Bianchi, and Fischer 2002; Burnetas and Katehakis 1996; a howtoleveragetheprivacyinherentintheEXP3algorithm PandeyandOlston2006;Thompson1933;Aueretal.2003; by modifying the way gains are used. Section 4 compares Auer2002;AgrawalandGoyal2012). ouralgorithmswithEXP3inavarietyofsettings.Thefull Thisproblemisamodelformanyapplicationswherethere proofsofallourmainresultsareinthefullversion. isaneedfortrading-offexplorationandexploitation.This issobecause,wheneverwemakeachoice,weonlyobserve 2 Preliminaries thegaingeneratedbythatchoice,andnotthegainsthatwe couldhaveobtainedotherwise.Anexampleisclinicaltrials, 2.1 TheMulti-ArmedBanditproblem wherearmscorrespondtodifferenttreatmentsortests,and Formally,abanditgameisdefinedbetweenanadversaryand the goal is to maximize the number of cured patients over anagentasfollows:thereisasetofK armsA,andateach time whilebeing uncertain about theeffects of treatments. round t, the agent plays an arm I ∈ A. Given the choice Other problems, such as search engine advertisement and t I ,theadversarygrantstheagentagaing ∈ [0,1].The t It,t Copyright(cid:13)c 2017,AssociationfortheAdvancementofArtificial agentonlyobservesthegainofarmIt,andnotthatofany Intelligence(www.aaai.org).Allrightsreserved. otherarms.Thegoalofthisagentistomaximizeitstotalgain afterT rounds,(cid:80)T g .Arandomizedbanditalgorithm Λ:(A×[0,1])∗ →t=D1 (IAt,)t mapseveryarm-gainhistorytoa exp(γ/KG˜ ) γ distributionoverthenextarmtotake. pi,t =(1−γ)(cid:80)K exp(γ/Ki,Gt˜ ) + K (2.1) Thenatureoftheadversary,andspecifically,howthegains i=1 i,t are generated, determines the nature of the game. For the withγ awelldefinedconstant. stochasticadversary (Thompson1933;Auer,Cesa-Bianchi, Finally, EXP3 plays one action randomly according to andFischer2002),thegainobtainedatroundtisgenerated theprobabilitydistributionpt ={p1,t,...pK,t}withpi,tas i.i.d from a distribution P . The more general fully obliv- definedabove. It ious adversary (Audibert and Bubeck 2010) generates the gainsindependentlyatroundtbutnotnecessarilyidentically 2.2 DifferentialPrivacy fromadistributionP .Finally,wehavetheobliviousadver- The following definition (from (Tossou and Dimitrakakis It,t sary(Aueretal.2003)whoseonlyconstraintistogenerate 2016))specifieswhatismeantwhenwecalledabanditalgo- thegaing asafunctionofthecurrentactionI only,i.e. rithmdifferentiallyprivateatasingleroundt: It,t t ignoringpreviousactionsandgains. Definition2.1(Singleround((cid:15),δ)-differentiallyprivateban- Whilefocusingonobliviousadversaries,wediscovered ditalgorithm). ArandomizedbanditalgorithmΛis((cid:15),δ)- that by targeting differential privacy we can also compete differentially private at round t, if for all sequence g 1:t−1 againstthestrongerm-boundedmemoryadaptiveadversary andg(cid:48) thatdiffersinatmostoneround,wehaveforany 1:t−1 (Cesa-Bianchi,Dekel,andShamir2013;Merhavetal.2002; actionsubsetS ⊆A: Dekel,Tewari,andArora2012)whocanuseuptothelastm P (I ∈S |g )≤δ+P (I ∈S |g(cid:48) )e(cid:15), (2.2) gains.Theobliviousadversaryisaspecialcasewithm=0. Λ t 1:t−1 Λ t 1:t−1 Anotherspecialcaseofthisadversaryistheonewithswitch- where PΛ denotes the probability distribution specified by ing costs, who penalises the agent whenever he switches thealgorithmandg1:t−1 ={g1,...gt−1}withgs thegains arms,bygivingthelowestpossiblegainof0(herem=1). ofallarmsatrounds.Whenδ =0,thealgorithmissaidto be(cid:15)-differentialprivate. Regret. Relyingonthecumulativegainofanagenttoeval- The (cid:15) and δ parameters quantify the amount of privacy uateitsperformancecanbemisleading.Indeed,considerthe loss.Lower((cid:15),δ)indicatehigherprivacyandconsequently casewhereanadversarygivesazerogainforallarmsatev- wewillalsoreferto((cid:15),δ)astheprivacyloss.Definition2.1 eryround.Thecumulativegainoftheagentwouldlookbad meansthattheoutputofthebanditalgorithmatroundtis butnootheragentscouldhavedonebetter.Thisiswhyone almostinsensibletoanysinglechangeinthegainssequence. comparesthegapbetweentheagent’scumulativegainand Thisimpliesthatwhetherornotweremoveasingleround, theoneobtainedbysomehypotheticalagent,calledoracle, replacethegains,thebanditalgorithmwillstillplayalmost with additional information or computational power. This the same action. Assuming the gains at round t are linked gapiscalledtheregret. toauserprivatedata(forexamplehiscancerstatusorthe Therearealsovariantsoftheoraclethatareconsideredin advertisementheclicked),thedefinitionpreservestheprivacy theliterature.Themostcommonvariantisthefixedoracle, of that user against any third parties looking at the output. whichalwaysplaysthebestfixedarminhindsight.Theregret Thisisthecasebecausethechoicesortheparticipationof Ragainstthisoracleis: thatuserwouldnotalmostaffecttheoutput.Equation(2.2) specifieshowmuchtheoutputisaffectedbyasingleuser. T T (cid:88) (cid:88) WewouldlikeDefinition2.1toholdforallrounds,soas R= max g − g i=1,...K i,t It,t toprotecttheprivacyofallusers.Ifitdoesforsome((cid:15),δ), t=1 t=1 then we say the algorithm has per-round or instantaneous Inpractice,weeitherproveahighprobabilityboundonRor privacyloss((cid:15),δ).Suchanalgorithmalsohasacumulative anexpectedvalueERwith: privacy loss of at most ((cid:15)(cid:48),δ(cid:48)) with (cid:15)(cid:48) = (cid:15)T and δ(cid:48) = δT (cid:34) T T (cid:35) afterT steps.Ourgoalistodesignbanditalgorithmsuchthat (cid:88) (cid:88) ER=E max g − g theircumulativeprivacyloss((cid:15)(cid:48),δ(cid:48))areaslowaspossible i=1,...K i,t It,t whileachievingsimultaneouslyaverylowregret.Inpractice, t=1 t=1 we would like (cid:15)(cid:48) and the regret to be sub-linear while δ(cid:48) where the expectation is taken with respect to the random should be a very small quantity. Definition 2.2 formalizes choices of both the agent and adversary. There are other clearlythemeaningofthiscumulativeprivacylossandfor oraclesliketheshiftingoraclebutthoseareoutofscopeof easeofpresentation,wewillignoretheterm”cumulative” thispaper. whenreferringtoit. Definition2.2(((cid:15),δ)-differentiallyprivatebanditalgorithm). EXP3. The Exponential-weight for Exploration and Ex- ArandomizedbanditalgorithmΛis((cid:15),δ)-differentiallypri- ploitation(EXP3(Aueretal.2003))algorithmachievesthe √ vateuptoroundt,ifforallg andg(cid:48) thatdiffersin optimalbound(uptologarithmicfactors)ofO( TKlnK) 1:t−1 1:t−1 atmostoneround,wehaveforanyactionsubsetS ⊆At: fortheweakregret(i.e.theexpectedregretcomparedtothe fixedoracle)againstanobliviousadversary.EXP3simply PΛ(I1:t ∈S |g1:t−1)≤δ+PΛ(I1:t ∈S |g1(cid:48):t−1)e(cid:15), maintainsanestimateG˜ forthecumulativegainofarmi (2.3) i,t uptoroundtwithG˜i,t =(cid:80)ts=1 pgii,,tt1It=iwhere wherePΛandgareasdefinedinDefinition2.1. √ Mostofthetime,wewillrefertoDefinition2.2andwhen- expectedregretofDP-EXP3-LapisO( T lnT/(cid:15))whichis everweneedtouseDefinition2.1,thiswillbemadeexplicit. optimalinT uptosomelogarithmicfactors.Thisresultisa Thesimplestmechanismtoachievedifferentialprivacyfor significantimprovementoverthebestknownboundsofar afunctionistoaddLaplacenoiseofscaleproportionaltoits ofO(T2/3/(cid:15))from(ThakurtaandSmith2013)andsolves sensitivity.Thesensitivityisthemaximumamountbywhich simultaneously the challenge (whether or not one can get thevalueofthefunctioncanchangeifwechangeasingle (cid:15)-DPmechanismwithoptimalregret)posedbytheauthors. elementintheinputssequence.Forexample,iftheinputis astreamofnumbersin[0,1]andthefunctiontheirsum,we Algorithm1DP-EXP3-Lap canaddLaplacenoiseofscale 1 toeachnumberandachieve (cid:15)-differentialprivacywithane(cid:15)rrorofO(√T/(cid:15))inthesum. LetG˜i =0forallarmsandb= ln(cid:15)T,γ =(cid:113)(Ke−ln1)KT However, (Chan, Shi, and Song 2010) introduced Hybrid foreach roundt=1,···,T do Mechanism,whichachieves(cid:15)-differentialprivacywithonly Computetheprobabilitydistributionpoverthearms poly-logarithmic error (with respect to the true sum). The withp=(p ,···p )andp asineq(2.1). 1,t K,t i,t ideaistogroupthestreamofnumbersinabinarytreeand DrawanarmI fromtheprobabilitydistributionp. t onlyaddaLaplacenoiseatthenodesofthetree. Receivetherewardg It,t Asdemonstratedabove,themainchallengewithdifferen- Letthenoisygainbeg(cid:48) =g +N It,t It,t It,t tialprivacyisthustotrade-offoptimallyprivacyandutility. withNIt,t ∼Lap(1(cid:15)) ifg(cid:48) ∈[−b,b+1]then It,t Notation. Inthispaper,iwillbeusedasanindexforan Scaleg(cid:48) to[0,1] arbitraryarmin[1,K],whilek willbeusedtoindicatean UpdateItth,teestimatedcumulativegainofarmI : t ot.pWtimeaulseargmi,tatnodinItdiicsattheethaermgapinlaoyfedthbeyi-atnhaargmenattartoruonudndt. G˜It =G˜It + pgII(cid:48)tt,,tt R (T)istheregretofthealgorithmΛafterT rounds.The endif Λ indexandT aredroppedwhenitisclearfromthecontext. endfor Unlessotherwisespecified,theregretisdefinedforoblivious adversaries against the fixed oracle. We use ”x ∼ P” to Theorem3.1. IfDP-Λ-Lapisrunwithinputabasebandit denote that x is generated from distribution P. Lap(λ) is algorithmΛ,thenoisyrewardg(cid:48) ofthetruerewardg set usedtodenotetheLaplacedistributionwithscaleλwhile It,t It,t Bern(p)denotestheBernoullidistributionwithparameterp. togI(cid:48)t,t =gIt,t+NIt,twithNIt,t ∼Lap(1(cid:15)),theacceptance interval set to [−b,b+1] with the scaling of the rewards 3 AlgorithmsandAnalysis g(cid:48) outside[0,1]doneusingg(cid:48) = gI(cid:48)t,t+b;thentheregret It It,t 2b+1 3.1 DP-Λ-Lap:Differentialprivacythrough R ofDP-Λ-Lapsatisfies: DP-Λ-Lap additionalnoise √ Westartbyshowingthattheobvioustechniquetoachieve 32T agiven(cid:15)-differentialprivacyinadversarialbanditsalready ERDP-Λ-Lap ≤ERΛscaled+2TKexp(−(cid:15)b)+ (cid:15) beat the state-of-the art. The main idea is to use any base (3.1) banditalgorithmΛasinputandaddaLaplacenoiseofscale 1(cid:15) to each gain before Λ observes it. This technique gives whereRΛscaledistheupperboundontheregretofΛwhenthe (cid:15)-DPdifferentialprivacyasthegainsareboundedin[0,1] rewardsarescaledfrom[−b,b+1]to[0,1] andthenoisesareaddedi.i.dateachround. However,banditsalgorithmsrequireboundedgainswhile ProofSketch. WeobservedthatDP-Λ-Lapisaninstanceof thenoisygainsarenot.Thetrickistoignoreroundswherethe Λ run with the noisy rewards g(cid:48) instead of g. This means noisygainsfalloutsideanintervaloftheform[−b,b+1].We RscaledisanupperboundoftheregretLong(cid:48).Then,wede- Λ pickthethresholdbsuchthat,withhighprobability,thenoisy rivedalowerboundonLshowinghowcloseitistoR . DP-Λ-Lap gainswillbeinsidetheinterval[−b,b+1].Moreprecisely, Thisallowsustoconclude. bcanbechosensuchthatwithhighprobability,thenumber ofroundsignoredislowerthantheupperboundR onthe Corollary3.1. IfDP-Λ-LapisrunwithEXP3asitsbaseal- regretofΛ.GiventhatinthestandardbanditprobΛlem,the gorithmandb= ln(cid:15)T,thenitsexpectedregretERDP-EXP3-Lap gainsareboundedin[0,1],thegainsatacceptedroundsare satisfies rescaledbackto[0,1]. 4lnT(cid:112) Theorem3.2showsthatalltheseoperationsstillpreserve ER ≤ (e−1)TKlnK DP-EXP3-Lap (cid:15) (cid:15)-DPwhileTheorem3.1demonstratesthattheupperbound √ ontheexpectedregretofDP-Λ-Lapaddssomesmalladdi- 32T +2K+ tionaltermstoR .Toillustratehowsmallthoseadditional (cid:15) Λ termsare,weinstantiateDP-Λ-LapwiththeEXP3algorithm. ThisleadstoamechanismcalledDP-EXP3-Lapdescribed inAlgorithm1.Withacarefullychosenthresholdb,corol- Proof. TheproofcomesbycombiningtheregretofEXP3 lary 3.1 implies that the additional terms are such that the (Aueretal.2003)withTheorem3.1 Theorem 3.2. DP-Λ-Lap is (cid:15)-differentially private up to anactionfromEXP3andplaysitforτ rounds.Duringthat roundT. time,EXP3doesnotobserveanyfeedback.Attheendofthe interval,EXP3 feedsEXP3withasinglegain,theaverage τ ProofSketch. CombiningtheprivacyofLaplaceMechanism gainreceivedduringtheinterval. with the parallel composition (McSherry 2009) and post- Theorem 3.4 borrowed from (Dekel, Tewari, and Arora processingtheorems(DworkandRoth2013)concludesthe 2012) specifies the upper bound on the regret EXP3 . It τ proof. is remarkable that thisbound holds against the m-memory boundedadaptiveadversary.Whileintheorem3.5,weshow 3.2 Leveragingtheinherentprivacyof EXP3 the privacy loss enjoyed by this algorithm, one gets a bet- On the differential privacy of EXP3 (Dwork and Roth ter intuition of how good those results are from corollary 2013)showsthatavariationofEXP3forthefull-information 3.2and3.3.Indeed,wecanobserve√thatEXP3τ achievesa setting (where the agent observes the gain of all arms at sub-logarithmic privacy loss of O( lnT) with a regret of any round regardless of what he played) is already differ- O(T2/3) against a special case of the m-memory bounded entially private. Their results imply that one can achieve adaptiveadversarycalledtheswitchingcostsadversaryfor theo√ptimalregretwithonlyasub-logarithmicprivacyloss whichm=1.Thisistheoptimalregretbound(inthesense (O( 128logT))afterT rounds. that there is a matching lower bound (Dekel et al. 2014)). WestartthissectionbyshowingasimilarresultforEXP3 Thismeansthatinsomesensewearegettingprivacyforfree inTheorem3.3.Indeed,weshowthatEXP3isalreadydif- againstthisadversary. ferentiallyprivatebutwithaper-roundprivacylossof2.1 Theorem3.4(RegretofEXP3 (Dekel,Tewari,andArora OurresultsimplythatEXP3canachievetheoptimalregret τ 2012)). TheexpectedregretofEXP3 isupperboundedby: albeitwithalinearprivacylossofO(2T)-DPafterT rounds. τ Thisisahugegapcomparedwiththefull-informationsetting √ Tm 7TτKlnK+ +τ and underlines the significance of our result in section 3.1 τ wherewedescribeaconcretealgorithmdemonstratingthat againstthem-memoryboundedadaptiveadversaryforany theoptimalregretcanbeachievedwithonlyalogarithmic m<τ. privacylossafterT rounds. Theorem 3.5 (Privacy loss of EXP3 ). EXP3 is Theorem3.3. TheEXP3algorithmis: τ τ (cid:16) (cid:113) (cid:17) (cid:40) (cid:114) (cid:41) 4τT3 + 8ln(1/δ(cid:48))τT3,δ(cid:48) -DPuptoroundT. K(1−γ)+γ 2lnT min 2T,T ·ln ,2(1−γ)T +2 γ T Proof. Thesensitivityofeachgainisnow 1 asweareusing τ theaverage.Combinedwiththeorem(3.3),itmeanstheper- differentiallyprivateuptoroundT. roundprivacylossis2T.GiventhatEXP3onlyobserves T Inpractice,wealsowantEXP3tohaveasub-linearregret. rounds, using the advaτnced composition theorem (Dworkτ, Thisimpliesthatγ <<1andEXP3issimply2T-DPoverT Rothblum,andVadhan2010)(TheoremIII.3)concludesthe rounds. finalprivacylossoverT rounds. ProofSketch. Thefirsttwotermsinthetheoremcomefrom Corollary3.2. EXP3 runwithτ = (7KlogK)−1/3T1/3 τ theobservationthatEXP3isacombinationoftwomecha- is((cid:15),δ(cid:48))differentiallyprivateuptoroundT withδ(cid:48) =T−2, √ nisms:theExponentialMechanism(McSherryandTalwar (cid:15) = 28KlnK + 112KlnKlnT. Its expected regret 2007)andarandomizedresponse.Thelasttermcomesfrom againsttheswitchingcostsadversaryisupperboundedby the observation that with probability γ we enjoy a perfect 2(7KlnK)1/3T2/3+(7KlogK)−1/3T1/3. 0-DP.Then,weuseChernofftoboundwithhighprobability thenumberoftimeswesufferanon-zeroprivacyloss. Proof. Theproofisimmediatebyreplacingτ andδ(cid:48)inThe- orem 3.4 and 3.5 and the fact that for the switching costs We will now show that the privacy of EXP3 itself may adversary,m=1. beimprovedwithoutanyadditionalnoise,andwithonlya moderateimpactontheregret. (cid:16)4T(cid:15)+2Tln1(cid:17)1/3 Corollary 3.3. EXP3τ run with τ = (cid:15)2 δ OntheprivacyofaEXP3wrapperalgorithm Theprevi- is ((cid:15),δ) differentially private and its expected regret ousparagraphleadstotheconclusionthatitisimpossibleto againsttheswitchingcostsadversaryisupperboundedby: obtainasub-linearprivacylosswithasub-linearregretwhile O(cid:32)T2/3√KlnK(cid:18)√lnδ1(cid:19)1/3(cid:33) usingtheoriginalEXP3.Here,wewillprovethatanexisting (cid:15) techniqueisalreadyachievingthisgoal.Thealgorithmwhich wecalledEXP3 isfrom(Dekel,Tewari,andArora2012). τ 4 Experiments It groups the rounds into disjoint intervals of fixed size τ wherethej’thintervalstartsonround(j−1)τ +1andends We tested DP-EXP3-Lap, EXP3τ together with the non- onroundjτ.Atthebeginningofintervalj,EXP3 receives privateEXP3againstafewdifferentadversaries.Theprivacy τ parameter(cid:15)ofDP-EXP3-Lapissetasdefinedincorollary 1Assumingwewantasub-linearregret.SeeTheorem3.3 3.2. This is done so that the regret of DP-EXP3-Lap and EXP3 arecomparedwiththesameprivacylevel.Allthe Stochastic adversary This adversary draws the gains of τ other parameters of DP-EXP3-Lap are taken as defined in thefirstarmi.i.dfromBern(0.55)whereasallothergainsare corollary 3.1 while the parameters of EXP3τ are taken as drawni.i.dfromBern(0.5). definedincorollary3.2. Forallexperiments,thehorizonisT =218andthenumber Fully oblivious adversary. For the best arm k, it first ofarmsisK =4.Weperformed720independenttrialsand draws a number p uniformly in [0.5,0.5+2·ε] and gen- reportedthemedian-of-meansestimator2ofthecumulative eratesthegaingk,t ∼Bern(p).Forallotherarms,pisdrawn regret.Itpartitionsthetrialsintoa0equalgroupsandreturn from[0.5−ε,0.5+ε].Thisprocessisrepeatedateveryround. themedianofthesamplemeansofeachgroup.Proposition Inourexperiments,ε=0.05 4.1 is a well known result (also in (Hsu and Sabato 2013; LerasleandOliveira2011))givingtheaccuracyofthisestima- √ Anobliviousadversary. Thisadversaryisidenticaltothe tor.ItsconvergenceisO(σ/ N),withexponentialprobabil- fullyobliviousoneforeveryroundmultipleof200.Between itytails,eventhoughtherandomvariablexmayhaveheavy- twomultiplesof200thelastgainofthearmisgiven. tails. In comparison, the empirical mean can not provide suchguaranteeforanyσ >0andconfidencein[0,1/(2e)] (Catoni2012). TheSwitchingcostsadversary Thisadversary(definedat Figure1in(Dekeletal.2014))definesastochasticprocesses Proposition 4.1. Let x be a random variable with mean µ (includingsimpleGaussianrandomwalkasspecialcase)for andvarianceσ2 <∞.AssumethatwehaveN independent generatingthegains.Itwasusedtoprovethatanyalgorithm sample of x and let µˆ be the median-of-means computed againstthisadversarymustincuraregretofO(T2/3). using a0 groups. With probability at least 1−e−a0/4.5, µˆ (cid:112) satisfies|µˆ−µ|≤σ 6a /N. 0 Discussion Figure 1 shows our results against a variety We set the number of groups to a0 = 24, so that the of adversaries, with respect to a fixed oracle. Overall, the confidenceintervalholdsw.p.atleast0.995. performance (in term of regret) of DP-EXP3-Lap is very Wealsoreportedthedeviationofeachalgorithmusingthe competitiveagainstthatofEXP3whileprovidingasignif- Gini’sMeanDifference(GMDhereafter)(GiniandPearson icantbetterprivacy.ThismeansthatDP-EXP3-Lapallows 1912). GMD computes the deviation as (cid:80)N (2j −N − us to get privacy for free in the bandit setting against an j=1 1)x withx thej-thorderstatisticsofthesample(that adversarynotmorepowerfulthantheobliviousone. (j) (j) is x(1) ≤ x(2) ≤ ... ≤ x(N)). As shown in (Yitzhaki and The performance of EXP3τ is worse than that of DP- others 2003; David 1968), the GMD provides a superior EXP3-Lapagainstanobliviousadversaryoronelesspow- approximation of the true deviation than the standard one. erful.However,thesituationiscompletelyreversedagainst To account for the fact that the cumulative regret of our themorepowerfulswitchingcostadversary.Inthatsetting, algorithms might not follow a symmetric distribution, we EXP3τ outperforms both EXP3 and DP-EXP3-Lap con- computedtheGMDseparatelyforthevaluesaboveandbelow firmingthetheoreticalanalysis.Wecansee EXP3τ asthe themedian-of-means. algorithmprovidingusprivacyforfreeagainstswitchingcost Atroundt,wecomputedthecumulativeregretagainstthe adversaryandadaptivem-boundedmemoryoneingeneral. fixedoraclewhoplaysthebestarmassumingthattheend ofthegameisatt.Theoracleusestheactualsequenceof 5 Conclusion gainstodecidehisbestarm.Foragiventrial,wemakesure We have provided the first results on differentially private thatallalgorithmsareplayingthesamegamebygenerating adversarialmulti-armedbandits,whichareoptimaluptologa- thegainsforallpossiblepairofround-armbeforethegame rithmicfactors.Oneopenquestionishowdifferentialprivacy starts. affectsregretinthefullreinforcementlearningproblem.At thispointintime,theonlyknownresultsintheMDPsetting obtaindifferentiallyprivatealgorithmsforMonteCarlopol- Deterministic adversary. As shown by (Audibert and icyevaluation(Balle,Gomrokchi,andPrecup2016).While Bubeck2010),theexpectedregretofanyagentagainstan this implies that it is possible to obtain policy iteration al- obliviousadversarycannotbeworsethanthatagainstthe gorithms,itisunclearhowtoextendthistothefullonline worstcasedeterministicadversary.Inthisexperiment,arm reinforcementlearningproblem. 2isthebestandgives1foreveryevenround.Totrickthe players into picking the wrong arms, the first arm always Acknowledgements. Thisresearchwassupportedbythe gives0.38whereasthethirdgives1foreveryroundmultiple SNSFgrants“AdaptivecontrolwithapproximateBayesian of3.Theremainingarmsalwaysgive0.Asshownbythe computationanddifferentialprivacy”and“SwissSenseSyn- figure,thissimpleadversaryisalreadypowerfulenoughto ergy”,bytheMarieCurieActions(REA608743),theFuture makethealgorithmsattaintheirupperbound. ofLifeInstitute“MechanismDesignforAIArchitectures” andtheCNRSSpecificActiononSecurity. 2Used heavily in the streaming literature (Alon, Matias, and Szegedy1996) EXP3 6000 DP-EXP3-Lap EXP3 EXP3τ 8000 DP-EXP3-Lap EXP3τ 5000 et mulativeregret46000000 Cumulativeregr34000000 Cu 2000 2000 1000 00 50000 100000 150000 200000 250000 00 50000 100000 150000 200000 250000 timestep timestep (a)Deterministic (b)Stochastic EXP3 8000 DP-EXP3-Lap EXP3 EXP3τ 6000 DP-EXP3-Lap EXP3τ 5000 et6000 gr mulativeregret34000000 Cumulativere4000 u C 2000 2000 1000 00 50000 100000 150000 200000 250000 00 50000 100000 150000 200000 250000 timestep timestep (c)FullyOblivious (d)Oblivious 10000 8000 et gr ere 6000 EXP3 ulativ DEXPP-E3XτP3-Lap m u 4000 C 2000 0 0 50000 100000 150000 200000 250000 timestep (e)Switchingcosts Figure1:RegretandErrorbaragainstfivedifferentadversaries,withrespecttothefixedoracle References InProceedingsofthe2010IEEE51stAnnualSymposiumon FoundationsofComputerScience,FOCS’10,51–60. [AgrawalandGoyal2012] Agrawal,S.,andGoyal,N. 2012. Analysisofthompsonsamplingforthemulti-armedbandit [Dwork2006] Dwork, C. 2006. Differential privacy. In problem. InCOLT2012. ICALP,1–12. Springer. [GiniandPearson1912] Gini,C.,andPearson,K.1912.Vari- [Alon,Matias,andSzegedy1996] Alon,N.;Matias,Y.;and abilita` emutabilita`:contributoallostudiodelledistribuzioni Szegedy,M. 1996. Thespacecomplexityofapproximating edellerelazionistatistiche.Fascicolo1. tipografiadiPaolo thefrequencymoments. In28thSTOC,20–29. ACM. Cuppini. [AudibertandBubeck2010] Audibert,J.-Y.,andBubeck,S. [HsuandSabato2013] Hsu,D.,andSabato,S. 2013. Loss 2010. Regret bounds and minimax policies under partial minimization and parameter estimation with heavy tails. monitoring. J.Mach.Learn.Res.11:2785–2836. arXivpreprintarXiv:1307.1827. [Aueretal.2003] Auer,P.;Cesa-Bianchi,N.;Freund,Y.;and [Jain,Kothari,andThakurta2012] Jain, P.; Kothari, P.; and Schapire,R.E. 2003. Thenonstochasticmultiarmedbandit Thakurta,A. 2012. Differentiallyprivateonlinelearning. In problem. SIAMJ.Comput.32(1):48–77. Mannor,S.;Srebro,N.;andWilliamson,R.C.,eds.,COLT [Auer,Cesa-Bianchi,andFischer2002] Auer, P.; Cesa- 2012,volume23,24.1–24.34. Bianchi, N.; and Fischer, P. 2002. Finite time analysis [LerasleandOliveira2011] Lerasle, M., and Oliveira, R. I. of the multiarmed bandit problem. Machine Learning 2011. Robust empirical mean estimators. arXiv preprint 47(2/3):235–256. arXiv:1112.3914. [Auer2002] Auer, P. 2002. Using confidence bounds for [McSherryandTalwar2007] McSherry, F., and Talwar, K. exploitation-exploration trade-offs. Journal of Machine 2007. Mechanismdesignviadifferentialprivacy. InProceed- LearningResearch3:397–422. ings of the 48th Annual IEEE Symposium on Foundations [Balle,Gomrokchi,andPrecup2016] Balle,B.;Gomrokchi, ofComputerScience,FOCS’07,94–103. Washington,DC, M.;andPrecup,D. 2016. Differentiallyprivatepolicyevalu- USA:IEEEComputerSociety. ation. InICML2016. [McSherry2009] McSherry,F.D. 2009. Privacyintegrated [BurnetasandKatehakis1996] Burnetas, A. N., and Kate- queries:Anextensibleplatformforprivacy-preservingdata hakis, M. N. 1996. Optimal adaptive policies for sequen- analysis. InProceedingsofthe2009ACMSIGMODInter- tialallocationproblems. AdvancesinAppliedMathematics nationalConferenceonManagementofData,SIGMOD’09, 17(2):122–142. 19–30. NewYork,NY,USA:ACM. [Catoni2012] Catoni, O. 2012. Challenging the empirical [Merhavetal.2002] Merhav,N.;Ordentlich,E.;Seroussi,G.; meanandempiricalvariance:Adeviationstudy. Annalesde andWeinberger,M.J. 2002. Onsequentialstrategiesforloss l’I.H.P.Probabilite´setstatistiques48(4):1148–1185. functions with memory. IEEE Trans. Information Theory 48(7):1947–1958. [Cesa-Bianchi,Dekel,andShamir2013] Cesa-Bianchi, N.; Dekel, O.; and Shamir, O. 2013. Online learning with [MishraandThakurta2015] Mishra, N., and Thakurta, A. switching costs and other adaptive adversaries. In NIPS, 2015. (nearly)optimaldifferentiallyprivatestochasticmulti- 1160–1168. armbandits. Proceedingsofthe31thUAI. [Chan,Shi,andSong2010] Chan,T.H.;Shi,E.;andSong,D. [PandeyandOlston2006] Pandey, S., and Olston, C. 2006. 2010. Privateandcontinualreleaseofstatistics. InAutomata, Handlingadvertisementsofunknownqualityinsearchadver- LanguagesandProgramming.Springer. 405–417. tising. InScho¨lkopf,B.;Platt,J.C.;andHoffman,T.,eds., TwentiethNIPS,1065–1072. [David1968] David, H. 1968. Miscellanea: Gini’s mean differencerediscovered. Biometrika55(3):573–575. [ThakurtaandSmith2013] Thakurta,A.G.,andSmith,A.D. 2013. (nearly)optimalalgorithmsforprivateonlinelearning [Dekeletal.2014] Dekel,O.;Ding,J.;Koren,T.;andPeres, infull-informationandbanditsettings. InNIPS,2733–2741. Y. 2014. Bandits with switching costs: T2/3 regret. In [Thompson1933] Thompson,W. 1933. OntheLikelihood Proceedingsofthe46thAnnualACMSymposiumonTheory thatOneUnknownProbabilityExceedsAnotherinViewof ofComputing,STOC’14,459–467. NewYork,NY,USA: theEvidenceoftwoSamples. Biometrika25(3-4):285–294. ACM. [TossouandDimitrakakis2016] Tossou,A.C.Y.,andDim- [Dekel,Tewari,andArora2012] Dekel,O.;Tewari,A.;and itrakakis, C. 2016. Algorithms for differentially private Arora,R. 2012. Onlinebanditlearningagainstanadaptive multi-armedbandits. InAAAI,2087–2093. AAAIPress. adversary:fromregrettopolicyregret. InICML. icml.cc/ Omnipress. [Yitzhakiandothers2003] Yitzhaki,S.,etal. 2003. Gini’s meandifference:Asuperiormeasureofvariabilityfornon- [DworkandRoth2013] Dwork,C.,andRoth,A. 2013. The normaldistributions. Metron61(2):285–316. algorithmicfoundationsofdifferentialprivacy. Foundations andTrends(cid:13)R inTheoreticalComputerScience9(3–4):211– [Zhaoetal.2014] Zhao, J.; Jung, T.; Wang, Y.; and Li, X. 407. 2014. Achievingdifferentialprivacyofdatadisclosureinthe smartgrid. In2014IEEEConferenceonComputerCommu- [Dwork,Rothblum,andVadhan2010] Dwork,C.;Rothblum, nications,INFOCOM2014,504–512. G.N.;andVadhan,S.2010.Boostinganddifferentialprivacy.