Differentially Private Bayesian Optimization MattJ.Kusner [email protected] JacobR.Gardner [email protected] RomanGarnett [email protected] KilianQ.Weinberger [email protected] ComputerScience&Engineering,WashingtonUniversityinSt. Louis 5 1 0 Abstract validationaccuraciesthroughpublicationsorothermeans. 2 However, data-holders must be careful, as even a small b Bayesianoptimizationisapowerfultoolforfine- amountofinformationcancompromiseprivacy. e tuningthehyper-parametersofawidevarietyof F machine learning models. The success of ma- Whichhyper-parametersettingyieldsthehighestaccuracy 3 chine learning has led practitioners in diverse can reveal sensitive information about individuals in the 2 real-world settings to learn classifiers for prac- validation or training data set, reminiscent of reconstruc- tical problems. As machine learning becomes tionattacksdescribedbyDwork&Roth(2013)andDinur ] L commonplace, Bayesian optimization becomes & Nissim (2003). For example, imagine updated hyper- M anattractivemethodforpractitionerstoautomate parameters are released right after a prominent public fig- theprocessofclassifierhyper-parametertuning. ureisadmittedtoahospital. Ifahyper-parameterisknown . t A key observation is that the data used for tun- tocorrelatestronglywithaparticulardiseasethepatientis a ing models in these settings is often sensitive. suspectedtohave,anattackercouldmakeadirectcorrela- t s Certaindatasuchasgeneticpredisposition, per- tionbetweenthehyper-parametervalueandtheindividual. [ sonalemailstatistics,andcaraccidenthistory,if Topreventthissortofattack,wedevelopasetofalgorithms 2 not properly private, may be at risk of being in- v thatautomaticallyfine-tunethehyper-parametersofama- ferred from Bayesian optimization outputs. To 0 chinelearningalgorithmwhileprovablypreservingdiffer- address this, we introduce methods for releas- 8 ential privacy (Dwork et al., 2006b). Our approach lever- 0 ing the best hyper-parameters and classifier ac- agesrecentresultsonBayesianoptimization(Snoeketal., 4 curacyprivately. Leveragingthestrongtheoreti- 2012;Hutteretal.,2011;Bergstra&Bengio,2012;Gard- 0 calguaranteesofdifferentialprivacyandknown . ner et al., 2014), training a Gaussian process (GP) (Ras- 1 Bayesian optimization convergence bounds, we mussen&Williams,2006)toaccuratelypredictandmax- 0 prove that under a GP assumption these private imize the validation gain of hyper-parameter settings. We 5 quantitiesarealsonear-optimal. Finally, evenif 1 show that the GP model in Bayesian optimization allows this assumption is not satisfied, we can use dif- : ustoreleasenoisyfinalhyper-parametersettingstoprotect v ferentsmoothnessguaranteestoprotectprivacy. against aforementioned privacy attacks, while only sacri- i X ficingatiny,boundedamountofvalidationgain. r a 1.Introduction Our privacy guarantees hold for releasing the best hyper- parametersand bestvalidationgain. Specifically ourcon- Machinelearningisincreasinglyusedinapplicationareas tributionsareasfollows: with sensitive data. For example, hospitals use machine learning to predict if a patient is likely to be readmitted We derive, to the best of our knowledge, the first soon (Yu et al., 2013), webmail providers classify spam • framework for Bayesian optimization with provable emails from non-spam (Weinberger et al., 2009), and in- differentialprivacyguarantees, suranceprovidersforecasttheextentofbodilyinjuryincar crashes(Chongetal.,2005). Wedevelopvariationsbothwithandwithoutobserva- • tionnoise,and Inthesescenariosdatacannotbesharedlegally,butcompa- niesandhospitalsmaywanttosharehyper-parametersand Weshowthatevenifourvalidationgainisnotdrawn • from a Gaussian process, we can guarantee differen- tialprivacyunderdifferentsmoothnessassumptions. DifferentiallyPrivateBayesianOptimization We begin with background on Bayesian optimization and RT T and λ Λ is any hyper-parameter. As more sam- × ∈ differentialprivacywewillusetoproveourguarantees. ples are observed, the posterior mean function µ (λ) ap- T proachesf (λ). V 2.Background Onewell-knownmethodtoselecthyper-parametersλmax- imizestheupper-confidencebound(UCB)oftheposterior Ingeneral, ouraimwillbetoprotecttheprivacyofaval- GPmodeloff (Aueretal.,2002;Srinivasetal.,2010): idation dataset of sensitive records (where is V V ⊆ X X the collection of all possible records) when the results of λ (cid:44)argmaxµ (λ)+(cid:112)β σ (λ), (3) t+1 t t+1 t Bayesianoptimizationdependson . λ Λ V ∈ where β is a parameter that trades off the exploita- T+1 Bayesianoptimization. Ourgoalistomaximizeanun- tionofmaximizingµ (λ)andtheexplorationofmaximiz- knownfunctionf : Λ Rthatdependsonsomevalida- t tiondataset V: → ing σt(λ). Srinivas et al. (2010) proved that given cer- V ⊆X tain assumptions on f and fixed, non-zero observation noise(σ2>0), selectinVghyper-parametersλtomaximize maxf (λ). (1) eq. (3) is a no-regret Bayesian optimization procedure: λ Λ V ∈ lim 1 (cid:80)T f (λ ) f (λ ) = 0,wheref (λ )is Itisimportanttopointoutthatallofourresultsholdforthe theTm→ax∞imTizerto=f1eqV.(1∗). F−orVthetno-noisesetting,Vde∗Fre- generalsettingofeq.(1),butthroughoutthepaper,weuse itasetal.(2012)giveaUCB-basedno-regretalgorithm. the vocabulary of a common application: that of machine learninghyper-parametertuning. Inthiscasef (λ)isthe Contributions. Alongside maximizing f , we would gainofalearningalgorithmevaluatedonvalidatViondataset like to guarantee that if f depends on (senVsitive) valida- thatwastrainedwithhyper-parametersλ Λ Rd. tion data, we canrelease iVnformationabout f so thatthe V ∈ ⊆ data remains private. Specifically, we mayVwish to re- AqusireevsatlruaaitniinnggfaVleiasrneixnpgenasligvoeri(the.mg.),, eBaacyheseivaanluoapttiiomnizrae-- leaseV(a)ourbestguessλˆ (cid:44)argmaxt T f (λt)ofthetrue tiongivesaprocedureforselectingasmallnumberofloca- (unknown) maximizer λ∗ and (b) our≤bestVguess f (λˆ) of tiicoanllsy,togisvaemnplaecfuVr:re[nλt1s,a.m..p,lλeTλ]=, wλeT o∈bsRedrv×eT.a Svpaleicdiaf-- tphreimtraurey(qaulesostiuonnktnhoiwswn)omrkaaxiimmsumtoaonbsjewcetirvies:fhVo(wλ∗Vc)a.nTwhee t tion gain v such that v = f (λ ) + α , where α release private versions of λˆ and f (λˆ) that are close to t t t t t (0,σ2) is Gaussian noise witVh possibly non-zero var∼i- theirtruevalues, orbetter, thevalueVsλ∗ andf (λ∗)? We aNnce σ2. Then, given v and previously observed values give two answers to these questions. The firstVwill make t v ,...,v , Bayesian optimization updates its belief of a Gaussian process assumption on f , which we describe f1 andsatm−1plesanewhyper-parameterλ . Eachstepof immediately below. The second, deVscribed in Section 5, t+1 thVeoptimizationproceedsinthisway. willutilizeLipschitzandconvexityassumptionstoguaran- teeprivacyintheeventtheGPassumptiondoesnothold. Todecidewhichhyper-parametertosamplenext,Bayesian optimizationplacesapriordistributionoverf andupdates Setting. For our first answer to this question, let us de- it after every (possibly noisy) function obseVrvation. One fine a Gaussian process over hyper-parameters λ,λ Λ (cid:48) popular prior distribution over functions is the Gaussian (cid:0) ∈ (cid:0) (cid:1) and datasets , (cid:48) as follows: 0,k1( , (cid:48)) process µ(),k(, ) (Rasmussen & Williams, 2006), (cid:1) V V ⊆ X GP V V ⊗ parameteGrPized b·y a ·m·ean function µ() (we set µ = 0, k2(λ,λ(cid:48)) . A prior of this form is known as a multi-task · Gaussianprocess(Bonillaetal.,2008). Manychoicesfor w.l.o.g.)andakernelcovariancefunctionk(, ).Functions drawnfromaGaussianprocesshavethepro·p·ertythatany k1 and k2 are possible. The function k1(V,V(cid:48)) defines a set kernel (e.g., a function of the number of records that finitesetofvaluesofthefunctionarenormallydistributed. differ between and ). For k , we focus on either the Additionally, given samples λT = [λ1,...,λT] and ob- squaredexponeVntial: kV((cid:48)λ,λ)=2exp(cid:0) λ λ 2/(2(cid:96)2)(cid:1) servations vT = [v1,...,vT], the GP posterior mean and or Mate´rn kernels: (e.g2., for(cid:48)ν = 5/2−, k(cid:107) (λ−,λ(cid:48)(cid:107))2= (1+ variancehasaclosedform: 2 (cid:48) √5r/(cid:96)+(5r2)/(3(cid:96)2))exp( √5r/(cid:96)),forr = λ λ ), (cid:48) 2 − (cid:107) − (cid:107) forafixed(cid:96),astheyhaveknownboundsonthemaximum µ (λ)=k(λ,λ )(K +σ2I) 1v T T T − T information gain (Srinivas et al., 2010). Note that as de- kT(λ,λ(cid:48))=k(λ,λ(cid:48))−k(λ,λT)(KT +σ2I)−1k(λT,λ(cid:48)) fined,thekernelk2isnormalized(i.e.,k2(λ,λ)=1). σT2(λ)=kT(λ,λ), (2) Assumption 1. We have a problem of type (1), where all possible dataset functions [f ,...,f ] are GP dis- wherek(λ,λ ) R1 T isevaluatedelement-wiseoneach tributed (cid:0)0,k ( , ) k (λ1,λ)(cid:1) fo2|rX|known kernels T × 1 (cid:48) 2 (cid:48) ∈ GP V V ⊗ of the T columns of λ . As well, K = k(X ,X ) k ,k ,forall , andλ,λ Λ,where Λ . T T T T 1 2 (cid:48) (cid:48) ∈ V V ⊆X ∈ | |≤∞ DifferentiallyPrivateBayesianOptimization Similar Gaussian process assumptions have been made in , differbyswappingonerecord)is (cid:48) V V previouswork(Srinivasetal.,2010). Foraresultintheno- ∆ (cid:44) max ( ) ( ) . noiseobservationsetting,wewillmakeuseoftheassump- A , (cid:48) (cid:107)A V −A V(cid:48) (cid:107)1 VV ⊆X tionsofdeFreitasetal.(2012)forourprivacyguarantees, (Exponentialmechanism)Theglobalsensitivityofafunc- asdescribedinSection4. tionq: Λ Roverallneighboringdatasets , is (cid:48) X× → V V 2.1.DifferentialPrivacy ∆ (cid:44) max q( ,λ) q( ,λ) . q (cid:48) 1 , (cid:48) (cid:107) V − V (cid:107) One of the most widely accepted frameworks for private VλV ⊆ΛX ∈ data release is differential privacy (Dwork et al., 2006b), TheLaplacemechanismhidestheoutputof byperturb- whichhasbeenshowntoberobusttoavarietyofprivacy A ingitsoutputwithsomeamountofrandomnoise. attacks (Ganta et al., 2008; Sweeney, 1997; Narayanan & Definition 3. Given a dataset and an algorithm , the Shmatikov, 2008). Given an algorithm that outputs a V A valueλwhenrunondataset ,thegoaloAfdifferentialpri- Laplacemechanismreturns ( )+ω,whereωisanoise A V vacy is to ‘hide’ the effect oVf a small change in on the variabledrawnfromLap(∆ /(cid:15)),theLaplacedistribution output of . Equivalently, an attacker should noVt be able withscaleparameter∆ /(cid:15)(Aandlocationparameter0). A A totellifaprivaterecordwasswappedin justbylooking V The exponential mechanism draws a slightly different λ˜ at the output of . If two datasets , differ by swap- A V V(cid:48) thatis‘close’toλ,theoutputof . pingasingleelement,wewillrefertothemasneighboring A Definition 4. Given a dataset and an algorithm datasets. Notethatanynon-trivialalgorithm(i.e.,analgo- V ( ) = argmax q( ,λ), the exponential mecha- rpiathirmA, thatou)tpmuutsstdiinffcelruednetsvoamlueesamonoVunatnodfrVa(cid:48)nfdoormsonmeses AnisVm returns λ˜, whλe∈rΛe λ˜Vis drawn from the distribution toguVaraVn(cid:48)te⊆esXuchachangeinV isunobservableintheout- Z1 exp(cid:0)(cid:15)q(V,λ)/(2∆q)(cid:1),andZ isanormalizingconstant. put λ of (Dwork & Roth, 2013). The level of privacy A Given Λ, a possible set of hyper-parameters, we derive we wish to guarantee decides the amount of randomness methods for privately releasing the best hyper-parameters weneedtoaddtoλ(betterprivacyrequiresincreasedran- and the best function values f , approximately solving domness). Formally,thedefinitionofdifferentialprivacyis eq.(1). WefirstaddressthesettiVngwithobservationnoise statedbelow. (σ2 >0) in eq. (2) and then describe small modifications Definition 1. A randomized algorithm is ((cid:15),δ)- fortheno-noisesetting. ForeachsettingweusetheUCB A differentiallyprivatefor(cid:15),δ 0ifforallλ Range( ) samplingtechniqueineq.(3)toderiveourprivateresults. ≥ ∈ A andforallneighboringdatasets , (i.e.,suchthat and (cid:48) V V V differbyswappingonerecord)wehavethat V(cid:48) 3.Withobservationnoise Pr(cid:2) ( )=λ(cid:3) e(cid:15)Pr(cid:2) ( (cid:48))=λ(cid:3)+δ. (4) In general cases of Bayesian optimization, observation A V ≤ A V noise occurs in a variety of real-world modeling settings such as sensor measurement prediction (Krause et al., Theparameters(cid:15),δguaranteehowprivate is;thesmaller, A 2008). In hyper-parameter tuning, noise in the validation the more private. The maximum privacy is (cid:15) = δ = 0 in gainmaybeasaresultofnoisyvalidationortrainingfea- whichcaseeq.(4)holdswithequality. Thiscanbeseenby tures. thefactthat and canbeswappedinthedefinition,and (cid:48) V V thus the inequality holds in both directions. If δ = 0, we Inthesectionsthatfollow,althoughthequantitiesf,µ,σ,v say the algorithm is simply (cid:15)-differentially private. For a all depend on the validation dataset , for notational sim- surveyondifferentialprivacywerefertheinterestedreader plicitywewilloccasionallyomitthesVubscript .Similarly, toDwork&Roth(2013). for wewilloftenwrite: f ,µ,σ 2,v . V (cid:48) (cid:48) (cid:48) (cid:48) (cid:48) V Therearetwopopularmethodsformakinganalgorithm(cid:15)- differentially private: (a) the Laplace mechanism (Dwork 3.1.Privatenear-maximumhyper-parameters etal.,2006b),inwhichweaddrandomnoisetoλand(b) Inthissectionweguaranteethatreleasingλ˜ inAlgorithm the exponential mechanism (McSherry & Talwar, 2007), 1 is private (Theorem 1) and that it is near-optimal (The- which draws a random output λ˜ such that λ˜ λ. For ≈ orem 2). Our proof strategy is as follows: we will first each mechanism we must define an intermediate quan- demonstratetheglobalsensitivityofµ (λ)withprobabil- T tity called the global sensitivity describing how much A ityatleast1 δ. Thenwewillshowshowthatreleasingλ˜ changeswhen changes. − V via the exponential mechanism is ((cid:15),δ)-differentially pri- Definition2. (Laplacemechanism)Theglobalsensitivity vate. Finally, we prove that µ (λ˜) is close to f(λ ), the T ∗ ofanalgorithm overallneighboringdatasets , (i.e., truemaximizerofeq.(1). (cid:48) A V V DifferentiallyPrivateBayesianOptimization Algorithm1PrivateBayesianOpt. (noisyobservations) Corollary 1. Let ( ) denote Algorithm 1 applied on Input: ;Λ Rd;T;((cid:15),δ);σ2 ;γ dataset . Given AAssVumption 1, λ˜ is ((cid:15),δ)-differentially µ ,0 =V0 ⊆ V,0 T private,Vi.e.,Pr(cid:2)A(V)=λ˜(cid:3)≤e(cid:15)Pr(cid:2)A(V(cid:48))=λ˜(cid:3)+δ,forany foVr t=1...T do pairofneighboringdatasets , (cid:48). V V β =2log(Λt2π2/(3δ)) t λ (cid:44)argm|ax| µ (λ)+√β σ (λ) We leave the proof of Corollary 1 to the supplementary Otbservevalidaλt∈ioΛngVa,itn−v1 ,giventλV,t−1 material. Even though we must release a noisy hyper- Updateµ andσ2 accoVr,dtingto(2)t parametersettingλ˜,itisinfactnear-optimal. ,t ,t endfor V V Theorem 2. Given Assumption 1 the following near- (cid:113) c=2 (cid:0)1 k( , )(cid:1)log(cid:0)3Λ/δ(cid:1) optimalapproximationguaranteeforreleasingλ˜holds: − V V(cid:48) | | (cid:112) q=σ 4log(3/δ) µ (λ˜) f(λ ) 2√β q 2∆(log Λ +a) C =8/log(1+σ 2) T ≥ ∗ − T − − (cid:15) | | 1 − (cid:16) (cid:17) Drawλ˜ Λw.p. Pr[λ] exp (cid:15)µV,T(λ) w.p. 1 (δ+e a),where∆=2(cid:112)β +c(forβ , ∈ ∝ 2(2√βT+1+c) c,an≥dqd−efineda−sinAlgorithm1). T+1 T+1 v =max v ∗ t T ,t ≤ (cid:104)V (cid:105) Drawθ ∼Lap √C(cid:15)1√βTTγT + c(cid:15) + q(cid:15) Proof.Ingeneral,theexponentialmechanismselectsλ˜that v˜=v∗+θ isclosetothemaximumλ(McSherry&Talwar,2007): Return: λ˜,v˜ µ (λ˜) max µ (λ) 2∆(log Λ +a), (5) T ≥ λ∈Λ T − (cid:15) | | Global sensitivity. As a first step we bound the global withprobabilityatleast1 e a. Recallweassumethatat − − sensitivityofµ (λ)asfollows: eachoptimizationstepweobservenoisygainv =f(λ )+ T t t Theorem1. GivenAssumption1,foranytwoneighboring αt,whereαt (0,σ2)(withfixednoisevarianceσ2> ∼ N datasets V,V(cid:48) and for all λ∈Λ with probability at least 0). Assuch,wecanlowerboundthetermmaxλ∈ΛµT(λ): 1 δ thereisanupperboundontheglobalsensitivity(in − maxµ (λ) f(λ )+α theexponentialmechanismsense)ofµT: λ Λ T ≥ (cid:124) T(cid:123)(cid:122) T(cid:125) ∈ (cid:112) (cid:112) vT µ (λ) µ (λ) 2 β +σ 2log(3Λ/δ), | (cid:48)T − T |≤ T+1 1 | | f(λ∗) maxµT(λ) f(λ∗) f(λT)+αT forσ =(cid:113)2(cid:0)1 k ( , )(cid:1),β =2log(cid:16)Λt2π2/(3δ)(cid:17). − λ∈Λ ≤ (cid:112) − 1 − 1 V V(cid:48) t | | f(λ∗) maxµT(λ) 2 βTσT 1(λT)+αT − λ Λ ≤ − Proof. Notethat,byapplyingthetriangleinequalitytwice, ∈ (cid:112) maxµ (λ) f(λ ) 2 β +α , (6) T ∗ T T forallλ Λ, λ Λ ≥ − ∈ ∈ µ (λ) µ (λ) µ (λ) f (λ) + f (λ) µ (λ) where the third line follows from Srinivas et al. (2010): | (cid:48)T − T |≤ | (cid:48)T − (cid:48) | | (cid:48) − T | Lemma 5.2 and the fourth line from the fact that µ (λ) f (λ) + f (λ) f(λ) + f(λ) µ (λ). ≤| (cid:48)T − (cid:48) | | (cid:48) − | | − T | σ (λ ) 1. T 1 T Wecannowboundeachoneofthetermsinthesummation − ≤ on the right hand side (RHS) with probability at least δ. AsintheproofofTheorem1,givenanormalrandomvari- Aµcc(oλrd)ingft(oλS)rini(cid:112)vasβetalσ.(2(0λ1)0.)T,hLeesmammaec5a.1n,bweeaopbptlaiei3nd aTbhleerZefo∼reNi(f0w,1e)sweteZhav=e:αPTr[|wZe|≤haγv]e≥γ1−=e−(cid:112)γ22/2lo:g=(21/−δ2δ).. | (cid:48)T − (cid:48) |≤ T+1 T(cid:48) (cid:112)σ (cid:112) twoe|fca(nλ)u−ppeµrTb(oλu)n|.dAbostσhT(cid:48)te(rλm)s≤by1,2b(cid:112)ecβaTu+se1.k(Iλn,oλr)de=r t1o, T(ahsisdeimfinpelidesinthAaltg|oαrTit|h≤mσ1)w2iltohgp(2ro/bδa)b≤ility4altolge(a3s/tδ1)−=2δq. boundtheremaining(middle)termontheRHSrecallthat Thus, wecanlowerboundα by q. Wecanthenlower T (cid:2) (cid:3) − forarandomvariableZ (0,1)wehave: Pr Z >γ bound max µ (λ) in eq. (5) with the right hand side e−γ2/2. For variables Z∼1,N...Zn (0,1), we| h|ave, b≤y of eq. (6). Tλh∈eΛrefTore, given the βT in Algorithm 1, Srini- theunionbound,thatPr(cid:2) i, Z ∼Nγ(cid:3) 1 ne γ2/2 (cid:44) vasetal.(2010),Lemma5.2holdswithprobabilityatleast 1−(cid:113)3δ. IfwesetZ = |f(λ∀)−σ1f|(cid:48)(λi)||≤andn≥=|−Λ|, w−eobtain 1− 2δ andthetheoremstatementfollows. (cid:4) γ= 2log(cid:0)3Λ/δ(cid:1),whichcompletestheproof. (cid:4) 3.2.Privatenear-maximumvalidationgain | | WeremarkthatallofthequantitiesinTheorem1areeither Inthissectionwedemonstratereleasingthevalidationgain given or selected by the modeler (e.g, δ,T). Given this v˜inAlgorithm1isprivate(Theorem3)andthatthenoise upper bound we can apply the exponential mechanism to weaddtoensureprivacyisboundedwithhighprobability releaseλ˜privately,asperDefinition1: (Theorem4). Asintheprevioussectionourapproachwill DifferentiallyPrivateBayesianOptimization be to first derive the global sensitivity of the maximum v Wecanimmediatelyboundthefinaltermontherighthand foundbyAlgorithm1. Thenweshowreleasingv˜is((cid:15),δ)- side. Notethatasv = f(λ )+α ,thefirsttwotermsare t t t differentially private via the Laplace mechanism. Perhaps boundedaboveby α and α , whereα = α t (cid:44) (cid:48) t surprisingly,wealsoshowthatv˜isclosetof(λ ). argmax α (|sim| ilarl|y f|or α). This i{s b(cid:100)e(cid:101)ca|u(cid:100)se(cid:101), in ∗ the worstt≤cTas|e,t|t}he observation no(cid:48)ise shifts the observed Globalsensitivity. Weboundtheglobalsensitivityofthe maximummax v upordownbyα. Therefore,letαˆ = t T t maximumvfoundwithBayesianoptimizationandUCB: αif α > α(cid:48) a≤ndαˆ =α(cid:48)otherwise,sothatwehave: | | | | Theorem3. GivenAssumption1, andneighboring , , we have the following global sensitivity bound (iVn Vth(cid:48)e |mt aTxv(cid:48)(λt)−mt aTxv(λt)|≤ TΩ +c+|2αˆ|. ≤ ≤ Laplacemechanismsense)forthemaximumv,w.p. 1 δ ≥ − Although αˆ can be arbitrarily large, recall that for Z |mt≤aTxvt(cid:48) −mt≤aTxvt|≤ √C√1βTTγT +c+q. TNh(e0r,e1fo)rew|iefh|waeves:etPZr[|=Z| 2≤αˆ γw]e≥ha1ve−γe−=γ2(cid:112)/22(cid:44)log1(3−/δ∼3δ).. σ√2 wherethemaximumGaussianprocessinformationgainγ (cid:112) T Thisimpliesthat 2αˆ σ 4log(3/δ)=qwithprobabil- isboundedaboveforthesquaredexponentialandMate´rn ity at least 1 δ|. T|h≤erefore, if Theorem 1 from Srinivas kernels(Srinivasetal.,2010). − 3 etal.(2010)andtheboundon f(λ) f (λ) holdtogether (cid:48) withprobabilityatleast1 2δ|asdes−cribeda|bove,thethe- Pterromofa.sΩFo(cid:44)r n√otCatioTnβalγsim.pTlhiceintyfrloemt uTshdeeonreomte1thienrSerginrei-t oremfollowsdirectly. − 3 (cid:4) 1 T T vasetal.(2010)wehavethat AsinTheorem1eachquantityintheaboveboundisgiven inAlgorithm1(β,c,q),giveninpreviousresults(Srinivas T Ω 1 (cid:88) et al., 2010) (γ , C ) or specified by the modeler (T, δ). f(λ ) f(λ ) f(λ ) maxf(λ ). (7) T 1 ∗ t ∗ t T ≥ − T ≥ − t T Now that we have a bound on the sensitivity of the max- t=1 ≤ imum v we will use the Laplace mechanism to prove our Thisimpliesf(λ ) max f(λ )+ Ω withprobability privacyguarantee(proofinsupplementarymaterial): atleast1− 3δ (wi∗th≤approprt≤iaTtechoticeoTfβT). Corollary2. LetA(V)denoteAlgorithm1runondataset . Given Assumption 1, releasing v˜is ((cid:15),δ)-differentially Recall that in the proof of Theorem 1 we showed that V f(λ) f (λ) c with probability at least 1 δ (for private,i.e.,Pr[ ( )=v˜] e(cid:15)Pr[ ( (cid:48))=v˜]+δ. | − (cid:48) | ≤ − 3 A V ≤ A V c given in Algorithm 1). This along with the above ex- Further, as the Laplace distribution has exponential tails, pression imply the following two sets of inequalities with thenoiseweaddtoobtainv˜isnottoolarge: probabilitygreaterthan1 2δ: − 3 Theorem4. GiventheassumptionsofTheorem1,wehave f (λ ) c f(λ )< maxf(λ )+ Ω; thefollowingbound, (cid:48) ∗ − ≤ ∗ t T t T ≤ f(λ∗)−c≤f(cid:48)(λ∗)< mt≤aTxf(cid:48)(λt)+ TΩ. |v˜−f(λ∗)|≤(cid:112)2log(2T/δ)+ TΩ +a(cid:16)(cid:15)ΩT + c(cid:15) + q(cid:15)(cid:17), These,inturn,implythetwosetsofinequalities: withprobabilityatleast1 (δ+e a)forΩ=√C Tβ γ . − 1 T T − maxf (λ ) f (λ )< maxf(λ )+ Ω +c; t T (cid:48) t ≤ (cid:48) ∗ t T t T Proof (Theorem 4). Let Z be a Laplace random variable ≤ ≤ maxf(λ ) f(λ )< maxf (λ )+ Ω +c. with scale parameter b and location parameter 0; Z t T t ≤ ∗ t T (cid:48) t T Lap(b). Then Pr(cid:2)Z ab(cid:3) = 1 e a. Thus, in A∼l- ≤ ≤ | | ≤ − − gorithm1, v˜ max v abforb= Ω + c + q with TThhiastiims,pthlieesgl|ombaalxste≤nTsifti(cid:48)v(iλtyt)o−fmmaaxxtt≤TTff((λλt)t)i|s≤bouTΩn+dedc.. probability|at−least1t−≤Te−ta|.≤Furtherobser(cid:15)vTe, (cid:15) (cid:15) Given the sensitivity of the maximu≤m f, we can readily ab maxv v˜ (maxf(λ ) max α ) v˜ t t t derivethesensitivityofmaximumv. Firstnotethatwecan ≥ t T − ≥ t T − t T | | − ≤ ≤ ≤ usethetriangleinequalitytoderive f(λ ) Ω max α v˜ (8) ≥ ∗ − T − t≤T | t|− maxv(cid:48)(λt) maxv(λt) maxv(λt) maxf(λt) where the second and third inequality follow from the | t T − t T |≤| t T − t T | ≤ ≤ ≤ ≤ proof of Theorem 3 (using the regret bound of Srinivas + maxv (λ ) maxf (λ ) | t T (cid:48) t − t T (cid:48) t | et al. (2010): Theorem 1). Note that the third inequal- ≤ ≤ ity holds with probability greater than 1 δ (given β in + maxf(cid:48)(λt) maxf(λt). − 2 t | t T − t T | Algorithm 1). The final inequality implies f(λ ) v˜ ≤ ≤ ∗ − ≤ DifferentiallyPrivateBayesianOptimization max α + Ω +ab. Alsonotethat, (2012). Then we prove releasing f˜is ((cid:15),δ)-differentially t≤T | t| T privateandthatf˜isalmostmax f(λ ). 2 t T t ab v˜ maxvt v˜ (maxf(λt)+max αt ) ≤ ≤ ≥ − t T ≥ − t T t T | | ≤ ≤ ≤ v˜ f(λ ) Ω max α (9) Global sensitivity. The following Theorem gives a ≥ − ∗ − T − t≤T | t| boundontheglobalsensitivityofthemaximumf. TpnTrahholuilbsysa,iwbmbieelpichtlayiaeuvsuseepthtp|haαeatrtt|f|bcv˜(ooλ−uu∗nfl)dd(−λboe∗nv˜)a|≥rα≤btit−mrafmorairxlaytax≤lltlaT≤rtgT|.αe|Rtαw|et+e|cga−TΩlilv+TΩethaaa−bth.iaFfgobih-r. bTiinnohguTendohdareteo(amirsneemtt5hs.eV2LG,oaVifpv(cid:48)ledawneceeAFhmrsaeseviuctemahtsaphnteeiiotsfmnoallls1.oew(na2isnn0edg1),2gt)hl,oebfoaarlsssnueemnigspihttibivoointrys- | | Z ,...Z (0,1)wehavebythetailprobabilitybound 1 n 1aγn=−d(cid:112)u2δn.2iTolonhgeb∼r(eo2fuTNonr/deδ,)tih.faAwtsPedrse(cid:2)efi∀tntZ,etd|=Zγt|α≤tmaγna(cid:3)dx≥n1=−Tαn,ew.−eγo2/b2tai(cid:44)(cid:4)n |2m≤ta≤xTf(cid:48)(λt)−2m≤ta≤xT(cid:113)f(λt)|≤Ae−(log22τ)d/4 +c t T t (cid:0) (cid:1) ≥ ≤ | | w.p.atleast1 δforc=2 1 k( , ) log(2Λ/δ). We note that, because releasing either λ˜ or v˜ is ((cid:15),δ)- − − V V(cid:48) | | differentially private, by Corollaries 1 and 2, releasing Weleavetheprooftothesupplementarymaterial. bothprivatequantitiesinAlgorithm1guarantees(2(cid:15),2δ)- Given this sensitivity, we may apply the Laplace mecha- differential privacy for validation dataset . This is due V nismtoreleasef˜. to the composition properties of ((cid:15),δ)-differential privacy (Dworketal.,2006a)(infactstrongercompositionresults Corollary3. Let ( )denoteAlgorithm2runondataset A V canbedemonstrated,(Dwork&Roth,2013)). . Givenassumption1andthatf satisfiestheassumptions VofdeFreitasetal.(2012),f˜is((cid:15),δ)-differentiallyprivate, withrespecttoanyneighboringdataset ,i.e., 4.Withoutobservationnoise (cid:48) V Inhyper-parametertuningitmaybereasonabletoassume Pr(cid:2) ( )=f˜(cid:3) e(cid:15)Pr(cid:2) ( (cid:48))=f˜(cid:3)+δ. A V ≤ A V that we can observe function evaluations exactly: v = ,t f (λt). First note that we can use the same algorithVm to Eventhoughwemustaddnoisetothemaximumfweshow reVportthemaximumλintheno-noisesetting. Theorems1 thatf˜isstillclosetotheoptimalf(λ ). ∗ and2stillhold(notethatq = 0inTheorem2). However, Theorem6. GiventheassumptionsofTheorem3,wehave wecannotreadilyreportaprivatemaximumf astheinfor- theutilityguaranteeforAlgorithm2: mationgainγ inTheorems3and4approachesinfinityas T σ2 0. Therefore, we extend results from the previous (cid:16) (cid:17) → f˜ f(λ ) Ω+a Ω + c sectiontotheexactobservationcaseviatheregretbounds | − ∗ |≤ (cid:15) (cid:15) ofdeFreitasetal.(2012). Algorithm2demonstrateshow 2τ toprivatizethemaximumf intheexactobservationcase. w.p. atleast1 (δ+e−a)forΩ=Ae−(log2)d/4. − Algorithm2PrivateBayesianOpt. (noisefreeobs.) WeproveCorollary3andTheorem6inthesupplementary Input: ; Λ Rd; T; ((cid:15),δ); A,τ; assumptionsonf material.Wehavedemonstratedthatinthenoisyandnoise- indeFreVitase⊆tal.(2012) V free settings we can release private near-optimal hyper- RunmethodofdeFreitasetal.(2012),resultinginnoise parameter settings λ˜ and function evaluations v˜,f˜. How- freeobservations: f (λ ),...,f (λ ) ever,theanalysisthusfarassumesthehyper-parameterset 1 T (cid:113)(cid:0) V (cid:1) V is finite: Λ < . It is possible to relax this assumption, c=2 1−k(V,V(cid:48)) log(2|Λ|/δ) usingana|na|lysi∞ssimilarto(Srinivasetal.,2010).Weleave (cid:104) 2τ (cid:105) Drawθ Lap Ae−(log2)d/4 + c thisanalysistothesupplementarymaterial. ∼ (cid:15) (cid:15) Return: f˜=max f (λ )+θ 2 t T t ≤ ≤ V 5.WithouttheGPassumption Even if our our true validation score f is not drawn from 4.1.Privatenear-maximumvalidationgain aGaussianprocess(Assumption1),wecanstillguarantee We demonstrate that releasing f˜in Algorithm 2 is private differential privacy for releasing its value after Bayesian (Theorem 3) and that a small amount of noise is added to optimization fBO = max f(λ ). In this section we t T t makef˜private(Theorem6).Todoso,wederivetheglobal describe a different functio≤nal assumption on f that also sensitivityofmax f(λ )inAlgorithm2independent yields differentially private Bayesian optimization for the 2 t T t of the maximum in≤fo≤rmation gain γ via de Freitas et al. caseofmachinelearninghyper-parametertuning. T DifferentiallyPrivateBayesianOptimization Assume we have a (nonsensitive) training set = In the following theorem we demonstrate that we can T (x ,y ) n , which, given a hyperparameter λ produces boundthechangeinf forarbitraryλ<λ. { i i }i=1 (cid:48) amodelw(λ)fromthefollowingoptimization, Theorem 7. Given assumption 2, for neighboring , (cid:48) V V andarbitraryλ<λ wehavethat, Oλ(w) (cid:48) w =argmin(cid:122)λ w 2+ 1 (cid:88)(cid:125)n(cid:124) (cid:96)(w,x ,y(cid:123)), (10) |fV(wλ)−fV(cid:48)(wλ(cid:48))|≤ (λλ(cid:48)−(cid:48)λλ)L +min(cid:8)gm∗,mλLmin(cid:9) λ w 2(cid:107) (cid:107)2 n i=1 i i where L is the Lipschitz constant of f, g∗ = max g(w,x,y),andmisthesizeof . The function (cid:96) is a training loss function (e.g., logistic (x,y)∈X,w∈Rd V loss, hinge loss). Given a (sensitive) validation set = Proof. Applyingthetriangleinequalityyields V (x ,y ) m we would like to use Bayesian opti- {miziatioin}tio=m1a⊆ximXizeavalidationscoref . f (wλ) f (cid:48)(wλ(cid:48)) f (wλ) f (wλ(cid:48)) V | V − V |≤| V − V | Assumption2. Ourtruevalidationscoref is + f (wλ(cid:48)) f (cid:48)(wλ(cid:48)). | V − V | V m This second term is bounded by Chaudhuri & Vinterbo 1 (cid:88) f (w(λ))= g(w ,x ,y ), (2013) in the proof of Theorem 4. The only difference is, V −m i=1 λ i i as we are not adding random noise to wλ(cid:48) we have that f (wλ(cid:48)) f (cid:48)(wλ(cid:48)) min g∗/m,L/(mλmin) . whereg()isavalidationlossfunctionthatisL-Lipschitz | V − V |≤ { } in w (e.g·., ramp loss, normalized sigmoid (Huang et al., Toboundthefirstterm,letOλ(w)bethevalueoftheob- 2014)). Additionally, the training model wλ is the mini- jectiveineq.(10)foraparticularλ. NotethatOλ(w)and mizerofeq.(10)foratrainingloss(cid:96)()thatis1-Lipschitz Oλ(cid:48)(w)areλandλ(cid:48)-stronglyconvex. Define · inwandconvex(e.g.,logisticloss,hingeloss). h(w)=Oλ(cid:48)(w)−Oλ(w)= λ(cid:48)2−λ(cid:107)w(cid:107)22. (11) Algorithm 3 describes a procedure for privately releasing Further,definetheminimizersw =argmin O (w)and thebestvalidationaccuracyfBO givenassumption2. Dif- λ w λ ferentfrompreviousalgorithms,wemayrunBayesianop- wλ(cid:48)=argminw[Oλ(w)+h(w)]. Thisimpliesthat timization in Algorithm 3 with any acquisition function Oλ(wλ)= Oλ(wλ(cid:48))+ h(wλ)=0. (12) (e.g., expected improvement (Mockus et al., 1978), UCB) ∇ ∇ ∇ andprivacyisstillguaranteed. Given that Oλ is λ-strongly convex (Shalev-Shwartz, 2007),andbytheCauchy-Schwartzinequality, Algorithm3PrivateBayesianOpt. (Lipschitzandconvex) (cid:104) (cid:105) (cid:104) (cid:105) Input: T sizen;V sizem;Λ;λmin;λmax;(cid:15);T;L;d λ(cid:107)wλ−wλ(cid:48)(cid:107)22 ≤ ∇Oλ(wλ)−∇Oλ(wλ(cid:48)) (cid:62) wλ−wλ(cid:48) Run Bayesian optimization for T timesteps, observing: (cid:104) (cid:105) ffVBO(w=λ1m),a.x..,fVf(w(wλT))for{λ1,...,λT}=ΛT,V ⊆Λ ≤∇h(wλ(cid:48))(cid:62) wλ−wλ(cid:48) ≤(cid:107)∇h(wλ(cid:48))(cid:107)2(cid:107)wλ−wλ(cid:48)(cid:107)2. g =max t≤T V λt g(w,x,y) Rearranging, ∗ (x,y) ,w d DRreatwurθn:∼f˜LLa=p∈f(cid:104)XB1(cid:15)Om+∈inRθ{gm∗,mλLmin}+ (λ(cid:15)mλamx−axλλmmiinn)L(cid:105) λ1(cid:107)∇h(wλ(cid:48))(cid:107)2 =(cid:13)(cid:13)(cid:13)λ(cid:48)2−λ∇(cid:107)wλ(cid:48)(cid:107)22(cid:13)(cid:13)(cid:13)2 ≥(cid:107)wλ−wλ(cid:48)((cid:107)123) Similar to Algorithms 1 and 2 we use the Laplace mech- Nowaswλ(cid:48) istheminimizerofOλ(cid:48) wehave, awnhisemn toims sawskapthpeedpowsistihblaencehiagnhgbeoriningvavliadlaidtiaotnioancsceutracy. ∇(cid:107)wλ(cid:48)(cid:107)22 = λ2(cid:48)(cid:2)− n1 (cid:80)ni=1∇(cid:96)(wλ(cid:48),xi,yi)(cid:3). (cid:48) DiffereVnt from the work of Chaudhuri & Vinterbo (201V3) Substitutingthisvalueofwλ(cid:48) intoeq.(13)andnotingthat ctihoannsgeianrgchVingtodiVff(cid:48)ermenatyhyalpseor-pleaardamtoeteBrsa,yΛesTia,nvosp.tΛimTi,za(cid:48)-. wnoermcananpduldlrtohpetphoesniteivgeatcivoensstiagnntitnertmhe(λno(cid:48)−rmλg)i/v2esouutsofthe Therefore, wemustboundthetotalglobalsenVsitivityofVf (cid:13) (cid:13) withrespecttoV andλ, λ1(cid:107)∇h(wλ(cid:48))(cid:107)2=λλ(cid:48)−λλ(cid:48) (cid:13)(cid:13)n1 (cid:80)ni=1∇(cid:96)(wλ(cid:48),xi,yi)(cid:13)(cid:13)2=λλ(cid:48)−λλ(cid:48) . Definition 5. The total global sensitivity of f over all The last equality follows from the fact that the loss (cid:96) is neighboringdatasets , is (cid:48) 1-Lipschitz by Assumption 2 and the triangle inequality. V V Thus,alongwitheq.(13),wehave ∆f (cid:44) max f (wλ) f (cid:48)(wλ(cid:48)). Vλ,,Vλ(cid:48)(cid:48)⊆∈ΛX| V − V | (cid:107)wλ−wλ(cid:48)(cid:107)2 ≤ λ1(cid:107)∇h(wλ(cid:48))(cid:107)2 ≤ λλ(cid:48)−λ(cid:48)λ. DifferentiallyPrivateBayesianOptimization Finally,asf isL-Lipschitzinw, λ ,...,λ from a Sobol sequence and train an SVM 1 100 model for each on the Forest UCI dataset (36,603 train- |fV(wλ)−fV(wλ(cid:48))|≤L(cid:107)wλ−wλ(cid:48)(cid:107)2 ≤Lλλ(cid:48)−λ(cid:48)λ ing inputs). We then randomly sample 100 i.i.d. valida- tionsets . Herewedescribetheevaluationprocedurefor CombiningtheresultofChaudhuri&Vinterbo(2013)with V a fixed validation set size, which corresponds to a single theaboveexpressioncompletestheproof. (cid:4) curve in Figure 1 (as such, to generate all results we re- GivenafinitesetofpossiblehyperparametersΛ,wewould peat this procedure for each validation set size in the set like to bound f f ; the best validation score found 1000,2000,3000,5000,15000 ).Foreachofthe100val- when running|BVa∗y−esiaV∗n(cid:48)|optimization on vs. . Note {idation sets we randomly add o}r remove an input to form (cid:48) V V that,byTheorem7, a neighboring dataset (cid:48). We then evaluate each of the V trained SVM models on all 100 datasets and their pairs V |fV∗ −fV∗(cid:48)|≤ mλ,aλx(cid:48) |fV(wλ)−fV(cid:48)(wλ(cid:48))| Vnu(cid:48).mbTehriosfrtersauinltesdiSnVtMwom1o0d0el×s)f1u0n0ct(inounmevbaelruaotfiodnatmasaetrtsi-, (cid:110) (cid:111) ≤ (λλmmaxa−xλλmmiinn)L +min gm∗,mλLmin , ctheesiFthVvaanliddaFtiVo(cid:48)n.sTehtus,[uFsiVn]gijthisetjhtehvSaVliMdatmioondaecl.curacyon i V as(λ λ)/(λλ)isstrictlyincreasinginλ strictlydecreas- (cid:48)− (cid:48) (cid:48) The likelihood of function evaluations for a dataset pair inginλ.Giventhissensitivityoff wecanusetheLaplace ∗ ( , ),foravalueofk ( , ),isgivenbythemarginal mechanismtohidechangesinthevalidationsetasfollows. Vi Vi(cid:48) 1 Vi Vi(cid:48) likelihoodofthemulti-taskGaussianprocess: Corollary 4. Let ( ) denote Algorithm 3 applied on dvaattea,sie.te.V,P. rG(cid:2)Aive(Vn)aA=ssuVf˜mLp(cid:3)ti≤one(cid:15)2,Pfr˜(cid:2)LAis(V(cid:15)(cid:48)-)di=ffefr˜eLn(cid:3)tially pri- Pr(cid:18)[[FFVV(cid:48)]]ii(cid:19)∼N(µ100,σ1200), (14) Weleavetheprooftothesupplementarymaterial. Further, where [F ] = [f (λ ),...,f (λ )] (similarly for ) i 1 100 (cid:48) bytheexponentialtailsoftheLaplacemechanismwehave and µ Vand σ2 Vare the posteVrior mean and varianceVof 100 100 thefollowingutilityguarantee, the multi-task Gaussian process using kernel k ( , ) 1 (cid:48) V V ⊗ Theorem8. GiventheassumptionsofTheorem7,wehave k2(λ,λ(cid:48))afterobserving[F ]iand[F (cid:48)]i(formoredetails thefollowingutilityguaranteeforf˜Lw.r.t. fBO, see Bonilla et al. (2008)). VAs µ100 aVnd σ1200 depend on k ( , ),wetreatitasafree-parameterandvaryitsvalue (cid:104) (cid:105) 1 V V(cid:48) |f˜L−fBO|≤a (cid:15)1mmin{g∗,λmLin}+ (λ(cid:15)mλamx−axλλmmiinn)L wfroemco0m.0p5utteoth0e.9m5ainrgiinnaclrelmikeenlithsooofd0(.1045).foFroarlelavcahlidvaatliuoen, withprobabilityatleast1−e−a. di.ai.tda.setthse(jVoiinftomr airg=ina1l,l.i.k.e,li1h0o0o)d. iAsssimeapclhyVthieipsrsoadmucptleodf Proof. ThisfollowsexactlyfromthetailboundonLaplace all likelihoods.Computingthisjointmarginallikelihood i V random variables, given in the beginning of the proof of foreachk ( , )valueyieldsasinglecurveofFigure1. 1 (cid:48) Theorem6. (cid:4) As shown, thVeVlargest values of k ( , ) = 0.95 is most 1 (cid:48) V V likely, meaning that c in the global sensitivity bounds is quitesmall,leadingtoprivatevaluesthatareclosertotheir 6.Results trueoptimums. In this section we examine the validity of our multi-task Gaussian process assumption on [f1,...,f2|X|]. Specifi- 7.Relatedwork cally,wesearchforthemostlikelyvalueofthemulti-task Gaussian process covariance element k ( , ), for clas- There has been much work towards differentially pri- 1 (cid:48) V V sifier hyper-parameter tuning. Larger values of k ( , ) vate convex optimization (Chaudhuri et al., 2011; Kifer 1 (cid:48) V V correspondtoasmallerglobalsensitivityboundsinTheo- et al., 2012; Duchi et al., 2013; Song et al., 2013; Jain & rems1,3,and5leadingtoimprovedprivacyguarantees. Thakurta, 2014; Bassily et al., 2014). The work of Bass- ily et al. (2014) established upper and lower bounds for Foroursettingofhyper-parametertuning,eachλ=[C,γ2] the excess empirical risk of (cid:15) and ((cid:15),δ)-differentially pri- arehyper-parametersfortrainingakernelizedsupportvec- vate algorithms for many settings including convex and tormachine(SVM)(Cortes&Vapnik,1995;Scho¨lkopf& strongly convex risk functions that may or may not be Smola, 2001) with cost parameter C and radial basis ker- smooth. There is also related work towards private high- nelwidthγ2. Thevaluef (λ)istheaccuracyoftheSVM modeltrainedwithhyper-Vparametersλon . dimensional regression, where the dimensions outnum- V ber the number of instances (Kifer et al., 2012; Smith & To search for the most likely k ( , ) we start by Thakurta,2013a). InsuchcasestheHessianbecomessin- 1 (cid:48) V V sampling 100 different SVM hyper-parameter settings gularandsothelossisnonconvex. However,itispossible DifferentiallyPrivateBayesianOptimization (Azimietal.,2012;Snoeketal.,2012),andtoconstrained x 104 7.5 optimization(Gardneretal.,2014). dataset size 7 1000 od 2000 8.Conclusion o h6.5 3000 eli 5000 We have introduced methods for privately releasing the k g li 6 15000 besthyper-parametersandvalidationaccuraciesinthecase o of exact and noisy observations. Our work makes use l 5.5 of the differential privacy framework, which has become commonplaceinprivatemachinelearning(Dwork&Roth, 5 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 2013). Webelievewearethefirsttodemonstratedifferen- dataset kernel value, kk11((DV,,VD00)) tiallyprivatequantitiesinthesettingofglobaloptimization Figure1.Theloglikelihoodofamulti-taskGaussianprocessfor of expensive (possibly nonconvex) functions, through the differentvaluesofthekernelvaluek (V,V(cid:48)).Thefunctioneval- 1 lensofBayesianoptimization. uationsarethevalidationaccuracyofSVMswithdifferenthyper- parameters. Onekeyfuturedirectionistodesigntechniquestorelease eachsampledhyper-parameterandvalidationaccuracypri- vately (during the run of Bayesian optimization). This to use the restricted strong convexity of the loss in the re- requires analyzing how the maximum upper-confidence gressioncasetoguaranteeprivacy. boundchangesasthevalidationdatasetchanges. Another interestingdirectionisextendingourguaranteesinSections Differential privacy has been shown to be achievable in 3and4tootheracquisitionfunctions. online and interactive kernel learning settings (Jain et al., 2012; Smith & Thakurta, 2013b; Jain & Thakurta, 2013; For the case of machine learning hyper-parameter tuning Mishra & Thakurta, 2014). In general, non-private online ourresultsaredesignedtoguaranteeprivacyofthevalida- algorithmsareclosestinspirittothemethodsofBayesian tionsetonly(itisequivalenttoguaranteethatthetraining optimization. However, all of the previous work in dif- setisneverallowedtochange). Tosimultaneouslyprotect ferentially private online learning represents a dataset as theprivacyofthetrainingsetitmaybepossibletousetech- a sequence of bandit arm pulls (the equivalent notion in niquessimilartothetrainingstabilityresultsofChaudhuri Bayesian optimization is function evaluations f(xt)). In- &Vinterbo(2013). Trainingstabilitycouldbeguaranteed, stead, we consider functions in which changing a single forexample,byassuminganadditionaltrainingsetkernel dataset entry possibly affects all future function evalua- thatboundstheeffectofalteringthetrainingsetonf. We tions. Closest to our work is that of Chaudhuri & Vin- leavedevelopingtheseguaranteesforfuturework. terbo (2013), who show that given a fixed set of hyper- As practitioners begin to use Bayesian optimization in parameters which are always evaluated for any validation practical settings involving sensitive data, it suddenly be- set,theycanreturnaprivateversionoftheindexofthebest comes crucial to consider how to preserve data privacy hyper-parameter, as well as a private model trained with while reporting accurate Bayesian optimization results. that hyper-parameter. Our setting is strictly more general Thisworkpresentsmethodstoachievesuchprivacy,which inthat,ifthevalidationsetchanges,Bayesianoptimization wehopewillbeusefultopractitionersandtheoristsalike. couldsearchcompletelydifferenthyper-parameters. Bayesian optimization, largely due to its principled han- References dling of the exploration/exploitation trade-off of global, black-box function optimization, is quickly becoming Auer, Peter, Cesa-Bianchi, Nicolo, and Fischer, Paul. the global optimization paradigm of choice. Alongside Finite-time analysis of the multiarmed bandit problem. promising empirical results there is a wealth of recent Machinelearning,47(2-3):235–256,2002. work on convergence guarantees for Bayesian optimiza- tion, similar to those used in this work (Srinivas et al., Azimi,Javad,Jalali,Ali,andFern,XiaoliZ. Hybridbatch 2010;deFreitasetal.,2012). Vazquez&Bect(2010)and bayesian optimization. In ICML 2012, pp. 1215–1222. Bull(2011)giveregretboundsforoptimizingtheexpected ACM,2012. improvement acquisition function each optimization step. BayesGap(Hoffmanetal.,2014)givesaconvergenceguar- Bardenet, Re´mi, Brendel, Ma´tya´s, Ke´gl, Bala´zs, and Se- antee for Bayesian optimization with budget constraints. bag, Miche`le. Collaborativehyperparametertuning. In Bayesianoptimizationhasalsobeenextendedtomulti-task ICML,2013. optimization(Bardenetetal.,2013;Swerskyetal.,2013), thesettingwheremultipleexperimentscanberunatonce Bassily, Raef, Smith, Adam, and Thakurta, Abhradeep. DifferentiallyPrivateBayesianOptimization Private empirical risk minimization, revisited. arXiv Ganta, Srivatsava Ranjit, Kasiviswanathan, Shiva Prasad, preprintarXiv:1405.7085,2014. andSmith,Adam. Compositionattacksandauxiliaryin- formationindataprivacy. InKDD,pp.265–273.ACM, Bergstra, James and Bengio, Yoshua. Random search 2008. for hyper-parameter optimization. JMLR, 13:281–305, 2012. Gardner, Jacob, Kusner, Matt, Xu, Zhixiang, Weinberger, Kilian, and Cunningham, John. Bayesian optimization Bonilla, Edwin, Chai, Kian Ming, and Williams, Christo- withinequalityconstraints.InICML,pp.937–945,2014. pher. Multi-task gaussian process prediction. In NIPS, 2008. Hoffman, Matthew, Shahriari, Bobak, and de Freitas, Nando. Oncorrelationandbudgetconstraintsinmodel- Bull,AdamD. Convergenceratesofefficientglobalopti- basedbanditoptimizationwithapplicationtoautomatic mizationalgorithms. JMLR,12:2879–2904,2011. machinelearning. InAISTATS,pp.365–374,2014. Chaudhuri, Kamalika and Vinterbo, Staal A. A stability- Huang, Xiaolin, Shi, Lei, andSuykens, JohanAK. Ramp basedvalidationprocedurefordifferentiallyprivatema- loss linear programming support vector machine. The chinelearning. InAdvancesinNeuralInformationPro- Journal of Machine Learning Research, 15(1):2185– cessingSystems,pp.2652–2660,2013. 2211,2014. Chaudhuri, Kamalika, Monteleoni, Claire, and Sarwate, Hutter,Frank,Hoos,H.Holger,andLeyton-Brown,Kevin. AnandD.Differentiallyprivateempiricalriskminimiza- Sequential model-based optimization for general algo- tion. JMLR,12:1069–1109,2011. rithm configuration. In Learning and Intelligent Opti- mization,pp.507–523.Springer,2011. Chong, Miao M, Abraham, Ajith, and Paprzycki, Marcin.Trafficaccidentanalysisusingmachinelearning Jain,PrateekandThakurta,Abhradeep. Differentiallypri- paradigms. Informatica(Slovenia),29(1):89–98,2005. vatelearningwithkernels. InICML,pp.118–126,2013. Jain, Prateek and Thakurta, Abhradeep Guha. (near) di- Cortes,CorinnaandVapnik,Vladimir. Support-vectornet- works. Machinelearning,20(3):273–297,1995. mension independent risk bounds for differentially pri- vatelearning. InICML,pp.476–484,2014. deFreitas,Nando,Smola,Alex,andZoghi,Masrour. Ex- Jain,Prateek,Kothari,Pravesh,andThakurta,Abhradeep. ponential regret bounds for gaussian process bandits Differentiallyprivateonlinelearning. COLT,2012. withdeterministicobservations. InICML,2012. Kifer, Daniel, Smith, Adam, and Thakurta, Abhradeep. Dinur,IritandNissim,Kobbi.Revealinginformationwhile Private convex empirical risk minimization and high- preserving privacy. In Proceedings of the SIGMOD- dimensionalregression. JMLR,1:41,2012. SIGACT-SIGART symposium on principles of database systems,pp.202–210.ACM,2003. Krause,Andreas,Singh,Ajit,andGuestrin,Carlos. Near- optimal sensor placements in gaussian processes: The- Duchi, John C, Jordan, Michael I, and Wainwright, Mar- ory,efficientalgorithmsandempiricalstudies. JMLR,9: tin J. Local privacy and statistical minimax rates. In 235–284,2008. FOCS,pp.429–438.IEEE,2013. McSherry,FrankandTalwar,Kunal.Mechanismdesignvia Dwork,CynthiaandRoth,Aaron.Thealgorithmicfounda- differentialprivacy. InFOCS,pp.94–103.IEEE,2007. tions of differential privacy. Theoretical Computer Sci- ence,9(3-4):211–407,2013. Mishra,NikitaandThakurta,Abhradeep.Privatestochastic multi-arm bandits: From theory to practice. In ICML Dwork, Cynthia, Kenthapadi, Krishnaram, McSherry, WorkshoponLearning,Security,andPrivacy,2014. Frank, Mironov, Ilya, and Naor, Moni. Our data, our- selves: Privacy via distributed noise generation. In Ad- Mockus, Jonas, Tiesis, Vytautas, and Zilinskas, Antanas. vancesinCryptology-EUROCRYPT2006,pp.486–503. Theapplicationofbayesianmethodsforseekingtheex- Springer,2006a. tremum. Towards Global Optimization, 2(117-129):2, 1978. Dwork, Cynthia, McSherry, Frank, Nissim, Kobbi, and Smith,Adam. Calibratingnoisetosensitivityinprivate Narayanan, Arvind and Shmatikov, Vitaly. Robust de- dataanalysis. InTheoryofCryptography,pp.265–284. anonymizationoflargesparsedatasets. InIEEESympo- Springer,2006b. siumonSecurityandPrivacy,pp.111–125.IEEE,2008.