ebook img

Differentially Private Bayesian Optimization PDF

0.54 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Differentially Private Bayesian Optimization

Differentially Private Bayesian Optimization MattJ.Kusner [email protected] JacobR.Gardner [email protected] RomanGarnett [email protected] KilianQ.Weinberger [email protected] ComputerScience&Engineering,WashingtonUniversityinSt. Louis 5 1 0 Abstract validationaccuraciesthroughpublicationsorothermeans. 2 However, data-holders must be careful, as even a small b Bayesianoptimizationisapowerfultoolforfine- amountofinformationcancompromiseprivacy. e tuningthehyper-parametersofawidevarietyof F machine learning models. The success of ma- Whichhyper-parametersettingyieldsthehighestaccuracy 3 chine learning has led practitioners in diverse can reveal sensitive information about individuals in the 2 real-world settings to learn classifiers for prac- validation or training data set, reminiscent of reconstruc- tical problems. As machine learning becomes tionattacksdescribedbyDwork&Roth(2013)andDinur ] L commonplace, Bayesian optimization becomes & Nissim (2003). For example, imagine updated hyper- M anattractivemethodforpractitionerstoautomate parameters are released right after a prominent public fig- theprocessofclassifierhyper-parametertuning. ureisadmittedtoahospital. Ifahyper-parameterisknown . t A key observation is that the data used for tun- tocorrelatestronglywithaparticulardiseasethepatientis a ing models in these settings is often sensitive. suspectedtohave,anattackercouldmakeadirectcorrela- t s Certaindatasuchasgeneticpredisposition, per- tionbetweenthehyper-parametervalueandtheindividual. [ sonalemailstatistics,andcaraccidenthistory,if Topreventthissortofattack,wedevelopasetofalgorithms 2 not properly private, may be at risk of being in- v thatautomaticallyfine-tunethehyper-parametersofama- ferred from Bayesian optimization outputs. To 0 chinelearningalgorithmwhileprovablypreservingdiffer- address this, we introduce methods for releas- 8 ential privacy (Dwork et al., 2006b). Our approach lever- 0 ing the best hyper-parameters and classifier ac- agesrecentresultsonBayesianoptimization(Snoeketal., 4 curacyprivately. Leveragingthestrongtheoreti- 2012;Hutteretal.,2011;Bergstra&Bengio,2012;Gard- 0 calguaranteesofdifferentialprivacyandknown . ner et al., 2014), training a Gaussian process (GP) (Ras- 1 Bayesian optimization convergence bounds, we mussen&Williams,2006)toaccuratelypredictandmax- 0 prove that under a GP assumption these private imize the validation gain of hyper-parameter settings. We 5 quantitiesarealsonear-optimal. Finally, evenif 1 show that the GP model in Bayesian optimization allows this assumption is not satisfied, we can use dif- : ustoreleasenoisyfinalhyper-parametersettingstoprotect v ferentsmoothnessguaranteestoprotectprivacy. against aforementioned privacy attacks, while only sacri- i X ficingatiny,boundedamountofvalidationgain. r a 1.Introduction Our privacy guarantees hold for releasing the best hyper- parametersand bestvalidationgain. Specifically ourcon- Machinelearningisincreasinglyusedinapplicationareas tributionsareasfollows: with sensitive data. For example, hospitals use machine learning to predict if a patient is likely to be readmitted We derive, to the best of our knowledge, the first soon (Yu et al., 2013), webmail providers classify spam • framework for Bayesian optimization with provable emails from non-spam (Weinberger et al., 2009), and in- differentialprivacyguarantees, suranceprovidersforecasttheextentofbodilyinjuryincar crashes(Chongetal.,2005). Wedevelopvariationsbothwithandwithoutobserva- • tionnoise,and Inthesescenariosdatacannotbesharedlegally,butcompa- niesandhospitalsmaywanttosharehyper-parametersand Weshowthatevenifourvalidationgainisnotdrawn • from a Gaussian process, we can guarantee differen- tialprivacyunderdifferentsmoothnessassumptions. DifferentiallyPrivateBayesianOptimization We begin with background on Bayesian optimization and RT T and λ Λ is any hyper-parameter. As more sam- × ∈ differentialprivacywewillusetoproveourguarantees. ples are observed, the posterior mean function µ (λ) ap- T proachesf (λ). V 2.Background Onewell-knownmethodtoselecthyper-parametersλmax- imizestheupper-confidencebound(UCB)oftheposterior Ingeneral, ouraimwillbetoprotecttheprivacyofaval- GPmodeloff (Aueretal.,2002;Srinivasetal.,2010): idation dataset of sensitive records (where is V V ⊆ X X the collection of all possible records) when the results of λ (cid:44)argmaxµ (λ)+(cid:112)β σ (λ), (3) t+1 t t+1 t Bayesianoptimizationdependson . λ Λ V ∈ where β is a parameter that trades off the exploita- T+1 Bayesianoptimization. Ourgoalistomaximizeanun- tionofmaximizingµ (λ)andtheexplorationofmaximiz- knownfunctionf : Λ Rthatdependsonsomevalida- t tiondataset V: → ing σt(λ). Srinivas et al. (2010) proved that given cer- V ⊆X tain assumptions on f and fixed, non-zero observation noise(σ2>0), selectinVghyper-parametersλtomaximize maxf (λ). (1) eq. (3) is a no-regret Bayesian optimization procedure: λ Λ V ∈ lim 1 (cid:80)T f (λ ) f (λ ) = 0,wheref (λ )is Itisimportanttopointoutthatallofourresultsholdforthe theTm→ax∞imTizerto=f1eqV.(1∗). F−orVthetno-noisesetting,Vde∗Fre- generalsettingofeq.(1),butthroughoutthepaper,weuse itasetal.(2012)giveaUCB-basedno-regretalgorithm. the vocabulary of a common application: that of machine learninghyper-parametertuning. Inthiscasef (λ)isthe Contributions. Alongside maximizing f , we would gainofalearningalgorithmevaluatedonvalidatViondataset like to guarantee that if f depends on (senVsitive) valida- thatwastrainedwithhyper-parametersλ Λ Rd. tion data, we canrelease iVnformationabout f so thatthe V ∈ ⊆ data remains private. Specifically, we mayVwish to re- AqusireevsatlruaaitniinnggfaVleiasrneixnpgenasligvoeri(the.mg.),, eBaacyheseivaanluoapttiiomnizrae-- leaseV(a)ourbestguessλˆ (cid:44)argmaxt T f (λt)ofthetrue tiongivesaprocedureforselectingasmallnumberofloca- (unknown) maximizer λ∗ and (b) our≤bestVguess f (λˆ) of tiicoanllsy,togisvaemnplaecfuVr:re[nλt1s,a.m..p,lλeTλ]=, wλeT o∈bsRedrv×eT.a Svpaleicdiaf-- tphreimtraurey(qaulesostiuonnktnhoiwswn)omrkaaxiimmsumtoaonbsjewcetirvies:fhVo(wλ∗Vc)a.nTwhee t tion gain v such that v = f (λ ) + α , where α release private versions of λˆ and f (λˆ) that are close to t t t t t (0,σ2) is Gaussian noise witVh possibly non-zero var∼i- theirtruevalues, orbetter, thevalueVsλ∗ andf (λ∗)? We aNnce σ2. Then, given v and previously observed values give two answers to these questions. The firstVwill make t v ,...,v , Bayesian optimization updates its belief of a Gaussian process assumption on f , which we describe f1 andsatm−1plesanewhyper-parameterλ . Eachstepof immediately below. The second, deVscribed in Section 5, t+1 thVeoptimizationproceedsinthisway. willutilizeLipschitzandconvexityassumptionstoguaran- teeprivacyintheeventtheGPassumptiondoesnothold. Todecidewhichhyper-parametertosamplenext,Bayesian optimizationplacesapriordistributionoverf andupdates Setting. For our first answer to this question, let us de- it after every (possibly noisy) function obseVrvation. One fine a Gaussian process over hyper-parameters λ,λ Λ (cid:48) popular prior distribution over functions is the Gaussian (cid:0) ∈ (cid:0) (cid:1) and datasets , (cid:48) as follows: 0,k1( , (cid:48)) process µ(),k(, ) (Rasmussen & Williams, 2006), (cid:1) V V ⊆ X GP V V ⊗ parameteGrPized b·y a ·m·ean function µ() (we set µ = 0, k2(λ,λ(cid:48)) . A prior of this form is known as a multi-task · Gaussianprocess(Bonillaetal.,2008). Manychoicesfor w.l.o.g.)andakernelcovariancefunctionk(, ).Functions drawnfromaGaussianprocesshavethepro·p·ertythatany k1 and k2 are possible. The function k1(V,V(cid:48)) defines a set kernel (e.g., a function of the number of records that finitesetofvaluesofthefunctionarenormallydistributed. differ between and ). For k , we focus on either the Additionally, given samples λT = [λ1,...,λT] and ob- squaredexponeVntial: kV((cid:48)λ,λ)=2exp(cid:0) λ λ 2/(2(cid:96)2)(cid:1) servations vT = [v1,...,vT], the GP posterior mean and or Mate´rn kernels: (e.g2., for(cid:48)ν = 5/2−, k(cid:107) (λ−,λ(cid:48)(cid:107))2= (1+ variancehasaclosedform: 2 (cid:48) √5r/(cid:96)+(5r2)/(3(cid:96)2))exp( √5r/(cid:96)),forr = λ λ ), (cid:48) 2 − (cid:107) − (cid:107) forafixed(cid:96),astheyhaveknownboundsonthemaximum µ (λ)=k(λ,λ )(K +σ2I) 1v T T T − T information gain (Srinivas et al., 2010). Note that as de- kT(λ,λ(cid:48))=k(λ,λ(cid:48))−k(λ,λT)(KT +σ2I)−1k(λT,λ(cid:48)) fined,thekernelk2isnormalized(i.e.,k2(λ,λ)=1). σT2(λ)=kT(λ,λ), (2) Assumption 1. We have a problem of type (1), where all possible dataset functions [f ,...,f ] are GP dis- wherek(λ,λ ) R1 T isevaluatedelement-wiseoneach tributed (cid:0)0,k ( , ) k (λ1,λ)(cid:1) fo2|rX|known kernels T × 1 (cid:48) 2 (cid:48) ∈ GP V V ⊗ of the T columns of λ . As well, K = k(X ,X ) k ,k ,forall , andλ,λ Λ,where Λ . T T T T 1 2 (cid:48) (cid:48) ∈ V V ⊆X ∈ | |≤∞ DifferentiallyPrivateBayesianOptimization Similar Gaussian process assumptions have been made in , differbyswappingonerecord)is (cid:48) V V previouswork(Srinivasetal.,2010). Foraresultintheno- ∆ (cid:44) max ( ) ( ) . noiseobservationsetting,wewillmakeuseoftheassump- A , (cid:48) (cid:107)A V −A V(cid:48) (cid:107)1 VV ⊆X tionsofdeFreitasetal.(2012)forourprivacyguarantees, (Exponentialmechanism)Theglobalsensitivityofafunc- asdescribedinSection4. tionq: Λ Roverallneighboringdatasets , is (cid:48) X× → V V 2.1.DifferentialPrivacy ∆ (cid:44) max q( ,λ) q( ,λ) . q (cid:48) 1 , (cid:48) (cid:107) V − V (cid:107) One of the most widely accepted frameworks for private VλV ⊆ΛX ∈ data release is differential privacy (Dwork et al., 2006b), TheLaplacemechanismhidestheoutputof byperturb- whichhasbeenshowntoberobusttoavarietyofprivacy A ingitsoutputwithsomeamountofrandomnoise. attacks (Ganta et al., 2008; Sweeney, 1997; Narayanan & Definition 3. Given a dataset and an algorithm , the Shmatikov, 2008). Given an algorithm that outputs a V A valueλwhenrunondataset ,thegoaloAfdifferentialpri- Laplacemechanismreturns ( )+ω,whereωisanoise A V vacy is to ‘hide’ the effect oVf a small change in on the variabledrawnfromLap(∆ /(cid:15)),theLaplacedistribution output of . Equivalently, an attacker should noVt be able withscaleparameter∆ /(cid:15)(Aandlocationparameter0). A A totellifaprivaterecordwasswappedin justbylooking V The exponential mechanism draws a slightly different λ˜ at the output of . If two datasets , differ by swap- A V V(cid:48) thatis‘close’toλ,theoutputof . pingasingleelement,wewillrefertothemasneighboring A Definition 4. Given a dataset and an algorithm datasets. Notethatanynon-trivialalgorithm(i.e.,analgo- V ( ) = argmax q( ,λ), the exponential mecha- rpiathirmA, thatou)tpmuutsstdiinffcelruednetsvoamlueesamonoVunatnodfrVa(cid:48)nfdoormsonmeses AnisVm returns λ˜, whλe∈rΛe λ˜Vis drawn from the distribution toguVaraVn(cid:48)te⊆esXuchachangeinV isunobservableintheout- Z1 exp(cid:0)(cid:15)q(V,λ)/(2∆q)(cid:1),andZ isanormalizingconstant. put λ of (Dwork & Roth, 2013). The level of privacy A Given Λ, a possible set of hyper-parameters, we derive we wish to guarantee decides the amount of randomness methods for privately releasing the best hyper-parameters weneedtoaddtoλ(betterprivacyrequiresincreasedran- and the best function values f , approximately solving domness). Formally,thedefinitionofdifferentialprivacyis eq.(1). WefirstaddressthesettiVngwithobservationnoise statedbelow. (σ2 >0) in eq. (2) and then describe small modifications Definition 1. A randomized algorithm is ((cid:15),δ)- fortheno-noisesetting. ForeachsettingweusetheUCB A differentiallyprivatefor(cid:15),δ 0ifforallλ Range( ) samplingtechniqueineq.(3)toderiveourprivateresults. ≥ ∈ A andforallneighboringdatasets , (i.e.,suchthat and (cid:48) V V V differbyswappingonerecord)wehavethat V(cid:48) 3.Withobservationnoise Pr(cid:2) ( )=λ(cid:3) e(cid:15)Pr(cid:2) ( (cid:48))=λ(cid:3)+δ. (4) In general cases of Bayesian optimization, observation A V ≤ A V noise occurs in a variety of real-world modeling settings such as sensor measurement prediction (Krause et al., Theparameters(cid:15),δguaranteehowprivate is;thesmaller, A 2008). In hyper-parameter tuning, noise in the validation the more private. The maximum privacy is (cid:15) = δ = 0 in gainmaybeasaresultofnoisyvalidationortrainingfea- whichcaseeq.(4)holdswithequality. Thiscanbeseenby tures. thefactthat and canbeswappedinthedefinition,and (cid:48) V V thus the inequality holds in both directions. If δ = 0, we Inthesectionsthatfollow,althoughthequantitiesf,µ,σ,v say the algorithm is simply (cid:15)-differentially private. For a all depend on the validation dataset , for notational sim- surveyondifferentialprivacywerefertheinterestedreader plicitywewilloccasionallyomitthesVubscript .Similarly, toDwork&Roth(2013). for wewilloftenwrite: f ,µ,σ 2,v . V (cid:48) (cid:48) (cid:48) (cid:48) (cid:48) V Therearetwopopularmethodsformakinganalgorithm(cid:15)- differentially private: (a) the Laplace mechanism (Dwork 3.1.Privatenear-maximumhyper-parameters etal.,2006b),inwhichweaddrandomnoisetoλand(b) Inthissectionweguaranteethatreleasingλ˜ inAlgorithm the exponential mechanism (McSherry & Talwar, 2007), 1 is private (Theorem 1) and that it is near-optimal (The- which draws a random output λ˜ such that λ˜ λ. For ≈ orem 2). Our proof strategy is as follows: we will first each mechanism we must define an intermediate quan- demonstratetheglobalsensitivityofµ (λ)withprobabil- T tity called the global sensitivity describing how much A ityatleast1 δ. Thenwewillshowshowthatreleasingλ˜ changeswhen changes. − V via the exponential mechanism is ((cid:15),δ)-differentially pri- Definition2. (Laplacemechanism)Theglobalsensitivity vate. Finally, we prove that µ (λ˜) is close to f(λ ), the T ∗ ofanalgorithm overallneighboringdatasets , (i.e., truemaximizerofeq.(1). (cid:48) A V V DifferentiallyPrivateBayesianOptimization Algorithm1PrivateBayesianOpt. (noisyobservations) Corollary 1. Let ( ) denote Algorithm 1 applied on Input: ;Λ Rd;T;((cid:15),δ);σ2 ;γ dataset . Given AAssVumption 1, λ˜ is ((cid:15),δ)-differentially µ ,0 =V0 ⊆ V,0 T private,Vi.e.,Pr(cid:2)A(V)=λ˜(cid:3)≤e(cid:15)Pr(cid:2)A(V(cid:48))=λ˜(cid:3)+δ,forany foVr t=1...T do pairofneighboringdatasets , (cid:48). V V β =2log(Λt2π2/(3δ)) t λ (cid:44)argm|ax| µ (λ)+√β σ (λ) We leave the proof of Corollary 1 to the supplementary Otbservevalidaλt∈ioΛngVa,itn−v1 ,giventλV,t−1 material. Even though we must release a noisy hyper- Updateµ andσ2 accoVr,dtingto(2)t parametersettingλ˜,itisinfactnear-optimal. ,t ,t endfor V V Theorem 2. Given Assumption 1 the following near- (cid:113) c=2 (cid:0)1 k( , )(cid:1)log(cid:0)3Λ/δ(cid:1) optimalapproximationguaranteeforreleasingλ˜holds: − V V(cid:48) | | (cid:112) q=σ 4log(3/δ) µ (λ˜) f(λ ) 2√β q 2∆(log Λ +a) C =8/log(1+σ 2) T ≥ ∗ − T − − (cid:15) | | 1 − (cid:16) (cid:17) Drawλ˜ Λw.p. Pr[λ] exp (cid:15)µV,T(λ) w.p. 1 (δ+e a),where∆=2(cid:112)β +c(forβ , ∈ ∝ 2(2√βT+1+c) c,an≥dqd−efineda−sinAlgorithm1). T+1 T+1 v =max v ∗ t T ,t ≤ (cid:104)V (cid:105) Drawθ ∼Lap √C(cid:15)1√βTTγT + c(cid:15) + q(cid:15) Proof.Ingeneral,theexponentialmechanismselectsλ˜that v˜=v∗+θ isclosetothemaximumλ(McSherry&Talwar,2007): Return: λ˜,v˜ µ (λ˜) max µ (λ) 2∆(log Λ +a), (5) T ≥ λ∈Λ T − (cid:15) | | Global sensitivity. As a first step we bound the global withprobabilityatleast1 e a. Recallweassumethatat − − sensitivityofµ (λ)asfollows: eachoptimizationstepweobservenoisygainv =f(λ )+ T t t Theorem1. GivenAssumption1,foranytwoneighboring αt,whereαt (0,σ2)(withfixednoisevarianceσ2> ∼ N datasets V,V(cid:48) and for all λ∈Λ with probability at least 0). Assuch,wecanlowerboundthetermmaxλ∈ΛµT(λ): 1 δ thereisanupperboundontheglobalsensitivity(in − maxµ (λ) f(λ )+α theexponentialmechanismsense)ofµT: λ Λ T ≥ (cid:124) T(cid:123)(cid:122) T(cid:125) ∈ (cid:112) (cid:112) vT µ (λ) µ (λ) 2 β +σ 2log(3Λ/δ), | (cid:48)T − T |≤ T+1 1 | | f(λ∗) maxµT(λ) f(λ∗) f(λT)+αT forσ =(cid:113)2(cid:0)1 k ( , )(cid:1),β =2log(cid:16)Λt2π2/(3δ)(cid:17). − λ∈Λ ≤ (cid:112) − 1 − 1 V V(cid:48) t | | f(λ∗) maxµT(λ) 2 βTσT 1(λT)+αT − λ Λ ≤ − Proof. Notethat,byapplyingthetriangleinequalitytwice, ∈ (cid:112) maxµ (λ) f(λ ) 2 β +α , (6) T ∗ T T forallλ Λ, λ Λ ≥ − ∈ ∈ µ (λ) µ (λ) µ (λ) f (λ) + f (λ) µ (λ) where the third line follows from Srinivas et al. (2010): | (cid:48)T − T |≤ | (cid:48)T − (cid:48) | | (cid:48) − T | Lemma 5.2 and the fourth line from the fact that µ (λ) f (λ) + f (λ) f(λ) + f(λ) µ (λ). ≤| (cid:48)T − (cid:48) | | (cid:48) − | | − T | σ (λ ) 1. T 1 T Wecannowboundeachoneofthetermsinthesummation − ≤ on the right hand side (RHS) with probability at least δ. AsintheproofofTheorem1,givenanormalrandomvari- Aµcc(oλrd)ingft(oλS)rini(cid:112)vasβetalσ.(2(0λ1)0.)T,hLeesmammaec5a.1n,bweeaopbptlaiei3nd aTbhleerZefo∼reNi(f0w,1e)sweteZhav=e:αPTr[|wZe|≤haγv]e≥γ1−=e−(cid:112)γ22/2lo:g=(21/−δ2δ).. | (cid:48)T − (cid:48) |≤ T+1 T(cid:48) (cid:112)σ (cid:112) twoe|fca(nλ)u−ppeµrTb(oλu)n|.dAbostσhT(cid:48)te(rλm)s≤by1,2b(cid:112)ecβaTu+se1.k(Iλn,oλr)de=r t1o, T(ahsisdeimfinpelidesinthAaltg|oαrTit|h≤mσ1)w2iltohgp(2ro/bδa)b≤ility4altolge(a3s/tδ1)−=2δq. boundtheremaining(middle)termontheRHSrecallthat Thus, wecanlowerboundα by q. Wecanthenlower T (cid:2) (cid:3) − forarandomvariableZ (0,1)wehave: Pr Z >γ bound max µ (λ) in eq. (5) with the right hand side e−γ2/2. For variables Z∼1,N...Zn (0,1), we| h|ave, b≤y of eq. (6). Tλh∈eΛrefTore, given the βT in Algorithm 1, Srini- theunionbound,thatPr(cid:2) i, Z ∼Nγ(cid:3) 1 ne γ2/2 (cid:44) vasetal.(2010),Lemma5.2holdswithprobabilityatleast 1−(cid:113)3δ. IfwesetZ = |f(λ∀)−σ1f|(cid:48)(λi)||≤andn≥=|−Λ|, w−eobtain 1− 2δ andthetheoremstatementfollows. (cid:4) γ= 2log(cid:0)3Λ/δ(cid:1),whichcompletestheproof. (cid:4) 3.2.Privatenear-maximumvalidationgain | | WeremarkthatallofthequantitiesinTheorem1areeither Inthissectionwedemonstratereleasingthevalidationgain given or selected by the modeler (e.g, δ,T). Given this v˜inAlgorithm1isprivate(Theorem3)andthatthenoise upper bound we can apply the exponential mechanism to weaddtoensureprivacyisboundedwithhighprobability releaseλ˜privately,asperDefinition1: (Theorem4). Asintheprevioussectionourapproachwill DifferentiallyPrivateBayesianOptimization be to first derive the global sensitivity of the maximum v Wecanimmediatelyboundthefinaltermontherighthand foundbyAlgorithm1. Thenweshowreleasingv˜is((cid:15),δ)- side. Notethatasv = f(λ )+α ,thefirsttwotermsare t t t differentially private via the Laplace mechanism. Perhaps boundedaboveby α and α , whereα = α t (cid:44) (cid:48) t surprisingly,wealsoshowthatv˜isclosetof(λ ). argmax α (|sim| ilarl|y f|or α). This i{s b(cid:100)e(cid:101)ca|u(cid:100)se(cid:101), in ∗ the worstt≤cTas|e,t|t}he observation no(cid:48)ise shifts the observed Globalsensitivity. Weboundtheglobalsensitivityofthe maximummax v upordownbyα. Therefore,letαˆ = t T t maximumvfoundwithBayesianoptimizationandUCB: αif α > α(cid:48) a≤ndαˆ =α(cid:48)otherwise,sothatwehave: | | | | Theorem3. GivenAssumption1, andneighboring , , we have the following global sensitivity bound (iVn Vth(cid:48)e |mt aTxv(cid:48)(λt)−mt aTxv(λt)|≤ TΩ +c+|2αˆ|. ≤ ≤ Laplacemechanismsense)forthemaximumv,w.p. 1 δ ≥ − Although αˆ can be arbitrarily large, recall that for Z |mt≤aTxvt(cid:48) −mt≤aTxvt|≤ √C√1βTTγT +c+q. TNh(e0r,e1fo)rew|iefh|waeves:etPZr[|=Z| 2≤αˆ γw]e≥ha1ve−γe−=γ2(cid:112)/22(cid:44)log1(3−/δ∼3δ).. σ√2 wherethemaximumGaussianprocessinformationgainγ (cid:112) T Thisimpliesthat 2αˆ σ 4log(3/δ)=qwithprobabil- isboundedaboveforthesquaredexponentialandMate´rn ity at least 1 δ|. T|h≤erefore, if Theorem 1 from Srinivas kernels(Srinivasetal.,2010). − 3 etal.(2010)andtheboundon f(λ) f (λ) holdtogether (cid:48) withprobabilityatleast1 2δ|asdes−cribeda|bove,thethe- Pterromofa.sΩFo(cid:44)r n√otCatioTnβalγsim.pTlhiceintyfrloemt uTshdeeonreomte1thienrSerginrei-t oremfollowsdirectly. − 3 (cid:4) 1 T T vasetal.(2010)wehavethat AsinTheorem1eachquantityintheaboveboundisgiven inAlgorithm1(β,c,q),giveninpreviousresults(Srinivas T Ω 1 (cid:88) et al., 2010) (γ , C ) or specified by the modeler (T, δ). f(λ ) f(λ ) f(λ ) maxf(λ ). (7) T 1 ∗ t ∗ t T ≥ − T ≥ − t T Now that we have a bound on the sensitivity of the max- t=1 ≤ imum v we will use the Laplace mechanism to prove our Thisimpliesf(λ ) max f(λ )+ Ω withprobability privacyguarantee(proofinsupplementarymaterial): atleast1− 3δ (wi∗th≤approprt≤iaTtechoticeoTfβT). Corollary2. LetA(V)denoteAlgorithm1runondataset . Given Assumption 1, releasing v˜is ((cid:15),δ)-differentially Recall that in the proof of Theorem 1 we showed that V f(λ) f (λ) c with probability at least 1 δ (for private,i.e.,Pr[ ( )=v˜] e(cid:15)Pr[ ( (cid:48))=v˜]+δ. | − (cid:48) | ≤ − 3 A V ≤ A V c given in Algorithm 1). This along with the above ex- Further, as the Laplace distribution has exponential tails, pression imply the following two sets of inequalities with thenoiseweaddtoobtainv˜isnottoolarge: probabilitygreaterthan1 2δ: − 3 Theorem4. GiventheassumptionsofTheorem1,wehave f (λ ) c f(λ )< maxf(λ )+ Ω; thefollowingbound, (cid:48) ∗ − ≤ ∗ t T t T ≤ f(λ∗)−c≤f(cid:48)(λ∗)< mt≤aTxf(cid:48)(λt)+ TΩ. |v˜−f(λ∗)|≤(cid:112)2log(2T/δ)+ TΩ +a(cid:16)(cid:15)ΩT + c(cid:15) + q(cid:15)(cid:17), These,inturn,implythetwosetsofinequalities: withprobabilityatleast1 (δ+e a)forΩ=√C Tβ γ . − 1 T T − maxf (λ ) f (λ )< maxf(λ )+ Ω +c; t T (cid:48) t ≤ (cid:48) ∗ t T t T Proof (Theorem 4). Let Z be a Laplace random variable ≤ ≤ maxf(λ ) f(λ )< maxf (λ )+ Ω +c. with scale parameter b and location parameter 0; Z t T t ≤ ∗ t T (cid:48) t T Lap(b). Then Pr(cid:2)Z ab(cid:3) = 1 e a. Thus, in A∼l- ≤ ≤ | | ≤ − − gorithm1, v˜ max v abforb= Ω + c + q with TThhiastiims,pthlieesgl|ombaalxste≤nTsifti(cid:48)v(iλtyt)o−fmmaaxxtt≤TTff((λλt)t)i|s≤bouTΩn+dedc.. probability|at−least1t−≤Te−ta|.≤Furtherobser(cid:15)vTe, (cid:15) (cid:15) Given the sensitivity of the maximu≤m f, we can readily ab maxv v˜ (maxf(λ ) max α ) v˜ t t t derivethesensitivityofmaximumv. Firstnotethatwecan ≥ t T − ≥ t T − t T | | − ≤ ≤ ≤ usethetriangleinequalitytoderive f(λ ) Ω max α v˜ (8) ≥ ∗ − T − t≤T | t|− maxv(cid:48)(λt) maxv(λt) maxv(λt) maxf(λt) where the second and third inequality follow from the | t T − t T |≤| t T − t T | ≤ ≤ ≤ ≤ proof of Theorem 3 (using the regret bound of Srinivas + maxv (λ ) maxf (λ ) | t T (cid:48) t − t T (cid:48) t | et al. (2010): Theorem 1). Note that the third inequal- ≤ ≤ ity holds with probability greater than 1 δ (given β in + maxf(cid:48)(λt) maxf(λt). − 2 t | t T − t T | Algorithm 1). The final inequality implies f(λ ) v˜ ≤ ≤ ∗ − ≤ DifferentiallyPrivateBayesianOptimization max α + Ω +ab. Alsonotethat, (2012). Then we prove releasing f˜is ((cid:15),δ)-differentially t≤T | t| T privateandthatf˜isalmostmax f(λ ). 2 t T t ab v˜ maxvt v˜ (maxf(λt)+max αt ) ≤ ≤ ≥ − t T ≥ − t T t T | | ≤ ≤ ≤ v˜ f(λ ) Ω max α (9) Global sensitivity. The following Theorem gives a ≥ − ∗ − T − t≤T | t| boundontheglobalsensitivityofthemaximumf. TpnTrahholuilbsysa,iwbmbieelpichtlayiaeuvsuseepthtp|haαeatrtt|f|bcv˜(ooλ−uu∗nfl)dd(−λboe∗nv˜)a|≥rα≤btit−mrafmorairxlaytax≤lltlaT≤rtgT|.αe|Rtαw|et+e|cga−TΩlilv+TΩethaaa−bth.iaFfgobih-r. bTiinnohguTendohdareteo(amirsneemtt5hs.eV2LG,oaVifpv(cid:48)ledawneceeAFhmrsaeseviuctemahtsaphnteeiiotsfmnoallls1.oew(na2isnn0edg1),2gt)hl,oebfoaarlsssnueemnigspihttibivoointrys- | | Z ,...Z (0,1)wehavebythetailprobabilitybound 1 n 1aγn=−d(cid:112)u2δn.2iTolonhgeb∼r(eo2fuTNonr/deδ,)tih.faAwtsPedrse(cid:2)efi∀tntZ,etd|=Zγt|α≤tmaγna(cid:3)dx≥n1=−Tαn,ew.−eγo2/b2tai(cid:44)(cid:4)n |2m≤ta≤xTf(cid:48)(λt)−2m≤ta≤xT(cid:113)f(λt)|≤Ae−(log22τ)d/4 +c t T t (cid:0) (cid:1) ≥ ≤ | | w.p.atleast1 δforc=2 1 k( , ) log(2Λ/δ). We note that, because releasing either λ˜ or v˜ is ((cid:15),δ)- − − V V(cid:48) | | differentially private, by Corollaries 1 and 2, releasing Weleavetheprooftothesupplementarymaterial. bothprivatequantitiesinAlgorithm1guarantees(2(cid:15),2δ)- Given this sensitivity, we may apply the Laplace mecha- differential privacy for validation dataset . This is due V nismtoreleasef˜. to the composition properties of ((cid:15),δ)-differential privacy (Dworketal.,2006a)(infactstrongercompositionresults Corollary3. Let ( )denoteAlgorithm2runondataset A V canbedemonstrated,(Dwork&Roth,2013)). . Givenassumption1andthatf satisfiestheassumptions VofdeFreitasetal.(2012),f˜is((cid:15),δ)-differentiallyprivate, withrespecttoanyneighboringdataset ,i.e., 4.Withoutobservationnoise (cid:48) V Inhyper-parametertuningitmaybereasonabletoassume Pr(cid:2) ( )=f˜(cid:3) e(cid:15)Pr(cid:2) ( (cid:48))=f˜(cid:3)+δ. A V ≤ A V that we can observe function evaluations exactly: v = ,t f (λt). First note that we can use the same algorithVm to Eventhoughwemustaddnoisetothemaximumfweshow reVportthemaximumλintheno-noisesetting. Theorems1 thatf˜isstillclosetotheoptimalf(λ ). ∗ and2stillhold(notethatq = 0inTheorem2). However, Theorem6. GiventheassumptionsofTheorem3,wehave wecannotreadilyreportaprivatemaximumf astheinfor- theutilityguaranteeforAlgorithm2: mationgainγ inTheorems3and4approachesinfinityas T σ2 0. Therefore, we extend results from the previous (cid:16) (cid:17) → f˜ f(λ ) Ω+a Ω + c sectiontotheexactobservationcaseviatheregretbounds | − ∗ |≤ (cid:15) (cid:15) ofdeFreitasetal.(2012). Algorithm2demonstrateshow 2τ toprivatizethemaximumf intheexactobservationcase. w.p. atleast1 (δ+e−a)forΩ=Ae−(log2)d/4. − Algorithm2PrivateBayesianOpt. (noisefreeobs.) WeproveCorollary3andTheorem6inthesupplementary Input: ; Λ Rd; T; ((cid:15),δ); A,τ; assumptionsonf material.Wehavedemonstratedthatinthenoisyandnoise- indeFreVitase⊆tal.(2012) V free settings we can release private near-optimal hyper- RunmethodofdeFreitasetal.(2012),resultinginnoise parameter settings λ˜ and function evaluations v˜,f˜. How- freeobservations: f (λ ),...,f (λ ) ever,theanalysisthusfarassumesthehyper-parameterset 1 T (cid:113)(cid:0) V (cid:1) V is finite: Λ < . It is possible to relax this assumption, c=2 1−k(V,V(cid:48)) log(2|Λ|/δ) usingana|na|lysi∞ssimilarto(Srinivasetal.,2010).Weleave (cid:104) 2τ (cid:105) Drawθ Lap Ae−(log2)d/4 + c thisanalysistothesupplementarymaterial. ∼ (cid:15) (cid:15) Return: f˜=max f (λ )+θ 2 t T t ≤ ≤ V 5.WithouttheGPassumption Even if our our true validation score f is not drawn from 4.1.Privatenear-maximumvalidationgain aGaussianprocess(Assumption1),wecanstillguarantee We demonstrate that releasing f˜in Algorithm 2 is private differential privacy for releasing its value after Bayesian (Theorem 3) and that a small amount of noise is added to optimization fBO = max f(λ ). In this section we t T t makef˜private(Theorem6).Todoso,wederivetheglobal describe a different functio≤nal assumption on f that also sensitivityofmax f(λ )inAlgorithm2independent yields differentially private Bayesian optimization for the 2 t T t of the maximum in≤fo≤rmation gain γ via de Freitas et al. caseofmachinelearninghyper-parametertuning. T DifferentiallyPrivateBayesianOptimization Assume we have a (nonsensitive) training set = In the following theorem we demonstrate that we can T (x ,y ) n , which, given a hyperparameter λ produces boundthechangeinf forarbitraryλ<λ. { i i }i=1 (cid:48) amodelw(λ)fromthefollowingoptimization, Theorem 7. Given assumption 2, for neighboring , (cid:48) V V andarbitraryλ<λ wehavethat, Oλ(w) (cid:48) w =argmin(cid:122)λ w 2+ 1 (cid:88)(cid:125)n(cid:124) (cid:96)(w,x ,y(cid:123)), (10) |fV(wλ)−fV(cid:48)(wλ(cid:48))|≤ (λλ(cid:48)−(cid:48)λλ)L +min(cid:8)gm∗,mλLmin(cid:9) λ w 2(cid:107) (cid:107)2 n i=1 i i where L is the Lipschitz constant of f, g∗ = max g(w,x,y),andmisthesizeof . The function (cid:96) is a training loss function (e.g., logistic (x,y)∈X,w∈Rd V loss, hinge loss). Given a (sensitive) validation set = Proof. Applyingthetriangleinequalityyields V (x ,y ) m we would like to use Bayesian opti- {miziatioin}tio=m1a⊆ximXizeavalidationscoref . f (wλ) f (cid:48)(wλ(cid:48)) f (wλ) f (wλ(cid:48)) V | V − V |≤| V − V | Assumption2. Ourtruevalidationscoref is + f (wλ(cid:48)) f (cid:48)(wλ(cid:48)). | V − V | V m This second term is bounded by Chaudhuri & Vinterbo 1 (cid:88) f (w(λ))= g(w ,x ,y ), (2013) in the proof of Theorem 4. The only difference is, V −m i=1 λ i i as we are not adding random noise to wλ(cid:48) we have that f (wλ(cid:48)) f (cid:48)(wλ(cid:48)) min g∗/m,L/(mλmin) . whereg()isavalidationlossfunctionthatisL-Lipschitz | V − V |≤ { } in w (e.g·., ramp loss, normalized sigmoid (Huang et al., Toboundthefirstterm,letOλ(w)bethevalueoftheob- 2014)). Additionally, the training model wλ is the mini- jectiveineq.(10)foraparticularλ. NotethatOλ(w)and mizerofeq.(10)foratrainingloss(cid:96)()thatis1-Lipschitz Oλ(cid:48)(w)areλandλ(cid:48)-stronglyconvex. Define · inwandconvex(e.g.,logisticloss,hingeloss). h(w)=Oλ(cid:48)(w)−Oλ(w)= λ(cid:48)2−λ(cid:107)w(cid:107)22. (11) Algorithm 3 describes a procedure for privately releasing Further,definetheminimizersw =argmin O (w)and thebestvalidationaccuracyfBO givenassumption2. Dif- λ w λ ferentfrompreviousalgorithms,wemayrunBayesianop- wλ(cid:48)=argminw[Oλ(w)+h(w)]. Thisimpliesthat timization in Algorithm 3 with any acquisition function Oλ(wλ)= Oλ(wλ(cid:48))+ h(wλ)=0. (12) (e.g., expected improvement (Mockus et al., 1978), UCB) ∇ ∇ ∇ andprivacyisstillguaranteed. Given that Oλ is λ-strongly convex (Shalev-Shwartz, 2007),andbytheCauchy-Schwartzinequality, Algorithm3PrivateBayesianOpt. (Lipschitzandconvex) (cid:104) (cid:105) (cid:104) (cid:105) Input: T sizen;V sizem;Λ;λmin;λmax;(cid:15);T;L;d λ(cid:107)wλ−wλ(cid:48)(cid:107)22 ≤ ∇Oλ(wλ)−∇Oλ(wλ(cid:48)) (cid:62) wλ−wλ(cid:48) Run Bayesian optimization for T timesteps, observing: (cid:104) (cid:105) ffVBO(w=λ1m),a.x..,fVf(w(wλT))for{λ1,...,λT}=ΛT,V ⊆Λ ≤∇h(wλ(cid:48))(cid:62) wλ−wλ(cid:48) ≤(cid:107)∇h(wλ(cid:48))(cid:107)2(cid:107)wλ−wλ(cid:48)(cid:107)2. g =max t≤T V λt g(w,x,y) Rearranging, ∗ (x,y) ,w d DRreatwurθn:∼f˜LLa=p∈f(cid:104)XB1(cid:15)Om+∈inRθ{gm∗,mλLmin}+ (λ(cid:15)mλamx−axλλmmiinn)L(cid:105) λ1(cid:107)∇h(wλ(cid:48))(cid:107)2 =(cid:13)(cid:13)(cid:13)λ(cid:48)2−λ∇(cid:107)wλ(cid:48)(cid:107)22(cid:13)(cid:13)(cid:13)2 ≥(cid:107)wλ−wλ(cid:48)((cid:107)123) Similar to Algorithms 1 and 2 we use the Laplace mech- Nowaswλ(cid:48) istheminimizerofOλ(cid:48) wehave, awnhisemn toims sawskapthpeedpowsistihblaencehiagnhgbeoriningvavliadlaidtiaotnioancsceutracy. ∇(cid:107)wλ(cid:48)(cid:107)22 = λ2(cid:48)(cid:2)− n1 (cid:80)ni=1∇(cid:96)(wλ(cid:48),xi,yi)(cid:3). (cid:48) DiffereVnt from the work of Chaudhuri & Vinterbo (201V3) Substitutingthisvalueofwλ(cid:48) intoeq.(13)andnotingthat ctihoannsgeianrgchVingtodiVff(cid:48)ermenatyhyalpseor-pleaardamtoeteBrsa,yΛesTia,nvosp.tΛimTi,za(cid:48)-. wnoermcananpduldlrtohpetphoesniteivgeatcivoensstiagnntitnertmhe(λno(cid:48)−rmλg)i/v2esouutsofthe Therefore, wemustboundthetotalglobalsenVsitivityofVf (cid:13) (cid:13) withrespecttoV andλ, λ1(cid:107)∇h(wλ(cid:48))(cid:107)2=λλ(cid:48)−λλ(cid:48) (cid:13)(cid:13)n1 (cid:80)ni=1∇(cid:96)(wλ(cid:48),xi,yi)(cid:13)(cid:13)2=λλ(cid:48)−λλ(cid:48) . Definition 5. The total global sensitivity of f over all The last equality follows from the fact that the loss (cid:96) is neighboringdatasets , is (cid:48) 1-Lipschitz by Assumption 2 and the triangle inequality. V V Thus,alongwitheq.(13),wehave ∆f (cid:44) max f (wλ) f (cid:48)(wλ(cid:48)). Vλ,,Vλ(cid:48)(cid:48)⊆∈ΛX| V − V | (cid:107)wλ−wλ(cid:48)(cid:107)2 ≤ λ1(cid:107)∇h(wλ(cid:48))(cid:107)2 ≤ λλ(cid:48)−λ(cid:48)λ. DifferentiallyPrivateBayesianOptimization Finally,asf isL-Lipschitzinw, λ ,...,λ from a Sobol sequence and train an SVM 1 100 model for each on the Forest UCI dataset (36,603 train- |fV(wλ)−fV(wλ(cid:48))|≤L(cid:107)wλ−wλ(cid:48)(cid:107)2 ≤Lλλ(cid:48)−λ(cid:48)λ ing inputs). We then randomly sample 100 i.i.d. valida- tionsets . Herewedescribetheevaluationprocedurefor CombiningtheresultofChaudhuri&Vinterbo(2013)with V a fixed validation set size, which corresponds to a single theaboveexpressioncompletestheproof. (cid:4) curve in Figure 1 (as such, to generate all results we re- GivenafinitesetofpossiblehyperparametersΛ,wewould peat this procedure for each validation set size in the set like to bound f f ; the best validation score found 1000,2000,3000,5000,15000 ).Foreachofthe100val- when running|BVa∗y−esiaV∗n(cid:48)|optimization on vs. . Note {idation sets we randomly add o}r remove an input to form (cid:48) V V that,byTheorem7, a neighboring dataset (cid:48). We then evaluate each of the V trained SVM models on all 100 datasets and their pairs V |fV∗ −fV∗(cid:48)|≤ mλ,aλx(cid:48) |fV(wλ)−fV(cid:48)(wλ(cid:48))| Vnu(cid:48).mbTehriosfrtersauinltesdiSnVtMwom1o0d0el×s)f1u0n0ct(inounmevbaelruaotfiodnatmasaetrtsi-, (cid:110) (cid:111) ≤ (λλmmaxa−xλλmmiinn)L +min gm∗,mλLmin , ctheesiFthVvaanliddaFtiVo(cid:48)n.sTehtus,[uFsiVn]gijthisetjhtehvSaVliMdatmioondaecl.curacyon i V as(λ λ)/(λλ)isstrictlyincreasinginλ strictlydecreas- (cid:48)− (cid:48) (cid:48) The likelihood of function evaluations for a dataset pair inginλ.Giventhissensitivityoff wecanusetheLaplace ∗ ( , ),foravalueofk ( , ),isgivenbythemarginal mechanismtohidechangesinthevalidationsetasfollows. Vi Vi(cid:48) 1 Vi Vi(cid:48) likelihoodofthemulti-taskGaussianprocess: Corollary 4. Let ( ) denote Algorithm 3 applied on dvaattea,sie.te.V,P. rG(cid:2)Aive(Vn)aA=ssuVf˜mLp(cid:3)ti≤one(cid:15)2,Pfr˜(cid:2)LAis(V(cid:15)(cid:48)-)di=ffefr˜eLn(cid:3)tially pri- Pr(cid:18)[[FFVV(cid:48)]]ii(cid:19)∼N(µ100,σ1200), (14) Weleavetheprooftothesupplementarymaterial. Further, where [F ] = [f (λ ),...,f (λ )] (similarly for ) i 1 100 (cid:48) bytheexponentialtailsoftheLaplacemechanismwehave and µ Vand σ2 Vare the posteVrior mean and varianceVof 100 100 thefollowingutilityguarantee, the multi-task Gaussian process using kernel k ( , ) 1 (cid:48) V V ⊗ Theorem8. GiventheassumptionsofTheorem7,wehave k2(λ,λ(cid:48))afterobserving[F ]iand[F (cid:48)]i(formoredetails thefollowingutilityguaranteeforf˜Lw.r.t. fBO, see Bonilla et al. (2008)). VAs µ100 aVnd σ1200 depend on k ( , ),wetreatitasafree-parameterandvaryitsvalue (cid:104) (cid:105) 1 V V(cid:48) |f˜L−fBO|≤a (cid:15)1mmin{g∗,λmLin}+ (λ(cid:15)mλamx−axλλmmiinn)L wfroemco0m.0p5utteoth0e.9m5ainrgiinnaclrelmikeenlithsooofd0(.1045).foFroarlelavcahlidvaatliuoen, withprobabilityatleast1−e−a. di.ai.tda.setthse(jVoiinftomr airg=ina1l,l.i.k.e,li1h0o0o)d. iAsssimeapclhyVthieipsrsoadmucptleodf Proof. ThisfollowsexactlyfromthetailboundonLaplace all likelihoods.Computingthisjointmarginallikelihood i V random variables, given in the beginning of the proof of foreachk ( , )valueyieldsasinglecurveofFigure1. 1 (cid:48) Theorem6. (cid:4) As shown, thVeVlargest values of k ( , ) = 0.95 is most 1 (cid:48) V V likely, meaning that c in the global sensitivity bounds is quitesmall,leadingtoprivatevaluesthatareclosertotheir 6.Results trueoptimums. In this section we examine the validity of our multi-task Gaussian process assumption on [f1,...,f2|X|]. Specifi- 7.Relatedwork cally,wesearchforthemostlikelyvalueofthemulti-task Gaussian process covariance element k ( , ), for clas- There has been much work towards differentially pri- 1 (cid:48) V V sifier hyper-parameter tuning. Larger values of k ( , ) vate convex optimization (Chaudhuri et al., 2011; Kifer 1 (cid:48) V V correspondtoasmallerglobalsensitivityboundsinTheo- et al., 2012; Duchi et al., 2013; Song et al., 2013; Jain & rems1,3,and5leadingtoimprovedprivacyguarantees. Thakurta, 2014; Bassily et al., 2014). The work of Bass- ily et al. (2014) established upper and lower bounds for Foroursettingofhyper-parametertuning,eachλ=[C,γ2] the excess empirical risk of (cid:15) and ((cid:15),δ)-differentially pri- arehyper-parametersfortrainingakernelizedsupportvec- vate algorithms for many settings including convex and tormachine(SVM)(Cortes&Vapnik,1995;Scho¨lkopf& strongly convex risk functions that may or may not be Smola, 2001) with cost parameter C and radial basis ker- smooth. There is also related work towards private high- nelwidthγ2. Thevaluef (λ)istheaccuracyoftheSVM modeltrainedwithhyper-Vparametersλon . dimensional regression, where the dimensions outnum- V ber the number of instances (Kifer et al., 2012; Smith & To search for the most likely k ( , ) we start by Thakurta,2013a). InsuchcasestheHessianbecomessin- 1 (cid:48) V V sampling 100 different SVM hyper-parameter settings gularandsothelossisnonconvex. However,itispossible DifferentiallyPrivateBayesianOptimization (Azimietal.,2012;Snoeketal.,2012),andtoconstrained x 104 7.5 optimization(Gardneretal.,2014). dataset size 7 1000 od 2000 8.Conclusion o h6.5 3000 eli 5000 We have introduced methods for privately releasing the k g li 6 15000 besthyper-parametersandvalidationaccuraciesinthecase o of exact and noisy observations. Our work makes use l 5.5 of the differential privacy framework, which has become commonplaceinprivatemachinelearning(Dwork&Roth, 5 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 2013). Webelievewearethefirsttodemonstratedifferen- dataset kernel value, kk11((DV,,VD00)) tiallyprivatequantitiesinthesettingofglobaloptimization Figure1.Theloglikelihoodofamulti-taskGaussianprocessfor of expensive (possibly nonconvex) functions, through the differentvaluesofthekernelvaluek (V,V(cid:48)).Thefunctioneval- 1 lensofBayesianoptimization. uationsarethevalidationaccuracyofSVMswithdifferenthyper- parameters. Onekeyfuturedirectionistodesigntechniquestorelease eachsampledhyper-parameterandvalidationaccuracypri- vately (during the run of Bayesian optimization). This to use the restricted strong convexity of the loss in the re- requires analyzing how the maximum upper-confidence gressioncasetoguaranteeprivacy. boundchangesasthevalidationdatasetchanges. Another interestingdirectionisextendingourguaranteesinSections Differential privacy has been shown to be achievable in 3and4tootheracquisitionfunctions. online and interactive kernel learning settings (Jain et al., 2012; Smith & Thakurta, 2013b; Jain & Thakurta, 2013; For the case of machine learning hyper-parameter tuning Mishra & Thakurta, 2014). In general, non-private online ourresultsaredesignedtoguaranteeprivacyofthevalida- algorithmsareclosestinspirittothemethodsofBayesian tionsetonly(itisequivalenttoguaranteethatthetraining optimization. However, all of the previous work in dif- setisneverallowedtochange). Tosimultaneouslyprotect ferentially private online learning represents a dataset as theprivacyofthetrainingsetitmaybepossibletousetech- a sequence of bandit arm pulls (the equivalent notion in niquessimilartothetrainingstabilityresultsofChaudhuri Bayesian optimization is function evaluations f(xt)). In- &Vinterbo(2013). Trainingstabilitycouldbeguaranteed, stead, we consider functions in which changing a single forexample,byassuminganadditionaltrainingsetkernel dataset entry possibly affects all future function evalua- thatboundstheeffectofalteringthetrainingsetonf. We tions. Closest to our work is that of Chaudhuri & Vin- leavedevelopingtheseguaranteesforfuturework. terbo (2013), who show that given a fixed set of hyper- As practitioners begin to use Bayesian optimization in parameters which are always evaluated for any validation practical settings involving sensitive data, it suddenly be- set,theycanreturnaprivateversionoftheindexofthebest comes crucial to consider how to preserve data privacy hyper-parameter, as well as a private model trained with while reporting accurate Bayesian optimization results. that hyper-parameter. Our setting is strictly more general Thisworkpresentsmethodstoachievesuchprivacy,which inthat,ifthevalidationsetchanges,Bayesianoptimization wehopewillbeusefultopractitionersandtheoristsalike. couldsearchcompletelydifferenthyper-parameters. Bayesian optimization, largely due to its principled han- References dling of the exploration/exploitation trade-off of global, black-box function optimization, is quickly becoming Auer, Peter, Cesa-Bianchi, Nicolo, and Fischer, Paul. the global optimization paradigm of choice. Alongside Finite-time analysis of the multiarmed bandit problem. promising empirical results there is a wealth of recent Machinelearning,47(2-3):235–256,2002. work on convergence guarantees for Bayesian optimiza- tion, similar to those used in this work (Srinivas et al., Azimi,Javad,Jalali,Ali,andFern,XiaoliZ. Hybridbatch 2010;deFreitasetal.,2012). Vazquez&Bect(2010)and bayesian optimization. In ICML 2012, pp. 1215–1222. Bull(2011)giveregretboundsforoptimizingtheexpected ACM,2012. improvement acquisition function each optimization step. BayesGap(Hoffmanetal.,2014)givesaconvergenceguar- Bardenet, Re´mi, Brendel, Ma´tya´s, Ke´gl, Bala´zs, and Se- antee for Bayesian optimization with budget constraints. bag, Miche`le. Collaborativehyperparametertuning. In Bayesianoptimizationhasalsobeenextendedtomulti-task ICML,2013. optimization(Bardenetetal.,2013;Swerskyetal.,2013), thesettingwheremultipleexperimentscanberunatonce Bassily, Raef, Smith, Adam, and Thakurta, Abhradeep. DifferentiallyPrivateBayesianOptimization Private empirical risk minimization, revisited. arXiv Ganta, Srivatsava Ranjit, Kasiviswanathan, Shiva Prasad, preprintarXiv:1405.7085,2014. andSmith,Adam. Compositionattacksandauxiliaryin- formationindataprivacy. InKDD,pp.265–273.ACM, Bergstra, James and Bengio, Yoshua. Random search 2008. for hyper-parameter optimization. JMLR, 13:281–305, 2012. Gardner, Jacob, Kusner, Matt, Xu, Zhixiang, Weinberger, Kilian, and Cunningham, John. Bayesian optimization Bonilla, Edwin, Chai, Kian Ming, and Williams, Christo- withinequalityconstraints.InICML,pp.937–945,2014. pher. Multi-task gaussian process prediction. In NIPS, 2008. Hoffman, Matthew, Shahriari, Bobak, and de Freitas, Nando. Oncorrelationandbudgetconstraintsinmodel- Bull,AdamD. Convergenceratesofefficientglobalopti- basedbanditoptimizationwithapplicationtoautomatic mizationalgorithms. JMLR,12:2879–2904,2011. machinelearning. InAISTATS,pp.365–374,2014. Chaudhuri, Kamalika and Vinterbo, Staal A. A stability- Huang, Xiaolin, Shi, Lei, andSuykens, JohanAK. Ramp basedvalidationprocedurefordifferentiallyprivatema- loss linear programming support vector machine. The chinelearning. InAdvancesinNeuralInformationPro- Journal of Machine Learning Research, 15(1):2185– cessingSystems,pp.2652–2660,2013. 2211,2014. Chaudhuri, Kamalika, Monteleoni, Claire, and Sarwate, Hutter,Frank,Hoos,H.Holger,andLeyton-Brown,Kevin. AnandD.Differentiallyprivateempiricalriskminimiza- Sequential model-based optimization for general algo- tion. JMLR,12:1069–1109,2011. rithm configuration. In Learning and Intelligent Opti- mization,pp.507–523.Springer,2011. Chong, Miao M, Abraham, Ajith, and Paprzycki, Marcin.Trafficaccidentanalysisusingmachinelearning Jain,PrateekandThakurta,Abhradeep. Differentiallypri- paradigms. Informatica(Slovenia),29(1):89–98,2005. vatelearningwithkernels. InICML,pp.118–126,2013. Jain, Prateek and Thakurta, Abhradeep Guha. (near) di- Cortes,CorinnaandVapnik,Vladimir. Support-vectornet- works. Machinelearning,20(3):273–297,1995. mension independent risk bounds for differentially pri- vatelearning. InICML,pp.476–484,2014. deFreitas,Nando,Smola,Alex,andZoghi,Masrour. Ex- Jain,Prateek,Kothari,Pravesh,andThakurta,Abhradeep. ponential regret bounds for gaussian process bandits Differentiallyprivateonlinelearning. COLT,2012. withdeterministicobservations. InICML,2012. Kifer, Daniel, Smith, Adam, and Thakurta, Abhradeep. Dinur,IritandNissim,Kobbi.Revealinginformationwhile Private convex empirical risk minimization and high- preserving privacy. In Proceedings of the SIGMOD- dimensionalregression. JMLR,1:41,2012. SIGACT-SIGART symposium on principles of database systems,pp.202–210.ACM,2003. Krause,Andreas,Singh,Ajit,andGuestrin,Carlos. Near- optimal sensor placements in gaussian processes: The- Duchi, John C, Jordan, Michael I, and Wainwright, Mar- ory,efficientalgorithmsandempiricalstudies. JMLR,9: tin J. Local privacy and statistical minimax rates. In 235–284,2008. FOCS,pp.429–438.IEEE,2013. McSherry,FrankandTalwar,Kunal.Mechanismdesignvia Dwork,CynthiaandRoth,Aaron.Thealgorithmicfounda- differentialprivacy. InFOCS,pp.94–103.IEEE,2007. tions of differential privacy. Theoretical Computer Sci- ence,9(3-4):211–407,2013. Mishra,NikitaandThakurta,Abhradeep.Privatestochastic multi-arm bandits: From theory to practice. In ICML Dwork, Cynthia, Kenthapadi, Krishnaram, McSherry, WorkshoponLearning,Security,andPrivacy,2014. Frank, Mironov, Ilya, and Naor, Moni. Our data, our- selves: Privacy via distributed noise generation. In Ad- Mockus, Jonas, Tiesis, Vytautas, and Zilinskas, Antanas. vancesinCryptology-EUROCRYPT2006,pp.486–503. Theapplicationofbayesianmethodsforseekingtheex- Springer,2006a. tremum. Towards Global Optimization, 2(117-129):2, 1978. Dwork, Cynthia, McSherry, Frank, Nissim, Kobbi, and Smith,Adam. Calibratingnoisetosensitivityinprivate Narayanan, Arvind and Shmatikov, Vitaly. Robust de- dataanalysis. InTheoryofCryptography,pp.265–284. anonymizationoflargesparsedatasets. InIEEESympo- Springer,2006b. siumonSecurityandPrivacy,pp.111–125.IEEE,2008.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.