Table Of ContentDifferentially Private Bayesian Optimization
MattJ.Kusner MKUSNER@WUSTL.EDU
JacobR.Gardner GARDNER.JAKE@WUSTL.EDU
RomanGarnett GARNETT@WUSTL.EDU
KilianQ.Weinberger KILIAN@WUSTL.EDU
ComputerScience&Engineering,WashingtonUniversityinSt. Louis
5
1
0 Abstract validationaccuraciesthroughpublicationsorothermeans.
2
However, data-holders must be careful, as even a small
b Bayesianoptimizationisapowerfultoolforfine-
amountofinformationcancompromiseprivacy.
e tuningthehyper-parametersofawidevarietyof
F machine learning models. The success of ma- Whichhyper-parametersettingyieldsthehighestaccuracy
3 chine learning has led practitioners in diverse can reveal sensitive information about individuals in the
2 real-world settings to learn classifiers for prac- validation or training data set, reminiscent of reconstruc-
tical problems. As machine learning becomes tionattacksdescribedbyDwork&Roth(2013)andDinur
]
L commonplace, Bayesian optimization becomes & Nissim (2003). For example, imagine updated hyper-
M anattractivemethodforpractitionerstoautomate parameters are released right after a prominent public fig-
theprocessofclassifierhyper-parametertuning. ureisadmittedtoahospital. Ifahyper-parameterisknown
.
t A key observation is that the data used for tun- tocorrelatestronglywithaparticulardiseasethepatientis
a
ing models in these settings is often sensitive. suspectedtohave,anattackercouldmakeadirectcorrela-
t
s Certaindatasuchasgeneticpredisposition, per- tionbetweenthehyper-parametervalueandtheindividual.
[
sonalemailstatistics,andcaraccidenthistory,if
Topreventthissortofattack,wedevelopasetofalgorithms
2 not properly private, may be at risk of being in-
v thatautomaticallyfine-tunethehyper-parametersofama-
ferred from Bayesian optimization outputs. To
0 chinelearningalgorithmwhileprovablypreservingdiffer-
address this, we introduce methods for releas-
8 ential privacy (Dwork et al., 2006b). Our approach lever-
0 ing the best hyper-parameters and classifier ac-
agesrecentresultsonBayesianoptimization(Snoeketal.,
4 curacyprivately. Leveragingthestrongtheoreti-
2012;Hutteretal.,2011;Bergstra&Bengio,2012;Gard-
0 calguaranteesofdifferentialprivacyandknown
. ner et al., 2014), training a Gaussian process (GP) (Ras-
1 Bayesian optimization convergence bounds, we
mussen&Williams,2006)toaccuratelypredictandmax-
0 prove that under a GP assumption these private
imize the validation gain of hyper-parameter settings. We
5
quantitiesarealsonear-optimal. Finally, evenif
1 show that the GP model in Bayesian optimization allows
this assumption is not satisfied, we can use dif-
: ustoreleasenoisyfinalhyper-parametersettingstoprotect
v ferentsmoothnessguaranteestoprotectprivacy.
against aforementioned privacy attacks, while only sacri-
i
X ficingatiny,boundedamountofvalidationgain.
r
a 1.Introduction Our privacy guarantees hold for releasing the best hyper-
parametersand bestvalidationgain. Specifically ourcon-
Machinelearningisincreasinglyusedinapplicationareas tributionsareasfollows:
with sensitive data. For example, hospitals use machine
learning to predict if a patient is likely to be readmitted We derive, to the best of our knowledge, the first
soon (Yu et al., 2013), webmail providers classify spam • framework for Bayesian optimization with provable
emails from non-spam (Weinberger et al., 2009), and in- differentialprivacyguarantees,
suranceprovidersforecasttheextentofbodilyinjuryincar
crashes(Chongetal.,2005). Wedevelopvariationsbothwithandwithoutobserva-
•
tionnoise,and
Inthesescenariosdatacannotbesharedlegally,butcompa-
niesandhospitalsmaywanttosharehyper-parametersand Weshowthatevenifourvalidationgainisnotdrawn
•
from a Gaussian process, we can guarantee differen-
tialprivacyunderdifferentsmoothnessassumptions.
DifferentiallyPrivateBayesianOptimization
We begin with background on Bayesian optimization and RT T and λ Λ is any hyper-parameter. As more sam-
×
∈
differentialprivacywewillusetoproveourguarantees. ples are observed, the posterior mean function µ (λ) ap-
T
proachesf (λ).
V
2.Background Onewell-knownmethodtoselecthyper-parametersλmax-
imizestheupper-confidencebound(UCB)oftheposterior
Ingeneral, ouraimwillbetoprotecttheprivacyofaval-
GPmodeloff (Aueretal.,2002;Srinivasetal.,2010):
idation dataset of sensitive records (where is V
V ⊆ X X
the collection of all possible records) when the results of λ (cid:44)argmaxµ (λ)+(cid:112)β σ (λ), (3)
t+1 t t+1 t
Bayesianoptimizationdependson . λ Λ
V ∈
where β is a parameter that trades off the exploita-
T+1
Bayesianoptimization. Ourgoalistomaximizeanun-
tionofmaximizingµ (λ)andtheexplorationofmaximiz-
knownfunctionf : Λ Rthatdependsonsomevalida- t
tiondataset V: → ing σt(λ). Srinivas et al. (2010) proved that given cer-
V ⊆X tain assumptions on f and fixed, non-zero observation
noise(σ2>0), selectinVghyper-parametersλtomaximize
maxf (λ). (1)
eq. (3) is a no-regret Bayesian optimization procedure:
λ Λ V
∈ lim 1 (cid:80)T f (λ ) f (λ ) = 0,wheref (λ )is
Itisimportanttopointoutthatallofourresultsholdforthe theTm→ax∞imTizerto=f1eqV.(1∗). F−orVthetno-noisesetting,Vde∗Fre-
generalsettingofeq.(1),butthroughoutthepaper,weuse itasetal.(2012)giveaUCB-basedno-regretalgorithm.
the vocabulary of a common application: that of machine
learninghyper-parametertuning. Inthiscasef (λ)isthe Contributions. Alongside maximizing f , we would
gainofalearningalgorithmevaluatedonvalidatViondataset like to guarantee that if f depends on (senVsitive) valida-
thatwastrainedwithhyper-parametersλ Λ Rd. tion data, we canrelease iVnformationabout f so thatthe
V ∈ ⊆
data remains private. Specifically, we mayVwish to re-
AqusireevsatlruaaitniinnggfaVleiasrneixnpgenasligvoeri(the.mg.),, eBaacyheseivaanluoapttiiomnizrae-- leaseV(a)ourbestguessλˆ (cid:44)argmaxt T f (λt)ofthetrue
tiongivesaprocedureforselectingasmallnumberofloca- (unknown) maximizer λ∗ and (b) our≤bestVguess f (λˆ) of
tiicoanllsy,togisvaemnplaecfuVr:re[nλt1s,a.m..p,lλeTλ]=, wλeT o∈bsRedrv×eT.a Svpaleicdiaf-- tphreimtraurey(qaulesostiuonnktnhoiwswn)omrkaaxiimmsumtoaonbsjewcetirvies:fhVo(wλ∗Vc)a.nTwhee
t
tion gain v such that v = f (λ ) + α , where α release private versions of λˆ and f (λˆ) that are close to
t t t t t
(0,σ2) is Gaussian noise witVh possibly non-zero var∼i- theirtruevalues, orbetter, thevalueVsλ∗ andf (λ∗)? We
aNnce σ2. Then, given v and previously observed values give two answers to these questions. The firstVwill make
t
v ,...,v , Bayesian optimization updates its belief of a Gaussian process assumption on f , which we describe
f1 andsatm−1plesanewhyper-parameterλ . Eachstepof immediately below. The second, deVscribed in Section 5,
t+1
thVeoptimizationproceedsinthisway. willutilizeLipschitzandconvexityassumptionstoguaran-
teeprivacyintheeventtheGPassumptiondoesnothold.
Todecidewhichhyper-parametertosamplenext,Bayesian
optimizationplacesapriordistributionoverf andupdates
Setting. For our first answer to this question, let us de-
it after every (possibly noisy) function obseVrvation. One
fine a Gaussian process over hyper-parameters λ,λ Λ
(cid:48)
popular prior distribution over functions is the Gaussian (cid:0) ∈
(cid:0) (cid:1) and datasets , (cid:48) as follows: 0,k1( , (cid:48))
process µ(),k(, ) (Rasmussen & Williams, 2006), (cid:1) V V ⊆ X GP V V ⊗
parameteGrPized b·y a ·m·ean function µ() (we set µ = 0, k2(λ,λ(cid:48)) . A prior of this form is known as a multi-task
· Gaussianprocess(Bonillaetal.,2008). Manychoicesfor
w.l.o.g.)andakernelcovariancefunctionk(, ).Functions
drawnfromaGaussianprocesshavethepro·p·ertythatany k1 and k2 are possible. The function k1(V,V(cid:48)) defines a
set kernel (e.g., a function of the number of records that
finitesetofvaluesofthefunctionarenormallydistributed.
differ between and ). For k , we focus on either the
Additionally, given samples λT = [λ1,...,λT] and ob- squaredexponeVntial: kV((cid:48)λ,λ)=2exp(cid:0) λ λ 2/(2(cid:96)2)(cid:1)
servations vT = [v1,...,vT], the GP posterior mean and or Mate´rn kernels: (e.g2., for(cid:48)ν = 5/2−, k(cid:107) (λ−,λ(cid:48)(cid:107))2= (1+
variancehasaclosedform: 2 (cid:48)
√5r/(cid:96)+(5r2)/(3(cid:96)2))exp( √5r/(cid:96)),forr = λ λ ),
(cid:48) 2
− (cid:107) − (cid:107)
forafixed(cid:96),astheyhaveknownboundsonthemaximum
µ (λ)=k(λ,λ )(K +σ2I) 1v
T T T − T information gain (Srinivas et al., 2010). Note that as de-
kT(λ,λ(cid:48))=k(λ,λ(cid:48))−k(λ,λT)(KT +σ2I)−1k(λT,λ(cid:48)) fined,thekernelk2isnormalized(i.e.,k2(λ,λ)=1).
σT2(λ)=kT(λ,λ), (2) Assumption 1. We have a problem of type (1), where
all possible dataset functions [f ,...,f ] are GP dis-
wherek(λ,λ ) R1 T isevaluatedelement-wiseoneach tributed (cid:0)0,k ( , ) k (λ1,λ)(cid:1) fo2|rX|known kernels
T × 1 (cid:48) 2 (cid:48)
∈ GP V V ⊗
of the T columns of λ . As well, K = k(X ,X ) k ,k ,forall , andλ,λ Λ,where Λ .
T T T T 1 2 (cid:48) (cid:48)
∈ V V ⊆X ∈ | |≤∞
DifferentiallyPrivateBayesianOptimization
Similar Gaussian process assumptions have been made in , differbyswappingonerecord)is
(cid:48)
V V
previouswork(Srinivasetal.,2010). Foraresultintheno-
∆ (cid:44) max ( ) ( ) .
noiseobservationsetting,wewillmakeuseoftheassump- A , (cid:48) (cid:107)A V −A V(cid:48) (cid:107)1
VV ⊆X
tionsofdeFreitasetal.(2012)forourprivacyguarantees,
(Exponentialmechanism)Theglobalsensitivityofafunc-
asdescribedinSection4.
tionq: Λ Roverallneighboringdatasets , is
(cid:48)
X× → V V
2.1.DifferentialPrivacy ∆ (cid:44) max q( ,λ) q( ,λ) .
q (cid:48) 1
, (cid:48) (cid:107) V − V (cid:107)
One of the most widely accepted frameworks for private VλV ⊆ΛX
∈
data release is differential privacy (Dwork et al., 2006b),
TheLaplacemechanismhidestheoutputof byperturb-
whichhasbeenshowntoberobusttoavarietyofprivacy A
ingitsoutputwithsomeamountofrandomnoise.
attacks (Ganta et al., 2008; Sweeney, 1997; Narayanan &
Definition 3. Given a dataset and an algorithm , the
Shmatikov, 2008). Given an algorithm that outputs a
V A
valueλwhenrunondataset ,thegoaloAfdifferentialpri- Laplacemechanismreturns ( )+ω,whereωisanoise
A V
vacy is to ‘hide’ the effect oVf a small change in on the variabledrawnfromLap(∆ /(cid:15)),theLaplacedistribution
output of . Equivalently, an attacker should noVt be able withscaleparameter∆ /(cid:15)(Aandlocationparameter0).
A A
totellifaprivaterecordwasswappedin justbylooking
V The exponential mechanism draws a slightly different λ˜
at the output of . If two datasets , differ by swap-
A V V(cid:48) thatis‘close’toλ,theoutputof .
pingasingleelement,wewillrefertothemasneighboring A
Definition 4. Given a dataset and an algorithm
datasets. Notethatanynon-trivialalgorithm(i.e.,analgo-
V
( ) = argmax q( ,λ), the exponential mecha-
rpiathirmA, thatou)tpmuutsstdiinffcelruednetsvoamlueesamonoVunatnodfrVa(cid:48)nfdoormsonmeses AnisVm returns λ˜, whλe∈rΛe λ˜Vis drawn from the distribution
toguVaraVn(cid:48)te⊆esXuchachangeinV isunobservableintheout- Z1 exp(cid:0)(cid:15)q(V,λ)/(2∆q)(cid:1),andZ isanormalizingconstant.
put λ of (Dwork & Roth, 2013). The level of privacy
A Given Λ, a possible set of hyper-parameters, we derive
we wish to guarantee decides the amount of randomness
methods for privately releasing the best hyper-parameters
weneedtoaddtoλ(betterprivacyrequiresincreasedran-
and the best function values f , approximately solving
domness). Formally,thedefinitionofdifferentialprivacyis
eq.(1). WefirstaddressthesettiVngwithobservationnoise
statedbelow.
(σ2 >0) in eq. (2) and then describe small modifications
Definition 1. A randomized algorithm is ((cid:15),δ)- fortheno-noisesetting. ForeachsettingweusetheUCB
A
differentiallyprivatefor(cid:15),δ 0ifforallλ Range( ) samplingtechniqueineq.(3)toderiveourprivateresults.
≥ ∈ A
andforallneighboringdatasets , (i.e.,suchthat and
(cid:48)
V V V
differbyswappingonerecord)wehavethat
V(cid:48) 3.Withobservationnoise
Pr(cid:2) ( )=λ(cid:3) e(cid:15)Pr(cid:2) ( (cid:48))=λ(cid:3)+δ. (4) In general cases of Bayesian optimization, observation
A V ≤ A V
noise occurs in a variety of real-world modeling settings
such as sensor measurement prediction (Krause et al.,
Theparameters(cid:15),δguaranteehowprivate is;thesmaller,
A 2008). In hyper-parameter tuning, noise in the validation
the more private. The maximum privacy is (cid:15) = δ = 0 in
gainmaybeasaresultofnoisyvalidationortrainingfea-
whichcaseeq.(4)holdswithequality. Thiscanbeseenby
tures.
thefactthat and canbeswappedinthedefinition,and
(cid:48)
V V
thus the inequality holds in both directions. If δ = 0, we Inthesectionsthatfollow,althoughthequantitiesf,µ,σ,v
say the algorithm is simply (cid:15)-differentially private. For a all depend on the validation dataset , for notational sim-
surveyondifferentialprivacywerefertheinterestedreader plicitywewilloccasionallyomitthesVubscript .Similarly,
toDwork&Roth(2013). for wewilloftenwrite: f ,µ,σ 2,v . V
(cid:48) (cid:48) (cid:48) (cid:48) (cid:48)
V
Therearetwopopularmethodsformakinganalgorithm(cid:15)-
differentially private: (a) the Laplace mechanism (Dwork 3.1.Privatenear-maximumhyper-parameters
etal.,2006b),inwhichweaddrandomnoisetoλand(b) Inthissectionweguaranteethatreleasingλ˜ inAlgorithm
the exponential mechanism (McSherry & Talwar, 2007),
1 is private (Theorem 1) and that it is near-optimal (The-
which draws a random output λ˜ such that λ˜ λ. For
≈ orem 2). Our proof strategy is as follows: we will first
each mechanism we must define an intermediate quan-
demonstratetheglobalsensitivityofµ (λ)withprobabil-
T
tity called the global sensitivity describing how much A ityatleast1 δ. Thenwewillshowshowthatreleasingλ˜
changeswhen changes. −
V via the exponential mechanism is ((cid:15),δ)-differentially pri-
Definition2. (Laplacemechanism)Theglobalsensitivity vate. Finally, we prove that µ (λ˜) is close to f(λ ), the
T ∗
ofanalgorithm overallneighboringdatasets , (i.e., truemaximizerofeq.(1).
(cid:48)
A V V
DifferentiallyPrivateBayesianOptimization
Algorithm1PrivateBayesianOpt. (noisyobservations) Corollary 1. Let ( ) denote Algorithm 1 applied on
Input: ;Λ Rd;T;((cid:15),δ);σ2 ;γ dataset . Given AAssVumption 1, λ˜ is ((cid:15),δ)-differentially
µ ,0 =V0 ⊆ V,0 T private,Vi.e.,Pr(cid:2)A(V)=λ˜(cid:3)≤e(cid:15)Pr(cid:2)A(V(cid:48))=λ˜(cid:3)+δ,forany
foVr t=1...T do pairofneighboringdatasets , (cid:48).
V V
β =2log(Λt2π2/(3δ))
t
λ (cid:44)argm|ax| µ (λ)+√β σ (λ) We leave the proof of Corollary 1 to the supplementary
Otbservevalidaλt∈ioΛngVa,itn−v1 ,giventλV,t−1 material. Even though we must release a noisy hyper-
Updateµ andσ2 accoVr,dtingto(2)t parametersettingλ˜,itisinfactnear-optimal.
,t ,t
endfor V V Theorem 2. Given Assumption 1 the following near-
(cid:113)
c=2 (cid:0)1 k( , )(cid:1)log(cid:0)3Λ/δ(cid:1) optimalapproximationguaranteeforreleasingλ˜holds:
− V V(cid:48) | |
(cid:112)
q=σ 4log(3/δ) µ (λ˜) f(λ ) 2√β q 2∆(log Λ +a)
C =8/log(1+σ 2) T ≥ ∗ − T − − (cid:15) | |
1 −
(cid:16) (cid:17)
Drawλ˜ Λw.p. Pr[λ] exp (cid:15)µV,T(λ) w.p. 1 (δ+e a),where∆=2(cid:112)β +c(forβ ,
∈ ∝ 2(2√βT+1+c) c,an≥dqd−efineda−sinAlgorithm1). T+1 T+1
v =max v
∗ t T ,t
≤ (cid:104)V (cid:105)
Drawθ ∼Lap √C(cid:15)1√βTTγT + c(cid:15) + q(cid:15) Proof.Ingeneral,theexponentialmechanismselectsλ˜that
v˜=v∗+θ isclosetothemaximumλ(McSherry&Talwar,2007):
Return: λ˜,v˜
µ (λ˜) max µ (λ) 2∆(log Λ +a), (5)
T ≥ λ∈Λ T − (cid:15) | |
Global sensitivity. As a first step we bound the global withprobabilityatleast1 e a. Recallweassumethatat
−
−
sensitivityofµ (λ)asfollows: eachoptimizationstepweobservenoisygainv =f(λ )+
T t t
Theorem1. GivenAssumption1,foranytwoneighboring αt,whereαt (0,σ2)(withfixednoisevarianceσ2>
∼ N
datasets V,V(cid:48) and for all λ∈Λ with probability at least 0). Assuch,wecanlowerboundthetermmaxλ∈ΛµT(λ):
1 δ thereisanupperboundontheglobalsensitivity(in
− maxµ (λ) f(λ )+α
theexponentialmechanismsense)ofµT: λ Λ T ≥ (cid:124) T(cid:123)(cid:122) T(cid:125)
∈
(cid:112) (cid:112) vT
µ (λ) µ (λ) 2 β +σ 2log(3Λ/δ),
| (cid:48)T − T |≤ T+1 1 | | f(λ∗) maxµT(λ) f(λ∗) f(λT)+αT
forσ =(cid:113)2(cid:0)1 k ( , )(cid:1),β =2log(cid:16)Λt2π2/(3δ)(cid:17). − λ∈Λ ≤ (cid:112) −
1 − 1 V V(cid:48) t | | f(λ∗) maxµT(λ) 2 βTσT 1(λT)+αT
− λ Λ ≤ −
Proof. Notethat,byapplyingthetriangleinequalitytwice, ∈ (cid:112)
maxµ (λ) f(λ ) 2 β +α , (6)
T ∗ T T
forallλ Λ, λ Λ ≥ −
∈ ∈
µ (λ) µ (λ) µ (λ) f (λ) + f (λ) µ (λ) where the third line follows from Srinivas et al. (2010):
| (cid:48)T − T |≤ | (cid:48)T − (cid:48) | | (cid:48) − T |
Lemma 5.2 and the fourth line from the fact that
µ (λ) f (λ) + f (λ) f(λ) + f(λ) µ (λ).
≤| (cid:48)T − (cid:48) | | (cid:48) − | | − T | σ (λ ) 1.
T 1 T
Wecannowboundeachoneofthetermsinthesummation − ≤
on the right hand side (RHS) with probability at least δ. AsintheproofofTheorem1,givenanormalrandomvari-
Aµcc(oλrd)ingft(oλS)rini(cid:112)vasβetalσ.(2(0λ1)0.)T,hLeesmammaec5a.1n,bweeaopbptlaiei3nd aTbhleerZefo∼reNi(f0w,1e)sweteZhav=e:αPTr[|wZe|≤haγv]e≥γ1−=e−(cid:112)γ22/2lo:g=(21/−δ2δ)..
| (cid:48)T − (cid:48) |≤ T+1 T(cid:48) (cid:112)σ (cid:112)
twoe|fca(nλ)u−ppeµrTb(oλu)n|.dAbostσhT(cid:48)te(rλm)s≤by1,2b(cid:112)ecβaTu+se1.k(Iλn,oλr)de=r t1o, T(ahsisdeimfinpelidesinthAaltg|oαrTit|h≤mσ1)w2iltohgp(2ro/bδa)b≤ility4altolge(a3s/tδ1)−=2δq.
boundtheremaining(middle)termontheRHSrecallthat Thus, wecanlowerboundα by q. Wecanthenlower
T
(cid:2) (cid:3) −
forarandomvariableZ (0,1)wehave: Pr Z >γ bound max µ (λ) in eq. (5) with the right hand side
e−γ2/2. For variables Z∼1,N...Zn (0,1), we| h|ave, b≤y of eq. (6). Tλh∈eΛrefTore, given the βT in Algorithm 1, Srini-
theunionbound,thatPr(cid:2) i, Z ∼Nγ(cid:3) 1 ne γ2/2 (cid:44) vasetal.(2010),Lemma5.2holdswithprobabilityatleast
1−(cid:113)3δ. IfwesetZ = |f(λ∀)−σ1f|(cid:48)(λi)||≤andn≥=|−Λ|, w−eobtain 1− 2δ andthetheoremstatementfollows. (cid:4)
γ= 2log(cid:0)3Λ/δ(cid:1),whichcompletestheproof. (cid:4) 3.2.Privatenear-maximumvalidationgain
| |
WeremarkthatallofthequantitiesinTheorem1areeither Inthissectionwedemonstratereleasingthevalidationgain
given or selected by the modeler (e.g, δ,T). Given this v˜inAlgorithm1isprivate(Theorem3)andthatthenoise
upper bound we can apply the exponential mechanism to weaddtoensureprivacyisboundedwithhighprobability
releaseλ˜privately,asperDefinition1: (Theorem4). Asintheprevioussectionourapproachwill
DifferentiallyPrivateBayesianOptimization
be to first derive the global sensitivity of the maximum v Wecanimmediatelyboundthefinaltermontherighthand
foundbyAlgorithm1. Thenweshowreleasingv˜is((cid:15),δ)- side. Notethatasv = f(λ )+α ,thefirsttwotermsare
t t t
differentially private via the Laplace mechanism. Perhaps boundedaboveby α and α , whereα = α t (cid:44)
(cid:48) t
surprisingly,wealsoshowthatv˜isclosetof(λ ). argmax α (|sim| ilarl|y f|or α). This i{s b(cid:100)e(cid:101)ca|u(cid:100)se(cid:101), in
∗ the worstt≤cTas|e,t|t}he observation no(cid:48)ise shifts the observed
Globalsensitivity. Weboundtheglobalsensitivityofthe maximummax v upordownbyα. Therefore,letαˆ =
t T t
maximumvfoundwithBayesianoptimizationandUCB: αif α > α(cid:48) a≤ndαˆ =α(cid:48)otherwise,sothatwehave:
| | | |
Theorem3. GivenAssumption1, andneighboring , ,
we have the following global sensitivity bound (iVn Vth(cid:48)e |mt aTxv(cid:48)(λt)−mt aTxv(λt)|≤ TΩ +c+|2αˆ|.
≤ ≤
Laplacemechanismsense)forthemaximumv,w.p. 1 δ
≥ −
Although αˆ can be arbitrarily large, recall that for Z
|mt≤aTxvt(cid:48) −mt≤aTxvt|≤ √C√1βTTγT +c+q. TNh(e0r,e1fo)rew|iefh|waeves:etPZr[|=Z| 2≤αˆ γw]e≥ha1ve−γe−=γ2(cid:112)/22(cid:44)log1(3−/δ∼3δ)..
σ√2
wherethemaximumGaussianprocessinformationgainγ (cid:112)
T Thisimpliesthat 2αˆ σ 4log(3/δ)=qwithprobabil-
isboundedaboveforthesquaredexponentialandMate´rn ity at least 1 δ|. T|h≤erefore, if Theorem 1 from Srinivas
kernels(Srinivasetal.,2010). − 3
etal.(2010)andtheboundon f(λ) f (λ) holdtogether
(cid:48)
withprobabilityatleast1 2δ|asdes−cribeda|bove,thethe-
Pterromofa.sΩFo(cid:44)r n√otCatioTnβalγsim.pTlhiceintyfrloemt uTshdeeonreomte1thienrSerginrei-t oremfollowsdirectly. − 3 (cid:4)
1 T T
vasetal.(2010)wehavethat AsinTheorem1eachquantityintheaboveboundisgiven
inAlgorithm1(β,c,q),giveninpreviousresults(Srinivas
T
Ω 1 (cid:88) et al., 2010) (γ , C ) or specified by the modeler (T, δ).
f(λ ) f(λ ) f(λ ) maxf(λ ). (7) T 1
∗ t ∗ t
T ≥ − T ≥ − t T Now that we have a bound on the sensitivity of the max-
t=1 ≤
imum v we will use the Laplace mechanism to prove our
Thisimpliesf(λ ) max f(λ )+ Ω withprobability privacyguarantee(proofinsupplementarymaterial):
atleast1− 3δ (wi∗th≤approprt≤iaTtechoticeoTfβT). Corollary2. LetA(V)denoteAlgorithm1runondataset
. Given Assumption 1, releasing v˜is ((cid:15),δ)-differentially
Recall that in the proof of Theorem 1 we showed that
V
f(λ) f (λ) c with probability at least 1 δ (for private,i.e.,Pr[ ( )=v˜] e(cid:15)Pr[ ( (cid:48))=v˜]+δ.
| − (cid:48) | ≤ − 3 A V ≤ A V
c given in Algorithm 1). This along with the above ex-
Further, as the Laplace distribution has exponential tails,
pression imply the following two sets of inequalities with
thenoiseweaddtoobtainv˜isnottoolarge:
probabilitygreaterthan1 2δ:
− 3
Theorem4. GiventheassumptionsofTheorem1,wehave
f (λ ) c f(λ )< maxf(λ )+ Ω; thefollowingbound,
(cid:48) ∗ − ≤ ∗ t T t T
≤
f(λ∗)−c≤f(cid:48)(λ∗)< mt≤aTxf(cid:48)(λt)+ TΩ. |v˜−f(λ∗)|≤(cid:112)2log(2T/δ)+ TΩ +a(cid:16)(cid:15)ΩT + c(cid:15) + q(cid:15)(cid:17),
These,inturn,implythetwosetsofinequalities:
withprobabilityatleast1 (δ+e a)forΩ=√C Tβ γ .
− 1 T T
−
maxf (λ ) f (λ )< maxf(λ )+ Ω +c;
t T (cid:48) t ≤ (cid:48) ∗ t T t T Proof (Theorem 4). Let Z be a Laplace random variable
≤ ≤
maxf(λ ) f(λ )< maxf (λ )+ Ω +c. with scale parameter b and location parameter 0; Z
t T t ≤ ∗ t T (cid:48) t T Lap(b). Then Pr(cid:2)Z ab(cid:3) = 1 e a. Thus, in A∼l-
≤ ≤ | | ≤ − −
gorithm1, v˜ max v abforb= Ω + c + q with
TThhiastiims,pthlieesgl|ombaalxste≤nTsifti(cid:48)v(iλtyt)o−fmmaaxxtt≤TTff((λλt)t)i|s≤bouTΩn+dedc.. probability|at−least1t−≤Te−ta|.≤Furtherobser(cid:15)vTe, (cid:15) (cid:15)
Given the sensitivity of the maximu≤m f, we can readily
ab maxv v˜ (maxf(λ ) max α ) v˜
t t t
derivethesensitivityofmaximumv. Firstnotethatwecan ≥ t T − ≥ t T − t T | | −
≤ ≤ ≤
usethetriangleinequalitytoderive f(λ ) Ω max α v˜ (8)
≥ ∗ − T − t≤T | t|−
maxv(cid:48)(λt) maxv(λt) maxv(λt) maxf(λt) where the second and third inequality follow from the
| t T − t T |≤| t T − t T |
≤ ≤ ≤ ≤ proof of Theorem 3 (using the regret bound of Srinivas
+ maxv (λ ) maxf (λ )
| t T (cid:48) t − t T (cid:48) t | et al. (2010): Theorem 1). Note that the third inequal-
≤ ≤ ity holds with probability greater than 1 δ (given β in
+ maxf(cid:48)(λt) maxf(λt). − 2 t
| t T − t T | Algorithm 1). The final inequality implies f(λ ) v˜
≤ ≤ ∗ − ≤
DifferentiallyPrivateBayesianOptimization
max α + Ω +ab. Alsonotethat, (2012). Then we prove releasing f˜is ((cid:15),δ)-differentially
t≤T | t| T privateandthatf˜isalmostmax f(λ ).
2 t T t
ab v˜ maxvt v˜ (maxf(λt)+max αt ) ≤ ≤
≥ − t T ≥ − t T t T | |
≤ ≤ ≤
v˜ f(λ ) Ω max α (9) Global sensitivity. The following Theorem gives a
≥ − ∗ − T − t≤T | t| boundontheglobalsensitivityofthemaximumf.
TpnTrahholuilbsysa,iwbmbieelpichtlayiaeuvsuseepthtp|haαeatrtt|f|bcv˜(ooλ−uu∗nfl)dd(−λboe∗nv˜)a|≥rα≤btit−mrafmorairxlaytax≤lltlaT≤rtgT|.αe|Rtαw|et+e|cga−TΩlilv+TΩethaaa−bth.iaFfgobih-r. bTiinnohguTendohdareteo(amirsneemtt5hs.eV2LG,oaVifpv(cid:48)ledawneceeAFhmrsaeseviuctemahtsaphnteeiiotsfmnoallls1.oew(na2isnn0edg1),2gt)hl,oebfoaarlsssnueemnigspihttibivoointrys-
| |
Z ,...Z (0,1)wehavebythetailprobabilitybound
1 n
1aγn=−d(cid:112)u2δn.2iTolonhgeb∼r(eo2fuTNonr/deδ,)tih.faAwtsPedrse(cid:2)efi∀tntZ,etd|=Zγt|α≤tmaγna(cid:3)dx≥n1=−Tαn,ew.−eγo2/b2tai(cid:44)(cid:4)n |2m≤ta≤xTf(cid:48)(λt)−2m≤ta≤xT(cid:113)f(λt)|≤Ae−(log22τ)d/4 +c
t T t (cid:0) (cid:1)
≥ ≤ | | w.p.atleast1 δforc=2 1 k( , ) log(2Λ/δ).
We note that, because releasing either λ˜ or v˜ is ((cid:15),δ)- − − V V(cid:48) | |
differentially private, by Corollaries 1 and 2, releasing Weleavetheprooftothesupplementarymaterial.
bothprivatequantitiesinAlgorithm1guarantees(2(cid:15),2δ)-
Given this sensitivity, we may apply the Laplace mecha-
differential privacy for validation dataset . This is due
V nismtoreleasef˜.
to the composition properties of ((cid:15),δ)-differential privacy
(Dworketal.,2006a)(infactstrongercompositionresults Corollary3. Let ( )denoteAlgorithm2runondataset
A V
canbedemonstrated,(Dwork&Roth,2013)). . Givenassumption1andthatf satisfiestheassumptions
VofdeFreitasetal.(2012),f˜is((cid:15),δ)-differentiallyprivate,
withrespecttoanyneighboringdataset ,i.e.,
4.Withoutobservationnoise (cid:48)
V
Inhyper-parametertuningitmaybereasonabletoassume Pr(cid:2) ( )=f˜(cid:3) e(cid:15)Pr(cid:2) ( (cid:48))=f˜(cid:3)+δ.
A V ≤ A V
that we can observe function evaluations exactly: v =
,t
f (λt). First note that we can use the same algorithVm to Eventhoughwemustaddnoisetothemaximumfweshow
reVportthemaximumλintheno-noisesetting. Theorems1 thatf˜isstillclosetotheoptimalf(λ ).
∗
and2stillhold(notethatq = 0inTheorem2). However,
Theorem6. GiventheassumptionsofTheorem3,wehave
wecannotreadilyreportaprivatemaximumf astheinfor-
theutilityguaranteeforAlgorithm2:
mationgainγ inTheorems3and4approachesinfinityas
T
σ2 0. Therefore, we extend results from the previous (cid:16) (cid:17)
→ f˜ f(λ ) Ω+a Ω + c
sectiontotheexactobservationcaseviatheregretbounds | − ∗ |≤ (cid:15) (cid:15)
ofdeFreitasetal.(2012). Algorithm2demonstrateshow
2τ
toprivatizethemaximumf intheexactobservationcase. w.p. atleast1 (δ+e−a)forΩ=Ae−(log2)d/4.
−
Algorithm2PrivateBayesianOpt. (noisefreeobs.) WeproveCorollary3andTheorem6inthesupplementary
Input: ; Λ Rd; T; ((cid:15),δ); A,τ; assumptionsonf material.Wehavedemonstratedthatinthenoisyandnoise-
indeFreVitase⊆tal.(2012) V free settings we can release private near-optimal hyper-
RunmethodofdeFreitasetal.(2012),resultinginnoise parameter settings λ˜ and function evaluations v˜,f˜. How-
freeobservations: f (λ ),...,f (λ ) ever,theanalysisthusfarassumesthehyper-parameterset
1 T
(cid:113)(cid:0) V (cid:1) V is finite: Λ < . It is possible to relax this assumption,
c=2 1−k(V,V(cid:48)) log(2|Λ|/δ) usingana|na|lysi∞ssimilarto(Srinivasetal.,2010).Weleave
(cid:104) 2τ (cid:105)
Drawθ Lap Ae−(log2)d/4 + c thisanalysistothesupplementarymaterial.
∼ (cid:15) (cid:15)
Return: f˜=max f (λ )+θ
2 t T t
≤ ≤ V 5.WithouttheGPassumption
Even if our our true validation score f is not drawn from
4.1.Privatenear-maximumvalidationgain
aGaussianprocess(Assumption1),wecanstillguarantee
We demonstrate that releasing f˜in Algorithm 2 is private differential privacy for releasing its value after Bayesian
(Theorem 3) and that a small amount of noise is added to optimization fBO = max f(λ ). In this section we
t T t
makef˜private(Theorem6).Todoso,wederivetheglobal describe a different functio≤nal assumption on f that also
sensitivityofmax f(λ )inAlgorithm2independent yields differentially private Bayesian optimization for the
2 t T t
of the maximum in≤fo≤rmation gain γ via de Freitas et al. caseofmachinelearninghyper-parametertuning.
T
DifferentiallyPrivateBayesianOptimization
Assume we have a (nonsensitive) training set = In the following theorem we demonstrate that we can
T
(x ,y ) n , which, given a hyperparameter λ produces boundthechangeinf forarbitraryλ<λ.
{ i i }i=1 (cid:48)
amodelw(λ)fromthefollowingoptimization, Theorem 7. Given assumption 2, for neighboring ,
(cid:48)
V V
andarbitraryλ<λ wehavethat,
Oλ(w) (cid:48)
w =argmin(cid:122)λ w 2+ 1 (cid:88)(cid:125)n(cid:124) (cid:96)(w,x ,y(cid:123)), (10) |fV(wλ)−fV(cid:48)(wλ(cid:48))|≤ (λλ(cid:48)−(cid:48)λλ)L +min(cid:8)gm∗,mλLmin(cid:9)
λ w 2(cid:107) (cid:107)2 n i=1 i i where L is the Lipschitz constant of f, g∗ =
max g(w,x,y),andmisthesizeof .
The function (cid:96) is a training loss function (e.g., logistic (x,y)∈X,w∈Rd V
loss, hinge loss). Given a (sensitive) validation set = Proof. Applyingthetriangleinequalityyields
V
(x ,y ) m we would like to use Bayesian opti-
{miziatioin}tio=m1a⊆ximXizeavalidationscoref . f (wλ) f (cid:48)(wλ(cid:48)) f (wλ) f (wλ(cid:48))
V | V − V |≤| V − V |
Assumption2. Ourtruevalidationscoref is + f (wλ(cid:48)) f (cid:48)(wλ(cid:48)).
| V − V |
V
m This second term is bounded by Chaudhuri & Vinterbo
1 (cid:88)
f (w(λ))= g(w ,x ,y ), (2013) in the proof of Theorem 4. The only difference is,
V −m i=1 λ i i as we are not adding random noise to wλ(cid:48) we have that
f (wλ(cid:48)) f (cid:48)(wλ(cid:48)) min g∗/m,L/(mλmin) .
whereg()isavalidationlossfunctionthatisL-Lipschitz | V − V |≤ { }
in w (e.g·., ramp loss, normalized sigmoid (Huang et al., Toboundthefirstterm,letOλ(w)bethevalueoftheob-
2014)). Additionally, the training model wλ is the mini- jectiveineq.(10)foraparticularλ. NotethatOλ(w)and
mizerofeq.(10)foratrainingloss(cid:96)()thatis1-Lipschitz Oλ(cid:48)(w)areλandλ(cid:48)-stronglyconvex. Define
·
inwandconvex(e.g.,logisticloss,hingeloss).
h(w)=Oλ(cid:48)(w)−Oλ(w)= λ(cid:48)2−λ(cid:107)w(cid:107)22. (11)
Algorithm 3 describes a procedure for privately releasing
Further,definetheminimizersw =argmin O (w)and
thebestvalidationaccuracyfBO givenassumption2. Dif- λ w λ
ferentfrompreviousalgorithms,wemayrunBayesianop- wλ(cid:48)=argminw[Oλ(w)+h(w)]. Thisimpliesthat
timization in Algorithm 3 with any acquisition function Oλ(wλ)= Oλ(wλ(cid:48))+ h(wλ)=0. (12)
(e.g., expected improvement (Mockus et al., 1978), UCB) ∇ ∇ ∇
andprivacyisstillguaranteed. Given that Oλ is λ-strongly convex (Shalev-Shwartz,
2007),andbytheCauchy-Schwartzinequality,
Algorithm3PrivateBayesianOpt. (Lipschitzandconvex)
(cid:104) (cid:105) (cid:104) (cid:105)
Input: T sizen;V sizem;Λ;λmin;λmax;(cid:15);T;L;d λ(cid:107)wλ−wλ(cid:48)(cid:107)22 ≤ ∇Oλ(wλ)−∇Oλ(wλ(cid:48)) (cid:62) wλ−wλ(cid:48)
Run Bayesian optimization for T timesteps, observing: (cid:104) (cid:105)
ffVBO(w=λ1m),a.x..,fVf(w(wλT))for{λ1,...,λT}=ΛT,V ⊆Λ ≤∇h(wλ(cid:48))(cid:62) wλ−wλ(cid:48) ≤(cid:107)∇h(wλ(cid:48))(cid:107)2(cid:107)wλ−wλ(cid:48)(cid:107)2.
g =max t≤T V λt g(w,x,y) Rearranging,
∗ (x,y) ,w d
DRreatwurθn:∼f˜LLa=p∈f(cid:104)XB1(cid:15)Om+∈inRθ{gm∗,mλLmin}+ (λ(cid:15)mλamx−axλλmmiinn)L(cid:105) λ1(cid:107)∇h(wλ(cid:48))(cid:107)2 =(cid:13)(cid:13)(cid:13)λ(cid:48)2−λ∇(cid:107)wλ(cid:48)(cid:107)22(cid:13)(cid:13)(cid:13)2 ≥(cid:107)wλ−wλ(cid:48)((cid:107)123)
Similar to Algorithms 1 and 2 we use the Laplace mech- Nowaswλ(cid:48) istheminimizerofOλ(cid:48) wehave,
awnhisemn toims sawskapthpeedpowsistihblaencehiagnhgbeoriningvavliadlaidtiaotnioancsceutracy. ∇(cid:107)wλ(cid:48)(cid:107)22 = λ2(cid:48)(cid:2)− n1 (cid:80)ni=1∇(cid:96)(wλ(cid:48),xi,yi)(cid:3).
(cid:48)
DiffereVnt from the work of Chaudhuri & Vinterbo (201V3) Substitutingthisvalueofwλ(cid:48) intoeq.(13)andnotingthat
ctihoannsgeianrgchVingtodiVff(cid:48)ermenatyhyalpseor-pleaardamtoeteBrsa,yΛesTia,nvosp.tΛimTi,za(cid:48)-. wnoermcananpduldlrtohpetphoesniteivgeatcivoensstiagnntitnertmhe(λno(cid:48)−rmλg)i/v2esouutsofthe
Therefore, wemustboundthetotalglobalsenVsitivityofVf (cid:13) (cid:13)
withrespecttoV andλ, λ1(cid:107)∇h(wλ(cid:48))(cid:107)2=λλ(cid:48)−λλ(cid:48) (cid:13)(cid:13)n1 (cid:80)ni=1∇(cid:96)(wλ(cid:48),xi,yi)(cid:13)(cid:13)2=λλ(cid:48)−λλ(cid:48) .
Definition 5. The total global sensitivity of f over all
The last equality follows from the fact that the loss (cid:96) is
neighboringdatasets , is
(cid:48) 1-Lipschitz by Assumption 2 and the triangle inequality.
V V
Thus,alongwitheq.(13),wehave
∆f (cid:44) max f (wλ) f (cid:48)(wλ(cid:48)).
Vλ,,Vλ(cid:48)(cid:48)⊆∈ΛX| V − V | (cid:107)wλ−wλ(cid:48)(cid:107)2 ≤ λ1(cid:107)∇h(wλ(cid:48))(cid:107)2 ≤ λλ(cid:48)−λ(cid:48)λ.
DifferentiallyPrivateBayesianOptimization
Finally,asf isL-Lipschitzinw, λ ,...,λ from a Sobol sequence and train an SVM
1 100
model for each on the Forest UCI dataset (36,603 train-
|fV(wλ)−fV(wλ(cid:48))|≤L(cid:107)wλ−wλ(cid:48)(cid:107)2 ≤Lλλ(cid:48)−λ(cid:48)λ ing inputs). We then randomly sample 100 i.i.d. valida-
tionsets . Herewedescribetheevaluationprocedurefor
CombiningtheresultofChaudhuri&Vinterbo(2013)with V
a fixed validation set size, which corresponds to a single
theaboveexpressioncompletestheproof. (cid:4)
curve in Figure 1 (as such, to generate all results we re-
GivenafinitesetofpossiblehyperparametersΛ,wewould peat this procedure for each validation set size in the set
like to bound f f ; the best validation score found 1000,2000,3000,5000,15000 ).Foreachofthe100val-
when running|BVa∗y−esiaV∗n(cid:48)|optimization on vs. . Note {idation sets we randomly add o}r remove an input to form
(cid:48)
V V
that,byTheorem7, a neighboring dataset (cid:48). We then evaluate each of the
V
trained SVM models on all 100 datasets and their pairs
V
|fV∗ −fV∗(cid:48)|≤ mλ,aλx(cid:48) |fV(wλ)−fV(cid:48)(wλ(cid:48))| Vnu(cid:48).mbTehriosfrtersauinltesdiSnVtMwom1o0d0el×s)f1u0n0ct(inounmevbaelruaotfiodnatmasaetrtsi-,
(cid:110) (cid:111)
≤ (λλmmaxa−xλλmmiinn)L +min gm∗,mλLmin , ctheesiFthVvaanliddaFtiVo(cid:48)n.sTehtus,[uFsiVn]gijthisetjhtehvSaVliMdatmioondaecl.curacyon
i
V
as(λ λ)/(λλ)isstrictlyincreasinginλ strictlydecreas-
(cid:48)− (cid:48) (cid:48) The likelihood of function evaluations for a dataset pair
inginλ.Giventhissensitivityoff wecanusetheLaplace
∗ ( , ),foravalueofk ( , ),isgivenbythemarginal
mechanismtohidechangesinthevalidationsetasfollows. Vi Vi(cid:48) 1 Vi Vi(cid:48)
likelihoodofthemulti-taskGaussianprocess:
Corollary 4. Let ( ) denote Algorithm 3 applied on
dvaattea,sie.te.V,P. rG(cid:2)Aive(Vn)aA=ssuVf˜mLp(cid:3)ti≤one(cid:15)2,Pfr˜(cid:2)LAis(V(cid:15)(cid:48)-)di=ffefr˜eLn(cid:3)tially pri- Pr(cid:18)[[FFVV(cid:48)]]ii(cid:19)∼N(µ100,σ1200), (14)
Weleavetheprooftothesupplementarymaterial. Further, where [F ] = [f (λ ),...,f (λ )] (similarly for )
i 1 100 (cid:48)
bytheexponentialtailsoftheLaplacemechanismwehave and µ Vand σ2 Vare the posteVrior mean and varianceVof
100 100
thefollowingutilityguarantee, the multi-task Gaussian process using kernel k ( , )
1 (cid:48)
V V ⊗
Theorem8. GiventheassumptionsofTheorem7,wehave k2(λ,λ(cid:48))afterobserving[F ]iand[F (cid:48)]i(formoredetails
thefollowingutilityguaranteeforf˜Lw.r.t. fBO, see Bonilla et al. (2008)). VAs µ100 aVnd σ1200 depend on
k ( , ),wetreatitasafree-parameterandvaryitsvalue
(cid:104) (cid:105) 1 V V(cid:48)
|f˜L−fBO|≤a (cid:15)1mmin{g∗,λmLin}+ (λ(cid:15)mλamx−axλλmmiinn)L wfroemco0m.0p5utteoth0e.9m5ainrgiinnaclrelmikeenlithsooofd0(.1045).foFroarlelavcahlidvaatliuoen,
withprobabilityatleast1−e−a. di.ai.tda.setthse(jVoiinftomr airg=ina1l,l.i.k.e,li1h0o0o)d. iAsssimeapclhyVthieipsrsoadmucptleodf
Proof. ThisfollowsexactlyfromthetailboundonLaplace all likelihoods.Computingthisjointmarginallikelihood
i
V
random variables, given in the beginning of the proof of foreachk ( , )valueyieldsasinglecurveofFigure1.
1 (cid:48)
Theorem6. (cid:4) As shown, thVeVlargest values of k ( , ) = 0.95 is most
1 (cid:48)
V V
likely, meaning that c in the global sensitivity bounds is
quitesmall,leadingtoprivatevaluesthatareclosertotheir
6.Results
trueoptimums.
In this section we examine the validity of our multi-task
Gaussian process assumption on [f1,...,f2|X|]. Specifi- 7.Relatedwork
cally,wesearchforthemostlikelyvalueofthemulti-task
Gaussian process covariance element k ( , ), for clas- There has been much work towards differentially pri-
1 (cid:48)
V V
sifier hyper-parameter tuning. Larger values of k ( , ) vate convex optimization (Chaudhuri et al., 2011; Kifer
1 (cid:48)
V V
correspondtoasmallerglobalsensitivityboundsinTheo- et al., 2012; Duchi et al., 2013; Song et al., 2013; Jain &
rems1,3,and5leadingtoimprovedprivacyguarantees. Thakurta, 2014; Bassily et al., 2014). The work of Bass-
ily et al. (2014) established upper and lower bounds for
Foroursettingofhyper-parametertuning,eachλ=[C,γ2]
the excess empirical risk of (cid:15) and ((cid:15),δ)-differentially pri-
arehyper-parametersfortrainingakernelizedsupportvec-
vate algorithms for many settings including convex and
tormachine(SVM)(Cortes&Vapnik,1995;Scho¨lkopf&
strongly convex risk functions that may or may not be
Smola, 2001) with cost parameter C and radial basis ker-
smooth. There is also related work towards private high-
nelwidthγ2. Thevaluef (λ)istheaccuracyoftheSVM
modeltrainedwithhyper-Vparametersλon . dimensional regression, where the dimensions outnum-
V ber the number of instances (Kifer et al., 2012; Smith &
To search for the most likely k ( , ) we start by Thakurta,2013a). InsuchcasestheHessianbecomessin-
1 (cid:48)
V V
sampling 100 different SVM hyper-parameter settings gularandsothelossisnonconvex. However,itispossible
DifferentiallyPrivateBayesianOptimization
(Azimietal.,2012;Snoeketal.,2012),andtoconstrained
x 104
7.5 optimization(Gardneretal.,2014).
dataset size
7 1000
od 2000 8.Conclusion
o
h6.5 3000
eli 5000 We have introduced methods for privately releasing the
k
g li 6 15000 besthyper-parametersandvalidationaccuraciesinthecase
o of exact and noisy observations. Our work makes use
l
5.5
of the differential privacy framework, which has become
commonplaceinprivatemachinelearning(Dwork&Roth,
5
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
2013). Webelievewearethefirsttodemonstratedifferen-
dataset kernel value, kk11((DV,,VD00)) tiallyprivatequantitiesinthesettingofglobaloptimization
Figure1.Theloglikelihoodofamulti-taskGaussianprocessfor
of expensive (possibly nonconvex) functions, through the
differentvaluesofthekernelvaluek (V,V(cid:48)).Thefunctioneval-
1
lensofBayesianoptimization.
uationsarethevalidationaccuracyofSVMswithdifferenthyper-
parameters. Onekeyfuturedirectionistodesigntechniquestorelease
eachsampledhyper-parameterandvalidationaccuracypri-
vately (during the run of Bayesian optimization). This
to use the restricted strong convexity of the loss in the re- requires analyzing how the maximum upper-confidence
gressioncasetoguaranteeprivacy. boundchangesasthevalidationdatasetchanges. Another
interestingdirectionisextendingourguaranteesinSections
Differential privacy has been shown to be achievable in
3and4tootheracquisitionfunctions.
online and interactive kernel learning settings (Jain et al.,
2012; Smith & Thakurta, 2013b; Jain & Thakurta, 2013; For the case of machine learning hyper-parameter tuning
Mishra & Thakurta, 2014). In general, non-private online ourresultsaredesignedtoguaranteeprivacyofthevalida-
algorithmsareclosestinspirittothemethodsofBayesian tionsetonly(itisequivalenttoguaranteethatthetraining
optimization. However, all of the previous work in dif- setisneverallowedtochange). Tosimultaneouslyprotect
ferentially private online learning represents a dataset as theprivacyofthetrainingsetitmaybepossibletousetech-
a sequence of bandit arm pulls (the equivalent notion in niquessimilartothetrainingstabilityresultsofChaudhuri
Bayesian optimization is function evaluations f(xt)). In- &Vinterbo(2013). Trainingstabilitycouldbeguaranteed,
stead, we consider functions in which changing a single forexample,byassuminganadditionaltrainingsetkernel
dataset entry possibly affects all future function evalua- thatboundstheeffectofalteringthetrainingsetonf. We
tions. Closest to our work is that of Chaudhuri & Vin- leavedevelopingtheseguaranteesforfuturework.
terbo (2013), who show that given a fixed set of hyper-
As practitioners begin to use Bayesian optimization in
parameters which are always evaluated for any validation
practical settings involving sensitive data, it suddenly be-
set,theycanreturnaprivateversionoftheindexofthebest
comes crucial to consider how to preserve data privacy
hyper-parameter, as well as a private model trained with
while reporting accurate Bayesian optimization results.
that hyper-parameter. Our setting is strictly more general
Thisworkpresentsmethodstoachievesuchprivacy,which
inthat,ifthevalidationsetchanges,Bayesianoptimization
wehopewillbeusefultopractitionersandtheoristsalike.
couldsearchcompletelydifferenthyper-parameters.
Bayesian optimization, largely due to its principled han-
References
dling of the exploration/exploitation trade-off of global,
black-box function optimization, is quickly becoming Auer, Peter, Cesa-Bianchi, Nicolo, and Fischer, Paul.
the global optimization paradigm of choice. Alongside Finite-time analysis of the multiarmed bandit problem.
promising empirical results there is a wealth of recent Machinelearning,47(2-3):235–256,2002.
work on convergence guarantees for Bayesian optimiza-
tion, similar to those used in this work (Srinivas et al., Azimi,Javad,Jalali,Ali,andFern,XiaoliZ. Hybridbatch
2010;deFreitasetal.,2012). Vazquez&Bect(2010)and bayesian optimization. In ICML 2012, pp. 1215–1222.
Bull(2011)giveregretboundsforoptimizingtheexpected ACM,2012.
improvement acquisition function each optimization step.
BayesGap(Hoffmanetal.,2014)givesaconvergenceguar- Bardenet, Re´mi, Brendel, Ma´tya´s, Ke´gl, Bala´zs, and Se-
antee for Bayesian optimization with budget constraints. bag, Miche`le. Collaborativehyperparametertuning. In
Bayesianoptimizationhasalsobeenextendedtomulti-task ICML,2013.
optimization(Bardenetetal.,2013;Swerskyetal.,2013),
thesettingwheremultipleexperimentscanberunatonce Bassily, Raef, Smith, Adam, and Thakurta, Abhradeep.
DifferentiallyPrivateBayesianOptimization
Private empirical risk minimization, revisited. arXiv Ganta, Srivatsava Ranjit, Kasiviswanathan, Shiva Prasad,
preprintarXiv:1405.7085,2014. andSmith,Adam. Compositionattacksandauxiliaryin-
formationindataprivacy. InKDD,pp.265–273.ACM,
Bergstra, James and Bengio, Yoshua. Random search 2008.
for hyper-parameter optimization. JMLR, 13:281–305,
2012. Gardner, Jacob, Kusner, Matt, Xu, Zhixiang, Weinberger,
Kilian, and Cunningham, John. Bayesian optimization
Bonilla, Edwin, Chai, Kian Ming, and Williams, Christo- withinequalityconstraints.InICML,pp.937–945,2014.
pher. Multi-task gaussian process prediction. In NIPS,
2008. Hoffman, Matthew, Shahriari, Bobak, and de Freitas,
Nando. Oncorrelationandbudgetconstraintsinmodel-
Bull,AdamD. Convergenceratesofefficientglobalopti- basedbanditoptimizationwithapplicationtoautomatic
mizationalgorithms. JMLR,12:2879–2904,2011. machinelearning. InAISTATS,pp.365–374,2014.
Chaudhuri, Kamalika and Vinterbo, Staal A. A stability- Huang, Xiaolin, Shi, Lei, andSuykens, JohanAK. Ramp
basedvalidationprocedurefordifferentiallyprivatema- loss linear programming support vector machine. The
chinelearning. InAdvancesinNeuralInformationPro- Journal of Machine Learning Research, 15(1):2185–
cessingSystems,pp.2652–2660,2013. 2211,2014.
Chaudhuri, Kamalika, Monteleoni, Claire, and Sarwate, Hutter,Frank,Hoos,H.Holger,andLeyton-Brown,Kevin.
AnandD.Differentiallyprivateempiricalriskminimiza- Sequential model-based optimization for general algo-
tion. JMLR,12:1069–1109,2011. rithm configuration. In Learning and Intelligent Opti-
mization,pp.507–523.Springer,2011.
Chong, Miao M, Abraham, Ajith, and Paprzycki,
Marcin.Trafficaccidentanalysisusingmachinelearning Jain,PrateekandThakurta,Abhradeep. Differentiallypri-
paradigms. Informatica(Slovenia),29(1):89–98,2005. vatelearningwithkernels. InICML,pp.118–126,2013.
Jain, Prateek and Thakurta, Abhradeep Guha. (near) di-
Cortes,CorinnaandVapnik,Vladimir. Support-vectornet-
works. Machinelearning,20(3):273–297,1995. mension independent risk bounds for differentially pri-
vatelearning. InICML,pp.476–484,2014.
deFreitas,Nando,Smola,Alex,andZoghi,Masrour. Ex-
Jain,Prateek,Kothari,Pravesh,andThakurta,Abhradeep.
ponential regret bounds for gaussian process bandits
Differentiallyprivateonlinelearning. COLT,2012.
withdeterministicobservations. InICML,2012.
Kifer, Daniel, Smith, Adam, and Thakurta, Abhradeep.
Dinur,IritandNissim,Kobbi.Revealinginformationwhile
Private convex empirical risk minimization and high-
preserving privacy. In Proceedings of the SIGMOD-
dimensionalregression. JMLR,1:41,2012.
SIGACT-SIGART symposium on principles of database
systems,pp.202–210.ACM,2003.
Krause,Andreas,Singh,Ajit,andGuestrin,Carlos. Near-
optimal sensor placements in gaussian processes: The-
Duchi, John C, Jordan, Michael I, and Wainwright, Mar-
ory,efficientalgorithmsandempiricalstudies. JMLR,9:
tin J. Local privacy and statistical minimax rates. In
235–284,2008.
FOCS,pp.429–438.IEEE,2013.
McSherry,FrankandTalwar,Kunal.Mechanismdesignvia
Dwork,CynthiaandRoth,Aaron.Thealgorithmicfounda-
differentialprivacy. InFOCS,pp.94–103.IEEE,2007.
tions of differential privacy. Theoretical Computer Sci-
ence,9(3-4):211–407,2013. Mishra,NikitaandThakurta,Abhradeep.Privatestochastic
multi-arm bandits: From theory to practice. In ICML
Dwork, Cynthia, Kenthapadi, Krishnaram, McSherry,
WorkshoponLearning,Security,andPrivacy,2014.
Frank, Mironov, Ilya, and Naor, Moni. Our data, our-
selves: Privacy via distributed noise generation. In Ad- Mockus, Jonas, Tiesis, Vytautas, and Zilinskas, Antanas.
vancesinCryptology-EUROCRYPT2006,pp.486–503. Theapplicationofbayesianmethodsforseekingtheex-
Springer,2006a. tremum. Towards Global Optimization, 2(117-129):2,
1978.
Dwork, Cynthia, McSherry, Frank, Nissim, Kobbi, and
Smith,Adam. Calibratingnoisetosensitivityinprivate Narayanan, Arvind and Shmatikov, Vitaly. Robust de-
dataanalysis. InTheoryofCryptography,pp.265–284. anonymizationoflargesparsedatasets. InIEEESympo-
Springer,2006b. siumonSecurityandPrivacy,pp.111–125.IEEE,2008.