Bayesian Optimization in a Billion Dimensions via Random Embeddings Ziyu Wang [email protected] University of British Columbia Masrour Zoghi [email protected] University of Amsterdam 3 David Matheson [email protected] 1 0 University of British Columbia 2 Frank Hutter [email protected] n University of British Columbia a J Nando de Freitas [email protected] 9 University of British Columbia ] L M Abstract 1. Introduction . t a Let f : X → R be a function on a compact subset st Bayesian optimization techniques have been X ⊆RD. Weaddressthefollowingglobaloptimization [ successfully applied to robotics, planning, problem 1 sensorplacement,recommendation,advertis- x(cid:63) =argmaxf(x). v ing, intelligent user interfaces and automatic x∈X 2 algorithm configuration. Despite these suc- Weareparticularlyinterestedinobjectivefunctionsf 4 cesses,theapproachisrestrictedtoproblems that may satisfy one or more of the following criteria: 9 of moderate dimension, and several work- they do not have a closed-form expression, are expen- 1 shops on Bayesian optimization have iden- . sive to evaluate, do not have easily available deriva- 1 tified its scaling to high-dimensions as one tives, or are non-convex. We treat f as a blackbox 0 of the holy grails of the field. In this pa- functionthatonlyallowsustoqueryitsfunctionvalue 3 per,weintroduceanovelrandomembedding 1 at arbitrary x∈X. To address objectives of this chal- : idea to attack this problem. The resulting lenging nature, we adopt the Bayesian optimization v Random EMbedding Bayesian Optimization i framework. There is a rich literature on Bayesian op- X (REMBO) algorithm is very simple, has im- timization, and we refer readers unfamiliar with the r portant invariance properties, and applies to topictomoretutorialtreatments(Brochuetal.,2009; a domains with both categorical and continu- Jones et al., 1998; Jones, 2001; Lizotte et al., 2011; ous variables. We present a thorough theo- Moˇckus, 1994; Osborne et al., 2009) and recent the- retical analysis of REMBO, including regret oretical results (Srinivas et al., 2010; Bull, 2011; de boundsthatonlydependontheproblem’sin- Freitas et al., 2012). trinsicdimensionality. Empiricalresultscon- firmthatREMBOcaneffectivelysolveprob- Bayesian optimization has two ingredients. The first lemswithbillionsofdimensions,providedthe ingredientisapriordistributionthatcapturesourbe- intrinsic dimensionality is low. They also liefsaboutthebehavioroftheunknownobjectivefunc- show that REMBO achieves state-of-the-art tion. The second ingredient is a risk function that de- performanceinoptimizingthe47discretepa- scribes the deviation from the global optimum. The rameters of a popular mixed integer linear expected risk is used to decide where to sample next. programming solver. After observing a few samples of the objective, the prior is updated to produce a more informative poste- rior distribution over the space of objective functions Bayesian Optimization in a Billion Dimensions (see Figure 1 in Brochu et al., 2009). One problem many problems this is all that is needed. However, withthismaximumexpectedutilityframeworkisthat to advance the state of the art, we need to scale the the risk is typically very hard to compute. This has methodology to high-dimensional parameter spaces. led to the introduction of many sequential heuristics, This is the goal of this paper. known as acquisition functions, including Thompson ItisdifficulttoscaleBayesianoptimizationtohighdi- sampling (Thompson, 1933), probability of improve- mensions. To ensure that a global optimum is found, ment (Jones, 2001), expected improvement (Moˇckus, we require good coverage of X, but as the dimension- 1994) and upper-confidence-bounds (Srinivas et al., ality increases, the number of evaluations needed to 2010). These acquisition functions trade-off explo- coverX increasesexponentially. Asaresult,therehas ration and exploitation. They are typically optimized been little progress on this challenging problem, with by choosing points where the predictive mean is high a few exceptions. Hutter et al. (2011) used random (exploitation) and where the variance is large (explo- forests models in Bayesian optimization to achieve ration). Since the acquisition functions above have state-of-the-art performance in optimizing up to 76 an analytical expression that is easy to evaluate, they mixed discrete/continuous parameters of algorithms aremucheasiertooptimizethantheoriginalobjective for solving hard combinatorial problems. However, function,usingoff-the-shelfnumericaloptimizational- their method is based on frequentist uncertainty es- gorithms. It is also possible to use dynamic portfolios timates that can fail even for the optimization of very ofacquisitionfunctionstoimprovetheefficiencyofthe simple functions and lacks theoretical guarantees. method (Hoffman et al., 2011). Inthelinear banditscase,Carpentier&Munos(2012) The term Bayesian optimization seems to have been recentlyproposedacompressedsensingstrategytoat- coined several decades ago by Jonas Moˇckus (1982). tack problems with a high degree of sparsity. Also re- A popular version of the method has been known as cently,Chenetal.(2012)madesignificantprogressby efficientglobaloptimizationintheexperimentaldesign introducing a two stage strategy for optimization and literature since the 1990s (Jones et al., 1998). Often, variableselectionofhigh-dimensionalGPs. Inthefirst theapproximationoftheobjectivefunctionisobtained stage,sequentiallikelihoodratiotests,withacoupleof using Gaussian process (GP) priors. For this reason, tuning parameters, are used to select the relevant di- the technique is also referred to as GP bandits (Srini- mensions. This, however, requires the relevant dimen- vasetal.,2010). However,manyotherapproximations sions to be axis-aligned with an ARD kernel. Chen et of the objective have been proposed, including Parzen al provide empirical results only for synthetic exam- estimators (Bergstra et al., 2011), Bayesian paramet- ples (of up to 400 dimensions), but they provide key ricmodels(Wang&deFreitas,2011),treedGPs(Gra- theoretical guarantees. macy et al., 2004) and random forests (Brochu et al., 2009; Hutter, 2009). These may be more suitable Many researchers have noted that for certain classes thanGPswhenthenumberofiterationsgrowswithout of problems most dimensions do not change the ob- bound, or when the objective function is believed to jective function significantly; examples include hyper- have discontinuities. We also note that often assump- parameter optimization for neural networks and deep tions on the smoothness of the objective function are belief networks (Bergstra & Bengio, 2012) and auto- encoded without use of the Bayesian paradigm, while matic configuration of state-of-the-art algorithms for leading to similar algorithms and theoretical guaran- solving NP-hard problems (Hutter, 2009). That is to tees (see, for example, Bubeck et al., 2011, and the say these problems have “low effective dimensional- references therein). ity”. To take advantage of this property, Bergstra & Bengio (2012) proposed to simply use random search In recent years, the machine learning community has for optimization – the rationale being that points increasingly used Bayesian optimization (Rasmussen, sampled uniformly at random in each dimension can 2003;Brochuetal.,2007;Martinez-Cantinetal.,2007; denselycovereachlow-dimensionalsubspace. Assuch, Lizotte et al., 2007; Frazier et al., 2009; Azimi et al., random search can exploit low effective dimensional- 2010; Hamze et al., 2011; Azimi et al., 2011; Bergstra ity without knowing which dimensions are important. et al., 2011; Gramacy & Polson, 2011; Denil et al., In this paper, we exploit the same property in a new 2012;Mahendranetal.,2012;Azimietal.,2012;Hen- Bayesian optimization variant based on random em- nig & Schuler, 2012; Marchant & Ramos, 2012). De- beddings. spite many success stories, the approach is restricted to problems of moderate dimension, typically up to Figure 1 illustrates the idea behind random embed- about10;seeforexampletheexcellentandveryrecent dings in a nutshell. Assume we know that a given overviewbySnoeketal.(2012). Ofcourse,foragreat D = 2 dimensional black-box function f(x ,x ) only 1 2 Bayesian Optimization in a Billion Dimensions x where K(x ,x ) = k(x ,x ) serves as the covari- 1 1:t 1:t i,j i j x* ance matrix. A common choice of k is the squared ex- g ponential function (see Definition 4), but many other portant Embeddin cahbooiuctesthaeresmpoosostibhlneesdsepofentdhienogbojnecotiuvredfuegnrceteioonf.belief x x m 1 x* 2 I An advantage of using GPs lies in their analytical x Unimportant 2 tractability. Inparticular,givenobservationsx with 1:n correspondingvaluesf ,wheref =f(x ),andanew Figure 1. This function in D=2 dimesions only has d=1 1:t i i point x∗, the joint distribution is given by: effective dimension: the vertical axis indicated with the word important on the right hand side figure. Hence, the (cid:20)f (cid:21) (cid:18) (cid:20)K(x ,x ) k(x ,x∗)(cid:21)(cid:19) 1:t ∼N m(x ), 1:t 1:t 1:t . 1-dimensionalembeddingincludesthe2-dimensionalfunc- f∗ 1:t k(x∗,x ) k(x∗,x∗) 1:t tion’s optimizer. It is more efficient to search for the opti- Forsimplicity,weassumethatm(x )=0. Usingthe mum along the 1-dimensional random embedding than in 1:t the original 2-dimensional space. Sherman-Morrison-Woodbury formula, one can easily arrive at the posterior predictive distribution: has d = 1 important dimensions, but we do not f∗|D ,x∗ ∼N(µ(x∗|D ),σ(x∗|D )), t t t know which of the two dimensions is the important with data D = {x ,f }, mean µ(x∗|D ) = one. We can then perform optimization in the em- t 1:t 1:t t k(x∗,x )K(x ,x )−1f and variance σ(x∗|D ) = bedded 1-dimensional subspace defined by x = x 1:t 1:t 1:t 1:t t 1 2 k(x∗,x∗)−k(x∗,x )K(x ,x )−1k(x ,x∗). That since this is guaranteed to include the optimum. As 1:t 1:t 1:t 1:t is, we can compute the posterior predictive mean µ(·) we demonstrate in this paper, using random embed- and variance σ(·) exactly for any point x∗. dings this simple idea largely scales to arbitrary D and d, allowing us to perform Bayesian optimiza- At each iteration of Bayesian optimization, one has tion in a low-dimensional space to optimize a high- to re-compute the predictive mean and variance. dimensional function with low intrinsic dimensional- These two quantities are used to construct the sec- ity. Importantly, this trick is not restricted to cases ond ingredient of Bayesian optimization: The ac- with axis-aligned intrinsic dimensions but applies to quisition function. In this work, we report results any d-dimensional linear subspace. for the expected improvement acquisition function u(x|D ) = E(max{0,f (x) − f(x+)}|D ) (Moˇckus, Following an explanation of GP-based Bayesian opti- t t+1 t 1982; Vazquez & Bect, 2010; Bull, 2011). In this mization (Section 2), we introduce the Random EM- definition, x+ = argmax f(x) is the element bedding Bayesian Optimization (REMBO) algorithm x∈{x1:t} with the best objective value in the first t steps of and state its theoretical properties, including regret the optimization process. The next query is: x = bounds that only depend on the problem’s intrinsic t+1 argmax u(x|D ). Note that this utility favors the dimensionality (Section 3). Our experiments (Section x∈X t selectionofpointswithhighvariance(pointsinregions 4)showthatREMBOcansolveproblemsofpreviously not well explored) and points with high mean value untenable high extrinsic dimensions. They also show (points worth exploiting). We also experimented with thatREMBOcanachievestate-of-the-artperformance the UCB acquisition function (Srinivas et al., 2010; when automatically tuning the 47 discrete parameters de Freitas et al., 2012) and found it to yield simi- ofapopularmixedintegerlinearprogrammingsolver. lar results. The optimization of the closed-form ac- quisition function can be carried out by off-the-shelf 2. Bayesian Optimization numerical optimization procedures, such as DIRECT (Jones et al., 1993) and CMA-ES (Hansen & Oster- As mentioned in the introduction, Bayesian optimiza- meier, 2001). tionhastwoingredientsthatneedtobespecified: The prior and the acquisition function. In this work, we The Bayesian optimization procedure is shown in Al- adopt GP priors. We review GPs very briefly and re- gorithm 1. fer the interested reader to the book by Rasmussen & Williams (2006). A GP is a distribution over func- 3. Random Embedding for Bayesian tions specified by its mean function m(·) and covari- Optimization ancek(·,·). Morespecifically,givenasetofpointsx , 1:t with x ⊆RD, we have i Before introducing our new algorithm and its theoret- ical properties, we need to define what we mean by f(x )∼N(m(x ),K(x ,x )), effective dimensionality formally. 1:t 1:t 1:t 1:t Bayesian Optimization in a Billion Dimensions Algorithm 1 Bayesian Optimization Algorithm 2 REMBO: Bayesian Optimization with 1: for t=1,2,... do Random Embedding 2: Find xt+1 ∈ RD by optimizing the acquisition 1: Generate a random matrix A∈RD×d function u: xt+1 =argmaxx∈X u(x|Dt). 2: Choose the bounded region set Y ⊂Rd 3: AugmentthedataDt+1 ={Dt,(xt+1,f(xt+1))} 3: for t=1,2,... do 4: end for 4: Find y ∈ Rd by optimizing the acquisition t+1 function u: y =argmax u(y|D ). t+1 y∈Y t 5: Augment the data D = Definition 1. A function f :RD →R is said to have t+1 {D ,(y ,f(Ay )} t t+1 t+1 effective dimensionality d , with d < D, if there e e 6: Update the kernel hyper-parameters. exists a linear subspace T of dimension d such that e 7: end for for all x ∈ T ⊂ RD and x ∈ T⊥ ⊂ RD, we have (cid:62) ⊥ f(x) = f(x +x ) = f(x ), where T⊥ denotes the (cid:62) ⊥ (cid:62) A orthogonal complement of T. We call T the effective subspace of f and T⊥ the constant subspace. Tchhainsgdeefianloitnigonthsiemcpoloyrsdtiantaetsetshxa⊥tt,haenfduntchtiisonisdwohesynwoet d=1 y x D=2 m b e d din g refer to T⊥ as the constant subspace. Given this def- y A E inition, the following theorem shows that problems of low effective dimensionality can be solved via random A embedding. Theorem 2. Assume we are given a function f : RD → R with effective dimensionality d and a ran- x=Ay x e Convex projection of Ay to dom matrix A∈RD×d with independent entries sam- pledaccordingtoN(0,1)andd≥de. Then,withprob- Figure 2. Embedding from d = 1 into D = 2. The box ability 1, for any x ∈ RD, there exists a y ∈ Rd such illustrates the 2D constrained space X, while the thicker that f(x)=f(Ay). red line illustrates the 1D constrained space Y. Note that Proof. Please refer to the appendix. if Ay is outside X, it is projected onto X. The set Y must be chosen large enough so that the projection of its image,AY,ontotheeffectivesubspace(verticalaxisinthis Theorem 2 says that given any x∈RD and a random diagram) covers the vertical side of the box. matrix A∈RD×d, with probability 1, there is a point y ∈ Rd such that f(x) = f(Ay). This implies that for any optimizer x(cid:63) ∈ RD, there is a point y(cid:63) ∈ Rd optimum within Y is easier if Y is small, but if we with f(x(cid:63)) = f(Ay(cid:63)). Therefore, instead of optimiz- set Y too small it may not actually contain the global inginthehighdimensionalspace,wecanoptimizethe optimizer. In the following theorem, we show that we functiong(y)=f(Ay)inthelowerdimensionalspace. can choose Y in a way that only depends on the effec- This observation gives rise to our new Random EM- tive dimensionality d such that the optimizer of the e bedding Bayesian Optimization (REMBO) algorithm original problem is contained in the low dimensional (see Algorithm 2). REMBO first draws a random em- space with constant probability. bedding(givenbyA)andthenperformsBayesianop- Theorem 3. Suppose we want to optimize a function timization in this embedded space. f : RD → R with effective dimension d ≤ d sub- e In practice, we do not typically perform optimization ject to the box constraint X ⊂ RD, where X is cen- across all of RD, but rather across a compact sub- tered around 0. Let us denote one of the optimizers set X ⊂ RD (typically a box). When REMBO se- by x(cid:63). Suppose further that the effective subspace T lects a point y such that Ay is outside the box X, of f is such that T is the span of de basis vectors. it projects Ay onto X before evaluating f. That is, Let x(cid:63) ∈ T ∩X be an optimizer of f inside T. If A (cid:62) g(y) = f(p (Ay)), where p : RD → RD is the is a D×d random matrix with independent standard X X standard projection operator for our box-constraint: Gaussian entries, there exists an optimizer y(cid:63) ∈ Rd √ p (y) = argmin (cid:107)z−y(cid:107) ; see Figure 2. We still such that f(Ay(cid:63)) = f(x(cid:63)) and (cid:107)y(cid:63)(cid:107) ≤ de(cid:107)x(cid:63)(cid:107) X z∈X 2 (cid:62) 2 (cid:15) (cid:62) 2 need to describe how REMBO chooses the bounded with probability at least 1−(cid:15). region Y ⊂ Rd, inside which it performs Bayesian op- Proof. Please refer to the appendix. timization. This is important because REMBO’s ef- fectiveness depends on the size of Y. Locating the Theorem 3 says that if the set X in the original space Bayesian Optimization in a Billion Dimensions is a box constraint, then there exists an optimizer the corresponding squared exponential kernel as x(cid:63) ∈X thatisd -sparsesuchthatwithprobabilityat lIef(cid:62)atshte1b−o(cid:15)x,(cid:107)cyon(cid:63)(cid:107)st2rea≤in√t(cid:15)dies(cid:107)Xx(cid:63)(cid:62)=(cid:107)2[−w1h,e1r]eDf((wAhyic(cid:63)h)=isfal(wxa(cid:63)(cid:62)y)s. k(cid:96)d(y(1),y(2))=exp(cid:18)−(cid:107)y(1)2−(cid:96)2y(2)(cid:107)2(cid:19). achievable through rescaling), we have with probabil- ity at least 1−(cid:15) that It is possible to work with two variants of this kernel. √ √ First, we can use k(cid:96)d(y1,y2) as in Definition 4. We (cid:107)y(cid:63)(cid:107) ≤ de(cid:107)x(cid:63)(cid:107) ≤ de(cid:112)d . refer to this kernel as the low-dimensional kernel. We 2 (cid:15) (cid:62) 2 (cid:15) e can also adopt an implicitly defined high-dimensional kernel on X: Hence,tochooseY,wejusthavetomakesurethatthe ball of radius de/(cid:15) satisfies (0,d(cid:15)e)⊆Y. In most prac- kD(y(1),y2)=exp(cid:18)−(cid:107)pX(Ay(1))−pX(Ay(2))(cid:107)2(cid:19), tical scenarios, we found that the optimizer does not (cid:96) 2(cid:96)2 fall on the boundary which implies that (cid:107)x(cid:63)(cid:107) < d . (cid:62) 2 e ThussettingY toobigmaybeunnecessa√rily√wasteful; where pX : RD → RD is the projection operator for in all our experiments we set Y to be [− d, d]d. our box-constraint as above (see Figure 2). Notethatwhenusingthishigh-dimensionalkernel,we 3.1. Increasing the Success Rate of REMBO are fitting the GP in D dimensions. However, the search space is no longer the box X, but it is instead Theorem 3 only guarantees that Y contains the op- given by the much smaller subspace {p (Ay) : y ∈ timum with probability at least 1 − (cid:15); with proba- X Y}. Importantly, in practice it is easier to maximize bility δ ≤ (cid:15) the optimizer lies outside of Y. There the acquisition function in this subspace. are several ways to guard against this problem. One is to simply run REMBO multiple times with differ- Both kernel choices have strengths and weaknesses. ent independently drawn random embeddings. Since Thelow-dimensionalkernelhasthebenefitofonlyhav- the probability of failure with each embedding is δ, ing to construct a GP in the space of intrinsic dimen- the probability of the optimizer not being included in sionalityd,whereasthehigh-dimensionalkernelhasto theconsideredspaceofkindependentlydrawnembed- construct the GP in a space of extrinsic dimensional- dings is δk. Thus, the failure probability vanishes ex- ityD. However,thelow-dimensionalkernelmaywaste ponentiallyquicklyinthenumberofREMBOruns,k. time exploring in the region of the embedding outside Note also that these independent runs can be trivially of X (see Figure 2) because two points far apart in parallelizedtoharnessthepowerofmodernmulti-core this region may be projected via p to nearby points X machines and large compute clusters. on the boundary of X. The high-dimensional kernel is not affected by this problem because the search is AnotherwayofincreasingREMBO’ssuccessrateisto conducted directly on {p (Ay):y∈Y}. increase the dimensionality d it uses internally. When X d > de, with probability 1 we have (cid:0)dde(cid:1) different em- Thechoiceofkernelalsodependsonwhetherourvari- beddings of dimensionality de. That is, we only need ables are continuous, integer or categorical. The cat- to select de columns of A∈RD×d to represent the de egorical case is important because we often encounter relevant dimensions of x. We can do this by choosing optimization problems that contain discrete choices. desub-componentsofthed-dimensionalvectorywhile We define our kernel for categorical variables as: setting the remaining components to zero. Informally, (cid:18) (cid:19) since we have more embeddings, it is more likely that λ kD(y(1),y(2))=exp − h(s(Ay(1)),s(Ay(2)))2 , one of these will include the optimizer. In our exper- λ 2 iments, we will assess the merits and shortcomings of these two strategies. where y(1),y(2) ∈ Y ⊂ Rd and h defines the distance between 2 vectors. The function s maps continuous 3.2. Choice of Kernel vectors to discrete vectors. In more detail, s(x) first projects x to [−1,1]D to generate x¯. For each di- Since REMBO uses GP-based Bayesian optimization mension x¯ of x¯, s then map x¯ to the correspond- to search in the region Y ⊂ Rd, we need to define i i ing discrete parameters by scaling and rounding. In a kernel between two points y(1),y(2) ∈ Y. We begin our experiments, following Hutter (2009), we defined withthestandarddefinitionofthesquaredexponential h(x(1),x(2)) = |{i : x(1) (cid:54)= x(2)}| so as not to impose kernel: i i an artificial ordering between the values of categori- Definition 4. Given a length scale (cid:96) > 0, we define cal parameters. In essence, we measure the distance Bayesian Optimization in a Billion Dimensions between two points in the low-dimensional space as also satisfying 0<r2 <λ (Λ)≤λ (Λ)<R2 for min max the Hamming distance between their mappings in the constants r and R. high-dimensional space. Let A be a D×d matrix, whose elements are drawn from the normal distribution √1 N (0,1). Given any 3.3. Regret Bounds d (cid:15) > 0, we can choose a length-scale (cid:96) = (cid:96)((cid:15)) such that When using the high-dimensional kernel kD on running Expected Improvement with kernel k on the (cid:96) {p (Ay) : y ∈ Y} ⊂ X, we could easily apply pre- restriction of f to the image of A inside X would have X vious theoretical results (Srinivas et al., 2010; Bull, simple regret in O(t−d1) with probability 1−(cid:15). 2011;deFreitasetal.,2012). However,thisisnotsat- isfying since the rates of convergence would still de- This theorem does not follow directly from the results pend on D. If the low-dimensional embedding cap- of Bull (2011), since the kernel is not aligned with tures the optimizer, and since the search is conducted the axes, both in the high-dimensional space and the in {p (Ay) : y ∈ Y} instead of X, we should expect lower dimensional embedding. Thus, even given the X faster rates of convergence that only depend on the true hyper-parameter the aforementioned paper will size of the embedding’s dimensionality. The rest of not entail a convergence result. this section shows that it is indeed possible to obtain Pleaserefertotheappendixfortheproofofthistheo- rates of convergence that only depend on the embed- rem. The general idea of the proof is as follows. If we ding’s dimensionality. have a squared exponential kernel k , with a smaller (cid:96) Webeginourmathematicaltreatmentwiththedefini- length scale than a given kernel k, then an element f tionsofsimpleregret andtheskewsquaredexponential of the RKHS of k is also an element of the RKHS of (SSE) kernel. k . So, when running expected improvement, one can (cid:96) Definition 5. Given a function f : X → R and a safely use k(cid:96) instead of k as the kernel and still get a sequence of points {x }∞ ⊆ X, the simple regret on regret bound. Most of the proof is dedicated to find- t t=1 ing a length scale (cid:96) that fits “underneath” our kernel, the set X at time T is defined to be r (T)=sup f− f X so we can replace our kernel with k , to which we can T (cid:96) maxf(x ). t apply the results of Bull (2011). t=1 Definition 6. Given a symmetric, positive-definite The above theorem requires the embedded dimension matrix Λ, we define the corresponding skew squared and the effective dimension to coincide, but due to exponential kernel using the formula Proposition 1 in (de Freitas et al., 2012), we strongly believe that the analysis in (Bull, 2011) can be modi- k (x(1),x(2))=e−(x(1)−x(2))(cid:62)Λ−1(x(1)−x(2)). Λ fied to allow for situations in which some of the ARD parameters are zero, which is precisely what is pre- Given Λ, (cid:96) (as in the squared exponential kernel ventingusfromextendingthisresulttothecasewhere kd) and X ⊆ Rd, we denote the Reproducing Ker- d>d . (cid:96) e nel Hilbert Spaces (RKHSs) corresponding to k and Λ k(cid:96) by HΛ(X) and H(cid:96)(X), respectively (Steinwart & 3.4. Hyper-parameter Optimization Christmann, 2008, Definition 4.18). Moreover, given anarbitrarykernelk,wewilldenoteitsRKHSbyH . For Bayesian optimization (and therefore REMBO), k it is difficult to manually estimate the true length Our main result below shows that the simple regret scalehyper-parameterofaproblemathand. Toavoid vanisheswithrateO(t−d1)withhighprobabilitywhen any manual steps and to achieve robust performance we use the squared exponential kernel. Note that we acrossdiversesetsofobjectivefunctions,inthispaper only make the assumption that the cost function re- weadoptedanadaptivehyper-parameteroptimization stricted to T is governed by a skew symmetric ker- scheme. The length scale of GPs is often set by max- nel, a much weaker assumption than the standard as- imizing marginal likelihood (Rasmussen & Williams, sumptionthatthecostfunctionisgovernedbyanaxis 2006;Jonesetal.,1998). However,asdemonstratedby aligned kernel in D dimensions (see, e.g., Bull, 2011). Bull(2011),thisapproach,whenimplementednaively, Theorem 7. Let X ⊂ RD be a compact subset with may not guarantee convergence. Here, we propose to non-emptyinteriorthatisconvexandcontainstheori- optimize the length scale parameter (cid:96) by maximizing gin and f : X → R, a function with effective dimen- the marginal likelihood subject to an upper bound U sion d. Supposethat therestrictionoff toits effective which is decreased when the algorithm starts exploit- subspace T, denoted f| , is an element of the RKHS ing too much. Full details are given in Algorithm 3. T H (Rd) with Λ symmetric and positive definite and Wesaythatthealgorithmisexploitingwhenthestan- Λ Bayesian Optimization in a Billion Dimensions Algorithm 3 Bayesian Optimization with Hyper- set the initial constraint [L,U] to be [0.01,50] and set parameter Optimization. t =0.002. σ input Threshold t . σ input Upper and lower bounds U >L>0 for hyper- 4. Experiments parameter. input Initial length scale hyper-parameter (cid:96)∈[L,U]. We now study REMBO empirically. We first use 1: Initialize C =0 synthetic functions of small intrinsic dimensionality 2: for t=1,2,... do de = 2 but extrinsic dimension D up to 1 billion to 3: Findx byoptimizingtheacquisitionfunction demonstrate REMBO’s independence of D. Then, we t+1 u: x =argmax u(x|D ). apply REMBO to automatically optimize the 47 pa- t+1 x∈X t 4: if (cid:112)σ(x )<t then rameters of a widely-used mixed integer linear pro- t+1 σ 5: C =C+1 gramming solver and demonstrate that it achieves 6: else state-of-the-art performance. However, we also warn 7: C =0 againsttheblindapplicationofREMBO.Toillustrate 8: end if this,intheappendixwestudyREMBO’sperformance 9: AugmentthedataD ={D ,(x ,f(x ))} for tuning the 14 parameters of a random forest body t+1 t t+1 t+1 10: if i mod 20=0 or C =5 then part classifier used by Kinect. In this application, all 11: if C =5 then the D = 14 parameters appear to be important, and 12: U =max{0.9(cid:96),L} whileREMBO(basedond=3)findsreasonablesolu- 13: C =0 tions (better than random search and comparable to 14: end if what domain experts achieve), standard Bayesian op- 15: Learning the hyper-parameter by optimizing timization can outperform REMBO (and the domain thelogmarginallikelihoodbyusingDIRECT: experts) in such moderate-dimensional spaces. (cid:96)=argmax logp(f |x ,l) l∈[L,U] 1:t+1 1:t+1 16: end if 4.1. Experimental Setup 17: end for For all our experiments, we used a single robust ver- sionofREMBOthatautomaticallysetsitsGP’slength scale parameter as described in Section 3.4. For each dard deviation at the maximizer of the acquisition optimization of the acquisition function, this version (cid:112) function σ(x ) is less than some threshold t for t+1 σ runs both DIRECT (Jones et al., 1993) and CMA- 5 consecutive iterations. Intuitively, this means that ES (Hansen & Ostermeier, 2001) and uses the result the algorithm did not emphasize exploration (search- of the better of the two. The code for REMBO, as inginnewpartsofthespace,wherethepredictiveun- well as all data used in our experiments will be made certainty is high) for 5 consecutive iterations. When publicly available in the near future. thiscriterionismet, thealgorithmdecreasesitsupper bound U multiplicatively and re-optimizes the hyper- Some of our experiments required substantial com- parameter subject to the new bound. Even when putational resources, with the computational expense the criterion is not met the hyper-parameter is re- of each experiment depending mostly on the cost of optimized every 20 iterations. evaluating the respective blackbox function. While the synthetic experiments in Section 4.2 only required The motivation of this algorithm is to rather err on minutes for each run of each method, optimizing the the side of having too small a length scale.1 Given a mixed integer programming solver in Section 4.3 re- squared exponential kernel k , with a smaller length (cid:96) quired 4-5 hours per run, and optimizing the random scale than another kernel k, one can show that any forest classifier in Appendix D required 4-5 days per function f in the RKHS characterized by k is also an run. In total, we used over half a year of CPU time element of the RKHS characterized by k . So, when (cid:96) for the experiments in this paper. running expected improvement, one can safely use k (cid:96) instead of k as the kernel of the GP and still preserve In each experiment, we study the effect of our two convergence(Bull,2011). Wearguethat(withasmall methods for increasing REMBO’s success rate (see enoughlowerboundL)thealgorithmwouldeventually Section 3.1) by running different numbers of indepen- reduce the upper bound enough to allow convergence. dent REMBO runs with different settings of its inter- Also, the algorithm would not explore indefinitely as nal dimensionality d. L is required to be positive. In our experiments, we 1A similar idea is exploited in the proof of Theorem 7. Bayesian Optimization in a Billion Dimensions k d=2 d=4 d=6 andk =1. However,theremaining37runsperformed 10 0.0022±0.0035 0.1553±0.1601 0.4865±0.4769 5 0.0004±0.0011 0.0908±0.1252 0.2586±0.3702 very well, and REMBO thus performed well when us- 4 0.0001±0.0003 0.0654±0.0877 0.3379±0.3170 ing multiple interleaved runs: with a failure rate of 2 0.1514±0.9154 0.0309±0.0687 0.1643±0.1877 1 0.7406±1.8996 0.0143±0.0406 0.1137±0.1202 13/50=0.26 per independent run, the failure rate us- ing k = 4 interleaved runs is only 0.264 ≈ 0.005. One Table 1. Optimality gap for d = 2-dimensional Branin couldeasilyachieveanarbitrarilysmallfailurerateby e function embedded in D = 25 dimensions, for REMBO using many independent parallel runs. Here we evalu- variants using a total of 500 function evaluations. The ated all REMBO versions using a fixed budget of 500 variants differed in the internal dimensionality d and in function evaluations that is spread across multiple in- the number of interleaved runs k (each such run was only terleavedruns—forexample,whenusingk =4inter- allowed 500/k function evaluations). We show mean and leaved REMBO runs, each of them was only allowed standard deviations of the optimality gap achieved after 125 function evaluations. The results show that per- 500 function evaluations. forming multiple independent runs nevertheless sub- stantially improved REMBO’s performance. Using a larger d was also effective in increasing the probabil- 4.2. Bayesian Optimization in a Billion ity of the optimizer falling into REMBO’s box Y but Dimensions at the same time slowed down REMBO’s convergence (suchthatinterleavingseveralshortrunslostitseffec- In this section, we add empirical evidence to our the- tiveness). We conclude that using several interleaved oretical finding from Section 3 that REMBO’s perfor- runs of REMBO with small d≥d performs best. mance is independent of the extrinsic dimensionality e D when using the low-dimensional kernel kd(y1,y2) Next,wecomparedREMBOtostandardBayesianop- (cid:96) from Definition 4. Specifically, using synthetic data, timization (BO) and to random search, for an extrin- we show that when using that kernel REMBO has sic dimensionality of D = 25. Standard BO is well no problem scaling up to as many as 1 billion dimen- known to perform well in low dimensions, but to de- sions. We also study REMBO’s invariance properties gradeaboveatippingpointofabout15-20dimensions. andempiricallyevaluateourtwostrategiesforincreas- Our results for D = 25 (see Figure 3, left) confirm ing its success probability. thatBOperformedratherpoorlyjustabovethiscriti- caldimensionality(merelytyingwithrandomsearch). The experiments in this section employ a standard REMBO, on the other hand, still performed very well d = 2-dimensional benchmark function for Bayesian e in 25 dimensions. optimization, embedded in a D-dimensional space. Thatis, weaddD−2additionaldimensionswhichdo Since REMBO is independent of the extrinsic dimen- notaffectthefunctionatall. Moreprecisely,thefunc- sionality D as long as the intrinsic dimensionality d e tion whose optimum we seek is f(x1:D) = b(xi,xj), issmall, itperformedjustaswellinD =1000000000 where b is the Branin function (see Lizotte, 2008, dimensions (see Figure 3, middle). To the best of our for its exact formula), and where dimensions i and knowledge,theonlyotherexistingmethodthatcanbe j are selected once using a random permutation of run in such high dimensionality is random search. 1,...,D. To measure the performance of each op- Finally, one important advantage of REMBO is that timization method, we used the optimality gap: the —incontrasttotheapproachofChenetal.(2012)— difference of the best function value it found and the itdoesnotrequiretheeffectivedimensiontobecoordi- optimal function value. nate aligned. To demonstrate this fact empirically, we We first study the effectiveness of the two techniques rotated the embedded Branin function by an orthogo- for increasing REMBO’s success probability that we nal rotation matrix R ∈ RD×D. That is, we replaced proposed in Section 3.1. To empirically study the ef- f(x)byf(Rx). Figure3(right)showsthatREMBO’s fectiveness of using internal dimensionalities d > de, performance is not affected by this rotation. and of interleaving multiple independent runs, k, we ran REMBO with all combinations of three different 4.3. Automatic Configuration of a Mixed values of d and k. The results in Table 1 demonstrate Integer Linear Programming Solver that both techniques helped improve REMBO’s per- formance, with interleaved runs being the more effec- State-of-the-art algorithms for solving hard computa- tivestrategy. Wenotethatin13/50REMBOruns,the tional problems tend to parameterize several design globaloptimumwasindeednotcontainedintheboxY choices in order to allow a customization of the al- that REMBO searched with d = 2; this is the reason gorithm to new problem domains. Automated meth- forthepoormeanperformanceofREMBOwithd=2 ods for algorithm configuration have recently demon- Bayesian Optimization in a Billion Dimensions Figure 3. Comparisonofrandomsearch(RANDOM),standardBayesianoptimization(BO),andREMBO.Left: D=25 extrinsic dimensions; Middle: D =1000000000 extrinsic dimensions; Right: D =25, with a rotated objective function. For each method, we plot means and 1/4 standard deviation confidence intervals of the optimality gap across 50 trials. strated that substantial performance gains of state- kD(y(1),y(2)) described in Section 3.2. While we have λ of-the-art algorithms can be achieved in a fully au- not proven any theoretical guarantees for discrete op- tomated fashion (Moˇckus et al., 1999; Hutter et al., timization problems, REMBO appears to effectively 2010; Bergstra et al., 2011; Wang & de Freitas, 2011). exploit the low effective dimensionality of at least this These successes have led to a paradigm shift in algo- particular optimization problem. rithmdevelopmenttowardstheactivedesignofhighly As baselines for judging our performance in config- parameterized frameworks that can be automatically uring lpsolve, we used the configuration procedures customized to particular problem domains using opti- ParamILS (Hutter et al., 2009) and SMAC (Hutter mization (Hoos, 2012; Bergstra et al., 2012). et al., 2011). ParamILS and SMAC have been specifi- Ithaslongbeensuspectedthattheresultingalgorithm callydesignedfortheconfigurationofalgorithmswith configuration problems have low dimensionality (Hut- many discrete parameters and yield state-of-the-art ter, 2009). Here, we demonstrate that REMBO can performance for this task. exploit this low dimensionality even in the discrete As Figure 4.3 (top) shows, ParamILS and SMAC in- spaces typically encountered in algorithm configura- deed outperformed random search and BO. However, tion. We use a configuration problem obtained from remarkably, our vanilla REMBO method performed Hutter et al. (2010), aiming to configure the 40 bi- even slightly better. While the figure only shows nary and 7 categorical parameters of lpsolve2, a pop- REMBO with d=5 to avoid clutter, we by no means ular mixed integer programming (MIP) solver that optimizedthisparameter;theonlyothervaluewetried has been downloaded over 40000 times in the last was d = 3, which resulted in indistinguishable perfor- year. The objective is to minimize the optimality gap mance. lpsolve can obtain in a time limit of five seconds for a MIP encoding of a wildlife corridor problem from As in the synthetic experiment, REMBO’s perfor- computationalsustainability(Gomesetal.,2008). Al- mance could be further improved by using multiple gorithm configuration usually aims to improve perfor- interleaved runs. However, as shown by Hutter et al. mance for a representative set of problem instances, (2012), multiple independent runs can also improve and effective methods need to solve two orthogonal the performance of SMAC and especially ParamILS. problems: searching the parameter space effectively Thus, to be fair, we re-evaluated all approaches using and deciding how many instances to use in each eval- interleavedruns. Figure4.3(bottom)showsthatwhen uation (to trade off computational overhead and over- using k = 4 interleaved runs of 500 evaluations each, fitting). Ourcontributionisforthefirstoftheseprob- REMBO and ParamILS performed best, with a slight lems; tofocusonhoweffectivelythedifferentmethods advantage for REMBO early on in the search. search the parameter space, we only consider configu- ration on a single problem instance. 5. Conclusion Due to the discrete nature of this optimization We have demonstrated that it is possible to use ran- problem, we could only apply REMBO using domembeddingsinBayesianoptimizationtooptimize the high-dimensional kernel for categorical variables functions of extremely high extrinsic dimensionality 2http://lpsolve.sourceforge.net/ D provided that they have low intrinsic dimension- ality d . Our resulting REMBO algorithm is coor- e Bayesian Optimization in a Billion Dimensions 2011. Azimi,J.,Jalali,A.,andFern,X.Z. Hybridbatchbayesian optimization. In ICML, 2012. Bergstra, J. and Bengio, Y. Random search for hyper- parameter optimization. JMLR, 13:281–305, 2012. Bergstra,J.,Bardenet,R.,Bengio,Y.,andK´egl,B. Algo- rithms for hyper-parameter optimization. In NIPS, pp. 2546–2554, 2011. Bergstra,J.,Yamins,D.,andCox,D.D. Makingascience of model search. CoRR, abs/1209.5111, 2012. Brochu,E.,deFreitas,N.,andGhosh,A.Activepreference learningwithdiscretechoicedata.InNIPS,pp.409–416, 2007. Brochu, E., Cora, V. M., and de Freitas, N. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical re- inforcement learning. Technical Report UBC TR-2009- 23 and arXiv:1012.2599v1, Dept. of Computer Science, University of British Columbia, 2009. Bubeck, S., Munos, R., Stoltz, G., and Szepesvari, C. X- armed bandits. JMLR, 12:1655–1695, 2011. Bull, A. D. Convergence rates of efficient global optimiza- Figure 4. Performanceofvariousmethodsforconfiguration tion algorithms. JMLR, 12:2879–2904, 2011. of lpsolve; we show the optimality gap lpsolve achieved Carpentier, A. and Munos, R. Bandit theory meets com- with the configurations found by the various methods pressed sensing for high dimensional stochastic linear (lower is better). Top: a single run of each method; Bot- bandit. In AIStats, pp. 190–198, 2012. tom: performance with k=4 interleaved runs. Chen, B., Castro, R.M., and Krause, A. Joint optimiza- tionandvariableselectionofhigh-dimensionalGaussian dinate independent and has provable regret bounds processes. In ICML, 2012. that are independent of the extrinsic dimensionality deFreitas,N.,Smola,A.,andZoghi,M.Exponentialregret D. Moreover, it only requires a simple modification of bounds for Gaussian process bandits with deterministic the original Bayesian optimization algorithm; namely observations. In ICML, 2012. multiplication by a random matrix. We confirmed REMBO’s independence of D empirically by optimiz- Denil, M., Bazzani, L., Larochelle, H., and de Freitas, N. Learning where to attend with deep architectures for ing low-dimensional functions embedded in previously image tracking. Neural Computation, 24(8):2151–2184, untenable extrinsic dimensionalities of up to 1 billion. 2012. Finally,wedemonstratedthatREMBOachievesstate- of-the-art performance for optimizing the 47 discrete Frazier, P., Powell, W., and Dayanik, S. The knowledge- gradientpolicyforcorrelatednormalbeliefs. INFORMS parameters of a popular mixed integer programming journal on Computing, 21(4):599–613, 2009. solver, thereby providing further evidence for the ob- servation (already put forward by Bergstra, Hutter Gomes, C. P., van Hoeve, W.J., and Sabharwal, A. Con- andcolleagues)that,formanyproblemsofgreatprac- nections in networks: A hybrid approach. In CPAIOR, tical interest, the number of important dimensions in- volume 5015, pp. 303–307, 2008. deed appears to be much lower than their extrinsic Gramacy, R. B., Lee, H. K. H., and Macready, W. G. Pa- dimensionality. rameter space exploration with Gaussian process trees. In ICML, pp. 45–52, 2004. References Gramacy,R.B.andPolson,N.G. Particlelearningofgaus- sian process models for sequential design and optimiza- Azimi,J.,Fern,A.,andFern,X. Batchbayesianoptimiza- tion. JournalofComputationalandGraphicalStatistics, tion via simulation matching. NIPS, 2010. 20(1):102–118, 2011. Azimi,J.,Fern,A.,andFern,X.Z. Budgetedoptimization Hamze, F., Wang, Z., and de Freitas, N. Self-avoiding withconcurrentstochastic-durationexperiments. NIPS,