ebook img

On the Reducibility of Submodular Functions PDF

0.7 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview On the Reducibility of Submodular Functions

On the Reducibility of Submodular Functions Jincheng Mei Hao Zhang Bao-Liang Lu(cid:63) Department of Computing Science The Robotics Institute Key Lab of SMEC for IICE University of Alberta School of Computer Science Dept. of Computer Sci. and Eng. Edmonton, AB, Canada, T6G 2E8 Carnegie Mellon University Shanghai Jiao Tong University Pittsburgh, PA 15213, USA Shanghai 200240, China 6 1 0 Abstract maximization. Mirzasoleiman et al. [21] and Pan et 2 al. [24] use distributed implementation to accelerate n The scalability of submodular optimization existing optimization methods. Other techniques, in- a cludingstochasticsampling[20]anddecomposableas- methodsiscriticalfortheirusabilityinprac- J tice. In this paper, we study the reducibility sumption [14], are also applied to scale up submod- 4 ular optimization methods. While in this paper, we of submodular functions, a property that en- focus on the reducibility of submodular functions, a ] ables us to reduce the solution space of sub- G favourablepropertythatcansubstantiallyimprovethe modular optimization problems without per- L formance loss. We introduce the concept of scalability of submodular optimization methods. The . reducibility using marginal gains. Then we reducibility can directly reduce the solution space of s c show that by adding perturbation, we can thesubmodularoptimizationproblems,whilepreserve [ all the optima in the reduced space, thereby enables endowirreduciblefunctionswithreducibility, us to accelerate the optimization process without in- 1 based on which we propose the perturbation- v reduction optimization framework. Our the- curring performance loss. 3 oretical analysis proves that given the per- Recentresearchshowsthatforsomesubmodularfunc- 9 3 turbation scales, the reducibility gain could tions, by evaluating marginal gains, reduction can be 0 be computed, and the performance loss has applied for unconstrained maximization [8], uncon- 0 additive upper bounds. We further conduct strained minimization [6, 12], and uniform matroid 1. empiricalstudiesandtheresultsdemonstrate constrained maximization [25]. By leveraging the re- 0 thatourproposedframeworksignificantlyac- ducibility, a variety of methods have been developed 6 celerates existing optimization methods for to scale up the optimization of reducible submodular 1 irreducible submodular functions with a cost functions[6,8,12,19],Whileexistingworksmainlyfo- : v of only small performance losses. cus on reducible functions, there exist a number of ir- Xi reducible submodularfunctionswidelyappliedinprac- tice, for which existing methods can only provide vac- r 1 INTRODUCTION a uous reduction. Submodularity naturally arises in a number of ma- Inthispaper,weinvestigatetheproblemthatwhether chine learning problems, such as active learning [10], irreducible functions can also exploit this favorable clustering [22], and dictionary selection [5]. The scal- property. Wefirstlyintroducetheconceptofreducibil- ability of submodular optimization methods is critical ity using marginal gains over the endpoint sets of a in practice, thus has drawn much attention from the givenlattice. Thenforirreduciblefunctions,wetrans- research community. For example, Iyer et al. [12] form them to reducible functions by adding random propose O(n2) general optimization methods based noise to perturb the marginal gains, after which we on the semidifferential. Wei et al. [25] combine ap- perform lattice reduction for the perturbed functions proximation with pruning to accelerate the greedy al- andsolvetheoriginalfunctionsonthereducedlattice. gorithm for uniform matroid constrained submodular Theoretical results show that given the perturbation scales, the reducibility gain is lower bounded, and the Appearing in Proceedings of the 19th International Con- performance loss has additive upper bounds. The em- ference on Artificial Intelligence and Statistics (AISTATS) pirical results demonstrate that there exist useful per- 2016, Cadiz, Spain. JMLR: W&CP volume 41. Copyright turbation scale intervals in practice, which enables us 2016 by the authors. On the Reducibility of Submodular Functions to significantly accelerate existing optimization meth- N | f(X∗) ≥ f(X),∀X ⊆ N}. Obviously, we have ods with small performance losses. X ⊆[∅,N] and X ⊆[∅,N]. min max In summary, this paper has the following contribu- Wenowgivethedefinitionofreducibility. ForP1(P2), tions. Firstly, we introduce the concept of reducibil- we say the objective function f is reducible for min- ity, and propose the perturbation-reduction frame- imization (maximization) if ∃ [S,T] ⊂ [∅,N], where work. Secondly,wetheoreticallyanalyzeourproposed [S,T] can be obtained in O(np) function evaluations, method. In particular, for the reducibility gain, we such that X ⊆[S,T] (X ⊆[S,T]). Note that if min max propose a lower bound in terms of the perturbation we can only find [S,T] in O(2n) time, the reduction is scale. For the performance loss, we propose both de- meaningless since O(2n) time is enough for us to find terministic and probabilistic upper bounds. The de- all the optima. The ratio 1− |T\S| ∈ (0,1] is called |N| terministic bound provides the understanding of re- the reduction rate. lationship between the reducibility gain and perfor- mance loss, while the probabilistic bound can explain 2.2 Algorithms the experimental results. Finally, we empirically show that the proposed method is applicable for a variety Existing works on reduction for unconstrained sub- of commonly used irreducible submodular functions. modular optimization can be summarized by the fol- lowing two algorithms, both of which terminate in In the sequel, we organize the paper as follows. In O(n2) time. The brief review of existing works can Section2,weintroducethedefinitionsandtheexisting be found in Section 6. reduction algorithms. In Section 3, we propose our perturbation based method. Theoretical analysis and empiricalresultsarepresentedinSection4andSection Algorithm 1 Reduction for Minimization 5, respectively. In Section 6, we review some related Input: N, f, X ←∅, Y ←N, t←0. 0 0 works. Section 7 comes to our conclusion. Output: [X ,Y ]. t t 1: Find U ={i∈Y \X | f(i|X )<0}. t t t t 2 REDUCIBILITY 2: X ←X ∪U . t+1 t t 3: Find D ={j ∈Y \X | f(j|Y −j)>0}. t t t t 2.1 Notations and Definitions 4: Yt+1 ←Yt\Dt. 5: If X =X and Y =Y , terminate. t+1 t t+1 t GivenafinitesetN ={1,2,...,n},andasetfunction 6: t←t+1. Go to Step 1. f :2N (cid:55)→R. f is said to be submodular [6] if ∀X,Y ⊆ N,f(X)+f(Y)≥f(X∩Y)+f(X∪Y). Anequivalent Algorithm 2 Reduction for Maximization definition of submodularity is the diminishing return Input: N, f, X ←∅, Y ←N, t←0. property: ∀A⊆B ⊆N, ∀i∈N \B, f(i|A)≥f(i|B), 0 0 where f(i|A)(cid:44)f(A+i)−f(A) is called the marginal Output: [Xt,Yt]. 1: Find U ={i∈Y \X | f(i|X )<0}. gain of element i with respect to set A. To simplify t t t t 2: Y ←Y \U . the notation, we denote A∪{i} by A+i, and A\{i} t+1 t t 3: Find D ={j ∈Y \X | f(j|Y −j)>0}. by A−i. Given A⊆B ⊆N, the (set interval) lattice t t t t is defined as [A,B](cid:44){S | A⊆S ⊆B}. 4: Xt+1 ←Xt∪Dt. 5: If X =X and Y =Y , terminate. t+1 t t+1 t Suppose N = {1,2,...,n}, and f : 2N (cid:55)→ R is a sub- 6: t←t+1. Go to Step 1. modular function. In this paper, we focus on uncon- Proposition 1. Suppose f : 2N (cid:55)→ R is submodu- strained submodular optimization problems, lar. After each iteration of A1 (A2), we have X ⊆ min Problem 1: min f(X), Problem 2: maxf(X). [X ,Y ] (X ⊆[X ,Y ]). X⊆N X⊆N t t max t t Problem 1 can be exactly solved in polynomial time WeproveProposition1inthesupplementarymaterial. [23],whileProblem2isNP-hardsincesomeofitsspe- According to Proposition 1, if the output of A1 (A2) cial cases (e.g., Max Cut) are NP-hard. statifies [X ,Y ]⊂[∅,N], then f is reducible. t t Forconvenienceofthepresentation,wewilluseP1and According to A1 (A2), if U = D = ∅, then we have 0 0 P2 to refer to Problem 1 and Problem 2, respectively. X = X and Y = Y . The algorithm will terminate 1 0 1 0 Similarly, for the following algorithms, we will use A1 after the first iteration and the output is [X ,Y ] = 0 0 and A2 to refer to Algorithm 1 and Algorithm 2. The [∅,N], which provides a vacuous reduction. In this reference holds for P3, P4, A3 and A4. case, we say that f is irreducible with respect to A1 (A2). Forconvenience,wedirectlysayf isirreducible. DefineX (cid:44){X ⊆N |f(X )≤f(X),∀X ⊆N}as min ∗ ∗ the optima set of P1. Similarly, define X (cid:44){X∗ ⊆ Thereby, we conclude two points from the above al- max Jincheng Mei, Hao Zhang, Bao-Liang Lu gorithms. First, by the definition of U and D , the where ∀i ∈ N, r(i) ∈ R is generated uniformly at t t reducibility of f can be determined by the signs of random in [−t,t] for some t ≥ 0. By appropriately marginal gains with respect to the endpoint sets of choosing the value of t, we can ensure g(i|S) < 0 or the current working lattice. Second, the reducibility g(i|T −i)>0 hold for some i∈T \S. Thus we have of f for minimization and maximization are actually (1) holds, indicating that g is reducible. At the same thesameproperty. Specially,supposeinacertainiter- time, asr isamodularfunction, thesubmodularityof ation,A1andA2havethesameworkinglattice[S,T]. g still holds. According to the algorithms, they also have the same U and D , which determine whether f is reducible t t Algorithm 3 Perturbation-Reduction Minimization after the current iteration. Input: N, f, [S,T] where X ⊆[S,T]. min Proposition 2. Given a submodular function f : Output: An approximate solution Xp. 2N (cid:55)→ R, and a lattice [S,T]. ∀i ∈ T \ S, Define ∗ 1: If f is reducible on [S,T], [X ,Y ] ← [S,T], run 0 0 K =sgn{f(i|S)}·sgn{f(i|T−i)}. Thenf isreducible i A1 for f, [S,T]←[X ,Y ]. t t on [S,T] with respect to A1 (A2) if and only if 2: Generate r. Let g = f +r. [X ,Y ] ← [S,T], run 0 0 A1 for g, [S,T]←[X ,Y ]. K = max K >0. (1) t t i∈T\S i 3: Solve X∗p ∈argminX∈[S,T]f(X). Proof. Suppose [X ,Y ] = [S,T] in A1 (A2). Then 0 0 Algorithm 4 Perturbation-Reduction Maximization f is reducible if and only if the algorithm does not terminate after its first iteration, i.e., U0 (cid:54)=∅ or D0 (cid:54)= Input: N, f, [S,T] where Xmax ⊆[S,T]. ∅. Suppose U0 (cid:54)=∅ happens, i.e., ∃i∈T \S, f(i|S)< Output: An approximate solution Xp∗. 0. According to submodularity, f(i|T −i)≤f(i|S)< 1: If f is reducible on [S,T], [X0,Y0] ← [S,T], run 0. We have K ≥Ki =sgn{f(i|S)}·sgn{f(i|T −i)}= A2 for f, [S,T]←[Xt,Yt]. 1 > 0. Suppose D0 (cid:54)= ∅ happens, i.e., ∃j ∈ T \S, 2: Generate r. Let g = f +r. [X0,Y0] ← [S,T], run f(j|T−j)>0. Accordingtosubmodularity, f(j|S)≥ A2 for g, [S,T]←[Xt,Yt]. f(j|T −j) > 0. We have K ≥ Kj = sgn{f(j|S)}· 3: Solve Xp∗ ∈argmaxX∈[S,T]f(X). sgn{f(j|T −j)}=1>0. We propose our perturbation based method for mini- mizationandmaximizationinA3andA4,respectively. According to Proposition 2, the reducibility of f for For an irreducible submodular function f on a given minimization (maximization) can be obtained by (1). lattice[S,T],wefirstperturbtheobjectivefunctionto Thus we say f is reducible with respect to A1 (A2) if make it reducible, i.e., g (cid:44) f +r. A1 or A2 are then (1) holds. Similarly, without ambiguity in this paper, employedtoobtainthereducedlatticeofg. Finallywe we directly say f is reducible if (1) holds. solve the original problems of f on the reduced lattice exactly or approximately using existing methods. 3 PERTURBATION REDUCTION It is worth mentioning that, though we mainly focus on irreducible functions, our methods also work for Given a reducible submodular function, we can use reducible ones, as they are special cases of irreducible A1 and A2 to provide useful reduction. Unfortu- functions. Particularly,givenareduciblefunctionf on nately, there still exist many irreducible submodular [S,T], of which the reduction rate is less than 1, after functions, some of which are listed in the experimen- A1 (A2) terminates, we can get a sublattice [P,Q] ⊂ tal section. Given a submodular function f, which [S,T] so that f is irreducible on [P,Q]. is irreducible on [S,T]. According to Proposition 2, ∀i∈T \S, we have f(i|S)≥0 and f(i|T −i)≤0. IfweexpectA1andA2toprovidenontrivialreduction, weneedtoguaranteethat(1)holdsforsomeelements 4 THEORETICAL ANALYSIS without changing the submodularity of the objective function. A natural way is to add random noise r1 to perturb the original function as follows, Byperturbingtheirreduciblesubmodularfunction,we transform P1 (P2) into P3 (P4). This makes the ob- Problem 3: min g(X)(cid:44) min f(X)+r(X), jective reducible while leads the solution to be inex- X⊆N X⊆N act. Correspondingly, our theoretical analysis of the Problem 4: maxg(X)(cid:44) maxf(X)+r(X), X⊆N X⊆N method focuses on two main aspects: the reducibility gain and the performance loss incurred by perturba- 1r:2N (cid:55)→Risamodularfunction,andr(X)(cid:44) (cid:80) r(i). tion. i∈X On the Reducibility of Submodular Functions 4.1 Reducibility Gain Consequently, the reduction rate in expectation is E(H) ≥1− ck. Suppose f : 2N (cid:55)→ R is an irreducible submodular s 2ts function on [S,T], and g (cid:44) f + r as defined in P3 Theorem 1 implies that the reduction rate in expecta- and P4. Since f is irreducible, ∀i ∈ T \S, we have tionapproaches1astheperturbationscaletincreases. f(i|S)≥0 and f(i|T −i)≤0. This is also consistent with our intuition since g → r Proposition 3. Given a submodular function f : whent→∞. Notethatrisamodularfunction,which 2N (cid:55)→ R, which is irreducible on [S,T]. Define always has the highest reduction rate 1. m{f,[S,T]} (cid:44) min min{f(i|S),−f(i|T −i)}. If i∈T\S t≤m{f,[S,T]}, then g is irreducible on [S,T]. 4.2 Performance Loss Proof. Since m{f,[S,T]} ≥ 0, we suppose 0 ≤ t ≤ Suppose X ∈X (X∗ ∈X ), i.e., X (X∗) is an ∗ min max ∗ m{f,[S,T]}. ∀i ∈ T \S, we have g(i|S) = f(i|S)+ optimum of P1 (P2). Recall that Xp (X∗) is the out- ∗ p r(i) ≥ f(i|S) − t ≥ f(i|S) − m{f,[S,T]} ≥ 0, and put of A3 (A4), for P1 (P2), we define f(Xp)−f(X ) ∗ ∗ g(i|T − i) = f(i|T − i) + r(i) ≤ f(i|T − i) + t ≤ (f(X∗)−f(X∗)) as the performance loss incurred by p f(i|T − i) + m{f,[S,T]} ≤ 0, which implies that g perturbation. For P1 (P2), the following result shows is also irreducible on [S,T]. thattheperformancelossisupperboundedbytheto- talperturbationofthe“mistakenly”reducedelements, Proposition 3 indicates that if the perturbation scale which will be explained later on. t is small enough, there is no reducibility gain. This Theorem 2. Given an irreducible submodular func- is intuitively reasonable since we have g → f when tion f : 2N (cid:55)→ R. Suppose t is the perturbation scale t→0. in A3 (A4), and R is the reduction rate. We have, t Tolowerboundthereducibilitygainofaddingpertur- f(Xp)−f(X )<−r(X \X )+r(X \Y )<ntR , bation, we generalize the concept of curvature [4, 13] ∗ ∗ t ∗ ∗ t t for non-monotone irreducible submodular functions. f(X∗)−f(Xp∗)<r(Xt\X∗)−r(X∗\Yt)<ntRt. Definition 1. Given a submodular function f :2N (cid:55)→ R, the curvature of f on [S,T] is defined as, Proof. We prove the maximization case. In general, wehaveX∗ (cid:54)∈[X ,Y ], otherwisethelossiszero. Note f(i|S)−f(i|T −i) t t c{f,[S,T]}= max . that Xp∗ ∈ [Xt,Yt] according to A4. So we firstly in- i∈T\S,f(i|S)>0 f(i|S) troduce an intermediate set X∗ ∪ X ∩ Y , i.e., the t t contraction of X∗ in [X ,Y ] for our analysis. Given Note that for any irreducible submodular function f t t the fact that f(X∗∪X ∩Y ) ≤ f(X∗), if we can up- on [S,T], we have c{f,[S,T]}≥1. t t p per bound f(X∗)−f(X∗ ∪X ∩Y ), then the total t t Theorem 1. Suppose t > m{f,[S,T]}, denote s = performance loss is also upper bounded. In A2, we (cid:80) |T \S| > 0, k = f(i|S), c = c{f,[S,T]}. The i∈T\S have [Xt,Yt] ⊂ ··· ⊂ [X1,Y1] ⊂ [X0,Y0] = [∅,N]. By reduction rate in expectation of g is at least 1− ck. definition, ∀ 0 ≤ k ≤ t−1, X = X ∪D , and 2ts k+1 k k Y =Y \U . We have, k+1 k k Proof. Suppose T \S = {1,2,...,s}, ∀i ∈ T \S, we define a random variable Hi as, f(X∗∪Xt)−f(X∗) (2) H =(cid:40)1 if Ki >0, = |X(cid:88)t\X∗| f(x |X∗+x +···+x ) (3) i 0 otherwise. s 1 s−1 s=1,xs∈Xt\X∗ Hi indicates wheth(cid:80)er i can be reduced from T \S or (cid:88)t−1 (cid:88) not. Define H = H as the total number of ≥ f(d|Y −d) (4) i∈T\S i i the reduced elements. We firstly lower bound E(H) i=0d∈Di\X∗ by the total number of the reduced elements after the t−1 (cid:88) (cid:88) first iteration round of A1(A2), >− r(d) (5) E(H)= (cid:88) E(H )= (cid:88) Pr{H =1} i=0d∈Di\X∗ i i =−r(X \X∗), (6) i∈T\S i∈T\S t 1 (cid:88) ≥ · max{0,t−f(i|S)}+max{0,t+f(i|T −i)} where(3)isthetelescopicversionof(2). Accordingto 2t submodularity, we have (4) holds, and (5) comes from i∈T\S the third step of A2. Similarly, we have, c (cid:88) ck ≥s− · f(i|S)=s− . 2t 2t f(X∗∪X )−f(X∗∪X ∩Y )<−r(X∗\Y ). (7) i∈T\S t t t t Jincheng Mei, Hao Zhang, Bao-Liang Lu Combining (6) with (7), and noting X∗ ∪X ∩Y ∈ wherethefirstequalityholdsbythelawoftotalprob- t t [X ,Y ] and f(X∗)≥f(X∗∪X ∩Y ), we have, ability. The second equality holds because replacing t t p t t X∗ with X in the first expression does not change f(X∗)−f(X∗)<r(X \X∗)−r(X∗\Y ). (8) c p t t the event. The first inequality comes from dropping We note in (8), Xt \X∗ is actually the set of all the the event Xc∗ = X increases the probability. The elements which are not in X∗ but added by A2. Simi- last line results from (11) and the definition of m. larly, X∗\Yt isthesetofalltheelementswhicharein CombiningtheaboveresultwithTheorem2,andnote X∗ but eliminated by A2. Consequently, the perfor- r(X \X∗)−r(X∗ \Y ) = r(X∗)−r(X∗), we have, t t c mancelossisupperboundedbythetotalperturbation with probability at least 1−δ, value of all the mistakenly reduced elements. Since (cid:112) f(X∗)−f(X∗)<r(X∗)−r(X∗)<t 2N (n+log(1/δ)). the number of all the mistakenly reduced elements is p c r no more than the number of all the reduced elements Using a similar method, (9) can also be proved. nR , and the perturbation is generated in [−t,t], we t have r(X \X∗)−r(X∗\Y )≤ntR . Theorem 4 has an intuitive interpretation. Take P2 t t t and P4 as examples, ∀Y ⊆ N, if f(X∗) − f(Y) is For the minimization case, the proof is similar. large, then it is unlikely that Y is an optimum of P4. Suppose f(X∗)−f(Y)=σ >0, then we have, Note that in Theorem 2, the performance loss is up- per bounded by the sum of random variables, which Pr[f(Y)+r(Y)≥f(X)+r(X),∀X ⊆N] means we can obtain high probability bounds using ≤Pr[f(Y)+r(Y)≥f(X∗)+r(X∗)] some concentration inequalities, such as [11]. =Pr[r(Y)−r(X∗)≥σ]. Theorem 3. (Hoeffding) Let X1,X2,...,Xn be in- Totally, the probability of Y being an optimum of P4 dependent real-valued random variables such that ∀i∈ isupperboundedbytheprobabilitythattheperturba- {1,2,...,n}, |Xi|≤t. Then with probability 1−δ, tiondifferencer(Y)−r(X∗)cancompensatethefunc- n (cid:34) n (cid:35) tion value difference σ, where the later probability is (cid:88) (cid:88) (cid:112) X −E X <t 2nlog(1/δ). small when σ is large. i i i=1 i=1 Finally, we show that the M and N in Theorem 4, r r Theorem 4. Define X∗c (cid:44) X∗ ∪Xt ∩Yt, and Xc∗ (cid:44) whicharethenumbersofmistakenlyreducedelements, X∗ ∪ Xt ∩ Yt. Denote Mr (cid:44) |X∗c(cid:52)X∗|, and Nr (cid:44) can also be upper bounded by functions of f and t. |X∗(cid:52)X∗|, where A(cid:52)B (cid:44) (A \ B) ∪ (B \ A) is the Theorem 5. Denote the total number of the mistak- c symmetric difference between the two sets A and B. enly reduced elements in the first iteration of A1 and Then with probability at least 1−δ, A2 as M1 and N1, respectively. We have, r r (cid:112) f(X∗p)−f(X∗)<t(cid:112)2Mr(n+log(1/δ)), (9) Er[Mr1]≤ n2 − F −ft(X∗), (12) f(X∗)−f(Xp∗)<t 2Nr(n+log(1/δ)). (10) E [N1]≤ n − f(X∗)−F, (13) r r 2 t Proof. We prove (10). Since the perturbation vector where F (cid:44) 1(f(∅)+f(N))∈[f(X ),f(X∗)]. r has zero expectation value, and each element of r 2 ∗ is independently generated. For any fixed X ⊆ N, Proof. For (13), we calculate the total mistakenly re- accordingtoTheorem3,withprobabilityatleast1−δ, duced element number in expectation in the first iter- (cid:112) r(X)−r(X∗)≤t 2|X(cid:52)X∗|log(1/δ). (11) ation of A2. According to the definition of symmetric Suppose X∗ ∈ [S,T], and define m (cid:44) |[S,T]|. Obvi- difference, Nr1 =|X∗\Y1|+|X1\X∗|. ously, we have m≤2n. Hence, ∀i∈X∗,f(i|∅)≥f(i|X∗−i)=f(X∗)−f(X∗−i)≥0. Pr(cid:104)r(X∗)−r(X∗)≥t(cid:112)2N log(m/δ)(cid:105) And i∈X∗\Y1 iff f(i|∅)+t<0. Similarly, ∀j (cid:54)∈X∗, c r f(j|N−j)≤f(j|X∗)=f(X∗+j)−f(X∗)≤0. And (cid:20) (cid:114) (cid:21) =(cid:88) Pr r(X∗)−r(X∗)≥t 2|X∗(cid:52)X∗|log(m),X∗=X j ∈X1\X∗ iff f(j|N −j)+t>0. Thus we have, c c δ c X∈[S,T] E [N1]=E |X∗\Y |+E |X \X∗| r r r 1 r 1 (cid:20) (cid:114) (cid:21) =(cid:88) Pr r(X)−r(X∗)≥t 2|X(cid:52)X∗|log(mδ ),Xc∗=X = (cid:88) t−f2t(i|∅) + (cid:88) t+f(j2|tN −j) X∈[S,T] i∈X∗ j(cid:54)∈X∗ ≤(cid:88) Pr(cid:20)r(X)−r(X∗)≥t(cid:114)2|X(cid:52)X∗|log(m)(cid:21) n 1 (cid:88) (cid:88)  δ = −  f(i|∅)− f(j|N −j) X∈[S,T] 2 2t i∈X∗ j(cid:54)∈X∗ (cid:88) δ δ ≤ =m =δ, n f(X∗)−f(∅)+f(X∗)−f(N) m m ≤ − . X∈[S,T] 2 2t On the Reducibility of Submodular Functions For (12), the proof is similar. Mutual Information Function. Given n random vectors X ,X ,...,X , define h(X) as the entropy 1 2 n Using similar methods we can obtain the following re- of random variables {X |i ∈ N}, which is a highly i sults, which recover Theorem 5 as a special case. reducible submodular function. The symmetrization Theorem 6. Denote the total number of the mistak- [1] of h leads to the mutual information f(X) (cid:44) enly reduced elements in the kth iteration as Mk, and h(X)+h(N\X),whichisirreducible. Wesetn=100, r Nk, respectively. We have, and randomly generate {Xi | i=1,2,...,n}. r n F −f(X ) E [Mk]≤ k−1 − k−1 ∗ , Log-Determinant Function. Givenapositivedef- r r 2 t inite matrix K ∈ Sn , the determinant [17] is log- n f(X∗)−F ++ E [Nk]≤ k−1 − k−1, submodular. The symmetrization of log-determinant r r 2 t isf(X)(cid:44)logdet(K )+logdet(K ),whereK (cid:44) X N\X X wheren (cid:44)|Y \X |, andF (cid:44) 1(f(X )+ [K ] , ∀X ⊆ N. We set n = 100. We randomly k−1 k−1 k−1 k−1 2 k−1 ij i,j∈X f(Y ))∈[f(X ),f(X∗)]. generatendatapointsandcomputethen×nsimilar- k−1 ∗ ity matrix as the positive definite matrix K. Theorem 5 implies that the expected number of mis- takenly reduced elements in the first iteration will ap- Negative Half-Products Function. The objec- proach n2 as theperturbation scale t increases. This is tive [2] is f(X) (cid:44) c(X)−(cid:80) a(i)b(j), where consistent with the intuition. Let t→∞, then g →r. i,j∈X,i<j a,b,c are non-negative vertors. When c is not non- Eachelementwillberandomlyselectedtobeaddedor negative, f can be highly reducible [19]. Here c is eliminatedwithprobability 1,sotheexpectednumber 2 non-negative, and f is nearly irreducible. The reduc- of mistakenly reduced elements is n. 2 tionrateofA1(A2)isabout1%. Wesetn=100,and Remark 1. When t is large enough, most elements randomly generate a,b in (0.1,0.5)n and c in (1,5)n. willbereducedinthefirstiteration,i.e.,N ≈N1. Let r r t= 2(f(X∗)−F),whereε>0,by(13)wehaveE [N ]≈ n(1−2ε) r r 5.1 Perturbation Scale E [N1] ≤ εn, which indicates that if t = 2(f(X∗)−F) r r n(1−2ε) is a large enough perturbation scale, then the number In Theorem 1, we lower bound the expectation of re- of mistakenly reduced elements can be desirably upper duction rate using the expectation of reduction rate bounded. after the first iteration of A1 (A2). It is recently re- ported that for reducible submodular minimization, Remark 2. Suppose t is large enough and N ≈N1. r r this bound is often not tight in practice [12]. Given With the result of Theorem 2, we have that our method is actually transforming irreducible nt f(X∗)−f(X∗)<tN ≈ −(f(X∗)−F). functions to reducible ones, it is reasonable to bor- p r 2 row experience from reducible cases. We conjecture Let t= 2[(1+δ)f(X∗)−F] where δ >0, we have f(X∗)> that relatively small reduction rates after the first it- n p (1−δ)f(X∗) from above. This means if there is some eration would be sufficient for high reduction rates af- relationship between the optimum and the perturba- ter the last iteration, thereby we only need to choose tion(t= 2[(1+δ)f(X∗)−F] isalargeperturbationscale), smallperturbationscalestoobtaindesirablereducibil- n thenthepreviousperformancelossresultscanbetrans- ity gains. We empirically verify the conjecture as formed into approximation ratios. shown in Figure 1. Appropriate perturbation scales t are chosen so that the reduction rates after the last iteration are changing from 0 to nearly 1. Given a 5 EXPERIMENTAL RESULTS certain perturbation scale, we repeatedly generate r for 10 times and record the average reduction rates of For reducible submodular functions, by incorporating A2 after iteration 1-4 and the last iteration. We ob- reductionintooptimizationmethods,favorableperfor- serve that A2 terminates within 10 iterations for all mance has been achieved [6, 8, 12, 19]. In our exper- objective functions. iments, we mainly focus on (nearly) irreducible sub- modular functions, as listed below. We defer similar results for minimization to the sup- plementary material. As conjectured, we learn from Subset Selection Function. The objective func- Figure 1 that the gap between the average reduction tion [18, 12] is irreducible. Given M ∈Rn×n, f(X)(cid:44) rates after the first iteration and the last iteration is + (cid:80) (cid:80) (cid:80) M −λ M , where λ ∈ [0.5,1]. always large in practice. Hence, we can choose t to i∈N j∈X ij i,j∈X ij We set n = 100, λ = 0.7, and randomly generate get appropriate reduction rates in expectation (e.g., symmetric matrix M in (0,1)n×n, and set M = 1, 0.3)afterthefirstiteration,soastoobtainpotentially ii ∀i∈N. high final reduction rates. Although we can empiri- Jincheng Mei, Hao Zhang, Bao-Liang Lu 1 Iteration 1 1 Iteration 1 1 Iteration 1 1 Iteration 1 Average Reduction Rate0000....2468 IIILtttaeeesrrrtaaa tttitiiioooernnna 234tion Average Reduction Rate0000....2468 IIILtttaeeesrrrtaaa tttitiiioooernnna 234tion Average Reduction Rate0000....2468 IIILtttaeeesrrrtaaa tttitiiioooernnna 234tion Average Reduction Rate0000....2468 IIILtttaeeesrrrtaaa tttitiiioooernnna 234tion 01 0 20 30 40 50 60 00 1 2 3 00 .1 0.15 0.2 0.25 0.3 −05 0 5 10 15 Perturbation Scale Perturbation Scale Perturbation Scale Perturbation Scale (a) Subset selection (b) Mutual information (c) Log-determinant (d) Negative half-products Figure 1: Average Reduction Rates of Maximization 2 Average Relative Error 1.5 Average Relative Error 1.4 Average Relative Error 1.5 Average Relative Error Average Time Ratio Average Time Ratio 1.2 Average Time Ratio Average Time Ratio Average Reduction Rate Average Reduction Rate Average Reduction Rate Average Reduction Rate 1.5 1 1 1 0.8 1 0.6 0.5 0.4 0.5 0.5 0.2 00 0.2 0.4 0.6 0.8 1 00 0.2 0.4 0.6 0.8 1 00 0.2 0.4 0.6 0.8 1 00 0.2 0.4 0.6 0.8 1 Perturbation Scale Ratio Perturbation Scale Ratio Perturbation Scale Ratio Perturbation Scale Ratio (a) Subset selection (b) Mutual information (c) Log-determinant (d) Negative half-products Figure 2: Maximization Results Using Branch-and-Bound Method [9] (Exact Solver) cally utilize the gap of reducibility gain to choose rel- the perturbation scale t in [m,M] by varying P(t) in ative small perturbation scales, we would like to point [0,1]. Wethenrandomlygenerate10casesforeachob- out that theoretically determining the reduction rates jectivefunctionandrecordtheaveragerelativeerrors, inexpectationafterthelastiterationgivencertainper- average reduction rates, and average time ratios for turbation scales is still an open problem. each perturbation scale ratio. Figure 3 shows the re- sults compared with the random bi-directional greedy method [3], which is used as the approximate solver. 5.2 Optimization Results Note that n is set to 100. For each case, we firstly run A2 once, and then run the random method 5 times on We implement our method using SFO toolbox [15]. both the original and the reduced lattice, and record For maximization, we compare A4 with both exact the best solutions. and approximate methods, as exact methods usually According to Figure 2 and Figure 3, when the pertur- cannot terminate in acceptable time with larger input bation scale ratio is smaller than 0.3, the time ratio scales. Denote the outputs of the proposed A4 and is larger than 1. This is because the small reducibil- the existing method as Xp and Xe, respectively. Also ity gain cannot make the combination methods more denotetherunningtimeasT andT . Wemeasurethe p e efficient than before. As the perturbation scale ratio performance loss using relative error, which is defined increases, the reduction rate increases and the time as E (cid:44) |f(Xe)−f(Xp)|. When Xe is exact, 1−E is r |f(Xe)| r ratio decreases as expected. Meanwhile, the relative the approximation ratio. We measure the reducibility error increases gently when P(t) increases, indicating gain using both the reduction rate and the time ratio that there exist useful intervals, in which the pertur- T /T . Small time ratios and relative errors indicate p e bation scales can lead to large reducibility gains and large reducibility gains and small performance losses, small performance losses. respectively. For minimization, since the subset selection and the We employ the branch-and-bound method [9] as the mutual information function have trivial zero optimal exact solver. Since it has exponential time complex- values, i.e., f(X ) = f(∅) = 0, we use the later two ∗ ity, we reset n = 20 so that it terminates within asobjectivefunctions. WeemploytheFujishige-Wolfe acceptable time. The results are shown in Fig- minimum-normpointalgorithm[7]astheexactsolver. ure 2. For comparison, we normalize the pertur- Allthesettingsarethesameasthoseofmaximization. bation scale as follows. We define M{f,[S,T]} (cid:44) The results of minimization are shown in Figure 4. max max{f(i|S),−f(i|T −i)}, and define the i∈T\S We note that for negative half-products function, the perturbation scale ratio as P(t) (cid:44) t−m . We change M−m On the Reducibility of Submodular Functions 2 Average Relative Error Average Relative Error Average Relative Error Average Relative Error Average Time Ratio 1.2 Average Time Ratio 1.2 Average Time Ratio 1.2 Average Time Ratio 1.5 Average Reduction Rate Average Reduction Rate Average Reduction Rate Average Reduction Rate 1 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.5 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 −0.50 0.2 0.4 0.6 0.8 1 −0.20 0.2 0.4 0.6 0.8 1 −0.20 0.2 0.4 0.6 0.8 1 −0.20 0.2 0.4 0.6 0.8 1 Perturbation Scale Ratio Perturbation Scale Ratio Perturbation Scale Ratio Perturbation Scale Ratio (a) Subset selection (b) Mutual information (c) Log-determinant (d) Negative half-products Figure 3: Maximization Results Using Random Bi-directional Greedy [3] (Approximate Solver) useful interval of perturbation scales is smaller than 7 CONCLUSIONS those of other functions. According to Remark 2, as the marginal gains are relatively large compared to In this paper, we introduce the reducibility of sub- the optimal value, it is inappropriate to choose large modularity, which can improve the efficiency of sub- perturbation scales in this case. modular optimization methods. We then propose the perturbation-reduction framework, and demonstrate its advantages theoretically and empirically. We an- Average Relative Error 1.5 Average Relative Error alyze the reducibility gain and performance loss given 1.2 Average Time Ratio Average Time Ratio Average Reduction Rate Average Reduction Rate perturbation scales. Experimental results show that 1 there exists practically useful intervals, and choosing 0.8 1 perturbation scales from them enables us to signifi- 0.6 0.4 cantlyacceleratetheexistingmethodswithonlysmall 0.5 0.2 performance loss. For the future work, we would like 0 to study the reducibility of submodular functions in −0.20 0.2 0.4 0.6 0.8 1 00 0.2 0.4 0.6 0.8 1 constrained problems. Perturbation Scale Ratio Perturbation Scale Ratio (a) Log-determinant (b) Negative half-products Acknowledgements Figure 4: Minimization Results Jincheng Mei would like to thank Csaba Szepesv´ari for fixing the proof of Theorem 4. Bao-Liang Lu was supported by the National Basic Research Program of China (No. 2013CB329401), the National Natural 6 RELATED WORK Science Foundation of China (No. 61272248) and the Science and Technology Commission of Shanghai Mu- nicipality (No. 13511500200). Asterisk indicates the In this section, we review some existing works related corresponding author. to solution space reduction for submodular optimiza- tion. ForP1,Fujishige[6]firstlyprovesX ⊆[A,B], min References where A = {i ∈ N | f(i|∅) < 0} and B = {j ∈ N | f(j|N − j) ≤ 0}. Note that actually [A,B] = [X ,Y ],whichistheworkinglatticeofA1afteritsfirst [1] Francis Bach. Learning with submodular functions: 1 1 A convex optimization perspective. Foundations and iteration. Recently,Iyeretal. [12]proposethediscrete Trends Machine Learning, 6(2-3):145–373, 2013. Majorization-Minimization(MMin)frameworkforP1. They prove that by choosing appropriate supergra- [2] Endre Boros and Peter L Hammer. Pseudo- dients, MMin is identical with A1. For P2, Golden- boolean optimization. Discrete Applied Mathematics, gorin [8] proposes the Preliminary Preservation Algo- 123(1):155–225, 2002. rithm (PPA), which is identical with A2. For general [3] Niv Buchbinder, Michael Feldman, Joseph Naor, cases, Mei et al. [19] prove that the two algorithms and Roy Schwartz. A tight linear time (1/2)- work for quasi-submodular functions. Beyond uncon- approximation for unconstrained submodular maxi- strained problems, for uniform matroid constrained mization. In IEEE Annual Symposium on Founda- monotone submodular function optimization, Wei et tions of Computer Science, pages 649–658, 2012. al. [25] propose similar pruning method in which the [4] Michele Conforti and G´erard Cornu´ejols. Submodu- reduced ground set contains all the original solutions lar set functions, matroids and the greedy algorithm of the greedy algorithm. tight worst-case bounds and some generalizations of Jincheng Mei, Hao Zhang, Bao-Liang Lu the rado-edmonds theorem. Discrete Applied Mathe- [19] JinchengMei,KangZhao,andBao-LiangLu. Onun- matics, 7(3):251–274, 1984. constrained quasi-submodular function optimization. In AAAI Conference on Artificial Intelligence, pages [5] Abhimanyu Das and David Kempe. Submodular 1191–1197, 2015. meets spectral: Greedy algorithms for subset selec- tion, sparse approximation and dictionary selection. [20] BaharanMirzasoleiman,AshwinkumarBadanidiyuru, In International Conference on Machine Learning, Amin Karbasi, Vondra´k Jan, and Andreas Krause. pages 1057–1064, 2011. Lazier than lazy greedy. In AAAI Conference on Ar- tificial Intelligence, pages 1812–1818, 2015. [6] Satoru Fujishige. Submodular Functions and Opti- [21] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, mization, volume 58. Elsevier, 2005. and Andreas Krause. Distributed submodular maxi- mization: Identifying representative elements in mas- [7] Satoru Fujishige and Shigueo Isotani. A submod- sivedata. InAdvancesinNeuralInformationProcess- ular function minimization algorithm based on the ing Systems, pages 2049–2057, 2013. minimum-normbase.PacificJournalofOptimization, 7(1):3–17, 2011. [22] Mukund Narasimhan, Nebojsa Jojic, and Jeff A Bilmes. Q-clustering. In Advances in Neural Infor- [8] BorisGoldengorin.Maximizationofsubmodularfunc- mation Processing Systems, pages 979–986, 2005. tions: Theory and enumeration algorithms. Euro- pean Journal of Operational Research, 198(1):102– [23] James B Orlin. A faster strongly polynomial time al- 112, 2009. gorithmforsubmodularfunctionminimization.Math- ematical Programming, 118(2):237–251, 2009. [9] Boris Goldengorin, Gerard Sierksma, Gert A Tijssen, and Michael Tso. The data-correcting algorithm for [24] Xinghao Pan, Stefanie Jegelka, Joseph E Gonzalez, theminimizationofsupermodularfunctions.Manage- Joseph K Bradley, and Michael I Jordan. Parallel ment Science, 45(11):1539–1551, 1999. doublegreedysubmodularmaximization.InAdvances inNeuralInformationProcessingSystems,pages118– [10] Daniel Golovin and Andreas Krause. Adaptive sub- 126, 2014. modularity: Theory and applications in active learn- ing and stochastic optimization. Journal of Artificial [25] Kai Wei, Rishabh Iyer, and Jeff Bilmes. Fast multi- Intelligence Research, 42:427–486, 2011. stage submodular maximization. In International Conference on Machine Learning, pages 1494–1502, [11] WassilyHoeffding.Probabilityinequalitiesforsumsof 2014. bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963. [12] Rishabh Iyer, Stefanie Jegelka, and Jeff Bilmes. Fast semidifferential-based submodular function optimiza- tion. In International Conference on Machine Learn- ing, pages 855–863, 2013. [13] Rishabh Iyer, Stefanie Jegelka, and Jeff A Bilmes. Curvature and optimal algorithms for learning and minimizing submodular functions. In Advances in Neural Information Processing Systems, pages 2742– 2750, 2013. [14] StefanieJegelka,FrancisBach,andSuvritSra.Reflec- tion methods for user-friendly submodular optimiza- tion. In Advances in Neural Information Processing Systems, pages 1313–1321, 2013. [15] Andreas Krause. Sfo a toolbox for submodular func- tion optimization. The Journal of Machine Learning Research, 11:1141–1144, 2010. [16] Alex Krizhevsky. Learning multiple layers of features from tiny images, 2009. [17] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(2-3):123–286, 2012. [18] HuiLinandJeffBilmes.Howtoselectagoodtraining- datasubsetfortranscriptionsubmodularactiveselec- tionforsequences. InConference of the International Speech, pages 2859–2862, 2009. On the Reducibility of Submodular Functions Appendix and random local search [12]. The settings are the same as those in the paper. The results are shown in A Proof of Proposition 1 Figure 6 and Figure 7. For Algorithm 1, the proof can be found in [12]. For C.2 Results Using Real Data Algorithm 2, the proof can be found in [8]. A proof Finally, we compare the results on real data. The ob- usingweakerassumptionofquasi-submodularfunction jective function is the log-determinant function. For f can be found in [19]. We prove Proposition 1 here each test case, we randomly select 100 samples from for completeness. theCIFARdataset[16],andthenwecomputethesim- ilaritymatrixasthepositivedefinitematrixK. Other Proof. Algorithm 1. Obviously X ⊆ [X ,Y ]. min 0 0 settingsarethesameasthoseinthepaper. Theresults Suppose X ⊆ [X ,Y ], we now prove X ⊆ min k k min are shown in Figure 8. [X ,Y ]. Suppose X ∈X is a minimum of f, k+1 k+1 ∗ min then we have X ⊆ X ⊆ Y . For ∀i ∈ U , if i (cid:54)∈ X , In Figure 8, the first three subfigures show the re- k ∗ k k ∗ by submodularity, we have f(i|X ) ≤ f(i|X ) < 0, sults of maximization using random local search, ran- ∗ k i.e., f(X +i) < f(X ), which contradicts with the dom permutation, and random bi-directional greedy, ∗ ∗ optimality of X . So we have U ⊆ X , and X = respectively. The last subfigure presents the results ∗ k ∗ k+1 X ∪U ⊆ X . ∀j ∈ D , if j ∈ X , by submodu- of minimization using the Fujishige-Wolfe minimum- k k ∗ k ∗ larity, we have f(j|X −j) ≥ f(j|Y −j) > 0, i.e., norm point algorithm [7]. ∗ k f(X ) > f(X −j), which also contradicts with the ∗ ∗ optimality of X . Therefore we have D ⊆ N \X , ∗ k ∗ and X ⊆Y =Y \D . ∗ k+1 k k Now we have X ⊆ X ⊆ Y . Since X can k+1 ∗ k+1 ∗ be an arbitrary element of X , we have X ⊆ min min [X ,Y ]. k+1 k+1 Algorithm 2. Obviously X ⊆ [X ,Y ]. Suppose max 0 0 X ⊆[X ,Y ], we now prove X ⊆[X ,Y ]. max k k max k+1 k+1 SupposeX∗ ∈X isamaximumoff, thenwehave max X ⊆X∗ ⊆Y . ∀i∈U , if i∈X∗, by submodularity, k k k we have f(i|X∗ − i) ≤ f(i|X ) < 0, i.e., f(X∗) < k f(X∗ − i), which contradicts with the optimality of X∗. So we have U ⊆ N \X∗, and X∗ ⊆ Y = k k+1 Y \U . ∀j ∈D ,ifj (cid:54)∈X∗,bysubmodularity,wehave k k k f(j|X∗) ≥ f(j|Y −j) > 0, i.e., f(X∗+j) > f(X∗), k which also contradicts with the optimality of X∗. So we have D ⊆X∗, and X =X ∪D ⊆X∗. k k+1 k k Now we have X ⊆ X∗ ⊆ Y . Since X∗ can k+1 k+1 be an arbitrary element of X , we have X ⊆ max max [X ,Y ]. k+1 k+1 B Reduction Rate of Algorithm 1 Figure5showsthereductionratesofAlgorithm1. All the settings are the same as those of Algorithm 2 in the paper. C More Experimental Results C.1 Results of Maximization In the paper we use the random bi-directional greedy method as the approximate solver for maximization. Wealsoreporttheresultsofrandompermutation[12]

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.