1 Submodular relaxation for inference in Markov random fields Anton Osokin Dmitry Vetrov Abstract—InthispaperweaddresstheproblemoffindingthemostprobablestateofadiscreteMarkovrandomfield(MRF), alsoknownastheMRFenergyminimizationproblem.ThetaskisknowntobeNP-hardingeneralanditspracticalimportance 5 motivates numerous approximate algorithms. We propose a submodular relaxation approach (SMR) based on a Lagrangian 1 relaxationoftheinitialproblem.UnlikethedualdecompositionapproachofKomodakisetal.[30]SMRdoesnotdecomposethe 0 graphstructureoftheinitialproblembutconstructsasubmodularenergythatisminimizedwithintheLagrangianrelaxation.Our 2 approachisapplicabletobothpairwiseandhigh-orderMRFsandallowstotakeintoaccountglobalpotentialsofcertaintypes. Westudytheoreticalpropertiesoftheproposedapproachandevaluateitexperimentally. n a IndexTerms—Markovrandomfields,energyminimization,combinatorialalgorithms,relaxation,graphcuts J ✦ 5 1 ] V 1 INTRODUCTION list of preferred configurations and the default value C for all the others [29], [37]. .The problem of inference in a Markov random field s Inthispaperwedevelopthesubmodularrelaxation (MRF) arises in many applied domains, e.g. in ma- c framework (SMR) for approximate minimization of [chine learning, computer vision, natural language energies with pairwise and sparse high-order poten- processing, etc. In this paper we focus on one impor- 1 tials.SMRisbasedonaLagrangianrelaxationofcon- vtant type of inference: maximum a posteriori (MAP) sistency constraints on labelings, i.e. constraints that 1inference, often referred to as MRF energy minimiza- make each node to have exactly one label assigned. 7tion. Inference of this type is a combinatorial opti- 7mization problem, i.e. an optimization problem with The SMR Lagrangian can be viewed at as a pseudo- 3 Boolean function of binary indicator variables and is the finite domain. 0 often submodular, thus can be efficiently minimized 1. ThemoststudiedcaseofMRFenergyminimization with a max-flow/min-cut algorithm (graph cut) at isthesituationwhentheenergycanberepresentedas 0 each step of the Lagrangian relaxation process. a sum of terms (potentials) that depend on only one 5 In this paper we explore the SMR framework the- 1ortwovariableseach(unaryandpairwisepotentials). oretically and provide explicit expressions for the ob- :In this setting the energy is said to be defined by a v tainedlowerbounds.Weexperimentallycompareour graph where the nodes correspond to the variables i approach against several baselines (decomposition- Xandtheedgestothepairwisepotentials.Minimization based approaches [30], [25], [29]) and show its appli- rofenergiesdefinedongraphsinknowntobeNP-hard a cability on a number of real-world tasks. in general [8] but can be done exactly in polynomial The rest of the paper is organized as follows. In time in a number of special cases, e.g. if the graph sec. 2 we review the related works. In sec. 3 we defining the energy is acyclic [36] or if the energy is introduce our notation and discuss some well-known submodularinstandard[28]ormulti-labelsense[10]. results. In sec. 4 we present the SMR framework. Onewaytogobeyondpairwisepotentialsistoadd In sec. 5 we construct several algorithms within the higher-order summands to the energy. For example, SMR framework. In sec. 6 we present the theoretical Kohli etal. [23] and Ladicky´ etal. [32] use high-order analysisofourapproachandinsec.7itsexperimental potentials based on superpixels (image regions) for evaluation. We conclude in sec. 8. semantic image segmentation; Delong et al. [11] use labelcost potentialsfor geometric modelfitting tasks. Tobetractable,high-orderpotentialsneedtohavea 2 RELATED WORKS compactrepresentation.Therecentlyproposedsparse We mention works related to ours in two ways. First, high-orderpotentialsdefinespecificvaluesforashort we discuss methods that analogously to SMR rely on multiple calls of graph cuts. Second, we mention • A. Osokin is with SIERRA team, INRIA and E´cole Normale some decomposition methods based on Lagrangian Supe´rieure,Paris,France.E-mail:[email protected] relaxation. • D. Vetrov is with Faculty of computer science, Higher School of Economics,Moscow,RussiaandLomonosovMoscowStateUniversity, Multiple graph cuts. The SMR framework uses the Moscow,Russia.E-mail:[email protected] one-versus-allencodingforthevariables,i.e.theindi- cators that particular labels are assigned to particular 2 variables, and runs the graph cut atthe inner step. In ular subproblems but only in case of problems with this sense our method is similar to the α-expansion binaryvariablesandtogether withthedecomposition algorithm [8] (and its generalizations [22], [23]) and of the graph. Strandmark and Kahl [42] used decom- the methods for minimizing multi-label submodular position into submodular subproblems to construct functions [14], [10]. Nevertheless SMR is quite dif- a distributed max-flow/min-cut algorithm for large- ferent from these approaches. The α-expansion algo- scale binary submodular problems. Komodakis and rithm is a move-making method where each move Paragios [29] generalized the method of [30] to work decreases the energy. Methods of [14] and [10] are with the energies with sparse high-order potentials. exact methods applicable only when a total order on In this method groups of high-order potentials were thesetoflabelsisdefinedandthepairwisepotentials put into separate subproblems. are convex w.r.t. it (or more generally submodular Several times the Lagrangian relaxation approach in multi-label sense). Note that the frequently used was applied to the task of energy minimization with- Potts model is not submodular when the number of out decomposing the graph of the problem. Yarkony possible labelsis largerthan2, sothe methods of [14] et al. [46] created several copies of each node and and [10] are not applicable to it. enforced copies to take equal values. Yarkony et al. Jegelka and Bilmes [16], Kohli et al. [24] studied [47] introduced planar binary subproblems in one- the high-order potentials that group lower-order po- versus-all way (similarly to SMR) and used them in tentials together (cooperative cuts). Similarly to SMR the scheme of tightening the relaxation. cooperative cut methods rely on calling graph cuts The SMR approachis similar to the methods of [2], multiple times. As opposed to SMR these methods [29], [30], [42], [46], [47] in a sense that all of them construct and minimize upper (not lower) bounds on use the Lagrangian relaxation to reach an agreement. the initial energy. The main advantage of the SMR is that the number The QPBO method [5] is a popular way of con- of dual variables does not depend on the number of structing the lower bound on the global minimum labels and the number of high-order potentials but is usinggraphcuts.Whentheenergydependsonbinary fixedtothenumberofnodes.Thisstructureresultsin variables the SMR approach obtains exactly the same the speed improvements over the competitors. lower bound as QPBO. Decompositionmethodssplit theoriginal hardprob- 3 NOTATION AND PRELIMINARIES lem into easier solvable problems and try to make Notation. Let V be a finite set of variables. Each them reach an agreement on their solutions. One variable i ∈ V takes values x ∈ P where P is i advantage of these methods is the ability not only a finite set of labels. For any subset of variables to estimate the solution but also to provide a lower C ⊆ V the Cartesian product P|C| is the joint domain bound on the global minimum. Obtaining a lower of variables C. A joint state (or a labeling) of variables bound is important because it gives an idea on how C is a mapping x : C → P and x (i), i ∈ C is the C C much better the solution could possibly be. If the image of iunder x . We use PC todenote asetof all C lower bound is tight (the obtained energy equals possible joint states of variables in C.1 A joint state of the obtained lower bound) the method provides a all variables in V is denoted by x (instead of x ). If V certificate of optimality, i.e. ensures that the obtained an expression contains both x ∈ PC and x ∈ PA, C A solution is optimal. The typical drawback of such A,C ⊆V weassumethatthemappingsareconsistent, methodsisahighercomputationalcostincomparison i.e. ∀i∈A∩C :x (i)=x (i). C A with e.g. the α-expansion algorithm [18]. Considerhypergraph(V,C)whereC ⊆2V isasetof Usually a decomposition method can be described hyperedges. A potential of hyperedge C is a mapping bythetwofeatures:awaytoenforceagreementofthe θ : PC → R and a number θ (x ) ∈ R is the value C C C subproblems and type of subproblems. We are aware of the potential on labeling x . The order of potential C of two ways to enforce agreement: message passing θ refersto the cardinalityof setC and isdenoted by C andLagrangianrelaxation.Messagepassingwasused |C|. Potentials of order 1 and 2 are called unary and in this context by Wainwright et al. [43], Kolmogorov pairwise correspondingly. Potentials of order higher [25] in the tree-reweighted message passing (TRW) than 2 are called high-order. algorithms. TheLagrangianrelaxationwasappliedto An energy defined on hypergraph (V,C) (factoriz- energy minimization by Komodakis et al. [30] and able according to hypergraph (V,C)) is a mapping became very popular because of its flexibility. The E : PV → R that can be represented as a sum of SMR approach uses the Lagrangian relaxation path. potentials of hyperedges C ∈C: The majority of methods based on the Lagrangian relaxationdecomposethegraphdefiningtheproblem E(x)= θC(xC). (1) into subgraphs of simple structure. The usual choice C∈C X is to use acyclic graphs (trees) or, more generally, 1. Note that PC is highly related to P|C|. The “mapping graphsoflowtreewidth[30],[2].Komodakisetal.[30] terminology” is introduced to allow original variables i∈C ⊆ V pointed out the alternative option of using submod- tobeusedasargumentsoflabelings:xC(i). 3 It is convenient to use notational shortcuts related w.r.t. unary y ≥ 0 and pairwise y ≥ 0 variables ip ij,pq to potentials and variables: θ (x ,...,x ) = over the so-called marginal polytope: a linear polytope i1...ik i1 ik θ =θ =θ (x ), x (i)=x where with facetsdefinedbyallmarginalizationconstraints. i1...ik,xi1,...,xik C,xC C C C C,i C ={i ,...,i } and i∈C. The marginal polytope has an exponential number of 1 k When an energy is factorizable to unary and pair- facets, thus minimizing (6) over it is intractable. A wise potentials only we call it pairwise. We write polytope defined using only the first-order marginal- pairwise energies as follows: ization constraints is called a local polytope: E(x)= θi(xi)+ θij(xi,xj). (2) yip =1, ∀i∈V, (7) Xi∈V {iX,j}∈E pX∈P Here E is a set of edges, i.e. hyperedges of order 2. y =y , y =y , ∀{i,j}∈E, ∀p,q∈P. ij,pq jq ij,pq ip UnderthisnotationsetCcontainshyperedgesoforder p∈P q∈P X X 1 and 2: C ={{i}|i∈V}∪{{i,j}|{i,j}∈E}. (8) Sometimesitwillbeconvenienttouseindicatorsof Werefertoconstraints(7)asconsistencyconstraintsand each variable taking each value: y = [x =p], i ∈ V, ip i to constraints (8) as marginalization constraints. p∈P. Indicatorsallow usto rewritethe energy(1) as Minimizing (6) over local polytope (7)-(8) is often E (y)= θ y . (3) called a Schlesinger’s LP relaxation [40] (see e.g. [44] I C,d idi CX∈CdX∈PC iY∈C for a recent review). In this paper we refer to this Minimizing multilabelenergy(1) overPV isequiva- relaxation as the standard LP relaxation. lent to minimizing binary energy (3) over the Boolean cube {0,1}V×P under so-called consistency constraints: 4 SUBMODULAR RELAXATION Inthissectionwepresentourapproach.Westartwith y =1, ∀i∈V. (4) ip pairwiseMRFs(sec.4.1)[35],generalizetohigh-order p∈P X potentials (sec. 4.2) and to linear global constraints Throughout this paper, we use equation numbers (sec. 4.3). A version of SMR for nonsubmodular La- to indicate a constraint given by the corresponding grangiansisdiscussedinthesupplementarymaterial. equation, for example y ∈(4). The Lagrangian relaxation approach. A popular ap- 4.1 PairwiseMRFs proximate approach to minimize energy (1) is to en- The indicator notation allows us to rewrite the min- largethevariables’domain,i.e.performtherelaxation, imization of pairwise energy (2) as a constrained and to construct the dual problem to the relaxed one optimization problem: as it might be easier to solve. Formally, we get mλaxD(λ)≤my∈iYnElb(y)≤xm∈iXnE(x) (5) myin θipyip+ θij,pqyipyjq, (9) Xi∈VpX∈P {iX,j}∈EpX,q∈P where X and Y are discrete and continuous sets, s.t. y ∈{0,1}, ∀i∈V,p∈P, (10) ip respectively, where X ⊂ Y. E (y) is a lower bound lb function defined on set Y and D(λ) a dual function. yip =1, ∀i∈V. (11) Both inequalities in (5) can be strict. If the left one is pX∈P strictwe saythatthereisanon-zerodualitygap,ifthe Constraints (51) (i.e. (4)) guarantee each node i to right one is strict – there is a non-zero integrality gap. have exactly one label assigned, constraints (50) fix Note that if the middle problem in (5) is convex and variables y to be binary. ip one of the regularity conditions is satisfied (e.g. it is The target function (49) (denoted by E (y)) of bi- I a linear program) there is no duality gap. The goal of nary variables y (i.e. (50) holds) is a pseudo-Boolean all relaxation methods is to make the joint gap in (5) polynomial of degree at most 2. This function is easy as small as possible. In this paper we will construct tominimize(ignoringtheconsistencyconstraints(51)) the dual directly from the discrete primal and will if it is submodular, which is equivalent to having need the continuous primal only for the theoretical θ ≤0foralledges{i,j}∈E andalllabelsp,q ∈P. ij,pq analysis. The so-called associative potentials are examples of The linear relaxation of a pairwise energy consists potentials that satisfy these constraints. A pairwise in converting the energy minimization problem to an potentialθ isassociativeifitrewardsconnectednodes ij equivalent integer linear program (ILP) and in relaxing for taking the same label, i.e. it to a linear program (LP) [40], [43]. −C , p=q, Wainwright et al. [43] have shown that minimizing ij,p θ (p,q)=−C [p=q]= (12) ij ij,p energy (2) over the discrete domain is equivalent to (0, p6=q. minimizing the linear function where C ≥ 0. Note that if constants C are ij,p ij,p EL(yL)= θipyip+ θij,pqyij,pq (6) independent of label index p (i.e. Cij,p = Cij) then Xi∈VpX∈P (iX,j)∈EpX,q∈P potentials (12) are equivalent to the Potts potentials. 4 Osokin et al. [35] observe that when E (y) is 4.2 High-orderMRFs I submodular consistency constraints (51) only prevent Consider a problem of minimizing the energy con- problem (49)-(51) from being solvable, i.e. if we drop sisting of the high-order potentials (1). The uncon- or perform the Lagrangian relaxation of them we strainedminimizationofenergyE(x)overmulti-label obtain an easily computable lower bound. In other variables x is equivalent to the minimization of func- words we can look at the following Lagrangian: tion E (y) (3) over variables y under the consistency I constraints(51)andthediscretizationconstraints(50). L(y,λ)=E (y)+ λ y −1 . (13) I i ip For now let us assume that all the high-order po- i∈V (cid:18)p∈P (cid:19) X X tentials are pattern-based and sparse [37], [29], i.e. AsafunctionofbinaryvariablesyLagrangianL(y,λ) differs from EI(y) only by the presence of unary θ (x )= θˆC,d, if xC =d∈DC, (17) potentialsλiyip andaconstant, soLagrangianL(y,λ) C C (0, otherwise. is submodular whenever E (y) is submodular. I Applying the Lagrangian relaxation to the prob- Here set DC contains some selected labelings of vari- lem (49)-(51) we bound the solution from below ables xC and all the values of the potential are non- positive, i.e. θˆ ≤ 0, d ∈ D . In case of sparse C,d C y∈m(50i)n,(51)EI(y)=ym∈(i5n0)mλaxL(y,λ)≥ potential the cardinality of set DC is much smaller compared to the number of all possible labelings of max min L(y,λ)=maxD(λ) (14) λ y∈(50) λ variables xC, i.e |DC|≪|PC|. In this setting we can use the identity where D(λ) is a Lagrangian dual function: − y =min (|C|−1)z− y z ℓ∈C ℓ z∈{0,1} ℓ∈C ℓ D(λ)= min E (y)+ λ y − λ . (15) to perform a reduction2 of the high-order func- I i ip i (cid:0) Q (cid:1) (cid:0) P (cid:1) y∈(50) tion E (y) to a pairwise function in such a way that (cid:18) p∈Pi∈V (cid:19) i∈V I XX X min E (y)=min E∗(y,z) The dual function D(λ) is a minimum of a finite y∈{0,1}∗ I z∈{0,1}∗,y∈{0,1}∗ I where by {0,1}∗ we denote the Boolean cubes of number of linear functions and thus is concave and appropriate sizes and piecewise-linear (non-smooth). A subgradient g ∈ R|V| of the dual D(λ) can be computed given the result of the inner pseudo-Boolean minimization: EI∗(y,z)=− θˆC,d (|C|−1)zC,d − g = y∗ −1, i∈V, (16) CX∈CdX∈DC (cid:18) i ip y z . (18) p∈P ℓdℓ C,d X ℓ∈C (cid:19) X where {yi∗p}pi∈∈VP =argmin EI(y)+ λiyip . FunctionEI∗(y,z)ispairwiseandsubmodularw.r.t. y∈(50) (cid:18) p∈Pi∈V (cid:19) binary variables y and z and thus in the absence of The concavity of D(λ) togetherPwitPh the ability additionalconstraintscanbeefficientlyminimized by to compute the subgradient makes it possible to graph cuts [28]. Adding and relaxing constraints (51) tackle the problem of finding the best possible lower to the minimization of (3) gives us the following bound in family (14) (i.e. the maximization of D(λ)) Lagrangian dual: by convex optimization techniques. The optimization methods are discussed in sec. 5. D(λ)= min L(y,z,λ)= Note, that if at some point λ the subgradient given y,z∈{0,1}∗ by eq. (16) equals zero it immediately follows that min E∗(y,z)+ λ y −1 . (19) the inequality in (14) becomes equality, i.e. there is I i ip y,z∈{0,1}∗ no integrality gap and the certificate of optimality is (cid:18) Xi∈V (cid:18)pX∈P (cid:19)(cid:19) provided. This effect happens because there exists a For all λ Lagrangian L(y,z,λ) is submodular w.r.t. labelingy∗ deliveringtheminimumofL(y,λ)w.r.t.y y and z allowing us to efficiently evaluate D at and satisfying the consistency constraints (51). any point. Function D(λ) is a lower bound on the In a special case when all the pairwise potentials globalminimumofenergy(1)andisconcavebutnon- are associative (of form (12)) minimization of the smooth as a function of λ. Analogously to (15) the Lagrangian w.r.t. y decomposes into |P| subproblems bestpossible lower bound canbefoundusing convex eachcontainingonlyvariablesindexedbyaparticular optimization techniques. label p: Robust sparse pattern-basedpotentials.We cangen- min L(y,λ)= min Φ (y ,λ)− λ eralize the SMR approach to the case of robust sparse p p i y∈(50) p∈Pyp∈{0,1}|V| i∈V high-orderpotentials using the ideasof Kohli et al. [23] X X where Φ (y ,λ)= (θ +λ )y − C y y . p p ip i ip ij,p ip jp 2.Thistransformationofbinaryhigh-orderfunction(3)isinfact i∈V {i,j}∈E equivalenttoaspecialcaseofType-IIbinarytransformationof[37], In this case the mPinimization tasksPw.r.t y can be p but in this form it was proposed much earlier (see e.g. [15] for a solved independently, e.g. in parallel. review). 5 and Rother et al. [37]. Robust sparse high-order po- described at the beginning of section 4.2. Note, that tentialsinadditiontorewardingthespecificlabelings in this case the number of variables z required to d ∈ D also reward labelings where the deviation add within the SMR approach can be exponential C from d ∈D is small. We can formulate such a term in the arity of hyper edge C. Therefore the SMR is C for hyperedge C in the following way: practical when either the arity of all hyperedges is small or when there exists a compact representation θC(xC)= min 0, θˆC,d+ wℓC[xℓ 6=dℓ] (20) such as (17) or (20). dX∈DC (cid:18) ℓX∈C (cid:19) or equivalently in terms of indicator variables y 4.3 Linearglobalconstraints θ (y )= min 0, θˆ + wC(1−y ) . (21) Linear constraints on the indicator variables form an C C C,d ℓ ℓdℓ important type of global constraints. The following dX∈DC (cid:18) ℓX∈C (cid:19) constraints are special case of this class: The inner min of (21) can be transformed to pairwise form via adding switching variables z (analogous C,d y =c, strict class size constraints, (23) jp to (18)): j∈V X y ∈[c ,c ], soft class size constraints, zC,dm∈i{n0,1} θˆC,dzC,d+ℓ∈CwℓCzC,d(1−yℓdℓ). Xj∈VyjpI =1 2 y I , flux equalities, X jp j jq j j∈V j∈V This function is submodular w.r.t. variables y and z if coefficients wC are nonnegative. Introducing La- X yjpIj =µX yip, mean, ℓ j∈V i∈V grange multipliers and maximizing the dual function X y (I −µ)X(I −µ)T= y Σ, variance. jp j j ip is absolutely analogous to the case of non-robust j∈V j∈V sparse high-order potentials. X X Here I is the observable scalar or vector value as- j One particular instance of robust pattern-based sociated with node j, e.g. intensity or color of the potentials is the robust Pn-Potts model [23]. Here correspondingimagepixel.Therehavebeenanumber the cardinality of sets D equals the number of the C of works that separately consider constraints on total labels |P| and all the selected labelings correspond to number of nodes that take a specific label (e.g. [31]). the uniform labelings of variables x : C The SMRapproachcaneasilyhandle generallinear constraints such as θ (x )= min 0,−γ+ γ/Q[x 6=p] . (22) C C ℓ pX∈P (cid:18) ℓX∈C (cid:19) wimpyip =cm, m=1,...,M, (24) Hereγ ≥0andQ∈(0,|C|]arethemodelparameters. Xi∈VpX∈P γ represents the reward that energy gains if all the vky ≤dk, k =1,...,K. (25) ip ip variables xC take equal values. Q stands for the i∈Vp∈P XX number of variables that are allowed to deviate from the selected label p before the penalty is truncated.3 We introduce additional Lagrange multipliers ξ = Positivevaluesofθˆ .Nowwedescribeasimpleway {ξm}m=1,...,M, π = {πk}k=1,...,K for constraints (24), C (25) and get the Lagrangian togeneralizeourapproachtothecaseofgeneralhigh- order potentials, i.e. potentials of form (17) but with values θˆC,d allowed to be positive. L(y,λ,ξ,π)=EI(y)+ λi yip−1 + First, note that constraints (50) and (51) im- i∈V (cid:18)p∈P (cid:19) X X ply that for each hyperedge C ∈ C equality M p∈PC ℓ∈Cyℓpℓ = 1 holds. Because of this fact ξm wimpyip−cm + wPe canQsubtract the constant MC = maxd∈DCθˆC(d) mXK=1 (cid:18)Xi∈VpX∈P (cid:19) fromallvaluesofpotentialθ simultaneouslywithout C π vky −dk changing the minimum of the problem, i.e. k ip ip k=1 (cid:18)i∈Vp∈P (cid:19) X XX y∈m(50i)n,(51)EI(y)=y∈m(50i)n,(51) MC + θ˜C(p) yℓpℓ where inequality multipliers π are constrained to CX∈C(cid:18) pX∈PC ℓY∈C (cid:19) be nonnegative. The Lagrangian is submodular w.r.t. where θ˜ (p) = θ (p)−M . We have all θ˜ (p) non- indicator variables y and thus dual function C C C C positive and therefore can apply the SMR technique D(λ,ξ,π)= min L(y,λ,ξ,π) (26) y∈(50) 3.Intheoriginalformulation oftherobustPn-Pottsmodel[23] theminimumoperationwasusedinsteadoftheoutersumin(22). can be efficiently evaluated. The dual function is Thetwoschemesareequivalentifthecostofthedeviationfromthe piecewise linear and concave w.r.t. all variables and selectedlabelingsifhighenough,i.e.Q<0.5|C|.Thesubstitution thus can be treated similarly to the dual functions ofminimumwithsummationinsimilarcontextwasusedbyRother etal.[37]. introduced earlier. 6 5 ALGORITHMS Input: Lagrangian L(y,λ) (13), initialization λ0; Output: λ∗ =argmax min L(y,λ); λ y In sec. 5.1 we discuss the convex optimization meth- 1: λ:=λ0; ods that can be used to maximize the dual func- 2: repeat tions D(λ) constructed within the SMR approach. In 3: y∗ :=argminyL(y,λ); // runmax-flow sec. 5.2 we develop an alternative approach based on 4: D(λ)=L(y∗,λ); // evaluatethefunction coordinate-wise ascent. 5: for all i∈V do 6: gi := yi∗p−1; //computethesubgradient 5.1 Convexoptimization p∈P 7: if g =0Pthen We consider severaloptimization methods thatmight 8: return λ; // thecertificateofoptimality be used to maximize the SMR dual functions: 9: else 1) subgradient ascent methods [30]; 10: estimate the primal; // seesec.5.3 2) bundle methods [17]; 11: choose stepsize α; // e.g.usingrule(27) 3) non-smooth version of L-BFGS [33]; 12: λ:=λ+αg; //makethestep 4) proximal methods [48]; 13: until convergence criteria are met 5) smoothing-based techniques [39], [48]; 14: return λ; 6) stochastic subgradient methods [30]. Fig. 1. Subgradient ascent algorithm to maximize the First,webeginwiththemethodsthatarenotsuited SMRdual(15). to SMR: proximal, smoothing-based and stochastic. Thefirsttwogroupsofmethodsarenotdirectlyappli- cableforSMRbecausetheyrequirenotonlytheblack- selecting the step size: box oracle that evaluates the function and computes the subgradient (in SMR this is done by solving the An−D(λn) αn =γ (27) max-flow problem) but request additional computa- k∂D(λn)k2 2 tions:proximalmethods–theprox-operator,log-sum- where An is the currentestimate of the optimum (the exp smoothing [39] – solving the sum-product prob- best value of the primal solution found so far) and γ lem instead of max-sum, quadratic smoothing [48] isa parameter.Fig. 1 summarizesthe overallmethod. requires explicit computation of convex conjugates.4 Bundle methods were analyzed by Kappes et al. We arenot awareof possibilities of computing any of [17] in the context of MRF energy minimization. A those in case of the SMR approach. bundle B is a collection of points λ′, values of dual Stochastic techniques have proven themselves to function D(λ′), and sungradients ∂D(λ′). The next be highly efficient for large-scale machine learning point is computed by the following rule: problems. Bousquet and Bottou [6, sec. 13.3.2] show that stochastic methods perform well within the λn+1 =argmax Dˆ(λ)− wnkλ−λ¯k2 (28) estimation-optimization tradeoff but are not so great 2 2 λ (cid:18) (cid:19) asoptimizationmethodsfortheempiricalrisk(i.e.the where Dˆ(λ) is the upper bound on the dualfunction: optimized function). In case of SMR we do not face the machine learning tradeoffs and thus stochastic Dˆ(λ)= min D(λ′)+h∂D(λ′),λ−λ′i . subgradient is not suitable. (λ′,D(λ′),∂D(λ′))∈B (cid:0) (cid:1) Further in this section we give a short review of Parameter wn is intended to keep λn+1 close to methods of the three groups (subgradient, bundle, L- the current solution estimate λ¯. If the current step BFGS) that require only a black-box oracle that for does not give a significant improvement than the an arbitrary value of dual variables λ computes the null step is performed: bundle B is updated with function value D(λ) and one subgradient ∂D(λ). (λn+1,D(λn+1),∂D(λn+1)).Otherwisetheseriousstep Subgradient methods at iteration n perform an is performed: in addition to updating the bundle update in the direction of the current subgradient the current estimate λ¯ is replaced by λn+1. Kappes λn+1 =λn+αn∂D(λn). et al. [17] suggest to use the relative improvement of D(λn+1) and Dˆ(λn+1) over D(λ¯) (threshold m ) to L withastepsizeαn.Komodakisetal.[30]andKappes choose the type of the step. Inverse step size wn can et al. [17] suggest the following adaptive scheme of bechosenadaptivelyifthepreviousstepwasserious: of4a.cZoancvheextnaol.n[-4s8m]odoirthecftulyncctoinosntrfucbtythfεes=m(ofo∗th+aεpkp·rko22x/i2m)∗atwiohnefrεe wn =P[wmin,wmax] γAn−mk∂axDk(=λ1,n..).k,nD(λk) −1 ε>0 isa smoothing parameterand (·)∗ isthe convexconjugate. (cid:18) 2 (cid:19) (29) Intheircase functionf isasumofsummandsofsimpleformfor which analytical computation of convex conjugates is possible. In where P[wmin,wmax] is the projection on the segment. our case each summand is a min-cut LP, its conjugate is a max- In case of the null step wn = wn−1. In summary, flowLP.Theadditionofquadratictermstof∗ makesfε quadratic the bundle method has the following parameters: γ, insteadofamin-cutLPandthusefficientmax-flowalgorithmsare nomoreapplicable. wmin, wmax, mL, and maximum size of the bundle – 7 Input: Lagrangian L(y,λ) (13), initialization λ0; a new value λnew such that D(λnew) ≥ D(λold). The Output: coordinate-wise maximum of D(λ); new value will differ from the old value only in one 1: λ:=λ0; coordinate λ , i.e. λnew =λold, i6=j. j i i 2: repeat Considertheunarymin-marginalsofLagrangian(13) 3: converged:=true; 4: for all j ∈V do MMjp,kL y,λold = min L(y,λold) 5: for all p∈P do y∈(50),yjp=k (cid:0) (cid:1) 6: compute MMjp,0L(y,λ), MMjp,1L(y,λ); where k ∈ {0,1}, p ∈ P. The min-marginals 7: δpj :=MMjp,0 L(y,λ)−MMjp,1 L(y,λ); can be computed efficiently using the dynamic 8: find δj and δj ; graphcut framework [21]. Denote the differences of (1) (2) 9: if δj and δj of the same sign then min-marginals for labels 0 and 1 with δpj: δpj = 1101:: λcoj(n1:=v)eλrgje+d(0:2=.)5(fδa(jl1s)e+; δ(j2)); MtheMjmp,a0xLim(cid:0)yu,mλoldo(cid:1)f −δpjM, δM(j2j)p,1–L(tyh,eλonlde)x.tLmetaxδi(jm1)umbe; 12: until converged6=true; p(1) and p(2) – the corresponding indices: p(1) = j j j 13: return λ; argmax δj, δj = δj , p(2) = argmax δj, p∈P p (1) p(j1) j p∈P\{p(j1)} p Fig. 2. Coordinate ascent algorithm for maximizing δj = δj . Let us construct the new value λnew of the dual D(λ) (15) in case of pairwise associative (2) p(j2) j dual variable λ with the following rule: potentials. j λnew =λold+∆ (30) j j j bs. Note thatthe update(28) implicitly estimatesboth where ∆ is a number from segment δj ,δj . the direction gn and the step size αn. j (2) (1) h i Another algorithm analyzed by Kappes et al. [17] Theorem 1. Rule (30) when applied to a pairwise energy is the aggregated bundle method proposed by Kiwiel with associative pairwise potentials delivers the maximum [19] with the same step-size rule (29). The method of function D(λ) w.r.t. variable λj. has the following parameters: γ, w , w , and a min max The proof relies on the intuition that rule (30) di- threshold to choose null or serious step – m . Please r rectly assigns the min-marginals MM L(y,λ), p∈ jp,k see works [17], [19] for details. We have also tried P, k∈{0,1} such values that only one subproblem p a combined strategy: LMBM method5[13] where we has y equal to 1. We provide the full proof in the j,p used only one non-default parameter: the maximum supplementary material. size of a bundle b . s Theorem 1 allows us to construct the coordinate- L-BFGSmethodschoosethedirectionoftheupdate ascent algorithm (fig. 2). This result cannot be ex- using the history of function evaluations: Sn∂D(λn) tended to the cases of general pairwise or high-order where Sn is a negative semi-definite estimate of the potentials. We are not aware of analytical updates inverse Hessian at λn. The step size αn is typically like (30) for these cases. The update w.r.t. single vari- chosen via approximate one-dimensional optimiza- able can be computed numerically but the resulting tion, a.k.a. line search. Lewis and Overton [33] pro- algorithm would be very slow. posedaspecializedversionofL-BFGSfornon-smooth functions implemented in HANSO library.6 In this code we varied only one parameter: the maximum 5.3 Gettingaprimalsolution rank of Hessian estimator hr. Obtaining a primal solution given some value of the dual variables is an important practical issue related 5.2 Coordinate-wiseoptimization to the Lagrangian relaxation framework. There are twodifferenttasks:obtainingthefeasibleprimalpoint Analternativeapproachtomaximizethedualfunc- of the relaxationand obtaining the finaldiscretesolu- tion D(λ) consists in selecting a group of variables tions.ThefirsttaskwasstudiedbySavchynskyyetal. and finding the maximum w.r.t them. This strategy [39] and Zach et al. [48] for algorithms based on the was used in max-sum diffusion [44] and MPLP [12] smoothing of the dual function. Recently, Savchyn- algorithms. In this section we derive an analogous skyy and Schmidt [38] proposed a general approach algorithm for the SMR in case of pairwise associative that can be applied to SMR whenever a subgradient potentials. The algorithm is based on the concept of algorithm is used to maximize the dual. The second min-marginalaveragingused in [43], [25] to optimize aspect is usually resolved using some heuristics. Os- the TRW dual function. okin and Vetrov [34] propose an heuristic method to Consider some value λold of dual variables λ and obtain the discrete primal solution within the SMR a selectednode j ∈V.We will show how to construct approach.Theirmethodisbasedontheideaofblock- coordinate descent w.r.t. the primal variables and is 5.http://napsu.karmitsa.fi/lmbm/lmbmu/lmbm-mex.tar.gz 6.http://www.cs.nyu.edu/overton/software/hanso/ similar to the analogous heuristics in TRW-S [25]. 8 6 THEORETICAL ANALYSIS D(λ) D(λ) D(λ) In this section we explore some properties of the integrality gap maximum points of the dual functions and express their maximum values explicitly in the form of LPs. The latter allows us to reason about tightness of the obtained lower bounds and to provide analytical λ∗ λ λ∗ λ λ∗ λ (1) (2) (3) comparisons against other approaches. Fig.3. Thethreepossiblecasesatthemaximumofthe SMRdualfunction.Theoptimalenergyvalueisshown 6.1 Definitionsandtheoreticalproperties by the horizontal dotted line. Solid lines show faces of Denote the possible optimal labelings of binary vari- thehypographofthedualfunctionD(λ). able y associated with node i and label p given ip Lagrange multiplyers λ with Z (λ)= z |∃yλ ∈ArgminL(y,λ) : yλ =z 3) Some sets Zip(λ∗) consist of multiple elements, ip y∈(50) ip labeling y∗ that is both consistent with (51) and (cid:8) (cid:9) istheminimumoftheLagrangiandoesnotexist, where L(y,λ) is the Lagrangian constructed within the integrality gap is non-zero. the SMR approach (13). In situations 1 and 2 there exists a binary label- Definition 1. Point λ satisfies the strong agreement ing y∗ that simultaneously satisfies consistency con- condition if for all nodes i, for one label p set Zip(λ) straints (51) and delivers the minimum of the La- equals {1} and for the other labels it equals {0}: grangian L(y,λ∗). Labeling y∗ defines the horizontal hyperplane that contains point (λ∗,D(λ∗)). ∀i∈V ∃!p∈P : Z (λ)={1}, ∀q 6=pZ (λ)={0}. ip iq Insituation1labelingy∗ iseasytofindbecausethe This definition implies that for a strong agreement subgradient defined by (16) equals zero. Situation 2 point λ there exists the unique binary labeling y can be trickier but in certain cases might be resolved consistent with all sets Z (λ) (y ∈ Z (λ)) and exactly. Some examples are given in [35]. ip ip ip with consistency constraints (51). Consequently, such In practice the convex optimization methods de- y defines labeling x ∈ PV that delivers the global scribed in sec. 5.1 typically converge to the point minimum of the initial energy (1). that satisfies the strong agreement condition rather quickly when there is no integrality gap. In case of Definition 2. Point λ satisfies the weak agreement the non-zero integrality gap all the methods start to condition if for all nodes i∈V both statements hold true: oscillate around the optimum and do not provide a 1) ∃p∈P : 1∈Zip(λ). point satisfying even the weak agreement condition. 2) ∀p∈P : Zip(λ)={1}⇒∀q 6=p, 0∈Zjq(λ). Thecoordinate-wisealgorithm(fig.2)oftenconverges to points with low values of the dual function (there This definition means that for a weak agreement point λ there exists a binary labeling y consistent exist multiple coordinate-wise maximums of the dual with all sets Z (λ) and consistency constraints (51). function)butallowstoobtainapointthatsatisfiesthe ip weakagreementcondition. Thisresultissummarized It is easy to see that the strong agreement condition in theorem 3 proven in the supplementary material. implies the weak agreement condition. DenoteamaximumpointofdualfunctionD(λ)(15) Theorem 3. Coordinate ascent algorithm (fig. 2) when by λ∗ and an assignment of the primal variables applied to a pairwise energy with associative pairwise that minimize Lagrangian L(y,λ∗) by y∗. We prove potentialsreturns a pointthatsatisfiestheweak agreement the following property of λ∗ in the supplementary condition. material. Theorem 2. A maximum point λ∗ of the dual func- 6.2 Tightnessofthelowerbounds tion D(λ) satisfies the weak agreement condition. 6.2.1 PairwiseassociativeMRFs In general the three situations are possible at the Osokin et al. [35] show that in the case of pairwise global maximum of the dual function of SMR (see associative MRFs maximizing Lagrangiandual(15) is fig. 3 for illustrations): equivalent to the standard LP relaxation of pairwise 1) The strong agreement holds, binary labeling y∗ energy(2),thustheSMRlowerboundequalstheones satisfying consistency constraints (51) (thus la- of competitive methods such as [30], [17], [39] and in beling x∗ as well) can be easily reconstructed, generalistighterthanthelowerboundsofTRW-S[25] there is no integrality gap. and MPLP [12] (TRW-S and MPLP can get stuck at 2) Some(maybeall)setsZ (λ∗)consistofmultiple bad coordinate-wise maximums). In addition Osokin ip elements,x∗ consistentwith(51)isnon-trivialto etal.[35]showthatinthecaseofpairwiseassociative reconstruct, but there is still no integrality gap. MRFs the standard LP relaxation is equivalent to 9 Kleinberg-Tardos (KT) relaxation of a uniform metric Lemma 2 is a direct corollary of the strong dual- labeling problem [20]. In this section we reformulate ity theorem with linearity constraint qualification [3, the most relevant result for the sake of completeness. Th. 6.4.2]. Theproofwillbegivenlaterasacorollaryofourmore Let us finish the proof of theorem 5. All values of general result for high-order MRFs (theorems 5 and the potentials are non-positive θˆ ≤ 0 so lemma 1 C,d 6). implies that the LP (31)-(34) is equivalent to the following LP: Theorem 4. In case of pairwise MRFs with associative pairwise potentials (2), (12) the maximum of Lagrangian min − θˆ (|C|−1)z − zℓ (39) dual (15) equals the minimum value of the standard LP y,z C,d C,d C,d relaxation (6), (7), (8) of the discrete energy minimization CX∈CdX∈DC (cid:18) ℓX∈C (cid:19) s.t.zℓ ∈[0,1], ∀C∈C, ∀d∈D , ∀ℓ∈C, (40) problem. C,d C y ,z ∈[0,1], ∀i∈V, ∀p∈P, ∀C∈C, ∀d∈D , ip C,d C 6.2.2 High-orderMRFs zℓ ≤z , zℓ ≤y ,C∈C, ∀d∈D , ∀ℓ∈C, C,d C,d C,d ℓdℓ C In this section we state the general result, compare (41) it to known LP relaxations of high-order terms and y =1, ∀i∈V. (42) ip further elaborate on one important special case. p∈P X The following theorem explicitly defines the LP Denote the target function (39) by Q(y,z). Let relaxation of energy (1) that is solved by SMR. R(λ)= min Q(y,z)+ λ y −1 . i ip Theorem 5. In case of high-order MRF with sparse y,z∈(40)−(41) (cid:18) i∈V (cid:18)p∈P (cid:19)(cid:19) pattern-based potentials (17) the maximum of the SMR X X Task (39), (40)-(41) is equivalent to the standard dual function D(λ) (19) equals the minimum value of the LP-relaxation of function (18) of binary variables y following linear program: and z (see lemma 3 in the supplementary mate- min θˆC,d yC,d (31) rial). As θˆC,d ≤ 0, function (18) is submodular and y CX∈CdX∈DC thus its LP-relaxation is tight [27, Th. 3], which s.t.y ,y ∈[0,1], ∀i∈V, ∀p∈P, ∀C∈C, ∀d∈D , means that R(λ) = D(λ) for an arbitrary λ. Func- ip C,d C (32) tion f(y) = minz∈(40)−(41)Q(y,z), set Y defined by y ∈ [0,1], matrix A and vector b defined by (42) y ≤y , ∀C ∈C, ∀d∈D , ∀ℓ∈C, (33) ip C,d ℓdℓ C satisfy the conditions of lemma 2 which implies that yip =1, ∀i∈V. (34) the value of solution of LP (39)-(42) equals R(λ∗), p∈P where λ∗ = argmax R(λ). The proof is finished by X λ equality R(λ∗)=D(λ∗). Proof: We start the proof with the two lemmas. Corollary 1. For the case of pairwise energy (2) with Lemma 1. Let y ,...,y be real numbers belonging to 1 K θ ≤ 0 the SMR returns the lower bound equal to the the segment [0,1], K > 1. Consider the following linear ij,pq solution of the following LP-relaxation: program: K min θipyip+ θij,pqyij,pq (43) y min z(K−1)− zk (35) Xi∈VpX∈P {iX,j}∈EpX,q∈P z,z1,...,zK∈[0,1] Xk=1 s.t.yip ∈[0,1], ∀i∈V, ∀p∈P, s.t. z ≤z, z ≤y . (36) k k k y ∈[0,1], ∀{i,j}∈E, ∀p,q ∈P, ij,pq Point z = z1 = ··· = zK = mink=1,...,Kyk delivers the yij,pq ≤yip, yij,pq ≤yjq, ∀{i,j}∈E,∀p,q∈P, minimum of LP (75)-(76). (44) Lemma 1 is proven in the supplementary material. yip =1, ∀i∈V. (45) p∈P Lemma 2. Consider the convex optimization problem PermutXed Pn-Potts. Komodakis and Paragios [29] formulate a generally tighter relaxation than (31)-(34) min f(y), (37) by adding constraints responsible for marginalization y∈Y of high-order potentials w.r.t all but one variable: s.t. Ay =b, (38) where Y ⊆ Rn is a convex set, A is a real-valued matrix yC,p =yℓp0, ∀C ∈C, ∀p0 ∈P, ∀ℓ∈C. (46) of size n×m, b∈Rm, f :Rn →R is a convex function. p∈PCX:pℓ=p0 Let set Y contain a non-empty interior and the solution of In this paragraph we show that in certain cases the problem (37)-(38) be finite. relaxation of [29] is equivalent to (31)-(34), i.e. con- Then the value of the solution of (37)-(38) equals straints (46) are redundant. λm∈aRxmmy∈iYnL(y,λ) where L(y,λ)=f(y)+λT(Ay−b). Definition 3. Potential θC is called permuted Pn-Potts if 10 TABLE1 Theoptimalparametervalueschosenbythegrid ∀C ∈C ∀d′,d′′ ∈D ∀i∈C : d′ 6=d′′ ⇒d′ 6=d′′. C i i searchinsec.7.1. In permuted Pn-Potts potentials all preferable con- figurations differfromeachother inallvariables.The Method Segmentation Stereo Pn-Potts potential described by Kohli et al. [22] is a W subgradient γ=0.3 γ=0.1 special case. R γ=0.01, mL=0.05, γ=0.01, mL=0.05, T bundle D wmax=1000 wmax=2000 Theorem 6. If all higher-order potentials are permuted D aggr. γ=0.02, wmax=100, γ=0.01,mr=0.002, Pn-Potts,thenthemaximumofdualfunction(19)isequal bundle mr=0.0005 wmax=500 to the value of the solution of LP (31)-(34), (46). subgradient γ=0.7 γ=0.3 variaPbrloeosf:yFirst,taaklle vmalauxeismaθˆlC,dpoasrseiblneonv-aplouseisti.veThsios SMR bundle wγ=m0a.x1=,100mL=0.2, wγ=m0a.x0=1,100mL=0.1, C,p aggr. γ=0.02,mr=0.001, γ=0.02,mr=0.001, implies that equality in (46) can be substituted with bundle wmax=500 wmax=1000 inequality. Second, for a permuted Pn-Potts potential θ for each ℓ ∈ C and for each p ∈ P there C 0 exists no more than one labeling d ∈ DC such that maximum value of lower bound obtained within 25 dℓ = p0 which means that all the sums in (46) can seconds for segmentation and 200 seconds for stereo. be substituted by the single summands. These two Forthebundlemethodweobservethatlargerbundles observations prove that (46) is equivalent to (33). lead to better results, but increase the computation The LP result for the associative pairwise MRFs time. Value b = 100 is a compromise for both stereo s (theorem 4) immediately follows from theorem 6 and and segmentation datasets. The same reasoning gives corollary 1 because all associative pairwise poten- us b = 10 for LMBM and h = 10 for L-BFGS. s r tials (12) are permuted Pn-Potts and condition (46) Parameter w for (aggregated) bundle method did min reduces to marginalization constraints (8). notaffecttheperformancesoweused10−10asin[17]. The remaining parameters are presented in table 1. 6.2.3 Globallinearconstraints Implementation details. For L-BFGS and LMBM we Theorem 7. The maximum of the dual use the original authors’ implementations. All the functionD(λ,ξ,π)(26)(constructedfortheminimization otheroptimizationroutinesareimplementedinMAT- of pairwise energy (2) under global linear constraints (24) LAB. Both SMR and DD TRW oracles are imple- and (25)) under constraints π ≥ 0 is equal to the value mentedinC++:forSMRweusedthedynamicversion k of the solution of LP relaxation (67)-(69) with addition of of BK max-flow algorithm [7], [21]; for DD TRW the global linear constraints (24) and (25). we used the domestic implementation of dynamic programmingspeededupspecificallyforPottsmodel. The proof is similar to the proof of theorem 5 but To estimate the primal solution at the current point requiresincludingextratermsintheLagrangian.The- we use 5 iterations of ICM (Iterated Conditional orem7generalizestothecaseofhigh-orderpotentials. Modes)[4]everywhere.Wenormalizealltheenergies in such a way that the energy of the α-expansion 7 EXPERIMENTAL EVALUATION result corresponds to 100, the sum of minimums of 7.1 PairwiseassociativeMRFs unary potentials – to 0. All our codes are available online.8 In this section we evaluate the performance of differ- Results. Fig. 4 shows the plots of the averaged ent optimization methods discussed in sec. 5.1 and lower bounds vs. time for stereo and segmentation compare the SMR approach against the dual decom- instances. See fig. 7 in the supplementary material position on trees used in [30], [17] (DD TRW) and for more detailed plots. We observe that SMRoutper- TRW-S method [25]. forms DD TRW for almost all optimization methods Dataset. To perform our tests we use the MRF in- and for the segmentation instances it outperforms stances created by Alahari et al. [1].7 We work with TRW-S as well. When α-expansion was not able to instancesoftwotypes:stereo(fourMRFs)andseman- find the global minimum of the energy (1 segmenta- tic image segmentation (five MRFs). All of them are tion instance and all stereo instances) all the tested pairwise MRFs with 4-connectivity system and Potts methods obtained primal solutions with lower ener- pairwisepotentials.Theinstancesofthesetwogroups gies than the energies obtained by α-expansion. are of significantly different sizes and thus we report the results separately. 7.2 High-orderpotentials Parameter choice. For each group and for each op- timization method we use grid search to find the In this section we evaluate the performance of the optimalmethodparameters.Asacriterionweusethe SMR approach on the high-order robust Pn-Potts 7.http://www.di.ens.fr/∼alahari/data/pami10data.tgz 8.https://github.com/aosokin/submodular-relaxation