ebook img

A Method for Finding Structured Sparse Solutions to Non-negative Least Squares Problems with Applications PDF

2.4 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Method for Finding Structured Sparse Solutions to Non-negative Least Squares Problems with Applications

A Method for Finding Structured Sparse Solutions to Non-negative Least Squares Problems with Applications 3 1 0 2 n a J Ernie Esser1, Yifei Lou1, Jack Xin1 3 ] L M Abstract . t a Demixing problems inmanyareassuchas hyperspectralimaginganddifferentialopti- t cal absorptionspectroscopy (DOAS) often require finding sparse nonnegative linear com- s [ binations of dictionary elements that match observed data. We show how aspects of these problems, such as misalignment of DOAS references and uncertainty in hyperspec- 1 tral endmembers, can be modeled by expanding the dictionary with grouped elements v 3 and imposing a structured sparsity assumption that the combinations within each group 1 should be sparse or even 1-sparse. If the dictionary is highly coherent, it is difficult to 4 obtaingoodsolutionsusingconvexorgreedymethods, suchasnon-negativeleastsquares 0 (NNLS) or orthogonal matching pursuit. We use penalties related to the Hoyer measure, . 1 which is the ratio of the l1 and l2 norms, as sparsity penalties to be added to the ob- 0 jective in NNLS-type models. For solving the resulting nonconvex models, we propose a 3 scaled gradient projection algorithm that requires solving a sequence of strongly convex 1 : quadraticprograms. Wediscussitscloseconnectionstoconvexsplittingmethodsanddif- v ference of convex programming. We also present promising numericalresults for example i X DOAS analysis and hyperspectral demixing problems. r a 1 Department of Mathematics, UC Irvine,Irvine, CA 92697. Authoremails: [email protected], [email protected], [email protected]. The work was partially supported by NSF grants DMS-0911277, DMS-0928427, and DMS-1222507. 1 Introduction A general demixing problem is to estimate the quantities or concentrations of the individual components of some observed mixture. Often a linear mixture model is assumed [1]. In this case the observed mixture b is modeled as a linear combination of references for each component known to possibly be in the mixture. If we put these references in the columns of a dictionary matrix A, then the mixing model is simply Ax = b. Physical constraints often mean that x should be nonnegative, and depending on the application we may also be able to make sparsity assumptions about the unknown coefficients x. This can be posed as a basis pursuit problem where we are interested in finding a sparse and perhaps also non-negative linear combination of dictionary elements that match observed data. This is a very well studied problem. Some standard convex models are non-negative least squares (NNLS) [2, 3], 1 min Ax b 2 (1) x 0 2k − k ≥ andmethodsbasedonl minimization[4,5,6]. Therearealsovariationsthatenforcedifferent 1 group sparsity assumptions on x [7, 8, 9]. In this paper we are interested in how to deal with uncertainty in the dictionary. The case when the dictionary is unknown is dealt with in sparse coding and non-negative ma- trix factorization (NMF) problems [10, 11, 12, 13, 14, 15], which require learning both the dictionary and a sparse representation of the data. We are, however, interested in the case where we know the dictionary but are uncertain about each element. One example we will study in this paper is differential optical absorption spectroscopy (DOAS) analysis [16], for which we know the reference spectra but are uncertain about how to align them with the data because of wavelength misalignment. Another example we will consider is hyperspectral unmixing [17, 18, 19]. Multiple reference spectral signatures, or endmembers, may have been measured for the same material, and they may all be slightly different if they were measured under different conditions. We may not know ahead of time which one to choose that is most consistent with the measured data. Although there is previous work that considers noise in the endmembers [20] and represents endmembers as random vectors [21], we may not always have a good general model for endmember variability. For the DOAS example, we do have a good model for the unknown misalignment [16], but even so, incorporating it may signifi- cantly complicate the overall model. Therefore for both examples, instead of attempting to model the uncertainty, we propose to expand the dictionary to include a representative group of possible elements for each uncertain element as was done in [22]. The grouped structure of the expanded dictionary is known by construction, and this allows us to make additional structured sparsity assumptions about the corresponding co- efficients. In particular, the coefficients should be extremely sparse within each group of representative elements, and in many cases we would like them to be at most 1-sparse. We will refer to this as intra group sparsity. If we expected sparsity of the coefficients for the unexpanded dictionary, then this will carry over to an inter group sparsity assumption about the coefficients for the expanded dictionary. By inter group sparsity we mean that with the coefficients splitintogroups,thenumberofgroupscontainingnonzeroelements shouldalsobe sparse. Modeling structured sparsity by applying sparsity penalties separately to overlapping subsets of the variables has been considered in a much more general setting in [8, 23]. Theexpandeddictionaryisusuallyanunderdeterminedmatrixwiththepropertythatitis 2 highlycoherentbecausetheaddedcolumnstendtobesimilartoeachother. Thismakesitvery challengingtofindgoodsparserepresentationsofthedatausingstandardconvexminimization and greedy optimization methods. If A satisfies certain properties related to its columns not being too coherent [24], then sufficiently sparse non-negative solutions are unique and can therefore be found by solving the convex NNLS problem. These assumptions are usually not satisfiedforourexpandeddictionaries, andwhileNNLSmaystillbeusefulasaninitialization, it does not by itself producesufficiently sparse solutions. Similarly, our expanded dictionaries usually do not satisfy the incoherence assumptions required for l minimization or greedy 1 methods like Orthogonal Matching Pursuit (OMP) to recover the l sparse solution [25, 26]. 0 However, with an unexpanded dictionary having relatively few columns, these techniques can be effectively used for sparse hyperspectral unmixing [27]. Thecoherenceofourexpandeddictionarymeansweneedtousedifferenttoolstofindgood solutionsthatsatisfyoursparsityassumptions. Wewouldliketouseavariational approachas similaraspossibletotheNNLSmodelthatenforcestheadditionalsparsitywhilestillallowing all the groups to collaborate. We propose adding nonconvex sparsity penalties to the NNLS objective function (1). We can apply these penalties separately to each group of coefficients to enforce intra group sparsity, and we can simultaneously apply them to the vector of all coefficients to enforce additional inter group sparsity. From a modeling perspective, the ideal sparsity penalty is l . There is a very interesting recent work that directly deals with l 0 0 constraints and penalties via a quadratic penalty approach [28]. If the variational model is going to be nonconvex, we prefer to work with a differentiable objective when possible. We therefore explore the effectiveness of sparsity penalties based on the Hoyer measure [29, 30], which is essentially the ratio of l and l norms. In previous works, this has been successfully 1 2 used to model sparsity in NMF and blind deconvolution applications [29, 31, 32]. We also considerthedifferenceofl1 andl2 norms. Bytherelationship,kxk1−kxk2 = kxk2(kxxk12−1),we seethatwhiletheratioofnormsisconstantinradialdirections,thedifferenceincreakskesmoving away from the origin except along the axes. Since the Hoyer measure is twice differentiable on the non-negative orthant away from the origin, it can be locally expressed as a difference of convex functions, and convex splitting or difference of convex (DC) methods [33] can be used to find a local minimum of the nonconvex problem. Some care must be taken, however, to deal with its poor behavior near the origin. It is even easier to apply DC methods when using l - l as a penalty, since this is already a difference of convex functions, and it is well 1 2 defined at the origin. The paper is organized as follows. In Section 2 we define the general model, describe the dictionary structure and show how to use both the ratio and the difference of l and l 1 2 normstomodelourintraandintergroupsparsityassumptions. Section3derivesamethodfor solvingthegeneralmodel,discussesconnectionstoexistingmethodsandincludesconvergence analysis. In Section 4 we discuss specific problem formulations for several examples related to DOAS analysis and hyperspectral demixing. Numerical experiments for comparing methods and applications to example problems are presented in Section 5. 2 Problem For the non-negative linear mixing model Ax = b, let b RW, A RW N and x RN × ∈ ∈ ∈ with x 0. Let the dictionary A have l normalized columns and consist of M groups, each 2 ≥ 3 Non−negative parts of l1 and l2 unit balls in 2D 1 0.9 ||||xx||||21 == 11 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 00 0.2 0.4 0.6 0.8 1 Figure 1: l and l unit balls 1 2 T with m elements. We can write A = A A and x = x x , where each j 1 M 1 M ··· ··· xj ∈ Rmj and N = Mj=1mj. The gen(cid:2)eral non-negat(cid:3)ive least sq(cid:2)uares problem(cid:3) with sparsity constraints that we will consider is P 1 minF(x) := Ax b 2+R(x) , (2) x 0 2k − k ≥ where M R(x)= γ R (x )+γ R (x) . (3) j j j 0 0 j=1 X The functions R represent the intra group sparsity penalties applied to each group of co- j efficients x , j = 1,...,M, and R is the inter group sparsity penalty applied to x. If F is j 0 differentiable, then a necessary condition for x to be a local minimum is given by ∗ (y x )T F(x ) 0 y 0 . (4) ∗ ∗ − ∇ ≥ ∀ ≥ For the applications we will consider, we want to constrain each vector x to be at most j 1-sparse, which is to say that we want x 1. To accomplish this through the model (2), j 0 k k ≤ we will need to choose the parameters γ to be sufficiently large. j The sparsity penalties R and R will either be the ratios of l and l norms defined by j 0 1 2 x x j 1 1 H (x )= γ k k and H (x) = γ k k , (5) j j j 0 0 x x j 2 2 k k k k or they will be the differences defined by S (x ) = γ ( x x ) and S (x) = γ ( x x ) . (6) j j j j 1 j 2 0 0 1 2 k k −k k k k −k k A geometric intuition for why minimizing kxk1 promotes sparsity of x is that since it is x 2 constant in radial directions, minimizing it tkrieks to reduce x without changing x . As 1 2 k k k k seen in Figure 1, sparser vectors have smaller l norm on the l sphere. 1 2 Neither H or S is differentiable at zero, and H is not even continuous there. Figure 2 j j j shows a visualization of both penalties in two dimensions. To obtain a differentiable F, we 4 l1/l2 x=y 5 1.4 2 1.35 4 1 1.3 3 1.25 0 0 2 4 6 8 1.2 x+y=1 2 1.151.4 1 1.1 1.2 1.05 0 1 1 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1 l1−l2 x=y 5 4 2.5 4 2 2 3 0 1.5 0 2 x+y4=1 6 8 2 1 0.6 1 0.5 0.4 0.2 0 0 0 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1 Figure 2: Visualization of l /l and l - l penalties 1 2 1 2 can smooth the sparsity penalties by replacing the l norm with the Huber function, defined 2 by the infimal convolution φ(x,ǫ) = inf y + 1 y x 2 = kx2kǫ22 if kxk2 ≤ ǫ (7) y k k2 2ǫk − k (kxk2 − 2ǫ otherwise . In this way we can define differentiable versions of sparsity penalties H and S by x Hǫj(x ) = γ k jk1 (8) j j jφ(x ,ǫ )+ ǫj j j 2 x Hǫ(x) = γ k k1 0 0φ(x,ǫ )+ ǫ0 0 2 Sǫ(x )= γ ( x φ(x ,ǫ )) (9) j j j j 1 j j k k − Sǫ(x) = γ ( x φ(x,ǫ )) 0 0 k k1 − 0 These smoothed sparsity penalties are shown in Figure 3. The regularized penalties behave more like l near the origin and should tend to shrink x that have small l norms. 1 j 2 An alternate strategy for obtaining a differentiable objective that doesn’t require smooth- ing the sparsity penalties is to add M additional dummy variables and modify the convex constraint set. Let d RM, d 0 denote a vector of dummy variables. Consider applying ∈ ≥ x R to vectors j instead of to x . Then if we add the constraints x +d ǫ , we are j dj j k jk1 j ≥ j (cid:20) (cid:21) assured that R (x ,d ) will only be applied to nonzero vectors, even though x is still allowed j j j j to be zero. Moreover, by requiring that dj M r, we can ensure that at least r of the j ǫj ≤ − vectors x have one or more nonzero elements. In particular, this prevents x from being zero, j P so R (x) is well defined as well. 0 5 Regularized l1/l2 x=y 5 1.4 2 1.2 4 1 1 3 0.8 00 2 4 6 8 x+y=1 2 0.6 1.4 0.4 1 0.2 1.2 0 1 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1 Regularized l1−l2 x=y 5 4 3 4 2 2.5 3 2 0 0 2 4 6 8 x+y=1 2 1.5 1 0.6 1 0.4 0.5 0.2 0 0 0 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1 Figure 3: Visualization of regularized l /l and l - l penalties 1 2 1 2 The dummy variable strategy is our preferred approach for using the l /l penalty. The 1 2 high variability of the regularized version near the origin creates numerical difficulties. It either needs a lot of smoothing, which makes it behave too much like l , or its steepness near 1 the origin makes it harder numerically to avoid getting stuck in bad local minima. For the l 1 - l penalty, the regularized approach is our preferred strategy because it is simpler and not 2 much regularization is required. Smoothing also makes this penalty behave more like l near 1 the origin, but a small shrinkage effect there may in fact be useful, especially for promoting inter group sparsity. Thesetwo main problem formulations are summarizedbelow as Problem 1 and Problem 2 respectively. Problem 1: M 1 minF (x,d) := Ax b 2+ γ H (x ,d )+γ H (x) H j j j j 0 0 x,d 2k − k j=1 X M d j such that x > 0,d > 0, M r and x +d ǫ , j = 1,...,M . j 1 j j ǫ ≤ − k k ≥ j j=1 X Problem 2: M 1 minF (x) := Ax b 2+ γ Sǫ(x )+γ Sǫ(x) . x 0 S 2k − k j j j 0 0 ≥ j=1 X 3 Algorithm Both Problems 1 and 2 from Section 2 can be written abstractly as 1 minF(x) := Ax b 2+R(x), (10) x X 2k − k ∈ 6 where X is a convex set. Problem 2 is already of this form with X = x RN : x 0 . { ∈ ≥ } Problem 1 is also of this form, with X = x RN,d RM : x > 0,d > 0, x + d j 1 j { ∈ ∈ k k ≥ ǫ , dj M r . Note that the objective function of Problem 1 can also be written as in j j ǫj ≤ − } x (10P) if we redefine x as j and consider an expanded vector of coefficients x RN+M that j dj ∈ (cid:20) (cid:21) includes the M dummy variables, d. The data fidelity term can still be written as 1 Ax b 2 2k − k if columns of zeros are inserted into A at the indices corresponding to the dummy variables. In this section, we will describe algorithms and convergence analysis for solving (10) under either of two sets of assumptions. Assumption 1. X is a convex set. • R(x) 2(X,R) and the eigenvalues of 2R(x) are bounded on X. • ∈C ∇ F is coercive on X in the sense that for any x0 X, x X : F(x) F(x0) is a • ∈ { ∈ ≤ } bounded set. In particular, F is bounded below. Assumption 2. R(x) is concave and differentiable on X. • Same assumptions on X and F as in Assumption 1 • Problem 1 satisfies Assumption 1 and Problem 2 satisfies Assumption 2. We will first consider the case of Assumption 1. Our approach for solving (10) was originally motivated by a convex splitting technique from [34, 35] that is a semi-implicit method for solving dx = F(x), x(0) = x0 when F dt −∇ can be split into a sum of convex and concave functions FC(x)+FE(x), both in 2(RN,R). C Let λE be an upper bound on the eigenvalues of 2FE, and let λ be a lower bound on max ∇ min the eigenvalues of 2F. Under the assumption that λE 1λ it can be shown that the ∇ max ≤ 2 min update defined by xn+1 = xn+∆t( FC(xn+1) FE(xn)) (11) −∇ −∇ doesn’t increase F for any time step ∆t > 0. This can be seen by using second order Taylor expansions to derive the estimate 1 1 F(xn+1) F(xn) (λE λC ) xn+1 xn 2. (12) − ≤ max− 2 min− ∆t k − k This convex splitting approach has been shown to be an efficient method much faster than gradient descent for solving phase-field models such as the Cahn-Hilliard equation, which has been used for example to simulate coarsening [35] and for image inpainting [36]. By the assumptions on R, we can achieve a convex concave splitting, F = FC +FE, by letting FC(x) = 1 Ax b 2 + x 2 and FE(x) = R(x) x 2 for an appropriately chosen 2k − k k kC −k kC positive definite matrix C. We can also use the fact that FC(x) is quadratic to improve upon the estimate in (12) when bounding F(xn+1) F(xn) by a quadratic function of xn+1. Then − instead of choosing a time step and updating according to (11), we can dispense with the time step interpretation altogether and choose an update that reduces the upper bound on F(xn+1) F(xn) as much as possible subject to the constraint. This requires minimizing a − strongly convex quadratic function over X. 7 Proposition 3.1. Let Assumption 1 hold. Also let λ and λ be lower and upper bounds r R respectively on the eigenvalues of 2R(x) for x X. Then for x,y X and for any matrix ∇ ∈ ∈ C, 1 1 F(y) F(x) (y x)T((λ λ )I C)(y x)+(y x)T( ATA+C)(y x)+(y x)T F(x) . R r − ≤ − −2 − − − 2 − − ∇ (13) Proof. The estimate follows from combining several second order Taylor expansions of F and R with our assumptions. First expanding F abouty and usingh = y x to simplify notation, − we get that 1 F(x) = F(y) hT F(y)+ hT 2F(y α h)h 1 − ∇ 2 ∇ − for some α (0,1). Substituting F as defined by (10), we obtain 1 ∈ 1 1 F(y) F(x) = hT(ATAy ATb+ R(y)) hTATAh hT 2R(y α h)h (14) 1 − − ∇ − 2 − 2 ∇ − Similarly, we can compute Taylor expansions of R about both x and y. 1 R(x)= R(y) hT R(y)+ hT 2R(y α h)h . 2 − ∇ 2 ∇ − 1 R(y) = R(x)+hT R(x)+ hT 2R(x+α h)h . 3 ∇ 2 ∇ Again, both α and α are in (0,1). Adding these expressions implies that 2 3 1 1 hT( R(y) R(x)) = hT 2R(y α h)h+ hT 2R(x+α h)h . 2 3 ∇ −∇ 2 ∇ − 2 ∇ From the assumption that the eigenvalues of 2R are bounded above by λ on X, R ∇ hT( R(y) R(x)) λ h 2 . (15) R ∇ −∇ ≤ k k Adding and subtracting hT R(x) and hTATAx to (14) yields ∇ F(y) F(x) = hTATAh+hT(ATAx ATb+ R(x))+hT( R(y) R(x)) − − ∇ ∇ −∇ 1 1 hTATAh hT 2R(y α h)h 1 − 2 − 2 ∇ − 1 1 = hTATAh+hT F(x)+hT( R(y) R(x)) hT 2R(y α h)h . 1 2 ∇ ∇ −∇ − 2 ∇ − Using (15), 1 1 F(y) F(x) hTATAh+hT F(x) hT 2R(y α h)h+λ h 2 . 1 R − ≤ 2 ∇ − 2 ∇ − k k The assumption that the eigenvalues of 2R(x) are bounded below by λ on X means r ∇ 1 1 F(y) F(x) (λ λ ) h 2 + hTATAh+hT F(x) . R r − ≤ − 2 k k 2 ∇ Since the estimate is unchanged by adding and subtracting hTCh for any matrix C, the inequality in (13) follows directly. 8 Corollary. Let C be symmetric positive definite and let λ denote the smallest eigenvalue of c C. If λ λ 1λ , then for x,y X, c ≥ R− 2 r ∈ 1 F(y) F(x) (y x)T( ATA+C)(y x)+(y x)T F(x) . − ≤ − 2 − − ∇ A natural strategy for solving (10) is then to iterate 1 xn+1 = argmin(x xn)T( ATA+C )(x xn)+(x xn)T F(xn) (16) n x X − 2 − − ∇ ∈ for C chosen to guarantee a sufficient decrease in F. The method obtained by iterating (16) n can be viewed as an instance of scaled gradient projection [37, 38, 39] where the orthogonal projection of xn (ATA+2C ) 1 F(xn) onto X is computed in the norm . The − n − ∇ k·kATA+2Cn approach of decreasing F by minimizing an upper bound coming from an estimate like (13) can beinterpreted as an optimization transfer strategy of definingandminimizinga surrogate function [40], which is done for related applications in [12, 13]. It can also be interpreted as a special case of difference of convex programming [33]. ChoosingC insuchaway thatguarantees (xn+1 xn)T((λ 1λ )I C )(xn+1 xn) 0 n − R−2 r − n − ≤ maybenumericallyinefficient,anditalsoisn’tstrictlynecessaryforthealgorithmtoconverge. To simplify the description of the algorithm, suppose C = c C for some scalar c > 0 and n n n symmetric positive definite C. Then as c gets larger, the method becomes more like explicit n gradient projection with small time steps. This can be slow to converge as well as more prone to converging to bad local minima. However, the method still converges as long as each c is n chosen so that the xn+1 update decreases F sufficiently. Therefore we want to dynamically choose c 0 to be as small as possible such that the xn+1 update given by (16) decreases F n ≥ by a sufficient amount, namely 1 F(xn+1) F(xn) σ (xn+1 xn)T( ATA+C )(xn+1 xn)+(xn+1 xn)T F(xn) n − ≤ − 2 − − ∇ (cid:20) (cid:21) for some σ (0,1]. Additionally, we want to ensure that the modulus of strong convexity ∈ of the quadratic objective in (16) is large enough by requiring the smallest eigenvalue of 1ATA+C to be greater than or equal to some ρ > 0. The following is an algorithm for 2 n solving (10) and a dymamic update scheme for C = c C that is similar to Armijo line search n n but designed to reduce the number of times that the solution to the quadratic problem has to be rejected for not decreasing F sufficiently. Algorithm 1: A Scaled Gradient Projection Method for Solving (10) Under Assumption 1 Define x0 X, c > 0, σ (0,1], ǫ > 0, ρ > 0, ξ > 1,ξ > 1 and set n = 0. 0 1 2 ∈ ∈ while n = 0 or xn xn 1 > ǫ − k − k∞ 1 y = argmin (x xn)T( ATA+c C)(x xn)+(x xn)T F(xn) n x X − 2 − − ∇ ∈ if F(y) F(xn)> σ (y xn)T(1ATA+c C)(y xn)+(y xn)T F(xn) − − 2 n − − ∇ (cid:2) (cid:3) 9 c = ξ c n 2 n else xn+1 = y cn if smallest eigenvalue of cnC + 1ATA is greater than ρ c = ξ1 ξ1 2 n+1 (cn otherwise . n = n+1 end if end while It isn’t necessary to impose an upper bound on c in Algorithm 1 even though we want n it to be bounded. The reason for this is because once c λ 1λ , F will be sufficiently n ≥ R − 2 r decreased for any choice of σ (0,1], so c is effectively bounded by ξ (λ 1λ ). ∈ n 2 R − 2 r Under Assumption 2 it is much more straightforward to derive an estimate analogous to Proposition 3.1. Concavity of R(x) immediately implies R(y) R(x)+(y x)T R(x) . ≤ − ∇ Adding to this the expression 1 1 1 Ay b 2 = Ax b 2+(y x)T(ATAx ATb)+ (y x)TATA(y x) 2k − k 2k − k − − 2 − − yields 1 F(y) F(x) (y x)T ATA(y x)+(y x)T F(x) (17) − ≤ − 2 − − ∇ for x,y X. Moreover, the estimate still holds if we add (y x)TC(y x) to the right hand ∈ − − side for any positive semi-definite matrix C. We are again led to iterating (16) to decrease F, and in this case C need only be included to ensure that ATA+2C is positive definite. n n We can let C = C since the dependence on n is no longer necessary. We can choose any C n such that the smallest eigenvalue of C+ 1ATA is greater than ρ > 0, but it is still preferable 2 to choose C as small as is numerically practical. Algorithm 2: A Scaled Gradient Projection Method for Solving (10) Under Assumption 2 Define x0 X, C symmetric positive definite and ǫ > 0. ∈ while n = 0 or xn xn 1 > ǫ − k − k∞ 1 xn+1 = argmin (x xn)T( ATA+C)(x xn)+(x xn)T F(xn) (18) x X − 2 − − ∇ ∈ n = n+1 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.