SHRINAKGE FUNCTION AND ITS APPLICATIONS IN MATRIX APPROXIMATION TOBY BOAS∗, ARITRA DUTTA†, XIN LI‡, KATIE MERCIER§, AND ERIC NIDERMAN¶ Abstract. We first give an elementary derivation of the shrinkage function widely used in compressivesensingandstatisticalestimations. Thenwedemonstratehowitcanbeusedinsolving 6 severalwell-knownproblemsandprovingonenewresultinmatrixapproximations. 1 0 Key words. Shrinkagefunction,singularvaluedecomposition,low-rankapproximation,sparse 2 approximation. n a AMS subject classifications. 65F15,65F30,65F35, 65F50,65K10 J 0 3 1. Introduction. Thereareafewmathematicalfunctionswhichareintroduced for convenience. For example, the Heaviside step function H(·), a piecewise constant ] function given by: C O 1, x>0 h. H(x)= 1, x=0 , 2 t 0, x<0 a m andtheDiracdeltafunctionδ(·)(moreprecisely,adistribution(see,e.g.,[1])),agen- [ eralizedfunction whose discrete analogis referredto as the Kroneckerdelta function: 2 v 1, i=j 0 δ = . ij 0, i6=j 0 (cid:26) 6 7 This article is on a newcomer, the shrinkage function S (·), first introduced by λ 0 Donoho and Johnstone in their landmark paper ([2], see also [3]) on function esti- . 1 mation using wavelets in the early 1990’s. Recently, the shrinkage function has been 0 heavily used in the solutions of several optimization and approximation problems of 6 matrices (see, e.g.,[4, 5, 6,7]). We give anelementarytreatment thatis accessibleto 1 a vast group of researchers,as it only requires basic knowledge in calculus and linear : v algebra and show how naturally the shrinkage function can be used in solving more i X advanced problems. r a 2. A calculus problem. We start with a regular calculus problem. Let λ > 0 and a∈R be given. Consider the following problem: 1 (2.1) min λ|x|+ (x−a)2 . x∈R 2 (cid:20) (cid:21) ∗DepartmentofMathematics,UniversityofFlorida,Gainesville,FL32611-8105(tboas@ufl.edu). †DepartmentofMathematics,UniversityofCentralFlorida,4000CentralFloridaBlvd,Orlando, Florida32816([email protected]). ‡DepartmentofMathematics,UniversityofCentralFlorida,4000CentralFloridaBlvd,Orlando, Florida32816([email protected]). §656EldronAve,Deltona, FL,32738([email protected]). ¶DepartmentofMathematics,UniversityofCentralFlorida,4000CentralFloridaBlvd,Orlando, Florida32816([email protected]). 1 2 BOAS,DUTTA,LI,MERCIERANDNIDERMAN We adopt the notation a = argminf(x) to mean that a ∈ A is a solution of the x∈A minimization problem minf(x) and define: x∈A 1 (2.2) S (a):=argmin λ|x|+ (x−a)2 . λ x∈R 2 (cid:20) (cid:21) Theorem 2.1. Let λ > 0 be fixed. For each a ∈ R, there is one and only one solution S (a), to the minimization problem (2.2). Furthermore, λ a−λ, a>λ S (a)= 0, |a|≤λ . λ a+λ, a<−λ Remark. The function S (·) defined above is called the shrinkage function (also λ referred to as soft shrinkage or soft threshold, [2, 3]). One may imagine that S (a) λ “shrinks” a to zero when |a|≤λ. A plot of S (·) for λ=1 is given in Figure 2.1. λ a,Sλ(a),λ=1 5 y=Sλ(a) 4 y=a 3 2 1 0 -1 -2 -3 -4 -5 -5 -4 -3 -2 -1 0 1 2 3 4 5 Fig. 2.1: A plot of S for λ=1. λ Proof. Let f(x) = λ|x|+ 1(x−a)2. Note that f(x) → ∞ when |x| → ∞ and f 2 is continuous on R and differentiable everywhere except a single point x = 0. So, f achievesits minimum value on R atone of its criticalpoints. Let x∗ =argminf(x). x∈R We consider three cases. Case 1: x∗ >0. Since f is differentiable at x=x∗ and achieves its minimum, we must have f′(x∗)=0. Note that, for x>0, we have d 1 f′(x)= [λx+ (x−a)2]=λ+(x−a). dx 2 So, λ+(x∗−a)=0, which implies x∗ =a−λ. To be consistent with x∗ >0, it is necessary that a−λ>0 or, equivalently, a>λ. Constrainedlowrankapproximationofmatrices 3 f(x)=λ|x|+1(x−a)2,λ=1 2 2 a=0 1.8 a=0.75 a=−1.6 1.6 a=1.5 1.4 1.2 fx() 1 0.8 0.6 0.4 0.2 0 -1 -0.5 0 0.5 1 X-range Fig. 2.2: Plots of f(x) for different values of a with λ=1. Case 2: x∗ <0. By proceeding similarly as in Case 1 above, we can arrive at x∗ =a+λ with a<−λ. Case 3: x∗ =0. Note that f(x) is no longer differentiable at x∗ =0 (so we could not use the condition f′(x∗) = 0 as before). But since f has a minimum at x∗ = 0 and since f is differentiable on each side of x∗ =0, it is necessary that f′(x)>0 for x>0 and f′(x)<0for x<0. So, λ+x−a>0 forx>0 and −λ+x−a<0 forx<0. Thus, λ−a>0 and −λ−a<0, or, equivalently, |a|≤λ. To summarize, we have a−λ with a>λ, x∗ = a+λ with a<−λ, 0 with |a|≤λ. Since one and only one of the three cases (1) a>λ, (2) a <−λ, and (3) |a|≤λ holds,weobtaintheuniquenessingeneral. Withtheuniqueness,itisstraightforward to verify that each of the three cases would imply the corresponding formula for x∗. This completes the proof. 4 BOAS,DUTTA,LI,MERCIERANDNIDERMAN 3. A sparse recovery problem. Recently, research in compressive sensing leads to the recognition of the fact that the ℓ -norm of a vector is a good substitute 1 for the count of the number of non-zero entries of the vector in many minimization problems. In this section, we solve some simple minimization problems using the count of non-zero entries or the ℓ -norm. Given a vector v∈Rn, we want to solve 1 β (3.1) min[card(u)+ ku−vk2 ], u∈Rn 2 ℓ2 wherecard(u)denotesthenumberofnon-zeroentriesofu,k·k denotestheEuclidean ℓ2 norm in Rn, and β > 0 is a given balancing parameter. We can solve problem (3.1) component-wise(in eachu ) asfollows. Notice that, givenu∈Rn, eachentry u ofu i i contributes 1 to card(u) if u is non-zero,and contributes0 if u is zero. Since we are i i minimizing g(u):=card(u)+ βku−vk2 , if u is zero then the contribution to g(u) 2 ℓ2 i depending on this u is βv2; otherwise, if u is non-zero, then we should minimize i 2 i i β(u −v )2 foru ∈R\{0},whichforcesthatu =v andcontributes1tog(u)asthe 2 i i i i i minimum value. Therefore, the solution u to problem (3.1) is given component-wise by 0, if β(v )2 ≤1 u = 2 i i (vi, otherwise. Next, we replace card(u) by kuk in (3.1) and solve: l1 β (3.2) min[kuk + ku−vk2 ], u∈Rn ℓ1 2 ℓ2 where k·k denotes the ℓ norm in Rn. ℓ1 1 Using Theorem 2.1, we can solve (3.2) component-wise as follows. Theorem 3.1. [8] Let β >0 and v∈Rn be given and let β u∗ =arg min[kuk + ku−vk2 ], u∈Rn ℓ1 2 ℓ2 then u∗ =S (v), 1/β where, S (v)denotesthevectorwhoseentriesareobtainedbyapplyingtheshrinkage 1/β function S (·) to the corresponding entries of v. 1/β Proof. If u and v denote the ith entry of the vectors u and v, respectively, i i i=1,2,...,n, then we have, β u∗ =arg min[kuk + ku−vk2 ] u∈Rn ℓ1 2 ℓ2 n 1 1 =arg min |u |+ (u −v )2 . u∈Rn"i=1(cid:18)β i 2 i i (cid:19)# X Since the i-th term in the summation depends only on u , the vector u∗ must have i components u∗ satisfying i 1 1 u∗ =argmin[ |u |+ (u −v )2], i u∗∈R β i 2 i i i Constrainedlowrankapproximationofmatrices 5 for i = 1,2,...,n. But by Theorem 2.1, the solution to each of these problems is given precisely by S (v ). This yields the result. 1/β i Remark. Thepreviousproofstillworksifwereplacethevectorsbymatricesanduse the extension of the ℓ and ℓ -norms to matrices by treating them as vectors. Thus 1 2 by using the same argument we can easily show the following matrix version of the previous theorem. Theorem 3.2. [8] Let β >0 and V∈Rm×n be given. Then β S (V)=arg min [kUk + kU−Vk2 ], 1/β U∈Rm×n ℓ1 2 ℓ2 where S (V) is again defined component-wise. 1/β Theorem 3.2 solves the problem of approximating a given matrix by a sparse matrix by using the shrinkage function. 4. Approximationbylowrankmatrices. Thesparseapproximationasgiven by Theorem 3.2 has many applications such as data compression and dimension re- duction. In these areas, one may also be interested in finding matrices of low rank. Forexample,givenamatrixA∈Rm×n,wewanttosolvethefollowingapproximation problem: β min rank(X)+ kX−Ak2 , X∈Rm×n 2 F (cid:20) (cid:21) wherek·k denotestheFrobeniusnormofmatrices(whichturnsouttobeequivalent F to the vector l norm if we treat a matrix as a vector - see more discussion on the 2 matrix norms in subsection 4.1). This is a harder problem since rank(X) is not a convex function. A convex relaxation (see, e.g., [9]) of the problem is provided by replacing the term rank(X) by the nuclear norm of X, kXk , (again, see subsection 4.1 for a discussion on the ∗ nuclear norm and its properties). The problem then becomes: β (4.1) min kXk + kX−Ak2 . X∈Rm×n ∗ 2 F (cid:20) (cid:21) This problem again yields an explicit solution ([4, 10]), but in these literatures, the formulaisderivedbyusingadvancedtoolsfromconvexanalysis(“subdifferentials”to be more specific). Here, we will showhow we canobtain the solutionby using simple ideas from the previous section. 4.1. Singular value decompositionand matrix norms. Itwillbebeneficial torecallthevariousmatrixnorms. Manyusefulmatrixnormscanbedefinedinterms of the singular values of the matrices. We will deal with two of them: the nuclear norm k·k and the Frobenius norm k·k . ∗ F Let A ∈ Rm×n and A = UA˜VT be a singular value decomposition (SVD) of A with U ∈ Rm×m and V ∈ Rn×n being two orthogonal matrices (that is, U−1 = UT and V−1 = VT) and A˜ = diag(σ (A) σ (A)···σ (A)) being a diagonal 1 2 min{m,n} matrix such that σ (A) ≥ σ (A) ≥ ··· ≥ σ (A) ≥ 0. The σ (A)’s are called 1 2 min{m,n} i the singular values of A. It is known ([11]) that every matrix in Rm×n has a SVD and that SVD of a matrix is not unique. Then the nuclear norm of A is given by min{m,n} kAk = σ (A), ∗ i i=1 X 6 BOAS,DUTTA,LI,MERCIERANDNIDERMAN and we can also define the Frobenius norm of A as 1/2 min{m,n} kAk = (σ (A))2 . F i i=1 X Thisnormturnsouttobethesameastheℓ normofA,treatedasavectorinRmn×1. 2 This is because the nonzerosingular values σ (A)’s areexactly the squarerootof the i nonzero eigenvalues of AAT or ATA. So, m n min{m,n} kAk2 = (a )2 =trace(AAT)= (σ (A))2. l2 ij i i=1j=1 i=1 XX X Here we haveusedtrace(·) to denote the traceof a matrix(which is equalto the sum of all diagonal entries of the matrix). We will need the following simple fact about the nuclear norms of a matrix and that of its diagonal: Let D(A) denote the diagonal matrix using the diagonal of A. We have (4.2) kD(A)k ≤kAk . ∗ ∗ This inequality can be verified by using a SVD of A = UA˜VT as follows. Write U=(u ), V=(v ), and t=min{m,n}. Then ij ij m t t m kD(A)k =kD(UA˜VT)k = σ (A)u v ≤ σ (A) |u v | ∗ ∗ (cid:12) j ij ij(cid:12) j ij ij i=1(cid:12)j=1 (cid:12) j=1 i=1 X(cid:12)X (cid:12) X X (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) t m 1/2 m 1/2 t ≤ σ (A)· |u |2 |v |2 ≤ σ (A)=kAk , j ij ij j ∗ ! ! j=1 i=1 i=1 j=1 X X X X wherewehaveusedtheCauchy-Schwarzinequalityinobtainingthesecondinequality, and the orthogonality of U and V (so that m |u |2 ≤ 1 and m |v |2 ≤ 1) in i=1 ij i=1 ij the last inequality. P P We will also use the fact that for any orthogonal matrices O and R, OAR and A have the same singular values, and therefore their Frobenius norms and nuclear norms are same: kOARk =kAk andkOARk =kAk . F F ∗ ∗ This is known as the unitary invariance of the Frobenius norm and nuclear norm. 4.2. Solutionto(4.1)viaproblem(2.2). Wearereadytoshowhowproblem (4.1) is problem (2.2) in disguise. Given β > 0, using the unitary invariance of the Frobenius norm and the nuclear norm, we have β β min[kXk + kX−Ak2]=min[kXk + kX−UA˜VTk2] X ∗ 2 F X ∗ 2 F β =min[kUTXVk + kUTXV−A˜k2]. X ∗ 2 F Constrainedlowrankapproximationofmatrices 7 It can be seen from the last expression that the minimum occurs when X˜ :=UTXV is diagonal since both terms in that expression get no larger when X˜ is replaced by its diagonal matrix (with the help of (4.2)). So, the matrix E = (e ) := X˜ −A˜ is a ij diagonal matrix. Thus, X=UX˜VT, with X˜ =A˜ +E, which yields a SVD of X (using the same matrices U and V as in a SVD of A !). Then, β β min[kXk + kX−Ak2]= min [kX˜k + kX˜ −A˜k2] X ∗ 2 F X˜∈diag ∗ 2 F β = min [ x˜ + (x˜ −σ (A))2], ii ii i X˜∈diag 2 i i X X where “diag” is the set of diagonal matrices in Rm×n. Above is an optimization problem like (2.2) (for vectors (x˜ ,x˜ ,···)T as X˜ varies)whose solution is givenby 11 22 x˜ =S (σ (A)), i=1,2,··· . ii 1/β i To summarize, we have proven the following. Theorem 4.1. [4] Suppose that A ∈ Rm×n and β > 0 are given. Then the solutiontotheminimization problem (4.1)isgiven byXˆ =UX˜VT wherethediagonal matrix X˜ has diagonal entries x˜ =S (σ (A)), i=1,2,...,min{m,n}, ii 1/β i where UA˜VT is a SVD of A. Remark. 1. A recent proof of this theorem is givenby Cai, Candes, and Shen in [4] where they giveanadvancedverificationofthe result. Our proofgivenabovehas the advantage that it is elementary and allows the reader to “discover” the result. 2. There are many earlier discoveries of related results ([12]) where rank(X) is used instead of the nuclear norm kXk . We will examine one such variant in the next ∗ section. 3. One key ingredient in the above discussion is the unitary invariance of the norms k·k andk·k . ItwasvonNeumann(see,e.g.,[9])who wasamongthe firstto study ∗ F the family of all unitarily invariant matrix norms in matrix approximation, k·k F being one of them. 4. A closely related (but harder) problem is compressive sensing ([13, 14]). Readers are strongly recommended to the recently survey by Bryan and Leise ([15]). 5. A variation. More problems can be solved by applying similar ideas. For example, let us consider a variant of a well-known result of Schmidt (see, e.g., [12, Section 5]), replacing the rank by the nuclear norm: For a fixed positive number τ, consider (5.1) min kX−Ak subject to kXk ≤τ. F ∗ X∈Rm×n Using similar methods as in Section 3, this problem can be transformed into the following: (5.2) min ku−vk subject to kuk ≤τ. ℓ2 ℓ1 u∈Rmin{m,n} 8 BOAS,DUTTA,LI,MERCIERANDNIDERMAN Notethat,(5.2)isrelatedtoaLassoproblem[3,16,17]. ButunlikeaLASSOproblem, no special assumption (like 0 mean) is made on v in (5.2). As in [3], one can form a Lagrangian of (5.2) and solve: 1 (5.3) u∗ =arg min { ku−vk2 +λkuk }, withkS (v)k =τ, u∈Rmin{m,n} 2 ℓ2 ℓ1 λ ℓ1 which has a solution u∗ = S (v) according to Theorem 3.1. (The reason for us to λ use λ instead of β in (5.2) is nonessential: it is only for indicating the similarity with LASSO formulation.) We now sketch the derivation of converting (5.1) to (5.2): As before, let A=UA˜VT be a SVD of A. Then, min kX−Ak = min kUTXV−A˜k . F F X∈Rm×n X∈Rm×n Note that kXk =kUTXVk , so (5.1) can be written as ∗ ∗ min kX−A˜k subject to kXk ≤τ, F ∗ X∈Rm×n which, by using (4.2), can be further transformed to (5.4) min kX−A˜k subject to X being diagonal and kXk ≤τ. F ∗ X∈Rm×n Next,ifweletuandvbetwovectorsinRmin{m,n} consistingofthediagonalelements of X and A˜, respectively, then (5.4) is (5.2). Thus we have established the following result. Theorem 5.1. With the notations above, the solution to problem (5.1) is given by Xˆ =US (A˜)VT, λ for some λ such that kS (A˜)k =τ. λ ℓ1 Acknowledgment. This work is partially supported by a CSUMS program of the National Science Foundation DMS-0803059. Toby Boas, Katie Mercier, and Eric Niederman contributed to this work as undergraduate students. REFERENCES [1] R. Bracewell,TheImpulseSymbol,The Fourier Transform and Its Applications (3rded., 2000), Ch.5,McGraw-Hill,74–104. [2] D. L. Donoho and I. M. Johnstone, Ideal spatial adaptation by wavelet shrinkage, Biometrika,81(1994) 425–455. [3] R.Tibshirani,Regression shrinkage andselectionviathe LASSO,J.oftheRoyalstatistical society,seriesB,58(1996)267–288. [4] J. Cai, E. J. Candes, and Z. Shen, A singular value thresholding algorithm for matrix completion, SIAMJ.Optim.,20(2010) 1956–1982. [5] Z.Lin,M.Chen,andY.Ma,TheaugmentedLagrangemultipliermethodforexactrecovery of corrupted low-rank matrices,arXiv:1009.5055v2,March9,2011. [6] M.TaoandX.Yuan,Recovering low-rank and sparse components of matrices from incom- plete and noisy observations,SIAMJ.Optim.,21(2011) 57–81. [7] X. Yuan and J. Yang, Sparse and low-rank matrix decomposition via alternating direction methods, Technical report (available from http://www.optimization-online.org/DBFILE/2009/11/2447.pdf), Dept. of Mathe- matics,HongKongBaptistUniversity,2009. Constrainedlowrankapproximationofmatrices 9 [8] W.Yin,E.Hale,andY.Zhang,Fixed-pointcontinuationforl1-minimization: methodology and convergence,SIAMJ.Optim.,19(2008)1107–1130. [9] M. Fazel, Matrix Rank Minimization with Applications, Ph.D. dissertation, Department of ElectricalEngineering,StanfordUniversity,2002. [10] S. Q. Ma, D. Goldfarb and L. F. Chen, Fixed point and Bregman iterative methods for matrix rank minimization,Math.Prog.Ser.A,(2009) DOI10.1007/s10107-009-0306-5. [11] G.W. Strang,Introduction to Linear Algebra,3rded.,Wellesley-CambridgePress,1998. [12] G. W. Stewart, On the early history of the singular value decomposition, SIAM Review, 35(1993) 551–566. [13] E.J.Candes, J.Romberg,andT.Tao,Robust uncertainty principles: Exact signalrecon- struction from highly incomplete frequency information,IEEETransactions onInforma- tionTheory,52(2006) 489–509. [14] E.J.CandesandT.Tao,Thepowerofconvexrelaxation: Near-optimalmatrixcompletion, IEEETransactionsonInformationTheory,56(2010) 2053–2080. [15] K.BryanandT.Leise,Making do withless: anintroduction tocompressed sensing,SIAM Review,55(2013), 547–566. [16] R. J. Tibshirani and J. Taylor, The solution path of the generalized LASSO, The Annals ofStatistics, 39-3(2011), 1335–1371. [17] M. R. Osborne, B. Presnell, and B. A. Turlach, On the LASSO and its dual, J. of Computational andGraphicalStat, 9(1999), 319–337.