1 Rectified Gaussian Scale Mixtures and the Sparse Non-Negative Least Squares Problem Alican Nalci, Igor Fedorov, and Bhaskar D. Rao Abstract—In this paper we introduce a hierarchical Bayesian processing [8], speech enhancement [9], and spectral decom- framework to obtain sparse and non-negative solutions to the position [10], [11] and requires special consideration. To this 6 sparse non-negative least squares problem (S-NNLS). We in- end, we pose the solution to the NNLS problem in (1) as 1 troduce a new family of scale mixtures, the Rectified Gaussian 0 Scale Mixture (R-GSM), to model the sparsity enforcing prior minimize y Φx 2. (2) 2 distribution for the signal of interest. One advantage of the R- x 0 k − k ≥ r GSMprioristhatthroughproperchoiceofthemixingdensityit In many applications, N < M and, therefore, (1) is under- a encompassesawidevarietyofheavytaileddistributions,suchas determined.Asaresult,auniquesolutionto(2)maynotexist. M the rectified Laplacian and rectified Student’s t distributions. Similar to the Gaussian Scale Mixture (GSM) approach, a However,recoveringanexactsolutionispossibleifadditional 9 Type II Expectation-Maximization framework is developed to informationisknownaboutthesolution.Ausefulassumption, estimatethehyper-parametersandobtainapointestimateofthe and one that has been recently made in many applications, ] parameter of interest. In the proposed method, called rectified is that the solution is sparse [12], [13], [14]. The problem G SparseBayesianLearning(R-SBL),weprovidetwowaystoper- becomesmore well-posed if x is known to be sparse, i.e. has L formtheExpectationstep;Markov-ChainMonte-Carlo(MCMC) few non-zeroelements. In this case, (2) is simply modified to simulations and a simple yet effective diagonal approximation . s approach (DA). Through numerical experiments we show that c R-SBL outperforms existing S-NNLS solvers in terms of both xmi0n,iym=iΦzex kxk0 (3) [ signalandsupportrecovery andthattheproposedDAapproach ≥ 3 admits both computational efficiency and numerical accuracy. where kxk0, the ℓ0 pseudo-norm,is the number of non-zero elements in x. We will refer to (3) as the sparse NNLS (S- v Index Terms—Non-negative Least Squares, Bayesian Sparse 7 Signal Recovery, Rectified Gaussian Scale Mixtures, Diagonal NNLS) problem. 0 Approximation, Markov-Chain Monte-Carlo Directly solving (3) is not tractable, since the ℓ0 pseudo- 2 norm is not convex in its arguments and the problem is 6 NP-hard in general [15]. Therefore greedy algorithms have 0 I. INTRODUCTION been been suggested to approximate the solution [15], [16]. . 1 One example is the class of algorithms known as Orthogonal 0 In this paper we consider a linear system of equations of Matching Pursuit (OMP) [17], which successively select non- 6 the form zero elements of x in a greedy fashion. In order to adapt 1 v: y =Φx+v (1) OMP to the S-NNLS problem, the criterion by which a new non-zeroelementofxisselectedismodifiedtoselecttheone i X where the solution of interest, x RM, is assumed to be havingthelargestpositivevalue[18].Anotherapproachinthis ar tnhoen-unnedgeartliyvien.gTphheysmicaatrlipxroΦbl∈emR,Ny×∈MRiN+s fiisxethdeamndearseularetemdetnot acnladssxof a0lguosriitnhgmtshefirascttifivned-sseatnLaxwssuocnh-Hthaantsoknya−lgΦorxithkm2 ≤[1ǫ] vector,and v is the additive measur∈ementnoise modeledas a and the≥n prunes x with a greedy procedure until x 0 K, k k ≤ zero mean Gaussian random vector with uncorrelated entries, where K is a pre-specified desired sparsity [4]. i.e. v (0,σ2I). Recovering the optimal solution to (1) Greedy algorithms are computationally attractive to solve is know∼nNas solving the non-negative least squares (NNLS) NP-hard problems, but often lead to sub-optimal solutions problem. NNLS has received considerable attention in the [15].Thus,numerousconvexrelaxationsoftheℓ0penaltyhave context of methods for solving systems of linear equations beenconsideredasanalternativetogreedymethods[15].One [1], density estimation [2], compressivenon-negativeimaging simplealternativeisto replacethe ℓ0 penaltywith ℓ1 andcast [3], andnon-negativematrixfactorization(NMF) [4], [5], [6], the problem in (3) as among others. Non-negative data is also a natural occurrence minimize y Φx 2+λ x 1 (4) in many application areas, such as text mining [7], image x 0 k − k k k ≥ where λ > 0 is a suitably chosen regularization parameter. The authors are with the Department of Electrical and Computer En- The advantage of this formulation is that it is in the form of gineering, University of California, San Diego, La Jolla, CA 92092 USA a constrained convexoptimization problemand can be solved ([email protected]; [email protected]; [email protected]) This work was partially supported byKIETofMOTIE grant funded by Korea govern- by any number of methods [19], [20]. One approach is to ment in 2015. [No. 10050527, General-purpose module, sensor and system estimate x through a projected gradient descent procedure developmentprojectstoenhanceindustrialsafetynetwork].AlicanNalciwas [21]. Whereas the projected gradient descent procedure re- partially supported by the UCSD Frontiers of Innovation Scholars Program. IgorFedorovwaspartially supportedbytheARCSFoundation. quires a line-search in order to calculate the learning rate at 2 each iteration, the approachcan be simplified by employinga B. Organization of the Paper multiplicative update rule which guarantees a decrease in (4) Inthefollowingsections,weprovidedetailsoftheproposed at each iteration [22]. In fact, the ℓ norm penalty in (4) can 1 R-SBL algorithm. In Section II, we discuss the advantagesof be replaced by any sparsity inducing regularization function scale mixture priorsfor p(x) and introducethe R-GSM prior. g(x): In Section III, we define the Type I and Type II Bayesian minimize y Φx 2+λg(x). (5) approaches to solve the S-NNLS problem and introduce the x 0 k − k ≥ mainR-SBLframeworkwithR-GSMprior.Weprovidedetails of an EM technique for estimating x using MCMC and our For instance, [23], [24] consider g(x) = M log x2+β , i=1 i DA method in Section III-B. We present experimental results whichleadstoaniterativere-weightedoptPimization(cid:0)approach(cid:1). comparingtheproposedR-SBLalgorithmtoexistingmethods An alternative and more promising view on the S-NNLS in Section IV. Finally, we conclude the work in Section V. problemistocasttheentireprobleminaBayesianframework and consider the maximum a-posteriori (MAP) estimate of x II. RECTIFIED GAUSSIAN SCALE MIXTURES given the data y: In this work, we assume separable priors of the form (7) x =argmaxp(xy). (6) and focus on the choice of p(x ). The choice of prior is MAP i x | very importantbecause it plays a central role in the Bayesian inference procedure. For the problem at hand, the prior must ThereisastrongconnectionbetweentheBayesianframework be sparsity inducing, satisfy the non-negativity constraint, in(6)anddeterministicformulationsliketheonein(5):ithas be general and versatile enough to be widely applicable. recently been shown that formulations of the form in (5) are Consequently, we consider the scale mixture prior: equivalent to the formulation in (6) with the proper choice of priorp(x)[25].For example,consideringa separablep(x) of ∞ p(x )= p(x γ )p(γ )dγ . (8) the form i i i i i Z | 0 M Scale mixture priors were first considered in the context of p(x)= p(xi), (7) GSM’s [26]. It is well known that super-gaussian densities iY=1 are the best priors for promoting sparsity [30], [31] and most such priors can be representedin the form shown in (8), with the ℓ regularization approach in (4) is equivalent to the 1 the proper choice of p(γ ) [32], [33], [29], [28], [27]. This Bayesian approach in (6) with an exponentialprior for p(x ). i i has made GSM based priors a valuable form of prior for the Inthispaper,ouremphasiswillbeonBayesianapproachesto general sparse signal recovery problem. Another advantage solving (1). of the scale mixture prior is that it establishes a Markovian structure of the form A. Contributions of the Paper γ x y → → We introduce a novel class of sparsity inducing priors where inference can be performed in the x domain (MAP or • for the S-NNLS problem, namely priors modeled as a Type I) and also in the γ domain (Type II). This provides rectified Gaussian Scale Mixture (R-GSM). These are an additionalflexibility and opportunityto explore algorithm de- extension of the Gaussian Scale Mixture (GSM) model velopment. Experimental results in the standard sparse signal [26], [27], [28], [29] and representa large class of heavy recovery problem show that performing inference in the γ tailed priors appropriate for inducing sparsity. domainconsistentlyachievessuperiorperformance[25], [30]. We cast the S-NNLS problem in a Bayesian framework The performance gains may be due to the intuition that γ • using R-GSM priors and develop a Type II inference is deeper in the Markovian chain than x, so the influence procedure, which we call Rectified Sparse Bayesian of errors in performing inference in the γ domain may be Learning(R-SBL).WedevelopadetailedBayesianinfer- diminished [25], [34], [35]. At the same time, γ is close ence procedure for estimating x using the Expectation- enough to y in the Markovian chain such that meaningful Maximization (EM) algorithm using both MCMC and inference about γ can still be performed [25]. a Diagonal Approximation (DA) approach that performs Although priors of the form shown in (8) have been used very close to MCMC in terms of recovery performance widely in the compressed sensing literature (where the signal but is computationally much more efficient. modelisidenticalto(1)withoutthenon-negativityconstriant) We detail two potential point estimate strategies for R- [25], this framework has not been extended to the S-NNLS • SBL to find the optimal solution: the mean and mode problem.In[36],aRectifiedGaussian(RG)priorisconsidered point estimates. for p(x ) within a variational bayesian inference framework. i We demonstrate the utility of the R-GSM prior and effi- Our goal in this work is to develop a more flexible and • cacyoftheR-SBLalgorithmwithextensiveexperimental general class of priors. Considering the findings that the results comparing the proposed technique to existing S- scale mixture prior is superior to most existing sparse signal NNLS algorithms. We also discuss the robustness of the recovery algorithms [25], we propose a R-GSM prior for the proposed method in the case of noisy measurements. S-NNLS problem, where p(x γ ) in (8) is given by the RG i i | 3 distribution. We refer to the proposed inference framework where Γ is defined as with R-GSM prior as Rectified Sparse Bayesian Learning (R- SBL). Γ(a)= ∞ta−1e−tdt. Z The RG distribution is defined as 0 Moregenerally,allofthedistributionsrepresentedbytheGSM (x µ)2 2 1 − have a corresponding rectified version represented by a R- R(xµ,γ)= e− 2γ u(x) GSM, e.g. contaminated Normal and slash, symmetric stable N | rπγ µ erfc and logistic, hyperbolic, etc. [29], [28], [27], [26], [32] [33]. (cid:18)−√2γ(cid:19) where µ is the location parameter, γ is the scale parameter, III. BAYESIAN INFERENCE WITHSCALE MIXTURE PRIOR u(x) is the unit step function, and erfc(x) is the complemen- TherearetwoBayesianapproachesforsolvingtheS-NNLS tary error function, defined as problem with a scale mixture prior. The first approach, called erfc(x)= 2 ∞e t2dt. Type I estimation (MAP) performsinference in the x domain − √π Z by treating γ as the hidden variable. On the other hand, the x When µ=0, the RG density becomes second approach, called Type II, performs inference in the γ domain. We briefly discuss Type I in the next section and the x2 major portion of the paper is dedicated to Type II estimation 2 R(x0,γ)=2 (x0,γ)u(x)= e−2γu(x). (9) because of its observed superiority in standard sparse signal N | N | rπγ recovery problems [25]. As noted in [37], [38], closed form inference computations using scale mixtures of RG’s is tractable only if the location A. Type I or MAP estimation parameter is zero, which effectively makes the erfc(.) vanish. Thus, in this work, we focus on R-GSM with µ=0. Employing Type I to solve S-NNLS translates into calcu- MotivatedbyGaussianScalemixtures,weusetheRGprior lating the maximum a-posteriori (MAP) estimate of x given in a scale mixture framework in order to provide a general y: and flexible class of priors and to better promote sparse non- M negative solutions. R-GSM priors have the form argmin y Φx 2 λ logp(x ). (12) p(x)= ∞ R(x0,γ)p(γ)dγ. (10) x k − k2− Xi=1 i Z0 N | Many of the ℓ0 relaxation methods described in Section I can be re-derived from the Type I perspective. For instance, Different choices of p(γ) lead to different choices of priors. by choosing an exponential prior for p(γ ), leading to an A few example are presented below. i exponentialprioronp(x),(12)reducestotheℓ regularization 1 approachin(4)withtheaddedbenefitthatλhasaprobabilistic A. Examples of R-GSM Representation of Sparse Prior interpretation.Similarly,bychoosingaGammapriorforp(γ ), i ArandomvariablemodeledbyaR-GSMcanalsobeviewed which correspondsto a rectified student-tprior on p(x ), (12) i as generating a random variable with a GSM and then taking reduces to its absolute value. This allows one to leverage the GSM prior literature to develop the class of R-GSM priors. M x2 argmin y Φx 2+λ log b+ i For instance, consider the rectified Laplacian, prior for x: x k − k2 Xi=1 (cid:18) 2 (cid:19) pe(x)=λe−λxu(x). which leads to the re-weighted ℓ approach to the S-NNLS 2 By using an exponential prior for p(γ), problem described in [23][24]. λ2 λ2γ p(γ)= e− 2 u(γ) (11) B. Type II Estimation: R-SBL for S-NNLS 2 The R-SBL framework aims to infer the hyper-parameters we can express p (x ) in the R-GSM form [39] as e i γ that shape the R-GSM prior. A MAP estimate of γ is pe(x)==2λue−(xλ)xZu0(∞x).N(x|0,γi)λ22e−λ22γu(γ)dγ picsoomsatepprupiotreordxdimaennadstiettdyh,eaansppthproe(pxpr|ioyast,teγerMpiooArinPdt)e.ensstHiitmayvaiotnefsgicneatsentriembsetatpered(axd|tiyhlye) obtained. Likewise, by considering the R-GSM with p(γ) given by BasedonpastexperienceinTypeIIestimationforthestan- the Gamma(a,b) distribution, we get a rectified student-t dard sparse signal recovery problem, several strategies exist distribution for p(x). In this case, (8) simplifies to [30] for estimating γ. The first strategy considers the problem of p(x)=2u(x) ∞ (x0,γ)γia−1e−bγ dγ forminga MAP estimate ofγ giveny by directlyminimizing Z N | abΓ(a) the appropriate posterior density. In this case, p(γ y) does 0 | 2baΓ(a+ 1) x2 −(a+21) not appear to admit a closed form, so we do not pursue = 2 b+ u(x) this strategy. The second strategy, investigated here, aims to (2π)21Γ(a) (cid:18) 2 (cid:19) estimate γ by utlizing an EM framework. 4 IntheEMapproach,wetreat(x,y,γ)asthecompletedata whereListhelowertriangularcholeskydecompositionofΣ. andxasthehiddenvariable.TheE-stepinvolvesdetermining It can be shown that w has a truncated normal distribution Q(γ,γt), which is defined as the conditional expectation of TN(w;0,I,L, µ, ) as in [44], where − ∞ the complete data log-likelihood: TN(w;µ,Σ,Rˆ,α ,α )= (20) L U Q(γ,γt)=Ex|y;γt,σ2[logp(y|x)+logp(x|γ)+logp(γ(1)3]) ctne−(x−µ)TΣ2−1(x−µ)1αL≤Rˆw≤αU (21) M 1 x2 =˙ Xi=1Ex|y;γt,σ2(cid:20)−2logγi− 2γii(cid:21) (14) caonndst1a(n·)t.is the indicator function and ctn is the normalizing The Gibbs sampler then proceeds by iteratively where t refers to the iteration index and =˙ refers to the fact drawing samples from the conditional distribution that constant terms and terms which don’t depend on γ have p(wi y,γ,σ2,w−i), where w−1 refers to the vector been omitted since they don’t effect the maximization step. | containing all but the i’th element of w. Given a set of For the sake of simplicity of this paper, we also assume a samples drawn from w, we can generate samples from the non-informativeprior on γ [30]. In the M-step, we maximize originaldistributionofinterestbyinvertingthetransformation: Q(γ,γt)withrespecttoγ bytakingthederivativeandsetting it equal to zero, which yields an estimate of γ: xn N = Lwn+µ N (22) { }n=1 { }n=1 γit+1 =Ex|y,γt,σ2[x2i]:=hx2ii (15) Finally, sample hx2ii can be calculated using f N 1 To compute x2 , we first consider the form of the posterior x2 = (xn)2 (23) h ii h ii N i p(xy,γ,σ2): nX=1 | f 2) DiagonalApproximation (DA): As an approximationto (x µ)TΣ−1(x µ) thetruemoments,weproposethatthemoments x and x2 p(xy,γ)=c(y)e− − 2 − u(x) (16) willbelargelyaffectedbythediagonalentriesofΣh .iiTherefhoriei, | inspiredbythediagonalapproximationideasin[45],[46],we where µ and Σ are given by [40], [30], [41] as set Σ = 0 for i = j just for moment calculations. Through ij 6 numericalexperimentsweverifythatthisassumptionisavery µ=ΓΦT(σ2I+ΦΓΦT)−1y (17) good one and produces very good estimates forhxii and hx2ii Σ=Γ ΓΦT(σ2I+ΦΓΦT)−1ΦΓ (18) outperforming S-NNLS solvers in terms of sparse recovery − performance and performing fairly similar to MCMC. where Γ = diag(γ). The posterior in (16) is known as a More importantly, MCMC procedure is computationally very intensive especially for large problem sizes and conver- multivariate RG (or multivariate truncated normal [42]). gence of x2 appears to be slow. DA approach on the other The normalizing constant c(y) does not appear to have h ii hand requires a single closed-form update equation requiring closedform.ThemultivariateRGisaparticularlydifficultdis- only the computation of the complimentary error function. tributiontoworkwithbecauseitsmarginalsarenotunivariate- Carrying out the second moment computation with this RG’s and do not admit a closed form [42], which makes assumption leads to the following approximation of x2 : computing the marginal moments difficult. As a result, we h ii uresseosrtMtoCcMalCcultaotindgrahwx2isiamwiptlhestwforosmtrapte(gxieys.,Tγh)eafinrdstessttriamteagtye x2 µ2+Σ +µ 1 Σ e−2µΣ2iii (24) hbxu2itiisfrgoumartahnotseeedsatmopgleesn.erTahteisqpuraolciteyssecstai|nmbateest.imTeheintseencsoinvde h ii≈ i ii irπp iierfc(cid:18)−√2µΣi ii(cid:19) strategy is an approximation to the true second moments of whereerfc() refersto the complimentaryerrorfunction.This the posterior p(xy,γ,σ2). In this approach, in the moment · issimplythesecondmomentofaunivariateRGdensitygiven | Σcalacruelavtieornyscfloorsehxto2iizwereoahsesnucmeewtheadtitshreegoafrfd-dtihaegmon.aTlhteisrmlesadosf in [47]. 3) Point estimate of x: The above equations are sufficient to a very fast algorithm that uses closed form computations to implement the exact MCMC and DA versions of the EM and outperformsbaseline S-NNLS solvers by a large margin. algorithm. After finding an estimate of γ, γˆ, the goal is to We call this the diagonal approximation or (DA) for short. find a point estimate of x from p(x y,γˆ,σ2). The optimal | 1) Markov-Chain Monte-Carlo: In this work, we employ estimatorofxinthemean-square-error(MSE)senseissimply a Gibbs sampling [43] strategy, which is a popular MCMC given by [48] technique for sampling from multivariate distributions. We follow [44] and begin by introducing the transformation xˆmean =Ex y,γˆ[x] (25) | where the expectation can be computed by using MCMC w =L-1(x µ) (19) to sample from p(x y,γˆ,σ2) and average the samples or − | 5 noiseratio(SNR)is20dBtotesttherobustnessoftheR-SBL Require: y,Φ,σ2,ǫ with DA under noisy conditions. 1: Initialize Γ0 =I, t=1 Inalloftheexperiments,wecompareourapproachtoavail- 2: while kγt−γ1t−−1γ2tk2 ≥ǫ do able baseline S-NNLS solvers: SLEP-ℓ1 [49], and NN-OMP 3: Calcuklate µkt using (17) and Γt−1 [50]. The noiseless case provides a fair comparison between 4: Compute Σt using (18) and Γt−1 the recovery performance of the S-NNLS solvers since, in 5: Option 1: Calculate γt+1 using MCMC in (23) this scenario, algorithm parameters are easy to select. In the 6: Option 2: Approximate γt+1 using DA in (24) noisy setting, tuning of various algorithm parameters leads to 7: t t+1 differingbehaviorofthealgorithms,makingitdifficulttodraw ← 8: end while wide-reaching conclusions about the relative performance of 9: Forthemeanestimator,computeEx y,γt,σ2[x]using the algorithms. Therefore, noisy simulations are meant for the MCMC sample mean or using t|he DA in (26) robustnessanalysisofR-SBLwithDAratherthancomparison 10: For the mode estimator compute xˆmode using (27) with other methods. Note that we use terms NR and RG with the convergedvalues of γtend. interchangeably to denote the rectified Gaussian distribution. Fig. 1: R-SBL Algorithm A. Experiment Design FortheMCMCsimulationswefocusontwoproblemswith by using the diagonal approximation, which results in the different sizes: I. approximationi.e. the first momentfor univariate RG in [47]. P1: xgen R20 and Φ R20 80 2 e−2µΣ2iii •• P2 :xgen ∈∈R4++0 and Φ∈∈R40××160 Ex|y,γˆ[xi]≈µi+rπpΣiierfc µi (26) withelementsofxgendrawnfromtheRG(Location:0,Scale: (cid:16)−√2Σii(cid:17) 1)density.ForP1weperform250trialsforafixedcardinality of K = 10 and for P2 we perform 80 trials for a fixed An alternative point estimate is to use xˆ , given by mode cardinality of K =20. xˆmode =argmaxp(x y,γˆ,σ2) To test the proposed DA approach, we generate sparse x | groundtruth solutionsxgen R100 such that xgen =K. =argmin y Φx 2+λ M x2i (27) We draw elements of xgen a∈cco+rding to the fo||llowin||g0distri- x≥0 k − k2 Xi=1 γi butions: 1) RG (Location: 0, Scale: 1) where (27) can be solved by any NNLS solver. xˆ is mode 2) NN-Cauchy (Location: 0, Scale: 1) a favorable point estimate because it chooses the peak of p(x y,γˆ,σ2), which could be multi-modal and not charac- 3) NN-Laplace (Location: 0, Scale: 1) | 4) NN-Gamma (Location: 1, Scale: 2) terized well by its mean. Note that if all entries of µ˜ are 5) Chi-square with ν =2 positive, then calculating (27) is not necessary as this point is 6) Bernoulli with p(0.5)=1/2 and p(1.5)=1/2 thetruemode,i.e.xˆ =µ˜.Theperformanceofbothpoint mode estimate methods is discussed in Section IV. where the prefix ’NN’ stands for non-negative. The non- The complete R-SBL algorithm is summarized in Figure 1 negativedistributionsareobtainedbytakingtheabsolutevalue for easy reference. of the respective probability densities, i.e. NN-X = X . Next, we randomly generate Φ R100 400 according to| th|e × ∈ following densities IV. EXPERIMENTS I. Gaussian (Location: 0, Scale: 1) In this section we provide two sets of experimentsto show II. 1 with p(1)=1/2 and p( 1)=1/2 the recovery performance of the R-SBL approach. ± − The first set presents the MCMC results along with the and we normalize the columns of Φ to have unit ℓ2 norm. DA and the baseline S-NNLS solvers. This experiment aims For a given Φ and xgen, we compute the synthetic mea- to compare MCMC with DA and establishes the superiority surements y = Φxgen and use various S-NNLS algo- of R-SBL over the baseline methods. We focus on two rithms to approximate xgen by xˆ. For the DA simulations, relatively smaller sized problems (P1) and (P2) as MCMC we perform 1000 trials for each combination of distribu- is computationally not feasible for large problems. tion for xgen, distribution for Φ, and cardinality K = The second set contains extensive simulations comparing 1,5,10,15,20,25,30,35,40,45,50 . { } the proposed DA approach with baseline S-NNLS solvers for alargerproblemsizeandforvariouscardinalities.Inbothfirst B. Performance Metrics andsecondsets,wesimulateanear’noiseless’casewherethe noise has distribution v (0,σ2I) with σ2 =10 8. ToevaluatetheperformanceofS-NNLSalgorithms,weuse − ∼N We present results of the proposed DA for a wide variety themeansquaredrecoveryerror(MSE)andtheprobabilityof of distributions for x and Φ. Additionally, in a separate errorin the recoveredsupport(PE), which are two commonly simulation, we set the noise variance such that the signal to usedperformancemetricsinthecompressivesensingliterature 6 ΦisNormalandx∼RG(0,1) ΦisNormalandx∼RG(0,1) 0.5 0.5 R-SBL Mean E) R-SBL Mean 0.45 R-SBL Mode P0.45 R-SBL Mode ( E) 0.4 SNLNE-OPM-LP1 port 0.4 SNLNE-OPM-LP1 S p M u (0.35 S0.35 Error 0.3 inthe 0.3 overy0.25 Error0.25 Rec 0.2 of 0.2 y g.L20.15 abilit0.15 Av 0.1 ob 0.1 Pr 0.05 g.0.05 v A 0 0 1 5 10 15 20 25 30 35 40 45 50 1 5 10 15 20 25 30 35 40 45 50 CardinalityoftheTrueSolution CardinalityoftheTrueSolution Fig. 2: Average MSE versus cardinality. Φ is (0,1) and Fig. 3: AveragePE versuscardinality.Φ is (0,1) and xgen N N xgen is RG(0,1) is RG(0,1) [15].WemeasuretheMSEbetweentherecoveredsignalxˆ and Markov-Chain Monte-Carlo Results: Table I shows the the ground truth xgen using MCMC simulation results together with the DA and other baseline methods for problems P1 and P2. As expected R- N 1 MSE= (xˆ xgen)2 (28) SBL with both MCMC and DA achieves greater performance N i− i withrespecttoNN-OMPandSLEP,provingthesuperiorityof Xi=1 R-SBL over other S-NNLS solvers. Moreover, since MCMC DenotingthesupportofthetruesolutionasS andthesupport solution is exact, it achieves better performance compared to of xˆ as Sˆ, PE is calculated using the DA. Despite MCMC EM is theoretically more pleasing, max S , Sˆ S Sˆ numerically and complexity-wise it is not feasible especially PE= {| | | |}−| ∩ |. (29) max S , Sˆ for large problem sizes. Trials for P1 took about couple of {| | | |} minutestohoursdependingonhowill-posedtheproblemwas, A value of PE = 0 indicates that the ground truth and weexperiencedthatsimulationsforthelargerproblemP2took recovered supports are the same, whereas PE = 1 indicates significantly longer. Therefore, we refer the DA as a simple no overlap between supports. Averagingthe PE over multiple and complexity-wisebetter alternative to the MCMC method. trials gives us the empirical probability of making errors in Diagonal Approximation (DA) Results: Figures 2 and 3 the support. show the MSE and PE as a function of solution cardinality We calculate the average values of MSE and PE over the using the DA approach, with elements of Φ drawn from a respective trials, i.e. 250 (P1) and 80 (P2) and 1000 (DA). In normal distribution, (0,1), and xgen drawn from RG(0,1), the following sections, we use MSE and PE to indicate the N i and mean and mode refer to the type of point estimate used. averaged values. We see that R-SBL performs significantly better than other methods. In fact, the average reconstruction error is less than C. Experiment Results 1%withR-SBLforcardinalitieslessthanK =45.Compared In this section we detail the experimental results, showing to the other algorithms, this is a significant improvement in thebenefitsofR-SBL comparedtoothermethods.Theresults reconstruction performance. Also note that, mean and mode will show that, in all of the sparse recovery experiments point estimates have the same recovery performance. described,the proposedR-SBL methodperformssignificantly We find that our R-SBL with DA performs the best re- better than its SLEP-ℓ and NN-OMP counterparts both in gardless of the distribution of xgen and outperforms other 1 terms of MSE and PE both in DA and MCMC simulations. algorithms with a large margin. We summarize the MSE and PE for all other distributions of xgen in Table II for P1 P2 Φ (0,1)andTableIIIforΦ 1forafixedcardinality ∼N ∼± MSE PE MSE PE of K =50. We select K =50 because it is the most difficult MeanR-SBL(MCMC) 0.0692 0.1148 0.0683 0.0853 experimental setting and because the relative performance ModeR-SBL(MCMC) 0.0692 0.0936 0.0687 0.0847 MeanR-SBL(DA) 0.1045 0.1192 0.0885 0.0927 of the tested algorithms is largely the same across different ModeR-SBL(DA) 0.1047 0.0956 0.0886 0.0873 experimental settings as compared to Figures 2 and 3. SLEP 0.2013 0.3276 0.1561 0.2947 Table II and Table III show that the R-SBL with DA has NN-OMP 0.4824 0.4016 0.3763 0.3380 superior recovery performance. The MSE for NN-Cauchy, TABLE I: MCMC and DA Results NN-Laplace, Gamma and Chi-square are below 0.6% with 7 SNR=20dBΦisNormalandx∼RG(0,1) SNR=20dBΦisNormalandx∼RG(0,1) 0.5 0.5 R-SBL Mean E) R-SBL Mean 0.45 R-SBL Mode P0.45 R-SBL Mode ( E) 0.4 SNLNE-OPM-LP1 port 0.4 SNLNE-OPM-LP1 S p M u (0.35 S0.35 Error 0.3 inthe 0.3 overy0.25 Error0.25 Rec 0.2 of 0.2 y g.L20.15 abilit0.15 Av 0.1 ob 0.1 Pr 0.05 g.0.05 v A 0 0 1 5 10 15 20 25 30 35 40 45 50 1 5 10 15 20 25 30 35 40 45 50 CardinalityoftheTrueSolution CardinalityoftheTrueSolution Fig. 4: Average MSE versus cardinality. Φ is (0,1) and Fig. 5:AveragePE versuscardinality.Φ is (0,1) andxgen N N xgen is RG(0,1) and v is Gaussian. SNR is 20 dB. is RG(0,1) and v is Gaussian. SNR is 20 dB. the proposed method, whereas with SLEP and NN-OMP the the tuning-parameters that give the smallest MSE error over MSE is above7%. Similar performancegainsare observedin allcardinalities.FortheR-SBL,wefixσ2 tobethetruenoise the PE. We also observe that the distribution of Φ does not variance. It should be noted that R-SBL is sensitive to the affect the results significantly and R-SBL with DA is robust choice of σ2 and estimation of σ2 is a topic of research in for different dictionary distributions. its own. Depending on the application, σ2 can be estimated The worst performanceis observedwhen xgen is Bernoulli offlineorestimationofσ2 canbeincorporatedintotheR-SBL with p(xgen = 0.5) = 1/2 and p(xgen = 1.5) = 1/2. framework [31]. i i We see significantly larger MSE values with the greedy NN- Figures4and5showtheMSEandPEforMeanR-SBLand OMP method, while SLEP-ℓ performs significantly better Mode R-SBL. xgen is drawn from RG(0,1) and the elements 1 i than NN-OMP. R-SBL performsmuch better comparedto the of Φ are drawn from (0,1). Compared to Figures 2 and N other techniques, although the MSE and PE is relatively high 3, we see that noise degrades the recovery performance of comparedtotheresultsfortheotherdistributionsofxgen.This each method. However, both Mean R-SBL and Mode R-SBL performance loss may result from the fact that the Bernoulli perform much better compared to SLEP-ℓ and NN-OMP, 1 distribution is not approximated well by continuous sparsity- especially at higher cardinalities. enforcing distributions. Timing Performance: In this part we comment on the Noise Performance: We now demonstrate the performance timing performanceof the R-SBL with DA approach.The R- of the proposed DA approach in the presence of Gaussian SBL with DA is computationally much efficient compared to noise, due to computation complexity noise analysis for MCMC, which requires drawing samples and simulating the MCMCisnotpursuedfurther.Thezeromeannoisevectorvis posterior density in an iterative fashion. In fact, complexity- addedtomeasurementsysuchthatSNRis20dB.Selectionof wise R-SBL with DA updates in Figure 1 differs from the tuning parameters for NN-OMP, SLEP-ℓ , and R-SBL is not SBL introduced in [31] only in first and moment calculations 1 straightforward in the noisy simulations. To find a common presentedin Equations(26) and(24). If furthercomputational groundbetweenselectingthesetuningparameters,wedolinea efficiency is desired one way of achieving performancespeed searchovertheparametersforSLEPandNN-OMPandselect up is by pruning γ when its elements become close to zero. Distribution ofxgen NN-OMP SLEP R-SBL Distribution ofxgen NN-OMP SLEP R-SBL RG(0,1) 0.3918 0.1582 0.0271 RG(0,1) 0.4028 0.1581 0.0028 NN-Cauchy(0,1) 0.0093 0.0080 0.0002 NN-Cauchy(0,1) 0.0097 0.0083 0.0001 NN-Laplace(0,1) 0.1163 0.0768 0.0055 NN-Laplace(0,1) 0.1270 0.0794 0.0053 Avg.MSE Avg.MSE Gamma(1,2) 0.1293 0.0783 0.0047 Gamma(1,2) 0.1219 0.0740 0.0035 Chi-square (ν=2) 0.1261 0.0746 0.0029 Chi-square (ν=2) 0.1318 0.0759 0.0048 Bernoulli 0.8972 0.2847 0.2108 Bernoulli 0.9083 0.2847 0.2086 RG(0,1) 0.4213 0.3363 0.0465 RG(0,1) 0.4202 0.3362 0.0510 NN-Cauchy(0,1) 0.1932 0.3121 0.0394 NN-Cauchy(0,1) 0.1936 0.3144 0.0337 NN-Laplace(0,1) 0.2579 0.3197 0.0169 NN-Laplace(0,1) 0.2756 0.3256 0.0163 Avg.PE Avg.PE Gamma(1,2) 0.2770 0.3191 0.0147 Gamma(1,2) 0.2717 0.3195 0.0113 Chi-square (ν=2) 0.2672 0.3210 0.0117 Chi-square (ν=2) 0.2658 0.3179 0.0128 Bernoulli 0.5473 0.3504 0.2631 Bernoulli 0.5540 0.3488 0.2624 TABLE II: Averaged MSE and PE Performance over 1000 TABLE III: Averaged MSE and PE Performance over 1000 trials for cardinality of K =50. Φ is distributed as (0,1). trials for cardinality of K =50. Φ is distributed as 1. N ± 8 AverageExecutionTimevsCardinality(Seconds) distributions for Φ. Under noisy conditions, we have shown 1.5 1.4 Mean R-SBL thatR-SBLwithDAisveryrobustintheMSEsensewithboth 1.3 MMoedaen RR--SSBBLL Prune mean and mode point estimates and achieves a much better ds)1.2 MSLoEdPe -RL1-SBL Prune reconstruction performance compared to the other methods on1.1 NN-OMP tested. c Se 1 ( e0.9 m Ti0.8 REFERENCES n0.7 o uti0.6 [1] C.L.LawsonandR.J.Hanson,Solvingleastsquaresproblems. SIAM, c xe0.5 1974,vol.161. E 0.4 [2] B.M.JedynakandS.Khudanpur,“Maximumlikelihoodsetforestimat- Avg.0.3 ingaprobabilitymassfunction,”Neuralcomputation,vol.17,no.7,pp. 1508–1530, 2005. 0.2 [3] J. P. Vila and P. Schniter, “An empirical-bayes approach to recovering 0.1 linearly constrained non-negative sparse signals,” Signal Processing, 0 5 10 15 20 25 30 35 40 45 50 IEEETransactions on,vol.62,no.18,pp.4689–4703, 2014. CardinalityoftheTrueSolution [4] R. Peharz and F. Pernkopf, “Sparse nonnegative matrix factorization with 0-constraints,” Neurocomputing, vol.80,pp.38–46,2012. Fig.6:AverageExecutionTimetoSolvetheS-NNLSProblem [5] H. Kim and H. Park, “Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares formicroarray data analysis,” Bioinformatics, vol.23,no.12,pp.1495–1502, 2007. [6] H.KimandH.Park,“Nonnegative matrixfactorization basedonalter- When an element γ becomes smaller than a threshold i.e. i nating nonnegativity constrained least squares and active set method,” ǫγ = 10−5, we shrink γ and Φ by ignoring the index i SIAMJournal onMatrixAnalysis andApplications, vol.30,no.2,pp. in the next iterations. This effectively reduces the problem 713–730,2008. [7] V.P.Pauca,F.Shahnaz,M.W.Berry,andR.J.Plemmons,“Textmining dimensionsandimprovesexecutiontime.Alternatively,wecan usingnon-negativematrixfactorizations.”inSDM,vol.4,2004,pp.452– relaxtheconvergenceconditionbymakingtheǫinAlgorithm 456. in Figure 1 larger, depending on the desired recovery MSE [8] V. Monga and M. K. Mihc¸ak, “Robust and secure image hashing via and PE. non-negativematrixfactorizations,”InformationForensicsandSecurity, IEEETransactions on,vol.2,no.3,pp.376–390,2007. WenotethatSLEPℓ andNN-OMPwillstillrunfasterthan 1 [9] P. C. Loizou, “Speech enhancement based on perceptually motivated R-SBL, as OMP and ℓ minimization approaches are faster bayesian estimators of the magnitude spectrum,” Speech and Audio 1 thanSBL,despitethespeed-upconsiderations.Nevertheless,it Processing,IEEETransactions on,vol.13,no.5,pp.857–869, 2005. [10] C. Fe´votte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factor- shouldbenotedthatmessagepassingtechniquesprovidecon- ization with the itakura-saito divergence: With application to music siderablespeed-upincompressedsensingproblems[51],[52], analysis,” Neuralcomputation, vol.21,no.3,pp.793–830,2009. [53] and recent developments in message passing techniques [11] P. Sajda, S. Du, T. R. Brown, R. Stoyanova, D. C. Shungu, X. Mao, and L. C. Parra, “Nonnegative matrix factorization for rapid recovery for SBL [54] could be extended to the R-SBL framework, ofconstituent spectra inmagnetic resonance chemical shiftimaging of leading to significant speed-up. thebrain,”MedicalImaging,IEEETransactionson,vol.23,no.12,pp. Figure 6 shows an example of average execution times to 1453–1465, 2004. [12] L. C. Potter, E. Ertin, J. T. Parker, and M. Cetin, “Sparsity and solve the S-NNLS problem. This example includes both the compressedsensinginradarimaging,”ProceedingsoftheIEEE,vol.98, pruning and the no-pruning methods. In the pruning method, no.6,pp.1006–1020,2010. wepruneγ withathresholdofǫ =10 3andtheconvergence [13] A.Hurmalainen,R.Saeidi,andT.Virtanen,“Groupsparsityforspeaker γ − condition is set to ǫ=10 5 in both cases. Results show that identity discrimination in factorisation-based speech recognition.” in − INTERSPEECH,2012. R-SBLwithDAisslightlyslowerthanSLEPℓ1 andNN-OMP [14] M. Lustig, J. M. Santos, D. L. Donoho, and J. M. Pauly, “kt sparse: about 200 milliseconds in average. Pruning helps a lot with High frame rate dynamic mri exploiting spatio-temporal sparsity,” in Proceedings ofthe13thAnnualMeeting ofISMRM,Seattle, vol.2420, the mode point estimate a lot but is rather ineffective for the 2006. mean point estimate. [15] M.Elad,SparseandRedundantRepresentations. SpringerNewYork, 2010. [16] Y.C.EldarandG.Kutyniok,Compressedsensing:theoryandapplica- V. CONCLUSION tions. Cambridge University Press,2012. Inthispaper,weintroducedahierarchicalBayesianmethod [17] Y. C. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matching pursuit: Recursive function approximation with applications towavelet to solve the S-NNLS problem. We developed the rectified decomposition,” in Signals, Systems and Computers, 1993. 1993 Con- Gaussian scale mixture model as a general and versatile prior ferenceRecordofTheTwenty-SeventhAsilomarConferenceon. IEEE, to promote sparsity. We constructed the R-SBL algorithm 1993,pp.40–44. [18] A.M.Bruckstein,D.L.Donoho,andM.Elad,“Fromsparsesolutionsof using the EM framework using both the MCMC approach systemsofequations tosparsemodeling ofsignals andimages,”SIAM and an approximation technique that is fast, scalable for review,vol.51,no.1,pp.34–81,2009. large problems and effective in performance. We presented [19] J. Nocedal and S. Wright, Numerical optimization. Springer Science two point estimate methods for the EM framework. We &BusinessMedia, 2006. [20] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge demonstrated that R-SBL outperforms the available S-NNLS university press,2004. solvers by a large margin both in terms of signal and support [21] C.-b.Lin,“Projectedgradientmethodsfornonnegativematrixfactoriza- recovery. The performance gains achieved by R-SBL with tion,”Neuralcomputation, vol.19,no.10,pp.2756–2779, 2007. [22] P. O. Hoyer, “Non-negative matrix factorization with sparseness con- Diagonal Approximation are consistent across different non- straints,”TheJournalofMachineLearningResearch,vol.5,pp.1457– negative data distributions for x and different sensing matrix 1469,2004. 9 [23] P.D.GradyandS.T.Rickard,“Compressivesamplingofnon-negative [48] B. Hajek, Random Processes for Engineers. Cambridge University signals,”inMachineLearningforSignalProcessing,2008.MLSP2008. Press,2015. IEEEWorkshopon. IEEE,2008,pp.133–138. [49] J. Liu, S. Ji, J. Ye et al., “Slep: Sparse learning with efficient projec- [24] R. Chartrand and W. Yin, “Iteratively reweighted algorithms for com- tions,”ArizonaState University, vol.6,p.491,2009. pressive sensing,” in Acoustics, speech and signal processing, 2008. [50] A.M.Bruckstein, M.Elad,andM.Zibulevsky, “Ontheuniqueness of ICASSP 2008. IEEE international conference on. IEEE, 2008, pp. nonnegativesparsesolutionstounderdeterminedsystemsofequations,” 3869–3872. Information Theory, IEEE Transactions on, vol. 54, no. 11, pp. 4813– [25] R. Giri and B. D. Rao, “Type I and Type II Bayesian Methods 4820,2008. for Sparse Signal Recovery using Scale Mixtures,” arXiv preprint [51] J.P.Vila andP.Schniter, “Expectation-maximization gaussian-mixture arXiv:1507.05087, 2015. approximate message passing,” Signal Processing, IEEE Transactions [26] D.F.Andrews andC.L.Mallows, “Scalemixtures ofnormaldistribu- on,vol.61,no.19,pp.4658–4672,2013. tions,” Journal oftheRoyal Statistical Society. Series B(Methodologi- [52] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algo- cal),pp.99–102,1974. rithms forcompressed sensing,” Proceedings of the National Academy [27] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood ofSciences, vol.106,no.45,pp.18914–18919,2009. from incomplete data via the em algorithm,” Journal of the royal [53] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing al- statistical society. Series B(methodological), pp.1–38,1977. gorithms for compressed sensing: I. motivation and construction,” in [28] A.P.Dempster, N.M.Laird,andD.B.Rubin, “Iteratively reweighted Information Theory Workshop (ITW), 2010 IEEE. IEEE, 2010, pp. least squares for linear regression when errors are normal/independent 1–5. distributed,” 1980. [54] M.Al-Shoukairi andB.Rao,“Sparsebayesian learning usingapproxi- [29] K. Lange and J.S. Sinsheimer, “Normal/independent distributions and matemessagepassing.” their applications in robust regression,” Journal of Computational and GraphicalStatistics, vol.2,no.2,pp.175–198, 1993. [30] M. E. Tipping, “Sparse bayesian learning and the relevance vector machine,” The journal of machine learning research, vol. 1, pp. 211– 244,2001. [31] D.P.WipfandB.D.Rao,“Anempirical bayesianstrategy forsolving the simultaneous sparse approximation problem,” Signal Processing, IEEETransactions on,vol.55,no.7,pp.3704–3716, 2007. [32] J. A. Palmer, “Variational and scale mixture representations of non- gaussian densities for estimation in the bayesian linear model: Sparse coding, independent component analysis, and minimum entropy seg- mentation,” 2006. [33] J.Palmer, K.Kreutz-Delgado, B.D.Rao,andD.P.Wipf,“Variational emalgorithmsfornon-gaussianlatentvariablemodels,”inAdvancesin neuralinformation processingsystems,2005,pp.1059–1066. [34] E. L. Lehmann and G. Casella, Theory of point estimation. Springer Science &BusinessMedia, 1998,vol.31. [35] D. P. Wipf, B. D. Rao, and S. Nagarajan, “Latent variable bayesian modelsforpromotingsparsity,”InformationTheory,IEEETransactions on,vol.57,no.9,pp.6236–6255,2011. [36] R.Schachtner,G.Poeppel,A.Tome´,andE.Lang,“Abayesianapproach to the lee–seung update rules for nmf,” Pattern Recognition Letters, vol.45,pp.251–256,2014. [37] M. Harva and A. Kaba´n, “Variational learning for rectified factor analysis,” SignalProcessing,vol.87,no.3,pp.509–527,2007. [38] J.W.Miskin,“Ensemblelearningforindependentcomponentanalysis,” ininAdvances inIndependent Component Analysis. Citeseer, 2000. [39] M.A.Figueiredo,“Adaptivesparsenessforsupervisedlearning,”Pattern AnalysisandMachineIntelligence,IEEETransactionson,vol.25,no.9, pp.1150–1159, 2003. [40] D.P.WipfandB.D.Rao,“Sparsebayesianlearningforbasisselection,” SignalProcessing,IEEETransactionson,vol.52,no.8,pp.2153–2164, 2004. [41] Z. Zhang, T.-P. Jung, S. Makeig, Z. Pi, and B. Rao, “Spatiotemporal sparse bayesian learning with applications to compressed sensing of multichannel physiological signals,”NeuralSystemsandRehabilitation Engineering,IEEETransactionson,vol.22,no.6,pp.1186–1197,2014. [42] W. C. Horrace, “Some results on the multivariate truncated normal distribution,” JournalofMultivariate Analysis,vol.94,no.1,pp.209– 221,2005. [43] S. Geman and D. Geman, “Stochastic relaxation, gibbs distributions, andthebayesian restoration ofimages,”PatternAnalysis andMachine Intelligence, IEEETransactions on,no.6,pp.721–741,1984. [44] Y.LiandS.K.Ghosh,“Efficientsamplingmethodsfortruncated mul- tivariate normal and student-t distributions subject to linear inequality constraints,” Journal of Statistical Theory and Practice, vol. 9, no. 4, pp.712–732, 2015. [45] M. Magdon-Ismail and J. T. Purnell, “Approximating the covariance matrix of gmms with low-rank perturbations,” in Intelligent Data En- gineering andAutomatedLearning–IDEAL2010. Springer, 2010,pp. 300–307. [46] W. C. Horrace, “On ranking and selection from independent truncated normaldistributions,”JournalofEconometrics,vol.126,no.2,pp.335– 354,2005. [47] J.W.Miskin,“Ensemblelearningforindependentcomponentanalysis,” ininAdvances inIndependent Component Analysis. Citeseer, 2000.