An Asynchronous Parallel Approach to Sparse Recovery Deanna Needell Tina Woolf Department of Mathematical Sciences Institute of Mathematical Sciences Claremont McKenna College Claremont Graduate University Claremont, CA 91711, USA Claremont, CA 91711, USA Email: [email protected] Email: [email protected] 7 1 0 2 Abstract—Asynchronous parallel computing and sparse re- computing for solving optimizations problems of this form n covery are two areas that have received recent interest. Asyn- assume that each fi is sparse, which means that each fi chronous algorithms are often studied to solve optimization Ja problems where the cost function takes the form PMi=1fi(x), tahcitssiomnplylieosnthaatsminadlilvinduumalbceorroefccoommppuotanteionntssoofnlxy;dtehpeernefdoroen, with a common assumption that each fi is sparse; that is, each 2 fi acts only on a small number of components of x ∈ Rn. and update a small number of coordinates of x (e.g., [11], 1 Sparse recovery problems, such as compressed sensing, can be [25]). The benefit of this setup is that memory overwrites are formulatedasoptimizationproblems,however,thecostfunctions rare, making it unlikely for the progress of faster cores to be ] G fi are dense with respect to the components of x, and instead overwritten by updates from slower cores. the signal x is assumed to be sparse, meaning that it has only s L non-zeros where s ≪ n. Here we address how one may use an There has also been a surge of interest in sparse recov- s. asynchronousparallelarchitecturewhenthecostfunctionsfi are ery problems; for instance, literature in compressed sensing c notsparseinx,butratherthesignalxissparse.Weproposean (e.g., [4], [5], [10]) addresses the problem of recovering a [ asynchronousparallelapproachtosparserecoveryviaastochas- sparse vector x ∈ Rn from few nonadaptive, linear, and tic greedy algorithm, where multiple processors asynchronously possibly noisy measurements of the form y =Ax+z, where 1 updateavectorinsharedmemorycontaininginformationonthe v estimated signal support. We include numerical simulations that A ∈ Rm×n is the measurement matrix and z ∈ Rm is 8 illustrate the potential benefits of our proposed asynchronous noise. Recovering x from the noisy measurements y can be 5 method. formulated as the optimization problem, 4 3 I. INTRODUCTION min 1 ky−Ax˜k2 subject to kx˜k ≤s. (1) 0 Technological advances in data gathering systems have led x˜∈Rn 2m 2 0 . 1 to the rapid growth of big data in diverse applications. At It is then natural to ask whether we can apply asynchronous 0 the same time, the recentemergenceof inexpensivemulticore parallel computing to solve (1). The challenge, however, is 7 1 processors, with the number of cores on each workstation that the cost function depends on A, which is typically not : on the rise, has motivated the study of parallel computing takentobesparse(e.g.,Aiscommonlytakentohavestandard v strategies. This has presented a challenge to many existing i.i.d. Gaussian entries). This is in stark contrast to the typical i X andpopularalgorithmsthataredesignedtoruniterativelyand assumptions made in the asynchronous parallel computing r sequentially. literature since the cost function is dense in the components a One possible approach to this problem is synchronous ofx, andinsteadthesignalxisassumedtobesparse.Indeed, parallel computing,which assigns tasks to multiple cores and since the signal x is sparse, it is very likely that the same thenwaitsforallcorestocompletebeforethenextstepbegins. non-zeroentrieswillbeupdatedfromoneiterationtothenext, Of course, the drawback of this approach is that all cores whiletheremainingentriesaresettozerotomaintainasparse must wait for the slowest core to finish, even if the remaining solution.If executedasynchronously,thenmemoryoverwrites cores all complete their computation quickly. An alternative would be frequent, and a slow core could easily “undo” the approach, and one that has received much recent interest, is progress of previous updates by faster cores. asynchronousparallelcomputing.In an asynchronoussystem, Contribution:Inthispaper,weconsideroneofthestochas- all cores run continuously, thus eliminating the idle time tic greedy algorithms studied in [22] for sparse recovery. presentinthesynchronousapproach,andallcoreshaveaccess Focusing on the compressed sensing problem, we propose a to shared memory and are able to make updates as needed. strategyforutilizingthealgorithmasynchronouslydespitethe Asynchronous algorithms are often studied to solve opti- matrixA(andthusthecostfunction)notbeingsparse.Instead mization problems where the cost function takes the form of having the current solution estimate in shared memory, a M f (x). The decision variable x ∈ Rn is updated iter- currentestimate of the location of the non-zerosof the signal Pi=1 i atively, and its current state is accessible in shared memory will be shared and available to each core. Thus, we providea by all processors. Many approaches to asynchronous parallel solution that merges asynchronous parallel architectures with sparserecoveryproblemsthathaveiterativeupdateswhichare As described above, suppose we observe y = Ax + z, not sparse, and will be a springboard to other more general wherexiss-sparseandsupportedonT ⊂{1,...,n}(thatis, approaches. kxk = |{i : x 6= 0}| = |T| ≤ s ≪ n). To recover x from 0 i the noisy measurements y, we aim to solve the optimization II. RELATIONTO PRIOR WORK problem (1). Note that we can also express the cost function in (1) as Theseminaltext[2]providesfoundationalworkforparallel and distributed optimization algorithms. More recently, [25] M 1 1 1 cstaulldeidesHanOaGsWynILchDr!o.nAouskveayriaansstuomfpsttoiocnhasinticthgerairdieanntaldyessicsenist 2mky−Ax˜k22 = M X2bkybi −Abix˜k22, i=1 that the cost function of the optimization problem is sparse with respect to the decision variable, meaning that most where y has been decomposed into non-overlapping vectors gradient updates only modify small and distinct parts of ybi of size b, Abi has been decomposed into non-overlapping the solution. Much of the recent literature on asynchronous b×n submatrices of A, and M =m/b (which for simplicity parallel computing borrows from the framework proposed in weassumeisintegral).Noticethatthecostfunctionnowtakes [25]. [11] also studies stochastic optimization when the cost the form PMi=1fi(x), and each fi(x) = 2b1Mkybi −Abixk22 functionis sparse, and proposestwo asynchronousalgorithms accounts for a block of observations of size b. The StoIHT with their analysis influenced by that in [25]. Also following algorithm from [22] for solving (1), specialized to the com- the technique in [25], [19] presents an asynchronous parallel pressed sensing setting, is shown in Algorithm 1, where variant of the randomized Kaczmarz algorithm for solving γ denotes a step-size parameter. The recovery error of the the linear system Ax = y when A is large and sparse. [1] algorithm depends on the block size b; we refer the reader to is interested in the same linear system with A symmetric [22] for details. positivedefiniteandpresentsanasynchronoussolver;notethat the convergence rate analysis again depends on the sparsity Algorithm 1 StoIHT Algorithm [22] of the matrix. Asynchronous parallel stochastic coordinate input: s, γ, p(i), and stopping criterion descent algorithms (which are clearly not designed for sparse initialize: x1 and t=1 solutions)areproposedin[18]and[17],whichalsofollowthe repeat modelof[25].Otherrecentandrelevantworkonasynchronous randomize: select i ∈[M] with probability p(i ) t t parallel algorithms and analysis frameworks includes [6]–[9], proxy: bt =xt+ γ A⋆ (y −A xt) Mp(it) bit bit bit [12]–[16], [20], [24], [27]. identify: Γt =supp (bt) s In the sparse recovery literature, popular greedy recovery estimate: xt+1 =bt Γt algorithms include IHT [3], OMP [26], and CoSaMP [21]. t=t+1 Most relevant to our work is IHT, which we briefly review until halting criterion true here. Starting with an initial estimation x1 = 0, the IHT output: xˆ=xt algorithm computes the following recursive update, xt+1 =H (xt+A⋆(y−Axt)), (2) In the asynchronous approach, each core will execute its s ownslightlymodifiedversionofAlgorithm1.Inordertoavoid where Hs(a) is the thresholding operator that sets all but havingeach core updatethe entire solution in shared memory the largest (in magnitude) s coefficients of a to zero. The ateachiteration,weproposetoinsteadkeepatallyvectorφ∈ work[22]proposesa stochasticvariantof IHTcalledStoIHT, Rn insharedmemory.Thetallyφwillcontaininformationon which is the algorithm of focus here. Note that [22] also the estimated supportlocations identified by each core’s most proposes a stochastic variant of GradMP [23] which is based recent iteration. on CoSaMP; ourwork herecan easily be generalizedto these The steps executed by each core at each iteration are other algorithms as well. detailed in Algorithm 2, where the tally vector φ is available for read and write by each core. Note that in Algorithm 2 the III. ASYNCHRONOUS SPARSE RECOVERY WITHTALLY iteration number t and the iterate xt are local to each core. UPDATES Alsonotethatinthetallyupdatestep,thetallyisincremented Notation: We denote by [M] the set {1,2,...,M} and let byt(thecore’slocaliterationnumber)onΓt anddecremented p(1),...,p(M) be the probability distribution of an index i by t−1 on Γt−1 (as mentioned in [25], these operations can selected at random from the set [M], so that M p(i) = 1. be performedatomically on most modern processors). This is Pi=1 Let A⋆ denote the conjugate transpose of the matrix A. For to provide more weight to faster cores that are further along a vector a ∈ Rn, let supp (a) be the operator that returns in the algorithm and should thus be able to provide a better s the set of cardinality s containing the indices of the largest support estimate, and less weight to slower cores. In order (in magnitude) elements of a. For a set Γ, let a denote the tomaintainonlytheinformationfromeachcore’smostrecent Γ vector a with everything except the components indexed in iteration,thetallycontributionfromthepreviousiterationt−1 Γ⊂{1,...,n} set to zero. is removed. 2 It is worth mentioning that inconsistent reads, where com- the asynchronous approach using Algorithm 2 could lead to ponentsofthesharedmemoryvectorvariablesmaybewritten speedupsas longasthe tallyφ becomesaccuratefast enough. by some cores while being simultaneously read by others, is discussedinthecurrentliteratureonasynchronouscomputing. 101 Standard It is desirable to incorporate this feature into any models and 100 α = 1.0 α = 0.9 aaarelnrgcaaohdlyristisetoehcfsmttuhtroweesitm.lalloOlbyreueφrmfbaoayiprtehpafrnruooylablmcuyhsetraeitnsopsr,inenhsoceotonwntiemsvmismetoreu,dnntetherrenethaocodopsimenocpfisouφntthasttaihistoatetnnhnaaetl Recovery Error111000−−−321 αααα ==== 0000....8765 cpuasrrseivnet,saonludtitohnatesatniminacteonsisniscteenittslyurseeaidnvtheersaiolgnoorifthφmwisillmsotirlel Mean 1100−−54 provide valuable information on the support locations of the 10−6 signal. Althoughnotaddressedhere,analysesandsimulations of this impact are certainly of interest. 10−70 50 100 150 200 250 300 350 400 450 Iteration Fig. 1: Mean recovery error over 50 trials versus iteration of Algorithm 2 AsynchronousStoIHT Iteration StoIHT and a modified version of StoIHT. In the modified Eachcoreperformsthefollowingateachiteration.Thetally StoIHT, we execute Algorithm 1 where the estimation step is vector φ is available to each core. performedbyprojectingbt ontoΓt∪T ateachiteration,where randomize: select it ∈[M] with probability p(it) T has accuracy |Te∩T| =α. e proxy: bt =xt+ Mpγ(it)A⋆bit(ybit −Abitxt) e |Te| identify: Γt =supp (bt) s Tt =supp (φ) s estimate: xet+1 =bt Γt∪Tet update tally: φΓt =φΓt +t φΓt−1 =φΓt−1 −(t−1) t=t+1 IV. SIMULATIONS Inthissection,weprovideencouragingexperimentalresults for the proposed asynchronous tally update scheme. In all of the experiments, we take the signal dimension n=1000, the sparsity level s=20, the number of measurementsm=300, the block size b = 15, the step-size γ = 1, and the initial estimate x1 =0. The algorithms exit once ky−Axtk drops 2 below the tolerance 10−7 or a maximum of 1500 iterations is reached. A. StoIHT with an Accurate Support Estimate Our first experiment provides evidence that performing the estimation step onto the top s coefficients in addition to a set that accurately describes the true signal support will in- creasethe convergencerateofthe standardStoIHT algorithm. That is, we execute Algorithm 1 with the modification that xt+1 = bt , where T estimates the true support T with Γt∪Te ee |T|=sandaccuracy |T∩T| =α.Figure1comparesthemean e e |T| recovery error as a function of iteration over 50 trials of the Fig. 2: Comparison of the number of time steps executed standard StoIHT algorithm (Algorithm 1) with the modified until convergence versus the number of cores used in the StoIHTalgorithmasdescribedforvariousvaluesofα.Indeed, asynchronous StoIHT method. The heavy line denotes the for α > 0.5, fewer iterations are needed on average for the meannumberoftimestepsover500trials,andtheboundaries algorithm to converge. When α = 1, on average roughly of the shaded region indicate ±1 standard deviation from the half as many iterations as the standard algorithm are needed mean.(Upper)Allcoresaresimulatedtocompleteaniteration for convergence. We emphasize that this experiment, as a in a single time step; (lower) half of the cores are “slow” and proof of concept, has no parallel implementation, and hence complete an iteration only once out of every four time steps. iterations are comparableto runtime.This result suggests that 3 B. Asynchronous StoIHT interesting to explore the proposed idea in other settings. It would also be beneficial to develop theory for the proposed Here, we simulate the execution of asynchronous StoIHT with c cores, where each core uses the iteration defined in approach,perhapsbuildingfromthetheoryin[22]andthecur- Algorithm 2. For clarity, define a time step to be the amount rentliteratureanalyzingasynchronousalgorithms,andinclude of time needed for the fastest core to complete an iteration the incorporation of architecture realities such as inconsistent of Algorithm 2. First, we assume each core takes the same reads. amount of time to perform an iteration; thus, in a single time ACKNOWLEDGMENT step, all c cores complete an iteration of Algorithm 2. We also assume that an iteration of Algorithm 1 takes a single The work of Deanna Needell was partially supported by time step. When executing the t-th time step, and hence the NSF CAREER grant #1348721 and the Alfred P. Sloan t-th iteration for each core, every core utilizes the same set Foundation. The work of Tina Woolf was partially supported Tt identified by the tally φ. Once each core completes its by NSF CAREER grant #1348721. eestimation step, the tally is updated via Γt from each core. As soon as any core achieves the exit criteria at its local REFERENCES iteration t, the algorithm terminates, and t time steps are [1] H. Avron, A. Druinsky, and A. Gupta. Revisiting asynchronous linear recorded as the number of time steps until exit is achieved. solvers: Provable convergence rate through randomization. J. ACM, 62(6):51, 2015. The upper plotof Figure 2 displays the mean number of time [2] D.P.BertsekasandJ.N.Tsitsiklis. Parallelanddistributedcomputation: steps until exit, over 500 trials, for both the standard and numerical methods, volume 23. Prentice Hall, Englewood Cliffs, NJ, asynchronous StoIHT algorithms, plotted against the number 1989. [3] T.BlumensathandM.Davies. Iterative hardthresholdingforcompres- ofcoresusedintheasynchronousmethod.Theshadedregions sivesensing. Appl.Comput.Harmon.Anal.,27(3):265–274, 2009. indicate ±1 standard deviation from the mean. Since the [4] E.Cande`s,J.Romberg,andT.Tao.Robustuncertaintyprinciples:Exact standard method does not depend on the number of cores, signal reconstruction from highly incomplete frequency information. IEEETrans.Inform.Theory,52(2):489–509, 2006. horizontal lines are shown. Notice that the mean number of [5] E. Cande`s, J. Romberg, and T. Tao. Stable signal recovery from timestepsrequiredintheasynchronousmethodisalwaysless incomplete and inaccurate measurements. Comm. Pure Appl. Math., than the standard algorithm; therefore, a speedup in the total 59(8):1207–1223, 2006. [6] L.Cannelli,F.Facchinei,V.Kungurtsev,andG.Scutari. Asynchronous time for the algorithmto convergeis expectedwhen executed parallel algorithms for nonconvex big-data optimization: Model and asynchronously. convergence. arXivpreprintarXiv:1607.04818, 2016. Next, we modify the previous experiment to simulate the [7] D.Davis. Theasynchronous palmalgorithm fornonsmoothnonconvex problems. arXivpreprintarXiv:1604.00526, 2016. impact of slow cores. We take half of the cores to be “fast,” [8] D. Davis, B. Edmunds, and M. Udell. The sound of apalm clapping: meaning that they continue to update the shared tally at each Fasternonsmoothnonconvexoptimizationwithstochasticasynchronous time step; the other half of the cores, however, are “slow” palm. arXivpreprintarXiv:1606.02338, 2016. and only complete an iteration and update the tally once out [9] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradientmethodwithsupportfornon-stronglyconvexcompositeobjec- of every four time steps. The lower plot of Figure 2 displays tives. Proc. Adv. in Neural Processing Systems (NIPS), pages 1646– themeannumberoftimestepsuntilexitversusthenumberof 1654,2014. coresintheasynchronousmethod.Forc=2,noimprovement [10] D. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):1289–1306, 2006. isgainedfromtheasynchronousmethodonaverage,however, [11] J.Duchi,M.I.Jordan,andB.McMahan. Estimation,optimization, and improvementis observed for the larger values of c tested. parallelismwhendataissparse.Proc.Adv.inNeuralProcessingSystems (NIPS),pages 2832–2840,2013. V. CONCLUSIONS AND FUTURE WORK [12] J.C.Duchi,S.Chaturapruek,andC.Re´.Asynchronousstochasticconvex optimization. arXivpreprintarXiv:1508.00882, 2015. In this paper, we have proposed an asynchronous parallel [13] H.R.Feyzmahdavian, A.Aytekin,andM.Johansson. Anasynchronous variant of an existing stochastic greedy algorithm for sparse mini-batchalgorithmforregularizedstochasticoptimization.Proc.IEEE recovery. Our method is distinct from much of the existing Conf.DecisionandControl(CDC),pages1384–1389, 2015. [14] Z. Huo and H. Huang. Asynchronous stochastic gradient descent literature on asynchronous algorithms because: 1) sparsity with variance reduction for non-convex optimization. arXiv preprint assumptions on the cost function, which are common to arXiv:1604.03584, 2016. the existing literature, are not necessary, and 2) the current [15] R.Leblond,F.Pedregosa,andS.Lacoste-Julien. Asaga:Asynchronous parallel saga. arXivpreprintarXiv:1606.04809, 2016. solution iterate is not available in shared memory,but instead [16] X.Lian,Y.Huang,Y.Li,andJ.Liu. Asynchronousparallel stochastic a tally vector containing the latest informationfrom the cores gradient for nonconvex optimization. Proc. Adv. in Neural Processing on the estimated support of the signal is shared and utilized; Systems(NIPS),pages2737–2745, 2015. [17] J.LiuandS.J.Wright.Asynchronousstochasticcoordinatedescent:Par- this approach provides necessary robustness to asynchronous allelism andconvergence properties. SIAMJ.Optimization, 25(1):351– updates and inconsistent reads even when traditional updates 376,2015. cannot be made sparse. [18] J.Liu,S.J.Wright,C.Re´,V.Bittorf, andS.Sridhar. Anasynchronous parallel stochastic coordinate descent algorithm. J. Machine Learning A similar approach could also be applied to the sec- Research,16(285-322):1–5, 2015. ond stochastic greedy algorithm studied in [22], namely, [19] J.Liu,S.J.Wright,andS.Sridhar.Anasynchronousparallelrandomized StoGradMP.Althoughwe specializedto thecompressedsens- Kaczmarzalgorithm. arXivpreprintarXiv:1401.4780, 2014. [20] H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and ing problem, both StoIHT and StoGradMP are studied for M.I. Jordan. Perturbed iterate analysis for asynchronous stochastic general sparse recovery problems in [22]; thus, it would be optimization. arXivpreprintarXiv:1507.06970, 2015. 4 [21] D. Needell and J. Tropp. CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Appl. Comput. Harmon. Anal., 26(3):301–321, 2009. [22] N.Nguyen,D.Needell,andT.Woolf. Linearconvergenceofstochastic iterative greedy algorithms with sparse constraints. arXiv preprint arXiv:1407.0088, 2014. [23] N. H. Nguyen, S. Chin, and T. D. Tran. A unified iterative greedy algorithm forsparsity-constrained optimization. 2013. Submitted. [24] Z. Peng, Y. Xu, M. Yan, and W. Yin. Arock: an algorithmic frame- work for asynchronous parallel coordinate updates. arXiv preprint arXiv:1506.02396, 2015. [25] B. Recht, C. Re´, S. Wright, and F. Niu. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Proc. Adv. in NeuralProcessingSystems (NIPS),pages 693–701,2011. [26] J. Tropp and A. Gilbert. Signal recovery from partial information via orthogonalmatchingpursuit.IEEETrans.Inform.Theory,53(12):4655– 4666,2007. [27] S.Y. Zhao and W.J. Li. Fast asynchronous parallel stochastic gradient decent. arXivpreprintarXiv:1508.05711, 2015. 5