Table Of Content

A Geometrical Study of Matching Pursuit Parametrization 8 0 0 2 February 2, 2008 n a J Abstract 2 2 This paper studies the effect of discretizing the parametrization of a dictionary used for MatchingPursuitdecompositionsofsignals. Ourapproachreliesonviewingthe continuously ] G parametrized dictionary as an embedded manifold in the signal space on which the tools of D differential (Riemannian) geometry can be applied. The main contribution of this paper is twofold. First,weprovethatifadiscretedictionaryreachesaminimaldensitycriterion,then . h the corresponding discrete MP (dMP) is equivalent in terms of convergence to a weakened t a hypothetical continuous MP. Interestingly, the corresponding weakness factor depends on a m density measure of the discrete dictionary. Second, we show that the insertion of a simple [ geometric gradient ascent optimization on the atom dMP selection maintains the previous comparison but with a weakness factor at least two times closer to unity than without opti- 1 mization. Finally,wepresentnumericalexperimentsconfirmingourtheoreticalpredictionsfor v 2 decompositionofsignalsandimagesonregulardiscretizationsofdictionaryparametrizations. 7 3 3 Keywords: Matching Pursuit,Riemanniangeometry,Optimization, Convergence,Dictionary, . 1 Parametrization. 0 8 0 1 Introduction : v i X There has been a large effort in the last decade to develop analysis techniques that decompose non-stationary signals into elementary components, called atoms, that characterize their salient r a features[1–5]. Inparticular,thematchingpursuit(MP)algorithmhasbeenextensivelystudied[2, 6–11] to expand a signal over a redundant dictionary of elementary atoms, based on a greedy processthatselectstheelementaryfunctionthatbestmatchestheresidualsignalateachiteration. Hence, MP progressively isolates thestructuresof thesignal thatarecoherent with respectto the chosen dictionary, and provides an adaptive signal representation in which the more significant coefficients are first extracted. The progressive nature of MP is a key issue for adaptive and scalable communication applications [12,13]. A majority of works that have considered MP for practical signal approximation and com- pression define the dictionary based on the discretization of a parametrized prototype function, typically a scaled/modulated Gaussian function or its second derivative [6,14,15]. An orthogonal 1-D or 2-D wavelet basis is also a trivial example of such a discretization even if in that case MP is not required to find signal coefficients; a simple wavelet decomposition is computationally more efficient. Works that do not directly rely on a prototype function either approximate 1 such a parametrized dictionary based on computationally efficient cascades of filters [16–18], or attempt to adapt a set of parametrized dictionary elements to a set of training signal samples based on vector quantization techniques [19,20]. Thus, most earlier works define their dictionary by discretizing, directly or indirectly, the parameters of a prototype function. The key question is then: how should the continuous parameter space be discretized ? A fine discretization results in a large dictionary which approximates signals efficiently with few atoms, but costs both in terms of computational complexity and atom index entropy coding. Previous workshavestudiedthistrade-offempirically[6,15]. Incontrast,ourpaperfocusesonthisquestion in a formal way. It provides a first attempt to quantify analytically how the MP convergence is affected by the discretization of the continuous space of dictionary function parameters. Ourcompasstoreachthisobjectiveisthenaturalgeometry ofthecontinuousdictionary. This dictionary can be seen as a parametric (Riemannian) manifold on which the tools of differential geometrycanbeapplied. Thisgeometricalapproach,ofincreasinginterestinthesignalprocessing literature, is inspired by the works [21,22] on Image Appearance Manifolds, and is also closely linkedtomanifoldsofparametricprobabilitydensityfunctionassociatedtotheFisherinformation metric [23]. Some preliminary hints were also provided in a Riemannian study of generalized correlation of signals with probing functions [24]. Theoutcomeofourstudyistwofold. Ontheonehand,weanalyzehowtherateofconvergence ofthecontinuousMP(cMP)isaffectedbythediscretizationoftheprototypefunctionparameters. We demonstrate that the MP using that discretized dictionary (dMP) converges like a weak continuous MP, i.e. a MP algorithm where the coefficient of the selected atom at each iteration overtakes only a percentage (the weakness factor) of the largest atom magnitude. We describe then how this weakness factor decreases as the so-called density radius1 of the discretization increases. This observation is demonstrated experimentally on images and randomly generated 1-D signals. On the other hand, to improve the rate of convergence of discrete MP without resorting to a finerbutcomputationallyheavierdiscretization,weproposetoexploitageometricgradientascent method. This allows to converge to a set of locally optimal continuous parameters, starting from the best set of parameters identified by a coarse but computationally light discrete MP. Each atom of the MP expansion is then defined in two steps. The first step selects the discrete set of parameters that maximizes the inner product between the corresponding dictionary function and the residual signal. The second step implements a (manifold2) gradient ascent method to compute the prototype function parameters that maximize the inner product function over the continuous parameter space. As a main analytical result, we demonstrate that this geometrically optimized discrete MP (gMP) is again equivalent to a continuous MP, butwith a weakness factor that is two times closer to unity than for the non-optimized dMP. Our experiments confirm that the proposed gradient ascent procedure significantly increases the rate of convergence of MP, compared to the non-optimized discrete MP. At an equivalent convergence rate, the optimization allows reduction of the discretization density by an order of magnitude, resulting in significant computational gains. The paper is organized as follows. In Section 2, we introduce the notions of parametric 1This density radius represents the maximal distance between any atom of the continuous dictionary and its closest atom in thediscretization. 2In the sense that this gradient ascent evolves on the manifold induced by theintrinsic dictionary geometry. 2 dictionary in the context of signal decomposition in an abstract Hilbert space. This dictionary is then envisioned as a Hilbert manifold, and we describe how its geometrical structure influences its parametrization using the tools of differential geometry. Section 3 surveys the definition of (weak)continuousMPprovidingatheoreticaloptimalrateofconvergenceforfurthercomparisons with other greedy decompositions. A “discretization autopsy” of this algorithm is performed in Section4andaresultingtheoremexplainingthedependencesofthedMPconvergencerelativelyto thissamplingisproved. Asimplebutillustrativeexampleofa1-Ddictionary, thewavelet (affine) dictionary, is then given. The optimization scheme announced above is developed in Section 5. After areview ofgradient ascent optimization evolving on manifolds,thegeometrically optimized MPisintroducedandits theoretical rateofconvergence analyzedinasecondtheorem. Finally, in Section 6, experimentsareperformedfor1-Dand2-Dsignaldecompositions usingdMPandgMP on various regular discretizations of dictionary parametrizations. We provide links to previous related works in Section 7 and conclude with possible extensions in Section 8. 2 Dictionary, Parametrization and Differential Geometry Our object of interest throughout this paper is a general real “signal”, i.e. a real function f taking value on a measure space X. More precisely, we assume f in the set of finite energy signals, i.e. f L2(X,dµ) = u : X R : u 2 = u(x)2 dµ(x) < , for a certain ∈ { → k k X | | ∞} integral measure dµ(x). Of course, the natural comparisoRn of two functions u and v in L2(X,dµ) is realized throughthescalar product u,v = u,v , u(x)v(x)dµ(x) makingL2(X,dµ) h iL2(X) h i X a Hilbert3 space where u 2 = u,u . R k k h i This very general framework can be specialized to 1-D signal or image decomposition where X is given respectively by R or R2, but also to more special spaces like the two dimensional sphere S2 [25] or the hyperboloid [26]. In the sequel, we will write simply L2(X) = L2(X,dµ). Inthefollowingsections,wewilldecompose f overahighlyredundantparametricdictionary of realatoms. Theseareobtainedfromsmoothtransformationsof arealmotherfunctiong L2(X) ∈ of unit norm. Formally, each atom is a function g (x) = [U(λ)g](x) L2(X), for a certain λ ∈ isometric operator U parametrized by elements λ Λ and such that g = g = 1. The λ ∈ k k k k parametrization set Λ is a continuous space where each λ Λ corresponds to P continuous ∈ components λ = λi of different nature. For instance, in the case of 1-D signal or 0 i P 1 { } ≤≤ − image analysis, g may be transformed by translation, modulation, rotation, or (anisotropic) dilation operations, each associated to one component λi of λ. Our dictionary is then the set dict(g,U,Θ) , g (x) = [U(λ)g](x) : λ Θ , for a certain subset Θ Λ. In the rest of the λ ∈ ⊆ paper, we write(cid:8)dict(Θ) = dict(g,U,Θ), assum(cid:9)ing g and U implicitly given by the context. For the case Θ = Λ, we write = dict(Λ). D We assume that g is twice differentiable over X and that the functions g (x) are twice differ- λ entiable on each of the P components of λ. In the following, we write ∂ for the partial derivative i with respect to λi, i.e. ∂ , of any element (e.g. g (x), g ,u , ...) depending on λ, and ∂λi λ h λ i ∂ = ∂ ∂ . From the smoothness of U and g, we have ∂ = ∂ on quantities built from these ij i j ij ji two ingredients. LetusnowanalyzethegeometricalstructureofΛ. RatherthananartificialEuclideandistance 3Assuming it complete, i.e. every Cauchy sequence converges in thisspace relatively to thenorm k·k2 =h·,·i. 3 d (λ ,λ )2 , (λi λi)2 between λ ,λ Λ, we use a distance introduced by the dictionary Eitsaelf bseen asPa Pi -daim−enbsional parameatricb ∈submanifold of L2(X) (or a Hilbert manifold 4 [27]). D The dictionary distance d is thus the distance in the embedding space L2(X), i.e. d (λ ,λ ) , a b D D g g . k λa − λbk From this embedding, we can define an intrinsic distance in , namely the geodesic distance. D Thislater hasbeenusedinasimilarcontext intheworkofGrimesandDonoho[22]andwefollow heretheir approach. For our two pointsλ ,λ , assumethat wehave asmooth curveγ : [0,1] Λ a b → with γ(t) = γ0(t), ,γP 1(t) , such that γ(0) = λ and γ(1) = λ . The length (γ) of this − a b ··· L curve in is(cid:0) thus given by (γ(cid:1)) , 1 d g dt, assuming that g is differentiable5 with D L 0 kdt γ(t)k γ(t) respect to t. R The geodesic distance between λ and λ in Λ is the length of shortest path between these a b two points, i.e. 1 d (λ ,λ ) , inf d g dt, (1) G a b γ(λa→λb) Z0 kdt γ(t)k where γ(λ λ ) is any differentiable curve γ(t) linking λ to λ for t equals to 0 and 1 a b a b → respectively. We denote by γ the optimal geodesic curve joining λ and λ on the manifold , i.e. such λaλb a b D that (γ ) = d (λ ,λ ), and we assume henceforth that it is always possible to define this L λaλb G a b curve between two points of Λ. Note that by construction, d (λ ,λ ) = d (λ ,λ)+d (λ,λ ), a b a ′ ′ b G G G for all λ on the curve γ (t). ′ λaλb In the language of differential geometry, the parameter space Λ is a Riemannian manifold = (Λ, ) with metric (λ) = ∂ g ,∂ g . Indeed, for any differentiable curve γ : t ij ij i λ j λ M G G h i ∈ [ δ,δ] γ(t) Λ with δ > 0 and γ(0) = λ, we have − → ∈ d g 2 = γ˙i(0)γ˙j(0) (λ), (2) kdt γ(t) t=0k Gij (cid:12) (cid:12) with u˙(t) = du(t), and where Einstein’s summation convention is used for simplicity6. dt The vector ξi = γ˙i(0) is by definition a vector in the tangent space T Λ of Λ in λ. The λ meaning of relation (2) is that the metric (λ) allows the definitions of a scalar product and ij G a norm in each T Λ. The norm of a vector ξ T Λ is therefore noted ξ 2 = ξ 2 , ξiξj (λ), λ ∈ λ | | | |λ Gij with thecorrespondence d g = γ˙ . For theconsistency of furtherRiemannian geometry kdt γ(t)|t=0k | | developments, we assume that our dictionary is non-degenerate, i.e. that it induces a positive D definite metric . Appendix A provides additional details. ij G We conclude this section with the arc length (or curvilinear) parametrization “s” [28] of a curve γ(s). It is such that γ 2 , γ i(s)γ j(s) (γ(s)) = 1, where u(s) = d u(s). From its | ′| ′ ′ Gij ′ ds definition, the curvilinear parameter s is the one which measures at each point γ(s) the length of the segment of curve already travelled on γ from γ(0). Therefore, in this parametrization, λ = γ (0) and λ = γ (d (λ ,λ )). a λaλb b λaλb G a b 4This is a special case of Image Appearance Manifold (IAM) defined for instance in [21,22]. It is also closely linked to manifolds of parametric probability density function associated tothe Fisher information metric [23]. 5Anotherdefinition of L exists for non differentiable curve. See for instance [22]. 6Namely, a summation in an expression is defined implicitly each time the same index is repeated once as a subscript and once as a superscript, the range of summation being always [0,P −1], so that for instance the expression aibi reads PPi=−01aibi. 4 3 Matching Pursuit in Continuous Dictionary Let us assume that we want to decompose a function f L2(X) into simpler elements (atoms) ∈ coming from a dictionary dict(Θ), given a possibly uncountable and infinite subset Θ Λ. Our ⊆ general aim is thus to find a set of coefficients c such that f(x) is equal or well approximated m { } by f (x) = c g (x) with a finite set of atoms g dict(Θ). app m m λm { λm} ⊂ Formally,Pfor a given weakness factor α (0,1], a General Weak(α) Matching Pursuit de- ∈ composition of f [2,29], written MP(Θ,α), in the dictionary dict(Θ) is performed through the following greedy7 algorithm : R0f = f, A0f = 0, (initialization), Rm+1f = Rmf g ,Rmf g , (3a) − h λm+1 i λm+1 Am+1f = Amf + g ,Rmf g , (3b) h λm+1 i λm+1 with : g ,Rmf 2 α2 sup g ,Rmf 2. (3c) h λm+1 i ≥ λ∈Θh λ i The quantity Rm+1f is the residual of f at iteration m+1. Since it is orthogonal to atom g , λm+1 Rm+1f 2 = Rmf 2 g ,Rmf 2 Rmf 2, so that the energy Rmf 2 is non-increasing. k k k k −h λm+1 i ≤ k k k k The function Amf is the m-term approximation of f with Amf = mk=−01 hgλk+1,Rkfi gλk+1. Notice that the selection rule (3c) concerns the square of the reaPl scalar product g ,Rmf . λ h i Matching Pursuit atom selection is typically defined over the absolute value g ,Rmf . How- λ |h i| ever, we prefer this equivalent quadratic formulation first to avoid the abrupt behavior of the absolutevaluewhenthescalarproductcrosseszero,andsecondforconsistencywiththequadratic optimization framework to be explained in Section 5. Finally, to allow the non-weak case where α = 1, we assume that a maximizer g dict(Θ) of g,u 2 always exists for any u L2(X). u ∈ h i ∈ If Θ is uncountable, our general Matching Pursuit algorithm is named continuous Matching pursuit. In particular, for Θ = Λ, we write cMP(α) = MP(Λ,α). The rate of convergence (or convergence) ofthecMP(α), characterized bytherateofdecay of Rmf withm,can beassessed k k in certain particular cases. For instance, if there exists a Hilbert space L2(X) containing S ⊆ = dict(Λ) such that D β2 = inf sup g ,u 2 > 0, (4) λ u S, u =1 λ Λ h i ∈ k k ∈ thenthecMP(α)convergesinside . Infact,theconvergenceisexponential[30]since g ,Rm 1f 2 S h λm − i ≥ α2β2 Rm 1f 2 and Rmf 2 Rm 1f 2 α2β2 Rm 1f 2 (1 α2β2)m f 2. We name − − − k k k k ≤ k k − k k ≤ − k k β = β( , ) the greedy factor since it charaterizes the MP convergence (greediness). S D The existence of the greedy factor β is obvious for instance for finite dimensional space [30], i.e. f CN, with finite dictionary (finite number of atoms). ∈ For a finite dictionary in an infinite dimensional space, as L2(X), the existence of β is not guaranteed over the whole space. However, there exists on the space of functions given by linear combination of dictionary elements, thenumberof terms beingrestricted by thedictionary (cumulative) coherence [29]. In the case of an infinite dictionary in an infinite dimension space where the greedy factor vanishes, cMP(α) convergence is characterized differently on the subspace of linear combination 7Greedy in thesense that it does not solve a global ℓ0 or ℓ1 minimization [1] to find thecoefficients cm of fapp above, but works iteratively by solving at each iteration step a local and smaller minimization problem. 5 of countablesubsetsof dictionary elements. Thisquestionis addressedseparately inacompanion TechnicalReport[31]tothisarticle. Wenowconsideronlythecasewhereanon-zerogreedyfactor exists to characterize the rate of convergence of MP using continuous and discrete dictionaries. 4 Discretization effects of Continuous Dictionary The greedy algorithm cMP(α) using the dictionary is obviously numerically unachievable D because of the intrinsic continuity of its main ingredient, namely the parameter space Λ. Any computer implementation needs at least to discretize the parametrization of the dictionary, more or less densely, leading to a countable set Λ Λ. This new parameter space leads naturally to d ⊂ thedefinitionofacountablesubdictionary = dict(Λ ). Henceforth,elementsofΛ arelabelled d d d D with roman letters, e.g. k, to distinguish them from the continuous greek-labelized elements of Λ, e.g λ. For a weakness factor α (0,1], the discrete Weak(α ) Matching Pursuit algorithm, or d ∈ dMP(α), of a function f L2(X) over is naturally defined as dMP(α) = MP(Λ ,α). The d d ∈ D replacement of Λ by Λ in the MP algorithm (3) leads obviously to the following question that d we address in the next section. Question 1. How does the MP rate of convergence evolve when the parametrization of a dictionary is discretized and what are the quantities that control (or bound) this evolution ? 4.1 Discretization Autopsy By working with instead of , the atoms selected at each iteration of dMP(α) are of course d D D less optimal than those available in the continuous framework. Answering Question 1 requires a quantitative measure of the induced loss in the MP coefficients. More concretely, defining the score function S (λ) = g ,u 2 for some u L2(X), we must analyze the difference be- u λ h i ∈ tween a maximum of S computed over Λ and that obtained from Λ . This function u will be u d next identified with the residue of dMP(α) at any iteration to characterize the global change in convergence. We propose to found our analysis on the geometric tools described in Section 2. Definition 1. The value S (λ ) is critical in the direction of λ if, given the geodesic γ = γ u a b λaλb in the manifold =(Λ, ), d S (γ(s)) = 0, where γ(0) = λ . M Gij ds u |s=0 a i Notice that if S (λ ) is critical in the direction of λ , γ (0)∂ S (λ ) = 0. An umbilical u a b ′ i u a point for which ∂ S (λ ) = 0 for all i, is obviously critical in any direction. An umbilical point i u a corresponds geometrically either to maxima, minima or saddlepoints of S relatively to Λ. u Proposition 1. Given u L2(X), if S (λ ) is critical in the direction of λ for λ ,λ Λ, then u a b a b ∈ ∈ for some r (0,d (λ ,λ )), a b ∈ G S (λ ) S (λ ) u 2d (λ ,λ )2 1+ d2gγ , (5) | u a − u b | ≤ k k G a b (cid:0) k ds2 (cid:12)s=rk(cid:1) (cid:12) where γ(s) = γ (s) is the geodesic in linking λ to λ . λaλb M a b 6 Proof. Let us define the twice differentiable function ψ(s) , S (γ(s)) on s [0,η], with η , u ∈ d (λ ,λ ). A second order Taylor development of ψ gives, for a certain r (0,s), ψ(s) = a b ψG(0)+sψ (0)+ 1s2ψ (r). Since ψ (0) = γ i(0)∂ S (λ ) = 0 by hypothesis, w∈e have in s = η, ′ 2 ′′ ′ ′ i u a ψ(0) ψ(η) = S (λ ) S (λ ) 1η2 ψ (r). However, on any s, ψ (s) = 2 d g ,u 2 + | − | | u a − u b | ≤ 2 | ′′ | | ′′ | | ds γ(s) g ,u d2 g ,u 2( d g 2+ d2 g ) u 2, usingtheCauchy-Schwarz(cid:10)(CS)ineq(cid:11)ual- h γ(s) i ds2 γ(s) |≤ kds γ(s)k kds2 γ(s)k k k ity in L2((cid:10)X) in the (cid:11)last equation. The result follows from the fact that d g = 1. kds γ(s)k The previous Lemma is particularly important since it bounds the loss in coefficient value whenwedecidetochooseS (λ )insteadoftheoptimalS (λ )infunctionofthegeodesicdistance u b u a d (λ ,λ ) between the two parameters. To obtain a more satisfactory control of this difference, a b G we need however a new property of the dictionary. We start by defining the principal curvature in the point λ Λ as ∈ , sup d2 g , (6) Kλ kds2 γξ(s) s=0k ξ: ξ=1 (cid:12) | | (cid:12) where γ is the unique geodesic in starting from λ = γ (0) and with γ (0) = ξ, for a direction ξ M ξ ξ′ ξ of unit norm in T Λ. λ Definition 2. The condition number of a dictionary is the number 1 obtained from − D K , sup . (7) λ K K λ Λ ∈ If does not exist (not bounded ), by extension, is said to be of zero condition number. λ K K D The notion of condition number has been introduced by Niyogi et al. [32] to bound the local curvature of an embedded manifold8 in its ambient space, and to characterize its self-avoidance. Essentially, it is the inverse of the maximum radius of a sphere that, when placed tangent to the manifold at any point, intersects the manifold only at that point [33,34]. Our quantity 1 is − K then by construction a similar notion for the dictionary seen as a manifold in L2(X). However, D it does not actually prevent manifold self-crossing on large distance due to the locality of our differential analysis9. Proposition 2. For a dictionary = dict(Λ), D 1 1 sup ∂ g ,∂ g ik jl 2 , (8) ij λ kl λ ≤ K ≤ λ Λ h(cid:10) (cid:11)G G i ∈ where ij = ij(λ) is the inverse10 of . ij G G G 8In their work, the condition number, named there τ−1, of a manifold M′ measures the maximal “thickness” τ of the normal bundle, the union of all the orthogonal complement of every tangent plane at every point of the manifold. 9A careful study of local self-avoidance of well-conditioned dictionary would have to be considered but this is beyond thescope of this paper. 10Using Einstein convention, this means GikGkj =GjkGki =δji, for the Kronecker’s symbol δji =δij =δij =1 if i=j and 0 if i6=j. 7 The proof is given in Appendix B since it uses some elements of differential geometry not essential inthecoreofthispaper. Theinterested readerwillfindalsothereaslightly lower bound than the bound presented in (8), exploiting covariant derivatives, Laplace-Beltrami operator and scalar curvature of [28]. We can state now the following corollary of Proposition 1. M Corollary 1. In the conditions of Proposition 1, if has a non-zero condition number 1, − D K then S (λ ) S (λ ) u 2d (λ ,λ )2 1+ . (9) u a u b a b | − | ≤ k k G K (cid:0) (cid:1) Therefore, in the dMP(α) decomposition of f based on , even if at each iteration the exact d D position of the continuous optimal atom of is not known, we are now able to estimate the D convergence rate of this MP provided we introduce a new quantity characterizing the set Λ . d Definition 3. The density radius ρ of a countable parameter space Λ Λ is the value d d ⊂ ρ = sup inf d (λ, k). (10) d λ Λ k Λd G ∈ ∈ We say that Λ covers Λ with a radius ρ . d d This radius characterizes the density of Λ inside Λ. Given any λ in Λ, one is guaranteed d that there exists an element k of Λ close to λ, i.e. within a geodesic distance ρ . d d Theorem 1. Given a Hilbert space L2(X) with a non zero greedy factor β, and a dictionary S ⊆ = dict(Λ) S of non-zero condition number 1, if Λ covers Λ with radius ρ , and if − d d D ⊂ K ρ < β/√1+ , then, for functions belonging to , a dMP(α) algorithm using = dict(Λ ) d d d K S D is bounded by the exponential convergence rate of a cMP(α) using with a weakness parameter ′ D given by α = α 1 β 2ρ2(1+ ) 1/2 < α. ′ − − d K (cid:0) (cid:1) Proof. Notice first that since f and , Rmf for all iteration m of dMP. d ∈ S D ⊂ D ⊂ S ∈ S Let us take the (m+1)th step of dMP(α) and write u = Rmf. We have of course Rm+1f 2 = k k u 2 S (k ), where k is the atom obtained from the selection rule (3c), i.e. S (k ) u m+1 m+1 u m+1 k k − ≥ α2 sup S (k). k Λd u Deno∈te by g the atom of that best represents Rmf, i.e. S (λ˜) = sup S (λ). If k˜ is the closest elemeλñt of λ˜ in Λ ,Dwe have d (λ˜,k˜) ρ from the coveuring properλt∈yΛofuΛ , and the d d d Proposition 1 tells us that, with u = RmfG, S (k˜≤) S (λ˜) ρ2(1+ ) u 2, since ∂ S (λ˜) = 0 | u − u | ≤ d K k k i u for all i. Therefore, S (k˜) S (λ˜) ρ2(1+ ) u 2 β2 u 2 ρ2(1+ ) u 2, and S (k˜) β2 1 u ≥ u − d K k k ≥ k k − d K k k u ≥ − β 2ρ2(1+ ) Rmf 2, this last quantity being positive from the density requirement, i.e. ρ(cid:0) < − d K k k d β/√1+ . (cid:1) In coKnsequence, S (k ) α2 sup S (k) α2S (k˜), implying Rm+1f 2 = u 2 S (k ) u 2 u αm2+S1 (≥k˜) ku∈Λ2d(1u α2≥β2), four α , α 1 βk 2ρ2(1k+ ) 1k/2k. S−o, u m+1 ≤ k k − u ≤ k k − ′ ′ − − d K Rm+1f (1 α2β2)(m+1)/2 f , which is the exponential conve(cid:0)rgence rate of the(cid:1)Weak(α) ′ k k ≤ − k k Matching Pursuit in when β exists [29,30]. D The previous proposition has an interesting interpretation : a weak Matching Pursuit decomposition in a discrete dictionary corresponds, in terms of rate of convergence, to a weaker Matching Pursuit in the continuous dictionary from which the discrete one is extracted. 8 About the hypotheses of the proposition, notice first that the existence of a greedy factor inside concerns the continuous dictionary and not the discrete one . Consequently, d S D D this condition is certainly easier to fulfill from the high redundancy of . Second, the density D requirement, ρ < β/√1+ , is just sufficient since the Proposition 1 does not state that it d K achieves the best bound for the control of S (λ ) S (λ ) when λ is critical. It is interesting u a u b a | − | to note that this inequality relates ρ , a quantity that characterizes the discretization Λ in Λ, to d d β and , which dependonly on the dictionary. In particular, β represents the density of inside K D L2(X), and depends on the shape of the atoms through the curvature of the dictionary. S ⊂ K Finally note that as β < 1 (from definition (4)) and > 1 (Prop. 2), the density radius must K at least satisfy ρ < 1 to guarantee that our analysis is valid. d √2 4.2 A Simple Example of Discretization Let us work on the line with L2(X) = L2(R,dt), and check if the hypothesis of the previous theorem can be assessed in the simple case of an affine (wavelet-like) dictionary. Weselectasymmetricandrealmotherfunctiong L2(R)welllocalizedaroundtheorigin,e.g. ∈ aGaussianoraMexicanHat, normalizedsuchthat g = 1. TheparametersetΛisrelatedtothe k k affine group, thegroupof translations anddilationsG . Weidentifyλ = (λ0 = b,λ1 = a), where aff b R and a> 0 are the translation and the dilation parameters respectively. Thedictionary is ∈ D defined from the atoms g (t) = [U(λ)g](t) = a 1/2g (t b)/a , with g =1 for all λ Λ. Our λ − λ − k k ∈ atoms are nothing but the wavelets of a Continuou(cid:0)s Wavelet(cid:1)Transform if g is admissible [35], and U is actually the representation of the affine group on L2(R) [36]. In the technical report [31], we prove that the associated metric is given by (λ) = a 2W, ij − G whereW is a constant 2 2 diagonal matrix dependingonly of themother functiong and its first × and second derivatives. Since ij(λ) =a2W 1, can be bounded by a constant also associated − G K to g and its first and second order time derivatives. Finally, given the τ-adic parameter discretization Λ = k = (b ,a ) = (nb τj,a τj) : j,n Z , d jn jn j 0 0 { ∈ } with τ > 1 and a ,b > 0, the density radius ρ of Λ is shown to be bounded by ρ 0 0 d d d ≤ Ca 1b +Dlnτ, with C and D depending only of the norms of g and its first derivative. −0 0 This bound has two interesting properties. First, as for the grid Λ , it is invariant under the d change (b ,a ) (2b ,2a ). Second, it is multiplied by 2n if we realize a “zoom” of factor 2n in 0 0 0 0 our τ-adic grid,→in other words, if (b ,τ) (2nb ,τ2n). By the same argument, the true density 0 0 → radius has also to respect these rules. Therefore, we conjecture that ρ = C a 1b +D lnτ, for d ′ −0 0 ′ two particular (non computed) positive constants C and D . ′ ′ Unfortunately, evenforthissimpleaffinedictionary, theexistenceofβ = β( , )isnontrivial S D to prove. However, if the greedy factor exists, the control of τ, a and b over ρ tells us that it 0 0 d is possible to satisfy the density requirement for convenient values of these parameters. 5 Optimization of Discrete Matching Pursuits The previous section has shown that under a few assumptions a dMP is equivalent, in terms of rate of convergence, to a weaker cMP in the continuous dictionary from which the discrete one 9 has been sampled. Question 2. Can we improve the rate of convergence of a dMP, not with an obvious increasing of the dictionary sampling, but by taking advantage of the dictionary geometry ? Our approach is to introduce an optimization of the discrete dMP scheme. In short, at each iteration, we propose to use the atoms of as the seeds of an iterative optimization, such as d D the basic gradient descent/ascent, respecting the geometry of the manifold = (Λ, ). ij M G Under the same density hypothesis of Theorem 1, we show that in the worst case and if the number of optimization steps is large enough, an optimized discrete MP is again equivalent to a continuous dMP, butwith a weakness factor two times closer to unity than for the non-optimized discrete MP. In this section, we first introduce the basic gradient descent/ascent on a manifold. Next, we show how this optimization can be introduced in the Matching Pursuit scheme to defined the geometrically optimized MP (gMP). Finally, the rate of convergence of this method is analyzed. 5.1 Gradient Ascent on Riemannian Manifolds Given a function u L2(X) and S (λ) = g ,u 2, we wish to find the parameter that maximizes u λ ∈ h i S , i.e. u λ = argmax S (λ) (P.1) u ∗ λ Λ ∈ Equivalently, by introducing h = g ,u g , we can decide to find λ by the minimization u,λ λ λ h i ∗ λ = argmin u h 2. (P.2) u,λ ∗ λ Λ k − k ∈ If we are not afraid to get stuck on local maxima (P.1) or minima (P.2) of these two non-convex problems, we can solve them by using well known optimization techniques such as gradient descent/ascent, or Newton or Newton-Gauss optimizations. We present here a basic gradient ascent of the Problem (P.1) that respect the geometry of = (Λ, ) [37]. This method increases iteratively the value of S by following a path in Λ, ij u M G composed of geodesic segments, driven by the gradient of S . u Given a sequence of step size t > 0, the gradient ascent of S starting from λ Λ is defined r u 0 ∈ by the following induction [38] : φ (λ ) = λ, φ (λ ) = γ t , φ (λ ), ξ (λ ) , 0 0 r+1 0 r r 0 r 0 (cid:0) (cid:1) whereξ (λ )= S (φ (λ )) 1 S (φ (λ ))isthegradient direction obtainedfrom thegradient r 0 u r 0 − u r 0 |∇ | ∇ iS = ij∂ S ,andγ(s,λ ,ξ )isthegeodesicstartingatλ = γ(0,λ ,ξ )withtheunitvelocity u j u 0 0 0 0 0 ∇ G ξ = ∂ γ(0,λ ,ξ ). Notice that i is the natural notion of gradient on a Riemannian manifold. 0 ∂s 0 0 ∇ Indeed, as for the Euclidean case, with ih , ij∂ h for h L2(X), given w T Λ, the j λ ∇ G ∈ ∈ directional derivative D h is equivalent to D h(λ) , wi∂ h(λ) = h,w , wi jh(λ) (λ), w w i λ ij h∇ i ∇ G since ik = δi. G Gkj j Practically, in our gradient ascent, we use the linear first order approximation of γ, i.e. φ (λ) = φ (λ) + t ξ (λ), (11) r+1 r r r 10