Two convergence results for an alternation maximization procedure 5 1 0 2 Andreas Andresen Vladimir Spokoiny ∗ † n a Weierstrass Institute and HU Berlin, J 7 Weierstrass-Institute, Moscow Institute of Physics and Technology Mohrenstr. 39, Mohrenstr. 39, ] T 10117 Berlin, Germany 10117 Berlin, Germany S h. [email protected] [email protected] t a m January 8, 2015 [ 1 v Abstract 5 2 5 Andresen and Spokoiny’s (2013) “critical dimension in semiparametric 1 0 estimation“ provide a technique for the finite sample analysis of profile . 1 M-estimators. This paper uses very similar ideas to derive two conver- 0 5 gence results for the alternating procedure to approximate the maxi- 1 : mizer of random functionals such as the realized log likelihood in MLE v i X estimation. We manage to show that the sequence attains the same r deviation properties as shown for the profile M-estimator in Andresen a and Spokoiny (2013), i.e. a finite sample Wilks and Fisher theorem. Further under slightly stronger smoothness constraints on the random functionalwecanshownearlylinearconvergencetotheglobalmaximizer if the starting point for the procedure is well chosen. AMS 2000 Subject Classification: Primary 62F10. Secondary 62J12, 62F25, 62H12 Keywords: alternating procedure, EM-algorithm, M-estimation, profile maximum likeli- hood, local linear approximation, spread, local concentration ∗TheauthorissupportedbyResearchUnits1735”StructuralInferenceinStatistics: Adaptationand Efficiency” †The author is partially supported by Laboratory for Structural Methods of Data Analysis in Pre- dictive Modeling, MIPT, RF government grant, ag. 11.G34.31.0073. Financial support by the German Research Foundation (DFG) through theCRC 649 “Economic Risk” is gratefully acknowledged. 1 2 Convergence of an alternation procedure 1 Introduction This paper presents a convergence result for an alternating maximization procedure to approximate M-estimators. Let Y denote some observed random data, and IP ∈ Y denote the data distribution. In the semiparametric profile M-estimation framework the target of analysis is θ = Π υ = Π argmaxIE L(υ,Y), (1.1) ∗ θ ∗ θ IP υ where L : Υ IR, Π : Υ IRp is a projection and where Υ is some high θ × Y → → dimensional or even infinite dimensional parameter space. This paper focuses on finite dimensional parameter spaces Υ IRp∗ with p = p+m N being the full dimension, ∗ ⊆ ∈ as infinite dimensional maximization problem are computationally anyways not feasible. A prominent way of estimating θ is the profile M-estimator (pME) ∗ θ d=ef Π υ d=ef argmaxL(θ,η). θ (θ,η) e e Thealternating maximization procedureisusedinsituations whereadirectcomputation of the full maximum estimator (ME) υ IRp∗ is not feasible or simply very difficult to ∈ implement. ConsiderforexamplethetasktocalculatethepMEwherewithscalarrandom e observations Y = (y )n IR, parameter υ = (θ,η) IRp IRm and a function basis i i=1 ⊂ ∈ × (e ) L2(IR) k ⊂ n m 1 2 L(θ,η) = y η e (X θ) . −2 i− k k ⊤i Xi=1(cid:12) Xk=0 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) In this case the maximization problem is high dimensional and non-convex (see Section 3 for more details). But for fixed θ S IRp maximization with respect to η IRm 1 ∈ ⊂ ∈ is rather simple while for fixed η IRm the maximization with respect to θ IRp can ∈ ∈ be feasible for low p N. This motivates the following iterative procedure. Given some ∈ (data dependent) functional L :IRp IRm IR and an initial guess υ IRp+m set for 0 × → ∈ k N ∈ e υ d=ef (θ ,η ) = θ ,argmaxL(θ ,η) , k,k+1 k k+1 k k η IRm ! ∈ e e e e e υ d=ef (θ ,η ) = argmaxL(θ,η ),η . (1.2) k,k k k k k (cid:18) θ IRp (cid:19) ∈ Thesocalled ”alternateion maximeizaetion procedure”(ormeinimeization)isawidelyapplied algorithm in many parameter estimation tasks (see Jain et al. (2013), Netrapalli et al. (2013), Keshavan et al. (2010) or Yi et al. (2013)). Some natural questions arise: Does andresen, a.and spokoiny, v. 3 the sequence (θ ) converge to a limit that satisfies the same statistical properties as k the profile estimator? And if the answer is yes, after how many steps does the sequence e acquire these properties? Under whatcircumstances does the sequence actually converge totheglobalmaximizer υ? Thisproblemishardbecausethebehaviorofeachstepofthe sequence is determined by the actual finite sample realization of the functional L(,Y). · e To the authors’ knowledge no general ”convergence” result is available that answers the questions from above except for the treatment of specific models (see again Jain et al. (2013), Netrapalli et al. (2013), Keshavan et al. (2010) or Yi et al. (2013)). We address this difficulty via employing new finite sample techniques of Andresen and Spokoiny (2013) and Spokoiny (2012) which allow to answer the above questions: with growing iteration number k N the estimators θ attain the same statistical k ∈ properties as the profile M-estimator and Theorem 2.2 provides a choice of the necessary e number of steps K N. Under slightly stronger conditions on the structure of the ∈ model we can give a convergence result to the global maximizier that does not rely on unimodality. Further we can address the important question under which ratio of full dimension p = p+m N to sample size n N the sequence behaves as desired. For ∗ ∈ ∈ instance for smooth L our results become sharp if p /√n is small and convergence to ∗ the full maximizer already occurs if p /n is small. ∗ The alternation maximization procedure can be understood as a special case of the Expectation Maximization algorithm (EM algorithm) as we will illustrate below. The EM algorithm itself was derived by Dempster et al. (1977) who generalized particular versions of this approach and presented a variety of problems where its application can be fruitful; for a brief history of the EM algorithm see McLachlan and Krishnan (1997) (Sect. 1.8). We briefly explain the EM algorithm. Take observations (X) IP for some θ ∼ parametric family (IP , θ Θ). Assume that a parameter θ Θ is to be estimated θ ∈ ∈ as maximizer of the functional L (X,θ) IR, but that only Y is observed, where c ∈ ∈ Y Y = f (X) is the image of the complete data set X X under some map f : X . Y Y ∈ → Y Prominent examples for the map f are projections onto some components of X if both Y arevectors. Theinformationlostunderthemapcanberegardedasmissingdataorlatent variables. As a direct maximization of the functional is impossible without knowledge of X the EM algorithm serves as a workaround. It consists of the iteration of tow steps: starting with some initial guess θ the kth “Expectation step“ derives the functional Q 0 via e Q(θ,θ )= IE [L (X,θ)Y], k θk c | which means that on the right hand side the conditional expectation is calculated under the distribution IP . The kth ”Maximation step” then simply locates the maximizer θk 4 Convergence of an alternation procedure θ of Q. k+1 Since the algorithm is very popular in applications a lot of research on its behaviour hasbeendone. Weareonlydealingwithaspecialcaseofthisproceduresowerestrictour selves to citing the well known convergence result by Wu (1983). Wu presents regularity conditions that ensure that L(θ ) L(θ ) where k+1 k ≥ L(θ,Y)d=ef log expL (X,θ)dX, c Z{X|Y=fY(X)} such that L(θ ) L for some limit value L > 0, that may depend on the starting k ∗ ∗ → point θ . Additionally Wu gives conditions that guarantee that the sequence θ (pos- 0 k sibly a sequence of sets) converges to C(L ) d=ef θ L(θ) = L . Dempster et al. (1977) ∗ ∗ { | } show that the speed of convergence is linear in the case of point valued θ and of some k differentiability criterion being met. A limitation of these results is that it is not clear whether L = supL(θ) and thus it is not guaranteed that C(L ) is the desired MLE ∗ ∗ andnotjustsomelocalmaximum. Ofcoursethisproblemdisappearsif L() isunimodal · and the regularity conditions are met but this assumption may be too restrictive. In a recent work Balakrishnan et al. (2014) present a new way of addressing the properties of the EM sequence in a very general i.i.d. setting, based on concavity of θ IEθ∗[Lc(X,θ)]. They show that if additional to concavity the functional Lc is 7→ smooth enough (First order stability) and if for a sample (Y ) with high probability an i uniform bound holds of the kind n sup argmaxIEθ[Lc(X,θ◦)Yi] argmaxIEθ∗[IEθ[Lc(X,θ◦)Y]] ǫn, (1.3) θ∈Br(θ∗)(cid:12)(cid:12)Xi=1 θ◦ | − θ◦ | (cid:12)(cid:12) ≤ (cid:12) (cid:12) (cid:12) (cid:12) that then with(cid:12) high probability and some ρ < 1 (cid:12) θ θ ρk θ θ +Cǫ . (1.4) k ∗ 0 ∗ n k − k ≤ k − k e Unfortunatelythisdoesnotanswerourtwoquestionstofullsatisfaction. Firstthebound (1.3) is rather high level and has to be checked for each model, while we seek (and find) properties of the functional - such as smoothness and bounds on the moments of its gradient - that lead to comparably desirable behavior. Further with (1.4) it remains unclear whether for large k N the alternating sequence satisfies a Fisher expansion or ∈ whether a Wilks type phenomenon occurs. In particular it remains open which ratio of dimension to sample size ensures good performance of the procedure. Also the actual convergence of θ θ is not implied, as the right hand side in (1.4) is bounded from k ∗ → below by Cǫ > 0. n e andresen, a.and spokoiny, v. 5 Remark 1.1. In the context of the alternating procedure the bound (1.3) would read max argmaxL(θ,η ) argmaxIEL(θ,η ) ǫ , θ◦ θ◦ n θ◦∈Br(θ∗)(cid:12) θ − θ (cid:12)≤ (cid:12) (cid:12) which is still difficult to c(cid:12)heck. e e (cid:12) (cid:12) (cid:12) To see that the procedure (1.2) is a special case of the EM algorithm denote in the notation from above X = argmax L (θ,η),Y ,Y - where θ is the parameter η { } specifying the distribution IP - and f (X) = Y. Then with L (θ,X) = L (θ,η,Y) d=ef θ (cid:0) Y (cid:1) c c L(θ,η) Q(θ,θ ) = IE [L (θ,X)Y] = L θ,argmaxL (θ ,η),Y ,Y = L(θ,η ), k−1 θek−1 c | c η { k−1 } k (cid:16) (cid:17) and thusethe resulting sequence is the same as in (1.2).eConsequently the conveergence results from above apply to our problem if the involved regularity criteria are met. But as noted these results do not tell us if the limit of the sequence (θ ) actually is the k profile and the statistical properties of limit points are not clear without too restrictive e assumptions on L and the data. This work fills this gap for a wide range of settings. Our main result can be sum- marized as follows: Under a set of regularity conditions on the data and the functional L points of the sequence (θ ) behave for large iteration number k N like the pME. k ∈ To be more precise we show in Theorem 2.2 that when the initial guess υ Υ is good 0 e ∈ enough, then the step estimator sequence (θ ) satisfies with high probability k e D˘ θ θ e ξ˘ 2 ǫ(p +ρkR ), k ∗ ∗ 0 − − ≤ (cid:13) (cid:0) (cid:1) (cid:13) maxL(θ ,η) max(cid:13)L(θe,η) ξ˘ 2/(cid:13)2 (p+x)1/2ǫ(p +ρkR ), k ∗ ∗ 0 η − η −k k ≤ (cid:12) (cid:12) (cid:12) (cid:12) where ρ <(cid:12)1 and ǫe> 0 is some small number, fo(cid:12)r example ǫ = Cp /√n in the smooth (cid:12) (cid:12) ∗ i.i.d setting. Further R > 0 is a bound related to the quality of the initial guess. The 0 random variable ξ˘ IRp and the matrix D˘ IRp p are related to the efficient influence × ∈ ∈ function in semiparametric models and its covariance. These are up to ρkR the same 0 properties as those proven for the pME in Andresen and Spokoiny (2013) under nearly the same set of conditions. Further in our second main result we manage to show under slightly stronger smoothness conditions that (θ ,η ) approaches the ME υ with nearly k k linear convergence speed, i.e. D((θ ,η ) υ) τk/log(k) with some 0 < τ < 1 and k k k − eke≤ e D2 = IE 2L(υ ) (see Theorem 2.4). ∗ ∇ e In the following we write υ in statements that are true for both υ and k,k(+1) k,k+1 υ . Also we do not specify whether the elements of the resulting sequence are sets or k,k e e single points. All statements made about properties of υ are to be understood in k,k(+1) e the sense that they hold for “every point of υ “. k,k(+1) e e 6 Convergence of an alternation procedure 1.1 Idea of the proof To motivate the approach first consider the toy model F2 A Y = υ +ε, where ε (0,F 2), F2 =: θ∗ . ∗ ∼ N −υ∗ υ∗ A⊤ F2η∗ ! In this case we set L to be the true log likelihood of the observations L(υ,Y) = F(υ Y) 2/2. ∗ −k − k With any starting initial guess υ IRp+m we obtain from (1.2) for k N and the 0 ∈ ∈ usual first order criterion of maximality the following two equations e Fθ∗(θk −θ∗) = Iθ∗εθ +F−θ∗1A(ηk −η∗), Fη∗(ηk+e1−η∗) = Iη∗εη +F−η∗1A⊤e(θk −θ∗). Combining these two equaetions we derive, assuming kF−θe∗1AF−η∗2A⊤Iθ−∗1k =:kM0k = ν < 1 Fθ∗(θk −θ∗) = F−θ∗1(F2θ∗εθ −Aεη)+F−θ∗1AF−η∗1A⊤F−θ∗1Fθ∗(θk−1−θ∗) k e e = Mk lF 1(F2 ε Aε ) 0− −θ∗ θ∗ θ − η l=1 X +Mk0Fθ∗(θ0−θ∗)→ Fθ∗(θ−θ∗). Because the limit θ is independenteof the initial poibnt υ and because the profile θ is 0 a fix point of the procedure the unique limit satisfies θ = θ. This argument is based on b e e the fact that in this setting the functional is quadratic such that the gradient satisfies b e L(υ) =F2 (υ υ )+F2 ε. υ∗ ∗ υ∗ ∇ − Any smooth function is quadratic around its maximizer which motivates a local linear approximation of the gradient of the functional L to derive our results with similar arguments. This is done in the proof of Theorem 2.2. First it is ensured that the whole sequence (υ ) satisfies for some R > 0 k,k(+1) k N0 0 ∈ υk,k(+1), k N0 D(eυ υ∗) R0 , (1.5) { ∈ }⊂ {k − k ≤ } where D2 d=ef 2IEL(υe) (see Theorem 4.3). In the second step we approximate with ∗ ∇ ζ = L IEL − L(υ,υ ) = ζ(υ )(υ υ ) D(υ υ ) 2/2+α(υ,υ ), (1.6) ∗ ∗ ∗ ∗ ∗ ∇ − −k − k andresen, a.and spokoiny, v. 7 where α(υ,υ ) is defined by (1.6). Similar to the toy case above this allows using the ∗ first order criterion of maximality and (1.5) to obtain a bound of the kind k D(υ υ ) C ρl D 1 ζ(υ ) + α(υ ,υ ) k,k ∗ − ∗ l,l ∗ k − k ≤ k ∇ k | | l=0 X (cid:0) (cid:1) C D 1 ζ(υ ) +ǫ(R ) +ρkR d=ef r . 1 − ∗ 0 0 k ≤ k ∇ k (cid:0) (cid:1) This is done in Lemma 4.5 using results from Andresen and Spokoiny (2013) to show that ǫ(R ) is small. Finally the same arguments as in Andresen and Spokoiny (2013) 0 allow to obtain our main result using that with high probability for all k N υ 0 k,k ∈ ∈ D(υ υ ) r . For the convergence result similar arguments are used. The only ∗ k {k − k ≤ } e difference is that instead of (1.6) we use the approximation L(υ,υ) = D(υ υ) 2/2+α(υ,υ), ′ −k − k e e e exploiting that L(υ) 0, which allows to obtain actual convergence to the ME. ∇ ≡ It is worthy to point out two technical challenges of the analysis. First the sketched e approach relies on (1.5). As all estimators (υ ) are random this means that we k,k(+1) need with some small β > 0 e IP υ ,υ D(υ υ ) R 1 β. k,k k,k+1 ∗ 0 ∈ {k − k ≤ } ≥ − k\∈N0(cid:26) (cid:27) e e Thisis nottrivial butthe resultof Theorem 4.3 serves the resultthanks to L(υ ) k,k(+1) ≥ L(υ ). Second the main result 2.2 is formulated to hold for all k N . This implies the 0 0 ∈ e need of a bound of the kind e IP D˘ 1 ˘ζ(υ ) ˘ζ(υ ) ǫ(r ) 1 β, − k,k ∗ k ∇ −∇ ≤ ≥ − k\∈N0n(cid:13)(cid:13) (cid:8) (cid:9)(cid:13)(cid:13) o (cid:13) e (cid:13) with some small ǫ(r) > 0 that is decreasing if r > 0 shrinks. Again this is not trivial and not a direct implication of the results of (Andresen and Spokoiny, 2013) or Spokoiny (2012). We manage to derive this result in the desired way in Theorem 8.2, which is an adapted version of Theorem D.1 of (Andresen and Spokoiny, 2013) based on Corollary 2.5 of Spokoiny (2012) . 8 Convergence of an alternation procedure 2 Main results 2.1 Conditions This section collects the conditions imposed on the model. We use the same set of assumptions as in Andresen and Spokoiny (2013) and this section closely follows Section 2.1 of that paper. Let the full dimension of the problem be finite, i.e. p < . Our conditions involve ∗ ∞ the symmetric positive definite information matrix D2 IRp∗ p∗ and a central point × ∈ υ IRp∗. In typical situations for p < , one can set υ = υ where υ is the “true ◦ ∗ ◦ ∗ ∗ ∈ ∞ point” from (1.1). The matrix D2 can be defined as follows: D2 = 2IEL(υ ). ◦ −∇ Hereandinwhatfollowsweimplicitlyassumethatthelog-functionalfunction L(υ): IRp∗ IR is sufficiently smooth in υ IRp∗, L(υ) IRp∗ stands for the gradient and → ∈ ∇ ∈ 2IEL(υ) IRp∗ p∗ for the Hessian of the expectation IEL : IRp∗ IR at υ IRp∗. × ∇ ∈ → ∈ By smooth enough we mean that we can interchange IEL= IE L on Υ (R ), where 0 ∇ ∇ ◦ Υ (r) is defined in (2.1) and R > 0 in (2.4). It is worth mentioning that D2 = 0 ◦ V2 d=ef Cov( L(υ∗)) if the model Y IPυ∗ (IPυ) is correctly specified and sufficiently ∇ ∼ ∈ regular; see e.g. Ibragimov and Khas’minskij (1981). In the context of semiparametric estimation, it is convenient to represent the infor- mation matrix in block form: D2 A D2 = . A⊤ H2 ! First we state an identifiability condition. (I) It holds for some ρ < 1 H−1A⊤D−1 √ρ. k k∞ ≤ Remark 2.1. The condition ( ) allows to introduce the important p p efficient I × information matrix D˘2 which is defined as the inverse of the θ-block of the inverse of the full dimensional matrix D2. The exact formula is given by D˘2 d=ef D2 AH 2A , − ⊤ − and ( ) ensures that the matrix D˘2 is well posed. I andresen, a.and spokoiny, v. 9 Usingthematrix D2 andthecentralpoint υ IRp∗, wedefinethelocalset Υ (r) ◦ ∈ ◦ ⊂ Υ IRp∗ with some r 0: ⊆ ≥ Υ (r) d=ef υ = (θ,η) Υ: D(υ υ ) r . (2.1) ◦ ◦ ∈ k − k ≤ (cid:8) (cid:9) Thefollowingtwoconditionsquantifythesmoothnesspropertieson Υ (r) oftheexpected ◦ log-functional IEL(υ) and of the stochastic component ζ(υ) = L(υ) IEL(υ). − (L˘) For each r r , there is a constant δ(r) such that it holds on the set Υ (r): 0 ≤ ◦ D 1D2(υ)D 1 I δ(r), D 1(A(υ) A)H 1 δ(r), − − p − − k − k ≤ k − k≤ D 1AH 1 I H 1H2(υ)H 1 δ(r). − − m − − − ≤ (cid:13) (cid:0) (cid:1)(cid:13) (cid:13) (cid:13) Remark 2.2. This condition describes the local smoothness properties of the function IEL(υ). In particular, it allows to bound the error of local linear approximation of the projected gradient ˘ IEL(υ) which is defined as θ ∇ ˘ = AH 2 . θ θ − η ∇ ∇ − ∇ Under condition (L˘ ) it follows from the second order Taylor expansion for any υ,υ 0 ′ ∈ Υ (r) (see Lemma B.1 of Andresen and Spokoiny (2013)) ◦ D˘ 1 ˘IEL(υ) ˘IEL(υ ) D˘(θ θ ) δ(r)r. (2.2) − ∗ ∗ k ∇ −∇ − − k ≤ (cid:16) (cid:17) Intheproofsweactually only needthecondition (2.2) whichinsomecases can beweaker than (L˘ ). 0 def The next condition concerns the regularity of the stochastic component ζ(υ) = L(υ) IEL(υ). Similarly to Spokoiny (2012), we implicitly assume that the stochastic − component ζ(υ) is a separable stochastic process. (E˘D ) For all 0 < r < r , there exists a constant ω 1/2 such that for all µ g˘ 1 0 ≤ | | ≤ and υ,υ Υ (r) ′ ∈ ◦ sup sup logIEexp µγ⊤D˘−1 ∇˘θζ(υ)−∇˘θζ(υ′) ν˘12µ2. ω D(υ υ ) ≤ 2 υ,υ′∈Υ◦(r)kγk≤1 ( (cid:8)k − ′ k (cid:9)) Theaboveconditionsallowtoderivethemainresultoncetheaccuracyofthesequence isestablished. Weincludeanotherconditionthatallowstocontrolthedeviation behavior of D˘ 1˘ζ(υ ) . To present this condition define the covariance matrix V2 IRp∗ p∗ − ∗ × k ∇ k ∈ and V˘2 IRp p × ∈ V2 d=ef Var L(υ ) , V˘2 = Cov(˘ ζ(υ )). ◦ θ ◦ ∇ ∇ (cid:8) (cid:9) 10 Convergence of an alternation procedure V2 IRp∗ p∗ describes the variability of the process L(υ) around the central point υ . × ◦ ∈ (E˘D ) There exist constants ν > 0 and g˘ > 0 such that for all µ g˘ 0 0 | | ≤ ˘ ζ(υ ),γ ν˘2µ2 sup logIEexp µh∇θ ◦ i 0 . γ IRp ( V˘γ ) ≤ 2 ∈ k k So far we only presented conditions that allow to treat the properties of θ on local k sets Υ (r ). To show that r is not to large the following, stronger conditions are k k ◦ e employed: (L ) For each r r , there is a constant δ(r) such that it holds on the set Υ (r): 0 0 ≤ ◦ D−1 2IEL(υ) D−1 IIp∗ δ(r). ∇ − ≤ (cid:13) (cid:8) (cid:9) (cid:13) (cid:13) (cid:13) (ED ) There exists a constant ω 1/2, such that for all µ g and all 0 < r < r 1 0 ≤ | | ≤ µγ D 1 ζ(υ) ζ(υ ) ν2µ2 sup sup logIEexp ⊤ − ∇ −∇ ′ 1 . ω D(υ υ ) ≤ 2 υ,υ′∈Υ◦(r)kγk=1 ( k(cid:8) − ′ k (cid:9)) (ED ) There exist constants ν > 0 and g > 0 such that for all µ g 0 0 | | ≤ ζ(υ ),γ ν2µ2 sup logIEexp µh∇ ◦ i 0 . γ IRp∗ (cid:26) kVγk (cid:27) ≤ 2 ∈ It is important to note, that the constants ω˘,δ˘(r),ν˘ and ω,δ(r),ν in the respective weak and strong version can differ substantially and may depend on the full dimension p N in less or more severe ways (AH 2 L might be quite smooth while L ∗ − η η ∈ ∇ ∇ could be less regular). This is why we use both sets of conditions where they suit best, although the list of assumptions becomes rather long. If a short list is preferred the following lemma shows, that the stronger conditions imply the weaker ones from above: Lemma 2.1. [Andresen and Spokoiny (2013), Lemma 2.1] Assume ( ). Then (ED ) 1 I implies (E˘D ), (L ) implies (L˘ ), and (ED ) implies (E˘D ) with 1 0 0 0 0 1 ρ2 1+ρ 1+ρ2 g˘ = − g, ν˘= ν, δ˘(r)= δ(r), and ω˘ = ω. 1+ρ 1+ρ2 1 ρ2 p p − p p Finally we present two conditions that allow to ensure that with a high probability thesequence (υ ) stays close to υ if theinitial guess υ landsclose to υ . These k,k(+1) ∗ 0 ∗ conditions have to be satisfied on the whole set Υ IRp∗. ⊆ e