ebook img

A Bernstein-type Inequality for Some Mixing Processes and Dynamical Systems with an Application to Learning PDF

0.35 MB·English
by  H. Hang
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Bernstein-type Inequality for Some Mixing Processes and Dynamical Systems with an Application to Learning

A Bernstein-type Inequality for Some Mixing Processes and Dynamical Systems with an Application to Learning H. Hang and I. Steinwart InstituteforStochasticsandApplications 5 UniversityofStuttgart 1 D-70569Stuttgart 0 hanghn,ingo.steinwart @mathematik.uni-stuttgart.de 2 { } n January 14,2015 a J 3 1 Abstract ] WeestablishaBernstein-typeinequalityforaclassofstochasticprocessesthatincludetheclas- R sical geometrically φ-mixing processes, Rio’s generalization of these processes, as well as many P time-discrete dynamicalsystems. Moduloa logarithmicfactor and some constants, our Bernstein- . h type inequalitycoincideswith the classical Bernstein inequalityfor i.i.d. data. We further use this t a new Bernstein-type inequality to derive an oracle inequality for generic regularized empirical risk m minimization algorithmsand data generatedby such processes. Applying this oracle inequality to [ supportvector machinesusing the Gaussian kernelsfor bothleast squaresandquantile regression, it turns out that the resulting learning rates match, up to some arbitrarily small extra term in the 1 exponent,theoptimalratesfori.i.d.processes. v 9 5 0 1 Introduction 3 0 Concentration inequalities such asHoeffding’s inequality, Bernstein’s inequality, McDiarmid’s inequal- . 1 ity, and Talagrand’s inequality play an important role in many areas of probability. For example, the 0 5 analysis of various methods from non-parametric statistics and machine learning crucially depend on 1 these inequalities, see e.g. [19, 20, 22, 42]. Here, stronger results can typically be achieved by Bern- : v stein’s inequality and/or Talagrand’s inequality, since these inequalities allow for localization due to i their specific dependence on the variance. In particular, most derivations of minimax optimal learning X ratesarebasedononeoftheseinequalities. r a Theconcentration inequalities mentioned aboveallassume thedata tobegenerated byani.i.d. pro- cess. Unfortunately, however, this assumption is often violated in several important areas of applica- tionsincludingfinancialprediction,signalprocessing,systemobservationanddiagnosis,textandspeech recognition,andtimeseriesforecasting. Forthisandotherreasonstherehasbeensomeefforttoestablish concentration inequalities for non-i.i.d. processes, too. For example, generalizations of Bernstein’s in- equalitytoα-mixingandφ-mixingprocesseshavebeenfound[10,33,32]and[38],respectively. Among manyotherapplications, theBernstein-typeinequalityestablishedin[10]wasusedin[50]toobtaincon- vergenceratesforsieveestimatesfromα-mixingstrictlystationaryprocessesinthespecialcaseofneural networks. Furthermore, [23]applied the Bernstein-type inequality in[33]to derive an oracle inequality for generic regularized empirical risk minimization algorithms learning from stationary α-mixing pro- cesses. Moreover, by employing the Bernstein-type inequality in [32], [7] derived almost sure uniform rates of convergence for the estimated Le´vy density both in mixed-frequency and low-frequency setups and proved that these rates are optimal in the minimax sense. Finally, in the particular case of the least 1 squareloss,[2]obtainedtheoptimallearningrateforφ-mixingprocessesbyapplyingtheBernstein-type inequality established in[38]. However, there exist many dynamical systems such as the uniformly expanding maps given in [17, p. 41] that are not α-mixing. To deal with such non-mixing processes Rio [34] introduced so-called φ˜-mixingcoefficients,whichextendtheclassicalφ-mixingcoefficients. Fordynamicalsystemswithex- ponentiallydecreasing,modifiedφ˜-coefficients,[47]derivedaBernstein-typeinequality, whichturnsout tobethesameastheonefori.i.d.processesmodulosomelogarithmicfactor. However,thismodification seemstobesignificant stronger thanRio’soriginal φ˜-mixing, soitremains unclear whentheBernstein- type inequality in[47]isapplicable. Inaddition, theφ˜-mixing concept isstill notlarge enough tocover manycommonly considered dynamical systems. Toinclude suchdynamical systems, [31]proposed the -mixingcoefficients, whichfurthergeneralize φ˜-mixingcoefficients. C In this work, weestablish aBernstein-type inequality for geometrically -mixing processes, which, C modulo a logarithmic factor and some constants, coincides with the classical one for i.i.d. processes. Using the techniques developed in [23], we then derive an oracle inequality for generic regularized empirical risk minimization and -mixing processes. We further apply this oracle inequality to a state- C of-the-art learning method, namely support vector machines (SVMs) with Gaussian kernels. Here it turns out that for both, least squares and quantile regression, we can recover the (essentially) optimal rates recently found forthei.i.d.case, see[21],whenthedataisgenerated byageometrically -mixing C process. Finally,weestablishanoracleinequality fortheproblemofforecasting anunknowndynamical system. This oracle will make it possible to extend the purely asymptotic analysis in [41] to learning rates. The rest of this work is organized as follows: In Section 2, we recall the notion of (time-reversed) -mixing processes. We further illustrate this class of processes by some examples and discuss the C relationbetween -mixingandothernotionsofmixing. Asthemainresultofthiswork,aBernstein-type C inequality forgeometrically (time-reversed) -mixing processes willbeformulated inSection3. There, C wealsocompareournewBernstein-type inequality topreviously established concentration inequalities. As an application of our Bernstein-type inequality, we will derive the oracle inequality for regularized risk minimization schemes in Section 4. We additionally derive learning rates for SVMs and an oracle inequality forforecasting certaindynamicalsystems. Allproofscanbefoundinthelastsection. 2 -mixing processes C In this section we recall two classes of stationary stochastic processes called (time-reversed) -mixing C processes that have acertain decay ofcorrelations for suitable pairs offunctions. Wealso present some examplesofsuchprocesses including certaindynamicalsystems. Letusbeginbyintroducingsomenotations. Inthefollowing,(Ω, ,µ)alwaysdenotesaprobability A space. As usual, we write L (µ) for the space of (equivalence classes of) measurable functions f : p Ω R with finite L -norm f . It is well-known that L (µ) together with f forms a Banach p p p p → k k k k space. Moreover, if ′ is a sub-σ-algebra, then L ( ′,µ) denotes the space of all ′-measurable 1 A ⊂ A A A functions f L (µ). Inthefollowing, foraBanachspaceE,wewriteB foritsclosedunitball. 1 E ∈ Givenasemi-norm onavectorspaceE ofboundedmeasurablefunctionsf :Z R,wedefine k·k → the -Normby C f := f + f (1) C ∞ k k k k k k anddenotethespaceofallbounded -functions by C (Z):= f : Z R f < . (2) C C → k k ∞ (cid:8) (cid:12) (cid:9) (cid:12) 2 Throughout thiswork,weonlyconsiderthesemi-norms in(1)thatsatisfytheinequality k·k ef ef f (3) ≤ ∞k k (cid:13) (cid:13) (cid:13) (cid:13) forallf (Z). Wearemostlyinterested inthefollowingexamplesofsemi-normssatisfying (3). (cid:13) (cid:13) (cid:13) (cid:13) ∈ C Example 2.1. Let Z be an arbitrary set and suppose that we have f = 0 for all f : Z R. Then, it is k k → obviouslytoseethat ef = f =0. Hence,(3)issatisfied. k k k k Example2.2. LetZ Rbeaninterval.Afunctionf :Z RissaidtohaveboundedvariationonZ ifitstotal ⊂ → variation f isbounded.DenotebyBV(Z)thesetofallfunctionsofboundedvariation. Itiswell-known BV(Z) k k thatBV(Z)togetherwith f + f formsaBanachspace. Moreover,wehave(3),i.e.wehaveforall ∞ BV(Z) k k k k f (Z): ∈C ef ef f . BV(Z) ≤ ∞k kBV(Z) (cid:13) (cid:13) (cid:13) (cid:13) Example 2.3. LetZ be a subsetof Rd(cid:13)an(cid:13)d C (Z) be(cid:13)the(cid:13)set of boundedcontinuousfunctionson Z. For f b ∈ C (Z)and0<α 1let b ≤ f(z) f(z′) f := f := sup | − |. k k | |α z6=z′ z z′ α | − | Clearly, f is α-Ho¨lder continuous if and only if f < . The collection of bounded, α-Ho¨lder continuous α | | ∞ functionsonZ willbedenotedby C (Z):= f C (Z): f < . b,α b α { ∈ | | ∞} Notethat,ifZ iscompact,thenC (Z)togetherwiththenorm f := f + f formsaBanachspace. b,α k kCb,α k k∞ | |α Moreover,theinequality(3)isalsovalidforf C (Z). Asusual,wespeakofLipschitzcontinuousfunctions b,α ∈ ifα=1andwriteLip(Z):=C (Z). b,1 Example2.4. LetZ Rdbeanopensubset. Foracontinuouslydifferentiablefunctionf :Z Rwewrite ⊂ → f :=sup f′(z) k k | | z∈Z andC1(Z):= f :Z R f continuouslydifferentiableand f + f < . Itiswell-known,thatC1(Z) ∞ → | k k k k ∞ isaBanachspacewithrespecttothenorm + andthechainrulegives ∞ (cid:8) k·k k·k (cid:9) ef = ef ′ = ef f′ ef f′ = ef f , ∞ · ∞ ≤ ∞k k∞ ∞k k forallf C1(Z),i.e.((cid:13)(cid:13)3)is(cid:13)(cid:13)sati(cid:13)(cid:13)sfi(cid:0)ed.(cid:1)(cid:13)(cid:13) (cid:13)(cid:13) (cid:13)(cid:13) (cid:13)(cid:13) (cid:13)(cid:13) (cid:13)(cid:13) (cid:13)(cid:13) ∈ Letusnowassume thatwealsohave ameasurable space (Z, )andameasurable mapχ : Ω Z. B → Then σ(χ) denotes the smallest σ-algebra on Ω for which χ is measurable. Moreover, µ denotes the χ χ-imagemeasureofµ,whichisdefinedbyµ (B) := µ(χ−1(B)),B . χ ∈ B Let := (Z ) beaZ-valuedstochasticprocesson(Ω, ,µ),and i and ∞ betheσ-algebras Z n n≥0 A A0 Ai+n generated by (Z ,...,Z ) and (Z ,Z ,...), respectively. The process is called stationary if 0 i i+n i+n+1 Z µ = µ for all n,i,i ,...,i 1. In this case, we always write P := µ . (Zi1+i,...,Zin+i) (Zi1,...,Zin) 1 n ≥ Z0 Moreover, to define certain dependency coefficients for , we denote, for ψ,ϕ L (µ) satisfying 1 Z ∈ ψϕ L (µ)thecorrelation ofψ andϕby 1 ∈ cor(ψ,ϕ) := ψ ϕdµ ψdµ ϕdµ. · − · ZΩ ZΩ ZΩ Severaldependencycoefficientsfor canbeexpressedbyimposingrestrictionsonψandϕ. Thefollow- Z ing definition, which is taken from [31], introduces the restrictions on ψ and ϕ weconsider throughout thiswork. 3 Definition 2.5. Let (Ω, ,µ) be a probability space, (Z, ) be a measurable space, := (Z ) be i i≥0 A B Z a Z-valued, stationary process on Ω, and be defined by (1) for some semi-norm . Then, for C k·k k·k n 0,wedefine: ≥ (i) the -mixingcoefficients by C φ ( ,n) := sup cor(ψ,h Z ): k 0,ψ B ,h B (4) C Z ◦ k+n ≥ ∈ L1(Ak0,µ) ∈ C(Z) (cid:8) (cid:9) (ii) thetime-reversed -mixingcoefficients by C φ ( ,n) := sup cor(h Z ,ϕ) : k 0,h B ,ϕ B . (5) C,rev Z ◦ k ≥ ∈ C(Z) ∈ L1(A∞k+n,µ) (cid:8) (cid:9) Let (d ) be a strictly positive sequence converging to 0. Then we say that is (time-reversed) n n≥0 Z -mixingwithrate(d ) ,ifwehaveφ ( ,n) d foralln 0. Moreover,if(d ) isofthe n n≥0 C,(rev) n n n≥0 C Z ≤ ≥ form d := cexp bnγ , n 1, (6) n − ≥ forsomeconstants b > 0,c 0,andγ > 0,then(cid:0) isc(cid:1)alledgeometrically (time-reversed) -mixing. ≥ Z C Obviously, is -mixing with rate (d ) , ifand only iffor allk,n 0, all ψ L ( k,µ), and Z C n n≥0 ≥ ∈ 1 A0 allh (Z),wehave ∈ C cor(ψ,h Z ) ψ h d , (7) ◦ k+n ≤ k kL1(µ)k kC n orsimilarly, time-reversed -mixingwithrate(d ) ,ifandonlyifforallk,n 0,allh (Z),and n n≥0 C ≥ ∈ C allϕ L ( ∞ ,µ),wehave ∈ 1 Ak+n cor(h Z ,ϕ) h ϕ d . (8) ◦ k ≤ k kCk kL1(µ) n In the rest of this section we consider examples of (time-reversed) -mixing processes. To begin C with, let us assume that is a stationary φ-mixing process [25] with rate (d ) . By [16, Inequality n n≥0 Z (1.1)]wethenhave cor(ψ,ϕ) ψ ϕ d , n 1, (9) ≤ k kL1(µ)k kL∞(µ) n ≥ for all k-measurable ψ L (µ) and all ∞ -measurable ϕ L (µ). By taking := A0 ∈ 1 Ak+n ∈ ∞ k·kC k·k∞ and ϕ := h Z , we then see that (7) is satisfied, i.e. is -mixing with rate (d ) . Finally, k+n n n≥0 ◦ Z C by similar arguments we can deduce that time-reversed φ-mixing processes [12, Section 3.13] are also time-reversed -mixingwiththesamerate. Inotherwordswehavefound C φ ( ,n) = φ( ,n) and φ ( ,n) = φ ( ,n). L∞(µ) Z Z L∞(µ),rev Z rev Z To deal with processes that are not α-mixing [35], Rio [34] introduced the following relaxation of φ-mixingcoefficients φ˜( ,n) := sup E f(Z ) k Ef(Z ) (10) Z k+n A0 − k+n ∞ k≥0, f∈BV1 (cid:13) (cid:0) (cid:12) (cid:1) (cid:13) (cid:13) (cid:12) (cid:13) = sup cor(ψ,h Z ): k 0,ψ B ,h B ◦ k+n ≥ ∈ L1(Ak0,µ) ∈ BV(Z) andananalogous time-reversed(cid:8)coefficient (cid:9) φ˜ ( ,n) := sup E f(Z ) A∞ Ef(Z ) rev Z k k+n − k ∞ k≥0, f∈BV1 (cid:13) (cid:0) (cid:12) (cid:1) (cid:13) (cid:13) (cid:12) (cid:13) 4 φ-mixing φ˜-mixing -mixing C Figure1: Relationship betweenφ-,φ˜-,and -mixingprocesses C = sup cor(h Z ,ϕ) :k 0,ϕ B ,h B , ◦ k ≥ ∈ L1(A∞k+n,µ) ∈ BV(Z) (cid:8) (cid:9) wherethetwoidentities followfrom[18,Lemma4]. Inotherwordswehave φ ( ,n) = φ˜( ,n) and φ ( ,n) = φ˜ ( ,n) BV(Z) BV(Z),rev rev Z Z Z Z Moreover,[17,p.41]showsthatsomeuniformlyexpandingmapsareφ˜-mixingbutnotα-mixing. Figure 1summarizestherelations betweenφ,φ˜,and -mixing. C Our next goal is to relate -mixing to some well-known results on the decay of correlations for C dynamical systems. To this end, recall that (Ω, ,µ,T) is a dynamical system, if T : Ω Ω is a A → measurable mapsatisfying µ(T−1(A)) = µ(A)forallA . Letusconsider thestationary stochastic ∈ A process := (Z ) defined by Z := Tn forn 0. Since n+1 n forall n 0, weconclude Z n n≥0 n ≥ An+1 ⊂ An ≥ that ∞ = k+n. Consequently, ϕis ∞ -measurable, ifandonly ifitis k+n-measurable. More- Ak+n Ak+n Ak+n Ak+n over k+n isthe σ-algebra generated by Tk+n, and hence ϕ is k+n-measurable, if and only ifit isof Ak+n Ak+n the form ϕ = g Tk+n forsome suitable, measurable g : Ω R. Letus now suppose that is C(Ω) ◦ → k·k definedby(1)forsomesemi-norm . Forh (Ω)wethenfind k·k ∈ C cor(h Z ,ϕ) = cor(h Z ,g Z ) = cor(h,g Z ) k k k+n n ◦ ◦ ◦ ◦ = h (g Tn)dµ hdµ gdµ · ◦ − · ZΩ ZΩ ZΩ =:cor (h,g). T,n Thenext result shows that is time-reversed -mixing even if weonly have generic constants C(h,g) Z C in(8). Theorem 2.6. Let (Ω, ,µ,T) be a dynamical system and the stochastic process := (Z ) be n n≥0 A Z defined byZ := Tn for n 0. Moreover, Let bedefined by(1)for somesemi-norm . Then, n C ≥ k·k k·k is time-reversed -mixing with rate (d ) iff for all h (Ω) and all g L (µ) there exists a n n≥0 1 Z C ∈ C ∈ constant C(h,g)suchthat cor (h,g) C(h,g)d , n 0. T,n n ≤ ≥ Thus, we see that is time-reversed -mixing, if cor (h,g) converges to zero for all h (Ω) T,n Z C ∈ C andg L (µ)witharatethatisindependent ofhandg. 1 ∈ For concrete examples, let us first mention that [31] presents some discrete dynamical systems that are time-reversed geometrically -mixing such as Lasota-Yorke maps, uni-modal maps, piecewise ex- C panding mapsinhigherdimension. Here,theinvolved spacesareeitherBV(Z)orLip(Z). 5 In dynamical systems where chaos is weak, correlations often decay polynomially, i.e. the correla- tionssatisfy cor (h,g) C(h,g) n−b, n 0, (11) T,n | | ≤ · ≥ for some constants b > 0 and C(h,g) 0 depending on the functions h and g. Young [49] developed ≥ a powerful method for studying correlations in systems with weak chaos where correlations decay at a polynomial rateforbounded g andHo¨lder continuous h. Hermethod wasapplied tobilliards withslow mixingrates,suchasBunimovichbilliards,see[6,Theorem3.5]. Forexample,modulosomelogarithmic factors[30,14]obtained(11)withb = 1andb = 2forcertainformsofBunimovichbilliardsandHo¨lder continuous handg. Besidestheseresults,Baladi[5]alsocompilesalistof“parabolic” or“intermittent” systemshavingapolynomial decay. It is well-known that, if the functions h and g are sufficient smooth, there exist dynamical systems wherechaosisstrongenoughsuchthatthecorrelations decayexponentially fast,thatis, cor (h,g) C(h,g) exp bnγ , n 0, (12) T,n | | ≤ · − ≥ for some constants b > 0, γ > 0, and C(h,g) (cid:0)0 depe(cid:1)nding on h and g. Again, Baladi [5] has ≥ listed somesimple examples of dynamical systems enjoying (12)for analytic hand g such asthe angle doubling map and the Arnold’s cat map. Moreover, for continuously differentiable h and g, [36, 39] proved (12) for two closely related classes of systems, more precisely, C1+ε Anosov or the Axiom-A diffeomorphismswithGibbsinvariantmeasuresandtopologicalMarkovchains,whicharealsoknownas subshifts offinitetype,seealso[11]. Theseresultswerethenextendedby[24,37]toexpanding interval maps with smooth invariant measures for functions hand g of bounded variation. In the 1990s, similar results forHo¨lder continuous hand g wereproved for systems withsomewhat weaker chaotic behavior which ischaracterized bynonuniform hyperbolicity, such asquadratic interval maps, see [48], [27]and the He´non map [8], and then extended to chaotic systems with singularities by [28] and specifically to Sinaibilliardsinatorusby[48,13]. Forsomeoftheseextensions, suchassmoothexpanding dynamics, smooth nonuniformly hyperbolic systems, and hyperbolic systems with singularities, we refer to [4] as well. Recently, for h of bounded variation and bounded g, [29] obtained (12) for a class of piecewise smooth one-dimensional mapswithcritical points andsingularities. Moreover, [3]hasdeduced (12)for h,g Lip(Z)andasuitableiterateofPoincare´’sfirstreturnmapT ofalargeclassofsingularhyperbolic ∈ flows. 3 A Bernstein-type inequality Inthissection, wepresentthekeyresultofthiswork,aBernstein-type inequality forstationary geomet- rically(time-reversed) -mixingprocess. C Theorem 3.1. Let := (Z ) be a Z-valued stationary geometrically (time-reversed) -mixing n n≥0 Z C process on (Ω, ,µ) with rate (d ) as in (6), be defined by (1) for some semi-norm n n≥0 C A k · k k · k satisfying (3), and P := µ . Moreover, let h (Z)with E h = 0 and assume that there exist some Z0 ∈ C P A> 0,B > 0,andσ 0suchthat h A, h B,andE h2 σ2. Then,forallε> 0andall ∞ P ≥ k k ≤ k k ≤ ≤ 808c(3A+B) m n ≥ n0 := max min m ≥ 3 :m2 ≥ B and 2 ≥ 4 ,e3b , (13) ( (cid:26) (logm)γ (cid:27) ) wehave 1 n nε2 µ ω Ω : h Z ε 2exp , (14) (cid:26) ∈ n i=1 ◦ i ≥ (cid:27)! ≤ −8(logn)γ2(σ2 +εB/3)! X 6 oralternatively, foralln n andτ > 0,wehave 0 ≥ n 2 2 1 8(logn)γσ2τ 8(logn)γBτ µ ω Ω : h(Z (ω)) + 2e−τ. (15)  ∈ n i ≥s n 3n  ≤ ( ) i=1 X   2 Notethatbesidestheadditional logarithmicfactor4(logn)γ andtheconstant2infrontoftheexpo- nential, (14)coincides withBernstein’sclassical inequality fori.i.d.processes. Intheremainderofthissection,wecompareTheorem3.1withsomeotherconcentrationinequalities fornon-i.i.d.processes . Here, isreal-valued andhistheidentity mapifnotspecifiedotherwise. Z Z Example 3.2. Theorem 2.3 in [4] shows that smooth expanding systems on [0,1] have exponential decay of correlations (7). Moreover, if, for such expanding systems, the transformation T is Lipschitz continuous and satisfiestheconditionsattheendofSection4in[18]andtheergodicmeasureµsatisfies[18,condition(4.8)],then [18,Theorem2]showsthatforallε 0andn 1,theleft-handsideof(14)isboundedby ≥ ≥ ε2n exp − C (cid:18) (cid:19) where C is some constant independent of n. The same result has been proved in [15, Theorem III.1] as well. Obviously,thisisaHoeffding-typeboundinsteadofaBernstein-typeone. Hence,itisalwayslargerthanoursif thedenominatoroftheexponentin(14)issmallerthanC. Example 3.3. For dynamical systems with exponentially decreasing φ˜-coefficients, see [47, condition (3.1)], [47,Theorem3.1]providesaBernstein-typeinequalityfor1-Lipschitzfunctionsh:Z [ 1/2,1/2]w.r.t. some → − metricdonZ,inwhichtheleft-handsideof(14)isboundedby Cε2n exp (16) −σ2+εlogf(n) (cid:18) (cid:19) for some constant C independentof n and f(n) being some function monotonicallyincreasing in n. Note that modulothelogarithmicfactorlogf(n)thebound(16)isthesameastheonefori.i.d.processes.Moreover,iff(n) growspolynomially,cf. [47, Section3.3],then(16) hasthesame asymptoticbehaviouras ourbound. However, geometrically -mixingisweakerthanCondition(3.1)in[47]:Indeed,therequiredexponentialformofCondition C (3.1)in[47],i.e. supφ˜( k,Zk+2n−1):=sup sup E f(Zk+2n−1) k Ef(Zk+2n−1) c e−bn k≥0 A0 k+n k≥0f∈Fn k+n A0 − k+n ∞ ≤ · (cid:13) (cid:0) (cid:12) (cid:1) (cid:13) for some c,b > 0 and all n 1, where Zk+2(cid:13)n−1 := (Z ,(cid:12)...,Z ) and (cid:13)n is the set of 1-Lipschitz ≥ k+n k+n k+2n−1 F functionsf :Zn [ 1,1]w.r.t. themetricdn(x,y):= 1 n d(x ,y ),implies → −2 2 n i=1 i i supsup E f(Z ) k Ef(Z P) c ne−bn c e−˜bn k+n A0 − k+n ∞ ≤ · ≤ · k≥0f∈F (cid:13) (cid:0) (cid:12) (cid:1) (cid:13) forsomec,˜b > 0 andalln (cid:13)1, where i(cid:12)sthe setof1-Lipsc(cid:13)hitzfunctionsf : Z [ 1,1] w.r.t. the metric ≥ F → −2 2 d. In other words, processes satisfying Condition (3.1) in [47] are φ˜-mixing, see (10), which is stronger than geometrically -mixing,see againFigure1. Moreover,ourresultholdsforallγ > 0, while[47]onlyconsiders C thecaseγ =1. Example3.4. Foranα-mixingsequenceofcenteredandboundedrandomvariablessatisfyingα(n) cexp( bnγ) ≤ − forsomeconstantsb>0,c 0,andγ >0,[33,Theorem4.3]boundstheleft-handsideof(14)by ≥ (1+4e−2c)exp 3ε2n(γ) withn(γ) nγ+γ1 (17) −6σ2+2εB ≍ (cid:18) (cid:19) for all n 1 and all ε > 0. In general, this bound and our result are not comparable, since not every α- ≥ mixingprocesssatisfies(7)andconversely,noteveryprocesssatisfying(7)isnecessarilyα-mixing,seeFigure2. Nevertheless,forφ-mixingprocesses,itiseasilyseenthatthisboundisalwaysworsethanoursforafixedγ >0, ifnislargeenough. 7 α-mixing φ-mixing -mixing C Figure2: Relationship betweenα-,φ-,and -mixingprocesses C Example3.5. Foranα-mixingstationarysequenceofcenteredandboundedrandomvariablessatisfyingα(n) ≤ exp( 2cn)forsomec>0,[32,Theorem2]boundstheleft-handsideof(14)by − Cε2n exp , (18) −v2+εB(logn)2+n−1B2 (cid:18) (cid:19) whereC >0issomeconstantand v2 :=σ2+2 cov(X ,X ) . (19) 1 i | | 2≤i≤n X By applying the covarianceinequality for α-mixing processes, see [16, the corollary to Lemma 2.1], we obtain v2 C X 2 foranarbitraryδ >0andaconstantC onlydependingonδ. Iftheadditionalδ >0isignored, ≤ δk 1k2+δ δ (18)hasthereforethesameasymptoticbehaviorasourbound.Ingeneral,however,theadditionalδdoesinfluence the asymptoticbehavior. For example, the oracleinequality we obtainin the nextsection would be slower by a factorofnξ,whereξ >0isarbitrary,ifweused(18)instead. Finally,notethatingeneralthebound(18)andours arenotcomparable,seeagainFigure2. In particular, Inequality(18) can be appliedto geometricallyφ-mixingprocesseswith γ = 1. By using the covarianceinequality(1.1)forφ-mixingprocessesin[16],wecanboundv2 definedasin(19)byCσ2 withsome constantC independentofn. Modulothetermn−1B inthedenominator,thebound(18)coincideswithoursfor geometricallyφ-mixingprocesseswithγ =1. However,ourboundalsoholdsforsuchprocesseswithγ (0,1). ∈ Example 3.6. For stationary, geometricallyα-mixingMarkovchainswith centeredand boundedrandomvari- ables,[1]boundstheleft-handsideof(14)by nε2 exp , (20) −σ˜2+εBlogn (cid:18) (cid:19) whereσ˜2 =lim 1Var n X . ByasimilarargumentasinExample3.5weobtain n→∞ n i=1 i nP Var X =nσ2+2 cov(X ,X ) nσ2+C˜ n X 2 i | i j |≤ δ k 1k2+δ i=1 1≤i<j≤n X X for an arbitrary δ > 0 and a constant C˜ depending only on δ. Consequently we conclude that modulo some δ arbitrarysmallnumberδ > 0andthelogarithmicfactorlogninsteadof(logn)2,thebound(20)coincideswith ours.Again,thisboundandourresultarenotcomparable,seeFigure2. Example 3.7. For stationary, weakly dependent processes of centered and bounded random variables with cov(X ,X ) c exp( bn) for some c,b > 0 and all n 1, [26, Theorem 2.1] boundsthe left-hand side 1 n | | ≤ · − ≥ of(14)by ε2n exp (21) −C +C ε5/3n2/3 (cid:18) 1 2 (cid:19) whereC issomeconstantdependingoncandb,andC issomeconstantdependingonc,b,andB. Notethatthe 1 2 denominatorin(21)isatleastC ,andthereforethebound(21)ismoreofHoeffdingtype. 1 8 4 Applications to Statistical Learning In this section, we apply the Bernstein inequality from the last section to deduce oracle inequalities for somewidelyused learning methods andobservations generated byageometrically -mixing processes. C More precisely, in Subsection 4.1, we recall some basic concepts of statistical learning and formulate an oracle inequality for learning methods that are based on (regularized) empirical risk minimization. Then,intheSubsection 4.2,weillustrate thisoracle inequality byderiving thelearning ratesforSVMs. Finally,inSubsection4.3,wepresentanoracleinequality forforecasting ofdynamicalsystems. 4.1 Oracleinequality forCR-ERMs Inthis section, letX always be ameasurable space ifnot mentioned otherwise and Y Ralways be a closedsubset. Recallthatinthe(supervised) statisticallearning,ouraimistofindafunc⊂tionf :X R → such that for (x,y) X Y the value f(x) is a good prediction of y at x. To evaluate the quality of ∈ × suchfunctions f,weneed alossfunction L : X Y R [0, )that ismeasurable. Following[42, × × → ∞ Definition2.22],wesaythatalossLcanbeclippedatM > 0,if,forall(x,y,t) X Y R,wehave ∈ × × Û L(x,y,t) L(x,y,t), (22) ≤ Û Û Û where t denotes the clipped value of t at M, that is t := t if t [ M,M], t := M if t < M, Û ± ∈ − − − t := M ift > M. Variousoftenusedlossfunctionscanbeclipped. Forexample,ifY := 1,1 andL {− } isaconvex,margin-based lossrepresented byϕ: R [0, ),thatisL(y,t)= ϕ(yt)forally Y and → ∞ ∈ t R,thenLcanbeclipped,ifandonlyifϕhasaglobalminimum,see[42,Lemma2.23]. Inparticular, ∈ thehingeloss,theleastsquares lossforclassification, andthesquared hingelosscanbeclipped, butthe logistic loss for classification and the AdaBoost loss cannot be clipped. Moreover, if Y := [ M,M] − andLisaconvex, distance-based lossrepresented bysomeψ : R [0, ), that isL(y,t) = ψ(y t) → ∞ − for all y Y and t R, then L can be clipped whenever ψ(0) = 0, see again [42, Lemma 2.23]. In ∈ ∈ particular, theleastsquares loss L(y,t) =(y t)2 (23) − andtheτ-pinballloss (1 τ)(y t), ify t < 0 L (y,t) := ψ(y t)= − − − − (24) τ − (τ(y t), ify t 0 − − ≥ usedforquantile regression canbeclipped, ifthespaceoflabelsY isbounded. Nowwesummarizeassumptions onthelossfunctionLthatwillbeusedthroughout thiswork. Assumption4.1. ThelossfunctionL : X Y R [0, )canbeclippedatsomeM > 0. Moreover, × × → ∞ itisbothbounded inthesenseofL(x,y,t) 1andlocallyLipschitzcontinuous, thatis, ≤ L(x,y,t) L(x,y,t′) t t′ . (25) | − | ≤ | − | Here both inequalites are supposed to hold for all (x,y) X Y and t,t′ [ M,M]. Note that the ∈ × ∈ − formerassumption cantypically beenforced byscaling. Given a loss function L and an f : X R, we often use the notation L f for the function → ◦ (x,y) L(x,y,f(x)). Our major goal is to have a small average loss for future unseen observations 7→ (x,y). Thisleadstothefollowingdefinition, seealso[42,Definitions2.2&2.3]. 9 Definition 4.2. Let L : X Y R [0, ) be a loss function and P be a probability measure on X Y. Then,forameasur×ablef×unctio→n f :∞X RtheL-riskisdefinedby × → (f):= L(x,y,f(x))dP(x,y). L,P R Z X×Y Moreover, theminimalL-risk ∗ := inf (f)f :X Rmeasurable RL,P {RL,P | → } is called the Bayes risk with respect to P and L. In addition, a measurable function f∗ : X R L,P → satisfying (f∗ )= ∗ iscalledaBayesdecisionfunction. RL,P L,P RL,P Informally, thegoal oflearning from atraining setD (X Y)n istofindadecision function f D ∈ × suchthat (f )isclosetotheminimalrisk ∗ . Ournextgoalistoformalizethisidea. Webegin RL,P D RL,P withthefollowingdefinition. Definition 4.3. Let X be a set and Y R be a closed subset. A learning method on X Y maps ⊂ L × everysetD (X Y)n,n 1,toafunction f : X R. D ∈ × ≥ → Let us now describe the learning algorithms we are interested in. To this end, we assume that we haveahypothesis set consisting ofbounded measurable functions f : X R,whichispre-compact F → with respect to the supremum norm . Since can be infinite, we need to recall the following, ∞ k ·k F classical concept, whichwillenableustoapproximate infinite byfinitesubsets. F Definition 4.4. Let (T,d) be a metric space and ε > 0. We call S T an ε-net of T if for all t T ⊂ ∈ thereexistsans S withd(s,t) ε. Moreover,theε-covering numberofT isdefinedby ∈ ≤ n (T,d,ε) := inf n 1 : s ,...,s T suchthatT B (s ,ε) , 1 n d i N ≥ ∃ ∈ ⊂ ( ) i=1 [ whereinf := and B (s,ε) := t T : d(t,s) ε denotes theclosed ballwithcenter s T and d ∅ ∞ { ∈ ≤ } ∈ radiusε. Note that our hypothesis set is assumed to be pre-compact, and hence for all ε > 0, the covering F number ( , ,ε)isfinite. ∞ N F k·k Inordertointroduce ourgenericlearningalgorithms, wewrite D := (X ,Y ),...,(X ,Y ) := (Z ,...,Z ) (X Y)n 1 1 n n 1 n ∈ × (cid:0) (cid:1) for a training set of length n that is distributed according to the first n components of the X Y- valued process = (Z ) . Furthermore, wewriteD := 1 n δ ,whereδ denote×s the Z i i≥1 n n i=1 (Xi,Yi) (Xi,Yi) (random) Diracmeasure at(X ,Y ). Inother words, D isthe empirical measure associated tothe data i i n setD. Finally, theriskofafunction f : X Rwithrespect toPthismeasure → n 1 (f) = L(X ,Y ,f(X )) RL,Dn n i i i i=1 X iscalledtheempiricalL-risk. Withthesepreparations wecannowintroducetheclassoflearningmethodsweareinterestedin,see also[42,Definition7.18]. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.