Table Of ContentA Bernstein-type Inequality for Some Mixing Processes and
Dynamical Systems with an Application to Learning
H. Hang and I. Steinwart
InstituteforStochasticsandApplications
5 UniversityofStuttgart
1 D-70569Stuttgart
0
hanghn,ingo.steinwart @mathematik.uni-stuttgart.de
2
{ }
n
January 14,2015
a
J
3
1
Abstract
]
WeestablishaBernstein-typeinequalityforaclassofstochasticprocessesthatincludetheclas-
R
sical geometrically φ-mixing processes, Rio’s generalization of these processes, as well as many
P
time-discrete dynamicalsystems. Moduloa logarithmicfactor and some constants, our Bernstein-
.
h type inequalitycoincideswith the classical Bernstein inequalityfor i.i.d. data. We further use this
t
a new Bernstein-type inequality to derive an oracle inequality for generic regularized empirical risk
m minimization algorithmsand data generatedby such processes. Applying this oracle inequality to
[ supportvector machinesusing the Gaussian kernelsfor bothleast squaresandquantile regression,
it turns out that the resulting learning rates match, up to some arbitrarily small extra term in the
1
exponent,theoptimalratesfori.i.d.processes.
v
9
5
0 1 Introduction
3
0
Concentration inequalities such asHoeffding’s inequality, Bernstein’s inequality, McDiarmid’s inequal-
.
1
ity, and Talagrand’s inequality play an important role in many areas of probability. For example, the
0
5 analysis of various methods from non-parametric statistics and machine learning crucially depend on
1 these inequalities, see e.g. [19, 20, 22, 42]. Here, stronger results can typically be achieved by Bern-
:
v stein’s inequality and/or Talagrand’s inequality, since these inequalities allow for localization due to
i their specific dependence on the variance. In particular, most derivations of minimax optimal learning
X
ratesarebasedononeoftheseinequalities.
r
a Theconcentration inequalities mentioned aboveallassume thedata tobegenerated byani.i.d. pro-
cess. Unfortunately, however, this assumption is often violated in several important areas of applica-
tionsincludingfinancialprediction,signalprocessing,systemobservationanddiagnosis,textandspeech
recognition,andtimeseriesforecasting. Forthisandotherreasonstherehasbeensomeefforttoestablish
concentration inequalities for non-i.i.d. processes, too. For example, generalizations of Bernstein’s in-
equalitytoα-mixingandφ-mixingprocesseshavebeenfound[10,33,32]and[38],respectively. Among
manyotherapplications, theBernstein-typeinequalityestablishedin[10]wasusedin[50]toobtaincon-
vergenceratesforsieveestimatesfromα-mixingstrictlystationaryprocessesinthespecialcaseofneural
networks. Furthermore, [23]applied the Bernstein-type inequality in[33]to derive an oracle inequality
for generic regularized empirical risk minimization algorithms learning from stationary α-mixing pro-
cesses. Moreover, by employing the Bernstein-type inequality in [32], [7] derived almost sure uniform
rates of convergence for the estimated Le´vy density both in mixed-frequency and low-frequency setups
and proved that these rates are optimal in the minimax sense. Finally, in the particular case of the least
1
squareloss,[2]obtainedtheoptimallearningrateforφ-mixingprocessesbyapplyingtheBernstein-type
inequality established in[38].
However, there exist many dynamical systems such as the uniformly expanding maps given in [17,
p. 41] that are not α-mixing. To deal with such non-mixing processes Rio [34] introduced so-called
φ˜-mixingcoefficients,whichextendtheclassicalφ-mixingcoefficients. Fordynamicalsystemswithex-
ponentiallydecreasing,modifiedφ˜-coefficients,[47]derivedaBernstein-typeinequality, whichturnsout
tobethesameastheonefori.i.d.processesmodulosomelogarithmicfactor. However,thismodification
seemstobesignificant stronger thanRio’soriginal φ˜-mixing, soitremains unclear whentheBernstein-
type inequality in[47]isapplicable. Inaddition, theφ˜-mixing concept isstill notlarge enough tocover
manycommonly considered dynamical systems. Toinclude suchdynamical systems, [31]proposed the
-mixingcoefficients, whichfurthergeneralize φ˜-mixingcoefficients.
C
In this work, weestablish aBernstein-type inequality for geometrically -mixing processes, which,
C
modulo a logarithmic factor and some constants, coincides with the classical one for i.i.d. processes.
Using the techniques developed in [23], we then derive an oracle inequality for generic regularized
empirical risk minimization and -mixing processes. We further apply this oracle inequality to a state-
C
of-the-art learning method, namely support vector machines (SVMs) with Gaussian kernels. Here it
turns out that for both, least squares and quantile regression, we can recover the (essentially) optimal
rates recently found forthei.i.d.case, see[21],whenthedataisgenerated byageometrically -mixing
C
process. Finally,weestablishanoracleinequality fortheproblemofforecasting anunknowndynamical
system. This oracle will make it possible to extend the purely asymptotic analysis in [41] to learning
rates.
The rest of this work is organized as follows: In Section 2, we recall the notion of (time-reversed)
-mixing processes. We further illustrate this class of processes by some examples and discuss the
C
relationbetween -mixingandothernotionsofmixing. Asthemainresultofthiswork,aBernstein-type
C
inequality forgeometrically (time-reversed) -mixing processes willbeformulated inSection3. There,
C
wealsocompareournewBernstein-type inequality topreviously established concentration inequalities.
As an application of our Bernstein-type inequality, we will derive the oracle inequality for regularized
risk minimization schemes in Section 4. We additionally derive learning rates for SVMs and an oracle
inequality forforecasting certaindynamicalsystems. Allproofscanbefoundinthelastsection.
2 -mixing processes
C
In this section we recall two classes of stationary stochastic processes called (time-reversed) -mixing
C
processes that have acertain decay ofcorrelations for suitable pairs offunctions. Wealso present some
examplesofsuchprocesses including certaindynamicalsystems.
Letusbeginbyintroducingsomenotations. Inthefollowing,(Ω, ,µ)alwaysdenotesaprobability
A
space. As usual, we write L (µ) for the space of (equivalence classes of) measurable functions f :
p
Ω R with finite L -norm f . It is well-known that L (µ) together with f forms a Banach
p p p p
→ k k k k
space. Moreover, if ′ is a sub-σ-algebra, then L ( ′,µ) denotes the space of all ′-measurable
1
A ⊂ A A A
functions f L (µ). Inthefollowing, foraBanachspaceE,wewriteB foritsclosedunitball.
1 E
∈
Givenasemi-norm onavectorspaceE ofboundedmeasurablefunctionsf :Z R,wedefine
k·k →
the -Normby
C
f := f + f (1)
C ∞
k k k k k k
anddenotethespaceofallbounded -functions by
C
(Z):= f : Z R f < . (2)
C
C → k k ∞
(cid:8) (cid:12) (cid:9)
(cid:12)
2
Throughout thiswork,weonlyconsiderthesemi-norms in(1)thatsatisfytheinequality
k·k
ef ef f (3)
≤ ∞k k
(cid:13) (cid:13) (cid:13) (cid:13)
forallf (Z). Wearemostlyinterested inthefollowingexamplesofsemi-normssatisfying (3).
(cid:13) (cid:13) (cid:13) (cid:13)
∈ C
Example 2.1. Let Z be an arbitrary set and suppose that we have f = 0 for all f : Z R. Then, it is
k k →
obviouslytoseethat ef = f =0. Hence,(3)issatisfied.
k k k k
Example2.2. LetZ Rbeaninterval.Afunctionf :Z RissaidtohaveboundedvariationonZ ifitstotal
⊂ →
variation f isbounded.DenotebyBV(Z)thesetofallfunctionsofboundedvariation. Itiswell-known
BV(Z)
k k
thatBV(Z)togetherwith f + f formsaBanachspace. Moreover,wehave(3),i.e.wehaveforall
∞ BV(Z)
k k k k
f (Z):
∈C
ef ef f .
BV(Z) ≤ ∞k kBV(Z)
(cid:13) (cid:13) (cid:13) (cid:13)
Example 2.3. LetZ be a subsetof Rd(cid:13)an(cid:13)d C (Z) be(cid:13)the(cid:13)set of boundedcontinuousfunctionson Z. For f
b
∈
C (Z)and0<α 1let
b
≤
f(z) f(z′)
f := f := sup | − |.
k k | |α z6=z′ z z′ α
| − |
Clearly, f is α-Ho¨lder continuous if and only if f < . The collection of bounded, α-Ho¨lder continuous
α
| | ∞
functionsonZ willbedenotedby
C (Z):= f C (Z): f < .
b,α b α
{ ∈ | | ∞}
Notethat,ifZ iscompact,thenC (Z)togetherwiththenorm f := f + f formsaBanachspace.
b,α k kCb,α k k∞ | |α
Moreover,theinequality(3)isalsovalidforf C (Z). Asusual,wespeakofLipschitzcontinuousfunctions
b,α
∈
ifα=1andwriteLip(Z):=C (Z).
b,1
Example2.4. LetZ Rdbeanopensubset. Foracontinuouslydifferentiablefunctionf :Z Rwewrite
⊂ →
f :=sup f′(z)
k k | |
z∈Z
andC1(Z):= f :Z R f continuouslydifferentiableand f + f < . Itiswell-known,thatC1(Z)
∞
→ | k k k k ∞
isaBanachspacewithrespecttothenorm + andthechainrulegives
∞
(cid:8) k·k k·k (cid:9)
ef = ef ′ = ef f′ ef f′ = ef f ,
∞ · ∞ ≤ ∞k k∞ ∞k k
forallf C1(Z),i.e.((cid:13)(cid:13)3)is(cid:13)(cid:13)sati(cid:13)(cid:13)sfi(cid:0)ed.(cid:1)(cid:13)(cid:13) (cid:13)(cid:13) (cid:13)(cid:13) (cid:13)(cid:13) (cid:13)(cid:13) (cid:13)(cid:13) (cid:13)(cid:13)
∈
Letusnowassume thatwealsohave ameasurable space (Z, )andameasurable mapχ : Ω Z.
B →
Then σ(χ) denotes the smallest σ-algebra on Ω for which χ is measurable. Moreover, µ denotes the
χ
χ-imagemeasureofµ,whichisdefinedbyµ (B) := µ(χ−1(B)),B .
χ
∈ B
Let := (Z ) beaZ-valuedstochasticprocesson(Ω, ,µ),and i and ∞ betheσ-algebras
Z n n≥0 A A0 Ai+n
generated by (Z ,...,Z ) and (Z ,Z ,...), respectively. The process is called stationary if
0 i i+n i+n+1
Z
µ = µ for all n,i,i ,...,i 1. In this case, we always write P := µ .
(Zi1+i,...,Zin+i) (Zi1,...,Zin) 1 n ≥ Z0
Moreover, to define certain dependency coefficients for , we denote, for ψ,ϕ L (µ) satisfying
1
Z ∈
ψϕ L (µ)thecorrelation ofψ andϕby
1
∈
cor(ψ,ϕ) := ψ ϕdµ ψdµ ϕdµ.
· − ·
ZΩ ZΩ ZΩ
Severaldependencycoefficientsfor canbeexpressedbyimposingrestrictionsonψandϕ. Thefollow-
Z
ing definition, which is taken from [31], introduces the restrictions on ψ and ϕ weconsider throughout
thiswork.
3
Definition 2.5. Let (Ω, ,µ) be a probability space, (Z, ) be a measurable space, := (Z ) be
i i≥0
A B Z
a Z-valued, stationary process on Ω, and be defined by (1) for some semi-norm . Then, for
C
k·k k·k
n 0,wedefine:
≥
(i) the -mixingcoefficients by
C
φ ( ,n) := sup cor(ψ,h Z ): k 0,ψ B ,h B (4)
C Z ◦ k+n ≥ ∈ L1(Ak0,µ) ∈ C(Z)
(cid:8) (cid:9)
(ii) thetime-reversed -mixingcoefficients by
C
φ ( ,n) := sup cor(h Z ,ϕ) : k 0,h B ,ϕ B . (5)
C,rev Z ◦ k ≥ ∈ C(Z) ∈ L1(A∞k+n,µ)
(cid:8) (cid:9)
Let (d ) be a strictly positive sequence converging to 0. Then we say that is (time-reversed)
n n≥0
Z
-mixingwithrate(d ) ,ifwehaveφ ( ,n) d foralln 0. Moreover,if(d ) isofthe
n n≥0 C,(rev) n n n≥0
C Z ≤ ≥
form
d := cexp bnγ , n 1, (6)
n
− ≥
forsomeconstants b > 0,c 0,andγ > 0,then(cid:0) isc(cid:1)alledgeometrically (time-reversed) -mixing.
≥ Z C
Obviously, is -mixing with rate (d ) , ifand only iffor allk,n 0, all ψ L ( k,µ), and
Z C n n≥0 ≥ ∈ 1 A0
allh (Z),wehave
∈ C
cor(ψ,h Z ) ψ h d , (7)
◦ k+n ≤ k kL1(µ)k kC n
orsimilarly, time-reversed -mixingwithrate(d ) ,ifandonlyifforallk,n 0,allh (Z),and
n n≥0
C ≥ ∈ C
allϕ L ( ∞ ,µ),wehave
∈ 1 Ak+n
cor(h Z ,ϕ) h ϕ d . (8)
◦ k ≤ k kCk kL1(µ) n
In the rest of this section we consider examples of (time-reversed) -mixing processes. To begin
C
with, let us assume that is a stationary φ-mixing process [25] with rate (d ) . By [16, Inequality
n n≥0
Z
(1.1)]wethenhave
cor(ψ,ϕ) ψ ϕ d , n 1, (9)
≤ k kL1(µ)k kL∞(µ) n ≥
for all k-measurable ψ L (µ) and all ∞ -measurable ϕ L (µ). By taking :=
A0 ∈ 1 Ak+n ∈ ∞ k·kC k·k∞
and ϕ := h Z , we then see that (7) is satisfied, i.e. is -mixing with rate (d ) . Finally,
k+n n n≥0
◦ Z C
by similar arguments we can deduce that time-reversed φ-mixing processes [12, Section 3.13] are also
time-reversed -mixingwiththesamerate. Inotherwordswehavefound
C
φ ( ,n) = φ( ,n) and φ ( ,n) = φ ( ,n).
L∞(µ) Z Z L∞(µ),rev Z rev Z
To deal with processes that are not α-mixing [35], Rio [34] introduced the following relaxation of
φ-mixingcoefficients
φ˜( ,n) := sup E f(Z ) k Ef(Z ) (10)
Z k+n A0 − k+n ∞
k≥0,
f∈BV1 (cid:13) (cid:0) (cid:12) (cid:1) (cid:13)
(cid:13) (cid:12) (cid:13)
= sup cor(ψ,h Z ): k 0,ψ B ,h B
◦ k+n ≥ ∈ L1(Ak0,µ) ∈ BV(Z)
andananalogous time-reversed(cid:8)coefficient (cid:9)
φ˜ ( ,n) := sup E f(Z ) A∞ Ef(Z )
rev Z k k+n − k ∞
k≥0,
f∈BV1 (cid:13) (cid:0) (cid:12) (cid:1) (cid:13)
(cid:13) (cid:12) (cid:13)
4
φ-mixing φ˜-mixing -mixing
C
Figure1: Relationship betweenφ-,φ˜-,and -mixingprocesses
C
= sup cor(h Z ,ϕ) :k 0,ϕ B ,h B ,
◦ k ≥ ∈ L1(A∞k+n,µ) ∈ BV(Z)
(cid:8) (cid:9)
wherethetwoidentities followfrom[18,Lemma4]. Inotherwordswehave
φ ( ,n) = φ˜( ,n) and φ ( ,n) = φ˜ ( ,n)
BV(Z) BV(Z),rev rev
Z Z Z Z
Moreover,[17,p.41]showsthatsomeuniformlyexpandingmapsareφ˜-mixingbutnotα-mixing. Figure
1summarizestherelations betweenφ,φ˜,and -mixing.
C
Our next goal is to relate -mixing to some well-known results on the decay of correlations for
C
dynamical systems. To this end, recall that (Ω, ,µ,T) is a dynamical system, if T : Ω Ω is a
A →
measurable mapsatisfying µ(T−1(A)) = µ(A)forallA . Letusconsider thestationary stochastic
∈ A
process := (Z ) defined by Z := Tn forn 0. Since n+1 n forall n 0, weconclude
Z n n≥0 n ≥ An+1 ⊂ An ≥
that ∞ = k+n. Consequently, ϕis ∞ -measurable, ifandonly ifitis k+n-measurable. More-
Ak+n Ak+n Ak+n Ak+n
over k+n isthe σ-algebra generated by Tk+n, and hence ϕ is k+n-measurable, if and only ifit isof
Ak+n Ak+n
the form ϕ = g Tk+n forsome suitable, measurable g : Ω R. Letus now suppose that is
C(Ω)
◦ → k·k
definedby(1)forsomesemi-norm . Forh (Ω)wethenfind
k·k ∈ C
cor(h Z ,ϕ) = cor(h Z ,g Z ) = cor(h,g Z )
k k k+n n
◦ ◦ ◦ ◦
= h (g Tn)dµ hdµ gdµ
· ◦ − ·
ZΩ ZΩ ZΩ
=:cor (h,g).
T,n
Thenext result shows that is time-reversed -mixing even if weonly have generic constants C(h,g)
Z C
in(8).
Theorem 2.6. Let (Ω, ,µ,T) be a dynamical system and the stochastic process := (Z ) be
n n≥0
A Z
defined byZ := Tn for n 0. Moreover, Let bedefined by(1)for somesemi-norm . Then,
n C
≥ k·k k·k
is time-reversed -mixing with rate (d ) iff for all h (Ω) and all g L (µ) there exists a
n n≥0 1
Z C ∈ C ∈
constant C(h,g)suchthat
cor (h,g) C(h,g)d , n 0.
T,n n
≤ ≥
Thus, we see that is time-reversed -mixing, if cor (h,g) converges to zero for all h (Ω)
T,n
Z C ∈ C
andg L (µ)witharatethatisindependent ofhandg.
1
∈
For concrete examples, let us first mention that [31] presents some discrete dynamical systems that
are time-reversed geometrically -mixing such as Lasota-Yorke maps, uni-modal maps, piecewise ex-
C
panding mapsinhigherdimension. Here,theinvolved spacesareeitherBV(Z)orLip(Z).
5
In dynamical systems where chaos is weak, correlations often decay polynomially, i.e. the correla-
tionssatisfy
cor (h,g) C(h,g) n−b, n 0, (11)
T,n
| | ≤ · ≥
for some constants b > 0 and C(h,g) 0 depending on the functions h and g. Young [49] developed
≥
a powerful method for studying correlations in systems with weak chaos where correlations decay at a
polynomial rateforbounded g andHo¨lder continuous h. Hermethod wasapplied tobilliards withslow
mixingrates,suchasBunimovichbilliards,see[6,Theorem3.5]. Forexample,modulosomelogarithmic
factors[30,14]obtained(11)withb = 1andb = 2forcertainformsofBunimovichbilliardsandHo¨lder
continuous handg. Besidestheseresults,Baladi[5]alsocompilesalistof“parabolic” or“intermittent”
systemshavingapolynomial decay.
It is well-known that, if the functions h and g are sufficient smooth, there exist dynamical systems
wherechaosisstrongenoughsuchthatthecorrelations decayexponentially fast,thatis,
cor (h,g) C(h,g) exp bnγ , n 0, (12)
T,n
| | ≤ · − ≥
for some constants b > 0, γ > 0, and C(h,g) (cid:0)0 depe(cid:1)nding on h and g. Again, Baladi [5] has
≥
listed somesimple examples of dynamical systems enjoying (12)for analytic hand g such asthe angle
doubling map and the Arnold’s cat map. Moreover, for continuously differentiable h and g, [36, 39]
proved (12) for two closely related classes of systems, more precisely, C1+ε Anosov or the Axiom-A
diffeomorphismswithGibbsinvariantmeasuresandtopologicalMarkovchains,whicharealsoknownas
subshifts offinitetype,seealso[11]. Theseresultswerethenextendedby[24,37]toexpanding interval
maps with smooth invariant measures for functions hand g of bounded variation. In the 1990s, similar
results forHo¨lder continuous hand g wereproved for systems withsomewhat weaker chaotic behavior
which ischaracterized bynonuniform hyperbolicity, such asquadratic interval maps, see [48], [27]and
the He´non map [8], and then extended to chaotic systems with singularities by [28] and specifically to
Sinaibilliardsinatorusby[48,13]. Forsomeoftheseextensions, suchassmoothexpanding dynamics,
smooth nonuniformly hyperbolic systems, and hyperbolic systems with singularities, we refer to [4] as
well. Recently, for h of bounded variation and bounded g, [29] obtained (12) for a class of piecewise
smooth one-dimensional mapswithcritical points andsingularities. Moreover, [3]hasdeduced (12)for
h,g Lip(Z)andasuitableiterateofPoincare´’sfirstreturnmapT ofalargeclassofsingularhyperbolic
∈
flows.
3 A Bernstein-type inequality
Inthissection, wepresentthekeyresultofthiswork,aBernstein-type inequality forstationary geomet-
rically(time-reversed) -mixingprocess.
C
Theorem 3.1. Let := (Z ) be a Z-valued stationary geometrically (time-reversed) -mixing
n n≥0
Z C
process on (Ω, ,µ) with rate (d ) as in (6), be defined by (1) for some semi-norm
n n≥0 C
A k · k k · k
satisfying (3), and P := µ . Moreover, let h (Z)with E h = 0 and assume that there exist some
Z0 ∈ C P
A> 0,B > 0,andσ 0suchthat h A, h B,andE h2 σ2. Then,forallε> 0andall
∞ P
≥ k k ≤ k k ≤ ≤
808c(3A+B) m
n ≥ n0 := max min m ≥ 3 :m2 ≥ B and 2 ≥ 4 ,e3b , (13)
( (cid:26) (logm)γ (cid:27) )
wehave
1 n nε2
µ ω Ω : h Z ε 2exp , (14)
(cid:26) ∈ n i=1 ◦ i ≥ (cid:27)! ≤ −8(logn)γ2(σ2 +εB/3)!
X
6
oralternatively, foralln n andτ > 0,wehave
0
≥
n 2 2
1 8(logn)γσ2τ 8(logn)γBτ
µ ω Ω : h(Z (ω)) + 2e−τ. (15)
∈ n i ≥s n 3n ≤
( )
i=1
X
2
Notethatbesidestheadditional logarithmicfactor4(logn)γ andtheconstant2infrontoftheexpo-
nential, (14)coincides withBernstein’sclassical inequality fori.i.d.processes.
Intheremainderofthissection,wecompareTheorem3.1withsomeotherconcentrationinequalities
fornon-i.i.d.processes . Here, isreal-valued andhistheidentity mapifnotspecifiedotherwise.
Z Z
Example 3.2. Theorem 2.3 in [4] shows that smooth expanding systems on [0,1] have exponential decay of
correlations (7). Moreover, if, for such expanding systems, the transformation T is Lipschitz continuous and
satisfiestheconditionsattheendofSection4in[18]andtheergodicmeasureµsatisfies[18,condition(4.8)],then
[18,Theorem2]showsthatforallε 0andn 1,theleft-handsideof(14)isboundedby
≥ ≥
ε2n
exp
− C
(cid:18) (cid:19)
where C is some constant independent of n. The same result has been proved in [15, Theorem III.1] as well.
Obviously,thisisaHoeffding-typeboundinsteadofaBernstein-typeone. Hence,itisalwayslargerthanoursif
thedenominatoroftheexponentin(14)issmallerthanC.
Example 3.3. For dynamical systems with exponentially decreasing φ˜-coefficients, see [47, condition (3.1)],
[47,Theorem3.1]providesaBernstein-typeinequalityfor1-Lipschitzfunctionsh:Z [ 1/2,1/2]w.r.t. some
→ −
metricdonZ,inwhichtheleft-handsideof(14)isboundedby
Cε2n
exp (16)
−σ2+εlogf(n)
(cid:18) (cid:19)
for some constant C independentof n and f(n) being some function monotonicallyincreasing in n. Note that
modulothelogarithmicfactorlogf(n)thebound(16)isthesameastheonefori.i.d.processes.Moreover,iff(n)
growspolynomially,cf. [47, Section3.3],then(16) hasthesame asymptoticbehaviouras ourbound. However,
geometrically -mixingisweakerthanCondition(3.1)in[47]:Indeed,therequiredexponentialformofCondition
C
(3.1)in[47],i.e.
supφ˜( k,Zk+2n−1):=sup sup E f(Zk+2n−1) k Ef(Zk+2n−1) c e−bn
k≥0 A0 k+n k≥0f∈Fn k+n A0 − k+n ∞ ≤ ·
(cid:13) (cid:0) (cid:12) (cid:1) (cid:13)
for some c,b > 0 and all n 1, where Zk+2(cid:13)n−1 := (Z ,(cid:12)...,Z ) and (cid:13)n is the set of 1-Lipschitz
≥ k+n k+n k+2n−1 F
functionsf :Zn [ 1,1]w.r.t. themetricdn(x,y):= 1 n d(x ,y ),implies
→ −2 2 n i=1 i i
supsup E f(Z ) k Ef(Z P) c ne−bn c e−˜bn
k+n A0 − k+n ∞ ≤ · ≤ ·
k≥0f∈F
(cid:13) (cid:0) (cid:12) (cid:1) (cid:13)
forsomec,˜b > 0 andalln (cid:13)1, where i(cid:12)sthe setof1-Lipsc(cid:13)hitzfunctionsf : Z [ 1,1] w.r.t. the metric
≥ F → −2 2
d. In other words, processes satisfying Condition (3.1) in [47] are φ˜-mixing, see (10), which is stronger than
geometrically -mixing,see againFigure1. Moreover,ourresultholdsforallγ > 0, while[47]onlyconsiders
C
thecaseγ =1.
Example3.4. Foranα-mixingsequenceofcenteredandboundedrandomvariablessatisfyingα(n) cexp( bnγ)
≤ −
forsomeconstantsb>0,c 0,andγ >0,[33,Theorem4.3]boundstheleft-handsideof(14)by
≥
(1+4e−2c)exp 3ε2n(γ) withn(γ) nγ+γ1 (17)
−6σ2+2εB ≍
(cid:18) (cid:19)
for all n 1 and all ε > 0. In general, this bound and our result are not comparable, since not every α-
≥
mixingprocesssatisfies(7)andconversely,noteveryprocesssatisfying(7)isnecessarilyα-mixing,seeFigure2.
Nevertheless,forφ-mixingprocesses,itiseasilyseenthatthisboundisalwaysworsethanoursforafixedγ >0,
ifnislargeenough.
7
α-mixing φ-mixing -mixing
C
Figure2: Relationship betweenα-,φ-,and -mixingprocesses
C
Example3.5. Foranα-mixingstationarysequenceofcenteredandboundedrandomvariablessatisfyingα(n)
≤
exp( 2cn)forsomec>0,[32,Theorem2]boundstheleft-handsideof(14)by
−
Cε2n
exp , (18)
−v2+εB(logn)2+n−1B2
(cid:18) (cid:19)
whereC >0issomeconstantand
v2 :=σ2+2 cov(X ,X ) . (19)
1 i
| |
2≤i≤n
X
By applying the covarianceinequality for α-mixing processes, see [16, the corollary to Lemma 2.1], we obtain
v2 C X 2 foranarbitraryδ >0andaconstantC onlydependingonδ. Iftheadditionalδ >0isignored,
≤ δk 1k2+δ δ
(18)hasthereforethesameasymptoticbehaviorasourbound.Ingeneral,however,theadditionalδdoesinfluence
the asymptoticbehavior. For example, the oracleinequality we obtainin the nextsection would be slower by a
factorofnξ,whereξ >0isarbitrary,ifweused(18)instead. Finally,notethatingeneralthebound(18)andours
arenotcomparable,seeagainFigure2.
In particular, Inequality(18) can be appliedto geometricallyφ-mixingprocesseswith γ = 1. By using the
covarianceinequality(1.1)forφ-mixingprocessesin[16],wecanboundv2 definedasin(19)byCσ2 withsome
constantC independentofn. Modulothetermn−1B inthedenominator,thebound(18)coincideswithoursfor
geometricallyφ-mixingprocesseswithγ =1. However,ourboundalsoholdsforsuchprocesseswithγ (0,1).
∈
Example 3.6. For stationary, geometricallyα-mixingMarkovchainswith centeredand boundedrandomvari-
ables,[1]boundstheleft-handsideof(14)by
nε2
exp , (20)
−σ˜2+εBlogn
(cid:18) (cid:19)
whereσ˜2 =lim 1Var n X . ByasimilarargumentasinExample3.5weobtain
n→∞ n i=1 i
nP
Var X =nσ2+2 cov(X ,X ) nσ2+C˜ n X 2
i | i j |≤ δ k 1k2+δ
i=1 1≤i<j≤n
X X
for an arbitrary δ > 0 and a constant C˜ depending only on δ. Consequently we conclude that modulo some
δ
arbitrarysmallnumberδ > 0andthelogarithmicfactorlogninsteadof(logn)2,thebound(20)coincideswith
ours.Again,thisboundandourresultarenotcomparable,seeFigure2.
Example 3.7. For stationary, weakly dependent processes of centered and bounded random variables with
cov(X ,X ) c exp( bn) for some c,b > 0 and all n 1, [26, Theorem 2.1] boundsthe left-hand side
1 n
| | ≤ · − ≥
of(14)by
ε2n
exp (21)
−C +C ε5/3n2/3
(cid:18) 1 2 (cid:19)
whereC issomeconstantdependingoncandb,andC issomeconstantdependingonc,b,andB. Notethatthe
1 2
denominatorin(21)isatleastC ,andthereforethebound(21)ismoreofHoeffdingtype.
1
8
4 Applications to Statistical Learning
In this section, we apply the Bernstein inequality from the last section to deduce oracle inequalities for
somewidelyused learning methods andobservations generated byageometrically -mixing processes.
C
More precisely, in Subsection 4.1, we recall some basic concepts of statistical learning and formulate
an oracle inequality for learning methods that are based on (regularized) empirical risk minimization.
Then,intheSubsection 4.2,weillustrate thisoracle inequality byderiving thelearning ratesforSVMs.
Finally,inSubsection4.3,wepresentanoracleinequality forforecasting ofdynamicalsystems.
4.1 Oracleinequality forCR-ERMs
Inthis section, letX always be ameasurable space ifnot mentioned otherwise and Y Ralways be a
closedsubset. Recallthatinthe(supervised) statisticallearning,ouraimistofindafunc⊂tionf :X R
→
such that for (x,y) X Y the value f(x) is a good prediction of y at x. To evaluate the quality of
∈ ×
suchfunctions f,weneed alossfunction L : X Y R [0, )that ismeasurable. Following[42,
× × → ∞
Definition2.22],wesaythatalossLcanbeclippedatM > 0,if,forall(x,y,t) X Y R,wehave
∈ × ×
Û
L(x,y,t) L(x,y,t), (22)
≤
Û Û Û
where t denotes the clipped value of t at M, that is t := t if t [ M,M], t := M if t < M,
Û ± ∈ − − −
t := M ift > M. Variousoftenusedlossfunctionscanbeclipped. Forexample,ifY := 1,1 andL
{− }
isaconvex,margin-based lossrepresented byϕ: R [0, ),thatisL(y,t)= ϕ(yt)forally Y and
→ ∞ ∈
t R,thenLcanbeclipped,ifandonlyifϕhasaglobalminimum,see[42,Lemma2.23]. Inparticular,
∈
thehingeloss,theleastsquares lossforclassification, andthesquared hingelosscanbeclipped, butthe
logistic loss for classification and the AdaBoost loss cannot be clipped. Moreover, if Y := [ M,M]
−
andLisaconvex, distance-based lossrepresented bysomeψ : R [0, ), that isL(y,t) = ψ(y t)
→ ∞ −
for all y Y and t R, then L can be clipped whenever ψ(0) = 0, see again [42, Lemma 2.23]. In
∈ ∈
particular, theleastsquares loss
L(y,t) =(y t)2 (23)
−
andtheτ-pinballloss
(1 τ)(y t), ify t < 0
L (y,t) := ψ(y t)= − − − − (24)
τ
− (τ(y t), ify t 0
− − ≥
usedforquantile regression canbeclipped, ifthespaceoflabelsY isbounded.
Nowwesummarizeassumptions onthelossfunctionLthatwillbeusedthroughout thiswork.
Assumption4.1. ThelossfunctionL : X Y R [0, )canbeclippedatsomeM > 0. Moreover,
× × → ∞
itisbothbounded inthesenseofL(x,y,t) 1andlocallyLipschitzcontinuous, thatis,
≤
L(x,y,t) L(x,y,t′) t t′ . (25)
| − | ≤ | − |
Here both inequalites are supposed to hold for all (x,y) X Y and t,t′ [ M,M]. Note that the
∈ × ∈ −
formerassumption cantypically beenforced byscaling.
Given a loss function L and an f : X R, we often use the notation L f for the function
→ ◦
(x,y) L(x,y,f(x)). Our major goal is to have a small average loss for future unseen observations
7→
(x,y). Thisleadstothefollowingdefinition, seealso[42,Definitions2.2&2.3].
9
Definition 4.2. Let L : X Y R [0, ) be a loss function and P be a probability measure on
X Y. Then,forameasur×ablef×unctio→n f :∞X RtheL-riskisdefinedby
× →
(f):= L(x,y,f(x))dP(x,y).
L,P
R
Z
X×Y
Moreover, theminimalL-risk
∗ := inf (f)f :X Rmeasurable
RL,P {RL,P | → }
is called the Bayes risk with respect to P and L. In addition, a measurable function f∗ : X R
L,P →
satisfying (f∗ )= ∗ iscalledaBayesdecisionfunction.
RL,P L,P RL,P
Informally, thegoal oflearning from atraining setD (X Y)n istofindadecision function f
D
∈ ×
suchthat (f )isclosetotheminimalrisk ∗ . Ournextgoalistoformalizethisidea. Webegin
RL,P D RL,P
withthefollowingdefinition.
Definition 4.3. Let X be a set and Y R be a closed subset. A learning method on X Y maps
⊂ L ×
everysetD (X Y)n,n 1,toafunction f : X R.
D
∈ × ≥ →
Let us now describe the learning algorithms we are interested in. To this end, we assume that we
haveahypothesis set consisting ofbounded measurable functions f : X R,whichispre-compact
F →
with respect to the supremum norm . Since can be infinite, we need to recall the following,
∞
k ·k F
classical concept, whichwillenableustoapproximate infinite byfinitesubsets.
F
Definition 4.4. Let (T,d) be a metric space and ε > 0. We call S T an ε-net of T if for all t T
⊂ ∈
thereexistsans S withd(s,t) ε. Moreover,theε-covering numberofT isdefinedby
∈ ≤
n
(T,d,ε) := inf n 1 : s ,...,s T suchthatT B (s ,ε) ,
1 n d i
N ≥ ∃ ∈ ⊂
( )
i=1
[
whereinf := and B (s,ε) := t T : d(t,s) ε denotes theclosed ballwithcenter s T and
d
∅ ∞ { ∈ ≤ } ∈
radiusε.
Note that our hypothesis set is assumed to be pre-compact, and hence for all ε > 0, the covering
F
number ( , ,ε)isfinite.
∞
N F k·k
Inordertointroduce ourgenericlearningalgorithms, wewrite
D := (X ,Y ),...,(X ,Y ) := (Z ,...,Z ) (X Y)n
1 1 n n 1 n
∈ ×
(cid:0) (cid:1)
for a training set of length n that is distributed according to the first n components of the X Y-
valued process = (Z ) . Furthermore, wewriteD := 1 n δ ,whereδ denote×s the
Z i i≥1 n n i=1 (Xi,Yi) (Xi,Yi)
(random) Diracmeasure at(X ,Y ). Inother words, D isthe empirical measure associated tothe data
i i n
setD. Finally, theriskofafunction f : X Rwithrespect toPthismeasure
→
n
1
(f) = L(X ,Y ,f(X ))
RL,Dn n i i i
i=1
X
iscalledtheempiricalL-risk.
Withthesepreparations wecannowintroducetheclassoflearningmethodsweareinterestedin,see
also[42,Definition7.18].
10