Learning Rates of Regression with -norm Loss and Threshold q 7 † 1 0 2 Ting Hu n School of Mathematics and Statistics, Wuhan University a J Luojia Hill, Wuhan 430072, China, [email protected] 8 Yuan Yao ] T School of Mathematical Sciences, Peking University S Beijing 100871, China, [email protected] . h t a m Abstract [ 1 Thispaperstudiessomerobustregressionproblemsassociated withtheq-normloss v (q 1) and the ǫ-insensitive q-norm loss in the reproducing kernel Hilbert space. We 6 ≥ establishavariance-expectationboundunderapriorinoiseconditionontheconditional 5 9 distribution, which is the key technique to measure the error bound. Explicit learning 1 rates will be given under the approximation ability assumptions on the reproducing 0 . kernel Hilbert space. 1 0 7 1 Key Words and Phrases. Insensitive q-norm loss, quantile regression, reproducing kernel : v Hilbert space, sparsity. i X Mathematical Subject Classification. 68Q32, 41A25 r a 1 Introduction In this paper we consider regression with the q-norm loss ψ with q 1 and an ǫ-insensitive q ≥ q-norm loss ψǫ (to be defined) with a threshold ǫ > 0. Here ψ is the univariate function q q defined by ψ (u) = u q. For a learning algorithm generated by a regularization scheme in q | | 1 reproducing kernel Hilbert spaces, learning rates and approximation error will be presented when ǫ is chosen appropriately for balancing learning rates and sparsity. For q = 1, the regression problem is the classical statistical method of least absolute deviations which is more robust than the least squares method and is resistant to outliers in data [4]. Its associated loss ψ(u) = u ,u IR, is widely used in practical applications for | | ∈ robustness. In fact, for all q < 2, the loss ψ is less sensitive to outliers and is thus more q robust than the square loss. Vapnik [13] proposed an ǫ-insensitive loss ψǫ(u) : IR IR to + → get sparsity in support vector regressions, which is defined by u ǫ, if u > ǫ, ψǫ(u) = | |− | | (1.1) 0, if u ǫ. (cid:26) | | ≤ When fixing ǫ > 0, error analysis was conducted in [12]. Xiang, Hu and Zhou [17, 18] showed how to accelerate learning rates and preserve sparsity by adapting ǫ. In [5], they discussed the convergence ability with flexible ǫ in an online algorithm. For the quantile regression with ǫ = 0 and a pinball loss having different slopes in different sides of the origin in IR [6], Steinwart and Christamann [10, 9] established comparison theorems and derived learning rates under some noise conditions. In this paper, we apply the q-norm loss ψ with q > 1 to improve the convexity of q the insensitive loss ψ. Our results show how the insensitive parameter ǫ that produces the sparsity can be chosen adaptively as the function of the sample size ǫ = ǫ(T) 0 when → T , to affect the error rates of the learning algorithm (to be defined by (1.4)). Such → ∞ results include some early studies as special cases. In the sequel, assume that the input space X is a compact metric space and the output space Y = IR. Let ρ be a Borel probability measure on Z := X Y, ρ ( ) be the conditional x × · distribution of ρ at each x X and ρ be the marginal distribution on X. For a measurable X ∈ function f : X Y, the generalization error (f) associated with the q-norm loss ψ , is q → E defined by (f) = ψ (y f(x))dρ. (1.2) q E − ZZ Denote f : X Y as the minimizer of the generalization error (f) over all measurable q → E functions. Its properties and the corresponding learning problem in the empirical risk mini- mization framework were discussed in [20]. When q = 1, the target function f is a function q 2 containing the medians of the conditional distribution for all x X. For symmetric distri- ∈ butions, the median is also the regression function, which is the conditional mean for given X. We aim at learning the minimizer f from a sample z = (x ,y ) T ZT, which is q { i i }i=1 ∈ assumed to be independently drawn according to ρ. Inspired by the ǫ-insensitive loss [13], we introduce an ǫ-insensitive q-norm loss ψǫ which is defined by q ( u ǫ)q, if u > ǫ, ψǫ(u) = | |− | | (1.3) q 0, if u ǫ. (cid:26) | | ≤ Ourlearning taskwill becarriedoutby aregularizationscheme inreproducing kernel Hilbert spaces. With a continuous, symmetric and positive semidefinite function K : X X IR × → (called a Mercer kernel), the reproducing kernel Hilbert space (RKHS) is defined as the K H completion of the span of K = K(x, ) : x X with the inner product , satisfying x K { · ∈ } h· ·i K ,K = K(x,u). The regularization algorithm in the paper takes the form x u K h i T 1 fǫ = arg min ψǫ(f(x ) y )+λ f 2 . (1.4) z f K T q t − t k kK ∈H t=1 (cid:8) X (cid:9) Here λ > 0 is a regularization parameter. Our learning rates are stated in terms of approx- imation or regularization error, noise conditions, and the capacity of the RKHS. Our main goal is to study how the learned function fǫ in (1.4) converges to the target function f . z q There is a large literature [1, 16, 7] in learning theory for studying the approximation error or regularization error (λ) of the triple (K,ρ,q) defined by D (λ) = min (f) (f )+λ f 2 , λ > 0. D f K E −E q k kK ∈H (cid:8) (cid:9) The regularization function is defined as f = arg min (f) (f )+λ f 2 . (1.5) λ f K E −E q k kK ∈H (cid:8) (cid:9) In the sequel, let Lp with p > 0 be the space of p integrable functions with respect to ρ ρX X and k·kLpρX be the norm in LpρX. A usual assumption on the regularization error D(λ) which imposes certain smoothness on is K H (λ) λβ, λ > 0 (1.6) 0 D ≤ D ∀ with some 0 < β 1 and > 0. 0 ≤ D 3 Remark 1. Assumption (1.6) always holds with β = 0. When the target function f q ∈ and is dense in C(X) which consists of bounded continuous functions on X, the K K H H approximation error (λ) 0 as λ 0. Thus, the decay (1.6) is natural and can be D → → illustrated in terms of interpolation spaces [7]. Define the integral operator L : L2 L2 K ρX → ρX by L (f)(x) = K(x,y)f(y)dρ ,x X,f L2 and suppose that the minimizer f is in K X X ∈ ∈ ρX q the range of LνKRwith 0 < ν ≤ 21. When q = 1, the approximation error D(λ) can be O(λ1−νν) for quantile regression [18]. When q = 2, (λ) = O(λ2ν) for the least square. For other D q > 1, the associated loss ψ is Lipschitz in a bounded domain and the corresponding (λ) q D can be characterized by the -functional [1], which can have the same polynomial decay as K (1.6). We assume that the conditional distribution ρ ( ) is supported on [ M,M], M > 0 x · − at each x and is non-degenerate, i.e. any non-empty open set of Y has strictly positive measure, which ensures that the target function f is unique. Without loss of generality, let q the support of ρ ( ) be [ 1, 1] at each x X and our analysis below is applicable for any x · −2 −2 ∈ M > 0. We will prove that in the next section. It is natural to project values of the learned function fǫ onto some interval by the projection operator [1, 15]. z Definition 1. The projection operator π on the space of measurable functions f : X IR → onto the interval [ 1,1] is defined by − 1, if f(x) 1, ≥ π(f(x)) = f(x), if 1 < f(x) < 1, − 1, if f(x) 1. − ≤ − To demonstrate our main resultin the general case, we shall give the following learning rate in the special case when K is C . ∞ Theorem 1. Let X IRn and K C (X X). Assume that f with q > 1, ∞ q K ⊂ ∈ × ∈ H f 1 and the conditional distributions ρ ( ) have density functions given by k qk∞ ≤ 4 { x · }x∈X dρ A y f (x) ϕ, if y f (x) 1, x(y) = | − q | | − q | ≤ 4 (1.7) dy 0, otherwise, (cid:26) 4 q+ϕ+1 where A = 22ϕ+1(ϕ + 1), ϕ > 0. Take λ = ǫ = T−2(q+ϕ), then for any 0 < δ < 1, with confidence 1 δ, we have − 3 kπ(fzǫ)−fqkLqρ+Xϕ+1 ≤ C′rlog δT−2(q1+ϕ), where C is a constant independent of T or δ. ′ To state our main result in the general case, we need a noise condition on the measure ρ introduced in [9, 10]. Definition 2. Let 0 < p and w > 0. We say that ρ has a p-average type w if there ≤ ∞ exist two functions b and a from X to IR such that baw 1 Lp and for any x X and { }− ∈ ρX ∈ s (0,a(x)], there holds ∈ ρ ( y : f (x) y f (x)+s ) b(x)sw x q q { ≤ ≤ } ≥ and ρ ( y : f (x) s y f (x) ) b(x)sw. (1.8) x q q { − ≤ ≤ } ≥ This assumption can be satisfied by many common conditional distributions such as Guassian, students’ t distributions and uniform distributions. In the following, we will give an example to illustrate Definition 2 in detail. More examples can be found in [9, 10]. Example 1. We assume that the conditional distributions ρ ( ) are Guassian distri- x x X { · } ∈ butions with a uniform variance σ > 0, i.e. ddρyx(y) = √21πσ exp{−(y−2σu2x)2}, where {ux}x∈X are expectations of the Gaussian distributions ρ ( ) . It is not difficult to check that the x x X { · } ∈ minimizer f (x) can take the value of u at each x X, then for any s (0,σ], there holds ρ x ∈ ∈ 1 fρ(x)+s (y u )2 x ρ ( y : f (x) y f (x)+s = exp − dy x q q { ≤ ≤ } √2πσ {− 2σ2 } Zfρ(x) 1 s y2 1 s s2 e 1 −2 = exp dy exp dy s. √2πσ {−2σ2} ≥ √2πσ {−2σ2} ≥ √2πσ Z0 Z0 −1 By similarity, we also have that ρ ( y : f (x) s y f (x) e 2 s. Thus, the measure x { q − ≤ ≤ q } ≥ √2πσ ρ has a -average type 1. ∞ 5 Ourerroranalysisisrelatedtothecapacityofthehypothesis space whichismeasured K H by covering numbers. Definition 3. For a subset S of C(X) and ε > 0, the covering number (S,ε) is the N minimal integer l IN such that there exist l disks with radius ε covering S. ∈ The covering numbers of balls B = f : f R with R > 0 of the RKHS R K K { ∈ H k k ≤ } have been well understood in the learning theory [22, 23]. In this paper, we assume for some k > 0 and C > 0 that k 1 k log (B ,ε) C , ε > 0. (1.9) 1 k N ≤ ε ∀ Remark 2. When X is a bounded subset of I(cid:0)Rn(cid:1)and the RKHS is a Sobolev space K H Hm(X) with index m, it is shown [22] that the condition (1.9) holds true with k = 2n. If m the kernel K lies in the smooth space C (X X), then (1.9) is satisfied for an arbitrarily ∞ × small k > 0. Another common way to measure the capacity of is the empirical covering K H number [21], which is out of scope of our discussion in this paper. Denote 2 p p(q +w) θ = min , (0,1], r = > 0. (1.10) {q+w p+1} ∈ p+1 The following learning rates in the general case will be proved in Section 4. One need to point out that the proof of Theorem 2 is only applicable to the case q > 1. However, when q = 1, it is a special case of quantile regression and the same learning rates as those of Theorem 2 can be found in [17, 18]. Theorem 2. Suppose that ρ has a p-average type w for some 0 < p and ω > 0. Assume ≤ ∞ that the regularization error condition (1.6) is satisfied for some 0 < β 1 and (1.9) holds ≤ with k > 0. Take λ = T α,ǫ = T η with 0 < α 1, 0 < η . Let ξ > 0. Then for any − − ≤ ≤ ∞ 0 < δ < 1, with confidence 1 δ, there holds − 3 3 π(fǫ) f C log 2 log T Λ (1.11) z q Lr ∗ − k − k ρX ≤ ξ δ r (cid:0) (cid:1) where C is a constant independent of T or δ , ∗ 1 q(1 β)α 1 1 k Λ = min η,αβ,1 − , , ϑ q +w − 2 2 θ 2+k θ − 1+k − − (cid:8) (cid:9) 6 with ϑ = max α−η, α(1−β), α + q(1−β)α 1, α 1 , [α(2+k−θ)−1](1+k) +ξ 0 provided 2 2 2 4 − 2 2 − 2(2 θ) (2+k θ)(2+k) ≥ − − that n o 1+k ϑ < . (1.12) k(2+k θ) − Corollary 1. Let X IRn, K C (X X). Assume (1.6) and (1.8). Take λ = T 1, ∞ − ⊂ ∈ × ǫ = T η with 0 < η . If 1 < q 2, then the index Λ for the learning rate (1.11) is − ≤ ∞ ≤ 1 min η,β, 1 . q+w { 2 θ} − Remark 3. When η = , the corresponding threshold ǫ is 0 and it is a least square problem ∞ for q = 2, which is widely discussed in [15, 16]. If ρ has a -average type w with w > 0 and ∞ 1 fq ∈ HK, the learning rate kπ(fzǫ) −fqkL2ρ+Xw = O(T−2(1+w)) for the least square. It follows that the error kπ(fzǫ)−fqk2L2ρX = O(T−1+1w) by k·kL2ρX ≤ k·kL2ρ+Xw. Thus, it can be near the optimal rate O(T 1) in L2 space if w is small enough. − ρX When 1 < q < 2, the learning error will be O(T−q+1wmin{β,2(qq++ww−1),pp++21}) with choice η β, ≥ depending only on the ’s approximation ability (1.6) and noise condition (1.8). Specially, K H 1 when q goes to 1, it is the quantile regression [17, 18] and the best rate is O(T−1+w) in this paper if ρ has a -average type w with 0 < w 1 and f . q K ∞ ≤ ∈ H 2 Comparison and Perturbation Theorem Approximation or learning ability of a regularized algorithm for regression problems can usually be studied by estimating the excess generalization error (f) (f ) for the learned q E −E function fǫ from the algorithm (1.4). However the following comparison theorem would yield z bounds for the error f f in the space Lr when the noise condition is satisfied. k − qkLrρX ρX Theorem 3. If ρ has a p-average type w, then for any measurable function f : X [ 1,1] → − we have the inequality 1 f f C (f) (f ) q+w (2.1) q Lr r q k − k ρX ≤ E −E q−1 1 1 (cid:0) 1 (cid:1) where the constant Cr = 2q+wq−q+w(q+w)q+wk(baw)−1kLq+pρXw. 7 Proof. For a measurable function f : X [ 1,1], the generalization error (f) is rewritten → − E as (f) = C (f(x))dρ where E X q,x X R C (t) = ψ (y t)dρ (y) = (y t)qdρ (y)+ (t y)qdρ (y), x X. q,x q x x x − − − ∈ ZY Zy>t Zy<t Denote t = min C (t). It is obvious that the minimizer f (x) of (f) takes the value ∗x t∈IR q,x q E of t for each x X. Noting that the conditional distribution ρ ( ) is supported on [ 1, 1], ∗x ∈ x · −2 2 the minimizer t can be on [ 1, 1]. Consider the case q > 1. Since the loss function ψ is ∗x −2 2 q differential and |dψqd(yt−t)| ≤ q|y−t|q−1 ≤ q for all y,t ∈ [−12, 21], by the corollary of Lebesgue control convergence theorem, we can exchange the order of of integration and derivation of Cq′,x(t) as Cq′,x(t) = ddt Y ψq(y − t)dρx(y) = Y dψqd(yt−t)dρx(y). This together with the fact C (t ) = 0, x X, we have q′,x ∗x ∀ ∈ R R C (t ) = q (t y)q 1dρ (y) q (y t )q 1dρ (y) = 0, q′,x ∗x ∗x − − x − − ∗x − x Zy<t∗x Zy>t∗x which means that (t y)q 1dρ (y) = (y t )q 1dρ (y) (2.2) ∗x − − x − ∗x − x Zy<t∗x Zy>t∗x t Let t = 0 for simply, then we have C (t) C (0) = C (s)ds, t > 0. Noting that for ∗x q,x − q,x 0 q′,x ∀ s > 0, R C (s) = q (s y)q 1dρ (y) (y s)q 1dρ (y) q′,x − − x − − − x Zy<s Zy>s (cid:0) (cid:1) = q (s y)q 1dρ (y)+ (s y)q 1dρ (y) (y s)q 1dρ (y) − x − x − x − − − − (cid:16)Zy<0 Z0≤y<s Zy>s (cid:17) q ( y)q 1dρ (y)+ (s y)q 1dρ (y) (y s)q 1dρ (y) . − x − x − x ≥ − − − − (cid:16)Zy<0 Z0≤y<s Zy>s (cid:17) 8 The above first term together with (2.2), then C (s) q yq 1dρ (y)+ (s y)q 1dρ (y) (y s)q 1dρ (y) q′,x ≥ − x − − x − − − x (cid:16)Zy>0 Z0≤y<s Zy>s (cid:17) q yq 1dρ (y)+ (s y)q 1dρ (y) yq 1dρ (y) − x − x − x ≥ − − (cid:16)Zy>0 Z0≤y<s Zy>s (cid:17) = q yq 1dρ (y)+ (s y)q 1dρ (y) − x − x − (cid:16)Z0<y≤s Z0≤y<s (cid:17) = q yq 1 +(s y)q 1 dρ (y) 21 qqsq 1ρ ( y : 0 y s ). − − x − − x − ≥ { ≤ ≤ } Z0≤y≤s(cid:16) (cid:17) Thus, t C (t) C (0) 21 q q sq 1ρ ( y : 0 y s )ds. q,x q,x − − x − ≥ · { ≤ ≤ } Z0 Let us consider the first case t [0,a(x)]. Noting the noise condition (1.8) and a(x) 1, we ∈ ≤ obtain that t 21 qq 21 qq C (t) C (0) 21 q q sq 1b(x)swds = − b(x)tq+w − b(x)a(x)wtq+w. q,x q,x − − − ≥ · q +w ≥ q +w Z0 For the second case t [a(x),1], we have ∈ a(x) t C (t) C (0) 21 q q sq 1ρ ( y : 0 y s )ds+ sq 1ρ ( y : 0 y s )ds q,x q,x − − x − x − ≥ · { ≤ ≤ } { ≤ ≤ } (cid:16)Z0 Za(x) (cid:17) a(x) t 21 q q sq 1b(x)swds+ sq 1b(x)a(x)wds − − − ≥ · (cid:16)Z0 Za(x) (cid:17) b(x)a(x)wtq wb(x)a(x)q+w wb(x)a(x)wtq = 21 q q 21 q b(x)a(x)wtq − − · q − q(q+w) ≥ − q +w (cid:16) (cid:17) 21 qq 21 qq (cid:0) (cid:1) = − b(x)a(x)wtq − b(x)a(x)wtq+w. q +w ≥ q +w In general, we can see that for any 0 < t 1, ≤ 21 qq C (t) C (0) − b(x)a(x)wtq+w. (2.3) q,x q,x − ≥ q +w By similarity, if 1 t < 0, we also have − ≤ 21 qq C (t) C (0) − b(x)a(x)wtq+w. (2.4) q,x q,x − ≥ q +w 9 Applying the two above inequalities (2.3) and (2.4) with t = f(x) and t = f (x), we have ∗x q that f(x) f (x) q+w 2q 1q 1(q +w)(b(x)a(x)w) 1 C (f(x)) C (f (x)) . q − − − q,x q,x q | − | ≤ − (cid:0) (cid:1) By p power and integration, p+1 p(q+w) p(q−1) p p f(x) fq(x) p+1 dρX 2 p+1 q−p+1(q +w)p+1 | − | ≤ ZX p p (b(x)a(x)w) 1 p+1 C (f(x)) C (f (x)) p+1dρ . − q,x q,x q X − ZX (cid:2) (cid:3) (cid:0) (cid:1) This with Holder inequality k · kL1ρX ≤ k · kLpρ∗Xk · kLqρ∗X, p1∗ + q1∗ = 1, we obtain that for p = p+1 and q = p+1, ∗ ∗ p p(q+w) p(q−1) p p p p ZX |f(x)−fq(x)| p+1 dρX ≤ 2 p+1 q−p+1(q +w)p+1k(baw)−1kLp+pρX1 E(f)−E(fq) p+1. (cid:0) (cid:1) Then the desired conclusion (2.1) holds. For q = 1, (2.1) also holds and the proof can be found in [18]. It yields a variance-expectation bound which will be applied in the next section. Lemma 1. Under the same conditions as Theorem 3, for any measurable function f : X → [ 1,1], we have the inequality − E (ψ (f(x) y) ψ (f (x) y))2 C ( (f) (f ) θ (2.5) q q q θ q − − − ≤ E −E (cid:8) (cid:9) (cid:1) where the power index θ is defined as (1.10) and C = C2 +22 r(1+ f 2 r)Cr. θ r − k qk − r ∞ Proof. By the continuity of ψ (u) and y 1, we see that q | | ≤ 2 ψ (f(x) y) ψ (f (x) y) q( f q 1 + f q 1 +1) f(x) f (x) q q q − q − q | − − − | ≤ k k k k | − | ∞ ∞ q(2+ f q 1) f(x) f (x) . q − q ≤ k k | − | ∞ It implies that E (ψ (f(x) y) ψ (f (x) y))2 q2( f q 1 +2)2E f(x) f (x) 2. q q q q − q − − − ≤ k k | − | ∞ (cid:8) (cid:9) 10