ebook img

Learning Rates of Regression with q-norm Loss and Threshold PDF

0.22 MB·
by  Ting Hu
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Learning Rates of Regression with q-norm Loss and Threshold

Learning Rates of Regression with -norm Loss and Threshold q 7 † 1 0 2 Ting Hu n School of Mathematics and Statistics, Wuhan University a J Luojia Hill, Wuhan 430072, China, [email protected] 8 Yuan Yao ] T School of Mathematical Sciences, Peking University S Beijing 100871, China, [email protected] . h t a m Abstract [ 1 Thispaperstudiessomerobustregressionproblemsassociated withtheq-normloss v (q 1) and the ǫ-insensitive q-norm loss in the reproducing kernel Hilbert space. We 6 ≥ establishavariance-expectationboundunderapriorinoiseconditionontheconditional 5 9 distribution, which is the key technique to measure the error bound. Explicit learning 1 rates will be given under the approximation ability assumptions on the reproducing 0 . kernel Hilbert space. 1 0 7 1 Key Words and Phrases. Insensitive q-norm loss, quantile regression, reproducing kernel : v Hilbert space, sparsity. i X Mathematical Subject Classification. 68Q32, 41A25 r a 1 Introduction In this paper we consider regression with the q-norm loss ψ with q 1 and an ǫ-insensitive q ≥ q-norm loss ψǫ (to be defined) with a threshold ǫ > 0. Here ψ is the univariate function q q defined by ψ (u) = u q. For a learning algorithm generated by a regularization scheme in q | | 1 reproducing kernel Hilbert spaces, learning rates and approximation error will be presented when ǫ is chosen appropriately for balancing learning rates and sparsity. For q = 1, the regression problem is the classical statistical method of least absolute deviations which is more robust than the least squares method and is resistant to outliers in data [4]. Its associated loss ψ(u) = u ,u IR, is widely used in practical applications for | | ∈ robustness. In fact, for all q < 2, the loss ψ is less sensitive to outliers and is thus more q robust than the square loss. Vapnik [13] proposed an ǫ-insensitive loss ψǫ(u) : IR IR to + → get sparsity in support vector regressions, which is defined by u ǫ, if u > ǫ, ψǫ(u) = | |− | | (1.1) 0, if u ǫ. (cid:26) | | ≤ When fixing ǫ > 0, error analysis was conducted in [12]. Xiang, Hu and Zhou [17, 18] showed how to accelerate learning rates and preserve sparsity by adapting ǫ. In [5], they discussed the convergence ability with flexible ǫ in an online algorithm. For the quantile regression with ǫ = 0 and a pinball loss having different slopes in different sides of the origin in IR [6], Steinwart and Christamann [10, 9] established comparison theorems and derived learning rates under some noise conditions. In this paper, we apply the q-norm loss ψ with q > 1 to improve the convexity of q the insensitive loss ψ. Our results show how the insensitive parameter ǫ that produces the sparsity can be chosen adaptively as the function of the sample size ǫ = ǫ(T) 0 when → T , to affect the error rates of the learning algorithm (to be defined by (1.4)). Such → ∞ results include some early studies as special cases. In the sequel, assume that the input space X is a compact metric space and the output space Y = IR. Let ρ be a Borel probability measure on Z := X Y, ρ ( ) be the conditional x × · distribution of ρ at each x X and ρ be the marginal distribution on X. For a measurable X ∈ function f : X Y, the generalization error (f) associated with the q-norm loss ψ , is q → E defined by (f) = ψ (y f(x))dρ. (1.2) q E − ZZ Denote f : X Y as the minimizer of the generalization error (f) over all measurable q → E functions. Its properties and the corresponding learning problem in the empirical risk mini- mization framework were discussed in [20]. When q = 1, the target function f is a function q 2 containing the medians of the conditional distribution for all x X. For symmetric distri- ∈ butions, the median is also the regression function, which is the conditional mean for given X. We aim at learning the minimizer f from a sample z = (x ,y ) T ZT, which is q { i i }i=1 ∈ assumed to be independently drawn according to ρ. Inspired by the ǫ-insensitive loss [13], we introduce an ǫ-insensitive q-norm loss ψǫ which is defined by q ( u ǫ)q, if u > ǫ, ψǫ(u) = | |− | | (1.3) q 0, if u ǫ. (cid:26) | | ≤ Ourlearning taskwill becarriedoutby aregularizationscheme inreproducing kernel Hilbert spaces. With a continuous, symmetric and positive semidefinite function K : X X IR × → (called a Mercer kernel), the reproducing kernel Hilbert space (RKHS) is defined as the K H completion of the span of K = K(x, ) : x X with the inner product , satisfying x K { · ∈ } h· ·i K ,K = K(x,u). The regularization algorithm in the paper takes the form x u K h i T 1 fǫ = arg min ψǫ(f(x ) y )+λ f 2 . (1.4) z f K T q t − t k kK ∈H t=1 (cid:8) X (cid:9) Here λ > 0 is a regularization parameter. Our learning rates are stated in terms of approx- imation or regularization error, noise conditions, and the capacity of the RKHS. Our main goal is to study how the learned function fǫ in (1.4) converges to the target function f . z q There is a large literature [1, 16, 7] in learning theory for studying the approximation error or regularization error (λ) of the triple (K,ρ,q) defined by D (λ) = min (f) (f )+λ f 2 , λ > 0. D f K E −E q k kK ∈H (cid:8) (cid:9) The regularization function is defined as f = arg min (f) (f )+λ f 2 . (1.5) λ f K E −E q k kK ∈H (cid:8) (cid:9) In the sequel, let Lp with p > 0 be the space of p integrable functions with respect to ρ ρX X and k·kLpρX be the norm in LpρX. A usual assumption on the regularization error D(λ) which imposes certain smoothness on is K H (λ) λβ, λ > 0 (1.6) 0 D ≤ D ∀ with some 0 < β 1 and > 0. 0 ≤ D 3 Remark 1. Assumption (1.6) always holds with β = 0. When the target function f q ∈ and is dense in C(X) which consists of bounded continuous functions on X, the K K H H approximation error (λ) 0 as λ 0. Thus, the decay (1.6) is natural and can be D → → illustrated in terms of interpolation spaces [7]. Define the integral operator L : L2 L2 K ρX → ρX by L (f)(x) = K(x,y)f(y)dρ ,x X,f L2 and suppose that the minimizer f is in K X X ∈ ∈ ρX q the range of LνKRwith 0 < ν ≤ 21. When q = 1, the approximation error D(λ) can be O(λ1−νν) for quantile regression [18]. When q = 2, (λ) = O(λ2ν) for the least square. For other D q > 1, the associated loss ψ is Lipschitz in a bounded domain and the corresponding (λ) q D can be characterized by the -functional [1], which can have the same polynomial decay as K (1.6). We assume that the conditional distribution ρ ( ) is supported on [ M,M], M > 0 x · − at each x and is non-degenerate, i.e. any non-empty open set of Y has strictly positive measure, which ensures that the target function f is unique. Without loss of generality, let q the support of ρ ( ) be [ 1, 1] at each x X and our analysis below is applicable for any x · −2 −2 ∈ M > 0. We will prove that in the next section. It is natural to project values of the learned function fǫ onto some interval by the projection operator [1, 15]. z Definition 1. The projection operator π on the space of measurable functions f : X IR → onto the interval [ 1,1] is defined by − 1, if f(x) 1, ≥ π(f(x)) = f(x), if 1 < f(x) < 1,  − 1, if f(x) 1.  − ≤ − To demonstrate our main resultin the general case, we shall give the following learning rate in the special case when K is C . ∞ Theorem 1. Let X IRn and K C (X X). Assume that f with q > 1, ∞ q K ⊂ ∈ × ∈ H f 1 and the conditional distributions ρ ( ) have density functions given by k qk∞ ≤ 4 { x · }x∈X dρ A y f (x) ϕ, if y f (x) 1, x(y) = | − q | | − q | ≤ 4 (1.7) dy 0, otherwise, (cid:26) 4 q+ϕ+1 where A = 22ϕ+1(ϕ + 1), ϕ > 0. Take λ = ǫ = T−2(q+ϕ), then for any 0 < δ < 1, with confidence 1 δ, we have − 3 kπ(fzǫ)−fqkLqρ+Xϕ+1 ≤ C′rlog δT−2(q1+ϕ), where C is a constant independent of T or δ. ′ To state our main result in the general case, we need a noise condition on the measure ρ introduced in [9, 10]. Definition 2. Let 0 < p and w > 0. We say that ρ has a p-average type w if there ≤ ∞ exist two functions b and a from X to IR such that baw 1 Lp and for any x X and { }− ∈ ρX ∈ s (0,a(x)], there holds ∈ ρ ( y : f (x) y f (x)+s ) b(x)sw x q q { ≤ ≤ } ≥ and ρ ( y : f (x) s y f (x) ) b(x)sw. (1.8) x q q { − ≤ ≤ } ≥ This assumption can be satisfied by many common conditional distributions such as Guassian, students’ t distributions and uniform distributions. In the following, we will give an example to illustrate Definition 2 in detail. More examples can be found in [9, 10]. Example 1. We assume that the conditional distributions ρ ( ) are Guassian distri- x x X { · } ∈ butions with a uniform variance σ > 0, i.e. ddρyx(y) = √21πσ exp{−(y−2σu2x)2}, where {ux}x∈X are expectations of the Gaussian distributions ρ ( ) . It is not difficult to check that the x x X { · } ∈ minimizer f (x) can take the value of u at each x X, then for any s (0,σ], there holds ρ x ∈ ∈ 1 fρ(x)+s (y u )2 x ρ ( y : f (x) y f (x)+s = exp − dy x q q { ≤ ≤ } √2πσ {− 2σ2 } Zfρ(x) 1 s y2 1 s s2 e 1 −2 = exp dy exp dy s. √2πσ {−2σ2} ≥ √2πσ {−2σ2} ≥ √2πσ Z0 Z0 −1 By similarity, we also have that ρ ( y : f (x) s y f (x) e 2 s. Thus, the measure x { q − ≤ ≤ q } ≥ √2πσ ρ has a -average type 1. ∞ 5 Ourerroranalysisisrelatedtothecapacityofthehypothesis space whichismeasured K H by covering numbers. Definition 3. For a subset S of C(X) and ε > 0, the covering number (S,ε) is the N minimal integer l IN such that there exist l disks with radius ε covering S. ∈ The covering numbers of balls B = f : f R with R > 0 of the RKHS R K K { ∈ H k k ≤ } have been well understood in the learning theory [22, 23]. In this paper, we assume for some k > 0 and C > 0 that k 1 k log (B ,ε) C , ε > 0. (1.9) 1 k N ≤ ε ∀ Remark 2. When X is a bounded subset of I(cid:0)Rn(cid:1)and the RKHS is a Sobolev space K H Hm(X) with index m, it is shown [22] that the condition (1.9) holds true with k = 2n. If m the kernel K lies in the smooth space C (X X), then (1.9) is satisfied for an arbitrarily ∞ × small k > 0. Another common way to measure the capacity of is the empirical covering K H number [21], which is out of scope of our discussion in this paper. Denote 2 p p(q +w) θ = min , (0,1], r = > 0. (1.10) {q+w p+1} ∈ p+1 The following learning rates in the general case will be proved in Section 4. One need to point out that the proof of Theorem 2 is only applicable to the case q > 1. However, when q = 1, it is a special case of quantile regression and the same learning rates as those of Theorem 2 can be found in [17, 18]. Theorem 2. Suppose that ρ has a p-average type w for some 0 < p and ω > 0. Assume ≤ ∞ that the regularization error condition (1.6) is satisfied for some 0 < β 1 and (1.9) holds ≤ with k > 0. Take λ = T α,ǫ = T η with 0 < α 1, 0 < η . Let ξ > 0. Then for any − − ≤ ≤ ∞ 0 < δ < 1, with confidence 1 δ, there holds − 3 3 π(fǫ) f C log 2 log T Λ (1.11) z q Lr ∗ − k − k ρX ≤ ξ δ r (cid:0) (cid:1) where C is a constant independent of T or δ , ∗ 1 q(1 β)α 1 1 k Λ = min η,αβ,1 − , , ϑ q +w − 2 2 θ 2+k θ − 1+k − − (cid:8) (cid:9) 6 with ϑ = max α−η, α(1−β), α + q(1−β)α 1, α 1 , [α(2+k−θ)−1](1+k) +ξ 0 provided 2 2 2 4 − 2 2 − 2(2 θ) (2+k θ)(2+k) ≥ − − that n o 1+k ϑ < . (1.12) k(2+k θ) − Corollary 1. Let X IRn, K C (X X). Assume (1.6) and (1.8). Take λ = T 1, ∞ − ⊂ ∈ × ǫ = T η with 0 < η . If 1 < q 2, then the index Λ for the learning rate (1.11) is − ≤ ∞ ≤ 1 min η,β, 1 . q+w { 2 θ} − Remark 3. When η = , the corresponding threshold ǫ is 0 and it is a least square problem ∞ for q = 2, which is widely discussed in [15, 16]. If ρ has a -average type w with w > 0 and ∞ 1 fq ∈ HK, the learning rate kπ(fzǫ) −fqkL2ρ+Xw = O(T−2(1+w)) for the least square. It follows that the error kπ(fzǫ)−fqk2L2ρX = O(T−1+1w) by k·kL2ρX ≤ k·kL2ρ+Xw. Thus, it can be near the optimal rate O(T 1) in L2 space if w is small enough. − ρX When 1 < q < 2, the learning error will be O(T−q+1wmin{β,2(qq++ww−1),pp++21}) with choice η β, ≥ depending only on the ’s approximation ability (1.6) and noise condition (1.8). Specially, K H 1 when q goes to 1, it is the quantile regression [17, 18] and the best rate is O(T−1+w) in this paper if ρ has a -average type w with 0 < w 1 and f . q K ∞ ≤ ∈ H 2 Comparison and Perturbation Theorem Approximation or learning ability of a regularized algorithm for regression problems can usually be studied by estimating the excess generalization error (f) (f ) for the learned q E −E function fǫ from the algorithm (1.4). However the following comparison theorem would yield z bounds for the error f f in the space Lr when the noise condition is satisfied. k − qkLrρX ρX Theorem 3. If ρ has a p-average type w, then for any measurable function f : X [ 1,1] → − we have the inequality 1 f f C (f) (f ) q+w (2.1) q Lr r q k − k ρX ≤ E −E q−1 1 1 (cid:0) 1 (cid:1) where the constant Cr = 2q+wq−q+w(q+w)q+wk(baw)−1kLq+pρXw. 7 Proof. For a measurable function f : X [ 1,1], the generalization error (f) is rewritten → − E as (f) = C (f(x))dρ where E X q,x X R C (t) = ψ (y t)dρ (y) = (y t)qdρ (y)+ (t y)qdρ (y), x X. q,x q x x x − − − ∈ ZY Zy>t Zy<t Denote t = min C (t). It is obvious that the minimizer f (x) of (f) takes the value ∗x t∈IR q,x q E of t for each x X. Noting that the conditional distribution ρ ( ) is supported on [ 1, 1], ∗x ∈ x · −2 2 the minimizer t can be on [ 1, 1]. Consider the case q > 1. Since the loss function ψ is ∗x −2 2 q differential and |dψqd(yt−t)| ≤ q|y−t|q−1 ≤ q for all y,t ∈ [−12, 21], by the corollary of Lebesgue control convergence theorem, we can exchange the order of of integration and derivation of Cq′,x(t) as Cq′,x(t) = ddt Y ψq(y − t)dρx(y) = Y dψqd(yt−t)dρx(y). This together with the fact C (t ) = 0, x X, we have q′,x ∗x ∀ ∈ R R C (t ) = q (t y)q 1dρ (y) q (y t )q 1dρ (y) = 0, q′,x ∗x ∗x − − x − − ∗x − x Zy<t∗x Zy>t∗x which means that (t y)q 1dρ (y) = (y t )q 1dρ (y) (2.2) ∗x − − x − ∗x − x Zy<t∗x Zy>t∗x t Let t = 0 for simply, then we have C (t) C (0) = C (s)ds, t > 0. Noting that for ∗x q,x − q,x 0 q′,x ∀ s > 0, R C (s) = q (s y)q 1dρ (y) (y s)q 1dρ (y) q′,x − − x − − − x Zy<s Zy>s (cid:0) (cid:1) = q (s y)q 1dρ (y)+ (s y)q 1dρ (y) (y s)q 1dρ (y) − x − x − x − − − − (cid:16)Zy<0 Z0≤y<s Zy>s (cid:17) q ( y)q 1dρ (y)+ (s y)q 1dρ (y) (y s)q 1dρ (y) . − x − x − x ≥ − − − − (cid:16)Zy<0 Z0≤y<s Zy>s (cid:17) 8 The above first term together with (2.2), then C (s) q yq 1dρ (y)+ (s y)q 1dρ (y) (y s)q 1dρ (y) q′,x ≥ − x − − x − − − x (cid:16)Zy>0 Z0≤y<s Zy>s (cid:17) q yq 1dρ (y)+ (s y)q 1dρ (y) yq 1dρ (y) − x − x − x ≥ − − (cid:16)Zy>0 Z0≤y<s Zy>s (cid:17) = q yq 1dρ (y)+ (s y)q 1dρ (y) − x − x − (cid:16)Z0<y≤s Z0≤y<s (cid:17) = q yq 1 +(s y)q 1 dρ (y) 21 qqsq 1ρ ( y : 0 y s ). − − x − − x − ≥ { ≤ ≤ } Z0≤y≤s(cid:16) (cid:17) Thus, t C (t) C (0) 21 q q sq 1ρ ( y : 0 y s )ds. q,x q,x − − x − ≥ · { ≤ ≤ } Z0 Let us consider the first case t [0,a(x)]. Noting the noise condition (1.8) and a(x) 1, we ∈ ≤ obtain that t 21 qq 21 qq C (t) C (0) 21 q q sq 1b(x)swds = − b(x)tq+w − b(x)a(x)wtq+w. q,x q,x − − − ≥ · q +w ≥ q +w Z0 For the second case t [a(x),1], we have ∈ a(x) t C (t) C (0) 21 q q sq 1ρ ( y : 0 y s )ds+ sq 1ρ ( y : 0 y s )ds q,x q,x − − x − x − ≥ · { ≤ ≤ } { ≤ ≤ } (cid:16)Z0 Za(x) (cid:17) a(x) t 21 q q sq 1b(x)swds+ sq 1b(x)a(x)wds − − − ≥ · (cid:16)Z0 Za(x) (cid:17) b(x)a(x)wtq wb(x)a(x)q+w wb(x)a(x)wtq = 21 q q 21 q b(x)a(x)wtq − − · q − q(q+w) ≥ − q +w (cid:16) (cid:17) 21 qq 21 qq (cid:0) (cid:1) = − b(x)a(x)wtq − b(x)a(x)wtq+w. q +w ≥ q +w In general, we can see that for any 0 < t 1, ≤ 21 qq C (t) C (0) − b(x)a(x)wtq+w. (2.3) q,x q,x − ≥ q +w By similarity, if 1 t < 0, we also have − ≤ 21 qq C (t) C (0) − b(x)a(x)wtq+w. (2.4) q,x q,x − ≥ q +w 9 Applying the two above inequalities (2.3) and (2.4) with t = f(x) and t = f (x), we have ∗x q that f(x) f (x) q+w 2q 1q 1(q +w)(b(x)a(x)w) 1 C (f(x)) C (f (x)) . q − − − q,x q,x q | − | ≤ − (cid:0) (cid:1) By p power and integration, p+1 p(q+w) p(q−1) p p f(x) fq(x) p+1 dρX 2 p+1 q−p+1(q +w)p+1 | − | ≤ ZX p p (b(x)a(x)w) 1 p+1 C (f(x)) C (f (x)) p+1dρ . − q,x q,x q X − ZX (cid:2) (cid:3) (cid:0) (cid:1) This with Holder inequality k · kL1ρX ≤ k · kLpρ∗Xk · kLqρ∗X, p1∗ + q1∗ = 1, we obtain that for p = p+1 and q = p+1, ∗ ∗ p p(q+w) p(q−1) p p p p ZX |f(x)−fq(x)| p+1 dρX ≤ 2 p+1 q−p+1(q +w)p+1k(baw)−1kLp+pρX1 E(f)−E(fq) p+1. (cid:0) (cid:1) Then the desired conclusion (2.1) holds. For q = 1, (2.1) also holds and the proof can be found in [18]. It yields a variance-expectation bound which will be applied in the next section. Lemma 1. Under the same conditions as Theorem 3, for any measurable function f : X → [ 1,1], we have the inequality − E (ψ (f(x) y) ψ (f (x) y))2 C ( (f) (f ) θ (2.5) q q q θ q − − − ≤ E −E (cid:8) (cid:9) (cid:1) where the power index θ is defined as (1.10) and C = C2 +22 r(1+ f 2 r)Cr. θ r − k qk − r ∞ Proof. By the continuity of ψ (u) and y 1, we see that q | | ≤ 2 ψ (f(x) y) ψ (f (x) y) q( f q 1 + f q 1 +1) f(x) f (x) q q q − q − q | − − − | ≤ k k k k | − | ∞ ∞ q(2+ f q 1) f(x) f (x) . q − q ≤ k k | − | ∞ It implies that E (ψ (f(x) y) ψ (f (x) y))2 q2( f q 1 +2)2E f(x) f (x) 2. q q q q − q − − − ≤ k k | − | ∞ (cid:8) (cid:9) 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.