The cumulative Kolmogorov filter for model-free screening in ultrahigh dimensional data Arlene K. H. Kim1 and Seung Jun Shin2 7 1 University of Cambridge and 2 Korea University 1 0 2 n a Abstract J 6 WeproposeacumulativeKolmogorovfiltertoimprovethefusedKolmogorovfilterproposed ] E byMaiandZou(2015)viacumulativeslicing. Weestablishanimprovedasymptoticresultunder M relaxed assumptions and numerically demonstrate its enhanced finite sample performance. . t Keyword: cumulative slicing; Kolmogorov filter; model-free marginal screening a t s [ 1 v 0 1 Introduction 6 5 1 Since Fan and Lv (2008), a marginal feature screening has been regarded as one canonical tool in 0 . 1 ultrahigh-dimensional data analysis. Let Y be a univariate response and X = (X ,...,X )T be 1 p 0 7 a p-dimensional covariate. We assume that only a small subset of covariates are informative to 1 : explain Y. In particular, we assume |S∗| = d (cid:28) p where v i X r S∗ = {j : F(y|X) functionally depends on X for some y}, (1) a j withF(·|X)beingtheconditionaldistributionfunctionofY|X. Suchassumptionisreasonablesince includinglargenumberofvariableswithweaksignalsoftendeterioratesthemodelperformancedue to accumulated estimation errors. Since the introduction of Fan and Lv (2008), numerous marginal screening methods have been developed (see Section 1 of Mai and Zou (2015) for a comprehensive summary). Among these methods, model-free screening (Zhu et al.; 2011; Li et al.; 2012; Mai and Zou; 2015) is desirable since the screening is a pre-processing procedure followed by a main statistical analysis. 1 For feature selection in binary classification, Kolmogorov filter (KF) is proposed by Mai and Zou (2012). For each X ,j = 1,...,p, KF computes j κ = sup|P(X ≤ x|Y = 1)−P(X ≤ x|Y = −1)|, j = 1,...,p, j j j x and selects variables with large κ ’s among all j = 1,··· ,p. A sample version of κ is obtained j j by replacing the probability measure with its empirical counterpart, leading to the well-known Kolmogorov–Smirnov statistic where its name came from. KF shows impressive performance in binary classification. Recently, Mai and Zou (2015) have extended the idea of KF beyond the binary response by slicingdataintoGpiecesdependingonthevalueofY. Inparticular,apseudoresponseY˜ takingg if Y ∈ (a −1,a ]forg = 1,...,G,isdefinedforgivenknotsG = {(−∞ =)a < a < ... < a (= ∞)}. g g 0 1 G Following the spirit of KF, one can select a set of variables with large values of κG = maxsup|P(X ≤ x|Y˜ = m)−P(X ≤ x|Y˜ = l)|, j = 1,...,p. (2) j j j l,m x However, information loss is inevitable due to the lower resolution of pseudo variable Y˜ compared to Y regardless of the choice of G. To tackle this, Mai and Zou (2015) proposed fused Kolmogorov filter (FKF) that combinies κG for different N sets of knots G ,...,G and selects variables with j 1 N large values of κfused = (cid:80)N κG(cid:96), for j = 1,...,p. The source of improvement in FKF is clear, j (cid:96)=1 j however, it cannot perfectly overcome the information-loss problem caused by slicing. In addition, itissubtletodecidehowtoslicedatainafinitesamplecase. Tothisend,weproposethecumulative Kolmogorov filter (CKF). CKF minimizes information loss from the slicing step and is free from choice of slices. As a consequence, it enhances the FKF. 2 Cumulative Kolmogorov filter We let F(·|X ) denote the conditional distribution function of Y given X . Given x such that j j 0 < P(X ≤ x) < 1, define j k (x) = sup|F(y|X > x)−F(y|X ≤ x)|, j = 1,...,p. (3) j j j y 2 We remark that (3) is identical to (2) with G = {−∞,x,∞} except that the sliced variable in (3) is X instead of Y. The choice of a slicing variable between X and Y is not crucial, however, it j j would be more natural to slice independent variable in regression set up whose target is E(Y|X). Now, 1 k (x) = sup|P(X ≤ x)P(Y ≤ y)−P(Y ≤ y,X ≤ x)|, j j j P(X ≤ x)(1−P(X ≤ x)) j j y which immediately yields k (x) = 0 for all x satisfying 0 < P(X ≤ x) < 1 if and only if X and Y j j j are independent. In fact, k (x) indicates the level of dependence as shown in the following lemma. j Lemma 2.1 If (X ,Y) has a bivariate Gaussian copula distribution such that (g (X ),g (Y)) is j 1 j 2 jointly normal with correlations ρ = Cor(g (X ),g (Y)) after transformation via two monotone j 1 j 2 funcitons g ,g , and g (X ) and g (Y) are marginally standard normal. Then 1 2 1 j 2 1. k (x) = 1 if |ρ | = 1 and k (x) = 0 if ρ = 0. j j j j (cid:113) 2. Denoting y∗ = x(cid:0)1− 1−ρ2j(cid:1), ρj (cid:12) (cid:12) kj(x) = 1 (cid:12)(cid:12)(cid:12)(cid:90) y∗ Φ(cid:16)(cid:113)x−ρju (cid:17)φ(u)du−Φ(x)Φ(y∗)(cid:12)(cid:12)(cid:12). Φ(x)(1−Φ(x)) (cid:12)(cid:12) −∞ 1−ρ2j (cid:12)(cid:12) 3. For each x, k (x) is a strictly increasing function of |ρ |. j j Nonetheless, (3) loses lots of information from the dichotomization of X . To overcome this, we j define (cid:104) (cid:105) K = E k (X˜ ) , for j = 1,...,p, (4) j j j where X˜ denotes an independent copy of X . In the population level, (4) is fusing infinitely many j j KFs with all possible dichotomized X ’s. By doing this, we can not only minimize efficiency loss j but also be free from the choice of knot sets. Similar idea has been firstly proposed by Zhu et al. (2010) in the context of sufficient dimension reduction where the slicing scheme has been regarded as a canonical approach. Given (Y ,X ),i = 1,...,n where X = (X ,...,X )T, a sample version of (3) is kˆ (x) = i i i i1 ip j (cid:12) (cid:12) (cid:80)n 1 sup (cid:12)Fˆ(y|X > x)−Fˆ(y|X ≤ x)(cid:12) where Fˆ(y|X > x) = i=1 {Yi≤y,Xij>x} and Fˆ(y|X ≤ x) is y(cid:12) j j (cid:12) j (cid:80)n 1 j i=1 {Xij>x} 3 similarly defined. Now, an estimator of (4) is given by n 1 (cid:88) Kˆ = kˆ (X ). (5) j j ij n i=1 Finally, for d ∈ N, we propose CKF to select the following set n Sˆ(d ) = {j : Kˆ is among the first d largest of all Kˆ ,j = 1,··· ,p}. n j n j 3 The Sure Screening Property We assume a regularity condition. Assumption 3.1 There exists a nondegenerate set S such that S∗ ⊆ S and ∆ = minK −maxK > 0. S j j j∈S j∈/S Assumption 3.1 is similar to the regularity condition (C1) for KFK (Mai and Zou; 2015). In fact, FKF requires one additional condition that guarantees that the estimated slices are not very differentfromoracleslicesbasedonpopulationquantilesofY, whichisnotnecessaryforCKFsince it is free from the slice choice. KF with a binary response requires only one assumption similar to Assumption 3.1. Theorem 3.2 Under Assumption 3.1, when d ≥ |S| and ∆ > 4/n, n S P(S∗ ⊂ Sˆ(d )) ≥ 1−η, n where η = p(cid:0)4nexp(−n∆2/128)+2exp(−n∆2/16)(cid:1). S S (cid:113) log(pn) This probability tends to 1 when ∆ (cid:29) . S n The sure screening probability converges to one when ∆ (cid:29) {log(pn)/n}1/2, which is more S relaxed than FKF that requires ∆ to be greater than {(lognlog(pN logn))/n}1/2. Theorem 3.2 S demonstrate that CKF indeed improves FKF by minimizing efficiency loss entailed by the slicing step. 4 4 A simulation study 4.1 A toy example Consider a simple regression model Y = βX +(cid:15) where X and (cid:15) are from independent N(0,1). In thisregard,(5)canbethoughtasastatisticfortestingH : β = 0. Todemonstratetheperformance 0 of CKF, we compare its power to i) κˆbinary = sup |Fˆ(y|X > median{X})−Fˆ(y|X ≤ median{X})| y and ii) (cid:80)4 κˆG(cid:96)/4 with four equally-spaced knot sets whose sizes are 3,4,5, and 6 as suggested (cid:96)=1 by Mai and Zou (2015). Figure 1 depicts numerically computed power functions of three methods under significance level α = 0.05. As expected, CKF (5) performs best while the simplest κˆbinary does worst, which echoes the fact that screening performance can be improved by minimizing information loss entailed in the slicing step and CKF indeed achieves it. 1.0 Binary Fused Cum 0.8 0.6 er w o P 0.4 0.2 0.0 −0.5 0.0 0.5 β1 Figure 1: Power functions under α = 0.05. CKF shows clear improvement. 4.2 Comparison to other screening methods We consider the following nine models with (n,p) = (200,5000) and (cid:15) ∼ N(0,1) independent of X: 1. U(Y) = T(X)Tβ +(cid:15), where β = (2.8×1T,0T )T, T(X) ∼ N (0 ,Σ) with Σ = CS(0.7). 2 p−2 p p CS(0.7) is a compound symmetry correlation matrix with the correlation coefficient of 0.7. Let U(Y) = Y, T(X) = X. 2. T(X) = X1/9 and other settings are the same as Model 1. 3. U(Y) = Y1/9 and other settings are the same as Model 1. 5 Model d SIS DCS FKF CKF 1 2 2.00 (0.00) 2.00 (0.00) 3.79 (6.28) 2.00 (0.00) 2 2 2038.12 (1348.05) 1985.10 (1460.82) 4.62 (9.14) 2.00 (0.00) 3 2 891.22 (1071.58) 350.88 (794.67) 3.88 (6.96) 2.00 (0.00) 4 10 10.04 (0.20) 10.04 (0.20) 10.26 (1.09) 10.06 (0.24) 5 10 150.10 (351.46) 12.50 (10.42) 10.23 (0.49) 10.11 (0.35) 6 10 1618.50 (1423.11) 927.16 (916.20) 10.81 (4.27) 10.03 (0.17) 7 2 1051.14 (1473.43) 682.47 (965.43) 2.00 (0.00) 2.00 (0.00) 8 3 2980.23 (1494.26) 277.43 (606.47) 9.05 (18.69) 6.66 (11.27) 9 8 3562.30 (1252.76) 231.63 (526.51) 60.84 (126.12) 38.59 (52.58) Table 1: Average number of minimum variables needed to keep all informative ones over 100 independent repetitions. Standard deviations are in parentheses. 4. U(Y) = T(X)Tβ +(cid:15), where β = (0.8×1T ,0T )T, T(X) ∼ N (0 ,Σ) with Σ = AR(0.7). 10 p−10 p p AR(0.7) is an autoregressive correlation matrix with the autoregressive correlation coefficient of 0.7. Let U(Y) = Y, U(X) = X. 5. T(X) = 1 log(X) and and other settings are the same as Model 4. 2 6. U(Y) = log(Y) and other settings are the same as Model 4. 7. Y = (X +X +1)3+(cid:15), where X i∼id Cauchy. 1 2 j iid 8. Y = 4X +2tan(πX /2)+5X +(cid:15), where X ∼ U(0,1) independently. 1 2 3 j 9. Y = 2(X +0.8X +0.6X +0.4X +0.2X )+exp(X +X +X )(cid:15), where X ∼ N(0,Σ) 1 2 3 4 5 20 21 22 with Σ = CS(0.8). Toavoidacutoffselectionproblem, wereporttheaveragenumberofminimumvariablesneeded to recover all informative ones over 100 independent repetitions. Hence, a smaller value implies a better performance. Table 1 contains the comparison results against correlation learning (CS, Fan and Lv; 2008) and distance correlation learning (DCS, Li et al.; 2012) as well as FKF. The results clearlyshowthatthe proposed CKFhasimprovedperformancecompared toothersincludingFKF. 5 Discussions Weemployacumulativeslicingtechniquetoextendascreeningtoolforbinaryresponsetocontiuous one. The idea is quite general and can be applied to t-test-based screening (Fan and Fan; 2008; Fan and Lv; 2008) as well as logistic-regression-based screening (Fan and Song; 2010). In addition, it is possible to extend the idea of CKF to the censored response by replacing the empirical distribution function with the Kaplan-Meier estimator. 6 References Fan, J. and Fan, Y. (2008). High dimensional classification using features annealed independence rules, The Annals of statistics 36(6): 2605. Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space, Journal of the Royal Statistical Society: Series B 70(5): 849–911. Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with np- dimensionality, The Annals of Statistics 38(6): 3567–3604. Li, R., Zhong, W. and Zhu, L. (2012). Feature screening via distance correlation learning, Journal of the American Statistical Association 107(499): 1129–1139. Mai, Q.andZou, H.(2012). Thekolmogorovfilterforvariablescreeninginhigh-dimensionalbinary classification, Biometrika 100: 229–234. Mai, Q. and Zou, H. (2015). The fused Kolmogorov filter: A nonparametric model-free screening method, The Annals of Statistics 43(4): 1471–1497. Van Der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence, Springer, New York. Zhu, L.-P., Li, L., Li, R. and Zhu, L.-X. (2011). Model-free feature screening for ultrahigh- dimensional data, Journal of the American Statistical Association 106(496): 1464–1475. Zhu, L.-P., Zhu, L.-X. and Feng, Z.-H. (2010). Dimension reduction in regressions through cumu- lative slicing estimation, Journal of the American Statistical Association 105(492): 1455–1466. A Proof of Lemma 2.1 Because k is invariant under monotone transformation, it suffices to consider the case where j g (t) = t, g (t) = t, and thus X and Y are jointly normal. If ρ = 0, then X is independent of Y 1 2 j j j and k (x) = 0. On the other hand, if ρ (cid:54)= 0, Y|X = x ∼ N(ρ x,(1−ρ2)). Let j j j j j j (cid:90) y (cid:16) x−ρ u (cid:17) j G(y) := Φ(x)Φ(y)− Φ φ(u)du. (cid:113) −∞ 1−ρ2 j 7 Then we have 1 k (x) = sup|G(y)|. j Φ(x)(1−Φ(x)) y (cid:32) (cid:33) (cid:16) (cid:17) (cid:16) (cid:17) (cid:12) Notethat ∂G = Φ(x)φ(y)−Φ x−ρjy φ(y) = φ(y) Φ(x)−Φ x−ρjy , whichgives ∂G(cid:12) = 0. ∂y (cid:113)1−ρ2 (cid:113)1−ρ2 ∂y(cid:12)y=y∗ j j (cid:113) where y∗ = x(cid:0)1− 1−ρ2j(cid:1). When ρ < 0 then G(cid:48)(cid:48)(y∗) = φ(y∗) ρj φ(x) < 0. Thus when ρ < 0 ρj j (cid:113)1−ρ2 j j then G attains its supremum at y = y∗. Similarly, when ρ > 0 then −G attains its supremum at j y = y∗. It follows that (cid:12) (cid:12) kj(x) = 1 (cid:12)(cid:12)(cid:12)(cid:90) y∗ Φ(cid:16)(cid:113)x−ρju (cid:17)φ(u)du−Φ(x)Φ(y∗)(cid:12)(cid:12)(cid:12). Φ(x)(1−Φ(x)) (cid:12)(cid:12) −∞ 1−ρ2j (cid:12)(cid:12) When ρ = 1, then k (x) = 1 |Φ(x)−Φ(x)Φ(x)| = 1. When ρ = −1, then k (x) = j j Φ(x)(1−Φ(x)) j j 1 |−Φ(x)Φ(−x)| = 1. Φ(x)(1−Φ(x)) Now we show that k (x) is an increasing function of |ρ | by taking derivative k (x) with respect j j j to ρ . After some tedious calculations, j ∂k (x) sgn(ρ ) (cid:90) y∗ (cid:16) x−ρ u (cid:17) j = j (1−ρ2)−3/2 (ρ x−u)φ j φ(u)du ∂ρj Φ(x)(1−Φ(x)) j −∞ j (cid:113)1−ρ2 j sgn(ρ ) (1−ρ2)−1/2 = j j exp(cid:0)−x2h(ρ )(cid:1), j Φ(x)(1−Φ(x)) 2π (cid:113) 1− 1−ρ2 where h(ρ ) = j. Thus, k (x) is increasing in |ρ | since h is symmetric. j ρ2 j j j B Proof of Theorem 3.2 Under the event that max |Kˆ −K | < ∆ , we know that j∈{1,...,p} j j S Kˆ > K −∆ ≥ minK −∆ = maxK , j ∈ S j j S j S j j∈S j∈/S Kˆ < K +∆ ≤ maxK +∆ = minK , j ∈/ S. j j S j S j j∈/S j∈S 8 Hence, for any d ≥ |S|, we have Sˆ(|S|) ⊂ Sˆ(d ), which implies S∗ ⊆ Sˆ(d ). On the other hand, n n n by the following Lemma B.1, we have for any ∆ > 4/n, S (cid:18) (cid:19) p (cid:88) (cid:16) (cid:17) P max |Kˆ −K | ≥ ∆ ≥ 1− P |Kˆ −K | ≥ ∆ j j S j j S j∈{1,...,p} j=1 ≥ 1−p(cid:0)4nexp(−n∆2/128)+2exp(−n∆2/16)(cid:1). S S (cid:112) It follows that when ∆ (cid:29) log(pn)/n, the probability tends to 1. S Lemma B.1 Consider K in (4) and Kˆ in (5). Then for any (cid:15) > 4/n, j j (cid:16) (cid:17) P |Kˆ −K | ≥ (cid:15) ≤ 4nexp(−n(cid:15)2/128)+2exp(−n(cid:15)2/16). j j Proof of Lemma B.1 Without loss of generality, we only need to consider (cid:15) < 1 since otherwise, the probability in the left side is trivially 0. Also we assume that all X are distinct for convenience. (cid:96)j First, we use a simple triangle inequality to bound (cid:16) (cid:17) (cid:16)(cid:12)1 (cid:88) 1 (cid:88) 1 (cid:88) (cid:12) (cid:17) P |Kˆ −K | ≥ (cid:15) = P (cid:12) kˆ (X )− k (X )+ k (X )−Ek (X˜ )(cid:12) ≥ (cid:15) j j (cid:12)n j (cid:96)j n j (cid:96)j n j (cid:96)j j j (cid:12) (cid:96) (cid:96) (cid:96) (cid:16)(cid:12)(cid:88) (cid:88) (cid:12) n(cid:15)(cid:17) ≤ P (cid:12) kˆ (X )− k (X )(cid:12) ≥ + (cid:12) j (cid:96)j j (cid:96)j (cid:12) 2 (cid:96) (cid:96) (cid:16)(cid:12)1 (cid:88) (cid:12) (cid:15)(cid:17) P (cid:12) k (X )−Ek (X˜ )(cid:12) ≥ (cid:12)n j (cid:96)j j j (cid:12) 2 (cid:96) := (i)+(ii). (6) Then we treat the second term (ii). By the Bernstein’s inequality(e.g. Lemma 2.2.9 in Van Der Vaart and Wellner (1996)), and using the fact that each X for (cid:96) = 1,...,n is independent (cid:96)j and has the same distribution as the distribution of X˜ , we have j (cid:18) 1 n2(cid:15)2 (cid:19) (cid:18) 1 (cid:19) (ii) ≤ 2exp − ≤ 2exp − n(cid:15)2 8n+n(cid:15)/3 16 where the first inequality follows by bounding the variance of each k (X )−Ek (X˜ ) by 1 from j (cid:96)j j j the fact that |k (X )−Ek (X˜ )| ≤ 1 for any (cid:96) = 1,...n. j (cid:96)j j j Now we consider the first term (i) in (6). First note that |kˆ (X ) − k (X )| ≤ 1 for any j (cid:96)j j (cid:96)j (cid:96) = 1,...,n. We use this trivial bound for (cid:96) = (cid:96)(cid:48) where X is the maximum of X ,...,X . Let (cid:96)(cid:48)j 1j nj √ (cid:15)(cid:48) := (cid:15)/2−1/n and (cid:15) := 1(cid:112)n(cid:15)(cid:48). Using (cid:80)n (cid:15) = (cid:80)n 1(cid:112)n(cid:15)(cid:48) ≤ n(cid:15)(cid:48) (cid:82)nx−1/2dx ≤ n(cid:15)(cid:48), we have (cid:96) 2 (cid:96) (cid:96)=1 (cid:96) (cid:96)=1 2 (cid:96) 2 1 9 by the union bound that n n (cid:16)(cid:12)(cid:88) (cid:88) (cid:12) n(cid:15)(cid:17) (cid:16)(cid:12)(cid:88) (cid:88) (cid:12) (cid:17) (i) = P (cid:12) kˆ (X )− k (X )(cid:12) ≥ ≤ P (cid:12) kˆ (X )− k (X )(cid:12) ≥ n(cid:15)(cid:48) (cid:12) j (cid:96)j j (cid:96)j (cid:12) 2 (cid:12) j (cid:96)j j (cid:96)j (cid:12) (cid:96)=1 (cid:96)=1 (cid:96)(cid:54)=(cid:96)(cid:48) (cid:96)(cid:54)=(cid:96)(cid:48) (cid:88) (cid:16)(cid:12) (cid:12) (cid:17) ≤ P (cid:12)kˆ (X )−k (X )(cid:12) ≥ (cid:15) (7) (cid:12) j (cid:96)j j (cid:96)j (cid:12) (cid:96)˜ (cid:96)(cid:54)=(cid:96)(cid:48) where (cid:96)˜corresponds to the rank of X . (cid:96)j We bound (7) by above using similar ideas in Lemma A1 of Mai and Zou (2012). Using the Dvoretzky–Kiefer–Wolfowitz inequality, for any x in the support of X , j (cid:16)(cid:12) (cid:12) (cid:12) (cid:17) P (cid:12)kˆ (x)−k (x)(cid:12) ≥ (cid:15) (cid:12)X ,...,X ≤ 2exp(−n (cid:15)2/2)+2exp(−n (cid:15)2/2) (cid:12) j j (cid:12) (cid:96)˜(cid:12) 1j nj + (cid:96)˜ − (cid:96)˜ where n = (cid:80)n 1 and n = (cid:80)n 1 . Thus by replacing x by X followed by + i=1 {Xij>x} − i=1 {Xij≤x} (cid:96)j taking the expectation, we have n−1 (cid:88) (cid:16)(cid:12) (cid:12) (cid:17) (cid:88)(cid:16) (cid:17) P (cid:12)kˆ (X )−k (X )(cid:12) ≥ (cid:15) ≤ 2exp(−(n−(cid:96))(cid:15)2/2)+2exp(−(cid:96)(cid:15)2/2) . (cid:12) j (cid:96)j j (cid:96)j (cid:12) (cid:96)˜ (cid:96) (cid:96) (cid:96)(cid:54)=(cid:96)(cid:48) (cid:96)=1 It follows by symmetry n−1 n−1 (i) ≤ 4(cid:88)exp(cid:0)−(cid:96)(cid:15)2/2(cid:1) = 4(cid:88)exp(−n(cid:15)(cid:48)2/8) ≤ 4nexp(−n(cid:15)2/128), (cid:96) (cid:96)=1 (cid:96)=1 where the last inequality holds since (cid:15)(cid:48) = (cid:15)/2−1/n ≥ (cid:15)/4. The proof is complete. 10