Prediction Error Reduction Function as a Variable Importance Score Ernest Fokou´e∗,† †School of Mathematical Sciences Rochester Institute of Technology 98 Lomb Memorial Drive, Rochester, NY 14623, USA 5 e-mail: [email protected] 1 Abstract: Thispaperintroducesanddevelopsanovelvariableimportancescorefunctioninthecontext 0 ofensemblelearninganddemonstratesitsappealboththeoreticallyandempirically.Ourproposedscore 2 function is simple and more straightforward than its counterpart proposed in the context of random n forest, and by avoiding permutations, it is by design computationally more efficient than the random a forestvariable importance function. Just likethe random forestvariable importance function, our score J handlesbothregressionandclassificationseamlessly.Oneofthedistinctadvantageofourproposedscore is the fact that it offers a natural cut off at zero, with all the positive scores indicating importance and 5 significance,whilethenegativescoresaredeemedindicationsofinsignificance.Anextraadvantageofour 2 proposed score lies in the fact it works very well beyond ensemble of trees and can seamlessly be used withanybaselearnersintherandomsubspacelearningcontext.Ourexamples,bothsimulatedandreal, ] demonstratethatourproposedscoredoescompetemostlyfavorablywiththerandomforestscore. L M AMS 2000 subject classifications:Primary62H30;secondary62H25. Keywordsand phrases:High-dimensional,VariableImportance,RandomSubspaceLearning,Out-of- t. BagError,RandomForest,Largepsmalln,Classification,Regression,EnsembleLearning,BaseLearner. a t s [ 1. Introduction 1 v Consider a data set D ={(x1,y1),··· ,(xn,yn)} where xi is a p-dimensional vector of attributes of potentially 6 differenttypes observableonsome input spacedenotedhere by X,andy arethe responsestakenfromY. We i 1 shall consider various scenarios, but mainly the regression scenario with Y =R and the classification scenario 1 withY ={1,2,··· ,K}.Weconsiderthetaskofbuildingtheestimatorf(·)ofthetruebutunknownunderlying 6 0 f, and seek to build f(·) such that the true error (generalization error) is as small as possible. In this context, b . we shall use the average test error AVTE(·), as our measure of predictive performance, namely 1 b 0 5 1 R 1 m 1 AVTE(f)= ℓ(y(r),f(r)(x(r))), (1.1) R m j j v: b Xr=1 Xj=1 b i X where x(r),y(r) isthe jth observationfromthe testsetatthe rth randomreplicationofthe splitofthe data. j j r (cid:16) (cid:17) a Throughout this paper, we shall use the zero-one loss (1.2) for all our classification tasks. ℓ(yj(r),f(r)(x(jr)))=1{yj(r)6=fb(r)(x(jr))} =(cid:26) 01 oifthyej(rr)w6=isef.(r)(xj(r)) (1.2) b b For regressiontasks, we shall use the squared error loss (1.2), namely ℓ(y(r),f(r)(x(r)))=(y(r)−f(r)(x(r)))2. (1.3) j j j j b b Besides,seekingtheoptimalpredictiveestimatoroff,wealsoseektoselectthemostimportant(useful)predictor variables as a byproduct of our overall learning scheme. Indeed, while accurate prediction is very important in and of itself, it’s often desirable or even crucial in some cases, provide the added description of the importance of the variables involved in the prediction task. The statistical literature is filled with thousands of papers on variable selection and measurement of variable importance. ∗Correspondingauthor 1 imsart-generic ver. 2014/07/30 file: perf-score-version-1.tex date: January 27, 2015 Ernest Fokou´e/PERF Score 2 2. Main result We consider a frameworkwith a p-dimensionalinput space X with typical input vector x=(x ,··· ,x )⊤. We 1 p also consider building different models with different subsets of the p original variables. Let γ = (γ ,··· ,γ )⊤ 1 p denote the p-dimensional indicator such that 1 if x is active in the current model indexed by γ γ = j (2.1) j (cid:26) 0 otherwise. Assume that we are given an ensemble (collection or aggregation)of models, say H ={h(·,γ(1))),h(·,γ(2))),··· ,h(·,γ(B)))} (2.2) where h(·,γ(b))) denotes the function built with only those variables that are active in the bth model of the ensemble (aggregation), and γ(b) =(γ(b),··· ,γ(b)) with 1 p γ(b) = 1 if xj is active in the b-th model of the ensemble (2.3) j (cid:26) 0 otherwise. For instance, we may consider a homogeneous ensemble, i.e, an ensemble in which all the functions are of the samefamily,likethecasewhereallthebaselearnersaremultiplelinearregression(MLR)modelsdifferingbythe variables upon which they are built. Consider a score function score(h(·,γ(b))) used to assess the performance ofmodelindexedbythevariablesactiveinγ(b).Weproposeavariableimportancescoreintheformofafunction that measures the importance of a variable x in terms of the reduction in averagescore j B B 1 1 PERF(x )= score(h(·,γ(b)))− γ(b)score(h(·,γ(b))) (2.4) j B B j Xb=1 j Xb=1 where B is the number of models containing the variable x , specifically B = B 1 . In words, j j j b=1 {γ(b)=1} j P PERF(x )=Average score over all models−Average score over all models with x j j Intuitively, PERF(x ) somewhat measures the impact of variable x . In the way similar to the approachused by j j sports writers to decide the MVP on a team or in a league, PERF(x ) looks at the overall performance of the j wholeensembleandthenforeachvariablex computesthedirectionandmagnitudeofthechangetothatoverall j performance of the ensemble brought by its presence in models. If a variable x is important, then its presence j in any model will cause that model to perform better in the sense of having a lower than common average error (score). The average score of all models containing an important variable x should therefore be lower than the j overall average score. • |PERF(x )| measures the magnitude of the importance/impact. j • sign(PERF(x )) measures the direction of the impact. j • If sign(PERF(x ))=+1 and |PERF(x )| is relatively large, then x is an important variable. j j j • Seamlessly applied to large p small n. • All variables with PERF(x )≤0 are unimportant and can be discarded. j • The PERF(·) score can be used whenever an ensemble H is available along with a suitable score function for each base learner. • This works with any base learner and can be adapted to parametric, nonparametric and semi-parametric models and one can imagine ensembles with any base learners as its atoms. • A great advantage over the traditional variable importance Breiman (2001a), Breiman (2001b) score functions is that the clear cut-off at zero, in the sense that all variables with PERF(x ) > 0 are kept and j all those variables with PERF(x )≤0 are thrown away. j 2.1. PERF score via Random Subspace Learning AnaturalimplementationofPERF(·)canbedoneusingtheubiquitousbootstrapalongwiththerandomsubspace learning scheme. The Out-of-Bag (oob) error in the bagging or random subspace learning context is a good imsart-generic ver. 2014/07/30 file: perf-score-version-1.tex date: January 27, 2015 Ernest Fokou´e/PERF Score 3 (infactexcellent)candidatescorefunction,especiallywhenthegoalifthe selectionofvariablesthatleadto the lowest prediction error. The advantage of using oob as the score lies in the fact that the score is obtained as partof building the ensemble in the randomsubspace learningframework.Consider the trainingset D ={z = i (x⊤,y )⊤, i = 1,··· ,n}, where x⊤ = (x ,··· ,x ) and y ∈ Y are realizations of two random variables X i i i i1 ip i and Y respectively. Let xi,πj = (xi,1,··· ,xi,πj,··· ,xi,d). The permutation πj acts the |D¯(b)|-dimensional jth column of the out-of-bag data matrix. Essentially, π simply permutes the |D¯(b)| elements of the jth column of j the out-of-bag data matrix. Algorithm 1 PERF Score Estimate via Random Subspace Learning 1: procedurePERF Score(B) ⊲ComputingthePERFScorebasedonB baselearners 2: Choose a base learner h(·) ⊲e.g.:Trees,MLR 3: Choose an estimation method ⊲e.g.:RecursivePartitioningorOLS b 4: Initialize all the PERF(xj) and VI(xj) at zero 5: forb=1toB do 6: Draw with replacedment fromDcabootstrapsampleD(b) ={z(1b),···,z(nb)} 7: Draw without replacement from{1,···,p}asubsetV(b)={j(b),···,j(b)}ofdvariables. 1 d 8: Form the indicator vectorγ(b)=(γj(b),···,γp(b))with γ(b) = 1 ifj∈{j1(b),···,jd(b)} j (cid:26) 0 otherwise. 9: Drop unselected variables fromD(b) sothatD(b) isddimensional sub 10: Build the bth base learner h(·,γ(b))basedonD(b) sub 11: Compute score of the bth base learnerh(·,γ(b)) ⊲e.g.Out-of-bagerror b b 1 s(b) =score(h(·,γ(b)))= |D¯(b)| ℓ(yi,h(xi,γ(b))) b zi∈/XD(b) b 12: forj∈V(b) do 13: Generate the permutation of the jth column of D¯(b), namely π j 14: Compute the permutation impacted score sπ(bj)=scoreπj(h(·,γ(b)))= |D¯1(b)| ℓ(yi,h(xi,πj,γ(b))) b zi∈X/D(b) b 15: Compute the bth instance of the importance of xj VI(b)(xj)=s(b)−s(πbj) 16: endfor c 17: endfor 18: UsetheensembleH = h(·,γ(b)),b=1,···,B toformtheestimator n o b PERF(xj)= B1 bX=B1score(h(·,γ(b)))− B1j bX=B1γj(b)score(h(·,γ(b))) (2.5) d b b VI(xj)= B1j bX=B1γj(b)VI(b)(xj) (2.6) c c 19: endprocedure 3. Computational demonstrations 3.1. Simulated Example The dataset in this example is simulated data with different scenarios on the level of correlation among the variables, and the ratio n and p. In this particular example, the true function is f(x)=1+2x +x +3x 3 7 9 imsart-generic ver. 2014/07/30 file: perf-score-version-1.tex date: January 27, 2015 Ernest Fokou´e/PERF Score 4 X9 X9 3 600 500 X8 2 X3 X8 X3 PERF(Xj) 1 X6 X7 X10 Random Forest VI(Xj) 300400 X6 X7 X4 0 X2 X5 200 X2 X4 X5 X10 −1 X1 X11 X12 X13 X14 X15 X16 X17 100 X1 X11 X12 X13 X14 X15 X16 X17 Variable Variable (a) Permutation-freeVariableImportance. (b) Permutation-basedVariableImportance. Fig 1. Variable score for simulated data with high correlation among the variables in low dimension high sample size setting with x ∼ MVN(1 ,Σ ) and ǫ ∼ N(0,1). The dataset in this example is simulated data with different scenarios 9 ρ onthe levelofcorrelationamong the variables,andthe ration andp. Specifically,we simulate data by defining ρ ∈ [0,1), then we generate our predictor variables using a multivariate normal distribution. Throughout this paper, the multivariate Gaussian density will be denoted by φ (x;µ,Σ) p 1 1 φ (x;µ,Σ)= exp − (x−µ)⊤Σ−1(x−µ) (3.1) p (2π)p|Σ| (cid:26) 2 (cid:27) p Furthermore, in order to study the effect of the correlation pattern, we simulate the data using a covariance matrix Σ parameterized by τ and ρ and defined by τΣ where Σ=(σ ) with σ =ρ|i−j|. ij ij 1 ρ ··· ρp−2 ρp−1 ρ 1 ρ ··· ρp−2 .. .. .. .. .. Σ=Σ(τ,ρ)=τ . . . . . ρp−2 ... ρ 1 ρ ρp−1 ρp−2 ··· ρ 1 For simplicity however,we use the first Σ with τ =1 throughout this paper. For the remaining parameters,we use ρ∈{0,0.25,0.75}and p∈{17,250},with the same n=200. 4. Conclusion and Discussion We have presented a variable importance score function in the context of ensemble learning. Our proposed score function is simple and more straightforward than its counterpart proposed in the context of random forest, and by avoiding permutations, it is by design computationally more efficient than the random forest variable importance function. Just like the random forest variable importance function, our score handles both regression and classification seamlessly. One of the distinct advantage of our proposed score is the fact that it offers a natural cut off at zero, with all the positive scores indicating importance and significance, while the negative scores are deemed indications of insignificance. An extra advantage of our proposed score lies in the fact it works very well beyond ensemble of trees and can seamlessly be used with any base learners in the random subspace learning context. Our examples, both simulated and real, demonstrated that our proposed scoredoes compete mostly favorablywith the randomforestscore.In ourfuture work,we presentandcompare the corresponding average test errors of the single models made up of the most important variables. We also provide in our future work theoretical proofs of the connection between our score function and the significance imsart-generic ver. 2014/07/30 file: perf-score-version-1.tex date: January 27, 2015 Ernest Fokou´e/PERF Score 5 X9 700 X9 5 600 4 500 PERF(Xj) 23 X3 Random Forest VI(Xj) 300400 X3 1 X10 200 X7 X10 0 −1 X1 X2 X4 X5 X6 X7 X8 X11 X12 X13 X14 X15 X16 X17 100 X1 X2 X4 X5 X6 X8 X11 X12 X13 X14 X15 X16 X17 Variable Variable (a) Permutation-freeVariableImportance. (b) Permutation-basedVariableImportance. Fig2.VariableImportance Scoresforsimulateddatawithmildcorrelationamongthevariablesinlowdimensionhighsample size setting 8 X9 1000 X9 800 6 PERF(Xj) 4 X3 Random Forest VI(Xj) 400600 X3 2 200 X7 0 X1 X2 X4 X5 X6 X7 X8 X10 X11 X12 X13 X14 X15 X16 X17 X1 X2 X4 X5 X6 X8 X10 X11 X12 X13 X14 X15 X16 X17 Variable Variable (a) Permutation-freeVariableImportance. (b) Permutation-basedVariableImportance. Fig3.VariableImportance Scoresforsimulateddatawithzerocorrelation among thevariablesinlowdimensionhighsample size setting imsart-generic ver. 2014/07/30 file: perf-score-version-1.tex date: January 27, 2015 Ernest Fokou´e/PERF Score 6 60complaints complaints 40 800 PERF(Xj) 20 learning raises Random Forest VI(Xj) 600 learning raises 0 400 privileges critical −20 privileges advance 200 advance critical Variable Variable (a) Permutation-freeVariableImportance. (b) Permutation-basedVariableImportance. Fig 4. Variable Importance Scores for the Attitude Data Set, for which n=30 and p=6. 0.03 glu 16 glu 0.02 14 PERF(Xj) 0.01 age Random Forest VI(Xj) 1012 age ped bmi 0.00 ped npreg 8 skin npreg −0.01 bmi bp skin bp 6 Variable Variable (a) Permutation-freeVariableImportance. (b) Permutation-basedVariableImportance. Fig 5. Variable Importance Scores for the Spam DetectionDataset where n=200 and p=7, and K=2 classes. imsart-generic ver. 2014/07/30 file: perf-score-version-1.tex date: January 27, 2015 Ernest Fokou´e/PERF Score 7 charExclamation 200 charExclamation 0.03 charDollar remove free charDollar 150 0.02 capitalLong hp remove your PERF(Xj) 0.01capcitaaplTiotatalAlve mnounme0y00 our Random Forest VI(Xj) 100 cacpaitpailtLaolAnvge hp your free george capitalTotal 0.00 edu num1999 hpl ybouusiness internet 50 mnounme0y00you our −0.01 cchhcaahrraSHcrqaRhusaoachruSornenedbfmebrtaarriceacbonkclepleckorroeeetonmrtjiegecientdtcainsilrgepptcmeatcrhtnsnnuoumlmond8gu4a5ym1ttea58llna5neb7luatsmb650 fonctreedmitaadidlrreepspesoorewreptscillemeloivaredilerovneuramda3dlmldreaskse 0 cchhcaahrraSHcrqaRhusaoachruSornenedbfmebrtaarriceacbeonkcldepleckoruroeeetonmrtjiegecientdtcainsilrgenpptcumeatmcrhtn1snnu9oum9lmo9nd8gu4a5ym1ttea58llna5neb7luatsgmbe6oh5rgp0el fonctrebedumitasadinidlerrseepsspesoorewreptscillemeloiivnaredtileerrnoevneturamda3dlmldreaskse Variable Variable (a) Permutation-freeVariableImportance. (b) Permutation-basedVariableImportance. Fig 6. Variable Importance Scores for the Spam DetectionDataset where n=4601 and p=57, and K=2 classes. ofvariablesselectedusingexistingcriteria.Itis alsoourplantoaddressthe factthatsometimesthe correlation structure among the predictor variables obscures the ability of our proposed score to correctly identify some significant variables. Acknowledgements Ernest Fokou´e wishes to express his heartfelt gratitude and infinite thanks to Our Lady of Perpetual Help for Herever-presentsupportandguidance,especiallyfortheuninterruptedflowofinspirationreceivedthroughHer most powerful intercession. References Breiman, L. (2001a). Random forests. Machine Learning 45, 5–32. Breiman, L. (2001b, August). Statistical modeling: The two cultures. Statistical Science 16(3), 199–215. imsart-generic ver. 2014/07/30 file: perf-score-version-1.tex date: January 27, 2015