ebook img

Feature Selection for Regression Problems Based on the Morisita Estimator of Intrinsic Dimension PDF

3 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Feature Selection for Regression Problems Based on the Morisita Estimator of Intrinsic Dimension

Feature Selection for Regression Problems Based on the Morisita Estimator of Intrinsic Dimension Jean GOLAY, Michael LEUENBERGER and Mikhail KANEVSKI Institute of Earth Surface Dynamics, Faculty of Geosciences and Environment, University of Lausanne, 1015 Lausanne, Switzerland. Email: [email protected]. 6 1 0 2 Abstract r p Data acquisition, storage and management have been improved, while the A key factors of many phenomena are not well known. Consequently, irrele- 8 vant and redundant features artificially increase the size of datasets, which complicates learning tasks, such as regression. To address this problem, ] L feature selection methods have been proposed. This paper introduces a M new supervised filter based on the Morisita estimator of intrinsic dimen- . t sion. It can identify relevant features and distinguish between redundant a t andirrelevantinformation. Besides, itoffersacleargraphicalrepresentation s [ of the results and it can be easily implemented in different programming 5 languages. Comprehensive numerical experiments are conducted using sim- v ulated datasets characterized by different levels of complexity, sample size 6 and noise. The suggested algorithm is also successfully tested on a selection 1 2 of real world applications and compared with RReliefF using extreme learn- 0 ing machine. In addition, a new measure of feature relevance is presented 0 . and discussed. 2 0 Keywords: Feature selection, Intrinsic dimension, Morisita index, 6 1 Measure of relevance, Data mining : v i X 1. Introduction r a In data mining, it is often not known a priori what features (or input variables 1) are truly necessary to capture the main characteristics of a studied phenomenon. This lack of knowledge implies that many of the con- sidered features are irrelevant or redundant. They artificially increase the 1In this paper, the term “feature” is used as a synonym for “input variable”. Preprint submitted to Elsevier April 11, 2016 dimension E of the Euclidean space in which the data points are embedded (E equals the number of input and output variables under consideration). Thisisaseriousmatter, sincefastimprovementsindataacquisition, storage and management cause the number of redundant and irrelevant features to increase. As a consequence, the interpretation of the results becomes more complicated and, unless the sample size N grows exponentially with E, the curse of dimensionality [1] may reduce the overall accuracy yielded by any learning algorithm. Besides, large N and E are also difficult to deal with because of computer performance limitations. In regression and classification, these issues are often addressed by im- plementingsupervisedfeatureselectionmethods[2–5]. Theycanbebroadly subdivided into filter (e.g. RReliefF [6], mRMR [7] and CFS [8]), wrapper [9, 10] and embedded methods (e.g. the Lasso [11] and random forest [12]). Filters rank features or subsets of features according to a relevance measure independently of any predictive model, while wrappers use an evaluation criterion involving a learning machine. Both approaches can be used with search strategies, since an exhaustive exploration of the 2#Feat. −1 models (all the possible combinations of features) is often computationally infeasi- ble. Greedy strategies [13, 14], such as Sequential Forward Selection (SFS), can be distinguished from stochastic ones (e.g. simulated annealing [15, 16] and particle swarm optimization [17, 18]). Regarding the embedded meth- ods, the feature selection is a by-product of a training procedure. It can be achieved by the addition of constraints in the cost function of a predic- tive model (e.g. the Lasso [11] and ridge regression [19]) or it can be more specific to a given algorithm (e.g. random forest [12] and adaptive general regression neural networks [20, 21]). The present paper2 deals with a new SFS filter algorithm. It relies on the idea that, although data points are embedded in E-dimensional spaces, they often reside on lower M-dimensional manifolds [23–25]. The value M (≤ E) is called Intrinsic Dimension (ID) and it can be estimated using the Morisita estimator of ID, introduced in [26], which is closely related to the fractal theory. The proposed filter algorithm is supervised, designed for regression problems and based on this new ID estimator. It also keeps the simplicity of the Fractal Dimension Reduction (FDR) algorithm introduced 2Themainideaofthispaperwaspartlypresentedatthe23rdsymposiumonartificial neural networks, computational intelligence and machine learning (ESANN2015) [22]. 2 (a) (b) (c) V2 V2 V2 V V V 1 1 1 Figure1: 50valuesweresampledfromtwouniformlydistributedvariablesV andV . The 1 2 two variables are (a) independent, (b) linearly dependent or (c) non-linearly dependent. in [27]. Finally, the results show the ability of this new filter to capture non-linear relationships and to effectively identify both redundant and ir- relevant information. The paper is organized as follows. Section 2 reviews previous work on ID-based feature selection approaches. The Morisita estimator of ID is shortly presented in Section 3 (for the completeness of the paper). Section 4 introduces the Morisita-based filter and Section 5 is devoted to numerical experiments conducted on simulated data of varying complexity. In Section 6, real world applications from publicly accessible repositories are presented and a comparison with a benchmark algorithm, RReliefF [6], is carried out using Extreme Learning Machine (ELM) [28]. Finally, conclusions are drawn in the last section with a special emphasis on future challenges and applications. 2. Related Work The concept of ID can be extended to the more general case where the data ID may be a non-integer dimension D [24, 27, 29]. The value D is estimated by using fractal-based methods which have been presented in [24, 25, 30] and successfully implemented in various fields, such as physics [31], cosmology [32], meteorology [33] and pattern recognition [34]. These methods rely on well-known fractal dimensions (e.g. the box-counting di- mension [35, 36], the correlation dimension [31] and Rényi’s dimensions of qth order [37]) and they can be used in feature selection [27, 38] and di- mensionality reduction [24] to detect dependencies between variables (or features). 3 Trainaetal. [27,39]haveopenedupnewprospectsfortheeffectiveuseof ID estimation in data mining by introducing the Fractal Dimension Reduc- tion (FDR) algorithm. FDR executes an unsupervised procedure of feature selection aiming to remove from a dataset all the redundant variables. The fundamental idea is that fully redundant variables do not contribute to the value of the data ID. This idea is illustrated in Figure 1 where 50 points were sampled from two uniformly distributed variables V and V . In the first panel, V and 1 2 1 V are independent, which means that they are not redundant, and one has 2 that: ID(V ,V ) ≈ ID(V )+ID(V ) ≈ 1+1 = 2 (1) 1 2 1 2 where ID(·) denotes the ID of a dataset. It indicates that both V and V 1 2 contribute to increasing the value of ID(V ,V ) by about 1, which is, by 1 2 construction, equal to the ID of each variable (i.e. ID(V ) and ID(V )). 1 2 Conversely, the removal of either V or V would lead to a reduction in the 1 2 data ID from about 2 (i.e. the dimension of the embedding space) to 1 (i.e. the ID of a single variable) and information would be irreparably lost. In contrast, the other two panels show what happens when V and V are fully 1 2 redundant with each other. One has that: ID(V ,V ) ≈ ID(V ) ≈ ID(V ) ≈ 1 (2) 1 2 1 2 where the ID of the full dataset is approximately equal to the topological dimension of a smooth line. This means that the contribution of only one variable is enough to reach the value of ID(V ,V ) and the remaining one 1 2 can be disregarded without losing any information. Based on these considerations, the FDR algorithm removes the redun- dant variables from a dataset by implementing a Sequential Backward Se- lection (SBS) strategy [13]. Besides, it uses Rényi’s dimension of order q = 2, D , for the ID estimation. Following the same principles, De Sousa 2 et al. [40] examined additional developments to FDR and presented a new algorithm for identifying subgroups of correlated variables. FDRisdesignedtocarryoutunsupervisedtasksanditisnotabletodis- tinguish between variables that are relevant to a learning process and those that are irrelevant. The reason is that such variables can all contribute to 4 the data ID. For instance, in the left-hand panel of Figure 1, V could be 1 regarded as irrelevant to the learning of V , but it would be selected by FDR 2 because it makes the data ID increase by about 1. Consequently, different studies were carried out to adapt FDR to supervised learning. Lee et al. [41] suggested decoupling the relevance and redundancy analysis. Following the same idea, Pham et al. [42] used mutual information to identify irrele- vant features and combined the results with those of FDR. Finally, Mo and Huang [38] developed an advanced algorithm to detect both redundant and irrelevant information in a single step. This algorithm follows a backward search strategy and relies on the correlation dimension, df , for the esti- cor mation of the data ID. The filter algorithm suggested in this paper is designed, in such a way that it combines the advantages of both FDR and Mo’s algorithm: it can deal with non-linear dependencies, it does not rely on any user-defined threshold, it can discriminate between redundant and irrelevant informa- tion and the results can be easily summarized in informative plots. More- over, it can cope with high-dimensional datasets thanks to its SFS search strategy and it uses the Morisita estimator of ID which was shown to yield comparable or better results than D and df [26]. 2 cor 3. The Morisita Estimator of Intrinsic Dimension The Morisita estimator of ID, M , has been recently introduced [26]. It m is a fractal-based ID estimator derived from the multipoint Morisita index I [30, 44] (named after Masaaki Morisita who proposed the first version m,δ of the index to study the spatial clustering of ecological data [45]). I is m,δ computed by superimposing a grid of Q quadrats of diagonal size δ onto the data points (see Figure 2). It measures how many times more likely it is that m (m ≥ 2) points selected at random will be from the same quadrat than it would be if all the N points of the studied dataset were distributed according to a random distribution generated from a Poisson process (i.e. complete spatial randomness). The formula is the following: PQ n (n −1)(n −2)···(n −m+1) I = Qm−1 i=1 i i i i (3) m,δ N(N −1)(N −2)···(N −m+1) where n is the number of points in the ith quadrat. For a fixed value of i m, I is calculated for a chosen scale range. If a dataset approximates a m,δ 5 0 1.2 1. 9 quadrats 0 8 1. 0. 8 n )2,l0. 4 quadrats 0.6 i g(Ie0.6 l Lo0.4 = 0.76 16 quadrats 0.4 0.2 S 2 0.2 0.0 1 quadrat 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.0 0.2 0.4 0.6 0.8 1.0 Loge(1/ l) 1 quadrat 0 1.2 1. 9 quadrats l 1.0 0.8 8 g(I)e2,l0.60. 4 quadrats 0.6 ni Lo0.4 = 0.76 16 quadrats 0.4 0.2 S 2 0.2 0.0 1 quadrat 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.0 0.2 0.4 0.6 0.8 1.0 Loge(1/ l) 16 quadrats Figure 2: Illustration of the way the Morisita slope, S , is computed for a 2-dimensional 2 problem. In the two left-hand panels, S is the slope of the dashed line and the red 2 squares correspond to the values of log (I ) calculated with the grids displayed on the e 2,‘ right. The R dataset “Quakes” [43] was used and the data were rescaled to the [0,1] interval. fractalbehavior(i.e. isself-similar)withinthisrange, therelationshipofthe plot relating log(I ) to log(1/δ) is linear and the slope of the regression m,δ line is defined as the Morisita slope S . Finally, M is expressed as: m m (cid:18) S (cid:19) m M = E − . (4) m m−1 In practice, each variable is rescaled to the [0,1] interval and δ can be replaced with the quadrat edge length, ‘, with ‘−1 being the number of grid subdivisions (see Figure 2). Then, a set of R values of ‘ (or ‘−1) is chosen so that it captures the linear part of the log-log plot. In the rest of this paper, only M will be used and it will be computed with an algorithm 2 called Morisita INDex for Intrinsic Dimension estimation (MINDID) [26] whose complexity is O(N ∗E ∗R). 6 4. The Morisita-based Filter for Regression Problems The Morisita-Based Filter for Regression (MBFR) relies on three obser- vations following from the work by Traina et al. [27], De Sousa et al. [40] and Mo and Huang [38]: 1. GivenanoutputvariableY generatedfromk relevantandnon-redundant input variables X , ..., X , one has that: 1 k ID(X ,...,X ,Y)−ID(X ,...,X ) ≈ 0 (5) 1 k 1 k where ID(·) denotes the (possibly non-integer) ID of a dataset. 2. Given i irrelevant input variables I , ..., I completely independent of 1 i Y, one has that: ID(I ,...,I ,Y)−ID(I ,...,I ) ≈ ID(Y) (6) 1 i 1 i 3. Given a randomly selected subset of X , ..., X of size r with 0 < r < 1 k k and k > 1, j redundant input variables J , ..., J related to some 1 j or all of X , ..., X and all the i irrelevant input variables I , ..., I , 1 r 1 i one has that: ID(X ,...,X ,J ,...,J ,I ,...,I ,Y) 1 r 1 j 1 i (7) −ID(X ,...,X ,J ,...,J ,I ,...,I ) ≈ H 1 r 1 j 1 i where H ∈ ]0,ID(Y)[ and H decreases to 0 as r increases to k. The difference Diss(F,Y) = ID(F,Y)−ID(F) (8) can thus be suggested as a way of measuring the dissimilarity (i.e the in- dependence) between Y and a set F of features (e.g. F = {X ,...,X }), 1 k among which only the relevant ones (i.e. the non-redundant features on which Y depends) contribute to reducing the dissimilarity. Based on that idea, MBFR (see Algorithm 1) aims at retrieving the relevant features avail- able in a dataset by sorting each subset of variables according to its dissim- ilarity with Y. MBFR implements a SFS search strategy and relies on the Morisita estimator of ID and the MINDID algorithm [26] to estimate Diss: (cid:91) Diss(F,Y) = M (F,Y)−M (F). (9) 2 2 7 Algorithm 1 MBFR INPUT: A dataset A with E −1 features F and one output variable Y. 1,...,E−1 A vector L of values ‘. An integer C (≤ E −1) indicating the number of iterations. Two empty vectors of length C: SelF and DissF for storing, respectively, the name of the selected features and the dissimilarities. An empty matrix Z for storing the selected features. OUTPUT:SelF andDissF. 1: Rescale each feature and Y to [0,1]. 2: for i = 1 to C do 3: for j = 1 to (E −i) do (cid:91) 4: Diss(Z,F ,Y) = M (Z,F ,Y)−M (Z,F ) (MINDID used with L) j 2 j 2 j 5: end for (cid:91) 6: Store in SelF[i] the name of the F yielding the lowest value of Diss. j (cid:91) 7: Store this value of Diss in DissF[i]. 8: Remove the corresponding F from A and add it into Z. j 9: end for The complexity of the algorithm is O(N ∗E2 ∗R∗C) where C (≤ E −1) is the number of iterations of the SFS procedure. For high-dimensional datasets, if Diss is likely to reach its minimum value after only a few iter- ations because of many redundant and irrelevant variables, C can be set to a value lower than E −1. Foreaseofcomparison,thecoefficientofdimensionalrelevance,DR(F,Y), can be introduced. It is defined as: Diss(F,Y) ID(F,Y)−ID(F) DR(F,Y) = 1− = 1− (10) ID(Y) ID(Y) and it can be computed using the Morisita estimator of ID M . In the same 2 way as Diss(F,Y), DR(F,Y) is able to capture both linear and non-linear relationships between an input and an output space. Besides, it lies be- tween 0 and 1. If the target variable Y can be completely explained by the considered features F, DR(F,Y) = 1. On the contrary, if all the available features are irrelevant, DR(F,Y) = 0 and, in-between, the closer it is to 1, the greater the explanatory power of F. 8 Figure 3: (left) The functional relationship between the output variable Y and the rele- vant features X and X of the butterfly dataset; (right) Shuffling of the output variable 1 2 Y. j ω ω β 1,j 2,j j 1 0.6655 0.8939 1.3446 2 1.2611 −0.3512 −0.0115 3 0.3961 −1.7827 1.2770 4 −1.7065 −0.5297 0.5962 5 0.8807 1.9574 −0.8530 6 1.8260 0.7962 −0.7290 7 1.3400 1.5001 1.2339 8 1.2919 −0.4462 0.1186 9 −1.3902 1.6856 0.5277 10 0.0743 1.5625 −0.6952 Table 1: Weights used in the construction of the butterfly dataset. In the rest of this paper, MBFR will be thoroughly tested by using both simulated data and real world applications. It will also be compared with another filter, RReliefF, through the results of different ELM models. 5. Experimental Study Using Simulated Data In this section, the MBFR algorithm is assessed by means of two simu- lated datasets (see Subsection 5.1) and its overall performance is carefully examined through a set of questions around which the subsections are or- ganized: 9 • Question 1: How does sample size affect MBFR (see Subsection 5.2)? • Question 2: How does the complexity of data manifolds affect MBFR (see Subsection 5.3) ? • Question 3: How can MBFR help distinguish between redundant and irrelevant information (see Subsection 5.4)? • Question 4: How does MBFR respond to a (partial) lack of relevant information (see Subsection 5.5)? • Question 5: How does MBFR respond to the presence of noise in data (see Subsection 5.6)? • Question 6: How better is the coefficient of dimensional relevance compared to a linear measure (see Subsection 5.7)? Notice also that the R environment [43] was used to implement the MBFR algorithm and to carry out the experiments. 5.1. Simulated Datasets Two simulated datasets were used: the butterfly and the Friedman datasets. 1. The butterfly dataset3 (see Figure 3): An output variable Y is gener- ated from two uniformly distributed input variables X ,X ∈ ]−5,5[ 1 2 by using an Artificial Neural Network (ANN) consisting of one hidden layer of 10 neurons. It can be expressed as:   10 X Y = β sig(X ω +X ω ) +ε (11)  j 1 1,j 2 2,j  j=1 where ω and ω are the weights connecting the input variables to 1,j 2,j thejth neuron, sig(x) : R → Risasigmoidtransferfunction, β isthe j weightbetweenthejth neuronandtheoutputlayerandεisaGaussian noise with zero mean and varying standard deviation (by default, it is set to 0.00). The exact weights used in the construction of the dataset are given in Table 1. Moreover, the addition of three redundant (J) and three irrelevant (I) variables is also made to complete the input space: J = log (X +5), J = X2−X2, J = X4−X4, a uniformly 3 10 1 4 1 2 5 1 2 distributed variable I ∈ ]−5,5[, I = log (I +5) and I = I +I . 6 7 10 6 8 6 7 Finally, the butterfly dataset is generated by random sampling of X , 1 X and I . In this paper, different sample sizes were considered: N = 2 6 1000,2000,10000,20000. 3It can be downloaded from: https://sites.google.com/site/jeangolayresearch/. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.