Novel measurement technologies such as microarray expression analysis,genome-wideSNPtypingandmassspectrometryarenowproducing experimental data of extremely high dimensions. While these techniques pro- videunprecedentedopportunitiesforexploratorydataanalysis,theincreasein dimensionality also introduces many difficulties. A key problem is to discover the most relevant variables, or features, among the tens of thousands of par- allel measurements in a particular experiment. This is referred to as feature selection. For feature selection to be principled, one needs to decide exactly what it means for a feature to be ”relevant”. This thesis considers relevance from a statistical viewpoint, as a measure of statistical dependence on a given target variable. The target variable might be continuous, such as a patient’s blood glucoselevel,orcategorical,suchas”smoker”vs.”non-smoker”. Severalforms ofrelevanceareexaminedandrelatedtoeachothertoformacoherenttheory. Each form of relevance then defines a different feature selection problem. Thepredictive featuresarethosethatallowanaccuratepredictivemodel, for example for disease diagnosis. I prove that finding predictive features is a tractableproblem,inthatconsistentestimatescanbecomputedinpolynomial time. Thisisasubstantialimprovementuponcurrenttheory. However,Ialso demonstrate that selecting features to optimize prediction accuracy does not controlfeatureerrorrates. Thisisaseveredrawbackinlifescience,wherethe selectedfeaturesper se areimportant,forexampleascandidatedrugtargets. Toaddressthisproblem,Iproposeastatisticalmethodwhichtomyknowledge isthefirsttoachieveerrorcontrol. Moreover,Ishowthatinhighdimensions, feature sets can be impossible to replicate in independent experiments even with controlled error rates. This finding may explain the lack of agreement among genome-wide association studies and molecular signatures of disease. The most predictive features may not always be the most relevant ones fromabiologicalperspective,sincethepredictivepowerofagivenfeaturemay depend on measurement noise rather than biological properties. I therefore considerawiderdefinitionofrelevancethatavoidsthisproblem. Theresulting feature selection problem is shown to be asymptotically intractable in the generalcase;however,Ideriveasetofsimplifyingassumptionswhichadmitan intuitive,consistentpolynomial-timealgorithm. Moreover,Ipresentamethod thatcontrolserrorratesalsoforthisproblem. Thisalgorithmisevaluatedon microarray data from case studies in diabetes and cancer. In some cases however, I find that these statistical relevance concepts are insufficient to prioritize among candidate features in a biologically reasonable manner. Therefore, effective feature selection for life science requires both a carefuldefinitionofrelevanceandaprincipledintegrationofexistingbiological knowledge. Nearest-neighbor methods . . . . . . . . . . . . . . 38 2.6.3 Kernel methods . . . . . . . . . . . . . . . . . . . . 39 2.7 Priors, regularization and over-fitting . . . . . . . . . . . . 42 2.7.1 Over-fitting . . . . . . . . . . . . . . . . . . . . . . 42 2.7.2 Regularization . . . . . . . . . . . . . . . . . . . . 45 2.7.3 Priors and Bayesian statistics . . . . . . . . . . . . 47 2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3 Feature Selection Problems 53 3.1 Predictive features . . . . . . . . . . . . . . . . . . . . . . 54 3.1.1 The Markov boundary . . . . . . . . . . . . . . . . 54 3.1.2 The Bayes-relevant features . . . . . . . . . . . . . 56 3.2 Small sample-optimal features . . . . . . . . . . . . . . . . 59 vii viii CONTENTS 3.2.1 The min-features bias . . . . . . . . . . . . . . . . 61 3.2.2 k-optimal feature sets . . . . . . . . . . . . . . . . 61 3.3 All relevant features . . . . . . . . . . . . . . . . . . . . . 62 3.3.1 The univariate case. . . . . . . . . . . . . . . . . . 65 3.3.2 The multivariate case . . . . . . . . . . . . . . . . 65 3.4 Feature extraction and gene set testing . . . . . . . . . . . 66 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4 Feature Selection Methods 69 4.1 Filter methods . . . . . . . . . . . . . . . . . . . . . . . . 70 4.1.1 Statistical hypothesis tests . . . . . . . . . . . . . 71 4.1.2 The multiple testing problem . . . . . . . . . . . . 76 4.1.3 Variable ranking . . . . . . . . . . . . . . . . . . . 80 4.1.4 Multivariate filters . . . . . . . . . . . . . . . . . . 81 4.1.5 Multivariate search methods . . . . . . . . . . . . 84 4.2 Wrapper methods. . . . . . . . . . . . . . . . . . . . . . . 85 4.3 Embedded methods . . . . . . . . . . . . . . . . . . . . . 86 4.3.1 Sparse linear predictors . . . . . . . . . . . . . . . 87 4.3.2 Non-linear methods . . . . . . . . . . . . . . . . . 87 4.4 Feature extraction and gene set testing methods . . . . . 88 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5 A benchmark study 91 5.1 Evaluation system . . . . . . . . . . . . . . . . . . . . . . 92 5.2 Feature selection methods tested . . . . . . . . . . . . . . 94 5.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3.1 Robustness against irrelevant features . . . . . . . 96 5.3.2 Regularization in high dimensions . . . . . . . . . 97 5.3.3 Rankings methods are comparable in high dimen- sions . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3.4 Numberofselectedfeaturesincreaseswithdimension 99 5.3.5 No method improves SVM accuracy . . . . . . . . 101 5.3.6 Feature set accuracy . . . . . . . . . . . . . . . . . 103 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6 Consistent Feature Selection in Polynomial Time 109 6.1 Relations between feature sets . . . . . . . . . . . . . . . . 110 6.1.1 The Markov boundary and strong relevance . . . . 110 6.1.2 The Bayes-relevant features . . . . . . . . . . . . . 112 6.1.3 The optimal feature set . . . . . . . . . . . . . . . 116 6.2 Consistent polynomial-time search algorithms . . . . . . . 118 6.2.1 Experimental data . . . . . . . . . . . . . . . . . . 122 6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 CONTENTS ix 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7 Bootstrapping Feature Selection 125 7.1 Stability and error rates . . . . . . . . . . . . . . . . . . . 125 7.2 Feature selection is ill-posed . . . . . . . . . . . . . . . . . 127 7.3 The bootstrap approach . . . . . . . . . . . . . . . . . . . 128 7.3.1 Accuracy of the bootstrap . . . . . . . . . . . . . . 130 7.3.2 Simulation studies . . . . . . . . . . . . . . . . . . 131 7.3.3 Application to cancer data . . . . . . . . . . . . . 134 7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8 Finding All Relevant Features 137 8.1 Computational complexity . . . . . . . . . . . . . . . . . . 137 8.2 The Recursive Independence Test algorithm . . . . . . . . 139 8.2.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . 139 8.2.2 Asymptotic correctness . . . . . . . . . . . . . . . 141 8.2.3 Biological relevance of the PCWT class . . . . . . 142 8.2.4 Multiplicity and small-sample error control . . . . 144 8.2.5 Simulated data . . . . . . . . . . . . . . . . . . . . 146 8.2.6 Microarray data . . . . . . . . . . . . . . . . . . . 148 8.2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . 153 8.3 The Recursive Markov Boundary algorithm . . . . . . . . 154 8.3.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . 154 8.3.2 Asymptotic correctness . . . . . . . . . . . . . . . 155 8.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 9 Conclusions 161 9.1 Model-based feature selection . . . . . . . . . . . . . . . . 161 9.2 