Link¨oping Studies in Science and Technology. Dissertations No. 1090. Statistical Feature Selection With Applications in Life Science Roland Nilsson Department of Physics, Chemistry and Biology Link¨oping University, SE58183 Link¨oping, Sweden Link¨oping, 2007 Statistical Feature Selection With Applications in Life Science Copyright (cid:13)c Roland Nilsson 2007 [email protected] Division of Theory and Modelling Department of Physics, Chemistry and Biology Link¨oping University, SE58183 Link¨oping, Sweden ISBN 978-91-85715-24-4 ISSN 0345-7524 Cover art by Roland Nilsson 2007. Image of nerve cells (bottom right) kindly provided by the Biomedical Electron Microscope Unit, Newcastle University, United Kingdom. Typeset in LATEX2ε. Printed by LiU-Tryck, Link¨oping, Sweden 2007. iii Abstract The sequencing of the human genome has changed life science research in many ways. Novel measurement technologies such as microarray expression analysis,genome-wideSNPtypingandmassspectrometryarenowproducing experimental data of extremely high dimensions. While these techniques pro- videunprecedentedopportunitiesforexploratorydataanalysis,theincreasein dimensionality also introduces many difficulties. A key problem is to discover the most relevant variables, or features, among the tens of thousands of par- allel measurements in a particular experiment. This is referred to as feature selection. For feature selection to be principled, one needs to decide exactly what it means for a feature to be ”relevant”. This thesis considers relevance from a statistical viewpoint, as a measure of statistical dependence on a given target variable. The target variable might be continuous, such as a patient’s blood glucoselevel,orcategorical,suchas”smoker”vs.”non-smoker”. Severalforms ofrelevanceareexaminedandrelatedtoeachothertoformacoherenttheory. Each form of relevance then defines a different feature selection problem. Thepredictive featuresarethosethatallowanaccuratepredictivemodel, for example for disease diagnosis. I prove that finding predictive features is a tractableproblem,inthatconsistentestimatescanbecomputedinpolynomial time. Thisisasubstantialimprovementuponcurrenttheory. However,Ialso demonstrate that selecting features to optimize prediction accuracy does not controlfeatureerrorrates. Thisisaseveredrawbackinlifescience,wherethe selectedfeaturesper se areimportant,forexampleascandidatedrugtargets. Toaddressthisproblem,Iproposeastatisticalmethodwhichtomyknowledge isthefirsttoachieveerrorcontrol. Moreover,Ishowthatinhighdimensions, feature sets can be impossible to replicate in independent experiments even with controlled error rates. This finding may explain the lack of agreement among genome-wide association studies and molecular signatures of disease. The most predictive features may not always be the most relevant ones fromabiologicalperspective,sincethepredictivepowerofagivenfeaturemay depend on measurement noise rather than biological properties. I therefore considerawiderdefinitionofrelevancethatavoidsthisproblem. Theresulting feature selection problem is shown to be asymptotically intractable in the generalcase;however,Ideriveasetofsimplifyingassumptionswhichadmitan intuitive,consistentpolynomial-timealgorithm. Moreover,Ipresentamethod thatcontrolserrorratesalsoforthisproblem. Thisalgorithmisevaluatedon microarray data from case studies in diabetes and cancer. In some cases however, I find that these statistical relevance concepts are insufficient to prioritize among candidate features in a biologically reasonable manner. Therefore, effective feature selection for life science requires both a carefuldefinitionofrelevanceandaprincipledintegrationofexistingbiological knowledge. iv Sammanfattning Sekvenseringen av det m¨anskliga genomet i b¨orjan p˚a 2000-talet tillsammans ochdesenaresekvenseringsprojektenf¨orolikamodellorganismerharm¨ojliggjort revolutionerade nya biologiska m¨atmetoder som omfattar hela genom. Micro- arrayer, mass-spektrometri och SNP-typning ¨ar exempel p˚a s˚adana m¨atmet- oder. Dessa metoder genererar mycket h¨ogdimensionell data. Ett centralt problem i modern biologisk forskning ¨ar s˚aledes att identifiera de relevanta variablernablanddessatusentalsm¨atningar. Dettakallasf¨orvariabels¨okning. F¨or att kunna studera variabels¨okning p˚a ett systematiskt s¨att ¨ar en ex- akt definition av begreppet ”relevans” n¨odv¨andig. I denna avhandling be- handlas relevans ur statistisk synvinkel: ”relevans” inneb¨ar ett statistiskt beroende av en m˚alvariabel; denna kan vara kontinuerlig, till exempel en blodtrycksm¨atning p˚a en patient, eller diskret, till exempel en indikatorvari- abel s˚asom ”r¨okare” eller ”icke-r¨okare”. Olika former av relevans behand- las och en sammanh¨angande teori presenteras. Varje relevansdefinition ger d¨arefter upphov till ett specifikt variabels¨okningsproblem. Prediktiva variabler¨ars˚adanasomkananv¨andasf¨orattkonstruerapredik- tionsmodeller. Detta ¨ar viktigt exempelvis i kliniska diagnossystem. H¨ar be- visasattenkonsistentskattningavs˚adanavariablerkanber¨aknasipolynomisk tid, s˚a att variabelss¨okning ¨ar m¨ojlig inom rimlig ber¨akningstid. Detta ¨ar ett genombrott j¨amf¨ort med tidigare forskning. Dock visas ¨aven att metoder f¨or att optimera prediktionsmodeller ofta ger h¨oga andelar irrelevanta vari- bler, vilket ¨ar mycket problematiskt inom biologisk forskning. D¨arf¨or pre- senteras ocks˚a en ny variabels¨okningsmetod med vilken de funna variabler- nas relevans ¨ar statistiskt s¨akerst¨alld. I detta sammanhang visas ocks˚a att variabels¨okningsmetoderinte¨arreproducerbaraivanligbem¨arkelseih¨ogadi- mensioner, ¨aven d˚a relevans ¨ar statistiskt s¨akerst¨alld. Detta f¨orklarar till viss delvarf¨orgenetiskaassociationsstudiersombehandlarhelagenomhittillshar varit sv˚ara att reproducera. H¨ar behandlas ocks˚a fallet d¨ar alla relevanta variabler efters¨oks. Detta problem bevisas kr¨ava exponentiell ber¨akningstid i det allm¨anna fallet. Dock presenterasenmetodsoml¨oserproblemetipolynomisktidundervissastatis- tiska antaganden, vilka kan anses rimliga f¨or biologisk data. Ocks˚a h¨ar tas problemet med falska positiver i beaktande, och en statistisk metod presen- teras som s¨akerst¨aller relevans. Denna metod till¨ampas p˚a fallstudier i typ 2-diabetes och cancer. I vissa fall ¨ar dock m¨angden relevanta variabler mycket stor. Statistisk behandling av en enskild datatyp ¨ar d˚a otillr¨acklig. I s˚adana situationer ¨ar det viktigt att nyttja olika datak¨allor samt existerande biologisk kunskap f¨or att f¨or att sortera fram de viktigaste fynden. v Publications The scientific publications underlying this Ph.D. thesis are: 1. RolandNilsson,Jos´eM.Pen˜a,JohanBj¨orkegren,andJesperTegn´er. Evaluating feature selection for SVMs in high dimensions. In Pro- ceedings of the 17th European Conference on Machine Learning, pages 719-726, 2006. 2. Jos´eM.Pen˜a,RolandNilsson,JohanBj¨orkegren,andJesperTegn´er. Identifyingtherelevantnodesbeforelearningthestructure. InPro- ceedings of the 22nd Conference on Uncertainty in Artificial Intel- ligence, pages 367-374, 2006. 3. RolandNilsson,Jos´eM.Pen˜a,JohanBj¨orkegren,andJesperTegn´er. ConsistentFeatureSelectionforPatternRecognitioninPolynomial Time. Journal of Machine Learning Research 8:589-612, 2007. 4. RolandNilsson,Jos´eM.Pen˜a,JohanBj¨orkegren,andJesperTegn´er. DetectingMultivariateDifferentiallyExpressedGenes. BMCBioin- formatics, 2007 (in press). 5. Roland Nilsson, Johan Bj¨orkegren, and Jesper Tegn´er. Reliable discovery of predictive gene lists using the bootstrap. Manuscript. vi Acknowledgments This thesis could never have been written without the joint effort of a small but bright and dedicated team — my supervisors at Link¨oping University, Karolinska Institutet and Clinical Gene Networks AB. First, my main supervisor, professor Jesper Tegn´er, who amazingly always manages to understand arbitrarily complicated problems in five minutes. More than once have a solution to an elusive puzzle dawned on me only when discussing the problem with Jesper. Second,mymainsupervisoratClinicalGeneNetworksAB,associate professor Johan Bj¨orkegren, who consistently provides a fresh ”what-is- it-good-for?” perspective and a seemingly unlimited supply of creative (and crazy) ideas. Third, a special acknowledgement to my co-supervisor Dr. Jos´e M. Pen˜a,whofirstintroducedmetotheworldofgraphicalmodelsandwho has been instrumental in many of the developments in this work. Your thorough knowledge, diligence and patience has been crucial at many points in the developments herein. Finally,totheComputationalMedicineteamatKarolinskaInstitutet and Link¨oping University and the Bioinformatics group at Link¨oping University,forinspirationanddiscussionsandalsoforkindlypreventing me from starvation by telling me when it’s time to stop staring at my theorems and go to lunch. ThisworkhasbeensupportedbytheSwedishKnowledgeFoundation throughtheIndustrialPhDprogrammeinMedicalBioinformaticsatthe Strategy and Development Office at Karolinska Institutet, Link¨oping University, and Clinical Gene Networks AB. Contents 1 Introduction 1 1.1 A brief background . . . . . . . . . . . . . . . . . . . . . . 3 1.2 A guide to the thesis . . . . . . . . . . . . . . . . . . . . . 7 2 Statistical Data Models 9 2.1 Parametric models . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 The exponential family . . . . . . . . . . . . . . . 13 2.1.2 Maximum likelihood estimation . . . . . . . . . . . 15 2.2 Graphical models . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Markov networks . . . . . . . . . . . . . . . . . . . 17 2.2.2 Bayesian networks . . . . . . . . . . . . . . . . . . 20 2.2.3 Probability axioms . . . . . . . . . . . . . . . . . . 22 2.3 Conditional probability models . . . . . . . . . . . . . . . 25 2.4 Predictors and inducers . . . . . . . . . . . . . . . . . . . 28 2.5 Loss and risk . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6 Nonparametric methods . . . . . . . . . . . . . . . . . . . 37 2.6.1 Empirical risk minimization . . . . . . . . . . . . . 37 2.6.2 Nearest-neighbor methods . . . . . . . . . . . . . . 38 2.6.3 Kernel methods . . . . . . . . . . . . . . . . . . . . 39 2.7 Priors, regularization and over-fitting . . . . . . . . . . . . 42 2.7.1 Over-fitting . . . . . . . . . . . . . . . . . . . . . . 42 2.7.2 Regularization . . . . . . . . . . . . . . . . . . . . 45 2.7.3 Priors and Bayesian statistics . . . . . . . . . . . . 47 2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3 Feature Selection Problems 53 3.1 Predictive features . . . . . . . . . . . . . . . . . . . . . . 54 3.1.1 The Markov boundary . . . . . . . . . . . . . . . . 54 3.1.2 The Bayes-relevant features . . . . . . . . . . . . . 56 3.2 Small sample-optimal features . . . . . . . . . . . . . . . . 59 vii viii CONTENTS 3.2.1 The min-features bias . . . . . . . . . . . . . . . . 61 3.2.2 k-optimal feature sets . . . . . . . . . . . . . . . . 61 3.3 All relevant features . . . . . . . . . . . . . . . . . . . . . 62 3.3.1 The univariate case. . . . . . . . . . . . . . . . . . 65 3.3.2 The multivariate case . . . . . . . . . . . . . . . . 65 3.4 Feature extraction and gene set testing . . . . . . . . . . . 66 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4 Feature Selection Methods 69 4.1 Filter methods . . . . . . . . . . . . . . . . . . . . . . . . 70 4.1.1 Statistical hypothesis tests . . . . . . . . . . . . . 71 4.1.2 The multiple testing problem . . . . . . . . . . . . 76 4.1.3 Variable ranking . . . . . . . . . . . . . . . . . . . 80 4.1.4 Multivariate filters . . . . . . . . . . . . . . . . . . 81 4.1.5 Multivariate search methods . . . . . . . . . . . . 84 4.2 Wrapper methods. . . . . . . . . . . . . . . . . . . . . . . 85 4.3 Embedded methods . . . . . . . . . . . . . . . . . . . . . 86 4.3.1 Sparse linear predictors . . . . . . . . . . . . . . . 87 4.3.2 Non-linear methods . . . . . . . . . . . . . . . . . 87 4.4 Feature extraction and gene set testing methods . . . . . 88 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5 A benchmark study 91 5.1 Evaluation system . . . . . . . . . . . . . . . . . . . . . . 92 5.2 Feature selection methods tested . . . . . . . . . . . . . . 94 5.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3.1 Robustness against irrelevant features . . . . . . . 96 5.3.2 Regularization in high dimensions . . . . . . . . . 97 5.3.3 Rankings methods are comparable in high dimen- sions . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3.4 Numberofselectedfeaturesincreaseswithdimension 99 5.3.5 No method improves SVM accuracy . . . . . . . . 101 5.3.6 Feature set accuracy . . . . . . . . . . . . . . . . . 103 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6 Consistent Feature Selection in Polynomial Time 109 6.1 Relations between feature sets . . . . . . . . . . . . . . . . 110 6.1.1 The Markov boundary and strong relevance . . . . 110 6.1.2 The Bayes-relevant features . . . . . . . . . . . . . 112 6.1.3 The optimal feature set . . . . . . . . . . . . . . . 116 6.2 Consistent polynomial-time search algorithms . . . . . . . 118 6.2.1 Experimental data . . . . . . . . . . . . . . . . . . 122 6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 CONTENTS ix 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7 Bootstrapping Feature Selection 125 7.1 Stability and error rates . . . . . . . . . . . . . . . . . . . 125 7.2 Feature selection is ill-posed . . . . . . . . . . . . . . . . . 127 7.3 The bootstrap approach . . . . . . . . . . . . . . . . . . . 128 7.3.1 Accuracy of the bootstrap . . . . . . . . . . . . . . 130 7.3.2 Simulation studies . . . . . . . . . . . . . . . . . . 131 7.3.3 Application to cancer data . . . . . . . . . . . . . 134 7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8 Finding All Relevant Features 137 8.1 Computational complexity . . . . . . . . . . . . . . . . . . 137 8.2 The Recursive Independence Test algorithm . . . . . . . . 139 8.2.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . 139 8.2.2 Asymptotic correctness . . . . . . . . . . . . . . . 141 8.2.3 Biological relevance of the PCWT class . . . . . . 142 8.2.4 Multiplicity and small-sample error control . . . . 144 8.2.5 Simulated data . . . . . . . . . . . . . . . . . . . . 146 8.2.6 Microarray data . . . . . . . . . . . . . . . . . . . 148 8.2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . 153 8.3 The Recursive Markov Boundary algorithm . . . . . . . . 154 8.3.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . 154 8.3.2 Asymptotic correctness . . . . . . . . . . . . . . . 155 8.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 9 Conclusions 161 9.1 Model-based feature selection . . . . . . . . . . . . . . . . 161 9.2 Recommendations for practitioners . . . . . . . . . . . . . 163 9.3 Future research . . . . . . . . . . . . . . . . . . . . . . . . 164 Notation X Feature vector; a vector-valued random variable n The dimension of X. V The set {1,...,n} n X Domain (event space) of the random variable X X Feature; a component of the vector X, a random variable i X For S ⊆{1,...,n}, the sub-vector X of X S {i∈S} X The sub-vector X of X ¬S {i6∈S} X Equal to X with S ={1,...,n} 1:n S Y Target variable; a random variable Z A pair of features and target, Z =(X,Y) x Observation of the random variable X x(j) Observation j the random variable X in a sample i i x(1:l) A sample (vector) of l observations of the random variable X j i X(1:l) A vector of l independent, identical random variables X p(x) Probability mass function f(x) Probability density function P (ξ) Probability of an event ξ ⊂X p(y|x) Conditional probability of Y =y given X =x. Y ⊥X|Z Y is conditionally independent of X given Z Y 6⊥X|Z Y is conditionally dependent of X given Z g(x) Predictor; a function X 7→Y g∗ The Bayes predictor G A set (domain, class) of predictors I(Z(1:l)) Inducer; a map Zl 7→G h(yˆ|y) Loss function R(g) Risk functional for classifier g Rˆ(g) Empirical risk estimate for classifier g ρ(I) Expected risk for inducer I S∗ The Bayes-relevant feature set (Definition 3.4) S† An expectation-optimal feature set (Definition 3.9) S‡ Min-features set (Definition 3.10) SA The set of all relevant features (Definition 3.11) M∗ The Markov boundary of Y E[X] Expectation value of X O(f(n)) Order of f(n) (Landau notation) x