Songetal.BMCBioinformatics2013,14:5 http://www.biomedcentral.com/1471-2105/14/5 METHODOLOGY ARTICLE OpenAccess Random generalized linear model: a highly accurate and interpretable ensemble predictor LinSong1,2,PeterLangfelder1andSteveHorvath1,2* Abstract Background: Ensemblepredictorssuchastherandomforestareknowntohavesuperioraccuracybuttheir black-boxpredictionsaredifficulttointerpret.Incontrast,ageneralizedlinearmodel(GLM)isveryinterpretable especiallywhenforwardfeatureselectionisusedtoconstructthemodel.However,forwardfeatureselectiontendsto overfitthedataandleadstolowpredictiveaccuracy.Therefore,itremainsanimportantresearchgoaltocombinethe advantagesofensemblepredictors(highaccuracy)withtheadvantagesofforwardregressionmodeling (interpretability).ToaddressthisgoalseveralarticleshaveexploredGLMbasedensemblepredictors.Sincelimited evaluationssuggestedthattheseensemblepredictorswerelessaccuratethanalternativepredictors,theyhavefound littleattentionintheliterature. Results: Comprehensiveevaluationsinvolvinghundredsofgenomicdatasets,theUCImachinelearningbenchmark data,andsimulationsareusedtogiveGLMbasedensemblepredictorsanewandcarefullook.Anovelbootstrap aggregated(bagged)GLMpredictorthatincorporatesseveralelementsofrandomnessandinstability(random subspacemethod,optionalinteractionterms,forwardvariableselection)oftenoutperformsahostofalternative predictionmethodsincludingrandomforestsandpenalizedregressionmodels(ridgeregression,elasticnet,lasso). Thisrandomgeneralizedlinearmodel(RGLM)predictorprovidesvariableimportancemeasuresthatcanbeusedto definea“thinned”ensemblepredictor(involvingfewfeatures)thatretainsexcellentpredictiveaccuracy. Conclusion: RGLMisastateoftheartpredictorthatsharestheadvantagesofarandomforest(excellentpredictive accuracy,featureimportancemeasures,out-of-bagestimatesofaccuracy)withthoseofaforwardselectedgeneralized linearmodel(interpretability).ThesemethodsareimplementedinthefreelyavailableRsoftwarepackagerandomGLM. Background neighbor (KNN) predictors, support vector machines Predictionmethods(alsoknownasclassifiers,supervised (SVM)[2],andtreepredictors[3].Manypublicationshave machine learning methods, regression models, prog- evaluated popular prediction methods in the context of nosticators, diagnostics) are widely used in biomedical geneexpressiondata[4-9]. research. For example, reliable prediction methods are Ensemble predictors are particularly attractive since essentialforaccuratediseaseclassification,diagnosisand they are known to lead to highly accurate predictions. prognosis. Since prediction methods based on multiple Anensemblepredictorgeneratesandintegratesmultiple features (also known as covariates or independent vari- versions of a single predictor (often referred to as base ables)cangreatlyoutperformpredictorsbasedonasingle learner), and arrives at a final prediction by aggregating feature [1], it is important to develop methods that can thepredictionsofmultiplebaselearners,e.g.viaplurality optimallycombinefeaturestoobtainhighaccuracy.Intro- voting across the ensemble. One particular approach for ductorytextbooksdescribewellknownpredictionmeth- constructinganensemblepredictorisbootstrapaggrega- odssuchaslineardiscriminantanalysis(LDA),K-nearest tion(bagging)[10].Heremultipleversionsoftheoriginal data are generated through bootstrapping, where obser- *Correspondence:[email protected] vationsfromthetrainingsetarerandomlysampledwith 1HumanGenetics,DavidGeffenSchoolofMedicine,UniversityofCalifornia, replacement. An individual predictor (e.g. a tree predic- LosAngeles,California,USA 2Biostatistics,SchoolofPublicHealth,UniversityofCalifornia,LosAngeles, tor) is fitted on each bootstrapped data set. Thus, 100 California,USA bootstrappeddatasets(100bags)willleadtoanensemble ©2013Songetal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycited. Songetal.BMCBioinformatics2013,14:5 Page2of22 http://www.biomedcentral.com/1471-2105/14/5 of100treepredictors.Incaseofaclassoutcome(e.g.dis- pointedoutthattheblackboxpredictionsoftheRFpre- easestatus),theindividualpredictors“vote”foreachclass dictor can be difficult to interpret. For this reason, we andthefinalpredictionisobtainedbymajorityvoting. wantedtogivebaggedforwardselectedgeneralizedlinear Breiman (1996) showed that bagging weak predictors regression models another careful look. After exploring (e.g. tree predictors or forward selected linear models) different approaches for injecting elements of random- often yields substantial gains in predictive accuracy [10]. ness into the individual GLM predictors, we arrived at Butitseemsthatensemblepredictorsareonlyveryrarely a new ensemble predictor, referred to as random GLM usedforpredictingclinicaloutcomes.Thisfactpointsto predictor,withanastonishingpredictiveperformance.An a major weakness of ensemble predictors: they typically attractiveaspectoftheproposedRGLMpredictoristhat leadto”blackbox”predictionsthatarehardtointerpretin it combines the advantages of the RF with that of a for- termsoftheunderlyingfeatures.Cliniciansandepidemi- wardselectedGLM.Asthenamegeneralizedlinearmodel ologists prefer forward selected regression models since indicates,itcanbeusedfor ageneraloutcome suchasa the resulting predictors are highly interpretable: a linear binaryoutcome,amulti-classoutcome,acountoutcome, combinationofrelativelyfewfeaturescanbeusedtopre- and a quantitative outcome. We show that several incre- dict the outcome or the probability of an outcome. But mental (but important) changes to the original bagged thesparsityaffordedbyforwardfeatureselectioncomesat GLM predictor by Breiman add to up to a qualitatively anunacceptablyhighcost:forwardvariableselection(and new predictor (referred to as random GLM predictor) other variable selection methods) often greatly overfit that performs at least as well as the RF predictor on the the data which results in unstable and inaccurate pre- UCI benchmark data sets. While the UCI data are the dictors [11,12]. Ideally, one would want to combine the benchmark data for evaluating predictors, only a dozen advantagesofensemblepredictorswiththoseofforward suchdatasetsareavailableforbinaryoutcomeprediction. selected regression models. As discussed below, multi- To provide a more comprehensive empirical compari- plearticlesdescribeensemblepredictorsbasedonlinear sonofthedifferentpredictionmethods,wealsoconsider modelsincludingtheseminalworkbyBreiman[10]who over700comparisonsinvolvinggeneexpressiondata.In evaluated a bagged forward selected linear regression these genomic studies, the RGLM method turns out to model.However,theideaofbaggingforwardselectedlin- be slightly more accurate than the considered alterna- earmodels(orotherGLMs)appearstohavebeensetaside tives.Whiletheimprovementsinaccuracyaffordedbythe as new ensemble predictors, such as the random forest, RGLMarerelativelysmalltheyarestatisticallysignificant. becamepopular.Arandomforest(RF)predictornotonly This article is organized as follows. First, we present bagstreepredictorsbutalsointroducesanelementofran- a motivating example that illustrates the high prediction domnessbyconsideringonlyarandomlyselectedsubset accuracy of the RGLM. Second, we compare the RGLM of features at each node split [13]. The number of ran- with other state of the art predictors when it comes to domly selected features, mtry, is the only parameter of binary outcome prediction. Toward this end, we use the therandomforestpredictor.Therandomforestpredictor UCI machine learning benchmark data, over 700 empir- hasdeservedlyreceivedalotofattentionforthefollowing ical gene expression comparisons, and extensive simula- reasons: First, the bootstrap aggregation step allows one tions.Third,wecomparetheRGLMwithotherpredictors touseout-of-bag(OOB)samplestoestimatethepredic- forquantitative(continuous)outcomeprediction.Fourth, tiveaccuracy.TheresultingOOBestimateoftheaccuracy we describe several variable importance measures and often obviates the need for cross-validation and other show how they can be used to define a thinned version resampling techniques. Second, the RF predictor pro- oftheRGLMthatonlyusesfewimportantfeatures.Even vides several measures of feature (variable) importance. fordatasetscomprisedofthousandsofgenefeatures,the Severalarticlesexploretheuseoftheseimportancemea- thinnedRGLMofteninvolvesfewerthan20featuresand sures to select genes [5,13,14]. Third, it can be used to isthusmoreinterpretablethanmostensemblepredictors. define a dissimilarity measure that can be used in clus- teringapplications[13,15].Fourth,andmostimportantly, Methods theRFpredictorhassuperiorpredictiveaccuracy.Itper- ConstructionoftheRGLMpredictor forms as well as alternatives in cancer gene expression RGLMisanensemblepredictorbasedonbootstrapaggre- data sets [5] but it really stands out when applied to the gation (bagging) of generalized linear models whose fea- UCImachinelearningbenchmarkdatasetswhereitisas tures (covariates) are selected using forward regression good as (if not better than) many existing methods [13]. according to AIC criterion. GLMs comprise a large class Whileweconfirmthetrulyoutstandingpredictiveperfor- ofregressionmodels,e.g.linearregressionforanormally mance of the RF, the proposedRGLMmethod turns out distributed outcome, logistic regression for binary out- tobeevenmoreaccuratethantheRF(e.g.acrossthedis- come, multi-nomial regression for multi-class outcome easegeneexpressiondatasets).Breimanandothershave and Poisson regression for count outcome [16]. Thus, Songetal.BMCBioinformatics2013,14:5 Page3of22 http://www.biomedcentral.com/1471-2105/14/5 RGLMcanbeusedtopredictbinary-,continuous-,count- of the correlation coefficient between the outcome and , and other outcomes for which generalized linear mod- each feature to rank the features. More generally, one els can be defined. The “randomness” in RGLM stems can fit a univariate GLM model to each feature to arrive results from two sources. First, a non-parametric boot- at an association measure (e.g. a Wald test statistic or a strap procedure is used which randomly selects samples likelihood ratio test). Only the top ranking features (i.e. with replacement from the original data set. Second, a features with the most significant univariate significance random subset of features (specified by input parame- levels)willbecomecandidatecovariatesforforwardselec- ternFeaturesInBag)isselectedforeachbootstrapsample. tioninamultivariateregressionmodel.Thetopnumber Thisamountstoarandomsub-spacemethod[17]applied ofcandidatefeaturesisdeterminedbytheinputparame- toeachbootstrapsampleseparately. ternCandidateCovariates(defaultvalue50).Fourth,for- The steps of the RGLM construction are presented in ward variable (feature) selection is applied to the nCan- Figure1.First,startingfromtheoriginaldatasetanother didateCovariates of each bag to arrive at a multivariable equal-sizeddatasetisgeneratedusingthenon-parametric generalized linear model per bag. The forward selection bootstrapmethod,i.e.samplesareselectedwithreplace- procedure used by RGLM is based on the stepAIC R ment from the original data set. The parameter nBags function in the MASS R library where method is set to (default value 100) determines how many of such boot- “forward”. Fifth, the predictions of each forward selected strap data sets (referred to as bags) are being generated. multivariate model (one per bag) are aggregated across Second, a random set of features (determined by the bags to arrive at a final ensemble prediction. The aggre- parameter nFeaturesInBag) is randomly chosen for each gation method depends on the type of outcome. For a bag. Thus, the GLM predictor for bag 1 will typically continuousoutcome,predictedvaluesaresimplyaveraged involveadifferentsetoffeaturesthanthatforbag2.Third, acrossbags.Forabinaryoutcome,theadjustedMajority thenFeaturesInBagofrandomlyselectedfeaturesperbag Vote(aMV)strategy[18]isusedwhichaveragespredicted arerank-orderedaccordingtotheirindividualassociation probabilitiesacrossbags.Giventheestimatedclassprob- with the outcome variable y in each bag. For a quanti- abilities one can get a binary prediction by choosing an tative outcome y, one can simply use the absolute value appropriatethreshold(default=0.5). Figure1OverviewoftheRGLMconstruction.ThefigureoutlinesthestepsusedintheconstructionoftheRGLM.Thepinkrectanglesrepresent datamatricesateachstep.Widthofarectanglereflectsthenumberofremainingfeatures. Songetal.BMCBioinformatics2013,14:5 Page4of22 http://www.biomedcentral.com/1471-2105/14/5 Importantly, RGLM also has a parameter maxInterac- ofnFeaturesInBagcanbearrivedatbysolvingequations tionOrder(defaultvalue1)forcreatinginteractionsupto presented in Table 1. These equations were found by a given order among features in the model construction. empirically e√valuating various choices of nFeaturesInBag For example, RGLM.inter2 results from setting maxIn- values (e.g. N,N/5,N/3,N/2,2N/3,N). In particular, teractionOrder=2, i.e. considering pairwise (also known wefoundthatincaseofN∗ <=10,thenusingallfeatures as 2-way) interaction terms. As example, consider the (i.esettingnFeaturesInBag/N =1)isoftenagoodchoice, case when only pairwise interaction terms are used. For whereas if N∗ > 300 then setting nFeaturesInBag/N = each bag a random set of features is selected (similar to 0.2 works well. The default value nFeaturesInBag/N = the random subspace method, RSM) from the original 1.0276 − 0.00276N∗ in the intermediate case (10 < covariates,i.e.covariateswithoutinteractionterms.Next, N∗ <= 300) results from fitting an interpolation line all pairwise interactions among the nFeaturesInBag ran- throughthetwopoints(10,1)and(300,0.2).Wefindthat domly selected features are generated. Next, the usual RGLMisquiterobustwithrespecttotheparameternFea- RGLMcandidatefeatureselectionstepswillbeappliedto turesInBag.Tolimitthenumberofcovariatesconsidered the combined set of pairwise interaction terms and the inforwardselection(whichiscomputationallyintensive), nFeaturesInBagrandomlyselectedfeaturesperbagresult- the default value of nCandidateCovariates is set to 50. ing in nCandidateCovariates top ranking features per Overall, the default values perform well in our simula- bag,whicharesubsequentlysubjectedtoforwardfeature tions, empirical gene expression and machine learning selection. benchmark studies. But we recommend to use the OOB These methods are implemented in our R software estimateofpredictiveaccuracytoinformthechoiceofthe package randomGLM which allows the user to input a parametervalues. trainingsetandoptionallyatestset.Itautomaticallyout- puts out-of-bag estimates of the accuracy and variable Relationshipwithrelatedpredictionmethods importancemeasures. Asdiscussedbelow,RGLMcanbeinterpretedasavariant ofabaggedpredictor[10].Inparticular,itissimilartothe ParameterchoicesfortheRGLMpredictor baggedforwardlinearregressionmodel[10]butdiffersin As discussed below, we find that it is usually sufficient thefollowingaspects: to consider only nBags = 100 bootstrap data sets. The default value for nFeaturesInBag depends on the total 1. RGLMallowsforinteractiontermsbetweenfeatures numberoffeatures.Itiseasiertoexplainitintermsofthe whichgreatlyimprovetheperformanceonsome proportion of features randomly selected per bag, nFea- datasets(inparticulartheUCIbenchmarkdatasets). turesInBag/N, where N is the total number of features WerefertoRGLMinvolvingtwo-wayorthreeway of the training set. Apart from N it is also valuable to interactionsasRGLM.inter2andRGLM.inter3, consider the effective number of features which equals respectively. thenumberoffeaturesN plusthenumberofinteraction 2. RGLMhasaparameternFeaturesInBagthatallows terms, e.g. N∗ = N + N(N − 1)/2 in case of pair- onetorestrictthenumberoffeaturesusedineach wise interactions. Using this notation, the default value bootstrapsample.Thisparameterisconceptually Table1DefaultsettingofnFeaturesInBag N nFeaturesInBag/N N∗ nFeaturesInBag/N Nointeraction 1−10 1 1−10 1 11−300 1.0276−0.00276N 11−300 1.0276−0.00276N∗ >300 0.2 >300 0.2 2-wayinteraction 1−4 1 1−10 1 5−24 1.0276−0.00276N(N+1)/2 11−300 1.0276−0.00276N∗ >24 0.2 >300 0.2 3-wayinteraction 1−3 1 1−10 1 4−12 1.0276−0.00276(N3+5N)/6 11−300 1.0276−0.00276N∗ >12 0.2 >300 0.2 ThistableshowsthedefaultvaluesofnFeaturesInBagintermsofnFeaturesInBag/NforRGLM,RGLM.inter2andRGLM.inter3.Nisthetotalnumberoffeaturesofthe trainingdata.N∗istheeffectivenumberoffeatureswhichequalsthenumberoffeaturesNplusthenumberofinteractionterms.Formulasareshownintermsofboth NandthecorrespondingN∗.1.0276and0.00276areobtainedbyinterpolatingastraightlinebetween(10,1)and(300,0.2). Songetal.BMCBioinformatics2013,14:5 Page5of22 http://www.biomedcentral.com/1471-2105/14/5 relatedtothemtryparameteroftheRandomForest caseofRGLM,exceptthatnoforwardmodelselectionis predictor.Inessence,thisparameterallowsoneto carriedout. usearandomsubspacemethod(RSM,[17])ineach bootstrapsample. Softwareimplementation 3. RGLMhasaparameternCandidateCovariatesthat TheRGLMmethodisimplementedinthefreelyavailable allowsonetorestrictthenumberoffeaturesin R package randomGLM. The R function randomGLM forwardregression,whichnotonlyhas allows the user to output training set predictions, out- computationaladvantagesbutalsointroduces of-bagpredictions,testsetpredictions,coefficientvalues, additionalinstabilityintotheindividualpredictors, and variable importance measures. The predict function whichisadesirablecharacteristicofanensemble canbeusedarriveattestsetpredictions.Tutorialscanbe predictor. foundatthefollowingwebpage:http://labs.genetics.ucla. 4. RGLMoptimizestheAICcriterionduringforward edu/horvath/RGLM. selection. 5. RGLMhasa“thinningthreshold”parameterwhich Shortdescriptionofalternativepredictionmethods allowsonetoreducethenumberoffeaturesinvolved Forward selected generalized linear model predictor inpredictionwhilemaintaininggoodprediction (forwardGLM) We denote by forwardGLM the (single) accuracy.SinceathinnedRGLMinvolvesfarfewer generalizedlinearmodelpredictorwhosecovariateswere features,itfacilitatestheunderstandinghowthe selectedusingforwardfeatureselection(accordingtothe ensemblearrivesatitspredictions. AICcriterion).Thus,forwardGLMdoesnotinvolvebag- ging, random feature selection, and is not an ensemble RGLMisnotonlyrelatedtobaggingbutalsototheran- predictor. dom subspace method (RSM) proposed by [17]. In the RSM, the training set is also repeatedly modified as in Random forest (RF) RF is an ensemble predictor that bagging but this modification is performed in the fea- consistsofacollectionofdecisiontreeswhichvoteforthe ture space (rather than the sample space). In the RSM, class of observations [13]. The RF is known for its out- a subset of features is randomly selected which amounts standing predictive accuracy. We used the randomForest to restricting attention to a subspace of the original fea- Rpackageinourstudies.Weconsideredtwochoicesfor ture space. As one of its construction steps, RGLM uses theRFparametermtry:i)thedefaultRFpredictorwhere a RSM on each bootstrap sample. Future research could mtryequalsthesquarerootofthenumberoffeaturesand explorewhetherrandompartitionsasopposedtorandom ii)RFbigmtrywheremtryequalsthetotalnumberoffea- subspaces would be useful for constructing an RGLM. tures. We always generated at least 500 trees per forest Randompartitionsofthefeaturespacearesimilartoran- butused1000treeswhencalculatingvariableimportance dom subspaces but they divide the feature space into measures. mutually exclusive subspaces [19,20]. Random partition based predictors have been shown to perform well in high-dimensionaldata([19]).BothRSMandrandompar- Recursive partitioning and regression trees (Rpart) titions have more general applicability than RGLM since Classification and regression trees were generated using these methods can be used for any base learner. There the default settings rpart R package. Tree methods are is a vast literature on ensemble induction methods but describedin[3]. a property worth highlighting is that RGLM uses for- wardvariableselectionofGLMs.RecallthatRGLMgoes Linear discriminant analysis (LDA) LDA aims to find through the following steps: 1) bootstrap sampling, 2) a linear combination of features (referred to as discrimi- RSM (and optionally creating interaction terms), 3) for- nant variables) to predict a binary outcome (reviewed in wardvariableselectionofaGLM,4)aggregationofvotes. [22,23]). We used the lda R function in the MASS R Empirical studies involving different base learners (other packagewithparameterchoicemethod=moment. than GLMs) have shown that combining bootstrap sam- plingwithRSM(steps1and2)leadstoensemblepredic- torswithcomparableperformancetothatoftherandom Diagonal linear discriminant analysis (DLDA) DLDA forestpredictor[21]. is similar to LDA but it ignores the correlation pat- Anotherpredictionmethod,randommultinomiallogit terns betweenfeatures. Whilethis is often anunrealistic model(RMNL),alsosharesasimilarideawithRGLM.It assumption,DLDA(alsoknownasgenevoting)hasbeen wasrecentlyproposedformulti-classoutcomeprediction foundtoworkwelliningeneexpressionapplications[4]. [18].RMNLbagsmultinomiallogitmodelswithrandom Hereweusedthedefaultparametersfromthesupclust R feature selection in each bag. It can be seen as a special package[24]. Songetal.BMCBioinformatics2013,14:5 Page6of22 http://www.biomedcentral.com/1471-2105/14/5 Knearestneighbor(KNN) WeusedtheknnRfunction 20disease-relatedgeneexpressiondatasets intheclassRpackage[22,23],whichchosetheparameter We use 20 disease related gene expression data sets kofnearestneighborsusing3-foldcrossvalidation(CV). involvingcancerandotherhumandiseases(describedin Table 2). The first 10 data sets involving various cancers Support vector machines (SVM) We used the default were previously used by [5]. These data can be down- parameters from the e1071 R package to fit SVMs [2]. loaded from the author’s webpage at http://ligarto.org/ Additionaldetailscanbefoundin[25]. rdiaz/Papers/rfVS/randomForestVarSel.html. The Brain- Tumor2 and DLBCL data sets were downloaded from Shrunkencentroids(SC) TheSCpredictorisknownto http://www.gems-system.org/.Theremaining8datasets workwellinthecontextofgeneexpressiondata[26].Here (lung1–MSdiagnosis2)weredownloadedfromeitherthe we used the implementation in the pamr R package [26] GeneExpressionOmnibus(GEO)databaseortheArray- which chose the optimal level of shrinkage using cross Express data base in raw form and subsequently prepro- validation. cessed using MAS5 normalization and quantile normal- ization.Onlythetop10000probes(features)withhighest Penalized regression models Various convex penalties meanexpressionwereconsideredforoutcomeprediction. can be applied to generalized linear models. We consid- WebrieflypointoutthatDiazetal(2006)reportpredic- eredridgeregression[27]correspondingtoan(cid:2)2penalty, tion error rates estimated using a bootstrap method. In the lasso corresponding to an (cid:2)1 penalty [28], and elas- contrast,wereport3-foldcrossvalidationestimates(aver- tic net corresponding to a linear combination of (cid:2)1 and agedover100randompartitionsofthedatainto3folds), (cid:2)2 penalties [29]. We used the glmnet R function from which mayexplainminornumerical differencesbetween theglmnetRpackage[30,31]withalphaparametervalues ourstudyandthatofDiazetal(2006). of 0,1, and 0.5 respectively. glmnet also involves another parameter (lambda) which was chosen as the median of Empiricalgeneexpressiondatasets the lambda sequence output resulting from glmnet. For For all data sets below, we considered 100 randomly UCI benchmark data sets, pairwise interaction between selected gene traits, i.e. 100 randomly selected probes. featureswereconsidered. They were directly used as continuous outcomes or Table2Descriptionofthe20diseaseexpressiondatasets Dataset Samples Features Reference DatasetID Binaryoutcome adenocarcinoma 76 9868 [32] NA mostprevalentclassvsothers brain 42 5597 [33] NA mostprevalentclassvsothers breast2 77 4869 [34] NA mostprevalentclassvsothers breast3 95 4869 [34] NA mostprevalentclassvsothers colon 62 2000 [35] NA mostprevalentclassvsothers leukemia 38 3051 [36] NA mostprevalentclassvsothers lymphoma 62 4026 [37] NA mostprevalentclassvsothers NCI60 61 5244 [38] NA mostprevalentclassvsothers prostate 102 6033 [39] NA mostprevalentclassvsothers srbct 63 2308 [40] NA mostprevalentclassvsothers BrainTumor2 50 10367 [41] NA AnaplasticoligodendrogliomasvsGlioblastomas DLBCL 77 5469 [42] NA follicularlymphomavsdiffuselargeB-celllymphoma lung1 58 10000 [43] GSE10245 AdenocarcinomavsSquamouscellcarcinoma lung2 46 10000 [44] GSE18842 AdenocarcinomavsSquamouscellcarcinoma lung3 71 10000 [45] GSE2109 AdenocarcinomavsSquamouscellcarcinoma psoriasis1 180 10000 [46,47] GSE13355 lesionalvshealthyskin psoriasis2 82 10000 [48] GSE14905 lesionalvshealthyskin MSstage 26 10000 [49] E-MTAB-69 relapsingvsremittingRRMS MSdiagnosis1 27 10000 [50] GSE21942 RRMSvshealthycontrol MSdiagnosis2 44 10000 [49] E-MTAB-69 RRMSvshealthycontrol Samplesize,numberoffeatures,originalreference,datasetIDsandoutcomesforthe20diseaserelatedgeneexpressiondatasets. Songetal.BMCBioinformatics2013,14:5 Page7of22 http://www.biomedcentral.com/1471-2105/14/5 dichotomized according to the median value (top half Machinelearningbenchmarkdatasets = 1, bottom half = 0) to generate binary outcomes. The12machinelearningbenchmarkdatasetsusedinthis For all data sets except “Brain cancer”, 2 of the observa- article are listed in Table 3. Note that only eight of the 3 tions (arrays) were randomly chosen as the training set, 12datasetshaveabinaryoutcomes.Themulti-classout- whiletheremainingsampleswerechosenastestset.We comesofthe4remainingdatasetswereturnedintobinary focusedonthe5000genes(probes)withthehighestmean outcomesbyconsideringthemostprevalentclassversus expressionlevelsineachdataset. all other classes combined. Missing data were imputed using nearest neighbor averaging. For each data set and Braincancerdatasets Thesetworelateddatasetscon- predictionmethod,wereporttheaverage3-foldCVesti- tain55and65microarraysamplesofglioblastoma(brain mate of prediction accuracy over 100 random partitions cancer) patients, respectively. Gene expression profiles ofthedatainto3folds. were measured using Affymetrix U133 microarrays. A detailed description can be found in [51]. The first data Simulatedgeneexpressiondatasets set (comprised of 55 samples) was used as a training set Wesimulatedanoutcomevariableyandgeneexpression whileandtheseconddataset(comprisedof65samples) data that contained 5 modules (clusters). Only 2 of the wasusedasatestset. moduleswerecomprisedofgenesthatcorrelatedwiththe outcomey.45%ofthegeneswerebackgroundgenes,i.e. these genes were outside of any module. The simulation SAFHS blood lymphocyte data set This data set [52] scheme is detailed in Additional file 1 and implemented was derived from blood lymphocytes of randomly ascer- in the R function simulateDatExpr5Modules from the tained participants enrolled in the San Antonio Family WGCNA R package [55]. This R function was used to Heart Study. Gene expression profiles were measured simulate pairs of training and test data sets. The sim- with the Illumina Sentrix Human Whole Genome (WG- ulation study was used to evaluate prediction methods 6) Series I BeadChips. After removing potential out- for continuous outcomes and for binary outcomes. For liers(basedonlowinterarraycorrelations),1084samples binaryoutcomeprediction,thecontinuousoutcomeywas remainedinthedataset. thresholdedaccordingtoitsmedianvalue. We considered 180 different simulation scenarios WB whole blood gene expression data set This is the involvingvaryingsizesofthe trainingdata(50,100,200, wholebloodgeneexpressiondatafromhealthycontrols. 500,1000or2000samples)andvaryingnumbersofgenes Peripheral blood samples from healthy individuals were (60,100,500,1000,5000 or 10000 genes) that served as analyzedusingIlluminaHumanHT-12microarrays.After features. Test sets contained the same number of genes pre-processing,380samplesremainedinthedataset. as in the corresponding training set and 1000 samples. For each simulation scenario, we simulate 5 replicates Mouse tissue gene expression data sets The 4 tis- resultingfromdifferentchoicesoftherandomseed. sue specific gene expression data sets were generated by the lab of Jake Lusis at UCLA. These data sets measure gene expression levels (Agilent array platform) Table3DescriptionoftheUCIbenchmarkdata from adipose (239 samples), brain (221 samples), liver Dataset Samples Features (272 samples) and muscle (252 samples) tissue of mice BreastCancer 699 9 from the B×H F mouse intercross described in [53,54]. 2 HouseVotes84 435 16 In addition to gene traits, we also predicted 21 quan- titative mouse clinical traits including mouse weight, Ionosphere 351 34 length, abdominal fat, other fat, total fat, adiposity diabetes 768 8 index (total fat∗100/weight), plasma triglycerides, total Sonar 208 60 plasma cholesterol, high-density lipoprotein fraction of ringnorm 300 20 cholesterol, plasma unesterified cholesterol, plasma free threenorm 300 20 fatty acids, plasma glucose, plasma low-density lipopro- twonorm 300 20 teinandverylow-densitylipoproteincholesterol,plasma MCP-1 protein levels, plasma insulin, plasma glucose- Glass 214 9 insulin ratio, plasma leptin, plasma adiponectin, aortic Satellite 6435 36 lesion size (measured by histological examination using Vehicle 846 18 a semi-quantitative scoring methods), aneurysms (semi- Vowel 990 10 quantitative scoring method), and aortic calcification in Samplesizeandnumberoffeaturesforthe12UCImachinelearningbenchmark thelesionarea. datasets. Songetal.BMCBioinformatics2013,14:5 Page8of22 http://www.biomedcentral.com/1471-2105/14/5 Results analysis (DLDA), k nearest neighbor (KNN), support Motivatingexample:disease-relatedgeneexpressiondata vector machine (SVM) and shrunken centroid (SC). A sets shortdescriptionofthesepredictionmethodsisprovided WecomparethepredictionaccuracyofRGLMwiththat inMethods. ofotherwidelyusedmethodson20geneexpressiondata As seen from Table 4, RGLM achieves the highest sets involving human disease related outcomes. Many of mean accuracy in these disease data sets, followed by the 20 data sets (Table 2) are well known cancer data RFbigmtryandSC.Notethatthestandardrandomforest sets, which have been used in other comparative studies predictor(withdefaultparameterchoice)performsworse [4,5,56,57]. A brief description of the data sets can be thanRGLM.TheaccuracydifferencebetweenRGLMand foundinMethods. alternative methods is statistically significant (Wilcoxon Toarriveatanunbiasedestimateofpredictionaccuracy, signed rank test < 0.05) for all predictors except for weused3-foldcrossvalidation(averagedover100random RFbigmtry,DLDAandSC.SinceRFbigmtryisanensem- partitionsofthedatainto3folds).Notethattheaccuracy blepredictorthatreliesonthousandsoffeaturesitwould equals 1 minus the median misclassification error rate. be difficult to interpret its predictions in terms of the Table4reportsthepredictionaccuracyofdifferentmeth- underlyinggenes. ods including RGLM, random forest (RF, with default Our evaluations focused on the accuracy (and mis- valueforitsmtryparameter),randomforest(RFbigmtry, classification error). However, a host of other accuracy withmtryequaltothetotalnumberoffeatures),treepre- measures could be considered. Additional file 2 presents dictor(alsoknownasrecursivepartitioning,Rpart),linear theresultsforsensitivityandspecificity.Thetop3meth- discriminantanalysis(LDA),diagonallineardiscriminant ods with highest sensitivity are: RF (median sensitivity= Table4Predictionaccuracyinthe20diseasegeneexpressiondatasets Dataset RGLM RF RFbigmtry Rpart LDA DLDA KNN SVM SC adenocarcinoma 0.842 0.842 0.842 0.737 0.842 0.744 0.842 0.842 0.803 brain 0.881 0.810 0.833 0.762 0.810 0.929 0.881 0.786 0.929 breast2 0.623 0.610 0.636 0.584 0.610 0.636 0.584 0.558 0.636 breast3 0.705 0.695 0.716 0.611 0.695 0.705 0.669 0.674 0.700 colon 0.855 0.823 0.823 0.726 0.855 0.839 0.774 0.774 0.871 leukemia 0.921 0.895 0.921 0.816 0.868 0.974 0.974 0.763 0.974 lymphoma 0.968 1.000 1.000 0.903 0.960 0.984 0.984 1.000 0.984 NCI60 0.902 0.869 0.869 0.738 0.885 0.902 0.852 0.869 0.918 prostate 0.931 0.892 0.902 0.853 0.873 0.627 0.804 0.853 0.912 srbct 1.000 0.944 0.984 0.921 0.857 0.905 0.952 0.873 1.000 BrainTumor2 0.760 0.750 0.740 0.620 0.760 0.700 0.700 0.660 0.720 DLBCL 0.909 0.851 0.883 0.831 0.922 0.779 0.870 0.792 0.857 lung1 0.931 0.931 0.931 0.828 0.914 0.931 0.931 0.897 0.914 lung2 0.935 0.935 0.935 0.826 0.957 0.978 0.935 0.848 0.978 lung3 0.901 0.901 0.887 0.803 0.873 0.859 0.831 0.859 0.887 psoriasis1 0.989 0.994 0.989 0.978 0.994 0.989 0.989 0.983 0.989 psoriasis2 0.963 0.988 0.976 0.963 0.976 0.963 0.963 0.963 0.963 MSstage1 0.846 0.846 0.846 0.423 0.769 0.769 0.808 0.769 0.769 MSdiagnosis1 0.963 0.926 0.926 0.556 0.889 0.889 0.963 0.926 0.926 MSdiagnosis2 0.591 0.614 0.614 0.568 0.545 0.568 0.568 0.568 0.523 MeanAccuracy 0.871 0.856 0.863 0.752 0.843 0.833 0.844 0.813 0.863 Rank 1 4 2.5 9 6 7 5 8 2.5 Pvalue NA 0.029 0.079 0.00014 0.0075 0.05 0.014 0.00042 0.37 Foreachdataset,thepredictionaccuracywasestimatedusing3−foldcrossvalidationacross100randompartitionsofthedatainto3folds.Meanaccuraciesacross the20datasetsandtheresultingranksaresummarizedatthebottom.TwosidedpairedWilcoxontestp-valuescanbeusedtodeterminewhethertheaccuracyof RGLMissignificantlydifferentfromthatofotherpredictors.NotethattheRGLMyieldsthehighestmeanaccuracy. Songetal.BMCBioinformatics2013,14:5 Page9of22 http://www.biomedcentral.com/1471-2105/14/5 0.969),SVM(0.969)andRGLM(0.960).Thetop3meth- expressionstudiesshowthatRGLMhasoutstandingpre- odswithhighestspecificityare:SC(0.900),RGLM(0.857) dictionaccuracy. andKNN(0.848). A strength of this empirical comparison is that it Machinelearningbenchmarkdataanalysis involvesclinicallyorbiologicallyinterestingdatasetsbut HereweevaluatetheperformanceofRGLMontheUCI aseverelimitationisthatitonlyinvolves20comparisons. machine learning benchmark data sets which are often Therefore,wenowturntomorecomprehensiveempirical used for evaluating prediction methods [10,13,58-61]. comparisons. We consider 12 benchmark data sets from the mlbench R package: 9 UCI data sets and 3 synthetic data sets Binaryoutcomeprediction (Table3).Wechoosethesedatasetsfortworeasons.First, Empiricalstudyinvolvingdichotomizedgenetraits these 12 data sets were also used in the original evalu- Many previous empirical comparisons of gene expres- ation of the random forest predictor [13]. Second, these sion data considered fewer than 20 data sets. To arrive dataincludealloftheavailabledatasetswithbinaryout- at 700 comparisons, we use the following approach: We comes in the mlbench R package. A detailed description started out with 7 human and mouse gene expression ofthesedatasetscanbefoundinMethods.Inhisoriginal datasets.Foreachdataset,werandomlychose100genes publicationontherandomforest,Breimanfoundthatthe as gene traits (outcomes) resulting in 7 × 100 possible RF outperformed bagged predictors on the UCI bench- outcomes. We removed the gene corresponding to the markdatawhichmayexplainwhybaggedGLMshavenot genetraitfromthefeatureset.Next,eachgenetraitwas received much attention. We hypothesize that the rela- dichotomizedbyitsmedianvaluetoarriveatabinaryout- tively poor performance of a bagged logistic regression comey.Thegoalofeachpredictionanalysiswastopredict model on these data sets could be ameliorated by con- thedichotomizedgenetraitybasedontheothergenes.At sidering interaction terms between the features. Table 5 firstsight,thisartificialoutcomeisclinicallyuninteresting confirmsourhypothesis.RGLM.inter2(correspondingto butitisworthemphasizingthatcliniciansoftendealwith pairwise interaction terms) has superior or tied accu- dichotomizedmeasuresofgeneproducts,e.g.highserum racycomparedtoRGLMin10outof12benchmarkdata creatinine levels may indicate kidney damage, high PSA sets. In particular, pairwise interactions greatly improve levels may indicate prostate cancer, and high HDL levels thepredictionaccuracyintheringnormdataset.Higher mayindicatehypercholesterolemia.Toarriveatunbiased order interactions (RGLM.inter3) do not perform better estimates of prediction accuracy, we split each data set than RGLM.inter2 but dramatically increase computa- intoatrainingandtestset.Figure2(A)showsboxplotsof tionalburden(datanotshown). theaccuraciesacrossthe700comparisons.Similarperfor- Overall, we find that RGLM.inter2 ties with SVM mance patterns are observed for the individual data sets (diff =−0.001,p=0.96)andRF(diff =0.001,p=0.26) (Figure 2 (B-H)). The figure also reports pairwise com- forthefirstplaceinthebenchmarkdata.Ascanbeseen parisons of the RGLM method versus alternative meth- from Additional file 3, RGLM.inter2 achieves the high- ods.Specifically,itreportsthetwo-sidedWilcoxonsigned estsensitivityandspecificity,whichalsosupportitsgood ranktestp-valuesfortestingwhethertheaccuracyofthe performanceinthebenchmarkdatasets. RGLM predictor is higher than that of the considered A potential limitation of these comparisons is that we alternative method. Strikingly, RGLM is more accurate consideredpairwiseinteractiontermsfortheRGLMpre- than the other methods overall. While the increase in dictor but not for the other predictors. To address this accuracy is often minor, it is statistically significant as issue,wealsoconsideredpairwiseinteractionsamongfea- can be seen by comparing RGLM to RF (median dif- turesforotherpredictors.Additionalfile4showsthatno ference = 0.02,p = 2.1 × 10−51), RFbigmtry (median method surpasses RGLM.inter2 when pairwise interac- difference= 0.01,p = 7.3×10−16),LDA(mediandiffer- tiontermsareconsidered.Inparticular,interactionterms ence = 0.06,p = 2.4×10−53), SVM (median difference betweenfeaturesdonotimprovetheperformanceofthe = 0.03,p = 1.8 × 10−62) and SC (median difference random forest predictor. A noteworthy disadvantage of = 0.04,p = 4.3×10−71).Otherpredictorsperformeven RGLM.interincaseofmanyfeaturesisthecomputational worse,andthecorrespondingp-valuesarenotshown. burden that may result from adding interaction terms. The fact that RFbigmtry is more accurate in this sit- In applications where interaction terms are needed for uation than the default version of RF probably indicates RGLM, faster alternatives (e.g. RF) remain an attractive that relatively few genes are informative for predicting a choice. dichotomized gene trait. Also note that RGLM is much moreaccuratethantheunbaggedforwardselectedGLM Simulationstudyinvolvingbinaryoutcomes which reflects that forward selection greatly overfits the AsdescribedinMethods,wesimulated180geneexpres- training data. In conclusion, these comprehensive gene sion data sets with binary outcomes. The number of Songetal.BMCBioinformatics2013,14:5 Page10of22 http://www.biomedcentral.com/1471-2105/14/5 A Total 700 traits, p.RF=2.1e−51, p.RFbigmtry=7.3e−16 B Brain cancer 100 traits, p.RF=0.00033, p.RFbigmtry=0.26 p.LDA=2.4e−53, p.SVM=1.8e−62, p.SC=4.3e−71 p.LDA=0.016, p.SVM=1.7e−11, p.SC=1e−10 1.0 y 0.9 y 0.8 c c a a cur 0.8 cur 0.7 c c on a 0.7 on a 0.6 cti 0.6 cti di di e e 0.5 Pr 0.5 Pr 0.4 0.4 forwardGLM RGLM RF RFbigmtry Rpart LDA DLDA KNN SVM SC forwardGLM RGLM RF RFbigmtry Rpart LDA DLDA KNN SVM SC C SAFHS 100 traits, p.RF=2.9e−12, p.RFbigmtry=7.1e−12 D WB 100 traits, p.RF=2.3e−08, p.RFbigmtry=5.9e−05 p.LDA=3.2e−10, p.SVM=0.00094, p.SC=1.2e−17 p.LDA=2.7e−09, p.SVM=7.8e−05, p.SC=1.2e−11 0.9 0.9 y y ac ac 0.8 ur 0.8 ur c c ac ac 0.7 on 0.7 on cti cti 0.6 di di Pre 0.6 Pre 0.5 0.5 forwardGLM RGLM RF RFbigmtry Rpart LDA DLDA KNN SVM SC forwardGLM RGLM RF RFbigmtry Rpart LDA DLDA KNN SVM SC E Mouse adipose 100 traits, p.RF=7e−07, p.RFbigmtry=0.048 F Mouse brain 100 traits, p.RF=5.5e−12, p.RFbigmtry=6.5e−05 p.LDA=6.4e−11, p.SVM=6.1e−14, p.SC=7.3e−13 p.LDA=2.4e−12, p.SVM=3.5e−14, p.SC=1.4e−10 1.0 y 0.9 y 0.9 c c a a cur 0.8 cur 0.8 c c a a on 0.7 on 0.7 cti cti edi 0.6 edi 0.6 Pr Pr 0.5 0.5 forwardGLM RGLM RF RFbigmtry Rpart LDA DLDA KNN SVM SC forwardGLM RGLM RF RFbigmtry Rpart LDA DLDA KNN SVM SC G Mouse liver 100 traits, p.RF=2.4e−11, p.RFbigmtry=0.0013 H Mouse muscle 100 traits, p.RF=6.5e−08, p.RFbigmtry=0.066 p.LDA=8.7e−12, p.SVM=2.1e−13, p.SC=4.9e−11 p.LDA=6.2e−09, p.SVM=2.6e−09, p.SC=2.4e−07 1.0 y 0.9 y 0.9 c c a a cur 0.8 cur 0.8 c c a a n 0.7 n 0.7 o o dicti 0.6 dicti 0.6 e e Pr 0.5 Pr 0.5 0.4 forwardGLM RGLM RF RFbigmtry Rpart LDA DLDA KNN SVM SC forwardGLM RGLM RF RFbigmtry Rpart LDA DLDA KNN SVM SC Figure2Binaryoutcomepredictioninempiricalgeneexpressiondatasets.Theboxplotsshowthetestsetpredictionaccuraciesacross700 comparisons.Thehorizontallineinsideeachboxrepresentsthemedianaccuracy.Thehorizontaldashedredlineindicatesthemedianaccuracyof theRGLMpredictor.P-valuesresultfromusingthetwo-sidedWilcoxonsignedranktestforevaluatingwhetherthemedianaccuracyofRGLMisthe sameasthatofthementionedmethod.Forexample,p.RFresultsfromtestingwhetherthemedianaccuracyofRGLMisthesameasthatoftheRF. (A)summarizesthetestsetperformanceforpredicting100dichotomizedgenetraitsfromeachofthe7expressiondatasets.(B-H)showthe resultsforindividualdatasets.100randomlychosen,dichotomizedgenetraitswereused.NotethesuperioraccuracyoftheRGLMpredictoracross thedifferentdatasets.