Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions Synthesis Lectures on Data Mining and Knowledge Discovery Editor RobertGrossman,UniversityofIllinois,Chicago EnsembleMethodsinDataMining:ImprovingAccuracyThroughCombining Predictions GiovanniSeniandJohnF.Elder 2010 ModelingandDataMininginBlogosphere NitinAgarwalandHuanLiu 2009 Copyright© 2010byMorgan&Claypool Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedin anyformorbyanymeans—electronic,mechanical,photocopy,recording,oranyotherexceptforbriefquotationsin printedreviews,withoutthepriorpermissionofthepublisher. EnsembleMethodsinDataMining:ImprovingAccuracyThroughCombiningPredictions GiovanniSeniandJohnF.Elder www.morganclaypool.com ISBN:9781608452842 paperback ISBN:9781608452859 ebook DOI10.2200/S00240ED1V01Y200912DMK002 APublicationintheMorgan&ClaypoolPublishersseries SYNTHESISLECTURESONDATAMININGANDKNOWLEDGEDISCOVERY Lecture#2 SeriesEditor:RobertGrossman,UniversityofIllinois,Chicago SeriesISSN SynthesisLecturesonDataMiningandKnowledgeDiscovery Print2151-0067 Electronic2151-0075 Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions Giovanni Seni ElderResearch,Inc.andSantaClaraUniversity John F.Elder ElderResearch,Inc.andUniversityofVirginia SYNTHESISLECTURESONDATAMININGANDKNOWLEDGEDISCOVERY #2 M &C Morgan &cLaypool publishers ABSTRACT EnsemblemethodshavebeencalledthemostinfluentialdevelopmentinDataMiningandMachine Learning in the past decade.They combine multiple models into one usually more accurate than the best of its components.Ensembles can provide a critical boost to industrial challenges – from investment timing to drug discovery, and fraud detection to recommendation systems – where predictiveaccuracyismorevitalthanmodelinterpretability. Ensembles are useful with all modeling algorithms,but this book focuses on decision trees toexplainthemmostclearly.Afterdescribingtreesandtheirstrengthsandweaknesses,theauthors provide an overview of regularization – today understood to be a key reason for the superior per- formance of modern ensembling algorithms.The book continues with a clear description of two recentdevelopments:ImportanceSampling(IS)andRuleEnsembles(RE).ISrevealsclassicensemble methods–bagging,randomforests,andboosting–tobespecialcasesofasinglealgorithm,thereby showinghowtoimprovetheiraccuracyandspeed.REsarelinearrulemodelsderivedfromdecision tree ensembles.They are the most interpretable version of ensembles,which is essential to appli- cations such as credit scoring and fault diagnosis. Lastly, the authors explain the paradox of how ensemblesachievegreateraccuracyonnewdatadespitetheir(apparentlymuchgreater)complexity. Thisbookisaimedatnoviceandadvancedanalyticresearchersandpractitioners–especially inEngineering,Statistics,andComputerScience.Thosewithlittleexposuretoensembleswilllearn whyandhowtoemploythisbreakthroughmethod,andadvancedpractitionerswillgaininsightinto building even more powerful models.Throughout,snippets of code in R are provided to illustrate thealgorithmsdescribedandtoencouragethereadertotrythetechniques1. Theauthorsareindustryexpertsindataminingandmachinelearningwhoarealsoadjunct professorsandpopularspeakers.Althoughearlypioneersindiscoveringandusingensembles,they heredistillandclarifytherecentgroundbreakingworkofleadingacademics(suchasJeromeFried- man)tobringthebenefitsofensemblestopractitioners. The authors would appreciate hearing of errors in or suggested improvements to this book, [email protected]@datamininglab.com.Errataand updateswillbeavailablefromwww.morganclaypool.com KEYWORDS ensemblemethods,ruleensembles,importancesampling,boosting,randomforest,bag- ging,regularization,decisiontrees,datamining,machinelearning,patternrecognition, modelinterpretation,modelcomplexity,generalizeddegreesoffreedom 1RisanOpenSourceLanguageandenvironmentfordataanalysisandstatisticalmodelingavailablethroughtheComprehensive RArchiveNetwork(CRAN).TheRsystem’slibrarypackagesofferextensivefunctionality,andbedownloadedformhttp:// cran.r-project.org/formanycomputingplatforms.TheCRANwebsitealsohaspointerstotutorialandcomprehensive documentation.Avarietyofexcellentintroductorybooksarealsoavailable;weparticularlylikeIntroductoryStatisticswithRby PeterDalgaardandModernAppliedStatisticswithSbyW.N.VenablesandB.D.Ripley. To the loving memory of our fathers, Tito and Fletcher ix Contents Acknowledgments..........................................................xiii ForewordbyJaffrayWoodriff.................................................xv ForewordbyTinKamHo...................................................xvii 1 EnsemblesDiscovered .......................................................1 1.1 BuildingEnsembles ........................................................4 1.2 Regularization .............................................................6 1.3 Real-WorldExamples:CreditScoring+theNetflixChallenge..................7 1.4 OrganizationofThisBook..................................................8 2 PredictiveLearningandDecisionTrees.......................................11 2.1 DecisionTreeInductionOverview..........................................15 2.2 DecisionTreeProperties ...................................................18 2.3 DecisionTreeLimitations..................................................19 3 ModelComplexity,ModelSelectionandRegularization.......................21 3.1 Whatisthe“Right”SizeofaTree?..........................................21 3.2 Bias-VarianceDecomposition...............................................22 3.3 Regularization ............................................................25 3.3.1 RegularizationandCost-ComplexityTreePruning 25 3.3.2 Cross-Validation 26 3.3.3 RegularizationviaShrinkage 28 3.3.4 RegularizationviaIncrementalModelBuilding 32 3.3.5 Example 34 3.3.6 RegularizationSummary 37