ebook img

Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions (Synthesis Lectures on Data Mining and Knowledge Discovery) PDF

127 Pages·2010·2.6 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions (Synthesis Lectures on Data Mining and Knowledge Discovery)

Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions Synthesis Lectures on Data Mining and Knowledge Discovery Editor RobertGrossman,UniversityofIllinois,Chicago EnsembleMethodsinDataMining:ImprovingAccuracyThroughCombining Predictions GiovanniSeniandJohnF.Elder 2010 ModelingandDataMininginBlogosphere NitinAgarwalandHuanLiu 2009 Copyright© 2010byMorgan&Claypool Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedin anyformorbyanymeans—electronic,mechanical,photocopy,recording,oranyotherexceptforbriefquotationsin printedreviews,withoutthepriorpermissionofthepublisher. EnsembleMethodsinDataMining:ImprovingAccuracyThroughCombiningPredictions GiovanniSeniandJohnF.Elder www.morganclaypool.com ISBN:9781608452842 paperback ISBN:9781608452859 ebook DOI10.2200/S00240ED1V01Y200912DMK002 APublicationintheMorgan&ClaypoolPublishersseries SYNTHESISLECTURESONDATAMININGANDKNOWLEDGEDISCOVERY Lecture#2 SeriesEditor:RobertGrossman,UniversityofIllinois,Chicago SeriesISSN SynthesisLecturesonDataMiningandKnowledgeDiscovery Print2151-0067 Electronic2151-0075 Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions Giovanni Seni ElderResearch,Inc.andSantaClaraUniversity John F.Elder ElderResearch,Inc.andUniversityofVirginia SYNTHESISLECTURESONDATAMININGANDKNOWLEDGEDISCOVERY #2 M &C Morgan &cLaypool publishers ABSTRACT EnsemblemethodshavebeencalledthemostinfluentialdevelopmentinDataMiningandMachine Learning in the past decade.They combine multiple models into one usually more accurate than the best of its components.Ensembles can provide a critical boost to industrial challenges – from investment timing to drug discovery, and fraud detection to recommendation systems – where predictiveaccuracyismorevitalthanmodelinterpretability. Ensembles are useful with all modeling algorithms,but this book focuses on decision trees toexplainthemmostclearly.Afterdescribingtreesandtheirstrengthsandweaknesses,theauthors provide an overview of regularization – today understood to be a key reason for the superior per- formance of modern ensembling algorithms.The book continues with a clear description of two recentdevelopments:ImportanceSampling(IS)andRuleEnsembles(RE).ISrevealsclassicensemble methods–bagging,randomforests,andboosting–tobespecialcasesofasinglealgorithm,thereby showinghowtoimprovetheiraccuracyandspeed.REsarelinearrulemodelsderivedfromdecision tree ensembles.They are the most interpretable version of ensembles,which is essential to appli- cations such as credit scoring and fault diagnosis. Lastly, the authors explain the paradox of how ensemblesachievegreateraccuracyonnewdatadespitetheir(apparentlymuchgreater)complexity. Thisbookisaimedatnoviceandadvancedanalyticresearchersandpractitioners–especially inEngineering,Statistics,andComputerScience.Thosewithlittleexposuretoensembleswilllearn whyandhowtoemploythisbreakthroughmethod,andadvancedpractitionerswillgaininsightinto building even more powerful models.Throughout,snippets of code in R are provided to illustrate thealgorithmsdescribedandtoencouragethereadertotrythetechniques1. Theauthorsareindustryexpertsindataminingandmachinelearningwhoarealsoadjunct professorsandpopularspeakers.Althoughearlypioneersindiscoveringandusingensembles,they heredistillandclarifytherecentgroundbreakingworkofleadingacademics(suchasJeromeFried- man)tobringthebenefitsofensemblestopractitioners. The authors would appreciate hearing of errors in or suggested improvements to this book, [email protected]@datamininglab.com.Errataand updateswillbeavailablefromwww.morganclaypool.com KEYWORDS ensemblemethods,ruleensembles,importancesampling,boosting,randomforest,bag- ging,regularization,decisiontrees,datamining,machinelearning,patternrecognition, modelinterpretation,modelcomplexity,generalizeddegreesoffreedom 1RisanOpenSourceLanguageandenvironmentfordataanalysisandstatisticalmodelingavailablethroughtheComprehensive RArchiveNetwork(CRAN).TheRsystem’slibrarypackagesofferextensivefunctionality,andbedownloadedformhttp:// cran.r-project.org/formanycomputingplatforms.TheCRANwebsitealsohaspointerstotutorialandcomprehensive documentation.Avarietyofexcellentintroductorybooksarealsoavailable;weparticularlylikeIntroductoryStatisticswithRby PeterDalgaardandModernAppliedStatisticswithSbyW.N.VenablesandB.D.Ripley. To the loving memory of our fathers, Tito and Fletcher ix Contents Acknowledgments..........................................................xiii ForewordbyJaffrayWoodriff.................................................xv ForewordbyTinKamHo...................................................xvii 1 EnsemblesDiscovered .......................................................1 1.1 BuildingEnsembles ........................................................4 1.2 Regularization .............................................................6 1.3 Real-WorldExamples:CreditScoring+theNetflixChallenge..................7 1.4 OrganizationofThisBook..................................................8 2 PredictiveLearningandDecisionTrees.......................................11 2.1 DecisionTreeInductionOverview..........................................15 2.2 DecisionTreeProperties ...................................................18 2.3 DecisionTreeLimitations..................................................19 3 ModelComplexity,ModelSelectionandRegularization.......................21 3.1 Whatisthe“Right”SizeofaTree?..........................................21 3.2 Bias-VarianceDecomposition...............................................22 3.3 Regularization ............................................................25 3.3.1 RegularizationandCost-ComplexityTreePruning 25 3.3.2 Cross-Validation 26 3.3.3 RegularizationviaShrinkage 28 3.3.4 RegularizationviaIncrementalModelBuilding 32 3.3.5 Example 34 3.3.6 RegularizationSummary 37

Description:
Ensemble methods have been called the most influential development in Data Mining and Machine Learning in the past decade. They combine multiple models into one usually more accurate than the best of its components. Ensembles can provide a critical boost to industrial challenges -- from investment t
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.