ebook img

Proactive Data Mining with Decision Trees PDF

94 Pages·2014·2.14 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Proactive Data Mining with Decision Trees

SpringerBriefs in Electrical and Computer Engineering Forfurthervolumes: http://www.springer.com/series/10059 Haim Dahan • Shahar Cohen (cid:129) Lior Rokach Oded Maimon Proactive Data Mining with Decision Trees 2123 HaimDahan LiorRokach Dept.ofIndustrialEngineering InformationSystemsEngineering TelAvivUniversity Ben-GurionUniversity RamatAviv Beer-Sheva Israel Israel ShaharCohen OdedMaimon Dept.ofIndustrialEngineering&Management Dept.ofIndustrialEngineering ShenkarCollegeofEngineeringandDesign TelAvivUniversity RamatGan RamatAviv Israel Israel ISSN2191-8112 ISSN2191-8120(electronic) ISBN978-1-4939-0538-6 ISBN978-1-4939-0539-3(eBook) DOI10.1007/978-1-4939-0539-3 SpringerNewYorkHeidelbergDordrechtLondon LibraryofCongressControlNumber:2014931371 © TheAuthor(s)2014 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartofthe materialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnection withreviewsorscholarlyanalysisormaterialsuppliedspecificallyforthepurposeofbeingenteredand executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publicationorpartsthereofispermittedonlyundertheprovisionsoftheCopyrightLawofthePublisher’s location,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.Permissions forusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.Violationsareliableto prosecutionundertherespectiveCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Whiletheadviceandinformationinthisbookarebelievedtobetrueandaccurateatthedateofpublication, neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityforanyerrorsor omissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,withrespecttothe materialcontainedherein. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Toourfamilies Preface Data mining has emerged as a new science—the exploration, algorithmically and systematically, of data in order to extract patterns that can be used as a means of supporting organizational decision making. Data mining has evolved from ma- chine learning and pattern recognition theories and algorithms for modeling data and extracting patterns. The underlying assumption of the inductive approach is that the trained model is applicable to future, unseen examples. Data mining can be considered as a central step in the overall knowledge discovery in databases (KDD)process. Inrecentyears,datamininghasbecomeextremelywidespread,emergingasadis- ciplinefeaturedbyanincreasinglargenumberofpublications.Althoughanimmense numberofalgorithmshavebeenpublishedintheliterature,mostofthesealgorithms stopshortofthefinalobjectiveofdatamining—providingpossibleactionstomax- imize utility while reducing costs. While these algorithms are essential in moving data mining results to eventual application, they nevertheless require considerable pre-andpost-processguidedbyexperts. Thegapbetweenwhatisbeingdiscussedintheacademicliteratureandreallife business applications is due to three main shortcomings in traditional data mining methods.(i)Mostexistingclassificationalgorithmsare‘passive’inthesensethatthe inducedmodelsmerelypredictorexplainaphenomenon, ratherthanhelpusersto proactivelyachievetheirgoalsbyinterveningwiththedistributionoftheinputdata. (ii)Mostmethodsignorerelevantenvironmental/domainknowledge.(iii)Thetradi- tionalclassificationmethodsaremainlyfocusedonmodelaccuracy.Therearevery few,ifany,dataminingmethodsthatovercomealltheseshortcomingsaltogether. In this book we present a proactive and domain-driven method to classification tasks. Thisnovelproactiveapproachtodata-mining, notonlyinducesamodelfor predicting or explaining a phenomenon, but also utilizes specific problem/domain knowledgetosuggestspecificactionstoachieveoptimalchangesinthevalueofthe target attribute. In particular, this work suggests a specific implementation of the domain-drivenproactiveapproachforclassificationtrees.Theproactivemethodisa two-phaseprocess.Inthefirstphase,ittrainsaprobabilisticclassifierusingasuper- visedlearningalgorithm.Theresultingclassificationmodelfromthefirst-phaseisa modelthatispredisposedtopotentialinterventionsandorientedtowardmaximizing vii viii Preface a utility function the organization sets. In the second phase, it utilizes the induced classifiertosuggestpotentialactionsformaximizingutilitywhilereducingcosts. Thisnewapproachinvolvesinterveninginthedistributionoftheinputdata,with theaimofmaximizinganeconomicutilitymeasure.Thisinterventionrequiresthe consideration of domain-knowledge that is exogenous to the typical classification task.Theworkisfocusedondecisiontreesandbasedontheideaofmovingobser- vationsfromonebranchofthetreetoanother.Thisworkintroducesanovelsplitting criterionfordecisiontrees,termedmaximal-utility,whichmaximizesthepotential forenhancingprofitabilityintheoutputtree. Thisbookpresentstworealcasestudies,oneofaleadingwirelessoperatorandthe otherofamajorsecuritycompany.Inthesecasestudies,weutilizedournewapproach to solve the real world problems that these corporations faced. This book demon- strates that by applying the proactive approach to classification tasks, it becomes possible to solve business problems that cannot be approach through traditional, passivedataminingmethods. TelAviv,Israel HaimDahan July,2013 ShaharCohen LiorRokach OdedMaimon Contents 1 IntroductiontoProactiveDataMining............................. 1 1.1 DataMining ................................................ 1 1.2 ClassificationTasks .......................................... 1 1.3 BasicTerms ................................................ 2 1.4 DecisionTrees(ClassificationTrees) ........................... 3 1.5 CostSensitiveClassificationTrees ............................. 6 1.6 ClassificationTreesLimitations................................ 8 1.7 ActiveLearning ............................................. 8 1.8 ActionableDataMining ...................................... 10 1.9 HumanCooperatedMining ................................... 11 References ...................................................... 12 2 Proactive Data Mining: A GeneralApproach andAlgorithmic Framework..................................................... 15 2.1 Notations .................................................. 15 2.2 FromPassivetoProactiveDataMining ......................... 16 2.3 ChangingtheInputData...................................... 17 2.4 The Need for Domain Knowledge: Attribute Changing Cost andBenefitFunctions........................................ 18 2.5 MaximalUtility:TheObjectiveofProactiveDataMiningTasks..... 18 2.6 AnAlgorithmicFrameworkforProactiveDataMining............. 19 2.7 ChapterSummary ........................................... 20 References ...................................................... 20 3 ProactiveDataMiningUsingDecisionTrees........................ 21 3.1 WhyDecisionTrees?......................................... 21 3.2 TheUtilityMeasureofProactiveDecisionTrees.................. 22 3.3 AnOptimizationAlgorithmforProactiveDecisionTrees........... 26 3.4 TheMaximal-UtilitySplittingCriterion ......................... 27 3.5 ChapterSummary ........................................... 31 References ...................................................... 33 ix x Contents 4 ProactiveDataMiningintheRealWorld:CaseStudies.............. 35 4.1 ProactiveDataMininginaCellularServiceProvider.............. 35 4.2 TheSecurityCompanyCase................................... 48 4.3 CaseStudiesSummary ....................................... 60 References ...................................................... 61 5 SensitivityAnalysisofProactiveDataMining....................... 63 5.1 Zero-oneBenefitFunction .................................... 63 5.2 DynamicBenefitFunction .................................... 69 5.3 Dynamic Benefits and Infinite Costs of the Unchangeable Attributes .................................................. 71 5.4 DynamicBenefitandBalancedCostFunctions ................... 76 5.5 ChapterSummary ........................................... 84 References ...................................................... 84 6 Conclusions .................................................... 87 Chapter 1 Introduction to Proactive Data Mining Inthischapter,weprovideanintroductiontotheaspectsoftheexcitingfieldofdata mining,whicharerelevanttothisbook.Inparticular,wefocusonclassificationtasks andondecisiontrees,asanalgorithmicapproachforsolvingclassificationtasks. 1.1 DataMining Data mining is an emerging discipline that refers to a wide variety of methods for automatically,exploring,analyzingandmodelinglargedatarepositoriesinattempt toidentifyvalid,novel,useful,andunderstandablepatterns.Datamininginvolvesthe inferringofalgorithmsthatexplorethedatainordertocreateanddevelopamodelthat providesaframeworkfordiscoveringwithinthedatapreviouslyunknownpatterns foranalysisandprediction. The accessibility and abundance of data today makes data mining a matter of considerableimportanceandnecessity.Giventherecentgrowthofthefield,itisnot surprisingthatresearchersandpractitionershaveattheirdisposalawidevarietyof methodsformakingtheirwaythroughthemassofinformationthatmoderndatasets canprovide. 1.2 ClassificationTasks Inmanycasesthegoalofdataminingistoinduceapredictivemodel.Forexample, in business applications such as direct marketing, decision makers are required to choose the action which best maximizes a utility function. Predictive models can helpdecisionmakersmakethebestdecision. Supervisedmethodsattempttodiscovertherelationshipbetweeninputattributes (sometimescalledindependentvariables)andatargetattribute(sometimesreferred to as a dependent variable). The relationship that is discovered is referred to as a model. Usually models describe and explain phenomena that are hidden in the datasetandcanbeusedforpredictingthevalueofthetargetattributebasedonthe H.Dahanetal.,ProactiveDataMiningwithDecisionTrees, 1 SpringerBriefsinElectricalandComputerEngineering, DOI10.1007/978-1-4939-0539-3_1,©TheAuthor(s)2014

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.