Annals of Information Systems SeriesEditors RameshSharda OklahomaStateUniversity Stillwater,OK,USA StefanVoß UniversityofHamburg Hamburg,Germany Forfurthervolumes: http://www.springer.com/series/7573 · · Robert Stahlbock Sven F. Crone Stefan Lessmann Editors Data Mining Special Issue in Annals of Information Systems 123 Editors RobertStahlbock SvenF.Crone DepartmentofBusinessAdministration DepartmentofManagementScience UniversityofHamburg LancasterUniversity InstituteofInformationSystems ManagementSchool Von-Melle-Park5 Lancaster 20146Hamburg UnitedKingdomLA14YX Germany [email protected] [email protected] StefanLessmann DepartmentofBusinessAdministration UniversityofHamburg InstituteofInformationSystems Von-Melle-Park5 20146Hamburg Germany [email protected] ISSN1934-3221 e-ISSN1934-3213 ISBN978-1-4419-1279-4 e-ISBN978-1-4419-1280-0 DOI10.1007/978-1-4419-1280-0 SpringerNewYorkDordrechtHeidelbergLondon LibraryofCongressControlNumber:2009910538 (cid:2)c SpringerScience+BusinessMedia,LLC2010 Allrightsreserved.Thisworkmaynotbetranslatedorcopiedinwholeorinpartwithoutthewritten permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY10013,USA),exceptforbriefexcerptsinconnectionwithreviewsorscholarlyanalysis.Usein connection with any form of information storage and retrieval, electronic adaptation, computer software,orbysimilarordissimilarmethodologynowknownorhereafterdevelopedisforbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not theyaresubjecttoproprietaryrights. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Preface Data mining has experienced an explosion of interest over the last two decades. It hasbeenestablishedasasoundparadigmtoderiveknowledgefromlarge,heteroge- neousstreamsofdata,oftenusingcomputationallyintensivemethods.Itcontinues toattractresearchersfrommultipledisciplines,includingcomputersciences,statis- tics, operations research, information systems, and management science. Success- fulapplicationsincludedomainsasdiverseascorporateplanning,medicaldecision making,bioinformatics,webmining,textrecognition,speechrecognition,andim- age recognition, as well as various corporate planning problems such as customer churnprediction,targetselectionfordirectmarketing,andcreditscoring.Research in information systems equally reflects this inter- and multidisciplinary approach. Informationsystemsresearchexceedsthesoftwareandhardwaresystemsthatsup- portdata-intensiveapplications,analyzingthesystemsofindividuals,data,andall manual or automated activities that process the data and information in a given organization. TheAnnalsofInformationSystemsdevotesaspecialissuetotopicsattheinter- section of information systems and data mining in order to explore the synergies between information systems and data mining. This issue serves as a follow-up to the International Conference on Data Mining (DMIN) which is annually held in conjunction within WORLDCOMP, the largest annual gathering of researchers in computer science, computer engineering, and applied computing. The special is- sueincludessignificantlyextendedversionsofpriorDMINsubmissionsaswellas contributionswithoutDMINcontext. We would like to thank the members of the DMIN program committee. Their supportwasessentialforthequalityoftheconferencesandforattractinginteresting contributions. We wish to express our sincere gratitude and respect toward Hamid R.Arabnia,generalchairofallWORLDCOMPconferences,forhisexcellentand tirelesssupport,organization,andcoordinationofallWORLDCOMPconferences. Moreover,wewouldliketothankthetwoserieseditors,RameshShardaandStefan Voß,fortheirvaluableadvice,support,andencouragement.Wearegratefulforthe pleasant cooperation with Neil Levine, Carolyn Ford, and Matthew Amboy from Springer and their professional support in publishing this volume. In addition, we v vi Preface wouldliketothankthereviewersfortheirtimeandtheirthoughtfulreviews.Finally, wewouldliketothankallauthorswhosubmittedtheirworkforconsiderationtothis focusedissue.Theircontributionsmadethisspecialissuepossible. Hamburg,Germany RobertStahlbock Hamburg,Germany StefanLessmann Lancaster,UK SvenF.Crone Contents 1 DataMiningandInformationSystems:QuoVadis?............... 1 RobertStahlbock,StefanLessmann,andSvenF.Crone 1.1 Introduction.............................................. 1 1.2 SpecialIssuesinDataMining............................... 3 1.2.1 ConfirmatoryDataAnalysis......................... 3 1.2.2 KnowledgeDiscoveryfromSupervisedLearning....... 4 1.2.3 ClassificationAnalysis ............................. 6 1.2.4 HybridDataMiningProcedures ..................... 8 1.2.5 WebMining ...................................... 10 1.2.6 Privacy-PreservingDataMining ..................... 11 1.3 ConclusionandOutlook.................................... 12 References..................................................... 13 PartI ConfirmatoryDataAnalysis 2 Response-BasedSegmentationUsingFiniteMixturePartialLeast Squares.................................................... 19 ChristianM.Ringle,MarkoSarstedt,andErikA.Mooi 2.1 Introduction.............................................. 20 2.1.1 OntheUseofPLSPathModeling ................... 20 2.1.2 ProblemStatement ................................ 22 2.1.3 ObjectivesandOrganization ........................ 23 2.2 PartialLeastSquaresPathModeling ......................... 24 2.3 FiniteMixturePartialLeastSquaresSegmentation ............. 26 2.3.1 Foundations ...................................... 26 2.3.2 Methodology ..................................... 28 2.3.3 SystematicApplicationofFIMIX-PLS................ 31 2.4 ApplicationofFIMIX-PLS ................................. 34 2.4.1 OnMeasuringCustomerSatisfaction ................. 34 2.4.2 DataandMeasures ................................ 34 2.4.3 DataAnalysisandResults .......................... 36 vii viii Contents 2.5 SummaryandConclusion .................................. 44 References..................................................... 45 PartII KnowledgeDiscoveryfromSupervisedLearning 3 BuildingAcceptableClassificationModels ...................... 53 DavidMartensandBartBaesens 3.1 Introduction.............................................. 54 3.2 ComprehensibilityofClassificationModels ................... 55 3.2.1 MeasuringComprehensibility ....................... 57 3.2.2 ObtainingComprehensibleClassificationModels....... 58 3.3 JustifiabilityofClassificationModels......................... 59 3.3.1 TaxonomyofConstraints ........................... 60 3.3.2 MonotonicityConstraint............................ 62 3.3.3 MeasuringJustifiability ............................ 63 3.3.4 ObtainingJustifiableClassificationModels ............ 68 3.4 Conclusion............................................... 70 References..................................................... 71 4 Mining Interesting Rules Without Support Requirement: A GeneralUniversalExistentialUpwardClosureProperty .......... 75 YannickLeBras,PhilippeLenca,andSte´phaneLallich 4.1 Introduction.............................................. 76 4.2 StateoftheArt ........................................... 77 4.3 AnAlgorithmicPropertyofConfidence ...................... 80 4.3.1 OnUEUCFramework ............................. 80 4.3.2 TheUEUCProperty ............................... 80 4.3.3 AnEfficientPruningAlgorithm...................... 81 4.3.4 GeneralizingtheUEUCProperty .................... 82 4.4 AFrameworkfortheStudyofMeasures ...................... 84 4.4.1 AdaptedFunctionsofMeasure ...................... 84 4.4.2 ExpressionofaSetofMeasuresofD ............. 87 dconf 4.5 ConditionsforGUEUC .................................... 90 4.5.1 ASufficientCondition ............................. 90 4.5.2 ANecessaryCondition............................. 91 4.5.3 ClassificationoftheMeasures ....................... 92 4.6 Conclusion............................................... 94 References..................................................... 95 5 ClassificationTechniquesandErrorControlinLogicMining ...... 99 GiovanniFelici,BrunoSimeone,andVincenzoSpinelli 5.1 Introduction..............................................100 5.2 BriefIntroductiontoBoxClustering .........................102 5.3 BC-BasedClassifier .......................................104 5.4 BestChoiceofaBoxSystem ...............................108 5.5 Bi-criterionProcedureforBC-BasedClassifier.................111 Contents ix 5.6 Examples................................................112 5.6.1 TheDataSets.....................................112 5.6.2 ExperimentalResultswithBC.......................113 5.6.3 ComparisonwithDecisionTrees.....................115 5.7 Conclusions..............................................117 References.....................................................117 PartIII ClassificationAnalysis 6 AnExtendedStudyoftheDiscriminantRandomForest ........... 123 Tracy D. Lemmond, Barry Y. Chen, Andrew O. Hatch, andWilliamG.Hanley 6.1 Introduction..............................................123 6.2 RandomForests ..........................................124 6.3 DiscriminantRandomForests...............................125 6.3.1 LinearDiscriminantAnalysis .......................126 6.3.2 TheDiscriminantRandomForestMethodology ........127 6.4 DRFandRF:AnEmpiricalStudy ...........................128 6.4.1 HiddenSignalDetection............................129 6.4.2 RadiationDetection................................132 6.4.3 SignificanceofEmpiricalResults ....................136 6.4.4 SmallSamplesandEarlyStopping ...................137 6.4.5 ExpectedCost ....................................143 6.5 Conclusions..............................................143 References.....................................................145 7 PredictionwiththeSVMUsingTestPointMargins ............... 147 Su¨reyyaO¨zo¨g˘u¨r-Akyu¨z,ZakriaHussain,andJohnShawe-Taylor 7.1 Introduction..............................................147 7.2 Methods.................................................151 7.3 DataSetDescription.......................................154 7.4 Results ..................................................154 7.5 DiscussionandFutureWork ................................155 References.....................................................157 8 Effects of Oversampling Versus Cost-Sensitive Learning for BayesianandSVMClassifiers................................. 159 AlexanderLiu,CherylMartin,BrianLaCour,andJoydeepGhosh 8.1 Introduction..............................................159 8.2 Resampling ..............................................161 8.2.1 RandomOversampling.............................161 8.2.2 GenerativeOversampling...........................161 8.3 Cost-SensitiveLearning....................................162 8.4 RelatedWork.............................................163 8.5 ATheoreticalAnalysisofOversamplingVersusCost-Sensitive Learning.................................................164 x Contents 8.5.1 BayesianClassification.............................164 8.5.2 Resampling Versus Cost-Sensitive Learning in BayesianClassifiers ...............................165 8.5.3 EffectofOversamplingonGaussianNaiveBayes ......166 8.5.4 EffectsofOversamplingforMultinomialNaiveBayes ..168 8.6 EmpiricalComparisonofResamplingandCost-Sensitive Learning.................................................170 8.6.1 ExplainingEmpiricalDifferencesBetweenResampling andCost-SensitiveLearning ........................170 8.6.2 Naive Bayes Comparisons on Low-Dimensional GaussianData ....................................171 8.6.3 MultinomialNaiveBayes...........................176 8.6.4 SVMs ...........................................178 8.6.5 Discussion .......................................181 8.7 Conclusion...............................................182 Appendix ......................................................183 References.....................................................190 9 TheImpactofSmallDisjunctsonClassifierLearning ............. 193 GaryM.Weiss 9.1 Introduction..............................................193 9.2 AnExample:TheVoteDataSet .............................195 9.3 DescriptionofExperiments.................................197 9.4 TheProblemwithSmallDisjuncts ...........................198 9.5 TheEffectofPruningonSmallDisjuncts .....................202 9.6 TheEffectofTrainingSetSizeonSmallDisjuncts .............210 9.7 TheEffectofNoiseonSmallDisjuncts.......................213 9.8 TheEffectofClassImbalanceonSmallDisjuncts..............217 9.9 RelatedWork.............................................220 9.10 Conclusion...............................................223 References.....................................................225 PartIV HybridDataMiningProcedures 10 PredictingCustomerLoyaltyLabelsinaLargeRetailDatabase:A CaseStudyinChile.......................................... 229 Cristia´nJ.Figueroa 10.1 Introduction..............................................229 10.2 RelatedWork.............................................231 10.3 ObjectivesoftheStudy ....................................233 10.3.1 SupervisedandUnsupervisedLearning ...............234 10.3.2 UnsupervisedAlgorithms...........................234 10.3.3 VariablesforSegmentation .........................238 10.3.4 ExploratoryDataAnalysis ..........................239 10.3.5 ResultsoftheSegmentation.........................240 10.4 ResultsoftheClassifier ....................................241
Description: