David L. Olson Desheng Wu (cid:129) Predictive Data Mining Models Second Edition 123 DavidL. Olson Desheng Wu Collegeof Business EconomicsandManagement School University of Nebraska-Lincoln UniversityofChineseAcademyofSciences Lincoln, NE,USA Beijing,China ISSN 2191-1436 ISSN 2191-1444 (electronic) Computational Risk Management ISBN978-981-13-9663-2 ISBN978-981-13-9664-9 (eBook) https://doi.org/10.1007/978-981-13-9664-9 1stedition:©SpringerScience+BusinessMediaSingapore2017 2ndedition:©SpringerNatureSingaporePteLtd.2020 Preface The book reviews forecasting data mining models, from basic tools for stable data through causal models and more advanced models using trends and cycles. Classification modelling tools are also discussed. These models are demonstrated usingbusiness-relateddatatoincludestockindices,crudeoilprice,andthepriceof gold. The style of the book is intended to be descriptive, seeking to explain how methodswork,with somecitations,butwithoutdeepscholarlyreference.Thedata sets and software are all selected for widespread availability and access by any reader with computer links. Thesecond edition focusesdiscussionofknowledge managementmoreondata mining aspects. More data sets are included than in the first edition. Both editions coveredRattleandWEKAsoftwareformostcommondataminingtypesofmodels. Lincoln, USA David L. Olson Beijing, China Desheng Wu Contents 1 Knowledge Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 The Big Data Era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Business Intelligence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Knowledge Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Computer Support Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Data Mining Forecasting Applications . . . . . . . . . . . . . . . . . . . . 5 1.6 Data Mining Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Data Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Gold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Other Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Financial Index Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Loan Analysis Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 Job Application Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.4 Insurance Fraud Data. . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.5 Expenditure Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Basic Forecasting Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1 Moving Average Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Time Series Error Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Seasonality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5 Demonstration Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.6 Software Demonstrations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6.1 R Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6.2 Weka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4 Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1 Data Series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3 Lags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5 Regression Tree Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1 R Regression Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 WEKA Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3.1 Decision Stump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.2 Random Tree Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.3 REP Tree Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.5 M5P Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6 Autoregressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.1 ARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 ARIMA Model of Brent Crude . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.3 ARMA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.4 GARCH Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.4.1 ARCH(q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.4.2 GARCH(p, q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.4.3 EGARCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.4.4 GJR(p, q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.5 Regime Switching Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.6 Application on Crude Oil Data . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7 Classification Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.1 Bankruptcy Data Set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.2 Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.4 Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.5 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.6 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.7 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.8 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.9 WEKA Classification Modeling. . . . . . . . . . . . . . . . . . . . . . . . . 116 7.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 8 Predictive Models and Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Chapter 1 Knowledge Management Knowledge management is an overarching term referring to the ability to iden- tify,store,andretrieveknowledge.Identificationrequiresgatheringtheinformation neededandtoanalyzeavailabledatatomakeeffectivedecisionsregardingwhatever theorganizationdoes.Thisincluderesearch,diggingthroughrecords,orgathering datafromwhereveritcanbefound.Storageandretrievalofdatainvolvesdatabase management, using many tools developed by computer science. Thus knowledge managementinvolvesunderstandingwhatknowledgeisimportanttotheorganiza- tion,understandingsystemsimportanttoorganizationaldecisionmaking,database management,andanalytictoolsofdatamining. 1.1 TheBigDataEra Theeraofbigdataishere[1].Davenportdefinesbigdataas: (cid:129) Datatoobigtofitonasingleserver; (cid:129) Toounstructuredtofitinarow-and-columndatabase; (cid:129) Flowingtoocontinuouslytofitintoastaticdatawarehouse; (cid:129) Havingthecharacteristicoflackingstructure. Bigdatahasevolvedwithanever-morecomplexcomputerenvironment.Advances in technology have given organizations and enterprises access to unprecedented amountsandvarietiesofdata[2].Therearemanydatacollectionchannelsincurrent industrial systems, linked by wireless sensors and Internet connectivity. Business intelligencesupportedbydataminingandanalyticsarenecessarytocopewiththis rapidlychangingenvironment[3]. Knowledge management (KM) needs to cope with big data by identifying and managingknowledgeassetswithinorganizations.KMisprocessoriented,thinking intermsofhowknowledgecanbeacquired,aswellastoolstoaiddecisionmaking. Rothberg and Erickson [4] give a framework defining data as observation, which 2 1 KnowledgeManagement whenputintocontextbecomesinformation,whichwhenprocessedbyhumanunder- standingbecomesknowledge.Thepointofbigdataistoanalyze,convertingdata intoinsights,innovation,andbusinessvalue.Itcanaddvaluebyprovidingreal-time measures of performance, provide more timely analyses based on more complete data,andleadtosounderdecisions[5]. Weliveinanenvironmentdrivenbydata.Amazonprospersbyunderstandingwhat theircustomerswant,anddeliveringcontentineffectiveways.Wal-Marthasbeen verysuccessfulinusingelectronicmeanstogathersalesinrealtime,storing65weeks of data in massive data warehouse systems that they intensively mine to support inventory,pricing,transportation,andotherdecisionsrelatedtotheirbusiness.Data managementalsoisfoundingovernmentaloperations.TheNationalWeatherService hascollected unbelievable quantitiesofinformationrelatedtoweather,harnessing high-end computing power to improve weather prediction. NASA has developed a knowledge base of physical relationships enabling space activities. Waller and Fawcett[6]describebigdataintermsof volume,velocity,andvariety. With respect to volume, retail sales data aggregates a firm’s many cash regis- ter transactions in real time; such information can be aggregated by consumer to profileeachandgeneraterecommendationsforadditionalsales;thesameinputcan updateinventoriesbystock-keeping-unitandbettermanageshippingacrossthesup- plychain;sensordatacanalsotrackmisplacedinventoryinstoresandwarehouses). With respect to velocity, sales data can be real-time, as well as aggregated to hourly,daily,weekly,andmonthlyformtosupportmarketingdecisions.Inventory dataobtainedinreal-timecanbeaggregatedtohourlyormonthlyupdates.Location andtimeinformationcanbeorganizedtomanagethesupplychain. Varietyismagnifiedinthiscontextbysaleseventsfromcashregistersinbrick-and- mortarlocations,alongwithInternetsales,wholesaleactivity,internationalactivity, and activity by competitors. All of this information can be combined with social media monitoring to better profile customers by profile. Inventory activity can be monitoredbytypeofoutletaswellasbyvendor.Sensor-obtaineddatacanbetraced byworkersinvolved,pathsused,andlocations. 1.2 BusinessIntelligence Businessintelligenceisconcernedwithgainingunderstandingofthebusinessenvi- ronmentsufficientlytomakesounddecisions.Thisinvolvestheprocessofsystematic acquisition,sorting,analyzing,interpreting,andexploitinginformation[7].Thishas beenexpandedtothefollowingcategoriesofbusinessintelligence[8]: (cid:129) Competitive intelligence involves intelligence about products, customers, com- petitors,andtheenvironment; (cid:129) Marketintelligenceinvolvesmarketopportunity,marketpenetrationstrategy,and marketdevelopment. 1.2 BusinessIntelligence 3 (cid:129) Marketing intelligence involves information concerning opportunities, market penetrationstrategy,andmarketdevelopment; (cid:129) Strategic intelligence involves intelligence needed to aid in forming policy and plans. 1.3 KnowledgeManagement Knowledge management has received steady consideration in the literature, with strong recent interest [9]. A knowledge-based view of the application of big data onrecognizesthatknowledgeiscriticaltofirmattainmentofcompetitiveadvantage [10].Oneviewisthatknowledgemanagementconsistsoftheprocessofidentifying, storing,andretrievingknowledge[11].Thatviewcombinesperspectivesfrominfor- mationsystemsaswellasquantitativemodeling.Informationsystemsprovidemeans toobtain,organize,store,andaccessdata.Chenetal.[12]studiedhowelementsof technologyinteractwithmanagerialfactorstobetterunderstanddata.Human,tech- nological,andrelationshipassetsareneededtosuccessfullyimplementknowledge management. Describetherelationshipofworksinapplyingbusinessintelligence,knowledge management, and analytics to dealing with operational problems in the era of big data. 1.4 ComputerSupportSystems Statisticiansandstudentsofartificialintelligencerevolutionizedthefieldofstatis- ticstodevelopdatamining,whichwhencombinedwithdatabasecapabilitiesevolv- ingonthecomputersideledtobusinessintelligence.Thequantitativesideofthis developmentisbusinessanalytics,focusingonprovidingbetteranswerstobusiness decisionsbasedonaccesstomassivequantitiesofinformationideallyinreal-time (bigdata). Davenport [13] reviewed three eras of analytics (see Table 1.1). The first era involved business intelligence, with focus on computer systems to support human decision making (for instance, use of models and focused data on dedicated com- putersystemsintheformofdecisionsupportsystems).Theseconderaevolvedin theearly21stCenturyintroducingbigdata,throughinternetandsocialmediagen- erationofmassesofdata.Davenportseesathirderainadata-enrichedenvironment whereon-linereal-timeanalysiscanbeconductedbyfirmsineveryindustry.This isaccomplishedthroughnewtools,usingHadoopclustersandNoSQLdatabasesto enable data discovery, applying embedded analytics supporting cross-disciplinary datateams. 4 1 KnowledgeManagement Table1.1 Theevolutionofbigdata Era(Davenport) Specificmeaning Decisionsupport 1970–1985 Dataanalysistosupportdecisionmaking Executivesupport 1980–1990 Dataanalysisbyseniorexecutives Onlineanalyticprocessing 1990–2000 Analysisofmultidimensionaltables Businessintelligence 1989–2005 Toolstosupportdata-drivendecisions, emphasizingreporting Analytics 2005–2010 Statisticalandmathematicalmodelingfor decisionmaking Bigdata 2010–now Large,unstructured,fast-movingdata OnesourceofallofthisdataistheInternetofThings.Notonlydopeoplesend messagesnowcars,phones,andmachinescommunicatewitheachother[14].This enables much closer monitoring of patient health, to include little wristbands to monitorthewearer’spulse,temperature,bloodpressureforwardedontothepatient’s physician.Howpeopleeversurviveduntil2010istrulyawonder.Butitdoesindicate thetonsofdatainwhichaminisculebitofimportantdataexists.Ofcourse,signalsare sentonlywhencriticallimitsarereached,justasvendingmachinescansendsignals to replenish stock at their critical limits. Monitors in homes can reduce electricity use,thussavingtheglobefromexcessivewarming.Carscansendsignalstodealers aboutengineproblems,sothattheymightsendatowtrucktothelocationprovided bythecar’sGPS.Insurancealreadyadvertisetheirabilitytoattachdevicestocars toidentifygooddrivers,aeuphemismfordetectionofbaddrivingsothattheycan cancelpoliciesmorelikelytocallforclaims. Useofallofthisdatarequiresincreaseddatastorage,thenextlinkinknowledge management.Italsoissupportedbyanewdataenvironment,allowingreleasefrom the old statistical reliance on sampling, because masses of data usually preclude the need for sampling. This also leads to a change in emphasis from hypothesis generationandtestingtomorerelianceonpatternrecognitionsupportedbymachine learning. A prime example of what this can accomplish is customer relationship management, where every detail of company interaction with each customer can be stored and recalled to analyze for likely interest in other company products, or managementoftheircredit,alldesignedtooptimizecompanyrevenuefromevery customer. Knowledge is defined in dictionaries as the expertise obtained through experi- ence or education leading to understanding of a subject. Knowledge acquisition referstotheprocesses ofperception, learning,and reasoningtocapture,structure, and represent knowledge from all sources for the purpose of storing, sharing, and implementingthisknowledge.Ourcurrentagehasseenaviewofaknowledgebeing usedtoimprovesociety. Knowledge discovery involves the process of obtaining knowledge, which of course can be accomplished in many ways. Some learn by observing, others by theorizing, yet others by listening to authority. Almost all of us learn in different