Rajendra Akerkar • Priti Srinivas Sajja Intelligent Techniques for Data Science 123 RajendraAkerkar PritiSrinivasSajja WesternNorwayResearchInstitute DepartmentofComputerScience Sogndal,Norway SardarPatelUniversity VallabhVidhyanagar,Gujarat,India ISBN978-3-319-29205-2 ISBN978-3-319-29206-9 (eBook) DOI10.1007/978-3-319-29206-9 LibraryofCongressControlNumber:2016955044 ©SpringerInternationalPublishingSwitzerland2016 ThisSpringerimprintispublishedbySpringerNature TheregisteredcompanyisSpringerInternationalPublishingAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Preface Information and communication technology (ICT) has become a common tool for doinganybusiness.WiththehighapplicabilityandsupportprovidedbyICT,many difficulttaskshavebeensimplified.Ontheotherhand,ICThasalsobecomeakey factor in creating challenges! Today, the amount of data collected across a broad varietyofdomainsfarexceedsourabilitytoreduceandanalysewithouttheuseof intelligent analysis techniques. There is much valuable information hidden in the accumulated(big)data.However,itisverydifficulttoobtainthisinformationand insight. Therefore, a new generation of computational theories and tools to assist humans in extracting knowledge from data is indispensable. After all, why should the tools and techniques, which are smart and intelligent in nature, not be used to minimizehumaninvolvementandtoeffectivelymanagethelargepoolofdata? Computationalintelligenceskills,whichembracethefamilyofneuralnetworks, fuzzy systems, and evolutionary computing in addition to other fields within machinelearning,areeffectiveinidentifying,visualizing,classifying,andanalysing data to support business decisions. Developed theories of computational intelli- gence have been applied in many fields of engineering, data analysis, forecasting, healthcare,andothers.Thistextbringstheseskillstogethertoaddressdatascience problems. The term ‘data science’ has recently emerged to specifically designate a new profession that is expected to make sense of the vast collections of big data. But making sense of data has a long history. Data science is a set of fundamental principles that support and guide the extraction of information and insight from data. Possibly the most closely related concept to data science is data mining— the actual extraction of knowledge from data via technologies that incorporate these principles. The key output of data science is data products. Data products can be anything from a list of recommendations to a dashboard, or any product that supports achieving a more informed decision. Analytics is at the core of data science. Analytics focuses on understanding data in terms of statistical models. It isconcernedwiththecollection,analysis,andinterpretationofdata,aswellasthe effectiveorganization,presentation,andcommunicationofresultsrelyingondata. This textbook has been designed to meet the needs of individuals wishing to pursue a research and development career in data science and computational intelligence. Overviewofthe Book Wehavetaughtthetopicsinthisbookinvariousappearancesatdifferentlocations since 1994. In particular, this book is based on graduate lectures delivered by the authorsoverthepastseveralyearsforawidevarietyofdatascience-relatedcourses at various universities and research organizations. The feedback from participants andcolleaguesatthesevenueshashelpedustoimprovethetextsignificantly. The book can be used at the graduate or advanced undergraduate level as a textbook or major reference for courses such as intelligent control, computational science, applied artificial intelligence, and knowledge discovery in databases, amongmanyothers. The book presents a sound foundation for the reader to design and implement data analytics solutions for real world applications in an intelligent manner. The contentofthebookisstructuredinninechapters. Abriefdescriptionofthecontentsfoundwithineachchapterofthetextfollows. (cid:129) Data is a vital asset to any business. It can provide valuable insights into areas such as customer behaviour, market intelligence, and operational performance. Data scientists build intelligent systems to manage, interpret, understand, and derive key knowledge from this data. Chapter 1 offers an overview of such aspects of data science. Special emphasis is placed on helping the student determinehowdatasciencethinkingiscrucialindata-drivenbusiness. (cid:129) Datascienceprojectsdifferfromtypicalbusinessintelligenceprojects.Chapter2 presentsanoverviewofdatalifecycle,datascienceprojectlifecycle,anddata analyticslifecycle.Thischapteralsofocusesonexplainingastandardanalytics landscape. (cid:129) Among the most common tasks performed by data scientists are prediction and machine learning. Machine learning focuses on data modelling and related methodsandlearningalgorithmsfordatasciences.Chapter3detailsthemethods andalgorithmsusedbydatascientistsandanalysts. (cid:129) Fuzzy sets can be used as a universal approximator, which is crucial for modelling unknown objects. If an operator can linguistically describe the type of action to be taken in a specific situation, then it is quite useful to model his control actions using data. Chapter 4 presents fundamental concepts of fuzzy logicanditspracticaluseindatascience. (cid:129) Chapter 5 introduces artificial neural networks—a computational intelligence technique modelled on the human brain. An important feature of these net- worksistheiradaptivenature,where‘learningbyexample’replacestraditional ‘programming’ in problems solving. Another significant feature is the intrinsic parallelismthatallowsspeedycomputations.Thechaptergivesapracticalprimer toneuralnetworksanddeeplearning. (cid:129) Evolutionarycomputingisaninnovativeapproachtooptimization.Oneareaof evolutionarycomputing—geneticalgorithms—involvestheuseofalgorithmsfor globaloptimization.Geneticalgorithmsarebasedonthemechanismsofnatural selectionandgenetics.Chapter6describesevolutionarycomputing,inparticular withregardtobiologicalevolutionandgeneticalgorithms,inamachinelearning context. (cid:129) Metaheuristics are known to be robust methods for optimization when the problemiscomputationallydifficultormerelytoolarge.Althoughmetaheuristics often do not result in an optimal solution, they may provide reasonable solu- tionswithinadequatecomputationtimes,e.g.,byusingstochasticmechanisms. Metaheuristics and data analytics share common ground in that they look for approximateresultsoutofapotentiallyintractablesearchspace,viaincremental operations. Chapter 7 offers a brief exposure to the essentials of metaheuristic approaches such as adaptive memory methods and swarm intelligence. Further classificationapproachessuchascase-basedreasoningarealsodiscussedinthe chapter.Thisclassificationapproachreliesontheideathatanewsituationcanbe wellrepresentedbytheaccumulatedexperienceofpreviouslysolvedproblems. Case-basedreasoninghasbeenusedinimportantrealworldapplications. (cid:129) Toachievethebenefitthatbigdataholds,itisnecessarytoinstilanalyticsubiq- uitouslyandtoexploitthevalueindata.Thisrequiresaninfrastructurethatcan manage and process exploding volumes of structured and unstructured data— in motion as well as at rest—and that can safeguard data privacy and security. Chapter 8 presents broad-based coverage of big data-specific technologies and toolsthatsupportadvancedanalyticsaswellasissuesofdataprivacy,ethics,and security. (cid:129) Finally, Chap. 9 gives a concise introduction to R. R programming language is elegant and flexible, and has a substantially expressive syntax designed around workingwithdata.Ralsoincludespowerfulgraphicscapabilities. Lastly, the appendices provide a spectrum of popular tools for handling data scienceinpractice.Throughoutthebook,realworldcasestudiesandexercisesare giventohighlightcertainaspectsofthematerialcoveredandtostimulatethought. Intended Audience This book is intended for individuals seeking to develop an understanding of data sciencefromtheperspectiveofthepracticingdatascientist,including: (cid:129) Graduate and undergraduate students looking to move into the world of data science. (cid:129) Managersofteamsofbusinessintelligence,analytics,anddataprofessionals. (cid:129) Aspiringbusinessanddataanalystslookingtoaddintelligenttechniquestotheir skills. Prerequisites Tofullyappreciatethematerialinthisbook,werecommendthefollowingprereq- uisites: (cid:129) An introduction to database systems, covering SQL and related programming systems. (cid:129) Asophomore-levelcourseindatastructures,algorithms,anddiscretemathemat- ics. We would like to thank the students in our courses for their comments on the draft of the lecture notes. We alsothank our families,friends,and colleagues who encouragedusinthisendeavour.Weacknowledgealltheauthors,researchers,and developersfromwhomwehaveacquiredknowledgethroughtheirwork.Finally,we mustgivethankstotheeditorialteamatSpringerVerlagLondon,especiallyHelen Desmond,andthereviewersofthisbookinbringingthebooktogetherinanorderly manner. Wesincerelyhopeitmeetstheneedsofourreaders. Sogndal,Norway RajendraAkerkar Gujarat,India Priti SrinivasSajja March26,2016 Contents 1 IntroductiontoDataScience ............................................... 1 1.1 Introduction............................................................ 1 1.2 HistoryofDataScience ............................................... 2 1.3 ImportanceofDataScienceinModernBusiness.................... 4 1.4 DataScientists ......................................................... 6 1.5 DataScienceActivitiesinThreeDimensions........................ 8 1.5.1 ManagingDataFlow......................................... 8 1.5.2 ManagingDataCuration..................................... 11 1.5.3 DataAnalytics................................................ 14 1.6 OverlappingofDataSciencewithOtherFields...................... 16 1.7 DataAnalyticThinking ............................................... 17 1.8 DomainsofApplication............................................... 18 1.8.1 SustainableDevelopmentforResources.................... 18 1.8.2 Utilization of Social Network Platform forVariousActivities ........................................ 19 1.8.3 IntelligentWebApplications................................ 20 1.8.4 Google’sAutomaticStatisticianProject.................... 20 1.9 ApplicationofComputationalIntelligencetoManage DataScienceActivities................................................ 21 1.10 ScenariosforDataScienceinBusiness .............................. 23 1.11 ToolsandTechniquesHelpfulforDoingDataScience.............. 24 1.11.1 DataCleaningTools ......................................... 25 1.11.2 DataMungingandModellingTools ........................ 26 1.11.3 DataVisualizationTools..................................... 28 1.12 Exercises............................................................... 29 References..................................................................... 30 2 DataAnalytics................................................................ 31 2.1 Introduction............................................................ 31 2.2 CrossIndustryStandardProcess...................................... 33 2.3 DataAnalyticsLifeCycle............................................. 34 2.4 DataScienceProjectLifeCycle ...................................... 36 2.5 ComplexityofAnalytics .............................................. 39 2.6 FromDatatoInsights ................................................. 41 2.7 BuildingAnalyticsCapabilities:CaseofBanking................... 42 2.8 DataQuality............................................................ 44 2.9 DataPreparationProcess.............................................. 45 2.10 CommunicatingAnalyticsOutcomes................................. 47 2.10.1 StrategiesforCommunicatingAnalytics ................... 47 2.10.2 DataVisualization............................................ 48 2.10.3 TechniquesforVisualization ................................ 50 2.11 Exercises............................................................... 51 References..................................................................... 52 3 BasicLearningAlgorithms ................................................. 53 3.1 LearningfromData.................................................... 53 3.2 SupervisedLearning................................................... 55 3.2.1 LinearRegression............................................ 56 3.2.2 DecisionTree................................................. 58 3.2.3 RandomForest............................................... 65 3.2.4 k-NearestNeighbour......................................... 66 3.2.5 LogisticRegression.......................................... 69 3.2.6 ModelCombiners............................................ 70 3.2.7 NaiveBayes.................................................. 74 3.2.8 BayesianBeliefNetworks................................... 76 3.2.9 SupportVectorMachine..................................... 77 3.3 UnsupervisedLearning................................................ 80 3.3.1 AprioriAlgorithm............................................ 80 3.3.2 k-MeansAlgorithm .......................................... 84 3.3.3 DimensionalityReductionforDataCompression.......... 86 3.4 ReinforcementLearning............................................... 87 3.4.1 MarkovDecisionProcess.................................... 90 3.5 CaseStudy:UsingMachineLearningforMarketingCampaign.... 91 3.6 Exercises............................................................... 92 References..................................................................... 93 4 FuzzyLogic................................................................... 95 4.1 Introduction............................................................ 95 4.2 FuzzyMembershipFunctions......................................... 98 4.2.1 TriangularMembershipFunction ........................... 99 4.2.2 TrapezoidalMembershipFunction.......................... 99 4.2.3 GaussianMembershipFunction............................. 100 4.2.4 SigmoidalMembershipFunction ........................... 100 4.3 MethodsofMembershipValueAssignment ......................... 101 4.4 FuzzificationandDefuzzificationMethods .......................... 102 4.5 FuzzySetOperations.................................................. 102 4.5.1 UnionofFuzzySets ......................................... 102 4.5.2 IntersectionofFuzzySets ................................... 103 4.5.3 ComplementofaFuzzySet................................. 103 4.6 FuzzySetProperties................................................... 105 4.7 FuzzyRelations........................................................ 106 4.7.1 ExampleofOperationonFuzzyRelationship.............. 108 4.8 FuzzyPropositions .................................................... 109 4.8.1 FuzzyConnectives........................................... 110 4.8.2 Disjunction................................................... 110 4.8.3 Conjunction .................................................. 111 4.8.4 Negation...................................................... 111 4.8.5 Implication ................................................... 111 4.9 FuzzyInference........................................................ 112 4.10 FuzzyRule-BasedSystem............................................. 112 4.11 FuzzyLogicforDataScience......................................... 114 4.11.1 Application1:WebContentMining........................ 116 4.11.2 Application2:WebStructureMining....................... 117 4.11.3 Application3:WebUsageMining.......................... 118 4.11.4 Application4:EnvironmentalandSocialData Manipulation................................................. 119 4.12 Tools and Techniques for Doing Data Science withFuzzyLogic ...................................................... 120 4.13 Exercises............................................................... 122 References..................................................................... 122 5 ArtificialNeuralNetwork................................................... 125 5.1 Introduction............................................................ 125 5.2 SymbolicLearningMethods.......................................... 126 5.3 ArtificialNeuralNetworkandItsCharacteristics.................... 128 5.4 ANNModels........................................................... 131 5.4.1 HopfieldModel .............................................. 132 5.4.2 PerceptronModel............................................ 133 5.4.3 Multi-LayerPerceptron...................................... 136 5.4.4 DeepLearninginMulti-LayerPerceptron.................. 139 5.4.5 OtherModelsofANN....................................... 141 5.4.6 LinearRegressionandNeuralNetworks.................... 143 5.5 ANNToolsandUtilities............................................... 144 5.6 EmotionsMiningonSocialNetworkPlatform ...................... 145 5.6.1 RelatedWorkonEmotionsMining ......................... 146 5.6.2 BroadArchitecture........................................... 146 5.6.3 DesignofNeuralNetwork................................... 148 5.7 ApplicationsandChallenges.......................................... 149 5.8 Concerns ............................................................... 152 5.9 Exercises............................................................... 153 References..................................................................... 154 6 GeneticAlgorithmsandEvolutionaryComputing....................... 157 6.1 Introduction............................................................ 157 6.2 GeneticAlgorithms.................................................... 159 6.3 BasicPrinciplesofGeneticAlgorithms .............................. 161 6.3.1 EncodingIndividuals ........................................ 161 6.3.2 Mutation...................................................... 163 6.3.3 Crossover..................................................... 163 6.3.4 FitnessFunction.............................................. 164 6.3.5 Selection...................................................... 165 6.3.6 OtherEncodingStrategies................................... 166 6.4 ExampleofFunctionOptimizationusingGeneticAlgorithm....... 168 6.5 SchemataandSchemaTheorem...................................... 170 6.5.1 Instance,DefinedBits,andOrderofSchema............... 170 6.5.2 ImportanceofSchema....................................... 171 6.6 ApplicationSpecificGeneticOperators.............................. 171 6.6.1 ApplicationoftheRecombinationOperator:Example .... 173 6.7 EvolutionaryProgramming ........................................... 174 6.8 ApplicationsofGAinHealthcare .................................... 175 6.8.1 CaseofHealthcare........................................... 176 6.8.2 PatientsSchedulingSystemUsingGeneticAlgorithm .... 177 6.8.3 EncodingofCandidates ..................................... 178 6.8.4 OperationsonPopulation.................................... 180 6.8.5 OtherApplications........................................... 182 6.9 Exercises............................................................... 183 References..................................................................... 184 7 OtherMetaheuristicsandClassificationApproaches.................... 185 7.1 Introduction............................................................ 185 7.2 AdaptiveMemoryProcedure.......................................... 186 7.2.1 TabuSearch .................................................. 186 7.2.2 ScatterSearch................................................ 188 7.2.3 PathRelinking................................................ 191 7.3 SwarmIntelligence.................................................... 192 7.3.1 AntColonyOptimization.................................... 193 7.3.2 ArtificialBeeColonyAlgorithm............................ 193 7.3.3 RiverFormationDynamics.................................. 195 7.3.4 ParticleSwarmOptimization................................ 196 7.3.5 StochasticDiffusionSearch ................................. 198 7.3.6 SwarmIntelligenceandBigData........................... 199 7.4 Case-BasedReasoning ................................................ 201 7.4.1 LearninginCase-BasedReasoning......................... 203 7.4.2 Case-BasedReasoningandDataScience................... 205 7.4.3 DealingwithComplexDomains............................ 205 7.5 RoughSets............................................................. 206 7.6 Exercises............................................................... 208 References..................................................................... 209