Table Of ContentRajendra Akerkar • Priti Srinivas Sajja
Intelligent Techniques
for Data Science
123
RajendraAkerkar PritiSrinivasSajja
WesternNorwayResearchInstitute DepartmentofComputerScience
Sogndal,Norway SardarPatelUniversity
VallabhVidhyanagar,Gujarat,India
ISBN978-3-319-29205-2 ISBN978-3-319-29206-9 (eBook)
DOI10.1007/978-3-319-29206-9
LibraryofCongressControlNumber:2016955044
©SpringerInternationalPublishingSwitzerland2016
ThisSpringerimprintispublishedbySpringerNature
TheregisteredcompanyisSpringerInternationalPublishingAG
Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland
Preface
Information and communication technology (ICT) has become a common tool for
doinganybusiness.WiththehighapplicabilityandsupportprovidedbyICT,many
difficulttaskshavebeensimplified.Ontheotherhand,ICThasalsobecomeakey
factor in creating challenges! Today, the amount of data collected across a broad
varietyofdomainsfarexceedsourabilitytoreduceandanalysewithouttheuseof
intelligent analysis techniques. There is much valuable information hidden in the
accumulated(big)data.However,itisverydifficulttoobtainthisinformationand
insight. Therefore, a new generation of computational theories and tools to assist
humans in extracting knowledge from data is indispensable. After all, why should
the tools and techniques, which are smart and intelligent in nature, not be used to
minimizehumaninvolvementandtoeffectivelymanagethelargepoolofdata?
Computationalintelligenceskills,whichembracethefamilyofneuralnetworks,
fuzzy systems, and evolutionary computing in addition to other fields within
machinelearning,areeffectiveinidentifying,visualizing,classifying,andanalysing
data to support business decisions. Developed theories of computational intelli-
gence have been applied in many fields of engineering, data analysis, forecasting,
healthcare,andothers.Thistextbringstheseskillstogethertoaddressdatascience
problems.
The term ‘data science’ has recently emerged to specifically designate a new
profession that is expected to make sense of the vast collections of big data. But
making sense of data has a long history. Data science is a set of fundamental
principles that support and guide the extraction of information and insight from
data. Possibly the most closely related concept to data science is data mining—
the actual extraction of knowledge from data via technologies that incorporate
these principles. The key output of data science is data products. Data products
can be anything from a list of recommendations to a dashboard, or any product
that supports achieving a more informed decision. Analytics is at the core of data
science. Analytics focuses on understanding data in terms of statistical models. It
isconcernedwiththecollection,analysis,andinterpretationofdata,aswellasthe
effectiveorganization,presentation,andcommunicationofresultsrelyingondata.
This textbook has been designed to meet the needs of individuals wishing
to pursue a research and development career in data science and computational
intelligence.
Overviewofthe Book
Wehavetaughtthetopicsinthisbookinvariousappearancesatdifferentlocations
since 1994. In particular, this book is based on graduate lectures delivered by the
authorsoverthepastseveralyearsforawidevarietyofdatascience-relatedcourses
at various universities and research organizations. The feedback from participants
andcolleaguesatthesevenueshashelpedustoimprovethetextsignificantly.
The book can be used at the graduate or advanced undergraduate level as a
textbook or major reference for courses such as intelligent control, computational
science, applied artificial intelligence, and knowledge discovery in databases,
amongmanyothers.
The book presents a sound foundation for the reader to design and implement
data analytics solutions for real world applications in an intelligent manner. The
contentofthebookisstructuredinninechapters.
Abriefdescriptionofthecontentsfoundwithineachchapterofthetextfollows.
(cid:129) Data is a vital asset to any business. It can provide valuable insights into areas
such as customer behaviour, market intelligence, and operational performance.
Data scientists build intelligent systems to manage, interpret, understand, and
derive key knowledge from this data. Chapter 1 offers an overview of such
aspects of data science. Special emphasis is placed on helping the student
determinehowdatasciencethinkingiscrucialindata-drivenbusiness.
(cid:129) Datascienceprojectsdifferfromtypicalbusinessintelligenceprojects.Chapter2
presentsanoverviewofdatalifecycle,datascienceprojectlifecycle,anddata
analyticslifecycle.Thischapteralsofocusesonexplainingastandardanalytics
landscape.
(cid:129) Among the most common tasks performed by data scientists are prediction
and machine learning. Machine learning focuses on data modelling and related
methodsandlearningalgorithmsfordatasciences.Chapter3detailsthemethods
andalgorithmsusedbydatascientistsandanalysts.
(cid:129) Fuzzy sets can be used as a universal approximator, which is crucial for
modelling unknown objects. If an operator can linguistically describe the type
of action to be taken in a specific situation, then it is quite useful to model his
control actions using data. Chapter 4 presents fundamental concepts of fuzzy
logicanditspracticaluseindatascience.
(cid:129) Chapter 5 introduces artificial neural networks—a computational intelligence
technique modelled on the human brain. An important feature of these net-
worksistheiradaptivenature,where‘learningbyexample’replacestraditional
‘programming’ in problems solving. Another significant feature is the intrinsic
parallelismthatallowsspeedycomputations.Thechaptergivesapracticalprimer
toneuralnetworksanddeeplearning.
(cid:129) Evolutionarycomputingisaninnovativeapproachtooptimization.Oneareaof
evolutionarycomputing—geneticalgorithms—involvestheuseofalgorithmsfor
globaloptimization.Geneticalgorithmsarebasedonthemechanismsofnatural
selectionandgenetics.Chapter6describesevolutionarycomputing,inparticular
withregardtobiologicalevolutionandgeneticalgorithms,inamachinelearning
context.
(cid:129) Metaheuristics are known to be robust methods for optimization when the
problemiscomputationallydifficultormerelytoolarge.Althoughmetaheuristics
often do not result in an optimal solution, they may provide reasonable solu-
tionswithinadequatecomputationtimes,e.g.,byusingstochasticmechanisms.
Metaheuristics and data analytics share common ground in that they look for
approximateresultsoutofapotentiallyintractablesearchspace,viaincremental
operations. Chapter 7 offers a brief exposure to the essentials of metaheuristic
approaches such as adaptive memory methods and swarm intelligence. Further
classificationapproachessuchascase-basedreasoningarealsodiscussedinthe
chapter.Thisclassificationapproachreliesontheideathatanewsituationcanbe
wellrepresentedbytheaccumulatedexperienceofpreviouslysolvedproblems.
Case-basedreasoninghasbeenusedinimportantrealworldapplications.
(cid:129) Toachievethebenefitthatbigdataholds,itisnecessarytoinstilanalyticsubiq-
uitouslyandtoexploitthevalueindata.Thisrequiresaninfrastructurethatcan
manage and process exploding volumes of structured and unstructured data—
in motion as well as at rest—and that can safeguard data privacy and security.
Chapter 8 presents broad-based coverage of big data-specific technologies and
toolsthatsupportadvancedanalyticsaswellasissuesofdataprivacy,ethics,and
security.
(cid:129) Finally, Chap. 9 gives a concise introduction to R. R programming language is
elegant and flexible, and has a substantially expressive syntax designed around
workingwithdata.Ralsoincludespowerfulgraphicscapabilities.
Lastly, the appendices provide a spectrum of popular tools for handling data
scienceinpractice.Throughoutthebook,realworldcasestudiesandexercisesare
giventohighlightcertainaspectsofthematerialcoveredandtostimulatethought.
Intended Audience
This book is intended for individuals seeking to develop an understanding of data
sciencefromtheperspectiveofthepracticingdatascientist,including:
(cid:129) Graduate and undergraduate students looking to move into the world of data
science.
(cid:129) Managersofteamsofbusinessintelligence,analytics,anddataprofessionals.
(cid:129) Aspiringbusinessanddataanalystslookingtoaddintelligenttechniquestotheir
skills.
Prerequisites
Tofullyappreciatethematerialinthisbook,werecommendthefollowingprereq-
uisites:
(cid:129) An introduction to database systems, covering SQL and related programming
systems.
(cid:129) Asophomore-levelcourseindatastructures,algorithms,anddiscretemathemat-
ics.
We would like to thank the students in our courses for their comments on the
draft of the lecture notes. We alsothank our families,friends,and colleagues who
encouragedusinthisendeavour.Weacknowledgealltheauthors,researchers,and
developersfromwhomwehaveacquiredknowledgethroughtheirwork.Finally,we
mustgivethankstotheeditorialteamatSpringerVerlagLondon,especiallyHelen
Desmond,andthereviewersofthisbookinbringingthebooktogetherinanorderly
manner.
Wesincerelyhopeitmeetstheneedsofourreaders.
Sogndal,Norway RajendraAkerkar
Gujarat,India Priti SrinivasSajja
March26,2016
Contents
1 IntroductiontoDataScience ............................................... 1
1.1 Introduction............................................................ 1
1.2 HistoryofDataScience ............................................... 2
1.3 ImportanceofDataScienceinModernBusiness.................... 4
1.4 DataScientists ......................................................... 6
1.5 DataScienceActivitiesinThreeDimensions........................ 8
1.5.1 ManagingDataFlow......................................... 8
1.5.2 ManagingDataCuration..................................... 11
1.5.3 DataAnalytics................................................ 14
1.6 OverlappingofDataSciencewithOtherFields...................... 16
1.7 DataAnalyticThinking ............................................... 17
1.8 DomainsofApplication............................................... 18
1.8.1 SustainableDevelopmentforResources.................... 18
1.8.2 Utilization of Social Network Platform
forVariousActivities ........................................ 19
1.8.3 IntelligentWebApplications................................ 20
1.8.4 Google’sAutomaticStatisticianProject.................... 20
1.9 ApplicationofComputationalIntelligencetoManage
DataScienceActivities................................................ 21
1.10 ScenariosforDataScienceinBusiness .............................. 23
1.11 ToolsandTechniquesHelpfulforDoingDataScience.............. 24
1.11.1 DataCleaningTools ......................................... 25
1.11.2 DataMungingandModellingTools ........................ 26
1.11.3 DataVisualizationTools..................................... 28
1.12 Exercises............................................................... 29
References..................................................................... 30
2 DataAnalytics................................................................ 31
2.1 Introduction............................................................ 31
2.2 CrossIndustryStandardProcess...................................... 33
2.3 DataAnalyticsLifeCycle............................................. 34
2.4 DataScienceProjectLifeCycle ...................................... 36
2.5 ComplexityofAnalytics .............................................. 39
2.6 FromDatatoInsights ................................................. 41
2.7 BuildingAnalyticsCapabilities:CaseofBanking................... 42
2.8 DataQuality............................................................ 44
2.9 DataPreparationProcess.............................................. 45
2.10 CommunicatingAnalyticsOutcomes................................. 47
2.10.1 StrategiesforCommunicatingAnalytics ................... 47
2.10.2 DataVisualization............................................ 48
2.10.3 TechniquesforVisualization ................................ 50
2.11 Exercises............................................................... 51
References..................................................................... 52
3 BasicLearningAlgorithms ................................................. 53
3.1 LearningfromData.................................................... 53
3.2 SupervisedLearning................................................... 55
3.2.1 LinearRegression............................................ 56
3.2.2 DecisionTree................................................. 58
3.2.3 RandomForest............................................... 65
3.2.4 k-NearestNeighbour......................................... 66
3.2.5 LogisticRegression.......................................... 69
3.2.6 ModelCombiners............................................ 70
3.2.7 NaiveBayes.................................................. 74
3.2.8 BayesianBeliefNetworks................................... 76
3.2.9 SupportVectorMachine..................................... 77
3.3 UnsupervisedLearning................................................ 80
3.3.1 AprioriAlgorithm............................................ 80
3.3.2 k-MeansAlgorithm .......................................... 84
3.3.3 DimensionalityReductionforDataCompression.......... 86
3.4 ReinforcementLearning............................................... 87
3.4.1 MarkovDecisionProcess.................................... 90
3.5 CaseStudy:UsingMachineLearningforMarketingCampaign.... 91
3.6 Exercises............................................................... 92
References..................................................................... 93
4 FuzzyLogic................................................................... 95
4.1 Introduction............................................................ 95
4.2 FuzzyMembershipFunctions......................................... 98
4.2.1 TriangularMembershipFunction ........................... 99
4.2.2 TrapezoidalMembershipFunction.......................... 99
4.2.3 GaussianMembershipFunction............................. 100
4.2.4 SigmoidalMembershipFunction ........................... 100
4.3 MethodsofMembershipValueAssignment ......................... 101
4.4 FuzzificationandDefuzzificationMethods .......................... 102
4.5 FuzzySetOperations.................................................. 102
4.5.1 UnionofFuzzySets ......................................... 102
4.5.2 IntersectionofFuzzySets ................................... 103
4.5.3 ComplementofaFuzzySet................................. 103
4.6 FuzzySetProperties................................................... 105
4.7 FuzzyRelations........................................................ 106
4.7.1 ExampleofOperationonFuzzyRelationship.............. 108
4.8 FuzzyPropositions .................................................... 109
4.8.1 FuzzyConnectives........................................... 110
4.8.2 Disjunction................................................... 110
4.8.3 Conjunction .................................................. 111
4.8.4 Negation...................................................... 111
4.8.5 Implication ................................................... 111
4.9 FuzzyInference........................................................ 112
4.10 FuzzyRule-BasedSystem............................................. 112
4.11 FuzzyLogicforDataScience......................................... 114
4.11.1 Application1:WebContentMining........................ 116
4.11.2 Application2:WebStructureMining....................... 117
4.11.3 Application3:WebUsageMining.......................... 118
4.11.4 Application4:EnvironmentalandSocialData
Manipulation................................................. 119
4.12 Tools and Techniques for Doing Data Science
withFuzzyLogic ...................................................... 120
4.13 Exercises............................................................... 122
References..................................................................... 122
5 ArtificialNeuralNetwork................................................... 125
5.1 Introduction............................................................ 125
5.2 SymbolicLearningMethods.......................................... 126
5.3 ArtificialNeuralNetworkandItsCharacteristics.................... 128
5.4 ANNModels........................................................... 131
5.4.1 HopfieldModel .............................................. 132
5.4.2 PerceptronModel............................................ 133
5.4.3 Multi-LayerPerceptron...................................... 136
5.4.4 DeepLearninginMulti-LayerPerceptron.................. 139
5.4.5 OtherModelsofANN....................................... 141
5.4.6 LinearRegressionandNeuralNetworks.................... 143
5.5 ANNToolsandUtilities............................................... 144
5.6 EmotionsMiningonSocialNetworkPlatform ...................... 145
5.6.1 RelatedWorkonEmotionsMining ......................... 146
5.6.2 BroadArchitecture........................................... 146
5.6.3 DesignofNeuralNetwork................................... 148
5.7 ApplicationsandChallenges.......................................... 149
5.8 Concerns ............................................................... 152
5.9 Exercises............................................................... 153
References..................................................................... 154
6 GeneticAlgorithmsandEvolutionaryComputing....................... 157
6.1 Introduction............................................................ 157
6.2 GeneticAlgorithms.................................................... 159
6.3 BasicPrinciplesofGeneticAlgorithms .............................. 161
6.3.1 EncodingIndividuals ........................................ 161
6.3.2 Mutation...................................................... 163
6.3.3 Crossover..................................................... 163
6.3.4 FitnessFunction.............................................. 164
6.3.5 Selection...................................................... 165
6.3.6 OtherEncodingStrategies................................... 166
6.4 ExampleofFunctionOptimizationusingGeneticAlgorithm....... 168
6.5 SchemataandSchemaTheorem...................................... 170
6.5.1 Instance,DefinedBits,andOrderofSchema............... 170
6.5.2 ImportanceofSchema....................................... 171
6.6 ApplicationSpecificGeneticOperators.............................. 171
6.6.1 ApplicationoftheRecombinationOperator:Example .... 173
6.7 EvolutionaryProgramming ........................................... 174
6.8 ApplicationsofGAinHealthcare .................................... 175
6.8.1 CaseofHealthcare........................................... 176
6.8.2 PatientsSchedulingSystemUsingGeneticAlgorithm .... 177
6.8.3 EncodingofCandidates ..................................... 178
6.8.4 OperationsonPopulation.................................... 180
6.8.5 OtherApplications........................................... 182
6.9 Exercises............................................................... 183
References..................................................................... 184
7 OtherMetaheuristicsandClassificationApproaches.................... 185
7.1 Introduction............................................................ 185
7.2 AdaptiveMemoryProcedure.......................................... 186
7.2.1 TabuSearch .................................................. 186
7.2.2 ScatterSearch................................................ 188
7.2.3 PathRelinking................................................ 191
7.3 SwarmIntelligence.................................................... 192
7.3.1 AntColonyOptimization.................................... 193
7.3.2 ArtificialBeeColonyAlgorithm............................ 193
7.3.3 RiverFormationDynamics.................................. 195
7.3.4 ParticleSwarmOptimization................................ 196
7.3.5 StochasticDiffusionSearch ................................. 198
7.3.6 SwarmIntelligenceandBigData........................... 199
7.4 Case-BasedReasoning ................................................ 201
7.4.1 LearninginCase-BasedReasoning......................... 203
7.4.2 Case-BasedReasoningandDataScience................... 205
7.4.3 DealingwithComplexDomains............................ 205
7.5 RoughSets............................................................. 206
7.6 Exercises............................................................... 208
References..................................................................... 209