Table Of Content(cid:2)
IntelligentDataAnalysis
(cid:2) (cid:2)
(cid:2)
(cid:2)
Intelligent Data Analysis
From Data Gathering to Data Comprehension
Edited by
Deepak Gupta
(cid:2) (cid:2)
MaharajaAgrasenInstituteofTechnology
Delhi,India
Siddhartha Bhattacharyya
CHRIST(DeemedtobeUniversity)
Bengaluru,India
Ashish Khanna
MaharajaAgrasenInstituteofTechnology
Delhi,India
Kalpna Sagar
KIETGroupofInstitutions
UttarPradesh,India
(cid:2)
(cid:2)
Thiseditionfirstpublished2020
©2020JohnWiley&SonsLtd
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmitted,in
anyformorbyanymeans,electronic,mechanical,photocopying,recordingorotherwise,exceptaspermittedby
law.Adviceonhowtoobtainpermissiontoreusematerialfromthistitleisavailableathttp://www.wiley.com/go/
permissions.
TherightofDeepakGupta,SiddharthaBhattacharyya,AshishKhanna,andKalpnaSagartobeidentifiedasthe
authorsoftheeditorialmaterialinthisworkhasbeenassertedinaccordancewithlaw.
RegisteredOffices
JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,USA
JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester,WestSussex,PO198SQ,UK
EditorialOffice
TheAtrium,SouthernGate,Chichester,WestSussex,PO198SQ,UK
Fordetailsofourglobaleditorialoffices,customerservices,andmoreinformationaboutWileyproductsvisitusat
www.wiley.com.
Wileyalsopublishesitsbooksinavarietyofelectronicformatsandbyprint-on-demand.Somecontentthat
appearsinstandardprintversionsofthisbookmaynotbeavailableinotherformats.
LimitofLiability/DisclaimerofWarranty
MATLABⓇisatrademarkofTheMathWorks,Inc.andisusedwithpermission.TheMathWorksdoesnotwarrant
theaccuracyofthetextorexercisesinthisbook.Thiswork’suseordiscussionofMATLABⓇsoftwareorrelated
productsdoesnotconstituteendorsementorsponsorshipbyTheMathWorksofaparticularpedagogicalapproach
orparticularuseoftheMATLABⓇsoftware.
Whilethepublisherandauthorshaveusedtheirbesteffortsinpreparingthiswork,theymakenorepresentations
orwarrantieswithrespecttotheaccuracyorcompletenessofthecontentsofthisworkandspecificallydisclaim
allwarranties,includingwithoutlimitationanyimpliedwarrantiesofmerchantabilityorfitnessforaparticular
(cid:2) purpose.Nowarrantymaybecreatedorextendedbysalesrepresentatives,writtensalesmaterialsorpromotional (cid:2)
statementsforthiswork.Thefactthatanorganization,website,orproductisreferredtointhisworkasacitation
and/orpotentialsourceoffurtherinformationdoesnotmeanthatthepublisherandauthorsendorsethe
informationorservicestheorganization,website,orproductmayprovideorrecommendationsitmaymake.This
workissoldwiththeunderstandingthatthepublisherisnotengagedinrenderingprofessionalservices.The
adviceandstrategiescontainedhereinmaynotbesuitableforyoursituation.Youshouldconsultwithaspecialist
whereappropriate.Further,readersshouldbeawarethatwebsiteslistedinthisworkmayhavechangedor
disappearedbetweenwhenthisworkwaswrittenandwhenitisread.Neitherthepublishernorauthorsshallbe
liableforanylossofprofitoranyothercommercialdamages,includingbutnotlimitedtospecial,incidental,
consequential,orotherdamages.
LibraryofCongressCataloging-in-PublicationData
Names:Gupta,Deepak,editor.
Title:Intelligentdataanalysis:fromdatagatheringtodata
comprehension/editedbyDr.DeepakGupta,Dr.Siddhartha
Bhattacharyya,Dr.AshishKhanna,Ms.KalpnaSagar.
Description:Hoboken,NJ,USA:Wiley,2020.|Series:TheWileyseriesin
intelligentsignalanddataprocessing|Includesbibliographical
referencesandindex.
Identifiers:LCCN2019056735(print)|LCCN2019056736(ebook)|ISBN
9781119544456(hardback)|ISBN9781119544449(adobepdf)|ISBN
9781119544463(epub)
Subjects:LCSH:Datamining.|Computationalintelligence.
Classification:LCCQA76.9.D343I574352020(print)|LCCQA76.9.D343
(ebook)|DDC006.3/12–dc23
LCrecordavailableathttps://lccn.loc.gov/2019056735
LCebookrecordavailableathttps://lccn.loc.gov/2019056736
CoverDesign:Wiley
CoverImage:©gremlin/GettyImages
Setin9.5/12.5ptSTIXTwoTextbySPiGlobal,Chennai,India
10 9 8 7 6 5 4 3 2 1
(cid:2)
(cid:2)
DeepakGuptawouldliketodedicatethisbooktohisfather,Sh.R.K.Gupta,hismother,
Smt.GeetaGupta,hismentorsfortheirconstantencouragement,andhisfamilymembers,
includinghiswife,brothers,sisters,kidsandthestudents.
SiddharthaBhattacharyyawouldliketodedicatethisbooktohisparents,thelateAjitKumar
BhattacharyyaandthelateHashiBhattacharyya,hisbelovedwife,Rashni,andhisresearch
scholars,Sourav,Sandip,Hrishikesh,Pankaj,Debanjan,Alokananda,Koyel,andTulika.
AshishKhannawouldliketodedicatethisbooktohisparents,thelateR.C.Khannaand
Smt.SurekhaKhanna,fortheirconstantencouragementandsupport,andtohiswife,
Sheenu,andchildren,MasterBhavyaandMasterSanyukt.
KalpnaSagarwouldliketodedicatethisbooktoherfather,Mr.LekhRamSagar,andher
mother,Smt.GomtiSagar,thestrongestpersonsofherlife.
(cid:2) (cid:2)
(cid:2)
(cid:2)
vii
Contents
ListofContributors xix
SeriesPreface xxiii
Preface xxv
1 IntelligentDataAnalysis:BlackBoxVersusWhiteBoxModeling 1
SarthakGupta,SiddhantBagga,andDeepakKumarSharma
1.1 Introduction 1
1.1.1 IntelligentDataAnalysis 1
1.1.2 ApplicationsofIDAandMachineLearning 2
(cid:2) 1.1.3 WhiteBoxModelsVersusBlackBoxModels 2 (cid:2)
1.1.4 ModelInterpretability 3
1.2 InterpretationofWhiteBoxModels 3
1.2.1 LinearRegression 3
1.2.2 DecisionTree 5
1.3 InterpretationofBlackBoxModels 7
1.3.1 PartialDependencePlot 7
1.3.2 IndividualConditionalExpectation 9
1.3.3 AccumulatedLocalEffects 9
1.3.4 GlobalSurrogateModels 12
1.3.5 LocalInterpretableModel-AgnosticExplanations 12
1.3.6 FeatureImportance 12
1.4 IssuesandFurtherChallenges 13
1.5 Summary 13
References 14
2 Data:ItsNatureandModernDataAnalyticalTools 17
RavinderAhuja,ShikharAsthana,AyushAhuja,andManuAgarwal
2.1 Introduction 17
2.2 DataTypesandVariousFileFormats 18
2.2.1 StructuredData 18
2.2.2 Semi-StructuredData 20
2.2.3 UnstructuredData 20
2.2.4 NeedforFileFormats 21
2.2.5 VariousTypesofFileFormats 22
2.2.5.1 CommaSeparatedValues(CSV) 22
(cid:2)
(cid:2)
viii Contents
2.2.5.2 ZIP 22
2.2.5.3 PlainText(txt) 23
2.2.5.4 JSON 23
2.2.5.5 XML 23
2.2.5.6 ImageFiles 24
2.2.5.7 HTML 24
2.3 OverviewofBigData 25
2.3.1 SourcesofBigData 27
2.3.1.1 Media 27
2.3.1.2 TheWeb 27
2.3.1.3 Cloud 27
2.3.1.4 InternetofThings 27
2.3.1.5 Databases 27
2.3.1.6 Archives 28
2.3.2 BigDataAnalytics 28
2.3.2.1 DescriptiveAnalytics 28
2.3.2.2 PredictiveAnalytics 28
2.3.2.3 PrescriptiveAnalytics 29
2.4 DataAnalyticsPhases 29
2.5 DataAnalyticalTools 30
2.5.1 MicrosoftExcel 30
(cid:2) 2.5.2 ApacheSpark 33 (cid:2)
2.5.3 OpenRefine 34
2.5.4 RProgramming 35
2.5.4.1 AdvantagesofR 36
2.5.4.2 DisadvantagesofR 36
2.5.5 Tableau 36
2.5.5.1 HowTableauWorks 36
2.5.5.2 TableauFeature 37
2.5.5.3 Advantages 37
2.5.5.4 Disadvantages 37
2.5.6 Hadoop 37
2.5.6.1 BasicComponentsofHadoop 38
2.5.6.2 Benefits 38
2.6 DatabaseManagementSystemforBigDataAnalytics 38
2.6.1 HadoopDistributedFileSystem 38
2.6.2 NoSql 38
2.6.2.1 CategoriesofNoSql 39
2.7 ChallengesinBigDataAnalytics 39
2.7.1 StorageofData 40
2.7.2 SynchronizationofData 40
2.7.3 SecurityofData 40
2.7.4 FewerProfessionals 40
2.8 Conclusion 40
References 41
(cid:2)
(cid:2)
Contents ix
3 StatisticalMethodsforIntelligentDataAnalysis:Introduction
andVariousConcepts 43
ShubhamKumaram,SamarthChugh,andDeepakKumarSharma
3.1 Introduction 43
3.2 Probability 43
3.2.1 Definitions 43
3.2.1.1 RandomExperiments 43
3.2.1.2 Probability 44
3.2.1.3 ProbabilityAxioms 44
3.2.1.4 ConditionalProbability 44
3.2.1.5 Independence 44
3.2.1.6 RandomVariable 44
3.2.1.7 ProbabilityDistribution 45
3.2.1.8 Expectation 45
3.2.1.9 VarianceandStandardDeviation 45
3.2.2 Bayes’Rule 45
3.3 DescriptiveStatistics 46
3.3.1 PictureRepresentation 46
3.3.1.1 FrequencyDistribution 46
3.3.1.2 SimpleFrequencyDistribution 46
3.3.1.3 GroupedFrequencyDistribution 46
(cid:2) 3.3.1.4 StemandLeafDisplay 46 (cid:2)
3.3.1.5 HistogramandBarChart 47
3.3.2 MeasuresofCentralTendency 47
3.3.2.1 Mean 47
3.3.2.2 Median 47
3.3.2.3 Mode 47
3.3.3 MeasuresofVariability 48
3.3.3.1 Range 48
3.3.3.2 BoxPlot 48
3.3.3.3 VarianceandStandardDeviation 48
3.3.4 SkewnessandKurtosis 48
3.4 InferentialStatistics 49
3.4.1 FrequentistInference 49
3.4.1.1 PointEstimation 50
3.4.1.2 IntervalEstimation 50
3.4.2 HypothesisTesting 51
3.4.3 StatisticalSignificance 51
3.5 StatisticalMethods 52
3.5.1 Regression 52
3.5.1.1 LinearModel 52
3.5.1.2 NonlinearModels 52
3.5.1.3 GeneralizedLinearModels 53
3.5.1.4 AnalysisofVariance 53
3.5.1.5 MultivariateAnalysisofVariance 55
(cid:2)
(cid:2)
x Contents
3.5.1.6 Log-LinearModels 55
3.5.1.7 LogisticRegression 56
3.5.1.8 RandomEffectsModel 56
3.5.1.9 Overdispersion 57
3.5.1.10 HierarchicalModels 57
3.5.2 AnalysisofSurvivalData 57
3.5.3 PrincipalComponentAnalysis 58
3.6 Errors 59
3.6.1 ErrorinRegression 60
3.6.2 ErrorinClassification 61
3.7 Conclusion 61
References 61
4 IntelligentDataAnalysiswithDataMining:Theoryand
Applications 63
ShivamBachhety,RamneekSinghal,andRachnaJain
Objective 63
4.1 IntroductiontoDataMining 63
4.1.1 ImportanceofIntelligentDataAnalyticsinBusiness 64
4.1.2 ImportanceofIntelligentDataAnalyticsinHealthCare 65
4.2 DataandKnowledge 65
(cid:2) 4.3 DiscoveringKnowledgeinDataMining 66 (cid:2)
4.3.1 ProcessMining 67
4.3.2 ProcessofKnowledgeDiscovery 67
4.4 DataAnalysisandDataMining 69
4.5 DataMining:Issues 69
4.6 DataMining:SystemsandQueryLanguage 71
4.6.1 DataMiningSystems 71
4.6.2 DataMiningQueryLanguage 72
4.7 DataMiningMethods 73
4.7.1 Classification 74
4.7.2 ClusterAnalysis 75
4.7.3 Association 75
4.7.4 DecisionTreeInduction 76
4.8 DataExploration 77
4.9 DataVisualization 80
4.10 ProbabilityConceptsforIntelligentDataAnalysis(IDA) 83
Reference 83
5 IntelligentDataAnalysis:DeepLearningandVisualization 85
ThanD.LeandHuyV.Pham
5.1 Introduction 85
5.2 DeepLearningandVisualization 86
5.2.1 LinearandLogisticRegressionandVisualization 86
5.2.2 CNNArchitecture 89
(cid:2)
(cid:2)
Contents xi
5.2.2.1 VanishingGradientProblem 90
5.2.2.2 ConvolutionalNeuralNetworks(CNNs) 91
5.2.3 ReinforcementLearning 91
5.2.4 InceptionandResNetNetworks 93
5.2.5 Softmax 94
5.3 DataProcessingandVisualization 97
5.3.1 RegularizationforDeepLearningandVisualization 98
5.3.1.1 RegularizationforLinearRegression 98
5.4 ExperimentsandResults 102
5.4.1 MaskRCNNBasedonObjectDetectionandSegmentation 102
5.4.2 DeepMatrixFactorization 108
5.4.2.1 NetworkVisualization 108
5.4.3 DeepLearningandReinforcementLearning 111
5.5 Conclusion 112
References 113
6 ASystematicReviewontheEvolutionofDentalCariesDetection
MethodsandItsSignificanceinDataAnalysisPerspective 115
SomaDatta,NabenduChaki,andBiswajitModak
6.1 Introduction 115
6.1.1 AnalysisofDentalCaries 115
(cid:2) 6.2 DifferentCariesLesionDetectionMethodsandDataCharacterization 119 (cid:2)
6.2.1 PointDetectionMethod 120
6.2.2 VisibleLightPropertyMethod 121
6.2.3 Radiographs 121
6.2.4 Light-EmittingDevices 123
6.2.5 OpticalCoherentTomography(OCT) 125
6.2.6 SoftwareTools 125
6.3 TechnicalChallengeswiththeExistingMethods 126
6.3.1 ChallengesinDataAnalysisPerspective 127
6.4 ResultAnalysis 129
6.5 Conclusion 129
Acknowledgment 131
References 131
7 IntelligentDataAnalysisUsingHadoopCluster–Inspired
MapReduceFrameworkandAssociationRuleMiningonEducational
Domain 137
PratiyushGuleriaandManuSood
7.1 Introduction 137
7.1.1 ResearchAreasofIDA 138
7.1.2 TheNeedforIDAinEducation 139
7.2 LearningAnalyticsinEducation 139
7.2.1 RoleofWeb-EnabledandMobileComputinginEducation 141
7.2.2 BenefitsofLearningAnalytics 142
(cid:2)