The Springer Series in Applied Machine Learning Poornachandra Sarang Thinking Data Science A Data Science Practitioner’s Guide The Springer Series in Applied Machine Learning SeriesEditor OgeMarques,FloridaAtlanticUniversity,BocaRaton,FL,USA EditorialBoard Over the last decade, Machine Learning and Artificial Intelligence methods and technologies have played an increasingly important role at the core of practical solutionsforawiderangeoftasksrangingfromhandheldappstoindustrialprocess control, from autonomous vehicle driving to environmental policies, and from life sciencestocomputergameplaying. TheSpringerSeriesin“Applied MachineLearning”will focusonmonographs, textbooks, edited volumes and reference books that provide suitable content and educate the reader on how the theoretical approaches, the algorithms, and the techniques of machine learning can be applied to address real world problems in a principledway. Theseriesstartsfromtherealizationthatvirtuallyallsuccessfulmachinelearning solutions to practical problems require much more than applying a set of standard off-the-shelftechniques.Suchsolutionsmayincludeprincipledwaysofcontrolling parameters of algorithms, carefully developed combinations of standard methods, regularization techniquesappliedina non-standard manner, probabilisticinference overtheresultsofmultiplemodels,application-drivenstatisticalanalysisofresults, modeling of the requirements for robustness, or a mix of practical ‘tricks’ and thoroughly analyzed approaches. In fact, most practical successes rely on machine learningappliedatdifferentlevelsofabstractionofthetasksthatareaddressedand thebooksofthisserieswillprovidethereaderabasisforunderstandinghowthese methods were applied, why the techniques were chosen, what are the pitfalls, and howthereadercanbuildasolutionforher/hisowntask. Our goal is to build a series of state of the art books for the libraries of both machine learning scientists interested in applying the latest techniques to novel problems, and subject matter experts, technologists, or researchers interested in leveraging the latest advances in machine learning for developing solutions that workforpracticalproblems.Ouraimistocoveralltopicsofrelevanceforapplied machinelearningandartificialintelligence. Theseries willalsopublishin-depthanalysesandstudies ofapplicationstoreal world problems in areas such as economics, social good, environmental sciences, transportationscience,manufacturingandproductionplanning,logisticsanddistri- bution,financialplanning,structuraloptimization,waterresourceplanning,network design,andcomputergames. In addition, this series will aim to address the interests of a wide spectrum of practitioners, students and researchers alike who are interested in employing machinelearningandAImethodsintheirrespectivedomains.Thescopewillspan the breadth of machine learning and AI as it pertains to all application areas both throughbooksthataddresstechniquesspecifictooneapplicationdomainandbooks thatshowtheapplicabilityofdifferenttypesofmachinelearningmethodstoawide arrayofdomains. Poornachandra Sarang Thinking Data Science ’ A Data Science Practitioner s Guide PoornachandraSarang PracticingDataScientist&Researcher Mumbai,India ISSN2520-1298 ISSN2520-1301 (electronic) TheSpringerSeriesinAppliedMachineLearning ISBN978-3-031-02362-0 ISBN978-3-031-02363-7 (eBook) https://doi.org/10.1007/978-3-031-02363-7 ©TheEditor(s)(ifapplicable)andTheAuthor(s),underexclusivelicensetoSpringerNatureSwitzerland AG2023 Thisworkissubjecttocopyright.AllrightsaresolelyandexclusivelylicensedbythePublisher,whether thewholeorpartofthematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseof illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similarordissimilarmethodologynowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors, and the editorsare safeto assume that the adviceand informationin this bookarebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressedorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Preface Chapter1(DataScienceProcess)introducesyoutothedatascienceprocessthatis followed by a modern data scientist in developing those highly acclaimed AI applications. It describes both the traditional and modern approach followed by a currentdaydatascientistinmodelbuilding.Intoday’sworld,adatascientisthasto deal with not just the numeric data, but he needs to handle even text and image datasets. The high-frequency datasets are another major challenge fora data scien- tist. After this brief on model building, the chapter introduces you to the full data science process. As we have a very large number of machine learning algorithms, which can apply to your datasets, the model development process becomes time consumingandresourceintensive.ThechapterintroducesyoutoAutoMLthateases this model development process and hyper-parameter tuning for the selected algo- rithm. Finally, it introduces you to the modern approach of using deep neural networks(DNNs)andtransferlearning. Machinelearningisbasedondata,morethedatathatyouhave;itmakeslearning better.Letusconsiderasimpleexampleofidentifyingapersoninaphoto,video,or justinreallife.Ifyouhaveabetterknowledgeorhavemorefeaturesofthatperson known to you, the identification becomes a simple task. However, in machine learning,themachinedoesnotlike this.Infact, weconsiderhaving many features a curse of dimensionality. This is mainly due to two reasons—we, human-beings, cannot visualize data beyond three dimensions and having many dimensions demandsenormousresourcesandtrainingtimes.Chapter2(DimensionalityReduc- tion) teaches you several techniques for bringing down the dimensions of your dataset to a manageable level. The chapter gives you an exhaustive coverage of dimensionalityreductiontechniquesfollowedbyadatascientist. Afterwepreparethedatasetformachinelearning,thedatascientist’smajortask is to select an appropriate algorithm forthe problem that he istrying to solve. The Classical Algorithms Overview (Part I) gives you an overview of the various algorithmsthatyouwillstudyinthenextfewchapters. The model development task could be of a regression or classification type. Regression is a well-studied statistical technique and successfully implemented in v vi Preface machine learning algorithms. Chapter 3 (Regression Analysis) discusses several regression algorithms, starting with simple linear to Ridge, Lasso, ElasticNet, Bayesian, Logistic, and so on. You will learn their practical implementations and howtoevaluatewhichbestfitsforadataset. Chapter 4 (Decision Trees) deals with decision trees—a fundamental block for many machine learning algorithms. I give in-depth coverage to building and tra- versing the trees. The projects in this chapter prove their importance for both regressionandclassificationproblems. Chapter5(Ensemble:BaggingandBoosting)talksaboutthestatisticalensemble methodsusedtoimprovetheperformanceofdecisiontrees.Itcoversbothbagging and boosting techniques. I cover several algorithms in each category, giving you definite guidelines on when to use them. You will learn several algorithms in this chapter, such as Random Forest, ExtraTrees, BaggingRegressor, and BaggingClassifier. Under boosting, you will learn AdaBoost, Gradient Boosting, XGBoost,CatBoost,andLIghtGBM.Thechapteralsopresentsacomparativestudy ontheirperformances,whichwillhelpyouintakingyourdecisionsonwhichoneto useforyourowndatasets. Now, we move on to classification problems. The next three chapters cover K- Nearest Neighbors, Naive Bayes, and Support Vector Machines used for classification. Chapter 6 (K-Nearest Neighbors) describes K-Nearest Neighbors, also called KNN, which is the simplest and starting algorithm for classifications. I describe thealgorithmin-depthalongwiththeeffectofKontheclassification.Idiscussthe techniques of obtaining the optimal K value for better classifications and finally providedguidelinesonwhentousethissimplealgorithm. Chapter7(NaiveBayes)describesNaiveBayes’theoremanditsadvantagesand disadvantages. I also discuss the various types, such as Multinomial, Bernoulli, Gaussian, Complement, and Categorical Naive Bayes. The Naive Bayes is useful inclassifyinghugedatasets.Atrivialprojecttowardtheendofthechapterbringsout itsimportance. Now, we come to another important and widely researched classification algo- rithm, and that is SVM (Support Vector Machines). Chapter 8 (Support Vector Machines) gives an in-depth coverage to this algorithm. There are several types of hyperplanesthatdividethedatasetintodifferentclasses.Ifullydiscusstheeffectsof the kernel and its various types, such as Linear, Polynomial, Radial Basis, and Sigmoid.Iprovideyouwithdefiniteguidelinesforkernelselectionforyourdataset. SVMalsorequirestuningofitsseveralparameters,suchasC,Degree,Gamma,and so on. You will learn parameter tuning. Toward the end, I present a project that showshowtouseSVMandconcludeswithSVM’sadvantagesanddisadvantagesin practicalsituations. A data scientist need not have a deep knowledge of how these algorithms are designed? Having only a conceptual understanding of the purpose for which they were designed suffices. So, in Chaps. 3 to 8, I focus on explaining the algorithm’s concepts,avoidingmathematicsonwhichwebasethemandgivingmoreimportance tohowweusethempractically. Preface vii Nowcomesthenextchallengeforadatascientist,andthatisclusteringadataset withouthavinglabeleddatapoints.Wecallthisunsupervisedlearning.Ihaveahuge section (Part II) comprising Chaps. 9 through 16 for clustering, giving you an in- depth coverage for several clustering techniques. The notion of cluster is not well- defined and usually there is no consensus on the results produced by clustering algorithms.So,wehavelotsofclusteringalgorithmsthatdealwithsmall,medium, large,andreallyhugespatialdatasets.Icovermanyclusteringalgorithmsexplaining theirapplicationsforvarioussizeddatasets. Chapter 9 (Centroid-Based Clustering) discusses the centroid-based clustering algorithms,whichareprobablythesimplestandarethestartingpointsforclustering huge spatial datasets. The chaptercovers bothK-MeansandK-Medoids clustering algorithms.ForK-Means,Idescribeitsworkingfollowedbytheexplanationofthe algorithm itself. I discuss the purpose of objective function and techniques for selecting optimal clusters. These are called Elbow, Average Silhouette, and the Gap Statistic. This is followed by a discussion on K-Means limitations and where to use it? For the K-Medoids algorithm, I follow a similar approach describing its working,algorithm,merits,demerits,andimplementation. Chapter 10 (Connectivity-Based Clustering) describes two connectivity-based clustering algorithms: Agglomerative and Divisive. For Agglomerative clustering, I describe the Simple, Complete, and Average linkages while explaining its full working.Ithendiscussitsadvantages,disadvantages,andpracticalsituationswhere this algorithm finds its use. For Divisive clustering, I take a similar approach and discussitsimplementationchallenges. Chapter 11 (Gaussian Mixture Model) describes another type of clustering algorithm where the data distribution is of Gaussian type. I explain how to select theoptimalnumberofclusterswithapracticalexample. Chapter12(Density-BasedClustering)focusesondensity-basedclusteringtech- niques. Here I describe three algorithms—DBSCAN, OPTICS, and Mean Shift. I discuss why we use DBSCAN? After discussing preliminaries, I discuss its full working.Ithendiscussitsadvantages,disadvantages,andimplementationwiththe helpofaproject.TounderstandOPTICS,Ifirstexplaintoyouafewtermslikecore distanceandreachabilitydistance.Liketheearlierone,Idiscussitsimplementation withthehelpofaproject.Finally,IdescribeMeanShiftclusteringbyexplainingits full working and how to select the bandwidth. A discussion on the algorithm’s strengths,weaknesses,applications,andapractical implementationillustratedwith thehelpofaprojectfollowsthis. Chapter 13 (BIRCH) discusses another important clustering algorithm called BIRCH. This is an algorithm that helps data scientists in clustering huge datasets, wherealltheearlieralgorithmsfail.BIRCHsplitsthehugedatasetintosubclusters by creating a hierarchical tree-like structure. The algorithm does clustering incre- mentally eliminating the need for loading the entire dataset into memory. In this chapter, I discuss why and where to use this algorithm and explain its working by showingyouhowthealgorithmconstructsaCFtree. Chapter 14 (CLARANS) discusses another important algorithm for clustering enormous sized datasets. This is called CLARANS. The CLARA (Clustering for viii Preface LargeApplications)isconsideredanextensiontoK-Medoids.CLARANS(Cluster- ing for Large Applications with Randomized Search) is a step further to handle spatialdatasets.Thisalgorithmmaintainsabalancebetweenthecomputationalcost andtherandomsamplingofdata.IdiscussbothCLARAandCLARANSalgorithms andpresentyouapracticalprojecttounderstandhowtouseCLARANS. Chapter 15 (Affinity Propagation Clustering) describes an altogether different type of clustering technique which is based on gossiping. This is called Affinity Propagationclustering.Ifullydescribeitsworkingbyexplainingtoyouhowweuse gossipingforforminggroupshavingaffinitytowardeachotherandtheirleader.This algorithmdoesnotrequireyoutohaveapriorestimationofthenumberofclusters.I explain the concept of responsibility and availability matrices while presenting its full working. Toward the end, I demonstrate its implementation with a practical project. Toward the end of the clustering section, you will learn two more clustering algorithms: these are STING and CLIQUE. Chapter 16 (STING & CLIQUE) discusses these algorithms. We consider them both density and grid based. The advantage of STING lies in the fact that it does re-scan the entire dataset while answeringanewquery.Thus,unlikepreviousalgorithms,thisalgorithmiscompu- tationallyfarlessexpensive.TheSTINGstandsforSTatisticalINformationGrid.I explainhowthealgorithmconstructsthegridandusesitforquerying.CLIQUEisa subspaceclusteringalgorithmthatusesabottom-upapproachwhileclustering.This algorithm provides better clustering in the case of high-dimensional datasets. Like earlier chapters, I present the full working of the algorithms, their advantages, disadvantages,andtheirpracticalimplementationsinprojects. After discussing the several classical algorithms, which are based on statistical techniques,wenowmoveontoanevolutionallytechniqueofmachinelearningand thatisANN(ArtificialNeuralNetworks).TheANNtechnologydefinitelybroughta new revolution in machine learning. Part III (ANN: Overview) provides you an overviewofthistechnology. InChap.17(ArtificialNeuralNetworks),IintroduceyoutoANN/DNNtechnol- ogy. You will first learn many new terms like back-propagation, optimization/loss functions,evaluationmetrics,andsoon.IdiscusshowtodesignanANNarchitec- tureonyourown,howtotrain/evaluateit,andfinallyhowtouseatrained-modelfor inferring unseen data. I introduce you to various network architectures, such as CNN,RNN,LSTM,Transformers,BERT,andsoon.Youwillunderstandthelatest Transfer Learning technology andlearntoextendthefunctionalityofapre-trained modelforyourownpurposes. Chapter 18 (ANN-Based Applications) deals with two practical examples of using ANN/DNN. One example deals with text data and the other one with image data.Besidesotherthingslikedesigningandtrainingnetworks,inthischapter,you willlearnhowtopreparetextandimagedatasetsformachinelearning. With the wide choices of algorithms and selection between classical and ANN approach of model building, we make the data scientist’s life quite tough. Fortu- nately,theresearchersandengineershavedevelopedtoolstohelpdatascientistsin theaboveselections. Preface ix Chapter 19 (Automated Tools) talks about the automated tools for developing machinelearningapplications.Themoderntoolsautomatealmostallworkflowsof model development. These tools build efficient data pipelines, select between GOFAI (Good Old Fashioned AI—classical algorithms) and ANN technologies, select the best performing algorithm, ensemble models, design an efficient neural network,andsoon.Youjustneedtoingestthedataintosuchtoolsandtheycomeup withthebestperformingmodelonyourdataset.Notonlythis,somealsospilloutthe complete model development project code—a great boon to data scientists. This chaptergivesathoroughcoverageofthistechnology. The last chapter, Chap. 20 (Data Scientist’s Ultimate Workflow), is the most important one. It merges all your lessons. In this chapter, I provide you with a definite path andguidelines onhow todevelop thosehighlyacclaimed AI applica- tionsandbecomeaModernDataScientist. Theentirebookattheendwillmakeyouamostsought-afterdatascientist.For those ofyou who arecurrently working asadata scientist, this book will help you become a modern data scientist. A modern data scientist can handle numeric, text, and image datasets, is well conversant with GOFAI and ANN/DNN development, andcanuseautomatedtoolsincludingMLaaS(MachineLearningasaService). So, move on to Chap. 1 to start your journey toward becoming a highly skilled moderndatascientist. Mumbai,India PoornachandraSarang