ebook img

Data Science, Analytics and Machine Learning with R PDF

621 Pages·2023·97.727 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Science, Analytics and Machine Learning with R

DATA SCIENCE, ANALYTICS AND MACHINE LEARNING WITH R DATA SCIENCE, ANALYTICS AND MACHINE LEARNING WITH R L P F´ UIZ AULO AVERO P ´ B ATRICIA ELFIORE R F S AFAEL DE REITAS OUZA AcademicPressisanimprintofElsevier 125LondonWall,LondonEC2Y5AS,UnitedKingdom 525BStreet,Suite1650,SanDiego,CA92101,UnitedStates 50HampshireStreet,5thFloor,Cambridge,MA02139,UnitedStates TheBoulevard,LangfordLane,Kidlington,OxfordOX51GB,UnitedKingdom Copyright©2023ElsevierInc.Allrightsreserved. Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronicor mechanical,includingphotocopying,recording,oranyinformationstorageandretrievalsystem,without permissioninwritingfromthepublisher.Detailsonhowtoseekpermission,furtherinformationaboutthe Publisher’spermissionspoliciesandourarrangementswithorganizationssuchastheCopyrightClearance CenterandtheCopyrightLicensingAgency,canbefoundatourwebsite:www.elsevier.com/permissions. ThisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythePublisher (otherthanasmaybenotedherein). Notices Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperiencebroadenour understanding,changesinresearchmethods,professionalpractices,ormedicaltreatmentmaybecome necessary. Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgeinevaluatingandusing anyinformation,methods,compounds,orexperimentsdescribedherein.Inusingsuchinformationormethods theyshouldbemindfuloftheirownsafetyandthesafetyofothers,includingpartiesforwhomtheyhavea professionalresponsibility. Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assumeanyliability foranyinjuryand/ordamagetopersonsorpropertyasamatterofproductsliability,negligenceorotherwise, orfromanyuseoroperationofanymethods,products,instructions,orideascontainedinthematerialherein. ISBN:978-0-12-824271-1 ForinformationonallAcademicPresspublications visitourwebsiteathttps://www.elsevier.com/books-and-journals Publisher:MaraConner EditorialProjectManager:TimEslava ProductionProjectManager:PunithavathyGovindaradjane CoverDesigner:GregHarris TypesetbySTRAIVE,India Dedication To Leonor Lopes Fávero Epigraph Everything in us is mortal, except the giftsof the spirit and of intelligence. Publius Ovidius Naso vii C H A P T E R 1 Overview of data science, analytics, and machine learning Introduction This chapter provides a brief introduction to data science, analytics, and machine learning, which will serve as a foundation for understanding the concepts and techniques covered throughout the book. Inthisnewmillennium,inwhichitisestimatedthatmorethan5quintillionpiecesofdataaregenerateddailyfrom socialnetworks,theinternetofthings,digitalphotos,consumermonitoring,andothersources,theunderstandingof the importance of data science in its various aspects is of fundamental importance for scientific and technological advancement, economic and social development, environmental preservation, business success, the discovery and exploration of new areas of knowledge, understanding of historical events, and even the protection of life on our planet! Datascienceisthereforenaturallymultidisciplinary.Wefoundexamplesofdatascienceapplicationsinengineer- ing,physics,medicine,biology,education,psychology,pedagogy,law,politics,publicsecurity,economics,sociology, business, marketing, astronomy, anthropology, human resources, meteorology, geography, and history. We will hardly be able to find a field of study in which it is not possible to investigate phenomena through the techniques and procedures ofdatascience. There are many aspects that data science encompasses. Many are the professions associated with these aspects because every day we witness the emergence of new terminologies and positions in the market and in the academic world. Examples include data scientist, data engineer, data architect, data analyst, business intel- ligence analyst, machine learning engineer, database administrator, computer engineer, information technology facilitator, edge computing master, cybercity analyst, personal data broker, machine manager, digital tailor, aug- mented reality (AR) journey builder, user experience (UX) writer, DevOps (developers and IT operation profes- sionals), among many other professions. And these professionals work, as we mentioned, in the most diverse sectors! We find data engineers in the food and beverage industry as well as AR journey builders in the gaming industry. Figure 1.1 providesan overview of the relationship among data science, analytics, and machine learning. DataScience,AnalyticsandMachineLearningwithR 3 Copyright©2023ElsevierInc.Allrightsreserved. https://doi.org/10.1016/B978-0-12-824271-1.00034-2 4 1. Overviewofdatascience,analytics,andmachinelearning FIGURE1.1 Overviewofdatascience,analytics,andmachinelearning(FáveroandBelfiore,2019). ThroughFigure1.1,itispossibletoverify,therefore,thatdatascienceencompassesknowledgeaboutdataanalysis (analytics) as well as knowledge about methods,algorithms, BigData, and decision-making processes. TheAnalytics pillarinvolvesknowledgeandfundamentalsaboutmeasurementscalesofvariables,mathematics, statistics, calculus, linear algebra, operations research, geometry, and trigonometry. It is not possible to find a data scientistwhodoesnotpresentsomesolidityofknowledgeinthesefields;however,ifyoufindonewhoidentifiesthis way, this personwillbe, atmost, a pusher of codes and buttons! Thepillarreferringtomethods,algorithms,andBigDatareferstotheknowledgeforimplementingroutinesand codes from specific languages such as R, Python, Stata, Julia, SQL, Java, C/C++, Scala, SAS, Matlab, SPSS, among many others. Note that the implementation of routines necessarily involves knowledge about the fundamentals of Analyticssomistakesarenotmadewhenwritingthecodes.Itisverycommontofindprogrammerswhodonotknow thestatisticalfoundationsofaparticularmodelingtechniqueandendupwritingcodethatdoesnotreflect,forexam- ple,thenatureofthevariablesunderstudy.Theoutputsobtainedinthiscasewillbe,tosaytheleast,inaccurateand sometimescompletely wrong! Inthispillar,wecanstillfindthefundamentalsofBigData,whichcorrespondtothesimultaneousoccurrenceoffive characteristics, or dimensionsof the data: volume,speed, variety, variability, and complexity ofthe data. Theexacerbatedvolumeofdataarises,amongotherreasons,fromtheincreaseincomputationalcapacityandthe increaseinthemonitoringofthemostdiversephenomena.Thespeedwithwhichdatabecomesavailablefortreatment andanalysis,duetonewformsofcollectionthatuse,forexample,electronictagsandradiofrequencysystems,isalso visibleandvitalfor the decision-making processes. The variety refersto the different formats in which the data are accessed,suchastexts,indicators,secondarybases,orevenspeeches,andaconvergentanalysiscanalsoprovidebetter decisionmaking.Thevariabilityofthedataisrelated,inadditiontothethreepreviousdimensions,withcyclicalor seasonalphenomena,sometimeswithhighfrequency,directlyobservableornot,andthatagiventreatmentcangen- eratedifferentiatedinformation.Last,butnotleast,thecomplexityofthedata,especiallyforlargevolumes,liesinthe factthatmanysourcescanbeaccessedwithdifferentcodes,periodicities,orcriteria,whichrequiresacontrolprocess from the researcher (Fávero and Belfiore, 2019). Inthissense,therelationshipbetweentheAnalyticspillarandtheMethods,Algorithms,andBigDatapillarcor- respondstowhatwecallmachinelearning,whichreferstotheprocessesofpatternrecognitionindatafromcodesthat “trainthemachine”forthispurpose,thatis,aprocessforexploringdatatodiscovermeaningfulpatternsandrules. Here are also deep learning algorithms, or deep pattern recognitionfrom algorithms, for example, from neural net- works forimage recognition based on large amounts of data. Thisprocessingflowcannotbesupportedwithoutbeingaccompaniedbytheimprovedprofessionalsoftwareand increased processing capacity of increasingly gigantic datasets that are capable of supporting the elaboration of the most diverse tests and the estimation of the most varied models that should reflect the reality of each situation and according to whatthe researcher and the decision maker want. Thesearethemainreasonsthathaveledorganizationsactiveinthemostdiversesectorstoinvestinthestructuring anddevelopmentofmultidisciplinaryareasofdatascience thathavethemainobjectiveofanalyzingdataandgen- eratinginformation,allowingthecreationofpatternrecognitionandtheestablishmentofreal-timepredictivecapa- bility. The emergence and improvement of complex computer systems, together with the reduction in costs for acquiring hardware and software, have made organizations increasingly store data in data warehouses, data lakes, virtuallibraries,and the cloud(Fávero and Belfiore, 2019). I.Introduction 5 Overviewofthebook Thedirectacquisitionofoutputsfromanalyticstoolsandthedeploymentofmodels,whichreferstoadataengi- neering task focused on the production and availability, through APIs (Application Programming Interfaces), of modelsestimatedinrealtime,generatesubsidiesfordecisionmaking.And,obviously,thedecision-makingprocess goes throughaspects related to teammanagement, resource allocation, and humanization of production processes! Inacyclicalway,understandingthebusinessortheareaofstudycanimprovetheacquisitionofnewdata,increase theabilitytopreparethesedata,andfavorthedevelopmentofnewprogrammingcodeswithafocusonthesearchfor othermachinelearningmodelsthateventuallygeneratebetteradhesionsbetweentherealvaluesofthephenomenon understudyandthefittedvaluesobtained.Thiscanprovidebetterresults,whichwillfavoranincreaseintheabilityto understandthe areaof studyand the business as a whole! Overview of the book The book is divided into 28 chapters, as follows: PartI:Introduction Chapter1:Overviewofdatascience,analytics,andmachinelearning Chapter2:IntroductiontoR-basedlanguage PartII:Appliedstatisticsanddatavisualization Chapter3:Typesofvariables,measurementscales,andaccuracyscales Chapter4:Univariatedescriptivestatistics Chapter5:Bivariatedescriptivestatistics Chapter6:Hypothesestests Chapter7:Datavisualizationandmultivariategraphs PartIII:Dataminingandpreparation Chapter8:Webscrapingandhandcraftedrobots Chapter9:Usingapplicationprogramminginterfacestocollectdata Chapter10:Managingdata PartIV:Unsupervisedmachinelearningtechniques Chapter11:Clusteranalysis Chapter12:Principalcomponentfactoranalysis Chapter13:Simpleandmultiplecorrespondenceanalysis PartV:Supervisedmachinelearningtechniques Chapter14:Simpleandmultipleregressionmodels Chapter15:Binaryandmultinomiallogisticregressionmodels Chapter16:Count-dataandzero-inflatedregressionmodels Chapter17:Generalizedlinearmixedmodels PartVI:Improvingperformance Chapter18:Supportvectormachines Chapter19:Classificationandregressiontrees Chapter20:Boostingandbagging Chapter21:Randomforests Chapter22:Artificialneuralnetworks PartVII:Spatialanalysis Chapter23:Workingonshapefiles Chapter24:Dealingwithsimplefeaturesobjects Chapter25:Rasterobjects Chapter26:Exploratoryspatialanalysis PartVIII:Addingvaluetoyourwork Chapter27:Enhancedandinteractivegraphs Chapter28:DashboardswithR I.Introduction

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.