ebook img

Data Science: Concepts and Practice PDF

549 Pages·2018·48.729 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Science: Concepts and Practice

Data Science Data Science Concepts and Practice Second Edition Vijay Kotu Bala Deshpande MorganKaufmannisanimprintofElsevier 50HampshireStreet,5thFloor,Cambridge,MA02139,UnitedStates Copyrightr2019ElsevierInc.Allrightsreserved. Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronicormechanical, includingphotocopying,recording,oranyinformationstorageandretrievalsystem,withoutpermissioninwriting fromthepublisher.Detailsonhowtoseekpermission,furtherinformationaboutthePublisher’spermissionspolicies andourarrangementswithorganizationssuchastheCopyrightClearanceCenterandtheCopyrightLicensingAgency, canbefoundatourwebsite:www.elsevier.com/permissions. ThisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythePublisher(otherthan asmaybenotedherein). Notices Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperiencebroadenour understanding,changesinresearchmethods,professionalpractices,ormedicaltreatmentmaybecomenecessary. Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgeinevaluatingandusingany information,methods,compounds,orexperimentsdescribedherein.Inusingsuchinformationormethodsthey shouldbemindfuloftheirownsafetyandthesafetyofothers,includingpartiesforwhomtheyhaveaprofessional responsibility. Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assumeanyliabilityfor anyinjuryand/ordamagetopersonsorpropertyasamatterofproductsliability,negligenceorotherwise,orfromany useoroperationofanymethods,products,instructions,orideascontainedinthematerialherein. BritishLibraryCataloguing-in-PublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary LibraryofCongressCataloging-in-PublicationData AcatalogrecordforthisbookisavailablefromtheLibraryofCongress ISBN:978-0-12-814761-0 ForInformationonallMorganKaufmannpublications visitourwebsiteathttps://www.elsevier.com/books-and-journals Publisher:JonathanSimpson AcquisitionEditor:GlynJones EditorialProjectManager:AnaClaudiaAbadGarcia ProductionProjectManager:SreejithViswanathan CoverDesigner:GregHarris TypesetbyMPSLimited,Chennai,India Dedication To all themothers in our lives Foreword A lot has happened since the first edition of this book was published in 2014. There is hardly a day where there is no news on data science, machine learning, or artificial intelligence in the media. It is interesting that many of those news articles have a skeptical, if not an even negative tone. All this underlines two things: data science and machine learning are finally becom- ing mainstream. And people know shockingly little about it. Readers of this book will certainly do better in this regard. It continues to be a valuable resource to not only educate about how to use data science in practice, but also how the fundamentalconcepts work. Data science and machine learning are fast-moving fields which is why this second edition reflects a lot of the changes in our field. While we used to talk a lot about “data mining” and “predictive analytics” only a couple of years ago, we have now settled on the term “data science” for the broader field. And even more importantly: it is now commonly understood that machine learning is at the core of many current technological breakthroughs. These are truly excitingtimesfor all the peopleworking inour fieldthen! I have seen situations where data science and machine learning had an incredible impact. But I have also seen situations where this was not the case. What was the difference? In most cases where organizations fail with data science and machine learning is, they had used those techniques in the wrong context. Data science models are not very helpful if you only have one big decision you need to make. Analytics can still help you in such cases by giving you easier access to the data you need to make this decision. Or by presenting the data in a consumable fashion. But at the end of the day, those single big decisions are often strategic. Building a machine learning model to help you make this decision is not worth doing. And often they also do not yield better results than just making the decision on your own. xi xii Foreword Here is where data science and machine learning can truly help: these advanced models deliver the most value whenever you need to make lots of similar decisions quickly.Good examples for this are: (cid:1) Defining the price of aproductin markets with rapidly changing demands. (cid:1) Makingoffersfor cross-selling in an E-Commerce platform. (cid:1) Approving credit or not. (cid:1) Detecting customerswith ahighrisk for churn. (cid:1) Stopping fraudulent transactions. (cid:1) And many others. You can see that a human being who would have access to all relevant data could make those decisions in a matter of seconds or minutes. Only that they can’t without data science, since they would need to make this type of decision millionsof times, everyday. Consider sifting through your customer base of 50 million clients every day to identify those with a high churn risk. Impossible for any human being. But no problem at all for a machine learn- ing model. So, the biggest value of artificial intelligence and machine learning is not to support us with those big strategic decisions. Machine learning delivers most value when we operationalize models and automate millions of decisions. One of the shortest descriptions of this phenomenon comes from Andrew Ng, who is a well-known researcher in the field of AI. Andrew describes what AI can do as follows: “If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.” I agree with him on this characterization. And I like that Andrew puts the emphasis on automation and operationalization of those models—because this is where the biggest value is. The only thing I disagree with is the time unit he chose. It issafeto already go with aminuteinstead of asecond. However, the quick pace of changes as well as the ubiquity of data science also underlines the importance of laying the right foundations. Keep in mind that machine learning is not completely new. It has been an active field of research since the 1950s. Some of the algorithms used today have even been around for more than 200 years now. And the first deep learning models were developed in the 1960s with the term “deep learning” being coined in 1984. Those algorithms are well understood now. And under- standing their basic concepts will help you to pick the right algorithm for the right task. Tosupportyouwiththis,someadditionalchapterson deeplearningandrec- ommendation systems have been added to the book. Another focus area is Foreword xiii using text analytics and natural language processing. It became clear in the past years that the most successful predictive models have been using unstructured input data in addition to the more traditional tabular formats. Finally, expansion of Time Series Forecasting should get you started on one ofthe most widely applied data science techniquesin the business. More algorithms could mean that there is a risk of increased complexity. But thanks to the simplicity of the RapidMiner platform and the many practical examples throughout the book this is not the case here. We continue our journey towards the democratization of data science and machine learning. Thisjourneycontinuesuntil datascienceandmachinelearningareasubiqui- tous as data visualization or Excel. Of course, we cannot magically transform everybody into a data scientist overnight, but we can give people the tools to help them on their personal path of development. This book is the only tour guide youneed on this journey. Ingo Mierswa Founder RapidMiner Inc. Massachusetts, USA Preface Our goal is to introduce you toDataScience. We will provide you with a survey of the fundamental data science concepts as well as step-by-step guidance on practical implementations—enough to getyoustartedon this exciting journey. WHY DATA SCIENCE? We have run out of adjectives and superlatives to describe the growth trends of data. The technology revolution has brought about the need to process, store, analyze, and comprehend large volumes of diverse data in meaningful ways.However,thevalueofthestoreddataiszerounlessitisactedupon.Thescale of data volume and variety places new demands on organizations to quickly uncover hidden relationships and patterns. This is where data science techni- ques have proven to be extremely useful. They are increasingly finding their way into the everyday activities of many business and government functions, whether in identifying which customers are likely to take their business else- where,or mapping flupandemic using socialmediasignals. Data science is a compilation of techniques that extract value from data. Some of the techniques used in data science have a long history and trace their roots to applied statistics, machine learning, visualization, logic, and computer science. Some techniques have just reached the popularity it deserves. Most emerging technologies go through what is termed the “hype cycle.” This is a way of contrasting the amount of hyperbole or hype versus the productivity that is engendered by the emerging technology. The hype cycle has three main phases: peak of inflated expectation, trough of disillu- sionment, and plateau of productivity. The third phase refers to the mature and value-generating phase of any technology. The hype cycle for data sci- ence indicates that it is in this mature phase. Does this imply that data sci- ence has stopped growing or has reached a saturation point? Not at all. On the contrary, this discipline has grown beyond the scope of its initial xv xvi Preface applications in marketing and has advanced to applications in technology, internet-based fields, health care, government, finance, andmanufacturing. WHY THIS BOOK? The objective of this book is two-fold: to help clarify the basic concepts behind many data science techniques in an easy-to-follow manner; and to prepare anyone with a basic grasp of mathematics to implement these techni- ques in their organizations without the need to write any lines of program- ming code. Beyond its practical value, we wanted to show you that the data science learning algorithms are elegant, beautiful, and incredibly effective. You will never look at data the same way once you learn the concepts of the learning algorithms. To make the concepts stick, you will have to build data science models. While there are many data science tools available to execute algorithms and develop applications, the approaches to solving a data science problem are similar amongthesetools. Wewantedtopickafullyfunctional,opensource, free to use, graphical user interface-based data science tool so readers can fol- low the concepts and implement the data science algorithms. RapidMiner, a leading data science platform, fit the bill and, thus, we used it as a compan- ion tool to implement the data science algorithms introduced in every chapter. WHO CAN USE THIS BOOK? The concepts and implementationsdescribed in this book are gearedtowards business, analytics, and technical professionals who use data everyday. You, the reader of the book will get a comprehensive understanding of the differ- ent data science techniques that can be used for prediction and for discover- ing patterns, be prepared to select the right technique for a given data problem, andyouwill be able tocreate ageneral-purpose analyticsprocess. We have tried to follow a process to describe this body of knowledge. Our focus has been on introducing about 30 key algorithms that are in wide- spread use today. We presentthesealgorithms in the framework of: 1. A high-levelpracticaluse case for eachalgorithm. 2. Anexplanation of how the algorithm works inplain language. Many algorithms have astrong foundation in statistics and/or computer science. In our descriptions, we have tried tostrike abalance between being accessibleto awider audience andbeingacademicallyrigorous.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.