ebook img

Machine Learning in Production: Developing and Optimizing Data Science Workflows and Applications (Addison-Wesley Data & Analytics Series) PDF

282 Pages·2019·9.15 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Machine Learning in Production: Developing and Optimizing Data Science Workflows and Applications (Addison-Wesley Data & Analytics Series)

Machine Learning in Production The Pearson Addison-Wesley Data & Analytics Series Visit informit.com/awdataseries for a complete list of available publications. The Pearson Addison-Wesley Data & Analytics Series provides readers with practical knowledge for solving problems and answering questions with data. Titles in this series primarily focus on three areas: 1. Infrastructure: how to store, move, and manage data 2. Algorithms: how to mine intelligence or make predictions based on data 3. Visualizations: how to represent data and insights in a meaningful and compelling way The series aims to tie all three of these areas together to help the reader build end-to-end systems for fighting spam; making recommendations; building personalization; detecting trends, patterns, or problems; and gaining insight from the data exhaust of systems and user interactions. Make sure to connect with us! informit.com/socialconnect Machine Learning in Production Developing and Optimizing Data Science Workflows and Applications Andrew Kelleher Adam Kelleher Boston • Columbus • New York • San Francisco • Amsterdam • Cape Town Dubai • London • Madrid • Milan • Munich • Paris • Montreal • Toronto • Delhi • Mexico City São Paulo • Sydney • Hong Kong • Seoul • Singapore • Taipei • Tokyo Manyofthedesignationsusedbymanufacturersandsellerstodistinguishtheirproductsareclaimedas trademarks.Wherethosedesignationsappearinthisbook,andthepublisherwasawareofatrademark claim,thedesignationshavebeenprintedwithinitialcapitallettersorinallcapitals. Theauthorsandpublisherhavetakencareinthepreparationofthisbook,butmakenoexpressedorimplied warrantyofanykindandassumenoresponsibilityforerrorsoromissions.Noliabilityisassumedfor incidentalorconsequentialdamagesinconnectionwithorarisingoutoftheuseoftheinformationor programscontainedherein. Forinformationaboutbuyingthistitleinbulkquantities,orforspecialsalesopportunities(whichmay includeelectronicversions;customcoverdesigns;andcontentparticulartoyourbusiness,traininggoals, marketingfocus,orbrandinginterests),pleasecontactourcorporatesalesdepartment [email protected](800)382-3419. Forgovernmentsalesinquiries,[email protected]. ForquestionsaboutsalesoutsidetheU.S.,[email protected]. VisitusontheWeb:informit.com/aw LibraryofCongressControlNumber:2018954331 Copyright©2019PearsonEducation,Inc. Allrightsreserved.Thispublicationisprotectedbycopyright,andpermissionmustbeobtainedfromthe publisherpriortoanyprohibitedreproduction,storageinaretrievalsystem,ortransmissioninanyformorby anymeans,electronic,mechanical,photocopying,recording,orlikewise.Forinformationregarding permissions,requestforms,andtheappropriatecontactswithinthePearsonEducationGlobalRights& PermissionsDepartment,pleasevisitwww.pearsoned.com/permissions/. ISBN-13:978-0-13-411654-9 ISBN-10:0-13-411654-2 1 19 (cid:118) Thisbookisdedicatedtoourlifelongmentor,WilliamF.WalshIII. Wecouldneverthankyouenoughforalltheyearsofsupportand encouragement. (cid:118) This page intentionally left blank Contents Foreword xv Preface xvii AbouttheAuthors xxi I: PrinciplesofFraming 1 1 TheRoleoftheDataScientist 3 1.1 Introduction 3 1.2 TheRoleoftheDataScientist 3 1.2.1 CompanySize 3 1.2.2 TeamContext 4 1.2.3 LaddersandCareerDevelopment 5 1.2.4 Importance 5 1.2.5 TheWorkBreakdown 6 1.3 Conclusion 6 2 ProjectWorkflow 7 2.1 Introduction 7 2.2 TheDataTeamContext 7 2.2.1 Embeddingvs.PoolingResources 8 2.2.2 Research 8 2.2.3 Prototyping 9 2.2.4 ACombinedWorkflow 10 2.3 AgileDevelopmentandtheProductFocus 10 2.3.1 The12Principles 11 2.4 Conclusion 15 3 QuantifyingError 17 3.1 Introduction 17 3.2 QuantifyingErrorinMeasuredValues 17 3.3 SamplingError 19 3.4 ErrorPropagation 21 3.5 Conclusion 23 4 DataEncodingandPreprocessing 25 4.1 Introduction 25 4.2 SimpleTextPreprocessing 26 4.2.1 Tokenization 26 viii Contents 4.2.2 N-grams 27 4.2.3 Sparsity 28 4.2.4 FeatureSelection 28 4.2.5 RepresentationLearning 30 4.3 InformationLoss 33 4.4 Conclusion 34 5 HypothesisTesting 37 5.1 Introduction 37 5.2 WhatIsaHypothesis? 37 5.3 TypesofErrors 39 5.4 P-valuesandConfidenceIntervals 40 5.5 MultipleTestingand“P-hacking” 41 5.6 AnExample 42 5.7 PlanningandContext 43 5.8 Conclusion 44 6 DataVisualization 45 6.1 Introduction 45 6.2 DistributionsandSummaryStatistics 45 6.2.1 DistributionsandHistograms 46 6.2.2 ScatterPlotsandHeatMaps 51 6.2.3 BoxPlotsandErrorBars 55 6.3 Time-SeriesPlots 58 6.3.1 RollingStatistics 58 6.3.2 Auto-Correlation 60 6.4 GraphVisualization 61 6.4.1 LayoutAlgorithms 62 6.4.2 TimeComplexity 64 6.5 Conclusion 64 II: AlgorithmsandArchitectures 67 7 IntroductiontoAlgorithmsandArchitectures 69 7.1 Introduction 69 7.2 Architectures 70 Contents ix 7.2.1 Services 71 7.2.2 DataSources 72 7.2.3 BatchandOnlineComputing 72 7.2.4 Scaling 73 7.3 Models 74 7.3.1 Training 74 7.3.2 Prediction 75 7.3.3 Validation 76 7.4 Conclusion 77 8 Comparison 79 8.1 Introduction 79 8.2 JaccardDistance 79 8.2.1 TheAlgorithm 80 8.2.2 TimeComplexity 81 8.2.3 MemoryConsiderations 81 8.2.4 ADistributedApproach 81 8.3 MinHash 82 8.3.1 Assumptions 83 8.3.2 TimeandSpaceComplexity 83 8.3.3 Tools 83 8.3.4 ADistributedApproach 83 8.4 CosineSimilarity 84 8.4.1 Complexity 85 8.4.2 MemoryConsiderations 85 8.4.3 ADistributedApproach 86 8.5 MahalanobisDistance 86 8.5.1 Complexity 86 8.5.2 MemoryConsiderations 87 8.5.3 ADistributedApproach 87 8.6 Conclusion 88 9 Regression 89 9.1 Introduction 89 9.1.1 ChoosingtheModel 90 9.1.2 ChoosingtheObjectiveFunction 90 9.1.3 Fitting 91 9.1.4 Validation 92

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.