ebook img

Principles of Data Science PDF

389 Pages·2016·11.407 MB·English
by  OzdemirSinan
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Principles of Data Science

Principles of Data Science Learn the techniques and math you need to start making sense of your data Sinan Ozdemir BIRMINGHAM - MUMBAI Principles of Data Science Copyright © 2016 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: December 2016 Production reference: 1121216 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78588-791-8 www.packtpub.com Credits Author Project Coordinator Devanshi Doshi Sinan Ozdemir Proofreaders Reviewers Safis Editing Samir Madhavan Oleg Okun Indexer Tejal Daruwale Soni Acquisition Editor Sonali Vernekar Graphics Jason Monteiro Content Development Editor Samantha Gonsalves Production Coordinator Melwyn Dsa Technical Editor Anushree Arun Tendulkar Cover Work Melwyn Dsa Copy Editor Shaila Kusanale About the Author Sinan Ozdemir is a data scientist, startup founder, and educator living in the San Francisco Bay Area with his dog, Charlie; cat, Euclid; and bearded dragon, Fiero. He spent his academic career studying pure mathematics at Johns Hopkins University before transitioning to education. He spent several years conducting lectures on data science at Johns Hopkins University and at the General Assembly before founding his own start-up, Legion Analytics, which uses artificial intelligence and data science to power enterprise sales teams. After completing the Fellowship at the Y Combinator accelerator, Sinan has spent most of his days working on his fast-growing company, while creating educational material for data science. I would like to thank my parents and my sister for supporting me through life, and also, my various mentors, including Dr. Pam Sheff of Johns Hopkins University and Nathan Neal, the chapter adviser of my collegiate leadership fraternity, Sigma Chi. Thank you to Packt Publishing for giving me this opportunity to share the principles of data science and my excitement for how this field will impact all of our lives in the coming years. About the Reviewers Samir Madhavan has over six years of rich data science experience in the industry and has also written a book called Mastering Python for Data Science. He started his career with Mindtree, where he was a part of the fraud detection algorithm team for the UID (Unique Identification) project, called Aadhar, which is the equivalent of a Social Security number for India. After this, he joined Flutura Decision Sciences and Analytics as the first employee, where he was part of the core team that helped the organization scale to an over a hundred members. As a part of Flutura, he helped establish big data and machine learning practice within Flutura and also helped out in business development. At present, he is leading the analytics team for a Boston-based pharma tech company called Zapprx, and is helping the firm to create data-driven products that will be sold to its customers. Oleg Okun is a machine learning expert and an author/editor of four books, numerous journal articles, and conference papers. His career spans more than a quarter of a century. He was employed in both academia and industry in his mother country, Belarus, and abroad (Finland, Sweden, and Germany). His work experience includes document image analysis, fingerprint biometrics, bioinformatics, online/ offline marketing analytics, and credit-scoring analytics. He is interested in all aspects of distributed machine learning and the Internet of Things. Oleg currently lives and works in Hamburg, Germany. I would like to express my deepest gratitude to my parents for everything that they have done for me. www.PacktPub.com eBooks, discount offers, and more Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub. com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career. Why subscribe? • Fully searchable across every book published by Packt • Copy and paste, print, and bookmark content • On demand and accessible via a web browser Table of Contents Preface vii Chapter 1: How to Sound Like a Data Scientist 1 What is data science? 3 Basic terminology 3 Why data science? 5 Example – Sigma Technologies 5 The data science Venn diagram 6 The math 8 Example – spawner-recruit models 8 Computer programming 10 Why Python? 10 Python practices 11 Example of basic Python 12 Domain knowledge 14 Some more terminology 15 Data science case studies 16 Case study – automating government paper pushing 16 Fire all humans, right? 18 Case study – marketing dollars 18 Case study – what's in a job description? 20 Summary 23 Chapter 2: Types of Data 25 Flavors of data 25 Why look at these distinctions? 26 Structured versus unstructured data 26 Example of data preprocessing 27 Word/phrase counts 28 Presence of certain special characters 28 Relative length of text 29 Picking out topics 29 [ i ] Table of Contents Quantitative versus qualitative data 30 Example – coffee shop data 30 Example – world alcohol consumption data 32 Digging deeper 34 The road thus far… 34 The four levels of data 35 The nominal level 35 Mathematical operations allowed 36 Measures of center 36 What data is like at the nominal level 36 The ordinal level 36 Examples 37 Mathematical operations allowed 37 Measures of center 38 Quick recap and check 39 The interval level 39 Example 39 Mathematical operations allowed 40 Measures of center 40 Measures of variation 41 The ratio level 43 Examples 43 Measures of center 43 Problems with the ratio level 44 Data is in the eye of the beholder 45 Summary 45 Chapter 3: The Five Steps of Data Science 47 Introduction to Data Science 47 Overview of the five steps 48 Ask an interesting question 48 Obtain the data 48 Explore the data 48 Model the data 49 Communicate and visualize the results 49 Explore the data 49 Basic questions for data exploration 50 Dataset 1 – Yelp 51 Dataframes 53 Series 54 Exploration tips for qualitative data 54 Dataset 2 – titanic 60 Summary 64 [ ii ] Table of Contents Chapter 4: Basic Mathematics 65 Mathematics as a discipline 65 Basic symbols and terminology 66 Vectors and matrices 66 Quick exercises 68 Answers 69 Arithmetic symbols 69 Summation 69 Proportional 70 Dot product 70 Graphs 73 Logarithms/exponents 74 Set theory 77 Linear algebra 81 Matrix multiplication 81 How to multiply matrices 82 Summary 85 Chapter 5: Impossible or Improbable – A Gentle Introduction to Probability 87 Basic definitions 88 Probability 88 Bayesian versus Frequentist 90 Frequentist approach 90 The law of large numbers 91 Compound events 93 Conditional probability 96 The rules of probability 97 The addition rule 97 Mutual exclusivity 98 The multiplication rule 99 Independence 100 Complementary events 100 A bit deeper 102 Summary 103 Chapter 6: Advanced Probability 105 Collectively exhaustive events 105 Bayesian ideas revisited 106 Bayes theorem 106 More applications of Bayes theorem 110 Example – Titanic 110 Example – medical studies 112 [ iii ]

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.