Machine Learning Algorithms A reference guide to popular algorithms for data science and machine learning Giuseppe Bonaccorso BIRMINGHAM - MUMBAI Machine Learning Algorithms Copyright © 2017 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: July 2017 Production reference: 1200717 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78588-962-2 www.packtpub.com Credits Author Copy Editors Giuseppe Bonaccorso Vikrant Phadkay Alpha Singh Reviewers Project Coordinator Manuel Amunategui Nidhi Joshi Doug Ortiz Lukasz Tracewski Commissioning Editor Proofreader Veena Pagare Safis Editing Acquisition Editor Indexer Divya Poojari Tejal Daruwale Soni Content Development Editor Graphics Mayur Pawanikar Tania Dutta Technical Editor Production Coordinator Prasad Ramesh Arvindkumar Gupta About the Author Giuseppe Bonaccorso is a machine learning and big data consultant with more than 12 years of experience. He has an M.Eng. in electronics engineering from the University of Catania, Italy, and further postgraduate specialization from the University of Rome, Tor Vergata, Italy, and the University of Essex, UK. During his career, he has covered different IT roles in several business contexts, including public administration, military, utilities, healthcare, diagnostics, and advertising. He has developed and managed projects using many technologies, including Java, Python, Hadoop, Spark, Theano, and TensorFlow. His main interests on artificial intelligence, machine learning, data science, and philosophy of mind. About the Reviewers Manuel Amunategui is the VP of data science at SpringML, a start-up offering Google Cloud, TensorFlow, and Salesforce enterprise solutions. Prior to that, he worked as a quantitative developer on Wall Street for a large equity options market-making firm and as a software developer at Microsoft. He holds master's degrees in predictive analytics and international administration. He is a data science advocate, blogger/vlogger (http://amunategui.github.io) and trainer on Udemy.com and O'Reilly Media, and technical reviewer at Packt. Doug Ortiz is a senior big data architect at ByteCubed who has been architecting, developing, and integrating enterprise solutions throughout his career. Organizations that leverage his skill set have been able to rediscover and reuse their underutilized data via existing and emerging technologies such as Microsoft BI Stack, Hadoop, NoSQL databases, SharePoint, and related tool sets and technologies. He is also the founder of Illustris, LLC and can be reached at [email protected]. Some interesting aspects of his profession are that he has experience in integrating multiple platforms and products, big data, data science certifications, R, and Python certifications. Doug also helps organizations gain a deeper understanding of and value their current investments in data and existing resources, turning them into useful sources of information. He has improved, salvaged, and architected projects by utilizing unique and innovative techniques. His hobbies include yoga and scuba diving. Lukasz Tracewski is a software developer and a scientist, specializing in machine learning, digital signal processing, and cloud computing. Being an active member of open source community, he is also an author of numerous research publications. He has worked for 6 years as a software scientist in high-tech industry in the Netherlands, first in photolithography and later in electron microscopy, helping to build algorithms and machines that reach physical limits of throughput and precision. Currently, he leads a data science team in the financial industry. For 4 years now, Lukasz has been using his skills pro bono in conservation science, involved in topics such as classification of bird species from audio recordings or satellite imagery analysis. He inhales carbon dioxide and exhales endangered species in his spare time. www.PacktPub.com For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career. Why subscribe? Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via a web browser Customer Feedback Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1785889621. If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products! Table of Contents Preface 1 Chapter 1: A Gentle Introduction to Machine Learning 6 Introduction - classic and adaptive machines 6 Only learning matters 9 Supervised learning 10 Unsupervised learning 12 Reinforcement learning 14 Beyond machine learning - deep learning and bio-inspired adaptive systems 15 Machine learning and big data 17 Further reading 18 Summary 19 Chapter 2: Important Elements in Machine Learning 20 Data formats 20 Multiclass strategies 23 One-vs-all 23 One-vs-one 23 Learnability 24 Underfitting and overfitting 27 Error measures 28 PAC learning 30 Statistical learning approaches 32 MAP learning 34 Maximum-likelihood learning 34 Elements of information theory 39 References 42 Summary 42 Chapter 3: Feature Selection and Feature Engineering 44 scikit-learn toy datasets 44 Creating training and test sets 45 Managing categorical data 47 Managing missing features 50 Data scaling and normalization 51