Scala for Machine Learning Second Edition Data processing, ML algorithms, smart analytics, and more Patrick R. Nicolas BIRMINGHAM - MUMBAI Scala for Machine Learning Second Edition Copyright © 2017 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: December 2015 Second edition: September 2017 Production reference: 1190917 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78712-238-3 www.packtpub.com Credits Author Project Coordinator Patrick R. Nicolas Shweta H Birwatkar Reviewers Proofreader Sumit Pal Safis Editing Dave Wentzel Indexer Mariammal Chettiyar Commissioning Editor Amey Varangaonkar Graphics Tania Dutta Acquisition Editor Tushar Gupta Production Coordinator Shantanu Zagade Content Development Editor Amrita Noronha Cover Work Deepika Naik Technical Editor Nilesh Sawakhande Copy Editors Safis Editing Laxmi Subramanian About the Author Patrick R. Nicolas is the director of engineering at Agile SDE, California. He has more than 25 years of experience in software engineering and building applications in C++, Java, and more recently in Scala/Spark, and has held several managerial positions. His interests include real-time analytics, modeling, and the development of nonlinear models. About the Reviewers Sumit Pal has more than 24 years of experience in the software industry, spanning companies from start-ups to enterprises. He is a big data architect, visualization, and data science consultant, and builds end- to-end data-driven analytic systems. Sumit has worked for Microsoft (SQLServer), Oracle (OLAP), and Verizon (big data analytics). Currently, he works for multiple clients, building their data architectures and big data solutions and works with Spark, Scala, Java, and Python. He has extensive experience in building scalable systems in middle tier, data tier to visualization for analytics applications, using big data and NoSQL databases. Sumit has expertise in database internals, data warehouses, and dimensional modeling, as an associate director for big data at Verizon. Sumit strategized, managed, architected, and developed analytic platforms for machine learning applications. Sumit was the chief architect at ModelN/LeapfrogRX (2006-2013), where he architected the core analytics platform. He is the author of SQL On Big Data - Technology, Architecture and Roadmap published by Apress in October 2016. He has spoken on the topic covered in this book at the following conferences: • May 2016, Big Data Conference—Linux Foundation in Vancouver, Canada • March 2016, World Data Center Conference in Las Vegas, USA • November 2015, BigData TechCon in Chicago, USA • August 2015, Global Big Data Conference in Boston, USA He is also the author of SQL On Big Data by Apress in December 2016. Dave Wentzel is the Chief Technology Officer (CTO) of Capax Global, a premier Microsoft consulting partner. Dave is responsible for setting the strategy and defining service offerings and capabilities for the data platform and Azure practice at Capax. Dave also works directly with clients to help them with their big data journey. Dave is a frequent blogger and speaker on big data and data science topics. www.PacktPub.com eBooks, discount offers, and more Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub. com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career. Why subscribe? • Fully searchable across every book published by Packt • Copy and paste, print, and bookmark content • On demand and accessible via a web browser Customer Feedback Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book’s Amazon page at https://www.amazon.com/dp/1787122387. If you’d like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products! Table of Contents Preface xv Chapter 1: Getting Started 1 Mathematical notations for the curious 2 Why machine learning? 2 Classification 3 Prediction 3 Optimization 3 Regression 3 Why Scala? 4 Scala as a functional language 4 Abstraction 4 Higher kinded types 5 Functors 6 Monads 7 Scala as an object oriented language 8 Scala as a scalable language 9 Model categorization 11 Taxonomy of machine learning algorithms 11 Unsupervised learning 11 Clustering 12 Dimension reduction 12 Supervised learning 13 Generative models 13 Discriminative models 14 Semi-supervised learning 15 Reinforcement learning 15 Leveraging Java libraries 16 Tools and frameworks 17 Java 17 [ i ]