Table Of ContentStatistics for Machine Learning
Build supervised, unsupervised, and reinforcement learning
models using both Python and R
Pratap Dangeti
BIRMINGHAM - MUMBAI
Statistics for Machine Learning
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: July 2017
Production reference: 1180717
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78829-575-8
www.packtpub.com
Credits
Author Copy Editor
Pratap Dangeti Safis Editing
Reviewer Project Coordinator
Manuel Amunategui Nidhi Joshi
Commissioning Editor Proofreader
Veena Pagare Safis Editing
Acquisition Editor Indexer
Aman Singh Tejal Daruwale Soni
Content Development Editor Graphics
Mayur Pawanikar Tania Dutta
Technical Editor Production Coordinator
Dinesh Pawar Arvindkumar Gupta
About the Author
Pratap Dangeti develops machine learning and deep learning solutions for structured,
image, and text data at TCS, analytics and insights, innovation lab in Bangalore. He has
acquired a lot of experience in both analytics and data science. He received his master's
degree from IIT Bombay in its industrial engineering and operations research program. He
is an artificial intelligence enthusiast. When not working, he likes to read about next-gen
technologies and innovative methodologies.
First and foremost, I would like to thank my mom, Lakshmi, for her support throughout
my career and in writing this book. She has been my inspiration and motivation for
continuing to improve my knowledge and helping me move ahead in my career. She is my
strongest supporter, and I dedicate this book to her. I also thank my family and friends for
their encouragement, without which it would not be possible to write this book.
I would like to thank my acquisition editor, Aman Singh, and content development editor,
Mayur Pawanikar, who chose me to write this book and encouraged me constantly
throughout the period of writing with their invaluable feedback and input.
About the Reviewer
Manuel Amunategui is vice president of data science at SpringML, a startup offering
Google Cloud TensorFlow and Salesforce enterprise solutions. Prior to that, he worked as a
quantitative developer on Wall Street for a large equity-options market-making firm and as
a software developer at Microsoft. He holds master degrees in predictive analytics and
international administration.
He is a data science advocate, blogger/vlogger (amunategui.github.io) and a trainer on
Udemy and O'Reilly Media, and technical reviewer at Packt Publishing.
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and
ePub files available? You can upgrade to the eBook version at www.PacktPub.comand as a
print book customer, you are entitled to a discount on the eBook copy. Get in touch with us
at service@packtpub.com for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a
range of free newsletters and receive exclusive discounts and offers on Packt books and
eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt
books and video courses, as well as industry-leading tools to help you plan your personal
development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial
process. To help us improve, please leave us an honest review on this book's Amazon page
at https://www.amazon.com/dp/1788295757.
If you'd like to join our team of regular reviewers, you can e-mail us at
customerreviews@packtpub.com. We award our regular reviewers with free eBooks and
videos in exchange for their valuable feedback. Help us be relentless in improving our
products!
Table of Contents
Preface
1
Chapter 1: Journey from Statistics to Machine Learning
7
Statistical terminology for model building and validation 8
Machine learning 8
Major differences between statistical modeling and machine learning 10
Steps in machine learning model development and deployment 11
Statistical fundamentals and terminology for model building and
validation 12
Bias versus variance trade-off 32
Train and test data 34
Machine learning terminology for model building and validation 35
Linear regression versus gradient descent 38
Machine learning losses 41
When to stop tuning machine learning models 43
Train, validation, and test data 44
Cross-validation 46
Grid search 46
Machine learning model overview 50
Summary 54
Chapter 2: Parallelism of Statistics and Machine Learning
55
Comparison between regression and machine learning models 55
Compensating factors in machine learning models 57
Assumptions of linear regression 58
Steps applied in linear regression modeling 61
Example of simple linear regression from first principles 61
Example of simple linear regression using the wine quality data 64
Example of multilinear regression - step-by-step methodology of model
building 66
Backward and forward selection 69
Machine learning models - ridge and lasso regression 75
Example of ridge regression machine learning 77
Example of lasso regression machine learning model 80
Regularization parameters in linear regression and ridge/lasso regression 82
Summary 82
Chapter 3: Logistic Regression Versus Random Forest
83
Maximum likelihood estimation 83
Logistic regression – introduction and advantages 85
Terminology involved in logistic regression 87
Applying steps in logistic regression modeling 94
Example of logistic regression using German credit data 94
Random forest 111
Example of random forest using German credit data 113
Grid search on random forest 117
Variable importance plot 120
Comparison of logistic regression with random forest 122
Summary 124
Chapter 4: Tree-Based Machine Learning Models
125
Introducing decision tree classifiers 126
Terminology used in decision trees 127
Decision tree working methodology from first principles 128
Comparison between logistic regression and decision trees 134
Comparison of error components across various styles of models 135
Remedial actions to push the model towards the ideal region 136
HR attrition data example 137
Decision tree classifier 140
Tuning class weights in decision tree classifier 143
Bagging classifier 145
Random forest classifier 149
Random forest classifier - grid search 155
AdaBoost classifier 158
Gradient boosting classifier 163
Comparison between AdaBoosting versus gradient boosting 166
Extreme gradient boosting - XGBoost classifier 169
Ensemble of ensembles - model stacking 174
Ensemble of ensembles with different types of classifiers 174
Ensemble of ensembles with bootstrap samples using a single type of
classifier 182
Summary 185
Chapter 5: K-Nearest Neighbors and Naive Bayes
186
K-nearest neighbors 187
KNN voter example 187
Curse of dimensionality 188
[ ii ]