Table Of ContentTree-Based Methods
for Statistical
Learning in R
The book follows up most ideas and mathematical concepts with code-based
examples in the R statistical language; with an emphasis on using as few external
packages as possible. For example, users will be exposed to writing their own
random forest and gradient tree boosting functions using simple for loops and
basic tree fitting software (like rpart and party/partykit), and more. The core
chapters also end with a detailed section on relevant software in both R and
other opensource alternatives (e.g., Python, Spark, and Julia), and example usage
on real data sets. While the book mostly uses R, it is meant to be equally acces-
sible and useful to non-R programmers.
Consumers of this book will have gained a solid foundation (and appreciation)
for tree-based methods and how they can be used to solve practical problems and
challenges data scientists often face in applied work.
Features:
• Thorough coverage, from the ground up, of tree-based methods (e.g.,
CART, conditional inference trees, bagging, boosting, and random
forests).
• A companion website containing additional supplementary material
and the code to reproduce every example and figure in the book.
• A companion R package, called treemisc, which contains several data
sets and functions used throughout the book (e.g., there’s an implemen-
tation of gradient tree boosting with LAD loss that shows how to per-
form the line search step by updating the terminal node estimates of a
fitted rpart tree).
• Interesting examples that are of practical use; for example, how to con-
struct partial dependence plots from a fitted model in Spark MLlib
(using only Spark operations), or post-processing tree ensembles via
the LASSO to reduce the number of trees while maintaining, or even
improving performance.
CHAPMAN & HALL/CRC DATA SCIENCE SERIES
Reflecting the interdisciplinary nature of the field, this book series brings together
researchers, practitioners, and instructors from statistics, computer science,
machine learning, and analytics. The series will publish cutting-edge research,
industry applications, and textbooks in data science.
The inclusion of concrete examples, applications, and methods is highly encour-
aged. The scope of the series includes titles in the areas of machine learning, pat-
tern recognition, predictive analytics, business analytics, Big Data, visualization,
programming, software, learning analytics, data wrangling, interactive graphics,
and reproducible research.
Published Titles
An Introduction to IoT Analytics
Harry G. Perros
Data Analytics
A Small Data Approach
Shuai Huang and Houtao Deng
Public Policy Analytics
Code and Context for Data Science in Government
Ken Steif
Supervised Machine Learning for Text Analysis in R
Emil Hvitfeldt and Julia Silge
Massive Graph Analytics
Edited by David Bader
Data Science
An Introduction
Tiffany-Anne Timbers, Trevor Campbell and Melissa Lee
Tree-Based Methods for Statistical Learning in R
Brandon M. Greenwell
Urban Informatics
Using Big Data to Understand and Serve Communities
Daniel T. O’Brien
For more information about this series, please visit: https://www.routledge.
com/Chapman--HallCRC-Data-Science-Series/book-series/CHDSS
Tree-Based Methods
for Statistical
Learning in R
Brandon M. Greenwell
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
and by CRC Press
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
CRC Press is an imprint of Taylor & Francis Group, LLC
© 2022 Brandon M. Greenwell
Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermis-
sions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
Names: Greenwell, Brandon M., author.
Title: Tree-Based Methods for Statistical Learning in R
/ Brandon M. Greenwell.
Description: First edition. | Boca Raton, FL : CRC Press, 2022. | Series:
Chapman & Hall/CRC data science series | Includes bibliographical
references and index.
Identifiers: LCCN 2021059606 (print) | LCCN 2021059607 (ebook) | ISBN
9780367532468 (hbk) | ISBN 9780367543822 (pbk) | ISBN 9781003089032
(ebk)
Subjects: LCSH: Decision making--Mathematics. | Decision trees--Data
processing. | Trees (Graph theory)--Data processing. | R (Computer
program language)
Classification: LCC T57.95 .G73 2022 (print) | LCC T57.95 (ebook) | DDC
658.4/03--dc23/eng/20220217
LC record available at https://lccn.loc.gov/2021059606
LC ebook record available at https://lccn.loc.gov/2021059607
ISBN: 978-0-367-53246-8 (hbk)
ISBN: 978-0-367-54382-2 (pbk)
ISBN: 978-1-003-08903-2 (ebk)
DOI: 10.1201/9781003089032
Typeset in LM Roman
by KnowledgeWorks Global Ltd.
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
To my son, Oliver.
Contents
Preface xiii
1 Introduction 1
1.1 Select topics in statistical and machine learning . . . . . . . 2
1.1.1 Statistical jargon and conventions . . . . . . . . . . . 3
1.1.2 Supervised learning . . . . . . . . . . . . . . . . . . . 4
1.1.2.1 Description . . . . . . . . . . . . . . . . . . . 5
1.1.2.2 Prediction . . . . . . . . . . . . . . . . . . . 6
1.1.2.3 Classification vs. regression . . . . . . . . . 7
1.1.2.4 Discrimination vs. prediction . . . . . . . . . 7
1.1.2.5 The bias-variance tradeoff . . . . . . . . . . . 8
1.1.3 Unsupervised learning . . . . . . . . . . . . . . . . . . 10
1.2 Why trees? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 A brief history of decision trees . . . . . . . . . . . . 12
1.2.2 The anatomy of a simple decision tree . . . . . . . . . 14
1.2.2.1 Example: survival on the Titanic . . . . . . . 15
1.3 Why R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.1 No really, why R? . . . . . . . . . . . . . . . . . . . . 17
1.3.2 Software information and conventions . . . . . . . . . 19
1.4 Some example data sets . . . . . . . . . . . . . . . . . . . . . 20
1.4.1 Swiss banknotes . . . . . . . . . . . . . . . . . . . . . 21
1.4.2 New York air quality measurements . . . . . . . . . . 21
1.4.3 The Friedman 1 benchmark problem . . . . . . . . . 23
1.4.4 Mushroom edibility . . . . . . . . . . . . . . . . . . . 24
1.4.5 Spam or ham? . . . . . . . . . . . . . . . . . . . . . . 25
1.4.6 Employee attrition . . . . . . . . . . . . . . . . . . . . 28
1.4.7 Predicting home prices in Ames, Iowa . . . . . . . . . 29
1.4.8 Wine quality ratings . . . . . . . . . . . . . . . . . . 30
1.4.9 Mayo Clinic primary biliary cholangitis study . . . . 31
1.5 There ain’t no such thing as a free lunch . . . . . . . . . . . 35
1.6 Outline of this book . . . . . . . . . . . . . . . . . . . . . . . 35
I Decision trees 37
2 Binary recursive partitioning with CART 39
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
vii
viii Contents
2.2 Classification trees . . . . . . . . . . . . . . . . . . . . . . . . 41
2.2.1 Splits on ordered variables . . . . . . . . . . . . . . . . 43
2.2.1.1 So which is it in practice, Gini or entropy? . 47
2.2.2 Example: Swiss banknotes . . . . . . . . . . . . . . . . 48
2.2.3 Fitted values and predictions . . . . . . . . . . . . . . 51
2.2.4 Class priors and misclassification costs . . . . . . . . 52
2.2.4.1 Altered priors . . . . . . . . . . . . . . . . . 54
2.2.4.2 Example: employee attrition . . . . . . . . . 55
2.3 Regression trees . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.3.1 Example: New York air quality measurements . . . . 59
2.4 Categorical splits . . . . . . . . . . . . . . . . . . . . . . . . 62
2.4.1 Example: mushroom edibility . . . . . . . . . . . . . . 64
2.4.2 Be wary of categoricals with high cardinality . . . . . 67
2.4.3 To encode, or not to encode? . . . . . . . . . . . . . . 68
2.5 Building a decision tree . . . . . . . . . . . . . . . . . . . . . 69
2.5.1 Cost-complexity pruning . . . . . . . . . . . . . . . . 71
2.5.1.1 Example: mushroom edibility . . . . . . . . . 74
2.5.2 Cross-validation . . . . . . . . . . . . . . . . . . . . . 77
2.5.2.1 The 1-SE rule . . . . . . . . . . . . . . . . . 78
2.6 Hyperparameters and tuning . . . . . . . . . . . . . . . . . . 78
2.7 Missing data and surrogate splits . . . . . . . . . . . . . . . 78
2.7.1 Other missing value strategies . . . . . . . . . . . . . 80
2.8 Variable importance . . . . . . . . . . . . . . . . . . . . . . . 82
2.9 Software and examples . . . . . . . . . . . . . . . . . . . . . 83
2.9.1 Example: Swiss banknotes . . . . . . . . . . . . . . . . 84
2.9.2 Example: mushroom edibility . . . . . . . . . . . . . . 88
2.9.3 Example: predicting home prices . . . . . . . . . . . . 96
2.9.4 Example: employee attrition. . . . . . . . . . . . . . . 100
2.9.5 Example: letter image recognition . . . . . . . . . . . 103
2.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.10.1 Advantages of CART. . . . . . . . . . . . . . . . . . . 105
2.10.2 Disadvantages of CART . . . . . . . . . . . . . . . . . 106
2.11 Recommended reading . . . . . . . . . . . . . . . . . . . . . 108
3 Conditional inference trees 111
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.2 Early attempts at unbiased recursive partitioning . . . . . . 112
3.3 A quick digression into conditional inference . . . . . . . . . 114
3.3.1 Example: X and Y are both univariate continuous . 117
3.3.2 Example: X and Y are both nominal categorical . . 118
3.3.3 Which test statistic should you use? . . . . . . . . . . 120
3.4 Conditional inference trees . . . . . . . . . . . . . . . . . . . 121
3.4.1 Selecting the splitting variable . . . . . . . . . . . . . 121
3.4.1.1 Example: New York air quality measurements 123
3.4.1.2 Example: Swiss banknotes . . . . . . . . . . 124
Contents ix
3.4.2 Finding the optimal split point . . . . . . . . . . . . . 125
3.4.2.1 Example: New York air quality measurements 126
3.4.3 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.4.4 Missing values . . . . . . . . . . . . . . . . . . . . . . 128
3.4.5 Choice of α, g(), and h() . . . . . . . . . . . . . . . . 128
3.4.6 Fitted values and predictions . . . . . . . . . . . . . . 131
3.4.7 Imbalanced classes . . . . . . . . . . . . . . . . . . . . 131
3.4.8 Variable importance . . . . . . . . . . . . . . . . . . . 132
3.5 Software and examples . . . . . . . . . . . . . . . . . . . . . 132
3.5.1 Example: New York air quality measurements . . . . 133
3.5.2 Example: wine quality ratings . . . . . . . . . . . . . . 137
3.5.3 Example: Mayo Clinic liver transplant data . . . . . . 140
3.6 Final thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4 The hitchhiker’s GUIDE to modern decision trees 147
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.2 A GUIDE for regression . . . . . . . . . . . . . . . . . . . . . 150
4.2.1 Piecewise constant models . . . . . . . . . . . . . . . . 150
4.2.1.1 Example: New York air quality measurements 152
4.2.2 Interaction tests . . . . . . . . . . . . . . . . . . . . . 153
4.2.3 Non-constant fits . . . . . . . . . . . . . . . . . . . . . 154
4.2.3.1 Example: predicting home prices . . . . . . 155
4.2.3.2 Bootstrap bias correction . . . . . . . . . . . 157
4.3 A GUIDE for classification . . . . . . . . . . . . . . . . . . . 157
4.3.1 Linear/oblique splits . . . . . . . . . . . . . . . . . . 157
4.3.1.1 Example: classifying the Palmer penguins . . 158
4.3.2 Priors and misclassification costs . . . . . . . . . . . . 161
4.3.3 Non-constant fits . . . . . . . . . . . . . . . . . . . . . 161
4.3.3.1 Kernel-based and k-nearest neighbor fits . . 162
4.4 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.5 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.6 Fitted values and predictions . . . . . . . . . . . . . . . . . . 163
4.7 Variable importance . . . . . . . . . . . . . . . . . . . . . . 163
4.8 Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.9 Software and examples . . . . . . . . . . . . . . . . . . . . . 165
4.9.1 Example: credit card default . . . . . . . . . . . . . . 165
4.10 Final thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . 172
II Tree-based ensembles 177
5 Ensemble algorithms 179
5.1 Bootstrap aggregating (bagging) . . . . . . . . . . . . . . . 181
5.1.1 When does bagging work? . . . . . . . . . . . . . . . . 184
5.1.2 Bagging from scratch: classifying email spam . . . . . 184
5.1.3 Sampling without replacement . . . . . . . . . . . . . 187