Table Of ContentSpringer Series in Statistics
Advisors:
P.Bickel,P.Diggle, S.Fienberg,
U.Gather, I.Olkin, S.Zeger
Forothertitlespublishedinthisseries,goto
www.springer.com/series/692
Heping Zhang • Burton H. Singer
Recursive Partitioning
and Applications
Second Edition
Heping Zhang Burton H. Singer
Department of Epidemiology and Public Health Emerging Pathogens Institute
Yale University School of Medicine University of Florida
60 College Street PO Box 100009
New Haven, Connecticut 06520-8034 Gainesville, FL 32610
USA USA
heping.zhang@yale.edu
ISSN0172-7397
ISBN978-1-4419-6823-4 e-ISBN978-1-4419-6824-1
DOI10.1007/978-1-4419-6824-1
Springer New York Dordrecht Heidelberg London
LibraryofCongressControlNumber:2010930849
(cid:2)c SpringerScience+BusinessMedia,LLC 2010
All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adaptation, computer
software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if
they are not identified as such, is not to be taken as an expression of opinion as to whether or not
they are subject to proprietary rights.
Printedonacid-freepaper
Springer is part of Springer Science+Business Media (www.springer.com)
Dedicated to Julan, Jeffrey, and Leon (HZ) and to Eugenia,
Gregory, Maureen, and Sheila (BS)
Preface
Multiple complex pathways, characterized by interrelated events and con-
ditions, represent routes to many illnesses, diseases, and ultimately death.
Although there are substantial data and plausibility arguments support-
ing many conditions as contributory components of pathways to illness
and disease end points, we have,historically,lackedan effective methodol-
ogyfor identifying the structure of the full pathways.Regressionmethods,
withstronglinearityassumptionsanddata-basedconstraintsontheextent
and order of interaction terms, have traditionally been the strategies of
choice for relating outcomes to potentially complex explanatorypathways.
However, nonlinear relationships among candidate explanatory variables
are a generic feature that must be dealt with in any characterization of
how health outcomes come about. It is noteworthy that similar challenges
arise from data analyses in Economics, Finance, Engineering, etc. Thus,
the purpose of this book is to demonstrate the effectiveness of a relatively
recently developed methodology—recursive partitioning—as a response to
this challenge. We also compare and contrast what is learned via recur-
sive partitioning with results obtained on the same data sets using more
traditionalmethods. This serves to highlight exactly where—and for what
kinds of questions—recursive partitioning–based strategies have a decisive
advantage over classical regressiontechniques.
This book is a revised edition of our first one entitled Recursive Par-
titioning in the Health Sciences. A decade has passed since we published
the first edition. This new edition reflects recent developments that are
either new or have increased in importance. It also covers areas that we
neglected before, particularlythe topic of forests. The first edition focused
viii Preface
ontwo aspects.First, we presentedthe tree-basedmethods entirely within
the framework of Breiman et al. (1984). Second, the examples were from
healthsciences.Althoughitisdifficulttodojusticetoallalternativemeth-
ods to Breiman et al. (1984), we feel they deserve emphasis here. We also
realize that the methods presented hereinhave applications beyond health
sciences,andanoutreachto otherfieldsofscienceandsocietalsignificance
isoverdue.Thisisthereasonthatwehavechangedthetitle.Lastly,wehave
experiencedtherapidadvancementofgenomics.Recursivepartitioninghas
become one of the most appealing analytic methods for understanding or
mining genomic data. In this revision, we demonstrate the application of
tree- and forest-based methods to understanding genomic data.
Having expanded the scope of our book, we are aiming at three broad
groups:(1)biomedicalresearchers,clinicians,bioinformaticists,geneticists,
psychologists,sociologists,epidemiologists,healthservicesresearchers,and
environmentalpolicy advisers;(2) consulting statisticians who canuse the
recursive partitioning technique as a guide in providing effective and in-
sightful solutions to clients’ problems; and (3) statisticians interested in
methodological and theoretical issues. The book provides an up-to-date
summary of the methodologicaland theoreticalunderpinnings of recursive
partitioning. More interestingly, it presents a host of unsolved problems
whose solutions would advance the rigorous underpinnings of statistics in
general.
From the perspective of the first two groups of readers, we demonstrate
with realapplications the sequentialinterplaybetween automatedproduc-
tionofmultiple well-fitting trees andscientific judgment leadingto respec-
ification of variables, more refined trees subject to context-specific con-
straints (on splitting and pruning, for example), and ultimately selection
ofthemostinterpretableandusefultree(s).Inthisrevisionweincludenew
and substantively important examples, some of which are related to bioin-
formaticsandgenomicsandothersareoutsidetherealmofhealthsciences.
Thesectionsmarkedwithasteriskscanbe skippedforapplication-oriented
readers.
We show a more conventional regression analysis—having the same ob-
jective as the recursive partitioning analysis—side by side with the newer
methodology. In each example, we highlight the scientific insight derived
fromtherecursivepartitioningstrategythatisnotreadilyrevealedbymore
conventional methods. The interfacing of automated output and scientific
judgment is illustrated with both conventional and recursive partitioning
analysis.
Theoretically oriented statisticians will find a substantial listing of chal-
lenging theoretical problems whose solutions would provide much deeper
insightthanheretoforeaboutthe scopeandlimits ofrecursivepartitioning
as such and multivariate adaptive splines and forests in particular.
We emphasize the development of narratives to summarize the formal
Boolean statements that define routes down the trees to terminal nodes.
Preface ix
Particularlywithcomplex—byscientific necessity—trees,narrativeoutput
facilitates understanding and interpretation of what has been provided by
automated techniques.
We illustrate the sensitivity of trees to variation in choosing misclassi-
fication cost, where the variation is a consequence of divergent views by
clinicians of the costs associated with differing mistakes in prognosis.
The book by Breiman et al. (1984) is a classical work on the subject of
recursivepartitioning.InChapter4,wereiteratethekeyideasexpressedin
that book and expand our discussions in different directions on the issues
that arise from applications. Other chapters on survival trees, adaptive
splines, forests, and classification trees for multiple discrete outcomes are
new developments since the work of Breiman et al. (1984).
Heping Zhang wishes to thank his colleagues and students, Joan Buen-
consejo,TheodoreHolford, James Leckman,JuLi, Robert Makuch,Kath-
leenMerikangas,BradleyPeterson,NormanSilliker,DanielZelterman,and
HongyuZhaoamongothers,fortheirhelpwithreadingandcommentingon
the firsteditionofthis book.He is alsogratefulto manycolleaguesinclud-
ing Drs. Michael Bracken, Dorit Carmelli, and Brian Leaderer for making
their data sets available to the first versionof this book. This revision was
supportedinpartbyNIHgrantsK02DA017713andR01DA016750toHep-
ing Zhang.BurtonSingerthanks TaraGruenewald(UCLA) andJasonKu
(Princeton) for assistance in developing some of the new examples. In ad-
dition,Drs.XiangChen,KellyCho,YunxiaoHe,YuanJiang,andMinghui
Wang, and Ms. Donna DelBasso assisted Heping Zhang in computation
and proofreading of this revised edition.