Computer Science F Chapman & Hall/CRC Chapman & Hall/CRC o Data Mining and Knowledge Discovery Series u Data Mining and Knowledge Discovery Series n d a t Foundations of Predictive Analytics i o Foundations of n s Drawing on the authors’ two decades of experience in applied modeling and o data mining, Foundations of Predictive Analytics presents the fundamental f Predictive background required for analyzing data and building models for many practical P applications, such as consumer behavior modeling, risk and marketing analytics, r e and other areas. It also discusses a variety of practical topics that are frequently Analytics d missing from similar texts. i c The book begins with the statistical and linear algebra/matrix foundation of t modeling methods, from distributions to cumulant and copula functions to i v Cornish–Fisher expansion and other useful but hard-to-find statistical techniques. e It then describes common and unusual linear methods as well as popular nonlinear A modeling approaches, including additive models, trees, support vector machine, n fuzzy systems, clustering, naïve Bayes, and neural nets. The authors go on to a cover methodologies used in time series and forecasting, such as ARIMA, GARCH, l and survival analysis. They also present a range of optimization techniques and y explore several special topics, such as Dempster–Shafer theory. t i c An in-depth collection of the most important fundamental material on predictive s analytics, this self-contained book provides the necessary information for understanding various techniques for exploratory data analysis and modeling. It explains the algorithmic details behind each technique (including underlying assumptions and mathematical formulations) and shows how to prepare and W encode data, select variables, use model goodness measures, normalize odds, u and perform reject inference. • C o g James Wu g e s Stephen Coggeshall h a l l K13186 K13186_Cover.indd 1 1/20/12 2:14 PM Foundations of Predictive Analytics Chapman & Hall/CRC Data Mining and Knowledge Discovery Series SERIES EDITOR Vipin Kumar University of Minnesota Department of Computer Science and Engineering Minneapolis, Minnesota, U.S.A AIMS AND SCOPE This series aims to capture new developments and applications in data mining and knowledge discovery, while summarizing the computational tools and techniques useful in data analysis. This series encourages the integration of mathematical, statistical, and computational methods and techniques through the publication of a broad range of textbooks, reference works, and hand- books. The inclusion of concrete examples and applications is highly encouraged. The scope of the series includes, but is not limited to, titles in the areas of data mining and knowledge discovery methods and applications, modeling, algorithms, theory and foundations, data and knowledge visualization, data mining systems and tools, and privacy and security issues. PUBLISHED TITLES UNDERSTANDING COMPLEX DATASETS: DATA MINING WITH MATRIX DECOMPOSITIONS David Skillicorn COMPUTATIONAL METHODS OF FEATURE SELECTION Huan Liu and Hiroshi Motoda CONSTRAINED CLUSTERING: ADVANCES IN ALGORITHMS, THEORY, AND APPLICATIONS Sugato Basu, Ian Davidson, and Kiri L. Wagstaff KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND LAW ENFORCEMENT David Skillicorn MULTIMEDIA DATA MINING: A SYSTEMATIC INTRODUCTION TO CONCEPTS AND THEORY Zhongfei Zhang and Ruofei Zhang NEXT GENERATION OF DATA MINING Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar DATA MINING FOR DESIGN AND MARKETING Yukio Ohsawa and Katsutoshi Yada THE TOP TEN ALGORITHMS IN DATA MINING Xindong Wu and Vipin Kumar GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY, SECOND EDITION Harvey J. Miller and Jiawei Han TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS Ashok N. Srivastava and Mehran Sahami BIOLOGICAL DATA MINING Jake Y. Chen and Stefano Lonardi INFORMATION DISCOVERY ON ELECTRONIC HEALTH RECORDS Vagelis Hristidis TEMPORAL DATA MINING Theophano Mitsa RELATIONAL DATA CLUSTERING: MODELS, ALGORITHMS, AND APPLICATIONS Bo Long, Zhongfei Zhang, and Philip S. Yu KNOWLEDGE DISCOVERY FROM DATA STREAMS João Gama STATISTICAL DATA MINING USING SAS APPLICATIONS, SECOND EDITION George Fernandez INTRODUCTION TO PRIVACY-PRESERVING DATA PUBLISHING: CONCEPTS AND TECHNIQUES Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S. Yu HANDBOOK OF EDUCATIONAL DATA MINING Cristóbal Romero, Sebastian Ventura, Mykola Pechenizkiy, and Ryan S.J.d. Baker DATA MINING WITH R: LEARNING WITH CASE STUDIES Luís Torgo MINING SOFTWARE SPECIFICATIONS: METHODOLOGIES AND APPLICATIONS David Lo, Siau-Cheng Khoo, Jiawei Han, and Chao Liu DATA CLUSTERING IN C++: AN OBJECT-ORIENTED APPROACH Guojun Gan MUSIC DATA MINING Tao Li, Mitsunori Ogihara, and George Tzanetakis MACHINE LEARNING AND KNOWLEDGE DISCOVERY FOR ENGINEERING SYSTEMS HEALTH MANAGEMENT Ashok N. Srivastava and Jiawei Han SPECTRAL FEATURE SELECTION FOR DATA MINING Zheng Alan Zhao and Huan Liu ADVANCES IN MACHINE LEARNING AND DATA MINING FOR ASTRONOMY Michael J. Way, Jeffrey D. Scargle, Kamal M. Ali, and Ashok N. Srivastava FOUNDATIONS OF PREDICTIVE ANALYTICS James Wu and Stephen Coggeshall TThhiiss ppaaggee iinntteennttiioonnaallllyy lleefftt bbllaannkk Foundations of Predictive Analytics James Wu Stephen Coggeshall CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2012 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20120119 International Standard Book Number-13: 978-1-4398-6948-2 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit- ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents List of Figures xv List of Tables xvii Preface xix 1 Introduction 1 1.1 What Is a Model? . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 What Is a Statistical Model? . . . . . . . . . . . . . . . . . 2 1.3 The Modeling Process . . . . . . . . . . . . . . . . . . . . . 3 1.4 Modeling Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Characteristics of Good Modelers . . . . . . . . . . . . . . . 5 1.6 The Future of Predictive Analytics . . . . . . . . . . . . . . 7 2 Properties of Statistical Distributions 9 2.1 Fundamental Distributions . . . . . . . . . . . . . . . . . . 9 2.1.1 Uniform Distribution . . . . . . . . . . . . . . . . . . 9 2.1.2 Details of the Normal (Gaussian) Distribution . . . . 10 2.1.3 Lognormal Distribution. . . . . . . . . . . . . . . . . 19 2.1.4 Γ Distribution . . . . . . . . . . . . . . . . . . . . . . 20 2.1.5 Chi-Squared Distribution . . . . . . . . . . . . . . . . 22 2.1.6 Non-Central Chi-Squared Distribution . . . . . . . . 25 2.1.7 Student’s t-Distribution . . . . . . . . . . . . . . . . 28 2.1.8 Multivariate t-Distribution . . . . . . . . . . . . . . . 29 2.1.9 F-Distribution . . . . . . . . . . . . . . . . . . . . . . 31 2.1.10 Binomial Distribution. . . . . . . . . . . . . . . . . . 31 2.1.11 Poisson Distribution . . . . . . . . . . . . . . . . . . 32 2.1.12 Exponential Distribution . . . . . . . . . . . . . . . . 32 2.1.13 Geometric Distribution . . . . . . . . . . . . . . . . . 33 2.1.14 Hypergeometric Distribution . . . . . . . . . . . . . . 33 2.1.15 Negative Binomial Distribution . . . . . . . . . . . . 34 2.1.16 Inverse Gaussian (IG) Distribution . . . . . . . . . . 35 2.1.17 Normal Inverse Gaussian (NIG) Distribution . . . . . 36 2.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . 38 2.3 Estimate of Mean, Variance, Skewness, and Kurtosis from Sample Data . . . . . . . . . . . . . . . . . . . . . . . . . . 40 vii viii Contents 2.4 Estimate of the Standard Deviation of the Sample Mean . . 40 2.5 (Pseudo) Random Number Generators . . . . . . . . . . . . 41 2.5.1 Mersenne Twister Pseudorandom Number Generator 42 2.5.2 Box–Muller Transform for Generating a Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . 42 2.6 Transformation of a Distribution Function . . . . . . . . . . 43 2.7 Distribution of a Function of Random Variables . . . . . . . 43 2.7.1 Z =X+Y . . . . . . . . . . . . . . . . . . . . . . . 44 2.7.2 Z =X·Y . . . . . . . . . . . . . . . . . . . . . . . . 44 2.7.3 (Z ,Z ,...,Z )=(X ,X ,...,X )·Y . . . . . . . . . 44 1 2 n 1 2 n 2.7.4 Z =X/Y . . . . . . . . . . . . . . . . . . . . . . . . 45 2.7.5 Z =max(X,Y) . . . . . . . . . . . . . . . . . . . . . 45 2.7.6 Z =min(X,Y) . . . . . . . . . . . . . . . . . . . . . 45 2.8 Moment Generating Function . . . . . . . . . . . . . . . . . 46 2.8.1 Moment Generating Function of Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . 46 2.8.2 Moment Generating Function of Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . 47 2.8.3 Moment Generating Function of the Γ Distribution . . . . . . . . . . . . . . . . . . . . . . 47 2.8.4 Moment Generating Function of Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . 47 2.8.5 Moment Generating Function of the Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . 48 2.9 Cumulant Generating Function . . . . . . . . . . . . . . . . 48 2.10 Characteristic Function . . . . . . . . . . . . . . . . . . . . 50 2.10.1 Relationship between Cumulative Function and Characteristic Function . . . . . . . . . . . . . . . . . 51 2.10.2 Characteristic Function of Normal Distribution . . . 52 2.10.3 Characteristic Function of Γ Distribution . . . . . . . 52 2.11 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . 53 2.12 Markov’s Inequality . . . . . . . . . . . . . . . . . . . . . . 54 2.13 Gram–Charlier Series . . . . . . . . . . . . . . . . . . . . . 54 2.14 Edgeworth Expansion . . . . . . . . . . . . . . . . . . . . . 55 2.15 Cornish–Fisher Expansion . . . . . . . . . . . . . . . . . . . 56 2.15.1 Lagrange Inversion Theorem . . . . . . . . . . . . . . 56 2.15.2 Cornish–Fisher Expansion . . . . . . . . . . . . . . . 57 2.16 Copula Functions . . . . . . . . . . . . . . . . . . . . . . . . 58 2.16.1 Gaussian Copula . . . . . . . . . . . . . . . . . . . . 60 2.16.2 t-Copula . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.16.3 Archimedean Copula . . . . . . . . . . . . . . . . . . 62 Contents ix 3 Important Matrix Relationships 63 3.1 Pseudo-Inverse of a Matrix . . . . . . . . . . . . . . . . . . 63 3.2 A Lemma of Matrix Inversion . . . . . . . . . . . . . . . . . 64 3.3 Identity for a Matrix Determinant . . . . . . . . . . . . . . 66 3.4 Inversion of Partitioned Matrix . . . . . . . . . . . . . . . . 66 3.5 Determinant of Partitioned Matrix . . . . . . . . . . . . . . 67 3.6 Matrix Sweep and Partial Correlation . . . . . . . . . . . . 67 3.7 Singular Value Decomposition (SVD) . . . . . . . . . . . . 69 3.8 Diagonalization of a Matrix . . . . . . . . . . . . . . . . . . 71 3.9 Spectral Decomposition of a Positive Semi-Definite Matrix . 75 3.10 Normalization in Vector Space . . . . . . . . . . . . . . . . 76 3.11 Conjugate Decomposition of a Symmetric Definite Matrix . 77 3.12 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . 77 3.13 Cauchy–Schwartz Inequality . . . . . . . . . . . . . . . . . . 80 3.14 Relationship of Correlation among Three Variables . . . . . 81 4 Linear Modeling and Regression 83 4.1 Properties of Maximum Likelihood Estimators . . . . . . . 84 4.1.1 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . 87 4.1.2 Wald Test . . . . . . . . . . . . . . . . . . . . . . . . 87 4.1.3 Lagrange Multiplier Statistic . . . . . . . . . . . . . . 88 4.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.1 Ordinary Least Squares (OLS) Regression . . . . . . 89 4.2.2 Interpretation of the Coefficients of Linear Regression 95 4.2.3 Regression on Weighted Data . . . . . . . . . . . . . 97 4.2.4 Incrementally Updating a Regression Model with Additional Data . . . . . . . . . . . . . . . . . . . . . 100 4.2.5 Partitioned Regression . . . . . . . . . . . . . . . . . 101 4.2.6 How Does the Regression Change When Adding One More Variable? . . . . . . . . . . . . . . . . . . . . . 101 4.2.7 Linearly Restricted Least Squares Regression . . . . 103 4.2.8 Significance of the Correlation Coefficient . . . . . . 105 4.2.9 Partial Correlation . . . . . . . . . . . . . . . . . . . 105 4.2.10 Ridge Regression . . . . . . . . . . . . . . . . . . . . 105 4.3 Fisher’s Linear Discriminant Analysis . . . . . . . . . . . . 106 4.4 Principal Component Regression (PCR) . . . . . . . . . . . 109 4.5 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.6 Partial Least Squares Regression (PLSR) . . . . . . . . . . 111 4.7 Generalized Linear Model (GLM) . . . . . . . . . . . . . . . 113 4.8 Logistic Regression: Binary . . . . . . . . . . . . . . . . . . 116 4.9 Logistic Regression: Multiple Nominal . . . . . . . . . . . . 119 4.10 Logistic Regression: Proportional Multiple Ordinal . . . . . 121 4.11 Fisher Scoring Method for Logistic Regression . . . . . . . . 123 4.12 Tobit Model: A Censored Regression Model . . . . . . . . . 125 4.12.1 Some Properties of the Normal Distribution . . . . . 125
Description: