S A N J I V R A N JA N DA S D ATA S C I E N C E : T H E O R I E S , M O D E L S , A L G O R I T H M S , A N D A N A LY T I C S S. R. DAS Copyright©2013,2014,2016SanjivRanjanDas published by s. r. das http://algo.scu.edu/ sanjivdas/ ∼ LicensedundertheApacheLicense,Version2.0(the“License”);youmaynotusethisbookexceptincompliance withtheLicense. YoumayobtainacopyoftheLicenseathttp://www.apache.org/licenses/LICENSE-2.0. Unless requiredbyapplicablelaworagreedtoinwriting,softwaredistributedundertheLicenseisdistributedonan “as is” basis, without warranties or conditions of any kind,eitherexpressorimplied. SeetheLicenseforthe specificlanguagegoverningpermissionsandlimitationsundertheLicense. Thisprinting,July2016 THE FUTURE IS ALREADY HERE; IT’S JUST NOT VERY EVENLY DISTRIBUTED. – WILLIAM GIBSON THE PUBLIC IS MORE FAMILIAR WITH BAD DESIGN THAN GOOD DESIGN. IT IS, IN EFFECT, CONDITIONED TO PREFER BAD DESIGN, BECAUSE THAT IS WHAT IT LIVES WITH. THE NEW BECOMES THREATENING, THE OLD REASSURING. – PAUL RAND IT SEEMS THAT PERFECTION IS ATTAINED NOT WHEN THERE IS NOTHING LEFT TO ADD, BUT WHEN THERE IS NOTHING MORE TO REMOVE. – ANTOINE DE SAINT-EXUPÉRY . . . IN GOD WE TRUST, ALL OTHERS BRING DATA. – WILLIAM EDWARDS DEMING Acknowledgements: I am extremely grateful to the following friends, stu- dents, and readers (mutually non-exclusive) who offered me feedback on these chapters. I am most grateful to John Heineke for his constant feedback and continuous encouragement. All the following students made helpful suggestions on the manuscript: Sumit Agarwal, Kevin Aguilar, Sankalp Bansal, Sivan Bershan, Ali Burney, Monalisa Chati, Jian- Wei Cheng, Chris Gadek, Karl Hennig, Pochang Hsu, Justin Ishikawa, Ravi Jagannathan, Alice Yehjin Jun, Seoyoung Kim, Ram Kumar, Fed- erico Morales, Antonio Piccolboni, Shaharyar Shaikh, Jean-Marc Soumet, Rakesh Sountharajan, Greg Tseng, Dan Wong, Jeffrey Woo. Contents 25 1 The Art of Data Science 27 1.1 Volume, Velocity, Variety 29 1.2 Machine Learning 30 1.3 Supervised and Unsupervised Learning 30 1.4 Predictions and Forecasts 31 1.5 Innovation and Experimentation 31 1.6 The Dark Side 1.6.1 Big Errors 31 1.6.2 Privacy 32 37 1.7 Theories, Models, Intuition, Causality, Prediction, Correlation 41 2 The Very Beginning: Got Math? 41 2.1 Exponentials, Logarithms, and Compounding 43 2.2 Normal Distribution 43 2.3 Poisson Distribution 44 2.4 Moments of a continuous random variable 45 2.5 Combining random variables 45 2.6 Vector Algebra 48 2.7 Statistical Regression 49 2.8 Diversification 50 2.9 Matrix Calculus 52 2.10Matrix Equations 6 55 3 Open Source: Modeling in R 55 3.1 System Commands 56 3.2 Loading Data 58 3.3 Matrices 59 3.4 Descriptive Statistics 61 3.5 Higher-Order Moments 61 3.6 Quick Introduction to Brownian Motions with R 62 3.7 Estimation using maximum-likelihood 64 3.8 GARCH/ARCH Models 66 3.9 Introduction to Monte Carlo 71 3.10Portfolio Computations in R 72 3.11Finding the Optimal Portfolio 75 3.12Root Solving 77 3.13Regression 81 3.14Heteroskedasticity 83 3.15Auto-regressive models 86 3.16Vector Auto-Regression 90 3.17Logit 94 3.18Probit 95 3.19Solving Non-Linear Equations 97 3.20Web-Enabling R Functions 103 4 MoRe: Data Handling and Other Useful Things 103 4.1 Data Extraction of stocks using quantmod 109 4.2 Using the merge function 114 4.3 Using the apply class of functions 114 4.4 Getting interest rate data from FRED 117 4.5 Cross-Sectional Data (an example) 121 4.6 Handling dates with lubridate 124 4.7 Using the data.table package 128 4.8 Another data set: Bay Area Bike Share data 130 4.9 Using the plyr package family 7 135 5 Being Mean with Variance: Markowitz Optimization 135 5.1 Quadratic (Markowitz) Problem 5.1.1 Solution in R 137 138 5.2 Solving the problem with the quadprog package 140 5.3 Tracing out the Efficient Frontier 141 5.4 Covariances of frontier portfolios: r ,r p q 142 5.5 Combinations 143 5.6 Zero Covariance Portfolio 143 5.7 Portfolio Problems with Riskless Assets 145 5.8 Risk Budgeting 149 6 Learning from Experience: Bayes Theorem 149 6.1 Introduction 151 6.2 Bayes and Joint Probability Distributions 152 6.3 Correlated default (conditional default) 153 6.4 Continuous and More Formal Exposition 156 6.5 Bayes Nets 159 6.6 Bayes Rule in Marketing 162 6.7 Other Applications 6.7.1 Bayes Models in Credit Rating Transitions 162 6.7.2 Accounting Fraud 162 6.7.3 Bayes was a Reverend after all... 162 165 7 More than Words: Extracting Information from News 165 7.1 Prologue 167 7.2 Framework 169 7.3 Algorithms 7.3.1 Crawlers and Scrapers 169 7.3.2 Text Pre-processing 172 7.3.3 The tm package 175 7.3.4 Term Frequency - Inverse Document Frequency (TF-IDF) 178 7.3.5 Wordclouds 180 7.3.6 Regular Expressions 181 8 184 7.4 Extracting Data from Web Sources using APIs 7.4.1 Using Twitter 184 7.4.2 Using Facebook 187 7.4.3 Text processing, plain and simple 190 7.4.4 A Multipurpose Function to Extract Text 191 193 7.5 Text Classification 7.5.1 Bayes Classifier 193 7.5.2 Support Vector Machines 198 7.5.3 Word Count Classifiers 200 7.5.4 Vector Distance Classifier 201 7.5.5 Discriminant-Based Classifier 202 7.5.6 Adjective-Adverb Classifier 204 7.5.7 Scoring Optimism and Pessimism 205 7.5.8 Voting among Classifiers 206 7.5.9 Ambiguity Filters 206 207 7.6 Metrics 7.6.1 Confusion Matrix 207 7.6.2 Precision and Recall 208 7.6.3 Accuracy 209 7.6.4 False Positives 209 7.6.5 Sentiment Error 210 7.6.6 Disagreement 210 7.6.7 Correlations 210 7.6.8 Aggregation Performance 211 7.6.9 Phase-Lag Metrics 213 7.6.10 Economic Significance 215 215 7.7 Grading Text 216 7.8 Text Summarization 219 7.9 Discussion 221 7.10Appendix: Sample text from Bloomberg for summarization 9 227 8 Virulent Products: The Bass Model 227 8.1 Introduction 227 8.2 Historical Examples 228 8.3 The Basic Idea 229 8.4 Solving the Model 8.4.1 Symbolic math in R 231 233 8.5 Software 234 8.6 Calibration 236 8.7 Sales Peak 238 8.8 Notes 241 9 Extracting Dimensions: Discriminant and Factor Analysis 241 9.1 Overview 241 9.2 Discriminant Analysis 9.2.1 Notation and assumptions 242 9.2.2 Discriminant Function 242 9.2.3 How good is the discriminant function? 243 9.2.4 Caveats 244 9.2.5 Implementation using R 244 9.2.6 Confusion Matrix 248 9.2.7 Multiple groups 249 250 9.3 Eigen Systems 252 9.4 Factor Analysis 9.4.1 Notation 252 9.4.2 The Idea 253 9.4.3 Principal Components Analysis (PCA) 253 9.4.4 Application to Treasury Yield Curves 257 9.4.5 Application: Risk Parity and Risk Disparity 260 9.4.6 Difference between PCA and FA 260 9.4.7 Factor Rotation 260 9.4.8 Using the factor analysis function 261 10 265 10 Bidding it Up: Auctions 265 10.1Theory 10.1.1 Overview 265 10.1.2 Auction types 266 10.1.3 Value Determination 266 10.1.4 Bidder Types 267 10.1.5 Benchmark Model (BM) 267 268 10.2Auction Math 10.2.1 Optimization by bidders 269 10.2.2 Example 270 272 10.3Treasury Auctions 10.3.1 DPA or UPA? 272 274 10.4Mechanism Design 10.4.1 Collusion 275 10.4.2 Clicks (Advertising Auctions) 276 10.4.3 Next Price Auctions 278 10.4.4 Laddered Auction 279 283 11 Truncate and Estimate: Limited Dependent Variables 283 11.1Introduction 284 11.2Logit 287 11.3Probit 288 11.4Analysis 11.4.1 Slopes 288 11.4.2 Maximum-Likelihood Estimation (MLE) 292 293 11.5Multinomial Logit 297 11.6Truncated Variables 11.6.1 Endogeneity 299 11.6.2 Example: Women in the Labor Market 301 11.6.3 Endogeity – Some Theory to Wrap Up 303
Description: