Institute of Transportation Engineers, Purdue Student Chapter engineering.purdue.edu/ITE Estimation of Statistical Models in R Updated July 21, 2010 • Josh Mills Contents Introduction .............................................................................................................................................. 2 Importing Data ......................................................................................................................................... 3 Importing Data from Other Programs ......................................................................................................... 6 Ordinary Least Squares (OLS) Regression ................................................................................................... 9 Model Diagnostics ................................................................................................................................... 10 Two-Sample and K-Sample Tests (Univariate and Multivariate) ................................................................. 15 Heteroskedastic t Regression .................................................................................................................... 26 Fitting Data to Distributions .................................................................................................................... 27 Robust Regression ................................................................................................................................... 33 Poisson and Negative Binomial Models ..................................................................................................... 34 Zero-Inflated and Hurdle Poisson and Negative Binomial Models ................................................................ 37 Tobit Models, Sample Selection Models, and Truncated Models ................................................................. 39 Quantile Regression ................................................................................................................................. 43 Regression Scale Models .......................................................................................................................... 44 Partial Least Squares ............................................................................................................................... 46 Multivariate Adaptive Regression Splines .................................................................................................. 49 Regression Trees ..................................................................................................................................... 52 Nonparametric and Semiparametric Estimation ......................................................................................... 58 Nested Logit Models ................................................................................................................................ 61 Ordered Models ...................................................................................................................................... 65 Systems of Simultaneous Equations .......................................................................................................... 70 Seemingly Unrelated Regression ............................................................................................................... 74 Beta Regression ...................................................................................................................................... 78 Panel Data Models .................................................................................................................................. 79 Multilevel Models .................................................................................................................................... 85 Generalized Estimating Equations ............................................................................................................. 94 Time-Series Models ................................................................................................................................. 95 Using Google Maps in R .......................................................................................................................... 98 Using US Census Data with R .................................................................................................................. 99 Spatial Econometrics and Testing for Spatial Dependence ......................................................................... 103 Performing Analysis on Multiple Combinations of Files ............................................................................. 117 Estimation of Statistical Models Using R Page 1 of 145 Institute of Transportation Engineers, Purdue Student Chapter engineering.purdue.edu/ITE Travel Demand Modeling in R ................................................................................................................ 123 Visualization of Networks and Graphs ...................................................................................................... 132 Large Data Sets: Virtual Memory and R .................................................................................................. 137 Miscellaneous Topics .............................................................................................................................. 140 Index of Packages Used .......................................................................................................................... 144 Introduction The R statistical package can best be thought of, in terms of user interface, as falling some- where in-between SAS and Limdep. Like SAS, R accomplishes different tasks using different procedures. Instead of the procedures being self-contained (like PROC REG in SAS), com- mands are issued one at a time, like Limdep. However, unlike SAS and Limdep, R can easily be extended using user-generated add-on packages, or libraries. Because R is open-source (and thus free), it is relatively easy to create add-on packages and quickly distribute them to a large, well-established community of users. Because of this, R can perform procedures and estimate statistical models that have not yet been implemented in purchaseable software such as SAS, Limdep, SPSS, and Stata. One primary example is spatial econometrics; while Stata contains capabilities for estimating certain types of spatial econometric models, the ma- jority of the most recent types of models (spatial panel models, spatial Probit, etc.) are only available in R. In short, R can give users access to cutting-edge methods and models easily. The available packages from the central R website, the Comprehensive R Archive Network (CRAN, http://cran.r-project.org), can accomplish s variety of tasks, such as statistical analy- sis (which is to be expected from a statistical software package), machine learning, importing and manipulating GIS data (R's GIS capabilities are far stronger than SAS), parallel compu- ting, qualitative research analysis, text mining, graph/network analysis, optimization, solving ordinary differential equations, image analysis, 3D graphics (through OpenGL), interactive graphics, and Bayesian inference. R can interface with a large number of other programs and platforms, such as Excel (through an add-in accessible through Excel), Google Maps, C++, WinBUGS, gretl, SAS, and even audio devices, relational databases (such as Oracle and SQL), the US Census database, LP_SOLVE, openNLP, PowerPoint, Word, LaTeX, the Web (through CURL), CPLEX, and Java. Despite this wide array of abilities, it is relatively easy to use a single data set with a wide variety of varieties, as will be demonstrated here. One important note - unlike SAS, all commands in R are case-sensitive. While package names and commands in the text of the handout are denoted using all caps, the exact capitali- zations should be used as shown in the syntax examples. Estimation of Statistical Models Using R Page 2 of 145 Institute of Transportation Engineers, Purdue Student Chapter engineering.purdue.edu/ITE Importing Data Packages Used: base, foreign, utils, prettyR Use READ.TABLE or READ.CSV to import a data set (the CSV, or comma-separated val- ues, format tends be more reliable than other formats such as Excel and Access, particularly in R, since there are no built-in functions for importing data from Excel or Access). In what may be surprising for a statistical software package, R can import data from URLs (in other words, a Web-based text file or data file can be imported through R simply by using a URL as the file name) and even directly from ZIP archives (which is accomplished using the unz command). READ.TABLE can import practically any delimited file (online or offline), but one must explicitly specify the record layout of the file, which can be tricky. In addition, importing files created by other statistical software packages, such as SAS, may lead to strangely specified variables. For example, SAS uses a period (".") to represent a missing value. When importing a file, R may misread this value and, instead of classifying a numerical variable as numerical (class NUMERIC), R would instead classify that variable as a factor (class FACTOR). Factors are useful for manipulating categorical data (for example, a variable in which a value of 1 indicates a rural minor arterial, 2 indicates a rural principal ar- terial, 3 indicates an urban minor arterial, etc.), but most data should be interpreted strictly as numeric values. To ensure this does not happen, use NA.STRINGS = "." (for data ex- ported by SAS). Commands with similar syntax requirements that can import data files from Minitab, SAS, dBase, and Stata can be found in the FOREIGN package. For most commands, such as those involving importing data or estimating models, the re- sults should be stored in separate variables. Unlike C++, one does not have to declare the "class" of a variable when it is created; R will automatically determine the appropriate class. Data imported using the READ.CSV command is stored in a data frame (class DA- TA.FRAME). Syntax trt<-read.csv("F:/lots_of_data/trtout.csv",header=F, na.strings = ".") If the first line of your data set has column names, set HEADER equal to T (for TRUE). In this example, the imported data will be stored in a data frame named TRT. Since the results of the command are being stored in the variable, one must refer to or access the variable in order to view the data or learn some characteristic about the data. To view the data, simply type the name of the variable and hit return. Syntax trt Output >trt V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 1 14 2 600 7 44 1 0 5 1 0 3 5 78 0 NA 2 13 2 400 7 44 0 0 1 0 1 0 1 85 1 NA Estimation of Statistical Models Using R Page 3 of 145 Institute of Transportation Engineers, Purdue Student Chapter engineering.purdue.edu/ITE 3 11 2 540 7 44 1 0 2 1 0 0 2 81 0 NA 4 12 2 540 7 44 1 0 5 0 1 0 4 83 1 NA 5 NA 9 2 370 7 44 1 0 5 0 1 0 4 83 1 6 10 2 350 7 44 1 0 3 1 1 0 3 81 0 NA 7 10 2 350 7 44 1 0 3 0 1 0 1 80 0 NA 8 12 1 460 14 48 0 0 2 0 1 0 1 86 0 NA 9 NA 8 2 400 7 44 1 0 3 0 1 0 1 84 1 Note that because variable names were not provided, R has named each variable V1, V2, V3, and so on. Missing values are denoted by "NA." To verify that TRT is a data frame, use the CLASS command. Syntax class(trt) Output [1] "data.frame" It is easy to view only a subset of the rows or columns in the data frame. Use the following syntax to view the first three rows and the first four columns. Syntax trt[1:3,1:4] Output > trt[1:3,1:4] V1 V2 V3 V4 1 14 2 600 7 2 13 2 400 7 3 11 2 540 7 To refer to a particular variable, use the $ operator. In this example, a histogram for variable V9 will be displayed using the HIST command. Syntax hist(trt$V9) Output Estimation of Statistical Models Using R Page 4 of 145 Institute of Transportation Engineers, Purdue Student Chapter engineering.purdue.edu/ITE R is particularly strong with graphics. The generated histogram can be easily pasted into a document or saved as an image file simply by right-clicking on the plot. Descriptive statistics can be viewed using the DESCRIBE and FREQ commands from the PRETTYR library. Syntax library(prettyR) describe(trt) freq(trt) Note that a library/package is loaded using the LIBRARY command. Output > describe(trt) Description of trt Numeric mean median var sd valid.n V1 13 13 7.4 2.72 111 V2 3.483 2 7.638 2.764 151 V3 364.9 385 7.116e+04 266.8 151 V4 135.9 10 5.643e+04 237.5 151 V5 37.93 43 414.5 20.36 151 V6 12.87 1 447 21.14 151 V7 0.3377 0 0.2918 0.5402 151 V8 1.987 2 1.786 1.337 151 V9 1.152 1 1.623 1.274 151 V10 0.6026 1 0.2411 0.491 151 V11 0.3311 0 0.3563 0.5969 151 V12 1.596 1 1.576 1.255 151 V13 60.45 81 1241 35.23 151 V14 22.43 1 1346 36.68 151 V15 0.525 1 0.2558 0.5057 40 frwy 0.06623 0 0.06225 0.2495 151 art 0.1589 0 0.1346 0.3668 151 frwytl 0.3642 0 2.006 1.416 151 rurtl 3.947 7 15.89 3.986 151 arttl 2.371 0 30.78 5.548 151 cage 25.55 5 1241 35.23 151 youngm 1 1 0 0 151 Estimation of Statistical Models Using R Page 5 of 145 Institute of Transportation Engineers, Purdue Student Chapter engineering.purdue.edu/ITE speed 23.09 22 35.24 5.937 111 > freq(trt) Frequencies for V1 10 11 12 13 14 15 16 17 18 19 20 21 NA 32 6 12 14 21 9 5 4 3 2 1 2 40 % 21.2 4 7.9 9.3 13.9 6 3.3 2.6 2 1.3 0.7 1.3 26.5 %!NA 28.8 5.4 10.8 12.6 18.9 8.1 4.5 3.6 2.7 1.8 0.9 1.8 Frequencies for V2 1 2 3 6 7 8 9 NA 24 77 10 2 9 18 11 0 % 15.9 51 6.6 1.3 6 11.9 7.3 0 %!NA 15.9 51 6.6 1.3 6 11.9 7.3 … Variables can be added to a data frame by using the assignment operators (<--, -->, or =) and the $ operators. Some examples are listed below. Note the syntax of the IF-THEN statements. The "pipe" operator (|) is used for OR, and the double-ampersand (&&) is used for AND. Syntax trt$frwy<-ifelse(trt$V2==3,1,0) trt$art<-ifelse(trt$V2==1,1,0) trt$frwytl<-ifelse(trt$V2==3,trt$V4,0) trt$rurtl<-ifelse(trt$V2==2,trt$V4,0) trt$arttl<-ifelse(trt$V2==1,trt$V4,0) trt$youngm<-ifelse(trt$V9<3 && trt$V9 ==1,1,0) trt$speed<-(trt$V5/10)/(trt$V1/60) y<-trt$speed X<-trt[,1:ncol(trt)-1] The best way to learn about the syntax of a command is by using the HELP or ? command. Help files are provided in HTML format, making it easy to browse through the different commands (the index for the package being used can be found at the bottom of the help page. For example, the HELP command can be used to access the help file for the LM command in the STATS package, which is used to estimate models using ordinary least squares (OLS). Syntax help(lm) ?lm Importing Data from Other Programs Packages Used: foreign, xlsReadWrite, xlsx A number of add-on packages are available for importing data written by programs such as Microsoft Excel, SAS, SPSS, and Stata. R can also write to most (but not all) of these for- Estimation of Statistical Models Using R Page 6 of 145 Institute of Transportation Engineers, Purdue Student Chapter engineering.purdue.edu/ITE mats. Excel files can be read and written using the XLSREADWRITE package. At the time of this writing, however, this package is not yet available for the 64-bit Windows version. The syntax is fairly straightforward, similar to using READ.CSV. Syntax ind<-read.csv("F:/Folder of Data/hoosierc.csv",header=T,na.strings=”.”) # Reading in Excel 2003 files using xlsReadWrite # Note that xlsReadWrite does not yet have a 64-bit version write.xls(ind,"F:/lots_of_data/hoosierexcel.xls",colNames=TRUE) indexcel<-read.xls("F:/lots_of_data/hoosierexcel.xls",colNames=TRUE) The XLSREADWRITE package can manipulate Excel 97-2003 files. The XLSX package can read and write files in the newer Excel 2007 (xlsx) format. The package, however, requires the installation of the Java Development Kit (http://java.sun.com/javase/downloads/index.jsp). This package can also directly manipulate Excel 2007 workbooks. The XLSX package uses Java-based commands. The documentation details only a subset of the operations that can be performed on Excel workbooks. Knowledge of Java is needed to understand the undocu- mented features. Data is read using the READ.XLSX command. This can be quite time-consuming for large files (> 100 columns). This is also true for the related WRITE.XLSX command. Syntax write.xlsx(ind,"F:/lots_of_data/hoosierexcel07.xlsx",sheetName="Indiana") # The write.xlsx command is quite time consuming indxlsx<-read.xlsx("F:/lots_of_data/hoosierc.xlsx",sheetName="hoosierc") # The read.xlsx command is also time-consuming for large files (> 1 MB) A suite of commands are available to perform commonly used tasks on Excel 2007 work- sheets, such as merging cells, changing print settings, and even adding pictures. A few of these are demonstrated below. Use help(loadWorkbook) to learn details about how these commands work. Syntax # The xlsx package also has Java-based commands to directly work with Excel spreadsheets # In this example, an xlsx file will be loaded and a picture will be added # The workbook will be set to print as landscape with 0.5 inch header and footer margins # The workbook will then be saved as a new file indwb <- loadWorkbook("F:/lots_of_data/hoosierc.xlsx") #.jmethods(indwb) # Java methods that can be used - must know Java! indsh <- getSheets(indwb) indsh1<-(indsh[['hoosierc']]) Estimation of Statistical Models Using R Page 7 of 145 Institute of Transportation Engineers, Purdue Student Chapter engineering.purdue.edu/ITE indprint<-printSetup(indsh1, footerMargin=0.5,headerMargin=0.5, land- scape=T,paperSize="LETTER_PAPERSIZE") addPicture("F:/lots_of_data/Indiana_state_seal.png", indsh1, star- tRow=10,startColumn=10) saveWorkbook(indwb,"F:/lots_of_data/hoosierp.xlsx") SAS, SPSS, Stata, and dBase (DBF) files can be read and written using the FOREIGN pack- age. The DBF and SPSS formats are the most restrictive; both formats only allow column (variable) names of a maximum of eight characters. The syntax of this set of commands is straightforward. Syntax # Reading and writing SAS, SPSS, Stata, DBF formats using package "foreign" # DBF - dBase, FoxPro - Used with shapefiles # The DBF format is rather old - column names are limited to 8 characters # In this example, a CSV file will be read then exported out as a DBF # read.dbf will be used to import this file # R will give warning messages indicating truncation had occurred county<-read.dbf("F:/lots_of_data/co18_d00.dbf") write.dbf(ind,"F:/lots_of_data/hoosierdb.dbf") indf<-read.dbf("F:/lots_of_data/hoosierdb.dbf") # Reading in SPSS files inds<-read.spss("F:/lots_of_data/hoosierspss.sav",to.data.frame=TRUE) # Reading in and writing Stata (.dta) files # Version 11 can be obtained for a low price from Krannert, 7th floor write.dta(ind,"F:/lots_of_data/hoosierstata.dta",version=10) indstata<-read.dta("F:/lots_of_data/hoosierstata.dta") The WRITE.FOREIGN command can create program-specific importing code for SAS, Stata, and SPSS. In these examples, separate code files are generated for SAS (with extension .sas), Stata (as do-files, with extension .do), and for SPSS. Syntax # Using write.foreign to write data sets for SAS, Stata, and SPSS # This command will also write importing code for each respective program # SPSS export is picky - column names must be 8 characters or less foreign:::writeForeignSAS # View the underlying code write.foreign(ind,"F:/lots_of_data/hoosier1.csv", "F:/lots_of_data/hoosier11.sas", package=c("SAS"),dataname="hoosierR") write.foreign(ind,"F:/lots_of_data/hoosier2.csv", "F:/lots_of_data/hoosier22.do", Estimation of Statistical Models Using R Page 8 of 145 Institute of Transportation Engineers, Purdue Student Chapter engineering.purdue.edu/ITE package=c("Stata")) write.foreign(ind,"F:/lots_of_data/hoosier3.csv", "F:/lots_of_data/hoosier33.txt", package=c("SPSS")) Ordinary Least Squares (OLS) Regression Packages Used: stats The LM command is used to estimate models by OLS. It is often convenient to store formu- las in a variable of class FORMULA to create more readable code. Syntax f1<-speed~1+frwy+art+cage+V6+V3 Instead of directly issuing the command, it is more beneficial to store the results of the LM regression in a variable and view the results using the SUMMARY command. Note that ob- servations can be (temporarily) deleted using the NA.ACTION=NA.OMIT statement. Syntax f1<-speed~1+frwy+art+cage+V6+V3 ff1<-lm(f1,data=trt,na.action=na.omit) summary(ff1) Output > ff1<-lm(f1,data=trt,na.action=na.omit) > summary(ff1) Call: lm(formula = f1, data = trt, na.action = na.omit) Residuals: Min 1Q Median 3Q Max -10.3842 -2.6012 -0.4189 3.1523 14.5701 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 24.502845 1.927874 12.710 < 2e-16 *** frwy 9.371764 1.843557 5.084 1.62e-06 *** art 3.486268 1.292920 2.696 0.00817 ** cage 0.082596 0.110675 0.746 0.45716 V6 1.062831 0.989525 1.074 0.28525 V3 -0.008268 0.003288 -2.515 0.01343 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.885 on 105 degrees of freedom (40 observations deleted due to missingness) Multiple R-squared: 0.3537, Adjusted R-squared: 0.3229 F-statistic: 11.49 on 5 and 105 DF, p-value: 7.351e-09 Estimation of Statistical Models Using R Page 9 of 145 Institute of Transportation Engineers, Purdue Student Chapter engineering.purdue.edu/ITE The next section will describe a series of diagnostic tests that can be used to check the valid- ity of the estimated model. Model Diagnostics Packages Used: car, lmtest, nortest, sandwich Compared to SAS and Limdep, a wider variety of statistical tests are available for testing the validity of estimated statistical models. A number of these tests can also be applied to gene- ralized linear models (for Poisson and negative binomial models, for example). The CAR and LMTEST packages are used to conduct these tests. The SANDWICH package can be used to correct for heteroskedasticity. Four tests can be used to test for heteroskedasticity: the Score test, the Breusch-Pagan test, the Goldfeld-Quandt test, and the Harrison-McCable tests. For all of these tests, the null hypothesis is that the residuals of the estimated model exhibit uniform variance (no heteroskedasticity). Note the use of # to denote comments. Syntax ncv.test(ff1) # Score test for heteroskedasticity, car bptest(f1,data=trt) # Breusch-Pagan test for heteroskedasticity, lmtest gqtest(f1, data=trt) # Goldfeld-Quandt test for heteroskedasticity, lmtest > hmctest(f1, data=trt) > ncv.test(ff1) # Score test for heteroskedasticity, car Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 1.654428 Df = 1 p = 0.1983573 > bptest(f1,data=trt) studentized Breusch-Pagan test data: f1 BP = 5.3949, df = 5, p-value = 0.3696 > gqtest(f1, data=trt) Goldfeld-Quandt test data: f1 GQ = 1.0827, df1 = 50, df2 = 49, p-value = 0.3908 > hmctest(f1, data=trt) Harrison-McCabe test data: f1 HMC = 0.482, p-value = 0.403 Estimation of Statistical Models Using R Page 10 of 145
Description: