ebook img

R for Data Science Cookbook PDF

438 Pages·2016·5.839 MB·english
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview R for Data Science Cookbook

R for Data Science Cookbook Over 100 hands-on recipes to effectively solve real-world data problems using the most popular R packages and techniques Yu-Wei, Chiu (David Chiu) R for Data Science Cookbook Copyright © 2016 Packt Publishing First published: July 2016 Production reference: 1270716 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78439-081-5 www.packtpub.com Contents Preface vii Chapter 1: Functions in R 1 Introduction 1 Creating R functions 2 Matching arguments 5 Understanding environments 7 Working with lexical scoping 10 Understanding closure 13 Performing lazy evaluation 16 Creating infix operators 18 Using the replacement function 20 Handling errors in a function 23 The debugging function 28 Chapter 2: Data Extracting, Transforming, and Loading 37 Introduction 38 Downloading open data 38 Reading and writing CSV files 42 Scanning text files 44 Working with Excel files 46 Reading data from databases 49 Scraping web data 52 Accessing Facebook data 62 Working with twitteR 68 Chapter 3: Data Preprocessing and Preparation 73 Introduction 73 Renaming the data variable 74 Converting data types 76 Working with the date format 78 Adding new records 81 Filtering data 83 Dropping data 87 Merging data 88 Sorting data 90 Reshaping data 92 Detecting missing data 95 Imputing missing data 98 Chapter 4: Data Manipulation 101 Introduction 101 Enhancing a data.frame with a data.table 102 Managing data with a data.table 106 Performing fast aggregation with a data.table 111 Merging large datasets with a data.table 115 Subsetting and slicing data with dplyr 120 Sampling data with dplyr 123 Selecting columns with dplyr 125 Chaining operations in dplyr 128 Arranging rows with dplyr 130 Eliminating duplicated rows with dplyr 131 Adding new columns with dplyr 133 Summarizing data with dplyr 134 Merging data with dplyr 138 Chapter 5: Visualizing Data with ggplot2 143 Introduction 143 Creating basic plots with ggplot2 146 Changing aesthetics mapping 150 Introducing geometric objects 153 Performing transformations 158 Adjusting scales 161 Faceting 164 Adjusting themes 167 Combining plots 169 Creating maps 171 Chapter 6: Making Interactive Reports 177 Introduction 177 Creating R Markdown reports 178 Learning the markdown syntax 182 Embedding R code chunks 186 Creating interactive graphics with ggvis 190 Understanding basic syntax and grammar 194 Controlling axes and legends 201 Using scales 206 Adding interactivity to a ggvis plot 208 Creating an R Shiny document 215 Publishing an R Shiny report 221 Chapter 7: Simulation from Probability Distributions 227 Introduction 227 Generating random samples 228 Understanding uniform distributions 231 Generating binomial random variates 233 Generating Poisson random variates 236 Sampling from a normal distribution 239 Sampling from a chi-squared distribution 245 Understanding Student's t-distribution 248 Sampling from a dataset 251 Simulating the stochastic process 253 Chapter 8: Statistical Inference in R 257 Introduction 257 Getting confidence intervals 258 Performing Z-tests 265 Performing student's T-tests 268 Conducting exact binomial tests 272 Performing Kolmogorov-Smirnov tests 274 Working with the Pearson's chi-squared tests 276 Understanding the Wilcoxon Rank Sum and Signed Rank tests 279 Conducting one-way ANOVA 282 Performing two-way ANOVA 287 Chapter 9: Rule and Pattern Mining with R 293 Introduction 293 Transforming data into transactions 294 Displaying transactions and associations 296 Mining associations with the Apriori rule 299 Pruning redundant rules 303 Visualizing association rules 305 Mining frequent itemsets with Eclat 307 Creating transactions with temporal information 310 Mining frequent sequential patterns with cSPADE 313 Chapter 10: Time Series Mining with R 319 Introduction 319 Creating time series data 320 Plotting a time series object 324 Decomposing time series 327 Smoothing time series 331 Forecasting time series 336 Selecting an ARIMA model 341 Creating an ARIMA model 346 Forecasting with an ARIMA model 348 Predicting stock prices with an ARIMA model 352 Chapter 11: Supervised Machine Learning 357 Introduction 358 Fitting a linear regression model with lm 358 Summarizing linear model fits 361 Using linear regression to predict unknown values 363 Measuring the performance of the regression model 366 Performing a multiple regression analysis 368 Selecting the best-fitted regression model with stepwise regression 371 Applying the Gaussian model for generalized linear regression 374 Performing a logistic regression analysis 375 Building a classification model with recursive partitioning trees 379 Visualizing a recursive partitioning tree 382 Measuring model performance with a confusion matrix 384 Measuring prediction performance using ROCR 387 Chapter 12: Unsupervised Machine Learning 391 Introduction 392 Clustering data with hierarchical clustering 392 Cutting tree into clusters 397 Clustering data with the k-means method 400 Clustering data with the density-based method 402 Extracting silhouette information from clustering 405 Comparing clustering methods 407 Recognizing digits using the density-based clustering method 410 Grouping similar text documents with k-means clustering methods 412 Performing dimension reduction with Principal Component Analysis (PCA) 414 Determining the number of principal components using a scree plot 418 Determining the number of principal components using the Kaiser method 420 Visualizing multivariate data using a biplot 422 Index 425 Preface Big data, the Internet of Things, and artificial intelligence have become the hottest technology buzzwords in recent years. Although there are many different terms used to define these technologies, the common concept is that they're all driven by data. Simply having data is not enough; being able to unlock its value is essential. Therefore, data scientists have begun to focus on how to gain insights from raw data. Data science has become one of the most popular subjects among academic and industry groups. However, as data science is a very broad discipline, learning how to master it can be challenging. A beginner must learn how to prepare, process, aggregate, and visualize data. More advanced techniques involve machine learning, mining various data formats (text, image, and video), and, most importantly, using data to generate business value. The role of a data scientist is challenging and requires a great deal of effort. A successful data scientist requires a useful tool to help solve day-to-day problems. In this field, the most widely used tool by data scientists is the R language, which is open source and free. Being a machine language, it provides many data processes, learning packages, and visualization functions, allowing users to analyze data on the fly. R helps users quickly perform analysis and execute machine learning algorithms on their dataset without knowing every detail of the sophisticated mathematical models. R for Data Science Cookbook takes a practical approach to teaching you how to put data science into practice with R. The book has 12 chapters, each of which is introduced by breaking down the topic into several simple recipes. Through the step-by-step instructions in each recipe, you can apply what you have learned from the book by using a variety of packages in R. The first section of this book deals with how to create R functions to avoid unnecessary duplication of code. You will learn how to prepare, process, and perform sophisticated ETL operations for heterogeneous data sources with R packages. An example of data manipulation is provided that illustrates how to use the dplyr and data.table packages to process larger data structures efficiently, while there is a section focusing on ggplot2 that covers how to create advanced figures for data exploration. Also, you will learn how to build an interactive report using the ggvis package. This book also explains how to use data mining to discover items that are frequently purchased together. Later chapters offer insight into time series analysis on financial data, while there is detailed information on the hot topic of machine learning, including data classification, regression, clustering, and dimension reduction. With R for Data Science Cookbook in hand, I can assure you that you will find data science has never been easier. What this book covers Chapter 1, Functions in R, describes how to create R functions. This chapter covers the basic composition, environment, and argument matching of an R function. Furthermore, we will look at advanced topics such as closure, functional programming, and how to properly handle errors. Chapter 2, Data Extracting, Transforming, and Loading, teaches you how to read structured and unstructured data with R. The chapter begins by collecting data from text files. Subsequently, we will look at how to connect R to a database. Lastly, you will learn how to write a web scraper to crawl through unstructured data from a web page or social media site. Chapter 3, Data Preprocessing and Preparation, introduces you to preparing data ready for analysis. In this chapter, we will cover the data preprocess steps, such as type conversion, adding, filtering, dropping, merging, reshaping, and missing-value imputation, with some basic R functions. Chapter 4, Data Manipulation, demonstrates how to manipulate data in an efficient and effective manner with the advanced R packages data.table and dplyr. The data.table package exposes you to the possibility of quickly loading and aggregating large amounts of data. The dplyr package provides the ability to manipulate data in SQL-like syntax. Chapter 5, Visualizing Data with ggplot2, explores using ggplot2 to visualize data. This chapter begins by introducing the basic building blocks of ggplot2. Next, we will cover advanced topics on how to create a more sophisticated graph with ggplot2 functions. Lastly, we will describe how to build a map with ggmap. Chapter 6, Making Interactive Reports, reveals how to create a professional report with R. In the beginning, the chapter discusses how to write R markdown syntax and embed R code chunks. We will also explore how to add interactive charts to the report with ggvis. Finally, we will look at how to create and publish an R Shiny report. Chapter 7, Simulation from Probability Distributions, begins with an emphasis on sampling data from different probability distributions. As a concrete example, we will look at how to simulate a stochastic trading process with a probability function. Chapter 8, Statistical Inference in R, begins with a discussion on point estimation and confidence intervals. Subsequently, you will be introduced to parametric and non-parametric testing methods. Lastly, we will look at how one can use ANOVA to analyze whether the salary basis of an engineer differs based on his job title and location. Chapter 9, Rule and Pattern Mining with R, exposes you to the common methods used to discover associated items and underlying frequency patterns from transaction data. In this chapter, we use a real-world blog as example data so that you can learn how to perform rule and pattern mining on real-world data. Chapter 10, Time Series Mining with R, begins by introducing you to creating and manipulating time series from a finance dataset. Subsequently, we will learn how to forecast time series with HoltWinters and ARIMA. For a more concrete example, this chapter reveals how to predict stock prices with ARIMA. Chapter 11, Supervised Machine Learning, teaches you how to build a model that makes predictions based on labeled training data. You will learn how to use regression models to make sense of numeric relationships and apply a fitted model to data for continuous value prediction. For classification, you will learn how to fit data into a tree-based classifier. Chapter 12, Unsupervised Machine Learning, introduces you to revealing the hidden structure of unlabeled data. Firstly, we will look at how to group similarly located hotels together with the clustering method. Subsequently, we will learn how to select and extract features on the economy freedom dataset with PCA. What you need for this book To follow the book's examples, you will need a computer with access to the Internet and the ability to install the R environment. You can download R from http://www.cran.r- project.org/. Detailed installation instructions are available in the first chapter. The examples provided in this book were coded and tested with R version 3.2.4 on Microsoft Windows. The examples are likely to work with any recent version of R installed on either Mac OS X or a Unix-like OS. Who this book is for R for Data Science Cookbook is intended for those who are already familiar with the basic operation of R, but want to learn how to efficiently and effectively analyze real-world data problems using practical R packages.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.