ebook img

R: Recipes for Analysis Visualization and Machine Learning PDF

934 Pages·2016·11.702 MB·english
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview R: Recipes for Analysis Visualization and Machine Learning

R: Recipes for Analysis, Visualization and Machine Learning Get savvy with R language and actualize projects aimed at analysis, visualization and machine learning A course in three modules BIRMINGHAM - MUMBAI R: Recipes for Analysis, Visualization and Machine Learning Copyright © 2016 Packt Publishing Published on: November 2016 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78728-959-8 www.packtpub.com Preface Since the release of version 1.0 in 2000, R's popularity as an environment for statistical computing, data analytics, and graphing has grown exponentially. People who have been using spreadsheets and need to perform things that spreadsheet packages cannot readily do, or need to handle larger data volumes than what a spreadsheet program can comfortably handle, are looking to R. Analogously, people using powerful commercial analytics packages are also intrigued by this free and powerful option. As a result, a large number of people are now looking to quickly get things done in R. Being an extensible system, R's functionality is divided across numerous packages with each one exposing large numbers of functions. Even experienced users cannot expect to remember all the details off the top of their head. Our ability to generate data has improved tremendously with the advent of technology. The data generated has become more complex with the passage of time. The complexity in data forces us to develop new tools and methods to analyze it, interpret it, and communicate with the data. Data visualization empowers us with the necessary skills required to convey the meaning of underlying data. Data visualization is a remarkable intersection of data, science, and art, and this makes it hard to define visualization in a formal way; a simple Google search will prove me right. The Merriam-Webster dictionary defines visualization as "formation of mental visual images”. Big data has become a popular buzzword across many industries. An increasing number of people have been exposed to the term and are looking at how to leverage big data in their own businesses, to improve sales and profitability. However, collecting, aggregating, and visualizing data is just one part of the equation. Being able to extract useful information from data is another task, and much more challenging. Traditionally, most researchers perform statistical analysis using historical samples of data. The main downside of this process is that conclusions drawn from statistical analysis are limited. In fact, researchers usually struggle to uncover hidden patterns and unknown correlations from target data. Aside from applying statistical analysis, machine learning has emerged as an alternative. This process yields a more accurate predictive model with the data inserted into a learning algorithm. Through machine learning, the analysis of business operations and processes is not limited to human- scale thinking. Machine-scale analysis enables businesses to discover hidden values in big data. What this learning path covers Module 1, R Data Analysis Cookbook, this module, aimed at users who are already exposed to the fundamentals of R, provides ready recipes to perform many important data analytics tasks. Instead of having to search the Web or delve into numerous books when faced with a specific task, people can find the appropriate recipe and get going in a matter of minutes. Module 2, R Data Visualization Cookbook, in this module you will learn how to generate basic visualizations, understand the limitations and advantages of using certain visualizations, develop interactive visualizations and applications, understand various data exploratory functions in R, and finally learn ways of presenting the data to our audience. This module is aimed at beginners and intermediate users of R who would like to go a step further in using their complex data to convey a very convincing story to their audience. Module 3, Machine Learning with R Cookbook, this module covers how to perform statistical analysis with machine learning analysis and assessing created models, which are covered in detail later on in the book. The module includes content on learning how to integrate R and Hadoop to create a big data analysis platform. The detailed illustrations provide all the information required to start applying machine learning to individual projects. What you need for this learning path Module 1: We have tested all the code in this module for R versions 3.0.2 (Frisbee Sailing) and 3.1.0 (Spring Dance). When you install or load some of the packages, you may get a warning message to the effect that the code was compiled for a different version, but this will not impact any of the code in this module. Module 2: You need to download R to generate the visualizations. You can download and install R using the CRAN website available at http://cran.r-project.org/. All the recipes were written using RStudio. RStudio is an integrated development environment (IDE) for R and can be downloaded from http://www.rstudio.com/ products/rstudio/. Many of the visualizations are created using R packages and they are discussed in their respective recipes. In few of the recipes, I have introduced users to some other open source platforms such as ScapeToad, ArcGIS, and Mapbox. Their installation procedures are outlined in their respective recipes. Module 3: To follow the course's examples, you will need a computer with access to the Internet and the ability to install the R environment. You can download R from http://www.cran.r-project.org/. Detailed installation instructions are available in the first chapter. The examples provided in this book were coded and tested with R Version 3.1.2 on a computer with Microsoft Windows installed on it. These examples should also work with any recent version of R installed on either MAC OSX or a Unix-like OS. Who this learning path is for This Learning Path is ideal for those who are already exposed to R, but have not yet used it extensively. This Learning Path will set you up with an extensive insight into professional techniques for analysis, visualization and machine learning with R. Regardless of your level of experience, this course also covers the basics of using R and it is written keeping in mind new and intermediate R users interested in learning. Reader feedback Feedback from our readers is always welcome. Let us know what you think about this course—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the course's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a course, see our author guide at www.packtpub.com/authors. Customer support Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase. Downloading the example code You can download the example code files for this course from your account at http://www.packtpub.com. If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps: 1. Log in or register to our website using your e-mail address and password. 2. Hover the mouse pointer on the SUPPORT tab at the top. 3. Click on Code Downloads & Errata. 4. Enter the name of the course in the Search box. 5. Select the course for which you're looking to download the code files. 6. Choose from the drop-down menu where you purchased this course from. 7. Click on Code Download. You can also download the code files by clicking on the Code Files button on the course's webpage at the Packt Publishing website. This page can be accessed by entering the course's name in the Search box. Please note that you need to be logged in to your Packt account. Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of: • WinRAR / 7-Zip for Windows • Zipeg / iZip / UnRarX for Mac • 7-Zip / PeaZip for Linux The code bundle for the course is also hosted on GitHub at https://github.com/ PacktPublishing/R-Recipes-for-Analysis-Visualization-and-Machine- Learning. We also have other code bundles from our rich catalog of books, videos and courses available at https://github.com/PacktPublishing/. Check them out! Errata Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this course. If you find any errata, please report them by visiting http://www.packtpub. com/submit-errata, selecting your course, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/ content/support and enter the name of the book in the search field. The required information will appear under the Errata section. Contents Module 1 1 Chapter 1: A Simple Guide to R 3 Installing packages and getting help in R 4 Data types in R 6 Special values in R 7 Matrices in R 9 Editing a matrix in R 10 Data frames in R 11 Editing a data frame in R 11 Importing data in R 12 Exporting data in R 13 Writing a function in R 14 Writing if else statements in R 15 Basic loops in R 16 Nested loops in R 16 The apply, lapply, sapply, and tapply functions 17 Using par to beautify a plot in R 18 Saving plots 19 Chapter 2: Practical Machine Learning with R 21 Introduction 21 Downloading and installing R 23 Downloading and installing RStudio 31 Installing and loading packages 35 Reading and writing data 37 Using R to manipulate data 40 Applying basic statistics 44 Visualizing data 48 Getting a dataset for machine learning 52 [ i ] Table of Contents Chapter 3: Acquire and Prepare the Ingredients – Your Data 57 Introduction 58 Reading data from CSV files 58 Reading XML data 61 Reading JSON data 63 Reading data from fixed-width formatted files 64 Reading data from R files and R libraries 65 Removing cases with missing values 67 Replacing missing values with the mean 69 Removing duplicate cases 71 Rescaling a variable to [0,1] 72 Normalizing or standardizing data in a data frame 74 Binning numerical data 76 Creating dummies for categorical variables 78 Chapter 4: What's in There? – Exploratory Data Analysis 81 Introduction 82 Creating standard data summaries 82 Extracting a subset of a dataset 84 Splitting a dataset 87 Creating random data partitions 88 Generating standard plots such as histograms, boxplots, and scatterplots 91 Generating multiple plots on a grid 99 Selecting a graphics device 101 Creating plots with the lattice package 102 Creating plots with the ggplot2 package 105 Creating charts that facilitate comparisons 111 Creating charts that help visualize a possible causality 116 Creating multivariate plots 118 Chapter 5: Where Does It Belong? – Classification 121 Introduction 121 Generating error/classification-confusion matrices 122 Generating ROC charts 125 Building, plotting, and evaluating – classification trees 128 Using random forest models for classification 134 Classifying using Support Vector Machine 137 Classifying using the Naïve Bayes approach 141 Classifying using the KNN approach 144 Using neural networks for classification 146 Classifying using linear discriminant function analysis 149 [ ii ] Table of Contents Classifying using logistic regression 151 Using AdaBoost to combine classification tree models 154 Chapter 6: Give Me a Number – Regression 157 Introduction 157 Computing the root mean squared error 158 Building KNN models for regression 160 Performing linear regression 166 Performing variable selection in linear regression 173 Building regression trees 176 Building random forest models for regression 183 Using neural networks for regression 188 Performing k-fold cross-validation 191 Performing leave-one-out-cross-validation to limit overfitting 193 Chapter 7: Can You Simplify That? – Data Reduction Techniques 195 Introduction 195 Performing cluster analysis using K-means clustering 196 Performing cluster analysis using hierarchical clustering 202 Reducing dimensionality with principal component analysis 206 Chapter 8: Lessons from History – Time Series Analysis 215 Introduction 215 Creating and examining date objects 215 Operating on date objects 220 Performing preliminary analyses on time series data 222 Using time series objects 226 Decomposing time series 233 Filtering time series data 236 Smoothing and forecasting using the Holt-Winters method 238 Building an automated ARIMA model 241 Chapter 9: It's All About Your Connections – Social Network Analysis 243 Introduction 243 Downloading social network data using public APIs 244 Creating adjacency matrices and edge lists 248 Plotting social network data 252 Computing important network metrics 265 Chapter: Put Your Best Foot Forward – Document and Present Your Analysis 273 Introduction 273 Generating reports of your data analysis with R Markdown and knitr 274 [ iii ]

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.