ebook img

Practical Data Science Cookbook: Data pre-processing, analysis and visualization using R and Python PDF

389 Pages·2017·1.63 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Practical Data Science Cookbook: Data pre-processing, analysis and visualization using R and Python

Contents 1: Preparing Your Data Science Environment b'Chapter 1: Preparing Your Data Science Environment' b'Understanding the data science pipeline' b'Installing R on Windows, Mac OS X, and Linux' b'Installing libraries in R and RStudio' b'Installing Python on Linux and Mac OS X' b'Installing Python on Windows' b'Installing the Python data stack on Mac OS X and Linux' b'Installing extra Python packages' b'Installing and using virtualenv' 2: Driving Visual Analysis with Automobile Data with R b'Chapter 2: Driving Visual Analysis with Automobile Data with R' b'Introduction' b'Acquiring automobile fuel efficiency data' b'Preparing R for your first project' b'Importing automobile fuel efficiency data into R' b'Exploring and describing fuel efficiency data' b'Analyzing automobile fuel efficiency over time' b'Investigating the makes and models of automobiles' 3: Creating Application-Oriented Analyses Using Tax Data and Python b'Chapter 3: Creating Application-Oriented Analyses Using Tax Data and Python' b'Introduction' b'Preparing for the analysis of top incomes' b'Importing and exploring the world's top incomes dataset' b'Analyzing and visualizing the top income data of the US' b'Furthering the analysis of the top income groups of the US' b'Reporting with Jinja2' b'Repeating the analysis in R' 4: Modeling Stock Market Data b'Chapter 4: Modeling Stock Market Data' b'Introduction' b'Acquiring stock market data' b'Summarizing the data' b'Cleaning and exploring the data' b'Generating relative valuations' b'Screening stocks and analyzing historical prices' 5: Visually Exploring Employment Data b'Chapter 5: Visually Exploring Employment Data' b'Introduction' b'Preparing for analysis' b'Importing employment data into R' b'Exploring the employment data' b'Obtaining and merging additional data' b'Adding geographical information' b'Extracting state-and county-level wage and employment information' b'Visualizing geographical distributions of pay' b'Exploring where the jobs are, by industry' b'Animating maps for a geospatial time series' b'Benchmarking performance for some common tasks' 6: Driving Visual Analyses with Automobile Data b'Chapter 6: Driving Visual Analyses with Automobile Data' b'Introduction' b'Getting started with IPython' b'Exploring Jupyter Notebook' b'Preparing to analyze automobile fuel efficiencies' b'Exploring and describing fuel efficiency data with Python' b'Analyzing automobile fuel efficiency over time with Python' b'Investigating the makes and models of automobiles with Python' 7: Working with Social Graphs b'Chapter 7: Working with Social Graphs' b'Introduction' b'Preparing to work with social networks in Python' b'Importing networks' b'Exploring subgraphs within a heroic network' b'Finding strong ties' b'Finding key players' b'Exploring the characteristics of entire networks' b'Clustering and community detection in social networks' b'Visualizing graphs' b'Social networks in R' 8: Recommending Movies at Scale (Python) b'Chapter 8: Recommending Movies at Scale (Python)' b'Introduction' b'Modeling preference expressions' b'Understanding the data' b'Ingesting the movie review data' b'Finding the highest-scoring movies' b'Improving the movie-rating system' b'Measuring the distance between users in the preference space' b'Computing the correlation between users' b'Finding the best critic for a user' b'Predicting movie ratings for users' b'Collaboratively filtering item by item' b'Building a non-negative matrix factorization model' b'Loading the entire dataset into the memory' b'Dumping the SVD-based model to the disk' b'Training the SVD-based model' b'Testing the SVD-based model' 9: Harvesting and Geolocating Twitter Data (Python) b'Chapter 9: Harvesting and Geolocating Twitter Data (Python)' b'Introduction' b'Creating a Twitter application' b'Understanding the Twitter API v1.1' b'Determining your Twitter followers and friends' b'Pulling Twitter user profiles' b'Making requests without running afoul of Twitter's rate limits' b'Storing JSON data to disk' b'Setting up MongoDB for storing Twitter data' b'Storing user profiles in MongoDB using PyMongo' b'Exploring the geographic information available in profiles' b'Plotting geospatial data in Python' 10: Forecasting New Zealand Overseas Visitors b'Chapter 10: Forecasting New Zealand Overseas Visitors' b'Introduction' b'The ts object' b'Visualizing time series data' b'Simple linear regression models' b'ACF and PACF' b'ARIMA models' b'Accuracy measurements' b'Fitting seasonal ARIMA models' 11: German Credit Data Analysis b'Chapter 11: German Credit Data Analysis' b'Introduction' b'Simple data transformations' b'Visualizing categorical data' b'Discriminant analysis' b'Dividing the data and the ROC' b'Fitting the logistic regression model' b'Decision trees and rules' b'Decision tree for german data' Chapter 1. Preparing Your Data Science Environment A traditional cookbook contains culinary recipes of interest to the authors, and helps readers expand their repertoire of foods to prepare. Many might believe that the end product of a recipe is the dish itself and one can read this book, in much the same way. Every chapter guides the reader through the application of the stages of the data science pipeline to different datasets with various goals. Also, just as in cooking, the final product can simply be the analysis applied to a particular set. We hope that you will take a broader view, however. Data scientists learn by doing, ensuring that every iteration and hypothesis improves the practioner's knowledge base. By taking multiple datasets through the data science pipeline using two different programming languages (R and Python), we hope that you will start to abstract out the analysis patterns, see the bigger picture, and achieve a deeper understanding of this rather ambiguous field of data science. We also want you to know that, unlike culinary recipes, data science recipes are ambiguous. When chefs begin a particular dish, they have a very clear picture in mind of what the finished product will look like. For data scientists, the situation is often different. One does not always know what the dataset in question will look like, and what might or might not be possible, given the amount of time and resources. Recipes are essentially a way to dig into the data and get started on the path towards asking the right questions to complete the best dish possible. If you are from a statistical or mathematical background, the modeling techniques on display might not excite you per se. Pay attention to how many of the recipes overcome practical issues in the data science pipeline, such as loading large datasets and working with scalable tools to adapting known techniques to create data applications, interactive graphics, and web pages rather than reports and papers. We hope that these aspects will enhance your appreciation and understanding of data science and apply good data science to your domain. Practicing data scientists require a great number and diversity of tools to get the job done. Data practitioners scrape, clean, visualize, model, and perform a job done. Data practitioners scrape, clean, visualize, model, and perform a million different tasks with a wide array of tools. If you ask most people working with data, you will learn that the foremost component in this toolset is the language used to perform the analysis and modeling of the data. Identifying the best programming language for a particular task is akin to asking which world religion is correct, just with slightly less bloodshed. In this book, we split our attention between two highly regarded, yet very different, languages used for data analysis - R and Python and leave it up to you to make your own decision as to which language you prefer. We will help you by dropping hints along the way as to the suitability of each language for various tasks, and we'll compare and contrast similar analyses done on the same dataset with each language. When you learn new concepts and techniques, there is always the question of depth versus breadth. Given a fixed amount of time and effort, should you work towards achieving moderate proficiency in both R and Python, or should you go all in on a single language? From our professional experiences, we strongly recommend that you aim to master one language and have awareness of the other. Does that mean skipping chapters on a particular language? Absolutely not! However, as you go through this book, pick one language and dig deeper, looking not only to develop conversational ability, but also fluency. To prepare for this chapter, ensure that you have sufficient bandwidth to download up to several gigabytes of software in a reasonable amount of time. Understanding the data science pipeline Before we start installing any software, we need to understand the repeatable set of steps that we will use for data analysis throughout the book. How to do it... The following are the five key steps for data analysis: 1. Acquisition: The first step in the pipeline is to acquire the data from a variety of sources, including relational databases, NoSQL and document stores, web scraping, and distributed databases such as HDFS on a Hadoop platform, RESTful APIs, flat files, and hopefully this is not the case, PDFs. 2. Exploration and understanding: The second step is to come to an understanding of the data that you will use and how it was collected; this often requires significant exploration. 3. Munging, wrangling, and manipulation: This step is often the single most time-consuming and important step in the pipeline. Data is almost never in the needed form for the desired analysis. 4. Analysis and modeling: This is the fun part where the data scientist gets to explore the statistical relationships between the variables in the data and pulls out his or her bag of machine learning tricks to cluster, categorize, or classify the data and create predictive models to see into the future. 5. Communicating and operationalizing: At the end of the pipeline, we need to give the data back in a compelling form and structure, sometimes to ourselves to inform the next iteration, and sometimes to a completely different audience. The data products produced can be a simple one-off report or a scalable web product that will be used interactively by millions. How it works... Although the preceding list is a numbered list, don't assume that every project will strictly adhere to this exact linear sequence. In fact, agile data scientists know that this process is highly iterative. Often, data exploration informs how the data must be cleaned, which then enables more exploration and deeper the data must be cleaned, which then enables more exploration and deeper understanding. Which of these steps comes first often depends on your initial familiarity with the data. If you work with the systems producing and capturing the data every day, the initial data exploration and understanding stage might be quite short, unless something is wrong with the production system. Conversely, if you are handed a dataset with no background details, the data exploration and understanding stage might require quite some time (and numerous non- programming steps, such as talking with the system developers). The following diagram shows the data science pipeline: As you have probably heard or read by now, data munging or wrangling can often consume 80 percent or more of project time and resources. In a perfect world, we would always be given perfect data. Unfortunately, this is never the case, and the number of data problems that you will see is virtually infinite. Sometimes, a data dictionary might change or might be missing, so understanding the field values is simply not possible. Some data fields may contain garbage or values that have been switched with another field. An update to the web app that passed testing might cause a little bug that prevents data from being collected, causing a few hundred thousand rows to go missing. If it can go wrong, it probably did at some point; the data you analyze is the sum total of all of these mistakes. The last step, communication and operationalization, is absolutely critical, but with intricacies that are not often fully appreciated. Note that the last step in the pipeline is not entitled data visualization and does not revolve around simply creating something pretty and/or compelling, which is a complex topic in itself. Instead, data visualizations will become a piece of a larger story that we will weave together from and with data. Some go even further and say that the end result is always an argument as there is no point in undertaking all of this effort unless you are trying to persuade someone or some group of a particular point. Installing R on Windows, Mac OS X, and Linux Straight from the R project, R is a language and environment for statistical computing and graphics, and it has emerged as one of the de-facto languages for statistical and data analysis. For us, it will be the default tool that we use in the first half of the book. Getting ready Make sure you have a good broadband connection to the Internet as you may have to download up to 200 MB of software. How to do it... Installing R is easy; use the following steps: 1. Go to Comprehensive R Archive Network (CRAN) and download the latest release of R for your particular operating system: For Windows, go to http://cran.r-project.org/bin/windows/base/ For Linux, go to http://cran.us.r-project.org/bin/linux/ For Mac OS X, go to http://cran.us.r-project.org/bin/macosx/ As of June 2017, the latest release of R is Version 3.4.0 from April 2017. 2. Once downloaded, follow the excellent instructions provided by CRAN to install the software on your respective platform. For both Windows and Mac, just double-click on the downloaded install packages. 3. With R installed, go ahead and launch it. You should see a window similar to that shown in the following screenshot: 4. An important modification of CRAN is available at

Description:
Over 85 recipes to help you complete real-world data science projects in R and PythonAbout This BookTackle every step in the data science pipeline and use it to acquire, clean, analyze, and visualize your dataGet beyond the theory and implement real-world projects in data science using R and PythonE
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.