ebook img

Java for Data Science PDF

370 Pages·2017·5.036 MB·english
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Java for Data Science

Java for Data Science Examine the techniques and Java tools supporting the growing field of data science Richard M. Reese Jennifer L. Reese BIRMINGHAM - MUMBAI Java for Data Science Copyright © 2017 Packt Publishing First published: January 2017 Production reference: 1050117 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78528-011-5 www.packtpub.com Contents Preface 1 Chapter 1: Getting Started with Data Science 6 Problems solved using data science 7 Understanding the data science problem – solving approach 8 Using Java to support data science 9 Acquiring data for an application 10 The importance and process of cleaning data 11 Visualizing data to enhance understanding 13 The use of statistical methods in data science 14 Machine learning applied to data science 16 Using neural networks in data science 18 Deep learning approaches 21 Performing text analysis 22 Visual and audio analysis 24 Improving application performance using parallel techniques 26 Assembling the pieces 28 Summary 28 Chapter 2: Data Acquisition 29 Understanding the data formats used in data science applications 30 Overview of CSV data 31 Overview of spreadsheets 31 Overview of databases 32 Overview of PDF files 34 Overview of JSON 35 Overview of XML 35 Overview of streaming data 36 Overview of audio/video/images in Java 37 Data acquisition techniques 38 Using the HttpUrlConnection class 38 Web crawlers in Java 39 Creating your own web crawler 41 Using the crawler4j web crawler 44 Web scraping in Java 47 Using API calls to access common social media sites 51 Using OAuth to authenticate users 51 Handing Twitter 51 Handling Wikipedia 54 Handling Flickr 57 Handling YouTube 60 Searching by keyword 61 Summary 64 Chapter 3: Data Cleaning 65 Handling data formats 66 Handling CSV data 67 Handling spreadsheets 69 Handling Excel spreadsheets 70 Handling PDF files 71 Handling JSON 73 Using JSON streaming API 73 Using the JSON tree API 78 The nitty gritty of cleaning text 79 Using Java tokenizers to extract words 81 Java core tokenizers 82 Third-party tokenizers and libraries 82 Transforming data into a usable form 84 Simple text cleaning 84 Removing stop words 86 Finding words in text 88 Finding and replacing text 89 Data imputation 91 Subsetting data 94 Sorting text 95 Data validation 99 Validating data types 100 Validating dates 101 Validating e-mail addresses 103 Validating ZIP codes 105 Validating names 105 Cleaning images 106 Changing the contrast of an image 107 Smoothing an image 108 Brightening an image 110 Resizing an image 111 Converting images to different formats 112 Summary 113 Chapter 4: Data Visualization 114 Understanding plots and graphs 115 Visual analysis goals 121 Creating index charts 122 Creating bar charts 125 Using country as the category 127 Using decade as the category 129 Creating stacked graphs 132 Creating pie charts 134 Creating scatter charts 137 Creating histograms 139 Creating donut charts 142 Creating bubble charts 144 Summary 147 Chapter 5: Statistical Data Analysis Techniques 148 Working with mean, mode, and median 149 Calculating the mean 149 Using simple Java techniques to find mean 149 Using Java 8 techniques to find mean 150 Using Google Guava to find mean 151 Using Apache Commons to find mean 151 Calculating the median 152 Using simple Java techniques to find median 152 Using Apache Commons to find the median 154 Calculating the mode 154 Using ArrayLists to find multiple modes 156 Using a HashMap to find multiple modes 157 Using a Apache Commons to find multiple modes 158 Standard deviation 158 Sample size determination 161 Hypothesis testing 161 Regression analysis 162 Using simple linear regression 164 Using multiple regression 167 Summary 173 Chapter 6: Machine Learning 175 Supervised learning techniques 176 Decision trees 177 Decision tree types 178 Decision tree libraries 178 Using a decision tree with a book dataset 179 Testing the book decision tree 183 Support vector machines 184 Using an SVM for camping data 187 Testing individual instances 190 Bayesian networks 191 Using a Bayesian network 192 Unsupervised machine learning 195 Association rule learning 195 Using association rule learning to find buying relationships 197 Reinforcement learning 199 Summary 200 Chapter 7: Neural Networks 202 Training a neural network 204 Getting started with neural network architectures 205 Understanding static neural networks 206 A basic Java example 206 Understanding dynamic neural networks 214 Multilayer perceptron networks 215 Building the model 215 Evaluating the model 217 Predicting other values 218 Saving and retrieving the model 219 Learning vector quantization 219 Self-Organizing Maps 220 Using a SOM 220 Displaying the SOM results 221 Additional network architectures and algorithms 225 The k-Nearest Neighbors algorithm 225 Instantaneously trained networks 225 Spiking neural networks 226 Cascading neural networks 226 Holographic associative memory 226 Backpropagation and neural networks 227 Summary 227 Chapter 8: Deep Learning 228 Deeplearning4j architecture 229 Acquiring and manipulating data 230 Reading in a CSV file 230 Configuring and building a model 231 Using hyperparameters in ND4J 232 Instantiating the network model 234 Training a model 234 Testing a model 235 Deep learning and regression analysis 236 Preparing the data 236 Setting up the class 237 Reading and preparing the data 237 Building the model 238 Evaluating the model 239 Restricted Boltzmann Machines 241 Reconstruction in an RBM 242 Configuring an RBM 243 Deep autoencoders 244 Building an autoencoder in DL4J 245 Configuring the network 245 Building and training the network 247 Saving and retrieving a network 247 Specialized autoencoders 247 Convolutional networks 248 Building the model 248 Evaluating the model 251 Recurrent Neural Networks 252 Summary 253 Chapter 9: Text Analysis 254 Implementing named entity recognition 255 Using OpenNLP to perform NER 256 Identifying location entities 257 Classifying text 259 Word2Vec and Doc2Vec 259 Classifying text by labels 259 Classifying text by similarity 262 Understanding tagging and POS 265 Using OpenNLP to identify POS 265 Understanding POS tags 267 Extracting relationships from sentences 268 Using OpenNLP to extract relationships 269 Sentiment analysis 271 Downloading and extracting the Word2Vec model 272 Building our model and classifying text 275 Summary 277 Chapter 10: Visual and Audio Analysis 279 Text-to-speech 280 Using FreeTTS 282 Getting information about voices 284 Gathering voice information 286 Understanding speech recognition 287 Using CMUPhinx to convert speech to text 288 Obtaining more detail about the words 289 Extracting text from an image 291 Using Tess4j to extract text 291 Identifying faces 292 Using OpenCV to detect faces 293 Classifying visual data 295 Creating a Neuroph Studio project for classifying visual images 296 Training the model 303 Summary 308 Chapter 11: Mathematical and Parallel Techniques for Data Analysis 309 Implementing basic matrix operations 310 Using GPUs with DeepLearning4j 312 Using map-reduce 314 Using Apache's Hadoop to perform map-reduce 314 Writing the map method 315 Writing the reduce method 316 Creating and executing a new Hadoop job 317 Various mathematical libraries 319 Using the jblas API 319 Using the Apache Commons math API 320 Using the ND4J API 321 Using OpenCL 323 Using Aparapi 323 Creating an Aparapi application 324 Using Aparapi for matrix multiplication 327 Using Java 8 streams 329 Understanding Java 8 lambda expressions and streams 330 Using Java 8 to perform matrix multiplication 331 Using Java 8 to perform map-reduce 332 Summary 334 Chapter 12: Bringing It All Together 336 Defining the purpose and scope of our application 337 Understanding the application's architecture 337 Data acquisition using Twitter 341 Understanding the TweetHandler class 343 Extracting data for a sentiment analysis model 345 Building the sentiment model 346 Processing the JSON input 347 Cleaning data to improve our results 348 Removing stop words 349 Performing sentiment analysis 350 Analysing the results 350 Other optional enhancements 351 Summary 352 Index 353 Preface In this book, we examine Java-based approaches to the field of data science. Data science is a broad topic and includes such subtopics as data mining, statistical analysis, audio and video analysis, and text analysis. A number of Java APIs provide support for these topics. The ability to apply these specific techniques allows for the creation of new, innovative applications able to handle the vast amounts of data available for analysis. This book takes an expansive yet cursory approach to various aspects of data science. A brief introduction to the field is presented in the first chapter. Subsequent chapters cover significant aspects of data science, such as data cleaning and the application of neural networks. The last chapter combines topics discussed throughout the book to create a comprehensive data science application. What this book covers Chapter 1, Getting Started with Data Science, provides an introduction to the technologies covered by the book. A brief explanation of each technology is given, followed by a short overview and demonstration of the support Java provides. Chapter 2, Data Acquisition, demonstrates how to acquire data from a number of sources, including Twitter, Wikipedia, and YouTube. The first step of a data science application is to acquire data. Chapter 3, Data Cleaning, explains that once data has been acquired, it needs to be cleaned. This can involve such activities as removing stop words, validating the data, and data conversion. Chapter 4, Data Visualization, shows that while numerical processing is a critical step in many data science tasks, people often prefer visual depictions of the results of analysis. This chapter demonstrates various Java approaches to this task. Chapter 5, Statistical Data Analysis Techniques, reviews basic statistical techniques, including regression analysis, and demonstrates how various Java APIs provide statistical support. Statistical analysis is key to many data analysis tasks. Chapter 6, Machine Learning, covers several machine learning algorithms, including decision trees and support vector machines. The abundance of available data provides an opportunity to apply machine learning techniques.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.