Table Of ContentJava for Data Science
Examine the techniques and Java tools
supporting the growing field of data science
Richard M. Reese
Jennifer L. Reese
BIRMINGHAM - MUMBAI
Java for Data Science
Copyright © 2017 Packt Publishing
First published: January 2017
Production reference: 1050117
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78528-011-5
www.packtpub.com
Contents
Preface
1
Chapter 1: Getting Started with Data Science
6
Problems solved using data science 7
Understanding the data science problem – solving approach 8
Using Java to support data science 9
Acquiring data for an application 10
The importance and process of cleaning data 11
Visualizing data to enhance understanding 13
The use of statistical methods in data science 14
Machine learning applied to data science 16
Using neural networks in data science 18
Deep learning approaches 21
Performing text analysis 22
Visual and audio analysis 24
Improving application performance using parallel techniques 26
Assembling the pieces 28
Summary 28
Chapter 2: Data Acquisition
29
Understanding the data formats used in data science applications 30
Overview of CSV data 31
Overview of spreadsheets 31
Overview of databases 32
Overview of PDF files 34
Overview of JSON 35
Overview of XML 35
Overview of streaming data 36
Overview of audio/video/images in Java 37
Data acquisition techniques 38
Using the HttpUrlConnection class 38
Web crawlers in Java 39
Creating your own web crawler 41
Using the crawler4j web crawler 44
Web scraping in Java 47
Using API calls to access common social media sites 51
Using OAuth to authenticate users 51
Handing Twitter 51
Handling Wikipedia 54
Handling Flickr 57
Handling YouTube 60
Searching by keyword 61
Summary 64
Chapter 3: Data Cleaning
65
Handling data formats 66
Handling CSV data 67
Handling spreadsheets 69
Handling Excel spreadsheets 70
Handling PDF files 71
Handling JSON 73
Using JSON streaming API 73
Using the JSON tree API 78
The nitty gritty of cleaning text 79
Using Java tokenizers to extract words 81
Java core tokenizers 82
Third-party tokenizers and libraries 82
Transforming data into a usable form 84
Simple text cleaning 84
Removing stop words 86
Finding words in text 88
Finding and replacing text 89
Data imputation 91
Subsetting data 94
Sorting text 95
Data validation 99
Validating data types 100
Validating dates 101
Validating e-mail addresses 103
Validating ZIP codes 105
Validating names 105
Cleaning images 106
Changing the contrast of an image 107
Smoothing an image 108
Brightening an image 110
Resizing an image 111
Converting images to different formats 112
Summary 113
Chapter 4: Data Visualization
114
Understanding plots and graphs 115
Visual analysis goals 121
Creating index charts 122
Creating bar charts 125
Using country as the category 127
Using decade as the category 129
Creating stacked graphs 132
Creating pie charts 134
Creating scatter charts 137
Creating histograms 139
Creating donut charts 142
Creating bubble charts 144
Summary 147
Chapter 5: Statistical Data Analysis Techniques
148
Working with mean, mode, and median 149
Calculating the mean 149
Using simple Java techniques to find mean 149
Using Java 8 techniques to find mean 150
Using Google Guava to find mean 151
Using Apache Commons to find mean 151
Calculating the median 152
Using simple Java techniques to find median 152
Using Apache Commons to find the median 154
Calculating the mode 154
Using ArrayLists to find multiple modes 156
Using a HashMap to find multiple modes 157
Using a Apache Commons to find multiple modes 158
Standard deviation 158
Sample size determination 161
Hypothesis testing 161
Regression analysis 162
Using simple linear regression 164
Using multiple regression 167
Summary 173
Chapter 6: Machine Learning
175
Supervised learning techniques 176
Decision trees 177
Decision tree types 178
Decision tree libraries 178
Using a decision tree with a book dataset 179
Testing the book decision tree 183
Support vector machines 184
Using an SVM for camping data 187
Testing individual instances 190
Bayesian networks 191
Using a Bayesian network 192
Unsupervised machine learning 195
Association rule learning 195
Using association rule learning to find buying relationships 197
Reinforcement learning 199
Summary 200
Chapter 7: Neural Networks
202
Training a neural network 204
Getting started with neural network architectures 205
Understanding static neural networks 206
A basic Java example 206
Understanding dynamic neural networks 214
Multilayer perceptron networks 215
Building the model 215
Evaluating the model 217
Predicting other values 218
Saving and retrieving the model 219
Learning vector quantization 219
Self-Organizing Maps 220
Using a SOM 220
Displaying the SOM results 221
Additional network architectures and algorithms 225
The k-Nearest Neighbors algorithm 225
Instantaneously trained networks 225
Spiking neural networks 226
Cascading neural networks 226
Holographic associative memory 226
Backpropagation and neural networks 227
Summary 227
Chapter 8: Deep Learning
228
Deeplearning4j architecture 229
Acquiring and manipulating data 230
Reading in a CSV file 230
Configuring and building a model 231
Using hyperparameters in ND4J 232
Instantiating the network model 234
Training a model 234
Testing a model 235
Deep learning and regression analysis 236
Preparing the data 236
Setting up the class 237
Reading and preparing the data 237
Building the model 238
Evaluating the model 239
Restricted Boltzmann Machines 241
Reconstruction in an RBM 242
Configuring an RBM 243
Deep autoencoders 244
Building an autoencoder in DL4J 245
Configuring the network 245
Building and training the network 247
Saving and retrieving a network 247
Specialized autoencoders 247
Convolutional networks 248
Building the model 248
Evaluating the model 251
Recurrent Neural Networks 252
Summary 253
Chapter 9: Text Analysis
254
Implementing named entity recognition 255
Using OpenNLP to perform NER 256
Identifying location entities 257
Classifying text 259
Word2Vec and Doc2Vec 259
Classifying text by labels 259
Classifying text by similarity 262
Understanding tagging and POS 265
Using OpenNLP to identify POS 265
Understanding POS tags 267
Extracting relationships from sentences 268
Using OpenNLP to extract relationships 269
Sentiment analysis 271
Downloading and extracting the Word2Vec model 272
Building our model and classifying text 275
Summary 277
Chapter 10: Visual and Audio Analysis
279
Text-to-speech 280
Using FreeTTS 282
Getting information about voices 284
Gathering voice information 286
Understanding speech recognition 287
Using CMUPhinx to convert speech to text 288
Obtaining more detail about the words 289
Extracting text from an image 291
Using Tess4j to extract text 291
Identifying faces 292
Using OpenCV to detect faces 293
Classifying visual data 295
Creating a Neuroph Studio project for classifying visual images 296
Training the model 303
Summary 308
Chapter 11: Mathematical and Parallel Techniques for Data Analysis
309
Implementing basic matrix operations 310
Using GPUs with DeepLearning4j 312
Using map-reduce 314
Using Apache's Hadoop to perform map-reduce 314
Writing the map method 315
Writing the reduce method 316
Creating and executing a new Hadoop job 317
Various mathematical libraries 319
Using the jblas API 319
Using the Apache Commons math API 320
Using the ND4J API 321
Using OpenCL 323
Using Aparapi 323
Creating an Aparapi application 324
Using Aparapi for matrix multiplication 327
Using Java 8 streams 329
Understanding Java 8 lambda expressions and streams 330
Using Java 8 to perform matrix multiplication 331
Using Java 8 to perform map-reduce 332
Summary 334
Chapter 12: Bringing It All Together
336
Defining the purpose and scope of our application 337
Understanding the application's architecture 337
Data acquisition using Twitter 341
Understanding the TweetHandler class 343
Extracting data for a sentiment analysis model 345
Building the sentiment model 346
Processing the JSON input 347
Cleaning data to improve our results 348
Removing stop words 349
Performing sentiment analysis 350
Analysing the results 350
Other optional enhancements 351
Summary 352
Index
353
Preface
In this book, we examine Java-based approaches to the field of data science. Data science is
a broad topic and includes such subtopics as data mining, statistical analysis, audio and
video analysis, and text analysis. A number of Java APIs provide support for these topics.
The ability to apply these specific techniques allows for the creation of new, innovative
applications able to handle the vast amounts of data available for analysis.
This book takes an expansive yet cursory approach to various aspects of data science. A
brief introduction to the field is presented in the first chapter. Subsequent chapters cover
significant aspects of data science, such as data cleaning and the application of neural
networks. The last chapter combines topics discussed throughout the book to create a
comprehensive data science application.
What this book covers
Chapter 1, Getting Started with Data Science, provides an introduction to the technologies
covered by the book. A brief explanation of each technology is given, followed by a short
overview and demonstration of the support Java provides.
Chapter 2, Data Acquisition, demonstrates how to acquire data from a number of sources,
including Twitter, Wikipedia, and YouTube. The first step of a data science application is to
acquire data.
Chapter 3, Data Cleaning, explains that once data has been acquired, it needs to be cleaned.
This can involve such activities as removing stop words, validating the data, and data
conversion.
Chapter 4, Data Visualization, shows that while numerical processing is a critical step in
many data science tasks, people often prefer visual depictions of the results of analysis. This
chapter demonstrates various Java approaches to this task.
Chapter 5, Statistical Data Analysis Techniques, reviews basic statistical techniques, including
regression analysis, and demonstrates how various Java APIs provide statistical support.
Statistical analysis is key to many data analysis tasks.
Chapter 6, Machine Learning, covers several machine learning algorithms, including
decision trees and support vector machines. The abundance of available data provides an
opportunity to apply machine learning techniques.