ebook img

Data Analysis Made Easy PDF

392 Pages·2018·20.028 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Analysis Made Easy

Table of Contents Cover Preface Part I: Introductory Background 1 What Can We Do With Data? 1.1 Big Data and Data Science 1.2 Big Data Architectures 1.3 Small Data 1.4 What is Data? 1.5 A Short Taxonomy of Data Analytics 1.6 Examples of Data Use 1.7 A Project on Data Analytics 1.8 How this Book is Organized 1.9 Who Should Read this Book Part II: Getting Insights from Data 2 Descriptive Statistics 2.1 Scale Types 2.2 Descriptive Univariate Analysis 2.3 Descriptive Bivariate Analysis 2.4 Final Remarks 2.5 Exercises 3 Descriptive Multivariate Analysis 3.1 Multivariate Frequencies 3.2 Multivariate Data Visualization 3.3 Multivariate Statistics 3.4 Infographics and Word Clouds 3.5 Final Remarks 3.6 Exercises 4 Data Quality and Preprocessing 4.1 Data Quality 4.2 Converting to a Different Scale Type 4.3 Converting to a Different Scale 4.4 Data Transformation 4.5 Dimensionality Reduction 4.6 Final Remarks 4.7 Exercises 5 Clustering 5.1 Distance Measures 5.2 Clustering Validation 5.3 Clustering Techniques 5.4 Final Remarks 5.5 Exercises 6 Frequent Pattern Mining 6.1 Frequent Itemsets 6.2 Association Rules 6.3 Behind Support and Confidence 6.4 Other Types of Pattern 6.5 Final Remarks 6.6 Exercises 7 Cheat Sheet and Project on Descriptive Analytics 7.1 Cheat Sheet of Descriptive Analytics 7.2 Project on Descriptive Analytics Part III: Predicting the Unknown 8 Regression 8.1 Predictive Performance Estimation 8.2 Finding the Parameters of the Model 8.3 Technique and Model Selection 8.4 Final Remarks 8.5 Exercises 9 Classification 9.1 Binary Classification 9.2 Predictive Performance Measures for Classification 9.3 Distancebased Learning Algorithms 9.4 Probabilistic Classification Algorithms 9.5 Final Remarks 9.6 Exercises 10 Additional Predictive Methods 10.1 Searchbased Algorithms 10.2 Optimizationbased Algorithms 10.3 Final Remarks 10.4 Exercises 11 Advanced Predictive Topics 11.1 Ensemble Learning 11.2 Algorithm Bias 11.3 Nonbinary Classification Tasks 11.4 Advanced Data Preparation Techniques for Prediction 11.5 Description and Prediction with Supervised Interpretable Techniques 11.6 Exercises 12 Cheat Sheet and Project on Predictive Analytics 12.1 Cheat Sheet on Predictive Analytics 12.2 Project on Predictive Analytics Part IV: Popular Data Analytics Applications 13 Applications for Text, Web and Social Media 13.1 Working with Texts 13.2 Recommender Systems 13.3 Social Network Analysis 13.4 Exercises Appendix A: A Comprehensive Description of the CRISPDM Methodology A.1 Business Understanding A.2 Data Understanding A.3 Data Preparation A.4 Modeling A.5 Evaluation A.6 Deployment References Index End User License Agreement List of Tables Chapter 01 Table 1.1 Data set of our private contact list. Table 1.2 Family relations between contacts. Chapter 02 Table 2.1 Data set of our private list of contacts with weight and height. Table 2.2 Univariate absolute and relative frequencies for “company” attribute. Table 2.3 Univariate absolute and relative frequencies for height. Table 2.4 Univariate plots. Table 2.5 Location univariate statistics for weight. Table 2.6 Central tendency statistics according to the type of scale. Table 2.7 Dispersion univariate statistics for the “weight” attribute. Table 2.8 The rank values for the attributes “weight” and “height”. Chapter 03 Table 3.1 Data set of our private list of contacts with weight and height. Table 3.2 Location univariate statistics for quantitative attributes. Table 3.3 Dispersion univariate statistics for quantitative attributes. Table 3.4 Covariance matrix for quantitative attributes. Table 3.5 Pearson correlation matrix for quantitative attributes. Chapter 04 Table 4.1 Filling of missing values. Table 4.2 Removal of redundant objects. Table 4.3 Data set of our private list of contacts with weight and height. Table 4.4 Food preferences of our colleagues. Table 4.5 Conversion from nominal scale to relative scale. Table 4.6 Conversion from the nominal scale to binary values. Table 4.7 Conversion from the nominal scale to the relative scale. Table 4.8 Conversion from the nominal scale to the relative scale. Table 4.9 Conversion from the nominal scale to the relative scale. Table 4.10 Conversion from the ordinal scale to the relative or absolute scale. Table 4.11 Conversion from the ordinal scale to the relative scale. Table 4.12 Euclidean distances of ages expressed in years. Table 4.13 Euclidean distance with age expressed in decades. Table 4.14 Normalization using min–max rescaling. Table 4.15 Normalization using standardization. Table 4.16 Euclidean distance with normalized values. Table 4.17 How much each friend earns as income and spends on dinners per year. Table 4.18 Correlation between each predictive attribute and the target attribute. Table 4.19 Predictive performance of a classifier for each predictive attribute. Table 4.20 Filling of missing values. Chapter 05 Table 5.1 Simple social network data set. Table 5.2 Example of bag of words vectors. Table 5.3 The clusters to which each friend belongs to according to Figure 5.7a–f. Table 5.4 Normalized centroids of the example dataset. Table 5.5 Advantages and disadvantages of kmeans. Table 5.6 Advantages and disadvantages of DBSCAN. Table 5.7 First iteration of agglomerative hierarchical clustering. Table 5.8 Second iteration of agglomerative hierarchical clustering. Table 5.9 Advantages and disadvantages of agglomerative hierarchical clustering. Chapter 06 Table 6.1 Data about the preferred cuisine of contacts. Table 6.2 Transactional data created from Table 6.1. Table 6.3 Combinatorial explosion with growing size of . Table 6.4 Transactional data from Table 6.2 in vertical format corresponding to the first iteration ( ) of the Eclat algorithm. Table 6.5 An example of a twoway contingency table for itemsets and . Table 6.6 Twoway contingency tables for itemsets and for two groups A and B of data and the whole data set (combined groups A and B). The presence or absence of itemsets in transactions are marked by Yes and No, respectively. Table 6.7 Example of a sequence database with items . Chapter 07 Table 7.1 Summarizing methods for a single attribute. Table 7.2 Summarizing methods for two attributes. Table 7.3 Distance measures. Table 7.4 Clustering methods. Table 7.5 Time complexity and memory requirements of frequent itemset mining approaches. Table 7.6 Measures, generally related to frequent mining approaches. Table 7.7 Attributes of the Breast Cancer Wisconsin data set. Table 7.8 Statistics of the attributes of the Breast Cancer Wisconsin data set. Chapter 08 Table 8.1 Data set of our contact list, with weights and heights. Table 8.2 Advantages and disadvantages of linear regression. Table 8.3 Advantages and disadvantages of ridge regression. Table 8.4 Advantages and disadvantages of the lasso. Table 8.5 My social network data using the height as target. Chapter 09 Table 9.1 Simple labeled social network data set for model induction. Table 9.2 Simple labeled social network data set 2. Table 9.3 Simple labeled social network data set 3. Table 9.4 Extended simple labeled social network data set. Table 9.5 Advantages and disadvantages of kNN. Table 9.6 Advantages and disadvantages of logistic regression. Table 9.7 Advantages and disadvantages of NB Chapter 10 Table 10.1 Simple labeled social network data set 3. Table 10.2 Advantages and disadvantages of decision trees. Table 10.3 Advantages and disadvantages of MARS. Table 10.4 Advantages and disadvantages of ANNs. Table 10.5 Advantages and disadvantages of DL. Table 10.6 Advantages and disadvantages of SVM. Chapter 11 Table 11.1 Advantages and disadvantages of bagging. Table 11.2 Advantages and disadvantages of random forests. Table 11.3 Advantages and disadvantages of Adaboost. Chapter 12 Table 12.1 A cheat sheet on predictive algorithms. Table 12.2 Predictive attributes of the Polish company insolvency data set. Table 12.3 Statistics of the Polish company insolvency data set. Table 12.4 KNN confusion matrix using five predictive attributes. Table 12.5 C4.5 confusion matrix using five predictive attributes. Table 12.6 Random forest confusion matrix using all predictive attributes. Chapter 13 Table 13.1 Training set of labeled texts. Table 13.2 Results of stemming. Table 13.3 Results of applying a stemmer. Table 13.4 Stems after removal of stop words. Table 13.5 Objects with their stems. Table 13.6 Item recommendation scenario. Table 13.7 Rating prediction scenario. Table 13.8 Data for a contentbased RT for the user Eve from Table 13.7. Table 13.9 Cosine vector similarities between the users from Table 13.6. Table 13.10 Pearson correlation similarities between users in Table 13.7. Table 13.11 The adjacency matrix for the network in Figure 13.5. Table 13.12 The adjacency matrix from the Table 13.11 squared showing the counts of paths of length two between pairs of nodes. Table 13.13 Basic properties of nodes from the network in Figure 13.5. Table 13.14 The distance matrix – distances between nodes – for the graph in Figure 13.5. List of Illustrations Chapter 01 Figure 1.1 A prediction model to classify someone as either good or bad company. Figure 1.2 The use of different methodologies on data analytics through time. Figure 1.3 The CRISPDM methodology Chapter 02 Figure 2.1 The main areas of statistics. Figure 2.2 The relation between the four types of scales: absolute, relative, ordinal and nominal. Figure 2.3 An example of an area chart used to compare several probability density functions. Figure 2.4 Price absolute frequency distributions with (histogram) and without (bar chart) cell definition. Figure 2.5 Empirical and probability distribution functions. Figure 2.6 Stacked bar plot for “company” split by “gender”. Figure 2.7 Location statistics on the absolute frequency plot for the attribute “weight”. Figure 2.8 Boxplot for the attribute “height”. Figure 2.9 Central tendency statistics in asymmetric and symmetric unimodal distributions. Figure 2.10 An example of the Likert scale. Figure 2.11 Combination of a histogram and a boxplot for the “height” attribute. Figure 2.12 The probability density function, of Figure 2.13 The probability density function for different standard deviations, . Figure 2.14 3D histogram for attributes “weight” and “height”. Figure 2.15 Scatter plot for the attributes “weight” and “height”. Figure 2.16 Three examples of correlation between two attributes. Figure 2.17 The scatter plot for the attributes and . Figure 2.18 Contingency table with absolute joint frequencies for “company” and “gender”. Figure 2.19 Mosaic plot for “company” and “gender”. Figure 2.20 Scatter plot. Chapter 03 Figure 3.1 Plot of objects with three attributes. Figure 3.2 Two alternatives for a plot of three attributes, where the the third attribute is qualitative. Figure 3.3 Plot for three attributes from the contacts data set. Figure 3.4 Plot for four attributes of the friends data set using color for the forth attribute. Figure 3.5 Parallel coordinate plot for three attributes. Figure 3.6 Parallel coordinate plot for five attributes. Figure 3.7 Parallel coordinate plots for multiple attributes: left, using a different style of line for contacts who are good and bad company; right, with the order of the attributes changed as well. Figure 3.8 Star plot with the value of each attribute for each object in our contacts data set. Figure 3.9 Star plot with the value of each attribute for each object in contacts data set. Figure 3.10 Visualization of the objects in our contacts data set using Chernoff faces. Figure 3.11 Set of box plots, one for each attribute. Figure 3.12 Matrix of scatter plots for quantitative attributes. Figure 3.13 Matrix of scatter plots for quantitative attributes with additional Pearson correlation values. Figure 3.14 Correlograms for Pearson correlation between the attributes “maxtemp”, “weight”, “height” and “years”. Figure 3.15 Heatmap for the short version of the contacts data set. Figure 3.16 Infographic of the level of qualifications in England (Contains public sector information licensed under the Open Government Licence v3.0.). Figure 3.17 Text visualization using a word cloud. Chapter 04 Figure 4.1 Data set with and without noise. Figure 4.2 Data set with outliers. Figure 4.3 Outlier detection based on the interquartile range distance. Figure 4.4 Two alternatives for a plot for three attributes the last of which is qualitative. Figure 4.5 Principal components obtained by PCA for the short version of the contacts data set. Figure 4.6 Components obtained by PCA and ICA for the short version of the contacts

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.