ebook img

The Elements of Data Analytic Style - A guide for people who want to analyze data. PDF

98 Pages·2015·1.55 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The Elements of Data Analytic Style - A guide for people who want to analyze data.

The Elements of Data Analytic Style A guide for people who want to analyze data. Jeff Leek Thisbookisforsaleathttp://leanpub.com/datastyle Thisversionwaspublishedon2015-03-02 ThisisaLeanpubbook.Leanpubempowersauthorsand publisherswiththeLeanPublishingprocess.Lean Publishingistheactofpublishinganin-progressebook usinglightweighttoolsandmanyiterationstogetreader feedback,pivotuntilyouhavetherightbookandbuild tractiononceyoudo. ©2014-2015JeffLeek An@simplystatspublication. ThankyoutoKarlBromanandAlyssaFrazeefor constructiveandreallyhelpfulfeedbackonthefirstdraftof thismanuscript.ThankstoRogerPeng,BrianCaffo,and RafaelIrizarryforhelpfuldiscussionsaboutdataanalysis. Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . 1 2. Thedataanalyticquestion . . . . . . . . . . . . . 3 3. Tidyingthedata . . . . . . . . . . . . . . . . . . . 10 4. Checkingthedata . . . . . . . . . . . . . . . . . . 17 5. Exploratoryanalysis . . . . . . . . . . . . . . . . . 23 6. Statisticalmodelingandinference . . . . . . . . . 34 7. Predictionandmachinelearning . . . . . . . . . . 45 8. Causality . . . . . . . . . . . . . . . . . . . . . . . 50 9. Writtenanalyses . . . . . . . . . . . . . . . . . . . 53 10.Creatingfigures . . . . . . . . . . . . . . . . . . . 58 11.Presentingdata . . . . . . . . . . . . . . . . . . . . 70 12.Reproducibility . . . . . . . . . . . . . . . . . . . . 79 13.Afewmattersofform . . . . . . . . . . . . . . . . 85 14.Thedataanalysischecklist . . . . . . . . . . . . . 87 CONTENTS 15.Additionalresources . . . . . . . . . . . . . . . . . 92 1. Introduction The dramatic change in the price and accessibility of data demands a new focus on data analytic literacy. This book is intendedforusebypeoplewhoperformregulardataanalyses. Itaimstogiveabriefsummaryofthekeyideas,practices,and pitfalls of modern data analysis. One goal is to summarize in a succinct way the most common difficulties encountered by practicing data analysts. It may serve as a guide for peer reviewers who may refer to specific section numbers when evaluating manuscripts. As will become apparent, it is modeled loosely in format and aim on the Elements of Style byWilliamStrunk. The book includes a basic checklist that may be useful as a guideforbeginningdataanalystsorasarubricforevaluating data analyses. It has been used in the author’s data analysis class to evaluate student projects. Both the checklist and this bookcoverasmallfractionofthefieldofdataanalysis,butthe experienceoftheauthoristhatoncetheseelementsaremas- tered,dataanalystsbenefitmostfromhandsonexperiencein their own discipline of application, and that many principles maybenon-transferablebeyondthebasics. If you want a more complete introduction to the analysis of data one option is the free Johns Hopkins Data Science Specialization¹. As with rhetoric, it is true that the best data analysts some- timesdisregardtherulesintheiranalyses.Expertsusuallydo ¹https://www.coursera.org/specialization/jhudatascience/1 Introduction 2 this to reveal some characteristic of the data that would be obscured by a rigid application of data analytic principles. Unless an analyst is certain of the improvement, they will oftenbebetterservedbyfollowingtherules.Aftermastering the basic principles, analysts may look to experts in their subjectdomainformorecreativeandadvanceddataanalytic ideas. 2. The data analytic question 2.1 Define the data analytic question first Data can be used to answer many questions, but not all of them. One of the most innovative data scientists of all time saiditbest. The data may not contain the answer. The com- binationofsomedataandanachingdesireforan answerdoesnotensurethatareasonableanswer canbeextractedfromagivenbodyofdata. JohnTukey Beforeperformingadataanalysisthekeyistodefinethetype ofquestionbeingasked.Somequestionsareeasiertoanswer withdataandsomeareharder.Thisisabroadcategorization ofthetypesofdataanalysisquestions,rankedbyhoweasyit istoanswerthequestionwithdata.Youcanalsousethedata analysis question type flow chart to help define the question type(Figure2.1) Thedataanalyticquestion 4 Figure2.1Thedataanalysisquestiontypeflowchart 2.2 Descriptive A descriptive data analysis seeks to summarize the measure- ments in a single data set without further interpretation. An exampleistheUnitedStatesCensus.TheCensuscollectsdata ontheresidencetype,location,age,sex,andraceofallpeople intheUnitedStatesatafixedtime.TheCensusisdescriptive because the goal is to summarize the measurements in this fixeddatasetintopopulationcountsanddescribehowmany Thedataanalyticquestion 5 people live in different parts of the United States. The inter- pretation and use of these counts is left to Congress and the public,butisnotpartofthedataanalysis. 2.3 Exploratory An exploratory data analysis builds on a descriptive analysis by searching for discoveries, trends, correlations, or rela- tionships between the measurements of multiple variables to generateideasorhypotheses.Anexampleisthediscoveryofa four-planetsolarsystembyamateurastronomersusingpublic astronomical data from the Kepler telescope. The data was made available through the planethunters.org website, that askedamateurastronomerstolookforacharacteristicpattern of light indicating potential planets. An exploratory analysis likethisoneseekstomakediscoveries,butrarelycanconfirm those discoveries. In the case of the amateur astronomers, follow-upstudiesandadditionaldatawereneededtoconfirm theexistenceofthefour-planetsystem. 2.4 Inferential Aninferentialdataanalysisgoesbeyondanexploratoryanal- ysis by quantifying whether an observed pattern will likely holdbeyondthedatasetinhand.Inferentialdataanalysesare the most common statistical analysis in the formal scientific literature. An example is a study of whether air pollution correlateswithlifeexpectancyatthestatelevelintheUnited States. The goal is to identify the strength of the relationship in both the specific data set and to determine whether that relationship will hold in future data. In non-randomized

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.