ebook img

The art of data science. A guide for anyone who works with data PDF

162 Pages·2015·1.64 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The art of data science. A guide for anyone who works with data

ThAer otf D atSac ience А GuidfeoA rn yoWnheo W orkwsi tDha ta RogDe.Pr e n&gE lizaMbaettshu i The Art of Data Science A Guide for Anyone Who Works with Data Roger D. Peng and Elizabeth Matsui ThisisaLeanpubbook.Leanpubempowersauthorsand publisherswiththeLeanPublishingprocess.Lean Publishingistheactofpublishinganin-progressebook usinglightweighttoolsandmanyiterationstogetreader feedback,pivotuntilyouhavetherightbookandbuild tractiononceyoudo. ©2015SkybrudeConsulting,LLC Also By Roger D. Peng RProgrammingforDataScience ExploratoryDataAnalysiswithR ReportWritingforDataScienceinR Special thanks to Maggie Matsui, who created all of the artwork forthisbook. Contents 1. DataAnalysisasArt . . . . . . . . . . . . . . . . . 1 2. EpicyclesofAnalysis . . . . . . . . . . . . . . . . . 4 2.1 SettingtheScene . . . . . . . . . . . . . . . . 5 2.2 EpicycleofAnalysis . . . . . . . . . . . . . . 6 2.3 SettingExpectations . . . . . . . . . . . . . . 8 2.4 CollectingInformation . . . . . . . . . . . . 9 2.5 ComparingExpectationstoData . . . . . . 10 2.6 ApplyingtheEpicyleofAnalysisProcess . . 11 3. StatingandRefiningtheQuestion . . . . . . . . 16 3.1 TypesofQuestions . . . . . . . . . . . . . . . 16 3.2 ApplyingtheEpicycletoStatingandRefin- ingYourQuestion . . . . . . . . . . . . . . . 20 3.3 CharacteristicsofaGoodQuestion . . . . . 20 3.4 TranslatingaQuestionintoaDataProblem 23 3.5 CaseStudy . . . . . . . . . . . . . . . . . . . . 26 3.6 ConcludingThoughts . . . . . . . . . . . . . 30 4. ExploratoryDataAnalysis . . . . . . . . . . . . . 31 4.1 ExploratoryDataAnalysisChecklist:ACase Study . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Formulateyourquestion . . . . . . . . . . . 33 4.3 Readinyourdata . . . . . . . . . . . . . . . . 35 4.4 CheckthePackaging . . . . . . . . . . . . . . 36 4.5 LookattheTopandtheBottomofyourData 39 CONTENTS 4.6 ABC:AlwaysbeCheckingYour“n”s . . . . . 40 4.7 Validate With at Least One External Data Source . . . . . . . . . . . . . . . . . . . . . . 45 4.8 MakeaPlot . . . . . . . . . . . . . . . . . . . 46 4.9 TrytheEasySolutionFirst . . . . . . . . . . 49 4.10 Follow-upQuestions . . . . . . . . . . . . . . 53 5. UsingModelstoExploreYourData . . . . . . . 55 5.1 ModelsasExpectations . . . . . . . . . . . . 57 5.2 ComparingModelExpectationstoReality . 60 5.3 ReactingtoData:RefiningOurExpectations 64 5.4 ExaminingLinearRelationships . . . . . . . 67 5.5 WhenDoWeStop? . . . . . . . . . . . . . . . 73 5.6 Summary . . . . . . . . . . . . . . . . . . . . . 77 6. Inference:APrimer . . . . . . . . . . . . . . . . . 78 6.1 Identifythepopulation . . . . . . . . . . . . 78 6.2 Describethesamplingprocess . . . . . . . . 79 6.3 Describeamodelforthepopulation . . . . . 79 6.4 AQuickExample . . . . . . . . . . . . . . . . 80 6.5 FactorsAffectingtheQualityofInference . 84 6.6 Example:AppleMusicUsage . . . . . . . . . 86 6.7 PopulationsComeinManyForms . . . . . . 89 7. FormalModeling . . . . . . . . . . . . . . . . . . . 92 7.1 WhatAretheGoalsofFormalModeling? . 92 7.2 GeneralFramework . . . . . . . . . . . . . . 93 7.3 AssociationalAnalyses . . . . . . . . . . . . . 95 7.4 PredictionAnalyses. . . . . . . . . . . . . . . 104 7.5 Summary . . . . . . . . . . . . . . . . . . . . . 111 8. Inferencevs.Prediction:ImplicationsforMod- elingStrategy . . . . . . . . . . . . . . . . . . . . . 112 8.1 AirPollutionandMortalityinNewYorkCity113 8.2 InferringanAssociation . . . . . . . . . . . . 115 8.3 PredictingtheOutcome . . . . . . . . . . . . 121 CONTENTS 8.4 Summary . . . . . . . . . . . . . . . . . . . . . 123 9. InterpretingYourResults. . . . . . . . . . . . . . 124 9.1 PrinciplesofInterpretation . . . . . . . . . . 124 9.2 Case Study: Non-diet Soda Consumption andBodyMassIndex . . . . . . . . . . . . . 125 10. Communication . . . . . . . . . . . . . . . . . . . 144 10.1 Routinecommunication . . . . . . . . . . . . 144 10.2 TheAudience . . . . . . . . . . . . . . . . . . 146 10.3 Content . . . . . . . . . . . . . . . . . . . . . . 148 10.4 Style . . . . . . . . . . . . . . . . . . . . . . . . 151 10.5 Attitude . . . . . . . . . . . . . . . . . . . . . . 151 11. ConcludingThoughts . . . . . . . . . . . . . . . . 153 12. AbouttheAuthors . . . . . . . . . . . . . . . . . . 155 1. Data Analysis as Art Data analysis is hard, and part of the problem is that few people can explain how to do it. It’s not that there aren’t any people doing data analysis on a regular basis. It’s that thepeoplewhoarereallygoodatithaveyettoenlightenus aboutthethoughtprocessthatgoesonintheirheads. Imagine you were to ask a songwriter how she writes her songs.Therearemanytoolsuponwhichshecandraw.We have a general understanding of how a good song should be structured: how long it should be, how many verses, maybe there’s a verse followed by a chorus, etc. In other words, there’s an abstract framework for songs in general. Similarly, we have music theory that tells us that certain combinations of notes and chords work well together and other combinations don’t sound good. As good as these toolsmightbe,ultimately,knowledgeofsongstructureand music theory alone doesn’t make for a good song. Some- thingelseisneeded. InDonaldKnuth’slegendary1974essayComputerProgram- mingasanArt1,Knuthtalksaboutthedifferencebetweenart andscience.Inthatessay,hewastryingtogetacrosstheidea that although computer programming involved complex machinesandverytechnicalknowledge,theactofwritinga computerprogramhadanartisticcomponent.Inthisessay, hesaysthat Science is knowledge which we understand so wellthatwecanteachittoacomputer. 1http://www.paulgraham.com/knuth.html 1 DataAnalysisasArt 2 Everythingelseisart. At some point, the songwriter must inject a creative spark intotheprocesstobringallthesongwritingtoolstogether to make something that people want to listen to. This is a key part of the art of songwriting. That creative spark is difficult to describe, much less write down, but it’s clearly essentialtowritinggoodsongs.Ifitweren’t,thenwe’dhave computer programs regularly writing hit songs. For better orforworse,thathasn’thappenedyet. Much like songwriting (and computer programming, for thatmatter),it’simportanttorealizethatdata analysis is an art.Itisnotsomethingyetthatwecanteachtoacomputer. Data analysts have many tools at their disposal, from linear regression to classification trees and even deep learning, andthesetoolshaveallbeencarefullytaughttocomputers. But ultimately, a data analyst must find a way to assemble allofthetoolsandapplythemtodatatoanswerarelevant question—aquestionofinteresttopeople. Unfortunately, the process of data analysis is not one that we have been able to write down effectively. It’s true that there are many statistics textbooks out there, many lining our own shelves. But in our opinion, none of these really addresses the core problems involved in conducting real- world data analyses. In 1991, Daryl Pregibon, a promi- nent statistician previously of AT&T Research and now of Google, said in reference to the process of data analysis2 that “statisticians have a process that they espouse but do notfullyunderstand”. Describing data analysis presents a difficult conundrum. On the one hand, developing a useful framework involves characterizingtheelementsofadataanalysisusingabstract 2http://www.nap.edu/catalog/1910/the-future-of-statistical-software- proceedings-of-a-forum DataAnalysisasArt 3 languageinordertofindthecommonalitiesacrossdifferent kindsofanalyses.Sometimes,thislanguageisthelanguage of mathematics. On the other hand, it is often the very details of an analysis that makes each one so difficult and yet interesting. How can one effectively generalize across many different data analyses, each of which has important uniqueaspects? What we have set out to do in this book is to write down the process of data analysis. What we describe is not a specific “formula” for data analysis—something like “ap- ply this method and then run that test”— but rather is a general process that can be applied in a variety of situ- ations. Through our extensive experience both managing data analysts and conducting our own data analyses, we havecarefullyobservedwhatproducescoherentresultsand whatfailstoproduceusefulinsightsintodata.Ourgoalisto write down what we have learned in the hopes that others mayfindituseful.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.