“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:42 — page i — #1 Humanities Data Analysis “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:42 — page ii — #2 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:42 — page iii — #3 Humanities Data Analysis (cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2) Case Studies with Python Folgert Karsdorp, Mike Kestemont & Allen Riddell PRINCETON UNIVERSITY PRESS PRINCETON AND OXFORD “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:42 — page iv — #4 Copyright©2021byFolgertKarsdorp,MikeKestemont,andAllenRiddell Requestsforpermissiontoreproducematerialfromthiswork [email protected] PublishedbyPrincetonUniversityPress 41WilliamStreet,Princeton,NewJersey08540 6OxfordStreet,Woodstock,OxfordshireOX201TR press.princeton.edu AllRightsReserved ISBN978-0-691-17236-1 ISBN(e-book)978-0-691-20033-0 BritishLibraryCataloging-in-PublicationDataisavailable Editorial:SusannahShoemakerandKristenHop ProductionEditorial:AliParrington TextDesign:LorraineDoneker Production:JacquelinePoirier Coverimage:Shutterstock ThisbookhasbeencomposedinSabon Printedonacid-freepaper.∞ PrintedintheUnitedStatesofAmerica 10 9 8 7 6 5 4 3 2 1 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:42 — page v — #5 Contents (cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2) Preface ix I Data Analysis Essentials 1 Chapter1 Introduction 3 1.1 QuantitativeDataAnalysisandtheHumanities 3 1.2 OverviewoftheBook 5 1.3 RelatedBooks 6 1.4 HowtoUseThisBook 7 1.4.1 Whatyoushouldknow 8 1.4.2 Packagesanddata 12 1.4.3 Exercises 13 1.5 AnExploratoryDataAnalysisoftheUnitedStates’ CulinaryHistory 13 1.6 CookingwithTabularData 14 1.7 TasteTrendsinCulinaryUSHistory 18 1.8 America’sCulinaryMeltingPot 26 1.9 FurtherReading 30 Chapter2 ParsingandManipulatingStructuredData 32 2.1 Introduction 32 2.2 PlainText 33 2.3 CSV 36 2.4 PDF 40 2.5 JSON 43 2.6 XML 46 2.6.1 ParsingXML 48 2.6.2 CreatingXML 51 2.6.3 TEI 56 2.7 HTML 57 2.7.1 RetrievingHTMLfromtheweb 64 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:42 — page vi — #6 vi • Contents 2.8 ExtractingCharacterInteractionNetworks 65 2.9 ConclusionandFurtherReading 74 Chapter3 ExploringTextsUsingtheVectorSpaceModel 78 3.1 Introduction 78 3.2 FromTextstoVectors 79 3.2.1 Textpreprocessing 81 3.3 MappingGenres 90 3.3.1 Computingdistancesbetweendocuments 97 3.3.2 Nearestneighbors 107 3.4 FurtherReading 111 3.5 Appendix:VectorizingTextswithNumPy 113 3.5.1 Constructingarrays 113 3.5.2 Indexingandslicingarrays 117 3.5.3 Aggregatingfunctions 120 3.5.4 Arraybroadcasting 122 Chapter4 ProcessingTabularData 126 4.1 Loading,Inspecting,andSummarizingTabularData 127 4.1.1 ReadingtabulardatawithPandas 130 4.2 MappingCulturalChange 136 4.2.1 Turnoverinnamingpractices 136 4.2.2 Visualizingturnovers 146 4.3 ChangingNamingPractices 149 4.3.1 Increasingnamediversity 150 4.3.2 Abiasfornamesendinginn? 153 4.3.3 UnisexnamesintheUnitedStates 158 4.4 ConclusionsandFurtherReading 162 II Advanced Data Analysis 165 Chapter5 StatisticsEssentials:WhoReadsNovels? 169 5.1 Introduction 169 5.2 Statistics 170 5.3 SummarizingLocationandDispersion 171 5.3.1 Data:NovelreadingintheUnitedStates 171 5.4 Location 175 5.5 Dispersion 179 5.5.1 Variationincategoricalvalues 184 5.6 MeasuringAssociation 188 5.6.1 Measuringassociationbetweennumbers 188 5.6.2 Measuringassociationbetweencategories 192 5.6.3 Mutualinformation 195 5.7 Conclusion 197 5.8 FurtherReading 198 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:42 — page vii — #7 Contents • vii Chapter6 IntroductiontoProbability 201 6.1 UncertaintyandThomasPynchon 202 6.2 Probability 203 6.2.1 Probabilityanddegreeofbelief 205 6.3 Example:Bayes’sRuleandAuthorshipAttribution 208 6.3.1 Randomvariablesandprobabilitydistributions 213 6.4 FurtherReading 225 6.5 Appendix 227 6.5.1 Bayes’srule 227 6.5.2 Fittinganegativebinomialdistribution 228 Chapter7 NarratingwithMaps 229 7.1 Introduction 229 7.2 DataPreparations 230 7.3 ProjectionsandBasemaps 233 7.4 PlottingBattles 236 7.5 MappingtheDevelopmentoftheWar 238 7.6 FurtherReading 244 Chapter8 StylometryandtheVoiceofHildegard 248 8.1 Introduction 248 8.2 AuthorshipAttribution 250 8.2.1 Burrows’sDelta 252 8.2.2 Functionwords 254 8.2.3 ComputingdocumentdistanceswithDelta 257 8.2.4 Authorshipattributionevaluation 260 8.3 HierarchicalAgglomerativeClustering 262 8.4 PrincipalComponentAnalysis 266 8.4.1 ApplyingPCA 268 8.4.2 TheintuitionbehindPCA 271 8.4.3 Loadings 274 8.5 Conclusions 280 8.6 FurtherReading 280 Chapter9 ATopicModelofUnitedStatesSupreme CourtOpinions,1900–2000 285 9.1 Introduction 285 9.2 MixtureModels:ArtworkDimensions intheTateGalleries 287 9.3 Mixed-MembershipModelofTexts 294 9.3.1 Parameterestimation 300 9.3.2 Checkinganunsupervisedmodel 304 9.3.3 Modelingdifferentwordsenses 309 9.3.4 ExploringtrendsovertimeintheSupremeCourt 313 9.4 Conclusion 317 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:42 — page viii — #8 viii • Contents 9.5 FurtherReading 318 9.6 Appendix:MappingBetweenOurTopicModel andLauderdaleandClark(2014) 320 Epilogue:GoodEnoughPractices 323 Bibliography 325 Index 333 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:42 — page ix — #9 Preface (cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2) More and more research in the humanities and allied social sciences involves analyzing machine-readable data with computer software. But learning the techniques and perspectives that support this computational work is still dif- ficult for students and researchers. The population of university courses and booksaddressedtoresearchersintheGeisteswissenschaftenremainssmalland unevenlydistributed.Thisisunfortunatebecausescholarsassociatedwiththe humanitiesstandtobenefitfromexpandingreservoirsoftrustworthy,machine- readabledata.Wewrotethisbookinresponsetothissituation.Ourgoalisto make common techniques and practices used in data analysis more accessible and to describe in detail how researchers can use the programming language Python—and its software ecosystem—in their work. When readers finish this book, they will have greater fluency in quantitative data analysis and will be equipped to move beyond deliberating about what one might do with large datasets and large text collections; they will be ready to begin to propose answerstoquestionsofdemonstrableinterest. This book is written with a particular group of readers in mind: students and researchers in the humanities and allied social sciences who are famil- iar with the Python programming language and who want to use Python in research related to their interests.(Readers fluent in a programming language other than Python should have no problem picking up the syntax of Python as they work through the initial chapters.) That such a population of readers exists—or is coming into existence—is clear. Python is the official program- ming language in secondary education in France and the most widely taught programminglanguageinUSuniversities(Ministèredel’ÉducationNationale et de la Jeunesse 2018; Guo 2014). The language is, increasingly, the domi- nantlanguageusedinsoftwaredevelopmentinhigh-incomecountriessuchas the United States, United Kingdom, Germany, and Canada (Robinson 2017). There are vanishingly few barriers to learning the basics. This is a book which should be accessible to all curious hackers interested in data-intensive research. Thebookislimitedinthatitoccasionallyomitsdetailedcoverageofmath- ematical or algorithmic details of procedures and models, opting to focus on