Table Of ContentAutomated Data
Collection with R
A Practical Guide to
Web Scraping and Text Mining
| | |
Simon Munzert Christian Rubba Peter Meißner Dominic Nyhuis
Automated Data Collection with R
Automated Data Collection with R
A Practical Guide to Web Scraping and
Text Mining
Simon Munzert
DepartmentofPoliticsandPublicAdministration,UniversityofKonstanz,
Germany
Christian Rubba
DepartmentofPoliticalScience,UniversityofZurichandNationalCenterof
CompetenceinResearch,Switzerland
Peter Meißner
DepartmentofPoliticsandPublicAdministration,UniversityofKonstanz,
Germany
Dominic Nyhuis
DepartmentofPoliticalScience,UniversityofMannheim,Germany
Thiseditionfirstpublished2015
©2015JohnWiley&Sons,Ltd
Registeredoffice
JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester,WestSussex,PO198SQ,UnitedKingdom
Fordetailsofourglobaleditorialoffices,forcustomerservicesandforinformationabouthowtoapplyfor
permissiontoreusethecopyrightmaterialinthisbookpleaseseeourwebsiteatwww.wiley.com.
Therightoftheauthortobeidentifiedastheauthorofthisworkhasbeenassertedinaccordancewiththe
Copyright,DesignsandPatentsAct1988.
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmitted,in
anyformorbyanymeans,electronic,mechanical,photocopying,recordingorotherwise,exceptaspermittedby
theUKCopyright,DesignsandPatentsAct1988,withoutthepriorpermissionofthepublisher.
Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprintmaynotbe
availableinelectronicbooks.
Designationsusedbycompaniestodistinguishtheirproductsareoftenclaimedastrademarks.Allbrandnamesand
productnamesusedinthisbookaretradenames,servicemarks,trademarksorregisteredtrademarksoftheir
respectiveowners.Thepublisherisnotassociatedwithanyproductorvendormentionedinthisbook.
LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbesteffortsinpreparing
thisbook,theymakenorepresentationsorwarrantieswithrespecttotheaccuracyorcompletenessofthecontents
ofthisbookandspecificallydisclaimanyimpliedwarrantiesofmerchantabilityorfitnessforaparticularpurpose.
Itissoldontheunderstandingthatthepublisherisnotengagedinrenderingprofessionalservicesandneitherthe
publishernortheauthorshallbeliablefordamagesarisingherefrom.Ifprofessionaladviceorotherexpert
assistanceisrequired,theservicesofacompetentprofessionalshouldbesought.
LibraryofCongressCataloging-in-PublicationData
Munzert,Simon.
AutomateddatacollectionwithR:apracticalguidetowebscrapingandtextmining/SimonMunzert,Christian
Rubba,PeterMeißner,DominicNyhuis.
pagescm
Summary:“Thisbookprovidesaunifiedframeworkofwebscrapingandinformationextractionfromtextdata
withRforthesocialsciences”–Providedbypublisher.
Includesbibliographicalreferencesandindex.
ISBN978-1-118-83481-7(hardback)
1.Datamining. 2.Automaticdatacollectionsystems. 3.Socialsciences–Research–Dataprocessing.
4.R(Computerprogramlanguage) I.Title.
QA76.9.D343M8652014
006.3′12–dc23
2014032266
AcataloguerecordforthisbookisavailablefromtheBritishLibrary.
ISBN:9781118834817
Setin10/12ptTimesbyAptaraInc.,NewDelhi,India.
1 2015
Tomyparents,fortheirunendingsupport.Also,toStefanie.
—Simon
Tomyparents,fortheirloveandencouragement.
—Christian
ToKristin, Buddy,andPaulforlove, regularwalks,andafinaldeadline.
—Peter
MeinerFamilie.
—Dominic
Contents
Preface xv
1 Introduction 1
1.1 Casestudy:WorldHeritageSitesinDanger 1
1.2 Someremarksonwebdataquality 7
1.3 Technologiesfordisseminating,extracting,andstoringwebdata 9
1.3.1 TechnologiesfordisseminatingcontentontheWeb 9
1.3.2 Technologiesforinformationextractionfromwebdocuments 11
1.3.3 Technologiesfordatastorage 12
1.4 Structureofthebook 13
PartOne APrimeronWebandDataTechnologies 15
2 HTML 17
2.1 Browserpresentationandsourcecode 18
2.2 Syntaxrules 19
2.2.1 Tags,elements,andattributes 20
2.2.2 Treestructure 21
2.2.3 Comments 22
2.2.4 Reservedandspecialcharacters 22
2.2.5 Documenttypedefinition 23
2.2.6 Spacesandlinebreaks 23
2.3 Tagsandattributes 24
2.3.1 Theanchortag<a> 24
2.3.2 Themetadatatag<meta> 25
2.3.3 Theexternalreferencetag<link> 26
2.3.4 Emphasizingtags<b>,<i>,<strong> 26
2.3.5 Theparagraphstag<p> 27
2.3.6 Headingtags<h1>,<h2>,<h3>,… 27
2.3.7 Listingcontentwith<ul>,<ol>,and<dl> 27
2.3.8 Theorganizationaltags<div>and<span> 27
viii CONTENTS
2.3.9 The<form>taganditscompanions 29
2.3.10 Theforeignscripttag<script> 30
2.3.11 Tabletags<table>,<tr>,<td>,and<th> 32
2.4 Parsing 32
2.4.1 Whatisparsing? 33
2.4.2 Discardingnodes 35
2.4.3 Extractinginformationinthebuildingprocess 37
Summary 38
Furtherreading 38
Problems 39
3 XMLandJSON 41
3.1 AshortexampleXMLdocument 42
3.2 XMLsyntaxrules 43
3.2.1 Elementsandattributes 44
3.2.2 XMLstructure 46
3.2.3 Namingandspecialcharacters 48
3.2.4 Commentsandcharacterdata 49
3.2.5 XMLsyntaxsummary 50
3.3 WhenisanXMLdocumentwellformedorvalid? 51
3.4 XMLextensionsandtechnologies 53
3.4.1 Namespaces 53
3.4.2 ExtensionsofXML 54
3.4.3 Example:ReallySimpleSyndication 55
3.4.4 Example:scalablevectorgraphics 58
3.5 XMLandRinpractice 60
3.5.1 ParsingXML 60
3.5.2 BasicoperationsonXMLdocuments 63
3.5.3 FromXMLtodataframesorlists 65
3.5.4 Event-drivenparsing 66
3.6 AshortexampleJSONdocument 68
3.7 JSONsyntaxrules 69
3.8 JSONandRinpractice 71
Summary 76
Furtherreading 76
Problems 76
4 XPath 79
4.1 XPath—aquerylanguageforwebdocuments 80
4.2 IdentifyingnodesetswithXPath 81
4.2.1 BasicstructureofanXPathquery 81
4.2.2 Noderelations 84
4.2.3 XPathpredicates 86
4.3 Extractingnodeelements 93
4.3.1 Extendingthefunargument 94
4.3.2 XMLnamespaces 96
4.3.3 LittleXPathhelpertools 97