ebook img

Automated Data Collection with R PDF

477 Pages·2015·8.04 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Automated Data Collection with R

Automated Data Collection with R A Practical Guide to Web Scraping and Text Mining | | | Simon Munzert Christian Rubba Peter Meißner Dominic Nyhuis Automated Data Collection with R Automated Data Collection with R A Practical Guide to Web Scraping and Text Mining Simon Munzert DepartmentofPoliticsandPublicAdministration,UniversityofKonstanz, Germany Christian Rubba DepartmentofPoliticalScience,UniversityofZurichandNationalCenterof CompetenceinResearch,Switzerland Peter Meißner DepartmentofPoliticsandPublicAdministration,UniversityofKonstanz, Germany Dominic Nyhuis DepartmentofPoliticalScience,UniversityofMannheim,Germany Thiseditionfirstpublished2015 ©2015JohnWiley&Sons,Ltd Registeredoffice JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester,WestSussex,PO198SQ,UnitedKingdom Fordetailsofourglobaleditorialoffices,forcustomerservicesandforinformationabouthowtoapplyfor permissiontoreusethecopyrightmaterialinthisbookpleaseseeourwebsiteatwww.wiley.com. Therightoftheauthortobeidentifiedastheauthorofthisworkhasbeenassertedinaccordancewiththe Copyright,DesignsandPatentsAct1988. Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmitted,in anyformorbyanymeans,electronic,mechanical,photocopying,recordingorotherwise,exceptaspermittedby theUKCopyright,DesignsandPatentsAct1988,withoutthepriorpermissionofthepublisher. Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprintmaynotbe availableinelectronicbooks. Designationsusedbycompaniestodistinguishtheirproductsareoftenclaimedastrademarks.Allbrandnamesand productnamesusedinthisbookaretradenames,servicemarks,trademarksorregisteredtrademarksoftheir respectiveowners.Thepublisherisnotassociatedwithanyproductorvendormentionedinthisbook. LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbesteffortsinpreparing thisbook,theymakenorepresentationsorwarrantieswithrespecttotheaccuracyorcompletenessofthecontents ofthisbookandspecificallydisclaimanyimpliedwarrantiesofmerchantabilityorfitnessforaparticularpurpose. Itissoldontheunderstandingthatthepublisherisnotengagedinrenderingprofessionalservicesandneitherthe publishernortheauthorshallbeliablefordamagesarisingherefrom.Ifprofessionaladviceorotherexpert assistanceisrequired,theservicesofacompetentprofessionalshouldbesought. LibraryofCongressCataloging-in-PublicationData Munzert,Simon. AutomateddatacollectionwithR:apracticalguidetowebscrapingandtextmining/SimonMunzert,Christian Rubba,PeterMeißner,DominicNyhuis. pagescm Summary:“Thisbookprovidesaunifiedframeworkofwebscrapingandinformationextractionfromtextdata withRforthesocialsciences”–Providedbypublisher. Includesbibliographicalreferencesandindex. ISBN978-1-118-83481-7(hardback) 1.Datamining. 2.Automaticdatacollectionsystems. 3.Socialsciences–Research–Dataprocessing. 4.R(Computerprogramlanguage) I.Title. QA76.9.D343M8652014 006.3′12–dc23 2014032266 AcataloguerecordforthisbookisavailablefromtheBritishLibrary. ISBN:9781118834817 Setin10/12ptTimesbyAptaraInc.,NewDelhi,India. 1 2015 Tomyparents,fortheirunendingsupport.Also,toStefanie. —Simon Tomyparents,fortheirloveandencouragement. —Christian ToKristin, Buddy,andPaulforlove, regularwalks,andafinaldeadline. —Peter MeinerFamilie. —Dominic Contents Preface xv 1 Introduction 1 1.1 Casestudy:WorldHeritageSitesinDanger 1 1.2 Someremarksonwebdataquality 7 1.3 Technologiesfordisseminating,extracting,andstoringwebdata 9 1.3.1 TechnologiesfordisseminatingcontentontheWeb 9 1.3.2 Technologiesforinformationextractionfromwebdocuments 11 1.3.3 Technologiesfordatastorage 12 1.4 Structureofthebook 13 PartOne APrimeronWebandDataTechnologies 15 2 HTML 17 2.1 Browserpresentationandsourcecode 18 2.2 Syntaxrules 19 2.2.1 Tags,elements,andattributes 20 2.2.2 Treestructure 21 2.2.3 Comments 22 2.2.4 Reservedandspecialcharacters 22 2.2.5 Documenttypedefinition 23 2.2.6 Spacesandlinebreaks 23 2.3 Tagsandattributes 24 2.3.1 Theanchortag<a> 24 2.3.2 Themetadatatag<meta> 25 2.3.3 Theexternalreferencetag<link> 26 2.3.4 Emphasizingtags<b>,<i>,<strong> 26 2.3.5 Theparagraphstag<p> 27 2.3.6 Headingtags<h1>,<h2>,<h3>,… 27 2.3.7 Listingcontentwith<ul>,<ol>,and<dl> 27 2.3.8 Theorganizationaltags<div>and<span> 27 viii CONTENTS 2.3.9 The<form>taganditscompanions 29 2.3.10 Theforeignscripttag<script> 30 2.3.11 Tabletags<table>,<tr>,<td>,and<th> 32 2.4 Parsing 32 2.4.1 Whatisparsing? 33 2.4.2 Discardingnodes 35 2.4.3 Extractinginformationinthebuildingprocess 37 Summary 38 Furtherreading 38 Problems 39 3 XMLandJSON 41 3.1 AshortexampleXMLdocument 42 3.2 XMLsyntaxrules 43 3.2.1 Elementsandattributes 44 3.2.2 XMLstructure 46 3.2.3 Namingandspecialcharacters 48 3.2.4 Commentsandcharacterdata 49 3.2.5 XMLsyntaxsummary 50 3.3 WhenisanXMLdocumentwellformedorvalid? 51 3.4 XMLextensionsandtechnologies 53 3.4.1 Namespaces 53 3.4.2 ExtensionsofXML 54 3.4.3 Example:ReallySimpleSyndication 55 3.4.4 Example:scalablevectorgraphics 58 3.5 XMLandRinpractice 60 3.5.1 ParsingXML 60 3.5.2 BasicoperationsonXMLdocuments 63 3.5.3 FromXMLtodataframesorlists 65 3.5.4 Event-drivenparsing 66 3.6 AshortexampleJSONdocument 68 3.7 JSONsyntaxrules 69 3.8 JSONandRinpractice 71 Summary 76 Furtherreading 76 Problems 76 4 XPath 79 4.1 XPath—aquerylanguageforwebdocuments 80 4.2 IdentifyingnodesetswithXPath 81 4.2.1 BasicstructureofanXPathquery 81 4.2.2 Noderelations 84 4.2.3 XPathpredicates 86 4.3 Extractingnodeelements 93 4.3.1 Extendingthefunargument 94 4.3.2 XMLnamespaces 96 4.3.3 LittleXPathhelpertools 97

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.