ebook img

Three out of Four URI References Lead to Changed Content PDF

32 Pages·2016·4.13 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Three out of Four URI References Lead to Changed Content

RESEARCHARTICLE Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content ShawnM.Jones1☯*,HerbertVandeSompel1☯,HariharShankar1☯,MartinKlein1☯, RichardTobin2‡,ClaireGrover2‡ 1DigitalLibraryResearchandPrototypingTeam,ResearchLibrary,LosAlamosNationalLaboratory,Los Alamos,NewMexico,UnitedStatesofAmerica,2LanguageTechnologyGroup,TheUniversityofEdinburgh, Edinburgh,Scotland,UnitedKingdom ☯Theseauthorscontributedequallytothiswork. ‡Theseauthorsalsocontributedequallytothiswork. *[email protected] Abstract a11111 Increasingly,scholarlyarticlescontainURIreferencesto“webatlarge”resourcesincluding projectwebsites,scholarlywikis,ontologies,onlinedebates,presentations,blogs,andvid- eos.Authorsreferencesuchresourcestoprovideessentialcontextfortheresearchthey reporton.AreaderwhovisitsawebatlargeresourcebyfollowingaURIreferenceinanarti- cle,sometimeafteritspublication,isledtobelievethattheresource’scontentisrepresenta- tiveofwhattheauthororiginallyreferenced.However,duetothedynamicnatureofthe OPENACCESS web,thatmayverywellnotbethecase.Wereuseadatasetfromapreviousstudyinwhich Citation:JonesSM,VandeSompelH,ShankarH, severalauthorsofthispaperwereinvolved,andinvestigatetowhatextentthetextualcon- KleinM,TobinR,GroverC(2016)Scholarly tentofwebatlargeresourcesreferencedinavastcollectionofScience,Technology,and ContextAdrift:ThreeoutofFourURIReferences Medicine(STM)articlespublishedbetween1997and2012hasremainedstablesincethe LeadtoChangedContent.PLoSONE11(12): e0167475.doi:10.1371/journal.pone.0167475 publicationofthereferencingarticle.Wedosoinatwo-stepapproachthatreliesonvarious well-establishedsimilaritymeasurestocomparetextualcontent.Inafirststep,weuse19 Editor:NeilR.Smalheiser,UniversityofIllinois- Chicago,UNITEDSTATES webarchivestofindsnapshotsofreferencedwebatlargeresourcesthathavetextualcon- tentthatisrepresentativeofthestateoftheresourcearoundthetimeofpublicationofthe Received:March8,2016 referencingpaper.Wefindthatrepresentativesnapshotsexistforabout30%ofallURIref- Accepted:November15,2016 erences.Inasecondstep,wecomparethetextualcontentofrepresentativesnapshotswith Published:December2,2016 thatoftheirlivewebcounterparts.Wefindthatforover75%ofreferencesthecontenthas Copyright:Thisisanopenaccessarticle,freeofall driftedawayfromwhatitwaswhenreferenced.Theseresultsraisesignificantconcerns copyright,andmaybefreelyreproduced, regardingthelongtermintegrityoftheweb-basedscholarlyrecordandcallforthedeploy- distributed,transmitted,modified,builtupon,or mentoftechniquestocombattheseproblems. otherwiseusedbyanyoneforanylawfulpurpose. TheworkismadeavailableundertheCreative CommonsCC0publicdomaindedication. DataAvailabilityStatement:dataisavailableon theOpenScienceFrameworkdatabaseathttp://dx. doi.org/10.17605/OSF.IO/B6KJZ. Introduction Funding:Theauthor(s)receivednospecific fundingforthiswork. Increasingly,scholarlypapersreferencewebatlargeresources[1],thatis,webresourcesthat themselvesarenotscholarlypapersbutrathersupportingresourcesincludingprojectweb CompetingInterests:Theauthorshavedeclared thatnocompetinginterestsexist. sites,scholarlywikis,ontologies,onlinedebates,presentations,blogs,andvideos.Whilethese PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 1/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent resourcesmaynotconveythescholarlyessenceofapaper,theydoprovideanimportantcon- textpertainingtotheresearchreportedinthepaper.Afterall,theauthorjudgedthatitwasrel- evanttoincludethem,and,inthecaseofpeer-reviewedpapers,bothreviewersandeditors agreedbydecidingtopublishthepaperwithinclusionofthereferencestothewebatlarge resources. WhilescholarlypapersarecommonlyreferencedbymeansofpersistentHTTPURIsthat carryaDigitalObjectIdentifier,webatlargeresourcesarereferencedbymeansofregular HTTPURIs,which,fromnowon,aresimplyreferredtoasURIs.TheHiberlinkproject[2] coinedthetermReferencerottodenotethecombinationoftwoproblemsrelatedtotheuseof HTTPURIsforreferencing.Bothproblemsarecausedbythedynamicandephemeralnature oftheweb: • Linkrot:TheresourceidentifiedbyaURIvanishesfromtheweb.Asaresult,aURIrefer- encetotheresourceceasestoprovideaccesstoreferencedcontent. • Contentdrift:TheresourceidentifiedbyaURIchangesovertime.Theresource’scontent evolvesandcanchangetosuchanextentthatitceasestoberepresentativeofthecontent thatwasoriginallyreferenced. Asaresultofreferencerot,thecontext—madeupofreferencedwebatlargeresources— thatsurroundsapapermaychangeovertime.Assuch,areaderwholooksupreferenced resourcessometimeafterthepublicationofthereferencingpapereffectivelyexploresthecur- rentcontextsurroundingthepaper,whichmaybesignificantlydifferentfromthepastcon- textthatsurroundedthepaperatthetimeofitspublication: • Thepastcontextconsistsofthewebatlargeresourcesreferencedbyapaperastheywereat thetimeofpublicationofthereferencingpaper.Ifalinktoareferencedresourcestillworks whenareaderfollowsitonthelivewebsometimeafterthereferencingpaperwaspublished, itispossiblethatthecontentattheendofthelinkisthesameasitwasatthetimeofpublica- tion.But,givencontentdrift,thechancesofthisbeingthecasediminishthefurtherthecon- sultationdateisremovedfromthepublicationdate. • Thecurrentcontextiswhatareaderofapaperseeswhenfollowinglinkstowebatlarge resourcesontheliveweb.Thecurrentcontextmaydifferfromwhatitwaswhenthe referencingpaperwaspublishedbecauselinksmayhaverotted.Inthesecases,thereader willreceiveanerrormessageasanexplicitindicatorthatcontentthatusedtobethereno longeris.Butthecurrentcontextmayalsodifferbecauselinkedcontenthasdrifted.Inthese cases,whenfollowingthelink,thereadermayencountercontentthatsignificantlydiffers fromtheoriginallyreferencedcontentbuthasnomeanstoassesswhetherornotthatisthe case. Whenitcomestorevisitingthepastcontextofapaper,snapshotsoflinkedresourcesin webarchives—fromnowonreferredtoasMementos—cancometotherescue.Webarchives worldwidestorehundredsofbillionsofMementosandhencemaycoincidentallyalsostore Mementosofreferencedwebatlargeresources.However,ifsuchMementosexist,thequestion arisesastowhatconstitutesarepresentativeMemento,aMementothataccuratelyreflectsthe referencedcontentasitwasatthepublicationtimeofthereferencingpaper.Notethatthepub- licationtimeischosenasanapproximationforthetimethepaper’sauthoractuallyvisitedand referencedtheresource,becauseconsistent,machine-processableinformationforthislatter timeisunavailable. Inthispaper,weassesstheextentofcontentdriftforURIreferencestowebatlarge resourcesthataremadeinSTMarticles.Wedosobyfirstmakingaquantitativeassessment PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 2/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent Fig1.PastContext—FindingRepresentativeMementos. doi:10.1371/journal.pone.0167475.g001 abouttheexistenceofrepresentativeMementosforURIreferencestowebatlargeresources. OncewehaveidentifiedURIreferencesforwhichrepresentativeMementosexist,weproceed toassesstheextentofcontentdrifttowhichthesereferencesaresubject.Assuch,thispaper addressestworesearchquestions.Tacklingthefirstisanecessarysteptowardsansweringthe second: 1. TowhatextentdorepresentativeMementosexistforURIreferencestowebatlarge resources(Fig1)?ForeachURIreference,wepollmultiplewebarchivesinsearchoftwo Mementos:aMementoPrethathasasnapshotdateclosestandpriortothepublication dateofthereferencingarticle,andaMementoPostthathasasnapshotdateclosestandpast thepublicationdate.WethenassessthesimilaritybetweenthesePreandPostMementos usingavarietyofsimilaritymeasures.BecausetheserepresentativeMementosareusedas thegroundtruthtoanswerthesecondresearchquestion,weuseahighthresholdtodecide whetherthePreandPostMementosaresimilaraccordingtothesemeasures.Iftheyare,we decidethattheMementosarerepresentativeofthereferencedcontentasitwasaroundthe timeofpublicationofthereferencingpaper.Wearriveatanovelinsightintotheextentto whichthepastcontextsurroundingscholarlyarticlescanberevisited. 2. WhatistheextentofcontentdrifttowhichURIreferencestowebatlargeresourcesare subject(Fig2)?WeusetheresultingsubsetofallURIreferencesforwhichrepresentative MementosexistandlookupeachURIontheliveweb.Predictably,andasshownbyexten- sivepriorlinkrotresearch,manyURIsnolongerexist.But,forthosethatstilldo,weuse thesamemeasurestoassessthesimilaritybetweentherepresentativeMementofortheURI referenceanditscounterpartontheliveweb.Wearriveatanunprecedentedquantitative insightintotheextenttowhichthecurrentcontext,whichsurroundsapaperatconsulta- tiontime,driftsawayfromthepastcontext,whichsurroundeditatpublicationtime. Tohelpexplainthenotionofcontentdrift,weshowtwoexampleswithsnapshotsofweb pagesthathavechangedovertime(Fig3)andthathavenotchangedatall(Fig4).Acaseof significantcontentdriftisdemonstratedbythehomepageoftheIceCubeNeutrinoObserva- toryattheUniversityofWisconsin.ItsURIhttp://icecube.wisc.eduisreferencedin[3] PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 3/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent Fig2.CurrentContext—AssessingContentDriftbyComparingaRepresentativeMementowiththeAssociatedLiveWebResource. doi:10.1371/journal.pone.0167475.g002 Fig3.SignificantContentDriftwithina3-MonthPeriod. doi:10.1371/journal.pone.0167475.g003 PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 4/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent Fig4.NoContentDriftovera19-YearPeriod. doi:10.1371/journal.pone.0167475.g004 publishedonAugust152009.TheleftpartofFig3showsaMementofromtheInternet ArchivewithURIhttp://web.archive.org/web/20090508003554/http://icecube.wisc.edu/and archivaldateMay82009.TherightpartshowsaMemento,alsofromtheInternetArchive, withURIhttp://web.archive.org/web/20090827100339/http://icecube.wisc.edu/archivedon August272009.Boththecontentandthepresentationhavedramaticallychangedinthe courseofaboutthreemonths.NotethatthebannersontheMementosaredynamically insertedbythewebarchiveandarenotanintegralpartofthearchivedpage. Ontheotherhand,Fig4showsapageoftheInstituteforAstronomyattheUniversityof HawaiiwithURIhttp://www.ifa.hawaii.edu/~cowie/k_table.html,referencedin[4]published onJuly41997.TheleftpartofFig4showsaMementofromtheInternetArchivewithURI http://web.archive.org/web/19970607115534/http://www.ifa.hawaii.edu/~cowie/k_table.html thatwasarchivedonJune71997.Therightpartshowsthelivewebversionofthatpageasit wasatthetimeofwritingthispaper.Notethatthepagesareidentical;thecontentseemingly hasnotdriftedabitin19years. RelatedWork Characterizingthefrequencyofchangeofwebresourceshasbeenthesubjectofnumerous researcheffortsinthepast.Withoutexception,thesestudiessupportourintuitionofthevola- tilityofwebatlargeresources.Forexample,ChoandGarcia-Molina[5]soughttodevelop improvedwebcrawlingstrategiesbystudyingthechangesofwebpagesovertime.They observed270websitesoverthecourseoffourmonthsanddeterminedthat“morethan40%of pagesinthecomdomainchangedeveryday,whilelessthan10%ofthepagesinotherdomains changedatthatfrequency”.Theyalsofoundthatpagesineduandgovdomainsaremorestatic. Inacompanionpaper[6],theywereabletodeterminethatwebpageschangebyaPoisson process,andsuggestthatthevalueoftheHTTPresponseheaderlast-modifiedisabetteresti- mateofthefrequencyofchange.Inarelatedstudyfrom2003,Fetterlyetal.[7]crawled151 millionpagesonceaweekforelevenweekstodeterminethecorrelationbetweenthechangeof PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 5/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent webresourcesandfactorssuchasdocumentlength.Theirresultsshowthatlongerdocuments changedmoreoftenandthattheobservedfrequencyofchangeisagoodpredictoroffuture changes.Inamorerecentstudy,Adaretal.[8]investigatedthechangeofwebcontentby crawling55,000webpagesathourlyandsub-hourlyperiodsandcomparingthetextbetween thesamepageatdifferenttimesduringthecrawl.Theyfoundthatalargeportionofpagesin theirdatasetchangedmorefrequentlythanonceperhourly.Inordertooptimizewebcrawlers theyproposedfocusingonthedetectionof“meaningfulchange”inwebpageswiththegoalof avoidingtherecrawlofpagesthatmerelychangebasedonnon-contentelements(e.g.,embed- dedadvertisements).Inrelatedwork,Adaretal.[9]investigatedthestructuralchangesofweb pages.TheyanalyzedthechangefrequencyofDocumentObjectModel(DOM)elements withinwebpagesandfoundthatthemediansurvivalrateofDOMelementsis98%afterone day,95%afteroneweek,63%afterfiveweeks,andonly11%afteroneyear. Inadditionto[1],linkrotinscholarlyliteraturewas,forexample,exploredbyLawrence [10]in2000,Casserlyin2002[11],Sellittoin2003[12],McCownin2004[13],Falagasin2007 [14],Dudain2008[15],Wagnerin2009[16],andMoghaddamin2010[17].Thesestudies notedthatthenumberofwebatlargeresourcescitedinacademicworkwasgrowing,but manyofthereferencedresourceswerenolongeravailableontheliveweb.In2011Sanderson, Phillips,andVandeSompel[18]analyzedtheuseoftheMementoprotocol[19]tofind archivedversionsofresourcesreferencedfromarXivandtheinstitutionalrepositoryfromthe UniversityofNorthTexas.TheywereprincipallyconcernedabouttheexistenceofMementos forareferencedresource,butwerealsointerestedinthetemporaldeltabetweenpaperpublica- tionandthearchivingofthereferencedresource.Thesescholarlyliteraturestudiesaimedat analyzingwhetherornotaliveversionofareferencedresourcewasstillavailableandwhether anarchivedversionoftheresourceexisted. Inadditiontolinkrot,Jackson[20],Zittrainetal.[21]andKleinetal.[1]alsoinvestigated contentdriftofURIreferences.Jacksonsampled1,000MementoURIsperyearwitharchival datesbetween2004and2013fromtheUKwebarchiveandcheckedtheirstatusontheliveweb. Hefoundthatafteronlytwoyears,around40%oftheURIsweregonefromtheliveweb(link rot).HealsofoundthatthesameratioofURIs(40%)are“unrecognizablydifferent”aftertwo years(contentdrift).Inaggregate,accordingtothisstudy,60%ofURIsfromtheUKweb archivecorpuseithersufferfromlinkrotorcontentdriftafteronlytwoyears.Zittrainetal. manuallyevaluatedallURIreferencesfoundinU.S.Supremecourtopinionpapersstartingin 1996.Theyanalyzedthecontextsinwhichthereferencesoccuranddeterminedwhetherthe originallyintendedcontentwasstillavailableattheliveversionoftheURI.Theyfoundthat around50%ofURIreferencessufferedfromcontentdriftby2014whenthestudywascon- ducted.Zittrainetal.alsoinvestigatedreferencerotinthreelawjournals:HarvardLawReview (HLR),theHarvardJournalofLawandTechnology(JOLT),andtheHarvardHumanRights Journal(HRJ).Theircorpuscontainededitionsofthejournalsfrom1999,1996,and1997, respectively,untilmid2012.Theirstudyfoundthatonly29.9%oftheHRJreferences,26.8%of theHLRreferences,and34.2%oftheJOLTreferencescontainedthematerialoriginallycited. Hencebetween65%and73%oftheURIreferencessufferedfromlinkrotorcontentdrift.The studybyKleinetal.extractedURIreferencesfromthreevastscholarlycorporaconsistingof STMarticlespublishedbetween1997and2012.Theyestimatedtheexistenceofrepresentative MementosforthoseURIreferencesusinganintuitivetechnique:ifaMementoforareferenced URIexistedwithanarchivaldatetimeinatemporalwindowof14dayspriorandafterthepub- licationdateofthereferencingpaper,theMementowasregardedrepresentative.Usingthisad- hoctechnique,theyfoundthatrepresentativeMementosexistedforabout25%ofURIrefer- encesacrosstheconsideredcorpora.Thestudyalsoassessedtheextentofreferencerotbutdid soattheaggregatelevelofjournalarticlesinsteadofURIreferences,concludingthatoneoutof PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 6/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent fiveSTMarticlessufferedfromreferencerot.Intheworkreportedhere,wemakeaquantitative assessmentonthebasisoftextsimilaritymeasuresoftheexistenceofrepresentativeMementos andtheextentofcontentdriftatthelevelofindividualURIreferences. Assessingthesimilarityoftextualdocumentsisacommonprobleminthewebscienceand informationretrievalrealm.Multiplesimilaritymeasureshavesuccessfullybeenappliedinthe pastacrossdisciplinesandresearchontheefficacyofthesetechniquesexists.Thoughweare notengaginginimprovingwebcrawlersordocumentclustering,thesuccessofthefollowing worksupportsourchoicesofsimilaritymeasures. Manku,Jain,andSarma[22]evaluatedtheuseofCharikar’sSimhashalgorithm[23,24]for webcrawlersasawaytodetectarecently-crawledpage.Aftertestingitduringacrawlofeight billionwebpages,theydeterminedthatSimhashwasaveryeffectivemethodofanalyzing duplicatewebpagesforthispurpose,duetoitsefficiencybyusingsmallhashfingerprintsfor comparison. TheabovementionedstudybyAdaretal.[8]usedtheSørensen-Dicecoefficient[25,26]to comparethetextualcontentofthesamewebresourceatdifferenttimesduringtheircrawl. Figuerolaetal.[27]demonstratedthatKornblum’sSpamsumalgorithm[28]cansuccess- fullybeusedtosupportwebcrawlers.Byconductingtestsonmorethan80,000webpages, theyestablishedathresholdscoreof0.9todetermineifpageshadchangedbetweencrawls. JacksonalsousedSpamsuminhis2015analysisoftheUKWebArchive’sholdings[20]men- tionedabove. TheJaccardcoefficient[29]was,forexample,usedbySivakumar[30]in2015.Thepurpose ofthestudywastoultimatelyimprovesearchresultsbycomparingblocksoftextwithinweb pages,aswellasidentifyingandremovingduplicateadvertisements,headers,andotherrecur- ringfeaturespresentinwebsites.OtherapplicationsoftheJaccardcoefficientincludedocu- mentclustering(Karun,Philip,andLubna[31])andkeywordsimilarityanalysis(Niwattankul etal.[32]). Anotherverypopulartextsimilaritymeasureiscosinesimilarity[33,34].Itwas,forexample, appliedbyHajishirzi,Yih,andKolcz[35]forthedetectionofduplicatewebpagesandbySand- hyaandGovardhan[36]forcomparingdocumentsaspartoftheirdocumentclustering approach.ForAlNoamany,Weigle,andNelson[37]theuseofthecosinesimilaritymeasure provedessentialtodetectoff-topicwebpagesinvariouswebarchivecollectionsfromArchive-It. Inadditiontotheabovementionedfivesimilaritymeasures,otherapproacheshavebeen demonstratedinthepast.ExamplesincludecomputingtheLevenshteindistancebetweentwo DOMtrees[38,39]andcalculatingthedeltasinembeddedresources[40]ofwebpages.How- ever,iftheapplicationofasimilaritymeasureinvolvestheHTMLelementsofwebpages,the processreliesuponthefactthatthestructureoftheresourcestobecomparedhasnotbeen altered.SincewebarchivesofteninjectadditionalHTMLelementsforuserexperienceand branding,andhencealtertheDOMstructureofourPreandPostMementos,wecannotuse thesemeasuresofsimilarity.Also,manyofourreferencesarenotinHTMLformat,butare insteadPDFdocuments,whichrendersanyDOMtreecomparisoninapplicable.Further methodstoassesssimilarityrelyuponavastcorpustocomputesimilarity[41]andothers, suchasHammingdistance[42],aresuitableforsame-lengthtextstringsonly.Neitherofthese methodsaresuitableforourexperimentsinceweseekmethodsthatcomparetwodocuments withoutknowledgeofalargercorpusandwecannotrelyonequallengthofwebpages. Methods OurstudyusesthesameURIreferencestowebatlargeresources,1,059,742intotal,thatwere usedfor[1].WebrieflydescribehowtheseURIreferenceswereobtainedandreferto[1]fora PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 7/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent Table1.ObtainingPre/PostMementosforURIReferences. arXiv Elsevier PMC Total URIreferences 346,177 232,712 480,853 1,059,742 NoTimeMap 62,923 34,897 54,068 151,888 NoMementoPre 36,942 11,870 50,357 99,169 NoMementoPost 17,027 22,303 23,140 62,470 ProblemdereferencingMemento 18,594 18,525 28,960 66,079 URIreferenceswithPre/PostMementos 210,691 145,117 324,328 680,136 doi:10.1371/journal.pone.0167475.t001 detaileddescription.TheURIreferenceswereextractedfromthreescholarlycorporaconsist- ingof3.5millionarticlespublishedbetween1997to2012:arXiv,Elsevier,andPubMedCen- tral(PMC).WedownloadedallarticlesfromarXivandPMCpublishedinthattimeframeand useda(meanwhilediscontinued)CrossRefAPItorandomlyselectDOIsofElsevierarticles. ThearXivcorpusmainlycoversphysics,mathematics,andcomputerscience;theElseviercor- puscoversawiderangeofSTMsubjects;PMCmostlycoversbiomedicalandlifesciences. Afterprocessingthesecorpora,forexampleremovingarticleswithoutURIreferences,adata- setconsistingof707,667arXivarticles,655,040Elsevierarticles,and479,194PMCarticles remained.Next,allURIswereextractedfromeacharticleusingahighlyaccurateregular expression-basedURIextractionapproachthatisdescribedin[43].URIswereextractedfrom allsectionsofanarticleincludingtheabstract,thebody,footnotes,andthereferencesection. Thisresultedinatotalof3,983,985URIsreferences,whichwereclassifiedintothree categories: 1. Referencestowebatlargeresources 2. Referencestojournalarticles 3. Referencestobeexcluded,forexample,becauseofURIsyntaxerrors. Referencestowebatlargeresourceswerethefocusoftheresearchreportedin[1]andare thestartingpointoftheresearchreportedheretoo.Atotalof1,059,742ofsuchURIswere extractedfromthethreecorpora.AsshowninthetoprowofTable1,346,177ofthoseURIs arefromarXivpapers,232,712fromElsevierpapers,and480,853wereextractedfromPMC papers. WeusetheseURIsfortwoexperiments.Thefirstisaimedatassessingtowhatextentrepre- sentativeMementosofURIreferencestowebatlargeresourcesexistinwebarchivesworld- wide.ThesecondisaimedatassessingtheextentofcontentdriftforURIreferencestowebat largeresourcesinscholarlypapers.ItusesonlythoseURIsforwhichrepresentativeMementos arefoundinthefirstexperiment. WebArchiveLookupofURIReferences WebarchivescontinuouslycreatesnapshotsofURIsaroundtheweb.Naturally,theymayalso takesnapshotsofURIsthatarereferencedinscholarlypapers.ForeachURIforwhichsnap- shotsexist,theMementoprotocol[19]anditsassociatedaggregationinfrastructureprovidea TimeMapthatgivesanoverviewofallMementosfortheURIheldbypublicwebarchives worldwide.EachTimeMapliststheoriginalURIaswellastheURIandthearchivaldateof eachofitsMementos.Atthetimetheexperimentwasconducted,theaggregationinfrastruc- turecoveredallexistingwebarchivesthatexposedanopenlyaccessiblemachineinterface. PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 8/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent Therewere19intotal,only6ofwhichwerealsousedin[1]:Archive-It,archive.is,Bibliotheca AlexandrinaWebArchive,CanadianGovernmentWebArchive,CroatianWebArchive,Esto- nianWebArchive,IcelandicWebArchive,InternetArchive,LibraryofCongressWeb Archive,NARAWebArchive,PortugueseWebArchive,PRONIWebArchive,Slovenian WebArchive,StanfordWebArchive,UKNationalArchivesWebArchive,UKParliament WebArchive,UKWebArchive,WebArchiveSingapore,andWebCite. WeprogrammaticallyobtainTimeMapsforallreferencedURIs,andfromeachselecttwo Mementos:theMementoPre,whichhasanarchivaldatethatistemporallyclosestandpriorto thepublicationdateofthearticlethatreferencestheURI,andtheMementoPost,whichis temporallyclosestandpastthatdate.Thisselectionismotivatedbytheinsightthat,ifthese twoMementosarehighlysimilar—theresourcehashardlychangedinthepre/postinterval thatsurroundsthepublicationdate—thenthesetwoMementosareverylikelyrepresentative ofthereferencedcontentasitwaswhenthearticlewaspublished.Afterhavingselectedthese twoMementosperURIreference,weobtaintheirarchivedcontentfromthewebarchivethey resideinbydereferencingtheURIoftherespectiveMementofoundintheTimeMap. ThisentireprocesswasconductedoverthecourseofAugust2015.Toaddresspossibletem- poraryglitches,theprocesswasrepeatedonceforthoseURIsforwhichtheprocesshadfailed inthefirstrun.Table1showstheoutcome.Ascanbeseen,wewereabletoobtaincontentfor Pre/PostMementosfor680,136ofthe1,059,742URIreferencesinourcollection.The Tablealsodetailsthereasonswhytheprocesswasnotsuccessfulfortheremaining379,606 URIs.TheseincludetheunavailabilityofTimeMaps(noMementosexist),theunavailabilityof eitherPreorPostMementos(Mementosexistonlypreoronlypostthepublicationdate),and severalinstanceswherebydereferencingaMemento’sURIfailed.ForURIdereferencing,we usedthecommandlinetoolcURLtoissueHTTPGETrequestsandconfiguredcURLtofol- lowagenerousmaximumof50HTTPredirects.Conductingthisprocesswasstraightforward forallwebarchivesexceptWebCite,anarchivethatwasexplicitlyintroducedmorethana decadeagotocombatreferencerotinscholarlycommunication.Thisarchivesetsaverylow andseeminglyvariablelimittothenumberofrequestsaclientcanissueonadailybasis.Also, programmaticallyobtainingMementosfromthisarchiverequiredextractingcontentfrom frames;thereforeweusedthePhantomJS[44]browserautomationtoolinsteadofcURL[45]. AssessingtheRepresentativenessofMementos Weusethe680,136URIreferencesthatremainfromtheprocessdescribedabove.Foreachof thesereferences,aPreandPostMementoexists.Ourgoalistomeasurethesimilaritybetween theseMementosandthentodeclarehighlysimilaronesasrepresentativeofthereferenced contentasitexistedatthetimeofpublicationofthereferencingpaper.Doingsoinvolvestwo distinctchallenges:measuringsimilarityandsettingathresholdforasimilaritymeasureabove whichcontentisregardedassufficientlysimilar. MeasuringSimilarity. AninspectionofthecontenttypesofourPreandPostMementos, summarizedinTable2,showsthatthevastmajorityofMementosaretextual.Over1million aretext/html,about93,000areapplication/pdf,andabout5,000aretext/plain.Thenumbers forothercontenttypesareverysmallincomparison.Sincewellestablishedtechniquesexist formeasuringthesimilarityoftextualcontent,wedecidetoonlyretainMementoswiththese threemostfrequentlyoccurringcontenttypes.Thisleadstodismissing10,112Pre/Post Mementos,whichcorrespondsto5,056URIreferences. InordertobeabletocomparethecontentoftheremainingMementos,weextractthetext fromtheHTMLandPDFfiles.ThisobviouslyinvolvesremovingtagsfromtheHTMLand controlcharactersfromthePDF.Werespectivelyuselxml[46]andpyPDF[47]toachieve PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 9/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent Table2.Top10ContentTypesofPreandPostMementos. ContentType Mementos In% text/html 1,041,192 90.68% application/pdf 93,428 8.14% text/plain 5,424 0.47% application/postscript 1,690 0.15% application/msword 911 0.07% application/vnd.ms-excel 610 0.05% image/gif 493 0.04% application/xhtml+xml 484 0.04% application/octent-stream 447 0.04% image/jpeg 405 0.04% doi:10.1371/journal.pone.0167475.t002 this.However,thetextextractionprocessposedsignificantadditionalchallengesthatrequired writingcustomcode.Detailingtheseisbeyondthescopeofthispaperbutatechnicalreport onthematterisavailable[45].Forthepurposeofthisdiscussionitsufficestomentionthatthe mostcommonchallengewasremovalofcontentinsertedintoMementosbywebarchives, includingJavaScript,CSS,andtextualinformation.Table3providesanoverviewofURIrefer- enceswithPre/PostMementosthatwereexcludedfromthetextualcomparisonbecausetheir contenttypewasnotoneofthetopthreeselectedtypesorbecausethetextextractionprocess wasunsuccessful.Weareleftwith648,253URIreferencesforwhichthetextualcontentofPre/ PostMementoscanbecompared. WiththeextractedtextfromourPre/PostMementopairsremaining,weproceededto assesstheirsimilarity.WedosobyusingthetextsimilaritymeasuresintroducedintheRelated Worksection,andtheircorrespondingimplementations,showninTable4.Thesefourmea- sureswereselectedbecausetheirspecificandcomplementarycharacteristicsprovideinsight intodifferentnotionsofsimilarity.Hence,theircombinationoffersawell-roundedviewof changingtextualcontent. WeuseSimhash,ahash-basedmeasure,thatsplitstwostringsinton-gramsandcreatesone vectorofn-gramsperstring.Itthencomputeshashvaluesforbothvectorsandreturnstheir distanceindicatingthesimilarityofbothstrings.Simhashisoftenusedforlarge-scaleweb pagecomparisonsandisdesignedtobesensitivetoeditorialchangesinthecomparedtexts. WeinitiallyalsousedSpamsumasabyte-levelsimilaritymeasuresinceitwasappliedinprevi- ousrelatedwork[20].However,wefoundmanyinstancesinwhichtheSpamsumscoredif- fereddramaticallyfromotherscoresandfoundtheexplanationin[48,49]wherethecreator ofthealgorithmstatesthatinputstringsneedtobemorethan4KBoflengthforthemeasure Table3.SelectingPre/PostMementosforTextualComparison. arXiv Elsevier PMC Total URIreferenceswithPre/PostMementos 210,691 145,117 324,328 680,136 Notintop3contenttypes 1,358 1,296 2,402 5,056 Textextractionprocessingerrors 1,562 3,064 13,669 18,295 Nocontentaftertextextraction 2,864 1,720 3,948 8,532 URIreferencesforPre/PostMementocomparison 204,907 139,037 304,309 648,253 doi:10.1371/journal.pone.0167475.t003 PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 10/32

Description:
Alamos, New Mexico, United States of America, 2 Language Technology Group, The University of Edinburgh, Both problems are caused by the dynamic and ephemeral nature of the web: • Link rot: perma.cc [21] and weblock.io [57] are stepping into this problem domain and are exploring avenues
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.