RESEARCHARTICLE Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content ShawnM.Jones1☯*,HerbertVandeSompel1☯,HariharShankar1☯,MartinKlein1☯, RichardTobin2‡,ClaireGrover2‡ 1DigitalLibraryResearchandPrototypingTeam,ResearchLibrary,LosAlamosNationalLaboratory,Los Alamos,NewMexico,UnitedStatesofAmerica,2LanguageTechnologyGroup,TheUniversityofEdinburgh, Edinburgh,Scotland,UnitedKingdom ☯Theseauthorscontributedequallytothiswork. ‡Theseauthorsalsocontributedequallytothiswork. *[email protected] Abstract a11111 Increasingly,scholarlyarticlescontainURIreferencesto“webatlarge”resourcesincluding projectwebsites,scholarlywikis,ontologies,onlinedebates,presentations,blogs,andvid- eos.Authorsreferencesuchresourcestoprovideessentialcontextfortheresearchthey reporton.AreaderwhovisitsawebatlargeresourcebyfollowingaURIreferenceinanarti- cle,sometimeafteritspublication,isledtobelievethattheresource’scontentisrepresenta- tiveofwhattheauthororiginallyreferenced.However,duetothedynamicnatureofthe OPENACCESS web,thatmayverywellnotbethecase.Wereuseadatasetfromapreviousstudyinwhich Citation:JonesSM,VandeSompelH,ShankarH, severalauthorsofthispaperwereinvolved,andinvestigatetowhatextentthetextualcon- KleinM,TobinR,GroverC(2016)Scholarly tentofwebatlargeresourcesreferencedinavastcollectionofScience,Technology,and ContextAdrift:ThreeoutofFourURIReferences Medicine(STM)articlespublishedbetween1997and2012hasremainedstablesincethe LeadtoChangedContent.PLoSONE11(12): e0167475.doi:10.1371/journal.pone.0167475 publicationofthereferencingarticle.Wedosoinatwo-stepapproachthatreliesonvarious well-establishedsimilaritymeasurestocomparetextualcontent.Inafirststep,weuse19 Editor:NeilR.Smalheiser,UniversityofIllinois- Chicago,UNITEDSTATES webarchivestofindsnapshotsofreferencedwebatlargeresourcesthathavetextualcon- tentthatisrepresentativeofthestateoftheresourcearoundthetimeofpublicationofthe Received:March8,2016 referencingpaper.Wefindthatrepresentativesnapshotsexistforabout30%ofallURIref- Accepted:November15,2016 erences.Inasecondstep,wecomparethetextualcontentofrepresentativesnapshotswith Published:December2,2016 thatoftheirlivewebcounterparts.Wefindthatforover75%ofreferencesthecontenthas Copyright:Thisisanopenaccessarticle,freeofall driftedawayfromwhatitwaswhenreferenced.Theseresultsraisesignificantconcerns copyright,andmaybefreelyreproduced, regardingthelongtermintegrityoftheweb-basedscholarlyrecordandcallforthedeploy- distributed,transmitted,modified,builtupon,or mentoftechniquestocombattheseproblems. otherwiseusedbyanyoneforanylawfulpurpose. TheworkismadeavailableundertheCreative CommonsCC0publicdomaindedication. DataAvailabilityStatement:dataisavailableon theOpenScienceFrameworkdatabaseathttp://dx. doi.org/10.17605/OSF.IO/B6KJZ. Introduction Funding:Theauthor(s)receivednospecific fundingforthiswork. Increasingly,scholarlypapersreferencewebatlargeresources[1],thatis,webresourcesthat themselvesarenotscholarlypapersbutrathersupportingresourcesincludingprojectweb CompetingInterests:Theauthorshavedeclared thatnocompetinginterestsexist. sites,scholarlywikis,ontologies,onlinedebates,presentations,blogs,andvideos.Whilethese PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 1/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent resourcesmaynotconveythescholarlyessenceofapaper,theydoprovideanimportantcon- textpertainingtotheresearchreportedinthepaper.Afterall,theauthorjudgedthatitwasrel- evanttoincludethem,and,inthecaseofpeer-reviewedpapers,bothreviewersandeditors agreedbydecidingtopublishthepaperwithinclusionofthereferencestothewebatlarge resources. WhilescholarlypapersarecommonlyreferencedbymeansofpersistentHTTPURIsthat carryaDigitalObjectIdentifier,webatlargeresourcesarereferencedbymeansofregular HTTPURIs,which,fromnowon,aresimplyreferredtoasURIs.TheHiberlinkproject[2] coinedthetermReferencerottodenotethecombinationoftwoproblemsrelatedtotheuseof HTTPURIsforreferencing.Bothproblemsarecausedbythedynamicandephemeralnature oftheweb: • Linkrot:TheresourceidentifiedbyaURIvanishesfromtheweb.Asaresult,aURIrefer- encetotheresourceceasestoprovideaccesstoreferencedcontent. • Contentdrift:TheresourceidentifiedbyaURIchangesovertime.Theresource’scontent evolvesandcanchangetosuchanextentthatitceasestoberepresentativeofthecontent thatwasoriginallyreferenced. Asaresultofreferencerot,thecontext—madeupofreferencedwebatlargeresources— thatsurroundsapapermaychangeovertime.Assuch,areaderwholooksupreferenced resourcessometimeafterthepublicationofthereferencingpapereffectivelyexploresthecur- rentcontextsurroundingthepaper,whichmaybesignificantlydifferentfromthepastcon- textthatsurroundedthepaperatthetimeofitspublication: • Thepastcontextconsistsofthewebatlargeresourcesreferencedbyapaperastheywereat thetimeofpublicationofthereferencingpaper.Ifalinktoareferencedresourcestillworks whenareaderfollowsitonthelivewebsometimeafterthereferencingpaperwaspublished, itispossiblethatthecontentattheendofthelinkisthesameasitwasatthetimeofpublica- tion.But,givencontentdrift,thechancesofthisbeingthecasediminishthefurtherthecon- sultationdateisremovedfromthepublicationdate. • Thecurrentcontextiswhatareaderofapaperseeswhenfollowinglinkstowebatlarge resourcesontheliveweb.Thecurrentcontextmaydifferfromwhatitwaswhenthe referencingpaperwaspublishedbecauselinksmayhaverotted.Inthesecases,thereader willreceiveanerrormessageasanexplicitindicatorthatcontentthatusedtobethereno longeris.Butthecurrentcontextmayalsodifferbecauselinkedcontenthasdrifted.Inthese cases,whenfollowingthelink,thereadermayencountercontentthatsignificantlydiffers fromtheoriginallyreferencedcontentbuthasnomeanstoassesswhetherornotthatisthe case. Whenitcomestorevisitingthepastcontextofapaper,snapshotsoflinkedresourcesin webarchives—fromnowonreferredtoasMementos—cancometotherescue.Webarchives worldwidestorehundredsofbillionsofMementosandhencemaycoincidentallyalsostore Mementosofreferencedwebatlargeresources.However,ifsuchMementosexist,thequestion arisesastowhatconstitutesarepresentativeMemento,aMementothataccuratelyreflectsthe referencedcontentasitwasatthepublicationtimeofthereferencingpaper.Notethatthepub- licationtimeischosenasanapproximationforthetimethepaper’sauthoractuallyvisitedand referencedtheresource,becauseconsistent,machine-processableinformationforthislatter timeisunavailable. Inthispaper,weassesstheextentofcontentdriftforURIreferencestowebatlarge resourcesthataremadeinSTMarticles.Wedosobyfirstmakingaquantitativeassessment PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 2/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent Fig1.PastContext—FindingRepresentativeMementos. doi:10.1371/journal.pone.0167475.g001 abouttheexistenceofrepresentativeMementosforURIreferencestowebatlargeresources. OncewehaveidentifiedURIreferencesforwhichrepresentativeMementosexist,weproceed toassesstheextentofcontentdrifttowhichthesereferencesaresubject.Assuch,thispaper addressestworesearchquestions.Tacklingthefirstisanecessarysteptowardsansweringthe second: 1. TowhatextentdorepresentativeMementosexistforURIreferencestowebatlarge resources(Fig1)?ForeachURIreference,wepollmultiplewebarchivesinsearchoftwo Mementos:aMementoPrethathasasnapshotdateclosestandpriortothepublication dateofthereferencingarticle,andaMementoPostthathasasnapshotdateclosestandpast thepublicationdate.WethenassessthesimilaritybetweenthesePreandPostMementos usingavarietyofsimilaritymeasures.BecausetheserepresentativeMementosareusedas thegroundtruthtoanswerthesecondresearchquestion,weuseahighthresholdtodecide whetherthePreandPostMementosaresimilaraccordingtothesemeasures.Iftheyare,we decidethattheMementosarerepresentativeofthereferencedcontentasitwasaroundthe timeofpublicationofthereferencingpaper.Wearriveatanovelinsightintotheextentto whichthepastcontextsurroundingscholarlyarticlescanberevisited. 2. WhatistheextentofcontentdrifttowhichURIreferencestowebatlargeresourcesare subject(Fig2)?WeusetheresultingsubsetofallURIreferencesforwhichrepresentative MementosexistandlookupeachURIontheliveweb.Predictably,andasshownbyexten- sivepriorlinkrotresearch,manyURIsnolongerexist.But,forthosethatstilldo,weuse thesamemeasurestoassessthesimilaritybetweentherepresentativeMementofortheURI referenceanditscounterpartontheliveweb.Wearriveatanunprecedentedquantitative insightintotheextenttowhichthecurrentcontext,whichsurroundsapaperatconsulta- tiontime,driftsawayfromthepastcontext,whichsurroundeditatpublicationtime. Tohelpexplainthenotionofcontentdrift,weshowtwoexampleswithsnapshotsofweb pagesthathavechangedovertime(Fig3)andthathavenotchangedatall(Fig4).Acaseof significantcontentdriftisdemonstratedbythehomepageoftheIceCubeNeutrinoObserva- toryattheUniversityofWisconsin.ItsURIhttp://icecube.wisc.eduisreferencedin[3] PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 3/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent Fig2.CurrentContext—AssessingContentDriftbyComparingaRepresentativeMementowiththeAssociatedLiveWebResource. doi:10.1371/journal.pone.0167475.g002 Fig3.SignificantContentDriftwithina3-MonthPeriod. doi:10.1371/journal.pone.0167475.g003 PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 4/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent Fig4.NoContentDriftovera19-YearPeriod. doi:10.1371/journal.pone.0167475.g004 publishedonAugust152009.TheleftpartofFig3showsaMementofromtheInternet ArchivewithURIhttp://web.archive.org/web/20090508003554/http://icecube.wisc.edu/and archivaldateMay82009.TherightpartshowsaMemento,alsofromtheInternetArchive, withURIhttp://web.archive.org/web/20090827100339/http://icecube.wisc.edu/archivedon August272009.Boththecontentandthepresentationhavedramaticallychangedinthe courseofaboutthreemonths.NotethatthebannersontheMementosaredynamically insertedbythewebarchiveandarenotanintegralpartofthearchivedpage. Ontheotherhand,Fig4showsapageoftheInstituteforAstronomyattheUniversityof HawaiiwithURIhttp://www.ifa.hawaii.edu/~cowie/k_table.html,referencedin[4]published onJuly41997.TheleftpartofFig4showsaMementofromtheInternetArchivewithURI http://web.archive.org/web/19970607115534/http://www.ifa.hawaii.edu/~cowie/k_table.html thatwasarchivedonJune71997.Therightpartshowsthelivewebversionofthatpageasit wasatthetimeofwritingthispaper.Notethatthepagesareidentical;thecontentseemingly hasnotdriftedabitin19years. RelatedWork Characterizingthefrequencyofchangeofwebresourceshasbeenthesubjectofnumerous researcheffortsinthepast.Withoutexception,thesestudiessupportourintuitionofthevola- tilityofwebatlargeresources.Forexample,ChoandGarcia-Molina[5]soughttodevelop improvedwebcrawlingstrategiesbystudyingthechangesofwebpagesovertime.They observed270websitesoverthecourseoffourmonthsanddeterminedthat“morethan40%of pagesinthecomdomainchangedeveryday,whilelessthan10%ofthepagesinotherdomains changedatthatfrequency”.Theyalsofoundthatpagesineduandgovdomainsaremorestatic. Inacompanionpaper[6],theywereabletodeterminethatwebpageschangebyaPoisson process,andsuggestthatthevalueoftheHTTPresponseheaderlast-modifiedisabetteresti- mateofthefrequencyofchange.Inarelatedstudyfrom2003,Fetterlyetal.[7]crawled151 millionpagesonceaweekforelevenweekstodeterminethecorrelationbetweenthechangeof PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 5/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent webresourcesandfactorssuchasdocumentlength.Theirresultsshowthatlongerdocuments changedmoreoftenandthattheobservedfrequencyofchangeisagoodpredictoroffuture changes.Inamorerecentstudy,Adaretal.[8]investigatedthechangeofwebcontentby crawling55,000webpagesathourlyandsub-hourlyperiodsandcomparingthetextbetween thesamepageatdifferenttimesduringthecrawl.Theyfoundthatalargeportionofpagesin theirdatasetchangedmorefrequentlythanonceperhourly.Inordertooptimizewebcrawlers theyproposedfocusingonthedetectionof“meaningfulchange”inwebpageswiththegoalof avoidingtherecrawlofpagesthatmerelychangebasedonnon-contentelements(e.g.,embed- dedadvertisements).Inrelatedwork,Adaretal.[9]investigatedthestructuralchangesofweb pages.TheyanalyzedthechangefrequencyofDocumentObjectModel(DOM)elements withinwebpagesandfoundthatthemediansurvivalrateofDOMelementsis98%afterone day,95%afteroneweek,63%afterfiveweeks,andonly11%afteroneyear. Inadditionto[1],linkrotinscholarlyliteraturewas,forexample,exploredbyLawrence [10]in2000,Casserlyin2002[11],Sellittoin2003[12],McCownin2004[13],Falagasin2007 [14],Dudain2008[15],Wagnerin2009[16],andMoghaddamin2010[17].Thesestudies notedthatthenumberofwebatlargeresourcescitedinacademicworkwasgrowing,but manyofthereferencedresourceswerenolongeravailableontheliveweb.In2011Sanderson, Phillips,andVandeSompel[18]analyzedtheuseoftheMementoprotocol[19]tofind archivedversionsofresourcesreferencedfromarXivandtheinstitutionalrepositoryfromthe UniversityofNorthTexas.TheywereprincipallyconcernedabouttheexistenceofMementos forareferencedresource,butwerealsointerestedinthetemporaldeltabetweenpaperpublica- tionandthearchivingofthereferencedresource.Thesescholarlyliteraturestudiesaimedat analyzingwhetherornotaliveversionofareferencedresourcewasstillavailableandwhether anarchivedversionoftheresourceexisted. Inadditiontolinkrot,Jackson[20],Zittrainetal.[21]andKleinetal.[1]alsoinvestigated contentdriftofURIreferences.Jacksonsampled1,000MementoURIsperyearwitharchival datesbetween2004and2013fromtheUKwebarchiveandcheckedtheirstatusontheliveweb. Hefoundthatafteronlytwoyears,around40%oftheURIsweregonefromtheliveweb(link rot).HealsofoundthatthesameratioofURIs(40%)are“unrecognizablydifferent”aftertwo years(contentdrift).Inaggregate,accordingtothisstudy,60%ofURIsfromtheUKweb archivecorpuseithersufferfromlinkrotorcontentdriftafteronlytwoyears.Zittrainetal. manuallyevaluatedallURIreferencesfoundinU.S.Supremecourtopinionpapersstartingin 1996.Theyanalyzedthecontextsinwhichthereferencesoccuranddeterminedwhetherthe originallyintendedcontentwasstillavailableattheliveversionoftheURI.Theyfoundthat around50%ofURIreferencessufferedfromcontentdriftby2014whenthestudywascon- ducted.Zittrainetal.alsoinvestigatedreferencerotinthreelawjournals:HarvardLawReview (HLR),theHarvardJournalofLawandTechnology(JOLT),andtheHarvardHumanRights Journal(HRJ).Theircorpuscontainededitionsofthejournalsfrom1999,1996,and1997, respectively,untilmid2012.Theirstudyfoundthatonly29.9%oftheHRJreferences,26.8%of theHLRreferences,and34.2%oftheJOLTreferencescontainedthematerialoriginallycited. Hencebetween65%and73%oftheURIreferencessufferedfromlinkrotorcontentdrift.The studybyKleinetal.extractedURIreferencesfromthreevastscholarlycorporaconsistingof STMarticlespublishedbetween1997and2012.Theyestimatedtheexistenceofrepresentative MementosforthoseURIreferencesusinganintuitivetechnique:ifaMementoforareferenced URIexistedwithanarchivaldatetimeinatemporalwindowof14dayspriorandafterthepub- licationdateofthereferencingpaper,theMementowasregardedrepresentative.Usingthisad- hoctechnique,theyfoundthatrepresentativeMementosexistedforabout25%ofURIrefer- encesacrosstheconsideredcorpora.Thestudyalsoassessedtheextentofreferencerotbutdid soattheaggregatelevelofjournalarticlesinsteadofURIreferences,concludingthatoneoutof PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 6/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent fiveSTMarticlessufferedfromreferencerot.Intheworkreportedhere,wemakeaquantitative assessmentonthebasisoftextsimilaritymeasuresoftheexistenceofrepresentativeMementos andtheextentofcontentdriftatthelevelofindividualURIreferences. Assessingthesimilarityoftextualdocumentsisacommonprobleminthewebscienceand informationretrievalrealm.Multiplesimilaritymeasureshavesuccessfullybeenappliedinthe pastacrossdisciplinesandresearchontheefficacyofthesetechniquesexists.Thoughweare notengaginginimprovingwebcrawlersordocumentclustering,thesuccessofthefollowing worksupportsourchoicesofsimilaritymeasures. Manku,Jain,andSarma[22]evaluatedtheuseofCharikar’sSimhashalgorithm[23,24]for webcrawlersasawaytodetectarecently-crawledpage.Aftertestingitduringacrawlofeight billionwebpages,theydeterminedthatSimhashwasaveryeffectivemethodofanalyzing duplicatewebpagesforthispurpose,duetoitsefficiencybyusingsmallhashfingerprintsfor comparison. TheabovementionedstudybyAdaretal.[8]usedtheSørensen-Dicecoefficient[25,26]to comparethetextualcontentofthesamewebresourceatdifferenttimesduringtheircrawl. Figuerolaetal.[27]demonstratedthatKornblum’sSpamsumalgorithm[28]cansuccess- fullybeusedtosupportwebcrawlers.Byconductingtestsonmorethan80,000webpages, theyestablishedathresholdscoreof0.9todetermineifpageshadchangedbetweencrawls. JacksonalsousedSpamsuminhis2015analysisoftheUKWebArchive’sholdings[20]men- tionedabove. TheJaccardcoefficient[29]was,forexample,usedbySivakumar[30]in2015.Thepurpose ofthestudywastoultimatelyimprovesearchresultsbycomparingblocksoftextwithinweb pages,aswellasidentifyingandremovingduplicateadvertisements,headers,andotherrecur- ringfeaturespresentinwebsites.OtherapplicationsoftheJaccardcoefficientincludedocu- mentclustering(Karun,Philip,andLubna[31])andkeywordsimilarityanalysis(Niwattankul etal.[32]). Anotherverypopulartextsimilaritymeasureiscosinesimilarity[33,34].Itwas,forexample, appliedbyHajishirzi,Yih,andKolcz[35]forthedetectionofduplicatewebpagesandbySand- hyaandGovardhan[36]forcomparingdocumentsaspartoftheirdocumentclustering approach.ForAlNoamany,Weigle,andNelson[37]theuseofthecosinesimilaritymeasure provedessentialtodetectoff-topicwebpagesinvariouswebarchivecollectionsfromArchive-It. Inadditiontotheabovementionedfivesimilaritymeasures,otherapproacheshavebeen demonstratedinthepast.ExamplesincludecomputingtheLevenshteindistancebetweentwo DOMtrees[38,39]andcalculatingthedeltasinembeddedresources[40]ofwebpages.How- ever,iftheapplicationofasimilaritymeasureinvolvestheHTMLelementsofwebpages,the processreliesuponthefactthatthestructureoftheresourcestobecomparedhasnotbeen altered.SincewebarchivesofteninjectadditionalHTMLelementsforuserexperienceand branding,andhencealtertheDOMstructureofourPreandPostMementos,wecannotuse thesemeasuresofsimilarity.Also,manyofourreferencesarenotinHTMLformat,butare insteadPDFdocuments,whichrendersanyDOMtreecomparisoninapplicable.Further methodstoassesssimilarityrelyuponavastcorpustocomputesimilarity[41]andothers, suchasHammingdistance[42],aresuitableforsame-lengthtextstringsonly.Neitherofthese methodsaresuitableforourexperimentsinceweseekmethodsthatcomparetwodocuments withoutknowledgeofalargercorpusandwecannotrelyonequallengthofwebpages. Methods OurstudyusesthesameURIreferencestowebatlargeresources,1,059,742intotal,thatwere usedfor[1].WebrieflydescribehowtheseURIreferenceswereobtainedandreferto[1]fora PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 7/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent Table1.ObtainingPre/PostMementosforURIReferences. arXiv Elsevier PMC Total URIreferences 346,177 232,712 480,853 1,059,742 NoTimeMap 62,923 34,897 54,068 151,888 NoMementoPre 36,942 11,870 50,357 99,169 NoMementoPost 17,027 22,303 23,140 62,470 ProblemdereferencingMemento 18,594 18,525 28,960 66,079 URIreferenceswithPre/PostMementos 210,691 145,117 324,328 680,136 doi:10.1371/journal.pone.0167475.t001 detaileddescription.TheURIreferenceswereextractedfromthreescholarlycorporaconsist- ingof3.5millionarticlespublishedbetween1997to2012:arXiv,Elsevier,andPubMedCen- tral(PMC).WedownloadedallarticlesfromarXivandPMCpublishedinthattimeframeand useda(meanwhilediscontinued)CrossRefAPItorandomlyselectDOIsofElsevierarticles. ThearXivcorpusmainlycoversphysics,mathematics,andcomputerscience;theElseviercor- puscoversawiderangeofSTMsubjects;PMCmostlycoversbiomedicalandlifesciences. Afterprocessingthesecorpora,forexampleremovingarticleswithoutURIreferences,adata- setconsistingof707,667arXivarticles,655,040Elsevierarticles,and479,194PMCarticles remained.Next,allURIswereextractedfromeacharticleusingahighlyaccurateregular expression-basedURIextractionapproachthatisdescribedin[43].URIswereextractedfrom allsectionsofanarticleincludingtheabstract,thebody,footnotes,andthereferencesection. Thisresultedinatotalof3,983,985URIsreferences,whichwereclassifiedintothree categories: 1. Referencestowebatlargeresources 2. Referencestojournalarticles 3. Referencestobeexcluded,forexample,becauseofURIsyntaxerrors. Referencestowebatlargeresourceswerethefocusoftheresearchreportedin[1]andare thestartingpointoftheresearchreportedheretoo.Atotalof1,059,742ofsuchURIswere extractedfromthethreecorpora.AsshowninthetoprowofTable1,346,177ofthoseURIs arefromarXivpapers,232,712fromElsevierpapers,and480,853wereextractedfromPMC papers. WeusetheseURIsfortwoexperiments.Thefirstisaimedatassessingtowhatextentrepre- sentativeMementosofURIreferencestowebatlargeresourcesexistinwebarchivesworld- wide.ThesecondisaimedatassessingtheextentofcontentdriftforURIreferencestowebat largeresourcesinscholarlypapers.ItusesonlythoseURIsforwhichrepresentativeMementos arefoundinthefirstexperiment. WebArchiveLookupofURIReferences WebarchivescontinuouslycreatesnapshotsofURIsaroundtheweb.Naturally,theymayalso takesnapshotsofURIsthatarereferencedinscholarlypapers.ForeachURIforwhichsnap- shotsexist,theMementoprotocol[19]anditsassociatedaggregationinfrastructureprovidea TimeMapthatgivesanoverviewofallMementosfortheURIheldbypublicwebarchives worldwide.EachTimeMapliststheoriginalURIaswellastheURIandthearchivaldateof eachofitsMementos.Atthetimetheexperimentwasconducted,theaggregationinfrastruc- turecoveredallexistingwebarchivesthatexposedanopenlyaccessiblemachineinterface. PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 8/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent Therewere19intotal,only6ofwhichwerealsousedin[1]:Archive-It,archive.is,Bibliotheca AlexandrinaWebArchive,CanadianGovernmentWebArchive,CroatianWebArchive,Esto- nianWebArchive,IcelandicWebArchive,InternetArchive,LibraryofCongressWeb Archive,NARAWebArchive,PortugueseWebArchive,PRONIWebArchive,Slovenian WebArchive,StanfordWebArchive,UKNationalArchivesWebArchive,UKParliament WebArchive,UKWebArchive,WebArchiveSingapore,andWebCite. WeprogrammaticallyobtainTimeMapsforallreferencedURIs,andfromeachselecttwo Mementos:theMementoPre,whichhasanarchivaldatethatistemporallyclosestandpriorto thepublicationdateofthearticlethatreferencestheURI,andtheMementoPost,whichis temporallyclosestandpastthatdate.Thisselectionismotivatedbytheinsightthat,ifthese twoMementosarehighlysimilar—theresourcehashardlychangedinthepre/postinterval thatsurroundsthepublicationdate—thenthesetwoMementosareverylikelyrepresentative ofthereferencedcontentasitwaswhenthearticlewaspublished.Afterhavingselectedthese twoMementosperURIreference,weobtaintheirarchivedcontentfromthewebarchivethey resideinbydereferencingtheURIoftherespectiveMementofoundintheTimeMap. ThisentireprocesswasconductedoverthecourseofAugust2015.Toaddresspossibletem- poraryglitches,theprocesswasrepeatedonceforthoseURIsforwhichtheprocesshadfailed inthefirstrun.Table1showstheoutcome.Ascanbeseen,wewereabletoobtaincontentfor Pre/PostMementosfor680,136ofthe1,059,742URIreferencesinourcollection.The Tablealsodetailsthereasonswhytheprocesswasnotsuccessfulfortheremaining379,606 URIs.TheseincludetheunavailabilityofTimeMaps(noMementosexist),theunavailabilityof eitherPreorPostMementos(Mementosexistonlypreoronlypostthepublicationdate),and severalinstanceswherebydereferencingaMemento’sURIfailed.ForURIdereferencing,we usedthecommandlinetoolcURLtoissueHTTPGETrequestsandconfiguredcURLtofol- lowagenerousmaximumof50HTTPredirects.Conductingthisprocesswasstraightforward forallwebarchivesexceptWebCite,anarchivethatwasexplicitlyintroducedmorethana decadeagotocombatreferencerotinscholarlycommunication.Thisarchivesetsaverylow andseeminglyvariablelimittothenumberofrequestsaclientcanissueonadailybasis.Also, programmaticallyobtainingMementosfromthisarchiverequiredextractingcontentfrom frames;thereforeweusedthePhantomJS[44]browserautomationtoolinsteadofcURL[45]. AssessingtheRepresentativenessofMementos Weusethe680,136URIreferencesthatremainfromtheprocessdescribedabove.Foreachof thesereferences,aPreandPostMementoexists.Ourgoalistomeasurethesimilaritybetween theseMementosandthentodeclarehighlysimilaronesasrepresentativeofthereferenced contentasitexistedatthetimeofpublicationofthereferencingpaper.Doingsoinvolvestwo distinctchallenges:measuringsimilarityandsettingathresholdforasimilaritymeasureabove whichcontentisregardedassufficientlysimilar. MeasuringSimilarity. AninspectionofthecontenttypesofourPreandPostMementos, summarizedinTable2,showsthatthevastmajorityofMementosaretextual.Over1million aretext/html,about93,000areapplication/pdf,andabout5,000aretext/plain.Thenumbers forothercontenttypesareverysmallincomparison.Sincewellestablishedtechniquesexist formeasuringthesimilarityoftextualcontent,wedecidetoonlyretainMementoswiththese threemostfrequentlyoccurringcontenttypes.Thisleadstodismissing10,112Pre/Post Mementos,whichcorrespondsto5,056URIreferences. InordertobeabletocomparethecontentoftheremainingMementos,weextractthetext fromtheHTMLandPDFfiles.ThisobviouslyinvolvesremovingtagsfromtheHTMLand controlcharactersfromthePDF.Werespectivelyuselxml[46]andpyPDF[47]toachieve PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 9/32 ThreeoutofFourScholarlyWebReferencesLeadtoChangedContent Table2.Top10ContentTypesofPreandPostMementos. ContentType Mementos In% text/html 1,041,192 90.68% application/pdf 93,428 8.14% text/plain 5,424 0.47% application/postscript 1,690 0.15% application/msword 911 0.07% application/vnd.ms-excel 610 0.05% image/gif 493 0.04% application/xhtml+xml 484 0.04% application/octent-stream 447 0.04% image/jpeg 405 0.04% doi:10.1371/journal.pone.0167475.t002 this.However,thetextextractionprocessposedsignificantadditionalchallengesthatrequired writingcustomcode.Detailingtheseisbeyondthescopeofthispaperbutatechnicalreport onthematterisavailable[45].Forthepurposeofthisdiscussionitsufficestomentionthatthe mostcommonchallengewasremovalofcontentinsertedintoMementosbywebarchives, includingJavaScript,CSS,andtextualinformation.Table3providesanoverviewofURIrefer- enceswithPre/PostMementosthatwereexcludedfromthetextualcomparisonbecausetheir contenttypewasnotoneofthetopthreeselectedtypesorbecausethetextextractionprocess wasunsuccessful.Weareleftwith648,253URIreferencesforwhichthetextualcontentofPre/ PostMementoscanbecompared. WiththeextractedtextfromourPre/PostMementopairsremaining,weproceededto assesstheirsimilarity.WedosobyusingthetextsimilaritymeasuresintroducedintheRelated Worksection,andtheircorrespondingimplementations,showninTable4.Thesefourmea- sureswereselectedbecausetheirspecificandcomplementarycharacteristicsprovideinsight intodifferentnotionsofsimilarity.Hence,theircombinationoffersawell-roundedviewof changingtextualcontent. WeuseSimhash,ahash-basedmeasure,thatsplitstwostringsinton-gramsandcreatesone vectorofn-gramsperstring.Itthencomputeshashvaluesforbothvectorsandreturnstheir distanceindicatingthesimilarityofbothstrings.Simhashisoftenusedforlarge-scaleweb pagecomparisonsandisdesignedtobesensitivetoeditorialchangesinthecomparedtexts. WeinitiallyalsousedSpamsumasabyte-levelsimilaritymeasuresinceitwasappliedinprevi- ousrelatedwork[20].However,wefoundmanyinstancesinwhichtheSpamsumscoredif- fereddramaticallyfromotherscoresandfoundtheexplanationin[48,49]wherethecreator ofthealgorithmstatesthatinputstringsneedtobemorethan4KBoflengthforthemeasure Table3.SelectingPre/PostMementosforTextualComparison. arXiv Elsevier PMC Total URIreferenceswithPre/PostMementos 210,691 145,117 324,328 680,136 Notintop3contenttypes 1,358 1,296 2,402 5,056 Textextractionprocessingerrors 1,562 3,064 13,669 18,295 Nocontentaftertextextraction 2,864 1,720 3,948 8,532 URIreferencesforPre/PostMementocomparison 204,907 139,037 304,309 648,253 doi:10.1371/journal.pone.0167475.t003 PLOSONE|DOI:10.1371/journal.pone.0167475 December2,2016 10/32
Description: