ebook img

Can We Find Documents in Web Archives without Knowing their Contents? PDF

1.2 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Can We Find Documents in Web Archives without Knowing their Contents?

Can We Find Documents in Web Archives without Knowing their Contents? Khoi Duy Vo, Tuan Tran, Tu Ngoc Nguyen, Xiaofei Zhu, Wolfgang Nejdl L3SResearchCenter/LeibnizUniversitätHannover,Germany {khoi,ttran,tunguyen,zhu,nejdl}@L3S.de ABSTRACT the URLs disappeared or changed their location within one 7 Recentadvancesofpreservationtechnologieshaveledtoan year. In order to preserve parts of the Web for future gen- 1 increasing number of Web archive systems and collections. erations, many Web archiving initiatives are active, among 0 These collections are valuable to explore the past of the whichtheInternetArchiveisthemostprominentone. Web 2 Web, but their value can only be uncovered with effective archive data, which can be up to petabytes or more in size, n access and exploration mechanisms. Ideal search and rank- consist of several snapshots of contents crawled from the Webatdifferentpointsintime. Suchcollectionsarevaluable a ing methods must be robust to the high redundancy and for journalists, economists, historians, social scientists and J the temporal noise of contents, as well as scalable to the 4 huge amount of data archived. Despite several attempts in others to explore the past, but their values can only be un- coveredwithaneffectiveaccessandexplorationmechanism. 1 Web archive search, facilitating access to Web archive still Suchamechanismshouldrankdocumentsnotonlybytheir remains a challenging problem. content relevance to an information need, as in the case of ] Inthiswork,weconductafirstanalysisondifferentrank- R ing strategies that exploit evidences from metadata instead theactualWeb,butalsobytheirlong-termpreservationval- ues reflected in the archive. Furthermore, Web archives are I of the full content of documents. We perform a first study . characterized by the high redundancy of contents, the im- s to compare the usefulness of non-content evidences to Web c archive search, where the evidences are mined from the balanceofrevisionsduetoselectivecrawlingstrategies[12], [ and thus call for different ranking approaches. metadataoffileheaders,linksandURLstringsonly. Based on these findings, we propose a simple yet surprisingly ef- SearchinginWebarchiveshasdrawnincreasingattention 1 inrecentyears. Earlyattemptsfocusonefficientlyindexing v fective learning model that combines multiple evidences to thefullcontentsofdocuments,withoptimizationtailoredto 2 distinguish“good”from“bad”search results. We conduct specialtypesofqueries[3]. Otherrecentworksuggeststhat 4 empiricalexperimentsquantitativelyaswellasqualitatively non-contentevidencessuchastimestamp[21],hyperlinks[7], 9 to confirm the validity of our proposed method, as a first oranchortexts[8,12]canbeusedtoimproverankingperfor- 3 step towards better ranking in Web archives taking meta- mance. For example, the anchor texts of hyperlinks, when 0 data into account. aggregated by the target page, can well represent the col- . 1 lective perception of the document essence [6], in contrast CCSConcepts 0 totheactualcontentofthedocumentwhichreflectstheau- 7 •Information systems → Document representation; thor’s intent. Similarly, authority evidence of a web page 1 Learning to rank; such as its PageRank score can complement the relevance : v of the page with its importance or freshness [21]. Despite i Keywords severalapproacheshavingbeenproposedanddeveloped,no X optimalsearchstrategyforWebarchiveshasbeendeveloped r Web Archive Search, Temporal Ranking, Feature Analysis yet. Furthermore,whiletheexistingworksuggeststheben- a efitofdifferentindividualevidences,nostudyhascompared 1. INTRODUCTION the impacts of these evidences in a unified ranking frame- The Web exhibits rapid growth and dynamic changes of work, their contribution as well as disadvantages when put its structure and contents, where ephemeral web pages are altogether. dominant, and content disappearance is commonplace [22]. In this paper, we follow the above school of thought, and AstudybyToyodaetal.[24]suggeststhatmorethanhalfof argue thatusingotherinformationthanthe documentcon- tents can help finding relevant and important documents with adequate performance, while avoiding the high cost of Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor fully indexing the Web archive. Furthermore, we provide classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcita- thefirststudytocomparetheusefulnessofnon-contentevi- tiononthefirstpage. Copyrightsforcomponentsofthisworkownedbyothersthan dencesinwebarchivesearch,wheretheevidencesaremined ACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orre- fromthedocumentmetadatasuchasURLstrings,filehead- publish,topostonserversortoredistributetolists,requirespriorspecificpermission and/[email protected]. ers,andfromthelinkssuchasanchortextsandthelinking WebSci’16,May22-25,2016,Hannover,Germany statistics. We conduct experiments with an entity search ©2016ACM.ISBN978-1-4503-4208-7/16/05...$15.00 scenarioinmind,i.e. findingdocumentsrelatedtoanentity DOI:http://dx.doi.org/10.1145/2908131.2908165 from the Web archive, and focus on entities with low ambi- 2.2 ExploitingAnchorTextsforRetrieval guity to reduce the effect of spurious results. In summary, Anchor text of links in documents has been widely we investigate the following research questions: adopted as promising information in the context of Web search. A number of studies find that anchor text is useful • RQ1: How useful are non-content features of docu- for information retrieval when queries are navigational [6, mentsforfindingrelevantdocumentsinaWebarchive? 11], as well as for ad hoc information search [15]. In [6], • RQ2: Does the combination of multiple features im- Craswell et al. aggregated anchor texts for a target page prove the performance over individual ones? and used them as surrogate documents for finding sites. In[11],Fujiietal. studiedtheeffectivenessofcontent-based We explore the above questions on subsets of the Inter- oranchor-basedretrievalmethodsfordifferentquerytypes. net Archive dataset, with our experiments focusing on the Both [6, 11] concluded that incorporating anchor text can .de domain. We investigate the problem of exploration in significantly improve the retrieval effectiveness for naviga- the Web archive, where the queries are unambiguous enti- tional queries. Koolen et al. [15] further investigated the tiesidentifiablefromWikipedia,andthedocumentsareweb importanceofanchortextforadhocinformationsearchand pages with several revisions captured in the archive1. showedthatanchortextbasedmethodscansignificantlyim- Weexamineseveralfeaturesderivedfromdifferentsources prove retrieval effectiveness. of the documents, without processing the full text of docu- In the context of Web archive search, anchor texts can ments. While the results confirm our assumption that non- help in additional interesting ways. For example, they can content evidences are valuable resources and deserve more be used to address the incompleteness of the archive col- attention, they also give interesting insights into the influ- lections,andhelpretrievingunarchived Webpages,orWeb ences of features in different settings. Based on the find- pages that were once present in the Web but not preserved ingsofthisanalysis,weproposeasimpleyeteffectiverank- in the archives. Klein and Nelson [13] leveraged the top-n ing model to combine multiple evidences for distinguishing words of link anchors to extract lexical signatures of such “good”from“bad”search results in a Web archive, without missingwebpages,whichwerethenusedtoretrievealterna- knowing their full contents, and a scalable approach to ob- tive URLs for these missing webpages. Similarly, Huurde- tainthetrainingdataforourlearningmodelswithoutmuch man et al. [12] conducted research to use links and anchors human effort. in the crawled webpages to recover unarchived webpages. Theremainderofthepaperisorganisedasfollows. After Otherresearchworkcombinedanchortextswithadditional the discussion of related work in Section 2, in Section 3 we information. Someresearchersaggregatedanchortextsfrom describe the dataset used in our study, the characteristics all pages that link to a page in the same domain [19], or of the dataset as well as some first exploratory analyses. the same website [14, 16] as the target page. Regarding Section 4 reports the analysis in Web archive search using time, historical trends of anchor texts have also been inves- non-contentevidences. InSection5,wediscussourranking tigatedforestimatinganchortextimportance. Forinstance, model based on non-content features, and show that it is Dai et al. [8] differentiated pages with different in-link cre- indeed able to distinguish“good”from“bad”search results. ation rates, and concluded that ranking performance can Finally, in Section 6, we close with a discussion of future be improved via taking into account in-link creation rates. work and conclusions. Nguyen et al. [20] study the problem of mining temporal subtopics in the Web Archive. They propose to mine the 2. RELATEDWORK relevanttimeoftemporalsubtopicsbyleveragingthetrend- Related work can be categorised into three groups. The ing behaviour of the corresponding anchor texts. first group consists of work on building infrastructure for In this paper, we focus on a different scenario, an ex- indexing and accessing Web archives. The second group ploratory search scenario, and find that anchor texts alone consistsofworkonexploitingnon-contentfeaturesininfor- do not provide an optimal solution. We incorporate anchor mation retrieval. As there exists a vast amount of work in text and other non-content features, without involving con- thisarea,wefocusonexistingworkonretrievingdocuments tent of webpages, and show that this combination is better usinganchortexts,asthisprovidesseveralinterestingques- able to distinguish between good and bad search results. tions for our analysis. The third group of work consists of ranking approaches in Web archives. 2.3 RankingModelsinWebArchives There is a number of approaches addressing the issue of 2.1 IndexingandSearchingWebArchives ranking in the Web archives, which can be classified into WebArchivesareusedtopreservedigitalheritageandthe threemainlines. Thefirstapproachisbasedonfullcontent web for future use [18]. One of the major issue in this con- indexing,suggestedintheearlyworkbyBerberichetal. [2]. text is how to provide infrastructures for indexing, search- This approach takes into account temporal expressions and ing, and access to these often huge archives [23]. This lead integrates them into a language model retrieval framework. to a bulk of technical difficulties which have to be solved. However,themethodonlyworksforlimitedtypesofqueries. Stack [23] proposed a modification of Nutch to enable web The second approach consists of graph-based methods, ex- archive indexing and full-text search on larger-scale collec- ploiting the hyperlinks in the documents of the archive. In tions,whichincreasethecapacityofNutchfromunder100M [7],Daietal. leveragefeaturesfromhistoricalauthoractivi- to higher amounts. Lin et al. [17] provide a scalable tool tiesandproposetoconsidertheauthorityovermultipleweb using Map-Reduce paradigm to parse data from and to the snapshots at different time points. They modify the tradi- index. tional link-based web ranking algorithms by further incor- 1In this paper, we will use“document”and“web page”in- poratingwebpagefreshnessovertimefrompageandin-link terchangeably. activity. Nguyenetal. [21]attempttodiscovercontentthat can cover most interesting time periods for a given topic. HTMLTagpattern Destinationtype A/href Eitherahyperlinkorananchor Tothisend,theydesignanovelgraph-basedmodelbyinte- IMG/src Imagesource grating relevance, temporal authority, diversity and time in AREA/href Areainsideanimage-map a unified framework. A third line of research is represented EMBED/src Embededobject FRAME/src Framesource by the learning approach in Costa et al. [5]. They assume INPUT/src Inputsource that closer time periods are more likely to hold similar web IFRAME/src Iframesource characteristics. Based on this assumption, they propose a FORM/action FormprocessingURL TD/background Tablecellbackgroundimage noveltemporal-dependentrankingframeworkwhichexploits TR/background Tablerowbackgroundimage thevarianceofwebcharacteristicsovertime. Incontrastto BODY/background Pagebackgroundimage this work, our work targets ranking documents and not re- OBJECT/codebase Codeofembededobject visions, and focuses on non-content features, to provide a TABLE/background Tablebackgroundimage FB:LOGIN- (Facebooklogin)buttonbackgroundimage more light-weight ranking framework. BUTTON/background Table 1: List of link types and their patterns, as used 3. WEBARCHIVEDATA to extract from the Internet Archive data 3.1 Dataset together collectively, can represent a surrogate of the doc- In this work, we use subsets of Web archive data pro- ument. We extracted 161,728,553,318 distinct anchors text vided by the Internet Archive from September 1996 to the links from the hyperlinks, with 118 links per document on end of December 2014. We focus our study on web pages average. with the .de top-level domains. This collection consists of 762,008archivefiles,including124,743.arcfilesand637,265 Indexing. Inspired by early work on anchor text-based .warc files, which comprise 55.6TB in volume2. It contains search[6],werepresentawebdocumentbyconcatenatingall 1,434,118,956 URLs and 141,835,258,519 revisions, belong- of their anchor texts pointing to this document, and utilize ingto9,880,121ofdomainsand702,398,802coreURLs(i.e., afull-textindextoindexthisdocument“anchorrepresenta- URL obtained after excluding query strings). Revisions are tion”. We use ElasticSearch 3 to achieve indexing capacity notevenlydistributedoverURLs,butinsteadheavilybiased at large scale. One subtle issue is that in Web archives, a to a small portion of domains, due to a number of selective link can be repeated multiple times due to crawling strate- crawling strategies [12]. This is known as the incomplete- gies. Forsuchlinks,wedecidetoperformtwostrategies: (1) ness problem, which is ubiquitous not only in the Internet keeponeuniqueanchortextforasource-destinationpairfor Archive dataset, but in all Web archives [12]. We take this eachrevision,and(2)keepallanchortextsacrossrevisions, characteristic into account in our analysis in the next sec- even it is possible (and likely) that such revisions are iden- tions. tical. Thisallowsustoexploittheaccumulatedstatisticsof In this paper, we focus on investigating the usefulness of anchor texts over time, and to analyse the effect of time on document metadata in Web archives without analysing the the anchoring behaviour4. fullcontents. Themetadatacomefromthefollowingmajor Thereare26,443,384,902destinationURLsforalllinksin sources: From the URL of the document, from the header web archive. However, we only make use of the ones which of the archive files (.arc and .warc formats), and from the are archived in our dataset, resulting in 990,031,302 corre- hyperlinks. As for the URL, we tokenize the string into sponding documents. Among these, there are 100,047,693 different words, use a fixed set of defined delimiters (e.g. documents that have no anchor text at all, which we re- “/”,“.”,“-”,“ ”). Thehyperlinkinformationrequiresslightly move, reducing the number of retrievable documents to moreworktobeextracted,wedescribethisseparatelyinthe 889,983,609. This data is indexed by ElasticSearch as men- following. tioned above. The resulting documents serve as potential search results from the archive. 3.2 Preprocessing 3.3 AnchorTextsDistribution Web Archive Graphs. Hyperlink information has been As mentioned above, anchor texts have been widely used proven to be a powerful indicator for searching in Web as a useful meta-data besides full contents to virtually rep- archives [21, 7]. Inspired by previous work, we extracted resent the (target) document, with a rich body of related the link structure of the Web archive documents in our set. work(Section2.2)Inthispaper,weinvestigatethisfurther, There are 305,726,983,071 links between revisions, with 14 by analysing the distribution of anchor texts in the context link types (shown in Table 1). Most of link types are links of Web archives. tobackgroundorimagesindocuments. Weselectonlylinks Figures 1a-d show the distribution of anchor texts mea- to content pages, identified by the tag <a>. There are sured by the number of destination Web pages in log-log 161,728,553,318linksafterremovingnoisypages. Theexist- scales. From these figures, a heavy-tail distribution of an- inglinkscontainanchortextstoadocumentrevisioninthe chortextusagecanbeobserved. Anotherobservationisthat archive, with approximately 2 links per revision in average. anchor texts follow power law distribution quite strictly in the head part, while the tail part exhibits high noise for Web Archive Anchor Texts. One special aspect of a popular anchor texts. The degree of noise increases over hyperlink is the anchor text, which is a visible text snippet the years, as shown in the Figure 1d. In this figure, we used to label the links. To some extent, anchor texts en- codethetestimonyoftheuserstothetarget,andwhenput 3https://www.elastic.co/ 4Instrategy(2),wetruncatetheanchortextsof26URLsof 2All collection files are compressed in gzip format whichthelengthexceedsthestoragelimitofElasticSearch. a) b) c) d) Figure1: DistributionofAnchorTextsin.dedomainsofInternetArchive. Y-axescorrespondtothenumberofanchor te xts, X-axes are the count of target urls per each anchor text. All axes are in log scales. a) and b) are distributions for the 350 biggest domains, judged by the number of Web pages, c) and d) are distribution over all domains. b) and d) are evolutions of anchor text distribution over years in the archive. gr oup the distribution by projecting timestamps of source search scenarios, described below, as the anchor texts for Web pages for each hyperlinks, grouping them on a yearly entities are more intuitive and interpretable, and also less sc ale, drawing each distributions for each year, and putting noisy. For question 2, we analyse the correlation between theminonetimelinefrom2007to2013. Fromthistimeline, relevance of documents and different non-content features. we can see the continuous increase of noise in the tails over the years. We believe that this is mainly due to the im- 4.1 EntityQueries balanced number of snapshots of websites according to the Analysis Setup. In order to avoid spurious effects of am- crawlingstrategies,whichresultsinabiasinthesamplingof biguous queries in Web archive search, we limit our queries domainsinthearchive(i.e.,notallWebpagesinthedomain to entities identifiable by a Wikipedia page. We choose are crawled, or are crawled at different frequency). Wikipedia pages that are not list pages, disambiguation Toverify this assumption, in Figures 1aand1b, we anal- pages, and contain no commas and round brackets which yse and show the distributions of the top 350 biggest do- indicate potential ambiguity of entity names 5. Our query mains, judged by the number of member Web pages pre- set consists of 216 queries, chosen from entities with both served under the domain. For these domains, we observe highandlowpopularity6. Specifically,weusethepageview that the crawling policy is more consistent over time, i.e. counts(extractedusingHedera[25])ofaWikipediapageas Webpagesareallaggressivelyandcomprehensivelycrawled proxyforthecorrespondingentitypopularities,partitionthe regardless of their active time periods. This results in the view counts into three buckets (Top, Middle, Bottom), and balancednumberofrevisionscrawledforbothrareandpop- randomlychooseentitiesineachpartitiontobuildthesam- ularanchortextsdrawnfromthesedomains. Inthefigures, ple. We also group entities into different types to facilitate thetailbecomesmuchlessnoisy. Thisfindingisinteresting, high-level analysis. We first used DBpedia knowledge base as it reveals that the revision counts (driven by the crawl- [1]toresolvethetypesofthesampledentities,andmanually ing policies) have an implication on distribution of anchor mapped each entity into one most plausible, coarse-grained texts, making anchor texts different from those normally type. Table 2 lists the types studied in our work. We ac- drawn from the contents of the Web pages, suggesting that knowledge that the list is still primitive, and leave a deeper theyshouldnotbeexploitedinisolation,butratherincom- analysiswithamoreadvancedtaxonomyforthefuture(for bination with other evidences such as the size and crawling example, multiple classes per entity). frequency of domains. In the next sections, we look further into the correlation of features and the quality of retrieved Example Analysis. Figure 2 shows the query “Angela documents. Merkel”ascoveredintheWeb(worldwideandinGermany), and as covered in our .de web archive. The Web coverage timelines are derived from Google Trend, while for the .de 4. ANALYSISONWEBARCHIVESEARCH As we are interested in exploiting non-content informa- 5https://en.wikipedia.org/wiki/Wikipedia:Article titles tion for search, we investigate the two research questions 6The full list of queries are omitted due to space limit, and described in Section 1. For question 1, we focus on entity available upon request from [email protected] Type Examples portanceofthewebpage,asweobservepopularandhighly dynamic web pages to be captured more often than others. Politician Barack Obama,Angela Merkel Scientist Max Planck,Josef Meixner Artist Michael Jackson,Scorpions 1.2 Sportplayer Andreas Nilsson,Franz Beckenbauer wordlwide .de archive Germany Author Franz Hoellering,Artur Kaps 1 Entrepreneur Enzo Ferrari,Artur Mest 0.8 Organisation Maharashtra Film Company Product Chevrolet Vega,Apple Macintosh 0.6 Location Volken,Larrabee State Park Works Volken,Alien Trespass 0.4 Event FIS-Ladies-Winter-Tournee 2011 Biology Rote Mangrove,Ortolan 0.2 Music Appenzeller Streichmusik AbstractACstornocloegpyt CAeloquudaStaCt,auAsbaelElff26ec2tum 20004 -01 2004-06 2004-11 2005-04 2005-09 2006-02 2006-07 2006-12 2007-05 2007-10 2008-03 2008-08 2009-01 2009-06 2009-11 2010-04 2010-09 2011-02 2011-07 2011-12 2012-05 2012-10 2013-03 Table 2: List of Entity types Figure 2: The timeline of coverage of the query“Angela Merkel”from the Web (worldwide and in Germany), as web archive, we derive them from our anchor text index compared to the timeline of coverage in the .de domain described in Section 3. All values in the timelines are nor- in our Internet Archive dataset. malized to [0,1] by dividing to the max value. The first observation from these timelines is that query BingSearchEngineasrelevanceproxy. Foreachentity coverage on the Web follows bursty patterns, with big query,wequeryourindextogetalldocumentsthatcontain spikes reflecting important events, quite synchronized be- thequeryeitheraspartofanchortext,orintheURLstring tweenGermanandinternationalwebsites. Forinstance,the ofthecorrespondingwebpage. WecallthisdatasetA. This first spikes during the end of 2005 correspond to the event datasetcontains1,276,900URLsperqueryinaverage(me- ofAngelaMerkelstartingherdutyasthechancellorofGer- dian value is 37199 URLs per query). We also issue the many. Asforthearchivedanchordataset,thetrendingsare query to the Bing search engine and fetch the URLs from only visible during the period of big events observed from the top-100 results that overlap with our results from the the actual Web, i.e. the end of 2005 (her first term as the index. WecallthisdatasetB. Theintuitionforthissetting chancellor) and the period of 2011-2012 (active period of is that we assume top results returned by Bing are more the Eurozone crisis). For many other medium size events, likely to be of higher quality and relevance to the query of we do not observe peaks, and the ratio of coverage stays interest. To somewhat reduce the bias introduced by time, relatively stable. We believe that the main cause for this we issued the query to Bing at different points in time over asynchronystemsfromthefactthatthepurposeofputting thelasthalfyear,andmergethetop-100URLsofallresults. ananchortextinahyperlinkfromonecontenttoanotheris to endorse the destination content, or to establish the con- Analysis. Figure3showsthedistributionsoftheaveraged textualrelevancebetweenthetwocontents. Thisisdifferent evidence scores per query, as derived from result set A and from queries on the actual Web, which directly encodes the B. It is calculated by obtaining retrieved documents for user information need. The implication is that any existing each entity query (either returned by Bing or by the local work that relies purely on anchor text as the source of re- ElasticSearch index), calculating the corresponding feature trieval can only gain adequate performance on certain sets valuesforeachdocumentandthengettingtheaverage. We of queries such as finding specific URL [6], or searching for observethatdocumentsreturnedontheBingfirstpagetend highly debated topics [21]. For other types of information to have lower depth of URL strings as compared to docu- need, combination with other evidences is necessary. ments in general (median 1.82 compared to 2.87), as well as higher number of revisions (3.80 compared to 2.74), and 4.2 DocumentMetadataandSearch appear in more links where the queries are used as the an- Analysis Setup. In this section, we investigate the in- chor texts (10.00 compared to 1.93). On the other hand, fluence of different non-content evidences on Web archive these evidence scores of top results exhibit higher degree of search. To start with, we choose the three main sources of diversity,reflectedbythewiderquartilewindowsinallthree evidences for the preliminary analysis: (for the Anchor Text evidence, the first and third quartiles 1. URL: We choose to investigate the depth of the doc- of top result are [4.56,16.81] compared to [0.00,5.15] for all ument URL, i.e., the level of sub-domains or directory in results). whichthecorrespondingwebpageresidesintheWebserver. In Figure 4, we further investigate the correlation of this The lower URL depth corresponds to more general web evidence diversity and the entity types. The scores, which pagessuchashomepages,whilehigherdepthindicatesthat are obtained in the same way as for Figure 3, are presented thedocumentrepresentedbythewebpageislikelytomen- in log scales, thus include some minus values (for instance, tionmorespecificinformation(suchasproductdetailinan the query of “Abstract Concept” entity type appear only e-commerce site). 0.59 times on average in the top URLs returned by Bing). 2. Anchor Text: We count the frequency of the query as Fordifferententitytypes,weobserveabigdifferenceinthe appearedintheanchortextrepresentationofthedocument scores. Forexample,queriesaboutpoliticiansorsportplay- (see Section 3, Indexing) as another evidence. erstendtohaveresultswithdeepURLstrings,astheyoften 3. Revision Number: We count how many times a web appear in news or other well-organised websites. They are pagehasbeencrawled. Tosomeextent,thisreflectstheim- also captured more often than others, reflected in higher URL Depth Revision No. Anchor Frequency ●● 100 ● 8 ● ● 60 ●● 80 6 ● ●● Anc News Revision count 40 60 024PAPSAOprosrrottogliirdastroittuncl_ociipsatganlyatiyoenr ●●●●● 1119641224517......385384696636453830888093639293 11663185......288270998175●●●●800164974336593194 113235241810......586665244670216388066812652935200 021211.....81908.076640922789274959156978●●●●●●●●●●●●●●●●●●●888639908286248229 010100.....52927.081420882373381366275149599590586992353163 211-020...153...222119209925403411840789279802587042611546●●●●●●●●●●●● 24000 ● ●●●● Author A 3.71429 3.16B964 1.28869 0.5698A7581 0.50100994 0.1101484B6 A B Biology 13.78488 4.49248 2.71261 1.13940299 0.65248615 0.43338736 Event 108.10056 7.24611 3.31268 2.03382794 0.86010492 0.52017949 EntreprenMeuredian = 2.87 4 .47619Med1.i4a9n56 7= 1.82 0 .97186 Me0.d65ia09n08 5=1 20..177448 3 578 M-0.0e1d23i9a6n3 = 3.80 Median = 1.93 Median = 10.00 Scientist 92.02381 10.54214 84.24206 1.96390021 1.02292878 1.92552898 Music 22.99048 3.82527 2.01978 1.36154804 0.58266209 0.30530407 FiguLocraetion3: Distribu2t3i2o.20n55s7 of21t.2h07e3 thre11e.04e79v8idences2.i3n6587a26l3l1r.3e26s4u85l3t8s1.0r4e32t8u28r8ned by the archive index (A), and in top results retuWrornksed by Bing (B41).4.699E5ach5.54p30o6int co3.r20r8e39sponds t1o.617a73n351av0.7e4r37a49g5e8 0v.5a06l2u87e15of the evidence derived from all results of a single queArbystract_concept 0.73333 1.12157 0.59085 -0.1347005 0.04982638 -0.2285228 1.4 2.5 2.5 Product Astrology 1.12 2 2 PSoploitritc_iapnla yer Artist 0.8 1.5 1.5 Organisation Author 0.6 1 1 Biology 0.4 Event 0.2 0.5 0.5 ESnciterenptirset neur Music 0 0 0 Location URL Depth Revision No. Anchor Frequency Works -0.5 -0.5 Abstract_concept Figure 4: Distributions of the three evidences from URL, revision count and anchor. Each column corresponds to an average value of the evidence for one query type (derived from top results by Bing). The y-axes are evidence scores in log scale average revision counts. On the other hand, queries about query are more topically relevant. For instance, many cita- authors (play writers, novelists) have the least crawled re- tions from the Wikipedia page of Michael Jackson are from sults,mainlybecausemanyofthemhavestaticcontents. As entertainment websites, while in Max Planck’s Wikipedia for the URL depth and anchor frequency, the highest aver- page they are mainly from scientific websites. We measure agescoresareforqueriesaboutlocation. Analysingdeeper, thisfeaturebythenumberoftimesadomainiscitedinthe we observe that the location terms appear much more fre- Wikipedia page of the entity. quentlyintheresultwebpages,includingboilerplatedparts AnchorTimeSpans: Besides the anchor text frequency ofthepagessuchasdisclaimers,ads,headers,etc. Thispre- feature (mentioned in Section 4), we also consider the tem- liminaryanalysissuggeststhatingeneral,anysearchsystem poraldimensionofthelinks. TheintuitionisthatintheWeb thatdealswithentityqueriesmusttakeintoaccountdiffer- archive, the time when the anchor texts appear also reveals ent dimensions, for example the inherent semantics of the someinformationabouthowimportantthetargetis. Iftwo entity types. anchortextscomefromtwolinksofrevisionsthathaveclose timestamps,theymightnotcountascompletelyseparateen- dorsement. Forexample,ifaWebpageiscrawledtwotimes 5. RANKINGBYMULTIPLEEVIDENCES withinafewhours,thecontentsmightbejustidentical,and Basedonourpreviousanalysis,inthissection,weinvesti- thereforecountingtheanchortextstwotimeswillgivefalse gate the effectiveness of combining multiple evidences from highendorsementstothetargetdocuments. Hence,forthis metadata to distinguish good from bad search results from feature, we measure the distance between any two consecu- the Web archive. We extract different features and rely on tive timestamps of the anchor texts, and count the number learning-to-rank models to provide a unified ranking score. of time spans of which the distance is longer than 1 week. RevDuration: This feature relies on the frequency of the 5.1 Features crawlers to estimate the quality of the documents. If the Table3liststhesetoffeaturesinvestigatedinourranking documents have many revisions that are crawled within a models. In the following we explain some selected features. short time period (e.g., a few days), it might have less au- thority than other documents with revisions crawled in dif- NewsURL:Thisfeatureestimatesthenewsworthyofacon- ferentyears(theWebpagearestillrelevanttousers,orare tent via its domain. We curate from the Wikipedia page stillreferredtofromotherWebpagespostedlater). Forthis “Liste deutscher Zeitungen”and from the web page“Paper feature, we measure the number of durations between two Boy”7,amongothersources,toobtainthelistof507German consecutive revisions that are at least 1 week long. news sites. PageRank: We also calculate the authority scores of the WikipediaURL: This feature indicates the credibility of documentbyconstructingthegraphsofdifferentlinks. The the content via its domain. We assume that domains that PageRankCore scores are calculated on the graph of direct have more citations from the Wikipedia page of the entity hyperlinks between two web pages, and the PageRankDo- mainscoresarecalculatedbasedonlinksbetweendomains. 7http://www.thepaperboy.com/germany/newspapers/ country.cfm Source Name Description URLDepth ThedepthofthedocumentURL QueryString WhetherthedocumentURLcontainsquerystrings SearchWord Appearanceofsearchingwords(“such”,“suchergebnis”,“query=”,etc.) inthedocumentURL URL QueryInURL FrequencyofquerytermsinthedocumenttokenizedURL NewsURL WhetherthedocumentURLbelongstoanewsdomain WikipediaURL WhethertheURLhasbeencitedintheWikipediapageofthequeriedentity InlinkCount ThenumberofincominglinkspointingtothedocumentintheWebarchive Links PageRankCore PageRankscoreofthedocumentinthehyperlinkgraph PageRankDomain PageRankscoreofthedomaininthehyperlinkgraph,wherenodesareprojectedtodomains The fraction of incoming links pointing to the document in the Web archive, and having queries AnchorFreq intheiranchortexts AnchorTexts AnchorTimeSpans Timespansoftwoconsecutivetimestampsoftheanchortextsthatarelongerthan1week LuceneTf MaxvalueofLucenetermfrequencies[10] DocLen Documentlengthcountedbyanchortexts FieldNorm Lucenefield-lengthnormalizationscoreofanchortexts[10] Idf Inversefrequencyofdocumentscontainthequeryterms[10] Metadata Revision Numberofrevisionsofthedocument RevDuration Countoftimespansoftwoconsecutiverevisionsthatarelongerthan1week DomainSize Countofwebpageshavingthesamedomainasthedocument EntType Typeofentityrepresentedbythequery Table 3: Selected features for analysing the ranking models, as grouped into different categories of evidence sources 5.2 ExperimentSetup ingtheircontentinthebrowserandintheInternetArchive archive.org site. Each evaluator annotated the documents from the sample on a three value scale: 0 means irrelevant, 5.2.1 TrainingDataandSampling 1 means relevant but not important, and 2 means relevant To study how the features can be combined in a unified andimportantdocuments. Herethenotionof“importance” learning-to-rank model, we need the relevance feedback for isdefinedbyguidingtheevaluatortothinkofthedocument documents in the Web archive, or the training data labels. in a long term exploration scenario, and estimate, whether Sincethereisnostandardbenchmarkforthisproblem,and thedocumentcontainslastinginformationaboutthequeried since the focus of our work is to study the influence of fea- entity or not. For example, a biography page about Albert tures coming from different evidences, in this first study, Einstein should be labeled as 2, while an ad page about a we choose two pragmatic approaches for building the train- product with Albert Einstein portrait should be labeled as ing data labels, described subsequently. First, we choose a 1. The average Cohen’s Kappa for the evaluators’ pairwise subset of 15 entity queries from the testbed for our further inter-agreement is κ=0.75, suggesting that the assessment experiments and case study analysis. As judging all URLs task has fair cognitive complexity. returnedinthedatasetAisinfeasible,wedecidetoevaluate onasmallrandomsampleoftheresults,constructedasfol- Adding Positive Examples: To make sure the training lows. Foreachfeaturedimension, wedividetheresultsinto data have enough positive and negative labeled URLs of three partitions according to the order of their normalized both strategies, after pooling Web pages by features, we scores. Then we randomly pick 20 to 50 Web pages from includeallresultsofourdatasetB intothesample. Weac- each partition. The sample is then obtained by pooling all knowledgethatthesamplehasastrongeffectontheresults Webpagesfromallfeatures. Oursamplingalgorithmmakes and models learnt, and plan future experiments with more sure there are overlaps between the samples of each feature advanced sampling strategies and larger samples. However, pair, so as to reduce the total size of the pool. In practice, ourcurrentsamplesatleastallowsustoanswerthequestion thisresultsin400sampleddocumentsperqueryinaverage. how well we can distinguish between good and bad search Then, two labeling strategies are applied: results, even if we cannot extrapolate these results to our full dataset A, which contains many more potential search Soft Labeling: SimilarlyastheapproachdiscussedinSec- results. tion 4, we compare the results retrieved by our index with 5.2.2 RankingMethods the top-100 results returned by Bing. If found, the label of thedocumentismeasuredbytheinverseofBingrank,other- Baselines. We consider the following baselines: 1) The wisethedocumentislabeledzero. Thisenablesustoexploit BM25-basedrankingscoreasreturnedfromouranchortext- the order in Bing search results for pairwise comparison in based index; 2) PageRank score; 3) Scores based on fre- a learning-to-rank model. The advantage of this approach quency of queries in the document URL string (QueryIn- is that we can easily scale up training data construction in URL). Each of these baselines corresponds to one source of future work. evidence, discussed in the analyses in Section 4. Manual Labeling: Besidestheautomatedapproachusing Learning Models. For the learning-to-rank method, we Bing search results, we also experimented with manual la- employ Random Forests (RF)[4] as implemented in the beling. Four evaluators annotated the documents by check- RankLibtoolkit[9]. Weuse5-foldcrossvalidationandNor- malized Discounted Cumulative Gain (NDCG) scores at search results, so cannot be generalized to performance of top-10 results to optimize the parameters. RF models are our ranking model on our larger dataset A. We will inves- trained on both training data with soft labels derived from tigate in future work how the two models perform on sub- Bing(denotedRF-B)andwithmanuallabels(denotedRF- stantially larger scale datasets. M). In each case, we use the labels of the same strategy We also see that the NDCG scores of both models are as ground truth for the evaluation (so RF-B is evaluated lower than Precision at the same cut-off threshold. This against Bing results and RF-M against human feedback). suggests that although we are able to distinguish relevant As we investigate in this first study the potential of using andimportantdocumentsfromlessrelevantones,theranks non-contentfeaturestoretrievedocuments,weleavesophis- in the top results are still not good as they should be (i.e. ticated settings of cross-model and cross-label evaluations relevantbutnotimportant),leavingadditionalroomforfu- for the future. ture improvement of the ranking models. 5.3 EmpiricalResults 5.4 FeatureAnalysis We use standard information retrieval metrics Precision RF-M RF-B (at different cut-off threshold of 1 and 10), NDCG as well as MAP for the evaluation. 1. InlinkCount 1. Revision 2. QueryInURL 2. InlinkCount 3. Revision 3. DomainSize P@1 P@10 NDCG@10 MAP 4. Idf 4. QueryInURL BM25 0.153 0.123 0.527 0.173 5. DomainSize 5. URLDepth PageRank 0.077 0.085 0.274 0.042 6. PageRankDomain 6. Idf 7. PageRankCore 7. AnchorTimeSpans QueryInURL 0.154 0.185 0.700 0.202 RF 0.870(cid:78) 0.736(cid:78) 0.754(cid:78) 0.804(cid:78) 8. LuceneTf 8. FieldNorm 9. AnchorTimeSpans 9. LuceneTf 10. AnchorFreq 10. PageRankDomain Table 4: Retrieval performance using Bing 11. URLDepth 11. AnchorFreq ranks as ground truth (RF-B). Symbol (cid:78) indicates significant improvement over Table 6: Top Features by Information Gain BM25. InTable6,weshowthetopfeaturesorderedbytheirim- portance, as measured using Information Gain, of the two P@1 P@10 NDCG@10 MAP learning models. InlinkCount and Revision are among the most discriminating features. QueryInURL is also a very BM25 0.385 0.269 0.517 0.270 useful feature, ranked second in RF-M and 4th in RF-B. In PageRank 0.077 0.277 0.508 0.261 ourexperiments,anchortextdidnotgetintothetopranks, QueryInURL 0.384 0.353 0.553 0.332 RF 0.846(cid:78) 0.769(cid:78) 0.760(cid:78) 0.798(cid:78) but stays at position 10th and 11th for the two models. If weincorporatethisfeaturewithtimedimension,itbecomes more discriminating (position 9 in RF-M and 7 in RF-B). Table 5: Retrieval Performance based on manual labels (RF-M). Symbol (cid:78) indicates Inaddition,inourRF-Bmodel(i.e. fordistinguishingBing searchresultsagainstothers),exceptInLinkCount,alltop-5 significant improvement over BM25. featuresare“light-weight”featuressuchasRevision,Query- Table 4 shows the performance of the baselines and of InURL,URLDepthwhichcanbeextractedeasilyfromdoc- our learning model RF-B, using Bing ranks as soft labels. ument metadata. For the DomainSize feature, we can also Table 5 shows the performance of the baselines and of our performaone-timeprocessjobtocomputethevalues. This learning model RF-M, using manual labels. In both set- will help us to scale to larger subsets of Web archives. tings, PageRank performs poorly, mainly because it does not target directly the query-driven relevance of the doc- 5.5 CaseStudyAnalysis uments. BM25 has better performance, and works better To look at our results in more detail, in Table 7 we show when judged by the manual labels than by comparing with thecomparisonoftop-5resultsasretrievedbythetwomod- Bing ranks. Given that this baseline is the same in both els. To guarantee a fair comparison, we only retrieve the experiments, it reveals that our human relevance feedback documents that are available in the results of both models, ismorerelaxedcomparedtotheBingsoftlabels. Asurpris- andshowtheminthesameorderastheyarerankedineach ingresultisthatitsperformanceiscomparabletoQueryIn- result list. URL,andactuallyslightlyworsewhenweexpandthecut-off threshold (from top-1 to top-10 results). In all three met- Angela Merkel: The top-1 result using the model learnt by ricsPrecision,NDCGandMAP.Inotherwords,forgeneral manual labels is the biography web page of Angela Merkel entity search scenario with no special information need, it hosted by wiwo.de, not included at top rank for the Bing mightbemoreusefultolookintotheURLstringandguess experiment. The pages at the 5th rank of both models are whether a web page is a good resource to explore, than to news topic pages, although the first one (zeit.de page) is examine the raw anchor texts without any other context. no longer available in the actual Web, and is replaced by Applying learning models introduces a significant im- another page on the same topic. provement on the retrieval in both settings (both with p- value<0.001). Notehowever,thatRF-MandRF-Bvalues Albert Einstein: The first-ranked document is the same are not directly comparable, as they are evaluated against for both models. For the other 4 top results, however, different ground truths. In addition, the high values are the model learnt by manual labels tends to be more di- caused by our small samples which include many positive verse, with one result (einstein-gymnasium-hameln.de) Query RF-M RF-B wiwo.de/koepfe-der-wirtschaft/angela- spiegel.de/international/topic/angela_merkel merkel/5288044.html spiegel.de/international/topic/angela_merkel spiegel.de/thema/angela_merkel Angela Merkel spiegel.de/thema/angela_merkel sueddeutsche.de/thema/angela_merkel sueddeutsche.de/thema/angela_merkel angela-merkel.de/ zeit.de/schlagworte/personen/angela- welt.de/themen/angela-merkel/2 merkel/index∗∗ spiegel.de/thema/albert_einstein spiegel.de/thema/albert_einstein amazon.de/albert-einstein-biographie- zitate-online.de/autor/einstein-albert Albert suhrkamp-taschenbuch/dp/3518389904 Einstein zitate-online.de/autor/einstein-albert einsteingalerie.de/ sueddeutsche.de/thema/albert_einstein einstein-website.de/z_kids/biographiekids.htm∗ einstein-gymnasium- helles-koepfchen.de/albert_einstein hameln.de/schule/einstein/einstein.php volkswagen.de/de.html volkswagen.de/de.html autohaus24.de/neuwagen-kaufen/volkswagen auto.de/kfzkatalog/vw Volkswagen volkswagen-nutzfahrzeuge.de/de.html motor-talk.de/forum/volkswagen-b22.html motor-talk.de/forum/volkswagen-b22.html volkswagen-nutzfahrzeuge.de/de.html spiegel.de/thema/vw autohaus24.de/neuwagen-kaufen/volkswagen sueddeutsche.de/thema/bruce_willis sueddeutsche.de/thema/bruce_willis amazon.de/bruce-willis/e/b000apunwa∗ kino.de/star/bruce-willis/8453∗∗ Bruce filmstarts.de/personen/197-bruce- new-video.de/darsteller-bruce-willis Willis willis.html kino.de/star/bruce-willis/8453∗∗ moviemaze.de/celebs/15/1.html∗∗ moviemaze.de/celebs/15/1.html∗∗ new-video.de/darsteller-bruce-willis Table 7: Comparison of top-5 results between RF models learnt using using manual labels and Bing rank. ∗ indicates URLs that are not alive anymore in the Web, whereas ∗∗ indicates URLs that are now redirected ones, checked in April 2016. having nothing to do with Albert Einstein as the physi- areaboutmovieswhereBruceWillisactedandnotabouthis cist. In constrast, the model using Bing ranks gives bet- information as a celebrity. Perhaps such information need ter results, all biographic information to the physicist Al- (search for Bruce Willis’ films) is more popular, as Bruce bert Einstein. Also, it manages to find the relevant page Willis is not known for many scandals and thus solicits less einstein-website.de/z_kids/biographiekids.htm in the personal commentsontheWebthanother celebrities. This archive, which is no longer available on the Web. Looking bias is handled better by relying on Bing labeling, rather intotheuserevaluationfeedback,weseethatuserstendnot than a small number of users labeling results. to agree on marginal cases where Web pages about Albert Einsteinactuallydiscusssomedistantlyrelatedinformation 5.6 Discussion such as the places named after him, or Web sites simply using the name Albert Einstein to symbolize outstanding In general, for top results, both models agree in many intelligence,soourmanuallabelsarenotreallyveryreliable cases, with some small difference between queries such as in this case. “AngelaMerkel”. ComparingtheanalysisinSection5.5with ourempiricalevaluationinSection5.3,weseethatinseveral Volkswagen: Both models give good results. Although the casesthemodellearntusingBingranksassoftlabelstends ranks of documents differ, it is difficult to judge which one to prefer documents which are more focused on the entity isreallybetterthantheother. Webelievethereasonisthat andalsomore important. AstheresultsofBingcomefrom the underlying information need is very diverse: User can advanced models with a much richer set of dimensions and searchforthecompany,orsearchforthecarproducts,either features,theintegrationofmorefeaturesintoourmodelwill forcommercialpurposesortolookuptechnicalinstructions. certainlyenableittoreflectbetterthemainaspectsembed- Thisambiguitymakesitdifficultforbothmodelstolearna ded in current search engine rankings. Given the small size sensible meaning from simple labeling approaches. of samples in our evaluation and the inclusion of all“good” results(fromBing),itwouldbeinterestingtoseehowgood Bruce Willis: Both models agree on the first position, but ourmodelscandealwithlargersubsetsofthearchive,which RF-M gives second position to an Amazon page with a will be more noisy and include more spamming page. list of Bruce Willis’ films, which is now only available in Forentityqueries,weseethatsomehighlyvaluablepages, archive.org. ItalsopushesredirectURLs(fourthandfifth suchasWikipediapagesforthebiographicalinformationof positions), because of the anchor texts pointing to. Using the entities, are not shown in the top results. It is inter- Bingrank,RF-Bdeliversbetterquality,withalltop5results esting to see how we can encode such credits with more biographic and topic pages about the movie star. Looking features, tailored to entity exploration scenarios. In this closelytotheresults,wecanseethatmostofthetopresults context, other issues such as spamming URLs, user quali- tative evaluation, etc. should be analysed deeper. For in- [9] V.Dang.RankLib. stance, the comparison between two models should not be https://sourceforge.net/p/lemur/wiki/RankLib/,2016. limited to quantitative assessment, but also be extended to [Online;accessed12-Feb-2016]. theusersatisfactionmeasurement,toreallyunderstandthe [10] Elastic.Lucene’sPracticalScoringFunction. https://www.elastic.co/guide/en/elasticsearch/guide/ usefulness of the document non-content evidences. We can current/practical-scoring-function.html,2016.[Online; investigatefurtherthesearchscenariosfordifferenttypesof accessed12-Feb-2016]. entities (e.g., the information need when user searches for [11] A.Fujii.Modelinganchortextandclassifyingqueriesto a celebrity is substantially different than when she searches enhancewebdocumentretrieval.InWWW,pages337–346, forascientist,oratravellinglocation),anddesigndifferent 2008. per-type questions for user experiments. [12] H.C.Huurdeman,A.Ben-David,J.Kamps,T.Samar,and A.P.deVries.Findingpagesontheunarchivedweb.In JCDL,pages331–340.IEEE,2014. 6. CONCLUSION [13] M.KleinandM.L.Nelson.Movedbutnotgone: an evaluationofreal-timemethodsfordiscoveringreplacement In this paper, we conducted a first study about the influ- webpages.14:17–38,2014. enceofnon-contentevidencesonsearchinginWebarchives. [14] J.M.Kleinberg.Authoritativesourcesinahyperlinked By using Bing search results as proxy for relevance, and environment.J. ACM,46(5):604–632,1999. by focusing our search scenario on exploring unambiguous [15] M.KoolenandJ.Kamps.Theimportanceofanchortext entities, we can identify the correlation between different foradhocsearchrevisited.InSIGIR,pages122–129,2010. featuresandthequalityofdocuments. Wefindoutthatal- [16] W.Kraaij,T.Westerveld,andD.Hiemstra.The thoughanchortextsareusefulresourcesforsearchinginthe importanceofpriorprobabilitiesforentrypagesearch.In Web,applyingthemtoWebarchivesrequiremorethoughts, SIGIR,pages27–34,2002. [17] J.Lin,M.Gholami,andJ.Rao.Infrastructurefor taking into account other evidences such as entity types, supportingexplorationanddiscoveryinwebarchives.In URL, etc. WWW,pages851–856,2014. We also study the possibility of combing multiple evi- [18] J.Masan`es.Webarchiving: issuesandmethods.Web dencestoimprovetheranking. Theresultispromising,and Archiving,pages1–53,2006. can be a good starting point for an efficient ranking frame- [19] D.Metzler,J.Novak,H.Cui,andS.Reddy.Building work that does not rely on full content indexing of docu- enricheddocumentrepresentationsusingaggregatedanchor ments,atleastforanentityexplorationscenario,wherethe text.InSIGIR,pages219–226.ACM,2009. user has no special ad-hoc information need. [20] T.N.Nguyen,N.Kanhabua,W.Nejdl,andC.Nieder´ee. Miningrelevanttimeforquerysubtopicsinwebarchives. Thereareseveralotherdirectionstoextendthiswork. For InWWW companion.ACM,2015. instance, we aim to target different search scenarios in the [21] T.N.Nguyen,N.Kanhabua,C.Nieder´ee,andX.Zhu.A Webarchive,suchasnavigationalsearchforparticular(but time-awarerandomwalkmodelforfindingimportant partially forgotten) resources in the past. We also aim to documentsinwebarchives.InSIGIR,pages915–918,2015. improve the retrieval component of our work, to facilitate [22] A.Ntoulas,J.Cho,andC.Olston.What’snewonthe bothambiguousandunambiguousqueries,andtoscaleand web?: theevolutionofthewebfromasearchengine evaluate our results for larger subsets of Web archives. perspective.InWWW,pages1–12.ACM,2004. [23] M.Stack.Fulltextsearchofwebarchivecollections.Proc. Acknowledgements. This work was funded by the ERC of IWAW,2006. Advanced Grant ALEXANDRIA (under the grant No. [24] M.ToyodaandM.Kitsuregawa.Extractingevolutionof 339233). We thank the reviewers for the fruitful discussion webcommunitiesfromaseriesofwebarchives.InHT, pages28–37.ACM,2003. andsuggestionsonfuturework. Wealsothankourcolleague [25] T.TranandT.N.Nguyen.Hedera: Scalableindexingand PhilippKemkesforsettinguptheusageofBingsearchAPI exploringentitiesinwikipediarevisionhistory.InISWC, to facilitate our experiments. 2014. 7. REFERENCES [1] S.Auer,C.Bizer,G.Kobilarov,J.Lehmann,R.Cyganiak, andZ.Ives.DBpedia: ANucleusforaWebofOpenData. InISWC / ASWC,pages722–735.Springer,2007. [2] K.Berberich,S.Bedathur,O.Alonso,andG.Weikum.A languagemodelingapproachfortemporalinformation needs.InECIR,pages13–25.Springer,2010. [3] K.Berberich,S.Bedathur,T.Neumann,andG.Weikum. Atimemachinefortextsearch.InSIGIR,pages519–526. ACM,2007. [4] L.Breiman.Randomforests.Machine learning,45(1):5–32, 2001. [5] M.Costa,F.Couto,andM.Silva.Learning temporal-dependentrankingmodels.InSIGIR,pages 757–766.ACM,2014. [6] N.Craswell,D.Hawking,andS.Robertson.Effectivesite findingusinglinkanchorinformation.InSIGIR,pages 250–257.ACM,2001. [7] N.DaiandB.D.Davison.Freshnessmatters: inflowers, food,andwebauthority.InSIGIR,pages114–121,2010. [8] N.DaiandB.D.Davison.Mininganchortexttrendsfor retrieval.InECIR,pages127–139.Springer,2010.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.