Exploiting RDFS and OWL for Integrating Heterogeneous, Large-Scale, Linked Data Corpora Aidan Hogan Supervisor: Dr. Axel Polleres Internal Examiner: Prof. Stefan Decker External Examiner: Prof. James A. Hendler Dissertation submitted in pursuance of the degree of Doctor of Philosophy Digital Enterprise Research Institute, Galway National University of Ireland, Galway / Ollscoil na hE´ireann, Gaillimh April 11, 2011 Copyright (cid:13)c Aidan Hogan, 2011 The research presented herein was supported by an IRCSET Postgraduate Scholarship and by Science Foundation Ireland under Grant No. SFI/02/CE1/I131 (Lion) and Grant No. SFI/08/CE/I1380 (Lion-2). “If you have an apple and I have an apple and we exchange these apples then you and I will still each have one apple. But if you have an idea and I have an idea and we exchange these ideas, then each of us will have two ideas.” —George Bernard Shaw Acknowledgements First, thanks to the taxpayers for the pizza and (much needed) cigarettes; ...thanks to friends and family; ...thanks to the various students and staff of DERI; ...thanks to the URQ folk; ...thanks to people with whom I have worked closely, including Alex, Antoine, Jeff, Luigi and Piero; ...thanks to people with whom I have worked very closely, particularly Andreas and Ju¨rgen; ...thanks to John and Stefan for the guidance; ...thanks to Jim for the patience and valuable time; ...and finally, a big thanks to Axel for everything. i Abstract The Web contains a vast amount of information on an abundance of topics, much of which is encoded as structured data indexed by local databases. However, these databases are rarely interconnected and information reuse across sites is limited. Semantic Web standards offer a possible solution in the form of an agreed-upon data model and set of syntaxes, as well as metalanguages for publishing schema-level information, offering a highly-interoperable means of publishing and interlinking structured data on the Web. Thanks to the Linked Data community, an unprecedented lode of such data has now been published ontheWeb—byindividuals,academia,communities,corporationsandgovernmentalorganisationsalike—on a medley of often overlapping topics. Thisnewpublishingparadigmhasopeneduparangeofnewandinterestingresearchtopicswithrespectto howthisemergent“WebofData”canbeharnessedandexploitedbyconsumers. Indeed,althoughSemantic Web standards theoretically enable a high level of interoperability, heterogeneity still poses a significant obstacle when consuming this information: in particular, publishers may describe analogous information using different terminology, or may assign different identifiers to the same referents. Consumers must also overcometheclassicalchallengesofprocessingWebdatasourcedfrommultitudinousandunvettedproviders: primarily, scalability and noise. Inthisthesis,welookattacklingtheproblemofheterogeneitywithrespecttoconsuminglarge-scalecor- poraofLinkedDataaggregatedfrommillionsofsourcesontheWeb. Assuch,wedesignbespokealgorithms— inparticular,basedontheSemanticWebstandardsandtraditionalInformationRetrievaltechniques—which leverage the declarative schemata (a.k.a. terminology) and various statistical measures to help smooth out the heterogeneity of such Linked Data corpora in a scalable and robust manner. All of our methods are distributed over a cluster of commodity hardware, which typically allows for enhancing performance and/or scale by adding more machines. WefirstpresentadistributedcrawlerforcollectingagenericLinkedDatacorpusfrommillionsofsources; we perform an open crawl to acquire an evaluation corpus for our thesis, consisting of 1.118 billion facts of information collected from 3.985 million individual documents hosted by 783 different domains. Thereafter, we present our distributed algorithm for performing a links-based analysis of the data-sources (documents) comprising the corpus, where the resultant ranks are used in subsequent chapters as an indication of the importance and trustworthiness of the information they contain. Next, we look at custom techniques for performing rule-based materialisation, leveraging RDFS and OWL semantics to infer new information, of- ten using mappings—provided by the publishers themselves—to translate between different terminologies. Thereafter,wepresentaformalframeworkforincorporatingmetainformation—relatingtotrust,provenance and data-quality—intothis inferencingprocedure; in particular, we deriveand track rankingvaluesfor facts based on the sources they originate from, later using them to repair identified noise (logical inconsistencies) in the data. Finally, we look at two methods for consolidating coreferent identifiers in the corpus, and we present an approach for discovering and repairing incorrect coreference through analysis of inconsistencies. Throughout the thesis, we empirically demonstrate our methods against our real-world Linked Data corpus, and on a cluster of nine machines. iii Declaration I declare that this thesis is composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. Aidan Hogan April 11, 2011 v
Description: