WWrriigghhtt SSttaattee UUnniivveerrssiittyy CCOORREE SScchhoollaarr Browse all Theses and Dissertations Theses and Dissertations 2017 EExxppllooiittiinngg AAlliiggnnmmeennttss iinn LLiinnkkeedd DDaattaa ffoorr CCoommpprreessssiioonn aanndd QQuueerryy AAnnsswweerriinngg Amit Krishna Joshi Wright State University Follow this and additional works at: https://corescholar.libraries.wright.edu/etd_all Part of the Computer Engineering Commons, and the Computer Sciences Commons RReeppoossiittoorryy CCiittaattiioonn Joshi, Amit Krishna, "Exploiting Alignments in Linked Data for Compression and Query Answering" (2017). Browse all Theses and Dissertations. 1766. https://corescholar.libraries.wright.edu/etd_all/1766 This Dissertation is brought to you for free and open access by the Theses and Dissertations at CORE Scholar. It has been accepted for inclusion in Browse all Theses and Dissertations by an authorized administrator of CORE Scholar. For more information, please contact [email protected]. Exploiting Alignments in Linked Data for Compression and Query Answering A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy by AMIT KRISHNA JOSHI B.E., Institute of Engineering, Nepal, 2004 M.Sc., University of Reading, UK, 2008 2017 Wright State University WRIGHTSTATEUNIVERSITY GRADUATESCHOOL March28,2017 I HEREBY RECOMMEND THAT THE DISSERTATION PREPARED UNDER MY SUPERVISIONBYAmitKrishnaJoshiENTITLEDExploitingAlignmentsinLinkedData forCompressionandQueryAnswering BE ACCEPTED IN PARTIAL FULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOFDoctorofPhilosophy. PascalHitzler,Ph.D. DissertationDirector MichaelL.Raymer,Ph.D. Director, Computer Science and Engi- neeringPh.D.Program RobertE.W.Fyffe,Ph.D. Vice President for Research and Dean of theGraduateSchool Committeeon FinalExamination PascalHitzler,Ph.D. GuozhuDong,Ph.D. KrishnaprasadThirunaraya,Ph.D. MichelleCheatham,Ph.D. SubhashiniGanapathy,Ph.D. ABSTRACT Joshi, Amit Krishna. Ph.D., Department of Computer Science & Engineering, Wright StateUniversity,2017. ExploitingAlignmentsinLinkedDataforCompressionandQuery Answering. Linked data has experienced accelerated growth in recent years due to its interlinking ability across disparate sources, made possible via machine-processable RDF data. Today, alargenumberoforganizations,includinggovernmentsandnewsproviders,publishdatain RDF format, inviting developers to build useful applications through reuse and integration of structured data. This has led to tremendous increase in the amount of RDF data on the web. Although the growth of RDF data can be viewed as a positive sign for semantic web initiatives,itcausesperformancebottlenecksforRDFdatamanagementsystemsthatstore and provide access to data. In addition, a growing number of ontologies and vocabularies makeretrievingdataachallengingtask. TheaimofthisresearchistoshowhowalignmentsintheLinkedDatacanbeexploited tocompressandquerythelinkeddatasets. First,weintroducetwocompressiontechniques thatcompressRDFdatasetsthroughidentificationandremovalofsemanticandcontextual redundancies in linked data. Logical Linked Data Compression is a lossless compression technique which compresses a dataset by generating a set of new logical rules from the dataset andremoving triples that canbe inferred from theserules. ContextualLinked Data Compression is a lossy compression technique which compresses datasets by performing schemaalignmentandinstancematchingfollowedbypruningofalignmentsbasedoncon- fidence value and subsequent grouping of equivalent terms. Depending on the structure of the dataset, the first technique was able to prune more than 50% of the triples. Second, we propose an Alignment based Linked Open Data Querying System (ALOQUS) that allows userstowritequerystatementsusingconceptsandpropertiesnotpresentinlinkeddatasets andshowthatqueryingdoesnotrequireathoroughunderstandingoftheindividualdatasets and interconnecting relationships. Finally, we present LinkGen, a multipurpose synthetic Linked Data generator that generates a large amount of repeatable and reproducible RDF datausingstatisticaldistribution,andinterlinkswithrealworldentitiesusingalignments. iii Contents 1 Introduction 1 1.1 RDFCompression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 QueryAnswering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 OntologyAlignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 ResearchStatements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6 DissertationOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Background 7 2.1 SemanticWebandLinkedData . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 DataInterchange: RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 RDF/XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 N-Triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Turtle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.4 N-Quads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.5 TriG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.6 JSON-LD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 QueryLanguage: SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 iv 2.5 AlignmentAPIFormat . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6 UpperOntology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 DictionaryEncoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.8 DataCompression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.9 DevelopmentToolsandTechnologies . . . . . . . . . . . . . . . . . . . . 21 3 LogicalLinkedDataCompression 22 3.1 FrequentItemsetMining . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 AssociationRuleMining . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.2 Multi-DimensionalAssociationRules . . . . . . . . . . . . . . . . 27 3.2 RuleBasedCompression . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 Intra-propertyRBCompression . . . . . . . . . . . . . . . . . . . 29 3.2.2 Inter-propertyRBCompression . . . . . . . . . . . . . . . . . . . 31 3.2.3 OptimalFrequentPatterns . . . . . . . . . . . . . . . . . . . . . . 32 3.2.4 DeltaCompression . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Decompression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1 RBCompression-TripleReduction . . . . . . . . . . . . . . . . . 34 3.4.2 Comparisonusingcompresseddatasetsize . . . . . . . . . . . . . 36 3.4.3 RBCompressiononBenchmarkDataset . . . . . . . . . . . . . . . 37 3.5 SoundnessandCompleteness . . . . . . . . . . . . . . . . . . . . . . . . . 37 4 ContextualLinkedDataCompression 40 4.1 OntologyAlignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.1 SchemaAlignment . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.2 InstanceMatching . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 v 4.3.1 OAEIConferenceOntology . . . . . . . . . . . . . . . . . . . . . 45 4.3.2 DatasetGeneration . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.3 VariedAlignmentsandCompression . . . . . . . . . . . . . . . . . 46 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5 AlignmentbasedLinkedOpenDataQueryingSystem 50 5.1 LinkedOpenDataandDataRetrieval . . . . . . . . . . . . . . . . . . . . 50 5.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.1 Intimateknowledgeofdatasets . . . . . . . . . . . . . . . . . . . . 52 5.2.2 Schemaheterogeneity . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.3 EntityCo-reference . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.1 AutomaticmappingbetweenupperlevelOntologyandOntologies usedinLODDatasets . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.2 Identification and mapping of concepts in user defined queries to thoseinLODDatasets . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3.3 ConstructingSub-queries . . . . . . . . . . . . . . . . . . . . . . . 55 5.3.4 Executionofsub-Queries . . . . . . . . . . . . . . . . . . . . . . . 56 5.3.5 Determiningentityco-references . . . . . . . . . . . . . . . . . . . 56 5.3.6 TransformationandlocalstorageofRDFgraphs . . . . . . . . . . 58 5.3.7 JoiningandProcessingofresults . . . . . . . . . . . . . . . . . . . 58 5.4 ScenarioIllustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.5.1 StatementandQueryTypes . . . . . . . . . . . . . . . . . . . . . 64 5.5.2 QueriesandResults . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.5.3 Qualitativecomparisonwithothertools . . . . . . . . . . . . . . . 67 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 vi 6 SyntheticLinkedDataGenerator 71 6.1 SyntheticLinkedData . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.2 DataGeneratorFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2.1 EntityDistribution . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2.2 NoisyData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2.3 Inter-linkingrealworldentities . . . . . . . . . . . . . . . . . . . . 74 6.2.4 OutputDataandStreamingmode . . . . . . . . . . . . . . . . . . 75 6.2.5 ConfigParameters . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.2.6 DataGenerationSteps . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7 RelatedWork 79 7.1 RDFCompression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.1.1 AdjacencyList . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.1.2 Bitmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.1.3 HDT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.2 QueryingLOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.3 SyntheticLinkedDataGenerator . . . . . . . . . . . . . . . . . . . . . . . 84 8 Conclusion 86 8.1 AlignmentsandRDFcompression . . . . . . . . . . . . . . . . . . . . . . 86 8.2 AlignmentsandQueryAnswering . . . . . . . . . . . . . . . . . . . . . . 87 8.3 AlignmentsandSyntheticDataGeneration . . . . . . . . . . . . . . . . . 88 8.4 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Bibliography 90 vii List of Figures 2.1 LatestSemanticWebLayerStack . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. Mc- Crae, Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. http://lod- cloud.net/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 ExampleofRDFGraphdata . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 SPARQLquerytofetchfiveoldestUSpresident . . . . . . . . . . . . . . . 15 2.5 SPARQLqueryresponseinTurtleformatforquerylistedin2.4. . . . . . . 15 2.6 OntologysnippetshowingT-BoxandA-Box . . . . . . . . . . . . . . . . 17 2.7 AlignmentAPIformatexampleshowing‘map’element . . . . . . . . . . . 17 2.8 HierarchyexampleinSUMO . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.9 TriplesencodedusingnumericIDs . . . . . . . . . . . . . . . . . . . . . . 19 3.1 RuleBasedCompression . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Listofencodedtriplesandcorrespondingtransactions . . . . . . . . . . . . 24 3.3 RuleBasedCompression,G = G ∪R(G ) . . . . . . . . . . . . . . . . 28 D A 3.4 CompressionandDecompressiontimeforvariouslinkedopendatasets . . . 35 3.5 CompressionandDecompressiontimeforvariousLUBMdatasets . . . . . 38 4.1 ConceptualSystemOverview . . . . . . . . . . . . . . . . . . . . . . . . . 42 viii 4.2 Groupingequivalenttermsforekaw#Regular_Paperandekaw#Research_Topic usingOAEIreferencealignment. . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 OAEIConferenceTrackOntologies . . . . . . . . . . . . . . . . . . . . . 45 4.4 Datasetsizeforvarioussetofqueries. . . . . . . . . . . . . . . . . . . . . 46 4.5 Varyingalignmentforsamepairofitems. . . . . . . . . . . . . . . . . . . 47 4.6 NumberofmappingsatdifferentthresholdsfortwoversionsofConference referencealignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.7 Comparison of various automated alignment systems demonstrating vary- ingnumberofequivalenttermsforsamethreshold . . . . . . . . . . . . . 48 4.8 Compressedsize(inMB)againstoriginalsizeof670MB . . . . . . . . . . 49 5.1 LinkedMdbconnectstoDBpediaviaNYTimes . . . . . . . . . . . . . . . 53 5.2 ALOQUSIllustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.1 Power-lawdistributionofsubjectsinWikipedia . . . . . . . . . . . . . . . 74 6.2 Timetakenforgeneratingdatasetsofvarioussizes . . . . . . . . . . . . . 78 7.1 ListofTriples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.2 Dictionaryencodingfortermsin7.1 . . . . . . . . . . . . . . . . . . . . . 80 7.3 CompactTransformationfromID-basedtriplesusingadjacencylist . . . . 81 7.4 BitmapTransformationfromCompactStreams . . . . . . . . . . . . . . . 82 7.5 Dictionaryandthreepossibletriplerepresentationsfortriplesin7.1 . . . . 82 ix
Description: