Table Of ContentWWrriigghhtt SSttaattee UUnniivveerrssiittyy
CCOORREE SScchhoollaarr
Browse all Theses and Dissertations Theses and Dissertations
2017
EExxppllooiittiinngg AAlliiggnnmmeennttss iinn LLiinnkkeedd DDaattaa ffoorr CCoommpprreessssiioonn aanndd QQuueerryy
AAnnsswweerriinngg
Amit Krishna Joshi
Wright State University
Follow this and additional works at: https://corescholar.libraries.wright.edu/etd_all
Part of the Computer Engineering Commons, and the Computer Sciences Commons
RReeppoossiittoorryy CCiittaattiioonn
Joshi, Amit Krishna, "Exploiting Alignments in Linked Data for Compression and Query Answering" (2017).
Browse all Theses and Dissertations. 1766.
https://corescholar.libraries.wright.edu/etd_all/1766
This Dissertation is brought to you for free and open access by the Theses and Dissertations at CORE Scholar. It
has been accepted for inclusion in Browse all Theses and Dissertations by an authorized administrator of CORE
Scholar. For more information, please contact library-corescholar@wright.edu.
Exploiting Alignments in Linked Data for
Compression and Query Answering
A dissertation submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
by
AMIT KRISHNA JOSHI
B.E., Institute of Engineering, Nepal, 2004
M.Sc., University of Reading, UK, 2008
2017
Wright State University
WRIGHTSTATEUNIVERSITY
GRADUATESCHOOL
March28,2017
I HEREBY RECOMMEND THAT THE DISSERTATION PREPARED UNDER MY
SUPERVISIONBYAmitKrishnaJoshiENTITLEDExploitingAlignmentsinLinkedData
forCompressionandQueryAnswering BE ACCEPTED IN PARTIAL FULFILLMENT
OFTHEREQUIREMENTSFORTHEDEGREEOFDoctorofPhilosophy.
PascalHitzler,Ph.D.
DissertationDirector
MichaelL.Raymer,Ph.D.
Director, Computer Science and Engi-
neeringPh.D.Program
RobertE.W.Fyffe,Ph.D.
Vice President for Research and Dean of
theGraduateSchool
Committeeon
FinalExamination
PascalHitzler,Ph.D.
GuozhuDong,Ph.D.
KrishnaprasadThirunaraya,Ph.D.
MichelleCheatham,Ph.D.
SubhashiniGanapathy,Ph.D.
ABSTRACT
Joshi, Amit Krishna. Ph.D., Department of Computer Science & Engineering, Wright
StateUniversity,2017. ExploitingAlignmentsinLinkedDataforCompressionandQuery
Answering.
Linked data has experienced accelerated growth in recent years due to its interlinking
ability across disparate sources, made possible via machine-processable RDF data. Today,
alargenumberoforganizations,includinggovernmentsandnewsproviders,publishdatain
RDF format, inviting developers to build useful applications through reuse and integration
of structured data. This has led to tremendous increase in the amount of RDF data on the
web. Although the growth of RDF data can be viewed as a positive sign for semantic web
initiatives,itcausesperformancebottlenecksforRDFdatamanagementsystemsthatstore
and provide access to data. In addition, a growing number of ontologies and vocabularies
makeretrievingdataachallengingtask.
TheaimofthisresearchistoshowhowalignmentsintheLinkedDatacanbeexploited
tocompressandquerythelinkeddatasets. First,weintroducetwocompressiontechniques
thatcompressRDFdatasetsthroughidentificationandremovalofsemanticandcontextual
redundancies in linked data. Logical Linked Data Compression is a lossless compression
technique which compresses a dataset by generating a set of new logical rules from the
dataset andremoving triples that canbe inferred from theserules. ContextualLinked Data
Compression is a lossy compression technique which compresses datasets by performing
schemaalignmentandinstancematchingfollowedbypruningofalignmentsbasedoncon-
fidence value and subsequent grouping of equivalent terms. Depending on the structure of
the dataset, the first technique was able to prune more than 50% of the triples. Second, we
propose an Alignment based Linked Open Data Querying System (ALOQUS) that allows
userstowritequerystatementsusingconceptsandpropertiesnotpresentinlinkeddatasets
andshowthatqueryingdoesnotrequireathoroughunderstandingoftheindividualdatasets
and interconnecting relationships. Finally, we present LinkGen, a multipurpose synthetic
Linked Data generator that generates a large amount of repeatable and reproducible RDF
datausingstatisticaldistribution,andinterlinkswithrealworldentitiesusingalignments.
iii
Contents
1 Introduction 1
1.1 RDFCompression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 QueryAnswering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 OntologyAlignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 ResearchStatements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 DissertationOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 SemanticWebandLinkedData . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 DataInterchange: RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 RDF/XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 N-Triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Turtle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 N-Quads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.5 TriG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.6 JSON-LD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 QueryLanguage: SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
iv
2.5 AlignmentAPIFormat . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 UpperOntology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 DictionaryEncoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.8 DataCompression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.9 DevelopmentToolsandTechnologies . . . . . . . . . . . . . . . . . . . . 21
3 LogicalLinkedDataCompression 22
3.1 FrequentItemsetMining . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 AssociationRuleMining . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2 Multi-DimensionalAssociationRules . . . . . . . . . . . . . . . . 27
3.2 RuleBasedCompression . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Intra-propertyRBCompression . . . . . . . . . . . . . . . . . . . 29
3.2.2 Inter-propertyRBCompression . . . . . . . . . . . . . . . . . . . 31
3.2.3 OptimalFrequentPatterns . . . . . . . . . . . . . . . . . . . . . . 32
3.2.4 DeltaCompression . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Decompression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.1 RBCompression-TripleReduction . . . . . . . . . . . . . . . . . 34
3.4.2 Comparisonusingcompresseddatasetsize . . . . . . . . . . . . . 36
3.4.3 RBCompressiononBenchmarkDataset . . . . . . . . . . . . . . . 37
3.5 SoundnessandCompleteness . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 ContextualLinkedDataCompression 40
4.1 OntologyAlignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.1 SchemaAlignment . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.2 InstanceMatching . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
v
4.3.1 OAEIConferenceOntology . . . . . . . . . . . . . . . . . . . . . 45
4.3.2 DatasetGeneration . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.3 VariedAlignmentsandCompression . . . . . . . . . . . . . . . . . 46
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 AlignmentbasedLinkedOpenDataQueryingSystem 50
5.1 LinkedOpenDataandDataRetrieval . . . . . . . . . . . . . . . . . . . . 50
5.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Intimateknowledgeofdatasets . . . . . . . . . . . . . . . . . . . . 52
5.2.2 Schemaheterogeneity . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.3 EntityCo-reference . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.1 AutomaticmappingbetweenupperlevelOntologyandOntologies
usedinLODDatasets . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.2 Identification and mapping of concepts in user defined queries to
thoseinLODDatasets . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.3 ConstructingSub-queries . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.4 Executionofsub-Queries . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.5 Determiningentityco-references . . . . . . . . . . . . . . . . . . . 56
5.3.6 TransformationandlocalstorageofRDFgraphs . . . . . . . . . . 58
5.3.7 JoiningandProcessingofresults . . . . . . . . . . . . . . . . . . . 58
5.4 ScenarioIllustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5.1 StatementandQueryTypes . . . . . . . . . . . . . . . . . . . . . 64
5.5.2 QueriesandResults . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5.3 Qualitativecomparisonwithothertools . . . . . . . . . . . . . . . 67
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
vi
6 SyntheticLinkedDataGenerator 71
6.1 SyntheticLinkedData . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 DataGeneratorFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.1 EntityDistribution . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.2 NoisyData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.3 Inter-linkingrealworldentities . . . . . . . . . . . . . . . . . . . . 74
6.2.4 OutputDataandStreamingmode . . . . . . . . . . . . . . . . . . 75
6.2.5 ConfigParameters . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2.6 DataGenerationSteps . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7 RelatedWork 79
7.1 RDFCompression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.1.1 AdjacencyList . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1.2 Bitmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.1.3 HDT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2 QueryingLOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3 SyntheticLinkedDataGenerator . . . . . . . . . . . . . . . . . . . . . . . 84
8 Conclusion 86
8.1 AlignmentsandRDFcompression . . . . . . . . . . . . . . . . . . . . . . 86
8.2 AlignmentsandQueryAnswering . . . . . . . . . . . . . . . . . . . . . . 87
8.3 AlignmentsandSyntheticDataGeneration . . . . . . . . . . . . . . . . . 88
8.4 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Bibliography 90
vii
List of Figures
2.1 LatestSemanticWebLayerStack . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. Mc-
Crae, Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. http://lod-
cloud.net/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 ExampleofRDFGraphdata . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 SPARQLquerytofetchfiveoldestUSpresident . . . . . . . . . . . . . . . 15
2.5 SPARQLqueryresponseinTurtleformatforquerylistedin2.4. . . . . . . 15
2.6 OntologysnippetshowingT-BoxandA-Box . . . . . . . . . . . . . . . . 17
2.7 AlignmentAPIformatexampleshowing‘map’element . . . . . . . . . . . 17
2.8 HierarchyexampleinSUMO . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.9 TriplesencodedusingnumericIDs . . . . . . . . . . . . . . . . . . . . . . 19
3.1 RuleBasedCompression . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Listofencodedtriplesandcorrespondingtransactions . . . . . . . . . . . . 24
3.3 RuleBasedCompression,G = G ∪R(G ) . . . . . . . . . . . . . . . . 28
D A
3.4 CompressionandDecompressiontimeforvariouslinkedopendatasets . . . 35
3.5 CompressionandDecompressiontimeforvariousLUBMdatasets . . . . . 38
4.1 ConceptualSystemOverview . . . . . . . . . . . . . . . . . . . . . . . . . 42
viii
4.2 Groupingequivalenttermsforekaw#Regular_Paperandekaw#Research_Topic
usingOAEIreferencealignment. . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 OAEIConferenceTrackOntologies . . . . . . . . . . . . . . . . . . . . . 45
4.4 Datasetsizeforvarioussetofqueries. . . . . . . . . . . . . . . . . . . . . 46
4.5 Varyingalignmentforsamepairofitems. . . . . . . . . . . . . . . . . . . 47
4.6 NumberofmappingsatdifferentthresholdsfortwoversionsofConference
referencealignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.7 Comparison of various automated alignment systems demonstrating vary-
ingnumberofequivalenttermsforsamethreshold . . . . . . . . . . . . . 48
4.8 Compressedsize(inMB)againstoriginalsizeof670MB . . . . . . . . . . 49
5.1 LinkedMdbconnectstoDBpediaviaNYTimes . . . . . . . . . . . . . . . 53
5.2 ALOQUSIllustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1 Power-lawdistributionofsubjectsinWikipedia . . . . . . . . . . . . . . . 74
6.2 Timetakenforgeneratingdatasetsofvarioussizes . . . . . . . . . . . . . 78
7.1 ListofTriples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2 Dictionaryencodingfortermsin7.1 . . . . . . . . . . . . . . . . . . . . . 80
7.3 CompactTransformationfromID-basedtriplesusingadjacencylist . . . . 81
7.4 BitmapTransformationfromCompactStreams . . . . . . . . . . . . . . . 82
7.5 Dictionaryandthreepossibletriplerepresentationsfortriplesin7.1 . . . . 82
ix
Description:Depending on the structure of the dataset, the first technique was able to prune more than 50% of the triples. Second, we propose an Alignment based Linked Open Data Querying System (ALOQUS) that allows users to write query statements using concepts and properties not present in linked datasets.