ebook img

Towards High Quality Semantic Web Data: Detecting Abnormal Data on the Semantic Web PDF

225 Pages·2015·3.01 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Towards High Quality Semantic Web Data: Detecting Abnormal Data on the Semantic Web

Lehigh University Lehigh Preserve Theses and Dissertations 2012 Towards High Quality Semantic Web Data: Detecting Abnormal Data on the Semantic Web Yang Yu Lehigh University Follow this and additional works at:http://preserve.lehigh.edu/etd Recommended Citation Yu, Yang, "Towards High Quality Semantic Web Data: Detecting Abnormal Data on the Semantic Web" (2012).Theses and Dissertations.Paper 1373. This Dissertation is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of Lehigh Preserve. For more information, please [email protected]. TOWARDS HIGH QUALITY SEMANTIC WEB DATA: DETECTING ABNORMAL DATA ON THE SEMANTIC WEB by Yang Yu A Dissertation Presented to the Graduate Committee of Lehigh University in Candidacy for the Degree of Doctor of Philosophy in Computer Science Lehigh University May 2012 ⃝c Copyright 2012 by Yang Yu All Rights Reserved ii This dissertation is accepted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. (Date) Jeff Heflin (Chair) Brian Davison Donald Hillman Lin Lin iii iv Acknowledgements IwouldliketothankProfessorJeffHeflinforallhishelp. Icannotthankhimenough for his support and patience. He is a great advisor and I have been inspired by his brilliance, hard work, and dedication and have benefited greatly from his guidance andexample. Withouthimthisworkwouldnothavebeenpossible. Iwouldlikealso tothankmycommitteemembers,ProfessorsBrianDavison,DonaldHillmanandLin Lin for their thoughtful comments and encouragement. I am also warmly grateful to my colleagues Yingjie Li, Xingjian Zhang, Dezhao Song and Zhengxiang Pan for insightful suggestions and helps. Finally this dissertation is dedicated to my wife Yun Sun, my parents Yukun Yu and Yafen Huang and my parents-in-law Changqing Sun and Xianzhi Zhang. Without their endless love, patience, understanding and support I would certainly not be where I am today. v vi Contents Acknowledgements v Abstract 1 1 Introduction 3 1.1 The Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 Quality Assessment on the Semantic Web . . . . . . . . . . . . . . . . 11 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 Background 19 2.1 Semantic Web Languages . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.1 RDF and RDF schema . . . . . . . . . . . . . . . . . . . . . . 20 2.1.2 RDF Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.3 RDF Query Language SPARQL . . . . . . . . . . . . . . . . . 27 2.1.4 OWL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2 Data Quality and Data Cleansing . . . . . . . . . . . . . . . . . . . . 34 2.2.1 Data Quality Dimensions . . . . . . . . . . . . . . . . . . . . . 35 2.2.2 Data Quality Methodologies . . . . . . . . . . . . . . . . . . . 39 2.2.3 Data Cleansing . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.3 Data Quality on the Semantic Web . . . . . . . . . . . . . . . . . . . 45 2.3.1 Quality Annotation . . . . . . . . . . . . . . . . . . . . . . . . 45 2.3.2 Semantic Web Data Evaluation . . . . . . . . . . . . . . . . . 49 vii 3 The Problem of Detecting Abnormal Semantic Web Data 55 3.1 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.1 Univariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2.2 Multivariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3 Abnormal Semantic Web data . . . . . . . . . . . . . . . . . . . . . . 61 3.4 Design Considerations of a Practical System . . . . . . . . . . . . . . 64 4 Data Correctness under the Closed World Assumption 69 4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2 Context Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 Credible Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5 Patterns of Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5 Data Correctness under the Open World Assumption 97 5.1 Context Representation Model . . . . . . . . . . . . . . . . . . . . . . 100 5.1.1 Representing Context for Two Instances . . . . . . . . . . . . 101 5.1.2 Context Expansion . . . . . . . . . . . . . . . . . . . . . . . . 101 5.1.3 Semantic Similarity of Contexts . . . . . . . . . . . . . . . . . 104 5.2 Learning Predicate Similarity . . . . . . . . . . . . . . . . . . . . . . 106 5.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2.2 Learning Model for Predicate Similarity . . . . . . . . . . . . 107 5.2.3 Dimensionality Reduction for Learning . . . . . . . . . . . . . 110 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.3.1 Parameters Analysis . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6 Data Correctness without Training 121 6.1 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 viii 6.2 Semantic Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2.1 Summary Graph . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.2.2 Finding Candidate Semantic Dependencies . . . . . . . . . . . 129 6.3 Probability of Semantic Dependency . . . . . . . . . . . . . . . . . . 132 6.3.1 Computing Probability of a Semantic Dependency . . . . . . . 133 6.3.2 Refine Probability of a Semantic Dependency . . . . . . . . . 134 6.3.3 Triple’s Posterior Probability . . . . . . . . . . . . . . . . . . 136 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7 Detecting Abnormal Data using Value-clustered Graph Functional Dependency 145 7.1 Functional Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.2 Value-clustered Graph Functional Dependency . . . . . . . . . . . . . 153 7.3 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.4 Discovering VGFDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.4.1 Heuristics for Static Pruning . . . . . . . . . . . . . . . . . . . 161 7.4.2 Computing VGFDs . . . . . . . . . . . . . . . . . . . . . . . . 165 7.4.3 Run-time Pruning . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.5 Clustering Property Values . . . . . . . . . . . . . . . . . . . . . . . . 168 7.5.1 Pre-clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.5.2 Optimal k-Means Clustering . . . . . . . . . . . . . . . . . . . 170 7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 8 Conclusion 179 8.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 A Example comparisons of clustering results 189 Bibliography 194 ix

Description:
This Dissertation is brought to you for free and open access by Lehigh Preserve. 3 The Problem of Detecting Abnormal Semantic Web Data. 55.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.