ebook img

Semantic Profiling in Data Lake PDF

81 Pages·2017·1.24 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Semantic Profiling in Data Lake

RWTHAachenUniversity|52056Aachen|Germany Chairfor ComputerScience5 InformationSystems RWTHAachen University Semantic Profiling in Data Lake MasterthesisatRWTHAachenUniversity. SubmittedforthedegreeofMasterofScience(M.Sc.) inSoftwareSystemsEngineering. Submittedby: JasimWaheedAnsari [email protected] Matriculationnumber: 350820 Thesisadvisors: Dr.OyaDenizBeyan,Dr.MichaelCochez,MSc.NailaKarim Informatik5,RWTHAachenUniversity Supervisor: Prof. Dr.StefanDecker Aachen,February2018 VersionFebruary13,2018 ©2018JasimWaheedAnsari ThisworkislicensedundertheCreativeCommonsAttribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. To view a copy of this license, please visit https://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, POBox1866,MountainView,CA94042,USA. Eidesstattliche Versicherung Waheed Ansari, Jasim 350820 ___________________________ ___________________________ Name, Vorname Matrikelnummer (freiwillige Angabe) Ich versichere hiermit an Eides Statt, dass ich die vorliegende Arbeit/Bachelorarbeit/ Masterarbeit* mit dem Titel Semantic Profiling in Data Lake __________________________________________________________________________ __________________________________________________________________________ __________________________________________________________________________ selbständig und ohne unzulässige fremde Hilfe erbracht habe. Ich habe keine anderen als die angegebenen Quellen und Hilfsmittel benutzt. Für den Fall, dass die Arbeit zusätzlich auf einem Datenträger eingereicht wird, erkläre ich, dass die schriftliche und die elektronische Form vollständig übereinstimmen. Die Arbeit hat in gleicher oder ähnlicher Form noch keiner Prüfungsbehörde vorgelegen. Aachen, 13.02.2018 ___________________________ ___________________________ Ort, Datum Unterschrift *Nichtzutreffendes bitte streichen Belehrung: § 156 StGB: Falsche Versicherung an Eides Statt Wer vor einer zur Abnahme einer Versicherung an Eides Statt zuständigen Behörde eine solche Versicherung falsch abgibt oder unter Berufung auf eine solche Versicherung falsch aussagt, wird mit Freiheitsstrafe bis zu drei Jahren oder mit Geldstrafe bestraft. § 161 StGB: Fahrlässiger Falscheid; fahrlässige falsche Versicherung an Eides Statt (1) Wenn eine der in den §§ 154 bis 156 bezeichneten Handlungen aus Fahrlässigkeit begangen worden ist, so tritt Freiheitsstrafe bis zu einem Jahr oder Geldstrafe ein. (2) Straflosigkeit tritt ein, wenn der Täter die falsche Angabe rechtzeitig berichtigt. Die Vorschriften des § 158 Abs. 2 und 3 gelten entsprechend. Die vorstehende Belehrung habe ich zur Kenntnis genommen: Aachen, 13.02.2018 ___________________________ ___________________________ Ort, Datum Unterschrift PPoowweerreedd bbyy TTCCPPDDFF ((wwwwww..ttccppddff..oorrgg)) I dedicate this work to my parents and friends for their constant support and encouragement during my course of study. Thank you. Acknowledgements Foremost, I would like to express my gratitude to my supervisor Prof. Dr. Stefan Decker to make it possible for me to work in his department. I am deeply grateful to Dr. Oya Deniz Beyan. Without her constant motivation, optimismandguidanceinthisworkuntiltheveryendwouldnothavebeenpossible. I also owe a great dept of gratitude to Dr. Michael Cochez, for helping me out with research insights and iterative revision of my work. Last but not the least, I am very thankful to M.Sc. Naila Karim for sharing her knowledge and advice on Big Data and Data Lake’s state-of-art. Abstract In the Big Data community, Data Lakes have become the de facto standard for storing data. Often, these data lakes are contained within the Hadoop ecosystem, where the actual storage happens on Hadoop Distributed File System(HDFS). The data is then stored in its raw (structured or unstructured) form and whenever an application needs the data, it interprets the raw data. This approach is a schema- on-read in which the interpretation of the data and potential consistency checks happen when the data is read by an application. The biggest challenge for data lake governance is to avoid that it turns in a so-called data swamp. Data swamp occurs due to various reasons. First, there is the data quality aspect, such as noisy or incorrect data. Second, exponential growth of data ingested and lack of schema enforcement. Lastly, the amount of used schemas tends to grow. Inthisworkwepresentanewmetadataextensiontodatalakesystemsbysemantic profiling,whichattemptstorecognizethemeaningofthedatawhichisingestedinto the Data Lake. The developed tool does not only detect meaning at schema level, but also at the data instance level by employing domain vocabularies and ontolo- gies. With our tool, ingested datasets can easily be mapped to common domain concepts with unique identifiers and the meaning of the data can be discovered by thesystem. Thedevelopedprofilingtoolwillhelptoproducemeaningfulsummaries oftheingestedcontentandprovidesopportunitiestolinkrelevantdatasetsingested using different data schemas. We evaluate our tool by using two cancer genome datasets. We use semantic profiling tool during data ingestion and observe how data sets are tagged and profiled. Our experiments show that Semantic Ingestion is a promising approach for enriching the data sets in a data lake.

Description:
knowledge and advice on Big Data and Data Lake's state-of-art 2.4 PWC's basic Hadoop architecture for scalable data lake infrastruc- ture[36] RQ1: How can Semantic Web improve the re-usability and discoverability of data In particular, we created algorithms for semantic data profiling us-.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.