MasterthesisatRWTHAachenUniversity. SubmittedforthedegreeofMasterofScience(M.Sc.) inSoftwareSystemsEngineering. Submittedby: JasimWaheedAnsari Matriculationnumber: 350820 Thesisadvisors: Dr.OyaDenizBeyan,Dr.MichaelCochez,MSc.NailaKarim Informatik5,RWTHAachenUniversity Supervisor: Prof. Dr.StefanDecker Aachen,February2018 I am deeply grateful to Dr. Oya Deniz Beyan. Without her constant motivation, optimismandguidanceinthisworkuntiltheveryendwouldnothavebeenpossible. I also owe a great dept of gratitude to Dr. Michael Cochez, for helping me out with research insights and iterative revision of my work. Last but not the least, I am very thankful to M.Sc. Naila Karim for sharing her knowledge and advice on Big Data and Data Lake's state-of-art. Abstract In the Big Data community, Data Lakes have become the de facto standard for storing data. Often, these data lakes are contained within the Hadoop ecosystem, where the actual storage happens on Hadoop Distributed File System(HDFS). The data is then stored in its raw (structured or unstructured) form and whenever an application needs the data, it interprets the raw data. This approach is a schema- on-read in which the interpretation of the data and potential consistency checks happen when the data is read by an application. The biggest challenge for data lake governance is to avoid that it turns in a so-called data swamp. Data swamp occurs due to various reasons. First, there is the data quality aspect, such as noisy or incorrect data. Second, exponential growth of data ingested and lack of schema enforcement. Lastly, the amount of used schemas tends to grow. Inthisworkwepresentanewmetadataextensiontodatalakesystemsbysemantic profiling,whichattemptstorecognizethemeaningofthedatawhichisingestedinto the Data Lake. The developed tool does not only detect meaning at schema level, but also at the data instance level by employing domain vocabularies and ontolo- gies. With our tool, ingested datasets can easily be mapped to common domain concepts with unique identifiers and the meaning of the data can be discovered by thesystem. Thedevelopedprofilingtoolwillhelptoproducemeaningfulsummaries oftheingestedcontentandprovidesopportunitiestolinkrelevantdatasetsingested using different data schemas. We evaluate our tool by using two cancer genome datasets. We use semantic profiling tool during data ingestion and observe how data sets are tagged and profiled. Our experiments show that Semantic Ingestion is a promising approach for enriching the data sets in a data lake.

