Table Of Content

ECOLE CENTRALE PARIS ECOLE NATIONALE DES SCIENCES DE L’ INFORMATIQUE P H D T H E S I S to obtain the title of Doctor of Computer Science Defended by Rania Soussi Querying and Extracting Heterogeneous Graphs from Structured Data and Unstructured Content prepared at Ecole Centrale Paris, MAS Laboratory, and Ecole Nationale des Sciences de l’Informatique, RIADI Laboratory defended on 2012, June 22nd Jury : Advisors: Henda Ben Ghezala: Professor at Ecole Nationale des Sciences de l’Informatique Marie-Aude Aufaure: Professor at Ecole Centrale Paris President : Bernd Amann: Professor at UniversitØ Pierre et Marie Curie (Paris 6) Reviewers: Alexander L(cid:246)ser: Professor at Technische Universit(cid:228)t Berlin,Germany Faiez Gargouri: Professor at Institut SupØrieur d’Informatique et de Multimedia de Sfax Examinator: BØnØdicte Le Grand: HDR at UniversitØ Pierre et Marie Curie (Paris 6) Acknowledgments You’ll meet more angels on a winding path than on a straight one. Terri Guillemet It is a pleasure to thank many people who made this thesis possible. I wish to express my deepest gratitude to my advisors at Ecole Centrale Paris and Ecole Nationale des Sciences de l’Informatique. I would really like to thank my advisor Prof. Marie-Aude Aufaure, Profes- sor at Ecole Centrale Paris, for her gentleness, guidance, inspiration and patience throughout these years. I am grateful for having been given the opportunity to work in the ARSA project. By working in such a project, I had the chance to gain experience in a variety of aspects related to research and team work. I am also grateful to Prof. Henda Ben Ghezala, Professor at Ecole Nationale des Sciences de l’Informatique, for accepting me as a member of RIADI Laboratory. I would like to thank her for her gentleness, encouragement and con(cid:28)dence during my thesis and my master. I am especially thankful to Prof. Alexander L(cid:246)ser, Professor at Technische Universit(cid:228)t Berlin, and Prof. Faiez Gargouri, Professor at ISIMS, for accepting to review my PhD thesis. I would also like to thank Dr. BØnØdicte Le Grand, HDR at UniversitØ Pierre et Marie Curie, who had accepted to be my jury examinator. I express my thorough gratitude to Prof. Bernd Amann, Professor at UniversitØ Pierre et Marie Curie, for the honor he made me by accepting to be president of the jury. I would like to thank Dr. Yves Lechevallier for his precious comments and remarks about my work. I would like also to thank Dr. Hajer Baazaoui for her help and advice during the (cid:28)rst year of my thesis. My appreciation likewise to whole BI team and MAS Laboratory, my second family, particularly to Etienne Cuvelier for the research advises and for labeling my graph model as SPIDER-Graph, and also to Nesrine Ben Mustapha, Casio Mello, Carlos Mello, Hichem Bannour , Nizar Messai for advises and supporting me in the most di(cid:30)cult moments; to Micheline Elyes , Raphael thollot and Nicolas Kuchmann-Beauger for the productive discussions. I would like to thank the new BI members: Yves Vanrompay, Mario Cataldi, Alexandre Mikheev, Baptiste Thillay and Ilaria Tiddi for their help during the (cid:28)nal preparations. I thank also Sylvie, Annie and Danny for all the administrative and technical help in ECP. During this work I have collaborated with many colleagues for whom I have great regard, and I wish to extend my warmest thanks to all those who have helped me with my work in ECP or ENSI and I forgot to mention their names. The permanent support of my family was extremely precious to me. Thank you, my aunts Salwa and Naziha , my grandmother Habiba, and my brother Mohamed ii to whom I owe so much. I wish to thank Hassen for the support that he has provided me and for believing in me that I can (cid:28)nish my thesis on time. Lastly, and most importantly, I wish to thank my parents, Zakia and Mhamed. They bore me, raised me, supported me, taught me, and loved me. To them I dedicate this thesis. iii Abstract Nowadays, a huge volume of data is collected and stored daily in enterprises. To ef- (cid:28)ciently extract and manage this valuable knowledge helpful in the decision-making process is a hard task. Firstly, because the data can be stored in di(cid:27)erent ways: in a very structured way like in databases but also in unstructured repositories. Moreover the description of a sole object, person or process can be disseminated in several sources with several structures. Managing these di(cid:27)erent data models is di(cid:30)cult and makes the extraction of information process not very e(cid:30)cient. Classical query techniques permit to retrieve the set of data which match with speci(cid:28)c criteria under a speci(cid:28)c model, but richer results could be provided if a uni(cid:28)ed representation of the disparate data of the enterprise and of their interactions and linkscouldbede(cid:28)ned. Graphscanbeusedforthisuni(cid:28)eddatamodel,andfacilitate the information search as well as the extraction of objects interaction. The present work introduces a set of solutions to extract graphs from enterprise data and facilitate the process of information search on these graphs. First of all we have de(cid:28)ned a new graph model called the SPIDER-Graph, which models complex objects and permits to de(cid:28)ne heterogeneous graphs. Furthermore, we have developed a set of algorithms to extract the content of a database from an enterprise and to represent it in this new model. This latter representation allows us to discover relations that exist in the data but are hidden due to their poor compatibility with the classical relational model. Moreover, in order to unify the representation of all thedataoftheenterprise,wehavedevelopedasecondapproachwhichextractsfrom unstructured data an enterprise’s ontology containing the most important concepts and relations that can be found in a given enterprise. Having extracted the graphs from the relational databases and documents using the enterprise ontology, we propose an approach which allows the users to extract an interaction graph between a set of chosen enterprise objects (for example between customers and products or eventheenterprisesocialnetwork). Thisapproachisbasedonasetofrelationspat- terns extracted from the graph and the enterprise ontology concepts and relations. Finally, information retrieval is facilitated using a new visual graph query language called GraphVQL, which allows users to query graphs by drawing a pattern visually for the query. This language covers di(cid:27)erent query types from the simple selection andaggregationqueriestosocialnetworkanalysisqueries. Alltheseapproachesand methods have been developed and evaluated using real enterprise data. Keywords: Graph Model, SPIDER-Graph relational database, entreprise ontology, visual query language iv RØsumØ La quantitØ des donnØes stockØes et collectØes au sein des entreprises ne cessent d’augmenter, cependant, l’extraction et la gestion des connaissances gisant au sein de tels puits de donnØes peut savØrer di(cid:30)cile, et ce alors mŒme que ces mŒme connaissances sont trŁs prØcieuses dans les processus de dØcision de lentreprise. La premiŁre de ces di(cid:30)cultØs vient du fait que ces donnØes peuvent Œtre stockØes sous diverses formes: sous forme structurØes, et donc dans des bases de donnØes relationnelles, ou sous forme non structurØes, cest-(cid:224)-dire dans des documents, des e-mails, . En outre, autre di(cid:30)cultØ, la description d’une entitØ, objet, personne ou un processus peut Œtre ØparpillØe dans plusieurs structures. Si les techniques classiques d’interrogation des donnØes permettent de chercher l’ensemble de donnØes qui cor- respondent (cid:224) des critŁres prØcis et spØci(cid:28)ques dans un modŁle de donnØes spØci(cid:28)que, un rØsultat plus riche ne peut Œtre obtenu quen modØlisant les donnØes dispersØes de l’entreprise d’une fa(cid:231)on uni(cid:28)Øe. Un moyen naturel pour reprØsenter et modØliser ces donnØes disparates en structures est dutiliser les graphes. Ces derniers peuvent Œtre utilisØs comme un modŁle uni(cid:28)Ø de donnØes et faciliter la recherche d’information. . L’avantage de ce type de modŁle rØside dans ses aspects dynamiques et ses capacitØs (cid:224) reprØsenter les relations simples, ainsi que ses facilitØs d’interrogation de donnØes appartenant (cid:224) des sources hØtØrogŁnes, mais aussi ses capacitØs (cid:224) dØcouvrir des relations et des informations non explicites sur les di(cid:27)Ørents objets de l’entreprise modØlisØs. La capacitØ des graphes (cid:224) modØliser les interactions entre les objets hØtØrogŁnes (ex. clients et produits, produits et des projets, l’interaction entre des personnes tel que les rØseaux sociaux, etc) est aussi un avantage non-nØgligeable. Lutilisation de ces reprØsentations en graphes peut grandement aider les entreprises dans les processus de prise de dØcision, comme suggØrer quelles recommandations envoyer (cid:224) un client (en utilisant le graphe des produits et des clients) ou trouver lexpert le plus adØquat sur un sujet prØcis (en utilisant le rØseau social). Ce travail introduit un ensemble de solutions pour extraire des graphes (cid:224) partir des donnØes de l’entreprise et pour aussi faciliter le processus de recherche d’information dans ces graphes. PremiŁrement, nous avons dØ(cid:28)ni un nouveau modŁle de donnØes ap- pelØ SPIDER-Graph permettant de modØliser des objets complexes et de dØ(cid:28)nir des graphs hØtØrogŁnes. Puis, nous avons dØveloppØ un ensemble d’algorithmes pour extraire le contenu des bases de donnØes de l’entreprise et les transformer suivant ce nouveau modŁle de graphe. Cette reprØsentation permet de mettre (cid:224) jour des relations non explicites entre objets, relations existantes mais non visibles dans le modŁle relationnel. Par ailleurs, pour uni(cid:28)er la reprØsentation de toutes les donnØes dans l’entreprise, nous avons dØveloppØ, dans une deuxiŁme approche, une mØthode de constitution d une ontologie d’entreprise contenant les concepts et les relations les plus importantes d’une entreprise, et ceci, (cid:224) partir de lextraction des donnØes non structurØs de cette mŒme entreprise. Ensuite, aprŁs le processus d’extraction des di(cid:27)Ørents graphes de donnØes l’entreprise, nous avons proposØ une approche qui permettent d’extraire des graphes d’interactions entre des objets hØtØrogŁnes modØlisant l’entreprise. Cette approche v permetd’extrairedesgraphesderØseauxsociauxoudesgraphesd’interactionsense basant sur le processus suivant : premiŁrement, l’utilisateur choisit les objets dont il veut voir les interactions (cid:224) partir des concepts de l’ontologie, ce qui permet (cid:224) un processus d’identi(cid:28)cation d’objets didenti(cid:28)er les nuds correspondant (cid:224) ces concepts dans le graphe extrait (cid:224) partir de la base de donnØes relationnelle. Ensuite, en se basant sur les relations de l’ontologie et un ensemble de patrons de relations con- struit (cid:224) partir de la base relationnelle, un processus d’extraction de relations crØe les relations entre ces objets. Extraire, (cid:224) partir des donnØes (structurØes ou pas) dune entreprise, de la connaissance sous forme de graphes est sans grand intØrŒt si lon ne peut interroger et interagir cette connaissance. Pour faciliter la recherche d’information, nous avons proposØunnouveaulangaged’interrogationvisuelappelØGraphVQL(GraphVisual Query Langauge) qui permet aux utilisateurs non experts de poser leurs requŒtes visuellement sous forme de patron de graphe. Ce langage propose plusieurs types de requŒtes de la simple sØlection et agrØgation jusqu’(cid:224) l’analyse des rØseaux sociaux. Il permet aussi d’interroger di(cid:27)Ørent type de graphes SPIDER-Graph, RDF ou GraphML en se basant sur des algorithmes de pattern matching ou de translation des requŒtes sous forme de SPARQL. LØvaluation de toutes ces approches et mØth- odes a ØtØ rØalisØe en utilisant des jeux de donnØes rØelles d’entreprises. Mots clØs: base de donnØes, graphe, SPIDER-Graph, Langage d’interrogation vi- suel, ontologie d’entreprise. vi List of Personal Publications Chapters in Books 1. RaniaSoussi, EtienneCuvelier, Marie-AudeAufaure, AmineLouati, andYves Lechevallier : DB2SNA: an All-in-one Tool for Extraction and Aggregation of underlying Social Networks from Relational Databases, book chapter in "Social Network Analysis and Mining" ,eds.Tansel Ozyer, Springer LNSN, 2011(to appear), 2. Rania Soussi, Marie-Aude Aufaure and Hajer Baazaoui (2011) Graph Database For collaborative Communities, In: Community-Built Databases:Research and Development, Pardede, Eric (Ed.) 1st Edi- tion., 2011, 400 p., Hardcover ISBN: 978-3-642-19046-9, Springer,Due: May 2011 International conferences and workshops 1. Rania Soussi (2012) SPIDER-Graph: A model for heterogeneous graphs extracted from a relational database"In 31st International Conference on Con- ceptual Modeling (ER 2012) -PhD Symposium. 2. Rania Soussi, Marie-Aude Aufaure and Hajer Baazaoui (2010) Towards a So- cial Network Extraction using Graph Databases, The Second International Conference on Advances in Databases, Knowledge, and Data Applications (DBKDA 2010), April 11-16, 2010 - Menuires, France. 3. RaniaSoussi, HajerBaazaouiZghal, Marie-AudeAufaure(2009)Towardaso- cial network extraction approach from relational database, Workshop ARS’09 Social Network Analysis: Models and Met and Methods for Relational Data, 13-14 July 2009, Fisciano (SA), Italy. 4. Rania Soussi, Hajer Baazaoui, Marie-Aude Aufaure, Henda Ben Ghezala. So- cial Networks Extraction: State of the Art,ICWIT’09,juillet12-14. KerKEnah .Tunisie National conferences 1. Amine Louati, Rania Soussi, Marie-Aude Aufaure, Hejer Baazaoui and Yves Lechevallier: Recherche de classes dans des rØseaux sociaux,SFC 2011, Or- lØans,FØvrier 2011. 2. RaniaSoussi,AmineLouati,Marie-AudeAufaure,HajerBaazaouiZghal,Yves Lechevallier, Henda Ben GhØzala: Extraction et analyse de rØseaux sociaux issus de bases de donnØes relationnelles. EGC 2011: 371-376 Poster vii 1. Rania Soussi: Extraction of interaction graphs from a relational database using the SPIDER-Graph model, ebiss 2011, http://cs.ulb.ac.be/conferences/ebiss2011/(cid:28)les/soussi.pdf Technical Reports (in the ARSA project) 1. Rania Soussi.Survey on Graph query languages. Ecole centrale Paris. 06-2010 2. RaniaSoussi. VisualGraphQueryLanguage(globalanalysisalgorithms).Ecole centrale Paris. 07- 2011

Description:

Graphs can be used for this unified data model, and facilitate Keywords: Graph Model, SPIDER-Graph relational database, entreprise ontology

Design and Use of Anatomical Atlases for Radiotherapy PDF

208 Pages·2012·6.77 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Design and Use of Anatomical Atlases for Radiotherapy

Description:

Graphs can be used for this unified data model, and facilitate Keywords: Graph Model, SPIDER-Graph relational database, entreprise ontology

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.