Faculteit Bio-ingenieurswetenschappen Academiejaar 2015-2016 An ontology based query engine for querying biological sequences Jim Clauwaert Promotor: Prof. Dr. ir. Wim van Criekinge Tutor: Martijn Devisscher Masterproef voorgedragen tot het behalen van de graad van Master in de bio-ingenieurswetenschappen: Cel- en genbiotechnologie Foreword This thesis came about during the academic year of 2015-2016, and it has been worked on as a final project for my masters degree in bio-engineering. In many ways, this year has been very heavy. Even though many times working on my thesis meant not working on something else I should have been working on, I have enjoyed researching the subject and am satisfied when looking back at the work invested. This feeling of comfort is due to many external influences that have guided and supported me. I wish to extend my gratitude to the people that have stood by my side throughout the last year. First, I would like to thank Martijn Devisscher, my tutor and the spiritual father of boinq. The rich experience obtained through my work on boinq and The Semantic Web is mainly attributed to the positive working environment he created. I have been given both the responsibility and the trust to handle important parts of the boinq program. This gave me not only the opportunity, but also the ability to think for myself and introduce solutions when these presented themselves. Through weekly appointments, I was able to follow-up and discuss work, and get directions when no path was obvious. Through these elements I feel that I was able to contribute in the creation of boinq, and that my input was of value. This has been both my strongest motivation and fulfilling aspect of my thesis. I also extend my gratitude to the BioBix group. Specifically, to my promoter, Prof. Wim Van Criekinge, for helping in making this thesis a possibility, Prof. Tim de Meyer and dr. Gerben Menschaert, for helping me define a use case and assisting me during. I want to thank my family for supporting me all these years. I want to thank my friends for being awesome in general. A special thankstoMeaghanBlanchard,forbeingthefirsthelpinghandwhencorrectingandrevisingmywork, and being there for whatever reason. Gent, 2016 Jim Clauwaert i Table of Contents Foreword i 1 Abstract 1 2 Introduction 3 3 The Semantic Web 5 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 What is The Semantic Web? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3 RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3.1 Structure of RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3.2 Vocabularies of RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.4 Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.4.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.4.2 Linked databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.5 RDF data management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.5.1 RDF formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.5.2 Triplestores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.6 SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.6.1 SPARQL syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4 Boinq 23 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2.1 Data unification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.2 Data organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3 Comparison to other frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.1 Biological query building . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.2 Semantic access to sequence information. . . . . . . . . . . . . . . . . . . . 26 4.4 Material and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5 Genomic Data Implementation 29 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2 Genomic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2.1 Browser Extensible Data format . . . . . . . . . . . . . . . . . . . . . . . . 30 5.2.2 Generic Feature Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2.3 Variant Call Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.2.4 Sequence Alignment/Map format. . . . . . . . . . . . . . . . . . . . . . . . 35 5.3 Data integration into The Semantic Web. . . . . . . . . . . . . . . . . . . . . . . . 36 5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3.2 Basic data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3.3 Vocabularies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3.4 Data models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3.5 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3.6 Practical implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.4.1 sparql-bed and sparql-vcf . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.4.2 Big data files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4.3 JBrowse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 iii iv TABLE OF CONTENTS 6 Biological research in RDF 55 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2 A biomarker for colon cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2.2 Material and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7 Conclusion and Future Prospects 63 A Code Examples 65 B Tables 71 C Figures 75 List of Acronyms List of Acronyms B Boinq Bio ontology integrated query platform BED Browser Extensible Data C CDS Coding DNA Sequence CNV Copy Number Variations CIMP CpG Island Methylator Phenotype CRC Colorectal Cancer CTD Comparative Toxicogenomics Database D DBMS Database Management Systems DDBJ DNA Databank of Japan DKO Double Knock-Out E EBI-EMBL The European Bioinformatics Institute G GDA Gene Disease Association GFF/GFF3 General Feature Format GFVO Genomic Feature and Variation Ontology GMOD Generic Model Organism Database GRC Genome Reference Consortium GTF Genetic Transfer Object I v vi TABLE OF CONTENTS IRI International Resource Identifier J JSON-LD JavaScript Object Notation for Linked Data M MeSH Medical Subject Headings N NCBI National Center for Biotechnology Information NCI National Cancer Institute NHGRI National Human Genome Research Institute O OWL Web Ontology Language R RDF Resource Description Framework RDFS Resource Description Framework Schema S SKOS Simple Knowledge Organization System SNP Single Nucleotide Polymorphism SO Sequence Ontology SPARQL SPARQL Protocol and RDF Query Language STS Spring Tool Suite T TCGA The Cancer Genome Atlas U UniProt The Universal Protein Resource TABLE OF CONTENTS vii URI Uniform Resource Identifier URL Uniform Resource Locator V VCF Variant Call Format W W3C World Wide Web Consortium WT Wild Type WWW World Wide Web X XML Extensible Markup Language XSD XML Schema Definition
Description: