Table Of Content

Undefined0(2016)1 1 IOSPress Information Extraction meets the Semantic Web: A Survey Editor(s):AndreasHotho,Julius-Maximilians-UniversitätWürzburg,Würzburg,Germany Solicitedreview(s):DatBaNguyen,VietnamNationalUniversity,Hanoi,Vietnam;SimonScerri,UniversityofBonn,Germany;2Anonymous Reviewers Openreview(s):MichelleCheatham,WrightStateUniversity,Dayton,Ohio,USA JoseL.Martinez-Rodrigueza,AidanHoganb andIvanLopez-Arevaloa aCinvestavTamaulipas,CiudadVictoria,Mexico E-mail:{lmartinez,ilopez}@tamps.cinvestav.mx bIMFDChile;DepartmentofComputerScience,UniversityofChile,Chile E-mail:[email protected] Abstract.WeprovideacomprehensivesurveyoftheresearchliteraturethatappliesInformationExtractiontechniquesinaSe- manticWebsetting.Worksintheintersectionofthesetwoareascanbeseenfromtwooverlappingperspectives:usingSeman- ticWebresources(languages/ontologies/knowledge-bases/tools)toimproveInformationExtraction,and/orusingInformation Extraction to populate the Semantic Web. In more detail, we focus on the extraction and linking of three elements: entities, concepts and relations. Extraction involves identifying (textual) mentions referring to such elements in a given unstructured orsemi-structuredinputsource.Linkinginvolvesassociatingeachsuchmentionwithanappropriatedisambiguatedidentifier referringtothesameelementinaSemanticWebknowledge-base(orontology),insomecasescreatinganewidentifierwhere necessary.Withrespecttoentities,worksinvolving(Named)EntityRecognition,EntityDisambiguation,EntityLinking,etc.in thecontextoftheSemanticWebareconsidered.Withrespecttoconcepts,worksinvolvingTerminologyExtraction,Keyword Extraction,TopicModeling,TopicLabeling,etc.,inthecontextoftheSemanticWebareconsidered.Finally,withrespectto relations,worksinvolvingRelationExtractioninthecontextoftheSemanticWebareconsidered.Thefocusofthemajorityof thesurveyisonworksappliedtounstructuredsources(textinnaturallanguage);however,wealsoprovideanoverviewofworks thatdevelopcustomtechniquesadaptedforsemi-structuredinputs,namelymarkupdocumentsandwebtables. Keywords:InformationExtraction,EntityLinking,KeywordExtraction,TopicModeling,RelationExtraction,SemanticWeb 1. Introduction ableontheWeb–informationthatisconstantlychang- ing–forittobefeasibletoapplymanualannotationto The Semantic Web pursues a vision of the Web evenasignificantsubsetofwhatmightbeofrelevance. where increased availability of structured content en- While the amount of structured data available on ables higher levels of automation. Berners-Lee [20] the Web has grown significantly in the past years, described this goal as being to “enrich human read- there is still a significant gap between the coverage ablewebdatawithmachinereadableannotations,al- of structured and unstructured data available on the lowingtheWeb’sevolutionasthebiggestdatabasein Web [248]. Mika referred to this as the semantic theworld”.However,makingannotationsoninforma- gap[205],wherebythedemandforstructureddataon tionfromtheWebisanon-trivialtaskforhumanusers, theWeboutstripsitssupply. Forexample,inananal- particularly if some formal agreement is required to ysis of the 2013 Common Crawl dataset, Meusel et ensure that annotations are consistent across sources. al. [201] found that of the 2.2 billion webpages con- Likewise,thereissimplytoomuchinformationavail- sidered, 26.3% contained some structured metadata. 0000-0000/16/$00.00©2016–IOSPressandtheauthors.Allrightsreserved 2 J.L.Martinez-Rodriguezetal./InformationExtractionmeetstheSemanticWeb Thus,despiteinitiativeslikeLinkingOpenData[274], languagethatisfoundedononeofthecoreSemantic Schema.org [200,204] (promoted by Google, Mi- Webstandards:RDF/RDFS/OWL/SKOS/SPARQL.2 crosoft,Yahoo,andYandex)andtheOpenGraphPro- ByInformationExtractionmethods,wefocusonthe tocol[127](promotedbyFacebook),thissemanticgap extractionand/orlinkingofthreemainelementsfrom isstillobservableontheWebtoday[205,201]. an(unstructuredorsemi-structured)inputsource. Asaresult,methodstoautomaticallyextractoren- hancethestructureofvariouscorporahavebeenacore 1. Entities: anything with named identity, typically topic in the context of the Semantic Web. Such pro- anindividual(e.g.,Barack Obama,1961). cessesareoftenbasedonInformationExtractionmeth- 2. Concepts:aconceptualgroupingofelements.We ods,whichinturnarerootedintechniquesfromareas considertwotypesofconcepts: suchasNaturalLanguageProcessing,MachineLearn- – Classes: a named set of individuals (e.g., ing and Information Retrieval. The combination of U.S. President(s)); techniquesfromtheSemanticWebandfromInforma- – Topics: categories to which individuals or tionExtractioncanbeseenfromtwoperspectives:on documentsrelate(e.g,U.S. Politics). the one hand, Information Extraction techniques can 3. Relations:ann-arytupleofentities(n≥2)witha beappliedtopopulatetheSemanticWeb,whileonthe predicatetermdenotingthetypeofrelation(e.g., other hand, Semantic Web techniques can be applied marry(BarackObama,MicheleObama,Chicago). to guide the Information Extraction process. In some cases, both aspects are considered together, where an More formally, we can consider entities as atomic el- existingSemanticWebontologyorknowledge-baseis ements from the domain, concepts as unary predi- used to guide the extraction, which further populates cates, and relations as n-ary (n ≥ 2) predicates. We thegivenontologyand/orknowledge-base(KB).1 take a rather liberal interpretation of concepts to in- Inthepastyears,wehaveseenawealthofresearch cludebothclassesbasedonset-theoreticsubsumption dedicatedtoInformationExtractioninaSemanticWeb ofinstances(e.g.,OWLclasses[135]),aswellastop- setting. While many such papers come from within ics that form categories over which broader/narrower the Semantic Web community, many recent works relations can be defined (e.g., SKOS concepts [206]). have come from other communities, where, in partic- Thisisratherapracticaldecisionthatwillallowusto ular, general-knowledge Semantic Web KBs – such drawtogetheracollectivesummaryofworksinthein- asDBpedia[170],Freebase[26]andYAGO2[138]– terrelated areas of Terminology Extraction, Keyword have been broadly adopted as references for enhanc- Extraction,TopicModeling,etc.,underoneheading. ing Information Extraction tasks. Given the wide va- Returning to “extracting and/or linking”, we con- riety of works emerging in this particular intersection sidertheextractionprocessasidentifyingmentionsre- fromvariouscommunities(sometimesunderdifferent ferring to such entities/concepts/relations in the un- nomenclatures), we see that a comprehensive survey structuredorsemi-structuredinput,whileweconsider isneededtodrawtogetherthetechniquesproposedin the linking process as associating a disambiguated suchworks.Ourgoalisthentoprovidesuchasurvey. identifier in a Semantic Web ontology/KB for a mention, possibly creating one if not already present and Survey Scope: This survey provides an overview of usingittodisambiguateandlinkfurthermentions. publishedworksthatdirectlyinvolvebothInformation Extraction methods and Semantic Web technologies. InformationExtractionTasks: Thesurveydealswith Giventhatbothareverybroadareas,wemustberather variousInformationExtractiontasks.Wenowgivean explicitinourinclusioncriteria. introductory summary of the main tasks considered With respect to Semantic Web technologies, to be (though we note that the survey will delve into each included in the scope of a survey, a work must make taskinmuchmoredepthlater): non-trivialuseofanontology,knowledge-base,toolor NamedEntityRecognition: demarcate the locations ofmentionsofentitiesinaninputtext: 1Hereinweadopttheconventionthattheterm“ontology”refers – aka.EntityRecognition,EntityExtraction; primarily to terminological knowledge, meaning that it describes classesandpropertiesofthedomain,suchasperson,knows,country,etc.Ontheotherhand,weusetheterm“KB”torefertoprimar- 2Worksthatsimplymentiongeneraltermssuchas“semantic”or ily“assertionalknowledge”,whichdescribesspecificentities(aka. “ontology”maybeexcludedbythiscriteriaiftheydonotalsodi- individuals)ofthedomain,suchasBarackObama,China,etc. rectlyuseordependuponaSemanticWebstandard. J.L.Martinez-Rodriguezetal./InformationExtractionmeetstheSemanticWeb 3 – e.g., in the sentence “Barack Obama was TopicLabeling: For clusters of words identified as born in Hawaii”, mark the underlined abstracttopics,extractasingletermorphrasethat phrasesasentitymentions. bestcharacterizesthetopic; EntityLinking: associate mentions of entities with – aka. Topic Identification, esp. when linked anappropriatedisambiguatedKBidentifier: with an ontology/KB identifier; often used – involves,orissometimessynonymouswith, forthepurposesofTextClassification; Entity Disambiguation;3 often used for the – e.g., identify that the topic { “cancer”, purposesofSemanticAnnotation; “breast”,“doctor”,“chemotherapy”}is – e.g., associate “Hawaii” with the DBpedia best characterized with the term “cancer” identifier dbr:Hawaii for the U.S. state (potentially linked to dbr:Cancer for the (rather than the identifier for various songs diseaseandnot,e.g.,theastrologicalsign). orbooksbythesamename).4 RelationExtraction: Extract potentially n-ary rela- TerminologyExtraction: extract the main phrases tions (for n≥2) from an unstructured (i.e., text) that denote concepts relevant to a given domain orsemi-structured(e.g.,HTMLtable)source; described by a corpus, sometimes inducing hier- – a goal of the area of Open Information Ex- archicalrelationsbetweenconcepts; traction; – aka.TermExtraction,oftenusedforthepur- – e.g., in the sentence “Barack Obama was posesofOntologyLearning; born in Hawaii”, extract the binary rela- – e.g., identify from a text on Oncology that “breast cancer”and“melanoma”areim- tionwasBornIn(Barack Obama,Hawaii); – binaryrelationsmayberepresentedasRDF portantconceptsinthedomain; – optionally identify that both of the above triples after linking entities and linking the conceptsarespecializationsof“cancer”; predicate to an appropriate property (e.g., – termsmaybelinkedtoaKB/ontology. mapping wasBornIn to the DBpedia prop- KeyphraseExtraction: extractthemainphrasesthat ertydbo:birthPlace); categorize the subject/domain of a text (unlike – n-ary(n≥3)relationsareoftenrepresented TerminologyExtraction,thefocusisoftenonde- withavariantofreification[133,271]. scribingthedocument,notthedomain); Notethatwewilluseamoresimplifiednomenclature – aka. Keyword Extraction, which is often {Entity,Concept,Relation} × {Extraction,Linking} generically applied to cover extraction of aspreviouslydescribedtostructureoursurveywiththe multi-word phrases; often usedfor the pur- goal of grouping related works together; thus, works posesofSemanticAnnotation; on Terminology Extraction, Keyphrase Extraction, – e.g., identify that the keyphrases “breast Topic Modeling and Topic Labeling will be grouped cancer”and“mammogram”helptosumma- undertheheadingofConceptExtractionandLinking. rizethesubjectofaparticulardocument; Again we are only interested in such tasks in the – keyphrasesmaybelinkedtoaKB/ontology. contextoftheSemanticWeb.Ourfocusisonunstruc- TopicModeling: Cluster words/phrases frequently tured (text) inputs, but we will also give an overview co-occurring together in the same context; these of methods for semi-structured inputs (markup docu- clusters are then interpreted as being associated mentsandtables)towardstheendofthesurvey. toabstracttopicstowhichatextrelates; – aka.TopicExtraction,TopicClassification; RelatedAreas,SurveysandNovelty: Thereareava- – e.g., identify that words such as “cancer”, rietyofareasthatrelateandoverlapwiththescopeof “breast”,“doctor”,“chemotherapy”tend thissurvey,andlikewisetherehavebeenanumberof to co-occur frequently and thus conclude previoussurveysintheseareas.Wenowdiscusssuch that a document containing many such oc- areasandsurveys,howtheyrelatetothecurrentcon- currencesisaboutaparticularabstracttopic. tribution,andoutlinethenoveltyofthecurrentsurvey. As we will see throughout this survey, Information 3InsomecasesEntityLinkingisconsideredtoincludebothrecog- Extraction (IE) from unstructured sources – i.e., tex- nitionanddisambiguation;inothercases,itisconsideredsynony- tualcorporaexpressedprimarilyinnaturallanguage– mouswithdisambiguationappliedafterrecognition. reliesheavilyonNaturalLanguageProcessing(NLP). 4Weusewell-knownIRIprefixesasconsistentwiththelookup A number of resources have been published within servicehostedat:http://prefix.cc.AllURLsinthispaperwere lastaccessedon2018/05/30. the intersection of NLP and the Semantic Web (SW), 4 J.L.Martinez-Rodriguezetal./InformationExtractionmeetstheSemanticWeb wherewecanpoint,forexample,toarecentbookpub- Ontology-BasedInformationExtraction: refers to lishedbyMaynardetal.[190]in2016,whichlikewise leveraging the formal knowledge of ontologies covers topics relating to IE. However, while IE tools toguideatraditionalInformationExtractionpro- mayoftendependonNLPprocessingtechniques,this cess over unstructured corpora. Such works fall isnotalwaysthecase,wheremanymodernapproaches withinthescopeofthissurvey.Apriorsurveyof totaskssuchasEntityLinkingdonotuseatraditional Ontology-BasedInformationExtractionwaspub- NLP processing pipeline. Furthermore, unlike the in- lishedbyWimalasuriyaandDou[313]in2010. troductorytextbookbyMaynardetal.[190],ourgoal OntologyLearning: helpsautomatethe(costly)pro- here is to provide a comprehensive survey of the re- cessofontologybuildingbyinducingan(initial) search works in the area. Note that we also provide a ontology from a domain-specific corpus. Ontol- briefprimeronthemostimportantNLPtechniquesin ogyLearningalsooftenincludesOntologyPopu- asupplementaryappendix,discussedlater. lation,meaningthatinstanceofconceptsandre- On the other hand, Data Mining involves extract- lationsarealsoextracted.Suchworksfallwithin ingpatternsinherentinadataset.ExampleDataMin- our scope. A survey of Ontology Learning was ing tasks include classification, clustering, rule min- providedbyWongetal.[316]in2012. ing, predictive analysis, outlier detection, recommen- KnowledgeExtraction: aims to lift an unstructured dation, etc. Knowledge Discovery refers to a higher- or semi-structured corpus into an output de- level process to help users extract knowledge from scribed using a knowledge representation for- rawdata,whereatypicalpipelineinvolvesselectionof malism (such as OWL). Thus Knowledge Ex- data,pre-processingandtransformationofdata,aData traction can be seen as Information Extraction Miningphasetoextractpatterns,andfinallyevaluation but with a stronger focus on using knowledge andvisualizationtoaidusersgainknowledgefromthe representation techniques to model outputs. In raw data and provide feedback. Some IE techniques 2013, Gangemi [110] provided an introduction may rely on extracting patterns from data, which can andcomparisonoffourteentoolsforKnowledge beseenasaDataMiningstep5;however,Information Extractionoverunstructuredcorpora. Extraction need not use Data Mining techniques, and Other related terms such as “Semantic Informa- many Data Mining tasks – such as outlier detection tion Extraction” [108], “Knowledge-Based Informa- – have only a tenuous relation to Information Extrac- tion Extraction” [139], “Knowledge-Graph Comple- tion.AsurveyofapproachesthatcombineDataMin- tion” [178], and so forth, have also appeared in the ing/KnowledgeDiscoverywiththeSemanticWebwas literature. However, many such titles are used specif- publishedbyRistoskiandPaulheim[261]in2016. ically within a given community, whereas works in With respect to our survey, both Natural Language theintersectionofIEandSWhaveappearedinmany Processing and Data Mining form part of the back- communities. For example, “Knowledge Extraction” groundofourscope,butasdiscussed,InformationEx- isusedpredominantlybytheSWcommunityandnot tractionhasaratherdifferentfocustobothareas,nei- others.6 Hence our survey can be seen as drawing to- thercoveringnorbeingcoveredbyeither. getherworksinsuch(sub-)areasunderamoregeneral Ontheotherhand,relatingmorespecificallytothe scope:worksinvolvingIEtechniquesinaSWsetting. intersectionofInformationExtractionandtheSeman- Intended Audience: This survey is written for re- ticWeb,wecanidentifythefollowing(sub-)areas: searchers and practitioners who are already quite fa- SemanticAnnotation: aims to annotate documents miliarwiththemainSWstandardsandconcepts–such with entities, classes, topics or facts, typically astheRDF,RDFS,OWLandSPARQLstandards,etc. based on an existing ontology/KB. Some works – but are not necessarily familiar with IE techniques. on Semantic Annotation fall within the scope of Hence we will not introduce SW concepts (such as oursurveyastheyincludeextractionandlinking RDF, OWL, etc.) herein. Otherwise, our goal is to of entities and/or concepts (though not typically make the survey as accessible as possible. For exam- relations). A survey focused on Semantic Anno- ple,inordertomakethesurveyself-contained,inAp- tationwaspublishedbyUrenetal.[300]in2006. 6Herewemean“KnowledgeExtraction”inanIE-relatedcontext. 5Infact,thetitle“InformationExtraction”pre-datesthatofthe Otherworksongeneratingexplanationsfromneuralnetworksuse title“DataMining”initsmoderninterpretation. thesameterminanunrelatedmanner. J.L.Martinez-Rodriguezetal./InformationExtractionmeetstheSemanticWeb 5 pendixAweprovideadetailedprimeronsometradi- Scholarforrelatedpapers,merginganddeduplicating tionalNLPandIEprocesses;thetechniquesdiscussed lists of candidate papers (numbering in the thousands inthisappendixare,ingeneral,notinthescopeofthe in total); (II) we initially apply a rough filter for rele- survey,sincetheydonotinvolveSWresources,butare vance based on the title and type of publication; (III) heavily used by works that fall in scope. We recom- wefilterforrelevancebyabstract;and(IV)finallywe mend readers unfamiliar with the IE area to read the filterforrelevancebythebodyofthepaper. appendix as a primer prior to proceeding to the main To collect further literature, while reading relevant bodyofthesurvey.KnowledgeofsomecoreInforma- papers,wealsotakenoteofotherworksreferencedin tion Retrieval concepts – such as TF–IDF, PageRank, relatedworks,worksthatcitemoreprominentrelevant cosinesimilarity,etc.–andsomecoreMachineLearn- papers, and also check the bibliography of prominent ingconcepts–suchaslogisticregression,SVM,neural authorsintheareaforotherpapersthattheyhavewrit- networks,etc.–maybenecessarytounderstandfiner ten;suchworkswereaddedinphaseIIItobelaterfil- details,butnottounderstandthemainconcepts. teredinphaseIV.Table2presentsthenumbersofpa- persconsideredbyeachphaseofthemethodology.7 Nomenclature: TheareaofInformationExtractionis associatedwithadiversenomenclaturethatmayvary Table1 in use and connotation from author to author. Such Keywordsusedtosearchforcandidatepapers variationsmayattimesbesubtleandatothertimesbe E/C/RlistkeywordsrelatingtoEntities,ConceptsandRelations; entirely incompatible. Part of this relates to the vari- SWlistskeywordsrelatingtotheSemanticWeb ousareasinwhichInformationExtractionhasbeenap- Type Keywordset pliedandthevarietyofareasfromwhichitdrawsin- fluence. We will attempt to use generalized terminol- E "coreference resolution", "entity disambiguation", "entity linking", "entity recognition", ogyandindicatewhenterminologyvaries. "entity resolution", "named entity", "semantic annotation" SurveyMethodology: Basedonthepreviousdiscus- C "concept models", "glossary extraction", sion,thissurveyincludespapersthat: "group detection", "keyphrase assignment", "keyphrase extraction", "keyphrase recognition", – deal with extraction and/or linking of entities, "keyword assignment", "keyword extraction", conceptsand/orrelations, "keyword recognition", "latent variable models", "LDA" "LSA", "pLSA", "term extraction", – dealwithsomeSemanticWebstandard–namely "term recognition", "terminology mining", RDF,RDFSorOWL–oraresourcepublishedor "topic extraction", "topic identification", "topic modeling" otherwiseusingthosestandards, – have details published, in English, in a relevant R "OpenIE", "open information extraction", "open knowledge extraction", "relation detection", workshop,conferenceorjournalsince1999, "relation extraction", "semantic relation" – considerextractionfromunstructuredsources. SW "linked data", "ontology", "OWL", "RDF", "RDFS", For finding in-scope papers, our methodology be- "semantic web", "SPARQL", "web of data" gins with a definition of keyphrases appropriate to thesectionathand.Thesekeyphrasesaredividedinto Weprovidefurtherdetailsofoursurveyonline,in- listsofIE-relatedterms(e.g.,“entity extraction”, cludingthelistsofpapersconsideredbyeachphase.8 “entity linking”) and SW-related terms (e.g., We may include out-of-scope papers to the extent “ontology”,“linked data”),whereweapplyacon- that they serve as important background for the in- junctionoftheirproductstocreatesearchphrases(e.g., scopepapers:forexample,itisimportantforanunini- “entity extraction ontology”).Giventhediverse tiated reader to understand some of the core tech- terminology used in different communities, often we niques considered in the traditional Information Ex- need to try many variants of keyphrases to capture tractionareaandtounderstandsomeofthecorestan- as many papers as possible. Table 1 lists the base dards and resources considered in the core Semantic keyphrasesusedtosearchforpapers;thefinalkeyword Web area. Furthermore, though not part of the main searchesaregivenbytheset(E∪C∪R)(cid:107)SW,where “(cid:107)”denotesconcatenation(withadelimitingspace). 7Table2referstopapersconsideringtextasinput;afurther20 Our survey methodology consists of four initial papersconsideringsemi-structuredinputsarepresentedlaterinthe phasestosearch,extractandfilterpapers.Foreachde- survey,whichwillbringthetotalto109selectedpapers. fined keyphrase, we (I) perform a search on Google 8http://www.tamps.cinvestav.mx/~lmartinez/survey/ 6 J.L.Martinez-Rodriguezetal./InformationExtractionmeetstheSemanticWeb Table2 ontheotherhand,thereferenceKBmaycontainenti- Numberofpapersincludedinthesurvey(byPhase) tiesfromhundredsoftypes.Hence,whilesomeEntity E/C/Rdenotecountsofhighlightedpapersinthissurveyrelatingto Extraction & Linking tools rely on off-the-shelf NER Entities,Concepts,andRelations,resp.; tools, others define bespoke methods for identifying ΣdenotesthesumofE+C+Rbyphase. entitymentionsintext,typicallyusingentities’labels Phase E C R Σ intheKBasadictionarytoguidetheextraction. Seedcollection(I) 2,418 8,666 8,008 19,092 Once entity mentions are extracted from the text, Filterbytitle(II) 114 167 148 429 thenextphaseinvolveslinking–ordisambiguating– Filterbyabstract(III) 100 115 102 317 these mentions by assigning them to KB identifiers; typicallyeachmentioninthetextisassociatedwitha Finallist(IV) 25 36 28 89 singleKBidentifierchosenbytheprocessasthemost likelymatch,orisassociatedwithmultipleKBidenti- survey, in Section 5, we provide a brief overview of fiersandanassociatedweight(aka.support)indicating otherwiserelatedpapersthatconsidersemi-structured confidenceinthematchesthatallowtheapplicationto inputsources,suchasmarkupdocuments,tables,etc. choosewhichentitylinkstotrust. Survey Structure: The structure of the remainder of Example: In Listing 1, we provide an excerpt of thissurveyisasfollows: an EEL response given by the online DBpedia Spot- Section2 discusses extraction and linking of entities light demo10 in JSON format. Within the result, the forunstructuredsources. “@URI” attribute is the selected identifier obtained Section3 discussesextractionandlinkingofconcepts from DBpedia, the “@support” is a degree of confi- forunstructuredsources. denceinthematch,the“@types”listmatchesclasses Section4 discussesextractionandlinkingofrelations from the KB, the “@surfaceForm” represents the forunstructuredsources. text of the entity mention, the “@offset” indicates Section5 discusses techniques adapted specifically the character position of the mention in the text, for extracting entities, concepts and/or relations the “@similarityScore” indicates the strength of fromsemi-structuredsources. a match with the entity label in the KB, and the Section6 concludesthesurveywithadiscussion. “@percentageOfSecondRank” indicates the ratio of Additionally,AppendixAprovidesaprimeronclas- thesupportcomputedforthefirst-andsecond-ranked sical Information Extraction techniques for readers documentsthusindicatingthelevelofambiguity. previouslyunfamiliarwiththeIEarea;werecommend Ofcourse,theexactdetailsoftheoutputofanEEL suchareadertoreviewthismaterialbeforecontinuing. processwillvaryfromtooltotool,butsuchatoolwill minimallyreturnaKBidentifierandthelocationofthe entitymention;asupportwillalsooftenbereturned. 2. EntityExtraction&Linking Applications: EEL is used in a variety of applica- EntityExtraction&Linking(EEL)9referstoidenti- tions,suchassemanticannotation[41],whereentities fyingmentionsofentitiesinatextandlinkingthemto mentioned in text can be further detailed with refer- areferenceKBprovidedasinput. encedatafromtheKB;semanticsearch[295],where EntityExtractioncanbeperformedusinganoff-the- search over textual collections can be enhanced – for shelfNamedEntityRecognition(NER)toolasusedin example,todisambiguateentitiesortofindcategories traditional IE scenarios (see Appendix A.1); however of relevant entities – through the structure provided suchtoolstypicallyextractentitiesforlimitednumbers bytheKB;questionanswering[299],wheretheinput of types, such as persons, organizations, places, etc.; text is a user question and the EEL process can iden- tifywhichentitiesintheKBthequestionrefersto;focused archival [79], where the goal is to collect and 9Wenotethatnamingconventionscanvarywidely:sometimes preserve documents relating to particular entities; de- Named Entity Linking (NEL) is used; sometimes the acronym (N)ERDisusedfor(Named)EntityRecognition&Disambiguation; tectingemergingentities[136],whereentitiesthatdo sometimesEELisusedasasynonymforNED;otherphrasescan not yet appear in the KB, but may be candidates for alsobeused,suchasNamedEntityExtraction(NEE),orNamedEn- tityResolution,orvariationsontheideaofsemanticannotationor semantictagging(whichweconsiderapplicationsofEEL). 10http://dbpedia-spotlight.github.io/demo/ J.L.Martinez-Rodriguezetal./InformationExtractionmeetstheSemanticWeb 7 addingtotheKB,areextracted.11 EELcanalsoserve Listing1:DBpediaSpotlightEELexample asthebasisforlaterIEprocesses,suchasTopicMod- Input: Bryan Cranston is an American actor. He is eling,RelationExtraction,etc.,asdiscussedlater. (cid:44)→ known for portraying "Walter White" in the (cid:44)→ drama series Breaking Bad. Process: As stated by various authors [62,153,243, Output: 246], the EEL process is typically composed of two { main steps: recognition, where relevant entity men- "@text": "Bryan Cranston is an American actor. He (cid:44)→ is known for portraying \"Walter White\" tionsinthetextarefound;anddisambiguation,where (cid:44)→ in the drama series Breaking Bad.", entity mentions are mapped to candidate identifiers "@confidence": "0.35", "@support": "0", withafinalweightedconfidence.Sincethesestepsare "@types": "", (often) loosely coupled, this section surveys the vari- "@sparql": "", "@policy": "whitelist", ous techniques proposed for the recognition task and "Resources": [ thereafterdiscussesdisambiguation. { "@URI": "http://dbpedia.org/resource/ (cid:44)→ Bryan_Cranston", "@support": "199", 2.1. Recognition "@types": "DBpedia:Agent,Schema:Person,Http:// (cid:44)→ xmlns.com/foaf/0.1/Person,DBpedia:Person (cid:44)→ ", The goal of EEL is to extract and link entity men- "@surfaceForm": "Bryan Cranston", "@offset": "0", tions in a text with entity identifiers in a KB; some "@similarityScore": "1.0", tools may additionally detect and propose identifiers "@percentageOfSecondRank": "0.0" }, { "@URI": "http://dbpedia.org/resource/ for emerging entities that are not yet found in the (cid:44)→ United_States", KB [238,247,231]. In both cases, the first step is to "@support": "560750", "@types": "Schema:Place,DBpedia:Place,DBpedia: markentitymentionsinthetextthatcanbelinked(or (cid:44)→ PopulatedPlace,Schema:Country,DBpedia: proposed as an addition) to the KB. Thus traditional (cid:44)→ Country", "@surfaceForm": "American", NERtools–discussedinAppendixA.1–canbeused. "@offset": "21", However, in the context of EEL where a target KB is "@similarityScore": "0.9940788480408", "@percentageOfSecondRank": "0.003612999020" }, givenasinput,therecanbekeydifferencesbetweena { "@URI": "http://dbpedia.org/resource/Actor", typicalEELrecognitionphaseandtraditionalNER: "@support": "35596", "@types": "", "@surfaceForm": "actor", – Incaseswhereemergingentitiesarenotdetected, "@offset": "30", "@similarityScore": "0.9999710345342", the KB can provide a full list of target entity la- "@percentageOfSecondRank": "2.433621943E−5" }, bels, which can be stored in a dictionary that is { "@URI": "http://dbpedia.org/resource/ (cid:44)→ Walter_White_(Breaking_Bad)", usedtofindmentionsofthoseentities.Whiledic- "@support": "856", tionaries can be found in traditional NER sce- "@types": "DBpedia:Agent,Schema:Person,Http:// (cid:44)→ xmlns.com/foaf/0.1/Person,DBpedia:Person narios,theseoftenrefertoindividualtokensthat (cid:44)→ ,DBpedia:FictionalCharacter", stronglyindicateanentityofagiventype,suchas "@surfaceForm": "Walter White", "@offset": "66", commonfirstorfamilynames,listsofplacesand "@similarityScore": "0.9999999999753", companies, etc. On the other hand, in EEL sce- "@percentageOfSecondRank": "2.471061675E−11" }, { "@URI": "http://dbpedia.org/resource/Drama", narios,thedictionarycanbepopulatedwithcom- "@support": "6217", pleteentitylabelsfromtheKBforawiderrange "@types": "", "@surfaceForm": "drama", oftypes;inscenariosnotinvolvingemergingen- "@offset": "87", tities,thisdictionarywillbecompletefortheen- "@similarityScore": "0.8446404328140", "@percentageOfSecondRank": "0.1565036704039" }, tities to recognize. Of course, this can lead to a { "@URI": "http://dbpedia.org/resource/ (cid:44)→ Breaking_Bad", verylargedictionary,dependingontheKBused. "@support": "638", – Relating to the previous point, (particularly) in "@types": "Schema:CreativeWork,DBpedia:Work, (cid:44)→ DBpedia:TelevisionShow", scenarios where a complete dictionary is avail- "@surfaceForm": "Breaking Bad", able,thelinebetweenextractionandlinkingcan "@offset": "100", "@similarityScore": "1.0", become blurred since labels in the dictionary "@percentageOfSecondRank": "4.6189529850E−23" } from the KB will often be associated with KB ] } 11Emerging entities are also sometimes known as Out-Of Knowledge-Base(OOKB)entitiesorNotInLexicon(NIL)entities. 8 J.L.Martinez-Rodriguezetal./InformationExtractionmeetstheSemanticWeb identifiers; hence, dictionary-based detection of Selectionofentities: InthecontextofEEL,anobvi- entities will also provide initial links to the KB. oussourcefromwhichtoformthedictionaryisthela- Such approaches are sometimes known as End- belsoftargetentitiesintheKB.InmanyInformation to-End(E2E)approaches[247],whereextraction Extractionscenarios,KBspertainingtogeneralknowl- andlinkingphasesbecomemoretightlycoupled. edgeareemployed;themostcommonlyusedare: – In traditional NER scenarios, extracted entity DBpedia[170] A KB extracted from Wikipedia and mentions are typically associated with a type, used by ADEL [247], DBpedia Spotlight [199], usuallywithrespecttoanumberoftrainedtypes ExPoSe [238], Kan-Dis [144], NERSO [124], suchasperson,organization,andlocation.How- Seznam[94],SDA[42]andTHD[82],aswellas ever, in many EEL scenarios, the types are al- works by Exner and Nugues [95], Nebhi [226], readygivenbytheKBandareinfactoftenmuch Gianninietal.[116],amongstothers; richerthanwhattraditionalNERmodelssupport. Freebase[26] A collaboratively-edited KB – previ- ously hosted by Google but now discontinued in In this section, we thus begin by discussing the favor of Wikidata [293] – used by JERL [183], preparation of a dictionary and methods used for rec- Kan-Dis [144], NEMO [68], Neofonie [157], ognizingentitiesinthecontextofEEL. NereL[280],Seznam[94],Tulip[180],aswellas worksbyZhengetal.[330],amongstothers; 2.1.1. Dictionary Wikidata[309] A collaboratively-edited KB hosted The predominant method for performing EEL re- by the Wikimedia Foundation that, although re- liesonusingadictionary–alsoknownasalexiconor leased more recently than other KBs, has been gazetteer –whichmapslabelsoftargetentitiesinthe usedbyHERD[283]; KBtotheiridentifiers;forexample,adictionarymight YAGO(2)[138] AnotherKBextractedfromWikipedia mapthelabel“Bryan Cranston”totheDBpediaIRI with richer meta-data, used by AIDA [139], dbr:Bryan_Cranston. In fact, a single KB entity AIDA-Light[230],CohELL[121],J-NERD[231], mayhavemultiplelabels(aka.aliases)thatmaptoone KORE [137] and LINDEN [277], as well as identifier, such as “Bryan Cranston”, “Bryan Lee worksbyAbedinietal.[1],amongstothers. Cranston”,“Bryan L. Cranston”,etc.Furthermore, some labels may be ambiguous, where a single label These KBs are tightly coupled with owl:sameAs maymaptoasetofidentifiers;forexample,“Boston” links establishing KB-level coreference and are also maymaptodbr:Boston,dbr:Boston_(band),and tightlycoupledwithWikipedia;thisimpliesthatonce entities are linked to one such KB, they can be tran- so forth. Hence a dictionary may map KB labels to sitively linked to the other KBs mentioned, and vice identifiers in a many-to-many fashion. Finally, for versa.KBsthataretightlycoupledwithWikipediain each KB identifier, a dictionary may contain contex- thismannerarepopularchoicesforEELsincetheyde- tual features to help disambiguate entities in a later scribe a comprehensive set of entities that cover nu- stage;forexample,contextinformationmaytellusthat merous domains of general interest; furthermore, the dbr:Bostonistypedasdbo:CityintheKB,orthat text of Wikipedia articles on such entities can form a known mentions of dbr:Boston in a text often have usefulsourceofcontextualinformation. wordslike“population”or“metropolitan”nearby. On the other hand, many of the entities in these Thus, with respect to dictionaries, the first impor- general-interestKBsmaybeirrelevantforcertainap- tant aspect is the selection of entities to consider (or, plicationscenarios.Somesystemssupportselectinga indeed, the source from which to extract a selection subset of entities from the KB to form the dictionary, ofentities).Thesecondimportantaspect–particularly potentially pertaining to a given domain or a selec- given large dictionaries and/or large corpora of text – tion of types. For example, DBpedia Spotlight [199] istheuseofoptimizedindexesthatallowforefficient can build a dictionary from the DBpedia entities re- matchingofmentionswithdictionarylabels.Thethird turned as results for a given SPARQL query. Such a aspect to consider is the enrichment of each entity in pre-selectionofrelevantentitiescanhelpreduceambi- thedictionarywithcontextualinformationtoimprove guityandtailorEELforagivenapplication. the precision of matches. We now discuss these three Conversely, in EEL scenarios targeting niche do- aspectsofdictionariesinturn. mains not covered by Wikipedia and its related KBs, J.L.Martinez-Rodriguezetal./InformationExtractionmeetstheSemanticWeb 9 customKBsmayberequired.Forexample,forthepur- dictionary preloaded into the index, the text can then poses of supporting multilingual EEL, Babelfy [215] bestreamedthroughtheindextofind(prefix)matches. constructsitsownKBfromaunificationofWikipedia, Phrasesaretypicallyindexedseparatelytoallowboth WordNet, and BabelNet. In the context of Microsoft word-levelandphrase-levelmatching.Thisalgorithm Research, JERL [183] uses a proprietary KB (Mi- isimplementedbyGATE[69]andLingPipe[38],with crosoft’sSatori)alongsideFreebase.Otherapproaches thelatterbeingusedbyDBpediaSpotlight[199]. makeminimalassumptionsabouttheKBused,where The main drawback of tries is that, for the match- earlier EEL approaches such as SemTag [80] and ing process to be performed efficiently, the dictio- KIM[250]onlyassumethatKBentitiesareassociated nary index must fit in memory, which may be pro- with labels (in experiments, SemTag [80] uses Stan- hibitive for very large dictionaries. For these reasons, ford’s TAP KB, while KIM [250] uses a custom KB theLucene/SolrTaggerimplementsamoregeneralfi- calledKIMO). nitestatetransducerthatalsoreusessuffixesandbyte- encodings to reduce space [71]; this index is used by Dictionarymatchingandindexing: Inordertomatch HERD[283]andTulip[180]tostoreKBlabels. mentions with the dictionary in an efficient manner – Inothercases,ratherthanusingtraditionalInforma- with O(1) or O(log(n)) lookup performance – opti- tion Extraction frameworks, some authors have pro- mized data structures are required, which depend on posedtoimplementcustomindexingmethods.Togive the form of matching employed. The need for effi- some examples, KIM [250] uses a hash-based index ciency is particularly important for some of the KBs overtokensinanentitymention12;AIDA-Light[230] previouslymentioned,wherethenumberoftargeten- usesaLocalitySensitiveHashing(LSH)indextofind titiesinvolvedcangointothemillions.Thesizeofthe approximate matches in cases where an initial exact- inputcorporaisalsoanimportantconsideration:while matchlookupfails;andsoforth. slower(butpotentiallymoreaccurate)matchingalgo- Of course, the problem of indexing the dictionary rithms can be tolerated for smaller inputs, such algo- is closely related to the problem of inverted indexing rithmsareimpracticalforlargerinputtexts. inInformationRetrieval,wherekeywordsareindexed Amajorchallengeisthatdesirablematchesmaynot againstthedocumentsthatcontainthem.Suchinverted be an exact match, but may rather only be captured indexeshaveproventheirscalabilityandefficiencyin by an approximate string-matching algorithm. While Web search engines such as Google, Bing, etc., and one could consider, for example, approximate match- likewisesupportsimpleformsofapproximatematch- ingbasedonregularexpressionsoreditdistances,such ingbasedon,forexample,stemmingorlemmatization, measures do not lend themselves naturally to index- which pre-normalize document and query keywords. based approaches. Instead, for large dictionaries, or Exploiting this natural link to Information Retrieval, large input corpora, it may be necessary to trade re- the ADEL [247], AGDISTIS [301], Kan-Dis [144], call(i.e.,thepercentageofcorrectspotscaptured)for TagMe [99] and WAT [243] systems use inverted- efficiency by using coarser matching methods. Like- indexingschemessuchasLucene13andElastic14. wise, it is important to note that KBs such as DBpe- To manage the structured data associated with en- dia enumerate multiple “alias” labels for entities (ex- tities, such as identifiers or contextual features, some tracted from the redirect entries in Wikipedia), which tools use more relational-style data management sys- ifincludedinthedictionary,canhelptoimproverecall whileusingcoarsermatchingmethods. tems.Forexample,AIDA[139]usesthePostgreSQL A popular approach to index the dictionary is to relationaldatabasetoretrieveentitycandidates,while usesomevariationonaprefixtree(aka.trie),suchas ADEL[247]andNeofonie[157]usetheCouchbase15 usedbytheAho–Corasickstring-searchingalgorithm, andRedis16NoSQLstores,respectively,tomanagethe which can find mentions of an input list of strings labelsandmeta-dataoftheirdictionaries. withinaninputtextintimelineartothecombinedsize of the inputs and output. The main idea is to repre- 12ThisimplementationwaslaterintegratedintoGATE:https: sent the dictionary as a prefix tree where nodes refer //gate.ac.uk/sale/tao/splitch13.html to letters, and transitions refer to sequences of letters 13http://lucene.apache.org/core/ 14https://www.elastic.co;notethatElasticSearchisinfact in a dictionary word; further transitions are put from basedonLucene failed matches (dead-ends) to the node representing 15http://www.couchbase.com thelongestmatchingprefixinthedictionary.Withthe 16https://redis.io/ 10 J.L.Martinez-Rodriguezetal./InformationExtractionmeetstheSemanticWeb Contextualfeatures: Ratherthanbeingaflatmapof thatmaynotbeavailableinothertextualcorpora, entity labels to (sets of) KB identifiers, dictionaries including disambiguation pages, redirections of often include contextual features to later help disam- aliases, category information, info-boxes, article biguatecandidatelinks.Suchcontextualfeaturesmay edithistory,andsoforth.17 becategorizedasbeingstructuredorunstructured. We will further discuss how contextual features – Structuredcontextualfeaturesarethosethatcanbe stored as part of the dictionary – can be used for dis- extracteddirectlyfromastructuredorsemi-structured ambiguationlaterinthissection. source.InthecontextofEEL,suchfeaturesareoften 2.1.2. Spotting extracted from the reference KB itself. For example, Wenowassumeadictionarythatmapslabels(e.g., eachentityinthedictionarycanbeassociatedwiththe “Bryan Cranston”, “Bryan Lee Cranston”, etc.) (labelsofthe)typesofthatentity,butalsoperhapsthe to a (set of) KB identifier(s) for the entity ques- labelsofthepropertiesthataredefinedforit,oracount tion (e.g„ “dbr:Bryan_Cranston”) and potentially ofthenumberoftriplesitisassociatedwith,ortheen- some contextual information (e.g., often co-occurs titiesitisrelatedto,oritscentrality(andthus“impor- with“dbr:Breaking_Bad”,anchortextoftenusesthe tance”)inthegraph-structureoftheKB,andsoforth. term “Heisenberg”, etc.). In the next step, we iden- Ontheotherhand,unstructuredcontextualfeatures tify entity mentions in the input text. We refer to this are those that must instead be extracted from textual processasspotting,wherewesurveykeyapproaches. corpora. In most cases, this will involving extracting statistics and patterns from an external reference cor- Token-based: Given that entity mentions may con- pusthatpotentiallyhasalreadyhaditsentitieslabeled sistofmultiplesequentialwords–aka.n-grams–the (and linked with the KB). Such features may capture brute-forceoptionwouldbetosendalln-gramsinthe patternsintextsurroundingthementionsofanentity, inputtexttothedictionary,fornupto,say,themaxi- entities that are frequently mentioned close together, mumnumberofwordsfoundinadictionaryentry,ora patternsintheanchor-textoflinkstoapageaboutthat fixedparameter.Werefergenericallytothesen-grams entity, in how many documents a particular entity is astokensandtothesemethodsforextractingn-grams mentioned, how many times it tends to be mentioned astokenization.Sometimesthesemethodsarereferred inaparticulardocument,andsoforth;clearlysuchin- toaswindow-basedspottingorrecognitiontechniques. formationwillnotbeavailablefromtheKBitself. A number of systems use such a form of tokeniza- A very common choice of text corpora for extract- tion. SemTag uses the TAP ontology for seeking en- ing both structured and unstructured contextual fea- titymentionsthatmatchtokensfromtheinputtext.In tures is Wikipedia, whose use in this setting was – to AIDA-Light [230], AGDISTIS [301], Lupedia [203], thebestofourknowledge–firstproposedbyBunescu and NERSO [124], recognition uses sliding windows and Pasca [33], then later followed by many other overthetextforvarying-lengthn-grams. subsequent works [67,99,42,255,39,40,243,246]. The Although relatively straightforward, a fundamental widespread use of Wikipedia can be explained by the weaknesswithtoken-basedmethodsrelatestoperfor- uniqueadvantagesithasforsuchtasks: mance:givenalargetext,thedictionary-lookupimple- mentation will have to be very efficient to deal with – The text in Wikipedia is primarily factual and thenumberoftokensatypicalsuchprocesswillgen- availableinavarietyoflanguages. erate, many of which will be irrelevant. While some – Wikipedia has broad coverage, with documents basicfeatures,suchascapitalization,canalsobetaken aboutentitiesinavarietyofdomains. intoaccounttofilter(some)tokens,still,notallmen- – ArticlesinWikipediacanbedirectlylinkedtothe tions may have capitalization, and many irrelevant or entities they describe in various KBs, including incoherent entities can still be retrieved; for example, DBpedia,Freebase,Wikidata,YAGO(2),etc. by decomposing the text “New York City”, the sec- – MentionsofentitiesinWikipediaoftenprovidea ond bi-gram may produce York City in England as a linktothearticleaboutthatentity,thusproviding candidate,though(probably)irrelevanttothemention. labeled examples of entity mentions and associ- Suchentitiesareknownasoverlappingentities,where atedexamplesofanchortextinvariouscontexts. post-processingmustbeapplied(discussedlater). – Aside from the usual textual features such as term frequencies and co-occurrences, a variety 17Informationfrominfo-boxes,disambiguation,redirectsandcat- of richer features are available from Wikipedia egoriesarealsorepresentedinastructuredformatinDBpedia.

Description:

While many such papers come from within . Semantic Annotation: aims to annotate documents Intended Audience: This survey is written for re- ing to sports, entertainment, science and technol- precision, Mintz et al. achieved 5% recall, Hoffmann [152] Xing Jiang and Ah-Hwee Tan.

Information Extraction meets the Semantic Web: A Survey PDF

80 Pages·2017·0.97 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Information Extraction meets the Semantic Web: A Survey

Description:

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.