ebook img

Semantic technologies and linked data for the Italian PA: the case of data.cnr.it PDF

0.94 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Semantic technologies and linked data for the Italian PA: the case of data.cnr.it

Semantic technologies and linked data, with a case study at the Consiglio Nazionale delle Ricerche (CNR) AldoGangemi Web evolution and resource dereferencing Webisevolvingfromawebofdocuments(oftencalledWeb1.0)toa webofentities(called,withsubtledifferencesinmeaning,semantic web,webofdata,Web3.0). Thisevolutionispassingalsothrough theavailabilitytouserstoedititscontentsandgeneratecomplex socialnetworksthroughsimpleinteractionparadigms(knownasso- cialweborWeb2.0). Thisishappeningprimarilythankstoadeeper exploitation of the Web architecture designed since the nineties,1 whichenablesthedereferencingandlinkingofwebresources(iden- tifiedbymeansofaWebaddress),throughsimplecommunication protocols(e.g. HTTP).Forexample,whenonewritestheaddress (URI)http://www.cnr.it(thewebaddressoftheportalofConsiglio Nazionale delle Ricerche, CNR) in a browser, the HTTP client of the browser dereferences that address by communicating with a serveratCNR,whichreturnsaHTMLpage,visualizedonitsturn 1http://www.w3.org/TR/webarch. JLIS.it.Vol.4,n.1(Gennaio/January2013). DOI:10.4403/jlis.it-5457 A.Gangemi,SemanticTechnologiesandLinkedData bythebrowser. Otherwebpagescanbepresentinthevisualized page, so creating a network of hypertextual links, which enables thebrowsingexperience. Thisisbasicallythewebofdocuments. Sometimesdereferencingisindirect,asinthecasewhenanaddress representsacalltoadatabase,e.g. whenlookingforone’staxdata intheAgenziadelleEntrate(theItalianfiscalauthority)website: thisisstillthewebofdocuments,butthedocumentsaregenerated outofaquerytoadatabase,whoseanswerisrenderedinHTML byusingXMLstylesheets. ThecaseofWeb2.0isamoresophisti- catedindirectdereferencing,whichalsoenablesdirectchangestoa databaseperformedbyusers: applicationssuchasvoiceprotocols, email,tagging,automaticloganalysis,opinionminingetc. converge inrich,customizableanddynamicHTMLpages,asinthecaseof Facebook. Two difficult problems: identity and semantic interoperability Web1.0and2.0havetwolimitations,whichactuallyexistininfor- mationsystemssincecenturiesago: identityandsemanticinterop- erability. The identity issue arises e.g. in the following example. AldoGangemihasdifferenthomepages(oneonhisinstitutesite, ISTC-CNR,onefromthewikiofhislab,STLab,oneontheseman- ticweb.org site etc. He is also registered on many other portals providingservicestothecitizen,tomembersofassociations,confer- encecommittees,commercialservicesetc. Moreover,hehasseveral accountsofsocialwebapplications(e.g. Facebook, Gmail, Flickr, iTunesetc.). Evenmore,AldoGangemiisadatumwithinpublicor personaldatabases,likeGoogleScholar,DBLPetc.;thatdatumhas identifiersthatareownedspecificallybythosedatabases,gathering JLIS.it. Vol.4,n.1(Gennaio/January2013).Art.#5457 p.254 JLIS.it.Vol.4,n.1(Gennaio/January2013) thenasortof”positional”identitywithinoneoftheirtables).Finally, AldoGangemiiscitedinotherwebpages: articles,bibliographic references,eventreports. Nowtheissueis: howcanweknowthat (thephysicalorsocialperson)AldoGangemiistheentitydenoted byhishomepages,registrations,accounts,databaseIDs,citations? Intuitively, the issue is not limited to persons, but it impacts on everythingthathasanidentity: places,organizations,products,ser- vices,events,laws,ideas,concepts,fictionalthingsetc.Thesemantic interoperabilityissue,besidespurelysystem-orientedproblems(e.g. differentcomputationalplatforms),arisesfromtheidentityissue: ifwecannotresolvetheidentityofsomethingacrossthedifferent sourcesandsystemsthatrefertoit,itgetsreallydifficulttoaggre- gate(i.e. assemble)andintegrate(i.e. appropriatelyconnect)the informationaboutit. Thisisquitelimitingwhenconsideringthat therelationsbetweensomethingandsomethingelsecanbesimilar withindifferentsystems: therelationbetweenAldoGangemiand theemailmessagesaddressedtohim,orbetweenhimandhisrecip- ients,aresimilarinanyemailingsystem,butthosesystemsassign different identities to the same persons, if any. In addition, each systemworksonaproprietaryinfrastructure: differentlanguages, formats, protocols etc. All this makes data integration between differentsystemspartialinthebestcases. Some traditional solutions Inthelastyears,asortofcartelhasemergedbetweencommercial serviceproviderssuchasFacebook,Googleetc.,inordertomake socialnetworkdatainteroperable: thishoweverconcernsonlydata exchangethatarecommerciallyinterestingforthosesystems,and thirdpartyapplicationsthatcountonthem. Databasemanagement systemsusecomplexprocedurestointegratetheirdatawhenitis JLIS.it. Vol.4,n.1(Gennaio/January2013).Art.#5457 p.255 A.Gangemi,SemanticTechnologiesandLinkedData required: schemaintegration,identityresolution,datawarehousing etc. Eachprocessistypicallymadeadhoconapairofdatabases. Partialsolutionsfordataintegrationalsocomefromdatamining ornaturallanguageprocessingtechniques. Forexample,thereare effectivestatisticalapproachesfornamedentityrecognitionandres- olution,aswellasfordiscoveringsimilarityandindirectrelationsin data. Documentannotationisanapproachthatcomesbackatleast to the beginning of 20th century: a document, or part of it (para- graphs,terms)areannotatedwithacategoryortagtakenfromsome knowledge organization system: thesauri, classification schemes, nomenclatures,controlledvocabularies,whichhavedevelopedin mostscientific,library,andcommercialdisciplines. Exemplarcases of similar large efforts include SnoMed, ICD, MeSH (medicine), Gettythesaurus(culturalheritage),Agrovoc(agriculture)etc. Re- cently,annotationproceduresareassistedeitherbycomputational support for manual annotation, or by automatic annotation algo- rithms(e.g. textclassification),withvariableprecision. The web of data In2006,TimBerners-Leeintroducedlinkeddata,asimpleandele- gantmethod2torealizesomepracticaldataidentityintegrationand interoperabilityontheWeb. Linkeddataareaimedatrealizinga webofdata(orEntities,orThings,dependingontheinteresttodata management,toentitylinking,ortosensorsandthingsinthephysi- calworld). Linkeddataisoneofthetechnologiesforthesemantic web(discussedinthenextsection),andconsistsoffourprinciples andmanygoodpractices. Theprinciplesinclude: 1. usewebaddresses(URI)asnamesforentities/things; 2http://www.w3.org/DesignIssues/LinkedData.html. JLIS.it. Vol.4,n.1(Gennaio/January2013).Art.#5457 p.256 JLIS.it.Vol.4,n.1(Gennaio/January2013) 2. useHTTPURIssothatpeoplecanlookupanddereference thosenames; 3. whensomeonelooksupaURI,provideusefulinformation, usingthestandards(RDF,RDFS,SPARQL,OWL,RIF); 4. include links to other URIs, in order to be able to discover morethingsanddata. Amonggoodpractices,it’susefultomentionthosethathavebest supportedtheLinkingOpenData(LOD)bootstrap,whosestateof playisvisualizedperiodicallyasacloud3: • useopenlicensestoobtainhighlyreusabledata; • usenon-proprietaryformats(e.g. CSVinsteadofExcel); • useW3Copenstandards(typicallyRDF,4SPARQL,5OWL6)to identifythings,sothatpeoplecanpointatyourstuff,newlinks canbecreated,betterqueriesandmoreextendedreasoning canbeperformed. ThesepracticesalsofitrecommendationsfromtheOpenDatamove- ment,andarecurrentlyadoptedinmanydifferentfields,including Public Administration data7 and are used in the integration and enrichmentofdata,forexamplefortheexpertfindingtask.8 TheLODCloudcontainslinkeddatafrommanydifferentdomains, inparticularbiomedicine,cultural,multimedia,bibliographic,geo- graphicetc. Anexampleofthepotentialoflinkeddataisshownin 3http://linkeddata.org. 4http://www.w3.org/RDF/. 5http://www.w3.org/TR/rdf-sparql-query/. 6http://www.w3.org/2004/OWL/. 7http://data.gov;http://data.gov.uk;http://dati.gov.it. 8http://data.cnr.it. JLIS.it. Vol.4,n.1(Gennaio/January2013).Art.#5457 p.257 A.Gangemi,SemanticTechnologiesandLinkedData figure1,agraphbuiltautomaticallybyanapplication(RelFinder9), whichincrementallyvisualizestherelationsbetweenanytwoenti- ties,providedthattheyhaveanidentityonthewebofdata. Inthe figure1,graphbuildingstartsfromtheentities: <http://dbpedia.org/wiki/Neo-positivism> <http://dbpedia.org/wiki/Francis_Bacon> Figure1:TheemergingrelationsbetweentwoentitiesacrosstheLinking OpenDatagraph. Semantic web standards W3Copenstandards,primarilyRDF,10SPARQL11andOWL,12en- ableelegantandhomogeneousrepresentationof,aswellasquerying 9http://www.visualdataweb.org/relfinder.php. 10http://www.w3.org/RDF/. 11http://www.w3.org/TR/rdf-sparql-query/. 12http://www.w3.org/2004/OWL/. JLIS.it. Vol.4,n.1(Gennaio/January2013).Art.#5457 p.258 JLIS.it.Vol.4,n.1(Gennaio/January2013) andreasoningon,thedatafrommosttraditionaldatastructuresand datamodels. RDFisbasedonarecursivedatastructure,calledtriple,madeofa Subject,aPredicate,andanObject,analogouslytothemostabstract grammatical structure of Western languages, SVO (Subject-Verb- Object). Listing1:SampleRDFtriples. <http://www.cnr.it/ontology/cnr/individuo/ unitaDiPersonaleInterno/MATRICOLA1582> <http://www.w3.org/2000/01/rdf-schema#label> ’’Aldo Gangemi’’ <http://www.cnr.it/ontology/cnr/individuo/ unitaDiPersonaleInterno/MATRICOLA1582> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cnr.it/ontology/cnr/personale.owl# UnitaDiPersonaleInterno> <http://www.cnr.it/ontology/cnr/individuo/ unitaDiPersonaleInterno/MATRICOLA1582> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:Semantic_Web> RDFtriplescanbequeriedviatheSPARQLlanguage. Listing2:Queryontriplesin1 SELECT ?l WHERE { ?x <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:Semantic_Web>. ?x <http://www.w3.org/2000/01/rdf-schema#label> ?l} JLIS.it. Vol.4,n.1(Gennaio/January2013).Art.#5457 p.259 A.Gangemi,SemanticTechnologiesandLinkedData Thequeryinlisting2ontheprecedingpagegetstheanswer: l ’’Aldo Gangemi’’ EachtriplecontainsSubjectsandObjectsthathaveatype,whichis onitsturnaClass,e.g. <http://www.cnr.it/ontology/cnr/personale.owl# UnitaDiPersonaleInterno> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> EachtriplecontainsaPredicate(orProperty),whichtogetherwith classesformsthevocabulary(alsocalledschemaorontology)used by a dataset. In cases where logical validation and reasoning is required,avocabularyisdefinedintheOWL(OntologyWebLan- guage) standard,13 a language that allows the use of automated reasonerstoderivelogicalinferencesoutofdatastructures. Forex- ample,anautomatedreasonerinferstheinversesofexistingtriples, thesymmetrictriples,thetriplesholdingtransitively(whenappro- priateruleshavebeendefinedforthevocabulary)etc. With the expressive power of OWL and SPARQL on the web of data,onecanmakecomplexquestionstoheterogeneousknowledge sources,e.g. intheRomalLawdomain,thefollowingnaturallan- guagequestionscanbeformalizedasqueries,buttermsneedtobe mappedtoappropriateentitytypesinRDFandOWL.Inthiscase, underlinedtermsaresupposedtobemappedasclasses,bold-faced termsasproperties,andtermsinitalicsasspecificentitiesorvalues: -which Roman Law sources contain maxims concerning stipulation, cite Ulpian,andincludecommentariespublishedinthelast10years? 13http://www.w3.org/2004/OWL/. JLIS.it. Vol.4,n.1(Gennaio/January2013).Art.#5457 p.260 JLIS.it.Vol.4,n.1(Gennaio/January2013) -whichcasesappearedinCommonLawsystemscontaininterpretationsrel- ativetocontractsanalogoustostipulatio? Inordertoimprovevocabularyqualityandinferencecapabili- ties,additionalaxiomsneedtobedefined(e.g. whattypeofentities canbeanalogoustowhat,whatcanbecitedinwhatetc.). Therefore, vocabularydesignrequiresacertainaccuracyandqualitycontrol, which can be obtained by means of approaches oriented to user requirements,andwiththereuseofstandardvocabulariesandon- tologydesignpatterns,14knowntodescribethedomainofinterest, and/orsolvingthemodelingproblemsemergingfromuserrequire- ments. Semantic applications Availability of large open data can provide a good motivation to develop next generation applications, which build on both exist- ingandnovelsolutions,focusedonthesemanticparadigm: using meaningofdataasawidespreadorganizationalschema. Thelifecycleofasemanticapplicationistypicallythefollowing: 1. reengineering existing data, by producing datasets in RDF triples(data)andOWL(vocabularies); 2. linkingbetweenentitiesinmultipledatasets,andproduction ofnewtriples; 3. extractionofnewentitiesandtriplesbymeansofdatamining andnaturallanguageprocessingtechniques,andproduction ofnewtriples; 14http://www.ontologydesignpatterns.org. JLIS.it. Vol.4,n.1(Gennaio/January2013).Art.#5457 p.261 A.Gangemi,SemanticTechnologiesandLinkedData 4. reasoning on the logical structures obtained from previous steps,andpossibleproductionofnewtriples(materialization); 5. publishingofdatasetsonappropriateplatforms,forSPARQL querying; 6. presentationofenricheddatatobeusedbywebusers: textual, graphic,richsnippets,explorativeetc. Thelifecyclereflectsamultipleinterpretationofthetermsemantics. Insteps2and3,wereferprimarilytothelinguisticsemanticsthatis implicitintheanalyzedtexts;relatedtechnologiesarethoseoftext anddataanalysis,andaimsatrecognizingentities,names,terms, relations,facts,topicsetc. Onlyoncewehaveextractedthem,we canproducenewformaltriples. Insteps1,4,5,werefertological (or formal) semantics of data and schemas; related technology is basicallywhatwehavementionedinprevioussectionsas”semantic web”(whichisamixofwebscienceandknowledgerepresentation). Instep6.,werefertothesemanticsofuserinteraction. Technologiesorientedtolinguisticsemanticsallowe.g. torecognize entitiesintexts,andtoresolvetheiridentitywithrespecttoknown datasets. Onceidentityhasbeenresolved, itispossibletoenrich the dataset with known relations between that entity and other entities. Forexample,giventhefollowingtextfromtheproceedings ofEuropeanUnionParliament: The sensitivities of Northern Ireland are too important for anyill-informedbandwagoningontheInternationalFundfor Ireland. Raytheon has been welcomed to Derry by no less thanNobelPeacePrizewinners,JohnHume–oneofourown colleagues,andDavidTrimble. Raytheonwillbefundedby theIndustrialDevelopmentBoardinNorthernIreland. Not oneeuronoroneIrishpoundfromtheInternationalFundfor IrelandisgoingtoRaytheon. JLIS.it. Vol.4,n.1(Gennaio/January2013).Art.#5457 p.262

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.