ebook img

Big Data Integration Theory: Theory and Methods of Database Mappings, Programming Languages, and Semantics PDF

528 Pages·2014·6.237 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Big Data Integration Theory: Theory and Methods of Database Mappings, Programming Languages, and Semantics

Texts in Computer Science Zoran Majkić Big Data Integration Theory Theory and Methods of Database Mappings, Programming Languages, and Semantics Texts in Computer Science Editors DavidGries FredB.Schneider Forfurthervolumes: www.springer.com/series/3191 Zoran Majkic´ Big Data Integration Theory Theory and Methods of Database Mappings, Programming Languages, and Semantics ZoranMajkic´ ISRST Tallahassee,FL,USA SeriesEditors DavidGries FredB.Schneider DepartmentofComputerScience DepartmentofComputerScience CornellUniversity CornellUniversity Ithaca,NY,USA Ithaca,NY,USA ISSN1868-0941 ISSN1868-095X(electronic) TextsinComputerScience ISBN978-3-319-04155-1 ISBN978-3-319-04156-8(eBook) DOI10.1007/978-3-319-04156-8 SpringerChamHeidelbergNewYorkDordrechtLondon LibraryofCongressControlNumber:2014931373 ©SpringerInternationalPublishingSwitzerland2014 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’slocation,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer. PermissionsforusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.Violations areliabletoprosecutionundertherespectiveCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Whiletheadviceandinformationinthisbookarebelievedtobetrueandaccurateatthedateofpub- lication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityforany errorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,withrespect tothematerialcontainedherein. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Preface Bigdataisapopulartermusedtodescribetheexponentialgrowth,availabilityand useofinformation,bothstructuredandunstructured.Muchhasbeenwrittenonthe big data trend and how it can serve as the basis for innovation,differentiation and growth. According to International Data Corporation (IDC) (one of the premier global providers of market intelligence,advisory services, and eventsfor the information technology,telecommunicationsandconsumertechnologymarkets),itisimperative thatorganizationsandITleadersfocusontheever-increasingvolume,varietyand velocityof informationthat forms big data.From Internet sources,availableto all riders,hereIbrieflycitemostofthem: • Volume. Many factors contribute to the increase in data volume—transaction- baseddatastoredthroughtheyears,textdataconstantlystreaminginfromsocial media, increasing amounts of sensor data being collected, etc. In the past, ex- cessivedatavolumecreatedastorageissue.Butwithtoday’sdecreasingstorage costs,otherissuesemerge,includinghowtodeterminerelevanceamidstthelarge volumesofdataandhowtocreatevaluefromdatathatisrelevant. • Variety.Datatodaycomesinalltypesofformats—fromtraditionaldatabasesto hierarchical data stores created by end users and OLAP systems, to text docu- ments,email,meter-collecteddata,video,audio,stocktickerdataandfinancial transactions. • Velocity.AccordingtoGartner,velocitymeansbothhowfastdataisbeingpro- ducedandhowfastthedatamustbeprocessedtomeetdemand.Reactingquickly enoughtodealwithvelocityisachallengetomostorganizations. • Variability. In addition to the increasing velocities and varieties of data, data flowscanbehighlyinconsistentwithperiodicpeaks.Daily,seasonalandevent- triggeredpeakdataloadscanbechallengingtomanage—especiallywithsocial mediainvolved. • Complexity. When you deal with huge volumes of data, it comes from mul- tiple sources. It is quite an undertaking to link, match, cleanse and transform data across systems. However, it is necessary to connect and correlate relation- ships,hierarchiesandmultipledatalinkagesoryourdatacanquicklyspiralout v vi Preface ofcontrol.Datagovernancecanhelpyoudeterminehowdisparatedatarelatesto commondefinitionsandhowtosystematicallyintegratestructuredandunstruc- tureddataassets toproducehigh-qualityinformationthatis useful,appropriate andup-to-date. Technologies today not only support the collection and storage of large amounts ofdata,theyprovidetheabilitytounderstandandtakeadvantageofitsfullvalue, whichhelpsorganizationsrunmoreefficientlyandprofitably. WecanconsideraRelationalDatabase(RDB)asanunifyingframeworkinwhich wecanintegrateallcommercialdatabasesanddatabasestructuresoralsounstruc- tureddatawrappedfromdifferentsourcesandusedasrelationaltables.Thus,from thetheoreticalpointofview,wecanchoseRDBasageneralframeworkfordatain- tegrationandresolvesomeoftheissuesabove,namelyvolume,variety,variability and velocity, by using the existing Database Management System (DBMS) tech- nologies. Moreover,simplerformsofintegrationbetweendifferentdatabasescanbeeffi- cientlyresolvedbyDataFederationtechnologiesusedforDBMStoday. Moreoften,emergentproblemsrelatedtothecomplexity(thenecessitytocon- nectandcorrelaterelationships)inthesystematicintegrationofdataoverhundreds andhundredsofdatabasesneednotonlytoconsidermorecomplexschemadatabase mappings,butalsoanevolutionarygraphicalinterfaceforauserinordertofacilitate themanagementofsuchhugeandcomplexsystems. Suchresultsarepossibleonlyunderacleartheoreticalandalgebraicframework (similartothealgebraicframeworkforRDB)whichextendsthestandardRDBwith more powerful features in order to manage the complex schema mappings (with, for example,mergingand matchingof databases,etc.).More work aboutDataIn- tegration is given in pure logical framework (as in RDB where we use a subset of the First Order Logic (FOL)). However, unlike with the pure RDB logic, here we have to deal with a kind of Second Order Logic based on the tuple-generating dependencies(tgds).Consequently,weneedtoconsideran‘algebraization’ofthis subclassoftheSecondOrderLogicandtotranslatethedeclarativespecificationsof logic-based mapping between schemas into the algebraic graph-based framework (sketches)and,ultimately,toprovidedenotationalandoperationalsemanticsofdata integrationinsideauniversalalgebraicframework:thecategorytheory. The kind of algebraization used here is different from the Lindenbaum method (used, for example, to define Heyting algebras for the propositional intuitionistic logic (in Sect. 1.2), or used to obtain cylindric algebras for the FOL), in order to supportthecompositionalpropertiesoftheinter-schemamapping. In this framework, especially because of Big Data, we need to theoretically consider both the inductive and coinductive principles for databases and infinite databasesas well.In thissemanticframeworkof Big-Dataintegration,wehaveto investigatethepropertiesofthebasicDBcategorybothwithitstopologicalproper- ties. Integrationacrossheterogeneousdataresources—somethatmightbeconsidered “big data” and others not—presents formidable logistic as well as analytic chal- lenges,butmanyresearchersarguethatsuchintegrationsarelikelytorepresentthe Preface vii mostpromisingnewfrontiersinscience[2,5,6,10,11].Thismonographisasyn- thesisofmypersonalresearchinthisfieldthatIdevelopedfrom2002to2013:this workpresentsacompleteformalframeworkforthesenewfrontiersinscience. Sincethelate1960s,therehasbeenconsiderableprogressinunderstandingthe algebraic semantics of logic and type theory, particularly because of the develop- ment of categorical analysis of most of the structures of interest to logicians. Al- though there have been other algebraic approaches to logic, none has been as far reachinginitsaimsandinitsresultsasthecategoricalapproach.Fromafairlymod- estbeginning,categoricallogichasmaturedverynicelyinthepastfourdecades. Categoricallogicisabranchofcategorytheorywithinmathematics,adjacentto mathematicallogicbutmorenotableforitsconnectionstotheoreticalcomputersci- ence[4].Inbroadterms,categoricallogicrepresentsbothsyntaxandsemanticsby acategory,andaninterpretationbyafunctor.Thecategoricalframeworkprovidesa richconceptualbackgroundforlogicalandtype-theoreticconstructions.Thesubject hasbeenrecognizableinthesetermssincearound1970. This monograph presents a categorical logic (denotational semantics) for databaseschemamappingbasedonviewsinaverygeneralframeworkfordatabase- integration/exchange and peer-to-peer. The base database category DB (instead of traditional Set category), with objects instance-databases and with morphisms (mappings which are not simple functions) between them, is used at an instance levelasapropersemanticdomainforadatabasemappingsbasedonasetofcom- plexquerycomputations. Thehigherlogicalschemalevelofmappingsbetweendatabases,usuallywritten insomehighexpressivelogicallanguage(ex.[3,7],GLAV(LAVandGAV),tuple generatingdependency)canthenbetranslatedfunctoriallyintothisbase“computa- tion”category. DifferentfromtheData-exchangesettings,wearenotinterestedinthe‘minimal’ instance B of the target schema B, of a schema mapping M :A→B. In our AB more general framework, we do not intend to determine ‘exactly’, ‘canonically’ or ‘completely’ the instance of the target schema B (for a fixed instance A of the sourceschema A)justbecauseoursettingismoregeneralandthetargetdatabase is only partially determined by the source database A. Another part of this target databasecanbe,forexample,determinedbyanotherdatabaseC (thatisnotanyof the‘intermediate’databasesbetweenAandB),orbythesoftwareprogramswhich updatetheinformationcontainedinthistargetdatabaseB.Inotherwords,theData- exchange(andDataintegration)settingsareonlyspecialparticularsimplercasesof thisgeneralframeworkofdatabasemappingswhereeachdatabasecanbemapped from other sources, or maps its own information into other targets, and is to be locallyupdatedaswell. The new approach based on the behavioral point of view for databases is as- sumed, and behavioral equivalences for databases and their mappings are estab- lished. The introduction of observations, which are computations without side- effects, defines the fundamental (from Universal algebra) monad endofunctor T, which is also the closure operator for objects and for morphisms such that the database lattice (cid:3)Ob ,(cid:4)(cid:5) is an algebraic (complete and compact) lattice, where DB viii Preface Ob is a set of all objects (instance-database) of DB category and “(cid:4)” is a pre- DB orderrelationbetweenthem.Thejoinandmeetoperatorsofthisdatabaselatticeare MergingandMatchingdatabaseoperators,respectively. Theresulting2-category DB issymmetric(alsoamappingisrepresentedasan object (i.e., instance-database)) and hence the mappings between mappings are a 1-cellmorphismsforallhighermeta-theories.Moreover,eachmappingisahomo- morphismfromaKleislimonadicT-coalgebraintothecofreemonadicT-coalgebra. ThedatabasecategoryDBhasniceproperties:itisequaltoitsdual,completeand cocomplete, locally small and locally finitely presentable, and monoidal biclosed V-category enriched over itself. The monad derived from the endofunctor T is an enrichedmonad. Generally, database mappings are not simply programs from values (i.e., rela- tions) into computations (i.e., views) but an equivalence of computations because each mapping between any two databases A and B is symmetric and provides a dualityproperty to the DB category.The denotationalsemantics of databasemap- pingsisgivenbymorphismsoftheKleislicategoryDB whichmaybe“internal- T ized” in DB category as “computations”. Special attention is devoted to a number ofpracticalexamples:querydefinition,queryrewritingintheDatabase-integration environment,P2PmappingsandtheirequivalentGAVtranslation. Thebookisintendedtobeaccessibletoreaderswhohavespecialistknowledge of theoretical computer science or advanced mathematics (category theory), so it attempts to treat the important database mapping issues accurately and in depth. The book exposes the author’s original work on database mappings, its program- minglanguage(algebras)anddenotationalandoperationalsemantics.Theanalysis method is constructed as a combination of technics from a kind of Second Order Logic,datamodeling,(co)algebrasandfunctorialcategorialsemantics. Two primary audiences exist in the academic domain. First, the book can be usedasatextforagraduatecourseintheBigDataIntegrationtheoryandmethods withinadatabaseengineeringmethodscurriculum,perhapscomplementinganother course on (co)algebras and category theory. This would be of interest to teachers ofcomputerscience,databaseprogramminglanguagesandapplicationsincategory theory.Second,researchesmaybeinterestedinmethodsofcomputerscienceusedin databasesandlogics,andtheoriginalcontributions:acategorytheoryappliedtothe databases.ThesecondaryaudienceIhaveinmindistheITsoftwareengineersand, generally, people who work in the development of the database tools: the graph- based categorial formal framework for Big Data Integration is helpful in order to developnewgraphictoolsinBigDataIntegration.Inthisbook,anewapproachto thedatabaseconceptsdevelopedfromanobservationalequivalencebasedonviews ispresented.ThemainintuitiveresultoftheobtainedbasicdatabasecategoryDB, moreappropriatethanthecategorySetusedforcategorialLawvere’stheories,isto havethepossibilityofmakingsyntheticrepresentationsofdatabasemappingsand queriesoverdatabasesinagraphicalform,suchthatallmapping(andquery)arrows can be composed in order to obtain the complex database mapping diagrams. For example,fortheP2Psystemsorthemappingsbetweendatabasesincomplexdata warehouses. Formally, it is possible to develop a graphic (sketch-based) tool for a Preface ix meta-mapping description of complex (and partial) mappings in various contexts withaformalmathematicalbackground.Apartofthisbookhasbeenpresentedto severalaudiencesatvariousconferencesandseminars. DependenciesBetweentheChapters Aftertheintroduction,thebookisdividedintothreeparts.Thefirstpartiscomposed of Chaps. 2, 3 and 4, which is a nucleus of this theory with a number of practical examples. The second part, composed of Chaps. 5, 6 and 7, is dedicated to com- putationalpropertiesoftheDBcategory,comparedtotheextensionsoftheCodd’s SPRJUrelationalalgebraΣ andStructuredQueryLanguage(SQL).Itisdemon- R stratedthattheDBcategory,asadenotationalsemanticsmodelfortheschemamap- pings,iscomputationallyequivalenttotheΣ relationalalgebrawhichisacom- RE pleteextensionofΣ withallupdateoperationsfortherelations.Chapter6isthen R dedicated to define the abstract computational machine, the categorial RDB ma- chine,abletosupportallDBcomputationsbySQLembedding.Thefinalsections are dedicated to categorial semantics for the database transactions in time-sharing DBMS.BasedontheresultsinChaps.5and6,thefinalchapterofthesecondpart, Chap.7,thenpresentsfulloperationalsemanticsfordatabasemappings(programs). Thethirdpart,composedofChaps.8and9,isdedicatedtomoreadvancedthe- oretical issues about DB category: matching and merging operators (tensors) for databases,universalalgebraconsiderationsandalgebraiclatticeofthedatabases.It isdemonstratedthattheDBcategoryisnotaCartesianClosedCategory(CCC)and henceitisnotanelementarytopos.ItisdemonstratedthatDBismonoidalbiclosed, finitelycompleteandcocomplete,locallysmallandlocallyfinitelypresentablecat- egorywithhom-objects(“exponentiations”)andasubobjectclassifier. Thus,DBisaweakmonoidaltoposandhenceitdoesnotcorrespondtoproposi- tionalintuitionisticlogic(asanelementary,or“standard”topos)buttooneinterme- diate superintuitionistic logic with strictly more theorems than intuitionistic logic but less then the propositional logic. In fact, as in intuitionistic logic, it does not holdtheexcludedmiddle φ∨¬φ,rathertheweak excludedmiddle ¬φ∨¬¬φ is valid. DetailedPlan 1. Chapter 1 is a formal and short introduction to different topics and concepts: logics, (co)algebras, databases, schema mappings and category theory, in order torenderthismonographmoreself-contained;thismaterialwillbewidelyused intherestofthisbook.Itisimportantalsoduetothefactthatusuallydatabase expertsdoalotwithlogicsandrelationalalgebras,butmuchlesswithprogram- minglanguages(theirdenotationalandoperationalsemantics)andstillmuchless with categorial semantics. For the experts in programming languages and cate- gory theory, having more information on FOL and its extensions used for the databasetheorywillbeuseful. x Preface 2. In Chap. 2, the formal logical framework for the schema mappings is defined, based on the second-order tuple generating dependencies (SOtgds), with exis- tentiallyquantifiedfunctionalsymbols.Eachtgdis amaterialimplicationfrom the conjunctive formula (with relational symbols of a source schema, preceded withnegationaswell)intoaparticularrelationalsymbolofthetargetschema.It providesanumberofalgorithmswhichtransformtheselogicalformulaeintothe algebraic structure based on the theory of R-operads. The schema database in- tegrityconstraintsaretransformedinasimilarwaysothatboththeschemamap- pings and schema integrity-constraints are formally represented by R-operads. Thenthecompositionalpropertiesareexplored,inordertorepresentadatabase mapping system as a graph where the nodes are the database schemas and the arrows are the schema mappings or the integrity-constraints for schemas. This representationisusedtodefinethedatabasemappingsketches(smallcategories), basedonthefactthateachschemahasanidentityarrow(mapping)andthatthe mapping-arrowssatisfytheassociativelowforthecompositionofthem. ThealgebraictheoryofR-operads,presentedinSect.2.4,representstheseal- gebras in a non-standard way (with carrier set and the signature) because it is orientedtoexpressthecompositionalproperties,usefulwhenformalizingtheal- gebraicpropertiesforacompositionofdatabasemappingsanddefiningacatego- rialsemanticsforthem.ThestandardalgebraiccharacterizationofR-operads,as akindofrelationalalgebras,willbepresentedinChap.4inordertounderstand the relationship with the extensions of the Select–Project–Rename–Join–Union (SPRJU) Codd’s relational algebras. Each Tarski’s interpretation of logical for- mulae(SOtgds),usedtospecifythedatabasemappings,resultsintheinstance- databasemappingscomposedofasetofparticularfunctionsbetweenthesource instance-database and the target instance-database. Thus, an interpretation of a database-mapping system may be formally represented as a functor from the sketch category (schema database graph) into a category where an object is an instance-database(i.e.,asetofrelationaltables)andanarrowisasetofmapping functions.Section2.5isdedicatedtotheparticularpropertyforsuchacategory, namelythedualityproperty(basedoncategorysymmetry). Thus, at the end of this chapter, we obtain a complete algebraization of the SecondOrderLogicbasedonSOtgdsusedforthelogical(declarative)specifica- tionofschemamappings.Thesketchcategory(agraphofschemamappings)rep- resentsthesyntaxofthedatabase-programminglanguage.Afunctorα,derived from a specific Tarski’s interpretation of the logical schema database mapping system, represents the semantics of this programming language whose objects are instance-databases and a mapping is a set of functions between them. The formal denotational semantics of this programming language will be provided by a database category DB in Chap. 3, while the operational semantics of this programminglanguagewillbepresentedinChap.7. 3. Chapter 3 provides the basic results of this theory, including the definition of theDBcategoryasadenotationalsemanticsfortheschemadatabasemappings. The objects of this category are the instance-databases (composed of the rela- tionaltablesandanemptyrelation⊥)andeveryarrowisjustasetoffunctions

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.