Table Of Content

On the Discovery of Semantically Meaningful SQL Constraints from Armstrong Samples: Foundations, Implementation, and Evaluation by Van Tran Bao Le Athesis submittedtotheVictoriaUniversityofWellington infulfilmentoftherequirementsforthedegreeof DoctorofPhilosophy inInformationManagement. VictoriaUniversityofWellington 2014 Abstract AdatabaseissaidtobeC-ArmstrongforafinitesetΣofdatadependencies inaclassC ifthedatabasesatisfiesalldatadependenciesinΣandviolates all data dependencies in C that are not implied by Σ. Therefore, Arm- strongdatabasesareconcise,user-friendlyrepresentationsofabstractdata dependenciesthatcanbeusedtojudge,justify,convey,andtesttheunder- standingofdatabasedesignchoices. Indeed,anArmstrongdatabasesatis- fiesexactlythosedatadependenciesthatareconsideredmeaningfulbythe currentdesignchoiceΣ. StructuralandcomputationalpropertiesofArm- strong databases have been deeply investigated in Codd’s Turing Award winning relational model of data. Armstrong databases have been incor- poratedinapproachestowardsrelationaldatabasedesign. Theyhavealso been found useful for the elicitation of requirements, the semantic sam- plingofexistingdatabases,andthespecificationofschemamappings. This research establishes a toolbox of Armstrong databases for SQL data. ThisischallengingasSQLdatacancontainnullmarkeroccurrences incolumnsdeclaredNULL,andmaycontainduplicaterows. Thus,theex- istingtheoryofArmstrongdatabasesonlyappliestoidealizedinstancesof SQL data, that is, instances without null marker occurrences and without duplicaterows. Forthethesis,twopopularinterpretationsofnullmarkers areconsidered: thenoinformationinterpretationusedinSQL,andtheexists but unknown interpretation by Codd. Furthermore, the study is limited to the popular class C of functional dependencies. However, the presence of duplicaterowsmeansthattheclassofuniquenessconstraintsisnolonger subsumed by the class of functional dependencies, in contrast to the relational model of data. As a first contribution a provably-correct algorithm is developed that computes Armstrong databases for an arbitrarily given finitesetofuniquenessconstraintsandfunctionaldependencies. Thiscon- tribution is based on axiomatic, algorithmic and logical characterizations of the associated implication problem that are also established in this thesis. While the problem to decide whether a given database is Armstrong for a given set of such constraints is precisely exponential, our algorithm computes an Armstrong database with a number of rows that is at most quadraticinthenumberofrowsofaminimum-sizedArmstrongdatabase. Asasecondcontributionthealgorithmsareimplementedintheformofa design tool. Users of the tool can therefore inspect Armstrong databases toanalyzetheircurrentdesignchoiceΣ. Intuitively,Armstrongdatabases areusefulfortheacquisitionofsemanticallymeaningfulconstraints,ifthe users can recognize the actual meaningfulness of constraints that they in- correctly perceived as meaningless before the inspection of an Armstrong database. As a final contribution, measures are introduced that formal- ize the term “useful” and it is shown by some detailed experiments that Armstrongtables,ascomputedbythetool,areindeeduseful. Insummary,thisresearchestablishesatoolboxofArmstrongdatabases thatcanbeappliedbydatabasedesignerstoconciselyvisualizeconstraints onSQLdata. Suchsupportcanleadtodatabasedesignsthatguaranteeef- ficientdatamanagementinpractice. Acknowledgments My first acknowledgement is to Sebastian Link. You not only guided me during the development of this thesis, but also willingly showed me how to sharpen and publish research results and deal with critical comments. Thank you, Sebastian, for your patient guidance, enthusiastic encouragement, and excellent support that made it possible for me to complete this work. Your positive outlook and confidence in my capabilities inspired me, making the tough PhD journey become enjoyable. It has been an honor to be your PhD student, and I could not have imagined having a bettersupervisorformyPhDstudy. Thanks also to Flavio Ferrarotti. Despite your busy schedule, you al- ways made yourself available to clarify my doubts. Your guidance and advice served me well, and I owe you my heartfelt appreciation. I also thank David Mason. You welcomed me warmly and made the last two yearsofmyPhDjourneysmooththroughyoursupervisionandtimelyas- sistance. MygratefulthanksarealsoextendedtoDanDorner,ValHooper, and Pak Yoong. Your open doors and ready support were much appreci- ated. Thanks to Usha Varatharaju, Jean Grant, and Tiso Ross. You helped menavigatebureaucraticpathways. MythanksalsogotoDionPeszynski. Withoutyou,thewebapplicationdevelopedforthisworkcouldnothave gonelive. ThankstothemanyotherhelpfulandgenerouspeopleatVictoriaUni- versity of Wellington and particularly the academic, administrative, and technical staff of the School of Information Management: Jocelyn Crane- iii iv field, Chern Li Liew, Gillian Oliver, Lalita Rajasingham, Mary Tate, Janet Toland, Brenda Chawner, Alastair Smith, Christine King, and Cristina Se- bold. It has been a great privilege to spend four years in the School of Information Management (SIM) at Victoria University of Wellington, and itsmemberswillalwaysremaindeartome. I would especially like to acknowledge Palve Mogin and Hui Ma in the School of Engineering and Computer Science at Victoria University of Wellington,andNguyenThiThanhTamintheComputerScienceDepart- ment at Lotus University of Vietnam. You were there to kindly assist me duringthedatagatheringprocess. Thankstoallthecomputersciencestu- dents for participating in my experiments. Without them, I simply could nothaveconductedempiricalstudiesandcompletedthisthesis. Finally,Iwishtothankmyparents,LeDuyQuangandTranThiPhuong Mai, who have never lost their belief in me and helped me go the dis- tance. Thankstomyhusband,NguyenTrongDuc,foryourunconditional support by being willing to share the ups and downs of the PhD process despite the massive challenges you had to face when making a big move to New Zealand to support me. Thanks also to my great friends in New Zealand and in Vietnam for your support and encouragement, for step- ping in when things got tough, and making me take out time for a bit of funeverynowandthen. Contents 1 Introduction 1 1.1 Challengeswithconstraints . . . . . . . . . . . . . . . . . . . 1 1.2 Stateofdatabasepractice . . . . . . . . . . . . . . . . . . . . . 2 1.3 Stateofdatabasetheory . . . . . . . . . . . . . . . . . . . . . 3 1.4 Stateofdisparity . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Objectiveandapproachofthethesis . . . . . . . . . . . . . . 4 1.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.7 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 LiteratureReview 9 2.1 DataDependenciesoverRelations . . . . . . . . . . . . . . . 9 2.1.1 TheRelationalDataModel . . . . . . . . . . . . . . . 10 2.1.2 DataDependencies . . . . . . . . . . . . . . . . . . . . 11 2.1.3 TheClassofKeys . . . . . . . . . . . . . . . . . . . . . 14 2.1.4 TheClassofFunctionalDependencies . . . . . . . . . 15 2.2 ArmstrongRelations . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 GeneralDefinitionofanArmstrongRelation . . . . . 23 2.2.2 ArmstrongRelationsforFunctionalDependencies . . 24 2.2.3 EvidenceFortheUsefulnessofArmstrongRelations 31 2.2.4 Constraint Acquisition by Sample Data and Natural LanguageProcessing . . . . . . . . . . . . . . . . . . . 32 2.3 PartialInformationinDatabases . . . . . . . . . . . . . . . . 33 2.3.1 NoInformation . . . . . . . . . . . . . . . . . . . . . . 34 v vi CONTENTS 2.3.2 ValueUnknownatPresent . . . . . . . . . . . . . . . 35 2.4 DataDependenciesoverPartialRelations . . . . . . . . . . . 36 2.4.1 FunctionalDependenciesoverSQLRelations . . . . . 36 2.4.2 FDsandNOTNULLconstraintsoverSQLRelations 39 2.4.3 FDsoverCoddRelations . . . . . . . . . . . . . . . . 42 2.4.4 UniquenessConstraintsandKeys . . . . . . . . . . . 44 2.5 SQLArmstrongTables . . . . . . . . . . . . . . . . . . . . . . 49 2.5.1 NFDsDoNotEnjoySQLArmstrongTables . . . . . . 50 2.5.2 StructuralPropertiesofSQLArmstrongTables . . . . 50 2.5.3 ComputationalProperties . . . . . . . . . . . . . . . . 54 2.5.4 FurtherRemarks . . . . . . . . . . . . . . . . . . . . . 59 2.6 SummaryandResearchGap . . . . . . . . . . . . . . . . . . . 60 2.6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.6.2 Researchgap . . . . . . . . . . . . . . . . . . . . . . . 61 2.6.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 62 3 Foundations-PartI 65 3.1 PreliminaryDefinitions . . . . . . . . . . . . . . . . . . . . . . 66 3.2 AxiomaticCharacterization . . . . . . . . . . . . . . . . . . . 71 3.3 AlgorithmicCharacterization . . . . . . . . . . . . . . . . . . 78 3.4 LogicalCharacterization . . . . . . . . . . . . . . . . . . . . . 83 3.4.1 S-3Logics . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4.2 EquivalenceofImplicationProblems . . . . . . . . . 85 3.5 FurtherComments . . . . . . . . . . . . . . . . . . . . . . . . 90 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4 Foundations-PartII 93 4.1 TwoMotivatingExamples . . . . . . . . . . . . . . . . . . . . 94 4.2 StructuralProperties . . . . . . . . . . . . . . . . . . . . . . . 95 4.3 ComputationalProperties . . . . . . . . . . . . . . . . . . . . 102 4.3.1 ComputationofMaximalSets . . . . . . . . . . . . . . 102 4.3.2 ComputationofDuplicateSets . . . . . . . . . . . . . 106 CONTENTS vii 4.3.3 ComputationofArmstrongTables . . . . . . . . . . . 109 4.4 ComplexityConsiderations . . . . . . . . . . . . . . . . . . . 114 4.4.1 Worst-caseComplexitytoFindArmstrongTables . . 114 4.4.2 Minimum-sizedArmstrongTables . . . . . . . . . . . 116 4.4.3 TheSizeofRepresentations . . . . . . . . . . . . . . . 117 4.5 FurtherRemarks . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5 Implementation: SQL-Sampler 123 5.1 IntroductiontoSQL-Sampler . . . . . . . . . . . . . . . . . . 124 5.2 SystemRequirements . . . . . . . . . . . . . . . . . . . . . . . 127 5.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.3.1 DEFINECONTEXTmodule . . . . . . . . . . . . . . . 127 5.3.2 INPUTDATAmodule . . . . . . . . . . . . . . . . . . 130 5.3.3 COMPUTEmodule . . . . . . . . . . . . . . . . . . . . 130 5.3.4 OUTPUTDATAmodule . . . . . . . . . . . . . . . . . 133 5.3.5 ARMSTRONGTABLE database . . . . . . . . . . . . . . 133 5.4 ImplementationDetails . . . . . . . . . . . . . . . . . . . . . . 136 5.4.1 Contextvariable . . . . . . . . . . . . . . . . . . . . . 137 5.4.2 DATAINPUTfunctions . . . . . . . . . . . . . . . . . 137 5.4.3 COMPUTATIONfunctions . . . . . . . . . . . . . . . 138 5.4.4 DATAOUTPUTfunctions . . . . . . . . . . . . . . . . 143 5.5 UseCaseExample . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.5.1 TheUseCase . . . . . . . . . . . . . . . . . . . . . . . 145 5.5.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.5.3 InputData . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.5.4 ComputingArmstrongTable . . . . . . . . . . . . . . 146 5.5.5 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6 Evaluation: ExperimentDesign 151 6.1 Theoverallprocessoftheexperiment . . . . . . . . . . . . . 152 viii CONTENTS 6.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.3 Applicationdomainandthetargetset . . . . . . . . . . . . . 155 6.4 Qualitymeasures . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.4.1 Preliminarynotation . . . . . . . . . . . . . . . . . . . 156 6.4.2 Soundness . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.4.3 Completeness . . . . . . . . . . . . . . . . . . . . . . . 159 6.4.4 Proximity . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.4.5 Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.4.6 Actualandpossiblegain . . . . . . . . . . . . . . . . . 161 6.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.5.1 Datagathering . . . . . . . . . . . . . . . . . . . . . . 162 6.5.2 Dataanalysis . . . . . . . . . . . . . . . . . . . . . . . 163 6.6 Furtherconsiderationsandfutureresearch . . . . . . . . . . 170 7 Evaluation: DataAnalysis 173 7.1 Quantitativedataanalysis . . . . . . . . . . . . . . . . . . . . 173 7.1.1 Quantitativeanalysisforsoundness . . . . . . . . . . 174 7.1.2 Quantitativeanalysisforcompleteness . . . . . . . . 180 7.1.3 Quantitativeanalysisforproximity . . . . . . . . . . 186 7.1.4 Quantitativeanalysisforgroups . . . . . . . . . . . . 191 7.2 Qualitativedataanalysis . . . . . . . . . . . . . . . . . . . . . 195 7.2.1 Qualitativeanalysisforsoundness . . . . . . . . . . . 196 7.2.2 Qualitativeanalysisforcompleteness . . . . . . . . . 202 7.2.3 Qualitativeanalysisforproximity . . . . . . . . . . . 208 7.2.4 Qualitativeanalysisbyundiscoveredconstraints . . . 211 7.2.5 Qualitativeanalysispergroups . . . . . . . . . . . . . 213 7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 8 ConclusionandFutureWork 217 8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 8.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

Description:

to the relational model of data. As a first contribution a provably-correct algorithm My thanks also go to Dion Peszynski. Without you, the web

On the Discovery of Semantically Meaningful SQL Constraints from Armstrong Samples PDF

266 Pages·2014·2.98 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview On the Discovery of Semantically Meaningful SQL Constraints from Armstrong Samples

Description:

to the relational model of data. As a first contribution a provably-correct algorithm My thanks also go to Dion Peszynski. Without you, the web

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.