ConceptualModelingofDatawithProvenance by DavidWilliamArcher Adissertationsubmittedinpartialfulfillmentofthe requirementsforthedegreeof DoctorofPhilosophy in ComputerScience DissertationCommittee: LoisM.L.Delcambre,Chair DavidMaier LeonardShapiro MarkJones CharlesWeber PortlandStateUniversity (cid:13)c 2011 i ABSTRACT Traditional database systems manage data, but often do not address its prove- nance. In the past, users were often implicitly familiar with data they used, how it wascreated(andhencehowitmightbeappropriatelyused),andfromwhichsources it came. Today, users may be physically and organizationally remote from the data they use, so this information may not be easily accessible to them. In recent years, several models have been proposed for recording provenance of data. Our work is motivatedbyopportunitiestomakeprovenanceeasytomanageandquery. Forexam- ple, current approaches model provenance as expressions that may be easily stored alongside data, but are difficult to parse and reconstruct for querying, and are diffi- cult to query with available languages. We contribute a conceptual model for data andprovenance,andevaluatehowwellitaddressestheseopportunities. Wecompare theexpressivepowerofourmodel’slanguagetothatofothermodels. Wealsodefine a benchmark suite with which to study performance of our model, and use this suite to study key model aspects implemented on existing software platforms. We dis- cover some salient performance bottlenecks in these implementations, and suggest future work to explore improvements. Finally, we show that our implementations cancomprisealogicalmodelthatfaithfullysupportsourconceptualmodel. ii DEDICATION To Cynthia iii Acknowledgements This research was a team effort, and was successful because of the contributions of everyoneinvolved. Firstandforemostamongcontributorswasmyadvisor,Professor LoisM.L.Delcambre,whoconsistentlypushedmetothinkaboutthebigpicture,al- waysbroughttothetablenewideasforustoconsider,andwasapatientandthorough reviewer of this work. The members of my thesis committee also contributed new perspectives, ideas that I had missed, and excellent constructive critique. For these I thank Professor David Maier, Professor Leonard Shapiro, Professor Mark Jones, and Professor Charles Weber. I also appreciate the ideas, critiques, and inputs of other faculty and students in the PSU DataLab research group: Kristin Tufte, Rafael J.Fernandez-Moctezuma,JeremySteinhauer,NickRayner,ScottBrittell,andJames Terwilliger. This work was supported in part by the National Science Foundation, grant IIS- 0534762, and by DARPA. Support for this work came from a number of others, including the PSU Computer Science Department staff: Beth Holmes, Kathi Lee, andReneRemillard. Irecognizeanddeeplyappreciatethecontributions,patience,andencouragement ofmywife,CynthiaL.Archer,PhD.Withouther,thisachievementwouldhavebeen impossible. iv Contents Abstract i Dedication ii Acknowledgements iii ListofTables x ListofFigures xi 1 Introduction 1 1.1 ExampleSettingsforProvenance . . . . . . . . . . . . . . . . . . . 3 1.1.1 DevelopmentofTargetedCancerTherapies . . . . . . . . . 3 1.1.2 CorporateBudgetPlanning . . . . . . . . . . . . . . . . . . 5 1.1.3 BattlefieldInformationManagement . . . . . . . . . . . . . 7 1.1.4 OpportunitiestoEnhanceProvenanceModels . . . . . . . . 8 1.2 WhereCurrentProvenanceModelsFallShortforOurSettings . . . 11 1.3 ResearchGoals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 ConceptualModelOverview 18 2.1 ModelFundamentals . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 StructureofaRelationalMMPDataFace . . . . . . . . . . . . . . 25 2.2.1 ExternalSourcesofData . . . . . . . . . . . . . . . . . . . 25 v 2.3 StructureofanExampleMMPProvenanceModel . . . . . . . . . . 26 2.3.1 ContinuityofExistingData . . . . . . . . . . . . . . . . . . 30 2.3.2 GranularityandInheritanceofProvenance . . . . . . . . . . 31 2.4 InteractingwithandVisualizingMMP . . . . . . . . . . . . . . . . 33 2.4.1 TheMMPLanguage . . . . . . . . . . . . . . . . . . . . . 33 2.4.2 DataSemanticsoftheMMPLanguage . . . . . . . . . . . . 35 2.4.2.1 Data Definition, Manipulation, and Confidence Operations . . . . . . . . . . . . . . . . . . . . . 35 2.4.2.2 QueryOperations . . . . . . . . . . . . . . . . . 40 2.4.3 ConfidenceLanguage . . . . . . . . . . . . . . . . . . . . . 40 2.4.4 PredicateLanguageforSelectionandProjectionOperators . 41 2.5 ProvenanceCreationSemanticsoftheMMPLanguage . . . . . . . 45 2.6 ProvenanceGraphsasVisualizationTools . . . . . . . . . . . . . . 46 2.7 ChapterSummary . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3 FormalizingtheConceptualModel 50 3.1 ModelingEvolvingData: Faces . . . . . . . . . . . . . . . . . . . . 51 3.2 ModelingTheOutsideWorld: ExternalSourceReferents . . . . . . 52 3.3 ModelingDataDerivation: ProvenanceLinks . . . . . . . . . . . . 52 3.3.1 Operation-inducedProvenanceLinks . . . . . . . . . . . . 53 3.3.2 ContinuityProvenanceLinks . . . . . . . . . . . . . . . . . 55 3.4 ModelingOperationsAppliedtoData: Revisions . . . . . . . . . . 56 3.5 ModelingCreationofExternalSourceReferents . . . . . . . . . . . 58 3.6 Single-revisionandSource-CreationImpactonDataandProvenance 59 3.6.1 DDLRevisionsandSourceCreations . . . . . . . . . . . . 60 3.6.1.1 CreateRelation . . . . . . . . . . . . . . . . . . . 60 3.6.1.2 CreateSource . . . . . . . . . . . . . . . . . . . 61 3.6.1.3 CreateAttribute . . . . . . . . . . . . . . . . . . 61 vi 3.6.1.4 DropRelation . . . . . . . . . . . . . . . . . . . 61 3.6.1.5 DropAttribute . . . . . . . . . . . . . . . . . . . 62 3.6.2 DMLandDCLRevisions . . . . . . . . . . . . . . . . . . . 62 3.6.2.1 InsertValue . . . . . . . . . . . . . . . . . . . . . 63 3.6.2.2 DropValue . . . . . . . . . . . . . . . . . . . . . 63 3.6.2.3 InsertTuple . . . . . . . . . . . . . . . . . . . . . 64 3.6.2.4 DropTuple . . . . . . . . . . . . . . . . . . . . . 64 3.6.2.5 PasteValue . . . . . . . . . . . . . . . . . . . . . 65 3.6.2.6 PasteTuple . . . . . . . . . . . . . . . . . . . . . 66 3.6.2.7 PasteRelation . . . . . . . . . . . . . . . . . . . 66 3.6.2.8 ConfirmValueandDoubtValue . . . . . . . . . . 67 3.6.3 QueryRevisions . . . . . . . . . . . . . . . . . . . . . . . 68 3.6.3.1 SelectionOperatorProvenance . . . . . . . . . . 69 3.6.3.2 ProjectionOperatorProvenance . . . . . . . . . . 70 3.6.3.3 CartesianProductOperatorProvenance . . . . . . 72 3.6.3.4 UnionOperatorProvenance . . . . . . . . . . . . 72 3.6.4 ProvenanceforResultsofGeneralMMPQueries . . . . . . 73 3.7 AccessingProvenanceInformation . . . . . . . . . . . . . . . . . . 75 3.7.1 ProvenanceGraphs . . . . . . . . . . . . . . . . . . . . . . 77 3.7.1.1 Preliminaries: TracingContinuityandInheritance 78 3.7.1.2 DefiningProvenanceGraphs . . . . . . . . . . . . 81 3.7.2 QueryingProvenance . . . . . . . . . . . . . . . . . . . . . 83 3.7.2.1 ExampleofProvenancePredicateEvaluation . . . 89 3.7.3 ProvenancePolynomials . . . . . . . . . . . . . . . . . . . 90 3.7.3.1 RepresentingOperationsinProvenancePolynomials 96 3.7.3.2 Evaluating Plurality of Support with Provenance Polynomials . . . . . . . . . . . . . . . . . . . . 97 vii 3.7.4 ChapterSummary . . . . . . . . . . . . . . . . . . . . . . . 98 4 ConceptualModelEvaluation 100 4.1 EvaluatingMMPAgainstGapsintheLiterature . . . . . . . . . . . 101 4.2 EvaluatingMMPAgainstNeedsinTargetSettings . . . . . . . . . . 105 4.3 RelativeExpressivenessofAlgebraicProvenanceRepresentations . 109 4.4 RelativeExpressivenessofProvenance-relatedQueries . . . . . . . 111 4.4.1 ProvenanceSelectionQueries . . . . . . . . . . . . . . . . 112 4.4.2 QuerysetforExpressivenessComparison . . . . . . . . . . 112 4.4.3 ComparisonofExpressiveness . . . . . . . . . . . . . . . . 114 4.4.3.1 Buneman’sWhy-provenancemodel . . . . . . . . 114 4.4.3.2 Trio . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.4.3.3 Green’smodel . . . . . . . . . . . . . . . . . . . 116 4.4.3.4 Examplequery1 . . . . . . . . . . . . . . . . . . 116 4.4.3.5 Examplequery2 . . . . . . . . . . . . . . . . . . 116 4.4.3.6 Examplequery3 . . . . . . . . . . . . . . . . . . 117 4.4.3.7 Examplequery4 . . . . . . . . . . . . . . . . . . 117 4.4.3.8 Examplequery5 . . . . . . . . . . . . . . . . . . 118 4.4.3.9 Examplequery6 . . . . . . . . . . . . . . . . . . 118 4.4.3.10 Examplequery7 . . . . . . . . . . . . . . . . . . 119 4.4.3.11 Examplequery8 . . . . . . . . . . . . . . . . . . 119 4.4.3.12 Examplequery9 . . . . . . . . . . . . . . . . . . 120 4.4.3.13 Conclusions About Expressiveness of Provenance SelectionQueries . . . . . . . . . . . . . . . . . . 120 4.5 OtherAdvantagesofMMPRelativetoOtherModels . . . . . . . . 121 4.5.1 AccessingAncestorsandOperationalHistoryofData . . . . 121 4.5.2 ComputingForward-LookingProvenance . . . . . . . . . . 122 4.6 RelativeComplexityofProvenance-relatedQueries . . . . . . . . . 124 viii 4.7 ChapterSummary . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5 CharacterizingPerformanceofImplementationChoicesforMMP 129 5.1 BenchmarksandMetrics . . . . . . . . . . . . . . . . . . . . . . . 129 5.1.1 Dataquerybenchmark . . . . . . . . . . . . . . . . . . . . 131 5.1.1.1 Datastructureforrelationaldatabasetesting . . . 132 5.1.1.2 Datastructureforgraphdatabasetesting . . . . . 132 5.1.1.3 Dataqueryworkload . . . . . . . . . . . . . . . . 133 5.1.2 Provenancequerybenchmark . . . . . . . . . . . . . . . . . 135 5.1.2.1 Provenancestructureforrelationaldatabasetesting 136 5.1.2.2 Provenancestructureforgraphdatabasetesting . . 137 5.1.2.3 Provenancequeryworkload . . . . . . . . . . . . 137 5.1.3 PerformanceComparisonMetrics . . . . . . . . . . . . . . 139 5.2 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.3 ExperimentsandResults . . . . . . . . . . . . . . . . . . . . . . . 141 5.3.1 RelationalDataQueryTests . . . . . . . . . . . . . . . . . 142 5.3.1.1 TestforDataQuery1 . . . . . . . . . . . . . . . 143 5.3.1.2 TestforDataQuery2 . . . . . . . . . . . . . . . 145 5.3.1.3 TestforDataQuery3 . . . . . . . . . . . . . . . 147 5.3.1.4 TestforDataQuery4 . . . . . . . . . . . . . . . 148 5.3.1.5 TestResultsUsingWarm-StartCaches . . . . . . 148 5.3.1.6 ConclusionsonDataTests . . . . . . . . . . . . . 149 5.3.2 ProvenancePredicateTests . . . . . . . . . . . . . . . . . . 149 5.3.2.1 ConclusionsforProvenanceTests . . . . . . . . . 152 5.3.3 ImplicationsforMMPImplementations . . . . . . . . . . . 153 5.4 OtherIdeasforAcceleratingMMPImplementations . . . . . . . . . 153 5.5 ChapterSummary . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 ix 6 ALogicalModeltoSupportMMPImplementation 157 6.1 TransformingConceptualModelsintoLogicalModels . . . . . . . 159 6.1.1 EquivalenceClassesofLanguageOperators . . . . . . . . . 161 6.1.1.1 Class1: DropAttribute . . . . . . . . . . . . . . 162 6.1.1.2 Class2: InsertTuple . . . . . . . . . . . . . . . . 163 6.1.1.3 Class3: PasteTuple . . . . . . . . . . . . . . . . 163 6.1.1.4 Class4: Queries . . . . . . . . . . . . . . . . . . 164 6.2 FaithfulSupportofMMPbyMMPL . . . . . . . . . . . . . . . . . 165 6.2.1 BasisCaseforInduction . . . . . . . . . . . . . . . . . . . 171 6.2.2 InductiveCase . . . . . . . . . . . . . . . . . . . . . . . . 172 6.2.2.1 DataPortionofInductiveCase . . . . . . . . . . . 172 6.2.2.2 ProvenancePortionofInductiveCase . . . . . . . 175 6.3 EfficiencyoftheLogicalModel . . . . . . . . . . . . . . . . . . . 182 6.4 ChapterSummary . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 7 RelatedWork 186 7.1 TheOpenProvenanceModel . . . . . . . . . . . . . . . . . . . . . 188 7.2 ProvenanceModelsintheLiterature . . . . . . . . . . . . . . . . . 190 7.2.1 LineageTracingforGeneralDataWarehouseTransformations 191 7.2.2 AnnotationManagementSystems . . . . . . . . . . . . . . 192 7.2.3 CPDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 7.2.4 Trio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 7.2.5 Panda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 7.2.6 Orchestra . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 7.3 ComparingExpressivenessofPopularProvenanceModels . . . . . 196 7.4 PerformanceofProvenanceModels . . . . . . . . . . . . . . . . . 197 7.5 ChapterSummary . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Description: