Managing and Mining Uncertain Data ADVANCES IN DATABASE SYSTEMS Volume35 SeriesEditors AhmedK. Elmagarmid AmitP.Sheth PurdueUniversity WrightStateUniversity WestLafayette,IN47907 Dayton,Ohio45435 OtherbooksintheSeries: PRIVACY-PRESERVINGDATAMINING:ModelsandAlgorithmsbyCharuC.AggarwalandPhilip S.Yu;ISBN:978-0-387-70991-8 SEQUENCEDATAMININGbyGuozhuDong,JianPei;ISBN:978-0-387-69936-3 DATASTREAMS:ModelsandAlgorithms,editedbyCharuC.Aggarwal;ISBN:978-0-387-28759-1 SIMILARITYSEARCH:TheMetricSpaceApproach,P.Zezula,G.Amato,V.Dohnal,M.Batko;ISBN: 0-387-29146-6 STREAM DATA MANAGEMENT, Nauman Chaudhry, Kevin Shaw, Mahdi Abdelguerfi; ISBN: 0-387-24393-3 FUZZYDATABASEMODELINGWITHXML,ZongminMa;ISBN:0-387-24248-1 MININGSEQUENTIALPATTERNSFROMLARGEDATASETS,WeiWangandJiongYang;ISBN: 0-387-24246-5 ADVANCEDSIGNATUREINDEXINGFORMULTIMEDIAANDWEBAPPLICATIONS,Yannis Manolopoulos,AlexandrosNanopoulos,EleniTousidou;ISBN:1-4020-7425-5 ADVANCESINDIGITALGOVERNMENT:Technology,HumanFactors,andPolicy,editedbyWilliam J.McIver,Jr.andAhmedK.Elmagarmid;ISBN:1-4020-7067-5 INFORMATION AND DATABASE QUALITY, Mario Piattini, Coral Calero and Marcela Genero; ISBN:0-7923-7599-8 DATAQUALITY,RichardY.Wang,MostaphaZiad,YangW.Lee:ISBN:0-7923-7215-8 THE FRACTAL STRUCTURE OF DATA REFERENCE: Applications to the Memory Hierarchy, BruceMcNutt;ISBN:0-7923-7945-4 SEMANTIC MODELS FORMULTIMEDIA DATABASESEARCHING AND BROWSING, Shu- ChingChen,R.L.Kashyap,andArifGhafoor;ISBN:0-7923-7888-1 INFORMATIONBROKERINGACROSSHETEROGENEOUSDIGITALDATA:AMetadata-based Approach,VipulKashyap,AmitSheth;ISBN:0-7923-7883-0 Foracompletelistingofbooksinthisseries,gotohttp://www.springer.com Managing and Mining Uncertain Data Editedby Charu C. Aggarwal IBMT.J.WatsonResearchCenter USA 123 Editor CharuC.Aggarwal IBMThomasJ.WatsonResearchCenter 19SkylineDrive Hawthorne,NY10532 [email protected] SeriesEditors AhmedK.Elmagarmid AmitP.Sheth PurdueUniversity WrightStateUniversity WestLafayette,IN47907 Dayton,Ohio45435 ISBN978-0-387-09689-6 e-ISBN978-0-387-09690-2 DOI10.1007/978-0-387-09690-2 LibraryofCongressControlNumber:2008939360 (cid:2)c SpringerScience+BusinessMedia,LLC2009 All rights reserved. This work may not be translated or copied in whole orin part without the written permissionofthepublisher(SpringerScience+Business Media,LLC,233SpringStreet,NewYork,NY 10013,USA),exceptforbriefexcerptsinconnectionwithreviewsorscholarlyanalysis.Useinconnection withanyformofinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilar ordissimilarmethodologynowknownorhereafterdevelopedisforbidden. Theuseinthispublicationoftradenames,trademarks,servicemarksandsimilarterms,eveniftheyare notidentifiedassuch,isnottobetakenasanexpressionofopinionastowhetherornottheyaresubjectto proprietaryrights. Printedonacid-freepaper springer.com Preface Uncertaindatamanagementhasseenarevivalininterestinrecentyearsbe- causeofanumberofnewfieldswhichutilizethiskindofdata. Forexample,in fields such as privacy-preserving data mining, additional errors may be added to data in order to mask the identity of the records. Often the data may be imputed using statistical methods such as forecasting. In such cases, the data is uncertain in nature. Such data sets may often be probabilistic in nature. In othercases,databasesmayshowexistentialuncertaintyinwhichoneormore records may be present or absent from the data set. Such data sets lead to a numberofuniquechallengesinprocessingandmanagingtheunderlyingdata. The field of uncertain data management has been studied in the traditional database literature, but the field has seen a revival in recent years because of newwaysofcollectingdata. Thefieldofuncertaindatamanagementpresents a number of challenges in terms of collecting, modeling, representing, query- ing, indexing and mining the data. We further note that many of these issues areinter-relatedandcannoteasilybeaddressedindependently. Whilemanyof theseissueshavebeenaddressedinrecentresearch,theresearchinthisareais oftenquitevariedinitsscope. Forexample, eventheunderlyingassumptions of uncertainty are different across different papers. It is often difficult for re- searchers and students to find a single place containing a coherent discussion onthetopic. Thisbookisdesignedtoprovideacoherenttreatmentofthetopicofuncer- taindatamanagementbyprovidingsurveysofthekeytopicsinthisfield. The book is structured as an edited volume containing surveys by prominent re- searchersinthefield. Thechoiceofchaptersiscarefullydesigned,sothatthe overall content of the uncertain data management and mining field is covered reasonablywell. Eachchaptercontainsthekeyresearchcontentonaparticular topic,alongwithpossibleresearchdirections. Thisincludesabroadoverview of the topic, the different models and systems for uncertain data, discussions ondatabaseissuesformanaginguncertaindata,andminingissueswithuncer- taindata. Twoofthemostprominentsystemsforuncertaindatahavealsobeen describedinthebookinordertoprovideanideahowrealuncertaindataman- agement systems might work. The idea is to structurally organize the topic, vi Preface and provide insights which are not easily available otherwise. It is hoped that this structural organization and survey approach will be a great help to stu- dents,researchers,andpractitionersinthefieldofuncertaindatamanagement andmining. Contents Preface v ListofFigures xv ListofTables xxi 1 AnIntroductiontoUncertainDataAlgorithmsandApplications 1 CharuC.Aggarwal 1. Introduction 1 2. AlgorithmsforUncertainData 3 3. Conclusions 6 References 7 2 ModelsforIncompleteandProbabilisticInformation 9 ToddJ.Green 1. Introduction 9 2. IncompleteInformationandRepresentationSystems 13 3. -CompletenessandFiniteCompleteness 14 RA 4. ClosureUnderRelationalOperations 18 5. AlgebraicCompletion 19 6. ProbabilisticDatabasesandRepresentationSystems 21 7. Probabilistic?-TablesandProbabilisticOr-SetTables 22 8. Probabilisticc-tables 24 9. QueriesonAnnotatedRelations 25 10. K-Relations 27 11. PolynomialsforProvenance 30 12. QueryContainment 33 13. RelatedWork 34 14. ConclusionandFurtherWork 34 References 37 15. Appendix 41 3 RelationalModelsandAlgebraforUncertainData 45 SumitSarkarandDebabrataDey 1. Introduction 45 viii Contents 2. DifferentProbabilisticDataModels 48 2.1 Point-ValuedProbabilityMeasuresAssignedtoEachTuple 48 2.2 Point-ValuedProbabilityMeasuresAssignedtoAttributesandAt- tributeSets 52 2.3 Interval-ValuedProbabilityMeasuresAssignedtoAttribute-Values 54 2.4 Interval-valuedProbabilityMeasuresAssignedtoTuples 56 3. ProbabilisticRelationalAlgebra 56 3.1 BasicDefinitions 56 3.2 PrimaryandForeignKeys 58 3.3 RelationalOperations 59 3.4 RelationalAlgebra 62 3.5 IncompleteDistributionandNullValues 63 4. AlgebraicImplicationsoftheDifferentRepresentationsandAssociated Assumptions 67 4.1 PointValuedProbabilityMeasuresinTuples 68 4.2 Point-ValuedProbabilityMeasuresinAttributes 69 4.3 Interval-ValuedProbabilityMeasuresinAttributes 70 4.4 Interval-ValuedProbabilityMeasuresinTuples 71 4.5 ObservationsonTupleIndependence 72 5. ConcludingRemarks 72 References 75 4 GraphicalModelsforUncertainData 77 AmolDeshpande,LiseGetoorandPrithvirajSen 1. Introduction 78 2. GraphicalModels: Overview 80 2.1 DirectedGraphicalModels: BayesianNetworks 82 2.2 UndirectedGraphicalModels: MarkovNetworks 84 2.3 InferenceQueries 85 3. RepresentingUncertaintyusingGraphicalModels 87 3.1 PossibleWorldSemantics 88 3.2 SharedFactors 91 3.3 RepresentingProbabilisticRelations 91 4. QueryEvaluationoverUncertainData 92 4.1 Example 94 4.2 GeneratingFactorsduringQueryEvaluation 96 4.3 QueryEvaluationasInference 99 4.4 Optimizations 100 5. RelatedWorkandDiscussion 101 5.1 SafePlans 101 5.2 RepresentingUncertaintyusingLineage 102 5.3 ProbabilisticRelationalModels 102 5.4 LiftedInference 104 5.5 ScalableInferenceusingaRelationalDatabase 104 6. Conclusions 105 References 107 Contents ix 5 Trio: ASystemforData,Uncertainty,andLineage 113 JenniferWidom 1. ULDBs: Uncertainty-LineageDatabases 116 1.1 Alternatives 117 1.2 ‘?’ (Maybe)Annotations 117 1.3 Confidences 118 1.4 Lineage 119 1.5 RelationalQueries 121 2. TriQL:TheTrioQueryLanguage 122 2.1 OperationalSemantics 122 2.2 QueryingConfidences 124 2.3 QueryingLineage 124 2.4 DuplicateElimination 125 2.5 Aggregation 126 2.6 ReorganizingAlternatives 127 2.7 HorizontalSubqueries 128 2.8 Query-DefinedResultConfidences 128 2.9 OtherTriQLQueryConstructs 129 3. DataModificationsinTrio 130 3.1 Inserts 130 3.2 Deletes 131 3.3 Updates 131 3.4 DataModificationsandVersioning 133 4. ConfidenceComputation 133 5. AdditionalTrioFeatures 135 6. TheTrioSystem 136 6.1 EncodingULDBData 138 6.2 BasicQueryTranslationScheme 140 6.3 DuplicateElimination 141 6.4 Aggregation 141 6.5 ReorganizingAlternatives 142 6.6 HorizontalSubqueries 142 6.7 Built-InPredicatesandFunctions 143 6.8 Query-DefinedResultConfidences 144 6.9 RemainingConstructs 144 References 147 6 MayBMS:ASystemforManagingLargeProbabilisticDatabases 149 ChristophKoch 1. Introduction 149 2. ProbabilisticDatabases 151 3. QueryLanguageDesiderata 152 4. TheAlgebra 153 5. RepresentingProbabilisticData 159 6. ConceptualQueryEvaluation 164 7. TheMayBMSQueryandUpdateLanguage 170 8. TheMayBMSSystem 175 9. ConclusionsandOutlook 178
Description: