ebook img

DIT - University of Trento Modeling and Querying Data Series and Data Streams with Uncertainty PDF

178 Pages·2012·1.24 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DIT - University of Trento Modeling and Querying Data Series and Data Streams with Uncertainty

PhD Dissertation International Doctorate School in Information and Communication Technologies DIT - University of Trento Modeling and Querying Data Series and Data Streams with Uncertainty Michele Dallachiesa Advisor: Prof. Themis Palpanas Universita` degli Studi di Trento March 2014 To my parents for their endless love and support. Abstract Many real applications consume data that is intrinsically uncertain and error-prone. An uncertain data series is a series whose point values are uncertain. An uncertain data stream is a data stream whose tuples are existentially uncertain and/or have an uncertain value. Typical sources of uncertainty in data series and data streams include sensor data, data synopses, privacy-preserving transformations and forecasting models. In this thesis, we focus on the following three problems: (1) the formulation and the evaluation of similarity search queries in uncertain data series; (2) the evaluation of nearest neighbor search queries in uncertain data series; (3) the adaptation of sliding windows in uncertain data stream processing to accommodate existential and value uncertainty. We demonstrate ex- perimentally that the correlation among neighboring time-stamps in data series can be leveraged to increase the accuracy of the results. We fur- ther show that the ”possible world” semantics can be used as underlying uncertainty model to formulate nearest neighbor queries that can be eval- uated efficiently. Finally, we discuss the relation between existential and value uncertainty in data stream applications, and verify experimentally our proposal of uncertain sliding windows. Keywords [Uncertain data, Similarity, Data series, Data streams] Acknowledgements First, and foremost, I would like to thank my advisor Themis Palpanas for his enormous help and encouragement not only in research but also in life. His patience, dedication, intelligence and positiveness will continue to inspire me for a long time to come. I would like to thank the many new friends that I encountered in these years, including the great colleagues at the dbTrento research group. I am grateful to the members of my Ph.D. committee, Prof. Johann-Christoph Freytag and Prof. Minos Garofalakis. I have been very lucky to work with Charu Aggarwal, Gabriela Jacques da Silva, Bu˘gra Gedik and Kun-Lung Wu at the IBM T.J. Watson Research Center, and with Prof. Ihab F. Ilyas at the Qatar Computing Research Institute. I have really enjoyed working with these bright minds. I would like to thank my parents for their unconditional support and my grandpar- ents for being my eternal advocates. Lastly, I am grateful to my girlfriend for her patience and love. Contents 1 Introduction 1 1.1 Motivating Scenarios . . . . . . . . . . . . . . . . . . . . . 2 1.2 Modeling and Querying Uncertain Data Series . . . . . . . 5 1.2.1 Contributions . . . . . . . . . . . . . . . . . . . . . 6 1.3 Top-k Nearest Neighbor Search in Uncertain Data Series . 6 1.3.1 Contributions . . . . . . . . . . . . . . . . . . . . . 7 1.4 Management of Sliding Windows in Uncertain Data Streams 8 1.4.1 Contributions . . . . . . . . . . . . . . . . . . . . . 9 1.5 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . 10 2 Related Work 11 2.1 Nearest Neighbor Queries . . . . . . . . . . . . . . . . . . 12 2.2 Uncertain Data Streams . . . . . . . . . . . . . . . . . . . 15 3 Preliminaries 21 4 Uncertain Time-Series Similarity: Return to the Basics 23 4.1 Similarity Matching for Uncertain Time Series . . . . . . . 24 4.1.1 MUNICH . . . . . . . . . . . . . . . . . . . . . . . 25 4.1.2 PROUD . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.3 DUST . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Analytical Comparison . . . . . . . . . . . . . . . . . . . . 30 i 4.2.1 Uncertainty Models and Assumptions . . . . . . . . 31 4.2.2 Type of Distance Measures . . . . . . . . . . . . . . 32 4.2.3 Type of Similarity Queries . . . . . . . . . . . . . . 32 4.3 Comparative Study . . . . . . . . . . . . . . . . . . . . . . 33 4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . 33 4.3.2 Quality Performance . . . . . . . . . . . . . . . . . 36 4.3.3 Time Performance . . . . . . . . . . . . . . . . . . 41 4.4 Moving Average for Uncertain Time Series . . . . . . . . . 43 4.4.1 Neighborhood-Aware Models . . . . . . . . . . . . . 44 4.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . 45 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5 Top-k Nearest Neighbor Search for Uncertain Data Series 53 5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1.1 Problem Statement . . . . . . . . . . . . . . . . . . 57 5.2 Baseline Algorithm . . . . . . . . . . . . . . . . . . . . . . 58 5.2.1 Complexity Analysis . . . . . . . . . . . . . . . . . 59 5.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . 60 5.3.1 Bounding the PNN Probability Estimates . . . . . 60 5.3.2 The Holistic-PkNN Algorithm . . . . . . . . . . . . 63 5.3.3 Tightening the PNN Bounds . . . . . . . . . . . . . 66 5.3.4 Managing the Distance Partitions . . . . . . . . . . 73 5.4 Indexing Uncertain Data Series . . . . . . . . . . . . . . . 77 5.4.1 Bulk-loading Algorithm . . . . . . . . . . . . . . . 78 5.4.2 Pruning the Search Space . . . . . . . . . . . . . . 79 5.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.6 Experimental results . . . . . . . . . . . . . . . . . . . . . 81 5.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . 81 ii 5.6.2 Evaluation Methodology . . . . . . . . . . . . . . . 83 5.6.3 Quality Results . . . . . . . . . . . . . . . . . . . . 85 5.6.4 Time performance . . . . . . . . . . . . . . . . . . 85 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6 Sliding Windows over Uncertain Data Streams 99 6.1 Uncertain data streams . . . . . . . . . . . . . . . . . . . . 101 6.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . 101 6.1.2 From value to existential uncertainty . . . . . . . . 102 6.1.3 From existential to value uncertainty . . . . . . . . 103 6.2 Uncertain Sliding Windows . . . . . . . . . . . . . . . . . 104 6.2.1 Modeling uncertain sliding windows . . . . . . . . . 105 6.2.2 Processing uncertain sliding windows . . . . . . . . 107 6.2.3 The Poisson-binomial distribution . . . . . . . . . . 109 6.2.4 Efficient approximations of the Poisson-binomial dis- tribution . . . . . . . . . . . . . . . . . . . . . . . . 111 6.3 Adapting stream operators to handle data uncertainty . . 112 6.4 Efficient similarity join processing . . . . . . . . . . . . . . 115 6.4.1 Upper-bounding the match probability . . . . . . . 116 6.4.2 Pruning the similarity search space . . . . . . . . . 117 6.5 Experimental evaluation . . . . . . . . . . . . . . . . . . . 120 6.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . 121 6.5.2 Poisson-binomial distribution approximations . . . 122 6.5.3 Uncertain sliding windows for sum aggregation . . . 127 6.5.4 Uncertain sliding windows for similarity join . . . . 128 6.6 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.6.1 Other sliding window policies . . . . . . . . . . . . 137 6.6.2 Integration into System S . . . . . . . . . . . . . . 138 6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 iii 7 Conclusions and Future Work 141 7.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . 142 Bibliography 145 iv

Description:
of uncertainty in data series and data streams include sensor data, data I have been very lucky to work with Charu Aggarwal, Gabriela Jacques da. Silva .. Second, the formulation of traditional database queries and mining geologic observing systems, pollution management in urban settings,.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.