Statistical Models for Querying and Managing Time-Series Data THÈSE NO 5705 (2013) PRÉSENTÉE LE 21 jUIN 2013 À LA FACULTÉ INFORMATIQUE ET COMMUNICATIONS LABORATOIRE DE SYSTÈMES D'INFORMATION RÉPARTIS PROGRAMME DOCTORAL EN INFORMATIQUE, COMMUNICATIONS ET INFORMATION ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE POUR L'OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES PAR Saket SATHE acceptée sur proposition du jury: Prof. M. Grossglauser, président du jury Prof. K. Aberer, directeur de thèse Prof. R. Cheng, rapporteur Prof. C. Koch, rapporteur Prof. G. Trajcevski, rapporteur Suisse 2013 To my dear parents, Vandana Sathe and late Keshav Sathe. i Acknowledgements The first person I would like to thank is my thesis supervisor, Prof. Karl Aberer. HedidanexcellentjobinsupportingmethroughoutmyPhDandat the same time gave me the opportunity to concentrate on the topics I liked most. I learnt a lot from Karl and I consider myself very lucky that I did my PhD in his lab. I am very thankful to the members of my thesis committee: Prof Christoph Koch, Prof. Reynold Cheng, Prof. Goce Trajcevski and Prof. Matthias Grossglauser for their important comments and discussions to improve my dissertation. I wish to thank all my colleagues whom I collaborated with during the work on this thesis, especially Dipanjan Chakraborty, Hoyoung Jeung, Thanasis Papaioannou, Sebastian Cartier, and Gleb Skobeltsyn. I thank all my col- leagues from LSIR - we have a great team. A special thanks goes to Chantal who helped me sort out so many not only administrative issues. I thank all my friends from the doctoral school group and beyond, for their support and for all the great moments we spent together, including all our travels, sports, adventures and parties: Surender, Dipanjan, Rammohan, Mehdi, Michele, Prem, Devika, Marc, Tri, Alexendra, Nishanth, Abhishek, Raj, Shobha, Laura, Shrinivas, Dinkar, Satish, Sanket, Sayali and many many others. I would like to especially thank Nicola Pozza, who not only did a French- Hindi tandem with me for the last 4 years, but also perfectly translated the abstract of this thesis. I owe all my knowledge of the French language, Hindi Grammar, and Swiss politics to him. Finally, I would like to thank my parents for their love and support, and for the many sacrifices that they’ve had to make. And, of course, I must never ever forget my wife Aishwarya. Not only is she the most intelligent and beautiful person I know, but she is also an amazing mother; our son Devansh is 10 months old and is learning to stand on his own. i Abstract In recent years we are experiencing a dramatic increase in the amount of available time-series data. Primary sources of time-series data are sensor networks, medical monitoring, financial applications, news feeds and social networkingapplications. Availabilityoflargeamountoftime-seriesdatacalls for scalable data management techniques that enable efficient querying and analysis of such data in real-time and archival settings. Often the time-series datageneratedfromsensors(environmental,RFID,GPS,etc.),areimprecise anduncertaininnature. Thus,itisnecessarytocharacterizethisuncertainty for producing clean answers. In this thesis we propose methods that address theseimportantissuespertainingtotime-seriesdata. Particularly,thisthesis is centered around the following three topics: Computing Statistical Measures on Large Time-Series Datasets. Computing statistical measures for large databases of time series is a funda- mental primitive for querying and mining time-series data [31, 81, 97, 111, 132, 137]. This primitive is gaining importance with the increasing number and rapid growth of time-series databases. In Chapter 3, we introduce the Affinity framework for efficient computation of statistical measures by ex- ploitingtheconceptofaffine relationships [113,114]. Affinerelationshipscan be used to infer a large number of statistical measures for time series, from other related time series, instead of computing them directly; thus, reducing the overall computational cost significantly. Moreover, the Affinity frame- workproposesanunifiedapproachforcomputingseveralstatisticalmeasures at once. Creating Probabilistic Databases from Imprecise Data. A large amount of time-series data produced in the real-world has an inher- ent element of uncertainty, arising due to the various sources of imprecision affecting its sources (like, sensor data, GPS trajectories, environmental mon- itoring data, etc.). The primary sources of imprecision in such data are: imprecise sensors, limited communication bandwidth, sensor failures, etc. Recently there has been an exponential rise in the number of such imprecise sensors, which has led to an explosion of imprecise data. Standard database techniques cannot be used to provide clean and consistent answers in such scenarios. Therefore, probabilistic databases that factor-in the inherent un- certainty and produce clean answers are required. An important assumption i while using probabilistic databases is that each data point has a probability distribution associated with it. This is not true in practice — the distribu- tions are absent. As a solution to this fundamental limitation, in Chapter 4 we propose methods for inferring such probability distributions and using them for efficiently creating probabilistic databases [116]. Managing Participatory Sensing Data. Community-driven participatory sensing is a rapidly evolving paradigm in mobile geo-sensor networks. Here, sensors of various sorts (e.g., multi-sensor units monitoring air quality, cell phones, thermal watches, thermometers in vehicles,etc.) arecarriedbythecommunity(publicvehicles,privatevehicles, or individuals) during their daily activities, collecting various types of data abouttheirsurrounding. Datageneratedbythesedevicesisinlargequantity, and geographically and temporally skewed. Therefore, it is important that systems designed for managing such data should be aware of these unique data characteristics. In Chapter 5, we propose the ConDense (Community-driven Sensing of the Environment)frameworkformanagingandqueryingcommunity-senseddata [5, 19, 115]. ConDense exploits spatial smoothness of environmental param- eters (like, ambient pollution [5] or radiation [2]) to construct statistical models of the data. Since the number of constructed models is significantly smaller than the original data, we show that using our approach leads to dramatic increase in query processing efficiency [19, 115] and significantly reduces memory usage. Keywords: time-seriesdatamanagement, statisticalqueryprocessing, adap- tive clustering, community sensing, probabilistic databases, affine transfor- mations, view generation, approximate caching. R´esum´e Laquantit´ededonn´eessousformedes´erietemporelleaaugment´edemani`ere spectaculaire ces derni`eres ann´ees. Les sources principales de ces donn´ees proviennent de r´eseaux de capteurs, du monitoring m´edical, des applications financi`eres, defluxd’actualit´esetdesr´eseauxsociaux. Afinquecesquantit´es importantes de donn´ees issues des s´eries temporelles soient disponibles, des techniques de gestion de donn´ees extensibles permettant un traˆıtement des requˆetes efficace et une analyse de ces donn´ees en temps r´eel et en tant qu’archivessontn´ecessaires. Cesdonn´ees, quandellessontissuesdecapteurs (environnementaux, RFID, GPS, etc.), sont toutefois souvent impr´ecises et peu fiables. Il est par cons´equent n´ecessaire de caract´eriser cette incertitude, afindepouvoirfournirdesr´eponsesfiables. Danscetteth`ese,nousproposons certaines m´ethodes permettant de traiter ces importants probl`emes li´es aux donn´ees des s´eries temporelles. Cette th`ese se concentre en particulier sur les trois sujets suivants: Calcul de mesures statistiques sur des s´eries temporelles `a large ´echelle: Le calcul des mesures statistiques de donn´ees de s´eries temporelles `a large´echelle est un pr´erequis indispensable `a la r´ecolte et `a l’examen de ces donn´ees [31, 81, 97, 111, 132, 137]. Ce pr´erequis gagne en importance au fur et `a mesure qu’augmentent la quantit´e et la taille des bases de donn´ees. Le chapitre 3 pr´esente l’architecture Affinity, n´ecessaire au calcul efficace des mesures statistiques et bas´e sur le concept de relations affines [113, 114]. Les relations affines peuvent ˆetre utilis´ees pour d´eduire un nombre important de mesures, `a partir d’autres s´eries temporelles qui leur sont li´ees, plutˆot qu’en les calculant des s´eries originales; r´eduisant ainsi de mani`ere drastique le calcul num´erique global. De plus, l’architecture Affinity propose une approcheunifi´eepourcalculerenuneseulefoisplusieursmesuresstatistiques. Cr´eation de bases de donn´ees probabilistes `a partir de donn´ees impr´ecises: Unegrandequantit´ededonn´eesdes´eriestemporellesproduites dans le monde r´eel comporte une part inh´erente d’incertitude, en raison des diverses causes d’impr´ecision affectant leurs sources (par ex., les capteurs, la g´eolocalisation, les observatoires environnementaux, etc.). Les principales sourcesd’impr´ecisionsproviennentdecapteursimpr´ecis,debandespassantes limit´ees, de pannes des capteurs, etc. R´ecemment, le nombre de capteurs i impr´ecis a augment´e de mani`ere exponentielle, ce qui a entraˆın´e une explo- sion des donn´ees impr´ecises. Par ailleurs, les techniques habituelles utilis´ees pour les bases de donn´ees ne permettent pas de fournir des r´eponses fiables et coh´erentes dans de tels cas de figure. Il est par cons´equent n´ecessaire d’utiliser des bases de donn´ees probabilistes qui tiennent compte de cette incertitude et fournissent des r´eponses fiables. Quand ce genre de donn´ees est utilis´e, il est souvent admis que chaque donn´ee est associ´ee `a une loi de probabilit´e. Ce n’est cependant pas le cas en pratique: ces lois de proba- bilit´e sont absentes. Comme solution `a cette limitation fondamentale, nous proposons dans le chapitre 4 des m´ethodes permettant de d´eduire ces lois et de cr´eer efficacement des bases de donn´ees utilisant les lois d´eduites [116]. Gestion de donn´ees de type participatory sensing: La m´ethode de participatory sensing est un mod`ele ´evoluant rapidement parmi les r´eseaux de g´eo-capteurs mobiles. Dans ce cas de figure, diff´erentes sortes de capteurs (par exemple, des unit´es de capteurs multiples observant la qualit´e de l’air, des t´el´ephones mobiles, des montres thermiques, des thermom`etres install´es dans les v´ehicules, etc.) sont transport´es par les gens (v´ehicules publics et priv´es, individus) pendant leurs activit´es quotidiennes, recueillant divers types de donn´ees sur leur environnement. Les aspects spatio-temporelles des nombreuses informations collect´ees par ces capteurs sont fr´equemment biais´ees. C’est pourquoi il est important que les syst`emes con¸cus pour g´erer ces donn´ees tiennent compte de leurs caract´eristiques propres. Au chapitre 5, nous proposons l’architecture ConDense (Community-driven Sensing of the Environment) pour g´erer et traˆıter des requˆetes sur ce type de donn´ees [5, 19, 115]. ConDense exploite la r´egularit´e spatiale des param`etres environnementaux(parexemple,lapollutionambiante,lerayonnement,etc.), afin de construire les mod`eles statistiques de ces donn´ees. Comme le nombre de mod`eles construits est beaucoup plus faible que celui des donn´ees, nous pouvons montrer que l’utilisation de cette approche entraˆıne une augmenta- tion spectaculaire de l’efficacit´e du traitement des requˆetes [19, 115] et une diminution consid´erable de l’utilisation de la m´emoire. Mots-cl´es: gestion des donn´ees de s´eries temporelles, traitement de requˆete statistique, clustering adaptif, community sensing, base de donn´ees proba- biliste, transformation affine, g´en´eration de vue, caching approximatif.
Description: