Table Of Content

High Performance Data Mining in Time Series: Techniques and Case Studies by Yunyue Zhu A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science New York University January 2004 Dennis Shasha (cid:176)c Yunyue Zhu All Rights Reserved, 2004 To my parents and Amy, for many wonderful things in life. iii Acknowledgments This dissertation would never have materialized without the contribution of many individuals to whom I have the pleasure of expressing my appreciation and gratitude. First of all, I gratefully acknowledge the persistent support and encourage- ment from my advisor, Professor Dennis Shasha. He provided constant academic guidance and inspired many of the ideas presented here. Dennis is a superb teacher and a great friend. I wish to express my deep gratitude to Professor Ernest Davis and Profes- sor Chee Yap for serving on my proposal and dissertation committees. Their comments on this thesis are precious. I also thank the other members of my dissertation committee, Professor Richard Cole, Dr. Flip Korn and Professor Arthur Goldberg, for their interest in this dissertation and for their feedback. Rich interactions with colleagues improve research and make it enjoyable. Professor Allen Mincer has both introduced me to high-energy physics and arranged the access to Milagro data and software. Stuart Lewis has helped with many exciting ideas and promising introductions to the Magnetic Resonance Imagery community. Within the database group, Tony Corso, Hsiao-Lan Hsu, Alberto Lerner, Nicolas Levi, David Rothman, David Tanzer, Aris Tsirigos, Zhihua Wang, Xiaojian Zhao have lent both voices and helpful suggestions in iv the course of this work. This is certainly not a complete list. I am thankful for many friends with whom I share more than just an academic relationship. Rosemary Amico, Anina Karmen and Maria L. Petagna performed the ad- ministrative work required for this research. They were vital in making my stay at NYU enjoyable. Finally and most importantly, I would like to thank my parents for their efforts to provide me with the best possible education. v Abstract As extremely large time series data sets grow more prevalent in a wide vari- ety of settings, we face the significant challenge of developing efficient analysis methods. This dissertation addresses the problem in designing fast, scalable algorithms for the analysis of time series. The first part of this dissertation describes the framework for high performance time series data mining based on important primitives. Data reduction transform such as the Discrete Fourier Transform, the Discrete Wavelet Trans- form, Singular Value Decomposition and Random Projection, can reduce the size of the data without substantial loss of information, therefore provides a synopsis of the data. Indexing methods organize data so that the time series data can be retrieved efficiently. Transformation on time series, such as shifting, scaling, time shifting, time scaling and dynamic time warping, facilitates the discovery of flexible patterns from time series. The second part of this dissertation integrates the above primitives into useful applications ranging from music to physics to finance to medicine. StatStream StatStream is a system based on fast algorithms for finding the most highly correlated pairs of time series from among thousands of time series streams and doing so in a moving window fashion. It can be used to find correlations in time series in finance and in scientific applications. vi HumFinderMostpeoplehumratherpoorly. Nevertheless, somehowpeople have some idea what we are humming when we hum. The goal of the query by humming program, HumFinder, is to make a computer do what a person can do. Using pitch translation, time dilation, and dynamic time warping, one can match an inaccurate hum to a melody remarkably accurately. OmniBurstBurstdetectionistheactivityoffindingabnormalaggregatesin data streams. Our software, OmniBurst, can detect bursts of varying durations. Our example applications are monitoring gamma rays and stock market price volatility. The software makes use of a shifted wavelet structure to create a linear time filter that can guarantee that no bursts will be missed at the same time that it guarantees (under a reasonable statistical model) that the filter eliminates nearly all false positives. vii Contents Dedication iii Acknowledgments iv Abstract vi List of Figures xi List of Tables xviii I Review of Techniques 1 1 Time Series Preliminaries 2 1.1 High Performance Time Series Analysis . . . . . . . . . . . . . . 8 2 Data Reduction Techniques 11 2.1 Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . 71 2.4 Sketches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 2.5 Comparison of Data Reduction Techniques . . . . . . . . . . . . 92 viii 3 Indexing Methods 97 3.1 B-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.2 KD-B-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.3 R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.4 Grid Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4 Transformations on Time Series 114 4.1 GEMINI Framework . . . . . . . . . . . . . . . . . . . . . . . . 117 4.2 Shifting and Scaling . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.3 Time Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.4 Local Dynamic Time Warping . . . . . . . . . . . . . . . . . . . 129 II Case Studies 135 5 StatStream 136 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.2 Data And Queries . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.3 Statistics Over Sliding Windows . . . . . . . . . . . . . . . . . . 142 5.4 StatStream System . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.5 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6 Query by Humming 173 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.3 Architecture of the HumFinder System . . . . . . . . . . . . . . 179 ix 6.4 Indexing Scheme for Dynamic Time Warping . . . . . . . . . . . 185 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 7 Elastic Burst Detection 206 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 7.2 Data Structure and Algorithm . . . . . . . . . . . . . . . . . . . 211 7.3 Empirical Results of the OmniBurst System . . . . . . . . . . . 225 7.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 7.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . 238 8 A Call to Exploration 239 Bibliography 241 x