ebook img

Efficient Analysis of Data Streams PDF

132 Pages·2017·20.27 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Efficient Analysis of Data Streams

Efficient Analysis of Data Streams Rhian Natalie Davies Submitted for the degree of Doctor of Philosophy at Lancaster University. September 2017 Abstract Data streams provide a challenging environment for statistical analysis. Data points can arrive at a high velocity and may need to be deleted once they have been observed. Due to these restrictions, standard techniques may not be applicable to the data streaming scenario. This leads to the need for data summaries to represent the data stream. This thesis explores how data summaries can be used to perform clustering and classification on data streams across a broad range of applications. Spectral clustering is one such technique which prior to this work has not been applicable to the data streaming setting due to the high computation involved. CluStream is an existing method which uses micro-clusters to summarise data streams. We present two algorithms which utilise these micro-cluster summaries to enable spectral clustering to be performed on data streams. The methods were tested on simulated data streams, as well as textured images and hand-written digits. Distributed acoustic sensing is used to monitor oil flow at various depths throughout an oil well. Vibrations are recorded at very high resolutions, up to 10000 observations a second at each depth. Unfortunately, corruption can occur in the signal and engineers need to know where corruption occurs. We develop a method which treats the multiple time series as a I high-dimensional clustering problem and uses the cluster labels to identify changes within the signal. The final piece of work concerns identifying areas of activity within a video stream, in particular CCTV footage. It is more efficient if this classification stage is performed on a compressed version of the video stream. In order to reconstruct areas of activity in the original video a recovery algorithm is needed. We present a comparison of the performance of two recovery algorithms and identify an ideal range for the compression ratio. II Acknowledgements I would like to thank my supervisors Dr Nicos Pavlidis and Professor Idris Eckley for their endless support, guidance and patience throughout this project. This final version of the thesis has been improved thanks to the helpful comments of my viva examiners, Dr Sotirios Tasoulis and Professor Kevin Glazebrook. Thanks are also due to Professor Lyudmila Mi- haylova for her help with the conference paper presented in Chapter 4. The data analysed in Chapter 3 was kindly provided by Shell. My PhD has been supported by the EPSRC funded Statistics and Operational Research (STOR-i)CentreforDoctoralTraining. TheSTOR-iCDThasprovidedanexcellentresearch environment and many additional training opportunities which have allowed me to develop my research skills. Particular thanks go to the STOR-i director Professor Jonathan Tawn for his constant support, especially during the late PhD stages. I would also like to give special thanks to the ‘11-‘15 cohort of STOR-i: Ben, Dave, Emma, Hugo, James, Judd, Rob and Tom. My PhD experience was greatly enhanced by their knowledge, encouragement and friendship. Finally, I would like to thank my family, Geraint, Joan, Sian and Tom. Dach chi’n werth y byd. III Declaration I declare that the work in this thesis has been done by myself and has not been submitted elsewhere for the award of any other degree. I declare that the word count of this thesis is 23469 words. Chapter 4 has been accepted for publication as R. Davies, L. Mihaylova, N. Pavlidis, and I. A. Eckley. The effect of recovery algorithms on compressive sensing background subtraction. In Sensor Data Fusion: Trends, Solutions, Applications, 1 - 6, 2013. Rhian Davies IV Contents Abstract I Acknowledgements III Declaration IV Contents VII 1 Introduction 1 1.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Spectral Clustering for Data Streams 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Spectral Clustering Background . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Graph cut problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Choice of affinity matrix . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Advanced Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 Large-scale Spectral Clustering . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Incremental methods for Spectral Clustering . . . . . . . . . . . . . . 21 V 2.4 CluStream for Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.1 Data Stream Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 Weighting the Micro-Clusters . . . . . . . . . . . . . . . . . . . . . . 33 2.5 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.5.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.5.3 Parameter Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.5.4 Simulated Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.5.5 Texture data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.5.6 Pendigit data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.5.7 Non-stationary data . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3 Identifying corruption within acoustic sensing signals 68 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.2.1 What is Distributed Acoustic Sensing? . . . . . . . . . . . . . . . . . 69 3.2.2 Relevant literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.2.3 Using CluStream to identify boundary locations . . . . . . . . . . . . 71 3.2.4 Stage one: Micro-clustering . . . . . . . . . . . . . . . . . . . . . . . 71 3.2.5 Stage two: Identifying corruption . . . . . . . . . . . . . . . . . . . . 72 3.3 Results on DAS data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 VI 4 The Effect of Recovery Algorithms on Compressive Sensing Background Subtraction 83 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3.1 Sparse and Compressible Signals . . . . . . . . . . . . . . . . . . . . 88 4.3.2 Compressive Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.3 Recovery Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.4 Background Subtraction with Compressive Sensing . . . . . . . . . . 94 4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.5 Conclusions and Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5 Conclusion 102 A Supplementary Material on Compressive Sensing 105 A.1 Introduction to Compressive Sensing . . . . . . . . . . . . . . . . . . . . . . 105 A.2 Conditions for a Stable Measurement Matrix . . . . . . . . . . . . . . . . . . 108 A.2.1 Null Space Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A.2.2 The Restricted Isometry Property . . . . . . . . . . . . . . . . . . . . 110 A.3 Intuition for Orthogonal Matching Pursuit . . . . . . . . . . . . . . . . . . . 111 Bibliography 114 VII Chapter 1 Introduction The volume of data collected on a daily basis is staggering. In 2013, IBM stated that over 90% of the world’s data was created in the last two years. This creates a challenge for researchers and practitioners as computers may not have the memory requirements to deal with such large quantities of data, and their algorithms may not run fast enough or even at all. Big data (Buhlmann et al., 2016) is a term used to refer to data sets so huge that traditional statistical techniques may not be directly applicable. The challenge of big data can be defined in terms of the three V’s (Laney, 2001); volume, variety and velocity. Volume refers to the vast amount of data collected. Variety relates to the many different types of data it is possible to collect from multiple sources such as text, images, GPS location and tweets. The velocity is the speed at which the data is observed. Velocity is perhaps the least studied of the three V’s within Statistics and is most commonly found in the analysis of data streams. A data stream (Aggarwal, 2007; Gama and Gaber, 2007) is data which is observed continuously in an ordered sequence. Data streams arise in many different applications. Examples include retail, e.g. Macy’s use of advanced data 1 collection (Shankar et al., 2016); oil and gas including the development of the digital oil field which uses sensors throughout oil wells to monitor flow and other operational characteristics (Cramer et al., 2008; Patri et al., 2012). The rate at which data arrives could be as fast as millions of data points each hour such as in the Macy’s and oil company examples. However, even if the data arrives more slowly, this can still provide a challenge if the available processing power, storage capabilities or transmission rates are limited. It is possible for a data stream to be unbounded in size by which we mean that there is no timepointwherethedatastreamends. Thisiscommonindatastreamsfoundinmeteorology, the stock market, online shopping and social media. Since these data streams are potentially endless in size it is not possible to store the data in its entirety. This means that instead of performing analysis once the data has been collected, analysis must be performed and updated in real time as new data points are observed. This leads to another issue sometimes referred to as the one-pass-access problem. As data streams are processed serially, once a data point has been seen it is discarded and cannot be accessed again. Some seemingly trivial analyses such as computing the median of the data become impossible in the data stream setting because of this one-pass-access issue. In fact, many statistical techniques make assumptions which do not hold in the data streaming scenario. A summary of the restrictions imposed by data streams is given below (Silva et al., 2013): 1. Data objects arrive continuously. 2. There is no control over the order in which the data objects should be processed. 3. The size of the stream is potentially unbounded. 2

Description:
high-dimensional clustering problem and uses the cluster labels to identify changes within the signal. The final piece of work concerns identifying areas of activity within a video stream, in particular an online spectral clustering algorithm for data streams based on the CluStream model of. Aggar
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.