Table Of ContentData Streams
Models and Algorithms
ADVANCES IN DATABASE SYSTEMS
Series Editor
Ahmed K. Elmagarmid
Purdue Universify
West Lafayette, IN 47907
Other books in the Series:
SIMILARITY SEARCH: The Metric Space Approach, P. Zezuln, C. A~wito,V .
Dohnal, M. Batko, ISBN: 0-387-29 146-6
STREAM DATA MANAGEMENT, Naurnan Chaudhry, Kevin Shaw, Mahdi
Abdelgueifi, ISBN: 0-387-24393-3
FUZZY DATABASE MODELING WITH XML, Zongrnin Ma, ISBN: 0-387-
24248-1
MINING SEQUENTIAL PATTERNS FROM LARGE DATA SETS, Wei Wang
and Jiong Yang; ISBN: 0-387-24246-5
ADVANCED SIGNATURE INDEXING FOR MULTIMEDIA AND WEB
APPLICATIONS, Yannis Manolopoulos, Alexandros Nanopoulos, Eleni
Tousidou; ISBN: 1-4020-7425-5
ADVANCES IN DIGITAL GOVERNMENT: Technology, Human Factors, and
Policy, edited by William J. Mclver, Jr. and Ahrned K. Elrnagarrnid; ISBN: 1-
4020-7067-5
INFORMATION AND DATABASE QUALITY, Mario Piattini, Coral Calero and
Marcela Genero; ISBN: 0-7923- 7599-8
DATA QUALITY, Richard Y. Wang, Mostapha Ziad, Yang W. Lee: ISBN: 0-7923-
7215-8
THE FRACTAL STRUCTURE OF DATA REFERENCE: Applications to the
Memory Hierarchy, Bruce McNutt; ISBN: 0-7923-7945-4
SEMANTIC MODELS FOR MULTIMEDIA DATABASE SEARCHING AND
BROWSING, Shu-Ching Chen, R.L. Kashyap, and ArifGhafoor; ISBN: 0-7923-
7888-1
INFORMATION BROKERING ACROSS HETEROGENEOUS DIGITAL DATA:
A Metadata-based Approach, Vipul Kashyap, Arnit Sheth; ISBN: 0-7923-7883-0
DATA DISSEMINATION IN WIRELESS COMPUTING ENVIRONMENTS,
Kian-Lee Tan and Beng Chin Ooi; ISBN: 0-7923-7866-0
MIDDLEWARE NETWORKS: Concept, Design and Deployment of Internet
Infrastructure, Michah Lerner, George Vanecek, Nino Vidovic, Dad Vrsalovic;
ISBN: 0-7923-7840-7
ADVANCED DATABASE INDEXING, Yannis Manolopoulos, Yannis Theodoridis,
Vassilis J. Tsotras; ISBN: 0-7923-77 16-8
MULTILEVEL SECURE TRANSACTION PROCESSING, Vijay Atluri, Sushi1
Jajodia, Binto George ISBN: 0-7923-7702-8
FUZZY LOGIC IN DATA MODELING, Guoqing Chen ISBN: 0-7923-8253-6
For a complete listing of books in this series, go to htt~://www.s~rin~er.com
Data Streams
Models and Algorithms
edited by
Charu C. Aggarwal
ZBM, T.J . Watson Research Center
Yorktown Heights, NY, USA
a
-
Springer
Charu C. Aggarwal
IBM
Thomas J. Watson Research Center
19 Skyline Drive
Hawthorne NY 10532
Library of Congress Control Number: 20069341 11
DATA STREAMS: Models and Algorithms edited by Charu C. Aggarwal
ISBN- 10: 0-387-28759-0
ISBN- 13: 978-0-387-28759- 1
e-ISBN- 10: 0-387-47534-6
e-ISBN-13: 978-0-387-47534-9
Cover by Will Ladd, NRL Mapping, Charting and Geodesy Branch
utilizing NRL's GIDBB Portal System that can be utilized at
http://dmap.nrlssc.navy.mil
Printed on acid-free paper.
O 2007 Springer Science+Business Media, LLC.
All rights reserved. This work may not be translated or copied in whole or
in part without the written permission of the publisher (Springer
Science+Business Media, LLC, 233 Spring Street, New York, NY 10013,
USA), except for brief excerpts in connection with reviews or scholarly
analysis. Use in connection with any form of information storage and
retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now know or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks and
similar terms, even if the are not identified as such, is not to be taken as
an expression of opinion as to whether or not they are subject to
proprietary rights.
Contents
List of Figures
List of Tables xv
Preface xvii
1
An Introduction to Data Streams
Cham C. Aggarwal
1. Introduction
2. Stream Mining Algorithms
3. Conclusions and Summary
References
2
On Clustering Massive Data Streams: A Summarization Paradigm
Cham C. Aggarwal, Jiawei Han, Jianyong Wang and Philip S. Yu
1. Introduction
2. The Micro-clustering Based Stream Mining Framework
3. Clustering Evolving Data Streams: A Micro-clustering Approach
3.1 Micro-clustering Challenges
3.2 Online Micro-cluster Maintenance: The CluStream Algo-
rithm
3.3 High Dimensional Projected Stream Clustering
4. Classification of Data Streams: A Micro-clustering Approach
4.1 On-Demand Stream Classification
5. Other Applications of Micro-clustering and Research Directions
6. Performance Study and Experimental Results
7. Discussion
References
3
A Survey of Classification Methods in Data Streams
Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy
1. Introduction
2. Research Issues
3. Solution Approaches
4. Classification Techniques
4.1 Ensemble Based Classification
4.2 Very Fast Decision Trees (VFDT)
DATA STREAMS: MODELS AND ALGORITHMS
4.3 On Demand Classification
4.4 Online Information Network (OLIN)
4.5 LWClass Algorithm
4.6 ANNCAD Algorithm
4.7 SCALLOP Algorithm
5. Summary
References
4
Frequent Pattern Mining in Data Streams
Ruoming Jin and Gagan Agrawal
1. Introduction
2. Overview
3. New Algorithm
4. Work on Other Related Problems
5. Conclusions and Future Directions
References
5
A Survey of Change Diagnosis
Algorithms in Evolving Data
Streams
Cham C. Agganval
1. Introduction
2. The Velocity Density Method
2.1 Spatial Velocity Profiles
2.2 Evolution Computations in High Dimensional Case
2.3 On the use of clustering for characterizing stream evolution
3. On the Effect of Evolution in Data Mining Algorithms
4. Conclusions
References
6
Multi-Dimensional Analysis of Data 103
Streams Using Stream Cubes
Jiawei Hun, Z Dora Cai, rain Chen, Guozhu Dong, Jian Pei, Benjamin W: Wah, and
Jianyong Wang
1. Introduction 104
2. Problem Definition 106
3. Architecture for On-line Analysis of Data Streams 108
3.1 Tilted time fiame 108
3.2 Critical layers 110
3.3 Partial materialization of stream cube 111
4. Stream Data Cube Computation 112
4.1 Algorithms for cube computation 115
5. Performance Study 117
6. Related Work 120
7. Possible Extensions 121
8. Conclusions 122
References 123
Contents vii
7
Load Shedding in Data Stream Systems
Brian Babcoclr, Mayur Datar and Rajeev Motwani
1. Load Shedding for Aggregation Queries
1.1 Problem Formulation
1.2 Load Shedding Algorithm
1.3 Extensions
2. Load Shedding in Aurora
3. Load Shedding for Sliding Window Joins
4. Load Shedding for Classification Queries
5. Summary
References
8
The Sliding-Window Computation Model and Results
Mayur Datar and Rajeev Motwani
0.1 Motivation and Road Map
1. A Solution to the BASICCOUNTINPrGob lem
1.1 The Approximation Scheme
2. Space Lower Bound for BASICCOUNTINPrGob lem
3. Beyond 0's and 1's
4. References and Related Work
5. Conclusion
References
9
A Survey of Synopsis Construction
in Data Streams
Cham C. Agganual, Philip S. Yu
1. Introduction
2. Sampling Methods
2.1 Random Sampling with a Reservoir
2.2 Concise Sampling
3. Wavelets
3.1 Recent Research on Wavelet Decomposition in Data Streams
4. Sketches
4.1 Fixed Window Sketches for Massive Time Series
4.2 Variable Window Sketches of Massive Time Series
4.3 Sketches and their applications in Data Streams
4.4 Sketches with p-stable distributions
4.5 The Count-Min Sketch
4.6 Related Counting Methods: Hash Functions for Determining
Distinct Elements
4.7 Advantages and Limitations of Sketch Based Methods
5. Histograms
5.1 One Pass Construction of Equi-depth Histograms
5.2 Constructing V-Optimal Histograms
5.3 Wavelet Based Histograms for Query Answering
5.4 Sketch Based Methods for Multi-dimensional Histograms
6. Discussion and Challenges
viii DATA STREAMS: MODELS AND ALGORITHMS
References
10
A Survey of Join Processing in
Data Streams
Junyi Xie and Jun Yang
1. Introduction
2. Model and Semantics
3. State Management for Stream Joins
3.1 Exploiting Constraints
3.2 Exploiting Statistical Properties
4. Fundamental Algorithms for Stream Join Processing
5. Optimizing Stream Joins
6. Conclusion
Acknowledgments
References
11
Indexing and Querying Data Streams
Ahmet Bulut, Ambuj K. Singh
Introduction
Indexing Streams
2.1 Preliminaries and definitions
2.2 Feature extraction
2.3 Index maintenance
2.4 Discrete Wavelet Transform
Querying Streams
3.1 Monitoring an aggregate query
3.2 Monitoring a pattern query
3.3 Monitoring a correlation query
Related Work
Future Directions
5.1 Distributed monitoring systems
5.2 Probabilistic modeling of sensor networks
5.3 Content distribution networks
Chapter Summary
References
12
Dimensionality Reduction and
Forecasting on Streams
Spiros Papadimitriou, Jimeng Sun, and Christos Faloutsos
1. Related work
2. Principal component analysis (PCA)
3. Auto-regressive models and recursive least squares
4. MUSCLES
5. Tracking correlations and hidden variables: SPIRIT
6. Putting SPIRIT to work
7. Experimental case studies
Contents ix
8. Performance and accuracy
9. Conclusion
Acknowledgments
References 287
13
A Survey of Distributed Mining of Data Streams
Srinivasan Parthasarathy, Am01 Ghoting and Matthew Eric Otey
1. Introduction
2. Outlier and Anomaly Detection
3. Clustering
4. Frequent itemset mining
5. Classification
6. Summarization
7. Mining Distributed Data Streams in Resource Constrained Environ-
ments
8. Systems Support
References
14
Algorithms for Distributed 309
Data Stream Mining
Kanishka Bhaduri, Kamalika Das, Krishnamoorthy Sivakumar, Hill01 Kargupta, Ran
Wolfand Rong Chen
1. Introduction 310
2. Motivation: Why Distributed Data Stream Mining? 311
3. Existing Distributed Data Stream Mining Algorithms 3 12
4. A local algorithm for distributed data stream mining 315
4.1 Local Algorithms : definition 315
4.2 Algorithm details 316
4.3 Experimental results 318
4.4 Modifications and extensions 320
5. Bayesian Network Learning from Distributed Data Streams 32 1
5.1 Distributed Bayesian Network Learning Algorithm 322
5.2 Selection of samples for transmission to global site 323
5.3 Online Distributed Bayesian Network Learning 324
5.4 Experimental Results 326
6. Conclusion 326
References 329
15
A Survey of Stream Processing
Problems and Techniques
in Sensor Networks
Sharmila Subramaniam, Dimitrios Gunopulos
1. Challenges
Description:This book primarily discusses issues related to the mining aspects of data streams and it is unique in its primary focus on the subject. This volume covers mining aspects of data streams comprehensively: each contributed chapter contains a survey on the topic, the key ideas in the field for that par