Table Of ContentData-Centric Systems and Applications
Serieseditors
M.J.Carey
S.Ceri
EditorialBoard
A.Ailamaki
S.Babu
P.Bernstein
J.C.Freytag
A.Halevy
J.Han
D.Kossmann
I.Manolescu
G.Weikum
K.-Y.Whang
J.X.Yu
Moreinformationaboutthisseriesat
http://www.springer.com/series/5258
Minos Garofalakis (cid:2) Johannes Gehrke (cid:2)
Rajeev Rastogi
Editors
Data Stream Management
Processing High-Speed Data Streams
Editors
MinosGarofalakis RajeevRastogi
SchoolofElectricaland AmazonIndia
ComputerEngineering Bangalore,India
TechnicalUniversityofCrete
Chania,Greece
JohannesGehrke
MicrosoftCorporation
Redmond,WA,USA
ISSN2197-9723 ISSN2197-974X(electronic)
Data-CentricSystemsandApplications
ISBN978-3-540-28607-3 ISBN978-3-540-28608-0(eBook)
DOI10.1007/978-3-540-28608-0
LibraryofCongressControlNumber:2016946344
SpringerHeidelbergNewYorkDordrechtLondon
©Springer-VerlagBerlinHeidelberg2016
Thefourthchapterinpart4ispublishedwithkindpermissionof©2004AssociationforComputing
Machinery,Inc..Allrightsreserved.
Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof
thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,
broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation
storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology
nowknownorhereafterdeveloped.
Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication
doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant
protectivelawsandregulationsandthereforefreeforgeneraluse.
Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook
arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor
theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany
errorsoromissionsthatmayhavebeenmade.
Printedonacid-freepaper
SpringerispartofSpringerScience+BusinessMedia(www.springer.com)
Contents
DataStreamManagement:ABraveNewWorld . . . . . . . . . . . . . 1
MinosGarofalakis,JohannesGehrke,andRajeevRastogi
PartI FoundationsandBasicStreamSynopses
Data-StreamSampling:BasicTechniquesandResults . . . . . . . . . . 13
PeterJ.Haas
QuantilesandEqui-depthHistogramsoverStreams . . . . . . . . . . . 45
MichaelB.GreenwaldandSanjeevKhanna
JoinSizes,FrequencyMoments,andApplications . . . . . . . . . . . . 87
GrahamCormodeandMinosGarofalakis
Top-kFrequentItemMaintenanceoverStreams . . . . . . . . . . . . . 103
MosesCharikar
Distinct-ValuesEstimationoverDataStreams . . . . . . . . . . . . . . 121
PhillipB.Gibbons
TheSliding-WindowComputationModelandResults . . . . . . . . . . 149
MayurDatarandRajeevMotwani
PartII MiningDataStreams
ClusteringDataStreams . . . . . . . . . . . . . . . . . . . . . . . . . . 169
SudiptoGuhaandNinaMishra
MiningDecisionTreesfromStreams . . . . . . . . . . . . . . . . . . . . 189
GeoffHultenandPedroDomingos
FrequentItemsetMiningoverDataStreams . . . . . . . . . . . . . . . 209
GurmeetSinghManku
v
vi Contents
TemporalDynamicsofOn-LineInformationStreams . . . . . . . . . . 221
JonKleinberg
PartIII AdvancedTopics
Sketch-BasedMulti-QueryProcessingoverDataStreams . . . . . . . . 241
AlinDobra,MinosGarofalakis,JohannesGehrke,andRajeevRastogi
ApproximateHistogramandWaveletSummariesofStreamingData . . 263
S.MuthukrishnanandMartinStrauss
StableDistributionsinStreamingComputations . . . . . . . . . . . . . 283
GrahamCormodeandPiotrIndyk
TrackingQueriesoverDistributedStreams . . . . . . . . . . . . . . . . 301
MinosGarofalakis
PartIV SystemArchitecturesandLanguages
STREAM:TheStanfordDataStreamManagementSystem . . . . . . . 317
ArvindArasu,BrianBabcock,ShivnathBabu,JohnCieslewicz,
MayurDatar,KeithIto,RajeevMotwani,UtkarshSrivastava,and
JenniferWidom
TheAuroraandBorealisStreamProcessingEngines . . . . . . . . . . . 337
Ug˘urÇetintemel,DanielAbadi,YanifAhmad,HariBalakrishnan,
MagdalenaBalazinska,MitchCherniack,Jeong-HyonHwang,
SamuelMadden,AnuragMaskey,AlexanderRasin,EstherRyvkina,
MikeStonebraker,NesimeTatbul,YingXing,andStanZdonik
ExtendingRelationalQueryLanguagesforDataStreams . . . . . . . . 361
N.Laptev,B.Mozafari,H.Mousavi,H.Thakkar,H.Wang,K.Zeng,
andCarloZaniolo
Hancock:ALanguageforAnalyzingTransactionalDataStreams . . . 387
CorinnaCortes,KathleenFisher,DarylPregibon,AnneRogers,and
FrederickSmith
SensorNetworkIntegrationwithStreamingDatabaseSystems . . . . . 409
DanielAbadi,SamuelMadden,andWolfgangLindner
PartV Applications
StreamProcessingTechniquesforNetworkManagement . . . . . . . . 431
CharlesD.Cranor,TheodoreJohnson,andOliverSpatscheck
High-PerformanceXMLMessageBrokering . . . . . . . . . . . . . . . 451
YanleiDiaoandMichaelJ.Franklin
FastMethodsforStatisticalArbitrage . . . . . . . . . . . . . . . . . . . 473
EleftheriosSoulasandDennisShasha
Contents vii
Adaptive,AutomaticStreamMining . . . . . . . . . . . . . . . . . . . . 499
SpirosPapadimitriou,AnthonyBrockwell,andChristosFaloutsos
ConclusionsandLookingForward . . . . . . . . . . . . . . . . . . . . . 529
MinosGarofalakis,JohannesGehrke,andRajeevRastogi
Data Stream Management: A Brave New World
MinosGarofalakis,JohannesGehrke,andRajeevRastogi
1 Introduction
Traditional data-management systems software is built on the concept of persis-
tentdatasetsthatare storedreliablyinstablestorageandqueried/updatedseveral
times throughout their lifetime. For several emerging application domains, how-
ever,dataarrivesandneedstobeprocessedonacontinuous(24×7)basis,without
the benefit of several passes over a static, persistent data image. Such continuous
datastreamsarisenaturally,forexample,inthenetworkinstallationsoflargeTele-
comandInternetserviceproviderswheredetailedusageinformation(Call-Detail-
Records(CDRs),SNMP/RMONpacket-flowdata,etc.)fromdifferentpartsofthe
underlying network needs to be continuously collected and analyzed for interest-
ingtrends.Otherapplicationsthatgeneraterapid,continuousandlargevolumesof
streamdataincludetransactionsinretailchains,ATMandcreditcardoperationsin
banks,financialtickers,Webserverlogrecords,etc.Inmostsuchapplications,the
datastreamisactuallyaccumulatedandarchivedinadatabase-managementsystem
of a (perhaps, off-site) data warehouse, often making access to the archived data
prohibitivelyexpensive.Further,theabilitytomakedecisionsandinferinteresting
M.Garofalakis(B)
SchoolofElectricalandComputerEngineering,TechnicalUniversityofCrete,
UniversityCampus—Kounoupidiana,Chania73100,Greece
e-mail:minos@softnet.tuc.gr
J.Gehrke
MicrosoftCorporation,OneMicrosoftWay,Redmond,WA98052-6399,USA
e-mail:johannes@microsoft.com
R.Rastogi
AmazonIndia,BrigadeGateway,Malleshwaram(W),Bangalore560055,India
e-mail:rastogi@amazon.com
©Springer-VerlagBerlinHeidelberg2016 1
M.Garofalakisetal.(eds.),DataStreamManagement,
Data-CentricSystemsandApplications,DOI10.1007/978-3-540-28608-0_1
2 M.Garofalakisetal.
Fig.1 ISPnetworkmonitoringdatastreams
patternson-line(i.e.,asthedatastreamarrives)iscrucialforseveralmission-critical
tasks that can have significant dollar value for a large corporation (e.g., telecom
fraud detection). As a result, recent years have witnessed an increasing interest in
designingdata-processingalgorithmsthat work overcontinuousdata streams,i.e.,
algorithms that provide results to user queries while looking at the relevant data
itemsonlyonceandinafixedorder(determinedbythestream-arrivalpattern).
Example1 (Application: ISP Network Monitoring) To effectively manage the op-
eration of their IP-network services, large Internet Service Providers (ISPs), like
AT&T and Sprint, continuously monitor the operation of their networking infras-
tructureatdedicatedNetworkOperationsCenters(NOCs).Thisistrulyalarge-scale
monitoringtaskthatreliesoncontinuouslycollectingstreamsofusageinformation
from hundreds of routers, thousands of links and interfaces, and blisteringly-fast
sets of events at different layers of the network infrastructure (ranging from fiber-
cable utilizations to packet forwarding at routers, to VPNs and higher-level trans-
portconstructs).Thesedatastreamscanbegeneratedthroughavarietyofnetwork-
monitoring tools (e.g., Cisco’s NetFlow [10] or AT&T’s GigaScope probe [5] for
monitoringIP-packetflows),Forinstance,Fig.1depictsanexampleISPmonitoring
setup,withanNOCtrackingNetFlowmeasurementstreamsfromfouredgerouters
in the network R –R . The figure also depicts a small fragment of the streaming
1 4
datatablesretrievedfromroutersR andR containingsimplesummaryinforma-
1 2
tionforIPsessions.Inreallife,suchstreamsaretrulymassive,comprisinghundreds
ofattributesandbillionsofrecords—forinstance,AT&Tcollectsoveroneterabyte
ofNetFlowmeasurementdatafromitsproductionnetworkeachday!
Typically, this measurement data is periodically shipped off to a backend data
warehouse for off-line analysis (e.g., at the end of the day). Unfortunately, such
off-line analyses are painfully inadequate when it comes to critical network-
managementtasks,wherereactionin(near)real-timeisabsolutelyessential.Such
tasks include, for instance, detecting malicious/fraudulent users, DDoS attacks, or
Service-LevelAgreement(SLA)violations,aswellasreal-timetrafficengineering
toavoidcongestionandimprovetheutilizationofcriticalnetworkresources.Thus,
Description:This volume focuses on the theory and practice of data stream management, and the novel challenges this emerging domain poses for data-management algorithms, systems, and applications. The collection of chapters, contributed by authorities in the field, offers a comprehensive introduction to both th