Data-Centric Systems and Applications Serieseditors M.J.Carey S.Ceri EditorialBoard A.Ailamaki S.Babu P.Bernstein J.C.Freytag A.Halevy J.Han D.Kossmann I.Manolescu G.Weikum K.-Y.Whang J.X.Yu Moreinformationaboutthisseriesat http://www.springer.com/series/5258 Minos Garofalakis (cid:2) Johannes Gehrke (cid:2) Rajeev Rastogi Editors Data Stream Management Processing High-Speed Data Streams Editors MinosGarofalakis RajeevRastogi SchoolofElectricaland AmazonIndia ComputerEngineering Bangalore,India TechnicalUniversityofCrete Chania,Greece JohannesGehrke MicrosoftCorporation Redmond,WA,USA ISSN2197-9723 ISSN2197-974X(electronic) Data-CentricSystemsandApplications ISBN978-3-540-28607-3 ISBN978-3-540-28608-0(eBook) DOI10.1007/978-3-540-28608-0 LibraryofCongressControlNumber:2016946344 SpringerHeidelbergNewYorkDordrechtLondon ©Springer-VerlagBerlinHeidelberg2016 Thefourthchapterinpart4ispublishedwithkindpermissionof©2004AssociationforComputing Machinery,Inc..Allrightsreserved. Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Contents DataStreamManagement:ABraveNewWorld . . . . . . . . . . . . . 1 MinosGarofalakis,JohannesGehrke,andRajeevRastogi PartI FoundationsandBasicStreamSynopses Data-StreamSampling:BasicTechniquesandResults . . . . . . . . . . 13 PeterJ.Haas QuantilesandEqui-depthHistogramsoverStreams . . . . . . . . . . . 45 MichaelB.GreenwaldandSanjeevKhanna JoinSizes,FrequencyMoments,andApplications . . . . . . . . . . . . 87 GrahamCormodeandMinosGarofalakis Top-kFrequentItemMaintenanceoverStreams . . . . . . . . . . . . . 103 MosesCharikar Distinct-ValuesEstimationoverDataStreams . . . . . . . . . . . . . . 121 PhillipB.Gibbons TheSliding-WindowComputationModelandResults . . . . . . . . . . 149 MayurDatarandRajeevMotwani PartII MiningDataStreams ClusteringDataStreams . . . . . . . . . . . . . . . . . . . . . . . . . . 169 SudiptoGuhaandNinaMishra MiningDecisionTreesfromStreams . . . . . . . . . . . . . . . . . . . . 189 GeoffHultenandPedroDomingos FrequentItemsetMiningoverDataStreams . . . . . . . . . . . . . . . 209 GurmeetSinghManku v vi Contents TemporalDynamicsofOn-LineInformationStreams . . . . . . . . . . 221 JonKleinberg PartIII AdvancedTopics Sketch-BasedMulti-QueryProcessingoverDataStreams . . . . . . . . 241 AlinDobra,MinosGarofalakis,JohannesGehrke,andRajeevRastogi ApproximateHistogramandWaveletSummariesofStreamingData . . 263 S.MuthukrishnanandMartinStrauss StableDistributionsinStreamingComputations . . . . . . . . . . . . . 283 GrahamCormodeandPiotrIndyk TrackingQueriesoverDistributedStreams . . . . . . . . . . . . . . . . 301 MinosGarofalakis PartIV SystemArchitecturesandLanguages STREAM:TheStanfordDataStreamManagementSystem . . . . . . . 317 ArvindArasu,BrianBabcock,ShivnathBabu,JohnCieslewicz, MayurDatar,KeithIto,RajeevMotwani,UtkarshSrivastava,and JenniferWidom TheAuroraandBorealisStreamProcessingEngines . . . . . . . . . . . 337 Ug˘urÇetintemel,DanielAbadi,YanifAhmad,HariBalakrishnan, MagdalenaBalazinska,MitchCherniack,Jeong-HyonHwang, SamuelMadden,AnuragMaskey,AlexanderRasin,EstherRyvkina, MikeStonebraker,NesimeTatbul,YingXing,andStanZdonik ExtendingRelationalQueryLanguagesforDataStreams . . . . . . . . 361 N.Laptev,B.Mozafari,H.Mousavi,H.Thakkar,H.Wang,K.Zeng, andCarloZaniolo Hancock:ALanguageforAnalyzingTransactionalDataStreams . . . 387 CorinnaCortes,KathleenFisher,DarylPregibon,AnneRogers,and FrederickSmith SensorNetworkIntegrationwithStreamingDatabaseSystems . . . . . 409 DanielAbadi,SamuelMadden,andWolfgangLindner PartV Applications StreamProcessingTechniquesforNetworkManagement . . . . . . . . 431 CharlesD.Cranor,TheodoreJohnson,andOliverSpatscheck High-PerformanceXMLMessageBrokering . . . . . . . . . . . . . . . 451 YanleiDiaoandMichaelJ.Franklin FastMethodsforStatisticalArbitrage . . . . . . . . . . . . . . . . . . . 473 EleftheriosSoulasandDennisShasha Contents vii Adaptive,AutomaticStreamMining . . . . . . . . . . . . . . . . . . . . 499 SpirosPapadimitriou,AnthonyBrockwell,andChristosFaloutsos ConclusionsandLookingForward . . . . . . . . . . . . . . . . . . . . . 529 MinosGarofalakis,JohannesGehrke,andRajeevRastogi Data Stream Management: A Brave New World MinosGarofalakis,JohannesGehrke,andRajeevRastogi 1 Introduction Traditional data-management systems software is built on the concept of persis- tentdatasetsthatare storedreliablyinstablestorageandqueried/updatedseveral times throughout their lifetime. For several emerging application domains, how- ever,dataarrivesandneedstobeprocessedonacontinuous(24×7)basis,without the benefit of several passes over a static, persistent data image. Such continuous datastreamsarisenaturally,forexample,inthenetworkinstallationsoflargeTele- comandInternetserviceproviderswheredetailedusageinformation(Call-Detail- Records(CDRs),SNMP/RMONpacket-flowdata,etc.)fromdifferentpartsofthe underlying network needs to be continuously collected and analyzed for interest- ingtrends.Otherapplicationsthatgeneraterapid,continuousandlargevolumesof streamdataincludetransactionsinretailchains,ATMandcreditcardoperationsin banks,financialtickers,Webserverlogrecords,etc.Inmostsuchapplications,the datastreamisactuallyaccumulatedandarchivedinadatabase-managementsystem of a (perhaps, off-site) data warehouse, often making access to the archived data prohibitivelyexpensive.Further,theabilitytomakedecisionsandinferinteresting M.Garofalakis(B) SchoolofElectricalandComputerEngineering,TechnicalUniversityofCrete, UniversityCampus—Kounoupidiana,Chania73100,Greece e-mail:[email protected] J.Gehrke MicrosoftCorporation,OneMicrosoftWay,Redmond,WA98052-6399,USA e-mail:[email protected] R.Rastogi AmazonIndia,BrigadeGateway,Malleshwaram(W),Bangalore560055,India e-mail:[email protected] ©Springer-VerlagBerlinHeidelberg2016 1 M.Garofalakisetal.(eds.),DataStreamManagement, Data-CentricSystemsandApplications,DOI10.1007/978-3-540-28608-0_1 2 M.Garofalakisetal. Fig.1 ISPnetworkmonitoringdatastreams patternson-line(i.e.,asthedatastreamarrives)iscrucialforseveralmission-critical tasks that can have significant dollar value for a large corporation (e.g., telecom fraud detection). As a result, recent years have witnessed an increasing interest in designingdata-processingalgorithmsthat work overcontinuousdata streams,i.e., algorithms that provide results to user queries while looking at the relevant data itemsonlyonceandinafixedorder(determinedbythestream-arrivalpattern). Example1 (Application: ISP Network Monitoring) To effectively manage the op- eration of their IP-network services, large Internet Service Providers (ISPs), like AT&T and Sprint, continuously monitor the operation of their networking infras- tructureatdedicatedNetworkOperationsCenters(NOCs).Thisistrulyalarge-scale monitoringtaskthatreliesoncontinuouslycollectingstreamsofusageinformation from hundreds of routers, thousands of links and interfaces, and blisteringly-fast sets of events at different layers of the network infrastructure (ranging from fiber- cable utilizations to packet forwarding at routers, to VPNs and higher-level trans- portconstructs).Thesedatastreamscanbegeneratedthroughavarietyofnetwork- monitoring tools (e.g., Cisco’s NetFlow [10] or AT&T’s GigaScope probe [5] for monitoringIP-packetflows),Forinstance,Fig.1depictsanexampleISPmonitoring setup,withanNOCtrackingNetFlowmeasurementstreamsfromfouredgerouters in the network R –R . The figure also depicts a small fragment of the streaming 1 4 datatablesretrievedfromroutersR andR containingsimplesummaryinforma- 1 2 tionforIPsessions.Inreallife,suchstreamsaretrulymassive,comprisinghundreds ofattributesandbillionsofrecords—forinstance,AT&Tcollectsoveroneterabyte ofNetFlowmeasurementdatafromitsproductionnetworkeachday! Typically, this measurement data is periodically shipped off to a backend data warehouse for off-line analysis (e.g., at the end of the day). Unfortunately, such off-line analyses are painfully inadequate when it comes to critical network- managementtasks,wherereactionin(near)real-timeisabsolutelyessential.Such tasks include, for instance, detecting malicious/fraudulent users, DDoS attacks, or Service-LevelAgreement(SLA)violations,aswellasreal-timetrafficengineering toavoidcongestionandimprovetheutilizationofcriticalnetworkresources.Thus,
Description: