ebook img

Data Stream Management: Processing High-Speed Data Streams PDF

528 Pages·2016·6.99 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Stream Management: Processing High-Speed Data Streams

Data-Centric Systems and Applications Serieseditors M.J.Carey S.Ceri EditorialBoard A.Ailamaki S.Babu P.Bernstein J.C.Freytag A.Halevy J.Han D.Kossmann I.Manolescu G.Weikum K.-Y.Whang J.X.Yu Moreinformationaboutthisseriesat http://www.springer.com/series/5258 Minos Garofalakis (cid:2) Johannes Gehrke (cid:2) Rajeev Rastogi Editors Data Stream Management Processing High-Speed Data Streams Editors MinosGarofalakis RajeevRastogi SchoolofElectricaland AmazonIndia ComputerEngineering Bangalore,India TechnicalUniversityofCrete Chania,Greece JohannesGehrke MicrosoftCorporation Redmond,WA,USA ISSN2197-9723 ISSN2197-974X(electronic) Data-CentricSystemsandApplications ISBN978-3-540-28607-3 ISBN978-3-540-28608-0(eBook) DOI10.1007/978-3-540-28608-0 LibraryofCongressControlNumber:2016946344 SpringerHeidelbergNewYorkDordrechtLondon ©Springer-VerlagBerlinHeidelberg2016 Thefourthchapterinpart4ispublishedwithkindpermissionof©2004AssociationforComputing Machinery,Inc..Allrightsreserved. Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Contents DataStreamManagement:ABraveNewWorld . . . . . . . . . . . . . 1 MinosGarofalakis,JohannesGehrke,andRajeevRastogi PartI FoundationsandBasicStreamSynopses Data-StreamSampling:BasicTechniquesandResults . . . . . . . . . . 13 PeterJ.Haas QuantilesandEqui-depthHistogramsoverStreams . . . . . . . . . . . 45 MichaelB.GreenwaldandSanjeevKhanna JoinSizes,FrequencyMoments,andApplications . . . . . . . . . . . . 87 GrahamCormodeandMinosGarofalakis Top-kFrequentItemMaintenanceoverStreams . . . . . . . . . . . . . 103 MosesCharikar Distinct-ValuesEstimationoverDataStreams . . . . . . . . . . . . . . 121 PhillipB.Gibbons TheSliding-WindowComputationModelandResults . . . . . . . . . . 149 MayurDatarandRajeevMotwani PartII MiningDataStreams ClusteringDataStreams . . . . . . . . . . . . . . . . . . . . . . . . . . 169 SudiptoGuhaandNinaMishra MiningDecisionTreesfromStreams . . . . . . . . . . . . . . . . . . . . 189 GeoffHultenandPedroDomingos FrequentItemsetMiningoverDataStreams . . . . . . . . . . . . . . . 209 GurmeetSinghManku v vi Contents TemporalDynamicsofOn-LineInformationStreams . . . . . . . . . . 221 JonKleinberg PartIII AdvancedTopics Sketch-BasedMulti-QueryProcessingoverDataStreams . . . . . . . . 241 AlinDobra,MinosGarofalakis,JohannesGehrke,andRajeevRastogi ApproximateHistogramandWaveletSummariesofStreamingData . . 263 S.MuthukrishnanandMartinStrauss StableDistributionsinStreamingComputations . . . . . . . . . . . . . 283 GrahamCormodeandPiotrIndyk TrackingQueriesoverDistributedStreams . . . . . . . . . . . . . . . . 301 MinosGarofalakis PartIV SystemArchitecturesandLanguages STREAM:TheStanfordDataStreamManagementSystem . . . . . . . 317 ArvindArasu,BrianBabcock,ShivnathBabu,JohnCieslewicz, MayurDatar,KeithIto,RajeevMotwani,UtkarshSrivastava,and JenniferWidom TheAuroraandBorealisStreamProcessingEngines . . . . . . . . . . . 337 Ug˘urÇetintemel,DanielAbadi,YanifAhmad,HariBalakrishnan, MagdalenaBalazinska,MitchCherniack,Jeong-HyonHwang, SamuelMadden,AnuragMaskey,AlexanderRasin,EstherRyvkina, MikeStonebraker,NesimeTatbul,YingXing,andStanZdonik ExtendingRelationalQueryLanguagesforDataStreams . . . . . . . . 361 N.Laptev,B.Mozafari,H.Mousavi,H.Thakkar,H.Wang,K.Zeng, andCarloZaniolo Hancock:ALanguageforAnalyzingTransactionalDataStreams . . . 387 CorinnaCortes,KathleenFisher,DarylPregibon,AnneRogers,and FrederickSmith SensorNetworkIntegrationwithStreamingDatabaseSystems . . . . . 409 DanielAbadi,SamuelMadden,andWolfgangLindner PartV Applications StreamProcessingTechniquesforNetworkManagement . . . . . . . . 431 CharlesD.Cranor,TheodoreJohnson,andOliverSpatscheck High-PerformanceXMLMessageBrokering . . . . . . . . . . . . . . . 451 YanleiDiaoandMichaelJ.Franklin FastMethodsforStatisticalArbitrage . . . . . . . . . . . . . . . . . . . 473 EleftheriosSoulasandDennisShasha Contents vii Adaptive,AutomaticStreamMining . . . . . . . . . . . . . . . . . . . . 499 SpirosPapadimitriou,AnthonyBrockwell,andChristosFaloutsos ConclusionsandLookingForward . . . . . . . . . . . . . . . . . . . . . 529 MinosGarofalakis,JohannesGehrke,andRajeevRastogi Data Stream Management: A Brave New World MinosGarofalakis,JohannesGehrke,andRajeevRastogi 1 Introduction Traditional data-management systems software is built on the concept of persis- tentdatasetsthatare storedreliablyinstablestorageandqueried/updatedseveral times throughout their lifetime. For several emerging application domains, how- ever,dataarrivesandneedstobeprocessedonacontinuous(24×7)basis,without the benefit of several passes over a static, persistent data image. Such continuous datastreamsarisenaturally,forexample,inthenetworkinstallationsoflargeTele- comandInternetserviceproviderswheredetailedusageinformation(Call-Detail- Records(CDRs),SNMP/RMONpacket-flowdata,etc.)fromdifferentpartsofthe underlying network needs to be continuously collected and analyzed for interest- ingtrends.Otherapplicationsthatgeneraterapid,continuousandlargevolumesof streamdataincludetransactionsinretailchains,ATMandcreditcardoperationsin banks,financialtickers,Webserverlogrecords,etc.Inmostsuchapplications,the datastreamisactuallyaccumulatedandarchivedinadatabase-managementsystem of a (perhaps, off-site) data warehouse, often making access to the archived data prohibitivelyexpensive.Further,theabilitytomakedecisionsandinferinteresting M.Garofalakis(B) SchoolofElectricalandComputerEngineering,TechnicalUniversityofCrete, UniversityCampus—Kounoupidiana,Chania73100,Greece e-mail:[email protected] J.Gehrke MicrosoftCorporation,OneMicrosoftWay,Redmond,WA98052-6399,USA e-mail:[email protected] R.Rastogi AmazonIndia,BrigadeGateway,Malleshwaram(W),Bangalore560055,India e-mail:[email protected] ©Springer-VerlagBerlinHeidelberg2016 1 M.Garofalakisetal.(eds.),DataStreamManagement, Data-CentricSystemsandApplications,DOI10.1007/978-3-540-28608-0_1 2 M.Garofalakisetal. Fig.1 ISPnetworkmonitoringdatastreams patternson-line(i.e.,asthedatastreamarrives)iscrucialforseveralmission-critical tasks that can have significant dollar value for a large corporation (e.g., telecom fraud detection). As a result, recent years have witnessed an increasing interest in designingdata-processingalgorithmsthat work overcontinuousdata streams,i.e., algorithms that provide results to user queries while looking at the relevant data itemsonlyonceandinafixedorder(determinedbythestream-arrivalpattern). Example1 (Application: ISP Network Monitoring) To effectively manage the op- eration of their IP-network services, large Internet Service Providers (ISPs), like AT&T and Sprint, continuously monitor the operation of their networking infras- tructureatdedicatedNetworkOperationsCenters(NOCs).Thisistrulyalarge-scale monitoringtaskthatreliesoncontinuouslycollectingstreamsofusageinformation from hundreds of routers, thousands of links and interfaces, and blisteringly-fast sets of events at different layers of the network infrastructure (ranging from fiber- cable utilizations to packet forwarding at routers, to VPNs and higher-level trans- portconstructs).Thesedatastreamscanbegeneratedthroughavarietyofnetwork- monitoring tools (e.g., Cisco’s NetFlow [10] or AT&T’s GigaScope probe [5] for monitoringIP-packetflows),Forinstance,Fig.1depictsanexampleISPmonitoring setup,withanNOCtrackingNetFlowmeasurementstreamsfromfouredgerouters in the network R –R . The figure also depicts a small fragment of the streaming 1 4 datatablesretrievedfromroutersR andR containingsimplesummaryinforma- 1 2 tionforIPsessions.Inreallife,suchstreamsaretrulymassive,comprisinghundreds ofattributesandbillionsofrecords—forinstance,AT&Tcollectsoveroneterabyte ofNetFlowmeasurementdatafromitsproductionnetworkeachday! Typically, this measurement data is periodically shipped off to a backend data warehouse for off-line analysis (e.g., at the end of the day). Unfortunately, such off-line analyses are painfully inadequate when it comes to critical network- managementtasks,wherereactionin(near)real-timeisabsolutelyessential.Such tasks include, for instance, detecting malicious/fraudulent users, DDoS attacks, or Service-LevelAgreement(SLA)violations,aswellasreal-timetrafficengineering toavoidcongestionandimprovetheutilizationofcriticalnetworkresources.Thus,

Description:
This volume focuses on the theory and practice of data stream management, and the novel challenges this emerging domain poses for data-management algorithms, systems, and applications. The collection of chapters, contributed by authorities in the field, offers a comprehensive introduction to both th
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.