MACHINELEARNINGFORDATASTREAMS withPracticalExamplesinMOA AdaptiveComputationandMachineLearning FrancisBach,Editor ChristopherBishop,DavidHeckerman,MichaelJordan,andMichaelKearns, AssociateEditors AcompletelistofbookspublishedinTheAdaptiveComputationandMachine Learningseriesappearsatthebackofthisbook. MACHINELEARNINGFORDATASTREAMS withPracticalExamplesinMOA AlbertBifet RicardGavalda` GeoffHolmes BernhardPfahringer TheMITPress Cambridge,Massachusetts London,England (cid:13)c 2017MassachusettsInstituteofTechnology Allrightsreserved.Nopartofthisbookmaybereproducedinanyformorbyanyelectronicor mechanicalmeans(includingphotocopying,recording,orinformationstorageandretrieval)with- outpermissioninwritingfromthepublisher. ThisbookwassetinTimesRomanandMathtimePro2bytheauthors. PrintedandboundintheUnitedStatesofAmerica. LibraryofCongressCataloging-in-PublicationDataisavailable ISBN:978-0-262-03779-2 109876543 21 Contents ListofFigures xiii ListofTables xvii Preface xix I INTRODUCTION 1 1 Introduction 3 1.1 BigData 3 1.1.1 Tools:Open-SourceRevolution 5 1.1.2 ChallengesinBigData 6 1.2 Real-TimeAnalytics 8 1.2.1 DataStreams 8 1.2.2 TimeandMemory 8 1.2.3 Applications 8 1.3 WhatThisBookIsAbout 10 2 BigDataStreamMining 11 2.1 Algorithms 11 2.2 Classification 12 2.2.1 ClassifierEvaluationinDataStreams 14 2.2.2 MajorityClassClassifier 15 2.2.3 No-ChangeClassifier 15 2.2.4 LazyClassifier 15 2.2.5 NaiveBayes 16 2.2.6 DecisionTrees 16 2.2.7 Ensembles 17 2.3 Regression 17 2.4 Clustering 17 2.5 FrequentPatternMining 18 3 Hands-onIntroductiontoMOA 21 3.1 GettingStarted 21 3.2 TheGraphicalUserInterfaceforClassification 23 3.2.1 DriftStreamGenerators 25 3.3 UsingtheCommandLine 29 vi Contents II STREAMMINING 33 4 StreamsandSketches 35 4.1 Setting:ApproximationAlgorithms 35 4.2 ConcentrationInequalities 37 4.3 Sampling 39 4.4 CountingTotalItems 41 4.5 CountingDistinctElements 42 4.5.1 LinearCounting 43 4.5.2 Cohen’sLogarithmicCounter 44 4.5.3 TheFlajolet-MartinCounterandHyperLogLog 45 4.5.4 AnApplication:ComputingDistanceFunctionsinGraphs 47 4.5.5 Discussion:Logvs.Linear 48 4.6 FrequencyProblems 48 4.6.1 TheSPACESAVINGSketch 49 4.6.2 TheCM-SketchAlgorithm 51 4.6.3 CountSketch 54 4.6.4 MomentComputation 56 4.7 ExponentialHistogramsforSlidingWindows 57 4.8 DistributedSketching:Mergeability 60 4.9 SomeTechnicalDiscussionsandAdditionalMaterial 61 4.9.1 HashFunctions 61 4.9.2 Creating((cid:15),δ)-ApproximationAlgorithms 62 4.9.3 OtherSketchingTechniques 63 4.10 Exercises 63 5 DealingwithChange 67 5.1 NotionofChangeinStreams 67 5.2 Estimators 72 5.2.1 SlidingWindowsandLinearEstimators 73 5.2.2 ExponentiallyWeightedMovingAverage 73 5.2.3 UnidimensionalKalmanFilter 74 5.3 ChangeDetection 75 5.3.1 EvaluatingChangeDetection 75 5.3.2 TheCUSUMandPage-HinkleyTests 75 Contents vii 5.3.3 StatisticalTests 76 5.3.4 DriftDetectionMethod 78 5.3.5 ADWIN 79 5.4 CombinationwithOtherSketchesandMultidimensionalData 81 5.5 Exercises 81 6 Classification 85 6.1 ClassifierEvaluation 86 6.1.1 ErrorEstimation 87 6.1.2 DistributedEvaluation 88 6.1.3 PerformanceEvaluationMeasures 90 6.1.4 StatisticalSignificance 92 6.1.5 ACostMeasurefortheMiningProcess 93 6.2 BaselineClassifiers 94 6.2.1 MajorityClass 94 6.2.2 No-changeClassifier 94 6.2.3 NaiveBayes 95 6.2.4 MultinomialNaiveBayes 98 6.3 DecisionTrees 99 6.3.1 EstimatingSplitCriteria 101 6.3.2 TheHoeffdingTree 102 6.3.3 CVFDT 105 6.3.4 VFDTcandUFFT 107 6.3.5 HoeffdingAdaptiveTree 108 6.4 HandlingNumericAttributes 109 6.4.1 VFML 110 6.4.2 ExhaustiveBinaryTree 110 6.4.3 GreenwaldandKhanna’sQuantileSummaries 111 6.4.4 GaussianApproximation 111 6.5 Perceptron 113 6.6 LazyLearning 114 6.7 Multi-labelClassification 115 6.7.1 Multi-labelHoeffdingTrees 116 viii Contents 6.8 ActiveLearning 117 6.8.1 RandomStrategy 119 6.8.2 FixedUncertaintyStrategy 119 6.8.3 VariableUncertaintyStrategy 119 6.8.4 UncertaintyStrategywithRandomization 121 6.9 ConceptEvolution 121 6.10 LabSessionwithMOA 122 7 EnsembleMethods 129 7.1 Accuracy-WeightedEnsembles 129 7.2 WeightedMajority 130 7.3 Stacking 132 7.4 Bagging 133 7.4.1 OnlineBaggingAlgorithm 133 7.4.2 BaggingwithaChangeDetector 133 7.4.3 LeveragingBagging 134 7.5 Boosting 135 7.6 EnsemblesofHoeffdingTrees 136 7.6.1 HoeffdingOptionTrees 136 7.6.2 RandomForests 136 7.6.3 PerceptronStackingofRestrictedHoeffdingTrees 137 7.6.4 Adaptive-SizeHoeffdingTrees 138 7.7 RecurrentConcepts 139 7.8 LabSessionwithMOA 139 8 Regression 143 8.1 Introduction 143 8.2 Evaluation 144 8.3 PerceptronLearning 145 8.4 LazyLearning 145 8.5 DecisionTreeLearning 146 8.6 DecisionRules 146 8.7 RegressioninMOA 148 Contents ix 9 Clustering 149 9.1 EvaluationMeasures 150 9.2 Thek-meansAlgorithm 151 9.3 BIRCH,BICO,andCLUSTREAM 152 9.4 Density-BasedMethods:DBSCANandDen-Stream 154 9.5 CLUSTREE 156 9.6 StreamKM++:Coresets 158 9.7 AdditionalMaterial 159 9.8 LabSessionwithMOA 160 10 FrequentPatternMining 165 10.1 AnIntroductiontoPatternMining 165 10.1.1 Patterns:DefinitionsandExamples 165 10.1.2 BatchAlgorithmsforFrequentPatternMining 168 10.1.3 ClosedandMaximalPatterns 169 10.2 FrequentPatternMininginStreams:Approaches 170 10.2.1 CoresetsofClosedPatterns 172 10.3 FrequentItemsetMiningonStreams 174 10.3.1 ReductiontoHeavyHitters 174 10.3.2 Moment 174 10.3.3 FP-STREAM 175 10.3.4 IncMine 176 10.4 FrequentSubgraphMiningonStreams 178 10.4.1 WINGRAPHMINER 179 10.4.2 ADAGRAPHMINER 179 10.5 AdditionalMaterial 181 10.6 Exercises 182 III THEMOASOFTWARE 185 11 IntroductiontoMOAandItsEcosystem 187 11.1 MOAArchitecture 188 11.2 Installation 188 11.3 RecentDevelopmentsinMOA 188 11.4 ExtensionstoMOA 189