Frequent Pattern Mining Charu C. Aggarwal • Jiawei Han Editors Frequent Pattern Mining 2123 Editors CharuC.Aggarwal JiaweiHan IBMT.J.WatsonResearchCenter UniversityofIllinoisatUrbana-Champaign YorktownHeights Urbana NewYork Illinois USA USA ISBN978-3-319-07820-5 ISBN978-3-319-07821-2(eBook) DOI10.1007/978-3-319-07821-2 SpringerChamHeidelbergNewYorkDordrechtLondon LibraryofCongressControlNumber:2014944536 © SpringerInternationalPublishingSwitzerland2014 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartofthe materialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnection withreviewsorscholarlyanalysisormaterialsuppliedspecificallyforthepurposeofbeingenteredand executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publicationorpartsthereofispermittedonlyundertheprovisionsoftheCopyrightLawofthePublisher’s location,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.Permissions forusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.Violationsareliableto prosecutionundertherespectiveCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Whiletheadviceandinformationinthisbookarebelievedtobetrueandaccurateatthedateofpublication, neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityforanyerrorsor omissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,withrespecttothe materialcontainedherein. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Preface Thefieldofdatamininghasfourmain“super-problems”correspondingtoclustering, classification,outlieranalysis,andfrequentpatternmining.Comparedtotheother threeproblems,thefrequentpatternminingmodelforformulatedrelativelyrecently. In spite of its shorter history, frequent pattern mining is considered the marquee problemofdatamining.Thereasonforthisisthatinterestinthedataminingfield increasedrapidlysoonaftertheseminalpaperonassociationruleminingbyAgrawal, Imielinski, andSwami.Theearlierdataminingconferenceswereoftendominated byalargenumberoffrequentpatternminingpapers.Thisisoneofthereasonsthat frequentpatternmininghasaveryspecialplaceinthedataminingcommunity.At thispoint,thefieldoffrequentpatternminingisconsideredamatureone. While the field has reached a relative level of maturity, very few books cover different aspects of frequent pattern mining. Most of the existing books are either too generic or do not cover frequent pattern mining in an exhaustive way.A need existsforanexhaustivebookonthetopicthatcancoverthedifferentnuancesinan exhaustiveway. Thisbookprovidescomprehensivesurveysinthefieldoffrequentpatternmining. Eachchapterisdesignedasasurveythatcoversthekeyaspectsofthefieldoffrequent patternmining.Thechaptersaretypicallyofthefollowingtypes: (cid:129) Algorithms: In these cases, the key algorithms for frequent pattern mining are explored.Theseincludejoin-basedmethodssuchasApriori,andpattern-growth methods. (cid:129) Variations: Many variations of frequent pattern mining such as interesting pat- terns, negativepatterns, constrainedpatternmining, orcompressedpatternsare exploredinthesechapters. (cid:129) Scalability:Thelargesizesofdatainrecentyearshasledtotheneedforbigdata andstreamingframeworksforfrequentpatternmining.Frequentpatternmining algorithmsneedtobemodifiedtoworkwiththeseadvancedscenarios. (cid:129) DataTypes:Differentdatatypesleadtodifferentchallengesforfrequentpattern miningalgorithms. Frequentpatternminingalgorithmsneedtobeabletowork withcomplexdatatypes,suchastemporalorgraphdata. v vi Preface (cid:129) Applications:Inthesechapters,differentapplicationsoffrequentpatternmining areexplored.Theseincludestheapplicationoffrequentpatternminingmethods toproblemssuchasclusteringandclassification.Othermorecomplexalgorithms arealsoexplored. This book is, therefore, intended to provide an overview of the field of frequent patternmining,asitcurrentlystands.Itishopedthatthebookwillserveasauseful guideforstudents,researchers,andpractitioners. Contents 1 AnIntroductiontoFrequentPatternMining ..................... 1 CharuC.Aggarwal 1 Introduction............................................... 1 2 FrequentPatternMiningAlgorithms .......................... 3 2.1 Frequent Pattern Mining with the Traditional Support Framework .......................................... 4 2.2 InterestingandNegativeFrequentPatterns ................ 6 2.3 ConstrainedFrequentPatternMining..................... 7 2.4 CompressedRepresentationsofFrequentPatterns .......... 7 3 ScalabilityIssuesinFrequentPatternMining................... 8 3.1 FrequentPatternMininginDataStreams ................. 8 3.2 FrequentPatternMiningwithBigData ................... 9 4 FrequentPatternMiningwithAdvancedDataTypes ............. 9 4.1 SequentialPatternMining .............................. 10 4.2 SpatiotemporalPatternMining .......................... 10 4.3 FrequentPatternsinGraphsandStructuredData ........... 11 4.4 FrequentPatternMiningwithUncertainData.............. 11 5 PrivacyIssues ............................................. 12 6 ApplicationsofFrequentPatternMining....................... 13 6.1 ApplicationstoMajorDataMiningProblems.............. 13 6.2 GenericApplications .................................. 13 7 ConclusionsandSummary .................................. 14 References.................................................... 14 2 FrequentPatternMiningAlgorithms:ASurvey .................. 19 CharuC.Aggarwal,MansurulA.BhuiyanandMohammadAlHasan 1 Introduction............................................... 19 1.1 Definitions........................................... 22 2 Join-BasedAlgorithms...................................... 23 2.1 AprioriMethod ....................................... 24 2.2 DHPAlgorithm....................................... 27 2.3 SpecialTricksfor2-ItemsetCounting .................... 28 vii viii Contents 2.4 PruningbySupportLowerBounding..................... 28 2.5 HypercubeDecomposition.............................. 29 3 Tree-BasedAlgorithms ..................................... 29 3.1 AISAlgorithm........................................ 31 3.2 TreeProjectionAlgorithms.............................. 32 3.3 VerticalMiningAlgorithms ............................. 36 4 RecursiveSuffix-BasedGrowth .............................. 39 4.1 TheFP-GrowthApproach .............................. 41 4.2 Variations............................................ 45 5 MaximalandClosedFrequentItemsets........................ 47 5.1 Definitions........................................... 47 5.2 FrequentMaximalItemsetMiningAlgorithms ............. 48 5.3 FrequentClosedItemsetMiningAlgorithms............... 55 6 OtherOptimizationsandVariations ........................... 57 6.1 RowEnumerationMethods ............................. 57 6.2 OtherExplorationStrategies ............................ 58 7 ReducingtheNumberofPasses .............................. 58 7.1 CombiningPasses..................................... 58 7.2 SamplingTricks ...................................... 59 7.3 OnlineAssociationRuleMining......................... 60 8 ConclusionsandSummary .................................. 61 References.................................................... 61 3 Pattern-GrowthMethods ...................................... 65 JiaweiHanandJianPei 1 Introduction............................................... 66 2 FP-Growth:PatternGrowthforMiningFrequentItemsets ........ 68 3 PushingMoreConstraintsinPattern-GrowthMining ............ 72 4 PrefixSpan:MiningSequentialPatternsbyPatternGrowth ....... 74 5 Further Development of Pattern Growth-Based Pattern Mining Methodology.............................................. 77 6 Conclusions............................................... 78 References.................................................... 79 4 MiningLongPatterns ......................................... 83 FeidaZhu 1 Introduction............................................... 83 2 Preliminaries.............................................. 84 3 APatternLatticeModel..................................... 86 4 PatternEnumerationApproach............................... 87 4.1 Breadth-FirstApproach ................................ 87 4.2 Depth-FirstApproach.................................. 88 5 RowEnumerationApproach................................. 89 6 PatternMergeApproach .................................... 92 6.1 Piece-wisePatternMerge............................... 93 Contents ix 6.2 Fusion-stylePatternMerge ............................. 98 7 PatternTraversalApproach .................................. 101 8 Conclusion ............................................... 102 References.................................................... 103 5 InterestingPatterns ........................................... 105 JillesVreekenandNikolajTatti 1 Introduction............................................... 106 2 AbsoluteMeasures......................................... 107 2.1 FrequentItemsets ..................................... 107 2.2 Tiles ................................................ 112 2.3 LowEntropySets ..................................... 114 3 AdvancedMethods......................................... 114 4 StaticBackgroundModels .................................. 115 4.1 IndependenceModel .................................. 116 4.2 BeyondIndependence ................................. 119 4.3 MaximumEntropyModels ............................. 120 4.4 RandomizationApproaches............................. 123 5 DynamicBackgroundModels................................ 124 5.1 TheGeneralIdea...................................... 125 5.2 MaximumEntropyModels ............................. 125 5.3 Tile-basedTechniques ................................. 126 5.4 SwapRandomization .................................. 128 6 PatternSets ............................................... 128 6.1 Itemsets ............................................. 129 6.2 Tiles ................................................ 130 6.3 SwapRandomization .................................. 130 7 Conclusions............................................... 131 References.................................................... 132 6 NegativeAssociationRules ..................................... 135 LuizaAntonie,JundongLiandOsmarZaiane 1 Introduction............................................... 135 2 NegativePatternsandNegativeAssociationRules............... 136 3 CurrentApproaches ........................................ 138 4 AssociativeClassificationandNegativeAssociationRules........ 143 5 Conclusions............................................... 143 References.................................................... 144 7 Constraint-BasedPatternMining ............................... 147 SiegfriedNijssenandAlbrechtZimmermann 1 Introduction............................................... 147 2 ProblemDefinition......................................... 148 2.1 Constraints........................................... 149 3 Level-WiseAlgorithm ...................................... 152 x Contents 3.1 GenericAlgorithm .................................... 153 4 Depth-FirstAlgorithm ...................................... 154 4.1 BasicAlgorithm ...................................... 154 4.2 Constraint-basedItemsetMining ........................ 155 4.3 GenericFrameworks................................... 158 4.4 ImplementationConsiderations.......................... 159 5 Languages................................................ 159 6 Conclusions............................................... 162 References.................................................... 162 8 MiningandUsingSetsofPatternsthroughCompression .......... 165 MatthijsvanLeeuwenandJillesVreeken 1 Introduction............................................... 165 2 Foundations............................................... 167 2.1 KolmogorovComplexity ............................... 168 2.2 MDL................................................ 169 2.3 MDLinDataMining .................................. 171 3 Compression-basedPatternModels ........................... 171 3.1 PatternModelsforMDL ............................... 172 3.2 CodeTables.......................................... 173 3.3 InstancesofCompression-basedModels .................. 179 4 AlgorithmicApproaches .................................... 181 4.1 CandidateSetFiltering................................. 181 4.2 DirectMiningofPatternsthatCompress.................. 184 5 MDLforDataMining ...................................... 185 5.1 Classification......................................... 186 5.2 ADissimilarityMeasureforDatasets..................... 188 5.3 IdentifyingandCharacterizingComponents ............... 189 5.4 OtherDataMiningTasks............................... 191 5.5 TheAdvantageofPattern-basedModels .................. 192 6 ChallengesAhead.......................................... 193 6.1 TowardMiningStructuredData ......................... 193 6.2 Generalization........................................ 194 6.3 Task-and/orUser-specificUsefulness .................... 194 7 Conclusions............................................... 195 References.................................................... 196 9 FrequentPatternMininginDataStreams ....................... 199 VictorE.Lee,RuomingJinandGaganAgrawal 1 Introduction............................................... 200 2 Preliminaries.............................................. 201 2.1 FrequentPatternMining:Definition...................... 201 2.2 DataWindows........................................ 202 2.3 FrequentItemMining.................................. 203 3 FrequentItemsetMiningAlgorithms .......................... 204
Description: