Data Mining for Association Rules and Sequential Patterns Springer-Science+Business Media, LLC Jean-Marc Adamo Data Mining for Association Rules and Sequential Patterns Sequential and Parallel Algorithms With 54 Illustrations , Springer Jean-Marc Adamo Universile de Lyon 43 bd. II novembre 1918 Bat. 308, B.P. 2077 69616 Villeurbanne, cedelt France [email protected] Ubrary ofCongress Cataloging-in-Publication Data Adamo, Jean-Marc, 1943- Data mining for association rule.s and sequential pancms: sequential and par.lllel algorithms I Jean-Marc Adamo p. cm. Includes bibliographical rcferem:e.s and index. ISBN 978-1-4612-6511-5 ISBN 978-1-4613-0085-4 (eBook) DOI 10.1007/978-1-4613-0085-4 1. Data mining. 2. Col1ţluter algorithms. 1. Title. QA76.9.D343A33 2000 006.3--dc2\ (}O.{)S6267 Printed on acid-free paper. C2001 Springer Sciencc+Busincss Media New York Originally publishcd by Springel"'-Verlag New YOI·k, Ine in 2001 Softcover reprint ofthe hardcover lat edition 2001 AII rights reserved. This work may RO( be tmnslated or copied in whole or in pari without the wrinen permission ofme publisher (Springer Science+Business Media, LLC), except for brief excerpts in connection wim rcvicws Of scllolarly analysis. Use in connection with any form of infonnation stomge ancl retrieval, electronic adaptation, computer software, CIr by similar Of dissimilar metllodology now known or hercafterdevelopcd is forbidden. The use ofgeneral descriptive names. U"ade names, trademaOOi, etc., in mis publicat ion, even ifme former are RO( e.special1y idenlified, is nOI to be laken IS I sign that such names. as unclerstood by the Trade Marks and Merchandise Marks Act, may aecordingly be uscd frecly byanyonc:. Production manage<! by Steven Pisano; manufacturing supervised by Jeffrey Taub. Camcra-rcady pagtS prcpared ITom the authors' Microsoft Word files. 9 8 7 6 5 4 3 2 1 ISUN 978-1-4612-6511-5 Preface Data mining includes a wide range of activities such as classification, clustering, similarity analysis, summarization, association rule and sequential pattern discovery, and so forth. The book focuses on the last two previously listed activities. It provides a unified presentation of algorithms for association rule and sequential pattern discovery. For both mining problems, the presentation relies on the lattice structure of the search space. All algorithms are built as processes running on this structure. Proving their properties takes advantage of the mathematical properties of the structure. Part of the motivation for writing this book was postgraduate teaching. One of the main intentions was to make the book a suitable support for the clear exposition of problems and algorithms as well as a sound base for further discussion and investigation. Since the book only assumes elementary mathematical knowledge in the domains of lattices, combinatorial optimization, probability calculus, and statistics, it is fit for use by undergraduate students as well. The algorithms are described in a C-like pseudo programming language. The computations are shown in great detail. This makes the book also fit for use by implementers: computer scientists in many domains as well as industry engineers. Mining for association rules and sequential patterns is known to be a problem with large computational complexity. Most mining algorithms typically take hours to complete when performed on large real-life datasets. The issue of designing efficient parallel algorithms should be considered as critical. Most algorithms in the book are devised for both sequential and parallel execution. Parallel algorithm design takes advantage of the lattice structure of the search space. Partitioning is performed via lattice recursive bisection. Database partitioning is also used as an additional source of parallelism. The book contains ten chapters including the introduction. Chapter 2 is dedicated to search space partitioning and to mining with partitioned search spaces. Chapter 3 contains a review of all rule-mining algorithms that have been presented thus far in the literature. Chapter 4 extends the search space partition-based algorithm so that it can also deal with taxonomies. Chapter 5 investigates the problem of rule mining under Boolean constraints. Chapter 6 presents a database partitioning method based vi Preface on sampling. Methods for merging the search space and database partitioning are proposed, leading to new sequential and parallel algorithms. Chapter 7 investigates the problem of mining rules with categorical and metric attributes. The latter problem deals with exhaustive enumeration. Another way of drawing useful information from quantitative association rules leads to optimization problems. Chapter 8 proposes a unified presentation of the problems and solutions. Chapter 9 describes new measures aimed at improving the predictive ability of the rules and new algorithms aimed at limiting combinatorial explosion. Finally, Chapter 10 deals with sequential pattern mining. The problem is investigated by using a method similar to the one used for rule mining. The implementation of all algorithms presented in the book is currently under development. The work is carried out on a cluster of SMP machines and uses the ARCH library (run above MPI) as a development tool. Progress reports will be found at the address http://www.cpe.fr/-arch. Acknowledgments The creation of this book benefited from the assistance of a number of people. I am grateful to Yves Kodratoff for reading the manuscript and making constructive comments. I am also grateful to my wife, Monique, for reading the manuscript so professionally and making many comments that helped improve the syntax significantly. Many thanks also go to my editors, Steven Pisano, Wayne Wheeler, and Wayne Yuhasz at Springer-Verlag, for their assistance. The Ecole Supeneure de Chimie Physique et Electronique de Lyon hosted me and provided logistic support and facilities. The cover photograph is by Olivier Lernout, CINES, Montpellier, France. Permission was granted by Olivier Lernout and Alain Quere, Director of CINES. The photograph represents the shot of a largescale sand counter designed by Jean Bernard Metais for counting the time between two eclipses: August 11, 1999 and June 22, 2001. The work of art is being exhibited at the Museum National d 'Histoire Naturelle de Paris. The sand flows down multiple holes, thereby creating a series of patterns, which triggers a striking evocation of parallel data mining--one of the main topics of this book. Jean-Marc Adamo I Contents Preface .................................................................................................................... v 1. Introduction ................................................................................................. 1 2 • Search Space Partition-Based Rule Mining .................................... 5 2.1 Problem Statement ................................................................................. 5 2.1.1 Canonical Attribute Sequences (cas) ............................................. 5 2.1.2 Database ........................................................................................ 6 2.1.3 Support ........................................................................................... 7 2.1.4 Association Rule ............................................................................. 7 2.1.5 Problem Statement ......................................................................... 9 2.2 Search Space ........................................................................................... 9 2.3 Splitting Procedure ............................................................................... 11 2.4 Enumerating a-Frequent Attribute Sets (cass) ................................. 13 2.5 Sequential Enumeration Procedure .................................................... 17 2.6 Parallel Enumeration Procedure ......................................................... 19 2.6.1 Initial Load Balancing ................................................................. 19 2.6.2 Computing the Starting Sets ......................................................... 20 2.6.3 Enumeration Procedure ............................................................... 23 2.6.4 Dynamic Load Balancing ............................................................. 23 2.7 Generating the Association Rules ...................................................... 24 2.7.1 Sequential Generation .................................................................. 26 2.7.2 Parallel Generation ..................................................................... 27 3. Apriori and Other Algorithms ........................................................... 33 3.1 Early Algorithms .................................................................................. 33 3.1.1 AlS. ............................................................................................... 33 3.1.2 SETM. ........................................................................................... 33 3.2 The Apriori Algorithms ....................................................................... 34 3.2.1 Apriori .......................................................................................... 34 3.2.2 AprioriTid. .................................................................................... 38 3.3 Direct Hashing and Pruning ................................................................ 41 3.3.1 Filtering Candidates .................................................................... 41 viii Contents 3.3.2 Database Trimming ...................................................................... 42 3.3.3 The DHP Algorithm ..................................................................... 43 3.4 Dynamic Set Counting ......................................................................... 46 4 • Mining for Rules over Attribute Taxonomies .............................. 49 4.1 Association Ru1es over Taxonomies .................................................. 50 4.2 Problem Statement and Algorithms ................................................... 54 4.3 Pruning Uninteresting Rules ............................................................... 59 4.3.1 Measure o/Interest ...................................................................... 59 4.3.2 Rule Pruning Algorithm ............................................................... 61 4.3.3 Attribute Presence-Based Pruning ............................................... 64 5. Constraint-Based Rule Mining ........................................................... 67 5.1 Boolean Constraints ............................................................................. 67 5.1.1 Syntax ........................................................................................... 67 5.1.2 Semantics ..................................................................................... 68 5.1.3 Propagation o/Boolean Constraints ............................................ 70 5.2 Prime Implicants ................................................................................... 70 5.3 Problem Statement and Algorithms ................................................... 73 6. Data Partition-Based Rule Mining ................................................... 79 6.1 Data Partitioning ................................................................................... 79 6.1.1 Building a Probabilistic Model .................................................... 80 6.1.2 Bounding Large Deviations for One cas (Chernoff bounds) ........ 81 6.1.3 Bounding Large Deviations for Sets ofc ass ................................. 82 6.2 cas Enumeration with Partitioned Data ............................................. 89 6.2.1 Data Partitioning ......................................................................... 89 6.2.2 Local CF-Frequent cas Generation ................................................ 89 6.2.3 Global CF-Frequent cas Generation .............................................. 90 7. Mining for Rules with Categorical and Metric Attributes ...... 93 7.1 Interval Systems and Quantitative Ru1es .......................................... 95 7.2 k-Partial Completeness ........................................................................ 99 7.3 Pruning Uninteresting Rules ............................................................. 102 7.3.1 Measure ofI nterest .................................................................... 103 7.3.2 Attribute Presence-Based Pruning ............................................. 108 7.4 Enumeration Algorithms ................................................................... 109 8. Optimizing Rules with Quantitative Attributes ......................... 111 8.1 Solving I-I-Type Rule Optimization Problems ............................. 113 8.1.1 Problem Statement ..................................................................... 113 8.1.2 MCISProblem ............................................................................ 114 8.1.3 MSIC Problem ............................................................................ 121 8.1.4 MG Problem ............................................................................... 125 Data Miningfor Association Rules and Sequential Patterns ix 8.2 Solving d-l-Type Rule Optimization Problems ............................. 125 8.3 Solving l-q-Type Rule Optimization Problems ............................. 126 8.3.1 Problem Statement ..................................................................... 126 8.3.2 MSIC Problem ............................................................................ 128 8.3.3 MG Problem ............................................................................... 138 8.4 Solving d-q-Type Rule Optimization Problems ............................. 144 8.4.1 Problem Statement ..................................................................... 144 8.4.2 Basic Enumeration ..................................................................... 146 8.4.3 Enumeration with Pruning ......................................................... 147 8.4.4 Pruning the Instantiation Set. ..................................................... 150 9. Beyond Support-Confidence Framework .................................... 151 9.1 A Criticism ofthe Support-Confidence Framework ..................... 151 9.2 Conviction ........................................................................................... 153 9.3 Pruning Conviction-Based Rules ..................................................... 157 9.3.1 Analyzing Conviction ................................................................. 157 9.3.2 Transitivity-Based Pruning ........................................................ 158 9.3.3 Improvement-Based Pruning. ..................................................... 158 9.4 One-Step Association Rule Mining ................................................. 159 9.4.1 Building a Procedure for One-Step Mining ............................... 160 9.4.2 Building a Procedure for Improvement-Based Pruning ............. 164 9.5 Correlated Attribute-Set Mining ...................................................... 167 9.5.1 Collective Strength ..................................................................... 167 9.5.2 Correlated Attribute-Set Enumeration ....................................... 172 9.6 Refining Conviction: Association Rule Intensity .......................... 178 9.6.1 Measure Construction ................................................................ 178 9.6.2 Properties ................................................................................... 181 9.6.3 Relating a-int(s .=;> u) to conv(s .=;> u) ........................................ 181 9.6.4 Mining with the Intensify Measure ............................................. 182 9.6.5 a-Intensify Versus Intensity as Defined in [G96]. ...................... 183 10. Search Space Partition-Based Sequential Pattern Mining .... 185 10.1 Problem Statement ............................................................................. 185 10.1.1 Sequences ofc ass ....................................................................... 185 10.1.2 Database .................................................................................... 186 10.1.3 Support ....................................................................................... 187 10.1.4 Problem Statement ..................................................................... 189 10.2 Search Space ....................................................................................... 189 10.3 Splitting the Search Space ................................................................. 190 10.4 Splitting Procedure ............................................................................. 195 10.5 Sequence Enumeration ...................................................................... 200 10.5.1 Extending the Support Set Notion .............................................. 201 10.5.2 Join Operations .......................................................................... 202 10.5.3 Sequential Enumeration Procedure ........................................... 208 10.5.4 Parallel Enumeration Procedure ............................................... 215 x Contents Appendix 1. Chernoff Bounds ................................................................... 229 Appendix 2. Partitioning in Figure 10.5: Beyond 3rd Power ........ 233 Appendix 3. Partitioning in Figure 10.6: Beyond 3rd Power. ....... 237 References .......................................................................................................... 245 Index .................................................................................................................... 251