Advanced Information and Knowledge Processing SeriesEditors ProfessorLakhmiJain [email protected] ProfessorXindongWu [email protected] Forothertitlespublishedinthisseries,goto www.springer.com/series/4738 Dan A. Simovici Chabane Djeraba • Mathematical Tools for Data Mining Set Theory, Partial Orders, Combinatorics 123 DanA.Simovici,MS,MS,PhD ChabaneDjeraba,BSc,MSc,PhD UniversityofMassachusetts,Boston UniversityofSciencesandTechnologies USA ofLille(USTL) France AI&KPISSN1610-3947 ISBN:978-1-84800-200-5 e-ISBN:978-1-84800-201-2 DOI:10.1007/978-1-84800-201-2 BritishLibraryCataloguinginPublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary LibraryofCongressControlNumber:2008932365 (cid:2)cSpringer-VerlagLondonLimited2008 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permittedundertheCopyright,DesignsandPatentsAct1988,thispublicationmayonlybereproduced, storedortransmitted,inanyformorbyanymeans,withthepriorpermissioninwritingofthepublishers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the CopyrightLicensingAgency.Enquiriesconcerningreproductionoutsidethosetermsshouldbesentto thepublishers. Theuseofregisterednames,trademarks,etc.,inthispublicationdoesnotimply,evenintheabsenceofa specificstatement,thatsuchnamesareexemptfromtherelevantlawsandregulationsandthereforefree forgeneraluse. Thepublishermakesnorepresentation,expressorimplied,withregardtotheaccuracyoftheinformation containedinthisbookandcannotacceptanylegalresponsibilityorliabilityforanyerrorsoromissions thatmaybemade. Printedonacid-freepaper 9 8 7 6 5 4 3 2 1 SpringerScience+BusinessMedia springer.com Preface This volume was born from the experience of the authors as researchers and educators,whichsuggeststhatmanystudentsofdataminingarehandicapped in their research by the lack of a formal, systematic education in its mathe- matics. Thedataminingliteraturecontainsmanyexcellenttitlesthataddressthe needsofuserswithavarietyofinterestsrangingfromdecisionmakingtopat- tern investigation in biological data. However, these books do not deal with the mathematical tools that are currently needed by data mining researchers and doctoral students. We felt it timely to produce a book that integrates themathematicsofdataminingwithitsapplications.Weemphasizethatthis bookisaboutmathematical tools fordatamining andnotaboutdatamining itself; despite this, a substantial amount of applications of mathematical con- cepts in data mining are presented. The book is intended as a reference for the working data miner. In our opinion, three areas of mathematics are vital for data mining: set theory,includingpartiallyorderedsetsandcombinatorics;linearalgebra,with its many applications in principal component analysis and neural networks; and probability theory, which plays a foundational role in statistics, machine learning and data mining. Thisvolumeisdedicatedtothestudyofset-theoreticalfoundationsofdata mining. Two further volumes are contemplated that will cover linear algebra and probability theory. The first part of this book, dedicated to set theory, begins with a study of functionsandrelations.Applicationsofthesefundamentalconceptstosuchis- suesasequivalencesandpartitionsarediscussed.Also,wepreparetheground forthefollowingvolumesbydiscussingindicatorfunctions,fieldsandσ-fields, and other concepts. In this part, we have also included a pr´ecis of universal and linear algebra that covers the needs of subsequent chapters. This part concludes with a chapter on graphs and hypergraphs. vi Preface Thesecondpartiscenteredaroundpartiallyorderedsets.Wepresentalge- braicstructurescloselyrelatedtopartialorders,namelylattices,andBoolean algebras. We study basic issues about lattices, such as their dual roles as special partially ordered sets and algebraic structures, the theory of complete latticesandGaloisconnections,andtheirapplicationstothestudyofassocia- tionrules.SpecialattentionispaidtoBooleanalgebraswhichareofincreasing interest for data mining because they allow the discovery of minimal sets of features necessary for explaining observations and the discovery of hidden patterns. Anintroductiontotopologyandmeasuretheory,whichisessentialforthe study of various concepts of dimension and the recent preoccupations of data mining researchers with the applications of fractal theory to data mining, is also a component of this part. Avarietyofapplicationsindataminingarediscussed,suchasthenotionof entropy, presented in a new algebraic framework related to partitions rather than random distributions, levelwise algorithms that generalize the Apriori technique, and generalized measures and their use in the study of frequent item sets. This part concludes with a chapter on rough sets. Thethirdpartisfocusedonmetricspaces.Metricsplayanimportantrole in clustering, classification, and certain data preprocessing techniques. We study a variety of concepts related to metrics, from dissimilarities to metrics, treemetrics,andultrametrics.Thischapterisfollowedbyanapplicationchap- ter dedicated to clustering that includes basic types of clustering algorithms, limitations of clustering, and techniques for evaluating cluster quality. The fourth part focuses on combinatorics, an area of mathematics dedi- cated to the study of finite collections of objects that satisfy certain criteria. Themaintopicsdiscussedaretheinclusion-exclusionprinciple,combinatorics ofpartitions,countingproblemsrelatedtocollectionsofsets,andtheVapnik- Chervonenkis dimension of collections of sets. Eachchapterendswithsuggestionsforfurtherreading.Thebookcontains morethan400exercises;theyformanintegralpartofthematerial.Someofthe exercisesareinrealitysupplementalmaterial.Forthese,weincludesolutions. The mathematics required for making the best use of our book is a typical three-semester sequence in calculus. We would like to thank Catherine Brett and Frank Ganz from Springer- Verlag for their professionalism and helpfulness. Boston and Villeneuve d’Ascq Dan A. Simovici January 2008 Chabane Djeraba Contents Preface ........................................................ v Part I Set Theory 1 Sets, Relations, and Functions ............................. 3 1.1 Introduction ............................................ 3 1.2 Sets and Collections ..................................... 3 1.3 Relations and Functions.................................. 9 1.3.1 Cartesian Products of Sets ......................... 9 1.3.2 Relations......................................... 10 1.3.3 Functions ........................................ 15 1.3.4 Finite and Infinite Sets............................. 22 1.3.5 Generalized Set Products and Sequences ............. 24 1.3.6 Equivalence Relations.............................. 30 1.3.7 Partitions and Covers.............................. 32 1.4 The Axiom of Choice .................................... 34 1.5 Countable Sets.......................................... 35 1.6 Elementary Combinatorics................................ 38 1.7 Multisets ............................................... 44 1.8 Relational Databases..................................... 46 Exercises and Supplements .................................... 49 Bibliographical Comments .................................... 55 2 Algebras................................................... 57 2.1 Introduction ............................................ 57 2.2 Operations and Algebras ................................. 57 2.3 Morphisms, Congruences, and Subalgebras.................. 61 2.4 Linear Spaces ........................................... 64 2.5 Matrices ............................................... 68 viii Contents Exercises and Supplements .................................... 74 Bibliographical Comments .................................... 77 3 Graphs and Hypergraphs .................................. 79 3.1 Introduction ............................................ 79 3.2 Basic Notions of Graph Theory ........................... 79 3.2.1 Degrees of Vertices ................................ 80 3.2.2 Graph Representations............................. 84 3.2.3 Paths ............................................ 85 3.2.4 Directed Graphs .................................. 86 3.3 Trees .................................................. 92 3.4 Flows in Digraphs .......................................111 3.5 Hypergraphs............................................118 Exercises and Supplements ....................................121 Bibliographical Comments ....................................124 Part II Partial Orders 4 Partially Ordered Sets.....................................129 4.1 Introduction ............................................129 4.2 Partial Orders ..........................................129 4.3 Special Elements of Partially Ordered Sets..................133 4.4 The Poset of Real Numbers...............................137 4.5 Closure and Interior Systems..............................139 4.6 The Poset of Partitions of a Set ...........................144 4.7 Chains and Antichains ...................................148 4.8 Poset Product ..........................................155 4.9 Functions and Posets ....................................158 4.10 Posets and the Axiom of Choice...........................160 4.11 Locally Finite Posets and Mo¨bius Functions ................162 Exercises and Supplements ....................................168 Bibliographical Comments ....................................172 5 Lattices and Boolean Algebras.............................173 5.1 Introduction ............................................173 5.2 Lattices as Partially Ordered Sets and Algebras .............173 5.3 Special Classes of Lattices ................................180 5.4 Complete Lattices .......................................188 5.5 Boolean Algebras and Boolean Functions ...................192 5.6 Logical Data Analysis....................................211 Exercises and Supplements ....................................219 Bibliographical Comments ....................................224 Contents ix 6 Topologies and Measures ..................................225 6.1 Introduction ............................................225 6.2 Topologies..............................................225 6.3 Closure and Interior Operators in Topological Spaces ........226 6.4 Bases ..................................................235 6.5 Compactness............................................239 6.6 Continuous Functions ....................................241 6.7 Connected Topological Spaces.............................244 6.8 Separation Hierarchy of Topological Spaces .................247 6.9 Products of Topological Spaces............................249 6.10 Fields of Sets ...........................................251 6.11 Measures ...............................................256 Exercises and Supplements ....................................265 Bibliographical Comments ....................................272 7 Frequent Item Sets and Association Rules .................273 7.1 Introduction ............................................273 7.2 Frequent Item Sets ......................................273 7.3 Borders of Collections of Sets .............................279 7.4 Association Rules .......................................281 7.5 Levelwise Algorithms and Posets ..........................283 7.6 Lattices and Frequent Item Sets...........................288 Exercises and Supplements ....................................290 Bibliographical Comments ....................................292 8 Applications to Databases and Data Mining ...............295 8.1 Introduction ............................................295 8.2 Tables and Indiscernibility Relations .......................295 8.3 Partitions and Functional Dependencies ....................298 8.4 Partition Entropy .......................................305 8.5 Generalized Measures and Data Mining ....................321 8.6 Differential Constraints ..................................325 Exercises and Supplements ....................................330 Bibliographical Comments ....................................332 9 Rough Sets ................................................333 9.1 Introduction ............................................333 9.2 Approximation Spaces ...................................333 9.3 Decision Systems and Decision Trees.......................337 9.4 Closure Operators and Rough Sets.........................345 Exercises and Supplements ....................................347 Bibliographical Comments ....................................348 x Contents Part III Metric Spaces 10 Dissimilarities, Metrics, and Ultrametrics..................351 10.1 Introduction ............................................351 10.2 Classes of Dissimilarities .................................351 10.3 Tree Metrics ............................................357 10.4 Ultrametric Spaces ......................................366 10.5 Metrics on Rn...........................................377 10.6 Metrics on Collections of Sets .............................388 10.7 Metrics on Partitions ....................................394 10.8 Metrics on Sequences ....................................398 10.9 Searches in Metric Spaces ................................402 Exercises and Supplements ....................................411 Bibliographical Comments ....................................421 11 Topologies and Measures on Metric Spaces ................423 11.1 Introduction ............................................423 11.2 Metric Space Topologies..................................423 11.3 Continuous Functions in Metric Spaces.....................426 11.4 Separation Properties of Metric Spaces.....................427 11.5 Sequences in Metric Spaces ...............................435 11.6 Completeness of Metric Spaces ............................439 11.7 Contractions and Fixed Points ............................445 11.8 Measures in Metric Spaces................................449 11.9 Embeddings of Metric Spaces .............................452 Exercises and Supplements ....................................454 Bibliographical Comments ....................................458 12 Dimensions of Metric Spaces ..............................459 12.1 Introduction ............................................459 12.2 The Dimensionality Curse ................................459 12.3 Inductive Dimensions of Topological Metric Spaces ..........462 12.4 The Covering Dimension .................................472 12.5 The Cantor Set .........................................475 12.6 The Box-Counting Dimension.............................479 12.7 The Hausdorff-Besicovitch Dimension ......................482 12.8 Similarity Dimension.....................................486 Exercises and Supplements ....................................490 Bibliographical Comments ....................................493 Contents xi 13 Clustering .................................................495 13.1 Introduction ............................................495 13.2 Hierarchical Clustering...................................496 13.2.1 Matrix-Based Hierarchical Clustering ................498 13.2.2 Graph-based Hierarchical Clustering .................506 13.3 The k-Means Algorithm ..................................512 13.4 The PAM Algorithm.....................................514 13.5 Limitations of Clustering .................................516 13.6 Clustering Quality.......................................520 13.6.1 Object Silhouettes.................................520 13.6.2 Supervised Evaluation .............................521 Exercises and Supplements ....................................523 Bibliographical Comments ....................................525 Part IV Combinatorics 14 Combinatorics .............................................529 14.1 Introduction ............................................529 14.2 The Inclusion-Exclusion Principle .........................529 14.3 Ramsey’s Theorem ......................................533 14.4 Combinatorics of Partitions...............................536 14.5 Combinatorics of Collections of Sets .......................539 Exercises and Supplements ....................................544 Bibliographical Comments ....................................549 15 The Vapnik-Chervonenkis Dimension ......................551 15.1 Introduction ............................................551 15.2 The Vapnik-Chervonenkis Dimension ......................551 15.3 Perceptrons.............................................563 Exercises and Supplements ....................................565 Bibliographical Comments ....................................567 Part V Appendices A Asymptotics ...............................................571 B Convex Sets and Functions ................................573 C Useful Integrals and Formulas .............................583 C.1 Euler’s Integrals.........................................583 C.2 Wallis’s Formula ........................................587 C.3 Stirling’s Formula .......................................588 C.4 The Volume of an n-Dimensional Sphere ...................590