Table Of ContentAdvanced Information and Knowledge Processing
SeriesEditors
ProfessorLakhmiJain
Lakhmi.jain@unisa.edu.au
ProfessorXindongWu
xwu@cs.uvm.edu
Forothertitlespublishedinthisseries,goto
www.springer.com/series/4738
Dan A. Simovici Chabane Djeraba
•
Mathematical Tools
for Data Mining
Set Theory, Partial Orders, Combinatorics
123
DanA.Simovici,MS,MS,PhD ChabaneDjeraba,BSc,MSc,PhD
UniversityofMassachusetts,Boston UniversityofSciencesandTechnologies
USA ofLille(USTL)
France
AI&KPISSN1610-3947
ISBN:978-1-84800-200-5 e-ISBN:978-1-84800-201-2
DOI:10.1007/978-1-84800-201-2
BritishLibraryCataloguinginPublicationData
AcataloguerecordforthisbookisavailablefromtheBritishLibrary
LibraryofCongressControlNumber:2008932365
(cid:2)cSpringer-VerlagLondonLimited2008
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as
permittedundertheCopyright,DesignsandPatentsAct1988,thispublicationmayonlybereproduced,
storedortransmitted,inanyformorbyanymeans,withthepriorpermissioninwritingofthepublishers,
or in the case of reprographic reproduction in accordance with the terms of licenses issued by the
CopyrightLicensingAgency.Enquiriesconcerningreproductionoutsidethosetermsshouldbesentto
thepublishers.
Theuseofregisterednames,trademarks,etc.,inthispublicationdoesnotimply,evenintheabsenceofa
specificstatement,thatsuchnamesareexemptfromtherelevantlawsandregulationsandthereforefree
forgeneraluse.
Thepublishermakesnorepresentation,expressorimplied,withregardtotheaccuracyoftheinformation
containedinthisbookandcannotacceptanylegalresponsibilityorliabilityforanyerrorsoromissions
thatmaybemade.
Printedonacid-freepaper
9 8 7 6 5 4 3 2 1
SpringerScience+BusinessMedia
springer.com
Preface
This volume was born from the experience of the authors as researchers and
educators,whichsuggeststhatmanystudentsofdataminingarehandicapped
in their research by the lack of a formal, systematic education in its mathe-
matics.
Thedataminingliteraturecontainsmanyexcellenttitlesthataddressthe
needsofuserswithavarietyofinterestsrangingfromdecisionmakingtopat-
tern investigation in biological data. However, these books do not deal with
the mathematical tools that are currently needed by data mining researchers
and doctoral students. We felt it timely to produce a book that integrates
themathematicsofdataminingwithitsapplications.Weemphasizethatthis
bookisaboutmathematical tools fordatamining andnotaboutdatamining
itself; despite this, a substantial amount of applications of mathematical con-
cepts in data mining are presented. The book is intended as a reference for
the working data miner.
In our opinion, three areas of mathematics are vital for data mining: set
theory,includingpartiallyorderedsetsandcombinatorics;linearalgebra,with
its many applications in principal component analysis and neural networks;
and probability theory, which plays a foundational role in statistics, machine
learning and data mining.
Thisvolumeisdedicatedtothestudyofset-theoreticalfoundationsofdata
mining. Two further volumes are contemplated that will cover linear algebra
and probability theory.
The first part of this book, dedicated to set theory, begins with a study of
functionsandrelations.Applicationsofthesefundamentalconceptstosuchis-
suesasequivalencesandpartitionsarediscussed.Also,wepreparetheground
forthefollowingvolumesbydiscussingindicatorfunctions,fieldsandσ-fields,
and other concepts.
In this part, we have also included a pr´ecis of universal and linear algebra
that covers the needs of subsequent chapters. This part concludes with a
chapter on graphs and hypergraphs.
vi Preface
Thesecondpartiscenteredaroundpartiallyorderedsets.Wepresentalge-
braicstructurescloselyrelatedtopartialorders,namelylattices,andBoolean
algebras. We study basic issues about lattices, such as their dual roles as
special partially ordered sets and algebraic structures, the theory of complete
latticesandGaloisconnections,andtheirapplicationstothestudyofassocia-
tionrules.SpecialattentionispaidtoBooleanalgebraswhichareofincreasing
interest for data mining because they allow the discovery of minimal sets of
features necessary for explaining observations and the discovery of hidden
patterns.
Anintroductiontotopologyandmeasuretheory,whichisessentialforthe
study of various concepts of dimension and the recent preoccupations of data
mining researchers with the applications of fractal theory to data mining, is
also a component of this part.
Avarietyofapplicationsindataminingarediscussed,suchasthenotionof
entropy, presented in a new algebraic framework related to partitions rather
than random distributions, levelwise algorithms that generalize the Apriori
technique, and generalized measures and their use in the study of frequent
item sets. This part concludes with a chapter on rough sets.
Thethirdpartisfocusedonmetricspaces.Metricsplayanimportantrole
in clustering, classification, and certain data preprocessing techniques. We
study a variety of concepts related to metrics, from dissimilarities to metrics,
treemetrics,andultrametrics.Thischapterisfollowedbyanapplicationchap-
ter dedicated to clustering that includes basic types of clustering algorithms,
limitations of clustering, and techniques for evaluating cluster quality.
The fourth part focuses on combinatorics, an area of mathematics dedi-
cated to the study of finite collections of objects that satisfy certain criteria.
Themaintopicsdiscussedaretheinclusion-exclusionprinciple,combinatorics
ofpartitions,countingproblemsrelatedtocollectionsofsets,andtheVapnik-
Chervonenkis dimension of collections of sets.
Eachchapterendswithsuggestionsforfurtherreading.Thebookcontains
morethan400exercises;theyformanintegralpartofthematerial.Someofthe
exercisesareinrealitysupplementalmaterial.Forthese,weincludesolutions.
The mathematics required for making the best use of our book is a typical
three-semester sequence in calculus.
We would like to thank Catherine Brett and Frank Ganz from Springer-
Verlag for their professionalism and helpfulness.
Boston and Villeneuve d’Ascq Dan A. Simovici
January 2008 Chabane Djeraba
Contents
Preface ........................................................ v
Part I Set Theory
1 Sets, Relations, and Functions ............................. 3
1.1 Introduction ............................................ 3
1.2 Sets and Collections ..................................... 3
1.3 Relations and Functions.................................. 9
1.3.1 Cartesian Products of Sets ......................... 9
1.3.2 Relations......................................... 10
1.3.3 Functions ........................................ 15
1.3.4 Finite and Infinite Sets............................. 22
1.3.5 Generalized Set Products and Sequences ............. 24
1.3.6 Equivalence Relations.............................. 30
1.3.7 Partitions and Covers.............................. 32
1.4 The Axiom of Choice .................................... 34
1.5 Countable Sets.......................................... 35
1.6 Elementary Combinatorics................................ 38
1.7 Multisets ............................................... 44
1.8 Relational Databases..................................... 46
Exercises and Supplements .................................... 49
Bibliographical Comments .................................... 55
2 Algebras................................................... 57
2.1 Introduction ............................................ 57
2.2 Operations and Algebras ................................. 57
2.3 Morphisms, Congruences, and Subalgebras.................. 61
2.4 Linear Spaces ........................................... 64
2.5 Matrices ............................................... 68
viii Contents
Exercises and Supplements .................................... 74
Bibliographical Comments .................................... 77
3 Graphs and Hypergraphs .................................. 79
3.1 Introduction ............................................ 79
3.2 Basic Notions of Graph Theory ........................... 79
3.2.1 Degrees of Vertices ................................ 80
3.2.2 Graph Representations............................. 84
3.2.3 Paths ............................................ 85
3.2.4 Directed Graphs .................................. 86
3.3 Trees .................................................. 92
3.4 Flows in Digraphs .......................................111
3.5 Hypergraphs............................................118
Exercises and Supplements ....................................121
Bibliographical Comments ....................................124
Part II Partial Orders
4 Partially Ordered Sets.....................................129
4.1 Introduction ............................................129
4.2 Partial Orders ..........................................129
4.3 Special Elements of Partially Ordered Sets..................133
4.4 The Poset of Real Numbers...............................137
4.5 Closure and Interior Systems..............................139
4.6 The Poset of Partitions of a Set ...........................144
4.7 Chains and Antichains ...................................148
4.8 Poset Product ..........................................155
4.9 Functions and Posets ....................................158
4.10 Posets and the Axiom of Choice...........................160
4.11 Locally Finite Posets and Mo¨bius Functions ................162
Exercises and Supplements ....................................168
Bibliographical Comments ....................................172
5 Lattices and Boolean Algebras.............................173
5.1 Introduction ............................................173
5.2 Lattices as Partially Ordered Sets and Algebras .............173
5.3 Special Classes of Lattices ................................180
5.4 Complete Lattices .......................................188
5.5 Boolean Algebras and Boolean Functions ...................192
5.6 Logical Data Analysis....................................211
Exercises and Supplements ....................................219
Bibliographical Comments ....................................224
Contents ix
6 Topologies and Measures ..................................225
6.1 Introduction ............................................225
6.2 Topologies..............................................225
6.3 Closure and Interior Operators in Topological Spaces ........226
6.4 Bases ..................................................235
6.5 Compactness............................................239
6.6 Continuous Functions ....................................241
6.7 Connected Topological Spaces.............................244
6.8 Separation Hierarchy of Topological Spaces .................247
6.9 Products of Topological Spaces............................249
6.10 Fields of Sets ...........................................251
6.11 Measures ...............................................256
Exercises and Supplements ....................................265
Bibliographical Comments ....................................272
7 Frequent Item Sets and Association Rules .................273
7.1 Introduction ............................................273
7.2 Frequent Item Sets ......................................273
7.3 Borders of Collections of Sets .............................279
7.4 Association Rules .......................................281
7.5 Levelwise Algorithms and Posets ..........................283
7.6 Lattices and Frequent Item Sets...........................288
Exercises and Supplements ....................................290
Bibliographical Comments ....................................292
8 Applications to Databases and Data Mining ...............295
8.1 Introduction ............................................295
8.2 Tables and Indiscernibility Relations .......................295
8.3 Partitions and Functional Dependencies ....................298
8.4 Partition Entropy .......................................305
8.5 Generalized Measures and Data Mining ....................321
8.6 Differential Constraints ..................................325
Exercises and Supplements ....................................330
Bibliographical Comments ....................................332
9 Rough Sets ................................................333
9.1 Introduction ............................................333
9.2 Approximation Spaces ...................................333
9.3 Decision Systems and Decision Trees.......................337
9.4 Closure Operators and Rough Sets.........................345
Exercises and Supplements ....................................347
Bibliographical Comments ....................................348
x Contents
Part III Metric Spaces
10 Dissimilarities, Metrics, and Ultrametrics..................351
10.1 Introduction ............................................351
10.2 Classes of Dissimilarities .................................351
10.3 Tree Metrics ............................................357
10.4 Ultrametric Spaces ......................................366
10.5 Metrics on Rn...........................................377
10.6 Metrics on Collections of Sets .............................388
10.7 Metrics on Partitions ....................................394
10.8 Metrics on Sequences ....................................398
10.9 Searches in Metric Spaces ................................402
Exercises and Supplements ....................................411
Bibliographical Comments ....................................421
11 Topologies and Measures on Metric Spaces ................423
11.1 Introduction ............................................423
11.2 Metric Space Topologies..................................423
11.3 Continuous Functions in Metric Spaces.....................426
11.4 Separation Properties of Metric Spaces.....................427
11.5 Sequences in Metric Spaces ...............................435
11.6 Completeness of Metric Spaces ............................439
11.7 Contractions and Fixed Points ............................445
11.8 Measures in Metric Spaces................................449
11.9 Embeddings of Metric Spaces .............................452
Exercises and Supplements ....................................454
Bibliographical Comments ....................................458
12 Dimensions of Metric Spaces ..............................459
12.1 Introduction ............................................459
12.2 The Dimensionality Curse ................................459
12.3 Inductive Dimensions of Topological Metric Spaces ..........462
12.4 The Covering Dimension .................................472
12.5 The Cantor Set .........................................475
12.6 The Box-Counting Dimension.............................479
12.7 The Hausdorff-Besicovitch Dimension ......................482
12.8 Similarity Dimension.....................................486
Exercises and Supplements ....................................490
Bibliographical Comments ....................................493
Contents xi
13 Clustering .................................................495
13.1 Introduction ............................................495
13.2 Hierarchical Clustering...................................496
13.2.1 Matrix-Based Hierarchical Clustering ................498
13.2.2 Graph-based Hierarchical Clustering .................506
13.3 The k-Means Algorithm ..................................512
13.4 The PAM Algorithm.....................................514
13.5 Limitations of Clustering .................................516
13.6 Clustering Quality.......................................520
13.6.1 Object Silhouettes.................................520
13.6.2 Supervised Evaluation .............................521
Exercises and Supplements ....................................523
Bibliographical Comments ....................................525
Part IV Combinatorics
14 Combinatorics .............................................529
14.1 Introduction ............................................529
14.2 The Inclusion-Exclusion Principle .........................529
14.3 Ramsey’s Theorem ......................................533
14.4 Combinatorics of Partitions...............................536
14.5 Combinatorics of Collections of Sets .......................539
Exercises and Supplements ....................................544
Bibliographical Comments ....................................549
15 The Vapnik-Chervonenkis Dimension ......................551
15.1 Introduction ............................................551
15.2 The Vapnik-Chervonenkis Dimension ......................551
15.3 Perceptrons.............................................563
Exercises and Supplements ....................................565
Bibliographical Comments ....................................567
Part V Appendices
A Asymptotics ...............................................571
B Convex Sets and Functions ................................573
C Useful Integrals and Formulas .............................583
C.1 Euler’s Integrals.........................................583
C.2 Wallis’s Formula ........................................587
C.3 Stirling’s Formula .......................................588
C.4 The Volume of an n-Dimensional Sphere ...................590