Table Of ContentClustering Techniques for
Large Data Sets
From the Past to the Future
Alexander Hinneburg, Daniel A. Keim
University of Halle
Introduction -
Preliminary Remarks
Problem: Analyze a (large) set of objects and
form a smaller number of groups using the
similarity and factual closeness between the
objects.
Goals:
– Finding representatives for homogenous groups ->
Data Reduction
– Finding “natural” clusters and describe their
unknown properties -> “natural” Data Types
– Find useful and suitable groupings -> “useful”
Data Classes
– Find unusual data objects -> Outlier Detection
1
Introduction -
Preliminary Remarks
(cid:1) Examples:
– Plant / Animal classification
– Book ordering
– Sizes for clothing
– Fraud detection
Introduction -
Preliminary Remarks
(cid:1) Goal:objective instead of subjective
Clustering
(cid:1) Preparations:
– Data Representation
(cid:127) Feature Vectors, real / categorical values
(cid:127) Strings, Key Words
– Similarity Function, Distance Matrix
2
Introduction
(cid:1) Application Example: Marketing
– Given:
(cid:127) Large data base of customer data
containing
their properties and past buying records
– Goal:
(cid:127) Find groups of customers with similar
behavior
(cid:127) Find customers with unusual behavior
Introduction
(cid:1) Application Example:
Class Finding in CAD-Databases
– Given:
(cid:127) Large data base of CAD data containing abstract
feature vectors (Fourier, Wavelet, ...)
– Goal:
(cid:127) Find homogeneous groups of similar CAD parts
(cid:127) Determine standard parts for each group
(cid:127) Use standard parts instead of special parts
(→ reduction of the number of parts to be produced)
3
The KDD-Process (CRISP)
Data Mining vs. Statistic
(cid:1) Algorithms scale to (cid:1) Many Algorithms with
large data sets quadratic run-time
(cid:1) Data is used secondary (cid:1) Data is made for the
for Data mining Statistic (primary use)
(cid:1) DM-Tools are for End- (cid:1) Statistical Background
User with Background is often required
(cid:1) Strategy: (cid:1) Strategy:
– explorative – conformational,
– cyclic – verifying
– few loops
4
Data Mining, an inter-
disciplinary Research Area
Statistic
Data Bases
Data Mining
Visualization
Machine Learning
Logic Programming
Introduction
(cid:1) Related Topics
– Unsupervised Learning (AI)
– Data Compression
– Data Analysis / Exploration
5
Role of Clustering in the
KDD Process
(cid:1) Clustering is beside Classification and
Association Rules Mining a basic technique
for Knowledge Discovery.
Clustering
Frequent
Separation of
Pattern
classes
Association
Classification
Rules
Introduction
Problem Description
(cid:1) Given:
A data set of N data items with each have a
d-dimensional data feature vector.
(cid:1) Task:
Determine a natural, useful partitioning of the
data set into a number of clusters (k) and
noise.
6
Introduction
From the Past ...
(cid:1) Clustering is a well-known problem in
statistics [Sch 64, Wis 69, DH 73, Fuk 90]
(cid:1) more recent research in
– machine learning [Roj 96],
– databases [CHY 96], and
– visualization [Kei 96] ...
Introduction
... to the Future
(cid:1) Effective and efficient clustering algorithms for
large high-dimensional data sets with high
noise level
(cid:1) Requires Scalability with respect to
– the number of data points (N)
– the number of dimensions (d)
– the noise level
(cid:1) New Understanding of Problems
7
Overview
( First Lesson)
11.. IInnttrroodduuccttiioonn
22.. CClluusstteerriinngg MMeetthhooddss From the Past ...
2.1 Model- and Optimization-based Approaches
2.2 Linkage-based Methods / Linkage Hierarchies
2.3 Density-based Approaches
2.4 Categorical Clustering ... to the Future
33.. TTeecchhnniiqquueess ffoorr IImmpprroovviinngg tthhee EEffffiicciieennccyy
44.. RReecceenntt RReesseeaarrcchh TTooppiiccss
55.. SSuummmmaarryy aanndd CCoonncclluussiioonnss
Model-based Approaches
(cid:1) Optimize the parameters for a given model
Statistic
Statistic / KDD
(cid:127)K-Means/LBG Artificial Intelligence
(cid:127)CLARANS (cid:127)Kohonen Net/ SOM
(cid:127)EM (cid:127)Neural Gas/ Hebb Learning
(cid:127)LBG-U (cid:127)Growing Networks
(cid:127)K-Harmonic Means
8
Model-based Methods:
Statistic/KDD
(cid:1) K-Means [Fuk 90]
(cid:1) Expectation Maximization [Lau 95]
(cid:1) CLARANS [NH 94]
(cid:1) Foccused CLARANS [EKX 95]
(cid:1) LBG-U [Fri 97]
(cid:1) K-Harmonic Means [ZHD 99, ZHD 00]
K-Means / LBG
[Fuk 90, Gra 92]
(cid:1) Determine k prototypes (p) of a given data set
(cid:1) Assign data points to nearest prototype
→
p R Voronoi -Set
p
(cid:1) Minimize distance criterion:
=
E(D,P) 1/ D dist(p,x)
p∈Px∈R
p
(cid:1) Iterative Algorithm
– Shift the prototypes towards the mean of their
point set
– Re-assign the data points to the nearest prototype
9
K-Means: Example
Expectation Maximization
[Lau 95]
(cid:1) Generalization of k-Means
( probabilistic assignment of points to clusters)
(cid:1) Baisc Idea:
– Estimate parameters of k Gaussians
– Optimize the probability, that the mixture of
parameterized Gaussians fits the data
– Iterative algorithm similar to k-Means
10
Description:Data Analysis / Exploration Effective and efficient clustering algorithms for [
AY 00] Charu C. Aggarwal, Philip S. Yu: Finding Generalized Projected Clusters