Table Of Content

Clustering Techniques for Large Data Sets From the Past to the Future Alexander Hinneburg, Daniel A. Keim University of Halle Introduction - Preliminary Remarks Problem: Analyze a (large) set of objects and form a smaller number of groups using the similarity and factual closeness between the objects. Goals: – Finding representatives for homogenous groups -> Data Reduction – Finding “natural” clusters and describe their unknown properties -> “natural” Data Types – Find useful and suitable groupings -> “useful” Data Classes – Find unusual data objects -> Outlier Detection 1 Introduction - Preliminary Remarks (cid:1) Examples: – Plant / Animal classification – Book ordering – Sizes for clothing – Fraud detection Introduction - Preliminary Remarks (cid:1) Goal:objective instead of subjective Clustering (cid:1) Preparations: – Data Representation (cid:127) Feature Vectors, real / categorical values (cid:127) Strings, Key Words – Similarity Function, Distance Matrix 2 Introduction (cid:1) Application Example: Marketing – Given: (cid:127) Large data base of customer data containing their properties and past buying records – Goal: (cid:127) Find groups of customers with similar behavior (cid:127) Find customers with unusual behavior Introduction (cid:1) Application Example: Class Finding in CAD-Databases – Given: (cid:127) Large data base of CAD data containing abstract feature vectors (Fourier, Wavelet, ...) – Goal: (cid:127) Find homogeneous groups of similar CAD parts (cid:127) Determine standard parts for each group (cid:127) Use standard parts instead of special parts (→ reduction of the number of parts to be produced) 3 The KDD-Process (CRISP) Data Mining vs. Statistic (cid:1) Algorithms scale to (cid:1) Many Algorithms with large data sets quadratic run-time (cid:1) Data is used secondary (cid:1) Data is made for the for Data mining Statistic (primary use) (cid:1) DM-Tools are for End- (cid:1) Statistical Background User with Background is often required (cid:1) Strategy: (cid:1) Strategy: – explorative – conformational, – cyclic – verifying – few loops 4 Data Mining, an inter- disciplinary Research Area Statistic Data Bases Data Mining Visualization Machine Learning Logic Programming Introduction (cid:1) Related Topics – Unsupervised Learning (AI) – Data Compression – Data Analysis / Exploration 5 Role of Clustering in the KDD Process (cid:1) Clustering is beside Classification and Association Rules Mining a basic technique for Knowledge Discovery. Clustering Frequent Separation of Pattern classes Association Classification Rules Introduction Problem Description (cid:1) Given: A data set of N data items with each have a d-dimensional data feature vector. (cid:1) Task: Determine a natural, useful partitioning of the data set into a number of clusters (k) and noise. 6 Introduction From the Past ... (cid:1) Clustering is a well-known problem in statistics [Sch 64, Wis 69, DH 73, Fuk 90] (cid:1) more recent research in – machine learning [Roj 96], – databases [CHY 96], and – visualization [Kei 96] ... Introduction ... to the Future (cid:1) Effective and efficient clustering algorithms for large high-dimensional data sets with high noise level (cid:1) Requires Scalability with respect to – the number of data points (N) – the number of dimensions (d) – the noise level (cid:1) New Understanding of Problems 7 Overview ( First Lesson) 11.. IInnttrroodduuccttiioonn 22.. CClluusstteerriinngg MMeetthhooddss From the Past ... 2.1 Model- and Optimization-based Approaches 2.2 Linkage-based Methods / Linkage Hierarchies 2.3 Density-based Approaches 2.4 Categorical Clustering ... to the Future 33.. TTeecchhnniiqquueess ffoorr IImmpprroovviinngg tthhee EEffffiicciieennccyy 44.. RReecceenntt RReesseeaarrcchh TTooppiiccss 55.. SSuummmmaarryy aanndd CCoonncclluussiioonnss Model-based Approaches (cid:1) Optimize the parameters for a given model Statistic Statistic / KDD (cid:127)K-Means/LBG Artificial Intelligence (cid:127)CLARANS (cid:127)Kohonen Net/ SOM (cid:127)EM (cid:127)Neural Gas/ Hebb Learning (cid:127)LBG-U (cid:127)Growing Networks (cid:127)K-Harmonic Means 8 Model-based Methods: Statistic/KDD (cid:1) K-Means [Fuk 90] (cid:1) Expectation Maximization [Lau 95] (cid:1) CLARANS [NH 94] (cid:1) Foccused CLARANS [EKX 95] (cid:1) LBG-U [Fri 97] (cid:1) K-Harmonic Means [ZHD 99, ZHD 00] K-Means / LBG [Fuk 90, Gra 92] (cid:1) Determine k prototypes (p) of a given data set (cid:1) Assign data points to nearest prototype → p R Voronoi -Set p (cid:1) Minimize distance criterion: = E(D,P) 1/ D dist(p,x) p∈Px∈R p (cid:1) Iterative Algorithm – Shift the prototypes towards the mean of their point set – Re-assign the data points to the nearest prototype 9 K-Means: Example Expectation Maximization [Lau 95] (cid:1) Generalization of k-Means ( probabilistic assignment of points to clusters) (cid:1) Baisc Idea: – Estimate parameters of k Gaussians – Optimize the probability, that the mixture of parameterized Gaussians fits the data – Iterative algorithm similar to k-Means 10

Description:

Data Analysis / Exploration Effective and efficient clustering algorithms for [ AY 00] Charu C. Aggarwal, Philip S. Yu: Finding Generalized Projected Clusters

Clustering Techniques for Large Data Sets Introduction - ERIC PDF

63 Pages·2000·3.2 MB·English

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download Clustering Techniques for Large Data Sets Introduction - ERIC PDF Free - Full Version

by Unknow| 2000| 63 pages| 3.2| English

Download Clustering Techniques for Large Data Sets Introduction - ERIC by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Clustering Techniques for Large Data Sets Introduction - ERIC

Data Analysis / Exploration Effective and efficient clustering algorithms for [ AY 00] Charu C. Aggarwal, Philip S. Yu: Finding Generalized Projected Clusters

Detailed Information

Author:	Unknown
Publication Year:	2000
Pages:	63
Language:	English
File Size:	3.2
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Clustering Techniques for Large Data Sets Introduction - ERIC Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Clustering Techniques for Large Data Sets Introduction - ERIC PDF?

Yes, on https://PDFdrive.to you can download Clustering Techniques for Large Data Sets Introduction - ERIC by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Clustering Techniques for Large Data Sets Introduction - ERIC on my mobile device?

After downloading Clustering Techniques for Large Data Sets Introduction - ERIC PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Clustering Techniques for Large Data Sets Introduction - ERIC?

Yes, this is the complete PDF version of Clustering Techniques for Large Data Sets Introduction - ERIC by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Clustering Techniques for Large Data Sets Introduction - ERIC PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.