ebook img

Mining Very Large Databases with Parallel Processing PDF

210 Pages·2000·11.45 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Mining Very Large Databases with Parallel Processing

MINING VERY LARGE DATABASES WITH PARALLEL PROCESSING The Kluwer International Series on ADVANCES IN DATABASE SYSTEMS Series Editor Ahmed K. Elmagarmid Purdue University West Lafayette, IN 47907 Other books in the Series: DATABASE CONCURRENCY CONTROL: Methods, Performance, and Analysis by Alexander Thomasian ISBN: 0-7923-9741-X TIME-CONSTRAINED TRANSACTION MANAGEMENT Real-Time Constraints in Database Transaction Systems by Nandit R. Soparkar, Henry F. Korth, Abraham Silberschatz ISBN: 0-7923-9752-5 SEARCHING MULTIMEDIA DATABASES BY CONTENT by Christos Faloutsos ISBN: 0-7923-9777-0 REPLICATION TECHNIQUES IN DISTRIBUTED SYSTEMS by Abdelsalam A. Helal, Abdelsalam A. Heddaya, Bharat B. Bhargava ISBN: 0-7923-9800-9 VIDEO DATABASE SYSTEMS: Issues, Products, and Applications by Ahmed K. Elmagarmid, Haitao Jiang, Abdelsalam A. Helal, Anupam Joshi, Magdy Ahmed ISBN: 0-7923-9872-6 DATABASE ISSUES IN GEOGRAPHIC INFORMATION SYSTEMS by Nabil R. Adam and Aryya Gangopadhyay ISBN: 0-7923-9924-2 INDEX DATA STRUCTURES IN OBJECT-ORIENTED DATABASES by Thomas A. Mueck and Martin L. Polaschek ISBN: 0-7923-9971-4 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS by Elisa Bertino, Beng Chin Ooi, Ron Sacks-Davis, Kian-Lee Tan, Justin Zobel, Boris Shidlovsky and Barbara Catania ISBN: 0-7923-9985-4 MINING VERY LARGE DATABASES WITH PARALLEL PROCESSING by Alex A. Freitas University ofE ssex Colchester, United Kingdom and Simon H. Lavington University ofE ssex Colchester, United Kingdom SPRINGER SCIENCE+BUSINESS MEDIA, LLC Library of Congress Cataloging.in.Publication Data Freitas, Alex A., 1964- Mining very large databases with parallel processing / by Alex A. Freitas and Simon H. Lavington. p. cm. -- (The Kluwer international series on advances in database systems) Includes bibliographical references and index. ISBN 978-1-4613-7523-4 ISBN 978-1-4615-5521-6 (eBook) DOI 10.1007/978-1-4615-5521-6 1. Database management. 2. Data mining. 3. Transaction systems (Computer systems) 4. Parallel processing (Electronic computers) I. Lavington, S. H. (Simon Hugh), 1939- . II. Title. III. Series. QA76.9.D3F745 1998 006.3--dc21 97-41615 CIP Copyright ® by Springer Science+Business Media New York Origina1ly published by Kluwer Academic Publishers in 2000 Softcover reprint of the hardcover 1s t edition 2000 AII rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photo copying, recording, or otherwise, without the prior written permission of the publisher, Springer Science+Business Media, LLC. Printed an acid-free paper. This book is dedicated to all the people who believe that learning is not only one of the most necessary but also one of the noblest human activities. CONTENTS. PREFACE ................................................................................................................. xi ACKNOWLEDGMENTS ..................................................................................... xiii INTRODUCTION ..................................................................................................... 1 The Motivation for Data Mining and Knowledge Discovery ................................... 1 The Inter-disciplinary Nature of Knowledge Discovery in Databases (KDD) ......... 2 The Challenge of Efficient Knowledge Discovery in Large Databases and Data Warehouses ............................................................................................... 3 Organization of the Book ......................................................................................... 4 Part I KNOWLEDGE DISCOVERY AND DATA MINING ........................... 5 1 KNOWLEDGE DISCOVERY TASKS ................................................................ 7 1.1 Discovery of Association Rules ......................................................................... 7 1.2 Classification .................................................................................................... 10 1.3 Other KDD Tasks ............................................................................................. 14 2 KNOWLEDGE DISCOVERY PARADIGMS .................................................. 19 2.1 Rule Induction (RI) .......................................................................................... 19 2.2 Instance-Based Learning (IBL) ........................................................................ 21 2.3 Neural Networks (NN) ..................................................................................... 22 2.4 Genetic Algorithms (GA) ................................................................................. 24 2.5 On-Line Analytical Processing (OLAP) ........................................................... 26 2.6 Focus on Rule Induction ................................................................................... 28 3 THE KNOWLEDGE DISCOVERY PROCESS ............................................... 31 3.1 An Overview of the Knowledge Discovery Process ......................................... 31 3.2 Data Warehouse (DW) .................................................................................... 33 3.3 Attribute Selection ............................................................................................ 34 3.4 Discretization ................................................................................................... 37 3.5 Rule-Set Refinement ........................................................................................ 39 4 DATA MINING .................................................................................................... 41 4.1 Decision-Tree Building .................................................................................... 41 4.2 Overfitting ........................................................................................................ 45 4.3 Data-Mining-Algorithm Bias ........................................................................... 47 4.4 Improved Representation Languages ............................................................... 48 4.5 Integrated Data Mining Architectures .............................................................. 49 viii 5 DATA MINING TOOLS ..................................................................................... 51 5.1 Clementine ....................................................................................................... 51 5.2 Darwin .............................................................................................................. 53 5.3 MineSet ............................................................................................................ 54 5.4 Intelligent Miner ............................................................................................... 55 5.5 Decision-Tree-Building Tools .......................................................................... 56 Part II PARALLEL DATABASE SYSTEMS .................................................. 59 6 BASIC CONCEPTS ON PARALLEL PROCESSING ..................................... 61 6.1 Temporal and Spatial Parallelism ..................................................................... 61 6.2 Granularity, Level and Degree of Parallelism .................................................. 62 6.3 Shared and Distributed Memory ...................................................................... 63 6.4 Evaluating the Performance of a Parallel System ............................................. 64 6.5 Communication Overhead ................................................................................ 65 6.6 Load Balancing ................................................................................................ 67 6.7 Approaches for Exploiting Parallelism ............................................................. 69 7 DATA PARALLELISM, CONTROL PARALLELISM AND RELATED ISSUES ..................................................................................................................... 71 7.1 Data Parallelism and Control Parallelism ......................................................... 71 7.2 Easy of Use and Automatic Paralle1ization ...................................................... 73 7.3 Machine-Architecture Independence ................................................................ 73 7.4 Scalability ......................................................................................................... 74 7.5 Data Partitioning .............................................................................................. 75 7.6 Data Placement (Declustering) ........................................................................ 76 8 PARALLEL DATABASE SERVERS ................................................................ 79 8.1 Architectures of Parallel Database Servers ...................................................... 79 8.2 From the Teradata DBC 1012 to the NCR WorldMark 5100 .......................... 82 8.3 ICL Goldrush Running Oracle Parallel Server ................................................. 83 8.4 IBM SP2 Running DB2 Parallel Edition (DB2-PE) ........................................ 84 8.5 Monet ............................................................................................................... 85 PartIII PARALLEL DATA MINING .............................................................. 87 9 APPROACHES TO SPEED UP DATA MINING ............................................. 89 9.1 Overview of Approaches to Speed up Data Mining ......................................... 89 9.2 Discretization ................................................................................................... 90 9.3 Attribute Selection ............................................................................................ 91 9.4 Sampling and Related Approaches ................................................................... 92 9.5 Fast Algorithms ................................................................................................ 97 9.6 Distributed Data Mining ................................................................................. 100 9.7 Parallel Data Mining ...................................................................................... 103 9.8 Discussion ...................................................................................................... 105 ix 10 PARALLEL DATA MINING WITHOUT DBMS FACILITIES ............... 109 10.1 Parallel Rule Induction .............................................................................. 109 10.2 Parallel Decision-Tree Building ................................................................ 116 10.3 Parallel Instance-Based Learning .............................................................. 123 10.4 Parallel Genetic Algorithms ...................................................................... 128 1O.S Parallel Neural Networks .......................................................................... 133 10.6 Discussion ................................................................................................. 137 11 PARALLEL DATA MINING WITH DATABASE FACILITIES .............. 143 11.1 An Overview ofIntegrated Data MininglData Warehouse Frameworks ... 143 11.2 The Case for Integrating Data Mining and the Data Warehouse ............... 147 11.3 Server-Based KDD Systems ...................................................................... lSI 11.4 Hybrid Client/Server-Based KDD Systems ............................................... IS4 11.5 Generic, Set-Oriented Primitives for the Hybrid Client/Server-Based KDD Framework ....................................................................................... IS6 11.6 A Generic, Set-Oriented Primitive for Candidate-Rule (CR) Evaluation in Rule Induction ....................................................................................... 157 11.7 A Generic, Set-Oriented Primitive for Computing Distance Metrics in Instance-Based Learning ........................................................................... 164 11.8 Parallel Data Mining with Specialized-Hardware Parallel Database Servers...... ................................ ......... . ............ .. ..... .. .. ... .. .................. ......... 171 12 SUMMARY AND SOME OPEN PROBLEMS ............................................ 173 12.1 Data-Parallel vs. Control-Parallel Data Mining ......................................... 173 12.2 Client/Server Frameworks for Parallel Data Mining ................................. 174 12.3 Open Problems .......................................................................................... 177 REFERENCES ...................................................................................................... 181 INDEX .................................................................................................................... 199 PREFACE. This book addresses the problem of large-scale data mining. It is an inter disciplinary text, describing advances in the integration of three computer science areas, namely: "intelligent" (machine learning-based) data mining techniques; relational databases and parallel processing. The basic idea is to use concepts and techniques of the latter two areas -particularly parallel processing - to speed up and scale up data mining algorithms. The book is divided into three parts. The first part presents a comprehensive review of intelligent data mining techniques such as rule induction, instance-based learning, neural networks and genetic algorithms. Likewise, the second part presents a comprehensive review of parallel processing and parallel databases. Each of these parts includes an overview of commercially-available, state-of-the-art tools. The third part deals with the application of parallel processing to data mining. The emphasis is on finding generic, cost-effective solutions for realistic data volumes. Two parallel computational environments are discussed, firstly excluding the use of commercial strength DBMS, and then using parallel DBMS servers. It is assumed that the reader has a knowledge roughly equivalent to a first degree (B.Sc.) in accurate sciences, so that (s)he is reasonably familiar with basic concepts of statistics and computer science. The primary audience for this book is industry data miners and practitioners in general, who would like to apply intelligent data mining techniques to large amounts of data. The book will also be of interest to academic researchers and post-graduate students, particularly database researchers interested in advanced, intelligent database applications and artificial intelligence researchers interested in industrial, real-world applications of machine learning. ACKNOWLEDGMENTS. Since we started to work on data mining we have had the help of several good people. We are grateful to all of them, for their support. In particular, we would like to express our thanks to the following people: To Dominicus R. Thoen and Neil EJ. Dewhurst, for their help in some data mining experiments and for their support in general. To Paul Scott, for interesting discussions about data mining and machine learning. To Steve Hassan, for his help in using the White Cross WX90lO parallel database server. To Foster Provost, Richard Kufrin, and Sarabjot Anand, for interesting discussions about parallel data mining and for their encouragement. During the project that led to the writing up of this book, the first author was financially supported by a grant from the Brazilian government's National Council of Scientific and Technological Development (CNPq), process number 200384/93-7.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.