Data Mining: A Heuristic Approach Hussein A. Abbass Ruhul A. Sarker Charles S. Newton University of New South Wales, Australia Idea Group Information Science Publishing Publishing Hershey • London • Melbourne • Singapore • Beijing Acquisitions Editor: Mehdi Khosrowpour Managing Editor: Jan Travers Development Editor: Michele Rossi Copy Editor: Maria Boyer Typesetter: Tamara Gillis Cover Design: Debra Andree Printed at: Integrated Book Technology Published in the United States of America by Idea Group Publishing 1331 E. Chocolate Avenue Hershey PA 17033-1117 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.idea-group.com and in the United Kingdom by Idea Group Publishing 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 3313 Web site: http://www.eurospan.co.uk Copyright © 2002 by Idea Group Publishing. All rights reserved. No part of this book may be reproduced in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Library of Congress Cataloging-in-Publication Data Data mining : a heuristic approach / [edited by] Hussein Aly Abbass, Ruhul Amin Sarker, Charles S. Newton. p. cm. Includes index. ISBN 1-930708-25-4 1. Data mining. 2. Database searching. 3. Heuristic programming. I. Abbass, Hussein. II. Sarker, Ruhul. III. Newton, Charles, 1942- QA76.9.D343 D36 2001 006.31--dc21 2001039775 British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. NEW from Idea Group Publishing • Data Mining: A Heuristic Approach Hussein Aly Abbass, Ruhul Amin Sarker and Charles S. Newton/ 1-930708-25-4 • Managing Information Technology in Small Business: Challenges and Solutions Stephen Burgess/ 1-930708-35-1 • Managing Web Usage in the Workplace: A Social, Ethical and Legal Perspective Murugan Anandarajan and Claire A. Simmers/ 1-930708-18-1 • Challenges of Information Technology Education in the 21st Century Eli Cohen/ 1-930708-34-3 • Social Responsibility in the Information Age: Issues and Controversies Gurpreet Dhillon/ 1-930708-11-4 • Database Integrity: Challenges and Solutions Jorge H. Doorn and Laura Rivero/ 1-930708-38-6 • Managing Virtual Web Organizations in the 21st Century: Issues and Challenges Ulrich Franke/ 1-930708-24-6 • Managing Business with Electronic Commerce: Issues and Trends Aryya Gangopadhyay/ 1-930708-12-2 • Electronic Government: Design, Applications and Management Åke Grönlund/ 1-930708-19-X • Knowledge Media in Health Care: Opportunities and Challenges Rolf Grutter/ 1-930708-13-0 • Internet Management Issues: A Global Perspective John D. Haynes/ 1-930708-21-1 • Enterprise Resource Planning: Global Opportunities and Challenges Liaquat Hossain, Jon David Patrick and M. A. Rashid/ 1-930708-36-X • The Design and Management of Effective Distance Learning Programs Richard Discenza, Caroline Howard, and Karen Schenk/ 1-930708-20-3 • Multirate Systems: Design and Applications Gordana Jovanovic-Dolecek/ 1-930708-30-0 • Managing IT/Community Partnerships in the 21st Century Jonathan Lazar/ 1-930708-33-5 • Multimedia Networking: Technology, Management and Applications Syed Mahbubur Rahman/ 1-930708-14-9 • Cases on Worldwide E-Commerce: Theory in Action Mahesh Raisinghani/ 1-930708-27-0 • Designing Instruction for Technology-Enhanced Learning Patricia L. Rogers/ 1-930708-28-9 • Heuristic and Optimization for Knowledge Discovery Ruhul Amin Sarker, Hussein Aly Abbass and Charles Newton/ 1-930708-26-2 • Distributed Multimedia Databases: Techniques and Applications Timothy K. Shih/ 1-930708-29-7 • Neural Networks in Business: Techniques and Applications Kate Smith and Jatinder Gupta/ 1-930708-31-9 • Information Technology and Collective Obligations: Topics and Debate Robert Skovira/ 1-930708-37-8 • Managing the Human Side of Information Technology: Challenges and Solutions Edward Szewczak and Coral Snodgrass/ 1-930708-32-7 • Cases on Global IT Applications and Management: Successes and Pitfalls Felix B. Tan/ 1-930708-16-5 • Enterprise Networking: Multilayer Switching and Applications Vasilis Theoharakis and Dimitrios Serpanos/ 1-930708-17-3 • Measuring the Value of Information Technology Han T. M. van der Zee/ 1-930708-08-4 • Business to Business Electronic Commerce: Challenges and Solutions Merrill Warkentin/ 1-930708-09-2 Excellent additions to your library! Receive the Idea Group Publishing catalog with descriptions of these books by calling, toll free 1/800-345-4332 or visit the IGP Online Bookstore at: http://www.idea-group.com! Data Mining: A Heuristic Approach Table of Contents Preface ............................................................................................................................vi Part One: General Heuristics Chapter 1: From Evolution to Immune to Swarm to …? A Simple Introduction to Modern Heuristics....................................................... 1 Hussein A. Abbass, University of New South Wales, Australia Chapter 2: Approximating Proximity for Fast and Robust Distance-Based Clustering ................................................................................22 Vladimir Estivill-Castro, University of Newcastle, Australia Michael Houle, University of Sydney, Australia Part Two: Evolutionary Algorithms Chapter 3: On the Use of Evolutionary Algorithms in Data Mining..........................48 Erick Cantú-Paz, Lawrence Livermore National Laboratory, USA Chandrika Kamath, Lawrence Livermore National Laboratory, USA Chapter 4: The discovery of interesting nuggets using heuristic techniques..........72 Beatriz de la Iglesia, University of East Anglia, UK Victor J. Rayward-Smith, University of East Anglia, UK Chapter 5: Estimation of Distribution Algorithms for Feature Subset Selection in Large Dimensionality Domains.....................................................97 Iñaki Inza, University of the Basque Country, Spain Pedro Larrañaga, University of the Basque Country, Spain Basilio Sierra, University of the Basque Country, Spain Chapter 6: Towards the Cross-Fertilization of Multiple Heuristics: Evolving Teams of Local Bayesian Learners ...................................................117 Jorge Muruzábal, Universidad Rey Juan Carlos, Spain Chapter 7: Evolution of Spatial Data Templates for Object Classification..............143 Neil Dunstan, University of New England, Australia Michael de Raadt, University of Southern Queensland, Australia Part Three: Genetic Programming Chapter 8: Genetic Programming as a Data-Mining Tool .......................................157 Peter W.H. Smith, City University, UK Chapter 9: A Building Block Approach to Genetic Programming for Rule Discovery.............................................................................................174 A.P. Engelbrecht, University of Pretoria, South Africa Sonja Rouwhorst, Vrije Universiteit Amsterdam, The Netherlands L. Schoeman, University of Pretoria, South Africa Part Four: Ant Colony Optimization and Immune Systems Chapter 10: An Ant Colony Algorithm for Classification Rule Discovery .............191 Rafael S. Parpinelli, Centro Federal de Educacao Tecnologica do Parana, Brazil Heitor S. Lopes, Centro Federal de Educacao Tecnologica do Parana, Brazil Alex A. Freitas, Pontificia Universidade Catolica do Parana, Brazil Chapter 11: Artificial Immune Systems: Using the Immune System as Inspiration for Data Mining .........................................................................209 Jon Timmis, University of Kent at Canterbury, UK Thomas Knight, University of Kent at Canterbury, UK Chapter 12: aiNet: An Artificial Immune Network for Data Analysis....................231 Leandro Nunes de Castro, State University of Campinas, Brazil Fernando J. Von Zuben, State University of Campinas, Brazil Part Five: Parallel Data Mining Chapter 13: Parallel Data Mining.............................................................................261 David Taniar, Monash University, Australia J. Wenny Rahayu, La Trobe University, Australia About the Authors......................................................................................................290 Index ...........................................................................................................................297 vi Preface The last decade has witnessed a revolution in interdisciplinary research where the boundaries of different areas have overlapped or even disappeared. New fields of research emerge each day where two or more fields have integrated to form a new identity. Examples of these emerging areas include bioinformatics (synthesizing biology with computer and information systems), data mining (combining statistics, optimization, machine learning, artificial intelligence, and databases), and modern heuristics (integrating ideas from tens of fields such as biology, forest, immunology, statistical mechanics, and physics to inspire search techniques). These integrations have proved useful in substantiating problem- solving approaches with reliable and robust techniques to handle the increasing demand from practitioners to solve real-life problems. With the revolution in genetics, databases, automa- tion, and robotics, problems are no longer those that can be solved analytically in a feasible time. Complexity arises because of new discoveries about the genome, path planning, changing environments, chaotic systems, and many others, and has contributed to the increased demand to find search techniques that are capable of getting a good enough solution in a reasonable time. This has directed research into heuristics. During the same period of time, databases have grown exponentially in large stores and companies. In the old days, system analysts faced many difficulties in finding enough data to feed into their models. The picture has changed and now the reverse picture is a daily problem–how to understand the large amount of data we have accumulated over the years. Simultaneously, investors have realized that data is a hidden treasure in their companies. With data, one can analyze the behavior of competitors, understand the system better, and diagnose the faults in strategies and systems. Research into statistics, machine learning, and data analysis has been resurrected. Unfortunately, with the amount of data and the complexity of the underlying models, traditional approaches in statistics, machine learning, and tradi- tional data analysis fail to cope with this level of complexity. The need therefore arises for better approaches that are able to handle complex models in a reasonable amount of time. These approaches have been named data mining (sometimes data farming) to distinguish them from traditional statistics, machine learning, and other data analysis techniques. In addition, decision makers were not interested in techniques that rely too much on the underlying assumptions in statistical models. The challenge is to not have any assumptions about the model and try to come up with something new, something that is not obvious or predictable (at least from the decision makers’ point of view). Some unobvious thing may have significant values to the decision maker. Identifying a hidden trend in the data or a buried fault in the system is by all accounts a treasure for the investor who knows that avoiding loss results in profit and that knowledge in a complex market is a key criterion for success and continuity. Notwithstanding, models that are free from assumptions–or at least have minimum assumptions–are expensive to use. The dramatic search space cannot be navigated using traditional search techniques. This has highlighted a natural demand for the use of heuristic search methods in data mining. This book is a repository of research papers describing the applications of modern vii heuristics to data mining. This is a unique–and as far as we know, the first–book that provides up-to-date research in coupling these two topics of modern heuristics and data mining. Although it is by all means an incomplete coverage, it does provide some leading research in this area. This book contains open-solicited and invited chapters written by leading researchers in the field. All chapters were peer reviewed by at least two recognized researchers in the field in addition to one of the editors. Contributors come from almost all the continents and therefore, the book presents a global approach to the discipline. The book contains 13 chapters divided into five parts as follows: • Part 1: General Heuristics • Part 2: Evolutionary Algorithms • Part 3: Genetic Programming • Part 4: Ant Colony Optimization and Immune Systems • Part 5: Parallel Data Mining Part 1 gives an introduction to modern heuristics as presented in the first chapter. The chapter serves as a textbook-like introduction for readers without a background in heuristics or those who would like to refresh their knowledge. Chapter 2 is an excellent example of the use of hill climbing for clustering. In this chapter, Vladimir Estivill-Castro and Michael E. Houle from the University of Newcastle and the University of Sydney, respectively, provide a methodical overview of clustering and hill climbing methods to clustering. They detail the use of proximity information to assess the scalability and robustness of clustering. Part 2 covers the well-known evolutionary algorithms. After almost three decades of continuous research in this area, the vast amount of papers in the literature is beyond a single survey paper. However, in Chapter 3, Erick Cantú-Paz and Chandrika Kamath from Lawrence Livermore National Laboratory, USA, provide a brave and very successful attempt to survey the literature describing the use of evolutionary algorithms in data mining. With over 75 references, they scrutinize the data mining process and the role of evolutionary algorithms in each stage of the process. In Chapter 4, Beatriz de la Iglesia and Victor J. Rayward-Smith, from the University of East Anglia, UK, provide a superb paper on the application of Simulated Annealing, Tabu Search, and Genetic Algorithms (GA) to nugget discovery or classification where an important class is under-represented in the database. They summarize in their chapter different measures of performance for the classification problem in general and compare their results against 12 classification algorithms. Iñaki Inza, Pedro Larrañaga, and Basilio Sierra from the University of the Basque Country, Spain, follow, in Chapter 5, with an outstanding piece of work on feature subset selection using a different type of evolutionary algorithms, the Estimation of Distribution Algorithms (EDA). In EDA, a probability distribution of the best individuals in the population is maintained to sample the individuals in subsequent generations. Traditional crossover and mutation operators are replaced by the re-sampling process. They applied EDA to the Feature Subset Selection problem and showed that it significantly improves the prediction accuracy. In Chapter 6, Jorge Muruzábal from the University of Rey Juan Carlos, Spain, presents the brilliant idea of evolving teams of local Bayesian learners. Bayes theorem was resurrected as a result of the revolution in computer science. Nevertheless, Bayesian approaches, such as viii Bayesian Networks, require large amounts of computational effort, and the search algorithm can easily become stuck in a local minimum. Dr. Muruzábal combined the power of the Bayesian approach with the ability of Evolutionary Algorithms and Learning Classifier Systems for the classification process. Neil Dunstan from the University of New England, and Michael de Raadt from the University of Southern Queensland, Australia, provide an interesting application of the use of evolutionary algorithms for the classification and detection of Unexploded Ordnance present on military sites in Chapter 7. Part 3 covers the area of Genetic Programming (GP). GP is very similar to the traditional GA in its use of selection and recombination as the means of evolution. Different from GA, GP represents the solution as a tree, and therefore the crossover and mutation operators are adopted to handle tree structures. This part starts with Chapter 8 by Peter W.H. Smith from City University, UK, who provides an interesting introduction to the use of GP for data mining and the problems facing GP in this domain. Before discarding GP as a useful tool for data mining, A.P. Engelbrecht and L Schoeman from the University of Pretoria, South Africa along with Sonja Rouwhorst from the University of Vrije, The Netherlands, provide a building block approach to genetic programming for rule discovery in Chapter 9. They show that their proposed GP methodology is comparable to the famous C4.5 decision tree classifier–a famous decision tree classifier. Part 4 covers the increasingly growing areas of Ant Colony Optimization and Immune Systems. Rafael S. Parpinelli and Heitor S. Lopes from Centro Federal de Educacao Tecnologica do Parana, and Alex A. Freitas from Pontificia Universidade Catolica do Parana, Brazil, present a pioneer attempt, in Chapter 10, to apply ant colony optimization to rule discovery. Their results are very promising and through an extremely interesting approach, they present their techniques. Jon Timmis and Thomas Knight, from the University of Kent at Canterbury, UK, introduce Artificial Immune Systems (AIS) in Chapter 11. In a notable presentation, they present the AIS domain and how can it be used for data mining. Leandro Nunes de Castro and Fernando J. Von Zuben, from the State University of Campinas, Brazil, follow in Chapter 12 with the use of AIS for clustering. The chapter presents a remarkable metaphor for the use of AIS with an outstanding potential for the proposed algorithm. In general, the data mining task is very expensive, whether we are using heuristics or any other technique. It was therefore impossible not to present this book without discussing parallel data mining. This is the task carried out by David Taniar from Monash University and J. Wenny Rahayu from La Trobe University, Australia, in Part 5, Chapter 13. They both have written a self-contained and detailed chapter in an exhilarating style, thereby bringing the book to a close. It is hoped that this book will trigger great interest into data mining and heuristics, leading to many more articles and books! ix Acknowledgments We would like to express our gratitude to the contributors without whose submissions this book would not have been born. We owe a great deal to the reviewers who reviewed entire chapters and gave the authors and editors much needed guidance. Also, we would like to thank those dedicated reviewers, who did not contribute through authoring chapters to the current book or to our second book Heuristics and Optimization for Knowledge Discovery– Paul Darwen, Ross Hayward, and Joarder Kamruzzaman. A further special note of thanks must go also to all the staff at Idea Group Publishing, whose contributions throughout the whole process from the conception of the idea to final publication have been invaluable. In closing, we wish to thank all the authors for their insights and excellent contributions to this book. In addition, this book would not have been possible without the ongoing professional support from Senior Editor Dr. Mehdi Khosrowpour, Managing Editor Ms. Jan Travers and Development Editor Ms. Michele Rossi at Idea Group Publishing. Finally, we want to thank our families for their love, support, and patience throughout this project. Hussein A. Abbass, Ruhul Sarker, and Charles Newton Editors (2001)
Description: