Data Mining and Computational Intelligence Studies in Fuzziness and Soft Computing Editor-in-chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw, Poland E-mail: [email protected] http://www.springer.de/cgi-bin/search_book.pl?series=2941 Further volumes of this series can Vol. 57. V. Novak and I. Perfilieva (Eds.) be found at our homepage. Discovering the World with Fuzzy Logic, 2001 ISBN 3-7908-1330-3 Vol. 46. J. N. Mordeson and P. S. Nair Vol. 58. D.S. Mahk and J.N. Mordeson Fuzzy Graphs and Fuzzy Hypergraphs, 2000 Fuzzy Discrete Structures, 2000 ISBN 3-7908-1286-2 ISBN 3-7908-1335-4 Vol. 47. E. Czogalat and J. L~ski Vol. 59. T. Furuhashi, Shun'Ichi Tano and Fuzzy and Neuro-Fuzzy Intelligent Systems, 2000 H.-A. Jacobsen (Eds.) ISBN 3-7908-1289-7 Deep Fusion of Computational Vol. 48. M. Sakawa and Symbolic Processing, 2001 Large Scale Interactive Fuzzy Multiobjective ISBN 3-7908-1339-7 Programming, 2000 ISBN 3-7908-1293-5 Vol. 60. K. J. Cios (Ed.) Medical Data Mining and Knowledge Discovery, Vol. 49. L. I. Kuncheva 2001 Fuzzy Classifier Design, 2000 ISBN 3-7908-1340-0 ISBN 3-7908-1298-6 Vol. 61. D. Driankov, A. Saffiotti (Eds.) Vol. 50. F. Crestani and G. Pasi (Eds.) Fuzzy Logic Techniques for Autonomous Vehicle Soft Computing in Information Retrieval, 2000 Navigation, 2001 ISBN 3-7908-1299-4 ISBN 3-7908-1341-9 Vol. 51. 1. Fodor, B. De Baets and P. Pemy (Eds.) Preferences and Decisions under Incomplete Vol. 62. N. Baba, L. C. Jain (Eds.) Knowledge, 2000 Computational lntelligence in Games, 2001 ISBN 3-7908-1303-6 ISBN 3-7908-1348-6 Vol. 52. E. E. Kerre and M. Nachtegael (Eds.) Vol. 63. O. Castillo, P. Melin Fuzzy Techniques in Image Processing, 2000 Soft Computing for Control of Non-Linear ISBN 3-7908-1304-4 Dynamical Systems, 2001 Vol. 53. G. Bordogna and G. Pasi (Eds.) ISBN 3-7908-1349-4 Recent Issues on Fuzzy Databases, 2000 ISBN 3-7908-1319-2 Vol. 64. I. Nishizaki, M. Sakawa Fuzzy and Multiobjective Games for Conflict Vol. 54. P. Sinc:ik and J. VaSc:ik (Eds.) Resolution, 2001 Quo Vadis Computational Intelligence?, 2000 ISBN 3-7908-1341-9 ISBN 3-7908-1324-9 Vol. 65. E. Orlowska, A. Szalas (Eds.) Vol. 55. J. N. Mordeson, D. S. Malik Relational Methods for Computer Science and S.-c. Cheng Applications, 2001 Fuzzy Mathematics in Medicine, 2000 ISBN 3-7908-1365-6 ISBN 3-7908-1325-7 Vol. 56. L. PolkowskI, S. Tsumoto and T. Y. Lin (Eds.) Vol. 66. R. 1. Howlett, L. C. Jain (Eds.) Rough Set Methods and Applications, 2000 Radial Basis Function Networks 1, 2001 ISBN 3-7908-1328-1 ISBN 3-7908-1367-2 Abraham Kandel Mark Last Horst Bunke Editors Data Mining and Computational Intelligence With 90 Figures and 45 Tables Springer-Verlag Berlin Heidelberg GmbH Dr. Abraham Kandel Computer Science and Engineering University of South Florida 4202 E. Fowler Ave., ENB 118 Tampa, Florida 33620 USA kandel @csee.usf.edu Dr. Mark Last Infonnation Systems Engineering Ben-Gurion University of the Negev Beer-Sheva 84105 Israel [email protected] Dr. Horst Bunke Department of Computer Science University of Bern Neubruckstrasse 10 CH-3012 Bern Switzerland [email protected] ISSN 1434-9922 ISBN 978-3-7908-2484-1 Catalogmg-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einhettsaufnahme Data mining and computatIonal intelligence: with 45 tables / Abraham Kandel ... ed. (Studies in fuzziness and soft computing; Vol. 68) ISBN 978-3-7908-2484-1 ISBN 978-3-7908-1825-3 (eBook) DOI 10.1007/978-3-7908-1825-3 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag Berlin Heidelberg GmbH. Violations are liable for prosecution under the German Copyright Law. © Springer-Verlag Berlin Heidelberg 2001 Originally published by Physica-Verlag Heidelberg N ew York in 200 I Sof'tcover reprint of the hardcover 1st edition 2001 The use of general descnptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Hardcover Design: Ench KIrchner, Heidelberg SPIN 10793207 88/2202-5 4 3 2 I 0 - Printed on acid-free paper Preface Many business decisions are made in the absence of complete information about the decision consequences. Credit lines are approved without knowing the future behavior of the customers; stocks are bought and sold without knowing their future prices; parts are manufactured without knowing all the factors affecting their final quality; etc. All these cases can be categorized as decision making under uncertainty. Decision makers (human or automated) can handle uncertainty in different ways. Deferring the decision due to the lack of sufficient information may not be an option, especially in real-time systems. Sometimes expert rules, based on experience and intuition, are used. Decision tree is a popular form of representing a set of mutually exclusive rules. An example of a two-branch tree is: if a credit applicant is a student, approve; otherwise, decline. Expert rules are usually based on some hidden assumptions, which are trying to predict the decision consequences. A hidden assumption of the last rule set is: a student will be a profitable customer. Since the direct predictions of the future may not be accurate, a decision maker can consider using some information from the past. The idea is to utilize the potential similarity between the patterns of the past (e.g., "most students used to be profitable") and the patterns of the future (e.g., "students will be profitable"). The problem of inference from data is closely related to the old and the well established area of statistics. According to (Mendenhall et al. 1993), modern statistics is concerned with "examining and summarizing data to predict, estimate, and, ultimately, make business decisions." Statisticians have a variety of tools at their disposal. These include linear and nonlinear regression models, which produce mathematical equations for estimating the value of a dependent variable. Regression models, like other statistical methods, are based on restricting assumptions regarding the type and the distribution of the analyzed data. Thus, the linear regression model requires all the model variables to be continuous. This requirement is not necessarily satisfied in every real-world dataset. The assumption regarding the "normality" of the data distribution is also very common in statistics, though the actual distribution of the real variables may be completely different. As indicated by (Elder and Pregibon 1996), statisticians are more interested in the interpretability of their results, rather than in the classification/estimation performance of the statistical models. The distinction between the real patterns and the "noise" is another important consideration in statistics: the sample data is assumed to include some amount of noise and a confidence interval is associated with every statistical conclusion. The increasing availability of electronic information has accentuated the limitations of the classical statistical models. On one hand, most statisticians still adhere to simple and global models (Elder and Pregibon 1996), and, on the other VI hand, today's computers have enough memory and computational power to find the best, though not necessarily the simplest models in a complex hypothesis space within minutes or even seconds. Alternative model representations include neural networks, decision trees, Bayesian networks, and others. Algorithms for computationally efficient search in a large set of models, specified by a given representation, have been developed by statisticians as well as by researchers from the artificial intelligence, the pattern recognition, and the machine learning communities (see Mitchell, 1997). A book by Fayyad et al. (1996) has defined data mining as "the application of specific algorithms for extracting patterns from data." According to the same book, data mining is a step within the process of knowledge discovery in databases, which starts with pre-processing the raw data and ends up with business-oriented interpretation of data mining results. Fayyad et al. (1996) present a list of data analysis methods (decision tree learning, clustering, regression, etc.) that can be used at the data mining step. Most research challenges for knowledge discovery and data mining have not changed much during the last five years. The list of research topics raised by Fayyad et al. (1996) includes the following issues. Understandability of patterns. Classification/prediction accuracy is still the most common criterion for comparing the performance of data mining algorithms. However, the knowledge discovery means that the user gets a better insight into a specific domain or problem. Improving the interpretability of the discovered patterns is a major concern for most papers in this volume, especially Chapters 1-6 and 9. Since the discovered knowledge may include certain amount of uncertainty and imprecision, fuzzy sets (see below) can be used to represent the extracted patterns in more understandable, linguistic form. Complex relationships between attributes. Several data mining methods (e.g., decision trees and association rules) automatically produce sets of rules of the form if condition then consequence. The task of learning rules from attribute-value records has been extensively studied in machine learning (see Mitchell, 1997). Though in simple systems the cause-effect relationships may be straightforward, automated rule induction from data representing complex phenomena should be done with caution. Extraction of complex relationships by using a two-phase approach to data mining is covered in Chapter 2. Chapters 3 and 7 handle the problem of finding complex associations in relational and transactional data. Discovering complex relationships in other types of data (e.g., financial and image data) is covered by Chapters 10 and 12. Missing and noisy data. Business databases suffer from high rates of data entry errors. Moreover, to avoid operational delays, many important attributes are defined as optional, leading to a large number of missing values. Alternative techniques for dealing with missing and noisy data are described in Chapters 1, 4 and 8 of this book. VII Mining very large databases. The VCI Machine Learning Repository (Blake and Merz 1998) has been recognized as a benchmark for evaluating performance of data mining algorithms. The repository is a collection of flat tables, having mostly fewer than 1,000 rows (records) and 50 columns (attributes). This is much less data than one can find in a typical commercial database application, where multi gigabyte tables are commonplace. When dealing with large volumes of data, the loading of complete tables in the computer's main memory becomes impractical. A scalable data mining algorithm, which requires a single scan of a database is presented in Chapter 7. Another problem associated with large databases, high dimensionality, is handled by the Fuzzy-Rosa method in Chapter 6. Changing data The original versions of many data mining methods assume the patterns to be static (time-invariant). The time dimension is absent from most benchmark datasets of the VCI Repository. However, modeling the dynamic behavior of non-stationary time series is very important for analyzing different types of financial data, like exchange rates and stock indices. Chapter 13 of this book is concerned with the problem of detecting changes in nonlinear time series. Integration with database systems Since most business information is stored by database management systems (DBMS), an interface between DBMS and data mining tools might very useful. Chapter 5 of this book presents a fuzzy querying interface, which can support a specific data mining technique, called "linguistic summaries." As shown by several chapters in this book, the fuzzy set theory can play an important role in the process of knowledge discovery. Central to the fuzzy set theory, introduced by Lotfi A. Zadeh (1965), is the concept of fuzzy sets, which are sets with imprecise boundaries. The membership of an object in a fuzzy set is a matter of a degree: for example, two persons of different height may belong to the same set of tall people, but their membership degree may be different. In the above example, tall is an imprecise linguistic term, which can be used by humans for communication and even for decision-making. This view of uncertainty is different from the probabilistic approach used by most data mining methods, since the calculation of membership grades is based on user-specific understanding of the domain (expressed mathematically by membership functions) rather than on purely statistical information. Knowledge discovery in databases can be seen as a process of approximate reasoning, since it is concerned with inferring imprecise conclusions from imprecise (noisy) data. Traditionally, the data mining methods have been optimized along a single dimension, namely classification or estimation accuracy. However, business users are aware of the inherent uncertainty of the decision making process and they may prefer comprehensible models that do not achieve the best classification performance. As demonstrated by this book, the fuzzy set theory provides an efficient tool for representing the trade-off between good performance and high comprehensibility of data mining methods. VIII The areas in which the chapters of this volume are contributing can be categorized in more detail as follows. Rule extraction and reduction. A neuro-fuzzy method for rule learning in presented by Klose et al. in Chapter 1. The emphasis of the method is on producing a set of interpretable rules, which may be examined by a human expert. Pedrycz (Chapter 2) proposes a two-phase approach to the rule induction process: first, associations are built and scored by their relevancy and, in the second phase, some associations can be converted into production (direction-driven) rules. According to Pedrycz's approach, associations are relations between two or more information granules. An information-theoretic fuzzy approach to reducing dimensionality of a rule set, without disclosing any confidential information to the users, is presented by Last and Kandel in Chapter 3. As demonstrated by Chan and Au (Chapter 4), fuzzy rules may be particularly useful for mining databases, which contain both relational and transactional data. A fuzzy querying interface and procedure for mining fuzzy association rules in a Microsoft Access™ database are presented by Kacprzyk and Zadrozny in Chapter 5. Chapter 6 by Slawinski et al. describes the Fuzzy-ROSA method for data-based generation of small rule bases in high-dimensional search spaces. Ben Yahia and Jaoua (Chapter 7) introduce a new efficient algorithm, called FARD, for mining fuzzy association rules in transaction databases. New data mining methods and techniques. Two Dimensional Partitioning Techniques (DPTI and DPT2) are applied by Chang and Halgamuge (Chapter 8) to the problem of mining labeled data with missing values. In Chapter 9, Alahakoon et al. present a method for automated identification of clusters using a Growing Self Organizing Map (GSOM). Shnaider and Schneider (Chapter 10) have developed a fuzzy analog of the traditional regression model, called "soft regression," that evaluates the relative importance of each explanatory variable related to the dependent variable. Mining non-relational data. Chapters 11 and 12 are concerned with mining image databases, while Chapter 13 deals with time series analysis. Nguyen et al. (Chapter 11) apply a combination of data mining and soft computing techniques to classification of dynamically changing images. A new FFf -based mosaicing algorithm is developed and implemented by Gibson et al. (Chapter 12) for finding common patterns in several images. The algorithm is applied to two problems: mosaicing satellite photos and searching images stored on the web. In Chapter 13, Wu employs a genetic-based approach for modeling time-series data. The genetic modeling is used to detect a change period and/or change point in a nonlinear time series. The methods and application results presented in this volume suggest many promising directions for the future research in data mining, soft computing, and related areas. Some of the main problems and challenges remaining in this field are covered below. IX Generalization and overfitting. Statistical techniques (e.g., regression and analysis of variance) provide clear relationship between the distribution of noise and the significance of simple data models. Applying the standard statistical approach to more complex models, like a decision tree, has been unsatisfactory (see Quinlan 1993, p. 37). Reliable assessment of model generalization (with and without the time factor) is one of the most important research challenges for the data mining community. Use of prior knowledge. The expert knowledge is usually expressed in linguistic terms, while most of business data is still stored in a numeric format. As demonstrated by neuro-fuzzy methods, fuzzy sets are a natural tool for combining the available prior knowledge with the patterns discovered in data. New methodology should be developed for enabling the integration of fuzzy set technology with additional data mining algorithms (e.g., C4.5 or CART). New forms of data. The last three chapters in this volume elucidate the problems associated with mining non-relational data. With multimedia databases becoming the main source of information in the 21 st century, the existing data mining methods need a thorough revision to make them applicable to new types of data. The capability of a data mining method to quickly identify the most important features in a high-dimensional data set is crucial for mining text, image, and video databases. Publication of this book was possible due to the enthusiastic response of all the contributors. We would like to thank them for their effort and for their constructive cooperation and support. We would also like to acknowledge the partial support by the USF Center for Software Testing (SOFTEC) under grant No. 2108-004-00. We hope the book will promote future research and development in data mining, computational intelligence and soft computing. Tampa, Florida, USA Abraham Kandel December 2000 Mark Last Horst Bunke