Mathematical Methods for Knowledge Discovery and Data Mining Giovanni Felici Consiglio Nazionale delle Ricerche, Rome, Italy Carlo Vercellis Politecnico di Milano, Italy InformatIon scIence reference Hershey • New York Acquisitions Editor: Kristin Klinger Development Editor: Kristin Roth Senior Managing Editor: Jennifer Neidig Managing Editor: Sara Reed Copy Editor: Angela Thor Typesetter: Jamie Snavely Cover Design: Lisa Tosheff Printed at: Yurchak Printing Inc. Published in the United States of America by Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue, Suite 200 Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.igi-global.com/reference and in the United Kingdom by Information Science Reference (an imprint of IGI Global) 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 0609 Web site: http://www.eurospanonline.com Copyright © 2008 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Felici, Giovanni. Mathematical methods for knowledge discovery and data mining / Giovanni Felici & Carlo Vercellis, editors. p. cm. Summary: “This book focuses on the mathematical models and methods that support most data mining applications and solution techniques, covering such topics as association rules; Bayesian methods; data visualization; kernel methods; neural networks; text, speech, and image recognition; an invaluable resource for scholars and practitioners in the fields of biomedicine, engineering, finance, manufacturing, marketing, performance measurement, and telecommunications”--Provided by publisher. Includes bibliographical references and index. ISBN 978-1-59904-528-3 (hardcover) -- ISBN 978-1-59904-530-6 (ebook) 1. Data mining. 2. Data mining--Mathematical models. 3. Knowledge acquisition (Expert systems) I. Felici, Giovanni. II. Vercellis, Carlo. III. Title. QA76.9.D343F46 2007 006.3’12--dc22 2007022228 British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book set is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher. Table of Contents Foreword .............................................................................................................................................xii Preface ................................................................................................................................................xiv Acknowledgment ................................................................................................................................xx Chapter I Discretization of Rational Data / Jonathan Mugan and Klaus Truemper ..............................................1 Chapter II Vector DNF for Datasets Classifications: Application to the Financial Timing Decision Problem / Massimo Liquori and Andrea Scozzari .................................................................24 Chapter III Reducing a Class of Machine Learning Algorithms to Logical Commonsense Reasoning Operations / Xenia Naidenova ...........................................................................................41 Chapter IV The Analysis of Service Quality Through Stated Preference Models and Rule-Based Classification / Giovanni Felici and Valerio Gatta ...........................................................65 Chapter V Support Vector Machines for Business Applications / Brian C. Lovell and Christian J. Walder .........82 Chapter VI Kernel Width Selection for SVM Classification: A Meta-Learning Approach / Shawkat Ali and Kate A. Smith ...........................................................................................................101 Chapter VII Protein Folding Classification Through Multicategory Discrete SVM / Carlotta Orsenigo and Carlo Vercellis ...............................................................................................116 Chapter VIII Hierarchical Profiling, Scoring, and Applications in Bioinformatics / Li Liao ..................................130 Chapter IX Hierarchical Clustering Using Evolutionary Algorithms / Monica Chiş ............................................146 Chapter X Exploratory Time Series Data Mining by Genetic Clustering / T. Warren Liao .................................157 Chapter XI Development of Control Signatures with a Hybrid Data Mining and Genetic Algorithm / Alex Burns, Shital Shah, and Andrew Kusiak .....................................................................................179 Chapter XII Bayesian Belief Networks for Data Cleaning / Enrico Fagiuoli, Sara Omerino, and Fabio Stella ..................................................................................................................................204 Chapter XIII A Comparison of Revision Schemes for Cleaning Labeling Noise /Chuck P. Lam and David G. Stork .............................................................................................................................220 Chapter XIV Improving Web Clickstream Analysis: Markov Chains Models and Genmax Algorithms / Paolo Baldini and Paolo Giudici ........................................................................................................233 Chapter XV Advanced Data Mining and Visualization Techniques with Probabilistic Principal Surfaces: Applications to Astronomy and Genetics / Antonino Staiano, Lara De Vinco, Giuseppe Longo, and Roberto Tagliaferri ..........................................................................................244 Chapter XVI Spatial Navigation Assistance System for Large Virtual Environments: The Data Mining Approach / Mehmed Kantardzic, Pedram Sadeghian, and Walaa M. Sheta ..........265 Chapter XVII Using Grids for Distributed Knowledge Discovery / Antonio Congiusta, Domenico Talia, and Paolo Trunfio ...............................................................................................................................284 Chapter XVIII Fuzzy Miner: Extracting Fuzzy Rules from Numerical Patterns / Nikos Pelekis, Babis Theodoulidis, Ioannis Kopanakis, and Yannis Theodoridis ..............................299 Chapter XIX Routing Attribute Data Mining Based on Rough Set Theory / Yanbing Liu, Menghao Wang, and Jong Tang .....................................................................................................................................322 Compilation of References ..............................................................................................................338 About the Contributors ...................................................................................................................361 Index ...................................................................................................................................................368 Detailed Table of Contents Foreword .............................................................................................................................................xii Preface ................................................................................................................................................xiv Acknowledgment ................................................................................................................................xx Chapter I Discretization of Rational Data / Jonathan Mugan and Klaus Truemper ..............................................1 Frequently, one wants to extend the use of a classification method that, in principle, requires records with True/False values, so that records with rational numbers can be processed. In such cases, the rational numbers must first be replaced by True/False values before the method may be applied. In other cases, a classification method, in principle, can process records with rational numbers directly, but replacement by True/False values improves the performance of the method. The replacement process is usually called discretization or binarization. This chapter describes a recursive discretization process called Cutpoint. The key step of Cutpoint detects points where classification patterns change abruptly. The chapter includes computational results where Cutpoint is compared with entropy-based methods that, to date, have been found to be the best discretization schemes. The results indicate that Cutpoint is preferred by certain classification schemes, while entropy-based methods are better for other classification methods. Thus, one may view Cutpoint to be an additional discretization tool that one may want to consider. Chapter II Vector DNF for Datasets Classifications: Application to the Financial Timing Decision Problem / Massimo Liquori and Andrea Scozzari .................................................................................................24 Traditional classification approaches consider a dataset formed by an archive of observations classified as positive or negative according to a binary classification rule. In this chapter, we consider the financial timing decision problem, which is the problem of deciding the time when it is profitable for the investor to buy shares or to sell shares or to wait in the stock exchange market. The decision is based on classifying a dataset of observations, represented by a vector containing the values of some financial numerical at- tributes, according to a ternary classification rule. We propose a new technique based on partially defined vector Boolean functions. We test our technique on different time series of the Mibtel stock exchange market in Italy, and we show that it provides a high classification accuracy, as well as wide applicability for other classification problems where a classification in three or more classes is needed. Chapter III Reducing a Class of Machine Learning Algorithms to Logical Commonsense Reasoning Operations / Xenia Naidenova ............................................................................................41 The purpose of this chapter is to demonstrate the possibility of transforming a large class of machine- learning algorithms into commonsense reasoning processes based on using well-known deduction and induction logical rules. The concept of a good classification (diagnostic) test for a given set of positive examples lies in the basis of our approach to the machine-learning problems. The task of inferring all good diagnostic tests is formulated as searching the best approximations of a given classification (a partitioning) on a given set of examples. The lattice theory is used as a mathematical language for con- structing good classification tests. The algorithms of good tests inference are decomposed into subtasks and operations that are in accordance with main human commonsense reasoning rules. Chapter IV The Analysis of Service Quality Through Stated Preference Models and Rule-Based Classification / Giovanni Felici and Valerio Gatta ...........................................................65 The analysis of quality of services is an important issue for the planning and the management of many businesses. The ability to address the demands and the relevant needs of the customers of a given service is crucial to determine its success in a competitive environment. Many quantitative tools in the areas of statistics and mathematical modeling have been designed and applied to serve this purpose. Here we consider an application of a well-established statistical technique, the stated preference models (SP), to identify, from a sample of customers, significant weights to attribute to different aspects of the ser- vice provided; such aspects may additively compose an overall satisfaction index. In addition, such a weighting system is applied to a larger set of customers, and a comparison is made between the overall satisfaction identified by the SP index and the overall satisfaction directly declared by the customers. Such comparison is performed by two rule-based classification systems, decision trees and the logic data miner Lsquare. The results of these two tools help in identifying the differences between the two measurements from the structural point of view, and provide an improved interpretation of the results. The application considered is related to the customers of a large Italian airport. Chapter V Support Vector Machines for Business Applications / Brian C. Lovell and Christian J. Walder .........82 This chapter discusses the use of support vector machines (SVM) for business applications. It provides a brief historical background on inductive learning and pattern recognition, and then an intuitive moti- vation for SVM methods. The method is compared to other approaches, and the tools and background theory required to successfully apply SVM to business applications are introduced. The authors hope that the chapter will help practitioners to understand when the SVM should be the method of choice, as well as how to achieve good results in minimal time. Chapter VI Kernel Width Selection for SVM Classification: A Meta-Learning / Shawkat Ali and Kate A. Smith ...........................................................................................................101 The most critical component of kernel-based learning algorithms is the choice of an appropriate kernel and its optimal parameters. In this chapter, we propose a rule-based metalearning approach for automatic radial basis function (rbf) kernel, and its parameter selection for support vector machine (SVM) clas- sification. First, the best parameter selection is considered on the basis of prior information of the data, with the help of maximum likelihood (ML) method and Nelder-Mead (N-M) simplex method. Then the new rule-based metalearning approach is constructed and tested on different sizes of 112 datasets with binary class, as well as multiclass classification problems. We observe that our rule-based methodology provides significant improvement of computational time, as well as accuracy in some specific cases. Chapter VII Protein Folding Classification Through Multicategory Discrete SVM / Carlotta Orsenigo and Carlo Vercellis ...............................................................................................116 In the context of biolife science, predicting the folding structure of a protein plays an important role for investigating its function and discovering new drugs. Protein folding recognition can be naturally cast in the form of a multicategory classification problem, which appears challenging due to the high number of folds classes. Thus, in the last decade, several supervised learning methods have been applied in order to discriminate between proteins characterized by different folds. Recently, discrete support vector machines have been introduced as an effective alternative to traditional support vector machines. Discrete SVM have been shown to outperform other competing classification techniques both on binary and multicategory benchmark datasets. In this chapter, we adopt discrete SVM for protein folding clas- sification. Computational tests performed on benchmark datasets empirically support the effectiveness of discrete SVM, which are able to achieve the highest prediction accuracy. Chapter VIII Hierarchical Profiling, Scoring, and Applications in Bioinformatics / Li Liao ..................................130 Recently, clustering and classification methods have seen many applications in bioinformatics. Some are simply straightforward applications of existing techniques, but most have been adapted to cope with peculiar features of the biological data. Many biological data take a form of vectors, whose com- ponents correspond to attributes characterizing the biological entities being studied. Comparing these vectors, a.k.a. profiles, is a crucial step for most clustering and classification methods. We review the recent developments related to hierarchical profiling where the attributes are not independent, but rather are correlated in a hierarchy. Hierarchical profiling arises in a wide range of bioinformatics problems, including protein homology detection, protein family classification, and metabolic pathway clustering. We discuss in detail several clustering and classification methods where hierarchical correlations are tackled with effective and efficient ways, by incorporation of domain specific knowledge. Relations to other statistical learning methods and more potential applications are also discussed. Chapter IX Hierarchical Clustering Using Evolutionary Algorithms / Monica Chiş ............................................146 Clustering is an important technique used in discovering some inherent structure present in data. The purpose of cluster analysis is to partition a given data set into a number of groups such that objects in a particular cluster are more similar to each other than objects in different clusters. Hierarchical clustering refers to the formation of a recursive clustering of the data points: a partition into many clusters, each of which is itself hierarchically clustered. Hierarchical structures solve many problems in a large area of interests. In this chapter, a new evolutionary algorithm for detecting the hierarchical structure of an input data set is proposed. The method could be very useful in economy, market segmentation, management, biology taxonomy, and other domains. A new linear representation of the cluster structure within the data set is proposed. An evolutionary algorithm evolves a population of clustering hierarchies. Proposed algorithm uses mutation and crossover as (search) variation operators. The final goal is to present a data clustering representation to quickly find a hierarchical clustering structure. Chapter X Exploratory Time Series Data Mining by Genetic Clustering / T. Warren Liao .................................157 In this chapter, we present genetic-algorithm (GA)-based methods developed for clustering univariate time series with equal or unequal length as an exploratory step of data mining. These methods basically implement the k-medoids algorithm. Each chromosome encodes, in binary, the data objects serving as the k-medoids. To compare their performance, both fixed-parameter and adaptive GAs were used. We first employed the synthetic control chart data set to investigate the performance of three fitness functions, two distance measures, and other GA parameters such as population size, crossover rate, and mutation rate. Two more sets of time series with or without known number of clusters were also experimented: one is the cylinder-bell-funnel data and the other is the novel battle simulation data. The clustering results are presented and discussed. Chapter XI Development of Control Signatures with a Hybrid Data Mining and Genetic Algorithm Approach / Alex Burns, Shital Shah, and Andrew Kusiak .................................................179 This chapter presents a hybrid approach that integrates a genetic algorithm (GA) and data mining to produce control signatures. The control signatures define the best parameter intervals leading to a desired outcome. This hybrid method integrates multiple rule sets generated by a data-mining algorithm with the fitness function of a GA. The solutions of the GA represent intersections among rules providing tight parameter bounds. The integration of intuitive rules provides an explanation for each generated control setting, and it provides insights into the decision-making process. The ability to analyze parameter trends and the feasible solutions generated by the GA with respect to the outcomes is another benefit of the proposed hybrid method. The presented approach for deriving control signatures is applicable to various domains, such as energy, medical protocols, manufacturing, airline operations, customer service, and so on. Control signatures were developed and tested for control of a power-plant boiler. These signatures