ebook img

Foundations of Data Mining and Knowledge Discovery PDF

382 Pages·2005·5.805 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Foundations of Data Mining and Knowledge Discovery

T.Y.Lin,S.Ohsuga,C.J.Liau,X.Hu,S.Tsumoto(Eds.) FoundationsofDataMiningandKnowledgeDiscovery StudiesinComputational Intelligence,Volume6 Editor-in-chief Prof.JanuszKacprzyk SystemsResearchInstitute PolishAcademyofSciences ul.Newelska6 01-447Warsaw Poland E-mail:[email protected] Furthervolumesofthisseries canbefoundonourhomepage: springeronline.com Vol.1.TetsuyaHoya ArtificialMindSystem–KernelMemory Approach,2005 ISBN3-540-26072-2 Vol.2.SamanK.Halgamuge,LipoWang (Eds.) ComputationalIntelligenceforModelling andPrediction,2005 ISBN3-540-26071-4 Vol.3.Boz˙enaKostek Perception-BasedDataProcessingin Acoustics,2005 ISBN3-540-25729-2 Vol.4.SamanHalgamuge,LipoWang(Eds.) ClassificationandClusteringforKnowledge Discovery,2005 ISBN3-540-26073-0 Vol.5.DaRuan,GuoqingChen,EtienneE. Kerre,GeertWets(Eds.) IntelligentDataMining,2005 ISBN3-540-26256-3 Vol.6.TsauYoungLin,SetsuoOhsuga, Churn-JungLiau,XiaohuaHu,Shusaku Tsumoto(Eds.) FoundationsofDataMiningandKnowledge Discovery,2005 ISBN3-540-26257-1 Tsau Young Lin Setsuo Ohsuga Churn-Jung Liau Xiaohua Hu Shusaku Tsumoto (Eds.) Foundations of Data Mining and Knowledge Discovery ABC ProfessorTsauYoungLin ProfessorXiaohuaHu DepartmentofComputerScience CollegeofInformationScience SanJoseStateUniversity andTechnology 95192-0103,SanJose,CA DrexelUniversity U.S.A. 3141ChestnutStreet19104-2875 E-mail:[email protected] Philadelphia U.S.A. E-mail:[email protected] ProfessorSetsuoOhsuga EmeritusProfessorof ProfessorShusakuTsumoto UniversityofTokyo Tokyo DepartmentofMedicalInformatics Japan ShimaneMedicalUniversity E-mail:[email protected] Enyo-cho89-1,693-8501 Izumo,Shimane-ken Japan Dr.Churn-JungLiau E-mail:[email protected] InstituteofInformationScience AcademiaSinica 128AcademiaRoad Sec.II,115Taipei Taiwan E-mail:[email protected] LibraryofCongressControlNumber:2005927318 ISSNprintedition:1860-949X ISSNelectronicedition:1860-9503 ISBN-10 3-540-26257-1SpringerBerlinHeidelbergNewYork ISBN-13 978-3-540-26257-2SpringerBerlinHeidelbergNewYork Thisworkissubjecttocopyright.Allrightsarereserved,whetherthewholeorpartofthematerialis concerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,broadcasting, reproductiononmicrofilmorinanyotherway,andstorageindatabanks.Duplicationofthispublication orpartsthereofispermittedonlyundertheprovisionsoftheGermanCopyrightLawofSeptember9, 1965,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.Violationsare liableforprosecutionundertheGermanCopyrightLaw. SpringerisapartofSpringerScience+BusinessMedia springeronline.com (cid:1)c Springer-VerlagBerlinHeidelberg2005 PrintedinTheNetherlands Theuseofgeneraldescriptivenames,registerednames,trademarks,etc.inthispublicationdoesnotimply, evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevantprotectivelaws andregulationsandthereforefreeforgeneraluse. Typesetting:bytheauthorsandTechBooksusingaSpringerLATEXmacropackage Printedonacid-freepaper SPIN:11498186 55/TechBooks 543210 Preface Whilethenotionofknowledgeisimportantinmanyacademicdisciplinessuch as philosophy, psychology, economics, and artificial intelligence, the storage and retrieval of data is the main concern of information science. In modern experimental science, knowledge is usually acquired by observing such data, and the cause-effect or association relationships between attributes of objects are often observable in the data. However, when the amount of data is large, it is difficult to analyze and extractinformationorknowledgefromit.Dataminingisascientificapproach that provides effective tools for extracting knowledge so that, with the aid of computers, the large amount of data stored in databases can be transformed into symbolic knowledge automatically. Datamining,whichisoneofthefastestgrowingfieldsincomputerscience, integratesvarioustechnologiesincludingdatabasemanagement,statistics,soft computing, and machine learning. We have also seen numerous applications of data mining in medicine, finance, business, information security, and so on. Many data mining techniques, such as association or frequent pattern mining, neural networks, decision trees, inductive logic programming, fuzzy logic, granular computing, and rough sets, have been developed. However, such techniques have been developed, though vigorously, under rather ad hoc andvagueconcepts.Forfurtherdevelopment,acloseexaminationofitsfoun- dations seemsnecessary.Itis expected thatthis examination will lead tonew directions and novel paradigms. The study of the foundations of data mining poses a major challenge for the data mining research community. To meet such a challenge, we initiated a preliminary workshop on the foundations of data mining. It was held on May 6, 2002, at the Grand Hotel, Taipei, Taiwan, as part of the 6th Pacific- Asia Conference on Knowledge Discovery and Data Mining (PAKDD-02). This conference is recognized as one of the most important events for KDD researchers in Pacific-Asia area. The proceedings of the workshop were pub- lishedasaspecialissuein[1],andthesuccessoftheworkshophasencouraged us to organize an annual workshop on the foundations of data mining. The VI Preface workshop, which started in 2002, is held in conjunction with the IEEE Inter- national Conference on Data Mining (ICDM). The goal is to bring together individualsinterestedinthefoundationalaspectsofdataminingtofosterthe exchange of ideas with each other, as well as with more application-oriented researchers. This volume is a collection of expanded versions of selected papers origi- nallypresentedattheIEEEICDM2002workshopontheFoundationofData MiningandDiscovery,andrepresentsthestate-of-the-artformuchofthecur- rent research in data mining. Each paper has been carefully peer-reviewed again to ensure journal quality. The following is a brief summary of this vol- ume’s contents. The papers in Part I are concerned with the foundations of data mining and knowledge discovery. There are eight papers in this part.1 In the pa- per Knowledge Discovery as Translation by S. Ohsuga, discovery is viewed as a translation from non-symbolic to symbolic representation. A quantita- tive measure is introduced into the syntax of predicate logic, to measure the distance between symbolic and non-symbolic representations quantitatively. Thismakestranslationpossiblewhenthereislittle(orno)differencebetween some symbolic representation and the given non-symbolic representation. In thepaperMathematicalFoundationofAssociationRules-MiningAssociations by Solving Integral Linear Inequalities by T. Y. Lin, the author observes, af- ter examining the foundation, that high frequency expressions of attribute values are the utmost general notion of patterns in association mining. Such patterns,ofcourse,includeclassicalhighfrequencyitemsets(asconjunctions) and high level association rules. Based on this new notion, the author shows that such patterns can be found by solving a finite set of linear inequalities. The results are derived from the key notions of isomorphism and canonical representations of relational tables. In the paper Comparative Study of Se- quential Pattern Mining Models byH.C.Kum,S.Paulsen,andW.Wang,the problem of mining sequential patterns is examined. In addition, four evalua- tioncriteriaareproposedforquantitativelyassessingthequalityofthemined resultsfromawidevarietyofsyntheticdatasetswithvaryingrandomnessand noiselevels.Itisdemonstratedthatanalternativeapproximatepatternmodel based on sequence alignment can better recover the underlying patterns with little confounding information under all examined circumstances, including those where the frequent sequential pattern model fails. The paper Design- ing Robust Regression Models by M. Viswanathan and K. Ramamohanarao presents a study of the preference among competing models from a family of polynomial regressors. It includes an extensive empirical evaluation of five polynomialselectionmethods.Thebehaviorofthesefivemethodsisanalyzed with respecttovariations inthenumberof trainingexamples andthelevelof 1 There were three keynotes and two plenary talks. S. Smale, S. Ohsuga, L. Xu, H. Tsukimoto and T. Y. Lin. Smale and Tsukimoto’s papers are collected in the book Foundation and advances of Data Mining W. Chu and T. Y. Lin (eds). Preface VII noiseinthedata.ThepaperA Probabilistic Logic-based Framework for Char- acterizing Knowledge Discovery in Databases by Y. Xie and V.V. Raghavan provides a formal logical foundation for data mining based on Bacchus’ prob- ability logic. The authors give formal definitions of “pattern” as well as its determiners,whichwere“previouslyunknown”and“potentiallyuseful”.They alsoproposealogicinductionoperatorthatdefinesastandardprocessthrough which all the potentially useful patterns embedded in the given data can be discovered.ThepaperA Careful Look at the Use of Statistical Methodology in Data Mining by N. Matloff presents a statistical foundation of data mining. The usage of statistics in data mining has typically been vague and informal, or even worse, seriously misleading. This paper seeks to take the first step in remedying this problem by pairing precise mathematical descriptions of some of the concepts in KDD with practical interpretations and implications for specific KDD issues. The paper Justification and Hypothesis Selection in Data Mining by T.F. Fan, D.R. Liu, and C.J. Liau presents a precise for- mulation of Hume’s induction problem in rough set-based decision logic and discussesitsimplicationsforresearchindatamining.Becauseofthejustifica- tion problem in data mining, a mined rule is nothing more than a hypothesis from a logical viewpoint. Hence, hypothesis selection is of crucial importance for successful data mining applications. In this paper, the hypothesis selec- tion issue is addressed in terms of two data mining contexts. The paper On Statistical Independence in a Contingency Table by S. Tsumoto gives a proof showing that statistical independence in a contingency table is a special type oflinearindependence,wheretherankofagiventableasamatrixisequalto 1.Byrelatingtheresultwiththatinprojectivegeometry,theauthorsuggests that a contingency matrix can be interpreted in a geometrical way. The papers in Part II are devoted to methods of data mining. There are ninepapersinthiscategory.ThepaperAComparativeInvestigationonModel Selection in Binary Factor Analysis by Y. An, X. Hu, and L. Xu presents methods of binary factor analysis based on the framework of Bayesian Ying- Yang(BYY)harmonylearning.TheyinvestigatetheBYYcriterionandBYY harmony learning with automatic model selection (BYY-AUTO) in compari- son with typical existing criteria. Experiments have shown that the methods are either comparable with, or better than, the previous best results. The pa- per Extraction of Generalized Rules with Automated Attribute Abstraction by Y. Shidara, M. Kudo, and A. Nakamura proposes a novel method for mining generalized rules with high support and confidence. Using the method, gen- eralized rules can be obtained in which the abstraction of attribute values is implicitlycarriedoutwithouttherequirementofadditionalinformation,such as information on conceptual hierarchies. The paper Decision Making Based on Hybrid of Multi-knowledge and Na¨ıve Bayes Classifier by Q. Wu et al. presents a hybrid approach to making decisions for unseen instances, or for instances with missing attribute values. In this approach, uncertain rules are introduced to represent multi-knowledge. The experimental results show that the decision accuracies for unseen instances are higher than those obtained VIII Preface by using other approaches in a single body of knowledge. The paper First- Order Logic Based Formalism for Temporal Data Mining by P. Cotofrei and K. Stoffel presents a formalism for a methodology whose purpose is the dis- coveryofknowledge,representedintheformofgeneralHornclauses,inferred from databases with a temporal dimension. The paper offers the possibility of using statistical approaches in the design of algorithms for inferring higher order temporal rules, denoted as temporal meta-rules. The paper An Alter- native Approach to Mining Association Rules by J. Rauch and M. Sˇim˚unek presentsanapproachforminingassociationrulesbasedontherepresentation of analyzed data by suitable strings of bits. The procedure, 4ft-Miner, which is the contemporary application of this approach, is described therein. The paper Direct Mining of Rules from Data with Missing Values by V. Gorodet- sky,O.Karsaev,andV.Samoilovpresentsanapproachto,andtechniquefor, directminingofbinarydatawithmissingvalues.Itaimstoextractclassifica- tionruleswhosepremisesarerepresentedinaconjunctiveform.Theideaisto first generate two sets of rules serving as the upper and lower bounds for any othersetsofrulescorrespondingtoallarbitraryassignmentsofmissingvalues. Then, based on these upper and lower bounds, as well as a testing procedure and a classification criterion, a subset of rules for classification is selected. The paper Cluster Identification using Maximum Configuration Entropy by C.H. Li proposes a normalized graph sampling algorithm for clustering. The important question of how many clusters exist in a dataset and when to ter- minatetheclusteringalgorithmissolvedviacomputingtheensembleaverage change in entropy. The paper Mining Small Objects in Large Images Using Neural Networks by M. Zhang describes a domain independent approach to the use of neural networks for mining multiple class, small objects in large images. In the approach, the networks are trained by the back propagation algorithm with examples that have been taken from the large images. The trainednetworksarethenapplied,inamovingwindowfashion,overthelarge images to mine the objects of interest. The paper Improved Knowledge Min- ing with the Multimethod Approach by M. Leniˇc presents an overview of the multimethod approach to data mining and its concrete integration and possi- ble improvements. This approach combines different induction methods in a uniquemannerbyapplyingdifferentmethodstothesameknowledgemodelin nopredefinedorder.Althougheachmethodmaycontaininherentlimitations, there is an expectation that a combination of multiple methods may produce better results. The papers in Part III deal with issues related to knowledge discovery in abroadsense.Thispartcontainsfourpapers.ThepaperPosting Act Tagging Using Transformation-Based Learning by T. Wu et al. presents the applica- tion of transformation-based learning (TBL) to the task of assigning tags to postingsinonlinechatconversations.Theauthorsdescribethetemplatesused forpostingacttagginginthecontextoftemplateselection,andextendtradi- tional approaches used in part-of-speech tagging and dialogue act tagging by incorporatingregularexpressionsintothetemplates.ThepaperIdentification Preface IX ofCriticalValuesinLatentSemanticIndexing byA.Kontostathis,W.M.Pot- tenger, and B.D. Davison deals with the issue of information retrieval. The authors analyze the values used by Latent Semantic Indexing (LSI) for in- formation retrieval. By manipulating the values in the Singular Value De- composition (SVD) matrices, it has been found that a significant fraction of the values have little effect on overall performance, and can thus be removed (i.e., changed to zero). This makes it possible to convert a dense term by dimensions and a document by dimension matrices into sparse matrices by identifying and removing such values. The paper Reporting Data Mining Re- sults in a Natural Language by P. Strossa, Z. Cˇerny´, and J. Rauch represents an attempt to report the results of data mining in automatically generated natural language sentences. An experimental software system, AR2NL, that canconvertimplicationalrulesintobothEnglishandCzechispresented.The paperAn Algorithm to Calculate the Expected Value of an Ongoing User Ses- sion byS.Milla´netal.presentsanapplicationofdataminingmethodstothe analysisofinformationcollectedfromconsumerwebsessions.Analgorithmis given that makes it possible to calculate, at each point of an ongoing naviga- tion, not only the possible paths a viewer may follow, but also the potential value of each possible navigation. Wewouldliketothanktherefereesfortheireffortsinreviewingthepapers andprovidingvaluablecommentsandsuggestionstotheauthors.Wearealso grateful to all the contributors for their excellent works. We hope that this book will be valuable and fruitful for data mining researchers, no matter whether they would like to uncover the fundamental principles behind data mining, or apply the theories to practical application problems. San Jose, Tokyo, Taipei, Philadelphia, and Izumo T.Y. Lin February, 2005 S. Ohsuga C.J. Liau X. Hu S. Tsumoto References 1. T.Y. Lin and C.J. Liau (2002) Special Issue on the Foundation of Data Mining, Communications of Institute of Information and Computing Machinery, Vol. 5, No. 2, Taipei, Taiwan.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.