Table Of ContentT.Y.Lin,S.Ohsuga,C.J.Liau,X.Hu,S.Tsumoto(Eds.)
FoundationsofDataMiningandKnowledgeDiscovery
StudiesinComputational Intelligence,Volume6
Editor-in-chief
Prof.JanuszKacprzyk
SystemsResearchInstitute
PolishAcademyofSciences
ul.Newelska6
01-447Warsaw
Poland
E-mail:kacprzyk@ibspan.waw.pl
Furthervolumesofthisseries
canbefoundonourhomepage:
springeronline.com
Vol.1.TetsuyaHoya
ArtificialMindSystem–KernelMemory
Approach,2005
ISBN3-540-26072-2
Vol.2.SamanK.Halgamuge,LipoWang
(Eds.)
ComputationalIntelligenceforModelling
andPrediction,2005
ISBN3-540-26071-4
Vol.3.Boz˙enaKostek
Perception-BasedDataProcessingin
Acoustics,2005
ISBN3-540-25729-2
Vol.4.SamanHalgamuge,LipoWang(Eds.)
ClassificationandClusteringforKnowledge
Discovery,2005
ISBN3-540-26073-0
Vol.5.DaRuan,GuoqingChen,EtienneE.
Kerre,GeertWets(Eds.)
IntelligentDataMining,2005
ISBN3-540-26256-3
Vol.6.TsauYoungLin,SetsuoOhsuga,
Churn-JungLiau,XiaohuaHu,Shusaku
Tsumoto(Eds.)
FoundationsofDataMiningandKnowledge
Discovery,2005
ISBN3-540-26257-1
Tsau Young Lin
Setsuo Ohsuga
Churn-Jung Liau
Xiaohua Hu
Shusaku Tsumoto
(Eds.)
Foundations of Data Mining
and Knowledge Discovery
ABC
ProfessorTsauYoungLin ProfessorXiaohuaHu
DepartmentofComputerScience CollegeofInformationScience
SanJoseStateUniversity andTechnology
95192-0103,SanJose,CA DrexelUniversity
U.S.A. 3141ChestnutStreet19104-2875
E-mail:tylin@cs.sjsu.edu Philadelphia
U.S.A.
E-mail:thu@cis.drexel.edu
ProfessorSetsuoOhsuga
EmeritusProfessorof
ProfessorShusakuTsumoto
UniversityofTokyo
Tokyo DepartmentofMedicalInformatics
Japan ShimaneMedicalUniversity
E-mail:ohsuga@fd.catv.ne.jp Enyo-cho89-1,693-8501
Izumo,Shimane-ken
Japan
Dr.Churn-JungLiau E-mail:tsumoto@computer.org
InstituteofInformationScience
AcademiaSinica
128AcademiaRoad
Sec.II,115Taipei
Taiwan
E-mail:liaucj@iis.sinica.edu.tw
LibraryofCongressControlNumber:2005927318
ISSNprintedition:1860-949X
ISSNelectronicedition:1860-9503
ISBN-10 3-540-26257-1SpringerBerlinHeidelbergNewYork
ISBN-13 978-3-540-26257-2SpringerBerlinHeidelbergNewYork
Thisworkissubjecttocopyright.Allrightsarereserved,whetherthewholeorpartofthematerialis
concerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,broadcasting,
reproductiononmicrofilmorinanyotherway,andstorageindatabanks.Duplicationofthispublication
orpartsthereofispermittedonlyundertheprovisionsoftheGermanCopyrightLawofSeptember9,
1965,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.Violationsare
liableforprosecutionundertheGermanCopyrightLaw.
SpringerisapartofSpringerScience+BusinessMedia
springeronline.com
(cid:1)c Springer-VerlagBerlinHeidelberg2005
PrintedinTheNetherlands
Theuseofgeneraldescriptivenames,registerednames,trademarks,etc.inthispublicationdoesnotimply,
evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevantprotectivelaws
andregulationsandthereforefreeforgeneraluse.
Typesetting:bytheauthorsandTechBooksusingaSpringerLATEXmacropackage
Printedonacid-freepaper SPIN:11498186 55/TechBooks 543210
Preface
Whilethenotionofknowledgeisimportantinmanyacademicdisciplinessuch
as philosophy, psychology, economics, and artificial intelligence, the storage
and retrieval of data is the main concern of information science. In modern
experimental science, knowledge is usually acquired by observing such data,
and the cause-effect or association relationships between attributes of objects
are often observable in the data.
However, when the amount of data is large, it is difficult to analyze and
extractinformationorknowledgefromit.Dataminingisascientificapproach
that provides effective tools for extracting knowledge so that, with the aid of
computers, the large amount of data stored in databases can be transformed
into symbolic knowledge automatically.
Datamining,whichisoneofthefastestgrowingfieldsincomputerscience,
integratesvarioustechnologiesincludingdatabasemanagement,statistics,soft
computing, and machine learning. We have also seen numerous applications
of data mining in medicine, finance, business, information security, and so
on. Many data mining techniques, such as association or frequent pattern
mining, neural networks, decision trees, inductive logic programming, fuzzy
logic, granular computing, and rough sets, have been developed. However,
such techniques have been developed, though vigorously, under rather ad hoc
andvagueconcepts.Forfurtherdevelopment,acloseexaminationofitsfoun-
dations seemsnecessary.Itis expected thatthis examination will lead tonew
directions and novel paradigms.
The study of the foundations of data mining poses a major challenge for
the data mining research community. To meet such a challenge, we initiated
a preliminary workshop on the foundations of data mining. It was held on
May 6, 2002, at the Grand Hotel, Taipei, Taiwan, as part of the 6th Pacific-
Asia Conference on Knowledge Discovery and Data Mining (PAKDD-02).
This conference is recognized as one of the most important events for KDD
researchers in Pacific-Asia area. The proceedings of the workshop were pub-
lishedasaspecialissuein[1],andthesuccessoftheworkshophasencouraged
us to organize an annual workshop on the foundations of data mining. The
VI Preface
workshop, which started in 2002, is held in conjunction with the IEEE Inter-
national Conference on Data Mining (ICDM). The goal is to bring together
individualsinterestedinthefoundationalaspectsofdataminingtofosterthe
exchange of ideas with each other, as well as with more application-oriented
researchers.
This volume is a collection of expanded versions of selected papers origi-
nallypresentedattheIEEEICDM2002workshopontheFoundationofData
MiningandDiscovery,andrepresentsthestate-of-the-artformuchofthecur-
rent research in data mining. Each paper has been carefully peer-reviewed
again to ensure journal quality. The following is a brief summary of this vol-
ume’s contents.
The papers in Part I are concerned with the foundations of data mining
and knowledge discovery. There are eight papers in this part.1 In the pa-
per Knowledge Discovery as Translation by S. Ohsuga, discovery is viewed
as a translation from non-symbolic to symbolic representation. A quantita-
tive measure is introduced into the syntax of predicate logic, to measure the
distance between symbolic and non-symbolic representations quantitatively.
Thismakestranslationpossiblewhenthereislittle(orno)differencebetween
some symbolic representation and the given non-symbolic representation. In
thepaperMathematicalFoundationofAssociationRules-MiningAssociations
by Solving Integral Linear Inequalities by T. Y. Lin, the author observes, af-
ter examining the foundation, that high frequency expressions of attribute
values are the utmost general notion of patterns in association mining. Such
patterns,ofcourse,includeclassicalhighfrequencyitemsets(asconjunctions)
and high level association rules. Based on this new notion, the author shows
that such patterns can be found by solving a finite set of linear inequalities.
The results are derived from the key notions of isomorphism and canonical
representations of relational tables. In the paper Comparative Study of Se-
quential Pattern Mining Models byH.C.Kum,S.Paulsen,andW.Wang,the
problem of mining sequential patterns is examined. In addition, four evalua-
tioncriteriaareproposedforquantitativelyassessingthequalityofthemined
resultsfromawidevarietyofsyntheticdatasetswithvaryingrandomnessand
noiselevels.Itisdemonstratedthatanalternativeapproximatepatternmodel
based on sequence alignment can better recover the underlying patterns with
little confounding information under all examined circumstances, including
those where the frequent sequential pattern model fails. The paper Design-
ing Robust Regression Models by M. Viswanathan and K. Ramamohanarao
presents a study of the preference among competing models from a family
of polynomial regressors. It includes an extensive empirical evaluation of five
polynomialselectionmethods.Thebehaviorofthesefivemethodsisanalyzed
with respecttovariations inthenumberof trainingexamples andthelevelof
1 There were three keynotes and two plenary talks. S. Smale, S. Ohsuga, L. Xu,
H. Tsukimoto and T. Y. Lin. Smale and Tsukimoto’s papers are collected in the
book Foundation and advances of Data Mining W. Chu and T. Y. Lin (eds).
Preface VII
noiseinthedata.ThepaperA Probabilistic Logic-based Framework for Char-
acterizing Knowledge Discovery in Databases by Y. Xie and V.V. Raghavan
provides a formal logical foundation for data mining based on Bacchus’ prob-
ability logic. The authors give formal definitions of “pattern” as well as its
determiners,whichwere“previouslyunknown”and“potentiallyuseful”.They
alsoproposealogicinductionoperatorthatdefinesastandardprocessthrough
which all the potentially useful patterns embedded in the given data can be
discovered.ThepaperA Careful Look at the Use of Statistical Methodology in
Data Mining by N. Matloff presents a statistical foundation of data mining.
The usage of statistics in data mining has typically been vague and informal,
or even worse, seriously misleading. This paper seeks to take the first step
in remedying this problem by pairing precise mathematical descriptions of
some of the concepts in KDD with practical interpretations and implications
for specific KDD issues. The paper Justification and Hypothesis Selection in
Data Mining by T.F. Fan, D.R. Liu, and C.J. Liau presents a precise for-
mulation of Hume’s induction problem in rough set-based decision logic and
discussesitsimplicationsforresearchindatamining.Becauseofthejustifica-
tion problem in data mining, a mined rule is nothing more than a hypothesis
from a logical viewpoint. Hence, hypothesis selection is of crucial importance
for successful data mining applications. In this paper, the hypothesis selec-
tion issue is addressed in terms of two data mining contexts. The paper On
Statistical Independence in a Contingency Table by S. Tsumoto gives a proof
showing that statistical independence in a contingency table is a special type
oflinearindependence,wheretherankofagiventableasamatrixisequalto
1.Byrelatingtheresultwiththatinprojectivegeometry,theauthorsuggests
that a contingency matrix can be interpreted in a geometrical way.
The papers in Part II are devoted to methods of data mining. There are
ninepapersinthiscategory.ThepaperAComparativeInvestigationonModel
Selection in Binary Factor Analysis by Y. An, X. Hu, and L. Xu presents
methods of binary factor analysis based on the framework of Bayesian Ying-
Yang(BYY)harmonylearning.TheyinvestigatetheBYYcriterionandBYY
harmony learning with automatic model selection (BYY-AUTO) in compari-
son with typical existing criteria. Experiments have shown that the methods
are either comparable with, or better than, the previous best results. The pa-
per Extraction of Generalized Rules with Automated Attribute Abstraction by
Y. Shidara, M. Kudo, and A. Nakamura proposes a novel method for mining
generalized rules with high support and confidence. Using the method, gen-
eralized rules can be obtained in which the abstraction of attribute values is
implicitlycarriedoutwithouttherequirementofadditionalinformation,such
as information on conceptual hierarchies. The paper Decision Making Based
on Hybrid of Multi-knowledge and Na¨ıve Bayes Classifier by Q. Wu et al.
presents a hybrid approach to making decisions for unseen instances, or for
instances with missing attribute values. In this approach, uncertain rules are
introduced to represent multi-knowledge. The experimental results show that
the decision accuracies for unseen instances are higher than those obtained
VIII Preface
by using other approaches in a single body of knowledge. The paper First-
Order Logic Based Formalism for Temporal Data Mining by P. Cotofrei and
K. Stoffel presents a formalism for a methodology whose purpose is the dis-
coveryofknowledge,representedintheformofgeneralHornclauses,inferred
from databases with a temporal dimension. The paper offers the possibility
of using statistical approaches in the design of algorithms for inferring higher
order temporal rules, denoted as temporal meta-rules. The paper An Alter-
native Approach to Mining Association Rules by J. Rauch and M. Sˇim˚unek
presentsanapproachforminingassociationrulesbasedontherepresentation
of analyzed data by suitable strings of bits. The procedure, 4ft-Miner, which
is the contemporary application of this approach, is described therein. The
paper Direct Mining of Rules from Data with Missing Values by V. Gorodet-
sky,O.Karsaev,andV.Samoilovpresentsanapproachto,andtechniquefor,
directminingofbinarydatawithmissingvalues.Itaimstoextractclassifica-
tionruleswhosepremisesarerepresentedinaconjunctiveform.Theideaisto
first generate two sets of rules serving as the upper and lower bounds for any
othersetsofrulescorrespondingtoallarbitraryassignmentsofmissingvalues.
Then, based on these upper and lower bounds, as well as a testing procedure
and a classification criterion, a subset of rules for classification is selected.
The paper Cluster Identification using Maximum Configuration Entropy by
C.H. Li proposes a normalized graph sampling algorithm for clustering. The
important question of how many clusters exist in a dataset and when to ter-
minatetheclusteringalgorithmissolvedviacomputingtheensembleaverage
change in entropy. The paper Mining Small Objects in Large Images Using
Neural Networks by M. Zhang describes a domain independent approach to
the use of neural networks for mining multiple class, small objects in large
images. In the approach, the networks are trained by the back propagation
algorithm with examples that have been taken from the large images. The
trainednetworksarethenapplied,inamovingwindowfashion,overthelarge
images to mine the objects of interest. The paper Improved Knowledge Min-
ing with the Multimethod Approach by M. Leniˇc presents an overview of the
multimethod approach to data mining and its concrete integration and possi-
ble improvements. This approach combines different induction methods in a
uniquemannerbyapplyingdifferentmethodstothesameknowledgemodelin
nopredefinedorder.Althougheachmethodmaycontaininherentlimitations,
there is an expectation that a combination of multiple methods may produce
better results.
The papers in Part III deal with issues related to knowledge discovery in
abroadsense.Thispartcontainsfourpapers.ThepaperPosting Act Tagging
Using Transformation-Based Learning by T. Wu et al. presents the applica-
tion of transformation-based learning (TBL) to the task of assigning tags to
postingsinonlinechatconversations.Theauthorsdescribethetemplatesused
forpostingacttagginginthecontextoftemplateselection,andextendtradi-
tional approaches used in part-of-speech tagging and dialogue act tagging by
incorporatingregularexpressionsintothetemplates.ThepaperIdentification
Preface IX
ofCriticalValuesinLatentSemanticIndexing byA.Kontostathis,W.M.Pot-
tenger, and B.D. Davison deals with the issue of information retrieval. The
authors analyze the values used by Latent Semantic Indexing (LSI) for in-
formation retrieval. By manipulating the values in the Singular Value De-
composition (SVD) matrices, it has been found that a significant fraction of
the values have little effect on overall performance, and can thus be removed
(i.e., changed to zero). This makes it possible to convert a dense term by
dimensions and a document by dimension matrices into sparse matrices by
identifying and removing such values. The paper Reporting Data Mining Re-
sults in a Natural Language by P. Strossa, Z. Cˇerny´, and J. Rauch represents
an attempt to report the results of data mining in automatically generated
natural language sentences. An experimental software system, AR2NL, that
canconvertimplicationalrulesintobothEnglishandCzechispresented.The
paperAn Algorithm to Calculate the Expected Value of an Ongoing User Ses-
sion byS.Milla´netal.presentsanapplicationofdataminingmethodstothe
analysisofinformationcollectedfromconsumerwebsessions.Analgorithmis
given that makes it possible to calculate, at each point of an ongoing naviga-
tion, not only the possible paths a viewer may follow, but also the potential
value of each possible navigation.
Wewouldliketothanktherefereesfortheireffortsinreviewingthepapers
andprovidingvaluablecommentsandsuggestionstotheauthors.Wearealso
grateful to all the contributors for their excellent works. We hope that this
book will be valuable and fruitful for data mining researchers, no matter
whether they would like to uncover the fundamental principles behind data
mining, or apply the theories to practical application problems.
San Jose, Tokyo, Taipei, Philadelphia, and Izumo T.Y. Lin
February, 2005 S. Ohsuga
C.J. Liau
X. Hu
S. Tsumoto
References
1. T.Y. Lin and C.J. Liau (2002) Special Issue on the Foundation of Data Mining,
Communications of Institute of Information and Computing Machinery, Vol. 5,
No. 2, Taipei, Taiwan.