Intelligent Systems Reference Library 41 Editors-in-Chief Prof.JanuszKacprzyk Prof.LakhmiC.Jain SystemsResearchInstitute SchoolofElectricalandInformation PolishAcademyofSciences Engineering ul.Newelska6 UniversityofSouthAustralia 01-447Warsaw Adelaide Poland SouthAustraliaSA5095 E-mail:[email protected] Australia E-mail:[email protected] Forfurthervolumes: http://www.springer.com/series/8578 Igor Chikalov, Vadim Lozin, Irina Lozina, Mikhail Moshkov, Hung Son Nguyen, Andrzej Skowron, and Beata Zielosko Three Approaches to Data Analysis Test Theory, Rough Sets and Logical Analysis of Data 123 Authors IgorChikalov HungSonNguyen MathematicalandComputerSciencesand InstituteofMathematics EngineeringDivision TheUniversityofWarsaw KingAbdullahUniversityofScienceand Warsaw Technology Poland Thuwal AndrzejSkowron SaudiArabia InstituteofMathematics VadimLozin TheUniversityofWarsaw MathematicsInstitute Warsaw TheUniversityofWarwick Poland Coventry BeataZielosko UnitedKingdom MathematicalandComputerSciencesand IrinaLozina EngineeringDivision MathematicsInstitute KingAbdullahUniversityofScienceand TheUniversityofWarwick Technology Coventry Thuwal UnitedKingdom SaudiArabia MikhailMoshkov and MathematicalandComputerSciencesand InstituteofComputerScience EngineeringDivision UniversityofSilesia KingAbdullahUniversityofScienceand Sosnowiec Technology Poland Thuwal SaudiArabia ISSN1868-4394 e-ISSN1868-4408 ISBN978-3-642-28666-7 e-ISBN978-3-642-28667-4 DOI10.1007/978-3-642-28667-4 SpringerHeidelbergNewYorkDordrechtLondon LibraryofCongressControlNumber:2012933770 (cid:2)c Springer-VerlagBerlinHeidelberg2013 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’slocation,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer. PermissionsforusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.Violations areliabletoprosecutionundertherespectiveCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Whiletheadviceandinformationinthisbookarebelievedtobetrueandaccurateatthedateofpub- lication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityforany errorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,withrespect tothematerialcontainedherein. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) To the memory of Peter L. Hammer, Zdzisław I. Pawlak and Sergei V. Yablonskii Preface Inthisbook,weconsiderthefollowingthreeapproachestodataanalysis: • TestTheory(TT),foundedbySergeiV.Yablonskii(1924-1998);thefirstpub- lications[5,11]appearedin1955and1958, • RoughSets(RS),foundedbyZdzisławI.Pawlak(1926-2006);thefirstpubli- cations[8,9]appearedin1981and1982(see,also,thebookbyPawlak[10]), • Logical Analysis of Data (LAD), founded by Peter L. Hammer (1936-2006); thefirstpublications[6,7]appearedin1986and1988. Thesethreeapproacheshavemuchincommon.Forexample,theyareallrelated to Boolean functions and Boolean reasoning with the roots in works by George Boole (see, e.g., [1-4]). However, we quite often observe that researchers active in one of these areas have a limited knowledgeaboutthe results and methodsde- veloped in the other two. On the other hand, each of the approachesshows some originalityandwebelievethattheexchangeofknowledgecanstimulatefurtherde- velopment of each of them. This can lead to new theoretical results and real-life applications. In particular, we expect new results based on combination of these threedataanalysisapproaches. Itwouldbeveryinterestingtomakeacomprehensivecomparativeanalysisofthe threeapproaches.However,in thepresentbook,we restrictourselvesto a simpler taskandpresentonlyadetailedoverviewofeachofthethreeapproaches.Tomake the reading easier, in the preface, we give a brief comparisonof the main notions usedinTT,RSandLAD,andashortoutlineoftheoverviews. All three data analysis approaches use decision tables for data representation. A decision table T is a rectangular table with n columns labeled with condi- tionalattributesa ,...,a .Thistableisfilledwithvaluesof(conditional)attributes 1 n a ,...,a ,andeachrowofthetableislabeledbyavalueofthedecisionattributed. 1 n Therearedifferenttypesofdataanddifferentproblemsassociatedwiththem. Adecisiontableissaidtobeconsistent,ifeachcombinationofvaluesofcondi- tionalattributesuniquelydeterminesthevalueofthedecisionattribute,andincon- sistent, otherwise.InTT andLAD, onlyconsistentdecisiontablesareconsidered, VIII Preface whileRSallowsinconsistentdecisiontables.InTT,decisiontablesarealsonamed testtables. LADinterpretsadecisiontableasapartiallydefinedfunctiond= f(a ,...,a ), 1 n in whichcase conditionalattributesare called variables.Therowsofthe table de- scribethevaluesofthevariablesforwhichthevalueofthefunction(i.e.,decision) is known. If all the attributes are binary, the decision table represents a partially definedBooleanfunction. Atypicalprobleminallthreeapproachesistheproblemofrevealingfunctional dependenciesbetweenconditionalattributesandthedecisionattribute.However,all threeapproachesusedifferentterminologyrelatedtothisproblem. In RS, a super-reduct is a set of conditional attributes that gives us the same informationabout the decision attribute as the whole set of conditionalattributes. In other words, a super-reductis a set of conditionalattributes on which any two differentrows (on the whole set of conditionalattributes) with differentdecisions are different(basedon conditionalattributesfromthis set). A reduct is a minimal super-reduct,i.e.,a super-reductnotincludinganyothersuper-reductasitsproper subset.InTT,thenotionoftest correspondsto thenotionofsuper-reduct,andthe notionofdead-endtest correspondstothenotionofreduct.InLAD,thenotionof supportsetcorrespondstothenotionsofsuper-reductandtest. InRS,adecisionruleisarelationofthefollowingform: (a =b )∧...∧(a =b )→(d=b), i1 1 im m wherea ,...,a are conditionalattributes,d isthe decisionattribute,and b ,..., i1 im 1 b ,b are values of attributes a ,...,a ,d, respectively.Here, we assume that the m i1 im above decision rule is true in the table T, i.e., each row of T having b ,...,b at 1 m theintersectionwiththecolumnsa ,...,a islabeledwith thedecisionb.InTT, i1 im decision rules are also called representative tuples. In LAD, the notion of pattern correspondstothenotionofdecisionrule.Decisionrulesareoftengeneratedonthe basisofdifferentkindsofreducts. A decision tree for a given decision table T is a rooted directed tree in which nonterminalnodesarelabeledwithconditionalattributes,terminalnodesarelabeled withvaluesofthedecisionattribute,andtheedgescomingoutofanynonterminal nodeare labeledwith pairwisedifferentvaluesof conditionalattributeattachedto thisnode.Thecomputationofadecisiontreeonagivenrowisdefinedinanatural way.Itisrequiredthat,foreachrowofthetableT,thecomputationofthedecision treeendsataterminalnodewhichislabeledwiththedecisionattachedtothecon- sideredrow.Thenotionofdecisiontree is commonlyused in TTand italso finds applicationsinRSandLAD.InTT,decisiontreesarealsocalledconditionaltests. A commonproblemforTT,RS andLAD isthe classification problem,i.e.,the problem of finding the value of the decision attribute based on the values of the conditionalattributes.Tests(reducts,supportsets),decisionrules(patterns)andde- cisiontrees(conditionaltests)areimportanttoolsfordealingwiththisproblem.To solvetheclassificationproblem,weconstructaclassificationalgorithm(alsoknown Preface IX as classifier, predictor, model). In constructingclassifiers, one can distinguish two mainapproaches. In the first approach,which is typicalfor problemsin computationalgeometry, discreteoptimization,faultdiagnosis,itisassumedthatthedecisiontablerepresents a complete description of the universe, in which case the efforts are focused on optimizingtheclassificationalgorithmsintermsoftimeandspacecomplexity. In the second approach, which is typical for experimental and statistical data analysis,itisassumedthatthedecisiontablerepresentstheuniverseonlypartially and the main task of the classifier is to predictthe unseen part of the universe. In thiscase,theaccuracyofpredictionis,usually,moreimportantthanthecomplexity of the classifier, althoughthe descriptionlength is animportantissue in searching forhighqualityclassifiersinallthreeapproaches. In addition to building a classifier, each of the three approaches has also de- velopednumerousmethodsto solve a numberof accompanyingproblemssuch as reductgeneration,decisionrule(pattern)generation,featureselection,discretization (binarization),symbolic value grouping,inducing classifiers, or clustering. In RS, thealgorithmsthatsolvetheseproblemsareoftenbasedon(approximate)Boolean reasoning,andinLAD,onoptimization,combinatorics,andthetheoryofBoolean functions. The bookconsistsofthe preface,threemainpartsdevotedtoTT, RSandLAD respectively,andfinalremarks.Eachmainpartendswithanoteaboutthefounder ofthecorrespondingtheory. Test Theory. The first part of the book, written by Igor Chikalov, Mikhail MoshkovandBeataZielosko,isdevotedtoTestTheory(TT).Thistheorywascre- ated in the middle of fifties of the last century as a tool for solving problems of controlanddiagnosisof faultsin circuits. In themiddle ofsixties, the methodsof TTwereextendedtopredictionproblemsinsuchdomainsasgeologyandmedicine. Thispartconsistsofthechapter“TestTheory:ToolsandApplications”andanote aboutthefounderofTT–SergeiV.Yablonskii. Inthechapter,weconsiderthreemainareasofTT:(i)theoreticalresultsrelated totests, (ii)applicationstocontrolanddiagnosisoffaults,and(iii)applicationsto patternrecognition(prediction).WealsodiscussthreelessknownareasofTTasso- ciated mainlywith ourresearchinterests:the resultsof studiesoninfinite orlarge finite sets of attributesand applicationsto discrete optimizationand mathematical linguistics.Alsowegiveacommonviewontests,decisiontreesanddecisionrule systemswhichcanbeplacedattheintersectionofTTandRS. Thechapterconsistsofsevensections. Thefirstthreesectionsincludetheoreticalresultsontests,treesandrules.Inthe firstsection,weconsiderboundsoncomplexityandalgorithmsforconstructionof tests, rules and trees from, in some sense, uniform point of view. In the second section, we present results on the minimum length (cardinality) of tests and the number of reducts. This is the most well known area of the TT research. In the thirdsection,westudyproblemsoverinfiniteorlargefinitesetsofattributes.Such problemsariseoftenindiscreteoptimizationandcomputationalgeometry. X Preface The following three sections deal with applications of TT. The fourth section is devoted to the TT methodsfor prediction. In the next two sections, we discuss applicationsof TT to problemswith complete information,i.e., when all possible tuplesofattributevaluesfortheconsideredproblemaregiven.Thefifthsectionis dedicated to the most developedarea of TT applications,i.e., controland diagno- sis of faults. In the sixth section, we study problems of discrete optimization and recognition problems for words from formal languages. The last section contains conclusions. Rough Sets. The second part of this book, written by Hung Son Nguyen and AndrzejSkowron,isdedicatedtoroughsetsasatooltodealwithimperfectdata,in particular,withvagueconcepts.Inthedevelopmentofroughsettheoryanditsap- plications,onecandistinguishthreemainstages.Atthebeginning,theresearchers wereconcentratedondescriptivepropertiessuchasreductsofinformationsystems preserving indiscernibility relations, description of concepts or classifications [8- 10]. Next, they moved to applications of rough sets in machine learning, pattern recognition, and data mining. After gaining some experience, they developed the foundationsfor inductive reasoning leading to, e.g., inducing classifiers. The first period was based on the assumption that objects are perceived by means of par- tialinformationrepresentedbyattributes.Inthesecondperiod,itwasalsousedthe assumption that information about the approximated concepts is partial, too. Ap- proximationspacesandsearchingstrategiesforrelevantapproximationspaceswere recognizedasthebasictoolsforroughsets.Importantachievementsbothintheory andapplicationswereobtainedusingBooleanreasoningandapproximateBoolean reasoningapplied, e.g., in searching for relevantfeatures, discretization,symbolic valuegrouping,or,inmoregeneralsense,insearchingforrelevantapproximation spaces. Nowadays, we observe that a new period is emerging in which two new importanttopics are investigated:(i) strategies for discoveringrelevant (complex) contextsofanalyzedobjectsorgranules,whicharestronglyrelatedtotheinforma- tiongranulationprocessandgranularcomputing,and(ii)interactivecomputations ongranules.Bothdirectionsaimatdevelopingtoolsforapproximationofcomplex vague concepts such as behavioral patterns or adaptive strategies, making it pos- sible to achievethesatisfactoryqualitiesofthe resultinginteractivecomputations. This chapter presents this developmentfrom the rudiments of rough sets to some challenges. In more details, the contents of this chapter are as follows. The chapter starts with a short discussion on vague concepts. Next, the basic concepts of rough set theoryare recalled,includingindiscernibilityand discernibilityrelations,approxi- mationofconcepts,roughsets,decisionrules,dependencies,reducts,discernibility and Boolean reasoning as the main methodology used in developing algorithms and heuristics based on rough sets, and also rough membership functions. Next, some extensions of the rough set approach are briefly presented. In the next part ofthischapter,therelationshipoftheroughsetapproachtoinductivereasoningis discussed. In particular, an outline of the rough set approach to inducing relevant approximation spaces and rough set based-classifiers, is given. Also, some com- mentson therelationshipof theroughsetapproachandhigherordervaguenessis Preface XI included.Inthefollowingpartofthechapter,anextensionoftheroughsetapproach fromconceptapproximationtoapproximationofontologiesispresented. The rough set approach based on the combination of rough sets and Boolean reasoningtoscalabilityindataminingisdiscussedinthefollowingsectionofthis chapter.Somecommentsonrelationshipsofroughsetsandlogicarealsoincluded. Finally,somechallengingissuesforroughsetsareincludedinthelastsectionof thischapter.InteractiveGranularComputing(IGC),inparticular,InteractiveRough Granular Computing (IRGC) are proposed as a framework that makes it possible tosearchforsolutionstoproblemsrelatedtoinducingofrelevantcontexts,process miningandperceptionbasedcomputing(PBC). LogicalAnalysisofData.ThethirdpartofthebookwaswrittenbyVadimLozin andIrinaLozinaandisdevotedtoLogicalAnalysisofData(LAD),theyoungestof thethreeapproaches.TheideaofLADwasfirstdescribedbyPeterL.Hammerin alecturegivenin1986attheInternationalConferenceonMulti-attributeDecision Making via OR-based Expert Systems [7] and was later expanded and developed in[6].Thatfirstpublicationwasfollowedbyastreamofresearchstudies.Inearly publications, the focus of research was on theoretical developments and on com- putationalimplementation.Inrecentyears,attentionwasconcentratedonpractical applicationsvaryingfrommedicinetocreditriskratings.Followingthispattern,we dividedthechapterdevotedtoLADintothreemainsections:Theory,Methodology andApplications. In the section devoted to theory, we define the main notions used in Logical Analysis of Data, such as partially defined Boolean function, pattern and discuss variousproblemsassociatedwiththesenotions.Thissectionisintendedmainlyfor theoreticians and can be skipped, except possibly the first subsection devoted to terminologyandnotation,bythosewhoareinterestedinpracticalimplementations ofthemethodology. ThesectiondevotedtomethodologydescribesthemainstepsinLogicalAnaly- sisofData,thatincludebinarization,attributeselection,patterngeneration,model constructionandvalidation.Wealsooutlinespecificalgorithmsimplementingthese steps. Inthe sectiondevotedto applications,weillustrate theLADmethodologywith a number of particular examples, such as estimating passenger show rates in the airlineindustryandcreditriskratings. Wedohopethatthisbookwillstimulateintensiveandsuccessfulresearchonthe relationshipsbetweenthethreedataanalysisapproachesaswellasthedevelopment ofnewmethodsbasedonacombinationoftheexistingmethods. IgorChikalov,VadimLozin,IrinaLozina,MikhailMoshkov HungSonNguyen,AndrzejSkowron,BeataZielosko Coventry,Thuwal,Warsaw,January2012 XII Preface References 1. Blake,A.:CanonicalExpressionsinBooleanAlgebra.Dissertation,Dept.ofMathemat- ics,UniversityofChicago,1937.UniversityofChicagoLibraries(1938) 2. Boole,G.:TheMathematicalAnalysisofLogic.G.Bell,London(1847),(reprintedby PhilosophicalLibrary,NewYork,1948) 3. Boole,G.:AnInvestigationoftheLawsofThought.Walton,London(1854),(reprinted byDoverBooks,NewYork,1954) 4. Brown,F.:BooleanReasoning.KluwerAcademicPublishers,Dordrecht(1990) 5. Chegis,I.A.,Yablonskii,S.V.:Logicalmethodsofcontrolofworkofelectricschemes: TrudyMat.Inst.Steklov.51,270–360(1958),(inRussian) 6. Crama, Y.,Hammer, Peter,L.,Ibaraki, T.:Cause-effect relationships andpartiallyde- finedBooleanfunctions.AnnalsofOperationsResearch16,299–326(1988) 7. Hammer, Peter,L.:Partiallydefined Boolean functions and cause-effect relationships. In:InternationalConferenceonMulti-attributeDecisionMakingviaOR-basedExpert Systems.UniversityofPassau,Passau,Germany,April(1986) 8. Pawlak,Z.:Roughsets.Basicnotions.ICSPASReports431/81.InstituteofComputer SciencePolishAcademyofSciences(ICSPAS),pp.1–12,Warsaw,Poland(1981) 9. Pawlak,Z.:Roughsets.InternationalJournalofComputerandInformationSciences11, 341–356(1982) 10. Pawlak,Z.:Roughsets-TheoreticalAspectsofReasoningaboutData.SystemTheory, KnowledgeEngineeringandProblemSolving9,KluwerAcademicPublishers,Boston, Dordrecht(1991) 11. Yablonskii, S. V., Chegis, I. A.: On tests for electric circuits. Uspekhi Mat. Nauk 10, 182–184(1955),(inRussian)
Description: