ebook img

Mining of Data with Complex Structures PDF

339 Pages·2010·5.378 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Mining of Data with Complex Structures

FedjaHadzic,HenryTan,andTharamS.Dillon MiningofDatawithComplexStructures StudiesinComputationalIntelligence,Volume333 Editor-in-Chief Prof.JanuszKacprzyk SystemsResearchInstitute PolishAcademyofSciences ul.Newelska6 01-447Warsaw Poland E-mail:[email protected] Furthervolumesofthisseriescanbefoundonour homepage:springer.com Vol.322.BrunoBaruqueandEmilioCorchado(Eds.) FusionMethodsforUnsupervisedLearningEnsembles,2010 Vol.311.JuanD.Vela´squezandLakhmiC.Jain(Eds.) ISBN978-3-642-16204-6 AdvancedTechniquesinWebIntelligence,2010 Vol.323.YingxuWang,DuZhang,andWitoldKinsner(Eds.) ISBN978-3-642-14460-8 AdvancesinCognitiveInformatics,2010 Vol.312.PatriciaMelin,JanuszKacprzyk,and ISBN978-3-642-16082-0 WitoldPedrycz(Eds.) Vol.324.AlessandroSoro,VargiuEloisa,GiulianoArmano, SoftComputingforRecognitionbasedonBiometrics,2010 andGavinoPaddeu(Eds.) ISBN978-3-642-15110-1 InformationRetrievalandMininginDistributed Vol.313.ImreJ.Rudas,Ja´nosFodor,and Environments,2010 JanuszKacprzyk(Eds.) ISBN978-3-642-16088-2 ComputationalIntelligenceinEngineering,2010 Vol.325.QuanBaiandNaokiFukuta(Eds.) ISBN978-3-642-15219-1 AdvancesinPracticalMulti-AgentSystems,2010 Vol.314.LorenzoMagnani,WalterCarnielli,and ISBN978-3-642-16097-4 ClaudioPizzi(Eds.) Vol.326.SherylBrahnamandLakhmiC.Jain(Eds.) Model-BasedReasoninginScienceandTechnology,2010 AdvancedComputationalIntelligenceParadigmsin ISBN978-3-642-15222-1 Healthcare5,2010 Vol.315.MohammadEssaaidi,MicheleMalgeri,and ISBN978-3-642-16094-3 CostinBadica(Eds.) IntelligentDistributedComputingIV,2010 Vol.327.SlawomirWiakand ISBN978-3-642-15210-8 EwaNapieralska-Juszczak(Eds.) ComputationalMethodsfortheInnovativeDesignof Vol.316.PhilippWolfrum ElectricalDevices,2010 InformationRouting,CorrespondenceFinding,andObject ISBN978-3-642-16224-4 RecognitionintheBrain,2010 ISBN978-3-642-15253-5 Vol.328.RaoulHuysandViktorK.Jirsa(Eds.) NonlinearDynamicsinHumanBehavior,2010 Vol.317.RogerLee(Ed.) ISBN978-3-642-16261-9 ComputerandInformationScience2010 ISBN978-3-642-15404-1 Vol.329.SantiCaballe´,FatosXhafa,andAjithAbraham(Eds.) IntelligentNetworking,CollaborativeSystemsand Vol.318.OscarCastillo,JanuszKacprzyk, Applications,2010 andWitoldPedrycz(Eds.) ISBN978-3-642-16792-8 SoftComputingforIntelligentControl andMobileRobotics,2010 Vol.330.SteffenRendle ISBN978-3-642-15533-8 Context-AwareRankingwithFactorizationModels,2010 ISBN978-3-642-16897-0 Vol.319.TakayukiIto,MinjieZhang,ValentinRobu, Vol.331.AthenaVakaliandLakhmiC.Jain(Eds.) ShaheenFatima,TokuroMatsuo, NewDirectionsinWebDataManagement1,2011 andHirofumiYamaki(Eds.) ISBN978-3-642-17550-3 InnovationsinAgent-BasedComplex AutomatedNegotiations,2010 Vol.332.JianguoZhang,LingShao,LeiZhang,and ISBN978-3-642-15611-3 GraemeA.Jones(Eds.) IntelligentVideoEventAnalysisandUnderstanding,2011 Vol.320.xxx ISBN978-3-642-17553-4 Vol.321.DimitriPlemenosandGeorgiosMiaoulis(Eds.) Vol.333.FedjaHadzic,HenryTan,andTharamS.Dillon IntelligentComputerGraphics2010 MiningofDatawithComplexStructures,2011 ISBN978-3-642-15689-2 ISBN978-3-642-17556-5 Fedja Hadzic,Henry Tan,and Tharam S.Dillon Mining of Data with Complex Structures 123 Dr.FedjaHadzic Prof.TharamS.Dillon Digital Ecosystems and Business Digital Ecosystems and Business Intelligence Institute, Intelligence Institute, Curtin University Curtin University GPO Box U1987 GPO Box U1987 Perth, Western Australia 6845 Perth, Western Australia 6845 Australia Australia Dr.HenryTan 183rd St. SE., 3514 98012 Bothel Washington USA ISBN 978-3-642-17556-5 e-ISBN 978-3-642-17557-2 DOI 10.1007/978-3-642-17557-2 Studiesin Computational Intelligence ISSN1860-949X (cid:2)c 2011 Springer-VerlagBerlin Heidelberg Thisworkissubjecttocopyright.Allrightsarereserved,whetherthewholeorpart of the material is concerned, specifically therights of translation, reprinting,reuse ofillustrations, recitation,broadcasting, reproductiononmicrofilm orinanyother way, and storage in data banks. Duplication of this publication or parts thereof is permittedonlyundertheprovisionsoftheGermanCopyrightLawofSeptember9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution undertheGerman Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset&CoverDesign:ScientificPublishing ServicesPvt. Ltd., Chennai, India. Printed on acid-free paper 9 8 7 6 5 4 3 2 1 springer.com Iwouldliketo equallydedicatethisbookto myneglected darlingMisheleforher everlastinglove, patienceand understanding duringmy wholeresearch career and tomy mother,who hasprovidedstrongsupportto methroughoutmylifeand inspirationtotake a researchpath. Fedja I thankmywifeTheresia andmytwo daughtersEnrica& Eideeforconsistently givingmecourageandinspiration.Lastbut notleast,Idevotethebooktomyparents, who havebeen instrumentalin guidingmy lifeandencouragingme tosucceed. Henry Wewouldallliketo thankProfessor ElizabethChang forprovidinga unique environmentwithintheDigitalEcosystems andBusinessIntelligenceInstitutethat allowedus toconcentrateontop level research. Tharam,Fedjaand Henry Foreword Withtherapiddevelopmentofcomputertechnologyandapplications,datacollected is mountingup both in size and in complexityon interconnectionsand structures. Thus, “mining of data with complex structures” becomes an increasingly impor- tant task in data mining. Although there are many books on data mining, this is a unique book dedicatedto mining of data with complexstructures, especially on treestructures.Treestructureshavemanyimportantapplications.XMLdocuments, ontologicalstructures,manysemanticstructuresontheinternet,structuresinmany socialandeconomicorganizations,Weblogstructures,patientrecordsinhealthcare, andsoonaretree-structureddata. Despite of the existence of a lot of general data mining algorithms and meth- ods,tree-structuredataminingdeservesdedicatedstudyandin-depthtreatmentbe- causeofits uniquenatureofstructureandordering,whichleadsto manyinterest- ing knowledge to be discovered, including simple subtree patterns, ordered sub- treepatterns,distance-constrainedembeddedsubtrees,variouskindsofapplication- orientedsubtreepatterns,andsoon;andthesekindsofpatternswillnaturallypro- mote the development of new pattern analysis methods. From the discussions of various kinds of tree pattern mining methods, one can see that tree pattern min- ing contains many challenging research problems, and recent research has made goodcontributionstotheunderstandingandsolvingtheproblems.Moreover,start- ing with tree pattern mining, this book also discusses methods for mining several other kinds of patterns with complex structures, including frequent subsequences andsubgraphs. Thisbook,byFedjaHadzic,HenryTan,andTharamS.Dillonprovidesacom- prehensivecoverageonthesetopicstimely,withconcisenessandclearorganization. Theauthorsofthebookareactiveresearchersontreepatternminingandhavemade good contributions to the progress of this dynamic research theme. This ensures thatthebookis authoritativeandreflectsthe currentstate of theart. Nevertheless, thebookgivesabalancedtreatmentonawidespectrumoftopics,wellbeyondthe authors’ownmethodologiesandresearchscopes. Miningdatawithcomplexstructuresisstillafairlyyounganddynamicresearch field. This book may serve researcher and application developers a comprehen- VIII Foreword sive overview of the general concepts, techniques, and applications on mining of datawithcomplexstructuresandhelpthemexplorethisexcitingfieldanddevelop newmethodsandapplications.Itmayalsoservegraduatestudentsandotherinter- estedreadersageneralintroductiontothestate-of-the-artofthispromisingresearch theme. Ifindthebookisenjoyabletoread.Ihopeyoulikeittoo. ProfessorJiaweiHan UniversityofIllinois Preface For many practical applications in domains such as biology, chemistry, network analysis, Web Intelligence applications, the expressional power of relational data is not capable of effectively capturing the necessary relationships and semantics that need to be expressed in the domain. This gave rise to semi-structured data sources, capable of dealing with 2-dimensional relationships among data entities thataremanifestedthroughstructuralrelationshipsamongattributenodes.Someex- amplesareXMLdatabases,RDFdatabases,moleculardatabases,graphdatabases, etc. Generally speaking,developingdata mining methodsfor mining of data with complexstructuresisanon-trivialtaskandmanyissuesexistthatneedtobecare- fullyconsidered.Theproperuseofsuchmethodsneedstobeexplainedtotheprac- titionersintheapplicationthatmaynotbesofamiliarwiththearea.Intheproposed bookweintendtopreciselydefinetheexistingproblemsassociatedwithminingof datawithcomplexstructures,includingtrees,graphsandsequences,withastronger focusontree-structureddata.Theimplicationsandpossibleapplicationsformining of different subpattern types under differentconstraints are discussed. We look at the currentapproachesfor solving the differentsub-problemsin the area and dis- cusstheir advantagesanddisadvantages.A numberofimportantapplicationareas arediscussed,whereweexplainhowthedescribedmethodscaneffectivelybeused fortheextractionofknowledgepatternsusefulforthedomain. The book overview is as follows. An introductionto the general aspects of the field ofknowledgediscoveryand dataminingispresentedin Chapter1, together withanoverviewofthesourcesofdatawithcomplexstructuresandthechallenges ofminingsuchdata.Chapter2isconcernedwiththeproblemofminingfrequent patternsfromdatawheretheunderlyinginformationcanberepresentedasatree.We firstdiscussthemotivationbehindtheproblemtogetherwithanexplanationofhow agreatdealofinformationcanbeeffectivelyrepresentedasatreestructure.Wethen show the importanceof extractingfrequentpatternsfromsuch data, knownas the frequentsubtreeminingproblem.Thisproblemisthemainfocusofthisbookand detaileddefinitionsofsomegeneraltreeconceptsareprovidedtogetherwithmany specifictermsnecessaryforunderstandingtheproblemingeneral.Thisinvolvesthe types of subtree patterns consideredand existing frequencycriteria definitions. In X Preface Chapter3welookatthemajorissuesthatarisewhendevelopingalgorithmsforthe frequentsubtreeminingproblem.A numberofdifferentapproachesare discussed and their advantagesand disadvantageshighlighted.At the end of the chapter, an overviewoftheexistingfrequentsubtreeminingalgorithmsisprovided.TheTree ModelGuided(TMG)frameworkforfrequentsubtreeminingisdiscussedinChap- ter 4. Here we discuss the underlying strategy of the TMG framework when ap- proachingeachoftheimplementationaspectsdiscussedinChapter3.Chapter5ex- plainsindetailthegeneralmechanismoftheTMGframeworkanditsmostgeneric implementationforminingofinduced/embeddedorderedsubtrees.Wealsoprovide a mathematical model of the worst case analysis of a model-guided enumeration approach,andusethismodeltotheoreticallyandpracticallyshowthedifferencesin complexitybetweenminingofinducedandembeddedsubtrees.Theapproachisex- perimentallyevaluatedbycomparingitwiththecurrentstate-of-the-arttechniques. Anumberofdifferentreal-worldandsyntheticdatasetswithvaryingtreecharacter- isticsareusedtohighlighttheimportantdifferencesandadvantages/disadvantages of the differentapproaches.We then explainthe necessary extensionsto this gen- eralTMGframeworktoenableitsuseforminingofinduced/embeddedunordered, andordered/unordereddistance-constrainedembeddedsubtrees,inChapters6and Chapter 7, respectively. Each extension is accompanied with a motivation of the problem,and a numberof experimentswith real world and synthetic datasets and comparisonstothecurrentstate-of-the-artapproaches(whenavailable).Theprob- lemofminingmaximalandclosedfrequentsubtreesisaddressedinChapter8.A number of existing solution techniquesare described, with one popular algorithm beingexplainedinmoredetail.Chapter9takesthefrequentsubtreeminingprob- lem and places it within the context of general knowledge analysis. The implica- tionsbehindminingdifferentsubtree typesusing differentsupportdefinitionsand constraints are highlighted with the aid of motivating examples. A number of in- dependentapplicationsarethendiscussedrelatedtoanalysisofhealthinformation, webdata,knowledgestructuresforthepurposeofmatching,andproteinstructures. TheproblemofminingsequentialdataisaddressedinChapter10.Itstartswiththe necessaryformulationsoftheproblem,andlooksatsomeexistingsequencemining algorithmsandapplications.Byconsideringasequenceasaspecialtypeoftree,we explainthe way that the TMG frameworkcan be used to mine sequentialdata. In Chapter11,thegraphminingproblemisformallydefinedandexistingapproaches to the problem explained, with an overview of some existing algorithms. To con- cludethebook,inChapter12,weconsiderseveralfutureresearchdirectionsinthe areaoffrequentsubtreemining,anddiscussanumberofemergingapplicationsof thetechniquestosomeimportantresearchareas. Author Introduction DrFedjaHadzicreceivedhisPhDfromCurtinUniversityofTechnologyin2008. His PhD thesis is entitled:“Advancesin KnowledgeLearningMethodologiesand theirApplications”.HeiscurrentlyaResearchFellowattheDigitalEcosystemsand BusinessIntelligenceInstituteoftheCurtinUniversityofTechnology.Hehascon- tributedinanumberoffieldsofdataminingandknowledgediscoveryandpublished hisworkinanumberofrefereedconferencesandjournals.Hisresearchinterestsin- cludedataminingandAIingeneralwithmorefocusontreemining,graphmining, neuralnetworks,knowledgematchingandontologylearning. Dr Henry Tan has received his PhD from University of Technology, Sydney in 2008.His PhD thesis is entitled:“Tree ModelGuided (TMG)Enumerationas the BasisforMiningFrequentPatternsfromXMLDocuments”.Hehasbecomeanex- pertinthefieldofdataminingwithanumberofpublicationsinrefereedconferences andjournals.OtherresearchinterestsincludeAI, neuralnetworks,gameand soft- waredevelopment Professor Tharam Dillon is an expert in the field of software engineering, data miningXMLbasedsystems,ontologies,trust,securityandcomponent-orientedac- cesscontrol.ProfessorDillonhaspublishedfiveauthoredbooksandfourco-edited books. He has also published over 750 scientific papers in refereed journals and conferences.Overthelast fifteenyears,hehasmorethan4500citations.Manyof hisresearchoutcomeshavebeenappliedbyIndustryworldwide.Thisindicatesthe highimpactofhisresearchwork.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.