ebook img

Data Mining in Large Sets of Complex Data PDF

123 Pages·2013·3.4 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Mining in Large Sets of Complex Data

SpringerBriefs in Computer Science Series Editors Stan Zdonik Peng Ning Shashi Shekhar Jonathan Katz Xindong Wu Lakhmi C. Jain David Padua Xuemin Shen Borko Furht V. S. Subrahmanian Martial Hebert Katsushi Ikeuchi Bruno Siciliano For furthervolumes: http://www.springer.com/series/10028 Robson L. F. Cordeiro • Christos Faloutsos Caetano Traina Júnior • Data Mining in Large Sets of Complex Data 123 RobsonL. F.Cordeiro Caetano Traina Júnior Computer Science Department (ICMC) Computer Science Department (ICMC) Universityof São Paulo Universityof São Paulo São Carlos, SP São Carlos, SP Brazil Brazil Christos Faloutsos Department of Computer Science Carnegie Mellon University Pittsburgh, PA USA ISSN 2191-5768 ISSN 2191-5776 (electronic) ISBN 978-1-4471-4889-0 ISBN 978-1-4471-4890-6 (eBook) DOI 10.1007/978-1-4471-4890-6 SpringerLondonHeidelbergNewYorkDordrecht LibraryofCongressControlNumber:2012954371 (cid:2)TheAuthor(s)2013 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionor informationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purposeofbeingenteredandexecutedonacomputersystem,forexclusiveusebythepurchaserofthe work. Duplication of this publication or parts thereof is permitted only under the provisions of theCopyrightLawofthePublisher’slocation,initscurrentversion,andpermissionforusemustalways beobtainedfromSpringer.PermissionsforusemaybeobtainedthroughRightsLinkattheCopyright ClearanceCenter.ViolationsareliabletoprosecutionundertherespectiveCopyrightLaw. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexempt fromtherelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. While the advice and information in this book are believed to be true and accurate at the date of publication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityfor anyerrorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,with respecttothematerialcontainedherein. Printedonacid-freepaper SpringerispartofSpringerScience?BusinessMedia(www.springer.com) Preface Boththeamountandthecomplexityofthedatagatheredbycurrentscientificand productive enterprises are increasing at an exponential rate, in the most diverse knowledge areas, such as biology, physics, medicine, astronomy, climate fore- casting,etc.Tofindpatternsandtrendsinthesedataisincreasinglyimportantand challengingfordecisionmaking.Asaconsequence,theanalysisandmanagement of Big Data is currently a central challenge in Computer Science, especially with regards to complex datasets. Finding clusters in large complex datasets is one of the most important tasks in data analyses. For example, given a satellite image database containing several tens of Terabytes, how can we find regions aiming at identifying native rainforests, deforestation, or reforestation? Can it be made automatically?Basedontheresultsoftheworkdiscussedinthisbook,theanswers to both questions are a sound ‘‘yes’’, and the results can be obtained in just minutes. In fact, results that used to require days or weeks of hard work from human specialists can now be obtained in minutes with high precision. Clustering complex data is a computationally expensive task, and the best existing algorithms have a super-linear complexity regarding both the data set cardinality (number of data elements) and dimensionality (number of attributes describing each element). Therefore, those algorithms do not scale well, pre- cluding being efficient to process large data sets. Focused on the analysis of Big Data, this book discusses new algorithms created to perform clustering in mod- erate-to-high dimensional data involving many billions of elements stored in several Terabytes of data, such as features extracted from large sets of complex objects, but that can nonetheless be quickly executed, in just a few minutes. To achieve that performance, it was taken into consideration that high- dimensional data have the clusters bounded to a few dimensions each, thus existing only in subspaces of the original high-dimensional space, although each cluster can have correlations among dimensions distinct from those dimensions correlated in the other clusters. The novel techniques were developed to perform bothhardandsoftclustering(thatis,assumingthateachelementcanparticipatein just one or in several clusters that overlap in the space) that can be executed by v vi Preface serial or by parallel processing. Moreover, their applications are shown inseveral practical test cases. Distinctly from most of the existing algorithms (and from all of the fastest ones), the new clustering techniques do not require the previous definition of the numberofexpectedclusters,rather,itisinferredfromthedataandreturnedtothe user.Besides,duetotheassumptionthateachclusterexistsbecauseofcorrelations existing in a subset of the space dimensions, the new techniques not only find clusterswithhighqualityandspeed,butalsospotthemostsignificantdimensions foreachcluster,abenefitthatthepreviousalgorithmsonlyachieveattheexpenses of costly processing. Themethodologytodevelopthetechniquesdiscussedinthisbookwasbasedon the extension of hierarchical data structures, multidimensional multi-scaling analysis of the spatial data distribution based on a convolution process using Laplacianfilters,ontheevaluationofalternativeclusterentropies,andonnewcost functions that enable to evaluate the best strategies before executing them, allowing to perform a dynamic dataflow optimization of the parallel processing. The new algorithms were compared with at least nine of the most efficient existingones,anditwasshownthattheperformanceimprovementisoveratleast one magnitude order, although always having its quality equivalent to the best achieved by the competing techniques. In extreme situations, it took just two seconds to obtain clusters from real data that the best competing techniques requiredtwodays,withequivalentaccuracy.Inoneoftherealcasesevaluated,the new techniques described were able to find correct tags for every image from a datasetcontainingseveraltensofthousandsofimages,performingsoftclustering (thus assigning one or more tags to each image), using as guidelines the labeling performed by a user in not more than five images for each tag (that is, in at most 0.001 % of the image set). Experiments reported in the book show that the novel techniques achieved excellent results in real data from high impact applications, suchasbreastcancerdiagnosis,regionclassificationinsatelliteimages,assistance to climate change forecast, recommendation systems for the Web, and social networks. In summary, the work described here takes steps forward from traditional data mining (especially for clustering) by considering large, complex data sets. Note that,usually,currentworksfocusononeaspect,eithersizeordatacomplexity.The work described in this book considers both: it enables mining complex data from high impact applications; the data are large in the Terabyte-scale, not in Giga as usual;andveryaccurateresultsarefoundinjustminutes.Thus,itprovidesacrucial andwell-timedcontributionforallowingthecreationofrealtimeapplicationsthat deal with Big Data of high complexity in which mining on the fly can make an immeasurable difference, like in cancer diagnosis or deforestation detection. São Carlos, October 2012 Robson L. F. Cordeiro Christos Faloutsos Caetano Traina Júnior Acknowledgments ThismaterialisbaseduponworksupportedbyFAPESP(SãoPauloStateResearch Foundation), CAPES (Brazilian Coordination for Improvement of Higher Level Personnel),CNPq(BrazilianNationalCouncilforSupportingResearch),Microsoft Research, and the National Science Foundation under Grant No. IIS1017415. Research was sponsored by the Defense Threat Reduction Agency and was accomplishedundercontractNo.HDTRA1-10-1-0120,andbytheArmyResearch Laboratory and was accomplished under Cooperative Agreement Number W911NF-09-2-0053. The views and conclusions contained in this document are thoseoftheauthorsandshouldnotbeinterpretedasrepresentingtheofficialpolicies, either expressed or implied, of the Army Research Laboratory or the U.S. Gov- ernment.TheU.S.Governmentisauthorizedtoreproduceanddistributereprintsfor Government purposes notwithstanding any copyright notation here on. Any opin- ions, findings,and conclusionsorrecommendationsexpressed inthis material are thoseoftheauthorsanddonotnecessarilyreflecttheviewsoftheNationalScience Foundation,orotherfundingparties. The authors also thank the following collaborators for providing valuable support to the development of this work: Agma J. M. Traina, Fan Guo, U. Kang, Julio López, Donna S. Haverkamp, James H. Horne, Ellen K. Hughes, Gunhee Kim, Mirella M. Moro, Carlos A. Heuser and João Eduardo Ferreira. vii Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Definition and Main Objectives . . . . . . . . . . . . . . . . . 2 1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Related Work and Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Processing Complex Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Knowledge Discovery in Traditional Data . . . . . . . . . . . . . . . . 9 2.3 Clustering Complex Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Labeling Complex Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 MapReduce. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Clustering Methods for Moderate-to-High Dimensionality Data. . . 21 3.1 Brief Survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 CLIQUE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 LAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 CURLER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Halite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 General Proposal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 Presented Method: Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.1 Building the Counting-Tree . . . . . . . . . . . . . . . . . . . . . 38 4.3.2 Finding b-Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.3 Building the Correlation Clusters . . . . . . . . . . . . . . . . . 47 ix x Contents 4.4 Presented Method: The Algorithm Halite. . . . . . . . . . . . . . . . . 47 4.5 Presented Method: Soft Clustering. . . . . . . . . . . . . . . . . . . . . . 49 4.6 Implementation Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.7 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.7.1 Comparing Hard Clustering Approaches . . . . . . . . . . . . 53 4.7.2 Comparing Soft Clustering Approaches. . . . . . . . . . . . . 61 4.7.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.7.4 Sensitivity Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5 BoW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Main Ideas of BoW: Reducing Bottlenecks . . . . . . . . . . . . . . . 71 5.2.1 Parallel Clustering: ParC . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.2 Sample and Ignore: SnI. . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 Cost-Based Optimization of BoW . . . . . . . . . . . . . . . . . . . . . . 76 5.4 Finishing Touches: Data Partitioning and Cluster Stitching. . . . . 80 5.4.1 Random-Based Data Partition. . . . . . . . . . . . . . . . . . . . 80 5.4.2 Location-Based Data Partition . . . . . . . . . . . . . . . . . . . 80 5.4.3 File-Based Data Partition. . . . . . . . . . . . . . . . . . . . . . . 82 5.5 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.5.1 Comparing the Data Partitioning Strategies . . . . . . . . . . 85 5.5.2 Quality of Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.5.3 Scale-Up Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.5.4 Accuracy of the Cost Equations . . . . . . . . . . . . . . . . . . 89 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6 QMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.2 Presented Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2.1 Mining and Attention Routing . . . . . . . . . . . . . . . . . . . 95 6.2.2 Low-Labor Labeling (LLL) . . . . . . . . . . . . . . . . . . . . . 99 6.3 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3.1 Results on the Initial Example . . . . . . . . . . . . . . . . . . . 102 6.3.2 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.3.3 Quality and Non-labor Intensive. . . . . . . . . . . . . . . . . . 104 6.3.4 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3.5 Experiments on the SATLARGE Dataset . . . . . . . . . . . . 106 6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Contents xi 7 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Chapter 1 Introduction Abstract This chapter presents an overview of the book. It contains brief de- scriptionsofthefactsthatmotivatedthework,besidesthecorrespondingproblem definition,mainobjectivesandcentralcontributions.Thefollowingsectionsdetail eachoneofthesetopics. · · · Keywords Knowledgediscoveryindatabases Datamining Clustering Label- · · · · · ing Summarization Bigdata Complexdata Linearorquasi-linearcomplexity Terabyte-scaledataanalysis 1.1 Motivation Theinformationgeneratedorcollectedindigitalformatsforvariousapplicationareas isgrowingnotonlyinthenumberofobjectsandattributes,butalsointhecomplexity oftheattributesthatdescribeeachobject[5, 7, 9–13].Thisscenariohasprompted the development of techniques and tools aimed at intelligently and automatically assist humans to analyze, to understand and to extract knowledge from raw data [5, 8, 13],moldingtheresearchareaofKnowledgeDiscoveryinDatabases(KDD). TheincreasingamountofdatamakestheKDDtasksespeciallyinteresting,since they allow the data to be considered as useful resources in the decision-making processesoftheorganizationsthatownthem,insteadofbeingleftunusedindisks of computers, stored to never be accessed, such as real ‘tombs of data’ [6]. On the other hand, the increasing complexity of the data creates several challenges to the researchers, provided that most of the existing techniques are not appropriate to analyze complex data, such as images, audio, graphs and long texts. Common knowledge discovery tasks are clustering, classification and labeling, identifying measurement errors and outliers, inferring association rules and missing data, and dimensionalityreduction. R.L.F.Cordeiroetal.,DataMininginLargeSetsofComplexData, 1 SpringerBriefsinComputerScience,DOI:10.1007/978-1-4471-4890-6_1, ©TheAuthor(s)2013

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.