Unsupervised and Semi-Supervised Learning Series Editor: M. Emre Celebi Olfa Nasraoui Chiheb-Eddine Ben N’Cir Editors Clustering Methods for Big Data Analytics Techniques, Toolboxes and Applications Unsupervised and Semi-Supervised Learning SeriesEditor M.EmreCelebi,ComputerScienceDepartment,Conway,Arkansas,USA Springer’s Unsupervised and Semi-Supervised Learning book series covers the latest theoreticaland practicaldevelopmentsin unsupervisedand semi-supervised learning.Titles–includingmonographs,contributedworks,professionalbooks,and textbooks–tacklevariousissuessurroundingtheproliferationofmassiveamounts of unlabeled data in many application domains and how unsupervised learning algorithms can automatically discover interesting and useful patterns in such data. The books discuss how these algorithms have found numerous applications includingpattern recognition,marketbasketanalysis, web mining,socialnetwork analysis, information retrieval, recommender systems, market research, intrusion detection, and fraud detection. Books also discuss semi-supervised algorithms, which can make use of both labeled and unlabeled data and can be useful in applicationdomainswhereunlabeleddataisabundant,yetitispossibletoobtaina smallamountoflabeleddata. Topicsofinterestininclude: – Unsupervised/Semi-SupervisedDiscretization – Unsupervised/Semi-SupervisedFeatureExtraction – Unsupervised/Semi-SupervisedFeatureSelection – AssociationRuleLearning – Semi-SupervisedClassification – Semi-SupervisedRegression – Unsupervised/Semi-SupervisedClustering – Unsupervised/Semi-SupervisedAnomaly/Novelty/OutlierDetection – EvaluationofUnsupervised/Semi-SupervisedLearningAlgorithms – ApplicationsofUnsupervised/Semi-SupervisedLearning While the series focuses on unsupervised and semi-supervised learning, outstandingcontributionsinthefieldofsupervisedlearningwillalsobeconsidered. Theintendedaudienceincludesstudents,researchers,andpractitioners. Moreinformationaboutthisseriesathttp://www.springer.com/series/15892 Olfa Nasraoui (cid:129) Chiheb-Eddine Ben N’Cir Editors Clustering Methods for Big Data Analytics Techniques, Toolboxes and Applications 123 Editors OlfaNasraoui Chiheb-EddineBenN’Cir DepartmentofComputerEngineering UniversityofJeddah andComputerScience Jeddah,KSA UniversityofLouisville Louisville,KY,USA ISSN2522-848X ISSN2522-8498 (electronic) UnsupervisedandSemi-SupervisedLearning ISBN978-3-319-97863-5 ISBN978-3-319-97864-2 (eBook) https://doi.org/10.1007/978-3-319-97864-2 LibraryofCongressControlNumber:2018957659 ©SpringerNatureSwitzerlandAG2019 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthors,andtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Preface Data has become the lifeblood of today’s knowledge-driveneconomyand society. Big data clustering aims to summarize, segment, and group large volumes and varietiesofdatathataregeneratedatanacceleratedvelocityintogroupsofsimilar contents.Thishasbecomeoneofthemostimportanttechniquesinexploratorydata analysis.Unfortunately,conventionalclusteringtechniquesarebecomingmoreand more unable to process such data due to its high complexity, heterogeneity,large volume, and rapid generation. This raises exciting challenges for researchers to design new scalable and efficient clustering methods and tools which are able to extractvaluableinformationfromthese tremendousamountof data. The progress inthistopicisfastandexciting. Thisvolumeaimstohelpthereadercapturenewadvancesinbigdataclustering. Itprovidesasystematicunderstandingofthescopeindepth,andrapidlybuildsan overviewofnewbigdataclusteringchallenges,methods,tools,andapplications. The volume opens with a chapter entitled “Overview of Scalable Partitional Methods for Big Data Clustering.” In this chapter, BenHaj Kacem et al. propose anoverviewoftheexistingclusteringmethodswithaspecialemphasisonscalable partitional methods. The authors design a new categorizing model based on the mainpropertiespointedoutinthebigdatapartitionalclusteringmethodstoensure scalability when analyzing a large amount of data. Furthermore, a comparative experimentalstudyofmostoftheexistingmethodsisgivenoversimulatedandreal large datasets. The authorsfinally elaborate a guide for researchersand end users whowanttodecidethebestmethodorframeworktousewhenataskofclustering largescaleofdataisrequired. In the second chapter, “Overview of Efficient Clustering Methods for High- dimensionalBigDataStreams,”Hassanifocusesonanalyzingcontinuous,possibly infinite streams of data, arriving at high velocity such as web traffic data, surveil- lance data, sensor measurements, and stock trading. The author reviews recent subspaceclusteringmethodsofhigh-dimensionalbigdatastreamswhilediscussing approachesthatefficientlycombinetheanytimeclusteringconceptwiththestream v vi Preface subspace clustering paradigm. Additionally, novel open-source assessment frame- workandevaluationmeasuresarepresentedforsubspacestreamclustering. In the chapter entitled “Clustering Blockchain Data,” Chawathe gives recent challenges and advances related to clustering blockchain data such as those generated by popular cryptocurrencies like Bitcoin, Ethereum, etc. Analysis of thesedatasetshavediverseapplications,suchasdetectingfraud,illegaltransactions, characterizing major services, identifying financial hotspots, characterizing usage and performance characteristics of large peer-to-peer consensus-based systems. The author motivates the study of clustering methods for blockchain data and introducesthekeyblockchainconceptsfromadata-centricperspective.Hepresents differentmodelsandmethodsusedforclusteringblockchaindataanddescribesthe challengesandsolutionstotheproblemofevaluatingsuchmethods. DeepLearningisanotherinterestingchallenge,whichisdiscussedinthechapter titled “An Introduction to Deep Clustering” by Gopi et al. The chapter presents a simplified taxonomy of deep clustering methods based mainly on the overall procedural structure or design which helps beginning readers quickly grasp how almost all approaches are designed. This also allows more advanced readers to learn how to design increasingly sophisticated deep clustering pipelines that fit their own machine learning problem-solving aims. Like Deep Learning, deep clustering promises to leave an impact on diverse application domains ranging fromcomputervisionandspeechrecognitiontorecommendersystemsandnatural languageprocessing. A new efficient Spark-based implementation of PSO (particle swarm opti- mization) clustering is described in a chapter entitled “Spark-Based Design of Clustering Using Particle Swarm Optimization.” Moslah et al. take advantage of in-memory operations of Spark to build grouping from large-scale data and accelerate the convergence of the method when approaching the global optimum region.Experimentsconductedonrealandsimulatedlargedata-setsshowthattheir proposedmethodisscalableandimprovestheefficiencyofexistingPSOmethods. Thelasttwochaptersdescribenewapplicationsofbigdataclusteringtechniques. In “Data Stream Clustering for Real-TimeAnomalyDetection:An Applicationto InsiderThreats,”HaiderandGaberinvestigateanewstreaminganomalydetection approach, namely, Ensemble of Random subspace Anomaly detectors In Data Streams(E-RAIDS),forinsiderthreatdetection.Theinvestigatedapproachsolves the issues of high velocity of coming data from different sources and high number of false alarms/positives (Fps). Furthermore, in “Effective Tensor-Based DataClusteringThroughSub-tensorImpactGraphs”whichcompletesthevolume, Candanetal.investigatetensor-basedmethodsforclusteringmultimodaldatasuch as web graphs, sensor streams, and social networks. The authors deal with the computational complexity problem of tensor decomposition by partitioning the tensor and then obtain the tensor decomposition leveraging the resulted smaller partitions. They introduce the notion of sub-tensor impact graphs (SIGs), which quantify how the decompositions of these sub-partitions impact each other and Preface vii the overall tensor decomposition accuracy and present several complementary algorithms that leverage this novel concept to address various key challenges in tensordecomposition. We hope that the volume will give an overview of the significant progress and thenewchallengesarisingfrombigdataclusteringinthesesrecentyears.Wealso hopethatcontentswillobviouslyhelpresearchers,practioners,andstudentsintheir studyandresearch. Louisville,KY,USA OlfaNasraoui Manouba,Tunisia Chiheb-EddineBenN’Cir Contents 1 OverviewofScalablePartitionalMethodsforBigDataClustering.... 1 MohamedAymenBenHajKacem,Chiheb-EddineBenN’Cir, andNadiaEssoussi 2 OverviewofEfficientClusteringMethodsforHigh-Dimensional BigDataStreams............................................................. 25 MarwanHassani 3 ClusteringBlockchainData................................................. 43 SudarshanS.Chawathe 4 AnIntroductiontoDeepClustering ....................................... 73 Gopi Chand Nutakki, Behnoush Abdollahi, Wenlong Sun, andOlfaNasraoui 5 Spark-Based Design of Clustering Using Particle Swarm Optimization.................................................................. 91 Mariem Moslah, Mohamed Aymen Ben HajKacem, andNadiaEssoussi 6 DataStreamClusteringforReal-TimeAnomalyDetection: AnApplicationtoInsiderThreats ......................................... 115 DianaHaidarandMohamedMedhatGaber 7 EffectiveTensor-BasedDataClusteringThroughSub-Tensor ImpactGraphs ............................................................... 145 K. Selçuk Candan, Shengyu Huang, Xinsheng Li, andMariaLuisaSapino Index............................................................................... 181 ix Chapter 1 Overview of Scalable Partitional Methods for Big Data Clustering MohamedAymenBenHajKacem,Chiheb-EddineBenN’Cir, andNadiaEssoussi 1.1 Introduction Clustering, also known as cluster analysis, has become an important technique in machinelearningusedtodiscoverthenaturalgroupingoftheobserveddata.Often,a cleardistinctionismadebetweenlearningproblemsthataresupervised,alsoknown as classification, and those that are unsupervised, known as clustering [24]. The first deals with only labeled data while the latter deals with only unlabeled data [16].Inmanyrealapplications,thereisalargesupplyofunlabeleddatabutlimited labeled data. This fact makes clustering more difficult and more challenging than classification. Consequently,there is a growing interest in a hybrid setting, called semi-supervisedlearning[11]wherethelabelsofonlysmallportionoftheobserved dataareavailable. During the last four decades, many clustering methods were designed based ondifferentapproachessuchashierarchical,partitional,probabilistic,anddensity- based [24]. Among them, Partitional clustering methods have been widely used in several real-life applications given their simplicity and their competitive com- putational complexity. This category of methods aims to divide the dataset into a number of groups based on the optimization of one, or several objective criteria. Theoptimizedcriteriamayemphasizea localoraglobalstructureofthedataand its optimization is based on an exact or an approximate optimization technique. Despitethecompetitivenessofthecomputationalcomplexityofpartitionalmethods comparedtoothermethods,itfailstoperformclusteringonhugeamountsofdata M.A.B.HajKacem((cid:2))·N.Essoussi LARODEC,InstitutSupérieurdeGestiondeTunis,UniversitédeTunis,Tunis,Tunisia e-mail:[email protected] C.-E.BenN’Cir((cid:2)) UniversityofJeddah,Jeddah,KSA ©SpringerNatureSwitzerlandAG2019 1 O.Nasraoui,C.-E.BenN’Cir(eds.),ClusteringMethodsforBigDataAnalytics, UnsupervisedandSemi-SupervisedLearning, https://doi.org/10.1007/978-3-319-97864-2_1