Table Of ContentUnsupervised and Semi-Supervised Learning
Series Editor: M. Emre Celebi
Olfa Nasraoui
Chiheb-Eddine Ben N’Cir Editors
Clustering
Methods
for Big Data
Analytics
Techniques,
Toolboxes and Applications
Unsupervised and Semi-Supervised Learning
SeriesEditor
M.EmreCelebi,ComputerScienceDepartment,Conway,Arkansas,USA
Springer’s Unsupervised and Semi-Supervised Learning book series covers the
latest theoreticaland practicaldevelopmentsin unsupervisedand semi-supervised
learning.Titles–includingmonographs,contributedworks,professionalbooks,and
textbooks–tacklevariousissuessurroundingtheproliferationofmassiveamounts
of unlabeled data in many application domains and how unsupervised learning
algorithms can automatically discover interesting and useful patterns in such
data. The books discuss how these algorithms have found numerous applications
includingpattern recognition,marketbasketanalysis, web mining,socialnetwork
analysis, information retrieval, recommender systems, market research, intrusion
detection, and fraud detection. Books also discuss semi-supervised algorithms,
which can make use of both labeled and unlabeled data and can be useful in
applicationdomainswhereunlabeleddataisabundant,yetitispossibletoobtaina
smallamountoflabeleddata.
Topicsofinterestininclude:
– Unsupervised/Semi-SupervisedDiscretization
– Unsupervised/Semi-SupervisedFeatureExtraction
– Unsupervised/Semi-SupervisedFeatureSelection
– AssociationRuleLearning
– Semi-SupervisedClassification
– Semi-SupervisedRegression
– Unsupervised/Semi-SupervisedClustering
– Unsupervised/Semi-SupervisedAnomaly/Novelty/OutlierDetection
– EvaluationofUnsupervised/Semi-SupervisedLearningAlgorithms
– ApplicationsofUnsupervised/Semi-SupervisedLearning
While the series focuses on unsupervised and semi-supervised learning,
outstandingcontributionsinthefieldofsupervisedlearningwillalsobeconsidered.
Theintendedaudienceincludesstudents,researchers,andpractitioners.
Moreinformationaboutthisseriesathttp://www.springer.com/series/15892
Olfa Nasraoui (cid:129) Chiheb-Eddine Ben N’Cir
Editors
Clustering Methods for Big
Data Analytics
Techniques, Toolboxes and Applications
123
Editors
OlfaNasraoui Chiheb-EddineBenN’Cir
DepartmentofComputerEngineering UniversityofJeddah
andComputerScience Jeddah,KSA
UniversityofLouisville
Louisville,KY,USA
ISSN2522-848X ISSN2522-8498 (electronic)
UnsupervisedandSemi-SupervisedLearning
ISBN978-3-319-97863-5 ISBN978-3-319-97864-2 (eBook)
https://doi.org/10.1007/978-3-319-97864-2
LibraryofCongressControlNumber:2018957659
©SpringerNatureSwitzerlandAG2019
Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof
thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,
broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation
storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology
nowknownorhereafterdeveloped.
Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication
doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant
protectivelawsandregulationsandthereforefreeforgeneraluse.
Thepublisher,theauthors,andtheeditorsaresafetoassumethattheadviceandinformationinthisbook
arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor
theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany
errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional
claimsinpublishedmapsandinstitutionalaffiliations.
ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG
Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland
Preface
Data has become the lifeblood of today’s knowledge-driveneconomyand society.
Big data clustering aims to summarize, segment, and group large volumes and
varietiesofdatathataregeneratedatanacceleratedvelocityintogroupsofsimilar
contents.Thishasbecomeoneofthemostimportanttechniquesinexploratorydata
analysis.Unfortunately,conventionalclusteringtechniquesarebecomingmoreand
more unable to process such data due to its high complexity, heterogeneity,large
volume, and rapid generation. This raises exciting challenges for researchers to
design new scalable and efficient clustering methods and tools which are able to
extractvaluableinformationfromthese tremendousamountof data. The progress
inthistopicisfastandexciting.
Thisvolumeaimstohelpthereadercapturenewadvancesinbigdataclustering.
Itprovidesasystematicunderstandingofthescopeindepth,andrapidlybuildsan
overviewofnewbigdataclusteringchallenges,methods,tools,andapplications.
The volume opens with a chapter entitled “Overview of Scalable Partitional
Methods for Big Data Clustering.” In this chapter, BenHaj Kacem et al. propose
anoverviewoftheexistingclusteringmethodswithaspecialemphasisonscalable
partitional methods. The authors design a new categorizing model based on the
mainpropertiespointedoutinthebigdatapartitionalclusteringmethodstoensure
scalability when analyzing a large amount of data. Furthermore, a comparative
experimentalstudyofmostoftheexistingmethodsisgivenoversimulatedandreal
large datasets. The authorsfinally elaborate a guide for researchersand end users
whowanttodecidethebestmethodorframeworktousewhenataskofclustering
largescaleofdataisrequired.
In the second chapter, “Overview of Efficient Clustering Methods for High-
dimensionalBigDataStreams,”Hassanifocusesonanalyzingcontinuous,possibly
infinite streams of data, arriving at high velocity such as web traffic data, surveil-
lance data, sensor measurements, and stock trading. The author reviews recent
subspaceclusteringmethodsofhigh-dimensionalbigdatastreamswhilediscussing
approachesthatefficientlycombinetheanytimeclusteringconceptwiththestream
v
vi Preface
subspace clustering paradigm. Additionally, novel open-source assessment frame-
workandevaluationmeasuresarepresentedforsubspacestreamclustering.
In the chapter entitled “Clustering Blockchain Data,” Chawathe gives recent
challenges and advances related to clustering blockchain data such as those
generated by popular cryptocurrencies like Bitcoin, Ethereum, etc. Analysis of
thesedatasetshavediverseapplications,suchasdetectingfraud,illegaltransactions,
characterizing major services, identifying financial hotspots, characterizing usage
and performance characteristics of large peer-to-peer consensus-based systems.
The author motivates the study of clustering methods for blockchain data and
introducesthekeyblockchainconceptsfromadata-centricperspective.Hepresents
differentmodelsandmethodsusedforclusteringblockchaindataanddescribesthe
challengesandsolutionstotheproblemofevaluatingsuchmethods.
DeepLearningisanotherinterestingchallenge,whichisdiscussedinthechapter
titled “An Introduction to Deep Clustering” by Gopi et al. The chapter presents
a simplified taxonomy of deep clustering methods based mainly on the overall
procedural structure or design which helps beginning readers quickly grasp how
almost all approaches are designed. This also allows more advanced readers to
learn how to design increasingly sophisticated deep clustering pipelines that fit
their own machine learning problem-solving aims. Like Deep Learning, deep
clustering promises to leave an impact on diverse application domains ranging
fromcomputervisionandspeechrecognitiontorecommendersystemsandnatural
languageprocessing.
A new efficient Spark-based implementation of PSO (particle swarm opti-
mization) clustering is described in a chapter entitled “Spark-Based Design of
Clustering Using Particle Swarm Optimization.” Moslah et al. take advantage
of in-memory operations of Spark to build grouping from large-scale data and
accelerate the convergence of the method when approaching the global optimum
region.Experimentsconductedonrealandsimulatedlargedata-setsshowthattheir
proposedmethodisscalableandimprovestheefficiencyofexistingPSOmethods.
Thelasttwochaptersdescribenewapplicationsofbigdataclusteringtechniques.
In “Data Stream Clustering for Real-TimeAnomalyDetection:An Applicationto
InsiderThreats,”HaiderandGaberinvestigateanewstreaminganomalydetection
approach, namely, Ensemble of Random subspace Anomaly detectors In Data
Streams(E-RAIDS),forinsiderthreatdetection.Theinvestigatedapproachsolves
the issues of high velocity of coming data from different sources and high
number of false alarms/positives (Fps). Furthermore, in “Effective Tensor-Based
DataClusteringThroughSub-tensorImpactGraphs”whichcompletesthevolume,
Candanetal.investigatetensor-basedmethodsforclusteringmultimodaldatasuch
as web graphs, sensor streams, and social networks. The authors deal with the
computational complexity problem of tensor decomposition by partitioning the
tensor and then obtain the tensor decomposition leveraging the resulted smaller
partitions. They introduce the notion of sub-tensor impact graphs (SIGs), which
quantify how the decompositions of these sub-partitions impact each other and
Preface vii
the overall tensor decomposition accuracy and present several complementary
algorithms that leverage this novel concept to address various key challenges in
tensordecomposition.
We hope that the volume will give an overview of the significant progress and
thenewchallengesarisingfrombigdataclusteringinthesesrecentyears.Wealso
hopethatcontentswillobviouslyhelpresearchers,practioners,andstudentsintheir
studyandresearch.
Louisville,KY,USA OlfaNasraoui
Manouba,Tunisia Chiheb-EddineBenN’Cir
Contents
1 OverviewofScalablePartitionalMethodsforBigDataClustering.... 1
MohamedAymenBenHajKacem,Chiheb-EddineBenN’Cir,
andNadiaEssoussi
2 OverviewofEfficientClusteringMethodsforHigh-Dimensional
BigDataStreams............................................................. 25
MarwanHassani
3 ClusteringBlockchainData................................................. 43
SudarshanS.Chawathe
4 AnIntroductiontoDeepClustering ....................................... 73
Gopi Chand Nutakki, Behnoush Abdollahi, Wenlong Sun,
andOlfaNasraoui
5 Spark-Based Design of Clustering Using Particle Swarm
Optimization.................................................................. 91
Mariem Moslah, Mohamed Aymen Ben HajKacem,
andNadiaEssoussi
6 DataStreamClusteringforReal-TimeAnomalyDetection:
AnApplicationtoInsiderThreats ......................................... 115
DianaHaidarandMohamedMedhatGaber
7 EffectiveTensor-BasedDataClusteringThroughSub-Tensor
ImpactGraphs ............................................................... 145
K. Selçuk Candan, Shengyu Huang, Xinsheng Li,
andMariaLuisaSapino
Index............................................................................... 181
ix
Chapter 1
Overview of Scalable Partitional Methods
for Big Data Clustering
MohamedAymenBenHajKacem,Chiheb-EddineBenN’Cir,
andNadiaEssoussi
1.1 Introduction
Clustering, also known as cluster analysis, has become an important technique in
machinelearningusedtodiscoverthenaturalgroupingoftheobserveddata.Often,a
cleardistinctionismadebetweenlearningproblemsthataresupervised,alsoknown
as classification, and those that are unsupervised, known as clustering [24]. The
first deals with only labeled data while the latter deals with only unlabeled data
[16].Inmanyrealapplications,thereisalargesupplyofunlabeleddatabutlimited
labeled data. This fact makes clustering more difficult and more challenging than
classification. Consequently,there is a growing interest in a hybrid setting, called
semi-supervisedlearning[11]wherethelabelsofonlysmallportionoftheobserved
dataareavailable.
During the last four decades, many clustering methods were designed based
ondifferentapproachessuchashierarchical,partitional,probabilistic,anddensity-
based [24]. Among them, Partitional clustering methods have been widely used
in several real-life applications given their simplicity and their competitive com-
putational complexity. This category of methods aims to divide the dataset into a
number of groups based on the optimization of one, or several objective criteria.
Theoptimizedcriteriamayemphasizea localoraglobalstructureofthedataand
its optimization is based on an exact or an approximate optimization technique.
Despitethecompetitivenessofthecomputationalcomplexityofpartitionalmethods
comparedtoothermethods,itfailstoperformclusteringonhugeamountsofdata
M.A.B.HajKacem((cid:2))·N.Essoussi
LARODEC,InstitutSupérieurdeGestiondeTunis,UniversitédeTunis,Tunis,Tunisia
e-mail:nadia.essoussi@isg.rnu.tn
C.-E.BenN’Cir((cid:2))
UniversityofJeddah,Jeddah,KSA
©SpringerNatureSwitzerlandAG2019 1
O.Nasraoui,C.-E.BenN’Cir(eds.),ClusteringMethodsforBigDataAnalytics,
UnsupervisedandSemi-SupervisedLearning,
https://doi.org/10.1007/978-3-319-97864-2_1