Efficient Evidence Accumulation Clustering for large datasets/big data Diogo Alexandre Oliveira Silva Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering Supervisors: Dra. Ana Luísa Nobre Fred Dra. Helena Isabel Aidos Lopes Examination Committee Chairperson: Prof. Dr. João Fernando Cardoso Silva Sequeira Supervisor: Dra. Ana Luísa Nobre Fred Members of the Committee: Dr. Pedro Filipe Zeferino Tomás BGEN ENGEL José Manuel dos Santos Vicêncio TCOR ENGEL Ana Paula da Silva Jorge December 2015 ii Acknowledgments It would come as a great oversight if I was not aware of the help and contribution I received for the completionofthisundertaking. I should start by expressing deep appreciation to my supervisors, Dr. Ana Fred and Dr. Helena Lopes. The freedom I was allowed was crucial for keeping my motivation up for such a long period. Whenthatwaslacking,theirguidanceandencouragementquicklyputmebackontrack. I leave my warmest regards to the family I built throughout my military and academic career, during these last years. Your camaraderie has shaped where I stand and who I am today. Your support throughouttheselongmonthswasinvaluable. To my loving girlfriend, who would always hear my boundless enthusiasm or disheartening frustra- tions, whichever was the case, and who followed me closest in the last months, I am deeply grateful. Youmotivatedmemostofall,andthatwasnosmalltask. And to my parents, thank you for all your support and motivation during all these years. You were alwayscaringandencouragingandIamforevergratefulforthat. Last,butnotleast,thankyoutoallmyfriendsandfamilyforunderstandingmyabsenceduringthese challengingmonths. iii iv Resumo Avanc¸os na tecnologia permitem a recolha e armazenamento de quantidades e variedades de dados semprecedente. Amaiorpartedestesdadossa˜oarmazenadoseletro´nicamenteeexisteinteresseem realizarana´liseautoma´ticadosmesmos. Aste´cnicasdeclustering esta˜oentreasmaispopularespara essa tarefa porque na˜o assumem nada sobre a estrutura dos dados a priori. Muitas te´cnicas existem, mas, t´ıpicamente, na˜o teˆm um bom desempenho em todos os conjuntos de dados devido a`s especi- fidades de cada um. Te´cnicas de ensemble clustering tentam responder a esse desafio ao combinar outros algoritmos. Esta dissertac¸a˜o foca-se numa em particular, o Evidence Accumulation Clustering (EAC). O EAC e´ uma algorithm robusto que tem demonstrado bons desempenhos na literatura numa variedade de conjuntos de dados. No entanto, esta robustez vem com um maior custo computacional associado. A sua aplicac¸a˜o na˜o so´ e´ mais lenta como esta´ restrita a conjuntos de dados pequenos. Assim, o objetivo desta dissertac¸a˜o e´ escalar o EAC, possibilitando a sua a aplicac¸a˜o a conjuntos de dadosgrandes,comtecnologiadisponivelnumat´ıpicaestac¸a˜odetrabalho. Comistoemmente,va´rias abordagens foram exploradas: acelerar processamento com outros algoritmos (quantum clustering), atrave´s de processamento paralelo (com GPU), escalar com algoritmos de memo´ria externa (disco r´ıgido)eexplorandoanaturezaesparsadoEAC.Ale´mdisto,foidesenvolvidoumme´todoeficientepara construirumamatrizesparsaespec´ıficoaoEAC.Asoluc¸a˜opropostae´ aplica´velaconjuntosdedados grandesee´ entre6a200vezesmaisra´pidaqueaoriginalparaconjuntospequenos. Palavras-chave: Me´todosdeagrupamento,EAC,K-Means,MST,GPGPU,CUDA,Matrizes esparsas,Single-Link v vi Abstract Advances in technology allow for the collection and storage of an unprecedent amount and variety of data. Most of this data is stored electronically and there is an interest in automated analysis for gener- ationofknowledgeandnewinsights. Sincethestructureofthedataisunknown,clusteringtechniques becomeparticularlyinterestingforknowledgediscoveryanddatamining,astheymakeasfewassump- tions on the data as possible. A vast body of work on these algorithms exist, yet, typically, no single algorithm is able to respond to the specificities of all data. Ensemble clustering algorithm address this problembycombiningotheralgorithms. EvidenceAccumulationClustering(EAC)isarobustensemble algorithm that has shown good results and is the focus of this dissertation. However, this robustness comeswithhighercomputationalcost. Itsapplicationisslowerandrestrictedtosmallerdatasets. Thus, theobjectiveofthisdissertationistoscaleEAC,allowingitsapplicabilitytobigdatasets,withtechnology available at a typical workstation. Accordingly, several approaches were explored: speed-up with other algorithms (quantum clustering) or parallel computing (with GPU) and reducing space complexity by usingexternalmemory(harddrive)algorithmsandexploitingthesparsenatureofEAC.Arelevantcon- tributionisanovelmethodtobuildasparsematrixspecializedinEAC.Resultsshowthattheproposed solution is applicable to large datasets and presents speed-ups between 6 and 200 over the original implementationondifferentphasesofEACforsmalldatasets. Key words: Clustering methods, EAC, K-Means, MST, GPGPU, CUDA, Sparse matrices, Single-Link vii viii Contents Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii ListofTables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv ListofFigures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 1 Introduction 1 1.1 ChallengesandMotivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Outline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Clustering: basicconcepts,definitionsandalgorithms 4 2.1 Theproblemofclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 DefinitionsandNotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Characteristicsofclusteringtechniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5 Single-Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.6 EnsembleClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.7 EvidenceAccumulationClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Stateoftheart 15 3.1 ScalabilityofEAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.1 pneighborsapproach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.2 Increasedsparsityapproach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Clusteringwithlargedatasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.1 ParallelK-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.2 ParallelSingle-LinkClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Quantumclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Thequantumbitapproach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 ix 3.3.2 Schro¨dinger’sequationapproach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4 EACforlargedatasets: proposedsolutions 32 4.1 QuantumClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 SpeedingupEnsembleGenerationwithParallelK-Means . . . . . . . . . . . . . . . . . . 34 4.2.1 Maximumpotentialspeed-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.2 Implementationdetails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Dealingwiththespacecomplexityoftheco-associationmatrix . . . . . . . . . . . . . . . 36 4.4 Buildingthesparsematrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4.1 EACCSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4.2 EACCSRCondensed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.5 Finalpartitionrecovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.5.1 Single-LinkandGPGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.2 Single-Linkandexternalmemoryalgorithms . . . . . . . . . . . . . . . . . . . . . 47 4.6 Overviewofthedevelopedwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.7 Guidelinesforusingthedifferentsolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5 ResultsandDiscussion 51 5.1 Experimentalenvironment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 QuantumClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2.1 QuantumK-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2.2 HornandGottlieb’salgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3 ParallelK-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3.1 Analysisofspeed-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3.2 Effectivebandwidthandcomputationalthroughput . . . . . . . . . . . . . . . . . . 57 5.3.3 Influenceofthenumberofpointsperthread . . . . . . . . . . . . . . . . . . . . . . 58 5.4 Buildingtheco-associationmatrixwithdifferentsparseformats . . . . . . . . . . . . . . . 58 5.5 GPUMST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.6 EACValidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.7 EAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.7.1 Performanceofproductionandcombinationphases . . . . . . . . . . . . . . . . . 64 5.7.2 PerformancecomparisonbetweenSLINK,SL-MSTandSL-MST-Disk . . . . . . . 66 5.7.3 Performanceofallphasescombined . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.7.4 Analysisofthenumberofassociations . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.7.5 Spacecomplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.7.6 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.8 Performanceonrealworlddatasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 x
Description: