Ritu Arora Editor Conquering Big Data with High Performance Computing Conquering Big Data with High Performance Computing Ritu Arora Editor Conquering Big Data with High Performance Computing 123 Editor RituArora TexasAdvancedComputingCenter Austin,TX,USA ISBN978-3-319-33740-1 ISBN978-3-319-33742-5 (eBook) DOI10.1007/978-3-319-33742-5 LibraryofCongressControlNumber:2016945048 ©SpringerInternationalPublishingSwitzerland2016 Chapter7wascreatedwithinthecapacity ofUSgovernmental employment.UScopyrightprotection doesnotapply. Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade. Printedonacid-freepaper ThisSpringerimprintispublishedbySpringerNature TheregisteredcompanyisSpringerInternationalPublishingAGSwitzerland Preface Scalablesolutionsforcomputingandstorageareanecessityforthetimelyprocess- ing and management of big data. In the last several decades, High-Performance Computing (HPC) has already impacted the process of developing innovative solutions across various scientific and nonscientific domains. There are plenty of examplesof data-intensiveapplicationsthat take advantageof HPC resourcesand techniquesforreducingthetime-to-results. Thispeer-reviewedbookisanefforttohighlightsomeofthewaysinwhichHPC resourcesandtechniquescanbeusedtoprocessandmanagebigdatawithspeedand accuracy.Throughthechaptersincludedinthebook,HPChasbeendemystifiedfor thereaders.HPCispresentedbothasanalternativetocommodityclustersonwhich theHadoopecosystemtypicallyrunsinmainstreamcomputingandasaplatformon whichalternativestotheHadoopecosystemcanbeefficientlyrun. The book includes a basic overview of HPC, High-Throughput Computing (HTC), and big data (in Chap. 1). Itintroducesthe readersto the varioustypesof HPC andhigh-endstorageresourcesthatcanbe usedfor efficientlymanagingthe entirebigdatalifecycle(inChap.2).Datamovementacrossvarioussystems(from storage to computing to archival) can be constrained by the available bandwidth and latency. An overview of the various aspects of moving data across a system is included in the book (in Chap. 3) to inform the readers about the associated overheads.Adetailedintroductiontoatoolthatcanbeusedtorunserialapplications onHPCplatformsinHTCmodeisalsoincluded(inChap.4). InadditiontothegentleintroductiontoHPCresourcesandtechniques,thebook includeschaptersonlatestresearchanddevelopmenteffortsthatarefacilitatingthe convergenceofHPCandbigdata(seeChaps.5,6,7,and8). TheRlanguageisusedextensivelyfordataminingandstatisticalcomputing.A descriptionofefficientlyusingRinparallelmodeonHPCresourcesisincludedin thebook(inChap.9).Achapterinthebook(Chap.10)describesefficientsampling methodstoconstructalargedataset,whichcanthenbeusedtoaddresstheoretical questionsaswellaseconometricones. v vi Preface Through the multiple test cases from diverse domains like high-frequency financialtrading,archaeology,and eDiscovery,the bookdemonstratesthe process ofconqueringbigdatawithHPC(inChaps.11,13,and14). Theneedandadvantageofinvolvinghumansintheprocessofdataexploration (asdiscussedinChaps.12and14)indicatethatthehybridcombinationofmanand the machine (HPC resources) can help in achieving astonishing results. The book alsoincludesashortdiscussiononusingdatabasesonHPCresources(inChap.15). TheWranglersupercomputerattheTexasAdvancedComputingCenter(TACC)is atop-notchdata-intensivecomputingplatform.Someexamplesoftheprojectsthat aretakingadvantageofWranglerarealsoincludedinthebook(inChap.16). I hope that the readersof this bookwill feel encouragedto use HPC resources fortheir bigdata processingandmanagementneeds.The researchersin academia and at governmentinstitutions in the United States are encouragedto explore the possibilities of incorporating HPC in their work through TACC and the Extreme ScienceandEngineeringDiscoveryEnvironment(XSEDE)resources. Iamgratefultoalltheauthorswhohavecontributedtowardmakingthisbooka reality.I am gratefulto allthe reviewersfortheir timely andvaluablefeedbackin improvingthecontentofthebook.IamgratefultomycolleaguesatTACCandmy familyfortheirselflesssupportatalltimes. Austin,TX,USA RituArora Contents 1 An Introduction to Big Data, High Performance Computing,High-ThroughputComputing,andHadoop.............. 1 RituArora 2 UsingHighPerformanceComputingforConqueringBigData....... 13 AntonioGómez-IglesiasandRituArora 3 DataMovementinData-IntensiveHighPerformanceComputing ... 31 PietroCicotti,SarpOral,GokcenKestor,RobertoGioiosa, ShawnStrande,MichelaTaufer,JamesH.Rogers, HasanAbbasi,JasonHill,andLauraCarrington 4 UsingManagedHighPerformanceComputingSystems forHigh-ThroughputComputing......................................... 61 LucasA.Wilson 5 AcceleratingBigDataProcessingonModernHPCClusters.......... 81 Xiaoyi Lu, Md. Wasi-ur-Rahman, Nusrat Islam, Dipti Shankar,andDhabaleswarK.(DK)Panda 6 dispel4py:AgilityandScalabilityforData-Intensive MethodsUsingHPC........................................................ 109 RosaFilgueira,MalcolmP.Atkinson,andAmreyKrause 7 Performance Analysis Tool for HPC and Big Data ApplicationsonScientificClusters ....................................... 139 WucherlYoo,MichelleKoo,YiCao,AlexSim,PeterNugent, andKeshengWu 8 BigDataBehindBigData ................................................. 163 ElizabethBautista,CaryWhitney,andThomasDavis vii viii Contents 9 Empowering R with HighPerformance Computing ResourcesforBigDataAnalytics......................................... 191 WeijiaXu,RuizhuHuang,HuiZhang,YaakoubEl-Khamra, andDavidWalling 10 BigDataTechniquesasaSolutiontoTheoryProblems................ 219 RichardW.Evans,KennethL.Judd,andKramerQuist 11 High-Frequency Financial Statistics Through High-PerformanceComputing............................................ 233 JianZouandHuiZhang 12 Large-ScaleMulti-ModalDataExplorationwithHuman intheLoop .................................................................. 253 GuangchenRuanandHuiZhang 13 Using High Performance Computing for Detecting Duplicate,SimilarandRelatedImagesinaLargeDataCollection .. 269 RituArora,JessicaTrelogan,andTrungNguyenBa 14 BigDataProcessingintheeDiscoveryDomain ......................... 287 SukritSondhiandRituArora 15 DatabasesandHighPerformanceComputing .......................... 309 RituAroraandSukritSondhi 16 ConqueringBigDataThroughtheUsageoftheWrangler Supercomputer.............................................................. 321 JorgeSalazar Chapter 1 An Introduction to Big Data, High Performance Computing, High-Throughput Computing, and Hadoop RituArora Abstract Recent advancementsin the field of instrumentation, adoption of some of the latest Internet technologies and applications, and the declining cost of storinglargevolumesofdata,haveenabledresearchersandorganizationstogather increasingly large datasets. Such vast datasets are precious due to the potential of discovering new knowledge and developing insights from them, and they are also referred to as “Big Data”. While in a large number of domains, Big Data is a newly found treasure that brings in new challenges, there are various other domains that have been handling such treasures for many years now using state- of-the-art resources, techniques and technologies. The goal of this chapter is to provide an introduction to such resources, techniques, and technologies, namely, High Performance Computing (HPC), High-Throughput Computing (HTC), and Hadoop. First, each of these topics is defined and discussed individually. These topics are then discussed further in the light of enabling short time to discoveries and,hence,withrespecttotheirimportanceinconqueringBigData. 1.1 BigData Recent advancements in the field of instrumentation, adoption of some of the latestInternettechnologiesandapplications,andthedecliningcostofstoringlarge volumesofdata,haveenabledresearchersandorganizationstogatherincreasingly large and heterogeneous datasets. Due to their enormous size, heterogeneity, and highspeedofcollection,suchlargedatasetsareoftenreferredtoas“BigData”.Even thoughtheterm“BigData”andthemassawarenessaboutithasgainedmomentum onlyrecently,thereare severaldomains,rightfromlife sciencestogeosciencesto archaeology,thathavebeengeneratingandaccumulatinglargeandheterogeneous datasetsformanyyearsnow.Asan example,a geoscientistcouldbe havingmore than 30 years of global Landsat data [1], NASA Earth Observation System data R.Arora((cid:2)) TexasAdvancedComputingCenter,Austin,TX,USA e-mail:[email protected] ©SpringerInternationalPublishingSwitzerland2016 1 R.Arora(ed.),ConqueringBigDatawithHighPerformanceComputing, DOI10.1007/978-3-319-33742-5_1