Table Of Content

Computer Communications and Networks K.G. Srinivasa Anil Kumar Muppalla Guide to High Performance Distributed Computing Case Studies with Hadoop, Scalding and Spark Computer Communications and Networks Series editor A.J. Sammes Centre for Forensic Computing Cranfield University, Shrivenham campus Swindon, UK The Computer Communications and Networks series is a range of textbooks, monographs and handbooks. It sets out to provide students, researchers, and non- specialists alike with a sure grounding in current knowledge, together with com- prehensible access to the latest developments in computer communications and networking. Emphasis is placed on clear and explanatory styles that support a tutorial approach, so that even the most complex of topics is presented in a lucid and intelligible manner. More information about this series at http://www.springer.com/series/4198 K.G. Srinivasa Anil Kumar Muppalla (cid:129) Guide to High Performance Distributed Computing Case Studies with Hadoop, Scalding and Spark 123 K.G.Srinivasa AnilKumar Muppalla M.S.Ramaiah InstituteofTechnology M.S.Ramaiah InstituteofTechnology Bangalore Bangalore India India ISSN 1617-7975 ISSN 2197-8433 (electronic) Computer Communicationsand Networks ISBN 978-3-319-13496-3 ISBN 978-3-319-13497-0 (eBook) DOI 10.1007/978-3-319-13497-0 LibraryofCongressControlNumber:2014956502 SpringerChamHeidelbergNewYorkDordrechtLondon ©SpringerInternationalPublishingSwitzerland2015 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilarmethodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexempt fromtherelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthis book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained hereinorforanyerrorsoromissionsthatmayhavebeenmade. Printedonacid-freepaper Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www.springer.com) DedicatedtoOneness Preface Overview Astheuseofcomputersbecamewidespreadinthelasttwentyyears,therehasbeen anavalancheofdigitaldatagenerated.Theadventofdigitizationofallequipments andtoolsinhomesandindustryhavealsocontributedtothegrowthofdigitaldata. The demand to store, process and analyze this huge, growing data is answered by a host of tools in the market. On the hardware front the High Performance Com- puting(HPC)systemsthatfunctionabovetera-floating-pointoperationspersecond undertakethetaskofmanaginghugedata.HPCsystemsneedstoworkindistributed environmentassinglemachinecannothandlethecomplexnatureofitsoperations. Therearetwotrendsinachievingtheteraflopscaleoperationsinadistributedway. Connecting computers via global network and handling the complex task of data management in distributed way is one approach. In other approach dedicated pro- cessors are kept close to each other thereby saving the data transfer time between themachines.Theconvergenceofbothtrendsisfastemergingandpromisestopro- vide faster, efficient hardware solutions to the problems of handling voluminous data. The popular software solution to the problem of huge data management has been ApacheHadoop.Hadoop’secosystemconsistsofHadoopDistributedFileSystem (HDFS), MapReduce framework with support for multiple data formats and data sources, unit testing, clustering variants and related projects like Pig, Hive etc. It providestoolsforlife-cyclemanagementofdataincludingstorageandprocessing. The strength of Hadoop is that it is built to manage very large amounts of data throughadistributedmodel.Itcanalsoworkwithunstructureddatawhichmakesit attractive.CombinedwithaHPCbackbone,Hadoopcanmakethetaskofhandling hugedataveryeasy. Today there are many high level Hadoop frameworks like Pig, Hive, Scoobi, Scrunch, Cascalog, Scalding and Spark that which make it easy to use Hadoop. Most of them are supported by well known organizations like Yahoo (Pig), Face- book (Hive), Cloudera (Scrunch) and Twitter (Scalding) demonstrating the wide vii viii Preface patronageHadoopenjoysintheindustry.TheseframeworksusethebasicHadoop moduleslikeHDFSandMapReducebutprovidesaneasymethodtomanagecom- plex data processing jobs by creating an abstraction to hide the complexities of Hadoopmodules.AnexampleofsuchabstractionisCascading.Manyspecificlan- guages are built using the framework of Cascading. One such implementation by TwitteriscalledScaldingwhichitusestoquerylargedatasetliketweetsstoredin HDFS. DatastorageinHadoopandScaldingismostlydiskbased.Thisarchitectureimpacts theperformanceduetolongseek/transfertimeofdata.Ifdataisreadfromdiskand thenheldinmemorywheretheycanalsobecached,theperformanceofthesystem willincreasemanifold.Sparkimplementsthisconceptandclaimsitis100xfaster than MapReduce in memory and 10x faster on disk. Spark uses the basic abstrac- tionofResilientDistributedDatasetswhicharedistributedimmutablecollections. SinceSparkstoresdatainmemoryiterativealgorithmsindataminingandmachine learningcanbeperformedefficiently. Objectives Theaimofthisbookistopresenttherequiredskillstosetupandbuildlargescale distributedprocessingsystemsusingthefreeandopensourcetoolsandtechnologies likeHadoop,Scalding,Spark.Thekeyobjectivesforthisbookinclude: • Capturingthestateoftheartinbuildinghighperformancedistributedcomputing systemsusingHadoop,ScaldingandSpark • Providingrelevanttheoreticalsoftwareframeworksandpracticalapproaches • Providingguidanceandbestpracticesforstudentsandpractitionersoffreeand opensourcesoftwaretechnologieslikeHadoop,ScaldingandSpark • Advancing the understanding of building scalable software systems for large scale data processing as relevant to the emerging new paradigm of High Per- formanceDistributedComputing(HPDC) Organization Thereare8chaptersinAGuideToHighPerformanceDistributedComputing Case StudieswithHadoop,ScaldingandSpark.Theseareorganizedintwoparts. Part I: Programming fundamentals of High Performance Distributed Comput- ing Chapter1coversthebasicsofdistributedsystemswhichformthebackboneofmod- ernHPDCparadigmslikeCloudComputing,Grid/ClusterSystems.Itstartsbydis- cussingvariousformsofdistributedsystemsandexplainingtheirgenericarchitec- ture.Distributedfilesystemswhichformthecentralthemeofsuchdesignarealso covered.Thetechnicalchallengesencounteredintheirdevelopmentandtherecent trendsinthisdomainarealsodealtwithahostofrelevantexamples. Preface ix ThediscussionontheoverviewofHadoopecosysteminChapter2isfollowedby astep-by-stepinstructiononitsinstallation,programmingandexecution.Chapter3 startsbydescribingthecoreofSparkwhichisResilientDistributedDatabases.The installation,programmingAPIandsomeexamplesarealsocoveredinthischapter. HadoopstreamingisthefocusofChapter4whichalsocoversworkingwithScald- ing.UsingPythonwithHadoopandSparkisalsodiscussed. PartII:CasestudiesusingHadoop,ScaldingandSpark Thatthecurrentbookdoesnotlimititselftoexplainingthebasictheoreticalfoun- dations and presenting sample programs is its biggest advantage. There are four casestudiespresentedinthisbookwhichcoversahostofapplicationdomainsand computational approaches so as to convert any doubter into a believer of Scald- ing and Spark. Chapter 5 takes up the task of implementing K-Means Clustering Algorithm while Chapter 6 covers data classification problems using Naive-Bayes classifier.Continuingthecoverageofdataminingandmachinelearningapproaches in distributed systems using Scalding and Spark, regression analysis is covered in Chapter7. Recommendersystemshavebecomeverypopulartodayinvariousdomains.They automate the task of middleman who can connect two otherwise disjoint entities. This is becoming much needed feature in all modern networked applications in shopping, searching and publishing. A working recommender system should not only have a strong computational engine but should also be scalable at real-time. Chapter8explainstheprocessofcreatingsucharecommendersystemusingScald- ingandSpark. TargetAudience AGuideToHighPerformanceDistributedComputing CaseStudieswithHadoop, ScaldingandSparkhasbeendevelopedtosupportanumberofpotentialaudiences, includingthefollowing: • SoftwareEngineersandApplicationDevelopers • StudentsandUniversityLecturers • ContributorstoFreeandOpenSourceSoftware • Researchers CodeRepository Thecompletelistofsourcecodeanddatasetsusedinthisbookcanbefoundhere https://github.com/4ni1/hpdc-scalding-spark Bangalore,India SrinivasaKG September2014 AnilKumarMuppalla About the Authors KGSrinivasa SrinivasaKGreceivedhisPhDinComputerScienceandEngineeringfromBanga- loreUniversityin2007.HeisnowworkingasaProfessorandHeadintheDepart- ment of Computer Science and Engineering, M S Ramaiah Institute of Technol- ogy, Bangalore. He is the recipient of All India Council for Technical Education - CareerAwardforYoungTeachers,IndianSocietyofTechnicalEducation ISGITS National Award for Best Research Work Done by Young Teachers, Institution of Engineers(India) IEI Young Engineer Award in Computer Engineering, Rajaram- bapuPatilNationalAwardforPromisingEngineeringTeacherAwardfromISTE- 2012,IMSSingapore VisitingScientistFellowshipAward.Hehaspublishedmore thanhundredresearchpapersinInternationalConferencesandJournals.Hehasvis- itedmanyUniversitiesabroadasavisitingresearcher HehasvisitedUniversityof Oklahoma,USA,IowaStateUniversity,USA,HongKongUniversity,KoreanUni- versity,NationalUniversityofSingaporearefewprominentvisits.Hehasauthored twobooksnamelyFileStructuresusingC++byTMHandSoftComputerforData MiningApplicationsLNAISeries Springer.HehasbeenawardedBOYSCASTFel- lowshipbyDST,forconductingcollaborativeResearchwithCloudsLaboratoryin UniversityofMelbourneintheareaofCloudComputing.HeistheprincipalInves- tigator for many funded projects from UGC, DRDO, and DST. His research areas includeDataMining,MachineLearning,HighPerformanceComputingandCloud Computing. He is the Senior Member of IEEE and ACM. He can be reached at kgsrinivas@msrit.edu AnilKumarMuppalla Mr.AnilMuppallaisaresearcherandauthor.HeholdsdegreeinComputerScience and Engineering. He is a developer and software consultant for many industries. Heisalsoactiveresearcherandpublishedmanypapersininternationalconferences and journals. His skills include application development using Hadoop, Scalding andSpark.Hecanbecontactedatanil@msrit.edu. xi

Description:

This timely text/reference describes the development and implementation of large-scale distributed processing systems using open source tools and technologies. Comprehensive in scope, the book presents state-of-the-art material on building high performance distributed computing systems, providing pr

Guide to High Performance Distributed Computing: Case Studies with Hadoop, Scalding and Spark PDF

310 Pages·2015·5.65 MB·English

by K.G. Srinivasa

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Guide to High Performance Distributed Computing: Case Studies with Hadoop, Scalding and Spark • Digital Edition • Free Download

By K.G. Srinivasa| File: 5.65| English| 2015

The digital edition of "Guide to High Performance Distributed Computing: Case Studies with Hadoop, Scalding and Spark" is available from our community library collection.

Retrieve Document

Resource Overview

Publication Details

Title:	Guide to High Performance Distributed Computing: Case Studies with Hadoop, Scalding and Spark
Author:	K.G. Srinivasa
Published by:
Publication date:	2015
ISBN:
Document length:	310 pages
Written in:	English
Digital size:	5.65
Access type:	Library Collection • No Cost

Reader Resources

Digital Reading Tips

This document works on all major e-readers and devices
Adjust brightness for comfortable extended reading
Use bookmarking features to track your progress
Searchable text helps locate specific content quickly

Library Collection Policy

Zlibrary.cc maintains an extensive collection of digital documents for educational and research purposes. We believe in providing access to knowledge for all.

Access "Guide to High Performance Distributed Computing: Case Studies with Hadoop, Scalding and Spark" Document

Available in multiple formats • No account creation required

Common Questions

Is this the complete version of "Guide to High Performance Distributed Computing: Case Studies with Hadoop, Scalding and Spark"?

Yes, this is the complete digital edition of "Guide to High Performance Distributed Computing: Case Studies with Hadoop, Scalding and Spark" by K.G. Srinivasa. The document includes all content from the original publication with no omissions.

What is Zlibrary.cc's approach to digital resources?

Zlibrary.cc serves as a catalog and access point to educational materials freely available across the internet. We do not host these files on our servers but provide links to sources where they can be accessed. Our mission is to support education and research by making knowledge more discoverable. Users should adhere to copyright laws in their jurisdiction when accessing these resources.