Table Of Content

Computer Communications and Networks K.G. Srinivasa Anil Kumar Muppalla Guide to High Performance Distributed Computing Case Studies with Hadoop, Scalding and Spark Computer Communications and Networks Series editor A.J. Sammes Centre for Forensic Computing Cranfield University, Shrivenham campus Swindon, UK The Computer Communications and Networks series is a range of textbooks, monographs and handbooks. It sets out to provide students, researchers, and non- specialists alike with a sure grounding in current knowledge, together with com- prehensible access to the latest developments in computer communications and networking. Emphasis is placed on clear and explanatory styles that support a tutorial approach, so that even the most complex of topics is presented in a lucid and intelligible manner. More information about this series at http://www.springer.com/series/4198 K.G. Srinivasa Anil Kumar Muppalla • Guide to High Performance Distributed Computing Case Studies with Hadoop, Scalding and Spark 123 K.G.Srinivasa AnilKumar Muppalla M.S.Ramaiah InstituteofTechnology M.S.Ramaiah InstituteofTechnology Bangalore Bangalore India India ISSN 1617-7975 ISSN 2197-8433 (electronic) Computer Communicationsand Networks ISBN 978-3-319-13496-3 ISBN 978-3-319-13497-0 (eBook) DOI 10.1007/978-3-319-13497-0 LibraryofCongressControlNumber:2014956502 SpringerChamHeidelbergNewYorkDordrechtLondon ©SpringerInternationalPublishingSwitzerland2015 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilarmethodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexempt fromtherelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthis book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained hereinorforanyerrorsoromissionsthatmayhavebeenmade. Printedonacid-freepaper Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www.springer.com) DedicatedtoOneness Preface Overview Astheuseofcomputersbecamewidespreadinthelasttwentyyears,therehasbeen anavalancheofdigitaldatagenerated.Theadventofdigitizationofallequipments andtoolsinhomesandindustryhavealsocontributedtothegrowthofdigitaldata. The demand to store, process and analyze this huge, growing data is answered by a host of tools in the market. On the hardware front the High Performance Com- puting(HPC)systemsthatfunctionabovetera-floating-pointoperationspersecond undertakethetaskofmanaginghugedata.HPCsystemsneedstoworkindistributed environmentassinglemachinecannothandlethecomplexnatureofitsoperations. Therearetwotrendsinachievingtheteraflopscaleoperationsinadistributedway. Connecting computers via global network and handling the complex task of data management in distributed way is one approach. In other approach dedicated pro- cessors are kept close to each other thereby saving the data transfer time between themachines.Theconvergenceofbothtrendsisfastemergingandpromisestopro- vide faster, efficient hardware solutions to the problems of handling voluminous data. The popular software solution to the problem of huge data management has been ApacheHadoop.Hadoop’secosystemconsistsofHadoopDistributedFileSystem (HDFS), MapReduce framework with support for multiple data formats and data sources, unit testing, clustering variants and related projects like Pig, Hive etc. It providestoolsforlife-cyclemanagementofdataincludingstorageandprocessing. The strength of Hadoop is that it is built to manage very large amounts of data throughadistributedmodel.Itcanalsoworkwithunstructureddatawhichmakesit attractive.CombinedwithaHPCbackbone,Hadoopcanmakethetaskofhandling hugedataveryeasy. Today there are many high level Hadoop frameworks like Pig, Hive, Scoobi, Scrunch, Cascalog, Scalding and Spark that which make it easy to use Hadoop. Most of them are supported by well known organizations like Yahoo (Pig), Face- book (Hive), Cloudera (Scrunch) and Twitter (Scalding) demonstrating the wide vii viii Preface patronageHadoopenjoysintheindustry.TheseframeworksusethebasicHadoop moduleslikeHDFSandMapReducebutprovidesaneasymethodtomanagecom- plex data processing jobs by creating an abstraction to hide the complexities of Hadoopmodules.AnexampleofsuchabstractionisCascading.Manyspecificlan- guages are built using the framework of Cascading. One such implementation by TwitteriscalledScaldingwhichitusestoquerylargedatasetliketweetsstoredin HDFS. DatastorageinHadoopandScaldingismostlydiskbased.Thisarchitectureimpacts theperformanceduetolongseek/transfertimeofdata.Ifdataisreadfromdiskand thenheldinmemorywheretheycanalsobecached,theperformanceofthesystem willincreasemanifold.Sparkimplementsthisconceptandclaimsitis100xfaster than MapReduce in memory and 10x faster on disk. Spark uses the basic abstrac- tionofResilientDistributedDatasetswhicharedistributedimmutablecollections. SinceSparkstoresdatainmemoryiterativealgorithmsindataminingandmachine learningcanbeperformedefficiently. Objectives Theaimofthisbookistopresenttherequiredskillstosetupandbuildlargescale distributedprocessingsystemsusingthefreeandopensourcetoolsandtechnologies likeHadoop,Scalding,Spark.Thekeyobjectivesforthisbookinclude: Capturingthestateoftheartinbuildinghighperformancedistributedcomputing • systemsusingHadoop,ScaldingandSpark Providingrelevanttheoreticalsoftwareframeworksandpracticalapproaches • Providingguidanceandbestpracticesforstudentsandpractitionersoffreeand • opensourcesoftwaretechnologieslikeHadoop,ScaldingandSpark Advancing the understanding of building scalable software systems for large • scale data processing as relevant to the emerging new paradigm of High Per- formanceDistributedComputing(HPDC) Organization Thereare8chaptersinAGuideToHighPerformanceDistributedComputing Case StudieswithHadoop,ScaldingandSpark.Theseareorganizedintwoparts. Part I: Programming fundamentals of High Performance Distributed Comput- ing Chapter1coversthebasicsofdistributedsystemswhichformthebackboneofmod- ernHPDCparadigmslikeCloudComputing,Grid/ClusterSystems.Itstartsbydis- cussingvariousformsofdistributedsystemsandexplainingtheirgenericarchitec- ture.Distributedfilesystemswhichformthecentralthemeofsuchdesignarealso covered.Thetechnicalchallengesencounteredintheirdevelopmentandtherecent trendsinthisdomainarealsodealtwithahostofrelevantexamples. Preface ix ThediscussionontheoverviewofHadoopecosysteminChapter2isfollowedby astep-by-stepinstructiononitsinstallation,programmingandexecution.Chapter3 startsbydescribingthecoreofSparkwhichisResilientDistributedDatabases.The installation,programmingAPIandsomeexamplesarealsocoveredinthischapter. HadoopstreamingisthefocusofChapter4whichalsocoversworkingwithScald- ing.UsingPythonwithHadoopandSparkisalsodiscussed. PartII:CasestudiesusingHadoop,ScaldingandSpark Thatthecurrentbookdoesnotlimititselftoexplainingthebasictheoreticalfoun- dations and presenting sample programs is its biggest advantage. There are four casestudiespresentedinthisbookwhichcoversahostofapplicationdomainsand computational approaches so as to convert any doubter into a believer of Scald- ing and Spark. Chapter 5 takes up the task of implementing K-Means Clustering Algorithm while Chapter 6 covers data classification problems using Naive-Bayes classifier.Continuingthecoverageofdataminingandmachinelearningapproaches in distributed systems using Scalding and Spark, regression analysis is covered in Chapter7. Recommendersystemshavebecomeverypopulartodayinvariousdomains.They automate the task of middleman who can connect two otherwise disjoint entities. This is becoming much needed feature in all modern networked applications in shopping, searching and publishing. A working recommender system should not only have a strong computational engine but should also be scalable at real-time. Chapter8explainstheprocessofcreatingsucharecommendersystemusingScald- ingandSpark. TargetAudience AGuideToHighPerformanceDistributedComputing CaseStudieswithHadoop, ScaldingandSparkhasbeendevelopedtosupportanumberofpotentialaudiences, includingthefollowing: SoftwareEngineersandApplicationDevelopers • StudentsandUniversityLecturers • ContributorstoFreeandOpenSourceSoftware • Researchers • CodeRepository Thecompletelistofsourcecodeanddatasetsusedinthisbookcanbefoundhere https://github.com/4ni1/hpdc-scalding-spark Bangalore,India SrinivasaKG September2014 AnilKumarMuppalla About the Authors KGSrinivasa SrinivasaKGreceivedhisPhDinComputerScienceandEngineeringfromBanga- loreUniversityin2007.HeisnowworkingasaProfessorandHeadintheDepart- ment of Computer Science and Engineering, M S Ramaiah Institute of Technol- ogy, Bangalore. He is the recipient of All India Council for Technical Education - CareerAwardforYoungTeachers,IndianSocietyofTechnicalEducation ISGITS National Award for Best Research Work Done by Young Teachers, Institution of Engineers(India) IEI Young Engineer Award in Computer Engineering, Rajaram- bapuPatilNationalAwardforPromisingEngineeringTeacherAwardfromISTE- 2012,IMSSingapore VisitingScientistFellowshipAward.Hehaspublishedmore thanhundredresearchpapersinInternationalConferencesandJournals.Hehasvis- itedmanyUniversitiesabroadasavisitingresearcher HehasvisitedUniversityof Oklahoma,USA,IowaStateUniversity,USA,HongKongUniversity,KoreanUni- versity,NationalUniversityofSingaporearefewprominentvisits.Hehasauthored twobooksnamelyFileStructuresusingC++byTMHandSoftComputerforData MiningApplicationsLNAISeries Springer.HehasbeenawardedBOYSCASTFel- lowshipbyDST,forconductingcollaborativeResearchwithCloudsLaboratoryin UniversityofMelbourneintheareaofCloudComputing.HeistheprincipalInves- tigator for many funded projects from UGC, DRDO, and DST. His research areas includeDataMining,MachineLearning,HighPerformanceComputingandCloud Computing. He is the Senior Member of IEEE and ACM. He can be reached at kgsrinivas@msrit.edu AnilKumarMuppalla Mr.AnilMuppallaisaresearcherandauthor.HeholdsdegreeinComputerScience and Engineering. He is a developer and software consultant for many industries. Heisalsoactiveresearcherandpublishedmanypapersininternationalconferences and journals. His skills include application development using Hadoop, Scalding andSpark.Hecanbecontactedatanil@msrit.edu. xi

Guide to high performance distributed computing: case studies with Hadoop, Scalding and Spark PDF

310 Pages·2015·5.714 MB·English

by Muppalla, Anil Kumar;Srinivasa, K. G

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download Guide to high performance distributed computing: case studies with Hadoop, Scalding and Spark PDF Free - Full Version

by Muppalla, Anil Kumar;Srinivasa, K. G| 2015| 310 pages| 5.714| English

Download Guide to high performance distributed computing: case studies with Hadoop, Scalding and Spark by Muppalla, Anil Kumar;Srinivasa, K. G in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Guide to high performance distributed computing: case studies with Hadoop, Scalding and Spark

No description available for this book.

Detailed Information

Author:	Muppalla, Anil Kumar;Srinivasa, K. G
Publication Year:	2015
ISBN:	3319134973
Pages:	310
Language:	English
File Size:	5.714
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Guide to high performance distributed computing: case studies with Hadoop, Scalding and Spark Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Guide to high performance distributed computing: case studies with Hadoop, Scalding and Spark PDF?

Yes, on https://PDFdrive.to you can download Guide to high performance distributed computing: case studies with Hadoop, Scalding and Spark by Muppalla, Anil Kumar;Srinivasa, K. G completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Guide to high performance distributed computing: case studies with Hadoop, Scalding and Spark on my mobile device?

After downloading Guide to high performance distributed computing: case studies with Hadoop, Scalding and Spark PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Guide to high performance distributed computing: case studies with Hadoop, Scalding and Spark?

Yes, this is the complete PDF version of Guide to high performance distributed computing: case studies with Hadoop, Scalding and Spark by Muppalla, Anil Kumar;Srinivasa, K. G. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Guide to high performance distributed computing: case studies with Hadoop, Scalding and Spark PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.