On-the-fly Task Execution for Speeding Up Pipelined MapReduce Diana Moise, Gabriel Antoniu, Luc Bougé To cite this version: Diana Moise, Gabriel Antoniu, Luc Bougé. On-the-fly Task Execution for Speeding Up Pipelined MapReduce. Euro-Par - 18th International European Conference on Parallel and Distributed Com- puting - 2012, Aug 2012, Rhodes Island, Greece. hal-00706844 HAL Id: hal-00706844 https://hal.inria.fr/hal-00706844 Submitted on 11 Jun 2012 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. On-the-fly Task Execution for Speeding Up Pipelined MapReduce DianaMoise1,GabrielAntoniu1,andLucBouge´2 1 INRIARennes-BretagneAtlantique/IRISA,France 2 ENSCachan-Brittany/IRISA,France Abstract. The MapReduce programming model is widely acclaimed as a key solutiontodesigningdata-intensiveapplications.However,manyofthecompu- tations that fit this model cannot be expressed as a single MapReduce execu- tion,butrequireamorecomplexdesign.Suchapplicationsconsistingofmultiple jobs chained into a long-running execution are called pipeline MapReduce ap- plications. Standard MapReduce frameworks are not optimized for the specific requirements of pipeline applications, yielding performance issues. In order to optimize the execution on pipelined MapReduce, we propose a mechanism for creatingmaptasksalongthepipeline,assoonastheirinputdatabecomesavail- able.WeimplementedourapproachintheHadoopMapReduceframework.The benefits of our dynamic task scheduling are twofold: reducing job-completion timeandincreasingclusterutilizationbyinvolvingmoreresourcesinthecompu- tation.ExperimentalevaluationperformedontheGrid’5000testbed,showsthat ourapproachdeliversperformancegainsbetween9%and32%. Keywords: MapReduce,pipelineMapReduceapplications,intermediatedatamanage- ment,taskscheduling,Hadoop,HDFS 1 Introduction The MapReduce abstraction has revolutionized the data-intensive community and has rapidly spread to various research and production areas. Google introduced MapRe- duce[8]asasolutiontotheneedtoprocessdatasetsuptomultipleterabytesinsizeon adailybasis.ThegoaloftheMapReduceprogrammingmodelistoprovideanabstrac- tionthatenablesuserstoperformcomputationsonlargeamountsofdata. TheMapReduceabstractionisinspiredbythe“map”and“reduce”primitivescom- monlyusedinfunctionalprogramming.WhendesigninganapplicationusingtheMapRe- duceparadigm,theuserhastospecifytwofunctions:mapandreducethatareexecuted in parallel on multiple machines. Applications that can be modeled by the means of MapReduce, mostly consist of two computations: the “map” step, that applies a filter on the input data, selecting only the data that satisfies a given condition, and the “re- duce” step, that collects and aggregates all the data produced by the first phase. The MapReducemodelexposesasimpleinterface,thatcanbeeasilymanipulatedbyusers withoutanyexperiencewithparallelanddistributedsystems.However,theinterfaceis versatile enough so that it can be employed to suit a wide range of data-intensive ap- plications.ThesearethemainreasonsforwhichMapReducehasknownanincreasing popularityeversinceitwasintroduced. An open-source implementation of Google’s abstraction was provided by Yahoo! through the Hadoop [5] project. This framework is considered the reference MapRe- duceimplementationandiscurrentlyheavilyusedforvariouspurposesandonseveral infrastructures.TheMapReduceparadigmhasalsobeenadoptedbythecloudcomput- ing community as a support to those cloud-based applications that are data-intensive. CloudproviderssupportMapReducecomputationssoastotakeadvantageofthehuge processingandstoragecapabilitiesthecloudholds,butatthesametime,toprovidethe userwithacleanandeasy-to-useinterface.AmazonreleasedElasticMapReduce[2],a web service that enables users to easily and cost-effectively process large amounts of data.TheserviceconsistsofahostedHadoopframeworkrunningonAmazon’sElastic Compute Cloud (EC2) [1]. Amazon’s Simple Storage Service (S3) [3] serves as stor- age layer for Hadoop. AzureMapReduce [9] is an implementation of the MapReduce programming model, based on the infrastructure services the Azure cloud [6] offers. Azure’sinfrastructureservicesarebuilttoprovidescalability,highthroughputanddata availability. These features are used by the AzureMapReduce runtime as mechanisms forachievingfaulttoleranceandefficientdataprocessingatlargescale. MapReduce is used to model a wide variety of applications, belonging to numer- ousdomainssuchasanalytics(dataprocessing),imageprocessing,machinelearning, bioinformatics,astrophysics,etc.Therearemanyscenariosinwhichdesigninganap- plicationwithMapReducerequirestheuserstoemployseveralMapReduceprocessing. TheseapplicationsthatconsistofmultipleMapReducejobschainedintoalong-running execution,arecalledpipelineMapReduceapplications.Inthispaper,westudythechar- acteristicsofpipelineMapReduceapplications,andwefocusonoptimizingtheirexe- cution.ExistingMapReduceframeworksmanagepipelineMapReduceapplicationsas asequenceofMapReducejobs.Whethertheyareemployeddirectlybyusersorthrough higher-leveltools,MapReduceframeworksarenotoptimizedforexecutingpipelineap- plications.Amajordrawbackcomesfromthefactthatthejobsinthepipelinehaveto beexecutedsequentially:ajobcannotstartuntilalltheinputdataitprocesseshasbeen generatedbythepreviousjobinthepipeline. In order to speed up the execution of pipelined MapReduce, we propose a new mechanismforcreating“map”tasksalongthepipeline,assoonastheirinputdatabe- comes available. Our approach allows successive jobs in the pipeline to overlap the execution of “reduce” tasks with that of “map” tasks. In this manner, by dynamically creating and scheduling tasks, the framework is able to complete the execution of the pipeline faster. In addition, our approach ensures a more efficient cluster utilization, with respect to the amount of resources that are involved in the computation. We im- plemented the proposed mechanisms in the Hadoop MapReduce framework [5] and evaluatedthebenefitsofourapproachthroughextensiveexperimentalevaluation. Insection2wepresentanoverviewofpipelinedMapReduceaswellasthescenarios inwhichthistypeofprocessingisemployed.Section3introducesthemechanismswe proposeandshowstheirimplementationinHadoop.Section4isdedicatedtotheexper- imentsweperformed;wedetailtheenvironmentalsetupandthescenariosweselected forexecutioninordertomeasuretheimpactofourapproach.Section5summarizesthe contributionsofthisworkandpresentsdirectionsforfuturework. 2 2 PipelineMapReduceapplications:overviewandrelatedwork ManyofthecomputationsthatfittheMapReducemodel,cannotbeexpressedasasin- gleMapReduceexecution,butrequireamorecomplexdesign.Theseapplicationsthat consistofmultipleMapReducejobschainedintoalong-runningexecution,arecalled pipelineMapReduceapplications.EachstageinthepipelineisaMapReducejob(with 2 phases, “map” and “reduce”), and the output data produced by one stage is fed as input to the next stage in the pipeline. Usually, pipeline MapReduce applications are long-runningtasksthatgeneratelargeamountsofintermediatedata(thedataproduced betweenstages).Thistypeofdataistransferredbetweenstagesandhasdifferentchar- acteristicsfromthemeaningfuldata(theinputandoutputofanapplication).Whilethe input and output data are expected to be persistent and are likely to be read multiple times(duringandaftertheexecutionoftheapplication),intermediatedataistransient datathatisusuallywrittenonce,byonestage,andreadonce,bythenextstage. However,therearefewscenariosinwhichusersdirectlydesigntheirapplicationasa pipelineofMapReducejobs.MostoftheusecasesofMapReducepipelinescomefrom applicationsthattranslateintoachainofMapReducejobs.Oneofthedrawbacksofthe extremesimplicityoftheMapReducemodelisthatitcannotbestraightforwardlyused in more complex scenarios. For instance, in order to use MapReduce for higher-level computations(forexample,theoperationsperformedinthedatabasedomain)onehas todealwithissueslikemulti-stageexecutionplan,branchingdata-flows,etc.Thetrend ofusingMapReducefordatabase-likeoperationsledtothedevelopmentofhigh-level query languages that are executed as MapReduce jobs, such as Hive [14], Pig [12], and Sawzall [13]. Pig is a distributed infrastructure for performing high-level analy- sisonlargedatasets.ThePigplatformconsistsofahigh-levelquerylanguagecalled PigLatinandtheframeworkforrunningcomputationsexpressedinPigLatin.PigLatin programscompriseSQL-likehigh-levelconstructsformanipulatingdatathatareinter- leavedwithMapReduce-styleprocessing.ThePigframeworkcompilestheseprograms intoapipelineofMapReducejobsthatareexecutedwithintheHadoopenvironment. ThescenariosinwhichusersactuallydevisetheirapplicationsasMapReducepipelines, involve binary data whose format does not fit the high-level structures of the afore- mentioned frameworks. In order to facilitate the design of pipeline MapReduce ap- plications, Cloudera recently released Crunch [4], a tool that generates a pipeline of MapReducejobsandmanagestheirexecution.Whilethereareseveralframeworksthat generate pipeline MapReduce applications, few works focus on optimizing the actual execution of this type of applications. In [11], the authors propose a tool for estimat- ingtheprogressofMapReducepipelinesgeneratedbyPigqueries.TheHadoopOnline Prototype(HOP)[7]isamodifiedversionoftheHadoopMapReduceframeworkthat supports online aggregation, allowing users to get snapshots from a job as it is being computed.HOPemployspipeliningofdatabetweenMapReducejobs,i.e.,thereduce tasksofonejobcanoptionallypipelinetheiroutputdirectlytothemaptasksofthenext job. However, by circumventing the storing of data in a distributed file system (DFS) betweenthejobs,faulttolerancecannotbeguaranteedbythesystem.Furthermore,as thecomputationofthereducefunctionfromthepreviousjobandthemapfunctionof thenextjobcannotbeoverlapped,thejobsinthepipelineareexecutedsequentially. 3 3 IntroducingdynamicschedulingofmaptasksinHadoop 3.1 Motivation InapipelineofMapReduceapplications,theintermediatedatageneratedbetweenthe stagesrepresentstheoutputdataofonestageandtheinputdataforthenextstage.The intermediatedataisproducedbyonejobandconsumedbythenextjobinthepipeline. Whenrunningthiskindofapplicationsinadedicatedframework,theintermediatedata isusuallystoredinthedistributedfilesystemthatalsostorestheuserinputdataandthe output result. This approach ensures intermediate data availability, and thus, provides faulttolerance,averyimportantfactorwhenexecutingpipelineapplications.However, using MapReduce frameworks to execute pipeline applications raises performance is- sues, since MapReduce frameworks are not optimized for the specific features of in- termediate data. The main performance issue comes from the fact that the jobs in the pipeline have to be executed sequentially: a job cannot start until all the input data it processes has been generated by the job in the previous stage of the pipeline. Conse- quently,theframeworkrunsonlyonejobatatime,whichresultsininefficientcluster utilizationandbasically,awasteofresources. 3.2 ExecutingpipelineMapReduceapplicationswithHadoop TheHadoopprojectprovidesanopen-sourceimplementationofGoogle’sMapReduce paradigmthroughtheHadoopMapReduceframework[5,15].Theframeworkwasde- signed following Google’sarchitectural model and has become the reference MapRe- duceimplementation.Thearchitectureistailoredinamaster-slavemanner,consisting of a single master jobtracker and multiple slave tasktrackers. The jobtracker’s main roleistoactasthetaskschedulerofthesystem,byassigningworktothetasktrackers. Eachtasktrackerdisposesofanumberofavailableslotsforrunningtasks.Everyactive maporreducetasktakesuponeslot,thusatasktrackerusuallyexecutesseveraltasks simultaneously.Whendispatching“map”taskstotasktrackers,thejobtrackerstrivesat keepingthecomputationasclosetothedataaspossible.Thistechniqueisenabledby thedata-layoutinformationpreviouslyacquiredbythejobtracker.Iftheworkcannotbe hostedontheactualnodewherethedataresides,priorityisgiventonodesclosertothe data(belongingtothesamenetworkrack).Thejobtrackerfirstschedules“map”tasks, asthereducersmustwaitforthe“map”executiontogeneratetheintermediatedata. HadoopexecutesthejobsofapipelineMapReduceapplicationinasequentialman- ner. Each job in the pipeline consists of a “map” and a “reduce” phase. The “map” computationisexecutedbyHadooptasktrackersonlywhenallthedataitprocessesis availableintheunderlyingDFS.Thus,themappersarescheduledtorunonlyafterall the reducers from the preceding job have completed their execution. This scenario is also representative for a Pig processing: the jobs in the logical plan generated by the Pig framework are submitted to Hadoop sequentially. In consequence, at each step of thepipeline,atmostthe“map”and“reduce”tasksofthesamejobarebeingexecuted. Runningthemappersandthereducersofasinglejobinvolvesonlyapartofthecluster nodes.Therestofthecomputationalandstorageclustercapabilitiesremainsidle. 4 3.3 Ourapproach In order to speed-up the execution of pipeline MapReduce applications, and also to improveclusterutilization,weproposeanoptimizedHadoopMapReduceframework, in which the scheduling is done in a dynamic manner. For a better understanding of ourapproach, wefirstdetailthe processthroughwhich “map”and“reduce”tasks are createdandscheduledintheoriginalHadoopMapReduceframework. submit job dispatch Client Jobtracker tasks create Map and Reduce tasks add job to scheduling queue Fig.1.JobsubmissionprocessinHadoop. Figure1displaysthejobsubmissionprocess.Thefirststepisfortheusertospecify the“map”and“reduce”computationsoftheapplication.TheHadoopclientgenerates all the job-related information (input and output directories, data placement, etc.) and then submits the job for execution to the jobtracker. On the jobtracker’s side, the list of “map” and “reduce” tasks for the submitted job is created. The number of “map” tasksisequaltothenumberofchunksintheinputdata,whilethenumberof“reduce” tasks is computed by taking into account various factors, such as the cluster capacity, theuserspecification,etc.Thelistoftasksisaddedtothejobqueuethatholdsthejobs tobescheduledforexecutionontasktrackers.IntheHadoopMapReduceframework, the “map” and “reduce” tasks are created by the jobtracker when the job is submitted forexecution.Whentheyarecreated,the“map”tasksrequiretoknowthelocationof thechunkstheywillworkon. In the context of multiple jobs executed in a pipeline, the jobs are submitted by theclienttothejobtrackersequentially,asthechunk-locationinformationisavailable only when the previous job completes. Our approach is based on the remark that a “map” task is created for a single input chunk. It only needs to be aware of this very chunk location. Furthermore, when it is created, the only information that the “map” taskrequires,isthelistofnodesthatstorethedatainitsassociatedchunk.Wemodified theHadoopMapReduceframeworktocreate“map”tasksdynamically,thatis,assoon as a chunk is available for processing. This approach can bring substantial benefits to theexecutionofpipelineMapReduceapplications.Sincetheexecutionofajobcanstart assoonasthefirstchunkofdataisgeneratedbythepreviousjob,thetotalruntimeis significantly reduced. Additionally, the tasks belonging to several jobs in the pipeline canbeexecutedatthesametime,whichleadstoamoreefficientclusterutilization. The modifications and extensions of the Hadoop MapReduce framework that we propose,arefurtherpresentedandsummarizedonFigure2. 5 Fig.2.Dynamiccreationof“map”tasks. Algorithm1Reportoutputsize(ontasktracker) 1: procedureCOMMITTASK 2: (size,files)←tasktracker.writeReduceOutputData() 3: jobtracker.transmitOutputInfo(size,files) 4: endprocedure Job-submissionprocess Clientside. Ontheclientside,wemodifiedthesubmissionprocessbetweentheHadoop clientandthejobtracker.Insteadofwaitingfortheexecutiontocomplete,theclient launchesajobmonitorthatreportstheexecutionprogresstotheuser.Withthisap- proach, a pipeline MapReduce application employs a single Hadoop client to run theapplication.Theclientsubmitsallthejobsinthepipelinefromthebeginning, instead of submitting them sequentially, i.e., the modified Hadoop client submits thewholesetofjobsjob1...jobnforexecution. Jobtrackerside. Thejob-submissionprotocolissimilartotheonedisplayedonFig- ure 1. However, at submission time, only the input data for job is available in 1 theDFS.Regardingjob2...jobn,theinputdatahastobegeneratedthroughoutthe pipeline.Thus,thejobtrackercreatesthesetof“map”and“reduce”tasksonlyfor job . For the rest of the jobs, the jobtracker creates only “reduce” tasks, while 1 “map”taskswillbecreatedalongthepipeline,asthedataisbeinggenerated. Jobscheduling Tasktrackerside: Forajobi inthepipeline,thedataproducedbythejob’s“reduce” phase(reducei)representstheinputdataofjobi+1’s“map”task(mapi+1).When reduceiiscompleted,thetasktrackerwritestheoutputdatatothebackendstorage. Wemodifiedthetasktrackercodetonotifythejobtrackerwheneveritsuccessfully completes the execution of a “reduce” function: the tasktracker informs the job- trackeraboutthesizeofthedataproducedbythe“reduce”task. 6 Algorithm2Updatejob(onjobtracker) 1: procedureTRANSMITOUTPUTINFO(size,files) 2: invokeupdateJob(size,files)ontaskscheduler 3: endprocedure 4: procedureUPDATEJOB(size,files) 5: foralljob∈jobQueuedo 6: dir←job.getInputDirectory() 7: ifdir=getDirectory(files)then 8: ifwrittenBytes.contains(dir)=Falsethen 9: writtenBytes.put(dir,size) 10: else 11: allBytes←writtenBytes.get(dir) 12: writtenBytes.put(dir,allBytes+size) 13: endif 14: allBytes←writtenBytes.get(dir) 15: ifallBytes≥CHUNK SIZEthen 16: b←job.createMapsForSplits(files) 17: writtenBytes.put(dir,allBytes−b) 18: else 19: job.addToPending(files) 20: endif 21: endif 22: endfor 23: endprocedure Jobtrackerside: In our modified framework, the jobtracker keeps track of the out- put data generated by reducers in the DFS. This information is important for the scheduling of the jobs in the pipeline, as the output directory of jobi is the in- putdirectoryofjobi+1.Eachtimedataisproducedinjobi’soutputdirectory,the jobtracker checks to see if it can create new “map” tasks for jobi+1. If the data accumulatedinjobi+1’sinputdirectoryisatleastofthesizeofachunk,thejob- trackercreates“map”tasksforthenewlygenerateddata.Foreachnewchunk,the jobtrackercreatesa“map”tasktoprocessit.Allthe“map”tasksareaddedtothe schedulingqueueandthendispatchedtoidletasktrackersforexecution. ThemodificationsonthetasktrackersidearedescribedinAlgorithm1.Weextended thecodewithaprimitivethatsendstothejobtrackertheinformationaboutthe“reduce” output data: the files written to the DFS and the total size of the data. Algorithm 2 showstheprocessofupdatingajobwithinformationreceivedfromtasktrackers.The algorithm is integrated in the jobtracker code, mainly in the scheduling phase. The jobtracker also plays the role of task scheduler. It keeps a list of data written to the inputdirectoryofeachjob.Foreachreceivedupdate,thejobtrackerchecksifthedata inthejob’sinputdirectoryreachesatleastachunkinsize(64MBdefault).Ifitisthe case, “map” tasks will be created, one per each new data chunk. Otherwise, the job’s information is stored for subsequent processing. The mechanism of creating “map” tasksispresentedinAlgorithm3,executedbythejobtracker,andintegratedintothejob 7 Algorithm3Createmaptasks(onjob) 1: procedureADDTOPENDING(files) 2: pendingFiles.addAll(files) 3: endprocedure 4: functionCREATEMAPSFORSPLITS(files)returnssplitBytes 5: pendingFiles.addAll(files) 6: splits←getSplits(pendingFiles) 7: pendingFiles.clear() 8: newSplits←splits.length 9: jobtracker.addWaitingMaps(newSplits) 10: fori∈[1..newSplits]do 11: maps[numMapTasks+i]←newMapTask(splits[i]) 12: endfor 13: numMapTasks←numMapTasks+newSplits 14: notifyAllReduceTasks(numMapTasks) 15: foralls∈splitsdo 16: splitBytes←splitBytes+s.getLength() 17: endfor 18: returnsplitBytes 19: endfunction code.Weextendedthecodesothateachjobholdsthelistoffilesthatweregenerated sofarinthejob’sinputdirectory.Whenthejobtrackercomputesthatatleastachunkof inputdatahasbeengenerated,new“map”tasksarecreatedforthejob.Thedatainthe filesissplitintochunks.A“map”taskiscreatedforeachchunkandthenewlylaunched tasksareaddedtotheschedulingqueue.Thejobtrackeralsoinformsthe“reduce”tasks that the number of “map” tasks has changed. The reducers need to be aware of the numberofmappersofthesamejob,astheyhavetotransfertheirassignedpartofthe outputdatafromallthemapperstotheirlocaldisk. 4 Evaluation We validate the proposed approach through a series of experiments that compare the originalHadoopframeworkwithourmodifiedversion,whenrunningpipelineapplica- tions. 4.1 Environmentalsetup TheexperimentswerecarriedoutontheGrid’5000[10]testbed.TheGrid’5000project is a widely-distributed infrastructure devoted to providing an experimental platform fortheresearchcommunity.Theplatformisspreadover10geographicalsiteslocated throughonFrenchterritoryand1inLuxembourg.Forourexperiments,weemployed nodesfromtheOrsayclusteroftheGrid’5000.Thenodesareoutfittedwithdual-core x86 64CPUsand2GBofRAM.Intra-clustercommunicationisdonethrougha1Gbps Ethernetnetwork.Weperformedaninitialtestatasmallscale,i.e.,20nodes,inorderto 8 assesstheimpactofourapproach.Thesecondsetoftestsinvolves50nodesbelonging totheOrsaycluster. 4.2 Results 400 Original Hadoop 350 Modified Hadoop 300 250 s) e ( 200 m Ti 150 100 50 0 0 1 2 3 4 5 6 7 8 9 No of jobs Fig.3.Completiontimeforshort-runningpipelineapplications. The evaluation presented here focuses on assessing the performance gains of the optimized MapReduce framework we propose, over the original one. To this end, we developedabenchmarkthatcreatesapipelineofnMapReducejobsandsubmitsthem to Hadoop for execution. Each job in the pipeline simulates a load that parses key- valuepairsfromtheinputdataandoutputs90%ofthemasfinalresult.Inthismanner, wemanagetoobtainalong-runningapplicationthatgeneratesalargeamountofdata, allowingourdynamicschedulingmechanismtooptimizetheexecutionofthepipeline. The computation itself is not relevant in this case, as our goal is to create a scenario inwhichenoughdatachunksaregeneratedalongthepipelinesothat“map”taskscan bedynamicallycreated.WerunthistypeofapplicationfirstwiththeoriginalHadoop framework,thenwithouroptimizedversionofHadoop.Inbothcases,wemeasurethe pipelinecompletion-timeandcomparetheresults. Short-runningpipelineapplications In a first set of experiments, we run the benchmark in a small setup involving 20 nodes, on top of which HDFS and Hadoop MapReduce are deployed as follows: a dedicated machine is allocated for each centralized entity (namenode, jobtracker), a nodeservesastheHadoopclientthatsubmitsthejobs,andtherestof17nodesrepre- sentbothdatanodesandtasktrackers.Ateachstep,wekeepthesamedeploymentsetup and we increase the number of jobs in the pipeline to be executed. The first test con- sists in running a single job, while the last one runs a pipeline of 9 MapReduce jobs. Theapplication’sinputdata,i.e.,job ’sinputdata,consistsof5datachunks(atotalof 1 320MB).Jobi keeps90%oftheinputdataitreceivedfromjobi−1.Inthecaseofthe 9-jobpipeline,thisdata-generationmechanismleadstoatotalof2GBofdataproduced throughoutthepipeline,outofwhich1.6GBaccountforintermediatedata. 9
Description: