ebook img

Amvrosiadis PDF PDF

15 Pages·2017·0.57 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Amvrosiadis PDF

On the diversity of cluster workloads and its impact on research results George Amvrosiadis, Jun Woo Park, Gregory R. Ganger, and Garth A. Gibson, Carnegie Mellon University; Elisabeth Baseman and Nathan DeBardeleben, Los Alamos National Laboratory https://www.usenix.org/conference/atc18/presentation/amvrosiadis This paper is included in the Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC ’18). July 11–13, 2018 • Boston, MA, USA ISBN 978-1-939133-02-1 Open access to the Proceedings of the 2018 USENIX Annual Technical Conference is sponsored by USENIX. On the diversity of cluster workloads and its impact on research results GeorgeAmvrosiadis,JunWooPark,GregoryR.Ganger,GarthA.Gibson CarnegieMellonUniversity ElisabethBaseman,NathanDeBardeleben LosAlamosNationalLaboratory Abstract first contribution is to introduce four new traces: two fromtheprivatecloudofTwoSigma,ahedgefund,and Six years ago, Google released an invaluable set of two from HPC clusters located at the Los Alamos Na- schedulerlogswhichhasalreadybeenusedinmorethan tional Laboratory (LANL). Our Two Sigma traces are 450publications. Wefindthatthescarcityofotherdata the longest, non-academic private cluster traces to date, sources, however, is leading researchers to overfit their spanning 9 months and more than 3 million jobs. The work to Google’s dataset characteristics. We demon- two HPC traces we introduce are also unique. The strate this overfitting by introducing four new traces first trace spans the entire 5-year lifetime of a general- fromtwoprivateandtwoHighPerformanceComputing purpose HPC cluster, making it the longest public trace (HPC)clusters. Ouranalysisshowsthattheprivateclus- to date, while also exhibiting shorter jobs than existing terworkloads,consistingofdataanalyticsjobsexpected publicHPCtraces. Thesecondtraceoriginatesfromthe to be more closely related to the Google workload, dis- 300,000-core current flagship supercomputer at LANL, playmoresimilaritytotheHPCclusterworkloads. This making it the largest cluster with a public trace, to our observationsuggeststhatadditionaltracesshouldbecon- knowledge. We introduce all four traces, and the envi- sideredwhenevaluatingthegeneralityofnewresearch. ronmentswheretheywerecollected,inSection2. Toaidthecommunityinmovingforward, werelease Our second contribution is an analysis examining the thefouranalyzedtraces,including: thelongestpublicly generality of workload characteristics derived from the available trace spanning all 61 months of an HPC clus- Google trace, when our four new traces are considered. ter’slifetimeandatracefroma300,000-coreHPCclus- Overall,wefindthattheprivateTwoSigmaclusterwork- ter,thelargestclusterwithapubliclyavailabletrace. We loadsdisplaysimilarcharacteristicstoHPC,despitecon- presentananalysisoftheprivateandHPCclustertraces sistingofdataanalyticsjobsthatmorecloselyresemble that spans job characteristics, workload heterogeneity, the Google workload. Table 1 summarizes all our find- resource utilization, and failure rates. We contrast our ings. For those characteristics where the Google work- findings with the Google trace characteristics and iden- loadisanoutlier,wehavesurveyedtheliteratureandlist tify affected work in the literature. Finally, we demon- affectedpriorwork.Intotal,wesurveyed450papersthat stratetheimportanceofdatasetpluralityanddiversityby reference the Google trace study [41] to identify popu- evaluatingtheperformanceofajobruntimepredictorus- larworkloadassumptions, andweconstrastthemtothe ingallfourofourtracesandtheGoogletrace. Two Sigma and LANL workloads to detect violations. We group our findings into four categories: job charac- 1 Introduction teristics(Section3),workloadheterogeneity(Section4), resourceutilization(Section5),andfailureanalysis(Sec- Despite intense activity in the areas of cloud and job tion6). schedulingresearch,publiclyavailableclusterworkload Ourfindingssuggestthatevaluatingnewresearchus- datasetsremainscarce. Thethreemajordatasetsources ing the Google trace alone is insufficient to guarantee today are: the Google cluster trace [58] collected in generality. To aid the community in moving forward, 2011, the Parallel Workload Archive [19] of High Per- our third contribution is to publicly release the four formanceComputing(HPC)tracescollectedsince1993, tracesintroducedandanalyzedinthispaper. Wefurther and the SWIM traces released in 2011 [10]. Of these, present a case study on the importance of dataset plu- theGoogletracehasbeenusedinmorethan450publi- rality and diversity when evaluating new research. For cations making it the most popular trace by far. Unfor- our demonstration we use JVuPredict, the job runtime tunately, this 29-day trace is often the only one used to predictoroftheJamaisVuschedulingsystem[51]. Orig- evaluatenewresearch. Bycontrastingitscharacteristics inally, JVuPredict was evaluated using only the Google withnewertracesfromdifferentenvironments,wehave trace[51]. Evaluatingitsperformancewithourfournew foundthattheGoogletracealoneisinsufficienttoaccu- traces, however, helped us identify features that make ratelyprovethegeneralityofanewtechnique. it easier to detect related and recurring jobs with pre- Ourgoalistouncoveroverfittingofpriorworktothe dictable behavior. This enabled us to quantify the im- characteristics of the Google trace. To achieve this, our portanceofindividualtracefieldsinruntimeprediction. USENIX Association 2018 USENIX Annual Technical Conference 533 Section Characteristic Google TwoSigma Mustang OpenTrinity Majorityofjobsaresmall (cid:52) (cid:56) (cid:56) (cid:56) JobCharacteristics(§3) Majorityofjobsareshort (cid:52) (cid:56) (cid:56) (cid:56) Diurnalpatternsinjobsubmissions (cid:56) (cid:52) (cid:52) (cid:52) WorkloadHeterogeneity(§4) Highjobsubmissionrate (cid:52) (cid:52) (cid:56) (cid:56) Resourceover-commitment (cid:52) (cid:56) (cid:56) (cid:56) ResourceUtilization(§5) Sub-secondjobinter-arrivalperiods (cid:52) (cid:52) (cid:52) (cid:52) Userrequestvariability (cid:56) (cid:52) (cid:52) (cid:52) Highfractionofunsuccessfuljoboutcomes (cid:52) (cid:52) (cid:56) (cid:52) Jobswithunsuccessfuloutcomesconsume FailureAnalysis(§6) (cid:52) (cid:52) (cid:56) (cid:56) significantfractionofresources Longer/largerjobsoftenterminateunsuccessfully (cid:52) (cid:56) (cid:56) (cid:56) Table1: Summaryofthecharacteristicsofeachtrace. NotethattheGoogleworkloadappearstobeanoutlier. WedescribeourfindingsinSection7. Platform Nodes CPUs RAM Length Finally, we briefly discuss the importance of trace LANLTrinity 9408 32 128GB 3months lengthinaccuratelyrepresentingacluster’sworkloadin LANLMustang 1600 24 64GB 5years Section8. Welistrelatedworkstudyingclustertracesin TwoSigmaA 872 24 256GB 9months Section9,beforeconcluding. TwoSigmaB 441 24 256GB GoogleB 6732 0.50* 0.50* 2 Datasetinformation GoogleB 3863 0.50* 0.25* GoogleB 1001 0.50* 0.75* GoogleC 795 1.00* 1.00* We introduce four sets of job scheduler logs that were GoogleA 126 0.25* 0.25* collected from a general-purpose cluster and a cutting- 29days GoogleB 52 0.50* 0.12* edge supercomputer at LANL, and across two clusters GoogleB 5 0.50* 0.03* ofTwoSigma,ahedgefund. Thefollowingsubsections GoogleB 5 0.50* 0.97* describe each dataset in more detail, and the hardware GoogleC 3 1.00* 0.50* configurationofeachclusterisshowninTable2. GoogleB 1 0.50* 0.06* Users typically interact with the cluster scheduler by submittingcommandsthatspawnmultipleprocesses,or Table 2: Hardware characteristics of the clusters ana- tasks, distributed across cluster nodes to perform a spe- lyzed in this paper. For the Google trace [41], (*) sig- cific computation. Each such command is considered nifiesaresourcehasbeennormalizedtothelargestnode. to be a job and users often compose scripts that gener- atemorecomplex,multi-jobschedules. InHPCclusters, nodes, we collectively refer to both data sources as the whereresourcesareallocatedatthegranularityofphys- TwoSigmatrace,andanalyzethemtogether. ical nodes similar to Emulab [4,16, 27, 57], tasksfrom WeexpectthisworkloadtoresembletheGoogleclus- differentjobsareneverscheduledonthesamenode.This ter more closely than the HPC clusters, where long- isnotnecessarilytrueinprivateclusterslikeTwoSigma. running, compute-intensive, and tightly-coupled scien- tificjobsarethenorm. First,unlikeLANL,jobruntime 2.1 TwoSigmaclusters isnotbudgetedstrictly;usersofthehedgefundclusters donothavetospecifyatimelimitwhensubmittingajob. Theprivateworkloadtracesweintroduceoriginatefrom Second, users can allocate individual cores, as opposed two datacenters of Two Sigma, a hedge fund firm. The to entire physical nodes allocated at LANL. Collected workload consists of data analytics jobs processing fi- datainclude: timestampsforjobstagesfromsubmission nancial data. A fraction of these jobs are handled by totermination,jobpropertiessuchassizeandowner,and a Spark [49] installation, while the rest are serviced by thejob’sreturnstatus. home-grown data analytics frameworks. The dataset spans9monthsofthetwodatacenters’operationstarting 2.2 LANLMustangcluster inJanuary2016,coveringatotalof1313identicalcom- putenodeswith31512CPUcoresand328TBRAM.The MustangwasanHPCclusterusedforcapacitycomput- logscontain3.2millionjobsand78.5milliontasks,col- ingatLANLfrom2011to2016. Capacityclusterssuch lectedbyaninternally-developedjobschedulerrunning as Mustang are architected as cost-effective, general- on top of Mesos [28]. Because both datacenters expe- purposeresourcesforalargenumberofusers. Mustang rience the same workload and consist of homogeneous was largely used by scientists, engineers, and software 534 2018 USENIX Annual Technical Conference USENIX Association developersatLANLanditwasallocatedtotheseusersat 1.0 the granularity of physical nodes. The cluster consisted 0.9 of 1600 identical compute nodes, with a total of 38400 bs 0.8 o AMDOpteron61762.3GHzcoresand102TBRAM. al j 0.7 theOmuracMhiunset’asngopdearatatisoent cforovmersOtchteobeenrti2re01611tmooNnothvsemo-f n of tot 000...456 o ber2016,whichmakesthisthelongestpubliclyavailable cti 0.3 a Mustang cluster trace to date. The Mustang trace is also unique Fr 0.2 OpenTrinity because its jobs are shorter than those in existing HPC 0.1 TwoSigma Google 0.0 traces. Overall,itconsistsof2.1millionmulti-nodejobs submittedby565usersandcollectedbySLURM[45],an 1e−02 1e−01 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 open-source cluster resource manager. The fields avail- Number of cores in job able in the trace are similar to those in the TwoSigma Figure1:CDFofjobsizesbasedonallocatedCPUcores. trace, with the addition of a time budget field per job, thatifexceededcausesthejobtobekilled. Table 2, nodes are presented through anonymized plat- 2.3 LANLTrinitysupercomputer form names representing machines with different com- binations of microarchitectures and chipsets [58]. Note In 2018, Trinity is the largest supercomputer at LANL that the number of CPU cores and RAM for each node anditisusedforcapabilitycomputing. Capabilityclus- in the trace have been normalized to the most powerful tersarealarge-scale,high-demandresourceintroducing node in the cluster. In our analysis, we estimate the to- novelhardwaretechnologiesthataidinachievingcrucial talnumberofcoresintheGoogleclustertobe106544. computingmilestones,suchashigher-resolutionclimate Wederivethisnumberbyassumingthatthemostpopu- and astrophysics models. Trinity’s hardware was stood lar node type (Google B with 0.5 CPU cores) is a dual- up in two pre-production phases before being put into socketserver,carryingquad-coreAMDOpteronBarch- full production use and our trace was collected before elonaCPUsthatGoogleallegedlyusedintheirdatacen- thesecondphasecompleted. Atthetimeofdatacollec- ters at the time [26]. Unlike previous workloads, jobs tion,Trinityconsistedof9408identicalcomputenodes, canbeallocatedfractionsofaCPUcore[46]. a total of 301056 Intel Xeon E5-2698v3 2.3GHz cores and 1.2PB RAM, making this the largest cluster with a publiclyavailabletracebynumberofCPUcores. 3 Jobcharacteristics Our Trinity dataset covers 3 months from February to April 2017. During that time, Trinity was operating Manyinstancesofpriorworkintheliteraturerelyonthe inOpenSciencemode,i.e.,themachinewasundergoing assumption of heavy-tailed distributions to describe the betatestingandwasavailabletoawidernumberofusers sizeanddurationofindividualjobs[2,8,13,14,40,50]. thanitisexpectedtohaveafteritreceivesitsfinalsecu- IntheLANLandTwoSigmaworkloadsthesetailsappear rityclassification. WenotethatOpenScienceworkloads significantlylighter. arerepresentativeofacapabilitysupercomputer’swork- Observation 1: On average, jobs in the TwoSigma load,astheyoccurroughlyevery18monthswhenanew andLANLtracesrequest3-406timesmoreCPUcores machineisintroduced,orbeforeanolderoneisdecom- than jobs in the Google trace. Job sizes in the LANL missioned. The dataset, which we will henceforth refer tracesaremoreuniformlydistributed. toasOpenTrinity,consistsof25237multi-nodejobsis- sued by 88 users and collected by MOAB [1], an open- Figure1showstheCumulativeDistributionFunctions sourceclusterschedulingsystem. Theinformationavail- (CDFs) of job requests for CPU cores across all traces, ableinthetraceisthesameasthatintheMustangtrace. withthex-axisinlogarithmicscale.Wefindthatthe90% ofsmallestjobsintheGoogletracerequest16CPUcores 2.4 Googlecluster or fewer. The same fraction of TwoSigma jobs request In2012,Googlereleasedatraceofjobsthatraninoneof 108 cores, and 1-16K cores in the LANL traces. Very theircomputeclusters[41]. Itisa29-daytraceconsist- large jobs are also more common outside Google. This ingof672074jobsand48milliontasks,someofwhich isunsurprisingfortheLANLHPCclusters, whereallo- wereissuedthroughtheMapReduceframework,andran catingthousandsofCPUcorestoasinglejobisnotun- on12583heterogeneousnodesinMay2011. Thework- common,astheclusters’primaryuseistorunmassively load consists of both long-running services and batch parallel scientific applications. It is interesting to note, jobs [55]. Google has not released the exact hardware however,thatwhiletheTwoSigmaclusterscontainfewer specificationsofeachclusternode. Instead,asshownin coresthantheotherclustersweexamine(3timesfewer USENIX Association 2018 USENIX Annual Technical Conference 535 the cluster’s cores, decreasing the efficiency of the ap- 1.0 0.9 proach and suggesting replication should be used judi- bs 0.8 ciously. ThisdoesnotconsiderthatLANLtasksarealso o al j 0.7 tightly-coupledandtheentirejobhastobeduplicated. of tot 00..56 Another example is the work by Delgado et al. [14], n 0.4 which improves the efficiency of distributed schedulers o cti 0.3 forshortjobsbydedicatingthemafractionofthecluster. a Mustang Fr 0.2 OpenTrinity This partition ranges from 2% for Yahoo and Facebook 0.1 TwoSigma traces, to 17% for the Google trace where jobs are sig- Google 0.0 nificantly longer, to avoid increasing job service times. FortheTwoSigmaandLANLtraceswehaveshownthat 1e−04 1e−03 1e−02 1e−01 1e+00 1e+01 1e+02 1e+03 Job duration (hours) jobs are even longer than for the Google trace (Figure 2),solargerpartitionswilllikelybenecessarytoachieve Figure2: CDFofthedurationsofindividualjobs. similarefficiency. Atthesametime,jobsrunninginthe TwoSigmaandLANLclustersarealsolarger(Figure1), thantheGooglecluster),itsmedianjobismorethanan so service times for long jobs are expected to increase orderofmagnitudelargerthanajobintheGoogletrace. unless the partition is shrunk. Other examples of work Ananalysisofallocatedmemoryyieldssimilartrends. thatislikelyaffectedincludetaskmigrationofshortand Observation2:ThemedianjobintheGoogletraceis smalljobs[50]andhybridschedulingaimedonimprov- 4-5timesshorterthanintheLANLorTwoSigmatraces. inghead-of-lineblockingforshortjobs[13]. Thelongest1%ofjobsintheGoogletrace,however,are 2-6 times longer than the same fraction of jobs in the 4 Workloadheterogeneity LANLandTwoSigmatraces. Figure 2 shows the CDFs of job durations for all Another common assumption about cloud workloads is traces. WefindthatintheGoogletrace,80%ofjobslast that they are characterized by heterogeneity in terms of lessthan12minuteseach. IntheLANLandTwoSigma resources available to jobs, and job interarrival times tracesjobsareatleastanorderofmagnitudelonger. In [7, 23, 31, 46, 56]. The private and HPC clusters we TwoSigma, the same fraction of jobs last up to 2 hours study, however, consist of homogeneous hardware (see and in LANL, they last up to 3 hours for Mustang and Table 2) and user activity follows well-defined diurnal 6 hours for OpenTrinity. Surprisingly, the tail end of patterns, even though the rate of scheduling requests the distribution is slightly shorter for the LANL clus- variessignificantlyacrossclusters. ters than for the Google and TwoSigma clusters. The Observation3: Diurnalpatternsareuniversal. Clus- longest job is 16 hours on Mustang, 32 hours in Open- tersreceivedmoreschedulingrequestsandsmallerjobs Trinity, 200hoursinTwoSigma, andatleast29daysin atdaytime,withminordeviationsfortheGoogletrace. Google(thedurationofthetrace). ForLANL,thisisdue tohardlimitscausingjobstobeindiscriminatelykilled. InFigure3weshowthenumberofjobschedulingre- ForGoogle,thedistribution’slongtailislikelyattributed quests for every hour of the day. We choose to show tolong-runningservices. metricsforthemediandaysurroundedbytheothertwo Implications. These observations impact the imme- quartiles because the high variation across days causes diate applicability of job scheduling approaches whose the averages to be unrepresentative of the majority of efficiencyreliesontheassumptionthatthevastmajority days (see Section 8). Overall, diurnal patterns are ev- of jobs’ durations are in the order of minutes, and job ident in every trace and user activity is concentrated at sizes are insignificant compared to the size of the clus- daytime (7AM to 7PM), similar to prior work [38]. An ter. For example, Ananthanarayanan et al. [2] propose exception to this is the Google trace, which is most ac- tomitigatetheeffectofstragglersbyduplicatingtasksof tivefrommidnightto4AM,presumablyduetobatchjobs smaller jobs. This is an effective approach for Internet leveragingtheavailableresources. service workloads (Microsoft and Facebook are repre- Sizes of submitted jobs are also correlated with the sentedinthepaper)becausethevastmajorityofjobscan timeofday. Wefindthatlonger,largerjobsintheLANL benefitfromit,withoutsignificantlyincreasingtheover- traces are typically scheduled during the night, while allclusterutilization. FortheGoogletrace,forexample, shorter,smallerjobstendtobescheduledduringtheday. 90%ofjobsrequestlessthan0.01%oftheclustereach, ThereverseistruefortheGoogletrace, whichprompts so duplicating them only slightly increases cluster uti- ourearlierassumptiononnightlybatchjobs.Long,large lization. Atthesametime,25-55%ofjobsintheLANL jobsarealsoscheduledatdaytimeintheTwoSigmaclus- and TwoSigma traces each request more than 0.1% of ters, despite having a diurnal pattern similar to LANL 536 2018 USENIX Annual Technical Conference USENIX Association Implications: PreviousworksuchasOmega[46]and 1400 s1200 ClusterFQ [56] propose distributed scheduling designs n sio1000 especially applicable to heterogeneous clusters. This mis 800 does not seem to be an issue for environments such as ub 600 LANLandTwoSigma,whichintentionallyarchitectho- s b 400 mogeneous clusters to lower performance optimization o J 200 TwoSigma andadministrationcosts. Google 0 Asclustersizesincrease,sodoestherateofscheduling requests,urgingustoreexaminepriorwork.Quincy[31] 0 6 12 18 23 representsschedulingasaMin-CostMax-Flow(MCMF) 40 Mustang optimizationproblemoveratask-nodegraphandcontin- s 35 OpenTrinity n uously refines task placement. The complexity of this o 30 ssi 25 approach, however, becomesadrawbackforlarge-scale bmi 20 clusters such as the ones we study. Gog et al. [23] find u 15 b s 10 thatQuincyrequires66seconds(onaverage)toconverge Jo 5 to a placement decision in a 10,000-node cluster. The 0 GoogleandLANLclusterswestudyalreadyoperateon that scale (Table 2). We have shown in Figure 3 that 0 6 12 18 23 Day hour (12am − 11pm) the average frequency of job submissions in the LANL traces is one job every 90 seconds, which implies that Figure 3: Hourly job submission rates for a given day. thisschedulinglatencymaywork,butthiswillnotbethe Thelinesrepresentthemedian,whiletheshadedregion caseforlong. Trinityiscurrentlyoperatingwith19,000 showsthedistancebetweenthe25thand75thpercentiles. nodesand,undertheDoE’sExascaleComputingProject [39], 25 times larger machines are planned within the 100K TwoSigma sions 789000KKK Google nfaerxwt e5ryeefearrst.ojNoobtse,stihnactewHhPeCnjdoibscsuhsasvinegasgcahnegdsuclhinegduslo- mis 60K ing requirement. Placement algorithms such as Quincy, b 50K u 40K however,focusontaskplacement. s ask 2300KK An improvement to Quincy is Firmament [23], a T 10K centralizedscheduleremployingageneralizedapproach 0K based on a combination of MCMF optimization tech- 0 6 12 18 23 niques to achieve sub-second task placement latency Day hour (12am − 11pm) on average. As Figure 4 shows, sub-second latency is paramount, since the rate of task placement requests in Figure4:Hourlytaskplacementrequestsforagivenday. theGoogleandTwoSigmatracescanbeashighas100K Thelinesrepresentthemedian,whiletheshadedregion showsthedistancebetweenthe25thand75thpercentiles. requestsperhour,i.e. onetaskevery36ms. Firmament’s placementlatency,however,increasestoseveralseconds as cluster utilization increases. For the TwoSigma and clusters.ThisislikelyduetoTwoSigma’sworkloadcon- Googletracesthiscanbeproblematic. sisting of financial data analysis, which bears a depen- denceonstockmarkethours. 5 Resourceutilization Observation4: Schedulingrequestratesdifferbyup to 3 orders of magnitude across clusters. Sub-second Awell-knownmotivationforthecloudhasbeenresource schedulingdecisionsseemnecessaryinordertokeepup consolidation, with the intention of reducing equipment withtheworkload. ownershipcosts. Anequallywell-knownpropertyofthe OnemorethingtotakeawayfromFigure3isthatthe cloud,however,isthatitsresourcesremainunderutilized rateofschedulingrequestscandiffersignificantlyacross [6, 15, 35, 36, 41]. This is mainly due to a disparity clusters. FortheGoogleandTwoSigmatraces,hundreds between user resource requests and actual resource us- to thousands of jobs are submitted every hour. On the age,whichrecentresearcheffortstrytoalleviatethrough other hand, LANL schedulers never receive more than workload characterization and aggressive consolidation 40 requests on any given hour. This could be related to [15, 33, 34]. Our analysis finds that user resource re- theworkloadorthenumberofusersinthesystem,asthe questsintheLANLandTwoSigmatracesarecharacter- Google cluster serves 2 times as many user IDs as the ized by higher variability than in the Google trace. We Mustangclusterand9timesasmanyasOpenTrinity. also look into job inter-arrival times and how they are USENIX Association 2018 USENIX Annual Technical Conference 537 s1.0 1.0 n of interarrival0000000.......3456789 Mustang on of total jobs0000000.......3456789 Mustang ctio0.2 OTwpoeSniTgrminaity acti0.2 OTwpoeSniTgrminaity Fra00..01 Google Fr00..01 Google 1e−01 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 Job interarrival period (seconds) Job task count Figure5: CDFofjobinterarrivaltimes. Figure6: CDFofthenumberoftasksperjob. approximatedwhenevaluatingnewresearch. unsuccessful, as Kolmogorov-Smirnov tests [37] reject the null hypothesis with p-values <2.2×10−16. This Observation5:UnliketheGooglecluster,noneofthe result does not account for a scenario where there is an otherclustersweexamineovercommitresources. underlyingPoissonprocesswitharateparameterchang- Overall, we find that the fraction of CPU cores al- ingovertime,butitsuggeststhatcautionshouldbeused located to jobs is stable over time across all the clus- whenaPoissondistributionisassumed. ters we study. For Google, CPU cores are over provi- Another common assumption is that jobs are very sionedby10%,whileforotherclustersunallocatedcores rarely big, i.e., made up of multiple tasks [29, 56]. In range between 2-12%, even though resource overprovi- Figure 6 we show the CDFs for the number of tasks sioningissupportedbytheirschedulers.Memoryalloca- per job across organizations. We observer that 77% of tion numbers follow a similar trend. Unfortunately, the Googlejobsaresingle-taskjobs,buttherestoftheclus- LANLandTwoSigmatracesdonotcontaininformation ters carry many more multi-task jobs. We note that the on actual resource utilization. As a result, we can nei- TwoSigma distribution approaches that of Google only ther confirm, nor contradict results from earlier studies forlargerjobs. Thissuggeststhattaskplacementmaybe on the imbalance between resource allocation and uti- aharderproblemoutsideGoogle,wheresingle-taskjobs lization. What differs between organizations is the mo- arecommon,exacerbatingtheevaluationissuesweout- tivation for keeping resources utilized or available. For linedinSection4forexistingtaskplacementalgorithms. Google [41], Facebook [10], and Twitter [15], there is Observation7:Userresourcerequestsaremorevari- atensionbetweenthefinancialincentiveofmaintaining able in the LANL and TwoSigma traces than in the only the necessary hardware to keep operational costs Googletrace. low and the need to provision for peak demand, which Resource under-utilization can be alleviated through leads to low overall utilization. For LANL, clusters are workloadconsolidation.Toensureminimalinterference, designed to accommodate a predefined set of applica- applicationsaretypicallyprofiledandclassifiedaccord- tionsforapredeterminedtimeperiodandhighutilization ingtohistoricaldata[15,33]. Ouranalysissuggeststhat is planned as part of efficiently utilizing federal fund- thisapproachislikelytobelesssuccessfuloutsidetheIn- ing. For the TwoSigma clusters, provisioning for peak ternetservicesworld. Toquantifyvariabilityinuserbe- demandismoreimportant,evenifitleadstolowoverall havior we examine the Coefficient of Variation1 (CoV) utilization, since business revenue is heavily tied to the across all requests of individual users. For the Google responsetimesoftheiranalyticsjobs. tracewefindthatthemajorityofusersissuejobswithin Observation 6: The majority of job interarrivals pe- 2xoftheiraveragerequestinCPUcores. FortheLANL riodsaresub-secondinlength. andTwoSigmatraces,ontheotherhand,60-80%ofusers candeviateby2-10xoftheiraveragerequest. Interarrival periods are a crucial parameter of an ex- Implications: AnumberofearlierstudiesofGoogle perimental setup, as they dictate the load on the sys- [41], Twitter [15], and Facebook [10] data have high- temundertest. Twocommonconfigurationsaresecond- lighted the imbalance between resource allocation and granularity[15]orPoisson-distributedinterarrivals[29], utilization. Googletacklesthisissuebyover-committing andwefindthatneithercharacterizesinterarrivalsaccu- resources, but this is not the case for LANL and rately. InFigure5weshowtheCDFsforjobinterarrival TwoSigma. Another proposed solution is Quasar [15], period lengths. We observe that 44-62% of interarrival asystemthatconsolidatesworkloadswhileguaranteeing periods are sub-second, implying that jobs arrive at a fasterratethanpreviouslyassumed.Furthermore,ourat- 1TheCoefficientofVariationisaunit-lessmeasureofspread,de- temptstofitaPoissondistributiononthisdatahavebeen rivedbydividingasample’sstandarddeviationbyitsmean. 538 2018 USENIX Annual Technical Conference USENIX Association a predefined level of QoS. This is achieved by profiling 100 %)100 jtohbespraetvsiouubsmlyisesniocnoutinmteereadnwdocrlkalsosaifdysi;nmgitshcelmassaifiscoantieonosf obs (%) 67890000 UTinmseuocucetsssful U time ( 67890000 are detected by inserting probes in the running applica- of j 50 CP 50 tjciooobdnse.scFaaonrrenLooAtftNbeneLc,satchraeilsefudalpldyporcowoanncfihfgowurorpeurdlodffioblreintihgne,fearaessqisubuelebs.mteFdiitrtaesltd-, Fraction 12340000 action of 12340000 ltoocbaetioanccsuizraet.eSlyecpornodfi,lesdubimnistetecdoncdosd,easnadreptrooobicnogmthpelemx 0 M ustOapnegnTriniTtwyoSig m a G oogle Fr 0 M ustOapnegnTriniTtwyoSig m a G oogle atruntimetodetectmisclassificationscanintroduceper- formancejitterthatisprohibitiveintightly-coupledHPC Figure7: Breakdownofthetotalnumberofjobs,aswell applications. Third, in our LANL traces we often find asCPUtime,byjoboutcome. that users tweak jobs before resubmitting them, as they re-calibrate simulation parameters to achieve a success- wherefailuresareexpectedeveryfewminutes[48],and ful run, which is likely to affect classification accuracy. cloud computing environments built on complex soft- Fourth, resources are carefully reserved for workloads warestacksthatincreasefailurerates[9,44]. andutilizationishigh,whichmakesithardtoprovision Definitions. An important starting point for any fail- resources for profiling. For the TwoSigma and Google ure analysis is defining what constitutes a failure event. tracesQuasarmaybeabetterfit,however,attherateof Acrossalltracesweconsider,wedefineasfailedjobsall 2.7 jobs per second (Figure 3), 15 seconds of profiling those that end due to events whose occurrence was not [15]atsubmissiontimewouldresultinanexpectedload intended by users or system administrators. We do not of6jobsbeingprofiledtogether. SinceQuasarrequires distinguishfailedjobsbytheirrootcause,e.g.,software 4 parallel and isolated runs to collect sufficient profil- andhardwareissues, becausethisinformationisnotre- ing data, we would need resources to run at least 360 liably available. There are other job termination states VMs concurrently, with guaranteed performance isola- in the traces, in addition to success and failure. For the tion between tham to keep up with the average load. Google trace, jobs can be killed by users, tasks can be Thisfurtherassumestheprofilingtimedoesnotneedto evictedinordertoschedulehigher-priorityones,orhave be increased beyond 15 seconds. Finally, Quasar [15] an unknown exit status. For the LANL traces, jobs can was evaluated using multi-second inter-arrival periods, be cancelled intentionally. We group all these job out- sotestingwouldbenecessarytoensurethatoneorderof comesasabortedjobsandcollectivelyrefertofailedand magnitudemoreloadcanbehandled(Figure5),andthat abortedjobsasunsuccessfuljobs. itwillnotincreasetheprofilingcostfurther. There is another job outcome category. At LANL, Anotherrelatedapproachtoworkloadconsolidationis usersarerequiredtospecifyaruntimeestimateforeach provided by TSF [56], a scheduling algorithm that at- job. This estimate is treated as a time limit, similar to tempts to maximize the number of task slots allocated anSLO,andtheschedulerkillsthejobifthelimitisex- to each job, without favoring bigger jobs. This ensures ceeded. We refer to these killings as timeout jobs and thatthealgorithmremainsstarvation-free,howeveritre- presentthemseparatelybecausetheycanproduceuseful sultsinsignificantslowdownsintheruntimeofjobswith workinthreecases:(a)whenHPCjobsusethetimelimit 100+tasks,whichtheauthorsdefineasbig. Thiswould asastoppingcriterion,(b)whenjobstateisperiodically beprohibitiveforLANL,wherejobsmustbescheduled checkpointed to disk, and (c) when a job completes its asawhole,andsuch“big”jobsaremuchmoreprevalent workbeforethetimelimitbutfailstoterminatecleanly. and longer in duration. Other approaches for schedul- ing and placement assume the availability of resources Observation8: Unsuccessfuljobterminationsinthe that may be unavailable in the clusters we study here, Google trace are 1.4-6.8x higher than in other traces. andtheirperformanceisshowntobereducedinhighly- UnsuccessfuljobsatLANLuse34-80%lessCPUtime. utilizedclusters[25,29]. In Figure 7, we break down the total number of jobs (left),aswellasthetotalCPUtimeconsumedbyalljobs 6 Failureanalysis byjoboutcome(right).First,weobservethatthefraction ofunsuccessfuljobsissignificantlyhigher(1.4-6.8x)for Job scheduler logs are often analyzed to gain an under- theGoogletrace,thanfortheothertraces. Thiscompar- standing of job failure characteristics in different envi- ison ignores jobs that timeout for Mustang, because as ronments[9,17,21,22,43]. Thisknowledgeallowsfor weexplainedabove,itisunlikelytheyrepresentwasted building more robust systems, which is especially im- resources. Wealsonotethatalmostallunsuccessfuljobs portant as we transition to exascale computing systems intheGoogletracewereaborted. Accordingtothetrace USENIX Association 2018 USENIX Annual Technical Conference 539 1.0 100 0.9 %) 8900 Mustang action of total jobs 000000......345678 MOTwpuoesStnaiTgnrmginaity Success rate ( 123456700000000 OTGwpooeoSngiTlgerminaity Fr 0.2 Google 0.1 Unsuccessful jobs <1 1−101 101−102 102−103 103−104 >104 0.0 Successful jobs Jobs grouped by number of CPU hours 1e−02 1e−01 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 Figure9: SuccessratesforjobsgroupedbyCPUhours. Number of cores in job Figure 8: CDFs of job sizes (in CPU cores) for unsuc- wedonotexpectthisdiscrepancytobeduetosemantic cessfulandsuccessfuljobs. differencesinthewayfailureisdefinedacrosstraces. Observation 10: For the Google and TwoSigma documentation [58] these jobs could have been aborted traces,successratesdropforjobsconsumingmoreCPU by a user or the scheduler, or by dependent jobs that hours. TheoppositeistrueforLANLtraces. failed. Asaresult,wecannotruleoutthepossibilitythat thesejobswerelinkedtoafailure. Forthisreason,prior Forthetracesweanalyze,therootcausebehindunsuc- workgroupsallunsuccessfuljobsunderthe“failed”la- cessful outcomes is not reliably recorded. Without this bel [17], which we choose to avoid for clarity. Another information,itisdifficulttointerpretandvalidatethere- factthatfurtherhighlightshowblurredthelinebetween sults. Forexample,weexpectthathardwarefailuresare failed and aborted jobs can be, is that all unsuccessful random events whose occurrence roughly approximates jobsintheTwoSigmatraceareassignedafailurestatus. some frequency based on the components’ Mean Time Inshort,ourclassificationofjobsas“unsuccesful”may BetweenFailureratings. Asaresult,jobsthatarelarger seembroad,butitisconsistentwiththeliberaluseofthe and/or longer, would be more likely to fail. In Figure 9 term“failure”intheliterature. wehavegroupedjobsbasedontheCPUhourstheycon- We also find that unsuccessful jobs are not equally sume(ameasureofbothsizeandlength), andweshow detrimentaltotheoverallefficiencyofallclusters.While thesuccessrateforeachgroup. Thetrendthatstandsout the rate of unsuccessful jobs for the TwoSigma trace is is that success rates decrease for jobs consuming more similartotherateofunsuccessfuljobsintheOpenTrin- CPUhoursintheGoogleandTwoSigmatraces,butthey itytrace,eachunsuccessfuljoblastslonger. Specifically, are increase and remain high for both LANL clusters. unsuccessfuljobsintheLANLtraceswaste34-80%less This could be attributed to larger, longer jobs at LANL CPUtimethanintheGoogleandTwoSigmatraces. Itis beingmorecarefullyplannedandtested,butitcouldalso worthnotingthat49-55%ofCPUtimeatLANLisallo- be due to semantic differences in the way success and catedtojobsthattimeout,whichsuggeststhatatleasta failurearedefinedacrosstraces. smallfractionofthattimemaybecomeavailablethrough Implications. The majority of papers analyzing the theuseofbettercheckpointstrategies. characteristics of job failures in the Google trace build failure prediction models that assume the existence of Observation 9: For the Google trace, unsuccessful thetrendswehaveshownonsuccessratesandresource jobstendtorequestmoreresourcesthansuccessfulones. consumptionofunsuccessfuljobs. Chenetal. [9]high- Thisisuntrueforallothertraces. light the difference in resource consumption between In Figure 8, we show the CDFs of job sizes (in CPU unsuccessful and successful jobs, and El-Sayed et al. cores)ofindividualjobs. Foreachtrace,weshowsepa- [17] note that this is the second most influential predic- rateCDFsforunsuccessfulandsuccessfuljobs. Bysep- tor (next to early task failures) for their failure predic- aratingjobsbasedontheiroutcomeweobservethatsuc- tion models. As we have shown in Figure 9, unsuc- cessfuljobsintheGoogletracerequestfewerresources, cessful jobs are not linked to resource consumption in overall,thanunsuccessfuljobs.Thisobservationhasalso othertraces. Anotherpredictorhighlightedinbothstud- beenmadeinearlierwork[17,21], butitdoesnothold iesisjobre-submissions, withsuccessfuljobsbeingre- forourothertraces. CPUrequestsforsuccessfuljobsin submittedfewertimes. Weconfirmthatthistrendiscon- the TwoSigma and LANL traces are similar to requests sistentacrossalltraces,eventhoughthemajorityofjobs made by unsuccessful jobs. This trend is opposite to (83-93%) are submitted exactly once. A final observa- whatisseeninolderHPCjoblogs[59],andsincethese tion that does not hold true for LANL is that CPU time traces were also collected through SLURM and MOAB ofunsuccessfuljobsincreaseswithjobruntime[17,22]. 540 2018 USENIX Annual Technical Conference USENIX Association 7 Acasestudyonpluralityanddiversity 40 Mustang OpenTrinity 35 TwoSigma Evaluating systems against multiple traces enables re- %) Google 30 searchers to identify practical sensitivities of new re- s ( b 25 search and prove its generality. We demonstrate this o tphrreoduicgthoramcoadsuelestoufdtyheoJnamJVauisPVrued[i5c1t,]cthluestjeorbscruhnedtiumleer2. ent of j 1250 c OurevaluationofJVuPredictwithallthetraceswehave Per 10 introduced revealed the predictive power of logical job 5 names and consistent user behavior in workload traces. 0 Conversely, we foundit difficult to obtain accuraterun- −∞ −80 −60 −40 −20 0 20 40 60 80 +∞ time predictions in systems that provide insufficient in- Estimate error (%) ± 5% formationtoidentifyjobre-runs. Thissectionbrieflyde- Figure 10: Accuracy of JVuPredict predictions of run- scribes the architecture of JVuPredict (Section 7.1) and timeestimates,forallfourtraces. ourevaluationresults(Section7.2). 7.1 JVuPredictbackground submitted job is used to generate the estimate. To do Recent schedulers [12, 24, 32, 51, 52] use information this, itusesfeaturesofsubmittedjobs, suchasuserIDs onjobruntimestomakebetterschedulingdecisions.Ac- andjobnames,tobuildmultipleindependentpredictors. curate knowledge of job runtimes allows a scheduler to These predictors are then evaluated based on the accu- packjobsmoreaggressivelyinacluster[12,18,54],or racyachievedonhistoricdata,andthemostaccurateone to delaya high-prioritybatch job toschedule alatency- is selected for future predictions. Once a prediction is sensitivejobwithoutexceedingthedeadlineofthebatch made, the new job is added to the history and the accu- job. Inheterogeneousclusters,knowledgeofajob’srun- racyscoresofeachmodelarerecalculated. Basedonthe timecanalsobeusedtodecidewhetheritisbettertoim- updated scores a new predictor is selected and the pro- mediatelystartajobonhardwarethatissub-optimalfor cessisrepeated. it,letitwaituntilpreferredhardwareisavailable,orsim- 7.2 Evaluationresults plypreemptotherjobstoletitrun[3,52]. Suchsched- ulersassumemostoftheprovidedruntimeinformationis JVuPredicthadoriginallybeenevaluatedusingonlythe accurate.Theaccuracyoftheprovidedruntimeisimpor- Google trace. Although predictions are not expected to tant as these schedulers are only robust to a reasonable beperfect,performanceundertheGoogletracewasrea- degreeoferror[52]. sonably good, with 86% of predictions falling within a Traditional approaches for obtaining runtime knowl- factor of two of the actual runtime. This level of ac- edgeareoftenastrivialasexpectingtheusertoprovide curacy is sufficient for the JamaisVu scheduler, which anestimate,anapproachusedinHPCenvironmentssuch furtherappliestechniquestomitigatetheeffectsofsuch asLANL.AswehaveseeninSection6,however,users mispredictions. Intheend,theperformanceofJamaisVu oftenusetheseestimatesasastoppingcriterion(jobsget withtheGoogletraceissufficienttocloselymatchthatof killedwhentheyexceedthem),specifyavaluethatistoo a hypothetical scheduler with perfect job runtime infor- high,orsimplyfixthemtoadefaultvalue. Anotherop- mation and to outperform runtime-unaware scheduling tionistodetectjobswithaknownstructurethatareeasy [51]. This section repeats the evaluation of JVuPredict to profile as a means of ensuring accurate predictions, using our new TwoSigma and LANL traces. Our crite- an approach followed by systems such as Dryad [30], rion for success is meeting or surpassing the prediction Jockey [20], and ARIA [53]. For periodic jobs, simple accuracyachievedwiththeGoogletrace. history-based predictions can also work well [12, 32]. A feature expected to effectively predict job repeats But these approaches are still inadequate for consoli- isthejob’sname. Thisfieldistypicallyanonymizedby datedclusterswithoutaknownstructureorhistory. hashingtheprogram’snameandarguments,orsimplyby JVuPredict, the runtime prediction module of Ja- hashingtheuser-definedhuman-readablejobnamepro- maisVu [51], aims to predict a job’s runtime when it is vided to the scheduler. For the Google trace, predictors submitted, using historical data on past job characteris- using the logical job name field are selected most fre- ticsandruntimes. Itdiffersfromtraditionalapproaches quentlybyJVuPredictduetotheirhighaccuracy. byattemptingtodetectjobsthatrepeat,evenwhensuc- Figure10showsourevaluationresults. Onthex-axis cessive runs are not declared as repeats. It is more ef- weplotthepredictionerrorforJVuPredict’sruntimees- fective, as only part of the history relevant to the newly timates,asapercentageoftheactualruntimeofthejob. 2Thetermsruntimeanddurationareusedinterchangeablyhere. Each data point in the plot is a bucket representing val- USENIX Association 2018 USENIX Annual Technical Conference 541

Description:
and astrophysics models. Trinity's hardware was stood their insights, feedback, and support: Alibaba Group,. Broadcom, Dell EMC, Facebook,
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.