A Game-Theoretic Approach for Runtime Capacity Allocation in MapReduce Eugenio Gianniti, Danilo Ardagna, and Michele Ciavotta ∗ † ‡ Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano 7 1 Milano, Italy 0 2 Mauro Passacantando§ n a J Dipartimento di Informatica, 7 Università di Pisa 1 Pisa, Italy ] C D . s Abstract c [ Nowadays many companies have available large amounts of raw, un- structured data. Among Big Data enabling technologies, a central place 1 is held by the MapReduce framework and, in particular, by its open v source implementation, Apache Hadoop. For cost effectiveness considera- 3 6 tions, a common approach entails sharing server clusters among multiple 7 users. Theunderlyinginfrastructureshouldprovideeveryuserwithafair 4 shareofcomputationalresources,ensuringthatServiceLevelAgreements 0 (SLAs) are met and avoiding wastes. . In this paper we consider two mathematical programming problems 1 0 thatmodeltheoptimalallocationofcomputationalresourcesinaHadoop2.x 7 cluster with the aim to develop new capacity allocation techniques that 1 guaranteebetterperformanceinshareddatacenters. Ourgoalistogeta : substantialreductionofpowerconsumptionwhilerespectingthedeadlines v stated in the SLAs and avoiding penalties associated with job rejections. i X The core of this approach is a distributed algorithm for runtime capacity r allocation, based on Game Theory models and techniques, that mimics a the MapReduce dynamics by means of interacting players, namely the central Resource Manager and Class Managers. Keywords: Hadoop, Resource Management, Capacity Allocation, Admission Control, Game Theory, Generalized Nash Equilibrium Prob- lem. ∗[email protected] †[email protected] ‡[email protected] §[email protected] 1 1 Introduction Alargenumberofenterprisescurrentlycommitstotheextractionofinformation from huge data sets as part of their core business activities. Applications range from fraud detection to one-to-one marketing, encompassing business analytics and support to decision making in both private and public sectors. In order to cope with the unprecedented amount of data and the need to process them in a timely fashion, new technologies are increasingly adopted in industry, fol- lowing the Big Data paradigm. Among such technologies, Apache Hadoop [1] is already widespread and predictions suggest a further increase in its future adoption. IDCestimatesthat,by2020,nearly40%ofBigDataanalyseswillbe supportedbypublicClouds[2],whileHadooptouchedhalfofthedataworldwide by 2015 [3]. ApacheHadoopisanopensourcesoftwaresuitethatenablestheelaboration ofvastamountsofdataonclustersofcommodityhardware. Hadoopimplements the MapReduce paradigm, automatically ensuring parallelization, distribution, fault-tolerance, reliability, and monitoring. In order to obtain a high level of scalability, Hadoop 2.x overcomes the drawbacks present in the previous ver- sions implementing a distributed resource management system, with a central ResourceManager(RM)thatprovidesresourcesforcomputationtoApplication Masters (AMs) entitled to manage the submitted jobs. Despite the convenience of this paradigm and the undeniably widespread adoptionofHadoopwithintheITindustry, stilltherearenotoolsthatsupport developersandoperatorsinachievingoptimalcapacityplanningofMapReduce applications. In this context the main drawback [4], [5] is that the execution time of a MapReduce job is generally unknown in advance: for some systems, capacity allocation can become a critical aspect. Moreover, resource allocation policies need to decide job execution and rejection rates in a way that users’ workloadscanmeettheirServiceLevelAgreements(SLAs)andtheoverallcost is minimized. This paper investigates the theoretical foundations for the optimal runtime management of cluster resources in private Clouds. We envisage a scenario where a novel resource allocation policy, based on our findings, is implemented and adopted in order to optimally address the discussed issues. Precisely, we focusonthejointadmissioncontrolandcapacityallocationproblem,seekingto fulfillSLAswhileminimizingenergy-relatedcosts. Overall,ICTenergydemand sums up to 7% of the world consumption and was expected to rise up to 12% by 2017 [6], with a further tendency towards a shift from devices to networks and data centers consumption [7]. Indeed, worldwide ICT systems account for 2–4% of global CO emissions and it is expected that they can reach up to 10% 2 in 5–10 years [8]. Weproposeatheoreticalapproachinwhichtheallocationproblemissolved periodicallybasedonapredictionoftheforthcomingsystemload. Inparticular, we adopt Game Theory techniques, which found successful application in the fieldofCloudcomputing[9]–[12],andusethemtoprovideadistributed,scalable solution to the joint admission control and capacity allocation of multi-class Hadoop clusters. We propose a distributed solution leading to a Generalized Nash Equilibrium Problem (GNEP), a class of games that generalizes classical Nash problems, yielding much more difficult instances. This paper is organized as follows. Initially, we give a clear statement of 2 Resource Manager ri,�i rN,�N Class Manageri … Class ManagerN Virtual Machine Node Manager Container Container … … … … … … Container Container AMi1 AMij AMihi AMN1 AMNhN R Figure 1: Reference technology theproblemathandalongsiderelevantdesignassumptions,inSection2. After- wards, we show how we developed models to solve the joint capacity allocation and admission control problem. Section 3 presents a preliminary, centralized mathematicalprogrammingformulation,whilstSection4buildsonittopropose a distributed game-theoretic model. Then we analyze our results in Section 5, whilstSection6discussesotherliteratureproposals. Intheend,Section7wraps up this work and draws conclusions on the outcomes. 2 Problem Statement and Design Assumptions Figure 1 shows the reference technological system, featuring the Hadoop 2.x framework running on a private Cloud. The private Cloud cluster supports several user classes competing for resources, which are managed via the YARN CapacityScheduler. Eachclasscollectssimilarjobs,i.e.,applicationsthatshare analogousvaluesforparameterscharacterizingtheirperformance: theyhavethe same job profile. Following the notation brought forth in [5], [13], job profiles include the following contributes: nM and nR, the total number of Map and i i Reduce tasks perjob, respectively; Mmax, Rmax, Shmax, and Shmax, the max- i i 1,i typ,i imum durations of one single Map, Reduce, and Shuffle task (notice that the first Shuffle wave of a given job is distinguished from all the subsequent ones); Mavg, Ravg, andShavg , i.e., theaveragedurationofMap, Reduce, andShuffle i i typ,i tasks, respectively. The modeled cluster supports the concurrent execution of a maximum of R virtualmachines(VMs), whichweassumehomogeneousforthesakeofsimplic- ity. In order to allow for elasticity, the reference system does not store data on the Hadoop Distributed File System (HDFS) as this would expose it to data corruption or poor performance. On the contrary, according to the practice suggested by major Cloud providers [14], [15], data reside on external storage [16], [17]. According to our vision of a novel resource allocation policy, every applica- tion class is managed by a Class Manager (CM), which negotiates the required resourceswithacentralRM,entitledtosplittheavailablecapacityamongsub- mitted jobs. The set of application classes is denoted with and N = . For A |A| all CMs i , the RM assigns r VMs. In other words, in this scenario the i ∈ A 3 proposedframeworkactsastheYARNCapacityScheduler[18],assigningevery application class i to a separate queue and providing a portion φ of the total i resources, where: r i φi , N r , ∀i∈A. j=1 j P Given ρ¯, the time unit cost to run a single VM, it is possible to obtain the total cost of execution as N ρ¯r . i=1 i Foreveryapplicationclassi,anSLAestablishesthatamaximumofHupjobs P i canbeexecutedconcurrently. However,thesystemcanautonomouslydecideto reject a portion of such jobs upon payment of a penalty. Finally, the accepted h jobs cannot be fewer than Hlow and the system commits to complete them i i within a deadline D . We denote with (h ) the penalty functions associated i i i P to the possible rejection of some jobs. They are assumed to be decreasing and convex: thisisreasonableasitmeansthatpenaltiesincreaseatleastlinearlyin the number of rejected jobs. According to the obtained number of resources r , a CM may need to reject i some jobs, then it proceeds to activate a suitable number of AMs to coordinate the admitted ones. In this scenario, the AMs have the only duty of managing the resources obtained by the CMs so as to carry out the associated job tasks, without directly taking part in the allocation process. We propose to solve our problems hourly, based on a prediction of the load Hup, to dynamically reallocate resources among application classes, while also i avoiding the overhead and costs of booting and shutting down VMs too fre- quently. In the Hadoop framework each computational node hosts several slots that execute Map and Reduce tasks. In particular, according to the YARN config- uration, VM resources are split in containers, so that every VM can be used to concurrently run cM Map or cR Reduce tasks1. These parameters depend i i only on the job classes owing to the assumption of homogeneity made on VMs. The total number of Map and Reduce slots assigned to an application class is represented by sM and sR, respectively. Again, these variables give a simple i i representationoftheworkloadrequiredtocompletejobsineachclassduetothe homogeneity assumption on VMs. Precisely, with sM and sR we represent the i i number of Map and Reduce tasks that run concurrently, hence the maximum size of each wave. Aspreviouslystated,accordingto[5],itispossibletoderivefromtheHadoop logsajob profile,i.e.,asetofparametersthatcharacterizetheexecutionofjobs in each class. In this paper we use a more refined formulation, as in [13]. The estimatedminimumandmaximumexecutiontimesarecomputedwithformulae similar to the following: 1Note that in Hadoop 1.x, each node resources can be partitioned between slots assigned to Map tasks and slots assigned to Reduce tasks. In Hadoop 2.x, the resource capacity configured for each container is available for both Map and Reduce tasks and cannot be partitioned anymore [19]. The maximum number of concurrent mappers and reducers (the slot count) is calculated by YARN based on administrator settings [20]. A node is eligible torunataskwhenitsavailablememoryandCPUcansatisfythetaskresourcerequirement. Withourhypothesisabove,weassumethattheconfigurationsettingsaresuchthatwhatever combinationofMapandReducetaskscanbeexecutedwithinacontainer, noCPUremains idlebecauseofawrongsettingoftheseparameters. 4 h h i i T =A +B +C . (1) i isM isR i i i The parameters A , B , and C aggregate the already mentioned nM, nR, i i i i i Mmax, Rmax, Shmax, Shmax, Mavg, Ravg, and Shavg parameters, which are i i 1,i typ,i i i typ,i measured directly by Hadoop and easily obtainable from its execution logs. These formulae are used to predict the jobs execution time, given the number of allocated resources and the concurrency level. Equations (1) can be used to derive deadline constraints; two main alter- natives have to be considered, though. On one hand it is possible to express constraints giving strong guarantees of meeting hard deadlines considering a conservative upper bound estimate. When dealing with soft deadlines, instead, the arithmetic mean of the upper and lower bounds has been shown to be a better suited estimate (see [5], [13]), giving a quite accurate forecast of the actualexecutiontimewithjustanaverage10%gapbetweenpredictedandmea- sured times [13], and leading to the allocation of comparably fewer resources. Notwithstanding, in both cases we can formulate the deadline constraints as: h h i i T =A +B +C D , i . (2) i isM isR i ≤ i ∀ ∈A i i where D are the deadlines. In the following, we adopt the parameter E = i i C D . Notice that, by definition, it holds E < 0, as nonnegative values i i i − would mean that jobs of class i cannot be completed on time. In this paper, we adopt the average formulation, hence renouncing to guarantees that the admitted jobs are completed on time, in favor of a less demanding allocation. In light of the above, we can say that the ultimate goal of the proposed approach is to determine the optimal values of h , sM, sR, and r so that the i i i i sum of costs and rejection penalties is minimized, while the deadlines set by SLAs are met. In Table 1 are reported all the parameters used in the models discussedinthesubsequentsections,whileinTable2wesummarizethedecision variables. 3 Mathematical Programming Formulation Building upon the observations and assumptions previously discussed, we for- mulate a preliminary mathematical programming model that formalizes the problem. The model is the following: N N min ρ¯r + (h ) (P1a) i i i r,h,sM,sR P i=1 i=1 X X subject to: N r R, (P1b) i ≤ i=1 X Hlow h Hup, i , (P1c) i ≤ i ≤ i ∀ ∈A 5 Table 1: Centralized Model Parameters Set of job classes A N Number of CMs, or |A| ρ¯ Time unit cost for running a VM in the cluster Hup Maximum concurrency required in the SLA contract for job class i i Hlow Minimum concurrency required in the SLA contract for job class i i ψlow Reciprocal of Hup i i ψup Reciprocal of Hlow i i R Total capacity of the cluster as number of VMs A Coefficient associated to Map tasks in the job profile for job class i, i [13] B CoefficientassociatedtoReducetasksinthejobprofileforjobclassi, i [13] E Parameter lumping the constant terms associated neither to Map, i nor to Reduce tasks in the job profile for job class i, as well as the deadlines, [13] cM Map slots supported on one VM for job class i i cR Reduce slots supported on one VM for job class i i α Slope of the penalty contribution linear in ψ for job class i i i β Constant term of the penalty contribution linear in ψ for job class i i i Table 2: Centralized Model Decision Variables r Number of VMs assigned for the execution of job class i i h Number of jobs concurrently executed in job class i i ψ Reciprocal of the concurrency degree h i i sM Number of Map slots assigned for the execution of job class i i sR Number of Reduce slots assigned for the execution of job class i i 6 A h B h i i i i + +E 0, i , (P1d) sM sR i ≤ ∀ ∈A i i sM sR i + i r , i , (P1e) cM cR ≤ i ∀ ∈A i i ri N, i , (P1f) ∈ ∀ ∈A hi N, i , (P1g) ∈ ∀ ∈A sM N, i , (P1h) i ∈ ∀ ∈A sR N, i . (P1i) i ∈ ∀ ∈A Inproblem(P1)theobjectivefunction(P1a)hasatermrepresentingthecost of executing all the assigned VMs and another for penalties. Constraint (P1b) ensures that the cluster capacity bounds the total assigned resources. Further, the set of constraints (P1c) imposes the minimum and maximum job concur- rency levels, according to the SLAs. Similarly, constraints (P1d) exploit the job profiles to ensure the deadlines are met. Constraints (P1e) guarantee that everyapplicationclassreceivesenoughVMstosupportthenumberofslotsthey should run concurrently. The left hand side is a conservative estimate of the resources needed to support at the same time sM and sR slots: this expression i i greatly simplifies the analysis. Constraints (P1f)–(P1i) require all the variables to be nonnegative integers, as expected for their interpretation. In particular, notice that the other constraints impose that all the variables must be positive integers. Since the optimization problem is nonlinear due to the constraints fam- ily (P1d) and the penalty terms (h ), it is advisable to study its continuous i i P relaxation. Instancesofpracticalinterestmaywellhavehundredsofapplication classes,thusmakingthesolutionmethodsfornonlinearintegerproblemsinfeasi- bleforsupportingadmissioncontrolandcapacityallocationatruntime. Indeed, the model includes 4N integer variables and 8N +1 constraints. Nonetheless, thesolutionstotheproposedmodelshavetobeinteger, asitisonlypossibleto instantiate a discrete number of VMs, then we will discuss a heuristic approach to the issue in Section 4.5. Moreover,constraints(P1d)arenotconvex,thusrulingoutmanyimportant results for nonlinear optimization. We address this issue introducing a new set of variables ψi , h−i 1, so that constraints (P1d) are now convex, as shown in Proposition 3.1. Proposition 3.1. The function A B f ψ ,sM,sR = i + i +E i i i sMψ sRψ i i i i i (cid:0) (cid:1) is convex. Proof. We note that it is sufficient to prove that the function g(x,y) = 1 is xy convex whenever x and y are positive. The Hessian matrix of g is: 2 1 2g(x,y)= x3y x2y2 . ∇ "x21y2 x2y3 # Since both the trace and the determinant are positive, the Hessian matrix of g is positive definite for any positive x and y, hence g is convex. 7 According to Proposition 3.1, with this change of variables it is possible to write a convex nonlinear problem. Note that we explicitly impose ψ > 0 with i the rewriting of constraints (P1c) whilst the same does not hold for sM and i sR. Besides the trivial consideration that sM = 0 and sR = 0 are outside the i i i domainoff, weshouldnoticethatthismirrorsthefactthatnotassigningslots to a job class is not acceptable in the modeled system. Now, let us assume that the penalties are linear in the new variables ψ , hence it is possible to i write them as α ψ β , i . The corresponding penalty term in the i i i objectivefunction(P1−a)ist∀hen∈PiA(hi)=αih−i 1−βi, ∀i∈A. Thisexpressionis consistentwiththeassumptionsofconvexityandmonotonicitymadeon (h ). i i P The formulation reads: N N min ρ¯r + (α ψ β ) (P2a) i i i i r,ψ,sM,sR − i=1 i=1 X X subject to: N r R, (P2b) i ≤ i=1 X ψlow ψ ψup, i , (P2c) i ≤ i ≤ i ∀ ∈A A B i i + +E 0, i , (P2d) sMψ sRψ i ≤ ∀ ∈A i i i i sM sR i + i r , i , (P2e) cM cR ≤ i ∀ ∈A i i r 0, i , (P2f) i ≥ ∀ ∈A ψ 0, i , (P2g) i ≥ ∀ ∈A sM 0, i , (P2h) i ≥ ∀ ∈A sR 0, i . (P2i) i ≥ ∀ ∈A Following the proposed change of variables, constraints (P1c) become con- straints (P2c) where ψlow = (Hup)−1 and ψup = Hlow −1. Further, as can i i i i be seen from constraints (P2f)–(P2i), we take the continuous relaxation of the (cid:0) (cid:1) otherwise mixed integer problem. Thanks to the high values typically attained by sM, sR, and r , it is possible to round the real solution without affecting too i i i much the optimal value. We now proceed with the analysis of this formulation. Problem (P2) is con- vex and Slater constraint qualification holds: the Karush-Kuhn-Tucker (KKT) conditions are, then, necessary and sufficient for optimality. The associated Lagrangian is: 8 N N (r,ψ,sM,sR)= ρ¯r + (α ψ β )+ i i i i L − i=1 i=1 X X N N +a r R + b ψlow ψ i− ! i i − i i=1 i=1 X X (cid:0) (cid:1) N + c (ψ ψup)+ i i− i i=1 X (3) N A B i i + d + +E + i sMψ sRψ i i=1 (cid:18) i i i i (cid:19) X N sM sR N + e i + i r f sM+ i cM cR − i − i i i=1 (cid:18) i i (cid:19) i=1 X X N N N g sR k r l ψ . − i i − i i− i i i=1 i=1 i=1 X X X The associated KKT conditions are: ∂ L =ρ¯+a e k =0, i , (4a) i i ∂r − − ∀ ∈A i ∂ d A d B i i i i L =α b +c =0, i , (4b) ∂ψ i− i i− sMψ2 − sRψ2 ∀ ∈A i i i i i ∂ d A e i i i L = + =0, i , (4c) ∂sMi − ψi sMi 2 cMi ∀ ∈A ∂ d B e L = (cid:0)i i(cid:1) + i =0, i . (4d) ∂sRi − ψi sRi 2 cRi ∀ ∈A And the complementary(cid:0)sla(cid:1)ckness conditions: N a r R =0, a 0, (5a) i − ! ≥ i=1 X b ψlow ψ =0, b 0, i , (5b) i i − i i ≥ ∀ ∈A c (ψ ψup)=0, c 0, i , (5c) i(cid:0) i− i (cid:1) i ≥ ∀ ∈A A B i i d + +E =0, d 0, i , (5d) i sMψ sRψ i i ≥ ∀ ∈A (cid:18) i i i i (cid:19) sM sR e i + i r =0, e 0, i , (5e) i cM cR − i i ≥ ∀ ∈A (cid:18) i i (cid:19) f sM =0, f 0, i , (5f) i i i ≥ ∀ ∈A g sR =0, g 0, i , (5g) i i i ≥ ∀ ∈A k r =0, k 0, i , (5h) i i i ≥ ∀ ∈A l ψ =0, l 0, i . (5i) i i i ≥ ∀ ∈A Now, we can easily prove the following propositions. 9 Proposition 3.2. Constraints (P2d) and (P2e) are active in every optimal solution. Proof. Building upon the previous consideration that all the variables must be positive in feasible solutions and owing to (5h), we have k = 0. From (4a), i then: e =ρ¯+a ρ¯>0, i , i ≥ ∀ ∈A meaning that every constraint (P2e) is active in optimal solutions. Now, conditions (4c) yield: e ψ sM 2 d = i i i , i , i A cM ∀ ∈A i(cid:0)i (cid:1) and, since all the parameters and variables are positive, it is proved that d > i 0, i , hence all the (P2d) are active in every optimal solution as well. ∀ ∈A Proposition 3.3. The optimal values attained by sM, sR, and ψ in prob- i i i lem (P2) are: sM =ξMr , i , (6a) i i i ∀ ∈A sR =ξRr , i , (6b) i i i ∀ ∈A ψi =Kiri−1, ∀i∈A, (6c) where: cM ξiM , i , ∀i∈A, (7a) 1+ BicMi rAi cRi cR ξiR , i , ∀i∈A, (7b) 1+ Ai cRi rBicMi 2 Ai + Bi cM cR Ki ,−(cid:16)q i Eq i (cid:17) , ∀i∈A. (7c) i Proof. From (4c) and (4d) we obtain: A cM sM =sR i i , i . i i sBicRi ∀ ∈A Substituting in (P2e) we get: cR sR = i r , i , i i ∀ ∈A 1+ AicRi rBicMi 10