Energy-Efficient Multiprocessor Scheduling for Flow Time and Makespan Hongyang Sun∗ Yuxiong He† Wen-Jing Hsu‡ 0 1 0 Abstract 2 We consider energy-efficientscheduling on multiprocessors,where the speed of each t c processor can be individually scaled, and a processor consumes power sα if it runs at O speeds,whereα>1. Aschedulingalgorithmneedstodecidebothprocessorallocations 0 andspeedsforasetofparalleljobswhoseparallelismcanvarywithtime. Theobjective 2 is to minimize the sum of overall energy consumption and some performance metric, which in this paper includes flow time and makespan. For both objectives, we present ] S semi-clairvoyantalgorithms that are aware of the instantaneous parallelism of the jobs D but not their future information. We present U-Ceq algorithm for flow time plus energy, and show that it is O(1)-competitive. This is the first O(1)-competitive result . s for multiprocessor speed scaling on parallel jobs. We also consider, for the first time in c [ the literature, makespan plus energy. We present P-First algorithm and show that it is O(ln1−1/αP)-competitive for parallel jobs consisting of fully-parallel and sequential 1 phases, where P is the total number of processors. Moreover, we prove that P-First v 0 is asymptotically optimal in this setting by providing a matching lower bound. In 1 addition,we revisitnon-clairvoyantschedulingforflowtime plusenergy,andshowthat 1 N-Equi algorithm is O(lnP)-competitive. We then prove a lower bound of Ω(ln1/αP) 4 for any non-clairvoyantalgorithm. . 0 1 0 1 Introduction 1 : v Energy has been widely recognized as a key consideration in the design of modern high- i X performance computer systems. One popular approach to control energy is by dynamically r changingthespeedsoftheprocessors,atechniqueknownasdynamicspeed scaling [29,8,16]. a IthasbeenobservedthatformostCMOS-basedprocessors,thedynamicpowerconsumption satisfies the cube-root rule, that is, the power of a processor is proportional to s3 when it runs at speed s [8, 23]. Since the seminal paper by Yao, Demers and Shenker [31], most researchers, however, have assumed a more general power function sα, where α > 1 is the power parameter. As this power function is strictly convex, the total energy usage when executing a job can be significantly reduced by slowing down the speed of the processor at the expense of the job’s performance. Thus, how to optimally tradeoff the conflicting objectives ofenergyandperformancehasbecomeanactive researchtopicinthealgorithmic community. (See [19, 1] for two excellent surveys of the field.) ∗School of Computer Engineering, Nanyang Technological University, Block N4, Nanyang Avenue, Sin- gapore 639798. [email protected] †Microsoft Research, Redmond,WA, USA 98052. [email protected] ‡School of Computer Engineering, Nanyang Technological University, Block N4, Nanyang Avenue, Sin- gapore 639798. [email protected] 1 Inthis paper,wefocus on schedulingparallel jobs on multiprocessorswith per-processor speed scaling capability [18, 21], that is, the speed of each processor can be individually scaled. A scheduling algorithm needs to have both a processor allocation policy, which decides the number of processors allocated to each job, and a speed scaling policy, which decides the speed of each allocated processor. Moreover, we assume that the parallel jobs under consideration can have time-varying parallelism. Thus, if the scheduling algorithm is not designed properly, it may incur large amount of energy waste when the jobs have relatively low parallelism, or cause severe execution delay and hence performance degrada- tions when the parallelism of the jobs is high. This poses additional challenges to the speed scaling problem for parallel jobs compared with its sequential counterpart. We adopt the objective function proposed by Albers and Fujiwara [2] that consists of a linear combination of overall energy consumption and some performance metric, which in this paper includes total flow time and makespan. The flow time of a job is defined to be the duration between its release and completion. The total flow time for a set of jobs is the sum of flow time of all jobs, and makespan is the completion time of the last completed job in the job set. Both total flow time and makespan are widely used performance metrics in schedulingliterature: theformeroftenmeasurestheaverage responsetimeofallusersinthe system while the latter is closely related to the throughput of the system. Although energy and flow time (or makespan) have different units, optimizing a linear combination of the two can be naturally interpreted by looking at both objectives from a unified point of view. Supposethattheuseriswillingtospendoneunitofenergyinordertoreduceρunitsoftotal flow time (or makespan). Then, by changing the units of time and energy, we can assume without loss of generality that ρ = 1. Thus, the objective can be reduced to optimizing the total flow time (or makespan) plus energy for a set of jobs. In fact, minimizing sum of conflicting objectives is quite common in many bi-criteria optimization problems. In the scheduling literature, similar metrics have been considered previously that combine both performance and cost of scheduling as part of the objective functions [28, 26, 11]. Since Albers and Fujiwara [2] first proposed total flow time plus energy, many excellent results (see, e.g.,[4,22, 3,9, 10,27, 30]) are obtained under different schedulingmodels. For instance, some results assume that the scheduling algorithm is clairvoyant, that is, it gains complete knowledge of a job, such as its total work, immediately upon the job’s arrival; the other results are based on a more practical non-clairvoyant model, where the scheduler knows nothing about the un-executed portion of a job. Most of these results, however, are applicable to scheduling sequential jobs on a single processor, and to the best of our knowledge, no previous work is known that minimizes makespan plus energy. The closest results to ours are from Chan, Edmonds and Pruhs [10], and Sun, Cao and Hsu [27], who studied non-clairvoyant scheduling for parallel jobs on multiprocessors to minimize total flow time plus energy. In both work, it is observed that any non-clairvoyant algorithm that allocates one set of uniform-speed processors to a job performs poorly, or specifically Ω(P(α−1)/α2)-competitive, where P is the total number of processors. The intuition is that any non-clairvoyant algorithm may in the worst case allocate a “wrong” number of processors to a job compared to its parallelism, thus either incur excessive energy waste or cause severe execution delay. Therefore, to obtain reasonable results, a non-clairvoyant algorithm need to be more flexible in assigning processors of different speeds to a job. To this end, Chan, Edmonds and Pruhs [10] assumed an execution model, in which each job can be executed by multiple groups of different speed processors. The execution rate of a job at any time is given by the 2 fastest rate of all groups. They proposed a scheduling algorithm MultiLaps and showed that it is O(logP)-competitive with respect to the total flow time plus energy. In addition, they also gave a lower bound of Ω(log1/αP) for any non-clairvoyant algorithm. Sun, Cao and Hsu [27], on the other hand, assumed a different execution model, in which only one group of processors with different speeds are allocated to each job at any time, and the execution rate is determined by the speeds of the fastest processors that can be effectively utilized. They proposed algorithm N-Equi and showed that it is O(ln1/αP)-competitive with respect to the total flow time plus energy on batched parallel jobs (i.e., all jobs are releasedatthesametime). Bothexecutionmodelsarebasedoncertainassumptionsandcan be justified in their respective terms. It is, however, quite difficult to predict which model is more practical to implement. In this paper, we first revisit non-clairvoyant scheduling for total flow time plus energy under the model by Sun, Cao and Hsu, and show that: • N-Equi is O(lnP)-competitive with respect to total flow time plus energy for parallel jobs with arbitrary release time, and any non-clairvoyant algorithm is Ω(ln1/αP)- competitive. Interestingly, both results match asymptotically those obtained under the execution model by Chan, Edmonds and Pruhs. The lower bound also suggests that N-Equi is asymptotically optimal in the batched setting. Moreover in this paper, we consider a new scheduling model, which we call semi- clairvoyant model. Compared to the non-clairvoyant model, which does not allow a sched- uler to have any knowledge about the un-executed portion of a job, we allow a semi- clairvoyant algorithm to know the available parallelism of a job at the immediate next step, or the instantaneous parallelism. Any future characteristic of the job, such as its remaining parallelism and work, is still unknown. In many parallel systems using central- ized task queue or thread pool, instantaneous parallelism is simply the number of ready tasks in the queue or the number of ready threads in the pool, which is information practi- cally available to the scheduler. Even for parallel systems using distributed scheduling such as work-stealing [6], instantaneous parallelism can also be collected or estimated through counting or sampling without introducing much system overhead. We first show that such semi-clairvoyance about the instantaneous parallelism of the jobs can bring significant per- formance improvement with respect to the total flow time plus energy. In particular, • Wepresentasemi-clairvoyant algorithm U-Ceq, andshowthatitis O(1)-competitive with respect to total flow time plus energy. This is the first O(1)-competitive result on multiprocessor speed scaling for parallel jobs. Comparing to the performance of non-clairvoyant algorithms, the reason for the im- provement is that uponknowing the instantaneous parallelism a semi-clairvoyant algorithm can now allocate a “right” number of processors to a job at any time, thus ensures that no energy will be wasted. At the same time, it can also guarantee sufficient execution rate by setting the total power consumption proportionally to the number of active jobs at any time and equally dividing it among the active jobs. This has been a common practice that intuitively provides the optimal balance between energy and total flow time [4, 22, 3, 9]. Moreover, unlike the best non-clairvoyant algorithm known so far [10, 27], which requires nonuniformspeedscaling foranindividualjob,oursemi-clairvoyant algorithm onlyrequires allocating processors of uniform speed to a job, thus may have better feasibility in practice. We also consider, for the first time in the literature, the objective of makespan plus energy. Unliketotalflowtimeplusenergy,wherethecompletiontimeofeachjobcontributes 3 to the overall objective function, makespan is the completion time of the last job, and the other jobs only contribute to the energy consumption part of the objective, hence can be slowed down to improve the overall performance. However, without knowing the future information, such as the remaining work of the jobs, we show that it is harder to minimize makespan plus energy even in the semi-clairvoyant setting. Specifically, • We present a semi-clairvoyant algorithm P-First and show that it is O(ln1−1/αP)- competitive with respect to makespan plus energy for batched parallel jobs consist- ing of sequential and fully-parallel phases. We also give a matching lower bound of Ω(ln1−1/αP) for any semi-clairvoyant algorithm. In addition, compared to minimizing total flow time plus energy, where the common practice is to set the power proportionally to the number of active jobs, we show that the optimal strategy for minimizing makespan plus energy is to set the power consumption at a constant level, or more precisely 1 at any time, where α is the power parameter. α−1 Therestthispaperisorganizedasfollows. Section2formallydefinesthemodelsandthe objective functions. Section 3 studies both non-clairvoyant and semi-clairvoyant schedul- ing for total flow time plus energy. Section 4 presents our semi-clairvoyant algorithm for makespan plus energy. Finally, Section 5 provides some discussions and future directions. 2 Models and Objective Functions We model parallel jobs using time-varying parallelism profiles. Specifically, we consider a set J = {J ,J ,··· ,J } of n jobs to be scheduled on P processors. Adopting the notions 1 2 n in [14, 13, 15, 10], each job J ∈ J contains k phases hJ1,J2,··· ,Jkii, and each phase Jk i i i i i i has an amount of work wk and a linear speedup function Γk up to a certain parallelism i i hk, where hk ≥ 1. Suppose that at any time t, job J is in its k-th phase and is allocated i i i a (t) processors, which may not have the same speed. We assume that the execution of the i job at time t is then based on the maximum utilization policy [20, 5], which always utilizes faster processors before slower ones until the total number of utilized processors exceeds the parallelism of the job. In particular, let s denote the speed of the j-th allocated j processor, and we can assume without loss of generality that s ≥ s ≥ ··· ≥ s . Then, 1 2 ai(t) only a¯ (t)= min{a (t),hk}fastest processors are effectively utilized, and thespeedupor the i i i execution rateofthejobattimetisgiven byΓk(a (t)) = a¯i(t)s . Thespan lk ofphaseJk, i i j=1 j i i which is a convenient parameter representing thetime to execute thephasewith hk or more P i processors of unit speed, is then given by lk = wk/hk. We say that phase Jk is fully-parallel i i i i if hk = ∞ and it is sequential if hk = 1. Moreover, if job J consists of only sequential and i i i fully-parallel phases, we call it (Par-Seq)* job [25]. Finally, for each job J , we define its i total work to be w(J )= ki wk and define its total span to be l(J ) = ki lk. i k=1 i i k=1 i At any time t, a scheduling algorithm needs to specify the number a (t) of processors i P P allocated to each job J , as well as the speed of each allocated processor. We say that an i algorithm is non-clairvoyant if it makes all scheduling decisions without any current and future information of the jobs, such as their release time, parallelism profile and remaining work. In addition, we say that an algorithm is semi-clairvoyant if it is only aware of the current parallelism or instantaneous parallelism of the jobs, but not their future parallelism and remaining work. We require that the total processor allocations cannot be more than the total number of processors at any time in a valid schedule, i.e., n a (t) ≤ P. Let i=1 i P 4 r denote the release time of job J . If all jobs are released together in a single batch, then i i their release time can be assumed to be all 0. Otherwise, we can assume without loss of generality that the first released job arrives at time 0. Let c denote the completion time of i job J . We also require that a valid schedule mustcomplete all jobs in finite amount of time i andcannotbegintoexecuteaphaseofajobunlessithascompletedallitsprecedingphases, i.e., r = c0 ≤ c1 ≤ ··· ≤ cki = c < ∞, and cki Γk(a (t))dt = wk for all 1 ≤ k ≤ k , where i i i i i ck−1 i i i i i ck denotes the completion time of phase Jk.R i i The flow time f of any job J is the duration between its completion and release, i.e., i i f = c − r . The total flow time F(J) of all jobs in J is given by F(J) = n f , i i i i=1 i and the makespan M(J) is the completion time of the last completed job, i.e., M(J) = P max c . Job J is said to be active at time t if it is released but not completed at i=1,···,n i i ∞ t, i.e., r ≤ t ≤ c . An alternative expression for the total flow time is F(J) = n dt, i i 0 t where n is the number of active jobs at time t. For each processor at a particular time, its t R power is given by sα if it runs at speed s, where α> 1 is the power parameter. Hence, if a processor is not allocated to any job, we can set its speed to 0, so it does not consume any power. Let u (t) denote the power consumed by job J at time t, i.e., u (t) = ai(t)sα. The i i i j=1 j ∞ overall energy consumption e of the job is given by e = u (t)dt, and the total energy i i 0 i P consumption E(J) of the job set is E(J) = n e , or alternatively E(J) = ∞u dt, i=1 i R 0 t where u = n u (t) denotes the total power consumption of all jobs at time t. In this t i=1 i P R paper, we consider total flow time plus energy G(J) and makespan plus energy H(J) of P the job set, i.e., G(J) = F(J)+E(J) and H(J) = M(J)+E(J). The objective is to minimize either G(J) or H(J). We usecompetitive analysis [7] toevaluate an onlineschedulingalgorithm by comparing its performance with that of an optimal offline scheduler. An online algorithm A is said to be c -competitive with respect to total flow time plus energy if G (J) ≤ c ·G∗(J) for 1 A 1 any job set J, where G∗(J) denotes the total flow time plus energy of J under an optimal offline scheduler. Similarly, an online algorithm B is said to be c -competitive with respect 2 to makespan plus energy if for any job set J we have H (J) ≤ c ·H∗(J), where H∗(J) B 2 denotes the makespan plus energy of the job set under an optimal offline scheduler. 3 Total Flow Time Plus Energy We consider the objective of total flow time plus energy in this section. We first revisit the non-clairvoyant algorithm N-Equi [27] by showing its competitive ratio for arbitrary released jobs. We then derivealower boundon thecompetitive ratioof any non-clairvoyant algorithm. Finally, we present a semi-clairvoyant algorithm U-Ceq and show that it sig- nificantly improves upon any non-clairvoyant algorithm. 3.1 Preliminaries We first derive a lower bound on the total flow time plus energy of any scheduler, which will help us conveniently bound the performance of the online algorithms through indirect comparison instead of comparing directly with the optimal. Lemma 1 The total flow time plus energy of any set J of n jobs under the optimal sched- uler satisfies G∗(J)≥ G∗(J) = α n ki wik . 1 (α−1)1−1/α i=1 k=1 (hk)1−1/α i P P 5 Proof. Consider any phase Jk of job J . The optimal scheduler will only perform better i i if there is an unlimited number of processors at its disposal. In this case, it will allocate a processors of the same speed, say s, to the phase throughout its execution, since by the convexity of the power function, if different speeds are used, then averaging the speeds will result in the same execution rate but less energy consumed [31]. Moreover, we have a ≤ hk, i since allocating more processors to a phase than its parallelism will incur more energy without improving flow time. The flow time plus energy introduced by the execution of Jk i is then given by wik +wik ·asα = wk 1 +sα−1 ≥ α · wik ≥ α · wik . as as i as (α−1)1−1/α a1−1/α (α−1)1−1/α (hk)1−1/α i Extending this property over all phases and all jobs gives the lower bound. (cid:0) (cid:1) Wenowoutlinetheamortizedlocal competitivenessargument [4]toprovethecompetitive ratio of any online scheduling algorithm A. We first define some notations. For any job set J at time t, let dGA(J(t)) denote the rate of change for flow time plus energy under online dt dG∗(J∗(t)) algorithm A, and let denote the rate of change for flow time plus energy under dt the optimal. Apparently, we have dGA(J(t)) = n + u , and dG∗(J∗(t)) = n∗ + u∗, where dt t t dt t t n∗ and u∗ denote the number of active jobs and the power under the optimal at time t. t t dG∗(J(t)) Moreover, we let 1 denote the rate of change for the lower bound given in Lemma 1 dt with respect to the execution of the job set under A at time t. We also need to define a potential function Φ(t) associated with the status of the job set at any time t under both dΦ(t) the online algorithm and the optimal. Then, we can similarly define to be the rate dt of change for the potential function at t. The following lemma shows that the competitive ratio of algorithm A can be obtained by bounding the instantaneous performance of A at any time t with respect to the optimal scheduler through these rates of change. Lemma 2 Suppose that an online algorithm A schedules a set J of jobs. Then A is (c +c )-competitive with respect to total flow time plus energy, if given a potential function 1 2 Φ(t), the execution of the job set under A satisfies - Boundary condition: Φ(0) ≤ 0 and Φ(∞) ≥ 0; - Arrival condition: Φ(t) does not increase when a new job arrives; - Completion condition: Φ(t) does not increase when a job completes under either A or the optimal offline scheduler; - Running condition: dGA(J(t)) + dΦ(t) ≤ c · dG∗(J∗(t)) +c · dG∗1(J(t)). dt dt 1 dt 2 dt Proof. LetT denotethesetoftimeinstances whenajobarrives orcompletes undereither the online algorithm A or the optimal offline scheduler. Integrating the running condition over time, we get G (J)+Φ(∞)−Φ(0)+ (Φ(t−)−Φ(t+))≤ c ·G∗(J)+c ·G∗(J), A t∈T 1 2 1 where t− and t+ denote the time instances right before and after time t. Now, applying P boundary, arrival and completion conditions to the above inequality, we get G (J) ≤ A c ·G∗(J)+c ·G∗(J). Since G∗(J) is a lower bound on the total flow time plus energy of 1 2 1 1 job set J according to Lemma 1, the performance of algorithm A thus satisfies G (J) ≤ A (c +c )·G∗(J). 1 2 3.2 Non-clairvoyant Algorithm: N-EQUI In this subsection, we revisit the non-clairvoyant algorithm N-Equi (Nonuniform Equipar- tition) [27], which is described in Algorithm 1. The idea of N-Equi is that at any time it allocates an equal share P/n of processors to each active job, and the speeds of the allo- t cated processors are set monotonically decreasing according to a scaled version of harmonic 6 series. We assume that the processor allocation P/n is always an integer, otherwise by t rounding it to ⌊P/n ⌋, the bounds derived will increase by at most a constant factor. t Algorithm 1 N-Equi (at any time t) 1: allocate ai(t)= P/nt processors to each active job Ji, 1/α 2: set the speed of the j-th allocated processor to job Ji as sij(t)= (α−11)HP·j , where 1≤ j ≤ a (t) and H = 1+ 1 +···+ 1 is P-th harmonic numbe(cid:16)r. (cid:17) i P 2 P At time t, when job J is in its k-th phase, we say that it is satisfied if its processor i allocation is at least the instantaneous parallelism, i.e., a (t) ≥ hk. Otherwise, the job is i i deprived if a (t) < hk. Let J (t) and J (t) denote the sets of satisfied and deprived jobs i i S D at time t, respectively. For convenience, we let nS = |J (t)| and nD = |J (t)|. Since a t S t D job is either satisfied or deprived, we have n = nS +nD. Moreover, we define x = nD/n t t t t t t to be the deprived ratio. Let a¯ (t) = min{a (t),hk}. By approximating summations with i i i integrals, the execution rate of job J can be shown to satisfy 1 1/α a¯i(t)1−1/α ≤ i (α−1)HP 21/α Γk(a (t)) ≤ 1 1/α a¯i(t)1−1/α at time t. Moreover, the pow(cid:16)er consum(cid:17)ption of job J i i (α−1)HP 1−1/α i satisfies u (t(cid:16)) ≤ 1 , a(cid:17)nd hence the overall power consumption satisfies u ≤ nt . i α−1 t α−1 Toboundtheperformanceof N-Equi, weadoptthepotentialfunctionbyLametal.[22] in the analysis of online speed scaling algorithm for sequential jobs. Specifically, we define n (z) to be the number of active jobs whose remaining work is at least z at time t under t N-Equi, and define n∗(z) to be the number of active jobs whose remaining work is at least t z under the optimal. The potential function is defined to be ∞ nt(z) Φ(t)= η i1−1/α −n (z)1−1/αn∗(z) dz, (1) t t Z0 i=1 X where η = η′ HP1/α and η′ is a constant to be specified later. With the help of Lemma 2, P1−1/α the competitive ratio of N-Equi is proved in the following theorem. Theorem 3 N-Equi is O(lnP)-competitive with respect to the total flow time plus energy for any set of parallel jobs, where P is the total number of processors. Proof. We will show that the execution of any job set under N-Equi (NE for short) satisfies the boundary, arrival and completion conditions, as well as the running condition dGNE(J(t)) + dΦ(t) ≤ c · dG∗(J∗(t)) +c · dG∗1(J(t)), where c = O(lnP) and c = O(ln1/αP). dt dt 1 dt 2 dt 1 2 Then the theorem is directly implied. - Boundary condition: at time 0, no jobs exist, so n (z) and n∗(z) are 0 for all z. Hence, t t Φ(0) = 0. At time ∞, all jobs are completed, so again Φ(∞)= 0. - Arrival condition: Let t− and t+ denote the instances right before and after a new job with work w arrives at time t. Hence, we have n (z) = n (z) + 1 for z ≤ w and t+ t− n (z) = n (z) for z > w, and similarly n∗ (z) = n∗ (z)+1 for z ≤ w and n∗ (z) = n∗ (z) t+ t− t+ t− t+ t− for z > w. For convenience, we define φ (z) = nt(z)i1−1/α − n (z)1−1/αn∗(z). It t i=1 t t is obvious that for z > w, we have φt+(z) = φt−((cid:16)zP). For z ≤ (cid:17)w, we can get φt+(z) − 7 φ (z) = n∗ (z) n (z)1−1/α −(n (z)+1)1−1/α ≤ 0. Hence, Φ(t+) = η ∞φ (z)dz ≤ t− t− t− t− 0 t+ η ∞φ (z)dz = Φ(t−). 0 t− (cid:0) (cid:1) R -Completion condition: whenajob completes undereither N-Equior theoptimal, Φ(t) R is unchanged since n(t) or n∗(t) is unchanged for all z > 0. - Running condition: At any time t, suppose that the optimal offline scheduler sets the speed of the j-th processor to s∗. We have dGNE(J(t)) = n +u ≤ α n and dG∗(J∗(t)) = j dt t t α−1 t dt n∗ + u∗ = n∗ + P s∗ α. To bound the rate of change dG∗1(J(t)), we consider each t t t j=1 j dt satisfied job Ji ∈JPS(t).(cid:16)Sup(cid:17)pose that at time t, Ji is in its k-th phase under N-Equi, then theexecution rate of thejob is given by Γk(a (t)) ≥ 1 1/α (hki)1−1/α. Since dG∗1(J(t)) i i (α−1)HP 21/α dt only depends on the parts of the jobs that are ex(cid:16)ecuted by(cid:17)N-Equi at time t, we have dG∗1(J(t)) ≥ α Γki(ai(t)) ≥ α 1 1/αnS = α 1 1/α(1−x )n . dt (α−1)1−1/α Ji∈JS(t) (hk)1−1/α α−1 2HP t α−1 2HP t t i Now, we focus on fiPnding an upper bound on(cid:16)the r(cid:17)ate of change (cid:16)dΦ(t) (cid:17)for the potential dt function Φ(t) at time t. In particular, we consider the set J (t) of deprived jobs. In the D worst case, the nD deprived jobs may have the most remaining work. Again, we assume t thatattimet jobJ ∈J (t)isin itsk-th phaseunderN-Equi. Thechangeof thepotential i D function can then be bounded by dΦ(t) η ∞ nt+dt(z) nt(z) ≤ i1−1/α − i1−1/α dz dt dt Z0 i=1 i=1 X X η ∞ + n (z)1−1/α n∗(z)−n∗ (z) +n∗(z) n (z)1−1/α −n (z)1−1/α dz dt t t t+dt t t t+dt Z0 h (cid:16) (cid:17)i η′H1/α nDt (cid:0) (cid:1) P nt ≤ P − i1−1/α ·Γk(a (t))+n1−1/α s∗+n∗ i1−1/α −(i−1)1−1/α Γk(a (t)) . P1−1/α i i t j t i i Xi=1 Xj=1 Xi=1(cid:16) (cid:17) Wecanget nDt i1−1/α ≥ nDt i1−1/αdi= (nDt )2−1/α ≥ x2tnt2−1/α and nt i1−1/α −(i−1)1−1/α = i=1 0 2−1/α 2 i=1 n1−1/α. MoreovPer,accordingtoRLemma4,wehaven1−1/α P s∗ ≤ λ(HPP·P)1−(cid:0)1/α P s∗ α+ (cid:1) t t j=1 j α j=1 j λ1/(α−11−)(1H/Pα·P)1/αPnt, where λ is a constant to be specifiedPlater. Substituting thePse bou(cid:16)nds(cid:17) as well as the upper and lower bounds of Γk(a (t)) into dΦ(t) and simplify, we have i i dt dΦ(t) x2 λH P 1−1/α α ≤ η′ − t n + P s∗ α+ n + n∗ . (2) dt 4(α−1)1/α t α j λ1/(α−1) t (α−1)1+1/α t j=1 X(cid:0) (cid:1) Now, we set η′ = 4α2 and λ = 4α−1(α−1)1−1/α. Substituting Inequality (3) as (α−1)1−1/α well as the rates of change dGNE(J(t)), dG∗(J∗(t)) and dG∗1(J(t)) into the running condition, dt dt dt we can see that in order to satisfy it for all values of x , the multipliers c and c can be t 1 2 set to c = max{ 4α3 ,4ααH } and c = 2α·(2H )1/α. Since α can be considered as a 1 (α−1)2 P 2 P constant, and it is well-known that H = O(lnP), the theorem is proved. P Lemma 4 For anyn ≥ 0, s∗ ≥ 0and λ > 0, wehave that n1−1/αs∗ ≤ λ(HP·P)1−1/α s∗ α+ t j t j α j 1−1/α n . (cid:16) (cid:17) λ1/(α−1)(HP·P)1/α t 8 Proof. Thelemma is a direct result of Young’s Inequality [17], which is stated formally as follows. If f is a continuous and strictly increasing function on [0,c] with c > 0, f(0) = 0, a ∈ [0,c] and b ∈ [0,f(c)], then ab ≤ af(x)dx + bf−1(x)dx, where f−1 is the inverse 0 0 function of f. By setting f(x) = λ(H ·P)1−1/αxα−1, a = s∗ and b = n1−1/α, the lemma PR R j t is directly implied. 3.3 Lower Bound of Non-clairvoyant Algorithm In this subsection, we prove a lower bound on the competitive ratio of any deterministic non-clairvoyant algorithm. In particular, this lower bound matches asymptotically the upper bound of N-Equi for batched parallel jobs [27], hence suggests that N-Equi is asymptotically optimal in the batched setting. Theorem 5 Any deterministic non-clairvoyant algorithm is Ω(ln1/αP)-competitive with respect to the total flow time plus energy, where P is the total number of processors. Proof. Consider a job set J of a single job with constant parallelism h and work w, where 1 ≤ h ≤ P and w > 0. For any non-clairvoyant algorithm A, we can assume without loss of generality that it allocates all P processors to the job with speeds satisfying s ≥ s ≥ ··· ≥ s ≥ 0, which do not change throughout the execution. Let u = P sα 1 2 P j=1 j denote the power of A at any time. The flow time plus energy of J scheduled by A is P G (J) = (1+u) w . The optimal offline scheduler, knowing the parallelism h, will A hj=1sj P 1/α allocate exactly h processors of speed 1 , thus incurring flow time plus energy of (α−1)h G∗(J)= α · w . Thecomp(cid:16)etitive r(cid:17)atio of A is GA(J) = (α−1)1−1/α(1+u)· h1−1/α . (α−1)1−1/α h1−1/α G∗(J) α hj=1sj Theadversarywillchooseparallelismhtomaximizethisratio,i.e.,tofindmax P GA(J), 1≤h≤P G∗(J) while the online algorithm A chooses (s ,··· ,s ) to minimize max GA(J) regard- 1 P 1≤h≤P G∗(J) less of the choice of h. According to Lemma 6, max h1−1/α is minimized when 1≤h≤P hj=1sj h1−1/α = (h−1)1−1/α for h = 2,··· ,P. Hence, the best nonP-clairvoyant algorithm will hj=1sj jh=−11sj sPet s = j1−P1/α −(j −1)1−1/α s for j = 1,2,··· ,P. Since j1−1/α−(j−1)1−1/α ≥ 1−1/α, j 1 j1/α we have(cid:0)sj ≥ 1j−11//ααs1. Substi(cid:1)tuting them into u = Pj=1(sj)α, we get s1 ≤ (α−α1u)1H/α1/α. P The competitive ratio of any non-clairvoyant algorithmPsatisfies GA(J) ≥ (α−1)1−1/α(1+u) ≥ G∗(J) αsi (α−1)2−1/α · 1+uH1/α ≥ α−1H1/α. The last inequality is because 1+u is minimized when α2 u1/α P α P u1/α u= 1 . Since H = Ω(lnP), the theorem is proved. α−1 P Lemma 6 For any P ≥ 1, α > 1 and b > 0, subject to the condition that P sα = b j j and s ≥ s ≥ ··· ≥ s ≥ 0, max h1−1/α is minimized when (s ,s ,··· ,s ) satisfy 1 2 P 1≤h≤P hj=1sj 1 2 PP h1−1/α = (h−1)1−1/α for all h = 2,··· ,P.P hj=1sj jh=−11sj P P Proof. The proof is in Appendix A. 9 3.4 Semi-clairvoyant Algorithm: U-CEQ We now present our semi-clairvoyant scheduling algorithm U-Ceq (Uniform Conservative Equi) and analyze its total flow time plus energy. In particular, we show that semi- clairvoyance makes a big difference on the performance of an online algorithm by proving that U-Ceq achieves O(1)-competitive. As shown in Algorithm 2, U-Ceq at any time t works similarly to N-Equi in terms of processor allocation, except that it never allo- cates more processors than a job’s instantaneous parallelism hk. Moreover, the speed of all i processors allocated to a job in U-Ceq is set in a uniform manner. Algorithm 2 U-Ceq (at any time t) 1: allocate ai(t)= min{hki,P/nt} processors to each active job Ji, 1/α 2: set the speed of all allocated processors to job Ji as si(t) = (α−11)ai(t) . (cid:16) (cid:17) Again, we say that active job J is satisfied at time t if a (t) = hk, and that it is i i i deprived if a (t) < hk. We can see that job J at time t scheduled by U-Ceq has execution i i i rate Γk(a (t)) = ai(t)1−1/α and consumes power u (t) = 1 . Therefore, the overall power i i (α−1)1/α i α−1 consumption is u = nt . Since there is no energy waste, we will show that this execution t α−1 rates is sufficient to ensure the competitive performance of the U-Ceq algorithm. Theorem 7 U-Ceq is O(1)-competitive with respect to the total flow time plus energy for any set of parallel jobs. Proof. As with N-Equi, we prove the O(1)-competitiveness of U-Ceq using amortized local competitiveness argument with the same potential function given in Eq. (1), but η is now set to η = η′ and η′ = 2α2 . Apparently, the boundary, arrival and P1−1/α (α−1)1−1/α completion conditions hold regardless of the scheduling algorithm. We need only show that the execution of any job set under U-Ceq (UC for short) satisfies the running condition dGUC(J(t)) + dΦ(t) ≤ c · dG∗(J∗(t)) +c · dG∗1(J(t)), where c and c are both constants with dt dt 1 dt 2 dt 1 2 respect to P. Following the proof of Theorem 3, we have dGUC(J(t)) = α n , dG∗(J∗(t)) = n∗ + dt α−1 t dt t P s∗ α, and dG∗1(J(t)) ≥ α Γki(ai(t)) = α (1−x )n . Moreover, the j=1 j dt (α−1)1−1/α Ji∈JS(t) (hk)1−1/α α−1 t t i rPate of(cid:16)ch(cid:17)ange dΦ(t) for the potential funPction Φ(t) at time t can be shown to satisfy dt dΦ(t) η′ nDt P nt ≤ − i1−1/α ·Γk(a (t))+n1−1/α s∗+n∗ i1−1/α −(i−1)1−1/α Γk(a (t)) dt P1−1/α i i t j t i i Xi=1 Xj=1 Xi=1(cid:16) (cid:17) x2 λ P 1−1/α n∗ ≤ η′ − t n + s∗ α+ n + t , 2(α−1)1/α t α j λ1/(α−1) t (α−1)1/α j=1 X(cid:0) (cid:1) where λ = 2α−1(α − 1)1−1/α. Substituting these bounds into the running condition, we can see that in order to satisfy it for all values of x , we can set c = max{2α2,2αα} and t 1 α−1 c = 2α, which are both constants in terms of P. Hence, the theorem is proved. 2 We can see that U-Ceq significantly improves upon any non-clairvoyant algorithm with respect to the total flow time plus energy, which is essentially a result of not wasting any 10