Table Of Content

Energy-Delay Tradeoffs in a Load-Balanced Router Matthew Andrews Lisa Zhang Bell Labs, Murray Hill, NJ Bell Labs, Murray Hill, NJ [email protected] [email protected] Abstract—The Load-Balanced Router architecture has re- mostoftheworkfocusesonoptimizinganindividualelement ceived a lot of attention because it does not requirecentralized in isolation [16], [14], [29], [23], [3], [5], [17], [18], [25], scheduling at the internal switch fabrics. In this paper we 3 [13]. A central question to both methods is to set the speed reexamine the architecture, motivated by its potential to turn 1 soastominimizeenergyusagewhilemaintainingadesirable off multiple components and thereby conserve energy in the 0 presence of low traffic. performance, e.g. latency or throughput. 2 We perform a detailed analysis of the queue and delay The Load-Balanced Router architecture has the potential n performanceofaLoad-BalancedRouterunderasimplerandom tohandlethetrafficinsuchawaythatportionsofthedevice a routingalgorithm. Wecalculate probabilisticboundsfor queue can be turned off in response to lightly loaded traffic. We J size and delay, and show that the probabilities drop exponen- 3 tially with increasingqueuesize or delay. We also demonstrate providewhatwebelieveisthefirstdetailedanalysisofqueue a tradeoff in energy consumption against the queue and delay size and delay in a Load-BalancedRouter, and we providea ] performance. tradeoff between energy consumption and queue size/delay. I Inorderto describeourresultsin moredetailwe nowgivea N I. INTRODUCTION brief description of the Load-Balanced Router architecture. . s The concept of a Load-Balanced Router was studied at c length at the beginning of last decade. See for example [6], [ A. Motivation for Traditional Load-Balanced Architecture [7],[22],[21],[9],[8],[15].Inthisworkweanalyzevarious 1 performance aspects of a Load-Balanced Router, motivated One ofthe mostfundamentalgoalsof anyrouterarchitec- v by the potential energy saving enabled by this architecture. tureis to achievestability, whichis sometimesreferredto as 6 Energy efficiency in networking has recently attracted a 100% throughput. In other words the router aims to process 7 4 large amount of attention. One of the main aims in much all the arriving traffic so long as no input and no output 0 of this work is captured by the slogan energy-follows-load, are inherently overloaded.The key difficulty with doing this 1. also known as energy-proportionality. In other words, we is that the arriving traffic may be highly non-uniform, i.e. 0 wishtomakesurethattheenergyconsumedbyanetworking if Aik is the arrival rate for traffic going from input i to 3 devicematchesthe amountoftraffic thatthe deviceneedsto output k, we will typically have Aik 6= Ai′k′ for ik 6= i′k′. 1 carry.This is in contrastto moretraditionalarchitecturesfor Early work on switching considered a crossbar architecture : v whichthe deviceoperatesatfullrate atalltimesevenif itis in which matchings between the inputs and the outputs are i lightly loaded. Indeed, by a conservative estimate in a study set up at every time step. It was shown in [24] that the X conducted by the Department of Energy in 2008, at least MaximumWeightMatchingalgorithm(withweightsequalto r a 40% of the total consumption by network elements such as the backlog for each input-output pair) can ensure stability. switches and routers can be saved if energy proportionality Subsequentpaperslookedatsimplificationsofthisscheduler isachieved.Thistranslatestoasavingof24billionkWhper that could still achieve stability. year attributed to data networking [1]. A recent study [12] However, a major drawback of all these approaches is further confirms that the power consumption of some state- that they require a centralized scheduler with information of-the-artcommercialroutersstayswithinasmallpercentage about the backlogs of data on each input-output pair. A ofthe peakpowerprofileregardlessof trafficfluctuation,for solution is the Load-Balanced Router that could make use example the significant daily variation in traffic load [26]. of randomized routing ideas first proposed by Valiant [28]. Variousapproacheshavebeenproposedinordertoachieve In the Load-Balanced Router there is a middle stage placed energy-proportionality. Speed scaling, also known as rate betweenthe inputnodesand the outputnodes.Each arriving adaptation,andpoweringdownaretwo popularmethodsfor packet is routed to a middle-stage node chosen at random. effectivelymatchingenergyconsumptionto traffic load. The After passing through the middle stage each packet is then former refers to setting the processing speed of a network forwarded to its designated output. In order to realize this elementaccordingto trafficload. Itistypicallyassumed that architecture we place a switching fabric between the input the energy consumption is superlinear with respect to the stage and middle stage and between the middle stage and operating rate. The latter refers to turning off the element at theoutputstage.Thebeautyofthisdesignisthattherandom certain times and so it either operatesat the full rate or zero routing ensures that for each of these switching fabrics no rate. Both methodsare the subjectof activeresearch, though complicatedschedulingisneeded.Allweneedtodoisrepeat a uniform schedule in which each connection is served at alpha/m beta/m least once [6]. 1 1 B. Energy Consideration for Revisiting Load-Balanced Ar- chitecture 1 1 Our motivation for revisiting the Load-Balanced Router architecture is that it provides an attractive framework for 1 1 studyingenergyproportionality[2].Forexample,thenumber of active nodes in the middle stage can be reduced in the 1 1 presence of light traffic, and increased with heavier traffic. The switch fabric between the input and middle stage and between the middle stage and the outputcan be functionally viewed as full meshes, as indicated in Figure 1. One possi- Input Middle stage Output bility is to implement the mesh with round-robin crossbars. For example, Keslassy’s thesis [22] assumes that the input, Fig.1. A4×6×4Load-BalancedRouterarchitecture,wherethenumber ofnodesinthemiddlestagecanbedifferentfromthenumberofinputsand output and the middle stage are all of size n and each outputs. fabric is an n×n crossbar that operates in a time-slotted fashion. At each time slot t the first fabric connects input be active in the middle stage and that such a configuration node i to middle node (i+t) mod n and the second fabric requires power w(m′) for some function w(·). connectsmiddlenodej tooutputnode(j+t) mod n.Inthis implementation, if the number of active nodes in the middle When an ik packet p arrives we choose a random middle stage isreducedthenthe crossbarcaneitherslowdownorbe nodej amongtheactivenodesinthemiddlestageandplace turned off periodically. p in bufferat nodeij. 1 The link ij operatescontinuouslyat Adjusting the size of the middle stage or the speed of rate α/m and transmits packets in the ij buffer in a FIFO the switching fabric requires considerable technological and manner. Similarly, when p arrives at node j in the middle engineeringchallenges.Inthisnotewedonotaimtoaddress stage, it is placed in the buffer jk. The link jk continuously these issues. Our focusis on analyzingqueue size and delay operates at rate β/m and transmits packets in the jk buffer in a FIFO manner. As we can see the scheduling for both given the active portion of the middle stage, which leads to stages requires no centralized intelligence and is extremely a tradeoff between power consumption and the size of the simple. active middle stage. Lastly we describe our traffic model. As is common in C. Model and Definition work on scheduling in routers, we assume that packets are WeformallydefinetheLoad-BalancedRouterarchitecture of unit size (or else are partitioned into cells of unit size). as follows. The router has n inputs and n outputs. We For any s,t let A (s,t) be the amount of ik traffic arriving ik normalizethelineratesuchthatitequals1oneachinputand at the router in the time interval [s,t) and let A (s,t) = i outputlink.Wealsohaveamiddlestagelyinginbetweenthe A (s,t) and A (s,t) = A (s,t). We assume that k ik k i ik inputs and outputs that consists of m nodes. The traditional the ik traffic, the input i traffic and the output k traffic are P P Load-BalancedRouterhasm=n.However,herewetreatm (σ ,r ), (σ,1) and (σ,1−ε) constrained respectively for ik ik as a separateparameter.Inbetweenthe inputand the middle some burst parameters σ and σ, for some rate parameters ik stage we have an n×m mesh. Effectively each link in this r andforsomeloadparameterε.Inotherwordsweassume ik meshoperatesatrateα/mforaspeedupofα≥1.Similarly, that, inbetweenthemiddleandtheoutputstagewehaveanm×n meshwithlinkrateβ/mforaspeedupfactorofβ ≥1.Each Aik(s,t) ≤ σik+rik(t−s) of the two meshes is input-buffered,i.e. each input node has A (s,t) ≤ σ+(t−s) i aseparatebufferforeachmiddlenodeandeachmiddlenode A (s,t) ≤ σ+(1−ε)(t−s). k has a separate buffer for each output node. See Figure 1. From now on, we use i to index the input, j the middle We remark that the arrival rates r will typically vary ik stage and k the output. We refer to the packets that wish to over time. Indeed, the energy savings that we hope to gain go from input i to output k as ik packets. Similarly, link ij come precisely from the fact that we can match the number connects input i and node j in the middle stage and link jk ofactivecomponentstothetraffic.However,we assumethat connects node j in the middle stage and output k. Buffer ij this happens over a slow timescale and so we perform our (resp. jk) at node i (resp. j) bufferspackets that are waiting scheduling analysis as if the rates r are fixed. ik to traverse link ij (resp. jk). The key to possible energysavings is that we assume that 1Notethatroundrobinisanotherpossibilityforchoosingamiddle-stage not all nodes need to be active at periods of low load. We node.However,arandomchoiceismorerobustagainstadversarialtypesof suppose that at any time we can choose m′ ≤ m nodes to trafficarrivals. Wedonotgointodetails here. D. Results forsomeα>1,thisimpliesthattheijlinkisnotoverloaded. A similar argument applies to each link jk between the • In Section II we make the simple statement that ⌈r¯m⌉ middle and output stages activemiddle-stagenodessufficeforhandlingthetraffic load where r¯≥ r for all k and r¯≥ r for all i ik k ik III. OVERVIEWOF DELAY ANALYSIS i. This in turn implies that the energy required to serve P P Before diving into details of the delay analysis, we first the traffic over the long-term is w(⌈r¯m⌉). provide a high level overview of our techniques. • In SectionsIII to IV-Ewe presenta probabilisticanaly- sis that bounds the queue sizes at the input and middle A. Relationship with Stochastic Network Calculus stages and bounds the delay experienced by packets as We use a variant of network calculus [10], [11], [4] theytravelfromtherouterinputtotherouteroutput.We sometimes referred to as stochastic network calculus. The first bound the probability for a queue size to exceed original form of network calculus derives delay bounds by a certain amount q, and the delay to exceed a certain imposingupperboundsontheamountoftrafficarrivingata amount of d, assuming a fixed sized number of middle node via arrival curves and lower bounds on the amount of stages. An important feature of our bounds is that they traffic served by a node via service curves. By relatingthese decrease exponentially with q and d. For a fixed traffic twocurveswecanbothobtainaboundonthedelaysuffered load, we then derive a trade off between the queue bydataatanetworkelementandalsocharacterizethearrival size/delayperformanceandthenumberofactivemiddle- curves for the data at any downstream nodes. However, in stage nodes which in turn gives a energy-delay and the traditionalnetwork calculusall such boundsare required energy-queuetradeoff. to hold with probability 1. In our context this will lead to Leavingenergyminimizationaside, we believethatthis extremely weak boundssince there is a non-zeroprobability is the first detailed analysis of the delay and queue thattherouterwillsendalargenumberofpacketstoasingle performanceof a Load-BalancedRouter, which may be middle-stage node, thus condemning them all to extremely interesting in its own right. poor service. • In Section V we present some numerical examples to An alternative therefore is to use a stochastic network validate our analytical findings. calculusinwhichweonlywishforboundsonservicetohold We note that ourapproachis differentfrom the traditional with high probability. A detailed formulation of a stochastic powerdown and rate adaptation techniques since we will be network calculus was outlined in a series of papers by Jiang directingtrafficinsuchawaythatenablessomecomponents and others [20], [19]. At a high level Jiang’s approach to be off. In other words for the middle stage nodes we are obtains curves that bound the probability that the delay not trying to match service rate to a traffic process that is exceedsa certainamountatupstreamelements,andthenuse exogenous. We are trying to match the active middle stage these curves to bound the worst-case arrivals at downstream nodestoa trafficprocessthatisunderourcontroldueto our elements. However, we follow a slightly different approach ability to route within the router. since we are able to obtain better bounds by not directly We also remark that our bounds could be used to govern utilizing the service curves at the input nodes to bound how many middle nodes are active in a Load-Balanced the arrivals at the middle-stage nodes. We instead base our Router without necessarily computing all the bounds on the calculations at the middle stage on the external arrivals of fly. We could instead precompute the bounds and create a various ensembles of flows that might then be time-shifted simplelook-uptablethatdetermineshowmanymiddlestage due to delays at the input. This allows us to avoid handling nodes should be active based on measurements of the load complicated convolutions of arrival and service curves. We at the inputs and outputs. elaborate further on this distinction later. II. THROUGHPUT ANALYSIS B. Our Approach Recall that r represents the current arrival rate of traffic We divide our analysis into a series of pieces. ik that wishes to be routed from input i to output k. Let r = a) Bound the queue build-up at the input: Recall that i r and let r = r . Let r¯ be such that r ≤ r¯ and betweeninputi andmiddle-stagenodej we effectivelyhave k ik k i ik i r ≤r¯for all i,k. alinkwithspeedα/m.Alsorecallthattheinputhasabuffer k P P especiallydedicatedtothetrafficthatwishestogofrominput Lemma1. Ifweusem′middlestageelementsthentherouter i to middle-stage node j. Suppose that at some time t this is not overloaded if m′ ≥r¯m. Hence the power required to buffer has level q. Let s be the last time that this buffer was serve all the traffic in the long-term is at most w(⌈r¯m⌉). empty. Note that during the time interval [s,t) the ij link Proof: Follows from the fact that if we turn on m′ serveddata oftotalsize α(t−s)/m. Thereforethe totaldata middle stage elements then for each middle element j, thatarrivedforlinkiduringthetimeinterval[s,t)isatleast 1 ≤ j ≤ m′, the traffic rate that will be routed on link q+(α(t−s)/m). ij will be at most r¯/m′ which by assumption is at most Therefore the probability that link i has a backlog of size (m′/m)/m′ =1/m.Sincethecapacityonthelinkijisα/m q attime t isupperboundedbytheprobabilitythatforsome s ≤ t, the amount of ik data arriving at link i during the approximation will typically not have a large effect for interval [s,t) is at least q+(α(t−s)/m). However, recall larger values of m since the ijk traffic only forms a thatthetrafficarrivingtoinputiarrivesatrateatmostr¯and 1/m fraction of the total kj. However, in order to get each packetis sent to a middle-stagenode chosen uniformly a more accurate bound, in Section IV-E we present a at random. Hence, for fixed s and t, we can use a Chernoff more detailed expression that deals with the correlation bound to bound the probability that the amount of ik data explicitly. arrivingduringtheinterval[s,t)isatleastq+(α(t−s)/m). • Our analysis is based on Chernoff bounds. However, Itiseasytoshowthatthisprobabilitydecreasesexponentially the form of the Chernoff bound that we use changes insandsowecanuseaunionboundtoboundtheprobability depending on whether the bound on data arrivals that that this occurs for any s≤t. we are considering for a particular link is close to or b) Bound the delay experienced at the input: Translat- far from the expected number of data arrivals for that ingourboundonqueuesizeintoaboundondelayissimple. link. This unfortunately leads to a somewhat involved Since the transmission rate on the ij link is α/m, the event case analysis. that the head of line packet for the ij link at time t has d) Bound the delay experienced at the middle stage: experienced delay d, implies the event that at time t − d, The conversion of the queue-size bound to a delay bound the queue size for the ij link was at least dα/m. An upper at the middle stage can be done in the exact same way as bound on the probability of the latter event can be derived for the input. Now that we have an expression for the delay using the method described earlier. at both stages we can convert it into an expression for the c) Bound the queue build-up at middle stage: We now end-to-end delay from the inputs to the outputs. giveanoverviewofhowweperformtheanalysisatthehead IV. ANALYTICAL BOUNDS ON DELAY of the jk queue at middle element j. This forms the crux of our analysis since we are in a more complicated situation We now present the details of our delay analysis. We rely due to the fact that the packet arrivals at the middle stage heavily on the following Chernoff bounds[27]. In particular are affected by how they are served at the input. As before we use (2) to derive analytical bounds and we use (1) to note that if the queue for link jk has size q at time t, then derive tighter bounds for numerical simulation. for some time s ≤ t, the arrivals for link jk at middle- Theorem 2 (Chernoff Bound). Let X ,...,X be indepen- 1 n stage node j during the time interval [s,t) must be at least dent binary random variables, and let µ be an upper bound q+(β(t−s)/m). Now suppose that the oldest of this data on the expectation E[ X ]. For all δ >0, i i arrived at the input at time s−d, and suppose in addition that this data arrived at middle-stage node j on link ij. We P eδ µ Pr[ X ≥(1+δ)µ] ≤ (1) can therefore state that the total amount of data arriving at i (1+δ)1+δ i (cid:18) (cid:19) the system in the time interval [s−d,t) that is destined for X ≤ e−min(δ2,δ)·µ/3 (2) linkjk isatleastq+(β(t−s)/m)andthedelayexperienced by data arriving at link ik at time s−d is at least d. Via Inwhatfollowswesometimesrefertoµastheaggregated unionboundswecancalculatetheprobabilitythatthisoccurs mean and δ as the excess factor. For conciseness we shall for any i,s,d. In particular, for d small we can say that the often use the aggregated mean even though µ is strictly probabilitythatthearrivalsexceedq+(β(t−s)/m)issmall speaking a bound on the mean. whereas if d is large then the probability that the link ik A. Input stage analysis delay is at least d is small. We make three points about this analysis at the middle We begin by computingthe queuedistribution at the head stage. of link ij, where i ∈ [1,n] is an input and j ∈ [1,m] is in the middle stage. As in Section II we assume that r ≤ r¯ • This is where our analysis deviates slightly from the and r ≤r¯for all i,k and we assume that m′ middleistage traditionalmethodologyofnetworkcalculus.We donot k elements are currently active. Initially, in order to keep the calculate a service curve for the input and use that formulas manageable, we shall derive our formulas for the service curve to bound the arrivals at the middle stage. case in which r¯=1 and m′ =m. We shallalso assume that Instead we analyze the delay behavior at the input and thearrivingtrafficissmoothandsowe donothavetheburst then use that calculation to bound the middle stage terms σ and σ . Later on, we shall show how to adapt the queuesizeusinganexpressionthatstillinvolvesarrivals i k formulas when these assumptions do not hold. at the input. Let A (t) be the binary random variable that indicates • Our initial analysis makes a slight approximating as- ijk whether a packet with input i output k is mapped to sumption in that for the ijk defined above,we treat the middle stage j at time t. We use A (t ,t ) to denote arriving traffic for link ij and the delay behavior on ijk 1 2 t2 A (t), the total arrival in the duration (t ,t ]. Let link ij as independent. In reality of course they will t=t1 ijk 1 2 be correlated due to traffic that passes from input i to QP(i,1j)(t)betherandomvariableforthequeuesizeatthehead output k through middle-stage node j. However, this oflink ij attime t. To computeQ(1)(t), let usassumes<t i,j bethelasttimethatthequeueatijisempty.ForQ(1)(t)tobe B. Middle stage analysis i,j largerthanavalueq, thearrivalA (s,t)overallk∈[1,n] ijk Wenowcomputethequeuedistributionattheheadoflink mustbe at least α·t−s+q since link ij is operatedat a rate m jk, where j ∈ [1,m] is in the middle stage and k ∈ [1,n] α. Formally, m is an output. Let Q(2)(t) be defined similarly as Q(1)(t). To jk ij Pr[Q(ij1)(t)≥q]≤ Pr" n Aijk(s,t)≥α· tm−s +q# (3bq)ouueunedQQ((j22k))waatsteimmeptty,.FleotrsQ(≤2)(tt)bteotbheelalargstertitmhaentqh,atthtehree Xs≤t kX=1 jk jk We now bound the right-hand-side of (3). Since Aijk(t) are must be at least (t−ms)β +q distinct packets in Q(j2k) during independentbinaryvariables,weapplytheChernoffboundto the time period [s,t]. Further, let s−d be the earliest time everyterminthesummation.Thefollowingistheaggregated one of these packets arrived at an input, say i. This packet mean µ and the excess factor δ for the above probability. must experience a delay of at least d in Q(1). Therefore, ij µ = t−s δ = αm−1+ qm , Pr[Q(j2k)(t)≥q] (cid:26) t−s since α· t−s +q =µ(1+δ). If α≥2, we have δ ≥1 and ≤ Pr A (s−d,t)≥ (t−s)β +q m ijk we use (2) to derive, " m # d s≤t i XX X Pr[Q(ij1)(t)≥q]≤t−∞s=0e−t3−ms(α−1)−q3 ≤ 1−ee−−3qα3−m1. (4) ne−3mdα ·Pr[∃i, Di(j1)(s)≥d] X ≤ If α < 2, we have δ ≥ 1 when t−s ≤ 2q−mα. Otherwise d 1−e−α3−m1 δ <1. We apply (2) as follows. X (t−s)β Pr[Q(1)(t)≥q] · Pr Aijk(s−d,t)≥ m +q ij " # s≤t i qm X X 2−α ∞ ≤ e−t3−ms(α−1)−q3 + e−t3−ms(α−1)2 Note thatthe bound(6) on the delaydistributionatthe input t−Xs=0 t−sX=2q−mα smtaogvee iPs rin[∃die,peDnd(1e)n(t(so)f)th≥e tdim] etoinoduetxs.idWe ethceansutmhemreaftioorne e−q3 e−(3α(2−−1α)2)q indexed by time iinj the second inequality above. ≤ + (5) 1−e−α3−m1 1−e−(α3m−1)2 We proceed to bound s≤tPr[ iAijk(s−d,t) ≥ Note thatbothexpressionsare exponentiallydecreasingin (t−s)β +q based on the following expressions for the m P P q. For the time being we shall proceedaccordingto the case expectationiµ and the excess factor δ. α ≥ 2 since this leads to more manageable formulas. Later on, we shall indicate where to adapt the formulas when we µ = t−s+d m , are in a scenario where α<2. δ = (t−s)(β−1)+qm−d Let D(1)(t) be the maximum delay that some packet has ( t−s+d ij experienced at time t in the queue Q(ij1)(t). For this delay since(1+δ)µ= (t−s)β+q.Therearetwocasestoconsider, to be more than d at time t, some packet must be in the m β ≤2 or β >2. queue at time t−d. Since link ij operates at rate α in a m For β ≤ 2, we further consider the following subcases, FIFOmanner,thequeueattimet−dmustbeatleastdα/m. depending on qm −1, the value of δ when t−s =0. Note Therefore, d that as t−s increases, the value of δ approaches β−1. −dα Pr[Di(j1)(t)≥d]≤Pr[Q(ij1)(t−d)≥dα/m]≤ 1−ee3−mα3−m1. • 1Ca≤seδ.1aS:in1ce≤δµqdm=−(t1−,sa)(nβd−t1)−+qsm≤−dq,m2b−−oβu2dn.dI(n2)thiimspclaisees m By a union bound, (8). Pr[∃i, D(1)(t)≥d]≤n· e−3mdα . (6) • cCaasseeβ1b−:11≤≤δq≤dm1−, w1h,iacnhdimqpm2l−−ieβ2sdδ2≤µt≥−(βs.−In1)t2hµis. ij 1−e−α3−m1 Bound (2) in turn implies (9). Recall that the above analysis was performed in the • Case2:β−1≤ qm−1≤1,andforallvaluesoft−s. absence of the burst term σ. If we do have bursty traffic In this case, β−1d≤δ ≤1, which is the same situation then we can adjust the formulas by making the following as 1b and implies (10) in the same way. changes to the aggregated mean and the excess factor and • Case 3a: qm −1 ≤ β −1, d(β+1)−2qm ≤ t−s and propagating these changes through the resulting formulas. d β−1 d(β+1)−2qm ≤0. In this case β−1 ≤δ ≤β−1 for all µ = t−ms+σ valuβes−1oft−s.Sinceδ2µ≥ (β−21)2µ,bound(2)implies δ = (t−s)(α−1)+qm−σ 4 (cid:26) t−s+σ (11). • Case3b: qm−1≤β−1,and0< d(β+1)−2qm ≤t−s. Note that every term above decreases exponentiallywith the d β−1 In this case β−1 ≤δ ≤β−1 for t−s≥ d(β+1)−2qm. queue size q. 2 β−1 Since δ2µ≥ (β−1)2µ, bound (2) implies (12). The case in which β >2 is simpler. We omit the analysis 4 forspaceconsideration. Recallthatalloftheaboveformulas • Case 3c: qdm−1≤β−1, andt−s< d(β+β1−)−12qm. We are for the case α≥2. If α<2 then we need to replace all trivially upper bound the probability by 1 as in (13). the factors (4), −dα e 3m Pr[Q(2)(t)≥q] 1−e−α3−m1 jk (t−s)β with (5) ≤ Pr[ A(ij1k)(s−d,t)≥ m +q] (7) e−3dmα e−3dmα((2α−−α1))2 Xd Xs≤t Xi ·Pr[∃i, D(1)((s))≥d] 1−e−α3−m1 + 1−e−(α3−m1)2 . ij qm qm−2d If output k has a burst term σ then as in the input stage 1a ≤ d=20 1−nee−−3dmαα3−m1!t−2−s=β0e−13(t−s)(β−m1)+qm−d +(8) wexececsasnfraecfltoercttoth,is byadjusting the aggregatedmean and the X X 1b Xdq=2m0 1−nee−−3dmαα3−m1!t−s=X∞qm2−−β2de−13(β−1)2t−ms+d +(9) We conc(ludδµe th==is se(tc−tt−isos+mn)d(βt+b−−σys1+)b+do+qumσn−didn−gσthe. delay distribu- qm 2 dX=βq2m 1−nee−−3dmαα3−m1!t−X∞s=0e−31(β−1)2t−ms+d +(10) tFsiooomrntehatpisatchdkeeelamtyhidatosdlbeeexpsmtearogireeen.ctDheadj(n2ka)dtbtaiemttteihmeteimnt,atxshoiemmqueumepuadecekQleat(jy2km)t(hutas)tt. 2qm 3a dX=β+q1βm 1−nee−−3dmαα3−m1!t−X∞s=0e−31(β−41)2t−ms+d +(11) FHbeIeFniOncemthPaenr[nqDeurej(,2kut)eh(eta)qt≥utiemdu]ee≤attP−timrd[eQ. tS(j2−ki)n(dctem−luidns)kt≥bjekdaβto/lpemearsa]t.tedsβ/inma. ∞ ne−3dmα We stress again that all of the above formulas are for the 3b d=X2βq+m1 1−e−α3−m1! (c4a)sewαith≥fa2c.tIofrsα(5<).2 then we need to replace all the factors ∞ e−13(β−41)2 t−ms+d + (12) C. Eventual end-to-end delay t−s=dX(β+β1−)−12qm Now that we have delay bounds for the two stages of the router we can obtain a boundon the end-to-enddelay distri- d(β+1)−2qm ∞ ne−3dmα β−1 bution.Inthefollowingweletgm,β(q)beashorthandforthe 3c d=X2βq+m1 1−e−α3−m1! t−Xs=0 1 (13) ufpper(qb)oausnhdotrhthaatnwdefohraovuerdueprpiveredboounndPorn[QP(jr2k[)Q(t()1)≥(t)q]≥aqn]d. m,α ij Supposethatanijkpacketisstilltraversingtherouterattime n e−q3 1 t, but it arrived at the router before time t−d. It is not hard 1a ≤ + (cid:18)1−e−α3−m1(cid:19)(cid:18)1−e−β3−m1(cid:19)(cid:18)1−e−α3−m1(cid:19) to see that either it is still waiting to traverse the inputstage n e−31(β−1)2q 1 at time t−d2, or it arrivedat the middle stage by time t−d2. 1b +Hence the probability that the end-to-end delay is at least d (cid:18)1−e−α3−m1(cid:19) 1−e−(β3−m1)2 ! 1−e−α−(3βm−1)2 ! is at most, ne−α6q e−31(β−1)2q 1 d d d 2 (cid:18)1−e−α3−m1(cid:19) 1−e−(β3−m1)2 ! 1−e−α−(3βm−1)2 !+ Pr[Di(j1)(t− 2)≥ 2]+Pr[Dj(2k)(t)≥ 2] dα dβ ne−3αβq e−121β(β−1)2q 1 ≤ fm,α( )+gm,β( ). 3a + 2m 2m 1−e−α3−m1! 1−e−(β1−2m1)2 ! 1−e−4α+1(2βm−1)2 !D. Characterizing the tradeoff with the number of middle ne−3(β2α+1)q e−61(β−1)2q 1 elements 3b + 1−e−α3−m1 ! 1−e−(β1−2m1)2 !(cid:18)1−e−2α+6βm(β−1)(cid:19) In the above delay analysis we made a number of simplifying assumptions to keep the notation manageable. For 3c (β+1)ne−3αm 1 example we held m′ = m and r¯= 1. We now demonstrate (β−1)(1−e−α3−m1)!(cid:18)(1−e−3αm)2(cid:19) that we can use the above formulas to handle the case of 2qm e−3αm(2βq+m1−1)− 2qm −1 e−3αm(2βq+m1)arbitrary m′ and r¯. This in turn allows to characterize the β+1 β+1 tradeoff between energy consumption and end-to-end delay. (cid:18)(cid:18) (cid:19) (cid:18) (cid:19) (cid:19) The main idea is to scale time so that the arrival rate at the the queue Q(1) at time s− d or it must be waiting in the ij 2 inputs is scaled to 1. We also adjust the link rates on the queueQ(2) attimes−d. Inthe formercase thepacketmust jk 2 two stages of the mesh. In particular, if we wish to analyze have experienced delay of d in Q(1) at time s− d and in a system with a given m, m′, r¯, α and β, we define a new 2 ij 2 system characterized by mˆ, mˆ′, rˆ¯, αˆ and βˆ in which we set, the latter case it must have experienced delay of d2 in Q(j2k) at time s. Therefore, αm′ βm′ mˆ′ =mˆ =m rˆ¯=1 αˆ = βˆ= mr¯ mr¯ Pr[Q(3)(t)≥q] k Then it is not hardto see that if we scale time by a factor r¯, thenewsystemhasexactlythesamebehaviorastheoldone. ≤ Pr A (s−d,t)≥t−s+q · ijk However,wearenowworkinginasystemwithmˆ′ =mˆ and d s≤t  ij  XX X r¯=1.Hencewecanapplytheanalysisthatwehavealready  d d  d derived. Pr[∃ij,Di(j1)(s− 2)≥ 2]+ Pr[∃j,Dj(2k)(s)≥ 2] Our main result is thus, (cid:18) (cid:19) dα dβ 1 ≤ nmf ( )·mg ( )· (d−q+σ ). Theorem 3. If we run the router with m′ middle stage m,α 2m m,β 2m ε k elements then it uses energy w(m′) and the probability that Xd the end-to-end delay is at least d is bounded by, In addition,as before,we can convertthis bound to a bound ondelayforthethirdstage.LetD(3) bethe maximumdelay dα dβ k fm′,αmmr¯′(2mr¯)+gm′,βmmr¯′(2mr¯). Qtha(3t)(sto)m. Feopratchkiestdhealasyetxopebreienmcoerdeathtatnimdeatt itnimtehet,qsuoemuee E. Dealing with dependence k packetmustbe in the queueat time t−d. Since link k oper- Inouranalysisofthemiddlestagewe implicitlymadethe atesinaFIFOmanner,thequeueattimet−dmustbeatleast simplifying assumption that i′A(i′1j)k(s−d,t) is indepen- d(1−ε). Hence Pr[Dk(3)(t)≥d]≤Pr[Q(k3)(t−d)≥d(1− adrernivtaflrsomforDthije(sp)a.tHhoiwjkevweril,ltPhaifsfeicstnbootthstrictlyAt(r1u)e(ssin−cedt,hte) ε)]≤ d′nmfm,α(d2′mα)·mgm,β(d2′mβ)·1ε(d′−d(1−ε)+σk). i′ i′jk as well as Dij(s). Since the ijk flow representsonly a 1/m P V. NUMERICAL RESULTS P fraction of the traffic that contributes to D (s), this will ij In this section we show numerical examples of the queue typically have a negligible effect on the eventual results. bounds. In particular for these calculations we use formulas However,ifwerequiredatrueupperboundontheprobability that are similar to those derived in Section IV but we use distribution of Q(2) we must use the following adaptation jk Chernoff bounds of the form (1) rather than (2) since the of the formula, which for each ˆı, conditions the event former are tighter. D(1)(s) ≥ d on whether or not A(1)(s − d,t) is greater ˆıj ˆıjk Figure2plotsthelogarithmoftheprobabilityofamiddle- than 5(t−s)β. More formally, mn stage queue jk exceeding q packets against the queue size q. For this instance, the router is 20×80×20. The traffic is Pr[Q(2)(t)≥q] fully loaded for each of the inputs and outputs. We vary the jk speedup α=β in the range of 2,3,4 and 5. As we can see 5(t−s)β ≤ Pr[A(1)(s−d,t)≥ ]+ from the figure the logarithm of the probability decreases ˆıjk mn linearly with the queue size, which means the probability d s≤t ˆı XX X decreasesexponentiallywiththequeue.Asexpected,wecan (t−s)β 5(t−s)β Pr[ A(1)(s−d,t)≥ +q− ]×also see that with increasing speedup, the probability of the ijk m mn ˆı i6=ˆı queue size exceeding q drops. X X 5(t−s)β Figure 3 demonstrates the tradeoff between the queue Pr[D(1)((s))≥d|A(1)(s−d,t)≤ ] ˆıj ˆıjk mn size and the number of active middle-stage nodes. For this (cid:19) instance, the router is again 20×80×20. The link rate of F. Queues at output the interconnectis set to 1/20.The traffic is fully loadedfor We conclude this section by explaining how the analysis each of the inputsand outputs. If we keep all m=80 nodes can be extended if we also wish to bound the delay on the inthemiddlestageactive,wecansee fromthebottomcurve output link from the router. Let Q(3)(t) be the queue at the k of Figure3 thatthe queuesize is the smallest. However,this head of the output link and recall that this link has speed 1. option is also the most energy consuming as all 80 middle- To bound Q(k3)(t) at time t, let s ≤ t be the last time that stage nodes are kept active. For the other extreme, we can queue Q(3) was empty. For Q(3) to be larger than q, there activate 40 middle-stage nodes, which is the most energy k k must be at least t−s+q distinct packets in Q(3) during the efficient. However, we can see from the bottom curve of k time period [s,t]. Further, let s−d be the earliest time one Figure 3 that the queue size is the smallest. The curves in of these packetsarrivedatan input.Supposethatthe path of between correspond to the intermediate cases in which the this packetis ijk. Then the packet must either be waiting in number of active middle-stage nodes are 50, 60 and 70. 0 alpha=2 [3] N.Bansal,T.Kimbrel,andK.Pruhs. Speedscalingtomanageenergy alpha=3 alpha=4 andtemperature. Journal oftheACM,54(1),2007. −10 alpha=5 [4] J. Y. L. Boudec and P. Thiran. Network Calculus. Springer Verlag, y http://ica1www.epfl.ch/ PS files/NetCal.htm, 2004. bilit [5] H.-L. Chan, W.-T. Chan, T. W. Lam, L.-K. Lee, K.-S. Mak, and ba−20 P. W. H. Wong. Energy efficient online deadline scheduling. In o pr Proceedings ofACM-SIAMSODA,pages 795–804,2007. of [6] C.Chang,D.Lee,andY.Jou. LoadbalancedBirkhoff-vonNeumann hm −30 switches,partI:one-stagebuffering. InIEEEHPSR’01,Dallas,TX, arit 2001. og [7] C.Chang,D.Lee,andC.Lien.LoadbalancedBirkhoff-vonNeumann l −40 switches, part II: multi-stage buffering. Computer Comm., 25:623 – 634,2002. [8] C. Chang, D. Lee, and Y. Shih. Mailbox switch: a scalable two- −50 stage switch architecture for conflict resolution of ordered packets. 0 10 20 30 40 50 InProceedings ofIEEEINFOCOM2004,HongKong,2004. queue size Fig. 2. Log of probability against queue size, for n=20, m=80 and [9] C.Chang, D.Lee,andC.Yue. Providing guaranteed rateservices in theloadbalancedBirkhoff-vonNeumannswitches. InProceedingsof fullyloadedtraffic.Fromtoptobottom,thecurvescorrespondtoincreasing speedupfromα=β=2,3,4and5. IEEEINFOCOM2003,SanFrancisco, CA,2003. [10] R. L. Cruz. A calculus for network delay, Part I: Network elements in isolation. IEEE Transactions on Information Theory, 37(1):114 – 0 m=40 131,1991. m=50 [11] R. L.Cruz. A calculus for network delay, Part II: Network analysis. m=60 m=70 IEEETransactions onInformation Theory,37(1):132–141,1991. −10 m=80 [12] A. Francini, S. Fortune, T. Klein, and M. Ricca. Energy profiling of y abilit Dneotwcuomrkenetqsu,iApmlceantetl-fLorucraetnet,ad2a0p1t1a.tiontechnologies. InternalTechnical b−20 pro [13] A. Francini and D. Stiliadis. Energy efficiency with rate of adaptation. In Proc. of IEEE HPSR, 2010. http://ect.bell- m labs.com/who/francini/ra9.pdf. h−30 garit [[1154]] NM..MGa.rrIe.ttK.ePsolawsesryi,ngS.dTo.wCnh.uCanogm.muAn.lAoaCdM-b,a5la1n(c9e):d42s–w4i6tc,h20w08it.h an o l−40 arbitrary number of linecards. In Proceedings of IEEE INFOCOM 2004,HongKong,2004. [16] S. Irani and K. Pruhs. Algorithmic problems in power management. −50 SIGACTNews,36(2):63–76, 2005. 0 10 20 30 40 50 [17] S. Irani, S. K.Shukla, and R. Gupta. Algorithms for power savings. queue size ACMTransactions onAlgorithms,3(4),2007. Fig.3. Logofprobabilityagainstqueuesize,forsameamountoftotaltraffic [18] S.Irani,S.K.Shukla,andR.K.Gupta. Onlinestrategiesfordynamic butvarying numberofactive middle-stage nodes. Fromtopto bottom, the powermanagementinsystemswithmultiplepower-savingstates.ACM curves correspond toincreasing numberofactive middle-stage nodes from Trans.EmbeddedComput.Syst.,2(3):325–346, 2003. m=40,50,60,70,80. The numberofinputs andoutputs is n=20and [19] Y.Jiang. Abasicstochasticnetworkcalculus. InProceedingsofACM interconnect linkrateis1/20. SIGCOMM’06,pages123–134,NewYork,NY,2006. [20] Y. Jiang and P. J. Emstad. Analysis of stochastic service guarantees VI. CONCLUSION in communication networks: A server model. In In Proc. of the International Workshop on Quality of Service (IWQoS 2005, pages InthispaperwerevisittheLoad-BalancedRouterarchitec- 233–245,2005. ture, motivated by its potential of delivering energy propor- [21] I. Keslassy, S.-T. Chuang, D. M. K. Yu, M. Horowitz, and O. Sol- gaard. Scaling internet routers using optics. In Proceedings of ACM tionalityforrouters.Weofferadetailedanalysisonthequeue SIGCOMM’03,Karlsruhe, Germany,2003. lengths and packet delays under a simple random routing [22] I.KeslassyandN.McKeown. Maintaining packet orderintwo-stage algorithm which is robust against all admissible traffic. This switches. In Proceedings of IEEE INFOCOM 2002, New York, NY, June2002. allows us to observe a trade off between performances such [23] M. Li, B. J. Liu, and F. F. Yao. Min-energy voltage allocation for as queue size against energy consumption. tree-structured tasks. In Proceedings of COCOON, pages 283–296, Our paper does not focus on algorithms that optimize the 2005. [24] N. W. McKeown, V. Anantharam, and J. Walrand. Achieving 100% numberof active middle-stagenodes.We give a very simple throughput in an input-queued switch. In Proceedings of IEEE argument for which the size of the active middle stage is INFOCOM’96,pages 296–302,SanFrancisco, CA,March1996. proportionalto the maximum traffic load over all inputs and [25] S.Nedevschi, L.Popa,G.Iannaccone, S.Ratnasamy,andD.Wether- all. Reducing network energy consumption via sleeping and rate- outputs. It is an intriguing open question to see how one adaptation. InProceedings ofNSDI,pages 323–336,2008. could make sure that the size of the active middle stage is [26] M.Roughan,A.Greenber,C.Kalmanek,M.Rumescwicz,J.Yates,and proportional to the traffic average over all input, not to the Y.Zhang.Experienceinmeasuringinternetbackbonetrafficvariability. InProc.ACMSIGCOMMIMW,pages 91–92,2002. maximum. [27] C. Scheideler. Probabilistic Methods for Coordination Problems. Habilitation thesis,PaderbornUniversity, 2000. REFERENCES [28] L. G. Valiant. A scheme for fast parallel communication. SIAM J. Comput.,11(2):350–361,1982. [1] Routing telecom and data centers toward efficient energy use. In [29] F.F.Yao,A.Demers,andS.Shenker.Aschedulingmodelforreduced Proceedings of the Vision and Roadmap Workshop, U.S. Department CPUenergy. InProceedings ofIEEEFOCS,pages374–382,1995. ofEnergy,October2008. [2] S. Antonakopoulos, S. Fortune, A. Francini, T. Klein, R. McLellan, D.Nielson,andL.Zhang. Personalcommunication, 2011.