A local limit theorem for Quicksort key 7 1 comparisons via multi-round smoothing 0 2 n B´ela Bollob´as∗† James Allen Fill‡§ Oliver Riordan¶ a J January 16, 2017 6 1 ] R Abstract P . As provedby R´egnier[11] and Ro¨sler[13], the number of key com- h parisons required by the randomized sorting algorithm QuickSort to t a sort a list of n distinct items (keys) satisfies a global distributional m limit theorem. Fill and Janson[5, 6] provedresults about the limiting [ distribution and the rate of convergence, and used these to prove a 1 result part way towards a corresponding local limit theorem. In this v paperweuseamulti-roundsmoothingtechniquetoprovethefulllocal 5 limit theorem. 6 3 4 1 Introduction 0 . 1 QuickSort, a basic sorting algorithm, may be described as follows. The 0 7 inputisalist, oflengthn > 0,ofdistinctrealnumbers(say). Ifn = 0orn= 1 1, do nothing (the list is already sorted). Otherwise, pick an element of the : v listuniformlyatrandomtouseasthepivot, andcompareeachotherelement i X withthepivot. Recursivelysortthetworesultingsublists,andcombinethem in the obvious way, with thepivot inthe middle. (Equivalently, onecan sort r a ∗Department of Pure Mathematics and Mathematical Statistics, Wilberforce Road, CambridgeCB30WB,UKandDepartmentofMathematicalSciences,UniversityofMem- phis, Memphis TN 38152, USA.E-mail: [email protected]. †ResearchsupportedinpartbyNSFgrant DMS-1301614 andEUMULTIPLEXgrant 317532. ‡Department of Applied Mathematics and Statistics, The Johns Hopkins University, Baltimore, MD 21218-2682, USA.E-mail: [email protected]. §ResearchsupportedbytheAchesonJ.DuncanFundfortheAdvancementofResearch in Statistics. ¶Mathematical Institute,University of Oxford, Radcliffe Observatory Quarter,Wood- stock Road, Oxford OX2 6GG, UK. E-mail: [email protected]. 1 the initial list randomly, and always usethe firstelement in each (sub)list as the pivot.) The recursive calls to the algorithm lead to a tree, the execution tree, with one node for each call. Each node either has no children (if the corresponding list had length 0 or 1) or two children. The main quantity we study here is the random variable Q , the total number of comparisons n used in sorting a list of n distinct items. R´egnier [11] and Ro¨sler [13] each established, using different methods, a distributional limit theorem for Q , proving that (Q EQ )/n d Q n n n − → as n , where Q has a certain distribution that can be characterized → ∞ in a variety of ways—to name one, as the unique fixed point of a certain distributionalidentity. Usingthatdistributionalidentity, FillandJanson[5] showed (among stronger results) that the distribution of Q has a continuous and strictly positive density f on R. Fill and Janson [6] proved bounds on the rate of convergence in var- ious metrics, including the Kolmogorov–Smirnov distance (i.e., sup-norm distance for distribution functions). Using this and their results about f from [5], they proved a ‘semi-local’ limit theorem for Q ; see their Theo- n rem6.1, whichisreproducedinlargepartasTheorem14below. Theyposed the question [6, Open Problem 6.2] of whether the corresponding local limit theorem (LLT) holds. Here we show that the answer is yes, using a multi- round smoothing technique developed in an initial draft of [2], but not used in the final version of that paper. This method may well be applicable to other distributions in which one can find ‘smooth parts’ on various different scales, including other distributions obeying recurrences of a type similar to that obeyed by Q . Taking the ‘semi-local’ limit theorem of [6] as a starting n point, in this paper we shall prove the following LLT for Q , together with n an explicit (but almost certainly not sharp) rate of convergence. Theorem 1. Defining Q and Q as above, and setting q := EQ , there n n n exists a constant ε> 0 such that the following holds. We have P(Q = x) = n−1f((x q )/n)+O(n−1−ε) (1) n n − uniformly in integers x, where f is the continuous probability density func- tion of Q. Infact,ourproofofTheorem1givesaboundoftheformO(n−19/18logn) on the error probability in (1). The basic idea used in our proof, that of strengthening a distributional (often normal) limit theorem to a local one by smoothing, is by now quite 2 old. Suppose that X takes integer values, and that we know that n d (X µ )/σ X, (2) n n n − → forsomenicedistributionX (say withcontinuous, strictly positivedensityf on R). By the corresponding LLT we mean the statement that whenever x n is a sequence of deterministic values with x = µ +O(σ ) then n n n P(X = x ) = σ−1f((x µ )/σ )+o(σ−1). (3) n n n n− n n n It is not hard to see that to deduce (3) from (2), it suffices to show that ‘nearby’ values have similar probabilities, i.e., that if x ,x′ = µ +O(σ ) n n n n and x x′ = o(σ ), then n− n n P(X =x ) = P(X = x′ )+o(σ−1). (4) n n n n n Inturn,toprove(4)wemight(asinMacDonald[8])trytofinda‘smooth part’ within the distribution of X . More precisely, we might try to write n X = A + B where, for some σ-algebra , we have that A is - n n n n n n F F measurableand theconditional distributionof B given obeys (or nearly n n F always obeys)arelation correspondingto(4). Thenitfollows easily (byfirst considering conditional probabilities given ) that (4) holds. One idea is n F to choose so that B has a very well understood distribution, such as a n n F binomial one. In some contexts, this approach works directly. Here (as far as we can see) it does not. We can decompose Q as above with B binomial (see n n Lemma 12), but B will have variance Θ(n), whereas VarQ = Θ(n2). n n This would, roughly speaking, allow us to establish that P(Q = x ) and n n P(Q = x′ ) are similar for x x′ = o(√n), but we need this relation for n n n − n all x x′ = o(n).1 n− n The key idea, as in the draft of [2], is not to try to jump straight from the global limit theorem to the local one, but to proceed in stages.2 For certain pairs of values ℓ < m with ℓ > 1 and m = o(n) we attempt to show that for any two length-ℓ subintervals I , I of an interval J of length m we 1 2 have P(Q I ) = P(Q I )+o(ℓ/n). (5) n 1 n 2 ∈ ∈ 1Actually,since[6]alreadycontainsa‘semi-LLT’,itwouldsufficetoconsiderx −x′ = n n O(n5/6). 2A related idea has recently been used (independently)by Diaconis and Hough [4], in a different context. They work with characteristic functions, rather than directly with probabilities as we dohere, establishing smoothness at a range of frequency scales. 3 The distributional limit theorem gives us that for some m = o(n) each in- terval J of length m has about the right probability, and we then use the relation above to transfer this to shorter and shorter scales, eventually end- ing with ℓ = 1. In establishing (5), the idea is as before to find a suitable decomposition Q = A +B , but we can use a different decomposition for n n n each scale—there is no requirement that these decompositions be ‘compat- ible’ in any way. For each pair (ℓ,m) we need such a decomposition where the distribution of B has a property analogous to (5). n There are some complications carrying this out. Our random variables B will have smaller variances than the original random variables Q . This n n means that the point probabilities P(B = x ), and (as it turns out) their n n differences P(B = x ) P(B = x′ ), are too large compared with the n n − n n bounds we are aiming for, and the same holds with the points x and x′ n n replaced by intervals. For this reason we mostly work with ratios, showing under suitable conditions that P(B I ) P(B I ). But this is not n 1 n 2 ∈ ∼ ∈ always true: Even if I and I are close, if both are far into a tail of B 1 2 n the ratio of the probabilities may be far from 1. To deal with this we use another trick: If for some interval I there is a significant probability p that 1 A +B I with the translated interval I A being far above the mean n n n ∈ − of B , say, then there is another interval J (to the left of I) such that there n is a probability much larger than p that A +B J. Hence what we will n n ∈ actually show, for a series of scales m, is that (i) each interval of length m has about the right Q -probability, and (ii) no interval of length m has n Q -probability much larger than it should. We will use (ii) at the longer n scale m to show that the ‘tail contributions’ described above are small at scale ℓ. Thus we will be able to transfer the combined statement (i)+(ii) from longer to shorter scales. In the particular context of QuickSort there is a very nice way to find binomial-like smooth parts: we partially expandthe execution tree, looking, roughly speaking, for a way of writing the original instance as the union of Θ(s) instances of QuickSort each run on Θ(r) input values, where s = n/r. Conditioningon thispartial expansion(plusa little furtherinformation) the unknown part of the distribution is then ‘binomial-like’: it is a sum of Θ(s) independent random variables each with ‘scale’ Θ(r). The rest of the paper is organized as follows: in Section 2 we state two standard results we shall need later, and then establish the existence of the decompositions described in the previous paragraph. In Section 3 we prove somesimplepropertiesof‘binomial-like’ distributions. Section 4istheheart of the paper; here we present the core smoothing argument, showing how to transfer ‘smoothness’ from a scale m to a scale ℓ 6 m under suitable con- 4 ditions. In Section 5 we complete the proof of Theorem 1; this is a matter of applying the results from Section 4 with suitable parameters, taking as a starting point the ‘semi-local’ limit theorem established by Fill and Jan- son [6]. Finally, in Section 6 we outline a different way of applying the same smoothing results, taking a weaker distributional convergence result as the starting point; this may be applicable in other settings. 2 Preliminaries 2.1 Some standard inequalities WeshallusetheAzuma–Hoeffdinginequality(see[1]and[7])inthefollowing form (see, for example, Ross [14, Theorem 6.3.3]). Theorem 2. Let (Zn)n>1 be a martingale with mean µ = EZn. Let Z0 = µ and suppose that for nonnegative constants α ,β , i > 1, we have i i α 6 Z Z 6 β . i i i−1 i − − Then, for any n> 0 and a > 0 we have n P(Z µ > a)6 exp 2a2 (α +β )2 , n i i − − ( , ) i=1 X and the same bound applies to P(Z µ 6 a). n − − We shall also need Esseen’s inequality, also known as the Berry–Esseen Theorem; see, for example, Petrov [10, Ch. V, Theorem 3]. We write Φ for the distribution function of the standard normal random variable. Theorem 3. Let Z ,...,Z be independent random variables with ρ = 1 t t E(Z 3) finite, and let S = t Z . Then i=1 | i| i=1 i P sup P(S 6 x) PΦ((x µ)/σ) 6 Aρ/σ3, − − x (cid:12) (cid:12) where µ and σ2 are(cid:12)the mean and variance o(cid:12)f S, and A is an absolute constant. 2.2 Decomposing the execution tree In this subsection we shall show that, given a parameter r, a single run of QuickSort on a list of length n will, with high probability, involve Ω(n/r) instances of QuickSort run on disjoint lists of length between r/2 and r. 5 Let 2 6 r < n be integers. We can implement QuickSort on a list of length n in two phases as follows: in the first step of Phase I, pick the random pivot dividingthe original list into two sublists of total length n 1. − In step t of Phase I, if all the current sublists have length at most r, do nothing. Otherwise, pick a sublist of length at least r +1 arbitrarily, and pick the random pivot in this sublist, dividing its remaining elements into two new sublists. After n steps, we proceed to Phase II, where we simply run QuickSort on all remaining sublists. Let X denote the number of n,r sublists at the end of Phase I that have length between r/2 and r. Lemma 4. Let r > 20 be even and n > 5r. Then P X 6 n 6 e−n/(400r). n,r 3r (cid:0) (cid:1) Proof. We have specified that r be even only for convenience. We have made no attempt to optimize the values of the various constants; these will be irrelevant later. Running QuickSort in two phases as above, let T be the number of ‘active’ steps in Phase I, i.e., steps in which we divide a sublist into two. Clearly, T 6 n, the first T steps of Phase I are active, and after T steps we have T +1 sublists of total length n T. The idea of the proof is to show − that T is very unlikely to be larger than 20n/r, say, that EX is of order n,r n/r, and that each decision in the first phase of our algorithm alters the conditional expectation of X by at most 1. The result will then follow n,r from the Azuma–Hoeffding inequality. Throughout the proof we keep r > 20 fixed. Let t = 20n/r . 0 ⌈ ⌉ Observe that if T > t , then after step t we have t +1 sublists with total 0 0 0 length < n. Since at most 10n/r 6 t /2 of these sublists can have length at 0 least r/10, at least t /2 of our sublists have length < r/10. Let N be the 0 number of sublists after t steps that have length less than r/10, so we have 0 shown that P(T > t ) 6 P(N > t /2). 0 0 In any step of Phase I, we either do nothing, or randomly divide a list of some length ℓ > r+1. In the latter case, the (conditional, given the past) probability of producing a sublist of length < r/10 is at most (r/10+1) 3r/10 3 2 6 < , ℓ ℓ 10 6 since r > 20 and ℓ > r. It follows that N is stochastically dominated by a binomial distribution with parameters t and 3/10, so 0 P T > t 6 P N > t /2 6 P Bin(t ,3/10) > t /2 6 e−2t0/25, (6) 0 0 0 0 using(cid:0)Theorem(cid:1) 2, o(cid:0)r a standa(cid:1)rd Ch(cid:0)ernoff bound, for the(cid:1)last step. Turningto thenextpartof theargument, as r is fixedthroughout, let us write X for the random variable X . We extend the definition of X to n n,r n the case n6 r by considering PhaseI to end immediately (with one ‘sublist’ oflengthn)inthiscase. Thesequence(X )satisfiesthedeterministicinitial n conditions X = = X = 0, 0 (r/2)−1 ··· X = = X =1, r/2 r ··· and (considering the first step in Phase I as described above) the distribu- tional recurrence relation X =L X +X∗ , n > r+1, (7) n Un−1 n−Un where, on the right, X and X∗ are independent probabilistic copies of X j j j for each j = 1,...,n 1 and U is uniformly distributed on 1,...,n , and n − { } is independent of all the X and X∗ variables. Let ξ := EX . n n From (7) we have ξ = 2 n−1ξ for n> r+1. It follows that n n i=0 i Pξ = = ξ =0, 0 (r/2)−1 ··· ξ = = ξ = 1, r/2 r ··· and n+1 ξ = , n > r+1. (8) n r+1 (The last equation holds also for n = r.) Define ξ˜ = n+1 for all n. Then n r+1 ξ˜ +ξ˜ = ξ˜ always. Since k−1 n−k n r/2 1 ξ ξ˜ 6 < , n n | − | r+1 2 it follows that if n > r+1 (and so ξ = ξ˜ ), then n n 1< ξ +ξ ξ < 1 (9) k−1 n−k n − − 7 for all 16 k 6 n. Let denote the σ-algebra corresponding to information revealed in t F the first t steps of Phase I as described above. Define M = E[X ], t n t |F so that (M )n is a (Doob) martingale. It follows from (9) that the martin- t t=0 gale (M ), which has mean M = ξ given by (8), satisfies t 0 n 1 < M M < 1 t t−1 − − for every t. Let E be the event that X 6 n. Since n 3r n+1 n 2n ξ = > > , n r+1 r+1 3r when E holds we have X ξ 6 n. After the first T steps of Phase I, n − n −3r nothing further happens, so M = M = = M = X . Hence, writing T T+1 n n ··· t = 20n/r as before, we have 0 ⌈ ⌉ P(E) 6 P(T > t )+P M ξ 6 n . 0 t0 − n −3r (cid:0) (cid:1) By (6) and the Azuma–Hoeffding inequality (Theorem 2), it follows that n2 P(E) 6 e−2t0/25+exp −18r2t (cid:18) 0(cid:19) 40n n 6 exp +exp −25r −378r (cid:18) (cid:19) n (cid:16) (cid:17) 6 exp , −400r (cid:16) (cid:17) where the penultimate inequality holds because 20n/r 6 t 6 21n/r, since 0 n/r > 5, and the final inequality holds because e−8x/5 +e−x/378 6 e−x/400 for x > 5. Corollary 5. Let r > 20 be even and n > 5r. Then we may write Q = n A+B where, for some σ-algebra , we have that A is -measurable, and, F F with probability at least 1 e−n/(400r), the conditional distribution of B given − is the sum of s = n/(3r) independent random variables B ,...,B with 1 s F ⌈ ⌉ each B having the distribution Q for some r with r/26 r 6 r. i ri i i 8 Proof. RunQuickSortintwophasesasabove,anddefineX asinLemma4. n,r Let E be the event that X > s = n/(3r) , so P(E) > 1 e−n/(400r) by n,r ⌈ ⌉ − Lemma 4. We now subdivide Phase II into two parts. When E holds, we select s sublists from the end of Phase I with length between r/2 and r, oth- erwise we do not select any. In Phase IIa, we run QuickSort on all sublists except the selected ones. In Phase IIb, we run QuickSort on the selected sublists. Take the σ-algebra to be the σ-algebra corresponding to all the F information uncovered in Phases I and IIa, and A to be the total number of comparisons made during Phases I and IIa. Take B , i = 1,...,s, to be i (when E occurs) the number of comparisons involved in runningQuickSort on the ith selected sublist. 2.3 Truncating the summands The sum of the B above will roughly serve as our ‘binomial-like’ distribu- i tion, but we would like a little more information about it. Knowing that B i has ‘scale’ roughly r r, we shall condition on B EB being at most i i i ≈ | − | 2r . This will still keep a constant fraction of the variance, while giving us i better control on the distribution of the sum of such random variables. Writing q for EQ , for n > 1 let Q∗ = (Q q )/n denote the centered n n n n− n and normalized form of Q . Since Q∗ converges in distribution to Q, a n n distribution with a continuous positive density on R, we know that there are constants n and c > 0 such that for all n > n we have, say, P(Q∗ 0 1 0 n ∈ [ 2, 1]) > c and P(Q∗ [1,2]) > c . Hence, for n > n , − − 1 n ∈ 1 0 P(Q∗ [ 2,2]) > 2c (10) n ∈ − 1 and, since P(Q∗ I Q∗ [ 2,2]) > c /1 = c for I = [ 2,1] and n ∈ | n ∈ − 1 1 − I = [1,2], we have Var(Q∗ Q∗ [ 2,2]) > c . n | n ∈ − 1 Let W′ denote the distribution of Q∗ conditioned to lie in [ 2,2], and let n n − W := W′ EW′. Then W 6 4 and VarW > c . We will record the n n − n | n| n 1 consequences for the unrescaled distribution of Q immediately after the n following definition. Definition 1. Given r > 0 let denote theset of probability distributions r D of random variables X with the following properties: EX = 0, X 6 4r, | | and VarX > c (r/2)2. 1 The calculations above have the following simple consequence: for any r > 2n and any r′ satisfying r/2 6 r′ 6 r, we have P(Q [q 2r′,q + 0 r′ r′ r′ ∈ − 9 2r′])> 2c , and the conditional distribution of Q given this event is of the 1 r′ form z +X for some constant z and some X with law in . r′ r′ r′ r′ r D Definition 2. Given r > 0 and a positive integer s, let denote the set r,s B of s-fold convolutions of distributions from . r D In other words, X has a distribution in if we can write X = X + r,s 1 B +X where the X are independent and each has law in . The distri- s i r ··· D butions in will be the ‘binomial-like’ ones we shall use in the smoothing r,s B argument. Remark. More properly we should write and for the classes Dr,c1 Br,s,c1 defined in Definitions 1 and 2. In this paper we need only consider a par- ticular value of c as at the start of this subsection, but in other contexts 1 one might consider these classes for other values of c . The results below of 1 course extend to this setting. The next lemma, a simple consequence of Corollary 5, will play a key role in our smoothing arguments. Lemma 6. There are positive constants r , c and c such that the following 0 2 3 holds whenever n and r are positive integers with r even and r 6 r 6 c n: 0 2 we may write Q = A+B where, for some σ-algebra , we have that A n F is -measurable, and, with probability at least 1 e−c3n/r, the conditional F − distribution of B given is in the class , with s = c n/r . r,s 2 F B ⌈ ⌉ Proof. We start by taking ′, A′, and B′ to be as in Corollary 5. Let F E′ ′ bethe event that we may write the conditional distribution of B′ as ∈ F the sum of independent variables B′,...,B′, t = n/(3r) , with B′ having 1 t ⌈ ⌉ i (conditionally given ′) the distribution of Q for some r/2 6 r 6 r. F ri i By Corollary 5 we have P(E′) > 1 e−Ω(n/r). We choose c 6 c /6, and 2 1 − set s = c n/r . Note that c n/r > 1, so s 6 2c n/r. We shall reveal 2 2 2 ⌈ ⌉ certain extra information as described in a moment. Let E denote the i event that B′ [q 2r ,q + 2r ], and let E denote the event that at i ∈ ri − i ri i least s of the events E occur. Each event E has conditional probability i i at least 2c by (10). Since the E ’s are conditionally independent given ′, 1 i and c t > 2c n/r > s, we see [from P(E E′) > P(Bin(t,2c ) > c t) aFnd 1 2 1 1 | Chernoff’s inequality] that P(E E′) > 1 e−Ω(t) = 1 e−Ω(n/r). Hence | − − P(E) > 1 e−Ω(n/r). − The extra information we reveal is as follows: firstly, which E ’s occur, i and hence whether E occurs. When E does occur, we let be the set of I the first s indices i such that E occurs, otherwise we may take = , say. i I ∅ 10