Electronic Journal of Statistics ISSN:1935-7524 Fast rates for empirical vector quantization Clément Levrard Université Paris Sud 11 and Paris 6, e-mail: [email protected] Abstract: We consider the rate of convergence of the expected loss of empirically optimal vector quantizers. Earlier results show that the mean-squared expected distortion for any fixed distribution 2 supportedonaboundedsetandsatisfyingsomeregularityconditionsdecreasesattherateO(logn/n). 1 WeprovethatthisrateisactuallyO(1/n).Althoughtheseconditionsarehardtocheck,weshowthat 0 well-polarizeddistributionswithcontinuous densitiessupported onaboundedsetareincludedinthe 2 scopeofthisresult. n Keywords and phrases:quantization, clustering, localization,fastrates. a J 9 1. Introduction 2 Clustering is the problem of identifying groupings of similar points that are relatively far one from each ] T others, or,in other words,to partition the data into dissimilar groups of similar items. For a comprehensive S introductiontothistopic,thereaderisreferredtothemonographofGrafandLuschgy[8].Isolatemeaningful . groups from a cloud of data is a topic of interest in many fields, from social science to biology. In fact this h t issue originates in the theory of signal processing in the late 40’s, known as the quantization issue, or lossy a data compression (see Gersho and Gray [7] for a comprehensive approach of this topic). More precisely, let m X ,...,X denotenrandomvariables,independentandidenticallydistributed,drawnfromadistributionP 1 n [ overRd,equippedwith its Euclideannorm . ,andletQ denotea k-quantizer,thatis a mapfromRd toRd 1 suchthatCard(Q(Rd)) k.Letc (Rd)k bkekaconcatenationofk d-dimensionalvectorsc1,...,ck.Without ≤ ∈ v lossofgeneralityweonlyconsiderquantizersofthetypex c ,where x c =min x c .The i i j=1,...,k j 2 c ’sarecalledclusters.TomeasurehowwellthequantizerQ7−→performsinrkep−resekntingthesourcekdi−stribkution, i 5 a possible way is to look at 0 6 R(c)=E X Q(X) 2 =E min X c 2, . j 1 k − k j=1,...,kk − k 0 2 when EX2 < . The goal here is to find a set of clusters ˆc , drawn from the data X ,...,X , whose n 1 n 1 distortion is as∞close as possible to the optimal distortion R∗ = infc (Rd)kR(c). To solve the problem, v: most approaches to date attempt to implement the principle of empiric∈al error minimization in the vector i quantization context. According to this principle, good clusters can be found by searching for ones that X minimize the empirical distortion over the training data, defined by r a n n 1 1 Rˆ (c)= (X Q(X ))2 = min X c 2. n i i i j n − n j=1,...,kk − k i=1 i=1 X X The existence of such empirically optimal clusters has been established by Graf an Luschgy [8, Theorem 4.12].Letusdenotebycˆ oneofthesevectorsofempiricallyoptimalclusters.Ifthe trainingdatarepresents n the source well, ˆc will hopefully perform near optimally also on the real source. Roughly, this means that n we expect R(ˆc ) R . The problem of quantifying how good empirically designed clusters are, compared n ∗ ≈ to the truly optimal ones, has been extensively studied, see for instance Linder [10]. To reach the later goal, a standard route is to exploit the Wasserstein distance between the empirical distribution and the source distribution, to derive upper bounds on the average distortion of empirically optimal clusters. Following this approach, Pollard [14] proved that if E X 2 < , then R(ˆc ) R 0 n ∗ k k ∞ − −→ almost surely, as n . More recently, Linder, Lugosi and Zeger [11], and Biau, Devroye and Lugosi [3] showed that if the s→upp∞ort of P is bounded, then E(R(ˆc ) R ) = (1/√n), using techniques borrowed n ∗ − O 1 from statistical learning theory. Bartlett, Linder and Lugosi [2] established that this rate is minimax over distributions supported on a finite set of points. However,fasterratescanbeachieved,usingmethodsinspiredfromstatisticallearningtheory.Forexample, itisshownbyChou[6],followingaresultofPollard[16],thatR(ˆcn) R∗ = P(1/n),undersomeregularity − O conditions onthe source distribution.Nevertheless,this consistency resultdoes not provideany information onhowmanytrainingsamplesareneededtoensurethattheaveragedistortionofempiricallyoptimalclusters isclosetotheoptimum.Antos,GyörfiandGyörgyestablishedin[1]thatE(R(ˆc ) R )= (logn/n)under n ∗ − O the same conditions, paying a logn factor to derive a non-asymptotic bound. It is worth pointing out that the conditions cannot be checked in practice, and consequently remain of theoretical nature. Moreover, the rate of 1/n for the average distortion can be achieved when the source distribution is supported on a finite set of points. Consequently, an open question is to know wether this optimal rate can be attained for more general distributions. Inthepresentpaper,weimprovepreviousresultsofAntos,GyörgyandGyörfi[1],bygettingridofthelogn factor. Besides, we express Pollard’s condition in a more reader-friendly framework, involving the density of the source distribution. To this aim we use statistical learning arguments and prove that the average distortion of empirically optimal clusters decreases at the rate (1/n). To get this result we use techniques O such as the localization principle borrowed from Massart, Blanchard and Bousquet [4] or Koltchinskii [9]. The condition we offer can be easily interpreted as margin-type condition, similar to the ones of Massart and Nedelec in [13], showing a clear connection between statistical learning theory and vector quantization. Thepaperisorganizedasfollows.InSection2weintroducenotationanddefinitionsofinterest.InSection 3 we offer our main results. These results are discussed in Section 4, and illustrated on examples such as Gaussian mixtures or quasi-finite distribution. Finally, proofs are gathered in Section 5. 2. The quantization problem Throughout the paper, X ,...,X is a sequence of independent Rd-valued random observations with the 1 n same distribution P as a generic random variable X. To frame the quantization problem as a statistical learningone,wefirsthavetoconsiderquantizationasacontrastminimizationissue.Tothisaimweintroduce the followingnotation. Letc=(c ,...,c ) be the setof possible clusters.The contrastfunction γ is defined 1 n as Rd k Rd R γ : (c×,x) −→ min x c 2 . ( (cid:0) (cid:1) 7−→ j=1,...,kk − jk Within this framework, the risk R(c) takes the form R(Q) = R(c) = Pγ(c,.), where Pf(.) means inegrationofthefunction f withrespecttoP.Inthe sameway,ifP denotestheempiricaldistributionthat n is induced on Rd by the n-sample X ,...,X , we can express the empirical risk Rˆ (Q) as P γ(c,.). 1 n n n is wNoorttehtphoaitn,twinitghoinutththisactotnhteexetx,iastnenocpetiomfabloct∗hmcinanimdizcesaPreγ(gcu,a.)r,anwtheeerdeabsyˆcGnr∈afaarngdmLinucs∈c(hRgdy)k[P8,nγT(hce,o.)r.emIt ∗ 4.12].In the sequel we denote by the set of such minimizers of the true risk Pγ(c,.), so that c .To ∗ measure how well a vector of clusMters c performs compared to an optimal one, we will make use of∈thMe loss ℓ(c,c∗)=R(c) R(c∗)=P (γ(c,.) γ(c∗,.)). − − Troughout the paper we will use the following assumptions on the source distribution. Let (0,M) denote B the closed ball of radius M, with M 0. ≥ Assumption 1 (Peak Power Constraint). The distribution P is such that P( (0,1))=1, B Note that Assumption 1 is strongerthan the requirement E X2 < , as it imposes a L -boundedness ∞ ∞ condition on the random variable X. For conveniency we assume that the distribution is bounded by 1. (cid:13) (cid:13) However, it is important to note that our results hold for rand(cid:13)om(cid:13)variables X bounded from above by an arbitrary M. We will also need the following regularity requirement, first introduced by Pollard [16]. 2 Assumption 2 (Pollard’s regularity condition). The distribution P satisfies the following two condi- tions: 1. P has a continuous density f with respect to Lebesgue measure on Rd, 2. The Hessian matrix of c Pγ(c,.) is positive definite for all optimal vector of clusters c . ∗ 7−→ OnecanpointoutthatCondition1ofAssumption2doesnotguaranteetheexistenceofasecondderivative fortheexpectationofthecontrastfunction.NeverthelessAssumption1andCondition1ofAssumption2are enoughtoguaranteethatthemapc Pγ(c,.)istwicedifferentiable.LetV bethe Voronoicellassociated i 7−→ with c , for i=1,...,k. In this situation, the Hessian matrix is composed of the following d d blocks: i × H(c)i,j = 2P(Vi)−2r2 1ℓσ6=ifri−(ℓx1)σ(xf(xc)()x(x−cic)()xt1−ci)t1∂(Vi∩Vℓ) ffoorr ii==jj , (cid:26) − ijP− (cid:2) − i − j ∂(Vi∩Vj) (cid:3) 6 (cid:2) (cid:3) where r = c c , ∂(V V ) denotes the possibly empty common face of V and V , and σ means ij i j i j i j k − k ∩ integrationwithrespecttothe(d 1)-dimensionalLebesguemeasure.Foraproofofthatstatement,werefer − to Pollard [16]. When Assumption 1 and Assumption 2 are satisfied, Chou [6] proved that ℓ(ˆcn,c∗)= P(1/n), whereas Antos, Györfi, and György established that Eℓ(ˆc ,c ) Clogn, where C is a constant dOepending on the n ∗ ≤ n distribution P. Theproofofthesetworesultsarebothbasedonargumentswhichhaveaconnectionwiththelocalization principle ([12], [9]), which provides faster rates of convergence when the expectation and the variance of γ(c,.) γ(c ,.) are connected. To prove his result, Pollard used conditions under which the distortion and ∗ − the Euclidean distance are connected, and used chaining arguments to bound from above a term which looks like a Rademacher complexity,constrainedon an area aroundan optimal vector of clusters. Note that Koltchinskii [9] used a similar method to apply the localization principle. On the other hand, Antos, Györfi and György exploited Pollard’s condition, and used a concentration inequality based on the fact that the variance and the expectation of the distortion are connected to get their result. Interestingly, this point of view has been developped by Blanchard, Bousquet and Massart [4] to get bounds on the classification risk of the SVM, using the localization principle. That is the approach that will be followed in the present document. 3. Main results We are now in a position to state our main result. Theorem3.1. AssumethatAssumption1andAssumption2aresatisfied.Then,denotingbyˆc anempirical n risk minimizer, we have C Eℓ(ˆc ,c ) 0, n ∗ ≤ n where C is a positive constant depending on P, k and d. 0 Thisresultimprovespreviousnon-asymptoticresultsofAntos,GyörfiandGyörgy[1],Linder,Lugosiand Zeger[11],showingthataconvergencerateof1/ncanbeachievedinexpectation.ToproveTheorem3.1,the key resultis basedona versionofTalagrand’sinequality due to Bousquet[5] and its applicationto localiza- tion,followingtheapproachofMassartandNedelec[13].ThemainpointistoconnectVar(γ(c,.) γ(c ,.)) ∗ to P(γ(c,.) γ(c ,.)) for all possible c. To be more precise, Pollard’s condition involves different−iability of ∗ the distortio−n, therefore γ(c,.) γ(c ,.) is naturally linked to c c , the Euclidean distance between c ∗ ∗ andc . However,it is noteworth−ythat, mimicing the proofof Aknto−s,Gykörfiand György[1, Corollary1],we ∗ have in fact: Proposition3.1. Suppose that Assumption 1 and Assumption2 are satisfied. Then thereexists two positive constants A and A , depending on the distribution P, such that 1 2 3 1. (H1) : c (0,1)k c c (c) 2 A ℓ(c,c (c)), ∗ 1 ∗ 2. (H2) : ∀c∈B(0,1)k kc− kV≤ar(γ(c,.) γ(c ,.)) A c c 2, ∗ ∗ 2 ∗ ∀ ∈B ∀ ∈M − ≤ k − k where c (c) argmin c c . ∗ ∗ ∈ c∗ k − k ∈M When considering severalpossible optimal vector of clusters, we have to choose one to be compared with our empirical vector ˆc . A nearest optimal vector of clusters c (ˆc ) is a natural choice. It is important n ∗ n to note that, for every c (Rd)k and c , ℓ(c,c (c)) = ℓ(c,c ). Consequently, Theorem 3.1 holds ∗ ∗ ∗ for every possible c ∈. Besides it is e∈asyMto see, using the compacity of (0,1), that there is only ∗ a finite set of optimal∈cMlusters c when Assumption 1 is satisfied and the HeBssian matrixes H(c ) are ∗ ∗ definite positive for every possible c . This compacity argument is also the key to turn respectively the ∗ local positiveness of H(c ) into property (H1) and the regularity of the contrast function γ into the global ∗ property (H2). These two properties are exactly matching the two parts of the proof of Antos, Györfi and György [1, Corollary 1], which in turn implies Proposition 3.1. Note also that, from Proposition 3.1 we get Var(γ(c,.) γ(c (c),.)) A A ℓ(c,c (c)).Thisallowsustouselocalizationtechniquessuchasinthepaper ∗ 1 2 ∗ − ≤ of Blanchard, Bousquet and Massart [4]. Pollard’sregularity condition(Assumption 2) involvessecondderivatives of the distortion.Consequently, checking Assumption 2, even theoretically, remains a hard issue. We give a more general condition regard- ing the L -norm of the density f on the boundaries of Voronoi diagram, for the distribution to satisfy Assumptio∞n 2. We recall that denotes the set of all possible optimal clusters c . ∗ M Theorem 3.2. Denote by V the Voronoi cell associated with c in the Voronoi diagram associated with i∗ ∗i c , by N the union of all possible boundaries of Voronoi cells with respect to all possible optimal vector of ∗ ∗ clusters c , and by Γ the Gamma function. Let B = inf c c . Suppose that ∗ c∗ ,i=jk ∗i − ∗jk ∈M 6 Γ(d)B f 2 inf P(V ). k |N∗k∞ ≤ 2d+5πd/2c∗ ,i=1,...,k i∗ ∈M Then P satisties Assumption 2. The proof is given in Section 5. It is important to note that, for general distributions supported on (0,M),wecanstateasimilartheorem,involvingMd+1 intheright-handsideofthe inequalityinTheorem B 3.2. However, a source distribution supported on (0,M) can be turned into a distribution supported on B (0,1),usinganhomothetictransformation.Thereforewewillonlystateresultsforadistributionsupported B on (0,1). B This theorem emphasizes the idea that if P is well concentrated around its optimal clusters, then some localizationconditions canhold and therefore it is a favorablecase. The intuition behind this result is given by the extremal case where Voronoi cells boundaries are empty with respect to P. This case is described in detailinSection4.Moreover,thenotionofawell-concentrateddistributionlookslikemargin-typeconditions for the classification case, as described by Massart and Nedelec [13]. This confirms the intuition of an easy- to-quantize distribution, when the poles are well-separated. 4. Discussion and examples 4.1. Minimax lower bound Let denotethesetofprobabilitydistributionson (0,1).Bartlett,LinderandLugosi[2]offeredaminimax P B lower bound for general distributions: supEℓ(ˆcn,c∗)≥c0skn1−n4/d. P ∈P Consequently,forgeneraldistributions,thisminimaxboundmathchestheupperboundonEℓ(ˆc ,c )Linder, n ∗ Lugosi,andZeger[11]obtained.Aquestionofinterestistoknowwhethertherateof1/nwegetinTheorem 3.1 is minimax overthe set of distributions which satisfies Assumption 1 and Assumption 2.Proposition4.1 below answers this question. 4 Proposition 4.1. Let n be an integer, and denote by the set of distributions P satisfying Assumption 1 D and Assumption 2. Then there exists a constant c and an integer k such that, for any k -point quantizer 0 n n c n supEℓ(ˆc ,c ) c kn1−4/d. n ∗ ≥ 0s n P ∈D There is no contradiction between Theorem 3.1 and Proposition 4.1. In fact, in Theorem 3.1, k is fixed, whereas,in Proposition4.1,k stronglydepends on n. Therefore,it is an interesting point to know whether n we can get such a minimax bound when k is fixed. The proof of Proposition 4.1 follows the proof of Bartlett, Linder and Lugosi [2, Theorem 1], and it is thereforeomittedinthispaper.Themainideaistoreplacethedistributionsupportedon2npointsproposed bytheseauthorsinStep3oftheproof,withadistributionsupportedon2nsmallballssatisfyingAssumption 1 and Assumption 2. 4.2. Assumption 1 is necessary The original result of Pollard [16] assume only that E X 2 < , to get an asymptotic rate of P(1/n). k k ∞ O Consequently,it is aninterestingquestionto know whether Assumption1 canbe replacedwiththe assump- tion E X 2 < in Theorem 3.1. In fact, Assumption 1 is useful to get a global localization result from a k k ∞ local one, through a compacity argument. This is precisely the result of Proposition 3.1, which provides us with the global argument required for applying some localization result from a local regularity condition. However,followingthe idea of Antos,GyörfiandGyörgy[1], it is possible to suppose only that E X 2 < k k ∞ and nevertheless get (H2) in Proposition 3.1, as expressed in the following result. Proposition 4.2. Suppose that E X 2 < and that the set of all possible optimal clusters c is finite. ∗ k k ∞ Then there exists a constant A , depending on P, such that 2 c=c Var(γ(c,.) γ(c ,.)) A c c 2. ∗ ∗ 2 ∗ ∀ 6 − ≤ k − k A proof of Proposition 4.2 can be directly deduced from the proof of [1, Theorem 2]. Consequently it is omitted in this paper. According to Proposition 4.2, we can expect to control the variance of our process indexed by c and c with the Euclidean distance c c , even if the support of P is not contained within ∗ ∗ k − k a ball. Unfortunately, when the distribution is not supported on a bounded set, there are cases where the term ℓ(c,c (c)) cannot dominate c c (c) 2 for all c, as expressed in the following counter-example. Let ∗ ∗ k − k η >0, q(η)= eη/2+2R−1, and define the density f of the distribution P supported on R by 3 1 if x [0,η] 3η ∈ 1 if x [R,R+η] f(x)= 3η ∈ . q(η)e−x if x>2R−1+ η2 0 elsewhere Proposition 4.3. Set η =2, R=10, and define c =(0,n,n2). Then n (i) P satisfies Assumption 2. (ii) We have ℓ(c ,c (c )) P x 2 < as n . n ∗ n (iii) We have c c (c )−2→nk4 aks n ∞ . −→∞ n ∗ n k − k ∼ −→∞ OneeasilydeducesfromProposition4.3thatthedistributionP satisfiesAssumption2,butfailstosatisfy (H1)in Proposition3.1.ThereforeAssumption1 is necessaryto getthe resultofTheorem3.2.The intuiton behindthiscounter-exampleisthattwophenomenonspreventℓ(c,c (c))frombeingatmostproportionalto ∗ c c (c) 2 when cis arbitrarilyfar from0.Firstly,the underlying measure"erase"the Euclideandistance ∗ ikn −the expkression of ℓ(c,c (c)), which implies that ℓ(c ,c (c )) converges. Therefore, a suitable criterion ∗ n ∗ n 5 to link ℓ(c,c (c)) and Var(γ(c,.) γ(c ‘(c),.)) should probably involve a weight drawn from the tail of P. ∗ ∗ Typically we expect such a criteri−on to be a function of c c (c) 2, taking into account a tail constraint ∗ k − k onthe distributionP we consider.Secondly,this example showsthat,if for instancewe takethe 3-quantizer c =(n,n2,n4), the relativelossℓ(c ,c (c )) willmostly depend onthe contributionofthe smallestcluster n n ∗ n n,whenngrowsto infinity,whereas c c (c ) 2 essentiallydepends onthe distanceto the mostfarfrom n ∗ n k − k 0 cluster n4. To conclude, the Euclidean distance does not take into account the weight induced by the underlying distribution over the space. Thus, when Assumption 1 is released, dominant clusters for the Euclidean distance from c are essentially the most far ones. On the other hand, when integrating with respect to P, ∗ far-from-zero clusters loose their influence in the loss ℓ(c ,c (c )). n ∗ n 4.3. A toy example Inthissubsectionweintendtounderstandwhichconditionsonthedensityf canguaranteethattheHessian matrixes H are positive. To this aim we consider an extremal case, in which the probability distribution is supported on small balls scattered in (0,1). Roughly, if the balls are small enough and far one from each B others, the optimal quantization points should be the center of these balls. These are the ideas which are behind the following proposition. Proposition 4.4. Let z ,...,z be vectors in Rd. Let ρ be a positive number and R=inf z z be the 1 k i j i=j k − k 6 smallest possible distance between these vectors. Let the distribution P be defined as follows 1 P ( (z ,ρ))= i i 1,...,k B k , ∀ ∈{ } P |B(zi,ρ) ∼U|B(zi,ρ) where denotes the uniform distribution over (z ,ρ). Then, if R 3ρ 2 2ρ2d, the optimal k- U|B(zi,ρ) B i 2 − ≥ d+2 centroid vector is (z ,...,z ). 1 k (cid:0) (cid:1) The proof of Proposition 4.4, which is given in Section 5, is inspired from a proof of Bartlett, Linder and Lugosi [2, Step 3]. It is interesting to note that Proposition 4.4 can be extended to the situation where we assume that the underlying distribution is supported on k small enough subsets. In this context, if each subset has a not too small P-measure, and if those subsets are far enough one from each others, it can be proved in the same way that an optimal quantizer has a point in every small subset. Let us now consider the distribution described in Proposition 4.3, with relevant values for ρ and R. We immediatly see that if R/2 > ρ, then every boundary of the Voronoi diagram for the optimal vector of clusters lies in a null-measured area. Thus, for this distribution, 1I 0 k d ··· H(c∗)= ... ... ... , 0 1I ··· k d which is clearly positive. This short example illustrates the idea behind Theorem 3.2. Namely, if the density of the distribution is not too big at the boundaries of the Voronoi diagram associated with every optimal k-quantizer, then the Hessian matrix H will roughly behave as a positive diagonal matrix. Thus Pollard’s condition (Assumption 2)willbe satisfied.This mostfavorablecaseis infactderivedfromthe specialcasewhere the distributionis supportedonk points.Antos,GyörfiandGyörgy[1]provedthat ifthe distributionhas onlya finite number of atoms, then the expected distortion ℓ(ˆc ,c ) is at most C/n, where C is a constant. Here we spread the n ∗ atoms into small balls to give a density to the distribution and match regularity conditions. 4.4. Quasi-Gaussian mixture example The aim of this subsection is to apply our results to the Gaussian mixtures in dimension d = 2. However, since the distribution support of a Gaussian random variable is not bounded, we will restrict ourselves to 6 the "quasi-Gaussianmixture" model, which is defined as follows. Let the density f of the distribution P be defined by k f(x)= Npii2π1σ2e−kx−2σm2ik21B(0,1), i=1 X where N denotes a normalization constant for each Gaussian variable. To ensure this model to be close to i the Gaussian mixture model, we assume that there exists a constant ε [0,1] such that, for i = 1,...,k, N 1 ε. Denote by B˜ = inf m m the smallest possible distan∈ce between two different means of i i=j i j the≥mix−ture. To avoid boundary6 isksues−we sukppose that, for all i=1,...,k, (m ,B˜/3) (0,1). For such i B ⊂B a model, we have: Proposition 4.5. Suppose that p 288kσ2 96k min max , . pmax ≥ (cid:18)(1−ε)B˜2(1−e−B˜2/288σ2) (1−ε)σ2B˜(eB˜/72σ2 −1)(cid:19) Then P satisfies Assumption 2. The inequality we propose as a condition in Proposition 4.5 can be decomposed as follows. If p 288kσ2 min , pmax ≥ (1−ε)B˜2(1−e−B˜2/288σ2) then the optimal vector of clusters c is close to the vector of means of the mixture m = (m ,...,m ). ∗ 1 k Knowing that, we can locate the Voronoï boundaries of the Voronoï diagram associated to c and apply ∗ Theorem 3.2. This leads to the second term of the maximum in Proposition 4.5. This condition can be interpreted as a condition on the polarization of the mixture. A favorable case for vector quantization seems to be when the poles of the mixtures are well-separated, which is equivalent to σ is small compared to B˜ when considering Gaussian mixtures. Proposition 4.5 just explained how σ has to be small compared to B˜, in order to satisfy Assumption 2 and therefore apply Theorem 3.1, to reach an improved convergencerate of 1/n for the loss ℓ(ˆc ,c ). Notice that Proposition 4.5 can be considered as an n ∗ extension of Proposition 4.4. In these two propositions a key point is to locate c , which is possible when ∗ the distribution P is well-polarized. The definition of a well-polarized distribution takes two similar forms when looking at Proposition 4.4 or Proposition 4.5. In Proposition 4.4 the favorable case is when the poles are far one from each other, separated by an empty area with respect to P, which ensures that the Hessian matrixes H(c ) are positive definite (in this case they are diagonal matrixes). When slightly disturbing the ∗ framework of Proposition 4.4, it is quite natural to think that the Hessian matrixes H(c ) should remain ∗ positive definite. Proposition 4.5 is an illustration of this idea: the empty separation area between poles is replaced with an area where the density f is small compared to its value around the poles. The condition on σ and B˜ we offer in Proposition 4.5 gives a theoretical definition of a well-polarized distribution for quasi-Gaussianmixtures. Itisimportanttonotethatourresultholdswhenkisknownandmatchexactlythenumberofcomponents of the mixture. When the number of cluster k is larger than the number of components k˜ of the mixture, we have no general idea of where the optimal clusters can be placed. Moreover,suppose that we are able to locate the optimal vector of clusters c . As explained in the proof of Proposition 4.5, the quantity involved ∗ in Proposition 4.5 is in fact B = inf c c . Thus, in this case, we expect B to be much smaller than i6=jk ∗i − ∗jk B˜. Consequently, a condition like in Proposition 4.5 could not involve the natural parameter of the mixture B˜. The two assumptions N 1 ε and (m ,B˜/3) (0,1) can easily be satisfied when P is constructed i i via an homothetic transform≥atio−n. To seBe this, take⊂a Bgeneric Gaussian mixture on R2, denote by m¯ ,i = i 1,...,k, its means and by σ¯2 its variance. For a given ε > 0, choose M > 0 such that, for all i = 1,...,k, (0,M)e−kx−mik2/2σ2dx ≥ 2πσ2(1 − ε) and B(mi,B˜/3) ⊂ B(0,M). Denote by P0 the "quasi-Gaussian B mixture"weobtainon (0,M)forsuchanM.Then,applyinganhomothetictransformationwithcoefficient R B 1/M to P provides a quasi-Gaussianmixture on (0,1),with means m =m¯ /M,i=1,...,k and variance 0 i i σ2 =σ¯2/M2. This distribution satisfies both N B1 ε and (m ,B˜/3) (0,1). i i ≥ − B ⊂B 7 5. Proofs 5.1. Proof of Theorem 3.1 TheproofstronglyreliesonthelocalizationprincipleanditsapplicationbyBlanchard,BousquetandMassart [4]. We start with the following definition. Definition5.1. LetΦbeareal-valuedfunction.Φiscalledasub-αfunctionifandonlyifΦisnon-decreasing and if the map x Φ(x)/xα is non-increasing. 7→ The next theorem is an adaptation of the result of Blanchard, Bousquet and Massart [4, Theorem 6.1]. For the sake of clarity its proof is given in Subsection 5.2. Theorem 5.1. Let be a class of bounded measurable functions such that F (i) f f b, ∀ ∈F k k ≤ (ii) f Var(∞f) ω(f). ∀ ∈F ≤ Let K be a positive constant, Φ a sub-α function, α [1/2,1[. Then there exists a constant C(α) such that, ∈ if D is a constant satisfying D 6KC(α) and r is the unique solution of the equation Φ(r) = r/D, the ∗ ≤ following holds. Assume that r r , E sup (P P )f Φ(r). ∗ n ∀ ≥ ω(f) r − !≤ ≤ Then, for all x>0, with probability larger than 1 e x, − − 1 6KC(α) 1−α (9K2+16Kb)x f , Pf P f K 1 ω(f)+ r + . n − ∗ ∀ ∈F − ≤ (cid:18) D (cid:19) 4n ! This theorem emphasizes the fact that if we are able to control the variance and the complexity term controlled by the variance, we can get a possibly interesting oracle inequality. Obviously the main point is to find a suitable control function for the variance of the process. Here the interesting set is = γ(c,.) γ(c∗,.),c (0,1)k,c∗ . F − ∈B ∈M According to Section 3 the relevan(cid:8)t control function for the variance of(cid:9)the process γ(c,.) γ(c ,.) is ∗ proportional to c c 2. Thus it remains to bound from above the quantity − ∗ k − k E sup (P P)(γ(c ,.) γ(c,.)) . n ∗ c∗ , c c∗ 2 δ − − ! ∈Mk − k ≤ This is done in the following proposition. Proposition 5.1. Suppose that P satisfies Assumption 1. Then C E sup (P P)(γ(c ,.) γ(c,.)) √δ , n ∗ c∗ , c c∗ 2 δ − − !≤ n ∈Mk − k ≤ where C is a constant depending on k, d, and P. Assuming thatAssumption 1 andAssumption2 are satisfied,wecanapply Theorem5.1,with w(c,c )= ∗ A c c 2 and b=2. 2 ∗ k − k Lemma 5.1. Let D >0. For all c , x>0 and K >D/7,if r is the (unique) solution of Φ(δ)=δ/D, ∗ ∗ ∈M then we have, with probability larger than 1 e x, − − 50K K+18 (P P )(γ(c,.) γ(c ,.)) K 1A c c 2+ r + x. − n − ∗ ≤ − 2k − ∗k D2 ∗ n 8 We are now in a position to prove Theorem 3.1. Take c =c (c), a nearest optimal vector of clusters to ∗ ∗ c,anduse(H2)toconnect c c (c) 2 toℓ(c,c (c)). Introducingthe explicitformr = C2D2,weget,with k − ∗ k ∗ ∗ n K =2A A , D =6K, and probability larger than 1 e x 1 2 − − 50C2D2 K+18 1/2(P P )(γ(ˆc ,.) γ(c (c),.)) + x. n n ∗ − − ≤ 36Kn n Observing that P (γ(ˆc ,.) γ(c (c),.)) 0, and taking expectation leads to, for all c , n n ∗ ∗ − ≤ ∈M C Eℓ(ˆc ,c ) 0, n ∗ ≤ n for some constant C >0 depending only on k, d, and P. 0 5.2. Proof of Theorem 5.1 This proof is a modification of the proof of Blanchard, Bousquet and Massart [4, Theorem 6.1]. For r 0, ≥ set f V = sup(P P ) . r n − ω(f)+r f ∈F We start with a modified version of the so-called peeling lemma: Lemma 5.2. Undertheassumptions of Theorem 5.1, there exists aconstant C(α) depending only on α such that, for all r>0, Φ(r) E(V ) C(α) . r ≤ r Furthermore, we have C(α) . α−→1∞ → Proof of Lemma 5.2. Let x>1 be a real number. Because 0 Conv( ), ∈ F f f f sup(P P ) sup (P P ) + sup (P P ) . n n n − ω(f)+r ≤ − ω(f)+r − ω(f)+r f∈F ω(f)≤r kX≥0rxk<ω(f)≤rxk+1 Taking expectation on both sides leads to Φ(r) Φ(rxk+1) E(V ) + . r ≤ r r(1+xk) k 0 X≥ Recalling that Φ is a sub-α function, we may write Φ(rxk+1) xα(k+1)Φ(r). Hence we get ≤ Φ(r) Φ(r) xα(k+1) E(V ) + r ≤ r r 1+xk k 0 X≥ Φ(r) 1 1 1+xα + . ≤ r 2 x1 α 1 (cid:18) (cid:18) − − (cid:19)(cid:19) Taking C(α)= inf 1+xα 1 + 1 proves the result. x>1 2 x1−α−1 (cid:16) (cid:16) (cid:17)(cid:17) We are now in a position to prove Theorem 5.1. Using the inequality of Talagrand for a supremum of bounded variables that Bousquet [5] offered, we have, with probabilty larger than 1 e x, − − x xbE( ) bx V E(V )+ +2 r + . r r ≤ 2rn nr 3rn r r 9 Using Lemma 5.2 and the inequality (a+b)2 2(a2+b2), ≤ 2C(α)Φ(r) x 4bx V + + . r ≤ r 2rn 3rn r 1 α Let r be the solution of Φ(r)= r. If r r , then Φ(r) r∗ − 1. For such an r we have ∗ D ≥ ∗ r ≤ r D (cid:16) (cid:17) V A r (1 α)+A r 1/2+A r 1, r 1 − − 2 − 3 − ≤ with 2C(α)(r )1 α ∗ − A = 1 D A2 =r2xn . 4bx A = rWe (w3aKntA1t)o1−1fiαnd+a(3KsuAit2a)b2le+r3KsuAch3,tahnadtrr ≥3r∗r,∗th3aennndVVrr ≤1/K1/Kus.inTgoththeipsraevimio,usituspupffiercebsoutondseoentVhra.t if ≥It remains to check that the condition (3≥KA1)1−1α +(3≤KA2)2+3KA3 r∗ holds. To see this just recall ≥ that 1 1 6KC(α) 1−α (3KA1)1−α =r∗ . × D (cid:18) (cid:19) Thus,we deduce that,if D 6KC(α),the choicer =(3KA1)1−1α +(3KA2)2+3KA3 guaranteesVr K−1 ≤ ≤ and, consequently, 1 1 6KC(α) 1−α (9K2+16Kb)x Pf P f ω(f)+ r + . n ∗ − ≤ K (cid:18) D (cid:19) 4n ! 5.3. Proof of Proposition 5.1 Using the differentiability of Pγ(c,.), we get, for any c (Rd)k and c , ∗ ∈ ∈M γ(c,x)=γ(c ,x)+ c c ),∆(c ,x) + c c R(c ,c c ,x), ∗ ∗ ∗ ∗ ∗ ∗ h − i k − k − where, with use of Pollard’s [16] notation ∆(c∗,x)=−2((x−c∗1)1V1∗,...,(x−c∗k)1Vk∗) R(c∗,c−c∗,x)=i,j=1,...,k1Vi∗1Vjkc−c∗k−1 2(ci−cj)∗x+kc∗ik2−2(c∗i)∗ci+kcjk2 . X (cid:2) (cid:3) Observe that, because is a finite set, by dominated convergence Theorem, M R(c ,c c ,.) L2 0 when (c c ) 0. ∗ ∗ ∗ − −a→.s − → Splitting the expectation in two parts, we obtain E sup (P P)(γ(c ,.) γ(c,.)) E sup (P P) (c c ),∆(c ,.) n ∗ n ∗ ∗ c∗ , c c∗ 2 δ − − !≤ c∗ , c c∗ 2 δ − h− − i! ∈Mk − k ≤ ∈Mk − k ≤ +√δE sup (P P)( R(c ,c c ,.)) n ∗ ∗ c∗ , c c∗ 2 δ − − − ! ∈Mk − k ≤ (1) :=A+B. 10