Optimal Transportation to the Entropy-Power Inequality Olivier Rioul Abstract—We present a simple proof of the entropy-power variables. For non-Gaussian variables, however, the entropy- inequalityusinganoptimaltransportationargumentwhichtakes power of the sum exceeds the sum of the individual entropy- the form of a simple change of variables. The same argument powers: yields a reverse inequality involving a conditional differential entropy which has its own interest. It can also be generalized in N(X+Y)≥N(X)+N(Y) (4) variousways.Theequalitycaseiseasilycapturedbythismethod andtheproofisformallyidenticalinoneandseveraldimensions. where equality holds only if X and Y are Gaussian. This is the celebrated entropy-power inequality (EPI) as stated by Shannon. It is remarkable that Shannon had the intuition of I. INTRODUCTION this inequality since it turns out to be quite difficult to prove. r The entropy-power inequality gives a lower bound on the The first rigorous proof is due to Stam [16] more than ten a differential entropy of a sum of independent random vectors in years after Shannon’s paper and is quite involved. M terms of their individual differential entropies, and is perhaps Thirty years after Shannon’s paper, Lieb [9] gave a very the most fascinating inequality that was stated by Shannon in different proof of an equivalent entropy-power inequality that 3 his 1948 seminal paper [15]. To simplify the presentation we is more√convenient to prove. By the scaling property (2), one ] assume, without loss of generality, that all considered random has N( λX)=λN(X) for any 0<λ<1, and the EPI (4) T vectors have zero mean and we first restrict ourselves to real- is clearly equivalent to .I valued random variables in one dimension. √ √ s LettingP(X)=E{X2}bethe(average)powerofarandom N( λX+ 1−λY)≥λN(X)+(1−λ)N(Y). (5) c [ variable X, Shannon defined the entropy-power N(X) as the power of a Gaussian random variable X∗ having the same Takingthelogarithmonbothsidesitfollowsfromtheconcavity 2 of the logarithm that entropyasX.Heargued[15,§21]thatforcontinuousrandom v √ √ 4 variables it is more convenient to work with the entropy-power h( λX+ 1−λY)≥λh(X)+(1−λ)h(Y). (6) 3 N(X) than with the differential entropy h(X). 5 By the well-known formula h(X∗)= 1log(cid:0)2πeP(X∗)(cid:1) of Conversely,toprove(5)itissufficient,byappropriatelyscaling 8 the entropy of the Gaussian X∗, the clo2sed-form expression the variables, to assume that X and Y have the same entropy 0 . of N(X)=P(X∗) when h(X∗)=h(X) is power N(X)=N(Y),hencethesameentropyh(X)=h(Y). 1 Inthiscase,takingtheexponentialonbothsidesof(6),ther.h.s. 0 e2h(X) N(X)= (1) becomes (e2h(X))λ(e2h(Y))1−λ = λe2h(X) +(1−λ)e2h(Y) 7 2πe 1 which gives (5). Thus Lieb’s restatement (6) is equivalent to which is essentially e to the power twice the entropy of X, : the EPI. Equality holds in (6) if and only if X and Y are v also the “entropy power” of X in this sense. Since the Gaussian with the same power. i X Gaussian maximizes entropy for a given power: h(X) ≤ Both (5) and (6) have a nice interpretation [5]: both the ar p12olwoge(cid:0)r:2πNeP(X(X) )≤(cid:1),tPhe(Xen)trowpiyth-poewqueralditoyesifnoatnedxcoenedlythiefaXctuaisl entropy-power N and the entr√opy h a√re concave under the power-preserving combination λX+ 1−λY. That linear Gaussian. combination is power-preserving because if X and Y have √ √ Abasicpropertyoftheentropy-poweristhescalingproperty. the same power P, then λX+ 1−λY also has the same The power of a scaled random variable is given by P(aX)= power P. a2P(X), and the same property holds for the entropy-power: All available proofs of the EPI (6) can be seen as either N(aX)=a2N(X) (2) variantsofStam’sproofusingaGaussianperturbationargument (wheretheentropiesaredifferentiatedwithrespecttothepower thanks to the well-known scaling property of the entropy: ofanadditiveGaussiannoise),orvariantsofLieb’sproofusing sharp inequalities from functional analysis such as Young’s h(aX)=h(X)+loga (a>0). (3) convolutional inequality (where the EPI is obtained as a limit For any two independent continuous random variables X case). In this paper, we present a new proof from [14] using and Y, the power of the sum equals the sum of the individual a transportation argument in which the Gaussian distribution powers: P(X +Y) = P(X)+P(Y) and clearly the same is “transported” to another probability distribution by a simple relation holds for the entropy-power in the case of Gaussian change of variable. The idea is to relate (6) to the case of equality: let X∗, Y∗ be independent Gaussian with the same For linear T(x)=ax we recover the scaling property (2). The power, so that general proof is similar: √ √ h( λX∗+ 1−λY∗)=λh(X∗)+(1−λ)h(Y∗) (7) Proof. Make the change of variable pT(X)(T(x))dT(x) = p (x)dx in the expression of the entropy h(T(X)) = X A transportation from X∗ to X, and similarly from Y∗ to Y, −E{logp (T(X))} = −E{log(p (X)/T(cid:48)(X))} = h(X) T(X) X can be√made to√compare h(X) to√h(X∗), h√(Y) to h(Y∗), and +E{logT(cid:48)(X)}. also h( λX+ 1−λY) to h( λX∗+ 1−λY∗). This is described in the following section. Lemma2allowsonetoevaluatethedifferencesh(T(X∗))− h(X∗) and h(U(Y∗))−h(Y∗). However, the remaining terms √ √ √ √ II. INGREDIENTS h( λT(X∗) + 1−λU(Y∗)) and h( λX∗ + 1−λY∗) Hereafter we assume that the considered random variables cannot be compared directly because two variables are in- have continuous and positive densities. This assumption can volved instead of one. However one variable can be fixed by be made without loss of generality (see [14] for details). It conditioningandanextendedversionofLemma2canbeused: follows that all considered cumulative distribution functions Lemma 3 (Change of Variable in the Conditional Entropy). are continuously differentiable and (strictly) increasing. The following lemma is the “not Gaussian to Gaussian” h(T (X)|Y)=h(X|Y)+E{logT(cid:48) (X)}. (10) Y Y lemma 1 used in [11]: Proof. By Lemma 2, we have h(T (X)|Y =y)=h(X|Y = Lemma 1 (Transportation). There exists an increasing func- Y y)+E{logT(cid:48) (X)|Y =y}forafixedvalueY =y.Theresult tion T such that T(X∗) has the same distribution as X. Y follows by taking the expectation over Y. Proof:LetF denotethecumulativedistributionfunctionofX. X Then P{X ≤x}=F (x)=F (cid:0)F−1(F (x))(cid:1)=P{X∗ ≤ Usingtheseingredients,asimpleproofoftheEPIisobtained F−1(F (x)} = P{FX−1(cid:0)F (XX∗∗)(cid:1)X≤∗x}Xwhich proves the as shown in the next section. X∗ X X X∗ lemma with T =F−1◦F . X X∗ III. ASIMPLEPROOFOFTHEEPI Notice that the lemma is well-known when X∗ is uniformly distributed, to justify the inverse transform sampling method. From Lemma 1 we can assume that X = T(X∗) using This function T is sometimes referred to an “optimal transport T and Y =U(Y∗) using transport U. By Lemma 2, transport” [17] because it solves a Monge-Kantorovitch trans- h(X)=h(X∗)+E{logT(cid:48)(X∗)} portation problem of the type: (11) h(Y)=h(Y∗)+E{logU(cid:48)(Y∗)}. (cid:112) min E{(X−X∗)2} √ √ √ (X,X∗) It remains to compare h( λX+ 1−λY)=h( λT(X∗)+ X∼pX,X∗∼pX∗ √1−λU(Y∗)) to h(√λX∗ + √1−λY∗), which is the √ √ where the marginal densities are fixed and the minimisation entropy if the Gaussian variable X(cid:101) = λX∗ + 1−λY∗. of the transportation cost is done on the joint distribution. Two independent variables are involved in the expression √ √ The resulting minimum is known as the Wasserstein distance λT(X∗) + 1−λU(Y∗) which does not depend on X(cid:101) W (X,X∗). Thus X =T(X∗) is the random variable which 2 alone, but rather on the two variables (X(cid:101),Y(cid:101)) obtained by is maximally correlated to X∗ for fixed marginals; this is a rotation from (X∗,Y∗): restatement of a well-known Hardy-Littlewood rearrangement inequalityandcanbegeneralizedtootherconvexcostfunctions. (cid:32)X(cid:101)(cid:33)=(cid:18) √√λ √1√−λ(cid:19)(cid:18)X∗(cid:19). (12) This type of optimality was used in [10] to prove Costa’s Y(cid:101) − 1−λ λ Y∗ corner point conjecture for the Gaussian interference channel (see also [13]). However, we shall not need such an optimality The inverse rotation reads proBpyerLtyemhemrea.1, to prove the EPI we can always assume that (cid:18)X∗(cid:19)=(cid:18)√√λ −√√1−λ(cid:19)(cid:32)X(cid:101)(cid:33) (13) X = T(X∗) using transport T, and similarly Y = U(Y∗) Y∗ 1−λ λ Y(cid:101) using another transport U. Thus the EPI can be restated in √ √ √ √ terms of the Gaussian variables X∗,Y∗ as w√hich gives √λT(X∗) +√ 1−λU(Y√∗) = λT( λX(cid:101) − √ √ 1−λY(cid:101)) + 1−λU( 1−λX(cid:101) + λY(cid:101)), a function of h( λT(X∗)+ 1−λU(Y∗)) (X(cid:101),Y(cid:101)) which we denote by T (X(cid:101)). Now since conditioning Y(cid:101) ≥λh(T(X∗))+(1−λ)h(U(Y∗)). (8) reduces entropy, √ √ Wehavethefollowingwell-knownlemma(alsousedin[11]). h( λX+ 1−λY)=h(T (X(cid:101)))≥h(T (X(cid:101))|Y(cid:101)). (14) Y(cid:101) Y(cid:101) Lemma 2 (Change of Variable in the Entropy). Lemma 3 applies with √ √ √ √ h(T(X))=h(X)+E{logT(cid:48)(X)} (9) T(cid:48) (X(cid:101))=λT(cid:48)( λX(cid:101)− 1−λY(cid:101))+(1−λ)U(cid:48)( 1−λX(cid:101)+ λY(cid:101)) Y(cid:101) where T(cid:48) >0 denotes the derivative of T. =λT(cid:48)(X∗)+(1−λ)U(cid:48)(Y∗) (15) which gives so that h(TY(cid:101)(X(cid:101))|Y(cid:101))=h(X(cid:101)|Y(cid:101))+E{log(cid:0)λT(cid:48)(X∗)+(1−λ)U(cid:48)(Y∗)(cid:1)}. T(cid:48)(x)=(cid:89)n ∂Ti >0. (24) (16) ∂xi i=1 Since X∗,Y∗ are independent Gaussian with identical powers, so are the rotated variables X(cid:101),Y(cid:101). By independence, This transport map was used in [11] and details about its √ √ construction can also be found in [14]. Lemmas 2 and 3 are h(X(cid:101)|Y(cid:101))=h(X(cid:101))=h( λX∗+ 1−λY∗). (17) then obtained by a change of variable in n dimensions. The Therefore, combining (14), (16) and (17) we obtain above proof of the EPI is identical word for word, where the concavity of the logarithm in the last step (19) is used on each √ √ √ √ h( λX+ 1−λY)≥h( λX∗+ 1−λY∗) dimension. +E{log(cid:0)λT(cid:48)(X∗)+(1−λ)U(cid:48)(Y∗)(cid:1)}. (18) VI. AREVERSEEPI With (11) we conclude that √ √ h( λX+ 1−λY)−λh(X)−(1−λ)h(Y) A. Derivation: Generalization to non-Gaussian X∗ and Y∗ √ √ ≥h( λX∗+ 1−λY∗)−λh(X∗)−(1−λ)h(Y∗) The above proof of the EPI can also be generalized to the (19) +E{log(cid:0)λT(cid:48)(X∗)+(1−λ)U(cid:48)(Y∗)(cid:1)} case where X∗ and Y∗ are not necessarily Gaussian. In fact a −λE{logT(cid:48)(X∗)}−(1−λ)E{logU(cid:48)(Y∗)} closer look at the proof reveals that the Gaussian assumption is never used except for the simplification in (17) which relies √ √ where the first line in the r.h.s. vanishes by (7) and the on√the independ√ence of X(cid:101) = λX∗+ 1−λY∗ and Y(cid:101) = remaining part is ≥ 0 by Jensen’s inequality (concavity of − 1−λX∗+ λY∗. If such an independence does not hold, the logarithm). This proves the EPI (6). we obtain the more general inequality IV. THEEQUALITYCASE √ √ h( λX+ 1−λY)−λh(X)−(1−λ)h(Y) The equality case is easily captured by the above method. √ √ √ √ ≥h( λX∗+ 1−λY∗|− 1−λX∗+ λY∗) (25) If equality holds in (19) then −λh(X∗)−(1−λ)h(Y∗) log(cid:0)λT(cid:48)(X∗)+(1−λ)U(cid:48)(Y∗)(cid:1) =λE{logT(cid:48)(X∗)}+(1−λ)E{logU(cid:48)(Y∗) a.e. (20) valid for any independent X,Y and any independent X∗,Y∗. In fact this gives two independent inequalities: For Gaussian Because the logarithm is strictly concave and 0<λ<1, this X∗,Y∗ the r.h.s. vanishes and we recover the classical EPI. implies But for Gaussian X,Y the l.h.s. vanishes, so that the r.h.s. is T(cid:48)(X∗)=U(cid:48)(Y∗) a.e. (21) ≤0, and we obtain a reverse inequality which (rewritten for X,Y) takes the form SinceX∗ andY∗ areindependent,itfollowsthatT(cid:48) andU(cid:48) are constant and equal, hence T and U are linear and X =c·X∗, √ √ √ √ h( λX+ 1−λY|− 1−λX+ λY)≤λh(X)+(1−λ)h(Y). Y = c·Y∗ are Gaussian with the same power. This is the (26) required equality case of the EPI (6). Of course this condition √ √ Compared to (6), the opposite inequality holds but for a also implies equality in (14) since then λX+ 1−λY = conditional differential entropy. In other words, λh(X) + T (X(cid:101))=cX(cid:101) is independent of Y(cid:101). Y(cid:101) (1 − λ)h(Y) is upper bounded by the differential entropy √ √ V. GENERALIZATIONTORANDOMVECTORS of λX + 1−λY and lo√wer bounde√d by its conditional differential entropy given − 1−λX+ λY, the difference The above proof of the EPI carries over verbatim to random between the bounds being equal to the mutual information vectorsinndimensions.Theonlychangeisthattransportmaps √ √ √ √ T :Rn →Rn are n-dimensional—accordingly, T(cid:48) denotes the I( λX+ 1−λY;− 1−λX+ λY). Thusan equivalent restatement is Jacobian determinant of T. Lemma 1 is easily extended to random vectors using the so-called Kno¨the’s map in the theory √ √ 0≤h( λX+ 1−λY)−λh(X)−(1−λ)h(Y) of convex bodies [8], [17], of the form √ √ √ √ (27) ≤I( λX+ 1−λY;− 1−λX+ λY). (cid:0) (cid:1) T(x)= T (x ),T (x ,x ),...,T (x ,...,x ) (22) 1 1 2 1 2 n 1 n where x=(x ,x ,...,x )∈Rn. The Jacobian matrix of T This mutual information can be seen as an upper bound on the 1 2 n deficit in the EPI for X and Y, which is zero if and only if X is triangular with positive diagonal elements: andY areGaussianwithidenticalpowers.Courtade[3]recently ∂T1 0 ··· 0 derivedasimilarboundonthedeficitinthelogarithmicSobolev ∂x1 ∂T2 ∂T2 ··· 0 inequality,whichisequivalenttoanothertypeof“reverseEPI”. ∂x1 ∂x2 (23) ..................... As above the extension to random vectors in n dimensions is ∂Tn ∂Tn ··· ∂Tn straightforward. ∂x1 ∂x2 ∂xn (cid:2) (cid:3)−1 K =λK +(1−λ)K −λ(1−λ)(K −K ) (1−λ)K +λK (K −K ) U|V X Y Y X X Y Y X (cid:2) (cid:3)−1 (cid:2) (cid:3)−1 =λK −λ(1−λ)K (1−λ)K +λK K +λ(1−λ)K (1−λ)K +λK K X X X Y X X X Y Y (cid:2) (cid:3)−1 (cid:2) (cid:3)−1 +(1−λ)K −λ(1−λ)K (1−λ)K +λK K +λ(1−λ)K (1−λ)K +λK K Y Y X Y Y Y X Y X (cid:2) (cid:3)−1(cid:0) (cid:1) (cid:2) (cid:3)−1 =λK (1−λ)K +λK (1−λ)K +λK −(1−λ)K +λ(1−λ)K (1−λ)K +λK K X X Y X Y X X X Y Y (cid:2) (cid:3)−1(cid:0) (cid:1) (cid:2) (cid:3)−1 +(1−λ)K (1−λ)K +λK (1−λ)K +λK −λK +λ(1−λ)K (1−λ)K +λK K Y X Y X Y Y Y X Y X =(cid:0)λ2+λ(1−λ)+(1−λ)2+λ(1−λ)(cid:1)(cid:2)λK−1+(1−λ)K−1(cid:3)−1 =(cid:2)λK−1+(1−λ)K−1(cid:3)−1. X Y X Y B. The Equality Case and Bernstein’s Lemma D. Equivalence Between the EPI and its Reverse We have seen that equality holds in the above proof of the As observed by Chandra Nair in a private communication to EPI if and only if (X,Y) and (X∗,Y∗) are proportional. The the author, it turns out that the reverse EPI is in fact equivalent sameargumentshowsthatthethesameequalityconditionholds to the EPI where the roles of X and Y are permuted. In for the reverse EPI. Thus both the EPI (6) and its reverse (26) fact (26) is equivalent to are equalities if and only if X and Y are i.i.d. Gaussian. This √ √ √ √ h( λX+ 1−λY,− 1−λX+ λY) also corresponds to the case where the mutual information √ √ vanishes in (27). This gives an alternative proof of Bernstein’s ≤λh(X)+(1−λ)h(Y)+h(− 1−λX+ λY) lemma (see e.g., [6, Appendix I] and [2, Chap. 5]): (30) where the joint entropy in the r.h.s. equals h(X,Y)=h(X)+ Lemma 4 (B√ernstein).√Let X and Y√be independ√ent. Then h(Y) by the scaling property of the differential entropy for the rotated λX + 1−λY, − 1−λX + λY are vectors. Reorganizing terms one obtains the following version independent if and only if X, Y are i.i.d. Gaussian. of the EPI: √ √ C. The Gaussian Case (1−λ)h(X)+λh(Y)≤h(− 1−λX+ λY). (31) If X and Y are Gaussian with not necessarily equal powers Werecover,inparticular,thatthecasesofequalityarethesame P(X) and P(Y), it is easily seen that (26) and (6) reduce to for the EPI and its reverse. The above calculation was already the harmonic/geometric/arithmetic inequalities used by Wang and Madiman [18] as a short proof of the EPI under the hypothesis that X and Y follow symmetrical and (cid:0)λP(X)−1+(1−λ)P(Y)−1(cid:1)−1 identical distributions. One reason why the EPI is equivalent ≤P(X)λP(Y)1−λ to its reverse version is suggested below in relation to Young’s ≤λP(X)+(1−λ)P(Y). (28) convolutional inequality and its reverse. More generally for Gaussian vectors, if X ∼N(0,K ) and VII. ZAMIRANDFEDER’SGENERALIZATIONTOLINEAR X Y ∼N(0,K ) not necessarily of identical covariances, it is TRANSFORMATIONS Y known [5, Thm. 8] that the EPI reduces to Ky Fan’s concavity AnimmediategeneralizationoftheEPI(6)fornindependent inequalityofthelog-determinant:usingthewell-knownformula variables X ,X ,...,X is 1 2 n h(U) = 12log(cid:0)(2πe)n|KU|(cid:1) the EPI reduces to log|λKX + h(cid:0)(cid:88)a X (cid:1)≥(cid:88)a2h(X ) (32) (1−λ)KY|≥λlog|KX|+(1−λ)log|KY|. i i i i Similarly for the reverse EPI, noting that h(U|V) = i i 1log(cid:0)(2πe)n|K |(cid:1) where K is Schur’s complement where the coefficients are normalized such that (cid:80) a2 = 1. 2 U|V U|V i i K = K − K K−1K (where K is an inter- The above proof of the EPI can easily be adapted to prove U|V U UV V V√U √ UV covariance matrix), set U = λX + 1−λY and V = this inequality directly by letting A be an orthogonal matrix √ √ − 1−λX+ λY, KU =λKX+(1−λ)KY, KV =(1− whose first line is (a1,a2,...,an) and defining (cid:112) λ)K +λK , and K =K = λ(1−λ)(K −K ). X Y UV VU Y X X(cid:101) =AX∗ (33) By the calculation shown at the top of this page, the reverse EPI reduces to the inequality log|λK−X1+(1−λ)K−Y1|−1 ≤ where X(cid:102)∗ is a column vector of n i.i.d. Gaussian variables λlog|KX|+(1−λ)log|KY|. Thus (26) and (6) reduce to X1∗,X2∗,...,Xn∗. The inverse transformation is X∗ = AtX(cid:101) the generalized harmonic/geometric/arithmetic inequalities: and the proof is easily modified along these lines. EssentiallythesameproofcanbeusedforZamirandFeder’s |λK−X1+(1−λ)K−Y1|−1 generalized EPI [19] (see also [12, § IV]): ≤|KX|λ|KY|1−λ h(AX)≥(cid:88)a2 h(X ) (34) i,j j ≤|λK +(1−λ)K |. (29) X Y i,j where X is the column vector of components X ,X ,...,X In fact Barthe [1] gave a transportation proof of both in- 1 2 n andA=(a )isanyreal-valued(possiblyrectangular)matrix equalities. Since one obtains the EPI by letting p,q,r → 1+ i,j with orthonormal rows. By adding orthonormal rows we form from above (from Young’s inequality) and also by letting a square orthonormal matrix (still denoted by A) and the p,q,r →1− frombelow(fromthereverseYoung’sinequality), same transformation X(cid:101) = AX∗, X∗ = AtX(cid:101) is used. The the EPI and its reverse are equivalent at the limit p,q,r →1. conclusion follows from a simple inequality [7, Lemma 1] which generalizes Jensen’s inequality for the logarithm. ACKNOWLEDGMENT VIII. GENERALIZATIONTORE´NYIENTROPIESANDTHE RELATIONTOYOUNG’SINEQUALITY TheauthorwouldliketothankTomCourtade,ChandraNair The Re´nyi entropy of order p>0 (p(cid:54)=1) is defined as and Miche`le Wigger for their discussions. 1 (cid:90) h (X)=h (f)= log fp =−p(cid:48)log(cid:107)f(cid:107) (35) p p 1−p p REFERENCES where (cid:107)f(cid:107) denotes the Lp norm of the density f of X and p p(cid:48) is p’s conjugate such that 1/p+1/p(cid:48) =1. 