Shannon Shakes Hands with Chernoff: Big Data Viewpoint On Channel Information Measures Shanyun Liu, Rui She, Jiaxun Lu, Pingyi Fan Tsinghua National Laboratory for Information Science and Technology(TNList), Department of Electronic Engineering, Tsinghua University, Beijing, P.R.China E-mail: [email protected], [email protected], [email protected], [email protected] 7 Abstract—Shannon entropy is the most crucial foundation of is channel capacity. As a matter of fact, the channel capacity 1 Information Theory, which has been proven to be effective in is defined as maximization mutual information. The mutual 0 manyfieldssuchascommunications.Re´nyientropyandChernoff information is a measure of the amount of information and 2 information are other two popular measures of information it is the reduction in uncertainty of one random variable with wide applications. The mutual information is effective to n measure the channel information for the fact that it reflects when given the other variable [8]. With respect to (w.r.t.) a the relation between output variables and input variables. In channel capacity, the mutual information is easier to use and J thispaper, we reexamine these channel information measures in expand, without loss of the ability to measure the channel 2 big data viewpoint by means of ACE algorithm. The simulated ”information”. As a result, in order to be promoted to 1 results show us that decomposition results of Shannon and Chernoffmutualinformationwithrespecttochannelparameters other non-Shannon entropy, the mutual information between ] are almost the same. In this sense, Shannon shakes hands channel input and output is adopted in this paper. T with Chernoff since they are different measures of the same .I information quantity. We also propose a conjecture that there Unlike traditional research method of Information Theory, s is nature of channel information which is only decided by the we attempt to revisit the traditional problem in big data c channel parameters. [ viewpoint.Obviously,theeraofbigdataiscoming.Thalarge Keywords—ShannonInformation;ChernoffInformation;Re´nyi amountofdatacanbeeasilygottenandthedataprocessingis 1 Divergence; ACE; Big Data v more and more importantfor us. Two data rules are believed. 7 For one thing, the data does not lie. For other thing, we only I. INTRODUCTION 3 need to guarantee that our conclusion is correct with very 2 Shannon entropy is the most crucial foundation of large probability rather than one in the era of big data [9]. In 3 InformationTheory.InthistraditionalInformationTheoryand this paper, we choose the alternating conditional expectation 0 itsapplications,relativeentropy(orKullback-Leiblerdistance) (ACE) algorithm as the tool to deal with data. . 1 is the basic information divergence. Shannon Information 0 Theory is useful for us to develop communications, data The ACE algorithm was proposed by Breiman and 7 1 science and other subjects about information [1]. Due to its Friedman (1985) [10] for estimating the transformations of : success,therearealotofliteraturewhichattemptstoadvance dependent variable and a set of independent variables in v these concepts. Unfortunately,nearly none of them have been multiple regression that estimate maximal correlation among i X widely adopted except the the Re´nyi entropy [2][3]. The the these variables [11]. It can help us analyze multivariate r Re´nyi entropy has a wide range of applications. For example, function for it enabling multiple variable separation. Its a according to [4], it is useful to adopt Re´nyi entropy in the effectiveness and correctness were provided in [10]. With the traditional signal processing. Besides that, the Re´nyi entropy ACE algorithm, it is easy to separate the effect of different was also used in independent component analysis (ICA) channel parameters so as to find the nature connection with the fact that the information measures based on Re´nyi between channel information and the channel parameters. entropy could provide the distance measures among a cluster Furthermore, one can investigate the relation between ofprobabilitydensities[5]. Moreover,somestudiessuggested Shannon and Chernoff as well. that it could help us to do blind source separation (BSS) [6][7]. The Chernoff information was developed to measure The main contribution of this paper can be summarized the information based on Re´nyi entropy [3]. In fact, the as follows. We decompose the channel mutual information Chernoff information is the exponent when minimizing the of Shannon and Chernoff w.r.t channel parameters by ACE overall probability of error, so it is very useful in hypothesis algorithm. The simulated results show us that Shannon testing [8]. shakes hands with Chernoff in big data viewpoint. Based on these results, we put forward a conjecture that there In the last 68 years of Information Theory development, a is nature of channel information and no matter Shannon large amount of concepts have been provided to describe the mutual information, Chernoff mutual information or other the channel information. Among them, the most popular one information measures, which are different measures of the same information quantity. This conclusion can help us to 1-e construct new information measures or judge a new channel x y information measure is reasonable or not. 1 e 1 e x 2 y A. Introduction of mutual Information 2 1-e 1) Shannon mutual Information: The mutual information is a measure of the amount of information that one random variable contains about another variable. In fact, it describes Fig. 1: Binary symmetric channel. the reduction in the uncertainty of one random variable if another one is known [8]. According to [8], the Shannon mutual information is defined as: and multiple symmetric channel (MSC). Furthermore, this sectionalsogiveabriefintroductionoftheACEalgorithm.In I (X,Y)=D(p(x,y)||p(x)p(y)) S SectionIII,somesimulatedexamplesaregiven.Next,Section p(x,y) (1) = p(x,y)log IV gives the analysis of the simulated examples. Finally, we p(x)p(y) x,y present the conclusion in Section V. X whereX,Y aretworandomvariableswithmarginalprobabil- II. CHANNEL INFORMATION MEASURESAND ACE ity mass functions p(x) and p(y) and joint probability mass ALGORITHM DESCRIPTION functionp(x,y).ItisnotedthatI(X;Y)istherelativeentropy (or Kullback-leibler distance) between the joint distribution The mutual information between the input and output of a channel can be used to describe the transmittability of a p(x,y) and the product distribution p(x)p(y). The relative channel. For example, the maximum mutual information is entropy is given by [8]: defined as the channel capacity for a discrete memoryless p(x) D(p||q)= p(x)log (2) channel. Obviously, the mutual information is the measure of q(x) channel information. x∈X X 2) ChernoffmutualInformation: TheChernoffinformation is derived from the problem of classic hypothesis testing. A. Binary Symmetric Channel Chernoff Information is the resulting error exponent when Consider the BSC, which is shown in Fig.1. It is a minimizing the overall probability of error [8]. As defined in binary channel where the probability of the input symbols is [8], it is given by (λ,1−λ). The transmission error in it is ε. Itis arguedin [8] that it reflects common characteristics of the general channel I (P ,P )=∆ − min log Pα(x)P1−α(x) (3) C 1 2 1 2 with errors though it is the simplest model. 0≤α≤1 ! x X where α is real parameter with 0≤α≤1. The Shannon mutual information is given by: With the aid of Chernoff information to describe the rela- tionship betweenthe two variables,a new mutualinformation I (X;Y)=H(Y)−H(Y|X) is defined as S 2 2 p(x ,y ) k j IC(X,Y)=∆ − min log pα(x,y)(p(x)p(y))1−α . = p(xk,yj)logp(xk)p(yj) 0≤α≤1 x y ! kX=1Xj=1 XX (4) 1−ε ε =(1−ε)log +εlog . In this paper, we call it Chernoff mutual information relative λ(1−ε)+(1−λ)ε λε+(1−λ)(1−ε) (5) to Shannon mutual information. It can be seen as I (X,Y) = D(p(x,y)||p(x)p(y)). While Infact,bothShannonmutualinformationandChernoffmutual S the Chernoff mutual information is given by: information are to depict the inner relationship between two random variables. The difference is that they adopt different measures of the distance between two distributions. The rel- IC(X;Y)=− min log P1α(x)P21−α(x) 0≤α≤1 ! ative entropy is adopted in Shannon mutual information and x X (6) the Re´nyi divergence is used in Chernoff mutual information =− min log((1−ε)α(λ(1−ε)+(1−λ)ε)1−α 0≤α≤1 [3]. +εα(λε+(1−λ)(1−ε))1−α) B. Outline of the Paper Unfortunately,thereisnoexplicitsolutionforEq.(6).Asare- The rest of this paper is organized as follows. Section II sult, itis arduoustoanalyzethe Chernoffmutualinformation. gives a brief introduction of the several special channel mod- It is observed that the Shannon mutual information seems to els. In this section, we lists binary symmetric channel (BSC) becompletelydifferentfromtheChernoffmutualinformation. x 1-e y Algorithm 1 ACE Algorithm 1 e 1 x21-e MMee--11 y2 21:: Isteetraθt(eYu)n=tilYe2/(kθY,φk1,a.n.d.,φφp1)(fXai1ls),t.o..,dφepc(rXeaps)e;=0; x3 1-e Me-1 y3 3: Iterate until e2(θ,φ1,...,φp)fails to decrease; x4 1-e y4 4: For k =1 to p Do: e . M-1 . . . . . φ (X )=E θ(Y)− φ (X )|X ; 1-e e k,1 k i i k xM-1 1-e M-1 yM-1 Xi6=k xM yM replace φk(Xk) with φk,1(Xk); End For Loop; Fig. 2: Multiple symmetric channel. 5: End Inner Iteration Loop; 6: B. Multiple Symmetric Channel E φi(Xi)|Y θ (Y)= (cid:20)i=1 (cid:21) ; The MSC is shown in Fig.2. The number of input symbol 1 Pp E φ (X )|Y is M. The probability of the input symbol is (λ ,λ ,...,λ ) i i 1 2 M with λ1+λ2+...+λM =1 (λi >0).The transmission error (cid:13)(cid:13) (cid:20)iP=1 (cid:21)(cid:13)(cid:13) oMf e=ac2h.one is Mε−1. The BSC is the special case of MSC for 7: rEenpdlaOceutθe(rYI)terwaittihonθ1L(oY(cid:13)(cid:13)o)p;; (cid:13)(cid:13) Similarly, the Shannon mutual information is given by: 8: θ,φ1,...,φp are the solution θ∗,φ∗1,...,φ∗p; I (X;Y)=D(p(x,y)||p(x)p(y)) S M M p(x ,y ) (7) parameters and Y be measured value of channel information. k j = p(x ,y )log k j p(x )p(y ) The functional relation between them is: k j k=1j=1 XX Y =f(X ,X ,...,X ). (10) The Chernoff mutual information is given by: 1 2 p ItissupposedthatX ,X ,...,X areknownandindependent. 1 2 p IC(X;Y)=− min log p(xy)α(x)(p(x)p(y))1−α(x) Moreover, the functional relation f is also known since it 0≤α≤1 x ! is defined by human to describe the channel information. X (8) Therefore, it is easy to get Y and they form a data set C. ACE Algorithm Description {Y,X1,··· ,Xp}. This data set meets the preconditionof the ACE algorithm, so ACE algorithm can be used to analyze Much of research in regression analysis has examined the them. As a result, one can get optimal transformation between one or more predictors and the response. Unlike traditional multiple regression algorithm p θ(Y)= φ (X )+δ (11) which requires the priori information of the functional forms, i i the ACE algorithm of Breiman and Friedman (1985) in [10] Xi=1 does not require that and it is non-parametric transformation. whereδ isresidualerror.Inthiscase,it iseasy tofindoutthe That is to say, it is a fully automated algorithm to estimate separate influence of each correspondentchannel parameter. the optimal transformation between predictors and response. III. SIMULATEDEXAMPLE Furthermore,it can be also used to estimate maximalcorrela- In this section, various numerical simulation results will tionamongrandomvariables.TheimplementationoftheACE be presented to analyze the two channel information mea- algorithm can consult [10][11]. sures. We focus on conducting the Monte Carlo simulation Assume random variables X ,X ,··· ,X 1 2 p by computer to compare the Shannon and Chernoff mutual are predictors and Y is response. Supposing information in different channels by ACE algorithm. The φ (X ),φ (X ),··· ,φ (X ),θ(Y) are arbitrary zero 1 1 2 2 p p procedure is given by the following simulation procedure. mean functions of the corresponding variables, the residual error is A. Binary Symmetric Channel p 2 20000 observations geneated from the Eq.(5) and Eq.(6) E θ(Y)− φ (X ) i i where λ is the probability of the input symbol and ε is the e2(θ,φ ,··· ,φ )= ((cid:20) i=1 (cid:21) ) (9) transmission error. λ and ε are independently drawn from 1 p E[θ2P(Y)] a uniform distribution U(0,1). In this case, our channel The algorithm is summarized in Alg.1 as in [10]. information is a multivariate case with two input parameters. In this paper, we use the ACE algorithm help us separate TheACEalgorithmisappliedtothissimulateddatasetand multivariate function. Let X ,X ,··· ,X be the channel theresultsisshowninFig.3.Thecorrelationbetweenθ(y)and 1 2 p Simulation Procedure IV. DISCUSSSION AND PHYSICAL EXPLANATION 1: Generate channel parameters, such as λ and ε; A. Shannon Shakes Hands with Chernoff 2: Calculatethechannelinformationmeasures,suchasShan- non mutual information; When more narrowly examined, there are more interesting 3: Decompose these channel information measures by ACE conclusions. In this section, we analyze these ACE simulated algorithm; resultsfromtheviewpointofbigdata.Thereareseveralnature 4: Repeat above process till the required times. behindit.Firstofall,thedatadoesnotlie.Naturally,theACE results can provide much true and useful information for us. Furthermore,we always turn to numerical analysis when it is difficult to do theoretical analysis. The complicated equations φ (λ)+φ (ε) is extremelyclose to 1. Furthermore,The error 1 2 of Shannonor Chernoffmutualinformationare arduousto do oftheACEdecompositionδisinvariablyneartozero.Clearly, intuitive theory analysis, especially when they are multivari- both of them have shown that the ACE decompositionresults able,soitisquiteappropriatetodonumericalanalysisbyACE are excellent. algorithm. In fact, the probably approximately correct (PAC) It is noted that function curve of φ (λ) and φ (ε) w.r.t. 1 2 modelisenoughforusinmostoftime[9].Inthismodel,one Shannonmutualinformationisalmostcoincidedwiththatw.r.t only need to construct algorithms which guarantee that it is Chernoff mutual information after the ACE decomposition. correctwithhighprobability(notnecessarilyone).Thismodel As a result, in the range of the errors permitted, the φ (λ) 1 is the foundation of Support Vector Machine (SVM) [9]. and φ (ε) appears to be coincided with each other. φ (λ) is 2 1 The ACE results is monotone decreasing function. The greater λ, the faster the p declinein φ1(λ). Furthermore,φ2(ε) is symmetricaroundthe Y =f(X)(=a)θ−1 φ (X ) =ψ(ϕ(X)) (12) i i line ε=0.5for the factthatthe channelis symmetricalabout ! i=1 ε. X where Y is the channelinformation and X is all the set of all TheresultsofShannonmutualinformationaresmallerthan the input channel parameters. The equation (a) holds because the Chernoff mutual information, but whole variant trend is the residual errorδ can be ignored.From the figuresabove,it identical. θ(y) increases rapidly when y is close to zero and is noted thatφ (X ) is almostthe same for bothShannonand the slope of its curve is becoming smaller as y increases. i i Chernoffmutualinformation.Forthesetwomeasures,theonly difference is θ(·). The function ψ(·) is different and ϕ(X) is B. Multiple Symmetric Channel the same. From the Fig.3 to Fig.4, it is also noted that the 60000 observations generated from the Eq.(7) and Eq.(8) function θ(·) is monotonic function. As mentioned above,the where λ1,λ2,λ3 is the probability of the input symbol and ε curvesof θ(·) have the same trend and the valuesof θ(Y) for is the transmission error. They are generated randomly and Shannon mutual information is invariably smaller than those independently, which is bound in (0,1). In this case, our for Chernoff information. Therefore, the Shannon’s θ−1(·) is channel information is a multivariate case with four input bigger than Chernoff’s. Actually, the facts extremely agree parameters. withthethiscorollary,whichcanwellexplainthephenomena Fig.4 shows similar characteristics that in Fig.3. The value in Fig.5. of the ACE decomposition error δ is still very small. On Inthissense,ShannonshakeshandswithChernoff.Overthe the other hand, the correlation between θ(y) and φ1(λ1) + last decades, Shannon information and Chernoff information φ2(λ2)+φ3(λ3)+φ4(ε)iscloseto1.Thesetwopointsverify are always consideredto be two kindsof channelinformation the validity of the ACE algorithm again. measures and the relation between them is that Kullback- It is obviously that the function curves of Leibler divergence is the special case of Re´nyi divergence θ(y),φ (λ ),φ (λ ),φ (λ ) and φ (ε) are almost the whenα=1.Thechannelinformationmeasuresdeducedfrom 1 1 2 2 3 3 4 same for these two channel information measures. It is worth them look very different even for the channel as simple as noting that the minimum of φ (ε) occurs when ε ≈ 0.75. BSC.However,withtheaidoftheACEalgorithm,wefindthe 4 The values of φ (ε) is big when ε is very close to 0 or 1. innerrelationsbetweenthem.Infact,theyaredifferentmetrics 4 Moreover, φ (λ ),φ (λ ) and φ (λ ) are very similar. Their of the same physical quantity. For a channel, in a certain 1 1 2 2 3 3 function curve is flat and close to zero when the independent time, if all of its parameters are given, all of its state and variable (the probability of the input symbol) is less than 0.7 propertieswill be determined.Thus the informationaboutthe . When the independent variable approaches to 1, the values samechannelisalwaysconsistent.Forexample,thekilometer of them rapid decrease. In fact, it is reasonable because these and centimeter are both measuresof length, but they are used input variables are equivalent for the channel. indifferentsituationsforconvenience.Generallyspeaking,we As illustrated in Fig.4, the result of Shannon mutual infor- do not adopt the centimeter to measure the distance between mation is smaller than the Chernoff mutual information, but two cities. In contrast, we would like adopt it to measure the whole varying trend is identical. θ(y) increases rapidly when length of the steel rule. Their true values are identical when y is also close to zero and the slope of its curve is becoming measuring the same things, even their numerical values are smaller as y increases. different.Similarly,theShannonandChernoffinformationare Fig.3:ACEoptimaltransformationsofsimulateddatasetinBSC. Thecorrelationbetweenθ(y) andφ (λ)+φ (ε)inShannon 1 2 mutucal imformation is 0.9994. The same correlation in Chernoff mutucal imformation is 0.9991. Fig.4:ACEoptimaltransformationsofsimulateddatasetinMSC.Thecorrelationbetweenθ(y)andφ (λ )+φ (λ )+φ (λ )+ 1 1 2 2 3 3 φ (ε) in Shannon mutucal imformation is 0.9999. The same correlation in Chernoff mutucal imformation is 0.9999. 4 V. CONCLUSION In this paper, the different channel information measures in several channels are revisited in Big Data viewpoint by ACE algorithm. Fortunately, the decomposition results of independent variable is the same with a slightly difference in the function of channel information values. That is to say, ShannonshakeshandswithChernoffwiththefactthattheyare just two metrics of the same quantity of channel information. For every channel, there is a nature of channel information, which is only decided by the channel parameters no matter what information measures are adopted. In fact, the Big Data viewpoint such as ACE algorithm provides a new viewpoint for us to reexamine the Information Theory and find out that Fig. 5: The comparison of Shannon mutual information and Shannon and Chernoff shake hands with each other, which Chernoff mutual information have been ignored for decades. Furthermore, this method can help us to construct new information measures with keeping the nature of the channel information unchanged. Obviously, two metrics of the same quantity of information. we put forward a criterion to judge whether a new channel information measure be appropriate. B. The Nature of Channel Information ACKNOWLEDGEMENT Obviously, the ability to convey information of a channel This work was supported by the China Major State is decided by input parameters. That is to say, the parameters Basic Research Development Program (973 Program) represent the channel. As a result, we hold opinion that these No.2012CB316100(2) and National Natural Science Founda- channel parameters decide the nature of channel. It is noted tion of China (NSFC) NO.61171064. that ϕ(X) in Eq.(12) is the same whatever we adopt the Shannon channel information measures or Chernoff channel REFERENCES information measures. In this paper, we take ϕ(X) as the [1] S. Verdu, “Fifty years of shannon theory,” IEEE Transactions nature of channel information and we get on information theory, vol. 44, no. 6, pp. 2057–2078, 1998. [2] A.Rrnyi,“Onmeasuresofentropyandinformation,”inFourth Ψ =ϕ(X) (13) Berkeleysymposiumonmathematicalstatisticsandprobability, vol. 1, 1961, pp. 547–561. where the Ψ is the nature of channel information. It is only [3] T. Van Erven and P. Harremoe¨s, “Re´nyi divergence and decided by the channel parameters and it contains the core kullback-leibler divergence,” Information Theory, IEEE Trans- actions on, vol. 60, no. 7, pp. 3797–3820, 2014. of the channel information.The other physicalquantity about [4] J. C. Principe, D. Xu, and J. Fisher, “Information theoretic channel information is just the function of Ψ, just as learning,”Unsupervised adaptive filtering,vol.1,pp. 265–319, 2000. I=ψ(Ψ) (14) [5] Y.BaoandH.Krim,“Renyientropybaseddivergencemeasures where I is a measurement of channel information. As far as for ica,”in Statistical Signal Processing, 2003 IEEEWorkshop on. IEEE, 2003, pp. 565–568. ShannonmutualinformationandChernoffmutualinformation [6] K. E. Hild, D. Erdogmus, J. Pr´ıncipe et al., “Blind source are concerned,theironlydifferenceis thefunctionψ(·). They separationusingrenyi’smutualinformation,”SignalProcessing can be seen as a compound function. The inner is Ψ which Letters, IEEE, vol. 8, no. 6, pp. 174–176, 2001. is the function w.r.t. channel parameters and the outer is the [7] K. E. Hild, D. Erdogmus, and J. C. Principe, “An analysis of function w.r.t. Ψ. For different channel informationmeasures, entropyestimatorsforblindsourceseparation,”SignalProcess- ing, vol. 86, no. 1, pp. 182–194, 2006. the inner is always the same but the outer is different. [8] T.M.CoverandJ.A.Thomas,Elementsofinformationtheory. There is a conjecture that if a novel information measure New Jersey, the USA: Wiley, 2006. is put forward to measure the channel information, it will be [9] V.Vapnik,Estimationofdependences basedonempiricaldata. decomposed to a function of ψ(·) by ACE algorithm from a Springer Science & Business Media, 2006. [10] L. Breiman and J. H. Friedman, “Estimating optimal transfor- verylargeextent.Thisrulecanbeusedtojudgetherationality mationsformultipleregressionandcorrelation,”Journal ofthe ofthenewinformationmeasure.Thisconjecturealsoprovides AmericanstatisticalAssociation,vol.80,no.391,pp.580–598, a way for us to propose a new information measure for the 1985. factthatwecanconstructthatbasedondesigningthefunction [11] D. Wang and M. Murphy, “Estimating optimal transformations ψ(·). In fact, even if this guess is not right with probability formultipleregressionusingtheacealgorithm,”JournalofData Science, vol. 2, no. 4, pp. 329–346, 2004. one, the new one which conforms to the law is correct with large probability and we can easily find the relation between that and previousinformationmeasures, such as Shannonand Chernoff information.