Coarse-graining and the Blackwell order Johannes Rauh∗, Pradeep Kr. Banerjee∗, Eckehard Olbrich∗, Ju¨rgen Jost∗‡, Nils Bertschinger† and David Wolpert‡§ ∗Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany {jrauh,pradeep,olbrich,jjost}@mis.mpg.de †Frankfurt Institute for Advanced Studies, Frankfurt, Germany bertschinger@fias.uni-frankfurt.de ‡Santa Fe Institute, Santa Fe, NM, USA §MIT, Cambridge, MA, USA [email protected] 7 1 0 2 Abstract—Suppose we have a pair of information chan- showsthatwhateverthethreestochasticmatricesareandwhat- nels,κ1,κ2withacommoninput.TheBlackwellorderisapartial ever the decision rule we would use if we chose channel κ , n order over channels that compares κ1 and κ2 by the maximal we can always get the same expected utility (as we woul1d a expectedutilityanagentcanobtainwhendecisionsarebasedon J get by choosingκ and using the associated decision rule) by theoutputsofκ1andκ2.Equivalently,κ1issaidtobeBlackwell- 1 6 inferiortoκ2 ifandonlyifκ1 canbeconstructedbygarblingthe instead choosing channel κ2 (for the appropriate associated 2 output of κ2. A related partial order stipulates that κ2 is more decisionrule).In thiskindof situation,whereκ1 =λ·κ2, we capablethanκ1 ifthemutualinformationbetweentheinputand say that κ1 is a garbling (or degradation) of κ2. It is much T] output is larger for κ2 than for κ1 for any distribution over more difficult to prove that the converse also holds true: inputs. If one channel is Blackwell-inferior to another then it .I must be less capable. However examples are known where κ1 is Theorem 1 (Blackwell’s theorem [1]). Let κ1,κ2 be two s less capable than κ2 even though it is not Blackwell-inferior. stochastic matrices representing two channels with the same [c We give a new example of this phenomenon in which κ1 is input alphabet. Then the following two conditions are equiv- constructed by coarse-graining the inputs of κ2. Such a coarse- grainingisaspecialkindof“pre-garbling”ofthechannelinputs. alent: 1 This example directly establishes that the expected value of 1) When the agent chooses κ (and uses the decision rule v 2 2 the shared utility function for the coarse-grained channel is that is optimal for κ2), her expected utility is always at larger than it is for the non-coarse-grained channel. This is 0 least as big as the expected utility when she choosesκ surprising, as one might think that coarse-graining can only 1 6 (andusestheoptimaldecisionruleforκ ),independent destroy information and lead to inferior channels. 1 7 oftheutilityfunctionandthedistributionoftheinputS. 0 . I. INTRODUCTION 2) κ1 is a garbling of κ2. 1 0 Suppose we are given the choice of two channels that both Blackwell’stheoremmotivateslookingatthefollowingpar- 7 provideinformationaboutthe same randomvariable,and that tialorderoverchannelsκ ,κ withacommoninputalphabet: 1 2 1 we want to make a decision based on the channel outputs. : one of the two statements v Suppose that our utility function depends on the joint value κ (cid:22)κ :⇐⇒ 1 2 i of the input to the channel and our resultant decision based (in Blackwell’s theorem holds true. X on the channel outputs. Suppose as well that we know the WecallthispartialordertheBlackwellorder.Ifκ (cid:22)κ ,then r 1 2 a preciseconditionaldistributionsdefiningthechannels,andthe κ issaidtobeBlackwell-inferiortoκ .Strictlyspeaking,the 1 2 distribution over channel inputs. Which channel should we Blackwell order is only a preorder, since there are channels choose?Theanswerto thisquestiondependsonthe choiceof κ 6=κ thatsatisfyκ (cid:22)κ (cid:22)κ .However,forourpurposes 1 2 1 2 1 ourutilityfunctionaswellasonthedetailsofthechannelsand such channels can usually be considered as equivalent. theinputdistribution.Soforexample,withoutspecifyinghow For a given distribution of S, we can also compare κ 1 we will use the channels, in general we cannot just compare and κ by comparing the two mutual informations I(S;X ), 2 1 their information capacities to choose between them. I(S;X ) between the common input S and the channel 2 Nonetheless, for certain pairs of channels we can make outputs X and X . The data processing inequality shows 1 2 our choice, even without knowing the utility functions or the that κ (cid:23) κ implies I(S;X ) ≥ I(S;X ). However, the 2 1 2 1 distribution over inputs. Let us denote the two channels with converseimplicationdoesnothold.Theintuitivereasonisthat κ andκ ,respectively.Thenifthereexistsanotherstochastic fortheBlackwellorder,notonlytheamountofinformationis 1 2 matrix λ such that κ = λ · κ , there is never any reason important.Rather,thequestionishowmuchoftheinformation 1 2 to strictly prefer κ . This is because if we choose κ , we can that κ or κ preserve is relevant for a decision problem at 1 2 1 2 alwaysmakeourdecisionbychainingtheoutputofκ through hand. 2 the channel λ and then using the same decision function we Given two channels κ ,κ , suppose that I(S;X ) ≥ 1 2 2 would have used had we chosen κ . This simple argument I(S;X ) for all distributions of S. In this case, we say 1 1 that κ2 is more capable than κ1. Conversely, κ1 is less II. PRE-GARBLING capable than κ . Does this imply that κ (cid:22)κ ? Surprisingly, 2 1 2 As discussed above, Blackwell’s theorem states that gar- specificexamplesareknownwheretheanswerisnegative[2]. blingtheoutputofachannel(“post-garbling”)neverincreases In Proposition 6 we introduce a new example of this phe- thequalityofachannel.Ontheotherhand,garblingtheinput nomenon. In this example, κ is a Markov approximation 1 of a channel (“pre-garbling”) may increase the quality of a of κ by a deterministic function, in the following sense: 2 channel, as the following example shows. Consider a function f : S → S′ from the support S := Example 2. Suppose that an agent can choose an action from {s : P (s) = Pr(S = s) > 0} of S to another set S′. S a finite set A. She then receives a utility u(a,s) that depends Given two random variables S, X, denote by X ← S the both on the chosen action a ∈ A and on the value s of a channel defined by the conditional probabilities P (x|s) X|S random variable S. Consider the channels on S. Consider the two channels κ := (X ← S) and 2 κ :=(X ←f(S))·(f(S)←S). Which channelis superior? 0.9 0 0 1 0 0.9 1 κ = and κ =κ · = , Using the data processing inequality, it is easy to see that κ 1 0.1 1 2 1 1 0 1 0.1 1 (cid:18) (cid:19) (cid:18) (cid:19) (cid:18) (cid:19) is less capable than κ . However, as Proposition 6 shows, in 2 and the utility function general κ 6(cid:22)κ . 1 2 We call κ a Markov approximation, because the output s a u(s,a) 1 of κ is independent of the input S given f(S). κ can also 0 0 2 1 1 be obtained from κ by “pre-garbling” (Lemma 8); that is, 0 1 0 2 κ is obtained by composing another channel λf with κ . 1 0 0 1 2 It is known that pre-garbling may improve the performance 1 1 1 of a channel (but not its capacity!) as we recall in Section II. For uniform input the optimal decision rule for κ is 1 Whatmaybesurprisingisthatthiscanhappenforpre-garbling distributions of the form λf, which have the effect of coarse- a(0)=0, a(1)=1 graining according to f. and the opposite The fact that the more capable preorder does not imply the Blackwell order shows that “Shannon information,” as a(0)=1, a(1)=0 captured by the mutual information, is not the same as “Blackwellinformation,”asneededfortheBlackwelldecision for κ2. The expected utility with κ1 is 1.4, while using κ2, it problems. Indeed, our example explicitly shows that even is slightly higher, 1.45. though coarse-graining always reduces Shannon information, The intuitive reason for the difference in the expected it may not reduce Blackwell information. utilities is that the channel κ transmits one of the states 2 Proposition 6 builds upon another effect that we find para- withoutnoiseandtheotherstatewithnoise.Withaconvenient doxical: Namely, there exist random variables S,X ,X and pre-processing, it is possible to make sure that the relevant 1 2 there exists a function f :S →S′ from the supportof S to a information for choosing which action has better expected finite set S′ such that the following holds: utility is transmitted with less noise. 1) S and X are independent given f(S). Note the symmetry of the example: Each of the two chan- 1 2) (X ←f(S)) (cid:22) (X ←f(S)). nelsarisesfromtheotherbyaconvenientpre-processing,since 1 2 3) (X ←S) 6(cid:22) (X ←S). the pre-processing is invertible, and the two channels are not 1 2 Statement1)saysthateverythingX knowsaboutS,itknows comparable by the Blackwell order. In contrast, two channels 1 through f(S). Statement 2) says that X knows more about that only differ by an invertible garbling of the output are 2 f(S) than X . Still, 3) says that we cannot conclude that X equivalent only with respect to the Blackwell order. 1 2 knows more about S than X . The paradox illustrates that it The pre-garbling in Example 2 is invertible, and so it is 1 is difficult to formalize what it means to “know more.” more aptly described as a pre-processing. In general, though, Theremainderofthisworkisorganizedasfollows:InSec- pure pre-garbling and pure pre-processing are not easily dis- tionII,werecallhowpre-garblingcanbeusedtoimprovethe tinguishable, and it is easy to perturb Example 2 by adding performance of a channel. We also show that the pre-garbled noise without changing the conclusion. In Section III, we channel will always be less capable and that simultaneous will present an example in which the pre-garblingconsists of pre-garbling of both channels preserves the Blackwell order. coarse-graining. It is much more difficult to understand how In Section III, we state a few properties of the Blackwell coarse-graining can be used as sensible pre-processing. order, and we explain why we find these properties counter- Even though pre-garbling can make a channel better (or, intuitive and paradoxical. In particular, we also show that more precisely, more suited for a particular decision problem coarse-graining the input can improve the performance of at hand), pre-garbling cannot invert the Blackwell order: a channel. Section IV contains a detailed discussion of an Lemma 3. If κ ≺κ ·λ, then κ 6(cid:23)κ . example that illustrates these properties. In the appendix, we 1 2 1 2 briefly remark how to compute the Blackwell order and how Proof. Suppose that κ ≺ κ · λ. Then the capacity of κ 1 2 1 to quantify deviations from the same. is less than the capacity of κ ·λ, which is bounded by the 2 capacity of κ . Therefore, the capacity of κ is less than the 2) (X ←f(S)) (cid:22) (X ←f(S)). 2 1 1 2 capacity of κ . 3) (X ←S) 6(cid:22) (X ←S). 2 1 2 It only remains to show that it is possible to also achieve Also, it follows directly from Blackwell’s theorem that the strict relation (X ← f(S)) ≺ (X ← f(S)) in the 1 2 κ1 (cid:22)κ2 implies κ1·λ(cid:22)κ2·λ second statement. This can easily be done by adding a small garbling to the channel X ← f(S) (e.g. by adding a binary for any channel λ, where the input and output alphabets 1 symmetric channel with sufficiently small noise parameter ǫ). of λ equal the input alphabet of κ ,κ . Thus, pre-garbling 1 2 This ensures (X ← f(S)) ≺ (X ← f(S)), and if the preserves the Blackwell order when applied to both channels 1 2 garbling is small enough, this does not destroy the property simultaneously. (X ←S) 6(cid:22) (X ←S). Finally, let us remark that certain kinds of simultaneous 1 2 pre-garbling can also be “hidden” in the utility function: TheexamplefromProposition5 alsoleadstothefollowing Namely, in Blackwell’s theorem, it is not necessary to vary paradoxical property: the distribution of S, as long as the (fixed) input distribution hasfullsupportS (thatis, everystate of the inputalphabetof Proposition 6. There exist random variables S,X and there κ and κ appears with positive probability). In this setting, exists a function f :S →S′ from the support of S to a finite 1 2 it suffices to look only at differentutility functions.When the set S′ such that the following holds: input distribution is fixed, it is more convenient to think in (X ←f(S))·(f(S)←S) 6(cid:22) X ←S. terms of random variables instead of channels, which slightly changes the interpretation of the decision problem. Suppose Let us again give a heuristic argument why we find this wearegivenrandomvariablesS,X1,X2andautilityfunction property paradoxical. Namely, the combined channel (X ← u(a,s) depending on the value of S and an action a ∈ A as f(S))·(f(S) ← S) can be seen as a Markov chain approx- above.IfwecannotlookatbothX1 andX2,shouldwerather imation of the direct channel X ← S that corresponds to look at X1 or at X2 to take our decision? replacing the conditional distribution Theorem 4 (Blackwell’s theorem for random variables [3]). P (x|s)= P (x|s,f(s))P (f(s)|s). X|S X|Sf(S) f(S)|S The following two conditions are equivalent: fX(s) 1) Undertheoptimaldecisionrule,whentheagentchooses by X , her expected utility is always at least as big as the 2 expected utility when she chooses X1, independent of PX|f(S)(x|f(s))Pf(S)|S(f(s)|s). the utility function. fX(s) 2) (X ←S)(cid:22)(X ←S). 1 2 Proposition 6 together with Blackwell’s theorem states that III. PRE-GARBLING BY COARSE-GRAINING there are situations where this approximation is better than the direct channel. In this section we presenta few counter-intuitiveproperties of the Blackwell order. Proof of Proposition 6. Let S,X ,X be as in Example 9 in 1 2 Proposition 5. There exist random variables S,X ,X and SectionIVthatalsoprovesProposition5, andletX =X2. In 1 2 a function f :S →S′ from the supportof S to a finite set S′ that example, the two channels X1 ←f(S) and X2 ←f(S) such that the following holds: are equal. Moreover, X1 and S are independent given f(S). Thus, (X ←f(S)·(f(S)←S)=(X ←S). Therefore, the 1) S and X are independent given f(S). 1 1 statement follows from (X ←S)6(cid:22)(X ←S). 2) (X ←f(S)) ≺ (X ←f(S)). 1 2 1 2 3) (X1 ←S) 6(cid:22) (X2 ←S). On the other hand, the channel (X ←f(S))·(f(S)←S) This result may at first seem paradoxical. After all, prop- is less capable than X ←S: erty 3) implies that there exists a decision problem involving Lemma 7. For any random variables S, X, and function f : S for which it is better to use X1 than X2. Now, property 1) S →S,thechannel(X ←f(S))·(f(S)←S)islesscapable impliesthatanyinformationthatX hasaboutS iscontained 1 than X ←S. in X ’s information about f(S). One would therefore expect 1 that,fromtheviewpointofX ,anydecisionprobleminwhich Proof. For any distribution of S, let X′ be the output of the 1 thetask isto predictS andtoreactonS lookslike adecision channel (X ←f(S))·(f(S)←S). Then, X′ is independent problemin which the task is to reactto f(S). Butproperty2) ofS givenf(S).Ontheotherhand,sincef isadeterministic implies that for such a decision problem, it is better to look function, X′ is independent of f(S) given S. Together, this at X . implies I(S;X′)=I(f(S);X′). Using the fact that the joint 2 distributions of (X,f(S)) and (X′,f(S)) are identical and Proof of Proposition 5. The proof is by Example 9, which applying the data processing inequality again gives will be given in Section IV. This example satisfies 1) S and X are independent given f(S). I(S;X′)=I(f(S);X′)=I(f(S);X)≤I(S;X). 1 s x P (s|x ) s x P (s|x ) The setting of Proposition 6 can also be understood as a 1 S|X1 1 2 S|X2 2 0 0 1/2 0 0 3/4 specific kind of pre-garbling. Namely, consider the channel λf defined by 1 0 1/2 1 0 1/4 0 1 1/4 0 1 0 λf :=P (s′|f(s)). 1 1 1/4 1 1 1/2 s′,s S|f(S) 2 1 1/2 2 1 1/2 The effect of this channel is the following: The precise value The optimal decision rule for X is a(0)=0, a(1)=1, with ofSisforgotten,andonlythevalueoff(S)ispreserved.Then 1 a new value s′ is sampled for S according to the conditional expected utility distribution of S given f(S). u :=1/2·1/2+1/2·1/2=1/2. X1 Lemma 8. (X ←f(S))·(f(S)←S)=(X ←S)·λf. The optimal decision rule for X is a(0) = 0, a(1) ∈ {0,1} 2 (this is not unique in this case), with expected utility Proof. P (x|s )P (s |f(s)) X|S 1 S|f(S) 1 Xs1 uX2 :=1/2·1/4+1/2·1/2=3/8<1/2. = P (x|s )P (s |t)P (t|s) X|S 1 S|f(S) 1 f(S)|S How can we understand this example? Some observations: Xs1,t = P (x|t)P (t|s), • It is easy to see that X2 has more irrelevant information X|f(S) f(S)|S than X : namely, X can determine relatively precisely t 1 2 X when S = 0. However, since S = 0 gives no utility where we have used that X − S − f(S) forms a Markov independentoftheaction,thisinformationisnotrelevant. chain. ItismoredifficulttounderstandwhyX haslessrelevant 2 information than X . Surprisingly, X can determine 1 1 While it is easy to understand that pre-garbling can be more precisely when S = 1: If S = 1, then X 1 advantageousin general (since it can work as preprocessing), “detectsthis”(inthesensethatX choosesaction0)with 1 wefinditsurprisingthatthiscanalsohappeninthecasewhere probability2/3.ForX ,thesameprobabilityisonly1/3. 2 ttherempsreo-fgaarbchlianngniesl λdofntehaint dteoremsscooafrsae-fgurnacintiionng.f; that is, in • HH((SS||XX1 ==0)0)==2lologg((22))−, 3Hlo(Sg(|X3)1≈=0.411)50=37325lloogg((22)),, 2 2 H(S|X = 1) = log(2). Thus, the conditional entropies 2 of S given X are smaller than the conditionalentropies 2 IV. EXAMPLES of S given X . 1 Example 9. Consider the joint distribution • One can see in which sense f(S) captures the relevant information for X , and indeed for the whole decision 1 f(s) s x1 x2 Pf(S)SX1X2 problem: knowing f(S) is completely sufficient in or- 0 0 0 0 1/4 der to receive the maximal utility for each state of S. 0 1 0 1 1/4 However, when information is incomplete, it matters 0 0 1 0 1/8 how the information about the different states of S is 0 1 1 0 1/8 mixed, and two variables X ,X that have the same 1 2 1 2 1 1 1/4 joint distribution with f(S) may perform differently. It and the function f :{0,1,2}→{0,1} with f(0)=f(1)=0 is somewhatsurprisingthatit is the randomvariablethat and f(2) = 1. Then X and X are independent uniform has less information about S and that is conditionally 1 2 binary random variables, and f(S) = AND(X1,X2). By independent of S given f(S) which actually performs symmetry, the joint distributions of the pairs (f(S),X ) and better. 1 (f(S),X )areidentical,andsothetwochannelsX ←f(S) Example 9 is different from the pre-garbling Example 2 2 1 and X ← f(S) are identical. In particular X ← f(S) (cid:22) discussed in Section III. In the latter, both channels had the 2 1 X ←f(S). sameamountofinformation(mutualinformation)aboutS,but 2 On the other hand, consider the utility function forthegivendecisionproblemtheinformationprovidedbyκ2 was more relevant than the information provide by κ . The 1 s a u(s,a) first difference in Example 9 is that X has less information 1 0 0 0 about S than X (Lemma 7). Moreover, both channels are 2 0 1 0 identicalwithrespecttof(S),i.e.theyareprovidingthesame 1 0 1 informationaboutf(S), and forX it is the only information 1 1 1 0 it has about S. So, one could argue that X has additional 2 2 0 0 information, that does not help though, but decreases the 2 1 1 expected utility instead. To compute the optimal decision rule, let us look at the To visualize how far the channel X ← S is from being 1 conditional distributions: a garbling of the channel X ← S, we use a function UI 2 proposed in [3] which has the property that UI = 0 if and only if (S ← X ) (cid:22) (S ← X ) (see Lemma 6 in [3]). 1 2 The definition of UI is briefly outlined in the Appendix. UI is a nonnegative continuous function on the set of all joint distributionsof (S,X ,X ). UI can be construedas the 1 2 unique information that X conveys about S (with respect to 1 X ) [3]. So, for instance, in Example 9, while both channels 2 havenouniqueinformationaboutf(S),“addinginformation” about S in one channel leads to having unique information about S in both channels. We now apply our measure to a parameterized version of the AND gate in Example 9. Example 10. Figure 1 shows a heat map of UI computed on the set of all distributions P , the (S,X ,X )-marginal SX1X2 1 2 of joint distributions of the form f(s) s x x P 1 2 f(S)SX1X2 0 0 0 0 1/8+2b 0 1 0 0 1/8−2b Fig.1. ThefunctionUI inExample10. 0 0 0 1 1/8+a 0 1 0 1 1/8−a a linear program, which is considered computationally easy. 0 0 1 0 1/8+a/2+b However, when the feasibility check returns a negative result, 0 1 1 0 1/8−a/2−b this approach does not give any more information, e.g. how 1 2 1 1 1/4 far κ is away from being a garbling of κ . A measure for 1 2 where −1/8 ≤ a ≤ 1/8 and −1/16 ≤ b ≤ 1/16. One can this can be constructed as follows: Let P be a joint SX1X2 show that this is the set of distributions of S,X1,X2 such distribution of S and the outputs X1 and X2 of κ1 and κ2 that f(S)= AND(X1,X2), where f is as in Example 9. For satisfying the constraints that X1 ← S = κ1 and X2 ← (a,b)=(−1/8,1/16), we recover Example 9. S =κ and that the supportof S equals the input alphabetS 2 To generate the heat map in Figure 1, we randomly sam- of κ and κ . Let ∆ be the set of all joint distributions 1 2 P pled a distribution PSX1X2 from the set of all distributions of the random variables S,X1,X2 (with the same alphabets) of (S,X1,X2) that are compatible with the chosen func- thatarecompatiblewith themarginaldistributionsofPSX1X2 tion f, with Pf(S)X1X2, and which satisfies the condition for the pairs (S,X1) and (S,X2), i.e., ∆P := {QSX1X2 ∈ S − f(S) − X1. We then computed UI to check whether ∆ : QSX1 = PSX1, QSX2 = PSX2, ∀ (S,X1,X2) ∈ ∆}. (X1 ←S) 6(cid:22) (X2 ←S)holdsornot.UI vanishesalongthe In other words, ∆P consists of all joint distributions that are secondary diagonal (b= a/2) when the marginal distributions compatiblewithκ andκ andthathavethesamedistribution 1 2 of the pairs (S,X1) and (S,X2) are identical. For all other for S as PSX1X2. Consider the function tuples (a,b), UI is nonvanishing and achieves its maximum at the corners of the main diagonal for (a,b) = (−1/8,1/16), UI(S;X1\X2):= min IQ(S;X1|X2), Q∈∆P (1/8,−1/16). where I denotes the conditional mutual information evalu- Q ated with respectto the the jointdistributionQ. Thisfunction REFERENCES has the following property: UI(S;X1\X2) = 0 if and only if κ (cid:22) κ [3]. Note that UI depends not only on the two [1] D. Blackwell, “Equivalent comparisons of experiments,” The Annals of 1 2 Mathematical Statistics, vol.24,no.2,pp.265–272, 1953. channelsκ1,κ2,butalsoonthechoiceoftheinputdistribution. [2] J.Ko¨rnerandK.Marton,“Comparisonoftwonoisychannels,”inTopics Computing UI is a convex optimization problem. However, in information theory. Keszthely (Hungary): Colloquia Mathematica the condition number can be very bad, which makes the Societatis Ja´nosBolyai, 1975,vol.16,pp.411–423. problem difficult in practice. [3] N. Bertschinger, J. Rauh, E. Olbrich, J. Jost, and N. Ay, “Quantifying uniqueinformation,” Entropy,vol.16,no.4,pp.2161–2183, 2014. APPENDIX COMPUTING THEBLACKWELL ORDER Given two channelsκ ,κ , how can one decide whether or 1 2 notκ (cid:22)κ ?Theeasiestwayistocheckwhethertheequation 1 2 κ =λ·κ has a solution λ that is a stochastic matrix. In the 1 2 finite alphabet case, this amounts to checking feasibility of