ebook img

Improved Measures of Integrated Information PDF

3.5 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Improved Measures of Integrated Information

Improved Measures of Integrated Information Max Tegmark Dept. of Physics & MIT Kavli Institute, Massachusetts Institute of Technology, Cambridge, MA 02139 and Theiss Research, La Jolla, CA 92037 (Dated: November 21 2016, published in PLoS Comput. Biol. 12 (11): e1005123, doi:10.1371/journal.pcbi.1005123) Although there is growing interest in measuring integrated information in computational and cognitivesystems,currentmethodsfordoingsoinpracticearecomputationallyunfeasible. Existing and novel integration measures are investigated and classified by various desirable properties. A simple taxonomy of Φ-measures is presented where they are each characterized by their choice of factorization method (5 options), choice of probability distributions to compare (3×4 options) and choice of measure for comparing probability distributions (7 options). When requiring the Φ- measurestosatisfyaminimumofattractiveproperties,thesehundredsofoptionsreducetoamere 6 handful,someofwhichturnouttobeidentical. Usefulexactandapproximateformulasarederived 1 that can be applied to real-world data from laboratory experiments without posing unreasonable 0 computational demands. 2 v o I. INTRODUCTION evaluate for large systems, growing super-exponentially N with the system’s information content. This has lead to 9 What makes an information-processing system con- the development of various alternative integration mea- sures that are simpler to compute or have other desir- 2 scious in the sense of having a subjective experience? able properties. For example, Barrett & Seth [13] pro- Although many scientists used to view this topic as be- ] yondthereachofscience, thestudyofNeuralCorrelates posed an attractive integration measure that is easier to C compute from neuroscience data, but whose interpreta- of Consciousness (NCCs) has become quite mainstream N tion is complicated by the fact that it can be negative in the neuroscience community in recent years — see, . e.g., [1, 2]. To move beyond correlation to causation in some cases [14, 15]. [16] used an integration mea- o sure inspired by complexity theory to successfully pre- i [3], neuroscientists have begun searching for a theory of b dict who was conscious in a sample including patients consciousnessthatcanpredictwhatphysicalphenomena - who were awake, in deep sleep, dreaming, sedated and q causeconsciousness(definedassubjectiveexperience[3]) with locked-in syndrome. [17] suggest that state tran- [ to occur. Dehaene [4] reviews a number of candidate sition entropy correlates with consciousness. Griffith & theories currently under active discussion, including the 3 Koch have proposed defining integration of a system as NonlinearIgnitionmodel(NI)[5,6],theGlobalNeuronal v the synergistic information that its parts have about the 6 Workspace (GNW) model [7–9] and Integrated Informa- future, whichappearspromisingalthoughtheredoesnot 2 tion Theory(IIT)[10,11]. Rapidprogressinartificialin- 6 telligence is further fueling interest in such theories and yet exist a unique formula for it [18]. Even the team 2 how they can be generalized to apply not only to bio- behind IIT has updated their integration measure twice 0 through successive refinements of their theory [10, 11]. logical systems, but also to engineered systems such as . Despite these definitional and computational challenges, 1 computers and robots and ultimately arbitrary arrange- 0 ments of elementary particles [12]. interest in measuring integration is growing, not only in 6 neurosciencebutalsoinotherfields,rangingfromphysics Although there is still no consensus on necessary and 1 [12] and evolution [19] to the study of collective intelli- sufficientconditionsforaphysicalsystemtobeconscious, v: thereisbroadagreementthatitneedstobeabletostore gence in social networks [20]. Xi and process information in a way that is somehow inte- grated, not consisting of nearly independent parts. As r a emphasized by Tononi [10], it must be impossible to decompose a conscious system into nearly independent It is therefore interesting and timely to do a compre- parts — otherwise these parts would feel like two sepa- hensive investigation of existing and novel integration rate conscious entities. While integration as a necessary measures, classifying them by various desirable proper- conditionforconsciousnessisratheruncontroversial, IIT ties. Thisisthegoalofthepresentpaper,assummarized goes further and makes the bold and controversial claim inTableIandTableII.Therestofthispaperisorganized thatitisalsoasufficientconditionforconsciousness,us- as follows. In Section II, we investigate general integra- inganelaboratemathematicalintegrationdefinition[11]. tionmeasuresandtheirproperties,presentingourresults Asneurosciencedataimprovesinquantityandquality, inSectionIII.InSectionIV,wederiveusefulformulasfor itistimelytoresolvethiscontroversybytestingthemany many of these measures that can be applied to the sort experimental predictions that IIT makes [11] with state- of time-series data that is typically measured in labora- of-the-artlaboratorymeasurements. Unfortunately,such tory experiments, with continuous variables. We explore testshavebeenhamperedbythefactthattheintegration furtheralgorithmicspeedupsandapproximationsinSec- measureproposedbyIITiscomputationallyinfeasibleto tion V and summarize our conclusions in Section VI. 2 φ2.5 φ2.5(cid:48) φ2.5(cid:48)(cid:48) φ3.0 φM φB φMD φM φoak φopk φots φofu φnas φmas φxfk kk(cid:48) Alwaysnon-negative y y y y y N y y y y y y y y y or Alwaysfiniteevenfor∞-dimensionalsystem N y y N y y y y y y y y N y y j a Vanishesfordeterministicsystem(drawback) n n n n n n n n n n n n n n Y M Vanishesforseparablesystem y y y y y N y y y y N y y y y Vanishesforafferentsystem y y y y N N N N y N N N y y N orVanishesforefferentsystem y y y y N N N N N y N N N N N n MiState-dependent y y y y N N N y y y N N N N y Basedonsymmetricprobabilitydistance N N N y N N N N N N N N N N N Intuitivelyinterpretatable 2 2 2 2 2 0 2 2 2 2 0 1 0 0 0 Computationallytractable 1 2 2 0 2 2 2 2 2 2 2 2 1 2 2 TABLE I: Properties of different integration measures. All but the third are desirable properties; capitalized N/Y (no/yes) indicatewhenanintegrationmeasurelacksadesirablepropertyorhasanundesirableone. Thefirstfourpropertiesaregenerally agreedtobeimportant,whilethesecondsetoffourhavebeenarguedtobeimportantbysomeauthors. Interpretabilityrefers to the extent to which the measure can be given an information-theoretic interpretation satisfying desirable properties of integration (see text). Computability refers to the feasibility of evaluating the measure in practice (see text). II. MEASURES OF INTEGRATION Following Tononi [10], we will use the symbol Φ to denoteintegratedinformation.1 AllmeasuresofΦaimto Probability Probability quantify the extent to which a system is interconnected, distribution p : distribution p : 0 1 yielding Φ=0 if the system consists of two independent parts,andalargerΦthemorethepartsaffecteachother. Mathematically,allΦ-measuresaredefinedinatwo-step xA xA process: 0 1 Markov 1. Given an imaginary cut that partitions the system process M into two parts, define a measure φ of how much xB xB these two parts affect each other. Table II lists 0 1 many φ-options. p = M p 1 0 2. Define Φ as the φ-value for the “cruelest cut” that measures inability to tensor factorize M = MA MB minimizes φ. A major numerical challenge is that the number of cuts to be minimized over grows super-exponentially with the number of bits in the FIG. 1: We model the time-evolution of the system state as system. A further challenge in this step is how to a Markov process defined by a transition matrix M: when the (possibly unknown) system state evolves from x to x , best handle cuts splitting the system into parts of 0 1 thecorrespondingprobabilitydistributionevolvesfromp to unequal size. 0 p ≡Mp . AllcompetingdefinitionsofΦquantifytheinabil- 1 0 itytotensorfactorizeM,whichcorrespondstoapproximating Before delving into the many different options for the system as two disconnected parts A and B that do not definingΦ,letusfirstintroduceconvenientnotationgen- affect one another. eral enough to describe all proposed integration mea- sures, as illustrated in Figure 1. them as the state of a time-dependent system x(t) at two separate times t and t . For example, if these are A. Interpreting evolution as a Markov process 0 1 twovectorsof5bitseach,thenpisatableof210numbers giving the probability of each possible bit string, while if Consider two random vectors x and x whose joint 0 1 these are two vectors in 3D space, then p is a function probability distribution is p(x ,x ). We will interpret 0 1 of 6 real continuous variables. We obtain the marginal distribution p(n)(x ) for the nth vector, where n = 0 or n n=1, by summing/integrating p over the other vector. 1 Notethatouranalysisisfocusedonlyonintegration,notoncon- Below we will often find it convenient to denote these sciousness; besides integration, a true measure of consciousness vectorsassingleindicesi=x0 andj =x1. Forexample, may involve additional requirements that this paper does not this allows us to write the marginal distribution p (x ) 0 0 consider. Forexample,ScottAaronsonhascriticizedinawidely as (cid:80) p , where the sum over j is to be interpreted as readblogposttheclaimthatintegrationisasufficientcondition j ij summation/integration over all allowed values of x . We for consciousness, and IIT discusses postulates including cause- 1 effectpower,compositionandexclusion[11]. alsoadoptthenotationwherereplacinganindexbyadot 3 Name Definition FormulaforGaussianvariables φotu(φM) I(xA,xB)−I(xA,xB) 1log|TA||TB||C| = 1log|Σ(cid:98)A||Σ(cid:98)B| 0 0 2 |T||CA||CB| 2 |Σ| φB I(x0,x1)−I(xA0,xA1)−I(xB0,xB1) 12log||TC|||2C|AT|A2|||CTBB||2 = 12log||ΣC||||CΣ(cid:98)AA|||Σ(cid:98)CBB|| φotum (φMD) dMD(p,q), qii(cid:48)jj(cid:48) = pii(cid:48)p··ip··i··pj··ip(cid:48)··i·(cid:48)·j(cid:48) 12(cid:104)ln||CΣ(cid:98)(cid:98)|| +tr(cid:16)C(cid:98)−1C−Σ(cid:98)−1Σ−A´tΣ(cid:98)−1A´C(cid:17)(cid:105)forβ=1,C(cid:98) ≡A(cid:98)CA(cid:98)t+Σ(cid:98) φots I(xA,xB) 1log|TA||TB| 2 |T| φofs I(xA,xB) 1log|CA||CB| 1 1 2 |C| φofu −j(cid:80)j(cid:48)p··jj(cid:48)log(cid:80)ii(cid:48) ppii··j···pp··ii(cid:48)(cid:48)··j·(cid:48)pp··ijij(cid:48)·(cid:48)· 12(cid:104)ln||CCq|| +trC−q1C−n(cid:105), Cq ≡A(cid:98)CA(cid:98)t+Σ(cid:98) φokfkk(cid:48)(φMkk(cid:48)) j(cid:80)j(cid:48) ppkkkk(cid:48)(cid:48)j·j·(cid:48) logppkkkk(cid:48)(cid:48)·j·jp(cid:48)kp(cid:48)k·j····pp·k·k(cid:48)(cid:48)·j··(cid:48) 12(cid:104)xt0(A(cid:98) −A)Σ(cid:98)−1(A(cid:98) −A)x+ln|ΣA||Σ|Σ| B| +trΣ(cid:98)−1Σ−n(cid:105) φokakk(cid:48) (cid:80)j ppkkkk(cid:48)(cid:48)j··· logppkkkk(cid:48)(cid:48)j···ppkk··j··· 12(cid:104)∆mtΣ¯−A1∆m+ln||ΣΣ¯AA|| +trΣ¯−A1ΣA−nA(cid:105), ∆m≡(AA−A(cid:98)A)xA0 +ABxB0 (cid:20) (cid:21) φokpkk(cid:48) (cid:80)i ppi···kkkk(cid:48)(cid:48) logppi···kkkk(cid:48)(cid:48)ppi···kk·· 12 ∆mtΣ˜¯−A1∆m+ln||ΣΣ˜˜¯AA|| +trΣ˜¯−A1Σ˜A−nA , ∆m≡(A˜A−Aˆ˜A)xA0 +A˜BxB1 φxkfkk(cid:48) I(xA1,xB1|x0=kk(cid:48)) 12log|ΣA|Σ||Σ|B| φnas −(cid:80)j p··j·log(cid:80)ii(cid:48) nBpipi(cid:48)iji(cid:48)··p·ip·····j· ∞ φnak −(cid:80)pk·j· log(cid:80) pki(cid:48)j·pk··· ∞ k j pk··· i(cid:48) nBpki(cid:48)··pk·j· φnps −(cid:80)i pi···logj(cid:80)j(cid:48) nBpip·j··jj(cid:48)jp(cid:48)·p·ji···· ∞ φnpk (φ2.0) −(cid:80)pi·k· log(cid:80) pi·kj(cid:48)p··k· ∞ k k i p··k· j(cid:48) nBp··kj(cid:48)pi·k· φmas −(cid:80)j p··j·log(cid:80)ii(cid:48) piip(cid:48)jii·(cid:48)p·i··p····pj··i(cid:48)·· 12(cid:104)ln||CCAq|| +trC−q1CA−nA(cid:105), Cq ≡ΣA+AACAAtA+AABCBAtAB φmkak −(cid:80)j ppkk··j··· log(cid:80)i(cid:48) pkip(cid:48)kji·(cid:48)p·k·p··k··pj··i(cid:48)·· 12(cid:104)∆mtΣ¯−A1∆m+ln||ΣΣ¯AA|| +trΣ¯−A1ΣA−nA(cid:105), ∆m≡(A(cid:98)A−AA)xA0 φmps −(cid:80)i pi···logj(cid:80)j(cid:48) pi·jpj··(cid:48)jpj·(cid:48)·jp·ip······j(cid:48) 12(cid:104)ln||CCAq|| +trC−q1CA−nA(cid:105), Cq ≡Σ˜A+A˜ACAA˜tA+A˜ABCBA˜tAB (cid:20) (cid:21) φmkpk −(cid:80)i ppi···kk·· log(cid:80)j(cid:48) pi·pk·j·(cid:48)kpj·(cid:48)·pki··pk····j(cid:48) 12 ∆mtΣ˜¯−A1∆m+ln||ΣΣ˜˜¯AA|| +trΣ˜¯−A1Σ˜A−nA , ∆m≡(Aˆ˜A−A˜A)xA1 φ2.5 min{φnak,φnpk} ∞ φ2.5(cid:48) min{φmak,φmpk} φ2.5(cid:48)(cid:48) min{φoak,φopk} TABLE II: Integration φ for different measures. A≡BtC−1, A ≡BC−1 =ΣAtΣ−1, Σ≡C−BtC−1B=C−ACAt and b b Σ ≡C−BC−1Bt =C−A CAt =[C−1+AtΣ−1A]−1 =C−CAtC−1AC. C is the data covariance matrix and B is the b b b cross-covariance between different times as defined by equation (46). means that this index is to be summed/integrated over. and p(1), this Markov process is defined by Thisletsuswritethemarginaldistributionsp(0)(x )and 0 p(1) =Mp(0), (2) p(1)(x ) as 1 where the Markov matrix M specifies the probability ji p(0) =p and p(1) =p (1) that a state i transitions to a state j, and satisfies the i i· j ·j conditions M ≥ 0 (non-negative transition probabili- ji ties)andM =1(unitcolumnsums,guaranteeingprob- ·i As illustrated in Figure 1, it is always possible to ability conservation). The standard rule for conditional model this relation between x and x as resulting from probabilities gives 0 1 a Markov process, where x is causally determined by a 1 p = P(x =i&x =j)= combination of x and random effects. If we write the ij 0 1 0 marginal distributions from equation (1) as vectors p(0) = P(x =i)P(x =j|x =i)=p(0)M , (3) 0 1 0 i ji 4 which uniquely determines the Markov matrix as If our system is integrated so that M cannot be fac- tored as in equation (7), we can nonetheless choose to p p M = ij = ij, (4) approximate M by a matrix of the factorizable form ji p(0) pi· MA ⊗ MB. If we retain the initial probability distri- i bution p for x but replace the correct Markov ma- ii(cid:48)·· 0 whichisseentosatisfytheMarkovrequirementsMji ≥0 trix M by the separable approximation MA⊗MB, then and M =1. ·i equation (6) shows that the probability distribution Note that any system obeying the laws of classical physics can be accurately modeled as a Markov process p =M p (8) ii(cid:48)jj(cid:48) jj(cid:48)ii(cid:48) ii(cid:48)·· as long as the time step ∆t≡t −t is sufficiently short 1 0 (definingx(t)asthepositioninphasespace). Ifthepro- gets replaced by the probability distribution q given ii(cid:48)jj(cid:48) cess has “memory” such that the next state depends not by only on the current state but also on some finite number of past states, it can reformulated as a standard memo- q =MAMB p (9) ii(cid:48)jj(cid:48) ji j(cid:48)i(cid:48) ii(cid:48)·· rylessMarkovprocessbysimplyexpandingthedefinition of the state x to include elements of the past.2 which is an approximation of pii(cid:48)jj(cid:48). If M is factoriz- able(meaningthatthereisnointegration),wecanfactor M such that the two probability distributions q and ii(cid:48)jj(cid:48) B. A taxonomy of integration measures pii(cid:48)jj(cid:48) are equal and, conversely, if the two probability distributions are different, we can use how different they We will now see that this Markov process interpreta- are as an integration measure φ. tion allows us create a simple taxonomy of integration To define an integration measure φ in this spirit, we measures φ that quantify the interaction between two thus need to make four different choices, which collec- subsystems. TheideaistoapproximatetheMarkovpro- tivelyspecifyitfullyanddeterminewheretheφ-measure cess by a separable Markov process that does not mix belongs in our taxonomy: information between subsystems, and to define the inte- 1. Choose a recipe defining an approximate factoriza- gration as a measure of how bad the best such approxi- tion M≈MA⊗MB. mation is. Consider the system x as being composed of two subsystems xA and xB, so that the elements of the 2. Choose which probability distributions p and q to vector x are simply the union of the elements of xA and compare for exact and approximate M (the distri- xB, and let us define the probability distribution bution for x, x or xA, say). 1 1 pii(cid:48)jj(cid:48) ≡P(xA0 =i&xB0 =i(cid:48)&xA1 =j&xB1 =j(cid:48)). (5) 3. Choose what to treat as known about pii(cid:48)·· when computing these probability distributions. (For brevity, we will sometimes refer to this distribution pii(cid:48)jj(cid:48) as simply p below, suppressing the indices, and we 4. Chooseametricforhowdifferentthetwoprobabil- willsometimeswritexwithoutindicestorefertothefull ity distributions p and q are. stateatbothtimes.) TheMarkovmatrixofequation(4) then takes the form ThesefouroptionsaredescribedinTablesIII,IVandV, and we will now explore them in detail. p M = ii(cid:48)jj(cid:48). (6) jj(cid:48)ii(cid:48) p ii(cid:48)·· The Markov process of equation (2) is separable if the C. Options for approximately factoring M Markov matrix M is a tensor product MA⊗MB, i.e., if Table III lists five factoring options which all have at- Mjj(cid:48)ii(cid:48) =MjAiMjB(cid:48)i(cid:48) (7) tractive features, and we will now describe each in turn. for Markov matrices MA and MB that determine the evolution of xA and xB. 1. Approximately factoring M using noising The first option corresponds to the “noising” method usedinIIT[10]: thetimeevolutionofonepartofthesys- 2 NotethatalthoughfullknowledgeoftheMarkovmatrixMcom- tem(xA,say)isdeterminedfromthepaststatexAalone, pletelyspecifiesthedynamicsofthesystem,apersonwishingto treatingxB asrandomnoisewithsomeprobabilit0ydistri- compute its integration may not know M exactly. If M is not 0 bution p(B0) that is independent of xA. In other words, known from having built the system or having examined its in- 0 ner workings, then passively observing it in action (without ac- wereplacetheinitialprobabilitydistributionp(0) =p ii(cid:48) ii(cid:48)·· tiveinterventions)maynotprovideenoughinformationtofully by the separable distribution p(0) = p(A0)p(B0). We will reconstruct M [21]. Section IV describes a convenient class of ii(cid:48) i i(cid:48) systemswhereMisrelativelyeasytodetermine. now see that if we start with equation (2), i.e., the 5 Code Factorizationmethod MA MB M˜A M˜B State-dependent? ji j(cid:48)i(cid:48) ij i(cid:48)j(cid:48) n Noising 1 (cid:80)pii(cid:48)j· 1 (cid:80)pii(cid:48)·j(cid:48) 1 (cid:80)pi·jj(cid:48) 1 (cid:80)p·i(cid:48)jj(cid:48) N nB i(cid:48) pii(cid:48)·· nA i pii(cid:48)·· nB j(cid:48) p··jj(cid:48) nA j p··jj(cid:48) m Mildnoising (cid:80)pii(cid:48)j·p·i(cid:48)·· (cid:80)pii(cid:48)·j(cid:48)pi··· (cid:80)pi·jj(cid:48)p···j(cid:48) (cid:80)p·i(cid:48)jj(cid:48)p··j· N i(cid:48) pii(cid:48)·· i pii(cid:48)·· j(cid:48) p··jj(cid:48) j p··jj(cid:48) o Optimalnotknowingstatex0 ppii··j··· pp··ii(cid:48)(cid:48)··j·(cid:48) ppi···jj·· pp··i·(cid:48)··jj(cid:48)(cid:48) N x Optimalgivenx0 ppkkkk(cid:48)(cid:48)j··· ppkkkk(cid:48)(cid:48)··j·(cid:48) ppkkkk(cid:48)(cid:48)j··· ppkkkk(cid:48)(cid:48)··j·(cid:48) Y a Optimalgivenx0,onaverage ppii··j··· pp··ii(cid:48)(cid:48)··j·(cid:48) ppi···jj·· pp··i·(cid:48)··jj(cid:48)(cid:48) N TABLE III: Different options for approximate factorizations M≈MA⊗MB and M˜ ≈M˜A⊗M˜B. These options correspond to the first superscript in φ-measures such as φofuk. The optimal factorizations maximize the accuracy of the approximate probabilitydistributionthattheypredict,whilethe“noising”factorizationsareinsteaddefinedbytreatingtheinputfromthe other subsystem as random noise, either with uniform distribution (option “n”) or with the observed marginal distribution (option “m”). Markov equation p(1) = Mp(0), then this noising pre- (its entropy is lower) than with the previous noising op- scription gives p(A1) =MAp(A0) for a particular matrix tion. TableIIIliststheMA-matrixcorrespondingtothis MA. Equation (2) states that mild noising choice as well as the analogous MB-matrix. p(1) =(cid:88)M p(0), (10) jj(cid:48) jj(cid:48)ii(cid:48) ii(cid:48) ii(cid:48) 3. Optimally factoring M andsubstitutingtheseparable“noising”formofp(0)from ii(cid:48) above gives Adrawbackofbothfactorizationsthatwehaveconsid- ered so far is that they might overestimate integration: p(A1) ≡p(1) =(cid:88)M p(A0)p(B0) =(cid:88)MAp(A0), theremayexistanalternativefactorizationthatisbetter j j· j·ii(cid:48) i i(cid:48) ji i ii(cid:48) i in the sense of giving a smaller φ. The natural way to (11) remedythisproblemistodefineφbyminimizingoverall where we have defined factorizations. This elegantly unifies with the fact that capital Φ is defined by minimizing over all partitions of MA ≡(cid:88)M p(B0) =(cid:88)pii(cid:48)j·p(i(cid:48)B0). (12) thesystemintotwoparts: wecancapturebothminimiza- ji j·ii(cid:48) i(cid:48) pii(cid:48)·· tionsbysimplysaying“minimizeoverallfactorizations”, i(cid:48) i(cid:48) sincethechoiceofatensorfactorizationincludesachoice IIT chooses the noise to have maximum entropy, i.e., a of partition. uniform distribution over the n possible states of sub- B In practice, the definition of the optimal factorization system B [10]: dependsonwhatweoptimize. Wediscussvariousoptions 1 below, and identify three particularly natural choices p(B0) = . (13) i(cid:48) n which are listed in Table III. The first option makes the B approximate probability distribution q as similar as ii(cid:48)jj(cid:48) TableIIIliststheMA-matrixcorrespondingtothisnois- possible to p , where similarity is quantified by KL- ii(cid:48)jj(cid:48) ing choice as well as the analogous MB-matrix. divergence. The second option treats the present state x as known and makes the conditional probability dis- 0 tribution for the future state x as similar as possible to 1 2. Approximately factoring M using mild noising the correct distribution. This factorization thus depends onthestateandhenceontime,whereasalltheotherswe One drawback of this choice is that uniform distribu- have considered are state-independent. The third option tionsareundefinedforcontinuousvariablessuchasmea- is the factorization that minimizes this state-dependent suredvoltages, becausetheycannotbenormalized. This φ on average; we will prove below that this factorization means that any φ-measure based on this noising factor- is identical to the first option. ization is undefined and useless for continuous systems. In summary, Table III lists five factorization options This problem can be solved by adopting another natural that each have various attractive features; options 3 and choice for the noise distribution: 5 turn out to be identical. It is easy to show that if the Markov matrix M is factorizable (which means that the p(i(cid:48)B0) =p·i(cid:48)··, (14) probability distribution is separable as pii(cid:48)jj(cid:48) =pAijpBi(cid:48)j(cid:48)), i.e., simply the marginal distribution for xB. We term then all five factorizations coincide, all giving MA = 0 ji thisoption“mildnoising”,sincethenoiseislessextreme pA/pA and MB = pA /pA. This means that they will ji i· j(cid:48)i(cid:48) i(cid:48)j(cid:48) i(cid:48)· 6 all agree on when φ=0; otherwise the noising factoriza- where equations (6) and (9) get replaced by tions will yield higher φ than an optimized factorization. p M˜ = ii(cid:48)jj(cid:48) (16) ii(cid:48)jj(cid:48) p ··jj(cid:48) D. Options for which probability distributions to and compare q˜ =M˜AM˜B p(1). (17) ii(cid:48)jj(cid:48) ij i(cid:48)j(cid:48) jj(cid:48) TableIVlistsfouroptionsforwhichprobabilitydistri- This time reversal symmetry doubles the number of q- butions p and q to compare. Arguably the most natural options we could list in Table IV to six in total, aug- option is to simply compare the full distributions p ii(cid:48)jj(cid:48) menting q , q and q by q˜ , q˜ and q˜ . In and q that describe our knowledge of the system at ii(cid:48)jj(cid:48) ··jj(cid:48) ··j· ii(cid:48)jj(cid:48) ··jj(cid:48) ··j· ii(cid:48)jj(cid:48) the interest of brevity, we have chosen to only list q˜ , both times (the present state and the future state). An- ··j· becauseofitsabilitytokillφforefferentpathways—the other obvious option is to merely compare the predic- formulas for the two omitted options are trivially analo- tions, i.e., the probability distributions p and q ··jj(cid:48) ··jj(cid:48) gous to those listed. forthefuturestate. Athirdinterestingoptionistocom- paremerelythepredictionsforoneofthetwosubsystems (which we without loss of generality can take to be sub- E. Options for what to treat as known about the system A), thus comparing p and q . ··j· ··j· current state Generally, thelesswecompare, theeasieritistogeta low φ-value. To see this, consider a system where A af- Abovewelistedoptionsforwhichprobabilitiespandq fectsB butB hasnoeffectonA. Wecould,forexample, to compare to compute φ. To complete our specification consider A to be photoreceptor cells in your retina and of these probabilities, we need to choose between vari- B tobetherestofyourbrain. Thenthesecondcompar- ous options for our knowledge of the present state; the ison option (“f”) in Table IV would give φ > 0 because threerightmostcolumnsofTableIVcorrespondtothree we predict the future of your brain worse if we ignore interesting choices. the information flow from your retina, while the third The first option is where the state is unknown, de- comparison option (“a”) in the table would give φ = 0 scribed simply by the probability distribution we have because the rest of your brain does not help predict the used above: future of your retina. In other words, comparison option “a” makes φ vanish for afferent pathways, where infor- p(0) =p (18) ij ii(cid:48)·· mation flows only inward toward the rest of the system. IITarguesthatanygoodφ-measureindeedshouldvan- This corresponds to us knowing M, the mechanism by ish for afferent pathways, because a system can only be whichthestateevolves,butnotknowingitscurrentstate conscious if it can have effects on itself — other systems x . Note that a generic Markov process eventually con- 0 that it is affected by without affecting will act merely as verges to a unique stationary state p = p(0) = p(1) parts of its unconscious outside world [10]. Analogously, which, since it satisfies Mp = p, can be computed di- IIT argues that any good φ-measure should vanish also rectlyfromMastheuniqueeigenvectorwhoseeigenvalue for efferent pathways, where information flows only out- is unity.3 This means that if we consider a system that ward away from the rest of the system. The argument has been evolving for a significantly long time, its full is that other systems that the conscious system affects two-time distribution pii(cid:48)jj(cid:48) is determined by M alone; withoutbeingaffectedbywillagainbeunconscious, act- conversely, pii(cid:48)jj(cid:48) determines M through equation (6). ing merely as unconscious parts of the outside world as Alternatively, if pii(cid:48)jj(cid:48) is measured empirically from a far as the conscious system is concerned. time-series x which is then used to compute M, we can t Option“p”inTableIVhasthispropertyofφvanishing use equation (18) to describe our knowledge of the state for efferent pathways. It is simply the time-reverse of at a random time. option“a”,quantifyingtheabilityofxA todetermineits A second option is to assume that we know the ini- 1 past cause xA instead of quantifying the ability of xA to tial probability distributions for xA and xB, but know 0 0 0 0 determine its future effect xA. nothing about any correlations between them. This cor- 1 To formalize this, consider that there is nothing in the respondstoreplacingequation(18)bytheseparabledis- probability distribution p that breaks time-reversal tribution ii(cid:48)jj(cid:48) symmetry and says that we must interpret causation as p(0) =p p , (19) going from t to t rather than vice versa. In complete ij i··· ·i(cid:48)·· 0 1 analogy with our formalism above, we can therefore de- fine a time-reversed Markov process M˜ whereby the fu- turedeterminesthepastaccordingtothetime-reverseof 3 The only Markov processes that do not converge to a unique equation (2): steady state are ones where M has more than one eigenvalue equaltounity;theseformasetofmeasurezeroonthesetofall p(0) =M˜p(1), (15) Markovprocesses. 7 Conditioningoption u s k xt unknown xt-distributionseparable xt known Code Comparisonoption p q qu (p(ij0)=pii(cid:48)··) qs (p(ij0)=pi···p·i(cid:48)··) qk (p(ij0)=δikδi(cid:48)k(cid:48)) t Two-timestate pii(cid:48)jj(cid:48) qii(cid:48)jj(cid:48) MjAiMjB(cid:48)i(cid:48)p(ii0(cid:48)) (cid:16)MjAip(i·0)(cid:17)(cid:16)MjB(cid:48)i(cid:48)p(·i0(cid:48))(cid:17) MjAiδikMjB(cid:48)i(cid:48)δi(cid:48)k(cid:48) (cid:18) (cid:19)(cid:32) (cid:33) f Futurestate p··jj(cid:48) q··jj(cid:48) (cid:80)MjAiMjB(cid:48)i(cid:48)p(ii0(cid:48)) (cid:80)MjAip(i·0) (cid:80)MjB(cid:48)i(cid:48)p(·i0(cid:48)) MjAkMjB(cid:48)k(cid:48) ii(cid:48) i i(cid:48) a FuturestateofsubsystemA p··j· q··j· (cid:80)MjAip(i·0) (cid:80)MjAip(i·0) MjAk i i p PaststateofsubsystemA pi··· q˜i··· (cid:80)M˜iAjp(j1·) (cid:80)M˜iAjp(j1·) M˜iAk j j TABLE IV: Different options for which probability distributions p and q to compare, corresponding to the second and third superscripts in φ-measures such as φofuk. The last three columns specify the formula for q for the three conditioning options we consider: when the state x is unknown (u), has a separable probability distribution (s) and is known (k), respectively. 0 andcanbeadvantageousforφ-measuresthatwouldcon- integrationφtoquantifyhowdifferenttheyarefromone flateintegrationwithinitialcorrelationsbetweenthesub- another: systems. A third option, advocated by IIT [10], is to treat the φ≡d(p,q) (22) current state as known: for some distance measure d that is larger the worse q p(0) =δ δ , (20) approximates p. There are a number of properties that ij ik i(cid:48)k(cid:48) we may consider desirable for d to quantify integration: i.e., we know with certainty that the current state x = 0 kk(cid:48) for some constants k and k(cid:48). IIT argues that this is 1. Positivity: d(p,q) ≥ 0, with equality if and only if p=q. the correct option from the vantage point of a conscious system which, by definition, knows its own state. 2. Monotonicity: The more different q is from p in A natural fourth option is a more extreme version of some intuitive sense, the larger d(p,q) gets. thefirst: treatingthestatenotmerelyasunknown, with p(0) given by its ensemble distribution, but completely 3. Interpretability: d(p,q) can be intuitively inter- unknown, with a uniform distribution: preted, forexampleintermsofinformationtheory. p(0) =constant. (21) 4. Tractability: d(p,q) is easy to compute numeri- ij cally. Ideally, the optimal factorizations from Sec- Although straightforward enough to use in our formulas, tion IIC can be found analytically rather than wehavechosennottoincludethisoptioninTableIVbe- through time-consuming numerical minimization. causeitisratherinappropriateformostphysicalsystems. 5. Symmetry: d(p,q)=d(q,p). Forcontinuousvariablessuchasvoltages,itbecomesun- defined. For brains, such maximum-entropy states never Anydistancemeasuredmeetsthemathematicalrequire- occur: they would have typical neurons firing about half mentsofbeingametriconthespaceofprobabilitydistri- the time, corresponding to much more extreme “on” be- butions if it obeys positivity, symmetry and the triangle havior than during an epileptic seizure. The related op- inequality d(p,q)≤d(p,r)+d(r,q). tion of consistently treating xA as known but xB as un- 0 0 Table V lists seven interesting probability distribution knownwhenpredictingxA (andviceversawhenpredict- 1 distance measures d(p,q) from the literature together ing xB) corresponds to the noising factorization options 1 with their definitions and properties. All these measures describedabove. Forfurtherdiscussionofthis,including are seen to have the positivity and monotonicity, and all so-called “noising at the connection”, see [11, 14, 22]. except the first are also symmetric and true metrics. We Finally, please note that if we choose to determine the will now discuss them one by one in greater detail. pastratherthanthefuture(the“p”-optionfromthepre- The distance d is the Kullback-Leibler divergence, KL vioussectionandTableIV),thenallthechoiceswehave andmeasureshowmanybitsofinformationarelostwhen described should be applied to p(1) rather than p(0). q is used to approximate p, in the sense that if you de- ij ij veloped an optimal data compression algorithm to com- press data drawn from a probability distribution q, it F. Options for comparing probability distributions would on average require d (p,q) more bits to com- KL press data drawn from a probability distribution p than Theoptionsinthepastthreesectionsuniquelyspecify if the algorithm had been optimized for p [23]. This has two probability distributions p and q, and we want the been argued to be the be the best measure because of 8 Code Metric Definition Positivity Monotonicity Interpretability Tractability Symmetry k dKL(p,q) (cid:80)α pαlogpqαα Y Y Y Y N 1 d1(p,q) (cid:80)|pα−qα| Y Y (Y) N Y α (cid:20) (cid:21)1/2 2 d2(p,q) (cid:80)(pα−qα)2 Y Y (N) (Y) Y α h dH(p,q) cos−1(cid:80)(pαqα)1/2 Y Y Y N Y α s dSJ(p,q) (cid:20)(cid:80)α (cid:16)p2α logpα2p+αqα + q2α logpα2+qαqα(cid:17)(cid:21)1/2 Y Y Y N Y e dEM(p,q) min (cid:80)fαβdαβ; fα·=pα,f·β =qβ Y Y Y N Y fαβ≥0αβ m dMD(p,q) Seeequations(24)-(25) Y Y Y N N TABLE V: Different options for measuring the difference d between two probability distributions p and q: Kullback-Leibler divergence d , L -norm d , L -norm d , Hilbert-space distance d , Shannon-Jensen distance d , Earth-Movers distance KL 1 1 2 2 H SJ d and Mismatched Decoding distance d . These options correspond to the fourth superscript in φ-measures such as EM MD φofuk. In the text, we considered options where p and q had one, two or four indices, but in this table, we have for simplicity combined all indices into a single Greek index α. its desirable properties related to information geometry bit string, d may be chosen to be the L “Manhattan ij 1 [24, 25]. distance”, i.e., the number of bit flips required to trans- d and d measure the distance between the vectors form one bit string into another. IIT 3.0 argues that 1 2 p and q using the L -norm and L -norm, respectively. the earth mover’s distance d is the most appropriate 1 2 EM The former is particularly natural for probability dis- measure d on conceptual grounds (whereas IIT 2.0 was tributions p, since they all have L norm of unity: still implemented using d ). Unfortunately, d rates 1 KL EM d (0,p)=p =1. It is easy to see that 0≤d (p,q)≤2 poorly on the tractability criterion. It’s definition in- 1 . √ 1 and 0≤d (p,q)≤ 2. volves a linear programming problem which needs to be 2 The measure d is the Hilbert-space distance: if, for solved numerically, and even with the fastest algorithms H each probability distribution, we define a corresponding currently available, the computation grows faster than wavefunction ψ ≡ p1/2, then all wavefunctions lie on a quadraticallywiththenumberofsystemstates—which i i inturngrowsexponentiallywiththenumberofbits. For unithyperspheresincetheyallhaveunitlength: (cid:104)ψ|ψ(cid:105)= continuous variables x, the number of states and hence p =1. Thedistance d is simplythe angle between two . H the computational time is formally infinite. wavefunctions, i.e.., the distance along the great circle The measure d is based on “mismatched decod- onthehyperspherethatconnectsthetwo, sod (p,q)≤ MD H ing” as advocated by [15]. The distance measure d π/2. ItisalsothegeodesicdistanceoftheFishermetric, MD is defined not for all probability distributions, but for all hence a natural “coordinate free” distance measure on distributionsovertwovariables,whichwecanwritewith the manifold of all probability distributions. two indices as p : The measure d is the Shannon-Jensen distance, ij SJ whose square is defined as the average of the KL- d (p,q)≡I(p)−maxI∗(p,q,β), (24) MD divergences of the two distributions to their average: β where d (p,[p+q]/2)+d (q,[p+q]/2) d (p,q)2 ≡ KL KL (cid:88) (cid:88) SJ 2 I(p) ≡ − p·jlogp·j + pijlogpj|i, (25) = S[(p+q)/2]−(S[p]+S[q])/2. (23) j ij (cid:88) (cid:88) (cid:88) I∗(p,q,β) ≡ − p log qβ p + p logqβ , It is bounded by 0 ≤ d ≤ 1, satisfies the triangle in- ·j j|i i· ij j|i SJ j i ij equality and is information-theoretically motivated [26]. The measure dEM is the Earth-Movers distance [27]. and the conditional distribution qj|i ≡ qij/qi·. Here If we imagine piles of earth scattered across the space I(p) is simply the mutual information between the two x, with p(x) specifying the fraction of the earth that is variables, since combining equation (34) with the con- in each location, then d is the average distance that ditional entropy definition from equation (37) gives the EM youneedtomoveearthtoturnthedistributionp(x)into well-knownequivalentexpressionformutualinformation q(x). The quantity d in the definition in Table V spec- ij I(A,B)=S(A)−S(A|B). (26) ifies the distance between points i and j in this space. For example, if x is a 3D Euclidean space, this may be I∗(p,q,β) can be interpreted as the amount of infor- chosen to be simply the Euclidean metric, while if x is a mation that one variable predicts about the other if the 9 correct conditional distribution p is replaced by a pos- As mentioned, numerical tractability is a key issue for j|i sibly incorrect one qβ (renormalized to sum to unity) integration measures. This means that it is valuable if j|i theLagrangeminimizationcanberapidlysolvedanalyt- when making the prediction [28]. This renormalization ically rather than slowly by numerical means, since this is strictly speaking unnecessary, because it cancels out needstobedoneseparatelyforlargenumbersofpossible between the two terms in I∗(p,q,β). Raising probabili- system partitions. There is only one d-option out of the ties to positive powers β has the effect of concentrating above-mentioned five for which I have been able to solve them (decreasing entropy) if β > 1 and spreading them the optimization over M-factorizations analytically: the more evenly (increasing entropy) if β < 1. It can be KL-divergenced . Therunner-upfortractabilityisd , shown that I∗(p,q,β) ≤ I(p) with equality for q = p KL 2 forwhicheverythingcanbeeasilysolvedanalyticallyex- and β = 1, and that I∗(p,q,β) ≥ 0, so one always ceptforafinalcolumnnormalizationstep,buttheresult- has 0 ≤ d (p,q) ≤ I(p) [28]. Mismatched decoding MD ing formulas are cumbersome and unilluminating, falling can presumably be further generalized by replacing the fouloftheinterpretabilitycriterion. Althoughd lacks maximizationoverpowerspβ bymaximizationoverarbi- KL the symmetry property, it has the above-mentioned pos- trary monotonically increasing functions f(p) that map itivity, monotonicity and interpretability properties, and the unit interval onto itself. we will now show that it also has the tractability prop- The integration measures of IIT3.0 have a more com- erty. plex probability comparison that cannot be fully cast in Let us begin with the q-options in the upper left cor- the form of a simple function of d(p,q): it makes the ner of Table IV, i.e., comparing the two-time distribu- metricchoiced(p,q)=d (p,q),butconsidersnotonly EM tions treating the present state as unknown. Substitut- probability distributions for the whole system and a bi- ing equation (9) into the definition of d from Table V partition, but also for all possible subsets, providing an KL gives elaborate interpretation of the results in terms of “con- ceptual structures” [11]. d (p,q)= (cid:88) p log pii(cid:48)jj(cid:48) = (29) KL ii(cid:48)jj(cid:48) p MAMB ii(cid:48)jj(cid:48) ii(cid:48)·· ji j(cid:48)i(cid:48) (cid:88) (cid:88) III. TAXONOMY RESULTS S(x )−S(x)− p logMA− p logMB , 0 i·j· ji ·i(cid:48)·j(cid:48) j(cid:48)i(cid:48) ij i(cid:48)j(cid:48) A. Optimal factorization with d KL where the entropy for a random variable x with proba- bility distribution p is given by Shannon’s formula [29] Our taxonomy of integration measures is determined by four choices: of factorization, variable selection, con- S(x)=−(cid:88)p logp . (30) i i ditioning and distance measure. Although we have now i explored these four choices one at a time, there are im- portant interplays between them that we must examine. To avoid a profusion of notation, we will often write as Firstofall,thethreeoptimalfactorizationoptionsinTa- theargumentofSarandomvariableratherthanitsprob- ble III depend on what is being optimized, so let us now abilitydistribution. Forconvenience,wewilltakealllog- explore which of these optimizations are feasible and in- arithmstobeinbase2fordiscretedistributions(sothat teresting to perform in practice and let us find out what entropies are measured in units of bits) and in base e for the corresponding factorizations and φ-measures are. continuous Gaussian distributions (so that equations get The mathematics problem we wish to solve is simpler). In the latter case, where the entropy is based on the natural logarithm, entropy is measured in “nits” φ≡ min d(p,q) (27) or “nats” which equal 1/ln2≈1.44 bits. MA,MB Substituting equation (29) into equation (28) and re- quiring vanishing derivatives with respect to MA, MB , i.e., minimizingd(p,q)overMA andMB giventhecon- ij i(cid:48)j(cid:48) λ and µ shows that the solution to our minimization straints that MA and MB are markov Matrices: MA = j j(cid:48) ·j problem is 1, MB = 1, MA ≥ 0 and MB ≥ 0. Table IV specifies ·j(cid:48) ij i(cid:48)j(cid:48) the options for how p and q are computed and how q MA = pi·j·, MB = p·i(cid:48)·j(cid:48). (31) dependsonMA andMB,whileTableVspecifiestheop- ji p j(cid:48)i(cid:48) p i··· ·i(cid:48)·· tions for computing the distance measure d. We enforce We recognize these equations as simply the Markov ma- the column sum constraints using Lagrange multipliers, trix estimator from equation (4) applied separately to minimizing subsystems A and B after marginalizing over the other L≡d(p,q)−(cid:88)λ (MA−1)−(cid:88)µ (MB −1), (28) system. Substituting this back into equation (9) gives i ·i i(cid:48) ·i(cid:48) i i(cid:48) q = pii(cid:48)··pi·j·p·i(cid:48)·j(cid:48). (32) ii(cid:48)jj(cid:48) p p and need to check afterwards that all elements of MA i··· ·i(cid:48)·· and MB come out to be non-negative (we will see that Although the full probability distributions q and p typ- this is indeed the case). ically differ, equation (32) implies that three marginal 10 distributions are identical: q = p , q = p in time, which would have yielded the alternative inte- i·j· i·j· ·i(cid:48)·j(cid:48) ·i(cid:48)·j(cid:48) and q =p . gration measure ij(cid:48)·· ij(cid:48)·· Substituting equation (32) back into the definition of d gives the extremely simple result that the integra- φot˜u(p)=I(xA,xB)−I(xA,xB). (35) KL 1 1 tion is Inpractice,oneusuallyestimatesallstatisticalproperties φotuk(p) = (cid:88) p logpii(cid:48)jj(cid:48)pi···p·i(cid:48)·· fromatime-seriesthatisassumedtobestationary. This ii(cid:48)jj(cid:48) p p p ii(cid:48)jj(cid:48) ii(cid:48)·· i·j· ·i(cid:48)·j(cid:48) meansthatI(xA0,xB0)=I(xA1,xB1),sothatthethesetwo = I(xA,xB)−I(xA,xB), (33) integration measures become identical. 0 0 wherethemutualinformationbetweentworandomvari- ables is given in terms of entropies by the standard defi- B. Comparison with the Ay/Barrett/Seth nition integration measures I(xA,xB)≡S(xA)+S(xB)−S(x). (34) In the paper [13] where Barrett & Seth proposed their easier-to-compute integration measure φB (see below), Since we will be deriving a large number of different they also mentioned an alternative measure that they φ-measures that we do not wish to conflate with one an- termed φ˜ , defined by E other, we superscript each one with four code letters de- notingthefourtaxonomicalchoicesthatdefineit. These φ˜ ≡S(xA|xA)+S(xB|xB)−S(x |x ), (36) E 0 1 0 1 0 1 letter codes are where the conditional entropy of two variables A and B 1. factorization: n/m/o/x/a is defined by 2. comparison: t/f/a/p S(A|B)≡S(A,B)−S(B). (37) 3. conditioning: u/s/k This measure had been introduced earlier by Ay [30, 31] inacontextunrelatedtoIIT,underthename“stochastic 4. measure: k/1/2/h/s/e/m interaction”, and was further discussed in [15, 32]. Ap- plying equations (37) and (34) to equation (36) shows and are defined in Tables III, IV and V. For example, that theintegrationmeasureφotuk fromequation(33)denotes optimized (o) factorization comparing the two-time (t) φ˜ = S(xA)−S(xA)+S(xB)−S(xB)−S(x)+S(x ) probabilitydistributionswiththecurrentstateunknown E 1 1 1 (u) and KL-divergence (k). Almost all measures dis- = I(xA,xB)−I(xA,xB)=φot˜u, (38) 1 1 cussedbelowwillinvolvethek-measure(KL-divergence), sowhenthisisthecasewewilltypicallydropthislastin- i.e., that φ˜ is identical to the time-reversed Markov E dexk toavoidaconfusingprofusionofindices,forexam- measure φot˜u. This equivalence provides another con- ple writing φotuk =φotu. For brevity, we will also define venient interpretation of φot˜u: as the average KL- φM ≡ φotu, since we will be referring to this “Markov divergencebetween(i)theprobabilitydistributionofthe measure” φotu many times below. paststatex giventhepresentstatex and(ii)theprod- 0 1 Althoughwederivedthisoptimalfactorizationbycom- uct of these conditional distributions for the two subsys- paring the two-time distribution (option t) for an un- tems. known state (option u), an analogous calculation leads It is also interesting to compare our result in equa- to the exact same optimal factorization for the options tion (33) with the popular integration measure a+u, s+f and a+s. The option t+s is undefined and the option f+u gives messy equations I have been unable to φB(p)=I(x ,x )−I(xA,xA)−I(xB,xB) (39) 0 1 0 1 0 1 solveanalytically. Itisthereforereasonabletoviewequa- tion (31) as the optimal factorization when the state is proposed by Barrett & Seth [13]. The intuition behind unknown(optiono),andfortheremainderofthispaper, this definition is to take the amount of information that we will simply define the o-option as using the factoriza- asystempredictsaboutitsfutureandsubtractofthein- tion given by equation (31). formation predicted by both of its subsystems. Unfortu- Note that our result in equation (33) involves a time- nately, the result can sometimes go negative [14, 15], vi- asymmetry, singling out t rather than t in the second olating the desirable positivity property and making the 0 1 term. This is because we chose to interpret our Markov φ difficult to interpret. Consider the simple example of B process as operating forward in time, determining the twoindependentbitsthatneverchange. Iftheystartout state at t from the state at t . As we discussed in perfectly correlated, then they will remain perfectly cor- 1 0 Section IID, we could equally well have done the op- related, giving I(x ,x ) = I(xA,xA) = I(xB,xB) = 1 0 1 0 1 0 1 posite, using the Markov process M˜ operating backward and integrated information φB(p)=−1.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.