1 Causal Discovery in a Binary Exclusive-or Skew Acyclic Model: BExSAM Takanori Inazumi,TakashiWashio, ShoheiShimizu, JoeSuzuki, Akihiro Yamamoto, and Yoshinobu Kawahara, Abstract—Discoveringcausalrelationsamongobservedvariablesinagivendatasetisamajorobjectiveinstudiesofstatistics 4 and artificial intelligence. Recently, some techniques to discover a unique causal model have been explored based on non- 1 Gaussianityoftheobserveddatadistribution.However,mostofthesearelimitedtocontinuousdata.Inthispaper,wepresent 0 anovelcausalmodelforbinarydataandproposeanefficientnewapproachtoderivingtheuniquecausalmodelgoverninga 2 givenbinarydatasetunderskewdistributionsofexternalbinarynoises.Experimentalevaluationshowsexcellentperformance forbothartificialandrealworlddatasets. n a ✦ J 2 2 ] 1 INTRODUCTION ontheseorders.However,theirapplicabilityislimited L to the model consisting of continuous variables and MMANY approaches to causal inference and learn- smoothnon-linearfunctions,andidentifyingaunique ing of Bayesian network structures have been t.studied in statistics and artificial intelligence [9], [19]. causal order in the multivariate model is not guaran- a teed.Morerecentworkfurtherextendedtheprinciple Most of these derive candidate causal models from t san observed data set by assuming acyclicity of the for a bivariate model where the two variables have [ ordered discrete values [10]. However, this did not causal dependencies. These mainly use information 1 from second-order statistics such as correlations of address the identification of a unique causal order in v a multivariate modeland the estimation of the model the observed variables, and narrow down the candi- 6 under the identified order. date directed acyclic graphs (DAGs) by using some 3 6constraints and/or scoring functions. However, these In contrast, manyrealworld domains such ascom- 5approaches often produce multiple candidate causal .structures for the data set because of the Markov puternetworks, medicine[9],bioinformatics[21],[14] 1 and sociology [19], maintain recently-accumulated 0equivalence or the local optimality of the structures. stochastic binary data sets, and practitioners need to 4 Recently, some non-Gaussianity-based approaches 1applied to a linear acyclic model (LiNGAM) have discover the causal structures and structural models : governing the data sets for various purposes. How- vbeenproposed[15],[16],[17].Suchapproachesderive ever, to the best of our knowledge, no past studies ia unique causal order of observed variables in a X have addressed principles and algorithms to practi- linear acyclic structural equation model under the r callyderiveauniquecausalorderandacausalmodel aconditionthatexternalnoisesdrivingthevariablesare following the order for a given binary data set. In non-Gaussian and jointly independent, and estimate this regard, our objective in this paper is to propose a unique model based on the derived causal order. a novel and practical approach to discovering such However,theseapproachesrequirelinearityoftheob- a causal order and the associated model within a jective system. Recent studies extended this principle given stochastic binary data set under some feasible to non-linear models [6], [22]. They clarified condi- assumptions. tionstoidentifyuniquecausalordersinbivariatenon- linearandpostnon-linear(PNL)models,andapplied In the next section, we briefly review some related these conditions to derive candidate causal orders in work to indicate important technical issues. In the their multivariate models and estimate models based third section, we introduce a novel binary exclusive- • T. Inazumi, T. Washio, S. Shimizu and Y. Kawahara are with the or (EXOR) skew acyclic model, termed “BExSAM,” Institute of Scientific and Industrial Research, Osaka University, 8-1 to representan objective system, and characterize the Mihogaoka, Ibaraki,Osaka,Japan. model with respect to causal order identification and E-mail:[email protected] • J.SuzukiiswithGraduateSchoolofScience,OsakaUniversity,andA. causal model estimation. In the fourth section, we YamamotoiswithGraduateSchoolofInformatics,KyotoUniversity. proposenovelcriteriaandalgorithmsforcausalorder • Apreliminaryresulthasbeenpresentedinaninternationalconference: identification and causal model estimation based on Discoveringcausalstructuresinbinaryexclusive-orskewacyclicmod- els,Proc.the27thConferenceonUncertaintyinArtificialIntelligence the model characterization. In the fifth section, we (UAI2011),pp.373-382, 2011. present an experimental evaluation of this approach using both artificial and real world data sets. 2 2 RELATED WORK of the regression problem. Many studies on causal inference in statistics and Incontrasttothesestudiesforcontinuousvariables, learning of Bayesiannetwork structureshave concen- onlyafewstudieshaveaddressedtheissueofdiscov- tratedondevelopingprinciplesforefficientlyfocusing ering causal structure for discrete variable sets based on candidate causal structures of a given data set onparticularcharacteristicsoftheirdatadistributions. within a feasible search space by using information A study on this topic was recently reported in [10]. It from second-order statistics of the data set [9], [19]. assessed the identifiability of a unique causal order This has arisen because exhaustive searches are in- and an algorithm to find a model entailed by the tractable as the number of possible directed acyclic order for integer variables in a finite range and/or a graphs (DAGs) grows exponentially with the num- cyclic range having a modulus. However, the focusis ber of variables. To address this issue, constraint- on bivariate models and their bivariate identifiability based approaches such as the PC and the CPC algo- conditions only. Another study [20] proposed a prin- rithms [19], [12] and score-based approaches such as ciple for finding a causal order of binary variables to theGESalgorithm[3]havebeenstudiedforbothcon- explainagivensampledistributionbymutuallyinde- tinuous and discrete variables. However, these admit pendent conditional probability distributions named multiplesolutionsbecauseoftheMarkovequivalence Markov kernels. However, its applicability is limited and the local optimality of those solutions in many to very simple Boolean relations because of the high cases,andthusoftenfailtogenerateauniquelyidenti- complexity of the kernel functions for the generic fiablecausalstructure.Theyalsoneedtheassumption cases. offaithfulness,implyingthatcorrelationsbetweenthe More recent work showed that the acyclic causal variables are entailed by the graph structure only. structure in a multivariate model is identifiable if A recent technique LiNGAM [15] formulates the the causal relation on every pair of variables con- causal DAG structure search and the structural mod- ditioned by all other variables in the model is bi- eling in the form of an independent component anal- variate identifiable [11]. It also indicated that the ysis (ICA), that ensures the existence of a unique faithfulnessassumptionisnotneededforthebivariate globaloptimumunderassumptionsoflinearrelations identifiability based approach to deriving a unique among the observed variables, non-Gaussianity and identifiable causal order from a given data set. Fur- joint independence of their external noises. Such a thermore, the multivariate identifiability of all afore- condition that enables the identification of a unique mentioned models under their respective bivariate causalorderof observed variablesin amodel, evenif identifiability conditions was shown. A generic algo- some other structural models are Markov equivalent rithm for derivingcandidateidentifiable causalstruc- to the model, is called an “identifiability condition.” tures from given data sets based on these results was However, this may often provide a locally optimal also demonstrated. However, the algorithm and the solution through the nature of its greedy search. In independence measure used in it are not adapted to contrast, the more recent DirectLiNGAM [16], [17] themodelconsistingofmutivariatediscretevariables, efficiently derives a uniquely identifiable solution and the applicability of the algorithm was confirmed under the same identifiability condition through its only for linear and non-linear additive noise models iterative search for exogenous variables, i.e., causally containing up to four continuous variables. top upstream variables, by applying simple bivari- ate linear regressions and independence measures. On the other hand, many real world applications The other notable advantage of these LiNGAM ap- need to discover a feasible causal order and a causal proaches is that they do not need the faithfulness model entailed by the order from a given binary assumption.Twostudies[6],[22]proposedextensions data set. Because binary variables do not constitute of the principles of LiNGAM to a non-linear addi- acontinuous algebra,weneedtodevelopastructural tive noise model and a post-nonlinear (PNL) model model of acyclic relations among the binary variables respectively. However, these two studies presented for other algebraic systems such as a Boolean alge- the identifiability conditions for two variable cases bra. In addition, we need to apply a binary data only, i.e., “bivariate identifiability conditions,” and did distributionsuchasaBernoullidistributioninsteadof not provide “multivariate identifiability conditions” or Gaussian or non-Gaussian distributions in modeling the algorithms to identify unique identifiable causal stochastic characteristics of the variables, and to use orders in multivariate models as mentioned in the adapted measures to evaluate the independence of former section. Another study [8] proposed a novel the binary distribution. Moreover, we have to design regression to allow causal inference in a non-linear novel algorithms for both causal order identification additive noise model containing multiple variables and model estimation under the order based upon by introducing HSIC (Hilbert-Schmidt independence characteristics of the structural model and the data criterion). However, the identifiability of a unique distribution. Inthefollowing sections, wepresentour solution is not ensured because of the non-convexity ideas concerning these issues. 3 TABLE1 3 PROPOSED MODEL Atruthtableofa⊕b,abanda+b. 3.1 BExSAM a b a⊕b ab a+b We first introduce a novel structural model repre- 0 0 0 0 0 senting generic acyclic causal relations among binary 0 1 1 0 1 1 0 1 0 1 variables. 1 1 0 1 1 Definition 1: Given d ≥ 1, let e ∈ {0,1} for all k k =1,...,d be jointly independent random variables, f :{0,1}k−1 →{0,1}deterministicBooleanfunctions e k 1 and e x =f (x ,...,x )⊕e , x k k 1 k−1 k 2 1 where f is constant, and ⊕ denotes the EXOR oper- x 1 ation defined by Table 1. e 2 Everyexternalnoise e affectsits corresponding vari- 4 k ablexk viaanEXORoperation.Eachfk expressesany x x e deterministicbinaryrelationwithoutlossofgenerality 4 3 3 because such a relation is always represented by Boolean algebraic formulae [5]. We further assume Fig.1. ADAGstructureoftheBExSAMinExample1. a skew Bernoulli distribution of every noise e as k follows. Assumption 1: We assume that the probability p of 3.2 Characterization k ek =1 satisfies 0<pk <0.5 for all k=1,...,d. In this subsection, characteristics of BExSAM asso- This assumption coversthe case 0.5<pk <1 without ciated with sink endogenous variables concerning loss of generality, becasue xk = fk(x1,...,xk−1)⊕ek the identification of a unique causal order and the is equivalent to xk = f¯k(x1,...,xk−1)⊕e¯k where the estimation of a structural model are analyzed. First, probabilityp¯k ofe¯k =1is1−pk satisfies0<p¯k <0.51. we define a notion of “selection” which specifies the pk 6= 0,0.5 is an essential assumption for causal values of some variables in X. identification in our setting as will be shown later, as Definition 3: Fork =1,...,d,wedenotebyX =V k k an analogue to the aforementioned non-Gaussianity an assignment x = v ,...,x = v ,x = 1 1 k−1 k−1 k+1 in the case of LiNGAM. The model provided by v ,...,x = v for X := X \ {x } and V := k+1 d d k k k Definition 1 and Assumption 1 is called a binary (v ,...,v ,v ,...,v )T ∈ {0,1}d−1. This assign- 1 k−1 k+1 d EXOR skew acyclic model, or “BExSAM” for short. ment is called a “selection” of X at V . k k If fk in Definition 1 depends on xh (h<k), we say The following theorem is important for causal or- thatxh isa“parent”ofxk andxk isa“child”ofxh.As dering of the variables in X by using the selection. is widely noted in causal inference studies [9], [19], we divide X = {xk|x = 1,...,d} into two classes: Theorem 1: Thefollowingconditionsareequivalent. x having no parents (“exogenous variables”) and x k k 1) x ∈X is a sink endogenous variable. k having some parents (“endogenous variables”). In this 2) There is a common constant q such that p(x = k k study, we further introduce the following definition 1|X =V )=q or 1−q (and therefore, p(x = k k k k k of a particular endogenous variable. 0|X = V ) = 1−q or q equivalently) for all k k k k Definition 2: Endogenous variables having no chil- selections X =V ∈{0,1}d−1. k k dren are called “sinks.” Proof. See Appendix A. ✷ Asshowninourlaterdiscussion,findingsinkendoge- For example, if we are given the two selections nous variables plays a key role with regard to prin- X = {x ,x ,x } at V = (0,0,0) and V′ = (0,0,1) ciples and algorithms for identification of a unique 4 1 2 3 4 4 in Example 1, we have the following conditional causal order and estimation of a BExSAM . probabilities for the sink endogenous variable. Example1 Thefollowing isanexampleof a BExSAM consisting of four binary variables. p(x =1|X =V ) = p , 4 4 4 4 x1 = e1, p(x4 =1|X4 =V4′) = 1−p4, x2 = x1⊕e2, Actually, p(x4 = 1|X4 = V4) = p4(= q4) or 1−p4(= x = x x ⊕e , 1− q ) holds for any V in this case. In contrast, if 3 1 2 3 4 4 we are provided with selections X = {x ,x ,x } at x = (x +x )⊕e , 3 1 2 4 4 1 3 4 V =(0,0,0) and V′ =(0,0,1), then 3 3 where the values of x x and x + x are given in 1 2 1 3 p p 3 4 Table 1. As shown in Fig. 1, x1 and x4 are exogenous p(x3 =1|X3 =V3) = p p +(1−p )(1−p ), and sink endogenous variables, respectively. 3 4 3 4 p (1−p ) p(x =1|X =V′) = 3 4 1.f¯i ande¯i arefi⊕1andei⊕1,respectively. 3 3 3 p3(1−p4)+(1−p3)p4 4 hold. Because p ,p 6= 0.5 by Assumption 1, these input: a binary data set D and its variable list X. 3 4 1. compute a frequency table FT of D. probabilities are not equal, and also their sum is not unity. Accordingly, no constant q or 1 − q can be 2. for k :=d to 1 do assignedtobothp(x =1|X =V )3andp(x =3 1|X = 3. i(k):=find sink(FT,X). V′) in this case. The3se resu3lts refl3ect Theor3em 1, t3hat 4. TTi(k) :=find truth table(FT,X,i(k)). 3 we can find a sink endogenous variable in X by 5. remove xi(k) from X and marginalize FT over x . checkingtheconditionalprobabilityofeveryvariable. i(k) Next, we present an important proposition for esti- 6. end mating a structural model of X. output: a list [{xi(k),TTi(k)}|k =1,...,d]. Proposition 1: Let x ∈ X be a sink endogenous k Fig.2. Mainalgorithm. variable. 1) f (x ,...,x )=1 under X =V k 1 k−1 k k ⇔ p(x =1|X =V )>p(x =0|X =V ). k k k k k k where f is a constant in {0,1} and f : 2) f (x ,...,x )=0 under X =V i(1) i(k) k 1 k−1 k k {0,1}k−1 → {0,1} (k ≥ 2) is deterministic Boolean ⇔ p(x =1|X =V )<p(x =0|X =V ). k k k k k k function. Our problem is to identify the permutation Proof. See Appendix B. ✷ i:{1,...,d}→{1,...,d},thatisthecausalorder,and The function f under a selectionX =V is constant k k k to estimatethe functions f (k =1,...,d) only from i(k) since all of its arguments are constant. Accordingly, D. the probability distribution of x under the selection k is determined by the constant f and the fact that k 4.2 OutlineofProposedAlgorithm 0 < p < 0.5 in Assumption 1. In Example 1, k under a selection X = {x ,x ,x } = V = {1,0,0}, We propose an approach to solving our problem 4 1 2 3 4 f = x + x = 1 and thus x = 1 ⊕ e = e¯ . based on Theorem 1 and Proposition 12. Figure 2 4 1 3 4 4 4 This implies that p(x = 1|X = V ) > p(x = shows the outline of our proposed algorithm. Since 4 4 4 4 0|X = V ) since 0 < p < 0.5. On the other hand, we only need the values of FT rather than D to 4 4 4 p(x4 = 1|X4 = V4) > p(x4 = 0|X4 = V4) implies that identifytheorderx1,···,xd,wecomputethevaluesof 0 < p(x = 0|X = V ) < 0.5. This further implies FT inthefirststageoftheprocedure.Intheloopfrom 4 4 4 that f = 1 under X = V since x = f ⊕e and the next step, the algorithm seeks a sink endogenous 4 4 4 4 4 4 0<p4 <0.5.Proposition1indicatesawaytoidentify variable xi(k) using the function “find sink” at step thepartofasinkendogenousvariablexk andXk =Vk 3 and a Boolean function fi(k) in the form of a truth in the truth table of fk. tableTTi(k) viathefunction“find truth table”atstep 4. These functions perform the identification of a 4 PROPOSED ALGORITHMS unique causal order and the estimation of a BExSAM 4.1 ProblemSetting entailed by the order. Step 5 reducesthe searchspace in the next loop by removing the estimated sink First, we define our problem of causal order identifi- endogenous variable x from X and marginalizing cationandstructuralmodelestimationforaBExSAM. i(k) x in FT. The entire list of x and TT in the In our setting, the causal order x ,···,x is un- i(k) i(k) i(k) 1 d output represents the causal order of the variables in known inadvance,but we have adata setD contain- the causal DAG structure and the BExSAM reflecting ingafinitenumberofinstancesV =(v ,...,v )∈ i(1) i(d) thestructure.Thisiterativereductionfromthebottom {0,1}d of variables X = {x ,···,x } where i(k) i(1) i(d) in the causalorderis similar to the causalorderingof labels a variable x while k is unknown. i(k) is a k [8].However,theircausalstructureestimationneedsa permutationi:{1,...,d}→{1,...,d}tobeidentified second sweep from the top to the bottom. We should from D in determining the causal order. In addition, note here that our approach consisting of the main Boolean functions f for all k = 1,...,d in the i(k) algorithm, find sink and find truth table does not BExSAM need to be estimated. Note that the values require any parameters to be tuned as shown in the of {e ,···,e } can be estimated only from D. D i(1) i(d) next subsection. is generated through a process well modeled by a BExSAM where the distributions of {e ,···,e } i(1) i(d) 4.3 FindingaUniqueCausalOrderandFunctions are skew and jointly independent. Accordingly, if the sample size n = |D| is sufficiently larger than 2d, Ouralgorithmforfindingasinkendogenousvariable thenDcontainsvarietiesofinstancesV whichenables is summarized in Fig. 3. In the loop starting from estimation of the conditional probabilities under var- step 1, mutual information adapted to our problem: ious selections similar to the other constraint based MI (x ,X ) is computed for each x from FT as s i i i approaches [19]. explained below. This represents the dgree to which In summary, we assume that a given data set D = x fits condition 2 in Theorem 1. Finally, x with the i i {V(h)|h=1,...,n} is generated in a BExSAM: 2.Code is available from http://www.ar.sanken.osaka-u.ac.jp/˜ xi(k) =fi(k)(xi(1),...,xi(k−1))⊕ei(k) (k =1,...,d) inazumi/bexsam.html. 5 input: a frequency table FT and its variable list X. input: a frequency table FT, its variable list X 1. for i:=1 to |X| do and an index of a sink endogenous variable i. 2. X :=X\{x }. 1. X :=X \{x } and TT =φ. i i i i i 3. compute ps(xi =vi,Xi =Vi), ps(xi =vi), 2. for all Vi ∈{0,1}|X|−1 do p(Xi =Vi) for all vi ∈{0,1} and 3. compute p(xi =vi|Xi =Vi) for all vi ∈{0,1} V ∈{0,1}|X|−1 from FT. from FT. i 4. compute independence measure MI (x ,X ). 4. If p(x =1|X =V )>p(x =0|X =V ), f =1, s i i i i i i i i i 5. end otherwise f =0. i 6. select i having the minimum MI (x ,X ) in X. 5. TT =TT +{f }. s i i i i i output: i. 6. end output: TT . i Fig.3. Algorithmforfind sink. Fig.5. Algorithmforfind truth table. X =V i i X =V TT of f (0,0,…,0) (0,0,…,1) . . . . (1,1,…,0) (1,1,…,1) i i i i x =v 0 1-qi qi . . . . qi 1-qi (0,0,…,0) 0 i i 1 qi 1-qi . . . . 1-qi qi (0,0,…,1) 1 p(x =v |X =V) i i i i - - - - - - (1,1,…,0) 1 X =V (1,1,…,1) 0 i i (0,0,…,0) (0,0,…,1) . . . . (1,1,…,0) (1,1,…,1) Fig.6. Atruthtableoff . x =v qi qi . . . . qi qi i i i 1-q 1-q . . . . 1-q 1-q i i i i p(x =v |X =V) sorted on x by q and (1-q) and the summation is taken over the available N i i i i i i i i selections. Because MI (x ,X ) is rewritten as s i i Fig.4. Sortonx inaconditionalprobabilitytable. i 2d−1 MI (x ,X ) = p (x =v |X =V ) s i i s i i i i N minimum value of MIs(xi,Xi), that is the highest i vXi,Vi(cid:26) p (x =v |X =V ) possibility of being a sink endogenous variable, is s i i i i ×p(X =V )ln , i i selected. ps(xi =vi) (cid:27) Asnoted inTheorem1,p(xi =1|Xi =Vi)takesone it is zero when ps(xi = vi|Xi = Vi) = ps(xi = vi) as of the values qi or 1−qi depending on the selections in the bottom table in Fig. 4. A smaller MIs(xi,Xi) as depicted in the upper table in Fig. 4, if and only if represents a higher possibility of x being a sink i xi is a sink endogenous variable. We then obtain the endogenous variable. bottomtablewhichrepresentstheindependenceofxi Figure5outlinesouralgorithmforestimatingevery from Xi after sorting p(xi =vi|Xi =Vi) in ascending function fi. In the loop beginning from step 2, the order in every column according to qi < 1 − qi. conditionalprobabilityofxi foreachselectionXi =Vi This implies that the independence of xi from Xi in iscomputedatstep3,andthevalueoffi isestimated the table sorted on xi is equivalent to the fact that by following Proposition 1 at step 4. This is further xi is a sink endogenous variable. Practically, if the listed in a predefined order on Vi in a truth table TTi frequencies of some Xi = Vi are zero in FT because at step 5 as depicted in Fig. 6. Similarly to the former of theincomplete coverofthe selections ofXi =Vi in algorithmoffind sink,weassign‘void’tofiwhenthe thegivendatasetD,theprobabilitiesonsuchXi =Vi frequencyof Xi =Vi is zeroby the incompleteness of are not computable. Thus, we obtain the probabilities FT.Thefinaloutputholdstheentiretruthtableoff . i on N selections of X = V less than 2d−1. Based i i i on these considerations, we use the following mutual 4.4 ComputationalComplexity information MI (x ,X ) ≥ 0 between the sorted x s i i i The largest table used in the above algorithms is and X to evaluate the dgree to which x is a sink i i the frequency table FT which has size 2d. Thus, endogenous variable. the memory complexity of our algorithms is O(2d). 2d−1 According to the requirement of data size, n≥2d, as MI (x ,X ) = p (x =v ,X =V ) s i i s i i i i Ni vXi,Vi(cid:26) noTtehdeilnoospecitniovnolv4.e1d, tihnisthies“afilsnodwsriinttke”nfuasncOti(onn).com- p (x =v ,X =V ) ×ln s i i i i , putes the probabilities at most 2d−1 times, and com- p (x =v )p(X =V ) s i i i i (cid:27) putes the independence measure MIs(xi,Xi) by ag- where p represents a probability for the sorted x gregating these 2d−1 probabilities. Thus, this process s i 6 is O(2d) ≃ O(n). Since the loop repeats d times at the variables in the generated BExSAM and its esti- most, the time complexity of “find sink” is O(d2d)≃ mated BExSAM respectively, we compute their preci- O(dn). The loop involved in “find truth table” func- sion and recall as follows. tion computes the conditional probabilities and esti- P(A)=|A ∧A |/|A | and R(A)=|A ∧A |/|A |, matesanelementinthetruthtableatmost2d−1times. t e e t e t Therefore, the time complexity of “find truth table” where ∧ is an element-wise AND operation and |·| is O(2d) ≃ O(n). Step 1 of the main algorithm needs is the number of non-zero elements in a matrix. We n≥2dcountstocomputeFT,soisO(2d)≃O(n).The then obtain their resultant F-measure as the first functions “find sink” and “find truth table” which performance index. are O(d2d) ≃ O(dn) and O(2d) ≃ O(n) are repeated P(A)·R(A) d times in the main algorithm. Accordingly, the to- F(A)=2 . P(A)+R(A) tal time complexity of the proposed algorithms is O(d22d)≃O(d2n). This represents the performance of the causal or- This computational complexity is tractable when dering. Similarly, we compute an F-measure F(TT) the number of observed variables is moderate as between the true truth table TT and the estimated t shown in the numerical experiments later. This com- truth table TT as the second performance index, e plexityisalsofavorablecomparedwithpastwork.For where f having the ‘void’ values in both TT and i e example, DirectLiNGAM which also has an iterative its corresponding f in TT were skipped in the i t algorithmstructureandisconsideredtobeone ofthe element-wise AND operation of ∧. This indicates the most efficient algorithms has O(d3n) complexity. performanceof themodelestimation. Thethirdindex is simply the total computational time CT (msec) of 5 EXPERIMENTAL EVALUATION our algorithm explained in section 4. These indices are averaged over the 1000 trials. 5.1 BasicPerformanceforArtificialData In the first experiment, every combination of d = Forournumericalexperiments,wegeneratedartificial 2,4,6,8,10,12,14,16,18,20 and n = 100,500,1000, data sets using BExSAMs produced by the following 5000,10000 was evaluated with p = 0.5 and p a i(k) procedure. For every fk (k > 1) in Definition 1, we defined uniformly at random over (0,0.5) for k = randomly choose each xi from the set of potential 1,...,d. These choices of pi(k) ensure the skew and ancestors X1:k−1 = {xi|i = 1,...,k−1} as a parent non-deterministic distribution of ei(k) as required by of xk with probability pa. Given a set of parents Assumption 1. pa reflects the density of the variable XkPa ⊆X1:k−1 chosen in this way, we set fk to 0 or 1 couplings in the generated BExSAM. Table 2 summa- uniformly at random for all selections XkPa = VkPa ∈ rizestheperformanceofourapproach.ValuesofF(A) {0,1}|XkPa|.Wedonotcareabouttheothernon-parent andF(TT)greaterthan0.9aretypedinboldface,and variables in X1:k−1 when defining fk, and obtain a values of CT less than 1000 msec are also written in truth table TTk = {fk|X1:k−1 = V1:k−1 ∈ {0,1}d−1} boldface. We observe that F(A) can be less than 0.9 based on fk for all VkPa. For f1, we simply form even with n > 2d for a small d. This is because the the truth table TT1 = {f1} where f1 is a constant statistical accuracy of MIs(xi,Xi) is not very high chosenfrom{0,1}uniformlyatrandom.Thisrandom according to the summation over the small number proceduregeneratesagenericBExSAMintheformof N (≃2d) of V . The accuracy of the causal ordering is i i a truth table TT ={TTk|k =1,...,d}. affected by erroneous values for MIs(xi,Xi). On the We obtained our artificial data set D = {V(h)|h = other hand, F(A) is greater than 0.9 under n > 2d 1,...,n}fromthegeneratedBExSAMinthefollowing when d is large because of the higher accuracy of way. We randomly generate e(h) (k = 1,...,d and MI (x ,X ). We further observe that F(TT) is more k s i i h = 1,...,n) under respective p ∈ (0,0.5) which sensitive to shortages in the data than F(A). This is k are common over all h by Assumption 1. For each because the estimation accuracy of p(x =v |X =V ) i i i i h, we successively derive the value of f from k = 1 required to derive TT is directly affected by the k to d by applying the values of X1:k−1 to TTk, and frequency of the individual Xi = Vi in the data. The computethevalueofx byf ⊕e .Oncethistentative estimationaccuracyisstronglyreducedbythesmaller k k k datasetisobtained,werandomlypermutetheindices frequency for smaller n. These results indicate that k = 1,...,d of the causal ordering to define new our causal ordering approach and model estimation variable indices i(k) (k = 1,...,d), and obtain the approach work properly for up to 12 variables and final data set D. The series of model generation, up to 8 or 10 variables, respectively, with a dataset of data generation and application of our approach was severalthousands samples. We note thatin particular repeated 1000 times for various combinations of the CT increaseswith dbut not with n. This isconsistent parameters d, n, p and p (k =1,...,d). with the aforementioned complexity analysis and the a i(k) Threeperformanceindiceswereusedfortheevalu- fact that n affects only the computation of FT in the ation.GiventwobinaryadjacencymatricesA andA initial stage of our algorithm. The results show that t e representing the parent-child relationships between our algorithm completes the causal ordering and the 7 TABLE2 Performanceundervariousnandd(top:F(A),middle:F(TT)andbottom:CT (msec)ineachcell). n\d 2 4 6 8 10 12 14 16 18 20 1.000 0.737 0.658 0.550 0.444 0.386 0.368 0.371 0.395 0.417 100 0.932 0.802 0.686 0.563 0.465 0.404 0.359 0.323 0.299 0.274 0.521 1.14 2.20 4.61 13.7 57.4 233 1150 6158 32197 1.000 0.893 0.912 0.867 0.734 0.571 0.458 0.386 0.351 0.339 500 0.971 0.909 0.860 0.771 0.619 0.498 0.427 0.386 0.353 0.328 0.462 1.13 2.19 4.59 13.6 57.6 234 1155 6192 32321 1.000 0.917 0.941 0.933 0.851 0.704 0.559 0.454 0.382 0.346 1000 0.980 0.934 0.890 0.829 0.715 0.570 0.472 0.411 0.374 0.348 0.446 1.14 2.18 4.59 13.6 57.6 235 1156 6206 32338 1.000 0.962 0.978 0.984 0.970 0.905 0.783 0.661 0.553 0.464 5000 0.986 0.966 0.946 0.910 0.856 0.749 0.600 0.495 0.432 0.396 0.460 1.17 2.25 4.71 14.0 54.2 242 1174 6195 32280 1.000 0.971 0.987 0.990 0.984 0.956 0.860 0.738 0.632 0.534 10000 0.994 0.976 0.959 0.933 0.888 0.815 0.674 0.544 0.464 0.417 0.474 1.17 2.25 4.70 14.0 54.0 242 1166 6202 32234 TABLE3 1 Comparisonwithotheralgorithms. 0.8 OurAlgorithm true\est. directed noedge undirected 0.6 directed 55 5 0 noedge 8 172 0 PCAlgorithm 0.4 true\est. directed noedge undirected directed 28 18 14 0.2 noedge 6 158 16 F(A) F(TT) CPCAlgorithm 0 true\est. directed noedge undirected directed 27 12 21 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 noedge 0 157 23 p e GESAlgorithm true\est. directed noedge undirected Fig.7. DependencyofF(A)andF(TT)onpe. directed 24 17 19 noedge 0 161 19 model estimation within a second up to d = 14 even x = e for large amounts of data. 2 2 In the second experiment, we gave an identical x3 = x1x2⊕e3 value pe ∈ (0,0.5) to pi(k) for all k = 1,...,d, and x4 = x3⊕e4 evaluated F(A) and F(TT) for d = 4, n = 1000 and Each p was given similarly to the first experiment p =0.5.Figure7depictstheresultantdependencyof i(k) a in the previous subsection. The significance levels α F(A) and F(TT)on variousp . If p is close to 0,our e e in both PC and CPC were set at 0.05. Table 3 shows approach fails to accurately estimate the conditional the frequencies of estimated relationships between probabilities and thus its accuracy is degraded. If p e variablesovertheirtruerelationshipsfor20trials.The is closed to 0.5, again the accuracy of our approach columns and rows represent estimated relationships is lost, because it relies heavily on Theorem 1 and and true relationships, respectively. Because of the Y- Proposition 1 which require p 6= 0.5. Through i(k) structureamongthefourvariables,thenumberoftrue some extra experiments, we confirmed that F(A) and directededgesis3×20=60intotalwhilethenumber F(TT) do not show strong dependency on p , the a of true non-edges is ( 4 −3)×20 = 180 by double causal density of the BExSAM. 2 countingthetwomissingdirectededgesbetweentwo (cid:10) (cid:11) variablesfor a non-edge. This counting method gives 5.2 ComparisonwithOtherAlgorithms double penaltiestoanincorrectestimationof anedge direction which often comes from causal ordering Our algorithm, the PC algorithm [19], the CPC algo- failures affecting the global structure estimation. The rithm [12] and the GES algorithm [3] were compared italics show the numbers of correct estimations. Note by applying them to data generated by the following that this Y structure is a typical example which en- artificial BExSAM forming a Y-structure [7]. ablesavalidestimationofthePCalgorithm.However, x = e our approach based on the skewness of the binary 1 1 8 IQ ❍ follows a BExSAM. A crucial property required in a ❍ ❍ BExSAM is the skew distribution of every external ❍ ❄ ❍❍❥ noise. Because the noise is not directly observable, a ✲ PE CP measure to check this property using the given data set is desirable. The following lemma can be used for Fig. 8. Discovered causal structure among CP, PE this purpose. andIQ. Lemma 1: Assuming that a data set D is generated by a BExSAM, if a variable x ∈X has a distribution k p(x =1)6=0.5, then the following condition hold. k data distribution provides better accuracy. Similar p <0.5 and p(f =1)6=0.5. k k advantageous results of our approach was obtained for the case where x = (x +x )⊕e . The results Proof. See Appendix C. ✷ 3 1 2 3 of CPC and GES are similar to PC, since CPC and We simply check whether the frequency of xk = 1 is GES do not have any significant advantages over PC different from 0.5 for every observed variable in X. at identifying Y-structures. If it is, then the skewness of their noise distribution is ensured. Another strong assumption of a BExSAM is the interventions of external binary noises via 5.3 ExampleApplicationstoReal-WorldData EXOR operations. However, Lemma 1 also suggests Ourapproachhasbeenappliedtotworeal-worlddata the applicability of the BExSAM to generic Boolean sets. One is on leukemia deaths and survivals (LE = interventions of the noises. For example, if a given 1/0)inchildreninsouthernUtahwhohavehigh/low data is generated by x = f + e , then we have k k k exposure to radiation (EX =1/0) from the fallout of p(x = 1) = p + p(f = 1) − p p(f = 1). On k k k k k nuclear tests in Nevada [4], [9]. As this contains only the other hand, a BExSAM x = f ⊕ e provides k k k two binary variables, conventional constraints/score- p(x =1)=p +p(f =1)−2p p(f =1).Accordingly, k k k k k based approaches cannot estimate any unique causal these two models show very similar distributions of structure. In contrast, our approach found a causal the observed variables, if p ,p(f = 1) ≪ 0.5. These k k order EX →LE consistent with our intuition. conditionscanbecheckedbytheinsightsofLemma1. Another data set is for college plans of 10318 As mentioned in subsection 4.1, our algorithm re- Wisconsin high school seniors [13], [19]. While the quires a complete data set D in principle, whixh originalstudyaimedtofindafeasiblecausalstructure is similar to other constraint based approaches [19]. among five variables constituting a Y structure, we Therefore, to analyze a data set D in which a large focusonthecausalitybetweenthreevariables:yes/no portion is incomplete, we need to estimate the miss- college plans (CP = 0/1), low/high parental encour- ing data in D by introducing some data completion agement (PE =0/1) and least to highest intelligence techniquessuchas[2].Anotherassociatedissueisthe quotient (IQ = 0,...,3). Conventional constraints- applicability to many variables, since the low error based approaches are known to give multiple can- rates are ensured only for up to 12 variables with didate causal structures in an equivalence class for several thousands of samples. A promising way to the three variables. We selected 2543 male seniors overcome this issue may be to combine our approach (SEX = male) having a higher socioeconomic status with other constraint-based approaches such as the (SES ≥2whereSES =0,...,3)toretainindividuals PC algorithm as discussed in [22]. The extensions of having similar social background while maintaining ourapproachtowardtheseissuesaretopicsforfuture an appropriate sample size. We further transformed studies. IQ to a binary variable using a threshold value be- tween 1 and 2, which gives a data set containing 7 CONCLUSION 1035 and 1508 seniors having IQ =0 (lower IQ) and In this paper, we presented a novel binary structural IQ = 1 (higher IQ), respectively. The application of model involving exclusive-or noise and proposed an our approach to this data set produced the unique efficient new approach to deriving an identifiable causal structure depicted in Fig. 8. This states that causal structure governing a given binary data set the intelligence quotient of a senior affects both his basedonthe skewnessofthe distributionsofexternal parental encouragement and his college plan, and noises. The approach has low computational com- that the parental encouragement further influences plexity and does not require any tunable parameters. the college plan. This is consistent with our intuition Theexperimentalevaluationshowspromisingperfor- andthestructureestimatedbythePCalgorithmfrom mance for both artificial and real world data sets. the original five variables constituting a Y structure. This study provides an extension of the non- Gaussianity-based causal inference for continuous 6 DISCUSSION variablestocausalinferencefordiscretevariables,and When we apply our approach to a data set, we as- suggests a new perspective on more generic causal sume that the data generation process approximately inference. 9 APPENDIX A By substituting Eqs.(t2) and (t3) into Eq.(t1), we PROOF OF THEOREM 1 obtain the following four cases. p(x =v |X =V )= (1 ⇒ 2) k k k k Let the value of xk be vk ∈ {0,1}. Under a selection αpkph (t4), αpk(1−ph) (t5), X = V , f is a constant in {0,1}. Accordingly, the αpkph+β(1−pk)(1−ph) αpk(1−ph)+β(1−pk)ph k k k following relation holds by Definition 1. α(1−pk)ph (t6), α(1−pk)(1−ph) (t7), α(1−pk)ph+βpk(1−ph) α(1−pk)(1−ph)+βpkph where α = p(e = l (v )) and β = xk =vk ⇔vk =fk⊕ek ⇔ek =vk⊕fk. xj∈Xkl,xj6=xh j j k p(e =l (v¯ )).αandβ arenonzerofrom xj∈Xkl,xj6=xhQ j j k Becausex is a sink endogenousvariable, e and X are Assumption 1, and each of (t4) = (t5), (t4) = (t6), k k k Q mutuallyindependentbyDefinition1.Underthisfact (t5) = (t7) and (t6) = (t7) for any α and β, that is and Assumption 1, any Xl = Vl excluding x , implies that p = 1/2 k k h k or p = 1/2 respectively. Because neither p 6= 1/2 h k p(xk =vk|Xk =Vk)= nor ph 6= 1/2 is allowed by Assumption 1, this p , for v ⊕f =1 implies that all conditions (t4) 6= (t5), (t4) 6= (t6), p(e =v ⊕f )= k k k k k k 1−p , for v ⊕f =0 (t5) 6= (t7) and (t6) 6= (t7) hold simultaneously k k k (cid:26) for some Xl = Vl excluding x . If we assume that k k h By letting qk =pk or qk =1−pk, 1 ⇒ 2 holds. (t4) = (t7) and (t5) = (t6) simultaneously for such (1 ⇐ 2) Xl =Vl excluding x , then p2p2 =(1−p )2(1−p )2 k k h k h k h Assume that xk is not a sink endogenous variable. Let and p2k(1 − ph)2 = (1 − pk)2p2h, and so pk = 1/2 Xk be partitioned into Xkl and Xku where Xkl is a set and ph = 1/2. Accordingly, pk 6= 1/2 and ph 6= 1/2 of all descendants of xk in a BExSAM, and Xku is the from Assumption 1 imply that one of (t4)=(t7) and complement of Xkl in Xk. Then, the following holds. (t5) = (t6) do not hold for Xkl = Vkl excluding xh. This resultshows thatp(x =v |X =V )takesmore k k k k p(xk =vk|Xk =Vk)=p(xk =vk|Xku =Vku,Xkl =Vkl) than two values for some given selection Xk = Vk = p(Xkl=Vkl|xk=vk,Xku=Vku)p(xk=vk|Xku=Vku) , (t1) if xk is not a sink endogenous variable. By taking the v′∈{vk,v¯k}p(Xkl=Vkl|xk=v′,Xku=Vku)p(xk=v′|Xku=Vku) contrapositive, we obtain 1 ⇐ 2. ✷ P where v¯ =v ⊕1. Furthermore, let X =X\{x } for k k j j x ∈ Xl be partitioned into Xl and Xu similarly to APPENDIX B j k j j Xl and Xu for x . By Definition 1, each f for x ∈ PROOF OF PROPOSITION 1 k k k j j Xkl is given by Xju = Vju, and thus xj = fj(Xju = (i) Since fk is binary, fk only takes the values 1 or Vju) ⊕ ej for all xj ∈ Xkl. Accordingly, Xkl = Vkl is 0 under any selection Xk = Vk. This and xk = equivalent to ej = vj ⊕fj(Xju = Vju) for all xj ∈ Xkl. fk ⊕ ek deduce the relations xk = e¯k or xk = We rewritethe r.h.s.:vj⊕fj(Xju =Vju)aslj(vk), since ek, respectively. Accordingly, one of 0 < p(xk = xk is an ancestor of xj (xk ∈ Xju) and the values of 0|Xk = Vk) < 0.5 and 0 < p(xk = 1|Xk = Vk) < all variables except xk = vk are constant under the 0.5 holds because 0<pk <0.5. This implies that selectionXk =Vk.Becauseeveryej isindependentof p(xk =1|Xk =Vk)6=p(xk =0|Xk =Vk). its upper variables, (ii) Iff =1,thenx =e¯ isdeducedfromx =f ⊕ k k k k k e . This implies that 0<p(x =0|X =V )<0.5 k k k k p(Xkl =Vkl|xk =vk,Xku =Vku)= p(ej =lj(vk)). by 0 < pk < 0.5 and thus p(xk = 1|Xk = Vk) > xjY∈Xkl p(xk =0|Xk =Vk). (iii) Iff =0,thenx =e isdeducedfromx =f ⊕ k k k k k xk has at least one child xh ∈Xkl where lh(1)6=lh(0) ek. This implies that 0<p(xk =1|Xk =Vk)<0.5 for some selection Xk =Vk from the assumption that by 0 < pk < 0.5 and thus p(xk = 1|Xk = Vk) < xk is not a sink endogenous variable. Accordingly, p(xk =0|Xk =Vk). From (ii), (iii) and thier contrapositives with (i), the p(Xkl =Vkl|xk =vk,Xku =Vku)= proposition is proved. ✷ ph xj∈Xkl,xj6=xhp(ej =lj(vk)), forlh(vk)=1 .(t2) ( (1−ph)Qxj∈Xkl,xj6=xhp(ej =lj(vk)), forlh(vk)=0 APPENDIX C Q PROOF OF LEMMA 1 Ontheother hand,sincee isindependentofXu and k k xk =vk ⇔ek =vk⊕fk, Without loss of generality, p(ek = 1)(= pk), p(ek = 0)(= 1−p ), p(f = 1), p(f = 0) with the definition k k k p(xk =vk|Xku =Vku)=p(ek =vk⊕fk) p(ek =1)≤p(ek =0) are represented as = pk, for vk⊕fk =1 . (t3) p(e =1)= 1+ǫk, p(e =0)= 1−ǫk (−1≤ǫ ≤0), 1−p , for v ⊕f =0 k k k k k k 2 2 (cid:26) 10 1+ξ 1−ξ k k p(f =1)= , p(f =0)= (−1≤ξ ≤1). [10] J. Peters, D. Janzing and B. Scho¨lkopf, ”Causal inference k k k 2 2 on discrete data using additive noise models,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 61, no. 2, Because e and f are mutually independent, and k k pp.282-293,2011. have the relation xk =fk⊕ek, [11] J.Peters,J.M.Mooij,D.JanzingandB.Scho¨lkopf,”Identi abilityofcausalgraphsusingfunctionalmodels,”Proc.the27th p(x =1) = p(e =1)p(f =0)+p(e =0)p(f =1) ConferenceonUncertaintyinArtificialIntelligenceCausalityII k k k k k &GraphicalModels(UAI2011),pp.589-598,2011. (1+ǫ )(1−ξ ) (1−ǫ )(1+ξ ) k k k k [12] J. Ramsey, J. Zhang and P. Spirtes, ”Adjacency-Faithfulness = + 4 4 andConservativeCausalInference,”Proc.the22ndConference 1−ǫ ξ ConferenceonUncertaintyinArtificialIntelligence(UAI2006), k k = , pp.401-408,2006. 2 [13] W.SewellandV.Shah,”Socialclass,parentalencouragement, p(xk =0) = p(ek =1)p(fk =1)+p(ek =0)p(fk =0) and educational aspirations,” American Journal of Sociology, vol.73,pp.559-572,1968. (1+ǫ )(1+ξ ) (1−ǫ )(1−ξ ) = k k + k k [14] Y.Y. Shi, G.A. Miller, O. Denisenko, H. Qian and K. Bomsz- 4 4 tyk,”Quantitative modelfor binary measurementsofprotein- 1+ǫ ξ proteininteractions,”JournalofComputationalBiology,vol.14, k k = . no.7,pp.1011-1023,2007. 2 [15] S. Shimizu, P.O. Hoyer, A. Hyva¨rinen and A. Kerminen, ”A linear non-Gaussian acyclic model for causal discovery,” ⇒p(x =0)−p(x =1)=ǫ ξ . k k k k Journal of Machine Learning Research, vol. 7, pp. 2003-2030, 2006. Accordingly, if p(xk = 1) 6= p(xk = 0) then ǫkξk 6= 0. [16] S. Shimizu, A. Hyva¨rinen, Y. Kawahara and T. Washio, ”A This implies that ǫ < 0 and ξ 6= 0. Therefore, if direct method for estimating a causal ordering in a linear k k non-Gaussian acyclic model,” Proc. the 25th Conference on p(x =1)6=0.5 then p(e =1)=p <0.5 and p(f = k k k k Uncertainty in Artificial Intelligence, Causality II & Graphical 1)6=0.5. ✷ Models(UAI2009),pp.506-513,2009. [17] S.Shimizu,T.Inazumi,Y.Sogawa,A.Hyva¨rinen,Y.Kawahara, T.Washio,P.O.HoyerandK.Bollen,”DirectLiNGAM:Adirect ACKNOWLEDGMENTS methodforlearningalinearnon-Gaussianstructural equation model,” Journal of Machine Learning Research, vol. 12, pp. 1225-1248, 2011. This work was partially supported by JST, ER- [18] R.R. Sokal and F.J. Rohlf, Biometry: the Principles and Practice ATO, Minato Discrete Structure Manipulation Sys- of Statistics in Biological Research, third ed. New York, N.Y.: tem Project and JSPS Grant-in-Aid for Scientific Re- Freeman,1994. [19] P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction, search(B)#22300054.The authors would like to thank and Search (Adaptive Computation and Machine Learning), Cam- Dr. Tsuyoshi Ueno, a research fellow of the JST ER- bridge,M.A.:TheMITPress,2000. ATO project, for his valuable technical comments. [20] X. Sun and D. Janzing, ”Exploring the causal order of bi- naryvariablesviaexponentialhierarchiesofMarkovkernels,” Proc. European Symposium on Artificial Neural Networks (ESANN2007),pp.25-27,2007. REFERENCES [21] S.R. VeflingstadaandE.Plahte,”Analysis ofgeneregulatory network models with graded and binary transcriptional re- [1] Y.BenjaminiandY.Hochberg,”Controllingthefalsediscovery sponses,”Biosystems,vol.90,no.2,pp.323-339,2007. rate: a practical and powerful approach to multiple testing,” [22] K. Zhang and A. Hyva¨rinen, ”On the identifiability of the JournaloftheRoyalStatisticalSociety:SeriesB,vol.57,no.1, post-nonlinear causal model,” Proc. the 25th Conference on pp.289-300,1995. Uncertainty in Artificial Intelligence, Causality II & Graphical [2] C.A. Bernaards, T.R. Belin and J.L. Schafer, ”Robustness of a Models(UAI2009),pp.647-655,2009. multivariate normal approximation for imputation of incom- pletebinarydata,”StatisticsinMedicine,vol.26,no.6,pp.1368- 1382,2007. [3] D.M.Chickering,”Optimalstructureidentificationwithgreedy search,”JournalofMachineLearningResearch,vol.3,pp.507- 554,2002. [4] M.O.FinkelsteinandB.Levin,Statisticsforlawyers,NewYork, N.Y.:Springer-Verlag,1990. [5] J.R.Gregg,OnesandZeros:UnderstandingBooleanAlgebra,Digital Circuits,andtheLogicofSets,Chap.5,pp.101-123,Hoboken,N.J.: JohnWiley&Sons,1998. [6] P.O. Hoyer, D. Janzing, J. Mooij, J. Peters and B. Scho¨lkopf, ”Nonlinear causal discovery with additive noise models,” In AdvancesinNeuralInformationProcessingSystems,Proc.the 22nd Annual Conference on Neural Information Processing Systems(NIPS2008), vol.21,pp.689-696,2009. [7] S. Mani, G. Cooper and P. Spirtes, ”A theoretical study of Y structures for causal discovery,” Proc. the 22nd Conference in Uncertainty in Artificial Intelligence (UAI2006), pp. 314-323, 2006. [8] J. Mooij, D. Janzing, J. Peters and B. Scho¨lkopf, ”Regression by dependence minimization and its application to causal inference in additive noise models,” Proc. the 26th Annual InternationalConferenceonMachineLearning(ICML2009),pp. 94-101,2009. [9] J. Pearl, Causality: Models, Reasoning, and Inference,Cambridge, UnitedKingdom:CambridgeUniversityPress,2000.