Optimal statistical decision for Gaussian graphical model selection Valery A. Kalyagin (a), Alexander P. Koldanov (a), Petr A. Koldanov (a), Panos M. Pardalos (a, b) 1 Abstract. Gaussian graphical model is a graphical representation of the dependence structure for a Gaussian random vector. It is recognized as a powerful tool in different 7 applied fields such as bioinformatics, error-control codes, speech language, information 1 0 retrieval and others. Gaussian graphical model selection is a statistical problem to identify 2 the Gaussian graphical model from a sample of a given size. Different approaches for n a Gaussian graphical model selection are suggested in the literature. One of them is based on J 9 considering the family of individual conditional independence tests. The application of this ] approach leads to the construction of a variety of multiple testing statistical procedures for L M Gaussian graphical model selection. An important characteristic of these procedures is its . error ratefor a given sample size. Inexisting literature great attentionis paid to thecontrol t a t of error rates for incorrect edge inclusion (Type I error). However, in graphical model s [ selection it is also important to take into account error rates for incorrect edge exclusion 1 (Type II error). To deal with this issue we consider the graphical model selection problem v 1 in the framework of the multiple decision theory. The quality of statistical procedures is 7 0 measured by a risk function with additive losses. Additive losses allow both types of errors 2 0 to be taken into account. We construct the tests of a Neyman structure for individual . 1 hypotheses and combine them to obtain a multiple decision statistical procedure. We show 0 7 thattheobtained procedureis optimalinthe sense that itminimizes the linear combination 1 : v of expected numbers of Type I and Type II errors in the class of unbiased multiple decision i X procedures. r a 1(a) - Laboratory of Algorithms and Technologies for Network Analysis, National Research Univer- sity Higher School of Economics, Bolshaya Pecherskaya 25/12, Nizhny Novgorod, 603155 Russia, vkalya- [email protected] and (b) - University of Florida, ISE Department, Gainesville, FL 32611,USA. 1 Keywords. Gaussian graphical model; Model selection; Loss function; Risk function; Optimal statistical decision; Unbiased multiple decision statistical procedures; Exponential families; tests of a Neyman structure. 1 Introduction Gaussian graphical model is a graphical representation of the dependence structure for a Gaussian random vector. It is recognized as a powerful tool in different applied fields such as bioinformatics, error-control codes, speech language, information retrieval and others [4]. There are two aspects related to the study of the graphical model: the computational and algorithmic aspect, and the statistical aspect. The computational aspect has become increasingly popular due to the growing interest in large scale networks [5]. A central question in the statistical aspect is how to recover the structure of an undirected Gaussian graph from observations. This problem is called the Gaussian graphical model selection problem(GGMSproblem). Acomprehensive surveyondifferentapproachestothisproblem is given in [3]. Most of the constructed statistical procedures for GGMS are based on asymptotically optimal estimations of correlations [1], [2], [12]. However, as far as we know, there are no results related to the optimality of statistical procedures for GGMS for a fixed sample size. This is the subject of this paper. The quality of statistical procedures for GGMS can be measured by two types of errors: false edge inclusion (Type I error) and false edge exclusion (Type II error). Traditional measures of quality used in GGMS are: FWER (Family Wise Error Rate), FDR (False Discovery Rate), FDP (False Discovery Proportion) and others [3]. Most of them are connected with the Type I error (false edge inclusion). It is clear, that the quality of GGMS procedures has to be related to the difference between two graphs: the true graph and selected graph. Therefore in GGMS it is important to take into account both types of errors (Type I and Type II errors). The quality of a statistical procedure can be measured 2 in this case by the linear combination of expected values of the numbers of Type I and Type II errors. We refer to a selection statistical procedure which minimizes this value for a fixed sample size as optimal. The main goal of this paper is to find an optimal statistical procedure for GGMS. To achieve this goal we consider the graphical model selection problem as part of the multiple decision theory [8]. The quality of statistical procedures is measured by the risk function. To take into account the numbers of Type I and Type II errors we consider the additive loss function. We prove that in this case the risk function is a linear combination of expected values of the numbers of Type I and Type II errors. Subsequently, we construct tests of a Neyman structure for individual hypotheses and combine them to obtain a multiple decision statistical procedure. We show that the obtained procedure minimizes the risk function in the class of unbiased multiple decision procedures. The paper is organized as follows. In Section 2 we give basic definitions and notations. In Section 3 we describe the multiple decision framework for the Gaussian graphical model selection problem and prove the representation of the risk function. In Section 4 we con- struct the tests of a Neyman structure for individual hypotheses. In Section 5 we combine individual tests in the multiple decision procedure and prove its optimality. In Section 6 we discuss the proposed approach for different model selection problems. 2 Problem statement Let X = (X ,X ,...,X ) be a random vector with the multivariate Gaussian distribution 1 2 p from N(µ,Σ), where µ = (µ ,µ ,...,µ ) is the vector of means and Σ = (σ ) is the 1 2 p i,j covariance matrix, σ = cov(X ,X ), i,j = 1,2,...,p. Let x(t), t = 1,2,...,n be a i,j i j sample of size n from the distribution of X. We assume in this paper that n > p, and that the matrix Σ is non degenerate. The case n < p has a practical interest too [10], but it is not considered in this paper. The undirected Gaussian graphical model is an 3 undirected graph with p nodes. The nodes of the graph are associated with the random variables X ,X ,...,X , edge (i,j) is included in the graph if the random variables X ,X 1 2 p i j are conditionally dependent [7], [1]. Gaussian graphical model selection problem consists of the identification of a graphical model from observations. The partial correlation ρ of X , X given X , k N(i,j) = 1,2,...,p i,j i,j•N(i,j) i j k ∈ { }\{ } is defined as the correlation of X , X in the conditional distribution of X , X given i j i j X , k N(i,j). It is known[1] that the conditional distribution of X , X given X , k i j k ∈ k N(i,j) is Gaussian with the correlation ρ . It implies that the conditional i,j•N(i,j) ∈ independence of X , X given X , k N(i,j) = 1,2,...,p i,j is equivalent to the i j k ∈ { } \ { } equation ρ = 0. Therefore, the Gaussian graphical model selection is equivalent to i,j•N(i,j) simultaneous inference on hypotheses of pairwise conditional independence ρ = 0, i,j•N(i,j) i = j, i,j = 1,2,...,p. 6 The inverse matrix for Σ, Σ−1 = (σi,j) is known as the concentration or precision matrix for the distribution of X. For simplicity we use the notation ρi,j = ρ . The problem i,j•N(i,j) of pairwise conditional independence testing has the form: h : ρi,j = 0 vs k : ρi,j = 0, i = j,i,j = 1,2,...,p (1) i,j i,j 6 6 According to [7] the partial correlation can be written as σi,j ρi,j = −√σi,iσj,j Note that the problem of pairwise conditional independence testing (1) is equivalent to h : σi,j = 0, vs k : σi,j = 0,i = j,i,j = 1,2,...,p i,j i,j 6 6 The Gaussian graphical model selection problem can be formulated now as multiple testing problem for the set of hypotheses (1). 4 3 Multiple decision approach In this Section we consider the GGMS problem in the framework of decision theory [11]. According to this approach we specify the decision statistical procedures and risk function. Let X = (X ,X ,...,X ) be a random vector with multivariate Gaussian distribution 1 2 p from N(µ,Σ). In the GGMS study observations are modeled as a sequence of random vectors X(t), t = 1,2,...,n where n is the sample size and vectors X(t) are independent and identically distributed as X. Let x = (x (t)) be observations of the random variables i X (t), t = 1,2,...,n, i = 1,2,...,p. Consider the set of all p p symmetric matrices i G × G = (g ) with g 0,1 , i,j = 1,2,...,p, g = 0, i = 1,2,...,p. Matrices G i,j i,j i,i ∈ { } ∈ G represent adjacency matrices of all simple undirected graphs with p vertices. The total number of matrices in is equal to L = 2M with M = p(p 1)/2. The GGMS problem G − can be formulated as a multiple decision problem of the choice between L hypotheses: H : ρi,j = 0,ifg = 0, ρi,j = 0 if g = 1; i = j, i,j = 1,2,...,p (2) G i,j i,j 6 6 The multiple decision statistical procedure δ(x) is a mapfrom the sample space Rp×n to the decision space D = d ,g , where the decision d is the acceptance of hypothesis H , G G G { ∈ G} G . Let ϕ (x) be tests for the individual hypothesis (1). More precisely, ϕ (x) = 1 i,j i,j ∈ G means that hypothesis h is rejected (edge (i,j) is included in the graphical model), and i,j ϕ (x) = 0meansthathypothesish isaccepted(edge(i,j)isnotincludedinthegraphical i,j i,j model). Let Φ(x) be the matrix 0 ϕ (x) ... ϕ (x) 1,2 1,p ϕ2,1(x) 0 ... ϕ2,p(x) Φ(x) = . (3) ... ... ... ... ϕ (x) ϕ (x) ... 0 p,1 p,2 Any multiple decision statistical procedure δ(x) based on the simultaneous inference of individual edge tests (1) can be written as δ(x) = d , iff Φ(x) = G (4) G 5 According to [11] the quality of the statistical procedure is defined by the risk function. LetΩbethesetofparametersΩ = θ : θ = (µ,Σ),µ Rp,Σ is a symmetric positive definite matrix . { ∈ } By Ω we denote the parametric region corresponding to hypothesis H . Let S = (s ), S S i,j Q = (q ), S, Q . By w(S,Q) we denote the loss from decision d when hypothesis H i,j Q S ∈ G is true, i.e. w(H ;d ) = w(S,Q), S,Q S Q ∈ G Assume that w(S,S) = 0,S . The risk function is defined by ∈ G Risk(S,θ;δ) = w(S,Q)P (δ(x) = d ), θ Q Q∈G X where P (δ(x) = d ) is the probability that decision d is taken. θ Q Q As mentioned before, for the GGMS problem it is important to control Type I and Type II errors. Let a be the loss from the false inclusion of edge (i,j) in the graphical i,j model, and let b , be the loss from the false non inclusion of the edge (i,j) in the graphical i,j model, i,j = 1,2,...,p; i = j. 6 Define the individual loss as a , if s = 0,q = 1, i,j i,j i,j w (S,Q) = b , if s = 1,q = 0, i,j i,j i,j i,j 0, otherwise To take into account both types of errors we suggest the total loss w(S,Q) is defined as: p p w(S,Q) = w (S,Q) (5) i,j i=1 j=1 XX It means that the total loss from the misclassification of H is equal to the sum of losses S from the misclassification of individual edges: w(S,Q) = a + b i,j i,j {i,j:si,jX=0;qi,j=1} {i,j:si,jX=1;qi,j=0} The main result of this Section is the following theorem 6 Theorem 1 Let the loss function w be defined by (5), and a = a, b = b, i = j, i,j i,j 6 i,j = 1,2,...,p. Then Risk(S,θ;δ) = aE [Y (S,δ)]+bE [Y (S,δ)] θ I θ II where Y (S,δ), Y (S,δ) are the numbers of Type I and Type II errors for model selection I II by the statistical procedure δ when the true decision is d . S Proof. One has Risk(S,θ;δ) = w(S,Q)P (δ(x) = d ) = θ Q Q∈G X [ a + b ]P (δ(x) = d ) = i,j i,j θ Q QX∈G {i,j:si,jX=0;qi,j=1} {i,j:si,jX=1;qi,j=0} = [aN (Q)+bN (Q)]P (δ(x) = d ) = aE [Y (S,δ]+bE [(Y (S,δ)] I II θ Q θ I θ II Q∈G X where N (Q) is the number of Type I errors when the procedure δ(x) takes decision d I Q and θ Ω , and N (Q) is the number of Type II errors when the procedure δ(x) takes S II ∈ decision d and θ Ω . Q S ∈ 4 Uniformly most powerful unbiased tests for individ- ual hypotheses In this Section we briefly present the uniformly most powerful unbiased tests for individual hypotheses (1). More details are given in our paper [6]. Consider the statistics 1 S = Σn (X (t) X )(X (t) X ), k,l n t=1 k − k l − l 7 The joint distribution of statistics S , k,l = 1,2,...,N, n > p is given by the Wishart k,l density function [1]: [det(σk,l)]n/2 [det(s )](n−p−2)/2 exp[ (1/2) s σk,l] f( s ) = × k,l × − k l k,l k,l { } 2(pn/2) πp(p−1)/4 Γ(n/2)Γ((n 1)/2) Γ((n p+1)/2) × × − ··· P−P if the matrix S = (s ) is positive definite, and f( s ) = 0 otherwise. The Wishart k,l k,l { } density function can be written as: 1 f( s ) = C( σk,l )exp[ σi,js s σk,l]m( s ) k,l i,j k,l k,l { } { } − − 2 { } (k,l)6=(i,j);(k,l)6=(j,i) X where C( σk,l ) = c−1[det(σk,l)]n/2 { } 1 c = 2(pn/2) πp(p−1)/4 Γ(n/2)Γ((n 1)/2) Γ((n p+1)/2) 1 × × − ··· − m( s ) = [det(s )](n−p−2)/2 k,l k,l { } According to [9] (Ch. 4) the uniformly most powerful unbiased (UMPU) test for hy- pothesis h has the form: i,j 0, if c′ ( s ) < s < c′′ ( s ), (k,l) = (i,j) ϕ ( s ) = i,j { k,l} i,j i,j { k,l} 6 (6) i,j k,l { } ( 1, if s c′ ( s ) or s c′′ ( s ), (k,l) = (i,j) i,j ≤ i,j { k,l} i,j ≥ i,j { k,l} 6 where the critical values c′ ,c′′ are defined from the equations i,j i,j [det(s )](n−p−2)/2ds I∩[c′ ;c′′ ] k,l i,j i,j i,j = 1 α (7) i,j R [det(s )](n−p−2)/2ds − I k,l i,j R s [det(s )](n−p−2)/2ds + i,j k,l i,j ZI∩(−∞;c′i,j] + s [det(s )](n−p−2)/2ds = (8) i,j k,l i,j ZI∩[c′i′,j;+∞) = α s [det(s )](n−p−2)/2ds i,j I i,j k,l i,j R 8 where I is the interval of values of s such that the matrix S = (s ) is positive definite, i,j k,l and α is the significance level of the test. i,j It is shown in [6] that the constructed UMPU test is equivalent to the following partial correlation test 0, 2q 1 < ri,j < 1 2q ϕumpu = i,j − − i,j (9) i,j ( 1, otherwise, where ri,j is the sample partial correlation, and q is the (α /2)-quantile of the beta i,j i,j distribution Be(n−p, n−p). 2 2 Finally, we need to specify the optimality and unbiasedeness of the test (9). Let ω be i,j the set of parameters defined by ω = Ω i,j S S:s[i,j=0 Denote by ω−1 = Ω w . Let ϕ be a test for individual hypothesis h with significance i,j \ i,j i,j i,j level α . The test ϕ is reffered to as unbiased [9] if i,j i,j E (ϕ ) α , θ ω ; E (ϕ ) α , θ ω−1 (10) θ i,j ≤ i,j ∀ ∈ i,j θ i,j ≥ i,j ∀ ∈ i,j The test ϕumpu defined by (9) is unbiased and the following inequality holds: i,j E (ϕumpu) E (ϕ ), θ ω−1 (11) θ i,j ≥ θ i,j ∀ ∈ i,j for any unbiased test ϕ . i,j 5 Optimal multiple decision procedures According to Wald [11] the procedure δ∗ is referred to as optimal in the class of statistical procedures if C Risk(S,θ;δ∗) Risk(S,θ;δ), (12) ≤ for any S , θ Ω , δ . S ∈ G ∈ ∈ C 9 In this paper we consider the class of w-unbiased statistical procedures. Statistical procedure δ(x) is reffered to as w-unbiased if one has Risk(S,θ;δ) = E w(S;δ) E w(S′;δ) = Risk(S′,θ;δ), (13) θ θ ≤ for any S,S′ , θ Ω . The following theorem describes the optimal procedure in the S ∈ G ∈ class of w-unbiased multiple decision procedures for Gaussian graphical model selection. Theorem 2 Let the loss function w be defined by (5). Let the procedure δou be defined by (3)- (4), where ϕ = ϕumpu is defined by (9) and i,j i,j b i,j α = . (14) i,j a +b i,j i,j Then procedure δou is an optimal multiple decision statistical procedure in the class of w- unbiased procedures for Gaussian graphical model selection. Proof. We use the general approach by Lehmann [8] and give a direct proof. Let δ be a statistical procedure defined by (3)-(4). If the loss function w satisfies (5) then the risk of statistical procedure δ is the sum of the risks of individual tests ϕ . Indeed, one has i,j Risk(S,θ;δ) = w(S,Q)P (δ(x) = d ) = θ Q Q∈G X = [ a + b ]P (δ(x) = d ) = i,j i,j θ Q QX∈G {i,j:si,jX=0;qi,j=1} {i,j:si,jX=1;qi,j=0} p p = a P (δ(x) = d )+ b P (δ(x) = d ) = i,j θ Q i,j θ Q i,j=X1,si,j=0 Q,Xqi,j=1 i,j=X1,si,j=1 Q,Xqi,j=0 p p = a P (ϕ (x) = 1)+ b P (ϕ (x) = 0) = i,j θ i,j i,j θ i,j i,j=X1,si,j=0 i,j=X1,si,j=1 10