ebook img

Sequential identification of nonignorable missing data mechanisms PDF

2 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Sequential identification of nonignorable missing data mechanisms

Sequential identification of nonignorable missing data mechanisms Mauricio Sadinle and Jerome P. Reiter Duke University 7 January 6, 2017 1 0 2 n Abstract a J With nonignorable missing data, likelihood-based inference should be based on the joint 5 distribution of the study variables and their missingness indicators. These joint models cannot beestimatedfromthedataalone,thusrequiringtheanalysttoimposerestrictionsthatmakethe ] T modelsuniquelyobtainablefromthedistributionoftheobserveddata. Wepresentanapproach S forconstructingclassesofidentifiablenonignorablemissingdatamodels. Themainideaistouse . h a sequence of carefully set up identifying assumptions, whereby we specify potentially different t a missingness mechanisms for different blocks of variables. We show that the procedure results in m models with the desirable property of being non-parametric saturated. [ 1 Key words and phrases: Identification; Non-parametric saturated; Missing not at random; v 5 Partial ignorability; Sensitivity analysis. 9 3 1 1 Introduction 0 . 1 0 When data are missing not at random (MNAR) (Rubin (1976)), appropriate likelihood-based in- 7 ference requires explicit models for the full-data distribution, i.e., the joint distribution of the study 1 : variables and their missingness indicators. Because of the missing data, this distribution is not v i uniquely identified from the observed data alone (Little and Rubin (2002)). To enable inference, X r analysts must impose restrictions on the full-data distribution. Such assumptions generally are a untestable; however, a minimum desideratum is that they result in a unique full-data distribution for the observed-data distribution at hand, i.e., the distribution that can be identified from the incomplete data. We present a strategy for constructing identifiable full-data distributions with nonignorable missing data. In its most general form, the strategy is to expand the observed-data distribution sequentiallybyidentifyingpartsofthefull-datadistributionassociatedwithblocksofvariables,one block at a time. This partitioning of the variables allows analysts to specify different missingness mechanisms in the different blocks; for example, use the missing at random (MAR, Rubin (1976)) assumption for some variables and a nonignorable missingness assumption for the rest to obtain a partially ignorable mechanism (Harel and Schafer (2009)). We ensure that the resulting full-data 1 distributions are non-parametric saturated (NPS, Robins (1997)), that is, their implied observed- data distribution matches the actual observed-data distribution, as detailed in Section 2.2. Related approaches to partitioning variables with missing data have appeared previously in the literature. Zhou et al. (2010) proposed to model blocks of study variables and their missingness indicators in a sequential manner; however, their approach does not guarantee identifiability of the full-datadistribution. HarelandSchafer(2009)mentionedthepossibilityoftreatingthemissingness in blocks of variables differently, but they do not provide results on identification. Robins (1997) proposed the group permutation missingness mechanism, which assumes MAR sequentially for blocks of variables and results in a NPS model. This is a particular case of our more general procedure, as we describe in Section 3.4. The remainder of the article is organized as follows. In Section 2, we describe notation and provide more details on the NPS property. In Section 3, we introduce our strategy for identifying a full-data distribution in a sequential manner. In Section 4 we present some examples of how to use this strategy for the case of two categorical study variables, for the construction of partially ignorable mechanisms, and for sensitivity analyses. Finally, in Section 5 we discuss possible future uses of our identifying approach. 2 Notation and Background 2.1 Notation Let X = (X ,...,X ) denote p random variables taking values on a sample space X. Let M be 1 p j the missingness indicator for variable j, where M = 1 when X is missing and M = 0 when X is j j j j observed. Let M = (M ,...,M ), which takes values on {0,1}p. Let µ be a dominating measure 1 p for the distribution of X, and let ν represent the product measure between µ and the counting measure on {0,1}p. The full-data distribution is the joint distribution of X and M. We call its density f with respect to ν the full-data density. In practice, the full-data distribution cannot be recovered from sampled data, even with an infinite sample size. An element m = (m ,...,m ) ∈ {0,1}p is called a missingness pattern. Given m ∈ {0,1}p we 1 p define m¯ = 1 −m to be the indicator vector of observed variables, where 1 is a vector of ones of p p length p. For each m, we define X = (X : m = 1) to be the missing variables and X = (X : m j j m¯ j m¯ = 1) to be the observed variables, which have sample spaces X and X , respectively. The j m m¯ observed-data distribution is the distribution involving the observed variables and the missingness (cid:82) indicators, which has density f(X = x ,M = m) = f(X = x,M = m)µ(dx ), where x ∈ X m¯ m¯ Xm m represents a generic element of the sample space, and we define x and x similarly as with the m m¯ random vectors. An alternative way of representing the observed-data distribution is by introducing the mate- 2 rialized variables X∗ = (X∗,...,X∗), where 1 p (cid:40) X , if M = 0; X∗ ≡ j j j ∗, if M = 1; j and “∗” is a placeholder for missingness. The sample space X∗ of each X∗ is the union of {∗} j j and the sample space X of X . The materialized variables contain all the observed information: j j if X∗ = ∗ then X was not observed, and if X∗ = x for any value x (cid:54)= ∗ then X was observed j j j j j j and X = x . Given m ∈ {0,1}p and x ∈ X , we define x∗ ≡ x∗(m,x ), such that x∗ = x j j m¯ m¯ m¯ m¯ m¯ and x∗ = ∗, where ∗ is a vector with the appropriate number of ∗ symbols. For example, if m m = (1,1,0) and x = x , then x∗ = (∗,∗,x ). The event X∗ = x∗(m,x ) is equivalent to m¯ 3 3 m¯ M = m and X = x , which implies that the distribution of X∗ is equivalent to the observed-data m¯ m¯ distribution. Therefore, with some abuse of notation, the observed-data density can be written in terms of X∗, that is f{X∗ = x∗(m,x )} ≡ f(X = x ,M = m). When there is no need to refer m¯ m¯ m¯ to the m and x that define x∗, we simply write f(X∗ = x∗) to denote the observed-data density m¯ evaluated at an arbitrary point. In what follows we often write f(X = x,M = m) simply as f(X,M), f(X∗ = x∗) as f(X∗), and likewise for other expressions, provided that there is no ambiguity. For the sake of simplicity, we use “f” for technically different functions, but their actual interpretations should be clear from the arguments passed to them. For example, we denote the missingness mechanism as f(M = m|X = x), or simply f(M|X). 2.2 Non-Parametric Saturated Modeling Since the true joint distribution of X and M cannot be identified from observed data alone, we need to work under the assumption that the full-data distribution falls within a class defined by a set of restrictions. Definition 1 (Identifiability). Consider a class of full-data distributions F defined by a set of A restrictions A. We say that the class F is identifiable if there is a mapping from the set of A observed-data distributions to F . A If we only require identifiability from a set of full-data distributions, two different observed- data distributions could map to the same full-data distribution. Robins (1997) introduced the stricter concept of a class of full-data distributions being non-parametric saturated — also called non-parametric identified (Vansteelandt et al. (2006); Daniels and Hogan (2008)). Definition 2 (Non-parametric Saturation). Consider a class of full-data distributions F defined A by a set of restrictions A. We say that the class F is non-parametric saturated if there is a A one-to-one mapping from the set of observed-data distributions to F . A ThesetAofrestrictions, oridentifiabilityassumptions, thatdefineaNPSclassallowustobuild a full-data distribution, say with density f (X = x,M = m), from an observed-data distribution A 3 with density f(X = x ,M = m), so that f (X = x ,M = m) = f(X = x ,M = m), where m¯ m¯ A m¯ m¯ m¯ m¯ by definition f (X = x ,M = m) = (cid:82) f (X = x,M = m)µ(dx ). In terms of X∗, the NPS A m¯ m¯ Xm A m property is expressed as f (X∗) = f(X∗). A NPS is a desirable property, particularly for comparing inferences under different approaches to handling nonignorable missing data. When two missing data models satisfy NPS, we can be sure that any discrepancies in inferences are due entirely to the different assumptions on the non- identifiable parts of the full-data distribution. In contrast, without NPS, it can be difficult to disentangle what parts of the discrepancies are due to the identifying assumptions and what parts are due to differing constraints on the observed-data distribution. Thus, NPS greatly facilitates sensitivity analysis (Robins (1997)). For a given m, we refer to the conditional distribution of the missing study variables given the observed data as the missing-data distribution—also known as the extrapolation distribution (Daniels and Hogan (2008))—with density f(X = x |X = x ,M = m). These distributions m m m¯ m¯ correspond to the non-identifiable parts of the full-data distribution. A NPS approach is equivalent to a recipe for building these distributions from the observed-data distribution without imposing constraints on the latter. NPS models can be constructed in many ways. For example, in pattern mixture models, one canusethecomplete-casemissing-variablerestriction(Little(1993)), whichsetsf(X = x |X = m m m¯ x ,M = m) = f(X = x |X = x ,M = 0 ), for all m ∈ {0,1}p. Although Little (1993) m¯ m m m¯ m¯ p considered parametric models for each f(X = x |M = m), this does not have to be the case, and m¯ m¯ therefore pattern mixture models can be NPS. Another example is the permutation missingness model of Robins (1997), which for a specific ordering of the study variables assumes that the probabilityofobservingthekthvariabledependsonthepreviousstudyvariablesandthesubsequent observed variables. The group permutation missingness model of Robins (1997) is an analog of the latter for groups of variables and is also NPS. Sadinle and Reiter (2017) introduced a missingness mechanism where each variable and its missingness indicator are conditionally independent given theremainingvariablesandmissingnessindicators,whichleadstoaNPSmodel. TchetgenTchetgen etal.(2016)proposedaNPSapproachbasedondiscretechoicemodels. Finally, wenotethatMAR models also can be NPS, as shown by Gill et al. (1997). 3 Sequential Identification Strategy We consider the p variables as divided into K blocks, X = (X ,...,X ), where X = (X ,... 1 K k tk−1+1 ,X ), which contains t −t variables. As our results only concern the identification of full-data tk k k−1 distributions starting from an observed-data distribution, we assume that f(X∗) = f(X∗,...,X∗ ) 1 K is known. The identification strategy consists of specifying a sequence of assumptions A ,...,A , 1 K one for each block of variables, where each A allows us to identify the conditional distribution of k X and M given X ≡ (X ,...,X ), X∗ ≡ (X∗ ,...,X∗ ), and a carefully chosen subset k k <k 1 k−1 >k k+1 K of the missingness indicators M ≡ (M ,...,M ) described below. We first provide a general <k 1 k−1 4 description of how A ,...,A allow us to identify parts of the full-data distribution in a sequential 1 K manner, and then in Theorem 1 present the formal identification result. 3.1 Description We now present the steps needed to implement the identification strategy. A graphical summary of the procedure is provided in Figure 1. Step 1. Write f(X∗) = f(X∗|X∗ )f(X∗ ). Consider an identifiability assumption A on 1 >1 >1 1 the distribution of X and M given X∗ that allows us to obtain a distribution with density 1 1 >1 f (X ,M |X∗ ) with the NPS property f (X∗|X∗ ) = f(X∗|X∗ ). From this we can define A1 1 1 >1 A1 1 >1 1 >1 f (X ,M ,X∗ ) ≡ f (X ,M |X∗ )f(X∗ ). A1 1 1 >1 A1 1 1 >1 >1 Step 2. Supposewedividethet variablesinX intotwosetsindexedbyR andS , whereR ∪ 1 1 1 1 1 S = {1,...,t }andR ∩S = ∅; seeRemark1belowfordiscussionofwhyonemightwanttodoso. 1 1 1 1 Let M and M be the corresponding missingness indicators. We can write f (X ,M ,X∗ ) = R1 S1 A1 1 1 >1 f (M |X ,M ,X∗ )f (X ,M ,X∗ ),wheref (X ,M ,X∗ ) = f (X∗|X ,M ,X∗ )f ( A1 S1 1 R1 >1 A1 1 R1 >1 A1 1 R1 >1 A1 2 1 R1 >2 A1 X ,M ,X∗ ). Consider an identifiability assumption A on the distribution of X and M given 1 R1 >2 2 2 2 X ,M and X∗ that allows us to obtain a distribution with density f (X ,M |X ,M ,X∗ ) 1 R1 >2 A≤2 2 2 1 R1 >2 with the NPS property f (X∗|X ,M ,X∗ ) = f (X∗|X ,M ,X∗ ). From this we can define A≤2 2 1 R1 >2 A1 2 1 R1 >2 f (X ,M ,M ,X∗ ) ≡ f (X ,M |X ,M ,X∗ )f (X ,M ,X∗ ). A≤2 ≤2 R1 2 >2 A≤2 2 2 1 R1 >2 A1 1 R1 >2 The notation f emphasizes that the distribution relies on A and A . A 1 2 ≤2 Step k +1. At the end of the kth step we have f (X ,M ,M ,X∗ ). Let R ∪S = A≤k ≤k Rk−1 k >k k k {t +1,...,t }∪R , R ∩S = ∅, and M and M be the corresponding missingness indica- k−1 k k−1 k k Rk Sk tors. Wecanwritef (X ,M ,M ,X∗ ) = f (M |X ,M ,X∗ )f (X ,M ,X∗ ), A≤k ≤k Rk−1 k >k A≤k Sk ≤k Rk >k A≤k ≤k Rk >k where f (X ,M ,X∗ ) = f (X∗ |X ,M ,X∗ )f (X ,M ,X∗ ). Now, con- A≤k ≤k Rk >k A≤k k+1 ≤k Rk >k+1 A≤k ≤k Rk >k+1 sider an identifiability assumptionA on the distribution of X and M given X ,M and k+1 k+1 k+1 ≤k Rk X∗ thatallowstoobtainadistributionwithdensityf (X ,M |X ,M ,X∗ )with >k+1 A≤k+1 k+1 k+1 ≤k Rk >k+1 the NPS property f (X∗ |X ,M ,X∗ ) = f (X∗ |X ,M ,X∗ ). From this we A≤k+1 k+1 ≤k Rk >k+1 A≤k k+1 ≤k Rk >k+1 can define f (X ,M ,M ,X∗ ) A≤k+1 ≤k+1 Rk k+1 >k+1 ≡ f (X ,M |X ,M ,X∗ )f (X ,M ,X∗ ). A≤k+1 k+1 k+1 ≤k Rk >k+1 A≤k ≤k Rk >k+1 Step K. For the final step, assumption A is on the distribution of X and M given X K K K <K and M . Following the previous generic identifying step, we obtain f (X,M ,M ) ≡ RK−1 A≤K RK−1 K f (X ,M |X ,M )f (X ,M ). We can obtain the implied distribution of the A≤K K K <K RK−1 A<K <K RK−1 study variables, with density f (X), from this last equation. A ≤K Remark 1. The main characteristic of the R subsets is that if an index does not appear in R , k k−1 then it cannot appear in R , unless it is one of t +1,...,t . The flexibility in the choosing of k k−1 k 5 f(X∗) A { f(X∗|X∗ ) } 1 1 >1 f (X ,M ,X∗ ) A1 1 1 >1 A { f (X∗|X ,M ,X∗ ) } 2 A1 2 1 R1 >2 f (X ,M ,M ,X∗ ) A≤2 ≤2 R1 2 >2 A { f (X∗|X ,M ,X∗ ) } 3 A≤2 3 ≤2 R2 >3 . . . A { f (X∗ |X ,M ,X∗ ) } K−1 A≤K−2 K−1 ≤K−2 RK−2 K f (X ,M ,M ,X∗ ) A≤K−1 ≤K−1 RK−2 K−1 K A { f (X∗ |X ,M ) } K A≤K−1 K ≤K−1 RK−1 f (X,M ,M ) A≤K RK−1 K Figure 1: Sequential identification strategy. We write A { f (X∗|...) } to indicate that k A≤k−1 k assumption A is being used to obtain a conditional full-data density f (X ,M |...) from k A≤k k k f (X∗|...). A≤k−1 k these subsets gives flexibility in the setting up of the identifiability assumptions: different versions of our identification approach can be obtained by making assumptions conditioning on different subsets of the missingness indicators. As long as the R subsets satisfy R ⊆ {t +1,...,t }∪R , k k k−1 k k−1 Theorem 1 guarantees that the final full-data distribution is NPS. Remark 2. The sequence for A ,...,A follows the order of the blocks X ,...,X . In many cases 1 K 1 K these blocks may not have a natural order. Different orderings of the blocks lead to different sets of assumptions, thereby leading to different final full-data distributions and implied distributions of the study variables. To clarify this point, suppose that we have three blocks of variables: demographic variables X , income variables X , and health-related variables X . When X is first in the order, D I H D A concerns the distribution of X and M given X∗ and X∗; likewise, when X is first in the 1 D D I H I order, A concerns the distribution of X and M given X∗ and X∗. Similarly, A and A also 1 I I D H 2 3 will change depending on the order of the variables, thereby implying changes in the final full-data distribution. 6 3.2 Non-Parametric Saturation The previous presentation makes it clear that the identifying assumptions A ,...,A allow us 1 K to identify f (X,M ,M ), and furthermore, f (M |X ,M ,X∗ ) for each k < K, A≤K RK−1 K A≤k Sk ≤k Rk >k although each of these conditional densities remains unused after step k in the procedure. A full-data distribution with density f˜ (X,M) that encodes A ,...,A can be expressed as A 1 K ≤K f˜ (X,M) = f (X,M ,M )f˜ (M ,...,M |X,M ,M ), A≤K A≤K RK−1 K A≤K S1 SK−1 RK−1 K where the second factor can be written as (cid:81)K−1f˜ (M |X,M ,M ,M ), with S ≡ k=1 A≤K Sk S>k RK−1 K >k S ∪ ··· ∪ S , and M ≡ (M ,...,M ). From the definition of the sets S and k+1 K−1 S>k Sk+1 SK−1 k R , it is easy to see that S ∪ R = R ∪ {t + 1,...,t }, and therefore we can rewrite k >k K−1 k k K−1 f˜ (M |X,M ,M ,M ) = f˜ (M |X,M ,M ). A≤K Sk S>k RK−1 K A≤K Sk Rk >k Thesequentialidentificationproceduredoesnotidentifyanyf˜ (M |X,M ,M ),butonly A≤K Sk Rk >k f (M |X ,M ,X∗ ),thatis,itidentifiesthedistributionofM giventhevariablesX ,the A≤k Sk ≤k Rk >k Sk ≤k missingnessindicatorsM ,andthematerializedvariablesX∗ ,butnotgiventhemissingvariables Rk >k among X according to M . Nevertheless, the full specification of f˜ (M |X,M ,M ) is >k >k A≤K Sk Rk >k irrelevant given that any such conditional distribution that agrees with f (M |X ,M ,X∗ ) A≤k Sk ≤k Rk >k would lead to the same f (X). One such distribution is one where A ≤K f˜ (M |X,M ,M ) = f (M |X ,M ,X∗ ), k < K, (1) A≤K Sk Rk >k A≤k Sk ≤k Rk >k that is, where the conditional distribution of M given X,M and M does not depend on Sk Rk >k the missing variables among X according to M . This guarantees the existence of a full-data >k >k distribution with density K−1 (cid:89) f˜ (X,M) = f (X,M ,M ) f (M |X ,M ,X∗ ), (2) A≤K A≤K RK−1 K A≤k Sk ≤k Rk >k k=1 which encodes the assumptions A ,...,A . Theorem 1 guarantees that this construction leads to 1 K NPS full-data distributions. Theorem 1. Let R ,...,R be a sequence of subsets such that R ⊆ {t +1,...,t }∪R . 1 K−1 k k−1 k k−1 Let A ,...,A be a sequence of identifying assumptions, with each A being an assumption on the 1 K k conditional distribution of X and M given X ,M , and X∗ , such that for a given density k k <k Rk−1 >k g(X∗|X ,M ,X∗ ), it allows the construction of a density g (X ,M |X ,M ,X∗ ) with k <k Rk−1 >k Ak k k <k Rk−1 >k the NPS property g (X∗|X ,M ,X∗ ) = g(X∗|X ,M ,X∗ ). Then, given an observed- Ak k <k Rk−1 >k k <k Rk−1 >k data density f(X∗), there exists a full-data density f˜ (X,M) that encodes the assumptions A ≤K A ,...,A and satisfies the NPS property f˜ (X∗) = f(X∗). 1 K A ≤K Proof. WeexplainedhowassumptionsA ,...,A alongwiththeextraassumptionin(1)leadtothe 1 K full-data density in (2). We now show the NPS property of (2). To start, we integrate (2) over the 7 missing variables in X according to M . Since none of the factors in (cid:81)K−1f (M |X ,M , K K k=1 A≤k Sk ≤k Rk X∗ ) depend on these missing variables, we obtain >k K−1 (cid:89) f (X ,M ,X∗ ) f (M |X ,M ,X∗ ) A≤K−1 ≤K−1 RK−1 K A≤k Sk ≤k Rk >k k=1 K−2 (cid:89) = f (X ,M ,M ,X∗ ) f (M |X ,M ,X∗ ) A≤K−1 ≤K−1 RK−1 SK−1 K A≤k Sk ≤k Rk >k k=1 K−2 (cid:89) = f (X ,M ,M ,X∗ ) f (M |X ,M ,X∗ ). (3) A≤K−1 ≤K−1 RK−2 K−1 K A≤k Sk ≤k Rk >k k=1 Similarly, we now integrate (3) over the missing variables in X according to M . Given K−1 K−1 that none of the factors in (cid:81)K−2f (M |X ,M ,X∗ ), depend on these missing variables, k=1 A≤k Sk ≤k Rk >k and given the way f (X ,M ,M ,X∗ ) is constructed (see generic step k + 1 in A≤K−1 ≤K−1 RK−2 K−1 K Section 3.1), we obtain K−2 (cid:89) f (X ,M ,X∗ ) f (M |X ,M ,X∗ ). A≤K−2 ≤K−2 RK−2 >K−2 A≤k Sk ≤k Rk >k k=1 Theseargumentsandprocesscanberepeated,sequentiallyintegratingoverthemissingvariables in X according to M , k = K −2,...,1, finally obtaining the observed-data density f(X∗). k k 3.3 Special Cases It is worth describing two special sequential identification schemes that can be derived from our general presentation. One is obtained when we take all R = {t +1,...,t }∪R , S = ∅, k k−1 k k−1 k and therefore M = M . In this case, each A is on the distribution of X and M Rk ≤k k+1 k+1 k+1 given X ,M and X∗ , that is, the assumption conditions on the whole set of missingness ≤k ≤k >k+1 indicators M and not just on a subset of these. The other is obtained when we take all R = ∅, ≤k k S = {t +1,...,t }, and therefore M = M . In this case, each A is on the distribution k k−1 k Sk k k+1 of X and M given X and X∗ , that is, each assumption conditions on none of the k+1 k+1 ≤k >k+1 missingness indicators M . ≤k 3.4 Connection with the Mechanisms of Robins (1997) An important particular case of our sequential identification strategy is obtained when all M = S k M and each A is taken to be a conditional MAR assumption, that is, when we assume that k k f(M |X ,X ,X∗ ) = f(M |X ,X∗,X∗ ). Along with (1), this leads to the combined k ≤k−1 k >k k ≤k−1 k >k assumption f(M |X,M ) = f(M |X ,X∗ ). (4) k >k k <k ≥k 8 The missingness mechanism derived from this approach corresponds to the group permutation missingness of Robins (1997). When each block contains only one variable, it corresponds to the permutation missingness mechanism of Robins (1997). If the ordering of the variables or blocks of variables is regarded as temporal, as in a longitudinal study or a survey that asks questions in a fixed sequence, Robins (1997) interpreted (4) as follows: the nonresponse propensity at the current timeperioddependsonthevaluesofstudyvariablesintheprevioustimeperiods, whetherobserved or not, but not on what is missing in the present and future time periods. If the order of the blocks of variables was reversed, that is, if A was on the distribution of X 1 K and M given X∗ , A was on the distribution of X and M given X∗ and X , and K <K 2 K−1 K−1 <K−1 K so on, then we would have the following interpretation: the nonresponse propensity at the current time period depends on the values of study variables in the future time periods, whether observed or not, but not on what is missing in the present and past time periods. This interpretation is arguablyeasiertoexplaininthecontextofrespondentsansweringaquestionnaire. Thenonresponse propensity for question t can depend on the respondent’s answers to questions that appear later in the questionnaire and to questions that she has already answered, but not on the information that she has not revealed. 4 Applications 4.1 Sequential Identification for Two Categorical Variables Consider two categorical random variables X ∈ X = {1,...,I} and X ∈ X = {1,...,J}. Let 1 1 2 2 M andM betheirmissingnessindicators. LetPdenotethejointdistributionof(X ,X ,M ,M ). 1 2 1 2 1 2 The observed-data distribution corresponds to the probabilities π ≡ P(X∗ = i,X∗ = j) = P(X = i,X = j,M = 0,M = 0), ij00 1 2 1 2 1 2 π ≡ P(X∗ = i,X∗ = ∗) = P(X = i,M = 0,M = 1), i+01 1 2 1 1 2 π ≡ P(X∗ = ∗,X∗ = j) = P(X = j,M = 1,M = 0), +j10 1 2 2 1 2 π ≡ P(X∗ = ∗,X∗ = ∗) = P(M = 1,M = 1), ++11 1 2 1 2 for i ∈ X , j ∈ X . We seek to construct a full-data distribution P (X ,X ,M ,M ) from the 1 2 A 1 2 1 2 ≤2 observed-data distribution P(X∗,X∗) by imposing some assumptions A and A . The goal is a 1 2 1 2 full-data distribution such that P (X∗,X∗) = P(X∗,X∗), that is, we want P to be NPS. A≤2 1 2 1 2 A≤2 To use the general identification strategy presented in Section 3 we define each variable as its own block. With only two variables, set R can be either R = {1} or R = ∅. We present two 1 1 1 examples below corresponding to these two options. Example 1. We first consider R = {1}, S = ∅, and the following identifying assumptions: 1 1 A : X ⊥⊥ M |X∗; and A : X ⊥⊥ M |M ,X . 1 1 1 2 2 2 2 1 1 Under A , P (X ,M |X∗) = P (X |X∗)P (M |X∗) = P(X |X∗,M = 0)P(M |X∗), where 1 A1 1 1 2 A1 1 2 A1 1 2 1 2 1 1 2 9 P(X |X∗,M = 0) and P(M |X∗) are identified from the observed data distribution. When X∗ = 1 2 1 1 2 2 j (cid:54)= ∗, P(X = i|X∗ = j,M = 0) = P(X = i|X = j,M = 0,M = 0) = π /π , and 1 2 1 1 2 1 2 ij00 +j00 P(M = m |X∗ = j) = P(M = m |X = j,M = 0) = π /π . Similarly, when X∗ = ∗ we 1 1 2 1 1 2 2 +jm10 +j+0 2 find P(X = i|X∗ = ∗,M = 0) = π /π and P(M = m |X∗ = ∗) = π /π . Since 1 2 1 i+01 ++01 1 1 2 ++m11 +++1 P(X∗) can be obtained from the observed-data distribution as P(X = j,M = 0) = π when 2 2 2 +j+0 X∗ = j (cid:54)= ∗, and as P(M = 1) = π when X∗ = ∗, using P (X ,M |X∗) we obtain a joint dis- 2 2 +++1 2 A1 1 1 2 tribution for (X ,M ,X∗) that relies on A , defined as P (X ,M ,X∗) ≡ P (X ,M |X∗)P(X∗). 1 1 2 1 A1 1 1 2 A1 1 1 2 2 Note that P can be written as an explicit function of the observed-data distribution. A1 We now use P and identifying assumption A to obtain P (X ,M |X ,M ). From the A1 2 A≤2 2 2 1 1 definition of X∗, P (X ,M ,X∗) can be written as P (X ,M ,X ,M = 0) when X∗ (cid:54)= ∗ and 2 A1 1 1 2 A1 1 1 2 2 2 P (X ,M ,M = 1) when X∗ = ∗. From this we can obtain A1 1 1 2 2 P (X ,M ,M = 1) P (M = 1|X ,M ) = A1 1 1 2 , A1 2 1 1 P (X ,M ,M = 1)+(cid:80) P (X ,M ,X = x ,M = 0) A1 1 1 2 x2∈X2 A1 1 1 2 2 2 and P (X ,M ,X ,M = 0) P (X |X ,M ,M = 0) = A1 1 1 2 2 . A1 2 1 1 2 (cid:80) P (X ,M ,X = x ,M = 0) x2∈X2 A1 1 1 2 2 2 We then obtain P (X ,M |X ,M ) = P (X |X ,M )P (M |X ,M ) = P (X |X ,M , A≤2 2 2 1 1 A≤2 2 1 1 A≤2 2 1 1 A1 2 1 1 M = 0)P (M |X ,M ), which gives us a way to obtain P (X ,M |X ,M ) as a function of 2 A1 2 1 1 A≤2 2 2 1 1 the distribution P , which in turn is a function of the observed-data distribution. The final full- A1 data distribution is obtained as P (X ,M ,X ,M ) ≡ P (X ,M |X ,M )P (X ,M ), where A≤2 1 1 2 2 A≤2 2 2 1 1 A1 1 1 P (X ,M ) can be obtained from P . After some algebra we find A1 1 1 A1 P (X = i,X = j,M = m ,M = m ) = ππ+ijj0000π+jm10 (cid:18) πi+01 π (cid:19)m2. A≤2 1 2 1 1 2 2 ((cid:80)l ππ+ill0000π+lm10)m2 π++01 ++m11 It is easy to see that P is NPS, that is P (X∗,X∗) = P(X∗,X∗). From the final distribution A≤2 A≤2 1 2 1 2 P (X ,X ,M ,M ) we can now obtain A 1 2 1 2 ≤2 P (X = i,X = j) A 1 2 ≤2 = π +π πij00 +π πij00 +π πi+01 ππ+ijj0000π+j10 , (5) ij00 i+01π +j10π ++11π (cid:80) πil00 π i+00 +j00 ++01 l π +l10 +l00 which is the distribution of inferential interest. In closing this example, we stress that the final full-data distribution is not invariant to the order in which the blocks of variables appear in the sequence of assumptions. From expression (5) it is clear that the final distribution of the study variables would be different had we identified a distribution for (X∗,X ,M ) first. Indeed, if we were to follow the steps in the previous example 1 2 2 but reversing the order of the variables, then we would be assuming that X ⊥⊥ M |X∗ and X ⊥⊥ 2 2 1 1 M |M ,X , which are different from A and A in this example. 1 2 2 1 2 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.