ebook img

Nonparametric Additive Model-assisted Estimation for Survey Data PDF

0.25 MB·English
by  Li Wang
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Nonparametric Additive Model-assisted Estimation for Survey Data

Nonparametric additive model-assisted estimation for survey data Li Wang∗ Department of Statistics, University of Georgia, Athens, Georgia 30602 1 1 Suojin Wang 0 2 Department of Statistics, Texas A&M University, College Station, Texas 77843 n a J 4 ] E Abstract M An additive model-assisted nonparametric method is investigated to estimate t. the finite population totals of massive survey data with the aid of auxiliary in- a t formation. A class of estimators is proposed to improve the precision of the well s [ known Horvitz-Thompson estimators by combining the spline and local poly- nomial smoothing methods. These estimators are calibrated, asymptotically 1 v design-unbiased, consistent, normal and robust in the sense of asymptotically 1 attaining the Godambe-Joshi lower bound to the anticipated variance. A con- 3 8 sistent model selection procedure is further developed to select the significant 0 auxiliary variables. The proposed method is sufficiently fast to analyze large . 1 survey data of high dimension within seconds. The performance of the proposed 0 1 method is assessed empirically via simulation studies. 1 : Keywords: Calibration, Horvitz-Thompson estimator, local linear regression, v i model-assisted estimation, spline, superpopulation. X r 2000 MSC: primary, 62D05; secondary, 62G08 a 1. Introduction Auxiliary information is often available in many surveys for all elements of the population of interest. For instance, in many countries, administrative registers provide extensive sources of auxiliary information. Complete registers ∗ corresponding author Email address: [email protected](Li Wang) Preprint submitted to Journal of Multivariate Analysis January 6, 2011 cangiveaccesstovariablessuchassex, age,incomeandcountryofbirth. Studies of labor force characteristics or household expenditure patterns, for example, might benefit from these auxiliary data. Another example is the satellite images or GPS data used in spatial sampling. These data are often collected at the population level, which are often available at little or no extra cost, especially compared to the cost of collecting the survey data. If no information other than the inclusion probabilities is used to estimate the population total, a well-known design unbiased estimator is the Horvitz- Thompson (HT) estimator. Nowadays, “cheap” auxiliary information can be regularly used to obtain higher precision estimates for the unknown finite pop- ulation quantities. For instance, post-stratification, calibration and regression estimation are different design-based approaches used to improve the precision of estimators. Auxiliary information can also be used to increase the accuracy of thefinitepopulationdistribution function; see, forexample, [30]. Model-assisted estimation ([21]) provides a convenient way to incorporate auxiliary variables to develop more efficient survey estimators. By model-assisted, it is meant that a superpopulation model is adopted (for example, model (1) below), in which the finite population is modeled conditionally on the auxiliary information; see, for instance, [4, 6, 8, 9]. The traditional parametric model-assisted approach assumes that the su- perpopulation model is fully described by a finite set of parameters, e.g., the regression estimator introduced in [21]. However, survey data now being col- lected by many government, health and social science organizations have more complex design features. It is difficult to obtain any prior model information to address various hypotheses. In this sense, preselected parametric model is too restricted to fit unexpected features. In contrast, nonparametric regression provides a useful tool for studying the dependence of variables of interest on auxiliary information without constraining the dependence to a fixed form with few parameters. The flexibility of nonparametric regression is extremely helpful to capture the complicated relationship between variables as well as in obtaining robust predictions; see [10, 12] for details. Breidt and Opsomer [1] first proposed a nonparametric model-assisted es- timator based on local polynomial regression, which generalizes the parametric framework in survey sampling and improves the precision of the survey esti- mators immensely. Their investigation is only based on one auxiliary variable. Most surveys, however, involve more than one study variables, perhaps, many; see [22]. For example, the remote sensing data which provide a wide and grow- ing range of variables to be employed. In this context, when the dimension 2 of the auxiliary information vector is high, one unavoidable issue is the “curse of dimensionality”, which refers to the poor convergence rate of nonparametric estimation of general multivariate functions. One solution is regression in the form of additive model; see [13]. Estimation and inference for additive models have been well studied in the literature; see, for example, the classic backfitting estimators of [13], the marginal integration estimators of [16], the smoothing backfitting estimators of [17], the spline estimators of Stone ([25, 26]) and the spline-backfitted kernel estimators of [29]. In survey sampling context, [2] discussed a semiparamet- ric possible extension to multiple auxiliary variables via using the penalized splines; [19] applied the generalized additive models (GAMs) in an interaction model for the estimation of variables from forest inventory and analysis surveys; and [3] proposed a special case of the GAMs with an identity link function. For large and high dimensional survey data, it is important that estimation and inference methods are efficient and computationally easily implemented. However, few methods are theoretically justified and computational efficient when there are multiple nonparametric terms. The kernel based backfitting and marginal integration approaches are computationally expensive, limiting their use for high dimensional data; see [18] for some numerical comparisons of these methods. Spline methods, on the other hand, provide only convergence rates but no asymptotic distributions, so no measures of confidence can be assigned to the estimators. Challenged by these demands, we propose approximating the nonparametric components by using the spline-backfitted local polynomial: spline does a quick initial estimation of all additive components and removes them all except the ones of interest; kernel smoothing is then applied to the cleaned univariate data to estimate with the asymptotic distribution. This two-step estimator is both computationallyexpedient foranalyzinglargeandhighdimensional survey data, and theoretically reliable as the estimator is uniformly oracle with asymptotic confidence intervals. The resulting estimator of population total can therefore be easily calculated, and more importantly allow for formal derivation of the asymptotic properties of the estimator. In practice, a large number of variables may be collected and some of the insignificant ones should be excluded from the final model in order to enhance the predictability. The selection of auxiliary variables is a fundamental issue for model-assisted survey sampling methods. In this paper, we propose a consistent variable selection method for the additive model-assisted survey sampling based on the Bayes information criterion (BIC). A comprehensive Monte Carlo study 3 demonstrates superior performance of the proposed methods. The rest of the paper is organized as follows. Section 2 gives details of the superpopulation model and proposed method of estimation. Section 3 describes the weighting, calibration and asymptotic properties of the proposed estimator. Section 4 describes the auxiliary variable selection procedure for the superpop- ulation model under simple random sampling design (SRS). Section 5 reports the findings in an extensive simulation study. Lengthy technical arguments are given in the Appendix. 2. Superpopulation Model and Proposed Estimator In what follows, let U = {1,...,i,...,N} be the finite population of N ele- N ments, called the target population, and i represents the ith element of the pop- ulation. Let x = {x ,...,x } be a d-dimensional auxiliary variable vector, i ∈ i i1 id U . We are interested in the estimation of the population total t = y , N y i∈UN i where y is the value of the study variable, y, for the ith element. To this end, i P a sample s of size n is drawn from U according to a fixed sampling design N N p (·), where p (s) is the probability of drawing the sample s. The inclusion N N probabilities, known for all i ∈ U , are π ≡ π = Pr{i ∈ s} = p (s). N iN i s∋i N In addition to the π , denote π ≡ π = Pr{i,j ∈ s} = p (s) the i ijN ij s∋iP,j N inclusion probability for both elements i,j ∈ U . N P Let{(x ,y )} bearealizationof(X,Y)fromaninfinitesuperpopulation, i i i∈UN ξ, satisfying Y = m(X)+σ(X)ε, (1) in which the unknown d-variate function m has a simpler form of d m(X) = c+ m (X ), E [m (X )] ≡ 0, 1 ≤ α ≤ d, (2) α α ξ α α α=1 X the function σ(·) is the unknown standard deviation function and the standard error ε satisfies that E (ε|X) = 0 and E (ε2|X) = 1. In the following, we ξ ξ assume the auxiliary variable X is distributed on a compact interval [a ,b ], α α α α = 1,...,d. Without loss of generality, we take all intervals [a ,b ] = [0,1]. α α To estimate the additive components in (2), we employ a two-stage procedure based on the spline-backfitted local polynomial smoothing. For any α = 1,...,d, we introduce a knot sequence with J interior knots k = 0 < k < ... < k < 1 = k , where J ≡ J increases when 0α 1α Jα (J+1)α N 4 n increases, and the precise order is given in Assumption (A5). Denote the N piecewise linear truncated power spline basis T Γ(x) ≡ 1,x ,(x −k ) ,...,(x −k ) ,α = 1,...,d , (3) α α 1α + α Jα + where (a) = a if(cid:8)a > 0 and 0 otherwise. For the local linear sm(cid:9)oothing, let + K (x) = h−1K(x/h), where K denotes a kernel function and h = h is the h N bandwidth; see Assumption (A6) below. We now describe our two-stage estimator for the population total t . At the y first stage, we apply the spline smoothing to obtain a quick initial estimator of m(x ), i d d J mˆ (x ) =ˆb + ˆb x + ˆb (x −k ) , i 0 0,α iα j,α iα jα + α=1 α=1 j=1 X XX ˆ ˆ where b and b , j = 0,1,..,J, α = 1,...,d are the minimizes of the following 0 j,α 2 d d J π−1 y −b − b x − b (x −k ) (4) i i 0 0,α α j,α iα jα + ( ) i∈s α=1 α=1 j=1 X X XX over a G ≡ 1+(J +1)d dimensional vector. Because the components m (x ) d α α can only be identified up to an additive constants, we center the estimator of m (x ) and define the centered pilot estimator of the αth component as α α J ˆ ˆ mˆ (x ) = b x + b (x −k ) −cˆ , (5) α α 0,α α j,α α jα + α j=1 X where cˆ = N−1 π−1 ˆb x + J ˆb (x −k ) . The above pilot α i∈s i 0,α iα j=1 j,α iα jα + estimators in (5) are then unsed to construct the new pseudoo-responses P P yˆ = y −N−1tˆ − mˆ (x ), i ∈ s, α = 1,...,d, (6) iα i y β iα β6=α X where tˆ is the well-known HT estimator. y At the second stage, a local polynomial smoothing is applied to the cleaned univariate data {x ,yˆ } to achieve the “oracle” property in [29]. To be iα iα i∈s specific, considering thelocal linear smoothing, foranyα = 1, ..., d, we minimize π−1{yˆ −a −a (x −x)K (x −x)}2, (7) i iα 0,α 1,α iα h iα i∈s X 5 withrespect toa anda . Thespline-backfitted locallinear (SBLL)estimator 0,α 1,α of the α-th component m is mˆ∗ = aˆ in (7). The final sample design-based α α 0,α SBLL estimator of m(x) is defined as d 1 mˆ∗(x) = tˆ + mˆ∗ (x ). (8) N y α α α=1 X Substituting mˆ∗ ≡ mˆ∗(x ) into the existing generalized difference estimator i i (see page 221 of [21]), the SBLL estimator for t is defined by y y −mˆ∗ y I tˆ = mˆ∗ + i i = i + 1− i mˆ∗, (9) y,SBLL i π π π i iX∈UN Xi∈s i Xi∈s i iX∈UN (cid:18) i(cid:19) where I = 1 if i ∈ s and I = 0 otherwise. i i Remark 1. In the first step spline smoothing, the number of knots J can be N determined by n and a tuning constant c: N 1/4 J = min [cn log(n )]+1,[(n /2−1)/d−1] . (10) N N N N (cid:16) (cid:17) As discussed in [29], the choice of c makes little difference. In the second step localpolynomialsmoothing,onecanusethequartickernelandtherule-of-thumb bandwidth. 3. Properties of the Estimator 3.1. Weighting and Calibration In the last decade, calibration estimation has developed into an important field of research in survey sampling. As discussed in [7] and [15], calibration is a highly desirable property for survey weights, which allows the survey practi- tionertosimplyadjusttheoriginaldesignweightstoincorporatetheinformation of the auxiliary variables. Several national statistical agencies have developed software to compute calibrated weights based on auxiliary information available in population registers and other sources. The proposed SBLL estimator in this paper also shares this property in certain sense. Let y be the column vector of the response values y for i ∈ s and define the s i diagonal matrix of inverse inclusion probabilities Π = diag{1/π } . For Γ(x) s i i∈s in (3), denote Γ = Γ(x )T the sample truncated power spline matrix. s i i∈s Let B be the collection of the estimated spline coefficient in (4), then B = s s (cid:8) (cid:9) 6 (ΓTΠ Γ )−1ΓTΠ y . Thus the pilot spline estimator of m (x ) in (5) can be s s s s s s α α written as mˆ (x ) = Γ(x)TD B −N−11TΠ Γ D B y , (11) α α α s n s s α s s where 1 is a vector of len(cid:8)gth n with all “1”s, and (cid:9) n N D = diag{0,...,0, 1,...,1, 0,...,0} (12) α from(J+1)(α−1)+2to(J+1)α+1 is a G × G diagonal matrix. Denoting|th{ez s}pline smoothing matrix and its d d centered version by Ψ = Γ D (ΓTΠ Γ )−1ΓTΠ , Ψ∗ = I−N−11 1TΠ Ψ , sα s α s s s s s sα n n s sα we have mˆ ≡ {mˆ (x )} = Ψ∗ y , for α = 1,(cid:0)...,d. Further for(cid:1)yˆ in (6), let α α iα i∈s sα s iα yˆ ≡ {yˆ } = y − 1tˆ 1 − mˆ , and define the matrices α iα i∈s s N y n β6=α β P 1 X = 1 x −x , W = diag K (x −x ) . siα kα iα k∈s siα π h kα iα (cid:26) k (cid:27)k∈s (cid:8)(cid:0) (cid:1)(cid:9) Then the SBLL estimator of m at x can be written as α iα mˆ∗ ≡ mˆ∗ (x ) = eT(XT W X )−1XT W yˆ , (13) iα α iα 1 siα siα siα siα siα α where e = (1,0)T. Therefore, the SBLL estimator in (8) of m(x) at x is 1 i 1 d tˆ mˆ∗ = tˆ + eT(XT W X )−1XT W y − y1 − Ψ∗ y i N y 1 siα siα siα siα siα s N n sβ s ! α=1 β6=α X X ≡ ρTy , si s where ρT = eT d (XT W X )−1XT W I+ 1−d1 1TΠ − Ψ∗ . si 1 α=1 siα siα siα siα siα dN n n s β6=α sβ Siminlar to [20], we define the “g-weight”(cid:16) (cid:17)o P P I g = 1+π 1− j ρT a , (14) is i π sj i jX∈UN (cid:18) j(cid:19) where a is a n -dimensional vector with a “1” in the ith position and “0” i N elsewhere. Thus the proposed estimator tˆ in (8) can be written as y,SBLL y I tˆ = i + 1− j ρT y ≡ g y /π , y,SBLL π π sj s is i i Xi∈s i jX∈UN (cid:18) j(cid:19) Xi∈s 7 which is a linear combination of the sample y ’s with a sampling weight, π−1, i i and the “g-weight”. Because the weights are independent of y , they can be i applied to any study variable of interest. As we show below, the weight system gives our estimator of the known total x to be itself. i∈UN iα TPheorem 1. For any α = 1,...,d and the “g-weight” defined in (14), tˆ ≡ g x /π = x . xα,SBLL is iα i iα Xi∈s iX∈UN Proof. Let x = {x } . We have α iα i∈s tˆ = π−1x + 1−I π−1 eT xα,SBLL i iα j j 1 Xi∈s jX∈UN(cid:0) (cid:1) d 1−d × XT W X −1XT W I+ 1 1TΠ − Ψ∗ x . sjγ sjγ sjγ sjγ sjγ dN n n s sβ α ( !) γ=1 β6=γ X(cid:0) (cid:1) X Observe that x , for β = α (ΓTΠ Γ )−1ΓTΠ x = τ , Ψ x = Γ D τ = α , s s s s s α α sβ α s β α 0, for β 6= α (cid:26) where τ is the vector of dimension G with a “1” in the {2+(J +1)(α−1)}th α d position and “0” elsewhere. Then we have, 1−d I+ 1−d1 1TΠ x , for γ = α I+ 1 1TΠ − Ψ∗ x = dN n n s α . dN n n s sβ α 1 1 1TΠ x , for γ 6= α β6=γ ! (cid:26) (cid:0)dN n n s α (cid:1) X Note that for any i ∈ U , N eT(XT W X )−1XT W x = x , eT(XT W X )−1XT W 1 = 1, 1 siα siα siα siα siα α iα 1 siα siα siα siα siα n thus d 1−d eT XT W X −1XT W I+ 1 1TΠ − Ψ∗ x 1 sjγ sjγ sjγ sjγ sjγ dN n n s sβ α ( !) γ=1 β6=γ X(cid:0) (cid:1) X = eT(XT W X )−1XT W I+(dN)−1(1−d)1 1TΠ x 1 siα siα siα siα siα n n s α +d−1eT1 XTsjγWsjγXsjγ −1X(cid:8)TsjγWsjγ1n1TnΠsxα = xjα. (cid:9) γ6=α X(cid:0) (cid:1) Hence the proposed SBLL estimator defined in (9) preserves the calibration (cid:3) property. 8 3.2. Assumptions For the asymptotic properties of the estimators, we adopt the traditional asymptoticframeworkin[1]whereboththepopulationandsamplesizesincrease as N → ∞. There are two sources of “variation” to be considered here. The first is introduced by the random sample design and the corresponding measure is denoted by p. The “O ”, “o ” and “E (·)” notation below is with respect to p p p this measure. The second is associated with the superpopulation from which the finite population is viewed as a sample. The corresponding measure and notation are “ξ”. For simplicity, let π −π π = ∆ . ij i j ij (A1) The density f (x) of X is continuous and bounded away from 0 and ∞. The marginal densities f (x ) of x have continuous derivatives and are α α α bounded away from 0 and ∞. (A2) The second order derivative of m (x ) is continuous, ∀ α = 1,...,d. α α (A3) There exists a positive constant M such that E |ε|2+δ|X < M for some ξ δ > 1/2; σ(x) is continuous on [0,1]d and boun(cid:16)ded away (cid:17)from 0 and ∞. (A4) As N → ∞, n → ∞ and n N−1 → π < 1. N N (A5) The number of knots J ∼ n1/4log(n ). N N N (A6) The kernel function K is Lipschitz continuous, bounded, nonnegative, symmetric, and supported on [−1,1]. The bandwidth h ∼ n−1/5, i.e., N N −1/5 −1/5 c n ≤ h ≤ C n for some positive constants c , C . h N N h N h h (A7) For all N, min π ≥ λ > 0, min π ≥ λ∗ > 0 and i∈UN i i,j∈UN ij limsupn max |∆ | < ∞. N ij N→∞ i,j∈UN,i6=j (A8) Let D be the set of all distinct k-tuples (i ,i ,...,i ) from U . Then k,N 1 2 k N limsupn2 max |E [(I −π )(I −π )(I −π )(I −π )]| < ∞, N p i1 i1 i2 i2 i3 i3 i4 i4 N→∞ (i1,i2,i3,i4)∈D4,N limsupn2 max |E [(I I −π )(I I −π )]| < ∞, N p i1 i2 i1i2 i3 i4 i3i4 N→∞ (i1,i2,i3,i4)∈D4,N limsupn2 max E (I −π )2(I −π )(I −π ) < ∞. N p i1 i1 i2 i2 i3 i3 N→∞ (i1,i2,i3)∈D3,N (cid:12) (cid:2) (cid:3)(cid:12) (cid:12) (cid:12) Remark 2. Assumptions (A1)-(A3)are typical in the smoothing literature; see, for instance, [10, 12, 29]. Assumption (A5) is about how to choose the number of interior knots J for the spline estimation in the first stage. In practice, J N N 9 can bedetermined by (10). Assumption (A6) is how to select the kernel function and the corresponding bandwidth. Such assumptions were used in [29] in the additive autoregressive model fitting. Assumptions (A7) and (A8) involve the inclusion probabilities of the design, which were also assumed in [1]. 3.3. Asymptotic properties of the estimator Like the local polynomial estimators in [1], the following theorem shows that the estimator tˆ in (9) is asymptotically design unbiased and design y,SBLL consistent. Theorem 2. Under Assumptions (A1)-(A7), the estimator tˆ in (9) is y,SBLL asymptotically design unbiased in the sense that tˆ −t y,SBLL y lim E = 0 with ξ-probability 1, p N→∞ N (cid:20) (cid:21) and is design consistent in the sense that for all η > 0, lim E I = 0 with ξ -probability 1. N→∞ p {|tˆy,SBLL−ty|>Nη} h i Let t˜ be the population-based generalized difference estimator of t y,SBLL y when the entire realization were known; see (A.4) in Appendix A.1 for the formal definition. Like the estimators in the local polynomial estimators in [1], the penalized spline estimators in [2], and the backfitting estimators in [3], the following theorem shows that the proposed estimator tˆ also inherits the y,SBLL limiting distribution of the “oracle” estimator t˜ . y,SBLL Theorem 3. Under Assumptions (A1)-(A8), N−1 t˜ −t y,SBLL y d −→ N (0,1) Var1/2 N−1t˜ p (cid:0) y,SBLL(cid:1) (cid:0) (cid:1) as N → ∞ implies N−1 tˆ −t y,SBLL y d −→ N (0,1), V1/2 N−1tˆ (cid:0) y,SBLL(cid:1) where (cid:0) (cid:1) b 1 ∆ y −mˆ∗y −mˆ∗ V N−1tˆ = ij i i j j. (15) y,SBLL N2 π π π ij i j i,j∈s (cid:0) (cid:1) X b 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.