Recursive Partitioning Applied to Complex Sample Survey Data Myron Katzoffa(Retired), Kumer Dasb, Meena Kharea aCDC/National Center for Health Statistics (NCHS), Hyattsville, MD 20782 bLamar University May, 2013 MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M1D/2017982bLamarUniversity Outline 1 Introduction Two Interesting Quotes Why Sample Surveys? 2 Methods to be Investigated The Use of Superpopulation Concepts & Alternatives An Example Based on the Helix Tree Methodology 3 References MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M2D/2017982bLamarUniversity Introduction TwoInterestingQuotes Daniell Toth & John Eltinge An observation from the abstract of their 2010 BLS Statistical Survey paper Building Consistent Regression Trees from Complex Sample Data: Many prospective [RP] applications involve data collected through a complex sample design. At present, however, relatively little is known regarding the properties of these methods under complex designs. MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M3D/2017982bLamarUniversity Introduction TwoInterestingQuotes The Signal and the Noise (2012) Nate Silver, the author, instructively asks in the age of Big Data ... “Who needs theory when you have so much information?” My adaptation of Silver’s answer is: Statistical inferences are much stronger when backed up by theory or at least some deeper thinking about ... [the statistical principles upon which statistical practice is based and for which there is some general agreement.] MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M4D/2017982bLamarUniversity Introduction WhySampleSurveys? Sample Survey Advantages in CER Enable broad coverage of a general population at reduced cost Intensify focus on certain socio-demographic subgroups of a population for which coverage may be poor By concentrating resources, provide data with fewer recording errors (i.e., greater accuracy) Provide opportunities to collect data to apply statistical techniques for missing information and informed data editing MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M5D/2017982bLamarUniversity MethodstobeInvestigated TheUseofSuperpopulationConcepts&Alternatives Three Basic Directions Ignore sample design and treat data as observations on iid random variables (rv’s). Problem: sample units (i.e., people) are not merely carriers of information – observations depend on variables associated with those persons Restructure the sample so that the data can be analyzed as representative of a simple random sample from an animal or human population Investigate the influence of sample design – use superpopulation concepts, asymptotics, etc. Alternatively,use design variables in the splitting steps. MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M6D/2017982bLamarUniversity MethodstobeInvestigated TheUseofSuperpopulationConcepts&Alternatives A Few Remarks about Superpopulation (SP) Models The model is the survey sampler’s conceptualization of how a finite population, U, might have been generated. SP model is often an expression of prior knowledge or belief. U can be a r.s. from some parametric distribution. SP model can be different from one survey to another. Two important types of inference: (1) inference about the characteristics of U; and (2) inference about the SP model – the process underlying U. MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M7D/2017982bLamarUniversity MethodstobeInvestigated AnExampleBasedontheHelixTreeMethodology Modification of A Helix Tree Procedure Mathematical discussion of computational procedures used in Helix Tree software can be found in Chapter 26 of the manual. Useful to think in terms of invoking frequentist paradigm asymptotics. Example based upon test statistic of section 26.2.1 for determining splits – involves sample totals and cross-product sums. MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M8D/2017982bLamarUniversity MethodstobeInvestigated AnExampleBasedontheHelixTreeMethodology Some General Remarks about Finite Population Sample Procedures Finite population (FP) parameters are superpopulation sample quantities. Use survey sample weights to get estimates of FP parameters. Need to prove asymptotic design unbiasedness (ADU) and consistency (ADC) for a test statistic Tˆ quantified with FP sample data. MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M9D/2017982bLamarUniversity MethodstobeInvestigated AnExampleBasedontheHelixTreeMethodology General Remarks about Finite Population Sample Procedures continued Definition(to be perfected) Allowing the size of U, N , and the size of the sample from U, n , to t t increase according to assumptions we must specify, we will consider adapting definitions from Robinson and Sarndal (1983) under various hypotheses for a superpopulation model ξ. Asymptotic Design Unbiasedness (ADU) lim E [Tˆ −T ] = 0 t→∞ p t t Asymptotic Design Consistency (ADC) lim P (|Tˆ −T | > (cid:15)) = 0 with ξ-probability 1, for (cid:15) > 0. t→∞ p t t Here, E and P denote expectation and probability with respect to the p p finite population sample design. MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSaM),aHy,y2a0t1ts3ville,M10D/2017982bLamarUniversity
Description: