Table Of ContentRecursive Partitioning Applied to Complex Sample
Survey Data
Myron Katzoffa(Retired), Kumer Dasb, Meena Kharea
aCDC/National Center for Health Statistics (NCHS), Hyattsville, MD
20782
bLamar University
May, 2013
MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M1D/2017982bLamarUniversity
Outline
1 Introduction
Two Interesting Quotes
Why Sample Surveys?
2 Methods to be Investigated
The Use of Superpopulation Concepts & Alternatives
An Example Based on the Helix Tree Methodology
3 References
MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M2D/2017982bLamarUniversity
Introduction TwoInterestingQuotes
Daniell Toth & John Eltinge
An observation from the abstract of their 2010 BLS Statistical Survey
paper Building Consistent Regression Trees from Complex Sample Data:
Many prospective [RP] applications involve data collected
through a complex sample design. At present, however, relatively
little is known regarding the properties of these methods under
complex designs.
MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M3D/2017982bLamarUniversity
Introduction TwoInterestingQuotes
The Signal and the Noise (2012)
Nate Silver, the author, instructively asks in the age of Big Data ... “Who
needs theory when you have so much information?” My adaptation of
Silver’s answer is:
Statistical inferences are much stronger when backed up by
theory or at least some deeper thinking about ... [the statistical
principles upon which statistical practice is based and for which
there is some general agreement.]
MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M4D/2017982bLamarUniversity
Introduction WhySampleSurveys?
Sample Survey Advantages in CER
Enable broad coverage of a general population at reduced cost
Intensify focus on certain socio-demographic subgroups of a
population for which coverage may be poor
By concentrating resources, provide data with fewer recording errors
(i.e., greater accuracy)
Provide opportunities to collect data to apply statistical techniques
for missing information and informed data editing
MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M5D/2017982bLamarUniversity
MethodstobeInvestigated TheUseofSuperpopulationConcepts&Alternatives
Three Basic Directions
Ignore sample design and treat data as observations on iid random
variables (rv’s).
Problem: sample units (i.e., people) are not merely carriers
of information – observations depend on variables associated
with those persons
Restructure the sample so that the data can be analyzed as
representative of a simple random sample from an animal or human
population
Investigate the influence of sample design – use superpopulation
concepts, asymptotics, etc. Alternatively,use design variables in the
splitting steps.
MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M6D/2017982bLamarUniversity
MethodstobeInvestigated TheUseofSuperpopulationConcepts&Alternatives
A Few Remarks about Superpopulation (SP) Models
The model is the survey sampler’s conceptualization of how a finite
population, U, might have been generated.
SP model is often an expression of prior knowledge or belief. U can
be a r.s. from some parametric distribution.
SP model can be different from one survey to another.
Two important types of inference: (1) inference about the
characteristics of U; and (2) inference about the SP model – the
process underlying U.
MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M7D/2017982bLamarUniversity
MethodstobeInvestigated AnExampleBasedontheHelixTreeMethodology
Modification of A Helix Tree Procedure
Mathematical discussion of computational procedures used in Helix
Tree software can be found in Chapter 26 of the manual.
Useful to think in terms of invoking frequentist paradigm asymptotics.
Example based upon test statistic of section 26.2.1 for determining
splits – involves sample totals and cross-product sums.
MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M8D/2017982bLamarUniversity
MethodstobeInvestigated AnExampleBasedontheHelixTreeMethodology
Some General Remarks about Finite Population Sample
Procedures
Finite population (FP) parameters are superpopulation sample
quantities.
Use survey sample weights to get estimates of FP parameters.
Need to prove asymptotic design unbiasedness (ADU) and consistency
(ADC) for a test statistic Tˆ quantified with FP sample data.
MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSa)M,Hayy,a2t0ts1v3ille,M9D/2017982bLamarUniversity
MethodstobeInvestigated AnExampleBasedontheHelixTreeMethodology
General Remarks about Finite Population Sample
Procedures continued
Definition(to be perfected)
Allowing the size of U, N , and the size of the sample from U, n , to
t t
increase according to assumptions we must specify, we will consider
adapting definitions from Robinson and Sarndal (1983) under various
hypotheses for a superpopulation model ξ.
Asymptotic Design Unbiasedness (ADU)
lim E [Tˆ −T ] = 0
t→∞ p t t
Asymptotic Design Consistency (ADC)
lim P (|Tˆ −T | > (cid:15)) = 0 with ξ-probability 1, for (cid:15) > 0.
t→∞ p t t
Here, E and P denote expectation and probability with respect to the
p p
finite population sample design.
MyronKatzoffa(Retired),KumerDasb,MeeRneacKurhsaivreeaPaaCrtDitCio/nNinagtiAonpapllieCdenttoerCfoomrpHleexalSthamStpaletisStuicrsve(yNDCHatSaM),aHy,y2a0t1ts3ville,M10D/2017982bLamarUniversity
Description:characteristics of L; and (2) inference about the SP model – the process underlying L. Myron Katzoffa(Retired), Kumer Dasb, Meena Kharea aCDC/National Center for Health Statistics (NCHS), Hyattsville, MD 20782 bLamar Un. Recursive Partitioning Applied to Complex Sample Survey DataMay, 2013.