Table Of ContentA Bayesian view of doubly robust causal inference
O.Saarela∗, L.R.Belzile†AND D.A.Stephens‡
Abstract
Incausalinferenceconfoundingmaybecontrolledeitherthroughregressionadjustmentinanout-
comemodel,orthroughpropensityscoreadjustmentorinverseprobabilityoftreatmentweighting,or
7
1 both. Thelatterapproaches,whicharebasedonmodellingofthetreatmentassignmentmechanism
0 andtheirdoublyrobustextensionshavebeendifficulttomotivateusingformalBayesianarguments;
2 inprinciple,forlikelihood-basedinferences,thetreatmentassignmentmodelcanplaynopartininfer-
encesconcerningtheexpectedoutcomesifthemodelsareassumedtobecorrectlyspecified. Onthe
n
otherhand,forcingdependencybetweentheoutcomeandtreatmentassignmentmodelsbyallowing
a
J theformertobemisspecifiedresultsinlossofthebalancingpropertyofthepropensityscoresandthe
lossofanydoublerobustness.Inthispaper,weexplainintheframeworkofmisspecifiedmodelswhy
5
1 doublyrobustinferencescannotarisefrompurelylikelihood-basedarguments,anddemonstratethis
throughsimulations. AsanalternativetoBayesianpropensityscoreanalysis,weproposeaBayesian
] posterior predictive approach for constructing doubly robust estimation procedures. Our approach
E
appropriatelydecouplestheoutcomeandtreatmentassignmentmodelsbyincorporatingtheinverse
M treatmentassignmentprobabilitiesinBayesiancausalinferencesasimportancesamplingweightsin
MonteCarlointegration.
.
t A revised version of this article has been accepted for publication in Biometrika, published by
a
t OxfordUniversityPress.
s
Saarela,O.,Belzile,L.R.andD.A.Stephens. ABayesianviewofdoublyrobustcausalinference,
[
Biometrika(2016),103(3):667-681.doi:10.1093/biomet/asw025.
1
v 1. INTRODUCTION
3
9 In causal inference contexts, confounding is most often controlled by one of two approaches: first, by
0 conditioningontheconfoundersinaregressionmodelfortheoutcomeinanoutcomeregressionmodel;
4 secondly,bymodellingthetreatmentassignmentmechanismtoobtainso-calledpropensityscorevalues,
0
and then using these scores to construct strata within the observed sample, or a pseudo-sample from
.
1 a hypothetical population, within which treatment assignment is not confounded. This pseudo-sample
0
can be obtained via inverse probability of treatment weighting of the original sample, analogously to
7
surveysamplingprocedures. Theoutcomeregressionadjustmentmethodrequirescorrectspecification
1
: of the regression function in order to obtain consistent inference; this may be achieved in practice us-
v
ingflexibleregressionstrategiesandcomplexfunctionsoftypicallyalargenumberofcovariates. The
i
X propensityscoreadjustmentmethodsfocusprincipallyonthespecificationofthetreatmentassignment
r model, which may be similarly flexible or complex. Under either approach, sufficient control of con-
a foundingisthereforedependentonpossiblyunverifiablemodellingassumptions. Thishasmotivatedthe
developmentofdoublyrobustmethodsinwhichbothmodelcomponentsarespecified,butonlyoneof
themneedstobecorrectlyspecifiedtosufficientlycontrolforconfounding.
Adjustments that depend on the propensity score using regression or reweighting are not easy to
interpretfromtheBayesianperspective,sinceBayesianinferencesarenaturallybasedonmodellingof
the outcome, with modelling of the treatment assignment playing no role in inference relating to the
outcome/treatment relationship (Robins and Ritov, 1997). Gustafson (2012) attempted a Bayesian in-
terpretation as a compromise between a saturated outcome model and a parametric one; however, the
treatmentassignmentmodeldidnotfeatureinthisinterpretation. Scharfsteinetal.(1999)andBangand
∗Dalla Lana School of Public Health, University of Toronto, 155 College Street, Toronto, Ontario M5T 3M7, Canada.
olli.saarela@utoronto.ca
†École Polytechnique Fédérale de Lausanne, EPFL-SB-MATHAA-STAT, Station 8, CH-1015 Lausanne, Switzerland.
leo.belzile@epfl.ch
‡Department of Mathematics and Statistics, McGill University, Montreal, Quebec H3A 2K6, Canada.
d.stephens@math.mcgill.ca
1
2 O.SAARELA,L.R.BELZILED.A.STEPHENS
Robins(2005)pointedoutaconnectionbetweenadoublyrobustestimatorandamodel-basedestima-
tor; the two are equivalent if the outcome model in the latter features the so called clever covariate, a
particularfunctionofthepropensityscore.
Separatelyfromtheabovedevelopments,therehasbeenabodyofworkstudyingBayesianversions
ofpropensityscoreadjustmenttocontrolforconfounding(Hoshino,2008;McCandlessetal.,2009a,b,
2010,2012;An,2010;KaplanandChen,2012;Zigleretal.,2013;ChenandKaplan,2015;Grahametal.,
2015). Theseapproachesrequireoneoftwokindsofcompromises;theyeitherforceaparametrization
whichmakestheoutcomeandtreatmentassignmentmodelsdependent,thuslosingthebalancingprop-
ertyofthepropensityscore,orcutsuchafeedbackinwhichcasetheinferenceproceduresarenolonger
Bayesian. Because of these difficulties, some authors (e.g. Achy-Brou et al., 2010, p. 828) have been
contenttofixthepropensityscorestotheirbestestimatesinmodel-basedinferences,withoutattempting
tojointlyestimatethetwomodelcomponents. Inanalternativeapproach,Wangetal.(2012)andZigler
andDominici(2014)havesuggestedconnectingtheoutcomeandtreatmentassignmentmodelsthrough
thepriordistributioninordertoincorporatetheuncertaintyinconfounderselection. Hereinwedonot
consider model uncertainty, but rather, concentrate on inferences with a priori specified outcome and
treatmentassignmentmodels.
ThepurposeofthispaperistoclarifythetheoreticalandpracticalmotivationsforBayesianpropen-
sityscoreadjustment,andtherelationshipsbetweenthedifferentmethodsproposedforthis,whichhave
notbeenfullyexploredpreviously.Weaddresstheseinthecontextofdoubleadjustmentforboththepo-
tentialconfoundersandthepropensityscore,andarguethattheproblemcannotbeproperlyunderstood
withoutconsideringitintheframeworkofmisspecifiedmodels. ToprovideanalternativetoBayesian
propensity score adjustment, we propose deriving Bayesian versions of various inverse probability of
treatmentweightedestimators,includinginverseprobabilityoftreatmentweightedoutcomeregression
and the semi-parametric double robust estimator, through posterior predictive expectations, with the
weightsintroducedasimportancesamplingweightsinMonteCarlointegration.
2. PRELIMINARIES
2. Notationandassumptions
Forsimplicity,weconsiderthecaseofasinglebinarytreatment,anddeferdiscussionofthelongitudinal,
multipleexposureandcontinuouscasestotheDiscussion. LettherandomvectorsX representasetof
i
pre-treatment covariate measurements, Z a binary treatment allocation indicator, and Y an outcome
i i
forindividuali,measuredaftersufficienttimehaspassedsinceadministeringthetreatment. Weadopt,
for convenience, the standard potential outcome (or counterfactual) framework: for individual i, the
observedoutcomeisrelatedtothetwopossiblepotentialoutcomes(Y ,Y )byY =(1−Z )Y +
0i 1i i i 0i
Z Y . We assume that X includes a sufficient set of covariates to control for confounding in the
i 1i i
sense that Z ⊥⊥ (Y ,Y )|X (cf. ignorable treatment assignment, Rosenbaum and Rubin, 1983,
i 0i 1i i
p. 43). The propensity score e(X ) ≡ pr(Z = 1|X ) has the balancing property Z ⊥⊥ X |e(X ),
i i i i i i
whichalsoimpliesthat(Y ,Y )⊥⊥Z |e(X )–thisscalarscoreisthereforeusefulincontrollingfor
0i 1i i i
confounding.
IfthecovariatespaceX ishigh-dimensional,inpracticethetaskofcontrollingforconfoundingoften
i
involves some covariate selection; here one can either select for the features that predict the outcome,
orthetreatmentassignment. Torepresentthis,letS andB representsomeaprioriselectedsubsetsof
i i
the all observed features X , so that the latter can be partitioned as X = (S ,R ) or X = (B ,C ),
i i i i i i i
where possibly R = ∅ and C = ∅. If the selected set of features S captures all relevant prognostic
i i i
information, then Y ⊥⊥ X |S (Hansen, 2008). For the remainder of the paper, we consider the
0i i i
strongercondition(i)(Y ,Y )⊥⊥X |S ,whichrequiresthatS alsocapturesallrelevantinformation
0i 1i i i i
aboutpossibleeffectmodification.
InthiscaseS issufficienttocontrolforconfounding,sincefromthepropertiesofconditionalinde-
i
pendence(Dawid,1979)itfollowsthat(Y ,Y ) ⊥⊥ Z |S . If,ontheotherhand,theselectedsetof
0i 1i i i
featuresB hasthebalancingproperty(ii)Z ⊥⊥ X |B ,itissufficienttocontrolforconfounding. We
i i i i
areinterestedinestimationprocedureswhicharevalidwheneither(Y ,Y ) ⊥⊥ X |S (butpossibly
0i 1i i i
Z (cid:54)⊥⊥X |B )orZ ⊥⊥X |B (butpossibly(Y ,Y )(cid:54)⊥⊥X |S ).
i i i i i i 0i 1i i i
Ourparameterofinterestisanaveragecausalcontrastsuchas
E(Y )−E(Y )=E {E(Y |X )}−E {E(Y |X )}.
1i 0i Xi 1i i Xi 0i i
Interest then lies in the identifiability of the average potential outcomes based on the observed data.
preprint,versionofOctober29th,2015 Author’sOriginalVersion
ABayesianviewofdoublyrobustcausalinference, 3
Whenthe‘nounmeasuredconfounder’assumptionholds,itfollowsthat(e.g.HernánandRobins,2006,
p. 43)
(cid:90)
E(Y )−E(Y )= {E(Y |Z =1,x )−E(Y |Z =0,x )}p(x )dx .
1i 0i i i i i i i i i
xi
2. Bayesianmodelformulationforoutcomeregression
AnyBayesianmodelspecificationisconstructedviadeFinetti’srepresentationforexchangeablerandom
sequencesv ≡(x ,y ,z )(seeforexampleBernardoandSmith,1994,Chap. 4). The(subjective)joint
i i i i
distributionforobservationsv ≡(v ,...,v )maythenberepresentedby
1 n
(cid:90) (cid:40) n (cid:41)
(cid:89)
p(v)= p(y |z ,x ;φ)p(z |x ;γ)p(x ;ψ) π (φ,γ,ψ)dφdγdψ, (1)
i i i i i i 0
φ,γ,ψ i=1
implying the existence of the parametric models and the prior density π (φ,γ,ψ). Since the repre-
0
sentation theorem is not constructive, that is, does not specify the models implicit in (1), in practice
inferencesaboutagivenfinite-dimensionalparametrizationinvolvestheoftenimplicitassumptionthat
p(y |z ,x ;φ )=f(y |z ,x ),p(z |x ;γ )=f(z |x )andp(x ;ψ )=f(x ),where(φ ,γ ,ψ )is
i i i 0 i i i i i 0 i i i 0 i 0 0 0
thelimitingvalueoftheposteriorπ (φ,γ,ψ)≡p(φ,γ,ψ|v)inthesenseofe.g.vanderVaart(1998,p.
n
139),andwherethefsrepresentthe‘true’limiting(sampling)distributions. Wemightfurtherassume
thattheparametersareaprioriindependent, sothatπ (φ,γ,ψ) = π (φ)π (γ)π (ψ), inwhichcaseit
0 0 0 0
italsofollowsalsothattheposteriorfactorizesasπ (φ,γ,ψ)=π (φ)π (γ)π (ψ)(e.g.Gelmanetal.,
n n n n
2004,p. 354–355).
Themarginaldistributionp(x ;ψ)caninpracticebespecifiednonparametricallyandestimatedusing
i
theempiricalcovariatedistribution,leadingtoaBayesianestimatoroftheaveragecausalcontrast,
(cid:90) 1 (cid:88)n
{m(1,x ;φ)−m(0,x ;φ)}π (φ)dφ, (2)
n i i n
φ i=1
whereπ (φ)∝(cid:81)n p(y |z ,x ;φ)π (φ)dφ,andm(z,x;φ)≡(cid:82) yp(y|z,x;φ)dy. InSupplementary
n i=1 i i i 0
Appendix1,weshowthat(2)canbemotivatedwithouttheuseofpotentialoutcomesnotationthrough
posterior predictive expectations for a new observation under a hypothetical completely randomized
setting.
Theestimator(2)istheBayesianversionofthewell-knowndirectstandardizationorg-formula. We
return to it in Section 6, but note here that it does not feature the treatment assignment model; rather,
theposteriorpredictiveapproachforestimatingthemarginalcausalcontrastdependsentirelyoncorrect
specificationofthedistributionp(y |z ,x ;φ),orinamoment-basedrepresentation,m(z,x;φ). Before
i i i
discussingBayesianalternativesto(2),webrieflyreviewsomecommonlyusedfrequentistapproaches
forcombiningoutcomeregressionandpropensityscoreadjustment.
3. FREQUENTIST APPROACHES FOR COMBINING OUTCOME REGRESSION AND
PROPENSITY SCORE ADJUSTMENT
3. Includingpropensityscoresintooutcomeregression
Because of the balancing property of the propensity score, it is tempting to specify a propensity score
e(b ;γ)≡pr(Z =1|b ;γ)anduseastatisticalmodelsuchasp{y |z ,e(b ),s ;φ},inthehopethat,if
i i i i i i i
theprognosticmodelisismisspecified,adjustingforthepropensityscorewouldstillsufficientlycontrol
for any residual confounding. For simplicity, we take the parameters φ to specify also the functional
dependencebetweenthepropensityscoreandtheoutcome;tomodelthisdependency,itisadvisableto
useflexibleformulationssuchassplines(e.g.ZhangandLittle,2009). Usingsuchanoutcomemodel,
themarginalcausalcontrastwouldthenbeestimatedby
n
1 (cid:88)(cid:104) (cid:110) (cid:111) (cid:110) (cid:111)(cid:105)
n m 1,e(bi;γ(cid:98)),si;φ(cid:98) −m 0,e(bi;γ(cid:98)),si;φ(cid:98) , (3)
i=1
(cid:82)
where m{z,e(b;γ),s;φ} ≡ yp{y|z,e(b;γ),s;φ}dy, and where φ(cid:98)and γ(cid:98) are maximum likelihood
estimators for the parameters in the outcome regression and treatment assignment model, respectively.
The motivation for such a double adjustment is that it is sufficient to control for confounding if either
Author’sOriginalVersion preprint,versionofOctober29th,2015
4 O.SAARELA,L.R.BELZILED.A.STEPHENS
condition (i) or (ii) of Section 2.1 applies. We summarize this property in the following theorem (a
proofinSupplementaryAppendix2;inthenotationallconvergenciesareinprobabilityunlessotherwise
stated).
THEOREM1. Estimator(3)isconsistentiftheoutcomemodeliscorrectlyspecifiedinthesensethat
(cid:82)
m{z,e(b;γ ),s;φ } = yf(y|z,e(b),s)dy, the parameters in this can be consistently estimated so
0 0
that φ(cid:98) → φ0, the treatment assignment model is correctly specified in the sense that p(zi|bi;γ0) =
f(z |b )andγ →γ ,andeither(i)holdstrue,or(ii)holdstrue.
i i (cid:98) 0
Theestimator(3)maybeconsidered‘doublyrobust’intermsofthecovariateselectioninthesense
that only one of the sets S and B needs to be correctly specified, although it still always relies on a
i i
correctparametricspecificationofthemodelfortheexpectedoutcome,conditionalon{Z ,e(B ),S }.
i i i
3. Theclevercovariateandaugmentedoutcomeregression
The estimator discussed in the previous section did not specify which function of the propensity score
e(B ) should be added to the regression model. Scharfstein et al. (1999, p. 1141–1142) and Bang
i
andRobins(2005,p. 964–965)drewaconnectionbetweenpropensityscoreregressionadjustmentand
doublyrobustestimatorsoftheform
n
1 (cid:88)yizi−{zi−e(bi;γ(cid:98))}m(1,si;φ(cid:98))
n e(b ;γ)
i (cid:98)
i=1
n
− 1 (cid:88)yi(1−zi)−[(1−zi)−{1−e(bi;γ(cid:98))}]m(0,si;φ(cid:98)),
n 1−e(b ;γ)
i (cid:98)
i=1
whichcanbeequivalentlyrepresentedas
n1 (cid:88)n (cid:26)e(bzi;γ) − 1−1e−(bzi;γ)(cid:27)(cid:110)yi−m(zi,si;φ(cid:98))(cid:111)+ n1 (cid:88)n (cid:26)m(1,si;φ(cid:98))−m(0,si;φ(cid:98))(cid:27). (4)
i (cid:98) i (cid:98)
i=1 i=1
On considering the score equation derived from a regression of Y on Z and S with mean function
i i i
m(z,s;φ),thisformsuggestsincorporatingthederivedcovariatec(z ,b ) = z /e(b )−(1−z )/{1−
i i i i i
e(b )}(termedthe‘clevercovariate’byRoseandvanderLaan,2008,p. 8)additivelyintotheoutcome
i
regression,thatis,forexample
m(z,s;φ)=φ +φ z+φ(cid:62)s+φ c(z,b). (5)
0 1 2 3
Thefirsttermin(4)isthenzerothroughthemaximumlikelihoodscoreequation, leavingonlythelast
termwhichisthemodelbasedestimatorofthemarginaltreatmenteffect. Thuswiththeclevercovariate
in the outcome model, the doubly robust estimator is equivalent to the model-based estimator. In the
specialcaseofmodel(5),thisbecomes
1 (cid:88)n (cid:110) (cid:111) 1 (cid:88)n (cid:26) 1 1 (cid:27)
n m(1,xi;φ(cid:98))−m(0,xi;φ(cid:98)) =φ(cid:98)1+φ(cid:98)3n e(b ;γ) − 1−e(b ;γ) .
i (cid:98) i (cid:98)
i=1 i=1
A potential drawback of using this covariate is that it may lead to extreme variability for the resulting
meandifferenceestimator,evencomparedtoinverseprobabilityoftreatmentweightedestimatorsofthe
form
n (cid:26) (cid:27)
1 (cid:88) yizi − yi(1−zi) . (6)
n e(b ;γ) 1−e(b ;γ)
i (cid:98) i (cid:98)
i=1
Toseewhythisisthecase,thedistributionofc(z ,b )initselfcanbeveryskewedtotheright,butthis
i i
becomesevenmorepronouncedinthemodel-basedestimatorwheretheclevercovariatehastoevaluated
at both c(1,b ) and c(0,b ) for each i = 1,...,n. In contrast, the inverse probability of treatment
i i
weightedestimator(6)involvesonlytheprobabilitiesoftreatmentsthatwereactuallyassigned.
Eq. (4) is doubly robust in a stronger, semi-parametric, sense than (3); it does not require correct
specificationoftheoutcomemodel,ifthetreatmentassignmentmodeliscorrectlyspecified.Theapprox-
imateBayesiandoublerobustapproachproposedbyGrahametal.(2015)involvedreplacingm(z,x;φ)
preprint,versionofOctober29th,2015 Author’sOriginalVersion
ABayesianviewofdoublyrobustcausalinference, 5
in(2) withalinear predictoraugmentedwith theclevercovariate. Wetake thistobe aspecialcase of
thetwo-stepBayesianmethodstobediscussedinSection4,andthusdonotseparatelyconsideritinthe
presentpaper. However,inSection6.2wewillshowhowtheform(4)maybederivedthroughposterior
predictiveexpectationsandimportancesampling.
3. Inverseprobabilityoftreatmentweightedoutcomeregression
Yetanotherestimatorforthemarginalcausalcontrastis
n
1 (cid:88)(cid:110) (cid:111)
n E(Y1i|si;φ(cid:98))−E(Y0i|si;φ(cid:98)) , (7)
i=1
wheretheparametersφinthemodelforthepotentialoutcomes(Y ,Y )areestimatedusinganinverse
1i 0i
probabilityoftreatmentweightedestimatingfunctionl(φ)=(cid:80)n l (φ),where
i=1 i
1
l (φ)=(cid:88)1 logp(yai|si;φ).
i {zi=a} pr(Z =a|b )
i i
a=0
Thecorrespondingestimatingequationisu(φ)=(cid:80)n u (φ)=0,wherethepseudo-scorefunctionis
i=1 i
1 ∂ logp(y |s ;φ)
u (φ)=(cid:88)1 ∂φ ai i . (8)
i {zi=a} pr(Z =a|b )
i i
a=0
Herethetreatmentassignmentprobabilitiespr(Z =a|b )wouldinpracticebereplacedwithestimates
i i
pr(Z = a|b ;γ). To motivate the use of (8) for the parameter estimation, we present the following
i i (cid:98)
theorem(aproofgiveninSupplementaryAppendix2).
THEOREM2. Theestimatingequationu(φ)=0isunbiasedifthemodelforthepotentialoutcomes
iscorrectlyspecifiedinthesensethatp(y |s ;φ )=f(y |s ),andeither(i)or(ii)holdstrue.
ai i 0 ai i
If we further assume the consistency of the estimator for φ, as well as consistency of γ when the
(cid:98)
weightsarecorrectlyspecified,italsofollowsthat(7)consistentlyestimatesthemarginalcausalcontrast.
In Section 6.1 we demonstrate that an estimator of the form (7) can also be motivated from Bayesian
arguments.
4. TWO-STEP ESTIMATION WHEN THE PROPENSITY SCORE IS UNKNOWN
In observational settings the function e(B ) is unknown and has to be estimated. When using an es-
i
timator of the form (3), a central question from a Bayesian perspective then is how the uncertainty in
the estimation of the parameters γ is incorporated in the inference of the marginal causal contrast. A
‘Bayesian’approachcouldbemotivatedbywritingtheposteriorpredictiveexpectationas
(cid:90) (cid:90)
m(a,e(b ;γ),s ;φ)p(x ;ψ)π (φ|γ)π (γ)π (ψ)dx dφdγdψ, (9)
i i i n n n i
ψ,γ,φ xi
where
n n
(cid:89) (cid:89)
π (φ|γ)∝ p{y |z ,e(b ;γ),s ;φ}π (φ) and π (γ)∝ p(z |b ;γ)π (γ).
n i i i i 0 n i i 0
i=1 i=1
The integrals of the form (9) could be evaluated by Monte Carlo integration, by forward sampling
first from π (γ) and given the current value γ, from the conditional posterior π (φ|γ). However, a
n n
concernrelatedtosuchanapproachisthattheproductoftheposteriordistributionsπ (φ|γ)andπ (γ)
n n
in (9) does not necessarily correspond to any well defined joint posterior, except in the case when the
outcomeiscorrectlyspecifiedinthesense(i),inwhichcaseitdoesnotdependonthepropensityscore.
Incontrast,forthecorrectlyspecifiedmodelsin(1)weindeedhavethatp(φ|v)p(φ|x,z)=p(φ,γ|v).
Asaresult,theaboveoutlinedtwo-stepapproachisnotproperBayesian,andwouldhavetobeevaluated
onitsfrequency-basedproperties.
To give an example of a situation where the two-step Bayesian approach does not result in correct
frequency-basedinferences,wecanconsiderthespecialcaseoftheoutcomemodelm{z,e(b;γ),s;φ}=
Author’sOriginalVersion preprint,versionofOctober29th,2015
6 O.SAARELA,L.R.BELZILED.A.STEPHENS
φ +φ z +φ(cid:62)s+φ(cid:62)g{e(b;γ)}, where g is, for example, some appropriate spline basis transforma-
0 1 2 3
tion of the propensity score. With this model the estimator based on (9) for the average causal con-
(cid:82)
trastE(Y )−E(Y )reducesto φ π (φ|γ)π (γ)dφdγ,thatis,toanestimatoroftheposterior
1i 0i φ,γ 1 n n
meanEΓ|X,Z{E(Φ1|v;γ)}.Thisestimatorinturncanbeapproximatedwith(cid:80)mj=1φ(cid:98)1(γ(j))/m,where
(γ(1),...,γ(m))isaMonteCarlosamplefromπ (γ)(cf.KaplanandChen,2012,p. 589). Wenotefirst
n
thatthisestimatorhasthesameasymptoticdistributionasφ(cid:98)1(γ(cid:98)),wherethetreatmentassignmentmodel
parametershavebeenfixedtotheirmaximumlikelihoodestimates(seeSupplementaryAppendix3). In
Supplementary Appendix 3 we further show that avar{φ(cid:98)1(γ(cid:98))} ≤ avar{φ(cid:98)1(γ0)}, where φ(cid:98)1(γ0) is the
estimator given the ‘true’ propensity scores. Thus, with the propensity score adjusted outcome model
specification, a variance adjustment due to estimating the propensity scores should reduce the asymp-
toticvarianceoftheresultingtreatmenteffectestimatorcomparedtoahypotheticalsituationwherethe
truepropensityscoresareknown(cf.HenmiandEguchi,2004). Incontrast,KaplanandChen(2012,p.
592)andGrahametal.(2015,p. 11)proposevarianceestimatorsbasedonthevariancedecomposition
formula
var(Φ |v)=E {var(Φ |v;γ)}+var {E(Φ |v;γ)}, (10)
1 Γ|X,Z 1 Γ|X,Z 1
whichappearstoaddafurthervariancecomponent. Anexplanationforthediscrepancyisthatwiththe
correctly specified models in the representation (1), we have that p(φ |v;γ) = p(φ |v), and the the
1 1
secondvariancecomponentbecomeszero.Ontheotherhand,ifthemodelsaremisspecified,theproduct
formlikelihoodintherepresentation(1)doesnotapplyinthefirstplace. Thisillustratesthedifficulty
in applying Bayesian procedures in the context of incompatible models. Even though this is routinely
doneinthecontextofmultipleimputation(e.g.Rubin,1996),andoftenproducesreasonableresults,in
thepresentcontextthereislittlemotivationtouseanapproachwhichintroducesanadditionalvariance
component to the posterior variance, given that estimation of the propensity scores should reduce the
varianceofthetreatmenteffectestimator. Wefurtherdiscussthisdiscrepancyinthefollowingsection.
5. JOINT ESTIMATION OF OUTCOME AND TREATMENT ASSIGNMENT MODELS
Even though the outcome Y can obviously be predictive of the individual treatment assignment Z ,
i i
the outcomes are not informative of the treatment assignment mechanism (a proof in Supplementary
Appendix2):
THEOREM3. If the outcome model is correctly specified, the outcomes are non-informative of the
parameterscharacterizingthetreatmentassignmentprocess.
Insuchacasethetreatmentassignmentmodelplaysnopartintheinferences,sincethecorresponding
posterior predictive estimator is (2). However, the Bayesian propensity score approach proposed by
McCandless et al. (2009a) specifies a parametrization making the outcome and treatment assignment
models dependent and estimates the parameters jointly. More recently, Zigler et al. (2013) suggested
thatasimilarapproachcouldbeusedtoobtainaBayesiananaloguetodoublyrobustinferences. Such
an approach can be understood by assuming that there exists a de Finetti parametrization (φ∗,γ∗) for
which
(cid:90) (cid:40) n (cid:41)
(cid:89)
p(v)= p(y ,z |x ;φ∗,γ∗)p(x ;ψ) π (φ∗)π (γ∗)π (ψ)dφ∗dγ∗dψ,
i i i i 0 0 0
φ∗,γ∗,ψ i=1
wherep(y ,z |x ;φ∗,γ∗) = p{y |z ,e(b ;γ∗),s ;φ∗}p(z |b ;γ∗). Comparedto(φ,γ)in(1),neither
i i i i i i i i i
φ∗orγ∗retainstheoriginalinterpretation,butnowthereisawelldefinedjointposteriordistribution
n
(cid:89)
π (φ∗,γ∗)∝ [p{y |z ,e(b ;γ∗),s ;φ∗}p(z |b ;γ∗)]π (φ∗)π (γ∗).
n i i i i i i 0 0
i=1
Inferencescouldnowbebasedontheposteriorpredictiveexpectations
(cid:90) (cid:90)
m{a,e(b ;γ∗),s ;φ∗}p(x ;ψ)π (φ∗,γ∗)π (ψ)dx dφ∗dγ∗dψ. (11)
i i i n n i
φ∗,γ∗,ψ xi
Atfirstsight,(11)wouldseemmorenaturalthan(9),sincethespecification(11)doesnotmakeuseof
incompatiblemodels. However,nowthequantitiese(b ;γ∗)donotpossessthebalancingpropertiesof
i
preprint,versionofOctober29th,2015 Author’sOriginalVersion
ABayesianviewofdoublyrobustcausalinference, 7
propensityscores,andthusitwouldbedifficulttoshowwhether(11)wouldhavethe‘doublerobustness’
propertyoftheestimator(3).
Toaddressthelackofbalance,McCandlessetal.(2010)suggestedaGibbssamplertypeapproach
similartothatofLunnetal.(2009)tocutthefeedbackfromtheoutcomemodel,bysuccessivelydraw-
ing from the conditional posteriors π (γ) and π (φ|γ) to approximate the joint posterior of (φ∗,γ∗).
n n
However, as discussed in the previous Section, these posteriors are incompatible and generally such a
sampling procedure is not guaranteed to converge to any well defined joint distribution. In fact, if the
conditionalposteriorscanbesampleddirectly,orifthesecondsamplingstepisallowedtoconvergeto
thecorrespondingconditionaldistribution,theinferencesbasedontheformulations(9)and(11)willbe
equivalent.
Tosumup,tryingtorecoverfullyprobabilisticinferencesthroughsamplingfromajointposteriorof
theoutcomeandtreatmentassignmentmodelparameterslosesthebalancingpropertyofthepropensity
scores,andconsequently,thepropertiesoftheresultingestimatorwouldbedifficulttoestablish. Onthe
other hand, cutting the feedback in an attempt to recover the balancing property would mean that the
inferences are no longer based on well-defined posterior distributions. Thus, in the following section,
followingtheapproachoutlinedinSaarelaetal.(2015b),weformulatealternativeBayesianestimators
thatarenotbasedonBayesianpropensityscoreadjustment.
6. POSTERIOR PREDICTIVE INFERENCES WITH IMPORTANCE SAMPLING
6. Inverseprobabilityoftreatmentweightedoutcomeregression
It has been recognized by various authors (e.g. Røysland, 2011; Chakraborty and Moodie, 2013) that
inverseprobabilityoftreatmentweightingcanbemotivatedthroughachangeofprobabilitymeasures,
or equivalently, importance sampling. However, as far as we know, before Saarela et al. (2015b) this
approachhasnotbeenusedtoformulateBayesiancausalinferences. Herewearguethatthisapproach
canbeusedtoresolvetheparadoxesdiscussedinSections(4)and(5).Wefollowtheposteriorpredictive
reasoning of Supplementary Appendix 1, but rather than trying to directly predict a new observation
undertheexperimentalsettingE,weconsiderfirstthetaskofparameterestimationunderthisregime.For
thispurpose,aBayesestimatorfortheoutcomemodelparameterscanbeconstructedbymaximizingthe
expectedutilityE {l(φ;V )|v}withrespecttoφ,wherethelog-likelihoodl(φ;v )≡logp(y |z ,s ;φ)
E i i i i i
takes the role of a parametric utility function, and the expectation is over a predicted new observation
v =(y ,z ,x ),i∈/ {1,...,n}.
i i i i
Let further ξ be a set of parameters characterizing the entire data-generating mechanism under the
observationalregimeO. Wecanfurtherwritetheexpectationas
E {l(φ;V )|v}=E [E {l(φ;V )|v;ξ}],
E i Ξ|V E i
where, following Walker (2010, p. 26–27), we can consider the lower dimensional decision of maxi-
mizingtheexpectedutilityE {l(φ;V )|v;ξ}withrespecttoφconditionalonξ. Withaknownregime
E i
E andthestabilityassumptiondiscussedinSupplementaryAppendix1,argmax E {l(φ;V )|v;ξ}is
φ E i
a deterministic function of ξ. Thus, the uncertainty represented by the posterior distribution p (ξ|v)
O
thenalsoreflectstheuncertaintyonφ,providingmeanstoconstructaposteriordistributionforφ. This
proceedsasfollows;theinnerexpectationcanbewrittenas
(cid:90)
E [l(φ;V )|v;ξ]= l(φ;v )p (v |v;ξ)dv
E i i E i i
vi
(cid:90) p (v |v;ξ)
= l(φ;v ) E i p (v |v;ξ)dv
i p (v |v;ξ) O i i
vi O i
(cid:90) p (y |z ,x ,v;ξ)p (z )p (x |v;ξ)
= l(φ;v ) E i i i E i E i p (v |v;ξ)dv
i p (y |z ,x ,v;ξ)p (z |x ,v;ξ)p (x |v;ξ) O i i
vi O i i i O i i O i
(cid:90) p (z )
= l(φ;v ) E i p (v |v;ξ)dv ,
i p (z |x ,v;ξ) O i i
vi O i i
whereinthelastequalitywemadeuseofthestabilityassumptionofSupplementaryAppendix1. Inthe
last form we can replace the predictive distribution under O with the Bayesian bootstrap specification
p (v |v;ξ) = (cid:80)n ξ δ (v ), where ξ ≡ (ξ ,...,ξ ) and Ξ|v ∼ Dirichlet(1,...,1), as in the
O i k=1 k vk i 1 n
weightedlikelihoodbootstrapofNewtonandRaftery,1994. Denotingw (ξ)≡p (z )/p (z |x ,v;ξ),
i E i O i i
Author’sOriginalVersion preprint,versionofOctober29th,2015
8 O.SAARELA,L.R.BELZILED.A.STEPHENS
theexpectedutilitynowbecomes
(cid:90) n n
(cid:88) (cid:88)
E [l(φ;V )|v;ξ]= l(φ;v )w (ξ) ξ δ (v )dv = w (ξ)ξ l(φ;v ), (12)
E i i i k vk i i k k k
vi k=1 k=1
thatis,aweightedlog-likelihood,motivatingtheestimator
n
(cid:88)
φ(cid:98)(ξ)≡argmax wk(ξ)ξkl(φ;vk).
φ
k=1
An approximate posterior distribution for φ under E can now be constructed by repeatedly sampling
the weight vectors from Ξ|v ∼ Dirichlet(1,...,1), and recalculating φ(cid:98)(ξ) at each realization. This
approachofcreatingamappingbetweenthenon-parametricspecificationandaparametrizationrelevant
toinferencesisanalogoustoChamberlainandImbens(2003),butaddstheimportancesamplingweights
totheDirichletweightsinordertomakeinferencesacrosstheobservationalandexperimentalregimes.
The weights w (ξ) function as importance sampling weights in Monte Carlo integration; they add
i
variabilitytotheestimation,butinthepresentcontextprovidesomeprotectiontowardsmisspecification
oftheoutcomemodel,inthesenseofSection3.3. Theabovedidnotyetaddresshowtocalculatethese
weights;inprinciplethesearefullydeterminedbythecurrentrealizationofξ underthenon-parametric
specification, but in practice parametric model specifications are needed for smoothing purposes, and
weneedawaytolinktheξ andthetreatmentassignmentmodelparametersγ. Forthispurposeγ itself
can be estimated through the weighted likelihood bootstrap since this readily gives the deterministic
function linking the two parametrizations; thus in (12) we choose w (ξ) = p (z )/p {z |b ;γ(ξ)},
i E i O i i (cid:98)
where γ(ξ) ≡ argmax (cid:80)n ξ logp(z |b ;γ). The probabilities p (z ) are given by the chosen
(cid:98) γ k=1 k k k E i
regimeE thatistheobjectofinference;inpracticetheestimationismostefficientwhenwechoosethe
targetregimetobeascloseaspossibletotheobservedregimeO;thiscanbeachievedbyfixingp (z )
E i
to the marginal treatment assignment probabilities under O, which would result in the usual kind of
stabilizedinverseprobabilityoftreatmentweightsusedinmarginalstructuralmodelling(Robinsetal.,
2000;Hernánetal.,2001;ColeandHernán,2008).
Themarginalcausalcontrastmaynowbeestimatedthroughtheexpectations
(cid:90) (cid:90) n
(cid:88)
EΞ|V{EE(Yi|Zi =a,v;ξ)}= m{a,si;φ(cid:98)(ξ)} ξkδsk(si)pO(ξ|v)dsidξ
ξ si k=1
(cid:90) n
(cid:88)
= ξkm{a,sk;φ(cid:98)(ξ)}pO(ξ|v)dξ, (13)
ξk=1
whereweusedthenon-parametricspecificationp(s |v;ξ)=(cid:80)n ξ δ (s ),andwhereagainp (ξ|v)
i k=1 k sk i O
is replaced with the uniform Dirichlet distribution. Since all uncertainty is contained in the parameter
vectorξ,aposteriordistributionforthepredictivemeanormeandifferencecanbeconstructedasbefore
throughresampling. Thepointestimatorgivenby(13)isthedirectBayesiananalogueof(7),wherethe
outcomemodelwasestimatedusinginverseprobabilityoftreatmentweightedregression. Infact,ifwe
fix ξ = 1/n, k = 1,...,n, instead of considering these as unknown parameters, the two estimators
k
are equivalent. Thus, we conjecture that the estimator given by (13) has a similar ‘double robustness’
propertyas(7). WedemonstratethisthroughsimulationsinSection7,butbeforethat,weshowhowthe
semi-parametricdoublyrobustestimator(4)canbemotivatedasaposteriorpredictiveexpectation.
6. Doublyrobustestimation
In Supplementary Appendix 4 we show that under the non-parametric specification in terms of ξ, the
posteriorpredictivecausalcontrastmaybewrittenas
E (Y |Z =1,v;ξ)−E (Y |Z =0,v;ξ)
E i i E i i
n (cid:26) (cid:27)
=(cid:88)ξ {y −m(z ,x ;ξ)} zk − 1−zk
k i k k pr (Z =1|x ,v;ξ) pr (Z =0|x ,v;ξ)
k=1 O k k O k k
n
(cid:88)
+ ξ {m(1,x ;ξ)−m(0,x ;ξ)}, (14)
k k k
k=1
preprint,versionofOctober29th,2015 Author’sOriginalVersion
ABayesianviewofdoublyrobustcausalinference, 9
whichcorrespondstoformulation(4). Sincethenon-parametricspecificationplacesnorestrictionson
theconditionaldistributions, inpractice, toobtainanestimator, thenon-parametricallyspecifiedquan-
tities m(z ,x ;ξ) and pr (Z = a|x ,v;ξ) would have to be replaced with the parametric versions
k k O k k
m{zk,sk;φ(cid:98)(ξ)} and prO{Zk = a|bk;γ(cid:98)(ξ)}, estimated using the weighted likelihood bootstrap, as in
the previous section. It is straightforward to see that if the outcome model is correctly specified, the
expression(14)reducestothemodel-basedestimator(thesecondadditiveterm). Thisagainreflectsthe
factthatifwebelieveinouroutcomemodel,thetreatmentassignmentmodeldoesnotplayapartinthe
inferences. However,aBayesianmightwanttouseanestimatoroftheform(14)ifbeingrestrictedby
twoparametricconstraints,intermsofφandγ,butnotknowingwhichoneoftheseiscorrect. Ifeither
oneoftheparametricspecificationsiscorrect,(14)willstillreducetotheposteriorpredictivemeandif-
ference,thenaturalBayesianestimator. Wesummarizethispropertyinthefollowingtheorem(aproof
inSupplementaryAppendix2).
THEOREM4. Theestimatorobtainedbysubstitutingtheparametricspecificationsm{zk,sk;φ(cid:98)(ξ)}
andpr {Z =a|b ;γ(ξ)}intoexpression(14)isequivalenttotheposteriorpredictivemeandifference
O k k (cid:98)
ifeitheroneofthesemodelsiscorrectlyspecified.
Aposteriordistributionforthemeandifferencecanbegeneratedasintheprevioussectionthrough
resamplingoftheparametervectorsξandrecalculating(14)ateachrealization;wewillillustratethisin
thefollowingsection.
7. SIMULATION STUDY
Above we have made a distinction between model misspecification due to omission of relevant co-
variates, and misspecification of the parametric functional relationship between the outcome and the
covariates, and noted that all the estimators discussed in Section 3 should be ‘doubly robust’ against
theformertypeofmisspecification. However, inpracticetheconsequencesofthesetwo types ofmis-
specification will often be similar; they result in residual confounding. Therefore, herein we investi-
gate how the different estimators perform when the covariate sets S and B are not only created by
i i
a partitioning, but also a transformation of the x ’s. For this purpose, we simulated X ∼ N(0,1),
i ij
j = 1,...,4, and transformed these as c = |x |/(1−2/π)1/2. The true treatment assignment
ij ij
andoutcomemechanismswerespecifiedasZ |x ∼ Bernoulli(expit{0.4c +0.4x +0.8x })and
i i i1 i2 i3
Y |z ,x ∼ N(z − c − x − x ,1), respectively. For estimation, we considered two scenarios:
i i i i i1 i2 i4
(I) s ≡ (x ,x ,x ) and b ≡ (c ,x ,x ) (misspecified outcome model and correctly specified
i i1 i2 i3 i i1 i2 i4
treatment assignment model), and (II) s ≡ (c ,x ,x ) and b ≡ (x ,x ,x ) (correctly specified
i i1 i2 i3 i i1 i2 i4
outcomemodelandmisspecifiedtreatmentassignmentmodel).
WeareinterestedinthemarginalcausalcontrastE(Y )−E(Y )=1.Toestimatethis,weapplied
i1 i0
the Bayesian estimators discussed in Sections 4, 5, and 6. In the two-step estimation we attempted
bothforwardsamplingfromtheposteriordistributions,andthevariancedecompositionformula(10). In
theformer, insteadofMarkovchainMonteCarlo, weappliednormalapproximationsfortheposterior
distributions,oftheformΦ|v;γ ∼N{φ(cid:98)(γ),S},whereφ(cid:98)(γ)isthemaximumlikelihoodestimateandS
itsestimatedvariance-covariancematrix. TheposteriordistributionΓ|x,z wasapproximatedusingthe
weightedlikelihoodbootstrap.Inthejointestimation,weagainusedanormalapproximation,centeredat
thejointmaximumlikelihoodestimates(φ(cid:98),γ(cid:98)),andthevariance-covariancematrixgivenbytheinverse
oftheobservedinformationatthemaximumlikelihoodpoint. Inbothtwo-stepandjointestimation,the
fittedmodelswerespecifiedasm{z ,e(b ;γ),s ;φ}=φ +φ z +φ(cid:62)s +φ(cid:62)g{e(b ;γ)},wheregis
i i i 0 1 i 2 i 3 i
acubicpolynomialbasis, ande(b ;γ) = expit(γ +γ(cid:62)b ). Intheimportancesampling(IS)estimator
i 0 1 i
proposed in Section 6.1, and in the importance sampling/doubly robust estimator (IS/DR) of Section
6.2, the fitted treatment assignment model was the same, with the outcome model specified through
m(z ,s ;φ)=φ +φ z +φ(cid:62)s .
i i 0 1 i 2 i
ForcomparisontotheBayesianestimators,wealsocalculatenaiveunadjustedcomparison(naive),
outcome regression adjusted for covariates s (adjusted), the standard inverse probability of treatment
i
weighted estimator (6) (IPTW), the semi-parametric doubly robust estimator (4) (DR), the clever co-
variate version of this (CC), the inverse probability of treatment weighted outcome regression based
estimator(7)(OR/IPTW),aswellaspropensityscoreadjustedoutcomeregressionbasedestimator(3)
(OR/PS). For IPTW, DR, CC, and OR/IPTW, the standard errors were estimated through the standard
frequentistnonparametricbootstrap(Efron, 1979). ForOR/PS,todemonstrate thevarianceestimation
Author’sOriginalVersion preprint,versionofOctober29th,2015
10 O.SAARELA,L.R.BELZILED.A.STEPHENS
issues discussed in Section 4, we calculated both observed information based standard errors, and the
adjustedsandwichtypestandarderrorsdiscussedinSupplementaryAppendix3.
The results over 1000 replications are shown in Table 1. Under scenario (I), all estimators except
naiveandadjustedcancorrectforconfounding,althoughthejointestimationapproachproducesaslight
bias.Intermsofefficiency,theestimatorsbasedonpropensityscoreadjustedoutcomeregressionarethe
best,withtheinverseprobabilityoftreatmentweightingbasedestimatorslosingslightly. Asdiscussed
inSection3.2, theclevercovariateapproachresultsinhighervariabilitycomparedtotheotherdoubly
adjustedestimators.Intermsofvarianceestimation,thecomparisonbetweentheunadjustedandadjusted
standarderrorsforOR/PSsuggeststhatunderthissimulationsettingestimationofthepropensityscores
substantially reduces the variance, and not adjusting for this results in overcoverage. The resampling
based variance estimators adjust for this automatically. However, the two-step approach to variance
estimationperformspoorly;asdemonstratedinSupplementaryAppendix3,thetwo-steppointestimator
has the same asymptotic variance as the other OR/PS estimators, but the two-step variance estimators
unnecessarilyaddafurthervariancecomponent. Thus,thesimulationresultssupportthediscussionin
Sections 4 and 5; the two-step and joint estimation approaches are difficult to justify from Bayesian
arguments,anddonotseemtoprovidepracticaladvantagesintermsoftheirfrequency-basedproperties.
Ontheotherhand, theimportancesamplingbasedBayesianestimatorsproduce verysimilarresultsto
theOR/IPTWandDRestimators,respectively.
Underscenario(II),alltheestimatorsexceptIPTWareunbiased, whichisexpectedbasedontheir
previouslydiscussedtheoreticalproperties. Whentheoutcomemodeliscorrectlyspecified,thereisalso
verylittledifferenceintheefficiencyofthevariousestimators.
Table1: Estimatesforthemarginalcausalcontrast(truevalue=1)over1000simulationrounds
Point Relative MC Coverage
Scenario Estimator SD SE
estimate bias(%) error (%)
(I) Naive 0.347 −65.3 0.128 0.128 0.004 0.3
Adjusted 0.667 −33.3 0.091 0.092 0.003 3.6
IPTW 1.001 0.1 0.135 0.138 0.005 94.4
OR/PS(obs.information) 0.997 −0.3 0.071 0.095 0.002 99.3
OR/PS(adj.sandwich) 0.997 −0.3 0.071 0.073 0.002 95.3
DR 0.998 −0.2 0.087 0.088 0.003 94.2
Clevercovariate 1.026 2.6 0.110 0.110 0.003 93.0
OR/IPTW 0.990 −1.0 0.082 0.083 0.003 93.9
Two-step(forwardsampling) 0.999 −0.1 0.071 0.113 0.002 99.8
Two-step(variancedecomposition) 0.995 −0.5 0.071 0.112 0.002 99.8
Jointestimation 1.046 4.6 0.071 0.071 0.002 89.5
Importancesampling 0.991 −0.9 0.083 0.080 0.003 93.6
Importancesampling/DR 0.997 −0.3 0.087 0.086 0.003 93.9
(II) Naive 0.347 −65.3 0.128 0.128 0.004 0.3
Adjusted 0.997 −0.3 0.065 0.067 0.002 95.3
IPTW 0.629 −37.1 0.133 0.131 0.005 20.3
OR/PS(obs.information) 0.997 −0.3 0.071 0.071 0.002 95.4
OR/PS(adj.sandwich) 0.997 −0.3 0.071 0.071 0.002 95.4
DR 0.999 −0.1 0.074 0.074 0.003 95.9
Clevercovariate 0.999 −0.1 0.075 0.075 0.003 95.9
OR/IPTW 0.999 −0.1 0.074 0.073 0.003 95.7
Two-step(forwardsampling) 1.000 0.0 0.071 0.072 0.002 96.5
Two-step(variancedecomposition) 0.997 −0.3 0.071 0.071 0.002 95.5
Jointestimation 0.997 −0.3 0.071 0.071 0.002 95.4
Importancesampling 0.999 −0.1 0.074 0.072 0.003 95.2
Importancesampling/DR 0.999 −0.1 0.074 0.073 0.003 95.5
Thecolumnscorrespondtoestimator,meanpointestimate,relativebias,MonteCarlostandarddeviation(SD),mean
standarderrorestimate(SE),MonteCarlo(MC)error(batchmeans)ofthemeanpointestimate,and95%confidence
intervalcoverageprobability.Thetwoscenarioscorrespondto(I)misspecifiedoutcomemodelandcorrectlyspeci-
fiedtreatmentassignmentmodel,and(II)correctlyspecifiedoutcomemodelandmisspecifiedtreatmentassignment
model.
preprint,versionofOctober29th,2015 Author’sOriginalVersion