Sufficient Covariate, Propensity Variable and Doubly Robust Estimation HuiGuo,PhilipDawidandGiovanniBerzuini 5 1 0 2 n a J 0 3 AbstractStatistical causalinferencefromobservationalstudiesoftenrequiresad- justment for a possibly multi-dimensionalvariable, where dimension reduction is ] crucial.Thepropensityscore,firstintroducedbyRosenbaumandRubin,isapopular T approachtosuchreduction.We addresscausalinferencewithinDawid’sdecision- S . theoretic framework, where it is essential to pay attention to sufficient covariates h andtheirproperties.Weexaminetheroleofapropensityvariableinanormallinear t a model.We investigatebothpopulation-basedandsample-basedlinearregressions, m with adjustmentsfor a multivariate covariate and for a propensity variable. In ad- [ dition, we study the augmented inverse probability weighted estimator, involving a combinationof a responsemodelanda propensitymodel.In a linear regression 1 v withhomoscedasticity,apropensityvariableisprovedtoprovidethesameestimated 1 causaleffectasmultivariateadjustment.Anestimatedpropensityvariablemay,but 6 need not, yield better precision than the true propensity variable. The augmented 7 inverseprobabilityweightedestimatorisdoublyrobustandcanimproveprecision 7 ifthepropensitymodeliscorrectlyspecified. 0 . 1 0 5 1 : v i X HuiGuo r a Centre for Biostatistics, Institute of Population Health, The University of Manch- ester, Jean McFarlane Building, Oxford Road, Manchester M13 9PL, UK, e-mail: [email protected] PhilipDawid StatisticalLaboratory, UniversityofCambridge, Wilberforce Road, CambridgeCB30WB,UK, e-mail:[email protected] GiovanniBerzuini Department of Brain and Behavioural Sciences, University of Pavia, Pavia, Italy, e-mail: [email protected] 1 2 HuiGuo,PhilipDawidandGiovanniBerzuini 1 Introduction Causal effects can be identified from well-designed experiments, such as ran- domisedcontrolledtrials(RCT),becausetreatmentassignmentisentirelyunrelated to subjects’ characteristics, both observedand unobserved.Suppose there are two treatment arms in an RCT: treatment group and control group. Then the average causaleffect(ACE)can simplybe estimatedasthe outcomedifferenceofthe two groupsfrom the observeddata. However,randomisedexperiments,althoughideal andtobeconductedwheneverpossible,arenotalwaysfeasible.Forinstance,toin- vestigatewhethersmokingcauseslungcancer,wecannotrandomlyforceagroupof subjectstotakecigarettes.Moreover,itmaytakeyearsorlongerfordevelopmentof thisdisease.Instead,aretrospectivecase-controlstudymayhavetobeconsidered. The task of drawing causal conclusion, however,becomesproblematic since sim- ilarity of subjectsfrom the two groupswill rarelyhold, e.g., lifestyles of smokers mightbedifferentfromthoseofnon-smokers.Thus,weareunableto“comparelike withlike”–theclassicproblemofconfoundinginobservationalstudies,whichmay requireadjustingforasuitablesetofvariables(suchasage,sex,healthstatus,diet). Otherwise, the relationship between treatmentand response will be distorted, and leadtobiasedinferences.Ingeneral,linearregressions,matchingorsubclassifica- tionareusedforadjustmentpurpose.Iftherearemultipleconfounders,especially formatchingandsubclassification,identifyingtwoindividualswithverysimilarval- uesofallconfounderssimultaneouslywouldbecumbersomeorimpossible.Thus,it wouldbesensibletoreplacealltheconfoundersbyascalarvariable.Thepropensity score[22]isapopulardimensionreductionapproachinavarietyofresearchfields. 2 Framework The aim of statistical causal inference is to understand and estimate a “causal ef- fect”,andtoidentifyscientificandinprincipletestableconditionsunderwhichthe causaleffectcanbeidentifiedfromobservationalstudies.Thephilosophicalnature of“causality”isreflectedin thediversityofitsstatistical formalisations,asexem- plifiedbythreeframeworks: 1. Rubin’spotentialresponseframework[24,25,26](alsoknownasRubin’scausal model)basedoncounterfactualtheory; 2. Pearl’scausalframework[16,17]richlydevelopedfromgraphicalmodels; 3. Dawid’sdecision-theoreticframework[6,7]basedondecisiontheoryandprob- abilisticconditionalindependence. InDawid’sframework,causalrelationsaremodelledentirelybyconditionalproba- bilitydistributions.Weadoptitthroughoutthischaptertoaddresscausalinference; theassumptionsrequiredare,atleastinprinciple,testable. LetX,T andY denote,respectively,a(typicallymultivariate)confounder,treat- ment, and response (or outcome). For simplicity, Y is a scalar and X a multi- SufficientCovariate,PropensityVariableandDoublyRobustEstimation 3 dimensionalvariable.WeassumethatT isbinary:1(treatmentarm)and0(control arm). Within Dawid’s framework, a non-stochastic regime indicator variable F , T takingvalues0/,0 and 1,is introducedto denotethetreatmentassignmentmecha- nismoperating.Thisdividestheworldintothreedistinctregimes,asfollows: 1. F =0/:theobservational(idle)regime.Inthisregime,thevalueofthetreatment T ispassivelyobservedandtreatmentassignmentisdeterminedbyNature. 2. F =1:the interventionaltreatmentregime,i.e.,treatmentT issetto 1byma- T nipulation. 3. F =0:theinterventionalcontrolregime,i.e.,treatmentT issetto0bymanip- T ulation. Forexample,inanobservationalstudyofcustodialsanctions,ourinterestisinthe effect of custodial sanction, as comparedto probation (noncustodialsanction), on theprobabilityofre-offence.ThenF =0/ denotestheactualobservationalregime T underwhichdatawerecollected;F =1isthe(hypothetical)interventionalregime T thatalwaysimposesimprisonment;andF =0isthe(hypothetical)interventional T regimethatalwaysimposesprobation.Throughout,weassumefullcomplianceand no dropouts, i.e., each individual actually takes whichever treatment they are as- signed to. Then we have a joint distribution P of all relevant variables in each f regimeF = f (f =0,1,0/). T Inthedecision-theoreticframework,causalassumptionsareconstruedasasser- tionsthatcertainmarginalor conditionaldistributionsare commonto allregimes. Suchassumptionscanbeformallyexpressedaspropertiesofconditionalindepen- dence,wherethisisextendedtoallownon-stochasticvariablessuchasF [4,5,7]. T For example, the “ignorable treatment assignment” assumption in Rubin’s causal model(RCM)[22]canbeexpressedas Y⊥⊥F |T, (1) T read as “Y is independent of F given T”. However, this condition will be most T likelyinappropriateinobservationalstudieswhererandomisationisabsent. Causal effect is defined as the response difference by manipulating treatment, which purely involves interventional regimes. In particular, the population-based averagecausaleffect(ACE)ofthetreatmentisdefinedas: ACE:=E(Y|F =1)−E(Y|F =0), (2) T T oralternatively, ACE:=E (Y)−E (Y)1. (3) 1 0 Without further assumptions, by its definition ACE is not identifiable from the observationalregime. 1Forconvenience,thevaluesoftheregimeindicatorF arepresentedassubscripts. T 4 HuiGuo,PhilipDawidandGiovanniBerzuini 3 Identification ofACE Supposethejointdistributionof(F ,T,Y )isknownandsatisfies(1).IsACEiden- T tifiablefromdatacollectedintheobservationalregime?Notethat(1)demonstrates thatthedistributionofY givenT =t isthesame,whethert isobservedintheob- servationalregimeF =0/,orintheinterventionalregimeF =t.Asdiscussed,this T T assumptionwouldnotbesatisfiedinobservationalstudies,andthus,directcompar- isonofresponsefromthetwotreatmentgroupscannotbeinterpretedasthecausal effectfromobservationaldata. Definition1. The“face-valueaveragecausaleffect”(FACE)isdefinedas: FACE:=E (Y|T =1)−E (Y|T =0). (4) 0/ 0/ ItwouldbehardlytruethatFACE=ACE,aswewouldnotexpecttheconditional distribution ofY given T =t is the same in any regime. In fact, identification of ACE fromobservationalstudiesrequires,on one hand,adjustingfor confounders, ontheotherhand,interplayofdistributionalinformationbetweendifferentregimes. Onecanmakenofurtherprogressunlesssomepropertiesaresatisfied. 3.1 Stronglysufficient covariate RigorousconditionsmustbeinvestigatedsoastoidentifyACE. Definition2. X isacovariateif: Property1. X⊥⊥F . T That is, the distribution of X is the same in any regime, be it observational or interventional.Inmostcases,X areattributesdeterminedpriortothetreatment,for example,bloodtypesandgenes. Definition3. X isasufficientcovariatefortheeffectoftreatmentT onresponseY if,inadditiontoProperty1,wehave Property2. Y⊥⊥F |(X,T). T Property 2 requires that the distribution of Y, given X and T, is the same in all regimes. It can also be described as “strongly ignorable treatment assignment, givenX”[22].Weassumethatreadersarefamiliarwiththeconceptandproperties ofdirectedacyclicgraphs(DAGs).ThenProperties1 and2canberepresentedby meansofaDAGasFig.1.ThedashedarrowfromX toT indicatesthatT ispartially dependentonX,i.e.,thedistributionofT dependsonX intheobservationalregime, butnotintheinterventionalregimewhereF =t. T SufficientCovariate,PropensityVariableandDoublyRobustEstimation 5 Fig.1 Sufficientcovariate X F T Y T Definition4. X is a strongly sufficientcovariateif, in additionto Properties1 and 2,wehave Property3.P (T =t|X)>0withprobabilility1,fort=0,1. 0/ Property 3 requires that, for any X =x, both treatment and control groups are observedintheobservationalregime. Lemma1.SupposeX isastronglysufficientcovariate.Then,consideredasajoint distributions for (Y,X,T), P is absolutely continuouswith respect to P (denoted t 0/ byP ≪P ),fort=0andt=1.Thatis,foreveryeventAdeterminedby(X,T,Y), t 0/ P (A)=0 =⇒ P(A)=0. (5) 0/ t Equivalently,ifaneventAoccurswithprobability1underthemeasureP ,thenit 0/ occurswithprobability1underthemeasureP (t=0,1). t Proof. Property 2, expressed equivalently as (Y,X,T)⊥⊥F |(X,T), asserts that T thereexistsafunctionw(X,T)suchthat P (A|X,T)=w(X,T) f almostsurely(a.s.)ineachregime f =0,1,0/.LetP (A)=0.Thena.s.[P ], 0/ 0/ 0=P (A|X)=w(X,1)P (T =1|X)+w(X,0)P (T =0|X). 0/ 0/ 0/ ByProperty3,fort=0,1, w(X,t)=0 (6) a.s.[P ].Asw(X,t)isafunctionofX,itfollowsthat(6)holdsa.s.[P]byProperty 0/ t 1.Consequently, w(X,T)=0 a.s. [P], (7) t since a.s. [P], T =t and w(X,T)=w(X,t) for any boundedfunctionw. Then by t (7), P(A)=E{P(A|X,T)}=E{w(X,T)}=0. t t t t Lemma2.ForanyintegrableZ(cid:22)2 (Y,X,T),andanyversionsoftheconditional expectations, E(Z|X)=E(Z|X,T) a.s.[P]. (8) t t t 2The(cid:22)symbolisinterpretedas“afunctionof”. 6 HuiGuo,PhilipDawidandGiovanniBerzuini Proof. Let j(X,T)beanarbitrarybutfixedversionofE(Z|X,T).Then j(X,T)= t j(X,t)a.s.[P],and j(X,t)servesasaversionofE(Z|X,T)under[P].So t t t E(Z|X)=E{j(X,T)|X}=E{j(X,t)|X}= j(X,t) a.s. [P]. t t t t Thus j(X,t)isaversionofE(Z|X)under[P]and(8)follows. t t Since E(Z |X) is a function of X, then by Property 1, j(X,t) is a version of t E(Z|X)inanyregime.Letg(X,T)besomearbitrarybutfixedversionofE (Z| t 0/ X,T). Theorem1.Suppose that X is a strongly sufficient covariate. Then for any inte- grableZ(cid:22)(Y,X,T),andwithnotationasabove, j(X,t)=g(X,t) (9) almostsurelyinanyregime. Proof. ByProperty2,thereexistsafunctionh(X,T)whichisacommonversionof E (Z |X,T) under[P ] for f =0,1,0/. Thenh(X,T)servesasa versionof E (Z | f f 0/ X,T)under[P ],andaversionofE(Z|X,T)under[P].As j(X,T)isaversionof 0/ t t E(Z|X,T), t j(X,T)=h(X,T) a.s. [P], t andconsequently j(X,t)=h(X,t) a.s. [P]. t Since j(X,t)andh(X,t)arefunctionsofX,byProperty1 j(X,t)=h(X,t) a.s. [P ] (10) f for f =0,1,0/.Wealsohavethatg(X,T)=h(X,T) a.s.[P ],andso,byLemma1, 0/ a.s.[P]. Then g(X,t)=h(X,t) a.s.[P], where g(X,t) and h(X,t) are both func- t t tionsofX.ByProperty1, g(X,t)=h(X,t) a.s. [P ] (11) f for f =0,1,0/.Thus(9)holdsby(10)and(11). 3.2 Specific causal effect LetX beacovariate. Definition5. Thespecificcausaleffect ofT onY,relativetoX,is SCE:=E (Y |X)−E (Y |X). 1 0 SufficientCovariate,PropensityVariableandDoublyRobustEstimation 7 WeannotateSCE toexpressSCEasafunctionofX andwriteSCE(x)toindicate X that X takes specific value x. Because it is defined in the interventional regimes, SCE has a direct causal interpretation,i.e., SCE(x) is the average causal effect in thesubpopulationwithX =x. Although we do not assume the existence of potential responses, when this assumption is made we might proceed as follows. Take X to be the pair Y = (Y(1),Y(0))ofpotentialresponses—whichisassumedtosatisfyProperty1.Then E(Y |X)=Y(t),andconsequently t SCE =Y(1)−Y(0), Y whichisthedefinitionof“individualcausaleffect”,ICE,inRubin’scausalmodel. Thus,althoughthe formalisationsof causality are different,SCE in Dawid’sdeci- sion theoretic framework can be reagarded as a generalisation of ICE in Rubin’s causalmodel. We caneasily provethat, foranycovariateX, ACE=E(SCE ), wherethe ex- X pectationmaybetakeninanyregime.SincebyProperty1, E {E(Y |X)}=E{E(Y |X)}=E(Y), 0/ t t t t fort =0,1.Thusbysubtraction,ACE=E (SCE )foranyregime f =0,1,0/ and f X thereforethesubscript f canbedropped.Hence,ACEisidentifiablefromobserva- tionaldatasolongasSCE isidentifiablefromobservationaldata.IfX isastrongly X sufficientcovariate,byTheorem1,E(Y |X)isidentifiablefromtheobservational t regime.ItfollowsthatSCE canbeestimatedfromdatapurelycollectedintheob- servationalregime.ThenACEexpressedas ACE=E (SCE ) (12) 0/ X isidentifiable,fromtheobservationaljointdistributionof(X,T,Y).Formula(12)is Pearl’s“back-doorformula”[17]becausebythepropertyofmodularity,P(X)isthe samewithorwithoutinterventiononT andthuscanbetakenasthedistributionof X intheobservationalregime. 3.3 Dimensionreductionofstronglysufficient covariate SupposeX isamulti-dimensionalstronglysufficientcovariate.Theadjustmentpro- cess might be simplified if we could replace X by some reduced variableV (cid:22)X, with fewerdimensions—solongasV is itself a stronglysufficientcovariate.Now sinceV isafunctionofX,Properties1and3willautomaticallyholdforV.Wethus onlyneedtoensurethatV satisfiesProperty2:thatis, Y⊥⊥F |(V,T). (13) T 8 HuiGuo,PhilipDawidandGiovanniBerzuini SincetwoarrowsinitiatefromX inFig.1,possiblereductionsmaybenaturally considered,onthepathwaysfromX toT, andfromX toY. Indeed,the following theoremgivestwoalternativesufficientconditionsfor(13) tohold.However,(13) canstillholdwithouttheseconditions. Theorem2.Suppose X is a strongly sufficient covariate andV (cid:22)X. ThenV is a stronglysufficientcovariateifeitherofthefollowingconditionsissatisfied: (a).Response-sufficientreduction: Y⊥⊥X|(V,F =t), (14) T or Y⊥⊥X|(V,T,F =0/), (15) T for t =0,1. It is indicated in (14) that, in each interventional regime, X con- tributesnothingtowardspredictingY onceweknowV.Inotherwords,aslong asV is observed, X need not be observed to make inference onY. While (15) impliesthatintheobservationalregime,knowingX isofnovalueofpredicting Y ifV andT areknown. (b).Treatment-sufficientreduction: T⊥⊥X|(V,F =0/). (16) T Thatis,intheobservationalregime,treatmentdoesnotdependonXconditioning ontheinformationofV. Proofsoftheabovereductionswereprovidedin[9].Analternativeproofof(b) canbe implementedgraphically[9], whichresultsina DAG asFig.2 3 offwhich (16)and(13)canbedirectlyread. Fig.2 Treatmentsufficient reduction X V FT T Y Agraphicalapproachto(a)doesnotworksinceProperty3isrequired.However, whilenotservingasaproof,Fig.3convenientlyembodiestheconditionalindepen- denciesProperties1,2andthetrivialpropertyV⊥⊥T|(X,F ),aswellas(13). T 3Thehollowarrowhead,pointingfromXtoV,isusedtoemphasisethatV isafunctionofX. SufficientCovariate,PropensityVariableandDoublyRobustEstimation 9 Fig.3 Response sufficient reduction X V FT T Y 4 Propensity analysis Here we furtherdiscussthetreatment-sufficientreduction,whichdoesnotinvolve theresponse.Thisbringsintheconceptofpropensityvariable:aminimaltreatment- sufficientcovariate,forwhichweinvestigatetheunbiasednessandprecisionofthe estimatorofACE. Alsothe asymptoticprecisionoftheestimatedACE,aswellas thevariationoftheestimatefromtheactualdata,willbeanalysed.Inasimplenor- mal linear model that applied for covariate adjustment, two cases are considered: homoscedasticity and heteroscedasticity. A non-parametric approach – subclassi- fication will also be conducted, for different covariance matrices of X of the two treatmentarms.TheestimatedACEobtainedbyadjustingformultivariateX andby adjustingfor a scalar propensityvariable,willthen be comparedtheoreticallyand throughsimulations[9]. 4.1 Propensityscoreand propensityvariable Thepropensityscore(PS),firstintroducedbyRosenbaumandRubin,isabalancing score [22]. Regardedas a useful tool to reduce bias and increase precision, it is a verypopularapproachtocausaleffectestimation.PSmatching(orsubclassification) method,widelyusedinvariousresearchfields,exploitsthepropertyofconditional (within-stratum) exchangeability, whereby individuals with the same value of PS (or belonging to a group with similar values of PS) are taken as comparable or exchangeable. We will, however, mainly focus on the application of PS within a linear regression. The definitions of the balancing score and PS given below are borrowedfrom[22]. Definition6. A balancingscore b(X)isa functionof X suchthat, inthe observa- tional regime 4, the conditional distribution of X given b(X) is the same for both treatmentgroups.Thatis, X⊥⊥T|(b(X),F =0/). T 4RosenbaumandRubindonotdefinethebalancingscoreandthePSexplicitlyforobservational studies,althoughtheydoaimtoapplythePSapproachinsuchstudies. 10 HuiGuo,PhilipDawidandGiovanniBerzuini IthasbeenshownthatadjustingforabalancingscoreratherthanX resultsinunbi- asedestimateofACE,withtheassumptionofstronglyignorabletreatmentassign- ment[22].Onecantriviallychooseb(X)=X,butitismoreconstructivetofinda balancingscoretobeamanytoonefunction. Definition7. The propensity score, denoted by P , is the probability of being as- signedtothetreatmentgroupgivenX intheobservationalregime: P :=P (T =1|X). 0/ We shalluse the symbolp to denotea particularrealisationof P . By (16) and Definitions6and7,weassertthatPSisthecoarsestbalancingscore.Forasubject i,PSisassumedtobepositive,i.e.,0<p <1.ThosewiththesamevalueofPSare i equallylikelytobeallocatedtothetreatmentgroup(orequivalently,tothecontrol group),whichprovidesobservationalstudieswiththerandomised-experiment-like propertybasedonmeasuredX.Thisisbecausethecharacteristicsofthetwogroups with the same or similar PS are “balanced”. Therefore, the scalar PS serves as a proxyofmulti-dimensionalvariableX,andthus,itissufficienttoadjustforthefor- merinsteadofthelatter.Inobservationalstudies,PSisgenerallyunknownbecause wedonotknowexactlywhichcomponentsofXhaveimpactonT andhowthetreat- mentisassociatedwiththem.However,wecanestimatePSfromtheobservational data. PSanalysisforcausalinferenceisbasedonasequenceoftwostages: Stage1:PSEstimation.ItisestimatedbytheobservedT andX,andnormally byalogisticregressionofT onX forbinarytreatment.NotethattheresponseY is irrelevantatthisstage.BecausewecanestimatePSwithoutobservingY,thereisno harminfindingan”optimal”regressionmodelofT onX byrepeatedtrials. Stage2:AdjustingforPS.Variousadjustmentapproacheshavebeendeveloped, e.g.,linearregression.IfweareunclearabouttheconditionaldistributionofY given T and PS, non-parametricadjustmentsuch as matching or subclassification could beappliedinstead. Althoughtwoalternativesfordimensionreductionshavebeenprovided,inprac- tice, this type of reduction may be more convenientin many cases. For example, certain values of the response may occur rarely and only after long observation periodsaftertreatment.Inaddition,itmaysometimesbetrickytodeterminea”cor- rect”formforaregressionmodelofY onX,T andF .SwappingthepositionsofX T andT,Equation(16)canbere-expressedas X⊥⊥T|(V,F =0/), (17) T which states that the observationaldistribution of X givenV is the same for both treatmentarms.Thatistosay,V isabalancingscoreforX. Thetreatment-sufficientcondition(b)canbeequivalentlyinterpretedasfollows. Consider the family Q ={Q ,Q } consisting of observational distributions of X 0 1 forthetwogroupsT =0andT =1.ThenEquation(16),re-expressedas(17),says thatV isasufficientstatistic(intheusualFisheriansense[8])forthisfamily.Inpar-