Processing math: 100%
BDSI ML4HC2023: Causal Inference
Session 10 | 2023-07-10
Notation and definitions
- Each data point will be defined by the triple (Xi,Wi,Yi)
- The vector Xi represents
covariates observed for individual i
- Treatment assignment is indicated by Wi∈{0,1}, with 1 representing treatment, and 0 representing
control
- The scalar Yi is the observed
outcome, and it can be real or binary
- Each observation is drawn independently and from the same
distribution
We’ll be interested in assessing the causal effect of
treatment on outcome.
Fundamental Problem in Causal Inference
A difficulty in estimating causal quantities is that we observe
each individual in only one treatment state: either they were treated,
or they weren’t
This is often called the fundamental problem in causal
inference
Potential Outcomes
- However, it’s often useful to imagine that each individual is
endowed with two random variables (Yi(1),Yi(0)), where
- Yi(1) represents the value of
this individual’s outcome if they receive treatment, and
- Yi(0) represents their outcome
if they are not treated
- These random variables are called potential
outcomes
- The observed outcome Yi
corresponds to whichever potential outcome we got to see
Yi≡Yi(Wi)={Yi(1)if Wi=1 (treated)Yi(0)if Wi=0 (control)
Estimating the Average Treatment Effect
- Since we can’t observe both Yi(1) and Yi(0), we won’t be able to make
statistical claims about the individual treatment effect Yi(1)−Yi(0). Instead, our goal will
be to estimate the average treatment effect (ATE):
τ:=E[Yi(1)−Yi(0)]
ATE in Randomized Setting
Here, when we refer to the randomized setting we
mean that we have data generated by a randomized controlled
trial
The key characteristic of this setting is that the probability
that an individual is assigned to the treatment arm is fixed. -
Specifically, it does not depend on the individual’s potential outcomes:
Yi(1),Yi(0)⊥Wi.
This precludes situations in which individuals may self-select
into or out of treatment
Canonical failure example: A job training program in which
workers enroll more often if they are more likely to benefit from
treatment because in that case Wi
and Yi(1)−Yi(0) would be
positively correlated
Violation of Unconfoundedness
- When the unconfoundedness condition is violated, we say that we are
in an observational setting
- This is a more complex setting encompassing several different
scenarios:
- sometimes it’s still possible to estimate the ATE under additional
assumptions
- sometimes the researcher can exploit other sources of variation to
obtain other interesting estimands
- We will focus on ATE estimation under the following assumption of
conditional unconfoundedness
Conditional Unconfoundedness
Yi(1),Yi(0)⊥Wi | Xi.
- It is also known as no unmeasured confounders,
ignorability or selection on
observables
- It says that all possible sources of self-selection, etc., can be
explained by the observable covariates Xi
- Job-training example :older or more educated people are
more likely to self-select into treatment; but when we compare two
workers that have the same age and level of education, etc., there’s
nothing else that we could infer about their relative potential outcomes
if we knew that one went into job training and the other did not
Propensity Score
- A key quantity of interest will be the treatment assignment
probability, or propensity score e(Xi):=P[Wi=1|Xi]
- In an experimental setting this quantity is usually known and
fixed
- In observational settings it must be estimated from the data
Propensity Scores - Overlap
- We will often need to assume that the propensity score is bounded
away from zero and one. That is, there exists some η>0 such that η<e(x)<1−ηfor all x.
- This assumption is known as overlap, and it means
that for all types of people in our population (i.e., all values of
observable characteristics) we can find some portion of individuals in
treatment and some in control
- Intuitively, this is necessary because we’d like to be comparing
treatment and control at each level of the covariates and then aggregate
those results
Estimators - Difference-in-means
Let’s begin by considering a simple estimator that is available in
experimental settings. The difference-in-means
estimator is the sample average of outcomes in treatment minus the
sample average of outcomes in control. ˆτDIFF=1n1∑i:Wi=1Yi−1n0∑i:Wi=0Yiwherenw:=|{i:Wi=w}|
Difference-in-means through Linear Regression
We can also compute the same quantity via linear regression, using
the fact that Yi=Yi(0)+Wi(Yi(1)−Yi(0))[consistency
assumption]
so that taking expectations conditional on treatment assignment,
E[Yi|Wi]=α+Wiτwhereα:=E[Yi(0)]
=> we can estimate the ATE of a binary treatment via a linear
regression of observed outcomes Yi
on a vector consisting of intercept and treatment assignment (1,Wi)
Difference-in-means estimator
- A simple, easily computable, unbiased, “model-free” estimator of the
treatment effect
- It should always be reported when dealing with data collected in a
randomized setting
Estimators for Observational Data
- When does the difference-in-mean estimator fail?
- Often in observational settings
- Example: We’ll focus on two covariates of interest:
age
and political views (polviews
)
- Treatment: Framing of a question on govt. spending on social safety
net ()
- Presumably, younger or more liberal individuals are less affected by
the change in wording in the question. So let’s see what happens when we
make these individuals more prominent in our sample of treated
individuals, and less prominent in our sample of untreated individuals.
- The treated population is much younger and more liberal, while the
untreated population is older and more conservative
Covariate Distributions in the data
Covariate Distributions.
DIM Estimator is biased
- We find that the DIM estimator is biased towards 0 as we mostly
treated individuals for which we expect the effect to be smaller
- Note that the dataset above still satisfies unconfoundedness, since
discrepancies in treatment assignment probability are described by
observable covariates (age and polviews).
- It also satisfies the overlap assumption, since we never completely
dropped all treated or all untreated observations in any region of the
covariate space
- This is important because we’ll consider different estimators of the
ATE that are available in observational settings under unconfoundedness
and overlap
Direct estimation
Our first estimator is suggested by the following decomposition of
the ATE, which is possible due to conditional unconfoundedness
E[Yi(1)−Yi(0)]=E[E[Yi|Xi,Wi=1]]−E[E[Yi|Xi,Wi=0]]
Direct Estimation
The decomposition above suggests the following procedure, sometimes
called the direct estimate of the ATE:
- Estimate μ(x,w):=E[Yi|Xi=x,Wi=w], preferably using nonparametric methods.
- Predict ˆμ(Xi,1) and
ˆμ(Xi,0) for each
observation in the data.
- Average out the predictions and subtract them.
ˆτDM:=1nn∑i=1ˆμ(Xi,1)−ˆμ(Xi,0)
Estimation
- This estimator allows us to leverage regression techniques to
estimate the ATE, so the resulting estimate should have smaller
root-mean-squared error
- However, it has several disadvantages that make it undesirable:
- its properties will rely heavily on the model ˆμ(x,w) being correctly
specified: it will be an unbiased and/or consistent estimate of the ATE
provided that ^mu(x,w) is an
unbiased and/or consistent estimator of E[Y|X=x,W=w] - In practice, having a
well-specified model is not something we want to rely upon. In general,
it will also not be asymptotically normal, which means that we can’t
easily compute t-statistics and p-values for it
A technical note on Step 1 of Direct Estimate
- Step 1 earlier (Estimating μ(x,w) can be done by regressing Yi on Xi using only treated observations to
get an estimate ˆμ(x,1)
first, and then repeating the same to obtain ˆμ(x,0) from the control
observations
- Or it can be done by regressing Yi on both covariates (Xi,Wi) together and obtaining a
function ˆμ(x,w)
- Both have advantages and disadvantages, and we refer to Künzel, Sekhon,
Bickel, Yu (2019) for a discussion.
Next Steps
- Inverse Propensity-weighted Estimator
- Doubly Robust Estimator
Space, Right Arrow or swipe left to move to next slide, click help below for more details