Notation and definitions

Each data point will be defined by the triple $(X_i, W_i, Y_i)$
The vector $X_i$ represents covariates observed for individual $i$
Treatment assignment is indicated by $W_i \in \{0, 1\}$ , with 1 representing treatment, and 0 representing control
The scalar $Y_i$ is the observed outcome, and it can be real or binary
Each observation is drawn independently and from the same distribution

We’ll be interested in assessing the causal effect of treatment on outcome.

Fundamental Problem in Causal Inference

A difficulty in estimating causal quantities is that we observe each individual in only one treatment state: either they were treated, or they weren’t
This is often called the fundamental problem in causal inference

Potential Outcomes

However, it’s often useful to imagine that each individual is endowed with two random variables $(Y_i(1), Y_i(0))$ , where
- $Y_i(1)$ represents the value of this individual’s outcome if they receive treatment, and
- $Y_i(0)$ represents their outcome if they are not treated
- These random variables are called potential outcomes
The observed outcome $Y_i$ corresponds to whichever potential outcome we got to see

$\begin{equation} Y_i \equiv Y_i(W_i) = \begin{cases} Y_i(1) \qquad \text{if }W_i = 1 \text{ (treated)} \\ Y_i(0) \qquad \text{if }W_i = 0 \text{ (control)}\\ \end{cases} \end{equation}$

Estimating the Average Treatment Effect

Since we can’t observe both $Y_i(1)$ and $Y_i(0)$ , we won’t be able to make statistical claims about the individual treatment effect $Y_i(1) - Y_i(0)$ . Instead, our goal will be to estimate the average treatment effect (ATE):

$\begin{equation} \tau := E[Y_i(1) - Y_i(0)] \end{equation}$

ATE in Randomized Setting

Here, when we refer to the randomized setting we mean that we have data generated by a randomized controlled trial
The key characteristic of this setting is that the probability that an individual is assigned to the treatment arm is fixed. - Specifically, it does not depend on the individual’s potential outcomes: $\begin{equation} Y_i(1), Y_i(0) \perp W_i. \end{equation}$
This precludes situations in which individuals may self-select into or out of treatment
Canonical failure example: A job training program in which workers enroll more often if they are more likely to benefit from treatment because in that case $W_i$ and $Y_i(1) - Y_i(0)$ would be positively correlated

Violation of Unconfoundedness

When the unconfoundedness condition is violated, we say that we are in an observational setting
This is a more complex setting encompassing several different scenarios:
- sometimes it’s still possible to estimate the ATE under additional assumptions
- sometimes the researcher can exploit other sources of variation to obtain other interesting estimands
We will focus on ATE estimation under the following assumption of conditional unconfoundedness

Conditional Unconfoundedness

$\begin{equation} Y_i(1), Y_i(0) \perp W_i \ | \ X_i. \end{equation}$

It is also known as no unmeasured confounders, ignorability or selection on observables
It says that all possible sources of self-selection, etc., can be explained by the observable covariates $X_i$
Job-training example :older or more educated people are more likely to self-select into treatment; but when we compare two workers that have the same age and level of education, etc., there’s nothing else that we could infer about their relative potential outcomes if we knew that one went into job training and the other did not

Propensity Score

A key quantity of interest will be the treatment assignment probability, or propensity score $e(X_i) := P[W_i = 1 | X_i]$
In an experimental setting this quantity is usually known and fixed
In observational settings it must be estimated from the data

Propensity Scores - Overlap

We will often need to assume that the propensity score is bounded away from zero and one. That is, there exists some $\eta > 0$ such that $\begin{equation} \eta < e(x) < 1 - \eta \qquad \text{for all }x. \end{equation}$
This assumption is known as overlap, and it means that for all types of people in our population (i.e., all values of observable characteristics) we can find some portion of individuals in treatment and some in control
Intuitively, this is necessary because we’d like to be comparing treatment and control at each level of the covariates and then aggregate those results

Estimators - Difference-in-means

Let’s begin by considering a simple estimator that is available in experimental settings. The difference-in-means estimator is the sample average of outcomes in treatment minus the sample average of outcomes in control. $\begin{equation} \widehat{\tau}^{DIFF} = \frac{1}{n_1} \sum_{i:W_i = 1} Y_i - \frac{1}{n_0} \sum_{i:W_i = 0} Y_i \qquad \text{where} \qquad n_w := |\{i : W_i = w \}| \end{equation}$

Difference-in-means through Linear Regression

We can also compute the same quantity via linear regression, using the fact that $\begin{equation} Y_i = Y_i(0) + W_i \left( Y_i(1) - Y_i(0) \right) \text{[consistency assumption]} \end{equation}$

so that taking expectations conditional on treatment assignment, $\begin{equation} E[Y_i | W_i] = \alpha + W_i \tau \qquad \text{where} \qquad \alpha := E[Y_i(0)] \end{equation}$

=> we can estimate the ATE of a binary treatment via a linear regression of observed outcomes $Y_i$ on a vector consisting of intercept and treatment assignment $(1, W_i)$

Difference-in-means estimator

A simple, easily computable, unbiased, “model-free” estimator of the treatment effect
It should always be reported when dealing with data collected in a randomized setting

Estimators for Observational Data

When does the difference-in-mean estimator fail?
Often in observational settings
- Example: We’ll focus on two covariates of interest: age and political views (polviews)
- Treatment: Framing of a question on govt. spending on social safety net ()
- Presumably, younger or more liberal individuals are less affected by the change in wording in the question. So let’s see what happens when we make these individuals more prominent in our sample of treated individuals, and less prominent in our sample of untreated individuals.
  - The treated population is much younger and more liberal, while the untreated population is older and more conservative

Covariate Distributions in the data

Covariate Distributions.

DIM Estimator is biased

We find that the DIM estimator is biased towards 0 as we mostly treated individuals for which we expect the effect to be smaller
Note that the dataset above still satisfies unconfoundedness, since discrepancies in treatment assignment probability are described by observable covariates (age and polviews).
It also satisfies the overlap assumption, since we never completely dropped all treated or all untreated observations in any region of the covariate space
This is important because we’ll consider different estimators of the ATE that are available in observational settings under unconfoundedness and overlap

Direct estimation

Our first estimator is suggested by the following decomposition of the ATE, which is possible due to conditional unconfoundedness

$\begin{equation} E[Y_i(1) - Y_i(0)] = E[E[Y_i | X_i, W_i=1]] - E[E[Y_i | X_i, W_i=0]] \end{equation}$

Direct Estimation

The decomposition above suggests the following procedure, sometimes called the direct estimate of the ATE:

Estimate $\mu(x, w) := E[Y_i|X_i = x,W_i=w]$ , preferably using nonparametric methods.
Predict $\hat{\mu}(X_i, 1)$ and $\hat{\mu}(X_i, 0)$ for each observation in the data.
Average out the predictions and subtract them.

$\begin{equation} \hat{\tau}^{DM} := \frac{1}{n} \sum_{i=1}^{n} \hat{\mu}(X_i, 1) - \hat{\mu}(X_i, 0) \end{equation}$

Estimation

This estimator allows us to leverage regression techniques to estimate the ATE, so the resulting estimate should have smaller root-mean-squared error
However, it has several disadvantages that make it undesirable:
- its properties will rely heavily on the model $\hat{\mu}(x, w)$ being correctly specified: it will be an unbiased and/or consistent estimate of the ATE provided that $\hat{mu}(x, w)$ is an unbiased and/or consistent estimator of $E[Y|X=x, W=w]$ - In practice, having a well-specified model is not something we want to rely upon. In general, it will also not be asymptotically normal, which means that we can’t easily compute t-statistics and p-values for it

A technical note on Step 1 of Direct Estimate

Step 1 earlier (Estimating $\mu(x, w)$ can be done by regressing $Y_i$ on $X_i$ using only treated observations to get an estimate $\hat{\mu}(x, 1)$ first, and then repeating the same to obtain $\hat{\mu}(x, 0)$ from the control observations
Or it can be done by regressing $Y_i$ on both covariates $(X_i, W_i)$ together and obtaining a function $\hat{\mu}(x, w)$
Both have advantages and disadvantages, and we refer to Künzel, Sekhon, Bickel, Yu (2019) for a discussion.

BDSI ML4HC2023: Causal Inference