Most pandemic research has focused on questions related to prediction such as: how many COVID cases do we expect to see next week in Allegheny County? Causal inference, on the other hand, is concerned with “what if” questions. For example: what would happen if everyone wore masks? How many cases would there be if there was a lockdown? How many deaths would occur if 40 percent of the population is vaccinated? These questions involve hypothetical interventions on a population.
Estimating causal effects is tricky. We all know that “correlation isn’t causation.” For example, mask usage and cases could both increase over time. But that doesn’t mean that masks cause more cases to occur. Activation of “Buckle your seatbelt” signs on airplanes is correlated with turbulence. But activating the sign does not cause turbulence.
So how do we estimate causal effects? There is a collection of methods for this task. The important thing is that we can’t just use standard prediction methods. We need to use specialized methods that were designed for causal inference. In this post we will discuss our work on the causal effect of social mobility on deaths from COVID. The social mobility variable we will use is the proportion of people staying home. We can think of this of anti-mobility, and expect that higher values of this variable will lead to fewer deaths from COVID.
Let’s start with a brief introduction to causal inference.
Causal inference is a huge, complex topic. We give a very brief exposition of some key ideas here. To learn more, we recommend the book by Hernan and Robins and the paper “Statistics and Causal Inference” by Paul Holland (Journal of the American Statistical Association 1986, pp. 945-960). And one can find many tutorials on the web.
Suppose we have three variables , and , where is the outcome of interest (such as deaths from COVID), is a treatment or exposure (such as mobility level), and are confounding variables, which are variables that affect both and (such as age). The joint density is
We want to know: what would be the mean of if we could set to be equal to some value ? For example, what would the number of COVID cases be if 92 percent of people wore masks? We let denote this hypothetical outcome, which is called a counterfactual or potential outcome. And we denote the mean of by .
If we have measured all the confounding variables then the answer (ignoring a few technical details) is given by Robins’ -formula:
where is the mean of given and . This is not equal to the mean of given , which is . This is the difference between causation and prediction (correlation). You can think of the -formula as the mean of if we replace in the joint density with a point mass at . Using the -formula is one example of adjusting for confounding.
Now suppose we observe data . How do we estimate ? The simplest approach is to get an estimate of (this is just regression) and replace the integral in the -formula with a sample average to get the estimate
This is called the plug-in estimator. (Note that for prediction we would not use this formula. We would just use .) There are often better estimators, but we won’t get into that here. The important thing is: there is a formula for the causal effect and we can estimate it.
The first plot below shows an example where we would predict higher values of when is large. For pure prediction, this is the correct conclusion. The second plot shows that once we account for (corresponding to different colors) there is a negative relationship between and : for a given person with a fixed age, increasing would decrease . In this case, age is a confounder and the -formula would correctly recover the negative relationship. For causal inference, this is the correct conclusion.
Things get trickier when there are time varying variables. Consider weekly mobility and death data in one state. For simplicity, we’ll assume that there are no variables. But we’ll see that at time , the variables are confounding variables for the causal effect of mobility on . Define and for . The density of can be written as
Now consider the causal question:
What would be if we set equal to some value ?
Let denote this counterfactual quantity. For example, if is mobility, measuring, say, the percentage of people who stay home, and then is the number of deaths we would observe if we set mobility to 12 for three weeks then set it to 6 for three weeks.
How do we find the mean ? As a general rule, we adjust for pre-treatment variables but not for post-treatment variables. But in the time varying case, this rule makes no sense. Each is both a pre-treatment variable (it occurs before ) and a post-treatment variable (it occurs after ). The correct way to get the mean of is to use the more general form of the -formula:
You can ignore that equation if you like; suffice it to say that there is indeed a formula for the quantity we are interested in.
We note, by the way, that some authors denote by . They mean the same thing.
Having a formula for is great, but how do we estimate it? Again we could plug-in estimates of all the quantities in the -formula. But there are better things we can do, as we now explain.
The approach we use is to make use of something called a marginal structural model (MSM).
Usually when we construct a statistical model, we are trying to come up with a way to express the distribution of the data, in this case, . In effect, we want a formula that explains how we would generate a dataset like the one we have. But to do so in a tractable way usually involves making a bunch of assumptions about the model. Instead, with an MSM, we specify a plausible model for the form of the function but we otherwise leave the model for the data unspecified. For example, we might assume that . In a sense, we are spending our modeling assumptions wisely: we want the effect of the treatment to be smooth and fairly simple. But we otherwise don’t want to impose assumptions.
The model we use is
where is log-deaths, is an arbitrary, smooth, increasing function and
is cumulative mobility (compared to baseline mobility). Based on published studies, we take weeks. (We switch from deaths to log-deaths mainly for certain numerical reasons related to the software we are using.) Here, represents the evolution of the pandemic if there were no interventions and no reduction in susceptible individuals. The term captures the effect of mobility. With no change in mobility, deaths increase exponentially as .
The intuition for using cumulative mobility comes from ideas from epidemic modeling. Roughly, if denotes new infections in week , we might assume that for some function . This implies that and so on. If we continue back to we find that where . If we further assume that some fraction of people infected a time will die in weeks, then we can derive the above model. (More complex models are discussed in the paper.)
To re-cap: we are leaving the distribution of the data unspecified, except that we require: if you apply the -formula to you must get . In other words, the statistical model is the set of all densities that satisfy the condition
This is an example of a semiparametric model: a model that has some sort of constraint but is otherwise nonparametric.
Estimating the MSM is a bit involved. A key step is estimating the following function
where represents potential confounding variables. The model is fit with these weights and the weights are crucial because that is how we adjust for confounding in the MSM. We won’t go into detail here. Instead, let’s look at the data and the estimates.
The data we use are weekly deaths and weekly averaged mobility starting February 15 2019. The mobility signal we’ll consider here is the proportion of people of who stay home, which is kindly provided to Delphi by SafeGraph. (We publish this data in our COVIDcast Epidata API, so it is available to anyone interested in studying mobility during the pandemic.) We use data at the state level. Note that this is actually a sort of anti-mobility measure: the higher it is, the less mobility there is. Here are plots of weekly deaths and mobility starting February 15, 2020, for a few states:
library(covidcast)
library(directlabels)
library(dplyr)
library(lubridate)
library(ggplot2)
options(covidcast.auth = Sys.getenv("API_KEY")) # for more on API keys, see: https://cmu-delphi.github.io/delphi-epidata/api/api_keys.html
# Top 5 most populated states according to
# https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population
states_of_interest <- c("ca", "tx", "fl", "ny", "pa")
states_deaths <- covidcast_signal(
"jhu-csse", "deaths_incidence_num",
start_day = "2020-02-15", end_day = "2020-12-16",
geo_type = "state", geo_values = states_of_interest
)
states_mobility <- covidcast_signal(
"safegraph", "completely_home_prop",
start_day = "2020-02-15", end_day = "2020-12-16",
geo_type = "state", geo_values = states_of_interest
)
dat <- full_join(x = select(states_deaths, time_value, geo_value, value),
y = select(states_mobility, time_value, geo_value, value),
by = c("time_value" = "time_value", "geo_value" = "geo_value"))
dat <- dat %>%
rename(date = time_value,
state = geo_value,
stay_home = value.y,
deaths = value.x) %>%
group_by(state) %>%
# epi week is from saturday to sunday
mutate(weekday = wday(date, label = TRUE, week_start = 6),
weeknum = wday(date, week_start = 6),
week = cumsum(weeknum == 1))
dat_week <- dat %>%
group_by(state, week) %>%
mutate(weekly_deaths = sum(deaths),
weekly_deaths = 1 + weekly_deaths, # avoid 0 in the log scale
weekly_stay_home = mean(stay_home)) %>%
select(week, state, weekly_stay_home, weekly_deaths, weekly_deaths) %>%
unique()
ggplot(dat_week, aes(x = week, y = weekly_deaths, color = state)) +
geom_line() +
scale_y_log10() +
geom_dl(aes(label = toupper(state)), method = "last.bumpup") +
labs(x = "Week", y = "# deaths (log scale)",
title = "Reported deaths over time",
subtitle = "From JHU CSSE COVID-19 data",
caption = "Data from Delphi COVIDcast, delphi.cmu.edu") +
theme_bw() +
guides(color = FALSE)
ggplot(dat_week, aes(x = week, y = weekly_stay_home, color = state)) +
geom_line() +
geom_dl(aes(label = toupper(state)), method = "last.bumpup") +
labs(x = "Week", y = "Proportion staying at home",
title = "(Anti)mobility over time",
subtitle = "From SafeGraph",
caption = "Data from Delphi COVIDcast, delphi.cmu.edu") +
theme_bw() +
guides(color = FALSE)
After fitting the MSM, we can estimate various quantities. Let’s consider the effect of changing the observed stay-at-home series to the hypothetical value . In other words, what would have happened if we had increased stay-at-home by ? Below is the estimated difference in deaths (with 90 percent pointwise confidence bands) for a few states: Arizona, California, Florida and Georgia. The plot shows the mean of
where is the number of deaths and is the (counterfactual) number of deaths we would have observed of mobility had been increases by at all times, where we take (a ten percent increase in stay-at-home). The fact that the values are negative indicates that increasing stay-at-home decreases deaths, as we would expect. (We remind the reader that these results are preliminary.)
One important question in every causal analysis is: what if there is unobserved confounding? In the current problem, a confounder is a variable that affects both mobility and death from COVID. We try to include as many plausible observed confounding variables as possible. So far we use all past values of death and mobility as confounding variables. But of course, it is always possible that there are unobserved confounders. Currently, we are conducting a sensitivity analysis to assess how robust the estimates are to unobserved confounders.
In this post we have only addressed mobility. As soon as the mobility analysis is complete we will turn our attention to estimating the causal effect of masks on COVID cases.
Acknowledgements: We’d like to thank the Delphi engineering team for making this data available. And we’d like to thank Roni Rosenfeld and Ryan Tibshirani for posing the challenge of making counterfactual predictions. We thank Rob Tibshirani for suggesting several improvements on this post, and Alex Reinhart for getting our post into the appropriate format.
Related Posts: