Debiasing with Orthogonalization

Debiasing with Orthogonalization#

Previously, we saw how to evaluate a causal model. By itself, that’s a huge deed. Causal models estimates the elasticity \(\frac{\delta y}{\delta t}\), which is an unseen quantity. Hence, since we can’t see the ground truth of what our model is estimating, we had to be very creative in how we would go about evaluating them.

The technique shown on the previous chapter relied heavily on data where the treatment was randomly assigned. The idea was to estimate the elasticity \(\frac{\delta y}{\delta t}\) as the coefficient of a single variable linear regression of y ~ t. However, this only works if the treatment is randomly assigned. If it isn’t, we get into trouble due to omitted variable bias.

To workaround this, we need to make the data look as if the treatment is randomly assigned. I would say there are two main techniques to do this. One is using propensity score and the other using orthogonalization. We will cover the latter in this chapter.

One final word of caution before we continue. I would argue that probably the safest way out of non random data is to go out and do some sort of experiment to gather random data. I myself don’t trust very much on debiasing techniques because you can never know if you’ve accounted for every confounder. Having said that, orthogonalization is still very much worth learning. It’s an incredibly powerful technique that will be the foundation of many causal models to come.

Linear Regression Reborn#

The idea of orthogonalization is based on a theorem designed by three econometricians in 1933, Ragnar Frisch, Frederick V. Waugh, and Michael C. Lovell. Simply put, it states that you can decompose any multivariable linear regression model into three stages or models. Let’s say that your features are in an \(X\) matrix. Now, you partition that matrix in such a way that you get one part, \(X_1\), with some of the features and another part, \(X_2\), with the rest of the features.

In the first stage, we take the first set of features and estimate the following linear regression model

\[ y_i = \theta_0 + \pmb{\theta_1 X}_{1i} + e_i \]

where \(\pmb{\theta_1}\) is a vector of parameters. We then take the residuals of that model

\[ y^* = y_i - (\hat{\theta}_0 + \pmb{\hat{\theta}_1 X}_{1i}) \]

On the second stage, we take the first set of features again, but now we run a model where we estimate the second set of features

\[ \pmb{X}_{2i} = \gamma_0 + \pmb{\gamma_1 X}_{1i} + e_i \]

Here, we are using the first set of features to predict the second set of features. Finally, we also take the residuals for this second stage.

\[ \pmb{X}^*_{2i} = \pmb{X}_{2i} - (\hat{\gamma}_0 + \pmb{\hat{\gamma}_1 X}_{1i}) \]

Lastly, we take the residuals from the first and second stage, and estimate the following model

\[ y_i^* = \beta_0 + \pmb{\beta_2 X}^*_{2i} + e_i \]

The Frisch–Waugh–Lovell theorem states that the parameter estimate \(\pmb{\hat{\beta}_2}\) from estimating this model is equivalent to the one we get by running the full regression, with all the features:

\[ y_i = \beta_0 + \pmb{\beta_1 X}_{1i} + \pmb{\beta_2 X}_{2i} + e_i \]

OK. Let’s unpack this a bit further. We know that regression is a very special model. Each of its parameters has the interpretation of a partial derivative: how much would \(Y\) increase if I increase one feature while holding all the others fixed. This is very nice for causal inference, because it means we can control for variables in the analysis, even if those same variables have not been held fixed during the collection of the data.

We also know that if we omit variables from the regression, we get bias. Specifically, omitted variable bias (or confounding bias). Still, the Frisch–Waugh–Lovell is saying that I can break my regression model into two parts, neither of them containing the full feature set, and still get the same estimate I would get by running the entire regression. Not only that, this theorem also provides some insight into what linear regression is doing. To get the coefficient of one variable \(X_k\), regression first uses all the other variables to predict \(X_k\) and takes the residuals. This “cleans” \(X_k\) of any influence from those variables. That way, when we try to understand \(X_k\)’s impact on \(Y\), it will be free from omitted variable bias. Second, regression uses all the other variables to predict \(Y\) and takes the residuals. This “cleans” \(Y\) from any influence from those variables, reducing the variance of \(Y\) so that it is easier to see how \(X_k\) impacts \(Y\).

I know it can be hard to appreciate how awesome this is. But remember what linear regression is doing. It’s estimating the impact of \(X_2\) on \(y\) while accounting for \(X_1\). This is incredibly powerful for causal inference. It says that I can build a model that predicts my treatment \(t\) using my features \(X\), a model that predicts the outcome \(y\) using the same features, take the residuals from both models and run a model that estimates how the residual of \(t\) affects the residual of \(y\). This last model will tell me how \(t\) affects \(y\) while controlling for \(X\). In other words, the first two models are controlling for the confounding variables. They are generating data which is as good as random. This is debiasing my data. That’s what we use in the final model to estimate the elasticity.

There is a (not so complicated) mathematical proof for why that is the case, but I think the intuition behind this theorem is so straightforward we can go directly into it.

The Intuition Behind Orthogonalization#

Let’s take our price data once again. But now, we will only take the sample where prices where not randomly assigned. Once again, we separate them into a training and a test set. Since we will use the test set to evaluate our causal model, let’s see how we can use orthogonalization to debias it.

prices = pd.read_csv("./data/ice_cream_sales.csv")

train, test = train_test_split(prices, test_size=0.5)
train.shape, test.shape

((5000, 5), (5000, 5))

If we show the correlations on the test set, we can see that price is positively correlated with sales, meaning that sales should go up as we increase prices. This is obviously nonsense. People don’t buy more if ice cream is expensive. We probably have some sort of bias here.

test.corr()

	temp	weekday	cost	price	sales
temp	1.000000	-0.016172	-0.002254	-0.000696	0.375368
weekday	-0.016172	1.000000	0.007748	-0.000841	-0.004849
cost	-0.002254	0.007748	1.000000	0.394668	-0.033491
price	-0.000696	-0.000841	0.394668	1.000000	0.068623
sales	0.375368	-0.004849	-0.033491	0.068623	1.000000

If we plot our data, we can see why this is happening. Weekends (Saturday and Sunday) have higher price but also higher sales. We can see that this is the case because the weekend cloud of points seems to be to the upper right part of the plot.

Weekend is probably playing an important role in the bias here. On the weekends, there are more ice cream sales because there is more demand. In response to that demand, prices go up. So it is not that the increase in price causes sales to go up. It is just that both sales and prices are high on weekends.

_images/33d552ec7efd08ee8c6dd488e9e7c6d51be498abd10d6f4a8554ee83aa70d8e5.png

To debias this dataset we will need two models. The first model, let’s call it \(M_t(X)\), predicts the treatment (price, in our case) using the confounders. It’s the one of the stages we’ve seen above, on the Frisch–Waugh–Lovell theorem.

m_t = smf.ols("price ~ cost + C(weekday) + temp", data=test).fit()
debiased_test = test.assign(**{"price-Mt(X)":test["price"] - m_t.predict(test)})

Once we have this model, we will construct the residuals

\[ \hat{t}_i = t_i - M_t(X_i) \]

You can think of this residual as a version of the treatment that is unbiased or, better yet, that is impossible to predict from the confounders \(X\). Since the confounders were already used to predict \(t\), the residual is by definition, unpredictable with \(X\). Another way of saying this is that the bias has been explained away by the model \(M_t(X_i)\), producing \(\hat{t}_i\) which is as good as randomly assigned. Of course this only works if we have in \(X\) all the confounders that cause both \(T\) and \(Y\).

We can also plot this data to see what it looks like.

_images/bcd62c859ce718ccfcdbb6aa727959ebf5076e9b17fa30abb1f2f22cc8b5454b.png

We can see that the weekends are no longer to the upper right corner. They got pushed to the center. Moreover, we can no longer differentiate between different price levels (the treatment) using the weekdays. We can say that the residual \(price-M_t(X)\), plotted on the x-axis, is a “random” or debiased version of the original treatment.

This alone is sufficient to debias the dataset. This new treatment we’ve created is as good as randomly assigned. But we can still do one other thing to make the debiased dataset even better. Namely, we can also construct residuals for the outcome.

\[ \hat{y}_i = y_i - M_y(X_i) \]

This is another stage from the Frisch–Waugh–Lovell theorem. It doesn’t make the set less biased, but it makes it easier to estimate the elasticity by reducing the variance in \(y\). Once again, you can think about \(\hat{y}_i\) as a version of \(y_i\) that is unpredictable from \(X\) or that had all its variances due to \(X\) explained away. Think about it. We’ve already used \(X\) to predict \(y\) with \(M_y(X_i)\). And \(\hat{y}_i\) is the error of this prediction. So, by definition, it’s not possible to predict it from \(X\). All the information in \(X\) to predict \(y\) has already been used. If that is the case, the only thing left to explain \(\hat{y}_i\) is something we didn’t used to construct it (not included in \(X\)), which is only the treatment (again, assuming no unmeasured confounders).

m_y = smf.ols("sales ~ cost + C(weekday) + temp", data=test).fit()

debiased_test = test.assign(**{"price-Mt(X)":test["price"] - m_t.predict(test),
                               "sales-My(X)":test["sales"] - m_y.predict(test)})

Once we do both transformations, not only does weekdays not predict the price residuals, but it also can’t predict the residual of sales \(\hat{y}\). The only thing left to predict these residuals is the treatment. Also, notice something interesting. In the plot above, it was hard to know the direction of the price elasticity. It looked like sales decreased as prices went up, but there was such a large variance in sales that it was hard to say that for sure.

Now, when we plot the two residuals, it becomes much clear that sales indeed causes prices to go down.

_images/395807ccec770f04efedf2cea98c2be91080b0abac337a6edfafcabd35223156.png

One small disadvantage of this debiased data is that the residuals have been shifted to a different scale. As a result, it’s hard to interpret what they mean (what is a price residual of -3?). Still, I think this is a small price to pay for the convenience of building random data from data that was not initially random.

To summarize, by predicting the treatment, we’ve constructed \(\hat{t}\) which works as an unbiased version of the treatment; by predicting the outcome, we’ve constructed \(\hat{y}\) which is a version of the outcome that can only be further explained if we use the treatment. This data, where we replace \(y\) by \(\hat{y}\) and \(t\) by \(\hat{t}\) is the debiased data we wanted. We can use it to evaluate our causal model just like we deed previously using random data.

To see this, let’s once again build a causal model for price elasticity using the training data.

m3 = smf.ols(f"sales ~ price*cost + price*C(weekday) + price*temp", data=train).fit()

Then, we’ll make elasticity predictions on the debiased test set.

def predict_elast(model, price_df, h=0.01):
    return (model.predict(price_df.assign(price=price_df["price"]+h))
            - model.predict(price_df)) / h

debiased_test_pred = debiased_test.assign(**{
    "m3_pred": predict_elast(m3, debiased_test),
})

debiased_test_pred.head()

	temp	weekday	cost	price	sales	price-Mt(X)	sales-My(X)	m3_pred
7592	27.8	4	1.0	4.9	203	-1.117175	4.365828	-1.921196
4489	27.4	5	1.0	8.2	214	2.190180	16.319865	-1.629962
7680	22.3	7	0.5	8.7	205	2.194122	-13.607892	-12.693222
7303	19.0	1	0.5	8.2	210	1.722688	-2.520010	-10.424218
6676	21.4	7	1.0	6.0	231	-1.025774	14.677591	-13.048581

Now, when it comes to plotting the cumulative elasticity, we still order the dataset by the predictive elasticity, but now we use the debiased versions of the treatment and outcome to get this elasticity. This is equivalent to estimating \(\beta_1\) in the following regression model

\[ \hat{y}_i = \beta_0 + \beta_1 \hat{t}_i + e_i \]

where the residuals are like we’ve described before.

_images/d0c96ae4c419948bb1ffb4dcb764b677cf65edcb37f5669079c2c8ec504f8b1b.png

We can do the same thing for the cumulative gain curve, of course.

_images/7df3fc5a7c046fbfc568c6f83d19799c95652d5cfbdebe39d4ece0ee3970d2c4.png

Notice how similar these plots are to the ones in the previous chapter. This is some indication that the debiasing worked wonders here.

In contrast, let’s see what the cumulative gain plot would look like if we used the original, biased data.

_images/ec7259698a32023d517e771f5dc5347ec7a6464a05b7914117e8437e2b08ed93.png

First thing you should notice is that the average elasticity goes up, instead of down. We’ve seen this before. In the biased data, it looks like sales goes up as price increases. As a result, the final point in the cumulative gain plot is positive. This makes little sense, since we now people don’t buy more as we increase ice cream prices. If the average price elasticity is already messed up, any ordering in it also makes little sense. The bottom line being that this data should not be used for model evaluation.

Orthogonalization with Machine Learning#

In a 2016 paper, Victor Chernozhukov et all showed that you can also do orthogonalization with machine learning models. This is obviously very recent science and we still have much to discover on what we can and can’t do with ML models. Still, it’s a very interesting idea to know about.

The nuts and bolts are pretty much the same to what we’ve already covered. The only difference is that now, we use machine learning models for the debiasing.

\[\begin{split} \begin{align} \hat{y}_i &= y_i - M_y(X_i) \\ \hat{t}_i &= t_i - M_t(X_i) \end{align} \end{split}\]

There is a catch, though. As we know very well, machine learning models are so powerful that they can fit the data perfectly, or rather, overfit. Just by looking at the equations above, we can know what will happen in that case. If \(M_y\) somehow overfits, the residuals will all be very close to zero. If that happens, it will be hard to find how \(t\) affects it. Similarly, if \(M_t\) somehow overfits, its residuals will also be close to zero. Hence, there won’t be variation in the treatment residual to see how it can impact the outcome.

To account for that, we need to do sample splitting. That is, we estimate the model with one part of the dataset and we make predictions in the other part. The simplest way to do this is to split the test sample in half, make two models in such a way that each one is estimated in one half of the dataset and makes predictions in the other half.

A slightly more elegant implementation uses K-fold cross validation. The advantage being that we can train all the models on a sample which is bigger than half the test set.

Fortunately, this sort of cross prediction is very easy to implement using Sklearn’s cross_val_predict function.

from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestRegressor

X = ["cost", "weekday", "temp"]
t = "price"
y = "sales"

folds = 5

np.random.seed(123)
m_t = RandomForestRegressor(n_estimators=100)
t_res = test[t] - cross_val_predict(m_t, test[X], test[t], cv=folds)

m_y = RandomForestRegressor(n_estimators=100)
y_res = test[y] - cross_val_predict(m_y, test[X], test[y], cv=folds)

Now that we have the residuals, let’s store them as columns on a new dataset.

ml_debiased_test = test.assign(**{
    "sales-ML_y(X)": y_res,
    "price-ML_t(X)": t_res,
})
ml_debiased_test.head()

	temp	weekday	cost	price	sales	sales-ML_y(X)	price-ML_t(X)
7592	27.8	4	1.0	4.9	203	1.747333	-0.639717
4489	27.4	5	1.0	8.2	214	16.084714	1.986000
7680	22.3	7	0.5	8.7	205	-27.699500	2.622268
7303	19.0	1	0.5	8.2	210	-28.049321	2.159750
6676	21.4	7	1.0	6.0	231	16.825929	-1.336283

Finally, we can plot the debiased dataset.

_images/6156cb395804214fc28577c26adfca5c6a0cff4a9409ffa6ea1c2951004f7f5d.png

Contribute#

《Causal Inference for the Brave and True》 是一本关于因果推断的开源教材，致力于以经济上可负担、认知上可理解的方式，普及这门“科学的统计基础”。全书基于 Python，仅使用自由开源软件编写，原始英文版本由 Matheus Facure 编写与维护。

本书的中文版由黄文喆与许文立助理教授合作翻译，并托管在 GitHub 中文主页。希望本地化的内容能帮助更多中文读者学习和掌握因果推断方法。

如果你觉得这本书对你有帮助，并希望支持该项目，可以前往 Patreon 支持原作者。

如果你暂时不方便进行经济支持，也可以通过以下方式参与贡献：

修正错别字
提出翻译或表达建议
反馈你未能理解的部分内容

欢迎前往英文版或中文版仓库点击 issues 区或中文版 issues 区提出反馈。

最后，如果你喜欢这本书的内容，也请将其分享给可能感兴趣的朋友，并为项目在 GitHub 上点亮一颗星：英文版仓库 / 中文版仓库。