# Machine Learning Trick of the Day (8): Instrumental Thinking5

· Read in 8 minutes · 1499 words · ·

This trick is unlike the others we've conjured. It will not reveal a clever manipulation of a probability or an integral or a derivative, or produce a code one-liner. But like all our other tricks, it will give us a powerful instrument in our toolbox. The instruments of thinking are rare and always sought-after, because with them we can actively and confidently challenge the assumptions and limitations of our machine learning practice. Something we must constantly do.

One of the most common tasks we can attempt with data is to use features x to make predictions of targets y. This regression often makes a key assumption:  any noise $\epsilon$ only affects the regression targets y (see figure 1(left)). In linear regression this is:

If this assumption is actually true for the problem we are addressing—that features x are linearly related to targets y using a set of parameters $\beta$, and noise only affects the targets—then we can also call our model a structural model (or structural equation). Structural models can be used to make cause-effect statements, allowing us to use the model to make predictions about how actively controlling the features x affect the targets y. What if this is not true; it will often not be. We explore this question here, and develop one solution based on instrumental variables.

Figure 1. Three common regression scenarios.

# Learning with Errors in Variables

Consider what is known as an errors-in-variables scenario1 (see figure 1 (centre)): a regression problem where the same source of noise $\epsilon$ affects the features and the target. Here, predictions of the target y are affected in two ways: directly through the variability in y, and indirectly through the effect on y that the noise affecting x has on it. This is generally an undesirable scenario.

If we ever find ourselves in this situation, then we no longer have a structural equation; any  model we have will simply track correlations in the data and will leave us with biased predictions. This is where our trick of the day enters. The instrumental variables trick asks us to use the data itself to account for noise, and makes it easier for us to define structural models and to make causal predictions.

The instrumental variables idea is conceptually simple: we introduce new observed variables z, called instrumental variables, into our model; figure 1 (right). And this is the trick: instrumental variables are special subset of the data we already have, but they allow us to remove the effect of confounders. Confounders are the noise and other interactions in our system whose effect we may not know or be able to observe, but which affect our ability to write structural models. We manipulate these instruments so that we transform the undesirable regression in figure 1(centre) into a structural regression of the form of figure 1 (left).

For a variable to be an instrumental variable, it must satisfy two conditions:

• The instruments z should be (strongly) correlated with the features x. There should be a direct association between the instrument and the data we wish to use to make predictions.
• The instruments should be uncorrelated with the noise $\epsilon$. This says that changes in instruments z should lead to changes in x, but not to y; otherwise z would be subject to the same errors-in-variables problem.

The common example is the prediction of future earnings (y) based on education (x). A person's ability ($\epsilon$) affects both their education and earnings, so is a confounder and source of common noise in this setting. There are many instruments we can find to satisfy the two needed conditions, meaning that there are no unique instruments. As instruments, we could use their mother's birth-month, the number of siblings they have,  the proximity of the person to schools2, or even the person's month of birth3.

# Two-stage Least Squares

The instrumental variables construction simply says that we should remove (marginalise) the effect of the variables (x) that have coupled errors, and instead consider the following marginalised distribution:

Where $\theta$ are the model parameters we are interested in learning, and $\phi$ are parameters of a new predictor relating z and x. This integral suggests a very simple new type of regression algorithm that exploits the integral in two stages.

1. Train a model to predict the inputs x given instruments z. This can be a regression model, or a more complex high-dimensional conditional generative model. Call these predictions $\hat{\mathbf{x}}$.
2. We now train a model to predict targets y given the predicted inputs $\hat{\mathbf{x}}$.

If a linear regression model is used for both these steps, then we will recover the famous two-stage least squares (2SLS) algorithm. Using the closed form solution (the normal equations) for each stage of the regression, we get:

Stage 1: Feature prediction using instruments

• Optimal parameters: $\boldsymbol{\phi} = (\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top \mathbf{X}$
• Predictions: $\hat{\mathbf{X}}=\mathbf{Z}\mathbf{\phi}=\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top \mathbf{X} =\mathbf{P}_z\mathbf{X}$

Stage 2: Target prediction using predicted features

• Optimal stage 2 parameters: $\boldsymbol{\theta} = (\hat{\mathbf{X}}^\top\hat{\mathbf{X}})^{-1}\hat{\mathbf{X}}^\top \mathbf{y} = (\mathbf{X}\mathbf{P}_z\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{P}_z\mathbf{y}$

2SLS does something intuitive: by introducing the instrumental variable, it creates a way to eliminate the paths through which confounding noise enters the model to create an errors-in-variables scenario. Using the predicted inputs $\hat{x}$ we recover a structural equation.

There are many powerful real-word examples where this thinking has been used, especially in settings where we cannot do randomised control trials but must rely on observational data. There is much more that could be written about these applications alone. Yet, how often will we have a situation in which we have an additional source of data to use as an instrument?

Instrumental variables are hard to find in real-world problems. And the assumption, hidden in the use of the Normal equations, that the number of instruments we use is the same as the number of input features (called the just-specified case), makes it difficult in high-dimensional problems. But the power of instrumental variables is not lost. They still shine, especially in settings where we control the definition of all the variables involved. One such area is reinforcement learning.

# Reinforcement Learning with Instruments

It may not look like it, but the problem of learning value-functions is an errors-in-variables scenario. Our problem is to learn a linear value function using features $s_x$ (when in state x)  using parameters $\theta$ so that $V(x) = s_x^\top\theta$.  The detailed derivations of what now follows are given in the paper by Bradtke and Barto (1996)4.

Let's start from the definition of the value function under transition distribution $p(x,x')$ from state x to x', where a reward $R(x,x')$ is observed upon transition and $\gamma$ is a discount factor:

We can also rewrite the value function as an expected immediate reward $\bar{r}_x =\sum_{x'}p(x,x')R(x,x')$ and an average next-state value. With this rewriting we will find a regression problem in plain sight. Let's first substitute the linear value function, and rewrite this equation in terms of $\bar{r}$:

This is a regression from features that capture the change-in-state $\Delta$ to the average reward $\bar{r}_x$ using parameters $\theta$. It is also an errors-in-variables regression: both sides of the equation are affected by the same source of variability given by the unknown state transition dynamics $p(x,x')$.

But we do have a trick for such scenarios: we can use instrumental variables regression and remain able to learn value-function parameters that correctly capture the causal structure of future rewards.

Because we control this setting, with a bit of thought, we can conclude that a set of instrumental variables that are strongly correlated with the features $\Delta$ are the state features $s$ themselves. If we make this choice, we can apply exactly the two-stage least squares algorithm:

Stage 1: Instrumental variables regression

• Linear instrumental parameters: $\phi = (S_t^\top S_t)^{-1}S_t^\top\Delta$
• Instrumental prediction: $\hat{\Delta}=S_t^\top\phi$

Stage 2: Average reward prediction

• Parameter estimation: $\theta = (S_t^\top\Delta)^{-1}S_t^\top r_t$

In reinforcement learning, this approach is known as Least squares TD (LSTD) learning. Most RL will instead reach this conclusion using the theory of Bellman projections. But this probabilistic viewpoint through instrumental variables means that we can think of alternative ways of extending this view.

We can go much deeper, and these papers can be used to explore this topic further:

# Summary

The instrumental variables tell us to critically consider the assumptions underlying the models we develop and to think deeply about how to use their predictions correctly. The importance of this for our machine learning practice cannot be overstated.

Like every trick in this series, the instrumental variables give us an alternative way to think about existing problems. It is to find these alternative views that I write this series, and is the real reason to study these tricks at all: they give us new ways to see.

Complement this trick with other tricks in the series. Or read one my earliest pieces on a Widely-applicable information criterion, or a piece of exploratory thinking on Decolonising artificial intelligence.

## Some References

1.
Durbin, J. Errors in Variables. Review of the International Statistical Institute 22, 23 (1954).
2.
David Card and Alan B. Krueger. Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania. The American Economic Review 84, 772–793 (1994). [Source]
3.
Angrist, J. D. Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administrative Records. American Economic Review 80, 313–336 (1990). [Source]
4.
Bradtke, S. J. & Barto, A. G. Linear Least-Squares algorithms for temporal difference learning. Machine Learning 22, 33–57 (1996).