# What is (a helpful mental picture of) a causal model?

2022-06-13 share

\gdef\do#1{\mathrm{do}(#1)}

What is a causal model and how is it different from a “common” statistical model? Here we consider a mental picture and intuition on how one may think about (the subclass of interventional) causal models.1

## Causal models as posets of distributions

A causal model is a partially ordered set of distributions, one for each intervention – this has been my go to definition ever since Paul and I scribbled dozens of commuting diagrams and posets on whiteboards when working out sensible requirements for transforming one causal model into another [1]. Diagram aficionados may find the link between causality and category theory appealing as it allows us to draw more diagrams and to do some string diagram surgery, yeah 🙃 [24]. Anyways, let’s get back to what a causal model is.

A “common” statistical model models one joint distribution \mathbb{P}_X over variables X = \{A, B, ...\};
a causal model models a set \{\mathbb{P}_X^{\do{i}}\mid i\in\mathcal{I}\} of joint distributions over X, one for each intervention i\in\mathcal{I}.
We may visualise this as follows 2
Here \varnothing denotes the null-intervention and line segments correspond to the modellable distributions for varying model parameters, that is, for different model parameters the distributions \mathbb{P}_X, \mathbb{P}_X^{\do{i_1}}, ... may lie somewhere different on the respective line segments.

The distributions are indexed by interventions i \in \mathcal{I}. The intervention set \mathcal{I} admits a partial ordering reflecting the compositionality of interventions, for example, \do{A=5} \leq \do{A=5, B=5} but \do{A=5} \not\leq \do{A=1, B=5} since we can implement the intervention \do{A=5,B=5} by composing an action that implements \do{A=5} with another that also sets \do{B=5} but cannot implement \do{A=1,B=5} by combining \do{A=5} with another intervention. Thus, a causal model is a partially ordered set, a poset, of distributions; for example, the set \{ \mathbb{P}_X^{{\varnothing}}, \mathbb{P}_X^{\do{i_1}}, \mathbb{P}_X^{\do{i_2}}, \mathbb{P}_X^{\do{i_3}} \} inherits relations, such as \mathbb{P}_X^{\do{i_1}} \leq \mathbb{P}_X^{\do{i_3}}, from the partial ordering i_1 \leq i_3 of the interventions.

## SCMs conveniently describe posets of distributions

Structural Causal Models (SCMs) are one convenient way to describe such a structured set of distributions: A set of equations and noise variables together with instructions on how to manipulate the equations upon intervention is enough to describe the entire poset.

For example, the following structural equations \begin{align*} A &= -1 + N_A \\ B &= 2 + 2A + N_B \end{align*} with independent noise variables N_A,N_B \sim \mathcal{N}(0, 1) induce the following distribution over A and B \begin{pmatrix} A \\ B \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} \mu_A \\ \mu_B \end{pmatrix}, \begin{pmatrix} \Sigma_{AA} & \Sigma_{AB} \\ \Sigma_{BA} & \Sigma_{BB} \\ \end{pmatrix} \right) = \mathcal{N} \left( \begin{pmatrix} -1 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & 2 \\ 2 & 5 \\ \end{pmatrix} \right). If all we cared about is this one distribution over X = \{A, B\}, we could as well just specify the multivariate Gaussian distribution without referring to any structural equations, that is, the “common” statistical model can also be concisely specified via \mathbb{P}_X \sim \mathcal{N} \left( \begin{pmatrix} -1 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & 2 \\ 2 & 5 \\ \end{pmatrix} \right) and can be sampled from as follows

# "classical" statistical model
mu = [-1, 0]
sigma = [[1, 2],
[2, 5]]
AB = rng.multivariate_normal(
mean=mu,
cov=sigma,
size=1000
)

This distribution coincides with the distribution induced by the above structural equations and noise variables3, which we can sample from as follows

# SCM
# observational distribution
Na = rng.normal(size=1000)
Nb = rng.normal(size=1000)
A = -1 + Na
B = 2 + 2*A + Nb
AB = cstack[A, B]

A causal model not only induces this so-called observational distribution, for which we also presented an alternative “common” statistical model above, but also interventional distributions. For SCMs, the instructions to obtain the interventional distributions are simple: (1) replace the structural equations of the intervened upon variables, (2) consider the distribution induced by the new set of structural equations and the original noise variables. For example, under the intervention \do{B=5}, the structural equations are \begin{align*} A &= -1 + N_A \\ B &= 5 \qquad {\color{gray}\sout{B = 2 + 2A + N_B}} \end{align*} and induce a joint distribution over A and B where A \sim \mathcal{N}(-1, 1), B \equiv 5, and A and B are independent. We can sample from the interventional distribution \mathbb{P}_X^{\do{B=5}} as follows

# SCM
# do(B=5) distribution
Na = rng.normal(size=1000)
Nb = rng.normal(size=1000)
A = -1 + Na
# B = 2 + 2*A + Nb
B = 5 * ones(1000)
AB = cstack[A, B]

## Intervening is not conditioning

(Marginals of) Interventional distributions need in general not coincide with corresponding conditional distributions of the observational distribution. For example, conditioning the multivariate Gaussian observational distribution \mathbb{P}_X^\varnothing on B=5 we obtain A\mid B=5 \sim \mathcal{N}\left( \underbrace{\mu_A + \Sigma_{AB}\Sigma_{BB}^{-1}(5 - \mu_B)}_{1}, \underbrace{\Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA}}_{0.2} \right) which we can sample from as follows

# condition observational distribution
# on B=5
mu = [-1, 0]
S = [[1, 2],
[2, 5]]
muA = -1 + 2 * (1/5) * (5 - 0)
sigmaA = 1 - 2 * (1/5) * 2
A = rng.multivariate_normal(
mean=[muA],
cov=[[sigmaA]],
size=1000
)
B = 5 * ones(1000)
AB = cstack[A, B]

This visualisation of the conditional distribution of A given B corresponds to taking a cross section of the above visualisation of the observational joint distribution over X=\{A,B\}.

## Cause and effect

Under the observational distribution \mathbb{P}_X^\varnothing the marginal distribution of A is \mathcal{N}(-1, 1); under the interventional distribution \mathbb{P}_X^{\do{B=5}} the marginal distribution of A is \mathcal{N}(-1, 1). If the marginal distribution of A is the same under all interventions on B, B is not a cause of A. In our example SCM, A is a cause of B.

## Causal discovery

In causal discovery we aim to infer a (structural) causal model from some observational data that correctly predicts the effects of interventions, while we are only given samples from a subset of the distributions, often only the observational distribution. Within our mental picture, we could visualise the task of causal discovery as follows

We aim to infer a model that not only models the distributions of which observations are available (common statistical modelling), but instead to infer a causal model that enables reasoning about the effects of interventions even beyond those interventional distributions that we may have observed.

The statistical treatment of causal discovery lays out different approaches that clarify under which additional assumptions causal structure can indeed be (partially) identified. Since a causal model is more than one observational joint distribution and makes predictions about an entire poset of interventional (joint) distributions, we require assumptions that reach beyond the observed and, for example, exclude the existence of unobserved variables that cause several of the observed variables or restrict the distributions of the noise terms (cf. this tweet).

## References

1.
PK Rubenstein, S Weichwald, S Bongers, JM Mooij, D Janzing, M Grosse-Wentrup, B Schölkopf; Co-first authorship between PKR and SW
Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence (UAI), 2017
2.
B Fong
arXiv Preprint arXiv:1301.6201, 2013
3.
B Jacobs, A Kissinger, F Zanasi
Foundations of Software Science and Computation Structures, 2019
4.
EF Rischel, S Weichwald
Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence (UAI), 2021

1. This causerie is adapted from my earlier tweetorial and includes some anecdotes and new code snippets.↩︎

2. Due credit and thanks to my colleague Paul for the poset of distributions TikZ template premiered at UAI 2017 (see also the paper, slides, and poster).↩︎

3. The same distribution is also induced by other SCMs, such as, for example, \begin{align*} A &= -1 + \frac{2}{5}B + \frac{1}{\sqrt{5}} N_A \\ B &= \sqrt{5}N_B \end{align*} or \begin{align*} A &= -1 - H + \frac{1}{\sqrt{2}} N_A \\ B &= \frac{3}{2} - H + \frac{3}{{2}} A + \frac{\sqrt{3}}{2} N_B \\ H &= \frac{1}{\sqrt{2}} N_H \end{align*} with independent noise variables N_A,N_B,N_H \sim \mathcal{N}(0, 1) and H unobserved. The SCMs in the main text and in this footnote are said to be observationally equivalent, as they induce the same observational distribution. Observationally equivalent SCMs need not induce the same interventional distributions, however; SCMs are richer than “common” statistical models and are not uniquely determined by the observational distribution they induce.↩︎

Imprint & Credits