*2022-06-13*
share

\gdef\do#1{\mathrm{do}(#1)}

What is a causal model and how is it different from a “common”
statistical model? Here we consider a mental picture and intuition on
how one may think about (the subclass of interventional) causal
models.^{1}

**A causal model is a partially ordered set of distributions,
one for each intervention** – this has been my go to definition
ever since Paul and I
scribbled dozens of commuting diagrams and posets on whiteboards when
working out sensible requirements for transforming one causal model into
another [1].
Diagram aficionados may find the link between causality and category
theory appealing as it allows us to draw more diagrams and to do some
string diagram surgery, yeah 🙃 [2–4].
Anyways, let’s get back to what a causal model is.

A “common” statistical model models one joint distribution \mathbb{P}_X over variables X = \{A, B, ...\};

a causal model models a set \{\mathbb{P}_X^{\do{i}}\mid i\in\mathcal{I}\}
of joint distributions over X, one for
each intervention i\in\mathcal{I}.

We may visualise this as follows ^{2}

Here \varnothing denotes the null-intervention and
line segments correspond to the modellable distributions for varying
model parameters, that is, for different model parameters the
distributions \mathbb{P}_X,
\mathbb{P}_X^{\do{i_1}}, ... may lie somewhere different on the
respective line segments.

The distributions are indexed by interventions i \in \mathcal{I}. The intervention set \mathcal{I} admits a partial ordering reflecting the compositionality of interventions, for example, \do{A=5} \leq \do{A=5, B=5} but \do{A=5} \not\leq \do{A=1, B=5} since we can implement the intervention \do{A=5,B=5} by composing an action that implements \do{A=5} with another that also sets \do{B=5} but cannot implement \do{A=1,B=5} by combining \do{A=5} with another intervention. Thus, a causal model is a partially ordered set, a poset, of distributions; for example, the set \{ \mathbb{P}_X^{{\varnothing}}, \mathbb{P}_X^{\do{i_1}}, \mathbb{P}_X^{\do{i_2}}, \mathbb{P}_X^{\do{i_3}} \} inherits relations, such as \mathbb{P}_X^{\do{i_1}} \leq \mathbb{P}_X^{\do{i_3}}, from the partial ordering i_1 \leq i_3 of the interventions.

Structural Causal Models (SCMs) are **one convenient way to
describe such a structured set of distributions**: A set of
equations and noise variables together with instructions on how to
manipulate the equations upon intervention is enough to describe the
entire poset.

For example, the following structural equations \begin{align*} A &= -1 + N_A \\ B &= 2 + 2A + N_B \end{align*} with independent noise variables N_A,N_B \sim \mathcal{N}(0, 1) induce the following distribution over A and B \begin{pmatrix} A \\ B \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} \mu_A \\ \mu_B \end{pmatrix}, \begin{pmatrix} \Sigma_{AA} & \Sigma_{AB} \\ \Sigma_{BA} & \Sigma_{BB} \\ \end{pmatrix} \right) = \mathcal{N} \left( \begin{pmatrix} -1 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & 2 \\ 2 & 5 \\ \end{pmatrix} \right). If all we cared about is this one distribution over X = \{A, B\}, we could as well just specify the multivariate Gaussian distribution without referring to any structural equations, that is, the “common” statistical model can also be concisely specified via \mathbb{P}_X \sim \mathcal{N} \left( \begin{pmatrix} -1 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & 2 \\ 2 & 5 \\ \end{pmatrix} \right) and can be sampled from as follows

```
# "classical" statistical model
= [-1, 0]
mu = [[1, 2],
sigma 2, 5]]
[= rng.multivariate_normal(
AB =mu,
mean=sigma,
cov=1000
size )
```

This distribution coincides with the distribution induced by the
above structural equations and noise variables^{3},
which we can sample from as follows

```
# SCM
# observational distribution
= rng.normal(size=1000)
Na = rng.normal(size=1000)
Nb = -1 + Na
A = 2 + 2*A + Nb
B = cstack[A, B] AB
```

A causal model not only induces this so-called observational distribution, for which we also presented an alternative “common” statistical model above, but also interventional distributions. For SCMs, the instructions to obtain the interventional distributions are simple: (1) replace the structural equations of the intervened upon variables, (2) consider the distribution induced by the new set of structural equations and the original noise variables. For example, under the intervention \do{B=5}, the structural equations are \begin{align*} A &= -1 + N_A \\ B &= 5 \qquad {\color{gray}\sout{B = 2 + 2A + N_B}} \end{align*} and induce a joint distribution over A and B where A \sim \mathcal{N}(-1, 1), B \equiv 5, and A and B are independent. We can sample from the interventional distribution \mathbb{P}_X^{\do{B=5}} as follows

```
# SCM
# do(B=5) distribution
= rng.normal(size=1000)
Na = rng.normal(size=1000)
Nb = -1 + Na
A # B = 2 + 2*A + Nb
= 5 * ones(1000)
B = cstack[A, B] AB
```

(Marginals of) Interventional distributions need in general not coincide with corresponding conditional distributions of the observational distribution. For example, conditioning the multivariate Gaussian observational distribution \mathbb{P}_X^\varnothing on B=5 we obtain A\mid B=5 \sim \mathcal{N}\left( \underbrace{\mu_A + \Sigma_{AB}\Sigma_{BB}^{-1}(5 - \mu_B)}_{1}, \underbrace{\Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA}}_{0.2} \right) which we can sample from as follows

```
# condition observational distribution
# on B=5
= [-1, 0]
mu = [[1, 2],
S 2, 5]]
[= -1 + 2 * (1/5) * (5 - 0)
muA = 1 - 2 * (1/5) * 2
sigmaA = rng.multivariate_normal(
A =[muA],
mean=[[sigmaA]],
cov=1000
size
)= 5 * ones(1000)
B = cstack[A, B] AB
```

This visualisation of the conditional distribution of A given B corresponds to taking a cross section of the above visualisation of the observational joint distribution over X=\{A,B\}.

Under the observational distribution \mathbb{P}_X^\varnothing the marginal distribution of A is \mathcal{N}(-1, 1); under the interventional distribution \mathbb{P}_X^{\do{B=5}} the marginal distribution of A is \mathcal{N}(-1, 1). If the marginal distribution of A is the same under all interventions on B, B is not a cause of A. In our example SCM, A is a cause of B.

In causal discovery we aim to infer a (structural) causal model from some observational data that correctly predicts the effects of interventions, while we are only given samples from a subset of the distributions, often only the observational distribution. Within our mental picture, we could visualise the task of causal discovery as follows

We aim to infer a model that not only models the distributions of which observations are available (common statistical modelling), but instead to infer a causal model that enables reasoning about the effects of interventions even beyond those interventional distributions that we may have observed.

The statistical treatment of causal discovery lays out different approaches that clarify under which additional assumptions causal structure can indeed be (partially) identified. Since a causal model is more than one observational joint distribution and makes predictions about an entire poset of interventional (joint) distributions, we require assumptions that reach beyond the observed and, for example, exclude the existence of unobserved variables that cause several of the observed variables or restrict the distributions of the noise terms (cf. this tweet).

1.

PK Rubenstein, S
Weichwald, S Bongers, JM Mooij, D Janzing, M Grosse-Wentrup, B
Schölkopf; Co-first authorship between PKR and SW

Proceedings of the Thirty-Third
Conference on Uncertainty in Artificial Intelligence (UAI),
2017

2.

B
Fong

arXiv Preprint
arXiv:1301.6201, 2013

3.

B Jacobs, A
Kissinger, F Zanasi

Foundations of Software Science and Computation
Structures, 2019

4.

EF Rischel, S
Weichwald

Proceedings
of the Thirty-Seventh Conference on Uncertainty in Artificial
Intelligence (UAI), 2021

This causerie is adapted from my earlier tweetorial and includes some anecdotes and new code snippets.↩︎

Due credit and thanks to my colleague Paul for the poset of distributions TikZ template premiered at UAI 2017 (see also the paper, slides, and poster).↩︎

The same distribution is also induced by other SCMs, such as, for example, \begin{align*} A &= -1 + \frac{2}{5}B + \frac{1}{\sqrt{5}} N_A \\ B &= \sqrt{5}N_B \end{align*} or \begin{align*} A &= -1 - H + \frac{1}{\sqrt{2}} N_A \\ B &= \frac{3}{2} - H + \frac{3}{{2}} A + \frac{\sqrt{3}}{2} N_B \\ H &= \frac{1}{\sqrt{2}} N_H \end{align*} with independent noise variables N_A,N_B,N_H \sim \mathcal{N}(0, 1) and H unobserved. The SCMs in the main text and in this footnote are said to be observationally equivalent, as they induce the same observational distribution. Observationally equivalent SCMs need not induce the same interventional distributions, however; SCMs are richer than “common” statistical models and are not uniquely determined by the observational distribution they induce.↩︎