2022-06-01
I ponder, on occasion, about what hurdles fruitful practical applications of causal modelling and causal machine learning. Here are partial musings and pointers as nosh for thought on how we may move forward (and as entry points into some relevant literature).
Speech recognition, text translation, automatic video captioning, … isn’t machine learning already fruitfully applied?
The deep learning approach to machine learning has arguably revolutionised data science. Fitting expressive function approximators and high-dimensional density estimators as black-boxes to huge datasets led to impressive language, computer vision, and RL tools [1,2,3 and various others]. Yet, in high-risk applications, such as treatment planning and therapy chatbots for depressed or suicidal patients, we cannot deploy black-box models and risk severe failures [4–11]. Instead, we need models with reliable and expectable model behaviour also under environmental shifts and interventions. For example, a vision system for autonomous driving ought to be robust against environmental shifts and to be working reliably in a white winter landscape, on a colourful autumn parkway, and between trucks with large-area natural paintings on them.
So we often require modelling abilities beyond classical probabilistic and machine learning models that predict future observations of a system under the same conditions that held true when we devised the model (of the observational distribution). Perhaps, we need causality for machine learning [12,13]. With a statistical causal model we may also reason about future observations of the system when subjected to external manipulating forces or environmental shifts (that is, about interventional distributions) [14–16].
This ability is crucial for reliable predictions and data-driven decisions as the following example aims to illustrate (here, it suffices to understand the graph as an intuitive depiction of the cause-effect relationships, while its causal semantics and correspondence to an accompanying structural equation model can also be made formally precise):\text{diabetes} \gets \text{insulin deficiency} \to \text{blurred vision}\gets \text{myopia}
A non-causal model may wrongly lead us to treat blurred vision (the side effect) instead of insulin deficiency (the cause) since blurred vision predicts but does not cause diabetes (the effect); likewise, the model may fail to accurately predict the incidence of diabetes in a different environment and population with prevalence of myopia (another cause of blurred vision) higher than in the training population.
Indeed, research into causality gained traction in the machine learning community, not least since Judea Pearl’s 2011 Turing Award on causality, and research foci shifted accordingly, for example, from representation learning [17] to causal representation learning [18].
Causal machine learning to the rescue?
The models deployed today for prediction tasks on text, image, video, and voice data are highly complex and non-linear with billions of parameters, composed of dozens of stacked neural network layers, and take days to train in the cloud. This sheer model complexity and flexibility appears to be key to the success of predictive machine learning and allows scaling to high-dimensional (low-risk) applications such as ad placement or text translation. While equipping these machine learning models with causal reasoning abilities is promising, it ain’t easy.
First, in statistical causal modelling we often
require restrictive assumptions for providing a statistical account of
causal model properties and for characterising their identifiability
from data. The gap is substantial between machine learning engineering
practice and statistical causal modelling theory. On the one hand are
universal function approximators, massive datasets, flexible
high-dimensional density estimators, and a model zoo of often heavily
overparameterised models; on the other hand are restricted model classes
with limiting assumptions, for example, on the linearity of cause-effect
relationships or the (non-)existence of relevant yet unobserved
variables, and impossibility results that characterise the limits for
learning causal structure from data [19–24].
While machine learning models over thousands of variables are
commonplace, for example, for high-resolution images, we struggle to
scale causal structure learning algorithms beyond a couple of dozens of
variables.
We may make progress despite this gap by improving machine learning via
causal principles instead of aiming to “solve causality via machine
learning and learn the full causal graph from data”. For example,
abstract causal knowledge may act as weak supervision signal for
improving rich machine learning pipelines. Here, abstract causal
knowledge refers to often only partial knowledge that comes in form of
high-level properties instead of exact causal knowledge on all model
variables and cause-effect relationships. In applications,
systematically incorporating such prior knowledge can benefit our
machine learning analyses and applications [25].
Second, in causal inference we often implicitly
assume that the observable variables are meaningful causal entities.
This assumption is often inapt and hurdles fruitful application. In
neuroimaging, for example, it is nonsensical to apply a causal structure
learning algorithm to the raw fMRI voxel values or EEG signals. Before
learning the causal structure, we need to construct representations or
recover latent from observed variables that are meaningful causal
entities. We face similar issues when we transform variables that admit
a causal model to simplify the model. For example, even if we could
obtain a causal model over several billions of neurons, this model may
not be pragmatically useful for a neurologist to decide on patient
treatment plans. To obtain a useful comprehensible model, we need to
abstract and simplify the billions-of-neurons model while retaining the
essential causal information. Only by systematically deriving a
manageable model on the level of brain regions from a causal model over
billions of neurons do we obtain a model that is pragmatically useful
for a neurologist to decide on patient treatment plans. In both cases,
when we recover latent causal variables or aggregate variables to
simplify a model, we easily break the variables’ and model’s causal
semantics [26,27].
Conversely, the choice of wrong variables or the inability to measure
the right variables conceptually hinders the applicability of causal
inference methodology.
To address this hurdle, researchers began formalising and characterising
causal model transformations and abstractions that guarantee that the
causal semantics stays intact when transforming the underlying variables
[28–32]. This conceptual understanding may
inform schemes to learn causal variables from data of observable
variables [33–36]; still,
we have almost certainly not yet exhausted the potential for improving
and regularising deep representation learning by causal (consistency and
compositionality) constraints that derive from these conceptual works. A
notion of approximate correspondence between two causal models with an
associated approximation error may be used to learn optimal
simplifications of causal models and causal representations.
Third, the scarcity of real-world data with known
cause-effect relationships hinders the development of causal structure
learning algorithms. Instead of real-world data, benchmarks of causal
structure learning algorithms commonly rely on synthetic data and causal
additive noise models. Synthetic data often exhibits certain regularity
patterns that may be exploited for structure learning but are perhaps
implausible to assume for real-world data [37,38]. If
benchmark performance does not reflect what we actually care about, we
risk developing tools that solve the wrong task and are thus unfit for
fruitful application.
Obviously, data from real-world systems with known cause-effect
relationships and the ability to intervene would be golden. Lacking such
data, we may hope to somehow improve benchmarking, while I personally
remain sceptical that we can make progress without rethinking the
seemingly innocent and generic assumption of causal additive noise and
how our models capture stochasticity. More generally, perhaps
applications do not even require us to “learn the full causal graph from
data” in the first place but instead to answer concrete causal questions
and to choose optimal actions; if so, we may wish to develop benchmarks
for the eventual causal task and explore the potential for synergies
between control, reinforcement learning, and causality [39]
instead of improving the benchmarking of causal graph recovery as an
(possibly dispensable) intermediate step.
Which causal tasks are we aiming to solve in practice and how do we align our research on causal modelling and causal machine learning to make real progress toward that goal?