home / research / talks & slides / teaching / causeries

Moving beyond the (non-)applicability of causal ML

2022-06-01 share

I ponder, on occasion, about what hurdles fruitful practical applications of causal modelling and causal machine learning. Here are partial musings and pointers as nosh for thought on how we may move forward (and as entry points into some relevant literature).

Speech recognition, text translation, automatic video captioning, … isn’t machine learning already fruitfully applied?

The deep learning approach to machine learning has arguably revolutionised data science. Fitting expressive function approximators and high-dimensional density estimators as black-boxes to huge datasets led to impressive language, computer vision, and RL tools [1,2,3 and various others]. Yet, in high-risk applications, such as treatment planning and therapy chatbots for depressed or suicidal patients, we cannot deploy black-box models and risk severe failures [411]. Instead, we need models with reliable and expectable model behaviour also under environmental shifts and interventions. For example, a vision system for autonomous driving ought to be robust against environmental shifts and to be working reliably in a white winter landscape, on a colourful autumn parkway, and between trucks with large-area natural paintings on them.

So we often require modelling abilities beyond classical probabilistic and machine learning models that predict future observations of a system under the same conditions that held true when we devised the model (of the observational distribution). Perhaps, we need causality for machine learning [12,13]. With a statistical causal model we may also reason about future observations of the system when subjected to external manipulating forces or environmental shifts (that is, about interventional distributions) [1416].

This ability is crucial for reliable predictions and data-driven decisions as the following example aims to illustrate (here, it suffices to understand the graph as an intuitive depiction of the cause-effect relationships, while its causal semantics and correspondence to an accompanying structural equation model can also be made formally precise):

\text{diabetes} \gets \text{insulin deficiency} \to \text{blurred vision}\gets \text{myopia}

A non-causal model may wrongly lead us to treat blurred vision (the side effect) instead of insulin deficiency (the cause) since blurred vision predicts but does not cause diabetes (the effect); likewise, the model may fail to accurately predict the incidence of diabetes in a different environment and population with prevalence of myopia (another cause of blurred vision) higher than in the training population.

Indeed, research into causality gained traction in the machine learning community, not least since Judea Pearl’s 2011 Turing Award on causality, and research foci shifted accordingly, for example, from representation learning [17] to causal representation learning [18].

Causal machine learning to the rescue?

The models deployed today for prediction tasks on text, image, video, and voice data are highly complex and non-linear with billions of parameters, composed of dozens of stacked neural network layers, and take days to train in the cloud. This sheer model complexity and flexibility appears to be key to the success of predictive machine learning and allows scaling to high-dimensional (low-risk) applications such as ad placement or text translation. While equipping these machine learning models with causal reasoning abilities is promising, it ain’t easy.

First, in statistical causal modelling we often require restrictive assumptions for providing a statistical account of causal model properties and for characterising their identifiability from data. The gap is substantial between machine learning engineering practice and statistical causal modelling theory. On the one hand are universal function approximators, massive datasets, flexible high-dimensional density estimators, and a model zoo of often heavily overparameterised models; on the other hand are restricted model classes with limiting assumptions, for example, on the linearity of cause-effect relationships or the (non-)existence of relevant yet unobserved variables, and impossibility results that characterise the limits for learning causal structure from data [1924]. While machine learning models over thousands of variables are commonplace, for example, for high-resolution images, we struggle to scale causal structure learning algorithms beyond a couple of dozens of variables.
We may make progress despite this gap by improving machine learning via causal principles instead of aiming to “solve causality via machine learning and learn the full causal graph from data”. For example, abstract causal knowledge may act as weak supervision signal for improving rich machine learning pipelines. Here, abstract causal knowledge refers to often only partial knowledge that comes in form of high-level properties instead of exact causal knowledge on all model variables and cause-effect relationships. In applications, systematically incorporating such prior knowledge can benefit our machine learning analyses and applications [25].

Second, in causal inference we often implicitly assume that the observable variables are meaningful causal entities. This assumption is often inapt and hurdles fruitful application. In neuroimaging, for example, it is nonsensical to apply a causal structure learning algorithm to the raw fMRI voxel values or EEG signals. Before learning the causal structure, we need to construct representations or recover latent from observed variables that are meaningful causal entities. We face similar issues when we transform variables that admit a causal model to simplify the model. For example, even if we could obtain a causal model over several billions of neurons, this model may not be pragmatically useful for a neurologist to decide on patient treatment plans. To obtain a useful comprehensible model, we need to abstract and simplify the billions-of-neurons model while retaining the essential causal information. Only by systematically deriving a manageable model on the level of brain regions from a causal model over billions of neurons do we obtain a model that is pragmatically useful for a neurologist to decide on patient treatment plans. In both cases, when we recover latent causal variables or aggregate variables to simplify a model, we easily break the variables’ and model’s causal semantics [26,27]. Conversely, the choice of wrong variables or the inability to measure the right variables conceptually hinders the applicability of causal inference methodology.
To address this hurdle, researchers began formalising and characterising causal model transformations and abstractions that guarantee that the causal semantics stays intact when transforming the underlying variables [2832]. This conceptual understanding may inform schemes to learn causal variables from data of observable variables [3336]; still, we have almost certainly not yet exhausted the potential for improving and regularising deep representation learning by causal (consistency and compositionality) constraints that derive from these conceptual works. A notion of approximate correspondence between two causal models with an associated approximation error may be used to learn optimal simplifications of causal models and causal representations.

Third, the scarcity of real-world data with known cause-effect relationships hinders the development of causal structure learning algorithms. Instead of real-world data, benchmarks of causal structure learning algorithms commonly rely on synthetic data and causal additive noise models. Synthetic data often exhibits certain regularity patterns that may be exploited for structure learning but are perhaps implausible to assume for real-world data [37,38]. If benchmark performance does not reflect what we actually care about, we risk developing tools that solve the wrong task and are thus unfit for fruitful application.
Obviously, data from real-world systems with known cause-effect relationships and the ability to intervene would be golden. Lacking such data, we may hope to somehow improve benchmarking, while I personally remain sceptical that we can make progress without rethinking the seemingly innocent and generic assumption of causal additive noise and how our models capture stochasticity. More generally, perhaps applications do not even require us to “learn the full causal graph from data” in the first place but instead to answer concrete causal questions and to choose optimal actions; if so, we may wish to develop benchmarks for the eventual causal task and explore the potential for synergies between control, reinforcement learning, and causality [39] instead of improving the benchmarking of causal graph recovery as an (possibly dispensable) intermediate step.

Which causal tasks are we aiming to solve in practice and how do we align our research on causal modelling and causal machine learning to make real progress toward that goal?

Some pointers

A Ramesh, P Dhariwal, A Nichol, C Chu, M Chen
arXiv Preprint arXiv:2204.06125, 2022
S Reed, K Zolna, E Parisotto, SG Colmenarejo, A Novikov, G Barth-Maron, M Gimenez, Y Sulsky, J Kay, JT Springenberg, T Eccles, J Bruce, A Razavi, A Edwards, N Heess, Y Chen, R Hadsell, O Vinyals, M Bordbar, N de Freitas
arXiv Preprint arXiv:2205.06175, 2022
C Saharia, W Chan, S Saxena, L Li, J Whang, E Denton, SKS Ghasemipour, BK Ayan, SS Mahdavi, RG Lopes, T Salimans, J Ho, DJ Fleet, M Norouzi
arXiv Preprint arXiv:2205.11487, 2022
C Szegedy, W Zaremba, I Sutskever, J Bruna, D Erhan, I Goodfellow, R Fergus
arXiv Preprint arXiv:1312.6199, 2013
A Rajkomar, M Hardt, MD Howell, G Corrado, MH Chin
Annals of Internal Medicine, 169(12):866–72, 2018
Z Obermeyer, B Powers, C Vogeli, S Mullainathan
Science, 366(6464):447–53, 2019
N Mehrabi, F Morstatter, N Saxena, K Lerman, A Galstyan
ACM Computing Surveys, 54(6), 2021
J Pearl
Communications of the ACM, 62(3):54–60, 2019
B Schölkopf
Probabilistic and Causal Inference: The Works of Judea Pearl, 765–804, 2022
P Spirtes, CN Glymour, R Scheines
2nd ed, MIT Press, 2001
J Pearl
2nd ed, Cambridge University Press, 2009
J Peters, D Janzing, B Schölkopf
MIT Press, 2017
Y Bengio, A Courville, P Vincent
IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–828, 2013
B Schölkopf, F Locatello, S Bauer, NR Ke, N Kalchbrenner, A Goyal, Y Bengio
Proceedings of the IEEE, 109(5):612–34, 2021
DM Chickering
Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI), 1995
DM Chickering
Journal of Machine Learning Research, 2:445–98, 2002
J Zhang, P Spirtes
Minds and Machines, 18(2):239–71, 2008
S Shimizu, PO Hoyer, A Hyvärinen, A Kerminen
Journal of Machine Learning Research, 7:2003–30, 2006
J Peters, P Bühlmann
Biometrika, 101(1):219–28, 2014
G Park
Journal of Machine Learning Research, 21(75):1–34, 2020
B Schölkopf, DW Hogg, D Wang, D Foreman-Mackey, D Janzing, C-J Simon-Gabriel, J Peters
Proceedings of the National Academy of Sciences, 113(27):7391–98, 2016
P Spirtes, R Scheines
Philosophy of Science, 71(5):833–45, 2004
F Eberhardt
Synthese, 193(4):1029–46, 2016
PK Rubenstein, S Weichwald, S Bongers, JM Mooij, D Janzing, M Grosse-Wentrup, B Schölkopf; Co-first authorship between PKR and SW
Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence (UAI), 2017
S Beckers, JY Halpern
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019
S Beckers, F Eberhardt, JY Halpern
Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI), 2019
V Belle
Knowledge-Based Systems, 199:105976, 2020
EF Rischel, S Weichwald
Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence (UAI), 2021
K Chalupka, P Perona, F Eberhardt
Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence (UAI), 2015
K Chalupka, T Bischoff, P Perona, F Eberhardt
Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence (UAI), 2016
K Chalupka, P Perona, F Eberhardt
The 19th International Conference on Artificial Intelligence and Statistics (AISTATS), 2016
D Kinney, D Watson
Proceedings of the Tenth International Conference on Probabilistic Graphical Models (PGM), 2020
S Weichwald, ME Jakobsen, PB Mogensen, L Petersen, N Thams, G Varando
Proceedings of the NeurIPS 2019 Competition and Demonstration Track, Proceedings of Machine Learning Research (PMLR), 2020
AG Reisach, C Seiler, S Weichwald
Advances in Neural Information Processing Systems 34 (NeurIPS), 2021
S Weichwald, SW Mogensen, TE Lee, D Baumann, O Kroemer, I Guyon, S Trimpe, J Peters, N Pfister
arXiv Preprint arXiv:2202.06052, 2022
Imprint & Credits