home / research / talks & slides / teaching / causeries ↩

Moving beyond the (non-)applicability of causal ML

2022-06-01

I ponder, on occasion, about what hurdles fruitful practical applications of causal modelling and causal machine learning. Here are partial musings and pointers as nosh for thought on how we may move forward (and as entry points into some relevant literature).

Speech recognition, text translation, automatic video captioning, … isn’t machine learning already fruitfully applied?

The deep learning approach to machine learning has arguably revolutionised data science. Fitting expressive function approximators and high-dimensional density estimators as black-boxes to huge datasets led to impressive language, computer vision, and RL tools [1,2,3 and various others]. Yet, in high-risk applications, such as treatment planning and therapy chatbots for depressed or suicidal patients, we cannot deploy black-box models and risk severe failures [4–11]. Instead, we need models with reliable and expectable model behaviour also under environmental shifts and interventions. For example, a vision system for autonomous driving ought to be robust against environmental shifts and to be working reliably in a white winter landscape, on a colourful autumn parkway, and between trucks with large-area natural paintings on them.

So we often require modelling abilities beyond classical probabilistic and machine learning models that predict future observations of a system under the same conditions that held true when we devised the model (of the observational distribution). Perhaps, we need causality for machine learning [12,13]. With a statistical causal model we may also reason about future observations of the system when subjected to external manipulating forces or environmental shifts (that is, about interventional distributions) [14–16].

This ability is crucial for reliable predictions and data-driven decisions as the following example aims to illustrate (here, it suffices to understand the graph as an intuitive depiction of the cause-effect relationships, while its causal semantics and correspondence to an accompanying structural equation model can also be made formally precise):

\text{diabetes} \gets \text{insulin deficiency} \to \text{blurred vision}\gets \text{myopia}

A non-causal model may wrongly lead us to treat blurred vision (the side effect) instead of insulin deficiency (the cause) since blurred vision predicts but does not cause diabetes (the effect); likewise, the model may fail to accurately predict the incidence of diabetes in a different environment and population with prevalence of myopia (another cause of blurred vision) higher than in the training population.

Indeed, research into causality gained traction in the machine learning community, not least since Judea Pearl’s 2011 Turing Award on causality, and research foci shifted accordingly, for example, from representation learning [17] to causal representation learning [18].

Causal machine learning to the rescue?

The models deployed today for prediction tasks on text, image, video, and voice data are highly complex and non-linear with billions of parameters, composed of dozens of stacked neural network layers, and take days to train in the cloud. This sheer model complexity and flexibility appears to be key to the success of predictive machine learning and allows scaling to high-dimensional (low-risk) applications such as ad placement or text translation. While equipping these machine learning models with causal reasoning abilities is promising, it ain’t easy.

First, in statistical causal modelling we often require restrictive assumptions for providing a statistical account of causal model properties and for characterising their identifiability from data. The gap is substantial between machine learning engineering practice and statistical causal modelling theory. On the one hand are universal function approximators, massive datasets, flexible high-dimensional density estimators, and a model zoo of often heavily overparameterised models; on the other hand are restricted model classes with limiting assumptions, for example, on the linearity of cause-effect relationships or the (non-)existence of relevant yet unobserved variables, and impossibility results that characterise the limits for learning causal structure from data [19–24]. While machine learning models over thousands of variables are commonplace, for example, for high-resolution images, we struggle to scale causal structure learning algorithms beyond a couple of dozens of variables.
We may make progress despite this gap by improving machine learning via causal principles instead of aiming to “solve causality via machine learning and learn the full causal graph from data”. For example, abstract causal knowledge may act as weak supervision signal for improving rich machine learning pipelines. Here, abstract causal knowledge refers to often only partial knowledge that comes in form of high-level properties instead of exact causal knowledge on all model variables and cause-effect relationships. In applications, systematically incorporating such prior knowledge can benefit our machine learning analyses and applications [25].

Second, in causal inference we often implicitly assume that the observable variables are meaningful causal entities. This assumption is often inapt and hurdles fruitful application. In neuroimaging, for example, it is nonsensical to apply a causal structure learning algorithm to the raw fMRI voxel values or EEG signals. Before learning the causal structure, we need to construct representations or recover latent from observed variables that are meaningful causal entities. We face similar issues when we transform variables that admit a causal model to simplify the model. For example, even if we could obtain a causal model over several billions of neurons, this model may not be pragmatically useful for a neurologist to decide on patient treatment plans. To obtain a useful comprehensible model, we need to abstract and simplify the billions-of-neurons model while retaining the essential causal information. Only by systematically deriving a manageable model on the level of brain regions from a causal model over billions of neurons do we obtain a model that is pragmatically useful for a neurologist to decide on patient treatment plans. In both cases, when we recover latent causal variables or aggregate variables to simplify a model, we easily break the variables’ and model’s causal semantics [26,27]. Conversely, the choice of wrong variables or the inability to measure the right variables conceptually hinders the applicability of causal inference methodology.
To address this hurdle, researchers began formalising and characterising causal model transformations and abstractions that guarantee that the causal semantics stays intact when transforming the underlying variables [28–32]. This conceptual understanding may inform schemes to learn causal variables from data of observable variables [33–36]; still, we have almost certainly not yet exhausted the potential for improving and regularising deep representation learning by causal (consistency and compositionality) constraints that derive from these conceptual works. A notion of approximate correspondence between two causal models with an associated approximation error may be used to learn optimal simplifications of causal models and causal representations.

Third, the scarcity of real-world data with known cause-effect relationships hinders the development of causal structure learning algorithms. Instead of real-world data, benchmarks of causal structure learning algorithms commonly rely on synthetic data and causal additive noise models. Synthetic data often exhibits certain regularity patterns that may be exploited for structure learning but are perhaps implausible to assume for real-world data [37,38]. If benchmark performance does not reflect what we actually care about, we risk developing tools that solve the wrong task and are thus unfit for fruitful application.
Obviously, data from real-world systems with known cause-effect relationships and the ability to intervene would be golden. Lacking such data, we may hope to somehow improve benchmarking, while I personally remain sceptical that we can make progress without rethinking the seemingly innocent and generic assumption of causal additive noise and how our models capture stochasticity. More generally, perhaps applications do not even require us to “learn the full causal graph from data” in the first place but instead to answer concrete causal questions and to choose optimal actions; if so, we may wish to develop benchmarks for the eventual causal task and explore the potential for synergies between control, reinforcement learning, and causality [39] instead of improving the benchmarking of causal graph recovery as an (possibly dispensable) intermediate step.

Which causal tasks are we aiming to solve in practice and how do we align our research on causal modelling and causal machine learning to make real progress toward that goal?

Some pointers

Hierarchical Text-Conditional Image Generation with CLIP Latents

A Ramesh, P Dhariwal, A Nichol, C Chu, M Chen

arXiv Preprint arXiv:2204.06125, 2022

A Generalist Agent

S Reed, K Zolna, E Parisotto, SG Colmenarejo, A Novikov, G Barth-Maron, M Gimenez, Y Sulsky, J Kay, JT Springenberg, T Eccles, J Bruce, A Razavi, A Edwards, N Heess, Y Chen, R Hadsell, O Vinyals, M Bordbar, N de Freitas

arXiv Preprint arXiv:2205.06175, 2022

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

C Saharia, W Chan, S Saxena, L Li, J Whang, E Denton, SKS Ghasemipour, BK Ayan, SS Mahdavi, RG Lopes, T Salimans, J Ho, DJ Fleet, M Norouzi

arXiv Preprint arXiv:2205.11487, 2022

Intriguing properties of neural networks

C Szegedy, W Zaremba, I Sutskever, J Bruna, D Erhan, I Goodfellow, R Fergus

arXiv Preprint arXiv:1312.6199, 2013

Fatal Tesla Self-Driving Car Crash Reminds Us That Robots Aren’t Perfect: The First Fatal Crash Involving Tesla’s Autopilot System Highlights the Contradictory Expectations of Vehicle Autonomy

E Ackerman

IEEE, 2016

Insight - Amazon scraps secret AI recruiting tool that showed bias against women

J Dastin

Reuters, 2018

Ensuring Fairness in Machine Learning to Advance Health Equity

A Rajkomar, M Hardt, MD Howell, G Corrado, MH Chin

Annals of Internal Medicine, 169(12):866–72, 2018

Dissecting racial bias in an algorithm used to manage the health of populations

Z Obermeyer, B Powers, C Vogeli, S Mullainathan

Science, 366(6464):447–53, 2019

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

C Rudin

Nature Machine Intelligence, 1(5):206–15, 2019

10.

In 2016, Microsoft’s Racist Chatbot Revealed the Dangers of Online Conversation: The bot learned language from people on Twitter—but it also learned values

O Schwartz

Reuters, 2019

11.

A Survey on Bias and Fairness in Machine Learning

N Mehrabi, F Morstatter, N Saxena, K Lerman, A Galstyan

ACM Computing Surveys, 54(6), 2021

12.

The Seven Tools of Causal Inference, with Reflections on Machine Learning

J Pearl

Communications of the ACM, 62(3):54–60, 2019

13.

Causality for Machine Learning

B Schölkopf

Probabilistic and Causal Inference: The Works of Judea Pearl, 765–804, 2022

14.

Causation, Prediction, and Search

P Spirtes, CN Glymour, R Scheines

2nd ed, MIT Press, 2001

15.

Causality: Models, Reasoning, and Inference

J Pearl

2nd ed, Cambridge University Press, 2009

16.

Elements of Causal Inference: Foundations and Learning Algorithms

J Peters, D Janzing, B Schölkopf

MIT Press, 2017

17.

Representation Learning: A Review and New Perspectives

Y Bengio, A Courville, P Vincent

IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–828, 2013

18.

Toward Causal Representation Learning

B Schölkopf, F Locatello, S Bauer, NR Ke, N Kalchbrenner, A Goyal, Y Bengio

Proceedings of the IEEE, 109(5):612–34, 2021

19.

A Transformational Characterization of Equivalent Bayesian Network Structures

DM Chickering

Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI), 1995

20.

Learning equivalence classes of Bayesian-network structures

DM Chickering

Journal of Machine Learning Research, 2:445–98, 2002

21.

Detection of unfaithfulness and robust causal inference

J Zhang, P Spirtes

Minds and Machines, 18(2):239–71, 2008

22.

A linear non-Gaussian acyclic model for causal discovery

S Shimizu, PO Hoyer, A Hyvärinen, A Kerminen

Journal of Machine Learning Research, 7:2003–30, 2006

23.

Identifiability of Gaussian structural equation models with equal error variances

J Peters, P Bühlmann

Biometrika, 101(1):219–28, 2014

24.

Identifiability of Additive Noise Models Using Conditional Variances

G Park

Journal of Machine Learning Research, 21(75):1–34, 2020

25.

Modeling confounding by half-sibling regression

B Schölkopf, DW Hogg, D Wang, D Foreman-Mackey, D Janzing, C-J Simon-Gabriel, J Peters

Proceedings of the National Academy of Sciences, 113(27):7391–98, 2016

26.

Causal Inference of Ambiguous Manipulations

P Spirtes, R Scheines

Philosophy of Science, 71(5):833–45, 2004

27.

Green and grue causal variables

F Eberhardt

Synthese, 193(4):1029–46, 2016

28.

Causal Consistency of Structural Equation Models

PK Rubenstein, S Weichwald, S Bongers, JM Mooij, D Janzing, M Grosse-Wentrup, B Schölkopf; Co-first authorship between PKR and SW

Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence (UAI), 2017

29.

Abstracting Causal Models

S Beckers, JY Halpern

Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019

30.

Approximate Causal Abstraction

S Beckers, F Eberhardt, JY Halpern

Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI), 2019

31.

Abstracting probabilistic models: Relations, constraints and beyond

V Belle

Knowledge-Based Systems, 199:105976, 2020

32.

Compositional Abstraction Error and a Category of Causal Models

EF Rischel, S Weichwald

Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence (UAI), 2021

33.

Visual Causal Feature Learning

K Chalupka, P Perona, F Eberhardt

Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence (UAI), 2015

34.

Unsupervised Discovery of El Nino Using Causal Feature Learning on Microlevel Climate Data

K Chalupka, T Bischoff, P Perona, F Eberhardt

Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence (UAI), 2016

35.

Multi-level cause-effect systems

K Chalupka, P Perona, F Eberhardt

The 19th International Conference on Artificial Intelligence and Statistics (AISTATS), 2016

36.

Causal Feature Learning for Utility-Maximizing Agents

D Kinney, D Watson

Proceedings of the Tenth International Conference on Probabilistic Graphical Models (PGM), 2020

37.

Causal structure learning from time series: Large regression coefficients may predict causal links better in practice than small p-values

S Weichwald, ME Jakobsen, PB Mogensen, L Petersen, N Thams, G Varando

Proceedings of the NeurIPS 2019 Competition and Demonstration Track, Proceedings of Machine Learning Research (PMLR), 2020

38.

Beware of the Simulated DAG! Causal Discovery Benchmarks May Be Easy To Game

AG Reisach, C Seiler, S Weichwald

Advances in Neural Information Processing Systems 34 (NeurIPS), 2021

39.

Learning by Doing: Controlling a Dynamical System using Causality, Control, and Reinforcement Learning

S Weichwald, SW Mogensen, TE Lee, D Baumann, O Kroemer, I Guyon, S Trimpe, J Peters, N Pfister

arXiv Preprint arXiv:2202.06052, 2022