2019-04-11
We encounter causal questions every day and–oftentimes unknowingly–reason about them with ease. Should we put on a jumper or turn on the heater if we were feeling cold in a room? How should we decide between these two actions if another person with a generally higher sensation of warmth was in the room? Does it matter for our decision whether the door towards the cool hallway is briefly opened time and again when people enter or exit the room? Does the wall painting matter, whether the room is under the roof or not, or the music we are listening to?
We are remarkably proficient at identifying the relevant concepts and quantities that enable us to reason about which actions to take in a given situation. Importantly, we are able to do this even without an encompassing understanding of all the underlying forces and mechanisms at work. We do not and need not know exactly how a heater works or how its thermal capacity and conductivity determine the temporal temperature profile in the room when turning it on. Still, we have an idea about how different actions will shape the world around us that is effective and useful within a specific context. That is, we have a causal intuition that we can leverage pragmatically, that serves its purpose and helps us navigate the world.
How does our ability to causally reason about a system depend on which variables and transformations thereof we include as descriptors in our model? The appropriate choice and construction of descriptor variables is an essential, yet largely overlooked, prerequisite to a pragmatic approach to statistical causal modelling.
What do we mean by pragmatism here? We appreciate that all models are wrong, yet some are useful. Thence, we do neither claim nor aim to identify from data the true causal mechanisms that exist in the world and govern its evolvement. It is not necessary that we settle on the physical reality behind all people, objects, atoms, and fields in a room before we can reason about whether we should put on a jumper. Instead, we strive to identify appropriate descriptors and a causal model that serves a specific purpose and that can be justified given explicit assumptions, observations, and prior knowledge. We require a causal model to be derived in a systematic way, that is, for the same ingredients (assumptions, observations, prior knowledge) people necessarily arrive at the same causal model and its predictions on the effects of interventions are testable. It reflects our current best guess on how to causally reason about a given system and forecasts how certain aspects of the world around us depend on the actions we take. As such, causal models are viewed as a pragmatic tool to make informed and principled decisions or to deliberate about what will happen if we took a certain route of actions.
In practice, the variables that qualify as causal variables are often unknown. In the following example we discuss how causal reasoning may be complicated by our chosen representations, concepts, and descriptors.
Let us imagine a coach who is entrusted with the task of forming the first European Jugger team. (Readers unfamiliar with this sports–be it fictitious or not–are encouraged to instead think of handball, association football, or a generic team sports with the respective number of athletes on a team.) Selecting players from the South League and the North League, she aims to build a strong team and win the next year's world championship. If she were to stick with already existing teams, a classical prediction-based approach would be to pick the team of either league that was most successful lately and expect somewhat similar chances of winning in the near future. Our coach, however, is interested in forming a new team and is not restricted to draw from the pool of already existing Jugger teams in the South League or the North League. Her task is further complicated by the fact that not all players are allowed on a European team and, since it is the first championship ever, the competing teams are unknown which impedes targeted game play strategies. The causal question thus is “How should she fill the five places on her new team to maximise the chances of it being successful?”. In contrast to sports betting, it is not enough to merely forecast the performance of existing teams for which historic observations are already available. Instead, she requires means to make an informed prediction about the chances of winning when putting together different players to form the new European team.
Her idea is intuitive: Pick the athletes which have been most successful in the past. An individual assessment of each player, however, is not enough to forecast the team's performance. In particular, each Jugger team needs one runner, a role which requires elegant and swift movements and crouching down to sneak through the defence and score a goal, and four so-called enforcers. The fact that the team composition is important for predicting its success, suggests that she cannot decide on the five places on the team independently but instead needs to decide on the players jointly to ensure that at least one talented runner is among them. This illustrates our first point, that it may be problematic to consider interventions on separate variables in isolation. In our example we know that using the five places on a team as descriptors is inapt and we should rather think of a team as consisting of one runner place and four enforcer places. Yet, in other realistic scenarios, when we are to infer a causal model from data, we often lack knowledge about such underlying structure and encounter problems in deriving meaningful causal descriptions as long as we consider interventions separately. If we did not know about the two roles in a Jugger team and scouted five Jugger talents independently, we would have a hard time explaining why one team may perform poorly while another may perform well.
Since the stakes are high, our coach consults a statistician to help her analyse historic data and inform the decision on which players to elect for her new team. She approaches all major Jugger clubs and requests information on all athletes who are entitled to play on the European team. To the statistician's delight, she not only gathers all the data but also cleans and consolidates the data, resulting in a database as follows:
athlete | body height | fraction of matches won |
---|---|---|
Carla | 1.76m | 0.84 |
Mariela | 1.62m | 0.91 |
… | … | … |
To complicate matters for our coach and for the sake of this example, assume that the clubs do not provide information about the preferred roles of their players. Instead, our coach leverages her observation that smaller athletes tend to be skilled in the role of the runner. Based on the athlete database, the coach and statistician decide the best they can do is to select 1 shorter and 4 taller athletes. We emphasise that in our hypothetical scenario it is indeed instrumental for building a successful team to decide on the players based on body height. The reason is, that in this example being a runner type manifests itself in body height and so the coach's decision to select a shorter athlete for the runner place on her team is likely to result in a reasonable team composition. This may be puzzling at first, since body height does not cause a player to be a talented runner. The following consideration is instructive. Assume we observe a tendency for runners to wear short hair (since it is less distracting during their swift movements) and hair length was reported in the athlete database. Then, just like body height, hair length may reflect being a runner type. Yet, if the coach were to select five skilled athletes and cut one's hair, that would generally not result in a team with one runner and four enforcers. How pragmatically useful a variable is for causal reasoning depends on the context in which we are employing a certain descriptor and the (implicitly specified) way of intervening on it. In our example, having the runner place occupied by a shorter athlete is implemented by selecting an athlete that happens to be not that tall; having it occupied by a short-haired athlete was implemented by cutting an athlete's hair. The validity of the statement “putting one short-haired player on the team increases the chances of having one gifted runner on the team” depends on the context. It depends on whether a talented athlete is selected and then gets her hair cut or whether a talent is only selected among the short-haired athletes.
So far we have only talked about our coach selecting one gifted runner and four enforcers. We can use body height to capture information about a player's role, but we have not yet discussed how our coach would select the skilled athletes. Provided a database as depicted above, one intuitive approach is to identify the five athletes with the highest fraction of matches won (within their respective role). High fractions correspond to athletes who often played on the victorious team of a match.
What could possibly go wrong when forming a team of five successful athletes? Imagine they query the database and obtain a list of athletes with the following highest fractions of matches won: 1.00, 1.00, 0.91, 0.90, 0.88. Impressed by these athletes' track record, they invite them over for a trial training session. Soon the coach finds herself disappointed, though, and realises only then that all five are inexperienced Jugger athletes. What went wrong? The database lacks information that is important for reasoning about the athletes' performance: Our coach only entered the fraction of matches won, effectively concealing the information about athletes' experience as reflected in the total number of matches played. Indeed, in the raw data records she finds the fractions of above athletes to be made up as 1/1, 1/1, 10/11, 9/10, 7/8. Having participated in 1, 1, 11, 10, 8 matches only, the players are evidently inexperienced. The corresponding high numbers of victorious matches can be explained by the common practice to rotate in inexperienced players only when the team has reached a clear point advantage early on in a match. As it stands, the data in the athlete database cannot inform an argument along the following lines: “A formation of five individually successful athletes can be expected to perform well as a team.”.
The problem is that the available variables are inapt to predict a newly formed team's performance. This exemplifies how our ability to causally reason about a system crucially depends on which descriptors are available to us. In our hypothetical scenario we can explain and understand what went wrong. In practice, however, we cannot and it is unknown to us which descriptors we are possibly missing; and even if we knew, measurements of those variables may be impossible or expensive to obtain.
We carry on the example to demonstrate that sometimes we need to combine observed quantities to recover (proxies to) appropriate descriptor variables. Assume that Jugger clubs collect data inconsistently and that the total number of matches played per player is mostly missing. Instead, an information that all clubs list is the year a player joined the league. The athlete database is augmented as follows:
athlete | body height | fraction of matches won | year joined |
---|---|---|---|
Carla | 1.76m | 0.84 | 2015 |
Mariela | 1.62m | 0.91 | 2018 |
… | … | … | … |
Our coach assumes that on average a player participates in 20 matches per season. She thus suggests that for each athlete the number of matches played can be reasonably estimated by multiplying the difference between the current year and the year a player joined the league by 20. Multiplying the resulting number by the fraction of matches won, we obtain an estimate of the number of victorious matches an athlete took part in. Let us assume that during the first ~10 matches new players are mostly only rotated in under clear point advantage. The first 10 matches are thus almost certainly victorious matches and are not informative about an athlete's individual performance. The statistician proposes to construct an adjusted winning ratio from the available variables, subtracting 10 from the estimated numbers for both the victorious matches and total matches played, and dividing the former by the latter:
athlete | body height | adjusted winning ratio |
---|---|---|
Carla | 1.76m | [0.84 · (2019 − 2015) · 20 − 10] / [(2019 − 2015) · 20 − 10] ≈ 0.82 |
Mariela | 1.62m | [0.90 · (2019 − 2018) · 20 − 10] / [(2019 − 2018) · 20 − 10] ≈ 0.80 |
… | … | … |
The adjusted winning ratio is more suitable for capturing past and predicting future individual athlete's success. Note that the specific construction is motivated and derived by reference to mechanisms that we assume to be in place in our exemplary scenario. When modelling a real world system such knowledge is generally unavailable. It is a fundamental challenge to find the right transformations of the observed quantities to obtain descriptors that facilitate causal reasoning.
This closes our example which illustrates the problems we are facing in causal modelling of real-world systems. These are inspiration and motivation for future conceptual research. Firstly, it is often not sensible to consider certain interventions in isolation, just like it is not enough to fill the five places on a Jugger team separately from one another; instead causal reasoning about the future prospects of a newly formed team requires us to understand and respect the concept of a 4-enforcers-plus-1-runner composition of a Jugger team. Secondly, we may have no knowledge about, no access to, or no way to measure the descriptors that facilitate causal reasoning (e. g. experience as reflected by the year a player joined the league). Thirdly, we may need to construct aggregate variables based on the observed quantities (e. g. an adjusted winning ratio), but generally lack any prior knowledge to inform such construction.
In real-world scenarios, we have often implicitly solved the above issues of choosing the appropriate level of description and transformations of variables without even thinking about it. For example, we know which aspects to pay attention to and how to conceptualise the world around us when feeling cold and deliberating about whether to put on a jumper or turn on the heating; seeing that others in the lecture hall are feeling warm and that the door is being opened every now end then we reason that putting on a jumper is more likely to result in everyone feeling comfortable. The situation is more subtle when trying to infer pragmatically useful cause-effect relationships from data of new or less well understood systems, such as the brain or genetic pathways, for which we lack prior knowledge. Just like in our example, there are several complications before we can even begin to infer cause-effect relationships between the chosen descriptor variables, fit a model to our observations, and reason about the effect of interventions. Leveraging pragmatic causal modelling in practice requires an understanding about how to identify, characterise, and obtain the right causal variables as transformations of observed variables. Hoping to bring statistical causal modelling from pen and paper to fruitful application, we advocate a pragmatic approach to causal reasoning and delineate one overlooked key prerequisite: causal variable definition and identifying the right descriptors as transformations of observed variables.