2019-04-11
We encounter causal questions
every day and–oftentimes unknowingly–reason about them with ease. Should
we put on a jumper or turn on the heater if we were feeling cold in a
room? How should we decide between these two actions if another person
with a generally higher sensation of warmth was in the room? Does it
matter for our decision whether the door towards the cool hallway is
briefly opened time and again when people enter or exit the room? Does
the wall painting matter, whether the room is under the roof or not, or
the music we are listening to?
We are remarkably proficient at identifying the relevant concepts and quantities that enable us to reason about which actions to take in a given situation. Importantly, we are able to do this even without an encompassing understanding of all the underlying forces and mechanisms at work. We do not and need not know exactly how a heater works or how its thermal capacity and conductivity determine the temporal temperature profile in the room when turning it on. Still, we have an idea about how different actions will shape the world around us that is effective and useful within a specific context. That is, we have a causal intuition that we can leverage pragmatically, that serves its purpose and helps us navigate the world.
How does our ability to causally reason about a system depend on which variables and transformations thereof we include as descriptors in our model? The appropriate choice and construction of descriptor variables is an essential, yet largely overlooked, prerequisite to a pragmatic approach to statistical causal modelling.
What do we mean by pragmatism here? We
appreciate that all models are wrong, yet some are useful. Thence, we do
neither claim nor aim to identify from data the true causal
mechanisms that exist in the world and govern its evolvement.
It is not necessary that we settle on the physical reality
behind all people, objects, atoms, and fields in a room before we can
reason about whether we should put on a jumper. Instead, we strive to
identify appropriate descriptors and a causal model that serves a
specific purpose and that can be justified given explicit assumptions,
observations, and prior knowledge. We require a causal model to be
derived in a systematic way, that is, for the same ingredients
(assumptions, observations, prior knowledge) people necessarily arrive
at the same causal model and its predictions on the effects of
interventions are testable. It reflects our current best guess on how to
causally reason about a given system and forecasts how certain aspects
of the world around us depend on the actions we take. As such, causal
models are viewed as a pragmatic tool to make informed and principled
decisions or to deliberate about what will happen if we took a certain
route of actions.
In practice, the variables that qualify as causal variables are often unknown. In the following example we discuss how causal reasoning may be complicated by our chosen representations, concepts, and descriptors.
Let us imagine a coach who is entrusted with the task of forming the first European Jugger team. (Readers unfamiliar with this sports–be it fictitious or not–are encouraged to instead think of handball, association football, or a generic team sports with the respective number of athletes on a team.) Selecting players from the South League and the North League, she aims to build a strong team and win the next year's world championship. If she were to stick with already existing teams, a classical prediction-based approach would be to pick the team of either league that was most successful lately and expect somewhat similar chances of winning in the near future. Our coach, however, is interested in forming a new team and is not restricted to draw from the pool of already existing Jugger teams in the South League or the North League. Her task is further complicated by the fact that not all players are allowed on a European team and, since it is the first championship ever, the competing teams are unknown which impedes targeted game play strategies. The causal question thus is “How should she fill the five places on her new team to maximise the chances of it being successful?”. In contrast to sports betting, it is not enough to merely forecast the performance of existing teams for which historic observations are already available. Instead, she requires means to make an informed prediction about the chances of winning when putting together different players to form the new European team.
Her idea is intuitive: Pick the athletes which have been most
successful in the past. An individual assessment of each player,
however, is not enough to forecast the team's performance. In particular, each Jugger team needs one
runner, a role which requires elegant and swift movements and crouching
down to sneak through the defence and score a goal, and four so-called
enforcers. The fact that the team composition is important for
predicting its success, suggests that she cannot decide on the five
places on the team independently but instead needs to decide on the
players jointly to ensure that at least one talented runner is among
them. This illustrates our first point, that it may be problematic to
consider interventions on separate variables in isolation. In our
example we know that using the five places on a team as descriptors is
inapt and we should rather think of a team as consisting of one runner
place and four enforcer places. Yet, in other realistic scenarios, when
we are to infer a causal model from data, we often lack knowledge about
such underlying structure and encounter problems in deriving meaningful
causal descriptions as long as we consider interventions separately. If
we did not know about the two roles in a Jugger team and scouted five
Jugger talents independently, we would have a hard time explaining why
one team may perform poorly while another may perform well.
Since the stakes are high, our coach consults a statistician to help her analyse historic data and inform the decision on which players to elect for her new team. She approaches all major Jugger clubs and requests information on all athletes who are entitled to play on the European team. To the statistician's delight, she not only gathers all the data but also cleans and consolidates the data, resulting in a database as follows:
athlete | body height | fraction of matches won |
---|---|---|
Carla | 1.76m | 0.84 |
Mariela | 1.62m | 0.91 |
… | … | … |
To complicate matters for our coach and for the sake of this example, assume that the clubs do not provide information about the preferred roles of their players. Instead, our coach leverages her observation that smaller athletes tend to be skilled in the role of the runner. Based on the athlete database, the coach and statistician decide the best they can do is to select 1 shorter and 4 taller athletes. We emphasise that in our hypothetical scenario it is indeed instrumental for building a successful team to decide on the players based on body height. The reason is, that in this example being a runner type manifests itself in body height and so the coach's decision to select a shorter athlete for the runner place on her team is likely to result in a reasonable team composition. This may be puzzling at first, since body height does not cause a player to be a talented runner. The following consideration is instructive. Assume we observe a tendency for runners to wear short hair (since it is less distracting during their swift movements) and hair length was reported in the athlete database. Then, just like body height, hair length may reflect being a runner type. Yet, if the coach were to select five skilled athletes and cut one's hair, that would generally not result in a team with one runner and four enforcers. How pragmatically useful a variable is for causal reasoning depends on the context in which we are employing a certain descriptor and the (implicitly specified) way of intervening on it. In our example, having the runner place occupied by a shorter athlete is implemented by selecting an athlete that happens to be not that tall; having it occupied by a short-haired athlete was implemented by cutting an athlete's hair. The validity of the statement “putting one short-haired player on the team increases the chances of having one gifted runner on the team” depends on the context. It depends on whether a talented athlete is selected and then gets her hair cut or whether a talent is only selected among the short-haired athletes.
So far we have only talked about our coach selecting one gifted runner and four enforcers. We can use body height to capture information about a player's role, but we have not yet discussed how our coach would select the skilled athletes. Provided a database as depicted above, one intuitive approach is to identify the five athletes with the highest fraction of matches won (within their respective role). High fractions correspond to athletes who often played on the victorious team of a match.
What could possibly go wrong when forming a team of five successful
athletes? Imagine they query the database and obtain a list of athletes
with the following highest fractions of matches won: 1.00, 1.00, 0.91,
0.90, 0.88. Impressed by these athletes' track record, they invite them
over for a trial training session. Soon the coach finds herself
disappointed, though, and realises only then that all five are
inexperienced Jugger athletes. What went wrong? The database lacks
information that is important for reasoning about the athletes'
performance: Our coach only entered the
fraction of matches won, effectively concealing the information about
athletes' experience as reflected in the total number of matches played.
Indeed, in the raw data records she finds the fractions of above
athletes to be made up as 1/1, 1/1, 10/11, 9/10, 7/8. Having
participated in 1, 1, 11, 10, 8 matches only, the players are evidently
inexperienced. The corresponding high numbers of victorious matches can
be explained by the common practice to rotate in inexperienced players
only when the team has reached a clear point advantage early on in a
match. As it stands, the data in the athlete database cannot inform an
argument along the following lines: “A formation of five individually
successful athletes can be expected to perform well as a team.”.
The problem is that the available variables are inapt to predict a newly formed team's performance. This exemplifies how our ability to causally reason about a system crucially depends on which descriptors are available to us. In our hypothetical scenario we can explain and understand what went wrong. In practice, however, we cannot and it is unknown to us which descriptors we are possibly missing; and even if we knew, measurements of those variables may be impossible or expensive to obtain.
We carry on the example to demonstrate that sometimes we need to combine observed quantities to recover (proxies to) appropriate descriptor variables. Assume that Jugger clubs collect data inconsistently and that the total number of matches played per player is mostly missing. Instead, an information that all clubs list is the year a player joined the league. The athlete database is augmented as follows:
athlete | body height | fraction of matches won | year joined |
---|---|---|---|
Carla | 1.76m | 0.84 | 2015 |
Mariela | 1.62m | 0.91 | 2018 |
… | … | … | … |
Our coach assumes that on average a player participates in 20 matches per season. She thus suggests that for each athlete the number of matches played can be reasonably estimated by multiplying the difference between the current year and the year a player joined the league by 20. Multiplying the resulting number by the fraction of matches won, we obtain an estimate of the number of victorious matches an athlete took part in. Let us assume that during the first ~10 matches new players are mostly only rotated in under clear point advantage. The first 10 matches are thus almost certainly victorious matches and are not informative about an athlete's individual performance. The statistician proposes to construct an adjusted winning ratio from the available variables, subtracting 10 from the estimated numbers for both the victorious matches and total matches played, and dividing the former by the latter:
athlete | body height | adjusted winning ratio |
---|---|---|
Carla | 1.76m | [0.84 · (2019 − 2015) · 20 − 10] / [(2019 − 2015) · 20 − 10] ≈ 0.82 |
Mariela | 1.62m | [0.90 · (2019 − 2018) · 20 − 10] / [(2019 − 2018) · 20 − 10] ≈ 0.80 |
… | … | … |
The adjusted winning ratio is more suitable for capturing past and predicting future individual athlete's success. Note that the specific construction is motivated and derived by reference to mechanisms that we assume to be in place in our exemplary scenario. When modelling a real world system such knowledge is generally unavailable. It is a fundamental challenge to find the right transformations of the observed quantities to obtain descriptors that facilitate causal reasoning.
This closes our example which illustrates the problems we are facing in causal modelling of real-world systems. These are inspiration and motivation for future conceptual research. Firstly, it is often not sensible to consider certain interventions in isolation, just like it is not enough to fill the five places on a Jugger team separately from one another; instead causal reasoning about the future prospects of a newly formed team requires us to understand and respect the concept of a 4-enforcers-plus-1-runner composition of a Jugger team. Secondly, we may have no knowledge about, no access to, or no way to measure the descriptors that facilitate causal reasoning (e. g. experience as reflected by the year a player joined the league). Thirdly, we may need to construct aggregate variables based on the observed quantities (e. g. an adjusted winning ratio), but generally lack any prior knowledge to inform such construction.
In real-world scenarios, we have
often implicitly solved the above issues of choosing the appropriate
level of description and transformations of variables without even
thinking about it. For example, we know which aspects to pay attention
to and how to conceptualise the world around us when feeling cold and
deliberating about whether to put on a jumper or turn on the heating;
seeing that others in the lecture hall are feeling warm and that the
door is being opened every now end then we reason that putting on a
jumper is more likely to result in everyone feeling comfortable. The
situation is more subtle when trying to infer pragmatically useful
cause-effect relationships from data of new or less well understood
systems, such as the brain or genetic pathways, for which we lack prior
knowledge. Just like in our example, there are several complications
before we can even begin to infer cause-effect relationships between the
chosen descriptor variables, fit a model to our observations, and reason
about the effect of interventions. Leveraging pragmatic causal modelling
in practice requires an understanding about how to identify,
characterise, and obtain the right causal variables as transformations
of observed variables. Hoping to bring statistical causal modelling from
pen and paper to fruitful application, we advocate a pragmatic approach
to causal reasoning and delineate one overlooked key prerequisite:
causal variable definition and identifying the right descriptors as
transformations of observed variables.