8
Causal Modeling
In cities throughout the United States, the difficulty of escaping poverty is exacerbated by the difficulty in obtaining social services such as job training, mental health care, financial education classes, legal advice, child care support, and emergency food assistance. They are offered by different agencies in disparate locations with different eligibility requirements. It is difficult for poor individuals to navigate this perplexity and avail themselves of services that they are entitled to. To counteract this situation, the (fictional) integrated social service provider ABC Center takes a holistic approach by housing many individual social services in one place and having a centralized staff of social workers guide their clients. To better advise clients on how to advance themselves in various aspects of life, the center’s director and executive staff would like to analyze the data that the center collects on the services used by clients and the life outcomes they achieved. As problem owners, they do not know what sort of data modeling they should do. Imagine that you are a data scientist collaborating with the ABC Center problem owners to analyze the situation and suggest appropriate problem specifications, understand and prepare the data available, and finally perform modeling. (This chapter covers a large part of the machine learning lifecycle whereas other chapters so far have mostly focused on smaller parts.)
Your first instinct may be to suggest that ABC Center take a machine learning approach that predicts life outcomes (education, housing, employment, etc.) from a set of features that includes classes taken and sessions attended. Examining the associations and correlations in the resulting trained model may yield some insights, but misses something very important. Do you know what it is? It’s causality! If you use a standard machine learning formulation of the problem, you can’t say that taking an automobile repair training class causes an increase in the wages of the ABC Center client. When you want to understand the effect of interventions (specific actions that are undertaken) on outcomes, you have to do more than machine learning, you have to perform causal modeling.[1] Cause and effect are central to understanding the world, but standard supervised learning is not a method for obtaining them.
Toward the goal of suggesting problem formulations to ABC Center, understanding the relevant data, and creating models for them, in this chapter you will:
§ distinguish a situation as requiring a causal model or a typical predictive machine learning model,
§ discover the graph structure of causal relations among all random variables in a system, and
§ compute the quantitative causal effect of an intervention on an outcome, including from observational data.
8.1 Contrasting Causal Modeling and Predictive Modeling
If an ABC Center client completes a oneonone counseling session, it may cause a change in their level of anxiety. In contrast, completing the sessions does not cause an increase in, say, the price of eggs even if the price of eggs suddenly jumps the day after every client’s counseling session as the price is unrelated to ABC Center. In addition, two different things can cause the same result: both counseling sessions and an increase in wages can cause a reduction in anxiety. You can also be fooled by the common cause fallacy: a client secures stable housing and then purchases a used car. The stable housing does not cause the car purchase, but both are caused by a wage increase.
But what is this elusive notion called causality? It is not the same as correlation, the ability to predict, or even statistical dependence. Remember how we broke down the meanings of trustworthiness and safety into smaller components (in Chapter 1 and Chapter 3, respectively)? Unfortunately, we cannot do the same for causality since it is an elementary concept that cannot be broken down further. The basic definition of causality is: if doing something makes something else happen, then the something we did is a cause of the something that happened. The key word in the statement is do. Causation requires doing. The actions that are done are known as interventions or treatments. Interventions can be done by people or by nature; the focus in this chapter is on interventions done consciously by people.
“While probabilities encode our beliefs about a static world, causality tells us whether and how probabilities change when the world changes, be it by intervention or by act of imagination.”
—Judea Pearl, computer scientist at University of California, Los Angeles
8.1.1 Structural Causal Models
A causal model is a quantitative attempt at capturing notions of causality among random variables that builds upon probability theory. Structural causal models are one key approach for causal modeling. They contain two parts: a causal graph (a graphical model like the Bayesian networks we went over in Chapter 3) and structural equations. As shown in Figure 8.1, the graph for both counseling sessions and a change in wages causing a change in anxiety is made up of three nodes arranged in a common effect motif: counseling à anxiety ß wages. The graph for increased wages causing stable housing and car purchase is also made up of three nodes, but arranged in the common cause motif: housing ß wages à car. The graph in the figure puts both subgraphs together along with another common cause: having access to child care causing both wages and stable housing. (If a client has child care, they can more easily search for jobs and places to live since they don’t have to take their child around with them.)
Figure 8.1. An example causal graph of how the clients of ABC Center respond to interventions and life changes. Accessible caption. A graph with six nodes: counseling sessions, anxiety, wages, child care, used car, and stable housing. There are edges from counseling sessions to anxiety, anxiety to wages, wages to used car, wages to stable housing, and child care to both wages and stable housing.
Nodes represent random variables in structural causal models just as they do in Bayesian networks. However, edges don’t just represent statistical dependencies, they also represent causal relationships. A directed edge from random variable (e.g. counseling sessions) to (e.g. anxiety) indicates a causal effect of (counseling sessions) on (anxiety). Since structural causal models are a generalization of Bayesian networks, the Bayesian network calculations for representing probabilities in factored form and for determining conditional independence through dseparation continue to hold. However, structural causal models capture something more than the statistical relationships because of the structural equations.
Structural equations, also known as functional models, tell you what happens when you do something. Doing or intervening is the act of forcing a random variable to take a certain value. Importantly, it is not just passively observing what happens to all the random variables when the value of one of them has been revealed—that is simply a conditional probability. Structural causal modeling requires a new operator , which indicates an intervention. The interventional distribution is the distribution of the random variable when the random variable is forced to take value . For a causal graph with only two nodes (counseling session) and (anxiety) with a directed edge from to , à , the structural equation takes the functional form:
Equation 8.1
where is some noise or randomness in and is any function. There is an exact equation relating an intervention on counseling sessions, like starting counseling sessions (changing the variable from to ), to the probability of a client’s anxiety. The key point is that the probability can truly be expressed as an equation with the treatment as an argument on the righthand side. Functional models for variables with more parents would have those parents as arguments in the function , for example if has three parents. (Remember from Chapter 3 that directed edges begin at parent nodes and end at child nodes.)
8.1.2 Causal Model vs. Predictive Model
How do you tell that a problem is asking for a causal model rather than a predictive model that would come from standard supervised machine learning? The key is identifying whether something is actively changing one or more of the features. The act of classifying borrowers as good bets for a loan does not change anything about the borrowers at the time, and thus calls for a predictive model (also known as an associational model) as used by ThriveGuild in Chapter 3 and Chapter 6. However, wanting to understand if providing job training to a client of ABC Center (actively changing the value of a feature on job preparedness) results in a change to their probability of being approved by ThriveGuild is a causal modeling question.
Using predictive models to form causal conclusions can lead to great harms. Changes to input features of predictive models do not necessarily lead to desired changes of output labels. All hell can break loose if a decision maker is expecting a certain input change to lead to a certain output change, but the output simply does not change or changes in the opposite direction. Because ABC Center wants to model what happens to clients upon receiving social services, you should suggest to the director that the problem specification be focused on causal models to understand the center’s set of interventions (various social services) and outcomes that measure how well a client is progressing out of poverty.
An important point is that even if a model is only going to be used for prediction, and not for making decisions to change inputs, causal models help sidestep issues introduced in Chapter 4—construct validity (the data really measures with it should), internal validity (no errors in data processing), and external validity (generalization to other settings)—because it forces the predictions to be based on real, salient phenomena rather than spurious phenomena that just happen to exist in a given dataset. For this reason, causality is an integral component of trustworthy machine learning, and comes up in Part 4 of the book that deals with reliability. Settling for predictive models is a shortcut when just a little more effort to pursue causal models would make a world of difference.
8.1.3 Two Problem Formulations
There are two main problem formulations in causal modeling for ABC Center to consider in the problem specification phase of the development lifecycle. The first is obtaining the structure of the causal graph, which will allow them to understand which services yield effects on which outcomes. The second problem formulation is obtaining a number that quantifies the causal effect between a given treatment variable (maybe it is completing the automobile repair class) and a given outcome label (maybe it is wages). This problem is described further in Section 8.2.
Proceeding to the data understanding and data preparation phases of the lifecycle, there are two types of data, interventional data and observational data, that may come up in causal modeling. They are detailed in Section 8.3. Very briefly, interventional data comes from a purposefully designed experiment and observational data does not. Causal modeling with interventional data is usually straightforward and causal modeling with observational data is much more involved.
In the modeling phase when dealing with observational data, the two problem formulations correspond to two different categories of methods. Causal discovery is to learn the structural causal model. Causal inference is to estimate the causal effect. Specific methods for conducting causal discovery and causal inference from observational data are the topic of Sections 8.4 and 8.5, respectively. A mental model of the modeling methods for the two formulations is given in Figure 8.2.
Figure 8.2. Classes of methods for causal modeling from observational data. Accessible caption. A hierarchy diagram with causal modeling from observational data at its root with children causal discovery and causal inference. Causal discovery has children conditional independencebased methods and functional modelbased methods. Causal inference has children treatment models and outcome models.
8.2 Quantifying a Causal Effect
The second problem specification is computing the average treatment effect. For simplicity, let’s focus on being a binary variable taking values in : either a client doesn’t get the automobile repair class or they do. Then the average treatment effect is:
Equation 8.2
This difference of the expected value of the outcome label under the two values of the intervention precisely shows how the outcome changes due to the treatment. How much do wages change because of the automobile repair class? The terminology contains average because of the expected value.
For example, if is a Gaussian random variable with mean and standard deviation ,[2] and is a Gaussian random variable with mean and standard deviation , then the average treatment effect is . Being trained in automobile repair increases the earning potential of clients by . The standard deviation doesn’t matter here.
Figure 8.3. Motifs that block and do not block paths between the treatment node and the outcome label node . Backdoor paths are not blocked. Accessible caption. If an entire path is made up of causal chains without observation ( à à ), it is not a backdoor path. Backdoor paths begin with an edge coming into the treatment node and can contain common causes without observation ( ß à is a confounding variable) and common effects with observation ( à ß ; the underline indicates that or any of its descendants is observed). In this case, in the common cause without observation is a confounding variable. The other three motifs—causal chain with observation ( à à ), common cause with observation ( ß à ), and common effect without observation ( à ß )—are blockers. If any of them exist along the way from the treatment to the outcome label, there is no open backdoor path.
8.2.1 Backdoor Paths and Confounders
It is important to note that the definition of the average treatment effect is conditioned on , not on , and that and are generally not the same. Specifically, they are not the same when there is a socalled backdoor path between and . Remember that a path is a sequence of steps along edges in the graph, irrespective of their direction. A backdoor path is any path from to that (1) starts with an edge going into and (2) is not blocked. The reason for the name ‘backdoor’ is because the first edge goes backwards into . (Frontdoor paths have the first edge coming out of .) Recall from Chapter 3 that a path is blocked if it contains:
1. a causal chain motif with the middle node observed, i.e. the middle node is conditioned upon,
2. a common cause motif with the middle node observed, or
3. a common effect motif with the middle node not observed (this is a collider)
anywhere between and . Backdoor paths can contain (1) the common cause without observation motif and (2) the common effect with observation motif between and . The motifs that block and do not block a path are illustrated in Figure 8.3.
The lack of equality between the interventional distribution and the associational distribution is known as confounding bias.[3] Any middle nodes of common cause motifs along a backdoor path are confounding variables or confounders. Confounding is the central challenge to be overcome when you are trying to infer the average treatment effect in situations where intervening is not possible (you cannot ). Section 8.5 covers how to mitigate confounding while estimating the average treatment effect.
8.2.2 An Example
Figure 8.4 shows an example of using ABC Center’s causal graph (introduced in Figure 8.1) while quantifying a causal effect. The center wants to test whether reducing a client’s high anxiety to low anxiety affects their stable housing status. There is a backdoor path from anxiety to stable housing going through wages and child care. The path begins with an arrow going into anxiety. A common cause without observation, wages ß child care à stable housing, is the only other motif along the path to stable housing. It does not block the path. Child care, as the middle node of a common cause, is a confounding variable. If you can do the treatment, that is intervene on anxiety, which is represented diagrammatically with a hammer, the incoming edges to anxiety from counseling sessions and wages are removed. Now there is no backdoor path anymore, and you can proceed with the treatment effect quantification.
Often, however, you cannot do the treatment. These are observational rather than interventional settings. The observational setting is a completely different scenario than the interventional setting. Figure 8.5 shows how things play out. Since you cannot make the edge between anxiety and wages go away through intervention, you have to include the confounding variable of whether the client has child care or not in your model, and only then will you be able to do a proper causal effect quantification between anxiety and stable housing. Including, observing, or conditioning upon confounding variables is known as adjusting for them. Adjusting for wages rather than child care is an alternative way to block the backdoor path in the ABC Center graph.
Figure 8.4. The scenario of causal effect quantification when you can intervene on the treatment. Accessible caption. The causal graph of Figure 8.1 is marked with anxiety as the treatment and stable housing as the outcome. A backdoor path is drawn between the two passing through wages and child care, which is marked as a confounder. Intervening on anxiety is marked with a hammer. Its effect is the removal of edges into anxiety from counseling sessions and wages, and the removal of the backdoor path.
Figure 8.5. The scenario of causal effect quantification when you cannot intervene on the treatment and thus have to adjust for a variable along a backdoor path. Accessible caption. The causal graph of Figure 8.1 is marked with anxiety as the treatment and stable housing as the outcome. A backdoor path is drawn between the two passing through wages and child care, which is marked as a confounder. Adjusting for child care colors its node gray and removes the backdoor path.
At the end of the day, the whole point of computing causal effects is to inform decision making. If there are two competing social services that ABC Center can offer a client, causal models should recommend the one with the largest effect on the outcome that they care about. In the next sections, you will proceed to find models for this task from data.
8.3 Interventional Data and Observational Data
You have worked with the director of ABC Center on the problem specification phase and decided on a causal modeling approach rather than a standard machine learning approach for giving insights on interventions for clients to achieve good life outcomes. You have also decided on the specific form of causal modeling: either obtaining the structural causal model or the average treatment effect. The next phases in the machine learning lifecycle are data understanding and data preparation.
There are two types of data in causal modeling: interventional data and observational data, whose settings you have already been exposed to in the previous section. Interventional data is collected when you actually do the treatment. It is data collected as part of an experiment that has already been thought out beforehand. An experiment that ABC Center might conduct to obtain interventional data is to enroll one group of clients in a financial education seminar and not enroll another group of clients. The group receiving the treatment of the financial education seminar is the treatment group and the group not receiving the seminar is the control group. ABC Center would collect data about those clients along many feature dimensions, and this would constitute the dataset to be modeled in the next phase of the lifecycle. It is important to collect data for all features that you think could possibly be confounders.
As already seen in the previous sections and irrespective of whether collected interventionally or observationally, the random variables are: the treatment that designates the treatment and control groups (anxiety intervention), the outcome label (stable housing), and other features (child care and others). A collection of samples from these random variables constitute the dataset in average treatment effect estimation: . An example of such a dataset is shown in Figure 8.6. In estimating a structural causal model, you just have random variables and designate a treatment and outcome label later if needed.
Figure 8.6. A dataset for treatment effect estimation. The axes are two feature dimensions of . The unfilled data points are the control group and the filled data points are the treatment group . The diamond data points have the outcome label and the square data points have the outcome label .
The causal graph can be estimated from the entire data if the goal of ABC Center is to get a general understanding of all of the causal relationships. Alternatively, causal effects can be estimated from the treatment, confounding, and outcome label variables if the director wants to test a specific hypothesis such as anxiety reduction seminars having a causal effect on stable housing. A special kind of experiment, known as a randomized trial, randomly assigns clients to the treatment group and control group, thereby mitigating confounding within the population being studied.
It is not possible to observe both the outcome label under the control and its counterfactual under the treatment because the same individual cannot both receive and not receive the treatment at the same time. This is known as the fundamental problem of causal modeling. The fundamental problem of causal modeling is lessened by randomization because, on average, the treatment group and control group contain matching clients that almost look alike. Randomization does not prevent a lack of external validity however (recall from Chapter 4 that external validity is the ability of a dataset to generalize to a different population). It is possible that some attribute of all clients in the population is commonly caused by some other variable that is different in other populations.
Randomized trials are considered to be the gold standard in causal modeling that should be done if possible. Randomized trials and interventional data collection more broadly, however, are often prohibited by ethical or logistical reasons. For example, it is not ethical to withhold a treatment known to be beneficial such as a job training class to test a hypothesis. From a logistical perspective, ABC Center may not, for example, have the resources to give half its clients a $1000 cash transfer (similar to Unconditionally’s modus operandi in Chapter 4) to test the hypothesis that this intervention improves stable housing. Even without ethical or logistical barriers, it can also be the case that ABC Center’s director and executive staff think up a new causeandeffect relationship they want to investigate after the data has already been collected.
In all of these cases, you are in the setting of observational data rather than interventional data. In observational data, the treatment variable’s value has not been forced to a given value; it has just taken whatever value it happened to take, which could be dependent on all sorts of other variables and considerations. In addition, because observational data is often data of convenience that has not been purposefully collected, it might be missing a comprehensive set of possible confounding variables.
The fundamental problem of causal modeling is very apparent in observational data and because of it, testing and validating causal models becomes challenging. The only (unsatisfying) ways to test causal models are (1) through simulated data that can produce both a factual and counterfactual data point, or (2) to collect an interventional dataset from a very similar population in parallel. Regardless, if all you have is observational data, all you can do is work with it in the modeling phase of the lifecycle.
8.4 Causal Discovery Methods
After the data understanding phase comes modeling. How do you obtain a causal graph structure for ABC Center like the one in Figure 8.1? There are three ways to proceed:[4]
1. Enlist subject matter experts from ABC Center to draw out all the arrows (causal relationships) among all the nodes (random variables) manually,
2. Design and conduct experiments to tease out causal relationships, or
3. Discover the graph structure based on observational data.
The first manual option is a good option, but it can lead to the inclusion of human biases and is not scalable to problems with a large number of variables, more than twenty or thirty. The second experimental option is also a good option, but is also not scalable to a large number of variables because interventional experiments would have to be conducted for every possible edge. The third option, known as causal discovery, is the most tractable in practice and what you should pursue with ABC Center.[5]
You’ve probably heard the phrase “those who can, do; those who can’t, teach” which is shortened to “those who can’t do, teach.” In causal modeling from observational data when you can’t intervene, the phrase to keep in mind is “those who can’t do, assume.” Causal discovery has two branches, shown back in Figure 8.2, each with a different assumption that you need to make. The first branch is based on conditional independence testing and relies on the faithfulness assumption. The main idea of faithfulness is that the conditional dependence and independence relationships among the random variables encode the causal relationships. There is no coincidental or deterministic relationship among the random variables that masks a causal relationship. The Bayesian network edges are the edges of the structural causal model. Faithfulness is usually true in practice.
One probability distribution can be factored in many ways by choosing different sets of variables to condition on, which leads to different graphs. Arrows pointing in different directions also lead to different graphs. All of these different graphs arising from the same probability distribution are known as a Markov equivalence class. One especially informative example of a Markov equivalence class is the setting with just two random variables, say anxiety and wages.[6] The graph with anxiety as the parent and wages as the child and the graph with wages as the parent and anxiety as the child lead to the same probability distribution, but with opposite causeandeffect relationships. One important point about the conditional independence testing branch of causal discovery methods is that they find Markov equivalence classes of graph structures rather than finding single graph structures.
The second branch of causal discovery is based on making assumptions on the form of the structural equations introduced in Equation 8.1. Within this branch, there are several different varieties. For example, some varieties assume that the functional model has a linear function , others assume that the functional model has a nonlinear function with additive noise , and even others assume that the probability distribution of the noise has small entropy. Based on the assumed functional form, a best fit to the observational data is made. The assumptions in this branch are much stronger than in conditional independence testing, but lead to single graphs as the solution rather than Markov equivalence classes. These characteristics are summarized in Table 8.1.
Table 8.1. Characteristics of the two branches of causal discovery methods.
Branch 
Faithfulness 
Assumption on Functional Model 
Markov Equivalence Class Output 
Single Graph Output 
conditional 
X 

X 

functional model 

X 

X 
In the remainder of this section, you’ll see an example of each branch of causal discovery in action: the PC algorithm for conditional independence testingbased methods and the additive noise modelbased approach for functional modelbased methods.
8.4.1 An Example Conditional Independence TestingBased Method
One of the oldest, simplest, and still oftenused conditional independence testingbased causal discovery methods is the PC algorithm. Named for its originators, Peter Spirtes and Clark Glymour, the PC algorithm is a greedy algorithm. An ABC Center example of the PC algorithm is presented in Figure 8.7 for the nodes of wages, child care, stable housing, and used car. The steps are as follows:
4. The overall PC algorithm starts with a complete undirected graph with edges between all pairs of nodes.
1. As a first step, the algorithm tests every pair of nodes; if they are independent, it deletes the edge between them. Next it continues to test conditional independence for every pair of nodes conditioning on larger and larger subsets, deleting the edge between the pair of nodes if any conditional independence is found. The end result is the undirected skeleton of the causal graph.
The reason for this first step is as follows. There is an undirected edge between nodes and if and only if and are dependent conditioned on every possible subset of all other nodes. (So if a graph has three other nodes , , and , then you’re looking for and to be (1) unconditionally dependent given no other variables, (2) dependent given , (3) dependent given , (4) dependent given , (5) dependent given , (6) dependent given , (7) dependent given , and (8) dependent given .) These conditional dependencies can be figured out using dseparation, which was introduced back in Chapter 3.
2. The second step puts arrowheads on as many edges as it can. The algorithm conducts conditional independence tests between the first and third nodes of threenode chains. If they’re dependent conditioned on some set of nodes containing the middle node, then a common cause (collider) motif with arrows is created. The end result is a partiallyoriented causal graph. The direction of edges that the algorithm cannot figure out remain unknown. All choices of all of those orientations give you the different graphs that make up the Markov equivalence class.
The reason for the second step is that an undirected chain of nodes , , and can be made directed into à ß if and only if and are dependent conditioned on every possible subset of nodes containing . These conditional dependencies can also be figured out using dseparation.
At the end of the example in Figure 8.7, the Markov equivalence class contains four possible graphs.
In Chapter 3, dseparation was presented in the ideal case when you know the dependence and independence of each pair of random variables perfectly well. But when dealing with data, you don’t have that perfect knowledge. The specific computation you do on data to test for conditional independence between random variables is often based on an estimate of the mutual information between them. This seemingly straightforward problem of conditional independence testing among continuous random variables has a lot of tricks of the trade that continue to be researched and are beyond the scope of this book.[7]
Figure 8.7. An example of the steps of the PC algorithm. Accessible caption. In step 0, there is a fully connected undirected graph with the nodes child care, wages, stable housing, and used car. In step 1, the edges between child care and used car, and between stable housing and used car have been removed because they exhibit conditional independence. The undirected skeleton is left. In step 2, the edge between child care and stable housing is oriented to point from child care to stable housing, and the edge between wages and stable housing is oriented to point from wages to stable housing. The edges between child care and wages and between wages and used car remain undirected. This is the partiallyoriented graph. There are four possible directed graphs which constitute the Markov equivalence class solution: with edges pointing from child care to wages and used car to wages, with edges pointing from child care to wages and wages to used car, with edges pointing from wages to child care and from used care to wages, and with edges poiting from wages to child care and wages to used car.
8.4.2 An Example Functional ModelBased Method
In the conditional independence testbased methods, no strong assumption is made on the functional form of . Thus, as you’ve seen with the PC algorithm, there can remain confusion on the direction of some of the edges. You can’t tell which one of two nodes is the cause and which one is the effect. Functional modelbased methods do make an assumption on and are designed to avoid this confusion. They are best understood in the case of just two nodes, say and , or wages and anxiety. You might think that a change in wages causes a change in anxiety ( causes ), but it could be the other way around ( causes ).
One specific method in this functional modelbased branch of causal discovery methods is known as the additive noise model. It requires that not be a linear function and that the noise be additive: ; here should not depend on . The plot in Figure 8.8 shows an example nonlinear function along with a quantification of the noise surrounding the function. This noise band is equal in height around the function for all values of since the noise does not depend on . Now look at what’s going on when is the vertical axis and is the horizontal axis. Is the noise band equal in height around the function for all values of ? It isn’t, and that’s the key observation. The noise has constant height around the function when the cause is the horizontal axis and it doesn’t when the effect is the horizontal axis. There are ways to test for this phenomenon from data, but if you understand this idea shown in Figure 8.8, you’re golden. If ABC Center wants to figure out whether a decrease in anxiety causes an increase in wages, or if an increase in wages causes a decrease in anxiety, you know what analysis to do.
Figure 8.8. Example of using the additive noise model method to tell cause and effect apart. Since the height of the noise band is the same across all and different across all , is the cause and is the effect. Accessible caption. Two plots of the same nonlinear function and noise bands around it. The first plot has on the vertical axis and on the horizontal axis; the second has on the vertical axis and on the horizontal axis. In the first plot, the height of the noise band is consistently the same for different values of . In the second plot, the height of the noise band is consistently different for different values of .
This phenomenon does not happen when the function is linear. Try drawing Figure 8.8 for a linear function in your mind, and then flip it around as a thought experiment. You’ll see that the height of the noise is the same both ways and so you cannot tell cause and effect apart.
8.5 Causal Inference Methods
Based on Section 8.4, you have the tools to estimate the structure of the causal relations among random variables collected by ABC Center. But just knowing the relations is not enough for the director. He also wants to quantify the causal effects for a specific treatment and outcome label. Specifically, he wants to know what the effect of anxiety reduction is on stable housing. You now turn to average treatment effect estimation methods to try to answer the question. You are working with observational data because ABC Center has not run a controlled experiment to try to tease out this causeandeffect relationship. From Figure 8.5, you know that child care is a confounding variable, and because of proper foresight and not taking shortcuts, you have collected data on it. The values report those clients who received an anxiety reduction treatment, the values report data on clients’ child care situation and other possible confounders, and the values are the client’s outcome label on stable housing.
Remember our working phrase: “those who can’t do, assume.” Just like in causal discovery, causal inference from observational data requires assumptions. A basic assumption in causal inference is similar to the independent and identically distributed (i.i.d.) assumption in machine learning, introduced in Chapter 3. This causal inference assumption, the stable unit treatment value assumption, simply says that the outcome of one client only depends on the treatment made to that client, and is not affected by treatments to other clients. There are two important assumptions:
1. No unmeasured confounders also known as ignorability. The dataset needs to contain all the confounding variables within .
2. Overlap also known as positivity. The probability of the treatment given the confounding variables must be neither equal to nor equal to . It must take a value strictly greater than and strictly less than . This definition explains the name positivity because the probability has to be positive, not identically zero. Another perspective on the assumption is that the probability distribution of for the treatment group and the probability distribution of for the control group should overlap; there should not be any support for one of the two distributions where there isn’t support for the other. Overlap and the lack thereof is illustrated in Figure 8.9 using a couple of datasets.
Figure 8.9. On the left, there is much overlap in the support of the treatment and control groups, so average treatment effect estimation is possible. On the right, there is little overlap, so average treatment effect estimation should not be pursued. Accessible caption. Plots with data points from the treatment group and control group overlaid with regions indicating the support of their underlying distributions. In the left plot, there is much overlap in the support and in the right plot, there isn’t.
Both assumptions together go by the name strong ignorability. If the strong ignorability assumptions are not true, you should not conduct average treatment effect estimation from observational data. Why are these two assumptions needed and why are they important? If the data contains all the possible confounding variables, you can adjust for them to get rid of any confounding bias that may exist. If the data exhibits overlap, you can manipulate or balance the data to make it look like the control group and the treatment group are as similar as can be.
If you’ve just been given a cleaned and prepared version of ABC Center’s data to perform average treatment effect estimation on, what are your next steps? There are four tasks for you to do in an iterative manner, illustrated in Figure 8.10. The first task is to specify a causal method, choosing between (1) treatment models and (2) outcome models. These are the two main branches of conducting causal inference from observational data and were shown back in Figure 8.2. Their details are coming up in the next subsections. The second task in the iterative approach is to specify a machine learning method to be plugged in within the causal method you choose. Both types of causal methods, treatment models and outcome models, are based on machine learning under the hood. The third task is to train the model. The fourth and final task is to evaluate the assumptions to see whether the result can really be viewed as a causal effect.[8] Let’s go ahead and run the four tasks for the ABC Center problem.
Figure 8.10. The steps to follow while conducting average treatment effect estimation from observational data. Outside of this loop, you may also go back to the data preparation phase of the machine learning lifecycle if needed. Accessible caption. A flow diagram starting with specify causal method, leading to specify machine learning method, leading to train model, leading to evaluate assumptions. If assumptions are not met, flow back to specify causal method. If assumptions are met, return average treatment effect.
8.5.1 Treatment Models
The first option for you to consider for your causal method is treatment models. Before diving into treatment models, let’s define an important concept first: propensity score. It is the probability of the treatment (anxiety reduction intervention) conditioned on the possible confounding variables (child care and others), . Ideally, the decision to give a client an anxiety reduction intervention is independent of anything else, including whether the client has child care. This is true in randomized trials, but tends not to be true in observational data.
The goal of treatment models is to get rid of confounding bias by breaking the dependency between and . They do this by assigning weights to the data points so that they better resemble a randomized trial: on average, the clients in the control group and in the treatment group should be similar to each other in their confounding variables. The main idea of the most common method in this branch, inverse probability weighting, is to give more weight to clients in the treatment group that were much more likely to be assigned to the control group and vice versa. Clients given the anxiety reduction treatment are given weight inversely proportional to their propensity score . Clients not given the treatment are similarly weighted which also equals . The average treatment effect of anxiety reduction on stable housing is then simply the weighted mean difference of the outcome label between the treatment group and the control group. If you define the treatment group as and the control group as , then the average treatment effect estimate is
Equation 8.3
Getting the propensity score from training data samples is a machine learning task with features and labels in which you want a (calibrated) continuous score as output (the score was called in Chapter 6). The learning task can be done with any of the machine learning algorithms from Chapter 7. The domains of competence of the different choices of machine learning algorithms for estimating the propensity score are the same as for any other machine learning task, e.g. decision forests for structured datasets and neural networks for large semistructured datasets.
Once you’ve trained a propensity score model, the next step is to evaluate it to see whether it meets the assumptions for causal inference. (Just because you can compute an average treatment effect doesn’t mean that everything is hunkydory and that your answer is actually the causal effect.) There are four main evaluations of a propensity score model: (1) covariate balancing, (2) calibration, (3) overlap of propensity distribution, and (4) area under the receiver operating characteristic (AUC). Calibration and AUC were introduced in Chapter 6 as ways to evaluate typical machine learning problems, but covariate balancing and overlap of propensity distribution are new here. Importantly, the use of AUC to evaluate propensity score models is different than its use to evaluate typical machine learning problems.
Since the goal of inverse probability weighting is to make the potential confounding variables look alike in the treatment and control groups, the first evaluation, covariate balancing, tests whether that has been accomplished. This is done by computing the standardized mean difference (SMD) going onebyone through the features (child care and other possible confounders). Just subtract the mean value of the feature for the control group data from the mean value for the treatment group data, and divide by the square root of the average variance of the feature for the treatment and control groups. ‘Standardized’ refers to the division at the end, which is done so that you don’t have to worry about the absolute scale of different features. An absolute value of SMD greater than about for any feature should be a source of concern. If you see this happening, your propensity score model is not good and you shouldn’t draw causal conclusions from the average treatment effect.
The second evaluation is calibration. Since the propensity score model is used as an actual probability in inverse probability weighting, it has to have good calibration to be effective. Introduced in Chapter 6, the calibration loss needs to be small and the calibration curve needs to be as much of a straight line as possible. If they aren’t, you shouldn’t draw causal conclusions from the average treatment effect you compute and need to go back to step 1.
The third evaluation is based on the distributions of the propensity score for the treatment group and the control group, illustrated in Figure 8.11. Spikes in the distribution near or are bad because they indicate a possible large set of values that can be almost perfectly classified by the propensity score model. Perfect classification means that there is almost no overlap of the treatment group and control group in that region, which is not desired to meet the positivity assumption. If you see such spikes, you should not proceed with this model. (This evaluation doesn’t tell you what the nonoverlap region is, but just that it exists.)
Figure 8.11. Example propensity score distributions; the one on the left indicates a possible overlap violation, whereas the one on the right does not. Accessible caption. Plots with density on the vertical axis and propensity score on the horizontal axis. Each plot overlays a control pdf and a treatment pdf. The pdfs in the left plot do not overlap much and the control group distribution has a spike near . The pdfs in the right plot are almost completely on top of each other.
The fourth evaluation of a treatment model is the AUC. Although its definition and computation are the same as in Chapter 6, good values of the AUC of a propensity score model are not near the perfect . Intermediate values of the AUC, like or , are just right for average treatment effect estimation. A poor AUC of nearly remains bad for a propensity score model. If the AUC is too high or too low, do not proceed with this model. Once you’ve done all the diagnostic evaluations and none of them raise an alert, you should proceed with reporting the average treatment effect that you have computed as an actual causal insight. Otherwise, you have to go back and specify different causal methods and/or machine learning methods.
8.5.2 Outcome Models
The other branch of methods you should consider for the causal model in testing whether ABC Center’s anxiety reduction intervention has an effect on stable housing is outcome models. In treatment models, you use the data to learn the relationship, but you do something different in outcome models. In this branch, you use the data to learn the relationships and . You’re directly predicting the average outcome label of stable housing from the possible confounding variables such as child care along with the anxiety reduction treatment variable. Strong ignorability is required in both treatment models and outcome models. Before moving on to the details of outcome models, the difference between treatment models and outcome models is summarized in Table 8.2.
Table 8.2. Characteristics of the two branches of causal inference methods.
Branch 
Dataset 
What Is Learned 
Purpose 
treatment models 


use to weight the data points 
outcome models 


use directly for average treatment estimation 
Why is learning models from data useful? How do you get the average treatment effect of anxiety reduction on stable housing from them? Remember that the definition of the average treatment effect is . Also, remember that when there is no confounding, the associational distribution and interventional distribution are equal, so . Once you have , you can use something known as the law of iterated expectations to adjust for and get . The trick is to take an expectation over because (The subscripts on the expectations tell you which random variable you’re taking the expectation with respect to.) To take the expectation over , you sum the outcome model over all the values of weighted by the probabilities of each of those values of . It is clear sailing after that to get the average treatment effect because you can compute the difference directly.
You have the causal model; now on to the machine learning model. When the outcome label takes binary values and corresponding the absence and presence of stable housing, then the expected values are equivalent to the probabilities and . Learning these probabilities is a job for a calibrated machine learning classifier with continuous score output trained on labels and features . You can use any machine learning method from Chapter 7 with the same guidelines for domains of competence. Traditionally, it has been common practice to use linear marginbased methods for the classifier, but nonlinear methods should be tried especially for highdimensional data with lots of possible confounding variables.
Just like with treatment models, being able to compute an average treatment effect using outcome models does not automatically mean that your result is a causal inference. You still have to evaluate. A first evaluation, which is also an evaluation for treatment models, is calibration. You want small calibration loss and a straight line calibration curve. A second evaluation for outcome models is accuracy, for example measured using AUC. With outcome models, just like with regular machine learning models but different from treatment models, you want the AUC to be as large as possible approaching . If the AUC is too small, do not proceed with this model and go back to step 1 in the iterative approach to average treatment effect estimation illustrated in Figure 8.10.
A third evaluation for outcome models examines the predictions they produce to evaluate ignorability or no unmeasured confounders. The predicted and values coming out of the outcome models should be similar for clients who were actually part of the treatment group (received the anxiety reduction intervention) and clients who were part of the control group (did not receive the anxiety reduction intervention). If the predictions are not similar, there is still some confounding left over after adjusting for (child care and other variables), which means that the assumption of no unmeasured confounders is violated. Thus, if the predicted and values for the two groups do not mostly overlap, then do not proceed and go back to the choice of causal model and machine learning model.
8.5.3 Conclusion
You’ve evaluated two options for causal inference: treatment models and outcome models. Which option is better in what circumstances? Treatment and outcome modeling are inherently different problems with different features and labels. You can just end up having better evaluations for one causal method than the other using the different machine learning method options available. So just try both branches and see how well the results correspond. Some will be better matched to the relevant modeling tasks, depending on the domains of competence of the machine learning methods under the hood.
But do you know what? You’re in luck and don’t have to choose between the two branches of causal inference. There’s a hybrid approach called doublyrobust estimation in which the propensity score values are added as an additional feature in the outcome model.[9] Doublyrobust models give you the best of both worlds! ABC Center’s director is waiting to decide whether he should invest more in anxiety reduction interventions. Once you’re done with your causal modeling analysis, he’ll be able to make an informed decision.
8.6 Summary
§ Causality is a fundamental concept that expresses how changing one thing (the cause) results in another thing changing (the effect). It is different than correlation, predictability, and dependence.
§ Causal models are critical to inform decisions involving interventions and treatments with expected effects on outcomes. Predictive associational models are not sufficient when you are ‘doing’ something to an input.
§ In addition to informing decisions, causal modeling is a way to avoid harmful spurious relationships in predictive models.
§ Structural causal models extend Bayesian networks by encoding causal relationships in addition to statistical relationships. Their graph structure allows you to understand what causes what, as well as chains of causation, among many variables. Learning their graph structure is known as causal discovery.
§ Causal inference between a hypothesized pair of treatment and outcome is a different problem specification. To validly conduct causal inference from observational data, you must control for confounding.
§ Causal modeling requires assumptions that are difficult to validate, but there is a set of evaluations you should perform as part of modeling to do the best that you can.
[1]Ruocheng Guo, Lu Cheng, Jundong Li, P. Richard Hahn, and Huan Liu. “A Survey of Learning Causality from Data: Problems and Methods.” In: ACM Computing Surveys 53.4 (Jul. 2020), p. 75.
[2]The pdf of a Gaussian random variable with mean and standard deviation is . Its expected value is .
[3]There can be confounding bias without a backdoor path in special cases involving selection bias. Selection bias is when the treatment variable and another variable are common causes for the outcome label.
[4]Clark Glymour, Kun Zhang, and Peter Spirtes. “Review of Causal Discovery Methods Based on Graphical Models.” In: Frontiers in Genetics 10 (Jun. 2019), p. 524.
[5]There are advanced methods for causal discovery that start with observational data and tell you a few important experiments to conduct to get an even better graph, but they are beyond the scope of the book.
[6]Matthew Ridley, Gautam Rao, Frank Schilbach, and Vikram Patel. “Poverty, Depression, and Anxiety: Causal Evidence and Mechanisms.” In: Science 370.6522 (Dec. 2020), p. 1289.
[7]Rajat Sen, Ananda Theertha Suresh, Karthikeyan Shanmugam, Alexandros G. Dimakis, and Sanjay Shakkottai. “ModelPowered Conditional Independence Test.” In: Advances in Neural Information Processing Systems 31 (Dec. 2017), pp. 2955–2965.
[8]Yishai Shimoni, Ehud Karavani, Sivan Ravid, Peter Bak, Tan Hung Ng, Sharon Hensley Alford, Denise Meade, and Ya’ara Goldschmidt. “An Evaluation Toolkit to Guide Model Selection and Cohort Definition in Causal Inference.” arXiv:1906.00442, 2019.
[9]Miguel A. Hernán and James M. Robins. Causal Inference: What If. Boca Raton, Florida, USA: Chapman & Hall/CRC, 2020.