Trustworthy Machine Learning - Kush R. Varshney

Detection Theory

Let’s continue from Chapter 3, where you are the data scientist building the loan approval model for the (fictional) peer-to-peer lender ThriveGuild. As then, you are in the first stage of the machine learning lifecycle, working with the problem owner to specify the goals and indicators of the system. You have already clarified that safety is important, and that it is composed of two parts: basic performance (minimizing aleatoric uncertainty) and reliability (minimizing epistemic uncertainty). Now you want to go into greater depth in the problem specification for the first part: basic performance. (Reliability comes in Part 4 of the book.)

What are the different quantitative metrics you could use in translating the problem-specific goals (e.g. expected profit for the peer-to-peer lender) to machine learning quantities? Once you’ve reached the modeling stage of the lifecycle, how would you know you have a good model? Do you have any special considerations when producing a model for risk assessment rather than simply offering an approve/deny output?

Machine learning models are decision functions: based on the borrower’s features, they decide a response that may lead to an autonomous approval/denial action or be used to support the decision making of the loan officer. The use of decision functions is known as statistical discrimination because we are distinguishing or differentiating one class label from the other. You should contrast the use of the term ‘discrimination’ here with unwanted discrimination that leads to systematic advantages to certain groups in the context of algorithmic fairness in Chapter 10. Discrimination here is simply telling the difference between things. Your favorite wine snob talking about their discriminative palate is a distinct concept from racial discrimination.

This chapter begins Part 3 of the book on basic modeling (see Figure 6.1 to remind yourself of the lay of the land) and uses detection theory, the study of optimal decision making in the case of categorical output responses,[1] to answer the questions above that you are struggling with.

Diagram

Description automatically generated

Figure 6.1. Organization of the book. This third part focuses on the first attribute of trustworthiness, competence and credibility, which maps to machine learning models that are well-performing and accurate. Accessible caption. A flow diagram from left to right with six boxes: part 1: introduction and preliminaries; part 2: data; part 3: basic modeling; part 4: reliability; part 5: interaction; part 6: purpose. Part 3 is highlighted. Parts 3–4 are labeled as attributes of safety. Parts 3–6 are labeled as attributes of trustworthiness.

Specifically, this chapter focuses on:

§ selecting metrics to quantify the basic performance of your decision function (including ones that summarize performance across operating conditions),

§ testing whether your decision function is as good as it could ever be, and

§ differentiating performance in risk assessment problems from performance in binary decision problems.

6.1 Selecting Decision Function Metrics

You, the ThriveGuild data scientist, are faced with the binary detection problem, also known as the binary hypothesis testing problem, of predicting which loan applicants will default, and thereby which applications to deny.[2] Let be the loan approval decision with label corresponding to and label corresponding to . Feature vector contains employment status, income, and other attributes. The value is called a negative and the value is called a positive. The random variables for the features and label are governed by the pmfs given the special name likelihood functions and , as well as by prior probabilities and . The basic task is to find a decision function that predicts a label from the features.[3]

6.1.1 Quantifying the Possible Events

There are four possible events in the binary detection problem:

1. the decision function predicts and the true label is ,

2. the decision function predicts and the true label is ,

3. the decision function predicts and the true label is , and

4. the decision function predicts and the true label is .

These are known as true negatives (TN), false negatives (FN), true positives (TP), and false positives (FP), respectively. A true negative is denying an applicant who should be denied according to some ground truth, a false negative is denying an applicant who should be approved, a true positive is approving an applicant who should be approved, and a false positive is approving an applicant who should be denied. Let’s organize these events in a table known as the confusion matrix:

Equation 6.1

The probabilities of these events are:

Equation 6.2

These conditional probabilities are nothing more than a direct implementation of the definitions of the events. The probability is known as the true negative rate as well as the specificity and the selectivity. The probability is known as the false negative rate as well as the probability of missed detection and the miss rate. The probability is known as the true positive rate as well as the probability of detection, the recall, the sensitivity, and the power. The probability is known as the false positive rate as well as the probability of false alarm and the fall-out. The probabilities can be organized in a slightly different table as well:

Equation 6.3

These probabilities give you some quantities by which to understand the performance of the decision function . Selecting one over the other involves thinking about the events themselves and how they relate to the real-world problem. A false positive, approving an applicant who should be denied, means that a ThriveGuild lender has to bear the cost of a default, so it should be kept small. A false negative, denying an applicant who should be approved, is a lost opportunity for ThriveGuild to make a profit through the interest they charge.

The events above are conditioned on the true label. Conditioning on the predicted label also yields events and probabilities of interest in characterizing performance:

Equation 6.4

These conditional probabilities are reversed from Equation 6.2. The probability is known as the negative predictive value. The probability is known as the false omission rate. The probability is known as the positive predictive value as well as the precision. The probability is known as the false discovery rate. If you care about the quality of the decision function, focus on the first set (, , , ). If you care about the quality of the predictions, focus on the second set (, , , ).

When you need to numerically compute these probabilities, apply the decision function to several i.i.d. samples of and denote the number of TN, FN, TP, and FP events as , , , and , respectively. Then use the following estimates of the probabilities:

Equation 6.5

As an example, let’s say that ThriveGuild makes the following number of decisions: , , , and . You can estimate the various performance probabilities by plugging these numbers into the respective expressions above. The results are , , , , , , , and . These are all reasonably good values, but must ultimately be judged according to the ThriveGuild problem owner's goals and objectives.

6.1.2 Summary Performance Metrics

Collectively, false negatives and false positives are errors. The probability of error, also known as the error rate, is the sum of the false negative rate and false positive rate weighted by the prior probabilities:

Equation 6.6

The balanced probability of error, also known as the balanced error rate, is the unweighted average of the false negative rate and false positive rate:

Equation 6.7

They summarize the basic performance of the decision function. Balancing is useful when there are a lot more data points with one label than the other, and you care about each type of error equally. Accuracy, the complement of the probability of error: , and balanced accuracy, the complement of the balanced probability of error: , are sometimes easier for problem owners to appreciate than error rates.

The -score, the harmonic mean of and , is an accuracy-like summary measure to characterize the quality of a prediction rather than the decision function:

Equation 6.8

Continuing the example from before with and , let ThriveGuild’s prior probability of receiving applications to be denied according to some ground truth be and applications to be approved be . Then, plugging in to the relevant equations above, you’ll find ThriveGuild to have , , and . Again, these are reasonable values that may be deemed acceptable to the problem owner.

As the data scientist, you can get pretty far with these abstract TN, FN, TP, and FP events, but they have to be put in the context of the problem owner’s goals. ThriveGuild cares about making good bets on borrowers so that they are profitable. More generally across real-world applications, error events yield significant consequences to affected people including loss of life, loss of liberty, loss of livelihood, etc. Therefore, to truly characterize the performance of a decision function, it is important to consider the costs associated with the different events. You can capture these costs through a cost function and denote the costs as , , , and , respectively.

Taking costs into account, the characterization of performance for the decision function is known as the Bayes risk :

Equation 6.9

Breaking the equation down, you’ll see that the two error probabilities, and are the main components, multiplied by their relevant prior probabilities and costs. The costs of the non-error events appear just multiplied by their costs. The Bayes risk is the performance metric most often used in finding optimal decision functions. Actually finding the decision function is known as solving the Bayesian detection problem. Eliciting the cost function for a given real-world problem from the problem owner is part of value alignment, described in Chapter 14.

A mental model or roadmap, shown in Figure 6.2, to hold throughout the rest of the chapter is that the Bayes risk and the Bayesian detection problem are the central concept, and all other concepts are related to the central concept in various ways and for various purposes. The terms and concepts that have not yet been defined and evaluated are coming up soon.

Diagram

Description automatically generated

Figure 6.2. A mental model for different concepts in detection theory surrounding the central concept of Bayes risk and Bayesian detection. A diagram with Bayes risk and Bayesian detection at the center and four other groups of concepts radiating outwards. False positive rate, false negative rate, error rate, and accuracy are special cases. Receiver operating characteristic, recall-precision curve, and area under the curve arise when examining all operating points. Brier score and calibration curve arise in probabilistic risk assessment. False discover rate, false omission rate, and -score relate to performance of predictions.

Because getting things right is a good thing, it is often assumed that there is no cost to correct decisions, i.e., and , which is also assumed in this book going forward. In this case, the Bayes risk simplifies to:

Equation 6.10

To arrive at this simplified equation, just insert zeros for and in Equation 6.9. The Bayes risk with and is the probability of error.

We are implicitly assuming that does not depend on except through . This assumption is not required, but made for simplicity. You can easily imagine scenarios in which the cost of a decision depends on the feature. For example, if one of the features used in the loan approval decision by ThriveGuild is the value of the loan, the cost of an error (monetary loss) depends on that feature. Nevertheless, for simplicity, we usually make the assumption that the cost function does not explicitly depend on the feature value. For example, under this assumption, the cost of a false negative may be and the cost of a false positive for all applicants.

6.1.3 Accounting for Different Operating Points

The Bayes risk is all well and good if there is a fixed set of prior probabilities and a fixed set of costs, but things change. If the economy improves, potential borrowers might become more reliable in loan repayment. If a different problem owner comes in and has a different interpretation of opportunity cost, then the cost of false negatives changes. How should you think about the performance of decision functions across different sets of those values, known as different operating points?

Many decision functions are parameterized by a threshold (including the optimal decision function that will be demonstrated in Section 6.2). You can change the decision function to be more or less forgiving of false positives or false negatives, but not both at the same time. Varying explores this tradeoff and yields different error probability pairs , i.e. different operating points. Equivalently, different operating points correspond to different false positive rate and true positive rate pairs . The curve traced out on the – plane as the parameter is varied from zero to infinity is the receiver operating characteristic (ROC). The ROC takes values when and when . You can understand this because at one extreme, the decision function always says ; in this case there are no FPs and no TPs. At the other extreme, the decision function always says ; in this case all decisions are either FPs or TPs.

The ROC is a concave, nondecreasing function illustrated in Figure 6.3. The closer to the top left corner it goes, the better. The best ROC for discrimination goes straight up to and then makes a sharp turn to the right. The worst ROC is the diagonal line connecting and achieved by random guessing. The area under the ROC, also known as the area under the curve (AUC) synthesizes performance across all operating points and should be selected as a metric when it is likely that the same threshold-parameterized decision function will be applied in very different operating conditions. Given the shapes of the worst (diagonal line) and best (straight up and then straight to the right) ROC curves, you can see that the AUC ranges from (area of bottom right triangle) to (area of entire square).[4]

Diagram

Description automatically generated

Figure 6.3. An example receiver operating characteristic (ROC). Accessible caption. A plot with on the vertical axis and on the horizontal axis. Both axes range from to . A dashed diagonal line goes from to and corresponds to random guessing. A solid concave curve, the ROC, goes from to staying above and to the left of the diagonal line.

6.2 The Best That You Can Ever Do

As the ThriveGuild data scientist, you have given the problem owner an entire menu of basic performance measures to select from and indicated when different choices are more and less appropriate. The Bayes risk is the most encompassing and most often used performance characterization for a decision function. Let’s say that Bayes risk was chosen in the problem specification stage of the machine learning lifecycle, including selecting the costs. Now you are in the modeling stage and need to figure out if the model is performing well. The best way to do that is to optimize the Bayes risk to obtain the best possible decision function with the smallest Bayes risk and compare the current model’s Bayes risk to it.

“The predictability ceiling is often ignored in mainstream ML research. Every prediction problem has an upper bound for prediction—the Bayes-optimal performance. If you don't have a good sense of what it is for your problem, you are in the dark.”

—Mert R. Sabuncu, computer scientist at Cornell University

Let us denote the best possible decision function as and its corresponding Bayes risk as . They are specified using the minimization of the expected cost:

Equation 6.11

where the expectation is over both and . Because it achieves the minimal cost, the function is the best possible by definition. Whatever Bayes risk it has, no other decision function can have a lower Bayes risk .

We aren’t going to work it out here, but the solution to the minimization problem in Equation 6.11 is the Bayes optimal decision function, and takes the following form:

Equation 6.12

where , known as the likelihood ratio, is defined as:

Equation 6.13

and , known as the threshold, is defined as:

Equation 6.14

The likelihood ratio is as its name says: it is the ratio of the likelihood functions. It is a scalar value even if the features are multivariate. As the ratio of two non-negative pdf values, it has the range and can be viewed as a random variable. The threshold is made up of both costs and prior probabilities. This optimal decision function given in Equation 6.12 is known as the likelihood ratio test.

6.2.1 Example

As an example, let ThriveGuild’s loan approval decision be determined solely by one feature : the income of the applicant. Recall that we modeled income to be exponentially-distributed in Chapter 3. Specifically, let and , both for . Like earlier in this chapter, , , , and . Then simply plugging in to Equation 6.13, you’ll get:

Equation 6.15

and plugging in to Equation 6.14, you’ll get:

Equation 6.16

Plugging these expressions into the Bayes optimal decision function given in Equation 6.12, you’ll get:

Equation 6.17

which can be simplified to:

Equation 6.18

by multiplying both sides of the inequalities in both cases by , taking the natural logarithm, and then multiplying by again. Applicants with an income less than or equal to are denied and applicants with an income greater than are approved. The expected value of is and the expected value of is . Thus in this example, an applicant's income has to be quite a bit higher than the mean to be approved.

You should use the Bayes-optimal risk to lower bound the performance of any machine learning classifier that you might try for a given data distribution.[5] No matter how hard you work or how creative you are, you can never overcome the Bayes limit. So you should be happy if you get close. If the Bayes-optimal risk itself is too high, then the thing to do is to go back to the data understanding and data preparation stages of the machine learning lifecycle and get more informative data.

6.3 Risk Assessment and Calibration

To approve or to deny, that is the question for ThriveGuild. Or is it? Maybe the question is actually: what is the probability that the borrower will default? Maybe the problem is not binary classification, but probabilistic risk assessment. It is certainly an option for you, the data scientist, and the problem owner to consider during problem specification. Thresholding a probabilistic risk assessment yields a classification, but there are a few subtleties for you to weigh.

The likelihood ratio ranges from zero to infinity and the threshold value is optimal for equal priors and equal costs. Applying any monotonically increasing function to both the likelihood ratio and the threshold still yields a Bayes optimal decision function with the same risk . That is,

Equation 6.19

for any monotonically increasing function is still optimal.

It is somewhat more natural to think of a score to be in the range because it corresponds to the label values and could also potentially be interpreted as a probability. The score, a continuous-valued output of the decision function, can then be thought of as a confidence in the prediction and be obtained by applying a suitable function to the likelihood ratio. In this case, is the threshold for equal priors and costs. Intermediate score values are less confident and extreme score values (towards and ) are more confident. Just as the likelihood ratio may be viewed as a random variable, the score may also be viewed as a random variable . The Brier score is an appropriate performance metric for the continuous-valued output score of the decision function:

Equation 6.20

It is the mean-squared error of the score with respect to the true label . For a finite number of samples , you can compute it as:

Equation 6.21

The Brier score decomposes into the sum of two separable components: calibration and refinement.[6] The concept of calibration is that the predicted score corresponds to the proportion of positive true labels. For example, a bunch of data points all having a calibrated score of implies that 70% of them have true label and 30% of them have true label . Said another way, perfect calibration implies that the probability of the true label being given the predicted score being is the value itself: Calibration is important for probabilistic risk assessments: a perfectly calibrated score can be interpreted as a probability of predicting one class or the other. It is also an important concept for evaluating causal inference methods, described in Chapter 8, for algorithmic fairness, described in Chapter 10, and for communicating uncertainty, described in Chapter 13.

Since any monotonically increasing transformation can be applied to a decision function without changing its ability to discriminate, you can improve the calibration of a decision function by finding a better . The calibration loss quantitatively captures how close a decision function is to perfect calibration. The refinement loss is a sort of variance of how tightly the true labels distribute around a given score. For that have been sorted by their score values and binned into groups with average values within the bins

Equation 6.22

As stated earlier, the sum of the calibration loss and refinement loss is the Brier score.

A calibration curve, also known as a reliability diagram, shows the values as a plot. One example is shown in Figure 6.4. The closer to a straight diagonal from to , the better. Plotting this curve is a good diagnostic tool for you to understand the calibration of a decision function.

Histogram, rectangle

Description automatically generated

Figure 6.4. An example calibration curve. Accessible caption. A plot with on the vertical axis and on the horizontal axis. Both axes range from to . A dashed diagonal line goes from to and corresponds to perfect calibration. A solid S-shaped curve, the calibration curve, goes from to starting below and to the right of the diagonal line before crossing over to being above and to the left of the diagonal line.

6.4 Summary

§ Four possible events result from binary decisions: false negatives, true negatives, false positives, and true positives.

§ Different ways to combine the probabilities of these events lead to classifier performance metrics appropriate for different real-world contexts.

§ One important one is Bayes risk: the combination of the false negative probability and false positive probability weighted by both the costs of those errors and the prior probabilities of the labels. It is the basic basic performance measure for the first attribute of safety and trustworthiness.

§ Detection theory, the study of optimal decisions, which provides fundamental limits to how well machine learning models may ever perform is a tool for you to assess the basic performance of your models.

§ Decision functions may output continuous-valued scores rather than only hard, zero or one, decisions. Scores indicate confidence in a prediction. Calibrated scores are those for which the score value is the probability of a sample belonging to a label class.

[1]Estimation theory is the study of optimal decision making in the case of continuous output responses.

[2]For ease of explanation in this chapter and in later parts of the book, we mostly stick with the case of two label values and do not delve much into the case with more than two label values.

[3]This is also the basic task of supervised machine learning. In supervised learning, the decision function is based on data samples from rather than on the distributions; supervised learning is coming up soon enough in the next chapter, Chapter 7.

[4]The recall-precision curve is an alternative to understand performance across operating points. It is the curve traced out on the – plane starting at and ending at . It has a one-to-one mapping with the ROC and is more easily understood by some people. Jesse Davis and Mark Goadrich. “The Relationship Between Precision-Recall and ROC Curves.” In: Proceedings of the International Conference on Machine Learning. Pittsburgh, Pennsylvania, USA, Jun. 2006, pp. 233–240.

[5]There are techniques for estimating the Bayes risk of a dataset without having access to its underlying probability distribution. Ryan Theisen, Huan Wang, Lav R. Varshney, Caiming Xiong, and Richard Socher. “Evaluating State-of-the-Art Classification Models Against Bayes Optimality” In: Advances in Neural Information Processing Systems 34 (Dec. 2021).

[6]José Hernández-Orallo, Peter Flach, and Cèsar Ferri. “A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss.” In: Journal of Machine Learning Research 13 (Oct. 2012), pp. 2813–2869.