7

Supervised Learning

The (fictional) information technology company JCN Corporation is reinventing itself and changing its focus to artificial intelligence and cloud computing. As part of managing its talent during this enterprise transformation, it is conducting a machine learning project to estimate the expertise of its employees from a variety of data sources such as self-assessments of skills, work artifacts (patents, publications, software documentation, service claims, sales opportunities, etc.), internal non-private social media posts, and tabular data records including the employee’s length of service, reporting chain, and pay grade. A random subset of the employees has been explicitly evaluated on a binary yes/no scale for various AI and cloud skills, which constitute the labeled training data for machine learning. JCN Corporation’s data science team has been given the mission to predict the expertise evaluation for all the other employees in the company. For simplicity, let’s focus on only one of the expertise areas: serverless architecture.

Imagine that you are on JCN Corporation’s data science team and have progressed beyond the problem specification, data understanding, and data preparation phases of the machine learning lifecycle and are now at the modeling phase. By applying detection theory, you have chosen an appropriate quantification of performance for predicting an employee’s skill in serverless architecture: the error rate—the Bayes risk with equal costs for false positives and false negatives—introduced in Chapter 6.

It is now time to get down to the business of learning a decision function (a classifier) from the training data that generalizes well to predict expertise labels for the unlabeled employees. Deep learning is one family of classifiers that is on the tip of everyone’s tongue. It is certainly one option for you, but there are many other kinds of classifiers too. How will you evaluate different classification algorithms to select the best one for your problem?

“My experience in industry strongly confirms that deep learning is a narrow sliver of methods needed for solving complex automated decision making problems.”

—Zoubin Ghahramani, chief scientist at Uber

A very important concept in practicing machine learning, first mentioned in Chapter 2, is the no free lunch theorem. There is no one single machine learning method that is best for all datasets.[1] What is a good choice for one dataset might not be so great for another dataset. It all depends on the characteristics of the dataset and the inductive bias of the method: the assumptions on how the classifier should generalize outside the training data points. The challenge in achieving good generalization and a small error rate is protecting against overfitting (learning a model that too closely matches the idiosyncrasies of the training data) and underfitting (learning a model that does not adequately capture the patterns in the training data). The goal is to get to the Goldilocks point where things are not too hot (overfitting) and not too cold (underfitting), but just right.

An implication of the no free lunch theorem is that you must try several different methods for the JCN Corporation expertise problem and see how they perform empirically before deciding on one over another. Simply brute forcing it—training all the different methods and computing their test error to see which one is smallest—is common practice, but you decide that you want to take a more refined approach and analyze the inductive biases of different classifiers. Your analysis will determine the domains of competence of various classifiers: what types of datasets do they perform well on and what type of datasets do they perform poorly on.[2] Recall that competence or basic accuracy is the first attribute of trustworthy machine learning as well as the first half of safety.

Why would you want to take this refined approach instead of simply applying a bunch of machine learning methods from software packages such as scikit-learn, tensorflow, and pytorch without analyzing their inductive biases? First, you have heeded the admonitions from earlier chapters to be safe and to not take shortcuts. More importantly, however, you know you will later be creating new algorithms that respect the second (reliability) and third (interaction) attributes of trustworthiness. You must not only be able to apply algorithms, you must be able to analyze and evaluate them before you can create. Now go forth and analyze classifiers for inventorying expertise in the JCN Corporation workforce.

7.1 Domains of Competence

Different classifiers work well on different datasets depending on their characteristics.[3] But what characteristics of a dataset matter? What are the parameters of a domain of competence? A key concept to answer those questions is the decision boundary. In Chapter 6, you learned that the Bayes optimal decision function is a likelihood ratio test which is a threshold of the one-dimensional likelihood ratio. If you invert the likelihood ratio, you can go back to the feature space with feature dimensions and trace out surfaces to which that single threshold value maps. The collection of these surfaces is a level set of the likelihood ratio function and is known as the decision boundary. Imagine the likelihood ratio function being like the topography and bathymetry of the Earth. Anything underwater receives the classification (employee is unskilled in serverless architecture) and anything above water receives the classification (employee is skilled in serverless architecture). Sea level is the threshold value and the coastline is the level set or decision boundary. An example of a decision boundary for a two-dimensional feature space is shown in Figure 7.1.

Figure 7.1. *An example of a decision boundary in a feature
space. The gray regions correspond to feature values for which the decision
function predicts employees are skilled in serverless architecture. The white
regions correspond to features for which the decision function predicts
employees are unskilled in serverless architecture. The black lines are the
decision boundary.* Accessible caption. A stylized plot with the first
feature dimension on the horizontal
axis and the second feature dimension on the vertical
axis. The space is partitioned into a couple of blob-like gray regions labeled and a white region
labeled . The boundary
between the regions is marked as the decision boundary. Classifier regions do
not have to be all one connected component.

Three key characteristics of a dataset determine how well the inductive biases of a classifier match the dataset:

1. overlap of data points from the two class labels near the decision boundary,

2. linearity or nonlinearity of the decision boundary, and

3. number of data points, their density, and amount of clustering.

Classifier domains of competence are defined in terms of these three considerations.[4] Importantly, domains of competence are relative notions: does one classification algorithm work better than others?[5] They are not absolute notions, because at the end of the day, the absolute performance is limited by the Bayes optimal risk defined in Chapter 6. For example, one classification method that you tried may work better than others on datasets with a lot of class overlap near the decision boundary, nearly linear shape of the decision boundary, and not many data points. Another classification method may work better than others on datasets without much class overlap near a tortuously-shaped decision boundary. Yet another classification method may work better than others on very large datasets. In the remainder of this chapter, you will analyze many different supervised learning algorithms. The aim is not only describing how they work, but analyzing their inductive biases and domains of competence.

7.2 Two Ways to Approach Supervised Learning

Let’s begin by cataloging what you and the team of JCN Corporation data scientists have at your disposal. Your training dataset consists of samples independently drawn from the probability distribution . The features sampled from the random variable are numerical or categorical quantities derived from skill self-assessments, work artifacts, and so on. There are features, so is a -dimensional vector. The labels , sampled from the random variable are the binary (zero or one) expertise evaluations on serverless architecture. Importantly, you do not have access to the precise distribution , but only to the finite number of samples contained in the training dataset drawn from the distribution. This is the key difference between the supervised machine learning problem and the Bayesian detection problem introduced in Chapter 6. The goal is the same in both the machine learning and detection problems: find a decision function that predicts labels from features.

What are your options to find the classifier based on the training data? You cannot simply minimize the Bayes risk functional or the probability of error directly, because that would rely on full knowledge of the probability distribution of the features and labels, which you do not have. You and the team have two options:

1. plug-in approach: estimate the likelihood functions and prior probabilities from the training data, and plug them into the Bayes optimal likelihood ratio test described in Chapter 6, or

2. risk minimization: optimize a classifier over an empirical approximation to the error rate computed on the training data samples.

There are specific methods within these two broad categories of supervised classification algorithms. A mental model for different ways of doing supervised machine learning is shown in Figure 7.2.

7.3 Plug-In Approach

First, you and the rest of the JCN Corporation data science team try out plug-in methods for supervised classification. The main idea is to use the training data to estimate the likelihood functions and , and then plug them into the likelihood ratio to obtain the classifier.

7.3.1 Discriminant Analysis

One of the most straightforward plug-in methods, discriminant analysis, assumes a parametric form for the likelihood functions and estimates their parameters. Then just like in Chapter 6, it obtains a decision function by taking the ratio of these likelihood functions and comparing them to a threshold . The actual underlying likelihood functions do not have to be exactly their assumed forms and usually aren’t in practice. If they are somewhat close, that is good enough. The assumed parametric form is precisely the inductive bias of discriminant analysis.

Figure 7.2. *A mental model for different ways of
approaching supervised machine learning.* A hierarchy diagram with
supervised learning at its root. Supervised learning has children plug-in and
risk minimization. Plug-in has children parametric and nonparametric.
Parametric has children linear discriminant analysis and quadratic discriminant
analysis. Nonparametric has children k-nearest neighbor and naïve Bayes. Risk
minimization has children empirical risk minimization and structural risk
minimization. Structural risk minimization has children decision trees and
forests, margin-based methods, and neural networks.

If the assumed parametric form for the likelihood functions is multivariate Gaussian in dimensions with mean parameters and and covariance matrix parameters and ,[6] then the first step is to compute their empirical estimates , , , and from the training data, which you know how to do from Chapter 3. The second step is to plug those estimates into the likelihood ratio to get the classifier decision function. Under the Gaussian assumption, the method is known as quadratic discriminant analysis because after rearranging and simplifying the likelihood ratio, the quantity compared to a threshold turns out to be a quadratic function of . If you further assume that the two covariance matrices and are the same matrix , then the quantity compared to a threshold is even simpler: it is a linear function of , and the method is known as linear discriminant analysis.

Figure 7.3 shows examples of linear and quadratic discriminant analysis classifiers in dimensions trained on the data points shown in the figure. The red diamonds are the employees in the training set unskilled at serverless architecture. The green squares are the employees in the training set skilled at serverless architecture. The domain of competence for linear and quadratic discriminant analysis is datasets whose decision boundary is mostly linear, with a dense set of data points of both class labels near that boundary.

Figure 7.3. *Examples of linear discriminant analysis (left)
and quadratic discriminant analysis (right) classifiers.* Accessible Caption.
Stylized plot showing two classes of data points arranged in a noisy yin yang
or interleaving moons configuration. The linear discriminant decision boundary
is a straight line that cuts through the middle of the two classes. The
quadratic discriminant decision boundary is a smooth curve that turns a little
to enclose one of the classes.

7.3.2 Nonparametric Density Estimation

You continue your quest to analyze different classifiers for estimating the expertise of JCN employees. Instead of assuming a parametric form for the likelihood functions like in discriminant analysis, you try to estimate the likelihood functions in a nonparametric fashion. The word nonparametric is a misnomer. It does not mean that there are no parameters in the estimated likelihood function at all; it means that the number of parameters is on par with the number of training data points.

A common way of estimating a likelihood function nonparametrically is kernel density estimation. The idea is to place a smooth function like a Gaussian pdf centered on each of the training data points and take the normalized sum of those functions as the estimate of the likelihood function. In this case, the parameters are the centers of the smooth functions, so the number of parameters equals the number of data points. Doing this for both likelihood functions separately, taking their ratio, and comparing to a threshold yields a valid classifier. However, it is a pretty complicated classifier. You would need a lot of data to get a good kernel density estimate, especially when the data has a lot of feature dimensions .

Instead of doing the full density estimate, a simplification is to assume that all the feature dimensions of are mutually independent. Under this assumption, the likelihood functions factor into products of one-dimensional pdfs that can be estimated separately with much less data. If you take the ratio of these products of one-dimensional pdfs (a likelihood ratio) and compare to a threshold, voilà, you have a naïve Bayes classifier. The name of this method contains ‘naïve’ because it is somewhat naïve to assume that all feature dimensions are independent—they never are in real life. It contains ‘Bayes’ because of plugging in to the Bayes-optimal likelihood ratio test. Often, this classifier does not outperform other classifiers in terms of accuracy, so its domain of competence is often non-existent.

A different nonparametric method is the k-nearest neighbor classifier. The idea behind it is very simple. Look at the labels of the closest training data points and predict whichever label is more common in those nearby points. A distance metric is needed to measure ‘close’ and ‘near.’ Typically, Euclidean distance (the normal straight-line distance) is used, but other distance metrics could be used instead. The k-nearest neighbor method works better than other classifiers when the decision boundary is very wiggly and broken up into lots of components, and when there is not much overlap in the classes. Figure 7.4 shows examples of naïve Bayes and k-nearest neighbor classifiers in two dimensions. The k-nearest neighbor classifier in the figure has .

Figure 7.4. *Examples of naïve Bayes (left) and k-nearest
neighbor (right) classifiers. *Accessible caption. Stylized plot showing two
classes of data points arranged in a noisy yin yang or interleaving moons configuration.
The naïve Bayes decision boundary is a smooth curve that turns a little to
enclose one of the classes. The k-nearest neighbor decision boundary is very
jagged and traces out the positions of the classes closely.

7.4 Risk Minimization Basics

You and the JCN team have tried out a few plug-in methods for your task of predicting which employees are skilled in serverless architecture and are ready to move on to a different category of machine learning methods: risk minimization. Whereas plug-in methods take one step back from the Bayes-optimal likelihood ratio test by estimating the likelihood functions from data, risk minimization takes two steps back and directly tries to find decision functions or decision boundaries that minimize an estimate of the Bayes risk.

7.4.1 Empirical Risk Minimization

Remember from Chapter 6 that the probability of error , the special case of the Bayes risk with equal costs that you’ve chosen as the performance metric, is:

Equation 7.1

The prior probabilities of the class labels and multiply the probabilities of the events when the decision function is wrong and . You cannot directly compute the error probability because you do not have access to the full underlying probability distribution. But is there an approximation to the error probability that you can compute using the training data?

First, because the training data is sampled i.i.d. from the underlying distribution, the proportion of employees in the training data set skilled and unskilled at serverless architecture will approximately match the prior probabilities and , so you do not have to worry about them explicitly. Second, the probabilities of both the false positive event and false negative event event can be expressed collectively as , which corresponds to for training data samples. The zero-one loss function captures this by returning the value for and the value for . Putting all these things together, the empirical approximation to the error probability, known as the empirical risk , is:

Equation 7.2

Minimizing the empirical risk over all possible decision functions is a possible classification algorithm, but not one that you and the other JCN Corporation data scientists evaluate just yet. Let’s understand why not.

7.4.2 Structural Risk Minimization

Without any constraints, you can find a decision function that brings the empirical risk to zero but does not generalize well to new unseen data points. At the extreme, think about a classifier that memorizes the training data points and gets them perfectly correct, but always predicts (unskilled at serverless architecture) everywhere else. This is not the desired behavior—the classifier has overfit. So just minimizing the empirical risk does not yield a competent classifier. This memorizing classifier is pretty complex. There’s nothing smooth or simple about it because it has as many discontinuities as there are training set employees skilled at serverless architecture.

Constraining the complexity of the classifier forces it to not overfit. To be a bit more precise, if you constrain the decision function to be an element of some class of functions or hypothesis space that only includes low-complexity functions, then you will prevent overfitting. But you can go too far with the constraints as well. If the hypothesis space is too small and does not contain any functions with the capacity to capture the important patterns in the data, it may underfit the data and not generalize either. It is important to control the hypothesis space to be just right. This idea is known as the structural risk minimization principle.

Figure 7.5 shows the idea in pictorial form using a sequence of nested hypothesis spaces . As an example of nested hypothesis spaces, could be all constant functions, could be all linear functions, could be all quadratic functions, and could be all polynomial functions. contains the least complex functions while also contains more complex functions. underfits as it has large values for both the empirical risk calculated on the training data and the probability of error , which measures generalization. overfits as it has zero and a large value for . achieves a good balance and is just right.

Figure 7.5. *Illustration of the structural risk
minimization principle.* Accessible caption. A plot with or on the vertical
axis and increasing complexity of hypothesis spaces on the horizontal axis. The
empirical risk decreases all the way to zero with increasing complexity. The
generalization error first decreases and then increases. It has a sweet spot in
the middle.

The hypothesis space is the inductive bias of the classifier. Thus, within the paradigm of the structural risk minimization principle, different choices of hypothesis spaces yield different domains of competence. In the next section, you and your team of JCN data scientists analyze several different risk minimization classifiers popularly used in practice, including decision trees and forests, margin-based classifiers (logistic regression, support vector machines, etc.), and neural networks.

7.5 Risk Minimization Algorithms

You are now analyzing the competence of some of the most popular classifiers used today that fit into the risk minimization paradigm. The basic problem is to find the function within the hypothesis space that minimizes the average loss function :

Equation 7.3

This equation may look familiar because it is similar to the Bayesian detection problem in Chapter 6. The function in the hypothesis space that minimizes the sum of the losses on the training data is . Different methods have different hypothesis spaces and different loss functions . An alternative way to control the complexity of the classifier is not through changing the hypothesis space , but through a complexity penalty or regularization term weighted by a regularization parameter :

Equation 7.4

The choice of regularization term also yields an inductive bias for you to analyze.

7.5.1 Decision Trees and Forests

One of the simplest hypothesis spaces is the set of decision stumps or one-rules. These classifiers create a single split along a single feature dimension like a numerical expertise self-assessment feature or a length of service feature. Any data point whose value is on one side of a threshold gets classified as skilled in serverless architecture, and on the other side as unskilled in serverless architecture. For categorical features, a split is just a partitioning of the values into two groups. The other features besides the one participating in the decision stump can be anything. An example of a decision stump is shown in Figure 7.6 as a node with two branches and also through its decision boundary.

Figure 7.6. *An example of a decision stump classifier. *Accessible
caption. On the left is a decision node . When it is true, and when it is
false, . On the right is a
stylized plot showing two classes of data points arranged in a noisy yin yang
or interleaving moons configuration. The decision boundary is a horizontal
line.

The hypothesis space of decision trees includes decision functions with more complexity than decision stumps. A decision tree is created by splitting on single feature dimensions within each branch of the decision stump, splitting within those splits, and so on. An example of a decision tree with two levels is shown in Figure 7.7. Decision trees can go much deeper than two levels to create fairly complex decision boundaries. An example of a complex decision boundary from a decision tree classifier is shown in Figure 7.8.

Figure 7.7. *An example of a two-level decision tree classifier.*
Accessible caption. On the left is a decision node . When it is true,
there is another decision node . When this
decision node is true, and when it is
false, When the top
decision node is false, there is another decision node . When this
decision node is true, and when it is
false, On the right is a
stylized plot showing two classes of data points arranged in a noisy yin yang
or interleaving moons configuration. The decision boundary is a made up of
three line segments: the first segment is vertical, it turns right into a
horizontal segment, and then up into another vertical segment.

Figure 7.8. *An example decision tree classifier with many
levels.* Accessible caption. A stylized plot showing two classes of data
points arranged in a noisy yin yang or interleaving moons configuration. The
decision boundary is a made up of several vertical and horizontal segments.

The hypothesis space of decision forests is made up of ensembles of decision trees that vote for their prediction, possibly with an unequal weighting given to different trees. The weighted majority vote from the decision trees is the overall classification. The mental model for a decision forest is illustrated in Figure 7.9 and an example decision boundary is given in Figure 7.10.

Figure 7.9. *A mental model for a decision forest.* Accessible
caption. Three individual decision trees each predict separately. Their
predictions feed into a vote node which outputs . The predictions
from each tree are combined using a (weighted) majority vote.

Figure 7.10. A*n example decision forest classifier*. Accessible
caption. Stylized plot showing two classes of data points arranged in a noisy
yin yang or interleaving moons configuration. The decision boundary is fairly
jagged with mostly axis-aligned segments and traces out the positions of the
classes closely.

Decision stumps and decision trees can be directly optimized for the zero-one loss function that appears in the empirical risk.[7] More commonly, however, greedy heuristic methods are employed for learning decision trees in which the split for each node is done one at a time, starting from the root and progressing to the leaves. The split is chosen so that each branch is as pure as can be for the two classes: mostly just employees skilled at serverless architecture on one side of the split, and mostly just employees unskilled at serverless architecture on the other side of the split. The purity can be quantified by two different information-theoretic measures, information gain and Gini index, which were introduced in Chapter 3. Two decision tree algorithms are popularly-used: the C5.0 decision tree that uses information gain as its splitting criterion, and the classification and regression tree (CART) that uses Gini index as its splitting criterion. The depth of decision trees is controlled to prevent overfitting. The domain of competence of C5.0 and CART decision trees is tabular datasets in which the phenomena represented in the features tend to have threshold and clustering behaviors without much class overlap.

Decision forests are made up of a lot of decision trees. C5.0 or CART trees are usually used as these base classifiers. There are two popular ways to train decision forests: bagging and boosting. In bagging, different subsets of the training data are presented to different trees and each tree is trained separately. All trees have equal weight in the majority vote. In boosting, a sequential procedure is followed. The first tree is trained in the standard manner. The second tree is trained to focus on the training samples that the first tree got wrong. The third tree focuses on the errors of the first two trees, and so on. Earlier trees receive greater weight. Decision forests have good competence because of the diversity of their base classifiers. As long as the individual trees are somewhat competent, any unique mistake that any one tree makes is washed out by the others for an overall improvement in generalization.

The random forest classifier is the most popular bagged decision forest and the XGBoost classifier is the most popular boosted decision forest. Both have very large domains of competence. They are robust and work extremely well for almost all kinds of structured datasets. They are the first-choice algorithms for practicing data scientists to achieve good accuracy models with little to no tuning of parameters.

7.5.2 Margin-Based Methods

Margin-based classifiers constitute another popular family of supervised learning algorithms. This family includes logistic regression and support vector machines (SVMs). The hypothesis space of margin-based classifiers is more complex than decision stumps, but in a different way than decision trees. Margin-based classifiers allow any linear decision boundary rather than only ones parallel to single feature dimensions. Going even further, margin-based classifiers can have nonlinear decision boundaries in the original feature space by applying nonlinear functions to the features and finding linear decision boundaries in that transformed space.[8]

The main concept of these algorithms is the margin, the distance of data points to the decision boundary. With a linear decision boundary, the form of the classifier is [9] where is a weight vector or coefficient vector that is learned from the training data. The absolute value of is the distance of the data point to the decision boundary and is thus the margin of the point. The quantity is positive if is on one side of the hyperplane defined by and negative if is on the other side. The function gives a classification of (unskilled at serverless architecture) for negative margin and a classification of (skilled at serverless architecture) for positive margin. The stuff with the function (adding one and dividing by two) is just a way to recreate the behavior of the function.

Surrogates for the zero-one loss function are used in the risk minimization problem. Instead of taking two arguments, these margin-based loss functions take the single argument . When is multiplied by , the result is positive for a correct classification and negative for an incorrect classification.[10] The loss is large for negative inputs and small or zero for positive inputs. In logistic regression, the loss function is the logistic loss: and in SVMs, the loss function is the hinge loss: . The shape of these loss function curves is shown in Figure 7.11.

The regularization term for the standard forms of logistic regression and SVMs is , the length-squared of the coefficient vector (also known as the -norm squared). The loss function, regularization term, and nonlinear feature mapping together constitute the inductive bias of the classifier. An alternative regularization term is the sum of the absolute values of the coefficients in (also known as the -norm), which provides the inductive bias for to have many zero-valued coefficients. Example linear and nonlinear logistic regression and SVM classifiers are shown in Figure 7.12. The domain of competence for margin-based classifiers is fairly broad: structured datasets of moderate size. SVMs work a little better than logistic regression when the features are noisy.

Figure 7.11. *Margin-based loss functions.* Accessible
caption. A plot with loss on the vertical axis and margin on the horizontal
axis. The logistic loss decreases smoothly. The hinge loss decreases linearly
until the point , after which it is
for all larger
values of the margin.

Figure 7.12. *Example linear logistic regression (top
left), linear SVM (top right), nonlinear polynomial SVM (bottom left), and
nonlinear radial basis function SVM (bottom right) classifiers.* Accessible
caption. Stylized plot showing two classes of data points arranged in a noisy
yin yang or interleaving moons configuration. The linear logistic regression
and linear SVM decision boundaries are diagonal lines through the middle of the
moons. The polynomial SVM decision boundary is a diagonal line with a smooth
bump to better follow the classes. The radial basis function SVM decision
boundary smoothly encircles one of the classes with a blob-like region.

7.5.3 Neural Networks

The final family of classifiers that you and the other JCN Corporation data scientists analyze is artificial neural networks. Neural networks are all the rage these days because of their superlative performance on high-profile tasks involving large-scale semi-structured datasets (image classification, speech recognition, natural language processing, bioinformatics, etc.), which is their domain of competence. The hypothesis space of neural networks includes functions that are compositions of compositions of compositions of simple functions known as neurons. The best way to understand the hypothesis space is graphically as layers of neurons, represented as nodes, connected to each other by weighted edges. There are three types of layers: an input layer, possibly several hidden layers, and an output layer. The basic picture to keep in mind is shown in Figure 7.13. The term deep learning which is bandied about quite a bit these days refers to deep neural networks: architectures of neurons with many many hidden layers.

Figure 7.13. *Diagram of a neural network.* Accessible
caption. Three nodes on the left form the input layer. They are labeled , , and To the right of
the input layer is hidden layer 1 with four nodes. To the right of hidden layer
1 is hidden layer 2 with four hidden nodes. To the right of hidden layer 2 is
one node labeled constituting the
output layer. There are edges between each node of one layer and each node of
the adjacent layer. Each edge has a weight. Each hidden node sums all of its
inputs and applies an activation function. The output node sums all of its
inputs and applies a hard threshold.

Logistic regression is actually a very simple neural network with just an input layer and an output node, so let’s start there. The input layer is simply a set of nodes, one for each of the feature dimensions relevant for predicting the expertise of employees. They have weighted edges coming out of them, going into the output node. The weights on the edges are the coefficients in , i.e. . The output node sums the weighted inputs, so computes , and then passes the sum through the function. This overall procedure is exactly the same as logistic regression described earlier, but described in a graphical way.

In the regular case of a neural network with one or more hidden layers, nodes in the hidden layers also start with a weighted summation. However, instead of following the summation with an abrupt function, hidden layer nodes use softer, more gently changing activation functions. A few different activation functions are used in practice, whose choice contributes to the inductive bias. Two examples, the sigmoid or logistic activation function and the rectified linear unit (ReLU) activation function , are shown in Figure 7.14. The ReLU activation is typically used in all hidden layers of deep neural networks because it has favorable properties for optimization techniques that involve the gradient of the activation function.

Figure 7.14. *Activation functions.* Accessible caption.
A plot with activation function on the vertical axis and input on the
horizontal axis. The sigmoid function is a gently rolling S-shaped curve that
equals at input value , approaches as the input goes
to negative infinity, and approaches as the input goes
to positive infinity. The ReLU function is for all negative
inputs and increases linearly starting at .

When there are several hidden layers, the outputs of nodes in one hidden layer feed into the nodes of the next hidden layer. Thus, the neural network’s computation is a sequence of compositions of weighted sum, activation function, weighted sum, activation function, and so on until reaching the output layer, which finally applies the function. The number of nodes per hidden layer and the number of hidden layers is a design choice for JCN Corporation’s data scientists to make.

You and your team have analyzed the hypothesis space. Cool beans. The next thing for you to analyze is the loss function of neural networks. Recall that margin-based loss functions multiply the true label by the distance (not by the predicted label ), before applying the function. The cross-entropy loss, the most common loss function used in neural networks, does kind of the same thing. It compares the true label to a soft prediction in the range computed in the output node before the function has been applied to it. The cross-entropy loss function is:

Equation 7.5

The form of the expression comes from cross-entropy, the average information in the true label random variable when described using the predicted distance random variable , introduced in Chapter 3. Cross-entropy should be minimized because you want the description in terms of the prediction to be matched to the ground truth. It turns out that the cross-entropy loss is equivalent to the margin-based logistic loss function in binary classification problems, but it is pretty involved to show it mathematically because the margin-based loss function is a function of one variable that multiplies the prediction and the true label, whereas the two arguments are kept separate in cross-entropy loss.[11]

The last question to ask is about regularization. Although -norm, -norm, or other penalties can be added to the cross-entropy loss, the most common way to regularize neural networks is dropout. The idea is to randomly remove some nodes from the network on each iteration of an optimization procedure during training. Dropout’s goal is somewhat similar to bagging, but instead of creating an ensemble of several neural networks explicitly, dropout makes each iteration appear like a different neural network of an ensemble, which helps diversity and generalization. An example neural network classifier with one hidden layer and ReLU activation functions is shown in Figure 7.15. Repeating the statement from the beginning of this section, the domain of competence for artificial neural networks is semi-structured datasets with a large number of data points.

Figure 7.15. *Example neural network classifier.*
Accessible caption. Stylized plot showing two classes of data points arranged
in a noisy yin yang or interleaving moons configuration. The decision boundary
is mostly smooth and composed of two almost straight diagonal segments that
form a slightly bent elbow in the middle of the two moons.

7.5.4 Conclusion

You have worked your way through several different kinds of classifiers to compare and contrast their domains of competence and evaluate their appropriateness for your expertise assessment prediction task. Your dataset consists of mostly structured data, is of moderate size, and has a lot of feature axis-aligned separations between employees skilled and unskilled at serverless architecture. For these reasons, you can expect that XGBoost will be a competent classifier for your problem. But you should nevertheless do some amount of empirical testing of a few different methods.

7.6 Summary

§ There are many different methods for finding decision functions from a finite number of training samples, each with their own inductive biases for how they generalize.

§ Different classifiers have different domains of competence: what kinds of datasets they have lower generalization error on than other methods.

§ Parametric and non-parametric plug-in methods (discriminant analysis, naïve Bayes, k-nearest neighbor) and risk minimization methods (decision trees and forests, margin-based methods, neural networks) all have a role to play in practical machine learning problems.

§ It is important to analyze their inductive biases and domains of competence not only to select the most appropriate method for a given problem, but also to be prepared to extend them for fairness, robustness, explainability, and other elements of trustworthiness.

[1]David H. Wolpert. “The Lack of A Priori Distinctions Between Learning Algorithms.” In: Neural Computation 8.7 (Oct. 1996), pp. 1341–1390.

[2]Tin Kam Ho and Ester Bernadó-Mansilla. “Classifier Domains of Competence in Data Complexity Space.” In: Data Complexity in Pattern Recognition. Ed. by Mitra Basu and Tin Kam Ho. London, England, UK: Springer, 2006, pp. 135–152.

[3]Maniel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. “Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?” In: Journal of Machine Learning Research 15 (Oct. 2014), pp. 3133–3181.

[4]In the scope of this chapter, the JCN team use these characteristics qualitatively as a means of gaining intuition. Quantitative measures for these characteristics are described by Tin Kam Ho and Mitra Basu. “Complexity Measures of Supervised Classification Problems.” In: IEEE Transactions on Pattern Analysis and Machine Intelligence 24.3 (Mar. 2002), pp. 289–300.

[5]For the purposes of this chapter, ‘work better’ is only in terms of basic performance (the first attribute of trustworthiness), not reliability or interaction (the second and third attributes of trustworthiness). Classifier domains of reliability and domains of quality interaction can also be defined.

[6]The mathematical form of the likelihood functions is: and , where is the matrix determinant function.

[7]Oktay Günlük, Jayant Kalagnanam, Minhan Li, Matt Menickelly, and Katya Scheinberg. “Optimal Generalized Decision Trees via Integer Programming.” arXiv:1612.03225, 2019.

[8]The nonlinear functions of the features are usually kernel functions, which satisfy certain mathematical properties that allow for efficient optimization during training.

[9]In the nonlinear case, replace with for a kernel function . To avoid cluttering up the mathematical notation, always assume that or has a column of all ones to allow for a constant shift.

[10]Computing is the inverse of applying the function, adding one, and dividing by two. It is performed to get values and from the class labels. When a classification is correct, and have the same sign. Multiplying two numbers with the same sign results in a positive number. When a classification is incorrect, and have different signs. Multiplying two numbers with different signs results in a negative number.

[11]Tyler Sypherd, Mario Diaz, Lalitha Sankar, and Peter Kairouz. “A Tunable Loss Function for Binary Classification.” In: Proceedings of the IEEE International Symposium on Information Theory. Paris, France, Jul. 2019, pp. 2479–2483.