Sections in this Chapter:

Population & random samples

A population is a complete set of elements of interest for a specific problem. It is usually defined by the researcher of the problem. A population can be either finite or infinite. For example, if we define the set of real numbers as our population, it is infinite. But if we are only interested in the integers between 1 and 100, then we get a finite population.

A random sample is a set of elements selected from a population. The size of a random sample could be larger than the population since an element can be taken multiple times.
Why do we care random samples? Because we are interested in the population from which the random sample is taken, and the random sample could help to make inference on the population.

In predictive modeling, each element in the population has a set of attributes which are usually called features or covariates, as well as a label. For example, a bank may use the mortgage applicant’s personal information (FICO score, years of employment, debt to income ratio, etc.) as the covariates, and the status of the mortgage (default, or paid off) as the label. A predictive model can be used to predict the final status of the mortgage for a new applicant, and such kind of models are classification models. When a mortgage is in default status, the applicant may have already made payments partially. Thus, a model to predict the amount of loss for the bank is also useful, and such kind of models are regression models.

But why do we need a predictive model? If we know the labels of the entire population, nothing is needed to learn. All what we need is a database table or a dictionary (HashMap) for lookup. The issue is that many problems in the real world don’t allow us to have the labels of the entire population. And thus, we need to learn or infer based on the random sample collected from the unknown population.

Universal approximation & overfitting

Universal approximation

The Universal approximation theorem says that a single hidden layer neural network can approximate any continuous functions ($\mathbf{R}^n\rightarrow\mathbf{R}$) with sufficient number of neurons under mild assumptions on the activation function (for example, the sigmoid activation function)1. There are also other universal approximators, such as the decision trees.

Not all machine learning models can approximate universally, for example, the linear regression without polynomial items. If we have the data from the entire population, we may fit the population with a universal approximator. But as we have discussed earlier, when the entire population is available there is no need to fit a machine learning model for prediction. But if only a random sample is available, is a universal approximator still the best choice? It depends.

Overfitting & cross-validation

One of the risks of fitting a random sample with a predictive model is overfitting (see the figure below). Using universal approximators as the predictive model may even amplify such risk.

A random sample from a population that follows a linear model (the dashed line) is overfit by the solid curve.
A random sample from a population that follows a linear model (the dashed line) is overfit by the solid curve.

To mediate the risk of overfitting, we usually use cross-validation to assess how accurately a predictive model is able to predict for unseen data. The following steps specifies how to perform a cross-validation.

$k$ fold cross-validation.
$k$ fold cross-validation.

Cross-validation can also be considered as a metaheuristic algorithm since it is not problem-specific because it doesn’t matter what type of predictive models we use.

There are some ready-to-use tools in R and Python modules for cross-validation. But for pedagogical purpose, let’s do it step by step. We do the cross validation using the Lasso regression we build in previous Chapter on the Boston dataset.

R
 1 source('../chapter5/lasso.R')
 2 
 3 library(caret)
 4 library(MASS)
 5 library(Metrics) # we use the rmse function from this package
 6 k = 5
 7 
 8 set.seed(42)
 9 # if we set returnTrain = TRUE, we get the indices for train partition
10 test_indices = createFolds(Boston$medv, k = k, list = TRUE, returnTrain = FALSE)
11 scores = rep(NA, k)
12 
13 for (i in 1:k){
14   lr = Lasso$new(200)
15   # we exclude the indices for test partition and train the model
16   lr$fit(data.matrix(Boston[-test_indices[[i]], -ncol(Boston)]), Boston$medv[-test_indices[[i]]], 100)
17   y_hat = lr$predict(data.matrix(Boston[test_indices[[i]], -ncol(Boston)]))
18   scores[i] = rmse(Boston$medv[test_indices[[i]]], y_hat)
19 }
20 print(mean(scores))
Python
 1 import sys
 2 sys.path.append("..")
 3 
 4 from sklearn.metrics import mean_squared_error
 5 from sklearn.datasets import load_boston
 6 from sklearn.model_selection import KFold
 7 from chapter5.lasso import Lasso
 8 
 9 
10 boston = load_boston()
11 X, y = boston.data, boston.target
12 
13 # create the partitions with k=5
14 k = 5
15 kf = KFold(n_splits=k)
16 # create a placeholder for the rmse on each test partition
17 scores = []
18 
19 for train_index, test_index in kf.split(X):
20     X_train, X_test = X[train_index], X[test_index]
21     y_train, y_test = y[train_index], y[test_index]
22     # let's train the model on the train partitions
23     lr = Lasso(200.0)
24     lr.fit(X_train, y_train, max_iter=100)
25     # now test on the test partition
26     y_hat = lr.predict(X_test)
27     # we calculate the root of mean squared error (rmse)
28     rmse = mean_squared_error(y_test, y_hat) ** 0.5
29     scores.append(rmse)
30 
31 # average rmse from 5-fold cross-validation
32 print(sum(scores)/k)

Run the code snippets, we have the $5$-fold cross-validation accuracy as follows.

R
1 chapter6 $r -f cv.R
2 [1] 4.978324
Python
1 chapter6 $python3.7 cv.py
2 5.702339699398128

We add the line sys.path.append("..") in the Python code, otherwise it would throw an error because of the import mechanism2.

Evaluation metrics

In the example above, we measure the root of mean squared error (RMSE) as the accuracy of the linear regression model. There are various metrics to evaluate the accuracy of predictive models.

For regression models, RMSE, mean absolute error (MAE)3 and mean absolute percentage error (MAPE)4 are some of the commonly-used evaluation metrics. You may have heard of the coefficient of determination ($R^2$ or adjusted $R^{2}$) in Statistics. But from predictive modeling perspective, $R^{2}$ is not a metric that evaluates the predictive power of the model since its calculation is based on the training data. But what we are actually interested in is the model performance on the unseen data. In Statistics, goodness of fit is a term to describe how good a model fits the observations, and $R^{2}$ is one of these measures for goodness of fit. In predictive modeling, we care more about the error of the model on the unseen data, which is called generalization error. But of course, it is possible to calculate the counterpart of $R^{2}$ on testing data.

Some metrics for regression models
metric formula
RMSE $\sqrt{\frac {\sum_{i=1}^{n} {(\hat{y}_{i}-y})^{2}} n}$
MAE $ \frac {\sum_{i=1}^{n} {\lvert \hat{y}_{i} – y \rvert}} {n} $
MAPE $ \frac {1} n \sum_{i=1}^{n} \lvert{\frac {\hat{y}_{i}-y_{i}} {y_{i}}\rvert}$

The most intuitive metric for classification models is the accuracy, which is the percentage of corrected classified instances. To calculate the accuracy, we need to label each instance to classify. Recall the logistic regression we introduced in Chapter 5, the direct output of a logistic regression are probabilities rather than labels. In that case, we need to convert the probability output to the label for accuracy calculation. For example, consider a classification problem, where the possible labels of each instance is $0$, $1$, $2$ and $3$. If the predictive probabilities of each label are $0.2$, $0.25$, $0.5$, $0.05$ for label $0$, $1$, $2$ and $3$ respectively, then the predictive label is $2$ since its corresponding probability is the largest.

But actually we don’t always care the labels of an instance. For example, a classification model for mortgage default built in a bank may only be used to calculate the expected monetary loss. Another example is the recommendation system that predicts the probabilities which are used for ranking of items. In that case, the model performance could be evaluated by logloss, AUC, etc. using the output probabilities directly. We have seen in Chapter 5 the loss function of logistic regression is the log-likelihood function.

Actually, logloss is just the average evaluated log-likelihood on the testing data, and thus it can also be used for classification models with more than 2 classes (labels) because likelihood function is not restricted to Bernoulli distribution (extended Bernoulli distribution is called categorical distribution5). Another name of logloss is cross-entropy loss.

In practice, AUC (Area Under the ROC Curve) is a very popular evaluation metric for binary-class classification problems. AUC is bounded between 0 and 1. A perfect model leads to an AUC equal to 1. If a model’s predictions are $100\%$ wrong, the resulted AUC is equal to 0. But if we know a binary-class classification model always results in $100\%$ wrong predictions, we can instead use $1-\hat{y}$ as the corrected prediction and as a result we will get a perfect model and the AUC based on the corrected prediction becomes 1. Actually, a model using completely random guess as the prediction leads to an AUC equal to 0.5. Thus, in practice the evaluated AUC is usually between 0.5 and 1.

There are also many other metrics, such as recalls, precisions, and F1 score6.

The selection of evaluation metrics in predictive modeling is important but also subjective. Sometimes we may also need to define a customized evaluation metric.

Many evaluation metrics can be found from the R package Metrics and the Python module sklearn.metrics.

R
 1 > set.seed(42)
 2 > # regression metrics
 3 > y = rnorm(n = 10, mean = 10, sd = 2)
 4 > y
 5  [1] 12.741917  8.870604 10.726257 11.265725 10.808537  9.787751
 6  [7] 13.023044  9.810682 14.036847  9.874572
 7 > # we use random numbers as the predictions
 8 > y_hat = rnorm(n = 10, mean = 10.5, sd = 2.2) 
 9 > y_hat
10  [1] 13.370713 15.530620  7.444506  9.886665 10.206693 11.899091
11  [7]  9.874644  4.655798  5.130973 13.404249
12 > 
13 > rmse(actual = y, predicted = y_hat)
14 [1] 4.364646
15 > mae(actual = y, predicted = y_hat)
16 [1] 3.540164
17 > mape(actual = y, predicted = y_hat)
18 [1] 0.3259014
19 > # classification metrics
20 > y = rbinom(n = 10, size = 1, prob=0.25)
21 > y
22  [1] 0 0 0 1 0 1 1 0 1 0
23 > y_hat = runif(10, 0, 1)
24 > logLoss(y, y_hat)
25 [1] 0.4553994
26 > auc(y, y_hat)
27 [1] 0.8333333
Python
 1 >>> import numpy as np 
 2 >>> from sklearn.metrics import mean_squared_error, mean_absolute_error, log_loss, roc_auc_score
 3 >>> np.random.seed(42)
 4 >>> # regression metrics
 5 ... 
 6 >>> y = np.random.normal(10, 2, 10)
 7 >>> y_hat = np.random.normal(10.5, 2.2, 10)
 8 >>> mean_squared_error(y, y_hat) ** 0.5 # rmse
 9 3.0668667318485165
10 >>> mean_absolute_error(y, y_hat) # mae
11 2.1355703394788237
12 >>> # let's define mape since it's not available
13 ... 
14 >>> def mape(y, y_hat): return np.mean(np.abs(y-y_hat)/y_hat)
15 ...
16 >>> mape(y, y_hat)
17 0.292059554974094
18 >>> # classification metrics
19 >>> y = np.random.binomial(1, 0.25, 10)
20 >>> y_hat = np.random.uniform(0, 1, 10)
21 >>> log_loss(y, y_hat)
22 0.47071363776285635
23 >>> roc_auc_score(y, y_hat) # auc
24 0.8095238095238095

Feature engineering & embedding

According to the explanation of feature engineering7 on wikipedia, feature engineering is the process to use domain knowledge to create new features based on existing features. In reality, it is not rare to see feature engineering leads to better prediction accuracy. And sometimes I use feature engineering too. But I think feature engineering should be less and less useful in the future as the machine learning algorithms become more and more intelligent.

Let’s consider three features $x_{1},x_{2}$ and $x_{3}$ and assume the actual model is specified as $y=f(x_{1},g(x_{2}, x_{3}))$. After all, $y$ is still a function of $x_{1},x_{2}$ and $x_{3}$, and we can write it as $y=h(x_{1},x_{2}, x_{3})$. Thus, even without creating the new feature $x_{4}=g(x_{2}, x_{3})$ explicitly, a universal approximator should be able to learn (i.e., approximate) the function $h$ from the data ideally. This idea is also supported by the Kolmogorov–Arnold representation theorem8 which says any continuous real-valued multivariate functions can be written as a finite composition of continuous functions of a single variable.

$$
\begin{equation}
f(x\sb{1},…,x\sb{m})=\sum_{q=0}^{2m} {\Phi\sb{q}\big(\sum_{p=1} ^{m} \phi_{q,p} (x_{p})}\big) .
\label{eq:ka}
\end{equation}
$$

As of today, since the machine learning algorithms are not that intelligent, it is worth trying feature engineering especially when domain knowledge is available.

If you are familiar with dimension reduction, embedding can be considered as something similar. Dimension reduction aims reducing the dimension of $\boldsymbol{X}$. It sounds interesting and promising if we can transform the high-dimensional dataset into a a low-dimensional dataset and feed the dataset in a low dimension space to the machine learning model. However, I don’t think this is a good idea in general because it is not guaranteed the low-dimensional predictors still keep all the information related to the response variable. Actually, many machine learning models are capable to handle the high-dimensional predictors directly.

Embedding also transform the features into a new space, which usually has a lower dimension. But generally embedding is not done by the traditional dimension reduction techniques (for example, principal component analysis). In natural language process, a word can be embedded into a vector space by word2vec9 (or other techniques). When an instance is associated with an image, we may consider to use autoencoder10 to encode/embed the image into a space with lower dimension, which is usually achieved by (deep) neural networks.

Collinearity

Collinearity is one of the cliches in machine learning. For non-linear models, collinearity is usually not a problem. For linear models, I recommend reading this discussion11 to see when it is not a problem.

Feature selection & parameter tuning

We have seen how the Lasso solutions of linear models can be used for feature selection in Chapter 5. What about non-linear models? There are some model-specific techniques for feature selection. Also, there is a metaheuristic approach to select features – cross-validation. Specifically, we can try different combinations of the features and use cross-validation to select the set of features which results in the best cross-validation evaluation metric. However, the major problem of this approach is its efficiency. When the number of features is too large, it is impossible to try all different combinations with limited computational resources. Thus, it is better to use the model-specific feature selection techniques in practice.

To tune model parameters, such as the $\lambda$ in Lasso, we can also use cross-validation. But again, the efficiency is our major concern.

Can we have feature selection and parameter tuning done automatically? Actually, automated machine learning has been a hot research topic in both academia and industry.

Gradient boosting machine

Decision tree

A decision tree consists of a bunch of nodes. In a decision tree there is a node with no parent nodes, which is called root node. The node without any children is called leaf node.

A single decision tree.
A single decision tree.

The length (number of nodes) of the longest path from the root node to a leaf node is called the depth of the tree. For example, the depth of the tree above is 4. Each leaf node has a label. In regression tasks, the label is a real number, and in classification tasks the label could could be a real number which is used to get the class indirectly (for example, fed into a sigmoid function to get the probability), or an integer representing the predicted class directly.

Each node except the leaves in a decision tree is associated with a splitting rule. These splitting rules determine to which leaf an instance belongs. A rule is just a function taking a feature as input and returns true or false as output. For example, a rule on the root could be $x_{1}<0$ and if it is true, we go to the left node otherwise go to the right node. Once we arrive at a leaf, we can get the predicted value based on the label of the leaf.

To get a closer look, let’s try to implement a binary tree structure for regression tasks in R/Python from scratch.

Let’s implement the binary tree as a recursive data structure, which is composed partially of similar instances of the same data structure. More specifically, a binary tree can be decomposed into three components, i.e., its root node, the left subtree under the root, and the right subtree of the root. To define a binary (decision) tree, we only need to define these three components. And to define the left and right subtrees, this decomposition is applied recursively until the leaves.

Now we have the big picture how to define a binary tree. However, to make the binary tree a decision tree, we also need to define the
splitting rules. For simplicity, we assume there is no missing value in our data and all variables are numeric. Then a splitting rule of a node is composed of two components, i.e., the variable to split on, and the corresponding breakpoint for splitting.

There is one more component we need to define in a decision tree, that is, the predict method which takes an instance as input and returns the prediction.

Now we are ready to define our binary decision tree.

R

chapter6/tree.R

 1 library(R6)
 2 Tree = R6Class(
 3   "Tree",
 4   public = list(
 5     left = NULL,
 6     right = NULL,
 7     variable_id = NULL,
 8     break_point = NULL,
 9     val = NULL,
10     initialize = function(left, right, variable_id, break_point, val) {
11       self$left = left
12       self$right = right
13       self$variable_id = variable_id
14       self$break_point = break_point
15       self$val = val
16     },
17     is_leaf = function() {
18       is.null(self$left) && is.null(self$right)
19     },
20     depth = function() {
21       if (self$is_leaf()) {
22         1
23       } else if (is.null(self$left)) {
24         1 + self$right$depth()
25       } else if (is.null(self$right)) {
26         1 + self$left$depth()
27       } else{
28         1 + max(self$left$depth(), self$right$depth())
29       }
30     },
31     predict_single = function(x) {
32       # if x is a vector
33       if (self$is_leaf()) {
34         self$val
35       } else{
36         if (x[self$variable_id] < self$break_point) {
37           self$left$predict_single(x)
38         } else{
39           self$right$predict_single(x)
40         }
41       }
42     },
43     predict = function(x) {
44       # if x is an array
45       preds = rep(0.0, nrow(x))
46       for (i in 1:nrow(x)) {
47         preds[i] = self$predict_single(x[i, ])
48       }
49       preds
50     },
51     print = function() {
52       # we can call print(tree), similar to the magic method in Python
53       cat("variable_id:", self$variable_id, "\n")
54       cat("break at:", self$break_point, "\n")
55       cat("is_leaf:", self$is_leaf(), "\n")
56       cat("val:", self$val, "\n")
57       cat("depth:", self$depth(), "\n")
58       invisible(self)
59     }
60   )
61 )

Python

chapter6/tree.py

 1 class Tree:
 2     def __init__(self, left, right, variable_id, break_point, val):
 3         self.left = left
 4         self.right = right
 5         self.variable_id = variable_id
 6         self.break_point = break_point
 7         self.val = val
 8 
 9     @property
10     def is_leaf(self):
11         return self.left is None and self.right is None
12 
13     def _predict_single(self, x):
14         if self.is_leaf:
15             return self.val
16         if x[self.variable_id] < self.break_point:
17             return self.left._predict_single(x)
18         else:
19             return self.right._predict_single(x)
20 
21     def predict(self, x):
22         return [self._predict_single(e) for e in x]
23 
24     @property
25     def depth(self):
26         if self.is_leaf:
27             return 1
28         elif self.left is None:
29             return 1 + self.right.depth
30         elif self.right is None:
31             return 1 + self.left.depth
32         return 1 + max(self.left.depth, self.right.depth)
33 
34     def __repr__(self):
35         return "variable_id: {0}\nbreak_at: {1}\nval: {2}\nis_leaf: {3}\nheight: {4}".format(
36             self.variable_id, self.break_point, self.val, self.is_leaf, self.depth)

You may have noticed the usage @property in our Python implementation. It is one of builtin decorators in Python. We won’t talk too much of decorators. Basically, adding this decorator makes the method depth behave like a property, in the sense that we can call self.depth instead of self.depth() to get the depth.

In the R implementation, the invisible(self) is returned in the print method which seems strange. It is an issue12 of R6 class due to the S3 dispatch mechanism which is not introduced in this book.

The above implementation doesn’t involve the training or fitting of the decision tree. In this book, we wouldn’t talk about how to fit a traditional decision tree model due to its limited usage in the context of modern machine learning. Let’s see how to use the decision tree structures we defined above by creating a pseudo decision tree illustrated below.

R
 1 > source('tree.R')
 2 > node_2 = Tree$new(NULL, NULL, NULL, NULL, 0.6)
 3 > node_3 = Tree$new(NULL, NULL, NULL, NULL, 2.4)
 4 > node_4 = Tree$new(NULL, NULL, NULL, NULL, 4.5)
 5 > node_1 = Tree$new(node_3, node_4, 1, 10.5, NULL)
 6 > node_0 = Tree$new(node_1, node_2, 2, 3.2, NULL)
 7 > print(node_0)
 8 variable_id: 2 
 9 break at: 3.2 
10 is_leaf: FALSE 
11 val: 
12 depth: 3 
13 > print(node_4)
14 variable_id: 
15 break at: 
16 is_leaf: TRUE 
17 val: 4.5 
18 depth: 1 
19 > x = array(c(10, 0.5), c(1, 2))
20 > node_0$predict(x)
21 [1] 2.4
Python
 1 >>> from tree import Tree
 2 >>> node_2 = Tree(None, None, None, None, 0.6)
 3 >>> node_3 = Tree(None, None, None, None, 2.4)
 4 >>> node_4 = Tree(None, None, None, None, 4.5)
 5 >>> node_1 = Tree(node_3, node_4, 0, 10.5, None)
 6 >>> node_0 = Tree(node_1, node_2, 1, 3.2, None)
 7 >>> 
 8 >>> print(node_0)
 9 variable_id: 1
10 break_at: 3.2
11 val: None
12 is_leaf: False
13 depth: 3
14 >>> print(node_4)
15 variable_id: None
16 break_at: None
17 val: 4.5
18 is_leaf: True
19 depth: 1
20 >>> x = [[10, 0.5]]
21 >>> node_0.predict(x)
22 [2.4]

It’s worth noting decision trees can approximate universally.

Tree growing in gradient boosting machine

What is a gradient boosting machine (GBM) (or gradient boosting regression)? Essentially, a GBM is just a forest of decision trees.
If you have heard of random forest (RF), you may know that a random forest is also a bunch of trees. What is the difference between a GBM and RF?

Looking at the fitted trees from RF and GBM, there is no way to tell if the trees are fitted by a RF or a GBM. The major difference is how these trees are trained, rather than the trees themselves. A minor difference is how these trees are used for prediction. In many RF implementations, the prediction for an instance is the average prediction of each tree within the forest. If it is a classification task, there are two ways to get the final prediction – (a) predict the class with majority voting directly, i.e., the predicted class is the one with highest frequency among the predicted classes of all trees; (b) predict the probability based on the frequencies, for example, if among five trees there are three trees output class 1 then the predicted probability of class 1 is equal to $3/5$. In many GBM implementations, the prediction (for regression tasks) is the sum of the predictions of individual trees.

GBM fits trees sequentially, but RF fits trees independently. The obvious advantage of fitting trees independently is that it can be done in parallel. But accuracy-wise, GBM usually performs better according to my limited experience.

We have seen the structure of a single decision tree in GBM. Now it’s time to see how to get these trees fitted in GBM. Let’s start from the first tree.

To grow a tree, we start from its root node. In GBM fitting, usually we pre-determine a maximum depth $d$ for each tree to grow. And the final tree’s depth may be equal to or less than the maximum depth $d$. At a high level, the tree is grown in a recursive fashion. Specifically, first we attempt to split the current node and if the splitting improves the performance we grow the left subtree and the right subtree under the root node. When we grow the left subtree, its maximum depth is $d-1$, and the same applies to the right subtree. We can define a tree grow function for such purpose which takes a root node $Node_{root}$ (it is also a leaf) as input. The pseudo code of tree grow function is illustrated below.

In fact, the algorithm of tree growing is just a DFS algorithm.

To complete the pseudo algorithm above, we need to have a split function which takes a leaf node as input and returns a boolean value as output. If true is returned, we will do the splitting, i.e., to grow the left/right subtree. So now the challenge is how to define the split function, which requires the understanding of the theories behind GBM.

Optimization of GBM

Similar to other regression models we have seen so far, GBM with $K$ trees has a loss function which is defined below.

$$
\begin{equation}
\mathcal{L}=\sum_{i=1}^{n} {(y_{i}-\hat{y}_{i})^{2}},
\label{eq:gbm0}
\end{equation}
$$
where

$$
\begin{equation}
\hat{y}_{i} = \sum_{t=1}^{K} {f_{t}(\boldsymbol{x}_{i})}.
\label{eq:treepred}
\end{equation}
$$

$f_{t}$ denotes the prediction of $t^{th}$ tree in the forest. As we mentioned previously, the fitting is done sequentially. When we fit the $t^{th}$ tree, all the previous $t-1$ trees are fixed. And the loss function for fitting the $t^{th}$ tree is given below.

$$
\begin{equation}
\mathcal{L}^{(t)}=\sum_{i=1}^{n} {(y_{i}- \sum_{l=1}^{t-1} {f_{l}(\boldsymbol{x}_{i})} – f_{t}(\boldsymbol{x}_{i}))^{2}}
\label{eq:gbm_t}
\end{equation}
$$

In practice, a regularized loss function \eqref{eq:treepred1} is used instead of \eqref{eq:treepred} to reduce overfitting.

$$
\begin{equation}
\mathcal{L}^{(t)}=\sum_{i=1}^{n} {(y_{i}- \sum_{l=1}^{t-1} {f_{l}(\boldsymbol{x}_{i})} – f_{t}(\boldsymbol{x}_{i}))^{2}} + \Phi(f_{t}).
\label{eq:treepred1}
\end{equation}
$$

Let’s follow the paper13 and use the number of leaves as well as the L2 penalty of the values (also called weights) of the leaves for regularization. The loss function then becomes

$$
\begin{equation}
\mathcal{L}^{(t)}=\sum_{i=1}^{n} {(y_{i}- \sum_{l=1}^{t-1} {f_{l}(\boldsymbol{x}_{i})} – f_{t}(\boldsymbol{x}_{i}))^{2}} + \gamma T + \frac 1 2 {\lambda \sum_{j=1}^{T}{\omega_{j}^{2}}} ,
\label{eq:growtree0}
\end{equation}
$$
where $\omega_{j}$ is the value associated with the $j_{th}$ leaf of the current tree.

Again, we get an optimization problem, i.e., to minimize the loss function \eqref{eq:growtree0}. The minimization problem can also be viewed as a quadratic programming problem. However, it seems different from the other optimization problems we have seen before, in the sense that the decision tree $f_{t}$ is a non-parametric model. A model is non-parametric if the model structure is learned from the data rather than pre-determined.

A common approach used in GBM is the second order approximation. By second order approximation, the loss function becomes

$$
\begin{equation}
\mathcal{L}^{(t)}\approx \sum_{i=1}^{n} {(y_{i} – \sum_{l=1}^{t-1} {f_{l}(\boldsymbol{x}_{i})})^2} + \sum_{i=1}^{n} {(g_{i}f_{t}(\boldsymbol{x}_{i}) +\frac 1 2 {h_{i}f_{t}(\boldsymbol{x}_{i})^{2}})} + \gamma T + \frac 1 2 {\lambda \sum_{j=1}^{T}{\omega_{j}^{2}}} ,
\label{eq:growtree}
\end{equation}
$$
where $g\sb{i}=2(f\sb{t}(\boldsymbol{x}\sb{i}) + \sum\sb{l=1}^{t-1} {f\sb{l}(\boldsymbol{x}_{i})} – y_{i})$ and $h\sb{i}=2$ are the first and the second order derivatives of the function $(y\sb{i}- \sum_{l=1}^{t-1} {f\sb{l}(\boldsymbol{x}\sb{i})-f\sb{t}(\boldsymbol{x}\sb{i}))^{2}} $ with respect to the function $f_{t}(\boldsymbol{x}_{i})$.

Let’s implement the function to calculate $g$ and $h$ and put them into util.py.

Python

chapter6/util.py

1 import numpy as np
2 
3 def gh_lm(actual, pred):
4     '''
5     gradient and hessian for linear regression loss
6     '''
7     return 2*(pred-actual), 2.0

Since the first item is a constant, let’s ignore it.

$$
\begin{equation}
\mathcal{L}^{(t)}\approx\sum_{i=1}^{n} {(g_{i}f_{t}(\boldsymbol{x}_{i}) +\frac 1 2 {h_{i}f_{t}(\boldsymbol{x}_{i})^{2}})} + \gamma T + \frac 1 2 {\lambda \sum_{j=1}^{T}{\omega_{j}^{2}}} ).
\label{eq:growtree1}
\end{equation}
$$

Let’s think of the prediction of an instance of the current tree. The training data, i.e., instances fall under the leaves of a tree. Thus, the prediction of an instance is the value $\omega$ associated with the corresponding leaf that the instance belongs to. Based on this fact, the loss function can be further rewritten as follows,

$$
\begin{equation}
\mathcal{L}^{(t)} \approx \sum\sb{j=1}^{T} {(\omega \sb{j} \sum\sb{i\in I\sb{j} }{g\sb{i}} + \frac 1 2 {\omega \sb{j} ^{2}( \sum\sb{i\in I\sb{j}} h\sb{i} +\lambda ) }}) + \gamma T .
\label{eq:growtree2}
\end{equation}
$$

When the structure of the tree is fixed the loss function \eqref{eq:growtree2} is a quadratic convex function of $\omega_{j}$, and the optimal solution can be obtained by setting the derivative to zero.

$$
\begin{equation}
\omega_{j}=- \frac {\sum_{i\in I_{j} } g_{i}} {\sum_{i\in I_{j}} h_{i} +\lambda}
\label{eq:omega}
\end{equation}
$$

Plugging \eqref{eq:omega} into the loss function results in the minimal loss of the current tree structure

$$
\begin{equation}
-\frac 1 2 \sum_{j=1}^{T} {\frac {(\sum_{i\in I_{j} }g_{i})^{2}} {\sum_{i\in I_{j}}h_{i} +\lambda} } + \gamma T .
\label{eq:minimalloss}
\end{equation}
$$

Now let’s go back to the split function required in the tree grow function discussed previously. How to determine if we need a splitting on a leaf? \eqref{eq:minimalloss} gives the solution – we can calculate the loss reduction after splitting which is given below

$$
\begin{equation}
\frac 1 2 {\bigg(\frac {(\sum\sb{i\in I\sb{left} }g\sb{i})^{2}} {\sum\sb{i\in I\sb{left}}h\sb{i}+ \lambda } + \frac {(\sum\sb{i\in I\sb{right} }g\sb{i})^{2}} {\sum\sb{i\in I\sb{right}}h\sb{i} + \lambda} – \frac {(\sum\sb{i\in I } g\sb{i})^{2}} {\sum\sb{i\in I} h\sb{i}+\lambda} \bigg)} – \gamma .
\label{eq:lossreduction}
\end{equation}
$$

If the loss reduction is positive, the split function returns true otherwise returns false.

So far, we have a few ingredients ready to implement our own GBM, which are listed below,

However, there are a few additional items we need to go through before the implementation.

In Chapter 5, we have seen how stochasticity works in iterative optimization algorithms. The stochasticity can be applied in both instance-wise and feature-wise. The stochasticity technique is very important is the optimization of GBM. More specifically, we apply the stochasticity both instance-wise and feature-wise. Instance-wise, when we fit a new tree, we randomly get a subsample from the training sample. And feature-wise, we randomly select a few features/variables when we fit a new tree. The stochasticity technique could help to reduce overfitting. The extent of the stochasticity can be controlled by arguments.

Like the gradient decent algorithm, in GBM there is also a learning rate parameter which scales the values of the leaves after a tree is fitted.

In practice, we may also not want to have too few instances under a leaf, to reduce potential overfitting. When there are two few instances under a leaf, we may just stop the splitting process.

Now we have almost all the ingredients to make a working GBM. Let’s define the split_node function in the code snippet below.

Python

chapter6/grow.py

  1 from tree import Tree
  2 import numpy as np
  3 import pdb
  4 
  5 
  6 def split_node(f_in_tree, x_in_node, x_val_sorted, x_index_sorted, g_tilde, h_tilde, lam, gamma, min_instances):
  7     '''
  8     f_in_tree: a list of booleans indicating which variable/feature is selected in the tree
  9     x_in_ndoe: a list of booleans indicating which instance is used in the tree
 10     x_val_sorted: a nested list, x_val_sorted[feature index] is a list of instance values 
 11     x_index_sorted: a nested list, x_index_sorted[feature index] is a list of instance indexes
 12     g_tilde: first order derivative
 13     h_tilde: second order derivative
 14     lam: lambda for regularization
 15     gamma: gamma for regularization
 16     min_instances: the minimal number of instances under a leaf
 17     at the beginning we assume all instances are on the right
 18     '''
 19     if sum(x_in_node) < min_instances:
 20         return False, None, None, None, None, None, None
 21     best_break = 0.0
 22     best_feature, best_location = 0, 0
 23     ncol, nrow = len(f_in_tree), len(x_in_node)
 24     g, h = 0.0, 0.0
 25     for i, e in enumerate(x_in_node):
 26         if e:
 27             g += g_tilde[i]
 28             h += h_tilde[i]
 29     base_score = g*g/(h+lam)
 30     score_reduction = -np.inf
 31     best_w_left, best_w_right = None, None
 32     for k in range(ncol):
 33         if f_in_tree[k]:
 34             # if the feature is selected for this tree
 35             best_n_left_k = 0
 36             n_left_k = 0
 37             # feature is in the current tree
 38             g_left, h_left = 0.0, 0.0
 39             g_right, h_right = g-g_left, h-h_left
 40             # score reduction for current feature k
 41             score_reduction_k = -np.inf
 42             for i in range(nrow):
 43                 # for each in sample, we try to split on it
 44                 index = x_index_sorted[k][i]
 45                 if x_in_node[index]:
 46                     n_left_k += 1
 47                     best_n_left_k += 1
 48                     g_left += g_tilde[index]
 49                     g_right = g-g_left
 50                     h_left += h_tilde[index]
 51                     h_right = h-h_left
 52                     # new score reduction
 53                     score_reduction_k_i = g_left*g_left/(h_left+lam) + \
 54                         (g_right*g_right)/(h_right+lam)-base_score
 55                     if score_reduction_k <= score_reduction_k_i:
 56                         best_n_left_k = n_left_k
 57                         best_break_k = x_val_sorted[k][i]
 58                         best_location_k = i
 59                         score_reduction_k = score_reduction_k_i
 60                         w_left_k = -g_left/(h_left+lam)
 61                         w_right_k = -g_right/(h_right+lam)
 62 
 63             # if the score reduction on feature k is a better candidate
 64             if score_reduction_k >= score_reduction:
 65                 score_reduction = score_reduction_k
 66                 best_feature = k
 67                 best_break = best_break_k
 68                 best_location = best_location_k
 69                 best_w_left = w_left_k
 70                 best_w_right = w_right_k
 71     return 0.5*score_reduction >= gamma, best_feature, best_break, best_location, best_w_left, best_w_right, score_reduction
 72 
 73 
 74 def grow_tree(current_tree, f_in_tree, x_in_node, max_depth, x_val_sorted, x_index_sorted, y, g_tilde, h_tilde, eta, lam, gamma, min_instances):
 75     '''
 76     current_tree: the current tree to grow, i.e., a node
 77     f_in_tree, x_in_node, x_val_sorted, x_index_sorted: see split_node function
 78     max_depth: maximinum depth to grow
 79     eta: learning rate
 80     y: the response variable
 81     '''
 82     nrow = len(y)
 83     if max_depth == 0:
 84         return
 85     # check if we need a split
 86     do_split, best_feature, best_break, best_location, w_left, w_right, _ = split_node(
 87         f_in_tree, x_in_node, x_val_sorted, x_index_sorted, g_tilde, h_tilde, lam, gamma, min_instances)
 88 
 89     if do_split:
 90         # update the value/weight with the learning rate eta
 91         w_left_scaled = w_left*eta
 92         w_right_scaled = w_right*eta
 93         current_tree.variable_id = best_feature
 94         current_tree.break_point = best_break
 95         current_tree.val = None
 96 
 97         # initialize the left subtree
 98         current_tree.left = Tree(None, None, None, None, w_left_scaled)
 99         # initialize the right subtree
100         current_tree.right = Tree(None, None, None, None, w_right_scaled)
101         # update if an instance is in left or right
102         x_in_left_node = [False]*len(x_in_node)
103         x_in_right_node = [False]*len(x_in_node)
104         for i in range(nrow):
105             index = x_index_sorted[best_feature][i]
106             if x_in_node[index]:
107                 if i <= best_location:
108                     x_in_left_node[index] = True
109                 else:
110                     x_in_right_node[index] = True
111         # recursively grow its left subtree
112         grow_tree(current_tree.left, f_in_tree, x_in_left_node, max_depth-1,
113                   x_val_sorted, x_index_sorted, y, g_tilde, h_tilde, eta, lam, gamma, min_instances)
114         # recursively grow its right subtree
115         grow_tree(current_tree.right, f_in_tree, x_in_right_node, max_depth-1,
116                   x_val_sorted, x_index_sorted, y, g_tilde, h_tilde, eta, lam, gamma, min_instances)
117     else:
118         # current node is a leaf, so we update the value/weight of the leaf
119         g, h = 0.0, 0.0
120         for i, e in enumerate(x_in_node):
121             if e:
122                 g += g_tilde[i]
123                 h += h_tilde[i]
124         w_left_scaled = -g/(h+lam)*eta
125         current_tree.val = w_left_scaled

And the implementation of the GBM class is given below.

Python

chapter6/gbm.py

 1 from tree import Tree
 2 from grow import grow_tree, split_node
 3 from utils import gh_lm, rmse
 4 import numpy as np
 5 
 6 
 7 class GBM:
 8     def __init__(self, x_train, y_train, depth, eta, lam, gamma, sub_sample, sub_feature, min_instances=2):
 9         '''
10         x_train, y_train: training data
11         depth: maximum depth of each tree
12         eta: learning rate
13         lam and gamma: regularization parameters
14         sub_sample: control the instance-wise stochasticity
15         sub_feature: control the feature-wise stochasticity
16         min_instances: control the mimimum number of instances under a leaf
17         '''
18         self.n = len(x_train)
19         self.m = len(x_train[0])
20         self.x_train = x_train
21         self.y_train = y_train
22         self.x_test, self.y_test = None, None
23         self.depth = depth
24         self.eta = eta
25         self.lam = lam
26         self.gamma = gamma
27         self.sub_sample = sub_sample
28         self.sub_feature = sub_feature
29         self.min_instances = min_instances
30 
31         self.y_tilde = [0]*len(y_train)
32         self.g_tilde, self.h_tilde = [0] * \
33             len(self.y_tilde), [0]*len(self.y_tilde)
34         for i in range(len(self.y_tilde)):
35             self.g_tilde[i], self.h_tilde[i] = gh_lm(
36                 y_train[i], self.y_tilde[i])
37         x_columns = x_train.transpose()
38         self.nf = min(self.m, max(1, int(sub_feature*self.m)))
39         self.x_val_sorted = np.sort(x_columns)
40         self.x_index_sorted = np.argsort(x_columns)
41         self.forest = []
42 
43     def set_test_data(self, x_test, y_test):
44         self.x_test = x_test
45         self.y_test = y_test
46 
47     def predict(self, x_new):
48         y_hat = np.array([0.0]*len(x_new))
49         for tree in self.forest:
50             y_hat += np.array(tree.predict(x_new))
51         return y_hat
52 
53     def fit(self, max_tree, seed=42):
54         np.random.seed(seed)
55         self.forest = []
56         i = 0
57         while i < max_tree:
58             # let's fit tree i
59             # instance-wise stochasticity
60             x_in_node = np.random.choice([True, False], self.n, p=[
61                                          self.sub_sample, 1-self.sub_sample])
62             # feature-wise stochasticity
63             f_in_tree_ = np.random.choice(
64                 range(self.m), self.nf, replace=False)
65             f_in_tree = np.array([False]*self.m)
66             for e in f_in_tree_:
67                 f_in_tree[e] = True
68             del f_in_tree_
69             # initialize the root of this tree
70             root = Tree(None, None, None, None, None)
71             # grow the tree from root
72             grow_tree(root, f_in_tree, x_in_node, self.depth-1, self.x_val_sorted,
73                       self.x_index_sorted, self.y_train, self.g_tilde, self.h_tilde, self.eta, self.lam, self.gamma, self.min_instances)
74             if root is not None:
75                 i += 1
76                 self.forest.append(root)
77             else:
78                 next
79             for j in range(self.n):
80                 self.y_tilde[j] += self.forest[-1]._predict_single(
81                     self.x_train[j])
82                 self.g_tilde[j], self.h_tilde[j] = gh_lm(
83                     self.y_train[j], self.y_tilde[j])
84             if self.x_test is not None:
85                 # test on the testing instances
86                 y_hat = self.predict(self.x_test)
87                 print("iter: {0:>4}   rmse: {1:1.6f}".format(
88                     i, rmse(self.y_test, y_hat)))

Now let’s see the performance of our GBM implemented from scratch.

Python

chapter6/test_gbm.py

 1 from gbm import GBM
 2 import numpy as np
 3 from sklearn import datasets
 4 from sklearn.utils import shuffle
 5 
 6 
 7 def get_boston_data(seed=42):
 8     boston = datasets.load_boston()
 9     X, y = shuffle(boston.data, boston.target, random_state=seed)
10     X = X.astype(np.float32)
11     offset = int(X.shape[0] * 0.8)
12     X_train, y_train = X[:offset], y[:offset]
13     X_test, y_test = X[offset:], y[offset:]
14     return X_train, y_train, X_test, y_test
15     
16 if __name__ == "__main__":
17     X_train, y_train, X_test, y_test = get_boston_data(42)
18     gbm = GBM(X_train, y_train, depth=6, eta=0.05, lam=1.0,
19               gamma=1, sub_sample=0.5, sub_feature=0.7)
20     gbm.set_test_data(X_test, y_test)
21     gbm.fit(max_tree=200)

Running the code above, we have output as below.

Python
 1 chapter6 $python3.7 test_gbm.py
 2 iter:    1   rmse: 23.063019
 3 iter:    2   rmse: 21.997972
 4 iter:    3   rmse: 21.026602
 5 iter:    4   rmse: 20.043397
 6 iter:    5   rmse: 19.210746
 7 ...
 8 iter:  196   rmse: 2.560747
 9 iter:  197   rmse: 2.544847
10 iter:  198   rmse: 2.541102
11 iter:  199   rmse: 2.537366
12 iter:  200   rmse: 2.535143

We don’t implement the model in R, but it is not difficult to do based on the Python implementation above.

GBM can be used with various loss functions, and the major difference is the implementation of the first/second order derivatives, i.e., $g$ and $h$.

Regardless of the performance, there are two major missing features in our implementation is a) cross-validation and b) early stopping.
We have talked about cross-validation, but what is early stopping? It is a very useful technique in GBM. Usually, the cross-validated loss decreases when we add new trees at the beginning and at a certain point, the loss may increase when more trees are fitted (due to overfitting). Thus, we may select the best number of trees based on the cross-validated loss. Specifically, stop the fitting process when the cross-validated loss doesn’t decrease. In practice, we don’t want to stop the fitting immediately when the cross-validated loss starts increasing. Instead, we specify a number of trees, e.g. 50, as a buffer, after which the fitting process should stop if cross-validated loss doesn’t decrease.

Early stopping is also used in other machine learning models, for example, neural network. Ideally, we would like to have early stopping based on the cross-validated loss. But when the training process is time-consuming, it’s fine to use the loss on a testing date set14.

The commonly used GBM packages include XGBoost15, LightGBM16 and CatBoost17.

Let’s see how to use XGBoost for the same regression task on the Boston dataset.

R

chapter6/xgb.R

 1 library(xgboost)
 2 library(MASS)
 3 library(Metrics)
 4 set.seed(42)
 5 train_index = sample(nrow(Boston), floor(0.8 * nrow(Boston)), replace = FALSE)
 6 Boston = data.matrix(Boston)
 7 target_col = which(colnames(Boston) == 'medv')
 8 X_train = Boston[train_index, -target_col]
 9 y_train = Boston[train_index, target_col]
10 X_test = Boston[-train_index, -target_col]
11 y_test = Boston[-train_index, target_col]
12 # prepare the data for training and testing
13 dTrain = xgb.DMatrix(X_train, label = y_train)
14 dTest = xgb.DMatrix(X_test)
15 params = list(
16   "booster" = "gbtree",
17   "objective" = "reg:linear",
18   "eta" = 0.1,
19   "max_depth" = 5,
20   "subsample" = 0.6,
21   "colsample_bytree" = 0.8,
22   "min_child_weight" = 2
23 )
24 # run the cross-validation
25 hist = xgb.cv(
26   params = params,
27   data = dTrain,
28   nrounds = 500,
29   early_stopping_rounds = 50,
30   metrics = 'rmse',
31   nfold = 5,
32   verbose = FALSE
33 )
34 # since we have the best number of trees from cv, let's train the model with this number of trees
35 model = xgb.train(params, nrounds = hist$best_iteration, data = dTrain)
36 pred = predict(model, dTest)
37 
38 cat(
39   "rmse on testing instances is",
40   rmse(y_test, pred),
41   "with",
42   hist$best_iteration,
43   "trees"
44 )

Python

chapter6/xgb.py

 1 import xgboost as xgb
 2 from sklearn import datasets
 3 from sklearn.utils import shuffle
 4 from sklearn.metrics import mean_squared_error
 5 import numpy as np
 6 
 7 seed = 42
 8 boston = datasets.load_boston()
 9 X, y = shuffle(boston.data, boston.target, random_state=seed)
10 X = X.astype(np.float32)
11 offset = int(X.shape[0] * 0.8)
12 X_train, y_train = X[:offset], y[:offset]
13 X_test, y_test = X[offset:], y[offset:]
14 
15 params = {'booster': 'gbtree', 'objective': 'reg:linear', 'learning_rate': 0.1,
16           'max_depth': 5, 'subsample': 0.6, 'colsample_bytree': 0.8, 'min_child_weight': 2}
17 
18 # prepare the data for training and testing
19 dtrain = xgb.DMatrix(data=X_train, label=y_train, missing=None)
20 dtest = xgb.DMatrix(X_test)
21 
22 # run 5-fold cross-validation with maximum 1000 trees, and try to minimize the metric rmse
23 # early stopping 50 trees
24 hist = xgb.cv(params, dtrain=dtrain, nfold=5,
25               metrics=['rmse'], num_boost_round=1000, maximize=False, early_stopping_rounds=50)
26 
27 # find the best number of trees from the cross-validation history
28 best_number_trees = hist['test-rmse-mean'].idxmin()
29 
30 # since we have the best number of trees from cv, let's train the model with this number of trees
31 model = xgb.train(params, dtrain, num_boost_round=best_number_trees)
32 pred = model.predict(dtest)
33 print(
34     f"rmse on testing instances is {mean_squared_error(pred, y_test)**0.5:.6f} with {best_number_trees} trees")

The parameters subsample and colsample_bytree control the stochasticity, within the range of $[0, 1]$. If we set these two parameters to 1, then all instances and all features are selected for fitting every tree.

The two code snippets illustrate a minimal workflow of fitting a GBM model. First, we conduct (hyper) parameter tuning (such as learning rate, number of trees, regularization parameters, stochasticity parameters) with cross-validation, and next we train the model with the tuned parameters.

Running the code snippets, we have the following results.

R
1 > source('xgb.R')
2 rmse on testing instances is 2.632298 with 83 trees
Python
1 chapter6 $python3.7 xgb.py
2 rmse on testing instances is 2.736038 with 179 trees

In XGBoost, we could also use linear regression models as the booster (or base learner) instead of decision trees. However, when 'booster':'gblinear' is used, the sum of the prediction from all boosters in the model is equivalent to the prediction from a single (combined) linear model. In that sense, what we get is just a Lasso solution of a linear regression model.

GBM can be used in different tasks, such as classification, ranking, survival analysis, etc. When we use GBM for predictive modeling, missing value imputation is not required, which is one big advantage over linear models. But in our own implementation we don’t consider missing values for simplicity. In GBM, if a feature is categorical we could do label-encoding18, i.e., mapping the feature to integers directly without creating dummy variables (such as one-hot encoding). Of course one-hot encoding19 can also be used. But when there are too many new columns created by one-hot encoding, the probability that the original categorical feature is selected is higher than these numerical variables. In other words, we are assigning a prior weight to the categorical feature regarding the feature-wise stochasticity.


1 Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.

2 https://docs.python.org/3/reference/import.html

3 https://en.wikipedia.org/wiki/Mean_absolute_error

4 https://en.wikipedia.org/wiki/Mean_absolute_percentage_error

5 https://en.wikipedia.org/wiki/Categorical_distribution

6 https://en.wikipedia.org/wiki/Precision_and_recall

7 https://en.wikipedia.org/wiki/Feature_engineering

8 https://en.wikipedia.org/wiki/Kolmogorov-Arnold_representation_theorem

9 Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

10 https://en.wikipedia.org/wiki/Autoencoder

11 https://statisticalhorizons.com/multicollinearity

12 https://github.com/r-lib/R6/issues/140

13 Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.

14 https://en.wikipedia.org/wiki/Early_stopping

15 https://github.com/dmlc/xgboost

16 https://github.com/microsoft/LightGBM

17 https://github.com/catboost/catboost

18 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

19 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html