Sections in this Chapter:

A population is a complete set of elements of interest for a specific problem. It is usually defined by the researcher of the problem. A population can be either finite or infinite. For example, if we define the set of real numbers as our population, it is infinite. But if we are only interested in the integers between 1 and 100, then we get a finite population.

A random sample is a set of elements selected from a population. The size of a random sample could be larger than the population since an element can be taken multiple times.

Why do we care random samples? Because we are interested in the population from which the random sample is taken, and the random sample could help to make inference on the population.

In predictive modeling, each element in the population has a set of attributes which are usually called features or covariates, as well as a label. For example, a bank may use the mortgage applicant’s personal information (FICO score, years of employment, debt to income ratio, etc.) as the covariates, and the status of the mortgage (default, or paid off) as the label. A predictive model can be used to predict the final status of the mortgage for a new applicant, and such kind of models are classification models. When a mortgage is in default status, the applicant may have already made payments partially. Thus, a model to predict the amount of loss for the bank is also useful, and such kind of models are regression models.

But why do we need a predictive model? If we know the labels of the entire population, nothing is needed to learn. All what we need is a database table or a dictionary (HashMap) for lookup. The issue is that many problems in the real world don’t allow us to have the labels of the entire population. And thus, we need to learn or infer based on the random sample collected from the unknown population.

The Universal approximation theorem says that a single hidden layer neural network can approximate any continuous functions ($\mathbf{R}^n\rightarrow\mathbf{R}$) with sufficient number of neurons under mild assumptions on the activation function (for example, the sigmoid activation function)^{1}. There are also other universal approximators, such as the decision trees.

Not all machine learning models can approximate universally, for example, the linear regression without polynomial items. If we have the data from the entire population, we may fit the population with a universal approximator. But as we have discussed earlier, when the entire population is available there is no need to fit a machine learning model for prediction. But if only a random sample is available, is a universal approximator still the best choice? It depends.

One of the risks of fitting a random sample with a predictive model is overfitting (see the figure below). Using universal approximators as the predictive model may even amplify such risk.

To mediate the risk of overfitting, we usually use cross-validation to assess how accurately a predictive model is able to predict for unseen data. The following steps specifies how to perform a cross-validation.

- divide the training data into $k$ partitions randomly
- for $i=1,…,k$

- train a model using all partitions except partition $i$

- record the prediction accuracy of the trained model on partition $i$ - calculate the average prediction accuracy.

Cross-validation can also be considered as a metaheuristic algorithm since it is not problem-specific because it doesn’t matter what type of predictive models we use.

There are some ready-to-use tools in R and Python modules for cross-validation. But for pedagogical purpose, let’s do it step by step. We do the cross validation using the Lasso regression we build in previous Chapter on the Boston dataset.

Run the code snippets, we have the $5$-fold cross-validation accuracy as follows.

We add the line `sys.path.append("..")`

in the Python code, otherwise it would throw an error because of the `import`

mechanism^{2}.

In the example above, we measure the root of mean squared error (RMSE) as the accuracy of the linear regression model. There are various metrics to evaluate the accuracy of predictive models.

- Metrics for regression

For regression models, RMSE, mean absolute error (MAE)^{3} and mean absolute percentage error (MAPE)^{4} are some of the commonly-used evaluation metrics. You may have heard of the coefficient of determination ($R^2$ or adjusted $R^{2}$) in Statistics. But from predictive modeling perspective, $R^{2}$ is not a metric that evaluates the predictive power of the model since its calculation is based on the training data. But what we are actually interested in is the model performance on the unseen data. In Statistics, goodness of fit is a term to describe how good a model fits the observations, and $R^{2}$ is one of these measures for goodness of fit. In predictive modeling, we care more about the error of the model on the unseen data, which is called generalization error. But of course, it is possible to calculate the counterpart of $R^{2}$ on testing data.

metric |
formula |

RMSE | $\sqrt{\frac {\sum_{i=1}^{n} {(\hat{y}_{i}-y})^{2}} n}$ |

MAE | $ \frac {\sum_{i=1}^{n} {\lvert \hat{y}_{i} – y \rvert}} {n} $ |

MAPE | $ \frac {1} n \sum_{i=1}^{n} \lvert{\frac {\hat{y}_{i}-y_{i}} {y_{i}}\rvert}$ |

- Metrics for classification

The most intuitive metric for classification models is the accuracy, which is the percentage of corrected classified instances. To calculate the accuracy, we need to label each instance to classify. Recall the logistic regression we introduced in Chapter 5, the direct output of a logistic regression are probabilities rather than labels. In that case, we need to convert the probability output to the label for accuracy calculation. For example, consider a classification problem, where the possible labels of each instance is $0$, $1$, $2$ and $3$. If the predictive probabilities of each label are $0.2$, $0.25$, $0.5$, $0.05$ for label $0$, $1$, $2$ and $3$ respectively, then the predictive label is $2$ since its corresponding probability is the largest.

But actually we don’t always care the labels of an instance. For example, a classification model for mortgage default built in a bank may only be used to calculate the expected monetary loss. Another example is the recommendation system that predicts the probabilities which are used for ranking of items. In that case, the model performance could be evaluated by `logloss`

, `AUC`

, etc. using the output probabilities directly. We have seen in Chapter 5 the loss function of logistic regression is the log-likelihood function.

Actually, `logloss`

is just the average evaluated log-likelihood on the testing data, and thus it can also be used for classification models with more than 2 classes (labels) because likelihood function is not restricted to Bernoulli distribution (extended Bernoulli distribution is called categorical distribution^{5}). Another name of `logloss`

is `cross-entropy loss`

.

In practice, `AUC`

(Area Under the ROC Curve) is a very popular evaluation metric for binary-class classification problems. AUC is bounded between 0 and 1. A perfect model leads to an AUC equal to 1. If a model’s predictions are $100\%$ wrong, the resulted AUC is equal to 0. But if we know a binary-class classification model always results in $100\%$ wrong predictions, we can instead use $1-\hat{y}$ as the corrected prediction and as a result we will get a perfect model and the AUC based on the corrected prediction becomes 1. Actually, a model using completely random guess as the prediction leads to an AUC equal to 0.5. Thus, in practice the evaluated AUC is usually between 0.5 and 1.

There are also many other metrics, such as recalls, precisions, and F1 score^{6}.

The selection of evaluation metrics in predictive modeling is important but also subjective. Sometimes we may also need to define a customized evaluation metric.

Many evaluation metrics can be found from the R package `Metrics`

and the Python module `sklearn.metrics`

.

According to the explanation of feature engineering^{7} on wikipedia, feature engineering is the process to use domain knowledge to create new features based on existing features. In reality, it is not rare to see feature engineering leads to better prediction accuracy. And sometimes I use feature engineering too. But I think feature engineering should be less and less useful in the future as the machine learning algorithms become more and more intelligent.

Let’s consider three features $x_{1},x_{2}$ and $x_{3}$ and assume the actual model is specified as $y=f(x_{1},g(x_{2}, x_{3}))$. After all, $y$ is still a function of $x_{1},x_{2}$ and $x_{3}$, and we can write it as $y=h(x_{1},x_{2}, x_{3})$. Thus, even without creating the new feature $x_{4}=g(x_{2}, x_{3})$ explicitly, a universal approximator should be able to learn (i.e., approximate) the function $h$ from the data ideally. This idea is also supported by the Kolmogorov–Arnold representation theorem^{8} which says any continuous real-valued multivariate functions can be written as a finite composition of continuous functions of a single variable.

$$

\begin{equation}

f(x\sb{1},…,x\sb{m})=\sum_{q=0}^{2m} {\Phi\sb{q}\big(\sum_{p=1} ^{m} \phi_{q,p} (x_{p})}\big) .

\label{eq:ka}

\end{equation}

$$

As of today, since the machine learning algorithms are not that intelligent, it is worth trying feature engineering especially when domain knowledge is available.

If you are familiar with dimension reduction, embedding can be considered as something similar. Dimension reduction aims reducing the dimension of $\boldsymbol{X}$. It sounds interesting and promising if we can transform the high-dimensional dataset into a a low-dimensional dataset and feed the dataset in a low dimension space to the machine learning model. However, I don’t think this is a good idea in general because it is not guaranteed the low-dimensional predictors still keep all the information related to the response variable. Actually, many machine learning models are capable to handle the high-dimensional predictors directly.

Embedding also transform the features into a new space, which usually has a lower dimension. But generally embedding is not done by the traditional dimension reduction techniques (for example, principal component analysis). In natural language process, a word can be embedded into a vector space by word2vec^{9} (or other techniques). When an instance is associated with an image, we may consider to use autoencoder^{10} to encode/embed the image into a space with lower dimension, which is usually achieved by (deep) neural networks.

Collinearity is one of the cliches in machine learning. For non-linear models, collinearity is usually not a problem. For linear models, I recommend reading this discussion^{11} to see when it is not a problem.

We have seen how the Lasso solutions of linear models can be used for feature selection in Chapter 5. What about non-linear models? There are some model-specific techniques for feature selection. Also, there is a metaheuristic approach to select features – cross-validation. Specifically, we can try different combinations of the features and use cross-validation to select the set of features which results in the best cross-validation evaluation metric. However, the major problem of this approach is its efficiency. When the number of features is too large, it is impossible to try all different combinations with limited computational resources. Thus, it is better to use the model-specific feature selection techniques in practice.

To tune model parameters, such as the $\lambda$ in Lasso, we can also use cross-validation. But again, the efficiency is our major concern.

Can we have feature selection and parameter tuning done automatically? Actually, automated machine learning has been a hot research topic in both academia and industry.

A decision tree consists of a bunch of nodes. In a decision tree there is a node with no parent nodes, which is called root node. The node without any children is called leaf node.

The length (number of nodes) of the longest path from the root node to a leaf node is called the depth of the tree. For example, the depth of the tree above is 4. Each leaf node has a label. In regression tasks, the label is a real number, and in classification tasks the label could could be a real number which is used to get the class indirectly (for example, fed into a sigmoid function to get the probability), or an integer representing the predicted class directly.

Each node except the leaves in a decision tree is associated with a splitting rule. These splitting rules determine to which leaf an instance belongs. A rule is just a function taking a feature as input and returns true or false as output. For example, a rule on the root could be $x_{1}<0$ and if it is true, we go to the left node otherwise go to the right node. Once we arrive at a leaf, we can get the predicted value based on the label of the leaf.

To get a closer look, let’s try to implement a binary tree structure for regression tasks in R/Python from scratch.

Let’s implement the binary tree as a recursive data structure, which is composed partially of similar instances of the same data structure. More specifically, a binary tree can be decomposed into three components, i.e., its root node, the left subtree under the root, and the right subtree of the root. To define a binary (decision) tree, we only need to define these three components. And to define the left and right subtrees, this decomposition is applied recursively until the leaves.

Now we have the big picture how to define a binary tree. However, to make the binary tree a decision tree, we also need to define the

splitting rules. For simplicity, we assume there is no missing value in our data and all variables are numeric. Then a splitting rule of a node is composed of two components, i.e., the variable to split on, and the corresponding breakpoint for splitting.

There is one more component we need to define in a decision tree, that is, the predict method which takes an instance as input and returns the prediction.

Now we are ready to define our binary decision tree.

chapter6/tree.R

chapter6/tree.py

You may have noticed the usage `@property`

in our Python implementation. It is one of builtin decorators in Python. We won’t talk too much of decorators. Basically, adding this decorator makes the method `depth`

behave like a property, in the sense that we can call `self.depth`

instead of `self.depth()`

to get the depth.

In the R implementation, the `invisible(self)`

is returned in the print method which seems strange. It is an issue^{12} of `R6`

class due to the `S3`

dispatch mechanism which is not introduced in this book.

The above implementation doesn’t involve the training or fitting of the decision tree. In this book, we wouldn’t talk about how to fit a traditional decision tree model due to its limited usage in the context of modern machine learning. Let’s see how to use the decision tree structures we defined above by creating a pseudo decision tree illustrated below.

It’s worth noting decision trees can approximate universally.

What is a gradient boosting machine (GBM) (or gradient boosting regression)? Essentially, a GBM is just a forest of decision trees.

If you have heard of random forest (RF), you may know that a random forest is also a bunch of trees. What is the difference between a GBM and RF?

Looking at the fitted trees from RF and GBM, there is no way to tell if the trees are fitted by a RF or a GBM. The major difference is how these trees are trained, rather than the trees themselves. A minor difference is how these trees are used for prediction. In many RF implementations, the prediction for an instance is the average prediction of each tree within the forest. If it is a classification task, there are two ways to get the final prediction – (a) predict the class with majority voting directly, i.e., the predicted class is the one with highest frequency among the predicted classes of all trees; (b) predict the probability based on the frequencies, for example, if among five trees there are three trees output class 1 then the predicted probability of class 1 is equal to $3/5$. In many GBM implementations, the prediction (for regression tasks) is the sum of the predictions of individual trees.

GBM fits trees sequentially, but RF fits trees independently. The obvious advantage of fitting trees independently is that it can be done in parallel. But accuracy-wise, GBM usually performs better according to my limited experience.

We have seen the structure of a single decision tree in GBM. Now it’s time to see how to get these trees fitted in GBM. Let’s start from the first tree.

To grow a tree, we start from its root node. In GBM fitting, usually we pre-determine a maximum depth $d$ for each tree to grow. And the final tree’s depth may be equal to or less than the maximum depth $d$. At a high level, the tree is grown in a recursive fashion. Specifically, first we attempt to split the current node and if the splitting improves the performance we grow the left subtree and the right subtree under the root node. When we grow the left subtree, its maximum depth is $d-1$, and the same applies to the right subtree. We can define a tree grow function for such purpose which takes a root node $Node_{root}$ (it is also a leaf) as input. The pseudo code of tree grow function is illustrated below.

- if $d>1$:

- call split function on $Node_{root}$

- if true is returned:- call grow function on the empty $Node_{left}$ with $d-1$ maximum depth
- call grow function on the empty $Node_{right}$ with $d-1$ maximum depth

- return

In fact, the algorithm of tree growing is just a DFS algorithm.

To complete the pseudo algorithm above, we need to have a split function which takes a leaf node as input and returns a boolean value as output. If true is returned, we will do the splitting, i.e., to grow the left/right subtree. So now the challenge is how to define the split function, which requires the understanding of the theories behind GBM.

Similar to other regression models we have seen so far, GBM with $K$ trees has a loss function which is defined below.

$$

\begin{equation}

\mathcal{L}=\sum_{i=1}^{n} {(y_{i}-\hat{y}_{i})^{2}},

\label{eq:gbm0}

\end{equation}

$$

where

$$

\begin{equation}

\hat{y}_{i} = \sum_{t=1}^{K} {f_{t}(\boldsymbol{x}_{i})}.

\label{eq:treepred}

\end{equation}

$$

$f_{t}$ denotes the prediction of $t^{th}$ tree in the forest. As we mentioned previously, the fitting is done sequentially. When we fit the $t^{th}$ tree, all the previous $t-1$ trees are fixed. And the loss function for fitting the $t^{th}$ tree is given below.

$$

\begin{equation}

\mathcal{L}^{(t)}=\sum_{i=1}^{n} {(y_{i}- \sum_{l=1}^{t-1} {f_{l}(\boldsymbol{x}_{i})} – f_{t}(\boldsymbol{x}_{i}))^{2}}

\label{eq:gbm_t}

\end{equation}

$$

In practice, a regularized loss function \eqref{eq:treepred1} is used instead of \eqref{eq:treepred} to reduce overfitting.

$$

\begin{equation}

\mathcal{L}^{(t)}=\sum_{i=1}^{n} {(y_{i}- \sum_{l=1}^{t-1} {f_{l}(\boldsymbol{x}_{i})} – f_{t}(\boldsymbol{x}_{i}))^{2}} + \Phi(f_{t}).

\label{eq:treepred1}

\end{equation}

$$

Let’s follow the paper^{13} and use the number of leaves as well as the L2 penalty of the values (also called weights) of the leaves for regularization. The loss function then becomes

$$

\begin{equation}

\mathcal{L}^{(t)}=\sum_{i=1}^{n} {(y_{i}- \sum_{l=1}^{t-1} {f_{l}(\boldsymbol{x}_{i})} – f_{t}(\boldsymbol{x}_{i}))^{2}} + \gamma T + \frac 1 2 {\lambda \sum_{j=1}^{T}{\omega_{j}^{2}}} ,

\label{eq:growtree0}

\end{equation}

$$

where $\omega_{j}$ is the value associated with the $j_{th}$ leaf of the current tree.

Again, we get an optimization problem, i.e., to minimize the loss function \eqref{eq:growtree0}. The minimization problem can also be viewed as a quadratic programming problem. However, it seems different from the other optimization problems we have seen before, in the sense that the decision tree $f_{t}$ is a non-parametric model. A model is non-parametric if the model structure is learned from the data rather than pre-determined.

A common approach used in GBM is the second order approximation. By second order approximation, the loss function becomes

$$

\begin{equation}

\mathcal{L}^{(t)}\approx \sum_{i=1}^{n} {(y_{i} – \sum_{l=1}^{t-1} {f_{l}(\boldsymbol{x}_{i})})^2} + \sum_{i=1}^{n} {(g_{i}f_{t}(\boldsymbol{x}_{i}) +\frac 1 2 {h_{i}f_{t}(\boldsymbol{x}_{i})^{2}})} + \gamma T + \frac 1 2 {\lambda \sum_{j=1}^{T}{\omega_{j}^{2}}} ,

\label{eq:growtree}

\end{equation}

$$

where $g\sb{i}=2(f\sb{t}(\boldsymbol{x}\sb{i}) + \sum\sb{l=1}^{t-1} {f\sb{l}(\boldsymbol{x}_{i})} – y_{i})$ and $h\sb{i}=2$ are the first and the second order derivatives of the function $(y\sb{i}- \sum_{l=1}^{t-1} {f\sb{l}(\boldsymbol{x}\sb{i})-f\sb{t}(\boldsymbol{x}\sb{i}))^{2}} $ with respect to the function $f_{t}(\boldsymbol{x}_{i})$.

Let’s implement the function to calculate $g$ and $h$ and put them into `util.py`

.

chapter6/util.py

Since the first item is a constant, let’s ignore it.

$$

\begin{equation}

\mathcal{L}^{(t)}\approx\sum_{i=1}^{n} {(g_{i}f_{t}(\boldsymbol{x}_{i}) +\frac 1 2 {h_{i}f_{t}(\boldsymbol{x}_{i})^{2}})} + \gamma T + \frac 1 2 {\lambda \sum_{j=1}^{T}{\omega_{j}^{2}}} ).

\label{eq:growtree1}

\end{equation}

$$

Let’s think of the prediction of an instance of the current tree. The training data, i.e., instances fall under the leaves of a tree. Thus, the prediction of an instance is the value $\omega$ associated with the corresponding leaf that the instance belongs to. Based on this fact, the loss function can be further rewritten as follows,

$$

\begin{equation}

\mathcal{L}^{(t)} \approx \sum\sb{j=1}^{T} {(\omega \sb{j} \sum\sb{i\in I\sb{j} }{g\sb{i}} + \frac 1 2 {\omega \sb{j} ^{2}( \sum\sb{i\in I\sb{j}} h\sb{i} +\lambda ) }}) + \gamma T .

\label{eq:growtree2}

\end{equation}

$$

When the structure of the tree is fixed the loss function \eqref{eq:growtree2} is a quadratic convex function of $\omega_{j}$, and the optimal solution can be obtained by setting the derivative to zero.

$$

\begin{equation}

\omega_{j}=- \frac {\sum_{i\in I_{j} } g_{i}} {\sum_{i\in I_{j}} h_{i} +\lambda}

\label{eq:omega}

\end{equation}

$$

Plugging \eqref{eq:omega} into the loss function results in the minimal loss of the current tree structure

$$

\begin{equation}

-\frac 1 2 \sum_{j=1}^{T} {\frac {(\sum_{i\in I_{j} }g_{i})^{2}} {\sum_{i\in I_{j}}h_{i} +\lambda} } + \gamma T .

\label{eq:minimalloss}

\end{equation}

$$

Now let’s go back to the split function required in the tree grow function discussed previously. How to determine if we need a splitting on a leaf? \eqref{eq:minimalloss} gives the solution – we can calculate the loss reduction after splitting which is given below

$$

\begin{equation}

\frac 1 2 {\bigg(\frac {(\sum\sb{i\in I\sb{left} }g\sb{i})^{2}} {\sum\sb{i\in I\sb{left}}h\sb{i}+ \lambda } + \frac {(\sum\sb{i\in I\sb{right} }g\sb{i})^{2}} {\sum\sb{i\in I\sb{right}}h\sb{i} + \lambda} – \frac {(\sum\sb{i\in I } g\sb{i})^{2}} {\sum\sb{i\in I} h\sb{i}+\lambda} \bigg)} – \gamma .

\label{eq:lossreduction}

\end{equation}

$$

If the loss reduction is positive, the split function returns true otherwise returns false.

So far, we have a few ingredients ready to implement our own GBM, which are listed below,

- the structure of a single decision tree in the forest, i.e., the
`Tree`

class defined; - the node splitting mechanism, i.e., \eqref{eq:lossreduction};
- the tree growing mechanism, i.e., the pseudo algorithm with the leaf value calculation \eqref{eq:omega}.

However, there are a few additional items we need to go through before the implementation.

In Chapter 5, we have seen how stochasticity works in iterative optimization algorithms. The stochasticity can be applied in both instance-wise and feature-wise. The stochasticity technique is very important is the optimization of GBM. More specifically, we apply the stochasticity both instance-wise and feature-wise. Instance-wise, when we fit a new tree, we randomly get a subsample from the training sample. And feature-wise, we randomly select a few features/variables when we fit a new tree. The stochasticity technique could help to reduce overfitting. The extent of the stochasticity can be controlled by arguments.

Like the gradient decent algorithm, in GBM there is also a learning rate parameter which scales the values of the leaves after a tree is fitted.

In practice, we may also not want to have too few instances under a leaf, to reduce potential overfitting. When there are two few instances under a leaf, we may just stop the splitting process.

Now we have almost all the ingredients to make a working GBM. Let’s define the `split_node`

function in the code snippet below.

chapter6/grow.py

And the implementation of the `GBM`

class is given below.

chapter6/gbm.py

Now let’s see the performance of our GBM implemented from scratch.

chapter6/test_gbm.py

Running the code above, we have output as below.

We don’t implement the model in R, but it is not difficult to do based on the Python implementation above.

GBM can be used with various loss functions, and the major difference is the implementation of the first/second order derivatives, i.e., $g$ and $h$.

Regardless of the performance, there are two major missing features in our implementation is a) cross-validation and b) early stopping.

We have talked about cross-validation, but what is early stopping? It is a very useful technique in GBM. Usually, the cross-validated loss decreases when we add new trees at the beginning and at a certain point, the loss may increase when more trees are fitted (due to overfitting). Thus, we may select the best number of trees based on the cross-validated loss. Specifically, stop the fitting process when the cross-validated loss doesn’t decrease. In practice, we don’t want to stop the fitting immediately when the cross-validated loss starts increasing. Instead, we specify a number of trees, e.g. 50, as a buffer, after which the fitting process should stop if cross-validated loss doesn’t decrease.

Early stopping is also used in other machine learning models, for example, neural network. Ideally, we would like to have early stopping based on the cross-validated loss. But when the training process is time-consuming, it’s fine to use the loss on a testing date set^{14}.

The commonly used GBM packages include `XGBoost`

^{15}, `LightGBM`

^{16} and `CatBoost`

^{17}.

Let’s see how to use `XGBoost`

for the same regression task on the Boston dataset.

chapter6/xgb.R

chapter6/xgb.py

The parameters `subsample`

and `colsample_bytree`

control the stochasticity, within the range of $[0, 1]$. If we set these two parameters to 1, then all instances and all features are selected for fitting every tree.

The two code snippets illustrate a minimal workflow of fitting a GBM model. First, we conduct (hyper) parameter tuning (such as learning rate, number of trees, regularization parameters, stochasticity parameters) with cross-validation, and next we train the model with the tuned parameters.

Running the code snippets, we have the following results.

In `XGBoost`

, we could also use linear regression models as the booster (or base learner) instead of decision trees. However, when `'booster':'gblinear'`

is used, the sum of the prediction from all boosters in the model is equivalent to the prediction from a single (combined) linear model. In that sense, what we get is just a Lasso solution of a linear regression model.

GBM can be used in different tasks, such as classification, ranking, survival analysis, etc. When we use GBM for predictive modeling, missing value imputation is not required, which is one big advantage over linear models. But in our own implementation we don’t consider missing values for simplicity. In GBM, if a feature is categorical we could do label-encoding^{18}, i.e., mapping the feature to integers directly without creating dummy variables (such as one-hot encoding). Of course one-hot encoding^{19} can also be used. But when there are too many new columns created by one-hot encoding, the probability that the original categorical feature is selected is higher than these numerical variables. In other words, we are assigning a prior weight to the categorical feature regarding the feature-wise stochasticity.

^{1} Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.

^{2} https://docs.python.org/3/reference/import.html

^{3} https://en.wikipedia.org/wiki/Mean_absolute_error

^{4} https://en.wikipedia.org/wiki/Mean_absolute_percentage_error

^{5} https://en.wikipedia.org/wiki/Categorical_distribution

^{6} https://en.wikipedia.org/wiki/Precision_and_recall

^{7} https://en.wikipedia.org/wiki/Feature_engineering

^{8} https://en.wikipedia.org/wiki/Kolmogorov-Arnold_representation_theorem

^{9} Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

^{10} https://en.wikipedia.org/wiki/Autoencoder

^{11} https://statisticalhorizons.com/multicollinearity

^{12} https://github.com/r-lib/R6/issues/140

^{13} Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.

^{14} https://en.wikipedia.org/wiki/Early_stopping

^{15} https://github.com/dmlc/xgboost

^{16} https://github.com/microsoft/LightGBM

^{17} https://github.com/catboost/catboost

^{18} https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

^{19} https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html