Sections in this Chapter:

- Convexity
- Gradient descent
- Root-finding
- General purpose minimization tools
- Linear programming
- Miscellaneous

We are talking about mathematical optimization in this chapter. Mathematical optimization is the selection of a best element from some set of available alternatives. Why we talk about optimization in this book on data science? Data science is used to help make better decisions, and so is optimization. Operations Research, with optimization at its core, is a discipline that deals with the application of advanced analytical methods to make better decisions^{1}. I always feel Data Science and Operations Research are greatly overlapped. The training of a variety of (supervised) machine learning models is to minimize the loss functions, which is essentially an optimization problem. In optimization, the loss functions are usually referred as objective functions.

Mathematical optimization is a very broad field and we ignore the theories in the book and only touch some of the simplest applications. Actually, in Chapter 4 we have seen one of its important applications in linear regression.

The importance of convexity cannot be overestimated. A convex optimization problem^{2} has the following form

$$

\begin{equation}

\begin{split}

&\max_{\boldsymbol{x}} {f_0(\boldsymbol{x})}\\

&subject\quad to\quad f_i(\boldsymbol{x})\leq{\boldsymbol{b}_i};i=1,…,m,

\end{split}

\label{eq:convex}

\end{equation}

$$

where the vector $\boldsymbol{x}\in\mathbf{R}^n$ represents the decision variable; $f_i;i=1,…,m$ are are convex functions ($\mathbf{R}^n\rightarrow\mathbf{R}$). A function $f_i$ is convex if

$$

\begin{equation}\label{eq:convex2}

f_i(\alpha\boldsymbol{x}+(1-\alpha)\boldsymbol{y})\leq {\alpha f_i(\boldsymbol{x})+ (1-\alpha) f_i(\boldsymbol{y})},

\end{equation}

$$

for all $\boldsymbol{x}$, $\boldsymbol{y}$ $\in\mathbf{R}^n$, and all $\alpha$ $\in\mathbf{R}$ with $0\leq\alpha\leq1$. \eqref{eq:convex} implies that a convex optimization problem requires both the objective function and the set of feasible solutions are convex. Why do we care the convexity? Because convex optimization problem has a very nice property – a local minimum (which is minimal among a neighboring set of candidate solutions) is also a global minimum (which is the minimal solution among all feasible solutions). Thus, if we can find a local minimum we get the global minimum for convex optimization problems.

Many methods/tools we will introduce in the following sections won’t throw an error if you feed them a non-convex objective function; but the solution returned by these numerical optimization methods are not necessary to be a global optimum. For example, the algorithm may get trapped in a local optimum.

What is the most commonly used optimization algorithm in machine learning? Probably the answer is gradient descent (and its variants). The gradient descent is also an iterative method, and in step $i$ the update of $\boldsymbol{x}$ follows

$$

\begin{equation}

\begin{split}

\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\gamma\nabla f(\boldsymbol{x}^{(k)}),

\end{split}

\label{eq:5_gradient_descent}

\end{equation}

$$

where $f$ is the objective function and $\gamma$ is called the step size or learning rate. Start from an initial value $\boldsymbol{x}_0$ and follow \eqref{eq:5_gradient_descent}, we will have a monotonic sequence $f(\boldsymbol{x}^{(0)})$, $f(\boldsymbol{x}^{(1)})$, …, $f(\boldsymbol{x}^{(n)})$ in the sense that $f(\boldsymbol{x}^{(k)})>=f(\boldsymbol{x}^{(k+1)})$. When the problem is convex, $f(\boldsymbol{x}^{(i)})$ converges to the global minimum.

Many machine learning models can be solved by the gradient descent algorithm, for example, the linear regression (with or without penalty) introduced in Chapter 5. Let’s see how to use the vanilla (standard) gradient descent method in linear regression we introduced in Chapter 5.

According to the gradient derived in Chapter 5, the update for the parameters become:

$$

\begin{equation}

\begin{split}

\boldsymbol{\hat{\beta}}^{(k+1)}=\boldsymbol{\hat{\beta}}^{(k)}+2\gamma \boldsymbol{X}’(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\hat{\beta}}^{(k)}).

\end{split}

\end{equation}

$$

chapter6/linear_regression_gradient_descent.R

Accordingly, we have the Python implementation as follows.

chapter6/linear_regression_gradient_descent.py

Let’s use simulated dataset to test our algorithm.

The results show that our implementation of gradient descent update works well on the simulated dataset for linear regression.

However, when the loss function is non-differentiable, the vanilla gradient descent algorithm \eqref{eq:5_gradient_descent} cannot be used. In Chapter 5, we added the $L^2$ norm (also referred as Euclidean norm

^{3}) of $\boldsymbol{\beta}$ to the loss function to get the ridge regression. What if we change the $L^2$ norm to $L^1$ norm in the loss function?$$

\begin{equation}\label{eq:5_lasso}

\min_{\boldsymbol{\hat\beta}} \quad \boldsymbol{e}’\boldsymbol{e} + \lambda \sum_{i=1}^p {|\boldsymbol{\beta}_i|},

\end{equation}

$$

where $\lambda>0$.

Solving the optimization problem specified in \eqref{eq:5_lasso}, we will get the Lasso solution of linear regression. Lasso is not a specific type of machine learning model. Actually, Lasso refers to least absolute shrinkage and selection operator. It is a method that performs both variable selection and regularization. What does variable selection mean? When we build up a machine learning model, we may collect as many data points (also called features, or independent variables) as possible. However, if only a subset of these data points are relevant for the task, we may need to select the subset explicitly for some reasons. First, it’s possible to reduce the cost (time or money) to collect/process these irrelevant variables. Second, adding irrelevant variables into the some machine learning models may impact the model performance. But why? Let’s take linear regression as an example. Including an irrelevant variable into a linear regression model is usually referred as model misspecification (Of course, omitting a relevant variable also results in a misspecified model). In theory, the estimator of coefficients is still unbiased ($\hat{\theta}$ is an unbiased estimator of $\theta$ if $E(\hat{\theta})=\theta$) when an irrelevant variable is included. But in practice, when the irrelevant variable has non-zero sample correlation coefficient with these relevant variables or the response variable $\boldsymbol{y}$, the estimate of coefficient may get changed. In addition, the estimator of coefficients are inefficient^{4}. Thus, it’s better to select the relevant variables for a linear regression.

The Lasso solution is very important in machine learning because of its sparsity, which results in the automated variable selection. As for why the Lasso solution is sparse, the theoretical explanation could be found from the reference book^{5}. Let’s focus on how to solve it. Apparently, gradient descent algorithm cannot be applied directly to solve \eqref{eq:5_lasso} although it is still a convex optimization problem since the regularized loss function is non-differentiable because of $||$.

Proximal gradient One of the algorithms that can be used to solve \eqref{eq:5_lasso} is called proximal gradient descent^{6}. Let’s skip the theory of proximal methods and the tedious linear algebra, and jump to the solution directly. By proximal gradient descent, the update is given as follows in each iteration

$$

\begin{equation}

\begin{split}

\boldsymbol{\beta}^{(k+1)} &= S_{\lambda \gamma}(\boldsymbol{\beta}^{(i)}-2\gamma \nabla{(\boldsymbol{e}’\boldsymbol{e} )})) \\

&= S_{\lambda \gamma}(\boldsymbol{\beta}^{(i)}+2\gamma \boldsymbol{X}’(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}^{(i)})) ,

\end{split}

\label{eq:5_lasso_1}

\end{equation}

$$

where $S_{\theta}(\cdot)$ is called soft-thresholding operator given by

$$

\begin{equation}\label{eq:sto}

[S_{\theta}(\boldsymbol{z})]_i =\left\{\begin{array}{ll} \boldsymbol{z}\sb{i} – \theta & \boldsymbol{z}\sb{i}>\theta;\\

0 & |\boldsymbol{z}\sb{i}|\leq\theta; \\

\boldsymbol{z}\sb{i} + \theta & \boldsymbol{z}\sb{i}<-\theta.

\end{array} \right.

\end{equation}

$$

Now we are ready to implement the Lasso solution of linear regression via \eqref{eq:sto}. Similar to ridge regression, we don’t want to put penalty on the intercept, and a natural estimate of the intercept is the mean of response, i.e., $\bar{\boldsymbol{y}}$. The learning rate $\gamma$ in each update could be fixed or adapted (change over iterations). In the implementation below, we use the largest eigenvalue of $\boldsymbol{X}’\boldsymbol{X}$ as the fixed learning rate.

chapter5/lasso.R

chapter5/lasso.py

Now, let’s see the application of our own Lasso solution of linear regression on the Boston house-prices dataset.

In the example above, we set the $\lambda=200.0$ arbitrarily. The results show that the coefficients of `age`

and `tax`

are zero. In practice, the selection of $\lambda$ is usually done by cross-validation which would be introduced later in this book. If you have used the off-the-shelf Lasso solvers in R/Python you may wonder why the $\lambda$ used in this example is so big. One major reason is that in our implementation the first item in the loss function, i.e., $\boldsymbol{e}’\boldsymbol{e}$ is not scaled by the number of observations.

In practice, one challenge to apply Lasso regression is the selection of parameter $\lambda$. We will talk about the in Chapter 6 with more details. The basic idea is select the best $\lambda$ to minimize the prediction error on the unseen data.

There is a close relation between optimization and root-finding. To find the roots of $f(x)=0$, we may try to minimize $f^2(x)$ alternatively. Under specific conditions, to minimize $f(x)$ we may try to solve the root-finding problem of $f’(x)=0$ (e.g., see linear regression in Chapter 4). Various root-finding methods are available in both R and Python.

Let’s see an application of root-finding in Finance.

Internal rate of return(IRR) is a measure to estimate the investment’s potential profitability. In fact, it is a discount rate which makes the net present value (NPV) of all future cashflows (positive for inflow and negative for outflow) equal to zero. Given the discount rate $r$, NPV is calculated as

$$

\begin{equation}\label{eq:npv}

\sum_{t=0}^T C_i/(1+r)^t,

\end{equation}

$$

where $C_i$ denotes the cashflow and $t$ denotes the duration from time 0. Thus, we can solve the IRR by find the root of NPV.

Let’s try to solve IRR in R/Python.

chapter5/xirr.R

chapter5/xirr.py

Please note the R function `uniroot`

works on one-dimensional functions only; while `scipy.optimize.root`

function also works on multi-dimensional functions. `scipy.optimize.root`

requires an initial guess for the solution, but `uniroot`

does not. To find roots of multi-dimensional functions in R, the `optim`

function introduced in the next Section can also be used. Also, the package `rootSolve`

could be useful. There are other useful features from both functions, for example, we can control the tolerance for termination.

Implementing your own algorithms from scratch could be fun, but they may not be always efficient and stable. And we do that in this book for pedagogical purpose. For general purpose minimization problems, we can use the `optim`

function in R and `scipy.optimize.minimize`

function in Python as a black box. Various optimization algorithms have been implemented as the workhorse of these two functions. Some of these optimization algorithms (e.g., Quasi-Newton methods^{7}) require the gradients; and some are derivate-free (e.g., Nelder–Mead method^{8}). First, let’s see a simple example.

Maximum likelihood estimation (MLE)

^{9}is a very important topic in Statistics. I will give a very brief introduction what MLE is. To understand MLE, first we need to understand the likelihood function. Simply speaking, the likelihood function $L(\theta|{x})$ is a function of the unknown parameters $\theta$ to describe the probability or odds of obtaining the observed data ${x}$ ($x$ is a realization of random variable $X$. For example, when the random variable $X$ follows a continuous probability distribution with probability density function (pdf) $f_\theta$, $L(\theta|{x}) = f_{\theta}(x)$.It’s often to observe a random sample consisting of multiple independent observations rather than a single observation. In that case, the likelihood function is

$$

\begin{equation}

\begin{split}

L(\theta|x)=\prod_{i=1} ^{n} {f_{\theta}(x_i)}.

\end{split}

\label{eq:5_0}

\end{equation}

$$

Given the observed data and a model based on which the data are generated, we may minimize the corresponding likelihood function to estimate the model parameters. There are some nice properties of MLE, such as its consistency and efficiency. To better understand MLE and its properties, I recommend reading Statistical Inference

^{10}.

In practice, it’s common to minimize the logarithm of the likelihood function, i.e., the log-likelihood function. When $X\sim\mathcal{N}(\mu,\sigma^2)$, the pdf of $X$ is given as below

$$

\begin{equation}

\begin{split}

f(x|(\mu,\sigma^2))=\frac 1 {\sigma\sqrt{2\pi}}e^{-(x-\mu)^2/{2\sigma^2}}.

\end{split}

\label{eq:5_2}

\end{equation}

$$Taking the logarithm, the log-likelihood function is equal to

$$

\begin{equation}

\begin{split}

\mathcal{L}(\theta|x)={-\frac n 2 \log(2\pi)- n \log\sigma – \frac 1 {2\sigma^2} \sum_{i=1}^{n}(x_i-\mu)^2}.

\end{split}

\label{eq:5_3}

\end{equation}

$$Since the first item in \eqref{eq:5_3} is a constant, we can simply the log-likelihood function to

$$

\begin{equation}

\begin{split}

\mathcal{L}(\theta|x)={- n \log\sigma- \frac 1 {2\sigma^2} \sum_{i=1}^{n}(x_i-\mu)^2}.

\end{split}

\label{eq:5_4}

\end{equation}

$$

It’s worth noting \eqref{eq:5_4} is convex.

Let’s implement the MLE for normal distribution in R/Python.

chapter5/normal_mle.R

chapter5/normal_mle.py

There is no bound set for the mean parameter $\mu$, while we need set the lower bound for the standard deviation $\sigma$. We chose ‘L-BFGS-B’ as the optimization algorithm in this example, which requires the gradient of the objective function. When the gradient function is not provided, the gradient is estimated in the numeric approach. In general, providing the gradient function may speed up. ‘L-BFGS-B’ is a quasi-Newton method. Newton method requires the Hessian matrix for optimization, but the calculation of Hessian matrix may be expensive or even unavailable in some cases. The quasi-Newton methods approximate the Hessian matrix of the objective function instead of calculation. Now let’s use these functions to estimate the parameters of normal distribution.

Let’s see another application of the general purpose minimization – logistic regression.

We have seen how linear regression works in Chapter 4. Logistic regression works similarly to linear regression. The major difference is that logistic regression is used in classification problem. Classification means to identify to which of a set of categories a new observation belongs. Among classification problems, if each observation belongs to one of two disjoint categories, it is a binary classification problem. Binary classification has lots of real-world applications, such as spam email filter, loan default prediction, and sentiment analysis.

In logistic regression, the prediction output of an observation is the probability that the observation falls into a specific category. Let $\boldsymbol{X}$ represent a $n\times (p+1)$ matrix ($n>p$) of independent variables (predictor) with constant vector $\boldsymbol{1}$ in the first column, and $\boldsymbol{y}$ represent a column vector of the response variable (target). Please note $\boldsymbol{y}$ is a binary vector, i.e., $\boldsymbol{y}_i\in{\{0,1\}}$. Formally, a logistic regression is specified as

$$

\begin{equation}

\begin{split}

Pr(\boldsymbol{y}_i=1|\boldsymbol{x}_i,\boldsymbol{\beta})=\frac 1 {1+e^{-\boldsymbol{x}’_i\boldsymbol{\beta}}},

\end{split}

\label{eq:5_5}

\end{equation}

$$

where $\boldsymbol{x}_i=[1,\boldsymbol{X}_{1i},\boldsymbol{X}_{2i},…,\boldsymbol{X}_{pi}]’$, and $\boldsymbol{\beta}$ denotes the parameter of the logistic model.

What does \eqref{eq:5_5} mean? It implies the response variable $\boldsymbol{y}_i$ given $\boldsymbol{x}_i$ and $\boldsymbol{\beta}$ follows a Bernoulli distribution. More specifically, $\boldsymbol{y}_i\sim{Bern(1/( 1+\exp(-\boldsymbol{x}’_i\boldsymbol{\beta}) ))}$. Based on the assumption that the observations are independent, the log-likelihood function is

$$

\begin{equation}

\begin{split}

&\mathcal{L}(\beta|\boldsymbol{X},\boldsymbol{y}) \\

&=\log{ \prod_{i=1}^{n} {\Big( {Pr(\boldsymbol{y}_i=1|\boldsymbol{x}_i,\boldsymbol{\beta})}^{\boldsymbol{y}_i } {Pr(\boldsymbol{y}_i=0|\boldsymbol{x}_i,\boldsymbol{\beta})}^{1-\boldsymbol{y}_i} \Big)}}\\

&= \sum_{i=1}^{n} {\Big( \boldsymbol{y}_i \log{Pr(\boldsymbol{y}_i=1|\boldsymbol{x}_i,\boldsymbol{\beta})} + (1-\boldsymbol{y}_i) \log{Pr(\boldsymbol{y}_i=0|\boldsymbol{x}_i,\boldsymbol{\beta})} \Big)}\\

&= \sum_{i=1} ^{n} {\Big(\boldsymbol{y}_i\boldsymbol{x}’_i\boldsymbol{\beta} – \log(1+e^{\boldsymbol{x}’_i\boldsymbol{\beta}})\Big)}.

\end{split}

\label{eq:5_6}

\end{equation}

$$

Given the log-likelihood function, we can get the maximum likelihood estimate of logistic regression by minimizing \eqref{eq:5_6} which is also convex. The minimization can be done similarly to linear regression via iteratively re-weighted least square method (IRLS)^{5}. However, in this Section we will not use the IRLS method and instead let’s use the `optim`

function and `scipy.optimize.minimize`

function in R and Python, respectively.

chapter5/logistic_regression.R

chapter5/logistic_regression.py

Now let’s see how to use our own logistic regression. We use the banknote dataset^{11} in this example. In the banknote dataset, there are four different predictors extracted from the image of a banknote, which are used to predict if a banknote is genuine and forged.

The above implementation of logistic regression works well. But it is only a pedagogical toy to illustrate what a logistic regression is and how it can be solved with general purpose optimization tool. In practice, there are fast and stable off-the-shelf software tools to choose. It’s also worth noting most of the iterative optimization algorithms allow to specify the max number of iterations as well as the tolerance. Sometimes it’s helpful to tune these parameters.

In practice, the penalized ($L^1,L^2$) versions of logistic regression are more commonly-used than the vanilla logistic regression we introduced above.

In linear programming (LP)^{12} both the objective function and the constraints are linear. LP can be expressed as follows,

$$

\begin{equation}

\begin{split}

&\max_{\boldsymbol{x}} {\boldsymbol{c}’\boldsymbol{x}}\\

&subject\quad to\quad \boldsymbol{Ax}\leq{\boldsymbol{b}},

\end{split}

\label{eq:5_1}

\end{equation}

$$

where the vector $\boldsymbol{x}$ represents the decision variable; $\boldsymbol{A}$ is a matrix, $\boldsymbol{b}$ and $\boldsymbol{c}$ are vectors. We can do minimization instead of maximization in practice. Also, it’s completely valid to have equality constraints for LP problems. All LP problems can be converted into the form of \eqref{eq:5_1}. For example, the equality constraint $\boldsymbol{Cx}=\boldsymbol{d}$ can be transformed to inequality constraints $\boldsymbol{Cx}\leq\boldsymbol{d}$ and $\boldsymbol{-Cx}\leq-\boldsymbol{d}$.

Every LP problem falls into three categories:

- infeasible – no solution that satisfies all constraints exists;
- unbounded – for every feasible solution, there exists another feasible solution that improves the objective function;
- having an optimal solution – a unique maximum (or minimum) value of the objective function exists.

When a LP problem has an optimal solution, there might be different sets of decision variables that lead to the same optimal value of the objective function. LP problems are also convex problems^{2}. The algorithms to solve LP problems have been well-studied. The basic and popular algorithm is Dantzig’s simplex method which was invented in 1947^{13}. More recent Interior point methods have better worst case complexity than the simplex method. In R/Python, we can find packages that implement both algorithms.

LP has been extensively used in business, economics, and some engineering problems. First, let’s see a simple application of LP in Let’s see an application of LP in regression analysis.

Let’s assume there are 4 different investment opportunities, A, B, C, and D, listed in the table below. There is no intermediate payment between the vintage year and the maturity year. Payments received on maturity can be reinvested. We also assume the risk-free rate is 0. The goal is to construct a portfolio using these investment opportunities with $10,000 at year 1 so that the net asset value at year 5 is maximized.

Available investment opportunities

investmentvintagematurityrate of return (annualized)Ayear 1 year 2 0.08 Byear 2 year 5 0.12 Cyear 3 year 5 0.16 Dyear 1 year 4 0.14

Let $x_1$, $x_2$, $x_3$ and $x_4$ denote the amount to invest in each investment opportunity; and let $x_0$ denote the cash not invested in year 1. The problem can be formulated as

$$

\begin{equation}

\begin{split}

\max_{\boldsymbol{x}} \quad & \boldsymbol{x}\sb{0}+\boldsymbol{x}\sb{1}{(1+0.08)} – \boldsymbol{x}\sb{2} – \boldsymbol{x}\sb{3} + \\

\boldsymbol{x}\sb{2}{(1+0.12)^3} + & \boldsymbol{x}\sb{3}{(1+0.16)^2} + \boldsymbol{x}\sb{4}{(1+0.14)^3} \\

subject\quad to \quad\quad & \boldsymbol{x}\sb{0} + \boldsymbol{x}\sb{1} + \boldsymbol{x}\sb{4} = 10000 ;\\

& \boldsymbol{x}\sb{0} + \boldsymbol{x}\sb{1}(1+0.08)\geq\boldsymbol{x}\sb{2} + \boldsymbol{x}\sb{3} ; \\

& \boldsymbol{x}\sb{i} \geq0;i=0,…,4.

\end{split}

\label{eq:linear_investment}

\end{equation}

$$

In Python, there are quite a few tools that can be used to solve LP, such as `ortools`

^{14}, `scipy.optimize.linprog`

^{15}, and `CVXPY`

^{16}. We will show the solution using `ortools`

in Python and `lpSolve`

^{17} in R.

chapter5/portfolio_construction.R

chapter5/portfolio_construction.py

We see that the optimal portfolio allocation results in a net asset value of 15173.22 in year 5. You may notice that the Python solution is lengthy compared to the R solution. That is because the `ortools`

interface in Python is in a OOP fashion and we add the constraint one by one. But for `lpSolve`

, we utilize the compact matrix form \eqref{eq:5_1} to specify a LP problem. In `ortools`

, we specify the bounds of a decision variable during its declaration. In `lpSolve`

, all decisions are non-negative by default.

LP also has applications in machine learning, for example, Least absolute deviations (LAD) regression^{18} can be solved by LP. LAD regression is just another type of linear regression. The major difference between LAD regression and the linear regression we introduced in Chapter 4 is that the loss function to minimize in LAD is the sum of the absolute values of the residuals rather than the SSR loss.

The update schedule of gradient descent is straightforward. But what to do if the dataset is too big to fit into memory? That is a valid question, especially in the era of big data. There is a naive and brilliant solution – we can take a random sample from all the observations and only evaluate the loss function on the random sample in each iteration. Actually, this variant of gradient descent algorithm is called stochastic gradient descent.

You may wonder that if we can randomly take a subset from all the observations, can we also just take a subset of the variables $\boldsymbol{x}$? Yes, and sometimes it is quite useful. For example, in coordinate descent, in each iteration we can update a single coordinate $\boldsymbol{x}^j$ rather than the entire $\boldsymbol{x}$. Coordinate descent has been widely applied for the Lasso solution of the generalized linear models^{19}.

Newton method looks similar to gradient descent, but it requires the Hessian matrix of the objective function (see \eqref{eq:newton_method}). In theory, the convergence of Newton method is faster than the gradient descent algorithm. But not all differentiable objective functions have second order derivatives; and the evaluation of Hessian matrix may also be computationally expensive.

$$

\begin{equation}

\begin{split}

\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\gamma {H(\boldsymbol{x}^{(k)})}^{-1} \nabla f(\boldsymbol{x}^{(k)})

\end{split}

\label{eq:newton_method}

\end{equation}

$$

We have seen the proximal gradient descent method for Lasso. Actually, the second order derivatives can also be incorporated into the proximal gradient descent method, which leads to the proximal Newton method.

In general, constrained optimization has the following form.

$$

\begin{equation}

\begin{split}

&\max_{\boldsymbol{x}} {f(\boldsymbol{x})}\\

subject\quad to\quad & g_i(\boldsymbol{x})\leq{\boldsymbol{b}_i};i=1,…,m,\\

& h_i(\boldsymbol{x})={\boldsymbol{c}_j};j=1,…,k,

\end{split}

\label{eq:constrainedoptimization}

\end{equation}

$$

where $g_{i}$ is the inequality constraint and $h_{i}$ is the equality constraint, both of which could be linear or nonlinear. A constrained optimization problem may or may not be convex. Although there are some tools existing in R/Python for constrained optimization, they may fail if you just throw the problem into the tools. Thus, it is better to have a tailored treatment for the problem. In general, that requires understanding the basic theories for constrained optimization problems, such as the Lagrangian, and the Karush-Kuhn-Tucker (KKT) conditions^{2}.

Quadratic programming (QP)^{20} is also commonly seen in machine learning. For example, the optimization problems in vanilla linear regression, ridge linear regression and Lasso solution of linear regression all fall into the QP category. To see why the Lasso solution of linear regression is also QP, let’s convert the optimization problem from the unconstrained form \eqref{eq:5_lasso} to the constrained form \eqref{eq:5_lasso11} similar to the ridge regression.

$$

\begin{equation}

\begin{split}

\min_{\boldsymbol{\hat\beta}}& \quad \boldsymbol{e}’\boldsymbol{e}\\

subject\quad to\quad & \lambda \sum_{i=1}^p {|\boldsymbol{\beta}_i|}\leq t.

\end{split}

\label{eq:5_lasso11}

\end{equation}

$$

The $||$ operator in the constraint in \eqref{eq:5_lasso_1} can be eliminated by expanding the constraint to $2^p$ linear constraints.

$$

\begin{equation}

\begin{split}

\min_{\boldsymbol{\hat\beta}} \quad \boldsymbol{e}’\boldsymbol{e}& \\

subject \quad to & \\

\lambda\boldsymbol{\beta}\sb{1}+\lambda\boldsymbol{\beta}\sb{2}+…+ & \lambda\boldsymbol{\beta}\sb{n} \leq t \\

- \lambda\boldsymbol{\beta}\sb{1}+\lambda\boldsymbol{\beta}\sb{2}+…+ & \lambda\boldsymbol{\beta}\sb{n} \leq t \\

& … \\

- \lambda\boldsymbol{\beta}\sb{1}-\lambda\boldsymbol{\beta}\sb{2}-…- & \lambda\boldsymbol{\beta}\sb{n} \leq t.

\end{split}

\label{eq:5_lasso_2}

\end{equation}

$$

It is clear the problem specified in \eqref{eq:5_lasso_2} is a QP problem. The optimization of support vector machine (SVM) can also be formulated as a QP problem.

In R and Python, there are solvers^{21} designed specifically for QP problems, which can be used as a black box.

Some optimization algorithms are specific to problems. For example, simplex method is major used for linear programming problem. In contrast, metaheuristic optimization algorithms are not problem-specific. Actually, metaheuristic algorithms are strategies to guide the search process for the solutions. There are a wide variety of metaheuristic algorithms in the literature, and one of my favorites is simulated annealing (SA)^{22}. SA algorithm can be used for both continuous and discrete optimization problems (or hybrid, such as mixed integer programming^{23}). The `optim`

function in R has the SA algorithm implemented (set `method = 'SANN'`

) for general purpose minimization with continuous variables. For discrete optimization problems, it usually requires customized implementation. The pseudocode of SA is given as follows.

- define the cooling schedule $T_k$, an energy function $E$, and initial solution $s_0$
- for $k=1,…,m$

- pick a random neighbor $s_k$ based on $s_{k-1}$

- calculate the energy change $\delta=E(s_k)-E(s_{k-1})$

- calculate the acceptance probability $P=e^{-\delta/T_k}$

- if $P<random(0, 1)$, $s_k=s_{k-1}$ - return $s_m$.

Let’s see how we can apply SA for the famous traveling salesman problem.

Given a list of cities, what is the shortest route that visits each city once and returns to the original city? This combinatorial optimization problem is the traveling salesman problem (TSP). When there are $K$ cities, the number of feasible solutions is equal to $(K-1)!/2$. It is difficult to enumerate and evaluate all solutions for a large number of cities. SA algorithm fits this problem quite well.

Let’s consider a toy problem with only 10 cities. To apply the SA algorithm, we need to determine a cooling schedule. For simplicity, we choose $T_k=T_0\alpha^k$. As for the energy function, the length of the route is a natural choice. For many problems, the challenge to apply SA is how to generate new solution from the incumbent. And most of the times there are various ways to do that, but some of which may not be efficient. In this TSP problem, we adopt a simple strategy, i.e., randomly select two different cities from the current route and switch them. If the switch results in a shorter route we would accept the switch; if the new route is longer, we sill could accept the switch with the acceptance probability.

The implementation of SA with this simple strategy is implemented in both R/Python as follows.

chapter5/TSP.R

chapter5/TSP.py

Let’s run our TSP solver to see the results.

In the implementation above, we set the initial temperature to $2000$, the cooling rate $\alpha$ to $0.99$. After 2000 iterations, we get the same routes from both implementations. In practice, it’s common to see metaheuristic algorithms get trapped in local optimums.

^{1} https://en.wikipedia.org/wiki/Operations_research

^{2} Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

^{3} https://en.wikipedia.org/wiki/Norm_(mathematics)

^{4} https://en.wikipedia.org/wiki/Efficient_estimator

^{5} Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer series in statistics New York, NY, USA:, second edition, 2009.

^{6} Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and Trends in Optimization, 1(3): 127–239, 2014.

^{7} https://en.wikipedia.org/wiki/Quasi-Newton_method

^{8} https://en.wikipedia.org/wiki/Nelder-Mead_method

^{9} https://en.wikipedia.org/wiki/Maximum_likelihood_estimation

^{10} George Casella and Roger L Berger. Statistical inference, volume 2. Duxbury Pacific Grove, CA, 2002.

^{11} https://archive.ics.uci.edu/ml/datasets/banknote+authentication

^{12} https://en.wikipedia.org/wiki/Linear_programming

^{13} George Dantzig. Linear programming and extensions. Princeton university press, 2016.

^{14} https://developers.google.com/optimization

^{15} https://scipy.org

^{16} https://www.cvxpy.org

^{17} https://cran.r-project.org/web/packages/lpSolve/lpSolve.pdf

^{18} https://en.wikipedia.org/wiki/Least_absolute_deviations

^{19} Jerome Friedman, Trevor Hastie, and Rob Tibshirani. glmnet: Lasso and elastic-net regularized gener- alized linear models. R package version, 1(4), 2009.

^{20} https://en.wikipedia.org/wiki/Quadratic_programming

^{21} https://cran.r-project.org/web/views/Optimization.html

^{22} Peter JM Van Laarhoven and Emile HL Aarts. Simulated annealing. In Simulated annealing: Theory and applications, pages 7–15. Springer, 1987.

^{23} https://en.wikipedia.org/wiki/Integer_programming