Sections in this Chapter:

There are numerous books/online courses available on the theory of linear regression regression, among which my favorites are – The Elements of Statistical Learning \citep{friedman2001elements} and \citep{greene2003econometric}. So what is the purpose to write a chapter about these models in another book of data science? Many audience would be interested in how to implement their own regression models rather than using the off-the-shelf software packages. By the end of this chapter, we will build up our own linear regression regression models in R/Python. We would also see how we would reuse some functions in different regressions, such as linear regression regression, and these regressions with L2 penalties.

We start from the matrix form of linear regression.

$$

\begin{equation}

\boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\epsilon},

\label{eq:1}

\end{equation}

$$

where the column vector $\boldsymbol{y}$ contains $n$ observations on the dependent variable, $\boldsymbol{X}$ is a $n\times (p+1)$ matrix ($n>p$) of independent variables with constant vector $\boldsymbol{1}$ in the first column, $\boldsymbol{\beta}$ is a column vector of unknown population parameters to estimate based on the data, and $\boldsymbol{\epsilon}$ is the error term (or noise). For the sake of illustration, \eqref{eq:1} can be extended as in

$$

\begin{equation}\label{eq:2}

\begin{bmatrix}

\boldsymbol{y}_1 \\

\boldsymbol{y}_2 \\

\vdots \\

\boldsymbol{y}_n

\end{bmatrix}

=

\begin{bmatrix}

1 & \boldsymbol{X}_{11} & \boldsymbol{X}_{21} & \cdots & \boldsymbol{X}_{p1} \\

1 & \boldsymbol{X}_{12} & \boldsymbol{X}_{22} & \cdots & \boldsymbol{X}_{p2} \\

\vdots & \vdots & \vdots & \vdots & \vdots \\

1 & \boldsymbol{X}_{1n} & \boldsymbol{X}_{2n} & \cdots & \boldsymbol{X}_{pn}

\end{bmatrix}

\begin{bmatrix}

\boldsymbol{\beta}_0 \\

\boldsymbol{\beta}_1 \\

\vdots \\

\boldsymbol{\beta}_p

\end{bmatrix}

+

\begin{bmatrix}

\boldsymbol{\epsilon}_1 \\

\boldsymbol{\epsilon}_2 \\

\vdots \\

\boldsymbol{\epsilon}_n

\end{bmatrix}

\end{equation}

$$

We apply ordinary least squares (OLS)^{3} approach to estimate the model parameter $\boldsymbol{\beta}$ since it requires fewer assumptions than other estimation methods such as maximum likelihood estimation^{4}. Suppose the estimated model parameter is denoted as $\boldsymbol{\hat\beta}$, we define the residual vector of the system as $\boldsymbol{e}=\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\hat\beta}$. The idea of OLS is to find $\boldsymbol{\hat\beta}$ which can minimize the sum of squared residuals (SSR), i.e.,

$$

\begin{equation}\label{eq:3}

\min_{\boldsymbol{\hat\beta}} \quad \boldsymbol{e}’\boldsymbol{e}

\end{equation}

$$

Now the question is how to solve the optimization problem \eqref{eq:2}. First, let’s expand the SSR.

$$

\begin{equation}\label{eq:4}

\begin{split}

\boldsymbol{e}’\boldsymbol{e}&=(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\hat\beta})‘(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\hat\beta}) \\

&= \boldsymbol{y}’\boldsymbol{y}-2\hat{\beta}’\boldsymbol{X}’\boldsymbol{y}+\hat{\beta}’\boldsymbol{X}’\boldsymbol{X}\hat{\boldsymbol{\beta}}

\end{split}

\end{equation}

$$

The first and second order derivatives are calculated as follows

$$

\begin{equation}\label{eq:5}

\begin{cases}

\frac{\partial \boldsymbol{e}’\boldsymbol{e}}{\partial \boldsymbol{\hat\beta}}

=-2\boldsymbol{X}’\boldsymbol{y}+2\boldsymbol{X}’\boldsymbol{X}\boldsymbol{\hat\beta}\\

\frac{\partial^2 \boldsymbol{e}’\boldsymbol{e}}{\partial \boldsymbol{\hat\beta} \partial \boldsymbol{\hat\beta}’}

=2\boldsymbol{X}’\boldsymbol{X}.

\end{cases}

\end{equation}

$$

We see that the second order derivative is positive semidefinite which implies the SSR in OLS is a convex function (see section 3.1.4^{5}) and for an unconstrained convex optimization problem, the necessary as well as sufficient condition for optimality is that the first order derivative equals 0 (see section 4.2.3^{5}). Optimization of convex function is very important in machine learning. Actually, the parameter estimations of many machine learning models are convex optimization problems.

Based on the analysis above, the solution of \eqref{eq:3} is given in \eqref{eq:6}.

$$

\begin{equation}\label{eq:6}

\boldsymbol{\hat\beta}=(\boldsymbol{X}’\boldsymbol{X})^{-1}\boldsymbol{X}’\boldsymbol{y}

\end{equation}

$$

Now it seems we are ready to write our own linear regression model in R/Python. The solution in \eqref{eq:6} involves matrix transportation, multiplication and inversion, all of which are supported in both R and Python. In Python, we can use the `numpy`

module for the matrix operations.

However, in practice we don’t solve linear regression with \eqref{eq:6} directly. Why? Let’s see an example with

$$

x=\begin{bmatrix}

1e+6 & -1 \\

-1 & 1e-6

\end{bmatrix}.

$$

The R code above throws an error because of the singularity of $\boldsymbol{X}’\boldsymbol{X}$. It’s interesting that the corresponding Python code doesn’t behave in the same way as R, which has been reported as an issue on github^{6}.

When the matrix $\boldsymbol{X}’\boldsymbol{X}$ is singular, how to solve the OLS problem? In this book, we would focus on the QR decomposition based solution. Singular value decomposition (SVD) can also be used to solve OLS, which would not be covered in this book.

In linear algebra, a QR decomposition^{7} of matrix $\boldsymbol{X}$ would factorize $\boldsymbol{X}$ into a product, i.e., $\boldsymbol{X}=\boldsymbol{Q}\boldsymbol{R}$ where $\boldsymbol{Q}$ are orthogonal matrices and $\boldsymbol{R}$ is an upper triangular matrix. Since the matrix $\boldsymbol{Q}$ is orthogonal ($\boldsymbol{Q}’=\boldsymbol{Q}^{-1}$), we have

$$

\begin{equation}

\begin{split}

\boldsymbol{\hat\beta}&=(\boldsymbol{X}’\boldsymbol{X})^{-1}\boldsymbol{X}’\boldsymbol{y} \\

& = (\boldsymbol{R}’\boldsymbol{Q}’\boldsymbol{Q}\boldsymbol{R})^{-1}\boldsymbol{R}’\boldsymbol{Q}’\boldsymbol{y} \\

& = (\boldsymbol{R}’\boldsymbol{R})^{-1}\boldsymbol{R}’\boldsymbol{Q}’\boldsymbol{y} \\

& = \boldsymbol{R}^{-1}\boldsymbol{Q}’\boldsymbol{y}

\end{split}

\label{eq:7}

\end{equation}

$$

Now we are ready to write our simple R/Python functions for linear regression with the help of QR decomposition according to $\eqref{eq:7}$.

chapter4/qr_solver.R

chapter4/qr_solver.py

Of course, we don’t need to implement our own OLS solvers in a production environment; but if you do, still you may find some well-written and well-tested functions such as `np.linalg.lstsq`

to save your time and effort from doing it from scratch.

Ok, now we have finished the training part of a linear regression model in both R and Python. After we train a model we want to use it, i.e., to make predictions based on the model. For most of machine learning models, training is much more complex than prediction (Exceptions include Lazy-learning models such as KNN). Let’s continue developing our own linear regression model by including the prediction function and enclose everything in an object.

chapter4/linear_regression.R

chapter4/linear_regression.py

Now let’s try to use our linear regression model to solve a real regression problem with the Boston dataset^{8}, and check the results.

The results from our own linear regression models are almost identical to the results from `lm()`

function or the `sklearn.linear_model`

module, which means we have done a great job so far.

I’m not a big fan of applying hypothesis testing in data science or machine learning. But sometimes it is irreplaceable, or at least useful, for example in the Design of Experiment (DOE).

Most of applications of hypothesis testing in machine learning models are about feature/variable selections. There are lots of debates on whether hypothesis-testings-based feature selections are good or bad. The major criticism of such approach is that the entire process is built on the data used for model training and the model performance on testing data is not considered at all.

I think it is still worth giving a brief introduction of hypothesis testing in linear regression, as it is still popular among data scientists with statistician’s mindset. I would assume the readers already have basic ideas of hypothesis-testing, p-value, significance level.

If you have done linear regressions using a computer software (R, Stata, SPSS, Minitab etc.), you may have noticed that the outputs of these softwares contain the p-values^{9} and t-statistics of the coefficient of each variable. If the p-value is less than a pre-determined significance level (usually 0.1 or 0.05 are used in practice),

the null hypothesis (always denoted as $H_0$) should be rejected. The hypothesis against the null hypothesis is called alternative hypothesis (denoted as $H_1$). An example of $H_0$ and $H_1$ regarding model \eqref{eq:2} could be stated as:

$$

\begin{equation}

\begin{split}

H_0: \boldsymbol{\beta}_1&=0 \\

H_1:\boldsymbol{\beta}_1 &\neq0.

\end{split}

\label{eq:8}

\end{equation}

$$

If the p-value of $\boldsymbol{\beta}_1$ suggests not to reject $H_0$, we may exclude the corresponding feature and re-fit the linear model.

The theory behind the hypothesis testing in \eqref{eq:8} is not complex. But we have to make an additional assumption for our linear regression model. More specifically, we assume the error term $\boldsymbol{\epsilon}$ follows a multivariate normal distribution, i.e., $\boldsymbol{\epsilon}{\sim}N(\boldsymbol{0},\sigma^2\boldsymbol{I})$. Based on this assumption, we could conclude that $\boldsymbol{\hat\beta}$ also follows a multivariate normal distribution. Furthermore, we can construct a test statistic^{10} containing $\boldsymbol{\hat\beta_j}$ which follows a t-distribution^{11}. The additional assumption we made for hypothesis testing is not required for least square estimation. It is also not one of assumptions for the Gauss-Markov theorem^{12}.

The example given in \eqref{eq:8} focuses on a single coefficient. Actually, it is a special case of a more general hypothesis testing, which is called linear hypothesis testing. In linear hypothesis testing, people are interested in a linear combination of the coefficients, i.e.,

$$

\begin{equation}

\begin{split}

H_0: \boldsymbol{A}\boldsymbol{\beta}+\boldsymbol{b}&=0\\

H_1: \boldsymbol{A}\boldsymbol{\beta} +\boldsymbol{b}&\neq0.

\end{split}

\label{eq:9}

\end{equation}

$$

The linear hypothesis testing is usually conducted by constructing a F-distributed test statistic. The general hypothesis testings is very powerful. For example, we may start from a full model with a larger list of variables to train a linear regression model; we may also exclude some variables from the full list to train a reduced model. By setting proper values of $\boldsymbol{A}$ and $\boldsymbol{b}$ in \eqref{eq:9} we can conduct a linear hypothesis testing to accept or reject the reduced model. For example, if the full model involves three variables and in the reduced model we only keep the first variable, we will set

$$\boldsymbol{A}=\begin{bmatrix}

0 & 0 & 0 \\

0 & 1 & 0 \\

0 & 0 & 1

\end{bmatrix}

$$

and

$$\boldsymbol{b}=\begin{bmatrix}

0 \\

0 \\

0 \\

\end{bmatrix}.

$$

What is the problem of solving linear regression model specified by \eqref{eq:3}? There is nothing wrong at all with that approach. But I would like to cite my favorite quote – “Essentially, all models are wrong, but some are useful”^{13}. \eqref{eq:3} provides one solution to model \eqref{eq:1}. There are some alternative solutions, such as lasso regression and ridge regression. In this section, let’s focus on ridge regression.

What is ridge regression? Ridge regression doesn’t change the model itself. Instead, it changes the way to estimate model parameters. By naive OLS, we minimize the SSR directly. In ridge regression, we change the objective function (which is commonly called loss function in machine learning models) by adding an additional penalty

$$

\begin{equation}\label{eq:10}

\min_{\boldsymbol{\hat\beta}} \quad \boldsymbol{e}’\boldsymbol{e} + \lambda\boldsymbol{\beta}’\boldsymbol{\beta}.

\end{equation}

$$

The optimization problem \eqref{eq:10} is still an unconstrained optimization problem and it is convex. It can also be formulated as a constrained convex optimization problem as

$$

\begin{equation}

\begin{split}

\min_{\boldsymbol{\hat\beta}}& \quad \boldsymbol{e}’\boldsymbol{e}\\

subject\quad to\quad &\boldsymbol{\beta}’\boldsymbol{\beta}\leq{t}.

\end{split}

\label{eq:11}

\end{equation}

$$

The theory behind ridge regression can be found from The Elements of Statistical Learning^{1}. Let’s turn our attention to the implementation of ridge regression. The solution to \eqref{eq:11} can be obtained in the same way as the solution to \eqref{eq:3}, i.e.,

$$

\begin{equation}\label{eq:12}

\boldsymbol{\hat\beta}=(\boldsymbol{X}’\boldsymbol{X}+\lambda\boldsymbol{I})^{-1}\boldsymbol{X}’\boldsymbol{y}.

\end{equation}

$$

Again, in practice we don’t use \eqref{eq:12} to implement ridge regression for the same reasons that we don’t use \eqref{eq:6} to solve linear regression without penalty.

Actually, we don’t need new techniques to solve \eqref{eq:10}. Let’s make some transformation on the objective function in \eqref{eq:10}:

$$

\begin{equation}

\begin{split}

&\boldsymbol{e}’\boldsymbol{e} + \lambda\boldsymbol{\beta}’\boldsymbol{\beta}=\sum_{i=1}^{n}(\boldsymbol{y}_i-\boldsymbol{x}’_i\boldsymbol{\beta})^2+\sum_{i=1}^{p}(0-\sqrt{\lambda}\boldsymbol{\beta}_i)^2 ,

\end{split}

\label{eq:13}

\end{equation}

$$

where $\boldsymbol{x}_i=[1,\boldsymbol{X}_{1i},\boldsymbol{X}_{2i},…,\boldsymbol{X}_{pi}]’$.

Let’s define an augmented data set:

$$

\begin{equation}\label{eq:14}

\boldsymbol{X}_\lambda=

\begin{bmatrix}

1 & \boldsymbol{X}_{11} & \boldsymbol{X}_{21} & \cdots & \boldsymbol{X}_{p1} \\

1 & \boldsymbol{X}_{12} & \boldsymbol{X}_{22} & \cdots & \boldsymbol{X}_{p2} \\

\vdots & \vdots & \vdots & \vdots & \vdots \\

1 & \boldsymbol{X}_{1n} & \boldsymbol{X}_{2n} & \cdots & \boldsymbol{X}_{pn} \\

\sqrt{\lambda} & 0 & 0 & \cdots & 0\\

0 & \sqrt{\lambda} & 0 & \cdots & 0\\

0 & 0 & \sqrt{\lambda} & \cdots & 0\\

\vdots & \vdots & \vdots & \ddots & \vdots\\

0 & 0 & 0 & 0 & \sqrt{\lambda}\\

\end{bmatrix} , \quad

\boldsymbol{y}_{\lambda} =

\begin{bmatrix}

\boldsymbol{y}_1 \\

\boldsymbol{y}_2 \\

\vdots \\

\boldsymbol{y}_n\\

0\\

0\\

0\\

\vdots\\

0

\end{bmatrix}.

\end{equation}

$$

If we regress $\boldsymbol{y}_{\lambda}$ on $\boldsymbol{X}_\lambda$, the OLS solution is just what we are looking after. However, usually the penalty is not applied on the intercept. Thus, we modify $\boldsymbol{y}_{\lambda}$ and $\boldsymbol{X}_\lambda$ to

$$

\begin{equation}\label{eq:15}

\boldsymbol{X}_\lambda=

\begin{bmatrix}

1 & \boldsymbol{X}_{11} & \boldsymbol{X}_{21} & \cdots & \boldsymbol{X}_{p1} \\

1 & \boldsymbol{X}_{12} & \boldsymbol{X}_{22} & \cdots & \boldsymbol{X}_{p2} \\

\vdots & \vdots & \vdots & \vdots & \vdots \\

1 & \boldsymbol{X}_{1n} & \boldsymbol{X}_{2n} & \cdots & \boldsymbol{X}_{pn} \\

0 & \sqrt{\lambda} & 0 & \cdots & 0\\

0 & 0 & \sqrt{\lambda} & \cdots & 0\\

\vdots & \vdots & \vdots & \ddots & \vdots\\

0 & 0 & 0 & 0 & \sqrt{\lambda}\\

\end{bmatrix} , \quad

\boldsymbol{y}_{\lambda} =

\begin{bmatrix}

\boldsymbol{y}_1 \\

\boldsymbol{y}_2 \\

\vdots \\

\boldsymbol{y}_n\\

0\\

0\\

\vdots\\

0

\end{bmatrix}.

\end{equation}

$$

Now we are ready to implement our own ridge regression model based on the description above. It is also common to normalize^{14} the independent variables before applying ridge regression to make different variables in the same order of magnitude.

chapter4/linear_regression_ridge.R

chapter4/linear_regression_ridge.py

In R, we implement our own scaler; but in the python implementation, we use `StandardScaler`

from `sklearn.preprocessing`

module, which is very handy. Please pay attention to how we calculation the standard deviation^{15} in R – we are using the formula of population standard deviation rather than the formula of sample standard deviation. Actually using which formula to calculate the standard deviation doesn’t matter at all. I made the choice to use population standard deviation formula in order to generate consistent result of the `StandardScaler`

since in `StandardScaler`

uses the formula of population standard deviation.

The selection of best $\lambda$ requires solving the OLS problem repeatedly with different values of $\lambda$, which implies the QR decomposition procedure would be called multiple times. Using SVD decomposition could be more efficient in terms of selection of best $\lambda$. But it wouldn’t be covered in the current version of this book.

We are ready to run our own ridge regression on the Boston dataset.

It’s excited to see the outputs from R and Python are quite consistent.

Linear regression is not as simple as it seems. To learn more about it, I recommend the book Econometric Analysis^{2}.

^{1} Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer series in statistics New York, NY, USA:, second edition, 2009.

^{2} William H Greene. Econometric analysis. Pearson Education India, 2003.

^{3} https://en.wikipedia.org/wiki/Ordinary_least_squares

^{4} https://en.wikipedia.org/wiki/Maximum_likelihood_estimation

^{5} Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

^{6} https://github.com/numpy/numpy/issues/10471

^{7} https://en.wikipedia.org/wiki/QR_decomposition

^{8} https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

^{9} https://www.statsdirect.com/help/basics/p_values.htm

^{10} https://en.wikipedia.org/wiki/Test_statistic

^{11} https://en.wikipedia.org/wiki/Student\%27s_t-distribution

^{12} https://en.wikipedia.org/wiki/Gauss-Markov_theorem

^{13} https://en.wikipedia.org/wiki/All_models_are_wrong

^{14} https://en.wikipedia.org/wiki/Normalization_(statistics)