Another Book on Data Science - Linear Regression

Sections in this Chapter:

Basics of linear regression
Linear hypothesis testing
Ridge regression

There are numerous books ¹ ² available on the theory of linear regression. What is the purpose to write a chapter about these models in another book of data science? Many audience would be interested in how to implement their own regression models rather than using the off-the-shelf software packages. By the end of this chapter, we will build up our own linear regression models in R/Python. We would also see how we would reuse some functions in different regressions, such as linear regression, and these regressions with L2 penalties.

Basics of linear regression

We start from the matrix form of linear regression.

$$
\begin{equation}
\boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\epsilon},
\label{eq:1}
\end{equation}
$$

where the column vector $\boldsymbol{y}$ contains $n$ observations on the dependent variable, $\boldsymbol{X}$ is a $n\times (p+1)$ matrix ($n>p$) of independent variables with constant vector $\boldsymbol{1}$ in the first column, $\boldsymbol{\beta}$ is a column vector of unknown population parameters to estimate based on the data, and $\boldsymbol{\epsilon}$ is the error term (or noise). For the sake of illustration, \eqref{eq:1} can be extended as in

$$
\begin{equation}\label{eq:2}
\begin{bmatrix}
\boldsymbol{y}_1 \\
\boldsymbol{y}_2 \\
\vdots \\
\boldsymbol{y}_n
\end{bmatrix}
=
\begin{bmatrix}
1 & \boldsymbol{X}_{11} & \boldsymbol{X}_{21} & \cdots & \boldsymbol{X}_{p1} \\
1 & \boldsymbol{X}_{12} & \boldsymbol{X}_{22} & \cdots & \boldsymbol{X}_{p2} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
1 & \boldsymbol{X}_{1n} & \boldsymbol{X}_{2n} & \cdots & \boldsymbol{X}_{pn}
\end{bmatrix}
\begin{bmatrix}
\boldsymbol{\beta}_0 \\
\boldsymbol{\beta}_1 \\
\vdots \\
\boldsymbol{\beta}_p
\end{bmatrix}
+
\begin{bmatrix}
\boldsymbol{\epsilon}_1 \\
\boldsymbol{\epsilon}_2 \\
\vdots \\
\boldsymbol{\epsilon}_n
\end{bmatrix}
\end{equation}
$$

We apply ordinary least squares (OLS)³ approach to estimate the model parameter $\boldsymbol{\beta}$ since it requires fewer assumptions than other estimation methods such as maximum likelihood estimation⁴. Suppose the estimated model parameter is denoted as $\boldsymbol{\hat\beta}$, we define the residual vector of the system as $\boldsymbol{e}=\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\hat\beta}$. The idea of OLS is to find $\boldsymbol{\hat\beta}$ which can minimize the sum of squared residuals (SSR), i.e.,

$$
\begin{equation}\label{eq:3}
\min_{\boldsymbol{\hat\beta}} \quad \boldsymbol{e}’\boldsymbol{e}
\end{equation}
$$

Now the question is how to solve the optimization problem \eqref{eq:2}. First, let’s expand the SSR.

$$
\begin{equation}\label{eq:4}
\begin{split}
\boldsymbol{e}’\boldsymbol{e}&=(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\hat\beta})‘(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\hat\beta}) \\
&= \boldsymbol{y}’\boldsymbol{y}-2\hat{\beta}’\boldsymbol{X}’\boldsymbol{y}+\hat{\beta}’\boldsymbol{X}’\boldsymbol{X}\hat{\boldsymbol{\beta}}
\end{split}
\end{equation}
$$

The first and second order derivatives are calculated as follows

$$
\begin{equation}\label{eq:5}
\begin{cases}
\frac{\partial \boldsymbol{e}’\boldsymbol{e}}{\partial \boldsymbol{\hat\beta}}
=-2\boldsymbol{X}’\boldsymbol{y}+2\boldsymbol{X}’\boldsymbol{X}\boldsymbol{\hat\beta}\\
\frac{\partial^2 \boldsymbol{e}’\boldsymbol{e}}{\partial \boldsymbol{\hat\beta} \partial \boldsymbol{\hat\beta}’}
=2\boldsymbol{X}’\boldsymbol{X}.
\end{cases}
\end{equation}
$$

We see that the second order derivative is positive semidefinite which implies the SSR in OLS is a convex function (see section 3.1.4⁵) and for an unconstrained convex optimization problem, the necessary as well as sufficient condition for optimality is that the first order derivative equals 0 (see section 4.2.3⁵). Optimization of convex function is very important in machine learning. Actually, the parameter estimations of many machine learning models are convex optimization problems.

Based on the analysis above, the solution of \eqref{eq:3} is given in \eqref{eq:6}.

$$
\begin{equation}\label{eq:6}
\boldsymbol{\hat\beta}=(\boldsymbol{X}’\boldsymbol{X})^{-1}\boldsymbol{X}’\boldsymbol{y}
\end{equation}
$$

Now it seems we are ready to write our own linear regression model in R/Python. The solution in \eqref{eq:6} involves matrix transportation, multiplication and inversion, all of which are supported in both R and Python. In Python, we can use the numpy module for the matrix operations.

However, in practice we don’t solve linear regression with \eqref{eq:6} directly. Why? Let’s see an example with

$$
x=\begin{bmatrix}
1e+6 & -1 \\
-1 & 1e-6
\end{bmatrix}.
$$

1 > x=array(c(10^6, -1, -1, 10^-6), c(2,2))
2 > solve(t(x) %*% x) # solve() would return the inverse matrix
3 Error in solve.default(t(x) %*% x) : 
4   system is computationally singular: reciprocal condition number = 2.22044e-28\end{rl}
5 }

Python

1 >>> import numpy as np
2 >>> x=np.array([[1e6,-1], [-1,1e-6]])
3 >>> np.linalg.inv( np.dot(x.transpose(), x))
4 array([[4.50359963e+03, 4.50359963e+09],
5        [4.50359963e+09, 4.50359963e+15]])

The R code above throws an error because of the singularity of $\boldsymbol{X}’\boldsymbol{X}$. It’s interesting that the corresponding Python code doesn’t behave in the same way as R, which has been reported as an issue on github⁶.

When the matrix $\boldsymbol{X}’\boldsymbol{X}$ is singular, how to solve the OLS problem? In this book, we would focus on the QR decomposition based solution. Singular value decomposition (SVD) can also be used to solve OLS, which would not be covered in this book.

In linear algebra, a QR decomposition⁷ of matrix $\boldsymbol{X}$ would factorize $\boldsymbol{X}$ into a product, i.e., $\boldsymbol{X}=\boldsymbol{Q}\boldsymbol{R}$ where $\boldsymbol{Q}$ are orthogonal matrices and $\boldsymbol{R}$ is an upper triangular matrix. Since the matrix $\boldsymbol{Q}$ is orthogonal ($\boldsymbol{Q}’=\boldsymbol{Q}^{-1}$), we have

$$
\begin{equation}
\begin{split}
\boldsymbol{\hat\beta}&=(\boldsymbol{X}’\boldsymbol{X})^{-1}\boldsymbol{X}’\boldsymbol{y} \\
& = (\boldsymbol{R}’\boldsymbol{Q}’\boldsymbol{Q}\boldsymbol{R})^{-1}\boldsymbol{R}’\boldsymbol{Q}’\boldsymbol{y} \\
& = (\boldsymbol{R}’\boldsymbol{R})^{-1}\boldsymbol{R}’\boldsymbol{Q}’\boldsymbol{y} \\
& = \boldsymbol{R}^{-1}\boldsymbol{Q}’\boldsymbol{y}
\end{split}
\label{eq:7}
\end{equation}
$$

Now we are ready to write our simple R/Python functions for linear regression with the help of QR decomposition according to $\eqref{eq:7}$.

chapter4/qr_solver.R

1 qr_solver=function(x,y){
2   qr.coef(qr(x),y)
3 }

Python

chapter4/qr_solver.py

1 import numpy as np
2 
3 def qr_solver(x,y):
4   q,r=np.linalg.qr(x)
5   p = np.dot(q.T,y)
6   return np.dot(np.linalg.inv(r),p)

Of course, we don’t need to implement our own OLS solvers in a production environment; but if you do, still you may find some well-written and well-tested functions such as np.linalg.lstsq to save your time and effort from doing it from scratch.

Ok, now we have finished the training part of a linear regression model in both R and Python. After we train a model we want to use it, i.e., to make predictions based on the model. For most of machine learning models, training is much more complex than prediction (Exceptions include Lazy-learning models such as KNN). Let’s continue developing our own linear regression model by including the prediction function and enclose everything in an object.

chapter4/linear_regression.R

 1 library(R6)
 2 LR = R6Class(
 3   "LR",
 4   public = list(
 5     coef = NULL,
 6     initialize = function() {
 7       
 8     },
 9     fit = function(x, y) {
10       self$qr_solver(cbind(1, x), y)
11     },
12     qr_solver = function(x, y) {
13       self$coef = qr.coef(qr(x), y)
14     },
15     predict = function(new_x) {
16       cbind(1, new_x) %*% self$coef
17     }
18   )
19 )

Python

chapter4/linear_regression.py

 1 import numpy as np
 2 
 3 class LR:
 4     def __init__(self):
 5         self.coef = None
 6 
 7     def qr_solver(self, x, y):
 8         q, r = np.linalg.qr(x)
 9         p = np.dot(q.T, y)
10         return np.dot(np.linalg.inv(r), p)
11 
12     def fit(self, x, y):
13         self.coef = self.qr_solver(np.hstack((np.ones((x.shape[0], 1)), x)), y)
14 
15     def predict(self, x):
16         return np.dot(np.hstack((np.ones((x.shape[0], 1)), x)), self.coef)

Now let’s try to use our linear regression model to solve a real regression problem with the Boston dataset⁸, and check the results.

 1 > source('linear_regression.R')
 2 > 
 3 > library(MASS) # load Boston data from this package
 4 > 
 5 > lr = LR$new()
 6 > # -i means excluding the ith column from the data.frame
 7 > lr$fit(data.matrix(Boston[,-ncol(Boston)]),Boston$medv)
 8 > print(lr$coef)
 9                        crim            zn         indus          chas 
10  3.645949e+01 -1.080114e-01  4.642046e-02  2.055863e-02  2.686734e+00 
11           nox            rm           age           dis           rad 
12 -1.776661e+01  3.809865e+00  6.922246e-04 -1.475567e+00  3.060495e-01 
13           tax       ptratio         black         lstat 
14 -1.233459e-02 -9.527472e-01  9.311683e-03 -5.247584e-01 
15 > # let's make prediction on the same data
16 > pred=lr$predict(data.matrix(Boston[,-ncol(Boston)]))
17 > print(pred[1:5])
18 [1] 30.00384 25.02556 30.56760 28.60704 27.94352
19 > 
20 > # compare it with the R built-in linear regression model
21 > rlm = lm(medv ~ ., data=Boston)
22 > print(rlm$coef)
23   (Intercept)          crim            zn         indus          chas 
24  3.645949e+01 -1.080114e-01  4.642046e-02  2.055863e-02  2.686734e+00 
25           nox            rm           age           dis           rad 
26 -1.776661e+01  3.809865e+00  6.922246e-04 -1.475567e+00  3.060495e-01 
27           tax       ptratio         black         lstat 
28 -1.233459e-02 -9.527472e-01  9.311683e-03 -5.247584e-01 
29 > print(rlm$fitted[1:5])
30        1        2        3        4        5 
31 30.00384 25.02556 30.56760 28.60704 27.94352

Python

 1 >>> from sklearn.datasets import load_boston
 2 >>> from linear_regression import LR
 3 >>> boston = load_boston()
 4 >>> X, y = boston.data, boston.target
 5 >>> # first, let's run our own linear regression
 6 ... lr = LR()
 7 >>> lr.fit(X, y)
 8 >>> print(lr.coef)
 9 [ 3.64594884e+01 -1.08011358e-01  4.64204584e-02  2.05586264e-02
10   2.68673382e+00 -1.77666112e+01  3.80986521e+00  6.92224640e-04
11  -1.47556685e+00  3.06049479e-01 -1.23345939e-02 -9.52747232e-01
12   9.31168327e-03 -5.24758378e-01]
13 >>> print(lr.predict(X)[:5])
14 [30.00384338 25.02556238 30.56759672 28.60703649 27.94352423]
15 >>>
16 >>> # now, use sklearn's linear regression model
17 ... from sklearn.linear_model import LinearRegression
18 >>> reg = LinearRegression().fit(X, y)
19 >>> print(reg.coef_)
20 [-1.08011358e-01  4.64204584e-02  2.05586264e-02  2.68673382e+00
21  -1.77666112e+01  3.80986521e+00  6.92224640e-04 -1.47556685e+00
22   3.06049479e-01 -1.23345939e-02 -9.52747232e-01  9.31168327e-03
23  -5.24758378e-01]
24 >>> print(reg.predict(X)[:5])
25 [30.00384338 25.02556238 30.56759672 28.60703649 27.94352423]

The results from our own linear regression models are almost identical to the results from lm() function or the sklearn.linear_model module, which means we have done a great job so far.

Linear hypothesis testing

I’m not a big fan of applying hypothesis testing in data science or machine learning. But sometimes it is irreplaceable, or at least useful, for example in the Design of Experiment (DOE).

Most of applications of hypothesis testing in machine learning models are about feature/variable selections. There are lots of debates on whether hypothesis-testings-based feature selections are good or bad. The major criticism of such approach is that the entire process is built on the data used for model training and the model performance on testing data is not considered at all.

I think it is still worth giving a brief introduction of hypothesis testing in linear regression, as it is still popular among data scientists with statistician’s mindset. I would assume the readers already have basic ideas of hypothesis-testing, p-value, significance level.

If you have done linear regressions using a computer software (R, Stata, SPSS, Minitab etc.), you may have noticed that the outputs of these softwares contain the p-values⁹ and t-statistics of the coefficient of each variable. If the p-value is less than a pre-determined significance level (usually 0.1 or 0.05 are used in practice),
the null hypothesis (always denoted as $H_0$) should be rejected. The hypothesis against the null hypothesis is called alternative hypothesis (denoted as $H_1$). An example of $H_0$ and $H_1$ regarding model \eqref{eq:2} could be stated as:

$$
\begin{equation}
\begin{split}
H_0: \boldsymbol{\beta}_1&=0 \\
H_1:\boldsymbol{\beta}_1 &\neq0.
\end{split}
\label{eq:8}
\end{equation}
$$

If the p-value of $\boldsymbol{\beta}_1$ suggests not to reject $H_0$, we may exclude the corresponding feature and re-fit the linear model.

The theory behind the hypothesis testing in \eqref{eq:8} is not complex. But we have to make an additional assumption for our linear regression model. More specifically, we assume the error term $\boldsymbol{\epsilon}$ follows a multivariate normal distribution, i.e., $\boldsymbol{\epsilon}{\sim}N(\boldsymbol{0},\sigma^2\boldsymbol{I})$. Based on this assumption, we could conclude that $\boldsymbol{\hat\beta}$ also follows a multivariate normal distribution. Furthermore, we can construct a test statistic¹⁰ containing $\boldsymbol{\hat\beta_j}$ which follows a t-distribution¹¹. The additional assumption we made for hypothesis testing is not required for least square estimation. It is also not one of assumptions for the Gauss-Markov theorem¹².

The example given in \eqref{eq:8} focuses on a single coefficient. Actually, it is a special case of a more general hypothesis testing, which is called linear hypothesis testing. In linear hypothesis testing, people are interested in a linear combination of the coefficients, i.e.,

$$
\begin{equation}
\begin{split}
H_0: \boldsymbol{A}\boldsymbol{\beta}+\boldsymbol{b}&=0\\
H_1: \boldsymbol{A}\boldsymbol{\beta} +\boldsymbol{b}&\neq0.
\end{split}
\label{eq:9}
\end{equation}
$$

The linear hypothesis testing is usually conducted by constructing a F-distributed test statistic. The general hypothesis testings is very powerful. For example, we may start from a full model with a larger list of variables to train a linear regression model; we may also exclude some variables from the full list to train a reduced model. By setting proper values of $\boldsymbol{A}$ and $\boldsymbol{b}$ in \eqref{eq:9} we can conduct a linear hypothesis testing to accept or reject the reduced model. For example, if the full model involves three variables and in the reduced model we only keep the first variable, we will set

$$\boldsymbol{A}=\begin{bmatrix}
0 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{bmatrix}
$$
and
$$\boldsymbol{b}=\begin{bmatrix}
0 \\
0 \\
0 \\
\end{bmatrix}.
$$

Ridge regression

What is the problem of solving linear regression model specified by \eqref{eq:3}? There is nothing wrong at all with that approach. But I would like to cite my favorite quote – “Essentially, all models are wrong, but some are useful”¹³. \eqref{eq:3} provides one solution to model \eqref{eq:1}. There are some alternative solutions, such as lasso regression and ridge regression. In this section, let’s focus on ridge regression.

What is ridge regression? Ridge regression doesn’t change the model itself. Instead, it changes the way to estimate model parameters. By naive OLS, we minimize the SSR directly. In ridge regression, we change the objective function (which is commonly called loss function in machine learning models) by adding an additional penalty

$$
\begin{equation}\label{eq:10}
\min_{\boldsymbol{\hat\beta}} \quad \boldsymbol{e}’\boldsymbol{e} + \lambda\boldsymbol{\beta}’\boldsymbol{\beta}.
\end{equation}
$$

The optimization problem \eqref{eq:10} is still an unconstrained optimization problem and it is convex. It can also be formulated as a constrained convex optimization problem as

$$
\begin{equation}
\begin{split}
\min_{\boldsymbol{\hat\beta}}& \quad \boldsymbol{e}’\boldsymbol{e}\\
subject\quad to\quad &\boldsymbol{\beta}’\boldsymbol{\beta}\leq{t}.
\end{split}
\label{eq:11}
\end{equation}
$$

The theory behind ridge regression can be found from The Elements of Statistical Learning¹. Let’s turn our attention to the implementation of ridge regression. The solution to \eqref{eq:11} can be obtained in the same way as the solution to \eqref{eq:3}, i.e.,

$$
\begin{equation}\label{eq:12}
\boldsymbol{\hat\beta}=(\boldsymbol{X}’\boldsymbol{X}+\lambda\boldsymbol{I})^{-1}\boldsymbol{X}’\boldsymbol{y}.
\end{equation}
$$

Again, in practice we don’t use \eqref{eq:12} to implement ridge regression for the same reasons that we don’t use \eqref{eq:6} to solve linear regression without penalty.

Actually, we don’t need new techniques to solve \eqref{eq:10}. Let’s make some transformation on the objective function in \eqref{eq:10}:

$$
\begin{equation}
\begin{split}
&\boldsymbol{e}’\boldsymbol{e} + \lambda\boldsymbol{\beta}’\boldsymbol{\beta}=\sum_{i=1}^{n}(\boldsymbol{y}_i-\boldsymbol{x}’_i\boldsymbol{\beta})^2+\sum_{i=1}^{p}(0-\sqrt{\lambda}\boldsymbol{\beta}_i)^2 ,
\end{split}
\label{eq:13}
\end{equation}
$$

where $\boldsymbol{x}_i=[1,\boldsymbol{X}_{1i},\boldsymbol{X}_{2i},…,\boldsymbol{X}_{pi}]’$.

Let’s define an augmented data set:

$$
\begin{equation}\label{eq:14}
\boldsymbol{X}_\lambda=
\begin{bmatrix}
1 & \boldsymbol{X}_{11} & \boldsymbol{X}_{21} & \cdots & \boldsymbol{X}_{p1} \\
1 & \boldsymbol{X}_{12} & \boldsymbol{X}_{22} & \cdots & \boldsymbol{X}_{p2} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
1 & \boldsymbol{X}_{1n} & \boldsymbol{X}_{2n} & \cdots & \boldsymbol{X}_{pn} \\
\sqrt{\lambda} & 0 & 0 & \cdots & 0\\
0 & \sqrt{\lambda} & 0 & \cdots & 0\\
0 & 0 & \sqrt{\lambda} & \cdots & 0\\
\vdots & \vdots & \vdots & \ddots & \vdots\\
0 & 0 & 0 & 0 & \sqrt{\lambda}\\
\end{bmatrix} , \quad
\boldsymbol{y}_{\lambda} =
\begin{bmatrix}
\boldsymbol{y}_1 \\
\boldsymbol{y}_2 \\
\vdots \\
\boldsymbol{y}_n\\
0\\
0\\
0\\
\vdots\\
0
\end{bmatrix}.
\end{equation}
$$

If we regress $\boldsymbol{y}_{\lambda}$ on $\boldsymbol{X}_\lambda$, the OLS solution is just what we are looking after. However, usually the penalty is not applied on the intercept. Thus, we modify $\boldsymbol{y}_{\lambda}$ and $\boldsymbol{X}_\lambda$ to

$$
\begin{equation}\label{eq:15}
\boldsymbol{X}_\lambda=
\begin{bmatrix}
1 & \boldsymbol{X}_{11} & \boldsymbol{X}_{21} & \cdots & \boldsymbol{X}_{p1} \\
1 & \boldsymbol{X}_{12} & \boldsymbol{X}_{22} & \cdots & \boldsymbol{X}_{p2} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
1 & \boldsymbol{X}_{1n} & \boldsymbol{X}_{2n} & \cdots & \boldsymbol{X}_{pn} \\
0 & \sqrt{\lambda} & 0 & \cdots & 0\\
0 & 0 & \sqrt{\lambda} & \cdots & 0\\
\vdots & \vdots & \vdots & \ddots & \vdots\\
0 & 0 & 0 & 0 & \sqrt{\lambda}\\
\end{bmatrix} , \quad
\boldsymbol{y}_{\lambda} =
\begin{bmatrix}
\boldsymbol{y}_1 \\
\boldsymbol{y}_2 \\
\vdots \\
\boldsymbol{y}_n\\
0\\
0\\
\vdots\\
0
\end{bmatrix}.
\end{equation}
$$

Now we are ready to implement our own ridge regression model based on the description above. It is also common to normalize¹⁴ the independent variables before applying ridge regression to make different variables in the same order of magnitude.

chapter4/linear_regression_ridge.R

  3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40

/> class="language-r" data-lang="r"> 1 library(R6) 2 LR_Ridge = R6Class( "LR_Ridge", public = list( coef = NULL, mu = NULL, sd = NULL, lambda = NULL, initialize = function(lambda) { self$lambda = lambda }, scale = function(x) { self$mu = apply(x, 2, mean) self$sd = apply(x, 2, function(e) { sqrt((length(e) - 1) / length(e)) * sd(e) }) }, transform = function(x) { return(t((t(x) - self$mu) / self$sd)) }, fit = function(x, y) { self$scale(x) x_transformed = self$transform(x) x_lambda = rbind(x_transformed, diag(rep(sqrt(self$lambda), ncol(x)))) y_lambda = c(y, rep(0, ncol(x))) self$qr_solver(cbind(c(rep(1, nrow( x )), rep(0, ncol( x ))), x_lambda), y_lambda) }, qr_solver = function(x, y) { self$coef = qr.coef(qr(x), y) }, predict = function(new_x) { new_x_transformed = self$transform(new_x) cbind(rep(1, nrow(new_x)), new_x_transformed) %*% self$coef } ) )

Python

chapter4/linear_regression_ridge.py

 1 import numpy as np
 2 from sklearn.preprocessing import StandardScaler
 3 
 4 
 5 class LR_Ridge:
 6     def __init__(self, l):
 7         self.l = l
 8         self.coef = None
 9         self.scaler = StandardScaler()
10 
11     def qr_solver(self, x, y):
12         q, r = np.linalg.qr(x)
13         p = np.dot(q.T, y)
14         return np.dot(np.linalg.inv(r), p)
15 
16     def fit(self, x, y):
17         x_transformed = self.scaler.fit_transform(x)
18         x_lambda = np.vstack(
19             (x_transformed, np.diag([self.l**0.5]*x.shape[1])))
20         x_lambda = np.hstack(
21             (np.vstack((np.ones((x.shape[0], 1)), np.zeros((x.shape[1], 1)))), x_lambda))
22         y_lambda = np.hstack((y, np.zeros((x.shape[1],))))
23         self.coef = self.qr_solver(x_lambda, y_lambda)
24 
25     def predict(self, x):
26         new_x_transformed = self.scaler.transform(x)
27         new_x_transformed = np.hstack(
28             (np.ones((x.shape[0],1)), new_x_transformed)
29         )
30         return np.dot(new_x_transformed, self.coef)

In R, we implement our own scaler; but in the python implementation, we use StandardScaler from sklearn.preprocessing module, which is very handy. Please pay attention to how we calculation the standard deviation¹⁵ in R – we are using the formula of population standard deviation rather than the formula of sample standard deviation. Actually using which formula to calculate the standard deviation doesn’t matter at all. I made the choice to use population standard deviation formula in order to generate consistent result of the StandardScaler since in StandardScaler uses the formula of population standard deviation.

The selection of best $\lambda$ requires solving the OLS problem repeatedly with different values of $\lambda$, which implies the QR decomposition procedure would be called multiple times. Using SVD decomposition could be more efficient in terms of selection of best $\lambda$. But it wouldn’t be covered in the current version of this book.

We are ready to run our own ridge regression on the Boston dataset.

 1 > source('linear_regression_ridge.R')
 2 > 
 3 > library(MASS) # load Boston data from this package
 4 > 
 5 > # let's try lmabda = 0.5
 6 > ridge = LR_Ridge$new(0.5)
 7 > ridge$fit(data.matrix(Boston[,-ncol(Boston)]),Boston$medv)
 8 > print(ridge$coef)
 9                    crim          zn       indus        chas         nox 
10 22.53280632 -0.92396151  1.07393055  0.12895159  0.68346136 -2.04275750 
11          rm         age         dis         rad         tax     ptratio 
12  2.67854971  0.01627328 -3.09063352  2.62636926 -2.04312573 -2.05646414 
13       black       lstat 
14  0.84905910 -3.73711409 
15 > # let's make prediction on the same data
16 > pred=ridge$predict(data.matrix(Boston[,-ncol(Boston)]))
17 > print(pred[1:5])
18 [1] 30.01652 25.02429 30.56839 28.61521 27.95385

Python

 1 >>> from sklearn.datasets import load_boston
 2 >>> from linear_regression_ridge import LR_Ridge
 3 >>>
 4 >>> boston = load_boston()
 5 >>> X, y = boston.data, boston.target
 6 >>> # first, let's run our own linear regression
 7 ...
 8 >>> ridge = LR_Ridge(0.5)
 9 >>> ridge.fit(X, y)
10 >>> print(ridge.coef)
11 [ 2.25328063e+01 -9.23961511e-01  1.07393055e+00  1.28951591e-01
12   6.83461360e-01 -2.04275750e+00  2.67854971e+00  1.62732755e-02
13  -3.09063352e+00  2.62636926e+00 -2.04312573e+00 -2.05646414e+00
14   8.49059103e-01 -3.73711409e+00]
15 >>> print(ridge.predict(X)[:5])
16 [30.01652397 25.02429359 30.56839459 28.61520864 27.95385422]

It’s excited to see the outputs from R and Python are quite consistent.

Linear regression is not as simple as it seems. To learn more about it, I recommend the book Econometric Analysis².

¹ Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer series in statistics New York, NY, USA:, second edition, 2009.

² William H Greene. Econometric analysis. Pearson Education India, 2003.

³ https://en.wikipedia.org/wiki/Ordinary_least_squares

⁴ https://en.wikipedia.org/wiki/Maximum_likelihood_estimation

⁵ Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

⁶ https://github.com/numpy/numpy/issues/10471

⁷ https://en.wikipedia.org/wiki/QR_decomposition

⁸ https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

⁹ https://www.statsdirect.com/help/basics/p_values.htm

¹⁰ https://en.wikipedia.org/wiki/Test_statistic

¹¹ https://en.wikipedia.org/wiki/Student\%27s_t-distribution

¹² https://en.wikipedia.org/wiki/Gauss-Markov_theorem

¹³ https://en.wikipedia.org/wiki/All_models_are_wrong

¹⁴ https://en.wikipedia.org/wiki/Normalization_(statistics)