Linear Regression from Scratch

7 minute read

In the past few weeks of my Applied Machine Learning course, we’ve been focusing on linear regression and its extensions. I wanted to continue my exploration of the algorithms by trying to build an implementation from scratch. A main part of this section of the course focused on gradient descent, so this implementation uses gradient descent as the main driver of the algorithm, as opposed to the closed-form solution to linear regression. As I did with K Nearest Neighbors, I built an object-oriented implementation modeled after the SKLearn “fit-predict” paradigm. Here are a few other notes about this project:

  • I am using full gradient descent here. I tried to implement a stochastic version but I was having trouble with the indexing/slicing inside the _gd function. Any tips would be very helpful!

  • The objective function that I used in this is mean squared error. Throughout this section of the course, we became familiar with the gradient of this objective function.

  • I added L2 regularization, which is also known as ridge regression. By setting the “l2_reg” hyperparameter to 0, you can “turn it off”.

  • The learning rate (eta) and regularization hyperparameter are passed at the initial instantiation of the class, opposed to at the “fit” method. Are there pros/cons to each?

  • I didn’t realize exactly how important scaling the data is for this to work. I spent a lot of time trying to figure out why gradient descent wasn’t working. Actually gradient descent was going in the opposite direction! I’ll do a little demonstration at the end to show this. It was a hard lesson to learn, but very valuable at the end of the day.

First, I will import numpy, pandas, matplotlib, and seaborn. Numpy is the foundation for this project, and I’ve used pandas to build a nicer-looking dataframe that holds the history of gradient descent. Matplotlib and seaborn are imported so that I could show some visualizations.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

Now that I have the necessary packages, I can go ahead and create the class.

class AndyLinRegGD(object):

    def __init__(self, eta = 0.001, l2_reg = 0.0):
        self.coef_ = None       # weight vector
        self.bias_ = None       # bias term
        self._theta = None      # augmented weight vector, i.e., bias + weights
        self._eta = eta         # step size for gradient descent
        self.l2_reg = l2_reg    # control parameter for regularization penalty term

        #gradient descent history, will build as DataFrame in later function
        self.history = {"MSE": [],
                       "Updated Weights":[]}

    def _gradient(self, X, y):
        """
        Calculate the gradient of the MSE objective function

        Args:
            X(ndarray):        training data
            y(ndarray):        target data
        Return:
            gradient(ndarray): vector of partial derivatives wrt each weight
        """
        # calc predictions
        predictions = np.dot(X, self._theta)
        # get error for each example
        errors = predictions - y
        #calculate gradient
        gradient = 2 * np.dot(errors, X) / X.shape[0]
        # penalties only for weights
        gradient[1:] += 2 * self.l2_reg * self._theta[1:]
        return gradient

    def score(self, X, y):
        """
        Calculate the Mean Squared Error

        Args:
            X(ndarray):        training data
            y(ndarray):        target data
        Return:
            score(float): mean squared error
        """
        predictions = np.dot(X, self._theta)
        errors = predictions - y
        mse = np.sum(errors**2) / X.shape[0]

        return mse

    def _gd(self, X, y, max_iter):
        """
        Performs full gradient descent

        Args:
            X(ndarray):        training data
            y(ndarray):        target data
            max_iter:          number of times gradient descent happens, passed into fit method
        Return:
            None, but updates self._theta
        """
        # for each interation
        for epoch in range(max_iter):
            # calc mse
            mse = self.score(X, y)
            # store in history dict
            self.history["MSE"].append(mse)

            # calculate gradient
            gradient = self._gradient(X, y)

            # do gradient step, update theta
            self._theta -= self._eta * gradient
            # store in history dict
            self.history['Updated Weights'].append(self._theta)

            if mse < 1e-6:
                break

    def fit(self, X, y, max_iter=1000):
        """
        Utilizes other functions to train model and update theta, coef, and intercept

        Args:
            X(ndarray):        training data
            y(ndarray):        target data
            max_iter:          number of times gradient descent happens
        Return:
            self(object)
        """
        # create array of ones for bias term
        bias_term = np.ones((X.shape[0],1))
        # add the bias term to the data
        X = np.concatenate((bias_term, X), axis = 1)
        # initialize weights
        self._theta = np.random.rand(X.shape[1])
        # perform gradient descent
        self._gd(X, y, max_iter)
        # set intercept from theta
        self.intercept_ = self._theta[0]
        # set coefs from theta
        self.coef_ = self._theta[1:]

        return self

    def predict(self, X):
        """
        Utilizes the fitted model to generate predictions

        Args:
            X(ndarray):        data to predict target value for
        Return:
            predictions(array)
        """
        # create array of ones for bias term
        bias_term = np.ones((X.shape[0],1))
        # add the bias term to the data
        X = np.concatenate((bias_term, X), axis = 1)
        # calc predictions
        predictions = np.dot(X, self._theta)

        return predictions

    def get_history(self):
        """
        Generate DataFrame of gradient descent history

        Args:
            None
        Return:
            history(DataFrame)
        """
        # pass self.history dict into pandas df
        df = pd.DataFrame(self.history)
        # create iterations column from index
        df['Iterations'] = df.index + 1
        # reorder dataframe
        return df[['Iterations','MSE','Updated Weights']]

In order to demonstrate this implementation, I will use the prototypical Boston dataset. I will load the data, split it into train and test, use the standard scaler, and then perform the fitting. I will also output a lineplot so that you can see the progress of gradient descent in action.

import sklearn.datasets
from sklearn.preprocessing import StandardScaler
boston = sklearn.datasets.load_boston()
from sklearn.model_selection import train_test_split

# load the data
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target

# split the data
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3)

# scale the training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

# instantiate the model and fit to the scaled training data
lr = AndyLinRegGD(eta = 0.1, l2_reg = 0.01)
lr.fit(X_train_scaled, y_train, max_iter = 100)

# plot the gradient descent history
hist = lr.get_history()
plt.figure(figsize=(8,8))
sns.lineplot(x = 'Iterations', y = "MSE", data = hist)
<AxesSubplot:xlabel='Iterations', ylabel='MSE'>

Let’s also take a look at the history dataframe for gradient descent.

hist.head(5)
Iterations MSE Updated Weights
0 1 593.982458 [22.38926553220308, -0.870699471124182, 0.9237...
1 2 358.076426 [22.38926553220308, -0.870699471124182, 0.9237...
2 3 234.355188 [22.38926553220308, -0.870699471124182, 0.9237...
3 4 157.144036 [22.38926553220308, -0.870699471124182, 0.9237...
4 5 108.103451 [22.38926553220308, -0.870699471124182, 0.9237...
hist.tail(5)
Iterations MSE Updated Weights
95 96 19.646877 [22.38926553220308, -0.870699471124182, 0.9237...
96 97 19.645203 [22.38926553220308, -0.870699471124182, 0.9237...
97 98 19.643577 [22.38926553220308, -0.870699471124182, 0.9237...
98 99 19.641997 [22.38926553220308, -0.870699471124182, 0.9237...
99 100 19.640460 [22.38926553220308, -0.870699471124182, 0.9237...

Now that we have the model, let’s go ahead and create predictions for the test set and see how it performs

predictions = lr.predict(X_test_scaled)

mse = np.sum((predictions - y_test)**2)/len(y_test)

print("The MSE for the test set is", mse)
The MSE for the test set is 27.919555078006432

There it is! The model is at least functioning how I would expect. Before I end this post, I want to demonstrate how important scaling the data is before you fit the model to it. It took me so long to figure out why gradient descent wasn’t working, only to find out that scaling the data single-handedly fixed the issue. I never realized just how important it was, but for me, gradient descent just was not working at all. Here’s a demo so that you can see what I mean.

#Use the pre-scaled dataset and a new lr object
lr2 = AndyLinRegGD(eta = 0.1, l2_reg = 0.01)
lr2.fit(X_train, y_train)

# plot the gradient descent history, notice the scale on the y-axis
hist2 = lr2.get_history()
plt.figure(figsize=(8,8))
sns.lineplot(x = 'Iterations', y = "MSE", data = hist2)
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:46: RuntimeWarning: overflow encountered in square
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:29: RuntimeWarning: overflow encountered in multiply





<AxesSubplot:xlabel='Iterations', ylabel='MSE'>

hist2.head(10)
Iterations MSE Updated Weights
0 1 1.178545e+05 [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
1 2 4.437804e+14 [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
2 3 1.707692e+24 [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
3 4 6.571943e+33 [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
4 5 2.529172e+43 [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
5 6 9.733367e+52 [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
6 7 3.745827e+62 [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
7 8 1.441559e+72 [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
8 9 5.547752e+81 [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
9 10 2.135019e+91 [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
hist2.tail(5)
Iterations MSE Updated Weights
995 996 NaN [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
996 997 NaN [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
997 998 NaN [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
998 999 NaN [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
999 1000 NaN [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...

Isn’t that wild? I don’t really know what’s happening with the weights but you can see at least in the first ten iterations of gradient descent how the MSE just skyrockets instead of steadily decreasing. I can’t tell you much time I spent pouring over the gradient descent function in this project! Ultimately, now I know how important scaling can be to a linear regression model.

Thank you for taking the time to read through this post. If you notice any errors, inconsistencies, or opportunities to improve the code, please let me know! I want to eventually extend this to an elastic net/stochastic version, so any advice or tips would be very appreciated! Take care!