--

This is Day 30 of the #100DaysOfPython challenge.

This post will continue on from part one and break down the basics of linear regression and also explain how we can take the work that we did and expand upon that to apply a train-test split to our dataset.

Source code can be found on my GitHub repo `okeeffed/regression-with-scikit-learn-part-two`

.

## Prerequisites

- Familiarity Conda package, dependency and virtual environment manager. A handy additional reference for Conda is the blog post “The Definitive Guide to Conda Environments” on “Towards Data Science”.
- Familiarity with JupyterLab. See here for my post on JupyterLab.
- These projects will also run Python notebooks on VSCode with the Jupyter Notebooks extension. If you do not use VSCode, it is expected that you know how to run notebooks (or alter the method for what works best for you).
- Read “Regression With Scikit Learn (Part One)”

## Getting started

Let’s create the `regression-with-scikit-learn-part-two`

by cloning the work we did yesterday. The packages required will be available in our `conda`

environment.

If you are unsure on how to activate the `conda`

virtual environment, please look to the prerequisites or resources section for links on `conda`

fundamentals.

At this stage, the file `docs/linear_regression.ipynb`

already exists and we can work off this material.

Before we start, let’s go over the basics of linear regression.

## Linear regression basics

The line equation to calculates the linear line is described as the following:

The statement can be broken down into the following:

| Variable/Statement | Description | | — — — — — — — — — | — — — — — — — — — — — — | | `y`

| Target variable | | `x`

| Single feature | | `a,b`

| Parameters of the model |

To calculate the values of `a`

and `b`

, we need to define an **error function** (also known as the **cost function** or **loss function**) for any line and choose the line that minimizes the error function.

The aim is to minimize the vertical line distance between the fit line and the data point.

The distance itself is known as the **residual**. Because a positive and negative residuals (from data points above and below the line) will cancel each other out, we use the sum of the squares of the residuals.

This will be our **loss function** and is called Ordinate Least Squares (OLS).

Wikipedia describes OLS as the following:

*OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the given dataset and those predicted by the linear function of the independent variable.*

*Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface — the smaller the differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula, especially in the case of a simple linear regression, in which there is a single regressor on the right side of the regression equation.*

To put that into some human speak, the `axis`

of the dependent variable is our `y`

axis, and so we sum the square between each data point in the set and the corresponding point on the X-axis to the regression line. The smaller the distance, the better the fit.

When we call the `fit`

method from our `LinearRegression`

object, we are actually calculating the parameters of the line by performing OLS under the hood.

## Higher dimensions of linear regression

So far, the examples we have done are working on a dimension that is easily understood with `y`

being calculated by one feature on the X-axis (from our example yesterday, this was the "Number Of Rooms (feature) vs Value Of House (target variable)"").

However, in the real world, we often have more than one feature.

To calculate multiple features (or dimensions), our linear regression equation becomes the following:

In application, the Scikit-learn API can help us with this as we pass two arrays to the `fit`

method:

- Array with he features.
- Array with the target variable.

Let’s do just that and see how it works.

## Applying the train/test split to our dataset

In our file `docs/linear_regression.ipynb`

, we can add the following:

The default score method for linear regression is `R squared`

. For more details, see the documentation.

Note: You will never use Linear Regression out of the box like this. You will almost always want to use regularization. We will dive into this in the next part.

## Summary

Today’s post spoke to the math that describes our linear line generated by the linear regression fit.

We then spoke about how this calculation is worked out with more dimensions added into the mix.

Finally, we demonstrated this with a `train_test_split`

and `LinearRegression`

object.

As noted in the last section, this is not how you would use Linear Regression in practice. You will (almost) always want to use **regularization**.

This will be our topic in tomorrow’s post.

## Resources and further reading

- Conda
- JupyterLab
- Jupyter Notebooks
- “The Definitive Guide to Conda Environments”
`okeeffed/regression-with-scikit-learn-part-two`

- Ordinate Least Squares (OLS)

*Photo credit: **deepakrautela*

*Originally posted on my **blog**. To see new posts without delay, read the posts there and subscribe to my newsletter.*

*I write content for AWS, Kubernetes, Python, JavaScript and more. To view all the latest content, be sure to **visit my blog** and subscribe to my newsletter. **Follow me on Twitter**.*