Dennis O'Keeffe
4 min readSep 30, 2021

--

Heading image

This is Day 28 of the #100DaysOfPython challenge.

This post will take the work that was done yesterday in the blog post “First Look At Supervised Learning With Classification” and introduce the concept of training/test sets and output a graph for us to interpret the accuracy of the k-nearest neighbors classifier.

The final code can be found on my GitHub repo okeeffed/measuring-classifier-model-performance.

Prerequisites

  1. Familiarity Conda package, dependency and virtual environment manager. A handy additional reference for Conda is the blog post “The Definitive Guide to Conda Environments” on “Towards Data Science”.
  2. Familiarity with JupyterLab. See here for my post on JupyterLab.
  3. These projects will also run Python notebooks on VSCode with the Jupyter Notebooks extension. If you do not use VSCode, it is expected that you know how to run notebooks (or alter the method for what works best for you).
  4. Read “First Look At Supervised Learning With Classification”.

Getting started

Let’s create the measuring-classifier-model-performance directory and install the required packages.

At this stage, we will need to bring across our initial code from yesterday’s post.

Bringing the code up to par

In our file docs/measuring-classifier-model-performance, we can add the following:

The above code was introduced previous. From here on out, we want to create a training and test set for our classifier.

Creating a training and test set

The “training” and “test” set are the data that we will use to train our classifier. We will use the “test” set to test the accuracy of our classifier.

We do this by splitting up the entire data set using the train_test_split function. In a new cell, add the following:

In the above code, we are doing the following:

  1. Splitting our data into a test size of 30% and a training size of 70% (as denoted in the kwarg test_size).
  2. Setting the random_state keyword arg to 21. This will ensure that the split is reproducible.
  3. Setting the stratify keyword arg to the y variable. This will ensure that the split is stratified. That is to say, that the ratio of the training set to the test set will be the same for each class.

The function itself returns four numpy.ndarray types in the order we assign X_train, X_test, y_train, y_test.

More information for train_test_split can be found here.

Checking a classifier for fit

In relation to the k-Nearest Neighbors classifier, we need to check how good the fit is for our model.

As the value k increases, the decision boundary becomes smoothers. This is known as "a less complex model".

Smaller k is more complex and can lead to overfitting. This can be defined as the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.

If you increase k even more, you can end up underfitting. This is the opposite of overfitting and occurs when a statistical model cannot adequately capture the underlying structure of the data.

There is a sweet spot in the middle that we are aiming for that gives us the best fit.

We can manually inspect this by using the score method and iterating over different values of k.

To see this in action, we will add the following code a new cell in our Python notebook:

The above iterates through possible k values 1 to 8 and plots the accuracy of both the testing and training data against a graph for us to interpret.

The output image looks like so:

Comparing Testing vs Training accuracy

When looking at the graph, we can see that the accuracy of the test set decreases as we increase k after 5. This tells us that we may be experiencing underfitting.

As for k=1, we see the accuracy is quite high but this could strongly be a sign of overfitting.

As for the sweet spot, we see that k=3, k=4 and k=5 are the best values for our model, with k=5 looking like the most eligible fit.

Using our classifier with the determined parameter

The final step is to use our classifier with the determined parameter. In a new cell, we can add some unlabelled data and use our classifier to label it.

Summary

Today’s post demonstrated how to produce a graph to help us search for parameters that produce a good fit for our k-Nearest Neighbors classifier.

We explored how to split our dataset into a training and test set, then produced a graph for us to look at to determine the best value of k for our classifier.

Resources and further reading

  1. Conda
  2. JupyterLab
  3. Jupyter Notebooks
  4. “The Definitive Guide to Conda Environments”
  5. okeeffed/measuring-classifier-model-performance

Photo credit: kelvinhan

Originally posted on my blog. To see new posts without delay, read the posts there and subscribe to my newsletter.

I write content for AWS, Kubernetes, Python, JavaScript and more. To view all the latest content, be sure to visit my blog and subscribe to my newsletter. Follow me on Twitter.

--

--