This is Day 28 of the #100DaysOfPython challenge.
This post will take the work that was done yesterday in the blog post “First Look At Supervised Learning With Classification” and introduce the concept of training/test sets and output a graph for us to interpret the accuracy of the k-nearest neighbors classifier.
The final code can be found on my GitHub repo
- Familiarity Conda package, dependency and virtual environment manager. A handy additional reference for Conda is the blog post “The Definitive Guide to Conda Environments” on “Towards Data Science”.
- Familiarity with JupyterLab. See here for my post on JupyterLab.
- These projects will also run Python notebooks on VSCode with the Jupyter Notebooks extension. If you do not use VSCode, it is expected that you know how to run notebooks (or alter the method for what works best for you).
- Read “First Look At Supervised Learning With Classification”.
Let’s create the
measuring-classifier-model-performance directory and install the required packages.
At this stage, we will need to bring across our initial code from yesterday’s post.
Bringing the code up to par
In our file
docs/measuring-classifier-model-performance, we can add the following:
The above code was introduced previous. From here on out, we want to create a training and test set for our classifier.
Creating a training and test set
The “training” and “test” set are the data that we will use to train our classifier. We will use the “test” set to test the accuracy of our classifier.
We do this by splitting up the entire data set using the
train_test_split function. In a new cell, add the following:
In the above code, we are doing the following:
- Splitting our data into a test size of 30% and a training size of 70% (as denoted in the kwarg
- Setting the
random_statekeyword arg to 21. This will ensure that the split is reproducible.
- Setting the
stratifykeyword arg to the
yvariable. This will ensure that the split is stratified. That is to say, that the ratio of the training set to the test set will be the same for each class.
The function itself returns four
numpy.ndarray types in the order we assign
X_train, X_test, y_train, y_test.
More information for
train_test_split can be found here.
Checking a classifier for fit
In relation to the
k-Nearest Neighbors classifier, we need to check how good the fit is for our model.
As the value
k increases, the decision boundary becomes smoothers. This is known as "a less complex model".
k is more complex and can lead to
overfitting. This can be defined as the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.
If you increase
k even more, you can end up
underfitting. This is the opposite of
overfitting and occurs when a statistical model cannot adequately capture the underlying structure of the data.
There is a sweet spot in the middle that we are aiming for that gives us the best fit.
We can manually inspect this by using the
score method and iterating over different values of
To see this in action, we will add the following code a new cell in our Python notebook:
The above iterates through possible
k values 1 to 8 and plots the accuracy of both the testing and training data against a graph for us to interpret.
The output image looks like so:
When looking at the graph, we can see that the accuracy of the test set decreases as we increase
k after 5. This tells us that we may be experiencing underfitting.
k=1, we see the accuracy is quite high but this could strongly be a sign of overfitting.
As for the sweet spot, we see that
k=5 are the best values for our model, with
k=5 looking like the most eligible fit.
Using our classifier with the determined parameter
The final step is to use our classifier with the determined parameter. In a new cell, we can add some unlabelled data and use our classifier to label it.
Today’s post demonstrated how to produce a graph to help us search for parameters that produce a good fit for our k-Nearest Neighbors classifier.
We explored how to split our dataset into a training and test set, then produced a graph for us to look at to determine the best value of
k for our classifier.
Resources and further reading
- Jupyter Notebooks
- “The Definitive Guide to Conda Environments”
Originally posted on my blog. To see new posts without delay, read the posts there and subscribe to my newsletter.