Dennis O'Keeffe
5 min readSep 28, 2021

--

Heading image

This is Day 27 of the #100DaysOfPython challenge.

This post will look at setting up our template repository for scikit-learn with Miniconda (a minimal installer for conda).

We will do this by using the scikit-learn package to create a GitHub template repository for our Machine Learning projects.

The final code can be found at okeeffed/supervised-learning-with-scikit-learn-template.

Prerequisites

  1. Familiarity Conda package, dependency and virtual environment manager. A handy additional reference for Conda is the blog post “The Definitive Guide to Conda Environments” on “Towards Data Science”.
  2. Familiarity with JupyterLab. See here for my post on JupyterLab.
  3. These projects will also run Python notebooks on VSCode with the Jupyter Notebooks extension. If you do not use VSCode, it is expected that you know how to run notebooks (or alter the method for what works best for you).

Getting started

Let’s create the supervised-learning-with-scikit-learn-template directory and install the required packages.

At this stage, we are ready to take a first look at some of the packages we will be using over the upcoming posts.

There will be more in-depth posts over the coming days with each package.

Today will include a short look at a iris dataset provided by Scikit Learn.

With this in mind, we can now begin adding code to our notebook.

Writing our first notebook

We will write seven cells in the notebook:

  1. Importing our required packages and setting a graph style.
  2. Exploring the iris dataset.
  3. Assigning the iris dataset to their X and y variables.
  4. Creating and exploring the data frame.
  5. Visualizing the output and making sense of the data.
  6. Creating a k-nearest neighbors classifier.
  7. Applying the classifier to some unlabelled data and assigning predicted classes to that data.

Importing required packages

In our file docs/supervised-learning-with-scikit-learn-template.ipynb, we can add the following:

We are using four main libraries here:

  1. sklearn which includes simple and efficient tools for predictive data analysis.
  2. pandas for a data analysis and manipulation tool.
  3. numpy to help with scientific computing.
  4. matplotlib as our data visualization library.

Finally, we are updating the pyplot style to use ggplot for aesthetics. More on that can be found in the docs here.

Exploring the dataset

As a first look, we will explore the dataset with some helpful functions to get a better idea of what is happening.

Some things to take away:

  1. iris.data is our features for the data (also known as independent or predictor variables). There are 4 features (4 columns) in the data.
  2. The features themselves can be explores with the feature_names property. In this data, the features are sepal length (cm), sepal width (cm), petal length (cm) and petal width (cm).
  3. We notice that the target is a vector of integers. Our three possible classes of setosa, versicolor and virginica will be encoded as 0, 1, 2.
  4. The iris.data.shape tells use that there are 150 rows of data to use as historical data to help us find features which might be useful in identifying future entries.

Assigning the iris dataset to a variable

The next step is a help to assign the data to more apt variables to be used.

Our features are assigned to X while the target variables are assigned to y.

Creating and exploring the data frame

We Use the X column to create a data frame.

The data frame is a tabular data structure with rows and columns.

Calling the .head() method on a data frame allows us to explore the first 5 entries of the data frame in the tabular structure.

The output is as follows:

This is a helpful preview to understand how our data will be used in the final matrix.

Visualizing the output

Finally, we can visualize the output by using a scatter matrix.

The matrix is a grid of scatter plots that shows the relationship between each pair of features. It allows us to explore many relationships in one chart.

In our notebook, this will output the following scatter matrix:

Scatter matrix in VSCode

It is up to us to interpret the data.

On the diagonal line, we can see histograms that bucket together the features corresponding to the rows and columns.

The colors on the scatter plot are assigned by our target variables. As we have three target variables, we will get three different colors plotted out.

The rest are scatter plots of the column feature vs the row feature color by target variable.

Something that you will notice on the second-from-the-bottom on the right scatter plot (petal length vs petal width) is that we get a linear grouping of elements. This tells us that there is a strong correlation between the two features.

You can read more about interpreting scatter plots here.

Constructing a classifier

There are different algorithms for classifying data. In our example, we will be going with k-nearest neighbors, an algorithm that creates predication boundaries to label data based on n closest data points.

We will do more of a deep dive on this classifier in another blog post. For now, we will see how to construct the classifier and train it against our labelled data.

The KNeighborsClassifier are helpful to understand more about the classifier and the available arguments.

In general, there are defaults for all possible arguments. Taken from the docs:

Again, we will deep dive into this in another topic, but all you need to understand in our code is that we are overriding the default of n_neighbors to be 6 to make the prediction against the six closest neighbors.

The knn.fit(iris.data, iris.target) invocation will train the classifier on the data. As soon as we have called fit, the classifier is ready to make predictions.

Predicting unlabeled data

To make predictions, we need to call predict on the classifier and pass some unlabelled data.

We can use what we learned already about data frames to display that data as mapped to their features.

This will print the following:

Finally, we can apply what we have done to predict the class of the unlabeled data.

Our prediction printed out [1 1 0] which when decoded and mapped back to our labels results in the labels [versicolor versicolor setosa].

Therefore, our classified has predicated that the first and second datapoint is versicolor and that is a setosa is the class of the final data point.

Summary

Today’s post set up a starting repository for all future posts on Machine Learning.

We then wrote a Python notebook that added cells to the notebook to show how to load the iris dataset, how to label the data, and how to create a classifier and apply that classifier.

Future posts will start to become more granular and dive deeper into particular topics around classifiers (and more machine learning applications).

Resources and further reading

  1. Conda
  2. JupyterLab
  3. Jupyter Notebooks
  4. “The Definitive Guide to Conda Environments”
  5. matplotlib.pyplot.style.use
  6. sklearn
  7. pandas
  8. numpy
  9. matplotlib
  10. data frame
  11. scatter matrix
  12. What is a scatter plot?
  13. okeeffed/supervised-learning-with-scikit-learn-template

Photo credit: itssammoqadam

Originally posted on my blog. To see new posts without delay, read the posts there and subscribe to my newsletter.

I write content for AWS, Kubernetes, Python, JavaScript and more. To view all the latest content, be sure to visit my blog and subscribe to my newsletter. Follow me on Twitter.

--

--