This is Day 27 of the #100DaysOfPython challenge.
This post will look at setting up our template repository for scikit-learn
with Miniconda
(a minimal installer for conda
).
We will do this by using the scikit-learn
package to create a GitHub template repository for our Machine Learning projects.
The final code can be found at okeeffed/supervised-learning-with-scikit-learn-template
.
Prerequisites
- Familiarity Conda package, dependency and virtual environment manager. A handy additional reference for Conda is the blog post “The Definitive Guide to Conda Environments” on “Towards Data Science”.
- Familiarity with JupyterLab. See here for my post on JupyterLab.
- These projects will also run Python notebooks on VSCode with the Jupyter Notebooks extension. If you do not use VSCode, it is expected that you know how to run notebooks (or alter the method for what works best for you).
Getting started
Let’s create the supervised-learning-with-scikit-learn-template
directory and install the required packages.
At this stage, we are ready to take a first look at some of the packages we will be using over the upcoming posts.
There will be more in-depth posts over the coming days with each package.
Today will include a short look at a iris
dataset provided by Scikit Learn.
With this in mind, we can now begin adding code to our notebook.
Writing our first notebook
We will write seven cells in the notebook:
- Importing our required packages and setting a graph style.
- Exploring the
iris
dataset. - Assigning the iris dataset to their
X
andy
variables. - Creating and exploring the data frame.
- Visualizing the output and making sense of the data.
- Creating a k-nearest neighbors classifier.
- Applying the classifier to some unlabelled data and assigning predicted classes to that data.
Importing required packages
In our file docs/supervised-learning-with-scikit-learn-template.ipynb
, we can add the following:
We are using four main libraries here:
sklearn
which includes simple and efficient tools for predictive data analysis.pandas
for a data analysis and manipulation tool.numpy
to help with scientific computing.matplotlib
as our data visualization library.
Finally, we are updating the pyplot
style to use ggplot
for aesthetics. More on that can be found in the docs here.
Exploring the dataset
As a first look, we will explore the dataset with some helpful functions to get a better idea of what is happening.
Some things to take away:
iris.data
is our features for the data (also known as independent or predictor variables). There are 4 features (4 columns) in the data.- The features themselves can be explores with the
feature_names
property. In this data, the features aresepal length (cm)
,sepal width (cm)
,petal length (cm)
andpetal width (cm)
. - We notice that the
target
is a vector of integers. Our three possible classes ofsetosa
,versicolor
andvirginica
will be encoded as 0, 1, 2. - The
iris.data.shape
tells use that there are 150 rows of data to use as historical data to help us find features which might be useful in identifying future entries.
Assigning the iris dataset to a variable
The next step is a help to assign the data to more apt variables to be used.
Our features are assigned to X
while the target variables are assigned to y
.
Creating and exploring the data frame
We Use the X
column to create a data frame
.
The data frame is a tabular data structure with rows and columns.
Calling the .head()
method on a data frame allows us to explore the first 5 entries of the data frame in the tabular structure.
The output is as follows:
This is a helpful preview to understand how our data will be used in the final matrix.
Visualizing the output
Finally, we can visualize the output by using a scatter matrix
.
The matrix is a grid of scatter plots that shows the relationship between each pair of features. It allows us to explore many relationships in one chart.
In our notebook, this will output the following scatter matrix:
It is up to us to interpret the data.
On the diagonal line, we can see histograms that bucket together the features corresponding to the rows and columns.
The colors on the scatter plot are assigned by our target variables. As we have three target variables, we will get three different colors plotted out.
The rest are scatter plots of the column feature vs the row feature color by target variable.
Something that you will notice on the second-from-the-bottom on the right scatter plot (petal length vs petal width) is that we get a linear grouping of elements. This tells us that there is a strong correlation between the two features.
You can read more about interpreting scatter plots here.
Constructing a classifier
There are different algorithms for classifying data. In our example, we will be going with k-nearest neighbors, an algorithm that creates predication boundaries to label data based on n
closest data points.
We will do more of a deep dive on this classifier in another blog post. For now, we will see how to construct the classifier and train it against our labelled data.
The KNeighborsClassifier
are helpful to understand more about the classifier and the available arguments.
In general, there are defaults for all possible arguments. Taken from the docs:
Again, we will deep dive into this in another topic, but all you need to understand in our code is that we are overriding the default of n_neighbors
to be 6
to make the prediction against the six closest neighbors.
The knn.fit(iris.data, iris.target)
invocation will train the classifier on the data. As soon as we have called fit
, the classifier is ready to make predictions.
Predicting unlabeled data
To make predictions, we need to call predict
on the classifier and pass some unlabelled data.
We can use what we learned already about data frames to display that data as mapped to their features.
This will print the following:
Finally, we can apply what we have done to predict the class of the unlabeled data.
Our prediction printed out [1 1 0]
which when decoded and mapped back to our labels results in the labels [versicolor versicolor setosa]
.
Therefore, our classified has predicated that the first and second datapoint is versicolor
and that is a setosa
is the class of the final data point.
Summary
Today’s post set up a starting repository for all future posts on Machine Learning.
We then wrote a Python notebook that added cells to the notebook to show how to load the iris dataset, how to label the data, and how to create a classifier and apply that classifier.
Future posts will start to become more granular and dive deeper into particular topics around classifiers (and more machine learning applications).
Resources and further reading
- Conda
- JupyterLab
- Jupyter Notebooks
- “The Definitive Guide to Conda Environments”
matplotlib.pyplot.style.use
sklearn
pandas
numpy
matplotlib
data frame
scatter matrix
- What is a scatter plot?
okeeffed/supervised-learning-with-scikit-learn-template
Photo credit: itssammoqadam
Originally posted on my blog. To see new posts without delay, read the posts there and subscribe to my newsletter.
I write content for AWS, Kubernetes, Python, JavaScript and more. To view all the latest content, be sure to visit my blog and subscribe to my newsletter. Follow me on Twitter.