This is Day 27 of the #100DaysOfPython challenge.
This post will look at setting up our template repository for
Miniconda (a minimal installer for
We will do this by using the
scikit-learn package to create a GitHub template repository for our Machine Learning projects.
The final code can be found at
- Familiarity Conda package, dependency and virtual environment manager. A handy additional reference for Conda is the blog post “The Definitive Guide to Conda Environments” on “Towards Data Science”.
- Familiarity with JupyterLab. See here for my post on JupyterLab.
- These projects will also run Python notebooks on VSCode with the Jupyter Notebooks extension. If you do not use VSCode, it is expected that you know how to run notebooks (or alter the method for what works best for you).
Let’s create the
supervised-learning-with-scikit-learn-template directory and install the required packages.
At this stage, we are ready to take a first look at some of the packages we will be using over the upcoming posts.
There will be more in-depth posts over the coming days with each package.
Today will include a short look at a
iris dataset provided by Scikit Learn.
With this in mind, we can now begin adding code to our notebook.
Writing our first notebook
We will write seven cells in the notebook:
- Importing our required packages and setting a graph style.
- Exploring the
- Assigning the iris dataset to their
- Creating and exploring the data frame.
- Visualizing the output and making sense of the data.
- Creating a k-nearest neighbors classifier.
- Applying the classifier to some unlabelled data and assigning predicted classes to that data.
Importing required packages
In our file
docs/supervised-learning-with-scikit-learn-template.ipynb, we can add the following:
We are using four main libraries here:
sklearnwhich includes simple and efficient tools for predictive data analysis.
pandasfor a data analysis and manipulation tool.
numpyto help with scientific computing.
matplotlibas our data visualization library.
Finally, we are updating the
pyplot style to use
ggplot for aesthetics. More on that can be found in the docs here.
Exploring the dataset
As a first look, we will explore the dataset with some helpful functions to get a better idea of what is happening.
Some things to take away:
iris.datais our features for the data (also known as independent or predictor variables). There are 4 features (4 columns) in the data.
- The features themselves can be explores with the
feature_namesproperty. In this data, the features are
sepal length (cm),
sepal width (cm),
petal length (cm)and
petal width (cm).
- We notice that the
targetis a vector of integers. Our three possible classes of
virginicawill be encoded as 0, 1, 2.
iris.data.shapetells use that there are 150 rows of data to use as historical data to help us find features which might be useful in identifying future entries.
Assigning the iris dataset to a variable
The next step is a help to assign the data to more apt variables to be used.
Our features are assigned to
X while the target variables are assigned to
Creating and exploring the data frame
We Use the
X column to create a
The data frame is a tabular data structure with rows and columns.
.head() method on a data frame allows us to explore the first 5 entries of the data frame in the tabular structure.
The output is as follows:
This is a helpful preview to understand how our data will be used in the final matrix.
Visualizing the output
Finally, we can visualize the output by using a
The matrix is a grid of scatter plots that shows the relationship between each pair of features. It allows us to explore many relationships in one chart.
In our notebook, this will output the following scatter matrix:
It is up to us to interpret the data.
On the diagonal line, we can see histograms that bucket together the features corresponding to the rows and columns.
The colors on the scatter plot are assigned by our target variables. As we have three target variables, we will get three different colors plotted out.
The rest are scatter plots of the column feature vs the row feature color by target variable.
Something that you will notice on the second-from-the-bottom on the right scatter plot (petal length vs petal width) is that we get a linear grouping of elements. This tells us that there is a strong correlation between the two features.
You can read more about interpreting scatter plots here.
Constructing a classifier
There are different algorithms for classifying data. In our example, we will be going with k-nearest neighbors, an algorithm that creates predication boundaries to label data based on
n closest data points.
We will do more of a deep dive on this classifier in another blog post. For now, we will see how to construct the classifier and train it against our labelled data.
KNeighborsClassifier are helpful to understand more about the classifier and the available arguments.
In general, there are defaults for all possible arguments. Taken from the docs:
Again, we will deep dive into this in another topic, but all you need to understand in our code is that we are overriding the default of
n_neighbors to be
6 to make the prediction against the six closest neighbors.
knn.fit(iris.data, iris.target) invocation will train the classifier on the data. As soon as we have called
fit, the classifier is ready to make predictions.
Predicting unlabeled data
To make predictions, we need to call
predict on the classifier and pass some unlabelled data.
We can use what we learned already about data frames to display that data as mapped to their features.
This will print the following:
Finally, we can apply what we have done to predict the class of the unlabeled data.
Our prediction printed out
[1 1 0] which when decoded and mapped back to our labels results in the labels
[versicolor versicolor setosa].
Therefore, our classified has predicated that the first and second datapoint is
versicolor and that is a
setosa is the class of the final data point.
Today’s post set up a starting repository for all future posts on Machine Learning.
We then wrote a Python notebook that added cells to the notebook to show how to load the iris dataset, how to label the data, and how to create a classifier and apply that classifier.
Future posts will start to become more granular and dive deeper into particular topics around classifiers (and more machine learning applications).
Resources and further reading
- Jupyter Notebooks
- “The Definitive Guide to Conda Environments”
- What is a scatter plot?
Originally posted on my blog. To see new posts without delay, read the posts there and subscribe to my newsletter.