Generating Fake CSV Data With Python

4 min readSep 25, 2021

This is Day 24 of the #100DaysOfPython challenge.

This post will use the Faker library to generate fake data and export it to a CSV file.

We wil be emulating some of the free datasets from Kaggle, in particular the Netflix original films IMDB score to generate something similar.

The final code can be found here.

Prerequisites

Familiarity with Pipenv. See here for my post on Pipenv.
Familiarity with JupyterLab. See here for my post on JupyterLab.

Getting started

Let’s create the generating-fake-csv-data-with-python directory and install Pillow.

At this stage, we have the packages that we

Now we can start up the notebook server.

The server will now be up and running.

Creating the notebook

Once on http://localhost:8888/lab, select to create a new Python 3 notebook from the launcher.

Ensure that this notebook is saved in generating-fake-csv-data-with-python/docs/generating-fake-data.ipynb.

We will create four cells to handle four parts of this mini project:

Importing Faker and generating data.
Importing the CSV module and exporting the data to a CSV file.

Before generating our data, we need to look at what we are trying to emulate.

Emulating The Netflix Original Movies IMDB Scores Dataset

Looking at the preview for our dataset, we can see that it contains the following columns and example rows:

| Title | Genre | Premiere | Runtime | IMDB Score | Language | | — — — — — — — — | — — — — — — | — — — — — — — — | — — — — | — — — — — | — — — — — — — — | | Enter the Anime | Documentary | August 5, 2019 | 58 | 2.5 | English/Japanese | | Dark Forces | Thriller | August 21, 2020 | 81 | 2.6 | Spanish |

We only have two rows for example, but from here we can make a few assumptions about how we want to emulate it.

In our langauges, we will stick to a single language (unlike the example English/Japanese).
IMDB scores are between 1 and 5. We won’t be too harsh on any movies and go from 0.
Runtimes should emulate a real movie — we can set it to be between 50 and 150 minutes.
Genres may be something we need to write our own Faker provider for.
We are going to be okay with non-sense data, so we can just use a string generator for the names.

With this said, let’s look at how we can fake this.

Emulating a value for each column

We will create seven cells — one to import Faker and one for each column.

For the first cell, we will import Faker.

Secondard, we will fake a movie name with words:

Third, we will generate a date this decate and use the same format as the example:

Fourth, we will create our own fake data geneartor for the genre:

Fifth, we will do the same for a language:

Sixth we need to generate a runtime:

Lastly, we need a rating with one decimal point between 1.0 and 5.0:

Now that we have all our information together, it is time to generate a CSV with 100 entries.

Generating the CSV

We can place everything we know into a last cell to generate some data:

Running the cell will output the CSV file movie_data.csv in our root that looks like this:

Success!

Summary

Today’s post demonstrated how to use the Faker package to generate fake data and the CSV library to export that data to file.

In future, we may use this data to make our data sets to work with and some some data science around.

Kaggle and Open Data are great resources for data and data visualization for any use you may also have when not generating your own data.

This “100 Days in Python” series will move towards data science and machine learning from here on out.

Resources and further reading

Photo credit: pawel_czerwinski

Originally posted on my blog. To see new posts without delay, read the posts there and subscribe to my newsletter.

I write content for AWS, Kubernetes, Python, JavaScript and more. To view all the latest content, be sure to visit my blog and subscribe to my newsletter. Follow me on Twitter.