Split data – deep learning python code

By | November 14, 2018

The very first script that you will write working in deep learning, will not be to perform training but to split the dataset. Every time I start a new project, the first task is to manage the given data. This blog is about image data for computer vision based tasks, mostly classification.

If you are lucky, the data will already be arranged for you. But in most of the cases, data will be a mess. Our job is to simplify it first. In this blog, I will show you some of the possible dataset patterns, how to arrange and split them. I have made codes available on GitHub. I will share links too.

Split percentage

Before we begin, let’s decide how to divide our dataset. We will need to divide our dataset into three parts – train, validation and test. Different researchers have different choice. The one I mostly prefer is to divide the data 70% for train, 15% for validation and 15% for test. Many times you will observe it to be 80-10-10.

Forms of raw data

Raw data could be in many different types. Mostly, I have encountered two forms of raw data:

1 – Directory containing subdirectories with class names

This type of arrangement makes our task a bit easy. Most of the deep learning libraries data generators expect the dataset to be in such form. So, the task is to split the dataset.

Python script below takes two arguments – input and output directories. Next it creates directories named train, validation and test in the output directory.

It copies 0.7 of data from each class to similar class labelled directory in train, 0.15 in validation and 0.15 in test. 

An example script like above can divide data in 70-15-15. To improve the code and to check the latest version of the code please check here on GitHub.

2 – Directory containing images

Images are given in a directory. Name of the each image file is a unique image id. Labels for image ids are given separately in a CSV file.

In this case, approach will be little different to prepare the split dataset. 

To improve the code and to check the latest version of the code please check here on GitHub.