Mixing datasets using symlinks

July 01, 2023

I recently had to fine-tune around 100 Computer Vision AI models for an assignment I’m completing as part of my Master’s Degree (the same assignment as this blogpost).

The models were actually the same variants with the difference that they were being trained on different mixes of datasets (specifically, the CamVid dataset, SYNTHIA dataset and the Playing for Data dataset).

In the assignment I needed to generate mixes of the CamVid dataset and another dataset at intervals of 10% of the second dataset added, for example:

CamVid + 10% SYNTHIA
CamVid + 10% PFD
CamVid + 20% SYNTHIA
…etc
CamVid + 100% SYNTHIA

Given that the SYNTHIA and PFD datasets are quite large (in the region of 10,000s of images), creating a copy of each dataset was going to be prohibitively expensive, in terms of both time and storage.

I considered overriding the way that the computer vision models I was using (specifically the YOLO family of models) loads data, however under the time constriants of the assignment this looked like a risky approach. The default way of feeding AI models, especially ones based on images, does tend heavily towards providing a directory of files and the library figures out how to use it.

Here is where symlinks became invaluable.

Symlinks or Symbolic Links are files with the sole purpose of pointing towards another file. Deleting a symbolic link does not affect the file that it links to. Creating one on Linux systems is done using ln and on Windows using mklink.

By leveraging some very simple python, I came up with two scripts which:

Read the file structures of the datasets and created text files for each dataset mixture, such as 100_synthia_20_camvid.txt, indicating the dataset mix was all of SYNTHIA and 20% of Camvid. Each line in the text file contained a file path.
Read each text file and for each line, modify the source file path to create a target file path (so if the path was /foo/bar.png in 100_synthia_20_camvid.txt it would set the target as ./datasets/100_synthia_20_camvid/foo/bar.png) and create a symlink from the source to the target. If the directories did not exist, it would create them.

Using this method I managed to create around 30 different datasets that, despite being made up of only symlinks, occupied around 6GB of disk space. This is a massive improvement over the 82.5GB of the source datasets. Had I copied each file, it would have easily resulted in the single digit terabytes of storage required and been time prohibitive to generate.

For completeness, you can find the python scripts here: