How to read pictures from a big folder and split it into train, validation and test sets?


I am working on sign language gesture classifier using pytorch, I have pictures resembling each letter residing in a folder titled with that specific letter. E.g. folder “A” has “1_A_1.jpg”, “1_A_2.jpg”, “21_A_3.jpg”.. etc.

I am trying to build a function that:

  1. Iterates through the different folders
  2. Splits the data into training, validation, and test sets
  3. Labels those pictures with their respective folder name (i.e. letter label)
  4. Returns 3 created folders that are train, test and validation

All the online code shows examples of splitting data coming from torchvision data sets (built in data sets), nothing from scratch.

I found the following on stackoverflow:

import os
import numpy as np
import argparse

def get_files_from_folder(path):

    files = os.listdir(path)
    return np.asarray(files)

def main(path_to_data, path_to_test_data, train_ratio):
    # get dirs
    _, dirs, _ = next(os.walk(path_to_data))

    # calculates how many train data per class
    data_counter_per_class = np.zeros((len(dirs)))

    for i in range(len(dirs)):
        path = os.path.join(path_to_data, dirs[i])
        files = get_files_from_folder(path)
        data_counter_per_class[i] = len(files)
    test_counter = np.round(data_counter_per_class * (1 - train_ratio))

    # transfers files
    for i in range(len(dirs)):
        path_to_original = os.path.join(path_to_data, dirs[i])
        path_to_save = os.path.join(path_to_test_data, dirs[i])

        #creates dir
        if not os.path.exists(path_to_save):
        files = get_files_from_folder(path_to_original)
        # moves data
        for j in range(int(test_counter[i])):
            dst = os.path.join(path_to_save, files[j])
            src = os.path.join(path_to_original, files[j])
            shutil.move(src, dst)

and when I tried doing the following:

path_to_data= r'path\A'


Nothing really happened..

If I can get this working for train and test, I can easily extend it for validation.


Give this a go:

from pathlib import Path

def main(data_path, out_path, train_ratio):
    dir_paths = [child for child in Path(data_path).iterdir() if child.is_dir()]

    for i, dir_path in enumerate(dir_paths):
        files = list(dir_path.iterdir())
        train_len = int(len(files) * (1 - train_ratio)) 

        out_dir = Path(out_path).joinpath(
        if not out_dir.exists():

        for file_ in files[:train_len]:

if __name__ == '__main__':
    main('data', 'test', 0.8)

Answered By – adrianp

Answer Checked By – Mildred Charles (AngularFixing Admin)

Leave a Reply

Your email address will not be published.