How does this split of train and evaluation data ensure there is no overlap?


I am reading this sentiment classification tutorial from Tensorflow:

The way it splits data into train and evaluate is the following code:

batch_size = 32
seed = 42

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(

raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(

Shouldn’t a single call of the function text_dataset_from_directory generate the two sets? If it is called twice, does it ensure there will be no overlap between the two split sets?


You need to either set a seed or set shuffle = False in order to make sure that you have no overlap in two sets. Here’s what happens under the hood:

When subset (train-val) is provided, seed or shuffle args are checked (Source)

if validation_split and shuffle and seed is None:
        raise ValueError(
            'If using `validation_split` and shuffling the data, you must provide '
            'a `seed` argument, to make sure that there is no overlap between the '
            'training and validation subset.')

Then, the data is reserved. (Source)

num_val_samples = int(validation_split * len(samples))
if subset == 'training':
 print('Using %d files for training.' % (len(samples) - num_val_samples,))
 samples = samples[:-num_val_samples]
 labels = labels[:-num_val_samples]
elif subset == 'validation':
 print('Using %d files for validation.' % (num_val_samples,))
    samples = samples[-num_val_samples:]
    labels = labels[-num_val_samples:]

With the last code samples & labels restricted to the training or validation set. And since you specified seed, datasets is randomized in the same order.

Answered By – Frightera

Answer Checked By – Robin (AngularFixing Admin)

Leave a Reply

Your email address will not be published.