Is it possible to resume training from a checkpoint model in Tensorflow?

Issue

I am doing auto segmentation and I was training a model over the weekend and the power went out. I had trained my model for 50+ hours and saved my model every 5 epochs using the line:

model_checkpoint = ModelCheckpoint('test_{epoch:04}.h5', monitor=observe_var, mode='auto', save_weights_only=False, save_best_only=False, period = 5)

I’m loading the saved model using the line:

model = load_model('test_{epoch:04}.h5', custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef})

I have included all of my data that splits my training data into train_x for the scan and train_y for the label. When I run the line:

loss, dice_coef = model.evaluate(train_x,  train_y, verbose=1)

I get the error:

ResourceExhaustedError:  OOM when allocating tensor with shape[32,8,128,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
 [[node model/conv3d_1/Conv3D (defined at <ipython-input-1-4a66b6c9f26b>:275) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_distributed_function_3673]

Function call stack:
distributed_function

Solution

This is basically you are running out of memory.So you need to do evaluate in small batch wise.Default batch size is 32 and try allocating small batch size.

evaluate(train_x,  train_y, batch_size=<batch size>)

from keras documentation

batch_size: Integer or None. Number of samples per gradient update. If
unspecified, batch_size will default to 32.

Answered By – Rajith Thennakoon

Answer Checked By – Mary Flores (AngularFixing Volunteer)

Leave a Reply

Your email address will not be published.