Is it normal to get ETA: 6:43:26 hours to complete the first epoch

Issue

I have crated the below vgg16 based CNN and I want to train it for 50 epochs. but it shows nearly 7 hours (ETA: 6:43:26) to complete the first epoch. could anyone please tell me is this normal with 209222 training images and 40000 validation images(DeepFashion dataset) ? or is this any issue with my steps_per_epoch? I use a HPC with 16 workers to train this model.

  train_gen = ImageDataGenerator(rescale=1./255)

  val_gen = ImageDataGenerator(rescale=1./255)

  train_batches = train_gen.flow_from_directory(train_path,
          target_size=(img_r, img_c),
          batch_size=batch_size,
          class_mode='categorical',
          shuffle=True)
          
  val_batches = val_gen.flow_from_directory(validation_path,
          target_size=(img_r, img_c),
          batch_size=batch_size_val,
          class_mode='categorical',
          shuffle=False)
  
  return train_batches, val_batches



def fit_model(model, batches, val_batches):

    print("started model training")
    history = model.fit(train_batches,
                                  steps_per_epoch = 209222/32,
                                  epochs = 50,
                                  validation_data= val_batches,
                                  validation_steps=40000/32,
                                  verbose=1,
                                  use_multiprocessing=True,
                                  workers=16
                                  )

this is the model part

def create_model(input_shape, output_classes):
    logging.debug('input_shape {}'.format(input_shape))
    logging.debug('input_shape {}'.format(type(input_shape)))
    
    #optimizer_mod = keras.optimizers.SGD(lr=0.001, momentum=momentum, decay=decay, nesterov=False)
    
    vgg16 = VGG16(weights='imagenet',include_top=False)
  
    for layer in vgg16.layers[:15]:
        layer.trainable = False
    
    x= vgg16.get_layer('block4_conv3').input
    x = vgg16.get_layer('block4_conv3')(x)
  
    if True:
        x = Reshape([28*28,512])(x)
        att = MultiHeadsAttModel(l=28*28, d=512 , dv=64, dout=512, nv = 8 )
        x = att([x,x,x])
        x = Reshape([28,28,512])(x)   
        x = BatchNormalization()(x)
        
    #x = vgg16.get_layer('block5_conv1')(x)
    #x = vgg16.get_layer('block5_conv2')(x)
    #x = vgg16.get_layer('block5_conv3')(x)
    #x = vgg16.get_layer('block5_pool')(x)
    
    x = Flatten()(x)
    x = Dense(256, activation="relu")(x)
    x = Dropout(0.5)(x)
    outputs = Dense(output_classes, activation='softmax')(x)
    
    
    model =tf.keras.Model(inputs=vgg16.input, outputs=outputs)
    
    top3_acc = functools.partial(keras.metrics.top_k_categorical_accuracy, k=3)
    top3_acc.__name__ = 'top3_acc' 
    opt = tf.keras.optimizers.Adam(learning_rate=0.01)
    
    model.compile(
                  optimizer=opt,
                  loss='categorical_crossentropy',
                  metrics=['accuracy',top3_acc]) 

    return model

Solution

if you are using VGG then you should rescale the values between -1 and +1 as

that is how it was trained so use

rescale=1/127.5=1
```
That will not solve your long epoch 1 problem however. 
For steps_per_epoch and validation steps use

steps_per_epoch= 209222//32+1
validation_steps= 40000//32 +1

That will also not solve the problem I suspect. 
Each training epoch will require 6539 steps and each validation 
will require 1251 steps. This is really rather large.
Now the processing time will be greatly dependent on the image size. 
What values did you use?
Also the VGG model has on the order of 40 million trainable parameters 
so it is computationally intensive to begin with. I would recommend 
using the Mobilenet model which has on the order of 4 million parameters
and is about as accurate. As noted by Edwin Cheong above  you need to
check if your GPU is being used. I suspect it is not.

Answered By – Gerry P

Answer Checked By – Marie Seifert (AngularFixing Admin)

Leave a Reply

Your email address will not be published.