keras TensorBoard write_grads prevents training from starting - Python

I am trying to use the TensorBoard callback to visualize the training of my network. The network reuses parts of ResNet (pre-trained and untrainable for my purposes). When omitting the write_grads=True option everything works as expected and I can see the histograms for my own trainable layers in TensorBoard. However with write_grads set to true the last visible output is Train on 30 samples, validate on 10 samples and the training itself never seems to start.

I noticed that the main memory used by the python process slowly increases from a couple of hundred megabytes to several gigabytes during that time. It came to my mind that saving all gradients in ResNet might take a significant amount of space. But since most of it is untrainable in my particular case I think this is not a general problem but rather an implementation issue.

Please find below a script to reproduce the issue.

import numpy as np

from keras.callbacks import TensorBoard
from keras.models import Model
from keras.layers import Input, Dense, Conv2D, GlobalMaxPool2D
from keras.applications.resnet50 import ResNet50

x = np.random.normal(size=(30, 200, 200, 3))
xv = np.random.normal(size=(10, 200, 200, 3))
y = np.random.randint(0, 2, size=(30, 5))
yv = np.random.randint(0, 2, size=(10, 5))

tb = TensorBoard(log_dir=r'C:\logs', histogram_freq=1, batch_size=2, write_grads=True)

# Build network that reuses a pre-trained part of ResNet and is then followed by some other layers
# Note that the ResNet part is untrainable!
inputs = Input((200, 200, 3))
net = ResNet50(input_tensor=inputs, weights='imagenet', include_top=False)
for layer in net.layers:
    layer.trainable = False
net = Conv2D(10, 5, activation='relu')(net.layers[141].output)
net = GlobalMaxPool2D()(net)
net = Dense(5, activation='softmax')(net)

model = Model(inputs=inputs, outputs=net)
model.compile(optimizer='sgd', loss='binary_crossentropy'), y, batch_size=2, validation_data=(xv, yv), epochs=10, callbacks=[tb])
  • [X] Check that you are up-to-date with the master branch of Keras.
  • [X] If running on TensorFlow, check that you are up-to-date with the latest version.
Asked Oct 17 '21 12:10
avatar jprellberg

4 Answer:

I'm having this same issue as well. With writegrade = False training starts, with writegrade = True, it stalls at the start. I'm also using ResNet50.

Answered Nov 06 '17 at 05:51
avatar  of paragon00

I have this issue too with a moderately deep (27 layers, 1.1M) model, all Conv2D and Concatenate layers.

I did notice, however, that it'll run if I just wait long enough after it prints that line about train/validate sizes. Specifically, it takes an extra ~4 min per epoch (with write_grads=True, write_images=False) as compared to training without the callback (~2 min per epoch).

I also had issues with my GPU going OOM when computing gradients until I reduced the batch_size on the TensorBoard call by 4x (64 -> 16). Same problem: #6217, #6231

Answered Dec 22 '17 at 01:04
avatar  of talmo

Same problem here, every time I set write_grad=True, it took a long time to start training.

Answered Mar 28 '18 at 07:15
avatar  of JustinhoCHN

I have experienced this issue as well. Training starts after about an hour and half. It took me a whole day to figure out what was wrong with my code and why it wouldn't just normally start training. Could anyone at least write in the documentation that this will delay the start of training for a freaking long time? I guess most people would just give up and think it kind of froze and stop it.

Answered Jun 25 '18 at 13:33
avatar  of janzd