keras TensorBoard write_grads prevents training from starting - Python
I am trying to use the TensorBoard callback to visualize the training of my network. The network reuses parts of ResNet (pre-trained and untrainable for my purposes). When omitting the write_grads=True
option everything works as expected and I can see the histograms for my own trainable layers in TensorBoard. However with write_grads
set to true the last visible output is Train on 30 samples, validate on 10 samples
and the training itself never seems to start.
I noticed that the main memory used by the python process slowly increases from a couple of hundred megabytes to several gigabytes during that time. It came to my mind that saving all gradients in ResNet might take a significant amount of space. But since most of it is untrainable in my particular case I think this is not a general problem but rather an implementation issue.
Please find below a script to reproduce the issue.
import numpy as np
from keras.callbacks import TensorBoard
from keras.models import Model
from keras.layers import Input, Dense, Conv2D, GlobalMaxPool2D
from keras.applications.resnet50 import ResNet50
x = np.random.normal(size=(30, 200, 200, 3))
xv = np.random.normal(size=(10, 200, 200, 3))
y = np.random.randint(0, 2, size=(30, 5))
yv = np.random.randint(0, 2, size=(10, 5))
tb = TensorBoard(log_dir=r'C:\logs', histogram_freq=1, batch_size=2, write_grads=True)
# Build network that reuses a pre-trained part of ResNet and is then followed by some other layers
# Note that the ResNet part is untrainable!
inputs = Input((200, 200, 3))
net = ResNet50(input_tensor=inputs, weights='imagenet', include_top=False)
for layer in net.layers:
layer.trainable = False
net = Conv2D(10, 5, activation='relu')(net.layers[141].output)
net = GlobalMaxPool2D()(net)
net = Dense(5, activation='softmax')(net)
model = Model(inputs=inputs, outputs=net)
model.compile(optimizer='sgd', loss='binary_crossentropy')
model.fit(x, y, batch_size=2, validation_data=(xv, yv), epochs=10, callbacks=[tb])
- [X] Check that you are up-to-date with the master branch of Keras.
- [X] If running on TensorFlow, check that you are up-to-date with the latest version.
4 Answer:
I'm having this same issue as well. With writegrade = False training starts, with writegrade = True, it stalls at the start. I'm also using ResNet50.
I have this issue too with a moderately deep (27 layers, 1.1M) model, all Conv2D
and Concatenate
layers.
I did notice, however, that it'll run if I just wait long enough after it prints that line about train/validate sizes. Specifically, it takes an extra ~4 min per epoch (with write_grads=True, write_images=False
) as compared to training without the callback (~2 min per epoch).
I also had issues with my GPU going OOM when computing gradients until I reduced the batch_size
on the TensorBoard call by 4x (64 -> 16). Same problem: #6217, #6231
Same problem here, every time I set write_grad=True, it took a long time to start training.
I have experienced this issue as well. Training starts after about an hour and half. It took me a whole day to figure out what was wrong with my code and why it wouldn't just normally start training. Could anyone at least write in the documentation that this will delay the start of training for a freaking long time? I guess most people would just give up and think it kind of froze and stop it.
Read next
- [Bug] Unable to use the Firebase Resize Image Extension when uploading from the Unity SDK - quickstart-unity
- firejail prevents Firefox from deactivating screen dimming during video watch C
- PowerToys Keyboard remapping works after the fact C#
- [FEATURE] Auto restart of prometheus-msteams pod when the entry from hashicorp vault is synced - prometheus-msteams
- ShareX Screen recording options (virtual-audio-capturer) - no work C#
- swoole-src swSocket_set_buffer_size#329: setsockopt(4, SOL_SOCKET, SO_SNDBUF, 8388608) failed. Error: No buffer space available[55]. - Cplusplus
- tensorflow The repository 'file:/var/nccl-repo-2.2.13-ga-cuda9.2 Release' no longer has a Release file. Ubuntu-18.04 - Cplusplus
- electron Print issue on Electron from v7 - Print only works once (window.print()) - Cplusplus