keras TensorBoard write_grads prevents training from starting - Python
I am trying to use the TensorBoard callback to visualize the training of my network. The network reuses parts of ResNet (pre-trained and untrainable for my purposes). When omitting the
write_grads=True option everything works as expected and I can see the histograms for my own trainable layers in TensorBoard. However with
write_grads set to true the last visible output is
Train on 30 samples, validate on 10 samples and the training itself never seems to start.
I noticed that the main memory used by the python process slowly increases from a couple of hundred megabytes to several gigabytes during that time. It came to my mind that saving all gradients in ResNet might take a significant amount of space. But since most of it is untrainable in my particular case I think this is not a general problem but rather an implementation issue.
Please find below a script to reproduce the issue.
import numpy as np from keras.callbacks import TensorBoard from keras.models import Model from keras.layers import Input, Dense, Conv2D, GlobalMaxPool2D from keras.applications.resnet50 import ResNet50 x = np.random.normal(size=(30, 200, 200, 3)) xv = np.random.normal(size=(10, 200, 200, 3)) y = np.random.randint(0, 2, size=(30, 5)) yv = np.random.randint(0, 2, size=(10, 5)) tb = TensorBoard(log_dir=r'C:\logs', histogram_freq=1, batch_size=2, write_grads=True) # Build network that reuses a pre-trained part of ResNet and is then followed by some other layers # Note that the ResNet part is untrainable! inputs = Input((200, 200, 3)) net = ResNet50(input_tensor=inputs, weights='imagenet', include_top=False) for layer in net.layers: layer.trainable = False net = Conv2D(10, 5, activation='relu')(net.layers.output) net = GlobalMaxPool2D()(net) net = Dense(5, activation='softmax')(net) model = Model(inputs=inputs, outputs=net) model.compile(optimizer='sgd', loss='binary_crossentropy') model.fit(x, y, batch_size=2, validation_data=(xv, yv), epochs=10, callbacks=[tb])
- [X] Check that you are up-to-date with the master branch of Keras.
- [X] If running on TensorFlow, check that you are up-to-date with the latest version.
I'm having this same issue as well. With writegrade = False training starts, with writegrade = True, it stalls at the start. I'm also using ResNet50.
I have this issue too with a moderately deep (27 layers, 1.1M) model, all
I did notice, however, that it'll run if I just wait long enough after it prints that line about train/validate sizes. Specifically, it takes an extra ~4 min per epoch (with
write_grads=True, write_images=False) as compared to training without the callback (~2 min per epoch).
I also had issues with my GPU going OOM when computing gradients until I reduced the
batch_size on the TensorBoard call by 4x (64 -> 16). Same problem: #6217, #6231
Same problem here, every time I set write_grad=True, it took a long time to start training.
I have experienced this issue as well. Training starts after about an hour and half. It took me a whole day to figure out what was wrong with my code and why it wouldn't just normally start training. Could anyone at least write in the documentation that this will delay the start of training for a freaking long time? I guess most people would just give up and think it kind of froze and stop it.
- [Bug] Unable to use the Firebase Resize Image Extension when uploading from the Unity SDK - quickstart-unity
- firejail prevents Firefox from deactivating screen dimming during video watch C
- PowerToys Keyboard remapping works after the fact C#
- [FEATURE] Auto restart of prometheus-msteams pod when the entry from hashicorp vault is synced - prometheus-msteams
- ShareX Screen recording options (virtual-audio-capturer) - no work C#
- swoole-src swSocket_set_buffer_size#329: setsockopt(4, SOL_SOCKET, SO_SNDBUF, 8388608) failed. Error: No buffer space available. - Cplusplus
- tensorflow The repository 'file:/var/nccl-repo-2.2.13-ga-cuda9.2 Release' no longer has a Release file. Ubuntu-18.04 - Cplusplus
- electron Print issue on Electron from v7 - Print only works once (window.print()) - Cplusplus