更新时间:2023-12-02 15:39:52
尝试使用CUDA 10.1. https://www.tensorflow.org/install/gpu 说"TensorFlow支持CUDA® 10.1"
ModelCheckpoint
回调有问题.检查checkpoint_path位置是否可写?另外,参考文献还说:"如果save_best_only = True,则根据监视数量的最新***模型将不会被覆盖.".因此,您可能希望每次运行模型时都删除最后一个保护程序模型或在checkpoint_path中提供新的唯一名称.它很可能会防止覆盖以前的模型并引发错误.
I'm unable to save a Keras model as I get the error mentioned in the title. I have been using tensorflow-gpu. My model consists of 4 inputs each is a ResNet50. When I use only a single input the call back below worked perfectly, but with the multi inputs I'm getting the following error:
RuntimeError: Unable to create link (name already exists)
callbacks = [EarlyStopping(monitor='val_loss', patience=30,mode='min', min_delta=0.0001, verbose=1),
ModelCheckpoint(checkpoint_path, monitor='val_loss',save_best_only=True, mode='min', verbose=1)
]
Now without the callback I couldn't save the model at the end of training as I got the same error, but I was able to fix that using this code found here:
from tensorflow.python.keras import backend as K
with K.name_scope(model.optimizer.__class__.__name__):
for i, var in enumerate(model.optimizer.weights):
name = 'variable{}'.format(i)
model.optimizer.weights[i] = tf.Variable(var, name=name)
This code only works with single input model and put after the training function model.fit
.
With the callbacks even the above code is not working. This post is somehow related to my previous one.
I have read that this issue can be fixed with tf-nightly
so I tried that, but didn't work.
I have tested with a standalone code and generated data in a Google colab and it worked. So I checked the tf version, it's the same as mine 2.3.0
. As for cuda, both colab and my machine is running with :
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
Could this be the issue?
Update:
Here the output error :
113/113 [==============================] - ETA: 0s - loss: 30.0107 - mae: 1.3525
Epoch 00001: val_loss improved from inf to 0.18677, saving model to saved_models/multi_channel_model.h5
Traceback (most recent call last):
File "fine_tuning.py", line 111, in <module>
run()
File "fine_tuning.py", line 104, in run
model.fit(x=train_x_list, y=train_y, validation_split=0.2,
File "/home/abderrezzaq/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/abderrezzaq/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1137, in fit
callbacks.on_epoch_end(epoch, epoch_logs)
File "/home/abderrezzaq/.local/lib/python3.8/site-packages/tensorflow/python/keras/callbacks.py", line 412, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/home/abderrezzaq/.local/lib/python3.8/site-packages/tensorflow/python/keras/callbacks.py", line 1249, in on_epoch_end
self._save_model(epoch=epoch, logs=logs)
File "/home/abderrezzaq/.local/lib/python3.8/site-packages/tensorflow/python/keras/callbacks.py", line 1301, in _save_model
self.model.save(filepath, overwrite=True, options=self._options)
File "/home/abderrezzaq/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1978, in save
save.save_model(self, filepath, overwrite, include_optimizer, save_format,
File "/home/abderrezzaq/.local/lib/python3.8/site-packages/tensorflow/python/keras/saving/save.py", line 130, in save_model
hdf5_format.save_model_to_hdf5(
File "/home/abderrezzaq/.local/lib/python3.8/site-packages/tensorflow/python/keras/saving/hdf5_format.py", line 125, in save_model_to_hdf5
save_optimizer_weights_to_hdf5_group(f, model.optimizer)
File "/home/abderrezzaq/.local/lib/python3.8/site-packages/tensorflow/python/keras/saving/hdf5_format.py", line 593, in save_optimizer_weights_to_hdf5_group
param_dset = weights_group.create_dataset(
File "/home/abderrezzaq/.local/lib/python3.8/site-packages/h5py/_hl/group.py", line 139, in create_dataset
self[name] = dset
File "/home/abderrezzaq/.local/lib/python3.8/site-packages/h5py/_hl/group.py", line 373, in __setitem__
h5o.link(obj.id, self.id, name, lcpl=lcpl, lapl=self._lapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 202, in h5py.h5o.link
RuntimeError: Unable to create link (name already exists)
Try with CUDA 10.1. https://www.tensorflow.org/install/gpu says "TensorFlow supports CUDA® 10.1"
Something is wrong with ModelCheckpoint
callback. Check checkpoint_path location Is it writeable? Also the reference says "if save_best_only=True, the latest best model according to the quantity monitored will not be overwritten." So you may want to delete the last saver model or provide new unique name in checkpoint_path every time you run model. Most likely it prevents overwriting the previous model and throws error.