如何在具有tensorflow2和keras的多GPU上训练模型?

更新时间：2023-12-02 09:42:16

假定单个GPU的batch_size为N，每批花费的时间为X秒.

Assuming that your batch_size for a single GPU is N and the time taken per batch is X secs.

您可以通过测量模型收敛所需的时间来衡量训练速度，但是您必须确保使用2个GPU正确地输入batch_size，因为2个GPU将具有两倍的内存，您应该将batch_size线性缩放至2N.可能令人着迷的是，该模型每批次仍需要X秒，但是您应该知道，现在您的模型正在每批次中看到2N个样本，这将导致 Quicker收敛，因为现在，您可以更高的学习速度进行培训.

You can measure the training speed by measuring the time taken for the model to converge, but you have to make sure that you feed in the right batch_size with 2 GPUs since 2 GPUs will have twice the memory of a single GPU you should linearly scale your batch_size to 2N. It might be deceiving to see that the model still takes X secs per batch, but you should know that now your model is seeing 2N samples per batch, and would lead to a quicker convergence because now you can train with a higher learning rate.

如果两个GPU的内存都被占用并且处于40%利用率，则可能有多种原因

If both of your GPUs have their memory utilized and are sitting at 40% utilization there might be multiple reasons

模型太简单了，您不需要所有的计算.
您的batch_size很小，您的GPU可以处理更大的batch_size
您的CPU是瓶颈，因此使GPU等待数据准备就绪，当您看到GPU利用率达到峰值时，情况可能就是这种情况
您需要编写更好的性能数据管道.您可以在此处找到有关有效数据输入管道的更多信息- https://www.tensorflow.org/guide/data_performance

The model is too simple and you don't need all that compute.
Your batch_size is small and your GPUs can handle a bigger batch_size
Your CPU is the bottleneck and thus making the GPUs wait for the data to be ready, this can be the case when you see spikes in GPU utilization
You need to write a better and performant data pipeline. You can find more about efficient data input pipelines here - https://www.tensorflow.org/guide/data_performance

上一篇 : ：javascript - 判断用户是否为第一次访问页面下一篇 : 如何防止Laravel中的SQL注入?

如何在具有tensorflow2和keras的多GPU上训练模型?

相关阅读

推荐文章