更新时间:2023-12-02 09:42:16
假定单个GPU的batch_size
为N
,每批花费的时间为X
秒.
Assuming that your batch_size
for a single GPU is N
and the time taken per batch is X
secs.
您可以通过测量模型收敛所需的时间来衡量训练速度,但是您必须确保使用2个GPU正确地输入batch_size
,因为2个GPU将具有两倍的内存,您应该将batch_size
线性缩放至2N
.可能令人着迷的是,该模型每批次仍需要X
秒,但是您应该知道,现在您的模型正在每批次中看到2N
个样本,这将导致 Quicker收敛,因为现在,您可以更高的学习速度进行培训.
You can measure the training speed by measuring the time taken for the model to converge, but you have to make sure that you feed in the right batch_size
with 2 GPUs since 2 GPUs will have twice the memory of a single GPU you should linearly scale your batch_size
to 2N
. It might be deceiving to see that the model still takes X
secs per batch, but you should know that now your model is seeing 2N
samples per batch, and would lead to a quicker convergence because now you can train with a higher learning rate.
如果两个GPU的内存都被占用并且处于40%
利用率,则可能有多种原因
If both of your GPUs have their memory utilized and are sitting at 40%
utilization there might be multiple reasons
batch_size
很小,您的GPU可以处理更大的batch_size
batch_size
is small and your GPUs can handle a bigger batch_size