且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从神经网络的不同成本函数和激活函数中选择

更新时间:2023-01-06 12:27:17

我会有点乱回答你的问题,从更一般的答案开始,最后是那些特定于你的特定实验的答案.

I will answer your questions a little bit out of order, starting with more general answers, and finishing with those specific to your particular experiment.

激活函数 不同的激活函数,其实有不同的属性.让我们首先考虑神经网络两层之间的激活函数.激活函数的唯一目的是作为非线性.如果你不在两层之间放置一个激活函数,那么两层放在一起不会比一层更好,因为它们的效果仍然只是一个线性变换.很长一段时间里,人们都在使用 sigmoid 函数和 tanh,选择几乎是任意的,sigmoid 更受欢迎,直到最近,当 ReLU 成为主要的非宽容性时.人们在层之间使用 ReLU 的原因是因为它是非饱和的(并且计算速度也更快).考虑 sigmoid 函数的图形.如果 x 的绝对值很大,那么 sigmoid 函数的导数就很小,这意味着当我们向后传播误差时,误差的梯度会随着我们返回而很快消失层.使用 ReLU,所有正输入的导数都是 1,因此激活单元根本不会改变那些被激发的神经元的梯度,也不会减慢梯度下降的速度.

Activation functions Different activation functions, in fact, do have different properties. Let's first consider an activation function between two layers of a neural network. The only purpose of an activation function there is to serve as an nonlinearity. If you do not put an activation function between two layers, then two layers together will serve no better than one, because their effect will still be just a linear transformation. For a long while people were using sigmoid function and tanh, choosing pretty much arbitrarily, with sigmoid being more popular, until recently, when ReLU became the dominant nonleniarity. The reason why people use ReLU between layers is because it is non-saturating (and is also faster to compute). Think about the graph of a sigmoid function. If the absolute value of x is large, then the derivative of the sigmoid function is small, which means that as we propagate the error backwards, the gradient of the error will vanish very quickly as we go back through the layers. With ReLU the derivative is 1 for all positive inputs, so the gradient for those neurons that fired will not be changed by the activation unit at all and will not slow down the gradient descent.

对于网络的最后一层,激活单元也取决于任务.对于回归,您需要使用 sigmoid 或 tanh 激活,因为您希望结果介于 0 和 1 之间.对于分类,您只希望输出之一为 1,其他所有输出为零,但没有可微分的方法来实现正是如此,因此您将需要使用 softmax 来近似它.

For the last layer of the network the activation unit also depends on the task. For regression you will want to use the sigmoid or tanh activation, because you want the result to be between 0 and 1. For classification you will want only one of your outputs to be one and all others zeros, but there's no differentiable way to achieve precisely that, so you will want to use a softmax to approximate it.

你的例子.现在让我们看看你的例子.您的第一个示例尝试以以下形式计算 AND 的输出:

Your example. Now let's look at your example. Your first example tries to compute the output of AND in a following form:

sigmoid(W1 * x1 + W2 * x2 + B)

请注意,W1W2 将始终收敛到相同的值,因为 (x1, x2) 应该等于 (x2, x1) 的输出.因此,您拟合的模型是:

Note that W1 and W2 will always converge to the same value, because the output for (x1, x2) should be equal to the output of (x2, x1). Therefore, the model that you are fitting is:

sigmoid(W * (x1 + x2) + B)

x1 + x2 只能取三个值(0、1 或 2)中的一个,并且您想在 x1 + x2 的情况下返回 0<2 和 1 对于 x1 + x2 = 2 的情况.由于 sigmoid 函数相当平滑,需要非常大的 WB 值才能使输出接近期望值,但由于学习率很小,它们可以不要快速达到那些大值.在第一个示例中提高学习率将提高收敛速度.

x1 + x2 can only take one of three values (0, 1 or 2) and you want to return 0 for the case when x1 + x2 < 2 and 1 for the case when x1 + x2 = 2. Since the sigmoid function is rather smooth, it will take very large values of W and B to make the output close to the desired, but because of a small learning rate they can't get to those large values fast. Increasing the learning rate in your first example will increase the speed of convergence.

您的第二个示例收敛得更好,因为 softmax 函数擅长使一个输出恰好等于 1,而所有其他输出都等于 0.由于这正是您的情况,因此它确实会很快收敛.请注意,sigmoid 最终也会收敛到良好的值,但需要更多的迭代(或更高的学习率).

Your second example converges better because the softmax function is good at making precisely one output be equal to 1 and all others to 0. Since this is precisely your case, it does converge quickly. Note that sigmoid would also eventually converge to good values, but it will take significantly more iterations (or higher learning rate).

使用什么.现在到最后一个问题,如何选择要使用的激活函数和成本函数.这些建议适用于大多数情况:

What to use. Now to the last question, how does one choose which activation and cost functions to use. These advices will work for majority of cases:

  1. 如果你做分类,使用softmax作为最后一层的非线性和cross entropy作为成本函数.

  1. If you do classification, use softmax for the last layer's nonlinearity and cross entropy as a cost function.

如果您进行回归,请使用 sigmoidtanh 作为最后一层的非线性和 平方误差 作为成本函数.

If you do regression, use sigmoid or tanh for the last layer's nonlinearity and squared error as a cost function.

使用 ReLU 作为层之间的非线性.

Use ReLU as a nonlienearity between layers.

使用更好的优化器(AdamOptimizerAdagradOptimizer)代替 GradientDescentOptimizer,或者使用动量来加快收敛速度​​,

Use better optimizers (AdamOptimizer, AdagradOptimizer) instead of GradientDescentOptimizer, or use momentum for faster convergence,