且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从神经网络的不同成本函数和激活函数中进行选择

更新时间:2023-01-06 12:13:54

我将以一些较混乱的答案回答您的问题,首先是更笼统的答案,最后是针对您特定实验的问题.

I will answer your questions a little bit out of order, starting with more general answers, and finishing with those specific to your particular experiment.

激活功能实际上,不同的激活功能确实具有不同的属性.首先让我们考虑神经网络两层之间的激活函数.激活函数的唯一目的是充当非线性.如果未在两层之间放置激活函数,则两层合起来不会比一层更好,因为它们的效果仍然只是线性变换.长期以来,人们一直在使用S形函数和tanh,随意选择,S形函数更受欢迎,直到最近,ReLU成为主要的非宽容性.人们之所以在各层之间使用ReLU是因为它不饱和(并且计算速度也更快).考虑一下S形函数的图.如果x的绝对值较大,则S型函数的导数也较小,这意味着当我们向后传播误差时,误差的梯度将随着层的返回而迅速消失.使用ReLU时,所有正输入的导数均为1,因此,被激发的神经元的梯度完全不会被激活单元改变,也不会减慢梯度下降的速度.

Activation functions Different activation functions, in fact, do have different properties. Let's first consider an activation function between two layers of a neural network. The only purpose of an activation function there is to serve as an nonlinearity. If you do not put an activation function between two layers, then two layers together will serve no better than one, because their effect will still be just a linear transformation. For a long while people were using sigmoid function and tanh, choosing pretty much arbitrarily, with sigmoid being more popular, until recently, when ReLU became the dominant nonleniarity. The reason why people use ReLU between layers is because it is non-saturating (and is also faster to compute). Think about the graph of a sigmoid function. If the absolute value of x is large, then the derivative of the sigmoid function is small, which means that as we propagate the error backwards, the gradient of the error will vanish very quickly as we go back through the layers. With ReLU the derivative is 1 for all positive inputs, so the gradient for those neurons that fired will not be changed by the activation unit at all and will not slow down the gradient descent.

对于网络的最后一层,激活单元还取决于任务.对于回归,您将需要使用S形或tanh激活,因为您希望结果在0到1之间.对于分类,您将只希望输出之一为1,其他所有零,但是没有可区别的方法来实现.正是这样,所以您将要使用softmax对其进行近似.

For the last layer of the network the activation unit also depends on the task. For regression you will want to use the sigmoid or tanh activation, because you want the result to be between 0 and 1. For classification you will want only one of your outputs to be one and all others zeros, but there's no differentiable way to achieve precisely that, so you will want to use a softmax to approximate it.

您的示例.现在,让我们看一下您的示例.您的第一个示例尝试以以下形式计算AND的输出:

Your example. Now let's look at your example. Your first example tries to compute the output of AND in a following form:

sigmoid(W1 * x1 + W2 * x2 + B)

请注意,W1W2将始终收敛到相同的值,因为(x1x2)的输出应等于(x2x1)的输出.因此,您适合的模型是:

Note that W1 and W2 will always converge to the same value, because the output for (x1, x2) should be equal to the output of (x2, x1). Therefore, the model that you are fitting is:

sigmoid(W * (x1 + x2) + B)

x1 + x2只能采用三个值(0、1或2)之一,对于x1 + x2 < 2的情况,要返回0,对于x1 + x2 = 2的情况,要返回1.由于S形函数相当平滑,因此WB的值将非常大,以使输出接近所需的值,但是由于学习率较低,它们无法快速达到这些大值.在第一个示例中,提高学习率将提高收敛速度.

x1 + x2 can only take one of three values (0, 1 or 2) and you want to return 0 for the case when x1 + x2 < 2 and 1 for the case when x1 + x2 = 2. Since the sigmoid function is rather smooth, it will take very large values of W and B to make the output close to the desired, but because of a small learning rate they can't get to those large values fast. Increasing the learning rate in your first example will increase the speed of convergence.

您的第二个示例收敛得更好,因为softmax函数擅长使一个输出等于1,而所有其他输出等于0.由于这正是您的情况,因此确实可以迅速收敛.请注意,sigmoid最终也将收敛为良好的值,但是它将花费更多的迭代次数(或更高的学习率).

Your second example converges better because the softmax function is good at making precisely one output be equal to 1 and all others to 0. Since this is precisely your case, it does converge quickly. Note that sigmoid would also eventually converge to good values, but it will take significantly more iterations (or higher learning rate).

使用方法.现在到最后一个问题,如何选择要使用的激活和成本函数.这些建议适用于大多数情况:

What to use. Now to the last question, how does one choose which activation and cost functions to use. These advices will work for majority of cases:

  1. 如果进行分类,请使用softmax表示最后一层的非线性,并使用cross entropy作为成本函数.

  1. If you do classification, use softmax for the last layer's nonlinearity and cross entropy as a cost function.

如果进行回归,则将sigmoidtanh用于最后一层的非线性,并将squared error作为成本函数.

If you do regression, use sigmoid or tanh for the last layer's nonlinearity and squared error as a cost function.

将ReLU用作图层之间的非线性.

Use ReLU as a nonlienearity between layers.

使用更好的优化器(AdamOptimizerAdagradOptimizer)代替GradientDescentOptimizer,或使用动量来加快收敛速度​​,

Use better optimizers (AdamOptimizer, AdagradOptimizer) instead of GradientDescentOptimizer, or use momentum for faster convergence,