且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

软注意力与硬注意力

更新时间:2023-12-02 19:46:16

确切注意的是什么?

要理解这个问题,我们需要深入研究某些注意力需要解决的问题.我认为关于高度关注的开创性论文之一是视觉注意力的循环模型和我会鼓励读者阅读该论文,即使起初似乎并不能完全理解该论文.

To be able to understand this question, we need to dive a little into certain problems which attention seeks to solve. I think one of the seminal papers on hard attention is Recurrent Models of Visual Attention and I would encourage the reader to go through that paper, even if it doesn't seem fully comprehensible at first.

要回答什么是注意力的问题,我将尝试提出另一个我认为更容易回答的问题. 为什么要注意?.我链接的论文试图简洁地回答该问题,在此我将重述部分推理.

To answer the question of what exactly is attention, I'll try and pose a different question which I believe is easier to answer. Which is, Why attention?. The paper I have linked seeks to answer that question succinctly and I'll reproduce a part of the reasoning here.

想象一下,你被蒙住双眼,被带到一个惊喜的生日聚会,而你睁开了眼睛.你会看到什么?

Imagine you were blindfolded and taken to a surprise birthday party and you just opened your eyes. What would you see?

现在,当我们说您看到图片时,它是以下在技术上更正确的动作序列的简短版本,即随着时间的推移移动您的视线并收集有关场景的信息.您不会一次看到图像的每个像素.您一次一次地参加图片的某些方面,并汇总信息.例如,即使在这样混乱的情况下,您也会认出您的叔叔比尔和堂兄山姆:).这是为什么?因为您参加了当前图像的某些显着方面.

Now, when we say you see the picture, that's a shorter version of the following more technically correct sequence of actions, which is, to move your eyes around over time and gather information about the scene. You don't see every pixel of the image at once. You attend to certain aspects of the picture one time-step at a time and aggregate the information. Even in such a cluttered picture for example, you would recognize your uncle Bill and cousin Sam :). Why is that? Because you attend to certain salient aspects of the current image.

这正是我们想要赋予神经网络模型的力量.为什么?将其视为某种形式的正则化. (答案的这一部分引用了本文)您通常的卷积网络模型确实能够识别混乱的图像,但是我们如何找到好"的权重的确切集合呢?这是一项艰巨的任务.通过为网络提供新的体系结构级别的功能,使它可以依次参加到图像的不同部分并随时间聚集信息,我们使这项工作变得更加容易,因为现在网络可以简单地学会忽略混乱(希望也是如此).

That is exactly the kind of power we want to give to our neural network models. Why? Think of this as some sort of regularization. (This portion of the answer references the paper) Your usual convolutional network model does have the ability to be able to recognize cluttered images but how do we find the exact set of weights which are "good"? That is a difficult task. By providing the network with a new architecture-level feature which allows it to attend to different parts of image sequentially and aggregate information over time, we make that job easier, because now the network can simply learn to ignore the clutter (or so is the hope).

我希望这能回答问题什么是关注的重点?.现在介绍其差异性的性质.好吧,还记得我们在看生日照片时如何方便地选择要看的正确点吗?我们是如何做到的?此过程涉及做出选择,这些选择很难根据输入(图像)的可区分函数来表示.例如,根据您已经看过的内容和图像,确定下一个要看的地方.您可能有一个在此处输出答案的神经网络,但我们不知道正确的答案!实际上,没有正确的答案.那么我们该如何训练网络参数?神经网络训练关键取决于输入的可微分损失函数.此类损失函数的示例包括对数似然损失函数,平方损失函数等.但是在这种情况下,我们没有关于下一步的正确答案.那我们如何定义损失呢?这是称为强化学习(RL)的机器学习领域.RL允许您通过使用诸如强化方法和参与者评论器算法之类的方法在策略空间中进行梯度处理.

I hope this answers the question What is hard attention?. Now onto the nature of its differentiability. Well, remember how we conveniently picked the correct spots to look at, while looking at the birthday picture? How did we do that? This process involves making choices which are difficult to represent in terms of a differentiable function of the input(image). For example, Based on what you've looked at already and the image, decide where to look next. You could have a neural network which outputs the answer here, but we do not know the correct answer! There is no correct answer in fact. How then are we to train the network parameters? Neural network training depends critically on a differentiable loss function of the inputs. Examples of such loss functions include the log-likelihood loss function, squared loss function etc. But in this case, we do not have a correct answer of where to look next. How then can we define a loss? This is where a field of machine learning called reinforcement learning(RL) comes in. RL allows you to do a gradient in the space of policies by using methods such as the reinforce method and the actor critic algorithms.

什么是柔和的注意力?

答案的这一部分取材于一篇名为用于阅读和理解的教学机器的论文. RL方法(例如增强方法)的主要问题是它们具有较高的方差(就计算的期望报酬的梯度而言),该方差与网络中隐藏单元的数量成线性比例.这不是一件好事,特别是如果您要建立一个大型网络.因此,人们试图寻找注意力的可区分模型.所有这些意味着注意项和损失函数是输入的可微函数,因此存在所有梯度.因此,我们可以使用我们的标准反向传播算法,以及通常的损失函数之一来训练我们的网络.那么什么是软注意力呢?

This part of the answer borrows from a paper which goes by the name teaching machines to read and comprehend. A major problem with RL methods such as the reinforce method is they have a high variance (in terms of the gradient of the expected reward computed) which scales linearly with the number of hidden units in your network. That's not a good thing, especially if you're going to build a large network. Hence, people try to look for differentiable models of attention. All this means is that the attention term and as a result the loss function are a differentiable function of the inputs and hence all gradients exist. Hence we can use our standard backprop algorithm along-with one of the usual loss functions for training our network. So what is soft attention?

在文本上下文中,它是指模型相对于其他标记,选择将更重要与文档中某些单词相关联的能力.如果您正在阅读文档并且必须根据文档回答问题,那么集中精力阅读文档中的某些标记可​​能会比仅阅读每个标记具有同等重要的意义更好地回答该问题.这是文本软关注背后的基本思想.之所以成为可区分模型的原因是,您可以纯粹基于特定的令牌和手头的查询来决定对每个令牌应付出多少关注.例如,您可以在相同的向量空间中表示文档和查询的标记,并查看点乘积/余弦相似度,以衡量在给定查询的情况下您应对该特定标记给予多大的关注.注意,余弦距离运算相对于其输入是完全可区分的,因此最终整个模型是可区分的. 请注意,本文使用的确切模型有所不同,尽管其他模型的确使用了基于点积的注意力得分,但该论点只是出于演示的目的.

In the context of text, it refers to the ability of the model to choose to associate more importance with certain words in the document vis-a-vis other tokens. If you're reading a document and have to answer a question based on it, concentrating on certain tokens in the document might help you answer the question better, than to just read each token as if it were equally important. That is the basic idea behind soft attention in text. The reason why it is a differentiable model is because you decide how much attention to pay to each token based purely on the particular token and the query in hand. You could for example represent the tokens of the document and the query in the same vector space and look at dot product/cosine similarity as a measure of how much attention should you pay to that particular token, given that query. Note that the cosine distance operation is completely differentiable with respect to its inputs and hence the overall model ends up being differentiable. Note that the exact model used by the paper differs and this argument is just for demonstration's sake, although other models do use a dot product based attention-score.