且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

了解巴赫达瑙的注意力线性代数

更新时间:2023-12-02 22:52:34

也许通过一个特定的示例来理解这一点可能会有所帮助:假设您有一条19字的推文,并且想要将其转换为另一种语言.您为单词创建嵌入,然后将其通过128个单元的双向LSTM层传递.现在,编码器为每个推文输出19个256个尺寸的隐藏状态. 假设解码器是单向的,具有128个单位.它开始翻译单词,同时在每个时间步并行输出隐藏状态.

Maybe understanding this with a specific example may help: Let us say you have a 19 word tweet and you want to convert it into another language. You create embeddings for the words and then pass it thru' a bi-directional LSTM layer of 128 units. The encoder now outputs 19 hidden states of 256 dimensions for every tweet. Let us say the decoder is uni-directional and has 128 units. It starts translating the words while parallely outputting a hidden state at each time step.

现在,您要引起巴赫达瑙对上述方程式的关注.您想要馈送解码器的s_tminus1和编码器(hj)的所有隐藏状态,并想要使用以下步骤获取上下文:

Now you want to bring in Bahdanau's attention to the above equation. You want to feed s_tminus1 of the decoder and all hidden states of the encoder (hj) and want to get the context using the following steps:

生成v *(w * s_tminus1 + u * hj)

generate v * (w * s_tminus1 + u*hj)

采用上面的softmax来获取每个tweet的19个关注权重,然后将这些关注权重乘以编码器隐藏状态得到加权和,即上下文.

Take a softmax of the above to get the 19 attention weights for each tweet and then multiply these attention weights by the encoder hidden states to get the weighted sum which is nothing but the context.

请注意,在Bahdanau模型中,解码器应该是单向的.然后形状如下:

Note that in Bahdanau model the decoder should be unidirectional. Then the shapes would be as follows:

假设n = 10个单位,用于取向层以确定w,u.然后:s_tminus1和hj的形状分别为(?,128)和(?,19,256).请注意,s_tminus1是t-1处的单个解码器隐藏状态,hj是双向编码器的19个隐藏状态.

Assume n=10 units for the alignment layer to determine w,u. Then: the shapes for s_tminus1 and hj would be (?,128) and (?,19,256). Note that s_tminus1 is the single decoder hidden state at t-1 and hj are the 19 hidden states of the bi-directional encoder.

我们必须将stminus1扩展为(?,1,128),以便随后沿时间轴进行加法运算. w,u,v的层权重将由框架分别自动确定为(?,128,10),(?, 256,10)和(?,10,1).注意self.w(stminus1)如何计算为(?,1,10).将其添加到self.u(hj)的每一个中,以得到(?,19,10)的形状.结果被馈送到self.v,输出为(?,19,1),它是我们想要的形状-一组19个权重. Softmaxing会赋予注意力权重.

We have to expand stminus1 to (?,1,128) for the addition that follows later along the time axis. The layer weights for w,u,v will be automatically determined by the framework as (?,128,10), (?,256,10) and (?,10,1) respectively. Notice how self.w(stminus1) works out to (?,1,10). This is added to each of the self.u(hj) to give a shape of (?,19,10). The result is fed to self.v and the output is (?,19,1) which is the shape we want - a set of 19 weights. Softmaxing this gives the attention weights.

将此注意力权重与每个编码器的隐藏状态相乘并求和会返回上下文.

Multiplying this attention weight with each encoder hidden state and summing up returns the context.

希望这可以澄清各种张量和权重形状的形状.

Hope this clarifies on the shapes of the various tensors and weight shapes.

要回答您的其他问题-ht和hs的尺寸可以不同,如上面的示例所示.至于您的其他问题,我已经看到两个向量被连接在一起,然后对其施加一个权重.至少这是我记得在原始论文中读到的内容

To answer your other questions - the dimensions of ht and hs can be different as shown in above example. As to your other question, I have seen the 2 vectors being concatenated and then a single weight applied on them..at least this is what I remember reading in the original paper