更新时间:2022-06-25 00:36:34
不幸的是,即使在AVX中也没有指令可以做到这一点(我没有注意到)。因此,您必须像现在一样手动完成。
Unfortunately, there's no instruction to do that even in AVX (none that I'm aware of). So you will have to do it manually like are right now.
但是,您当前的方法非常不理想,而且您依赖 .m128i_u8
这是一个MSVC扩展。根据我对MSVC的经验,它将使用对齐的缓冲区来访问各个元素。由于部分字访问,这会受到非常严重的惩罚。
However, your current method is very sub-optimal and you're relying on .m128i_u8
which is an MSVC extension. Based on my experience with MSVC, it will use an aligned buffer to access the individual elements. This has a very heavy penalty because of partial-word access.
而不是 .m128i_u8
,请使用 _mm_extract_epi32()
。这是在SSE4.1中。但是你已经依赖SSE4.1与 _mm_cvtepu8_epi32()
。
这种情况特别糟糕,因为你'使用1字节粒度。如果您使用的是2字节(16位整数)粒度,那么使用 shuffle intrinsics 。
This situation is particularly bad since you're working with 1-byte granularity. If you were working with 2-byte (16-bit integer) granularity instead, there is an efficient solution using shuffle intrinsics.