且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

ARM Neon:将非零字节的第n个位置存储在8字节向量通道中

更新时间:2023-11-14 09:47:22

事实证明这根本不简单.

This turns out to be not at all simple.

天真高效的方法始于简单地获取索引(只需使用位掩码加载0 1 2 3 4 5 6 7vand的静态向量).但是,为了随后在输出矢量的一端(在它们代表的输入通道的不同通道中)收集它们,则需要进行一些任意置换操作.只有一个指令能够任意排列向量vtbl(或vtbx,这基本上是同一件事).但是,vtbl按目的地顺序获取源索引的向量,这实际上与您要生成的内容完全相同.因此,为了产生最终结果,您需要使用最终结果,因此,天真的高效解决方案是不可能的. QED.

The naïve efficient approach starts with trivially getting the indices (just load a static vector of 0 1 2 3 4 5 6 7 and vand it with the bitmask). However, in order to then collect them at one end of the output vector - in different lanes to the input lanes they represent - you then need some arbitrary permutation operation. There's only one instruction capable of arbitrarily permuting vectors, vtbl (or vtbx, which is essentially the same thing). However, vtbl takes a vector of source indices in destination order, which turns out to be the exact same thing you're trying to produce. Thus in order to produce your final result, you need to use your final result, therefore the naïve efficient solution is impossible; QED.

基本问题是您实际上在做的是对向量进行排序,这本质上不是并行的SIMD操作. NEON是专为媒体处理而设计的并行SIMD指令集,对于更通用的矢量处理中任何与数据相关的/水平/分散/聚集操作,实际上并没有切入.

The fundamental problem is that what you're effectively doing is sorting a vector, which is inherently not a parallel SIMD operation. NEON is a parallel SIMD instruction set designed for media processing, and is really not cut out for any of the data-dependent/horizontal/scatter-gather operations of more general vector processing.

为了证明这一点,我确实在纯NEON中做到了这一点,根本没有任何标量代码,这是可怕的;我能想到的***的一个或两个移位NEON指令"是一些基于条件选择的旋转位掩码累积技巧.如果不清楚,我建议逐步进入调试器或模拟器以遵循其功能(

To prove the point, I did manage to do this in pure NEON, without any scalar code at all, and it's horrible; the best "one or two bit-shift NEON instructions" I could come up with is some conditional-select-based rotating bitmask accumulation trickery. If it's not clear, I'd suggest stepping through in a debugger or simulator to follow what it does (example):

// d0 contains input vector
vmov.u8 d1, #0
vmov.u8 d2, #0
vmvn.u8 d3, #0
vdup.u8 d4, d0[0]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[1]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[2]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[3]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[4]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[5]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[6]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[7]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vbic.u8 d1, d1, d3
// d1 contains output vector

作弊并使用循环(必须沿相反的方向旋转d0,以便我们可以通过d0[0]访问每个原始车道)使它变小,但实际上并没有那么糟糕:

Cheating and using a loop (which necessitates rotating d0 in the opposite direction such that we can access each original lane through d0[0]) makes it smaller, but not really any less awful:

vmov.u8 d1, #0
vmov.u8 d2, #0
vmvn.u8 d3, #0
mov r0, #8
1:
vdup.u8 d4, d0[0]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
subs r0, r0, #1
vext.u8 d0, d0, d0, #1
vsub.u8 d1, d1, d3
bne 1b
vbic.u8 d1, d1, d3

理想情况下,如果可以对算法的其他部分进行重做以避免需要向量的非恒定排列,则应这样做.

Ideally, if it's at all possible to rework other parts of the algorithm to avoid needing non-constant permutations of vectors, do that instead.