更新时间:2021-07-23 21:33:49
像Jester一样,我很惊讶您的SIMD代码有了重大改进.您是否在启用优化的情况下编译了C代码?
Like Jester I'm surprised that your SIMD code had any significant improvement. Did you compile the C code with optimization turned on?
我可以提出的另一项建议是展开您的Packetloop
循环.这是一个相当简单的优化,并且将每个迭代"的指令数量减少到只有两个:
The one additional suggestion I can make is to unroll your Packetloop
loop. This is a fairly simple optimization and reduces the number of instructions per "iteration" to just two:
pextrb ebx, xmm0, 0
inc dword [ebx * 4 + Hist]
pextrb ebx, xmm0, 1
inc dword [ebx * 4 + Hist]
pextrb ebx, xmm0, 2
inc dword [ebx * 4 + Hist]
...
pextrb ebx, xmm0, 15
inc dword [ebx * 4 + Hist]
如果您使用的是NASM,则可以使用%rep指令保存一些输入内容:
If you're using NASM you can use the %rep directive to save some typing:
%assign pixel 0
%rep 16
pextrb rbx, xmm0, pixel
inc dword [rbx * 4 + Hist]
%assign pixel pixel + 1
%endrep