且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

为什么 SSE 标量 sqrt(x) 比 rsqrt(x) * x 慢?

更新时间:2022-04-18 03:18:12

sqrtss 给出了一个正确的舍入结果.rsqrtss 给出了倒数的近似,精确到大约 11 位.

sqrtss gives a correctly rounded result. rsqrtss gives an approximation to the reciprocal, accurate to about 11 bits.

sqrtss 正在生成更准确的结果,用于需要准确性时.rsqrtss 适用于近似值足够但需要速度的情况.如果您阅读 Intel 的文档,您还会发现一个指令序列(倒数平方根近似,然后是单个 Newton-Raphson 步骤),它提供几乎全精度(约 23 位精度,如果我没记错的话),并且仍然有些比 sqrtss 快.

sqrtss is generating a far more accurate result, for when accuracy is required. rsqrtss exists for the cases when an approximation suffices, but speed is required. If you read Intel's documentation, you will also find an instruction sequence (reciprocal square-root approximation followed by a single Newton-Raphson step) that gives nearly full precision (~23 bits of accuracy, if I remember properly), and is still somewhat faster than sqrtss.

如果速度很重要,并且您确实要在循环中为许多值调用它,则您应该使用这些指令的矢量化版本,rsqrtpssqrtps,它们都处理每条指令四个浮点数.

edit: If speed is critical, and you're really calling this in a loop for many values, you should be using the vectorized versions of these instructions, rsqrtps or sqrtps, both of which process four floats per instruction.