且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

SIMD与无符号乘法签署了64位* 64位到128位

更新时间:2023-11-10 19:34:10

要考虑使用的各种指令整数乘法的吞吐量限制,正确的方法是在多少产品位,你可以在每个周期计算方面。

mulx 产生一个64×64 - > 128结果每个周期;这就是6​​4×64 = 4096每个周期产品位

如果您拼凑的SIMD事半功倍出那些32x32的说明 - > 64位乘法,你需要能够得到四个结果每个周期来匹配 mulx (4x32x32 = 4096)。如果不是乘法没有其他的算术,你只是甚至AVX2打破。不幸的是,你已经注意到,还有很多比其他倍增算术,所以这是目前新一代硬件共有非首发。

I have created a function which does 64-bit * 64-bit to 128-bit using SIMD. Currently I have implemented it using SSE2 (acutally SSE4.1). This means it does two 64b*64b to 128b products at the same time. The same idea could be extended to AVX2 or AVX512 giving four or eight 64b*64 to 128b products at the same time. I based my algorithm on http://www.hackersdelight.org/hdcodetxt/muldws.c.txt

That algorithm does one unsigned multiplication, one signed multiplication, and two signed * unsigned multiplications. The signed * signed and unsigned * unsigned operations are easy to do using _mm_mul_epi32 and _mm_mul_epu32. But the mixed signed and unsigned products caused me trouble. Consider for example.

int32_t x = 0x80000000;
uint32_t y = 0x7fffffff;
int64_t z = (int64_t)x*y;

The double word product should be 0xc000000080000000. But how can you get this if you assume your compiler does know how to handle mixed types? This is what I came up with:

int64_t sign = x<0; sign*=-1;        //get the sign and make it all ones
uint32_t t = abs(x);                 //if x<0 take two's complement again
uint64_t prod = (uint64_t)t*y;       //unsigned product
int64_t z = (prod ^ sign) - sign;    //take two's complement based on the sign

Using SSE this can be done like this

__m128i xh;    //(xl2, xh2, xl1, xh1) high is signed, low unsigned
__m128i yl;    //(yh2, yl2, yh2, yl2)
__m128i xs     = _mm_cmpgt_epi32(_mm_setzero_si128(), xh); // get sign
        xs     = _mm_shuffle_epi32(xs, 0xA0);              // extend sign
__m128i t      = _mm_sign_epi32(xh,xh);                    // abs(xh)
__m128i prod   = _mm_mul_epu32(t, yl);                     // unsigned (xh2*yl2,xh1*yl1)
__m128i inv    = _mm_xor_si128(prod,xs);                   // invert bits if negative
__m128i z      = _mm_sub_epi64(inv,xs);                    // add 1 if negative

This gives the correct result. But I have to do this twice (once when squaring) and it's now a significant fraction of my function. Is there a more efficient way of doing this with SSE4.2, AVX2 (four 128bit products), or even AVX512 (eight 128bit products)?

Maybe there are more efficient ways of doing this than with SIMD? It's a lot of calculations to get the upper word.

Edit: based on the comment by @ElderBug it looks like the way to do this is not with SIMD but with the mul instruction. For what it's worth, in case anyone want's to see how complicated this is, here is the full working function (I just got it working so I have not optimized it but I don't think it's worth it).

void muldws1_sse(__m128i x, __m128i y, __m128i *lo, __m128i *hi) {
    __m128i lomask = _mm_set1_epi64x(0xffffffff);

    __m128i xh     = _mm_shuffle_epi32(x, 0xB1);    // x0l, x0h, x1l, x1h
    __m128i yh     = _mm_shuffle_epi32(y, 0xB1);    // y0l, y0h, y1l, y1h

    __m128i xs     = _mm_cmpgt_epi32(_mm_setzero_si128(), xh);
    __m128i ys     = _mm_cmpgt_epi32(_mm_setzero_si128(), yh);
            xs     = _mm_shuffle_epi32(xs, 0xA0);
            ys     = _mm_shuffle_epi32(ys, 0xA0);

    __m128i w0     = _mm_mul_epu32(x,  y);          // x0l*y0l, y0l*y0h
    __m128i w3     = _mm_mul_epi32(xh, yh);         // x0h*y0h, x1h*y1h
            xh     = _mm_sign_epi32(xh,xh);
            yh     = _mm_sign_epi32(yh,yh);

    __m128i w1     = _mm_mul_epu32(x,  yh);         // x0l*y0h, x1l*y1h
    __m128i w2     = _mm_mul_epu32(xh, y);          // x0h*y0l, x1h*y0l

    __m128i yinv   = _mm_xor_si128(w1,ys);          // invert bits if negative
            w1     = _mm_sub_epi64(yinv,ys);         // add 1
    __m128i xinv   = _mm_xor_si128(w2,xs);          // invert bits if negative
            w2     = _mm_sub_epi64(xinv,xs);         // add 1

    __m128i w0l    = _mm_and_si128(w0, lomask);
    __m128i w0h    = _mm_srli_epi64(w0, 32);

    __m128i s1     = _mm_add_epi64(w1, w0h);         // xl*yh + w0h;
    __m128i s1l    = _mm_and_si128(s1, lomask);      // lo(wl*yh + w0h);
    __m128i s1h    = _mm_srai_epi64(s1, 32);

    __m128i s2     = _mm_add_epi64(w2, s1l);         //xh*yl + s1l
    __m128i s2l    = _mm_slli_epi64(s2, 32);
    __m128i s2h    = _mm_srai_epi64(s2, 32);           //arithmetic shift right

    __m128i hi1    = _mm_add_epi64(w3, s1h);
            hi1    = _mm_add_epi64(hi1, s2h);

    __m128i lo1    = _mm_add_epi64(w0l, s2l);
    *hi = hi1;
    *lo = lo1;
}

It gets worse. There is no _mm_srai_epi64 instrinsic/instruction until AVX512 so I had to make my own.

static inline __m128i _mm_srai_epi64(__m128i a, int b) {
    __m128i sra = _mm_srai_epi32(a,32);
    __m128i srl = _mm_srli_epi64(a,32);
    __m128i mask = _mm_set_epi32(-1,0,-1,0);
    __m128i out = _mm_blendv_epi8(srl, sra, mask);
}

The right way to think about the throughput limits of integer multiplication using various instructions is in terms of how many "product bits" you can compute per cycle.

mulx produces one 64x64 -> 128 result every cycle; that's 64x64 = 4096 "product bits per cycle"

If you piece together a multiplier on SIMD out of instructions that do 32x32 -> 64 bit multiplies, you need to be able to get four results every cycle to match mulx (4x32x32 = 4096). If there was no arithmetic other than the multiplies, you'd just break even on AVX2. Unfortunately, as you've noticed, there's lots of arithmetic other than the multiplies, so this is a total non-starter on current generation hardware.