且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用Microsoft编译器生成CMOV指令

更新时间:2022-04-26 01:36:18

要使Microsoft的32位C/C ++编译器发出CMOVcc说明.

It is extremely difficult, if not downright impossible, to get Microsoft's 32-bit C/C++ compiler to emit CMOVcc instructions.

您要记住的是,有条件移动是首先使用Pentium Pro处理器引入的,而Microsoft拥有一个编译器开关,可以对该第6代处理器(长的-不推荐使用 /G6 ),永远不会发出在此处理器上独家运行的代码.该代码仍需要在第5代处理器(,奔腾和AMD K6)上运行,因此它不能使用CMOVcc指令,因为这些指令会生成非法的指令异常.与英特尔的编译器不同,全局动态调度尚未实现(并且仍然没有实现).

What you have to remember is that conditional moves were first introduced with the Pentium Pro processor, and while Microsoft had a compiler switch that would tune the generated code for this 6th generation processor (the long-deprecated /G6), they never emitted code that would run exclusively on this processor. The code still needed to run on 5th generation processors (i.e., Pentium and AMD K6), so it couldn't use CMOVcc instructions because those would have generated illegal instruction exceptions. Unlike Intel's compiler, global dynamic dispatching was not (and is still not) implemented.

此外,值得注意的是,从未引入过针对第六代及以后版本的处理器的独家开关.没有/arch:CMOV或他们可能称之为的任何东西. /arch开关的受支持值直接从IA32 (最低公分母,对于CMOV可能是非法的)到SSE.但是,文档 确实确认就像人们可能期望的那样,启用SSE或SSE2代码生成将隐式启用条件移动指令的使用以及在 之前引入的其他任何内容:

Also, it is worth noting that no switch was ever introduced to target exclusively 6th generation processors and later. There's no /arch:CMOV or whatever they might call it. Supported values for the /arch switch go straight from IA32 (the lowest common denominator, for which CMOV would be potentially illegal) to SSE. However, the documentation does confirm that, as one might expect, enabling SSE or SSE2 code generation implicitly enables the use of conditional-move instructions and anything else that was introduced before SSE:

除了使用SSE和SSE2指令外,编译器还使用其他支持SSE和SSE2的处理器版本上的指令.一个示例就是CMOV指令,该指令首次出现在Intel处理器的Pentium Pro版本中.

In addition to using the SSE and SSE2 instructions, the compiler also uses other instructions that are present on the processor revisions that support SSE and SSE2. An example is the CMOV instruction that first appeared on the Pentium Pro revision of the Intel processors.

因此,为了有希望使编译器发出CMOV指令,必须设置/arch:SSE或更高.当然,如今,这没什么大不了的.不过,您可以简单地设置/arch:SSE/arch:SSE2并保持安全,因为所有现代处理器都支持这些指令集.

Therefore, in order to have any hope of getting the compiler to emit CMOV instructions, you must set /arch:SSE or higher. Nowadays, of course, this is no big deal. You can simply set /arch:SSE or /arch:SSE2 and be safe, though, since all modern processors support these instruction sets.

但这只是成功的一半.即使启用了正确的编译器开关,也很难使MSVC发出CMOV指令.这是两个重要的观察结果:

But that's only half of the battle. Even when you have the right compiler switches enabled, it is extremely difficult to get MSVC to emit CMOV instructions. Here are two important observations:

  1. MSVC 10(Visual Studio 2010)和更早的版本从来没有生成CMOV指令.从来没有看到过它们在输出中,无论我尝试过多少种源代码.我说实际上"是因为我可能错过了一些疯狂的案例,但我对此非常怀疑.没有任何优化标志对此有任何影响.

  1. MSVC 10 (Visual Studio 2010) and earlier virtually never generate CMOV instructions. I've never seen them in the output, no matter how many variations of source code I've tried. I say "virtually" because there might be some crazy edge case that I missed, but I very much doubt it. None of the optimization flags have any effect on this.

但是,至少在这方面,MSVC 11(Visual Studio 2012)对代码生成器进行了重大改进.现在,此版本和更高版本的编译器似乎至少意识到 CMOVcc指令的存在,并且可以在正确的条件下发出它们(例如,/arch:SSE或更高版本,并使用条件运算符,如下所述.)

However, MSVC 11 (Visual Studio 2012) introduced significant improvements to the code generator, at least in this aspect. This and later versions of the compiler now seem to be at least aware of the existence of the CMOVcc instructions, and may emit them under the right conditions (i.e., /arch:SSE or later, and use of the conditional operator, as described below).

我发现哄骗编译器发出CMOV指令的最有效方法是使用条件运算符 ,而不是冗长的-form if-else语句.尽管就代码生成器而言,这两种构造应该完全等效,但它们并非完全相同.

I've found that the most effective way to cajole the compiler into emitting a CMOV instruction is to use the conditional operator instead of a long-form if-else statement. Although these two constructs should be completely equivalent as far as the code generator is concerned, they are not.

换句话说,当您可能时,会看到以下内容翻译为无分支的CMOVLE指令:

In other words, while you might see the following translated to a branchless CMOVLE instruction:

int value = (a < b) ? a : b;

您将总是获得以下序列的分支代码:

you will always get branching code for the following sequence:

int value;
if (a < b)    value = a;
else          value = b;

至少,即使您对条件运算符的使用没有引起CMOV指令(例如在MSVC 10或更早版本上),您仍然可能很幸运,可以通过其他方式获得无分支代码- eg SETcc或巧妙地使用SBBNEG/NOT/INC/DEC.这就是您在问题中显示的反汇编的用途,尽管它不如CMOVcc那样理想,但肯定是可比的,其区别不值得担心. (唯一的其他分支指令是循环的一部分.)

At the very least, even if your use of the conditional operator doesn't cause a CMOV instruction (such as on MSVC 10 or earlier), you might still be lucky enough to get branchless code by some other means—e.g., SETcc or clever use of SBB and NEG/NOT/INC/DEC. This is what the disassembly you've shown in the question uses, and although it is not quite as optimal as CMOVcc, it's certainly comparable and the difference is not worth worrying about. (The only other branching instruction is part of the loop.)


如果您确实想要无分支代码(您通常是在手动优化时执行的操作),并且您没有让编译器生成所需代码的运气,您将需要更加聪明地编写源代码.我很幸运编写了使用按位或算术运算符无分支计算结果的代码.


If you truly want branchless code (which you often do when hand-optimizing), and you're not having any luck getting the compiler to generate the code you want, you'll need to get more clever in how you write the source code. I've had good luck with writing code that computes the result branchlessly using bitwise or arithmetic operators.

例如,您可能希望以下函数生成***代码:

For example, you might wish that the following function generated optimal code:

int Minimum(int a, int b)
{
    return (a < b) ? a : b;
}

您遵循规则2,并使用了条件运算符,但是如果您使用的是较旧版本的编译器,则无论如何都将获得分支代码.使用经典技巧胜过编译器:

You followed rule #2 and used a conditional operator, but if you're using an older version of the compiler, you'll get branching code anyway. Outsmart the compiler using the classic trick:

int Minimum_Optimized(int a, int b)
{
    return (b + ((a - b) & -(a < b)));
}

生成的目标代码并非完全理想(它包含一条CMP指令,因为SUB已经设置了标志,所以该指令是多余的),但是它是无分支的,因此仍将比原始尝试随机输入的速度明显快导致分支预测失败.

The resulting object code is not perfectly optimal (it contains a CMP instruction that is redundant since SUB already sets the flags), but it is branchless and will therefore still be significantly faster than the original attempt on random inputs that cause branch prediction to fail.

作为另一个示例,假设您要确定在32位应用程序中64位整数是否为负.您编写以下不言而喻的代码:

As another example, imagine that you want to determine whether a 64-bit integer is negative in a 32-bit application. You write the following self-evident code:

bool IsNegative(int64_t value)
{
    return (value < 0);
}

,结果会让您非常失望. GCC和Clang对此进行了优化,但MSVC吐出了一个讨厌的条件分支. (非便携式)技巧是意识到符号位在高32位中,因此您可以使用按位操作明确地隔离和测试该位:

and will find yourself sorely disappointed in the results. GCC and Clang optimize this sensibly, but MSVC spits out a nasty conditional branch. The (non-portable) trick is realizing that the sign bit is in the upper 32 bits, so you can isolate and test that explicitly using bitwise manipulation:

bool IsNegative_Optimized(int64_t value)
{
    return (static_cast<int32_t>((value & 0xFFFFFFFF00000000ULL) >> 32) < 0);
}

此外,其中一位评论员建议使用内联汇编.尽管这是可行的(Microsoft的32位编译器支持内联汇编),但这通常是一个糟糕的选择.内联汇编会以相当大的方式破坏优化器,因此,除非您在内联汇编中编写大量的代码,否则不太可能获得实质性的性能提升.此外,Microsoft的内联汇编语法极为有限.它在很大程度上以灵活性为代价,以简化为代价.特别是,无法指定 input 值,因此您不得不将输入从内存中加载到寄存器中,并且调用者***将输入从寄存器中溢出到内存中进行准备.这会造成一种现象,我喜欢称之为整个过程",或者简称为慢速代码".在可接受慢速代码的情况下,您不会陷入内联汇编.因此,总是***(至少在MSVC上)弄清楚如何编写可说服编译器发出所需目标代码的C/C ++源代码.即使您只能使 close 达到理想的输出,仍然比使用内联汇编要付出的代价要好得多.

In addition, one of the commentators suggests using inline assembly. While this is possible (Microsoft's 32-bit compiler supports inline assembly), it is often a poor choice. Inline assembly disrupts the optimizer in rather significant ways, so unless you're writing significant swaths of code in inline assembly, there is unlikely to be a substantial net performance gain. Furthermore, Microsoft's inline assembly syntax is extremely limited. It trades flexibility for simplicity in a big way. In particular, there is no way to specify input values, so you're stuck loading the input from memory into a register, and the caller is forced to spill the input from a register to memory in preparation. This creates a phenomenon I like to call "a whole lotta shufflin' goin' on", or for short, "slow code". You don't drop to inline assembly in cases where slow code is acceptable. Thus, it is always preferable (at least on MSVC) to figure out how to write C/C++ source code that persuades the compiler to emit the object code you want. Even if you can only get close to the ideal output, that's still considerably better than the penalty you pay for using inline assembly.

请注意,如果您针对x86-64,则不需要任何这些扭曲. Microsoft的64位C/C ++编译器在尽可能使用CMOVcc指令方面更加积极,即使是较旧的版本.为这篇博客文章解释了,与Visual Studio 2010捆绑在一起的x64编译器对代码质量进行了许多改进,包括更好地识别和使用CMOV指令.

Note that none of these contortions are necessary if you target x86-64. Microsoft's 64-bit C/C++ compiler is significantly more aggressive about using CMOVcc instructions whenever possible, even the older versions. As this blog post explains, the x64 compiler bundled with Visual Studio 2010 contains a number of code-quality improvements, including better identification and use of CMOV instructions.

此处不需要特殊的编译器标志或其他考虑事项,因为支持64位模式的所有处理器支持条件移动.我想这就是为什么他们能够将其正确地用于64位编译器的原因.我还怀疑VS 2010中对x86-64编译器所做的某些更改已移植到VS 2012中的x86-32编译器中,解释了为什么它至少知道CMOV的存在,但它仍然没有做到这一点.不要像64位编译器那样积极地使用它.

No special compiler flags or other considerations are necessary here, since all processors that support 64-bit mode support conditional moves. I suppose this is why they were able to get it right for the 64-bit compiler. I also suspect that some of these changes made to the x86-64 compiler in VS 2010 were ported to the x86-32 compiler in VS 2012, explaining why it is at least aware of the existence of CMOV, but it still doesn't use it as aggressively as the 64-bit compiler.

最重要的是,针对x86-64时,请以最有意义的方式编写代码.优化器实际上知道如何完成工作!

The bottom line is, when targeting x86-64, write the code in the way that makes the most sense. The optimizer actually knows how to do its job!