且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

矩阵/矢量运算的GCC优化标志

更新时间:2022-10-26 18:18:11

首先,我不推荐使用 -ffast-math ,原因如下:


  1. 在大多数情况下(如果不是全部)情况下使用此选项的情况下,性能实际上降低 。所以fast math
    ,实际上并不那么快。

  2. 这个选项打破了IEEE在浮点
    操作,最终导致不可预知性质的计算
    错误的累积。
  3. 您可能在不同的环境中获得不同的结果,差异可能是
    很大。术语环境(在这种情况下)意味着硬件,
    OS,编译器的组合。这意味着,当你可以得到意想不到的
    结果时,情况的多样性呈指数级增长。

  4. 另一个可悲的结果是,与$使用此选项构建的b $ b库可能需要
    正确的(符合IEEE标准)浮点数学计算,而这是
    ,他们的期望值会突破,但要计算
    为什么会很难。 isnt /rel =noreferrer>这篇文章。 b
    -Ofast (因为它包含邪恶 -ffast-math )。提取:


    - 快速



    忽视严格的标准合规性。 -Ofast 可以启用所有 -O3 优化。它还支持对所有符合标准的程序无效的优化。它打开 -ffast-math 和Fortran特有的 -fno-protect-parens -fstack-arrays


    不存在 -O4 。至少我没有意识到这一点,并且官方GCC文档中没有它的踪影。所以在这方面的最大值是 -O3 ,你应该明确地使用它,不仅是为了优化数学,而且是在一般的发布版本中。

    $ b $对于数学例程来说,b

    -funroll-loops 是一个非常好的选择,特别是涉及向量/矩阵操作,其中循环的大小可以在编译时(并且作为编译器展开的结果)。

    我可以推荐2个标志: -march = native -mfpmath = sse 。类似于 -O3 -march = native 对于任何软件的发布版本来说都很好,不仅数学密集。 -mfpmath = sse 允许在浮点指令中使用XMM寄存器(而不是在 SIMD SSE内部函数向量化 ,重线性代数码可以比没有它们快几个数量级。然而,正确应用这些技术需要深入了解其内部,并花费相当多的时间/精力去修改(实际上是重写)代码。然而,有一个选项可能适合您的情况。 GCC提供自动矢量化,可以通过 -ftree-vectorize ,但是因为您使用 -O3 (因为它包含 -ftree-vectorize 已经)。关键是你仍应该帮助GCC了解哪些代码可以自动矢量化。修改通常很小(如果需要的话),但你必须让自己熟悉它们。因此,请参阅上面的链接中的 Vectorizable Loops 部分。 最后,我建议您查看 Eigen ,这是基于C ++模板的库,它可以高效地实现最常见的线性代数例程。它以一种非常聪明的方式利用了这里提到的所有技术。界面纯粹是面向对象的,整洁和令人愉快的使用。面向对象的方法看起来与线性代数非常相关,因为它通常操纵纯粹的对象,如矩阵,向量,四元数,旋转,过滤器等等。因此,使用Eigen进行编程时,您无需自己处理这些低级别的概念(如SSE,矢量化等),只需享受解决您的特定问题。


    I am performing matrix operations using C. I would like to know what are the various compiler optimization flags to improve speed of execution of these matrix operations for double and int64 data - like Multiplication, Inverse, etc. I am not looking for hand optimized code, I just want to make the native code more faster using compiler flags and learn more about these flags.

    The flags that I have found so far which improve matrix code.

    -O3/O4
    -funroll-loops
    -ffast-math
    

    First of all, I don't recommend using -ffast-math for the following reasons:

    1. It has been proved that the performance actually degrades when using this option in most (if not all) cases. So "fast math" is not actually that fast.

    2. This option breaks strict IEEE compliance on floating-point operations which ultimately results in accumulation of computational errors of unpredictable nature.

    3. You may well get different results in different environments and the difference may be substantial. The term environment (in this case) implies the combination of: hardware, OS, compiler. Which means that the diversity of situations when you can get unexpected results has exponential growth.

    4. Another sad consequence is that programs which link against the library built with this option might expect correct (IEEE compliant) floating-point math, and this is where their expectations break, but it will be very tough to figure out why.

    5. Finally, have a look at this article.

    For the same reasons you should avoid -Ofast (as it includes the evil -ffast-math). Extract:

    -Ofast

    Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.

    There is no such flag as -O4. At least I'm not aware of that one, and there is no trace of it in the official GCC documentation. So the maximum in this regard is -O3 and you should be definitely using it, not only to optimize math, but in release builds in general.

    -funroll-loops is a very good choice for math routines, especially involving vector/matrix operations where the size of the loop can be deduced at compile-time (and as a result unrolled by the compiler).

    I can recommend 2 more flags: -march=native and -mfpmath=sse. Similarly to -O3, -march=native is good in general for release builds of any software and not only math intensive. -mfpmath=sse enables use of XMM registers in floating point instructions (instead of stack in x87 mode).

    Furthermore, I'd like to say that it's a pity that you don't want to modify your code to get better performance as this is the main source of speedup for vector/matrix routines. Thanks to SIMD, SSE Intrinsics, and Vectorization, the heavy-linear-algebra code can be orders of magnitude faster than without them. However, proper application of these techniques requires in-depth knowledge of their internals and quite some time/effort to modify (actually rewrite) the code.

    Nevertheless, there is one option that could be suitable in your case. GCC offers auto-vectorization which can be enabled by -ftree-vectorize, but it is unnecessary since you are using -O3 (because it includes -ftree-vectorize already). The point is that you should still help GCC a little bit to understand which code can be auto-vectorized. The modifications are usually minor (if needed at all), but you have to make yourself familiar with them. So see the Vectorizable Loops section in the link above.

    Finally, I recommend you to look into Eigen, the C++ template-based library which has highly efficient implementation of most common linear algebra routines. It utilizes all the techniques mentioned here so far in a very clever way. The interface is purely object-oriented, neat, and pleasing to use. The object-oriented approach looks very relevant to linear algebra as it usually manipulates the pure objects such as matrices, vectors, quaternions, rotations, filters, and so on. As a result, when programming with Eigen, you never have to deal with such low level concepts (as SSE, Vectorization, etc.) yourself, but just enjoy solving your specific problem.