且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

为什么 Mac ABI 需要 x86-32 的 16 字节堆栈对齐?

更新时间:2023-11-14 09:38:52

来自Intel®64 and IA-32 Architectures Optimization Reference Manual",第 4.4.2 节:

为了获得***性能,Streaming SIMD Extensions 和 Streaming SIMD Extensions 2 要求它们的内存操作数对齐到 16 字节边界.与对齐数据相比,未对齐的数据会导致显着的性能损失."

来自附录 D:

在函数进入时确保堆栈帧与 16 字节边界对齐非常重要,以保持本地 __m128 数据、参数和 XMM 寄存器溢出位置在整个函数调用过程中对齐."

http://www.intel.com/Assets/PDF/manual/248966.pdf

I can understand this requirement for the old PPC RISC systems and even for x86-64, but for the old tried-and-true x86? In this case, the stack needs to be aligned on 4 byte boundaries only. Yes, some of the MMX/SSE instructions require 16byte alignments, but if that is a requirement of the callee, then it should ensure the alignments are correct. Why burden every caller with this extra requirement? This can actually cause some drops in performance because every call-site must manage this requirement. Am I missing something?

Update: After some more investigation into this and some consultation with some internal colleagues, I have some theories about this:

  1. Consistency between the PPC, x86, and x64 version of the OS
  2. It seems that the GCC codegen now consistently does a sub esp,xxx and then "mov"s the data onto the stack rather than simply doing a "push" instruction. This could actually be faster on some hardware.
  3. While this does complicate the call sites a little, there is very little extra overhead when using the default "cdecl" convention where the caller cleans up the stack.

The issue I have with the last item, is that for calling conventions that rely on the callee cleaning the stack, the above requirements really "uglifies" the codegen. For instance, what some compiler decided to implement a faster register-based calling style for its own internal use (ie any code that isn't intended to be called from other languages or sources)? This stack-alignment thing could negate some of the performance gains achieved by passing some parameters in registers.

Update: So far the only real answers have been consistency, but to me that's a bit too easy of an answer. I have well over 20 years experience with the x86 architecture and if consistency, not performance, or something else concrete, is really the reason then I respectfully suggest that is a bit naive for the developers to require it. They're ignoring nearly three decades of tools and support. Especially if they're expecting tools vendors to quickly and easily adapt their tools for their platform (maybe not... it is Apple...) without having to jump through several seemingly unnecessary hoops.

I'll give this topic another day or so then close it...

Related

From "Intel®64 and IA-32 Architectures Optimization Reference Manual", section 4.4.2:

"For best performance, the Streaming SIMD Extensions and Streaming SIMD Extensions 2 require their memory operands to be aligned to 16-byte boundaries. Unaligned data can cause significant performance penalties compared to aligned data."

From Appendix D:

"It is important to ensure that the stack frame is aligned to a 16-byte boundary upon function entry to keep local __m128 data, parameters, and XMM register spill locations aligned throughout a function invocation."

http://www.intel.com/Assets/PDF/manual/248966.pdf