通过std :: vector进行矩阵乘法的速度比numpy慢10倍

更新时间：2022-01-15 15:15:40

矩阵乘法相对容易优化.但是，如果要达到不错的cpu利用率，它将变得很棘手，因为您需要对所使用的硬件有深入的了解.实现快速Matmul内核的步骤如下:

Matrix multiplication is relativly easy to optimize. However if you want to get to decent cpu utilization it becomes tricky because you need deep knowledge of the hardware you are using. The steps to implement a fast matmul kernel are the following:

使用SIMD指令
使用寄存器阻止并一次获取多个数据
针对您的车队线(主要是L2和L3)进行优化
并行化代码以使用多个线程

在此链接下，它是一个很好的资源，它解释了所有令人讨厌的细节: https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0

Under this linke is a very good ressource, that explains all the nasty details: https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0

如果您想更深入地建议，请发表评论.

If you want more indepth advise leave a comment.

上一篇 : ：Numpy ndarray乘法切换到矩阵乘法下一篇 : 是什么导致矩阵向量乘法的Cython实现速度降低2倍?

通过std :: vector进行矩阵乘法的速度比numpy慢10倍

相关阅读

技术问答最新文章