且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

性能:Matlab与C ++矩阵向量乘法

更新时间:2022-04-28 22:02:16

如评论中所述,MatLab依靠英特尔的MKL库提供矩阵产品,该库是此类操作最快的库.尽管如此,仅Eigen就应该能够提供类似的性能.为此,请确保使用最新的Eigen(例如3.4)和适当的编译标志来启用AVX/FMA(如果有)和多线程功能:

As said in the comments MatLab relies on Intel's MKL library for matrix products, which is the fastest library for such kind of operations. Nonetheless, Eigen alone should be able to deliver similar performance. To this end, make sure to use latest Eigen (e.g. 3.4), and proper compilation flags to enable AVX/FMA if available and multithreading:

-O3 -DNDEBUG -march=native

由于 charges _ 是向量,因此***使用 VectorXd ,Eigen知道您需要矩阵向量乘积,而不是矩阵矩阵乘积.

Since charges_ is a vector, better use a VectorXd to Eigen knows that you want a matrix-vector product and not a matrix-matrix one.

如果您拥有Intel的MKL,则还可以让Eigen 使用它来获取如此精确的操作,其性能与MatLab完全相同.

If you have Intel's MKL, then you can also let Eigen uses it to get exact same performance than MatLab for this precise operation.

关于程序集,***反转两个循环以启用矢量化,然后使用OpenMP启用多线程(添加 -fopenmp 作为编译器标志)以使最外面的循环并行运行,最后可以简化您使用Eigen的代码:

Regarding the assembly, better inverse the two loops to enable vectorization, then enable multithreading with OpenMP (add -fopenmp as compiler flags) to make the outermost loop run in parallel, and finally you can simplify your code using Eigen:

void kernel_2D(const unsigned long M, double* x, const unsigned long N,  double*  y, MatrixXd& kernel)    {
    kernel.resize(M,N);
    auto x0 = ArrayXd::Map(x,M);
    auto x1 = ArrayXd::Map(x+M,M);
    auto y0 = ArrayXd::Map(y,N);
    auto y1 = ArrayXd::Map(y+N,N);
    #pragma omp parallel for
    for(unsigned long j=0;j<N;++j)
      kernel.col(j) = sqrt((x0-y0(j)).abs2() + (x1-y1(j)).abs2());
}

使用多线程,您需要测量挂钟时间.在这里(Haswell的4个物理内核以2.6GHz运行),在N = 20000的情况下,组装时间降至0.36s,矩阵向量乘积所需的时间为0.24s,因此总计0.6s,比MatLab快,而我的CPU似乎更慢比你的.

With multi-threading you need to measure the wall clock time. Here (Haswell with 4 physical cores running at 2.6GHz) the assembly time drops to 0.36s for N=20000, and the matrix-vector products take 0.24s so 0.6s in total that is faster than MatLab whereas my CPU seems to be slower than yours.