且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何在运行时生成,编译和运行CUDA内核

更新时间:2023-01-15 20:38:28

在他的评论中,Roger Dahl已链接了以下帖子

In his comment, Roger Dahl has linked the following post

直接将PTX程序传递给CUDA驱动程序

其中两个功能的使用,即 cuModuleLoad cuModuleLoadDataEx ,已解决。前者用于从文件加载PTX代码,并将其传递给 nvcc 编译器驱动程序。后者避免了I / O,并允许将PTX代码作为C字符串传递给驱动程序。在这两种情况下,您都需要准备好PTX代码(作为CUDA内核编译的结果(要加载或复制并粘贴到C字符串中)或作为手写源)。

in which the use of two functions, namely cuModuleLoad and cuModuleLoadDataEx, are addressed. The former is used to load PTX code from file and passing it to the nvcc compiler driver. The latter avoids I/O and enables to pass the PTX code to the driver as a C string. In either cases, you need to have already at your disposal the PTX code, either as the result of the compilation of a CUDA kernel (to be loaded or copied and pasted in the C string) or as an hand-written source.

但是,如果您必须从CUDA内核开始即时创建PTX代码,会发生什么情况?按照 CUDA表达式模板中的方法,您可以定义一个包含CUDA内核的字符串,例如

But what happens if you have to create the PTX code on-the-fly starting from a CUDA kernel? Following the approach in CUDA Expression templates, you can define a string containing your CUDA kernel like

ss << "extern \"C\" __global__ void kernel( ";
ss << def_line.str() << ", unsigned int vector_size, unsigned int number_of_used_threads ) { \n";
ss << "\tint idx = blockDim.x * blockIdx.x + threadIdx.x; \n";
ss << "\tfor(unsigned int i = 0; i < ";
ss << "(vector_size + number_of_used_threads - 1) / number_of_used_threads; ++i) {\n";
ss << "\t\tif(idx < vector_size) { \n";
ss << "\t\t\t" << eval_line.str() << "\n";
ss << "\t\t\tidx += number_of_used_threads;\n";
ss << "\t\t}\n";
ss << "\t}\n";
ss << "}\n\n\n\n";

然后使用系统调用将其编译为

then using system calls to compile it as

int nvcc_exit_status = system(
         (std::string(NVCC) + " -ptx " + NVCC_FLAGS + " " + kernel_filename 
              + " -o " + kernel_comp_filename).c_str()
    );

    if (nvcc_exit_status) {
            std::cerr << "ERROR: nvcc exits with status code: " << nvcc_exit_status << std::endl;
            exit(1);
    }

并最终使用 cuModuleLoad cuModuleGetFunction 从文件中加载PTX代码并将其传递给编译器驱动程序,例如

and finally use cuModuleLoad and cuModuleGetFunction to load the PTX code from file and passing it to the compiler driver like

    result = cuModuleLoad(&cuModule, kernel_comp_filename.c_str());
    assert(result == CUDA_SUCCESS);
    result =  cuModuleGetFunction(&cuFunction, cuModule, "kernel");
    assert(result == CUDA_SUCCESS);

当然,表达式模板与该问题无关,我仅引用我在此答案中报告的想法。

Of course, expression templates have nothing to do with this problem and I'm only quoting the source of the ideas I'm reporting in this answer.