且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在 R 中使用 foreach 读取全局变量

更新时间:2023-11-14 09:42:46

doParallel 包将自动将变量导出到 foreach 循环中引用的 worker.如果您不希望它这样做,您可以使用 foreach .noexport" 选项来防止它自动导出特定变量.但是,如果我理解正确的话,您的问题是 R 随后复制了其中的一些变量,这比平常更成问题,因为它发生在一台机器上的多个进程中.

The doParallel package will auto-export variables to the workers that are referenced in the foreach loop. If you don't want it to do that, you can use the foreach ".noexport" option to prevent it from auto-exporting particular variables. But if I understand you correctly, your problem is that R is subsequently duplicating some of those variables, which is even more of problem than usual since it is happening in multiple processes on a single machine.

没有一种方法可以声明一个变量,这样 R 就不会复制它.您要么需要用bigmemory 之类的包中的对象替换有问题的变量,以便永远不会复制,或者您可以尝试以不触发重复的方式修改代码.您可以使用 tracemem 函数来帮助您,因为只要该对象被复制,它就会打印一条消息.

There isn't a way to declare a variable so that R will never make a duplicate of it. You either need to replace the problem variables with objects from a package like bigmemory so that copies are never made, or you can try modifying the code in such a way as to not trigger the duplication. You can use the tracemem function to help you, since it will print a message whenever that object is duplicated.

但是,您可以通过减少工作人员所需的数据来避免该问题.这减少了需要复制到每个 worker 的数据量,并减少了它们的内存占用.

However, you may be able to avoid the problem by reducing the data that is needed by the workers. That reduces the amount of data that needs to be copied to each of the workers, as well as decreasing their memory footprint.

这是一个向工作人员提供超出他们需要的数据的经典示例:

Here is a classic example of giving the workers more data than they need:

x <- matrix(1:100, 10)
foreach(i=1:10, .combine='c') %dopar% {
    mean(x[,i])
}

由于矩阵 xforeach 循环中被引用,它将被自动导出到每个工作人员,即使每个工作人员只需要一个子集列.最简单的解决方案是迭代矩阵的实际列而不是列索引:

Since the matrix x is referenced in the foreach loop, it will be auto-exported to each of the workers, even though each worker only needs a subset of the columns. The simplest solution is to iterate over the actual columns of the matrix rather than over column indices:

foreach(xc=x, .combine='c') %dopar% {
    mean(xc)
}

不仅传输给worker的数据更少,而且每个worker实际上一次只需要在内存中有一列,这大大减少了大型矩阵的内存占用.xc 向量最终可能仍然会被复制,但它几乎没有受到太大的伤害,因为它比 x 小得多.

Not only is less data transferred to the workers, but each of the workers only actually needs to have one column in memory at a time, which greatly decreases its memory footprint for large matrices. The xc vector may still end up being duplicated, but it doesn't hurt nearly as much because it is much smaller than x.

请注意,此技术仅在 doParallel 使用雪衍生"函数时有用,例如 parLapplyclusterApplyLB,而不适用于使用mclapply.当使用 mclapply 时,使用这种技术可以使循环变慢一点,因为所有工作人员都免费获得矩阵 x,那么为什么当工作人员在列时转移已经有了整个矩阵?但是,在Windows上,doParallel不能使用mclapply,所以这个技巧很重要.

Note that this technique only helps when doParallel uses the "snow-derived" functions, such as parLapply and clusterApplyLB, not when using mclapply. Using this technique can make the loop a bit slower when mclapply is used, since all of the workers get the matrix x for free, so why transfer around the columns when the workers already have the entire matrix? However, on Windows, doParallel can't use mclapply, so this technique is very important.

重要的是要考虑工作人员真正需要哪些数据来执行他们的工作,并尽可能减少它.有时您可以通过使用来自 iteratorsitertools 包的特殊迭代器来做到这一点,但您也可以通过更改算法来做到这一点.

The important thing is to think about what data is really needed by the workers in order to perform their work and to try to decrease it if possible. Sometimes you can do that by using special iterators, either from the iterators or itertools packages, but you may also be able to do that by changing your algorithm.