且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

sklearn管道的并行化

更新时间:2021-08-08 02:34:44

通常,不会.

如果您查看sklearn阶段的界面,则方法为:

If you look at the interface for sklearn stages, the methods are of the form:

fit(X, y, other_stuff)

predict(X)

也就是说,它们适用于整个数据集,并且不能对数据流(或分块流)进行增量学习.

That is, they work on the entire dataset, and can't do incremental learning on streams (or chunked streams) of data.

此外,从根本上讲,某些算法不适合于此.考虑一下您的舞台

Moreover, fundamentally, some of the algorithms are not amenable to this. Consider for example your stage

("SCALE", Normalizer()),

大概这是使用均值和/或方差归一化的.没有看到整个数据集,它怎么知道这些?因此,它必须在操作之前等待整个输入,因此不能与之后的各个阶段并行运行.大多数(如果不是全部)阶段都是这样.

Presumably, this normalizes using mean and/or variance. Without seeing the entire dataset, how can it know these things? It must therefore wait for the entire input before operating, and hence can't be run in parallel with the stages after it. Most (if not nearly all) stages are like that.

但是,在某些情况下,您仍然可以将多核与sklearn一起使用.

However, in some cases, you still can use multicores with sklearn.

  1. 某些阶段具有 n_jobs参数.像这样的阶段相对于其他阶段按顺序使用,但是可以并行化其中的工作.

  1. Some stages have an n_jobs parameter. Stages like this use sequentially relative to other stages, but can parallelize the work within.

在某些情况下,您可以滚动自己自己(近似)其他阶段的并行版本.例如,在给定任何回归阶段的情况下,您可以将其包装在一个阶段中,该阶段将数据随机分成 n 个部分,并行学习各部分,然后输出所有回归器的平均值的回归器. YMMV.

In some cases you can roll your own (approximate) parallel versions of other stages. E.g., given any regressor stage, you can wrap it in a stage that randomly chunks your data into n parts, learns the parts in parallel, and outputs a regressor that is the average of all the regressors. YMMV.